Statistics, Histograms, and Probability

In all likelihood you have computed an average, for example, the average of all your test scores in a course. To find your average, you add your scores and divide by the number of tests. The mathematical term for this average is the mean. On the other hand, the median is the value in the of the data if the number of data points is odd. For example, if the test on a particular test in a class of 27 students have a median of 74, then 13 students scored below 74; 13 scored above 74, and one student obtained a grade  number of data points is even, the median is the mean of the two ‘values close the middle. The mean need not be the,same as the median. For example, for the data 65, 68, 74, 88, 95, the mean is 75, whereas the median Little mean of 68 and 74; or 71.

MATLAB provides the mean(x) median (x) functions to perform _these computations. If x is a vector, the mean (or median) value of the vector’s values is returned. However, if x is a matrix, a row vector is returned containing the mean (or median) value of each column of x. These functions do not require the elements in x to be sorted in ascending or descending order.

In many applications, the mean and the median do not adequately describe a data set. Two data sets can-have the same mean (or the same median) yet be very different. For example, the test scores 60, 65, 68, 74, 88,95 have the same mean , as the scores 71, 72, 73, 77, 78, 79, but the two sets describe very.different test outcomes. The first set of scores vary over large range, whereas in the second set-the scores are tightly grouped about the mean.

The way the data are spread around the mean can be described by a histogram plot. A histogram is a.plot of the frequency of occurrence of data values versus the values themselves. For example, suppose that in a class of 20 students the 20 scores on the first test were

61 61 65 67 69 72 74 74 76 77
83 83 85 88 89 92 93 93 95 98

On this test there are five scores in the 60-69 range, five in the 70-79 range, five in the 80-89 range, and five in the 90-100 range. The histogram for these scores is shown in the top graph in Figure 7.1-1. It is a bar plot of the number of scores that occur within each range, with the bar centered in the middle of the range (for example, the bar for the range 60-69 is centered at 64.5, and the asterisk on the plot’s abscissa shows the bar’s center).

Suppose that on the second test the following 20 scores were achieved:

66 69 72 74 75 76 77 78 78 79
79 80 81 83 84 85 87 88 90 94

On this test there are two scores in the 60-69 range, nine in the 70-79 range, seven in the 80-89 range, and two in the 90-100 range. The histogram for these scores is shown in the bottom graph in Figure 7.1-1. The mean on both tests is identical and is 79.75. However, the distribution of the scores is very different. On the first test we.say that the scores are evenly, or “uniformly,” distributed between 60 and 100, whereas on the second test the scores are more clustered around the mean.

To plot a histogram, you must group the data into sub ranges, called bins. In this example the four bins are.the ranges 60-69,70-79, 80-89, and 90-100. The choice of the bin width and bin center can drastically change the shape of the histogram. If the number of data values is relatively small, the bin width can not be small because some of the bins will contain no data and the resulting histogram might not usefully illustrate the distribution of the data.

To obtain a histogram, first sort the data if it has not yet been sorted (you can use the sort function here). Then choose the bin ranges and. bin centers and count the number of values in each bin. Use the bar function to plot the number of values in each bin versus the bin centers as a bar chart. The function bar (x I Y} creates a bar chart of y versus x. The MATLAB script file that generates Figure 7.1-1 follows. We have selected the bin centers to be in the middle of the ranges 60-69, 70-79, 80-89, 90-99. MATLAB provides the hi s t command to generate a histogram. This command has several forms. Its basic form is hi s t (y) ,where y is a vector containing the data. This form aggregates the data into 10 bins evenly spaced between the minimum and maximum values in y. The second form is hist (y, n ) , where . n is a user-specified scalar indicating the number of bins. The third form is hi s t (y r x) ,where x is a user-specified vector ,that determines the location. of the bin centers; the bin widths are the distances between the centers.

will not be satisfactory. This case occurs when you want to obtain a relative frequency histogram. In such cases you can use the bar function to generate the histogram. The following script file generates the relative frequency histogram for the 100 thread tests. Note that if you use the bar function, you must aggregate the data first. The result appears in Figure 7.1-4.

The fourth, fifth, and sixth forms of the hi s t function do not generate a plot, but are used to compute the frequency counts and bin locations. The bar function can then be used to plot the histogram. The syntax of the fourth form is [z , x] = hi s t (y) , where z is “the returned vector containing the frequency count and x is the returned vector containing the bin locations. The fifth and sixth forms are [z, x] = hist (y ,n) and [z , x] = hist (y, x). In the latter

case the returned vector x is the same as the user-supplied vector. The following script file shows how the sixth form can be used to generate a relative frequency histogram for the thread example with 100 tests. The plot generated by this M-file will be identical to that shown in Figure 7.1-4. These commands are summarized in Table 7.1-1.

The Data Statistics Tool

With the Data Statistics tool you can calculate statistics for data and add plots of the statistics to a graph of the data. The tool is accessed from the Figure window after you plot the data. Click on the Tools menu, then select Data Statistics. The menu appears as shown in Figure 7.1-5. To plot the mean of the dependent variable (y), click the box in the row labeled mean under the column labeled Y, as shown in the figure. You can plot other statistics as well; these are shown in the figure. You can save the statistics to the workspace as a structure by clicking on the’Save to Workspace button. This opens a dialog box that prompts you for a name for the structure containing the x data, and a name for the y data structure.

Probability

Probability is expressed as a number between 0 and 1 or as a percentage between o percent and 100 percent. For example, because there are six possible outcomes from rolling a single die, the probability of obtaining a specific number on one roll is 1/6, or. 16.67 percent. Thus if you roll the die a large number of times, you expect to obtain a 2 one-sixth of the time. Figure 7.1-6 shows the theoretical uniform probabilities for rolling a single die, and the relative frequency histogram for the data from 100 die rolls. The number of times a 1,2,3,4,5, or Occurred was 21,14, 18, 16, 19,and 12 respectively. The plots of the theory and the data are very similar, but not identical. In general, if you had rolled the die 1000 times instead of 100 times, the histogram would look even more like the theoretical probability plot.
If you roll two balanced dice, each roll has 36 possible outcomes because each die can produce six numbers. There is only one way to obtain a sum of 2, but there are two ways to obtain a sum of 3, and so on. Thus the probability of rolling a sum of 2 is 1/36, and the probability of rolling a sum of 3 is 1/36 +1/36 = 2/36. Figure 7.1-6 Comparison of theory end experiment for 100 rolls of a single die.

Continuing this line of reasoning, you can obtain the theoretical probabilities for the sum of two dice, as shown in the following table.

Probabilities Cor the sum of two dice
Sum                                2 3 4 5 6 7 8 9 10 11 12
Probability (x 36)          1 2 3 4 5 6 5 4  3 2 1

An experiment was performed by rolling two dice 100 times and recording the sums. The data follows.\

Data Cor two dice
Sum                               2  3  4   5   6  7   8  9 10 11 12
Frequency                     5  5  8 11 20 10 8 12 7  10  4

Figure 7.1-7 shows the relative frequency histogram and the theoretical probabilities on the same plot, If you had collected more data, the histogram would have been closer to the theoretical probabilities.
The theoretical probabilities can be used to predict the outcome of an experiment. Note that the sum of the theoretical probabilities for two dice equals I, because it is 100 percent certain to obtain a sum between 2 and 12. The sum of the probabilities corresponding to the outcomes 3, 4, and 5 is 2/36 +3/36 +4/36 1/4. This result corresponds to a probability of 25 percent. Thus if you roll two dice many times, 25 percent of the time you would expect to obtain a sum of either 3, 4, or 5.

In many applications the theoretical probabilities are not available because the underlying causes of the process are not understood well enough. In such applications you can use the histogram to make predictions. For example, if you did not have the theoretical probabilities for the sum of two dice, you could use the data to estimate the probability. Using the previously given data from 100 rolls, you can estimate the probability of obtaining a sum of either 3, 4, or 5 by summing the relative frequencies of these three outcomes. This sum is (5 +8 + 11)/100 = 0.24, or 24 percent. Thus on the basis of the data from 100 rolls, 24 percent of the time you can estimate that you would obtain a sum of either 3,4, or 5. The accuracy of the estimates so obtained is highly dependent on the number of trials used to collect the data; the more trials, the better. Many sophisticated statistical methods are available to assess the accuracy of such predictions; these methods are covered in advanced courses.