The Normal Distribution
Rolling a die is an example of a process whose possible outcomes are a limited set of numbers; namely, the integers from 1 to 6. For such processes the probability is a function of a discrete-valued variable, that is, a variable having a limited number of values. For example, the following table gives the measured heights of 100 men 20 years of age. The heights were recorded to the nearest 1/2 in., so the height variable is discrete valued:
Height (in.) Frequency Height (in.) Frequency
64 1 70 9
64.5 0 70.5 8
65 0 71 7
65.5 0 71.5 5
66 2 72 4
66.5 4 72.5 4
67 5 72.5 4
67.5 4 73 3
68 8 73.5 1
68.5 11 74 1
69 12 74.5 0
69.5 10 75 1
Scaled Frequency Histogram
You can plot the data as a histogram using either the absolute or relative frequencies. However, another useful histogram uses data scaled so that the total area under the histogram’s rectangles is 1. This scaled frequency histogram is the absolute frequency histogram divided by the total area of that histogram. The area of each rectangle on the absolute frequency histogram equals the bin width times the absolute frequency for that bin. Because all the rectangles have the same width, the total area is the bin width times the sum of the absolute frequencies. The following M-file produces the scaled histogram shown in Figure 7.2-1
Because the total area under the scaled histogram is 1, the fractional area corresponding to a range of heights gives the probability that a randomly selected 20-year-old man will have a height in that range. For example, the heights of the scaled histogram rectangles corresponding to heights of 67 through 69 in. are 0.1, 0.08, 0.16, 0.22, and 0.24. Because the bin width is 0.5, the total area corresponding to these rectangles is (0.1 +0.08 +0.16+0.22 +0.24 )(0.5) =0.4. Thus 40 percent of the heights lie between 67 and 69 in.
You can use the eumsum function to calculate areas under the scaled frequency histogram, and therefore calculate probabilities. If x is a vector, eumsum (x ) returns a vector the same length as x, whose elements are the sum of the previous elements. For example, if x = [2 I 5 I 3 I 8], eumsum (x) = [2 I 7 I 10 I 18 ].If A is a matrix, eurns urn (A) computes the cumulative sum of each row. The result is a matrix the same size as A.
After running the previous script, the last element of eumsurn (y_scaled) * binwidth is 1, which is the area under the scaled frequency histogram. To compute the probability of a height lying between 67 and 69 in. (that is, above the 6th value up to the 11th value, type)
» prob = eumsum (y_sealed) * binwidth;
» prob 67_69 = prob ( 11 ) -prob ( 6 )
The result is prob67 _69 = 0.4 000, which agrees with our previous calculation of 40 percent
Continuous Approximation to the Scaled Histogram
In the height data given previously, there was a limited number of possible outcomes because the heights were measured to within 1/2 in. That is, if a particular man’s height is between 66 and 66.5, we would measure and record his height as either 66 or 66.5. The number of possible outcomes is doubled if we were to measure the heights to within 1/4 in. Other processes can have an infinite set of possible outcomes. For example, you could obtain an infinite number of possible height measurements in the human population if you could measure a person’s height to enough decimal places. For example, an infinite number of values exist between 66 inches and 66.5 in.
Figure 7.2-2 shows the scaled histogram for very many height measurements taken to within 1/4 in. For many processes, as we decrease the bin width and increase the number of measurements, the tops of the rectangles in the scaled histogram often form- a smooth bell-shaped curve such as the one shown in Figure 7.2-2.
For processes having an infinite number of possible outcomes, the probability is a function of a continuous variable and is plotted as a curve rather than as rectangles. It is based on the same concept as the scaled histogram; that is, the total area under the curve is I, and the fractional area gives the probability of occurrence of a specific range of outcomes. A probability function that describes many processes is the normal or Gaussian function, which is shown in Figure 7.2-3.
This function is aIso known as the “bell-shaped curve.” Outcomes that can be described by this function are said to be “normally distributed.” The normal probability function is a two-parameter function; one parameter, u, is the mean of the outcomes, and the other parameter, a, i :the standard deviation. The mean u locates the peak of the curve and is the most likely value to occur. The width. or spread, of the cu.rve is described by the parameter σ. Sometimes the term
variance is used to describe the spread of the curve. The variance is the square of the standard deviation σ.
The normal probability function is described by the following equation:
Figure 7.2-4 is a plot of this function for three cases having the same mean, J-L = l O,but different standard deviations: (1 = 1,2, and 3. Note how the peak height decreases as (1 is increased. The reason is that the area under the curve must equal 1 (because the value of the random variable x must certainly lie between -00 and +(0).
Recall that the fractional area under a scaled histogram gives the probability that a range of outcomes will occur. The fractional area under a probability function curve also gives this probability. It can be shown that 68.3 percent, or approximately 68 percent, of the area lies between the limits of µ – σ ≤ x ≤ µ + σ.
Consequently, if a variable is normally distributed, there is a 68 percent chance that a randomly selected sample will lie within one standard deviation of the mean. In addition, 95.5 percent, or approximately 96 percent, of the area lies between the limits of µ – 2σ ≤ x ≤ µ + 2σ and 99.7 percent, or practically 100 percent, of the area lies between the limits of µ – 3σ ≤ x ≤ µ + 3σ. So there is a 96 percent chance that a randomly selected sample will lie within two
standard deviations of the mean, and a 99.7 percent chance that a randomly selected sample will lie within three standard deviations of the mean. Figure 7.2-5 illustrates the areas associated with the µ ± σ and µ ± 2σ limits. For example, if the variable is normally distributed with a mean equal to 20 and a standard deviation equal to 2, there is a 68 percent chance that a randomly selected sample will lie between 18 and 22, a 96 percent chance that it will lie between 16 and 24, and a 99.7 percent chance that it will lie between 14 and 26.
Estimating the Mean and Standard Deviation
In most applications you do not know the mean or variance of the distribution of possible outcomes, but must estimate them from experimental data. An estimate
of the mean µ is denoted by i and is found in the same way you compute an average, namely,
where the n data values are X¹, X², …xη The variance of a set of data values is the average of their squared deviations from their mean x. Thus the standard deviation a is computed from a set of n data values as follows:
You would expect that the divisor Should be n rather than n – 1. However, using n – 1 gives a better estimate of the and are deviation when the number of data points n is small. The MATLAB function mean (x) uses (7.2-2) to calculate the mean of the data stored in the vector x. The function std (x) uses (7.2-3) to calculate the standard deviation. Table 7.2-1 summarizes these functions.
As discussed earlier, you can use the 1σ ; 2σ, and 3σ points to estimate the 68.3 percent, 95.5 percent, and 99.7 percent probabilities, respectively. Thus for the preceding height data, 68-.3 percent of 20-year-old men will be between µ – σ = 67.3 and µ + σ = 71.3 in. tall.
If you need to compute the probability at other points, you can use the erf function. Typing erf (x ) returns the area to the left of the value t = x under the curve of the function 2e-t² / √π. This area, which is a function of x, is known ERROR FUNCTION as the error unction, and is written. as erf(x) ..The probability that the random variable x is less than or equal to b is written as P(x ≤ b) if the outcomes are normally distributed. This probability can be computed from the error function as,follows [Kreyzig, 19.98):
The probability that the random variable x is no less than a and no greater than b is written as P(a ≤ x ≤ b). It can be computed as follows:
These equations are useful for computing probabilities of outcomes for which the data is scarce or missing altogether.
Sums and Differences of Random Variables
It can.be proved that the mean of the sum (or difference) of two independent normally distributed random variables equals the sum (or difference) of their means, but the variance is always the sum of the two variances. That is, if x and y are normally distributed with means πx and πy, and variances σx² and σxy² ,and if u =x +y and v = x – y, then
These properties are applied in some of the homework problems.