Business Analytics

Chapter 5

Numerical descriptive measures

Chapter outline

Lecture 03

5.1 Measures of central location

5.2 Measures of variability

Lecture 04

5.3 Measures of relative standing and box plots

5.4 Approximating descriptive measures for grouped data

5.5 Measures of association

5.6 General guidelines on the exploration of data

Learning objectives

Lecture 03

LO1 Calculate mean, median and mode, and explain the relationships between them

LO2 Calculate range, variance, standard deviation and coefficient of variation

LO3 Interpret the use of standard deviation through empirical rule and Chebyshev’s theorem

Lecture 04

LO4 Explain the concepts of percentiles, deciles, quartiles and interquartile range, and show their usefulness through the application of a box plot

LO5 Calculate the mean and variance when the data are already in grouped form

LO6 Obtain numerical measures to calculate the direction and strength of the linear relationship between two variables

LO7 Understand the use of graphical methods and numerical measures to present summary information about a data set.

Introduction

In Chapter 4, we considered graphical descriptive techniques to summarize numerical data. In this chapter, we present a number of numerical measures to summarise numerical data.

Introduction

Popular Numerical Descriptive Measures

Measures of central location

Mean, median, mode

Measures of variability

Range, standard deviation, variance, coefficient of variation

Measures of relative standing and box plots

Percentiles, quartiles

Measures of linear relationship

Covariance, correlation, determination, least squares line

5.1 Measures of central location

Three main types of measures of central location are:

•Arithmetic mean (or average)

•Median

•Mode

–

•

Arithmetic Mean (or Average)The mean is the most popular and useful measure of central location.

Example 1

Median

Another most commonly used measure of central location is the median.

The median of a set of measurements is the value that falls in the middle when the measurements are arranged in order of magnitude.

Example 2

The median is calculated by placing all the observations in order; the observation that falls in the middle is the median.

Impact of an outlier on the Mean and Median

Example 3 - Solution

Example 3 – Solution…

c)For the data in (a) and (b),

(a) Without the outlier (b) With the outlier
●

As can be seen, the median did not change that much (43 vs 43.5), even with the outlier (200). However, the mean has changed from 42.8 to 62.5.

Mean is affected by the outlier, whereas the median is not.

Another commonly used measure of central location is the mode.

The mode of a set of observations is the value that occurs most frequently.

A set of data may have one mode (or modal class), or two or more modes.

Mode is useful for all data types, though mainly used for nominal data.

For large data sets, the modal class is much more relevant than a single-value mode.

Sample and population modes are computed the same way.

The manager of a menswear store observed the waist size (in centimeters) of trousers sold yesterday: 77, 85, 90, 85, 82, 70, 85, 75, 85, 80, 77, 100, 85, 70. Suggest a suitable size of trousers to be ordered more with the next order.

Solution:

The mode, the size with the highest sales, for this data set, is 85 cm.

Mean = 81.9

Median = 83.5

Excel Histogram for Example 5

Relationship between Mean, Median and Mode

If a distribution is symmetrical, the mean, median and mode coincide.

Relationship between Mean, Median and Mode

Relationship between Mean, Median and Mode

With three measures from which to choose, which one should we use?

The mean is generally our first selection. However, there are several circumstances when the median is better (for example, if there are outliers in the dataset).

The mode is seldom the best measure of central location.

One advantage the median holds is that it not as sensitive to extreme values as is the mean.

To illustrate, consider the data the following example.

The number of hours of Internet use in the previous month among 10 primary school children were 13, 11, 12, 10, 13, 14, 11, 7, 9, 10.

The mean was 11.0 and the median was 8.5.

Now suppose that the child who reported 14 hours actually reported 114 hours (obviously an Internet addict). The data now is 13, 11, 12, 10, 13, 114, 11, 7, 9, 10.

The new mean is 21.0 and the median is 8.5.

The median is not affected much by this outlier, but the mean is.

This value is only exceeded by only one of the ten observations in the sample, making this statistic (mean) a poor measure of central location.

The median stays the same.

When there is a relatively small number of extreme observations (either very small or very large, but not both), the median usually produces a better measure of the center of the data.

Mean, Median and Mode for Ordinal and Nominal Data

For ordinal and nominal data, the calculation of the mean is not valid.

Median is appropriate for ordinal data.

For nominal data, a mode calculation is useful for determining highest frequency, but not ‘central location’.

Measures of Central Location – Summary…

Compute the mean to

Describe the central location of a single set of numerical (or interval) data.

Compute the median to

Describe the central location of a single set of numerical or ordinal (ranked) data.

Compute the mode to

Describe a single set of nominal (or categorical) data.

5.2 Measures of Variability

Measures of central location fail to tell the whole story about the distribution.

A question of interest still remains unanswered:

Observe Two Hypothetical Data Sets

Measures of Variability…

Measures of central location fail to tell the whole story about the distribution; that is, how much are the observations spread out around the mean value?

Range

The range is the simplest measure of variability, calculated as:

Range = Largest observation – Smallest observation

E.g.

Data: {4, 4, 4, 4, 50} Range = 46

Data: {4, 8, 15, 24, 39, 50} Range = 46

The range is the same in both cases, but the data sets have very different distributions…

Range…

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.

Hence we need a measure of variability that incorporates all the data and not just two end point observations. Hence…

Range…

Variance

Variance and its related measure, standard deviation, are arguably the most important statistics used to measure variability. They also play a vital role in almost all statistical inference procedures.

Population variance is denoted by s2

(lower case Greek letter ‘sigma’ squared).

Sample variance is denoted by s2

(lower case ‘S’ squared).

Variance

Example 6

The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance.

Solution:

Example 6 – Solution…

Standard deviation

The standard deviation of a set of measurements is the square root of the variance of the measurements.

Example 7

(Example 5.8, page 148)

Rates of return over the past 10 years for two unit trusts are shown below. Which one has a higher level of risk?

Trust A: 12.3, –2.2, 24.9, 1.3, 37.6, 46.9, 28.4, 9.2, 7.1, 34.5

Trust B: 15.1, 0.2, 9.4, 15.2, 30.8, 28.3, 21.2, 13.7, 1.7, 14.4

Example 7 - Solution

Using Data > Data Analysis > Descriptive Statistics in Excel, we produce the following tables for interpretation…

Interpreting Standard Deviation

The standard deviation can be used to compare the variability of several distributions and make a statement about the general shape of a distribution.

If the histogram is bell shaped, we can use the Empirical Rule, which states:

1)Approximately 68% of all observations fall within one standard deviation of the mean.

2)Approximately 95% of all observations fall within two standard deviations of the mean.

3)Approximately 99.7% of all observations fall within three standard deviations of the mean.

Empirical rule…Empirical rule…

Approximately 68% of all observations fall

within one standard deviation of the mean.

Approximately 95% of all observations fall

within two standard deviations of the mean.

Approximately 99.7% of all observations fall

within three standard deviations of the mean.

Example 8

A statistician wants to describe the way returns on investment are distributed.

The mean return = 10%

The standard deviation of the return = 3%

The histogram is bell-shaped.

How can the statistician use the mean and the standard deviation to describe the distribution?

Example 8 - Solution

The empirical rule can be applied (bell-shaped histogram).

Describing the return distribution:

Approximately 68% of the returns lie between 7% and 13% [10 – 1(3), 10 + 1(3)]

Approximately 95% of the returns lie between 4% and 16% [10 – 2(3), 10 + 2(3)]

Approximately 99.7% of the returns lie between 1% and 19% [10 – 3(3), 10 + 3(3)]

Example 9

(Example 5.10, page 152)

The duration of 30 long-distance telephone calls are shown next. Check the empirical rule for this set of measurements.

Example 9 - Solution

Therefore, range can be approximated by 4s. In other words,

Given any set of measurements and a number k (greater than 1), the fraction of these measurements that lie within k standard deviations around the mean is at least 1–1/k2.

This theorem is valid for any set of measurements (sample, population) of any shape.

k Interval Chebyshev Empirical rule

1 approx 68%

2 at least 75% approx 95%

3 at least 89% approx 100%

Interpreting Standard Deviation

Suppose that the mean and standard deviation of last year’s mid-semester exam marks are 70 and 5, respectively.

If the histogram is bell-shaped, then we know that approximately 68% of the marks fell between 65 and 75, approximately 95% of the marks fell between 60 and 80, and approximately 99.7% of the marks fell between 55 and 85.

If the histogram is not at all bell-shaped we can say that at least 75% of the marks fell between 60 and 80, and at least 89% of the marks fell between 55 and 85. (We can use other values of k.)

Coefficient of Variation

The coefficient of variation of a set of measurements is the standard deviation divided by the mean value.

•

•

•

•

•

Coefficient of Variation