# Assignment: Expanding Definitions

John-Francis Bourke/Corbis

Chapter Learning Objectives After reading this chapter, you should be able to do the following:

1. Organize measures into frequency distributions, ordered arrays, and stem-and-leaf plots.

2. Create pie charts, bar graphs, and frequency polygons using Excel.

3. Describe the components of data normally.

4. Judge data normality by performing manual calculations and by using Excel output.

5. Develop tools to identify outliers.

tan82773_02_ch02_029-060.indd 29 3/3/16 9:58 AM

Section 2.1 From Description to Display

Introduction People who like to organize things will especially like this chapter. What we cover here can be particularly helpful in an age where we are exposed to much more data than we can absorb. When the material is irrelevant, this data overload is not a problem, but when the information is important, we need ways to retain it. This chapter offers some solutions involving visual data displays, which an anecdote will help to illustrate.

During World War II, a British analyst was assigned to recommend to aircraft builders the points on airframes that should be reinforced with armor plating. Too much armor plating and the aircraft would lose maneuverability and range; too little and it would become too vulnerable to enemy fire. The analyst examined aircraft returning from com-

bat, noted which areas showed damage, and drew pictures of the places where they had been hit. He recommended reinforcing the areas where the return- ing planes had not been damaged. How counterintuitive was that? As illogical as his approach seems, he reasoned that if the damage had been fatal to either the pilot or the aircraft’s ability to fly, the airplanes he examined would not have returned. So damage to the other areas was apparently the most serious, and those were the areas that needed the most protection.

This story is a lesson in the value of clari- fying relationships with visual displays. Certainly, mathematical manipulation and statistical procedures are required at

times, but often a necessary first step to understanding a data set is to arrange the data so that they can be visually analyzed. The understanding researchers gain from observation can then guide the mathematical analyses that follow.

Chapter 1 emphasized the descriptors and the statistical shorthand that allow us to classify and describe groups of data. That chapter limited descriptions to the scale of the data and the measures of central tendency and variability that allow data summaries. This chapter uses visual display for some of the same purposes and expands the applications for descriptive statistics.

2.1 From Description to Display The study of statistics has an incremental nature: Each step becomes part of a more involved process later, which makes grasping the early topics important, since they are building blocks for subsequent ones. For now, we will use what we know about data scale and descriptive

Edward Koren/The New Yorker Collection/The Cartoon Bank

tan82773_02_ch02_029-060.indd 30 3/3/16 9:58 AM

Section 2.1 From Description to Display

statistics to arrange measures into the tables and figures that reveal the multiple dimensions of numerical data. Although the stakes for us may be different than they were for the British warplane analyst, the issues are important nevertheless.

Most audiences are more engaged by a visual display than by a text presentation. When a good deal of data must be communicated in a short time, a visual display serves as a good place to begin. The discussions that follow suggest some of the more common procedures for repre- senting different kinds of data, if only to introduce them briefly. For someone interested in a more in- depth discussion, books by authors such as Friendly (2000) and Tufte (2001) will be helpful. Tufte in particular has a reputation for innovative and infor- mative data displays.

Data distributions of one sort or another are ubiquitous. A glance at the latest news reports indi- cates how unemployment numbers have changed during the year. Checking how the stock market has fluctuated over today’s trading session indicates highs, lows, and the volume of trading. The fact that data fluctuate makes them interesting. Data that either all have the same value or that always occur in the same proportions leave little to be analyzed. They interest us much less than data for which pro- portions and frequencies change.

Frequency Distributions Scores on most measures vary, but the variation will generally have some repetition. Whether college admissions test results or the scores on a statistics quiz, all scores are not equally likely; some will occur more frequently than others. Frequency distributions indicate the number of measures in a data set that have the same characteristic. They allow us to display scores in terms of both their variability and their frequency of occurrence.

Suppose a state board administers a licensing test for marriage and family counselors. Rather than report every individual score, the board finds it more economical to report test results in categories:

Meritorious

Exceeds Expectations

Pass

Pass with Exceptions

Fail

Consider the following example: A group of 25 graduates of State U’s marriage and family counseling program takes the test. Table 2.1 shows the group’s results.

John Moore/Getty Images News/Thinkstock

Tracking the highs, lows, and trading volume of stocks on a graph allows us to concisely evaluate what would otherwise be very large quantities of data.

tan82773_02_ch02_029-060.indd 31 3/3/16 9:59 AM

Section 2.1 From Description to Display

Table 2.1: A frequency distribution for licensing test results

Licensing test results f

Meritorious 4

Exceeds expectations 6

Pass 8

Pass with exceptions 4

Fail 3

Total 25

Table 2.1 depicts a frequency distribution, with the symbol f indicating the number of scores that occur in a particular category. If each individual score had been entered rather than being grouped into categories, the result would have been a table with 25 discrete entries. Instead, the data in Table 2.1 represent a grouped frequency distribution. Such a table provides a compact presentation when there are many scores.

Ordered and Disordered Arrays Table 2.1 is divided into categories, but if each of the 25 results was listed in ranked order from the four that were meritorious down to the three fails, the display would reflect an ordered array. If instead of listing them from highest to lowest, the board arbitrarily piled all the scores into the table, it would show, not surprisingly, a disordered array. In such a table, for example, although the meritorious scores would still occur as a group, they would be in no particular order. Table 2.1 is a much shorter display than either an ordered or a disordered array.

When sample sizes are comparatively small—15 or 20 scores from a larger popula- tion, for example—the type of presentation is not an issue, but presentation would be a greater issue if the frequency distribution included data for every aspiring mar- riage and family counselor in the state who took the licensing test. Even if hundreds of scores were being reported, a grouped frequency distribution would have the same number of rows as Table 2.1. Frequency distributions, then, can make a presentation compact. Jokela (2012) studied whether associations between individuals’ personality traits and whether they have children are affected by when they were born. Table 2.2 is part of his subjects’ description. It shows the birth cohort, or particular period of birth, and gender for 6,259 subjects (2,971 men and 3,288 women) in a relatively compact display.

Class Intervals The “groups” in grouped frequency distributions—the birth cohorts in Table 2.2—are called class intervals. Although they provide an economical data presentation and make a great deal of data accessible to even a casual observer, some details are inevitably lost. It is not apparent from studying Table 2.1, for example, which numerical test scores belong to a particular class

tan82773_02_ch02_029-060.indd 32 3/3/16 9:59 AM

Section 2.1 From Description to Display

interval. We can address that deficiency by incorporat- ing a list of score ranges, which might be the following:

28–34 Meritorious 21–27 Exceeds Expectations 14–20 Pass

7–13 Pass with Exceptions 0–6 Fail

With the ranges, we know how scores were classified, but it still is not apparent exactly how one individual whose score is in the “pass” interval, for example, scored. The person could have scored anywhere from 14 to 20. We know only the category. The same difficulty emerges in Table 2.2. The table shows 347 female subjects in the 1920–1929 birth cohort, but it does not make any distinction within the 1920–1929 group, a range of 9 years.

If we cannot know precisely how a particular individual scored, or the exact year in which a subject was born (Table 2.2), the data can at least be roughly ranked. Clearly, those in Table 2.1 who “exceeded expectations” did better than those in the pass category, although exactly how much better is not indicated.

Estimating the Mean from a Class Interval Indicating the score frequencies in the class intervals reduces the scores to values that can be ranked approximately. Even without the individual scores, we can use the categories to esti- mate the mean of the scores from class intervals. To estimate the mean from class intervals,

1. Determine the midpoint in each class interval. 2. Sum the midpoints of all the class intervals. 3. Divide the sum of the midpoints by the number of class intervals.

Table 2.2: A grouped frequency distribution of subjects’ birth cohort

Birth year Men (2,971) Women (3,288)

1914–1919 0 0

1920–1929 316 347

1930–1939 498 614

1940–1949 732 795

1950–1959 816 802

1960–1969 585 707

1970–1979 24 23

Source: Jokela, M. (2012). Birth-cohort effects in the association between personality and fertility. Psychological Science, 23, 835–841.

Try It!: #1 According to the discussion of the scale of data in Chapter 1, what scale do data cate- gories such as meritorious, exceeds expec- tations, and so on indicate?

tan82773_02_ch02_029-060.indd 33 3/3/16 9:59 AM

Section 2.1 From Description to Display

To see how accurate the estimated mean is, using the data in Table 2.1, we will first calculate the actual mean. Perhaps for the licensing test data in the grouped frequency distribution above, the individual scores were the following:

Meritorious: 34, 33, 33, 29 Exceeds Expectations: 26, 26, 24, 23, 23, 22

Pass: 20, 19, 19, 18, 17, 15, 15, 14 Pass with Exceptions: 12, 11, 9, 8

Fail: 6, 3, 1

Using the formula for the mean, M 5 ∑x n

, verify that 460 25 5 18.40.

Now, to estimate the mean based on the class intervals, follow these four steps:

1. Determine the midpoint of each class interval by

a) adding the two possible extreme scores within each interval (not the actual scores) and then

b) dividing by 2.

For

Meritorious: (28 1 34)/2 5 31

Exceeds Expectations: (21 1 27)/2 5 24

Pass: (14 1 20)/2 5 17

Pass with Exceptions: (7 1 13)/2 5 10

Fail: (0 1 6)/2 5 3

2. Multiply the midpoint values from Step 1 by the number of scores in the interval.

31 3 4 5 124

24 3 6 5 144

17 3 8 5 136

10 3 4 5 40

3 3 3 5 9

3. Sum Step 2’s products (the midpoints times the number of values).

124 1 144 1 136 1 40 1 9 5 453

4. Divide the sum of the products from Step 3 by the number of scores.

453/25 5 18.12

The actual mean is 18.40. The estimated mean is 18.12.

tan82773_02_ch02_029-060.indd 34 3/3/16 9:59 AM