Practice Exam 1 Solutions                                                     

 

1. When applying for financial aid, City U students and their families must report household income (as computed for tax purposes).  Family incomes, in thousands of dollars, for a group of 34 incoming students are shown below.

                                22.1         24.5        25.0        29.3        31.2         39.8         40.0         41.0         44.2         45.6

46.7         48.8        49.1        50.2        50.4         51.3         54.1         57.5         59.5         62.1

64.0         64.0        68.3        68.9        70.1         74.4         75.4         80.0         81.5         86.9        

98.8         110.3       129.5       191.2                      

 

(a) Make a histogram or stemplot of these data.  If you choose a histogram, be sure to specify your classes.  If you choose a stemplot, be sure to explain what your stems and leaves represent.

 

2

2459

3

Stems are tens of thousands of dollars, and leaves are thousands of dollars.  I truncated rather than rounding.

 

Answers will vary.

 
19

4

0145689

5

001479

6

24488

7

045

8

016

9

8

10

 

11

0

12

9

13

 

14

 

15

 

16

 

17

 

18

 

19

1

 

 

(b) Describe the overall shape of the distribution.  Is it roughly symmetric, skewed to the right, or skewed to the left?  Are there any outliers?

 

My distribution is skewed to the right, with a center between 50 and 60 thousand dollars, and an outlier of 191.2 thousand dollars. 

 

(c) Would the 5 number summary or the mean and standard deviation give a better brief summary for this distribution?  Explain your choice.  Calculate the one of the summaries you choose.

 

The 5-number summary would be better for this distribution because it is heavily skewed and has a definite outlier.  Since the mean and standard deviation are sensitive to outliers, and less well-suited for skewed distributions, we choose the 5-number summary.

 

Min = 22.1, Q1 = 44.2, M = 55.8, Q3 = 74.4, Max = 191.2

 

2.  The table below summarizes the accept/reject decisions which City U has made for a sample of n=3000 applicants, broken down by the type of high school attended.                                           

 

Public

Private

Parochial

Accept

1254

336

180

Reject

1026

144

60

 

(a) What is the acceptance rate (as a %) among all City U applicants? __________59%_____ (1770/3000)__

 

(b) What proportion of City U applicants are not from a public high school? _____0.24______(720/3000)___

 

(c) Find the conditional distribution of acceptance and rejection within each of the high school types.  (That is, find the acceptance and rejection rates for students who attended public high schools.  Then do the same for private high schools and again for parochial schools.)  Summarize the results in a table and with a bar chart.

 

 

 

Public

Private

Parochial

Accept

55%

70%

75%

Reject

45%

30%

25%

 

 

(d) If there was no relationship between the type of school and the admissions decision, what would you expect for the count in the cell describing number accepted from public high schools?

 

row total * column total / table total = 1770*2280/3000 = approximately 1345

 

(e) With a sentence or two, summarize any relationship that you see in these data between the admission decision and the type of high school.

 

Students from private and parochial schools are accepted at higher rates than students from public schools.  Parochial acceptance rates are slightly higher than private acceptance rates, but both are substantially higher than public acceptance rates.

 

3. City U has a special relationship with an inner city high school that encourages students to apply for admission.  Below are the Verbal SAT scores from a SRS of 10 applicants from that school.

 

                                510   430   600   540   420   380   620   520   490   540

 

(a) Find the sample mean and standard deviation for these SAT scores.

 

mean = 505, standard deviation = 77.2

 

(b) Find the interquartile range for these data.  [Recall that the interquartile range is the difference between the third and first quartiles.]

 

Ordering the data, we have:  380 420 430 490 510 520 540 540 600, and the first and third quartiles are 430 and 540, respectively.  The interquartile range is 540-430 = 110.

 

(c) Use the 1.5*IQR criterion to decide if the minimum score of 380 unusually low, given the other values in this distribution.  Carefully justify your decision.  [Recall that the 1.5*IQR criterion says that an observation is an outlier if it falls more than 1.5*IQR above the third quartile or below the first quartile.]

 

1.5*IQR = 1.5*110 = 165.  Data below Q1-165 = 430-165 = 265 are outliers (as are data above Q3+165 = 705).  Since 380 is clearly above 265, the 380 SAT score is not an outlier by the 1.5*IQR criterion.

 

4.  Suppose that all City U applicants are required to submit a high school grade average (on a 100 point scale).  Past experience shows that these averages follow a normal distribution with a mean of 83.0 and a standard deviation of 6.0 points.

 

(a) What proportion of City U applicants should have a high school average below 80?  Find the appropriate z-score and use a standard normal table.

 

  and P(z<-0.5) = 0.3085, using the table.  So about 30.85% of applicants have a high school average below 80.

 

(b) The admissions office would like to designate students in the top 10% of the high school grade distribution for a "fast track" admissions decision.  How high would a student's high school average need to be in order to make it into this special decision group?  Your work should include the relevant z-score and the relationship between the z-score and your answer.

Looking up .90 in the body of the table, we find a corresponding z-score of 1.28.  Since , we have , and solving for x gives x = 90.69.  So students with high school averages at or above 90.69 would be in the “fast track” admissions decision group.

 

5. City U is noted for having a top-ranked water polo team.  In order to attract the best quality players, the school is quite generous in awarding scholarships to students on the team to help defray the $18,000 tuition bill.  Suppose that the boxplot below reflects the size of the scholarships awarded to the 15 current water poloists.  All scholarships are in multiples of $1,000.

               

                                                                                               

Determine whether each of the statements below is VALID (definitely true), INVALID (definitely false), or UNDETERMINED (could be true or false).  Explain your reasoning in each case.

 

(a) __________________ At least 4 of the water polo players are on full scholarships.

 

VALID.  Since the maximum and the third quartile are the same, and there are four students at or above the third quartile, all of those 4 must have full scholarships.

 

(b) __________________ There is at least one player with a $12,000 scholarship.

 

VALID.  The quartiles fall on actual observations (with n=15, the median is observation 8, and the quartiles are observations 4 and 12).

 

(c) __________________ None of the 15 swimmers has a scholarship worth exactly $10,000.

 

UNDETERMINED.  There is at least one swimmer with no scholarship, and at least one swimmer with a $12,000 scholarship.  This still leaves two swimmers who could have scholarships anywhere between 0 and $12,000.

 

(d) Circle the value below which is the most reasonable estimate for the sample mean of the water polo scholarships.  Briefly explain your reasoning.

 

                                $ 9,000     $ 13,500    $ 16,000    $ 18,000

 

$13,500 is the best choice.  Because this distribution is skewed to the left, we know the mean should be less than the median of $16,000.  $13,500 is reasonable, but to be sure, we do a quick check using the lowest possible numbers for the unknown scholarships (in thousands:  0,0,0,12,12,12,12,16,16,16,16,18,18,18,18), which leads to an average above $9000.

 

6. Trying to determine the best number of students to accept is a tricky admission's decision.  City U officials must assume that some students will reject an offer from City U in order to attend another school.  If too few students are accepted, they may end up with too small an incoming class, but accepting too many students may jeopardize City U's rating in college guidebooks.  Here are several years' data on the number of students accepted and the number who later enrolled.  We are interested in predicting the number enrolling from the number accepted.

 

 

Year

Accepted

Enrolled

1996

2440

611

1997

2800

708

1998

2720

637

1999

2360

584

2000

2660

614

2001

2620

625

 

(a) Find the correlation between the number of students accepted and the number that enrolled.

 

r = 0.8303, using my graphing calculator with Accepted in L1, Enrolled in L2, and the linear regression option, which also returns the correlation.

 

(b) Find the least squares regression line which best fits these 6 data points.

 

, again from the linear regression option

 

(c) Write a sentence that interprets what the value of the estimated slope of this regression line tells us about accepted and enrolled students.

 

An increase of 100 accepted students would correspond to a predicted increase of about 21 enrolled students.

 

(d) If City U accepts 2500 students in 2002, how many would you expect to enroll?

 

, so if 2500 students are accepted, we would expect approximately 609 to enroll.

 

(e)  What is the residual for 1998?  Write a sentence interpreting the value of the residual.

 

The residual for 1998 is:  actual enrollment – predicted enrollment = 637 – 655 = -18.  Our LSR model over-predicted 1998 enrollment by approximately 18 students.

 

(f) Find the value of r2 for this model and interpret it as a percentage.  Your statement should relate to City U admissions.

 

r2 is about 0.6894.  About 69% of the variation in the number of students enrolled from year to year can be accounted for by the LSR model of enrolled students on accepted students.

 

(g) Sketch a time plot of the accepted data and another of the enrolled data. these data.  Do your time plots reveal any strong trend in the number of students accepted or enrolled from year to year?

 

There is variation in the number of accepted students, but no clear

indication of a generally increasing or generally decreasing trend.
The number enrolled varies, too, with a moderately strong correlation

with the number accepted, but again there is no clear indication

that enrollment is generally increasing or generally decreasing over

time.

 

 

 

7.   The age distribution of students at City U is modeled by the

distribution shown to the right. 

 

(a)  Approximate the median student age, based on the distribution.

 

The median age is the age for which half the area under the distribution is to

the right, and half is to the left.  The median is about age 28..

 

(b)  Do you expect the mean student age to be higher or lower than the median?  Explain briefly.  Approximate the mean student age, based on the distribution.

 

The mean will be higher than the median.  This distribution is skewed to the right, so the mean will be pulled to the right.  Graphically, the mean is the “balancing point” for the distribution, which is approximately age 33.

 

(c)  If we took random samples of size 5 from the student population, computed the average age within the sample, and looked at the distribution of these averages, would you expect the mean for the new distribution to be larger than, smaller than, or the same as, the mean you found in Part (b)?  Explain briefly.

 

Individual random samples may give averages either above or below the distribution mean, but if we look at a bunch of them together, the mean of the distribution of averages should be the same as the mean of the distribution above.

 

(d)  If we took random samples of size 5 from the student population, computed the average age within the sample, and looked at the distribution of these averages, would you expect the standard deviation for the new distribution to be larger than, smaller than, or the same as, the standard deviation of the distribution in Part (a)?  Explain briefly.

 

The standard deviation of the distribution of averages should be smaller than the standard deviation of the distribution above.  The average of 5 numbers will be closer to the average age of all the students than will the individual ages, and since standard deviation is sort of an average distance from the mean, it will be smaller for the averaged values than for the individual values.

 

8.  (a)  Tell me everything you can about correlation.  (What does it measure?  What values can it have?  How is it used?) 

 

 

Correlation is a measure of the strength and direction of a linear relationship between two quantitative variables.  It takes values between -1 and 1, with values near 1 or -1 signifying a strong linear relationship (values of exactly 1 or -1 imply all the data points lie on a line), and values near 0 signifying little or no linear relationship.  Positive correlations indicate that small values of one variable are associated with small values of the other and large values are similarly associated.  Negative correlations indicate that small values of one variable are associated with large values of the other variable.  The value of the correlation does not depend on the units of measurement of either variable (because correlation involves the standardized values, or z-scores, for the individual measurements), and correlation does not distinguish between explanatory and response variables (the formula is symmetric in x and y).  In least squares regression of y on x, the square of the correlation coefficient gives the proportion of variability in the y-values that is explained by the least squares regression line.

 

(b)  Sketch two scatterplots, one with a correlation of approximately -0.98 and the other with a correlation of approximately 0.45.  Label your plots so it is clear which is which.

 

                                       

 

9.  Explain or define the following terms as they relate to linear regression:

(a)  Influential observations

 

An influential observation is one that has a substantial effect on the regression line (so that removing that one observation changes the regression line a lot.  Observations with extremely large or small x-values have the potential to be extremely influential.  Other outliers may or may not be influential.

 

(b)  Residual

 

The residual of an observation is the difference between the actual and predicted y-value for a given x­-value.  A positive residual means the observation lies above the LSR line, while a negative residual means the observation lies below the LSR line.  The LSR line tries to minimize the sum of the squared residuals.

 

10.  Overweight parents tend to have overweight children.  The results of a study of Mexican American girls aged 9 to 12 years are typical.  The investigators measured body mass index (BMI), a measure of weight relative to height, for both the girls and their mothers.  People with high BMI are overweight.  The correlation between the BMI of daughters and the BMI of their mothers was r = 0.506.  The results of this study are confounded.  Explain what the confounding is and what you may or may not conclude from the study. 

 

Body type is determined in part by heredity, so genetics explains part of the correlation, but environmental influence also contribute to the correlation:  mothers who are overweight may also set an example of little exercise and poor diet, and these behaviors are likely to influence the behaviors of their daughters.

 

11.  The table below shows numbers of flights on time and delayed for two airlines at five airports in one month. 

 

Alaska Airlines

America West

 

On Time

Delayed

On Time

Delayed

Los Angeles

497

62 (11%)

694

117 (14%)

Phoenix

221

12 (5%)

4840

415 (8%)

San Diego

212

20 (9%)

383

65 (15%)

San Francisco

503

102 (17%)

320

129 (29%)

Seattle

1841

305 (14%)

201

61 (23%)

 

(a) What proportion of all Alaska Airlines flights were delayed?  What proportion of all America West flights were delayed?

 

Alaska Airlines:  501/3575 = about 13% delayed                       America West:  787/7225 = about 11% delayed

 

(b) Find the percentage of delayed flights for Alaska Airlines at each of the five airports.  You may record your percentages in the table, next to the number of delayed flights.  Do the same for America West.

 

(c) What happens?  What is the name of the phenomenon you observe?  Explain why it occurs in this situation.  (What’s the lurking variable?)

 

Although America West has a higher percentage of delayed flights for each of the five cities, America West has a lower overall percentage of delayed flights.  This is an example of Simpson’s Paradox.  In this example, the reason for the paradox is that most of Alaska Airlines’ flights are through Seattle and San Francisco, which are much more prone to bad weather (and thus flight delays), while the majority of America West’s flights are through Phoenix, which rarely has bad weather.

 

12.  In Professor Friedman’s economics course, the correlation between the students total scores prior to the final exam and their final exam scores is r = 0.6.  The pre-exam totals for all students in the course have mean 280 and standard deviation 30.  The final exam scores have mean 75 and standard deviation 8. 

(a)  Professor Friedman grades on a curve so that he expects to assign A’s to approximately 15% of his students, B’s to approximately 35%, C’s to approximately 40% of his students, and D’s or F’s to the remaining 10%.  Assuming the distribution of pre final totals is approximately normal, before the final exam, how many points (find the minimum) would a student need to be earning an A?  a B?  a C?

 

A:  311                   B:  280                   C:  242                 

 

normal curve calculations using z-scores of 1.04 (85% of area to left), 0 (50% of area to left), and -1.28 (10% of area to left) to find corresponding x’s.

 

(b)  Find the least squares regression line of final exam scores on pre-final total scores for this course.

 

predicted final exam score = 30.2 + 0.16*(pre final total score)

 

(c)  Explain the meaning of the vertical intercept of your LSR line in the context of Professor Friedman’s class.   Is your interpretation reasonable?  Why or why not?

 

The intercept is the predicted final exam score for a student who had 0 points as a prefinal total.  Our interpretation is reasonable, but we should be careful..  The problem is that 0 as a prefinal total is probably far outside the range of the actual students’ prefinal totals, and we shouldn’t try to extrapolate far from the range of the data.

 

(d)  Julie’s total before the exam was 300.  What does LSR predict for her score on the final exam?

 

Julie’s predicted score = 30.2 + 0.16*300 = 78.2

 

(e)  Should we should have great confidence in our ability to predict Julie’s final exam score accurately?  Explain your answer and justify it statistically.

 

No, Julie’s score could be considerably higher or lower than our prediction.  The value of r2 in the regression is only 0.36, which means the prefinal total only accounts for 36% of the variability in the final exam scores.  The remaining variability (64%) is large and unaccounted for.  We would have greater predictive ability if r2 was larger.