Correlation & Regression (DP IB Maths: AA HL)

Revision Note

Dan

Author

Dan

Expertise

Maths

Did this video help you?

Linear Regression

What is linear regression?

  • If strong linear correlation exists on a scatter diagram then the data can be modelled by a linear model
    • Drawing lines of best fit by eye is not the best method as it can be difficult to judge the best position for the line
  • The least squares regression line is the line of best fit that minimises the sum of the squares of the gap between the line and each data value
  • It can be calculated by either looking at:
    • vertical distances between the line and the data values
      • This is the regression line of y on x
    • horizontal distances between the line and the data values
      • This is the regression line of x on y

How do I find the regression line of y on x?

  • The regression line of y on x is written in the form space y equals a x plus b
  • a is the gradient of the line
    • It represents the change in y for each individual unit change in x
      • If is positive this means increases by for a unit increase in x
      • If is negative this means decreases by |a| for a unit increase in x
  • b is the y – intercept
    • It shows the value of y when x is zero
  • You are expected to use your GDC to find the equation of the regression line
    • Enter the bivariate data and choose the model “ax + b”
    • Remember the mean point left parenthesis x with bar on top comma space y with bar on top right parenthesis will lie on the regression line

How do I find the regression line of x on y?

  • The regression line of x on y is written in the form space x equals c y plus d
  • c is the gradient of the line
    • It represents the change in x for each individual unit change in y
      • If c is positive this means x increases by c for a unit increase in y
      • If c is negative this means x decreases by |c| for a unit increase in y
  • d is the x – intercept
    • It shows the value of x when y is zero
  • You are expected to use your GDC to find the equation of the regression line
    • It is found the same way as the regression line of y on x but with the two data sets switched around
    • Remember the mean point left parenthesis x with bar on top comma space y with bar on top right parenthesis will lie on the regression line

How do I use a regression line?

  • The regression line can be used to decide what type of correlation there is if there is no scatter diagram
    • If the gradient is positive then the data set has positive correlation
    • If the gradient is negative then the data set has negative correlation
  • The regression line can also be used to predict the value of a dependent variable from an independent variable
    • The equation for the y on x line should only be used to make predictions for y
      • Using a y on x line to predict x is not always reliable
    • The equation for the x on y line should only be used to make predictions for x
      • Using an x on y line to predict y is not always reliable
    • Making a prediction within the range of the given data is called interpolation
      • This is usually reliable
      • The stronger the correlation the more reliable the prediction
    • Making a prediction outside of the range of the given data is called extrapolation
      • This is much less reliable
    • The prediction will be more reliable if the number of data values in the original sample set is bigger
  • The y on x and x on y regression lines intersect at the mean point left parenthesis x with bar on top comma space y with bar on top right parenthesis

Exam Tip

  • Once you calculate the values of and store then in your GDC
    • This means you can use the full display values rather than the rounded values when using the linear regression equation to predict values
    • This avoids rounding errors

Worked example

The table below shows the scores of eight students for a maths test and an English test.

Maths (x)

7

18

37

52

61

68

75

82

English (y)

5

3

9

12

17

41

49

97

a)
Write down the value of Pearson’s product-moment correlation coefficient, r.

4-2-2-ib-aa-sl-linear-reg-a-we-solution

b)
Write down the equation of the regression line of y on x , giving your answer in the form y equals a x plus b where a and b are constants to be found.

4-2-2-ib-aa-sl-linear-reg-b-we-solution

c)
Write down the equation of the regression line of x on y, giving your answer in the form x equals c y plus d where c and d are constants to be found.

4-2-2-ib-aa-sl-linear-reg-c-we-solution

d)
Use the appropriate regression line to predict the score on the maths test of a student who got a score of 63 on the English test.

4-2-2-ib-aa-sl-linear-reg-d-we-solution

Did this video help you?

PMCC

What is Pearson’s product-moment correlation coefficient?

  • Pearson’s product-moment correlation coefficient (PMCC) is a way of giving a numerical value to a linear relationship of bivariate data
  • The PMCC of a sample is denoted by the letter r
    • r can take any value such that negative 1 less or equal than r less or equal than 1
    • A positive value of r describes positive correlation
    • A negative value of r describes negative correlation
    • r = 0 means there is no linear correlation
    • r = 1 means perfect positive linear correlation
    • r = -1 means perfect negative linear correlation
    • The closer to 1 or -1 the stronger the correlation

2-5-1-pmcc-diagram-1

How do I calculate Pearson’s product-moment correlation coefficient (PMCC)?

  • You will be expected to use the statistics mode on your GDC to calculate the PMCC
  • The formula can be useful to deepen your understanding

begin mathsize 22px style r equals fraction numerator S subscript x y end subscript over denominator S subscript x S subscript y end fraction end style 

      • S subscript x y end subscript equals sum from i equals 1 to n of x subscript i y subscript i minus 1 over n stretchy left parenthesis sum from i equals 1 to n of x subscript i stretchy right parenthesis stretchy left parenthesis sum from i equals 1 to n of y subscript i stretchy right parenthesis is linked to the covariance
      • S subscript x equals square root of sum from i equals 1 to n of x subscript i squared minus 1 over n stretchy left parenthesis sum from i equals 1 to n of x subscript i stretchy right parenthesis squared end root and S subscript y equals square root of sum from i equals 1 to n of y subscript i squared minus 1 over n stretchy left parenthesis sum from i equals 1 to n of y subscript i stretchy right parenthesis squared end root are linked to the variances
    • You do not need to learn this as using your GDC will be expected

When does the PMCC suggest there is a linear relationship?

  • Critical values of r indicate when the PMCC would suggest there is a linear relationship
    • In your exam you will be given critical values where appropriate
    • Critical values will depend on the size of the sample
  • If the absolute value of the PMCC is bigger than the critical value then this suggests a linear model is appropriate

Did this page help you?

Dan

Author: Dan

Dan graduated from the University of Oxford with a First class degree in mathematics. As well as teaching maths for over 8 years, Dan has marked a range of exams for Edexcel, tutored students and taught A Level Accounting. Dan has a keen interest in statistics and probability and their real-life applications.