Plotting and Analysing Weather Data with Edexcel Large Data Set - LDS

Please use Google Chrome or Mozilla FireFox to see the animations properly.

Edexcel has made a really constructive move with the introduction of its Large Data Set, LDS, as a part of its statistics course in Applied Mathematics for A Level. It contains a substantial amount of data that could be used to experiment with numerous statistical concepts. With that, those who show a knack for statistics can explore many avenues in the field that stem from the modern data science and AI, Artificial Intelligence.

The main examination boards in the United Kingdom have understood the need of large data sets, when it comes to dealing with data, a marked deviation from traditional data tables with relatively small amount of data; the data science industry needs people with real practical skills in this realm, not just folks with the understanding of mere concepts. In this context, the introduction of the Large Data Set by Edexcel is a step in the right direction.

The LDS covers the data accumulated in 1987, covering 5 towns in the United Kingdom, a city in China, a city in the US and a city in Australia. Jacksonville in the US and Perth in Australia are in the Northern and Southern Hemispheres respectively.

In this tutorial, you will learn the following interactively:

  • Plotting two sets of data on the same grid against the same time period
  • You can change the time period interactively to see the changes
  • During a given period, you can find the locations and spread of data
  • You can check whether the two sets of data have a correlation
  • You can take a random sample of a size of your choice
  • The precautions you should take while taking samples from this data set
  • Interactive boxplots
  • The units of oktas and knots are fully explained
  • The areas where you are supposed to exercise restraint, when it comes to forecasting island-wide weather, based on this particular data set

 

Edexcel Large Data Set - a level statistics

 

The Edexcel large data set covers the following data:

 

Please note that data in some cells in the Excel large data set is missing, represented by n/a characters - a serious challenge for a developer to overcome before plotting.

The large data set can be downloaded from the following link(2020):

Download Edexcel Large Data Set

 

The following data locations and data spread are interactively updated:

 

In the animation, you can change the period of data using a slider below the chart; not only does the chart get updated, but also locations of data and spread for that particular period are updated accordingly.

 

Change Period
Mean Mode Median Standard Deviation Maximum Minimum Interquartile Range
Temperature: 0C
Cloud Cover: oktas

Please note that cloud cover, measured in oktas, is a discrete variable.

 

Formulae in Use

Mean: x̄ = Σx / n
Standard Deviation: σ = √Σ(x - x̄)²/n or σ = √Σf(x - x̄)²/(Σf)
Q1: 25% of data lies below this
Q3: 75% of data lies below this
Median: 50% of data lies below this
IQR: Q3 - Q1

 

Coding

When the data is too large or too small, we use coding to make calculations easier.
E.g.
x: 111, 121, 131, 141, 151
This data set can be tuned into y as follows by coding:
Let y = (x - 1)/10
y: 11, 12, 13, 14, 15
ȳ = Σy/5 = (11 + 12 + 13 + 14 + 15)/5 = 13 Now, the locations of the data can be found in terms of y and then turned into corresponding x values.
The same process can be used if the data in question is too small.

Turning Coded Values into Original Values
Let y = (x - a)/b, where a and b are constants. x and y are original and coded values respectively.
ȳ = Σ y/n
= Σ (x - a)/nb
= Σ x/nb - Σa/nb
= x̄/b - na/nb
= x̄/b - a/b
x̄ = bȳ + a If ȳ, a and b are known,
x̄ can be calculated easily.
In the above example, ȳ = 13; a = 1; b = 10
x̄ = 10ȳ + 1
x̄ = 131

 

 

Variables and Units in the LDS

Variables are characteristics, numbers or quantities that can be counted or measured.
E.g. wind speed, cloud cover, no of fish in a lake, no of girls in a class with black hair

In the large data set, LDS, the following units are used to represent the wind speed and cloud cover.

Knots

The number of nautical miles per hour gives the speed of wind in knots.
Nautical miles are used for navigation.
1 knot = 1.15 mph

You can convert knots into mph by using the following; just put the value in the text box and move the mouse out:

  

 

 

Okta

This is the unit of the measurement of cloud cover. It's a discrete unit and ranges from 0 - 8 - hence a derivation of octave.

◯ - 0 okta: clear sky
◔ - 2 oktas: ¼ of the sky covered by clouds
◑ - 4 oktas: ½ of the sky covered by clouds
◕ - 6 oktasa: a ¾ of the sky covered by clouds
⬤ - 8 oktas: a ¼ of the sky covered by clouds

Sampling with LDS

You can take random samples from the LDS, provided that you know how to avoid the cells with no data. For instance, there is no data in the first 16 cells of the Daily mean wind speed column. If you treat the whole column as the population and a random number turns out to be in that region, there is going to an error related to that data. It will be the same for systemic sampling.

  

These samples from the LDS do not lead to an accurate or reliable forecast for the UK weather for the following reasons.

  • The data does not cover the entire United Kingdom.
  • The data covers just five areas of the country.
  • The data covers a period of 6 months of the year - a part of summer and autumn

 

Ad: Recommended Reading:
In this book, the statistical concepts are explained well; there are plenty of worked examples to complement what you learn at school/college/university.

 

Scatter Graphs from Edexcel LDS

The following interactive chart checks whether there is any correlation between the daily temperature and the cloud cover in Heathrow area in the United Kingdom. The temperature and cloud cover are plotted along the x-axis and y-axis respectively; the units are 0C and oktas respectively.

 

Change Period:

Data Source: Edexcel

 

 

Histograms from Edexcel LDS - 9 shades of grey

The following histogram is based on the cloud cover data in Heathrow - from May to October 1987. It's a histogram based on cloud cover, measured in oktas - a discreet variable. It's fully interactive.

Change Period:

Data Source: Edexcel

Since the data in question is discrete, the above chart can also be described as a bar chart.

 

Histograms from Edexcel LDS - Relative Humidity

The following histogram is based on continuous data, collected over a period of six months in 1987, in Heathrow area. The data shows that relative humidity stayed above 65%, most of the time. In this context, you may understand why the histogram has been restricted to just 3 classes.

Change Period:

Data Source: Edexcel

 

 

 

Boxplots from Edexcel Large Data set - interactive

The boxplot below is based on the data collected from May, 1987 to October, 1987 in Heathrow area in the UK, where one of the busiest air ports in the world functions from. As the chart shows, the relative humidity fluctuated between 70% and 90% during the period of six months in the summer / autumn seasons. The boxplot is fully interactive.

Change Period:

Data Source: Edexcel

 

single cloud

Comparing Two Data Sets from Edexcel Large Data Set

In order to compare the daily average temperatures, from Camborne and Heathrow, the following interactive animation has been made.
Change the size of the sample and keep an eye on the boxplots and the frequency tables, as the are automatically updated.

 

 

When comparing two data sets, please note the following:
1) Compare median and interquartile range
2) Compare mean and standard deviation
3) Do not compare median and standard deviation
4) Do not compare mean and interquartile range

If you want to contact me, here is the email.

Solving Problems: Edexcel Large Data Set

 

 

The above histogram shows how to take a sample of daily mean temperatures in Heathrow in 1987, from May to October. Answer the following when the sample size is 110.

  1. A formula for frequency and class width
  2. The frequency of the classes, 6 - 8 and 18 - 20

1) Let's take the sample of 110 data values - you can change the sample size to whatever value you want.
Since the frequency of a bar of a histogram ∝ area,
f ∝ area
f = k x area
48 = k x 24
24k = 48
k = 2
f = 2A
2) For class, 6 - 8,
f = 2 x 1
=2
For class, 18 - 20,
f = 2 x 5
= 10

 

Bonus: getting the perfect regression line - fully interactive

With the following animation, you can see how the residual sum of squares determines the perfect regression line. Move the data points closer to the line with your mouse and see the equation of the regression line. It's fun, isn't it?

 

You will find the following tutorials useful too:

For Developers

I used Fetch() function from the REST API to load the data from a .csv file; the original file was a .xls file, an Excel document. In addition, the following technologies were used to produce the chart and the corresponding statistical values from data.

  • The data that comes as a promise was dissected to extract the required data.
  • In order to plot data, Chart.js library was used.
  • In order to find the statistical values,simple statistics JavaScript library was used.
  • Two JavaScript functions was created to turn knots into mph and get a sample from the data set.