00:00
So far we have covered graphs that represent
only one variable, but how do we represent
relationships between two variables?
In this video, we'll explore cross tables
and scatter plots.
00:12
Once again, we have a division between
categorical and numerical variables.
00:18
Let's start with categorical variables.
00:21
The most common way to represent them is
using cross tables or, as some statisticians
call them, contingency tables.
00:28
Imagine you were an investment manager, and
you manage stocks, bonds and real estate
investments for three different investors.
00:35
Each of them has a different idea of risk,
and hence their money is allocated in a
different way. Among the three asset classes
across table representing all the data
looks in the following way.
00:47
You can clearly see the row showing the type
of investment that's been made and the
columns with each investor's allocation.
00:54
It is a good practice to calculate the
totals of each row and column, as it is often
useful in further analysis.
01:01
Notice that the subtotals of the rows give
us total investment in stocks, bonds and real
estate. On the other hand, the subtitles of
the columns give us the holdings of
each investor. Once we have created a cross
table, we can proceed
by visualizing the data onto a plane.
01:19
A very useful chart in such cases is a
variation of the bar chart called the side by
sidebar chart.
01:26
It represents the holdings of each investor
in the different types of assets.
01:30
Stocks are in green, bonds are in red, and
real estate is in blue.
01:35
The name of this type of chart comes from the
fact that for each investor, the categories
of assets are represented side by side.
01:42
In this way, we can easily compare asset
holdings for a specific investor or among
investors. Easy, right?
All graphs are very easy to create and read.
01:52
Once you have identified the type of data
you were dealing with and decided on the best
way to visualize it.
01:58
Finally, we would like to conclude with a
very important graph.
02:02
The scatter plot.
02:04
It is used when representing two numerical
variables.
02:08
For this example, we have gathered the
reading and writing SAT scores of 100
individuals. Let me first show you the graph
before analyzing it.
02:18
All right. First SAT scores by component
range from 200 to 800
points. And that is why our data is bounded
within the range of 200 to 800.
02:28
Second, our vertical axis shows the writing
scores, while the horizontal axis
contains reading scores.
02:36
Third, there are 100 students and the
results correspond to a specific point on the
graph. Each point gives us information about
a particular student's
performance. For example, this is Jane.
02:48
She scored 300 on writing, but 550 on the
reading part.
02:54
Scatter plots usually represent lots and
lots of observations.
02:58
When interpreting a scatter plot.
03:00
A statistician is not expected to look into
single data points.
03:03
He would be much more interested in getting
the main idea of how the data is distributed.
03:09
Ok The first thing we see is that there is
an obvious uptrend.
03:13
This is because lower writing scores are
usually obtained by students with lower
reading scores, and higher writing scores
have been achieved by students with higher
reading scores. This is logical, right?
Students are more likely to do well on both
because the two tasks are closely related.
03:29
Second, we notice a concentration of
students in the middle of the graph with
scores in the region of 450 to 550 on both
reading and writing.
03:38
Remember we said that scores can be anywhere
between 208 hundred?
Well, 500 is the average score one can get.
03:45
So it makes sense that a lot of people fall
into that area.
03:50
Third, there is this group of people with
both very high writing and reading
scores. The exceptional students tend to be
excellent at both components.
04:00
This is less true for bad students as their
performance tends to deviate when performing
different tasks. Finally, we have Jane from
a minute ago.
04:08
She is far away from every other observation
as she scored above average on reading but
poorly on writing.
04:15
This observation is called an outlier as it
goes against the logic of the whole data
set. We will learn more about outliers and
how to treat them in our analysis later on in
this course. So we have gone through the
basics.
04:28
We have covered populations, samples, types
of variables,
graphs and tables.
04:35
And it is time for us to dive into the heart
of descriptive statistics,
measurements of central tendency and
variability.
04:43
Thanks for watching.