00:01
Finally, it's time for the practical example
we've been talking about.
00:06
In this lesson, we will see an actual
database of a real estate company operating
in California. All right.
00:13
We are interested in the statistical
properties of the data.
00:16
That is why we have reordered the database
and cherry-picked variables and then imported
these in a spreadsheet.
00:23
The labels of the columns have been made
friendly, even for those of you who do not
have any experience with real estate.
00:30
Finally, we have altered the names of the
customers for confidentiality reasons.
00:35
Okay. The company is launching a marketing
campaign, but it wants to target its audience
properly. The management suspects that after
some short analysis, marketing
results can be improved without the need of
investing additional resources.
00:49
We are the data analysts who are going to
crunch some numbers and identify which groups
of people are most likely to buy our
product.
00:57
Once we have done so, we will instruct the
marketing team to focus its efforts on these
groups. The first thing we have to do when
we analyze the data is to get
acquainted with the table.
01:08
It illustrates the sales of real estate
property for a specific company.
01:12
Let's call it 365 data science, real estate,
California.
01:17
Hopefully nobody else thought of that name.
01:20
Second, the table has two parts, left and
right.
01:25
On the left side, we have product
information.
01:28
On the right hand side, we have customer
information.
01:32
You can easily spot that all products are
listed, but customer information is only
available for some products.
01:39
This is because we input information about a
customer once the deal is done.
01:43
Logically, only sold items are associated
with a buyer.
01:49
Let's see what a row looks like.
01:51
This should clear up the logic of the table
for you.
01:55
Nora Lynch with customer ID SI 0004 was
56 years old when she bought Apartment 43
and building one in order to
live there. She paid
$377,313 for an area of
160 square feet in June 2004.
02:16
Nora is from California, felt very satisfied
with the deal and did not get a
mortgage for the purchase.
02:23
She found out about our product through our
website.
02:28
Okay. Now that that's out of the way, we
need to dig a bit deeper into these
variables. We will identify types of data
and levels of measurement for some of them.
02:38
This is a crucial step, as we cannot analyze
the data if we don't understand its
type. Let's start from the first column ID.
02:47
ID is a value that we assign to each item,
which lets us differentiate between
products. It may look numerical to you, but
in fact, it is
categorical. That's very counterintuitive
the first time.
02:59
So let's clarify it a bit.
03:03
What if we use names like John, John two,
John three and so on
until John?
The meaning would not change.
03:12
ID variables are like names that we assign
to different products.
03:16
However, it is much easier to use numbers
as, unlike names.
03:19
We have an infinite number of numbers.
03:23
A simple way to check of a variable as
numerical or categorical is to interpret its
mean. Think about it, the mean ID number
shows
nothing. Now oppose this to the mean price,
for example.
03:36
It is clear that the mean price is a very
valuable piece of information.
03:42
Okay. The bottom line is that it is a
categorical variable.
03:47
What about its level of measurement?
Well, it is qualitative, nominal, clear.
03:57
The next variable will have to examine is
age.
04:00
Age is rather interesting.
04:02
The level of measurement is quantitative
ratio.
04:06
A rule that is used for verifying ratios is
asking the question Is there a
true zero point?
Well for age.
04:14
It is obvious that when you were born, you
were exactly zero years old.
04:19
That's the true zero point.
04:21
So we are safe.
04:24
However, what's truly intriguing is whether
age is discrete or
continuous. In fact, it may be both.
04:33
In this case, we can only see age as a whole
number.
04:36
Therefore, it is discrete.
04:39
However, similar to wait, a variable we
discussed earlier in the course, age is a
continuous variable.
04:46
At the time, I am recording this.
04:48
The Statue of Liberty is 131 years old.
04:52
But I may get more specific by saying it is
131 years and 11 months old
or its age is 130 192.
05:00
If I add days, minutes and seconds, you get
the point.
05:05
When you were dealing with AIDS, you decide
its type depending on your work at hand.
05:13
The next variable we have is age interval.
05:17
This is yet another way to represent age.
05:19
Once again, it is either continuous or
discrete, as we are talking about the same
variable. This time, though, the level of
measurement is an ordinal instead of a
ratio. The age groups represent different
categories that are
ordered but are not numerical.
05:36
This serves to show that the same variable
can have different levels of measurement
within the same database.
05:43
All right, let's move on.
05:46
In most corporate analyzes, price is
central.
05:50
No matter the data set, it is always a
numerical variable that like age may be
discrete or continuous, depending on your
needs.
05:58
If you are interested, banks and
corporations treat it as continuous, and so
will we. The level of measurement here is
ratio.
06:08
The next variable we want to look into is
gender.
06:11
It is of categorical type and its level of
measurement is nominal.
06:15
It is very similar to yes and no questions
that we have discussed in previous lessons.
06:21
Such variables are called binary, as there
are only two possibilities, which are always
categorical. Finally, let's check out the
location.
06:31
We will discuss State in more detail and
leave country for homework.
06:36
The state variable refers to sales in the
USA only.
06:40
Note that only if the country input is USA,
we would have a value for state.
06:47
State is a categorical variable like ID that
we talked about earlier.
06:51
In fact, you can label the US states from 1
to 50 and use numbers instead.
06:56
Either way, the variable is categorical and
its level of measurement is nominal.
07:02
Ok. Excellent.
07:04
We've categorized the variables we are going
to use in this video.
07:08
This spreadsheet is available for you in the
resources section, together with the
exercises we've prepared on this data set.
07:15
You can practice the entire section about
descriptive statistics.
07:19
All right, back to our problem at hand.
07:22
We have to identify the groups of people who
buy the most of our product.
07:26
Let's start with gender.
07:29
Before we can plot the data, we have to
create the frequency distribution table.
07:34
In the course notes, you can see how that's
done in Excel.
07:37
However, in this video, I'll skip this step
and get to the frequency
distribution table.
07:44
Now we have three possibilities for gender.
07:48
Male, female, or a cell where gender is not
available.
07:53
Since some properties were purchased by
companies, they have no gender.
07:58
Nevertheless, we have to include them in the
analysis or explain why we omitted it in the
report. Gender is categorical.
08:07
We said that a good way to represent it in
practice is with a pie chart.
08:13
Okay. We can clearly see that most clients
are male.
08:17
However, this information is biased as the
customers in this database are the people who
sign the contract.
08:24
It is very likely that a family bought the
apartment, but our data shows us only the
person who signed the contract.
08:31
Such variables are interesting to see, but
it is not a good idea to include them in the
data driven decisions we make.
08:40
Okay. Let's carry on with location.
08:43
What chart would be useful to show this?
State is a categorical variable.
08:49
We may use a bar chart or a pie chart.
08:52
However, I prefer the Perito diagram as it
gives additional information.
08:58
From the graph, you can immediately see that
the majority of clients are from California.
09:04
A possible scenario is to decide to invest
in marketing for the top 75% of the
locations. This will mean that we can focus
on California and Nevada
alone. Next, we want to see
age. First, we have to note that age
represents the age of the buyer
when the deal was sealed.
09:24
The formula used is the year of the deal,
minus the year of birth of the
buyer. We are doing this because we want to
see the age at which
customers buy our product.
09:36
Their current age is irrelevant.
09:39
Moreover, real estate is something people
rarely buy more than once in their life.
09:43
So we expect age to be a central variable in
our analysis.
09:48
Let's first plot the frequency distribution
of age.
09:51
This is done by creating a histogram with an
interval length of one.
09:57
Now we can move on to the age interval
representation.
10:00
The options there are 18 to 25, 26 to
35, 36 to 45.
10:07
56 to 65 and 65 plus.
10:12
Most of the data falls between 25 and 60
years, which is evident from the frequency
distribution graph.
10:19
Therefore our intervals are a good fit of
the data.
10:23
Let's build a new histogram based on them.
10:27
Done. This representation is much neater,
isn't it?
We can clearly see that 36 to 45 is the age
at which most people purchase a real estate
property. Moreover, it is evident that
customers from 26
to 65 years old account 87% of our
observations.
10:47
But we are better than this.
10:48
We can calculate more statistics to get an
improved idea, can't we?
Let's do it. The mean median and mode are
the place
where we usually start.
11:01
The mean age is 46.15 years.
11:04
The median age is 45 years and the mode is
48 years.
11:10
All right. The mean and median are pretty
close, so we don't have a lot of
outliers. As you may recall, the mean is
affected by them while the median is
not. Moreover, when the mean is higher than
the median, we have a
positive or right skew.
11:26
This is confirmed by our histogram.
11:29
Now is the time to remind you that skewness
shows to which side is the longer tail and
not where the data is concentrated.
11:38
Now for the mode we have 48 years.
11:41
You can see that from the frequency
distribution graph, but not from the
histogram. See the histogram bundles data
together, which is good when we
want to see the main trend.
11:51
But some information like the mode in this
case is lost.
11:58
Finally, we should inspect the variability
of age before we can do
so. We have to see if this is sample or
population data.
12:07
The company data is the population of all
people who are our customers already.
12:12
However, our research aims to help the
marketing department in identifying future
customers. Therefore, our data set is a
sample drawn from all the
people who will eventually buy property from
our company.
12:26
Henceforth we will use sample formulas.
12:29
Let's compute both the variance and the
standard deviation.
12:35
The former is measured in squared years and
the latter in years.
12:40
So I suggest we stick with the standard
deviation then, shall we?
The result is around 13 years.
12:49
This gives us an additional idea of how
dispersed the data is.
12:54
What inferences can we make from this
result?
Well, that's the topic of the next section,
so we will have to make a halt here.
13:04
As you may have guessed, our final stop is
relationship between variables.
13:09
Let's see if age determines how expensive of
an apartment do customers buy?
Maybe younger people have less funds, so
they buy cheaper apartments.
13:18
We don't know.
13:19
The data will tell us.
13:23
First things first, let's plot the data.
13:25
Both variables are numerical, so we'll have
to use a scatterplot.
13:30
Here it is. It seems that it is pretty
dispersed
and there isn't an obvious trend.
13:39
Let's confirm this observation by
calculating the covariance of the two
variables. We get this enormous value that
doesn't
tell us much. So it's suitable to
standardize it by using the
correlation coefficient.
13:55
The value that we get is -0.17.
13:59
Much better. This correlation is very low.
14:03
A common practice is to disregard
correlations that are below 0.2.
14:10
All right. So real estate expenditure is not
related to
age. From a previous lesson.
14:18
We know that price and size are much more
likely to be correlated, right?
You have all the tools needed to check this
on your own through the exercise after this
lesson. So we've exhausted our statistical
knowledge so
far. What can we tell the marketing team
after this short analysis?
Well, we got several insights.
14:39
First males are more likely to sign the
contracts and are potentially a better
audience for our ads.
14:45
However, we don't have any information about
their marital status.
14:48
Thus this observation is a bit unclear.
14:51
Yet we know that 9% of sales came from
corporate clients rather than
individuals, which we didn't expect.
15:00
Second, 68% of our sales in the US come from
California with
Nevada, Oregon, Arizona and Colorado
following behind to form
93% of the US customer base.
15:14
Third, 71% of sales were made with customers
aged between 26 and
55 years old, with a mean of 46 years of age
and a standard deviation of
13 years.
15:26
Moreover, the distribution is right skewed.
15:28
So we expect younger people to buy more
property than older people.
15:35
Finally, there is no relationship between
the age of a given customer and the price
they are willing to pay.
15:43
All right. That was our practical example.
15:46
We learned a lot about this business, but we
were unable to get some truly amazing
insights. In the following sections, we will
learn about confidence intervals
and hypothesis testing.
15:57
This knowledge will provide us with the
tools we need to make predictions about the
future and make data driven decisions.
16:06
Oh, and one last thing.
16:07
If you like the course so far, please leave
us a review.
16:10
It helps a lot.
16:13
Thanks for practicing and thanks for
watching.