Playlist

Descriptive Statistics: Practical Example

by 365 Careers

My Notes
  • Required.
Save Cancel
    Learning Material 5
    • XLS
      2.13. Practical example. Descriptive statistics lesson.xls
    • XLS
      2.13.Practical-example.Descriptive-statistics-exercise.xls
    • XLS
      2.13.Practical-example.Descriptive-statistics-exercise-solution.xls
    • PDF
      Statistics Excel solutions.pdf
    • PDF
      Download Lecture Overview
    Report mistake
    Transcript

    00:01 Finally, it's time for the practical example we've been talking about.

    00:06 In this lesson, we will see an actual database of a real estate company operating in California. All right.

    00:13 We are interested in the statistical properties of the data.

    00:16 That is why we have reordered the database and cherry-picked variables and then imported these in a spreadsheet.

    00:23 The labels of the columns have been made friendly, even for those of you who do not have any experience with real estate.

    00:30 Finally, we have altered the names of the customers for confidentiality reasons.

    00:35 Okay. The company is launching a marketing campaign, but it wants to target its audience properly. The management suspects that after some short analysis, marketing results can be improved without the need of investing additional resources.

    00:49 We are the data analysts who are going to crunch some numbers and identify which groups of people are most likely to buy our product.

    00:57 Once we have done so, we will instruct the marketing team to focus its efforts on these groups. The first thing we have to do when we analyze the data is to get acquainted with the table.

    01:08 It illustrates the sales of real estate property for a specific company.

    01:12 Let's call it 365 data science, real estate, California.

    01:17 Hopefully nobody else thought of that name.

    01:20 Second, the table has two parts, left and right.

    01:25 On the left side, we have product information.

    01:28 On the right hand side, we have customer information.

    01:32 You can easily spot that all products are listed, but customer information is only available for some products.

    01:39 This is because we input information about a customer once the deal is done.

    01:43 Logically, only sold items are associated with a buyer.

    01:49 Let's see what a row looks like.

    01:51 This should clear up the logic of the table for you.

    01:55 Nora Lynch with customer ID SI 0004 was 56 years old when she bought Apartment 43 and building one in order to live there. She paid $377,313 for an area of 160 square feet in June 2004.

    02:16 Nora is from California, felt very satisfied with the deal and did not get a mortgage for the purchase.

    02:23 She found out about our product through our website.

    02:28 Okay. Now that that's out of the way, we need to dig a bit deeper into these variables. We will identify types of data and levels of measurement for some of them.

    02:38 This is a crucial step, as we cannot analyze the data if we don't understand its type. Let's start from the first column ID.

    02:47 ID is a value that we assign to each item, which lets us differentiate between products. It may look numerical to you, but in fact, it is categorical. That's very counterintuitive the first time.

    02:59 So let's clarify it a bit.

    03:03 What if we use names like John, John two, John three and so on until John? The meaning would not change.

    03:12 ID variables are like names that we assign to different products.

    03:16 However, it is much easier to use numbers as, unlike names.

    03:19 We have an infinite number of numbers.

    03:23 A simple way to check of a variable as numerical or categorical is to interpret its mean. Think about it, the mean ID number shows nothing. Now oppose this to the mean price, for example.

    03:36 It is clear that the mean price is a very valuable piece of information.

    03:42 Okay. The bottom line is that it is a categorical variable.

    03:47 What about its level of measurement? Well, it is qualitative, nominal, clear.

    03:57 The next variable will have to examine is age.

    04:00 Age is rather interesting.

    04:02 The level of measurement is quantitative ratio.

    04:06 A rule that is used for verifying ratios is asking the question Is there a true zero point? Well for age.

    04:14 It is obvious that when you were born, you were exactly zero years old.

    04:19 That's the true zero point.

    04:21 So we are safe.

    04:24 However, what's truly intriguing is whether age is discrete or continuous. In fact, it may be both.

    04:33 In this case, we can only see age as a whole number.

    04:36 Therefore, it is discrete.

    04:39 However, similar to wait, a variable we discussed earlier in the course, age is a continuous variable.

    04:46 At the time, I am recording this.

    04:48 The Statue of Liberty is 131 years old.

    04:52 But I may get more specific by saying it is 131 years and 11 months old or its age is 130 192.

    05:00 If I add days, minutes and seconds, you get the point.

    05:05 When you were dealing with AIDS, you decide its type depending on your work at hand.

    05:13 The next variable we have is age interval.

    05:17 This is yet another way to represent age.

    05:19 Once again, it is either continuous or discrete, as we are talking about the same variable. This time, though, the level of measurement is an ordinal instead of a ratio. The age groups represent different categories that are ordered but are not numerical.

    05:36 This serves to show that the same variable can have different levels of measurement within the same database.

    05:43 All right, let's move on.

    05:46 In most corporate analyzes, price is central.

    05:50 No matter the data set, it is always a numerical variable that like age may be discrete or continuous, depending on your needs.

    05:58 If you are interested, banks and corporations treat it as continuous, and so will we. The level of measurement here is ratio.

    06:08 The next variable we want to look into is gender.

    06:11 It is of categorical type and its level of measurement is nominal.

    06:15 It is very similar to yes and no questions that we have discussed in previous lessons.

    06:21 Such variables are called binary, as there are only two possibilities, which are always categorical. Finally, let's check out the location.

    06:31 We will discuss State in more detail and leave country for homework.

    06:36 The state variable refers to sales in the USA only.

    06:40 Note that only if the country input is USA, we would have a value for state.

    06:47 State is a categorical variable like ID that we talked about earlier.

    06:51 In fact, you can label the US states from 1 to 50 and use numbers instead.

    06:56 Either way, the variable is categorical and its level of measurement is nominal.

    07:02 Ok. Excellent.

    07:04 We've categorized the variables we are going to use in this video.

    07:08 This spreadsheet is available for you in the resources section, together with the exercises we've prepared on this data set.

    07:15 You can practice the entire section about descriptive statistics.

    07:19 All right, back to our problem at hand.

    07:22 We have to identify the groups of people who buy the most of our product.

    07:26 Let's start with gender.

    07:29 Before we can plot the data, we have to create the frequency distribution table.

    07:34 In the course notes, you can see how that's done in Excel.

    07:37 However, in this video, I'll skip this step and get to the frequency distribution table.

    07:44 Now we have three possibilities for gender.

    07:48 Male, female, or a cell where gender is not available.

    07:53 Since some properties were purchased by companies, they have no gender.

    07:58 Nevertheless, we have to include them in the analysis or explain why we omitted it in the report. Gender is categorical.

    08:07 We said that a good way to represent it in practice is with a pie chart.

    08:13 Okay. We can clearly see that most clients are male.

    08:17 However, this information is biased as the customers in this database are the people who sign the contract.

    08:24 It is very likely that a family bought the apartment, but our data shows us only the person who signed the contract.

    08:31 Such variables are interesting to see, but it is not a good idea to include them in the data driven decisions we make.

    08:40 Okay. Let's carry on with location.

    08:43 What chart would be useful to show this? State is a categorical variable.

    08:49 We may use a bar chart or a pie chart.

    08:52 However, I prefer the Perito diagram as it gives additional information.

    08:58 From the graph, you can immediately see that the majority of clients are from California.

    09:04 A possible scenario is to decide to invest in marketing for the top 75% of the locations. This will mean that we can focus on California and Nevada alone. Next, we want to see age. First, we have to note that age represents the age of the buyer when the deal was sealed.

    09:24 The formula used is the year of the deal, minus the year of birth of the buyer. We are doing this because we want to see the age at which customers buy our product.

    09:36 Their current age is irrelevant.

    09:39 Moreover, real estate is something people rarely buy more than once in their life.

    09:43 So we expect age to be a central variable in our analysis.

    09:48 Let's first plot the frequency distribution of age.

    09:51 This is done by creating a histogram with an interval length of one.

    09:57 Now we can move on to the age interval representation.

    10:00 The options there are 18 to 25, 26 to 35, 36 to 45.

    10:07 56 to 65 and 65 plus.

    10:12 Most of the data falls between 25 and 60 years, which is evident from the frequency distribution graph.

    10:19 Therefore our intervals are a good fit of the data.

    10:23 Let's build a new histogram based on them.

    10:27 Done. This representation is much neater, isn't it? We can clearly see that 36 to 45 is the age at which most people purchase a real estate property. Moreover, it is evident that customers from 26 to 65 years old account 87% of our observations.

    10:47 But we are better than this.

    10:48 We can calculate more statistics to get an improved idea, can't we? Let's do it. The mean median and mode are the place where we usually start.

    11:01 The mean age is 46.15 years.

    11:04 The median age is 45 years and the mode is 48 years.

    11:10 All right. The mean and median are pretty close, so we don't have a lot of outliers. As you may recall, the mean is affected by them while the median is not. Moreover, when the mean is higher than the median, we have a positive or right skew.

    11:26 This is confirmed by our histogram.

    11:29 Now is the time to remind you that skewness shows to which side is the longer tail and not where the data is concentrated.

    11:38 Now for the mode we have 48 years.

    11:41 You can see that from the frequency distribution graph, but not from the histogram. See the histogram bundles data together, which is good when we want to see the main trend.

    11:51 But some information like the mode in this case is lost.

    11:58 Finally, we should inspect the variability of age before we can do so. We have to see if this is sample or population data.

    12:07 The company data is the population of all people who are our customers already.

    12:12 However, our research aims to help the marketing department in identifying future customers. Therefore, our data set is a sample drawn from all the people who will eventually buy property from our company.

    12:26 Henceforth we will use sample formulas.

    12:29 Let's compute both the variance and the standard deviation.

    12:35 The former is measured in squared years and the latter in years.

    12:40 So I suggest we stick with the standard deviation then, shall we? The result is around 13 years.

    12:49 This gives us an additional idea of how dispersed the data is.

    12:54 What inferences can we make from this result? Well, that's the topic of the next section, so we will have to make a halt here.

    13:04 As you may have guessed, our final stop is relationship between variables.

    13:09 Let's see if age determines how expensive of an apartment do customers buy? Maybe younger people have less funds, so they buy cheaper apartments.

    13:18 We don't know.

    13:19 The data will tell us.

    13:23 First things first, let's plot the data.

    13:25 Both variables are numerical, so we'll have to use a scatterplot.

    13:30 Here it is. It seems that it is pretty dispersed and there isn't an obvious trend.

    13:39 Let's confirm this observation by calculating the covariance of the two variables. We get this enormous value that doesn't tell us much. So it's suitable to standardize it by using the correlation coefficient.

    13:55 The value that we get is -0.17.

    13:59 Much better. This correlation is very low.

    14:03 A common practice is to disregard correlations that are below 0.2.

    14:10 All right. So real estate expenditure is not related to age. From a previous lesson.

    14:18 We know that price and size are much more likely to be correlated, right? You have all the tools needed to check this on your own through the exercise after this lesson. So we've exhausted our statistical knowledge so far. What can we tell the marketing team after this short analysis? Well, we got several insights.

    14:39 First males are more likely to sign the contracts and are potentially a better audience for our ads.

    14:45 However, we don't have any information about their marital status.

    14:48 Thus this observation is a bit unclear.

    14:51 Yet we know that 9% of sales came from corporate clients rather than individuals, which we didn't expect.

    15:00 Second, 68% of our sales in the US come from California with Nevada, Oregon, Arizona and Colorado following behind to form 93% of the US customer base.

    15:14 Third, 71% of sales were made with customers aged between 26 and 55 years old, with a mean of 46 years of age and a standard deviation of 13 years.

    15:26 Moreover, the distribution is right skewed.

    15:28 So we expect younger people to buy more property than older people.

    15:35 Finally, there is no relationship between the age of a given customer and the price they are willing to pay.

    15:43 All right. That was our practical example.

    15:46 We learned a lot about this business, but we were unable to get some truly amazing insights. In the following sections, we will learn about confidence intervals and hypothesis testing.

    15:57 This knowledge will provide us with the tools we need to make predictions about the future and make data driven decisions.

    16:06 Oh, and one last thing.

    16:07 If you like the course so far, please leave us a review.

    16:10 It helps a lot.

    16:13 Thanks for practicing and thanks for watching.


    About the Lecture

    The lecture Descriptive Statistics: Practical Example by 365 Careers is from the course Statistics for Data Science and Business Analysis (EN).


    Author of lecture Descriptive Statistics: Practical Example

     365 Careers

    365 Careers


    Customer reviews

    (1)
    5,0 of 5 stars
    5 Stars
    5
    4 Stars
    0
    3 Stars
    0
    2 Stars
    0
    1  Star
    0