Playlist

Data Science

by Dr. Holger Aust

My Notes
  • Required.
Save Cancel
    Learning Material 2
    • PDF
      Slides Digitalization for Companies.pdf
    • PDF
      Download Lecture Overview
    Report mistake
    Transcript

    00:08 The Harvard Business Review said that the job of "data scientist" is the "sexiest job of the 21st century." But what exactly does a "data scientist" do? How come data science is so popular? Well, the word itself says that it has to do with science and data.

    00:24 And statistics, software development, and specialized knowledge are often brought up together. But if you look at it in a more general way, it's about how to use computers to solve problems.

    00:35 So, first a problem has to be understood, then it has to be analyzed, then it has to be turned into problems that a computer can solve, and then it has to be solved.

    00:44 And that is what data science is all about.

    00:48 Because my marketing colleague doesn't come up to me and say, "Minimize the loss function for me using a polynomial core!" But he has a normal problem, like, "I'm trying to figure out how to get more customers to buy." And that makes it clear that this is different from data analytics.

    01:05 Data analytics is a big part, but it is mostly about answering certain questions.

    01:12 On the other hand, data science is all about coming up with the right questions.

    01:18 To do this, you need to know what the problems or opportunities are in each department and then work with the people who matter on questions and solutions.

    01:29 So, the data scientist looks at the big picture.

    01:31 Communication is important.

    01:33 A project in data science can be broken into four steps.

    01:37 On the one hand, there are data imports, which are the second step in preparing the data, the third step in modeling the data, and the fourth step in deployment.

    01:50 Now, I'd like to walk you through these four steps.

    01:53 First, the import of the data.

    01:57 The question that data scientists are trying to answer is brand new.

    02:01 First, the right sources need to be found.

    02:05 Most of the data for the data analysis comes from the data warehouse.

    02:10 So, the new questions in data science are about which data sources are needed or might be interesting. Most of the time, you won't find these in tables but rather as unstructured lists.

    02:22 For instance, as a PDF, a group of scientific papers or annual reports.

    02:27 It's also possible that you can only compare prices or read customer reviews on websites.

    02:32 Then you have to make a web scraper that searches these websites and pulls the information out. Last, I'd like to talk about what are called "NoSQL databases." "Not only SQL" is what NoSQL stands for.

    02:47 So these are databases that are made up of more than just tables.

    02:50 There are, for example, the count databases, which are great for the way social networks are set up. Or column-oriented databases, which are often used to store sensor data.

    03:02 The second step is to get the data ready.

    03:06 After getting data from many different places, it needs to be cleaned up because it almost always has mistakes.

    03:13 This can be caused by outliers or other values that don't make sense, double values, values that are missing, or other mistakes.

    03:24 And unfortunately, these mistakes have a big effect on the algorithms that come next.

    03:29 So you need the cleanest record possible.

    03:31 This part is also called "Data Cleaning" because of this.

    03:35 So, we'll clean up the data.

    03:37 And what may seem like an easy part is actually a big job for the data scientist.

    03:43 Also, the data is then converted into suitable formats.

    03:47 For example, some algorithms may also need attributes to be standardized if they all have the same name.

    03:54 The next step is to model the data.

    03:56 After the data has been cleaned up, it's time to start modeling.

    04:00 To solve the problem, both well-known statistical methods like linear regression and the newest algorithms for machine learning, like artificial neural networks, are used. Most of the time, the data scientist is more interested in the future, which means that predictive or prescriptive analytics are more important to them.

    04:20 Or forecast models or scenarios.

    04:23 Less with descriptive analytics, so the summary of the past.

    04:27 The lines are, of course, not so clear.

    04:30 This modeling is by far the most interesting part of data science because it involves putting real-world things into math.

    04:37 For example, clustering algorithms are used to find fraud.

    04:41 Neural networks are used to recognize images.

    04:44 Neural networks are also needed for natural language processing, but the way these neural networks are made is very different from how they are made for image recognition.

    04:55 Overall, to get convincing results, you need a lot of experience, knowledge of statistics, and a willingness to try new things.

    05:05 There are three types of algorithms for machine learning.

    05:07 The first is supervised learning, the second is unsupervised learning, and the third is reinforcement learning.

    05:15 In supervised learning, you start with a set of data that includes results.

    05:21 So, if you want to look at how people end their cell phone contracts, which is called the "churn rate," you need a data record with the right attributes and also a field that says whether the user has ended the contract or not.

    05:35 Most algorithms for classifying things, like logistic regression or decision trees, are in the class of "supervised learning." Even though there are no results, unsupervised learning also needs sets of data.

    05:49 The goal is to see if there are any patterns in the data.

    05:53 The main component analysis is a statistical method for reducing the number of variables.

    05:58 It is also used to find outliers, like when credit card fraud is going on.

    06:03 The data points are put into groups based on how similar they are.

    06:07 In this way, groups of customers who are like each other can be made.

    06:12 Reinforcing learning means learning from what you do and what you get in return.

    06:18 An agent moves around in a real or virtual environment and does certain things.

    06:23 These things are evaluated, and good practices are made stronger.

    06:28 Robots often use a learning method called reinforcement learning.

    06:32 The biggest mistake all of these algorithms make is that they try to fit too well.

    06:39 During overfitting, the training set of data is often optimized too much.

    06:44 In theory, one could think that the algorithm remembers this training data record, but since it doesn't make any general rules, it can't work well in practice. Due to this, the data is always split into a training data record and a test data record. The second one is only there to judge the quality, so it shouldn't be used first. The fourth step in the process of data science is to put the data to use.

    07:10 This means going live or automating.

    07:15 This is probably just routine reporting, where data is just updated once a week.

    07:21 Depending on where it is used, though, it can become much more complicated.

    07:25 Let's say we have a prediction about how much a grocery store will sell.

    07:29 Now, we want to use this model.

    07:31 And that as a web service for both managing the supply chain and branch managers.

    07:36 Maybe our algorithm is also good enough to trigger orders from wholesalers on its own.

    07:41 So, it gets involved in the way things work.

    07:44 Stability is very important for this, of course.

    07:46 And the algorithm can't go live until it has been tested many times on the development system. IT is usually in charge of the deployment process.

    07:56 In data science, it's hard because the model changes all the time.

    08:00 The machine learning algorithm is always learning.

    08:04 It updates its calculation model based on what it has learned, and then this new model is put into production.

    08:11 Again, tests of plausibility are needed for this purpose.

    08:17 As you can see, it is not the same thing to make a one-time analysis or prototype or to try to change the way things work.

    08:24 Let us summarize again: The goal of data science is to solve problems.

    08:29 So, first, human problems are understood and analyzed.

    08:34 Then, they are turned into problems that a computer can solve, and of course, they are also solved. The data science process has four steps: importing data, preparing data, modeling data, and putting the data to use.

    08:54 I hope this little trip into the wonderful world of data has paid off for you.


    About the Lecture

    The lecture Data Science by Dr. Holger Aust is from the course Data Data Data (EN).


    Included Quiz Questions

    1. Supervised learning
    2. Unsupervised learning
    3. Mitigated learning
    4. Intelligent learning
    1. The model is too closely aligned with the training data set.
    2. The “deployment” of the model is faulty.
    3. The model provides poor results on the test data set.
    4. The model does not use encouraging learning.

    Author of lecture Data Science

    Dr. Holger Aust

    Dr. Holger Aust


    Customer reviews

    (1)
    5,0 of 5 stars
    5 Stars
    5
    4 Stars
    0
    3 Stars
    0
    2 Stars
    0
    1  Star
    0