00:08
The Harvard Business Review said that the job
of "data scientist" is the "sexiest job of
the 21st century." But what exactly does a
"data scientist" do?
How come data science is so popular?
Well, the word itself says that it has to do
with science and data.
00:24
And statistics, software development, and
specialized knowledge are often brought up
together. But if you look at it in a more
general way, it's about how to use computers
to solve problems.
00:35
So, first a problem has to be understood,
then it has to be analyzed, then it has to be
turned into problems that a computer can
solve, and then it has to be solved.
00:44
And that is what data science is all about.
00:48
Because my marketing colleague doesn't come
up to me and say, "Minimize the loss function
for me using a polynomial core!" But he has
a normal problem, like, "I'm trying to figure
out how to get more customers to buy." And
that makes it clear that this is different
from data analytics.
01:05
Data analytics is a big part, but it is
mostly about answering certain questions.
01:12
On the other hand, data science is all about
coming up with the right questions.
01:18
To do this, you need to know what the
problems or opportunities are in each
department and then work with the people who
matter on questions and solutions.
01:29
So, the data scientist looks at the big
picture.
01:31
Communication is important.
01:33
A project in data science can be broken into
four steps.
01:37
On the one hand, there are data imports,
which are the second step in preparing the
data, the third step in modeling the data,
and the fourth step in deployment.
01:50
Now, I'd like to walk you through these four
steps.
01:53
First, the import of the data.
01:57
The question that data scientists are trying
to answer is brand new.
02:01
First, the right sources need to be found.
02:05
Most of the data for the data analysis comes
from the data warehouse.
02:10
So, the new questions in data science are
about which data sources are needed or might
be interesting. Most of the time, you won't
find these in tables but rather as
unstructured lists.
02:22
For instance, as a PDF, a group of
scientific papers or annual reports.
02:27
It's also possible that you can only compare
prices or read customer reviews on websites.
02:32
Then you have to make a web scraper that
searches these websites and pulls the
information out. Last, I'd like to talk
about what are called "NoSQL databases." "Not
only SQL" is what NoSQL stands for.
02:47
So these are databases that are made up of
more than just tables.
02:50
There are, for example, the count databases,
which are great for the way social networks
are set up. Or column-oriented databases,
which are often used to store sensor data.
03:02
The second step is to get the data ready.
03:06
After getting data from many different
places, it needs to be cleaned up because it
almost always has mistakes.
03:13
This can be caused by outliers or other
values that don't make sense, double values,
values that are missing, or other mistakes.
03:24
And unfortunately, these mistakes have a big
effect on the algorithms that come next.
03:29
So you need the cleanest record possible.
03:31
This part is also called "Data Cleaning"
because of this.
03:35
So, we'll clean up the data.
03:37
And what may seem like an easy part is
actually a big job for the data scientist.
03:43
Also, the data is then converted into
suitable formats.
03:47
For example, some algorithms may also need
attributes to be standardized if they all
have the same name.
03:54
The next step is to model the data.
03:56
After the data has been cleaned up, it's
time to start modeling.
04:00
To solve the problem, both well-known
statistical methods like linear regression
and the newest algorithms for machine
learning, like artificial neural networks,
are used. Most of the time, the data
scientist is more interested in the future,
which means that predictive or prescriptive
analytics are more important to them.
04:20
Or forecast models or scenarios.
04:23
Less with descriptive analytics, so the
summary of the past.
04:27
The lines are, of course, not so clear.
04:30
This modeling is by far the most interesting
part of data science because it involves
putting real-world things into math.
04:37
For example, clustering algorithms are used
to find fraud.
04:41
Neural networks are used to recognize
images.
04:44
Neural networks are also needed for natural
language processing, but the way these neural
networks are made is very different from how
they are made for image recognition.
04:55
Overall, to get convincing results, you need
a lot of experience, knowledge of statistics,
and a willingness to try new things.
05:05
There are three types of algorithms for
machine learning.
05:07
The first is supervised learning, the second
is unsupervised learning, and the third is
reinforcement learning.
05:15
In supervised learning, you start with a set
of data that includes results.
05:21
So, if you want to look at how people end
their cell phone contracts, which is called
the "churn rate," you need a data record
with the right attributes and also a field
that says whether the user has ended the
contract or not.
05:35
Most algorithms for classifying things, like
logistic regression or decision trees, are in
the class of "supervised learning." Even
though there are no results, unsupervised
learning also needs sets of data.
05:49
The goal is to see if there are any patterns
in the data.
05:53
The main component analysis is a statistical
method for reducing the number of variables.
05:58
It is also used to find outliers, like when
credit card fraud is going on.
06:03
The data points are put into groups based on
how similar they are.
06:07
In this way, groups of customers who are
like each other can be made.
06:12
Reinforcing learning means learning from
what you do and what you get in return.
06:18
An agent moves around in a real or virtual
environment and does certain things.
06:23
These things are evaluated, and good
practices are made stronger.
06:28
Robots often use a learning method called
reinforcement learning.
06:32
The biggest mistake all of these algorithms
make is that they try to fit too well.
06:39
During overfitting, the training set of data
is often optimized too much.
06:44
In theory, one could think that the
algorithm remembers this training data
record, but since it doesn't make any
general rules, it can't work well in
practice. Due to this, the data is always
split into a training data record and a test
data record. The second one is only there to
judge the quality, so it shouldn't be used
first. The fourth step in the process of
data science is to put the data to use.
07:10
This means going live or automating.
07:15
This is probably just routine reporting,
where data is just updated once a week.
07:21
Depending on where it is used, though, it can
become much more complicated.
07:25
Let's say we have a prediction about how
much a grocery store will sell.
07:29
Now, we want to use this model.
07:31
And that as a web service for both managing
the supply chain and branch managers.
07:36
Maybe our algorithm is also good enough to
trigger orders from wholesalers on its own.
07:41
So, it gets involved in the way things work.
07:44
Stability is very important for this, of
course.
07:46
And the algorithm can't go live until it has
been tested many times on the development
system. IT is usually in charge of the
deployment process.
07:56
In data science, it's hard because the model
changes all the time.
08:00
The machine learning algorithm is always
learning.
08:04
It updates its calculation model based on
what it has learned, and then this new model
is put into production.
08:11
Again, tests of plausibility are needed for
this purpose.
08:17
As you can see, it is not the same thing to
make a one-time analysis or prototype or to
try to change the way things work.
08:24
Let us summarize again: The goal of data
science is to solve problems.
08:29
So, first, human problems are understood and
analyzed.
08:34
Then, they are turned into problems that a
computer can solve, and of course, they are
also solved. The data science process has
four steps: importing data, preparing data,
modeling data, and putting the data to use.
08:54
I hope this little trip into the wonderful
world of data has paid off for you.