Pragmatic Data Science | 4, feature scaling

Feature scaling is the task of taking a parameter and rescale it to any predefined interval. If this is done with all parameters it can be said that the data set is normalized. After the feature scaling the internal structure of the data is kept, meaning that if one value was much bigger than some other value this proportion is persistent after the feature scaling.

By using the min and max value of all seen examples of a variable we can create a sense of whether this is the biggest value seen (100%) or the smallest value seen (0%). If we think of this as percentage we can say that the features have been scaled between 0 and 100. The interval of the scaling can be different depending of what you are trying to achieve but the most common scenario is to scale features [0, 1]. In this sense it can also be thought of as a percentage in decimal form.

There are different type of feature scalings but a very common is the min/max scaling that we have previously discussed. The new value x-prime is given  by subtracting the min value of x over the interval of x (given by taking the max value of x minus the min value of x). Other common data normalization techniques include the mean normalization.

Here, the x-prime is instead given by how much from the x value is deviating from the mean. If x is 20% less than the mean this normalization will yield -0.2 while if x is 20% more than the mean it will yield 0.2. Therefore this normalization will have the domain [-1, 1].

 

 

 

Pragmatic Data Science | 3, meta analysis

Before you start implementing your application it is important to do some preliminary meta-analysis of how the data is constructed. The mindset that you should have is that it is fast to think but slow to implement, and therefore we want to implement the right kind of application so that we solve a problem that actually exist.

We also need to do some formatting of the structure of the data. We start by reading in the data from the csv file to memory. This is fine for small data sets, and our use-case, but other approaches will have to be taken for bigger data-sets that do not fit in memory. We also need to convert data type formats, like dates, to something that is easier to work with when you are using arithmetic operators (such as plus, minus multiplication, division).

We will be using python3 for this implementation, so make sure that you are not using python2 if you intend to follow along with the series. We will also be using pyplot from matplotlb. If you don’t have it installed you can install it with pip3.

pip3 install matplotlib

If we start by plotting the temperature, in Farenheit. We get the following graph.

We can see that there is a common trend in the temperature over time, which of course is expected for anyone that have ever experienced any kind of weather seasons. In the coming tutorials we are going to exploit the temperature trend to see how many people Gunnar need to call in to help him out in the shop.

If you want to convert the temperature to Celsius you can use the following formula.

 

 

Pragmatic Data Science | 2, Use case

With real data set with weather data  you can start diving into some of the basics of data science. With this historical weather data from North Carolina we will see how we can optimize the ice-cream business for Gunnar, the poor Swedish immigrant , who moved from the harsh climate to save his company.

In the video we introduce the use-case, look at the data in the csv file and see what data is available for exploitation.

 

 

 

Pragmatic Data Science | 1, Introduction

Data science toolbox

We are going to be using python 3 for this course. You need to have some basic knowledge of how to run python scripts, how to use pip3 to install python packages and basic programming concepts such as how to use data structures, loops and control statements. If you know JavaScript it will help but it is not mandatory. The same goes for reading academic papers. If you have previous experience of where to find them, how to read them and how to understand them it will help but all algorithms will be handed out so you don’t need to find them yourself. If you know basic arithmetic, such as plus, minus, division and multiplication  you are good to go. The linear algebra needed for the course will be presented in a fashion where no prior knowledge is needed. However, it might be assumed that you spend a little time on your own making sure you understand them.

Why should you care about data science?

Data science and machine learning (ML) is becoming bigger and more hyped. Python as a language is growing and have a lot of nice libraries for data science and ML.

Because of the hype it might be expected of you at your current workplace to have some kind of “in the ballpark” knowledge about the topics. You might be interested in getting these knowledges to boost your career  to land some prestigous new work. It is no secret that data science jobs are very well paid if salary is an interest of yours. You might be interested in gaining some knowledge about a more theoretical area than you normally spend your time in, e.g. linear algebra and its applications. I will also say that the biggest benefit of this tutorial series is that we will actually see practical implementations and use cases, so the videos will be more pragmatic than normally seen at universities or online courses.

Content

The content is dynamic and I will take request on topics that you are interested in. Just leave it in the comment section on youtube and I will see what I can do.

Some of the topics that will be covered are linear algebra, because many algorithms are based on manipulation of matrices, like matrix multiplication and vector multiplications. Other topics include data normalization, so that the data behaves in nice way making certain algorithms practically feasible. We will also touch on the subject of similarity measurements, so that we can make sense of our data when we are comparing highly dimensional data. This is not as trivial as it might seem at first glance.