In a linear combination it is easy to create arbitrarily long sequences of pairs of weights (thetas) and input values (x). But first, just a quick recap of how a simple linear combination might look like.
Since each x value is multiplied with the theta, we can very easily disable the variable by setting the theta to 0. When working with real world data this procedure of setting the theta to 0 to disable it is commonly used in the practical implementation since it allows for a more optimized implementation.
In coming videos we are going to put these x values and thetas in vectors. Instead of having dynamically sized vectors we can then have vectors static in size, where some items in this vectors are set to 0. Knowing the size of the vector beforehand gives some really good benefits since you can use some tools from python, such as sparse vectors and matrices. These implementations of sparse vectors and matrices are much more effective and uses much less RAM (about 10 times as little) than the python lists which are composed of python objects. But in order to use these implementations from the python library numpy we need to set the size of the array before executing the program.
In the previous examples we have been doing a naive prediction from the linear dependency between temperature and staff. However, the temperature, by itself, is a too simplistic model. In our csv file with weather data we have more information available, such as humidity, wind speed etc. In order to do this we are going to use something called a linear combination.
The linear combination we are going to do for our new prediction consists of two parts. Weights (theta) and values (x). Each variable we want to use for our prediction is going to be composed of a pairs of weights and values. To get the final score we simply sum them up.
score = (theta_1 * x_1) + (theta_2 * x_2)
The parenthesis is not really necessary, but is used to better show the pair for each variable we are using in our linear combination. With this model we can append any number of variable pairs to any length we want.
But what is really the thetas and x values doing with the score? A good analogy here might be recording studio. The thetas can be thought of as the sliders on a mixer board.
They signifies how much volume is going to be let though to the speakers (the score in our case). We can allow for more volume to be let though if we drag the slider a bit up and allow for less volume if we slide it a bit down. This slider (the theta) can therefore be seen as a weight of how much of this particular channel we are going to let effect the final score.
If the singer is screaming at the top of his lungs into the microphone the volume is going to be higher than if he is whispering. The input to the channel from the microphone is signified by our x variable. We can adjust for “screaming” in the x variable by having a lower theta for that channel if it effects the final score too much. The reverse is also true. If we want to adjust for “whispering” to amplify the input value we can set a higher theta for that x value.
By having these weights for each x we can make a linear combination of multiple variables to form a final scalar scoring.
Once you have performed feature scaling on your data all the fun stuff can begin. When the temperature data is plotted over time in fahrenheit, on the y-axis we can see how the temperature is fluctuating over the year. When this data is scaled this internal structure is kept. Before scaling, this is how our data looks when plotted.
After the feature scaling the data looks very similar but with a different y-axis.
The fact that the temperature started out as fahrenheit is no longer relevant, but instead the relative change is preserved and can be seen more as a decimal percentage where 0 is the coldest observed day and 1 is the hottest observed day.
intuitively, in order to map this to how many people should come and help out in the store we can say that when the temperature is 100% we should bring 100% of the available people. In the naive example from the video these two relationships between people and temperature are linearly dependent.
If we give the program the current temperature (data that do not exist in the dataset) it can now make the first naive predictions. The prediction consists of the following steps:
Input is given in farenheit
Input is normalized to [0, 1]
The normalized input is multiplied by the maximum number of people we can bring in
The last step here is particularly interesting and I suggest you take your time to really understand what is going on here. It might seem trivial, but what we have effectively have done is to move from one scale, the fahrenheit scale, to some other handcrafted scale, the employment scale. We can do this since we know the relationships between these scales. In this naive prediction it is quite easy since they are linearly dependent. Intuitively, this means e.g. that if the temperature is at 87% we should bring in 87% of our staff.
By normalizing our data we found this common ground for “muchness”, where we can go from one scale to another via the normalization procedure.
Feature scaling is the task of taking a parameter and rescale it to any predefined interval. If this is done with all parameters it can be said that the data set is normalized. After the feature scaling the internal structure of the data is kept, meaning that if one value was much bigger than some other value this proportion is persistent after the feature scaling.
By using the min and max value of all seen examples of a variable we can create a sense of whether this is the biggest value seen (100%) or the smallest value seen (0%). If we think of this as percentage we can say that the features have been scaled between 0 and 100. The interval of the scaling can be different depending of what you are trying to achieve but the most common scenario is to scale features [0, 1]. In this sense it can also be thought of as a percentage in decimal form.
There are different type of feature scalings but a very common is the min/max scaling that we have previously discussed. The new value x-prime is given by subtracting the min value of x over the interval of x (given by taking the max value of x minus the min value of x). Other common data normalization techniques include the mean normalization.
Here, the x-prime is instead given by how much from the x value is deviating from the mean. If x is 20% less than the mean this normalization will yield -0.2 while if x is 20% more than the mean it will yield 0.2. Therefore this normalization will have the domain [-1, 1].
Before you start implementing your application it is important to do some preliminary meta-analysis of how the data is constructed. The mindset that you should have is that it is fast to think but slow to implement, and therefore we want to implement the right kind of application so that we solve a problem that actually exist.
We also need to do some formatting of the structure of the data. We start by reading in the data from the csv file to memory. This is fine for small data sets, and our use-case, but other approaches will have to be taken for bigger data-sets that do not fit in memory. We also need to convert data type formats, like dates, to something that is easier to work with when you are using arithmetic operators (such as plus, minus multiplication, division).
We will be using python3 for this implementation, so make sure that you are not using python2 if you intend to follow along with the series. We will also be using pyplot from matplotlb. If you don’t have it installed you can install it with pip3.
pip3 install matplotlib
If we start by plotting the temperature, in Farenheit. We get the following graph.
We can see that there is a common trend in the temperature over time, which of course is expected for anyone that have ever experienced any kind of weather seasons. In the coming tutorials we are going to exploit the temperature trend to see how many people Gunnar need to call in to help him out in the shop.
If you want to convert the temperature to Celsius you can use the following formula.
With real data set with weather data you can start diving into some of the basics of data science. With this historical weather data from North Carolina we will see how we can optimize the ice-cream business for Gunnar, the poor Swedish immigrant , who moved from the harsh climate to save his company.
In the video we introduce the use-case, look at the data in the csv file and see what data is available for exploitation.
Why should you care about data science?
Data science and machine learning (ML) is becoming bigger and more hyped. Python as a language is growing and have a lot of nice libraries for data science and ML.
Because of the hype it might be expected of you at your current workplace to have some kind of “in the ballpark” knowledge about the topics. You might be interested in getting these knowledges to boost your career to land some prestigous new work. It is no secret that data science jobs are very well paid if salary is an interest of yours. You might be interested in gaining some knowledge about a more theoretical area than you normally spend your time in, e.g. linear algebra and its applications. I will also say that the biggest benefit of this tutorial series is that we will actually see practical implementations and use cases, so the videos will be more pragmatic than normally seen at universities or online courses.
The content is dynamic and I will take request on topics that you are interested in. Just leave it in the comment section on youtube and I will see what I can do.
Some of the topics that will be covered are linear algebra, because many algorithms are based on manipulation of matrices, like matrix multiplication and vector multiplications. Other topics include data normalization, so that the data behaves in nice way making certain algorithms practically feasible. We will also touch on the subject of similarity measurements, so that we can make sense of our data when we are comparing highly dimensional data. This is not as trivial as it might seem at first glance.