Welcome to part two of our Machine Learning in Soccer series! For this project I used something called linear regression. Which is not necessarily machine learning but is a super simple way to predict results using past data. Many of the “bibles” of machine learning refer to linear regression as one of the algorithms that we should all know, and that makes a lot of sense because it sets up the foundation for other regression methods. Because your computer is taking the data and then spitting out an answer we can call this a machine learning exercise, but I think it is better to think of linear regression as an introduction into applied statistics rather than machine learning. Machine learning is mostly just super advanced applications of statistical analysis using high level math and computer modeling techniques. Because computers can calculate things so incredibly fast, we use them to run algorithms and models at a rate far faster than a human being, and then the computer can replicate those calculations at scale as well (hundreds and thousands of times instead of just once). There are a ton of ways to get at the same answers, so this season we are going to dive into several of them.
Basing our prediction on the previous three (3) seasons of data using 102 different data points collected from each match in Major League Soccer (regular season only at this point, perhaps we will collect playoff data at another time). Ranging from the identity of the referee to the elevation of the stadium, temperature at kickoff, minutes played up or down a man, as well as shots, key passes, expected goals (xG), team touches, aerials won and lost, and so on – I used this data to train a linear regression model.
Wait. What is linear regression model?
Linear regression is a way of modeling a direct relationship between the data we pick – called independent variables (shots, passes, fouls, weather, etc) and the the data we want to predict – called the dependent variable (goals). Essentially, a linear regression takes all of the data we give it and shows a straight-line type of relationship. Linear in this sense means a straight line with no curving. I should probably explain that we are actually using a multiple linear regression here, which just means that we have more than one independent variable. If we were just using one variable, like shots, to predict goals then that would be called a simple linear regression. By having more than one independent variable, we then have a multiple linear regression. Simple enough…I think. Let me know in the comments if you want some further explanation here. Below is an example of a linear regression line (courtesy Wikipedia).
You can see here that the data points are scattered and that their layout – or distribution – follows a straight line. As the data on the X-axis (the side to side line on the bottom of the graph) grows, the data on the Y-axis (the top to bottom line on the graph) grows as well. This is what we call a linear relationship.
What does this model say about the game and what is the score going to be?
Well, using our model we get the predicted final score of…
FC Dallas: 1.789 goals
New England Revolution: 1.124 goals
If we round this to the nearest goal then we get a 2-1 final score for tomorrow’s game. Starting a season off with a win? Check. Using data to predict scores? Check. Using the best method to do that? No check here, we can improve it. Let’s improve it! (whispers: Next week).
Now, first and foremost we should discuss the inherent flaws in what I just did here with this model. I wanted to start the prediction with a linear model because it is a good foundation and is useful for a lot of things. If you are only measuring a few variables, then this is a great starting point! It can also be run without any technical know-how and without any coding skills. You can perform your own linear regression on paper (not my recommendation but whatever, it’s your time) or in Excel or Google Sheets. If you would like to know how just ask away in the comments and I will get you some resources so that you can start to test out your own ideas and models. Always remember – your model may not be right, but it does get you to ask questions. More questions means more insights and that is what we want. Insightful fans dig deeper than the surface and try to find causes of bad streaks, the relationship between player pairings on the field, or even shot locations as a predictor of points per game. There is a world of data out there and we will continue to explore this. Next week we will be using a nearest-neighbor model to predict scores. This is a very different approach because that will be a categorical approach rather than a regression one. Check back before week 2’s game and find out more!
Seriously though guys, let me know if there is something you want to see here in the future and we can tackle it straight away. I look forward to this season long series and hope that we can create some really interesting models and maybe you guys can shoot over some creative thoughts around some predictive or modeling activities. If you have an idea, post it down below or find me on Twitter. I will give you credit for any ideas you have and we can collaborate on some fun stuff.