So What is Machine Learning? by Dean Van Der Laan
Disclaimer: This was written from memory as a distillation of my own experience and as such will likely contain inadvertent plagiarism. In the words of David Mitchell, if I knew how I know everything that I know, I’d only know half as much. So, in the interest of good faith, if you spot any such cases please help by commenting links.
Many of you may have heard the term “machine learning” bandied about. It’s the latest craze in marketing strategies for tech products. You hear that some product has machine learning and you immediately know that it’s doing some clever magic that must be worth paying for. But what is it? I’m hoping this will be a gentle introduction for those curious. I promise,
no maths not much maths!
So, what is machine learning? As with most concepts in language, this one is bounded by a thick grey area of subjectivity, so I won’t attempt to precisely define it. Instead, here are a few tongue in cheek definitions I’ve picked up along the way:
- Machine learning is fitting fancy curves to data.
- Machine learning is just complicated statistics.
- Machine learning is the set of methods commonly considered to be machine learning by the machine learning community.
- Machine learning is computer programs writing computer programs.
- Machine learning is a set of methods for estimating the parameters of a mathematical model from data.
Ok, I snuck the last one in because once you trim back the jargon, it’s actually quite descriptive and representative of what most machine learning algorithms do. Example: “y = mx + c” is a mathematical model which describes all possible straight lines, where a specific line is modelled by replacing the parameters m and c with numbers. It says that as the value of x changes, the value of y changes proportionately. If you were given a set of data of pairs of x’s and y’s, you could use a machine learning algorithm called linear regression to estimate the parameters m and c of the model and in doing so, identify the line which is best aligned with the data. At which point you would, hopefully, have a mathematical model which you could use to make predictions about what y would be if you were given x.
That concludes the maths portion. Non plus ultra. I want to explore machine learning in a more intuitive way by thinking about human learning. Although we think of ourselves as rational creatures, we’re not really all that good at thinking rationally. We learn from examples and although we may not like to admit it, our knowledge of the world is biased by our experience and we irrationally accept this knowledge as truth. Machine learning is no different. These methods learn about a very specific aspect of the world from examples, and if those examples are no good then neither are the results. Take for example Microsoft’s racist twitter chatbot, which was so easily corrupted by bored teenagers (I assume).
Timmy, an avid carnothologist, is out exploring the countryside with his mum. “Look mummy, it’s a dog”, shouts Timmy, pointing off into a field. “That’s not a dog, that’s a sheep”. Timmy is momentarily confused, but soon gets distracted by something trivial, unconsciously committing the event to memory. A little later, Timmy enthusiastically points off into another field “Look, look, it’s a sheep!”. “No Timmy, that’s actually a poodle. It’s a type of dog. I wonder why the farmer has a poodle…”. Eventually they arrive back in town to find a peculiarly shorn sheep ambling down the road, oblivious to the fabulousness of its fashionably sculpted coat. Timmy raises his hand for a moment before dropping it to his side with a scowl.
To begin with, dogs are the only animal Timmy has ever encountered. His predictions are simple – “it’s always a dog” – and so far, he’s been 100% accurate. But when he encounters his first sheep, he has to find some way to distinguish them from dogs. So “if it’s in a field and it has fluffy white hair then it’s not a dog”. When he then encounters the poodle, it’s clear that he needs to include more information to make an accurate prediction, so he may add a clause to the logic that “if it also has floppy ears and a weird band of bald skin around it’s waist then it’s a dog”.
This set of complicated logic that Timmy is creating forms a mathematical model called a decision tree. Why tree? Well, each of these logical questions has 2 answers, or branches. “Is the creature fluffy, yes or no?”, and the answer to that question leads to another question. “Is it in a field, yes or no?”. At each new question, the tree forks into two new branches until it eventually reaches a leaf, at which point there is no need for any more questioning and the “dog or not dog” question is answered. There are many possible trees which would describe the same logic depending on how you order the questions.
You could make a decision tree manually by carefully considering all the characteristics of dogs and non-dogs, but there are a number of reasons why this would be impractical.
- It would take ages,
- Order matters. You probably don’t know which questions are the most important to ask – which questions do the best job of differentiating between dogs and non-dogs. If you ask these first, then when you come to use the tree for prediction you will come to an answer sooner.
- You probably won’t consider cases where your logic will lead to incorrect predictions. For example, a peculiarly shorn sheep is not a poodle. This is called noise.
It would be far more efficient if a computer could learn how to do Timmy’s work for him. Maybe we could train a decision tree using machine learning. How does this work? Well, a decision tree is just a mathematical model, albeit a little more complicated than a straight line. Features like “fluffiness”, “in-a-fieldness”, and “floppy-earedness” are equivalent the x’s in our straight line example. Labels for the type of animal – dog or not dog - are equivalent to the y’s. A machine learning algorithm can take examples of pairs of features and labels and learn an optimal decision tree which is efficient to use for prediction and is relatively robust to noise. And it doesn’t even take all that long.
Now, when Timmy has a little more experience with animals, it’s unlikely that he’s going to consult his ever growing decision tree to find out if the creature bounding down the road ahead of him is a dog. Most of the time, you see a dog and you just know. But this also applies to “fluffiness”, “in-a-fieldness”, and “floppy-earedness”, you just know. In the example above, the algorithm is kind of cheating. It’s skipping a fundamental step. Timmy not only has to learn to distinguish between dogs and non-dogs, but he also has to learn to distil the various creature features from his experience of the world, which at the end of the day is all just light entering his eyes.
In practice all machine learning algorithms are just a series of computations. They require numbers. In the case of the decision tree above, you can easily convert features like “Does it have fluffy hair?” to numbers as 1 if the answer is yes and 0 if the answer is no. More commonly in practice these features are defined over a number range, in which case questions look more like “Is the value for feature x greater than some constant value a”, like “is the animal’s body more than 20% bald?“. But how do you get from an image of a poodle in a field to an answer to the question “Does it have fluffy hair?”. If you have a car with a camera on it which needs to detect a person in the road, you can’t have a little Timmy in the glove box enumerating all the features of all the things he sees. A computer only sees pixels. A computer sees an image as a sequence of numbers which represent pixel colour, nothing more. So, you need an additional algorithm to convert these numbers into an answer to the question “Does it have fluffy hair?”. You need another algorithm to answer the question “Does it have a weird band of bald skin around it’s waist?”. And you need an additional algorithm for each feature your model requires an input for. This is called feature engineering. In the straight line case, there is only one feature x, so no engineering is required. But not all problems are as straight forward. Timmy’s dog detection logic had to grow features like arms and legs and ears with only a few carefully crafted examples. So, are we going to design all these algorithms ourselves? And even if we do, maybe there are better features that we haven’t considered which would massively simplify the problem or improve the predictions. Like hooves. The answer was hooves Timmy!
So, what else could we try? Is there some way that we can teach a computer to bypass feature engineering altogether and learn directly from images like Timmy does?
Some clever person noticed that brains are pretty good at doing this classification thing and went digging. What do brains do at a cellular level, and can we just emulate that activity in a computer and be done with it? Well, brains are composed of cells called neurons. These are essentially switches. They have a ton of inputs and one output, and if enough of the inputs are on the output switches on momentarily. Sound familiar? Maybe there’s a “dog or not dog” neuron which takes in a bunch of features about some observed creature and lights up like a slot machine jackpot when it logically combines them into a dog. This is obviously mainly nonsense, but there’s value in the analogy. So, the inputs to neurons are connected to outputs from other neurons in a huge spaghetti junction all triggering one-another to switch in what’s likely the most complex configuration of matter we’re aware of, and somehow resulting in Timmy’s decision logic. If this could be distilled into a mathematical model, then we could have machines which learn like humans!
And so, the artificial neural network was born. An artificial neural network is composed of perceptrons which emulate neurons. Rows of these perceptrons called layers are daisy chained together into a network so that the feature data enters at the first layer as input data, each perceptron outputs a value based on some combination of its input data. These outputs become inputs to the next layer and so on until the last layer outputs a prediction. In terms of the initial discussion about machine learning, the parameters – the m and c in the straight line case – are the ways in which input data is combined to produce an output. These parameters, these formulae for combining input data, are what the network will learn.
The trick to training these networks to make accurate predictions is as follows:
- You start it off making any old random predictions using any old way of combining input data.
- You give it an example and see what it predicts.
- Based on how correct the prediction is, you tell each of the perceptrons how to adjust the way in which it combines input data to produce a better prediction. Small steps though, learning takes time! You guide them in the right direction. You don’t want them to completely change the way they work based on one example. It might be a bad example. It might be a peculiarly shorn sheep.
- Rinse and repeat a few thousand or million times.
- Hopefully by the end of this process you will have a neural network which takes images as inputs and predicts dog or not dog correctly…most of the time.
But what is the neural network actually learning? Of course, it’s learning how to detect dogs in images, but it isn’t just an amorphous number churning blob. Each perceptron is learning how to solve small piece of the puzzle such that their combined knowledge gives a clear picture. It’s as if each of the perceptrons is learning how to derive features from the image, as if neural networks are feature engineering factories, except that those features aren’t as intuitive as “Does it have fluffy hair?”, or “Is it in a field?”. They’re more abstract, more like edges or shapes or small configurations of pixels, and the deeper you delve into the network, the more detailed these features become. Each perceptron is combining the input data in weird and wonderful ways which we may never have conceived of or understood the relevance of and doing so in an optimal way to produce the best prediction.
So, we’ve looked at linear regression, decision trees, and artificial neural networks, and they all seem wildly different. There are countless more machine learning methods and even these three have many variants. So why should they all be accumulated under the mysterious “machine learning” banner? Well, if you take the input data, the features – which could just be a solitary x, or a range of ridiculous characteristics of animals, or just a set of pixel colours – convert them to numbers, and you plot each one on it’s own axis, each axis perpendicular to the others, add another axis for the label, and call this big multidimensional box a space, then what all of these methods have in common is that they’re just defining a curve in that space. A big, curious, twisting curve which divides the space in two, acting as a boundary between examples which are dogs and examples which are not dogs, or a very straight curve which is representative of a trend in the data. Machine learning is fitting fancy curves to data. Mostly…
Now, at this point- where readership has dwindled sufficiently - I have to reveal a subtle disingenuousness about the relationship between the way in which Timmy has learned and the way in which learning has been discussed above. These methods have assumed that all of Timmy’s knowledge occurs at once and that it’s representative of the world, but in reality, Timmy is learning incrementally as the world reveals itself to him. Timmy has some initial belief about the world based on his experience to date – he believes that all animals are dogs. But when he sees a new example his belief will change. Whether altered or reinforced, it will change. Timmy’s new belief will be relate