In the first part of our Machine Learning series, we illustrated the broad concept of machine learning by taking a little bit of Twitter, mixing it with some machine learning goodness and creating a web application that, based upon the user’s personal tweets, recommends Twitter accounts the user should follow.
What sort of machine learning wizardry allows a program to give twitter account recommendations, understand and respond to the questions people ask (Siri), protect people against purchase fraud (PayPal, banks, credit card companies), or even allow cars to drive themselves (Google)? The answer, my muggle friend, lies in the data used and the learning style employed.
Data, Data, Data
As a child of the eighties, I have fond memories of cartoons such as Transformers, G.I. Joe and He-Man. In particular, G.I. Joe had some public service announcements at the end of the cartoons that always ended with “Knowing is half the battle”.
With machine learning, “knowing” can be reworded to “data”, and is, at a minimum, half the battle. As a general rule of thumb, when it comes to machine learning, the more data you have, the better.
We don’t have better algorithms than anyone else; we just have more data. (Peter Norvig, Director of Research at Google Inc.)
Data is used in machine learning for two primary purposes:
- Feed algorithms so data models can be created and refined. Data used for this purpose is called a training set.
- Test and validate data models. Data used for this purpose is called…wait for it…a test set.
However, having a tremendous amount of data for training and testing machine learning models isn’t enough; the data should be reliable and trustworthy. The Harvard Business Review has a simple flowchart that can help you identify whether your data can be trusted
Simple flowchart to determine if your data is trustworthy
People learn in a variety of ways. Whether you prefer pictures & videos, lectures and music, or hands-on exercises, the way a person receives, retains, and responds to information varies. Understanding a person’s learning style is a key factor for educators and important for an individual’s knowledge building abilities.
Much like people, machines learn using a variety of styles. The choice of learning style depends on the problem you’re looking to solve and the data you have available. Now , if you find yourself at the annual Machine Learning, BBQ & Bourbon Festival*, you may overhear many spirited discussions regarding the number and variety of learning styles used in machine learning. In the interest of time, we will look at the two most commonly used styles: supervised and unsupervised.
*This isn’t a real festival, but it should be.
In supervised learning, the data given to a machine includes input and the desired output. The machine then finds patterns based upon the given relationship between the inputs and outputs.
Think of it like using flashcards to teach someone to multiply by tens. Card after card, we are given a multiplication problem that includes the number 10. This is the input.
Early on, maybe we actually work out the problem using our fingers and toes then progress to calculating it in our head. Each time we give an answer, the card is turned over and the correct answer is revealed. This is the output.
Eventually, after enough flash cards, we find a pattern and learn to simply add a zero to the end of whatever number we are multiplying by 10. This pattern recognition, given the input and output, is what supervised machine learning does.
A few common uses of supervised learning include:
- Classification: a machine determines the category something belongs in. One example is Microsoft’s, How-Old.net, a web app that uses your photo to determine your age.
- Anomaly detection: a machine finds items or events that are outside a standard pattern. An example of anomaly detection is credit card fraud detection.
- Regression analysis: a machine gives a prediction. An example would be stock price forecasting.
In unsupervised learning, data given to the machine doesn’t have a desired output and therefore, doesn’t know the relationship between the given inputs. The machine has to determine how to organize and structure the data in order to find patterns contained within.
Imagine I handed you a stack of 10,000 photos of alien species found in the galaxy. You’ve never seen any of them and don’t know anything about them other than what is in the picture. This is the input.
Now I want you to arrange them into groups. How would you do it?
Perhaps you would do it by the dominant color displayed in the photograph? Maybe you cluster them by a common characteristic such as the number of legs or arms. Ultimately the choice is yours, but the final grouping becomes the output.
Unsupervised learning enables a machine to make correlations between data that a human may not of considered. This is why one of the key uses of unsupervised learning is clustering.
One example is how researchers were able to use machine learning to determine the risk a person has of having a heart attack, based upon their tweets.
Another great example is how Google used unsupervised machine learning that enabled a large-scale neural network to identify pictures of cats from YouTube video thumbnails, without being told anything about cats. Their training dataset included thumbnails from 10 million YouTube videos and the white paper is a fascinating read.
Next Time: Putting Machine Learning To Work
In part 3 of my Machine Learning series, I will provide a number of ways that you can start taking advantage of machine learning.