What Is Machine Learning?
You have just taken a new job with a company that grows oranges. Your job is to inspect the oranges as they pass by you on a conveyor belt and to pick out the oranges that can’t be sold because of their appearance. These oranges will be sent to the juicer to make orange juice. When you start your new job, an experienced orange-picker shows you what you need to do. He tells you what kind of flaws or blemishes to look for that might prevent someone from buying the orange.
You’re able to pick out a few of the juice oranges at first, but many that you should have removed pass by you on the conveyor belt (fortunately, there’s a second orange-picker further down the line). As the day progresses, however, you get better and better at identifying and picking out the blemished oranges. By the end of the day, you’re picking about 95 percent of the juice oranges.
You learned which oranges to pick by example as the oranges passed by you on the conveyor belt. The more examples you saw, the better you performed. Machine learning is taking these same concepts and applying them to computers. It’s a process to train computers to also learn by example. The more examples we show a computer, the better it usually gets at predicting the right answer.
With a computer, however, we often have millions of examples that the computer can read from a data file, so it can learn very quickly and can produce results that are sometimes even more accurate than a human could do.
There are three major components of a machine learning project. The first is the “data” set that contains the examples we use to train the computer. The second is a “machine learning” algorithm that we tell the computer to use when learning from the data. The result is a machine learning model that we apply to brand new data as it becomes available.
Big Data: The Lifeblood of Machine Learning
It’s difficult to even grasp the amount of data we are creating and storing every day. The actual number is 2.5 quintillion bytes of data a day. We track everything, from the items you purchase at a grocery store to the license plates that pass through toll booths and the tweets, videos, and posts we put up on social media. Ninety percent of the data we have today was created in the last two years.
Data usually tells a story. Hidden inside these databases are patterns in the data that give us the insight we can use to answer questions. The answers to those questions help us make important business decisions that improve a company’s results. Machine learning algorithms are very good at finding patterns in Big Data and can be used in virtually every industry and profession in some way. For example, machine learning is used to:
- Identify customers whose credit cards have been stolen and are being used to make fraudulent purchases.
- Trade stocks to maximize profit by buying low and selling high.
- Review x-rays to determine if a patient has a tumor.
- Monitor driving sensors to correct the steering of a car that drifts out of its lane.
- Review articles to determine the main topic and the sentiment of the author.
- Provide predefined responses to email messages that you can choose instead of creating your own.
- Identify repairs or maintenance that will prevent machines from breaking down.
- Forecast the amount of electricity needed by customers during the summer months to prevent brown-outs.
- Provide suggestions to you for what movies you might want to watch.
- Identify the customers who are most likely to purchase a new product you’re launching.
Types of Machine Learning
Machine learning always starts with a question you want to answer about the data. For example, “Is this a fraudulent credit card transaction?” or “Where are the faces in this photograph?” Once you know the question you’re trying to answer, you select from one of two types of machine learning: “supervised” or “unsupervised”. With supervised learning, you have a set of data for which you already know the right answer to the question you want to answer.
For example, you might have a database of credit card transactions and you know which are fraudulent and which are not. You use the examples in this data set to teach the computer to come up with the right answer. Generally speaking, the more examples you provide, the better the computer gets at predicting the right answer. Unsupervised learning is used when there is no known answer, and you want to identify groups or clusters in a data set, such as picking out the faces in a photograph.
With supervised learning, you have a set of data for which you already know the right answer. The computer reads the data set and learns by example. Generally speaking, the larger the data set you provide to the computer, the more accurate the answer will be. The objective of supervised learning is to create a model that predicts an outcome based on a set of input factors when the outcome might vary in unknown ways.
For example, suppose you’re a real estate agent, and you’re trying to predict how much a new listing will sell for, so you know how to price it. You have a database of home sales with the sales price and various attributes about the houses such as the number of bedrooms and bathrooms, square footage, zip code, subdivision, age of the home, building material, etc. You might use supervised learning to create a model that predicts the sales price for a new listing based on these attributes.
Supervised learning uses one of two approaches to develop a model, depending on the kind of answer you want the model to produce.
The first approach is called classification. You use classification when there is a fixed set of answers the model can produce. For example:
- Is this a spam email? (yes or no)
- What color is the object in this picture (red, green, blue, black, orange, purple, yellow)?
- Should I accept or reject this credit application? (accept or reject)
The second approach is called regression. You use regression to predict an answer that’s along a continuum. For example:
- What should the starting salary be for this job?
- How likely is it that a customer will cancel his subscription?
- How much will this item sell for at auction?
Model Input and Output
Perhaps the most important element of a supervised learning model is the data you use to train the computer. Imagine you’re creating a model to predict how much federal income tax someone owes to introduce a new fast way to file. To predict the amount of tax, your model is reading millions of tax returns from the previous year.
Suppose you create a model and it’s predicting the amount of tax owed with good accuracy. But then you realize that someone made a mistake when they compiled the data set. They forgot to include the “married filing separately” status group in the data. The impact of this is tremendous. The model the computer develops is based on the examples you provide. If you train the model without the “married filing separately” group and then you use the model to predict tax for those taxpayers, the chances are that it won’t predict the tax correctly. That’s why the data you use to train a model must be representative in depth and breadth of the population you will apply the model to. That’s also why more data is usually better.
To help make sure your data is representative, most machine learning projects divide the data into a “training” set and a “test” set. For example, you might use 75 percent for the training set and 25 percent for the test set, or 80 percent for the training set and 20 percent for the test set. You use the training set to teach the computer how to learn and to create the model. Then, you use the test set to measure its accuracy. The way you separate your data should be completely random, so you don’t inadvertently include or exclude a group that shares common attributes such as filing status.
Unsupervised learning is for data sets where you want to discover hidden patterns or structures within the data that might not be obvious. The most common method of unsupervised learning is called clustering. Clustering identifies patterns by grouping data that is close to one another based on the input factors. For example, given all the pixels in a photograph, cluster the data to identify where the faces are located. In market research, clustering is often used to group customers together based on natural patterns in the data and to identify insights that might not be obvious. For example, identifying that customers who buy diapers are twice as likely also to buy beer, so if you put the beer next to the diapers, you might sell more.
Machine Learning Algorithms
Once you have a data set you want to use to train the computer, you need to choose a machine learning algorithm. There are many algorithms to choose from, and each takes a slightly different approach to how the machine learns from the data. While some guidelines might direct you toward certain algorithms based on the kind of data you have and the outcomes you’re trying to predict, there are no definitive rules about which algorithms work best. Choosing an algorithm is typically based on experience and trial and error. Even very experienced data scientists don’t know which algorithm will produce the best result. That’s why it’s common to create different models using different algorithms, and then to compare the results and select the one that performs best.
Can Machine Learning Help You?
Machine learning has application in almost every industry and profession. It helps you develop a formula or equation to answer questions about large data sets. It’s also ideal for cases where the rules are much too complex to even calculate, such as recognizing faces and matching them to a database of mugshots. It also works well in cases where the rules are constantly changing, such as identifying fraudulent credit card charges. It even works well when the data itself if constantly changing, such as automated stock trading. Whenever you need to predict something based on a large number of variables, and you don’t know which variables matter and how important each variable is to determining the right outcome, that’s usually a great application for machine learning.
Machine Learning Resources
There are many free and paid resources to learn more about machine learning, depending on your objectives. Leading websites on artificial intelligence, analytics, big data, data mining, data science, and machine learning include KDNuggets.com, TowardDataScience.com, and Education-Google AI.
For more information about AI and machine learning, view Ross Pamphilon’s website.
About Ross Pamphilon: Ross Pamphilon brings a broad experience having worked in a variety of roles across fixed income. Ross joined the newly established ECM Asset Management in 1999 which coincided with the launch of the euro currency and development of the pan European credit markets. Ross Pamphilon’s experience includes emerging markets investment grade and high yield, portfolio management, and credit research.