Throw Away Your Code And Gather More Data
The Big Hype in Machine Learning Is Not Going to Stop
Why is that? Well maybe (and that is only our humble opinion) because we realise we can solve problems, difficult problems, with machine learning (ML), that we haven’t been able to solve in the past. This capability opens a new era in data science and technology. This is like a brand new Enlightenment Age, there are new discoveries every month and people are really excited about it!
Throw Away Your Code And Gather More Data!
During the last 40 years, engineers and scientist used to describe and solve difficult problems with complex algorithms. This approach takes time, effort, and state-of-the-art techniques to progress slowly. A common example is image classification. Say you have to write an application to split apple and oranges. You could:
- extract color information, shape information, texture information, etc,
- put the features together and write clever statistics; and,
- Guess the label using the above feature statistics.
Excellent! You now have apples and oranges, and the error rate is below 5%. But what if you add lemons to the equation? Your code, and rules written to discriminate between labels, will have to change.
With machine learning, the focus is shifting to the data. The data decides which features are important and which ones are not. Think about it in two steps:
- Training. This step requires gathering enough data and formatting it properly. The more data, the more accurate the ML model will be. When training is supervised, we need to assign it explicit labels: This is a Lemon image: use ‘lemon’ label; This is an Orange image: use ‘orange’ label etc. The output of the training step is commonly called a model. With training, we have been able to teach our model the correct answer to our problem.
- Inference. A model is available, so let’s test it. Will this model be able to infer the correct answer when it sees new lemons? This process is called inference.
This second step, inference, is fast and cheap. Once the model is loaded into memory, matrix multiplications and floating point calculations will do the math. This concept is now general availability and is already running inside embedded devices: Apple has released core-ML, an API that runs in iOS and lets developers run ML inference directly on the device. Other embedded vendors, ARM included, have yet to release similar API support in their SoC unless it is available from a compatible library.
The hard step is training, as it requires time and strong data science knowledge. Training is data and computation intensive: you’ll want specialised and dedicated hardware! Finding the best model to fit a given data set is therefore time consuming. Without the right hardware (or cloud ML remote compute power), the most difficult problems will need months of computation and the results may not be the ones you expect.
Leaving The Past Behind
If you don’t have it yet, you don’t know what you’re missing. With image classification, the imageNet group has achieved nearly 100% accuracy. Even better, Google has released Inception after using its internal resources to train on a giant image dataset and the model is available (Apache License 2.0) on Github.
The same applies to digit classification, and the popular MNIST data set. Accuracy (inference accuracy) is beyond any other technique used in the past.
Everywhere you look, you will find difficult problems that computer science hasn’t been able to solve in the last decade. It is time to take a fresh look at them. These are exciting times to dig deeper and create new features and new solutions, with an edge on competition!