Collection

Getting started with machine learning

With the world’s biggest collection of open source data, GitHub’s Data Science Team has just started exploring how we can use machine learning to make the developer experience better. I see machine learning shaping experiences around me every day, and I’m excited about what’s to come in applying it to create more useful, predictive technologies.

In this collection, I'll share the basics of machine learning, along with some related resources and projects for people who are getting started with it.

What is machine learning?

Machine learning is the study of algorithms that use data to learn, generalize, and predict. What makes machine learning exciting is that with more data, the algorithm improves its prediction. For example, I remember when my family started using voice to search instead of typing. At first, it took a while for the machine to recognize our words, but within a week of working with it, the algorithm’s speech detection capacity was good enough that now, voice is my family’s primary mode of search.

At its core, machine learning isn’t a new concept. The term was coined in 1959 by Arthur Samuel, a computer scientist at IBM, and it’s been widely used in software since the 1980s.

As people move from the physical to digital realm, we can learn from the trail of data they’ve left behind.

Dating myself here, I remember building neural networks in the early 2000s as part of my academic training. While it was informative to learn and build these algorithms, they lacked a real commercial application. What was missing was access to vast amounts of data. As people move from the physical to the digital realm, they leave digital footprints that we can learn from. With about three billion people on the planet with access to the internet, these footprints make for a staggering amount of data.

These data stores are what we refer to when we use the phrase “big data”. With the emergence of big data, machine learning algorithms were finally able to transition from academia into industry, powering products that deliver a lot of value to consumers. However, collecting and gaining access to that data is only part of the puzzle towards building machine learning data products like search engines and recommender systems. Until recently, software programmers, data scientists, and statisticians lacked the tools to harness, clean, and package these massive datasets so that they could be used by other applications.

Now, with tools like Amazon Web Services and Hadoop, we have better, more cost-effective ways to manage information. Access to these tools opens a new realm of possibilities for gaining value from big data sets.

Amazon Web Services

aws

94 repositories 39 people

apache / hadoop

Mirror of Apache Hadoop

5167 3841 Java

In recent years, machine learning has expanded to include new applications and endeavors of all kinds. We’ve trained algorithms to do everything from pattern recognition to mastering games to “dreaming”.

jbhuang0604 / awesome-computer-vision

A curated list of awesome computer vision resources

4524 1331

Even with all of the exciting developments in machine learning today, we’re only at the beginning of what’s possible.

How does machine learning work?

To understand what goes into machine learning, it’s helpful to break down the process into three components: inputs, algorithms, and outputs.

Inputs: the data that powers machine learning

Inputs are the data sets you need for training and algorithm. From source code to statistics, data sets can contain just about anything:

GSA / data

Assorted data from the General Services Administration.

681 111 HTML

GoogleTrends / data

An index of all open-source data

1456 113 JavaScript

nationalparkservice / data

An unofficial repository of National Park Service data.

424 38 JavaScript

fivethirtyeight / data

Data and code behind the stories and interactives at FiveThirtyEight

7126 2896 Jupyter Notebook

beamandrew / medical-data

1722 342

src-d / awesome-machine-learning-on-source-code

Interesting links & research papers related to Machine Learning applied to source code

885 75

Because we need these inputs to train machine learning algorithms, finding and producing high-quality data sets is one of the biggest challenges in machine learning today.

Algorithms: how data is processed and analyzed

Algorithms are what turn data into insights.

A machine learning algorithm uses data to perform a specific task. The most common types of algorithms are:

Supervised learning uses training data that has already been labeled and structured. By specifying a set of inputs and desired outputs, a machine learns how to successfully recognize and map one to the other.

For example, in decision tree learning, values are predicted by applying a set of decision rules to the input data:

igrigorik / decisiontree

ID3-based implementation of the ML Decision Tree algorithm

661 66 Ruby

Unsupervised learning is the process of using unstructured data to discover a pattern and structure. Whereas supervised learning might use a spreadsheet as its data input, unsupervised learning might be used to make sense of a book or blog.

For example, unsupervised learning is a popular approach in natural language processing (NLP):

keon / awesome-nlp

📖 A curated list of resources dedicated to Natural Language Processing (NLP)

3475 630

Reinforcement learning requires the algorithm to achieve a goal. As the algorithm performs tasks towards that goal, it learns the correct approach through rewards and punishments.

For example, reinforcement learning might be used to develop self-driving cars or teach a robot how to manufacture an item.

openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.

9529 2163 Python

aikorea / awesome-rl

Reinforcement learning resources curated

2751 706

Here are a few examples of algorithms in practice:

umutisik / Eigentechno

Principal Component Analysis on music loops

298 17 Jupyter Notebook

jpmckinney / tf-idf-similarity

Ruby gem to calculate the similarity between texts using tf*idf

303 31 Ruby

scikit-learn-contrib / lightning

Large-scale linear classification, regression and ranking in Python

699 105 Python

gwding / draw_convnet

669 172 Python

Some of the libraries and tools you’ll find to perform these analyses include:

scikit-learn / scikit-learn

scikit-learn: machine learning in Python

24711 12731 Python

tensorflow / tensorflow

Computation using data flow graphs for scalable machine learning

85791 41860 C++

Theano / Theano

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic …

7614 2375 Python

shogun-toolbox / shogun

Shōgun

1828 810 C++

davisking / dlib

A toolkit for making real world machine learning and data analysis applications in C++

3789 1194 C++

apache / predictionio

PredictionIO, a machine learning server for developers and ML engineers. Built on Apache Spark, HBase and Spray.

10897 1772 Scala

What is deep learning? Deep learning is a subset of machine learning that uses neural networks to find connections between data. Deep learning may use supervised, unsupervised, or reinforcement learning to achieve its goal.

In this great visualization, you can actually play with neural networks right in your browser. Go ahead and give it a try.

While deep learning has existed for decades, neural networks have only become possible since the mid-2000s thanks to graphics processing unit (GPU) innovation. GPUs were originally developed to render pixels in 3D game environments, but they’ve since found a new purpose in training neural network algorithms.

Outputs

Outputs are the final results of your hard work. They might be a pattern that recognizes when a sign is red, a sentiment analysis that classifies the tone of a webpage as positive or negative, or a predictive score with a confidence interval.

In machine learning, outputs can be just about anything. A few approaches to finding outputs include:

Classification: generate an output value for each item in a data set
Regression: given the data, predict the most likely value for variable under consideration
Clustering: group the data into similar patterns

Here are a few real-life examples of what people do with machine learning:

DeepMind used reinforcement learning to play StarCraft II: