Beginning.Two of the English language’s most demotivating words.

The first step is frequently the most difficult, and when allowed too much directional freedom, it may sometimes be crippling.

How do I begin?

This post will take a beginner from having no experience of machine learning in Python to being a proficient practitioner in seven simple steps, all while utilizing freely available materials and tools.

The primary goal of this outline is to assist you in sorting among the numerous free options accessible; there are certainly many, but which are the best?

Which are mutually beneficial?

What is the optimal sequence for utilising specified resources?

Proceeding, I will presume that you are not an expert in the following:

Python
Machine learning
Any Python library for machine learning, scientific computing, or data analysis

It would probably be beneficial to have a working knowledge of one or both of the first two topics, but this is not required; some more time spent on the earlier steps should suffice.

#1 Basic Python Skills

If we wish to use Python to execute machine learning, it is critical to have a working knowledge of the language.

Fortunately, because Python is so widely used as a general-purpose programming language and is widely used in scientific computing and machine learning, finding beginner’s tutorials is not difficult.

Your level of proficiency with Python and programming, in general, plays a critical role in determining a starting point.

To begin, you must instal Python.

Because we will eventually need scientific computing and machine learning programmes, I recommend that you instal Anaconda.

It is a robust Python implementation for Linux, OS X, and Windows that includes all necessary machine learning tools, such as NumPy, sci-kit-learn, and matplotlib.

Additionally, it contains the iPython Notebook, which provides an interactive environment for a number of our tutorials.

I would propose Python 2.7, if only because it is still the most widely installed version.

If you are unfamiliar with programming, I recommend beginning with the following free online book and progressing to the following materials:

Zed A. Shaw’s Python The Hard Way

If you have programming expertise but not specifically with Python, or if your Python is elementary, I recommend one or both of the following:

Python for Google Developers Course (highly recommended for visual learners)

M. Scott Shell’s An Introduction to Python for Scientific Computing (from UCSB Engineering) is an excellent 60-page introduction to Python for scientific computing.

Additionally, for anyone interested in a 30-minute crash course in Python, here it is:

X in Y Minutes (X equals Python)

Naturally, if you are a seasoned Python coder, you may skip this step.

Even so, I recommend keeping a copy of the extremely readable Python documentation on hand.

#2 Foundational Machine Learning Skills

According to Nearlearn, there is much variety in what constitutes a “data scientist.”

This reflects the discipline of machine learning, as much of what data scientists perform involves some form of machine learning method.

Is it required to have a thorough understanding of kernel approaches in order to efficiently develop and analyse support vector machine models?

Obviously not.

As with practically everything else in life, the depth of theoretical understanding required is proportional to practical application.

Acquiring a thorough grasp of machine learning algorithms is beyond the scope of this article and typically involves significant time investment in a more academic setting or, at the very least, through extensive self-study.

The good news is that you need not need a PhD-level understanding of machine learning theory in order to practise, just as not all programmers require a theoretical computer science education in order to be competent coders.

Although the Nearlearn course frequently receives wonderful reviews for its substance, I recommend perusing the course notes written by a former student of the online course’s prior version.

octave-specific notes omitted (a Matlab-like language unrelated to our Python pursuits).

Please note that these are not “official” notes, but they do appear to capture the pertinent content from Andrew’s course materials.

Of course, if you have the time and interest, now is the time to enrolll in the Nearlearn Machine Learning course.

There are other video lectures available if you wish, in addition to Ng’s course described previously.

As a fan of Tom Mitchell, I’d like to share a link to some of his recent lecture films (co-presented with Maria-Florina Balcan), which I find particularly approachable:

Lectures on Machine Learning by (lecturer name)

At this time, you do not require all of the notes and videos.

A sound method is progressing to specific activities below and referencing pertinent sections of the aforementioned notes and videos as needed.

For instance, whenever you come into an assignment below that requires you to implement a regression model, read the appropriate regression portion of Nearlearns notes.

#3 Scientific Python Packages Overview

Alright.

We are proficient in Python programming and have a working knowledge of machine learning.

Apart from Python, a variety of open-source packages are commonly utilised to help practical machine learning.

In general, the following are the primary so-called scientific Python libraries that we utilise while executing rudimentary machine learning tasks (note that this list is obviously subjective):

NumPy Is primarily helpful for its array objects in the N-dimensional array format

pandas – A Python data analysis package that supports the use of data structures such as data frames.

matplotlib – a two-dimensional charting library for creating publications

-figures of superior quality

scikit-learn – a collection of machine learning algorithms for performing data analysis and mining tasks

A smart way to learn these is to go over the following material:

Gal Varoquaux, Emmanuelle Gouillart, and Olav Vahtras, Scipy Lecture Notes

This pandas tutorial is concise and effective:

Pandas in Ten Minutes

You’ll notice some other packages in the lessons below, including Seaborn, which is a matplotlib-based data visualisation framework.

Although the aforementioned packages are (again, subjectively) at the heart of a wide variety of machine learning activities in Python, understanding them should enable you to quickly adapt to additional and related packages that are referenced in the following lessons.

Now for the important stuff…

#4 Getting Started with Machine Learning in Python

Python.

Fundamentals of machine learning.

Numpy.

Pandas.

Matplotlib.

The moment has arrived.

Let’s begin by developing machine learning algorithms using scikit-learn, Python’s de facto standard machine learning package.

Flowchart for scikit-learn.

Numerous tutorials and exercises will be powered by the iPython (Jupyter) Notebook, an interactive environment for Python execution.

These iPython notebooks can be viewed online or downloaded to your computer and interacted with locally.

Stanford’s iPython Notebook Overview

Additionally, the tutorials below are sourced from a variety of web sources.

All Notebooks have been correctly attributed to their authors; if you discover that someone has not been properly credited for their work, please contact me and I will rectify the matter as soon as possible.

I’d want to express my gratitude to Jake VanderPlas, Randal Olson, Donne Martin, Kevin Markham, and Colin Raffel in particular for your fantastically generously available resources.

Our initial tutorials will introduce us to scikit-learn.

I recommend completing all of these stages in order before proceeding to the subsequent steps.

An overview of scikit-learn, Python’s most popular general-purpose machine learning package, with a focus on the k-nearest neighbour algorithm:

Jake VanderPlas’s scikit-learn: An Introduction

A more detailed and enlarged introduction, including with a beginning project using a well-known dataset:

Randal Olson’s Machine Learning Notebook as an Example

A discussion of ways for evaluating various models in scikit-learn, with a particular emphasis on train/test dataset splits:

Kevin Markham’s Model Evaluation

#5 Machine Learning Topics with Python

After laying the groundwork using scikit-learn, we can go on to more in-depth examinations of different common and useful algorithms.

We begin with k-means clustering, a widely used machine learning approach.

It is a straightforward and frequently efficient technique for resolving challenges involving unsupervised learning:

Following that, we return to classification and examine one of the most historically popular classification methods:

We now consider a continuous numeric forecast derived from classification:

Then, via logistic regression, we may use regression to classification problems:

#6 Advanced Machine Learning Topics with Python

After getting our feet wet with scikit-learn, we’ll move on to more sophisticated topics.

To begin, there are support vector machines, a non-linear classifier that relies on sophisticated data transformations into higher dimensional space.

Following that, random forests, an ensemble classifier, is investigated with a walk-through of the Kaggle Titanic Competition:

Dimensionality reduction is a technique for minimising the number of variables in a task.

Principal Component Analysis is a type of unsupervised dimension reduction technique:

Before proceeding to the next phase, we might take a minute to reflect on how far we have come in such a short period of time.

We covered some of the most popular and well-known machine learning algorithms (k-nearest neighbours, k-means clustering, support vector machines) using Python and its machine learning libraries, as well as a powerful ensemble technique (random forests) and some additional machine learning support tasks (dimensionality reduction, model validation techniques).

Along with some fundamental machine learning skills, we’ve begun building ourselves a useful toolkit.

Before we conclude, we will add one more in-demand tool to that kit.

#7 Deep Learning in Python

The education is extensive. Deep learning is ubiquitous!

While deep learning builds on decades of neural network research, new advancements in the last few years have significantly enhanced the perceived power of and widespread interest in deep neural networks.

If you’re unfamiliar with deep learning, Nearlearn’s features multiple articles outlining the technology’s numerous recent innovations, triumphs, and honours.

This final stage is not intended to be a deep learning clinic; rather, we will examine a few simple network implementations in two of the most popular Python deep learning libraries today.

I propose starting with the following free online book for people interested in delving deeper into deep learning:

Theano

Theano is the first Python deep learning library we will look at. Authors’ statement:

Theano is a Python package that makes it easy to design, optimise, and evaluate multi-dimensional array-based mathematical expressions.

Although the following introduction to deep learning in Theano is extensive, it is pretty good, very descriptive, and heavily commented:

Caffe

Caffe is the other library that we will evaluate.

Once more, from the authors:

Caffe is a framework for deep learning that prioritises expression, speed, and modularity.

It is a collaborative effort between the Berkeley Vision and Learning Center (BVLC) and community members.

This instruction is the icing on the cake.

While we have included a few fascinating examples above, none likely compare to the following, which is a Caffe implementation of Google’s #DeepDream.

Take pleasure in this one! Once you’ve mastered the instruction, experiment with it to get your processors to dream on their own.

I did not promise it would be quick or easy, but if you put in the time and follow the above seven steps, there is no reason why you cannot claim reasonable proficiency and understanding of a variety of machine learning algorithms and their implementation in Python using its popular libraries, including some at the cutting edge of current deep learning research.

Matthew Mayo is a PhD student in computer science who is now working on a thesis about parallelizing machine learning techniques. Additionally, he is a data mining student, a data enthusiast, and a would-be machine learning scientist.