1 Introduction
1.1 Motivation
- Scientists are nowadays faced with an unprecedented amount of complex and big data sets (e.g. high-throughput sequencing, GIS (Geographical Information System) data, biomedical imaging and social media data).
- These data sets are very challenging to analyse due to nonlinear dependencies, mixed data sources and high-dimensionality.
- They often fail to conform to the assumptions of classical statistical methods.
- Hand-in-hand with the rise of computational power, machine learning (ML), has matured into a field of its own, to specificially extract knowledge from these challenging datasets.
1.2 What is machine learning?
A machine (an algorithm/model) improves its performance (predictive accuracy) in achieving a task (e.g classifying the content of an image) from experience (data).
The automatic discovery of patterns and regularities in data.
1.3 What problems can machine learning solve?
Object recognition
Biomarker discovery in genomics
Navigation of autonomous vehicles
Fraud detection
… and much more!
1.4 Types of machine learning methods
Unsupervised learning
Unsupervised learning methods uncover structure in unlabelled data. Structure means patterns in the data that are sufficiently different from pure unstructured noise. Structure can be discovered by:
Determining the distribution of the data using density estimation techniques
Visualising the data after dimensionality reduction (e.g Principal Component Analysis (PCA))
Identifying groups of observations sharing similar attributes using clustering methods
Supervised learning
Akin to traditional statistical models (e.g generalised linear models) supervised learning methods discover the relationship between an outcome and a set of explanatory variables. Using training data, the model learns the mapping (predictive model) between a set of features and a:
Continuous outcome - regression
Categorical variable - classification
Semi-supervised learning
Similar to supervised learning, however these methods also make use of unlabelled data to improve the model’s predictive performance.
Reinforcement learning
These methods mimic the way humans learn new games or skills (e.g riding a unicycle). The machine/model explores different actions that maximise a reward (e.g score achieved in a game or time spent upright on a unicyle) by a process of trial and error. There is an inherent tradeoff between exploration (trying out new actions) and exploitation (use actions that already give reasonable results). Here’s a nice example.
In this introductory workshop we will only focus on unsupervised and supervised learning methods.
1.5 Statistics and Machine Learning
There is substantial overlap between the fields of statistics and machine learning. Some high-profile academics, such as Robert Tibshirani, even argue that ML is merely “glorified statistics”. He even provides a handy glossary.
We will not engage in a philosophical debate here, rather we focus on a pragmatic comparison between these two schools of thought, which evolved from different research areas and tackled different problems.
Statistics | Machine learning | |
---|---|---|
Philosophy | provide humans with a set of data analysis tools | replace humans in the processing of data |
Focus | what is the relationship between the data and the outcome? | how can we predict the outcome using the data? |
Inference | how was the observed data generated? what do the model parameters mean in practice? | typically only care about predictions and not what the model parameters mean |
Learning | use all of the observed data to perform inference at the population-level | use training data, then use testing data to perfom individual-level predictions |
Validation | measures of fit (\(R^2\), chi-square test, etc.) and suitability of inferred parameters | predictive performance measures (root mean squared error (RMSE), area under the ROC cuve (AUC), etc.) computed on “unseen” data (generalisation) |
Model selection | adjusted measures of fit (adjusted \(R^2\), \(C_p\) statistic, Aikake information criterion, etc.) | Cross-validation and out-of-bag errors |
The line between ML and statistics is blurry at best. Personally, I do not find engaging in heated debates between the two fields to be healthy. Both fields complement each other and as the late Leo Breiman puts it:
The best solution could be an algorithmic model (machine learning), or maybe a data model, or maybe a combination. But the trick to being a scientist is to be open to using a wide variety of tools.
— Leo Breiman
1.6 Terminology
The jargon used in ML can be daunting at first. The table below summarises the most commonly encountered terms and their synonyms:
Training dataset | data used to train a set of machine learning models |
Validation dataset | data used for model selection and validation i.e to choose a model which is complex enough to describe the data “well” but not more complex |
Testing dataset | data not used when building the machine learning model, but used to evaluate the model’s performance on previously unseen data (generalisation error) |
Features | the covariates/predictors/inputs/attributes used to train the model |
Training error | the model’s performance evaluated on the training data (also known as in-sample or resubstitution error) |
Testing error | the model’s performance evaluated on the testing data (also known as out-of-sample or generalisation error) |
1.7 A bird’s-eye view of building machine learning systems
Health warning: Real-life data is very messy. You will end up spending most of your time preparing your data into a format amenable for exploration/modelling. Do not despair, in this workshop you will be provided with clean data sets that can be used straightaway. Nevertheless, if you have not attended a course on data wrangling and visualisation yet, I would strongly recommend doing TJ McKinley’s course.