When it comes to machine learning it’s common to have multiple, different models running on the data and providing us with different outputs that we can use to test and validate other data points. Choosing the wrong model can lead to poor performance and wasted effort.
1. Problem Type
Firstly, we need to identify the problem(s) that exist in our data that we are trying to solve. For example, you may have sale prices for homes in the same area over a 5-year time span. The problem we’ve chosen to solve is to understand what future sale prices could be, given historical data. This would be a good example of where linear regression could be used to predict continuous values.
Once we understand the problem, we can apply the most appropriate algorithm to produce the desired outcome.
1.1. Regression
Linear Regression – Finds the best-fitting straight line (linear relationship) that predicts a continuous target variable based on one or more input features.
Polynomial Regression – Extends linear regression by allowing for curved relationships between features and the target variable. It adds polynomial terms (e.g., x^2, x^3) to the model equation.
Decision Trees – Builds a tree-like structure of decisions based on features to classify data or predict continuous values.
Random Forests – Combines multiple decision trees to improve accuracy and robustness.
1.2. Classification
Logistic Regression – Predicts the probability of a categorical outcome (binary or multi-class) based on input features.
Decision Trees – Builds a tree-like structure of decisions based on features to classify data or predict continuous values. Can be prone to overfitting.
Random Forests – Combines multiple decision trees to improve accuracy and robustness. Can be prone to overfitting.
Support Vector Machines (SVMs) – Finds the optimal hyperplane that best separates data points into different classes (classification) or maps data points to a desired output (regression).
1.3. Clustering
K-Means – Groups data points into *k* clusters based on their nearest mean.
1.4. Dimensionality Reduction
Principal Component Analysis (PCA) – Finds a new set of uncorrelated features (principal components) that capture most of the variance in the original data.
t-Distributed Stochastic Neighbor Embedding (t-SNE) – Reduces high-dimensional data into a lower-dimensional space (usually 2 or 3 dimensions) for visualization, preserving local neighborhood structures.
Linear Discriminant Analysis (LDA) – Finds linear combinations of features that best discriminate between classes.
1.5. Anomalous Detection/Outlier Detection
One-Class SVM – Learn a boundary around “normal” data points and identify outliers that lie outside this boundary.
Isolation Forest – Isolates anomalies by randomly selecting features and splitting the data based on those features. Anomalies are typically isolated with fewer splits than normal points.
Local Outlier Factor (LOF) – Identifies anomalies by comparing the local density of a data point to its neighbors. Points with significantly lower density are outliers.
1.6. Natural Language Processing (NLP)
Recurrent Neural Networks (RNNs) – Process sequential data (e.g., text, time series) by maintaining an internal memory (hidden state) that captures information from previous time steps.
Transformers – Process and understand sequential data (especially text) using attention mechanisms.
1.7. Computer Vision
Convolutional Neural Networks (CNNs) – Analyze and classify visual data (images, videos).
Transfer Learning – Utilize pre-trained models on large datasets for new tasks with limited data.
1.8. Time Series Analysis
ARIMA Models – Forecast time series data by identifying patterns and trends in past values.
Prophet – Forecast future time series data, accounting for seasonality and trends.
Recurrent Neural Networks (RNNs) – Process sequential data (e.g., text, time series) by maintaining an internal memory (hidden state) that captures information from previous time steps.
2. Data Characteristics
Size – Linear models are almost always faster than other models like random forests and decision trees, especially on larger datasets.
Complexity – For complex relationships, tree-based models are better suited.
Dimensionality – Highly dimensional data will require more advanced models, such as Support Vector Machines (SVMs)
Number of Classes (for classification) – Binary classification works well for binary classification, however, multi-class classification often benefits from more complex models like random forests or SVMs.
3. Interpretability
If understanding how the model makes decisions is required, you should consider utilizing simpler algorithms such as linear regression or decision trees. More complex algorithms will be harder to interpret and be performing unnecessary evaluations and decisions that can’t be accounted for. Beginning with a simple model and gradually increasing complexity is preferred. Utilizing multiple models to cross-validate performance and decision-making on novel/unseen data can also assist with interpretability.
4. Conclusion
Selecting a model or algorithm to run against your data is heavily dependent on what problem we are attempting to answer and how we are attempting to answer it. Understanding the relationships within your data will be key to identifying which algorithm will give us the most appropriate output. You may also notice that some algorithms can by applied to solving slightly different problems.
Finally, we must interpret the outputs of these algorithms in order to determine the success or failure of our model. Utilizing multiple models to cross-validate previously unseen data can provide a mechanism to evaluate this success.
Next up in our machine learning concepts series we are going to discuss exploratory data analysis (EDA). EDA is a cornerstone of data science; you can think of it as [...]
Post comments (0)