Machine Learning Justin today16 October 2024
Previously, we gave an introduction to the necessary steps to deploy a machine learning model as well as started the discussion of problem definition and data collection. Now that we have a firm understanding of the problem our machine learning model is attempting to solve we need to perform a number of actions to prepare the data for use within our model. This is a crucial step in the development of your machine learning model and, for many, should be the step that they spend the most time in.
Unless you’re working with a prepared dataset you’ll need to “clean” the data before using it to train our machine learning model. This is necessary in order to provide our model with data that provides consistency.
Missing values will probably be the most common data inconsistency that you’ll need to attend to. In general, we don’t care about missing values as long as there is enough data for our ML model that is not missing.
Duplicates reinforce repeat patterns of the training data by our large language model leading to overfitting. Overfitting will cause our large language model to perform well against training data, but will struggle with new, unseen data. Other reasons why removing duplicates is important, include:
Large language models rely on numerical representations of data or specific data structures. Inconsistent formatting can prevent our model from properly interpreting the data. Consider the following data types that may have various ways of formatting.
Identify extreme values that might be errors or anomalies. Decide whether to remove them, cap them, or transform them using techniques like logarithmic transformation. Outliers can significantly distort our model’s learning and lead to inaccurate predictions.
The following methods can assist with determining outliers in our data:
Now that we’ve identified potential outliers we need to decide how to treat them.
When determining how to best handle outliers it’s important to consider the nature of outliers, impact on model performance, and domain expertise (aka your understanding of the data and the problem).
Before we dive into transformations a little more we should explain features. A feature is a measurable property or characteristic of your data. Features are used your machine learning algorithm to learn patterns and make predictions. Some examples of features include the presence of a shape or color, a demographic such as age, location, or anything else that you’d use to categorize your data.
Feature scaling is used to ensure that every feature in our model is on the same scale. This ensures that our model will treat similar features equally. This is one way to help reduce bias in our model.
Categorized data must be converted into a numerical format that our machine learning algorithm can use. Here are a couple of methods to perform such a feat.
1
if the row’s data matches the corresponding category, or a 0
if it does not. If you have a lot of categorizations, you’ll end up with a lot of columns.Label encoding is generally preferred when there is a meaningful ordinal relationship between categories. However, bias may result if the ordinality is not genuine.
Now that we’ve cleaned up our data and understand what features are we now need to use these features to help our algorithm to learn more effectively by ensuring that our model understands the relationships between features.
Not all features are going to be equally as useful. Take the time to identify the most important features to your model and remember, feature engineering is an iterative process!
Always document preprocessing steps as clearly as possible in order to reproduce results, as well as to understand how the data was transformed. This should include each step of pre-processing that is performed on our data including details on transformations, and rationale for data removals and/or caps.
An important thing to remember is that the data preprocessing step is an iterative process which allows you to experiment with a different number, and combination, of techniques in order to allow you to find what works best for your dataset and ML problem.
Written by: Justin
Tagged as: Standard Deviation, Data Preprocessing, Overfitting, Cleaning, Z-score, Transformation, Interquartile Range, Feature Engineering, IQR, Standardization, Modified Z-score, Normalization, Duplicates, One-Hot Encoding, Feature, Label Encoding, Polynomial Features, Machine Learning, Mean.
Machine Learning Justin
©Copyright roguesecurity.ca 2024. All Rights Reserved.
Post comments (2)