Machine Learning Concepts – Part 4 – Exploratory Data Analysis

Machine Learning Justin today18 October 2024

Background
This entry is part 4 of 7 in the series Machine Learning Concepts

Next up in our machine learning concepts series we are going to discuss exploratory data analysis (EDA). EDA is a cornerstone of data science; you can think of it as a process for you to get to know your data by uncovering patterns, trends, relationships and potential issues. It can help you understand the data’s structure, distributions, and key characteristics which will guide decision-making about:

  • Feature Engineering – Identifying new features and finding opportunities to combine features.
  • Data Cleaning – Identifying missing values, outliers and inconsistencies in our data.
  • Model Selection – Choosing appropriate algorithms based on the types of data and the observed relationships within the data.
  • Gaining Insights – Discover patterns that tell a story about our data.

1. EDA Techniques

1.1. Descriptive Statistics

Calculating basic measures such as mean, median, standard deviation, range, quartiles, etc., allows us to understand the tendency, spread and shape of our numerical data.

1.2. Visualization

Using charts and graphs we can easily visualize our data to identify patterns and relationships.

  • Histograms – Show the distribution of a single variable.
  • Scatter Plots – Explore relationships between two numerical variables.
  • Box Plots – Display the distribution and potential outliers of a variable across different categories.
  • Bar Charts/Pie Charts – Compare categorical data frequencies.

1.3. Correlation Analysis

We can also measure the strength and direction of linear relationships between one or more numerical variables.

1.4. Univariate & Bivariate Analysis

Univariate analysis is the exploration of individual variables in order to identify distributions, patterns, outliers, and relationships. Bivariate analysis involves exploring these same areas but as a relationship between two variables.

1.5. Data Cleaning Techniques

Identify and handle missing values, outliers, duplicates, and other inconsistencies based on your understanding of the data and its domain. The difference between data cleaning in the EDA phase and data cleaning that was previously done in the data pre-processing phase is that in the EDA phase, we’re dealing with numerical representations of the raw data that we cleaned during data pre-processing.

2. EDA Tools

  • Python – Pandas, NumPy, Matplotlib, Seaborn, PyTorch
  • R – Base R Functions, Ggplot2
  • Spreadsheet Software – Excel, Google Sheets

3. Document

As with data-processing, it’s important to document every action taken as part of the EDA phase. This includes the documentation of EDA techniques used, and what data they were used on; analysis findings; and the documentation of any data that was removed or replaced.

4. Conclusion

As with most steps in creating a Machine Learning model, exploratory data analysis is an iterative process and should not be rushed. The goal is to visualize and display the data in ways in which we can make meaningful interpretation of said data.

Series Navigation<< Machine Learning Concepts – Part 3 – Data PreprocessingMachine Learning Concepts – Part 5 – Model Selection >>

Written by: Justin

Tagged as: , , , , , , , , , , , , , , .

Previous post

Post comments (0)

Leave a reply

Your email address will not be published. Required fields are marked *