Machine Learning Concepts – Part 2 – Problem Definition and Data Collection

Machine Learning Justin today7 September 2024

Background
This entry is part 2 of 2 in the series Machine Learning Concepts

 

In part 1 of this Machine Learning Concepts series of posts I provided an introduction to the necessary steps to deploy a machine learning model. Check that out if you missed it. Now that we understand the requirements for deploying machine learning models we need to begin by defining what problem we’re trying to solve, understanding the goal of our model, and start gathering the necessary data. Let’s dive in.

1. Clearly Define the Goal

It is absolutely crucial that we have a very clear understanding of the goal for our machine learning model. A clearly defined goal or problem statement acts as a compass that will help guide decision-making throughout the deployment of your machine learning model.

1.1 Start with the ‘Why’

Before we can start building our actual machine learning model we should understand “why” it is that we need a machine learning model in the first place. Answering the following questions will help us validate that.

What problem are you trying to solve?

Be specific! Instead of “improve customer experience”, we should aim for something like, “reduce customer turnover by 10%”, or “increase customer satisfaction by 15%”.

The more specific that you are, the easier it is going to be later on.

What business value will this bring?

How will your machine learning project impact your organization? How will it impact your bottom line? Will it improve efficiency or decision-making? How does this project align with your organizations current goals?

Whenever possible attempt to quantify the potential benefits, such as cost savings, increased revenue, reduced errors, improved customer satisfaction, etc…

1.2 Identify the Target Outcome

Identify what specific outcome that you are attempting to achieve. For example, are you attempting to predict future values of sales for your organization next quarter, or even next year? Perhaps you want to know how similar to data points are. Consider the following outcomes:

Prediction: Forecasting future values
Classification: Assigning data points to categories
Clustering: Grouping similar data points together

1.3 Define Success Metrics

It’s important that we determine and understand how success will be measured related to our machine learning project. Ensure that our success metrics are aligned with the business goals and value that we previously identified. Consider the following statistical measures when evaluating the success or failure of your machine learning algorithms.

Regression:

  • R-squared
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)

Classification:

  • Accuracy
  • Precision
  • Recall
  • F1-Score
  • Area Under the Curve (AUC)

1.4 Consider Constraints

Every project will have constraints. Some limitations that you may want to consider include:

Data Availability: Do you have enough data of the right quality?
Time and Resources: Do you have enough skilled resources and time for development as well as continued maintenance and tuning?
Interpretability: How important is it to understand why our model makes certain predictions?

2. Gather Data

“Garbage In, Garbage Out” is an applicable sentiment as it relates to machine learning. The data that we provide our model directly influences the accuracy, reliability, and ultimately, the success of our model.

2.1 Why Data Quality Matters

Accuracy: Training data that contains errors and inconsistencies may cause your model to make flawed predictions.
Bias: Biased data may lead to unfair or discriminatory results. Examples of this may include societal biases that may lead to different results for individuals with different socio-economic experiences.
Generalizability: Our model should be trained with data that is representative of the real world and must be able to perform accurately with new or previously unseen data.

Key aspects of data quality include:

  • Accuracy
    • Data must be free of errors and inconsistencies.
    • Data sources should be verified for reliability.
    • Data validation rules should be implemented to catch errors during entry or transfer.
  • Completeness
    • Missing values should be corrected before being processed by our model.
    • Missing data patterns should be identified.
  • Consistency
    • Data should be formatted consistently.
    • Data should be in standardized units of measurement, data formats, and labels.
  • Relevance
    • Only data relevant to the problem we’re trying to solve should be included in our model.
    • Irrelevant features that may introduce noise or confusion should be avoided.
  • Timeliness
    • Data should be up-to-date.
    • Consider how frequently the data needs to be updated.
  • Bias
    • Identify and mitigate potential biases.
    • Ensure that your data represents diverse populations and perspectives.

2.2 Data Collection Strategies

Existing Databases: Internal and publicly available datasets are widely available. huggingface.com is an excellent resource that provides over 200,000 datasets for you to freely use.
APIs: Web APIs often provide access to real-time data from various web applications.
Web Scraping: Data extracted from a web page using scripts or automated tools.
Surveys and Questionnaires: Useful for collecting data points from a specific set of users or a targeted audience.
Sensors and IoT Devices: Real-world sensor and IoT data.

3. Conclusion

Investing the appropriate time and effort into understanding your goals, as well as gathering and normalizing high-quality data, is going to lead to a more accurate, precise, and un-biased machine learning models. Overall, a high-quality dataset that matches our goals and requirements will payoff throughout the entire machine learning lifecycle.

Series Navigation<< Machine Learning Concepts – Part 1 – Deployment Introduction

Written by: Justin

Tagged as: , , , , , , , , , , , , , , .

Post comments (0)

Leave a reply

Your email address will not be published. Required fields are marked *