Machine Learning Justin today7 September 2024
In part 1 of this Machine Learning Concepts series of posts I provided an introduction to the necessary steps to deploy a machine learning model. Check that out if you missed it. Now that we understand the requirements for deploying machine learning models we need to begin by defining what problem we’re trying to solve, understanding the goal of our model, and start gathering the necessary data. Let’s dive in.
1. Clearly Define the GoalIt is absolutely crucial that we have a very clear understanding of the goal for our machine learning model. A clearly defined goal or problem statement acts as a compass that will help guide decision-making throughout the deployment of your machine learning model.
Before we can start building our actual machine learning model we should understand “why” it is that we need a machine learning model in the first place. Answering the following questions will help us validate that.
Be specific! Instead of “improve customer experience”, we should aim for something like, “reduce customer turnover by 10%”, or “increase customer satisfaction by 15%”.
The more specific that you are, the easier it is going to be later on.
How will your machine learning project impact your organization? How will it impact your bottom line? Will it improve efficiency or decision-making? How does this project align with your organizations current goals?
Whenever possible attempt to quantify the potential benefits, such as cost savings, increased revenue, reduced errors, improved customer satisfaction, etc…
Identify what specific outcome that you are attempting to achieve. For example, are you attempting to predict future values of sales for your organization next quarter, or even next year? Perhaps you want to know how similar to data points are. Consider the following outcomes:
Prediction: Forecasting future values
Classification: Assigning data points to categories
Clustering: Grouping similar data points together
It’s important that we determine and understand how success will be measured related to our machine learning project. Ensure that our success metrics are aligned with the business goals and value that we previously identified. Consider the following statistical measures when evaluating the success or failure of your machine learning algorithms.
Regression:
Classification:
Every project will have constraints. Some limitations that you may want to consider include:
Data Availability: Do you have enough data of the right quality?
Time and Resources: Do you have enough skilled resources and time for development as well as continued maintenance and tuning?
Interpretability: How important is it to understand why our model makes certain predictions?
“Garbage In, Garbage Out” is an applicable sentiment as it relates to machine learning. The data that we provide our model directly influences the accuracy, reliability, and ultimately, the success of our model.
Accuracy: Training data that contains errors and inconsistencies may cause your model to make flawed predictions.
Bias: Biased data may lead to unfair or discriminatory results. Examples of this may include societal biases that may lead to different results for individuals with different socio-economic experiences.
Generalizability: Our model should be trained with data that is representative of the real world and must be able to perform accurately with new or previously unseen data.
Key aspects of data quality include:
Existing Databases: Internal and publicly available datasets are widely available. huggingface.com is an excellent resource that provides over 200,000 datasets for you to freely use.
APIs: Web APIs often provide access to real-time data from various web applications.
Web Scraping: Data extracted from a web page using scripts or automated tools.
Surveys and Questionnaires: Useful for collecting data points from a specific set of users or a targeted audience.
Sensors and IoT Devices: Real-world sensor and IoT data.
Investing the appropriate time and effort into understanding your goals, as well as gathering and normalizing high-quality data, is going to lead to a more accurate, precise, and un-biased machine learning models. Overall, a high-quality dataset that matches our goals and requirements will payoff throughout the entire machine learning lifecycle.
Written by: Justin
Tagged as: data quality, data gathering, bais, prediction, clustering, regression, r-squared, mean squared error, root mean squared error, Machine Learning, f1-score, classification, area under the curve, availability, planning.
Penetration Testing Justin
©Copyright roguesecurity.ca 2024. All Rights Reserved.
Post comments (3)