This is Part 1 of “Machine Learning Models with C#”. In this article, we will have a look into the Basics Models of Machine Learning. Background I recently started to learn Machine Learning in C# from a book. I found it is very useful and thought of sharing that knowledge with you all. Article Credits The concepts in this article are adapted from the book, C# Machine Learning Projects, by Hwang, Yoon Hyup.
The Basics of Machine Learning Modeling
It can be difficult to see how machine learning (ML) affects the daily lives of ordinary people. In fact, ML is everywhere! In the process of searching for a restaurant for dinner, you almost certainly used ML. In the search for a dress to wear for a dinner party, you would have used ML. On your way to your dinner appointment, you probably used ML as well if you used one of the ride-sharing apps. ML has become so widely used that it has become an essential part of our lives, although it is usually unnoticeable. With ever-growing data and its accessibility, the applications and needs for ML are rapidly rising across various industries. However, the pace of the growth in trained data scientists has yet to meet the pace of growth ML needs in businesses, despite abundant resources and software libraries that make building ML models easier, due to the fact that it takes time and experience for a data scientist and ML engineer to master such skill sets.
In this article, we will cover the following topics:
- Key ML tasks and applications
- Steps in building ML models
In the next article, we will cover the following topics:
- Setting up a C# Environment for Machine Learning
Key ML tasks and applications
There are many areas where ML is used in our daily lives without being noticed. Media companies use ML to recommend the most relevant content, such as news articles, movies, or music, for you to read, watch, or listen to. E-commerce companies use ML to suggest items that are of interest and that you are most likely to purchase. Game companies use ML to detect your motion and joint movements for their motion sensor games. Some other common uses of ML in the industry include face detection on cameras for better focusing, automated question answering where chatbots or virtual assistants interact with customers to answer questions and requests, and detecting and preventing fraudulent transactions. In this section, we will take a look at some of the applications that we use in our daily lives that utilize ML heavily. Google Newsfeed Google News feed uses ML to generate a personalized stream of articles based on the user’s interests and other profile data. Collaborative filtering algorithms are frequently used for such recommendation systems and are built from the view history data of their user base. Media companies use such personalized recommendation systems to attract more traffic to their websites and increase the number of subscribers. Amazon product recommendations Amazon uses user browse and order history data to train an ML model to recommend products that a user is most likely to purchase. This is a good use case for supervised learning in the e-commerce industry. These recommendation algorithms help e-commerce companies maximize their profit by displaying items that are the most relevant to each user’s interests. Netflix movie recommendation Netflix uses movie ratings, view history, and preference profiles to recommend other movies that a user might like. They train collaborative filtering algorithms with data to make personalized recommendations. Considering that more than 80 percent of the TV shows people watch on Netflix are discovered through the platform’s recommendation system according to an article on Wired (http://www.wired.co.uk/article/how-do-netflixs-algorithms-work-machine-learning-helps-to-predict-what-viewers-will-like), this is a very useful and profitable example of ML at a media company. Face detection on cameras Cameras detect faces for better focusing and light metering. This is the most frequently used example of computer vision and classification. Also, some photo management software uses clustering algorithms to group similar faces in your images together so that you can search for photos by certain people in them later. Alexa – Virtual Assistant Virtual assistant systems, such as Alexa, can answer questions such as What’s the weather in New York? They can also complete certain tasks, such as Turn on the living room lights. These kinds of virtual assistant systems are typically built using speech recognition, natural language understanding (NLU), deep learning, and various other machine learning technologies. Microsoft Xbox Kinect Kinect can sense how far each object is from the sensor and detect joint positions. Kinect is trained with a randomized decision forest algorithm that builds lots of individual decision trees from depth images.
Steps for building ML models
Now that we have seen some examples of the ML applications that are out there, the question is, “How do we go about building such ML applications and systems?” Books about ML and ML courses that are taught in universities typically start by covering the mathematics and theories behind ML algorithms and then apply those algorithms to a given dataset. This approach is great for people who are completely new to this subject and are looking to learn the foundations of ML. However, aspiring data scientists with some prior knowledge and experience and who are looking to apply their knowledge to real ML projects often stumble about where to start and how to approach a given ML project. In this section, we will discuss a typical workflow for building an ML application, which we will follow throughout the book. The following figure summarizes our approach to developing an application using ML and we will discuss this in more detail in the following subsections.
As seen in the preceding diagram, the steps that are to be followed for building learning models are as follows. Problem definition The first step in starting any project is not only understanding the problem but also defining the problem that you are trying to solve using ML. Poor definition of a problem will result in a meaningless ML system since the models will have been trained and optimized for a problem that you are not actually trying to solve. This first step is unarguably the most important step in building useful ML models and applications. You should at least answer the following four questions before you jump into building ML models, What is the problem? This is where you describe and state the problem that you are trying to solve. For example, a problem description might need a system to assess a small business owner’s ability to pay back a loan for a small business lending project. Why is it a problem? It is important to define why such a problem is actually a problem and why the new ML model is going to be useful. Maybe you have a working model already and you have noticed it is performing worse than before; you might have obtained new data sources that you can use for building a new prediction model, or maybe you want your existing model to produce prediction results more quickly. There can be multiple reasons why you think this is a problem and why you need a new model. Defining why it is a problem will help you stay on the right track while you are building a new ML model. What are some of the approaches to solving this problem? This is where you brainstorm your approaches to solve the given problem. You should think about how this model is going to be used (do you need this to be a real-time system or is this going to be run as a batch process?), what type of problem it is (is it a classification problem, regression, clustering, or something else?), and what types of data you would need for your model. This will provide a good basis for future steps in building your machine learning model. What are the success criteria? This is where you define your checkpoints. You should think about what metrics you will look at and what your target model performance should look like. If you are building a model that is going to be used in a real-time system, then you can also set the target execution speed and data availability at runtime as part of your success criteria. Setting these success criteria will help you keep moving forward without being stuck at a certain step.
Having data is the most essential and critical part of building an ML model, preferably lots of data. No data, no model. Depending on your project, your approaches to collecting data can vary. You can purchase existing data sources from other vendors, you can scrape websites and extract data from there, you can use publicly available data, or you can also collect your own data. There are multiple ways you can gather the data you need for your ML model, but you need to keep in mind these two elements of your data when you are in the process of data collection—the target variable and feature variables. The target variable is the answer for your predictions and feature variables are the factors that your models will use to learn how to predict the target variable. Often, target variables are not present in a labeled form. For example, when you are dealing with Twitter data to predict the sentiment of each tweet, you might not have labeled sentiment data for each tweet. In this case, you will have to take an extra step to label your target variables. Once you have your data collected, you can move on to the data preparation step.
Once you have gathered all of your input data, you need to prepare it so that it is in a usable format. This step is more important than you might think. If you have messy data and you did not clean it up for your learning algorithms, your algorithms will not learn well from your dataset and will not perform as expected. Also, even if you have high-quality data, if your data is not in a format that your algorithms can be trained with, then it is meaningless to have high-quality data. Bad data, bad model. You should at least handle some of the common problems listed as follows to have your data ready for the next steps. File format If you are getting your data from multiple data sources, you will most likely run into different formats for each data source. Some data might be in CSV format, while other data is in JSON or XML format. Some data might even be stored in a relational database. In order to train your ML model, you will need to first merge all these data sources in different formats into one standard format. Data format Sometimes, data formats vary among different data sources. For example, some data might have the address field broken down into street address, city, state, and ZIP, while some others might not. Some data might have the date field in the American date format (mm/dd/yyyy), while some others may be in British format (dd/mm/yyyy). These data format discrepancies among data sources can cause issues when you are parsing the values. In order to train your ML model, you will need to have a uniform data format for each field. Duplicate records Often you will see the same exact records repeating in your dataset. This problem can occur in the data collection process where you recorded a data point more than once or when you were merging different datasets in your data preparation process. Having duplicate records can adversely affect your model and it is good to check for duplicates in your dataset before you move on to the next steps. Missing values It is also common to see some records with empty or missing values in the data. This can also have an adverse effect when you are training your ML models. There are multiple ways to handle missing values in your data, but you will have to be careful and understand your data very well, as this can change your model performance dramatically. Some of the ways you can handle the missing values include dropping records with missing values, replacing missing values with the mean or median, replacing missing values with a constant, or replacing missing values with a dummy variable and an indicator variable for missing. It will be beneficial to study your data before you deal with the missing values.
Now that your data is ready, it is time to actually look at the data and see if you can recognize any patterns and draw some insights from the data. Summary statistics and plots are two of the best ways to describe and understand your data. For continuous variables, looking at the minimum, maximum, mean, median, and quartiles is a good place to start. For categorical variables, you can look at the counts and percentages of categories. As you are looking at these summary statistics, you can also start plotting graphs to visualize the structures of your data. The following figure shows some commonly used charts for data analysis. Histograms are frequently used to show and inspect underlying distributions of variables, outliers, and skewness. Box plots are frequently used to visualize five-number summary, outliers, and skewness. Pairwise scatter plots are frequently used to detect obvious pairwise correlations among the variables,
Feature engineering Feature engineering is the most important part of the model building process in applied ML. However, this is one of the least discussed topics in many textbooks and ML courses. Feature engineering is the process of transforming raw input data into more informative data for your algorithms to learn from. For example, for your Twitter sentiment prediction model that we will build in the next article, Twitter Sentiment Analysis, your raw input data may only contain a list of text in one column and a list of sentiment targets in another column. Your ML model will probably not learn how to predict each tweet’s sentiment well with this raw data. However, if you transform this data so that each column represents the number of occurrences of each word in each tweet, then your learning algorithm can learn the relationship between the existence of certain words and sentiments more easily. You can also group each word with its adjacent word (bigram) and have the number of occurrences of each bigram in each tweet as another group of features. As you can see from this example, feature engineering is a way of making your raw data more representative and informative of the underlying problems. Feature engineering is not only a science but also an art. Feature engineering requires good domain knowledge of the dataset, the creativity to build new features from raw input data, and multiple iterations for better results. As we work through this book, we will cover how to build text features using some natural language processing (NLP) techniques, how to build time series features, how to sub-select features to avoid overfitting issues, and how to use dimensionality reduction techniques to transform high-dimensional data into fewer dimensions.
Once you have created your features, it is time to train and test some ML algorithms. Before you start training your models, it is good to think about performance metrics. Depending on the problem you are solving, your choice of performance measure will differ. For example, if you are building a stock price forecast model, you might want to minimize the difference between your prediction and the actual price and choose root mean square error (RMSE) as your performance measure. If you are building a credit model to predict whether a person can be approved for a loan or not, you would want to use the precision rate as your performance measure, since incorrect loan approvals (false positives) will have a more negative impact than incorrect loan disapprovals (false negatives). Once you have specific performance measures in mind for your model, you can now train and test various learning algorithms and their performance. Depending on your prediction target, your choice of learning algorithms will also vary. The following figure shows illustrations of some of the common machine learning problems. If you were solving classification problems, you would want to train classifiers, such as the logistic regression model, the Naive Bayes classifier, or the random forest classifier. On the other hand, if you had a continuous target variable, then you would want to train regressors, such as the linear regression model, k-nearest neighbor, or Support Vector Machine (SVM). If you would like to draw some insights from data by using unsupervised learning, you would want to use k-means clustering or mean shift algorithms,
Lastly, we will have to think about how we test and evaluate the performance of the learning algorithms we tried. Splitting your dataset into train and test sets and running cross-validation are the two most commonly used methods of testing and comparing your ML models. The purpose of splitting a dataset into two subsets, one for training and another for testing, is to train a model on the train set without exposing it to the test set so that prediction results on the test set are indicative of the general model performance for the unforeseen data. K-fold cross-validation is another way to evaluate model performance. It first splits a dataset into equally sized K subsets and leaves one set out for testing and trains on the rest. For example, In 3-fold cross-validation, a dataset will first split into three equally sized subsets. In the first iteration, we will use folds #1 and #2 to train our model and test it on fold #3. In the second iteration, we will use folds #1 and #3 to train and test our model on fold #2, In the third iteration, we will use folds #2 and #3 to train and test our model on fold #1. Then, we will average the performance measures to estimate the model performance.
By now, you will have one or two candidate models that perform reasonably well, but there might be still some room to improve. Maybe you noticed your candidate models are overfitting to some extent, maybe they do not meet your target performance, or maybe you have some more time to iterate on your models—regardless of your intent, there are multiple ways that you can improve the performance of your model and they are as follows, Hyperparameter tuning You can tune the configurations of your models to potentially improve the performance results. For example, for random forest models, you can tune the maximum height of the tree or number of trees in the forest. For SVMs, you can tune the kernels or cost values. Ensemble methods Ensembling is combining the results of multiple models to get better results. Bagging is where you train the same algorithm on different subsets of your dataset, boosting is combining different models that are trained on the same train set, and stacking is where the output of models is used as the input to a meta model that learns how to combine the results of the sub-models. More feature engineering Iterating on feature engineering is another way to improve model performance.
Time to put your models into action! Once you have your models ready, it is time to let them run in production. Make sure you test extensively before your models take full charge. It will also be beneficial to plan to develop monitoring tools for your models since model performance can decrease over time as the input data evolves.
These are the basic modeling in Machine Learning. I hope you found it very useful. In the next article, we will have a detailed view of Setting up the C# environment for Machine Learning. Feel free to share your feedback in the comments section.