The machine learning process involves multiple steps from problem identification to deploy to production and even retraining the model or build pipelines. All of this sounds quite complicated, and it is, but if you’re a beginner and want to learn how to deploy a simple Machine Learning model this is the right place to learn it.
The tools and steps we’ll be using/following are
- Find a simple problem to solve using an ML model
- Use Google Colab to import data samples, and create EDAs and Model
- Create our Exploratory Data Analysis
- Build a simple linear regression model to predict the outcome
- Save/Serialize the model to use it in production
- Create Project Structure (boilerplate) to work with Flask
- Create the API endpoints with Flask
- Deploy the model to a real server using DigitalOcean
Step #1 - Find a simple problem to solve using an ML model
The goal for this post is not to solve a hard problem. It is just to know all the processes involved to deploy a Machine Learning model to production systems, with an end-to-end perspective. For these reasons we choose the most simple problem: “Predict the salary of an engineer given the number of years in experience”.
Step #2- Use Google Colab to create data, EDA and Model
The best way to save time and resources is to use Google Colab to create our data, EDAs and modeling. The best and simple way to keep things organized, go to your Google Drive and click on connect more apps, search for Colaboratory and follow the next screenshots
Note: if you don’t know what is Colab, please refer to this link: https://research.google.com/colaboratory/faq.html
You are ready. Now let’s create our first Notebook
This will open the next window
Rename the project
Now let’s create data samples. Go to the following link and download the CSV. Upload that file into your Google Drive inside the same folder you created the Google Colab file
Now we have to connect the CSV file with Google Colab. To do this follow the next tutorial. It's quite simple and a necessary step to read the data. I followed the simple steps
Once everything is done you should see this folder tree (with your own files)
Copy that path, it will help us to import that data. Copy the following commands into your Google Colab to check everything is working well.
import pandas as pd csv_in_drive = "/content/drive/MyDrive/your_path/salary.csv" df = pd.read_csv(csv_in_drive) df.head()
We can now see the table with the csv data. It works!
The problem statement is very clear about our purpose, and we want to keep it as simple as possible. For this reason we’ll be using just +20 historical records with just 2 columns: “Years of experience” and “salary”.
Let’s tune our Google Colab to dark mode!
Step #3- Create our Exploratory Data Analysis
For this step you can download my Google Colab here: https://colab.research.google.com/drive/13o5u6Hwvs1GlnTL7EVRyLovQZQiqUhN3?usp=sharing , but remember to change the paths to your own files and follow along with me this description. I'll copy and paste the Python code here. Take your time to understand what I explained in the notebook
import seaborn as snsimport matplotlib.pyplot as plt from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_score import pandas as pd csv_in_drive = "/content/drive/MyDrive/6- Marketing/1-Blog Posts/1- Salary Prediction Blog Post/salary.csv" df = pd.read_csv(csv_in_drive) df.head() """# Importing Libraries""" import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score """# EDAs This is a simple Exploratory Data Analysis """ # to check the amount of data df.shape # To check if we have any null values in the dataset df.isnull().values.any() ## dividing data train_set, test_set = train_test_split(df, test_size=0.2, random_state=42) ## train_set copy df_copy = train_set.copy() ## exploratory analysis df_copy.describe() # search for correlations df_copy.corr() df_copy.plot.scatter(x='YearsExperience', y='Salary') """We have intentionally created a high correlated dataset to build a simple linear regression""" sns.regplot('YearsExperience', 'Salary', data=df_copy)
Remember you can access my Google Colab here: https://colab.research.google.com/drive/13o5u6Hwvs1GlnTL7EVRyLovQZQiqUhN3?usp=sharing
Step #4- Build a simple linear regression model to predict the outcome
Now I'm going to create the model
"""# Building the Model Now that we have a high correlted data, we want to build the model """ ## building the model test_set_full = test_set.copy() test_set = test_set.drop(["Salary"], axis=1) test_set.head() train_labels = train_set["Salary"] train_labels.head() ## with train data train_set_full = train_set.copy() train_set = train_set.drop(["Salary"], axis=1) train_set.head() import warnings warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd") lin_reg = LinearRegression() lin_reg.fit(train_set, train_labels) print("Coefficients", lin_reg.coef_) print("Intercept", lin_reg.intercept_) salary_pred = lin_reg.predict(test_set) salary_pred print(salary_pred) print(test_set_full['Salary']) """Seems like we have good results comparing 'real' data with the 'predicted' data""" # Lets check the scores lin_reg.score(test_set, test_set_full["Salary"]) r2_score(test_set_full["Salary"], salary_pred) r2_score = (test_set_full["Salary"], salary_pred) r2_score plt.scatter(test_set_full["YearsExperience"], test_set_full["Salary"], color="blue") plt.plot(test_set_full["YearsExperience"], salary_pred, color="red", linewidth=2)
With this we end a very simple model, this will help us to understand how to deploy
Step #5- Save/Serialize the model to use it in production
Now I'm going save/serialize the model
"""# Persisting the Model""" ## Model persistence import pickle with open ("/content/drive/MyDrive/6- Marketing/1-Blog Posts/1- Salary Prediction Blog Post/python_lin_reg_model.pkl", "wb") as file_handler: pickle.dump(lin_reg, file_handler) with open ("/content/drive/MyDrive/6- Marketing/1-Blog Posts/1- Salary Prediction Blog Post/python_lin_reg_model.pkl", "rb") as file_handler: loaded_pickle = pickle.load(file_handler) loaded_pickle # Just for testing purposes, lets use the joblib instead of pickle: import joblib joblib.dump(lin_reg, "/content/drive/MyDrive/6- Marketing/1-Blog Posts/1- Salary Prediction Blog Post/linear_regression_model.pkl") """# Conclusion With this we end a very simple model, this will help us to understand how to deploy de result to production """
Step #6- Create Project Structure (boilerplate) to work with Flask
Great, we have the EDAs and the Model ready, even we have the .pkl file with the result, so we have to create the service to expose this to the real world. But before that we have to organize the structure of the project in Flask. This will help us to have a common pattern where the files live and where we can extend to create and add more models under the same server
So let's start creating our folder structure. Open a terminal and type the following. If you don't know what is a terminal, go here to learn about it
$ mkdir 1-simple-salary$ cd 1-simple-salary
Then let’s install some packages. The first one is virtualenv. If you don’t know what a virtualenv is, please refer to this resource.
$ pip3 install virtualenv$ virtualenv .$ source bin/activate
Now let’s use Git and Github to track changes. It is very important to use git now because the deployment will read the changes from our Github repository. So, create a Repository in Github and then let’s start git in our machine
git initgit commit -m "first commit"git branch -M maingit remote add origin [email protected]:your_user/project_name.gitgit push origin main
- ORIGINAL COLAB FILE: https://colab.research.google.com/drive/1wEaJNVUKb6zigl0AWYuoLjy_CkyaIiMW
- BLOG POST COLAB FILE:
- SLIDE FILE: https://docs.google.com/presentation/d/1WCwaAq4d4zowfgtCU1dVaQYy8AwnwuXk/edit#slide=id.p7