Machine learning (ML) is getting more and more traction. The willingness of companies to use machine learning models in production for their day to day business is rapidly growing. Fraud detection, recommender systems and object detection are just a few use cases that are widely used by various enterprises. With the increase in demand of machine learning, new challenges arise. How can we create, serve and maintain machine learning models in a production environment? In this blog we will take a look at how automation can play a role in the creation, delivery and maintenance of machine learning models. What steps can we automate and what tools can we use to achieve automation? Is full automation even achievable?
Manual approach
Machine learning projects have, in general, the same components. You start with your data extraction, continue with pre-processing the data and start with your exploratory data analysis. Once you understand the data you are working with, you start training a model. After this step you will evaluate and validate the trained model and decide whether it is “good enough’’ to be deployed for production. If the model needs some tweaking, more data or more training you repeat the steps until “good enough” is achieved.
Usually the process described in Figure 1 consists of a data scientist or researcher manually running notebooks until an acceptable model is produced. This manual approach works great in an experimental environment where the outcome or goal is unsure and changes occur rapidly. This allows the data scientist to adapt quickly and make changes without hurting the production. Once the acceptable model is ready, the model is thrown over the fence to the engineer who are going to deploy and serve the model in production
However, this approach poses some issues which can severely halt or even hurt the models in production:
- The data scientist and ops engineers are on two different islands each doing their job, which can cause miscommunication between two important parts of the model lifecycle (known as training-serving skew).
- It leads to infrequent model updates.
- There is lack of monitoring of the model.
- Neither continuous integration (CI) — automated build, test and package of source code or continuous delivery (CD) — automated delivery of application to selected infrastructure are leveraged, which makes the development of the model and surround infrastructure really hard.
- It is really difficult to promote a model to production under these circumstances.
Automated ML
The idea of automated ML is not something new. In traditional business intelligence (BI), pipelines are already automated complete with CICD, versioning, release schedules, performance measuring, etc.
The majority of the steps within traditional BI projects and ML projects are comparable. You need to extract the data, do some transformation and eventually load the data for further use.
The goal of an automated ML pipeline is continuous training, which enables continuous delivery of production grade models. There are a few other nice changes that will happen along the way:
- Code versioning — The code created by the data scientist will be versioned. This will make developing models in teams much easier as well as some backup and rollback capabilities.
- Modularity — Modularity of the components makes your pipeline very adaptable, for instance if you need to change a part in the pipeline, you can easily remove and replace that component without having to change the other components.
- Automation triggers — With these triggers the pipeline will start the processes as needed. For instance, when new code is merged to e.g. the master branch you can start the pipeline and retrain and deploy the model.
- Continuous deployment — Because we have the ability to start the training and deploying processes when needed, we have continuous deployment. When a trigger event defined by us occurs, the pipeline automatically starts and produces a new production model.
- Model performance monitoring — An important part is to monitor model performance. There are some nice tools out there that can help with this process. Usually you will use a mix of different tools, for example MLFlow, to store the model and metrics and Grafana for the monitoring part. The idea is to push the metrics and the model to a location where the model can be monitored. You can add cool features and triggers for when something unexpected happens.
- Rapid deployment — Because we have the majority of the pipeline automated, time to production is reduced significantly. Changes can be made without updating the whole pipeline. When new data arrives, it can be scored immediately. Also, when this new data changes the expected outcome, we can retrain the model.
A Client Case Study
Originally the client would build one model for their client at the start of the onboarding process and use this model for years. The industry of the client is very prone to small changes and new discoveries of certain events are crucial in the model (the datasets in this industry are really skewed). They had already a frontend tool built that made it possible for a domain expert to label data. However, this data was never used because the model was built at the beginning of the process and barely updated. They had also some other difficulties such as extracting and preparing the data so that it can be used for modeling. We proposed and delivered the following for this client:
- Automate their data extraction and data preparation, the programming language and of choice was .Net Core and R.
- Introduced version control.
- Built a modular training and validation script for the model.
- Setup some intervals to start the data extraction process.
- Created triggers to initiate training and validation steps.
- Added monitoring to the model, for various model performance metrics (recall, precision, f1 score) and metadata (amount of new labeled data, amount of new data, version of model etc.).
With this solution the client improved their ML capabilities significantly. The models are improved regularly and kept up to date. Thus, providing more value to their customers.
Summary: Automated ML and You
How can a company take the very first step to automating a pipeline? The first step to implementing automated ML in your own business is to introduce DevOps in your machine learning teams. It is a nice first step due to the control that the machine learning teams will gain over their application with DevOps. This will increase the speed to production, makes it easier to deploy new changes and at the end this will result in models that are kept up to date.
Hiring some software engineers and computer scientists for your machine learning teams can also improve the overall quality of the code and will result in much more efficient resource usage. Also, they have the knowledge to build frontends or API’s for the end-users.
One of the most challenging issues I encounter is resource related. Due to the nature of machine learning (big data sets, computationally heavy, etc.) memory and CPU related issues are very common. The most simple and common solution is to increase the memory. This can be the best solution in some situations but usually it is better to run a root cause analysis and find the memory leak in the application. There are a few tricks that you can use to mitigate this, for example use your disk storage as an extension on your memory, process in batches, and don’t keep variables in memory. There some nice tools (code profilers) that can help with finding memory leaks in your code.
These are some suggestions to improve the capabilities of the machine learning teams, increase the chance of success for machine learning projects and most importantly it will leave more time for the data scientists and data engineers to work on other cool projects!
The results of this project are coming to a publication by Mohammed Dahou, so watch out for further exciting details of this work!
Mohammed Dahou is a “Data scientist with an interest in software development and technologies. Sport enthusiast that loves to watch and play soccer. Background in information management and software engineering. Bridge between data science/business and engineering /IT.”