Evaluating MLOps Tools

Knowledge

Evaluating MLOps Tools

Machine learning lifecycle

This section describes a generic pipeline, which is a common use case for real-world modeling initiatives. We also rely on this pipeline when testing existing tools and conducting our own experiments.

It is based on Hapke H., Nelson, C. Building Machine Learning Pipelines: Automating Model Life Cycles with TensorFlow (O’Reilly Media, Inc., 2020).

Let’s look at the steps in this pipeline in more detail.

Data Ingestion and Data Versioning

Here, we process the data and put it into a format suitable for the next steps in the pipeline, such as feature engineering, for example, which is generally performed in the later steps after data validation. The incoming data is then versioned with the trained model at the end of this stage.

Data Validation

Data validation checks the new data across multiple factors, including the range, number of categories, and distribution of categories. The data scientist is alerted if any anomalies are detected or if the class balance significantly changes, which is important for classification problems. To resolve these issues, the data scientist or machine learning (ML) engineer may take different measures, like changing the loss function, use oversampling, undersampling, or more advanced techniques to tackle data sampling problems, start with a new modeling or model building pipeline, or even initiate a whole new lifecycle.

Data Preprocessing

Data preprocessing is comprised of multiple steps, the number of which depends on the modeling tasks and data characteristics. Those steps might contain (but are not limited to) data cleansing, data imputation, feature extraction, vectorization, or encoding both data and labels to match the input requirements. These steps may range from simple scripts to extensive sequences or graphs of the steps. Since preprocessing is only required before model training and not with every training round, it makes the most sense to run the preprocessing as its own life cycle step before training the model.

While most data scientists focus on the processing capabilities of their preferred tools, it is also important to link modifications made to the preprocessing steps with the processed data and vice versa. If someone modifies a processing step (e.g. allowing an additional label in a one-hot vector conversion), the previous training data is now invalid and an update of the entire pipeline is required. Such change management can be supported using your manual or automated management tools.

Model Training and Tuning

In this step, we train a model to take predefined inputs and predict an output with the lowest possible error. Large models, large datasets, memory and storage resource constraints can easily make his step difficult to manage. Therefore, the model training should be efficient, while the latest developments and speedups in optimization can help to achieve it without sacrificing too much prediction performance.

Model tuning is also relevant to performance improvements and has always received a significant amount of attention from both researchers and practitioners. The initial tuning can be performed at the experiment planning phase or while designing machine learning pipelines, or even during the training process. In the latter case, the whole tuning process is directly integrated into the pipeline; automated machine learning can help to automatically obtain the best architecture or set of hyperparameters, or even optimize the whole pipeline. It is also very scalable and can be parallelized meaning that many models can be trained and evaluated at the same time in order to select the best performing one.

Model Analysis

While accuracy and/or loss functions are generally used in the training process as basic performance indicators, more extensive metrics can be used to carry out more in-depth analysis, including precision, recall, and AUC (area under the curve), or calculating performance on a testing dataset (which is not processed during previous steps). Fairness analysis can indicate if the model will perform well for different groups of users if the dataset is sliced and the performance is calculated for each slice. We can also investigate the dependence on features used in training and explore how the model’s predictions would change if we altered a single training example’s features. This is the main aim of explainability analysis and a human generally performs this step. However, it is possible to automate it to the level of a final review and confirmation step.

Model Validation

The purpose of the model validation step is to keep track of which model, hyperparameters, and datasets have been selected for the next, deployable version. Model versioning enables incremental improvement and continuous integration processes for the given model. Semantic versioning also allows you to track and manage these improvements. Semantic versioning in software engineering requires you to increase the major version number when you make an incompatible change in your API or add major features. Otherwise, you increase the minor version number. Model release management can also benefit from dataset information. Using more extensive datasets, containing more instances, or having better data quality may significantly change the model performance without changing a single model parameter.

It is essential to document all inputs into a new model version (hyperparameters, datasets, architecture) and track them as part of this release step.

Model Deployment

Model deployment often approached similarly to deploying web applications. Unfortunately, this is often not suitable for machine learning artifacts which are updated often, may have significant changes in their inputs or outputs, which makes updating the whole deployment very confusing and error-prone. Modern model servers, like TensorFlow Serving, MLFlow allow you to deploy your models without writing specific web app code. The models can be accessed using API interfaces like the representational state transfer (REST) or remote procedure call (RPC) protocols. gRPC protocol, developed by Google, was developed specifically for similar problems. It provides low-latency communication and smaller payloads by using Protocol Buffers (Protobuf) and may support even more advanced scenarios for data exchange, streaming, or model execution. Yet, it requires additional effort to implement it, compared to conventional protocols, like REST or RPC, with the consideration that inspecting binary protocol is rather difficult. Moreover, modern ML servers allow you to host multiple versions of the same model simultaneously, with opportunities to run multiple A/B tests on your models and receive valuable feedback about your model improvements. Model servers also allow you to update a model version without redeploying your application, which will reduce your application’s downtime and reduce the communication between your application development and machine learning teams.

Feedback Loops

The last step is crucial to the success of any data science project. After the model is deployed and run in practice, we can measure its effectiveness and performance, identifying the problems and bottlenecks. It is also possible to update the model by using captured new training data. This may involve a human in the loop, or it may be automatic.

Ideally, the entire pipeline should be automated, except for the two steps, which require human interception for review and confirmation (the model analysis step and the feedback step). Data scientists should focus on the development of new models, not on updating and maintaining existing models.

Hidden Technical Debt in Machine Learning Systems

Data scientists can implement and train an ML model with predictive performance on an offline holdout dataset, given relevant training data for their use case. However, the real challenge isn’t building an ML model, the challenge is building an integrated ML system and continuously operate it in production (Google Cloud, 2020).

As shown in the following diagram, only a small fraction of a real-world ML system is composed of the ML code. The surrounding elements are vast and complex.

Elements for ML systems. Adapted from Hidden Technical Debt in Machine Learning Systems.

In this diagram, the rest of the system is composed of configuration, automation, data collection, data verification, testing and debugging, resource management, model analysis, process and metadata management, serving infrastructure, and monitoring.

ML introduces two new assets into the software development lifecycle – data and models.

ML Systems require extensive testing and monitoring. The key consideration is that unlike a manually coded system (left), ML-based system behavior is not easily specified in advance. This behavior depends on the dynamic qualities of the data, and on various model configuration choices (Eric Beck et al, 2017).

MLOps Principles

MLOps refers to DevOps as applied to machine learning and artificial intelligence. Short for “software development” and “IT operations,” DevOps applies software engineering practices to IT operations, such as packaging and deploying your production software. MLOps aims to shorten the analytics development life cycle and increase model stability by automating repeatable steps in the workflows used by today’s software practitioners (including data engineers and data scientists). While MLOps practices vary significantly, they typically involve automating integration (the frequent checking in and testing of code) and deployment (packaging code and using it in a production setting) (Mckinsey, 2020ai).

MLOps principles

Principles	Definition
Hardening	A collection of techniques and configurations is used to make the software more robust, reduce vulnerabilities to security issues and errors in production, and ensure that the analytics solution’s results cannot be manipulated.
Continuous integration	The process supporting code collaboration. When developers share the code with the team, testing steps occur to ensure the “build” is not broken. Continuous integration automates how team members share their working copies of code, enabling it to occur frequently (multiple people sharing multiple times per day). Mature teams leverage test automation and an “integration pipeline” to enable continuous integration.
Continuous delivery	The process in which every time a software “build” passes the automated tests, it is deployable and can be released to production at any time. Environments are usually staged, going first through a development environment, then a user-acceptance environment, and then the production environment. Mature teams have “deployment pipelines” that automate delivery.
Issue tracking	Software (e.g. Jira) is used to plan work for software developers, particularly for teams working in an agile manner.
Logging	The creation of a detailed record of events that occurred within an application during its operation, which is typically stored in log files
Monitoring	The observation and management of software components to ensure they are available and performing normally (as opposed to the monitoring of the model’s predictions that takes place while the model is in use)

Source: adapted from (Mckinsey, 2020ai)

As models are released into production and put to use, the inputs cannot be predicted and, in some cases, they might be subjected to users or programs with malicious intent. Hardening reduces any potentially negative impacts by reducing the “surface area” subject to attack. For example, hardening may ensure that all requests and responses to and from an API are encrypted even within an internal network.

Automating integration can speed up developer productivity and help to identify issues rapidly. Continuous delivery aims to build, test, and release software with more incredible speed and frequency. This reduces deployment risks since the changes between each version are small.

The planning of software development tasks can be complicated and requires collaboration. Work is broken down from user stories to features to tasks/issues implemented by software developers. Logging allows teams to troubleshoot issues by reviewing changes in applications and systems via their respective logs.

For jobs that need to run at particular times or services/applications that need to be available, application monitoring is crucial to ensure that software performs as intended. This is particularly important in production when a model is no longer an experiment and is adopted by business users as part of normal business operations.

Tools for data science and machine learning operations (MLOps)

After initial analysis, the following groups of tools can help organize and support the work of your data scientists and your machine learning operations management:

Machine learning process management and support. Such tools serve as the main platforms for the full machine learning process lifecycle, starting with data management and ending with model versioning and deployment. These tools aim to support the full lifecycle of a machine learning project or cover most of the steps required to successfully initiate, implement, and manage a machine learning, deep learning (DL), or advanced data analytics project.

Model training. This group categorizes a huge number of various frameworks and tools for machine learning tasks that are out of the scope of this report. They cover almost any possible task of the subfield in machine learning or artificial intelligence in general, as well as related tasks in data engineering, evaluation, interpretability, or serving. However, most of these tools focus on fulfilling a single task. Deep learning is one of the most evolving fields in computer science, with dedicated frameworks like PyTorch, TensorFlow, Keras, Chainer, MxNet, which are generally developed and supported by leading companies. However, these tools tend to focus on meeting the needs of your machine learning modeling process. Nevertheless, there are more general projects which cover most of the machine learning lifecycle, such as TensorFlow Extended (TFX), KubeFlow Pipelines, MetaFlow, mlr3, and tidymodels for the R language, which organize the whole modeling process in form of pipelines to enable full ML modeling workflow processing, improve reuse and simplify reproducibility. TFX and KubeFlow will be further considered for experimental analysis.

Feature stores. Feature engineering is a key step, which is also one of the most iterative, time and resource-consuming steps in the machine learning process. It addresses robustness, scalability, and reuse issues. Feature stores are usually deployed to address this issue. Aside from metadata indexing and search, they also offer rapid and unified access for processed features, which can quickly be tested or applied in practical applications. They can be deployed globally per company or locally per division or data science group.

Feature Store Comparison (featurestore.org)

Platform	Open Source	Offline	Online	Metadata	Feature Engineering	Supported Platforms	Time Travel	Training Data
Hopsworks	AGPL-V3	Hudi/Hive	MySQL Cluster	DB Tables, Elasticsearch	DB Tables, Elasticsearch	AWS, GCP, On-Premises	SQL Join or Hudi Queries	.tfrecords, .csv, .npy, .petastorm, .hf5, etc
Michelangelo	n/a	Hive	Cassandra	KV Entries	Spark, DSL	Proprietary	SQL Join	Streamed to models?
Feast	Apache V2	BigQuery	BigTable/Redis	DB Tables	Beam, Python	GCP	SQL Join	Streamed to models
Conde Nast	n/a	Kafka/Cassandra	Kafka/Cassandra	Protocol Buffers	Shared libraries	Proprietary	?	Protobuf
Zipline	n/a	Hive	KV Store	KV Entries	Flink, Spark, DSL	Proprietary	Schema	Streamed to models?
Comcast	n/a	HDFS, Cassandra	Kafka / Redis	Github	Flink, Spark	Proprietary	No?	Unknown
Netflix Metaflow	n/a	Kafka & S3	Kafka & Microservices	Protobufs	Spark, shared libraries	Proprietary	Custom	Protobuf
Twitter	n/a	HDFS	Strato / Manhatten	Scala shared feature libraries	Scala DSL, Scalding,	Proprietary	No	Unknown
Facebook FBLearner	n/a	?	Yes, no details	Yes, no details	?	Proprietary	?	Unknown
Pinterest Galaxy	n/a	S3/Hive	Yes, no details	Yes, no details	DSL (Linchpin), Spark	Proprietary	?	Unknown

Here are the benefits of feature stores, as given in featurestore.org:

Track and share features between data scientists, including a version-control repository;
Process and curate feature values while preventing data leakage;
Ensure parity between training and inference data systems;
Serve features for ML-specific consumption profiles, including model training, batch and real-time predictions;
Accelerate ML innovation by reducing the data engineering process from months to days;
Monitor data quality to rapidly identify data drift and pipeline errors;
Empower legal and compliance teams to ensure compliant use of data;
Bridge the gap between data scientists and data & ML engineers;
Lower the total cost of ownership through automation and simplification;
Faster time-to-market for new model-driven products;
Improved model accuracy: the availability of features will improve model performance;
Improved data quality via data -> feature -> model lineage.

The motivation of using feature stores is outlined in the Hopsworks feature store documentation. To summarize, machine learning systems tend to assemble technical debt. Examples of technical debt in machine learning systems include:

There is no principled way to access features during model serving.
Features cannot easily be reused between multiple machine learning pipelines.
Data science projects work in isolation without collaboration and re-use.
Features used for training and serving are inconsistent.
When new data arrives, there is no way to pin down exactly which features need to be recomputed. Rather, the entire pipeline needs to be run to update features.

Using a feature store is a best practice, which can reduce the technical debt of your machine learning workflows. When the feature store is built up with more features, it becomes easier and cheaper to build new models as the new models can re-use the existing features in the feature store.

Computation orchestration. These tools provide the functionality and flexibility required to execute developed modeling or data engineering pipelines. ML for process management tools provides integrated orchestration functionality (like Kubeflow Pipelines or Polyaxon). However, having a specialized tool set up or customized to tackle more general tasks (like data engineering, modeling, or data routing) is also a requirement for most data science platforms. This group considers well-known tools like Apache Airflow, Luigi (initially developed by Spotify), Apache NiFi, or Pachyderm. Note that we will not consider tools in groups like message queues (RabbitMQ, Apache ActiveMQ, Apache Kafka), which are expected to perform data stream routing by organizing them into queues. Likewise, we do not consider other tools which are targeted at data engineering.

Apache Airflow is arguably the most used framework in this group, widely considered and developed by various companies.

Deployment. This group of tools is developed to provide functionality mainly for the deployment of finalized models. The number of candidate tools is also extensive, and only a small fraction will be considered for testing. However, some of the tools that are not considered in this report could be considered viable alternatives for future testing, if they meet functional requirements for the implementations.

We did not consider commercial tools in this analysis and chose to use vendor-agnostic tools, which can be later customized and set up to meet our needs. However, we chose to evaluate tools that provide community editions and commercial or enterprise editions. Enterprise support is considered an advantage for future growth. For a more detailed comparison of commercial tools, like Neptune.ai, Weights & Biases, Comet.ai, we refer to analysis (Neptune.ai, 2020).

ML tools comparison

The main tools for machine learning process management and execution are summarized in the table below.

Main tools for machine learning process management and execution (part 1)

	Kubeflow	MLFlow	Seldon Core	MLRun	TFX	Polyaxon	Trains
Vendor	Google	DataBricks	Seldon	Iguazio	Google	Polyaxon	Allegro
Purpose/focus	ML pipeline execution	Experiment management	Deployment and monitoring	Experiment management, pipeline execution	ML pipeline execution	Experiment management, pipeline execution	Experiment management
Features
UI	Yes	Yes	No	Yes	No	Yes	Yes
Experiment management	Yes	Yes	No	Yes	No	Yes	Yes
Workflows/ pipelines	Argo	–	Inference graphs	–	Internal	Internal	–
Produce Docker images	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Tracking	Yes	Yes	Yes				Yes
Monitoring	No	No	Yes	No	No	No	Yes
Model registry	No	Yes
Extensibility	–	+	+/-	–	+	+/-	+
Python API	Yes	Yes	Yes	Yes	Yes	Yes	Yes
R API	No	Yes	No	No	No	No	No
Kubernetes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Supported frameworks
TensorFlow	Yes	Yes	Yes	Yes	Yes	Yes	Yes
PyTorch	Yes	Yes	Yes	Yes		Yes	Yes
Scikit-Learn	Yes	Yes	Yes	Yes		Yes	Yes
MPI	Yes	N/A	N/A	Yes
MxNet	Yes	Yes	Yes			Yes
XGBoost	Yes	Yes	Yes	Yes			Yes
Spark	Yes	Yes	Yes	Yes		Yes
Dask				Yes		Yes
Horovod				Yes
R	Yes	Yes	Yes	No	No	No	No
Additional functionality
Hyperparameter tuning	Katib		Internal	Internal	TFX Tuner	Hyperopt, etc.	Autokeras
Intelligent routing			Yes
Drift detection			Yes		Yes
Interpretability			Yes		Yes
Continuous integration			Yes

A set of deployment-oriented frameworks is listed in the next table. Again, this is not a complete set; only the most relevant options were considered. Some of the first table tools could also easily fall under the tools for deployment (such as Seldon Core), as they also focus on model deployment and provide other features. Some of these tools in this table can also be considered development-oriented (like Nuclio). However, they also provide deployment facilities.

Main tools for machine learning process management and execution (part 2)

	kfserving	Clipper	BentoML	Ray Serve	Nuclio
Vendor	Google	UC Berkeley RISE Lab	BentoML	University of California at Berkeley	Iguazio
Purpose	ML deployment in Kubernetes	Docker image build, ML deployment	Docker image build, ML deployment	Reinforcement learning, model serving	Serverless functions, deployment in Kubernetes
GPU support	Yes	Yes	Yes	Yes	Yes
Kubernetes	Yes	Yes	Yes	N/A	Yes
Monitoring	No	Yes	No		Yes
Framework support
TensorFlow	Yes	Yes	Yes	Yes
PyTorch	Yes	Yes	Yes	Yes
Keras	Yes	Yes	Yes	Yes
scikit-learn	Yes	Yes	Yes	Yes
XGBoost	Yes	Yes	Yes
ONNX	Yes	N/A	Yes
MxNet	Yes	Yes
FastAI			Yes
H2O			Yes
Spacy			Yes
statsmodels			Yes
CoreML			Yes
FastText			Yes
R	No	Yes	No	No	Yes
Supported platforms
Docker	Yes	Yes	Yes		Yes
AWS	Yes	N/A	Yes	N/A	Yes
Google Cloud	Yes	N/A	Yes	N/A	Yes
Azure	Yes	N/A	Yes	N/A	Yes
Additional features
Scalability	Autoscaling Canary rollouts	Adaptive batching Caching Improved throughput		Batching Load balancing between backends
Open API	No	Yes	Yes	No
Integrations	Anything in Kubernetes	Prometheus
Other deployment options			SQL Server ML Services Python package	Model composition

Testing of MLOps tools

In this section, we provide the findings of our experimental implementations in some of the tools which were selected for evaluation. This section also outlines the configurations of the technological components and models we chose to automate data science tasks using an MLOps approach.

Experiment setup

Due to the large number of solutions that exist, their comparative evaluation is not a trivial matter. These tools address different topics in machine learning engineering and deployment. Also, they are developed to support solutions working on different scales. They also vary in terms of underlying technology and scalability requirements, making it rather difficult to compare in terms of their performance due to uncertainty in the research and development process, real-world application development, business requirements, or future demands. Yet, several requirements can be identified, which are relevant for practical applications. These include:

Flexibility – can the tool be easily adopted in multiple situations, meeting the needs for different modeling techniques?
Framework support – are the most popular ML and DL technologies and libraries integrated and supported?
Multilanguage support – can the tool support code written in multiple languages? Does it have packages for the most popular languages used by data scientists, like R and Python?
Multi-user support – can the tool be used in a multi-user environment? Does it meet security requirements?
Maturity – is the tool mature enough to be used in production? Is it still developed? Is it used by any large companies?
Community support – is the tool supported by any developer groups or backed by large companies? Does it have a commercial version?

We also determined whether the tool is easily extensible, which is an additional benefit. For example, MLFlow functionality can be extended using plugins, providing a powerful mechanism for customizing the behavior of the MLflow client and integrating third-party tools, allowing you to:

Integrate with third-party storage solutions for experiment data, artifacts, and models.
Integrate with third-party authentication providers, e.g. read HTTP authentication credentials from a special file.
Use the MLflow client to communicate with other REST APIs, e.g. your organization’s existing experiment-tracking APIs.
Automatically capture additional metadata as run tags, e.g. the git repository associated with a run.
Add a new backend to execute MLFlow Project entry points.

We chose to evaluate the robustness, flexibility, ease of use, and customization of these tools for research and implement workflows of differing complexity. For the proof-of-concept implementation, a classifier that tries to detect whether the given text fragment represents a business-to-business (B2B) entity (like a company) or business-to-customer (B2C) subject (person) was selected for implementation. The development of such a classifier requires multiple preprocessing steps and external data sources, which are also required to be included for serving. These steps are:

Dataset splitting – for training/testing;
Feature engineering – requires several Python libraries and external datasets (family names use, frequency ratios for both first names and surnames, etc.)
Feature processing – using categorical (one-hot encoding) conversion of both categorical features and labels;
Model training – logistic regression and random forest classifiers from the scikit-learn library were considered. For TensorFlow implementation, custom Keras classifier and Google TabNet implementation were considered;
Model evaluation.

At this stage, we considered the Python-based modeling process. however, R Model deployment was tested separately, considering the diversity of models which would have to be deployed. Note that classifier performance is not evaluated in this experiment, which is generally performed iteratively by data scientists who continuously perform model optimization and implement the required improvements. However, the whole process of organization and management is essential. A huge number of relevant technologies exist for practical implementations. However, tools to organize and orchestrate these tasks are crucial, as they may lead to unified, reusable, and reproducible results and increase the quality of the final models.

After a thorough analysis of the tools, the following were selected for further testing:

KubeFlow Pipelines and TensorFlow Extended (TFX) to test the training process.
MLFlow and Allegro Trains for experiment management and model versioning.
BentoML, KFServing, TensorFlow Serving, Clipper, and Seldon for deployment.

In this experiment, we used KubeFlow 1.0, KFServing 0.4.1, scikit-learn 0.24, TFX 0.24, BentoML 0.11, Clipper.ai 0.4.1, MLFlow 1.13, TensorFlow Server 2.3 and Allegro Trains 0.16. The evaluation of these tools in terms of flexibility is presented in the next section.

Evaluation results

Model training. This is one of the core tasks in the machine learning lifecycle and is performed regularly. The classifier was trained using a dataset consisting of 376,617 entries. 30% were selected as the holdout sample for testing, while the remaining data was used for training. Note, for simplicity, we did not consider parameter tuning and used scikit-learn classifiers with the default parameters (the only exception is Random Forest which used a total of 200 estimators instead of the default 100). The implemented TensorFlow classifier was a deep neural network with three hidden dense layers of 256, 64, and 16 units, respectively. We also tested TabNet classifier with default settings as defined in the implementing library tf-TabNet.

For Kubeflow Pipelines, we considered only Scikit-learn based implementation. Kubeflow has extensive support and integration with TensorFlow Extended; however, testing deep learning performance was not one of the goals in this research.

Examples of Kubeflow Pipelines

The workflow for Kubeflow implementation is presented in the above figure. While we identified several problems, the training was performed smoothly. The whole process took more than 10 minutes, which could be considered excessive. However, this is expected as Kubeflow is more applicable to tasks of a much larger scale. The time required to initialize, setup, and prepare the environment for each step is not significant compared to the time required to run the model training workflows.

Basically, there are two ways to implement workflows in Kubeflow:

Using Kubeflow API, one can easily create Docker containers programmatically in Python languages and generate workflows for deployment;
Creating custom Docker images and orchestrating them using Kubeflow facilities. This could be the preferred option if custom components or multiple languages are used in the pipeline.

Due to the Python modeling language’s relevance for internal processes, the Kubeflow API option was selected for testing. However, the second option could be considered an important factor for future implementations and is one that we might consider for other languages and technologies, like R, Java, or Julia.

The experimental analysis identified several other Kubeflow advantages:

Logs in the dashboard are updated almost in real-time, which is very helpful for debugging specific operators.
Pipelines can be exported as Argo workflows and loaded using the dashboard.

However, we also identified multiple disadvantages while performing Kubeflow testing:

Pods created during workflow runs were not automatically destroyed, which may require additional management.
Setup is quite complex and requires additional skills in running Kubernetes. While the documentation provided by the website clearly outlines each step, experimental testing identified specific versions of Kubernetes (particularly Minikube) may be required and additional setup steps may also be required to ensure that it runs smoothly with all components properly setup. We also had difficulties during the removal of Kubeflow from the cluster, as deleting the whole Kubernetes namespace often failed and required additional manual effort.
Runs are archived without any possibility to delete. While this may be useful for reproducibility or result analysis, it may take additional storage (especially as the pods generated during runs are not deleted) or it may result in multiple entries, which are redundant and difficult to navigate or search.
It is not possible to edit experiment metadata, like names of experiments or runs.
We could not pass binary objects (like Pandas DataFrames) between tasks directly; as stated in the documentation, and Kubeflow provides minimal support. Therefore, one must store the intermediary outputs in the storage facility, like Minio or S3. This is mostly performed automatically by the Kubeflow engine, but the user may also need to implement this manually when required.
Enforced naming conventions in the code were not properly documented.
While outputting data for metrics visualization, multiple problems were identified: e.g. we failed to visualize the confusion matrix and could not use white space characters in the names of the metrics used which is not documented.

The TensorFlow Extended (TFX) application revealed more challenges than we initially expected. First, a lack of flexibility was observed during the implementation of the test model. However, it is possible that required functionality can be implemented by overriding or extending existing components, yet, it would require a significant amount of time invested.

We also identified difficulties like conversion between sparse and dense tensor formats, difficulties integrating preprocessing steps that do not use tensor operations, which would also slow down the adoption of TFX.

Incompatibilities between component versions was another issue, as some of the components, like TF Transform, TF Privacy, TF Encrypted, supported only TensorFlow 1.x or were in beta for TensorFlow 2.0.

Finally, we had issues while running some of the visualization tools provided with the platform (like Tensorflow Model Analysis), which did not work in the Jupyter Lab environment.

Nevertheless, testing with our classifier proved the potential of this framework for future tasks, which would require large-scale deep learning and should run using orchestration tools, like Apache Airflow. However, as in the case of Kubeflow, additional time, expertise and research would be required to conduct a more extensive evaluation.

Experiment with versioning and model management. Experiment versioning tools play a similar role to code or component versioning in software engineering. However, they require more flexibility, as additional metadata (performance metrics, visualization outputs) must be stored together with the modeling output.

Therefore, we tested two tools that satisfied our requirements after the initial analysis. Development of the CatBoost classifier based on the scikit-learn library was used to implement the testing process. Integration with MLFlow went smoothly and it was easy to implement and store additional metrics using the provided API. The developed model artefacts could then be versioned and sent for deployment.

Support for R language was also tested using a similar classifier (using a ranger and tidymodels based implementation), and no problems occurred while storing the experimental results, evaluation metrics, or generated artifacts.

Next, we experimented with versioning and model management using MLFlow.

The integration of Allegro Trains was also seamless, providing a solid user experience and options for further scalability. Moreover, it provides more flexibility and functionality, compared to the MLFlow tools, such as hyperparameter logging, store of logs, or plots. However, MLFlow is used because it offers R language support. It is hoped that the Trains environment could be adopted for R-based modeling versioning and management. However, it would require additional research and development of R packages to enable such functionality. Therefore, it could also be considered for future work.

Model deployment. Finally, we tested the four tools mentioned (BentoML, KFServing, Seldon, TFServer, MLFlow) as the deployment options. The first three tools were selected to deploy a scikit-learn based classifier (although they support deep learning-based classifiers), and TensorFlow Serving was used to deploy the Keras-based classifier developed using TFX.

To deploy a classifier in the BentoML environment, we implemented a class extending the core BentoML service class. After implementing it, one must build and save the BentoML service (which is also versioned), which can then be served directly as a local service. Alternatively, it is possible to build a Docker container using this service together with all the artifacts (model files, external datasets, etc.) and serve it from Docker/Kubernetes environment. The BentoML documentation states that it can scale extremely well and is capable of reaching high serving performance. The whole deployment process was executed successfully, and the service built with BentoML satisfied our requirements almost perfectly. This is also promising because BentoML can be integrated with external tools like Kubeflow and deployed in the Kubernetes environment, which might become a functional requirement in the future.

KFServing is a subproject by Kubeflow developers, which aims to produce and deploy this service in the Kubernetes environment. This is also performed by building a Docker image and deploying it as a container. Due to the requirements for our test project, the scikit-learn kfserving module was selected for testing. However, the following issues were observed during implementation:

the project seemed quite outdated (a required scikit-learn version of 0.20, which is very outdated – scikit-learn 0.25 is the active version at the moment of writing)
the implementation which required custom preprocess/post-process steps did not work as expected or described in the sample code.

We expect we will test Kubeflow in further iterations due to new scalability requirements. Therefore, KFServing might be tested more extensively in other configurations (deep learning support, specific models). However, it will not be considered at this moment.

Seldon is an auspicious tool. It offers a full range of functionality required to successfully deploy state-of-the-art ML and DL solutions on an almost unlimited scale using the Kubernetes platform. However, its setup is quite complicated, making it difficult to evaluate the solution completely. Therefore, for some functionality testing, we used examples provided together with the reference documentation. This tool provides advanced functionality which is not present in similar tools (Intelligent routing for load balancing, drift detection, outlier detection functionality, explainable machine learning), which are implemented as separate libraries. Therefore, such components can be easily integrated into custom solutions. Seldon can be used with almost any major programming language used in machine learning research, as the underlying technology can convert provided code into a Docker image, which is then deployed in the Kubernetes cluster. Again, just like in the case of Kubeflow and KFServing, we expect we will test this technology in the future more thoroughly after our scalability requirements change.

Clipper.ai is an interesting option, despite its lack of support and recent commits. Python model deployment was rather easy (although deep learning support was not tested), and R modeling support was an additional benefit. However, it will not be considered for Python model deployment as long as there are better options. Still, it could serve as an option to deploy R models supported very vaguely by other tools (at the time of analysis, only MLFlow had support for R model deployment). To run clipper.ai with R language support, it required minor corrections (like outdated R language and R package versions in its Docker file, used to build base Docker image). After fixing these issues, we were able to deploy the example provided together with the tool in its GitHub repository and even deploy our tested classifier. Unfortunately, we were unsuccessful in getting it running due to missing the system libraries required for R package dependency compilation. Nevertheless, it managed to collect and deploy all the required dependencies like R packages, data structures, or external files.

MLFlow is another tool that provides options or model deployment. This seemed a very viable option, considering its support for full model lifecycle management. Indeed, it generated Docker images for deployment using a scikit-learn based classifier and these were easily deployed. This corresponds to the main principles of continuous integration. However, the generated image does not provide a Swagger-based OpenAPI interface. It is very bloated with additional libraries, while BentoML seems to provide a more lightweight and flexible solution. Unfortunately, R language support became much more complicated to implement, which might even require advanced R skills such as metaprogramming. It supports a single R model framework and does not look mature enough to support more advanced modeling requirements. Hence, it might even be more appropriate to use custom-built Docker images instead.

TensorFlow Serving is part of the TFX framework. Therefore, it only supports TensorFlow framework, and hence was used to test the deployment of the TensorFlow-based classifier. However, as with the whole TFX framework, the setup turned to be complicated – it even failed to install initially using the Ubuntu apt tool, as the repositories provided in the setup reference did not work. After the setup, we managed to run the classifier successfully. Unfortunately, unlike BentoML or Seldon, it does not generate a Swagger interface for OpenAPI, which is beneficial for testing and documenting services and is also one of the requirements for our implementations. Another drawback is the possible limitations in the documented service API – it provides endpoints for classification, regression, or prediction. It does not consider other tasks often considered in modeling (like ranking). While it still should be possible to implement using workarounds, it was impossible to validate this statement at this stage. However, TF Serving can serve different versions of the model, which is encoded in the REST endpoint path structure and is one of its advantages. While the options described previously in this section seem to be more viable for the deployment of TensorFlow-based models, the TFX Serving module could be tested further together with the TFX framework.

Conclusions

Successful machine learning cycle (MLOps) management requires tools with extensive functionality and across different skillsets. MLOps management must cover the management and versioning of models, experiments, the experiment runs, features, distributed training, automated deployments, and scaling to support thousands or even millions of users. If it is not properly automated using the latest available developments, MLOps management is a highly sophisticated and time-consuming task.

This review summarizes a range of available open-source tools, which were applied to address these challenges. This review describes and evaluates the pros and cons of each tool, identifying tools that can be currently applied to implement MLOps processes in an organization.

In conclusion, the following tools were deemed to meet our requirements and were also easy to set up and/or implement:

We recommend the MLFlow for versioning experiments, the experiment runs, and models. However, it has some significant disadvantages, including the absence of a multi-user environment, role-based access, and advanced security features. These are important points if a large number of data scientists use this tool. Allegro Trains will be tested more thoroughly using real-world cases, considering its support for deep learning and Python-based frameworks. Testing its adoption to support R language is also an interesting option for our future research.
Model development and testing can be performed using Python, R, and Java/Scala (if Apache Spark is one of the platforms considered for implementation). Deep learning (especially using PyTorch or TensorFlow frameworks) is supported by almost all the tools that were discussed in this review. Therefore, it should not be difficult to integrate deep learning solutions into the implemented pipelines or MLOps processes. We will also consider R targets and MetaFlow frameworks, which can be used to structure and orchestrate R processing and model training pipelines, and the H2O library, which provides industrial-strength solutions and is easy both to use and deploy.
For process orchestration, we recommend Apache Airflow. It is very robust, scalable and can implement all the required functionality required to support our internal development processes or processing activities. As an execution engine, Apache Airflow is also supported by some of the frameworks discussed here (such as TFX), which is another advantage. It is already implemented for processing our internal computations.
Feature store is another important component. We recommend this is part of every MLOps process. Unfortunately, our initial analysis indicated that there are only two mature open tools in the market – Feast by Gojek and Hopsworks Feature Store. The first tool is heavily integrated with the Google Compute Service and is dependent on the Python programming language. Hopsworks Feature Store is a compelling solution-oriented solution, but this uses Hadoop/Spark clusters, which is not a convenient option for our case. Therefore, currently, it would be more feasible to implement our own simple feature store and later test the solutions available in the market, at that time.
Due to its features and ease of use when generating the final deployment solution, BentoML is a solid option for the deployment of Python-based solutions. It could be extended with additional components, like drift detection or explainable machine learning from other components. There are alternative solutions that could be tested in parallel or later. However, the situation is much more complicated for R language-based model deployment, as both implementations in MLFlow and Clipper.ai have issues that might restrict automated deployment of more complex models. Surprisingly, Clipper.ai turned out to have better support for R model deployment. Unfortunately, it is not developed further and would require corrections and improvements for production-level deployment. Therefore, the best option for R-driven workflows could be implementing an internal tool, either by extending or reusing the basis of existing tools (like Clipper.ai) or by developing it completely from scratch. However, one must be aware of the problems that will arise from the system (Linux) library dependencies that are not considered by any tool and require additional handling during deployment automation.

Finally, we recommend that the following data science LAB architecture is used as a reference for the implementation of our future experiments:

The picture above presents the architecture, which will be used as a reference for future implementations. Next, our main focus is on the Python technologies and their stacks, while the R-based framework is still under research. However, R language support may be relevant, considering its wide use and extensive support for particular domains, like time series analysis, statistics, or econometrics.

Darius Dilijonas

CTO & Head of Data Science Team; Partnership Professor at Vilnius University

d.dilijonas@intellerts.com

Paulius Danėnas

Data scientist / Software developer / Researcher / ML Engineer

p.danenas@intellerts.com