Selecting your optimal MLOps stack: advantages and challenges

MLOps Principles

In 2015, Google released an influential paper Hidden Technical Debt in Machine Learning Systems. This paper described most of the problems associated with developing, deploying, producing, and monitoring machine learning-driven systems.

The paper revealed that ML is no longer a discipline for data scientists. It is also relevant for any software engineering practitioner who faces challenges when deploying models in production, scaling them and implementing automated processes. Moreover, the complexity of these ML systems lends itself to problems that were previously solved by adopting a DevOps mindset.

This is when MLOps term was coined. In its simplest form, MLOps describes the discipline of “machine learning” + “IT operations”. It encompasses people who possess skills at the intersection of data science and machine learning, as well as experience with continuous integration, and the automation skills that are commonplace in the DevOps world.

Hence, MLOps aims to shorten the analytics development life cycle and increase model stability by automating repeatable steps in the workflows used by today’s software practitioners (including data engineers and data scientists).

While MLOps practices vary significantly, they typically involve automating, experimenting, integration (frequent checking in and testing of code), and deployment (packaging code and using it in a production setting). There are also supporting activities to consider, such as security hardening (encryption, for example, to protect your information from malicious system hijacking attempts), planning, logging, detecting changes in data or performance, and monitoring to ensure that your software performs as intended. This is particularly important in production when a model is no longer an experiment and is adopted by your business users as part of your normal business operations.

Elements for ML systems

(Adapted from Hidden Technical Debt in Machine Learning Systems.)

As shown in the above diagram from Google’s paper, only a small fraction of a real-world ML system is composed of the ML code.

The vast majority of the system is composed of supporting processes, like configuration, automation, data collection, data verification, testing and debugging, resource management, model analysis, process and metadata management, serving infrastructure, and monitoring.

One must also consider the lifecycle of a machine learning process, which is generally represented as a pipeline. Here, we will borrow the definition from a reference book by H. Hapke and C. Nelson, which focuses on TensorFlow but the principles can easily be applied to any machine learning project, independent of the technology stack. The definition includes:

  • Data ingestion and data versioning;

  • Data validation, often including the detection of changes in data distribution, anomaly detection, or sampling to ensure that the model is developed using representative and quality data;

  • Data preprocessing and feature generation;

  • Model training and tuning;

  • Model analysis including performance analysis, evaluation, fairness analysis to ensure model consistency per different groups or explainability (aka interpretability);

  • Model validation, versioning, and release management;

  • Model deployment;

  • Feedback and response.

Ideally, the entire pipeline is automated, except for the two steps requiring human interception for review and confirmation (the model analysis step and the feedback steps). The authors of this book emphasize that data scientists should focus on the development of new models, not on updating and maintaining existing models, processes which are addressed by the tools tested in this article.

Tools for data science and machine learning operations (MLOps)

Machine learning process management and support. These tools are the main platforms, hosting the full machine learning process lifecycle, starting with data management and ending with model versioning and deployment. These tools support the full lifecycle of a machine learning project, covering most of the steps required to successfully initiate, implement, and manage a machine learning, deep learning (DL), or advanced data analytics project.

Model training. This encompasses a huge number of different tools to develop and train machine learning models. This includes well-known frameworks like PyTorch, TensorFlow, Keras, Chainer, and MxNet, which are generally developed and supported by leading companies, as well as scikit-learn, statsmodels, shogun or PyCaret. However, these tools tend to focus on machine learning modeling rather than supporting full training pipelines. Nevertheless, general projects also existm which cover most of the machine learning lifecycle, such as TensorFlow Extended (TFX), KubeFlow Pipelines, or projects like drake, targets, mlr3, tidymodels for the R language. These tools organize the whole modeling process in form of pipelines, to enable full ML modeling workflow processing, improving the reusability of the data and simplifying data reproducibility.

Feature stores. Feature engineering is a key step. It is also one of the most iterative, time-consuming and resource-intensive steps in the machine learning process. It addresses data robustness, scalability, versioning, quality, consistency, and reuse issues, while also helping data scientists to iterate faster during their ML research. Aside from metadata indexing and searching, feature stores also provide rapid and unified access for processed features, which can quickly be tested or applied in practical applications or enrich existing implementations with additional dimensions. They can be deployed globally per company or locally per division or data science group. summarizes the main features of current feature store implementations. At this moment, two main open-source options are available (Hopsworks​ and Feast​), with large companies tending to focus on the development of their own internal tools.

Using a feature store is a best practice, reducing the technical debt of your machine learning workflows. When a feature store is extended with more features, it becomes easier and cheaper to build new models as the new models can re-use the existing features in the feature store without recomputing the entire pipeline.

Computation orchestration. These tools provide the functionality and flexibility required to execute developed modeling or data engineering pipelines. ML for process management tools provides integrated orchestration functionality (like Kubeflow Pipelines or Polyaxon). However, having a specialized tool set up or customized to tackle more general tasks (like data engineering, modeling, or data routing) is also a requirement for most data science platforms. This group recommends well-known tools like Apache Airflow, Luigi (initially developed by Spotify), Apache NiFi, or Pachyderm. Note, we will not consider tools in groups like message queues (RabbitMQ, Apache ActiveMQ, Apache Kafka), which perform data stream routing by organizing them into queues. Likewise, we do not consider other tools that are targeted at data engineering. Apache Airflow is arguably the most used framework in this group, widely used and developed by a range of companies.

Deployment. This group of tools is developed to provide functionality mainly for the deployment of finalized models. The number of candidate tools is also extensive, and only a small fraction are considered for testing. However, some of the tools that are not considered in this report could be considered viable alternatives for future testing, if they meet the functional requirements for the implementations.

We did not consider commercial tools in this analysis and chose to use vendor-agnostic tools, which can be later customized and set up to meet our needs. However, we chose to evaluate tools that provide community editions and commercial or enterprise editions. Enterprise support is considered an advantage for future growth. For a more detailed comparison of commercial tools, like, Weights & Biases,, we refer to analysis in

Testing of MLOps tools

Due to the large number of existing solutions, their comparative evaluation is not a trivial matter. These tools address different topics in machine learning engineering and deployment. Also, they are developed to support solutions working on different scales. They also vary in terms of underlying technology and scalability requirements, making it rather difficult to compare in terms of their performance due to uncertainty in the research and development processes, real-world application development, business requirements, or future demands. Yet, several requirements can be identified, which are relevant for practical applications. These include:

  • Flexibility – can the tool be easily adopted in multiple situations, meeting the needs for different modeling techniques?

  • Framework support – are the most popular ML and DL technologies and libraries integrated and supported?

  • Multilanguage support – can the tool support code written in multiple languages? Does it have packages for the most popular languages used by data scientists, like R and Python?

  • Multi-user support – can the tool be used in a multi-user environment? Does it meet security requirements?

  • Maturity – is the tool mature enough to be used in production? Is it still developed? Is it used by any large companies?

  • Community support – is the tool supported by any developer groups or backed by large companies? Does it have a commercial version?

To test the implementation, a classifier that tries to detect whether the given text fragment represents a business-to-business (B2B) entity (like a company) or business-to-customer (B2C) subject (person) was selected for implementation. The development of such a classifier requires multiple preprocessing steps and external data sources, which are also included to serve:

  • Dataset splitting: for training/testing;

  • Feature engineering: requiring several Python libraries and external datasets (such as family names, frequency ratios, etc.)

  • Feature processing: using categorical (one-hot encoding) conversion of both categorical features and labels

  • Model training: logistic regression and random forest classifiers from the scikit-learn library were considered. For TensorFlow implementation, we considered custom Keras classifier and Google TabNet implementations.

  • Model evaluation.

Again, we considered both Python and R implementations (Keras and TabNet classifiers were not used for testing in the R environment). Classifier performance was not our main consideration, therefore, we did not apply any procedures to tune the hyperparameters (although frameworks like Katib, hyperopt or TFX Tuner could certainly be useful in this situation). Yet, we put our main focus on tools to organize and orchestrate the ML pipeline tasks, as they may lead to unified, reusable, and reproducible results and increase the quality of the final models.

After a thorough analysis of the tools, the following were selected for further testing:

In this experiment, we used:

  • KubeFlow 1.0,

  • KFServing 0.4.1,

  • scikit-learn 0.24,

  • TFX 0.24,

  • BentoML 0.11,

  • Clipper 0.4.1,

  • MLFlow 1.13,

  • TensorFlow Server 2.3

  • ClearML 0.16

Model training. The classifier was trained using a dataset consisting of 376,617 entries. 30% were selected as the holdout sample for testing, while the remaining data was used for training. Note, for simplicity, we did not consider parameter tuning and used scikit-learn classifiers with the default parameters (the only exception is Random Forest which used a total of 200 estimators instead of the default 100). The implemented TensorFlow classifier was a deep neural network with three hidden dense layers of 256, 64, and 16 units, respectively. We also tested Google’s TabNet classifier with default settings as defined in the implementing library tf-TabNet.

For Kubeflow Pipelines, we considered only scikit-learn based implementation. Kubeflow has extensive support and integration with TensorFlow Extended. However, testing deep learning performance was not one of the goals in this research. While we identified several problems, the training process was smooth, taking more than 10 minutes to complete, which could be considered excessive. Nevertheless, this is expected as Kubeflow is more applicable to tasks of a much larger scale processing, and the time required to initialize, set up, and prepare the environment for each step would be less significant compared to the time required to run the model training workflows.

Basically, there are two ways to implement workflows in Kubeflow:

  1. Using Kubeflow API, one can easily create Docker containers programmatically in Python languages and generate workflows for deployment;

  2. Creating custom Docker images and orchestrating them using Kubeflow facilities. This could be the preferred option if custom components or multiple languages are used in the pipeline.

The second option could be considered an important factor for future implementations and is one that we might consider for other languages and technologies, like R, Java, or Julia.

The experimental analysis identified several other Kubeflow advantages:

  • Logs in the dashboard are updated almost in real-time, which is very helpful for debugging specific operators.

  • Pipelines can be exported as Argo workflows and loaded using the dashboard.

However, some disadvantages were identified while performing Kubeflow testing:

  • Pods created during workflow runs were not automatically destroyed, which may require additional management.

  • Setup was quite complex and required additional skills in running Kubernetes. While the documentation provided by the website clearly outlines each step, experimental testing identified specific versions of Kubernetes (particularly Minikube), and additional setup steps may be required to ensure that it runs smoothly with all components properly set up. We also had difficulties during the removal of Kubeflow from the cluster, as deleting the whole Kubernetes namespace often failed and required additional manual effort.

  • Runs are archived without any possibility to delete. While this may be useful for reproducibility or result analysis, it may take additional storage (especially as the pods generated during runs are not deleted) or it may result in multiple entries, which are redundant and difficult to navigate or search.

  • It is not possible to edit experiment metadata, like names of experiments or runs.

  • We could not pass binary objects (like Pandas DataFrames) between tasks directly; as stated in the documentation, and Kubeflow provides minimal support. Therefore, one must store the intermediary outputs in the storage facility, like Minio or S3. This is mostly performed automatically by the Kubeflow engine, but the user may also need to implement this manually when required.

  • Enforced naming conventions in the code were not properly documented, leading to rather frustrating debugging. While outputting data for metrics visualization, multiple problems were identified, like failure to visualize confusion matrices and restrictions when using white space characters in the names of the metrics, which was not documented.

The TensorFlow Extended (TFX) application introduced more challenges than we initially expected. First, a lack of flexibility was observed during the implementation of the test model. However, it is possible that the required functionality can be implemented by overriding or extending existing components, but this would require a significant investment of time.

Second, we also identified difficulties like conversion between sparse and dense tensor formats, and difficulties integrating preprocessing steps that do not use tensor operations, which would slow down the adoption of TFX.

Incompatibilities between component versions were another issue, as some of the components, like TF Transform, TF Privacy, TF Encrypted, supported only TensorFlow 1.x or were in beta for TensorFlow 2.0. Again, this is expected to change in the future as TFX and the whole TensorFlow platform continuously evolve.

Finally, we had issues while running some of the visualization tools provided with the platform (like Tensorflow Model Analysis), which did not work in our Jupyter Lab environment.

Nevertheless, testing with our classifier proved the potential of this framework for future tasks requiring large-scale deep learning and these should run be using orchestration tools, like Apache Airflow. Yet, we identified the need for more extensive evaluation and a rather steep learning curve, which would require additional time and expertise to successfully execute more advanced experimental setups.

Experiment with versioning and model management. Experiment versioning tools play a similar role to code or component versioning in software engineering. However, they require more flexibility, as additional metadata (performance metrics, visualization outputs) must be stored together with the modeling output.

Therefore, we tested two tools that satisfied our requirements after the initial analysis. The process of developing CatBoost based classifier was selected to test the whole ML lifecycle processing. Integration with MLFlow went smoothly and it was easy to implement and store additional metrics using the provided API. The developed model artefacts could then be versioned and sent for deployment.

We also tested support for R language using a similar classifier (ranger and tidymodels based implementation), and no problems were observed while storing the experimental results, evaluation metrics, or generated artefacts. The integration of ClearML was also seamless, providing a solid user experience and options for further scalability.

Moreover, it provides more flexibility and functionality, compared to the MLFlow tools, such as hyperparameter logging, store of logs, or plots. Unfortunately, MLFlow is more applicable if R language support is required, as ClearML does not offer any support for R language-based workflows.

Model deployment. Finally, we tested the four tools mentioned (BentoML, KFServing, Seldon, TFServer, MLFlow) as deployment options. The first three tools were selected to deploy a scikit-learn based classifier (although they support deep learning-based classifiers), and TensorFlow Serving was used to deploy the Keras and TabNet based classifiers developed using TFX.

To deploy a classifier in the BentoML environment, we implemented a class extending the core BentoML service class. After implementing it, one must build and save the BentoML service (which is also versioned), which can then be served directly as a local service. Alternatively, it is possible to build a Docker container using this service together with all the artefacts (model files, external datasets, etc.) and serve it from Docker/Kubernetes environment. The BentoML documentation also states that it can scale extremely well and is capable of reaching high serving performance. The whole deployment process was executed successfully, and the service built with BentoML satisfied our requirements almost perfectly. This is also promising because BentoML can be integrated with external tools like Kubeflow and deployed in the Kubernetes environment, which might become a functional requirement in the future.

KFServing is a subproject by Kubeflow developers, which aims to produce and deploy this service in the Kubernetes environment. This is also performed by building a Docker image and deploying it as a container. Due to the requirements for our test project, KFServing module for scikit-learn based classifiers was selected for testing. However, the following issues were observed during implementation, which indicates that KFServing should be tested more thoroughly, because:

  • the project seemed quite outdated (a required scikit-learn version of 0.20, which is very outdated – scikit-learn 0.24 is the active version at the moment of performing tests);

  • the implementation which required custom preprocess/post-process steps did not work as expected or described in the sample code.

Seldon is an auspicious tool. It offers a full range of functionality required to successfully deploy state-of-the-art ML and DL solutions on an almost unlimited scale using the Kubernetes platform. However, its setup is quite complicated, making it difficult to evaluate the solution completely. Therefore, for some functionality testing, we used simple examples provided together with the reference documentation and checked whether they could be integrated with our test classifiers.

This tool provides advanced functionality which is not present in similar tools (Intelligent routing for load balancing, drift detection, outlier detection functionality, explainable machine learning), which are implemented as separate libraries and could be easily integrated into custom solutions. Seldon can be used with almost any major programming language used in machine learning research, as the underlying technology can convert provided code into a Docker image, which is then deployed in the Kubernetes cluster.

Due to its extensiveness and still undergoing adoption, we expect to test this technology more intensively in the future especially if scalability requirements increase.

Clipper is an interesting option, despite its lack of further development and support. Python model deployment was rather easy (although we did not test deep learning support), and R modeling support was an additional benefit.

While there are much better and more mature options for Python model deployment, it could serve as an option to deploy R models that are vaguely supported by other tools (at the time of analysis, only MLFlow had support for R model deployment).

To run Clipper with R language support, it required minor corrections (like outdated R language and R package versions in its Docker file, used to build base Docker image). After fixing these issues, we were able to deploy the example provided together with the tool in its GitHub repository and even deploy our tested classifier.

Unfortunately, we were unsuccessful in getting it running due to missing the system libraries required for R package dependency compilation. Nevertheless, it managed to collect and deploy all the required dependencies like R packages, data structures, or external files, hence, it might be applicable for less complex setups.

MLFlow also provides options or model deployment. This seemed a very viable option, considering its support for full model lifecycle management. Indeed, it generated Docker images for deployment using a scikit-learn based classifier and these were easily deployed. This corresponds to the main principles of continuous integration.

However, the generated image does not provide a Swagger-based OpenAPI interface and is highly bloated with additional libraries, while BentoML seems to provide a more lightweight and flexible solution.

Unfortunately, R language support might be rather tedious to implement, which might even require advanced R skills such as metaprogramming (as demonstrated in this wonderful blog post by David Neuzerling). It supports a single R model framework and does not look mature enough to support more advanced modeling requirements. Hence, it might even be more appropriate to use custom-built Docker images instead.

TensorFlow Serving is part of the TFX framework. Therefore, it only supports TensorFlow framework, and hence was used to test the deployment of the TensorFlow-based classifier. However, as with the whole TFX framework, the setup initially turned to be tedious – it even failed to install using the Ubuntu apt tool, as the repositories provided in the setup reference did not work.

Fortunately, after the setup, we managed to run our classifier successfully. However, unlike BentoML or Seldon, it does not generate OpenAPI interface for model serving which is beneficial for testing and documenting services.

Another drawback is the possible limitations in the documented service API – it provides endpoints for classification, regression, or prediction. It does not consider other tasks often considered in modeling (like ranking). While it still should be possible to implement using workarounds, it was impossible to validate this statement at this stage.

However, one of the nice features of TF Serving is the ability to serve different versions of the model, which is encoded in the REST endpoint path structure and is one of its advantages. While the options described previously in this section seem to be more viable for the deployment of TensorFlow-based models, the TFX Serving module could be tested further together with the TFX framework.

Final points

Successful machine learning cycle (MLOps) management requires tools with extensive functionality and across different skill sets.

MLOps management must cover the management and versioning of models, experiments, the experiment runs, features, distributed training, automated deployments, and scaling to support thousands or even millions of users.

To achieve success, a high level of automation and expertise in multiple fields is required, as well as a proper selection of tools that could be used. Our experiences reveal that this is not an easy task, requiring a significant amount of time for evaluation.

While we identified the most interesting tools among the tested ones, other tools might not be discarded, as there is no “silver bullet” for every case. While MLFlow is one of the well-known and applicable tools for versioning experiments, the experiment runs and models, it suffers from multiple limitations, including the absence of a multi-user environment, sharing capabilities, role-based access, and advanced security features which make its application in the enterprise environment very complicated.

ClearML seems to be a very promising solution that can be used to fully support the ML lifecycle, from experiment to deployment in production. But, unfortunately, it is limited to Python language. However, almost all tools support the most popular deep learning frameworks (like TensorFlow or PyTorch), which make them relatively easy to support in the whole MLOps process.

KubeFlow, while not very user-friendly and easy to set up, is a very powerful solution for scaling processing and training models with millions of data points.

TensorFlow Extended has similar properties, albeit having a very steep learning curve, but able to scale and perform distributed computations using orchestration frameworks like Apache Airflow. Feature store tools are not widely accessible in the open-source space, but very limited selection, yet, their adoption in large companies tend to be very promising and one of the key driving factors for the future trends of machine learning lifecycle management.

Due to its features and ease of use when generating the final deployment solution, BentoML is a solid option for the deployment of Python-based solutions. It could be extended with additional components, like drift detection or explainable machine learning from other components. There are alternative solutions that could be tested in parallel or later.

The situation is much more complicated for R language-based model deployment, as both implementations in MLFlow and Clipper have issues that might restrict automated deployment of more complex models.

Therefore, the best option for R-driven workflows could be implementing an internal tool, either by extending or reusing the basis of existing tools or by developing it completely from scratch.

However, one must be aware of the problems that will arise from the system (Linux) library dependencies that are not considered by any tool and require additional handling during deployment automation.

Darius Dilijonas

CTO & Head of Data Science Team; Partnership Professor at Vilnius University

Paulius Danėnas

Data scientist / Software developer / Researcher / ML Engineer

Share this article:
Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp
Share on pinterest


Join our data science mailinglist

This website uses cookies to ensure you get the best experience on our website. More information.