How to choose MLOps tools (MLOps from first principles)
  Back to blog home

How to choose MLOps tools (MLOps from first principles)

MLOps Apr 21, 2024

If you are a Marvel fan, you might know that killing Thanos does not guarantee a safer world as there are other villains to deal with.

Source: https://in.mashable.com/entertainment/24645/every-marvel-movie-villain-ranked

Similarly, while building any machine learning-based product or service, training and evaluating the model on a few real-world samples does not necessarily mean the end of your responsibilities. You need to make that model available to the end users, monitor it, and retrain it for better performance if needed. A traditional machine learning (ML) pipeline is a collection of various stages that include data collection, data preparation, model training and evaluation, hyperparameter tuning (if needed), model deployment and scaling, monitoring, security and compliance, and CI/CD.

Source: Author

A machine learning engineering team is responsible for working on the first four stages of the ML pipeline, while the last two stages fall under the responsibilities of the operations team. Since there is a clear delineation between the machine learning and operations teams for most organizations, effective collaboration and communication between the two teams are essential for the successful development, deployment, and maintenance of ML systems. This collaboration of ML and operations teams is what you call MLOps and focuses on streamlining the process of deploying the ML models to production, along with maintaining and monitoring them. Although MLOps is an abbreviation for ML and operations, don’t let it confuse you as it can allow collaborations among data scientists, DevOps engineers, and IT teams.

The core responsibility of MLOps is to facilitate effective collaboration among ML and operation teams to enhance the pace of model development and deployment with the help of continuous integration and development (CI/CD) practices complemented by monitoring, validation, and governance of ML models. Tools and software that facilitate automated CI/CD, easy development, deployment at scale, streamlining workflows, and enhancing collaboration are often referred to as MLOps tools. After a lot of research, I have curated a list of various MLOps tools that are used across some big tech giants like Netflix, Uber, DoorDash, LUSH, etc. We are going to discuss all of them later in this article.  

In this article, you will delve into the key principles and practices of MLOps, and examine the essential MLOps tools and technologies that underpin its implementation. By understanding the role of each tool within the MLOps ecosystem, you'll be better equipped to design and deploy robust ML pipelines that drive business impact and foster innovation.

What is MLOps?

Moving the machine learning models to production is tough, especially the larger deep learning models as it involves a lot of processes starting from data ingestion to deployment and monitoring. Keeping all these stages in sync can be a little difficult when managing them as standalone processes and as the project grows it becomes harder to keep these processes in order. By establishing standardized workflows, automating repetitive tasks, and implementing robust monitoring and governance mechanisms, MLOps enables organizations to accelerate model development, improve deployment reliability, and maximize the value derived from ML initiatives.

Now you might be wondering why you should believe me with all this information. To back it up let me introduce you to some of the real-world use cases where MLOps is playing a key role.

  • Credit Scoring in Financial Services: Banks and financial firms deploy machine learning models to assess the creditworthiness of customers for a loan application. MLOps helps these organizations to continuously monitor the systems for accuracy and fairness, with automated processes for model retraining and deployment as new data becomes available.
  • Autonomous Vehicles: Automotive companies are using ML models for autonomous driving systems including object detection, path planning, and decision-making algorithms. MLOps ensures the reliability and safety of these models through rigorous testing, validation, and continuous monitoring in real-world driving conditions.

These are just a few examples, MLOps finds its value in almost all of industry spaces.

What Are the Different Types of MLOps Tools?

MLOps tools play a pivotal role in every stage of the machine learning lifecycle. In this section, you will see a clear breakdown of the roles of a list of MLOps tools in each stage of the ML lifecycle.

Pipeline Orchestration Tools

Pipeline orchestration in terms of machine learning refers to the process of managing and coordinating various tasks and components involved in end-to-end ML workflow, from data preprocessing and model training to model deployment and monitoring.

Source: https://seattledataguy.substack.com/p/ten-of-the-funniest-data-memes

MLOps software is really popular in this space as it provides features like workflow management, dependency management, parallelization, version control, and deployment automation, enabling organizations to streamline their ML workflows, improve collaboration among data scientists and engineers, and accelerate the delivery of ML solutions.

Model Training Frameworks

This stage involves the process of creating and optimizing the predictive models with labeled and unlabeled data. During training the models learn the underlying patterns and relationships in the data, adjusting its parameters to minimize the difference between predicted and actual outcomes. You can consider this stage as the most code-intensive stage of the entire ML pipeline. This is the reason why data scientists need to be actively involved in this stage as they need to try out different algorithms and parameter combinations.

Source: https://medium.com/nybles/understanding-machine-learning-through-memes-4580b67527bf

Machine learning framework scikit-learn is quite popular for training the machine learning models while TensorFlow and PyTorch are popular for training the deep learning models that comprise different neural networks.

Model Deployment and Serving Platforms

Once the development team is done training the model, they need to make this model available for inference in the production environment where these models can generate the predictions. This typically involves deploying the model to a serving infrastructure, setting up APIs for communication, model versioning and management, automated scaling and load balancing, and ensuring scalability, reliability, and performance.

Source: https://www.bentoml.com/blog/your-ml-model-serving-framework

MLOps tools offer features such as containerization, orchestration, model versioning, A/B testing, and logging, enabling organizations to deploy and serve ML models efficiently and effectively.

Monitoring and Observability Tools

Developing and deploying the models is not a one-time process. When you develop a model on a certain data distribution, you expect the model to make predictions for the same data distribution in production as well. This is not ideal because data distribution is prone to change in the real world which results in degradation in the model’s predictive power, this is what you call data drift. There is only one way to identify the data drift, by continuously monitoring your models in production.

Source: https://medium.com/marvelous-mlops/ml-monitoring-vs-ml-observability-understanding-the-differences-fff574a8974f

Model monitoring and observability in machine learning  include monitoring key metrics such as prediction accuracy, latency, throughput, and resource utilization, as well as detecting anomalies, drift, and concept shifts in the data distribution. MLOps monitoring tools can automate the collection of telemetry data, enable real-time analysis and visualization of metrics, and trigger alerts and actions based on predefined thresholds or conditions.

Collaboration and Experiment Tracking Platforms

Suppose you are working on developing an ML system along with a team of fellow data scientists. If you are not using a mechanism that tracks what all models have been tried, who is working on what part of the pipeline, etc. it will be hard for you to determine what all models have already been tried by you or others. There could also be the case that two developers are working on developing the same features which is really a waste of time and resources. And since you are not tracking anything related to your project, you can most certainly not use this knowledge for other projects thereby limiting reproducibility.

Source: https://www.mariosblog.co.uk/collaboration-as-part-of-enterprise-agility/

Collaboration and experiment-tracking MLOps tools allow data scientists and engineers to collaborate effectively, share knowledge, and reproduce experiments for model development and optimization. These tools offer features such as experiment tracking, versioning, lineage tracking, and model registry, enabling teams to log experiments, track changes, and compare results across different iterations of ML models.

Data storage and versioning

While working on the ML pipelines, you make significant changes to the raw data in the preprocessing phase. For some reason, if you are not able to train your model right away, you want to store this preprocessed data to avoid repeated work. The same goes for the code, you will always want to continue working on the code that you have left in your previous session.

Source: https://medium.com/@cookedsashimi/learn-how-to-integrate-data-version-control-into-your-ml-workflow-in-7-steps-b5c65d68abe0

MLOps data storage and versioning tools offer features such as data versioning, artifact management, metadata tracking, and data lineage, allowing teams to track changes, reproduce experiments, and ensure consistency and reproducibility across different iterations of ML models.

Compute and infrastructure

When you talk about training, deploying, and scaling the models, everything comes down to computing and infrastructure. Especially in the current time when LLM models are making their way for several industry-based generative AI projects. You can surely train a simple classifier on a system with 8 GB RAM and no GPU device, but it would not be prudent to train an LLM model on the same infrastructure.

Source: https://www.quora.com/What-are-some-good-cloud-computing-memes

Compute and infrastructure tools offer features such as containerization, orchestration, auto-scaling, and resource management, enabling organizations to efficiently utilize cloud resources, on-premises infrastructure, or hybrid environments for ML workloads.

Best MLOps Tools & Platforms for 2024

In this section, you will learn about the top MLOps tools and platforms that are commonly used across organizations for managing machine learning pipelines.

Data storage and versioning

Some of the most popular data storage and versioning tools are Git and DVC. Let’s explore them in detail.

Git

Git is a distributed version control system that allows developers to manage code repositories in software development projects including machine learning projects. It is specifically not an MLOps tool but it is often used as part of MLOps workflows to version control machine learning code, models, and other artifacts.

Git tracks changes to files and directories in a repository, enabling users to view the history of changes, revert to previous versions, and collaborate with team members on code development. It allows users to create branches to work on new features or experiments in isolation, and later merge those changes back into the main codebase. Git integrates seamlessly with continuous integration and continuous deployment (CI/CD) pipelines, enabling automated testing, building, and deployment of machine learning models and applications.

Git is an absolutely open-source version control tool that can be used freely by individuals and organizations.

DVC (Data Version Control)

Since you know that Git can not be used to store large dataset files directly, you need a tool that can handle different versions of your dataset. Data Version Control (DVC) is an open-source version control system designed specifically for handling large datasets and machine learning models. It works alongside Git to version control both code and data, enabling reproducibility and collaboration in machine learning projects.

DVC allows users to version control datasets by storing them in a remote storage location (e.g., Amazon S3, Google Cloud Storage) and tracking changes to data files using Git. It manages dependencies between code, data, and machine learning pipelines, ensuring that changes to data or code are properly tracked and reproducible. DVC supports parallel execution of machine learning pipelines, enabling efficient utilization of compute resources and faster model training.

Similar to Git, DVC is also an open-source tool that has no licensing cost.

Model Training Frameworks

Let’s have a look at various model training frameworks that are highly used by data scientists for machine learning development.

TensorFlow

TensorFlow is a popular machine learning framework developed by Google that offers the implementation of a wide range of neural network models. It provides different features for building as well as deploying various deep learning-based solutions.

TensorFlow is a flexible and scalable platform for building a variety of ML models, including deep learning models, reinforcement learning algorithms, and traditional machine learning models. It is designed to leverage hardware acceleration (e.g., GPUs, TPUs) for fast and efficient model training and inference, making it suitable for large-scale ML tasks. It also offers multiple abstraction layers, including TensorFlow Core, TensorFlow Keras (high-level API), and TensorFlow Estimators (pre-built model architectures), allowing users to choose the level of abstraction that best fits their needs.

TensorFlow does not have any licensing cost as it is made available open source by Google.

PyTorch

Developed by Facebook’s AI Research Lab (FAIR), PyTorch is a popular machine-learning framework that offers a flexible and dynamic approach to building and training neural networks. Unlike TensorFlow, PyTorch provides more control over the training process as things are not abstract. This is the main reason why companies like Tesla and Netflix use PyTorch for their different ML use cases.

PyTorch uses dynamic computational graphs, allowing for more flexibility and intuitive debugging compared to static graph-based frameworks like TensorFlow. It seamlessly integrates with CUDA for GPU acceleration, enabling faster model training and inference on NVIDIA GPUs.

Similar to TensorFlow, PyTorch is also an open-source tool that allows you to develop deep learning models for free.

Scikit-learn

Scikit-learn is a machine learning library in Python that is majorly used for data mining and data analysis. It is built on top of NumPy, SciPy, and Matplotlib, and offers a wide range of supervised and unsupervised learning algorithms, as well as tools for model evaluation and preprocessing.

Scikit-learn provides a consistent API for training and using machine learning models, making it easy to experiment with different algorithms and techniques. It offers implementations of various machine learning algorithms, including linear and logistic regression, decision trees, random forests, support vector machines, clustering algorithms, and more. It also provides tools for model evaluation, including cross-validation, hyperparameter tuning, and metrics such as accuracy, precision, recall, and F1-score.

There is no licensing cost for Scikit-learn, you can create and use different ML models with Scikit-learn for free.

Collaboration and Experiment Tracking Platforms

Let’s have a look at popular collaboration and experiment tracking tools like MLFLow and DagsHub.

MLFlow

MLFlow is a popular collaboration and experiment tracking platform for managing the end-to-end machine learning lifecycle. MLFlow can streamline the process of building, training, and deploying ML models by providing features like model packaging, experiment tracking, model versioning, etc.

MLFlow allows users to log and track experiments, including parameters, metrics, and artifacts (e.g., trained models), enabling reproducibility and collaboration. It supports versioning of machine learning models, allowing users to track changes over time and roll back to previous versions if needed. MLflow includes a model registry for managing and organizing machine learning models, providing features such as model approval workflows, access control, and model lineage tracking.

MLFlow is also free of cost with no licensing cost which means you can increase the collaboration among teams for free using MLflow.

DagsHub

DagsHub is an open-source MLOps platform for managing and collaborating on machine learning projects. It provides a suite of tools and features that streamline the development, versioning, and deployment of machine learning models. You can consider DagsHub as a workflow that adds structure to your project and connects the different pieces in a way that makes sense.

DagsHub leverages Git for version control, allowing users to track changes to their machine learning code, data, and models over time, and collaborate with team members using familiar Git workflows. It can integrate with tools like DVC for versioning and managing large datasets, enabling users to track changes to data files and ensure reproducibility across different experiments and environments. It also allows users to log and track experiments with tools like MLFlow, including parameters, metrics, and artifacts (e.g., trained models), providing visibility into the model development process and facilitating collaboration among team members.

DagsHub comes in two pricing plans, the open-source community version that provides access to features like CI/CD integration, data versioning and lineage, community support, etc. but a few features like security, email support, etc. are available in the paid version.

Model Deployment and Serving Platforms

Some of the most popular tools for development, serving and scaling are as follows:

Amazon SageMaker

Developed by Amazon Web Services (AWS), Amazon Sagemaker is a fully managed machine learning service that allows developers and data scientists to build, train, and deploy machine learning models at scale. SageMaker offers a comprehensive set of tools and capabilities for the entire machine-learning lifecycle.

SageMaker provides managed Jupyter notebooks for data exploration, experimentation, and model development, with built-in support for popular libraries such as TensorFlow, PyTorch, and Scikit-learn. It offers a collection of pre-built algorithms for common machine learning tasks, such as linear regression, XGBoost, and image classification, making it easy to get started with model training. SageMaker allows users to train custom machine learning models using their own algorithms and training scripts, with support for distributed training across multiple instances. Finally, it makes it easy to deploy trained models to production environments using managed hosting services, with support for real-time inference and batch prediction.

Amazon SageMaker pricing is based on a pay-as-you-go model, with costs calculated based on factors such as instance type, storage usage, and training hours.

Microsoft Azure ML

Provided by Microsoft, Azure Machine Learning (ML) is a cloud-based machine learning platform that enables data scientists and developers to build, train, and deploy machine learning models at scale. Similar to SageMaker, Azure ML offers a range of tools and services for the entire machine learning lifecycle, from data preparation and model development to deployment and monitoring.

Azure ML offers automated machine learning capabilities, allowing users to quickly build and deploy machine learning models without extensive manual tuning or coding. It provides tools for model interpretability and explainability, allowing users to understand how models make predictions and identify factors contributing to model performance. It also enables easy deployment of trained models to production environments using managed services such as Azure Kubernetes Service (AKS) or Azure Container Instances (ACI), with support for real-time and batch inferencing.

Azure ML pricing is based on a pay-as-you-go model, with costs calculated based on factors such as compute usage, storage usage, and data transfer. There are additional charges for specific services such as model deployment and inference.

Monitoring and Observability Tools

To monitor any of the machine learning solutions, you can use tools like Prometheus and Grafana. Let’s take a detailed look into what features these tools provide.

Prometheus

Prometheus is a monitoring and alerting toolkit that was originally developed at SoundCloud but it is now maintained by the Cloud Native Computing Foundation (CNCF). It is specially designed for monitoring highly dynamic containerized environments such as Kubernetes and provides powerful features for collecting, querying, visualizing, and alerting on time-series data.

Prometheus uses a multi-dimensional data model with key-value pairs, allowing users to efficiently label and query metrics based on various dimensions such as instance, job, and service. It uses Prometheus Query Language (PromQL), which is a powerful and expressive language for querying and manipulating time-series data (logs), enabling advanced analysis and visualization of metrics. Prometheus supports flexible alerting rules based on metric thresholds and conditions, with integrations for sending alerts to various alerting channels such as email, PagerDuty, and Slack.

Prometheus is open source without any licensing cost.

Grafana

Grafana is a popular analytics and visualization platform that is most commonly used for monitoring and observability in machine learning. While tools like Loki and FluentD work as data aggregators, Grafana allows users to query, visualize, raise alerts, and understand metrics from multiple data sources, including time-series databases, application logs, and cloud monitoring services.

Grafana provides a wide range of visualization options, including graphs, gauges, heatmaps, and tables, allowing users to create custom dashboards tailored to their specific monitoring needs. It supports integration with numerous data sources, including Graphite, InfluxDB, Elasticsearch, and many others, enabling users to query and visualize metrics from multiple sources in a single dashboard. It also offers built-in alerting capabilities, allowing users to define alerting rules based on metric thresholds and conditions, and receive notifications via email, Slack, PagerDuty, or other alerting channels.

Grafana is also open-source and free to use, with no licensing costs.

Compute and infrastructure

When it comes to computing and infrastructure, Docker and Kubernetes are considered some of the highly used tools in the MLOps space.

Docker

Docker is a platform for building packaging and deploying applications as lightweight, portable containers. Docker resolves the issue of “it works on my system” as it creates a container for your application that can run on any platform and produces the same result. It is commonly used in MLOps workflows for containerizing machine learning models and their dependencies, enabling consistent and reproducible deployments across different environments.

Docker enables applications to be packaged into containers, which encapsulate the application code, runtime, libraries, and dependencies, making it easy to deploy and run applications consistently across different environments. Docker containers provide process-level isolation, allowing applications to run in isolated environments without interfering with each other or the underlying host system. These containers are portable and can be run on any system that supports Docker, including development laptops, on-premises servers, and cloud-based infrastructure. Docker integrates seamlessly with container orchestration platforms such as Kubernetes, enabling automated deployment, scaling, and management of containerized applications in production environments.

Docker is also an open-source tool that can be used freely by individuals and project teams.

Kubernetes

Once you are done creating a container for an application, you need a tool that can actually manage this container and successfully deploy your application. In this case, Kubernetes is all you need, it is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. It is commonly used in MLOps workflows for deploying and managing machine learning models and inference services.

Kubernetes manages the deployment and scaling of containerized applications across a cluster of compute nodes, ensuring high availability and resource efficiency. It provides built-in mechanisms for service discovery and load balancing, enabling applications to communicate with each other and distribute incoming traffic across multiple instances. Kubernetes supports auto-scaling of applications based on resource usage metrics, automatically adjusting the number of replicas to meet demand and optimize resource utilization. It uses declarative configuration files to define the desired state of applications and infrastructure, enabling infrastructure as code and automated deployment pipelines.

Kubernetes is the first choice for deploying and managing distributed applications including machine learning. Kubernetes can be used for free as it is also an open-source tool for orchestrating containers.

Pipeline Orchestration Tools

To handle the end-to-end workflow orchestration, you can use famous tools like Apache Airflow and Kubeflow Pipelines.

Apache Airflow

Apache Airflow is an open-source workflow orchestration tool that can manage complex workflows and data pipelines. It allows users to schedule, monitor, and manage workflows using DAGs (Direct Acyclic Graphs), enabling the automation of tasks across various systems and services.

Airflow provides a flexible and extensible framework for defining and scheduling workflows using Python code or configuration files. It allows users to define dependencies between tasks within a workflow, ensuring tasks are executed in the correct order based on their dependencies. Airflow supports various types of tasks, including Bash commands, Python functions, SQL queries, and more, allowing users to execute a wide range of tasks within their workflows.

It is an open-source tool that is free to use without any licensing costs. However, if you or your organization is using Airflow you may incur costs for infrastructure resources (e.g., compute instances, storage) used to run Airflow and store workflow data.

Kubeflow Pipelines

If you want to work on building and deploying scalable, portable, and reproducible machine learning workflows on Kubernetes, the Kubeflow pipeline is the answer. It provides a visual interface for designing, orchestrating, and monitoring ML pipelines, allowing users to define complex workflows as DAGs.

Kubeflow Pipelines offers a user-friendly visual interface for designing and editing ML pipelines, making it easy to define tasks, dependencies, and data flows. These pipelines are built using Docker containers, enabling seamless integration with Kubernetes for efficient resource management and scalability. These Pipelines also integrate with popular ML frameworks and tools such as TensorFlow, PyTorch, and Scikit-learn, allowing users to leverage their existing workflows and models.

Similar to Apache Airflow, Kubeflow is also an open-source tool that is available for free to manage ML pipelines.

How to Select the Right MLOps Tools?

Now that you have seen a lot of MLOps tools that can ease your work starting from data collection to model deployment and monitoring, it is time to understand how you can decide on particular tools while working on the ML pipelines. Here are a few tips that you should keep in mind while selecting the right MLOps tool for you.

Scalability and Performance

You might be able to train and deploy your model with possibly any open source tool, but when it comes to scalability, things get a little messy as you need to depend on a lot of tools that need to work together to achieve the scalability. You must evaluate the MLOps tools based on scalability and performance considering the factors like ability to handle large datasets, support for distributed computing, and performance optimization features.

Integration Capabilities

The tech stack that organizations use for building ML pipelines does not remain constant, they can use different technologies based on their budget and the experience of the developers. While choosing the MLOps tool, you must assess the integration capabilities of the MLOps tools with your existing infrastructure, data sources, machine learning frameworks, and deployment environments. Look for tools that seamlessly integrate with your tech stack to minimize integration efforts and ensure compatibility.

Ease of Use and Learning Curve

Organizations tend to use tools that are easy to learn and use as compared to tools that take a longer time to learn. They would not want to spend the time and money to train a developer on a tool that takes months as they can actually build the entire project in months without using the complex technology. Consider the ease of use and learning curve associated with the MLOps tools. Look for tools that provide intuitive user interfaces, comprehensive documentation, an active developer community and educational resources to facilitate adoption and minimize training time.

Support and Documentation

While learning any technology, you might have seen that some of the technologies have really well-written documentation and community support while others lack the same. Tools that have better learning resources are preferable as compared to tools with almost no learning resources as they become hard to learn and solve the issues arising during development. You must evaluate the level of support and documentation provided by the tool vendors or the open-source community. Look for tools with active developer communities, reliable technical support, and comprehensive documentation to address any issues or questions that may arise.

Cost and ROI

You must consider the cost of acquiring, implementing, and maintaining the MLOps tools, including licensing fees, infrastructure costs, and ongoing support expenses. Also, you should evaluate the potential return on investment (ROI) of the tools in terms of improved efficiency, productivity, and the ability to deliver value from machine learning projects.

Conclusion

After reading this article, you now know about MLOps and its role in the machine learning space. MLOps plays a key role in building, deploying, and maintaining the scalable ML model. It does this with the help of several tools and frameworks that complement each stage of the ML pipeline and they are called MLOps tools. You have also seen the role of MLOps tools in various ML pipeline stages and then explored the best MLOps tools used across various organizations.

While you need to manage each tool individually for maintaining the entire ML pipeline, that can be very hard as there could be so many moving parts, tools like DagsHub make your work easy. DagsHub helps data teams build better ML models and manage end-to-end data projects by managing data, code, experiments, and models all in one place. You can learn more about DagsHub here.

Tags

Gourav Singh Bais

Senior Data Scientist & Technical Writer | Contributor at O'Reilly | Evaluator at Great Learning | Expert in scalable data science solution development & deployment.

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.