Pachyderm joins the Open Data Hub family!


Because Data Science teams need to move from research analysis to training, maintenance, and optimization of their production models, they need to create a set of MLOps tools to automate their machine learning lifecycle. Ops machine learning is a complex area that requires a lot of time and, most often, a separate set of skills, ranging from data and system development to cloud architecture. Deploying “data science” in manufacturing is challenging at best.

Starting February 22, 2022, the Pachyderm Community Edition became available on the Open Data Hub. Users can use the Pachyderm operator and quickly launch the platform, reducing the cost of joining MLOps.

Facilitate the transition to MLOps from the Open Data Hub.

For those unfamiliar Open Data Hub (ODH) *ODH is an open source project that provides a plan for building AI as a service platform Kubernetes– based on Red Hat OpenShift and related products in the Red Hat portfolio, for example Vault of objects Ceph.

The Open Data Hub combines various open source AI tools into a single installation. Pressing the button launches Red Hat OpenShift with the Open Data Hub Operator installed. On the platform, data scientists can create models using Jupyter laptops and choose from popular tools for model development and deployment.

As a result, data scientists can save time on setting up a stable and scalable AL / ML environment with Open Data Hub. Read about “How Red Hat Data Scientists Use and Contribute to the Open Data Hub” and get a better idea of ​​what the Open Data Hub offers to the world of Data Science.

* Note that ODH is an open source community project that has inspired and provided the technology foundation for Red Hat OpenShift Data Science. Red Hat OpenShift Data Science is a cloud service that provides a variety of technologies offered in the Open Data Hub, but provides additional support from the Red Hat team. Pachyderm is working with Red Hat to make its Enterprise product available on RHODS.

Reliable stack of MLOps from Pachyderm


[Pachyderm]( provides the basis for data for the life cycle of machine learning. It provides a level of data that feeds everything [ML loop]( by providing petabyte-scale data versions and pedigree tracking, as well as full automatic scaling and a data-driven pipeline.

Having Pachyderm as this basis for the modern MLOps stack allows you to:

  • Automate your data tasks into flexible pipelines. These pipelines are code- and platform-independent, so you can use the best tools for your specific ML applications.

  • Scaling and optimization for large amounts of unstructured and structured data. Everything in Pachyderm is a file, so Pachyderm works with any type of data – images, audio, CSV, JSON data … It is designed to automatically parallelize your code to scale up to billions of files.

  • Gradually process the data. Pachyderm comes with unique features such as incremental processing, in which it only handles differences or changes to your data, thus reducing processing time by an order of magnitude.

  • Version of all changes to your data – including metadata, artifacts and metrics – providing end-to-end reproducibility and a consistent data line. This greatly reduces debugging efforts and helps meet data management and audit requirements. Please note that the Pachyderm data line is UNCHANGEABLE, APPLIED and AUTOMATIC. You cannot start the Pachyderm process without recording the origin. All of this is tracked behind the scenes as a fundamental property of data, and ML teams don’t need to do anything on their own.

Pachyderm Enterprise relies on the Community Edition to provide additional features such as a console (Pachyderm interface), user access control, and robust support from the Pachyderm team. Contact Pachyderm for more information info@pachyderm.ioor subscribe to Pachyderm in the Red Hat market.

High-level architecture Pachyderm

Before diving into the Pachyderm Installation Guide using the Pachyderm Operator, let’s take a quick look at the architectural layers in the game.

  • The Open Data Hub Operator is installed on an OpenShift cluster.
  • Open Data Hub Operator installs Jupyterhub / Pachyderm Operator / Ceph Nano.
  • Ceph creates a new object repository (compatible with bucket S3).
  • The Pachyderm cluster uses the object repository provided by Ceph.
  • Access the Jupyter notebook to the Pachyderm cluster.

Note that the Open Data Hub comes with many components including Ceph Nano / JupyterHub, which makes deploying Pachyderm relatively easy.


Follow the installation instructions to get more information, and then start with the canonical Pachyderm launch demonstration.

Additional resources:

Pachyderm joins the Open Data Hub family!

Source link Pachyderm joins the Open Data Hub family!