The Role of Object Storage in AI, The Modern Datalake, MinIO Days Recap: The July 2024 MinIO Newsletter
Hello, everyone! Welcome back to the July edition of the MinIO Newsletter. It is full summer here in the northern hemisphere and the temperatures are nothing compared to the heat around AI. While some predict an oncoming winter, we continue to see investment, new architecture initiatives and an insatiable appetite for learning. We have you covered this month with a ton of content—so let’s get started.
Last month we launched our MinIO Certification program, which is designed to provide a professional certification for MinIO users. The exam will cover MinIO’s core features and capabilities—not to mention it can be taken anywhere. Get more info on the certification webpage. #miniocertified
As always, feel free to opt out. Conversely, if this was forwarded to you, you can always opt-in to stay current.
The AI Corner
AI/ML SME Keith Pijanowski as written some of the seminal work on AI-centric modern datalakes, both here and in The New Stack. He contributed two stellar posts this month that should be on your reading list, from the C-suite to the individual contributor.
The first is a top-10 list of tools, technologies and vendors for your gen-AI stack. Gen-AI is very buzzy and understanding what fits where is immensely important. These technologies all integrate with MinIO (in fact all gen-ai tools do) so there is no bias—just Keith’s very well-informed opinion.
Keith’s second installment is the Architect’s Guide to MLOps. In it, he explains MLOps practices and tools for managing machine learning models in production. It covers differences between traditional application development and ML model creation, essential MLOps features like experiment tracking, model packaging, and data pipeline capabilities. It also discusses the importance of scalable data infrastructure and why modern object storage is the key.
One key element in the world of AI is the ability to benchmark your performance. MinIO has built the best tools in the business and has open sourced them for anyone to use. Our WARP tool is full-featured, S3 performance assessment software that is purpose-built to conduct tests between WARP clients and object storage hosts. It is perfectly suited for benchmarking your AI data infrastructure. Go deep in this piece by AJ here.
The world of AI is built on a foundation of data, object storage and models. That is a highly simplified view of the world, but not inaccurate. Sidharth Rajaram explains why AI workloads are built on modern, high-performance object storage with four excellent arguments—you have to hit the link to see what they are.
Retrieval-Augmented Generation (RAG) is another area of interest for the MinIO community. It is very well suited for the capabilities of a modern object store like ourselves. Dileeshvar Radhakrishnan stops by to explain how to build a RAG application using MinIO for data storage, LangChain for text processing, and LanceDB for vector storage. This is a practitioners guide and it emphasizes the importance of scalable data infrastructure in supporting AI and ML workflows.
MinIO Days for days...
Did you miss MinIO Days Silicon Valley? Check out these photos and session write-ups from a superb day of learning:
A Reference Architecture For The Modern Datalake, Keith Pijanowski, AI/ML SME
As businesses strive to unlock the full potential of their data assets, the need for a flexible, scalable, and unified approach to data storage and analytics has become top of mind for enterprise architects who need to build out infrastructure that supports the needs of the business. A Modern Datalake architecture addresses this need by integrating the scalability and flexibility of a Data Lake with the structure and performance optimizations of a Data Warehouse. A Modern Datalake is ½ Data Warehouse and ½ Data Lake and uses object storage for everything.
Engineering for Performance and Scale, Will Dinyes, Training Content Developer
When working with petabytes of data, access speeds become extremely important. In this session, we look at two features of MinIO that can help enhance the speed of access at scale: erasure coding and active-active replication. Erasure coding not only ensures that your data is resilient in the face of drive or server loss, but it breaks large objects into more manageable parts and distributes the read and write load to multiple servers and drives. Active-active replication means that data can be written to any cluster in a deployment and replicated easily to other clusters, allowing data to be geographically located where it makes the most sense.
AI/ML within the MinIO Centric Modern Datalake, Keith Pijanowski, AI/ML SME
In enterprise artificial intelligence, there are two main types of models: traditional and generative. Traditional models are used to classify or predict data, while generative models are used to create new data. Even though Generative AI has dominated the news of late, organizations are still pursuing both types of AI. This talk focuses on the areas of the Modern Datalake Reference Architecture that support these AI/ML workloads.
Administrating MinIO for Data Governance, Will Dinyes, Training Content Developer
Recommended by LinkedIn
When working with large data sets, managing data access becomes a key security task. No one wants to be in the next data breach headline. To that end, this session will cover features of MinIO to help prevent unauthorized access. MinIO Identity and Access Management provides fine-grained policy controls, allowing administrators to allow only specific actions and tasks to authorized users, whether using MinIO, LDAP, or OpenID. MinIO encryption provides security for data both in transit and at rest, and can prevent accessing of data through unauthorized channels.
Migrating to MinIO: ECS and HCP
One of the hot topics these days is moving content to MinIO for modernization purposes or to create a high-performance tier for the AI/ML/Analytics teams. In some cases, the migration requirements became so significant that we built our own tooling. In other cases, we can leverage other tooling to achieve the same purpose. Here are two examples:
We built our Hitachi Content Platform (HCP) to MinIO migration tool for some large Asian customers. It was a big hit and we made some improvements. Check out Brenna Buuck ’s step-by-step guide to ensure a smooth transition to MinIO.
Dell built a “Data Movement” tool for ECS that enables you to migrate your data to any S3 compatible store (like MinIO for example…). Also called copy-to-cloud, it is ideal for customers and prospects who are modernizing their storage stack to support their AI data infrastructure requirements and need to get off of the performance and scalability constrained ECS platform. Learn more here.
New and Notable—Release Notes from June 2024:
This was an exciting month for MinIO with 7 releases and a whopping 100 or so features and bug fixes. We are continuing to forge ahead with a focus on making the existing features more robust and on the way improve their performance. We made this release with the help of a new contributor @guoard. We’ve added the env var MINIOKAFKA_DEBUG to enable sarama debug messages. ILM metrics were added and updated in metrics-v3. With this latest update, it allows rebalance to start when it's stopped/completed and also now replication can pass checksum headers to replica. Not to mention we’ve added a number of LDAP features to enable more folks to use this feature. For instance, we’ve added support for LDAP public key authentication, which is enabled via SFTP. Last but not least, the console was updated to v1.5.0.
Bits and Bytes:
Kicking off with some great news—this month we were selected as a winner in the 2024 AI Breakthrough Awards program. After analyzing our ability to support a variety of AI-based analytics applications from our object store, MinIO was awarded the “Best Overall AI-based Analytics Company.” A great win.
Moving on to our awesome Linkedin content from this month. Data Engineer Nguyễn Tuấn Dương posted a great Linkedin article using MinIO and a number of other tools to unlock NYC taxi data insights.
DevOps Engineer Ayman Elhussiny announced his recent project for backing up and restoring stateful applications on OKD using MinIO.
Machine Learning Engineer Florent J. posted about the open data stack highlighting MinIO as the open source storage choice for large amounts of data.
Software Developer Riffki Krisdiyanto created a live attendance application using MinIO—which he provides a writeup for here.
Soumil S. is back with a video on how to perform basic procedures using Spark SQL, MinIO and Apache Hudi.
Web Scraping and ETL Pipeline Specialist Tauqeer Abbas posted a tutorial building a modern and cost-effective ELT data pipeline with Airflow, MinIO, DuckDB and more.
Calling all Portuguese speakers! Data Engineer Victor Outtes, MSc. posted a very thorough Linkedin article about creating a low-cost, cloud-agnostic data lake with MinIO, Airflow, DBT, Trino and Superset. And Luciano Borba - LR Data on YouTube posted a video in Portuguese covering using a data lake with MinIO, DuckDB and Docker.
Vroom vroom. MLOps Software Developer Aleksander Zawalich announced a personal project creating a full-stack MLOps platform for Formula 1 using MinIO. Check it out.
Moving on from Linkedin…
Have you been following Simon Thelin ’s content lately? He’s been killing it. He recently released the second edition of his series covering Helm setup with Hive Metastore and MinIO integration. See his accompanying video here.
Krishiv on Steemit posted a guide to setting up MinIO as your S3 storage solution. Super quick read.
Enjoy the rest of your July, everyone. Can’t wait to see you next month.
MinIO Team
Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber
10moGreat stuff! I'm working on a POC to showcase how we can build and store large vectors on Minio and Hudi. Once it's completed, I'll share insights using blogs.
Infrastucture Consultant | ServiceNow | Openshift
10moThanks for sharing
Solutions Architect, zawalich.pl
10moNice and informative issue - good work MinIO. Thanks for having me! 😎