Art of Data Newsletter - Issue #19
Welcome all Data fanatics. In today's issue:
Let's dive in!
Open challenges in LLM research | 27mins
The article discusses ten major research directions for large language models (LLMs). These include reducing and measuring their tendency to "hallucinate", or produce incorrect or inconsistent answers, optimizing context length and context construction, and incorporating other data modalities. The author calls for new architecture, development of alternatives to GPUs, and making 'agents' or autonomous LLMs more usable. Further research is needed to improve learning from human preference, increase the efficiency of the chat interface, and adapt LLMs for non-English languages. The author goes more in-depth on each point, discussing current research, challenges, and possible improvements.
How Generative AI Can Revolutionize Data Engineering - Alibaba Cloud Community | 4mins
Generative AI is an Artificial Intelligence that is capable of creating new content from existing data such as text, images, audio, among others. Trained on a large dataset, it can generate synthetic data, automate data cleaning and preparation, generate code for data pipelines, and create data visualizations. With potential to significantly impact data engineering, Generative AI can be used with Data Lake to define the data lake's specifications and constraints using a JSON template, streamline the creation of ETL pipelines, track and manage data lineage, manage data warehouses, predict future trends and aid in data visualization. Furthermore, incorporating generative AI can enable data engineers to query data more intuitively and automate tasks.
Machine Learning for Delivery Time Estimation | by Enderson Santos | OLX Engineering | 17mins
Enderson describes a data science project conducted by OLX, a leading online marketplace, aimed at optimizing delivery time estimation using machine learning (ML). The data analysis revealed that delivery time is a critical factor for users when purchasing items using Pay&Ship, a service where the seller sends the ordered item through a delivery service. The article covers the project's various stages, including defining the problem, exploratory data analysis, the baseline solution, A/B testing, training and modeling, and solution production. Factors affecting delivery time were identified, such as the day of the week, the region, and the distance from the seller to the buyer.
Scaling the Instagram Explore recommendations system - Engineering at Meta | 14mins
Recommended by LinkedIn
Explore is a sophisticated recommendation system used by Instagram, which leverages machine learning (ML) to ensure users see the most relevant content. The system uses advanced ML models, such as Two Towers neural networks, resulting in a more scalable and flexible system. The multi-stage ranking approach deployed by this system includes retrieval, first-stage ranking, second-stage ranking, and final re-ranking. The retrieval stage gets an approximation of the content that will rank high later; the ranking stage then ranks these candidates according to their value to the user. Offline and online metric tuning occurs afterwards to optimize performance.
The building blocks of Faire’s Data Team onboarding | by Brian de Luna | Aug, 2023 | The Craft | 8mins
Faire's rapidly growing Data Team tackled the problem of inconsistent and time-consuming onboarding by developing a flexible, scalable, and consistent onboarding platform. They categorized classes into five core competencies: Data Infrastructure, Data Warehouse, Machine Learning, Experimentation, and Backend. They further categorized these into 100 Series and 200 Series, for basic and advanced level classes respectively. This organized system ensured that classes suited team members at all levels and in varied roles. For consistency and accessibility, classes were pre-recorded and followed a standardized Notion template. As a result, Faire effectively created a more efficient and meaningful onboarding experience for their Data Team.
Apache Arrow: The Future of Data Engineering | by Ravish Kumar | Jul, 2023 | Data Engineer Things | 7mins
Apache Arrow is an open-source project that enhances data processing making it faster, easier, and more efficient. It helps to manage the complexity of working with various data formats, sources, systems, and tools by allowing data engineers to work with data consistently across different platforms. Arrow supports language-independent columnar memory format for efficient analytic operations on modern hardware and eliminates the need for serialization and deserialization when moving data between systems or languages. Some of the benefits of using Apache Arrow include performance enhancement, interoperability, standardization and flexibility. It can be used in a variety of ways depending on the use case and the language of choice. Libraries for Apache Arrow are available for C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python and many more.
Accelerate Spark SQL Queries with Gluten | by Weiting Chen | Intel Analytics Software | Medium | 8mins
Intel and Kyligence have introduced a new open-source project called Gluten at the start of 2022 to accelerate Apache Spark SQL queries using multiple native engines. The project replaces Gazelle, an Intel-developed SQL engine, but now includes multiple native engines, such as Velox, developed by Intel, and a Clickhouse execution engine developed by Kyligence, promising improved performance and efficiency for users. Gluten is designed to enhance Apache Spark, a framework that processes petabyte-scale datasets, but whose community has had to deal with performance hurdles that require regular optimizations. While Gazelle was potent, development was largely left to Intel, prompting the company to involve a more active open-source community that includes Meta's Velox project.
Apache Kafka as Workflow and Orchestration Engine - Kai Waehner | 27mins
Using Apache Kafka as the backbone for a workflow engine offers better scalability, higher availability, and a simplified architecture. A workflow engine is software that orchestrates human and automated activities. It creates, manages, and monitors the state of these activities according to defined business processes. Business Process Management (BPM) refers to a disciplined approach for achieving consistent, targeted results aligned with an organization's strategic goals using both automated and non-automated processes. Kafka, a data streaming platform, can be used alongside BPM tools for human interaction and data processing. Case studies across industries, including Salesforce and Swisscom, illustrate how they implement stateful workflow automation and orchestration with Kafka.