Storage choose the right one
In the book Fundamentals of Data Engineering, storage is one of the many covered topics as it plays a critical role in the field of data engineering. Proper storage is essential for effectively storing, managing, and accessing large volumes of data. I will share various types of storage available for data engineering.
Several types of storage are commonly used in data engineering, including object stores, traditional storage on disk, and in-memory storage.
Object store
Object stores, such as Amazon S3, are designed for storing and retrieving large amounts of data. They are typically accessed via APIs and are highly scalable and durable, with data being automatically replicated across multiple servers. However, object stores can have slower read speeds compared to other types of storage due to the need to make a separate request for each piece of data.
Disk storage
Traditional storage on disk includes hard disk drives (HDDs) and solid-state drives (SSDs). HDDs store data on spinning disks, while SSDs use flash memory. SSDs generally have faster-read speeds than HDDs due to their lack of moving parts, but they are also more expensive.
Random access memory (RAM)
In-memory storage, also known as RAM, is the fastest type of storage available. It is used to store data in the memory of a computer or server and is accessed directly by the processor. However, in-memory storage is volatile, meaning that it is lost when the computer or server is powered off, making it less suitable for long-term storage of critical data.
Recommended by LinkedIn
NVRAM
Non-volatile random-access memory (NVRAM) is another type of storage on disk that combines the speed of RAM with the durability of a hard drive.
Storage for data warehouse
If you are building up a data warehouse make sure you know what hardware you are using. We can choose one of HDD, SSD, and NVRAM - RAM is not an option because is volatile storage (data is gone after it loses electricity).
Based on this analysis (i knew the results but never had the option to test one myself) I can easily say the best is NVRAM - which has a speed of RAM and is non-volatile however it cost the most. A cheaper option would be SSD followed by HDD.
Why not Object store
The write and read speed of S3 depends on many factors however I remember looking at some experiments where the speed was about 70-90MB/sec if data is read in bigger chunks. If you read data with smaller chunks it would degrade to 20MB/sec. There are also some additional factors to look at ...
In the current, everything is in the cloud, approach - very often we choose not to check what hardware we are buying - however, this very often is one of the biggest contributors to your data warehouse performance.
p.s. each store has its own place, IMHO Object stores are the data lake.