The AI Data Cycle: Understanding the Optimal Storage Mix for AI Workloads at Scale

-By Ghassan Azzi, Sales Director, Africa at Western Digital

While AI is transforming lives and inspiring a world of new applications, at its core, it’s fundamentally about data utilization and data generation.

As the AI industry builds-out a massive new infrastructure to train AI models and offer AI services (inference), there are important implications related to data storage. First, storage technology plays important roles in the cost and power-efficiency of the varied stages of this new infrastructure. As AI systems process and analyze existing data, they create new data, much of which will be stored because it’s useful or entertaining. And new AI use cases and ever more sophisticated models make existing repositories and additional data sources more valuable for model context and training, powering a cycle where increased data generation fuels expanded data storage, which fuels further data generation – a virtuous AI Data Cycle.

It’s important for enterprise data center planners to understand the dynamic interplay between AI and data storage. The AI Data Cycle outlines storage priorities for AI workloads at scale at each one of the six-stages. Storage component manufacturers, such as Western Digital, are tuning their product roadmaps in recognition of these accelerating AI-driven requirements to maximize performance and minimize TCO.

Let’s take a quick walk through the stages of the AI Data Cycle:

Raw Data Archives, Content Storage: Raw data is collected and stored from various sources securely and efficiently. The quality and diversity of collected data are critical, setting the foundation for everything that follows.

Storage needs: Capacity enterprise hard disk drives (eHDDs) remain the technology of choice for lowest cost bulk data storage, continuing to deliver highest capacity per drive and lowest cost per bit.

Data Preparation & Ingestion: Data is processed, cleaned, and transformed for input to model training. Data center owners are implementing upgraded storage infrastructure such as fast data lakes to support preparation and ingestion.

Storage needs: All-flash storage systems incorporating high-capacity enterprise solid state drives (eSSDs) are being deployed to augment existing HDD based repositories, or within new all-flash storage tiers.

AI Model Training: It is during this stage where AI models are trained iteratively to make accurate predictions based on the training data. Specifically, models are trained on high-performance supercomputers, and training efficiency relies heavily on maximizing GPU utilization.

Storage needs: Very high-bandwidth flash storage near the training server is important for maximum utilization. High-performance (PCIe® Gen. 5) and low-latency compute optimized eSSDs are designed to meet these stringent requirements.

Inference & Prompting: This stage involves creating user-friendly interfaces for AI models, including APIs, dashboards, and tools that combine context specific data with end-user prompts. AI models will be integrated into existing internet and client applications, enhancing them without replacing current systems. This means maintaining current systems alongside new AI compute, driving further storage needs.

Storage needs: Current storage systems will be upgraded for additional data center eHDD and eSSD capacity to accommodate AI-integration into existing processes. Similarly, larger and higher performance client SSDs (cSSDs) for PCs and laptops, and higher capacity embedded flash devices for Mobile Phones, IoT systems, and Automotive will be needed for AI-enhancements to existing applications.

AI Inference Engine: Stage 5 is where the magic happens in real-time. This stage involves deploying the trained models into production environments where they can analyze new data and provide real-time predictions or generate new content. The efficiency of the inference engine is crucial for timely and accurate AI responses.

Storage needs: High-capacity eSSDs for streaming context or model data to inference servers; depending on scale or response time targets, high-performance compute eSSDs may be deployed for caching; High-capacity cSSDs and larger embedded Flash modules in AI-enabled edge devices.

New Content Generation: The final stage is where new content is created. The insights produced by the AI models often generate new data, which is stored because it proves valuable or engaging. While this stage closes the loop, it also feeds back into the data cycle, driving continuous improvement and innovation by increasing the value of data for training or analysis by future models.

Storage needs: Generated content will land back in capacity enterprise eHDDs for archival data center storage, and in high-capacity cSSDs and embedded Flash devices in AI-enabled edge devices.

A self-perpetuating cycle of increased data generation

This continuous loop of data generation and consumption is accelerating the need for performance-driven and scalable storage technologies for managing large AI data sets and re-factoring complex data efficiently, driving further innovation.

Ed Burns, research director at IDC noted, “The implications for storage are expected to be significant as the role of storage, and access to data, influences the speed, efficiency and accuracy of AI Models, especially as larger and higher-quality data sets become more prevalent.”

There’s no doubt that AI is the next transformational technology. As AI technologies become embedded across virtually every industry sector, expect to see storage component providers increasingly tailor products to the needs of each stage in the cycle.

Comment here Cancel reply