Demystifying Deep Neural Networks and AI Workloads in Cloud Data Center Networks
Credit:Pixabay

Demystifying Deep Neural Networks and AI Workloads in Cloud Data Center Networks

AI has become a buzzword. One important aspect of Artificial Intelligence (AI) is its application, and the challenges involved in building performance networks. To build an AI infrastructure, one must carefully evaluate storage, networking, and AI Data Requirements, and plan Strategically. The focus of this article is Networking.

The article provides an overview of Deep Neural Networks (DNNs) and their importance in various AI applications. It highlights the criticality of sharing parameters with large elephant like message flows during the training AI models for DNNs, emphasizing the need for a short Round Trip Time (RTT).

Artificial Intelligence and Networking

Artificial Intelligence(AI) long ago has made its way out of research labs and then into Enterprise. Deep Learning, an AI method that processes data similar to the human brain, has increased the value of AI tools for businesses like...

  • Predictive Ads or Recommendation Engines to show news feeds, or music you want to listen, or videos might want to watch.  
  • Medical Image Analysis for Tumor Identification. Diabetic Retinopathy Detection through eye scans, which can also predict the risk of heart disease.

AI applications are becoming increasingly important to customers, including enterprise and cloud, who are building AI networks as a service to run AI applications. The need for efficient AI clusters and networks is growing because of their high business values. So Interconnected Networks and AI clusters are the increasingly important.

Most AI applications use Neural Networks as their foundation.

Neural Networks

A neural is like a super simple, tiny little section of a neural network. A neural network is a method in AI that mimics the functioning of the human brain by using artificial neurons to process data and solve problems. Neurons in the human brain communicate with each other through electrical signals to process information. Similarly, the interconnected artificial neurons in an artificial neural network collaborate to solve problems.

Article content
Neuron

Each neuron in a Neural Network receives inputs, which are multiplied by random weights, and then added to a static value. The neurons then go through an activation function to determine its final output. The loss function calculated by comparing the output to the input and backpropagation is used to adjust the weights for the optimal loss. Different inputs provided to train the weights and determine the optimal values for the neurons. So weights get learned by providing different inputs like might be hundreds of thousands of images fed through it and then the weights adjusted based on how those perform to new values. 

Article content
Simple Neural Network

The Simple Neural Network comprises three layers of neurons: Input, Hidden, and Output. The Input layer receives and processes the data, which is then analyzed and categorized by the Hidden layer before being passed on to the output layer. Finally, the Output layer provides the outcome of the data processing.

Deep Learning and Deep Neural Networks

Machine Learning (ML) comprises two main distinct steps: Training and Inference. The goal of training is to learn or train a mathematical model from an input dataset. Inference, the second step where the trained model put into action on live data to produce actionable output. ML has two categories: Supervised and Unsupervised Learning. Among Supervised learning algorithms, Deep Learning (DL) has shown impressive accuracy and has become the primary focus. The main algorithm of DL is a Deep Neural Network (DNN). A two-layered simple Neural Network with inputs, output, and one hidden layer is not sufficient for tasks like Self-Driving cars and Image Identification. Therefore, DL involves using bigger networks with more hidden layers.

Article content
Relationship between AI, ML, DL and DNN

In comparison to the Simple Neural Network, a Deep Neural Network architecture contains several hidden layers with millions of interconnected artificial neurons. The weighted number represents the link between neurons. DNNs have shown improvements in performance, with large scales and complex models in AI applications. The accuracy of the models gets enhanced by training on an increasing number of inputs. One such simple example of image classification is using feed forward DNN identification of dog image.

Article content
Feed Forward DNN

These models are so complex that it can typically contain between 50 million to 1 billion parameters (neurons and weights), and the weights get trained on an extensive range of inputs, ranging from 100,000 to millions. This is computationally very expensive to iterate on so many parameters, feed the results back to identify errors, and adjust the weights periodically. It’s way too big for a simple system to do these Gigantic Matrix Computations.

Article content
Example of Complex Deep Neural Networks

Thus, it is common practice to distribute AI applications using Data Parallel, enabling the utilization of clusters for running large models with extensive training sets rather than relying on a single CPU or GPU.

There are many types of Neural Networks, including Feed-Forward, Convolutional, Perceptron, Recurrent, Generative Adversarial and many more.

Distributed AI Application and Data Parallel Training

Let's say in the radiology area, image classification uses DNNs with multiple Graphic Processing Unit (GPU) to speed up the process.

GPUs typically have hundreds of cores while CPUs have fewer, allowing GPUs to break down simple and repetitive tasks into smaller components for faster completion. Hence, GPUs employed for DNNs, but how many neurons can they support? The data suggests that an Nvidia GeForce RTX2080 Ti GPU board can simulate nearly 77000 neurons and 3* 10^8 connections at a speed that's almost real-time.

The training data sets, comprising thousands of images, splits into batches and trained parallel on different clusters of GPU processors. The resulting outputs compared to the expected results, and errors propagated backward through the network to adjust the weights. During training, a barrier effect occurs, where the participating GPUs need to share their parameters and errors before proceeding to the next step. Each GPU in a job sends its results to every other GPU, creating a full mesh traffic pattern and requiring a full synchronization event. The GPUs wait for all their peers to receive the data before starting again. So, the interconnected networks must have a Short Round Trip Time (RTT).

Article content
Data-parallelism: Many GPUs used to process extremely large mini-batches (thousands of images) in parallel.

In building large applications, weight updates through parameter sharing are crucial for model development, not execution. Training the model can be a time-consuming process, taking hours, days, or even weeks. Hence, it is crucial to accelerate the exchange of parameters and minimize the training phase.

In this Compute-Exchange-Optimize cycle, around 20-50% of the job time spent communicating across the network. Therefore, it can significantly affect the Job Completion Time.

Platforms for DNN Executions

AI/ML Workloads represent workflows with the concept of running AI/ML in containers. The workflow starts with gathering and preparation of data, which is known as Data Ingestion. This process involves importing large, diverse data from multiple sources into a single cloud-based storage medium. For On-prem, the storage medium is MySQL or data warehouses such as Amazon Redshift or Google BigQuery. For real-time stream processing, Kafka or Azure event hub used.

Article content
Deep Learning Dataflow

The data undergo transformation, which includes multiplication with random weights and the addition of a static value, followed by training through cleansing and pre-processing techniques. These data sets then fed into the Deep Learning (DL) model for training. The model undergoes validation for accuracy, reliability, and performance to meet production inference requirements.

Different phases of DNN execution requires different types of computing platforms. Ingestion and preparing data for training, requires many containers with clusters of CPUs. Training (Distributed) requires containers with clusters of GPUs

Article content
ML workflow, with DataPrep phase, Training phase and a model serving phase requires different sets of resources from the cluster. DataPrep typically requires CPUs, training requires GPUs, and serving requires CPUs (and maybe GPUs for low latency model serving).

Once the model is trained, the next step is to deploy or serve the model into production. Data Scientist monitors the performance of models in production, tracks predictions, and performance metrics.

Containers are useful for both model development and integration into applications. They offer benefits for AI / ML workloads, such as modularity, speed, and performance. Furthermore, the containers require fewer resources compared to bare metal or VM systems.

Conclusions & Key Message

This article highlighted about AI and its impact on networking. AI applications built on neural networks, specifically Deep Learning models known as DNN. These models use Supervised ML algorithms and are widely known for their high accuracy. However, DNN models are complex and computationally expensive due to their millions of neurons and weights. To achieve accuracy, these models require training on large datasets. Therefore, they are designed for distributed computing environments and utilize data parallel training. To run such complex AI applications, server racks with clusters of CPUs, GPUs run containers with kubernetes jobs.

The key challenge in networking for running AI lies in the sharing of parameters among the GPU clusters that simulate DNN models. These parameters are very large elephant like messages flows. It is crucial to synchronize and share these flows with all GPUs in a full mess traffic pattern. However, this can quickly saturate the network links and cause problems. Additionally, GPUs being expensive resources, cannot remain idle, so it is crucial for the network to have a short round trip time.


To view or add a comment, sign in

More articles by Amit Godbole

Insights from the community

Others also viewed

Explore topics