SIMD and MIMD

SIMD (Single Instruction, Multiple Data)SIMD is a type of parallel processing where a single instruction operates on multiple data points simultaneously. Imagine having an assembly line where one station performs the same task on multiple items at once.

How it works:

  1. Vector Processing: SIMD processors have specialized registers that can hold multiple data elements (like integers, floating-point numbers, etc.).
  2. Single Instruction, Multiple Operations: A single instruction is fetched and decoded, but its then applied to all the data elements within the SIMD registers in parallel.
  3. Data Parallelism: SIMD is most effective when dealing with data-parallel tasks, where the same operation needs to be performed on a large set of data, such as in image processing, multimedia, and scientific simulations.

EX:Consider adding two arrays of numbers:

array1 = [1, 2, 3, 4]
array2 = [5, 6, 7, 8]
result = [0, 0, 0, 0]        

A traditional processor would perform the addition element by element:

result[0] = array1[0] + array2[0]
result[1] = array1[1] + array2[1]
result[2] = array1[2] + array2[2]
result[3] = array1[3] + array2[3]        

A SIMD processor could load multiple elements from array1 and array2 into its vector registers and perform the addition in a single instruction, potentially processing all four additions at once.

What is AVX?

AVX (Advanced Vector Extensions) is an x86 SIMD (Single Instruction, Multiple Data) instruction set introduced by Intel and AMD to accelerate data-parallel computations. It allows a single CPU instruction to process multiple data elements simultaneously, leveraging wide vector registers (e.g., 256-bit or 512-bit). AVX is commonly used for tasks like:

  • Matrix/vector operations (e.g., linear algebra),
  • Image/video processing,
  • Machine learning inference.


Article content

Example of AVX code : Just for understanding purpose , I didnt execute this anywhere

below code adds two arrays 8x faster than scalar code (assuming AVX is supported).

#include <immintrin.h> // AVX intrinsics header

void add_arrays(float* a, float* b, float* c, int N) {
    for (int i = 0; i < N; i += 8) { // Process 8 floats at once (256-bit AVX)
        __m256 vecA = _mm256_loadu_ps(&a[i]); // Load 8 floats from a
        __m256 vecB = _mm256_loadu_ps(&b[i]); // Load 8 floats from b
        __m256 vecC = _mm256_add_ps(vecA, vecB); // Add them in parallel
        _mm256_storeu_ps(&c[i], vecC); // Store result
    }
}        


The par_unseq policy allows the compiler/runtime to use both multi-threading and SIMD (e.g., AVX). However:

  • Auto-Vectorization: The compiler may generate AVX instructions automatically for simple loops (e.g., std::transform on floats).
  • No Guarantee: Using par_unseq does not guarantee AVX usage. It depends on:

  1. Compiler optimizations (e.g., -O3, -mavx flags),
  2. Data alignment and loop structure,
  3. Hardware support.


MIMD (Multiple Instruction, Multiple Data)

MIMD is a more flexible form of parallel processing where multiple processors can execute different instructions on different data simultaneously. Think of it as having multiple independent assembly lines, each working on different tasks.

How it works:

  1. Multiple Processors: MIMD systems have multiple independent processing units (cores, CPUs, or even entire computers).
  2. Independent Execution: Each processor can fetch and execute its own sequence of instructions.
  3. Task Parallelism: MIMD is well-suited for task parallelism, where different parts of a larger task can be broken down and executed independently on different processors. Its also effective for data parallelism, where the data is divided among processors, and each processor works on its portion.


Example:

Consider rendering a complex 3D scene:

  • One processor could handle the geometry calculations.
  • Another processor could handle the texturing.
  • A third processor could handle the lighting and shading.

Each processor executes different instructions on different parts of the scene data concurrently.


Other Parallel Architectures

While SIMD and MIMD are the two primary classifications, other architectures exist:

  • SISD (Single Instruction, Single Data): This is the traditional sequential processing model, where one instruction operates on one data element at a time. Most single-core CPUs are SISD.
  • MISD (Multiple Instruction, Single Data): This architecture is less common in practice. It involves multiple processors executing different instructions on the same data stream. One potential example is a fault-tolerant system where multiple processors perform the same calculation and their results are compared for correctness.


How C++17 Utilizes SIMD and MIMD

The C++17 parallel algorithms, particularly when used with the std::execution::par_unseq policy, can leverage both SIMD and MIMD parallelism:

  • MIMD: The par and par_unseq policies allow the implementation to divide the work of the algorithm among multiple threads, which can run on different cores (MIMD).
  • SIMD: The par_unseq policy explicitly allows the implementation to use vectorization (SIMD) within each thread, if the hardware and the operations are suitable. This means that within a single thread, the processor might be executing a single instruction on multiple data elements.


Finnaly

  • SIMD focuses on data parallelism by applying the same operation to multiple data points simultaneously.
  • MIMD focuses on task parallelism and data parallelism by allowing multiple processors to execute different or the same instructions on different data.
  • C++17 parallel algorithms can utilize both SIMD and MIMD to achieve higher performance by leveraging the capabilities of modern multi-core processors with SIMD units.


Referance ;

A Comprehensive Guide to Single Instruction Multiple Data

SIMD: Supercharging Your Code with Parallel Processing - DEV Community

Exploring Parallel Processing: SIMD vs. MIMD Architectures | by Aditya Bhuyan | Medium

Mastering_Embedded_Assembly_in_Modern_C++.pdf


To view or add a comment, sign in

More articles by Amit Nadiger

  • PhantomPinned, Pin/UnPin - APIs

    is a zero-sized marker type used in a struct to prevent it from being "Unpin" automatically. In Rust, types are by…

  • Pinning in Rust

    Rust provides a powerful concept called pinning, which allows you to ensure that an object remains at a fixed memory…

  • Adapter design pattern

    The Adapter design pattern is a structural pattern that solves the problem of making incompatible interfaces work…

  • Media Streaming using Axum

    Media Streaming means sending audio or video over a network (like Wi-Fi, mobile data, etc.) in small parts so that the…

  • OnceLock

    OnceLock in std::sync - Rust OnceLock is a synchronization primitive in Rust that provides a way to initialize a value…

  • Factory design pattern

    In software development, the way objects are created can significantly impact the flexibility, maintainability, and…

  • Traits and Concepts in C++

    Traits are a foundational mechanism in C++ for introspecting type properties at compile time. They provide information…

  • jthread-joining thread

    Before C++20, multithreading was handled via std::thread, but it had some drawbacks: We have to manually join or detach…

  • Coroutines in C++

    Coroutines at its core are functions that can be paused and resumed, enabling asynchronous and lazy computations in a…

  • List of C++ 20 additions

    Good read : Table of Content – MC++ BLOG 1. Ranges (C++20) The Ranges library in C++20 provides a new way to work with…

Insights from the community

Others also viewed

Explore topics