Multiply Accumulate Operations with SIMD Intrinsics
Let's explore using SIMD intrinsics to perform multiplyaccumulate (MAC) operations with C programming on x8664 platform.
Single Instruction Multiple Data (SIMD), the name says it all, performs same operation on all elements of given operand vectors simultaneously in one clock cycle ^{[1]}, so longer the vector length more efficient the arithmetic operation. SIMD implementations in MMX, SSE technology to AVX512, have vector lengths from 64 bits to 512 bits.
Intrinsic Functions
Intrinsic functions allows us to use SIMD and many more types of assembly instructions in C style function calls with access to C variables instead of writing assembly code but this also makes source code nonportable across architectures. It is always a good idea to cross check output assembly code generated by the compiler to meet optimization goals, so assembly programming skill set is still useful, atleast reading them.
Below I will focus only on Intrinsic functions/data types applicable to SIMD MAC operation.Naming Conventions
The data type or vector directly represents the contents of underlying SIMD register, they are named as:__m<register_width_in_bits><data_type>
 Notice the double underscores used for data types, intrinsic function names have single underscore prefix.
 register_width_in_bits depends on the technology: 64 bits in MMX (mm registers), 128 bits in SSE (xmm registers), 256 bits in AVX (ymm registers), ...
 data_type is a single letter suffix:
 d: double precision floating point
 i: integer
 no suffix: single precision floating point
__m256d vector containing quad 64bit doubles.
Intrinsics data types cannot be mixed with C arithmetic operators, they can only be passed as arguments to intrinsic functions or to get return value. We can however read data values from them. Intrinsics function are named as:
_mm<register_width_in_bits>_<operation>_<suffix>
 register_width_in_bits same as in data type naming conventions.
 operation arithmetic, logical or data operation to perform on the operands.
 suffix indicates the data type the instruction operates upon. First one or two letter:
 p: packed data
 ep: extended packed data
 s: scalar data
 s: singleprecision floating point
 d: doubleprecision floating point
 i<integer_size_in_bits>: signed integer of 8, 16, 32, 64 or 128 bits
 u<integer_size_in_bits>: unsigned integer of 8, 16, 32, 64 or 128 bits
__m256d _mm256_fmadd_pd (__m256d a, __m256d b, __m256d c)
Perform fused multiply  add operation on vectors a, b and c each containing quad doubles and return result in quad double vector, this is equivalent to below C code with the difference being all iterations will be executed simultaneously.


Notice that in the above C code two rounding operations are required to produce multiplyadd result, one rounding from multiplication of two doubles and another rounding from addition of this intermediate result with last operand. fmadd intrinsic performs only one rounding by keeping intermediate result internally in high precision, this is why the function is called "fused" multiplyadd.
Code
Let's implement dot product of arbitrarily long vectors, this kind of computation occurs in matrix multiplication or filtering operation in DSP.
Machine configuration used for testing: Intel core i3 6th generation
 Ubuntu 16.04 LTS
 GCC 5.4.0




 Line 1: Top level intrinsic header file, this will in turn include technology specific header files. Just including intrinsic header file is not sufficient architecture specific command line options are also required for GCC as shown later.
 Line 8: _mm256_load_pd intrinsic loads SIMD register with data from memory, memory address must be aligned on 32 byte boundary so malloc function won't work here, instead use posix_memalign function to allocate memory with required alignment. If 32 byte alignment is not feasible then _mm256_loadu_pd intrinsic can be used which has no alignment requirement on the memory pointer but it will generate additional instructions.
 Line 10: _mm256_fmadd_pd returns a vector of partial sums but what we need is a scalar result. Two options are available in the code.
 Line 1745: Computes vector sum with intrinsics, refer to code comments.
 Line 38: _mm256_hadd_pd intrinsic computes horizontal interleaved addition, say first operand vector contains elements a, b, c, d and second operand vector contains elements m, n, p, q then the resultant vector would be a+b, m+n, c+d, p+q. This method of addition may look odd but suppose if these elements were part of complex numbers: a+jc, b+jd, m+jp, n+jq, then the hadd operation has computed the sum of two pairs of complex numbers, with one more hadd operation we get the final sum: (a+b+m+n) + j(c+d+p+q).
 Line 4955: Computes vector sum with C arithmetic operators. Let's compile Listing2 with this block enabled and with highest possible compiler optimization to check how close the generated assembly code is to the corresponding code computing vector sum with intrinsics.
$ gcc S Ofast mavx mfma o mac_with_simd.S mac_with_simd.c
 From gcc man page, Ofast optimization flag disregards standards compliance, so it is always a good idea to apply this flag on a small piece of code with separate compilation this will also help in catching any issues during unit testing.
 mavx tells the compiler to use avx intrinsics since we are using 256bit data types and corresponding intrinsic functions.
 mfma enables fused multiplyadd intrinsic.


 Line 2732: Compiler has generated SIMD instructions from C arithmetic code! It is quite close to Line 1745 in Listing2 except that instead of 256bit ymm registers, 128bit xmm registers are being used and two horizontal and one scalar additions.
Conclusion
SIMD intrinsics can significantly increase computation performance, if instead of 64 bit doubles, 16 bit signed integers were used in above code examples then that would be a 4x performance increase. For integer vector multiplyadd operation use _mm256_madd_epi16 from AVX2 as a starting point. This performance increase comes at a cost of nonportability, so as a logical rule first try to optimize the C algorithm itself along with compiler optimization flags, next choice enable mavx, mavx2 or other technology specific options to enable auto vectorization and as a last choice use intrinsic functions.References
 Throughput can be <= 1 but latency could be high. Intel Intrinsics Guide↗
 IntelĀ® 64 and IA32 Architectures Developer's Manual: Vol. 1↗
 GCC x86 machine specific options↗
 GNU AS syntax↗
 IntelĀ® C++ Intrinsics Reference [broken link]