Last updated: 19 January 2019
First published: 29 December 2018

Multiply Accumulate Operations with SIMD Intrinsics

Let's explore using SIMD intrinsics to perform multiply-accumulate (MAC) operations with C programming on x86-64 platform.

Single Instruction Multiple Data (SIMD), the name says it all, performs same operation on all elements of given operand vectors simultaneously in one clock cycle ^[1], so longer the vector length more efficient the arithmetic operation. SIMD implementations in MMX, SSE technology to AVX-512, have vector lengths from 64 bits to 512 bits.

Intrinsic Functions

Intrinsic functions allows us to use SIMD and many more types of assembly instructions in C style function calls with access to C variables instead of writing assembly code but this also makes source code non-portable across architectures. It is always a good idea to cross check output assembly code generated by the compiler to meet optimization goals, so assembly programming skill set is still useful, atleast reading them.

Below I will focus only on Intrinsic functions/data types applicable to SIMD MAC operation.

Naming Conventions

The data type or vector directly represents the contents of underlying SIMD register, they are named as:

__m<register_width_in_bits><data_type>

Notice the double underscores used for data types, intrinsic function names have single underscore prefix.
register_width_in_bits depends on the technology: 64 bits in MMX (mm registers), 128 bits in SSE (xmm registers), 256 bits in AVX (ymm registers), ...
data_type is a single letter suffix:

Example:

__m256d vector containing quad 64-bit doubles.

Intrinsics data types cannot be mixed with C arithmetic operators, they can only be passed as arguments to intrinsic functions or to get return value. We can however read data values from them. Intrinsics function are named as:

_mm<register_width_in_bits>_<operation>_<suffix>

register_width_in_bits same as in data type naming conventions.
operation arithmetic, logical or data operation to perform on the operands.
suffix indicates the data type the instruction operates upon. First one or two letter:

Example:

__m256d _mm256_fmadd_pd (__m256d a, __m256d b, __m256d c)

Perform fused multiply - add operation on vectors a, b and c each containing quad doubles and return result in quad double vector, this is equivalent to below C code with the difference being all iterations will be executed simultaneously.

1
2
3
for (i = 0; i < 4; i++)
	result[i] = (a[i] * b[i]) + c[i];

Notice that in the above C code two rounding operations are required to produce multiply-add result, one rounding from multiplication of two doubles and another rounding from addition of this intermediate result with last operand. fmadd intrinsic performs only one rounding by keeping intermediate result internally in high precision, this is why the function is called "fused" multiply-add.

Code

Let's implement dot product of arbitrarily long vectors, this kind of computation occurs in matrix multiplication or filtering operation in DSP.

Machine configuration used for testing:

Intel core i3 6th generation
Ubuntu 16.04 LTS
GCC 5.4.0

Listing-1 performs MAC operation on vectors of doubles with C arithmetic operators. Listing-2 implements the same functionality with SIMD intrinsics instead.

Listing-1: mac_without_simd.c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11

double mac_without_simd(double* a, double* b, unsigned int length) {
        int i;
        double y = 0;
        for (i = 0; i < length; i++) {
                y += a[i] * b[i];
        } 
        
        return y;
}

Listing-2: mac_with_simd.c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#include <immintrin.h>

double mac_with_simd (double* a, double* b, unsigned int length) {
        int i;
        
        __m256d sum = {0.0, 0.0, 0.0, 0.0}; //vector to hold partial sums
        for (i = 0; i < length; i += 4) {
                __m256d va = _mm256_load_pd(&a[i]);
                __m256d vb = _mm256_load_pd(&b[i]);
                sum = _mm256_fmadd_pd (va, vb, sum);
        }
        
        //sum now hold summations of all products in four parts
        //we want scalar result
        //two options below
        
#if 1 //Enable this block to perform vector sum with intrinsics
        
        //x86 architecture have little endian data ordering
        
        //                               index: 0  1  2  3 
        //sum contains quad 64bit doubles, say: m, n, p, q
        //we want scalar result = m + n + p + q
        
        //intrinsic function to extract upper 128 bits.
        //if second parameter is zero then lower 128 bits are extracted.
        __m128d xmm = _mm256_extractf128_pd (sum, 1); 
        //xmm contains: p, q
        
        //This intrinsic is compile time only.
        //__m256d ymm = _mm256_zextpd128_pd256 (xmm); //But missing in GCC 5.4.0
        
        //zero extend xmm to make 256bit vector ymm.
        __m256d ymm = {xmm[0], xmm[1], 0, 0};
        //ymm contains: p, q, 0, 0
        
        //intrinsic function to perform horizontal interleaved addition.
        sum = _mm256_hadd_pd (sum, ymm); 
        //sum contains: m+n, p+q, p+q, 0+0  
        
        //another round of horizontal interleaved addition
        sum = _mm256_hadd_pd (sum, sum);
        //sum contains: m+n+p+q, m+n+p+q, p+q+0, p+q+0

        return sum[0]; //scalar result = m+n+p+q

#else //vector sum with C arithmetic operators.
        
        double y = 0;
        
        for (i = 0; i < 4; i++) {
                y += sum[i];
        }
        
        return y; //scalar result
#endif
}

Line 1: Top level intrinsic header file, this will in turn include technology specific header files. Just including intrinsic header file is not sufficient architecture specific command line options are also required for GCC as shown later.
Line 8: _mm256_load_pd intrinsic loads SIMD register with data from memory, memory address must be aligned on 32 byte boundary so malloc function won't work here, instead use posix_memalign function to allocate memory with required alignment. If 32 byte alignment is not feasible then _mm256_loadu_pd intrinsic can be used which has no alignment requirement on the memory pointer but it will generate additional instructions.
Line 10: _mm256_fmadd_pd returns a vector of partial sums but what we need is a scalar result. Two options are available in the code.
Line 17-45: Computes vector sum with intrinsics, refer to code comments.
Line 38: _mm256_hadd_pd intrinsic computes horizontal interleaved addition, say first operand vector contains elements a, b, c, d and second operand vector contains elements m, n, p, q then the resultant vector would be a+b, m+n, c+d, p+q. This method of addition may look odd but suppose if these elements were part of complex numbers: a+jc, b+jd, m+jp, n+jq, then the hadd operation has computed the sum of two pairs of complex numbers, with one more hadd operation we get the final sum: (a+b+m+n) + j(c+d+p+q).
Line 49-55: Computes vector sum with C arithmetic operators. Let's compile Listing-2 with this block enabled and with highest possible compiler optimization to check how close the generated assembly code is to the corresponding code computing vector sum with intrinsics.

Disable the #if block at Line 17 before compiling.

$ gcc -S -Ofast -mavx -mfma -o mac_with_simd.S mac_with_simd.c

From gcc man page, -Ofast optimization flag disregards standards compliance, so it is always a good idea to apply this flag on a small piece of code with separate compilation this will also help in catching any issues during unit testing.
-mavx tells the compiler to use avx intrinsics since we are using 256bit data types and corresponding intrinsic functions.
-mfma enables fused multiply-add intrinsic.

This gives us the assembly output in Listing-3

Listing-3: mac_with_simd.S
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
	.file	"mac_with_simd.c"
	.section	.text.unlikely,"ax",@progbits
.LCOLDB0:
	.text
.LHOTB0:
	.p2align 4,,15
	.globl	mac_with_simd
	.type	mac_with_simd, @function
mac_with_simd:
.LFB4596:
	.cfi_startproc
	testl	%edx, %edx
	je	.L4
	vxorpd	%xmm1, %xmm1, %xmm1
	xorl	%eax, %eax
	xorl	%ecx, %ecx
	.p2align 4,,10
	.p2align 3
.L3:
	vmovapd	(%rdi,%rax), %ymm2
	addl	$4, %ecx
	vfmadd231pd	(%rsi,%rax), %ymm2, %ymm1
	addq	$32, %rax
	cmpl	%ecx, %edx
	ja	.L3
.L2:
	vextractf128	$0x1, %ymm1, %xmm0
	vhaddpd	%xmm1, %xmm1, %xmm1
	vhaddpd	%xmm0, %xmm0, %xmm0
	vaddsd	%xmm1, %xmm0, %xmm0
	vzeroupper
	ret
.L4:
	vxorpd	%xmm1, %xmm1, %xmm1
	jmp	.L2
	.cfi_endproc
.LFE4596:
	.size	mac_with_simd, .-mac_with_simd
	.section	.text.unlikely
.LCOLDE0:
	.text
.LHOTE0:
	.ident	"GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609"
	.section	.note.GNU-stack,"",@progbits

Line 27-32: Compiler has generated SIMD instructions from C arithmetic code! It is quite close to Line 17-45 in Listing-2 except that instead of 256bit ymm registers, 128bit xmm registers are being used and two horizontal and one scalar additions.

Conclusion

SIMD intrinsics can significantly increase computation performance, if instead of 64 bit doubles, 16 bit signed integers were used in above code examples then that would be a 4x performance increase. For integer vector multiply-add operation use _mm256_madd_epi16 from AVX2 as a starting point. This performance increase comes at a cost of non-portability, so as a logical rule first try to optimize the C algorithm itself along with compiler optimization flags, next choice enable -mavx, -mavx2 or other technology specific options to enable auto vectorization and as a last choice use intrinsic functions.

References

Throughput can be <= 1 but latency could be high. Intel Intrinsics Guide↗
Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 1↗
GCC x86 machine specific options↗
GNU AS syntax↗
Intel® C++ Intrinsics Reference [broken link]