

AVX2 64-bit Multiply
The AVX2 instruction set includes all the operations for 256-bit vectors of integers. However, it does not include an instruction that performs 64-bit low word multiplication (i64*i64->i64). This is known as low word multiplication because the upper word is discarded if the result can’t fit into 64 bits.
This type of multiplication is still quite useful as 64-bit numbers are pretty big. It is also common in algorithms that generate pseudo-random numbers.
We can implement this multiplication with around six AVX2 instructions. As a stand-alone operation, it is probably not as quick as doing four scaler multiplications. However, it is still worth implementing, so we can use it as part of more complex parallel calculations. Using this method keeps our values in the 256-bit registers and will hopefully allow for better compiler optimizations.
Algorithm
To perform the 64-bit multiplication, we use two 32-bit values. You can think of these 32-values as digits in a base 2^32 number system. We can then manipulate them the same as we would base 10 digits.
First we work out what the value would be for each dword/digit. Let’s consider the case of multiplying x and y, we will give each digit a name (a,b,c,d).
If: x*y = [a,b]*[c,d] Then: x*y = (2^64)ac + (2^32)(ad + bc) + bd
Because we are only performing low word multiplication, we can immediately discard the first term.
Lower digit (dword) = bd Upper digit (dword) = ad + bc + carry
Now we just need to use 32-bit operations to calulate the two digits.
AVX2
Note: There are some sample code on popular websites that gives incorrect results. Always write a small test script to compare the results against scaler multiplication.
Here is a way to implement the operators for a struct containing a single 256-bit vector value “v”.
struct Simd256UInt64 {
__m256i v;
//Multiply-Assign Operator. (i64*i64->i64)
Simd256UInt64& operator*=(const Simd256UInt64& rhs) noexcept {
auto digit1 = _mm256_mul_epu32(v, rhs.v); //Calculate bd (carry in upper dword)
auto rhs_swap = _mm256_shuffle_epi32(rhs.v, 0xB1); //Swap the low and high dwords.
auto ad_bc = _mm256_mullo_epi32(v, rhs_swap); //Multiply dwords.
auto bc_00 = _mm256_slli_epi64(ad_bc, 32); //Shift left to put bc in the upper dword.
auto ad_plus_bc = _mm256_add_epi32(ad_bc, bc_00); //Perform addition in the upper dword
auto digit2 = _mm256_and_si256(ad_plus_bc, _mm256_set1_epi64x(0xFFFFFFFF00000000)); //Zero lower dword
this->v = _mm256_add_epi64(digit1, digit2); //Add digits to get final result.
return *this;
}
//Multiply every value in the vector by the same uint64_t ()
Simd256UInt64& operator*=(uint64_t rhs) noexcept {
*this *= Simd256UInt64(_mm256_set1_epi64x(rhs));
return *this;
}
}
//(i64*i64->i64) AVX2 Multiplication
inline Simd256UInt64 operator*(Simd256UInt64 lhs, const Simd256UInt64& rhs) noexcept {
lhs *= rhs; return lhs;
}
//Multiply SIMD vector by single uint64_t
inline Simd256UInt64 operator*(Simd256UInt64 lhs, uint64_t rhs) noexcept {
lhs *= rhs; return lhs;
}
//Multiply SIMD vector by single uint64_t
inline Simd256UInt64 operator*(uint64_t lhs, Simd256UInt64 rhs) noexcept {
rhs *= lhs; return rhs;
}
AVX-512
The AVX-512DQ instruction set has a dedicated intrinsic for i64*i64->i64 multiplication. It is called _mm512_mullo_epi64.
//(i64*i64->i64) Multiplication for AVX-512.
//Requires the AVX-512DQ instruction set.
Simd512UInt64& operator*=(const Simd512UInt64& rhs) noexcept {
v = _mm512_mullo_epi64(v, rhs.v);
return *this;
}
AVX-512 Notes
AVX-512DQ is included within the level 4 micro-architecture support level along with AVX-512F, AVX-512BW, AVX-512CD, AVX-512VL. It is present on most consumer level CPUs that supprt AVX-512. However, it is not supported on all CPUs that support the basic AVX-512F instruction set, such as some of Intel’s Xeon processors (Knights Landing and Knights Mill). So you should test for AVX-512DQ support either at compile-time or run-time.
Activating AVX-512 compilation in Visual Studio will activate both the AVX-512F and AVX512DQ instruction sets. It will define: __AVX512F__ and __AVX512DQ__ (see more)