Kyber: NTT and Efficient Polynomial Multiplication

In my previous article From Baby to Teenage Kyber I've explained how Kyber encryption/decryption works. As it was stated there, the essence of the algorithm is multiplication of polynomials in the ring R_q = Z_q[X]/(X^n + 1):

f = np.zeros(n + 1)
f[0] = 1
f[n] = 1

el = np.polymul(a, b)
el = np.polydiv(el, f)[1]
el = np.remainder(el, q)

This kind of multiplication is not efficient because each coefficient from the first polynomial is multiplied with each coefficient from the second polynomial. This multiplication has O(n^2) complexity, where n is the length of the polynomials (furthermore division adds additional complexity).

What's our goal?

We want to transform somehow the polynomials into different domain. These transformed polynomials should be multiplied element wise, i.e. the first coefficient from the first polynomial with the first coefficient from the second polynomial, second coefficient from the first with the second coefficient from the second polynomial and so on.

Let's have 2 polynomials, a and b. Straight multiplication is:

a = a_0 + a_1 * X + a_2 * X^2

b = b_0 + b_1 * X + b_2 * X^2

a * b = (a_0 + a_1 * X + a_2 * X^2) * (b_0 + b_1 * X + b_2 * X^2) = 

= a_0 * b_0 + a_0 * b_1 * X + a_0 * b_2 * X^2 + 

+ a_1 * b_0 * X + a_1 * b_1 * X^2 + a_1 * b_2 * X^3 +

+ a_2 * b_0 * X^2 + a_2 * b_1 * X^3 + a_2 * b_2 * X^4 =

= a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X + 

+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 + 

+ (a_1 * b_2 + a_2 * b_1) * X^3 + a_2 * b_2 * X^4

Before we dive into different types of convolution, let's define some polynomial congruences.

X^n - 1 = 0 mod (X^n - 1)

X^n = 1 mod (X^n - 1)

X^(n+1) = X mod (X^n - 1)

X^(n+2) = X^2 mod (X^n - 1)

. . .

X^n + 1 = 0 mod (X^n + 1)

X^n = -1 mod (X^n + 1)

X^(n+1) = -X mod (X^n + 1)

X^(n+2) = -X^2 mod (X^n + 1)

Now we are going to define different types of polynomial convolutions.

Linear convolution

Linear convolution matches the straight multiplication of polynomials of length n and degree of n-1. The resulting polynomial is of length 2*n-1 (degree 2*n-2). More formally, c = a * b where c, a, b belong to Z_q[X] can be defined as:

Circular Convolution

We get Circular Convolution if we divide the resulting polynomial from the Linear Convolution with X^n - 1. If we apply the congruences specified in the example above the result is:

a * b = 

= a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X + 

+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 +

+ (a_1 * b_2 + a_2 * b_1) * X^3 + a_2 * b_2 * X^4

a * b / (X^3 - 1) =

a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X +
 
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 +

+ (a_1 * b_2 + a_2 * b_1) + a_2 * b_2 * X =

= a_0 * b_0 + a_1 * b_2 + a_2 * b_1 +

+ (a_0 * b_1 + a_1 * b_0 + a_2 * b_2) * X +

+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2

More formally, c = a * b where c, a, b belong to Z_q[X]/(X^n-1) can be defined as:

Negative Wrapped Convolution

We get Negative Wrapped Convolution if we divide the resulting polynomial from the Linear Convolution with X^n + 1. If we apply the congruences specified in the example above the result is:

a * b =

= a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X +
 
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 +

+ (a_1 * b_2 + a_2 * b_1) * X^3 + a_2 * b_2 * X^4

a * b / (X^3 + 1) =

= a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X +
 
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 +

+ (a_1 * b_2 + a_2 * b_1) * (-1) + a_2 * b_2 * (-X) =

= a_0 * b_0 - a_1 * b_2 - a_2 * b_1 +

+ (a_0 * b_1 + a_1 * b_0 - a_2 * b_2) * X +

+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2

More formally, c = a * b where c, a, b belong to Z_q[X]/(X^n+1) can be defined as:

What we need in Kyber is efficient Negative Wrapped Convolution calculation! Why? Because Kyber polynomials belong to Z_q[X]/(X^n+1)! Whenever we multiply two polynomials a * b in the text below, it is Negative Wrapped Convolution!

Primitive Roots of the Ring Z_q

The n-th primitive root of the ring Z_q is a number w_n that belongs to Z_q and it holds w_n^n = 1 mod q. All other exponentiations w_n^k mod q from k = 0,1, 2,..., n-1 generate different numbers in Z_q. It also holds that w_n^(n/2) mod q = -1 (-1 is the same with q-1).

For example, in Kyber, for q = 3329, the 128-th root is 289 and 17 is the 256-th root. Also, 512-th root does not exist!

289^0 = 1 mod 3329

289^1 = 289 mod 3329

289^2 = 296 mod 3329

289^3 = 2319 mod 3329

. . .

289^64 = 3328 mod 3329 # -1 mod 3329

. . .

289^127 = 2419 mod 3329

289^128 = 1 mod 3329

Number Theoretic Transform (NTT)

Number Theoretic Transform of polynomial a of length n and coefficients in Z_q is another polynomial a' of length n, i.e. a' = NTT(a), whose coefficients a'_k for k = 0, 1, 2, ..., n_1 are defined as:

The Inverse Number Theoretic Transform of polynomial a' of length n and coefficients in Z_q is the original polynomial a of length n, i.e. a = INTT(a'), whose coefficients a_k for k = 0, 1, 2, ..., n_1 are defined as:

Obviously it holds that a = INTT(NTT(a)).

Ok, we have defined what NTT is. What can we do with it?

Let's have two polynomials a and b, and their corresponding NTT transformations a' and b':

a' = NTT(a)

b' = NTT(b)

Let o denotes point wise polynomial multiplication:

a' o b' = a'_0 * b'_0 + a'_1 * b'_1 + . . . + a'_n_1 * b'_n_1

So what we get with:

INTT(a' o b') = INTT(NTT(a) o NTT(b))

The result is Circular Convolution of the polynomials a and b.

How can we calculate Linear Convolution using NTT?

It's easy too! In order to calculate Linear Convolution of the polynomials a and b we shall:

Pad with zeros both polynomials a and b to the length of 2*n
Calculate 2*n point NTT on both polynomials. Note that in this calculations we shall work with 2*n-th root w_2n (w_n = w_2n^2)
Point wise multiply transformed polynomials and finally apply INTT on this product.

How can we calculate Negative Wrapped Convolution using NTT?

The straight forward approach is calculating Linear Convolution and dividing the result with X^n+1. We can calculate this division by using congruences as it has been demonstrated in the above convolution examples.

Padding polynomials to the length of 2*n and calculating 2*n point NTT is not the most efficient way of calculating Negative Wrapped Convolution. It has been proven that we can calculate this convolution with n point NTT only. Here is how we can do it.

Define polynomials w and w_inv as:

w = [1, w_2n, w_2n^2, w_2n^3, . . . w_2n^(n-1)]

w_inv = [1, w_2n^(-1), w_2n^(-2), w_2n^(-3), . . . w_2n^(-n+1)]

Calculate polynomials a'' and b'' as:

a'' = w o a

b'' = w o b

Finally calculate:

w_inv o INTT(NTT(a'') o NTT(b'')) =

= w_inv o INTT(NTT(w o a) o NTT(w o b))

The coefficients of w and w_inv can be incorporated directly into NTT and INTT simply as:

In addition, if we take into account that w_n = w_2n^2, the equations can rely on w_2n only!

Up to this moment, we have the equations that define how to calculate Negative Wrapped Convolution using NTT.

But...

If we calculate NTT using its coefficient definition, we still have n * n multiplications, i.e. O(n^2) complexity. There is no gain at all!

Before we continue, let's stress that Number Theoretic Transform is conceptually the same with the Discrete Fourier Transform - DFT. In DFT instead of w_n, we work with primitive root on the complex unit circle, i.e. exp(2Pi/n). Everything else is the same. DFT can be calculated by using the well known Fast Fourier Transform - FFT and its most famous representation, the Cooley-Tukey FFT Algorithm. It is an iterative algorithm that divides odd and even coefficients, calculates two n/2 DFTs (one for the odd coefficients and one for the even coefficients) and finally combines results. It is a Divide and Conquer strategy known as Butterfly pattern.

We are going to define the Cooley-Tukey NTT and Gentlemen-Sande INTT algorithms. Before presenting the algorithms let's us define what bit reversed order means in the algorithm specifications.

0 = 000 => 000 = 0

1 = 001 => 100 = 4

2 = 010 => 010 = 2

3 = 011 => 110 = 6

4 = 100 => 001 = 1

5 = 101 => 101 = 5

6 = 110 => 011 = 3

7 = 111 => 111 = 7

If we have polynomial a (or vector, list, array) with coefficients in standard (natural) order:

[a_0, a_1, a_2, a_3, a_4, a_5, a_6, a_7]

the bit reversed order is:

[a_0, a_4, a_2, a_6, a_1, a_5, a_3, a_7]

Cooley-Tukey NTT Algorithm

INPUT: 
A vector a = [a_0,...,a_n−1] of degree n (power of 2) 
in standard order where a_i belongs to [0,q − 1], q = 1 mod 2n 
INPUT: 
Lookup array 'psis' of 2n-th roots (w_2n) in bit reversed order 
OUTPUT: a <- NTT(a) in bit reversed order
====================

def ntt(a):
    t = n
    m = 1
    while m < n:
        t = t // 2
        for i in range(0, m):
            j1 = 2 * i * t
            j2 = j1 + t - 1
            S = psis[m + i]
            for j in range(j1, j2 + 1):
                U = a[j]
                V = a[j + t] * S
                a[j] = (U + V) % q
                a[j + t] = (U - V) % q
        m = 2 * m

Gentlemen-Sande INTT Algorithm

INPUT: 
A vector a = [a_0,...,a_n−1] of degree n (power of 2)
in bit reversed order where a_i belongs to [0,q − 1], q = 1 mod 2n 
INPUT: 
Lookup array of 'inv_psis' 
of 2n-th root (w_2n) inverses in bit reversed order
INPUT: n_inv which is n^(-1) mod q
OUTPUT: a <- INTT(a) in standard order
==================== 

def inv_ntt(a):
    t = 1
    m = n
    while m > 1:
        j1 = 0
        h = m // 2
        for i in range(0, h):
            j2 = j1 + t - 1
            S = inv_psis[h + i]
            for j in range(j1, j2 + 1):
                U = a[j]
                V = a[j + t]
                a[j] = (U + V) % q
                a[j + t] = (U - V) * S % q
            j1 = j1 + 2 * t
        t = 2 * t
        m = m // 2
    for i in range(0, n):
        a[i] = a[i] * n_inv % q

The complexity of these algorithms is O(n*log(n)).

Here is simple Python code that demonstrates the technique. It generates the lookup arrays and their inverses, calculates NTT based polynomial multiplication and compares the result with the standard Numpy based polynomial multiplication in Z_q[X]/(X^n+1). Note that the Numpy methods used require array inversion, i.e. the largest degree coefficient comes first.

params.py

q = 3329

n = 128

ntt.py

import numpy as np

from params import q, n

n_inv = pow(n, -1, q)

def reverse_bits(n):
    b = "{:0{width}b}".format(n, width=7)
    return int(b[::-1], 2)


positions = [reverse_bits(x) for x in range(0, n)]

psi = 1

root = 17

tmp = []

psis = []

inv_psis = []


for x in range(0, n):
    tmp.append(psi)
    psi = psi * root % q


for x in range(0, n):
    val = tmp[positions[x]]
    inv_val = pow(val, -1, q)
    psis.append(val)
    inv_psis.append(inv_val)


# Cooley-Tukey (TC)

def ntt(a):
    t = n
    m = 1
    while m < n:
        t = t // 2
        for i in range(0, m):
            j1 = 2 * i * t
            j2 = j1 + t - 1
            S = psis[m + i]
            for j in range(j1, j2 + 1):
                U = a[j]
                V = a[j + t] * S
                a[j] = (U + V) % q
                a[j + t] = (U - V) % q
        m = 2 * m

# Gentleman-Sande (GS)

def inv_ntt(a):
    t = 1
    m = n
    while m > 1:
        j1 = 0
        h = m // 2
        for i in range(0, h):
            j2 = j1 + t - 1
            S = inv_psis[h + i]
            for j in range(j1, j2 + 1):
                U = a[j]
                V = a[j + t]
                a[j] = (U + V) % q
                a[j + t] = (U - V) * S % q
            j1 = j1 + 2 * t
        t = 2 * t
        m = m // 2
    for i in range(0, n):
        a[i] = a[i] * n_inv % q


f = np.zeros(n + 1)
f[0] = 1
f[n] = 1

a = np.random.randint(0, q, n)

b = np.random.randint(0, q, n)

p = np.remainder(np.polydiv(np.polymul(a[::-1], b[::-1]), f)[1], q).astype(int)[::-1]

print(p)

ntt(a)
ntt(b)
c = np.multiply(a, b)

inv_ntt(c)

print(c)

print(np.array_equal(c, p))

So far so good, but we can do better :)

Consider the following modulo operations in NTT and INTT:

 a[j] = (U + V) % q
 a[j + t] = (U - V) * S % q

Adding, multiplying, shifting and other bitwise operations are relatively cheap processor operations. However, division and modulo operations are expensive processor operations.

What if we can calculate modulo without division, but cheap only processor operations only. The answer is the Barrett Reduction.

Simple Python implementation based on the article above could be:

kbr = 26
m = 2**kbr // q

def barrett_reduce(x):
    x1 = x * m >> kbr
    x -= x1 * q
    if x > q:
        return x - q
    else:
        return x

The Barrett Reduction will help us calculate modulo operation when adding or subtracting numbers. For multiplication we need yet another strategy, the Montgomery Reduction. The idea of the Montgomery Space is simple:

Pick a number r > q, r and q should be relatively prime
Transform each number a we operate with into the Montgomery Space. The transformation is a' = a*r mod q. You can always go back from the Montgomery Space with a = a' r_inv mod q, where r_inv = r^(-1) mod q.
If we have two numbers in Montgomery Space:

a' = a * r mod q

b' = b * r mod q

(a' + b') mod q = (a + b) * r mod q

=> (a + b) mod q = (a' + b') * r_inv mod q

In other words, adding and subtracting numbers in the Montgomery Space provides results that can be converted back into the ordinary space by r_inv multiplication.

For multiplication of two numbers in the Montgomery Space we need the so called reduction.

a' = a * r mod q

b' = b * r mod q

(a' * b') mod q = a * r * b * r mod q = (a * b * r) * r mod q

We have one additional r in the multiplication result. We don't need it and we shall reduce the multiplication result to one r only. If we denote x = a * b * r * r, then we shall calculate x * r_inv mod q after each multiplication in the Montgomery Space.

Simple Python implementation could be:

e = 16

r = 1 << e

k = (r * r_inv - 1) // q

def montgomery_reduce(x):
    s = x * k & (r - 1)
    t = x + s * q
    u = t >> e
    return u if u < q else u - q

Note that Montgomery reduction relies on multiplication, addition, bit masking and bit shifting that are processor efficient operations.

Converting this reduced value back into the ordinary space yields:

x * r_inv * r_inv mod q = a * r * b * r * r_inv * r_inv mod q = a * b mod q

i.e. it is the desired number product.

The whole idea with the Montgomery Space is to move all numbers, coefficients and polynomials into this space, add and multiply the numbers into this space multiple times according the algorithm in place (these operations never include the expensive division and modulo operations) and finally convert back the result into the ordinary space.

The modified Python NTT code with Montgomery Space operations is given below.

reduce.py

from params import q, n

e = 16

r = 1 << e

rr = r * r %q

mont = r % q

r_inv = pow(r, -1, q)

k = (r * r_inv - 1) // q

kbr = 26

m = 2**kbr // q

def barrett_reduce(x):
    x1 = x * m >> kbr
    x -= x1 * q
    if x > q:
        return x - q
    else:
        return x

def montgomery_reduce(x):
    s = x * k & (r - 1)
    t = x + s * q
    u = t >> e
    return u if u < q else u - q

def fqmul(a, b):
    return montgomery_reduce(a * b)

ntt_mr.py

import numpy as np

from params import q, n

from reduce import barrett_reduce, fqmul, mont, r, rr, r_inv

n_inv = pow(n, -1, q) * r % q

def reverse_bits(n):
    b = "{:0{width}b}".format(n, width=7)
    return int(b[::-1], 2)

positions = [reverse_bits(x) for x in range(0, n)]

psi = 1

root = 17

tmp = []

psis = []

inv_psis = []

for x in range(0, n):
    tmp.append(psi)
    psi = psi * root % q

for x in range(0, n):
    val = tmp[positions[x]]
    inv_val = pow(val, -1, q)
    psis.append(val * r % q)
    inv_psis.append(inv_val * r % q)


# Cooley-Tukey (TC) no(natural order) -> br (bit reversed order)

def ntt(a):
    t = n
    m = 1
    while m < n:
        t = t // 2
        for i in range(0, m):
            j1 = 2 * i * t
            j2 = j1 + t - 1
            S = psis[m + i]
            for j in range(j1, j2 + 1):
                U = a[j]
                V = fqmul(a[j + t], S)
                a[j] = barrett_reduce(U + V)
                a[j + t] = barrett_reduce(U - V)
        m = 2 * m

# Gentleman-Sande (GS)  br (bit reversed order) -> no (natural order)

def inv_ntt(a):
    t = 1
    m = n
    while m > 1:
        j1 = 0
        h = m // 2
        for i in range(0, h):
            j2 = j1 + t - 1
            S = inv_psis[h + i]
            for j in range(j1, j2 + 1):
                U = a[j]
                V = a[j + t]
                a[j] = barrett_reduce(U + V)
                a[j + t] = fqmul(S, U - V)
            j1 = j1 + 2 * t
        t = 2 * t
        m = m // 2
    for i in range(0, n):
        a[i] = fqmul(a[i], n_inv)


f = np.zeros(n + 1)
f[0] = 1
f[n] = 1

a = np.random.randint(0, q, n)

b = np.random.randint(0, q, n)

p = np.remainder(np.polydiv(np.polymul(a[::-1], b[::-1]), f)[1], q).astype(int)[::-1]

print(p)

a = np.array([montgomery_reduce(x * rr) for x in a])

b = np.array([montgomery_reduce(x * rr) for x in b])

ntt(a)
ntt(b)
c = np.array([fqmul(x, y) for (x, y) in zip(a, b)])

inv_ntt(c)

c = np.array([montgomery_reduce(x) for x in c])

print(c)

print(np.array_equal(c, p))

The comparison between results yields True.

What if we don't convert polynomials a and b into Montgomery Space, use them directly and spare some multiplications?

NTT(a) = NTT((a * r_inv) * r)

NTT(b) = NTT((b * r_inv) * r)

Note: multiplying vector with scalar means multiplying each vector's coefficient with the scalar.

After the point wise multiplication of these two NTT representations along with the Montgomery reductions within, we get:

NTT(a * b * r_inv * r_inv * r mod q)

After INTT we get:

c = a * b * r_inv * r_inv * r mod q

If we multiply the final result with r we get:

c * r mod q = a * b * r_inv * r_inv * r * r mod q = a * b mod q

We have the desired result, but instead of multiplying 3 polynomials with r or r_inv (two input and one output polynomial), we've multiplied the coefficients of the final result only, i.e. the output polynomial.

# No polynomial transformation before multiplication

ntt(a)
ntt(b)

c = np.array([fqmul(x, y) for (x, y) in zip(a, b)])

inv_ntt(c)

c = np.array([montgomery_reduce(x * rr) for x in c])

But we don't need even that only one multiplication. We can embed the multiplication with r within INTT itself. The INTT ends with the following modification that applies to the all output coefficients:

for i in range(0, n)
    a[i] = fqmul(a[i], n_inv):

If we multiply n_inv with additional r in its definition, the final multiplication steps are 'by the book':

n_inv = pow(n, -1, q) * r * r % q

. . . 

ntt(a)
ntt(b)

c = np.array([fqmul(x, y) for (x, y) in zip(a, b)])

inv_ntt(c)

print(c)

This is exactly the "trick" used in the Kyber reference implementation.

The polynomials length in the examples above is 128. In Kyber the length is 256. In order to use the above NTT based method for polynomial multiplication we need 512-th root. But, such does not exist. The aforementioned NIST reference implementation of Kyber demonstrates how to efficiently multiply polynomials with a kind of modified NTT transformation that relies on the 256-th root (17) and 128 powers of it. They still use similar "butterfly" pattern based algorithms that execute all of the rounds except the last one for which there are no more coefficients. The transformation produces 256 long vector, but the the coefficients shall be treated as pairs:

a = [(a_0,a_1),(a_2,a_3), ... ..., (a_252, a_253), (a_254, a_255)]

b = [(b_0,b_1),(b_2,b_3), ... ..., (b_252, b_253), (b_254, b_255)]

Instead of point wise multiplication of these 2 vectors, the implementation uses 128 vector wise multiplications:

c = [(a_0,a_1)x(b_0,b_1), (a_2,a_3)x(b_2,b_3) ... ]

where:

(a_0,a_1) x (b_0,b_1) = 

(a_0 + a_1*X)(b_0 + b_1 * X) mod (X^2 - psi) = 

= (a_1*b_1*X^2 + (a_0*b_1 + a_1*b_0)*X + a_0*b_0) mod (X^2 - psi) = 

= (a_0*b_1 + a_1*b_0)*X + a_0*b_0 + psi*a_1*b_1

= (a_0*b_0 + psi*a_1*b_1, (a_0*b_1 + a_1*b_0)

The code below is Python adaptation of this algorithm demonstrated in the NIST reference implementation. The vectors are of length 256.

ntt_mr_256.py

import numpy as np

from params import q, n

from reduce import barrett_reduce, montgomery_reduce, fqmul, mont, r, rr, r_inv

# np.random.seed(0)


# NTT starts here
def reverse_bits(n):
    b = "{:0{width}b}".format(n, width=7)
    return int(b[::-1], 2)


positions = [reverse_bits(x) for x in range(0, n)]

psi = mont

root = 17

tmp = []

psis = []

inv_psis = []

for x in range(0, n):
    tmp.append(psi)
    psi = fqmul(psi, mont * root % q)


for x in range(0, n):
    val = tmp[positions[x]]
    psis.append(val)

i = 64
while i >= 1:
    for j in range(i, 2 * i):
        inv_psis.append(-tmp[128 - positions[j]] % q)
    i >>= 1

inv_psis.append(mont * (mont * (q - 1) * ((q - 1) // 128) % q) % q)


def basemul(a, b, psi):
    r0 = fqmul(a[1], b[1])
    r0 = fqmul(r0, psi)
    r0 += fqmul(a[0], b[0])

    r1 = fqmul(a[0], b[1])
    r1 += fqmul(a[1], b[0])

    return (r0, r1)


def ntt(r):
    k = 1
    len = 128
    while len >= 2:
        start = 0
        while start < 256:
            j = start
            psi = psis[k]
            k += 1
            while j < start + len:
                t = fqmul(psi, r[j + len])
                r[j + len] = barrett_reduce(r[j] - t)
                r[j] = barrett_reduce(r[j] + t)
                j += 1
            start = j + len
        len >>= 1


def inv_ntt(r):
    k = 0
    len = 2
    while len <= 128:
        start = 0
        while start < 256:
            psi = inv_psis[k]
            k += 1
            j = start
            while j < start + len:
                t = r[j]
                r[j] = barrett_reduce(t + r[j + len])
                r[j + len] = t - r[j + len]
                r[j + len] = fqmul(psi, r[j + len])
                j += 1
            start = j + len
        len <<= 1
    for j in range(0, 256):
        r[j] = fqmul(r[j], inv_psis[127])


f = np.zeros(2 * n + 1)
f[0] = 1
f[2 * n] = 1

a = np.random.randint(0, q, 2 * n)

b = np.random.randint(0, q, 2 * n)

p = np.remainder(np.polydiv(np.polymul(a[::-1], b[::-1]), f)[1], q).astype(int)[::-1]

print("p:", p)

ntt(a)
ntt(b)


c = np.zeros(2 * n).astype(int)

for i in range(0, n // 2):
    r0, r1 = basemul(a[4 * i : 4 * i + 2], b[4 * i : 4 * i + 2], psis[64 + i])
    c[4 * i] = r0
    c[4 * i + 1] = r1

    r0, r1 = basemul(a[4 * i + 2 : 4 * i + 4], b[4 * i + 2 : 4 * i + 4], -psis[64 + i])
    c[4 * i + 2] = r0
    c[4 * i + 3] = r1


inv_ntt(c)

print("c:", c)

print(np.array_equal(p, c))

Optimisation is very important Cryptography concept. The algorithms shall be fast and shall run on various devices, some of them not that powerful. As I said before, understanding NTT is a nice to have skill in the post-quantum era. The algorithm is ubiquitous and I believe more and more emerging cryptography and blockchain features will rely on NTT.

Marjan

Kyber: NTT and Efficient Polynomial Multiplication

Marjan Sterjev

IT Engineer | CISSP | CCSP | CEH (Master): research | learn | do | MENTOR

Linear convolution

Circular Convolution

Negative Wrapped Convolution

Primitive Roots of the Ring Z_q

Number Theoretic Transform (NTT)

Recommended by LinkedIn

Cooley-Tukey NTT Algorithm

Gentlemen-Sande INTT Algorithm

More articles by Marjan Sterjev

Insights from the community

Others also viewed

Kalman Filter from A to Z

Hardware-friendliness of HyperPlonk

Advancements in Failure Prognosis of Discrete Event Systems

Scientific Data Compression with ZFP

Byzantine Fault Mitigation using Fuzzy Logic, Schrodinger’s Cat and the Copenhagen Interpretation to deliver a P=NP environment.

Concept of Numerology in 5G-NR

Understanding Transients via Time-Varying Harmonics

NEWSLETTER INTELIGENCIA ARTIFICIAL

MONET (Mixture of Monosemantic Experts for Transformers) to tackle polysemanticity issues

Sensor and Data Fusion: Combining Technologies for Adventageous Results

Explore topics

Linear convolution

Circular Convolution

Negative Wrapped Convolution

Primitive Roots of the Ring Z_q

Number Theoretic Transform (NTT)

Recommended by LinkedIn

Cooley-Tukey NTT Algorithm

Gentlemen-Sande INTT Algorithm

More articles by Marjan Sterjev

Decoding SQL Injection Exfiltrated Data from Network Traffic Dump

Secure Multi-Party Computation: Idea and Example

Theseus TryHackMe Room Writeup

Breaking weak Captcha: Tesseract OCR

Behaviour Analysis: Outlier Sequence Detection

RSA Optimizations: Think Twice

RSA & Chinese Remainder Theorem MEGA Exploits

A word or two about the parallelisation

Covert Tcp - Scapy Version

Network Traffic Anomaly Detection

Insights from the community

Others also viewed

Kalman Filter from A to Z

Hardware-friendliness of HyperPlonk

Advancements in Failure Prognosis of Discrete Event Systems

Scientific Data Compression with ZFP

Byzantine Fault Mitigation using Fuzzy Logic, Schrodinger’s Cat and the Copenhagen Interpretation to deliver a P=NP environment.

Concept of Numerology in 5G-NR

Understanding Transients via Time-Varying Harmonics

NEWSLETTER INTELIGENCIA ARTIFICIAL

MONET (Mixture of Monosemantic Experts for Transformers) to tackle polysemanticity issues

Sensor and Data Fusion: Combining Technologies for Adventageous Results

Explore topics