Kyber: NTT and Efficient Polynomial Multiplication
In my previous article From Baby to Teenage Kyber I've explained how Kyber encryption/decryption works. As it was stated there, the essence of the algorithm is multiplication of polynomials in the ring R_q = Z_q[X]/(X^n + 1):
f = np.zeros(n + 1)
f[0] = 1
f[n] = 1
el = np.polymul(a, b)
el = np.polydiv(el, f)[1]
el = np.remainder(el, q)
This kind of multiplication is not efficient because each coefficient from the first polynomial is multiplied with each coefficient from the second polynomial. This multiplication has O(n^2) complexity, where n is the length of the polynomials (furthermore division adds additional complexity).
What's our goal?
We want to transform somehow the polynomials into different domain. These transformed polynomials should be multiplied element wise, i.e. the first coefficient from the first polynomial with the first coefficient from the second polynomial, second coefficient from the first with the second coefficient from the second polynomial and so on.
Let's have 2 polynomials, a and b. Straight multiplication is:
a = a_0 + a_1 * X + a_2 * X^2
b = b_0 + b_1 * X + b_2 * X^2
a * b = (a_0 + a_1 * X + a_2 * X^2) * (b_0 + b_1 * X + b_2 * X^2) =
= a_0 * b_0 + a_0 * b_1 * X + a_0 * b_2 * X^2 +
+ a_1 * b_0 * X + a_1 * b_1 * X^2 + a_1 * b_2 * X^3 +
+ a_2 * b_0 * X^2 + a_2 * b_1 * X^3 + a_2 * b_2 * X^4 =
= a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X +
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 +
+ (a_1 * b_2 + a_2 * b_1) * X^3 + a_2 * b_2 * X^4
Before we dive into different types of convolution, let's define some polynomial congruences.
X^n - 1 = 0 mod (X^n - 1)
X^n = 1 mod (X^n - 1)
X^(n+1) = X mod (X^n - 1)
X^(n+2) = X^2 mod (X^n - 1)
. . .
X^n + 1 = 0 mod (X^n + 1)
X^n = -1 mod (X^n + 1)
X^(n+1) = -X mod (X^n + 1)
X^(n+2) = -X^2 mod (X^n + 1)
Now we are going to define different types of polynomial convolutions.
Linear convolution
Linear convolution matches the straight multiplication of polynomials of length n and degree of n-1. The resulting polynomial is of length 2*n-1 (degree 2*n-2). More formally, c = a * b where c, a, b belong to Z_q[X] can be defined as:
Circular Convolution
We get Circular Convolution if we divide the resulting polynomial from the Linear Convolution with X^n - 1. If we apply the congruences specified in the example above the result is:
a * b =
= a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X +
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 +
+ (a_1 * b_2 + a_2 * b_1) * X^3 + a_2 * b_2 * X^4
a * b / (X^3 - 1) =
a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X +
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 +
+ (a_1 * b_2 + a_2 * b_1) + a_2 * b_2 * X =
= a_0 * b_0 + a_1 * b_2 + a_2 * b_1 +
+ (a_0 * b_1 + a_1 * b_0 + a_2 * b_2) * X +
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2
More formally, c = a * b where c, a, b belong to Z_q[X]/(X^n-1) can be defined as:
Negative Wrapped Convolution
We get Negative Wrapped Convolution if we divide the resulting polynomial from the Linear Convolution with X^n + 1. If we apply the congruences specified in the example above the result is:
a * b =
= a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X +
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 +
+ (a_1 * b_2 + a_2 * b_1) * X^3 + a_2 * b_2 * X^4
a * b / (X^3 + 1) =
= a_0 * b_0 + (a_0 * b_1 + a_1 * b_0) * X +
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2 +
+ (a_1 * b_2 + a_2 * b_1) * (-1) + a_2 * b_2 * (-X) =
= a_0 * b_0 - a_1 * b_2 - a_2 * b_1 +
+ (a_0 * b_1 + a_1 * b_0 - a_2 * b_2) * X +
+ (a_0 * b_2 + a_1 * b_1 + a_2 * b_0) * X^2
More formally, c = a * b where c, a, b belong to Z_q[X]/(X^n+1) can be defined as:
What we need in Kyber is efficient Negative Wrapped Convolution calculation! Why? Because Kyber polynomials belong to Z_q[X]/(X^n+1)! Whenever we multiply two polynomials a * b in the text below, it is Negative Wrapped Convolution!
Primitive Roots of the Ring Z_q
The n-th primitive root of the ring Z_q is a number w_n that belongs to Z_q and it holds w_n^n = 1 mod q. All other exponentiations w_n^k mod q from k = 0,1, 2,..., n-1 generate different numbers in Z_q. It also holds that w_n^(n/2) mod q = -1 (-1 is the same with q-1).
For example, in Kyber, for q = 3329, the 128-th root is 289 and 17 is the 256-th root. Also, 512-th root does not exist!
289^0 = 1 mod 3329
289^1 = 289 mod 3329
289^2 = 296 mod 3329
289^3 = 2319 mod 3329
. . .
289^64 = 3328 mod 3329 # -1 mod 3329
. . .
289^127 = 2419 mod 3329
289^128 = 1 mod 3329
Number Theoretic Transform (NTT)
Number Theoretic Transform of polynomial a of length n and coefficients in Z_q is another polynomial a' of length n, i.e. a' = NTT(a), whose coefficients a'_k for k = 0, 1, 2, ..., n_1 are defined as:
The Inverse Number Theoretic Transform of polynomial a' of length n and coefficients in Z_q is the original polynomial a of length n, i.e. a = INTT(a'), whose coefficients a_k for k = 0, 1, 2, ..., n_1 are defined as:
Obviously it holds that a = INTT(NTT(a)).
Ok, we have defined what NTT is. What can we do with it?
Let's have two polynomials a and b, and their corresponding NTT transformations a' and b':
a' = NTT(a)
b' = NTT(b)
Let o denotes point wise polynomial multiplication:
a' o b' = a'_0 * b'_0 + a'_1 * b'_1 + . . . + a'_n_1 * b'_n_1
So what we get with:
INTT(a' o b') = INTT(NTT(a) o NTT(b))
The result is Circular Convolution of the polynomials a and b.
How can we calculate Linear Convolution using NTT?
It's easy too! In order to calculate Linear Convolution of the polynomials a and b we shall:
How can we calculate Negative Wrapped Convolution using NTT?
The straight forward approach is calculating Linear Convolution and dividing the result with X^n+1. We can calculate this division by using congruences as it has been demonstrated in the above convolution examples.
Padding polynomials to the length of 2*n and calculating 2*n point NTT is not the most efficient way of calculating Negative Wrapped Convolution. It has been proven that we can calculate this convolution with n point NTT only. Here is how we can do it.
Define polynomials w and w_inv as:
w = [1, w_2n, w_2n^2, w_2n^3, . . . w_2n^(n-1)]
w_inv = [1, w_2n^(-1), w_2n^(-2), w_2n^(-3), . . . w_2n^(-n+1)]
Calculate polynomials a'' and b'' as:
a'' = w o a
b'' = w o b
Finally calculate:
w_inv o INTT(NTT(a'') o NTT(b'')) =
= w_inv o INTT(NTT(w o a) o NTT(w o b))
The coefficients of w and w_inv can be incorporated directly into NTT and INTT simply as:
In addition, if we take into account that w_n = w_2n^2, the equations can rely on w_2n only!
Up to this moment, we have the equations that define how to calculate Negative Wrapped Convolution using NTT.
But...
If we calculate NTT using its coefficient definition, we still have n * n multiplications, i.e. O(n^2) complexity. There is no gain at all!
Before we continue, let's stress that Number Theoretic Transform is conceptually the same with the Discrete Fourier Transform - DFT. In DFT instead of w_n, we work with primitive root on the complex unit circle, i.e. exp(2Pi/n). Everything else is the same. DFT can be calculated by using the well known Fast Fourier Transform - FFT and its most famous representation, the Cooley-Tukey FFT Algorithm. It is an iterative algorithm that divides odd and even coefficients, calculates two n/2 DFTs (one for the odd coefficients and one for the even coefficients) and finally combines results. It is a Divide and Conquer strategy known as Butterfly pattern.
We are going to define the Cooley-Tukey NTT and Gentlemen-Sande INTT algorithms. Before presenting the algorithms let's us define what bit reversed order means in the algorithm specifications.
0 = 000 => 000 = 0
1 = 001 => 100 = 4
2 = 010 => 010 = 2
3 = 011 => 110 = 6
4 = 100 => 001 = 1
5 = 101 => 101 = 5
6 = 110 => 011 = 3
7 = 111 => 111 = 7
If we have polynomial a (or vector, list, array) with coefficients in standard (natural) order:
Recommended by LinkedIn
[a_0, a_1, a_2, a_3, a_4, a_5, a_6, a_7]
the bit reversed order is:
[a_0, a_4, a_2, a_6, a_1, a_5, a_3, a_7]
Cooley-Tukey NTT Algorithm
INPUT:
A vector a = [a_0,...,a_n−1] of degree n (power of 2)
in standard order where a_i belongs to [0,q − 1], q = 1 mod 2n
INPUT:
Lookup array 'psis' of 2n-th roots (w_2n) in bit reversed order
OUTPUT: a <- NTT(a) in bit reversed order
====================
def ntt(a):
t = n
m = 1
while m < n:
t = t // 2
for i in range(0, m):
j1 = 2 * i * t
j2 = j1 + t - 1
S = psis[m + i]
for j in range(j1, j2 + 1):
U = a[j]
V = a[j + t] * S
a[j] = (U + V) % q
a[j + t] = (U - V) % q
m = 2 * m
Gentlemen-Sande INTT Algorithm
INPUT:
A vector a = [a_0,...,a_n−1] of degree n (power of 2)
in bit reversed order where a_i belongs to [0,q − 1], q = 1 mod 2n
INPUT:
Lookup array of 'inv_psis'
of 2n-th root (w_2n) inverses in bit reversed order
INPUT: n_inv which is n^(-1) mod q
OUTPUT: a <- INTT(a) in standard order
====================
def inv_ntt(a):
t = 1
m = n
while m > 1:
j1 = 0
h = m // 2
for i in range(0, h):
j2 = j1 + t - 1
S = inv_psis[h + i]
for j in range(j1, j2 + 1):
U = a[j]
V = a[j + t]
a[j] = (U + V) % q
a[j + t] = (U - V) * S % q
j1 = j1 + 2 * t
t = 2 * t
m = m // 2
for i in range(0, n):
a[i] = a[i] * n_inv % q
The complexity of these algorithms is O(n*log(n)).
Here is simple Python code that demonstrates the technique. It generates the lookup arrays and their inverses, calculates NTT based polynomial multiplication and compares the result with the standard Numpy based polynomial multiplication in Z_q[X]/(X^n+1). Note that the Numpy methods used require array inversion, i.e. the largest degree coefficient comes first.
params.py
q = 3329
n = 128
ntt.py
import numpy as np
from params import q, n
n_inv = pow(n, -1, q)
def reverse_bits(n):
b = "{:0{width}b}".format(n, width=7)
return int(b[::-1], 2)
positions = [reverse_bits(x) for x in range(0, n)]
psi = 1
root = 17
tmp = []
psis = []
inv_psis = []
for x in range(0, n):
tmp.append(psi)
psi = psi * root % q
for x in range(0, n):
val = tmp[positions[x]]
inv_val = pow(val, -1, q)
psis.append(val)
inv_psis.append(inv_val)
# Cooley-Tukey (TC)
def ntt(a):
t = n
m = 1
while m < n:
t = t // 2
for i in range(0, m):
j1 = 2 * i * t
j2 = j1 + t - 1
S = psis[m + i]
for j in range(j1, j2 + 1):
U = a[j]
V = a[j + t] * S
a[j] = (U + V) % q
a[j + t] = (U - V) % q
m = 2 * m
# Gentleman-Sande (GS)
def inv_ntt(a):
t = 1
m = n
while m > 1:
j1 = 0
h = m // 2
for i in range(0, h):
j2 = j1 + t - 1
S = inv_psis[h + i]
for j in range(j1, j2 + 1):
U = a[j]
V = a[j + t]
a[j] = (U + V) % q
a[j + t] = (U - V) * S % q
j1 = j1 + 2 * t
t = 2 * t
m = m // 2
for i in range(0, n):
a[i] = a[i] * n_inv % q
f = np.zeros(n + 1)
f[0] = 1
f[n] = 1
a = np.random.randint(0, q, n)
b = np.random.randint(0, q, n)
p = np.remainder(np.polydiv(np.polymul(a[::-1], b[::-1]), f)[1], q).astype(int)[::-1]
print(p)
ntt(a)
ntt(b)
c = np.multiply(a, b)
inv_ntt(c)
print(c)
print(np.array_equal(c, p))
So far so good, but we can do better :)
Consider the following modulo operations in NTT and INTT:
a[j] = (U + V) % q
a[j + t] = (U - V) * S % q
Adding, multiplying, shifting and other bitwise operations are relatively cheap processor operations. However, division and modulo operations are expensive processor operations.
What if we can calculate modulo without division, but cheap only processor operations only. The answer is the Barrett Reduction.
Simple Python implementation based on the article above could be:
kbr = 26
m = 2**kbr // q
def barrett_reduce(x):
x1 = x * m >> kbr
x -= x1 * q
if x > q:
return x - q
else:
return x
The Barrett Reduction will help us calculate modulo operation when adding or subtracting numbers. For multiplication we need yet another strategy, the Montgomery Reduction. The idea of the Montgomery Space is simple:
a' = a * r mod q
b' = b * r mod q
(a' + b') mod q = (a + b) * r mod q
=> (a + b) mod q = (a' + b') * r_inv mod q
In other words, adding and subtracting numbers in the Montgomery Space provides results that can be converted back into the ordinary space by r_inv multiplication.
a' = a * r mod q
b' = b * r mod q
(a' * b') mod q = a * r * b * r mod q = (a * b * r) * r mod q
We have one additional r in the multiplication result. We don't need it and we shall reduce the multiplication result to one r only. If we denote x = a * b * r * r, then we shall calculate x * r_inv mod q after each multiplication in the Montgomery Space.
Simple Python implementation could be:
e = 16
r = 1 << e
k = (r * r_inv - 1) // q
def montgomery_reduce(x):
s = x * k & (r - 1)
t = x + s * q
u = t >> e
return u if u < q else u - q
Note that Montgomery reduction relies on multiplication, addition, bit masking and bit shifting that are processor efficient operations.
Converting this reduced value back into the ordinary space yields:
x * r_inv * r_inv mod q = a * r * b * r * r_inv * r_inv mod q = a * b mod q
i.e. it is the desired number product.
The whole idea with the Montgomery Space is to move all numbers, coefficients and polynomials into this space, add and multiply the numbers into this space multiple times according the algorithm in place (these operations never include the expensive division and modulo operations) and finally convert back the result into the ordinary space.
The modified Python NTT code with Montgomery Space operations is given below.
reduce.py
from params import q, n
e = 16
r = 1 << e
rr = r * r %q
mont = r % q
r_inv = pow(r, -1, q)
k = (r * r_inv - 1) // q
kbr = 26
m = 2**kbr // q
def barrett_reduce(x):
x1 = x * m >> kbr
x -= x1 * q
if x > q:
return x - q
else:
return x
def montgomery_reduce(x):
s = x * k & (r - 1)
t = x + s * q
u = t >> e
return u if u < q else u - q
def fqmul(a, b):
return montgomery_reduce(a * b)
ntt_mr.py
import numpy as np
from params import q, n
from reduce import barrett_reduce, fqmul, mont, r, rr, r_inv
n_inv = pow(n, -1, q) * r % q
def reverse_bits(n):
b = "{:0{width}b}".format(n, width=7)
return int(b[::-1], 2)
positions = [reverse_bits(x) for x in range(0, n)]
psi = 1
root = 17
tmp = []
psis = []
inv_psis = []
for x in range(0, n):
tmp.append(psi)
psi = psi * root % q
for x in range(0, n):
val = tmp[positions[x]]
inv_val = pow(val, -1, q)
psis.append(val * r % q)
inv_psis.append(inv_val * r % q)
# Cooley-Tukey (TC) no(natural order) -> br (bit reversed order)
def ntt(a):
t = n
m = 1
while m < n:
t = t // 2
for i in range(0, m):
j1 = 2 * i * t
j2 = j1 + t - 1
S = psis[m + i]
for j in range(j1, j2 + 1):
U = a[j]
V = fqmul(a[j + t], S)
a[j] = barrett_reduce(U + V)
a[j + t] = barrett_reduce(U - V)
m = 2 * m
# Gentleman-Sande (GS) br (bit reversed order) -> no (natural order)
def inv_ntt(a):
t = 1
m = n
while m > 1:
j1 = 0
h = m // 2
for i in range(0, h):
j2 = j1 + t - 1
S = inv_psis[h + i]
for j in range(j1, j2 + 1):
U = a[j]
V = a[j + t]
a[j] = barrett_reduce(U + V)
a[j + t] = fqmul(S, U - V)
j1 = j1 + 2 * t
t = 2 * t
m = m // 2
for i in range(0, n):
a[i] = fqmul(a[i], n_inv)
f = np.zeros(n + 1)
f[0] = 1
f[n] = 1
a = np.random.randint(0, q, n)
b = np.random.randint(0, q, n)
p = np.remainder(np.polydiv(np.polymul(a[::-1], b[::-1]), f)[1], q).astype(int)[::-1]
print(p)
a = np.array([montgomery_reduce(x * rr) for x in a])
b = np.array([montgomery_reduce(x * rr) for x in b])
ntt(a)
ntt(b)
c = np.array([fqmul(x, y) for (x, y) in zip(a, b)])
inv_ntt(c)
c = np.array([montgomery_reduce(x) for x in c])
print(c)
print(np.array_equal(c, p))
The comparison between results yields True.
What if we don't convert polynomials a and b into Montgomery Space, use them directly and spare some multiplications?
NTT(a) = NTT((a * r_inv) * r)
NTT(b) = NTT((b * r_inv) * r)
Note: multiplying vector with scalar means multiplying each vector's coefficient with the scalar.
After the point wise multiplication of these two NTT representations along with the Montgomery reductions within, we get:
NTT(a * b * r_inv * r_inv * r mod q)
After INTT we get:
c = a * b * r_inv * r_inv * r mod q
If we multiply the final result with r we get:
c * r mod q = a * b * r_inv * r_inv * r * r mod q = a * b mod q
We have the desired result, but instead of multiplying 3 polynomials with r or r_inv (two input and one output polynomial), we've multiplied the coefficients of the final result only, i.e. the output polynomial.
# No polynomial transformation before multiplication
ntt(a)
ntt(b)
c = np.array([fqmul(x, y) for (x, y) in zip(a, b)])
inv_ntt(c)
c = np.array([montgomery_reduce(x * rr) for x in c])
But we don't need even that only one multiplication. We can embed the multiplication with r within INTT itself. The INTT ends with the following modification that applies to the all output coefficients:
for i in range(0, n)
a[i] = fqmul(a[i], n_inv):
If we multiply n_inv with additional r in its definition, the final multiplication steps are 'by the book':
n_inv = pow(n, -1, q) * r * r % q
. . .
ntt(a)
ntt(b)
c = np.array([fqmul(x, y) for (x, y) in zip(a, b)])
inv_ntt(c)
print(c)
This is exactly the "trick" used in the Kyber reference implementation.
The polynomials length in the examples above is 128. In Kyber the length is 256. In order to use the above NTT based method for polynomial multiplication we need 512-th root. But, such does not exist. The aforementioned NIST reference implementation of Kyber demonstrates how to efficiently multiply polynomials with a kind of modified NTT transformation that relies on the 256-th root (17) and 128 powers of it. They still use similar "butterfly" pattern based algorithms that execute all of the rounds except the last one for which there are no more coefficients. The transformation produces 256 long vector, but the the coefficients shall be treated as pairs:
a = [(a_0,a_1),(a_2,a_3), ... ..., (a_252, a_253), (a_254, a_255)]
b = [(b_0,b_1),(b_2,b_3), ... ..., (b_252, b_253), (b_254, b_255)]
Instead of point wise multiplication of these 2 vectors, the implementation uses 128 vector wise multiplications:
c = [(a_0,a_1)x(b_0,b_1), (a_2,a_3)x(b_2,b_3) ... ]
where:
(a_0,a_1) x (b_0,b_1) =
(a_0 + a_1*X)(b_0 + b_1 * X) mod (X^2 - psi) =
= (a_1*b_1*X^2 + (a_0*b_1 + a_1*b_0)*X + a_0*b_0) mod (X^2 - psi) =
= (a_0*b_1 + a_1*b_0)*X + a_0*b_0 + psi*a_1*b_1
= (a_0*b_0 + psi*a_1*b_1, (a_0*b_1 + a_1*b_0)
The code below is Python adaptation of this algorithm demonstrated in the NIST reference implementation. The vectors are of length 256.
ntt_mr_256.py
import numpy as np
from params import q, n
from reduce import barrett_reduce, montgomery_reduce, fqmul, mont, r, rr, r_inv
# np.random.seed(0)
# NTT starts here
def reverse_bits(n):
b = "{:0{width}b}".format(n, width=7)
return int(b[::-1], 2)
positions = [reverse_bits(x) for x in range(0, n)]
psi = mont
root = 17
tmp = []
psis = []
inv_psis = []
for x in range(0, n):
tmp.append(psi)
psi = fqmul(psi, mont * root % q)
for x in range(0, n):
val = tmp[positions[x]]
psis.append(val)
i = 64
while i >= 1:
for j in range(i, 2 * i):
inv_psis.append(-tmp[128 - positions[j]] % q)
i >>= 1
inv_psis.append(mont * (mont * (q - 1) * ((q - 1) // 128) % q) % q)
def basemul(a, b, psi):
r0 = fqmul(a[1], b[1])
r0 = fqmul(r0, psi)
r0 += fqmul(a[0], b[0])
r1 = fqmul(a[0], b[1])
r1 += fqmul(a[1], b[0])
return (r0, r1)
def ntt(r):
k = 1
len = 128
while len >= 2:
start = 0
while start < 256:
j = start
psi = psis[k]
k += 1
while j < start + len:
t = fqmul(psi, r[j + len])
r[j + len] = barrett_reduce(r[j] - t)
r[j] = barrett_reduce(r[j] + t)
j += 1
start = j + len
len >>= 1
def inv_ntt(r):
k = 0
len = 2
while len <= 128:
start = 0
while start < 256:
psi = inv_psis[k]
k += 1
j = start
while j < start + len:
t = r[j]
r[j] = barrett_reduce(t + r[j + len])
r[j + len] = t - r[j + len]
r[j + len] = fqmul(psi, r[j + len])
j += 1
start = j + len
len <<= 1
for j in range(0, 256):
r[j] = fqmul(r[j], inv_psis[127])
f = np.zeros(2 * n + 1)
f[0] = 1
f[2 * n] = 1
a = np.random.randint(0, q, 2 * n)
b = np.random.randint(0, q, 2 * n)
p = np.remainder(np.polydiv(np.polymul(a[::-1], b[::-1]), f)[1], q).astype(int)[::-1]
print("p:", p)
ntt(a)
ntt(b)
c = np.zeros(2 * n).astype(int)
for i in range(0, n // 2):
r0, r1 = basemul(a[4 * i : 4 * i + 2], b[4 * i : 4 * i + 2], psis[64 + i])
c[4 * i] = r0
c[4 * i + 1] = r1
r0, r1 = basemul(a[4 * i + 2 : 4 * i + 4], b[4 * i + 2 : 4 * i + 4], -psis[64 + i])
c[4 * i + 2] = r0
c[4 * i + 3] = r1
inv_ntt(c)
print("c:", c)
print(np.array_equal(p, c))
Optimisation is very important Cryptography concept. The algorithms shall be fast and shall run on various devices, some of them not that powerful. As I said before, understanding NTT is a nice to have skill in the post-quantum era. The algorithm is ubiquitous and I believe more and more emerging cryptography and blockchain features will rely on NTT.
Marjan
PhD Candidate | Security & AI Researcher | Cryptography | Machine Learning Engineer | Software Engineer
1yI want to thank you for this blog post. It really helps me to achieve an error detection scheme over NTT operation in Kyber.
PhD Candidate | Security & AI Researcher | Cryptography | Machine Learning Engineer | Software Engineer
1yThanks for your sharing. I have a question though. In the NTT implementation in Kyber documentation there are 7 stages instead of 8. Any idea about the reason behind that?