A-16.2 Forward FFT (or NTT)

We assume a polynomial ring of [X](Xn + 1) for FFT, and p[X](Xn + 1) for NTT (where Xn + 1 is a cyclotomic polynomial). The x coordinates to evaluate the target polynomial are the n distinct roots of Xn + 1, which are ω1, ω3,,ω2n1, where ω is the primitive 2n-th root of unity. Then, the point-value representation of the polynomial A(X) is ((x0,y0a),(x1,y1a),(xn1,yn1a)), where:

yia = A(ω2i+1) = j=0n1aj (ω2i+1)j = j=0n1aj ω(2i+1)j.

We call the vector ya = (y0a,y1a,,yn1a) the Discrete Fourier Transform (DFT) of the coefficient vector a = (a0,a1,,an1). We write this as ya = 𝖣𝖥𝖳(a). As explained in §A-16.1, the computation of the DFT takes O(n2), because we have to evaluate n distinct X values for a polynomial that has n terms.

A-16.2.1 High-level Idea

FFT (or NTT) is an improved method for computing the DFT, which reduces the time complexity from O(n2) to O(nlogn). The high-level idea of FFT is to split the (n 1)-degree (or lesser degree) target polynomial A(X) to evaluate into 2 half-degree polynomials A0(X) and A1(X) as follows:

A(X) = a0 + a1X + a2X2 + + an1Xn1

A(X) = A0(X2) + X A1(X2) A0(X) = a0 + a2X + a4X2 + + an2Xn 2 1

A1(X) = a1 + a3X + a5X2 + + an1Xn 2 1

The above method of splitting a polynomial into two half-degree polynomials is called the Cooley-Tukey step. As we split A(X) into two smaller-degree polynomials A0(X) and A1(X), evaluating A(X) at the odd-powered primitive 2n-th roots of unity {ω1,ω3,ω5,,ω2n1}is equivalent to evaluating A0(X) and A1(X) at n distinct squared n-th roots of unity {(ω2)1,(ω2)3,(ω2)5,,(ω2)2n1} and computing A0(X2) + X A1(X2). However, remember that the primitive 2n-th root of unity ω has order 2n (i.e., ω2n = 1 and ωm1 for all m < 2n). Therefore, the second half of {(ω2)1,(ω2)3,(ω2)5,,(ω2)2n1} is a repetition of the first half. This implies that we only need to evaluate A0(X) and A1(X) at n 2 distinct x coordinates each, instead of n distinct coordinates, because the polynomial evaluation results for the other half are the same as those of the first half (as their input x to the polynomial is the same).

We recursively split A0(X) and A1(X) into half-degree polynomials and evaluate them at half-counted (i.e., n2) n-th roots of unity. Then, the total number of rounds of splitting is logn, and the maximum number of root-to-coefficient multiplications in each round is n, which aggregates to O(nlogn).

A-16.2.2 Details

Suppose we have a polynomial ring that is either p[X](X8 + 1) (i.e., over a finite field with prime p) or [X](X8 + 1) (over complex numbers). We denote by ω a primitive (2n = 16)-th root of unity, and the 8 distinct roots of X8 + 1 are: {ω1,ω3,ω5,ω7,ω9,ω11,ω13,ω15}.

Now, we define our target polynomial to evaluate as follows:

A(X) = a0 + a1X + a2X2 + a3X3 + a4X4 + a5X5 + a6X6 + a7X7

We split this 7-degree polynomial into the following two 3-degree polynomials (using the Cooley-Tukey step):

A0(X) = a0 + a2X + a4X2 + a6X3

A1(X) = a1 + a3X + a5X2 + a7X3

A(X) = A0(X2) + X A1(X2)

We recursively split the two 3-degree polynomials above into 1-degree polynomials as follows:

A0,0(X) = a0 + a4X, ... A0,1(X) = a2 + a6X

A0(X) = A0,0(X2) + X A0,1(X2)

A1,0(X) = a1 + a5X, ... A1,1(X) = a3 + a7X

A1(X) = A1,0(X2) + X A1,1(X2)

A(X) = A0(X2) + X A1(X2)

A(X) = (A0,0(X4) FFT Level 1 + X2 A 0,1(X4) FFT Level 1)FFT Level 2 + X (A1,0(X4) FFT Level 1 + X2 A 1,1(X4) FFT Level 1)FFT Level 2FFT Level 3

To evaluate A(X) at the n distinct roots X = {ω1,ω3,,ω15}, we evaluate each FFT level of the above formula at X = {ω1,ω3,,ω15}, starting from level 1 l 3.

FFT Level 𝐥 = 1: We evaluate A0,0(X4), A0,1(X4), A1,0(X4), and A1,1(X4) at X = {ω1,ω3,,ω15}. However, notice that plugging in X = {ω1,ω3,,ω15} to X4 results in only 2 distinct values: ω0 and ω8. This is because the order of ω is 2n (i.e., ω2n = 1), and thus (ω1)4 = (ω5)4 = (ω9)4 = (ω13)4, and (ω3)4 = (ω7)4 = (ω11)4 = (ω15)4. Therefore, we only need to evaluate A0,0(X4), A0,1(X4), A1,0(X4), and A1,1(X4) at 2 distinct x values instead of 8, where each evaluation requires a constant number of arithmetic operations: computing 1 multiplication and 1 addition. As there are a total of 4 polynomials to evaluate (i.e., A0,0(X4),A0,1(X4),A1,0(X4),A1,1(X4)), we compute the FFT a total of 4 2 = 8 times.

FFT Level 𝐥 = 2: Based on the evaluation results from FFT Level 1 as building blocks, we evaluate A0(X2) and A1(X2) at X = {ω1,ω3,,ω15}. However, notice that plugging in X = {ω1,ω3,,ω15} to X2 results in only 4 distinct values: ω1, ω5, ω9, and ω13. This is because the order of ω is 2n (i.e., ω2n = 1), and thus (ω1)2 = (ω9)2, (ω3)2 = (ω11)2, (ω5)2 = (ω13)2, and (ω7)2 = (ω15)2. Therefore, we only need to evaluate A0(X2) and A1(X2) at 4 distinct x values instead of 8, where each evaluation requires a constant number of arithmetic operations: computing 1 multiplication and 1 addition (where we use the results from FFT Level 1 as building blocks, and the computational structure of FFT Level 2 is the same as that of FFT Level 1). There are a total of 2 polynomials to evaluate (i.e., A0(X2),A1(X2)); thus, we compute the FFT a total of 2 4 = 8 times.

FFT Level 𝐥 = 3: Based on the evaluation results from FFT Level 2 as building blocks, we evaluate A(X) at X = {ω1,ω3,,ω15}. For this last level of computation, we need to evaluate all 8 distinct X values, since they are all unique values, and each evaluation requires a constant number of arithmetic operations: computing 1 multiplication and 1 addition. There is a total of 1 polynomial to evaluate (i.e., A(X)); thus, we compute the FFT a total of 1 8 = 8 times.

Generalization: Suppose that the degree of the target polynomial to evaluate is at most n 1 (with n terms), and we define L = logn (i.e., the total number of FFT levels). Then, the forward FFT operation requires a total of L FFT levels, where each l-th level requires the evaluation of 2Ll polynomials at 2l distinct X values. Therefore, the total number of FFT computations for the forward FFT is: l=1L(2Ll 2l) = L 2L = nlogn . Therefore, the time complexity of the forward FFT is O(nlogn).

Using the FFT technique, we reduce the number of x points to evaluate to half as the level decreases (while the number of polynomials to evaluate doubles), and their growth and reduction cancel each other, resulting in O(n) for each level. Since there are logn such levels, the total time complexity is O(nlogn). The core enabler of this optimization is the special property of the x evaluation coordinates: its power (i.e., ωi) is cyclic. To enforce this cyclic property, FFT requires the evaluation points of x to be the odd-powered primitive 2n-th roots of unity.