Virtual Components for the Converging World

Amphion continues to expand its family of application-specific cores

See http://www.amphion.com for a current list of products

CS2412

1024-Point Pipelined FFT/IFFT

Preliminary Datasheet

The CS2412 is an online programmable, pipelined architecture 1024-point FFT/IFFT core. It is capable of
processing continuous data streams with high data throughput rate of up to 50 Msamples/Sec. This highly
integrated application specific silicon core is the pipelined version of CS2411 and is available in both ASIC and
FPGA versions that have been handcrafted by Amphion for maximum performance while minimizing power
consumption and silicon area.

Figure 1: CS2412 Architecture

(768+
1024)
32-bit

word
Input

Buffer

Radix-4

Butterfly &

Twiddle

Multiplication

(192+

256)

32-bit

word
Input

Buffer

Radix-4

Butterfly &

Twiddle

Multiplication

Radix-4

Butterfly &

Twiddle

Multiplication

Radix-4

Butterfly &

Twiddle

Multiplication

Radix-4

Butterfly

(48+

64)

32-bit

word

Buffer

(12+

16)

32-bit

word

Buffer

(3+4)

32-bit

word

Buffer

1024

32-bit

word

Re-order

Buffer

Control Logic

FEATURES

On-line programmable FFT/IFFT core
Pipelined architecture
16-bit complex input/output in two's
complement format (32-bit complex word)
16-bit twiddle factors generated inside the
core
18-bit internal accuracy
Programmable shift down control
Radix-4 architecture
Simultaneous loading/downloading
supported
Both input and output in normal order
No external memory required
Optimized for both ASIC and FPGA
technologies with the same functionality
Fully synchronous design

APPLICATIONS

Communications modulation schemes
Image processing
Atmospheric imaging
Spectral representation

CS2412

1024-Point Pipelined FFT/IFFT

FAST FOURIER TRANSFORM

FFT (Fast Fourier Transform) and IFFT (Inverse Fast Fourier

Transform) are algorithms computing 2

-point discrete

Fourier transform and inverse discrete Fourier transform, as
defined below

FFT

, k = 0, 1, 2... N-1

[1]

IFFT:

, k = 0, 1, 2... N-1

[2]

Where N=2

and W

= e

-j2

The computational complexity of FFT and IFFT is
proportional to Nlog

N, where R is the radix base on which

FFT/IFFT is performed. The higher the radix, the less number
of multiplication is required, however the more simultaneous
multiple data access is required which causes the circuits to be
more complicated. The radix-4 algorithm offers a balance
between the computational and circuit complexity and is often
used in construction of higher radix FFT computation units
when designing high performance FFT/IFFT hardware.

CS2412 SYMBOL

AND PIN DESCRIPTION

Table 1 describes input and output ports (shown graphically
in Figure 2) of the CS2412 1024-point FFT/IFFT core. Unless
otherwise stated, all signals are active high and bit(0) is the
least significant bit.

Figure 2: CS2412 Symbol

Y k

( )

X n

( )W

N 1

Y k

( )

----

X n

( )W

N 1

CS2412

1024-pt

FFT/IFFT

Ylm

YRe

YOV

Xlm

CLK

NotRST

XRe

XBS

SDC

TType

XBIP

YBS

YAV

YSDC

Table 1: CS2412 1024-Point FFT/IFFT Interface Signal Definitions

Name

I/O

Width

Description

CLK

Data clock signal, rising edge active

NotRST

Asynchronous global reset signal, active LOW

TType

Static signal specifying the transform type,
0: FFT,
1: IFFT

SDC

Input signal specifying the number of bits for the additional scaling down operation, loaded
when XBS is active and associated with the 1024-point block indicated by XBS.

Xre

Real component of input data X, in two's complement format

Xim

Imaginary component of input data X, in two's complement format

XBS

Input data X block start signal, active HIGH, associated with the first input data of the N-point
block. The remaining N-1 data of the N-point data block are loaded into the core in the follow-
ing N-1 data clock cycles in the natural order.

XBIP

Output signal indicating loading X is in Progress. XBIP goes to HIGH the next clock cycle
when XBS is active and returns to LOW when the last data of the N-point block is loaded into
the core. XBS is ignored when it is HIGH.

YBS

Output data Y block start signal, active HIGH, asserted when the first data of the N-point
transformed block is on the output port. The remaining N-1 data of the N-point transform
result come out of the core in the following N-1 clock cycles in the natural order.

FUNCTIONAL DESCRIPTION

The CS2412 performs decimation in frequency (DIF) radix-4
forward or inverse Fast Fourier Transforms on complex data.
Data is loaded into its workspace in normal sequential
(natural) order. The transformed data is returned in normal
sequential order. It performs 1024-point FFT/IFFT using the
following equations:

FFT:

, k = 0, 1, 2 ... N-1

[3]

IFFT:

, k = 0, 1, 2 ... N-1

[4]

Where N is equal to 1024, SDC is the scaling down control
signal, X(n) is the complex input data and Y(k) the complex
output data. Both the real and imaginary components of input
X(n) and output Y(k) are 16-bit numbers in two's complement
format.

The CS2412 achieves high data throughput rates of up to 50
Msamples/Sec by employing a pipelined architecture with
fixed-point arithmetic operations and pre-scaling strategy to
handle possible overflow in computation. The core has 4-bit
unconditional scaling down operations and 7-bit controlled
scaling down operations specified by input signal SDC, giving
the user the necessary gain control required in a specific
application. The CS2412 core uses radix-4 decimation in
frequency (DIF) algorithm to perform the transform. It
consists of five radix-4 pipelined stages with reshuffle buffers
between stages and is capable of processing continuous data
stream. Both the input and output are in the normal order (the
ordinary time order).

The Selection of transform (FFT/IFFT) is controlled by a static
signal. However, the scaling down control is applied on a
block-by-block basis. The core detects possible overflow
during computation and saturates overflow data accordingly.

In order to minimize the device size, CS2412 uses a 2 x clock
internally. For example, the input data is clocked in using the
data clock while the core operates on the 2 x clock. The output
data is also clocked out on the 2xclock although it changes
only on every 2 cycles of the 2 x clock. When implemented on
FPGA devices, The 2 x clock is generated by the on-chip PLL
of Apex 20KE device or DLL of Virtex devices.

WORD LENGTH

The internal wordlength of each radix-4 operation of CS2412
is specified by Figure 3. The intermediate data stored in the
reshuffle buffers are 16-bit wide (32 bits for complex
numbers). The wordlength grows to 18 bits after the radix-4
butterfly. The twiddle multiplier takes the 18-bit butterfly
output and 16-bit twiddle factors, generating 34-bit product.
The product is then scaled and rounded to 16 bits for the next
stage radix-4 operation.

Figure 3: Wordlength Specification

YAV

Output data Y available indicator, active HIGH, asserted with all data of the N-point transform
result

YRe

Real component of output data Y, in two's complement format, valid only when YAV is HIGH

YIm

Imaginary component of output data Y, in two's complement format, valid only when YAV is
HIGH

YOV

Output data Y overflow signal, active HIGH, asserted when overflow occurs during the trans-
form of the output data block.

YSDC

Output signal indicating the SDC of the output data block

Table 1: CS2412 1024-Point FFT/IFFT Interface Signal Definitions

Name

I/O

Width

Description

Y k

( )

4 SDC

--------------------

X n

( )

N 1

Y k

( )

4 SDC

--------------------

X n

( )

N 1

16 bits

18 bits

34 bits

16 bits Twiddle factor

Radix-4

Butterfly

16 bits

Twiddle

Multiply

Scaling &
Rounding

Radix-4

Butterfly

CS2412

1024-Point Pipelined FFT/IFFT

FUNCTIONAL OPERATION

The core is capable of processing continuous data stream.
Loading the input data is performed under the control of
signal XBS. Signal XBS is asserted when the output signal
XBIP is de-asserted. It indicates the first data of the 1024-point
data block and the data is clocked in on the clock rising edge.
The rest of the 1023-points of data are loaded in the successive
1023 clock cycles in the natural order. When the last data is
loaded signal XBIP returns to LOW. Loading of the next data

block can be started by asserting XBS at any time from the
next clock cycle after XBIP returns to LOW.

Signal YBS is asserted, when the first of the result data
appears on the output port. The rest of the result data will be
continuously clocked out in the following 1023 clock cycles.
Signal YAV will be asserted during the period of the result
being output. Figure 4 illustrates the functional timing of the
I / O signals.

Figure 4: Input/Output Functional Timing

CLK

XBS

1023

XRE

XIM

XBIP

n-1

n+1

n+2

n+3

YRE

YIM

YBS

YAV

2040 cycles

SDC

YSDC

SHIFTING CONTROL

The kernel operation for 1024-point transform consists of
radix-4 butterfly followed by a twiddle multiplication.
Theoretically in the worst case the result value may grow by a
factor of up to 5.657 in the first stage. This occurs when the
four input data to the radix-4 computation have the maximal
absolute value and the twiddle angle is

. The final result

reaching stage 5 may grow by a factor of up to 1303.793. This
represents a possible wordlength growth of 11 bits. As the
output is 16-bit value and fixed-point arithmetic is employed
in the core, it is necessary to be able to scale the result to avoid
overflow while still obtaining a good dynamic range.

Since the input word length is 16 bits and the output 16 bits,
zero bit growth can be allowed. Thus, the megafunction must
have the capability of up to 11-bit right shifting of the internal
result to enable overflow to be avoided. The total of 11 bit
scaling down operation is assigned to each stage according to
Table 2. When SDC is set to the maximal value, there will be
no overflow for any input data.

The first 4-bits of shift control are mandatory. The remaining
7-bits are applied at the discretion of the user under the
control of SDC.

COMPUTATION ACCURACY

A rounding technique is employed to achieve the maximal
computation accuracy possible for the given word lengths.
The core performs the round-to-the-nearest operation to keep
the loss in accuracy minimal. When the intermediate value, for
instance from the twiddle multiplication result, is required to
scale down, the most significant bit of the portion to be
rounded off is added to the word which remains. This is a
compromise between true rounding and truncation.
Compared with the technique that unconditionally sets the
bottom bit to '1', the partial rounding scheme achieves better
accuracy and guarantees to generate an all-zero output block
for an all-zero input block.

CS2412 detects overflow at each computation stage and uses
the following procedure to saturate output overflow samples:

If (X >= 32768) X = 32767;

If (X <= -32768) X = -32767;

The bit accurate C model provided checks of the output error
with respect to SDC signal. Table 3 represents the output error
with respect to SDC signal.

Table 2: Number of Shifting Bits in Each Stage

SDC Stage

Stage

Total

000

001

010

011

100

101

110

111

ÐÐ»ÐµÐºÑ‚Ñ€Ð¾Ð½Ð½Ñ‹Ð¹ ÐºÐ¾Ð¼Ð¿Ð¾Ð½ÐµÐ½Ñ‚: CS2412

Document Outline

Ð­Ð»ÐµÐºÑ‚Ñ€Ð¾Ð½Ð½Ñ‹Ð¹ ÐºÐ¾Ð¼Ð¿Ð¾Ð½ÐµÐ½Ñ‚: CS2412

Document Outline

ÐÐ»ÐµÐºÑ‚Ñ€Ð¾Ð½Ð½Ñ‹Ð¹ ÐºÐ¾Ð¼Ð¿Ð¾Ð½ÐµÐ½Ñ‚: CS2412