4640ds.fm

1 of 23

April 10, 2001

2001 Integrated Device Technology, Inc.

DSC 3486/2

Block Diagram

The IDT logo is a registered trademark and RC4600, RC4650, RC3081,RC3052,RC3051,RC3041 RISController, and RISCore are trademarks of Integrated Device Technology, Inc.

267 MHz 64-bit CPU

64-bit Register File

64-bit Adder

Store Aligner

Logic Unit

Load Aligner

High-Performance

Integer Multiply

i
pe

l
i
n

r
o

FP Register File

FP Add/Sub/Cvt/

Pack/Unpack

FP Multiply

i
pe

ont

r
o

89 MFlops Single-Precision FPA

Div/Sqrt

32-bit

Synchronized

System Interface

Address Translation/

Cache Attribute Control

Exception Management

Functions

System Control Coprocessor

Data Cache

Instruction Bus

Control Bus

Data Bus

Set A

(Lockable)

Set B

Instruction Cache

Set B

Instruction Cache

Set A

(Lockable)

Features

High-performance embedded 64-bit microprocessor

 64-bit integer operations

 64-bit registers

 Based on the MIPS RISC Architecture

 100MHz, 133MHz, 150MHz, 180MHz, 200MHz and 267MHz

operating frequencies

 32-bit bus interface brings 64-bit power to 32-bit system cost

High-performance DSP capability

 133.5 Million Integer Mul-Accumulate

operations/sec @267MHz

 89 MFlops floating-point operations @267MHz

High-performance microprocessor

 133.5 M Mul-Add/second @267MHz

 89 MFlops @267MHz

 >640,000 dhrystone (2.1)/sec capability @267MHz (352

dhrystone MIPS)

High level of integration

 64-bit, 267 MHz integer CPU

 8KB instruction cache; 8KB data cache

 Integer multiply unit with 133.5M Mul-Add/sec

Upwardly software compatible with IDT RISController
Family

Easily upgradable to 64-bit system

Low-power operation

 Active power management powers-down inactive units

 Standby mode

Large, efficient on-chip caches

 Separate 8KB Instruction and 8KB Data caches

 Over 3200MB/sec bandwidth from internal caches

 2-set associative

 Write-back and write-through support

 Cache locking, to facilitate deterministic response

 High performance write protocols, for graphics and data

communications

Bus compatible with RC4000 family

 System interfaces to 125MHz, provides bandwidth up to 500

MB/sec

 Direct interface to 32-bit wide systems

 Synchronized to external reference clock for multi- master

operation

 Socket compatible with IDT RC 64474 and RC64574

Improved real-time support

 Fast interrupt decode

 Optional cache locking

Note: "R" refers to 5V parts; "RV" refers to 3.3V parts; "RC"
refers to both

Low-Cost Embedded
64-bit RISController
w/ DSP Capability

IDT79RC4640

2 of 23

April 10, 2001

IDT79RC4640TM

Description

The IDT79RC4640 is a low-cost member of the Integrated Device

Technology, Inc. RC4000 family, targeted to a variety of performance-
hungry embedded applications. The RC4640 continues the RC4000
tradition of high-performance through high-speed pipelines, high-band-
width caches and bus interface, 64-bit architecture, and careful attention
to efficient control. The cost of this performance is reduced by removing
functional units frequently not required for many embedded applications.

The RC4640 supports a wide variety of embedded processor-based

applications, such as internetworking equipment (routers, switches),
office automation equipment (printers, scanners), and consumer multi-
media game systems. Also, being upwardly software-compatible with
the RC32300 family as well as bus- and upwardly software-compatible
with the IDT RC4000 family, the RC4640 will serve in many of the same
applications. And, the RC4640 supports applications that require integer
digital signal processing (DSP) functions.

The RC64475 and RC64575 processors offer a direct migration path

for designs based on IDT's RC4650 processors, through full pin and
socket compatibility.

The RC4640 brings 64-bit performance levels to lower cost systems.

High performance is preserved by retaining large on-chip two-way set-
associative caches, a streamlined high-speed pipeline, high bandwidth,
64-bit execution, and facilities such as early restart for data cache
misses.

These techniques allow the system designer over 3.2 GB/sec aggre-

gate internal bandwidth, 500 MB/sec bus bandwidth, almost 352 Dhrys-
tone MIPS, 89MFlops, and 133.5 M Mul-Add/sec. An array of tools
facilitates rapid development of RC4640-based systems, allowing a
wide variety of customers access to the processor's high-performance
capabilities while maintaining short time-to-market goals.

Hardware Overview

Some key elements of the RC4640 are briefly described below. More

detailed information is available in the IDT79RC4640/IDT79RC4650
RISC Processor Hardware User's Manual.

Pipeline

The RC4640 uses a 5-stage pipeline that is similar to the

IDT79RC3000 and the IDT79RC4700 processors. The simplicity of this
pipeline allows the RC4640 to cost less than super-scalar processors
and require less power than super-pipelined processors. So, unlike
superscalar processors, applications that have large data dependen-
cies, or require frequent load/stores, can still achieve peak performance.

Integer Execution Engine

The RC4640 implements the MIPS-III Instruction Set Architecture

and is fully upward compatible with applications that run on earlier
generation parts. The RC4640 is software-compatible with the RC4650,
and includes the instruction set found in the RC4700 microprocessor,
targeted at higher performance while maintaining binary compatibility
with RC32300 processors.

The extensions result in better code density, greater multi-

processing support, improved performance for commonly used code
sequences in operating system kernels, and faster execution of floating-
point intensive applications. All resource dependencies are made trans-
parent to the programmer, insuring transportability among implementa-
tions of the MIPS instruction set architecture. In addition, MIPS-III
specifies new instructions defined to take advantage of the 64-bit archi-
tecture of the processor.

Finally, the RC4640 also implements additional instructions, which

are considered extensions to the MIPS-III architecture. These instruc-
tions improve the multiply and multiply-add throughput of the CPU,
making it well suited to a wide variety of imaging and DSP applications.
These extensions, which use opcodes allocated by MIPS Technologies
for this purpose, are supported by a wide variety of development tools.

The MIPS integer unit implements a load/store architecture with

single cycle ALU operations (logical, shift, add, sub) and autonomous
multiply/divide unit. The 64-bit register resources include: 32 general-
purpose orthogonal integer registers, the HI/LO result registers for the
integer multiply/divide unit, and the program counter. In addition, the on-
chip floating-point co-processor adds 32 floating-point registers, and a
floating-point control/status register.

The RC4640 has 32 general-purpose 64-bit registers. These regis-

ters are used for scalar integer operations and address calculation. The
register file consists of two read ports and one write port and is fully
bypassed to minimize operation latency in the pipeline.

Arithmetic Logic Unit

The RC4640 ALU consists of the integer adder and logic unit. The

adder performs address calculations in addition to arithmetic operations;
the logic unit performs all of the logic and shift operations. Each unit is
highly optimized and can perform an operation in a single pipeline cycle.

Integer Multiply/Divide

The RC4640 uses a dedicated integer multiply/divide unit, optimized

for high-speed multiply and multiply-accumulate operation. Table 1
shows the performance, expressed in terms of pipeline clocks, achieved
by the RC4640 integer multiply unit.

Opcode

Operand

Size

Latency Repeat Stall

MULT/U, MAD/U

16 bit

32 bit

MUL

16 bit

32 bit

DMULT, DMULTU

any

DIV, DIVU

any

DDIV, DDIVU

any

Table 1 RC4640 Integer Multiply Operation

3 of 23

April 10, 2001

IDT79RC4640TM

The MIPS-III architecture defines that the results of a multiply or

divide operation are placed in the HI and LO registers. The values can
then be transferred to the general purpose register file using the MFHI/
MFLO instructions.

The RC4640 adds a new multiply instruction, "MUL", which can

specify that the multiply results bypass the "Lo" register and are placed
immediately in the primary register file. By avoiding the explicit "Move-
from-Lo" instruction required when using "Lo", throughput of multiply-
intensive operations is increased.

An additional enhancement offered by the RC4640 is an atomic

"multiply-add" operation, MAD, used to perform multiply-accumulate
operations. This instruction multiplies two numbers and adds the product
to the current contents of the HI and LO registers. This operation is used
in numerous DSP algorithms, and allows the RC4640 to cost reduce
systems requiring a mix of DSP and control functions.

Finally, aggressive implementation techniques feature low latency for

these operations along with pipelining to allow new operations to be
issued before a previous one has fully completed. Table 1 also shows
the repeat rate (peak issue rate), latency, and number of processor stalls
required for the various operations. The RC4640 performs automatic
operand size detection to determine the size of the operand, and imple-
ments hardware interlocks to prevent overrun, allowing this high-perfor-
mance to be achieved with simple programming.

Floating-Point Coprocessor

The RC4640 incorporates an entire single-precision floating-point

coprocessor on chip, including a floating-point register file and execution
units. The floating-point coprocessor forms a "seamless" interface with
the integer unit, decoding and executing instructions in parallel with the
integer unit.

The floating-point unit of the RC4640 directly implements single-

precision floating-point operations, which enables the RC4640 to
perform functions such as graphics rendering without requiring exten-
sive die area or power consumption. The single-precision unit of the
RC4640 is directly compatible with the single-precision operation of the
RC4700, and features the same latencies and repeat rates.

The RC4640 does not directly implement the double-precision opera-

tions found in the RC4700. However, to maintain software compatibility,
the RC4640 will signal a trap when a double-precision operation is initi-
ated, allowing the requested function to be emulated in software. Alter-
natively, the system architect could use a software library emulation of
double-precision functions, selected at compile time, to eliminate the
overhead associated with trap and emulation.

Floating-Point Units

The RC4640's floating-point execution units perform single precision

arithmetic, as specified in IEEE Standard 754. The execution unit is
broken into a separate multiply unit and a combined add/convert/divide/
square root unit. Overlap of multiply and add/subtract is supported. The
multiplier is partially pipelined, allowing a new multiplication instruction
to begin every 6 cycles.

As in the IDT79RC4700, the RC4640 maintains fully precise floating-

point exceptions while allowing both overlapped and pipelined opera-
tions. Precise exceptions are extremely important in mission-critical
environments, such as ADA, and highly desirable for debugging in any
environment.

The floating-point unit's operation set includes floating-point add,

subtract, multiply, divide, square root, conversion between fixed-point
and floating-point format, conversion among floating-point formats, and
floating-point compare. These operations comply with IEEE Standard
754. Double precision operations are not directly supported; attempts to
execute double-precision floating point operations, or refer directly to
double-precision registers, result in the RC4640 signalling a "trap" to the
CPU, enabling emulation of the requested function. Table 2 gives the
latencies of some of the floating-point instructions in internal processor
cycles.

Floating-Point General Register File

The floating-point register file is made up of thirty-two 32-bit regis-

ters. These registers are used as source or target registers for the
single-precision operations.

References to these registers as 64-bit registers (as supported in the

RC4700) will cause a trap to be signalled to the integer unit.

The floating-point control register space contains two registers; one

for determining configuration and revision information for the copro-
cessor and one for control and status information. These are primarily
involved with diagnostic software, exception handling, state saving and
restoring, and control of rounding modes.

Operation

Instruction

Latency

ADD

SUB

MUL

DIV

SQRT

CMP

FIX

FLOAT

ABS

MOV

NEG

LWC1

SWC1

Table 2 Floating-Point Operation

4 of 23

April 10, 2001

IDT79RC4640TM

System Control Coprocessor (CP0)

The system control coprocessor in the MIPS architecture is respon-

sible for the virtual to physical address translation and cache protocols,
the exception control system, and the diagnostics capability of the
processor. In the MIPS architecture, the system control coprocessor
(and thus the kernel software) is implementation dependent.

In the RC4640, significant changes in CP0 relative to the RC4600

have been implemented. These changes are designed to simplify
memory management, facilitate debug, and speed real-time processing.

System Control Coprocessor Registers

The RC4640 incorporates all system control co-processor (CP0)

registers on-chip. These registers provide the path through which the
virtual memory system's address translation is controlled, exceptions
are handled, and operating modes are controlled (kernel vs. user mode,
interrupts enabled or disabled, cache features). In addition, the RC4640
includes registers to implement a real-time cycle counting facility, which
aids in cache diagnostic testing, assists in data error detection, and facil-
itates software debug. Alternatively, this timer can be used as the
operating system reference timer, and can signal a periodic interrupt.

Table 3 shows the CP0 registers of the RC4640.

Number

Name

Function

IBase

Instruction address space base

IBound

Instruction address space bound

DBase

Data address space base

DBound

Data address space bound

4-7, 10, 20-25,
29, 31

Not used

BadVAddr

Virtual address on address exceptions

Count

Counts every other cycle

Compare

Generate interrupt when Count = Compare

Status

Miscellaneous control/status

Cause

Exception/Interrupt information

EPC

Exception PC

PRId

Processor ID

Config

Cache and system attributes

CAlg

Cache attributes for the 8 512MB regions of the
virtual address space

IWatch

Instruction breakpoint virtual address

DWatch

Data breakpoint virtual address

ECC

Used in cache diagnostics

CacheErr

Cache diagnostic information

TagLo

Cache index information

ErrorEPC

CacheError exception PC

Table 3 RC4640 CPO Registers

Operation Modes

The RC4640 supports two modes of operation: user mode and

kernel mode. Kernel mode operation is typically used for exception
handling and operating system kernel functions, including CP0 manage-
ment and access to IO devices. In kernel mode, software has access to
the entire address space and all of the co-processor 0 registers, and
can select whether to enable co-processor 1 accesses. The processor
enters kernel mode at reset, and whenever an exception is recognized.

User mode is typically used for applications programs. User mode

accesses are limited to a subset of the virtual address space, and can
be inhibited from accessing CP0 functions.

Virtual-to-Physical Address Mapping

The 4GB virtual address space of the RC4640 is shown in Figure 1.

The 4 GB address space is divided into addresses accessible in either
kernel or user mode (kuseg), and addresses only accessible in kernel
mode (kseg2:0).

The RC4640 supports the use of multiple user tasks sharing

common virtual addresses, but mapped to separate physical addresses.
This facility is implemented via the "base-bounds" registers contained in
CP0.

When a user virtual address is asserted (load, store, or instruction

fetch), the RC4640 compares the virtual address with the contents of
the appropriate "bounds" register (instruction or data). If the virtual

0xFFFFFFFF

0xC0000000

Kernel virtual address space
(kseg2)
Unmapped, 1.0 GB

0xBFFFFFFF

0xA0000000

Uncached kernel physical address space
(kseg1)
Unmapped, 0.5GB

0x9FFFFFFF

0x80000000

Cached kernel physical address space
(kseg0)
Unmapped, 0.5GB

0x7FFFFFF

0x00000000

User virtual address space
(useg)
Mapped, 2.0GB

Figure 1 Mode Virtual Addressing (32-bit mode)

5 of 23

April 10, 2001

IDT79RC4640TM

address is "in bounds", the value of the corresponding "base" register is
added to the virtual address to form the physical address for that refer-
ence. If the address is not within bounds, an exception is signalled.

This facility enables multiple user processes in a single physical

memory without the use of a TLB. This type of operation is further
supported by a number of development tools for the RC4640, including
real-time operating systems and "position independent code".

Kernel mode addresses do not use the base-bounds registers, but

rather undergo a fixed virtual-to-physical address translation.

Debug Support

To facilitate software debug, the RC4640 adds a pair of "watch" regis-

ters to CP0. When enabled, these registers will cause the CPU to take
an exception when a "watched" address is appropriately accessed.

Interrupt Vector

The RC4640 also adds the capability to speed interrupt exception

decoding. Unlike the RC4700, which utilizes a single common exception
vector for all exception types (including interrupts), the RC4640 allows
kernel software to enable a separate interrupt exception vector. When
enabled, this vector location speeds interrupt processing by allowing
software to avoid decoding interrupts from general purpose exceptions.

Cache Memory

To keep the RC4640's high-performance pipeline full and operating

efficiently, the RC4640 incorporates on-chip instruction and data caches
that can each be accessed in a single processor cycle. Each cache has
its own 64-bit data path and can be accessed in parallel. The cache
subsystem provides the integer and floating-point units with an aggre-
gate bandwidth of over 3200 MB per second at a pipeline clock
frequency of 267MHz. The cache subsystem is similar in construction to
that found in the RC4700, although some changes have been imple-
mented. Table 4 is an overview of the caches found on the RC4640.

Instruction Cache

The RC4640 incorporates a two-way set associative on-chip instruc-

tion cache. This virtually indexed, physically tagged cache is 8KB in size
and is parity protected.

Because the cache is virtually indexed, the virtual-to-physical

address translation occurs in parallel with the cache access, thus further
increasing performance by allowing these two operations to occur simul-
taneously. The tag holds a 20-bit physical address and valid bit, and is
parity protected.

The instruction cache is 64-bits wide, and can be refilled or accessed

in a single processor cycle. Instruction fetches require only 32 bits per
cycle, for a peak instruction bandwidth of 1068MB/sec at 267MHz.
Sequential accesses take advantage of the 64-bit fetch to reduce power
dissipation, and cache miss refill, can write 64 bits-per-cycle to minimize
the cache miss penalty. The line size is eight instructions (32 bytes) to
maximize performance.

In addition, the contents of one set of the instruction cache (set "A")

can be "locked" by setting a bit in a CP0 register. Locking the set
prevents its contents from being overwritten by a subsequent cache
miss; refill occurs then only into "set B".

This operation effectively "locks" time critical code into one 4kB set,

while allowing the other set to service other instruction streams in a
normal fashion. Thus, the benefits of cached performance are achieved,
while deterministic real-time response is preserved.

Data Cache

For fast, single cycle data access, the RC4640 includes an 8KB on-

chip data cache that is two-way set associative with a fixed 32-byte
(eight words) line size. Table 4 lists the RC4640 cache attributes.

The data cache is protected with byte parity and its tag is protected

with a single parity bit. It is virtually indexed and physically tagged to
allow simultaneous address translation and data cache access

The normal write policy is writeback, which means that a store to a

cache line does not immediately cause memory to be updated. This
increases system performance by reducing bus traffic and eliminating
the bottleneck of waiting for each store operation to finish before issuing
a subsequent memory operation. Software can however select write-
through for certain address ranges, using the CAlg register in CP0.
Cache protocols supported for the data cache are:

Uncached.

Addresses in a memory area indicated as uncached will not be
read from the cache. Stores to such addresses will be written
directly to main memory, without changing cache contents.

Writeback.

Loads and instruction fetches will first search the cache, reading
main memory only if the desired data is not cache resident. On
data store operations, the cache is first searched to see if the
target address is cache resident. If it is resident, the cache con-

Characteristics

Instruction

Data

Size

8KB

Organization

2-way set associative 2-way set associative

Line size

32B

Index

vAddr

11..0

vAddr

11..0

Tag

pAddr

31..12

pAddr

31..12

Write policy

n.a.

writeback /writethru

Line transfer order

read sub-block order

write sequential

Miss restart after transfer of entire line

first word

Parity

per-word

per-byte

Cache locking

set A

Table 4 RC4640 Cache Attributes

ÐÐ»ÐµÐºÑ‚Ñ€Ð¾Ð½Ð½Ñ‹Ð¹ ÐºÐ¾Ð¼Ð¿Ð¾Ð½ÐµÐ½Ñ‚: 79R4640

Document Outline

Ð­Ð»ÐµÐºÑ‚Ñ€Ð¾Ð½Ð½Ñ‹Ð¹ ÐºÐ¾Ð¼Ð¿Ð¾Ð½ÐµÐ½Ñ‚: 79R4640

Document Outline

ÐÐ»ÐµÐºÑ‚Ñ€Ð¾Ð½Ð½Ñ‹Ð¹ ÐºÐ¾Ð¼Ð¿Ð¾Ð½ÐµÐ½Ñ‚: 79R4640