MIPS32 4KmTM Processor Core Datasheet

March 6, 2002

MIPS32 4KmTM Processor Core Datasheet, Revision 01.07

The MIPS32TM 4KmTM core from MIPS® Technologies is a member of the MIPS32 4KTM processor core family. It is a
high-performance, low-power, 32-bit MIPS RISC core designed for custom system-on-silicon applications. The core is
designed for semiconductor manufacturing companies, ASIC developers, and system OEMs who want to rapidly integrate
their own custom logic and peripherals with a high-performance RISC processor. It is highly portable across processes, and
can be easily integrated into full system-on-silicon designs, allowing developers to focus their attention on end-user
products. The 4Km core is ideally positioned to support new products for emerging segments of the digital consumer,
network, systems, and information management markets, enabling new tailored solutions for embedded applications.

The 4Km core implements the MIPS32 Architecture and contains all MIPS IITM instructions; special multiply-accumulate
(MAC), conditional move, prefetch, wait, and leading zero/one detect instructions; and the 32-bit privileged resource
architecture. The Memory Management Unit consists of a simple, fixed Block Address Translation (BAT) mechanism for
applications that do not require the full capabilities of a Translation Lookaside Buffer based MMU.

The synthesizable 4Km core implements single cycle MAC instructions, which enable DSP algorithms to be performed
efficiently. The Multiply/Divide Unit (MDU) allows 32-bit x 16-bit MAC instructions to be issued every cycle. A 32-bit x
32-bit MAC instruction can be issued every 2 cycles.

Instruction and data caches are fully configurable from 0 - 16 Kbytes in size. In addition, each cache can be organized as
direct-mapped or 2-way, 3-way, or 4-way set associative. Load and fetch cache misses only block until the critical word
becomes available. The pipeline resumes execution while the remaining words are being written to the cache. Both caches
are virtually indexed and physically tagged to allow them to be accessed in the same clock that the address is translated.

An optional Enhanced JTAG (EJTAG) block allows for single-stepping of the processor as well as instruction and data
virtual address breakpoints.

Figure 1

shows a block diagram of the 4Km core. The core is divided into required and optional blocks as shown.

Figure 1 4Km Core Block Diagram

Mul/Div Unit

Execution

Core

System

Coprocessor

MMU

BAT

EJTAG

Cache

Control

Instruction

Cache

Data

Cache

BIU

Thin I/F

On-Chip Bus(es)

Fixed/Required

Optional

Power

Mgmt.

Processor Core

MIPS32 4KmTM Processor Core Datasheet, Revision 01.07

Features

· 32-bit Address and Data Paths

· MIPS32-Compatible Instruction Set

All MIPS II Instructions

Multiply-Accumulate and Multiply-Subtract
Instructions (MADD, MADDU, MSUB, MSUBU)

Targeted Multiply Instruction (MUL)

Zero/One Detect Instructions (CLZ, CLO)

Wait Instruction (WAIT)

Conditional Move Instructions (MOVZ, MOVN)

Prefetch Instruction (PREF)

· Programmable Cache Sizes

Individually configurable instruction and data caches

Sizes from 0 - 16KB

Direct Mapped, 2-, 3-, or 4-Way Set Associative

Loads block only until critical word is available

Write-through, no write-allocate

16-byte cache line size, word sectored

Virtually indexed, physically tagged

Cache line locking support

Non-blocking prefetches

· Scratchpad RAM Support

Can optionally replace 1 way of the I- and/or D-cache
with a fast scratchpad RAM

20 index address bits allow access of arrays up to 1MB

Memory-mapped registers attached to the scratchpad
port can be used as a coprocessor interface

· R4000

-style Privileged Resource Architecture

Count/Compare registers for real-time timer interrupts

I and D watch registers for SW breakpoints

Separate interrupt exception vector

· Memory Management Unit

Simple Block Address Translation (BAT) mechanism

· Simple Bus Interface Unit (BIU)

All I/Os fully registered

Separate unidirectional 32-bit address and data buses

Two 16-byte collapsing write buffers

· Multiply/Divide Unit

Maximum issue rate of one 32x16 multiply per clock

Maximum issue rate of one 32x32 multiply every other
clock

Early-in iterative divide. Minimum 11 and maximum 34
clock latency (dividend (rs) sign extension-dependent)

· Power Control

Minimum frequency: 0 MHz

Power-down mode (triggered by WAIT instruction)

Support for software-controlled clock divider

· EJTAG Debug Support with single stepping, virtual

instruction and data address breakpoints

Architecture Overview

The 4Km core contains both required and optional blocks.
Required blocks are the lightly shaded areas of the block
diagram in

Figure 1

and must be implemented to remain

MIPS-compliant. Optional blocks can be added to the 4Km
core based on the needs of the implementation.

The required blocks are as follows:

· Execution Unit

· Multiply/Divide Unit (MDU)

· System Control Coprocessor (CP0)

· Memory Management Unit (MMU)

· Block Address Translation (BAT)

· Cache Controllers

· Bus Interface Unit (BIU)

· Power Management

Optional blocks include:

· Instruction Cache

· Data Cache

· Scratchpad RAM

· Enhanced JTAG (EJTAG) Controller

The section entitled

"4Km Core Required Logic Blocks"

on page 3

discusses the required blocks. The section

entitled

"4Km Core Optional Logic Blocks" on page 10

discusses the optional blocks.

Pipeline Flow

The 4Km core implements a 5-stage pipeline with
performance similar to the R3000

pipeline. The pipeline

allows the processor to achieve high frequency while
minimizing device complexity, reducing both cost and
power consumption.

The 4Km core pipeline consists of five stages:

· Instruction (I Stage)

· Execution (E Stage)

· Memory (M Stage)

· Align (A Stage)

MIPS32 4KmTM Processor Core Datasheet, Revision 01.07

· Writeback (W stage)

The 4Km core implements a bypass mechanism that allows
the result of an operation to be forwarded directly to the
instruction that needs it without having to write the result
to the register and then read it back.

Figure 2

shows a timing diagram of the 4Km core pipeline.

Figure 2 4Km Core Pipeline

I Stage: Instruction Fetch

During the Instruction fetch stage:

· An instruction is fetched from instruction cache.

E Stage: Execution

During the Execution stage:

· Operands are fetched from register file.

· The arithmetic logic unit (ALU) begins the arithmetic

or logical operation for register-to-register instructions.

· The ALU calculates the data virtual address for load

and store instructions.

· The ALU determines whether the branch condition is

true and calculates the virtual branch target address for
branch instructions.

· Instruction logic selects an instruction address.

· All multiply and divide operations begin in this stage.

M Stage: Memory Fetch

During the memory fetch stage:

· The arithmetic ALU operation completes.

· The data cache fetch and the data virtual-to-physical

address translation are performed for load and store
instructions.

· Data cache look-up is performed and a hit/miss

determination is made.

· A 16x16 or 32x16 multiply calculation completes.

· A 32x32 multiply operation stalls for one clock in the

M stage.

· A divide operation stalls for a maximum of 34 clocks

in the M stage. Early-in sign extension detection on the
dividend will skip 7, 15, or 23 stall clocks.

A Stage: Align

During the Align stage:

· A separate aligner aligns load data to its word

boundary.

· A 16x16 or 32x16 multiply operation performs the

carry-propagate-add. The actual register writeback is
performed in the W stage.

· A MUL operation makes the result available for

writeback. The actual register writeback is performed
in the W stage.

W Stage: Writeback

· For register-to-register or load instructions, the

instruction result is written back to the register file
during the W stage.

4Km Core Required Logic Blocks

The 4Km core consists of the following required logic
blocks as shown in

Figure 1

. These logic blocks are defined

in the following subsections:

· Execution Unit

· Multiply/Divide Unit (MDU)

· System Control Coprocessor (CP0)

· Memory Management Unit (MMU)

· Block Address Translation (BAT)

· Cache Controller

· Bus Interface Control (BIU)

· Power Management

I-A1

I-Cache

RegRd

I Dec

ALU Op

D-Cache

Align

RegW

D-AC

Bypass

Mul-16x16, 32x16

RegW

Bypass

Acc

Mul-32x32

RegW

Acc

I-A2

Bypass

Div

RegW

Acc

MIPS32 4KmTM Processor Core Datasheet, Revision 01.07

Execution Unit

The 4Km core execution unit implements a load/store
architecture with single-cycle ALU operations (logical,
shift, add, subtract) and an autonomous multiply/divide
unit. The 4Km core contains thirty-two 32-bit general-
purpose registers used for integer operations and address
calculation. The register file consists of two read ports and
one write port and is fully bypassed to minimize operation
latency in the pipeline.

The execution unit includes:

· 32-bit adder used for calculating the data address

· Address unit for calculating the next instruction

address

· Logic for branch determination and branch target

address calculation

· Load aligner

· Bypass multiplexers used to avoid stalls when

executing instructions streams where data producing
instructions are followed closely by consumers of their
results

· Leading Zero/One detect unit for implementing the

CLZ and CLO instructions

· Arithmetic Logic Unit (ALU) for performing bitwise

logical operations

· Shifter & Store Aligner

Multiply/Divide Unit (MDU)

The 4Km core contains a multiply/divide unit (MDU) that
contains a separate pipeline for multiply and divide
operations. This pipeline operates in parallel with the
integer unit (IU) pipeline and does not stall when the IU
pipeline stalls. This setup allows long-running MDU
operations, such as a divide, to be partially masked by
system stalls and/or other integer unit instructions.

The MDU consists of a 32x16 booth recoded multiplier,
result/accumulation registers (HI and LO), a divide state
machine, and the necessary multiplexers and control logic.
The first number shown (`32' of 32x16) represents the rs
operand. The second number (`16' of 32x16) represents the
rt operand. The 4Km core only checks the value of the
latter (rt) operand to determine how many times the
operation must pass through the multiplier. The 16x16 and
32x16 operations pass through the multiplier once. A
32x32 operation passes through the multiplier twice.

The MDU supports execution of one 16x16 or 32x16
multiply operation every clock cycle; 32x32 multiply
operations can be issued every other clock cycle.
Appropriate interlocks are implemented to stall the
issuance of back-to-back 32x32 multiply operations. The
multiply operand size is automatically determined by logic
built into the MDU.

Divide operations are implemented with a simple 1 bit per
clock iterative algorithm. An early-in detection checks the
sign extension of the dividend (rs) operand. If rs is 8 bits
wide, 23 iterations are skipped. For a 16-bit-wide rs, 15
iterations are skipped, and for a 24-bit-wide rs, 7 iterations
are skipped. Any attempt to issue a subsequent MDU
instruction while a divide is still active causes an IU
pipeline stall until the divide operation is completed.

Table 1

lists the repeat rate (peak issue rate of cycles until

the operation can be reissued) and latency (number of
cycles until a result is available) for the 4Km core multiply
and divide instructions. The approximate latency and
repeat rates are listed in terms of pipeline clocks. For a
more detailed discussion of latencies and repeat rates, refer
to Chapter 2 of the MIPS32 4KTM Processor Core Family
Software User's Manual.

The MIPS architecture defines that the result of a multiply
or divide operation be placed in the HI and LO registers.
Using the Move-From-HI (MFHI) and Move-From-LO
(MFLO) instructions, these values can be transferred to the
general-purpose register file.

Table 1 4Km Core Integer Multiply/Divide Unit

Latencies and Repeat Rates

Opcode

Operand

Size

(mul

rt)

(div

rs)

Latency

Repeat

Rate

MULT/MULTU,
MADD/MADDU,
MSUB/MSUBU

16 bits

32 bits

MUL

16 bits

32 bits

DIV/DIVU

8 bits

16 bits

24 bits

32 bits

MIPS32 4KmTM Processor Core Datasheet, Revision 01.07

As an enhancement to the MIPS II ISA, the 4Km core
implements an additional multiply instruction, MUL,
which specifies that multiply results be placed in the
primary register file instead of the HI/LO register pair. By
avoiding the explicit MFLO instruction, required when
using the LO register, and by supporting multiple
destination registers, the throughput of multiply-intensive
operations is increased.

Two other instructions, multiply-add (MADD) and
multiply-subtract (MSUB), are used to perform the
multiply-accumulate and multiply-subtract operations. The
MADD instruction multiplies two numbers and then adds
the product to the current contents of the HI and LO
registers. Similarly, the MSUB instruction multiplies two
operands and then subtracts the product from the HI and
LO registers. The MADD and MSUB operations are
commonly used in DSP algorithms.

System Control Coprocessor (CP0)

In the MIPS architecture, CP0 is responsible for the virtual-
to-physical address translation and cache protocols, the
exception control system, the processor's diagnostics
capability, the operating modes (kernel, user, and debug),
and interrupts enabled or disabled. Configuration
information such as cache size and set associativity is
available by accessing the CP0 registers, listed in

Table 2

Coprocessor 0 also contains the logic for identifying and
managing exceptions. Exceptions can be caused by a
variety of sources, including boundary cases in data,
external events, or program errors.

Table 3

shows the

exception types in order of priority.

Table 2 Coprocessor 0 Registers in Numerical Order

Register

Number

Register

Name

Function

Index

Reserved in the 4Km core.

Random

Reserved in the 4Km4Km

core.

EntryLo0

Reserved in the 4Km

core.

EntryLo1

Reserved in the 4Km core.

Context

Pointer to page table entry in
memory.

PageMask

Reserved in the 4Km core.

Wired

Reserved in the 4Km core.

Reserved

Reserved.

BadVAddr

Reports the address for the most
recent address-related exception.

Count

Processor cycle count.

EntryHi

Reserved in the 4Km core.

Compare

Timer interrupt control.

Status

Processor status and control.

Cause

Cause of last general exception.

EPC

Program counter at last exception.

PRId

Processor identification and
revision.

Config

Configuration register.

Config1

Configuration register 1.

LLAddr

Load linked address.

WatchLo

Low-order watchpoint address.

WatchHi

High-order watchpoint address.

20 - 22

Reserved

Reserved.

Debug

Debug control and exception
status.

DEPC

Program counter at last debug
exception.

25 - 27

Reserved

Reserved.

TagLo/
DataLo

Low-order portion of cache tag
interface.

Reserved

Reserved.

ErrorEPC

Program counter at last error.

DeSave

Debug handler scratchpad register.

1. Registers used in memory management.

2. Registers used in exception processing.

3. Registers used during debug.

Table 3 4Km Core Exception Types

Exception

Description

Reset

Assertion of SI_ColdReset signal.

Soft Reset

Assertion of SI_Reset signal.

DSS

EJTAG Debug Single Step.

Table 2 Coprocessor 0 Registers in Numerical Order

Register

Number

Register

Name

Function

Document Outline

Ð­Ð»ÐµÐºÑ‚Ñ€Ð¾Ð½Ð½Ñ‹Ð¹ ÐºÐ¾Ð¼Ð¿Ð¾Ð½ÐµÐ½Ñ‚: MIPS324KM

Document Outline

ÐÐ»ÐµÐºÑ‚Ñ€Ð¾Ð½Ð½Ñ‹Ð¹ ÐºÐ¾Ð¼Ð¿Ð¾Ð½ÐµÐ½Ñ‚: MIPS324KM