A single chip processor, cache memory system
by Douglas Eldon McGary
A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in
Electrical Engineering
Montana State University
© Copyright by Douglas Eldon McGary (1989)
Abstract:
This thesis is concerned with the design and simulation of a cache memory system placed on the same
chip with a central processing unit VLSI fabrication techniques allow for the production of full 32-bit
architectures on a single silicon die. Larger silicon areas are available, but a larger processor is
unnecessary. A cache memory is one possible system enhancement which increases system
performance.
The topics of a cache memory system are discussed, and needed information for design choices is
provided. A direct mapped cache 4096 words in length is designed and simulated. The design is
intended for use with 32-bit processors that contain 32-bit address and data busses. The simulation
includes all of the necessary support logic for a cache read, write and block replacement.
During the circuit layout phase, it was discovered that a content addressable memory cell could be used
without paying a large penalty in silicon area. This led to the development of an alternate design which
showed the feasibility of a fully associative mapped cache memory system.
The direct mapped cache memory system was simulated to produce system performance measures. A
speed up factor of 7.8 relative to an non-cached system and 1.9 relative to a conventional cache system
of the same size was determined. The total system size is quite large and would be expensive to
manufacture. The total number of transistors for both the processor and the cache system totaled to 1.1
million. The total silicon die size for the combination of the processor and the cache memory system is
almost one square inch(0.921 in^2). The power dissipation for the cache system was estimated at 308
mW and found to be acceptable.
An on-chip cache memory system proved to be valuable and cost effective. The on-chip cache memory
system is already in use on a small scale, and as chip sizes increase so will the cache memory size. The
use of a fully associative cache is an easily achieved option because of the minor difference in layout
size between a static RAM and a content addressable RAM when a single polysilicon technology is
used. 
A SINGLE CHIP PROCESSOR, CACHE MEMORY SYSTEM
by
Douglas Eldon McGary
A thesis submitted in partial fulfillment 
of the requirements for the degree
of
Master of Science 
in
Electrical Engineering
MONTANA STATE UNIVERSITY 
Bozeman, Montana
March 1989
ii
/ J i W
/ I l h - S
iiy7 2
APPROVAL
of a thesis submitted by 
Douglas Eldon McGary
This thesis has been read by each member of the thesis committee and 
has been found to be satisfactory regarding content, English usage, format, 
citations, bibliographic style, and consistency, and is ready for 
submission to the College of Graduate Studies.
Date
"/  •'
ChairpersocTK7 Graduate Committee
Approved for the Major Department
Date 7
/ ' J  / / / ^ l i  t . ' L
Head, Major Department { d c / ’/ k / J  D,4-.\
Approved for the College of Graduate Studies
Graduate DeanDate
iii
STATEMENT OF PERMISSION TO USE
In presenting this thesis in partial fulfillment of the requirements 
for a master’s degree at Montana State University, I agree that the Library 
shall make it available to borrowers under rules of the Library. Brief 
quotations from this thesis are allowable without special permission, 
provided that accurate acknowledgment of the source is made.
Permission for extensive quotation from or reproduction of this 
thesis may be granted by my major professor, or in his absence, by the Dean 
of Libraries when, in the opinion of either, the proposed use of the 
material is for scholarly purposes. Any copying or use of the material in 
this thesis for financial gain shall not be allowed without my written 
permission.
Signature
Date
iv
ACKNOWLEDGEMENTS
The author wishes to express his gratitude to Dr. Roy Johnson and 
Professor Kel Winters for their guidance and constructive criticism during 
the research and writing of this thesis. Dr. Johnson conceived the basic 
idea and Professor Winters had the knowledge and background to complete the 
circuit design. Rick Robinson also generously provided helpful suggestions 
and information: TEKTRONICS INC. of Beaverton, Oregon kindly donated to 
MONTANA STATE UNIVERSITY, the circuit layout package, QUICK KIC, and the 
circuit simulation package ,TSPICE.
Others involved include Dr. Ken Marcotte and Dr. John Hanton whose 
helpful hints proved very valuable. I also wish to thank all of my 
friends, work associates and especially Suzanne Spitzer for their support 
and patience during the research and writing of this thesis.
. \
VTABLE OF CONTENTS
Page
LIST OF T A B L E S ................ ..........................viii
LIST OF FIGURES  ....................................... ix
ABSTRACT . . .  ........................  . . . . .  xii
I . I N T R O D U C T I O N ....................................  I
Problem Description and Background ................  I
Chip S i z e ..........................  3
Parametric Processing Problems . . . .  4
Circuit Design Problems .................... 5
, Point Defects .............................  6
On-Chip vs. Off-Chip Communications .............  7
Cache Memories and System Performance . . . .  11
Scope and Organization of Remaining Chapters . . 12
2. CACHE MEMORIES . ............................ ... . 13
Conventional Cache Memories.........................  14
Cache Size . ................................  19
Cache Block Size . .........................  20
Cache Operation Scheme .................... 21
Cache Replacement................   22
Cache Miss H a n d l i n g .......................  23
Predominant Cache Organizations . . . . . .  24
Direct Mapped Cache .......................  24
Set Associative Mapped Cache ...........  26
Fully Associative Mapped Cache . . . .  28
Cache Memory T o p i c s ................................  30
Tag Directories and Tag Checking . . . .  30
Main/Cache Memory Consistency . . . . .  32
General Cache vs. A Split
Instruction/Data Cache . . . . .  33
. TABLE OF CONTENTS^-Continued
Page
3. DESIGN METHODOLOGY .............................  36
General Operation and Data F l o w .................. 38
Cache Initialization................ ... . 38
Memory Access Cycle .......................  39
Size and O r g a n i z a t i o n ..........................  41
Mapping Scheme .............................  42
Initialization .............................  42
Replacement Procedure .................... 43
Memory Consistency . . . . . . . .  43
Cache Size .    43
Block S i z e .............................  44
Tag Directory .    44
Top-Down design ................................. 46
CPU Communications and T i m i n g .................... 56
4. FINAL D E S I G N ....................................  58
Circuit Layout ....................................  58
Circuit Parameters .    58
System S i z e ................................   60
Alternate Design - Fully Associative Cache . . 60
Tag Word L e n g t h ......................  62
Full/Empty Designation .................... 62
Replacement Scheme .......................  63
Communications and Timing ................  63
Advantages of Fully Associative Mapping . 64
5. DESIGN SIMULATION ................................. 65
Design and Simulation Tools .................... 65
Layout Software ..........................  65
Circuit Simulation .......................  65
Layout Technology....................... . 66
Speed Estimate..........................  67
Power Consumption Estimate ......................  69
System Performance Estimate ................  . 70
Comparison to Conventional Cache Systems . . .  72
6. C O N C L U S I O N .......................................  74
Discussion of Results ..........................  74
Future W o r k ......................   75
vi
TABLE OF CONTENTS— Continued
Page
REFERENCES C I T E D .......................... 76
APPENDICES . ................................ 79
A - Circuit Documentation .......................  80
B - Rules and Values for Circuit Parameter
Calculation............................. Ill
C - MOSIS Scalable Design Rules ............ 113
D - Circuit Schematic and TSPICE Input Files for
The Simulated Cache Memory Word . . 115
vii
viii
LIST OF TABLES
Table Page
I. Typical Static RAM Performance Data . . 8
2. Typical Dynamic RAM Performance Data . . . 15
3. Hit Ratio and Speed Up vs. Cache Size . . 20
4. Power Dissipation of Cache Memory System . . 70
5. CMOS Physical Properties: Resistance
and Capacitance ............. . 112
LIST OF FIGURES
Figure Page
1. Logic Diagram of a Long On-Chip
Communication Path . . . . . . .  9
2. Long On-Chip Communication Path Delay Estimate . . 10
3. Cache Memory Logical Organization ................  15
4. Maximum Speed Up vs. Hit R a t i o ...................18
5. Miss Ratio vs. Cache S i z e ......................... 19
6. Direct Mapped Cache Memory .......................  25
7. Two-Way Set Associative Mapped Cache Memory . . .  27
8. Fully Associative Mapped Cache Memory .........  29
9. Content Addressable Memory Cell Logic Diagram . . 31
10. Miss Ratio of Split and Unified Cache
vs. Memory C a p a c i t y ................ 34
11. Memory Request Cycle Timing Diagram .........  41
12. Tag Directory Block Diagram .......................  45
13. CPU - Cache Memory System Block Diagram . . . .  46
14. Cache Memory System Block Diagram ................  47
15. Cache Block Logic Diagram................ ... . . 4 8
16. Two-Bit Address Decoder Logic Diagram ; 48
17. Dynamic Memory Cell Logic Diagram ................  '50
ix
18. Static Memory Cell Logic Diagram . . . . . 51
19. Sense Amplifier Logic and Timing Diagram . . . .  52
20. Block Address Decoder Logic Diagram .............  53
21. Tag Directory Block Diagram .......................  54
22. Tag Write Logic Diagram............................55
23. Nine - Gate CAM Cell Logic Diag r a m ............... 55
24. Typical Cache Read Timing Diagram ..............  ". 57
25. Cache Memory System and Subsystem Sizes . . . .  59
26. System Diagram of a Fully Associative
Mapped Cache Memory ............. . 61
27. Sample Parasitic Capacitance Calculation . . . .  66
28. Transient Analysis of a Cache Memory Request . . 68
29. Speed Up vs. Hit Ratio for the On-Chip
Cache Memory S y s t e m ................ 71
30. Speed Up vs. Hit Ratio for the On-Chip. Cache
Memory vs. A Conventional Off-Chip 
Cache Memory . . .' . . . . . 72
31. Circuit Schematic for TAG.CEL ...................... 82
32. Circuit Layout for TAG.C E L ............  83
33. Circuit Schematic for SRAM.CEL . ..................... 85
34. Circuit Layout for SRAM.C E L ............ .... . . 8 6
35. Circuit Schematic for WORD.CEL . . . ... . . 8 8
36. Circuit Layout for WORD.CEL .......................  89
37. Circuit Schematic for BLOCK.CEL .................... 91
X
LIST OF FIGURES— Continued
Figure Page
xi
LIST OF FIGURES— Continued
Figure Page
38. Circuit Layout for BLOCK.CEL . .
39. Circuit Schematic for TRASS.CEL .
AO. Circuit Layout for TRASS.CEL .
Al. Circuit Schematic for TWRTOP.CEL
(Top Tag Write Cell)
A2. Circuit Layout for TWRTOP.CEL
(Top Tag Write Cell)
A3. Circuit Schematic for TWRBOT.CEL
(Bottom Tag Write Cell)
AA.. Circuit Layout for TWRBOT.CEL
(Bottom Tag Write Cell)
AS. Circuit Schematic for SPEED.CEL .
AS. Circuit Layout for SPEED.CEL .
A7. Circuit Schematic for ADDINV.CEL
(Address Invertor Cell)
92 
. 9A 
. 95
. 97
98
. 100
. 101 
. 103 
. 10A
. 106
AS. Circuit Layout for ADDINV.CEL
(Address Invertor Cell) . 107
A9. Circuit Schematic for TMAPRE.CEL
(Tag Match Precharge Cell) . . . .109
50. Circuit Layout for TMAPRE.CEL
(Tag, Match Precharge Cell) . . . . 110
51. MOSIS Scalable Design Rules ........................  IIA
52. TSPICE Input Files for the Simulated
Cache Memory W o r d ...................... 116
53. Circuit Schematic for the Simulated
Cache Memory W o r d ...................... 121
xii
ABSTRACT
This thesis is concerned with the design and simulation of a cache 
memory system placed on the same chip with a central processing unit. VLSI 
fabrication techniques allow for the production of full 32-bit 
architectures on a single silicon die. Larger silicon areas are available, 
but a larger .processor is unnecessary. A cache memory is one possible 
system enhancement which increases system performance.
The topics of a cache memory system are discussed, and needed 
information for design choices is provided. A direct mapped cache 4096 
words in length is designed and simulated. The design is intended for use 
with 32-bit processors that contain 32-bit address and data busses. The 
simulation includes all of the necessary support logic for a cache read, 
write and block replacement.
During the circuit layout phase, it was discovered that a content 
addressable memory cell could be used without paying a large penalty in 
silicon area. This led to the development of an alternate design which 
showed the feasibility of a fully associative mapped cache memory system.
The direct mapped cache memory system was simulated to produce system 
performance measures. A speed up factor of 7.8 relative to an non-cached 
system and I .9 relative to a conventional cache system of the same size was 
determined. The total system size is quite large and would be expensive 
to manufacture. The total number of transistors for both the processor and 
the cache system totaled to 1.1 million. The total silicon die size for 
the combination of the processor and the cache memory system is almost one 
square inch(0.921 in2). The power dissipation for the cache system was 
estimated at 308 mW and found to be acceptable.
An on-chip cache memory system proved to be valuable and cost 
effective. The on-chip cache memory system is already in use on a small 
scale, and as chip sizes increase so will the cache memory size. The use 
of a fully associative cache is an easily achieved option because of the 
minor difference in layout size between a static RAM and a content 
addressable RAM when a single polysilicon technology is used.
I■ . :
CHAPTER I 
INTRODUCTION
Problem Description and Background
Computer designers are building faster processors to increase 
computer system throughput. There is a growing gap between processor speed 
and memory speed. . Memory speed is the limiting factor for system, 
performance. There are several ways to increase the speed of today’s 
microprocessors systems. These are:
1. increase the speed of memory access.
2. increase the single cycle performance by performing one 
operation per clock cycle.
3. incorporate other system features such as memory management 
units, pipeline instructions, etc.
System designers must use faster, more expensive main memory 
components to increase memory speed. Main memory has increased in size, 
but not in speed.
A more complex CPU is required to perform one CPU instruction per 
clock cycle. The CPU must have the ability to prefetch and decode 
instructions before execution. The microprocessor designs already encompass 
more than 100,000 devices. This increase in complexity is very costly.
A simple and cost effective means of increasing microprocessor system 
throughput is to incorporate a cache memory. It has been shown that little
I
2performance gain will be obtained when the speed of a CPU is increased [I]. 
The speed of the CPU is so much faster than the memory bandwidth that the 
system is greatly "out of balance". Increased memory bandwidth will allow 
memory to keep pace with fast microprocessors.
Microprocessors are built using a two or three metal layer CMOS 
process. Static RAM (SRAM) is comparable in size to Content Addressable 
memory (CAM) in a CMOS environment. This allows the use of associative 
cache memories built from (CAM) to be designed without a large penalty ,in 
silicon area.
This thesis probes into the feasibility of incorporating a cache 
memory on the same silicon die as a microprocessor or CPU. A background 
in cache memory and theory of operation is provided. Questions concerning 
total silicon die size, cache memory type, organization, and cache memory 
size are reviewed. The performance measures are compared to conventional 
off-chip cache memories.
Three basic assumptions that have been made are:
1. A silicon chip large enough to handle both a microprocessor
and a cache memory can be fabricated economically.
2. An On-chip cache is faster than the conventional off-chip
implementation.
Use of a cache memory improves system performance.3.
3Chip Size
Silicon wafer chubs are now grown in 6 inch diameters, and 
experimental work is being done on 8 inch silicon wafers. Device 
miniaturization also allows for the packing of large numbers of logic and 
data storage circuits into a very small area. Smaller circuits can be 
built because of better alignment techniques such as self aligned gates and 
shorter wavelength light for smaller photoplate geometries.
" Advances in integrated circuit density are permitting the single 
chip implementation of features, functions and performance enhancements 
beyond those of basic eight and sixteen bit processors" [2]. Processors 
being designed and built today not only include full 16 and 32-bit 
architecture, but also have sufficient areas for performance enhancements 
such as instruction buffering, pipelining and cache memories. The chip 
area will not be sufficient for several years to include all possible 
enhancement features [2] . One of the best uses for additional chip areas 
is a cache memory.
Increased chip areas result from better yield parameters. Fewer 
total chip defects allow for production of larger chips with the required 
yield. The Motorola MC68000 contains 68,000 transistors and is 45 square 
millimeters (6.24 by. 7.14 mm) in size [3] and the Intel 80386 contains 
275,000 transistors and is approximately 350 mills x 350 mills in size [4] .
Ideally, perfect or 100# yield is desired. If this were attainable, any 
size chip area could be realized within the limits of the wafer size. 
Causes for less than perfect yield fall into three basic categories [5]:
41. Parametric processing problems
2. Circuit design problems
3. Random point defects in the wafer
Parametric Processing Problems
One outstanding feature of a processed wafer is the distinct regions 
of the wafer that exhibit very high yield and other regions that exhibit 
very low to even zero yield. Effects that produce low yield regions are:
1. variations in thickness of oxide and polysilicon layers
2. variations in the resistance of implanted layers
3. variations in the width of lithographically defined features
4. variations in registration or alignment of a photomask with
respect to the previous masking operations
These variations depend on one another, and a gross variation of one 
processing step severely reduces chip yield. A variation in oxide layer 
thickness can result in areas being over-etched in thinner than average 
oxide and under-etched in thicker than average oxide. Polysilicon gates 
are shorter in the thinner than average polysilicon regions. This results 
in channel lengths being too short to shut off a transistor when the proper 
gate voltage is applied. Variations in the doping of implanted layers 
leads to variations in contact resistance to implanted layers. During the
5processing of a wafer, various operations are carried out which result in 
small but important changes in the size of the wafer. When a wafer is 
oxidized, the SiO2 formed has twice the volume of the silicon consumed in 
the process. This stresses the wafer and when part of the SiO2 is removed 
from one side of the wafer, the wafer will bend if the resulting stress is 
above the elastic limit of the material. Average variations in wafer size 
are around 2.5 microns for a 125 mm wafer [5], This can cause serious 
misalignment problems and reduce yield on regions of the wafer especially 
when a transistor gate length is only 2 microns wide.
Circuit Design Problems
Regions of a wafer show low yield because the designed circuits, fail 
to take account of the variations in the processing. Designers must be 
careful and follow design rules. Design rules are instigated to insure 
integrated device operation for a particular process.
Threshold voltage (Vt) and channel length (L) are the two most 
important parameters in MOS design. Variations in substrate doping, in ion 
implantation dosage, and gate oxide thickness will cause a variation in 
threshold voltage. Variations in gate length and source and drain junction 
depth cause the channel length to vary. Variation of Vt and L may be 
enough to cause a faulty circuit. The operation of a circuit may be 
unpredictable if the threshold voltage or the channel length are not within
tolerance.
6Point Defects
Even after processing and circuit design problems are reduced, 100% 
chip yield is not obtained because of point defects. Point defects are 
regions where the wafer fabrication produces faulty circuits, and the size 
of the faulty region is small compared to the size of the wafer. A three 
micron dust particle on the wafer could cause a metal conductor to break 
and render the circuit useless. This is an example of a point defect. 
Point defects also come from other sources such as dust on the photoplate, 
a small fault on the photoplate, or a pinhole in the silicon dioxide layer. 
The best procedure to reduce random point defects is to monitor them, take 
action to reduce them such as clean the fabrication facility and processing 
equipment, and continue to monitor point defects.
To overcome the above three problems, fabrication facilities have 
advanced the technology of processing steps and circuit design rules to 
reduce point defects. The progress they have shown allows for the 
manufacture of large area VLIS chips. Larger chips reduce the number of 
chips per wafer, and the yield must be high to produce large chips 
economically.
Processing problems have been reduced by lowering processing 
temperatures. Ion. implantation allows for lower wafer processing 
temperatures. Chip yield is enhanced by better control of layer thickness 
and doping densities. Circuit designers strictly follow design rules to 
insure device operation. Circuit designers usually use a library of 
circuit layouts that have been proven to function reliably. Point defects
7have been reduced by creating cleaner fabrication facilities and better 
cleaning techniques. Substrates are now grown with better purity than 
obtainable in the past.
On-Chip vs. Off-Chip Communications
In a typical off-chip cache design, the CPU must convert address 
signals to TTL compatible levels. The cache memory chips must then perform 
the tag search and drive the necessary data signals back to the CPU. This 
process is inherently slow because:
1. I/O drivers are required to drive and receive the signals, and 
I/O drivers are not as fast as other chip circuitry.
2. Interconnect lines on a PC board contain large amounts of 
resistance and capacitance.
3. A complete cache memory access cycle requires four signal 
transformations from internal chip levels to external chip 
levels.
In an on-chip cache system, all information transfer takes place on 
internal data paths. Transfer of information internally on a chip is 
faster and consumes less power than intra-chip I/O transfers. Table I 
shows typical static RAM performance. Access time ranges from 50 to 500 
ns for static rams. On-chip cache memory has the advantage of receiving 
data before an I/O driver can signal an off-chip memory component. The
8static RAM access times in Table I are time required for the data to be 
delivered after it receives the input signal. The time needed to deliver 
the input signal to the static RAM chip is about 20ns, which is the delay 
time associated in switching an input/output pad and tristate driver. 
On-chip cache memory receives the input signal in about 6 ns, which is a 
14 ns advantage.
Table I. typical Static Ram Performance Data.
Organization Typical
Access
Times
Typical
Cost
Total
Storage
256x4 80-150ns $2-4 !Kbytes
4096x1 80-150ns $5-10 4Kbytes
1024x4 80-150ns $5-10 4Kbytes
2048x4 80-150ns $10-15 SKbytes
8Kx8 SO-IOOns $25-30 I CKbytes
Figure I shows the logic diagram for a possible long interconnect a
CPU might drive to deliver an input signal to an on-chip cache memory. The
time delay involved results from having to transfer charge associated with 
the long interconnect capacitance. The interconnect line is precharged for 
an optimum signal transfer rate.
9Voio!
D a t a
In
Rl
A A A A
R2
AAA
Ci
D a t a
Ou t
Figure I. Logic Diagram for 
a Long On-Chip Communication Path.
The circuit in Figure I was simulated using TSPICE, and the time 
delay was found to be approximately 6 ns for transistors with gates of 16 
/xm width and 2 ^ m length. The transistor technology is the same used for 
the design of the on-chip cache memory. The transient analysis response 
is plotted in Figure 2. This is a conservative estimate of intrachip data 
transfer rates and supports the assumption that intrachip transfers are 
faster than interchip transfers. The interconnect is 20 mm long(almost an 
inch) and 6 #im wide. A typical process assigns approximately 0.25E-4 
picofarads of capacitance per square micron of metal interconnect. The 
total amount of capacitance associated with the interconnect line is 3.0 
pf. A typical process assigns approximately 0.05 ohms of resistance per 
square of metal interconnect. Assuming that the metal interconnect only
10
covers substrate material, the total amount of resistance is 167 ohms. 
The resistance is broken down into two halves (R1 and R2) for the circuit 
simulation. The interconnect line could be longer and wider and still be 
driven in a few nanoseconds with only slightly larger transistors.
GD
1-MAR-S9 61:48:25
OUTPUT DATA 
5.000.00 7.50 10.00
J__________________I
INPUT DATA
0.00 2.50 5.00 7.50 10.00
Long On-Chip Communication Path 
Delay Estimate
Figure 2.
11
Cache Memories and System Performance
Cache memories are a time tested mechanism for improving memory 
system performance. Caches reduce access time and memory traffic. Cache 
memories have proven, to be useful and will continue into the future [6].
Microprocessors are buss limited [7] which underlines the importance 
of conserving or reducing buss traffic. Total system performance is a 
function of memory buss bandwidth. It is conceivable that a microprocessor 
can grow to 10 times as large in transistor count [7], but the number of 
pins that connect it to the outside world will only grow to about 1.5 to 
2 times. Buss traffic will become a more pressing problem as processors 
become more complex. A cache memory will reduce buss traffic and allow the 
buss to be used for other important communication functions. All buss 
transfers to external main memory are made in blockmode form.
The Motorola 68000 microprocessor has 64 I/O pins. The MC68000 
typically utilizes 90-95# of the external buss bandwidth (6.25Mbytes/sec 
@12.5MHz clock frequency). Increasing the speed of the CPU will not result 
in a significant increase in system performance [8]. The main memory 
components (DRAMS and Memory Management Units) simply are not fast enough 
to keep pace with the CPU 1/0 demands. The 12,5MHz MC68010 has a minimum 
buss cycle time of 320 ns (4 80 ns clock periods). To perform a read cycle 
without wait states, the 12.5 MHz MC68010 requires valid data at the inputs 
a maximum of 135 ns after asserting the CPU address strobe. Conventional 
DRAMS and MMU’s only deliver data from around 150 ns to 500 ns [9]. A 
cache system logically placed between the MC68010 and the MC68451 MMU will
12
lower memory access times enough to allow the CPU to run without wait 
states.
Processors being produced today are easily outpacing the memory 
products associated with them. Memory products can be made fast enough to 
keep pace with the processor, but it is very costly. Memory designers have 
been concentrating on building high density memory chips. Dense memory 
chips allow computer systems to have several megabytes of main memory at 
a reasonable cost, but access times do not decrease. One reasonable way 
to bridge the gap between memory speed and CPU I/O speed is to incorporate 
a fast buffer memory.
Scope and Organization of Remaining Chapters
Chapter 2 provides a background and covers important topics of cache 
memories. Chapter 3 discusses the design methodology. Chapter 4 reviews 
the final cache memory design. Chapter 5 contains the design simulation, 
and Chapter 6 provides the conclusions.
13
CHAPTER 2 
CACHE MEMORIES
"A large number of computer programs show that most of the execution 
time is spent in a few main routines" [10],. This phenomenon is called 
locality of reference. Most programs also demonstrate a sequential nature 
of program flow.
The CPU does not generate addresses that are random in nature. They 
tend to be localized and the higher order bits remain the same. The only 
deviation from this is a jump or branch instruction.
When program flow is localized on a small number of instructions, 
those instructions are executed over and over again. These are usually 
loops, nested loops or procedures that continually call each other. The 
overall flow of the program is not important. The main idea is that 
localized instructions are accessed by the CPU most of the time while the 
rest of the code is accessed infrequently [11]. If a buffer memory is 
incorporated that contains the frequently accessed code, the CPU can access 
the buffer instead of main memory and save time. The buffer needs to be 
faster than main memory. Faster memory is expensive and to make a complete 
main memory out of fast buffers is very expensive.
Conventional Cache Memories
The first cache implemented was the IBM system/360 Model 85 [12]. 
Since then, several high performance computers such as the ILLIAC IV, the 
CDC STAR, the CRAY-1 and the TI ASC have implemented a cache memory. 
Almost every high performance computer in use today incorporates a cache 
memory. Computers ranging from "minis" to "supers" contain a cache memory 
in one form or another. The main memory for a typical 8-bit microprocessor 
is 512 Kbytes to I Mbyte. A 16-bit processor has addressability from 20 
to 24 bits which is from 1-16 Mb. Larger main memories are being built 
from medium speed Dynamic RAM chips. A cache memory can reduce memory 
access times.
Typical memory access time for a DRAM memory chip is about 400 ns and 
is shown in Table 2. Incorporating a DRAM chip set for the main memory is 
the most economical choice. Large DRAM chips are slow, but inexpensive. 
The cost of fabricating a main memory completely out of high speed static 
RAMs or ECL chips would overshadow the cost of the rest of the computer.
The block diagram of a buffer memory or cache memory is shown in 
Figure 3. The cache memory is logically located between the CPU and main 
memory. The access time for cache memory is usually an order of magnitude 
faster than main memory and is intermediate in speed relative to CPU 
register cycle time and main memory cycle time.
15
Table 2.______Typical DRAM Performance Data.
Organization Typical 
Access 
Time
Typical
Cost
Typical
Power
Consumption
64Kx1 I00-250ns $5-10 I50-300mW
8Kx8
I28Kx1 I00-250ns $10-15 200-400mW
32Kx4 M ft If
16Kx8 M If fl
256Kx1 10O-SOOns $15-20 250-500mW
64Kx4 «« ff
32Kx8 tl If
IMxI 100-300ns $60-100 250-500mW
256Kx4
I28Kx8
Address
Cache
Memory
Data & 
Control 
Lines
Figure 3. Cache Memory Logical Organization.
16
Cache memory and main memory create a memory hierarchy just as main 
memory and secondary storage. Many of the techniques designed for virtual 
memory management are also valid in a coche/main memory hierarchy. Both 
systems try to obtain the fastest operation possible. The effective speed 
of the memory system is maximized to obtain optimum system performance. 
Ideal operation of the main memory/cache memory hierarchy is an access time 
very close to the access time of the cache memory.
A cache hit is when the cache memory satisfies a memory request. The 
cache hit ratio B is the total number of memory requests serviced by the 
cache divided by the total number of memory requests [10].
number of cache hits
B = ------------------- :----------  (I).
total memory requests
A cache miss occurs when a memory request is not serviced by the 
cache. The cache miss ratio is the number of memory requests satisfied by 
the main memory divided by the total number of memory requests [10].
Miss Ratio = I - Hit Ratio = T - B  (2).
The effective speedup is the ratio of the main memory access time to 
the effective memory access time. Let Tc be the cache memory access time, 
Tm be the main memory access time and Te be the effective memory access 
time as seen by the CPU. The effective access time is [10]:
17
Te = * * Tc + (I - tf) * Tm (3).
This equation simply states that the effective access time is equal 
to the cache hit ratio times the cache access time plus the cache miss 
ratio times the main memory access time. The speedup due to the cache 
is [10]:
sC = Tm / Te (4).
A higher cache hit ratio, produces a more efficient cache system 
which leads to faster data and instructions transfers to and from the CPU. 
The relationship between cache hit ratio and speed up is [10]:
Tm Tm I
Sc = ----= -------------------------- --------------------- —  (5).
Te ■ (**TC + (I - O T m) (I + O T c/Tm - I))
let Tc/Tm = 0.1 (6).
Letting TcZTm equal 0.1 assumes the cache memory access speed is an 
order of magnitude faster than the main memory access time and is a very 
reasonable assumption.
thus, I
Sc = -------------- (7).
I - 0.90
Equation 7 shows that the cache hit ratio 6, is the determining 
factor in speedup. The speedup increases as the hit ratio approaches one. 
Figure 4 shows speedup vs. hit ratio using equation 7. The ratio of cache
18
memory access time (Tc) to main memory access time (Tm) is assumed to be
0.1 which simply means the cache memory is ten times faster than main 
memory. Note that the speedup axis is logarithmic. Small increases in hit 
ratio will produce large changes in the speedup. The maximum speedup is 
10. Ifthe hit ratio is 0.5, the maximum speedup is 2.0 no matter how fast 
the cache memory is, which underlines the importance of high hit ratios.
' 1 I I ! ! I ! I I I I I I  I I ! I ! I I I I 1 I I 1 ! I I ' ! I ! I I ! 1 ! I I I I ! I I I I I  I
Hit Ratio
Figure 4. Maximum Speed Up vs. Hit Ratio.
There are several factors that affect the hit ratio and consequently 
the memory speedup. These are:
I . Cache size
2. Cache block size
3. Cache operation scheme
4. Cache replacement scheme and
5. Cache "miss" handling
19
Cache Size
Naturally, the larger a cache memory, the greater chance that it 
contains the desired memory location, but there is a point of diminishing 
returns on cache size. It would be very expensive to build a cache memory 
that is comparable in size to the main memory. Figure 5 shows miss ratio 
vs. cache size. The figure clearly shows that after a cache size of about 
4 Kbytes, the amount the miss ratio decreases is small compared to the 
increase in cache size.
0.50
o
-W
0.20
CO
CO
2
(DO.10
O
O
O
0.00 2000 4000 SOOO 8000 10000
Cache Size (Bytes)
Figure 5. Miss Ratio vs. Cache Size.
As an example, let Tc=I.0, Tm=IO and vary 6, the hit ratio. The
results are shown in Table 3.
20
Table 5.____ Hit Ratio and Speed up vs. Cache size.
B Sc Cache memory size change required 
to produce desired hit ratio
0.7 I .82 0 to 512 bytes
0.8 3.57 512 to 1024 bytes (doubled)
0.9 5.23 1024 to 2048 bytes (doubled)
0.95 6,90 2048 to 4096 bytes (doubled)
0.97 7.87 4096 to 8192 bytes (doubled)
0.98 8.47 8192 to 16384 bytes (doubled)
The system designer must ask himself if doubling the cache size is 
worth the percentage increase in hit ratio and speed up.
Cache Block Size
All cache memories use blocks of words. A block is simply several 
words grouped with the same least significant address bits. Each word can 
contain data and instructions. The larger the block size, the more time 
it takes to move a block into the cache memory. A large block size does 
allow the CPU a better chance of accessing an upcoming instruction or data, 
but this gain is easily offset by the amount of time the CPU waits for the 
block transfer to occur. Locality of reference plays a big part in 
determining the block size. The amount of locality of reference is 
completely program dependent. There have been several studies into the 
relationship between cache hit ratio and cache .block size [13], [16]. 
These studies show that a cache block size of around 8 words is a good 
choice. The optimization of a cache block size would require the testing 
of several hundred programs with several different sizes of cache blocks
21
and is beyond the scope of this thesis.
Cache Operation Scheme
there are three basic types of caching schemes, these are:
1. Direct Mapped
2. Set Associative Mapped
3. Fully Associative Mapped
A direct mapped cache scheme is the simplest to use and understand, 
but it is the least flexible of the three types [14]. In a direct mapped 
cache scheme, the main memory blocks map directly into only one cache 
block. They map by the modulo of the number of cache blocks. If there are 
128 blocks, main memory block K maps onto K modulo 128 of the cache.
Set associative mapping groups the cache memory blocks into sets of 
blocks. Any main memory block can map into a block set of the cache. The 
block then maps into only one of the blocks in the set. This allows for 
better utilization of the total cache memory, but it requires more overhead 
to control and maintain [15].
Full associative mapping allows any main memory block to map onto any 
cache memory block. This cache scheme is the most flexible, but depending 
upon the replacement scheme, can be the. most complex. If the replacement 
algorithm is complex, the fully associative mapping scheme requires the 
most CPU overhead to control and maintain [16]. If the replacement scheme 
is simple such as random, the fully associative mapping scheme is not the
22
most complex and in fact is only a little more complex than direct.
The three schemes are explained in detail in the next section. The 
contribution of each scheme to hit ratio is determined by the amount of 
overhead required to control and maintain the cache scheme. The set and 
full associative cache schemes allow for more flexibility, but the required 
overhead decreases performance. The direct mapped cache scheme requires 
very little overhead, but the total cache memory is not used. Each cache 
operation scheme has advantages and disadvantages. The comparison of the 
three types to determine the absolute best is a long and complex process 
and beyond the scope of this thesis. The results would be program specific 
and inconclusive. Through experience with computer hardware and software, 
the best choice, especially in the first design iteration, is the simplest 
choice.
Cache Replacement
The replacement algorithm determines how the cache management system 
replaces cache locations in cache memory. Cache replacement is performed 
after a cache miss occurs. In the direct mapped scheme, the replacement 
algorithm is trivial. The needed memory block is automatically moved into 
the cache memory. The only variation is if the system designer wishes to 
move more than one block, usually the next or adjacent block of main 
memory. The other two ,types of cache schemes can use from very simple 
replacement algorithms to very complex. The simplest is the first in/first 
out. This scheme only requires that the blocks have a counter that is
I1
23
increased for each unit of time that they spend in the cache memory. The 
cache block that is replaced is the one that has been in the longest. A 
more optimum replacement scheme is the Least-recently-used or LRU 
algorithm. This scheme requires the management of which block has been 
used the most recently to the least recently. The least recently used 
block is swapped out of cache when replacement is necessary. There are 
replacement algorithms that statistically keep track of the memory blocks 
and make decisions on which block should be kept. This algorithm is 
obviously the most expensive in overhead, and the efficiency gain or 
performance gain is questionable.
Each replacement scheme has its advantages and disadvantages. The 
one that is used is completely up to the designer. The designer must weigh 
the amount of processing overhead needed to implement a replacement 
algorithm vs. the efficiently of the cache and the increase in performance.
Cache Miss Handling
When a cache miss occurs, the cache memory must be updated to allow 
the continued execution of a program. Just as in replacement algorithms, 
the miss handling can be very simple to very complex. The goal is to 
replace a portion of cache memory, enable continued program execution and 
reduce the possibility of future cache misses.
Predominant Cache Organizations
Direct Mapped Cache
The direct mapped scheme is shown in Figure 6. This scheme checks
cache blocks and the number of blocks. For example, let a cache memory 
consists of 4 K words. Let each block contain 4 words. This means there 
will be 1024 blocks (4 words/block * 1024 blocks = 4 K words). Let main 
memory contain 20 address bits. The cache tag words are 8 bits long, the
of four words in a block.
The main memory blocks map directly onto only one cache block 
location. In the case described above, the main memory blocks map by 
modulo 512 onto the cache blocks. When a CPU generates an address, the tag 
bits are checked; if the check produces a positive result, the block 
contains the needed memory location and the CPU proceeds with a read or 
write to the desired word. If the tag bit check produces a negative 
result, the CPU must copy the needed block from main memory into the cache 
block. This procedure slows down the processing of a program, but if the 
cache has been designed well, and the locality or reference theory holds 
for the program, the number of cache misses will be small and overall
Jbaa bits to determine if a needed main memory word is resident in the
cache. The number
(t/iH
needed is determined by the size of the
block address is 10 bits long, and 2 address bits are used to select one
system performance is improved.
25
The direct mapping system is simple and easily managed. The only 
flexibility for the designer is the cache miss handling procedure. There 
are two ways to handle the block replacement that is necessary after a 
cache miss. These are:
1. When a cache miss occurs, replace the needed block and
continue program execution.
2. When a cache miss occurs, replace the needed block and I
or more adjacent blocks from main memory.
Cache  Memory  Main Memory
1024
B lo c k s
/ \ /
B lo c k  0I uy  u
B lo c k  I
— B lo c k  0
B lo c k  1023__B lo c k  I
B lo c k  1024
B lo c k  1025
Tag 2
__ B lo c k  2
B lo c k  2047
B lo c k  262120Tag 1023
B lo c k  261121_  B lo c k  1023
B lo c k  262143
Main Memory A d d r e s s
8 b i t s 10 t i l t s 2 b i t s
Tag B lo c k W o rd
Figure 6. Direct Mapped Cache Memory.
26
The second procedure attempts to reduce the number of misses in the 
future execution of the program. There are a couple examples of known 
cache designs that replace adjacent blocks when a cache miss occurs. This 
is the only time that such large amounts of data are moved into a cache 
at one time. The buss bandwidth limitations prohibit the transfer of large 
amounts of data for each cache operation.
Set Associative Mapped Cache
Set associative mapping is similar to direct mapping, but is more
I
flexible. Blocks of cache memory are grouped into sets. Two-way set 
associative cache has two elements per set. Each element contains a memory 
block. For example, let a two-way set associative cadhe consist of 512
f
sets, 2 blocks per set, and 4 words per block. The ^total cache is 4 K 
words long. Let the main memory have£20)address lines which is the same 
as the direct mapped example. The lower two address bits select the 
particular word in a block. The next 9 address bits control the set, and 
the final 9 bits control the tag associated with the set. Figure 7 shows 
a block diagram of the logic associated with this example. The cache can 
contain blocks 0 through 1023 just like the direct mapped example, but 
cache set 0 can contain block 0 and any other main memory block that is 
modulo 512. This gives the CPU an option when placing the main memory 
blocks. A full/empty bit can be added to signify the status of a block in 
a set. When a cache write is desired and one of the blocks of the set are 
full, the CPU just writes into the other one, but, if both set blocks are
27
full, the CPU must decide which one will be replaced.
Z -
/ 1
256 _
S e ts
S e t 0 _
Set I  _
S e t 51! _
X
Cache Memory
________ I
I Ta8
Tog
I Tog
I Tog
Tog Block 1022
Tog B lock 1023
Main Memory A d d re s s
4
9 b its 9 b its 2 b its
/<  y
Tog B lock V o rd
,-Js r'
Main Memory
Block 0
B lock I
B lock 2
B lock 3
B lock 1022
B lock 1023
B lock 1024
B lock 1025
B lock 1026
B lock 1027
B lock 261629
B lock 261630
B lock 261631
B lock 261632
B lock 261633
B lock 261634
B lock 262142
B lock 262:43
\
Figure 7. Two-Way Set Associative 
Mapped Cache Memory.
The set associative cache requires more logic to control than the 
direct mapped cache. The extra logic controls which cache set block is 
filled by an incoming memory block and which cache set is replaced when the 
cache set becomes full. The two-way set associative cache reduces cache 
misses and four-way is better yet, but only slightly. Further increases 
in degree of associativity has little effect on system performance. The 
upgrade from direct mapping requires another level of logic and a couple 
more control lines. The added logic level adds delay; and the added
28
complexity requires more design effort and takes up more silicon area. As 
a general rule, the decision to go from direct mapped to set associative 
is governed by the cache size.
For a cache of 32 Kbytes, the miss ratio will be low; and a set 
associative cache will not significantly reduce cache misses or improve 
performance. For smaller caches, where the delay from a cache miss will 
dominate, the set associative design will greatly increase performance. 
The final decision must include the design constraints that face the system 
designer.
Fully Associative Mapped Cache
Fully associative mapping is the most flexible but also the most 
complex of the three types discussed. This mapping system allows for any 
main memory block to map onto any cache block as shown in Figure 8. The 
cache blocks are distinguished completely by their tag bits. For the 20 
bit address main memory example with 4 word cache blocks and 1024 blocks, 
there are 18 tag, bits per tag word associated with each cache block.
The main memory blocks are mapped into any available cache block. 
When the CPU generates a memory address,, all of the tag bits are checked. 
If the result is positive, the needed block is resident and program 
execution continues. If the tag check produces a negative result, the 
necessary main memory block is moved into the cache memory. The fully 
associative mapping technique allows for complete freedom in replacing a 
block when a cache miss occurs. However, it might not be practical to
29
implement a complex replacement scheme. The cache management for a 512 way 
associative memory is not trivial and requires a large design effort.
The fully associative cache system requires the checking of a large 
number of tag bits. The information that determines replacement also needs 
to be kept in a memory table. The replacement algorithm needs to determine 
which block is going to be replaced in the same amount of time that the 
tags can be checked or system performance is degraded. The overhead in 
controlling and managing a fully associative cache is prohibitive for its 
use. The design effort that goes into a fully associative cache is also 
long and expensive which deters most designers from implementing it.
Cache Memory Main Memory
1024
Blocks
x Z" x
\
I Tag 0
— B lock 0 —
I Tag I
_  B lock I _
I Tag 2
_  B lock 2 —
I Tag 1023
_  B lock 1023 _
Main Memory A a d re s s
ie b it s 2 c i t s  I
Vv yx Z
Tag V o rd
Figure 8. Fully Associative Mapped Cache Memory.
30
Cache Memory Topics
Tag Directories and Tag Checking
The contents of a cache are constantly changing as a program is 
executed. When a memory request is issued by the CPU, the cache is checked 
to see if the desired location is present. This is done with a tag 
directory or table look-aside buffer. The tag directory must be checked 
as fast as possible so as to not inhibit the speed of a cache memory 
access. The fastest method is an associative search. The search is 
performed on all tag locations at once.
The content addressable memory is the most common form of tag 
directory. The content addressable memory scheme provides comparison logic 
for each memory location and bit. This adds considerable cost to each bit 
in a CAM, so the size of a CAM or tag directory is kept as small as 
possible. Suppose the cache is direct mapped into 64 blocks and the tag 
for each block is 8 bits wide. The necessary CAM is 8 cells wide and 64 
tag words in length. The CAM contains a write logic to modify the tag 
directory when a new memory location is stored into cache. A tag match 
line is associated with each tag word. The tag match line is precharged 
prior to a tag directory search. If one bit of a tag word doesn’t match 
the desired tag address, the tag match line is pulled down signifying a tag 
word miss. All of the tag match lines are combined, together through OR 
gates to form the hit/miss line. Figure 9 shows the logic layout of a CAM.
31
D o ta  D a to  _ b o r
T OQ 
V o r d
Tag Cv 
M a tc h
Line
Figure 9. Content Addressable Memory Cell Logic Diagram.
The tag word that leaves the tag line precharged contains the desired 
memory location and access is given to the CPU. A miss occurs when all of 
the tag match lines are discharged after a tag search. The CPU is notified 
by the Hit/Miss line and proceeds with a replacement procedure.
The replacement procedure brings in the desired memory location and 
also extra memory blocks to help avoid another cache miss. The associative 
memory is the fastest memory scheme available for tag checking. The logic 
associated with each cell allows all the tag cells to be checked 
simultaneously. This checking scheme is non-destructive, allowing the tag 
directory to be checked any number of times without a loss of data.
Writing a new address into a tag word is done in parallel. The tag 
write logic supplies the desired tag address. Address bits A2-A1I select
32
the desired tag word. The tag write line is pulsed,, and the new block 
address is written into the tag word.
Main/Cache Memory Consistency
After a CPU modifies a cache location by a write command, the 
modification must make its way back to the main memory. Main memory must 
be kept consistent with modifications that take place in the cache. There 
are two common methods of forcing consistency between the cache and the 
main memory. The first is write-through or store-through, and the second 
is write-back or copy-back.
The write-through technique writes a modified cache location out to 
memory every time there is a write to cache. This system is very reliable 
but can tie up a microprocessors buss structures. The write-back procedure 
only writes a modified cache location to memory when the location is 
swapped out of the cache. A cache location may be modified several times 
.before it is moved back to main memory. The write-back procedure lowers 
buss traffic but is more complex to control and is not 100# reliable [17].
The write-back procedure reduces memory traffic but requires complex 
logic to significantly reduce buss traffic. The reduction of buss traffic 
does not affect the miss ratio of a cache memory system, and low system 
buss traffic must be necessary to justify the cost of a write-bock cache 
implementation. The easiest way to implement a write-back cache scheme 
is to mark a block as dirty when the CPU writes to it. The whole block is 
then written to main memory when the block is replaced or the program
33
ended. The number of times that a whole cache block is modified by a CPU 
is very small, and several memory cycles are wasted writing the whole 
block. The only way to alleviate this problem is to assign a dirty/clean 
bit to each cache location. The block is then scanned for modified (dirty) 
locations which are written to main memory. This procedure adds more 
complexity to the cache design. The system designer must make the decision 
as to whether the extra logic, silicon area and design effort are worth the 
increase in system performance.
If write-through is used,- main memory always contains an up-to-date 
copy of all information in the system. If the microprocessor is used in 
a multiple processor environment, the write-through simplifies the multiple 
cache consistency problem. The write-back method results in the cache 
containing the only valid copy of data, and an error correcting code is 
needed to ensure high reliability.
General Cache vs. A Split Instruction/Data Cache
Cache bandwidth and access time can be improved by splitting the 
cache into a data cache and an instruction cache. The bandwidth is doubled 
because two memory requests can be serviced at once. Fast computers are 
pipelined which means that several instructions are simultaneously fetched 
and decoded. Most pipelines have several stages including instruction 
fetch, instruction decode, operand fetch and result transmission to the 
proper destination. Logic pertaining to instruction fetch and decode 
generally does not involve the operand fetch and store except when
34
instruction execution takes place. If the computer designer feels that it 
is worth it, the logic to handle both data and instruction cache can be 
implemented and the system performance will be improved.
There are several problems with a split cache scheme. The 
instructions must be kept separate from the data. Keeping main memory 
consistent with both caches is complex and requires extra overhead. A 
split cache generally results in inefficient use of cache memory [18]. A 
split cache miss ratio vs. a unified cache miss ratio is shown in Figure 
1 0 .
Unified Memory, 
Split Memory
10000 20000 30000 40000 50000 60000
Memory Capacity (Bytes)
Figure 10. Miss Ratio of Split 
and Unified Cache vs. Memory Capacity.
35
A split cache can be divided several ways. The split could be 50% 
data and 50% instruction (split equal) or some other more optimum split 
(split unequal). The unequal split in cache size is determined by running 
several performance measures and adjusting the different sizes to produce 
an improvement in system throughput. The split equal, split unequal and 
unified cache systems all perform about the same with respect to miss 
ratio. The system designer must decide if the increased bandwidth 
justifies the cost of all the extra logic and CPU overhead.
36
CHAPTER 3
DESIGN METHODOLOGY
The design of an on-chip cache system is governed by several 
constraints and conditions. The constraints in this cache implementation 
are:
1. The cache structure is geared toward a microprocessor.
2. Cache operation must require very few CPU instruction 
executions and be transparent to the user. The mapping scheme 
must be kept manageable. The cache initialization and 
replacement procedures must be straight forward and require 
very little CPU overhead. The memory consistency procedure 
must keep main memory valid without CPU intervention.
3. The cache structure must not increase total silicon area to 
an unproducable size. The block size must be small enough to 
not bottleneck the microprocessor memory buss and big enough 
to produce low miss ratios.
k. The cache tag directory must be fast and require very little 
logic to maintain and control. Modifying the tag words must 
require very little delay dr CPU overhead.
37
5. The layout technology of the cache memory must be compatible
with the layout technology of the microprocessor. For this
thesis, the layout software must be available here at Montana 
State University. The layouts must be simulated to test 
system operation.
6. The memory cell used to store the cache data must have a fast 
access time and add a minimal amount of complexity to the 
microprocessor.
The above constraints were kept in mind when the on-chip cache 
structure was designed. The following sections explain the choices made 
and how they reflect the system constraints. The information needed to 
form design decisions originated from information on cache systems and 
theory, microprocessor vendors, memory vendors and conventional cache 
memory systems.
There are several commercially available microprocessors that could 
benefit from an on-chip cache memory system. Some of these are:
Intel 8086, 80286 & 80386.
Motorola 6800, 68000, 68010, 68020 & 68030.
National Semiconductor 3200 & 32000.
Texas Instruments 4000, 4400 & 9900.
Zilog Z80 & Z8000.
38
The cache design is not geared toward a specific microprocessor. A 
general design is done which could be modified slightly and applied to a 
specific microprocessor. The search for an optimum microprocessor is 
beyond this thesis. The only specific design choice made was to use 32 bit 
data words and 32 bit addresses which makes the cache design unapplicable 
to 8 or 16 bit microprocessors.
General Operation and Data Flow
Cache Initialization
The cache memory words are initialized upon CPU start-up or power-up. 
The CPU will run a small program from a start-up ROM that writes the first 
page of memory into the cache. In an MS-DOS operating environment, the 
cache would be initialized with OA-OOH.
Cache initialization can be accomplished in two ways. Oneway is to 
include a valid bit in the cache system, and another way is to initialize 
the cache with a page of memory.
The valid bit associated with each cache word would be set to 
"invalid" upon system power up. The CPU would read and write only to valid 
cache locations. A cache block is validated after a tag address and the 
memory words are written to the cache block. The valid bits would not be 
used often enough to justify their placement on chip, and it would not be 
practical to put them off-chip.
39
Memory Access Cycle
The CPU submits a memory request by placing the desired address on 
the address buss followed by a read or write control sequence. The memory 
request, cycle is shown in Figure 11 and follows the following sequence.
1. The upper 20 address bits are applied to the tag directory 
search logic.
2. The tag search is performed, and the hit/miss signal is 
validated.
The following four steps are taken when the tag search results in a 
cache hit. The tag hit line signals the CPU to continue the cache access 
and provide the read signal in the desired time.
3. The tag match line and the set address bits are decoded to 
access the desired cache block. This is performed almost 
simultaneously along with the tag directory search.
4 . The decoded set line and the cache block address bits are 
decoded to access the desired cache word.
5. The read/write signal is applied by the CPU.
40
6. Data is either transferred to the 32 bit data buss (in the 
case of a read operation) or transferred from the data buss 
and written into the cache word (in the case of a write 
operation).
The following five steps are taken when the tag search results in a 
cache miss. The tag hit/miss line signals the CPU to access main memory 
and perform a cache replacement operation.
3. The tag write line is. applied and the set address bits are 
decoded allowing the upper 20 address bits to be written into 
the tag directory.
4. The tag match lines are applied.
5. The block address bits are decoded.
6. The write line is applied by the CPU, and data from the data 
buss (and main memory) is written into the cache memory word.
7. The write operation is performed three more times filling the 
cache block with main memory. This is performed in a burst 
mode which requires only the decoding of the block address 
(lowest two address bits).
41
VAliID
Precharge
Perform Search
VALIDAddress lines
Tag Match
This Cache Word
Read/Write
Data
Figure 11. Memory Request Cycle Timing Diagram. 
Size and Organization
Some computer designers feel that it is very advantageous to separate 
instruction cache from data cache. The efficiency and performance gain is 
questionable [18]. Logic that keeps instructions separate from data is 
required and adds an overhead to the operation and maintenance of a cache 
system.
The cache system designed for a microprocessor is a generic data and 
instruction cache which is easily maintained and operated. The cache is 
direct mapped and write through. A cache miss initiates the replacement 
scheme. The organization fits well onto a microprocessor, lowers the main 
memory cycle time and requires very little CPU overhead. The logical flow 
of data in and out of the cache structure is trivial and fast. The layout 
is only of medium complexity and fits on-chip without increasing the total
chip size significantly.
42
Mapping Scheme
Direct mapping was chosen because it requires the least logic to 
control [19]. It is not the most efficient, but some of the loss in 
efficiency is gained by the simplicity. Direct mapping requires very 
little logic beyond address decoding and tag checking. The direct mapping 
scheme requires smaller silicon area than set associative or fully 
associative cache memories. If a nested loop is large compared to the 
cache size, then direct mapped cache is as efficient as fully associative 
[20].
Initialization
When a microprocessor is powered up, the cache memory and the cache 
tag directory will contain random data. The CPU may make a request, and 
the tag directory could indicate that the needed information is contained 
in the cache before any valid data has been written into the cache. One 
approach to eliminate this problem is to have the CPU run a procedure upon 
start-up that sets a valid bit on all cache memory locations. The valid 
bit is checked to determine if the cache memory location is valid or not.
Another and better way is to have the CPU run a tag directory 
initialization upon stdrt-up. The initialization could write zero’s into 
all tag directory cells. The memory location 0000 can then be reserved for 
the system or not used at all. This will insure that data is not accessed 
from the cache until the cache has been written into. .
43
Replacement Procedure
The replacement procedure is trivial in a direct mapped cache system. 
In fact, the only choice in replacement structures that the designer has 
is when cache blocks are replaced and not where they go. The system 
described here uses a cache miss generated replacement scheme. A cache 
block is replaced only when a miss occurs. The word of a cache block could 
be used to signify that the next CPU access will be in another block, but 
there is no simple way to tell which block will be needed next. The logic 
required in replacement prediction is not trivial and is too complex for 
application on a microprocessor.
Memory Consistency
The write-through consistency scheme is employed because of high 
reliability and easy maintenance. The microprocessor address and data buss 
will be used more often, but the increased traffic should not create a 
bottleneck. One drawback to this scheme is that main memory cannot keep 
up with a series of writes. The main memory controller will notify the CPU 
to halt program execution until the main memory is caught up.
Cache Size
The overall cache size is governed mainly by the size of the 
individual memory cell and the size of the tag directory. The I/O logic
and address decoding logic require 11# of the total silicon area for the 
cache memory. The tag directory requires 16# and the memory cells fill 
up 73# of the total silicon area. A larger cache produces a higher hit 
ratio, but the optimum performance per size is around 4 K words. The 
minimum useful cache size for the INTEL 80386 is 4 K words [21]. The 
complete cache system with 4 K words, 1024 word tag directory and all 
required support logic occupies 464.8 square millimeters or 849 x 850 
square mils. The cache memory system contains approximately 1,000,000 
transistors and this would raise the total transistor count on the MC68000 
to 1.1 million transistors and brings the total silicon area to 
approximately 920 x 920 square mils.
Block Size
The Block size is four words long. This allows relatively high hit 
ratios and requires only four continuous memory access cycles to replace 
a cache block. The small block size allows for the storage of many 
different possible software loops in the cache structure.
Tag Directory
The tag directory block diagram is shown in Figure 12. The tag 
directory is a direct mapped CAM with very simple write logic and hit/miss 
signaling. The decoded set address allows for the writing of a tag word 
without a second decoding of the memory address.
45
The CAM directory is searched in parallel. The hit/miss signal 
indicates whether the desired memory word is located in the cache memory. 
The only overhead associated with the tag directory is writing new 
addresses when cache blocks are replaced.
Tog Write
Perform
Search
Tag
Write
Signal
Tag Address Bits
Tag Match lines
Tag 0 Match line 
Tag I match line
Tag Word 0
Tag Word I
Tag Word 2
Hit/Miss logic
Tag n Match line
Hit/Miss
Signal
Tag write Logic
Tag Write Logic
Tag Word n
Tag Address Bits
Figure 12. Tag Directory Block Diagram.
Static RAM memory cells could be used instead of CAM memory cells. 
The circuit layouts, which are shown in appendix C, show that Static RAM 
cells are only 28$ smaller than CAM cells when laid out in a two-metal CMOS 
process. The total area for the cache system is only 4$ smaller when SRAMs 
are used instead of CAMs. CAM tag cells allow for easier tag checking. 
SRAM tag cells require that the contents of the tag word be transferred to 
the address buss and XORed with the desired tag address to check for a 
cache hit. The price in area paid for the use of CAM memory cells is small
and results in gains of simplicity and speed.
Top Down Design
The system block diagram is shown below in Figure 13. The cache 
system is shown with the CPU. The data and control signals that connect 
the CPU to the cache are also shown.
O f f  c h ip
C a c h e
M em o ry
s y s t e m
P e r f o r m  s e a r c h  
& T a g  w r i t e
H i t / M i s s  s ig n a l
R e a d /W r i t e
A d d r e s s  B u s s
O f f  c h ip
Figure 13. CPU - Cache Memory System Diagram.
Figure 14 shows the block diagram of the cache memory. The major 
portions are:
1. Cache Tag Directory
2. Cache Block Address Decoder
3. Cache Memory Blocks
The tag directory contains all the tag CAM and is searched when a 
memory request is issued. The block address decoder decodes address bits 
A3 - A12 and selects the correct tag directory word during a cache
47
replacement operation. The word address decoder selects the correct cache 
memory word during a memory request.
The cache memory blocks store four memory words and contain the two 
bit word address decoder.
P e r f o r n  S e a r c h __Cs
Tag W r i te
H it /M iss
Tag B lock
A d d re s s  A d d re s s
B i ts  B i t s
I? L LLL Too Mocth L L I?__L-
Tag
D i r e c t o r y
Decoded
S e t  line 
I
I
I
I
Tag Ma tch
Decoded 
S e t  line
B lock
A d d re s s
Decode r
Word
Add re s s
B its
U ___
I
Dedoded 
S e t  lines 
(1024)
Cache
Memory
System
R e a d /W r i t e _J1 u n tData  Buss
Speed Up Signal
Figure 14. Cache Memory System Block diagram.
Figure 15 shows the logic diagram of a cache memory block. The cache 
words contain 32 memory cells and are connected to the data buss. Address 
lines AO and Al are used to select the desired cache word from a cache 
block. The two bit address decoder, shown in Figure 16, is a precharge 
logic decoder combined with an AND gate. The address is decoded and 
"ANDed" with the decoded block line to select the desired cache word.
48
Al AO
Decoded
Block
Line
Data Buss
Precharge__
Speed up signal __
Word 0
Word I
Word 2
Word 3
Speedup
Sense
Two-bit Address decoder
Cache Menory Word
-- Cell 0 Cell I Cell 2 _ Cell 31
DO DO Dl Dl D2 D2 D31 D31
Figure 15. Cache Block Logic Diagram.
ADRl ADRO
ADRl
Decoded
Block
Line
ADRO__ p P re c h a rg e Vdd4
H
_tx This Word 
Signal
Figure 16. Two-Bit Address Decoder Logic Diagram.
The two possible memory cell designs ore dynamic memory cells or 
static memory cells. The dynamic memory cells offer a medium access time 
and consume only a few square microns. The static memory cell is faster 
but requires more silicon area. The dynamic memory cell must be refreshed 
every few milliseconds, and the stored information must be rewritten into 
the cell after each read operation. The static memory cell requires very 
little supporting logic and retains information as long as power is 
supplied. Static RAM memories are used almost exclusively in small buffer 
memories and cache memories.
The logic diagram for a dynamic memory cell is shown in Figure 17. 
The cell requires only 3 transistors and occupies less silicon area than 
a static memory cell. The information is stored on the gate of transistor 
#2. A charged gate represents a logic 0 and an uncharged gate is a logic 
I. The read data line is precharged just before a read operation is 
performed. When the read strobe line is pulsed, the read data line will 
be discharged if the gate of transistor # 2  is charged and will remain 
charged if the gate of transistor #2 is uncharged. The leakage resistance 
associated with the gate determines how often the gate must be refreshed 
to preserve the stored information. The required refresh and I/O logic 
require a significant amount of design effort and extra silicon area. The 
access time is too slow to justify the cost of the extra complexity. 
Dynamic RAMs are used exclusively in large main memories.
The logic diagram for a static RAM memory cell is shown in Figure 18. 
The cell requires 6 transistors and occupies more silicon area than a 
dynamic memory cell. The information is stored on two cross coupled
50
inverters that continually drive each other. The static memory cell will 
retain information for as long as it is supplied with power. The word line 
is pulsed just prior to a read or write operation. A write operation 
forces the data lines to the desired signal level(0 volts for a logic 0 and 
+5 volts for a logic I). A read operation allows the cross coupled 
inverters to drive the data lines to the voltage levels contained on the 
inputs of the cross coupled inverters.
Wr i t e  Read 
Da ta  Da ta
Read [x
Write tx
S t r o b e
Figure 17. Dynamic Memory Cell Logic Diagram.
The dynamic RAM cell requires more I/O logic and a refresh sequencer. 
Conservation of silicon area and simplicity of design are the two most 
important factors in this design. The addition of a cache memory module 
must be small and simple, and dynamic memory cells require too much support 
logic to justify their use.
The 6 cell static RAM requires a sense amp for a read operation and 
a speed-up circuit to increase the speed of read and write operations. One
51
sense amp is used to drive a column of static memory cells because only I 
cache memory word is accessed at a time. When the memory cell is pulsed 
for a read, the cross-coupled inverters start to modify the data and 
dato-bar lines; but they are not large enough to drive the data lines in 
a reasonable amount of time. The sense amplifier "senses" the small change 
in the data lines and drives them quickly into the logic levels specified 
by the SRAM. Figure 19 shows the logic and timing diagram for the sense 
amplifier.
D a t a  D a t a
D a t a   ^
V o r d  
S e l e c t
Pl & P2 a r e  Pass  t r a n s i s t o r s
Figure 18. Static Memory Cell Logic Diagram.
52
D a ta  D a ta
Word
S t a t i c
Memory .
Celt
P re c h a rg e
Sense
Speed up
I
P r e c h a r g e ]
I
I
i I
U
I I
I I \
I I I 
I I I 
I I i 
I I I
I
I
I
I
I
I
I
I
I
I I
I V j  I
I I I
I
I
I
I
V o r d  line J
I /
I I I
I I I
I
I (\ I I f l II
I i i
i I
I i I
I I I
I
I
I
I
I I I
I I I
I
I
S p e e d  u p  j i i r r r l\ II i i rr II
I
I
D a t a  |
i i
i i
' ( i
I i I
"A I R e a d  Oi
I
I
I
I
I
I
I
I
I I I
I I I
I f I R e a d
I
I
I
I i
I
I
i I
i !
I I I
I I I
I
I
I
I _ ! ------ !— . I
I
I
D a t a  j Y |
i I I
I I I
I I I
I i i
I
I
I
I
I
i
I I V! II
I
Timing Diagram f o r  Sense Amp.
Figure 19. Sense Amplifier Logic and Timing Diagram.
53
The Block address decoder logic diagram is shown in Figure 20. 
Address bits A2 - All are decoded to select the correct cache memory block 
during a memory request and the correct tag word during a tag directory 
write. The decoder is a two input AND gate. One input is the tag match 
line and the other is a precharged line. The precharged line is discharged 
through a pass transistor if the address applied is not the block’s correct 
address.
ADRll
ADRll
T ag 
Motch
Line
ADR3 ADR2
P re cho rge
Decoded
Block
Line
Figure 20. Block Address Decoder Logic Diagram.
The tag directory block diagram is shown in Figure 21. The tag words 
are connected in parallel to the upper 20 address bits of the address buss. 
A tag word will leave the match line high if the searched address is 
contained in its memory cells. The tag match signal is passed to the 
Hit/Miss signal notifying the CPU that the desired memory word is contained 
in the cache.
The tag directory is updated during a cache replacement. The tag 
write logic diagram is shown in Figure 22. The decoded block line and the
54
tag write line are "ANDed" to produce the tag word line that allows the tag 
word to be written.
Tag A d d re s s  Buss  
ADR31 -  -  -  -  ADR12
Decoded  B lock  line 0
Tag M a tc h  I
A  Decoded  B lock  line I
A  Decoded  B lock  line 2
Tag M a tc h  1023
Decoded  B lock  line 1023
Tag Wo rd  I
Tag Wo rd  1023
Tag W r . te  Log ic
Tag W o rd  2
H i t /M is s  Signal
Figure 21. Tag Directory Block Diagram.
The tag word contains 20 content addressable memory cells (CAM 
cells). The cells are all connected to the tag match line. During a tag 
search, any tag cell can pull the match line low signifying that the 
desired tag address is not contained in this tag word.
The CAM cell contains two cross-coupled inverters to store the tag 
bit and pass transistors that are pulsed during a tag search or a tag 
write. Figure 23 shows the nine - gate CAM cell logic diagram.
55
ADR31 ADR30 ADR29
K F KF KF
P e rfo rm  
Search  tx_M m m I n
P recha rge
ADR12
K H
Tag
Cell 19 
Word O
—
Tog
Cell 18 
Word O
—
Tag
Cell 17 
Word O
—  —
Tag
Cell O 
Word O
I I I I T I
Tag Tag — Tag - Tag
Cell 19 Cell 18 Cell 17 Cell O
Word I — Word I Word I - Word I
O
Tog W r i te
- / J  Decoded Block line 0
Tag Match line 0
- o -d Decoded Block line I
Tag Match line I
P e rfo rm
ADR31 ADR30 ADR29
Figure 22. Tag Write Logic Diagram.
Dttta Data_bar
Tag h. 
Match
Figure 23. Nine - Gate CAM Cell Logic Diagram
56
The read and write operations are the same as for a normal static 
memory cell. A search operation is performed by (I) precharging the match
line (2) applying the data to the bit-bar line and data-bar to the bit
line. The match transistor will remain off if the data in the cell matches 
the data on the bit and bit-bar lines. The match transistor of any cell 
in the tag word can discharge the match line during a tag search.
CPU Communications and Timing
The CPU communicates with the cache memory through the:
1 . 32 bit address buss
2. 32 bit data buss
3. precharge signal
4. read/write signal
5. perform tag search signal
6. tag write signal
7. hit/miss signal
8. data speed up signal
The timing diagram for a typical read cycle is shown in Figure 24. 
The CPU generates the following signals.
1. The precharge signal is applied.
2. The address bits are applied to the cache memory system.
3. The tag search signal is applied.
4. The tag match line will stay high if there is a cache hit.
57
5. The tag pass signal is applied and the correct cache word is 
addressed.
6. The read signal is applied followed by the data speed up 
signal.
7. The desired cache memory word is now on the data buss and the 
CPU continues processing.
P r e c h a r g e
P e r f o r n  S e a r c h
V A L I D
A d d r e s s  lines
Tag Pa s s
S p e e d  up
D a t a
D a t o.
Tag M a t c h
This C a c h e  W o r d
Figure 24. Typical Cache Read Timing Diagram.
58
CHAPTER 4 
FINAL DESIGN 
Circuit Layout
The complete layout is done with 10 cells. The layouts are:
1. Static RAM Cell.
2. Content Addressable Memory Cell.
3. Cache Word Address Decoder.
4. Cache Block Address Decoder.
5. Tag Line Pass and Tag Word Select Logic.
6. Top Tag Write Logic.
7. Bottom Tag Write Logic.
8. Static RAM Speed Up Logic.
9. Address Invertor.
TO. Tag Match Line Precharge Logic.
The logic diagrams and circuit layouts are shown in Appendix A. All 
of the circuits are documented with circuit specifications, circuit 
layouts, logic diagrams, and circuit parasitic calculations.
Circuit Parameters
The circuit parameters are derived from the layout artwork. The 
parasitic capacitance and resistance depend upon interconnection length and
59
location relative to the other layers of the layout. The circuit 
parameters determine the performance of the system. The parasitic 
capacitance determines the maximum speed of the system and the power 
consumption. In the cache system designed, the critical circuit parameter 
is the capacitance associated with the data lines and address lines that 
connect all the cache blocks together. The capacitance values are 
determined by following the calculation rules explained in Appendix B .
Tag Match Pass and Vrite logic
Speed up
Tag Vrite logic Address logic
Cache
Memory
Vords
Tag
Directory
Block
Address
Decoder
Tag Vrite logic Address logic
Vord
Address
Decoder
Speed up
Subsystem Size Cnm^)
Tag Directory______________________73.5
Cache Memory Vords________________ 338.6
Tag Match Pass and Vrite logic_______ 8.2
Block Address Decoder________________19.6
Vord Address Decoder________________24.3
Speed up________________________ 0.3
Tag Vrite logic____________________ 0.3
Total System Size 464.8
Figure 25. Cache Memory System and Subsystem Sizes.
60
System Size
The total system size is 464.8 mm2. The cache memory words occupy 
73# and the tag directory 16# of the total layout. The address decoding 
circuits and other necessary logic make up the remaining 11# of the layout. 
The system and subsystem sizes are shown in Figure 25.
The total system size of a MC68000 and the cache system would occupy 
almost a full square inch of silicon area and contain 1.1 million 
transistors. The yield for a chip this size would be low, on the order of 
about one chip per wafer. This yield would escalate the cost of the 
processor/cache memory chip.
Alternate Design - Fully Associative Cache
The extra silicon area required to use content addressable memory 
(CAM) cells in a CMOS environment is small relative to the size of static 
RAM (SRAM) cells. A fully associative cache memory is not impossible to 
implement. The fully associative cache scheme differs only by a few 
hardware items from the direct mapped cache implementation. The fully 
associative cache scheme differs in timing and control. Four changes that 
would transform the direct mapped cache design into a fully associative 
cache are:
1. 10 more bits per tag word.
2. A more complex replacement scheme and an address
generator for cache replacement.
61
3. A full/empty bit associated with each cache block and 
the logic to check for empty space when a cache block 
is received from main memory or a start up procedure 
that fills the cache memory.
4. Communication signals and more complex timing scheme are 
required for a fully associative memory.
Tag Match Pass and Vrite logic
Speed up
Address logicTag Vrite logic
Cache
Memory
Vords
BlockTag
Directory Address
Decoder
Address logicTag Vrite logic
VordRandom
AddressAddress
DecoderGenerator
Speed up
Subsystem
Tag Directory______________________125,8
Cache Memory Vords________________ 338.6
Tag Match Pass and Vrite logic-------- 8.2
Block Address Decoder________________19,6
Vord Address Decoder________________24.3
Speed up_________________________ 0.3
Tag Vrite logic____________________ 0.3
Total System Size
Figure 26. System Diagram of a Fully Associative
Mapped Cache Memory.
62
A system diagram of a fully associative cache memory system is shown 
in Figure 26. The shaded areas define the extra area and logic needed to 
transform the direct mapped cache into a fully associative mapped cache.
Tag Word Length
A fully associative cache memory would require 30-bit tag words. Ten 
more CAM cells per tag word than the direct mapped design. This requires 
an extra 9 mm2 of silicon area, but only increases the total silicon area 
for the cache system by 1.9#.
Full/Empty Designation
The CPU must know when the cache memory is full. A cache replacement 
is not necessary until the cache memory is full, and a memory location 
other than what is contained in the cache is needed. A full/empty bit 
associated with each block is one possible solution.
Assuming that the cache memory is filled with the first page of the 
operating system upon system start up could replace the needed hardware of 
a full/empty bit. The full/empty bit is not used enough to justify 
locating it on-chip and is impractical to locate off chip. This solution 
requires a specific software environment that initializes the cache memory 
upon power up and after any system failure.
63
Replacement Scheme
The fully associative organization allows complete flexibility in the 
placement of cache blocks which results in more efficient use of the cache 
memory space. The simplest replacement scheme is random replacement. 
After it has been determined that a cache replacement is needed, a block 
is chosen at random by a random number generator. The new block is then 
written into cache.
Communications and Timing
The data flow of a fully associative mapped cache memory request 
cycle that results in a cache hit is identical to the data flow of a direct 
mapped cache memory cycle.
When a memory request results in a cache miss, the replacement of a 
cache block requires the generation of the block address. A random number 
generator will suffice for a random replacement algorithm. The block 
address (generated from the random number generator) is applied to a block 
address decoder which selects the location for the needed main memory 
block. The program execution then continues until another cache miss 
occurs.
The extra communication signals required of the CPU are:
I. A signal to start the random number generator.
2 . A signal to apply the random cache block address to the
address decoder.
Advantages of Fully Associative Mapping
There are some advantages to using a fully associative mapped cache.
The advantages are best displayed through the execution of benchmark 
programs and comparing the execution times. Possible advantages are:
1. . The cache is used more.efficiently.
2. The use of CAM cells is fully justified.
3. Several different types of replacement schemes are possible.
4. The hit ratio could be higher resulting in shorter memory 
access times.
The fully associative mapped cache is not implemented or simulated 
because the advantages are only visible after some benchmark programs are 
simulated. This exercise is beyond the scope of this thesis.
65
CHAPTER 5 
DESIGN SIMULATION 
Design and Simulation Tools
Layout Software
The QUICK KIC layout package was used in the layout and design of the 
on-chip cache. This package is the only available software package at 
Montana State University that is capable of a layout of this size. The 
package includes a design rule checker and produces layouts that can be 
processed and sent to a fabrication facility.
Circuit Simulation
The critical signal paths are simulated to produce performance 
measures. The simulation software is TSPICE, a TEKTRONICS version of 
Berkely SPICE. The QUICK KIC graphics editor includes an electronic 
schematic editor. The schematic editor was used to create the TSPICE 
circuit definition file. This file contains all transistor sizes, node 
connections and transistor types. The parasitic capacitances were 
calculated by hand and added to the circuit definition file to create a 
complete model of a cache memory word. A sample parasitic capacitance 
calculation is given in Figure 27. The capacitance associated with the
66
input of on inverter is calculated.
In p u t
Cm
Vdd
z
O u tp u t
Cm C a lcu la t io n
Vdd
- .I O u tp u t
V ss
Thin Oxide; {?Z3
2 x 16 x 6 .6 7 E -4 p f  = 21.4 f f  
Po lys il icon  t o  S u b s t r a t e ;  □
2 x 20 x 0 .4 8E -4 p f  = 1.92 f f
T o ta l  Cin C apac i tan ce  = 23,36 f f
Figure 27. Sample Parasitic Capacitance Calculation.
Layout Technology
The layout technology used is a two metal CMOS process. Most of 
today’s microprocessors are constructed with a 2 or 3 metal CMOS process. 
The CMOS transistor technology has replaced bipolar TTL and NMOS
67
technologies over the lost few years. There are several reasons for this 
evolution: <
1. Smaller more compact layouts are easily achieved.
2. The noise immunity is better than both bipolar TTL and 
NMOS.
3. The power consumption is the lowest among current 
circuit technologies. CMOS circuits only draw current 
when they are switching. Very high speed CMOS circuits 
consume as much power as some TTL.
4. The speed of CMOS circuits rivals fast bipolar TTL. The 
layout technology follows the MOSIS scalable design 
rules version 6. The layout rules are given in Appendix 
C.
Speed Estimate
The cache word was simulated with a 50 MHz dual cycle clock. The 
access time is 40 ns for both a cache read and write. The block 
replacement procedure takes 80 ns. • The address and data lines all 
stabilize to the desired logic levels in the time specified by the timing 
diagram. Appendix D contains the complete circuit schematic and the SPICE
68
input file used to simulate a cache memory access with a cache hit. All 
possible data and address combinations were simulated for cache reads and 
writes. The cache block replacement was also simulated. The data read is 
a 0 and the tag address checked was also a 0. The resulting transient 
analysis plot is shown in Figure 28.
I -MhR-SS 0 2 : 1 9 : 5 4
0.00 2.50
I
TAG MATCH LIME 
5.00
I
7.50
I
Cl
10.00
I
0.00
I
2.50
i
DATA BAR LIME 
5.00i
7.50
j
10.00
I
0.00
I_______
2.50
J
DATA LIME 
5.00
I
7.50 1 0 . 0 0i
Figure 28. Transient Analysis of 
a Cache Memory Request.
69
Power Consumption Estimate
The power dissipation of CMOS circuits is a function of the 
frequency, voltage and the capacitance associated with the circuit. The 
power dissipation is calculated by:
Pd = C1 * Vdd2 * f (8)
as described by. Weste and Eshraghian [22]. Pd is the dissipated power, C1 
is the load capacitance, Vdd is the source voltage, and f is the frequency 
at which the output is driven.
The power dissipation is estimated for each cell of the layout 
operating at 25 MHz. Not all cells in the cache memory will be operating 
at one time. The total power dissipation is the combined power dissipation 
of:
1. all the partially active address decoding cells
2. all the active tag cells
3. one fully active data word.
The power dissipation estimate comes from driving all these cells at the 
maximum frequency. The power dissipated in the tag cell is calculated as: 
C Address = 7.5 ff 
C Address Bar = 5.8 ff 
C Tag Match Line = 8.3 ff 
Total Capacitance = 21.6 ff 
Power dissipation:
= 21.6 ff * (5 volts)2 * 25 MHz = 13.5 pW 
The power dissipation for each individual cell is given in Table 4. 
The total power consumption for the cache memory is 308 mW and is easily
70
dissipated in a silicon die this size. There is no power dissipation 
problems, and the power dissipation for a CPU/cache memory system would be 
raised by 50 percent (from 0.6 to 0.9 watts).
Table 4._______Power Dissipation of Cache Memory System.
Individual power dissipation (active cells):
Tag Cell: 13.5 pW « 20,480 cells - 277 mW
SRAM Cell: 27.1 /iW « 32 cells - 0.87 mW
Word Address Cell: 142 /xW * I cell - 0.14 mW
Block address Cell: 341 /tW « I cell - 0.34 mW
Tag Pass Cell: 227 pW * I cell = 0.23 mW
Top Tag Write Cell: 213 /iW * 20 cells = 4.3 mW
Bottom Tag Write Cell: 213 ^W * 20 cells - 4.3 mW
Speed up Cell: 92.9 AtW « 2 cells - 0.19 mW
Address Invertor Cell: 90.7 AtW * 12 cells - I . I mW
Tag Precharge Cell: 45.7 AtW * I cell = 0.05 mW
Individual Power Dissipation (partially active cells):
SRAM Cell: 8.19 AtW « 127 cells = 1.04 mW
Word Address Cell: 44.7 AtW * 127 cells = 5.68 mW
Block Address Cell: 197 AtW « 31 cells = 6.11 mW
Tag Pass Cell: 57.8 /iW * 31 cells = 1.79 mW
Tag Precharge Cell: 45.7 /iW « 127 cells = 5.80 mW
Total power dissipation: 308.6 mW
System Performance Estimate
The system performance estimate includes an estimate of the hit 
ratio. This estimate comes from the graphs in chapter 2 and includes some 
uncertainty. The system performance estimate will include some
uncertainty.
S
p
e
e
d
 
U
p
71
The hit ratio is estimated to be 0.97 (6), and the main memory access 
time is estimated to be 400 ns (Tm). The cache memory access time is 40 
ns (Tc). The speed up factor is 7.87. Equation 5 is used to calculate the 
speed up factor :
I
Speed up ------------------------- (5).
I - 0{ Tc/Tm - I)
Figure 29 is a plot of hit ratios vs. speed up for the on-chip cache 
memory system. The hit ratio can vary from program to program. A true 
test of performance would be to fabricate the cache memory and simulate it 
using several benchmark programs. This has not been done because it is 
beyond the scope of this thesis.
Figure 29. Speed Up vs. Hit Ratio for the 
On-Chip Cache Memory System.
72
Comparison to Conventional Cache Systems 
A conventional cache memory system must wait for interchip data 
transfers which increases the cache memory access time. A typical cache 
memory access time for conventional cache memory systems is about 80 ns. 
The speed up factor of the on-chip cache memory verses a conventional cache 
memory is 1.94. Figure 30 is a plot of speed up factor vs. hit ratio for 
the on-chip cache memory system vs. a conventional cache memory system.
I I I I I I I I Ii i i iI I I i I I I
Figure 30. Speed Up vs. Hit Ratio for an On-chip 
Cache Memory vs. a Conventional Off-Chip Cache Memory.
The power consumption associated with interchip transfers and the
operation of separate cache memory chips is much greater than the power
consumed by an on-chip cache memory. The typical power consumption of a
73
16 Kbyte SRAM used for a cache memory is 700 mW. The total power 
consumption of the on-chip cache memory and CPU is 0.9 Watts.
The conventional cache memory system requires the design of a more 
complex circuit board and increases the chip count of the computer system. 
The on-chip cache memory reduces total chip count and the logic glue needed 
to interface fast SRAM as a cache memory system. The mother board for most 
computers is already crowded and the reduced chip count would reduce the 
cost but might not be enough to offset the increased cost of the CPU. If 
enough computer systems were built and sold, the on-chip cache memory 
system could be cheaper.
The combination cache memory/CPU chip lends itself very well to 
parallel or multiprocessing. Each processor incorporated in a 
multiprocessor computing system would contain its own cache memory system 
and greatly reduce design time and enhance performance. Each processor 
could work from a shared main memory and contain a separate and secure task 
in local cache memory.
The diagnosis and integration of a CPU/cache memory chip will be more 
difficult because there is no way of monitoring the cache memory contents 
externally.
74
CHAPTER 6 
CONCLUSION
Discussion of Results
The designed on-chip cache memory system displayed several 
interesting characteristics. The on-chip cache memory system improved 
system performance as expected because of decreased memory access times 
relative to both a non-cached system and a conventional off-chip cached 
system. The speed up factor relative to a non-cached system is 
approximately 7.8 and relative to a conventional off-chip cached system is 
approximately 1.9.
The inclusion of a CAM is not very expensive in silicon area and 
paves the way for a fully associative cache memory scheme. The fully 
associative cache memory would require more hardware and a more complex 
timing scheme but would result in very efficient use of the cache memory. 
The hardware would only increase the total system size by about 4 percent.
The cache memory system consumed more silicon area than would be 
feasible to put on one silicon die, but the cache is 16 Kbytes which is 
quite large. The size can be reduced by layout compaction and possibly 
logic reduction.
The power dissipation of an on-chip cache system is substantially 
less than an off-chip implementation. The processor/cache memory chip 
power dissipation is approximately 0.75W. A conventional processor/cache
75
memory system would dissipate approximately 2.6W.
Because of the processor/cache memory chip size, it would initially 
be quite expensive to produce and manufacture. The yield per wafer for 
this size of a silicon die would be low, on the order of about one chip per 
wafer. The cost of designing and building a conventional cache memory 
would be less, but the chip count for the total system is increased and 
the circuit board layout will be more complex.
The processor/cache memory chip would lend itself for use in a 
parallel processing environment. Each processor in a multiprocessor 
environment would have its own local secure cache memory.
Future Work
The continuation of this work should include the simulation of a SRAM 
tag directory and the design and simulation of a fully associative cache 
system. It would be interesting to then perform a cache memory system 
simulation and check the improvement of several benchmark programs.
Other work should include the reduction of the cache access time. 
The cache memory access times should be comparable to the access times of 
the CPU registers because of their location on-chip. This would be the 
fastest cache memory needed because it would be almost like having 16 
thousand registers. The instruction set would not support 16 thousand 
registers, but the access times would allow operation speeds almost as 
fast.
The cache memory could be pipelined. Cache memory words, (both data 
and instructions) could be prefetched to increase system performance.
76
REFERENCES CITED
77
1. Reinhart, J . and C. Serrano, "High-Speed Components and a Cache 
Memory Lower Access Times," Computer Technology Review, Winter 1984.
2. Hill, M. D . and A. J. Smith, "Experimental Evaluation of On-Chip 
microprocessor cache memories," Conference Proceedings of the Ilth 
Anaual International Symposium on Computer Architecture, June 5-7, 
1984
3. Starnes, T. W., "Design Philosophy Behind Motorola’s MC68000," BYTE, 
April 1983.
4. Introduction to the 80586, Intel Corporation, Literature
Distribution, Mail Stop SC6-59, 3065 Bowers Avenue, Santa Clara,
California 95051, 1985.
5. Bertram, W. J ., "Yeild and Reliability," VLSI Technology, by S. M. 
Sze, pp. 600, McGraw-Hill Inc. 1983.
6. Hwang, K. and F. A. Briggs, "Cache Memories and Management," Computer 
Architecture and Parallel Processing, McGraw-Hill Inc. 1984.
7. Goodman, J . R., "Using Cache Memory to Reduce Processor-Memory
Traffic," Conference Proceedings of the 10th International Symposium 
on Computer Architecture, pp. 124-131, Stockholm, Sweden, 1983
8. Reinhart, J . and C. Serrano, "High Speed Components and a Cache 
Memory Lower Access Times, " Computer Technology Review, winter, 1984
9. Smith, A. J., "Cache Memory Design: An Evolving Art,," IEEE Spectrum, 
December 1987, volume 24 number 12.
10. Hamacher, V. C., Z. G. Vranesic and S. G. Zaky, "Cache Memories," 
Computer Organization, pp. 306-313 MaGraw-Hill Inc., 1984.
11. Mono, M. M., "Cache Memory." Computer System Architecture, pp. 501. 
Prentice-Hall Inc., 1982.
12. Strecker1 W. D., "Cache Memories for the PDP-IIFamily of Computers, " 
Computer Engineering, A DEC view of Hardware Systems Design, pp. 263- 
270, Digital Press Inc. 1978
13. Smith J . E . and J . R. Goodman, A Study of Instruction Cache 
Organizations and Replacement Policies," ACM Transactions on 
Computing, pp. 132-137, Association of Computing Machinery, 1983.
14. Mono, M. M., "Cache Memory." Computer System Architecture, pp. 505. 
Prentice-Hall Inc., 1982.
15. Mono, M. M., "Cache Memory." Computer System Architecture, pp. 507. 
Prentice-Hall Inc., 1982.
78
16. Mono, M. M. , "Cache Memory," Computer System Architecture, pp. 504. 
Prentice-Hall Inc., 1982.
17. Goodman, J . R., "Using Cache Memory to Reduce Processoi— Memory 
Traffic," Conference Proceedings of the IOth International Symposium 
on Computer Architecture, pp. 124-131, Stockholm, Sweden, 1983
18. Smith, A. J., "Cache Memories," Computing Surveys, Vol. 14, No. 3, 
September 1982.
19. Hamacher, V. C., Z . G . Vranesic and S. G. Zaky, "Cache Memories," 
Computer Organization, pp. 306-313 MaGraw-Hill Inc., 1984.
20. Goodman, J . R., "Using Cache Memory to Reduce Processor-Memory 
Traffic," Conference Proceedings of the 10th International Symposium 
on Computer Architecture, pp. 124-131, Stockholm, Sweden, 1983
21. Introduction to the 80586, Intel Corporation, Literature 
Distribution, Mail Stop SC6-59, 3065 Bowers Avenue, Santa Clara, 
California 95051, 1985.
22. Weste, N. and Eshraghian1 K., Principles of CMOS VLSI Design: A 
Systems Perspective, Reading, Addison-Wesley, 1985, pp. 148-149.
79
APPENDICES
APPENDIX A
CIRCUIT DOCUMENTATION
81
The tag cell stores the address information for the tag directory. 
20 tag cells combine to form one tag word. The address is checked and if 
a match occurs, the precharged TAG MATCH line remains asserted. The
I . TAG CELL
circuit schematic is shown in Figure 31. The circuit 
Figure 32. The inputs are:
ADDRESS
ADDRESS_BAR
THIS_TAG_W0RD
The output is:
TAG_MATCH_LINE
Cell specifications.
layout is shown in
INPUTS Capacitance
ADDR
ADDR_BAR
TAGJWORD
10 ff
10 ff
50 ff
OUTPUTS Capacitance
TAG_MATCH_LINE 
Cell Size
10 ff
Width
Height
57 microns
63 microns
Output settling time 
Input Hold time
6 ns
12 ns
Power Dissipation 
Maximum transient current
125 AtW
2.5 mA
82
Addr Addr
Vord
C addr
Tag
Match
C tag 
match
C tagword
C addr
C addr = 7.5 ff 
C addr = 5.8 ff 
C tagword = 31.5 ff 
C tag match = 8.3 ff
Figure 31. Circuit Schematic for TAG.CEL
83
Figure 32. Circuit Layout for TAG.CEL
84
2. SRAM CELL
The SRAM cell stores the data for the cache memory word. 32 SRAM 
cells combine to form one cache word. The DATA and DATA BAR lines are 
precharged and then the cell is polled and depending upon the cell’s 
contents, one of the data lines is discharged. The circuit schematic is 
shown in Figure 33. The circuit layout is shown in Figure 34. The inputs 
are:
DATA
DATA_BAR 
THIS_DATA WORD
The output is:
DATA
DATA_BAR
Cell specifications.
INPUTS , Capacitance
DATA 10 ff
DAT A_BAR 10 ff
DAT A_WORD 50 ff
Cell Size 
Width 41 microns
Height 63 microns
Output settling time 5 ns
Input Hold time 8 ns
Power Dissipation 62.5 f M
Maximum transient current 2.75 MlA
85
This
Word
Data Data
C data = 6,8 ff 
C data = 6.3 ff 
C dataword = 30.3 ff
Figure 33. Circuit Schematic for SRAM.CEL
dataword
86
Figure 34. Circuit Layout for SRAM.CEL
87
3. WORD CELL
ADDRO
ADDRI
ADDRO BAR
ADDR1_BAR
DEC0DED_BL0CK LINE
PRECHARGE_LINE
The output is:
DATA WORD
Cell specifications.
The WORD cell decodes the word address and "ANDs" the decoded block
line to select the desired cache word. The circuit schematic is shown in
Figure 35. The circuit layout is shown in Figure 36. The inputs are:
INPUTS Capacitance
ADDRO 50 ff
ADDRI 50 ff
ADDRO BAR 50 ff
ADDRI BAR 50 ff
DECODED BLOCK 100 ff
PRECHARGE 100 ff
OUTPUTS Capacitance
DATA_W0RD 10 ff
Cell Size
Width 94 microns
Height 63 microns
Output settling time 7 ns
Input Hold time 10 ns
Power Dissipation 94 4«
Maximum transient current 9 mA
88
ADRl ADRO
ADRO
P r e c h a r ge
Decoded  
B lo c k  ;
D a ta w o rd
CAl = 37.1 f f  CAO = 34.4 f f
C p re  = 77.6 f f
C D ecoded  B lo c k  Line = 76.0 f f  
C D a ta  w o rd  = 2.8 f f
Figure 35. Circuit Schematic for WORD.CEL
89
Figure 36. Circuit Layout for WORD.CEL
90
4. BLOCK CELL
ADDR2 - ADDRlI
ADDR2 BAR - ADDRlIBAR
BLOCK ADDR
The output is:
DECODED BLOCK 
CACHEJ3L0CK
Cell specifications.
The BLOCK cell decodes the cache block address and "ANDs" the TAG
match line to select the cache block. The circuit schematic is shown in
Figure 37. The circuit layout is shown in Figure 38. The inputs are:
INPUTS Capacitance
ADDR2 - ADDR11 50 ff
ADDR2 BAR - ADDRlI BAR 50 ff
BL0CK_ADDR 100 ff
OUTPUTS Capacitance
DEC0DED_BL0CK 10 ff
CACHE BLOCK 100 ff
Cell Size
Width 304 microns
Height 63 microns
Output settling time 4 ns
Input Hold time 8 ns
Power Dissipation 281 Mw
Maximum transient current 11 mA
91
Tag
Match
Line
Tag
W rite
Logic
ADll AD3 ADS
P re ch a rg e
C ta g  m a tch  line = 75.1 f f  
Cpre = 77.6 f f
C Decoded B lock Line = 76.0 f f  
C D a ta  w o rd  = 2.8 f f  
CAS -  CA ll = 31.5 f f
Decoded
B lock
Line
Figure 37. Circuit Schematic for BLOCK.CEL
92
Figure 38. Circuit Layout for BLOCK.CEL
93
5. TRASS CELL
TAGJVIATCH 
TAG PASS 
TAG WRITE 
CACHE_BLOCK 
HIT/MISS_IN
The output is:
BL0CK_ADDR 
THIS_TAG_WORD 
HIT/MISS_0UT
Cell specifications.
The TRASS cell passes the tag match line to the block address decoder
upon reception of the TAG PASS signal. The TRASS cell also selects the
THIS_WORD line for a tag write operation. The circuit schematic is shown
in Figure 39. The circuit layout is shown in Figure 40. The inputs are:
INPUTS Capacitance
TAG_MATCH 100 ff
TAG PASS 25 ff
TAG WRITE 100 ff
CACHE BLOCK 100 ff
HIT/MISS_IN 75 ff
OUTPUTS Capacitance
BLOCK ADDRESS 10 ff
TAG_W0RD 25 ff
HIT/MISS_OUT 10 ff
Cell Size
Width 127 microns
Height 63 microns
Output settling time 5 ns
Input Hold time 8 ns
Power Dissipation 313 Mw
Maximum transient current 6 mA
94
This
Tag
v,oroi
Tag
M a tc h
Tag
Pass
H ; - /M is s
i n p u t
Tag
w-'ite
Figure 39. Circuit Schematic for TRASS.CEL
95
Figure 40. Circuit Layout for TPASS.CEL
96
6. TOP TAG WRITE CELL
The top tog write cell applies the address and address bar lines to 
a column of tag cells. The address is passed when the perform search line 
is asserted. The circuit schematic is shown in Figure 41. The circuit 
layout is shown in Figure 42. The inputs are:
ADDRESS 
ADDRESS BAR 
PERFORM_CHECK
The outputs are:
ADDRESS
ADDRESS_BAR
Cell specifications.
INPUTS Capacitance
ADDRESS 100 ff
ADDRESSJ3AR 100 ff
PERFORM CHECK 200 ff
OUTPUTS Capacitance
ADDRESS 100 ff
ADDRESS_BAR 100 ff
Cell size
Width 57 microns
Height 110 microns
Output settling time 8 ns
Input hold time 10 ns
Power dissipation 625 HW
Maximum transient current 20 mA
97
Perf orn. 
Search
Vdd
A
ADR
T
Cadr
Cadr
U T
~±Z Cper
7
ADR
T Cadrl T Cadrl
ADR
Cper = 189 ff 
Cadr = 9.1 ff 
Cadrl = 5.8f f 
Cadr = 130.7 ff 
Cadrl = 5.8 ff
Figure 41. Circuit Schematic for TWRTOP.CEL
(Top Tag Write Cell)
98
Figure 42. Circuit Layout for TWRTOP.CEL
(Top Tag Write Cell)
99
7. BOTTOM TAG WRITE CELL
The bottom tog write cell applies the address and addressbar lines 
to a column of tag cells. The address is passed when the perform search 
line is asserted. The circuit schematic is shown in Figure 43. The 
circuit layout is shown in Figure 44. The inputs are:
ADDRESS 
ADDRESS_BAR 
PERFORM CHECK
The outputs are:
ADDRESS
ADDRESS_BAR
Cell specifications.
INPUTS Capacitance
ADDRESS 100 ff
ADDRESS BAR 100 ff
PERFORM_CHECK 200 ff
OUTPUTS Capacitance
ADDRESS 100 ff
ADDRESS_BAR 100 ff
Cell size
Width 57 microns
Height 110 microns
Output settling time 8 ns
Input hold time 10 ns
Power dissipation 625 A<W
Maximum transient current 20 mA
100
ADR ADR
Cadri
Oadr
Cadr
189 ff
Cacir = 130.7 ff 
Cadrl - 5.8 ff 
Cadr = 9,1 ff 
Cadrl = 5.8 ff
Figure 43. Circuit Schematic for TWRBOT.CEL
(Bottom Tag Write Cell)
Ocdrl
101
Figure 44. Circuit Layout for TWRBOT.CEL
(Bottom Tag Write Cell)
102
8. SPEED CELL
The SPEED CELL is a sense amplifier that speeds up the reading of a 
cache word. The circuit schematic is shown in Figure 45. The circuit 
layout is shown in Figure 46. The inputs are:
PRECHARGE 
SPEED UP
The output is:
DATA
DATA BAR
Cell specifications.
INPUTS Capacitance
PRECHARGE 100 ff
SPEED_UP 50 ff
OUTPUTS Capacitance
DATA 10 ff
DATA_BAR 10 ff
Cell Size
Width 41 microns
Height 105 microns
Output settling time 5 ns
Input Hold time 8 ns
Power Dissipation 281 AW
Maximum transient current 13 mA
103
D A T A D A T A
/{ Vdd
Cdata
Speedup
Cpre
Cdata
Cpre =98.2ff
Cspd =30.5ff
Cdata OIl ff
Cdata= 10ff
Figure 45. Circuit Schematic for SPEED.CEL
104
Figure 46. Circuit Layout for SPEED.CEL
105
9. ADDRESS INVERTOR CELL
ADDRESS
The output is:
ADDRESS_BAR
Cell specifications.
The address invertor cell inverts the address and address bar lines
for input to the block address decoder. The circuit schematic is shown in
Figure 47. The circuit layout is shown in Figure 48. The input is:
INPUTS Capacitance
ADDRESS 125 ff
OUTPUTS Capacitance
ADDRESS BAR 50 ff
Cell size
Width 31 microns
Height 107 microns
Output settling time 5 ns
Input hold time 10 ns
Power dissipation 125 4W
Maximum transient current 20 mA
106
ADDR
Cadolr
ADDR
Caololr
Caddr = 136,8 ff
Caddr = 8.3 ff
Figure 47. Circuit Schematic for ADDINV.CEL
(Address Invertor Cell)
107
Figure 48. Circuit Layout for ADDINV.CEL
(Address Invertor Cell)
108
The tag match precharge cell precharges the tag match line just prior 
to a tag directory check. The circuit schematic is shown in Figure 49.
10. TAG MATCH PRECHARGE CELL
: 
ck. < 
The circuit layout is shown in Figure 50. The input is:
PRECHARGE
The output is:
TAG_MATCH_LINE
Cell specifications.
INPUTS Capacitance
PRECHARGE 100 ff
OUTPUTS Capacitance
TAG_MATCH_LINE 10 ff
Cell size
Width 29 microns
Height 65 microns
Output settling time 5 ns
Input hold time 10 ns
Power dissipation 94 pW
Maximum transient current 15 mA
109
Vdc i
4
Precharge
b____ C
Cpre
Cpre = 65,8 ff 
Ctag = 7,3 ff
I ag
Match
Line
Ctag
Figure 49. Circuit Schematic for TMATPRE.CEL
(Tag Match Precharge Cell)
110
Figure 50. Circuit Layout for TMATPRE.CEL
(Tag Match Precharge Cell)
Ill
APPENDIX B
RULES AND VALUES FOR CIRCUIT PARAMETER CALCULATION
112
Table 5.______CMOS Physical Properties: Resistance and Capacitance
SHEET RESISTANCES.
Ohms / Square
N+ < 30
Polysilicon < 40
Metal < 0.05
JUNCTION LEAKAGE
N+ 0.1 fA//x*«2 (typical)
TYPICAL 3/x CMOS/BULK AC PARASITIC CAPACITANCES
Following is a set of typical ranges of parasitic capacitances 
observed on the MOSIS 3/i CMOS-bulk devices. these numbers are typical 
limits without means specified because the values are multi-modal when data 
is taken on the ensemble of all fabricators. Standard deviations for each 
fabricator are typically smaller then the range given.
Layer - Layer Thickness Capacitance
Tax 500 — 600 5.6 - 6.7 E-4
CsJ**3.
\LlCL
Poly - Substrate 7000 -- 8500 0.39 - 0.48 E-4 pF/At**2
Metal - Substrate I .4 - 1.7 At 0.20 -0.24 E-4 pF/At**2
Metal - Diffuse 9000 -- 9500 0.35 - 0.37 E-4 pF/At**2
Metal - Poly 8000 -- 9300 0.36 - 0.43 E-4
CSJ**3.LlQ.
Metal2 - Substrate 2.5 - 2.9 a* 0.12 - 0.16 E-4 pF/At"*2
Metal2 - Metal 1.1 - I . 3 At 0.26 - 0.31 E-4 pF/
Metal2 - Poly 1.8 - 2.2 At 0.15 - 0.19 E-4 pF//1**2
Poly2 - Polyl 700 - 900 3.9 - 4.9 E-4 pF/At«*2
N+ Junction Area — — — 1.6 - 5.0 E-4 pF/n * * 2
N+ Junction Side Wall — — — 2.0 - 3.3 E-4 pF/At**2
P+ Junction Area — — — 2.8 - 5.0 E-4 pF/At""2
P+ Junction Side Wall I .6 - 5.4 E-4 pF/At*«2
APPENDIX C
MOSIS SCALABLE DESIGN RULES
114
Ir MOSIS CMOS SClLfiDLE RULES (DEfS)
HELL
( F WELL. M WEU- )
H l K -
H U K - H  I K-
RCTIVE
HETRLl
,
Irijimmm
2 H K- 3
m m
FCtV CU CCTTVC
i^k V i )
11)
/
T i
T
Sitv kd
0VEB6LR33
NETflL 2 i
T
8 NIC
r-r.-rrrp |
-
I
I
i
I
pm Po 75x75 Ntc 
BHONO PO 111x188 HIC 
nct2 *eao unoh bvwouibb
s -4 I K-
LAYER CIF CflLMfl• COLOR
HELL CHG 53 I_______I
PHELL CHP
NHELL CHN 42
ACTIVE CAA 43 IUIIIIIiHlillllll
SELECT CSG 54 C ____ J
PSELECT CSP 44
n s e l e c t CSN 45
POLY CPG 46
CONT to POLY CCP 47 C______I
CONT to ACT CCA 48 C ]
METALl CHF 49 I
VIA CVA 50 I I
METAL2 CMS El C Z =
OVERGLASS COG 52 C =  =
KtUTOd BUST* W t i W W W  IOMPf
IKfNLS HStH DM K  ON HRLflAWM BAD
Figure 51. MOSIS Scalable Design Rules
APPENDIX D
CIRCUIT SCHEMATIC AND TSPICE INPUT FILES FOR 
THE SIMULATED CACHE MEMORY WORD
116
Figure 52. TSPICE Input Files for the 
Simulated Cache Memory Word
CLEAR
TITLE SMALL CACHE WORD SIMULATION, READ/HIT D=O1 T=0.
SIMULATION OF A COMPLETE CACHE WORD. THE SIMULATION 
SEARCHES THE TAG DIRECTORY. A HIT OCCURS AND A ’O ’ IS 
PASSED FROM THE SRAM CELL TO THE DATA BUS.
SEARCHED ADDRESS = 01111, NEEDED ADDRESS = 01111.
DOUGLAS E . MCGARY NOVEMBER 3, 1988.
FILENAME TINY RHOO.TSP
CIRCUIT
SUPPLY AND SIGNAL DEFINITIONS 
VDD = +5 POWER SUPPLY
PRE = PRECHARGE SIGNAL FOR ADDRESS DECODERS AND 
TAG MATCH LINE
PSC = PERFORM SEARCH SIGNAL FOR TAG SEARCHING
ADO - AD5 = ADDRESS SIGNALS
TWR = TAG WRITE SIGNAL
TPS = TAG PASS SIGNAL
DAT = DATA LINE SIGNAL
DTB = DATA BAR LINE SIGNAL
POWER VDD 0 V :DC=S.0
PRECHARGE PRE 0 V :TRAN=PWL(0 5 IN I
PERFORM PSC 0 V :TRAN=PWL(0 0 9N I
ADRESSO ADO 0 V :TRAN=PWL(0 0 IN I
ADRESS1 ADI 0 V :TRAN=PWL(0 0 IN I
ADRESS2 AD2 0 V :TRAN=PWL(0 0 IN I
ADRESS3 AD3 0 V :TRAN=PWL(0 0 IN I
ADRESS4 AD4 0 V :TRAN=PWL(0 5 IN
ADRESS4B A4B 0 V :TRAN=PWL(0 0 IN
TAGPASS TGP 0 V :TRAN=PWL(0 0 19N
TAGWRITE TWR 0 V :TRAN=PWL(0 0 SON
SPEEDUP SPD 0 V :TRAN=PWL(0 0 SON
9N 0 ION 5 50N 5)
0 ION 5 19N 5 20N 0 50N 0) 
0 2N 5 40N 5 41N 0 50N 0)
0 2N 5 40N 5 41N 0 SON 0)
0 2N 5 40N 5 41N 0 50N 0)
0 2N 5 40N 5 41N 0 SON 0)
5 2N 0 40N 0 41N 5 SON 5) 
0 2N 5 40N 5 41N 0 SON 0)
117
;--- READ THE CMOS TRANSISTOR MODELS
READ SCP30HP_SPICE SLOW.MOD
;--- READ THE TINY CACHE WORD MODEL
READ TINY_CIRCUIT.DEF
ENDC
;--- TRANSIENT ANALYSIS
TRAN ANALYSIS TIME: 0 50N IN ITL4=20 ITLI=200 UIC=ON 
;--- INITIAL TRANSIENT CONDITIONS
INITTR N2 N22 N91 N39 N32 N44 N6 NSI N24 N7 N66 N29 N24 : 0
INITTR NB N26 : 5
ENDTRAN
AUTOPROBE=ON
PLOT V(NS) V(N2) V(DAT) V(DTB) V(N91) V(N7) V(N32) V(NB) $ V(N29) V(NBB) 
PLOT V(NS) V(N2) V(DAT) V(DTB) V(N91) V(N7) V(N32) V(NB) $ V(N29) V(NBB) 
: F=TINY RHOO.PLT
Figure 52.— Continued
118
Figure 52.— Continued
SIMULATION OF A COMPLETE CACHE WORD.
THIS FILE CONTAINS THE DEFINITIONS OF ALL OF THE 
TRANSISTORS AND ALL PARASITIC CAPACITORS.
DOUGLAS E . MCGARY OCT. 4, 1988
FILENAME: TINY CIRCUIT.DEF
Ml N2 NS VDD VDD PEMOS:L=2U W=36U AS=216P PS=84U 
M2 NS N2 VDD VDD PEMOS:L=2U W=20U AS-120P PS=52U 
M3 NS N2 O O NEMOS:L=2U W=IOU AS=SOP PS=32U 
M4 N2 NS O O NEMOS:L=2U W=18U AS=IOSP PS=48U 
MS N2 N7 DAT O NEMOS:L=2U W=8U AS=48P PS=28U 
MS DTB N7 NS O NEMOS:L=2U W=8U AS=48P PS=28U 
M7 DAT PRE VDD VDD PEMOS:L=2U W=32U AS=192P PS=76U
M8 VDD PRE DTB VDD PEMOS:L=2U W=32U AS=I92P PS=76U
M9 DAT DTB VDD VDD PEMOS:L=2U W=32U AS=I92P PS=76U
MIO VDD DAT DTB VDD PEMOS:L=2U W=32U AS-192P PS=76U
Mil N28 DTB DAT O NEMOS:L=2U W=ISU AS=96P PS=44U 
Ml2 DTB DAT N28 O NEMOS:L=2U W=ISU AS=96P PS=44U 
Ml3 N28 SPD O O NEMOS:L=2U W=SSU AS=216P PS=84U 
Ml4 N7 NS VDD VDD PEM0S:L=2U W=36U AS=216P PS=84U 
MIS NS N39 VDD VDD PEM0S:L=2U W=36U AS=216P PS=84U
MIS NS N32 VDD VDD PEM0S:L=2U W=36U AS=216P PS=84U
M17 N7 NS O 0 NEMOS:L=2U W=ISU AS=96P PS=44U 
MIS NS N39 N27 O NEMOS:L=2U W=ISU AS=96P PS=44U
Ml 9 N27 N32 O O NEMOS:L=2U W=ISU AS=96P PS-44U
M2 O NI 3 ADO O O NEMOS:L=2U W=ISU AS=96P PS=44U
M21 NI 3 ADO VDD VDD PEMOS:L=2U W=36U AS =21SP PS=84U
M2 2 NIS ADI O O NEMOS:L=2U W=ISU AS=96P PS=44U
M2 3 NIS ADI VDD VDD PEMOS:L=2U W=36U AS =21SP PS=84U
M24 N30 NI 3 O O NEMOS:L=2U W=ISU AS=96P PS=44U
M2 5 N39 NIS O O NEMOS:L=2U W=ISU AS=96P PS=44U
M2 6 N39 PRE VDD VDD PEMOS:L=2U W=36U AS=21SP PS=84U
M27 N20 NS IO IO NEMOS:L=2U W=ISU AS=96P PS=44U
M2 8 N38 N44 O O NEMOS:L=2U W=ISU AS=96P PS=44U
M2 9 N32 N38 O O NEMOS:L=2U W=ISU AS=96P PS=44U
119
Figure 52.— Continued
M30 N32 N38 VDD
M31 N38 NAA VDD
M32 N38 NG VDD '
M33 NAA PRE VDD
M34 NAA NGO 0 0
M35 NAA NGS 0 0
M36 NGO AD2 0 0
M37 NGO AD2 VDD
M38 NGS AD3 0 0
M39 NGS AD3 VDD
MAI N22 N26 VDD
M42 N26 N22 VDD
MA 3 N22 N26 0 0
MAA N26 N22 0 0
MA 5 N2A NGG N22
MAG 0 N29 NSI 0
MA 7 N2A N22 N29
MAS N29 N26 N51
MAS N26 NGG N51
M50 AAB PSC N2A
M51 N3A PSC N51
M52 N51 PSC ADA
M53 N24 PSC N63
M54 NSI TPG NG
VDD PEMOS:L=2U W-36U AS-216P PS=84U 
VDD PEMOS:L=2U W-36U AS-216P PS=84U 
fDD PEMOS:L=2U W=36U AS=216P PS=SAU 
VDD PEMOS:L=2U W=36U AS-216P PS=84U 
NEMOS:L=2U W=IGU AS=SGP PS=44U 
NEMOS:L=2U W=IGU AS=SGP PS=44U 
NEMOS:L=2U W=IGU AS=SGP PS=44U 
VDD PEMOS:L=2U W-36U AS=216P PS=84U 
NEMOS:L=2U W=IGU AS=SGP PS=44U 
VDD PEMOS:L=2U W=36U AS=216P PS=84U
VDD PEMOS:L=2U W=20U AS-120P PS=52U 
VDD PEMOS:L=2U W=36U AS=216P PS=84U 
NEMOS:L=2U W=IOU AS=GOP PS=32U 
NEMOS:L=2U W=IBU AS=IOSP PS=48U 
O NEMOS:L=2U W=8U AS=48P PS=28U 
NEMOS:L=2U W=IGU AS=SGP PS=44U 
O NEMOS:L=2U W=8U AS=48P PS=28U 
O NEMOS:L=2U W=8U AS-48P PS=28U 
O NEMOS:L=2U W=8U AS-48P PS=28U
M57 NIOI N44 O O 
M58 NGG NI 07 O O 
M59 NI 07 TWR NIOI
NEMOS:L=2U W=64U AS=384P PS=IAOU 
NEMOS:L=2U W=GAU AS=384P PS=IAOU 
NEMOS:L=2U W=GAU AS=384P PS=IAOU 
NEMOS:L=2U W=GAU AS=384P PS=IAOU 
NEMOS:L=2U W=32U AS=192P PS=76U 
NEMOS:L=2U W=IGU AS=SGP PS=AAU 
NEMOS:L=2U W=IGU AS=SGP PS=AAU 
O NEMOS:L=2U W=IGU AS=SGP PS=AAU
MGO NGG NI 07 VDD VDD PEMOS:L=2U W=36U AS=216P PS=SAU
MGI NI 07 TWR VDD VDD PEMOS:L = 2U W=36U AS=216P PS=SAU
M62 NI 07 NAA VDD VDD PEMOS:L=2U W=36U AS=216P PS=SAU
M63 N37 AAB O O NEMOS:L=2U W=32U AS=192P PS=76U 
MGA N37 AAB VDD VDD PEMOS:L=2U W=GAU AS=SSAP PS=IAOU 
M65 N63 ADA O O NEMOS:L=2U W=32U AS=192P PS=76U 
MGG N63 ADA VDD VDD PEM0S:L=2U W=GAU AS=SSAP PS=IAOU 
M67 NSI PRE VDD VDD PEM0S:L=2U W=32U AS=192P PS=76U
MIOOO NA NSI O O NEMOS:L=2U W=IGU AS=SGP PS=AAU 
Ml001 NA NPLT O O NEM0S:L=2U W=IGU AS=SGP PS=AAU 
Ml 002 NC NA O O NEMOS:L = 2U W=IGU AS = SGP PS=AAU 
Ml003 NC NA VDD VDD PEM0S:L=2U W=32U AS=192P PS=76U 
MIOOA NA NSI NB VDD PEM0S:L=2U W=32U AS=192P PS=76U 
Ml005 NB NPLT VDD VDD PEM0S:L=2U W=32U AS=192P PS=76U
120
Figure 52.— Continued
--- PARASITIC CAPACITORS CALCULATED FROM THE CIRCUIT LAYOUT
CADR4 N51 O C :C=O.97P
CADR4B N24 O C:C=0.75P
CADR42 N37 O C :C=O.OIP
CADR42B N63 O C:C-0.01P
CTAGMATCH N91 O
CLto•d-OHOO
CTAGWORD N66 O C:C=0.65P
CHITMISSOUT NC O C:C=0.01P
CCACHEBLOCK N44 O O O n O 3
CBLOCKADDR NB O C:C=0.08P
CPREADD N39 O C :C=O1OBP
CDECODEBLOCK N32 O C :C=O.OBP
CDATAWORD N7 O C :C=O.97P
CDATA DAT O C :C=O.89P
CDATABAR DTB O C :C=O.83P
;--- END OF CIRCUIT DEFINITION
;ADDRESS LINE CAP 
;ADDRESS BAR LINE CAP 
;INV. ADDRESS LINE CAP 
;INV. ADDRESS BAR LINE CAP 
;TAG MATCH LINE CAP 
;THIS TAG WORD LINE CAP 
;HIT/MISS LINE CAP 
;CACHE BLOCK LINE CAP 
;CACHE BLOCK ADDRESS LINE CAP 
;PRECHARGED ADDRESS LINE CAP 
;DECODED BLOCK LINE CAP 
;THIS DATA WORD LINE CAP 
;DATA LINE CAP 
;DATA BAR LINE CAP
121
A D R 3 A D R O
Vdd
^  V o d
S p e e d u p
Figure 53. Circuit Schematic for the 
Simulated Cache Memory Word
____-  ,..,tucdctty LIBRARIES
3 1762 10148866 4