A single chip processor, cache memory system by Douglas Eldon McGary A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering Montana State University © Copyright by Douglas Eldon McGary (1989) Abstract: This thesis is concerned with the design and simulation of a cache memory system placed on the same chip with a central processing unit VLSI fabrication techniques allow for the production of full 32-bit architectures on a single silicon die. Larger silicon areas are available, but a larger processor is unnecessary. A cache memory is one possible system enhancement which increases system performance. The topics of a cache memory system are discussed, and needed information for design choices is provided. A direct mapped cache 4096 words in length is designed and simulated. The design is intended for use with 32-bit processors that contain 32-bit address and data busses. The simulation includes all of the necessary support logic for a cache read, write and block replacement. During the circuit layout phase, it was discovered that a content addressable memory cell could be used without paying a large penalty in silicon area. This led to the development of an alternate design which showed the feasibility of a fully associative mapped cache memory system. The direct mapped cache memory system was simulated to produce system performance measures. A speed up factor of 7.8 relative to an non-cached system and 1.9 relative to a conventional cache system of the same size was determined. The total system size is quite large and would be expensive to manufacture. The total number of transistors for both the processor and the cache system totaled to 1.1 million. The total silicon die size for the combination of the processor and the cache memory system is almost one square inch(0.921 in^2). The power dissipation for the cache system was estimated at 308 mW and found to be acceptable. An on-chip cache memory system proved to be valuable and cost effective. The on-chip cache memory system is already in use on a small scale, and as chip sizes increase so will the cache memory size. The use of a fully associative cache is an easily achieved option because of the minor difference in layout size between a static RAM and a content addressable RAM when a single polysilicon technology is used.  A SINGLE CHIP PROCESSOR, CACHE MEMORY SYSTEM by Douglas Eldon McGary A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering MONTANA STATE UNIVERSITY Bozeman, Montana March 1989 ii / J i W / I l h - S iiy7 2 APPROVAL of a thesis submitted by Douglas Eldon McGary This thesis has been read by each member of the thesis committee and has been found to be satisfactory regarding content, English usage, format, citations, bibliographic style, and consistency, and is ready for submission to the College of Graduate Studies. Date "/ •' ChairpersocTK7 Graduate Committee Approved for the Major Department Date 7 / ' J / / / ^ l i t . ' L Head, Major Department { d c / ’/ k / J D,4-.\ Approved for the College of Graduate Studies Graduate DeanDate iii STATEMENT OF PERMISSION TO USE In presenting this thesis in partial fulfillment of the requirements for a master’s degree at Montana State University, I agree that the Library shall make it available to borrowers under rules of the Library. Brief quotations from this thesis are allowable without special permission, provided that accurate acknowledgment of the source is made. Permission for extensive quotation from or reproduction of this thesis may be granted by my major professor, or in his absence, by the Dean of Libraries when, in the opinion of either, the proposed use of the material is for scholarly purposes. Any copying or use of the material in this thesis for financial gain shall not be allowed without my written permission. Signature Date iv ACKNOWLEDGEMENTS The author wishes to express his gratitude to Dr. Roy Johnson and Professor Kel Winters for their guidance and constructive criticism during the research and writing of this thesis. Dr. Johnson conceived the basic idea and Professor Winters had the knowledge and background to complete the circuit design. Rick Robinson also generously provided helpful suggestions and information: TEKTRONICS INC. of Beaverton, Oregon kindly donated to MONTANA STATE UNIVERSITY, the circuit layout package, QUICK KIC, and the circuit simulation package ,TSPICE. Others involved include Dr. Ken Marcotte and Dr. John Hanton whose helpful hints proved very valuable. I also wish to thank all of my friends, work associates and especially Suzanne Spitzer for their support and patience during the research and writing of this thesis. . \ VTABLE OF CONTENTS Page LIST OF T A B L E S ................ ..........................viii LIST OF FIGURES ....................................... ix ABSTRACT . . . ........................ . . . . . xii I . I N T R O D U C T I O N .................................... I Problem Description and Background ................ I Chip S i z e .......................... 3 Parametric Processing Problems . . . . 4 Circuit Design Problems .................... 5 , Point Defects ............................. 6 On-Chip vs. Off-Chip Communications ............. 7 Cache Memories and System Performance . . . . 11 Scope and Organization of Remaining Chapters . . 12 2. CACHE MEMORIES . ............................ ... . 13 Conventional Cache Memories......................... 14 Cache Size . ................................ 19 Cache Block Size . ......................... 20 Cache Operation Scheme .................... 21 Cache Replacement................ 22 Cache Miss H a n d l i n g ....................... 23 Predominant Cache Organizations . . . . . . 24 Direct Mapped Cache ....................... 24 Set Associative Mapped Cache ........... 26 Fully Associative Mapped Cache . . . . 28 Cache Memory T o p i c s ................................ 30 Tag Directories and Tag Checking . . . . 30 Main/Cache Memory Consistency . . . . . 32 General Cache vs. A Split Instruction/Data Cache . . . . . 33 . TABLE OF CONTENTS^-Continued Page 3. DESIGN METHODOLOGY ............................. 36 General Operation and Data F l o w .................. 38 Cache Initialization................ ... . 38 Memory Access Cycle ....................... 39 Size and O r g a n i z a t i o n .......................... 41 Mapping Scheme ............................. 42 Initialization ............................. 42 Replacement Procedure .................... 43 Memory Consistency . . . . . . . . 43 Cache Size . 43 Block S i z e ............................. 44 Tag Directory . 44 Top-Down design ................................. 46 CPU Communications and T i m i n g .................... 56 4. FINAL D E S I G N .................................... 58 Circuit Layout .................................... 58 Circuit Parameters . 58 System S i z e ................................ 60 Alternate Design - Fully Associative Cache . . 60 Tag Word L e n g t h ...................... 62 Full/Empty Designation .................... 62 Replacement Scheme ....................... 63 Communications and Timing ................ 63 Advantages of Fully Associative Mapping . 64 5. DESIGN SIMULATION ................................. 65 Design and Simulation Tools .................... 65 Layout Software .......................... 65 Circuit Simulation ....................... 65 Layout Technology....................... . 66 Speed Estimate.......................... 67 Power Consumption Estimate ...................... 69 System Performance Estimate ................ . 70 Comparison to Conventional Cache Systems . . . 72 6. C O N C L U S I O N ....................................... 74 Discussion of Results .......................... 74 Future W o r k ...................... 75 vi TABLE OF CONTENTS— Continued Page REFERENCES C I T E D .......................... 76 APPENDICES . ................................ 79 A - Circuit Documentation ....................... 80 B - Rules and Values for Circuit Parameter Calculation............................. Ill C - MOSIS Scalable Design Rules ............ 113 D - Circuit Schematic and TSPICE Input Files for The Simulated Cache Memory Word . . 115 vii viii LIST OF TABLES Table Page I. Typical Static RAM Performance Data . . 8 2. Typical Dynamic RAM Performance Data . . . 15 3. Hit Ratio and Speed Up vs. Cache Size . . 20 4. Power Dissipation of Cache Memory System . . 70 5. CMOS Physical Properties: Resistance and Capacitance ............. . 112 LIST OF FIGURES Figure Page 1. Logic Diagram of a Long On-Chip Communication Path . . . . . . . 9 2. Long On-Chip Communication Path Delay Estimate . . 10 3. Cache Memory Logical Organization ................ 15 4. Maximum Speed Up vs. Hit R a t i o ...................18 5. Miss Ratio vs. Cache S i z e ......................... 19 6. Direct Mapped Cache Memory ....................... 25 7. Two-Way Set Associative Mapped Cache Memory . . . 27 8. Fully Associative Mapped Cache Memory ......... 29 9. Content Addressable Memory Cell Logic Diagram . . 31 10. Miss Ratio of Split and Unified Cache vs. Memory C a p a c i t y ................ 34 11. Memory Request Cycle Timing Diagram ......... 41 12. Tag Directory Block Diagram ....................... 45 13. CPU - Cache Memory System Block Diagram . . . . 46 14. Cache Memory System Block Diagram ................ 47 15. Cache Block Logic Diagram................ ... . . 4 8 16. Two-Bit Address Decoder Logic Diagram ; 48 17. Dynamic Memory Cell Logic Diagram ................ '50 ix 18. Static Memory Cell Logic Diagram . . . . . 51 19. Sense Amplifier Logic and Timing Diagram . . . . 52 20. Block Address Decoder Logic Diagram ............. 53 21. Tag Directory Block Diagram ....................... 54 22. Tag Write Logic Diagram............................55 23. Nine - Gate CAM Cell Logic Diag r a m ............... 55 24. Typical Cache Read Timing Diagram .............. ". 57 25. Cache Memory System and Subsystem Sizes . . . . 59 26. System Diagram of a Fully Associative Mapped Cache Memory ............. . 61 27. Sample Parasitic Capacitance Calculation . . . . 66 28. Transient Analysis of a Cache Memory Request . . 68 29. Speed Up vs. Hit Ratio for the On-Chip Cache Memory S y s t e m ................ 71 30. Speed Up vs. Hit Ratio for the On-Chip. Cache Memory vs. A Conventional Off-Chip Cache Memory . . .' . . . . . 72 31. Circuit Schematic for TAG.CEL ...................... 82 32. Circuit Layout for TAG.C E L ............ 83 33. Circuit Schematic for SRAM.CEL . ..................... 85 34. Circuit Layout for SRAM.C E L ............ .... . . 8 6 35. Circuit Schematic for WORD.CEL . . . ... . . 8 8 36. Circuit Layout for WORD.CEL ....................... 89 37. Circuit Schematic for BLOCK.CEL .................... 91 X LIST OF FIGURES— Continued Figure Page xi LIST OF FIGURES— Continued Figure Page 38. Circuit Layout for BLOCK.CEL . . 39. Circuit Schematic for TRASS.CEL . AO. Circuit Layout for TRASS.CEL . Al. Circuit Schematic for TWRTOP.CEL (Top Tag Write Cell) A2. Circuit Layout for TWRTOP.CEL (Top Tag Write Cell) A3. Circuit Schematic for TWRBOT.CEL (Bottom Tag Write Cell) AA.. Circuit Layout for TWRBOT.CEL (Bottom Tag Write Cell) AS. Circuit Schematic for SPEED.CEL . AS. Circuit Layout for SPEED.CEL . A7. Circuit Schematic for ADDINV.CEL (Address Invertor Cell) 92 . 9A . 95 . 97 98 . 100 . 101 . 103 . 10A . 106 AS. Circuit Layout for ADDINV.CEL (Address Invertor Cell) . 107 A9. Circuit Schematic for TMAPRE.CEL (Tag Match Precharge Cell) . . . .109 50. Circuit Layout for TMAPRE.CEL (Tag, Match Precharge Cell) . . . . 110 51. MOSIS Scalable Design Rules ........................ IIA 52. TSPICE Input Files for the Simulated Cache Memory W o r d ...................... 116 53. Circuit Schematic for the Simulated Cache Memory W o r d ...................... 121 xii ABSTRACT This thesis is concerned with the design and simulation of a cache memory system placed on the same chip with a central processing unit. VLSI fabrication techniques allow for the production of full 32-bit architectures on a single silicon die. Larger silicon areas are available, but a larger .processor is unnecessary. A cache memory is one possible system enhancement which increases system performance. The topics of a cache memory system are discussed, and needed information for design choices is provided. A direct mapped cache 4096 words in length is designed and simulated. The design is intended for use with 32-bit processors that contain 32-bit address and data busses. The simulation includes all of the necessary support logic for a cache read, write and block replacement. During the circuit layout phase, it was discovered that a content addressable memory cell could be used without paying a large penalty in silicon area. This led to the development of an alternate design which showed the feasibility of a fully associative mapped cache memory system. The direct mapped cache memory system was simulated to produce system performance measures. A speed up factor of 7.8 relative to an non-cached system and I .9 relative to a conventional cache system of the same size was determined. The total system size is quite large and would be expensive to manufacture. The total number of transistors for both the processor and the cache system totaled to 1.1 million. The total silicon die size for the combination of the processor and the cache memory system is almost one square inch(0.921 in2). The power dissipation for the cache system was estimated at 308 mW and found to be acceptable. An on-chip cache memory system proved to be valuable and cost effective. The on-chip cache memory system is already in use on a small scale, and as chip sizes increase so will the cache memory size. The use of a fully associative cache is an easily achieved option because of the minor difference in layout size between a static RAM and a content addressable RAM when a single polysilicon technology is used. I■ . : CHAPTER I INTRODUCTION Problem Description and Background Computer designers are building faster processors to increase computer system throughput. There is a growing gap between processor speed and memory speed. . Memory speed is the limiting factor for system, performance. There are several ways to increase the speed of today’s microprocessors systems. These are: 1. increase the speed of memory access. 2. increase the single cycle performance by performing one operation per clock cycle. 3. incorporate other system features such as memory management units, pipeline instructions, etc. System designers must use faster, more expensive main memory components to increase memory speed. Main memory has increased in size, but not in speed. A more complex CPU is required to perform one CPU instruction per clock cycle. The CPU must have the ability to prefetch and decode instructions before execution. The microprocessor designs already encompass more than 100,000 devices. This increase in complexity is very costly. A simple and cost effective means of increasing microprocessor system throughput is to incorporate a cache memory. It has been shown that little I 2performance gain will be obtained when the speed of a CPU is increased [I]. The speed of the CPU is so much faster than the memory bandwidth that the system is greatly "out of balance". Increased memory bandwidth will allow memory to keep pace with fast microprocessors. Microprocessors are built using a two or three metal layer CMOS process. Static RAM (SRAM) is comparable in size to Content Addressable memory (CAM) in a CMOS environment. This allows the use of associative cache memories built from (CAM) to be designed without a large penalty ,in silicon area. This thesis probes into the feasibility of incorporating a cache memory on the same silicon die as a microprocessor or CPU. A background in cache memory and theory of operation is provided. Questions concerning total silicon die size, cache memory type, organization, and cache memory size are reviewed. The performance measures are compared to conventional off-chip cache memories. Three basic assumptions that have been made are: 1. A silicon chip large enough to handle both a microprocessor and a cache memory can be fabricated economically. 2. An On-chip cache is faster than the conventional off-chip implementation. Use of a cache memory improves system performance.3. 3Chip Size Silicon wafer chubs are now grown in 6 inch diameters, and experimental work is being done on 8 inch silicon wafers. Device miniaturization also allows for the packing of large numbers of logic and data storage circuits into a very small area. Smaller circuits can be built because of better alignment techniques such as self aligned gates and shorter wavelength light for smaller photoplate geometries. " Advances in integrated circuit density are permitting the single chip implementation of features, functions and performance enhancements beyond those of basic eight and sixteen bit processors" [2]. Processors being designed and built today not only include full 16 and 32-bit architecture, but also have sufficient areas for performance enhancements such as instruction buffering, pipelining and cache memories. The chip area will not be sufficient for several years to include all possible enhancement features [2] . One of the best uses for additional chip areas is a cache memory. Increased chip areas result from better yield parameters. Fewer total chip defects allow for production of larger chips with the required yield. The Motorola MC68000 contains 68,000 transistors and is 45 square millimeters (6.24 by. 7.14 mm) in size [3] and the Intel 80386 contains 275,000 transistors and is approximately 350 mills x 350 mills in size [4] . Ideally, perfect or 100# yield is desired. If this were attainable, any size chip area could be realized within the limits of the wafer size. Causes for less than perfect yield fall into three basic categories [5]: 41. Parametric processing problems 2. Circuit design problems 3. Random point defects in the wafer Parametric Processing Problems One outstanding feature of a processed wafer is the distinct regions of the wafer that exhibit very high yield and other regions that exhibit very low to even zero yield. Effects that produce low yield regions are: 1. variations in thickness of oxide and polysilicon layers 2. variations in the resistance of implanted layers 3. variations in the width of lithographically defined features 4. variations in registration or alignment of a photomask with respect to the previous masking operations These variations depend on one another, and a gross variation of one processing step severely reduces chip yield. A variation in oxide layer thickness can result in areas being over-etched in thinner than average oxide and under-etched in thicker than average oxide. Polysilicon gates are shorter in the thinner than average polysilicon regions. This results in channel lengths being too short to shut off a transistor when the proper gate voltage is applied. Variations in the doping of implanted layers leads to variations in contact resistance to implanted layers. During the 5processing of a wafer, various operations are carried out which result in small but important changes in the size of the wafer. When a wafer is oxidized, the SiO2 formed has twice the volume of the silicon consumed in the process. This stresses the wafer and when part of the SiO2 is removed from one side of the wafer, the wafer will bend if the resulting stress is above the elastic limit of the material. Average variations in wafer size are around 2.5 microns for a 125 mm wafer [5], This can cause serious misalignment problems and reduce yield on regions of the wafer especially when a transistor gate length is only 2 microns wide. Circuit Design Problems Regions of a wafer show low yield because the designed circuits, fail to take account of the variations in the processing. Designers must be careful and follow design rules. Design rules are instigated to insure integrated device operation for a particular process. Threshold voltage (Vt) and channel length (L) are the two most important parameters in MOS design. Variations in substrate doping, in ion implantation dosage, and gate oxide thickness will cause a variation in threshold voltage. Variations in gate length and source and drain junction depth cause the channel length to vary. Variation of Vt and L may be enough to cause a faulty circuit. The operation of a circuit may be unpredictable if the threshold voltage or the channel length are not within tolerance. 6Point Defects Even after processing and circuit design problems are reduced, 100% chip yield is not obtained because of point defects. Point defects are regions where the wafer fabrication produces faulty circuits, and the size of the faulty region is small compared to the size of the wafer. A three micron dust particle on the wafer could cause a metal conductor to break and render the circuit useless. This is an example of a point defect. Point defects also come from other sources such as dust on the photoplate, a small fault on the photoplate, or a pinhole in the silicon dioxide layer. The best procedure to reduce random point defects is to monitor them, take action to reduce them such as clean the fabrication facility and processing equipment, and continue to monitor point defects. To overcome the above three problems, fabrication facilities have advanced the technology of processing steps and circuit design rules to reduce point defects. The progress they have shown allows for the manufacture of large area VLIS chips. Larger chips reduce the number of chips per wafer, and the yield must be high to produce large chips economically. Processing problems have been reduced by lowering processing temperatures. Ion. implantation allows for lower wafer processing temperatures. Chip yield is enhanced by better control of layer thickness and doping densities. Circuit designers strictly follow design rules to insure device operation. Circuit designers usually use a library of circuit layouts that have been proven to function reliably. Point defects 7have been reduced by creating cleaner fabrication facilities and better cleaning techniques. Substrates are now grown with better purity than obtainable in the past. On-Chip vs. Off-Chip Communications In a typical off-chip cache design, the CPU must convert address signals to TTL compatible levels. The cache memory chips must then perform the tag search and drive the necessary data signals back to the CPU. This process is inherently slow because: 1. I/O drivers are required to drive and receive the signals, and I/O drivers are not as fast as other chip circuitry. 2. Interconnect lines on a PC board contain large amounts of resistance and capacitance. 3. A complete cache memory access cycle requires four signal transformations from internal chip levels to external chip levels. In an on-chip cache system, all information transfer takes place on internal data paths. Transfer of information internally on a chip is faster and consumes less power than intra-chip I/O transfers. Table I shows typical static RAM performance. Access time ranges from 50 to 500 ns for static rams. On-chip cache memory has the advantage of receiving data before an I/O driver can signal an off-chip memory component. The 8static RAM access times in Table I are time required for the data to be delivered after it receives the input signal. The time needed to deliver the input signal to the static RAM chip is about 20ns, which is the delay time associated in switching an input/output pad and tristate driver. On-chip cache memory receives the input signal in about 6 ns, which is a 14 ns advantage. Table I. typical Static Ram Performance Data. Organization Typical Access Times Typical Cost Total Storage 256x4 80-150ns $2-4 !Kbytes 4096x1 80-150ns $5-10 4Kbytes 1024x4 80-150ns $5-10 4Kbytes 2048x4 80-150ns $10-15 SKbytes 8Kx8 SO-IOOns $25-30 I CKbytes Figure I shows the logic diagram for a possible long interconnect a CPU might drive to deliver an input signal to an on-chip cache memory. The time delay involved results from having to transfer charge associated with the long interconnect capacitance. The interconnect line is precharged for an optimum signal transfer rate. 9Voio! D a t a In Rl A A A A R2 AAA Ci D a t a Ou t Figure I. Logic Diagram for a Long On-Chip Communication Path. The circuit in Figure I was simulated using TSPICE, and the time delay was found to be approximately 6 ns for transistors with gates of 16 /xm width and 2 ^ m length. The transistor technology is the same used for the design of the on-chip cache memory. The transient analysis response is plotted in Figure 2. This is a conservative estimate of intrachip data transfer rates and supports the assumption that intrachip transfers are faster than interchip transfers. The interconnect is 20 mm long(almost an inch) and 6 #im wide. A typical process assigns approximately 0.25E-4 picofarads of capacitance per square micron of metal interconnect. The total amount of capacitance associated with the interconnect line is 3.0 pf. A typical process assigns approximately 0.05 ohms of resistance per square of metal interconnect. Assuming that the metal interconnect only 10 covers substrate material, the total amount of resistance is 167 ohms. The resistance is broken down into two halves (R1 and R2) for the circuit simulation. The interconnect line could be longer and wider and still be driven in a few nanoseconds with only slightly larger transistors. GD 1-MAR-S9 61:48:25 OUTPUT DATA 5.000.00 7.50 10.00 J__________________I INPUT DATA 0.00 2.50 5.00 7.50 10.00 Long On-Chip Communication Path Delay Estimate Figure 2. 11 Cache Memories and System Performance Cache memories are a time tested mechanism for improving memory system performance. Caches reduce access time and memory traffic. Cache memories have proven, to be useful and will continue into the future [6]. Microprocessors are buss limited [7] which underlines the importance of conserving or reducing buss traffic. Total system performance is a function of memory buss bandwidth. It is conceivable that a microprocessor can grow to 10 times as large in transistor count [7], but the number of pins that connect it to the outside world will only grow to about 1.5 to 2 times. Buss traffic will become a more pressing problem as processors become more complex. A cache memory will reduce buss traffic and allow the buss to be used for other important communication functions. All buss transfers to external main memory are made in blockmode form. The Motorola 68000 microprocessor has 64 I/O pins. The MC68000 typically utilizes 90-95# of the external buss bandwidth (6.25Mbytes/sec @12.5MHz clock frequency). Increasing the speed of the CPU will not result in a significant increase in system performance [8]. The main memory components (DRAMS and Memory Management Units) simply are not fast enough to keep pace with the CPU 1/0 demands. The 12,5MHz MC68010 has a minimum buss cycle time of 320 ns (4 80 ns clock periods). To perform a read cycle without wait states, the 12.5 MHz MC68010 requires valid data at the inputs a maximum of 135 ns after asserting the CPU address strobe. Conventional DRAMS and MMU’s only deliver data from around 150 ns to 500 ns [9]. A cache system logically placed between the MC68010 and the MC68451 MMU will 12 lower memory access times enough to allow the CPU to run without wait states. Processors being produced today are easily outpacing the memory products associated with them. Memory products can be made fast enough to keep pace with the processor, but it is very costly. Memory designers have been concentrating on building high density memory chips. Dense memory chips allow computer systems to have several megabytes of main memory at a reasonable cost, but access times do not decrease. One reasonable way to bridge the gap between memory speed and CPU I/O speed is to incorporate a fast buffer memory. Scope and Organization of Remaining Chapters Chapter 2 provides a background and covers important topics of cache memories. Chapter 3 discusses the design methodology. Chapter 4 reviews the final cache memory design. Chapter 5 contains the design simulation, and Chapter 6 provides the conclusions. 13 CHAPTER 2 CACHE MEMORIES "A large number of computer programs show that most of the execution time is spent in a few main routines" [10],. This phenomenon is called locality of reference. Most programs also demonstrate a sequential nature of program flow. The CPU does not generate addresses that are random in nature. They tend to be localized and the higher order bits remain the same. The only deviation from this is a jump or branch instruction. When program flow is localized on a small number of instructions, those instructions are executed over and over again. These are usually loops, nested loops or procedures that continually call each other. The overall flow of the program is not important. The main idea is that localized instructions are accessed by the CPU most of the time while the rest of the code is accessed infrequently [11]. If a buffer memory is incorporated that contains the frequently accessed code, the CPU can access the buffer instead of main memory and save time. The buffer needs to be faster than main memory. Faster memory is expensive and to make a complete main memory out of fast buffers is very expensive. Conventional Cache Memories The first cache implemented was the IBM system/360 Model 85 [12]. Since then, several high performance computers such as the ILLIAC IV, the CDC STAR, the CRAY-1 and the TI ASC have implemented a cache memory. Almost every high performance computer in use today incorporates a cache memory. Computers ranging from "minis" to "supers" contain a cache memory in one form or another. The main memory for a typical 8-bit microprocessor is 512 Kbytes to I Mbyte. A 16-bit processor has addressability from 20 to 24 bits which is from 1-16 Mb. Larger main memories are being built from medium speed Dynamic RAM chips. A cache memory can reduce memory access times. Typical memory access time for a DRAM memory chip is about 400 ns and is shown in Table 2. Incorporating a DRAM chip set for the main memory is the most economical choice. Large DRAM chips are slow, but inexpensive. The cost of fabricating a main memory completely out of high speed static RAMs or ECL chips would overshadow the cost of the rest of the computer. The block diagram of a buffer memory or cache memory is shown in Figure 3. The cache memory is logically located between the CPU and main memory. The access time for cache memory is usually an order of magnitude faster than main memory and is intermediate in speed relative to CPU register cycle time and main memory cycle time. 15 Table 2.______Typical DRAM Performance Data. Organization Typical Access Time Typical Cost Typical Power Consumption 64Kx1 I00-250ns $5-10 I50-300mW 8Kx8 I28Kx1 I00-250ns $10-15 200-400mW 32Kx4 M ft If 16Kx8 M If fl 256Kx1 10O-SOOns $15-20 250-500mW 64Kx4 «« ff 32Kx8 tl If IMxI 100-300ns $60-100 250-500mW 256Kx4 I28Kx8 Address Cache Memory Data & Control Lines Figure 3. Cache Memory Logical Organization. 16 Cache memory and main memory create a memory hierarchy just as main memory and secondary storage. Many of the techniques designed for virtual memory management are also valid in a coche/main memory hierarchy. Both systems try to obtain the fastest operation possible. The effective speed of the memory system is maximized to obtain optimum system performance. Ideal operation of the main memory/cache memory hierarchy is an access time very close to the access time of the cache memory. A cache hit is when the cache memory satisfies a memory request. The cache hit ratio B is the total number of memory requests serviced by the cache divided by the total number of memory requests [10]. number of cache hits B = ------------------- :---------- (I). total memory requests A cache miss occurs when a memory request is not serviced by the cache. The cache miss ratio is the number of memory requests satisfied by the main memory divided by the total number of memory requests [10]. Miss Ratio = I - Hit Ratio = T - B (2). The effective speedup is the ratio of the main memory access time to the effective memory access time. Let Tc be the cache memory access time, Tm be the main memory access time and Te be the effective memory access time as seen by the CPU. The effective access time is [10]: 17 Te = * * Tc + (I - tf) * Tm (3). This equation simply states that the effective access time is equal to the cache hit ratio times the cache access time plus the cache miss ratio times the main memory access time. The speedup due to the cache is [10]: sC = Tm / Te (4). A higher cache hit ratio, produces a more efficient cache system which leads to faster data and instructions transfers to and from the CPU. The relationship between cache hit ratio and speed up is [10]: Tm Tm I Sc = ----= -------------------------- --------------------- — (5). Te ■ (**TC + (I - O T m) (I + O T c/Tm - I)) let Tc/Tm = 0.1 (6). Letting TcZTm equal 0.1 assumes the cache memory access speed is an order of magnitude faster than the main memory access time and is a very reasonable assumption. thus, I Sc = -------------- (7). I - 0.90 Equation 7 shows that the cache hit ratio 6, is the determining factor in speedup. The speedup increases as the hit ratio approaches one. Figure 4 shows speedup vs. hit ratio using equation 7. The ratio of cache 18 memory access time (Tc) to main memory access time (Tm) is assumed to be 0.1 which simply means the cache memory is ten times faster than main memory. Note that the speedup axis is logarithmic. Small increases in hit ratio will produce large changes in the speedup. The maximum speedup is 10. Ifthe hit ratio is 0.5, the maximum speedup is 2.0 no matter how fast the cache memory is, which underlines the importance of high hit ratios. ' 1 I I ! ! I ! I I I I I I I I ! I ! I I I I 1 I I 1 ! I I ' ! I ! I I ! 1 ! I I I I ! I I I I I I Hit Ratio Figure 4. Maximum Speed Up vs. Hit Ratio. There are several factors that affect the hit ratio and consequently the memory speedup. These are: I . Cache size 2. Cache block size 3. Cache operation scheme 4. Cache replacement scheme and 5. Cache "miss" handling 19 Cache Size Naturally, the larger a cache memory, the greater chance that it contains the desired memory location, but there is a point of diminishing returns on cache size. It would be very expensive to build a cache memory that is comparable in size to the main memory. Figure 5 shows miss ratio vs. cache size. The figure clearly shows that after a cache size of about 4 Kbytes, the amount the miss ratio decreases is small compared to the increase in cache size. 0.50 o -W 0.20 CO CO 2 (DO.10 O O O 0.00 2000 4000 SOOO 8000 10000 Cache Size (Bytes) Figure 5. Miss Ratio vs. Cache Size. As an example, let Tc=I.0, Tm=IO and vary 6, the hit ratio. The results are shown in Table 3. 20 Table 5.____ Hit Ratio and Speed up vs. Cache size. B Sc Cache memory size change required to produce desired hit ratio 0.7 I .82 0 to 512 bytes 0.8 3.57 512 to 1024 bytes (doubled) 0.9 5.23 1024 to 2048 bytes (doubled) 0.95 6,90 2048 to 4096 bytes (doubled) 0.97 7.87 4096 to 8192 bytes (doubled) 0.98 8.47 8192 to 16384 bytes (doubled) The system designer must ask himself if doubling the cache size is worth the percentage increase in hit ratio and speed up. Cache Block Size All cache memories use blocks of words. A block is simply several words grouped with the same least significant address bits. Each word can contain data and instructions. The larger the block size, the more time it takes to move a block into the cache memory. A large block size does allow the CPU a better chance of accessing an upcoming instruction or data, but this gain is easily offset by the amount of time the CPU waits for the block transfer to occur. Locality of reference plays a big part in determining the block size. The amount of locality of reference is completely program dependent. There have been several studies into the relationship between cache hit ratio and cache .block size [13], [16]. These studies show that a cache block size of around 8 words is a good choice. The optimization of a cache block size would require the testing of several hundred programs with several different sizes of cache blocks 21 and is beyond the scope of this thesis. Cache Operation Scheme there are three basic types of caching schemes, these are: 1. Direct Mapped 2. Set Associative Mapped 3. Fully Associative Mapped A direct mapped cache scheme is the simplest to use and understand, but it is the least flexible of the three types [14]. In a direct mapped cache scheme, the main memory blocks map directly into only one cache block. They map by the modulo of the number of cache blocks. If there are 128 blocks, main memory block K maps onto K modulo 128 of the cache. Set associative mapping groups the cache memory blocks into sets of blocks. Any main memory block can map into a block set of the cache. The block then maps into only one of the blocks in the set. This allows for better utilization of the total cache memory, but it requires more overhead to control and maintain [15]. Full associative mapping allows any main memory block to map onto any cache memory block. This cache scheme is the most flexible, but depending upon the replacement scheme, can be the. most complex. If the replacement algorithm is complex, the fully associative mapping scheme requires the most CPU overhead to control and maintain [16]. If the replacement scheme is simple such as random, the fully associative mapping scheme is not the 22 most complex and in fact is only a little more complex than direct. The three schemes are explained in detail in the next section. The contribution of each scheme to hit ratio is determined by the amount of overhead required to control and maintain the cache scheme. The set and full associative cache schemes allow for more flexibility, but the required overhead decreases performance. The direct mapped cache scheme requires very little overhead, but the total cache memory is not used. Each cache operation scheme has advantages and disadvantages. The comparison of the three types to determine the absolute best is a long and complex process and beyond the scope of this thesis. The results would be program specific and inconclusive. Through experience with computer hardware and software, the best choice, especially in the first design iteration, is the simplest choice. Cache Replacement The replacement algorithm determines how the cache management system replaces cache locations in cache memory. Cache replacement is performed after a cache miss occurs. In the direct mapped scheme, the replacement algorithm is trivial. The needed memory block is automatically moved into the cache memory. The only variation is if the system designer wishes to move more than one block, usually the next or adjacent block of main memory. The other two ,types of cache schemes can use from very simple replacement algorithms to very complex. The simplest is the first in/first out. This scheme only requires that the blocks have a counter that is I1 23 increased for each unit of time that they spend in the cache memory. The cache block that is replaced is the one that has been in the longest. A more optimum replacement scheme is the Least-recently-used or LRU algorithm. This scheme requires the management of which block has been used the most recently to the least recently. The least recently used block is swapped out of cache when replacement is necessary. There are replacement algorithms that statistically keep track of the memory blocks and make decisions on which block should be kept. This algorithm is obviously the most expensive in overhead, and the efficiency gain or performance gain is questionable. Each replacement scheme has its advantages and disadvantages. The one that is used is completely up to the designer. The designer must weigh the amount of processing overhead needed to implement a replacement algorithm vs. the efficiently of the cache and the increase in performance. Cache Miss Handling When a cache miss occurs, the cache memory must be updated to allow the continued execution of a program. Just as in replacement algorithms, the miss handling can be very simple to very complex. The goal is to replace a portion of cache memory, enable continued program execution and reduce the possibility of future cache misses. Predominant Cache Organizations Direct Mapped Cache The direct mapped scheme is shown in Figure 6. This scheme checks cache blocks and the number of blocks. For example, let a cache memory consists of 4 K words. Let each block contain 4 words. This means there will be 1024 blocks (4 words/block * 1024 blocks = 4 K words). Let main memory contain 20 address bits. The cache tag words are 8 bits long, the of four words in a block. The main memory blocks map directly onto only one cache block location. In the case described above, the main memory blocks map by modulo 512 onto the cache blocks. When a CPU generates an address, the tag bits are checked; if the check produces a positive result, the block contains the needed memory location and the CPU proceeds with a read or write to the desired word. If the tag bit check produces a negative result, the CPU must copy the needed block from main memory into the cache block. This procedure slows down the processing of a program, but if the cache has been designed well, and the locality or reference theory holds for the program, the number of cache misses will be small and overall Jbaa bits to determine if a needed main memory word is resident in the cache. The number (t/iH needed is determined by the size of the block address is 10 bits long, and 2 address bits are used to select one system performance is improved. 25 The direct mapping system is simple and easily managed. The only flexibility for the designer is the cache miss handling procedure. There are two ways to handle the block replacement that is necessary after a cache miss. These are: 1. When a cache miss occurs, replace the needed block and continue program execution. 2. When a cache miss occurs, replace the needed block and I or more adjacent blocks from main memory. Cache Memory Main Memory 1024 B lo c k s / \ / B lo c k 0I uy u B lo c k I — B lo c k 0 B lo c k 1023__B lo c k I B lo c k 1024 B lo c k 1025 Tag 2 __ B lo c k 2 B lo c k 2047 B lo c k 262120Tag 1023 B lo c k 261121_ B lo c k 1023 B lo c k 262143 Main Memory A d d r e s s 8 b i t s 10 t i l t s 2 b i t s Tag B lo c k W o rd Figure 6. Direct Mapped Cache Memory. 26 The second procedure attempts to reduce the number of misses in the future execution of the program. There are a couple examples of known cache designs that replace adjacent blocks when a cache miss occurs. This is the only time that such large amounts of data are moved into a cache at one time. The buss bandwidth limitations prohibit the transfer of large amounts of data for each cache operation. Set Associative Mapped Cache Set associative mapping is similar to direct mapping, but is more I flexible. Blocks of cache memory are grouped into sets. Two-way set associative cache has two elements per set. Each element contains a memory block. For example, let a two-way set associative cadhe consist of 512 f sets, 2 blocks per set, and 4 words per block. The ^total cache is 4 K words long. Let the main memory have£20)address lines which is the same as the direct mapped example. The lower two address bits select the particular word in a block. The next 9 address bits control the set, and the final 9 bits control the tag associated with the set. Figure 7 shows a block diagram of the logic associated with this example. The cache can contain blocks 0 through 1023 just like the direct mapped example, but cache set 0 can contain block 0 and any other main memory block that is modulo 512. This gives the CPU an option when placing the main memory blocks. A full/empty bit can be added to signify the status of a block in a set. When a cache write is desired and one of the blocks of the set are full, the CPU just writes into the other one, but, if both set blocks are 27 full, the CPU must decide which one will be replaced. Z - / 1 256 _ S e ts S e t 0 _ Set I _ S e t 51! _ X Cache Memory ________ I I Ta8 Tog I Tog I Tog Tog Block 1022 Tog B lock 1023 Main Memory A d d re s s 4 9 b its 9 b its 2 b its /< y Tog B lock V o rd ,-Js r' Main Memory Block 0 B lock I B lock 2 B lock 3 B lock 1022 B lock 1023 B lock 1024 B lock 1025 B lock 1026 B lock 1027 B lock 261629 B lock 261630 B lock 261631 B lock 261632 B lock 261633 B lock 261634 B lock 262142 B lock 262:43 \ Figure 7. Two-Way Set Associative Mapped Cache Memory. The set associative cache requires more logic to control than the direct mapped cache. The extra logic controls which cache set block is filled by an incoming memory block and which cache set is replaced when the cache set becomes full. The two-way set associative cache reduces cache misses and four-way is better yet, but only slightly. Further increases in degree of associativity has little effect on system performance. The upgrade from direct mapping requires another level of logic and a couple more control lines. The added logic level adds delay; and the added 28 complexity requires more design effort and takes up more silicon area. As a general rule, the decision to go from direct mapped to set associative is governed by the cache size. For a cache of 32 Kbytes, the miss ratio will be low; and a set associative cache will not significantly reduce cache misses or improve performance. For smaller caches, where the delay from a cache miss will dominate, the set associative design will greatly increase performance. The final decision must include the design constraints that face the system designer. Fully Associative Mapped Cache Fully associative mapping is the most flexible but also the most complex of the three types discussed. This mapping system allows for any main memory block to map onto any cache block as shown in Figure 8. The cache blocks are distinguished completely by their tag bits. For the 20 bit address main memory example with 4 word cache blocks and 1024 blocks, there are 18 tag, bits per tag word associated with each cache block. The main memory blocks are mapped into any available cache block. When the CPU generates a memory address,, all of the tag bits are checked. If the result is positive, the needed block is resident and program execution continues. If the tag check produces a negative result, the necessary main memory block is moved into the cache memory. The fully associative mapping technique allows for complete freedom in replacing a block when a cache miss occurs. However, it might not be practical to 29 implement a complex replacement scheme. The cache management for a 512 way associative memory is not trivial and requires a large design effort. The fully associative cache system requires the checking of a large number of tag bits. The information that determines replacement also needs to be kept in a memory table. The replacement algorithm needs to determine which block is going to be replaced in the same amount of time that the tags can be checked or system performance is degraded. The overhead in controlling and managing a fully associative cache is prohibitive for its use. The design effort that goes into a fully associative cache is also long and expensive which deters most designers from implementing it. Cache Memory Main Memory 1024 Blocks x Z" x \ I Tag 0 — B lock 0 — I Tag I _ B lock I _ I Tag 2 _ B lock 2 — I Tag 1023 _ B lock 1023 _ Main Memory A a d re s s ie b it s 2 c i t s I Vv yx Z Tag V o rd Figure 8. Fully Associative Mapped Cache Memory. 30 Cache Memory Topics Tag Directories and Tag Checking The contents of a cache are constantly changing as a program is executed. When a memory request is issued by the CPU, the cache is checked to see if the desired location is present. This is done with a tag directory or table look-aside buffer. The tag directory must be checked as fast as possible so as to not inhibit the speed of a cache memory access. The fastest method is an associative search. The search is performed on all tag locations at once. The content addressable memory is the most common form of tag directory. The content addressable memory scheme provides comparison logic for each memory location and bit. This adds considerable cost to each bit in a CAM, so the size of a CAM or tag directory is kept as small as possible. Suppose the cache is direct mapped into 64 blocks and the tag for each block is 8 bits wide. The necessary CAM is 8 cells wide and 64 tag words in length. The CAM contains a write logic to modify the tag directory when a new memory location is stored into cache. A tag match line is associated with each tag word. The tag match line is precharged prior to a tag directory search. If one bit of a tag word doesn’t match the desired tag address, the tag match line is pulled down signifying a tag word miss. All of the tag match lines are combined, together through OR gates to form the hit/miss line. Figure 9 shows the logic layout of a CAM. 31 D o ta D a to _ b o r T OQ V o r d Tag Cv M a tc h Line Figure 9. Content Addressable Memory Cell Logic Diagram. The tag word that leaves the tag line precharged contains the desired memory location and access is given to the CPU. A miss occurs when all of the tag match lines are discharged after a tag search. The CPU is notified by the Hit/Miss line and proceeds with a replacement procedure. The replacement procedure brings in the desired memory location and also extra memory blocks to help avoid another cache miss. The associative memory is the fastest memory scheme available for tag checking. The logic associated with each cell allows all the tag cells to be checked simultaneously. This checking scheme is non-destructive, allowing the tag directory to be checked any number of times without a loss of data. Writing a new address into a tag word is done in parallel. The tag write logic supplies the desired tag address. Address bits A2-A1I select 32 the desired tag word. The tag write line is pulsed,, and the new block address is written into the tag word. Main/Cache Memory Consistency After a CPU modifies a cache location by a write command, the modification must make its way back to the main memory. Main memory must be kept consistent with modifications that take place in the cache. There are two common methods of forcing consistency between the cache and the main memory. The first is write-through or store-through, and the second is write-back or copy-back. The write-through technique writes a modified cache location out to memory every time there is a write to cache. This system is very reliable but can tie up a microprocessors buss structures. The write-back procedure only writes a modified cache location to memory when the location is swapped out of the cache. A cache location may be modified several times .before it is moved back to main memory. The write-back procedure lowers buss traffic but is more complex to control and is not 100# reliable [17]. The write-back procedure reduces memory traffic but requires complex logic to significantly reduce buss traffic. The reduction of buss traffic does not affect the miss ratio of a cache memory system, and low system buss traffic must be necessary to justify the cost of a write-bock cache implementation. The easiest way to implement a write-back cache scheme is to mark a block as dirty when the CPU writes to it. The whole block is then written to main memory when the block is replaced or the program 33 ended. The number of times that a whole cache block is modified by a CPU is very small, and several memory cycles are wasted writing the whole block. The only way to alleviate this problem is to assign a dirty/clean bit to each cache location. The block is then scanned for modified (dirty) locations which are written to main memory. This procedure adds more complexity to the cache design. The system designer must make the decision as to whether the extra logic, silicon area and design effort are worth the increase in system performance. If write-through is used,- main memory always contains an up-to-date copy of all information in the system. If the microprocessor is used in a multiple processor environment, the write-through simplifies the multiple cache consistency problem. The write-back method results in the cache containing the only valid copy of data, and an error correcting code is needed to ensure high reliability. General Cache vs. A Split Instruction/Data Cache Cache bandwidth and access time can be improved by splitting the cache into a data cache and an instruction cache. The bandwidth is doubled because two memory requests can be serviced at once. Fast computers are pipelined which means that several instructions are simultaneously fetched and decoded. Most pipelines have several stages including instruction fetch, instruction decode, operand fetch and result transmission to the proper destination. Logic pertaining to instruction fetch and decode generally does not involve the operand fetch and store except when 34 instruction execution takes place. If the computer designer feels that it is worth it, the logic to handle both data and instruction cache can be implemented and the system performance will be improved. There are several problems with a split cache scheme. The instructions must be kept separate from the data. Keeping main memory consistent with both caches is complex and requires extra overhead. A split cache generally results in inefficient use of cache memory [18]. A split cache miss ratio vs. a unified cache miss ratio is shown in Figure 1 0 . Unified Memory, Split Memory 10000 20000 30000 40000 50000 60000 Memory Capacity (Bytes) Figure 10. Miss Ratio of Split and Unified Cache vs. Memory Capacity. 35 A split cache can be divided several ways. The split could be 50% data and 50% instruction (split equal) or some other more optimum split (split unequal). The unequal split in cache size is determined by running several performance measures and adjusting the different sizes to produce an improvement in system throughput. The split equal, split unequal and unified cache systems all perform about the same with respect to miss ratio. The system designer must decide if the increased bandwidth justifies the cost of all the extra logic and CPU overhead. 36 CHAPTER 3 DESIGN METHODOLOGY The design of an on-chip cache system is governed by several constraints and conditions. The constraints in this cache implementation are: 1. The cache structure is geared toward a microprocessor. 2. Cache operation must require very few CPU instruction executions and be transparent to the user. The mapping scheme must be kept manageable. The cache initialization and replacement procedures must be straight forward and require very little CPU overhead. The memory consistency procedure must keep main memory valid without CPU intervention. 3. The cache structure must not increase total silicon area to an unproducable size. The block size must be small enough to not bottleneck the microprocessor memory buss and big enough to produce low miss ratios. k. The cache tag directory must be fast and require very little logic to maintain and control. Modifying the tag words must require very little delay dr CPU overhead. 37 5. The layout technology of the cache memory must be compatible with the layout technology of the microprocessor. For this thesis, the layout software must be available here at Montana State University. The layouts must be simulated to test system operation. 6. The memory cell used to store the cache data must have a fast access time and add a minimal amount of complexity to the microprocessor. The above constraints were kept in mind when the on-chip cache structure was designed. The following sections explain the choices made and how they reflect the system constraints. The information needed to form design decisions originated from information on cache systems and theory, microprocessor vendors, memory vendors and conventional cache memory systems. There are several commercially available microprocessors that could benefit from an on-chip cache memory system. Some of these are: Intel 8086, 80286 & 80386. Motorola 6800, 68000, 68010, 68020 & 68030. National Semiconductor 3200 & 32000. Texas Instruments 4000, 4400 & 9900. Zilog Z80 & Z8000. 38 The cache design is not geared toward a specific microprocessor. A general design is done which could be modified slightly and applied to a specific microprocessor. The search for an optimum microprocessor is beyond this thesis. The only specific design choice made was to use 32 bit data words and 32 bit addresses which makes the cache design unapplicable to 8 or 16 bit microprocessors. General Operation and Data Flow Cache Initialization The cache memory words are initialized upon CPU start-up or power-up. The CPU will run a small program from a start-up ROM that writes the first page of memory into the cache. In an MS-DOS operating environment, the cache would be initialized with OA-OOH. Cache initialization can be accomplished in two ways. Oneway is to include a valid bit in the cache system, and another way is to initialize the cache with a page of memory. The valid bit associated with each cache word would be set to "invalid" upon system power up. The CPU would read and write only to valid cache locations. A cache block is validated after a tag address and the memory words are written to the cache block. The valid bits would not be used often enough to justify their placement on chip, and it would not be practical to put them off-chip. 39 Memory Access Cycle The CPU submits a memory request by placing the desired address on the address buss followed by a read or write control sequence. The memory request, cycle is shown in Figure 11 and follows the following sequence. 1. The upper 20 address bits are applied to the tag directory search logic. 2. The tag search is performed, and the hit/miss signal is validated. The following four steps are taken when the tag search results in a cache hit. The tag hit line signals the CPU to continue the cache access and provide the read signal in the desired time. 3. The tag match line and the set address bits are decoded to access the desired cache block. This is performed almost simultaneously along with the tag directory search. 4 . The decoded set line and the cache block address bits are decoded to access the desired cache word. 5. The read/write signal is applied by the CPU. 40 6. Data is either transferred to the 32 bit data buss (in the case of a read operation) or transferred from the data buss and written into the cache word (in the case of a write operation). The following five steps are taken when the tag search results in a cache miss. The tag hit/miss line signals the CPU to access main memory and perform a cache replacement operation. 3. The tag write line is. applied and the set address bits are decoded allowing the upper 20 address bits to be written into the tag directory. 4. The tag match lines are applied. 5. The block address bits are decoded. 6. The write line is applied by the CPU, and data from the data buss (and main memory) is written into the cache memory word. 7. The write operation is performed three more times filling the cache block with main memory. This is performed in a burst mode which requires only the decoding of the block address (lowest two address bits). 41 VAliID Precharge Perform Search VALIDAddress lines Tag Match This Cache Word Read/Write Data Figure 11. Memory Request Cycle Timing Diagram. Size and Organization Some computer designers feel that it is very advantageous to separate instruction cache from data cache. The efficiency and performance gain is questionable [18]. Logic that keeps instructions separate from data is required and adds an overhead to the operation and maintenance of a cache system. The cache system designed for a microprocessor is a generic data and instruction cache which is easily maintained and operated. The cache is direct mapped and write through. A cache miss initiates the replacement scheme. The organization fits well onto a microprocessor, lowers the main memory cycle time and requires very little CPU overhead. The logical flow of data in and out of the cache structure is trivial and fast. The layout is only of medium complexity and fits on-chip without increasing the total chip size significantly. 42 Mapping Scheme Direct mapping was chosen because it requires the least logic to control [19]. It is not the most efficient, but some of the loss in efficiency is gained by the simplicity. Direct mapping requires very little logic beyond address decoding and tag checking. The direct mapping scheme requires smaller silicon area than set associative or fully associative cache memories. If a nested loop is large compared to the cache size, then direct mapped cache is as efficient as fully associative [20]. Initialization When a microprocessor is powered up, the cache memory and the cache tag directory will contain random data. The CPU may make a request, and the tag directory could indicate that the needed information is contained in the cache before any valid data has been written into the cache. One approach to eliminate this problem is to have the CPU run a procedure upon start-up that sets a valid bit on all cache memory locations. The valid bit is checked to determine if the cache memory location is valid or not. Another and better way is to have the CPU run a tag directory initialization upon stdrt-up. The initialization could write zero’s into all tag directory cells. The memory location 0000 can then be reserved for the system or not used at all. This will insure that data is not accessed from the cache until the cache has been written into. . 43 Replacement Procedure The replacement procedure is trivial in a direct mapped cache system. In fact, the only choice in replacement structures that the designer has is when cache blocks are replaced and not where they go. The system described here uses a cache miss generated replacement scheme. A cache block is replaced only when a miss occurs. The word of a cache block could be used to signify that the next CPU access will be in another block, but there is no simple way to tell which block will be needed next. The logic required in replacement prediction is not trivial and is too complex for application on a microprocessor. Memory Consistency The write-through consistency scheme is employed because of high reliability and easy maintenance. The microprocessor address and data buss will be used more often, but the increased traffic should not create a bottleneck. One drawback to this scheme is that main memory cannot keep up with a series of writes. The main memory controller will notify the CPU to halt program execution until the main memory is caught up. Cache Size The overall cache size is governed mainly by the size of the individual memory cell and the size of the tag directory. The I/O logic and address decoding logic require 11# of the total silicon area for the cache memory. The tag directory requires 16# and the memory cells fill up 73# of the total silicon area. A larger cache produces a higher hit ratio, but the optimum performance per size is around 4 K words. The minimum useful cache size for the INTEL 80386 is 4 K words [21]. The complete cache system with 4 K words, 1024 word tag directory and all required support logic occupies 464.8 square millimeters or 849 x 850 square mils. The cache memory system contains approximately 1,000,000 transistors and this would raise the total transistor count on the MC68000 to 1.1 million transistors and brings the total silicon area to approximately 920 x 920 square mils. Block Size The Block size is four words long. This allows relatively high hit ratios and requires only four continuous memory access cycles to replace a cache block. The small block size allows for the storage of many different possible software loops in the cache structure. Tag Directory The tag directory block diagram is shown in Figure 12. The tag directory is a direct mapped CAM with very simple write logic and hit/miss signaling. The decoded set address allows for the writing of a tag word without a second decoding of the memory address. 45 The CAM directory is searched in parallel. The hit/miss signal indicates whether the desired memory word is located in the cache memory. The only overhead associated with the tag directory is writing new addresses when cache blocks are replaced. Tog Write Perform Search Tag Write Signal Tag Address Bits Tag Match lines Tag 0 Match line Tag I match line Tag Word 0 Tag Word I Tag Word 2 Hit/Miss logic Tag n Match line Hit/Miss Signal Tag write Logic Tag Write Logic Tag Word n Tag Address Bits Figure 12. Tag Directory Block Diagram. Static RAM memory cells could be used instead of CAM memory cells. The circuit layouts, which are shown in appendix C, show that Static RAM cells are only 28$ smaller than CAM cells when laid out in a two-metal CMOS process. The total area for the cache system is only 4$ smaller when SRAMs are used instead of CAMs. CAM tag cells allow for easier tag checking. SRAM tag cells require that the contents of the tag word be transferred to the address buss and XORed with the desired tag address to check for a cache hit. The price in area paid for the use of CAM memory cells is small and results in gains of simplicity and speed. Top Down Design The system block diagram is shown below in Figure 13. The cache system is shown with the CPU. The data and control signals that connect the CPU to the cache are also shown. O f f c h ip C a c h e M em o ry s y s t e m P e r f o r m s e a r c h & T a g w r i t e H i t / M i s s s ig n a l R e a d /W r i t e A d d r e s s B u s s O f f c h ip Figure 13. CPU - Cache Memory System Diagram. Figure 14 shows the block diagram of the cache memory. The major portions are: 1. Cache Tag Directory 2. Cache Block Address Decoder 3. Cache Memory Blocks The tag directory contains all the tag CAM and is searched when a memory request is issued. The block address decoder decodes address bits A3 - A12 and selects the correct tag directory word during a cache 47 replacement operation. The word address decoder selects the correct cache memory word during a memory request. The cache memory blocks store four memory words and contain the two bit word address decoder. P e r f o r n S e a r c h __Cs Tag W r i te H it /M iss Tag B lock A d d re s s A d d re s s B i ts B i t s I? L LLL Too Mocth L L I?__L- Tag D i r e c t o r y Decoded S e t line I I I I Tag Ma tch Decoded S e t line B lock A d d re s s Decode r Word Add re s s B its U ___ I Dedoded S e t lines (1024) Cache Memory System R e a d /W r i t e _J1 u n tData Buss Speed Up Signal Figure 14. Cache Memory System Block diagram. Figure 15 shows the logic diagram of a cache memory block. The cache words contain 32 memory cells and are connected to the data buss. Address lines AO and Al are used to select the desired cache word from a cache block. The two bit address decoder, shown in Figure 16, is a precharge logic decoder combined with an AND gate. The address is decoded and "ANDed" with the decoded block line to select the desired cache word. 48 Al AO Decoded Block Line Data Buss Precharge__ Speed up signal __ Word 0 Word I Word 2 Word 3 Speedup Sense Two-bit Address decoder Cache Menory Word -- Cell 0 Cell I Cell 2 _ Cell 31 DO DO Dl Dl D2 D2 D31 D31 Figure 15. Cache Block Logic Diagram. ADRl ADRO ADRl Decoded Block Line ADRO__ p P re c h a rg e Vdd4 H _tx This Word Signal Figure 16. Two-Bit Address Decoder Logic Diagram. The two possible memory cell designs ore dynamic memory cells or static memory cells. The dynamic memory cells offer a medium access time and consume only a few square microns. The static memory cell is faster but requires more silicon area. The dynamic memory cell must be refreshed every few milliseconds, and the stored information must be rewritten into the cell after each read operation. The static memory cell requires very little supporting logic and retains information as long as power is supplied. Static RAM memories are used almost exclusively in small buffer memories and cache memories. The logic diagram for a dynamic memory cell is shown in Figure 17. The cell requires only 3 transistors and occupies less silicon area than a static memory cell. The information is stored on the gate of transistor #2. A charged gate represents a logic 0 and an uncharged gate is a logic I. The read data line is precharged just before a read operation is performed. When the read strobe line is pulsed, the read data line will be discharged if the gate of transistor # 2 is charged and will remain charged if the gate of transistor #2 is uncharged. The leakage resistance associated with the gate determines how often the gate must be refreshed to preserve the stored information. The required refresh and I/O logic require a significant amount of design effort and extra silicon area. The access time is too slow to justify the cost of the extra complexity. Dynamic RAMs are used exclusively in large main memories. The logic diagram for a static RAM memory cell is shown in Figure 18. The cell requires 6 transistors and occupies more silicon area than a dynamic memory cell. The information is stored on two cross coupled 50 inverters that continually drive each other. The static memory cell will retain information for as long as it is supplied with power. The word line is pulsed just prior to a read or write operation. A write operation forces the data lines to the desired signal level(0 volts for a logic 0 and +5 volts for a logic I). A read operation allows the cross coupled inverters to drive the data lines to the voltage levels contained on the inputs of the cross coupled inverters. Wr i t e Read Da ta Da ta Read [x Write tx S t r o b e Figure 17. Dynamic Memory Cell Logic Diagram. The dynamic RAM cell requires more I/O logic and a refresh sequencer. Conservation of silicon area and simplicity of design are the two most important factors in this design. The addition of a cache memory module must be small and simple, and dynamic memory cells require too much support logic to justify their use. The 6 cell static RAM requires a sense amp for a read operation and a speed-up circuit to increase the speed of read and write operations. One 51 sense amp is used to drive a column of static memory cells because only I cache memory word is accessed at a time. When the memory cell is pulsed for a read, the cross-coupled inverters start to modify the data and dato-bar lines; but they are not large enough to drive the data lines in a reasonable amount of time. The sense amplifier "senses" the small change in the data lines and drives them quickly into the logic levels specified by the SRAM. Figure 19 shows the logic and timing diagram for the sense amplifier. D a t a D a t a D a t a ^ V o r d S e l e c t Pl & P2 a r e Pass t r a n s i s t o r s Figure 18. Static Memory Cell Logic Diagram. 52 D a ta D a ta Word S t a t i c Memory . Celt P re c h a rg e Sense Speed up I P r e c h a r g e ] I I i I U I I I I \ I I I I I I I I i I I I I I I I I I I I I I I I V j I I I I I I I I V o r d line J I / I I I I I I I I (\ I I f l II I i i i I I i I I I I I I I I I I I I I I I I S p e e d u p j i i r r r l\ II i i rr II I I D a t a | i i i i ' ( i I i I "A I R e a d Oi I I I I I I I I I I I I I I I f I R e a d I I I I i I I i I i ! I I I I I I I I I I _ ! ------ !— . I I I D a t a j Y | i I I I I I I I I I i i I I I I I i I I V! II I Timing Diagram f o r Sense Amp. Figure 19. Sense Amplifier Logic and Timing Diagram. 53 The Block address decoder logic diagram is shown in Figure 20. Address bits A2 - All are decoded to select the correct cache memory block during a memory request and the correct tag word during a tag directory write. The decoder is a two input AND gate. One input is the tag match line and the other is a precharged line. The precharged line is discharged through a pass transistor if the address applied is not the block’s correct address. ADRll ADRll T ag Motch Line ADR3 ADR2 P re cho rge Decoded Block Line Figure 20. Block Address Decoder Logic Diagram. The tag directory block diagram is shown in Figure 21. The tag words are connected in parallel to the upper 20 address bits of the address buss. A tag word will leave the match line high if the searched address is contained in its memory cells. The tag match signal is passed to the Hit/Miss signal notifying the CPU that the desired memory word is contained in the cache. The tag directory is updated during a cache replacement. The tag write logic diagram is shown in Figure 22. The decoded block line and the 54 tag write line are "ANDed" to produce the tag word line that allows the tag word to be written. Tag A d d re s s Buss ADR31 - - - - ADR12 Decoded B lock line 0 Tag M a tc h I A Decoded B lock line I A Decoded B lock line 2 Tag M a tc h 1023 Decoded B lock line 1023 Tag Wo rd I Tag Wo rd 1023 Tag W r . te Log ic Tag W o rd 2 H i t /M is s Signal Figure 21. Tag Directory Block Diagram. The tag word contains 20 content addressable memory cells (CAM cells). The cells are all connected to the tag match line. During a tag search, any tag cell can pull the match line low signifying that the desired tag address is not contained in this tag word. The CAM cell contains two cross-coupled inverters to store the tag bit and pass transistors that are pulsed during a tag search or a tag write. Figure 23 shows the nine - gate CAM cell logic diagram. 55 ADR31 ADR30 ADR29 K F KF KF P e rfo rm Search tx_M m m I n P recha rge ADR12 K H Tag Cell 19 Word O — Tog Cell 18 Word O — Tag Cell 17 Word O — — Tag Cell O Word O I I I I T I Tag Tag — Tag - Tag Cell 19 Cell 18 Cell 17 Cell O Word I — Word I Word I - Word I O Tog W r i te - / J Decoded Block line 0 Tag Match line 0 - o -d Decoded Block line I Tag Match line I P e rfo rm ADR31 ADR30 ADR29 Figure 22. Tag Write Logic Diagram. Dttta Data_bar Tag h. Match Figure 23. Nine - Gate CAM Cell Logic Diagram 56 The read and write operations are the same as for a normal static memory cell. A search operation is performed by (I) precharging the match line (2) applying the data to the bit-bar line and data-bar to the bit line. The match transistor will remain off if the data in the cell matches the data on the bit and bit-bar lines. The match transistor of any cell in the tag word can discharge the match line during a tag search. CPU Communications and Timing The CPU communicates with the cache memory through the: 1 . 32 bit address buss 2. 32 bit data buss 3. precharge signal 4. read/write signal 5. perform tag search signal 6. tag write signal 7. hit/miss signal 8. data speed up signal The timing diagram for a typical read cycle is shown in Figure 24. The CPU generates the following signals. 1. The precharge signal is applied. 2. The address bits are applied to the cache memory system. 3. The tag search signal is applied. 4. The tag match line will stay high if there is a cache hit. 57 5. The tag pass signal is applied and the correct cache word is addressed. 6. The read signal is applied followed by the data speed up signal. 7. The desired cache memory word is now on the data buss and the CPU continues processing. P r e c h a r g e P e r f o r n S e a r c h V A L I D A d d r e s s lines Tag Pa s s S p e e d up D a t a D a t o. Tag M a t c h This C a c h e W o r d Figure 24. Typical Cache Read Timing Diagram. 58 CHAPTER 4 FINAL DESIGN Circuit Layout The complete layout is done with 10 cells. The layouts are: 1. Static RAM Cell. 2. Content Addressable Memory Cell. 3. Cache Word Address Decoder. 4. Cache Block Address Decoder. 5. Tag Line Pass and Tag Word Select Logic. 6. Top Tag Write Logic. 7. Bottom Tag Write Logic. 8. Static RAM Speed Up Logic. 9. Address Invertor. TO. Tag Match Line Precharge Logic. The logic diagrams and circuit layouts are shown in Appendix A. All of the circuits are documented with circuit specifications, circuit layouts, logic diagrams, and circuit parasitic calculations. Circuit Parameters The circuit parameters are derived from the layout artwork. The parasitic capacitance and resistance depend upon interconnection length and 59 location relative to the other layers of the layout. The circuit parameters determine the performance of the system. The parasitic capacitance determines the maximum speed of the system and the power consumption. In the cache system designed, the critical circuit parameter is the capacitance associated with the data lines and address lines that connect all the cache blocks together. The capacitance values are determined by following the calculation rules explained in Appendix B . Tag Match Pass and Vrite logic Speed up Tag Vrite logic Address logic Cache Memory Vords Tag Directory Block Address Decoder Tag Vrite logic Address logic Vord Address Decoder Speed up Subsystem Size Cnm^) Tag Directory______________________73.5 Cache Memory Vords________________ 338.6 Tag Match Pass and Vrite logic_______ 8.2 Block Address Decoder________________19.6 Vord Address Decoder________________24.3 Speed up________________________ 0.3 Tag Vrite logic____________________ 0.3 Total System Size 464.8 Figure 25. Cache Memory System and Subsystem Sizes. 60 System Size The total system size is 464.8 mm2. The cache memory words occupy 73# and the tag directory 16# of the total layout. The address decoding circuits and other necessary logic make up the remaining 11# of the layout. The system and subsystem sizes are shown in Figure 25. The total system size of a MC68000 and the cache system would occupy almost a full square inch of silicon area and contain 1.1 million transistors. The yield for a chip this size would be low, on the order of about one chip per wafer. This yield would escalate the cost of the processor/cache memory chip. Alternate Design - Fully Associative Cache The extra silicon area required to use content addressable memory (CAM) cells in a CMOS environment is small relative to the size of static RAM (SRAM) cells. A fully associative cache memory is not impossible to implement. The fully associative cache scheme differs only by a few hardware items from the direct mapped cache implementation. The fully associative cache scheme differs in timing and control. Four changes that would transform the direct mapped cache design into a fully associative cache are: 1. 10 more bits per tag word. 2. A more complex replacement scheme and an address generator for cache replacement. 61 3. A full/empty bit associated with each cache block and the logic to check for empty space when a cache block is received from main memory or a start up procedure that fills the cache memory. 4. Communication signals and more complex timing scheme are required for a fully associative memory. Tag Match Pass and Vrite logic Speed up Address logicTag Vrite logic Cache Memory Vords BlockTag Directory Address Decoder Address logicTag Vrite logic VordRandom AddressAddress DecoderGenerator Speed up Subsystem Tag Directory______________________125,8 Cache Memory Vords________________ 338.6 Tag Match Pass and Vrite logic-------- 8.2 Block Address Decoder________________19,6 Vord Address Decoder________________24.3 Speed up_________________________ 0.3 Tag Vrite logic____________________ 0.3 Total System Size Figure 26. System Diagram of a Fully Associative Mapped Cache Memory. 62 A system diagram of a fully associative cache memory system is shown in Figure 26. The shaded areas define the extra area and logic needed to transform the direct mapped cache into a fully associative mapped cache. Tag Word Length A fully associative cache memory would require 30-bit tag words. Ten more CAM cells per tag word than the direct mapped design. This requires an extra 9 mm2 of silicon area, but only increases the total silicon area for the cache system by 1.9#. Full/Empty Designation The CPU must know when the cache memory is full. A cache replacement is not necessary until the cache memory is full, and a memory location other than what is contained in the cache is needed. A full/empty bit associated with each block is one possible solution. Assuming that the cache memory is filled with the first page of the operating system upon system start up could replace the needed hardware of a full/empty bit. The full/empty bit is not used enough to justify locating it on-chip and is impractical to locate off chip. This solution requires a specific software environment that initializes the cache memory upon power up and after any system failure. 63 Replacement Scheme The fully associative organization allows complete flexibility in the placement of cache blocks which results in more efficient use of the cache memory space. The simplest replacement scheme is random replacement. After it has been determined that a cache replacement is needed, a block is chosen at random by a random number generator. The new block is then written into cache. Communications and Timing The data flow of a fully associative mapped cache memory request cycle that results in a cache hit is identical to the data flow of a direct mapped cache memory cycle. When a memory request results in a cache miss, the replacement of a cache block requires the generation of the block address. A random number generator will suffice for a random replacement algorithm. The block address (generated from the random number generator) is applied to a block address decoder which selects the location for the needed main memory block. The program execution then continues until another cache miss occurs. The extra communication signals required of the CPU are: I. A signal to start the random number generator. 2 . A signal to apply the random cache block address to the address decoder. Advantages of Fully Associative Mapping There are some advantages to using a fully associative mapped cache. The advantages are best displayed through the execution of benchmark programs and comparing the execution times. Possible advantages are: 1. . The cache is used more.efficiently. 2. The use of CAM cells is fully justified. 3. Several different types of replacement schemes are possible. 4. The hit ratio could be higher resulting in shorter memory access times. The fully associative mapped cache is not implemented or simulated because the advantages are only visible after some benchmark programs are simulated. This exercise is beyond the scope of this thesis. 65 CHAPTER 5 DESIGN SIMULATION Design and Simulation Tools Layout Software The QUICK KIC layout package was used in the layout and design of the on-chip cache. This package is the only available software package at Montana State University that is capable of a layout of this size. The package includes a design rule checker and produces layouts that can be processed and sent to a fabrication facility. Circuit Simulation The critical signal paths are simulated to produce performance measures. The simulation software is TSPICE, a TEKTRONICS version of Berkely SPICE. The QUICK KIC graphics editor includes an electronic schematic editor. The schematic editor was used to create the TSPICE circuit definition file. This file contains all transistor sizes, node connections and transistor types. The parasitic capacitances were calculated by hand and added to the circuit definition file to create a complete model of a cache memory word. A sample parasitic capacitance calculation is given in Figure 27. The capacitance associated with the 66 input of on inverter is calculated. In p u t Cm Vdd z O u tp u t Cm C a lcu la t io n Vdd - .I O u tp u t V ss Thin Oxide; {?Z3 2 x 16 x 6 .6 7 E -4 p f = 21.4 f f Po lys il icon t o S u b s t r a t e ; □ 2 x 20 x 0 .4 8E -4 p f = 1.92 f f T o ta l Cin C apac i tan ce = 23,36 f f Figure 27. Sample Parasitic Capacitance Calculation. Layout Technology The layout technology used is a two metal CMOS process. Most of today’s microprocessors are constructed with a 2 or 3 metal CMOS process. The CMOS transistor technology has replaced bipolar TTL and NMOS 67 technologies over the lost few years. There are several reasons for this evolution: < 1. Smaller more compact layouts are easily achieved. 2. The noise immunity is better than both bipolar TTL and NMOS. 3. The power consumption is the lowest among current circuit technologies. CMOS circuits only draw current when they are switching. Very high speed CMOS circuits consume as much power as some TTL. 4. The speed of CMOS circuits rivals fast bipolar TTL. The layout technology follows the MOSIS scalable design rules version 6. The layout rules are given in Appendix C. Speed Estimate The cache word was simulated with a 50 MHz dual cycle clock. The access time is 40 ns for both a cache read and write. The block replacement procedure takes 80 ns. • The address and data lines all stabilize to the desired logic levels in the time specified by the timing diagram. Appendix D contains the complete circuit schematic and the SPICE 68 input file used to simulate a cache memory access with a cache hit. All possible data and address combinations were simulated for cache reads and writes. The cache block replacement was also simulated. The data read is a 0 and the tag address checked was also a 0. The resulting transient analysis plot is shown in Figure 28. I -MhR-SS 0 2 : 1 9 : 5 4 0.00 2.50 I TAG MATCH LIME 5.00 I 7.50 I Cl 10.00 I 0.00 I 2.50 i DATA BAR LIME 5.00i 7.50 j 10.00 I 0.00 I_______ 2.50 J DATA LIME 5.00 I 7.50 1 0 . 0 0i Figure 28. Transient Analysis of a Cache Memory Request. 69 Power Consumption Estimate The power dissipation of CMOS circuits is a function of the frequency, voltage and the capacitance associated with the circuit. The power dissipation is calculated by: Pd = C1 * Vdd2 * f (8) as described by. Weste and Eshraghian [22]. Pd is the dissipated power, C1 is the load capacitance, Vdd is the source voltage, and f is the frequency at which the output is driven. The power dissipation is estimated for each cell of the layout operating at 25 MHz. Not all cells in the cache memory will be operating at one time. The total power dissipation is the combined power dissipation of: 1. all the partially active address decoding cells 2. all the active tag cells 3. one fully active data word. The power dissipation estimate comes from driving all these cells at the maximum frequency. The power dissipated in the tag cell is calculated as: C Address = 7.5 ff C Address Bar = 5.8 ff C Tag Match Line = 8.3 ff Total Capacitance = 21.6 ff Power dissipation: = 21.6 ff * (5 volts)2 * 25 MHz = 13.5 pW The power dissipation for each individual cell is given in Table 4. The total power consumption for the cache memory is 308 mW and is easily 70 dissipated in a silicon die this size. There is no power dissipation problems, and the power dissipation for a CPU/cache memory system would be raised by 50 percent (from 0.6 to 0.9 watts). Table 4._______Power Dissipation of Cache Memory System. Individual power dissipation (active cells): Tag Cell: 13.5 pW « 20,480 cells - 277 mW SRAM Cell: 27.1 /iW « 32 cells - 0.87 mW Word Address Cell: 142 /xW * I cell - 0.14 mW Block address Cell: 341 /tW « I cell - 0.34 mW Tag Pass Cell: 227 pW * I cell = 0.23 mW Top Tag Write Cell: 213 /iW * 20 cells = 4.3 mW Bottom Tag Write Cell: 213 ^W * 20 cells - 4.3 mW Speed up Cell: 92.9 AtW « 2 cells - 0.19 mW Address Invertor Cell: 90.7 AtW * 12 cells - I . I mW Tag Precharge Cell: 45.7 AtW * I cell = 0.05 mW Individual Power Dissipation (partially active cells): SRAM Cell: 8.19 AtW « 127 cells = 1.04 mW Word Address Cell: 44.7 AtW * 127 cells = 5.68 mW Block Address Cell: 197 AtW « 31 cells = 6.11 mW Tag Pass Cell: 57.8 /iW * 31 cells = 1.79 mW Tag Precharge Cell: 45.7 /iW « 127 cells = 5.80 mW Total power dissipation: 308.6 mW System Performance Estimate The system performance estimate includes an estimate of the hit ratio. This estimate comes from the graphs in chapter 2 and includes some uncertainty. The system performance estimate will include some uncertainty. S p e e d U p 71 The hit ratio is estimated to be 0.97 (6), and the main memory access time is estimated to be 400 ns (Tm). The cache memory access time is 40 ns (Tc). The speed up factor is 7.87. Equation 5 is used to calculate the speed up factor : I Speed up ------------------------- (5). I - 0{ Tc/Tm - I) Figure 29 is a plot of hit ratios vs. speed up for the on-chip cache memory system. The hit ratio can vary from program to program. A true test of performance would be to fabricate the cache memory and simulate it using several benchmark programs. This has not been done because it is beyond the scope of this thesis. Figure 29. Speed Up vs. Hit Ratio for the On-Chip Cache Memory System. 72 Comparison to Conventional Cache Systems A conventional cache memory system must wait for interchip data transfers which increases the cache memory access time. A typical cache memory access time for conventional cache memory systems is about 80 ns. The speed up factor of the on-chip cache memory verses a conventional cache memory is 1.94. Figure 30 is a plot of speed up factor vs. hit ratio for the on-chip cache memory system vs. a conventional cache memory system. I I I I I I I I Ii i i iI I I i I I I Figure 30. Speed Up vs. Hit Ratio for an On-chip Cache Memory vs. a Conventional Off-Chip Cache Memory. The power consumption associated with interchip transfers and the operation of separate cache memory chips is much greater than the power consumed by an on-chip cache memory. The typical power consumption of a 73 16 Kbyte SRAM used for a cache memory is 700 mW. The total power consumption of the on-chip cache memory and CPU is 0.9 Watts. The conventional cache memory system requires the design of a more complex circuit board and increases the chip count of the computer system. The on-chip cache memory reduces total chip count and the logic glue needed to interface fast SRAM as a cache memory system. The mother board for most computers is already crowded and the reduced chip count would reduce the cost but might not be enough to offset the increased cost of the CPU. If enough computer systems were built and sold, the on-chip cache memory system could be cheaper. The combination cache memory/CPU chip lends itself very well to parallel or multiprocessing. Each processor incorporated in a multiprocessor computing system would contain its own cache memory system and greatly reduce design time and enhance performance. Each processor could work from a shared main memory and contain a separate and secure task in local cache memory. The diagnosis and integration of a CPU/cache memory chip will be more difficult because there is no way of monitoring the cache memory contents externally. 74 CHAPTER 6 CONCLUSION Discussion of Results The designed on-chip cache memory system displayed several interesting characteristics. The on-chip cache memory system improved system performance as expected because of decreased memory access times relative to both a non-cached system and a conventional off-chip cached system. The speed up factor relative to a non-cached system is approximately 7.8 and relative to a conventional off-chip cached system is approximately 1.9. The inclusion of a CAM is not very expensive in silicon area and paves the way for a fully associative cache memory scheme. The fully associative cache memory would require more hardware and a more complex timing scheme but would result in very efficient use of the cache memory. The hardware would only increase the total system size by about 4 percent. The cache memory system consumed more silicon area than would be feasible to put on one silicon die, but the cache is 16 Kbytes which is quite large. The size can be reduced by layout compaction and possibly logic reduction. The power dissipation of an on-chip cache system is substantially less than an off-chip implementation. The processor/cache memory chip power dissipation is approximately 0.75W. A conventional processor/cache 75 memory system would dissipate approximately 2.6W. Because of the processor/cache memory chip size, it would initially be quite expensive to produce and manufacture. The yield per wafer for this size of a silicon die would be low, on the order of about one chip per wafer. The cost of designing and building a conventional cache memory would be less, but the chip count for the total system is increased and the circuit board layout will be more complex. The processor/cache memory chip would lend itself for use in a parallel processing environment. Each processor in a multiprocessor environment would have its own local secure cache memory. Future Work The continuation of this work should include the simulation of a SRAM tag directory and the design and simulation of a fully associative cache system. It would be interesting to then perform a cache memory system simulation and check the improvement of several benchmark programs. Other work should include the reduction of the cache access time. The cache memory access times should be comparable to the access times of the CPU registers because of their location on-chip. This would be the fastest cache memory needed because it would be almost like having 16 thousand registers. The instruction set would not support 16 thousand registers, but the access times would allow operation speeds almost as fast. The cache memory could be pipelined. Cache memory words, (both data and instructions) could be prefetched to increase system performance. 76 REFERENCES CITED 77 1. Reinhart, J . and C. Serrano, "High-Speed Components and a Cache Memory Lower Access Times," Computer Technology Review, Winter 1984. 2. Hill, M. D . and A. J. Smith, "Experimental Evaluation of On-Chip microprocessor cache memories," Conference Proceedings of the Ilth Anaual International Symposium on Computer Architecture, June 5-7, 1984 3. Starnes, T. W., "Design Philosophy Behind Motorola’s MC68000," BYTE, April 1983. 4. Introduction to the 80586, Intel Corporation, Literature Distribution, Mail Stop SC6-59, 3065 Bowers Avenue, Santa Clara, California 95051, 1985. 5. Bertram, W. J ., "Yeild and Reliability," VLSI Technology, by S. M. Sze, pp. 600, McGraw-Hill Inc. 1983. 6. Hwang, K. and F. A. Briggs, "Cache Memories and Management," Computer Architecture and Parallel Processing, McGraw-Hill Inc. 1984. 7. Goodman, J . R., "Using Cache Memory to Reduce Processor-Memory Traffic," Conference Proceedings of the 10th International Symposium on Computer Architecture, pp. 124-131, Stockholm, Sweden, 1983 8. Reinhart, J . and C. Serrano, "High Speed Components and a Cache Memory Lower Access Times, " Computer Technology Review, winter, 1984 9. Smith, A. J., "Cache Memory Design: An Evolving Art,," IEEE Spectrum, December 1987, volume 24 number 12. 10. Hamacher, V. C., Z. G. Vranesic and S. G. Zaky, "Cache Memories," Computer Organization, pp. 306-313 MaGraw-Hill Inc., 1984. 11. Mono, M. M., "Cache Memory." Computer System Architecture, pp. 501. Prentice-Hall Inc., 1982. 12. Strecker1 W. D., "Cache Memories for the PDP-IIFamily of Computers, " Computer Engineering, A DEC view of Hardware Systems Design, pp. 263- 270, Digital Press Inc. 1978 13. Smith J . E . and J . R. Goodman, A Study of Instruction Cache Organizations and Replacement Policies," ACM Transactions on Computing, pp. 132-137, Association of Computing Machinery, 1983. 14. Mono, M. M., "Cache Memory." Computer System Architecture, pp. 505. Prentice-Hall Inc., 1982. 15. Mono, M. M., "Cache Memory." Computer System Architecture, pp. 507. Prentice-Hall Inc., 1982. 78 16. Mono, M. M. , "Cache Memory," Computer System Architecture, pp. 504. Prentice-Hall Inc., 1982. 17. Goodman, J . R., "Using Cache Memory to Reduce Processoi— Memory Traffic," Conference Proceedings of the IOth International Symposium on Computer Architecture, pp. 124-131, Stockholm, Sweden, 1983 18. Smith, A. J., "Cache Memories," Computing Surveys, Vol. 14, No. 3, September 1982. 19. Hamacher, V. C., Z . G . Vranesic and S. G. Zaky, "Cache Memories," Computer Organization, pp. 306-313 MaGraw-Hill Inc., 1984. 20. Goodman, J . R., "Using Cache Memory to Reduce Processor-Memory Traffic," Conference Proceedings of the 10th International Symposium on Computer Architecture, pp. 124-131, Stockholm, Sweden, 1983 21. Introduction to the 80586, Intel Corporation, Literature Distribution, Mail Stop SC6-59, 3065 Bowers Avenue, Santa Clara, California 95051, 1985. 22. Weste, N. and Eshraghian1 K., Principles of CMOS VLSI Design: A Systems Perspective, Reading, Addison-Wesley, 1985, pp. 148-149. 79 APPENDICES APPENDIX A CIRCUIT DOCUMENTATION 81 The tag cell stores the address information for the tag directory. 20 tag cells combine to form one tag word. The address is checked and if a match occurs, the precharged TAG MATCH line remains asserted. The I . TAG CELL circuit schematic is shown in Figure 31. The circuit Figure 32. The inputs are: ADDRESS ADDRESS_BAR THIS_TAG_W0RD The output is: TAG_MATCH_LINE Cell specifications. layout is shown in INPUTS Capacitance ADDR ADDR_BAR TAGJWORD 10 ff 10 ff 50 ff OUTPUTS Capacitance TAG_MATCH_LINE Cell Size 10 ff Width Height 57 microns 63 microns Output settling time Input Hold time 6 ns 12 ns Power Dissipation Maximum transient current 125 AtW 2.5 mA 82 Addr Addr Vord C addr Tag Match C tag match C tagword C addr C addr = 7.5 ff C addr = 5.8 ff C tagword = 31.5 ff C tag match = 8.3 ff Figure 31. Circuit Schematic for TAG.CEL 83 Figure 32. Circuit Layout for TAG.CEL 84 2. SRAM CELL The SRAM cell stores the data for the cache memory word. 32 SRAM cells combine to form one cache word. The DATA and DATA BAR lines are precharged and then the cell is polled and depending upon the cell’s contents, one of the data lines is discharged. The circuit schematic is shown in Figure 33. The circuit layout is shown in Figure 34. The inputs are: DATA DATA_BAR THIS_DATA WORD The output is: DATA DATA_BAR Cell specifications. INPUTS , Capacitance DATA 10 ff DAT A_BAR 10 ff DAT A_WORD 50 ff Cell Size Width 41 microns Height 63 microns Output settling time 5 ns Input Hold time 8 ns Power Dissipation 62.5 f M Maximum transient current 2.75 MlA 85 This Word Data Data C data = 6,8 ff C data = 6.3 ff C dataword = 30.3 ff Figure 33. Circuit Schematic for SRAM.CEL dataword 86 Figure 34. Circuit Layout for SRAM.CEL 87 3. WORD CELL ADDRO ADDRI ADDRO BAR ADDR1_BAR DEC0DED_BL0CK LINE PRECHARGE_LINE The output is: DATA WORD Cell specifications. The WORD cell decodes the word address and "ANDs" the decoded block line to select the desired cache word. The circuit schematic is shown in Figure 35. The circuit layout is shown in Figure 36. The inputs are: INPUTS Capacitance ADDRO 50 ff ADDRI 50 ff ADDRO BAR 50 ff ADDRI BAR 50 ff DECODED BLOCK 100 ff PRECHARGE 100 ff OUTPUTS Capacitance DATA_W0RD 10 ff Cell Size Width 94 microns Height 63 microns Output settling time 7 ns Input Hold time 10 ns Power Dissipation 94 4« Maximum transient current 9 mA 88 ADRl ADRO ADRO P r e c h a r ge Decoded B lo c k ; D a ta w o rd CAl = 37.1 f f CAO = 34.4 f f C p re = 77.6 f f C D ecoded B lo c k Line = 76.0 f f C D a ta w o rd = 2.8 f f Figure 35. Circuit Schematic for WORD.CEL 89 Figure 36. Circuit Layout for WORD.CEL 90 4. BLOCK CELL ADDR2 - ADDRlI ADDR2 BAR - ADDRlIBAR BLOCK ADDR The output is: DECODED BLOCK CACHEJ3L0CK Cell specifications. The BLOCK cell decodes the cache block address and "ANDs" the TAG match line to select the cache block. The circuit schematic is shown in Figure 37. The circuit layout is shown in Figure 38. The inputs are: INPUTS Capacitance ADDR2 - ADDR11 50 ff ADDR2 BAR - ADDRlI BAR 50 ff BL0CK_ADDR 100 ff OUTPUTS Capacitance DEC0DED_BL0CK 10 ff CACHE BLOCK 100 ff Cell Size Width 304 microns Height 63 microns Output settling time 4 ns Input Hold time 8 ns Power Dissipation 281 Mw Maximum transient current 11 mA 91 Tag Match Line Tag W rite Logic ADll AD3 ADS P re ch a rg e C ta g m a tch line = 75.1 f f Cpre = 77.6 f f C Decoded B lock Line = 76.0 f f C D a ta w o rd = 2.8 f f CAS - CA ll = 31.5 f f Decoded B lock Line Figure 37. Circuit Schematic for BLOCK.CEL 92 Figure 38. Circuit Layout for BLOCK.CEL 93 5. TRASS CELL TAGJVIATCH TAG PASS TAG WRITE CACHE_BLOCK HIT/MISS_IN The output is: BL0CK_ADDR THIS_TAG_WORD HIT/MISS_0UT Cell specifications. The TRASS cell passes the tag match line to the block address decoder upon reception of the TAG PASS signal. The TRASS cell also selects the THIS_WORD line for a tag write operation. The circuit schematic is shown in Figure 39. The circuit layout is shown in Figure 40. The inputs are: INPUTS Capacitance TAG_MATCH 100 ff TAG PASS 25 ff TAG WRITE 100 ff CACHE BLOCK 100 ff HIT/MISS_IN 75 ff OUTPUTS Capacitance BLOCK ADDRESS 10 ff TAG_W0RD 25 ff HIT/MISS_OUT 10 ff Cell Size Width 127 microns Height 63 microns Output settling time 5 ns Input Hold time 8 ns Power Dissipation 313 Mw Maximum transient current 6 mA 94 This Tag v,oroi Tag M a tc h Tag Pass H ; - /M is s i n p u t Tag w-'ite Figure 39. Circuit Schematic for TRASS.CEL 95 Figure 40. Circuit Layout for TPASS.CEL 96 6. TOP TAG WRITE CELL The top tog write cell applies the address and address bar lines to a column of tag cells. The address is passed when the perform search line is asserted. The circuit schematic is shown in Figure 41. The circuit layout is shown in Figure 42. The inputs are: ADDRESS ADDRESS BAR PERFORM_CHECK The outputs are: ADDRESS ADDRESS_BAR Cell specifications. INPUTS Capacitance ADDRESS 100 ff ADDRESSJ3AR 100 ff PERFORM CHECK 200 ff OUTPUTS Capacitance ADDRESS 100 ff ADDRESS_BAR 100 ff Cell size Width 57 microns Height 110 microns Output settling time 8 ns Input hold time 10 ns Power dissipation 625 HW Maximum transient current 20 mA 97 Perf orn. Search Vdd A ADR T Cadr Cadr U T ~±Z Cper 7 ADR T Cadrl T Cadrl ADR Cper = 189 ff Cadr = 9.1 ff Cadrl = 5.8f f Cadr = 130.7 ff Cadrl = 5.8 ff Figure 41. Circuit Schematic for TWRTOP.CEL (Top Tag Write Cell) 98 Figure 42. Circuit Layout for TWRTOP.CEL (Top Tag Write Cell) 99 7. BOTTOM TAG WRITE CELL The bottom tog write cell applies the address and addressbar lines to a column of tag cells. The address is passed when the perform search line is asserted. The circuit schematic is shown in Figure 43. The circuit layout is shown in Figure 44. The inputs are: ADDRESS ADDRESS_BAR PERFORM CHECK The outputs are: ADDRESS ADDRESS_BAR Cell specifications. INPUTS Capacitance ADDRESS 100 ff ADDRESS BAR 100 ff PERFORM_CHECK 200 ff OUTPUTS Capacitance ADDRESS 100 ff ADDRESS_BAR 100 ff Cell size Width 57 microns Height 110 microns Output settling time 8 ns Input hold time 10 ns Power dissipation 625 A