Table of contents

  1. Introduction
    1. Locality principle
    2. Terminology
  2. Cache placement policies
    1. Direct mapping
    2. Set-associativity
      1. Exercise 1
      2. Exercise 2
      3. Exercise 3
      4. Exercise 4
  3. Additional exercises
    1. Exercise 5

Introduction

Over the years, a performance gap has formed between the processor and the memory unit. As shown in the figure below, processor performance has been increasing much faster than memory performance. Consequently, the processor has become much faster than the memory, forming a bottleneck for the performance of the computer in general.

Illustration of the performance gap between memory and CPU

To solve this issue, caches were introduced. Cache memories are not as fast as registers, but can include more data, and are much faster than the main memory. However, they are also more expensive than the main memory and therefore a lot smaller. Most commercial CPUs typically offer a hierarchy of caches: Level 1 cache (L1) is the fastest but also the smallest cache, Level 2 cache (L2) is slower but larger, and the last level cache (L3 or LLC), which is the slowest but the biggest.

Illustration of memory hierarchy in a computer

Caches can be local to a single processor core (this is typically the case for L1 and L2 caches), or shared across multiple cores. Finally, the L1 cache is usually split into an instruction cache (L1I), which contains program instructions, and a data cache (L1D), which contains program data.

Illustration of cache hierarchy in a computer

Locality principle

Programs usually access a relatively small portion of the address space at a time.

In particular, when a program accesses a given memory location (a variable), it is likely to access that same location again in the near future. This is called temporal locality. Hence, a memory location that is accessed by a program can be cached to speed up future accesses!

Illustration of temporal locality

Additionally, when a program accesses a memory location, it is likely to access nearby memory locations in the near future (think for instance about nearby variables on the stack or nearby array members). This is called spatial locality. Hence, when a memory location is accessed, we transfer entire blocks (multiple contiguous words in memory) into the cache at once, pulling nearby data into the cache.

Illustration of temporal locality

By exploiting these two locality principles, caches can cause a huge performance gain, even though they are small.

Terminology

Consider a program that accesses a memory address: lw t0, 0x10010000.

We say that we have a cache miss if the address 0x10010000 is not in the cache (for instance if it is the first time it is accessed by the program). In that case, the value is requested from the DRAM and the memory access is slow. The value is then placed in the cache for later use (following the temporal locality principle).

We say that we have a cache hit if the address 0x10010000 is in the cache. In this case, the corresponding value is directly served from the cache and the memory access is fast.

Cache hits and misses

We can express the performance of the caching algorithm with the hit rate, which is the number of cache hits divided by the total number of memory accesses. We can also talk about the miss rate, which is equal to (1 - hit rate), or (cache misses / total accesses).

Cache placement policies

The cache placement policy determines where a memory address should be placed in the cache.

Direct mapping

Let us start with the structure of the cache. In our examples, we consider byte-addressable memory, meaning that data is stored and accessed byte-per-byte (contrary to a word-addressable memory where data is stored word-per-word). The cache is a table that contains multiple rows, these are called the cache sets. A cache set can include one or more cache blocks (a block is sometimes also called a cache line, which can be slightly confusing in our representation, since a cache set is represented as a row [or line] in the cache table, which can contain multiple cache lines).

The simplest cache placement policy, called direct mapping, maps every memory address to a unique block in the cache.

Take for instance the cache model given below, where each cache set only contains a single block. Given a memory address, the index of the corresponding cache set is determined using the two least significant bits of the address (index = address % 4). Because multiple addresses map to a single cache block, the cache also needs to keep track of a tag, corresponding to the most significant bits of the address. Therefore, given an address, the index determines where to look for the data in the cache and the tag indicates whether we have a cache hit or a cache miss.

Illustration of a direct mapped cache

A memory address, composed of a tag t and an index i, is in the cache (cache hit) if the tag at index i in the cache matches t. For instance, accessing the address 0001 (i.e. tag=00, index=01) results in a cache hit because the tag in the cache at index 01 is 00. However, accessing the address 0010 (i.e. tag=00, index=10) results in a cache miss because the tag in the cache at index 10 is 01.

The data in one cache block typically contains more than one byte. This is to enable spatial locality; when the data from a certain address is loaded, the contents of the neighboring memory locations are also placed in the cache, in case they are also accessed in the near future.

For instance, in the cache model given below, the size of a block is 4 bytes. The lower bits of the address correspond to the offset of the data in a cache block. For instance, the address 001000 corresponds to the value A0, while the address 001001 corresponds to the value A1.

Illustration of a direct mapped cache where a cache set contains 2 cache blocks

:bulb: Summary
A memory address (of size 32 bits) is composed of:

  1. an offset (the least significant bits of b) which determines the offset into one cache block;
  2. an index (the next k bits), which determines the cache set;
  3. a tag (the remaining 32-(k+b) most significant bits), which determines whether we have a cache miss or a cache hit.

Additionally, the cache contains a Valid bit, which indicates whether a cache line is valid or not (e.g., in order to synchronize data across different caches).

Summary of a direct mapped cache

Set-associativity

A limitation of direct-mapped caches is that there is only one block available in a set. Every time a new memory address is referenced that is mapped to the same set, the entire block is replaced, which causes a cache miss. Imagine for instance a program that frequently accesses addresses 000100 and 010100 in the above illustration. Because both addresses map to the same cache set (at index 01), accessing 010100 evicts 000100 from the cache (and vice versa). Hence, accessing both addresses in an alternating fashion results in a sequence of cache misses, which causes a performance loss.

To mitigate this problem, we can duplicate our cache structure into multiple ways, where a given address can be placed into any of the ways. We illustrate below a 2-way cache. Now, even though 000100 and 010100 map to the same cache set (at index 01), they can be placed in two different ways and can be in the cache at the same time!

Illustration of a 2-way set-associative cache

Finally, a fully associative cache is made up of a single cache set containing multiple ways. Hence, a memory address can occupy any of the ways and is solely identified with its tag (no need for an index because there is only one set!). Fully-associative caches ensure full utilization of the cache: a block is never evicted if the cache is not full. When the cache is full, the evicted line is determined by a replacement policy (e.g., the least recently used block is replaced). However, searching for an address in a fully associative cache is expensive: it takes time (and power) to iterate through all cache blocks and find a matching tag.

:bulb: Notice that a 1-way associative cache corresponds to a direct-mapped cache. Hence, an n-way set-associative cache provides an interesting tradeoff between a direct-mapped cache and a fully associative cache.

Illustration of a n-way set-associative cache

Exercise 1

Suppose a computer’s address size is k bits (using byte addressing), the cache data size (the total of the stored data, excluding the metadata such as tags) is S bytes, the block size is B bytes, and the cache is A-way set-associative. Assume that B is a power of two, so B=2^b. Figure out what the following quantities are in terms of S, B, A, b and k:

  • the number of sets in the cache;
  • the number of index bits in the address;
  • the number of bits needed to implement the cache.

Exercise 2

Consider a processor with a 32-bit addressing mode and a direct mapped cache, with 8-word block size and a total size of 1024 blocks. Calculate how many bits of the address are used to tag a block, to select the block, a word in the block, a byte in the word. The sum of these numbers must be 32! Calculate the total size needed to implement this cache.

Exercise 3

Here is a series of address references given as word addresses: 2, 3, 11, 16, 21, 13, 64, 48, 19, 11, 3, 22, 4, 27, 11. Label each reference in the list as a hit or a miss and show the final content of the cache, assuming:

  • direct-mapped cache with 16 1-word blocks that is initially empty;
  • direct-mapped cache with 4-word block size and total size of 16 words;
  • 2-way set-associative cache, 2-word block size. Total size of 16 words.

The following table shows an example format of a possible solution for a direct-mapped cache with 8 2-word blocks.


+---+----+----+
| S | W0 | W1 |
+---+----+----+
| 0 | 48 | 49 |
| 1 |  2 |  3 |
| 2 |  4 |  5 |
| 3 | 22 | 23 |
| 4 | -- | -- |
| 5 | 26 | 27 |
| 6 | 12 | 13 |
| 7 | -- | -- |
+---+----+----+
2 (miss), 3 (hit) 11, 16, 21, 13, 64, 48, 19 (all misses), 11 (hit), 3, 22, 4, 27, 11 (all misses)

Exercise 4

Associativity usually improves the hit ratio, but not always. Consider a direct mapped cache with 16 1-word blocks and a 2-way set-associative cache with 1-word block size and a total size of 16 words. Find a sequence of memory references for which the associative cache experiences more misses than the direct mapped cache.

Additional exercises

Exercise 5

Consider a processor with a 32-bit addressing mode and a 4-way set-associative mapped cache, with 8-word block size and a total size of 2048 blocks. Calculate how many bits of the address are used…

  • to select the set
  • to tag a block
  • to select a word in the block
  • to select a byte in the word.