The Cache Kernel was also designed and implemented to minimize both logical and physical contention to provide for efficient non-blocking synchronization. By logical contention, we mean contention for access to data structures that need to be controlled to maintain the consistency and semantics of these data structures. By physical contention, we mean the contention for access to shared memory that needs to be controlled to maintain the consistency and semantics of the memory system.
Minimizing logical contention with non-blocking synchronization minimizes the overhead of conflicting operations failing and being retried. It also avoids the complexity of complex backoff mechanisms as part of the retry.
Most of our techniques for contention minimization are well-known. For example, one aspect of contention minimization is replicating data structures for each processor. In particular, there are per-processor ready and delay queues in the Cache Kernel, so contention on these structures is limited to signal/interrupt handlers and management operations to load balance, etc. being executed by a separate processor.
Similarly, there is a signal delivery cache per processor which allows a significant number of signals to be delivered by a processor without accessing the shared signal mapping data structure, which cannot be made per-processor without replicating the entire structure. This per-processor ``cache'' approach is similar to that provided by a per-processor TLB for address translation. The TLB reduces access to the real virtual address space mapping structure, which is necessarily shared among threads in the address space.
Contention on a data structure is also reduced in some cases by structuring it as a multi-level hierarchy. For example, a list that is searched frequently may be revised to be a hash table with a version number or lock per bucket. Then, searches and updates are localized to a single bucket portion of the list, reducing the conflict with other operations, assuming they hash to different buckets. The upper levels of the hierarchy are read-only or read-mostly: descriptors are only added at the leaves.
Physical contention is also reduced by using cache-aligned descriptors. TSM with its restricted allocation of descriptors can also reduce the number of pages referenced as part of scan and search operations, reducing the TLB miss rate, another source of physical contention. Finally, in this vein, commonly updated fields are placed contiguously and aligned to hopefully place them in the same cache line, thereby making the updates more efficient.
The spatial locality of data access achieved by these techniques provides significant benefit for synchronization, whether non-blocking or conventional locks. This spatial locality also minimizes the consistency overhead when the system is running across multiple processors, with each caching portions of this shared data. In general, our experience (e.g. ) suggests that it is better to (re)structure the data structures to reduce contention rather than attempt to improve the behavior of synchronization techniques under high contention. Low-contention algorithms are simpler and thus easier to get right, and faster as long as contention is actually low.