{michaelg, cheriton}@cs.stanford.edu. This work was sponsored in part by ARPA under US Army contract DABT63-91-K-0001. Michael Greenwald was supported by a Rockwell Fellowship.

An example of a TSM implementation is a collection of descriptors that are stored in a set of page frames which are allocated and released over time. When more descriptors are required, additional page frames can be allocated from the general pool and when the number of descriptors falls, the descriptors may be consolidated into a smaller number of pages and the excessive page frames returned to the general pool. However, the release of page frames to the general pool must be delayed sufficiently to ensure the type-stability property. This delay provides a useful hysteresis to the movement of pages between this descriptor collection and the general page pool.

The list is initialized with a dummy node at the head, thus deletion of the first element works correctly.

Physical contention is separate from logical contention because one can have logical contention without physical contention as well as vice versa, so called false sharing. For example, if two shared variable can reside in the same cache line unit so there can be physical contention without logical contention if two processor attempt to update the variables simultaneously, each processor updating a separate variable.

The basic Herlihy approach involves copying the entire data structure, modifying the copy, and then atomically replacing the old copy with the new copy using CAS, and retrying the entire copy and modifying if there is a conflict. Our approach reduces the allocation and copy cost to a single descriptor rather than the entire data structure but requires DCAS.

Given data structures that are protected by a version number, te DCAS is actually a Compare-And-Double-Swap (CADS) --- the second value cannot have changed if the version number is unchanged. In these cases a minor optimization is possible and line 4 can be deleted.

It was necessary to derive our own version of the algorithm, as the algorithm presented in [24] is not strictly correct. This is the natural result of the complicated contortions necessary when using only CAS. The DCAS algorithm is relatively straightforward.

The Valois simulation in Michael and Scott [18] reports better asymptotic behavior than we do. The difference appears because the authors are only simulating a FIFO queue. In the FIFO queue algorithm -- where insertion always occurs at the tail and deletion at the head -- auxiliary nodes are not traversed in general and thus don't affect completion time. In fully general lists auxiliary nodes increase the execution time and memory traffic.

Consider processors 15#15, 16#16 and 17#17. 18#18 accesses cache lines Y,Z, 19#19 X,Y, and 20#20 W,X (addressed in ascending alphabetical order). 21#21 and 22#22 should not interact. However, if 23#23 holds Y and Z and 24#24 holds X, then when 25#25 asks 26#26 for Y, 27#27 stalls, and buffers 28#28's request for X. Thus, 29#29 delays 30#30. Longer chains can be constructed.

Their scheme requires twice the memory for every possibly shared location and extra overhead of at least a factor of three for reads and writes even in the case of no contention.

Michael Greenwald
Thu Sep 19 12:18:20 PDT 1996