Most processors provide at most single Compare-and-Swap (CAS) functionality to support non-blocking synchronization. Herlihy's general methodology  shows that that single CAS is adequate in theory but appears too inefficient in practice. A few processors such as the Motorola 68040 provide a multi-word atomic instruction but that functionality is rare and is not present in any RISC processor to our knowledge. The RISC-like extension that we propose in Section 6 suggests that it is feasible to support in modern processors. The CISC approach does not appear viable with most current and future processors and seems likely to die out with the current processors that support it.
Transactional Memory  provides hardware support for multiple-address atomic memory operations. It is more general than DCAS but comes at a correspondingly higher cost. The proposed hardware implementation requires six new instructions, a second set of caches in the processor, twice the storage for cache lines actively involved in a transaction, and a more complicated ``commit'' protocol. Double LL/ SC appears to be a more practical solution because DCAS functionality is sufficient and significantly simpler to implement.
Oklahoma Update  provides an alternate implementation of multiple-address atomic memory operations. Rather than duplicating entire cache lines involved in transactions (as Transactional Memory does), Oklahoma Update requires only a reservation register per word used in their version of Load Linked. This register contains flags plus two words (and optionally two more). This contrasts with our implementation which requires a ``link address retained'' register per synchronized word and a single cache-line buffer for the delayed SCP. Our design can also work with a word register instead of an entire cache line to buffer the SCP. However, this approach adds complexity to the chip's logic, slows down the SC and increases the time the cache is locked so the savings are questionable. The Oklahoma Update attempts to implement some features in hardware (e.g. exponential backoff) which are better done in software, and which needlessly increase the complexity and size of the chip. Also, buffering of certain requests that come in during the ``pre-commit'' phase can cause two processors with non-interfering reservation sets to delay each other.
These different designs arise because of different assumptions regarding the number of memory locations that should be atomically updatable at one time. Transactional Memory paper conjectures between 10 and 100 and Oklahoma Update places the knee at 3 or 4. In general, more locations are better and more powerful. However, our implementation at 2 (DCAS) is by far the simplest extension to existing processor designs. A key contribution of our work is experience that indicates that DCAS is sufficient for practical performance, making the extra hardware complexity of the other schemes unnecessary.