Some days back Martin Thompson published investigation on results of his controversial CAS (compare and swap) performance test he made few months back. And that investigation really impressed me - it shows how microbenchmarking can go really wrong, even when it is done by such a smart guy.
Just to recap, test was executing several threads which were hammering CPU with CAS operations. Test showed that on average CAS on modern Ivy Bridge processor works significantly slower than on older Nehalem architecture. After a few months and Martin found out the reason for such strange behavior and amazing thing about it is that the reason for test being slower is that Ivy Bridge is actually faster.
To understand why that happens lets see what's going on when CAS is executed. Generally speaking, on high level, in relation to CPU core, memory which is going to be written can be in two states - core can either exclusively own cache line with it or do not own. If it owns that line then CAS is extremely fast - core doesn't need to notify other cores to do that operation. If core doesn't own it, the situation is very different - core has to send request to fetch cache line in exclusive mode and such request requires communication with all other cores. Such negotiation is not fast, but on Ivy Bridge it is much faster than on Nehalem. And because it is faster on Ivy Bride, core has less time to perform a set of fast local CAS operations while it owns cacheline, therefore total throughput is less.
I suppose, a very good lesson learned here - microbenchmarking can be very tricky and not easy to do properly. Also results can be easily interpreted in a wrong way. So, be careful!