I have converted rw semaphore from current generic spin_lock implementation to use architecture specific atomic operation on ia64. This new scheme speeds up all the semaphore operations in the fast path with atomic instruction and fall back to a heavy function when there are read/write contention. I've also taken some raw measurement how fast it improves. The most significant gain comes from parallel reader lock acquire/release which has around 6.6X speed up with the new version. Here is a patch against 2.4.20. <> - Ken