I have converted rw semaphore from current generic spin_lock implementation to use architecture specific atomic operation on ia64.  This new scheme speeds up all the semaphore operations in the fast path with atomic instruction and fall back to a heavy function when there are read/write contention.  I've also taken some raw measurement how fast it improves.  The most significant gain comes from parallel reader lock acquire/release which has around 6.6X speed up with the new version.  Here is a patch against 2.4.20.

 <<rwsem.2.4.20.patch>> 
- Ken