Hi semaphore/mutex maintainers, Looks like rw_semaphore's down_write is not as efficient as it could be. It can have a latency in the miliseconds range, but if I wrap it in yet another mutex then it becomes faster (100 us range). One difference I noticed betwen the rwsem and mutex, is that the mutex code does optimistic spinning. But adding something similar to the rw_sem code didn't improve timings (it made things worse). My guess is that this has to do something with excessive scheduler ping-pong (spurious wakeups, scheduling a task that won't be able to take the semaphore, etc.), I'm not sure what are the best tools to confirm/infirm this. perf sched/perf lock/ftrace ? Also the huge slowdowns only happen if I trigger a pagefault in the just-mapped area, if I remove the ' *((volatile char*)h) = 0;' line from mmapsem.c then mmap() time is back in the 50us range. (And using MAP_POPULATE is even worse, presumably due to zero-filling, but even with MAP_POPULATE the mutex hekps). First some background: this all started out when I was investigating why mmap()/munmap() is still faster in ClamAV when it is wrapped with a pthread mutex. Initially the reason was that mmap_sem was held during disk I/O, but thats supposedly been fixed, and ClamAV only uses anon mmap + pread now anyway. So I wrote the attached microbenchmark to illustrate the latency difference. Note that in a real app (ClamAV), the difference is not as large, only ~5-10%. Yield Time: 0.002225s, Latency: 0.222500us Mmap Time [nolock]: 21.647090s, Latency: 2164.709000us Mmap Time [spinlock]: 0.649472s, Latency: 64.947200us Mmap Time [mutex]: 0.720323s, Latency: 72.032300us The difference is huge, switching between threads takes <1us, and context switching between processes takes ~2us, so I don't know what rw_sem is doing that takes 2ms! To further track the problem I patched the kernel slightly, wrapping down_write/up_write in a regular mutex (in a hackish way, this should be per process, not a global one), see attached patch. Sure enough mmap() improved now: Yield Time: 0.002289s, Latency: 0.228900us Mmap Time [nolock]: 1.000317s, Latency: 100.031700us Mmap Time [spinlock]: 0.618873s, Latency: 61.887300us Mmap Time [mutex]: 0.739471s, Latency: 73.947100us Of course the attached patch is not a solution, it is just a test. The nolock case is now very close to the userspace-locking version, the slowdown is due to the double locking. I could write a patch that adds a mutex to rwsem and wraps all writes with it, but I'd rather see the rwsem code fixed / optimized. The .config I used for testing is attached. Best regards, --Edwin