From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Mason Subject: Re: rw_semaphore performance, was: new metadata reader/writer locks in integration-test Date: Fri, 22 Jul 2011 11:14:09 -0400 Message-ID: <1311347135-sup-4912@shiny> References: <1311096438-sup-1263@shiny> <20110722150151.GA23686@infradead.org> Content-Type: text/plain; charset=UTF-8 Cc: linux-btrfs , linux-kernel , david To: Christoph Hellwig Return-path: In-reply-to: <20110722150151.GA23686@infradead.org> List-ID: Excerpts from Christoph Hellwig's message of 2011-07-22 11:01:51 -0400: > On Tue, Jul 19, 2011 at 01:30:22PM -0400, Chris Mason wrote: > > We've seen a number of benchmarks dominated by contention on the root > > node lock. This changes our locks into a simple reader/writer lock. > > They are based on mutexes so that we still take advantage of the mutex > > adaptive spins for write locks (rwsemaphores were much slower). > > Interesting. Do you have set up some artifical benchmarks for this? > > I wonder if the lack of adaptive spinning has something to do with the > slightly slower XFS performance on Joern's flash testing, given that > we extensively use the rw_semaphore as the primary I/O mutex, while > all others rely on plain mutexes as the primary synchronization > primitive. For the rw locks I had three main tests. 1) dbench 10. This is interesting only because it is mostly bound by how quickly we can do metadata operations in ram. There's not much IO and there's a good mixture of read and write btree operations (about 50/50). rwsemaphores ran at 200MB/s while my current code runs at 2400MB/s. The old btrfs implementation runs at 3000MB/s. We all love and hate dbench, so I don't put a huge amount of stock in 2400 vs 3000. But, 200 vs 2400...people notice that in real world stuff. 2) fs_mark doing parallel zero byte file creates. No fsyncs here, all metadata operations. The old btrfs locking was completely bound by getting write locks on the root node. The new code is much better here, overall about 30-50% faster. I didn't do the rw semaphores on this one, I'll give it a shot. 3) A stat-hammer program. This creates a bunch of files in parallel, and then times how long it takes us to stat all the inodes. I went from 3s of CPU time down to .9s. rwsems were about the same here (very fast), but that's because it's 100% reader locks. My money for Joern's benchmarks is end-io latencies. xfs and btrfs are doing more at endio time. But I need to sit down and run them myself and take a look. -chris