From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762322AbYFFB2C (ORCPT ); Thu, 5 Jun 2008 21:28:02 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753977AbYFFB1w (ORCPT ); Thu, 5 Jun 2008 21:27:52 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51767 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752904AbYFFB1v (ORCPT ); Thu, 5 Jun 2008 21:27:51 -0400 Date: Fri, 6 Jun 2008 03:27:49 +0200 From: Nick Piggin To: Linus Torvalds Cc: Ingo Molnar , David Howells , Ulrich Drepper , Linux Kernel Mailing List , Andrew Morton Subject: Re: [PATCH 0/3] 64-bit futexes: Intro Message-ID: <20080606012749.GA12187@wotan.suse.de> References: <4840C29B.6030507@redhat.com> <4840CE51.9060109@redhat.com> <4840D63F.2090407@redhat.com> <20080602185433.GB4081@elte.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.9i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 02, 2008 at 01:22:57PM -0700, Linus Torvalds wrote: > > [ I added Nick and DavidH to the Cc, since they have at least historically > shown interest in locking algorithms ] > > On Mon, 2 Jun 2008, Ingo Molnar wrote: > > > > i suspect _any_ abstract locking functionality around a data structure > > can be implemented via atomic control over just a single user-space bit. > > Well, theory is theory, and practice is different. > > That's *especially* true when it comes to locking, which is just so > tightly coupled to the exact details of which atomics are fast on a > particular architecture. > > Also, different people will want to see different performance profiles. > > For example, it makes a *huge* difference whether you are strictly fair or > not. I can almost guarantee that a 100% fair implementation may be really > "nice" from a theoretical standpoint, but suck really badly in practice, > because if you want best performance, then you want the lock to have a > very strong CPU affinity - and the faster you can do your lock ops, the > more of a unfairness and CPU affinity they get! > > And rwlocks in particular are actually much more interesting in this > respect, because they not only have that CPU affinity fairness, but also > the reader-vs-writer fairness. You optimize for one load, and it may give > you almost perfect performance, but at the cost of sucking at another > load. > > For example, some loads are almost entirely read-read locks, with only > very occasional write locks for some uncommon config change thing. Do you > want to optimize for that? Maybe. And yes, you can make those uncontended > read-read locks go really quickly, but then (especially if you continue to > let reads through even when writers want to contend), that can slow down > writers a *lot*, to the point of starvation. > > Different CPU's will also show different patterns. > > Anyway, I was busy most of the weekend, but I've now coded up a partial > actual example implementation. Its' probably buggy. Uli is very correct in > saying that it's easy to screw up futex'es, but even in the absense of > futexes, it's just easy to screw up any threaded logic. > > But if anybody wants to look at a rough draft that at least limps along > _partially_, there's > > http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary > > for you. > > And I'll freely admit that > (a) the above thing is pretty hacked up. No guarantees as to correctness > or anything else. > (b) I looked _mainly_ at the "all readers" case, and considered that the > primary goal. Which explains why it does really well on that > particular load. > (c) However, I refuse to starve writers. In fact, my thing will always > prefer a writer to any new readers, on the assumption that the sanest > rwlock situation is the "tons of readers, occasional writer". > (d) it does ok for me on the write-write contention case too, but the > random mixed "all threads do 5% writelocks" test-case seems to suck. > > As an exmaple: the current glibc code I have seems to be uniformly and > boringly middle-of-the-road. It really doesn't seem to be horrible at all. > That said, the above thing _is_ 2-3x faster for me on that read-read case > (from single thread up to 20 threads - but just tested on one particular > machine), so if that is what you want to aim for, it's certainly easy to > beat. > > But for the write case, while I can easily beat it for a low-thread count > with little contention, my example thing above benchmarks about equal for > bigger thread counts (caveat: again - I've only tested on one machine, > which is a single-socket dual-core thing. Caveat emptor). > > And the pthreads implementation actually beats my hacked-up one at least > for the 5% random writelocks case. It's not beating it by a huge amount, > but it's not in the noise level either (maybe 15%). And I bet the pthreads > implementation is a hell of a lot better tested ;) Had a bit of a look though this, seems pretty OK to me. Obviously with rwlocks it *is* impossible to do non-atomic unlocks, so I can't see a way to significantly improve your code there. What you *could* maybe do, to slightly speed up the reader fastpath, at the expense of the writer fastpath, is to also have the active writer add 4 to the count too, so your unlock can start with a lock xadd -4, count in order to get the write-intent on the cacheline straight up. Though that's a pretty tiny "optmisation", and not going to be an amazing win, even when it does save a state transition on the cacheline... I'd be more interested to know why this code can't be evolved into a full rwlock implementation? This is a rather standard (though neat) looking rwlock -- so my question is what can the patented 64-bit futex locks do that this can't, or what can they do faster?