From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: David Howells In-Reply-To: <17204.41449.162199.476757@cargo.ozlabs.ibm.com> References: <17204.41449.162199.476757@cargo.ozlabs.ibm.com> <606.1127460740@warthog.cambridge.redhat.com> To: Paul Mackerras Date: Mon, 26 Sep 2005 12:38:21 +0100 Message-ID: <25552.1127734701@warthog.cambridge.redhat.com> Sender: dhowells@redhat.com Cc: David Howells , linuxppc-dev@ozlabs.org, linuxppc64-dev@ozlabs.org Subject: Re: PATCH powerpc Merge asm-ppc*/rwsem.h List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Paul Mackerras wrote: > > > rwsems on ppc64 should really be using a 64-bit counter, not a 32-bit > > counter, otherwise you limit the maximum number of processes to 32K-ish. > > > > The counter should be "signed long" really. > > It has long annoyed me that we waste half the bits in the rwsem > counter, just because you assume a lowest-common-denominator set of > atomic ops. IIRC your implementation replaced the earlier ppc > implementation which had a 32-bit counter and didn't have the 32k > process limit. And also didn't work, at least not on i386. > I'd have to study it in more detail, but I strongly suspect that with > an atomic operation that did something like > > *p = max(*p, limit) + inc There is no requirement for an arch to use the XADD-based algorithm for doing this - that's optimised for use with the i386/x86_64 XADD instruction. That can be emulated by cmpxchg or load-locked/store-conditional type constructs like MIPS, Alpha and ppc have. > atomically (or alternatively a min), we could increase the limit to at > least 1G processes with a 32-bit counter, without needing to change > your common slow-path implementation. I think you'd probably need a different slow-path for what you want. My XADD optimised slow patch assumes that certain values represent certain states, and if you're using different values, then it won't necessarily work. This is why the thing is contingent on CONFIG_RWSEM_XCHGADD_ALGORITHM. > Such an atomic op is easy to implement with load-and-reserve / > store-conditional instructions. Look at __sem_update_count in > arch/ppc64/kernel/semaphore.c for an example. So that does (I think): do { int old_count = load_reserve(&sem->count); int tmp = old_count >> 31; // not sure about this: 31 or 1? tmp = old_count & ~tmp; tmp = tmp + incr; } while (!conditionally_assign(&sem->count, tmp)); PPC asm makes my head hurt; in particular the description of the SRAWI instruction I've got is a delight to behold, and is inconsistent with the description of the SRAW instruction:-/ I'm guessing it does a shift by 31 to leave tmp filled with copies of bit 0 (the sign bit) of old_count. The manual suggests the shift is 32-N (or 1 in this case) which would seem odd. I see how it works, then. What's "limit" in this formula: > *p = max(*p, limit) + inc Is it the maximum number of processes that may hold readlocks? Or maybe the negated quantity thereof. It's hard to see immediately how this will work. The counter has to represent several things: (1) The number of active readers (max N) or the number of active writers (max 1). (2) Whether or not there is potential contention. I do this with my XADD algorithm by splitting the word into two halves: MSH = #active_writers + #failed_writers + #sleepers LSH = #active_readers + #active_writers + #failed_contenders Basically, the counter is a lock additional to the spinlock. Perhaps you could do something like this: COUNT MEANING ======================= ========================================== 0 Not in use 0x00000001 - 0x7fffffff N readers 0x80000000 last active reader or writer done, contention 0x80000001 - 0xffffffff N active readers or writers, contention So up_read() and up_write() would decrement the count. If the result reaches 0x80000000 then you have to perform contention resolution. down_read() would increment the count if >= 0, or attempt to sleep otherwise. down_write() would just set the sign bit of the count and sleep if != 0, or set the count to 0x80000001 and return. We could also divide the space in two, and use the second most significant bit (0x40000000) to record the fact that there are sleeping contenders or to record which type of lock is actually active. However, I think there's a problem with doing that, and that is: COUNT PROCESS 0 PROCESS 1 ======== ====================== ================================ 00000000 down_read() 00000001 -->down_write() lwarx stwcx [set sign bit] 80000001 -->down_write_slow() -->up_read() lwarx stwcx [dec] 80000000 -->up_read_slow() -->lock sem->wait_lock <--lock sem->wait_lock -->lock sem->wait_lock count indicated contention but nothing on sem->wait_list - do we leave the count unaltered? - do we clear the count? - do we set the count to 0x80000001? If we leave the count set to 0x80000000, then the down_write_slow() when it runs won't be able to tell whether it got sem->wait_lock before or after down_read_slow() did. In one case it should go to sleep and in the other it should claim the lock. If we set the count to 0x80000001, then the case where _two_ CPUs are contending with the releaser may _both_ think they have the lock, and in any case can't tell the difference between that and a sleeping process just having been woken up and given a writelock. If we clear the count, then other processes may be able to queue jump (which isn't necessarily a problem since spinlocks aren't fair anyway, though they could be made so). down_write_slow() would have to attempt to take the lock again, though this time it would already have the spinlock. But what might happen is that: COUNT PROCESS 0 PROCESS 1 PROCESS 2 ======== ====================== =============== =============== 80000000 -->lock sem->wait_lock count indicated contention but nothing on sem->wait_list - do we clear the count? 00000000 down_write() 80000001 unlock sem->wait_lock <--up_read_slow() <--up_read() <--lock sem->wait_lock lwarx stwcx [set sign bit] 80000001 -->up_write() lwarx stwcx [dec] 80000000 -->up_write_slow() -->lock sem->wait_lock list_add_tail() unlock sem->wait_lock sleep <--lock sem->wait_lock dequeue process 1 set count 80000001 unlock sem->wait_lock <--up_write_slow() <--up_write() wake <--up_write_slow() <--up_write() Ummm... that does seem to work. I think their may be a multi-process race in there somewhere, but I can't work it out if there is; just as long as up_write() always does contention resolution. The down side of this is that you always have to get the spinlock in up_write, unlike in the XADD implementation where the counter includes a contention counter. I think downgrade_write() should simply be a matter of: whilst holding the spinlock, set the counter to the number of outstanding reads, and only set the sign bit if there's a sleeping writer left at the front of the queue. Should interruptible rwsems arrive, then I think just dequeuing the aborted down op should be enough. Leave the contention counter with the sign bit set as it'll sort itself out later. David