From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753318Ab1G0MMD (ORCPT ); Wed, 27 Jul 2011 08:12:03 -0400 Received: from merlin.infradead.org ([205.233.59.134]:36815 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752353Ab1G0ML7 convert rfc822-to-8bit (ORCPT ); Wed, 27 Jul 2011 08:11:59 -0400 Subject: Re: per-cpu operation madness vs validation From: Peter Zijlstra To: Christoph Lameter Cc: Tejun Heo , Linus Torvalds , Thomas Gleixner , "Paul E. McKenney" , linux-kernel , Ingo Molnar In-Reply-To: References: <1311714410.24752.404.camel@twins> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Date: Wed, 27 Jul 2011 14:11:33 +0200 Message-ID: <1311768693.24752.488.camel@twins> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2011-07-26 at 16:32 -0500, Christoph Lameter wrote: > Those all fold to the same operation on x86. The x86 cpu operations with > segment prefix are interrupt and preempt safe. The context issues come > into play in fallback scenarios where other architectures do not have > instructions that can perform the same things as x86. Right, but that doesn't help us much if anything. > > Now the reason is of course that -rt changes some things, even when > > using migrate_disable() it might be multiple processes might try and > > access the per-cpu variable, needing serialization. > > On x86 these operations are rt safe by the pure fact that all these > operations are the same interrupt safe instruction. The number of instructions has nothing what so ever to do with things. And while the individual ops might be preempt/irq/nmi safe, that doesn't say anything at all about the piece of code they're used in. > The difficulties come in for other > architectures. In those cases preemption or interrupts need to be > disabled. Irrelevant. > > The point of course is, how are we going to go about doing this, I'm > > sure adding a proper conditional to each and every per-cpu op is going > > to be a herculean task, and most of it utterly boring. > > Basically we need to track the contexts in which references to a certain > per cpu variable are established. Then constraints can be enforced. For > example: > > 1. A variable used in an interupt context with __this_cpu ops requires > irqsafe_cpu_xxx when used in context where interrupts are enabled. > > 2. A variable used in a non-preemptible context with __this_cpu ops > requires this_cpu ops in a preemptible context. And doesn't solve the larger issue of multiple per-cpu variables forming a consistent piece of data. Suppose there's two per-cpu variables, a and b, and we have an invariant that says that a-b := 5. Then the following code: preempt_disable(); __this_cpu_inc(a); __this_cpu_inc(b); preempt_enable(); is only correct when used from task context, an IRQ user can easily observe our invariant failing and none of your above validations will catch this. Now move to -rt where the above will likely end up looking something like: migrate_disable(); __this_cpu_inc(a); __this_cpu_inc(b); migrate_enable(); and everybody can see the invariant failing. Now clearly my example is very artificial but not far fetched at all, there's lots of code like this, and -rt gets to try and unravel all this. Basically preempt_disable()/local_bh_disable()/local_irq_disable() are the next BKL, there is no context what so ever to reconstruct what invariants certain pieces of code expect. Hence my suggestion to do something like: struct foo { percpu_lock_t lock; int a; int b; } DEFINE_PER_CPU(struct foo, foo); percpu_lock(&foo.lock); __this_cpu_inc(foo.a); __this_cpu_inc(foo.b); percpu_unlock(&foo.lock); That would get us (aside from a shitload of work to make it so): - clear boundaries of where the data structure atomicy lie - validation, for if the above piece of code was also ran from IRQ context we could get lockdep complaining about IRQ unsafe locks used from IRQ context. Now for !-rt percpu_lock will not emit more than preempt_disable/local_bh_disable/local_irq_disable, depending on what variant is used, and the data type percpu_lock_t would be empty (except when enabling lockdep of course). Possibly we could reduce all this percpu madness back to one form (__this_cpu_*) and require that when used a lock of the percpu_lock_t is taken.