* [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs @ 2011-12-19 3:36 mengcong 2011-12-19 4:11 ` Al Viro 2011-12-19 7:31 ` Srivatsa S. Bhat 0 siblings, 2 replies; 37+ messages in thread From: mengcong @ 2011-12-19 3:36 UTC (permalink / raw) To: Alexander Viro; +Cc: linux-kernel, linux-fsdevel, Nick Piggin In a heavily loaded system, when frequently turning on and off CPUs, the kernel will detect soft-lockups on multiple CPUs. The detailed bug report is at https://lkml.org/lkml/2011/8/24/185. The root cause is that brlock functions, i.e. br_write_lock() and br_write_unlock(), only locks/unlocks the per-CPU spinlock of CPUs that are online, which means, if one online CPU is locked and then goes offline, any later unlocking operation happens during its offline state will not touch it; and when it goes online again, it has the incorrect brlock state. This has been verified in current kernel. I can reproduce this bug on the intact 3.1 kernel. After my patch applied, I've ran an 8-hours long test(test script provided by the bug reporter), and no soft lockup happened again. Signed-off-by: Cong Meng <mc@linux.vnet.ibm.com> Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> --- include/linux/lglock.h | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/lglock.h b/include/linux/lglock.h index f549056..08b9e84 100644 --- a/include/linux/lglock.h +++ b/include/linux/lglock.h @@ -27,8 +27,8 @@ #define br_lock_init(name) name##_lock_init() #define br_read_lock(name) name##_local_lock() #define br_read_unlock(name) name##_local_unlock() -#define br_write_lock(name) name##_global_lock_online() -#define br_write_unlock(name) name##_global_unlock_online() +#define br_write_lock(name) name##_global_lock() +#define br_write_unlock(name) name##_global_unlock() #define DECLARE_BRLOCK(name) DECLARE_LGLOCK(name) #define DEFINE_BRLOCK(name) DEFINE_LGLOCK(name) -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 3:36 [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs mengcong @ 2011-12-19 4:11 ` Al Viro 2011-12-19 5:00 ` Dave Chinner 2011-12-19 7:31 ` Srivatsa S. Bhat 1 sibling, 1 reply; 37+ messages in thread From: Al Viro @ 2011-12-19 4:11 UTC (permalink / raw) To: mengcong; +Cc: linux-kernel, linux-fsdevel, Nick Piggin On Mon, Dec 19, 2011 at 11:36:15AM +0800, mengcong wrote: > In a heavily loaded system, when frequently turning on and off CPUs, the > kernel will detect soft-lockups on multiple CPUs. The detailed bug report > is at https://lkml.org/lkml/2011/8/24/185. > > The root cause is that brlock functions, i.e. br_write_lock() and > br_write_unlock(), only locks/unlocks the per-CPU spinlock of CPUs that > are online, which means, if one online CPU is locked and then goes > offline, any later unlocking operation happens during its offline state > will not touch it; and when it goes online again, it has the incorrect > brlock state. This has been verified in current kernel. > > I can reproduce this bug on the intact 3.1 kernel. After my patch applied, > I've ran an 8-hours long test(test script provided by the bug reporter), > and no soft lockup happened again. Argh... OK, that's seriously nasty. I agree that this is broken, but your patch makes br_write_lock() very costly on kernels build with huge number of possible CPUs, even when it's run on a box with few CPUs ;-/ That sucks. Worse, AFAICS, the only way to prevent on-/off-line status changes is blocking (and both directions are bad - if the thing goes online between br_write_lock() and br_write_unlock(), we'll get spin_unlock without spin_lock). And I really don't want to make vfsmount_lock writers blocking - we *probably* could get away with that, but it'll suck very badly. Especially since we'll have that nested inside namespace_sem... Alternative is to do get_online_cpus/put_online_cpus around the stuff in fs/namespace.c, putting it *outside* everything but actual IO. We can do that (since right now vfsmount_lock is non-blocking and the only potentially blocking operations under namespace_sem is kmalloc()), but I'm not particulary comfortable doing that - I never played with the code in kernel/cpu.c and I don't know if there's anything subtle to watch out for. The same issue exists for lg_global_lock_online(), but that beast is never used (and the only remaining user of lg_global_lock() is hardly time-critical - with Miklos' patches it's only done on mount -o remount,force,ro). Nick, any comments? That's your code... ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 4:11 ` Al Viro @ 2011-12-19 5:00 ` Dave Chinner 2011-12-19 6:07 ` mengcong 0 siblings, 1 reply; 37+ messages in thread From: Dave Chinner @ 2011-12-19 5:00 UTC (permalink / raw) To: Al Viro; +Cc: mengcong, linux-kernel, linux-fsdevel, Nick Piggin On Mon, Dec 19, 2011 at 04:11:42AM +0000, Al Viro wrote: > On Mon, Dec 19, 2011 at 11:36:15AM +0800, mengcong wrote: > > In a heavily loaded system, when frequently turning on and off CPUs, the > > kernel will detect soft-lockups on multiple CPUs. The detailed bug report > > is at https://lkml.org/lkml/2011/8/24/185. > > > > The root cause is that brlock functions, i.e. br_write_lock() and > > br_write_unlock(), only locks/unlocks the per-CPU spinlock of CPUs that > > are online, which means, if one online CPU is locked and then goes > > offline, any later unlocking operation happens during its offline state > > will not touch it; and when it goes online again, it has the incorrect > > brlock state. This has been verified in current kernel. > > > > I can reproduce this bug on the intact 3.1 kernel. After my patch applied, > > I've ran an 8-hours long test(test script provided by the bug reporter), > > and no soft lockup happened again. > > Argh... OK, that's seriously nasty. I agree that this is broken, but > your patch makes br_write_lock() very costly on kernels build with > huge number of possible CPUs, even when it's run on a box with few > CPUs ;-/ I fixed this problem with the XFS per-cpu superblock counters (bit lock + counters in per-cpu structs) back in 2006. It basically uses a lglock-like local/global locking structure and iterates them using for_each_online_cpu(). I fixed it simply by registering a hotplug notifier and draining/reinitialising counters on the appropriate event under a global lock context. i.e. make CPU hotplug serialise against concurrent lock operations. See commit e8234a68 ("[XFS] Add support for hotplug CPUs...") Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 5:00 ` Dave Chinner @ 2011-12-19 6:07 ` mengcong 0 siblings, 0 replies; 37+ messages in thread From: mengcong @ 2011-12-19 6:07 UTC (permalink / raw) To: Al Viro; +Cc: linux-kernel, linux-fsdevel, Nick Piggin, Dave Chinner On Mon, 2011-12-19 at 16:00 +1100, Dave Chinner wrote: > On Mon, Dec 19, 2011 at 04:11:42AM +0000, Al Viro wrote: > > On Mon, Dec 19, 2011 at 11:36:15AM +0800, mengcong wrote: > > > In a heavily loaded system, when frequently turning on and off CPUs, the > > > kernel will detect soft-lockups on multiple CPUs. The detailed bug report > > > is at https://lkml.org/lkml/2011/8/24/185. > > > > > > The root cause is that brlock functions, i.e. br_write_lock() and > > > br_write_unlock(), only locks/unlocks the per-CPU spinlock of CPUs that > > > are online, which means, if one online CPU is locked and then goes > > > offline, any later unlocking operation happens during its offline state > > > will not touch it; and when it goes online again, it has the incorrect > > > brlock state. This has been verified in current kernel. > > > > > > I can reproduce this bug on the intact 3.1 kernel. After my patch applied, > > > I've ran an 8-hours long test(test script provided by the bug reporter), > > > and no soft lockup happened again. > > > > Argh... OK, that's seriously nasty. I agree that this is broken, but > > your patch makes br_write_lock() very costly on kernels build with > > huge number of possible CPUs, even when it's run on a box with few > > CPUs ;-/ > > I fixed this problem with the XFS per-cpu superblock counters > (bit lock + counters in per-cpu structs) back in 2006. It basically > uses a lglock-like local/global locking structure and iterates them > using for_each_online_cpu(). > > I fixed it simply by registering a hotplug notifier and > draining/reinitialising counters on the appropriate event under a > global lock context. i.e. make CPU hotplug serialise against > concurrent lock operations. See commit e8234a68 ("[XFS] Add support > for hotplug CPUs...") > how about to register a cpu hotplug notifier to align the brlock as what Dave did? > Cheers, > > Dave. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 3:36 [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs mengcong 2011-12-19 4:11 ` Al Viro @ 2011-12-19 7:31 ` Srivatsa S. Bhat 2011-12-19 9:12 ` Stephen Boyd 1 sibling, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-19 7:31 UTC (permalink / raw) To: mc Cc: Alexander Viro, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki Hi, I feel the following patch is a better fix for 2 reasons: 1. As Al Viro pointed out, if we do for_each_possible_cpus() then we might encounter unnecessary performance hit in some scenarios. So working with only online cpus, safely(a.k.a race-free), if possible, would be a good solution (which this patch implements). 2. *_global_lock_online() and *_global_unlock_online() needs fixing as well because, the names suggest that they lock/unlock per-CPU locks of only the currently online CPUs, but unfortunately they do not have any synchronization to prevent offlining those CPUs in between, if it happens to race with a CPU hotplug operation. And if we solve issue 2 above "carefully" (as mentioned in the changelog below), it solves this whole thing! --- From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Subject: [PATCH] VFS: Fix race between CPU hotplug and *_global_[un]lock_online() The *_global_[un]lock_online() macros defined in include/linux/lglock.h can race with CPU hotplug operations. Fix this race by using the pair get_online_cpus() and put_online_cpus() around them, so as to prevent CPU hotplug happening at the same time. But be careful to note the semantics here: if we enable CPU hotplug in-between a lock_online() and the corresponding unlock_online(), the lock states can get messed up, since we might end up for example, in a situation such as taking a lock on an online CPU but not releasing it because that CPU was offline when unlock_online() was invoked (thanks to Cong Meng for debugging this issue). [Soft-lockups detected as a consequence of this issue was reported earlier in https://lkml.org/lkml/2011/8/24/185.] So, we are careful to allow CPU hotplug only after the lock-unlock sequence is complete. Debugged-by: Cong Meng <mc@linux.vnet.ibm.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> --- include/linux/lglock.h | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/include/linux/lglock.h b/include/linux/lglock.h index f549056..583d1a8 100644 --- a/include/linux/lglock.h +++ b/include/linux/lglock.h @@ -22,6 +22,7 @@ #include <linux/spinlock.h> #include <linux/lockdep.h> #include <linux/percpu.h> +#include <linux/cpu.h> /* can make br locks by using local lock for read side, global lock for write */ #define br_lock_init(name) name##_lock_init() @@ -126,6 +127,7 @@ int i; \ preempt_disable(); \ rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ + get_online_cpus(); \ for_each_online_cpu(i) { \ arch_spinlock_t *lock; \ lock = &per_cpu(name##_lock, i); \ @@ -142,6 +144,7 @@ lock = &per_cpu(name##_lock, i); \ arch_spin_unlock(lock); \ } \ + put_online_cpus(); \ preempt_enable(); \ } \ EXPORT_SYMBOL(name##_global_unlock_online); \ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 7:31 ` Srivatsa S. Bhat @ 2011-12-19 9:12 ` Stephen Boyd 2011-12-19 11:03 ` Srivatsa S. Bhat 0 siblings, 1 reply; 37+ messages in thread From: Stephen Boyd @ 2011-12-19 9:12 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: mc, Alexander Viro, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/18/2011 11:31 PM, Srivatsa S. Bhat wrote: > Hi, > > I feel the following patch is a better fix for 2 reasons: > > 1. As Al Viro pointed out, if we do for_each_possible_cpus() then we might > encounter unnecessary performance hit in some scenarios. So working with > only online cpus, safely(a.k.a race-free), if possible, would be a good > solution (which this patch implements). > > 2. *_global_lock_online() and *_global_unlock_online() needs fixing as well > because, the names suggest that they lock/unlock per-CPU locks of only the > currently online CPUs, but unfortunately they do not have any synchronization > to prevent offlining those CPUs in between, if it happens to race with a CPU > hotplug operation. > > And if we solve issue 2 above "carefully" (as mentioned in the changelog below), > it solves this whole thing! We started seeing this same problem last week. I've come up with almost the same solution but you beat me to the list! > diff --git a/include/linux/lglock.h b/include/linux/lglock.h > index f549056..583d1a8 100644 > --- a/include/linux/lglock.h > +++ b/include/linux/lglock.h > @@ -126,6 +127,7 @@ > int i; \ > preempt_disable(); \ > rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ > + get_online_cpus(); \ > for_each_online_cpu(i) { \ > arch_spinlock_t *lock; \ > lock =&per_cpu(name##_lock, i); \ > @@ -142,6 +144,7 @@ > lock =&per_cpu(name##_lock, i); \ > arch_spin_unlock(lock); \ > } \ > + put_online_cpus(); \ > preempt_enable(); \ > } \ > EXPORT_SYMBOL(name##_global_unlock_online); \ Don't you want to call {get,put}_online_cpus() outside the preempt_{disable,enable}()? Otherwise you are scheduling while atomic? With that fixed Acked-by: Stephen Boyd <sboyd@codeaurora.org> but I wonder if taking the hotplug mutex even for a short time reduces the effectiveness of these locks? Or is it more about fast readers and slow writers? ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 9:12 ` Stephen Boyd @ 2011-12-19 11:03 ` Srivatsa S. Bhat 2011-12-19 12:11 ` Al Viro 0 siblings, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-19 11:03 UTC (permalink / raw) To: Stephen Boyd Cc: mc, Alexander Viro, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/19/2011 02:42 PM, Stephen Boyd wrote: > On 12/18/2011 11:31 PM, Srivatsa S. Bhat wrote: >> Hi, >> >> I feel the following patch is a better fix for 2 reasons: >> >> 1. As Al Viro pointed out, if we do for_each_possible_cpus() then we >> might >> encounter unnecessary performance hit in some scenarios. So working with >> only online cpus, safely(a.k.a race-free), if possible, would be a good >> solution (which this patch implements). >> >> 2. *_global_lock_online() and *_global_unlock_online() needs fixing as >> well >> because, the names suggest that they lock/unlock per-CPU locks of only >> the >> currently online CPUs, but unfortunately they do not have any >> synchronization >> to prevent offlining those CPUs in between, if it happens to race with >> a CPU >> hotplug operation. >> >> And if we solve issue 2 above "carefully" (as mentioned in the >> changelog below), >> it solves this whole thing! > > We started seeing this same problem last week. I've come up with almost > the same solution but you beat me to the list! > Oh :-) >> diff --git a/include/linux/lglock.h b/include/linux/lglock.h >> index f549056..583d1a8 100644 >> --- a/include/linux/lglock.h >> +++ b/include/linux/lglock.h >> @@ -126,6 +127,7 @@ >> int i; \ >> preempt_disable(); \ >> rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ >> + get_online_cpus(); \ >> for_each_online_cpu(i) { \ >> arch_spinlock_t *lock; \ >> lock =&per_cpu(name##_lock, i); \ >> @@ -142,6 +144,7 @@ >> lock =&per_cpu(name##_lock, i); \ >> arch_spin_unlock(lock); \ >> } \ >> + put_online_cpus(); \ >> preempt_enable(); \ >> } \ >> EXPORT_SYMBOL(name##_global_unlock_online); \ > > Don't you want to call {get,put}_online_cpus() outside the > preempt_{disable,enable}()? Otherwise you are scheduling while atomic? > Oh right, thanks for spotting this! (and thanks for your Ack too!) > With that fixed > > Acked-by: Stephen Boyd <sboyd@codeaurora.org> > > but I wonder if taking the hotplug mutex even for a short time reduces > the effectiveness of these locks? Or is it more about fast readers and > slow writers? > IMHO, we don't need to be concerned here because, {get,put}_online_cpus() implement a refcounting solution, and they don't really serialize stuff unnecessarily. The readers (those who prevent cpu hotplug, such as this lock- unlock code) are fast and can be concurrent, while the writers (the task that is doing the cpu hotplug) waits till all existing readers are gone/done with their work. So, since we are readers here, IMO, we don't have to worry about performance. (I know that we get serialized just for a moment while incrementing the refcount, but that should not be worrisome right?) Moreover, using for_each_online_cpu() without using {get,put}_online_cpus() around that, is plain wrong, because of the unhandled race with cpu hotplug. IOW, our primary concern here is functionality, isn't it? To summarize, in the current design of these VFS locks, using {get,put}_online_cpus() is *essential* to fix a functionality-related bug, (and not so bad performance-wise as well). The following patch (v2) incorporates your comments: --- From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Subject: [PATCH v2] VFS: Fix race between CPU hotplug and *_global_[un]lock_online() The *_global_[un]lock_online() macros defined in include/linux/lglock.h can race with CPU hotplug operations. Fix this race by using the pair get_online_cpus() and put_online_cpus() around them, so as to prevent CPU hotplug happening at the same time. But be careful to note the semantics here: if we enable CPU hotplug in-between a lock_online() and the corresponding unlock_online(), the lock states can get messed up, since we might end up for example, in a situation such as taking a lock on an online CPU but not releasing it because that CPU was offline when unlock_online() was invoked (thanks to Cong Meng for debugging this issue). [Soft-lockups detected as a consequence of this issue was reported earlier in https://lkml.org/lkml/2011/8/24/185.] So, we are careful to allow CPU hotplug only after the lock-unlock sequence is complete. v2: Moved {get,put}_online_cpus() outside of preempt_{disable,enable} to avoid scheduling while atomic. Debugged-by: Cong Meng <mc@linux.vnet.ibm.com> Acked-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> --- include/linux/lglock.h | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/include/linux/lglock.h b/include/linux/lglock.h index f549056..7ef257d 100644 --- a/include/linux/lglock.h +++ b/include/linux/lglock.h @@ -22,6 +22,7 @@ #include <linux/spinlock.h> #include <linux/lockdep.h> #include <linux/percpu.h> +#include <linux/cpu.h> /* can make br locks by using local lock for read side, global lock for write */ #define br_lock_init(name) name##_lock_init() @@ -124,6 +125,7 @@ \ void name##_global_lock_online(void) { \ int i; \ + get_online_cpus(); \ preempt_disable(); \ rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ for_each_online_cpu(i) { \ @@ -143,6 +145,7 @@ arch_spin_unlock(lock); \ } \ preempt_enable(); \ + put_online_cpus(); \ } \ EXPORT_SYMBOL(name##_global_unlock_online); \ \ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 11:03 ` Srivatsa S. Bhat @ 2011-12-19 12:11 ` Al Viro 2011-12-19 20:23 ` Srivatsa S. Bhat 2011-12-19 23:56 ` Dave Chinner 0 siblings, 2 replies; 37+ messages in thread From: Al Viro @ 2011-12-19 12:11 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On Mon, Dec 19, 2011 at 04:33:47PM +0530, Srivatsa S. Bhat wrote: > IMHO, we don't need to be concerned here because, {get,put}_online_cpus() > implement a refcounting solution, and they don't really serialize stuff > unnecessarily. The readers (those who prevent cpu hotplug, such as this lock- > unlock code) are fast and can be concurrent, while the writers (the task that > is doing the cpu hotplug) waits till all existing readers are gone/done with > their work. > > So, since we are readers here, IMO, we don't have to worry about performance. > (I know that we get serialized just for a moment while incrementing the > refcount, but that should not be worrisome right?) > > Moreover, using for_each_online_cpu() without using {get,put}_online_cpus() > around that, is plain wrong, because of the unhandled race with cpu hotplug. > IOW, our primary concern here is functionality, isn't it? > > To summarize, in the current design of these VFS locks, using > {get,put}_online_cpus() is *essential* to fix a functionality-related bug, > (and not so bad performance-wise as well). > > The following patch (v2) incorporates your comments: I really don't like that. Amount of contention is not a big issue, but the fact that now br_write_lock(vfsmount_lock) became blocking is really nasty. Moreover, we suddenly get cpu_hotplug.lock nested inside namespace_sem... BTW, it's seriously blocking - if nothing else, it waits for cpu_down() in progress to complete. Which can involve any number of interesting locks taken by notifiers. Dave's variant is also no good; consider this: CPU1: br_write_lock(); spinlocks grabbed CPU2: br_read_lock(); spinning on one of them CPU3: try to take CPU2 down. We *can't* proceed to the end, notifiers or no notifiers, until CPU2 gets through the critical area. Which can't happen until the spinlock is unlocked, i.e. until CPU1 does br_write_unlock(). Notifier can't silently do spin_unlock() here or we'll get CPU2 free to go into the critical area when it's really not safe there. That got one hell of a deadlock potential ;-/ So far I'm more or less in favor of doing get_online_cpus() explicitly in fs/namespace.c, outside of namespace_sem. But I still have not convinced myself that it's really safe ;-/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 12:11 ` Al Viro @ 2011-12-19 20:23 ` Srivatsa S. Bhat 2011-12-19 20:52 ` Al Viro 2011-12-19 23:56 ` Dave Chinner 1 sibling, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-19 20:23 UTC (permalink / raw) To: Al Viro Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/19/2011 05:41 PM, Al Viro wrote: > On Mon, Dec 19, 2011 at 04:33:47PM +0530, Srivatsa S. Bhat wrote: > >> IMHO, we don't need to be concerned here because, {get,put}_online_cpus() >> implement a refcounting solution, and they don't really serialize stuff >> unnecessarily. The readers (those who prevent cpu hotplug, such as this lock- >> unlock code) are fast and can be concurrent, while the writers (the task that >> is doing the cpu hotplug) waits till all existing readers are gone/done with >> their work. >> >> So, since we are readers here, IMO, we don't have to worry about performance. >> (I know that we get serialized just for a moment while incrementing the >> refcount, but that should not be worrisome right?) >> >> Moreover, using for_each_online_cpu() without using {get,put}_online_cpus() >> around that, is plain wrong, because of the unhandled race with cpu hotplug. >> IOW, our primary concern here is functionality, isn't it? >> >> To summarize, in the current design of these VFS locks, using >> {get,put}_online_cpus() is *essential* to fix a functionality-related bug, >> (and not so bad performance-wise as well). >> >> The following patch (v2) incorporates your comments: > > I really don't like that. Amount of contention is not a big issue, but the > fact that now br_write_lock(vfsmount_lock) became blocking is really nasty. > Moreover, we suddenly get cpu_hotplug.lock nested inside namespace_sem... > BTW, it's seriously blocking - if nothing else, it waits for cpu_down() > in progress to complete. Which can involve any number of interesting > locks taken by notifiers. > > Dave's variant is also no good; consider this: > CPU1: br_write_lock(); spinlocks grabbed > CPU2: br_read_lock(); spinning on one of them > CPU3: try to take CPU2 down. We *can't* proceed to the end, notifiers or no > notifiers, until CPU2 gets through the critical area. Which can't happen > until the spinlock is unlocked, i.e. until CPU1 does br_write_unlock(). > Notifier can't silently do spin_unlock() here or we'll get CPU2 free to go > into the critical area when it's really not safe there. > > That got one hell of a deadlock potential ;-/ So far I'm more or less > in favor of doing get_online_cpus() explicitly in fs/namespace.c, outside > of namespace_sem. But I still have not convinced myself that it's > really safe ;-/ > Okay, clearly you want br_write_locks to remain non-blocking. And for reasons related to long-term code understandability/maintainability, I am not a fan of the idea of sprinkling {get,put}_online_cpus() in fs/namespace.c, away from the place where the race with CPU hotplug actually exists (though I understand that that would work just fine). So, considering the above two requirements, let me propose another approach (though it might sound a bit hacky) : We note that we can simplify our requirement in *global_[un]lock_online() to: "lock and unlock must be invoked on the same set of CPUs, and that sequence must not get missed for any CPU, even if the CPU is offline. We only care about the correctness of lock-unlock operations, and not really about CPU Hotplug or 'working with only online CPUs' or anything of that sort." If this new definition of our requirement is acceptable (correct me if I am wrong), then we can do something like the following patch, while still retaining br locks as non-blocking. We make a copy of the current cpu_online_mask, and lock the per-cpu locks of all those cpus. Then while unlocking, we unlock the per-cpu locks of these cpus (by using that temporary copy of cpu_online_mask we created earlier), without caring about the cpus actually online at that moment. IOW, we do lock-unlock on the same set of cpus, and that too, without missing the complete lock-unlock sequence in any of them. Guaranteed. [I will provide the changelog later, if people are OK with this approach]. But it is to be noted that this doesn't actually synchronize with cpu hotplug, but tries to overcome the issue nevertheless. Comments/suggestions? --- include/linux/lglock.h | 9 +++++++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/include/linux/lglock.h b/include/linux/lglock.h index f549056..6351ae3 100644 --- a/include/linux/lglock.h +++ b/include/linux/lglock.h @@ -122,11 +122,16 @@ } \ EXPORT_SYMBOL(name##_local_unlock_cpu); \ \ +static DECLARE_BITMAP(name##_locked_cpu_bits, CONFIG_NR_CPUS); \ +static struct cpumask *name##_locked_cpu_mask = \ + to_cpumask(name##_locked_cpu_bits); \ + \ void name##_global_lock_online(void) { \ int i; \ preempt_disable(); \ rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ - for_each_online_cpu(i) { \ + cpumask_copy(name##_locked_cpu_mask, cpu_online_mask); \ + for_each_cpu(i, name##_locked_cpu_mask) { \ arch_spinlock_t *lock; \ lock = &per_cpu(name##_lock, i); \ arch_spin_lock(lock); \ @@ -137,7 +142,7 @@ void name##_global_unlock_online(void) { \ int i; \ rwlock_release(&name##_lock_dep_map, 1, _RET_IP_); \ - for_each_online_cpu(i) { \ + for_each_cpu(i, name##_locked_cpu_mask) { \ arch_spinlock_t *lock; \ lock = &per_cpu(name##_lock, i); \ arch_spin_unlock(lock); \ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 20:23 ` Srivatsa S. Bhat @ 2011-12-19 20:52 ` Al Viro 2011-12-20 4:56 ` Srivatsa S. Bhat 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2011-12-19 20:52 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On Tue, Dec 20, 2011 at 01:53:42AM +0530, Srivatsa S. Bhat wrote: > If this new definition of our requirement is acceptable (correct me if I am > wrong), then we can do something like the following patch, while still > retaining br locks as non-blocking. > > We make a copy of the current cpu_online_mask, and lock the per-cpu locks of > all those cpus. Then while unlocking, we unlock the per-cpu locks of these cpus > (by using that temporary copy of cpu_online_mask we created earlier), without > caring about the cpus actually online at that moment. > IOW, we do lock-unlock on the same set of cpus, and that too, without missing > the complete lock-unlock sequence in any of them. Guaranteed. And what's to stop a process on a newly added CPU from _not_ spinning in br_read_lock(), even though br_write_unlock() hadn't been done yet? ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 20:52 ` Al Viro @ 2011-12-20 4:56 ` Srivatsa S. Bhat 2011-12-20 6:27 ` Al Viro 0 siblings, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-20 4:56 UTC (permalink / raw) To: Al Viro Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/20/2011 02:22 AM, Al Viro wrote: > On Tue, Dec 20, 2011 at 01:53:42AM +0530, Srivatsa S. Bhat wrote: >> If this new definition of our requirement is acceptable (correct me if I am >> wrong), then we can do something like the following patch, while still >> retaining br locks as non-blocking. >> >> We make a copy of the current cpu_online_mask, and lock the per-cpu locks of >> all those cpus. Then while unlocking, we unlock the per-cpu locks of these cpus >> (by using that temporary copy of cpu_online_mask we created earlier), without >> caring about the cpus actually online at that moment. >> IOW, we do lock-unlock on the same set of cpus, and that too, without missing >> the complete lock-unlock sequence in any of them. Guaranteed. > > And what's to stop a process on a newly added CPU from _not_ > spinning in br_read_lock(), even though br_write_unlock() hadn't been > done yet? > Oh, right, that has to be handled as well... Hmmm... How about registering a CPU hotplug notifier callback during lock init time, and then for every cpu that gets onlined (after we took a copy of the cpu_online_mask to work with), we see if that cpu is different from the ones we have already locked, and if it is, we lock it in the callback handler and update the locked_cpu_mask appropriately (so that we release the locks properly during the unlock operation). Handling the newly introduced race between the callback handler and lock-unlock code must not be difficult, I believe.. Any loopholes in this approach? Or is the additional complexity just not worth it here? Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 4:56 ` Srivatsa S. Bhat @ 2011-12-20 6:27 ` Al Viro 2011-12-20 7:28 ` Srivatsa S. Bhat 2011-12-20 7:30 ` mengcong 0 siblings, 2 replies; 37+ messages in thread From: Al Viro @ 2011-12-20 6:27 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote: > Oh, right, that has to be handled as well... > > Hmmm... How about registering a CPU hotplug notifier callback during lock init > time, and then for every cpu that gets onlined (after we took a copy of the > cpu_online_mask to work with), we see if that cpu is different from the ones > we have already locked, and if it is, we lock it in the callback handler and > update the locked_cpu_mask appropriately (so that we release the locks properly > during the unlock operation). > > Handling the newly introduced race between the callback handler and lock-unlock > code must not be difficult, I believe.. > > Any loopholes in this approach? Or is the additional complexity just not worth > it here? To summarize the modified variant of that approach hashed out on IRC: * lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug notifier. * foo_global_lock_online starts with grabbing that spinlock and loops over the cpus in that bitmap. * foo_global_unlock_online loops over the same bitmap and then drops that spinlock * callback of the notifier is going to do all bitmap updates. Under that spinlock. Events that need handling definitely include the things like "was going up but failed", since we need the bitmap to contain all online CPUs at all time, preferably without too much junk beyond that. IOW, we need to add it there _before_ low-level __cpu_up() calls set_cpu_online(). Which means that we want to clean up on failed attempt to up it. Taking a CPU down is probably less PITA; just clear bit on the final "the sucker's dead" event. * bitmap is initialized once, at the same time we set the notifier up. Just grab the spinlock and do for_each_online_cpu(N) add N to bitmap then release the spinlock and let the callbacks handle all updates. I think that'll work with relatively little pain, but I'm not familiar enough with the cpuhotplug notifiers, so I'd rather have the folks familiar with those to supply the set of events to watch for... ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 6:27 ` Al Viro @ 2011-12-20 7:28 ` Srivatsa S. Bhat 2011-12-20 9:37 ` mengcong 2011-12-20 7:30 ` mengcong 1 sibling, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-20 7:28 UTC (permalink / raw) To: Al Viro Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/20/2011 11:57 AM, Al Viro wrote: > On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote: >> Oh, right, that has to be handled as well... >> >> Hmmm... How about registering a CPU hotplug notifier callback during lock init >> time, and then for every cpu that gets onlined (after we took a copy of the >> cpu_online_mask to work with), we see if that cpu is different from the ones >> we have already locked, and if it is, we lock it in the callback handler and >> update the locked_cpu_mask appropriately (so that we release the locks properly >> during the unlock operation). >> >> Handling the newly introduced race between the callback handler and lock-unlock >> code must not be difficult, I believe.. >> >> Any loopholes in this approach? Or is the additional complexity just not worth >> it here? > > To summarize the modified variant of that approach hashed out on IRC: > > * lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug > notifier. > * foo_global_lock_online starts with grabbing that spinlock and > loops over the cpus in that bitmap. > * foo_global_unlock_online loops over the same bitmap and then drops > that spinlock > * callback of the notifier is going to do all bitmap updates. Under > that spinlock. Events that need handling definitely include the things like > "was going up but failed", since we need the bitmap to contain all online CPUs > at all time, preferably without too much junk beyond that. IOW, we need to add > it there _before_ low-level __cpu_up() calls set_cpu_online(). Which means > that we want to clean up on failed attempt to up it. Taking a CPU down is > probably less PITA; just clear bit on the final "the sucker's dead" event. > * bitmap is initialized once, at the same time we set the notifier > up. Just grab the spinlock and do > for_each_online_cpu(N) > add N to bitmap > then release the spinlock and let the callbacks handle all updates. > > I think that'll work with relatively little pain, but I'm not familiar enough > with the cpuhotplug notifiers, so I'd rather have the folks familiar with those > to supply the set of events to watch for... > We need not watch out for "up failed" events. It is enough if we handle CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only upon successful online or offline operation, and these notifications are more than enough for our purpose (to update our bitmaps). Also, those cpus which came online wont start running until these "success notifications" are all done, which is where we do our stuff in the callback (ie., try grabbing the spinlock..). Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD events), our bitmap will probably be one step behind cpu_online_mask (which means, we'll still have to take the snapshot of cpu_online_mask and work with it instead of using for_each_online_cpu()). But that doesn't matter, as long as: * we don't allow the newly onlined CPU to start executing code (this is achieved by taking the spinlock in the callback) * we stick to our bitmap while taking and releasing the spinlocks. Both of these have been handled in the design proposed above. So we are good to go I guess. I am working on translating all these to working code.. Will post the patch as soon as I'm done. Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 7:28 ` Srivatsa S. Bhat @ 2011-12-20 9:37 ` mengcong 2011-12-20 10:36 ` Srivatsa S. Bhat 0 siblings, 1 reply; 37+ messages in thread From: mengcong @ 2011-12-20 9:37 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On Tue, 2011-12-20 at 12:58 +0530, Srivatsa S. Bhat wrote: > On 12/20/2011 11:57 AM, Al Viro wrote: > > > On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote: > >> Oh, right, that has to be handled as well... > >> > >> Hmmm... How about registering a CPU hotplug notifier callback during lock init > >> time, and then for every cpu that gets onlined (after we took a copy of the > >> cpu_online_mask to work with), we see if that cpu is different from the ones > >> we have already locked, and if it is, we lock it in the callback handler and > >> update the locked_cpu_mask appropriately (so that we release the locks properly > >> during the unlock operation). > >> > >> Handling the newly introduced race between the callback handler and lock-unlock > >> code must not be difficult, I believe.. > >> > >> Any loopholes in this approach? Or is the additional complexity just not worth > >> it here? > > > > To summarize the modified variant of that approach hashed out on IRC: > > > > * lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug > > notifier. > > * foo_global_lock_online starts with grabbing that spinlock and > > loops over the cpus in that bitmap. > > * foo_global_unlock_online loops over the same bitmap and then drops > > that spinlock > > * callback of the notifier is going to do all bitmap updates. Under > > that spinlock. Events that need handling definitely include the things like > > "was going up but failed", since we need the bitmap to contain all online CPUs > > at all time, preferably without too much junk beyond that. IOW, we need to add > > it there _before_ low-level __cpu_up() calls set_cpu_online(). Which means > > that we want to clean up on failed attempt to up it. Taking a CPU down is > > probably less PITA; just clear bit on the final "the sucker's dead" event. > > * bitmap is initialized once, at the same time we set the notifier > > up. Just grab the spinlock and do > > for_each_online_cpu(N) > > add N to bitmap > > then release the spinlock and let the callbacks handle all updates. > > > > I think that'll work with relatively little pain, but I'm not familiar enough > > with the cpuhotplug notifiers, so I'd rather have the folks familiar with those > > to supply the set of events to watch for... > > > > > We need not watch out for "up failed" events. It is enough if we handle > CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only > upon successful online or offline operation, and these notifications are > more than enough for our purpose (to update our bitmaps). Also, those cpus > which came online wont start running until these "success notifications" > are all done, which is where we do our stuff in the callback (ie., try > grabbing the spinlock..). > > Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD > events), our bitmap will probably be one step behind cpu_online_mask > (which means, we'll still have to take the snapshot of cpu_online_mask and > work with it instead of using for_each_online_cpu()). > But that doesn't matter, as long as: > * we don't allow the newly onlined CPU to start executing code (this > is achieved by taking the spinlock in the callback) I think cpu notifier callback doesn't always run on the UPing cpu. Actually, it rarely runs on the UPing cpu. If I was wrong about the above thought, there is still a chance that lg-lock operations are scheduled on the UPing cpu before calling the callback. > * we stick to our bitmap while taking and releasing the spinlocks. > > Both of these have been handled in the design proposed above. So we are good > to go I guess. > > I am working on translating all these to working code.. Will post the patch > as soon as I'm done. > > Regards, > Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 9:37 ` mengcong @ 2011-12-20 10:36 ` Srivatsa S. Bhat 2011-12-20 11:08 ` Srivatsa S. Bhat 0 siblings, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-20 10:36 UTC (permalink / raw) To: mc Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/20/2011 03:07 PM, mengcong wrote: > On Tue, 2011-12-20 at 12:58 +0530, Srivatsa S. Bhat wrote: >> On 12/20/2011 11:57 AM, Al Viro wrote: >> >>> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote: >>>> Oh, right, that has to be handled as well... >>>> >>>> Hmmm... How about registering a CPU hotplug notifier callback during lock init >>>> time, and then for every cpu that gets onlined (after we took a copy of the >>>> cpu_online_mask to work with), we see if that cpu is different from the ones >>>> we have already locked, and if it is, we lock it in the callback handler and >>>> update the locked_cpu_mask appropriately (so that we release the locks properly >>>> during the unlock operation). >>>> >>>> Handling the newly introduced race between the callback handler and lock-unlock >>>> code must not be difficult, I believe.. >>>> >>>> Any loopholes in this approach? Or is the additional complexity just not worth >>>> it here? >>> >>> To summarize the modified variant of that approach hashed out on IRC: >>> >>> * lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug >>> notifier. >>> * foo_global_lock_online starts with grabbing that spinlock and >>> loops over the cpus in that bitmap. >>> * foo_global_unlock_online loops over the same bitmap and then drops >>> that spinlock >>> * callback of the notifier is going to do all bitmap updates. Under >>> that spinlock. Events that need handling definitely include the things like >>> "was going up but failed", since we need the bitmap to contain all online CPUs >>> at all time, preferably without too much junk beyond that. IOW, we need to add >>> it there _before_ low-level __cpu_up() calls set_cpu_online(). Which means >>> that we want to clean up on failed attempt to up it. Taking a CPU down is >>> probably less PITA; just clear bit on the final "the sucker's dead" event. >>> * bitmap is initialized once, at the same time we set the notifier >>> up. Just grab the spinlock and do >>> for_each_online_cpu(N) >>> add N to bitmap >>> then release the spinlock and let the callbacks handle all updates. >>> >>> I think that'll work with relatively little pain, but I'm not familiar enough >>> with the cpuhotplug notifiers, so I'd rather have the folks familiar with those >>> to supply the set of events to watch for... >>> >> >> >> We need not watch out for "up failed" events. It is enough if we handle >> CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only >> upon successful online or offline operation, and these notifications are >> more than enough for our purpose (to update our bitmaps). Also, those cpus >> which came online wont start running until these "success notifications" >> are all done, which is where we do our stuff in the callback (ie., try >> grabbing the spinlock..). >> >> Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD >> events), our bitmap will probably be one step behind cpu_online_mask >> (which means, we'll still have to take the snapshot of cpu_online_mask and >> work with it instead of using for_each_online_cpu()). >> But that doesn't matter, as long as: >> * we don't allow the newly onlined CPU to start executing code (this >> is achieved by taking the spinlock in the callback) > > I think cpu notifier callback doesn't always run on the UPing cpu. > Actually, it rarely runs on the UPing cpu. > If I was wrong about the above thought, there is still a chance that lg-lock > operations are scheduled on the UPing cpu before calling the callback. > I wasn't actually banking on that, but you have raised a very good point. The scheduler uses its own set of cpu hotplug callback handlers to start using the newly added cpu (see the set of callbacks in kernel/sched.c) So, now we have a race between our callback and the scheduler's callbacks. ("Placing" our callback appropriately in a safe position using priority for notifiers doesn't appeal to me that much, since it looks like too much hackery. It should probably be our last resort). Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 10:36 ` Srivatsa S. Bhat @ 2011-12-20 11:08 ` Srivatsa S. Bhat 2011-12-20 12:50 ` Srivatsa S. Bhat 0 siblings, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-20 11:08 UTC (permalink / raw) To: mc Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/20/2011 04:06 PM, Srivatsa S. Bhat wrote: > On 12/20/2011 03:07 PM, mengcong wrote: > >> On Tue, 2011-12-20 at 12:58 +0530, Srivatsa S. Bhat wrote: >>> On 12/20/2011 11:57 AM, Al Viro wrote: >>> >>>> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote: >>>>> Oh, right, that has to be handled as well... >>>>> >>>>> Hmmm... How about registering a CPU hotplug notifier callback during lock init >>>>> time, and then for every cpu that gets onlined (after we took a copy of the >>>>> cpu_online_mask to work with), we see if that cpu is different from the ones >>>>> we have already locked, and if it is, we lock it in the callback handler and >>>>> update the locked_cpu_mask appropriately (so that we release the locks properly >>>>> during the unlock operation). >>>>> >>>>> Handling the newly introduced race between the callback handler and lock-unlock >>>>> code must not be difficult, I believe.. >>>>> >>>>> Any loopholes in this approach? Or is the additional complexity just not worth >>>>> it here? >>>> >>>> To summarize the modified variant of that approach hashed out on IRC: >>>> >>>> * lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug >>>> notifier. >>>> * foo_global_lock_online starts with grabbing that spinlock and >>>> loops over the cpus in that bitmap. >>>> * foo_global_unlock_online loops over the same bitmap and then drops >>>> that spinlock >>>> * callback of the notifier is going to do all bitmap updates. Under >>>> that spinlock. Events that need handling definitely include the things like >>>> "was going up but failed", since we need the bitmap to contain all online CPUs >>>> at all time, preferably without too much junk beyond that. IOW, we need to add >>>> it there _before_ low-level __cpu_up() calls set_cpu_online(). Which means >>>> that we want to clean up on failed attempt to up it. Taking a CPU down is >>>> probably less PITA; just clear bit on the final "the sucker's dead" event. >>>> * bitmap is initialized once, at the same time we set the notifier >>>> up. Just grab the spinlock and do >>>> for_each_online_cpu(N) >>>> add N to bitmap >>>> then release the spinlock and let the callbacks handle all updates. >>>> >>>> I think that'll work with relatively little pain, but I'm not familiar enough >>>> with the cpuhotplug notifiers, so I'd rather have the folks familiar with those >>>> to supply the set of events to watch for... >>>> >>> >>> >>> We need not watch out for "up failed" events. It is enough if we handle >>> CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only >>> upon successful online or offline operation, and these notifications are >>> more than enough for our purpose (to update our bitmaps). Also, those cpus >>> which came online wont start running until these "success notifications" >>> are all done, which is where we do our stuff in the callback (ie., try >>> grabbing the spinlock..). >>> >>> Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD >>> events), our bitmap will probably be one step behind cpu_online_mask >>> (which means, we'll still have to take the snapshot of cpu_online_mask and >>> work with it instead of using for_each_online_cpu()). >>> But that doesn't matter, as long as: >>> * we don't allow the newly onlined CPU to start executing code (this >>> is achieved by taking the spinlock in the callback) >> >> I think cpu notifier callback doesn't always run on the UPing cpu. >> Actually, it rarely runs on the UPing cpu. >> If I was wrong about the above thought, there is still a chance that lg-lock >> operations are scheduled on the UPing cpu before calling the callback. >> > > > I wasn't actually banking on that, but you have raised a very good point. > The scheduler uses its own set of cpu hotplug callback handlers to start > using the newly added cpu (see the set of callbacks in kernel/sched.c) > > So, now we have a race between our callback and the scheduler's callbacks. > ("Placing" our callback appropriately in a safe position using priority > for notifiers doesn't appeal to me that much, since it looks like too much > hackery. It should probably be our last resort). > Hey, actually there is a simple solution: just nip it (or rather delay it) in the bud ;) That is, we watch out for CPU_UP_PREPARE event and lock it up there itself using our spinlock.. that way, that cpu will not come up until we are done executing br_write_unlock(). In fact, we can even fail the onlining of that cpu by returning appropriate value from our callback, but that would be too harsh.. so we can settle for delaying the cpu online operation :) And we do the same thing for CPU_DOWN_PREPARE event. And we can forget about taking snapshot of cpu_online_mask, since cpu_online_mask is guaranteed to be unmodified until we are done with br_write_unlock(). Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 11:08 ` Srivatsa S. Bhat @ 2011-12-20 12:50 ` Srivatsa S. Bhat 2011-12-20 14:06 ` Al Viro 0 siblings, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-20 12:50 UTC (permalink / raw) To: mc Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/20/2011 04:38 PM, Srivatsa S. Bhat wrote: > On 12/20/2011 04:06 PM, Srivatsa S. Bhat wrote: > >> On 12/20/2011 03:07 PM, mengcong wrote: >> >>> On Tue, 2011-12-20 at 12:58 +0530, Srivatsa S. Bhat wrote: >>>> On 12/20/2011 11:57 AM, Al Viro wrote: >>>> >>>>> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote: >>>>>> Oh, right, that has to be handled as well... >>>>>> >>>>>> Hmmm... How about registering a CPU hotplug notifier callback during lock init >>>>>> time, and then for every cpu that gets onlined (after we took a copy of the >>>>>> cpu_online_mask to work with), we see if that cpu is different from the ones >>>>>> we have already locked, and if it is, we lock it in the callback handler and >>>>>> update the locked_cpu_mask appropriately (so that we release the locks properly >>>>>> during the unlock operation). >>>>>> >>>>>> Handling the newly introduced race between the callback handler and lock-unlock >>>>>> code must not be difficult, I believe.. >>>>>> >>>>>> Any loopholes in this approach? Or is the additional complexity just not worth >>>>>> it here? >>>>> >>>>> To summarize the modified variant of that approach hashed out on IRC: >>>>> >>>>> * lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug >>>>> notifier. >>>>> * foo_global_lock_online starts with grabbing that spinlock and >>>>> loops over the cpus in that bitmap. >>>>> * foo_global_unlock_online loops over the same bitmap and then drops >>>>> that spinlock >>>>> * callback of the notifier is going to do all bitmap updates. Under >>>>> that spinlock. Events that need handling definitely include the things like >>>>> "was going up but failed", since we need the bitmap to contain all online CPUs >>>>> at all time, preferably without too much junk beyond that. IOW, we need to add >>>>> it there _before_ low-level __cpu_up() calls set_cpu_online(). Which means >>>>> that we want to clean up on failed attempt to up it. Taking a CPU down is >>>>> probably less PITA; just clear bit on the final "the sucker's dead" event. >>>>> * bitmap is initialized once, at the same time we set the notifier >>>>> up. Just grab the spinlock and do >>>>> for_each_online_cpu(N) >>>>> add N to bitmap >>>>> then release the spinlock and let the callbacks handle all updates. >>>>> >>>>> I think that'll work with relatively little pain, but I'm not familiar enough >>>>> with the cpuhotplug notifiers, so I'd rather have the folks familiar with those >>>>> to supply the set of events to watch for... >>>>> >>>> >>>> >>>> We need not watch out for "up failed" events. It is enough if we handle >>>> CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only >>>> upon successful online or offline operation, and these notifications are >>>> more than enough for our purpose (to update our bitmaps). Also, those cpus >>>> which came online wont start running until these "success notifications" >>>> are all done, which is where we do our stuff in the callback (ie., try >>>> grabbing the spinlock..). >>>> >>>> Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD >>>> events), our bitmap will probably be one step behind cpu_online_mask >>>> (which means, we'll still have to take the snapshot of cpu_online_mask and >>>> work with it instead of using for_each_online_cpu()). >>>> But that doesn't matter, as long as: >>>> * we don't allow the newly onlined CPU to start executing code (this >>>> is achieved by taking the spinlock in the callback) >>> >>> I think cpu notifier callback doesn't always run on the UPing cpu. >>> Actually, it rarely runs on the UPing cpu. >>> If I was wrong about the above thought, there is still a chance that lg-lock >>> operations are scheduled on the UPing cpu before calling the callback. >>> >> >> >> I wasn't actually banking on that, but you have raised a very good point. >> The scheduler uses its own set of cpu hotplug callback handlers to start >> using the newly added cpu (see the set of callbacks in kernel/sched.c) >> >> So, now we have a race between our callback and the scheduler's callbacks. >> ("Placing" our callback appropriately in a safe position using priority >> for notifiers doesn't appeal to me that much, since it looks like too much >> hackery. It should probably be our last resort). >> > > > Hey, actually there is a simple solution: just nip it (or rather delay it) > in the bud ;) That is, we watch out for CPU_UP_PREPARE event and lock it > up there itself using our spinlock.. that way, that cpu will not come up > until we are done executing br_write_unlock(). In fact, we can even fail > the onlining of that cpu by returning appropriate value from our callback, > but that would be too harsh.. so we can settle for delaying the cpu online > operation :) > > And we do the same thing for CPU_DOWN_PREPARE event. And we can forget about > taking snapshot of cpu_online_mask, since cpu_online_mask is guaranteed to > be unmodified until we are done with br_write_unlock(). > Had to modify this a bit, to handle a race while using cpu_online_mask. Anyway, here is the code: The idea here is that the notifier will grab the spinlock and not release it until the entire cpu hotplug operation is complete. Hence the pairs CPU_UP_PREPARE <-> CPU_UP_CANCELED, CPU_ONLINE and CPU_DOWN_PREPARE <-> CPU_DOWN_FAILED, CPU_DEAD However, if the notifier couldn't acquire the spinlock, it will keep spinning at the CPU_UP_PREPARE or CPU_DOWN_PREPARE event, thereby preventing cpu hotplug, as well as any updates to cpu_online_mask. Hence, taking snapshot of cpu_online_mask becomes unnecessary. [Actually I have another approach, using CPU_STARTING and CPU_DYING events, which addresses Cong Meng's concern in a more direct manner, but I would rather post that patch after sometime, to avoid the risk of confusing everyone. Moreover, that approach is not a "clear winner", so it can wait until the below patch gets reviewed]. --- include/linux/lglock.h | 39 +++++++++++++++++++++++++++++++++++++++ 1 files changed, 39 insertions(+), 0 deletions(-) diff --git a/include/linux/lglock.h b/include/linux/lglock.h index f549056..71079b1 100644 --- a/include/linux/lglock.h +++ b/include/linux/lglock.h @@ -22,6 +22,7 @@ #include <linux/spinlock.h> #include <linux/lockdep.h> #include <linux/percpu.h> +#include <linux/cpu.h> /* can make br locks by using local lock for read side, global lock for write */ #define br_lock_init(name) name##_lock_init() @@ -75,6 +76,41 @@ DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \ DEFINE_LGLOCK_LOCKDEP(name); \ \ +DEFINE_SPINLOCK(name##_notifier_lock); \ + \ +static int \ +name##_lg_cpu_callback(struct notifier_block *nb, \ + unsigned long action, void *hcpu) \ +{ \ + switch (action & ~CPU_TASKS_FROZEN) { \ + case CPU_UP_PREPARE: \ + spin_lock(&name##_notifier_lock); \ + break; \ + \ + case CPU_UP_CANCELED: \ + case CPU_ONLINE: \ + spin_unlock(&name##_notifier_lock); \ + break; \ + \ + case CPU_DOWN_PREPARE: \ + spin_lock(&name##_notifier_lock); \ + break; \ + \ + case CPU_DOWN_FAILED: \ + case CPU_DEAD: \ + spin_unlock(&name##_notifier_lock); \ + break; \ + \ + default: \ + return NOTIFY_DONE; \ + } \ + return NOTIFY_OK; \ +} \ + \ +static struct notifier_block name##_lg_cpu_notifier = { \ + .notifier_call = name##_lg_cpu_callback, \ +}; \ + \ void name##_lock_init(void) { \ int i; \ LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \ @@ -83,6 +119,7 @@ lock = &per_cpu(name##_lock, i); \ *lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; \ } \ + register_hotcpu_notifier(&name##_lg_cpu_notifier); \ } \ EXPORT_SYMBOL(name##_lock_init); \ \ @@ -125,6 +162,7 @@ void name##_global_lock_online(void) { \ int i; \ preempt_disable(); \ + spin_lock(&name##_notifier_lock); \ rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ for_each_online_cpu(i) { \ arch_spinlock_t *lock; \ @@ -142,6 +180,7 @@ lock = &per_cpu(name##_lock, i); \ arch_spin_unlock(lock); \ } \ + spin_unlock(&name##_notifier_lock); \ preempt_enable(); \ } \ EXPORT_SYMBOL(name##_global_unlock_online); \ Regards, Srivatsa S. Bhat ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 12:50 ` Srivatsa S. Bhat @ 2011-12-20 14:06 ` Al Viro 2011-12-20 14:35 ` Srivatsa S. Bhat 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2011-12-20 14:06 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On Tue, Dec 20, 2011 at 06:20:44PM +0530, Srivatsa S. Bhat wrote: > > Hey, actually there is a simple solution: just nip it (or rather delay it) > > in the bud ;) That is, we watch out for CPU_UP_PREPARE event and lock it > > up there itself using our spinlock.. that way, that cpu will not come up > > until we are done executing br_write_unlock(). In fact, we can even fail > > the onlining of that cpu by returning appropriate value from our callback, > > but that would be too harsh.. so we can settle for delaying the cpu online Eeeek... Are you serious about leaving a spinlock grabbed by notifier callback and not released until another callback call? That would be one hell of a constraint on what these notifiers can do - _nothing_ between these calls (including other notifier callbacks, etc.) would be allowed to block. That sounds extremely brittle... ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 14:06 ` Al Viro @ 2011-12-20 14:35 ` Srivatsa S. Bhat 2011-12-20 17:59 ` Al Viro 0 siblings, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-20 14:35 UTC (permalink / raw) To: Al Viro Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/20/2011 07:36 PM, Al Viro wrote: > On Tue, Dec 20, 2011 at 06:20:44PM +0530, Srivatsa S. Bhat wrote: > >>> Hey, actually there is a simple solution: just nip it (or rather delay it) >>> in the bud ;) That is, we watch out for CPU_UP_PREPARE event and lock it >>> up there itself using our spinlock.. that way, that cpu will not come up >>> until we are done executing br_write_unlock(). In fact, we can even fail >>> the onlining of that cpu by returning appropriate value from our callback, >>> but that would be too harsh.. so we can settle for delaying the cpu online > > Eeeek... Are you serious about leaving a spinlock grabbed by notifier > callback and not released until another callback call? That would be one > hell of a constraint on what these notifiers can do - _nothing_ between > these calls (including other notifier callbacks, etc.) would be allowed > to block. > > That sounds extremely brittle... > Sorry but I didn't quite get your point... No two cpu hotplug operations can race because of the cpu_hotplug lock they use. Hence, if a cpu online operation begins, it has to succeed or fail eventually. No other cpu hotplug operation can intervene. Ditto for cpu offline operations. Hence a CPU_UP_PREPARE event *will* be followed by a corresponding CPU_UP_CANCELED or CPU_ONLINE event for the same cpu. (And we ignore the CPU_STARTING event that comes in between, on purpose, so as to avoid the race with cpu_online_mask). Similar is the story for offline operation. And if the notifier grabs the spinlock and keeps it locked between these 2 points of a cpu hotplug operation, it ensures that our br locks will spin, instead of block till the cpu hotplug operation is complete. Isn't this what we desired all along? "A non-blocking way to sync br locks with cpu hotplug"? Or am I missing something? Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 14:35 ` Srivatsa S. Bhat @ 2011-12-20 17:59 ` Al Viro 2011-12-20 19:12 ` Srivatsa S. Bhat 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2011-12-20 17:59 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On Tue, Dec 20, 2011 at 08:05:32PM +0530, Srivatsa S. Bhat wrote: > Sorry but I didn't quite get your point... > No two cpu hotplug operations can race because of the cpu_hotplug lock they > use. Hence, if a cpu online operation begins, it has to succeed or fail > eventually. No other cpu hotplug operation can intervene. Ditto for cpu offline > operations. > > Hence a CPU_UP_PREPARE event *will* be followed by a corresponding > CPU_UP_CANCELED or CPU_ONLINE event for the same cpu. (And we ignore the > CPU_STARTING event that comes in between, on purpose, so as to avoid the race > with cpu_online_mask). Similar is the story for offline operation. > > And if the notifier grabs the spinlock and keeps it locked between these 2 > points of a cpu hotplug operation, it ensures that our br locks will spin, > instead of block till the cpu hotplug operation is complete. Isn't this what > we desired all along? "A non-blocking way to sync br locks with cpu hotplug"? > > Or am I missing something? The standard reason why losing the timeslice while holding a spinlock means deadlocks? CPU1: grabs spinlock CPU[2..n]: tries to grab the same spinlock, spins CPU1: does something blocking, process loses timeslice CPU1: whatever got scheduled there happens to to try and grab the same spinlock and you are stuck. At that point *all* CPUs are spinning on that spinlock and your code that would eventually unlock it has no chance to get any CPU to run on. Having the callback grab and release a spinlock is fine (as long as you don't do anything blocking between these spin_lock/spin_unlock). Having it leave with spinlock held, though, means that the area where you can't block has expanded a whole lot. As I said, brittle... A quick grep through the actual callbacks immediately shows e.g. del_timer_sync() done on CPU_DOWN_PREPARE. And sysfs_remove_group(), which leads to outright mutex_lock(). And sysfs_remove_link() (ditto). And via_cputemp_device_remove() (again, mutex_lock()). And free_irq(). And perf_event_exit_cpu() (mutex_lock()). And... IOW, there are shitloads of deadlocks right there. If your callback's position in the chain is earlier than any of those, you are screwed. No, what I had in mind was different - use the callbacks to maintain a bitmap that would contain a) all CPUs that were online at the moment b) ... and not too much else Updates protected by spinlock; in all cases it gets dropped before the callback returns. br_write_lock() grabs that spinlock and iterates over the set; it *does* leave the spinlock grabbed - that's OK, since all code between br_write_lock() and br_write_unlock() must be non-blocking anyway. br_write_unlock() iterates over the same bitmap (unchanged since br_write_lock()) and finally drops the spinlock. AFAICS, what we want in callback is CPU_DEAD, CPU_DEAD_FROZEN, CPU_UP_CANCELLED, CPU_UP_CANCELLED_FROZEN: grab spinlock remove cpu from bitmap drop spinlock CPU_UP_PREPARE, CPU_UP_PREPARE_FROZEN grab spinlock add cpu to bitmap drop spinlock That ought to keep bitmap close to cpu_online_mask, which is enough for our purposes. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 17:59 ` Al Viro @ 2011-12-20 19:12 ` Srivatsa S. Bhat 2011-12-20 19:58 ` Al Viro 0 siblings, 1 reply; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-20 19:12 UTC (permalink / raw) To: Al Viro Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/20/2011 11:29 PM, Al Viro wrote: > On Tue, Dec 20, 2011 at 08:05:32PM +0530, Srivatsa S. Bhat wrote: > >> Sorry but I didn't quite get your point... >> No two cpu hotplug operations can race because of the cpu_hotplug lock they >> use. Hence, if a cpu online operation begins, it has to succeed or fail >> eventually. No other cpu hotplug operation can intervene. Ditto for cpu offline >> operations. >> >> Hence a CPU_UP_PREPARE event *will* be followed by a corresponding >> CPU_UP_CANCELED or CPU_ONLINE event for the same cpu. (And we ignore the >> CPU_STARTING event that comes in between, on purpose, so as to avoid the race >> with cpu_online_mask). Similar is the story for offline operation. >> >> And if the notifier grabs the spinlock and keeps it locked between these 2 >> points of a cpu hotplug operation, it ensures that our br locks will spin, >> instead of block till the cpu hotplug operation is complete. Isn't this what >> we desired all along? "A non-blocking way to sync br locks with cpu hotplug"? >> >> Or am I missing something? > > The standard reason why losing the timeslice while holding a spinlock means > deadlocks? > CPU1: grabs spinlock > CPU[2..n]: tries to grab the same spinlock, spins > CPU1: does something blocking, process loses timeslice > CPU1: whatever got scheduled there happens to to try and grab the same > spinlock and you are stuck. At that point *all* CPUs are spinning on > that spinlock and your code that would eventually unlock it has no chance > to get any CPU to run on. > > Having the callback grab and release a spinlock is fine (as long as you > don't do anything blocking between these spin_lock/spin_unlock). Having > it leave with spinlock held, though, means that the area where you can't > block has expanded a whole lot. As I said, brittle... > Ah, now I see your point! Thanks for the explanation. > A quick grep through the actual callbacks immediately shows e.g. > del_timer_sync() done on CPU_DOWN_PREPARE. And sysfs_remove_group(), > which leads to outright mutex_lock(). And sysfs_remove_link() (ditto). > And via_cputemp_device_remove() (again, mutex_lock()). And free_irq(). > And perf_event_exit_cpu() (mutex_lock()). And... > > IOW, there are shitloads of deadlocks right there. If your callback's > position in the chain is earlier than any of those, you are screwed. > The thought makes me shudder! > No, what I had in mind was different - use the callbacks to maintain a > bitmap that would contain > a) all CPUs that were online at the moment > b) ... and not too much else > Updates protected by spinlock; in all cases it gets dropped before the > callback returns. br_write_lock() grabs that spinlock and iterates over > the set; it *does* leave the spinlock grabbed - that's OK, since all > code between br_write_lock() and br_write_unlock() must be non-blocking > anyway. br_write_unlock() iterates over the same bitmap (unchanged since > br_write_lock()) and finally drops the spinlock. > I had this same thing in mind when I started out to write the patch.. but after Cong raised that concern, I changed track and in the meantime, tried to get rid of maintaining our own bitmap as well... But unfortunately that turned out to be disastrous! > AFAICS, what we want in callback is > CPU_DEAD, CPU_DEAD_FROZEN, CPU_UP_CANCELLED, CPU_UP_CANCELLED_FROZEN: > grab spinlock > remove cpu from bitmap > drop spinlock > CPU_UP_PREPARE, CPU_UP_PREPARE_FROZEN > grab spinlock > add cpu to bitmap > drop spinlock > That ought to keep bitmap close to cpu_online_mask, which is enough for > our purposes. > Yes, that should do. And while initializing our bitmap, we could use get_online_cpus() make a copy of cpu_online_mask put_online_cpus() since blocking here is acceptable, as this is done in the lock_init() code. Right? That would be better than register_hotcpu_notifier(...); grab spinlock for_each_online_cpu(N) add N to bitmap release spinlock because the latter code is not fully race-free (because we don't handle CPU_DOWN_PREPARE event in the callback and hence cpu_online_mask can get updated in-between). But it would still work since cpus going down don't really pose problems for us. However the former code is race-free, and we can afford it since we are free to block at that point. Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 19:12 ` Srivatsa S. Bhat @ 2011-12-20 19:58 ` Al Viro 2011-12-20 22:27 ` Dave Chinner 2011-12-21 21:15 ` Srivatsa S. Bhat 0 siblings, 2 replies; 37+ messages in thread From: Al Viro @ 2011-12-20 19:58 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On Wed, Dec 21, 2011 at 12:42:04AM +0530, Srivatsa S. Bhat wrote: > register_hotcpu_notifier(...); > grab spinlock > for_each_online_cpu(N) > add N to bitmap > release spinlock > > because the latter code is not fully race-free (because we don't handle > CPU_DOWN_PREPARE event in the callback and hence cpu_online_mask can get > updated in-between). But it would still work since cpus going down don't > really pose problems for us. Um? Sure, that loop can end up adding CPUs on their way down into the set. And as soon as they get their CPU_DEAD, notifier will prune them out... Or is there something I'm missing here? Anyway, the variant I have here (untested) follows: diff --git a/include/linux/lglock.h b/include/linux/lglock.h index f549056..1951f67 100644 --- a/include/linux/lglock.h +++ b/include/linux/lglock.h @@ -22,6 +22,7 @@ #include <linux/spinlock.h> #include <linux/lockdep.h> #include <linux/percpu.h> +#include <linux/cpu.h> /* can make br locks by using local lock for read side, global lock for write */ #define br_lock_init(name) name##_lock_init() @@ -72,9 +73,31 @@ #define DEFINE_LGLOCK(name) \ \ + DEFINE_SPINLOCK(name##_cpu_lock); \ + cpumask_t name##_cpus __read_mostly; \ DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \ DEFINE_LGLOCK_LOCKDEP(name); \ \ + static int \ + name##_lg_cpu_callback(struct notifier_block *nb, \ + unsigned long action, void *hcpu) \ + { \ + switch (action & ~CPU_TASKS_FROZEN) { \ + case CPU_UP_PREPARE: \ + spin_lock(&name##_cpu_lock); \ + cpu_set((unsigned long)hcpu, name##_cpus); \ + spin_unlock(&name##_cpu_lock); \ + break; \ + case CPU_UP_CANCELED: case CPU_DEAD: \ + spin_lock(&name##_cpu_lock); \ + cpu_clear((unsigned long)hcpu, name##_cpus); \ + spin_unlock(&name##_cpu_lock); \ + } \ + return NOTIFY_OK; \ + } \ + static struct notifier_block name##_lg_cpu_notifier = { \ + .notifier_call = name##_lg_cpu_callback, \ + }; \ void name##_lock_init(void) { \ int i; \ LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \ @@ -83,6 +106,11 @@ lock = &per_cpu(name##_lock, i); \ *lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; \ } \ + register_hotcpu_notifier(&name##_lg_cpu_notifier); \ + spin_lock(&name##_cpu_lock); \ + for_each_online_cpu(i) \ + cpu_set(i, name##_cpus); \ + spin_unlock(&name##_cpu_lock); \ } \ EXPORT_SYMBOL(name##_lock_init); \ \ @@ -124,9 +152,9 @@ \ void name##_global_lock_online(void) { \ int i; \ - preempt_disable(); \ + spin_lock(&name##_cpu_lock); \ rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ - for_each_online_cpu(i) { \ + for_each_cpu(i, &name##_cpus) { \ arch_spinlock_t *lock; \ lock = &per_cpu(name##_lock, i); \ arch_spin_lock(lock); \ @@ -137,12 +165,12 @@ void name##_global_unlock_online(void) { \ int i; \ rwlock_release(&name##_lock_dep_map, 1, _RET_IP_); \ - for_each_online_cpu(i) { \ + for_each_cpu(i, &name##_cpus) { \ arch_spinlock_t *lock; \ lock = &per_cpu(name##_lock, i); \ arch_spin_unlock(lock); \ } \ - preempt_enable(); \ + spin_unlock(&name##_cpu_lock); \ } \ EXPORT_SYMBOL(name##_global_unlock_online); \ \ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 19:58 ` Al Viro @ 2011-12-20 22:27 ` Dave Chinner 2011-12-20 23:31 ` Al Viro 2011-12-21 21:15 ` Srivatsa S. Bhat 1 sibling, 1 reply; 37+ messages in thread From: Dave Chinner @ 2011-12-20 22:27 UTC (permalink / raw) To: Al Viro Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, akpm@linux-foundation.org, Maciej Rutecki On Tue, Dec 20, 2011 at 07:58:06PM +0000, Al Viro wrote: > On Wed, Dec 21, 2011 at 12:42:04AM +0530, Srivatsa S. Bhat wrote: > > register_hotcpu_notifier(...); > > grab spinlock > > for_each_online_cpu(N) > > add N to bitmap > > release spinlock > > > > because the latter code is not fully race-free (because we don't handle > > CPU_DOWN_PREPARE event in the callback and hence cpu_online_mask can get > > updated in-between). But it would still work since cpus going down don't > > really pose problems for us. > > Um? Sure, that loop can end up adding CPUs on their way down into the set. > And as soon as they get their CPU_DEAD, notifier will prune them out... Or > is there something I'm missing here? Anyway, the variant I have here > (untested) follows: Only thing that concerns me about this patch is the bitmap changing between lock and unlock operations. i.e. CPU 1: lock all cpus in mask CPU 2: brings up new cpu, notifier adds CPU to bitmask CPU 1: unlock all cpus in mask And in this case the unlock tries to unlock a cpu that wasn't locked to start with. It really seems to me that while a global lock is in progress, the online bitmask cannot be allowed to change. Perhaps something can be passed between the lock and unlock operations to be able to detect a changed mask between lock/unlock operations (e.g. a generation number) and then handle that via a slow path that unlocks only locks that are active in the online bitmask? i.e. all the notifier does is bump the generation count, and the slow path on the unlock handles everything else? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 22:27 ` Dave Chinner @ 2011-12-20 23:31 ` Al Viro 0 siblings, 0 replies; 37+ messages in thread From: Al Viro @ 2011-12-20 23:31 UTC (permalink / raw) To: Dave Chinner Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, akpm@linux-foundation.org, Maciej Rutecki On Wed, Dec 21, 2011 at 09:27:34AM +1100, Dave Chinner wrote: > Only thing that concerns me about this patch is the bitmap changing > between lock and unlock operations. i.e. > > CPU 1: lock all cpus in mask ... after having taken the spinlock > CPU 2: brings up new cpu, notifier adds CPU to bitmask ... which would also take the same spinlock > CPU 1: unlock all cpus in mask ... which is where that spinlock would've been released > And in this case the unlock tries to unlock a cpu that wasn't locked > to start with. It really seems to me that while a global lock is in > progress, the online bitmask cannot be allowed to change. Well, yes. See spin_lock() in br_write_lock() before the loop and spin_unlock() in br_write_unlock() after the loop... > Perhaps something can be passed between the lock and unlock > operations to be able to detect a changed mask between lock/unlock > operations (e.g. a generation number) and then handle that via a > slow path that unlocks only locks that are active in the online > bitmask? i.e. all the notifier does is bump the generation count, > and the slow path on the unlock handles everything else? ??? ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 19:58 ` Al Viro 2011-12-20 22:27 ` Dave Chinner @ 2011-12-21 21:15 ` Srivatsa S. Bhat 2011-12-21 22:02 ` Al Viro 2011-12-21 22:12 ` Andrew Morton 1 sibling, 2 replies; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-21 21:15 UTC (permalink / raw) To: Al Viro Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/21/2011 01:28 AM, Al Viro wrote: > On Wed, Dec 21, 2011 at 12:42:04AM +0530, Srivatsa S. Bhat wrote: >> register_hotcpu_notifier(...); >> grab spinlock >> for_each_online_cpu(N) >> add N to bitmap >> release spinlock >> >> because the latter code is not fully race-free (because we don't handle >> CPU_DOWN_PREPARE event in the callback and hence cpu_online_mask can get >> updated in-between). But it would still work since cpus going down don't >> really pose problems for us. > > Um? Sure, that loop can end up adding CPUs on their way down into the set. > And as soon as they get their CPU_DEAD, notifier will prune them out... Or > is there something I'm missing here? > No, everything is fine overall (as I mentioned earlier - it would still work, because even though the for_each_online_cpu() thing is racy, that race is harmless). Now considering that we have come to a working solution to this whole issue (though it looks a bit convoluted), IMHO it is worth trying to make this whole thing look a bit more intuitive and appear "obviously race-free", at least at places where we can afford it. IOW, I feel using {get,put}_online_cpus() as mentioned in my previous post avoids a "suspicious looking" usage of for_each_online_cpu() :-) The following patch is the same as the patch you posted, but with this small change (aimed merely at making the code a bit easier to understand) and a commit message added. Please point out if this change seems bad for any reason. And if it is fine, Viro, can you please sign-off on this patch? (since this patch has code contributions from both you and me) I tested this patch locally - the originally reported issue did not crop up (soft-lockups in VFS callpaths). --- From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Subject: [PATCH] VFS: Fix race between CPU hotplug and lglocks Currently, the *_global_[un]lock_online() routines are not at all synchronized with CPU hotplug. Soft-lockups detected as a consequence of this race was reported earlier at https://lkml.org/lkml/2011/8/24/185. (Thanks to Cong Meng for finding out that the root-cause of this issue is the race condition between br_write_[un]lock() and CPU hotplug, which results in the lock states getting messed up). Fixing this race by just adding {get,put}_online_cpus() at appropriate places in *_global_[un]lock_online() is not a good option, because, then suddenly br_write_[un]lock() would become blocking, whereas they have been kept as non-blocking all this time, and we would want to keep them that way. So, overall, we want to ensure 3 things: 1. br_write_lock() and br_write_unlock() must remain as non-blocking. 2. The corresponding lock and unlock of the per-cpu spinlocks must not happen for different sets of CPUs. 3. Either prevent any new CPU online operation in between this lock-unlock, or ensure that the newly onlined CPU does not proceed with its corresponding per-cpu spinlock unlocked. To achieve all this: (a) We introduce a new spinlock that is taken by the *_global_lock_online() routine and released by the *_global_unlock_online() routine. (b) We register a callback for CPU hotplug notifications, and this callback takes the same spinlock as above. (c) We maintain a bitmap which is close to the cpu_online_mask, and once it is initialized in the lock_init() code, all future updates to it are done in the callback, under the above spinlock. (d) The above bitmap is used (instead of cpu_online_mask) while locking and unlocking the per-cpu locks. The callback takes the spinlock upon the CPU_UP_PREPARE event. So, if the br_write_lock-unlock sequence is in progress, the callback keeps spinning, thus preventing the CPU online operation till the lock-unlock sequence is complete. This takes care of requirement (3). The bitmap that we maintain remains unmodified throughout the lock-unlock sequence, since all updates to it are managed by the callback, which takes the same spinlock as the one taken by the lock code and released only by the unlock routine. Combining this with (d) above, satisfies requirement (2). Overall, since we use a spinlock (mentioned in (a)) to prevent CPU hotplug operations from racing with br_write_lock-unlock, requirement (1) is also taken care of. By the way, it is to be noted that a CPU offline operation can actually run in parallel with our lock-unlock sequence, because our callback doesn't react to notifications earlier than CPU_DEAD (in order to maintain our bitmap properly). And this means, since we use our own bitmap (which is stale, on purpose) during the lock-unlock sequence, we could end up unlocking the per-cpu lock of an offline CPU (because we had locked it earlier, when the CPU was online), in order to satisfy requirement (2). But this is harmless, though it looks a bit awkward. Debugged-by: Cong Meng <mc@linux.vnet.ibm.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> --- include/linux/lglock.h | 36 ++++++++++++++++++++++++++++++++---- 1 files changed, 32 insertions(+), 4 deletions(-) diff --git a/include/linux/lglock.h b/include/linux/lglock.h index f549056..87f402c 100644 --- a/include/linux/lglock.h +++ b/include/linux/lglock.h @@ -22,6 +22,7 @@ #include <linux/spinlock.h> #include <linux/lockdep.h> #include <linux/percpu.h> +#include <linux/cpu.h> /* can make br locks by using local lock for read side, global lock for write */ #define br_lock_init(name) name##_lock_init() @@ -72,9 +73,31 @@ #define DEFINE_LGLOCK(name) \ \ + DEFINE_SPINLOCK(name##_cpu_lock); \ + cpumask_t name##_cpus __read_mostly; \ DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \ DEFINE_LGLOCK_LOCKDEP(name); \ \ + static int \ + name##_lg_cpu_callback(struct notifier_block *nb, \ + unsigned long action, void *hcpu) \ + { \ + switch (action & ~CPU_TASKS_FROZEN) { \ + case CPU_UP_PREPARE: \ + spin_lock(&name##_cpu_lock); \ + cpu_set((unsigned long)hcpu, name##_cpus); \ + spin_unlock(&name##_cpu_lock); \ + break; \ + case CPU_UP_CANCELED: case CPU_DEAD: \ + spin_lock(&name##_cpu_lock); \ + cpu_clear((unsigned long)hcpu, name##_cpus); \ + spin_unlock(&name##_cpu_lock); \ + } \ + return NOTIFY_OK; \ + } \ + static struct notifier_block name##_lg_cpu_notifier = { \ + .notifier_call = name##_lg_cpu_callback, \ + }; \ void name##_lock_init(void) { \ int i; \ LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \ @@ -83,6 +106,11 @@ lock = &per_cpu(name##_lock, i); \ *lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; \ } \ + register_hotcpu_notifier(&name##_lg_cpu_notifier); \ + get_online_cpus(); \ + for_each_online_cpu(i) \ + cpu_set(i, name##_cpus); \ + put_online_cpus(); \ } \ EXPORT_SYMBOL(name##_lock_init); \ \ @@ -124,9 +152,9 @@ \ void name##_global_lock_online(void) { \ int i; \ - preempt_disable(); \ + spin_lock(&name##_cpu_lock); \ rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ - for_each_online_cpu(i) { \ + for_each_cpu(i, &name##_cpus) { \ arch_spinlock_t *lock; \ lock = &per_cpu(name##_lock, i); \ arch_spin_lock(lock); \ @@ -137,12 +165,12 @@ void name##_global_unlock_online(void) { \ int i; \ rwlock_release(&name##_lock_dep_map, 1, _RET_IP_); \ - for_each_online_cpu(i) { \ + for_each_cpu(i, &name##_cpus) { \ arch_spinlock_t *lock; \ lock = &per_cpu(name##_lock, i); \ arch_spin_unlock(lock); \ } \ - preempt_enable(); \ + spin_unlock(&name##_cpu_lock); \ } \ EXPORT_SYMBOL(name##_global_unlock_online); \ \ ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-21 21:15 ` Srivatsa S. Bhat @ 2011-12-21 22:02 ` Al Viro 2011-12-21 22:12 ` Andrew Morton 1 sibling, 0 replies; 37+ messages in thread From: Al Viro @ 2011-12-21 22:02 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On Thu, Dec 22, 2011 at 02:45:29AM +0530, Srivatsa S. Bhat wrote: > The following patch is the same as the patch you posted, but with this small > change (aimed merely at making the code a bit easier to understand) and a > commit message added. Please point out if this change seems bad for > any reason. And if it is fine, Viro, can you please sign-off on this patch? > (since this patch has code contributions from both you and me) I can live with that. I still think it's not particulary useful, but... Anyway, ACKed-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> I couldn't care less who ends up in the Author: fields. I can feed it to Linus tonight, if you prefer it to go through the VFS tree. But IMO that's properly a CPU-hotplug issue and VFS is only involved as it's the only current user of br-locks. Should go into -stable as well, all way back to 2.6.36... Sad that Nick is still MIA, BTW - it's his code and he'd been Cc'ed on the thread all along... ;-/ > I tested this patch locally - the originally reported issue did not crop up > (soft-lockups in VFS callpaths). > > --- > From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> > Subject: [PATCH] VFS: Fix race between CPU hotplug and lglocks > > Currently, the *_global_[un]lock_online() routines are not at all synchronized > with CPU hotplug. Soft-lockups detected as a consequence of this race was > reported earlier at https://lkml.org/lkml/2011/8/24/185. (Thanks to Cong Meng > for finding out that the root-cause of this issue is the race condition > between br_write_[un]lock() and CPU hotplug, which results in the lock states > getting messed up). > > Fixing this race by just adding {get,put}_online_cpus() at appropriate places > in *_global_[un]lock_online() is not a good option, because, then suddenly > br_write_[un]lock() would become blocking, whereas they have been kept as > non-blocking all this time, and we would want to keep them that way. > > So, overall, we want to ensure 3 things: > 1. br_write_lock() and br_write_unlock() must remain as non-blocking. > 2. The corresponding lock and unlock of the per-cpu spinlocks must not happen > for different sets of CPUs. > 3. Either prevent any new CPU online operation in between this lock-unlock, or > ensure that the newly onlined CPU does not proceed with its corresponding > per-cpu spinlock unlocked. > > To achieve all this: > (a) We introduce a new spinlock that is taken by the *_global_lock_online() > routine and released by the *_global_unlock_online() routine. > (b) We register a callback for CPU hotplug notifications, and this callback > takes the same spinlock as above. > (c) We maintain a bitmap which is close to the cpu_online_mask, and once it is > initialized in the lock_init() code, all future updates to it are done in > the callback, under the above spinlock. > (d) The above bitmap is used (instead of cpu_online_mask) while locking and > unlocking the per-cpu locks. > > The callback takes the spinlock upon the CPU_UP_PREPARE event. So, if the > br_write_lock-unlock sequence is in progress, the callback keeps spinning, > thus preventing the CPU online operation till the lock-unlock sequence is > complete. This takes care of requirement (3). > > The bitmap that we maintain remains unmodified throughout the lock-unlock > sequence, since all updates to it are managed by the callback, which takes > the same spinlock as the one taken by the lock code and released only by the > unlock routine. Combining this with (d) above, satisfies requirement (2). > > Overall, since we use a spinlock (mentioned in (a)) to prevent CPU hotplug > operations from racing with br_write_lock-unlock, requirement (1) is also > taken care of. > > By the way, it is to be noted that a CPU offline operation can actually run > in parallel with our lock-unlock sequence, because our callback doesn't react > to notifications earlier than CPU_DEAD (in order to maintain our bitmap > properly). And this means, since we use our own bitmap (which is stale, on > purpose) during the lock-unlock sequence, we could end up unlocking the > per-cpu lock of an offline CPU (because we had locked it earlier, when the > CPU was online), in order to satisfy requirement (2). But this is harmless, > though it looks a bit awkward. > > Debugged-by: Cong Meng <mc@linux.vnet.ibm.com> > Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> > --- > > include/linux/lglock.h | 36 ++++++++++++++++++++++++++++++++---- > 1 files changed, 32 insertions(+), 4 deletions(-) > > diff --git a/include/linux/lglock.h b/include/linux/lglock.h > index f549056..87f402c 100644 > --- a/include/linux/lglock.h > +++ b/include/linux/lglock.h > @@ -22,6 +22,7 @@ > #include <linux/spinlock.h> > #include <linux/lockdep.h> > #include <linux/percpu.h> > +#include <linux/cpu.h> > > /* can make br locks by using local lock for read side, global lock for write */ > #define br_lock_init(name) name##_lock_init() > @@ -72,9 +73,31 @@ > > #define DEFINE_LGLOCK(name) \ > \ > + DEFINE_SPINLOCK(name##_cpu_lock); \ > + cpumask_t name##_cpus __read_mostly; \ > DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \ > DEFINE_LGLOCK_LOCKDEP(name); \ > \ > + static int \ > + name##_lg_cpu_callback(struct notifier_block *nb, \ > + unsigned long action, void *hcpu) \ > + { \ > + switch (action & ~CPU_TASKS_FROZEN) { \ > + case CPU_UP_PREPARE: \ > + spin_lock(&name##_cpu_lock); \ > + cpu_set((unsigned long)hcpu, name##_cpus); \ > + spin_unlock(&name##_cpu_lock); \ > + break; \ > + case CPU_UP_CANCELED: case CPU_DEAD: \ > + spin_lock(&name##_cpu_lock); \ > + cpu_clear((unsigned long)hcpu, name##_cpus); \ > + spin_unlock(&name##_cpu_lock); \ > + } \ > + return NOTIFY_OK; \ > + } \ > + static struct notifier_block name##_lg_cpu_notifier = { \ > + .notifier_call = name##_lg_cpu_callback, \ > + }; \ > void name##_lock_init(void) { \ > int i; \ > LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \ > @@ -83,6 +106,11 @@ > lock = &per_cpu(name##_lock, i); \ > *lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; \ > } \ > + register_hotcpu_notifier(&name##_lg_cpu_notifier); \ > + get_online_cpus(); \ > + for_each_online_cpu(i) \ > + cpu_set(i, name##_cpus); \ > + put_online_cpus(); \ > } \ > EXPORT_SYMBOL(name##_lock_init); \ > \ > @@ -124,9 +152,9 @@ > \ > void name##_global_lock_online(void) { \ > int i; \ > - preempt_disable(); \ > + spin_lock(&name##_cpu_lock); \ > rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ > - for_each_online_cpu(i) { \ > + for_each_cpu(i, &name##_cpus) { \ > arch_spinlock_t *lock; \ > lock = &per_cpu(name##_lock, i); \ > arch_spin_lock(lock); \ > @@ -137,12 +165,12 @@ > void name##_global_unlock_online(void) { \ > int i; \ > rwlock_release(&name##_lock_dep_map, 1, _RET_IP_); \ > - for_each_online_cpu(i) { \ > + for_each_cpu(i, &name##_cpus) { \ > arch_spinlock_t *lock; \ > lock = &per_cpu(name##_lock, i); \ > arch_spin_unlock(lock); \ > } \ > - preempt_enable(); \ > + spin_unlock(&name##_cpu_lock); \ > } \ > EXPORT_SYMBOL(name##_global_unlock_online); \ > \ > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-21 21:15 ` Srivatsa S. Bhat 2011-12-21 22:02 ` Al Viro @ 2011-12-21 22:12 ` Andrew Morton 2011-12-22 7:02 ` Al Viro 1 sibling, 1 reply; 37+ messages in thread From: Andrew Morton @ 2011-12-21 22:12 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Al Viro, mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, Maciej Rutecki On Thu, 22 Dec 2011 02:45:29 +0530 "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote: off-topic, but the lockdep stuff in include/linux/lglock.h (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code. > --- a/include/linux/lglock.h > +++ b/include/linux/lglock.h > @@ -22,6 +22,7 @@ > #include <linux/spinlock.h> > #include <linux/lockdep.h> > #include <linux/percpu.h> > +#include <linux/cpu.h> > > /* can make br locks by using local lock for read side, global lock for write */ > #define br_lock_init(name) name##_lock_init() > @@ -72,9 +73,31 @@ > > #define DEFINE_LGLOCK(name) \ > \ > + DEFINE_SPINLOCK(name##_cpu_lock); \ > + cpumask_t name##_cpus __read_mostly; \ > DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \ > DEFINE_LGLOCK_LOCKDEP(name); \ > \ also off-topic: we can't define a static lglock with this macro. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-21 22:12 ` Andrew Morton @ 2011-12-22 7:02 ` Al Viro 2011-12-22 7:20 ` Andrew Morton 0 siblings, 1 reply; 37+ messages in thread From: Al Viro @ 2011-12-22 7:02 UTC (permalink / raw) To: Andrew Morton Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, Maciej Rutecki On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote: > off-topic, but the lockdep stuff in include/linux/lglock.h > (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code. Um? See ..._lock_init(); it's used there. > > + DEFINE_SPINLOCK(name##_cpu_lock); \ > > + cpumask_t name##_cpus __read_mostly; \ > > DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \ > > DEFINE_LGLOCK_LOCKDEP(name); \ > > \ > > also off-topic: we can't define a static lglock with this macro. True... I don't like the general STL feel of that code, TBH. The only thing that stops me from putting all that stuff into a struct and getting rid of those macros from hell is that we'd need to use alloc_percpu() and that means extra overhead, potentially quite painful on the "local" side of those. May be worth experimenting with later, but not at this point in cycle... Anyway, to Linus it goes. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-22 7:02 ` Al Viro @ 2011-12-22 7:20 ` Andrew Morton 2011-12-22 8:08 ` Al Viro 2011-12-22 8:22 ` Andi Kleen 0 siblings, 2 replies; 37+ messages in thread From: Andrew Morton @ 2011-12-22 7:20 UTC (permalink / raw) To: Al Viro Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, Maciej Rutecki, Andi Kleen On Thu, 22 Dec 2011 07:02:15 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote: > > off-topic, but the lockdep stuff in include/linux/lglock.h > > (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code. > > Um? See ..._lock_init(); it's used there. oops, I had Andi's patch applied. Wanna take a look at it while things are fresh in your mind? From: Andi Kleen <ak@linux.intel.com> Subject: brlocks/lglocks: clean up code lglocks and brlocks are currently generated with some complicated macros in lglock.h. But there's no reason I can see to not just use common utility functions that get pointers to the lglock. Since there are at least two users it makes sense to share this code in a library. This will also make it later possible to dynamically allocate lglocks. In general the users now look more like normal function calls with pointers, not magic macros. The patch is rather large because I move over all users in one go to keep it bisectable. This impacts the VFS somewhat in terms of lines changed. But no actual behaviour change. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- fs/dcache.c | 10 +- fs/file_table.c | 23 ++--- fs/internal.h | 3 fs/namei.c | 26 +++--- fs/namespace.c | 124 +++++++++++++++---------------- fs/pnode.c | 4 - include/linux/lglock.h | 154 ++++++--------------------------------- kernel/Makefile | 2 kernel/lglock.c | 101 +++++++++++++++++++++++++ 9 files changed, 223 insertions(+), 224 deletions(-) diff -puN fs/dcache.c~brlocks-lglocks-clean-up-code fs/dcache.c --- a/fs/dcache.c~brlocks-lglocks-clean-up-code +++ a/fs/dcache.c @@ -2454,7 +2454,7 @@ static int prepend_path(const struct pat bool slash = false; int error = 0; - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); while (dentry != root->dentry || vfsmnt != root->mnt) { struct dentry * parent; @@ -2485,7 +2485,7 @@ static int prepend_path(const struct pat error = prepend(buffer, buflen, "/", 1); out: - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); return error; global_root: @@ -2859,11 +2859,11 @@ int path_is_under(struct path *path1, st struct dentry *dentry = path1->dentry; int res; - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); if (mnt != path2->mnt) { for (;;) { if (mnt->mnt_parent == mnt) { - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); return 0; } if (mnt->mnt_parent == path2->mnt) @@ -2873,7 +2873,7 @@ int path_is_under(struct path *path1, st dentry = mnt->mnt_mountpoint; } res = is_subdir(dentry, path2->dentry); - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); return res; } EXPORT_SYMBOL(path_is_under); diff -puN fs/file_table.c~brlocks-lglocks-clean-up-code fs/file_table.c --- a/fs/file_table.c~brlocks-lglocks-clean-up-code +++ a/fs/file_table.c @@ -34,7 +34,6 @@ struct files_stat_struct files_stat = { .max_files = NR_FILE }; -DECLARE_LGLOCK(files_lglock); DEFINE_LGLOCK(files_lglock); /* SLAB cache for file structures */ @@ -422,9 +421,9 @@ static inline void __file_sb_list_add(st */ void file_sb_list_add(struct file *file, struct super_block *sb) { - lg_local_lock(files_lglock); + lg_local_lock(&files_lglock); __file_sb_list_add(file, sb); - lg_local_unlock(files_lglock); + lg_local_unlock(&files_lglock); } /** @@ -437,9 +436,9 @@ void file_sb_list_add(struct file *file, void file_sb_list_del(struct file *file) { if (!list_empty(&file->f_u.fu_list)) { - lg_local_lock_cpu(files_lglock, file_list_cpu(file)); + lg_local_lock_cpu(&files_lglock, file_list_cpu(file)); list_del_init(&file->f_u.fu_list); - lg_local_unlock_cpu(files_lglock, file_list_cpu(file)); + lg_local_unlock_cpu(&files_lglock, file_list_cpu(file)); } } @@ -478,7 +477,7 @@ int fs_may_remount_ro(struct super_block { struct file *file; /* Check that no files are currently opened for writing. */ - lg_global_lock(files_lglock); + lg_global_lock(&files_lglock); do_file_list_for_each_entry(sb, file) { struct inode *inode = file->f_path.dentry->d_inode; @@ -490,10 +489,10 @@ int fs_may_remount_ro(struct super_block if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE)) goto too_bad; } while_file_list_for_each_entry; - lg_global_unlock(files_lglock); + lg_global_unlock(&files_lglock); return 1; /* Tis' cool bro. */ too_bad: - lg_global_unlock(files_lglock); + lg_global_unlock(&files_lglock); return 0; } @@ -509,7 +508,7 @@ void mark_files_ro(struct super_block *s struct file *f; retry: - lg_global_lock(files_lglock); + lg_global_lock(&files_lglock); do_file_list_for_each_entry(sb, f) { struct vfsmount *mnt; if (!S_ISREG(f->f_path.dentry->d_inode->i_mode)) @@ -526,12 +525,12 @@ retry: file_release_write(f); mnt = mntget(f->f_path.mnt); /* This can sleep, so we can't hold the spinlock. */ - lg_global_unlock(files_lglock); + lg_global_unlock(&files_lglock); mnt_drop_write(mnt); mntput(mnt); goto retry; } while_file_list_for_each_entry; - lg_global_unlock(files_lglock); + lg_global_unlock(&files_lglock); } void __init files_init(unsigned long mempages) @@ -549,6 +548,6 @@ void __init files_init(unsigned long mem n = (mempages * (PAGE_SIZE / 1024)) / 10; files_stat.max_files = max_t(unsigned long, n, NR_FILE); files_defer_init(); - lg_lock_init(files_lglock); + lg_lock_init(&files_lglock, "files_lglock"); percpu_counter_init(&nr_files, 0); } diff -puN fs/internal.h~brlocks-lglocks-clean-up-code fs/internal.h --- a/fs/internal.h~brlocks-lglocks-clean-up-code +++ a/fs/internal.h @@ -77,8 +77,7 @@ extern void mnt_make_shortterm(struct vf extern void __init mnt_init(void); -DECLARE_BRLOCK(vfsmount_lock); - +extern struct lglock vfsmount_lock; /* * fs_struct.c diff -puN fs/namei.c~brlocks-lglocks-clean-up-code fs/namei.c --- a/fs/namei.c~brlocks-lglocks-clean-up-code +++ a/fs/namei.c @@ -463,7 +463,7 @@ static int unlazy_walk(struct nameidata mntget(nd->path.mnt); rcu_read_unlock(); - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); nd->flags &= ~LOOKUP_RCU; return 0; @@ -521,14 +521,14 @@ static int complete_walk(struct nameidat if (unlikely(!__d_rcu_to_refcount(dentry, nd->seq))) { spin_unlock(&dentry->d_lock); rcu_read_unlock(); - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); return -ECHILD; } BUG_ON(nd->inode != dentry->d_inode); spin_unlock(&dentry->d_lock); mntget(nd->path.mnt); rcu_read_unlock(); - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); } if (likely(!(nd->flags & LOOKUP_JUMPED))) @@ -693,15 +693,15 @@ int follow_up(struct path *path) struct vfsmount *parent; struct dentry *mountpoint; - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); parent = path->mnt->mnt_parent; if (parent == path->mnt) { - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); return 0; } mntget(parent); mountpoint = dget(path->mnt->mnt_mountpoint); - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); dput(path->dentry); path->dentry = mountpoint; mntput(path->mnt); @@ -833,7 +833,7 @@ static int follow_managed(struct path *p /* Something is mounted on this dentry in another * namespace and/or whatever was mounted there in this * namespace got unmounted before we managed to get the - * vfsmount_lock */ + * &vfsmount_lock */ } /* Handle an automount point */ @@ -959,7 +959,7 @@ failed: if (!(nd->flags & LOOKUP_ROOT)) nd->root.mnt = NULL; rcu_read_unlock(); - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); return -ECHILD; } @@ -1253,7 +1253,7 @@ static void terminate_walk(struct nameid if (!(nd->flags & LOOKUP_ROOT)) nd->root.mnt = NULL; rcu_read_unlock(); - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); } } @@ -1487,7 +1487,7 @@ static int path_init(int dfd, const char nd->path = nd->root; nd->inode = inode; if (flags & LOOKUP_RCU) { - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); rcu_read_lock(); nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); } else { @@ -1500,7 +1500,7 @@ static int path_init(int dfd, const char if (*name=='/') { if (flags & LOOKUP_RCU) { - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); rcu_read_lock(); set_root_rcu(nd); } else { @@ -1513,7 +1513,7 @@ static int path_init(int dfd, const char struct fs_struct *fs = current->fs; unsigned seq; - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); rcu_read_lock(); do { @@ -1549,7 +1549,7 @@ static int path_init(int dfd, const char if (fput_needed) *fp = file; nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq); - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); rcu_read_lock(); } else { path_get(&file->f_path); diff -puN fs/namespace.c~brlocks-lglocks-clean-up-code fs/namespace.c --- a/fs/namespace.c~brlocks-lglocks-clean-up-code +++ a/fs/namespace.c @@ -419,7 +419,7 @@ static int mnt_make_readonly(struct vfsm { int ret = 0; - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); mnt->mnt_flags |= MNT_WRITE_HOLD; /* * After storing MNT_WRITE_HOLD, we'll read the counters. This store @@ -453,15 +453,15 @@ static int mnt_make_readonly(struct vfsm */ smp_wmb(); mnt->mnt_flags &= ~MNT_WRITE_HOLD; - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); return ret; } static void __mnt_unmake_readonly(struct vfsmount *mnt) { - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); mnt->mnt_flags &= ~MNT_READONLY; - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); } static void free_vfsmnt(struct vfsmount *mnt) @@ -508,10 +508,10 @@ struct vfsmount *lookup_mnt(struct path { struct vfsmount *child_mnt; - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); if ((child_mnt = __lookup_mnt(path->mnt, path->dentry, 1))) mntget(child_mnt); - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); return child_mnt; } @@ -776,34 +776,34 @@ static void mntput_no_expire(struct vfsm { put_again: #ifdef CONFIG_SMP - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); if (likely(atomic_read(&mnt->mnt_longterm))) { mnt_dec_count(mnt); - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); return; } - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); mnt_dec_count(mnt); if (mnt_get_count(mnt)) { - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); return; } #else mnt_dec_count(mnt); if (likely(mnt_get_count(mnt))) return; - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); #endif if (unlikely(mnt->mnt_pinned)) { mnt_add_count(mnt, mnt->mnt_pinned + 1); mnt->mnt_pinned = 0; - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); acct_auto_close_mnt(mnt); goto put_again; } - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); mntfree(mnt); } @@ -828,20 +828,20 @@ EXPORT_SYMBOL(mntget); void mnt_pin(struct vfsmount *mnt) { - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); mnt->mnt_pinned++; - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); } EXPORT_SYMBOL(mnt_pin); void mnt_unpin(struct vfsmount *mnt) { - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); if (mnt->mnt_pinned) { mnt_inc_count(mnt); mnt->mnt_pinned--; } - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); } EXPORT_SYMBOL(mnt_unpin); @@ -931,12 +931,12 @@ int mnt_had_events(struct proc_mounts *p struct mnt_namespace *ns = p->ns; int res = 0; - br_read_lock(vfsmount_lock); + br_read_lock(&vfsmount_lock); if (p->m.poll_event != ns->event) { p->m.poll_event = ns->event; res = 1; } - br_read_unlock(vfsmount_lock); + br_read_unlock(&vfsmount_lock); return res; } @@ -1157,12 +1157,12 @@ int may_umount_tree(struct vfsmount *mnt struct vfsmount *p; /* write lock needed for mnt_get_count */ - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); for (p = mnt; p; p = next_mnt(p, mnt)) { actual_refs += mnt_get_count(p); minimum_refs += 2; } - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); if (actual_refs > minimum_refs) return 0; @@ -1189,10 +1189,10 @@ int may_umount(struct vfsmount *mnt) { int ret = 1; down_read(&namespace_sem); - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); if (propagate_mount_busy(mnt, 2)) ret = 0; - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); up_read(&namespace_sem); return ret; } @@ -1209,13 +1209,13 @@ void release_mounts(struct list_head *he struct dentry *dentry; struct vfsmount *m; - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); dentry = mnt->mnt_mountpoint; m = mnt->mnt_parent; mnt->mnt_mountpoint = mnt->mnt_root; mnt->mnt_parent = mnt; m->mnt_ghosts--; - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); dput(dentry); mntput(m); } @@ -1281,12 +1281,12 @@ static int do_umount(struct vfsmount *mn * probably don't strictly need the lock here if we examined * all race cases, but it's a slowpath. */ - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); if (mnt_get_count(mnt) != 2) { - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); return -EBUSY; } - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); if (!xchg(&mnt->mnt_expiry_mark, 1)) return -EAGAIN; @@ -1328,7 +1328,7 @@ static int do_umount(struct vfsmount *mn } down_write(&namespace_sem); - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); event++; if (!(flags & MNT_DETACH)) @@ -1340,7 +1340,7 @@ static int do_umount(struct vfsmount *mn umount_tree(mnt, 1, &umount_list); retval = 0; } - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); up_write(&namespace_sem); release_mounts(&umount_list); return retval; @@ -1452,19 +1452,19 @@ struct vfsmount *copy_tree(struct vfsmou q = clone_mnt(p, p->mnt_root, flag); if (!q) goto Enomem; - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); list_add_tail(&q->mnt_list, &res->mnt_list); attach_mnt(q, &path); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); } } return res; Enomem: if (res) { LIST_HEAD(umount_list); - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); umount_tree(res, 0, &umount_list); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); release_mounts(&umount_list); } return NULL; @@ -1483,9 +1483,9 @@ void drop_collected_mounts(struct vfsmou { LIST_HEAD(umount_list); down_write(&namespace_sem); - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); umount_tree(mnt, 0, &umount_list); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); up_write(&namespace_sem); release_mounts(&umount_list); } @@ -1613,7 +1613,7 @@ static int attach_recursive_mnt(struct v if (err) goto out_cleanup_ids; - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); if (IS_MNT_SHARED(dest_mnt)) { for (p = source_mnt; p; p = next_mnt(p, source_mnt)) @@ -1632,7 +1632,7 @@ static int attach_recursive_mnt(struct v list_del_init(&child->mnt_hash); commit_tree(child); } - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); return 0; @@ -1729,10 +1729,10 @@ static int do_change_type(struct path *p goto out_unlock; } - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL)) change_mnt_propagation(m, type); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); out_unlock: up_write(&namespace_sem); @@ -1779,9 +1779,9 @@ static int do_loopback(struct path *path err = graft_tree(mnt, path); if (err) { - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); umount_tree(mnt, 0, &umount_list); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); } out2: unlock_mount(path); @@ -1838,16 +1838,16 @@ static int do_remount(struct path *path, else err = do_remount_sb(sb, flags, data, 0); if (!err) { - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); mnt_flags |= path->mnt->mnt_flags & MNT_PROPAGATION_MASK; path->mnt->mnt_flags = mnt_flags; - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); } up_write(&sb->s_umount); if (!err) { - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); touch_mnt_namespace(path->mnt->mnt_ns); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); } return err; } @@ -2052,9 +2052,9 @@ fail: /* remove m from any expiration list it may be on */ if (!list_empty(&m->mnt_expire)) { down_write(&namespace_sem); - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); list_del_init(&m->mnt_expire); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); up_write(&namespace_sem); } mntput(m); @@ -2070,11 +2070,11 @@ fail: void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list) { down_write(&namespace_sem); - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); list_add_tail(&mnt->mnt_expire, expiry_list); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); up_write(&namespace_sem); } EXPORT_SYMBOL(mnt_set_expiry); @@ -2094,7 +2094,7 @@ void mark_mounts_for_expiry(struct list_ return; down_write(&namespace_sem); - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); /* extract from the expiration list every vfsmount that matches the * following criteria: @@ -2113,7 +2113,7 @@ void mark_mounts_for_expiry(struct list_ touch_mnt_namespace(mnt->mnt_ns); umount_tree(mnt, 1, &umounts); } - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); up_write(&namespace_sem); release_mounts(&umounts); @@ -2376,9 +2376,9 @@ void mnt_make_shortterm(struct vfsmount #ifdef CONFIG_SMP if (atomic_add_unless(&mnt->mnt_longterm, -1, 1)) return; - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); atomic_dec(&mnt->mnt_longterm); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); #endif } @@ -2406,9 +2406,9 @@ static struct mnt_namespace *dup_mnt_ns( kfree(new_ns); return ERR_PTR(-ENOMEM); } - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); list_add_tail(&new_ns->list, &new_ns->root->mnt_list); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); /* * Second pass: switch the tsk->fs->* elements and mark new vfsmounts @@ -2647,7 +2647,7 @@ SYSCALL_DEFINE2(pivot_root, const char _ goto out4; } else if (!is_subdir(old.dentry, new.dentry)) goto out4; - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); detach_mnt(new.mnt, &parent_path); detach_mnt(root.mnt, &root_parent); /* mount old root on put_old */ @@ -2655,7 +2655,7 @@ SYSCALL_DEFINE2(pivot_root, const char _ /* mount new_root on / */ attach_mnt(new.mnt, &root_parent); touch_mnt_namespace(current->nsproxy->mnt_ns); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); chroot_fs_refs(&root, &new); error = 0; out4: @@ -2718,7 +2718,7 @@ void __init mnt_init(void) for (u = 0; u < HASH_SIZE; u++) INIT_LIST_HEAD(&mount_hashtable[u]); - br_lock_init(vfsmount_lock); + br_lock_init(&vfsmount_lock); err = sysfs_init(); if (err) @@ -2738,9 +2738,9 @@ void put_mnt_ns(struct mnt_namespace *ns if (!atomic_dec_and_test(&ns->count)) return; down_write(&namespace_sem); - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); umount_tree(ns->root, 0, &umount_list); - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); up_write(&namespace_sem); release_mounts(&umount_list); kfree(ns); diff -puN fs/pnode.c~brlocks-lglocks-clean-up-code fs/pnode.c --- a/fs/pnode.c~brlocks-lglocks-clean-up-code +++ a/fs/pnode.c @@ -273,12 +273,12 @@ int propagate_mnt(struct vfsmount *dest_ prev_src_mnt = child; } out: - br_write_lock(vfsmount_lock); + br_write_lock(&vfsmount_lock); while (!list_empty(&tmp_list)) { child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash); umount_tree(child, 0, &umount_list); } - br_write_unlock(vfsmount_lock); + br_write_unlock(&vfsmount_lock); release_mounts(&umount_list); return ret; } diff -puN include/linux/lglock.h~brlocks-lglocks-clean-up-code include/linux/lglock.h --- a/include/linux/lglock.h~brlocks-lglocks-clean-up-code +++ a/include/linux/lglock.h @@ -24,26 +24,14 @@ #include <linux/percpu.h> /* can make br locks by using local lock for read side, global lock for write */ -#define br_lock_init(name) name##_lock_init() -#define br_read_lock(name) name##_local_lock() -#define br_read_unlock(name) name##_local_unlock() -#define br_write_lock(name) name##_global_lock_online() -#define br_write_unlock(name) name##_global_unlock_online() +#define br_lock_init(name) lg_lock_init(name, #name) +#define br_read_lock(name) lg_local_lock(name) +#define br_read_unlock(name) lg_local_unlock(name) +#define br_write_lock(name) lg_global_lock_online(name) +#define br_write_unlock(name) lg_global_unlock_online(name) -#define DECLARE_BRLOCK(name) DECLARE_LGLOCK(name) #define DEFINE_BRLOCK(name) DEFINE_LGLOCK(name) - -#define lg_lock_init(name) name##_lock_init() -#define lg_local_lock(name) name##_local_lock() -#define lg_local_unlock(name) name##_local_unlock() -#define lg_local_lock_cpu(name, cpu) name##_local_lock_cpu(cpu) -#define lg_local_unlock_cpu(name, cpu) name##_local_unlock_cpu(cpu) -#define lg_global_lock(name) name##_global_lock() -#define lg_global_unlock(name) name##_global_unlock() -#define lg_global_lock_online(name) name##_global_lock_online() -#define lg_global_unlock_online(name) name##_global_unlock_online() - #ifdef CONFIG_DEBUG_LOCK_ALLOC #define LOCKDEP_INIT_MAP lockdep_init_map @@ -58,115 +46,27 @@ #define DEFINE_LGLOCK_LOCKDEP(name) #endif +struct lglock { + arch_spinlock_t __percpu *lock; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lock_class_key lock_key; + struct lockdep_map lock_dep_map; +#endif +}; + +#define DEFINE_LGLOCK(name) \ + DEFINE_PER_CPU(arch_spinlock_t, name ## _lock) = __ARCH_SPIN_LOCK_UNLOCKED; \ + struct lglock name = { .lock = &name ## _lock } + +/* Only valid for statics */ +void lg_lock_init(struct lglock *lg, char *name); +void lg_local_lock(struct lglock *lg); +void lg_local_unlock(struct lglock *lg); +void lg_local_lock_cpu(struct lglock *lg, int cpu); +void lg_local_unlock_cpu(struct lglock *lg, int cpu); +void lg_global_lock_online(struct lglock *lg); +void lg_global_unlock_online(struct lglock *lg); +void lg_global_lock(struct lglock *lg); +void lg_global_unlock(struct lglock *lg); -#define DECLARE_LGLOCK(name) \ - extern void name##_lock_init(void); \ - extern void name##_local_lock(void); \ - extern void name##_local_unlock(void); \ - extern void name##_local_lock_cpu(int cpu); \ - extern void name##_local_unlock_cpu(int cpu); \ - extern void name##_global_lock(void); \ - extern void name##_global_unlock(void); \ - extern void name##_global_lock_online(void); \ - extern void name##_global_unlock_online(void); \ - -#define DEFINE_LGLOCK(name) \ - \ - DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \ - DEFINE_LGLOCK_LOCKDEP(name); \ - \ - void name##_lock_init(void) { \ - int i; \ - LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \ - for_each_possible_cpu(i) { \ - arch_spinlock_t *lock; \ - lock = &per_cpu(name##_lock, i); \ - *lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; \ - } \ - } \ - EXPORT_SYMBOL(name##_lock_init); \ - \ - void name##_local_lock(void) { \ - arch_spinlock_t *lock; \ - preempt_disable(); \ - rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_); \ - lock = &__get_cpu_var(name##_lock); \ - arch_spin_lock(lock); \ - } \ - EXPORT_SYMBOL(name##_local_lock); \ - \ - void name##_local_unlock(void) { \ - arch_spinlock_t *lock; \ - rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_); \ - lock = &__get_cpu_var(name##_lock); \ - arch_spin_unlock(lock); \ - preempt_enable(); \ - } \ - EXPORT_SYMBOL(name##_local_unlock); \ - \ - void name##_local_lock_cpu(int cpu) { \ - arch_spinlock_t *lock; \ - preempt_disable(); \ - rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_); \ - lock = &per_cpu(name##_lock, cpu); \ - arch_spin_lock(lock); \ - } \ - EXPORT_SYMBOL(name##_local_lock_cpu); \ - \ - void name##_local_unlock_cpu(int cpu) { \ - arch_spinlock_t *lock; \ - rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_); \ - lock = &per_cpu(name##_lock, cpu); \ - arch_spin_unlock(lock); \ - preempt_enable(); \ - } \ - EXPORT_SYMBOL(name##_local_unlock_cpu); \ - \ - void name##_global_lock_online(void) { \ - int i; \ - preempt_disable(); \ - rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ - for_each_online_cpu(i) { \ - arch_spinlock_t *lock; \ - lock = &per_cpu(name##_lock, i); \ - arch_spin_lock(lock); \ - } \ - } \ - EXPORT_SYMBOL(name##_global_lock_online); \ - \ - void name##_global_unlock_online(void) { \ - int i; \ - rwlock_release(&name##_lock_dep_map, 1, _RET_IP_); \ - for_each_online_cpu(i) { \ - arch_spinlock_t *lock; \ - lock = &per_cpu(name##_lock, i); \ - arch_spin_unlock(lock); \ - } \ - preempt_enable(); \ - } \ - EXPORT_SYMBOL(name##_global_unlock_online); \ - \ - void name##_global_lock(void) { \ - int i; \ - preempt_disable(); \ - rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_); \ - for_each_possible_cpu(i) { \ - arch_spinlock_t *lock; \ - lock = &per_cpu(name##_lock, i); \ - arch_spin_lock(lock); \ - } \ - } \ - EXPORT_SYMBOL(name##_global_lock); \ - \ - void name##_global_unlock(void) { \ - int i; \ - rwlock_release(&name##_lock_dep_map, 1, _RET_IP_); \ - for_each_possible_cpu(i) { \ - arch_spinlock_t *lock; \ - lock = &per_cpu(name##_lock, i); \ - arch_spin_unlock(lock); \ - } \ - preempt_enable(); \ - } \ - EXPORT_SYMBOL(name##_global_unlock); #endif diff -puN kernel/Makefile~brlocks-lglocks-clean-up-code kernel/Makefile --- a/kernel/Makefile~brlocks-lglocks-clean-up-code +++ a/kernel/Makefile @@ -6,7 +6,7 @@ obj-y = fork.o exec_domain.o panic.o cpu.o exit.o itimer.o time.o softirq.o resource.o \ sysctl.o sysctl_binary.o capability.o ptrace.o timer.o user.o \ signal.o sys.o kmod.o workqueue.o pid.o \ - rcupdate.o extable.o params.o posix-timers.o \ + rcupdate.o extable.o params.o posix-timers.o lglock.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ notifier.o ksysfs.o cred.o \ diff -puN /dev/null kernel/lglock.c --- /dev/null +++ a/kernel/lglock.c @@ -0,0 +1,101 @@ +/* See include/linux/lglock.h for description */ +#include <linux/module.h> +#include <linux/lglock.h> + +void lg_lock_init(struct lglock *lg, char *name) +{ + LOCKDEP_INIT_MAP(&lg->lock_dep_map, name, &lg->lock_key, 0); +} +EXPORT_SYMBOL(lg_lock_init); + +void lg_local_lock(struct lglock *lg) +{ + arch_spinlock_t *lock; + preempt_disable(); + rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_); + lock = this_cpu_ptr(lg->lock); + arch_spin_lock(lock); +} +EXPORT_SYMBOL(lg_local_lock); + +void lg_local_unlock(struct lglock *lg) +{ + arch_spinlock_t *lock; + rwlock_release(&lg->lock_dep_map, 1, _RET_IP_); + lock = this_cpu_ptr(lg->lock); + arch_spin_unlock(lock); + preempt_enable(); +} +EXPORT_SYMBOL(lg_local_unlock); + +void lg_local_lock_cpu(struct lglock *lg, int cpu) +{ + arch_spinlock_t *lock; + preempt_disable(); + rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_); + lock = per_cpu_ptr(lg->lock, cpu); + arch_spin_lock(lock); +} +EXPORT_SYMBOL(lg_local_lock_cpu); + +void lg_local_unlock_cpu(struct lglock *lg, int cpu) +{ + arch_spinlock_t *lock; + rwlock_release(&lg->lock_dep_map, 1, _RET_IP_); + lock = per_cpu_ptr(lg->lock, cpu); + arch_spin_unlock(lock); + preempt_enable(); +} +EXPORT_SYMBOL(lg_local_unlock_cpu); + +void lg_global_lock_online(struct lglock *lg) +{ + int i; + preempt_disable(); + rwlock_acquire(&lg->lock_dep_map, 0, 0, _RET_IP_); + for_each_online_cpu(i) { + arch_spinlock_t *lock; + lock = per_cpu_ptr(lg->lock, i); + arch_spin_lock(lock); + } +} +EXPORT_SYMBOL(lg_global_lock_online); + +void lg_global_unlock_online(struct lglock *lg) +{ + int i; + rwlock_release(&lg->lock_dep_map, 1, _RET_IP_); + for_each_online_cpu(i) { + arch_spinlock_t *lock; + lock = per_cpu_ptr(lg->lock, i); + arch_spin_unlock(lock); + } + preempt_enable(); +} +EXPORT_SYMBOL(lg_global_unlock_online); + +void lg_global_lock(struct lglock *lg) +{ + int i; + preempt_disable(); + rwlock_acquire(&lg->lock_dep_map, 0, 0, _RET_IP_); + for_each_possible_cpu(i) { + arch_spinlock_t *lock; + lock = per_cpu_ptr(lg->lock, i); + arch_spin_lock(lock); + } +} +EXPORT_SYMBOL(lg_global_lock); + +void lg_global_unlock(struct lglock *lg) +{ + int i; + rwlock_release(&lg->lock_dep_map, 1, _RET_IP_); + for_each_possible_cpu(i) { + arch_spinlock_t *lock; + lock = per_cpu_ptr(lg->lock, i); + arch_spin_unlock(lock); + } + preempt_enable(); +} +EXPORT_SYMBOL(lg_global_unlock); _ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-22 7:20 ` Andrew Morton @ 2011-12-22 8:08 ` Al Viro 2011-12-22 8:17 ` Andi Kleen 2011-12-22 8:22 ` Andi Kleen 1 sibling, 1 reply; 37+ messages in thread From: Al Viro @ 2011-12-22 8:08 UTC (permalink / raw) To: Andrew Morton Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, Maciej Rutecki, Andi Kleen On Wed, Dec 21, 2011 at 11:20:47PM -0800, Andrew Morton wrote: > On Thu, 22 Dec 2011 07:02:15 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote: > > > off-topic, but the lockdep stuff in include/linux/lglock.h > > > (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code. > > > > Um? See ..._lock_init(); it's used there. > > oops, I had Andi's patch applied. > > Wanna take a look at it while things are fresh in your mind? a) tons of trivial conflicts with fs/namespace.c changes in my tree b) more seriously, the question of overhead - see the mail you replied to. I really, really don't like templates, so in general I would prefer to do something fairly close to what Andi seems to have done in that patch. However, we *are* talking about some fairly hot paths and I'm really not comfortable with doing that kind of changes that late in the cycle. If somebody cares to take Andi's patch and see what it does to overhead, I'd love to see the results; we might do that in the next merge window. Again, the idea itself is obviously OK; the only problem is with the performance issues. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-22 8:08 ` Al Viro @ 2011-12-22 8:17 ` Andi Kleen 2011-12-22 8:39 ` Al Viro 0 siblings, 1 reply; 37+ messages in thread From: Andi Kleen @ 2011-12-22 8:17 UTC (permalink / raw) To: Al Viro Cc: Andrew Morton, Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, Maciej Rutecki, Andi Kleen On Thu, Dec 22, 2011 at 08:08:56AM +0000, Al Viro wrote: > On Wed, Dec 21, 2011 at 11:20:47PM -0800, Andrew Morton wrote: > > On Thu, 22 Dec 2011 07:02:15 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > > > On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote: > > > > off-topic, but the lockdep stuff in include/linux/lglock.h > > > > (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code. > > > > > > Um? See ..._lock_init(); it's used there. > > > > oops, I had Andi's patch applied. > > > > Wanna take a look at it while things are fresh in your mind? > > a) tons of trivial conflicts with fs/namespace.c changes in my tree > b) more seriously, the question of overhead - see the mail you replied > to. > The costly operations here are the atomics and nothing really changes for them. So I don't expect any measurable difference. I actually have an idea to remove them for the common case, but not in that patchkit or cycle :) I can run a ftrace if you want, but I expect any difference to be below the measurement inaccuracy. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-22 8:17 ` Andi Kleen @ 2011-12-22 8:39 ` Al Viro 0 siblings, 0 replies; 37+ messages in thread From: Al Viro @ 2011-12-22 8:39 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, Maciej Rutecki On Thu, Dec 22, 2011 at 09:17:57AM +0100, Andi Kleen wrote: > The costly operations here are the atomics and nothing really changes > for them. So I don't expect any measurable difference. > > I actually have an idea to remove them for the common case, but not in > that patchkit or cycle :) > > I can run a ftrace if you want, but I expect any difference to be below > the measurement inaccuracy. What I'm concerned with is not the cost of extra dereference per se; it's more about cacheline bouncing - note that with this fix in place you'll end up with spinlock in the same cacheline as the pointer to per-cpu stuff. Hell knows; it might not matter at all, since we take it only for br_write_lock() (and definitely rare CPU up/down), but still, I'd like to see the data. In any case, that's not -stable material. The race fix, OTOH, is... ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-22 7:20 ` Andrew Morton 2011-12-22 8:08 ` Al Viro @ 2011-12-22 8:22 ` Andi Kleen 1 sibling, 0 replies; 37+ messages in thread From: Andi Kleen @ 2011-12-22 8:22 UTC (permalink / raw) To: Andrew Morton Cc: Al Viro, Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, Maciej Rutecki, Andi Kleen On Wed, Dec 21, 2011 at 11:20:47PM -0800, Andrew Morton wrote: > On Thu, 22 Dec 2011 07:02:15 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote: > > > off-topic, but the lockdep stuff in include/linux/lglock.h > > > (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code. > > > > Um? See ..._lock_init(); it's used there. > > oops, I had Andi's patch applied. > > Wanna take a look at it while things are fresh in your mind? Double checked: DEFINE_LGLOCK_LOCKDEP is dead now and can be removed. The stuff it declared is in the struct lglock now. I can send a patch for that tomorrow if you haven't done it already. LOCKDEP_INIT_MAP should be still used in lg_lock_init() and that is called -Andi ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 6:27 ` Al Viro 2011-12-20 7:28 ` Srivatsa S. Bhat @ 2011-12-20 7:30 ` mengcong 2011-12-20 7:37 ` Srivatsa S. Bhat 1 sibling, 1 reply; 37+ messages in thread From: mengcong @ 2011-12-20 7:30 UTC (permalink / raw) To: Al Viro Cc: Srivatsa S. Bhat, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On Tue, 2011-12-20 at 06:27 +0000, Al Viro wrote: > On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote: > > Oh, right, that has to be handled as well... > > > > Hmmm... How about registering a CPU hotplug notifier callback during lock init > > time, and then for every cpu that gets onlined (after we took a copy of the > > cpu_online_mask to work with), we see if that cpu is different from the ones > > we have already locked, and if it is, we lock it in the callback handler and > > update the locked_cpu_mask appropriately (so that we release the locks properly > > during the unlock operation). > > > > Handling the newly introduced race between the callback handler and lock-unlock > > code must not be difficult, I believe.. > > > > Any loopholes in this approach? Or is the additional complexity just not worth > > it here? > > To summarize the modified variant of that approach hashed out on IRC: > On which IRC do you discuss? > * lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug > notifier. > * foo_global_lock_online starts with grabbing that spinlock and > loops over the cpus in that bitmap. > * foo_global_unlock_online loops over the same bitmap and then drops > that spinlock > * callback of the notifier is going to do all bitmap updates. Under > that spinlock. Events that need handling definitely include the things like > "was going up but failed", since we need the bitmap to contain all online CPUs > at all time, preferably without too much junk beyond that. IOW, we need to add > it there _before_ low-level __cpu_up() calls set_cpu_online(). Which means > that we want to clean up on failed attempt to up it. Taking a CPU down is > probably less PITA; just clear bit on the final "the sucker's dead" event. > * bitmap is initialized once, at the same time we set the notifier > up. Just grab the spinlock and do > for_each_online_cpu(N) > add N to bitmap > then release the spinlock and let the callbacks handle all updates. > > I think that'll work with relatively little pain, but I'm not familiar enough > with the cpuhotplug notifiers, so I'd rather have the folks familiar with those > to supply the set of events to watch for... > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-20 7:30 ` mengcong @ 2011-12-20 7:37 ` Srivatsa S. Bhat 0 siblings, 0 replies; 37+ messages in thread From: Srivatsa S. Bhat @ 2011-12-20 7:37 UTC (permalink / raw) To: mc Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin, david, akpm@linux-foundation.org, Maciej Rutecki On 12/20/2011 01:00 PM, mengcong wrote: > On Tue, 2011-12-20 at 06:27 +0000, Al Viro wrote: >> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote: >>> Oh, right, that has to be handled as well... >>> >>> Hmmm... How about registering a CPU hotplug notifier callback during lock init >>> time, and then for every cpu that gets onlined (after we took a copy of the >>> cpu_online_mask to work with), we see if that cpu is different from the ones >>> we have already locked, and if it is, we lock it in the callback handler and >>> update the locked_cpu_mask appropriately (so that we release the locks properly >>> during the unlock operation). >>> >>> Handling the newly introduced race between the callback handler and lock-unlock >>> code must not be difficult, I believe.. >>> >>> Any loopholes in this approach? Or is the additional complexity just not worth >>> it here? >> >> To summarize the modified variant of that approach hashed out on IRC: >> > On which IRC do you discuss? #kernel on tinc.sekrit.org :-) Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 12:11 ` Al Viro 2011-12-19 20:23 ` Srivatsa S. Bhat @ 2011-12-19 23:56 ` Dave Chinner 2011-12-20 4:05 ` Al Viro 1 sibling, 1 reply; 37+ messages in thread From: Dave Chinner @ 2011-12-19 23:56 UTC (permalink / raw) To: Al Viro Cc: Srivatsa S. Bhat, Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin, akpm@linux-foundation.org, Maciej Rutecki On Mon, Dec 19, 2011 at 12:11:00PM +0000, Al Viro wrote: > On Mon, Dec 19, 2011 at 04:33:47PM +0530, Srivatsa S. Bhat wrote: > > > IMHO, we don't need to be concerned here because, {get,put}_online_cpus() > > implement a refcounting solution, and they don't really serialize stuff > > unnecessarily. The readers (those who prevent cpu hotplug, such as this lock- > > unlock code) are fast and can be concurrent, while the writers (the task that > > is doing the cpu hotplug) waits till all existing readers are gone/done with > > their work. > > > > So, since we are readers here, IMO, we don't have to worry about performance. > > (I know that we get serialized just for a moment while incrementing the > > refcount, but that should not be worrisome right?) > > > > Moreover, using for_each_online_cpu() without using {get,put}_online_cpus() > > around that, is plain wrong, because of the unhandled race with cpu hotplug. > > IOW, our primary concern here is functionality, isn't it? > > > > To summarize, in the current design of these VFS locks, using > > {get,put}_online_cpus() is *essential* to fix a functionality-related bug, > > (and not so bad performance-wise as well). > > > > The following patch (v2) incorporates your comments: > > I really don't like that. Amount of contention is not a big issue, but the > fact that now br_write_lock(vfsmount_lock) became blocking is really nasty. > Moreover, we suddenly get cpu_hotplug.lock nested inside namespace_sem... > BTW, it's seriously blocking - if nothing else, it waits for cpu_down() > in progress to complete. Which can involve any number of interesting > locks taken by notifiers. > > Dave's variant is also no good; consider this: > CPU1: br_write_lock(); spinlocks grabbed > CPU2: br_read_lock(); spinning on one of them > CPU3: try to take CPU2 down. We *can't* proceed to the end, notifiers or no > notifiers, until CPU2 gets through the critical area. Which can't happen > until the spinlock is unlocked, i.e. until CPU1 does br_write_unlock(). > Notifier can't silently do spin_unlock() here or we'll get CPU2 free to go > into the critical area when it's really not safe there. Yeah, XFS has some, er, complexity to handle this. Basically, it has global state (the on-disk superblock) that the per-cpu counters synchronised back to every so often and hence the per-cpu counters can be switched on and off. There's also a global state lock that is held through a counter modification slow path and during notifier operations and the combination of these is used to avoid such race conditions. Hence when a cpu dies, we do: case CPU_DEAD: case CPU_DEAD_FROZEN: /* Disable all the counters, then fold the dead cpu's * count into the total on the global superblock and * re-enable the counters. */ xfs_icsb_lock(mp); spin_lock(&mp->m_sb_lock); xfs_icsb_disable_counter(mp, XFS_SBS_ICOUNT); xfs_icsb_disable_counter(mp, XFS_SBS_IFREE); xfs_icsb_disable_counter(mp, XFS_SBS_FDBLOCKS); .... Which is basically: 1. take global counter state modification mutex 2. take in-core superblock lock (global in-core fs state that the per-cpu counters sync to) 3. disable each online per-cpu counter a. lock all online per-cpu locks for the counter b. clear counter enabled bit c. unlock all online per-cpu locks And when the counter is re-enabled after the cleanup of the per-cpu counter state on the dead CPU, it does it via a rebalancing operation: 1. disable each online per-cpu counter a. lock all online per-cpu locks for the counter b. clear counter enabled bit c. unlock all online per-cpu locks 2. balance counter across all online per-cpu structures 3. enable each online epr-cpu counter: a. lock all online per-cpu locks for the counter b. set counter enabled bit c. unlock all online per-cpu locks 4. drop in-core superblock lock 5. drop global counter state modification mutex Hence, in the situation you describe above, if CPU 2 gets the lock before the notifier, all is well. In the case it doesn't, it get blocked like this: prempt_disable() if (counter disabled) goto slow path lock_local_counter() <<<< spin here cpu notifier disables counter and unlocks it. We get the lock if (counter disabled) { unlock_local_counter() goto slow path } ..... slow_path: preempt_enable() xfs_icsb_lock(mp) <<<< serialises on global notifier lock not on any of the spin locks Like I said, there's quite a bit of complexity in all this to handle the cpu notifiers in (what I think is) a race free manner. I've been looking at replacing all this complexity (it's close to a 1000 lines of code) with the generic per-cpu counters, but that's got it's own problems that involve adding lots of complexity.... > That got one hell of a deadlock potential ;-/ So far I'm more or less > in favor of doing get_online_cpus() explicitly in fs/namespace.c, outside > of namespace_sem. But I still have not convinced myself that it's > really safe ;-/ Agreed, it looks like a lot simpler solution to this problem than a notifier. But I don't think I know enough about the usage context to determine if it is safe, either, so i can't really help you there. :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs 2011-12-19 23:56 ` Dave Chinner @ 2011-12-20 4:05 ` Al Viro 0 siblings, 0 replies; 37+ messages in thread From: Al Viro @ 2011-12-20 4:05 UTC (permalink / raw) To: Dave Chinner Cc: Srivatsa S. Bhat, Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin, akpm@linux-foundation.org, Maciej Rutecki On Tue, Dec 20, 2011 at 10:56:59AM +1100, Dave Chinner wrote: > > That got one hell of a deadlock potential ;-/ So far I'm more or less > > in favor of doing get_online_cpus() explicitly in fs/namespace.c, outside > > of namespace_sem. But I still have not convinced myself that it's > > really safe ;-/ > > Agreed, it looks like a lot simpler solution to this problem than a > notifier. But I don't think I know enough about the usage context to > determine if it is safe, either, so i can't really help you there. :/ That's really nasty; mntput_no_expire() (and thus mntput()) wants br_write_lock()/br_write_unlock(). Right now we *know* that mntput() is non-blocking in situations when we are holding more than one reference. With that kind of change that isn't true anymore - one needs to have long-term refs to make it safe. And that's not going to be fun to audit... Can we get some kind of non-blocking exclusion against CPU hotplug? Note that we care about it only for writers, i.e. when we are going to cause cacheline bouncing from hell, no matter what. I *really* hate making br_write_lock() blocking and explicit get_online_cpus() around it isn't really any better. Too much PITA verifying correctness after the locking change. At that point in the cycle the original patch (loop over all CPUs, online or not) may turn out to be the only sane variant, as much as its going to hurt us. ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2011-12-22 8:39 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-19 3:36 [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs mengcong 2011-12-19 4:11 ` Al Viro 2011-12-19 5:00 ` Dave Chinner 2011-12-19 6:07 ` mengcong 2011-12-19 7:31 ` Srivatsa S. Bhat 2011-12-19 9:12 ` Stephen Boyd 2011-12-19 11:03 ` Srivatsa S. Bhat 2011-12-19 12:11 ` Al Viro 2011-12-19 20:23 ` Srivatsa S. Bhat 2011-12-19 20:52 ` Al Viro 2011-12-20 4:56 ` Srivatsa S. Bhat 2011-12-20 6:27 ` Al Viro 2011-12-20 7:28 ` Srivatsa S. Bhat 2011-12-20 9:37 ` mengcong 2011-12-20 10:36 ` Srivatsa S. Bhat 2011-12-20 11:08 ` Srivatsa S. Bhat 2011-12-20 12:50 ` Srivatsa S. Bhat 2011-12-20 14:06 ` Al Viro 2011-12-20 14:35 ` Srivatsa S. Bhat 2011-12-20 17:59 ` Al Viro 2011-12-20 19:12 ` Srivatsa S. Bhat 2011-12-20 19:58 ` Al Viro 2011-12-20 22:27 ` Dave Chinner 2011-12-20 23:31 ` Al Viro 2011-12-21 21:15 ` Srivatsa S. Bhat 2011-12-21 22:02 ` Al Viro 2011-12-21 22:12 ` Andrew Morton 2011-12-22 7:02 ` Al Viro 2011-12-22 7:20 ` Andrew Morton 2011-12-22 8:08 ` Al Viro 2011-12-22 8:17 ` Andi Kleen 2011-12-22 8:39 ` Al Viro 2011-12-22 8:22 ` Andi Kleen 2011-12-20 7:30 ` mengcong 2011-12-20 7:37 ` Srivatsa S. Bhat 2011-12-19 23:56 ` Dave Chinner 2011-12-20 4:05 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).