From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [RFC PATCH 0/2] Improve reliability of CPU hotplug Date: Wed, 11 Jan 2012 10:11:06 +0000 Message-ID: <1326276668-19932-1-git-send-email-mgorman@suse.de> Cc: Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen , Mel Gorman To: Linux-MM , Linux-FSDevel , LKML Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Recent stress tests doing CPU online/offline in a loop revealed at least two separate bugs. They result in CPUs either being deadlocked on a spinlock or the machine halting entirely. My reproduction case used a large numbers of simultaneous kernel compiles on an 8-core machine while CPUs were continually being brought online and offline in a loop. This small series includes two patches that allow hotplug stress tests to pass for me when applied to 3.2. This does not claim to solve all CPU hotplug problems. For example, the test configuration did not have PREEMPT enabled but there is no harm in eliminating these bugs at least. Patch 1 looks at a sysfs dirent problem whereby under stress a dentry lock is taken twice. This is a sysfs-specific test but a dcache related fix also exists as an RFC. Patch 2 notes that the page allocator is sending IPIs without calling get_online_cpus() to protect the cpuonline mask from changes. In low memory situations, this allows an IPI to be sent to a CPU going offline. This patch fixes drain_all_pages() and then changes the page allocator to only drain local lists with preempt disabled instead of sending an IPI on the grounds the IPI costs while having a marginal benefit. fs/sysfs/dir.c | 4 ++-- mm/page_alloc.c | 16 ++++++++++++---- 2 files changed, 14 insertions(+), 6 deletions(-) -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Wed, 11 Jan 2012 10:11:08 +0000 Message-ID: <1326276668-19932-3-git-send-email-mgorman@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> Cc: Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen , Mel Gorman To: Linux-MM , Linux-FSDevel , LKML Return-path: In-Reply-To: <1326276668-19932-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org While running a CPU hotplug stress test under memory pressure, it was observed that the machine would halt with no messages logged to console. This is difficult to trigger and required a machine with 8 cores and plenty of memory. Part of the problem is the page allocator is sending IPIs using on_each_cpu() without calling get_online_cpus() to prevent changes to the online cpumask. This allows IPIs to be send to CPUs that are going offline or offline already. At least one bug report has been seen on ppc64 against a 3.0 era kernel that looked like a bug receiving interrupts on a CPU being offlined. This patch starts by adding a call to get_online_cpus() to drain_all_pages() to make it safe versis CPU hotplug. In the context of the page allocator, this causes a problem. It is possible that kthreadd blocks on cpu_hotplug mutex while another process already holding the mutex is blocked waiting for kthreadd to make forward progress leading to deadlock. Additionally, it is important that cpu_hotplug mutex does not become a new hot lock while under pressure. There is also the consideration that CPU hotplug expects that get_online_cpus() is not called frequently as it can lead to livelock in exceptional circumstances (see comment above cpu_hotplug_begin()). Rather than making it safe to call get_online_cpus() from the page allocator, this patch simply removes the page allocator call to drain_all_pages(). To avoid impacting high-order allocation success rates, it still drains the local per-cpu lists for high-order allocations that failed. As a side effect, this reduces the number of IPIs sent during low memory situations. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 16 ++++++++++++---- 1 files changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2b8ba3a..b6df6fc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1119,7 +1119,9 @@ void drain_local_pages(void *arg) */ void drain_all_pages(void) { + get_online_cpus(); on_each_cpu(drain_local_pages, NULL, 1); + put_online_cpus(); } #ifdef CONFIG_HIBERNATION @@ -1982,11 +1984,17 @@ retry: migratetype); /* - * If an allocation failed after direct reclaim, it could be because - * pages are pinned on the per-cpu lists. Drain them and try again + * If a high-order allocation failed after direct reclaim, there is a + * possibility that it is because the necessary buddies have been + * freed to the per-cpu list. Drain the local list and try again. + * drain_all_pages is not used because it is unsafe to call + * get_online_cpus from this context as it is possible that kthreadd + * would block during thread creation and the cost of sending storms + * of IPIs in low memory conditions is quite high. */ - if (!page && !drained) { - drain_all_pages(); + if (!page && order && !drained) { + drain_pages(get_cpu()); + put_cpu(); drained = true; goto retry; } -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Date: Wed, 11 Jan 2012 10:11:07 +0000 Message-ID: <1326276668-19932-2-git-send-email-mgorman@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> Cc: Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen , Mel Gorman To: Linux-MM , Linux-FSDevel , LKML Return-path: In-Reply-To: <1326276668-19932-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org While running a CPU hotplug stress test under memory pressure, a spinlock lockup was detected due to a dentry lock being recursively taken. When this happens varies considerably and is difficult to trigger. [ 482.345588] BUG: spinlock lockup on CPU#2, udevd/4400 [ 482.345590] lock: ffff8803075be0d0, .magic: dead4ead, .owner: udevd/5689, .owner_cpu: 0 [ 482.345592] Pid: 4400, comm: udevd Not tainted 3.2.0-vanilla #1 [ 482.345592] Call Trace: [ 482.345595] [] spin_dump+0x88/0x8d [ 482.345597] [] do_raw_spin_lock+0xd6/0xf9 [ 482.345599] [] _raw_spin_lock+0x39/0x3d [ 482.345601] [] ? shrink_dcache_parent+0x77/0x28c [ 482.345603] [] shrink_dcache_parent+0x77/0x28c [ 482.345605] [] ? have_submounts+0x13e/0x1bd [ 482.345607] [] sysfs_dentry_revalidate+0xaa/0xbe [ 482.345608] [] do_lookup+0x263/0x2fc [ 482.345610] [] ? security_inode_permission+0x1e/0x20 [ 482.345612] [] link_path_walk+0x1e2/0x763 [ 482.345614] [] path_lookupat+0x5c/0x61a [ 482.345616] [] ? might_fault+0x89/0x8d [ 482.345618] [] ? might_fault+0x40/0x8d [ 482.345619] [] do_path_lookup+0x2a/0xa8 [ 482.345621] [] user_path_at_empty+0x5d/0x97 [ 482.345623] [] ? trace_hardirqs_off+0xd/0xf [ 482.345625] [] ? _raw_spin_unlock_irqrestore+0x44/0x5a [ 482.345627] [] user_path_at+0x11/0x13 [ 482.345629] [] vfs_fstatat+0x44/0x71 [ 482.345631] [] vfs_lstat+0x1e/0x20 [ 482.345632] [] sys_newlstat+0x1f/0x40 [ 482.345634] [] ? trace_hardirqs_on_caller+0x12d/0x164 [ 482.345636] [] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 482.345638] [] ? trace_hardirqs_off+0xd/0xf [ 482.345640] [] system_call_fastpath+0x16/0x1b [ 482.515004] [] ? trace_hardirqs_off+0xd/0xf [ 482.520870] [] system_call_fastpath+0x16/0x1b At this point, CPU hotplug stops and other processes get stuck in a similar deadlock waiting for 5689 to unlock. RCU reports stalls but it is collateral damage. The deadlocked processes have sysfs_dentry_revalidate() in common. Miklos Szeredi explained at https://lkml.org/lkml/2012/1/9/114 that the deadlock happens within dcache if two processes call shrink_dcache_parent() on the same dentry. In Miklos's case, the problem is with the bonding driver but during CPU online or offline, a number of dentries are being created and deleted and this deadlock is also being hit. Looking at sysfs, there is a global sysfs_mutex that protects the sysfs directory tree from concurrent reclaims. Almost all operations involving directory inodes and dentries take place under the sysfs_mutex - linking, unlinking, patch searching lookup, renames and readdir. d_invalidate is slightly different. It is mostly under the mutex but if the dentry has to be removed from the dcache, the mutex is dropped. Where as Miklos' patch changes dcache, this patch changes sysfs to consistently hold the mutex for dentry-related operations. Once applied, this particular bug with CPU hotadd/hotremove no longer occurs. Signed-off-by: Mel Gorman --- fs/sysfs/dir.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 7fdf6a7..acaf21d 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -279,8 +279,8 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd) if (strcmp(dentry->d_name.name, sd->s_name) != 0) goto out_bad; - mutex_unlock(&sysfs_mutex); out_valid: + mutex_unlock(&sysfs_mutex); return 1; out_bad: /* Remove the dentry from the dcache hashes. @@ -294,7 +294,6 @@ out_bad: * to the dcache hashes. */ is_dir = (sysfs_type(sd) == SYSFS_DIR); - mutex_unlock(&sysfs_mutex); if (is_dir) { /* If we have submounts we must allow the vfs caches * to lie about the state of the filesystem to prevent @@ -305,6 +304,7 @@ out_bad: shrink_dcache_parent(dentry); } d_drop(dentry); + mutex_unlock(&sysfs_mutex); return 0; } -- 1.7.3.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: ebiederm@xmission.com (Eric W. Biederman) Subject: Re: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Date: Wed, 11 Jan 2012 09:11:27 -0800 Message-ID: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-2-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , Greg KH , Gong Chen To: Mel Gorman Return-path: In-Reply-To: <1326276668-19932-2-git-send-email-mgorman@suse.de> (Mel Gorman's message of "Wed, 11 Jan 2012 10:11:07 +0000") Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Mel Gorman writes: > While running a CPU hotplug stress test under memory pressure, a > spinlock lockup was detected due to a dentry lock being recursively > taken. When this happens varies considerably and is difficult > to trigger. > > [ 482.345588] BUG: spinlock lockup on CPU#2, udevd/4400 > [ 482.345590] lock: ffff8803075be0d0, .magic: dead4ead, .owner: udevd/5689, .owner_cpu: 0 > [ 482.345592] Pid: 4400, comm: udevd Not tainted 3.2.0-vanilla #1 > [ 482.345592] Call Trace: > [ 482.345595] [] spin_dump+0x88/0x8d > [ 482.345597] [] do_raw_spin_lock+0xd6/0xf9 > [ 482.345599] [] _raw_spin_lock+0x39/0x3d > [ 482.345601] [] ? shrink_dcache_parent+0x77/0x28c > [ 482.345603] [] shrink_dcache_parent+0x77/0x28c > [ 482.345605] [] ? have_submounts+0x13e/0x1bd > [ 482.345607] [] sysfs_dentry_revalidate+0xaa/0xbe > [ 482.345608] [] do_lookup+0x263/0x2fc > [ 482.345610] [] ? security_inode_permission+0x1e/0x20 > [ 482.345612] [] link_path_walk+0x1e2/0x763 > [ 482.345614] [] path_lookupat+0x5c/0x61a > [ 482.345616] [] ? might_fault+0x89/0x8d > [ 482.345618] [] ? might_fault+0x40/0x8d > [ 482.345619] [] do_path_lookup+0x2a/0xa8 > [ 482.345621] [] user_path_at_empty+0x5d/0x97 > [ 482.345623] [] ? trace_hardirqs_off+0xd/0xf > [ 482.345625] [] ? _raw_spin_unlock_irqrestore+0x44/0x5a > [ 482.345627] [] user_path_at+0x11/0x13 > [ 482.345629] [] vfs_fstatat+0x44/0x71 > [ 482.345631] [] vfs_lstat+0x1e/0x20 > [ 482.345632] [] sys_newlstat+0x1f/0x40 > [ 482.345634] [] ? trace_hardirqs_on_caller+0x12d/0x164 > [ 482.345636] [] ? trace_hardirqs_on_thunk+0x3a/0x3f > [ 482.345638] [] ? trace_hardirqs_off+0xd/0xf > [ 482.345640] [] system_call_fastpath+0x16/0x1b > [ 482.515004] [] ? trace_hardirqs_off+0xd/0xf > [ 482.520870] [] system_call_fastpath+0x16/0x1b > > At this point, CPU hotplug stops and other processes get stuck in a > similar deadlock waiting for 5689 to unlock. RCU reports stalls but > it is collateral damage. > > The deadlocked processes have sysfs_dentry_revalidate() in > common. Miklos Szeredi explained at https://lkml.org/lkml/2012/1/9/114 > that the deadlock happens within dcache if two processes call > shrink_dcache_parent() on the same dentry. > > In Miklos's case, the problem is with the bonding driver but during > CPU online or offline, a number of dentries are being created and > deleted and this deadlock is also being hit. Looking at sysfs, there > is a global sysfs_mutex that protects the sysfs directory tree from > concurrent reclaims. Almost all operations involving directory inodes > and dentries take place under the sysfs_mutex - linking, unlinking, > patch searching lookup, renames and readdir. d_invalidate is slightly > different. It is mostly under the mutex but if the dentry has to be > removed from the dcache, the mutex is dropped. The sysfs_mutex protects the sysfs data structures not the vfs. > Where as Miklos' patch changes dcache, this patch changes sysfs to > consistently hold the mutex for dentry-related operations. Once > applied, this particular bug with CPU hotadd/hotremove no longer > occurs. After taking a quick skim over the code to reacquaint myself with it appears that the usage in sysfs is idiomatic. That is sysfs uses shrink_dcache_parent without a lock and in a context where the right race could trigger this deadlock. And in particular I expect you could trigger the same deadlock in proc, nfs, and gfs2 with if you can get the timing right. I don't think adding a work-around for the bug in shrink_dcache_parent is going to do anything except hide the bug in the VFS, and unnecessarily increase the sysfs_mutex hold times. I may be blind but I don't see a reason at this point to rush out an incomplete work-around for the bug in shrink_dcahce_parent instead of actually fixing shrink_dcache_parent. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Date: Wed, 11 Jan 2012 18:07:23 +0000 Message-ID: <20120111180723.GF4118@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-2-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , Greg KH , Gong Chen To: "Eric W. Biederman" Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Jan 11, 2012 at 09:11:27AM -0800, Eric W. Biederman wrote: > > In Miklos's case, the problem is with the bonding driver but during > > CPU online or offline, a number of dentries are being created and > > deleted and this deadlock is also being hit. Looking at sysfs, there > > is a global sysfs_mutex that protects the sysfs directory tree from > > concurrent reclaims. Almost all operations involving directory inodes > > and dentries take place under the sysfs_mutex - linking, unlinking, > > patch searching lookup, renames and readdir. d_invalidate is slightly > > different. It is mostly under the mutex but if the dentry has to be > > removed from the dcache, the mutex is dropped. > > The sysfs_mutex protects the sysfs data structures not the vfs. > Ok. > > Where as Miklos' patch changes dcache, this patch changes sysfs to > > consistently hold the mutex for dentry-related operations. Once > > applied, this particular bug with CPU hotadd/hotremove no longer > > occurs. > > After taking a quick skim over the code to reacquaint myself with > it appears that the usage in sysfs is idiomatic. That is sysfs > uses shrink_dcache_parent without a lock and in a context where > the right race could trigger this deadlock. > Yes. > And in particular I expect you could trigger the same deadlock in > proc, nfs, and gfs2 with if you can get the timing right. > Agreed. When the dcache-specific fix was being discussed on an external bugzilla, this came up. It's probably easiest to race in sysfs because it's possible to create/delete directories faster than is possible for proc, nfs or gfs2. > I don't think adding a work-around for the bug in shrink_dcache_parent > is going to do anything except hide the bug in the VFS, and > unnecessarily increase the sysfs_mutex hold times. > Ok. > I may be blind but I don't see a reason at this point to rush out an > incomplete work-around for the bug in shrink_dcahce_parent instead of > actually fixing shrink_dcache_parent. > Since I wrote this patch, the dcache specific fix was finished, merged and I expect it'll make it to stable. Assuming that happens, this patch will no longer be required. Thanks Eric. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: ebiederm@xmission.com (Eric W. Biederman) Subject: Re: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Date: Wed, 11 Jan 2012 11:02:26 -0800 Message-ID: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-2-git-send-email-mgorman@suse.de> <20120111180723.GF4118@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , Greg KH , Gong Chen To: Mel Gorman Return-path: In-Reply-To: <20120111180723.GF4118@suse.de> (Mel Gorman's message of "Wed, 11 Jan 2012 18:07:23 +0000") Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Mel Gorman writes: > On Wed, Jan 11, 2012 at 09:11:27AM -0800, Eric W. Biederman wrote: >> > In Miklos's case, the problem is with the bonding driver but during >> > CPU online or offline, a number of dentries are being created and >> > deleted and this deadlock is also being hit. Looking at sysfs, there >> > is a global sysfs_mutex that protects the sysfs directory tree from >> > concurrent reclaims. Almost all operations involving directory inodes >> > and dentries take place under the sysfs_mutex - linking, unlinking, >> > patch searching lookup, renames and readdir. d_invalidate is slightly >> > different. It is mostly under the mutex but if the dentry has to be >> > removed from the dcache, the mutex is dropped. >> >> The sysfs_mutex protects the sysfs data structures not the vfs. >> > > Ok. > >> > Where as Miklos' patch changes dcache, this patch changes sysfs to >> > consistently hold the mutex for dentry-related operations. Once >> > applied, this particular bug with CPU hotadd/hotremove no longer >> > occurs. >> >> After taking a quick skim over the code to reacquaint myself with >> it appears that the usage in sysfs is idiomatic. That is sysfs >> uses shrink_dcache_parent without a lock and in a context where >> the right race could trigger this deadlock. >> > > Yes. > >> And in particular I expect you could trigger the same deadlock in >> proc, nfs, and gfs2 with if you can get the timing right. >> > > Agreed. When the dcache-specific fix was being discussed on an external > bugzilla, this came up. It's probably easiest to race in sysfs because > it's possible to create/delete directories faster than is possible > for proc, nfs or gfs2. I expect we see the race in sysfs because of uevents that get triggered on hotplug. So a lot is occurring around the time of the race. You can get to shrink_dcache_parent with fork/exit in proc which is a lot easier to trigger. But usually in fork/exec you don't have the dentries cached... > Since I wrote this patch, the dcache specific fix was finished, merged > and I expect it'll make it to stable. Assuming that happens, this patch > will no longer be required. Sounds good. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gilad Ben-Yossef Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Thu, 12 Jan 2012 16:51:56 +0200 Message-ID: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen To: Mel Gorman Return-path: In-Reply-To: <1326276668-19932-3-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Jan 11, 2012 at 12:11 PM, Mel Gorman wrote: > Rather than making it safe to call get_online_cpus() from the page > allocator, this patch simply removes the page allocator call to > drain_all_pages(). To avoid impacting high-order allocation success > rates, it still drains the local per-cpu lists for high-order > allocations that failed. As a side effect, this reduces the number > of IPIs sent during low memory situations. > > Signed-off-by: Mel Gorman > --- > =A0mm/page_alloc.c | =A0 16 ++++++++++++---- > =A01 files changed, 12 insertions(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2b8ba3a..b6df6fc 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1119,7 +1119,9 @@ void drain_local_pages(void *arg) > =A0*/ > =A0void drain_all_pages(void) > =A0{ > + =A0 =A0 =A0 get_online_cpus(); > =A0 =A0 =A0 =A0on_each_cpu(drain_local_pages, NULL, 1); > + =A0 =A0 =A0 put_online_cpus(); > =A0} > > =A0#ifdef CONFIG_HIBERNATION > @@ -1982,11 +1984,17 @@ retry: > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0migratetype); > > =A0 =A0 =A0 =A0/* > - =A0 =A0 =A0 =A0* If an allocation failed after direct reclaim, it could= be because > - =A0 =A0 =A0 =A0* pages are pinned on the per-cpu lists. Drain them and = try again > + =A0 =A0 =A0 =A0* If a high-order allocation failed after direct reclaim= , there is a > + =A0 =A0 =A0 =A0* possibility that it is because the necessary buddies h= ave been > + =A0 =A0 =A0 =A0* freed to the per-cpu list. Drain the local list and tr= y again. > + =A0 =A0 =A0 =A0* drain_all_pages is not used because it is unsafe to ca= ll > + =A0 =A0 =A0 =A0* get_online_cpus from this context as it is possible th= at kthreadd > + =A0 =A0 =A0 =A0* would block during thread creation and the cost of sen= ding storms > + =A0 =A0 =A0 =A0* of IPIs in low memory conditions is quite high. > =A0 =A0 =A0 =A0 */ > - =A0 =A0 =A0 if (!page && !drained) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 drain_all_pages(); > + =A0 =A0 =A0 if (!page && order && !drained) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 drain_pages(get_cpu()); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 put_cpu(); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0drained =3D true; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto retry; > =A0 =A0 =A0 =A0} > -- > 1.7.3.4 > I very much like the judo like quality of relying on the fact that in memory pressure conditions most of the cpus will end up in the direct reclaim path to drain them all without IPIs. What I can't figure out is why we don't need get/put_online_cpus() pair around each and every call to on_each_cpu everywhere? and if we do, perhaps making it a part of on_each_cpu is the way to go? Something like: diff --git a/kernel/smp.c b/kernel/smp.c index f66a1b2..cfa3882 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void *info, int wait) { unsigned long flags; + BUG_ON(in_atomic()); + + get_online_cpus(); preempt_disable(); smp_call_function(func, info, wait); local_irq_save(flags); func(info); local_irq_restore(flags); preempt_enable(); + put_online_cpus(); } EXPORT_SYMBOL(on_each_cpu); Does that makes? --=20 Gilad Ben-Yossef Chief Coffee Drinker gilad@benyossef.com Israel Cell: +972-52-8260388 US Cell: +1-973-8260388 http://benyossef.com "Unfortunately, cache misses are an equal opportunity pain provider." -- Mike Galbraith, LKML -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gilad Ben-Yossef Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Thu, 12 Jan 2012 17:08:10 +0200 Message-ID: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen To: Mel Gorman Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Jan 12, 2012 at 4:51 PM, Gilad Ben-Yossef wro= te: > On Wed, Jan 11, 2012 at 12:11 PM, Mel Gorman wrote: > >> Rather than making it safe to call get_online_cpus() from the page >> allocator, this patch simply removes the page allocator call to >> drain_all_pages(). To avoid impacting high-order allocation success >> rates, it still drains the local per-cpu lists for high-order >> allocations that failed. As a side effect, this reduces the number >> of IPIs sent during low memory situations. >> >> Signed-off-by: Mel Gorman >> --- >> =A0mm/page_alloc.c | =A0 16 ++++++++++++---- >> =A01 files changed, 12 insertions(+), 4 deletions(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 2b8ba3a..b6df6fc 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1119,7 +1119,9 @@ void drain_local_pages(void *arg) >> =A0*/ >> =A0void drain_all_pages(void) >> =A0{ >> + =A0 =A0 =A0 get_online_cpus(); >> =A0 =A0 =A0 =A0on_each_cpu(drain_local_pages, NULL, 1); >> + =A0 =A0 =A0 put_online_cpus(); >> =A0} >> >> =A0#ifdef CONFIG_HIBERNATION >> @@ -1982,11 +1984,17 @@ retry: >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0migratetype); >> >> =A0 =A0 =A0 =A0/* >> - =A0 =A0 =A0 =A0* If an allocation failed after direct reclaim, it coul= d be because >> - =A0 =A0 =A0 =A0* pages are pinned on the per-cpu lists. Drain them and= try again >> + =A0 =A0 =A0 =A0* If a high-order allocation failed after direct reclai= m, there is a >> + =A0 =A0 =A0 =A0* possibility that it is because the necessary buddies = have been >> + =A0 =A0 =A0 =A0* freed to the per-cpu list. Drain the local list and t= ry again. >> + =A0 =A0 =A0 =A0* drain_all_pages is not used because it is unsafe to c= all >> + =A0 =A0 =A0 =A0* get_online_cpus from this context as it is possible t= hat kthreadd >> + =A0 =A0 =A0 =A0* would block during thread creation and the cost of se= nding storms >> + =A0 =A0 =A0 =A0* of IPIs in low memory conditions is quite high. >> =A0 =A0 =A0 =A0 */ >> - =A0 =A0 =A0 if (!page && !drained) { >> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 drain_all_pages(); >> + =A0 =A0 =A0 if (!page && order && !drained) { >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 drain_pages(get_cpu()); >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 put_cpu(); >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0drained =3D true; >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto retry; >> =A0 =A0 =A0 =A0} >> -- >> 1.7.3.4 >> > > I very much like the judo like quality of relying on the fact that in > memory pressure conditions most > of the cpus will end up in the direct reclaim path to drain them all > without IPIs. > > What I can't figure out is why we don't need =A0get/put_online_cpus() > pair around each and every call > to on_each_cpu everywhere? and if we do, perhaps making it a part of > on_each_cpu is the way to go? > > Something like: > > diff --git a/kernel/smp.c b/kernel/smp.c > index f66a1b2..cfa3882 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void > *info, int wait) > =A0{ > =A0 =A0 =A0 =A0unsigned long flags; > > + =A0 =A0 =A0 BUG_ON(in_atomic()); > + > + =A0 =A0 =A0 get_online_cpus(); > =A0 =A0 =A0 =A0preempt_disable(); > =A0 =A0 =A0 =A0smp_call_function(func, info, wait); > =A0 =A0 =A0 =A0local_irq_save(flags); > =A0 =A0 =A0 =A0func(info); > =A0 =A0 =A0 =A0local_irq_restore(flags); > =A0 =A0 =A0 =A0preempt_enable(); > + =A0 =A0 =A0 put_online_cpus(); > =A0} > =A0EXPORT_SYMBOL(on_each_cpu); > > Does that makes? Well, it dies on boot (after adding the needed include file), so it's obviously wrong, but I'm still guessing other users of on_each_cpu will need an =A0get/put_online_cpus() wrapper to= o. Maybe - on_each_cpu(...) { get_online_cpus(); __on_each_cpu(...); put_online_cpus(); } We'll need to audit all callers. Gilad --=20 Gilad Ben-Yossef Chief Coffee Drinker gilad@benyossef.com Israel Cell: +972-52-8260388 US Cell: +1-973-8260388 http://benyossef.com "Unfortunately, cache misses are an equal opportunity pain provider." -- Mike Galbraith, LKML -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Zijlstra Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Thu, 12 Jan 2012 16:08:04 +0100 Message-ID: <1326380884.2442.187.camel@twins> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Cc: Mel Gorman , Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen To: Gilad Ben-Yossef Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu, 2012-01-12 at 16:51 +0200, Gilad Ben-Yossef wrote: > What I can't figure out is why we don't need get/put_online_cpus() > pair around each and every call > to on_each_cpu everywhere? and if we do, perhaps making it a part of > on_each_cpu is the way to go? >=20 > Something like: >=20 > diff --git a/kernel/smp.c b/kernel/smp.c > index f66a1b2..cfa3882 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void > *info, int wait) > { > unsigned long flags; >=20 > + BUG_ON(in_atomic()); > + > + get_online_cpus(); > preempt_disable(); Your preempt_disable() here serializes against hotplug.. > smp_call_function(func, info, wait); > local_irq_save(flags); > func(info); > local_irq_restore(flags); > preempt_enable(); > + put_online_cpus(); > } > EXPORT_SYMBOL(on_each_cpu);=20 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gilad Ben-Yossef Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Thu, 12 Jan 2012 17:13:32 +0200 Message-ID: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326380884.2442.187.camel@twins> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Mel Gorman , Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen To: Peter Zijlstra Return-path: In-Reply-To: <1326380884.2442.187.camel@twins> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Jan 12, 2012 at 5:08 PM, Peter Zijlstra wr= ote: > On Thu, 2012-01-12 at 16:51 +0200, Gilad Ben-Yossef wrote: >> What I can't figure out is why we don't need =A0get/put_online_cpus() >> pair around each and every call >> to on_each_cpu everywhere? and if we do, perhaps making it a part of >> on_each_cpu is the way to go? >> >> Something like: >> >> diff --git a/kernel/smp.c b/kernel/smp.c >> index f66a1b2..cfa3882 100644 >> --- a/kernel/smp.c >> +++ b/kernel/smp.c >> @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void >> *info, int wait) >> =A0{ >> =A0 =A0 =A0 =A0 unsigned long flags; >> >> + =A0 =A0 =A0 BUG_ON(in_atomic()); >> + >> + =A0 =A0 =A0 get_online_cpus(); >> =A0 =A0 =A0 =A0 preempt_disable(); > > Your preempt_disable() here serializes against hotplug.. I'm probably daft, but why didn't it work for the page allocator then? Mel's description reads: "Part of the problem is the page allocator is sending IPIs using on_each_cpu() without calling get_online_cpus() to prevent changes to the online cpumask." on_each_cpu() disables preemption in master as well... Gilad Thanks, Gilad --=20 Gilad Ben-Yossef Chief Coffee Drinker gilad@benyossef.com Israel Cell: +972-52-8260388 US Cell: +1-973-8260388 http://benyossef.com "Unfortunately, cache misses are an equal opportunity pain provider." -- Mike Galbraith, LKML -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Zijlstra Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Thu, 12 Jan 2012 16:18:12 +0100 Message-ID: <1326381492.2442.188.camel@twins> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen To: Mel Gorman Return-path: In-Reply-To: <1326276668-19932-3-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: > At least one bug report has > been seen on ppc64 against a 3.0 era kernel that looked like a bug > receiving interrupts on a CPU being offlined.=20 Got details on that Mel? The preempt_disable() in on_each_cpu() should serialize against the stop_machine() crap in unplug. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Thu, 12 Jan 2012 15:37:12 +0000 Message-ID: <20120112153712.GL4118@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen To: Peter Zijlstra Return-path: Content-Disposition: inline In-Reply-To: <1326381492.2442.188.camel@twins> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote: > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: > > At least one bug report has > > been seen on ppc64 against a 3.0 era kernel that looked like a bug > > receiving interrupts on a CPU being offlined. > > Got details on that Mel? The preempt_disable() in on_each_cpu() should > serialize against the stop_machine() crap in unplug. I might have added 2 and 2 together and got 5. The stack trace clearly was while sending IPIs in on_each_cpu() and always when under memory pressure and stuck in direct reclaim. This was on !PREEMPT kernels where preempt_disable() is a no-op. That is why I thought get_online_cpu() would be necessary. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Zijlstra Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Thu, 12 Jan 2012 16:52:31 +0100 Message-ID: <1326383551.2442.203.camel@twins> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen To: Mel Gorman Return-path: Received: from casper.infradead.org ([85.118.1.10]:60103 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753861Ab2ALPwq convert rfc822-to-8bit (ORCPT ); Thu, 12 Jan 2012 10:52:46 -0500 In-Reply-To: <20120112153712.GL4118@suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote: > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote: > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: > > > At least one bug report has > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug > > > receiving interrupts on a CPU being offlined. > > > > Got details on that Mel? The preempt_disable() in on_each_cpu() should > > serialize against the stop_machine() crap in unplug. > > I might have added 2 and 2 together and got 5. > > The stack trace clearly was while sending IPIs in on_each_cpu() and > always when under memory pressure and stuck in direct reclaim. This was > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I > thought get_online_cpu() would be necessary. For non-preempt the required scheduling of stop_machine() will have to wait even longer. Still there might be something funny, some of the hotplug notifiers are ran before the stop_machine thing does its thing so there might be some fun interaction. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Thu, 12 Jan 2012 17:18:47 +0000 Message-ID: <20120112171847.GN4118@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen To: Peter Zijlstra Return-path: Content-Disposition: inline In-Reply-To: <1326383551.2442.203.camel@twins> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Jan 12, 2012 at 04:52:31PM +0100, Peter Zijlstra wrote: > On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote: > > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote: > > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: > > > > At least one bug report has > > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug > > > > receiving interrupts on a CPU being offlined. > > > > > > Got details on that Mel? The preempt_disable() in on_each_cpu() should > > > serialize against the stop_machine() crap in unplug. > > > > I might have added 2 and 2 together and got 5. > > > > The stack trace clearly was while sending IPIs in on_each_cpu() and > > always when under memory pressure and stuck in direct reclaim. This was > > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I > > thought get_online_cpu() would be necessary. > > For non-preempt the required scheduling of stop_machine() will have to > wait even longer. Still there might be something funny, some of the > hotplug notifiers are ran before the stop_machine thing does its thing > so there might be some fun interaction. Ok, how about this as a replacement patch? ---8<--- From: Mel Gorman Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context While running a CPU hotplug stress test under memory pressure, it was observed that the machine would halt with no messages logged to console. This is difficult to trigger and required a machine with 8 cores and plenty of memory. In at least one case on ppc64, the warning in include/linux/cpumask.h:107 triggered implying that IPIs are being sent to offline CPUs in some cases. A suspicious part of the problem is that the page allocator is sending IPIs using on_each_cpu() without calling get_online_cpus() to prevent changes to the online cpumask. It is depending on preemption being disabled to protect it which is a no-op on !PREEMPT kernels. This means that a thread can be reading the mask in smp_call_function_many() when an attempt is made to take a CPU offline. The expectation is that this is not a problem as the stop_machine() used during CPU hotplug should be able to prevent any problems as the reader of the online mask will prevent stop_machine making forward progress but it's unhelpful. On the other side, the mask can also be read while the CPU is being brought online. In this case it is the responsibility of the architecture that the CPU is able to receive and handle interrupts before being marked active but that does not mean they always get it right. Sending excessive IPIs from the page allocator is a bad idea. In low memory situations, a large number of processes can drain the per-cpu lists at the same time, in quick succession and on many CPUs which is pointless. In light of this and the unspecific CPU hotplug concerns, this patch removes the call drain_all_pages() after failing direct reclaim. To avoid impacting high-order allocation success rates, it still drains the local per-cpu lists for high-order allocations that failed. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 10 ++++++---- 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2b8ba3a..63ea182 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1982,11 +1982,13 @@ retry: migratetype); /* - * If an allocation failed after direct reclaim, it could be because - * pages are pinned on the per-cpu lists. Drain them and try again + * If a high-order allocation failed after direct reclaim, there is a + * possibility that it is because the necessary buddies have been + * freed to the per-cpu list. Drain the local list and try again. */ - if (!page && !drained) { - drain_all_pages(); + if (!page && order && !drained) { + drain_pages(get_cpu()); + put_cpu(); drained = true; goto retry; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gilad Ben-Yossef Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Thu, 12 Jan 2012 21:14:41 +0200 Message-ID: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Peter Zijlstra , Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen To: Mel Gorman Return-path: In-Reply-To: <20120112171847.GN4118@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Jan 12, 2012 at 7:18 PM, Mel Gorman wrote: > On Thu, Jan 12, 2012 at 04:52:31PM +0100, Peter Zijlstra wrote: >> On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote: >> > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote: >> > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: >> > > > At least one bug report has >> > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug >> > > > receiving interrupts on a CPU being offlined. >> > > >> > > Got details on that Mel? The preempt_disable() in on_each_cpu() shou= ld >> > > serialize against the stop_machine() crap in unplug. >> > >> > I might have added 2 and 2 together and got 5. >> > >> > The stack trace clearly was while sending IPIs in on_each_cpu() and >> > always when under memory pressure and stuck in direct reclaim. This wa= s >> > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I >> > thought get_online_cpu() would be necessary. >> >> For non-preempt the required scheduling of stop_machine() will have to >> wait even longer. Still there might be something funny, some of the >> hotplug notifiers are ran before the stop_machine thing does its thing >> so there might be some fun interaction. > > Ok, how about this as a replacement patch? > > ---8<--- > From: Mel Gorman > Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI f= rom page allocator context > > While running a CPU hotplug stress test under memory pressure, it > was observed that the machine would halt with no messages logged > to console. This is difficult to trigger and required a machine > with 8 cores and plenty of memory. In at least one case on ppc64, > the warning in include/linux/cpumask.h:107 triggered implying that > IPIs are being sent to offline CPUs in some cases. > > A suspicious part of the problem is that the page allocator is sending > IPIs using on_each_cpu() without calling get_online_cpus() to prevent > changes to the online cpumask. It is depending on preemption being > disabled to protect it which is a no-op on !PREEMPT kernels. This means > that a thread can be reading the mask in smp_call_function_many() when > an attempt is made to take a CPU offline. The expectation is that this > is not a problem as the stop_machine() used during CPU hotplug should > be able to prevent any problems as the reader of the online mask will > prevent stop_machine making forward progress but it's unhelpful. > > On the other side, the mask can also be read while the CPU is being > brought online. In this case it is the responsibility of the > architecture that the CPU is able to receive and handle interrupts > before being marked active but that does not mean they always get it > right. > > Sending excessive IPIs from the page allocator is a bad idea. In low > memory situations, a large number of processes can drain the per-cpu > lists at the same time, in quick succession and on many CPUs which is > pointless. In light of this and the unspecific CPU hotplug concerns, > this patch removes the call drain_all_pages() after failing direct > reclaim. To avoid impacting high-order allocation success rates, > it still drains the local per-cpu lists for high-order allocations > that failed. > > Signed-off-by: Mel Gorman > --- > =A0mm/page_alloc.c | =A0 10 ++++++---- > =A01 files changed, 6 insertions(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2b8ba3a..63ea182 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1982,11 +1982,13 @@ retry: > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0migratetype); > > =A0 =A0 =A0 =A0/* > - =A0 =A0 =A0 =A0* If an allocation failed after direct reclaim, it could= be because > - =A0 =A0 =A0 =A0* pages are pinned on the per-cpu lists. Drain them and = try again > + =A0 =A0 =A0 =A0* If a high-order allocation failed after direct reclaim= , there is a > + =A0 =A0 =A0 =A0* possibility that it is because the necessary buddies h= ave been > + =A0 =A0 =A0 =A0* freed to the per-cpu list. Drain the local list and tr= y again. > =A0 =A0 =A0 =A0 */ > - =A0 =A0 =A0 if (!page && !drained) { > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 drain_all_pages(); > + =A0 =A0 =A0 if (!page && order && !drained) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 drain_pages(get_cpu()); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 put_cpu(); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0drained =3D true; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto retry; > =A0 =A0 =A0 =A0} I like the patch, think it is better then current code and want it to go in= . I also think there is still some problems with IPIs somewhere that cause some corruption when a lot of IPIs are sent and that the patch simply lowered the very big number of IPIs that are sent via the direct reclaim code path so the problem is hidden, not solved by this patch. I've seen something related when trying to test the IPI reduction patches. Interesting enough it was not related to CPU hotplug at all - When a lot of IPIs are being sent, I sometime got an assert from low level platform code that I'm trying to send IPIs with an empty mask although the mask was NOT empty. I didn't manage to debug it then but I did manage to recreate it quite easily. I will see if I can recreate it with recent master and report. --=20 Gilad Ben-Yossef Chief Coffee Drinker gilad@benyossef.com Israel Cell: +972-52-8260388 US Cell: +1-973-8260388 http://benyossef.com "Unfortunately, cache misses are an equal opportunity pain provider." -- Mike Galbraith, LKML -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Srivatsa S. Bhat" Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Fri, 20 Jan 2012 03:16:58 +0530 Message-ID: <4F188F52.1060303@linux.vnet.ibm.com> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> <20120119162057.GD3143@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Milton Miller , Gilad Ben-Yossef , linux-kernel@vger.kernel.org, Peter Zijlstra , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "akpm@linux-foundation.org" , Russell King - ARM Linux , "Paul E. McKenney" , mszeredi@novell.com, ebiederm@xmission.com, Greg Kroah-Hartman , gong.chen@intel.com, Tony Luck , Borislav Petkov , "tglx@linutronix.de" , "mingo@redhat.com" , "hpa@zytor.com" , "x86@kernel.org" , linux-edac@vger.kernel.org, Andi Kleen To: Mel Gorman Return-path: In-Reply-To: <20120119162057.GD3143@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org [Reinstating the original Cc list] On 01/19/2012 09:50 PM, Mel Gorman wrote:> > On a different x86-64 machines with an intel-specific MCE, I have > also noted that the value of num_online_cpus() can change while > stop_machine() is running. That is expected and intentional right? Meaning, it is during the stop_machine() thing itself that a CPU is actually taken offline. And at the same time, it is removed from the cpu_online_mask. On Intel boxes, essentially, the following gets executed on the dying CPU, as set up by the stop_machine stuff. __cpu_disable() native_cpu_disable() cpu_disable_common() remove_cpu_from_maps() set_cpu_online(cpu, false) ^^^^^^ So, set_cpu_online will remove this CPU from the cpu_online_mask. And all this runs while still under the stop machine context. And this is exactly what we want right? > This is sensitive to timing and part of > the problem seems to be due to cmci_rediscover() running without the > CPU hotplug mutex held. This is not related to the IPI mess and is > unrelated to memory pressure but is just to note that CPU hotplug in > general can be fragile in parts. > For the cmci_rediscover() part, I feel a simple get/put_online_cpus() around it should work. Something like the following patch? (It is untested at the moment, but I will run it later and see if it works well). I would like the opinion of MCE/Intel maintainers as to whether this is a proper fix or something else would have been better.. ---- From: Srivatsa S. Bhat Subject: [PATCH] x86/intel mce: Fix race with CPU hotplug in cmci_rediscover() cmci_rediscover() is invoked upon the CPU_POST_DEAD notification, when the cpu_hotplug lock is no longer held. And cmci_rediscover() iterates over all the online cpus. So this can race with an ongoing CPU hotplug operation. Fix this by wrapping the iteration code within the pair get_online_cpus() / put_online_cpus(). Reported-by: Mel Gorman Signed-off-by: Srivatsa S. Bhat --- arch/x86/kernel/cpu/mcheck/mce_intel.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c index 38e49bc..1c30397 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include @@ -179,6 +180,7 @@ void cmci_rediscover(int dying) return; cpumask_copy(old, ¤t->cpus_allowed); + get_online_cpus(); for_each_online_cpu(cpu) { if (cpu == dying) continue; @@ -188,6 +190,7 @@ void cmci_rediscover(int dying) if (cmci_supported(&banks)) cmci_discover(banks, 0); } + put_online_cpus(); set_cpus_allowed_ptr(current, old); free_cpumask_var(old); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Fri, 20 Jan 2012 08:48:40 +0000 Message-ID: <20120120084840.GG3143@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> <20120119162057.GD3143@suse.de> <4F188F52.1060303@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Milton Miller , Gilad Ben-Yossef , linux-kernel@vger.kernel.org, Peter Zijlstra , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "akpm@linux-foundation.org" , Russell King - ARM Linux , "Paul E. McKenney" , mszeredi@novell.com, ebiederm@xmission.com, Greg Kroah-Hartman , gong.chen@intel.com, Tony Luck , Borislav Petkov , "tglx@linutronix.de" , "mingo@redhat.com" , "hpa@zytor.com" , "x86@kernel.org" , linux-edac@vger.kernel.org, Andi Kleen To: "Srivatsa S. Bhat" Return-path: Received: from cantor2.suse.de ([195.135.220.15]:58137 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751953Ab2ATIss (ORCPT ); Fri, 20 Jan 2012 03:48:48 -0500 Content-Disposition: inline In-Reply-To: <4F188F52.1060303@linux.vnet.ibm.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri, Jan 20, 2012 at 03:16:58AM +0530, Srivatsa S. Bhat wrote: > [Reinstating the original Cc list] > > On 01/19/2012 09:50 PM, Mel Gorman wrote:> > > > On a different x86-64 machines with an intel-specific MCE, I have > > also noted that the value of num_online_cpus() can change while > > stop_machine() is running. > > > That is expected and intentional right? Meaning, it is during the > stop_machine() thing itself that a CPU is actually taken offline. > And at the same time, it is removed from the cpu_online_mask. > It's intentional sometimes and no others. The machine does halt sometimes and stays there. > On Intel boxes, essentially, the following gets executed on the dying > CPU, as set up by the stop_machine stuff. > > __cpu_disable() > native_cpu_disable() > cpu_disable_common() > remove_cpu_from_maps() > set_cpu_online(cpu, false) > ^^^^^^ > So, set_cpu_online will remove this CPU from the cpu_online_mask. > And all this runs while still under the stop machine context. > And this is exactly what we want right? > We don't want it to halt in stop_machine forever waiting on acknowledges that are never received until the NMI handler fires. > > This is sensitive to timing and part of > > the problem seems to be due to cmci_rediscover() running without the > > CPU hotplug mutex held. This is not related to the IPI mess and is > > unrelated to memory pressure but is just to note that CPU hotplug in > > general can be fragile in parts. > > > > > For the cmci_rediscover() part, I feel a simple get/put_online_cpus() > around it should work. > Yeah, that's the first thing I tried first too. Doesn't work though. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx134.postini.com [74.125.245.134]) by kanga.kvack.org (Postfix) with SMTP id AAD716B004D for ; Thu, 12 Jan 2012 10:52:40 -0500 (EST) Message-ID: <1326383551.2442.203.camel@twins> Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context From: Peter Zijlstra Date: Thu, 12 Jan 2012 16:52:31 +0100 In-Reply-To: <20120112153712.GL4118@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote: > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote: > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: > > > At least one bug report has > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug > > > receiving interrupts on a CPU being offlined.=20 > >=20 > > Got details on that Mel? The preempt_disable() in on_each_cpu() should > > serialize against the stop_machine() crap in unplug. >=20 > I might have added 2 and 2 together and got 5. >=20 > The stack trace clearly was while sending IPIs in on_each_cpu() and > always when under memory pressure and stuck in direct reclaim. This was > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I > thought get_online_cpu() would be necessary. For non-preempt the required scheduling of stop_machine() will have to wait even longer. Still there might be something funny, some of the hotplug notifiers are ran before the stop_machine thing does its thing so there might be some fun interaction. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx173.postini.com [74.125.245.173]) by kanga.kvack.org (Postfix) with SMTP id DD99C6B004D for ; Fri, 20 Jan 2012 03:48:48 -0500 (EST) Date: Fri, 20 Jan 2012 08:48:40 +0000 From: Mel Gorman Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Message-ID: <20120120084840.GG3143@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> <20120119162057.GD3143@suse.de> <4F188F52.1060303@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <4F188F52.1060303@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Srivatsa S. Bhat" Cc: Milton Miller , Gilad Ben-Yossef , linux-kernel@vger.kernel.org, Peter Zijlstra , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "akpm@linux-foundation.org" , Russell King - ARM Linux , "Paul E. McKenney" , mszeredi@novell.com, ebiederm@xmission.com, Greg Kroah-Hartman , gong.chen@intel.com, Tony Luck , Borislav Petkov , "tglx@linutronix.de" , "mingo@redhat.com" , "hpa@zytor.com" , "x86@kernel.org" , linux-edac@vger.kernel.org, Andi Kleen On Fri, Jan 20, 2012 at 03:16:58AM +0530, Srivatsa S. Bhat wrote: > [Reinstating the original Cc list] > > On 01/19/2012 09:50 PM, Mel Gorman wrote:> > > > On a different x86-64 machines with an intel-specific MCE, I have > > also noted that the value of num_online_cpus() can change while > > stop_machine() is running. > > > That is expected and intentional right? Meaning, it is during the > stop_machine() thing itself that a CPU is actually taken offline. > And at the same time, it is removed from the cpu_online_mask. > It's intentional sometimes and no others. The machine does halt sometimes and stays there. > On Intel boxes, essentially, the following gets executed on the dying > CPU, as set up by the stop_machine stuff. > > __cpu_disable() > native_cpu_disable() > cpu_disable_common() > remove_cpu_from_maps() > set_cpu_online(cpu, false) > ^^^^^^ > So, set_cpu_online will remove this CPU from the cpu_online_mask. > And all this runs while still under the stop machine context. > And this is exactly what we want right? > We don't want it to halt in stop_machine forever waiting on acknowledges that are never received until the NMI handler fires. > > This is sensitive to timing and part of > > the problem seems to be due to cmci_rediscover() running without the > > CPU hotplug mutex held. This is not related to the IPI mess and is > > unrelated to memory pressure but is just to note that CPU hotplug in > > general can be fragile in parts. > > > > > For the cmci_rediscover() part, I feel a simple get/put_online_cpus() > around it should work. > Yeah, that's the first thing I tried first too. Doesn't work though. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757410Ab2AKKLQ (ORCPT ); Wed, 11 Jan 2012 05:11:16 -0500 Received: from cantor2.suse.de ([195.135.220.15]:46391 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751286Ab2AKKLO (ORCPT ); Wed, 11 Jan 2012 05:11:14 -0500 From: Mel Gorman To: Linux-MM , Linux-FSDevel , LKML Cc: Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen , Mel Gorman Subject: [RFC PATCH 0/2] Improve reliability of CPU hotplug Date: Wed, 11 Jan 2012 10:11:06 +0000 Message-Id: <1326276668-19932-1-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.7.3.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Recent stress tests doing CPU online/offline in a loop revealed at least two separate bugs. They result in CPUs either being deadlocked on a spinlock or the machine halting entirely. My reproduction case used a large numbers of simultaneous kernel compiles on an 8-core machine while CPUs were continually being brought online and offline in a loop. This small series includes two patches that allow hotplug stress tests to pass for me when applied to 3.2. This does not claim to solve all CPU hotplug problems. For example, the test configuration did not have PREEMPT enabled but there is no harm in eliminating these bugs at least. Patch 1 looks at a sysfs dirent problem whereby under stress a dentry lock is taken twice. This is a sysfs-specific test but a dcache related fix also exists as an RFC. Patch 2 notes that the page allocator is sending IPIs without calling get_online_cpus() to protect the cpuonline mask from changes. In low memory situations, this allows an IPI to be sent to a CPU going offline. This patch fixes drain_all_pages() and then changes the page allocator to only drain local lists with preempt disabled instead of sending an IPI on the grounds the IPI costs while having a marginal benefit. fs/sysfs/dir.c | 4 ++-- mm/page_alloc.c | 16 ++++++++++++---- 2 files changed, 14 insertions(+), 6 deletions(-) -- 1.7.3.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757432Ab2AKKLS (ORCPT ); Wed, 11 Jan 2012 05:11:18 -0500 Received: from cantor2.suse.de ([195.135.220.15]:46408 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757355Ab2AKKLQ (ORCPT ); Wed, 11 Jan 2012 05:11:16 -0500 From: Mel Gorman To: Linux-MM , Linux-FSDevel , LKML Cc: Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen , Mel Gorman Subject: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Date: Wed, 11 Jan 2012 10:11:08 +0000 Message-Id: <1326276668-19932-3-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.7.3.4 In-Reply-To: <1326276668-19932-1-git-send-email-mgorman@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org While running a CPU hotplug stress test under memory pressure, it was observed that the machine would halt with no messages logged to console. This is difficult to trigger and required a machine with 8 cores and plenty of memory. Part of the problem is the page allocator is sending IPIs using on_each_cpu() without calling get_online_cpus() to prevent changes to the online cpumask. This allows IPIs to be send to CPUs that are going offline or offline already. At least one bug report has been seen on ppc64 against a 3.0 era kernel that looked like a bug receiving interrupts on a CPU being offlined. This patch starts by adding a call to get_online_cpus() to drain_all_pages() to make it safe versis CPU hotplug. In the context of the page allocator, this causes a problem. It is possible that kthreadd blocks on cpu_hotplug mutex while another process already holding the mutex is blocked waiting for kthreadd to make forward progress leading to deadlock. Additionally, it is important that cpu_hotplug mutex does not become a new hot lock while under pressure. There is also the consideration that CPU hotplug expects that get_online_cpus() is not called frequently as it can lead to livelock in exceptional circumstances (see comment above cpu_hotplug_begin()). Rather than making it safe to call get_online_cpus() from the page allocator, this patch simply removes the page allocator call to drain_all_pages(). To avoid impacting high-order allocation success rates, it still drains the local per-cpu lists for high-order allocations that failed. As a side effect, this reduces the number of IPIs sent during low memory situations. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 16 ++++++++++++---- 1 files changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2b8ba3a..b6df6fc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1119,7 +1119,9 @@ void drain_local_pages(void *arg) */ void drain_all_pages(void) { + get_online_cpus(); on_each_cpu(drain_local_pages, NULL, 1); + put_online_cpus(); } #ifdef CONFIG_HIBERNATION @@ -1982,11 +1984,17 @@ retry: migratetype); /* - * If an allocation failed after direct reclaim, it could be because - * pages are pinned on the per-cpu lists. Drain them and try again + * If a high-order allocation failed after direct reclaim, there is a + * possibility that it is because the necessary buddies have been + * freed to the per-cpu list. Drain the local list and try again. + * drain_all_pages is not used because it is unsafe to call + * get_online_cpus from this context as it is possible that kthreadd + * would block during thread creation and the cost of sending storms + * of IPIs in low memory conditions is quite high. */ - if (!page && !drained) { - drain_all_pages(); + if (!page && order && !drained) { + drain_pages(get_cpu()); + put_cpu(); drained = true; goto retry; } -- 1.7.3.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757445Ab2AKKLl (ORCPT ); Wed, 11 Jan 2012 05:11:41 -0500 Received: from cantor2.suse.de ([195.135.220.15]:46395 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751745Ab2AKKLP (ORCPT ); Wed, 11 Jan 2012 05:11:15 -0500 From: Mel Gorman To: Linux-MM , Linux-FSDevel , LKML Cc: Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen , Mel Gorman Subject: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Date: Wed, 11 Jan 2012 10:11:07 +0000 Message-Id: <1326276668-19932-2-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 1.7.3.4 In-Reply-To: <1326276668-19932-1-git-send-email-mgorman@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org While running a CPU hotplug stress test under memory pressure, a spinlock lockup was detected due to a dentry lock being recursively taken. When this happens varies considerably and is difficult to trigger. [ 482.345588] BUG: spinlock lockup on CPU#2, udevd/4400 [ 482.345590] lock: ffff8803075be0d0, .magic: dead4ead, .owner: udevd/5689, .owner_cpu: 0 [ 482.345592] Pid: 4400, comm: udevd Not tainted 3.2.0-vanilla #1 [ 482.345592] Call Trace: [ 482.345595] [] spin_dump+0x88/0x8d [ 482.345597] [] do_raw_spin_lock+0xd6/0xf9 [ 482.345599] [] _raw_spin_lock+0x39/0x3d [ 482.345601] [] ? shrink_dcache_parent+0x77/0x28c [ 482.345603] [] shrink_dcache_parent+0x77/0x28c [ 482.345605] [] ? have_submounts+0x13e/0x1bd [ 482.345607] [] sysfs_dentry_revalidate+0xaa/0xbe [ 482.345608] [] do_lookup+0x263/0x2fc [ 482.345610] [] ? security_inode_permission+0x1e/0x20 [ 482.345612] [] link_path_walk+0x1e2/0x763 [ 482.345614] [] path_lookupat+0x5c/0x61a [ 482.345616] [] ? might_fault+0x89/0x8d [ 482.345618] [] ? might_fault+0x40/0x8d [ 482.345619] [] do_path_lookup+0x2a/0xa8 [ 482.345621] [] user_path_at_empty+0x5d/0x97 [ 482.345623] [] ? trace_hardirqs_off+0xd/0xf [ 482.345625] [] ? _raw_spin_unlock_irqrestore+0x44/0x5a [ 482.345627] [] user_path_at+0x11/0x13 [ 482.345629] [] vfs_fstatat+0x44/0x71 [ 482.345631] [] vfs_lstat+0x1e/0x20 [ 482.345632] [] sys_newlstat+0x1f/0x40 [ 482.345634] [] ? trace_hardirqs_on_caller+0x12d/0x164 [ 482.345636] [] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 482.345638] [] ? trace_hardirqs_off+0xd/0xf [ 482.345640] [] system_call_fastpath+0x16/0x1b [ 482.515004] [] ? trace_hardirqs_off+0xd/0xf [ 482.520870] [] system_call_fastpath+0x16/0x1b At this point, CPU hotplug stops and other processes get stuck in a similar deadlock waiting for 5689 to unlock. RCU reports stalls but it is collateral damage. The deadlocked processes have sysfs_dentry_revalidate() in common. Miklos Szeredi explained at https://lkml.org/lkml/2012/1/9/114 that the deadlock happens within dcache if two processes call shrink_dcache_parent() on the same dentry. In Miklos's case, the problem is with the bonding driver but during CPU online or offline, a number of dentries are being created and deleted and this deadlock is also being hit. Looking at sysfs, there is a global sysfs_mutex that protects the sysfs directory tree from concurrent reclaims. Almost all operations involving directory inodes and dentries take place under the sysfs_mutex - linking, unlinking, patch searching lookup, renames and readdir. d_invalidate is slightly different. It is mostly under the mutex but if the dentry has to be removed from the dcache, the mutex is dropped. Where as Miklos' patch changes dcache, this patch changes sysfs to consistently hold the mutex for dentry-related operations. Once applied, this particular bug with CPU hotadd/hotremove no longer occurs. Signed-off-by: Mel Gorman --- fs/sysfs/dir.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 7fdf6a7..acaf21d 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -279,8 +279,8 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd) if (strcmp(dentry->d_name.name, sd->s_name) != 0) goto out_bad; - mutex_unlock(&sysfs_mutex); out_valid: + mutex_unlock(&sysfs_mutex); return 1; out_bad: /* Remove the dentry from the dcache hashes. @@ -294,7 +294,6 @@ out_bad: * to the dcache hashes. */ is_dir = (sysfs_type(sd) == SYSFS_DIR); - mutex_unlock(&sysfs_mutex); if (is_dir) { /* If we have submounts we must allow the vfs caches * to lie about the state of the filesystem to prevent @@ -305,6 +304,7 @@ out_bad: shrink_dcache_parent(dentry); } d_drop(dentry); + mutex_unlock(&sysfs_mutex); return 0; } -- 1.7.3.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757839Ab2AKRJT (ORCPT ); Wed, 11 Jan 2012 12:09:19 -0500 Received: from out02.mta.xmission.com ([166.70.13.232]:54568 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756699Ab2AKRJR (ORCPT ); Wed, 11 Jan 2012 12:09:17 -0500 From: ebiederm@xmission.com (Eric W. Biederman) To: Mel Gorman Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , Greg KH , Gong Chen Subject: Re: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-2-git-send-email-mgorman@suse.de> Date: Wed, 11 Jan 2012 09:11:27 -0800 In-Reply-To: <1326276668-19932-2-git-send-email-mgorman@suse.de> (Mel Gorman's message of "Wed, 11 Jan 2012 10:11:07 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-XM-SPF: eid=;;;mid=;;;hst=in02.mta.xmission.com;;;ip=98.207.153.68;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+/mdqMG4rv8q0r/3GFYQixTy2fYehZxv8= X-SA-Exim-Connect-IP: 98.207.153.68 X-SA-Exim-Mail-From: ebiederm@xmission.com X-SA-Exim-Scanned: No (on in02.mta.xmission.com); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Mel Gorman writes: > While running a CPU hotplug stress test under memory pressure, a > spinlock lockup was detected due to a dentry lock being recursively > taken. When this happens varies considerably and is difficult > to trigger. > > [ 482.345588] BUG: spinlock lockup on CPU#2, udevd/4400 > [ 482.345590] lock: ffff8803075be0d0, .magic: dead4ead, .owner: udevd/5689, .owner_cpu: 0 > [ 482.345592] Pid: 4400, comm: udevd Not tainted 3.2.0-vanilla #1 > [ 482.345592] Call Trace: > [ 482.345595] [] spin_dump+0x88/0x8d > [ 482.345597] [] do_raw_spin_lock+0xd6/0xf9 > [ 482.345599] [] _raw_spin_lock+0x39/0x3d > [ 482.345601] [] ? shrink_dcache_parent+0x77/0x28c > [ 482.345603] [] shrink_dcache_parent+0x77/0x28c > [ 482.345605] [] ? have_submounts+0x13e/0x1bd > [ 482.345607] [] sysfs_dentry_revalidate+0xaa/0xbe > [ 482.345608] [] do_lookup+0x263/0x2fc > [ 482.345610] [] ? security_inode_permission+0x1e/0x20 > [ 482.345612] [] link_path_walk+0x1e2/0x763 > [ 482.345614] [] path_lookupat+0x5c/0x61a > [ 482.345616] [] ? might_fault+0x89/0x8d > [ 482.345618] [] ? might_fault+0x40/0x8d > [ 482.345619] [] do_path_lookup+0x2a/0xa8 > [ 482.345621] [] user_path_at_empty+0x5d/0x97 > [ 482.345623] [] ? trace_hardirqs_off+0xd/0xf > [ 482.345625] [] ? _raw_spin_unlock_irqrestore+0x44/0x5a > [ 482.345627] [] user_path_at+0x11/0x13 > [ 482.345629] [] vfs_fstatat+0x44/0x71 > [ 482.345631] [] vfs_lstat+0x1e/0x20 > [ 482.345632] [] sys_newlstat+0x1f/0x40 > [ 482.345634] [] ? trace_hardirqs_on_caller+0x12d/0x164 > [ 482.345636] [] ? trace_hardirqs_on_thunk+0x3a/0x3f > [ 482.345638] [] ? trace_hardirqs_off+0xd/0xf > [ 482.345640] [] system_call_fastpath+0x16/0x1b > [ 482.515004] [] ? trace_hardirqs_off+0xd/0xf > [ 482.520870] [] system_call_fastpath+0x16/0x1b > > At this point, CPU hotplug stops and other processes get stuck in a > similar deadlock waiting for 5689 to unlock. RCU reports stalls but > it is collateral damage. > > The deadlocked processes have sysfs_dentry_revalidate() in > common. Miklos Szeredi explained at https://lkml.org/lkml/2012/1/9/114 > that the deadlock happens within dcache if two processes call > shrink_dcache_parent() on the same dentry. > > In Miklos's case, the problem is with the bonding driver but during > CPU online or offline, a number of dentries are being created and > deleted and this deadlock is also being hit. Looking at sysfs, there > is a global sysfs_mutex that protects the sysfs directory tree from > concurrent reclaims. Almost all operations involving directory inodes > and dentries take place under the sysfs_mutex - linking, unlinking, > patch searching lookup, renames and readdir. d_invalidate is slightly > different. It is mostly under the mutex but if the dentry has to be > removed from the dcache, the mutex is dropped. The sysfs_mutex protects the sysfs data structures not the vfs. > Where as Miklos' patch changes dcache, this patch changes sysfs to > consistently hold the mutex for dentry-related operations. Once > applied, this particular bug with CPU hotadd/hotremove no longer > occurs. After taking a quick skim over the code to reacquaint myself with it appears that the usage in sysfs is idiomatic. That is sysfs uses shrink_dcache_parent without a lock and in a context where the right race could trigger this deadlock. And in particular I expect you could trigger the same deadlock in proc, nfs, and gfs2 with if you can get the timing right. I don't think adding a work-around for the bug in shrink_dcache_parent is going to do anything except hide the bug in the VFS, and unnecessarily increase the sysfs_mutex hold times. I may be blind but I don't see a reason at this point to rush out an incomplete work-around for the bug in shrink_dcahce_parent instead of actually fixing shrink_dcache_parent. Eric From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933663Ab2AKSHc (ORCPT ); Wed, 11 Jan 2012 13:07:32 -0500 Received: from cantor2.suse.de ([195.135.220.15]:46384 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932432Ab2AKSHb (ORCPT ); Wed, 11 Jan 2012 13:07:31 -0500 Date: Wed, 11 Jan 2012 18:07:23 +0000 From: Mel Gorman To: "Eric W. Biederman" Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , Greg KH , Gong Chen Subject: Re: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex Message-ID: <20120111180723.GF4118@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-2-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 11, 2012 at 09:11:27AM -0800, Eric W. Biederman wrote: > > In Miklos's case, the problem is with the bonding driver but during > > CPU online or offline, a number of dentries are being created and > > deleted and this deadlock is also being hit. Looking at sysfs, there > > is a global sysfs_mutex that protects the sysfs directory tree from > > concurrent reclaims. Almost all operations involving directory inodes > > and dentries take place under the sysfs_mutex - linking, unlinking, > > patch searching lookup, renames and readdir. d_invalidate is slightly > > different. It is mostly under the mutex but if the dentry has to be > > removed from the dcache, the mutex is dropped. > > The sysfs_mutex protects the sysfs data structures not the vfs. > Ok. > > Where as Miklos' patch changes dcache, this patch changes sysfs to > > consistently hold the mutex for dentry-related operations. Once > > applied, this particular bug with CPU hotadd/hotremove no longer > > occurs. > > After taking a quick skim over the code to reacquaint myself with > it appears that the usage in sysfs is idiomatic. That is sysfs > uses shrink_dcache_parent without a lock and in a context where > the right race could trigger this deadlock. > Yes. > And in particular I expect you could trigger the same deadlock in > proc, nfs, and gfs2 with if you can get the timing right. > Agreed. When the dcache-specific fix was being discussed on an external bugzilla, this came up. It's probably easiest to race in sysfs because it's possible to create/delete directories faster than is possible for proc, nfs or gfs2. > I don't think adding a work-around for the bug in shrink_dcache_parent > is going to do anything except hide the bug in the VFS, and > unnecessarily increase the sysfs_mutex hold times. > Ok. > I may be blind but I don't see a reason at this point to rush out an > incomplete work-around for the bug in shrink_dcahce_parent instead of > actually fixing shrink_dcache_parent. > Since I wrote this patch, the dcache specific fix was finished, merged and I expect it'll make it to stable. Assuming that happens, this patch will no longer be required. Thanks Eric. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933555Ab2AKTAS (ORCPT ); Wed, 11 Jan 2012 14:00:18 -0500 Received: from out03.mta.xmission.com ([166.70.13.233]:55040 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753324Ab2AKTAO (ORCPT ); Wed, 11 Jan 2012 14:00:14 -0500 From: ebiederm@xmission.com (Eric W. Biederman) To: Mel Gorman Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , Greg KH , Gong Chen Subject: Re: [PATCH 1/2] fs: sysfs: Do dcache-related updates to sysfs dentries under sysfs_mutex References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-2-git-send-email-mgorman@suse.de> <20120111180723.GF4118@suse.de> Date: Wed, 11 Jan 2012 11:02:26 -0800 In-Reply-To: <20120111180723.GF4118@suse.de> (Mel Gorman's message of "Wed, 11 Jan 2012 18:07:23 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-XM-SPF: eid=;;;mid=;;;hst=in01.mta.xmission.com;;;ip=98.207.153.68;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1/jo+NaAZ+y0uBxeiGHDuxX42V9LLg0Omg= X-SA-Exim-Connect-IP: 98.207.153.68 X-SA-Exim-Mail-From: ebiederm@xmission.com X-SA-Exim-Scanned: No (on in01.mta.xmission.com); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Mel Gorman writes: > On Wed, Jan 11, 2012 at 09:11:27AM -0800, Eric W. Biederman wrote: >> > In Miklos's case, the problem is with the bonding driver but during >> > CPU online or offline, a number of dentries are being created and >> > deleted and this deadlock is also being hit. Looking at sysfs, there >> > is a global sysfs_mutex that protects the sysfs directory tree from >> > concurrent reclaims. Almost all operations involving directory inodes >> > and dentries take place under the sysfs_mutex - linking, unlinking, >> > patch searching lookup, renames and readdir. d_invalidate is slightly >> > different. It is mostly under the mutex but if the dentry has to be >> > removed from the dcache, the mutex is dropped. >> >> The sysfs_mutex protects the sysfs data structures not the vfs. >> > > Ok. > >> > Where as Miklos' patch changes dcache, this patch changes sysfs to >> > consistently hold the mutex for dentry-related operations. Once >> > applied, this particular bug with CPU hotadd/hotremove no longer >> > occurs. >> >> After taking a quick skim over the code to reacquaint myself with >> it appears that the usage in sysfs is idiomatic. That is sysfs >> uses shrink_dcache_parent without a lock and in a context where >> the right race could trigger this deadlock. >> > > Yes. > >> And in particular I expect you could trigger the same deadlock in >> proc, nfs, and gfs2 with if you can get the timing right. >> > > Agreed. When the dcache-specific fix was being discussed on an external > bugzilla, this came up. It's probably easiest to race in sysfs because > it's possible to create/delete directories faster than is possible > for proc, nfs or gfs2. I expect we see the race in sysfs because of uevents that get triggered on hotplug. So a lot is occurring around the time of the race. You can get to shrink_dcache_parent with fork/exit in proc which is a lot easier to trigger. But usually in fork/exec you don't have the dentries cached... > Since I wrote this patch, the dcache specific fix was finished, merged > and I expect it'll make it to stable. Assuming that happens, this patch > will no longer be required. Sounds good. Eric From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753848Ab2ALOwC (ORCPT ); Thu, 12 Jan 2012 09:52:02 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:38496 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753168Ab2ALOv5 convert rfc822-to-8bit (ORCPT ); Thu, 12 Jan 2012 09:51:57 -0500 MIME-Version: 1.0 X-Originating-IP: [212.179.42.66] In-Reply-To: <1326276668-19932-3-git-send-email-mgorman@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> Date: Thu, 12 Jan 2012 16:51:56 +0200 Message-ID: Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context From: Gilad Ben-Yossef To: Mel Gorman Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 11, 2012 at 12:11 PM, Mel Gorman wrote: > Rather than making it safe to call get_online_cpus() from the page > allocator, this patch simply removes the page allocator call to > drain_all_pages(). To avoid impacting high-order allocation success > rates, it still drains the local per-cpu lists for high-order > allocations that failed. As a side effect, this reduces the number > of IPIs sent during low memory situations. > > Signed-off-by: Mel Gorman > --- >  mm/page_alloc.c |   16 ++++++++++++---- >  1 files changed, 12 insertions(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2b8ba3a..b6df6fc 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1119,7 +1119,9 @@ void drain_local_pages(void *arg) >  */ >  void drain_all_pages(void) >  { > +       get_online_cpus(); >        on_each_cpu(drain_local_pages, NULL, 1); > +       put_online_cpus(); >  } > >  #ifdef CONFIG_HIBERNATION > @@ -1982,11 +1984,17 @@ retry: >                                        migratetype); > >        /* > -        * If an allocation failed after direct reclaim, it could be because > -        * pages are pinned on the per-cpu lists. Drain them and try again > +        * If a high-order allocation failed after direct reclaim, there is a > +        * possibility that it is because the necessary buddies have been > +        * freed to the per-cpu list. Drain the local list and try again. > +        * drain_all_pages is not used because it is unsafe to call > +        * get_online_cpus from this context as it is possible that kthreadd > +        * would block during thread creation and the cost of sending storms > +        * of IPIs in low memory conditions is quite high. >         */ > -       if (!page && !drained) { > -               drain_all_pages(); > +       if (!page && order && !drained) { > +               drain_pages(get_cpu()); > +               put_cpu(); >                drained = true; >                goto retry; >        } > -- > 1.7.3.4 > I very much like the judo like quality of relying on the fact that in memory pressure conditions most of the cpus will end up in the direct reclaim path to drain them all without IPIs. What I can't figure out is why we don't need get/put_online_cpus() pair around each and every call to on_each_cpu everywhere? and if we do, perhaps making it a part of on_each_cpu is the way to go? Something like: diff --git a/kernel/smp.c b/kernel/smp.c index f66a1b2..cfa3882 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void *info, int wait) { unsigned long flags; + BUG_ON(in_atomic()); + + get_online_cpus(); preempt_disable(); smp_call_function(func, info, wait); local_irq_save(flags); func(info); local_irq_restore(flags); preempt_enable(); + put_online_cpus(); } EXPORT_SYMBOL(on_each_cpu); Does that makes? -- Gilad Ben-Yossef Chief Coffee Drinker gilad@benyossef.com Israel Cell: +972-52-8260388 US Cell: +1-973-8260388 http://benyossef.com "Unfortunately, cache misses are an equal opportunity pain provider." -- Mike Galbraith, LKML From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753886Ab2ALPIQ (ORCPT ); Thu, 12 Jan 2012 10:08:16 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:33474 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753703Ab2ALPIL convert rfc822-to-8bit (ORCPT ); Thu, 12 Jan 2012 10:08:11 -0500 MIME-Version: 1.0 X-Originating-IP: [212.179.42.66] In-Reply-To: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> Date: Thu, 12 Jan 2012 17:08:10 +0200 Message-ID: Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context From: Gilad Ben-Yossef To: Mel Gorman Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , Peter Zijlstra , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 12, 2012 at 4:51 PM, Gilad Ben-Yossef wrote: > On Wed, Jan 11, 2012 at 12:11 PM, Mel Gorman wrote: > >> Rather than making it safe to call get_online_cpus() from the page >> allocator, this patch simply removes the page allocator call to >> drain_all_pages(). To avoid impacting high-order allocation success >> rates, it still drains the local per-cpu lists for high-order >> allocations that failed. As a side effect, this reduces the number >> of IPIs sent during low memory situations. >> >> Signed-off-by: Mel Gorman >> --- >>  mm/page_alloc.c |   16 ++++++++++++---- >>  1 files changed, 12 insertions(+), 4 deletions(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 2b8ba3a..b6df6fc 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1119,7 +1119,9 @@ void drain_local_pages(void *arg) >>  */ >>  void drain_all_pages(void) >>  { >> +       get_online_cpus(); >>        on_each_cpu(drain_local_pages, NULL, 1); >> +       put_online_cpus(); >>  } >> >>  #ifdef CONFIG_HIBERNATION >> @@ -1982,11 +1984,17 @@ retry: >>                                        migratetype); >> >>        /* >> -        * If an allocation failed after direct reclaim, it could be because >> -        * pages are pinned on the per-cpu lists. Drain them and try again >> +        * If a high-order allocation failed after direct reclaim, there is a >> +        * possibility that it is because the necessary buddies have been >> +        * freed to the per-cpu list. Drain the local list and try again. >> +        * drain_all_pages is not used because it is unsafe to call >> +        * get_online_cpus from this context as it is possible that kthreadd >> +        * would block during thread creation and the cost of sending storms >> +        * of IPIs in low memory conditions is quite high. >>         */ >> -       if (!page && !drained) { >> -               drain_all_pages(); >> +       if (!page && order && !drained) { >> +               drain_pages(get_cpu()); >> +               put_cpu(); >>                drained = true; >>                goto retry; >>        } >> -- >> 1.7.3.4 >> > > I very much like the judo like quality of relying on the fact that in > memory pressure conditions most > of the cpus will end up in the direct reclaim path to drain them all > without IPIs. > > What I can't figure out is why we don't need  get/put_online_cpus() > pair around each and every call > to on_each_cpu everywhere? and if we do, perhaps making it a part of > on_each_cpu is the way to go? > > Something like: > > diff --git a/kernel/smp.c b/kernel/smp.c > index f66a1b2..cfa3882 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void > *info, int wait) >  { >        unsigned long flags; > > +       BUG_ON(in_atomic()); > + > +       get_online_cpus(); >        preempt_disable(); >        smp_call_function(func, info, wait); >        local_irq_save(flags); >        func(info); >        local_irq_restore(flags); >        preempt_enable(); > +       put_online_cpus(); >  } >  EXPORT_SYMBOL(on_each_cpu); > > Does that makes? Well, it dies on boot (after adding the needed include file), so it's obviously wrong, but I'm still guessing other users of on_each_cpu will need an  get/put_online_cpus() wrapper too. Maybe - on_each_cpu(...) { get_online_cpus(); __on_each_cpu(...); put_online_cpus(); } We'll need to audit all callers. Gilad -- Gilad Ben-Yossef Chief Coffee Drinker gilad@benyossef.com Israel Cell: +972-52-8260388 US Cell: +1-973-8260388 http://benyossef.com "Unfortunately, cache misses are an equal opportunity pain provider." -- Mike Galbraith, LKML From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753938Ab2ALPI1 (ORCPT ); Thu, 12 Jan 2012 10:08:27 -0500 Received: from casper.infradead.org ([85.118.1.10]:59535 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753703Ab2ALPIX convert rfc822-to-8bit (ORCPT ); Thu, 12 Jan 2012 10:08:23 -0500 Message-ID: <1326380884.2442.187.camel@twins> Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context From: Peter Zijlstra To: Gilad Ben-Yossef Cc: Mel Gorman , Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen Date: Thu, 12 Jan 2012 16:08:04 +0100 In-Reply-To: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.1- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2012-01-12 at 16:51 +0200, Gilad Ben-Yossef wrote: > What I can't figure out is why we don't need get/put_online_cpus() > pair around each and every call > to on_each_cpu everywhere? and if we do, perhaps making it a part of > on_each_cpu is the way to go? > > Something like: > > diff --git a/kernel/smp.c b/kernel/smp.c > index f66a1b2..cfa3882 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void > *info, int wait) > { > unsigned long flags; > > + BUG_ON(in_atomic()); > + > + get_online_cpus(); > preempt_disable(); Your preempt_disable() here serializes against hotplug.. > smp_call_function(func, info, wait); > local_irq_save(flags); > func(info); > local_irq_restore(flags); > preempt_enable(); > + put_online_cpus(); > } > EXPORT_SYMBOL(on_each_cpu); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753897Ab2ALPNg (ORCPT ); Thu, 12 Jan 2012 10:13:36 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:45986 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753703Ab2ALPNd convert rfc822-to-8bit (ORCPT ); Thu, 12 Jan 2012 10:13:33 -0500 MIME-Version: 1.0 X-Originating-IP: [212.179.42.66] In-Reply-To: <1326380884.2442.187.camel@twins> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326380884.2442.187.camel@twins> Date: Thu, 12 Jan 2012 17:13:32 +0200 Message-ID: Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context From: Gilad Ben-Yossef To: Peter Zijlstra Cc: Mel Gorman , Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 12, 2012 at 5:08 PM, Peter Zijlstra wrote: > On Thu, 2012-01-12 at 16:51 +0200, Gilad Ben-Yossef wrote: >> What I can't figure out is why we don't need  get/put_online_cpus() >> pair around each and every call >> to on_each_cpu everywhere? and if we do, perhaps making it a part of >> on_each_cpu is the way to go? >> >> Something like: >> >> diff --git a/kernel/smp.c b/kernel/smp.c >> index f66a1b2..cfa3882 100644 >> --- a/kernel/smp.c >> +++ b/kernel/smp.c >> @@ -691,11 +691,15 @@ void on_each_cpu(void (*func) (void *info), void >> *info, int wait) >>  { >>         unsigned long flags; >> >> +       BUG_ON(in_atomic()); >> + >> +       get_online_cpus(); >>         preempt_disable(); > > Your preempt_disable() here serializes against hotplug.. I'm probably daft, but why didn't it work for the page allocator then? Mel's description reads: "Part of the problem is the page allocator is sending IPIs using on_each_cpu() without calling get_online_cpus() to prevent changes to the online cpumask." on_each_cpu() disables preemption in master as well... Gilad Thanks, Gilad -- Gilad Ben-Yossef Chief Coffee Drinker gilad@benyossef.com Israel Cell: +972-52-8260388 US Cell: +1-973-8260388 http://benyossef.com "Unfortunately, cache misses are an equal opportunity pain provider." -- Mike Galbraith, LKML From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753925Ab2ALPS0 (ORCPT ); Thu, 12 Jan 2012 10:18:26 -0500 Received: from merlin.infradead.org ([205.233.59.134]:41720 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750970Ab2ALPSZ convert rfc822-to-8bit (ORCPT ); Thu, 12 Jan 2012 10:18:25 -0500 Message-ID: <1326381492.2442.188.camel@twins> Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context From: Peter Zijlstra To: Mel Gorman Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen Date: Thu, 12 Jan 2012 16:18:12 +0100 In-Reply-To: <1326276668-19932-3-git-send-email-mgorman@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.1- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: > At least one bug report has > been seen on ppc64 against a 3.0 era kernel that looked like a bug > receiving interrupts on a CPU being offlined. Got details on that Mel? The preempt_disable() in on_each_cpu() should serialize against the stop_machine() crap in unplug. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754036Ab2ALPhX (ORCPT ); Thu, 12 Jan 2012 10:37:23 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38592 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753977Ab2ALPhV (ORCPT ); Thu, 12 Jan 2012 10:37:21 -0500 Date: Thu, 12 Jan 2012 15:37:12 +0000 From: Mel Gorman To: Peter Zijlstra Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Message-ID: <20120112153712.GL4118@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1326381492.2442.188.camel@twins> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote: > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: > > At least one bug report has > > been seen on ppc64 against a 3.0 era kernel that looked like a bug > > receiving interrupts on a CPU being offlined. > > Got details on that Mel? The preempt_disable() in on_each_cpu() should > serialize against the stop_machine() crap in unplug. I might have added 2 and 2 together and got 5. The stack trace clearly was while sending IPIs in on_each_cpu() and always when under memory pressure and stuck in direct reclaim. This was on !PREEMPT kernels where preempt_disable() is a no-op. That is why I thought get_online_cpu() would be necessary. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754706Ab2ALRS6 (ORCPT ); Thu, 12 Jan 2012 12:18:58 -0500 Received: from cantor2.suse.de ([195.135.220.15]:42823 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754089Ab2ALRSy (ORCPT ); Thu, 12 Jan 2012 12:18:54 -0500 Date: Thu, 12 Jan 2012 17:18:47 +0000 From: Mel Gorman To: Peter Zijlstra Cc: Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , Gilad Ben-Yossef , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Message-ID: <20120112171847.GN4118@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1326383551.2442.203.camel@twins> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 12, 2012 at 04:52:31PM +0100, Peter Zijlstra wrote: > On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote: > > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote: > > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: > > > > At least one bug report has > > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug > > > > receiving interrupts on a CPU being offlined. > > > > > > Got details on that Mel? The preempt_disable() in on_each_cpu() should > > > serialize against the stop_machine() crap in unplug. > > > > I might have added 2 and 2 together and got 5. > > > > The stack trace clearly was while sending IPIs in on_each_cpu() and > > always when under memory pressure and stuck in direct reclaim. This was > > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I > > thought get_online_cpu() would be necessary. > > For non-preempt the required scheduling of stop_machine() will have to > wait even longer. Still there might be something funny, some of the > hotplug notifiers are ran before the stop_machine thing does its thing > so there might be some fun interaction. Ok, how about this as a replacement patch? ---8<--- From: Mel Gorman Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context While running a CPU hotplug stress test under memory pressure, it was observed that the machine would halt with no messages logged to console. This is difficult to trigger and required a machine with 8 cores and plenty of memory. In at least one case on ppc64, the warning in include/linux/cpumask.h:107 triggered implying that IPIs are being sent to offline CPUs in some cases. A suspicious part of the problem is that the page allocator is sending IPIs using on_each_cpu() without calling get_online_cpus() to prevent changes to the online cpumask. It is depending on preemption being disabled to protect it which is a no-op on !PREEMPT kernels. This means that a thread can be reading the mask in smp_call_function_many() when an attempt is made to take a CPU offline. The expectation is that this is not a problem as the stop_machine() used during CPU hotplug should be able to prevent any problems as the reader of the online mask will prevent stop_machine making forward progress but it's unhelpful. On the other side, the mask can also be read while the CPU is being brought online. In this case it is the responsibility of the architecture that the CPU is able to receive and handle interrupts before being marked active but that does not mean they always get it right. Sending excessive IPIs from the page allocator is a bad idea. In low memory situations, a large number of processes can drain the per-cpu lists at the same time, in quick succession and on many CPUs which is pointless. In light of this and the unspecific CPU hotplug concerns, this patch removes the call drain_all_pages() after failing direct reclaim. To avoid impacting high-order allocation success rates, it still drains the local per-cpu lists for high-order allocations that failed. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 10 ++++++---- 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2b8ba3a..63ea182 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1982,11 +1982,13 @@ retry: migratetype); /* - * If an allocation failed after direct reclaim, it could be because - * pages are pinned on the per-cpu lists. Drain them and try again + * If a high-order allocation failed after direct reclaim, there is a + * possibility that it is because the necessary buddies have been + * freed to the per-cpu list. Drain the local list and try again. */ - if (!page && !drained) { - drain_all_pages(); + if (!page && order && !drained) { + drain_pages(get_cpu()); + put_cpu(); drained = true; goto retry; } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754920Ab2ALTOq (ORCPT ); Thu, 12 Jan 2012 14:14:46 -0500 Received: from mail-vw0-f46.google.com ([209.85.212.46]:47593 "EHLO mail-vw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752873Ab2ALTOm convert rfc822-to-8bit (ORCPT ); Thu, 12 Jan 2012 14:14:42 -0500 MIME-Version: 1.0 X-Originating-IP: [89.138.86.114] In-Reply-To: <20120112171847.GN4118@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> Date: Thu, 12 Jan 2012 21:14:41 +0200 Message-ID: Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context From: Gilad Ben-Yossef To: Mel Gorman Cc: Peter Zijlstra , Linux-MM , Linux-FSDevel , LKML , Andrew Morton , "Srivatsa S. Bhat" , Russell King - ARM Linux , "Paul E. McKenney" , Miklos Szeredi , "Eric W. Biederman" , Greg KH , Gong Chen Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 12, 2012 at 7:18 PM, Mel Gorman wrote: > On Thu, Jan 12, 2012 at 04:52:31PM +0100, Peter Zijlstra wrote: >> On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote: >> > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote: >> > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: >> > > > At least one bug report has >> > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug >> > > > receiving interrupts on a CPU being offlined. >> > > >> > > Got details on that Mel? The preempt_disable() in on_each_cpu() should >> > > serialize against the stop_machine() crap in unplug. >> > >> > I might have added 2 and 2 together and got 5. >> > >> > The stack trace clearly was while sending IPIs in on_each_cpu() and >> > always when under memory pressure and stuck in direct reclaim. This was >> > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I >> > thought get_online_cpu() would be necessary. >> >> For non-preempt the required scheduling of stop_machine() will have to >> wait even longer. Still there might be something funny, some of the >> hotplug notifiers are ran before the stop_machine thing does its thing >> so there might be some fun interaction. > > Ok, how about this as a replacement patch? > > ---8<--- > From: Mel Gorman > Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context > > While running a CPU hotplug stress test under memory pressure, it > was observed that the machine would halt with no messages logged > to console. This is difficult to trigger and required a machine > with 8 cores and plenty of memory. In at least one case on ppc64, > the warning in include/linux/cpumask.h:107 triggered implying that > IPIs are being sent to offline CPUs in some cases. > > A suspicious part of the problem is that the page allocator is sending > IPIs using on_each_cpu() without calling get_online_cpus() to prevent > changes to the online cpumask. It is depending on preemption being > disabled to protect it which is a no-op on !PREEMPT kernels. This means > that a thread can be reading the mask in smp_call_function_many() when > an attempt is made to take a CPU offline. The expectation is that this > is not a problem as the stop_machine() used during CPU hotplug should > be able to prevent any problems as the reader of the online mask will > prevent stop_machine making forward progress but it's unhelpful. > > On the other side, the mask can also be read while the CPU is being > brought online. In this case it is the responsibility of the > architecture that the CPU is able to receive and handle interrupts > before being marked active but that does not mean they always get it > right. > > Sending excessive IPIs from the page allocator is a bad idea. In low > memory situations, a large number of processes can drain the per-cpu > lists at the same time, in quick succession and on many CPUs which is > pointless. In light of this and the unspecific CPU hotplug concerns, > this patch removes the call drain_all_pages() after failing direct > reclaim. To avoid impacting high-order allocation success rates, > it still drains the local per-cpu lists for high-order allocations > that failed. > > Signed-off-by: Mel Gorman > --- >  mm/page_alloc.c |   10 ++++++---- >  1 files changed, 6 insertions(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2b8ba3a..63ea182 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1982,11 +1982,13 @@ retry: >                                        migratetype); > >        /* > -        * If an allocation failed after direct reclaim, it could be because > -        * pages are pinned on the per-cpu lists. Drain them and try again > +        * If a high-order allocation failed after direct reclaim, there is a > +        * possibility that it is because the necessary buddies have been > +        * freed to the per-cpu list. Drain the local list and try again. >         */ > -       if (!page && !drained) { > -               drain_all_pages(); > +       if (!page && order && !drained) { > +               drain_pages(get_cpu()); > +               put_cpu(); >                drained = true; >                goto retry; >        } I like the patch, think it is better then current code and want it to go in. I also think there is still some problems with IPIs somewhere that cause some corruption when a lot of IPIs are sent and that the patch simply lowered the very big number of IPIs that are sent via the direct reclaim code path so the problem is hidden, not solved by this patch. I've seen something related when trying to test the IPI reduction patches. Interesting enough it was not related to CPU hotplug at all - When a lot of IPIs are being sent, I sometime got an assert from low level platform code that I'm trying to send IPIs with an empty mask although the mask was NOT empty. I didn't manage to debug it then but I did manage to recreate it quite easily. I will see if I can recreate it with recent master and report. -- Gilad Ben-Yossef Chief Coffee Drinker gilad@benyossef.com Israel Cell: +972-52-8260388 US Cell: +1-973-8260388 http://benyossef.com "Unfortunately, cache misses are an equal opportunity pain provider." -- Mike Galbraith, LKML From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759187Ab2AMU6u (ORCPT ); Fri, 13 Jan 2012 15:58:50 -0500 Received: from mail4.comsite.net ([205.238.176.238]:15487 "EHLO mail4.comsite.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753739Ab2AMU6p (ORCPT ); Fri, 13 Jan 2012 15:58:45 -0500 X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=70.249.75.62; From: Milton Miller Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context To: "Gilad Ben-Yossef" , Mel Gorman Cc: , Peter Zijlstra Message-Id: In-Reply-To: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> Date: Fri, 13 Jan 2012 14:58:39 -0600 X-Originating-IP: 70.249.75.62 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 12 Jan 2012 about 14:14:57 -0500 (EST), Gilad Ben-Yossef wrote: > I also think there is still some problems with IPIs somewhere that > cause some corruption when a lot of IPIs are sent and that > the patch simply lowered the very big number of IPIs that are sent > via the direct reclaim code path so the problem > is hidden, not solved by this patch. > > I've seen something related when trying to test the IPI reduction > patches. Interesting enough it was not related to CPU hotplug at all - > When a lot of IPIs are being sent, I sometime got an assert from low > level platform code that I'm trying to send IPIs with an empty mask > although the mask was NOT empty. I didn't manage to debug it then but > I did manage to recreate it quite easily. That is a known scenario and expected race, the check must be removed from the platform code. > > I will see if I can recreate it with recent master and report. > The code in kernel/smp checks the mask, and looks for opportunities to convert smp_call_function_many to call_function_single. But it now tolerates callers passing a mask that other cpus are changing. So while it tries to make sure the mask has > 1 cpu, its not guaranteed. But that is not what causes the assert to fire. There are a few opportunities for the architecture code to detect an empty mask. The first case is a cpu started processing the list due to an interrupt generated by a cpu other than the requester between the time the requester put its work element in the list and when it executed the architecture code to send the interrupt. A second case is a work element is deleted by a third cpu as its ref count goes to zero and then that element gets reused by the owning cpu, so we process part of the request list twice. The solution is to tell the architecture code its not an error for the mask to be empty. Here is a walk though of how smp_call_function_many works. Each cpu has a queue element in per-cpu space that can be placed on the queue. When a cpu calls smp_call_function_many, it does a quick check to see if it could be converted to smp_call_function_single, but at this point other cpus can still be changing the mask (we only guarantee cpus that remain in the mask from the start of the call through the list copy will be called). Once it decides multiple cpus are needed, it first waits for its per-cpu data to become free (if the previous caller said not to wait, it could still be in use). After its free it copies the mask supplied by the caller to its per-cpu data work element. At that point it removes itself from the list then counts the bits. If the count is 0 it is done, but otherwise it obtains the lock, adds its work to the front of the global list, and stores the count of cpus. Starting at this point other cpus could start doing the work requested, even though they have not received an interrupt yet, nor have we unlocked the list manipulation lock). The requesting cpu proceeds to unlock the list and then calls the architecture code to request all cpus in the mask be notified there is work for them. The low level code does its processing and then iterates over the list sending IPIs to each cpu that is still in the mask in the work element. If the caller requested to wait for all (other) cpus to perform the work before returning, then the requesting cpu waits for its per-cpu-data work request element be unlocked by the last cpu who does the work, after it has removed it from the list. Whenever a cpu gets an IPI for generic_call_function_interrupt, it starts at the top of the list and works its way down. If a given element has the cpu's bit set in the cpu mask in the work element, then there will be work to do in the future. It then proceeds to check references, to make sure the element has been placed on the list and is allowed to be processed. If not, we are guaranteed the requester has not unlocked the list and therefore has not called the architecture notifier, so the cpu will get another interrupt in the future and proceeds to the next element. If the bit is set and references are set, then the code will execute the requested function with the supplied data argument. This function must not enable interrupts! (There is a defensive check, but that will only trigger if we take another call to smp_call_function_interrupt.) After the call, the cpu removes itself from the cpu mask and decrements the ref count. If it removes the last reference, it grabs the list manipulation lock, removes the element with list_del_rcu, unlock the list (flushing the list manipulation writes), and then unlocks the element, allowing the requesting cpu to reuse the element. It then goes on to the next element. If the cpu suddenly becomes slow here and the element owning cpu see the unlock and is able to reuse the element, our traversal of next will lead us to the start of the list and we will check some entries an additional time. This means we can see entries that were added to the list after we started processing the list before we get the interrupt. Eventually we wrap back to the head of the list and return from the interrupt. It would be easy to write some debugging assertions for the list, for instance: After grabbing the list walk it and mark which cpus per_cpu data is in the list. Then still under the lock verify (1) if refs is set the element is on the list (2) refs is <= the number of cpus in the cpumask (3) any element with refs set is under csd_lock. Things that are valid tranisent states: (1) refs < bits in mask (short transient), (2) refs = 0 but on list (waiting for lock to remove) (3) off list but csd_lock (either short trasnient or owning cpu preparing or waiting to put it on the list). For added saftey under the lock we could count the traversals it should never take more than one above nr_cpu_ids (or num_possible_cpus() for that matter) to get back to the list head. We could also put an assert type check before taking a cpu dead that would check all per_cpu copies that our cpu bit is not set in any cpus element. It should never be set even transiently if we are offline. milton From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932910Ab2AMU6s (ORCPT ); Fri, 13 Jan 2012 15:58:48 -0500 Received: from mail4.comsite.net ([205.238.176.238]:48740 "EHLO mail4.comsite.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753981Ab2AMU6p (ORCPT ); Fri, 13 Jan 2012 15:58:45 -0500 X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=70.249.75.62; From: Milton Miller Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context To: Mel Gorman , Gilad Ben-Yossef Cc: , Peter Zijlstra Message-Id: In-Reply-To: <20120112171847.GN4118@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> Date: Fri, 13 Jan 2012 14:58:39 -0600 X-Originating-IP: 70.249.75.62 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu Jan 12 2012 about 12:18:59 EST, Mel Gorman wrote: > On Thu, Jan 12, 2012 at 04:52:31PM +0100, Peter Zijlstra wrote: > > On Thu, 2012-01-12 at 15:37 +0000, Mel Gorman wrote: > > > On Thu, Jan 12, 2012 at 04:18:12PM +0100, Peter Zijlstra wrote: > > > > On Wed, 2012-01-11 at 10:11 +0000, Mel Gorman wrote: > > > > > At least one bug report has > > > > > been seen on ppc64 against a 3.0 era kernel that looked like a bug > > > > > receiving interrupts on a CPU being offlined. > > > > > > > > Got details on that Mel? The preempt_disable() in on_each_cpu() should > > > > serialize against the stop_machine() crap in unplug. > > > > > > I might have added 2 and 2 together and got 5. > > > > > > The stack trace clearly was while sending IPIs in on_each_cpu() and > > > always when under memory pressure and stuck in direct reclaim. This was > > > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I > > > thought get_online_cpu() would be necessary. Well, stop_machine has to be selected by the scheduler, so we have to get back and call schedule() at some point to switch to that thread. .. unless it is the one allocating memory. > > > > For non-preempt the required scheduling of stop_machine() will have to > > wait even longer. Still there might be something funny, some of the > > hotplug notifiers are ran before the stop_machine thing does its thing > > so there might be some fun interaction. > > Ok, how about this as a replacement patch? > > ---8<--- > From: Mel Gorman > Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context > > While running a CPU hotplug stress test under memory pressure, it > was observed that the machine would halt with no messages logged > to console. This is difficult to trigger and required a machine > with 8 cores and plenty of memory. In at least one case on ppc64, > the warning in include/linux/cpumask.h:107 triggered implying that > IPIs are being sent to offline CPUs in some cases. That is WARN_ON_ONCE(cpu >= nr_cpumask_bits); That has nothing to do with cpus going offline! nr_cpumask_bits is set during boot based on cpu_possible_mask. If you see that triggered its a direct bug in the caller. Either its looking at random memory in a NR_CPUS loop or its assuming that there is another cpu in a mask and not checking for a cpu_mask_next returning nr_cpumask_bits. Again it has nothing to do with hotplug (unless its assuming there are n online cpus in a loop instead of looking at the return value of the function). > > A suspicious part of the problem is that the page allocator is sending > IPIs using on_each_cpu() without calling get_online_cpus() to prevent > changes to the online cpumask. It is depending on preemption being > disabled to protect it which is a no-op on !PREEMPT kernels. This means > that a thread can be reading the mask in smp_call_function_many() when > an attempt is made to take a CPU offline. The expectation is that this > is not a problem as the stop_machine() used during CPU hotplug should > be able to prevent any problems as the reader of the online mask will > prevent stop_machine making forward progress but it's unhelpful. And without CONFIG_PREEMPT, we won't be able to schedule away from the current task over to the stop_machine (migration/NN) thread. > > On the other side, the mask can also be read while the CPU is being > brought online. In this case it is the responsibility of the > architecture that the CPU is able to receive and handle interrupts > before being marked active but that does not mean they always get it > right. yes. See my other reply for some things we can to to help find bugs with smp_call_function_many (and on_each_cpu). > > Sending excessive IPIs from the page allocator is a bad idea. In low > memory situations, a large number of processes can drain the per-cpu > lists at the same time, in quick succession and on many CPUs which is > pointless. In light of this and the unspecific CPU hotplug concerns, > this patch removes the call drain_all_pages() after failing direct > reclaim. To avoid impacting high-order allocation success rates, > it still drains the local per-cpu lists for high-order allocations > that failed. "There is some bug somewhere. This seems like a big slow pain I don't think this is likely to have much impact but if I am wrong we will just OOM early." vs Gilad's "Lets reduce the pain of this slow path by doing just the required work". Lets find the real bug. milton From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753182Ab2AONx5 (ORCPT ); Sun, 15 Jan 2012 08:53:57 -0500 Received: from mail-vx0-f174.google.com ([209.85.220.174]:60429 "EHLO mail-vx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752184Ab2AONxz convert rfc822-to-8bit (ORCPT ); Sun, 15 Jan 2012 08:53:55 -0500 MIME-Version: 1.0 X-Originating-IP: [212.179.42.66] In-Reply-To: References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> Date: Sun, 15 Jan 2012 15:53:53 +0200 Message-ID: Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context From: Gilad Ben-Yossef To: Milton Miller Cc: Mel Gorman , linux-kernel@vger.kernel.org, Peter Zijlstra Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 10:58 PM, Milton Miller wrote: > On Thu, 12 Jan 2012 about 14:14:57 -0500 (EST), Gilad Ben-Yossef wrote: >> I also think there is still some problems with IPIs somewhere that >> cause some corruption when a lot of IPIs are sent and that >> the patch simply lowered the very big number of IPIs that are sent >> via the direct reclaim code path so the problem >> is hidden, not solved by this patch. >> >> I've seen something related when trying to test the IPI reduction >> patches. Interesting enough it was not related to CPU hotplug at all - > >> When a lot of IPIs are being sent, I sometime got an assert from low >> level platform code that I'm trying to send IPIs with an empty mask >> although the mask was NOT empty. I didn't manage to debug it then but >> I did manage to recreate it quite easily. > > That is a known scenario and expected race, the check must be removed > from the platform code. OK. > >> >> I will see if I can recreate it with recent master and report. >> > > The code in kernel/smp checks the mask, and looks for opportunities to > convert smp_call_function_many to call_function_single.  But it now > tolerates callers passing a mask that other cpus are changing.  So > while it tries to make sure the mask has > 1 cpu, its not guaranteed. Yes, that much I figured out already, even counted on this behavior. > But that is not what causes the assert to fire. Yes, my test case involves on_each_cpu with hotplug disabled. > > There are a few opportunities for the architecture code to detect an > empty mask.  The first case is a cpu started processing the list due > to an interrupt generated by a cpu other than the requester between > the time the requester put its work element in the list and when it > executed the architecture code to send the interrupt.  A second > case is a work element is deleted by a third cpu as its ref count > goes to zero and then that element gets reused by the owning cpu, > so we process part of the request list twice. > OK, that makes sense. > The solution is to tell the architecture code its not an error for > the mask to be empty. I believe this patch is in order then: diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c index cce91bf..00b68a3 100644 --- a/arch/x86/kernel/apic/ipi.c +++ b/arch/x86/kernel/apic/ipi.c @@ -106,7 +106,10 @@ void default_send_IPI_mask_logical(const struct cpumask *cpumask, int vector) unsigned long mask = cpumask_bits(cpumask)[0]; unsigned long flags; - if (WARN_ONCE(!mask, "empty IPI mask")) + if (!mask) + /* The target CPUs must have already processed the + * work items due to a concurrent IPI + */ return; local_irq_save(flags); Signed-off-by: Gilad Ben-Yossef Here is a walk though of how smp_call_function_many works. > the interrupt.  Eventually we wrap back to the head of the list and > return from the interrupt. Would you mind if I pasted the above into Documentation/ipi.txt with your name on it? It could save the next potential bloke trying to get his head around IPIs a lot of head scratching... Also, it means I have no idea about the original problem with hotplug :-) Gilad -- Gilad Ben-Yossef Chief Coffee Drinker gilad@benyossef.com Israel Cell: +972-52-8260388 US Cell: +1-973-8260388 http://benyossef.com "Unfortunately, cache misses are an equal opportunity pain provider." -- Mike Galbraith, LKML From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932553Ab2ASQVH (ORCPT ); Thu, 19 Jan 2012 11:21:07 -0500 Received: from cantor2.suse.de ([195.135.220.15]:45459 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932526Ab2ASQVC (ORCPT ); Thu, 19 Jan 2012 11:21:02 -0500 Date: Thu, 19 Jan 2012 16:20:57 +0000 From: Mel Gorman To: Milton Miller Cc: Gilad Ben-Yossef , linux-kernel@vger.kernel.org, Peter Zijlstra Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context Message-ID: <20120119162057.GD3143@suse.de> References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 13, 2012 at 02:58:39PM -0600, Milton Miller wrote: > > > > The stack trace clearly was while sending IPIs in on_each_cpu() and > > > > always when under memory pressure and stuck in direct reclaim. This was > > > > on !PREEMPT kernels where preempt_disable() is a no-op. That is why I > > > > thought get_online_cpu() would be necessary. > > Well, stop_machine has to be selected by the scheduler, so we have to > get back and call schedule() at some point to switch to that thread. > .. unless it is the one allocating memory. > > > > > > > For non-preempt the required scheduling of stop_machine() will have to > > > wait even longer. Still there might be something funny, some of the > > > hotplug notifiers are ran before the stop_machine thing does its thing > > > so there might be some fun interaction. > > > > Ok, how about this as a replacement patch? > > > > ---8<--- > > From: Mel Gorman > > Subject: [PATCH] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context > > > > While running a CPU hotplug stress test under memory pressure, it > > was observed that the machine would halt with no messages logged > > to console. This is difficult to trigger and required a machine > > with 8 cores and plenty of memory. In at least one case on ppc64, > > the warning in include/linux/cpumask.h:107 triggered implying that > > IPIs are being sent to offline CPUs in some cases. > > That is > WARN_ON_ONCE(cpu >= nr_cpumask_bits); > > That has nothing to do with cpus going offline! > You're right. This particular warning was caused by xmon starting up while it was handling another exception. There was not enough information in the bug to tell exactly what caused the original exception. A CPU was being offlined at the time and the system was under memory pressure but they are not necessarily related. I'll drop this from the changelog. > nr_cpumask_bits is set during boot based on cpu_possible_mask. If you > see that triggered its a direct bug in the caller. Either its looking > at random memory in a NR_CPUS loop or its assuming that there is > another cpu in a mask and not checking for a cpu_mask_next returning > nr_cpumask_bits. > > Again it has nothing to do with hotplug (unless its assuming there are > n online cpus in a loop instead of looking at the return value of > the function). > xmon does some funny things around num_online_cpus() but tracking down the internals of xmon is not useful. A more serious problem had already triggered if xmon was starting at all. > > A suspicious part of the problem is that the page allocator is sending > > IPIs using on_each_cpu() without calling get_online_cpus() to prevent > > changes to the online cpumask. It is depending on preemption being > > disabled to protect it which is a no-op on !PREEMPT kernels. This means > > that a thread can be reading the mask in smp_call_function_many() when > > an attempt is made to take a CPU offline. The expectation is that this > > is not a problem as the stop_machine() used during CPU hotplug should > > be able to prevent any problems as the reader of the online mask will > > prevent stop_machine making forward progress but it's unhelpful. > > And without CONFIG_PREEMPT, we won't be able to schedule away from the > current task over to the stop_machine (migration/NN) thread. > On a different x86-64 machines with an intel-specific MCE, I have also noted that the value of num_online_cpus() can change while stop_machine() is running. This is sensitive to timing and part of the problem seems to be due to cmci_rediscover() running without the CPU hotplug mutex held. This is not related to the IPI mess and is unrelated to memory pressure but is just to note that CPU hotplug in general can be fragile in parts. > > On the other side, the mask can also be read while the CPU is being > > brought online. In this case it is the responsibility of the > > architecture that the CPU is able to receive and handle interrupts > > before being marked active but that does not mean they always get it > > right. > > yes. See my other reply for some things we can to to help find bugs > with smp_call_function_many (and on_each_cpu). > Thanks for that. > > Sending excessive IPIs from the page allocator is a bad idea. In low > > memory situations, a large number of processes can drain the per-cpu > > lists at the same time, in quick succession and on many CPUs which is > > pointless. In light of this and the unspecific CPU hotplug concerns, > > this patch removes the call drain_all_pages() after failing direct > > reclaim. To avoid impacting high-order allocation success rates, > > it still drains the local per-cpu lists for high-order allocations > > that failed. > > "There is some bug somewhere. This seems like a big slow pain I > don't think this is likely to have much impact but if I am wrong we > will just OOM early." vs Gilad's "Lets reduce the pain of this slow > path by doing just the required work". > > Lets find the real bug. > When the underlying bug related to IPIs is isolated and fixed, the patch still made sense. As noted in the changelog of the latest version, high-order allocations are still draining the local list to minimise impact. For order-0 allocations failing in this situation, we must be already on the min watermark obviously. For an IPI to help avoid an OOM for an order-0 allocation, we would need enough pages on the per-cpu lists to meet the watermark with no other allocation/freeing activity on those CPUs and that the process trying to allocate memory never gets scheduled on another CPU. This is a very improbable combination of events which is why I don't think the patch makes any difference to going OOM early. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754915Ab2ASVrS (ORCPT ); Thu, 19 Jan 2012 16:47:18 -0500 Received: from e28smtp01.in.ibm.com ([122.248.162.1]:58904 "EHLO e28smtp01.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753757Ab2ASVrO (ORCPT ); Thu, 19 Jan 2012 16:47:14 -0500 Message-ID: <4F188F52.1060303@linux.vnet.ibm.com> Date: Fri, 20 Jan 2012 03:16:58 +0530 From: "Srivatsa S. Bhat" User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0) Gecko/20110927 Thunderbird/7.0 MIME-Version: 1.0 To: Mel Gorman CC: Milton Miller , Gilad Ben-Yossef , linux-kernel@vger.kernel.org, Peter Zijlstra , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "akpm@linux-foundation.org" , Russell King - ARM Linux , "Paul E. McKenney" , mszeredi@novell.com, ebiederm@xmission.com, Greg Kroah-Hartman , gong.chen@intel.com, Tony Luck , Borislav Petkov , "tglx@linutronix.de" , "mingo@redhat.com" , "hpa@zytor.com" , "x86@kernel.org" , linux-edac@vger.kernel.org, Andi Kleen Subject: Re: [PATCH 2/2] mm: page allocator: Do not drain per-cpu lists via IPI from page allocator context References: <1326276668-19932-1-git-send-email-mgorman@suse.de> <1326276668-19932-3-git-send-email-mgorman@suse.de> <1326381492.2442.188.camel@twins> <20120112153712.GL4118@suse.de> <1326383551.2442.203.camel@twins> <20120112171847.GN4118@suse.de> <20120119162057.GD3143@suse.de> In-Reply-To: <20120119162057.GD3143@suse.de> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit x-cbid: 12011921-4790-0000-0000-000000F461CD Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [Reinstating the original Cc list] On 01/19/2012 09:50 PM, Mel Gorman wrote:> > On a different x86-64 machines with an intel-specific MCE, I have > also noted that the value of num_online_cpus() can change while > stop_machine() is running. That is expected and intentional right? Meaning, it is during the stop_machine() thing itself that a CPU is actually taken offline. And at the same time, it is removed from the cpu_online_mask. On Intel boxes, essentially, the following gets executed on the dying CPU, as set up by the stop_machine stuff. __cpu_disable() native_cpu_disable() cpu_disable_common() remove_cpu_from_maps() set_cpu_online(cpu, false) ^^^^^^ So, set_cpu_online will remove this CPU from the cpu_online_mask. And all this runs while still under the stop machine context. And this is exactly what we want right? > This is sensitive to timing and part of > the problem seems to be due to cmci_rediscover() running without the > CPU hotplug mutex held. This is not related to the IPI mess and is > unrelated to memory pressure but is just to note that CPU hotplug in > general can be fragile in parts. > For the cmci_rediscover() part, I feel a simple get/put_online_cpus() around it should work. Something like the following patch? (It is untested at the moment, but I will run it later and see if it works well). I would like the opinion of MCE/Intel maintainers as to whether this is a proper fix or something else would have been better.. ---- From: Srivatsa S. Bhat Subject: [PATCH] x86/intel mce: Fix race with CPU hotplug in cmci_rediscover() cmci_rediscover() is invoked upon the CPU_POST_DEAD notification, when the cpu_hotplug lock is no longer held. And cmci_rediscover() iterates over all the online cpus. So this can race with an ongoing CPU hotplug operation. Fix this by wrapping the iteration code within the pair get_online_cpus() / put_online_cpus(). Reported-by: Mel Gorman Signed-off-by: Srivatsa S. Bhat --- arch/x86/kernel/cpu/mcheck/mce_intel.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c index 38e49bc..1c30397 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include @@ -179,6 +180,7 @@ void cmci_rediscover(int dying) return; cpumask_copy(old, ¤t->cpus_allowed); + get_online_cpus(); for_each_online_cpu(cpu) { if (cpu == dying) continue; @@ -188,6 +190,7 @@ void cmci_rediscover(int dying) if (cmci_supported(&banks)) cmci_discover(banks, 0); } + put_online_cpus(); set_cpus_allowed_ptr(current, old); free_cpumask_var(old);