Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@linux-foundation.org>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Mel Gorman <mel@csn.ul.ie>, Shaohua Li <shaohua.li@intel.com>,
	Christoph Lameter <cl@linux.com>,
	David Rientjes <rientjes@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu
Date: Tue, 16 Nov 2010 16:07:20 -0800	[thread overview]
Message-ID: <20101116160720.5244ea22.akpm@linux-foundation.org> (raw)
In-Reply-To: <20101114163727.BEE0.A69D9226@jp.fujitsu.com>

On Sun, 14 Nov 2010 17:53:03 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
> > >  	}
> > >  }
> > >  
> > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> > > +{
> > > +	struct zone *zone;
> > > +	int cpu;
> > > +	int threshold;
> > > +	int i;
> > > +
> > 
> > get_online_cpus();
> 
> 
> This caused following runtime warnings. but I don't think here is
> real lock inversion. 
> 
> =================================
> [ INFO: inconsistent lock state ]
> 2.6.37-rc1-mm1+ #150
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>] get_online_cpus+0x41/0x60
> {RECLAIM_FS-ON-W} state was registered at:
>   [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0
>   [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100
>   [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0
>   [<ffffffff812eea10>] idr_pre_get+0x60/0x90
>   [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0
>   [<ffffffff8106ebf5>] create_worker+0x55/0x190
>   [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235
>   [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0
>   [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10
>   [<ffffffff81051f30>] __cpu_notify+0x20/0x40
>   [<ffffffff8150bff7>] _cpu_up+0x73/0x113
>   [<ffffffff8150c175>] cpu_up+0xde/0xf1
>   [<ffffffff81dcc81d>] kernel_init+0x21b/0x342
>   [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
> irq event stamp: 27
> hardirqs last  enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80
> hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0
> softirqs last  enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0
> softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0
> 
> other info that might help us debug this:
> no locks held by kswapd0/419.
> 
> stack backtrace:
> Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150
> Call Trace:
>  [<ffffffff810890b1>] print_usage_bug+0x171/0x180
>  [<ffffffff8108a057>] mark_lock+0x377/0x450
>  [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0
>  [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
>  [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
>  [<ffffffff8108bf94>] lock_acquire+0xb4/0x150
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90
>  [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10
>  [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
>  [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60
>  [<ffffffff810520d1>] get_online_cpus+0x41/0x60
>  [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0
>  [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60
>  [<ffffffff8110b552>] kswapd+0x1f2/0x360
>  [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff8110b360>] ? kswapd+0x0/0x360
>  [<ffffffff81073ae6>] kthread+0xa6/0xb0
>  [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
>  [<ffffffff81515710>] ? restore_args+0x0/0x30
>  [<ffffffff81073a40>] ? kthread+0x0/0xb0
>  [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10

Well what's actually happening here?  Where is the alleged deadlock?

In the kernel_init() case we have a GFP_KERNEL allocation inside
get_online_cpus().  In the other case we simply have kswapd calling
get_online_cpus(), yes?

Does lockdep consider all kswapd actions to be "in reclaim context"? 
If so, why?

> 
> I think we have two option 1) call lockdep_clear_current_reclaim_state()
> every time 2) use for_each_possible_cpu instead for_each_online_cpu.
> 
> Following patch use (2) beucase removing get_online_cpus() makes good
> side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock
> risk. 

Well.  Being able to run for_each_online_cpu() is a pretty low-level
and fundamental thing.  It's something we're likely to want to do more
and more of as time passes.  It seems a bad thing to tell ourselves
that we cannot use it in reclaim context.  That blots out large chunks
of filesystem and IO-layer code as well!

> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -193,18 +193,16 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
>  	int threshold;
>  	int i;
>  
> -	get_online_cpus();
>  	for (i = 0; i < pgdat->nr_zones; i++) {
>  		zone = &pgdat->node_zones[i];
>  		if (!zone->percpu_drift_mark)
>  			continue;
>  
>  		threshold = (*calculate_pressure)(zone);
> -		for_each_online_cpu(cpu)
> +		for_each_possible_cpu(cpu)
>  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
>  							= threshold;
>  	}
> -	put_online_cpus();
>  }

That's a pretty sad change IMO, especially of num_possible_cpus is much
larger than num_online_cpus.

What do we need to do to make get_online_cpus() safe to use in reclaim
context?  (And in kswapd context, if that's really equivalent to
"reclaim context").

WARNING: multiple messages have this Message-ID (diff)

From: Andrew Morton <akpm@linux-foundation.org>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Mel Gorman <mel@csn.ul.ie>, Shaohua Li <shaohua.li@intel.com>,
	Christoph Lameter <cl@linux.com>,
	David Rientjes <rientjes@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu
Date: Tue, 16 Nov 2010 16:07:20 -0800	[thread overview]
Message-ID: <20101116160720.5244ea22.akpm@linux-foundation.org> (raw)
In-Reply-To: <20101114163727.BEE0.A69D9226@jp.fujitsu.com>

On Sun, 14 Nov 2010 17:53:03 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
> > >  	}
> > >  }
> > >  
> > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> > > +{
> > > +	struct zone *zone;
> > > +	int cpu;
> > > +	int threshold;
> > > +	int i;
> > > +
> > 
> > get_online_cpus();
> 
> 
> This caused following runtime warnings. but I don't think here is
> real lock inversion. 
> 
> =================================
> [ INFO: inconsistent lock state ]
> 2.6.37-rc1-mm1+ #150
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>] get_online_cpus+0x41/0x60
> {RECLAIM_FS-ON-W} state was registered at:
>   [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0
>   [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100
>   [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0
>   [<ffffffff812eea10>] idr_pre_get+0x60/0x90
>   [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0
>   [<ffffffff8106ebf5>] create_worker+0x55/0x190
>   [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235
>   [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0
>   [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10
>   [<ffffffff81051f30>] __cpu_notify+0x20/0x40
>   [<ffffffff8150bff7>] _cpu_up+0x73/0x113
>   [<ffffffff8150c175>] cpu_up+0xde/0xf1
>   [<ffffffff81dcc81d>] kernel_init+0x21b/0x342
>   [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
> irq event stamp: 27
> hardirqs last  enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80
> hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0
> softirqs last  enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0
> softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0
> 
> other info that might help us debug this:
> no locks held by kswapd0/419.
> 
> stack backtrace:
> Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150
> Call Trace:
>  [<ffffffff810890b1>] print_usage_bug+0x171/0x180
>  [<ffffffff8108a057>] mark_lock+0x377/0x450
>  [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0
>  [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
>  [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
>  [<ffffffff8108bf94>] lock_acquire+0xb4/0x150
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90
>  [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10
>  [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
>  [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60
>  [<ffffffff810520d1>] get_online_cpus+0x41/0x60
>  [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0
>  [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60
>  [<ffffffff8110b552>] kswapd+0x1f2/0x360
>  [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff8110b360>] ? kswapd+0x0/0x360
>  [<ffffffff81073ae6>] kthread+0xa6/0xb0
>  [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
>  [<ffffffff81515710>] ? restore_args+0x0/0x30
>  [<ffffffff81073a40>] ? kthread+0x0/0xb0
>  [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10

Well what's actually happening here?  Where is the alleged deadlock?

In the kernel_init() case we have a GFP_KERNEL allocation inside
get_online_cpus().  In the other case we simply have kswapd calling
get_online_cpus(), yes?

Does lockdep consider all kswapd actions to be "in reclaim context"? 
If so, why?

> 
> I think we have two option 1) call lockdep_clear_current_reclaim_state()
> every time 2) use for_each_possible_cpu instead for_each_online_cpu.
> 
> Following patch use (2) beucase removing get_online_cpus() makes good
> side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock
> risk. 

Well.  Being able to run for_each_online_cpu() is a pretty low-level
and fundamental thing.  It's something we're likely to want to do more
and more of as time passes.  It seems a bad thing to tell ourselves
that we cannot use it in reclaim context.  That blots out large chunks
of filesystem and IO-layer code as well!

> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -193,18 +193,16 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
>  	int threshold;
>  	int i;
>  
> -	get_online_cpus();
>  	for (i = 0; i < pgdat->nr_zones; i++) {
>  		zone = &pgdat->node_zones[i];
>  		if (!zone->percpu_drift_mark)
>  			continue;
>  
>  		threshold = (*calculate_pressure)(zone);
> -		for_each_online_cpu(cpu)
> +		for_each_possible_cpu(cpu)
>  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
>  							= threshold;
>  	}
> -	put_online_cpus();
>  }

That's a pretty sad change IMO, especially of num_possible_cpus is much
larger than num_online_cpus.

What do we need to do to make get_online_cpus() safe to use in reclaim
context?  (And in kswapd context, if that's really equivalent to
"reclaim context").

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-11-17  0:08 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-27  8:47 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions Mel Gorman
2010-10-27  8:47 ` Mel Gorman
2010-10-27  8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
2010-10-27  8:47   ` Mel Gorman
2010-10-27 20:16   ` Christoph Lameter
2010-10-27 20:16     ` Christoph Lameter
2010-10-28  1:09   ` KAMEZAWA Hiroyuki
2010-10-28  1:09     ` KAMEZAWA Hiroyuki
2010-10-28  9:49     ` Mel Gorman
2010-10-28  9:49       ` Mel Gorman
2010-10-28  9:58       ` KAMEZAWA Hiroyuki
2010-10-28  9:58         ` KAMEZAWA Hiroyuki
2010-11-14  8:53     ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro
2010-11-14  8:53       ` KOSAKI Motohiro
2010-11-15 10:26       ` Mel Gorman
2010-11-15 10:26         ` Mel Gorman
2010-11-15 14:04         ` Christoph Lameter
2010-11-15 14:04           ` Christoph Lameter
2010-11-16  9:58           ` Mel Gorman
2010-11-16  9:58             ` Mel Gorman
2010-11-17  0:07       ` Andrew Morton [this message]
2010-11-17  0:07         ` Andrew Morton
2010-11-19 15:29         ` Christoph Lameter
2010-11-19 15:29           ` Christoph Lameter
2010-11-23  8:32         ` KOSAKI Motohiro
2010-11-23  8:32           ` KOSAKI Motohiro
2010-11-01  7:06   ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low KOSAKI Motohiro
2010-11-01  7:06     ` KOSAKI Motohiro
2010-11-26 16:06   ` Kyle McMartin
2010-11-26 16:06     ` Kyle McMartin
2010-11-29  9:56     ` Mel Gorman
2010-11-29  9:56       ` Mel Gorman
2010-11-29 13:16       ` Kyle McMartin
2010-11-29 13:16         ` Kyle McMartin
2010-11-29 15:08         ` Mel Gorman
2010-11-29 15:08           ` Mel Gorman
2010-11-29 15:22           ` Kyle McMartin
2010-11-29 15:22             ` Kyle McMartin
2010-11-29 15:26             ` Kyle McMartin
2010-11-29 15:26               ` Kyle McMartin
2010-11-29 15:58             ` Mel Gorman
2010-11-29 15:58               ` Mel Gorman
2010-12-23 22:18               ` David Rientjes
2010-12-23 22:18                 ` David Rientjes
2010-12-23 22:35                 ` Andrew Morton
2010-12-23 22:35                   ` Andrew Morton
2010-12-23 23:00                   ` Kyle McMartin
2010-12-23 23:00                     ` Kyle McMartin
2010-12-23 23:07                   ` David Rientjes
2010-12-23 23:07                     ` David Rientjes
2010-12-23 23:17                     ` Andrew Morton
2010-12-23 23:17                       ` Andrew Morton
2010-10-27  8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman
2010-10-27  8:47   ` Mel Gorman
2010-10-27 20:13   ` Christoph Lameter
2010-10-27 20:13     ` Christoph Lameter
2010-10-28  1:10   ` KAMEZAWA Hiroyuki
2010-10-28  1:10     ` KAMEZAWA Hiroyuki
2010-11-01  7:06   ` KOSAKI Motohiro
2010-11-01  7:06     ` KOSAKI Motohiro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101116160720.5244ea22.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=rientjes@google.com \
    --cc=shaohua.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.