Re: [PATCH] compact_pgdat: workaround lockdep warning in kswapd

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>,
	linux-mm@kvack.org
Subject: Re: [PATCH] compact_pgdat: workaround lockdep warning in kswapd
Date: Thu, 9 Feb 2012 15:40:14 +0000	[thread overview]
Message-ID: <20120209154014.GM5796@csn.ul.ie> (raw)
In-Reply-To: <alpine.LSU.2.00.1202061129040.2144@eggly.anvils>

On Mon, Feb 06, 2012 at 11:40:08AM -0800, Hugh Dickins wrote:
> I get this lockdep warning from swapping load on linux-next
> (20120201 but I expect the same from more recent days):
> 

Only getting to this now. Yes, I'm slow.

> =================================
> [ INFO: inconsistent lock state ]
> 3.3.0-rc2-next-20120201 #5 Not tainted
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.

Ok, shoving that through my spidey decoder ring, that is saying we were
in reclaim context (RECLAIM_FS-ON-W) and taking a mutex that at in
the past was taken from a different reclaim context. In lockdeps mind,
this leads to a potential deadlock where a user of pcpu_alloc needs
kswapd or a reclaimer to make forward progress that can't because it
depends on the same mutex.

> kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  (pcpu_alloc_mutex){+.+.?.}, at: [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
> {RECLAIM_FS-ON-W} state was registered at:
>   [<ffffffff81099b75>] mark_held_locks+0xd7/0x103
>   [<ffffffff8109a13c>] lockdep_trace_alloc+0x85/0x9e
>   [<ffffffff810f6bdc>] __kmalloc+0x6c/0x14b
>   [<ffffffff810d57fd>] pcpu_mem_zalloc+0x59/0x62
>   [<ffffffff810d5d16>] pcpu_extend_area_map+0x26/0xb1
>   [<ffffffff810d679f>] pcpu_alloc+0x182/0x325
>   [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
>   [<ffffffff8142ebfd>] snmp_mib_init+0x1e/0x2e
>   [<ffffffff8185cd8d>] ipv4_mib_init_net+0x7a/0x184
>   [<ffffffff813dc963>] ops_init.clone.0+0x6b/0x73
>   [<ffffffff813dc9cc>] register_pernet_operations+0x61/0xa0
>   [<ffffffff813dca8e>] register_pernet_subsys+0x29/0x42
>   [<ffffffff8185d044>] inet_init+0x1ad/0x252
>   [<ffffffff810002e3>] do_one_initcall+0x7a/0x12f
>   [<ffffffff81832bc5>] kernel_init+0x9d/0x11e
>   [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
> irq event stamp: 656613
> hardirqs last  enabled at (656613): [<ffffffff814e0ddc>] __mutex_unlock_slowpath+0x104/0x128
> hardirqs last disabled at (656612): [<ffffffff814e0d34>] __mutex_unlock_slowpath+0x5c/0x128
> softirqs last  enabled at (655568): [<ffffffff8105b4a5>] __do_softirq+0x120/0x136
> softirqs last disabled at (654757): [<ffffffff814e52dc>] call_softirq+0x1c/0x30
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>        CPU0
>        ----
>   lock(pcpu_alloc_mutex);
>   <Interrupt>
>     lock(pcpu_alloc_mutex);
> 
>  *** DEADLOCK ***
> 
> no locks held by kswapd0/28.
> 
> stack backtrace:
> Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5
> Call Trace:
>  [<ffffffff810981f4>] print_usage_bug+0x1bf/0x1d0
>  [<ffffffff81096c3e>] ? print_irq_inversion_bug+0x1d9/0x1d9
>  [<ffffffff810982c0>] mark_lock_irq+0xbb/0x22e
>  [<ffffffff810c5399>] ? free_hot_cold_page+0x13d/0x14f
>  [<ffffffff81098684>] mark_lock+0x251/0x331
>  [<ffffffff81098893>] mark_irqflags+0x12f/0x141
>  [<ffffffff81098e32>] __lock_acquire+0x58d/0x753
>  [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
>  [<ffffffff81099433>] lock_acquire+0x54/0x6a
>  [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
>  [<ffffffff8107a5b8>] ? add_preempt_count+0xa9/0xae
>  [<ffffffff814e0a21>] mutex_lock_nested+0x5e/0x315
>  [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
>  [<ffffffff81098f81>] ? __lock_acquire+0x6dc/0x753
>  [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
>  [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
>  [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
>  [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
>  [<ffffffff8106c35e>] schedule_on_each_cpu+0x23/0x110
>  [<ffffffff810c9fcb>] lru_add_drain_all+0x10/0x12
>  [<ffffffff810f126f>] __compact_pgdat+0x20/0x182
>  [<ffffffff810f15c2>] compact_pgdat+0x27/0x29
>  [<ffffffff810c306b>] ? zone_watermark_ok+0x1a/0x1c
>  [<ffffffff810cdf6f>] balance_pgdat+0x732/0x751
>  [<ffffffff810ce0ed>] kswapd+0x15f/0x178
>  [<ffffffff810cdf8e>] ? balance_pgdat+0x751/0x751
>  [<ffffffff8106fd11>] kthread+0x84/0x8c
>  [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
>  [<ffffffff810787ed>] ? finish_task_switch+0x85/0xea
>  [<ffffffff814e3861>] ? retint_restore_args+0xe/0xe
>  [<ffffffff8106fc8d>] ? __init_kthread_worker+0x56/0x56
>  [<ffffffff814e51e0>] ? gs_change+0xb/0xb
> 
> The RECLAIM_FS notations indicate that it's doing the GFP_FS checking
> that Nick hacked into lockdep a while back: I think we're intended to
> read that "<Interrupt>" in the DEADLOCK scenario as "<Direct reclaim>".
> 

It's not GFP_FS it is complaining about though. It's complaining because
that mutex is being taken from inconsistent reclaim contexts. At least,
that is my reading of it. It's not often I read lockdep reports so I
could be wrong.

> I'm hazy, I have not reached any conclusion as to whether it's right
> to complain or not; but I believe it's uneasy about kswapd now doing
> the mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails.
> Nor have I reached any conclusion as to whether it's important for
> kswapd to do that draining or not.
> 

It's not important for kswapd to do this draining. Compaction via proc
does the draining to maximise the amount of compaction it is able to do.
kswapd is best effort and not even doing sync compaction, let alone
caring about draining pagevecs.

> But so as not to get blocked on this, with lockdep disabled from giving
> further reports, here's a patch which removes the lru_add_drain_all()
> from kswapd's callpath (and calls it only once from compact_nodes(),
> instead of once per node).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Thanks

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2012-02-09 15:40 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-06 19:40 [PATCH] compact_pgdat: workaround lockdep warning in kswapd Hugh Dickins
2012-02-06 20:49 ` Andrew Morton
2012-02-06 21:00   ` Hugh Dickins
2012-02-06 21:48   ` Rik van Riel
2012-02-06 22:14     ` Hugh Dickins
2012-02-09 15:40 ` Mel Gorman [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120209154014.GM5796@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).