Re: [BUG] kmemleak on __radix_tree_preload

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"Linux Kernel, Mailing List" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] kmemleak on __radix_tree_preload
Date: Thu, 8 May 2014 08:53:30 -0700	[thread overview]
Message-ID: <20140508155330.GE8754@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140508152946.GA10470@localhost>

On Thu, May 08, 2014 at 04:29:48PM +0100, Catalin Marinas wrote:
> On Thu, May 08, 2014 at 04:00:27PM +0100, Paul E. McKenney wrote:
> > On Thu, May 08, 2014 at 11:24:36AM +0100, Catalin Marinas wrote:
> > > On Thu, May 08, 2014 at 10:37:40AM +0100, Jaegeuk Kim wrote:
> > > > 2014-05-08 (ea(C)), 10:26 +0100, Catalin Marinas:
> > > > > On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > > > > > 2014-05-07 (i??), 12:39 +0100, Catalin Marinas:
> > > > > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > > > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > > > > >   hex dump (first 32 bytes):
> > > > > > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > > > > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > > > > > >   backtrace:
> > > > > > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > > > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > > > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > > > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > > > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > > > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > > > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > > > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > > > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > > > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > > > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > > > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > > > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > > > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > > > > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > > > > > >
> > > > > > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > > > > > and I guess it's page_cache_tree_insert() which calls
> > > > > > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > > > > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > > > > > the node should have been freed, who keeps track of it.
> > > > > > >
> > > > > > > At a quick look at the hex dump (assuming that the above leak is struct
> > > > > > > radix_tree_node):
> > > > > > >
> > > > > > > 	.path = 1
> > > > > > > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > > > > > > 	union {
> > > > > > > 		{
> > > > > > > 			.parent = NULL
> > > > > > > 			.private_data = 0xffffffff81348950
> > > > > > > 		}
> > > > > > > 		{
> > > > > > > 			.rcu_head.next = NULL
> > > > > > > 			.rcu_head.func = 0xffffffff81348950
> > > > > > > 		}
> > > > > > > 	}
> > > > > > >
> > > > > > > The count is a bit suspicious.
> > > > > > >
> > > > > > > From the union, it looks most likely like rcu_head information. Is
> > > > > > > radix_tree_node_rcu_free() function at the above rcu_head.func?
> > > > >
> > > > > Thanks for the config. Could you please confirm that 0xffffffff81348950
> > > > > address corresponds to the radix_tree_node_rcu_free() function in your
> > > > > System.map (or something else)?
> > > > 
> > > > Yap, the address is matched to radix_tree_node_rcu_free().
> > > 
> > > Cc'ing Paul as well, not that I blame RCU ;), but maybe he could shed
> > > some light on why kmemleak can't track this object.
> > 
> > Do we have any information on how long it has been since that data
> > structure was handed to call_rcu()?  If that time is short, then it
> > is quite possible that its grace period simply has not yet completed.
> 
> kmemleak scans every 10 minutes but Jaegeuk can confirm how long he has
> waited.

OK.  If RCU was stalled for that long, you should get stall warnings,
at least assuming that they were not disabled, for example, using
the rcupdate.rcu_cpu_stall_suppress=1 boot parameter.

> > It might also be that one of the CPUs is stuck (e.g., spinning with
> > interrupts disabled), which would prevent the grace period from
> > completing, in turn preventing any memory waiting for that grace period
> > from being freed.
> 
> We should get some kernel warning if it's stuck for too long but, again,
> Jaegeuk can confirm. I haven't managed to reproduce this on ARM systems.
> 
> > > My summary so far:
> > > 
> > > - radix_tree_node reported by kmemleak as it cannot find any trace of it
> > >   when scanning the memory
> > > - at allocation time, radix_tree_node is memzero'ed by
> > >   radix_tree_node_ctor(). Given that node->rcu_head.func ==
> > >   radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
> > >   been called
> > > - some time later, kmemleak still hasn't received any callback for
> > >   kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
> > >   called either since node->count is not NULL.
> > > 
> > > For RCU queued objects, kmemleak should still track references to them
> > > via rcu_sched_state and rcu_head members. But even if this went wrong, I
> > > would expect the object to be freed eventually and kmemleak notified (so
> > > just a temporary leak report which doesn't seem to be the case here).
> > 
> > OK, so you are saying that this memory has been in this state for quite
> > some time?
> 
> These leaks don't seem to disappear (time lapsed to be confirmed) and
> the object checksum not changed either (otherwise kmemleak would not
> report it).
> 
> > If the system is responsive during this time, I recommend building with
> > CONFIG_RCU_TRACE=y, then polling the debugfs rcu/*/rcugp files.  The value
> > of "*" will be "rcu_sched" for kernels built with CONFIG_PREEMPT=n and
> > "rcu_preempt" for kernels built with CONFIG_PREEMPT=y.
> > 
> > If the number printed does not advance, then the RCU grace period is
> > stalled, which will prevent memory waiting for that grace period from
> > ever being freed.
> 
> Thanks for the suggestions
> 
> > Of course, if the value of node->count is preventing call_rcu() from
> > being invoked in the first place, then the needed grace period won't
> > start, much less finish.  ;-)
> 
> Given the rcu_head.func value, my assumption is that call_rcu() has
> already been called.

Fair point -- given that it is a union, you would expect this field to
be overwritten upon reuse.

> BTW, is it safe to have a union overlapping node->parent and
> node->rcu_head.next? I'm still staring at the radix-tree code but a
> scenario I have in mind is that call_rcu() has been raised for a few
> nodes, other CPU may have some reference to one of them and set
> node->parent to NULL (e.g. concurrent calls to radix_tree_shrink()),
> breaking the RCU linking. I can't confirm this theory yet ;)

If this were reproducible, I would suggest retrying with non-overlapping
node->parent and node->rcu_head.next, but you knew that already.  ;-)

But the usual practice would be to make node removal exclude shrinking.
And the radix-tree code seems to delegate locking to the caller.

So, is the correct locking present in the page cache?  The radix-tree
code seems to assume that all update operations for a given tree are
protected by a lock global to that tree.

Another diagnosis approach would be to build with
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which would complain about double
call_rcu() invocations.  Rumor has it that is is necessary to turn off
other kmem debugging for this to tell you anything -- I have seen cases
where the kmem debugging obscures the debug-objects diagnostics.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"Linux Kernel, Mailing List" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] kmemleak on __radix_tree_preload
Date: Thu, 8 May 2014 08:53:30 -0700	[thread overview]
Message-ID: <20140508155330.GE8754@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140508152946.GA10470@localhost>

On Thu, May 08, 2014 at 04:29:48PM +0100, Catalin Marinas wrote:
> On Thu, May 08, 2014 at 04:00:27PM +0100, Paul E. McKenney wrote:
> > On Thu, May 08, 2014 at 11:24:36AM +0100, Catalin Marinas wrote:
> > > On Thu, May 08, 2014 at 10:37:40AM +0100, Jaegeuk Kim wrote:
> > > > 2014-05-08 (목), 10:26 +0100, Catalin Marinas:
> > > > > On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > > > > > 2014-05-07 (수), 12:39 +0100, Catalin Marinas:
> > > > > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > > > > >   comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > > > > >   hex dump (first 32 bytes):
> > > > > > > >     01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00  ................
> > > > > > > >     50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff  P.4......m".....
> > > > > > > >   backtrace:
> > > > > > > >     [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > > > > >     [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > > > > >     [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > > > > >     [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > > > > >     [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > > > > >     [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > > > > >     [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > > > > >     [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > > > > >     [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > > > > >     [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > > > > >     [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > > > > >     [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > > > > >     [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > > > > >     [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > > > > >     [<ffffffffffffffff>] 0xffffffffffffffff
> > > > > > >
> > > > > > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > > > > > and I guess it's page_cache_tree_insert() which calls
> > > > > > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > > > > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > > > > > the node should have been freed, who keeps track of it.
> > > > > > >
> > > > > > > At a quick look at the hex dump (assuming that the above leak is struct
> > > > > > > radix_tree_node):
> > > > > > >
> > > > > > > 	.path = 1
> > > > > > > 	.count = -0x7f (or 0xffffff81 as unsigned int)
> > > > > > > 	union {
> > > > > > > 		{
> > > > > > > 			.parent = NULL
> > > > > > > 			.private_data = 0xffffffff81348950
> > > > > > > 		}
> > > > > > > 		{
> > > > > > > 			.rcu_head.next = NULL
> > > > > > > 			.rcu_head.func = 0xffffffff81348950
> > > > > > > 		}
> > > > > > > 	}
> > > > > > >
> > > > > > > The count is a bit suspicious.
> > > > > > >
> > > > > > > From the union, it looks most likely like rcu_head information. Is
> > > > > > > radix_tree_node_rcu_free() function at the above rcu_head.func?
> > > > >
> > > > > Thanks for the config. Could you please confirm that 0xffffffff81348950
> > > > > address corresponds to the radix_tree_node_rcu_free() function in your
> > > > > System.map (or something else)?
> > > > 
> > > > Yap, the address is matched to radix_tree_node_rcu_free().
> > > 
> > > Cc'ing Paul as well, not that I blame RCU ;), but maybe he could shed
> > > some light on why kmemleak can't track this object.
> > 
> > Do we have any information on how long it has been since that data
> > structure was handed to call_rcu()?  If that time is short, then it
> > is quite possible that its grace period simply has not yet completed.
> 
> kmemleak scans every 10 minutes but Jaegeuk can confirm how long he has
> waited.

OK.  If RCU was stalled for that long, you should get stall warnings,
at least assuming that they were not disabled, for example, using
the rcupdate.rcu_cpu_stall_suppress=1 boot parameter.

> > It might also be that one of the CPUs is stuck (e.g., spinning with
> > interrupts disabled), which would prevent the grace period from
> > completing, in turn preventing any memory waiting for that grace period
> > from being freed.
> 
> We should get some kernel warning if it's stuck for too long but, again,
> Jaegeuk can confirm. I haven't managed to reproduce this on ARM systems.
> 
> > > My summary so far:
> > > 
> > > - radix_tree_node reported by kmemleak as it cannot find any trace of it
> > >   when scanning the memory
> > > - at allocation time, radix_tree_node is memzero'ed by
> > >   radix_tree_node_ctor(). Given that node->rcu_head.func ==
> > >   radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
> > >   been called
> > > - some time later, kmemleak still hasn't received any callback for
> > >   kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
> > >   called either since node->count is not NULL.
> > > 
> > > For RCU queued objects, kmemleak should still track references to them
> > > via rcu_sched_state and rcu_head members. But even if this went wrong, I
> > > would expect the object to be freed eventually and kmemleak notified (so
> > > just a temporary leak report which doesn't seem to be the case here).
> > 
> > OK, so you are saying that this memory has been in this state for quite
> > some time?
> 
> These leaks don't seem to disappear (time lapsed to be confirmed) and
> the object checksum not changed either (otherwise kmemleak would not
> report it).
> 
> > If the system is responsive during this time, I recommend building with
> > CONFIG_RCU_TRACE=y, then polling the debugfs rcu/*/rcugp files.  The value
> > of "*" will be "rcu_sched" for kernels built with CONFIG_PREEMPT=n and
> > "rcu_preempt" for kernels built with CONFIG_PREEMPT=y.
> > 
> > If the number printed does not advance, then the RCU grace period is
> > stalled, which will prevent memory waiting for that grace period from
> > ever being freed.
> 
> Thanks for the suggestions
> 
> > Of course, if the value of node->count is preventing call_rcu() from
> > being invoked in the first place, then the needed grace period won't
> > start, much less finish.  ;-)
> 
> Given the rcu_head.func value, my assumption is that call_rcu() has
> already been called.

Fair point -- given that it is a union, you would expect this field to
be overwritten upon reuse.

> BTW, is it safe to have a union overlapping node->parent and
> node->rcu_head.next? I'm still staring at the radix-tree code but a
> scenario I have in mind is that call_rcu() has been raised for a few
> nodes, other CPU may have some reference to one of them and set
> node->parent to NULL (e.g. concurrent calls to radix_tree_shrink()),
> breaking the RCU linking. I can't confirm this theory yet ;)

If this were reproducible, I would suggest retrying with non-overlapping
node->parent and node->rcu_head.next, but you knew that already.  ;-)

But the usual practice would be to make node removal exclude shrinking.
And the radix-tree code seems to delegate locking to the caller.

So, is the correct locking present in the page cache?  The radix-tree
code seems to assume that all update operations for a given tree are
protected by a lock global to that tree.

Another diagnosis approach would be to build with
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which would complain about double
call_rcu() invocations.  Rumor has it that is is necessary to turn off
other kmem debugging for this to tell you anything -- I have seen cases
where the kmem debugging obscures the debug-objects diagnostics.

							Thanx, Paul

next prev parent reply	other threads:[~2014-05-08 15:53 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-25  1:45 [BUG] kmemleak on __radix_tree_preload Jaegeuk Kim
2014-04-25  1:45 ` Jaegeuk Kim
2014-05-01 17:06 ` Catalin Marinas
2014-05-01 17:06   ` Catalin Marinas
2014-05-01 18:41   ` Johannes Weiner
2014-05-01 18:41     ` Johannes Weiner
2014-05-07  2:58     ` Jaegeuk Kim
2014-05-07  2:58       ` Jaegeuk Kim
2014-05-07 11:39       ` Catalin Marinas
2014-05-07 11:39         ` Catalin Marinas
2014-05-08  9:16         ` Jaegeuk Kim
2014-05-08  9:16           ` Jaegeuk Kim
2014-05-08  9:26           ` Catalin Marinas
2014-05-08  9:26             ` Catalin Marinas
2014-05-08  9:37             ` Jaegeuk Kim
2014-05-08  9:37               ` Jaegeuk Kim
2014-05-08 10:24               ` Catalin Marinas
2014-05-08 10:24                 ` Catalin Marinas
2014-05-08 15:00                 ` Paul E. McKenney
2014-05-08 15:00                   ` Paul E. McKenney
2014-05-08 15:29                   ` Catalin Marinas
2014-05-08 15:29                     ` Catalin Marinas
2014-05-08 15:53                     ` Paul E. McKenney [this message]
2014-05-08 15:53                       ` Paul E. McKenney
2014-05-08 16:52                       ` Catalin Marinas
2014-05-08 16:52                         ` Catalin Marinas
2014-05-09  0:06                         ` Jaegeuk Kim
2014-05-09  0:06                           ` Jaegeuk Kim
2014-05-08 17:52                       ` Johannes Weiner
2014-05-08 17:52                         ` Johannes Weiner
2014-05-08 21:40                         ` Catalin Marinas
2014-05-08 21:40                           ` Catalin Marinas
2014-05-09  0:01                     ` Jaegeuk Kim
2014-05-09  0:01                       ` Jaegeuk Kim
2014-05-09  9:45                       ` Catalin Marinas
2014-05-09  9:45                         ` Catalin Marinas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140508155330.GE8754@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=catalin.marinas@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=jaegeuk.kim@samsung.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.