From: Mel Gorman <mgorman@suse.de>
To: Hugh Dickins <hughd@google.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
Rik van Riel <riel@redhat.com>, Linux-MM <linux-mm@kvack.org>,
Linux-S390 <linux-s390@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
Date: Tue, 17 Apr 2012 13:22:02 +0100 [thread overview]
Message-ID: <20120417122202.GF2359@suse.de> (raw)
In-Reply-To: <alpine.LSU.2.00.1204161332120.1675@eggly.anvils>
On Mon, Apr 16, 2012 at 02:14:09PM -0700, Hugh Dickins wrote:
> On Mon, 16 Apr 2012, Mel Gorman wrote:
>
> > This patch is horribly ugly and there has to be a better way of doing
> > it. I'm looking for suggestions on what s390 can do here that is not
> > painful or broken.
> >
> > The following bug was reported on s390
> >
> > kernel BUG at
> > /usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
> > illegal operation: 0001 [#1] SMP
> > Modules linked in: ext2 dm_round_robin dm_multipath sg sd_mod crc_t10dif fuse
> > loop dm_mod ipv6 qeth_l3 ipv6_lib zfcp scsi_transport_fc scsi_tgt qeth qdio
> > ccwgroup ext3 jbd mbcache dasd_eckd_mod dasd_mod scsi_dh_rdac scsi_dh_emc
> > scsi_dh_alua scsi_dh_hp_sw scsi_dh scsi_mod
> > Supported: Yes
> > CPU: 3 Not tainted 3.0.13-0.27-default #1
> > Process kpartx_id (pid: 24381, task: 000000004d938138, ksp: 00000000733539f0)
> > Krnl PSW : 0404000180000000 000000000037935e
> > (radix_tree_tag_set+0xfe/0x118)
> > R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> > Krnl GPRS: 0000000000000002 0000000000000018 00000000004e82a8 0000000000000000
> > 0000000000000007 0000000000000003 00000000004e82a8 00000000007280c0
> > 0000000000000000 0000000000000000 00000000fac55d5f 0000000000000000
> > 000003e000000006 0000000000529f88 0000000000000000 0000000073353ad0
> > Krnl Code: 0000000000379352: a7cafffa ahi %r12,-6
> > 0000000000379356: a7f4ffb5 brc 15,3792c0
> > 000000000037935a: a7f40001 brc 15,37935c
> > >000000000037935e: a7f40000 brc 15,37935e
> > 0000000000379362: b90200bb ltgr %r11,%r11
> > 0000000000379366: a784fff0 brc 8,379346
> > 000000000037936a: a7f4ffe1 brc 15,37932c
> > 000000000037936e: a7f40001 brc 15,379370
> > Call Trace:
> > ([<00000000002080ae>] __set_page_dirty_nobuffers+0x10a/0x1d0)
> > [<0000000000234970>] page_remove_rmap+0x100/0x104
> > [<0000000000228608>] zap_pte_range+0x194/0x608
> > [<0000000000228caa>] unmap_page_range+0x22e/0x33c
> > [<0000000000228ec2>] unmap_vmas+0x10a/0x274
> > [<000000000022f2f6>] exit_mmap+0xd2/0x254
> > [<00000000001472d6>] mmput+0x5e/0x144
> > [<000000000014d306>] exit_mm+0x196/0x1d4
> > [<000000000014f650>] do_exit+0x18c/0x408
> > [<000000000014f92c>] do_group_exit+0x60/0xec
> > [<000000000014f9ea>] SyS_exit_group+0x32/0x40
> > [<00000000004e1660>] sysc_noemu+0x16/0x1c
> > [<000003fffd23ab96>] 0x3fffd23ab96
> > Last Breaking-Event-Address:
> > [<000000000037935a>] radix_tree_tag_set+0xfa/0x118
> >
> > While this bug was reproduced on a 3.0.x kernel, there is no reason why
> > it should not happen in mainline.
> >
> > The bug was triggered because a page had a valid mapping but by the time
> > set_page_dirty() was called there was no valid entry in the radix tree.
> > This was reproduced while underlying storage was unplugged but this may
> > be indirectly related to the problem.
> >
> > This bug only triggers on s390 and may be explained by a race. Normally
> > when pages are being readahead in read_cache_pages(), an attempt is made
> > to add the page to the page cache. If that fails, the page is invalidated
> > by locking it, giving it a valid mapping, calling do_invalidatepage()
> > and then setting mapping to NULL again. It's similar with page cache is
> > being deleted. The page is locked, the tree lock is taken, it's removed
> > from the radix tree and the mapping is set to NULL.
> >
> > In many cases looking up the radix tree is protected by a combination of
> > the page lock and the tree lock. When zapping pages, the page lock is not
> > taken which does not matter as most architectures do nothing special with
> > the mapping and for file-backed pages, the mapping should be pinned by
> > the open file. However, s390 also calls set_dirty_page() for
> > PageSwapCache() pages where it is potentially racing against
> > reuse_swap_page() for example.
> >
> > This patch splits the propagation of the storage key into a separate
> > function and introduces a new variant of page_remove_rmap() called
> > page_remove_rmap_nolock() which is only used during page table zapping
> > that attempts to acquire the page lock. The approach is ugly although it
> > works in that the bug can no longer be triggered. At the time the page
> > lock is required, the PTL is held so it cannot sleep so it busy waits on
> > trylock_page(). That potentially deadlocks against reclaim which holds
> > the page table lock and is trying to acquire the pagetable lock so in
> > some cases there is no choice except to race. In the file-backed case,
> > this is ok because the address space will be valid (zap_pte_range() makes
> > the same assumption. However, s390 needs a better way of guarding against
> > PageSwapCache pages being removed from the radix tree while set_page_dirty()
> > is being called. The patch would be marginally better if in the PageSwapCache
> > case we simply tried to lock once and in the contended case just fail to
> > propogate the storage key. I lack familiarity with the s390 architecture
> > to be certain if this is safe or not. Suggestions on a better fix?
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> I'm very confused - and wonder if you are too ;)
>
It would not be the first time :)
> Please forgive me if I've not been patient enough in reading through
> all you've written, and knocking it into my skull: ask me to go back
> and read again.
>
> But it seems to me that you're approaching this from the wrong end
> (page_remove_rmap), instead of from the end that's giving rise to the
> problem - perhaps because you've not yet identified that other end?
>
Not as precisely as I would have like. I do not have access to the machine
in question unfortunately.
> I'm confused as to whether you see this problem with file pages,
> or with anon-swap-cache pages, or with both, or not yet determined.
>
PageSwapCache pages only.
> The obvious places which set page->mapping = NULL or ClearPageSwapCache,
> once they've been visible to the outside world, are careful to do so at
> the same time as removing the radix_tree entry, holding mapping->tree_lock.
>
The problem here is not that page->mapping is NULL but that the page->mapping
is valid while the page is not in the radix tree.
> And __set_page_dirty_nobuffers(), the set_page_dirty variant implicated
> in the trace above, is careful to recheck mapping under tree_lock, as I
> think are the others. There is an issue if mapping itself could vanish
> while in there, but that's not a danger in page_remove_rmap().
>
> (You do remind me that I meant years ago to switch swapper_space over
> to the much simpler __set_page_dirty_no_writeback(), which shmem has
> used for ages; but as far as this problem goes, that would probably
> be at best a workaround, rather than the proper fix.)
>
It would be a workaround. If in the future we wanted to treat swapper
space more like a normal file inode and writeback dirty pages from
the flusher thread then this bug would just pop its head back up.
> You indicate read_cache_pages_invalidate_page() above, and yes that's
> weird the way it sets and unsets page->mapping; I never came across it
> before, but IIUC it's safe because those pages have not yet been made
> visible to the outside world. (Or do we have a speculative pagecache
> issue here?)
>
We do not have a speculative pagecache issue here that I'm aware of.
> You compare do_invalidatepage() with deletion from page cache, but I hope
> that's mistaken: aren't such pages usually being invalidated precisely
> because they raced with pages found already there in the page cache,
> which need to be left in place?
>
True. What I was trying to do was point out that we usually handle
pages being removed from the page cache with a combination of the page
lock and tree lock. s390 does not have the same type of protection in
page_remove_rmap() when handling PageSwapCache pages.
> So, is there somewhere which is setting page->mapping = NULL, or
> doing ClearPageSwapCache, in a separate block from removing that
> page's entry from the radix_tree? If there is, then I think it
> could pose a problem on more than s390.
>
> Hmm, mm/migrate.c.
>
Migration moves the page mapping under the tree lock so
__set_page_dirty_nobuffers() I don't think that is it.
I think the race is against something like reuse_swap_page() which locks
the page and removes it from swap cache while page_remove_rmap() looks
up the same page.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mgorman@suse.de>
To: Hugh Dickins <hughd@google.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
Rik van Riel <riel@redhat.com>, Linux-MM <linux-mm@kvack.org>,
Linux-S390 <linux-s390@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock
Date: Tue, 17 Apr 2012 13:22:02 +0100 [thread overview]
Message-ID: <20120417122202.GF2359@suse.de> (raw)
In-Reply-To: <alpine.LSU.2.00.1204161332120.1675@eggly.anvils>
On Mon, Apr 16, 2012 at 02:14:09PM -0700, Hugh Dickins wrote:
> On Mon, 16 Apr 2012, Mel Gorman wrote:
>
> > This patch is horribly ugly and there has to be a better way of doing
> > it. I'm looking for suggestions on what s390 can do here that is not
> > painful or broken.
> >
> > The following bug was reported on s390
> >
> > kernel BUG at
> > /usr/src/packages/BUILD/kernel-default-3.0.13/linux-3.0/lib/radix-tree.c:477!
> > illegal operation: 0001 [#1] SMP
> > Modules linked in: ext2 dm_round_robin dm_multipath sg sd_mod crc_t10dif fuse
> > loop dm_mod ipv6 qeth_l3 ipv6_lib zfcp scsi_transport_fc scsi_tgt qeth qdio
> > ccwgroup ext3 jbd mbcache dasd_eckd_mod dasd_mod scsi_dh_rdac scsi_dh_emc
> > scsi_dh_alua scsi_dh_hp_sw scsi_dh scsi_mod
> > Supported: Yes
> > CPU: 3 Not tainted 3.0.13-0.27-default #1
> > Process kpartx_id (pid: 24381, task: 000000004d938138, ksp: 00000000733539f0)
> > Krnl PSW : 0404000180000000 000000000037935e
> > (radix_tree_tag_set+0xfe/0x118)
> > R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> > Krnl GPRS: 0000000000000002 0000000000000018 00000000004e82a8 0000000000000000
> > 0000000000000007 0000000000000003 00000000004e82a8 00000000007280c0
> > 0000000000000000 0000000000000000 00000000fac55d5f 0000000000000000
> > 000003e000000006 0000000000529f88 0000000000000000 0000000073353ad0
> > Krnl Code: 0000000000379352: a7cafffa ahi %r12,-6
> > 0000000000379356: a7f4ffb5 brc 15,3792c0
> > 000000000037935a: a7f40001 brc 15,37935c
> > >000000000037935e: a7f40000 brc 15,37935e
> > 0000000000379362: b90200bb ltgr %r11,%r11
> > 0000000000379366: a784fff0 brc 8,379346
> > 000000000037936a: a7f4ffe1 brc 15,37932c
> > 000000000037936e: a7f40001 brc 15,379370
> > Call Trace:
> > ([<00000000002080ae>] __set_page_dirty_nobuffers+0x10a/0x1d0)
> > [<0000000000234970>] page_remove_rmap+0x100/0x104
> > [<0000000000228608>] zap_pte_range+0x194/0x608
> > [<0000000000228caa>] unmap_page_range+0x22e/0x33c
> > [<0000000000228ec2>] unmap_vmas+0x10a/0x274
> > [<000000000022f2f6>] exit_mmap+0xd2/0x254
> > [<00000000001472d6>] mmput+0x5e/0x144
> > [<000000000014d306>] exit_mm+0x196/0x1d4
> > [<000000000014f650>] do_exit+0x18c/0x408
> > [<000000000014f92c>] do_group_exit+0x60/0xec
> > [<000000000014f9ea>] SyS_exit_group+0x32/0x40
> > [<00000000004e1660>] sysc_noemu+0x16/0x1c
> > [<000003fffd23ab96>] 0x3fffd23ab96
> > Last Breaking-Event-Address:
> > [<000000000037935a>] radix_tree_tag_set+0xfa/0x118
> >
> > While this bug was reproduced on a 3.0.x kernel, there is no reason why
> > it should not happen in mainline.
> >
> > The bug was triggered because a page had a valid mapping but by the time
> > set_page_dirty() was called there was no valid entry in the radix tree.
> > This was reproduced while underlying storage was unplugged but this may
> > be indirectly related to the problem.
> >
> > This bug only triggers on s390 and may be explained by a race. Normally
> > when pages are being readahead in read_cache_pages(), an attempt is made
> > to add the page to the page cache. If that fails, the page is invalidated
> > by locking it, giving it a valid mapping, calling do_invalidatepage()
> > and then setting mapping to NULL again. It's similar with page cache is
> > being deleted. The page is locked, the tree lock is taken, it's removed
> > from the radix tree and the mapping is set to NULL.
> >
> > In many cases looking up the radix tree is protected by a combination of
> > the page lock and the tree lock. When zapping pages, the page lock is not
> > taken which does not matter as most architectures do nothing special with
> > the mapping and for file-backed pages, the mapping should be pinned by
> > the open file. However, s390 also calls set_dirty_page() for
> > PageSwapCache() pages where it is potentially racing against
> > reuse_swap_page() for example.
> >
> > This patch splits the propagation of the storage key into a separate
> > function and introduces a new variant of page_remove_rmap() called
> > page_remove_rmap_nolock() which is only used during page table zapping
> > that attempts to acquire the page lock. The approach is ugly although it
> > works in that the bug can no longer be triggered. At the time the page
> > lock is required, the PTL is held so it cannot sleep so it busy waits on
> > trylock_page(). That potentially deadlocks against reclaim which holds
> > the page table lock and is trying to acquire the pagetable lock so in
> > some cases there is no choice except to race. In the file-backed case,
> > this is ok because the address space will be valid (zap_pte_range() makes
> > the same assumption. However, s390 needs a better way of guarding against
> > PageSwapCache pages being removed from the radix tree while set_page_dirty()
> > is being called. The patch would be marginally better if in the PageSwapCache
> > case we simply tried to lock once and in the contended case just fail to
> > propogate the storage key. I lack familiarity with the s390 architecture
> > to be certain if this is safe or not. Suggestions on a better fix?
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> I'm very confused - and wonder if you are too ;)
>
It would not be the first time :)
> Please forgive me if I've not been patient enough in reading through
> all you've written, and knocking it into my skull: ask me to go back
> and read again.
>
> But it seems to me that you're approaching this from the wrong end
> (page_remove_rmap), instead of from the end that's giving rise to the
> problem - perhaps because you've not yet identified that other end?
>
Not as precisely as I would have like. I do not have access to the machine
in question unfortunately.
> I'm confused as to whether you see this problem with file pages,
> or with anon-swap-cache pages, or with both, or not yet determined.
>
PageSwapCache pages only.
> The obvious places which set page->mapping = NULL or ClearPageSwapCache,
> once they've been visible to the outside world, are careful to do so at
> the same time as removing the radix_tree entry, holding mapping->tree_lock.
>
The problem here is not that page->mapping is NULL but that the page->mapping
is valid while the page is not in the radix tree.
> And __set_page_dirty_nobuffers(), the set_page_dirty variant implicated
> in the trace above, is careful to recheck mapping under tree_lock, as I
> think are the others. There is an issue if mapping itself could vanish
> while in there, but that's not a danger in page_remove_rmap().
>
> (You do remind me that I meant years ago to switch swapper_space over
> to the much simpler __set_page_dirty_no_writeback(), which shmem has
> used for ages; but as far as this problem goes, that would probably
> be at best a workaround, rather than the proper fix.)
>
It would be a workaround. If in the future we wanted to treat swapper
space more like a normal file inode and writeback dirty pages from
the flusher thread then this bug would just pop its head back up.
> You indicate read_cache_pages_invalidate_page() above, and yes that's
> weird the way it sets and unsets page->mapping; I never came across it
> before, but IIUC it's safe because those pages have not yet been made
> visible to the outside world. (Or do we have a speculative pagecache
> issue here?)
>
We do not have a speculative pagecache issue here that I'm aware of.
> You compare do_invalidatepage() with deletion from page cache, but I hope
> that's mistaken: aren't such pages usually being invalidated precisely
> because they raced with pages found already there in the page cache,
> which need to be left in place?
>
True. What I was trying to do was point out that we usually handle
pages being removed from the page cache with a combination of the page
lock and tree lock. s390 does not have the same type of protection in
page_remove_rmap() when handling PageSwapCache pages.
> So, is there somewhere which is setting page->mapping = NULL, or
> doing ClearPageSwapCache, in a separate block from removing that
> page's entry from the radix_tree? If there is, then I think it
> could pose a problem on more than s390.
>
> Hmm, mm/migrate.c.
>
Migration moves the page mapping under the tree lock so
__set_page_dirty_nobuffers() I don't think that is it.
I think the race is against something like reuse_swap_page() which locks
the page and removes it from swap cache while page_remove_rmap() looks
up the same page.
--
Mel Gorman
SUSE Labs
next prev parent reply other threads:[~2012-04-17 12:22 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-16 14:14 [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock Mel Gorman
2012-04-16 14:14 ` Mel Gorman
2012-04-16 14:53 ` Rik van Riel
2012-04-16 14:53 ` Rik van Riel
2012-04-16 15:02 ` Mel Gorman
2012-04-16 15:02 ` Mel Gorman
2012-04-16 15:50 ` Martin Schwidefsky
2012-04-16 15:50 ` Martin Schwidefsky
2012-04-17 12:29 ` Mel Gorman
2012-04-17 12:29 ` Mel Gorman
2012-04-17 13:02 ` Martin Schwidefsky
2012-04-17 13:02 ` Martin Schwidefsky
2012-04-18 4:00 ` Hugh Dickins
2012-04-18 4:00 ` Hugh Dickins
2012-04-16 21:14 ` Hugh Dickins
2012-04-16 21:14 ` Hugh Dickins
2012-04-17 12:22 ` Mel Gorman [this message]
2012-04-17 12:22 ` Mel Gorman
2012-04-18 3:52 ` Hugh Dickins
2012-04-18 3:52 ` Hugh Dickins
2012-04-18 15:28 ` Mel Gorman
2012-04-18 15:28 ` Mel Gorman
2012-04-18 17:09 ` Hugh Dickins
2012-04-18 17:09 ` Hugh Dickins
2012-04-23 12:41 ` Mel Gorman
2012-04-23 12:41 ` Mel Gorman
2012-04-23 18:09 ` Hugh Dickins
2012-04-23 18:09 ` Hugh Dickins
2012-04-23 18:14 ` [PATCH] mm: fix s390 BUG by __set_page_dirty_no_writeback on swap Hugh Dickins
2012-04-23 18:14 ` Hugh Dickins
2012-04-18 18:29 ` [RFC PATCH] s390: mm: rmap: Transfer storage key to struct page under the page lock Martin Schwidefsky
2012-04-18 18:29 ` Martin Schwidefsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120417122202.GF2359@suse.de \
--to=mgorman@suse.de \
--cc=heiko.carstens@de.ibm.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=riel@redhat.com \
--cc=schwidefsky@de.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.