[PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390

linux-s390.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
@ 2012-10-22 15:06 Jan Kara
  2012-10-22 19:38 ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2012-10-22 15:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Jan Kara, Martin Schwidefsky, Mel Gorman, linux-s390,
	Hugh Dickins

On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via buffered write, HW
dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
filesystems. The bug we observed in practice is that buffers from the page get
freed, so when the page gets later marked as dirty and writeback writes it, XFS
crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
from xfs_count_page_state().

Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then page
unmapped.

Fix the issue by ignoring s390 HW dirty bit for page cache pages of mappings
with mapping_cap_account_dirty(). This is safe because for such mappings when a
page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or
do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
the page gets writeprotected in page_mkclean(). So pagecache page is writeable
if and only if it is dirty.

Thanks to Hugh Dickins <hughd@google.com> for pointing out mapping has to have
mapping_cap_account_dirty() for things to work and proposing a cleaned up
variant of the patch.

The patch has survived about two hours of running fsx-linux on tmpfs while
heavily swapping and several days of running on out build machines where the
original problem was triggered.

CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Mel Gorman <mgorman@suse.de>
CC: linux-s390@vger.kernel.org
CC: Hugh Dickins <hughd@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/rmap.c |   20 +++++++++++++++-----
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 7df7984..2ee1ef0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -56,6 +56,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/hugetlb.h>
+#include <linux/backing-dev.h>

 #include <asm/tlbflush.h>

@@ -926,11 +927,8 @@ int page_mkclean(struct page *page)

 	if (page_mapped(page)) {
 		struct address_space *mapping = page_mapping(page);
-		if (mapping) {
+		if (mapping)
 			ret = page_mkclean_file(mapping, page);
-			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
-				ret = 1;
-		}
 	}

 	return ret;
@@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *page)
  */
 void page_remove_rmap(struct page *page)
 {
+	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
 	bool locked;
 	unsigned long flags;
@@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
 	 * this if the page is anon, so about to be freed; but perhaps
 	 * not if it's in swapcache - there might be another pte slot
 	 * containing the swap entry, but page not yet written to swap.
+	 *
+	 * And we can skip it on file pages, so long as the filesystem
+	 * participates in dirty tracking; but need to catch shm and tmpfs
+	 * and ramfs pages which have been modified since creation by read
+	 * fault.
+	 *
+	 * Note that mapping must be decided above, before decrementing
+	 * mapcount (which luckily provides a barrier): once page is unmapped,
+	 * it could be truncated and page->mapping reset to NULL at any moment.
+	 * Note also that we are relying on page_mapping(page) to set mapping
+	 * to &swapper_space when PageSwapCache(page).
 	 */
-	if ((!anon || PageSwapCache(page)) &&
+	if (mapping && !mapping_cap_account_dirty(mapping) &&
 	    page_test_and_clear_dirty(page_to_pfn(page), 1))
 		set_page_dirty(page);
 	/*
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-22 15:06 [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390 Jan Kara
@ 2012-10-22 19:38 ` Andrew Morton
  2012-10-23  4:40   ` Hugh Dickins
  2012-10-23 10:21   ` Jan Kara
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2012-10-22 19:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390,
	Hugh Dickins

On Mon, 22 Oct 2012 17:06:46 +0200
Jan Kara <jack@suse.cz> wrote:

> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via buffered write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
> 
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().
> 
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
> 
> Fix the issue by ignoring s390 HW dirty bit for page cache pages of mappings
> with mapping_cap_account_dirty(). This is safe because for such mappings when a
> page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.
> 
> Thanks to Hugh Dickins <hughd@google.com> for pointing out mapping has to have
> mapping_cap_account_dirty() for things to work and proposing a cleaned up
> variant of the patch.
> 
> The patch has survived about two hours of running fsx-linux on tmpfs while
> heavily swapping and several days of running on out build machines where the
> original problem was triggered.

That seems a fairly serious problem.  To which kernel version(s) should
we apply the fix?

> diff --git a/mm/rmap.c b/mm/rmap.c

It's a bit surprising that none of the added comments mention the s390
pte-dirtying oddity.  I don't see an obvious place to mention this, but
I for one didn't know about this and it would be good if we could
capture the info _somewhere_?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-22 19:38 ` Andrew Morton
@ 2012-10-23  4:40   ` Hugh Dickins
  2012-10-23 10:21   ` Jan Kara
  1 sibling, 0 replies; 10+ messages in thread
From: Hugh Dickins @ 2012-10-23  4:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390

On Mon, 22 Oct 2012, Andrew Morton wrote:
> On Mon, 22 Oct 2012 17:06:46 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via buffered write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> > 
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages of mappings
> > with mapping_cap_account_dirty(). This is safe because for such mappings when a
> > page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> > 
> > Thanks to Hugh Dickins <hughd@google.com> for pointing out mapping has to have
> > mapping_cap_account_dirty() for things to work and proposing a cleaned up
> > variant of the patch.
> > 
> > The patch has survived about two hours of running fsx-linux on tmpfs while
> > heavily swapping and several days of running on out build machines where the
> > original problem was triggered.
> 
> That seems a fairly serious problem.  To which kernel version(s) should
> we apply the fix?

That I'll leave Jan and/or Martin to answer.

> 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> 
> It's a bit surprising that none of the added comments mention the s390
> pte-dirtying oddity.  I don't see an obvious place to mention this, but
> I for one didn't know about this and it would be good if we could
> capture the info _somewhere_?

I think it's okay: the comment you can see in Jan's patch is extending
this existing comment in page_remove_rmap(), that I added sometime in
the past (largely because "page_test_and_clear_dirty" sounds so
magisterially generic, when in actuality it's specific to s390):

	/*
	 * Now that the last pte has gone, s390 must transfer dirty
	 * flag from storage key to struct page.  We can usually skip
	 * this if the page is anon, so about to be freed; but perhaps
	 * not if it's in swapcache - there might be another pte slot
	 * containing the swap entry, but page not yet written to swap.
	 */

And one of the delights of Jan's patch is that it removes the other
callsite completely.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-22 19:38 ` Andrew Morton
  2012-10-23  4:40   ` Hugh Dickins
@ 2012-10-23 10:21   ` Jan Kara
  2012-10-23 21:56     ` Andrew Morton
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kara @ 2012-10-23 10:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390,
	Hugh Dickins

On Mon 22-10-12 12:38:52, Andrew Morton wrote:
> On Mon, 22 Oct 2012 17:06:46 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via buffered write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> > 
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
> > 
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> > 
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages of mappings
> > with mapping_cap_account_dirty(). This is safe because for such mappings when a
> > page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
> > 
> > Thanks to Hugh Dickins <hughd@google.com> for pointing out mapping has to have
> > mapping_cap_account_dirty() for things to work and proposing a cleaned up
> > variant of the patch.
> > 
> > The patch has survived about two hours of running fsx-linux on tmpfs while
> > heavily swapping and several days of running on out build machines where the
> > original problem was triggered.
> 
> That seems a fairly serious problem.  To which kernel version(s) should
> we apply the fix?
  Well, XFS will crash starting from 2.6.36 kernel where the assertion was
added. Previously XFS just silently added buffers (as other filesystems do
it) and wrote / redirtied the page (unnecessarily). So looking into
maintained -stable branches I think pushing the patch to -stable from 3.0
on should be enough.

> > diff --git a/mm/rmap.c b/mm/rmap.c
> 
> It's a bit surprising that none of the added comments mention the s390
> pte-dirtying oddity.  I don't see an obvious place to mention this, but
> I for one didn't know about this and it would be good if we could
> capture the info _somewhere_?
  As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
updated. But do you mean recording somewhere the catch that s390 HW dirty
bit gets set also whenever we write to a page from kernel? I guess we could
add that also to the comment before page_test_and_clear_dirty() in
page_remove_rmap() and also before definition of
page_test_and_clear_dirty(). So most people that will add / remove these
calls will be warned. OK?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-23 10:21   ` Jan Kara
@ 2012-10-23 21:56     ` Andrew Morton
  2012-10-24  8:30       ` Martin Schwidefsky
  2012-10-25 20:01       ` Jan Kara
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2012-10-23 21:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390,
	Hugh Dickins

On Tue, 23 Oct 2012 12:21:53 +0200
Jan Kara <jack@suse.cz> wrote:

> > That seems a fairly serious problem.  To which kernel version(s) should
> > we apply the fix?
>   Well, XFS will crash starting from 2.6.36 kernel where the assertion was
> added. Previously XFS just silently added buffers (as other filesystems do
> it) and wrote / redirtied the page (unnecessarily). So looking into
> maintained -stable branches I think pushing the patch to -stable from 3.0
> on should be enough.

OK, thanks, I made it so.

> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > 
> > It's a bit surprising that none of the added comments mention the s390
> > pte-dirtying oddity.  I don't see an obvious place to mention this, but
> > I for one didn't know about this and it would be good if we could
> > capture the info _somewhere_?
>   As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
> updated. But do you mean recording somewhere the catch that s390 HW dirty
> bit gets set also whenever we write to a page from kernel?

Yes, this.  It's surprising behaviour which we may trip over again, so
how do we inform developers about it?

> I guess we could
> add that also to the comment before page_test_and_clear_dirty() in
> page_remove_rmap() and also before definition of
> page_test_and_clear_dirty(). So most people that will add / remove these
> calls will be warned. OK?

Sounds good, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-23 21:56     ` Andrew Morton
@ 2012-10-24  8:30       ` Martin Schwidefsky
  2012-10-25 20:01       ` Jan Kara
  1 sibling, 0 replies; 10+ messages in thread
From: Martin Schwidefsky @ 2012-10-24  8:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-mm, Mel Gorman, linux-s390, Hugh Dickins

On Tue, 23 Oct 2012 14:56:36 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 23 Oct 2012 12:21:53 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > 
> > > It's a bit surprising that none of the added comments mention the s390
> > > pte-dirtying oddity.  I don't see an obvious place to mention this, but
> > > I for one didn't know about this and it would be good if we could
> > > capture the info _somewhere_?
> >   As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
> > updated. But do you mean recording somewhere the catch that s390 HW dirty
> > bit gets set also whenever we write to a page from kernel?
> 
> Yes, this.  It's surprising behaviour which we may trip over again, so
> how do we inform developers about it?

That is what I worry about as well. It is not the first time we tripped over
the per-page dirty bit and I guess it won't be the last time. Therefore I
created a patch to switch s390 over to fault based dirty bits, the sneak
performance test are promising. If we do not find any major performance
degradation this would be my preferred way to fix this problem for good.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-23 21:56     ` Andrew Morton
  2012-10-24  8:30       ` Martin Schwidefsky
@ 2012-10-25 20:01       ` Jan Kara
  2012-12-14  8:45         ` Martin Schwidefsky
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kara @ 2012-10-25 20:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, linux-mm, Martin Schwidefsky, Mel Gorman, linux-s390,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 1697 bytes --]

On Tue 23-10-12 14:56:36, Andrew Morton wrote:
> On Tue, 23 Oct 2012 12:21:53 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > > That seems a fairly serious problem.  To which kernel version(s) should
> > > we apply the fix?
> >   Well, XFS will crash starting from 2.6.36 kernel where the assertion was
> > added. Previously XFS just silently added buffers (as other filesystems do
> > it) and wrote / redirtied the page (unnecessarily). So looking into
> > maintained -stable branches I think pushing the patch to -stable from 3.0
> > on should be enough.
> 
> OK, thanks, I made it so.
> 
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > 
> > > It's a bit surprising that none of the added comments mention the s390
> > > pte-dirtying oddity.  I don't see an obvious place to mention this, but
> > > I for one didn't know about this and it would be good if we could
> > > capture the info _somewhere_?
> >   As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
> > updated. But do you mean recording somewhere the catch that s390 HW dirty
> > bit gets set also whenever we write to a page from kernel?
> 
> Yes, this.  It's surprising behaviour which we may trip over again, so
> how do we inform developers about it?
> 
> > I guess we could
> > add that also to the comment before page_test_and_clear_dirty() in
> > page_remove_rmap() and also before definition of
> > page_test_and_clear_dirty(). So most people that will add / remove these
> > calls will be warned. OK?
> 
> Sounds good, thanks.
  OK, the patch is attached. As Martin says, it may be obsolete soon but just
in case Martin's patch set gets delayed...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-mm-Comment-on-storage-key-dirty-bit-semantics.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-10-25 20:01       ` Jan Kara
@ 2012-12-14  8:45         ` Martin Schwidefsky
  2012-12-17 23:31           ` Hugh Dickins
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Schwidefsky @ 2012-12-14  8:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, linux-mm, Mel Gorman, linux-s390, Hugh Dickins

On Thu, 25 Oct 2012 22:01:41 +0200
Jan Kara <jack@suse.cz> wrote:

> On Tue 23-10-12 14:56:36, Andrew Morton wrote:
> > On Tue, 23 Oct 2012 12:21:53 +0200
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > > That seems a fairly serious problem.  To which kernel version(s) should
> > > > we apply the fix?
> > >   Well, XFS will crash starting from 2.6.36 kernel where the assertion was
> > > added. Previously XFS just silently added buffers (as other filesystems do
> > > it) and wrote / redirtied the page (unnecessarily). So looking into
> > > maintained -stable branches I think pushing the patch to -stable from 3.0
> > > on should be enough.
> > 
> > OK, thanks, I made it so.
> > 
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > 
> > > > It's a bit surprising that none of the added comments mention the s390
> > > > pte-dirtying oddity.  I don't see an obvious place to mention this, but
> > > > I for one didn't know about this and it would be good if we could
> > > > capture the info _somewhere_?
> > >   As Hugh says, the comment before page_test_and_clear_dirty() is somewhat
> > > updated. But do you mean recording somewhere the catch that s390 HW dirty
> > > bit gets set also whenever we write to a page from kernel?
> > 
> > Yes, this.  It's surprising behaviour which we may trip over again, so
> > how do we inform developers about it?
> > 
> > > I guess we could
> > > add that also to the comment before page_test_and_clear_dirty() in
> > > page_remove_rmap() and also before definition of
> > > page_test_and_clear_dirty(). So most people that will add / remove these
> > > calls will be warned. OK?
> > 
> > Sounds good, thanks.
>   OK, the patch is attached. As Martin says, it may be obsolete soon but just
> in case Martin's patch set gets delayed...
> 
> 								Honza

The patch got delayed a bit, the main issue is to get conclusive performance
measurements about the effects of the patch. I am pretty sure that the patch
works and will not cause any major degradation so it is time to ask for your
opinion. Here we go:
--
Subject: [PATCH] s390/mm: implement software dirty bits

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

The s390 architecture is unique in respect to dirty page detection,
it uses the change bit in the per-page storage key to track page
modifications. All other architectures track dirty bits by means
of page table entries. This property of s390 has caused numerous
problems in the past, e.g. see git commit ef5d437f71afdf4a
"mm: fix XFS oops due to dirty pages without buffers on s390".

To avoid future issues in regard to per-page dirty bits convert
s390 to a fault based software dirty bit detection mechanism. All
user page table entries which are marked as clean will be hardware
read-only, even if the pte is supposed to be writable. A write by
the user process will trigger a protection fault which will cause
the user pte to be marked as dirty and the hardware read-only bit
is removed.

With this change the dirty bit in the storage key is irrelevant
for Linux as a host, but the storage key is still required for
KVM guests. The effect is that page_test_and_clear_dirty and the
related code can be removed. The referenced bit in the storage
key is still used by the page_test_and_clear_young primitive to
provide page age information.

For page cache pages of mappings with mapping_cap_account_dirty
there will not be any change in behavior as the dirty bit tracking
already uses read-only ptes to control the amount of dirty pages.
Only for swap cache pages and pages of mappings without
mapping_cap_account_dirty there can be additional protection faults.
To avoid an excessive number of additional faults the mk_pte
primitive checks for PageDirty if the pgprot value allows for writes
and pre-dirties the pte. That avoids all additional faults for
tmpfs and shmem pages until these pages are added to the swap cache.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/page.h    |  22 -------
 arch/s390/include/asm/pgtable.h | 131 +++++++++++++++++++++++++++-------------
 arch/s390/include/asm/sclp.h    |   1 -
 arch/s390/include/asm/setup.h   |  16 ++---
 arch/s390/kvm/kvm-s390.c        |   2 +-
 arch/s390/lib/uaccess_pt.c      |   2 +-
 arch/s390/mm/pageattr.c         |   2 +-
 arch/s390/mm/vmem.c             |  24 +++-----
 drivers/s390/char/sclp_cmd.c    |  10 +--
 include/asm-generic/pgtable.h   |  10 ---
 include/linux/page-flags.h      |   8 ---
 mm/rmap.c                       |  23 -------
 12 files changed, 112 insertions(+), 139 deletions(-)

diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index a86ad40840..75ce9b0 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -155,28 +155,6 @@ static inline int page_reset_referenced(unsigned long addr)
 #define _PAGE_ACC_BITS		0xf0	/* HW access control bits	*/
 
 /*
- * Test and clear dirty bit in storage key.
- * We can't clear the changed bit atomically. This is a potential
- * race against modification of the referenced bit. This function
- * should therefore only be called if it is not mapped in any
- * address space.
- *
- * Note that the bit gets set whenever page content is changed. That means
- * also when the page is modified by DMA or from inside the kernel.
- */
-#define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-static inline int page_test_and_clear_dirty(unsigned long pfn, int mapped)
-{
-	unsigned char skey;
-
-	skey = page_get_storage_key(pfn << PAGE_SHIFT);
-	if (!(skey & _PAGE_CHANGED))
-		return 0;
-	page_set_storage_key(pfn << PAGE_SHIFT, skey & ~_PAGE_CHANGED, mapped);
-	return 1;
-}
-
-/*
  * Test and clear referenced bit in storage key.
  */
 #define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 33aeb77..66d3b2a 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -29,6 +29,7 @@
 #ifndef __ASSEMBLY__
 #include <linux/sched.h>
 #include <linux/mm_types.h>
+#include <linux/page-flags.h>
 #include <asm/bug.h>
 #include <asm/page.h>
 
@@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
 /* Software bits in the page table entry */
 #define _PAGE_SWT	0x001		/* SW pte type bit t */
 #define _PAGE_SWX	0x002		/* SW pte type bit x */
-#define _PAGE_SWC	0x004		/* SW pte changed bit (for KVM) */
-#define _PAGE_SWR	0x008		/* SW pte referenced bit (for KVM) */
-#define _PAGE_SPECIAL	0x010		/* SW associated with special page */
+#define _PAGE_SWC	0x004		/* SW pte changed bit */
+#define _PAGE_SWR	0x008		/* SW pte referenced bit */
+#define _PAGE_SWW	0x010		/* SW pte write bit */
+#define _PAGE_SPECIAL	0x020		/* SW associated with special page */
 #define __HAVE_ARCH_PTE_SPECIAL
 
 /* Set of bits not changed in pte_modify */
-#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
+#define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
+				 _PAGE_SWC | _PAGE_SWR)
 
 /* Six different types of pages. */
 #define _PAGE_TYPE_EMPTY	0x400
@@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
 
 /* Bits in the region table entry */
 #define _REGION_ENTRY_ORIGIN	~0xfffUL/* region/segment table origin	    */
+#define _REGION_ENTRY_RO	0x200	/* region protection bit	    */
 #define _REGION_ENTRY_INV	0x20	/* invalid region table entry	    */
 #define _REGION_ENTRY_TYPE_MASK	0x0c	/* region/segment table type mask   */
 #define _REGION_ENTRY_TYPE_R1	0x0c	/* region first table type	    */
@@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
  */
 #define PAGE_NONE	__pgprot(_PAGE_TYPE_NONE)
 #define PAGE_RO		__pgprot(_PAGE_TYPE_RO)
-#define PAGE_RW		__pgprot(_PAGE_TYPE_RW)
+#define PAGE_RW		__pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
+#define PAGE_RWC	__pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
 
-#define PAGE_KERNEL	PAGE_RW
+#define PAGE_KERNEL	PAGE_RWC
 #define PAGE_COPY	PAGE_RO
 
 /*
@@ -625,23 +630,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
 	bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
 	/* Clear page changed & referenced bit in the storage key */
 	if (bits & _PAGE_CHANGED)
-		page_set_storage_key(address, skey ^ bits, 1);
+		page_set_storage_key(address, skey ^ bits, 0);
 	else if (bits)
 		page_reset_referenced(address);
 	/* Transfer page changed & referenced bit to guest bits in pgste */
 	pgste_val(pgste) |= bits << 48;		/* RCP_GR_BIT & RCP_GC_BIT */
 	/* Get host changed & referenced bits from pgste */
 	bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
-	/* Clear host bits in pgste. */
+	/* Transfer page changed & referenced bit to kvm user bits */
+	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
+	/* Clear relevant host bits in pgste. */
 	pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
 	pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
 	/* Copy page access key and fetch protection bit to pgste */
 	pgste_val(pgste) |=
 		(unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
-	/* Transfer changed and referenced to kvm user bits */
-	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
-	/* Transfer changed & referenced to pte sofware bits */
-	pte_val(*ptep) |= bits << 1;		/* _PAGE_SWR & _PAGE_SWC */
+	/* Transfer referenced bit to pte */
+	pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
 #endif
 	return pgste;
 
@@ -654,20 +659,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
 
 	if (!pte_present(*ptep))
 		return pgste;
+	/* Get referenced bit from storage key */
 	young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
-	/* Transfer page referenced bit to pte software bit (host view) */
-	if (young || (pgste_val(pgste) & RCP_HR_BIT))
+	if (young)
+		pgste_val(pgste) |= RCP_GR_BIT;
+	/* Get host referenced bit from pgste */
+	if (pgste_val(pgste) & RCP_HR_BIT) {
+		pgste_val(pgste) &= ~RCP_HR_BIT;
+		young = 1;
+	}
+	/* Transfer referenced bit to kvm user bits and pte */
+	if (young) {
+		pgste_val(pgste) |= KVM_UR_BIT;
 		pte_val(*ptep) |= _PAGE_SWR;
-	/* Clear host referenced bit in pgste. */
-	pgste_val(pgste) &= ~RCP_HR_BIT;
-	/* Transfer page referenced bit to guest bit in pgste */
-	pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
+	}
 #endif
 	return pgste;
-
 }
 
-static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
+static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
 {
 #ifdef CONFIG_PGSTE
 	unsigned long address;
@@ -681,10 +691,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
 	/* Set page access key and fetch protection bit from pgste */
 	nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
 	if (okey != nkey)
-		page_set_storage_key(address, nkey, 1);
+		page_set_storage_key(address, nkey, 0);
 #endif
 }
 
+static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
+{
+	if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
+		/*
+		 * Without enhanced suppression-on-protection force
+		 * the dirty bit on for all writable ptes.
+		 */
+		pte_val(entry) |= _PAGE_SWC;
+		pte_val(entry) &= ~_PAGE_RO;
+	}
+	*ptep = entry;
+}
+
 /**
  * struct gmap_struct - guest address space
  * @mm: pointer to the parent mm_struct
@@ -743,11 +766,14 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	if (mm_has_pgste(mm)) {
 		pgste = pgste_get_lock(ptep);
-		pgste_set_pte(ptep, pgste, entry);
-		*ptep = entry;
+		pgste_set_key(ptep, pgste, entry);
+		pgste_set_pte(ptep, entry);
 		pgste_set_unlock(ptep, pgste);
-	} else
+	} else {
+		if (!(pte_val(entry) & _PAGE_INVALID) && MACHINE_HAS_EDAT1)
+			pte_val(entry) |= _PAGE_CO;
 		*ptep = entry;
+	}
 }
 
 /*
@@ -756,16 +782,12 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
  */
 static inline int pte_write(pte_t pte)
 {
-	return (pte_val(pte) & _PAGE_RO) == 0;
+	return (pte_val(pte) & _PAGE_SWW) != 0;
 }
 
 static inline int pte_dirty(pte_t pte)
 {
-#ifdef CONFIG_PGSTE
-	if (pte_val(pte) & _PAGE_SWC)
-		return 1;
-#endif
-	return 0;
+	return (pte_val(pte) & _PAGE_SWC) != 0;
 }
 
 static inline int pte_young(pte_t pte)
@@ -815,11 +837,14 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	pte_val(pte) &= _PAGE_CHG_MASK;
 	pte_val(pte) |= pgprot_val(newprot);
+	if ((pte_val(pte) & _PAGE_SWC) && (pte_val(pte) & _PAGE_SWW))
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
+	pte_val(pte) &= ~_PAGE_SWW;
 	/* Do not clobber _PAGE_TYPE_NONE pages!  */
 	if (!(pte_val(pte) & _PAGE_INVALID))
 		pte_val(pte) |= _PAGE_RO;
@@ -828,20 +853,26 @@ static inline pte_t pte_wrprotect(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
-	pte_val(pte) &= ~_PAGE_RO;
+	pte_val(pte) |= _PAGE_SWW;
+	if (pte_val(pte) & _PAGE_SWC)
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-#ifdef CONFIG_PGSTE
 	pte_val(pte) &= ~_PAGE_SWC;
-#endif
+	/* Do not clobber _PAGE_TYPE_NONE pages!  */
+	if (!(pte_val(pte) & _PAGE_INVALID))
+		pte_val(pte) |= _PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
+	pte_val(pte) |= _PAGE_SWC;
+	if (pte_val(pte) & _PAGE_SWW)
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
@@ -879,10 +910,10 @@ static inline pte_t pte_mkhuge(pte_t pte)
 		pte_val(pte) |= _SEGMENT_ENTRY_INV;
 	}
 	/*
-	 * Clear SW pte bits SWT and SWX, there are no SW bits in a segment
-	 * table entry.
+	 * Clear SW pte bits, there are no SW bits in a segment table entry.
 	 */
-	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX);
+	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX | _PAGE_SWC |
+			  _PAGE_SWR | _PAGE_SWW);
 	/*
 	 * Also set the change-override bit because we don't need dirty bit
 	 * tracking for hugetlbfs pages.
@@ -1053,9 +1084,11 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 					   unsigned long address,
 					   pte_t *ptep, pte_t pte)
 {
-	*ptep = pte;
-	if (mm_has_pgste(mm))
+	if (mm_has_pgste(mm)) {
+		pgste_set_pte(ptep, pte);
 		pgste_set_unlock(ptep, *(pgste_t *)(ptep + PTRS_PER_PTE));
+	} else
+		*ptep = pte;
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
@@ -1121,10 +1154,13 @@ static inline pte_t ptep_set_wrprotect(struct mm_struct *mm,
 			pgste = pgste_get_lock(ptep);
 
 		ptep_flush_lazy(mm, address, ptep);
-		*ptep = pte_wrprotect(pte);
+		pte = pte_wrprotect(pte);
 
-		if (mm_has_pgste(mm))
+		if (mm_has_pgste(mm)) {
+			pgste_set_pte(ptep, pte);
 			pgste_set_unlock(ptep, pgste);
+		} else
+			*ptep = pte;
 	}
 	return pte;
 }
@@ -1142,10 +1178,12 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 		pgste = pgste_get_lock(ptep);
 
 	__ptep_ipte(address, ptep);
-	*ptep = entry;
 
-	if (mm_has_pgste(vma->vm_mm))
+	if (mm_has_pgste(vma->vm_mm)) {
+		pgste_set_pte(ptep, entry);
 		pgste_set_unlock(ptep, pgste);
+	} else
+		*ptep = entry;
 	return 1;
 }
 
@@ -1163,8 +1201,13 @@ static inline pte_t mk_pte_phys(unsigned long physpage, pgprot_t pgprot)
 static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
 {
 	unsigned long physpage = page_to_phys(page);
+	pte_t __pte = mk_pte_phys(physpage, pgprot);
 
-	return mk_pte_phys(physpage, pgprot);
+	if ((pte_val(__pte) & _PAGE_SWW) && PageDirty(page)) {
+		pte_val(__pte) |= _PAGE_SWC;
+		pte_val(__pte) &= ~_PAGE_RO;
+	}
+	return __pte;
 }
 
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
@@ -1256,6 +1299,8 @@ static inline int pmd_trans_splitting(pmd_t pmd)
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t entry)
 {
+	if (!(pmd_val(entry) & _SEGMENT_ENTRY_INV) && MACHINE_HAS_EDAT1)
+		pmd_val(entry) |= _SEGMENT_ENTRY_CO;
 	*pmdp = entry;
 }
 
diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
index 8337886..06a1361 100644
--- a/arch/s390/include/asm/sclp.h
+++ b/arch/s390/include/asm/sclp.h
@@ -46,7 +46,6 @@ int sclp_cpu_deconfigure(u8 cpu);
 void sclp_facilities_detect(void);
 unsigned long long sclp_get_rnmax(void);
 unsigned long long sclp_get_rzm(void);
-u8 sclp_get_fac85(void);
 int sclp_sdias_blk_count(void);
 int sclp_sdias_copy(void *dest, int blk_num, int nr_blks);
 int sclp_chp_configure(struct chp_id chpid);
diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h
index f69f76b..f685751 100644
--- a/arch/s390/include/asm/setup.h
+++ b/arch/s390/include/asm/setup.h
@@ -64,13 +64,14 @@ extern unsigned int s390_user_mode;
 
 #define MACHINE_FLAG_VM		(1UL << 0)
 #define MACHINE_FLAG_IEEE	(1UL << 1)
-#define MACHINE_FLAG_CSP	(1UL << 3)
-#define MACHINE_FLAG_MVPG	(1UL << 4)
-#define MACHINE_FLAG_DIAG44	(1UL << 5)
-#define MACHINE_FLAG_IDTE	(1UL << 6)
-#define MACHINE_FLAG_DIAG9C	(1UL << 7)
-#define MACHINE_FLAG_MVCOS	(1UL << 8)
-#define MACHINE_FLAG_KVM	(1UL << 9)
+#define MACHINE_FLAG_CSP	(1UL << 2)
+#define MACHINE_FLAG_MVPG	(1UL << 3)
+#define MACHINE_FLAG_DIAG44	(1UL << 4)
+#define MACHINE_FLAG_IDTE	(1UL << 5)
+#define MACHINE_FLAG_DIAG9C	(1UL << 6)
+#define MACHINE_FLAG_MVCOS	(1UL << 7)
+#define MACHINE_FLAG_KVM	(1UL << 8)
+#define MACHINE_FLAG_ESOP	(1UL << 9)
 #define MACHINE_FLAG_EDAT1	(1UL << 10)
 #define MACHINE_FLAG_EDAT2	(1UL << 11)
 #define MACHINE_FLAG_LPAR	(1UL << 12)
@@ -84,6 +85,7 @@ extern unsigned int s390_user_mode;
 #define MACHINE_IS_LPAR		(S390_lowcore.machine_flags & MACHINE_FLAG_LPAR)
 
 #define MACHINE_HAS_DIAG9C	(S390_lowcore.machine_flags & MACHINE_FLAG_DIAG9C)
+#define MACHINE_HAS_ESOP	(S390_lowcore.machine_flags & MACHINE_FLAG_ESOP)
 #define MACHINE_HAS_PFMF	MACHINE_HAS_EDAT1
 #define MACHINE_HAS_HPAGE	MACHINE_HAS_EDAT1
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index c9011bf..4659b62 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -147,7 +147,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = KVM_MAX_VCPUS;
 		break;
 	case KVM_CAP_S390_COW:
-		r = sclp_get_fac85() & 0x2;
+		r = MACHINE_HAS_ESOP;
 		break;
 	default:
 		r = 0;
diff --git a/arch/s390/lib/uaccess_pt.c b/arch/s390/lib/uaccess_pt.c
index 9017a63..a70ee84 100644
--- a/arch/s390/lib/uaccess_pt.c
+++ b/arch/s390/lib/uaccess_pt.c
@@ -50,7 +50,7 @@ static __always_inline unsigned long follow_table(struct mm_struct *mm,
 	ptep = pte_offset_map(pmd, addr);
 	if (!pte_present(*ptep))
 		return -0x11UL;
-	if (write && !pte_write(*ptep))
+	if (write && (!pte_write(*ptep) || !pte_dirty(*ptep)))
 		return -0x04UL;
 
 	return (pte_val(*ptep) & PAGE_MASK) + (addr & ~PAGE_MASK);
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 29ccee3..d21040e 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -127,7 +127,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable)
 			pte_val(*pte) = _PAGE_TYPE_EMPTY;
 			continue;
 		}
-		*pte = mk_pte_phys(address, __pgprot(_PAGE_TYPE_RW));
+		pte_val(*pte) = __pa(address);
 	}
 }
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 6ed1426..79699f46 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -85,11 +85,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 	pud_t *pu_dir;
 	pmd_t *pm_dir;
 	pte_t *pt_dir;
-	pte_t  pte;
 	int ret = -ENOMEM;
 
 	while (address < end) {
-		pte = mk_pte_phys(address, __pgprot(ro ? _PAGE_RO : 0));
 		pg_dir = pgd_offset_k(address);
 		if (pgd_none(*pg_dir)) {
 			pu_dir = vmem_pud_alloc();
@@ -101,9 +99,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
 		if (MACHINE_HAS_EDAT2 && pud_none(*pu_dir) && address &&
 		    !(address & ~PUD_MASK) && (address + PUD_SIZE <= end)) {
-			pte_val(pte) |= _REGION3_ENTRY_LARGE;
-			pte_val(pte) |= _REGION_ENTRY_TYPE_R3;
-			pud_val(*pu_dir) = pte_val(pte);
+			pud_val(*pu_dir) = __pa(address) |
+				_REGION_ENTRY_TYPE_R3 | _REGION3_ENTRY_LARGE |
+				(ro ? _REGION_ENTRY_RO : 0);
 			address += PUD_SIZE;
 			continue;
 		}
@@ -118,8 +116,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
 		if (MACHINE_HAS_EDAT1 && pmd_none(*pm_dir) && address &&
 		    !(address & ~PMD_MASK) && (address + PMD_SIZE <= end)) {
-			pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
-			pmd_val(*pm_dir) = pte_val(pte);
+			pmd_val(*pm_dir) = __pa(address) |
+				_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE |
+				(ro ? _SEGMENT_ENTRY_RO : 0);
 			address += PMD_SIZE;
 			continue;
 		}
@@ -132,7 +131,7 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 		}
 
 		pt_dir = pte_offset_kernel(pm_dir, address);
-		*pt_dir = pte;
+		pte_val(*pt_dir) = __pa(address) | (ro ? _PAGE_RO : 0);
 		address += PAGE_SIZE;
 	}
 	ret = 0;
@@ -199,7 +198,6 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 	pud_t *pu_dir;
 	pmd_t *pm_dir;
 	pte_t *pt_dir;
-	pte_t  pte;
 	int ret = -ENOMEM;
 
 	start_addr = (unsigned long) start;
@@ -237,9 +235,8 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 				new_page = vmemmap_alloc_block(PMD_SIZE, node);
 				if (!new_page)
 					goto out;
-				pte = mk_pte_phys(__pa(new_page), PAGE_RW);
-				pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
-				pmd_val(*pm_dir) = pte_val(pte);
+				pmd_val(*pm_dir) = __pa(new_page) |
+					_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE;
 				address = (address + PMD_SIZE) & PMD_MASK;
 				continue;
 			}
@@ -260,8 +257,7 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 			new_page =__pa(vmem_alloc_pages(0));
 			if (!new_page)
 				goto out;
-			pte = pfn_pte(new_page >> PAGE_SHIFT, PAGE_KERNEL);
-			*pt_dir = pte;
+			pte_val(*pt_dir) = __pa(new_page);
 		}
 		address += PAGE_SIZE;
 	}
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index c44d13f..30a2255 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -56,7 +56,6 @@ static int __initdata early_read_info_sccb_valid;
 
 u64 sclp_facilities;
 static u8 sclp_fac84;
-static u8 sclp_fac85;
 static unsigned long long rzm;
 static unsigned long long rnmax;
 
@@ -131,7 +130,8 @@ void __init sclp_facilities_detect(void)
 	sccb = &early_read_info_sccb;
 	sclp_facilities = sccb->facilities;
 	sclp_fac84 = sccb->fac84;
-	sclp_fac85 = sccb->fac85;
+	if (sccb->fac85 & 0x02)
+		S390_lowcore.machine_flags |= MACHINE_FLAG_ESOP;
 	rnmax = sccb->rnmax ? sccb->rnmax : sccb->rnmax2;
 	rzm = sccb->rnsize ? sccb->rnsize : sccb->rnsize2;
 	rzm <<= 20;
@@ -171,12 +171,6 @@ unsigned long long sclp_get_rzm(void)
 	return rzm;
 }
 
-u8 sclp_get_fac85(void)
-{
-	return sclp_fac85;
-}
-EXPORT_SYMBOL_GPL(sclp_get_fac85);
-
 /*
  * This function will be called after sclp_facilities_detect(), which gets
  * called from early.c code. Therefore the sccb should have valid contents.
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 83b54ed..bdd7fac 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -197,16 +197,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-#define page_test_and_clear_dirty(pfn, mapped)	(0)
-#endif
-
-#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-#define pte_maybe_dirty(pte)		pte_dirty(pte)
-#else
-#define pte_maybe_dirty(pte)		(1)
-#endif
-
 #ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
 #define page_test_and_clear_young(pfn) (0)
 #endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5d1384..4c0c8eb 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -303,21 +303,13 @@ static inline void __SetPageUptodate(struct page *page)
 
 static inline void SetPageUptodate(struct page *page)
 {
-#ifdef CONFIG_S390
-	if (!test_and_set_bit(PG_uptodate, &page->flags))
-		page_set_storage_key(page_to_phys(page), PAGE_DEFAULT_KEY, 0);
-#else
 	/*
 	 * Memory barrier must be issued before setting the PG_uptodate bit,
 	 * so that all previous stores issued in order to bring the page
 	 * uptodate are actually visible before PageUptodate becomes true.
-	 *
-	 * s390 doesn't need an explicit smp_wmb here because the test and
-	 * set bit already provides full barriers.
 	 */
 	smp_wmb();
 	set_bit(PG_uptodate, &(page)->flags);
-#endif
 }
 
 CLEARPAGEFLAG(Uptodate, uptodate)
diff --git a/mm/rmap.c b/mm/rmap.c
index face808..ef75a7d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1144,29 +1144,6 @@ void page_remove_rmap(struct page *page)
 		goto out;
 
 	/*
-	 * Now that the last pte has gone, s390 must transfer dirty
-	 * flag from storage key to struct page.  We can usually skip
-	 * this if the page is anon, so about to be freed; but perhaps
-	 * not if it's in swapcache - there might be another pte slot
-	 * containing the swap entry, but page not yet written to swap.
-	 *
-	 * And we can skip it on file pages, so long as the filesystem
-	 * participates in dirty tracking (note that this is not only an
-	 * optimization but also solves problems caused by dirty flag in
-	 * storage key getting set by a write from inside kernel); but need to
-	 * catch shm and tmpfs and ramfs pages which have been modified since
-	 * creation by read fault.
-	 *
-	 * Note that mapping must be decided above, before decrementing
-	 * mapcount (which luckily provides a barrier): once page is unmapped,
-	 * it could be truncated and page->mapping reset to NULL at any moment.
-	 * Note also that we are relying on page_mapping(page) to set mapping
-	 * to &swapper_space when PageSwapCache(page).
-	 */
-	if (mapping && !mapping_cap_account_dirty(mapping) &&
-	    page_test_and_clear_dirty(page_to_pfn(page), 1))
-		set_page_dirty(page);
-	/*
 	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
 	 * and not charged by memcg for now.
 	 */
-- 
1.7.12.4

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-12-14  8:45         ` Martin Schwidefsky
@ 2012-12-17 23:31           ` Hugh Dickins
  2012-12-18  7:30             ` Martin Schwidefsky
  0 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2012-12-17 23:31 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Jan Kara, Andrew Morton, linux-mm, Mel Gorman, linux-s390

On Fri, 14 Dec 2012, Martin Schwidefsky wrote:
> 
> The patch got delayed a bit,

Thanks a lot for finding the time to do this:
I never expected it to get priority.

> the main issue is to get conclusive performance
> measurements about the effects of the patch. I am pretty sure that the patch
> works and will not cause any major degradation so it is time to ask for your
> opinion. Here we go:

If if works reliably and efficiently for you on s390, then I'm strongly in
favour of it; and I cannot imagine who would not be - it removes several
hunks of surprising and poorly understood code from the generic mm end.

I'm slightly disappointed to be reminded of page_test_and_clear_young(),
and find it still there; but it's been an order of magnitude less
troubling than the _dirty, so not worth more effort I guess.

Hugh

> --
> Subject: [PATCH] s390/mm: implement software dirty bits
> 
> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 
> The s390 architecture is unique in respect to dirty page detection,
> it uses the change bit in the per-page storage key to track page
> modifications. All other architectures track dirty bits by means
> of page table entries. This property of s390 has caused numerous
> problems in the past, e.g. see git commit ef5d437f71afdf4a
> "mm: fix XFS oops due to dirty pages without buffers on s390".
> 
> To avoid future issues in regard to per-page dirty bits convert
> s390 to a fault based software dirty bit detection mechanism. All
> user page table entries which are marked as clean will be hardware
> read-only, even if the pte is supposed to be writable. A write by
> the user process will trigger a protection fault which will cause
> the user pte to be marked as dirty and the hardware read-only bit
> is removed.
> 
> With this change the dirty bit in the storage key is irrelevant
> for Linux as a host, but the storage key is still required for
> KVM guests. The effect is that page_test_and_clear_dirty and the
> related code can be removed. The referenced bit in the storage
> key is still used by the page_test_and_clear_young primitive to
> provide page age information.
> 
> For page cache pages of mappings with mapping_cap_account_dirty
> there will not be any change in behavior as the dirty bit tracking
> already uses read-only ptes to control the amount of dirty pages.
> Only for swap cache pages and pages of mappings without
> mapping_cap_account_dirty there can be additional protection faults.
> To avoid an excessive number of additional faults the mk_pte
> primitive checks for PageDirty if the pgprot value allows for writes
> and pre-dirties the pte. That avoids all additional faults for
> tmpfs and shmem pages until these pages are added to the swap cache.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
>  arch/s390/include/asm/page.h    |  22 -------
>  arch/s390/include/asm/pgtable.h | 131 +++++++++++++++++++++++++++-------------
>  arch/s390/include/asm/sclp.h    |   1 -
>  arch/s390/include/asm/setup.h   |  16 ++---
>  arch/s390/kvm/kvm-s390.c        |   2 +-
>  arch/s390/lib/uaccess_pt.c      |   2 +-
>  arch/s390/mm/pageattr.c         |   2 +-
>  arch/s390/mm/vmem.c             |  24 +++-----
>  drivers/s390/char/sclp_cmd.c    |  10 +--
>  include/asm-generic/pgtable.h   |  10 ---
>  include/linux/page-flags.h      |   8 ---
>  mm/rmap.c                       |  23 -------
>  12 files changed, 112 insertions(+), 139 deletions(-)
> 
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index a86ad40840..75ce9b0 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -155,28 +155,6 @@ static inline int page_reset_referenced(unsigned long addr)
>  #define _PAGE_ACC_BITS		0xf0	/* HW access control bits	*/
>  
>  /*
> - * Test and clear dirty bit in storage key.
> - * We can't clear the changed bit atomically. This is a potential
> - * race against modification of the referenced bit. This function
> - * should therefore only be called if it is not mapped in any
> - * address space.
> - *
> - * Note that the bit gets set whenever page content is changed. That means
> - * also when the page is modified by DMA or from inside the kernel.
> - */
> -#define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> -static inline int page_test_and_clear_dirty(unsigned long pfn, int mapped)
> -{
> -	unsigned char skey;
> -
> -	skey = page_get_storage_key(pfn << PAGE_SHIFT);
> -	if (!(skey & _PAGE_CHANGED))
> -		return 0;
> -	page_set_storage_key(pfn << PAGE_SHIFT, skey & ~_PAGE_CHANGED, mapped);
> -	return 1;
> -}
> -
> -/*
>   * Test and clear referenced bit in storage key.
>   */
>  #define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 33aeb77..66d3b2a 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -29,6 +29,7 @@
>  #ifndef __ASSEMBLY__
>  #include <linux/sched.h>
>  #include <linux/mm_types.h>
> +#include <linux/page-flags.h>
>  #include <asm/bug.h>
>  #include <asm/page.h>
>  
> @@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
>  /* Software bits in the page table entry */
>  #define _PAGE_SWT	0x001		/* SW pte type bit t */
>  #define _PAGE_SWX	0x002		/* SW pte type bit x */
> -#define _PAGE_SWC	0x004		/* SW pte changed bit (for KVM) */
> -#define _PAGE_SWR	0x008		/* SW pte referenced bit (for KVM) */
> -#define _PAGE_SPECIAL	0x010		/* SW associated with special page */
> +#define _PAGE_SWC	0x004		/* SW pte changed bit */
> +#define _PAGE_SWR	0x008		/* SW pte referenced bit */
> +#define _PAGE_SWW	0x010		/* SW pte write bit */
> +#define _PAGE_SPECIAL	0x020		/* SW associated with special page */
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
>  /* Set of bits not changed in pte_modify */
> -#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
> +#define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
> +				 _PAGE_SWC | _PAGE_SWR)
>  
>  /* Six different types of pages. */
>  #define _PAGE_TYPE_EMPTY	0x400
> @@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
>  
>  /* Bits in the region table entry */
>  #define _REGION_ENTRY_ORIGIN	~0xfffUL/* region/segment table origin	    */
> +#define _REGION_ENTRY_RO	0x200	/* region protection bit	    */
>  #define _REGION_ENTRY_INV	0x20	/* invalid region table entry	    */
>  #define _REGION_ENTRY_TYPE_MASK	0x0c	/* region/segment table type mask   */
>  #define _REGION_ENTRY_TYPE_R1	0x0c	/* region first table type	    */
> @@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
>   */
>  #define PAGE_NONE	__pgprot(_PAGE_TYPE_NONE)
>  #define PAGE_RO		__pgprot(_PAGE_TYPE_RO)
> -#define PAGE_RW		__pgprot(_PAGE_TYPE_RW)
> +#define PAGE_RW		__pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
> +#define PAGE_RWC	__pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
>  
> -#define PAGE_KERNEL	PAGE_RW
> +#define PAGE_KERNEL	PAGE_RWC
>  #define PAGE_COPY	PAGE_RO
>  
>  /*
> @@ -625,23 +630,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
>  	bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
>  	/* Clear page changed & referenced bit in the storage key */
>  	if (bits & _PAGE_CHANGED)
> -		page_set_storage_key(address, skey ^ bits, 1);
> +		page_set_storage_key(address, skey ^ bits, 0);
>  	else if (bits)
>  		page_reset_referenced(address);
>  	/* Transfer page changed & referenced bit to guest bits in pgste */
>  	pgste_val(pgste) |= bits << 48;		/* RCP_GR_BIT & RCP_GC_BIT */
>  	/* Get host changed & referenced bits from pgste */
>  	bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
> -	/* Clear host bits in pgste. */
> +	/* Transfer page changed & referenced bit to kvm user bits */
> +	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> +	/* Clear relevant host bits in pgste. */
>  	pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
>  	pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
>  	/* Copy page access key and fetch protection bit to pgste */
>  	pgste_val(pgste) |=
>  		(unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
> -	/* Transfer changed and referenced to kvm user bits */
> -	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> -	/* Transfer changed & referenced to pte sofware bits */
> -	pte_val(*ptep) |= bits << 1;		/* _PAGE_SWR & _PAGE_SWC */
> +	/* Transfer referenced bit to pte */
> +	pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
>  #endif
>  	return pgste;
>  
> @@ -654,20 +659,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
>  
>  	if (!pte_present(*ptep))
>  		return pgste;
> +	/* Get referenced bit from storage key */
>  	young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
> -	/* Transfer page referenced bit to pte software bit (host view) */
> -	if (young || (pgste_val(pgste) & RCP_HR_BIT))
> +	if (young)
> +		pgste_val(pgste) |= RCP_GR_BIT;
> +	/* Get host referenced bit from pgste */
> +	if (pgste_val(pgste) & RCP_HR_BIT) {
> +		pgste_val(pgste) &= ~RCP_HR_BIT;
> +		young = 1;
> +	}
> +	/* Transfer referenced bit to kvm user bits and pte */
> +	if (young) {
> +		pgste_val(pgste) |= KVM_UR_BIT;
>  		pte_val(*ptep) |= _PAGE_SWR;
> -	/* Clear host referenced bit in pgste. */
> -	pgste_val(pgste) &= ~RCP_HR_BIT;
> -	/* Transfer page referenced bit to guest bit in pgste */
> -	pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
> +	}
>  #endif
>  	return pgste;
> -
>  }
>  
> -static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
> +static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
>  {
>  #ifdef CONFIG_PGSTE
>  	unsigned long address;
> @@ -681,10 +691,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
>  	/* Set page access key and fetch protection bit from pgste */
>  	nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
>  	if (okey != nkey)
> -		page_set_storage_key(address, nkey, 1);
> +		page_set_storage_key(address, nkey, 0);
>  #endif
>  }
>  
> +static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
> +{
> +	if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
> +		/*
> +		 * Without enhanced suppression-on-protection force
> +		 * the dirty bit on for all writable ptes.
> +		 */
> +		pte_val(entry) |= _PAGE_SWC;
> +		pte_val(entry) &= ~_PAGE_RO;
> +	}
> +	*ptep = entry;
> +}
> +
>  /**
>   * struct gmap_struct - guest address space
>   * @mm: pointer to the parent mm_struct
> @@ -743,11 +766,14 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>  
>  	if (mm_has_pgste(mm)) {
>  		pgste = pgste_get_lock(ptep);
> -		pgste_set_pte(ptep, pgste, entry);
> -		*ptep = entry;
> +		pgste_set_key(ptep, pgste, entry);
> +		pgste_set_pte(ptep, entry);
>  		pgste_set_unlock(ptep, pgste);
> -	} else
> +	} else {
> +		if (!(pte_val(entry) & _PAGE_INVALID) && MACHINE_HAS_EDAT1)
> +			pte_val(entry) |= _PAGE_CO;
>  		*ptep = entry;
> +	}
>  }
>  
>  /*
> @@ -756,16 +782,12 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>   */
>  static inline int pte_write(pte_t pte)
>  {
> -	return (pte_val(pte) & _PAGE_RO) == 0;
> +	return (pte_val(pte) & _PAGE_SWW) != 0;
>  }
>  
>  static inline int pte_dirty(pte_t pte)
>  {
> -#ifdef CONFIG_PGSTE
> -	if (pte_val(pte) & _PAGE_SWC)
> -		return 1;
> -#endif
> -	return 0;
> +	return (pte_val(pte) & _PAGE_SWC) != 0;
>  }
>  
>  static inline int pte_young(pte_t pte)
> @@ -815,11 +837,14 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>  {
>  	pte_val(pte) &= _PAGE_CHG_MASK;
>  	pte_val(pte) |= pgprot_val(newprot);
> +	if ((pte_val(pte) & _PAGE_SWC) && (pte_val(pte) & _PAGE_SWW))
> +		pte_val(pte) &= ~_PAGE_RO;
>  	return pte;
>  }
>  
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
> +	pte_val(pte) &= ~_PAGE_SWW;
>  	/* Do not clobber _PAGE_TYPE_NONE pages!  */
>  	if (!(pte_val(pte) & _PAGE_INVALID))
>  		pte_val(pte) |= _PAGE_RO;
> @@ -828,20 +853,26 @@ static inline pte_t pte_wrprotect(pte_t pte)
>  
>  static inline pte_t pte_mkwrite(pte_t pte)
>  {
> -	pte_val(pte) &= ~_PAGE_RO;
> +	pte_val(pte) |= _PAGE_SWW;
> +	if (pte_val(pte) & _PAGE_SWC)
> +		pte_val(pte) &= ~_PAGE_RO;
>  	return pte;
>  }
>  
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
> -#ifdef CONFIG_PGSTE
>  	pte_val(pte) &= ~_PAGE_SWC;
> -#endif
> +	/* Do not clobber _PAGE_TYPE_NONE pages!  */
> +	if (!(pte_val(pte) & _PAGE_INVALID))
> +		pte_val(pte) |= _PAGE_RO;
>  	return pte;
>  }
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> +	pte_val(pte) |= _PAGE_SWC;
> +	if (pte_val(pte) & _PAGE_SWW)
> +		pte_val(pte) &= ~_PAGE_RO;
>  	return pte;
>  }
>  
> @@ -879,10 +910,10 @@ static inline pte_t pte_mkhuge(pte_t pte)
>  		pte_val(pte) |= _SEGMENT_ENTRY_INV;
>  	}
>  	/*
> -	 * Clear SW pte bits SWT and SWX, there are no SW bits in a segment
> -	 * table entry.
> +	 * Clear SW pte bits, there are no SW bits in a segment table entry.
>  	 */
> -	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX);
> +	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX | _PAGE_SWC |
> +			  _PAGE_SWR | _PAGE_SWW);
>  	/*
>  	 * Also set the change-override bit because we don't need dirty bit
>  	 * tracking for hugetlbfs pages.
> @@ -1053,9 +1084,11 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
>  					   unsigned long address,
>  					   pte_t *ptep, pte_t pte)
>  {
> -	*ptep = pte;
> -	if (mm_has_pgste(mm))
> +	if (mm_has_pgste(mm)) {
> +		pgste_set_pte(ptep, pte);
>  		pgste_set_unlock(ptep, *(pgste_t *)(ptep + PTRS_PER_PTE));
> +	} else
> +		*ptep = pte;
>  }
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
> @@ -1121,10 +1154,13 @@ static inline pte_t ptep_set_wrprotect(struct mm_struct *mm,
>  			pgste = pgste_get_lock(ptep);
>  
>  		ptep_flush_lazy(mm, address, ptep);
> -		*ptep = pte_wrprotect(pte);
> +		pte = pte_wrprotect(pte);
>  
> -		if (mm_has_pgste(mm))
> +		if (mm_has_pgste(mm)) {
> +			pgste_set_pte(ptep, pte);
>  			pgste_set_unlock(ptep, pgste);
> +		} else
> +			*ptep = pte;
>  	}
>  	return pte;
>  }
> @@ -1142,10 +1178,12 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  		pgste = pgste_get_lock(ptep);
>  
>  	__ptep_ipte(address, ptep);
> -	*ptep = entry;
>  
> -	if (mm_has_pgste(vma->vm_mm))
> +	if (mm_has_pgste(vma->vm_mm)) {
> +		pgste_set_pte(ptep, entry);
>  		pgste_set_unlock(ptep, pgste);
> +	} else
> +		*ptep = entry;
>  	return 1;
>  }
>  
> @@ -1163,8 +1201,13 @@ static inline pte_t mk_pte_phys(unsigned long physpage, pgprot_t pgprot)
>  static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
>  {
>  	unsigned long physpage = page_to_phys(page);
> +	pte_t __pte = mk_pte_phys(physpage, pgprot);
>  
> -	return mk_pte_phys(physpage, pgprot);
> +	if ((pte_val(__pte) & _PAGE_SWW) && PageDirty(page)) {
> +		pte_val(__pte) |= _PAGE_SWC;
> +		pte_val(__pte) &= ~_PAGE_RO;
> +	}
> +	return __pte;
>  }
>  
>  #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
> @@ -1256,6 +1299,8 @@ static inline int pmd_trans_splitting(pmd_t pmd)
>  static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>  			      pmd_t *pmdp, pmd_t entry)
>  {
> +	if (!(pmd_val(entry) & _SEGMENT_ENTRY_INV) && MACHINE_HAS_EDAT1)
> +		pmd_val(entry) |= _SEGMENT_ENTRY_CO;
>  	*pmdp = entry;
>  }
>  
> diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
> index 8337886..06a1361 100644
> --- a/arch/s390/include/asm/sclp.h
> +++ b/arch/s390/include/asm/sclp.h
> @@ -46,7 +46,6 @@ int sclp_cpu_deconfigure(u8 cpu);
>  void sclp_facilities_detect(void);
>  unsigned long long sclp_get_rnmax(void);
>  unsigned long long sclp_get_rzm(void);
> -u8 sclp_get_fac85(void);
>  int sclp_sdias_blk_count(void);
>  int sclp_sdias_copy(void *dest, int blk_num, int nr_blks);
>  int sclp_chp_configure(struct chp_id chpid);
> diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h
> index f69f76b..f685751 100644
> --- a/arch/s390/include/asm/setup.h
> +++ b/arch/s390/include/asm/setup.h
> @@ -64,13 +64,14 @@ extern unsigned int s390_user_mode;
>  
>  #define MACHINE_FLAG_VM		(1UL << 0)
>  #define MACHINE_FLAG_IEEE	(1UL << 1)
> -#define MACHINE_FLAG_CSP	(1UL << 3)
> -#define MACHINE_FLAG_MVPG	(1UL << 4)
> -#define MACHINE_FLAG_DIAG44	(1UL << 5)
> -#define MACHINE_FLAG_IDTE	(1UL << 6)
> -#define MACHINE_FLAG_DIAG9C	(1UL << 7)
> -#define MACHINE_FLAG_MVCOS	(1UL << 8)
> -#define MACHINE_FLAG_KVM	(1UL << 9)
> +#define MACHINE_FLAG_CSP	(1UL << 2)
> +#define MACHINE_FLAG_MVPG	(1UL << 3)
> +#define MACHINE_FLAG_DIAG44	(1UL << 4)
> +#define MACHINE_FLAG_IDTE	(1UL << 5)
> +#define MACHINE_FLAG_DIAG9C	(1UL << 6)
> +#define MACHINE_FLAG_MVCOS	(1UL << 7)
> +#define MACHINE_FLAG_KVM	(1UL << 8)
> +#define MACHINE_FLAG_ESOP	(1UL << 9)
>  #define MACHINE_FLAG_EDAT1	(1UL << 10)
>  #define MACHINE_FLAG_EDAT2	(1UL << 11)
>  #define MACHINE_FLAG_LPAR	(1UL << 12)
> @@ -84,6 +85,7 @@ extern unsigned int s390_user_mode;
>  #define MACHINE_IS_LPAR		(S390_lowcore.machine_flags & MACHINE_FLAG_LPAR)
>  
>  #define MACHINE_HAS_DIAG9C	(S390_lowcore.machine_flags & MACHINE_FLAG_DIAG9C)
> +#define MACHINE_HAS_ESOP	(S390_lowcore.machine_flags & MACHINE_FLAG_ESOP)
>  #define MACHINE_HAS_PFMF	MACHINE_HAS_EDAT1
>  #define MACHINE_HAS_HPAGE	MACHINE_HAS_EDAT1
>  
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index c9011bf..4659b62 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -147,7 +147,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = KVM_MAX_VCPUS;
>  		break;
>  	case KVM_CAP_S390_COW:
> -		r = sclp_get_fac85() & 0x2;
> +		r = MACHINE_HAS_ESOP;
>  		break;
>  	default:
>  		r = 0;
> diff --git a/arch/s390/lib/uaccess_pt.c b/arch/s390/lib/uaccess_pt.c
> index 9017a63..a70ee84 100644
> --- a/arch/s390/lib/uaccess_pt.c
> +++ b/arch/s390/lib/uaccess_pt.c
> @@ -50,7 +50,7 @@ static __always_inline unsigned long follow_table(struct mm_struct *mm,
>  	ptep = pte_offset_map(pmd, addr);
>  	if (!pte_present(*ptep))
>  		return -0x11UL;
> -	if (write && !pte_write(*ptep))
> +	if (write && (!pte_write(*ptep) || !pte_dirty(*ptep)))
>  		return -0x04UL;
>  
>  	return (pte_val(*ptep) & PAGE_MASK) + (addr & ~PAGE_MASK);
> diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
> index 29ccee3..d21040e 100644
> --- a/arch/s390/mm/pageattr.c
> +++ b/arch/s390/mm/pageattr.c
> @@ -127,7 +127,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable)
>  			pte_val(*pte) = _PAGE_TYPE_EMPTY;
>  			continue;
>  		}
> -		*pte = mk_pte_phys(address, __pgprot(_PAGE_TYPE_RW));
> +		pte_val(*pte) = __pa(address);
>  	}
>  }
>  
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index 6ed1426..79699f46 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -85,11 +85,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  	pud_t *pu_dir;
>  	pmd_t *pm_dir;
>  	pte_t *pt_dir;
> -	pte_t  pte;
>  	int ret = -ENOMEM;
>  
>  	while (address < end) {
> -		pte = mk_pte_phys(address, __pgprot(ro ? _PAGE_RO : 0));
>  		pg_dir = pgd_offset_k(address);
>  		if (pgd_none(*pg_dir)) {
>  			pu_dir = vmem_pud_alloc();
> @@ -101,9 +99,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
>  		if (MACHINE_HAS_EDAT2 && pud_none(*pu_dir) && address &&
>  		    !(address & ~PUD_MASK) && (address + PUD_SIZE <= end)) {
> -			pte_val(pte) |= _REGION3_ENTRY_LARGE;
> -			pte_val(pte) |= _REGION_ENTRY_TYPE_R3;
> -			pud_val(*pu_dir) = pte_val(pte);
> +			pud_val(*pu_dir) = __pa(address) |
> +				_REGION_ENTRY_TYPE_R3 | _REGION3_ENTRY_LARGE |
> +				(ro ? _REGION_ENTRY_RO : 0);
>  			address += PUD_SIZE;
>  			continue;
>  		}
> @@ -118,8 +116,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
>  		if (MACHINE_HAS_EDAT1 && pmd_none(*pm_dir) && address &&
>  		    !(address & ~PMD_MASK) && (address + PMD_SIZE <= end)) {
> -			pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
> -			pmd_val(*pm_dir) = pte_val(pte);
> +			pmd_val(*pm_dir) = __pa(address) |
> +				_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE |
> +				(ro ? _SEGMENT_ENTRY_RO : 0);
>  			address += PMD_SIZE;
>  			continue;
>  		}
> @@ -132,7 +131,7 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  		}
>  
>  		pt_dir = pte_offset_kernel(pm_dir, address);
> -		*pt_dir = pte;
> +		pte_val(*pt_dir) = __pa(address) | (ro ? _PAGE_RO : 0);
>  		address += PAGE_SIZE;
>  	}
>  	ret = 0;
> @@ -199,7 +198,6 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
>  	pud_t *pu_dir;
>  	pmd_t *pm_dir;
>  	pte_t *pt_dir;
> -	pte_t  pte;
>  	int ret = -ENOMEM;
>  
>  	start_addr = (unsigned long) start;
> @@ -237,9 +235,8 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
>  				new_page = vmemmap_alloc_block(PMD_SIZE, node);
>  				if (!new_page)
>  					goto out;
> -				pte = mk_pte_phys(__pa(new_page), PAGE_RW);
> -				pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
> -				pmd_val(*pm_dir) = pte_val(pte);
> +				pmd_val(*pm_dir) = __pa(new_page) |
> +					_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE;
>  				address = (address + PMD_SIZE) & PMD_MASK;
>  				continue;
>  			}
> @@ -260,8 +257,7 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
>  			new_page =__pa(vmem_alloc_pages(0));
>  			if (!new_page)
>  				goto out;
> -			pte = pfn_pte(new_page >> PAGE_SHIFT, PAGE_KERNEL);
> -			*pt_dir = pte;
> +			pte_val(*pt_dir) = __pa(new_page);
>  		}
>  		address += PAGE_SIZE;
>  	}
> diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
> index c44d13f..30a2255 100644
> --- a/drivers/s390/char/sclp_cmd.c
> +++ b/drivers/s390/char/sclp_cmd.c
> @@ -56,7 +56,6 @@ static int __initdata early_read_info_sccb_valid;
>  
>  u64 sclp_facilities;
>  static u8 sclp_fac84;
> -static u8 sclp_fac85;
>  static unsigned long long rzm;
>  static unsigned long long rnmax;
>  
> @@ -131,7 +130,8 @@ void __init sclp_facilities_detect(void)
>  	sccb = &early_read_info_sccb;
>  	sclp_facilities = sccb->facilities;
>  	sclp_fac84 = sccb->fac84;
> -	sclp_fac85 = sccb->fac85;
> +	if (sccb->fac85 & 0x02)
> +		S390_lowcore.machine_flags |= MACHINE_FLAG_ESOP;
>  	rnmax = sccb->rnmax ? sccb->rnmax : sccb->rnmax2;
>  	rzm = sccb->rnsize ? sccb->rnsize : sccb->rnsize2;
>  	rzm <<= 20;
> @@ -171,12 +171,6 @@ unsigned long long sclp_get_rzm(void)
>  	return rzm;
>  }
>  
> -u8 sclp_get_fac85(void)
> -{
> -	return sclp_fac85;
> -}
> -EXPORT_SYMBOL_GPL(sclp_get_fac85);
> -
>  /*
>   * This function will be called after sclp_facilities_detect(), which gets
>   * called from early.c code. Therefore the sccb should have valid contents.
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 83b54ed..bdd7fac 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -197,16 +197,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> -#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> -#define page_test_and_clear_dirty(pfn, mapped)	(0)
> -#endif
> -
> -#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> -#define pte_maybe_dirty(pte)		pte_dirty(pte)
> -#else
> -#define pte_maybe_dirty(pte)		(1)
> -#endif
> -
>  #ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
>  #define page_test_and_clear_young(pfn) (0)
>  #endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index b5d1384..4c0c8eb 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -303,21 +303,13 @@ static inline void __SetPageUptodate(struct page *page)
>  
>  static inline void SetPageUptodate(struct page *page)
>  {
> -#ifdef CONFIG_S390
> -	if (!test_and_set_bit(PG_uptodate, &page->flags))
> -		page_set_storage_key(page_to_phys(page), PAGE_DEFAULT_KEY, 0);
> -#else
>  	/*
>  	 * Memory barrier must be issued before setting the PG_uptodate bit,
>  	 * so that all previous stores issued in order to bring the page
>  	 * uptodate are actually visible before PageUptodate becomes true.
> -	 *
> -	 * s390 doesn't need an explicit smp_wmb here because the test and
> -	 * set bit already provides full barriers.
>  	 */
>  	smp_wmb();
>  	set_bit(PG_uptodate, &(page)->flags);
> -#endif
>  }
>  
>  CLEARPAGEFLAG(Uptodate, uptodate)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index face808..ef75a7d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1144,29 +1144,6 @@ void page_remove_rmap(struct page *page)
>  		goto out;
>  
>  	/*
> -	 * Now that the last pte has gone, s390 must transfer dirty
> -	 * flag from storage key to struct page.  We can usually skip
> -	 * this if the page is anon, so about to be freed; but perhaps
> -	 * not if it's in swapcache - there might be another pte slot
> -	 * containing the swap entry, but page not yet written to swap.
> -	 *
> -	 * And we can skip it on file pages, so long as the filesystem
> -	 * participates in dirty tracking (note that this is not only an
> -	 * optimization but also solves problems caused by dirty flag in
> -	 * storage key getting set by a write from inside kernel); but need to
> -	 * catch shm and tmpfs and ramfs pages which have been modified since
> -	 * creation by read fault.
> -	 *
> -	 * Note that mapping must be decided above, before decrementing
> -	 * mapcount (which luckily provides a barrier): once page is unmapped,
> -	 * it could be truncated and page->mapping reset to NULL at any moment.
> -	 * Note also that we are relying on page_mapping(page) to set mapping
> -	 * to &swapper_space when PageSwapCache(page).
> -	 */
> -	if (mapping && !mapping_cap_account_dirty(mapping) &&
> -	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> -		set_page_dirty(page);
> -	/*
>  	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
>  	 * and not charged by memcg for now.
>  	 */
> -- 
> 1.7.12.4
> 
> -- 
> blue skies,
>    Martin.
> 
> "Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390
  2012-12-17 23:31           ` Hugh Dickins
@ 2012-12-18  7:30             ` Martin Schwidefsky
  0 siblings, 0 replies; 10+ messages in thread
From: Martin Schwidefsky @ 2012-12-18  7:30 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jan Kara, Andrew Morton, linux-mm, Mel Gorman, linux-s390

On Mon, 17 Dec 2012 15:31:47 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> On Fri, 14 Dec 2012, Martin Schwidefsky wrote:
> > 
> > The patch got delayed a bit,
> 
> Thanks a lot for finding the time to do this:
> I never expected it to get priority.
> 
> > the main issue is to get conclusive performance
> > measurements about the effects of the patch. I am pretty sure that the patch
> > works and will not cause any major degradation so it is time to ask for your
> > opinion. Here we go:
> 
> If if works reliably and efficiently for you on s390, then I'm strongly in
> favour of it; and I cannot imagine who would not be - it removes several
> hunks of surprising and poorly understood code from the generic mm end.
> 
> I'm slightly disappointed to be reminded of page_test_and_clear_young(),
> and find it still there; but it's been an order of magnitude less
> troubling than the _dirty, so not worth more effort I guess.

To remove the dependency on the referenced-bit in the storage key would
require to set the invalid bit on the pte until the first access has been
done. Then the referenced bit would have to be set and a valid pte can
be established. That would be costly, because we would get a lot more
program checks on the invalid, old ptes. So the page_test_and_clear_young
needs to stay. The situation for the referenced bits is much more relaxed
though, we can afford to loose the one of the other referenced bit
without ill effect. I would not worry about page_test_and_clear_young
too much.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-12-18  7:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-22 15:06 [PATCH] mm: Fix XFS oops due to dirty pages without buffers on s390 Jan Kara
2012-10-22 19:38 ` Andrew Morton
2012-10-23  4:40   ` Hugh Dickins
2012-10-23 10:21   ` Jan Kara
2012-10-23 21:56     ` Andrew Morton
2012-10-24  8:30       ` Martin Schwidefsky
2012-10-25 20:01       ` Jan Kara
2012-12-14  8:45         ` Martin Schwidefsky
2012-12-17 23:31           ` Hugh Dickins
2012-12-18  7:30             ` Martin Schwidefsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).