linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
To: David Hildenbrand <david@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andreas Gruenbacher <agruenba@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-s390 <linux-s390@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	Heiko Carstens <hca@linux.ibm.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Subject: Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
Date: Tue, 8 Mar 2022 14:20:47 +0100	[thread overview]
Message-ID: <20220308142047.7a725518@thinkpad> (raw)
In-Reply-To: <1bdb0184-696c-0f1a-3054-d88391c32e64@redhat.com>

On Tue, 8 Mar 2022 13:24:19 +0100
David Hildenbrand <david@redhat.com> wrote:

[...]
> 
> From 1e51e8a93894f87c0a4d0e908391e0628ae56afe Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 8 Mar 2022 12:51:26 +0100
> Subject: [PATCH] mm/gup: fix buffered I/O on s390x with pagefaults disabled
> 
> On s390x, we actually need a pte_mkyoung() / pte_mkdirty() instead of
> going via the page and leaving the PTE unmodified. E.g., if we only
> mark the page accessed via mark_page_accessed() when doing a FOLL_TOUCH,
> we'll miss to clear the HW invalid bit in the pte and subsequent accesses
> via the MMU would still require a pagefault.
> 
> Otherwise, buffered I/O will loop forever because it will keep stumling
> over the set HW invalid bit, requiring a page fault.
> 
> Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/gup.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index a9d4d724aef7..de3311feb377 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -587,15 +587,33 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		}
>  	}
>  	if (flags & FOLL_TOUCH) {
> -		if ((flags & FOLL_WRITE) &&
> -		    !pte_dirty(pte) && !PageDirty(page))
> -			set_page_dirty(page);
>  		/*
> -		 * pte_mkyoung() would be more correct here, but atomic care
> -		 * is needed to avoid losing the dirty bit: it is easier to use
> -		 * mark_page_accessed().
> +		 * We have to be careful with updating the PTE on architectures
> +		 * that have a HW dirty bit: while updating the PTE we might
> +		 * lose that bit again and we'd need an atomic update: it is
> +		 * easier to leave the PTE untouched for these architectures.
> +		 *
> +		 * s390x doesn't have a hw referenced / dirty bit and e.g., sets
> +		 * the hw invalid bit in pte_mkold(), to catch further
> +		 * references. We have to update the PTE here to e.g., clear the
> +		 * invalid bit; otherwise, callers that rely on not requiring
> +		 * an MMU fault once GUP(FOLL_TOUCH) succeeded will loop forever
> +		 * because the page won't actually be accessible via the MMU.
>  		 */
> -		mark_page_accessed(page);
> +		if (IS_ENABLED(CONFIG_S390)) {
> +			pte = pte_mkyoung(pte);
> +			if (flags & FOLL_WRITE)
> +				pte = pte_mkdirty(pte);
> +			if (!pte_same(pte, *ptep)) {
> +				set_pte_at(vma->vm_mm, address, ptep, pte);
> +				update_mmu_cache(vma, address, ptep);
> +			}
> +		} else {
> +			if ((flags & FOLL_WRITE) &&
> +			    !pte_dirty(pte) && !PageDirty(page))
> +				set_page_dirty(page);
> +			mark_page_accessed(page);
> +		}
>  	}
>  	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>  		/* Do not mlock pte-mapped THP */

Thanks David, your analysis looks valid, at least it seems that you found
a scenario where we would have HW invalid bit set due to pte_mkold() in
ptep_clear_flush_young(), and still GUP would find and return that page, IIUC.

I think pte handling should be similar to pmd handling in follow_trans_huge_pmd()
-> touch_pmd(), or cow_user_page() (see comment on software "accessed" bits),
which is more or less what your patch does.

Some possible concerns:
- set_page_dirty() would not be done any more for s390, is that intended and ok?
- using set_pte_at() here seems a bit dangerous, as I'm not sure if this will
  always only operate on invalid PTEs. Using it on active valid PTEs could
  result in TLB issues because of missing flush. Also not sure about kvm impact.
  Using ptep_set_access_flags() seems safer, again similar to touch_pmd() and
  also cow_user_page().

Looking at cow_user_page(), I also wonder if the arch_faults_on_old_pte()
logic could be used here. I must admit that I did not really understand the
"losing the dirty bit" part of the comment, but it seems that we might need
to not only check for arch_faults_on_old_pte(), but also for something like
"arch_faults_for_dirty_pte".

Last but not least, IIUC, this issue should affect all archs that return
true on arch_faults_on_old_pte(). After all, the basic problem seems to be
that a pagefault is required for PTEs marked as old, in combination with
GUP still returning a valid page. So maybe this should not be restricted
to IS_ENABLED(CONFIG_S390).

  reply	other threads:[~2022-03-08 13:22 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-07 22:52 Buffered I/O broken on s390x with page faults disabled (gfs2) Andreas Gruenbacher
2022-03-07 23:18 ` Linus Torvalds
2022-03-08  8:21   ` David Hildenbrand
2022-03-08  8:37     ` David Hildenbrand
2022-03-08 12:11       ` David Hildenbrand
2022-03-08 12:24         ` David Hildenbrand
2022-03-08 13:20           ` Gerald Schaefer [this message]
2022-03-08 13:32             ` David Hildenbrand
2022-03-08 14:14               ` Gerald Schaefer
2022-03-08 17:23           ` David Hildenbrand
2022-03-08 17:26     ` Linus Torvalds
2022-03-08 17:40       ` Linus Torvalds
2022-03-08 19:27         ` Linus Torvalds
2022-03-08 20:03           ` Linus Torvalds
2022-03-08 23:24             ` Andreas Gruenbacher
2022-03-09  0:22               ` Linus Torvalds
2022-03-09 18:42                 ` Andreas Gruenbacher
2022-03-09 19:08                   ` Linus Torvalds
2022-03-09 20:57                     ` Andreas Gruenbacher
2022-03-09 21:08                     ` Andreas Gruenbacher
2022-03-10 12:13                       ` Filipe Manana
2022-03-09 19:21                   ` Linus Torvalds
2022-03-09 19:35                     ` Andreas Gruenbacher
2022-03-09 20:18                       ` Linus Torvalds
2022-03-09 20:36                         ` Andreas Gruenbacher
2022-03-09 20:48                           ` Linus Torvalds
2022-03-09 20:54                             ` Linus Torvalds
2022-03-10 17:13           ` David Hildenbrand
2022-03-10 18:00             ` Andreas Gruenbacher
2022-03-10 18:35             ` Linus Torvalds
2022-03-10 18:38               ` David Hildenbrand
2022-03-10 18:47               ` Andreas Gruenbacher
2022-03-10 19:22                 ` Linus Torvalds
2022-03-10 19:56                   ` Linus Torvalds
2022-03-10 20:23                     ` Andreas Gruenbacher
2022-03-08 17:47       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220308142047.7a725518@thinkpad \
    --to=gerald.schaefer@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=agruenba@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=borntraeger@de.ibm.com \
    --cc=david@redhat.com \
    --cc=hca@linux.ibm.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).