linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Gruenbacher <agruenba@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	David Hildenbrand <david@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-s390 <linux-s390@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Buffered I/O broken on s390x with page faults disabled (gfs2)
Date: Wed,  9 Mar 2022 19:42:38 +0100	[thread overview]
Message-ID: <20220309184238.1583093-1-agruenba@redhat.com> (raw)
In-Reply-To: <CAHk-=whaoxuCPg4foD_4VBVr+LVgmW7qScjYFRppvHqnni0EMA@mail.gmail.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 6892 bytes --]

On Wed, Mar 9, 2022 at 1:22 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, Mar 8, 2022 at 3:25 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Seems to be working on s390x for this test case at least; the kind of
> > trace I'm getting is:
>
> Good.
>
> > This shows bursts of successful fault-ins in gfs2_file_read_iter
> > (read_fault). The pauses in between might be caused by the storage;
> > I'm not sure.
>
> Don't know about the pauses, but the burst size might be impacted by that
>
> +       const size_t max_size = 4 * PAGE_SIZE;
>
> thing that limits how many calls to fixup_user_fault() we do per
> fault_in_safe_writeable().
>
> So it might be worth checking if that value seems to make any difference.
>
> > I'd still let the caller of fault_in_safe_writeable() decide how much
> > memory to fault in; the tight cap in fault_in_safe_writeable() doesn't
> > seem useful.
>
> Well, there are real latency concerns there - fixup_user_fault() is
> not necessarily all that low-cost.

I just don't know if making the GUP based approach work instead of
switching to fixup_user_fault(), or introducing something else, would
make enough of a performance difference to be worth it.

> And it's actually going to be worse when we have the sub-page coloring
> issues happening, and will need to probe at a 128-byte granularity
> (not on s390, but on arm64).
>
> At that point, we almost certainly will need to have a "probe user
> space non-destructibly for writability" instruction (possibly
> extending on our current arch_futex_atomic_op_inuser()
> infrastructure).

Let me add Catalin Marinas to the CC.

From what I took from the previous discussion, probing at a sub-page
granularity won't be necessary for bytewise copying: when the address
we're trying to access is poisoned, fault_in_*() will fail; when we get
a short result, that will take us to the poisoned address in the next
iteration.

The problematic case was copying objects that cross fault domains, when
we're getting an all-or-nothing result back from the copying and the
address we're trying to fault_in_*() isn't the address of the actual
fault.  The fix for those cases is to pass back the address of the
actual fault in one way or another.

> So honestly, if you do IO on areas that will get page faults on them,
> to some degree it's a "you are doing something stupid, you get what
> you deserve". This code should _work_, it should not have to worry
> about users having bad patterns (where "read data into huge cold
> mappings under enough memory pressure that it causes access bit faults
> in the middle of the read" very much counts as such a bad pattern).

With a large enough buffer, a simple malloc() will return unmapped
pages, and reading into such a buffer will result in fault-in.  So page
faults during read() are actually pretty normal, and it's not the user's
fault.

In my test case, the buffer was pre-initialized with memset() to avoid
those kinds of page faults, which meant that the page faults in
gfs2_file_read_iter() only started to happen when we were out of memory.
But that's not the common case.

> > Also, you want to put in an extra L here:
> > > Signed-off-by: linus Torvalds <torvalds@linux-foundation.org>
>
> Heh. Fixed locally.

Attached is a revised patch; only lightly tested so far.  Changes:

* Fix the function description.

* No need for untagged_addr() as that is handled in fixup_user_fault().

* Get rid of max_size: it really makes no sense to second-guess what the
  caller needs.  In cases where fault_in_safe_writeable() is used for
  buffers larger than a handful of pages, it is entirely the caller's
  responsibility to scale back the fault-in size in anticipation of or
  in reaction to page-out.

* Use the same control flow as in fault_in_readable(); there is no
  need for anything more complicated anymore.

* The same patch description still applies.

Thanks,
Andreas

diff --git a/mm/gup.c b/mm/gup.c
index f1bf3a1f6d109..5e777049bdf41 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1841,9 +1841,9 @@ EXPORT_SYMBOL(fault_in_writeable);
  * @uaddr: start of address range
  * @size: length of address range
  *
- * Faults in an address range using get_user_pages, i.e., without triggering
- * hardware page faults.  This is primarily useful when we already know that
- * some or all of the pages in the address range aren't in memory.
+ * Faults in an address range for writing.  This is primarily useful when we
+ * already know that some or all of the pages in the address range aren't in
+ * memory.
  *
  * Other than fault_in_writeable(), this function is non-destructive.
  *
@@ -1856,46 +1856,33 @@ EXPORT_SYMBOL(fault_in_writeable);
  */
 size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
 {
-	unsigned long start = (unsigned long)untagged_addr(uaddr);
-	unsigned long end, nstart, nend;
+	const unsigned int fault_flags = FAULT_FLAG_WRITE | FAULT_FLAG_KILLABLE;
+	unsigned long start = (unsigned long)uaddr, end;
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	int locked = 0;
 
-	nstart = start & PAGE_MASK;
+	if (unlikely(size == 0))
+		return 0;
 	end = PAGE_ALIGN(start + size);
-	if (end < nstart)
+	if (end < start)
 		end = 0;
-	for (; nstart != end; nstart = nend) {
-		unsigned long nr_pages;
-		long ret;
 
-		if (!locked) {
-			locked = 1;
-			mmap_read_lock(mm);
-			vma = find_vma(mm, nstart);
-		} else if (nstart >= vma->vm_end)
-			vma = vma->vm_next;
-		if (!vma || vma->vm_start >= end)
-			break;
-		nend = end ? min(end, vma->vm_end) : vma->vm_end;
-		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
-			continue;
-		if (nstart < vma->vm_start)
-			nstart = vma->vm_start;
-		nr_pages = (nend - nstart) / PAGE_SIZE;
-		ret = __get_user_pages_locked(mm, nstart, nr_pages,
-					      NULL, NULL, &locked,
-					      FOLL_TOUCH | FOLL_WRITE);
-		if (ret <= 0)
-			break;
-		nend = nstart + ret * PAGE_SIZE;
+	mmap_read_lock(mm);
+	if (!PAGE_ALIGNED(start)) {
+		if (fixup_user_fault(mm, start, fault_flags, NULL))
+			return size;
+		start = PAGE_ALIGN(start);
 	}
-	if (locked)
-		mmap_read_unlock(mm);
-	if (nstart == end)
-		return 0;
-	return size - min_t(size_t, nstart - start, size);
+	while (start != end) {
+		if (fixup_user_fault(mm, start, fault_flags, NULL))
+			goto out;
+		start += PAGE_SIZE;
+	}
+	mmap_read_unlock(mm);
+
+out:
+	if (size > (unsigned long)uaddr - start)
+		return size - ((unsigned long)uaddr - start);
+	return 0;
 }
 EXPORT_SYMBOL(fault_in_safe_writeable);
 


  reply	other threads:[~2022-03-09 18:42 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-07 22:52 Buffered I/O broken on s390x with page faults disabled (gfs2) Andreas Gruenbacher
2022-03-07 23:18 ` Linus Torvalds
2022-03-08  8:21   ` David Hildenbrand
2022-03-08  8:37     ` David Hildenbrand
2022-03-08 12:11       ` David Hildenbrand
2022-03-08 12:24         ` David Hildenbrand
2022-03-08 13:20           ` Gerald Schaefer
2022-03-08 13:32             ` David Hildenbrand
2022-03-08 14:14               ` Gerald Schaefer
2022-03-08 17:23           ` David Hildenbrand
2022-03-08 17:26     ` Linus Torvalds
2022-03-08 17:40       ` Linus Torvalds
2022-03-08 19:27         ` Linus Torvalds
2022-03-08 20:03           ` Linus Torvalds
2022-03-08 23:24             ` Andreas Gruenbacher
2022-03-09  0:22               ` Linus Torvalds
2022-03-09 18:42                 ` Andreas Gruenbacher [this message]
2022-03-09 19:08                   ` Linus Torvalds
2022-03-09 20:57                     ` Andreas Gruenbacher
2022-03-09 21:08                     ` Andreas Gruenbacher
2022-03-10 12:13                       ` Filipe Manana
2022-03-09 19:21                   ` Linus Torvalds
2022-03-09 19:35                     ` Andreas Gruenbacher
2022-03-09 20:18                       ` Linus Torvalds
2022-03-09 20:36                         ` Andreas Gruenbacher
2022-03-09 20:48                           ` Linus Torvalds
2022-03-09 20:54                             ` Linus Torvalds
2022-03-10 17:13           ` David Hildenbrand
2022-03-10 18:00             ` Andreas Gruenbacher
2022-03-10 18:35             ` Linus Torvalds
2022-03-10 18:38               ` David Hildenbrand
2022-03-10 18:47               ` Andreas Gruenbacher
2022-03-10 19:22                 ` Linus Torvalds
2022-03-10 19:56                   ` Linus Torvalds
2022-03-10 20:23                     ` Andreas Gruenbacher
2022-03-08 17:47       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220309184238.1583093-1-agruenba@redhat.com \
    --to=agruenba@redhat.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).