linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Gruenbacher <agruenba@redhat.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Will Deacon <will@kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults
Date: Mon, 29 Nov 2021 14:33:42 +0100	[thread overview]
Message-ID: <CAHc6FU6zVi9A2D3V3T5zE71YAdkBiJTs0ao1Q6ysSuEp=bz8fQ@mail.gmail.com> (raw)
In-Reply-To: <YaTEkAahkCwuQdPN@arm.com>

On Mon, Nov 29, 2021 at 1:22 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote:
> > On Sat, Nov 27, 2021 at 4:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > That's similar, somehow, to the arch-specific probing in one of my
> > > patches: [1]. We could do the above if we can guarantee that the maximum
> > > error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For
> > > arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever
> > > need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes
> > > even if aligned: reads of large blocks are done in 4 * 16 loads, and if
> > > one of them fails e.g. because of the 16-byte sub-page fault, no write
> > > is done, hence such larger than 16 delta).
> > >
> > > If you want something in the generic fault_in_writeable(), we probably
> > > need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE
> > > increments. But I thought I'd rather keep this in the arch-specific code.
> >
> > I see, that's even crazier than I'd thought. The looping / probing is
> > still pretty generic, so I'd still consider putting it in the generic
> > code.
>
> In the arm64 probe_subpage_user_writeable(), the loop is skipped if
> !system_supports_mte() (static label). It doesn't make much difference
> for search_ioctl() in terms of performance but I'd like the arch code to
> dynamically decide when to probe. An arch_has_subpage_faults() static
> inline function would solve this.
>
> However, the above assumes that the only way of probing is by doing a
> get_user/put_user(). A non-destructive probe with MTE would be to read
> the actual tags in memory and compare them with the top byte of the
> pointer.
>
> There's the CHERI architecture as well. Although very early days for
> arm64, we do have an incipient port (https://www.morello-project.org/).
> The void __user * pointers are propagated inside the kernel as 128-bit
> capabilities. A fault_in() would check whether the address (bottom
> 64-bit) is within the range and permissions specified in the upper
> 64-bit of the capability. There is no notion of sub-page fault
> granularity here and no need to do a put_user() as the check is just
> done on the pointer/capability.
>
> Given the above, my preference is to keep the probing arch-specific.
>
> > We also still have fault_in_safe_writeable which is more difficult to
> > fix, and fault_in_readable which we don't want to leave behind broken,
> > either.
>
> fault_in_safe_writeable() can be done by using get_user() instead of
> put_user() for arm64 MTE and probably SPARC ADI (an alternative is to
> read the in-memory tags and compare them with the pointer).

So we'd keep the existing fault_in_safe_writeable() logic for the
actual fault-in and use get_user() to check for sub-page faults? If
so, then that should probably also be hidden in arch code.

> For CHERI, that's different again since the fault_in_safe_writeable capability
> encodes the read/write permissions independently.
>
> However, do we actually want to change the fault_in_safe_writeable() and
> fault_in_readable() functions at this stage? I could not get any of them
> to live-lock, though I only tried btrfs, ext4 and gfs2. As per the
> earlier discussion, normal files accesses are guaranteed to make
> progress. The only problematic one was O_DIRECT which seems to be
> alright for the above filesystems (the fs either bails out after several
> attempts or uses GUP to read which skips the uaccess altogether).

Only gfs2 uses fault_in_safe_writeable(). For buffered reads, progress
is guaranteed because failures are at a byte granularity.

O_DIRECT reads and writes happen in device block size granularity, but
the pages are grabbed with get_user_pages() before the copying
happens. So by the time the copying happens, the pages are guaranteed
to be resident, and we don't need to loop around fault_in_*().

You've mentioned before that copying to/from struct page bypasses
sub-page fault checking. If that is the case, then the checking
probably needs to happen in iomap_dio_bio_iter and dio_refill_pages
instead.

> Happy to address them if there is a real concern, I just couldn't trigger it.

Hopefully it should now be clear why you couldn't. One way of
reproducing with fault_in_safe_writeable() would be to use that in
btrfs instead of fault_in_writeable(), of course.

We're not doing any chunked reads from user space with page faults
disabled as far as I'm aware right now, so we probably don't have a
reproducer for fault_in_readable(). It would still be worth fixing
fault_in_readable() to prevent things from blowing up very
unexpectedly later, though.

Thanks,
Andreas

> > > Of course, the above fault_in_writeable() still needs the btrfs
> > > search_ioctl() counterpart toget_user_pages change the probing on the actual fault
> > > address or offset.
> >
> > Yes, but that change is relatively simple and it eliminates the need
> > for probing the entire buffer, so it's a good thing. Maybe you want to
> > add this though:
> >
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode,
> >         unsigned long sk_offset = 0;
> > -       char __user *fault_in_addr;
> > +       char __user *fault_in_addr, *end;
> >
> > @@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode,
> >         fault_in_addr = ubuf;
> > +       end = ubuf + *buf_size;
> >         while (1) {
> >                 ret = -EFAULT;
> > -               if (fault_in_writeable(fault_in_addr,
> > -                                      *buf_size - (fault_in_addr - ubuf)))
> > +               if (fault_in_writeable(fault_in_addr, end - fault_in_addr))
> >                         break;
>
> Thanks, I'll add it.
>
> --
> Catalin
>


  reply	other threads:[~2021-11-29 13:36 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-24 19:20 [PATCH 0/3] Avoid live-lock in fault-in+uaccess loops with sub-page faults Catalin Marinas
2021-11-24 19:20 ` [PATCH 1/3] mm: Introduce fault_in_exact_writeable() to probe for " Catalin Marinas
2021-11-24 19:20 ` [PATCH 2/3] arm64: Add support for sub-page faults user probing Catalin Marinas
2021-11-24 19:20 ` [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults Catalin Marinas
2021-11-24 20:03   ` Matthew Wilcox
2021-11-24 20:37     ` Catalin Marinas
2021-11-25 22:25       ` Andreas Gruenbacher
2021-11-25 22:42         ` Catalin Marinas
2021-11-26 22:29         ` Andreas Gruenbacher
2021-11-26 22:57           ` Catalin Marinas
2021-11-27  3:52             ` Andreas Gruenbacher
2021-11-27 14:33               ` Catalin Marinas
2021-11-27 12:39         ` Andreas Gruenbacher
2021-11-27 15:21           ` Catalin Marinas
2021-11-27 18:05             ` Andreas Gruenbacher
2021-11-29 12:16               ` Catalin Marinas
2021-11-29 13:33                 ` Andreas Gruenbacher [this message]
2021-11-29 15:36                   ` Catalin Marinas
2021-11-29 18:40                     ` Linus Torvalds
2021-11-29 19:31                       ` Andreas Gruenbacher
2021-11-29 20:56                       ` Catalin Marinas
2021-11-29 21:53                         ` Linus Torvalds
2021-11-29 23:12                           ` Catalin Marinas
2021-11-29 13:52               ` Catalin Marinas
2021-11-24 23:00     ` Linus Torvalds
2021-11-25 11:10       ` Catalin Marinas
2021-11-25 18:13         ` Linus Torvalds
2021-11-25 20:43           ` Catalin Marinas
2021-11-25 21:02             ` Matthew Wilcox
2021-11-25 21:29               ` Catalin Marinas
2021-11-25 21:40               ` Andreas Gruenbacher
2021-11-26 16:42   ` David Sterba
2021-11-24 21:36 ` [PATCH 0/3] Avoid live-lock in fault-in+uaccess loops " Andrew Morton
2021-11-24 22:31   ` Catalin Marinas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHc6FU6zVi9A2D3V3T5zE71YAdkBiJTs0ao1Q6ysSuEp=bz8fQ@mail.gmail.com' \
    --to=agruenba@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=dsterba@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).