From mboxrd@z Thu Jan  1 00:00:00 1970
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Wed, 27 Oct 2021 20:13:01 +0100
Subject: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault
	deadlocks
In-Reply-To: <CAHk-=wjWDsB-dDj+x4yr8h8f_VSkyB7MbgGqBzDRMNz125sZxw@mail.gmail.com>
References: <CAHk-=wh0_3y5s7-G74U0Pcjm7Y_yHB608NYrQSvgogVNBxsWSQ@mail.gmail.com>
	<YXBFqD9WVuU8awIv@arm.com>
	<CAHk-=wgv=KPZBJGnx_O5-7hhST8CL9BN4wJwtVuycjhv_1MmvQ@mail.gmail.com>
	<YXCbv5gdfEEtAYo8@arm.com>
	<CAHk-=wgP058PNY8eoWW=5uRMox-PuesDMrLsrCWPS+xXhzbQxQ@mail.gmail.com>
	<YXL9tRher7QVmq6N@arm.com>
	<CAHk-=wg4t2t1AaBDyMfOVhCCOiLLjCB5TFVgZcV4Pr8X2qptJw@mail.gmail.com>
	<CAHc6FU7BEfBJCpm8wC3P+8GTBcXxzDWcp6wAcgzQtuaJLHrqZA@mail.gmail.com>
	<YXhH0sBSyTyz5Eh2@arm.com>
	<CAHk-=wjWDsB-dDj+x4yr8h8f_VSkyB7MbgGqBzDRMNz125sZxw@mail.gmail.com>
Message-ID: <YXmkvfL9B+4mQAIo@arm.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

On Tue, Oct 26, 2021 at 11:50:04AM -0700, Linus Torvalds wrote:
> On Tue, Oct 26, 2021 at 11:24 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > While more intrusive, I'd rather change copy_page_from_iter_atomic()
> > etc. to take a pointer where to write back an error code.
[...]
> That said, the fact that these sub-page faults are always
> non-recoverable might be a hint to a solution to the problem: maybe we
> could extend the existing return code with actual negative error
> numbers.
> 
> Because for _most_ cases of "copy_to/from_user()" and friends by far,
> the only thing we look for is "zero for success".
> 
> We could extend the "number of bytes _not_ copied" semantics to say
> "negative means fatal", and because there are fairly few places that
> actually look at non-zero values, we could have a coccinelle script
> that actually marks those places.

As you already replied, there are some odd places where the returned
uncopied of bytes is used. Also for some valid cases like
copy_mount_options(), it's likely that it will fall back to
byte-at-a-time with MTE since it's a good chance it would hit a fault in
a 4K page (not a fast path though). I'd have to go through all the cases
and check whether the return value is meaningful. The iter_iov.c
functions and their callers also seem to make use of the bytes copied in
case they need to call iov_iter_revert() (though I suppose the
iov_iter_iovec_advance() would skip the update in case of an error).

As an alternative, you mentioned earlier that a per-thread fault status
was not feasible on x86 due to races. Was this only for the hw poison
case? I think the uaccess is slightly different.

We can add a current->non_recoverable_uaccess variable cleared on
pagefault_disable(), only set by uaccess faults and checked by the fs
code before re-attempting the fault_in(). An interrupt shouldn't do a
uaccess (well, if it does a _nofault one, we can detect in_interrupt()
in the MTE exception handler). Last time I looked at io_uring it was
running in a separate kernel thread, not sure whether this was changed.
I don't see what else would be racing with such
current->non_recoverable_uaccess variable. If that's doable, I think
it's the least intrusive approach.

-- 
Catalin