linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lstoakes@gmail.com>
To: David Laight <David.Laight@aculab.com>
Cc: 'Baoquan He' <bhe@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Liu Shixin <liushixin2@huawei.com>, Jiri Olsa <jolsa@kernel.org>,
	Jens Axboe <axboe@kernel.dk>,
	Alexander Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()
Date: Sun, 26 Mar 2023 15:20:56 +0100	[thread overview]
Message-ID: <01d87b9f-3c8b-4f92-89c2-3e07420e9c67@lucifer.local> (raw)
In-Reply-To: <0cff573c3a344504b1b1b77486b4d853@AcuMS.aculab.com>

On Sun, Mar 26, 2023 at 01:26:57PM +0000, David Laight wrote:
> From: Baoquan He
> > Sent: 23 March 2023 13:32
> ...
> > > > > If this fails, then we fault in, and try again. We loop because there could
> > > > > be some extremely unfortunate timing with a race on e.g. swapping out or
> > > > > migrating pages between faulting in and trying to write out again.
> > > > >
> > > > > This is extremely unlikely, but to avoid any chance of breaking userland we
> > > > > repeat the operation until it completes. In nearly all real-world
> > > > > situations it'll either work immediately or loop once.
> > > >
> > > > Thanks a lot for these helpful details with patience. I got it now. I was
> > > > mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter.
> > > >
> > > > Now is there any chance that the faulted in memory is swapped out or
> > > > migrated again before vread_iter()? fault_in_iov_iter_writeable() will
> > > > pin the memory? I didn't find it from code and document. Seems it only
> > > > falults in memory. If yes, there's window between faluting in and
> > > > copy_to_user_nofault().
> > > >
> > >
> > > See the documentation of fault_in_safe_writeable():
> > >
> > > "Note that we don't pin or otherwise hold the pages referenced that we fault
> > > in.  There's no guarantee that they'll stay in memory for any duration of
> > > time."
> >
> > Thanks for the info. Then swapping out/migration could happen again, so
> > that's why while(true) loop is meaningful.
>
> One of the problems is that is the system is under severe memory
> pressure and you try to fault in (say) 20 pages, the first page
> might get unmapped in order to map the last one in.
>
> So it is quite likely better to retry 'one page at a time'.

If you look at the kcore code, it is in fact only faulting one page at a
time. tsz never exceeds PAGE_SIZE, so we never attempt to fault in or copy
more than one page at a time, e.g.:-

if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
	tsz = buflen;

...

tsz = (buflen > PAGE_SIZE ? PAGE_SIZE : buflen);

It might be a good idea to make this totally explicit in vread_iter()
(perhaps making it vread_page_iter() or such), but I think that might be
good for another patch series.

>
> There have also been cases where the instruction to copy data
> has faulted for reasons other than 'page fault'.
> ISTR an infinite loop being caused by misaligned accesses failing
> due to 'bad instruction choice' in the copy code.
> While this is rally a bug, an infinite retry in a file read/write
> didn't make it easy to spot.

I am not sure it's reasonable to not write code just in case an arch
implements buggy user copy code (do correct me if I'm misunderstanding you
about this!). By that token wouldn't a lot more be broken in that
situation? I don't imagine all other areas of the kernel would make
explicitly clear to you that this was the problem.

>
> So maybe there are cases where a dropping back to a 'bounce buffer'
> may be necessary.

One approach could be to reinstate the kernel bounce buffer, set up an
iterator that points to it and pass that in after one attempt with
userland.

But it feels a bit like overkill, as in the case of an aligment issue,
surely that would still occur and that'd just error out anyway? Again I'm
not sure bending over backwards to account for possibly buggy arch code is
sensible.

Ideally the iterator code would explicitly pass back the EFAULT error which
we could then explicitly handle but that'd require probably quite
significant rework there which feels a bit out of scope for this change.

We could implement some maximum number of attempts which statistically must
reduce the odds of repeated faults in the tiny window between fault in and
copy to effectively zero. But I'm not sure the other David would be happy
with that!

If we were to make a change to be extra careful I'd opt for simply trying X
times then giving up, given we're trying this a page at a time I don't
think X need be that large before any swap out/migrate bad luck becomes so
unlikely that we're competing with heat death of the universe timescales
before it might happen (again, I may be missing some common scenario where
the same single page swaps out/migrates over and over, please correct me if
so).

However I think there's a case to be made that it's fine as-is unless there
is another scenario we are overly concerned about?

>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

      reply	other threads:[~2023-03-26 14:21 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-22 18:57 [PATCH v7 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
2023-03-22 18:57 ` [PATCH v7 1/4] fs/proc/kcore: avoid bounce buffer for ktext data Lorenzo Stoakes
2023-03-22 18:57 ` [PATCH v7 2/4] fs/proc/kcore: convert read_kcore() to read_kcore_iter() Lorenzo Stoakes
2023-03-22 18:57 ` [PATCH v7 3/4] iov_iter: add copy_page_to_iter_nofault() Lorenzo Stoakes
2023-03-22 18:57 ` [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter() Lorenzo Stoakes
2023-03-23  2:52   ` Baoquan He
2023-03-23  6:44     ` Lorenzo Stoakes
2023-03-23 10:36       ` Baoquan He
2023-03-23 10:38         ` David Hildenbrand
2023-03-23 13:31           ` Baoquan He
2023-03-26 13:26             ` David Laight
2023-03-26 14:20               ` Lorenzo Stoakes [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01d87b9f-3c8b-4f92-89c2-3e07420e9c67@lucifer.local \
    --to=lstoakes@gmail.com \
    --cc=David.Laight@aculab.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=bhe@redhat.com \
    --cc=david@redhat.com \
    --cc=jolsa@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liushixin2@huawei.com \
    --cc=urezki@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).