linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* getting rid of the last memory modifitions through gup(FOLL_GET)
@ 2023-09-05 14:16 Christoph Hellwig
  2023-09-06  9:42 ` David Hildenbrand
  2023-09-08  8:41 ` David Howells
  0 siblings, 2 replies; 6+ messages in thread
From: Christoph Hellwig @ 2023-09-05 14:16 UTC (permalink / raw)
  To: Jan Kara, David Howells, David Hildenbrand, Peter Xu
  Cc: Lei Huang, miklos, Xiubo Li, Ilya Dryomov, Jeff Layton,
	Trond Myklebust, Anna Schumaker, Latchesar Ionkov,
	Dominique Martinet, Christian Schoenebeck, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Fastabend,
	Jakub Sitnicki, Boris Pismenny, linux-nfs, linux-fsdevel,
	ceph-devel, linux-mm, v9fs, netdev

Hi all,

we've made some nice progress on converting code that modifies user
memory to the pin_user_pages interface, especially though the work
from David Howells on iov_iter_extract_pages.  This thread tries to
coordinate on how to finish off this work.

The obvious next step is the remaining users of iov_iter_get_pages2
and iov_iter_get_pages_alloc2.  We have three file system direct I/O
users of those left: ceph, fuse and nfs.  Lei Huang has sent patches
to convert fuse to iov_iter_extract_pages which I'd love to see merged,
and we'd need equivalent work for ceph and nfs.

The non-file system uses are in the vmsplice code, which only reads
from the pages (but would still benefit from an iov_iter_extract_pages
conversion), and in net.  Out of the users in net, all but the 9p code
appear to be for reads from memory, so they don't pin even if a
conversion would be nice to retire iov_iter_get_pages* APIs.

After that we might have to do an audit of the raw get_user_pages APIs,
but there probably aren't many that modify file backed memory.

I'm also wondering what a good debug aid would be for detecting
writes to file backed memory without a previous reservation, but
everything either involves a page flag or file system code.  But if
someone has an idea I'm all ear as something mechanical to catch these
uses would be quite helpful.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: getting rid of the last memory modifitions through gup(FOLL_GET)
  2023-09-05 14:16 getting rid of the last memory modifitions through gup(FOLL_GET) Christoph Hellwig
@ 2023-09-06  9:42 ` David Hildenbrand
  2023-09-08  8:15   ` Christoph Hellwig
  2023-09-08  8:41 ` David Howells
  1 sibling, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2023-09-06  9:42 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara, David Howells, Peter Xu
  Cc: Lei Huang, miklos, Xiubo Li, Ilya Dryomov, Jeff Layton,
	Trond Myklebust, Anna Schumaker, Latchesar Ionkov,
	Dominique Martinet, Christian Schoenebeck, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, John Fastabend,
	Jakub Sitnicki, Boris Pismenny, linux-nfs, linux-fsdevel,
	ceph-devel, linux-mm, v9fs, netdev

On 05.09.23 16:16, Christoph Hellwig wrote:
> Hi all,

Hi,

> 
> we've made some nice progress on converting code that modifies user
> memory to the pin_user_pages interface, especially though the work
> from David Howells on iov_iter_extract_pages.  This thread tries to
> coordinate on how to finish off this work.
> 
> The obvious next step is the remaining users of iov_iter_get_pages2
> and iov_iter_get_pages_alloc2.  We have three file system direct I/O
> users of those left: ceph, fuse and nfs.  Lei Huang has sent patches
> to convert fuse to iov_iter_extract_pages which I'd love to see merged,
> and we'd need equivalent work for ceph and nfs.
> 
> The non-file system uses are in the vmsplice code, which only reads

vmsplice really has to be fixed to specify FOLL_PIN|FOLL_LONGTERM for 
good; I recall that David Howells had patches for that at one point. (at 
least to use FOLL_PIN)

> from the pages (but would still benefit from an iov_iter_extract_pages
> conversion), and in net.  Out of the users in net, all but the 9p code
> appear to be for reads from memory, so they don't pin even if a
> conversion would be nice to retire iov_iter_get_pages* APIs.
> 
> After that we might have to do an audit of the raw get_user_pages APIs,
> but there probably aren't many that modify file backed memory.

ptrace should apply that ends up doing a FOLL_GET|FOLL_WRITE.

Further, KVM ends up using FOLL_GET|FOLL_WRITE to populate the 
second-level page tables for VMs, and uses MMU notifiers to synchronize 
the second-level page tables with process page table changes. So once a 
PTE goes from writable -> r/o in the process page table, the second 
level page tables for the VM will get updated. Such MMU users are quite 
different from ordinary GUP users.

Converting the latter to FOLL_PIN is not desired (as it would implicitly 
  trigger COW-unsharing on KSM pages -- but GUP+MMU notifiers is 
different to ordinary GUP+read/write where there is no such 
synchronization).

Converting ptrace might not be desired/required as well (the reference 
is dropped immediately after the read/write access).

The end goal as discussed a couple of times would be the to limit 
FOLL_GET in general only to a couple of users that can be audited and 
keep using it for a good reason. Arbitrary drivers that perform DMA 
should stop using it (and ideally be prevented from using it) and switch 
to FOLL_PIN.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: getting rid of the last memory modifitions through gup(FOLL_GET)
  2023-09-06  9:42 ` David Hildenbrand
@ 2023-09-08  8:15   ` Christoph Hellwig
  2023-09-08 16:48     ` David Hildenbrand
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2023-09-08  8:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, Jan Kara, David Howells, Peter Xu, Lei Huang,
	miklos, Xiubo Li, Ilya Dryomov, Jeff Layton, Trond Myklebust,
	Anna Schumaker, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, John Fastabend, Jakub Sitnicki,
	Boris Pismenny, linux-nfs, linux-fsdevel, ceph-devel, linux-mm,
	v9fs, netdev

On Wed, Sep 06, 2023 at 11:42:33AM +0200, David Hildenbrand wrote:
>> and iov_iter_get_pages_alloc2.  We have three file system direct I/O
>> users of those left: ceph, fuse and nfs.  Lei Huang has sent patches
>> to convert fuse to iov_iter_extract_pages which I'd love to see merged,
>> and we'd need equivalent work for ceph and nfs.
>>
>> The non-file system uses are in the vmsplice code, which only reads
>
> vmsplice really has to be fixed to specify FOLL_PIN|FOLL_LONGTERM for good; 
> I recall that David Howells had patches for that at one point. (at least to 
> use FOLL_PIN)

Hmm, unless I'm misreading the code vmsplace is only using
iov_iter_get_pages2 for reading from the user address space anyway.
Or am I missing something?

>> After that we might have to do an audit of the raw get_user_pages APIs,
>> but there probably aren't many that modify file backed memory.
>
> ptrace should apply that ends up doing a FOLL_GET|FOLL_WRITE.

Yes, if that ends up on file backed shared mappings we also need a pin.

> Further, KVM ends up using FOLL_GET|FOLL_WRITE to populate the second-level 
> page tables for VMs, and uses MMU notifiers to synchronize the second-level 
> page tables with process page table changes. So once a PTE goes from 
> writable -> r/o in the process page table, the second level page tables for 
> the VM will get updated. Such MMU users are quite different from ordinary 
> GUP users.

Can KVM page tables use file backed shared mappings?

> Converting ptrace might not be desired/required as well (the reference is 
> dropped immediately after the read/write access).

But the pin is needed to make sure the file system can account for
dirtying the pages.  Something we fundamentally can't do with get.

> The end goal as discussed a couple of times would be the to limit FOLL_GET 
> in general only to a couple of users that can be audited and keep using it 
> for a good reason. Arbitrary drivers that perform DMA should stop using it 
> (and ideally be prevented from using it) and switch to FOLL_PIN.

Agreed, that's where I'd like to get to.  Preferably with the non-pin
API not even beeing epxorted to modules.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: getting rid of the last memory modifitions through gup(FOLL_GET)
  2023-09-05 14:16 getting rid of the last memory modifitions through gup(FOLL_GET) Christoph Hellwig
  2023-09-06  9:42 ` David Hildenbrand
@ 2023-09-08  8:41 ` David Howells
  1 sibling, 0 replies; 6+ messages in thread
From: David Howells @ 2023-09-08  8:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jan Kara, David Hildenbrand, Peter Xu, Lei Huang,
	miklos, Xiubo Li, Ilya Dryomov, Jeff Layton, Trond Myklebust,
	Anna Schumaker, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, John Fastabend, Jakub Sitnicki,
	Boris Pismenny, linux-nfs, linux-fsdevel, ceph-devel, linux-mm,
	v9fs, netdev

Christoph Hellwig <hch@lst.de> wrote:

> we've made some nice progress on converting code that modifies user
> memory to the pin_user_pages interface, especially though the work
> from David Howells on iov_iter_extract_pages.  This thread tries to
> coordinate on how to finish off this work.

Right at this moment, I'm writing some kunit tests for iov_iter and I've found
at least one bug.

David


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: getting rid of the last memory modifitions through gup(FOLL_GET)
  2023-09-08  8:15   ` Christoph Hellwig
@ 2023-09-08 16:48     ` David Hildenbrand
  2023-09-09 11:18       ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2023-09-08 16:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, David Howells, Peter Xu, Lei Huang, miklos, Xiubo Li,
	Ilya Dryomov, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Latchesar Ionkov, Dominique Martinet, Christian Schoenebeck,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Fastabend, Jakub Sitnicki, Boris Pismenny, linux-nfs,
	linux-fsdevel, ceph-devel, linux-mm, v9fs, netdev

On 08.09.23 10:15, Christoph Hellwig wrote:
> On Wed, Sep 06, 2023 at 11:42:33AM +0200, David Hildenbrand wrote:
>>> and iov_iter_get_pages_alloc2.  We have three file system direct I/O
>>> users of those left: ceph, fuse and nfs.  Lei Huang has sent patches
>>> to convert fuse to iov_iter_extract_pages which I'd love to see merged,
>>> and we'd need equivalent work for ceph and nfs.
>>>
>>> The non-file system uses are in the vmsplice code, which only reads
>>
>> vmsplice really has to be fixed to specify FOLL_PIN|FOLL_LONGTERM for good;
>> I recall that David Howells had patches for that at one point. (at least to
>> use FOLL_PIN)
> 
> Hmm, unless I'm misreading the code vmsplace is only using
> iov_iter_get_pages2 for reading from the user address space anyway.
> Or am I missing something?

It's not relevant for the case you're describing here ("last memory 
modifitions through gup(FOLL_GET)").

vmsplice_to_pipe() -> iter_to_pipe() -> iov_iter_get_pages2()

So it ends up calling get_user_pages_fast()

... and not using FOLL_PIN|FOLL_LONGTERM

Why FOLL_LONGTERM? Because it's a longterm pin, where unprivileged users 
can grab a reference on a page for all eternity, breaking CMA and memory 
hotunplug (well, and harming compaction).

Why FOLL_PIN? Well FOLL_LONGTERM only applies to FOLL_PIN. But for 
anonymous memory, this will also take care of the last remaining hugetlb 
COW test (trigger COW unsharing) as commented back in:

https://lore.kernel.org/all/02063032-61e7-e1e5-cd51-a50337405159@redhat.com/


> 
>>> After that we might have to do an audit of the raw get_user_pages APIs,
>>> but there probably aren't many that modify file backed memory.
>>
>> ptrace should apply that ends up doing a FOLL_GET|FOLL_WRITE.
> 
> Yes, if that ends up on file backed shared mappings we also need a pin.

See below.

> 
>> Further, KVM ends up using FOLL_GET|FOLL_WRITE to populate the second-level
>> page tables for VMs, and uses MMU notifiers to synchronize the second-level
>> page tables with process page table changes. So once a PTE goes from
>> writable -> r/o in the process page table, the second level page tables for
>> the VM will get updated. Such MMU users are quite different from ordinary
>> GUP users.
> 
> Can KVM page tables use file backed shared mappings?

Yes, usually shmem and hugetlb. But with things like emulated 
NVDIMMs/virtio-pmem for VMs, easily also ordinary files.

But it's really not ordinary write access through GUP. It's write access 
via a secondary page table (secondary MMU), that's synchronized to the 
process page table -- just like if the CPU would be writing to the page 
using the process page tables (primary MMU).

> 
>> Converting ptrace might not be desired/required as well (the reference is
>> dropped immediately after the read/write access).
> 
> But the pin is needed to make sure the file system can account for
> dirtying the pages.  Something we fundamentally can't do with get.

ptrace will find the pagecache page writable in the page table (PTE 
write bit set), if it intends to write to the page (FOLL_WRITE). If it 
is not writable, it will trigger a page fault that informs the file system.

With an FS that wants writenotify, we will not map a page writable (PTE 
write bit not set) unless it is dirty (PTE dirty bit set) IIRC.

So are we concerned about a race between the filesystem removing the PTE 
write bit (to catch next write access before it gets dirtied again) and 
ptrace marking the page dirty?

It's a very, very small race window, staring at __access_remote_vm(). 
But it should apply if that's the concern.

> 
>> The end goal as discussed a couple of times would be the to limit FOLL_GET
>> in general only to a couple of users that can be audited and keep using it
>> for a good reason. Arbitrary drivers that perform DMA should stop using it
>> (and ideally be prevented from using it) and switch to FOLL_PIN.
> 
> Agreed, that's where I'd like to get to.  Preferably with the non-pin
> API not even beeing epxorted to modules.

Yes. However, secondary MMU users (like KVM) would need some way to keep 
making use of that; ideally, using a proper separate interface instead 
of (ab)using plain GUP and confusing people :)

[1] https://lkml.org/lkml/2023/1/24/451

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: getting rid of the last memory modifitions through gup(FOLL_GET)
  2023-09-08 16:48     ` David Hildenbrand
@ 2023-09-09 11:18       ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2023-09-09 11:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, Jan Kara, David Howells, Peter Xu, Lei Huang,
	miklos, Xiubo Li, Ilya Dryomov, Jeff Layton, Trond Myklebust,
	Anna Schumaker, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, John Fastabend, Jakub Sitnicki,
	Boris Pismenny, linux-nfs, linux-fsdevel, ceph-devel, linux-mm,
	v9fs, netdev

On Fri, Sep 08, 2023 at 06:48:05PM +0200, David Hildenbrand wrote:
> vmsplice_to_pipe() -> iter_to_pipe() -> iov_iter_get_pages2()
>
> So it ends up calling get_user_pages_fast()
>
> ... and not using FOLL_PIN|FOLL_LONGTERM
>
> Why FOLL_LONGTERM? Because it's a longterm pin, where unprivileged users 
> can grab a reference on a page for all eternity, breaking CMA and memory 
> hotunplug (well, and harming compaction).
>
> Why FOLL_PIN? Well FOLL_LONGTERM only applies to FOLL_PIN. But for 
> anonymous memory, this will also take care of the last remaining hugetlb 
> COW test (trigger COW unsharing) as commented back in:
>
> https://lore.kernel.org/all/02063032-61e7-e1e5-cd51-a50337405159@redhat.com/

Well, I'm not against it.  It just isn't required for deadling with
file system writeback vs GUP modification race this thread was started
for. 

>> Can KVM page tables use file backed shared mappings?
>
> Yes, usually shmem and hugetlb. But with things like emulated 
> NVDIMMs/virtio-pmem for VMs, easily also ordinary files.
>
> But it's really not ordinary write access through GUP. It's write access 
> via a secondary page table (secondary MMU), that's synchronized to the 
> process page table -- just like if the CPU would be writing to the page 
> using the process page tables (primary MMU).

Writing through the process page tables takes a write faul when first
writing, which calls into ->page_mkwrite in the file system.  Does the
synchronization take care of that?  If not we need to add or emulate it.

> ptrace will find the pagecache page writable in the page table (PTE write 
> bit set), if it intends to write to the page (FOLL_WRITE). If it is not 
> writable, it will trigger a page fault that informs the file system.

Yes, that case is (mostly) fine.

>
> With an FS that wants writenotify, we will not map a page writable (PTE 
> write bit not set) unless it is dirty (PTE dirty bit set) IIRC.
>
> So are we concerned about a race between the filesystem removing the PTE 
> write bit (to catch next write access before it gets dirtied again) and 
> ptrace marking the page dirty?

Yes.  This is the race that we've run into with various GUP users.

> Yes. However, secondary MMU users (like KVM) would need some way to keep 
> making use of that; ideally, using a proper separate interface instead of 
> (ab)using plain GUP and confusing people :)

I'mm all for that.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-09-09 11:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-05 14:16 getting rid of the last memory modifitions through gup(FOLL_GET) Christoph Hellwig
2023-09-06  9:42 ` David Hildenbrand
2023-09-08  8:15   ` Christoph Hellwig
2023-09-08 16:48     ` David Hildenbrand
2023-09-09 11:18       ` Christoph Hellwig
2023-09-08  8:41 ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).