All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	linux-mm@kvack.org, Hugh Dickins <hughd@google.com>
Subject: Re: Avoiding allocation of unused shmem page
Date: Fri, 21 Oct 2022 12:01:51 -0400	[thread overview]
Message-ID: <Y1LCb7jyjJGiY9iQ@x1n> (raw)
In-Reply-To: <3b19120b-05f6-10b8-c1af-67d8eb60fea0@redhat.com>

On Fri, Oct 21, 2022 at 05:17:08PM +0200, David Hildenbrand wrote:
> On 21.10.22 17:08, Peter Xu wrote:
> > On Fri, Oct 21, 2022 at 04:45:27PM +0200, David Hildenbrand wrote:
> > > On 21.10.22 16:28, Peter Xu wrote:
> > > > On Fri, Oct 21, 2022 at 04:10:41PM +0200, David Hildenbrand wrote:
> > > > > On 21.10.22 16:01, Peter Xu wrote:
> > > > > > On Fri, Oct 21, 2022 at 09:23:00AM +0200, David Hildenbrand wrote:
> > > > > > > On 20.10.22 23:10, Peter Xu wrote:
> > > > > > > > On Thu, Oct 20, 2022 at 09:14:09PM +0100, Matthew Wilcox wrote:
> > > > > > > > > In yesterday's call, David brought up the case where we fallocate a file
> > > > > > > > > in shmem, call mmap(MAP_PRIVATE) and then store to a page which is over
> > > > > > > > > a hole.  That currently causes shmem to allocate a page, zero-fill it,
> > > > > > > > > then COW it, resulting in two pages being allocated when only the
> > > > > > > > > COW page really needs to be allocated.
> > > > > > > > > 
> > > > > > > > > The path we currently take through the MM when we take the page fault
> > > > > > > > > looks like this (correct me if I'm wrong ...):
> > > > > > > > > 
> > > > > > > > > handle_mm_fault()
> > > > > > > > > __handle_mm_fault()
> > > > > > > > > handle_pte_fault()
> > > > > > > > > do_fault()
> > > > > > > > > do_cow_fault()
> > > > > > > > > __do_fault()
> > > > > > > > > vm_ops->fault()
> > > > > > > > > 
> > > > > > > > > ... which is where we come into shmem_fault().  Apart from the
> > > > > > > > > horrendous hole-punch handling case, shmem_fault() is quite simple:
> > > > > > > > > 
> > > > > > > > >             err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
> > > > > > > > >                                       gfp, vma, vmf, &ret);
> > > > > > > > >             if (err)
> > > > > > > > >                     return vmf_error(err);
> > > > > > > > >             vmf->page = folio_file_page(folio, vmf->pgoff);
> > > > > > > > >             return ret;
> > > > > > > > > 
> > > > > > > > > What we could do here is detect this case.  Something like:
> > > > > > > > > 
> > > > > > > > > 	enum sgp_type sgp = SGP_CACHE;
> > > > > > > > > 
> > > > > > > > > 	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> > > > > > > > > 		sgp = SGP_READ;
> > > > > > > > 
> > > > > > > > Yes this will start to save the space, but just to mention this may start
> > > > > > > > to break anything that will still depend on the pagecache to work.  E.g.,
> > > > > > > > it'll change behavior if the vma is registered with uffd missing mode;
> > > > > > > > we'll start to lose MISSING events for these private mappings.  Not sure
> > > > > > > > whether there're other side effects.
> > > > > > > 
> > > > > > > I don't follow, can you elaborate?
> > > > > > > 
> > > > > > > hugetlb doesn't perform this kind of unnecessary allocation and should be fine in regards to uffd. Why should it matter here and how exactly would a problematic sequence look like?
> > > > > > 
> > > > > > Hugetlb is special because hugetlb detects pte first and relies on pte at
> > > > > > least for uffd.  shmem is not.
> > > > > > 
> > > > > > Feel free to also reference the recent fix which relies on the stable
> > > > > > hugetlb pte with commit 2ea7ff1e39cbe375.
> > > > > 
> > > > > Sorry to be dense here, but I don't follow how that relates.
> > > > > 
> > > > > Assume we have a MAP_PRIVATE shmem mapping and someone registers uffd
> > > > > missing events on that mapping.
> > > > > 
> > > > > Assume we get a page fault on a hole. We detect no page is mapped and check
> > > > > if the page cache has a page mapped -- which is also not the case, because
> > > > > there is a hole.
> > > > > 
> > > > > So we notify uffd.
> > > > > 
> > > > > Uffd will place a page. It should *not* touch the page cache and only insert
> > > > > that page into the page table -- otherwise we'd be violating MAP_PRIVATE
> > > > > semantics.
> > > > 
> > > > That's actually exactly what we do right now... we insert into page cache
> > > > for the shmem.  See shmem_mfill_atomic_pte().
> > > > 
> > > > Why it violates MAP_PRIVATE?  Private pages only guarantee the exclusive
> > > > ownership of pages, I don't see why it should restrict uffd behavior. Uffd
> > > > missing mode (afaiu) is defined to resolve page cache missings in this
> > > > case.  Hugetlb is special but not shmem IMO comparing to most of the rest
> > > > of the file systems.
> > > 
> > > If a write (or uffd placement) via a MAP_PRIVATE mapping results in other
> > > shared/private mappings from observing these modifications, you have a clear
> > > violation of MAP_PRIVATE semantics.
> > 
> > I think I understand what you meant, but just to mention again that I think
> > it's a matter of how we defined the uffd missing semantics when shmem
> > missing mode was introduced years ago.  It does not need to be the same
> > semantic as writting directly to a private mapping.
> > 
> 
> I think uffd does exactly the right thing in mfill_atomic_pte:
> 
> 	/*
> 	 * The normal page fault path for a shmem will invoke the
> 	 * fault, fill the hole in the file and COW it right away. The
> 	 * result generates plain anonymous memory. So when we are
> 	 * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll
> 	 * generate anonymous memory directly without actually filling
> 	 * the hole. For the MAP_PRIVATE case the robustness check
> 	 * only happens in the pagetable (to verify it's still none)
> 	 * and not in the radix tree.
> 	 */
> 	if (!(dst_vma->vm_flags & VM_SHARED)) {
> 		if (mode == MCOPY_ATOMIC_NORMAL)
> 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> 					       dst_addr, src_addr, page,
> 					       wp_copy);
> 		else
> 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
> 						 dst_vma, dst_addr);
> 	} else {
> 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
> 					     dst_addr, src_addr,
> 					     mode != MCOPY_ATOMIC_NORMAL,
> 					     wp_copy, page);
> 	}
> 
> Unless we have a writable shared mapping, we end up not touching the pagecache.
> 
> After what I understand from your last message (maybe I understood it wrong),
> I tried exploiting uffd behavior by writing into a hole of a file without
> write permissions using uffd. I failed because it does the right thing ;)

Very interesting to learn this, thanks for the pointer, David. :)
Definitely helpful to me on knowing better on the vma security model.

Though note that it'll be a different topic if we go back to the original
problem we're discussing - the fake-read approach of shmem will still keep
the hole in file which will still change the behavior of missing messages
from generating.

Said that, I don't really know whether there can be a real impact on any
uffd users, or anything else that similarly access the file cache.

-- 
Peter Xu



  reply	other threads:[~2022-10-21 16:02 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-20 20:14 Avoiding allocation of unused shmem page Matthew Wilcox
2022-10-20 21:10 ` Peter Xu
2022-10-21  7:23   ` David Hildenbrand
2022-10-21 14:01     ` Peter Xu
2022-10-21 14:10       ` David Hildenbrand
2022-10-21 14:28         ` Peter Xu
2022-10-21 14:45           ` David Hildenbrand
2022-10-21 15:08             ` Peter Xu
2022-10-21 15:17               ` David Hildenbrand
2022-10-21 16:01                 ` Peter Xu [this message]
2022-10-21 16:19                   ` David Hildenbrand
2022-10-21 16:26                     ` David Hildenbrand
2022-10-20 22:17 ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1LCb7jyjJGiY9iQ@x1n \
    --to=peterx@redhat.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.