Re: [PATCH v3] mm: introduce reference pages

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v3] mm: introduce reference pages
       [not found] <20200814213310.42170-1-pcc@google.com>
@ 2020-08-18  2:31 ` John Hubbard
  2020-08-18  3:00   ` Matthew Wilcox
  2021-06-19  9:21   ` Peter Collingbourne
  0 siblings, 2 replies; 5+ messages in thread
From: John Hubbard @ 2020-08-18  2:31 UTC (permalink / raw)
  To: Peter Collingbourne, Matthew Wilcox, Kirill A . Shutemov,
	Andrew Morton, Catalin Marinas, Evgenii Stepanov
  Cc: Linux ARM, linux-mm, kernel test robot, Linux API, linux-doc

On 8/14/20 2:33 PM, Peter Collingbourne wrote:
> Introduce a new syscall, refpage_create, which returns a file
> descriptor which may be mapped using mmap. Such a mapping is similar

Hi,

For new syscalls, I think we need to put linux-api on CC, at the very
least. Adding them now. This would likely need man page support as well.
I'll put linux-doc on Cc, too.

> to an anonymous mapping, but instead of clean pages being backed by the
> zero page, they are instead backed by a so-called reference page, whose
> contents are specified using an argument to refpage_create. Loads from
> the mapping will load directly from the reference page, and initial
> stores to the mapping will copy-on-write from the reference page.
> 
> Reference pages are useful in circumstances where anonymous mappings
> combined with manual stores to memory would impose undesirable costs,
> either in terms of performance or RSS. Use cases are focused on heap
> allocators and include:
> 
> - Pattern initialization for the heap. This is where malloc(3) gives
>    you memory whose contents are filled with a non-zero pattern
>    byte, in order to help detect and mitigate bugs involving use
>    of uninitialized memory. Typically this is implemented by having
>    the allocator memset the allocation with the pattern byte before
>    returning it to the user, but for large allocations this can result
>    in a significant increase in RSS, especially for allocations that
>    are used sparsely. Even for dense allocations there is a needless
>    impact to startup performance when it may be better to amortize it
>    throughout the program. By creating allocations using a reference
>    page filled with the pattern byte, we can avoid these costs.
> 
> - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
>    feature which allows for memory to be tagged in order to detect
>    certain kinds of memory errors with low overhead. In order to set
>    up an allocation to allow memory errors to be detected, the entire
>    allocation needs to have the same tag. The issue here is similar to
>    pattern initialization in the sense that large tagged allocations
>    will be expensive if the tagging is done up front. The idea is that
>    the allocator would create reference pages with each of the possible
>    memory tags, and use those reference pages for the large allocations.

That is good information, and it belongs in a man page, and/or Documentation/.

> 
> In order to measure the performance and RSS impact of reference pages,
> a version of this patch backported to kernel version 4.14 was tested on
> a Pixel 4 together with a modified [2] version of the Scudo allocator
> that uses reference pages to implement pattern initialization. A
> PDFium test program was used to collect the measurements like so:
> 
> $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
> $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf
> 
> and the median of 100 runs measurement was taken with three variants
> of the allocator:
> 
> - "anon" is the baseline (no pattern init)
> - "memset" is with pattern init of allocator pages implemented by
>    initializing anonymous pages with memset
> - "refpage" is with pattern init of allocator pages implemented
>    by creating reference pages
> 
> All three variants are measured using the patch that I linked. "anon"
> is without the patch, "refpage" is with the patch and "memset" is
> with a previous version of the patch [3] with "#if 0" in place of
> "#if 1" in linux.cpp. The measurements are as follows:
> 
>            Real time (s)    Max RSS (KiB)
> anon        2.237081         107088
> memset      2.252241         112180
> refpage     2.243786         107128
> 
> We can see that RSS for refpage is almost the same as anon, and real
> time overhead is 44% that of memset.
> 

Are some of the numbers stale, maybe? Try as I might, I cannot combine
anything above to come up with 44%. :)


> As an alternative to introducing this syscall, I considered using
> userfaultfd to implement reference pages. However, after having taken
> a detailed look at the interface, it does not seem suitable to be
> used in the context of a general purpose allocator. For example,
> UFFD_FEATURE_FORK support would be required in order to correctly
> support fork(2) in a process that uses the allocator (although POSIX
> does not guarantee support for allocating after fork, many allocators
> including Scudo support it, and nothing stops the forked process from
> page faulting pre-existing allocations after forking anyway), but
> UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> making it unsuitable for use in an allocator. Furthermore, even if
> the interface issues are resolved, I suspect (but have not measured)
> that the cost of the multiple context switches between kernel and
> userspace would be too high to be used in an allocator anyway.


That whole blurb is good for a cover letter, and perhaps an "alternatives
considered" section in Documentation/. However, it should be omitted from
the patch commit description, IMHO.

...
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 467302056e17..a1dc07ff914a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -175,6 +175,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
>   
>   	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
>   		return false;
> +
> +	/*
> +	 * Transparent hugepages not currently supported for anonymous VMAs with
> +	 * reference pages
> +	 */
> +	if (unlikely(vma->vm_private_data))


This should use a helper function, such as is_reference_page_vma(). Because the
assumption that "vma->vm_private_data means a reference page vma" is much too
fragile. More below.


> +		return false;
>   	return true;
>   }
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e7602a3bcef1..ac375e398690 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3122,5 +3122,15 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
>   
>   extern int sysctl_nr_trim_pages;
>   
> +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
> +					 unsigned long pfn)
> +{
> +	if (is_zero_pfn(pfn))
> +		return true;
> +	if (unlikely(!vma->vm_ops && vma->vm_private_data))
> +		return pfn == page_to_pfn((struct page *)vma->vm_private_data);

As foreshadowed above, this needs a helper function. And the criteria for
deciding that it's a reference page needs to be more robust than just "no vm_ops,
vm_private_data is set, and it matches my page". Needs some more decisive
information.

Maybe setting vm_ops to some new "refpage" ops would be the way to go, for that.

...
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 5053439be6ab..6e9246d09e95 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>   	pmd_t *pmdp;
>   	pte_t *ptep;
>   
> -	/* Only allow populating anonymous memory */
> -	if (!vma_is_anonymous(vma))
> +	/* Only allow populating anonymous memory without a reference page */
> +	if (!vma_is_anonymous(vma) || vma->private_data)

Same thing here: helper function, instead of open-coding the assumption about
what makes a refpage vma.

...

> +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> +		flags)
> +{
> +	unsigned long content_addr = (unsigned long)content;
> +	struct page *userpage, *refpage;
> +	int fd;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	refpage = alloc_page(GFP_KERNEL);
> +	if (!refpage)
> +		return -ENOMEM;
> +
> +	if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> +	    get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
> +		put_page(refpage);
> +		return -EFAULT;
> +	}
> +
> +	copy_highpage(refpage, userpage);
> +	put_page(userpage);
> +
> +	fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage,
> +			      O_RDONLY | O_CLOEXEC);

Seems like the flags argument should have an influence on these flags, rather
than hard-coding O_CLOEXEC, right?


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] mm: introduce reference pages
  2020-08-18  2:31 ` [PATCH v3] mm: introduce reference pages John Hubbard
@ 2020-08-18  3:00   ` Matthew Wilcox
  2020-08-18 18:25     ` John Hubbard
  2021-06-19  9:21   ` Peter Collingbourne
  1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2020-08-18  3:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: Peter Collingbourne, Kirill A . Shutemov, Andrew Morton,
	Catalin Marinas, Evgenii Stepanov, Linux ARM, linux-mm,
	kernel test robot, Linux API, linux-doc

On Mon, Aug 17, 2020 at 07:31:39PM -0700, John Hubbard wrote:
> >            Real time (s)    Max RSS (KiB)
> > anon        2.237081         107088
> > memset      2.252241         112180
> > refpage     2.243786         107128
> > 
> > We can see that RSS for refpage is almost the same as anon, and real
> > time overhead is 44% that of memset.
> > 
> 
> Are some of the numbers stale, maybe? Try as I might, I cannot combine
> anything above to come up with 44%. :)

You're not trying hard enough ;-)

(2.252241 - 2.237081) / 2.237081 = .00677668801442594166
(2.243786 - 2.237081) / 2.237081 = .00299720930981041812
.00299720930981041812 / .00677668801442594166 = .44228232189973614648

tadaa!

As I said last time this was posted, I'm just not excited by this.  We go
from having a 0.68% time overhead down to an 0.30% overhead, which just
doesn't move the needle for me.  Maybe there's a better benchmark than
this to show benefits from this patchset.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] mm: introduce reference pages
  2020-08-18  3:00   ` Matthew Wilcox
@ 2020-08-18 18:25     ` John Hubbard
  2021-06-19  9:20       ` Peter Collingbourne
  0 siblings, 1 reply; 5+ messages in thread
From: John Hubbard @ 2020-08-18 18:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Peter Collingbourne, Kirill A . Shutemov, Andrew Morton,
	Catalin Marinas, Evgenii Stepanov, Linux ARM, linux-mm,
	kernel test robot, Linux API, linux-doc

On 8/17/20 8:00 PM, Matthew Wilcox wrote:
> On Mon, Aug 17, 2020 at 07:31:39PM -0700, John Hubbard wrote:
>>>             Real time (s)    Max RSS (KiB)
>>> anon        2.237081         107088
>>> memset      2.252241         112180
>>> refpage     2.243786         107128
>>>
>>> We can see that RSS for refpage is almost the same as anon, and real
>>> time overhead is 44% that of memset.
>>>
>>
>> Are some of the numbers stale, maybe? Try as I might, I cannot combine
>> anything above to come up with 44%. :)
> 
> You're not trying hard enough ;-)
> 
> (2.252241 - 2.237081) / 2.237081 = .00677668801442594166
> (2.243786 - 2.237081) / 2.237081 = .00299720930981041812
> .00299720930981041812 / .00677668801442594166 = .44228232189973614648
> 
> tadaa!

haha, OK then! :) Next time I may try harder, but on the other hand my
interpretation of the results is still "this is a small effect", even
if there is a way to make it sound large by comparing the 3rd significant
digits of the results...

> 
> As I said last time this was posted, I'm just not excited by this.  We go
> from having a 0.68% time overhead down to an 0.30% overhead, which just
> doesn't move the needle for me.  Maybe there's a better benchmark than
> this to show benefits from this patchset.
> 

Yes, I wonder if there is an artificial workload that just uses refpages
really extensively, maybe we can get some good solid improvements shown
with that? Otherwise, it seems like we've just learned that memset is
actually pretty good in this case. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] mm: introduce reference pages
  2020-08-18 18:25     ` John Hubbard
@ 2021-06-19  9:20       ` Peter Collingbourne
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Collingbourne @ 2021-06-19  9:20 UTC (permalink / raw)
  To: John Hubbard
  Cc: Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	Catalin Marinas, Evgenii Stepanov, Linux ARM,
	Linux Memory Management List, kernel test robot, Linux API,
	linux-doc

[Apologies for the delay in getting back to you; other work ended up
taking priority and now I'm back to looking at this.]

On Tue, Aug 18, 2020 at 11:25 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 8/17/20 8:00 PM, Matthew Wilcox wrote:
> > On Mon, Aug 17, 2020 at 07:31:39PM -0700, John Hubbard wrote:
> >>>             Real time (s)    Max RSS (KiB)
> >>> anon        2.237081         107088
> >>> memset      2.252241         112180
> >>> refpage     2.243786         107128
> >>>
> >>> We can see that RSS for refpage is almost the same as anon, and real
> >>> time overhead is 44% that of memset.
> >>>
> >>
> >> Are some of the numbers stale, maybe? Try as I might, I cannot combine
> >> anything above to come up with 44%. :)
> >
> > You're not trying hard enough ;-)
> >
> > (2.252241 - 2.237081) / 2.237081 = .00677668801442594166
> > (2.243786 - 2.237081) / 2.237081 = .00299720930981041812
> > .00299720930981041812 / .00677668801442594166 = .44228232189973614648
> >
> > tadaa!
>
> haha, OK then! :) Next time I may try harder, but on the other hand my
> interpretation of the results is still "this is a small effect", even
> if there is a way to make it sound large by comparing the 3rd significant
> digits of the results...
>
> >
> > As I said last time this was posted, I'm just not excited by this.  We go
> > from having a 0.68% time overhead down to an 0.30% overhead, which just
> > doesn't move the needle for me.  Maybe there's a better benchmark than
> > this to show benefits from this patchset.
> >
>

Remember that this is a "realistic" benchmark, so it's doing plenty of
other work besides faulting pages. So I don't think we should expect
to see a massive improvement here.

I ran the pdfium benchmark again but I couldn't see the same
improvements that I got last time. This seems to be because pdfium has
since switched to its own allocator, bypassing the system allocator. I
think the gains should be larger with the memset optimization that
I've implemented, but I'm still in the process of finding a suitable
realistic benchmark that uses the system allocator.

But I would find a 0.4% perf improvement convincing enough,
personally, given that the workload is realistic. Consider a certain
large company which spends $billions annually on data centers. In that
environment a 0.4% performance improvement on realistic workloads can
translate to $millions of savings. And that's not taking into account
the memory savings which are important both in mobile environments and
in data centers.

> Yes, I wonder if there is an artificial workload that just uses refpages
> really extensively, maybe we can get some good solid improvements shown
> with that? Otherwise, it seems like we've just learned that memset is
> actually pretty good in this case. :)

Yes, it's possible to see the performance improvement here more
clearly with a microbenchmark. I've updated the commit message in v4
to include a microbenchmark program and some performance numbers from
it.

Peter

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] mm: introduce reference pages
  2020-08-18  2:31 ` [PATCH v3] mm: introduce reference pages John Hubbard
  2020-08-18  3:00   ` Matthew Wilcox
@ 2021-06-19  9:21   ` Peter Collingbourne
  1 sibling, 0 replies; 5+ messages in thread
From: Peter Collingbourne @ 2021-06-19  9:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	Catalin Marinas, Evgenii Stepanov, Linux ARM,
	Linux Memory Management List, kernel test robot, Linux API,
	linux-doc

[Apologies for the delay in getting back to you; other work ended up
taking priority and now I'm back to looking at this.]

On Mon, Aug 17, 2020 at 7:31 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 8/14/20 2:33 PM, Peter Collingbourne wrote:
> > Introduce a new syscall, refpage_create, which returns a file
> > descriptor which may be mapped using mmap. Such a mapping is similar
>
> Hi,
>
> For new syscalls, I think we need to put linux-api on CC, at the very
> least. Adding them now. This would likely need man page support as well.
> I'll put linux-doc on Cc, too.

Thanks.

> > to an anonymous mapping, but instead of clean pages being backed by the
> > zero page, they are instead backed by a so-called reference page, whose
> > contents are specified using an argument to refpage_create. Loads from
> > the mapping will load directly from the reference page, and initial
> > stores to the mapping will copy-on-write from the reference page.
> >
> > Reference pages are useful in circumstances where anonymous mappings
> > combined with manual stores to memory would impose undesirable costs,
> > either in terms of performance or RSS. Use cases are focused on heap
> > allocators and include:
> >
> > - Pattern initialization for the heap. This is where malloc(3) gives
> >    you memory whose contents are filled with a non-zero pattern
> >    byte, in order to help detect and mitigate bugs involving use
> >    of uninitialized memory. Typically this is implemented by having
> >    the allocator memset the allocation with the pattern byte before
> >    returning it to the user, but for large allocations this can result
> >    in a significant increase in RSS, especially for allocations that
> >    are used sparsely. Even for dense allocations there is a needless
> >    impact to startup performance when it may be better to amortize it
> >    throughout the program. By creating allocations using a reference
> >    page filled with the pattern byte, we can avoid these costs.
> >
> > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> >    feature which allows for memory to be tagged in order to detect
> >    certain kinds of memory errors with low overhead. In order to set
> >    up an allocation to allow memory errors to be detected, the entire
> >    allocation needs to have the same tag. The issue here is similar to
> >    pattern initialization in the sense that large tagged allocations
> >    will be expensive if the tagging is done up front. The idea is that
> >    the allocator would create reference pages with each of the possible
> >    memory tags, and use those reference pages for the large allocations.
>
> That is good information, and it belongs in a man page, and/or Documentation/.

I plan to write a man page for refpage_create(2) once this is closer to landing.

> >
> > In order to measure the performance and RSS impact of reference pages,
> > a version of this patch backported to kernel version 4.14 was tested on
> > a Pixel 4 together with a modified [2] version of the Scudo allocator
> > that uses reference pages to implement pattern initialization. A
> > PDFium test program was used to collect the measurements like so:
> >
> > $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
> > $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf
> >
> > and the median of 100 runs measurement was taken with three variants
> > of the allocator:
> >
> > - "anon" is the baseline (no pattern init)
> > - "memset" is with pattern init of allocator pages implemented by
> >    initializing anonymous pages with memset
> > - "refpage" is with pattern init of allocator pages implemented
> >    by creating reference pages
> >
> > All three variants are measured using the patch that I linked. "anon"
> > is without the patch, "refpage" is with the patch and "memset" is
> > with a previous version of the patch [3] with "#if 0" in place of
> > "#if 1" in linux.cpp. The measurements are as follows:
> >
> >            Real time (s)    Max RSS (KiB)
> > anon        2.237081         107088
> > memset      2.252241         112180
> > refpage     2.243786         107128
> >
> > We can see that RSS for refpage is almost the same as anon, and real
> > time overhead is 44% that of memset.
> >
>
> Are some of the numbers stale, maybe? Try as I might, I cannot combine
> anything above to come up with 44%. :)
>
> > As an alternative to introducing this syscall, I considered using
> > userfaultfd to implement reference pages. However, after having taken
> > a detailed look at the interface, it does not seem suitable to be
> > used in the context of a general purpose allocator. For example,
> > UFFD_FEATURE_FORK support would be required in order to correctly
> > support fork(2) in a process that uses the allocator (although POSIX
> > does not guarantee support for allocating after fork, many allocators
> > including Scudo support it, and nothing stops the forked process from
> > page faulting pre-existing allocations after forking anyway), but
> > UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> > ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> > making it unsuitable for use in an allocator. Furthermore, even if
> > the interface issues are resolved, I suspect (but have not measured)
> > that the cost of the multiple context switches between kernel and
> > userspace would be too high to be used in an allocator anyway.
>
>
> That whole blurb is good for a cover letter, and perhaps an "alternatives
> considered" section in Documentation/. However, it should be omitted from
> the patch commit description, IMHO.

Okay, I moved it to the notes section of the commit message.

> ...
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 467302056e17..a1dc07ff914a 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -175,6 +175,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
> >
> >       if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> >               return false;
> > +
> > +     /*
> > +      * Transparent hugepages not currently supported for anonymous VMAs with
> > +      * reference pages
> > +      */
> > +     if (unlikely(vma->vm_private_data))
>
>
> This should use a helper function, such as is_reference_page_vma(). Because the
> assumption that "vma->vm_private_data means a reference page vma" is much too
> fragile. More below.

That makes sense. In v4 I've introduced a helper function.

> > +             return false;
> >       return true;
> >   }
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e7602a3bcef1..ac375e398690 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3122,5 +3122,15 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
> >
> >   extern int sysctl_nr_trim_pages;
> >
> > +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
> > +                                      unsigned long pfn)
> > +{
> > +     if (is_zero_pfn(pfn))
> > +             return true;
> > +     if (unlikely(!vma->vm_ops && vma->vm_private_data))
> > +             return pfn == page_to_pfn((struct page *)vma->vm_private_data);
>
> As foreshadowed above, this needs a helper function. And the criteria for
> deciding that it's a reference page needs to be more robust than just "no vm_ops,
> vm_private_data is set, and it matches my page". Needs some more decisive
> information.
>
> Maybe setting vm_ops to some new "refpage" ops would be the way to go, for that.

As I mentioned in my reply to Jann, we can't set vm_ops without
introducing some unwanted behavior as a result of following the
non-anonymous VMA code path. What I ended up doing instead was to
check whether vm_file->f_op refers to the refpage file_operations
struct.

It might be nice to introduce a VM_REFPAGE flag to make this check
more efficient, but this would first require extending vm_flags to 64
bits on 32-bit platforms since we're out of bits in vm_flags. From
looking around it looks like many people have attempted this over the
years; it looks like the most recent attempt is from this month:
https://www.spinics.net/lists/kernel/msg3961408.html

Let's see if it actually happens this time.

> ...
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 5053439be6ab..6e9246d09e95 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >       pmd_t *pmdp;
> >       pte_t *ptep;
> >
> > -     /* Only allow populating anonymous memory */
> > -     if (!vma_is_anonymous(vma))
> > +     /* Only allow populating anonymous memory without a reference page */
> > +     if (!vma_is_anonymous(vma) || vma->private_data)
>
> Same thing here: helper function, instead of open-coding the assumption about
> what makes a refpage vma.

Done.

> ...
>
> > +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> > +             flags)
> > +{
> > +     unsigned long content_addr = (unsigned long)content;
> > +     struct page *userpage, *refpage;
> > +     int fd;
> > +
> > +     if (flags != 0)
> > +             return -EINVAL;
> > +
> > +     refpage = alloc_page(GFP_KERNEL);
> > +     if (!refpage)
> > +             return -ENOMEM;
> > +
> > +     if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> > +         get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
> > +             put_page(refpage);
> > +             return -EFAULT;
> > +     }
> > +
> > +     copy_highpage(refpage, userpage);
> > +     put_page(userpage);
> > +
> > +     fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage,
> > +                           O_RDONLY | O_CLOEXEC);
>
> Seems like the flags argument should have an influence on these flags, rather
> than hard-coding O_CLOEXEC, right?

I couldn't see a use case for having one of these FDs without
O_CLOEXEC. If someone really wants a non-CLOEXEC refpage FD, they can
use fcntl to clear the CLOEXEC bit.

I only added the flags argument to support future extension as described in:
https://www.kernel.org/doc/html/v5.12/process/adding-syscalls.html#designing-the-api-planning-for-extension

Peter

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-19  9:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20200814213310.42170-1-pcc@google.com>
2020-08-18  2:31 ` [PATCH v3] mm: introduce reference pages John Hubbard
2020-08-18  3:00   ` Matthew Wilcox
2020-08-18 18:25     ` John Hubbard
2021-06-19  9:20       ` Peter Collingbourne
2021-06-19  9:21   ` Peter Collingbourne

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).