[RFC PATCH 0/4] Convert perf ringbuffer to folios

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Convert perf ringbuffer to folios
@ 2023-08-21 20:20 Matthew Wilcox (Oracle)
  2023-08-21 20:20 ` [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page " Matthew Wilcox (Oracle)
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle), linux-perf-users, peterz, mingo, acme,
	urezki, hch, lstoakes

The motivation here is to remove uses of page->mapping as part of the
quest to shrink struct page.  I'm unsure about some of this as the perf
ringbuffers present some new challenges that we haven't faced before.

Perf has three different ways of allocating memory for the ring buffer.
There's an individual pages way, a vmalloc way and a physically contiguous
way (for the aux buffers).  The fault handler can map all three.

The fault handler can map the _first_ page read-write, but all subsequent
pages are mapped read-only.  That makes calling vm_insert_page() not
feasible; that would allow all pages to be mapped read-write (yes?)
So the fault handler sets vmf->page and returns, in order to let
finish_fault() set it up.

I'm not really sure why page/folio->mapping needs to be set by the
fault handler.  What goes wrong if we leave it as NULL?  Likewise, do
we really need to set ->index?  I've taken that out for now, but maybe
that'll make something fail (perhaps an rmap walk?  Do we really need to
do rmap walks on device driver mappings?)

I don't like drivers playing these games with pages/folios.  I think we
need better support for these kinds of things.  Probably in the form of
helpers, given how unusual some of perf's requirements are.

More broadly, we don't have a detailed plan for how vmalloc memory gets
handled in a memdesc world.  For the vast majority of vmalloc allocations,
the pages allocated need no descriptor; they're only accessed through
the virtual address by the CPU.  This is a case where we need to mmap
the vmalloc memory.  There are also cases where we need to do I/O to
vmalloc memory; we need to understand what fields from struct page are
really needed for each of these two requirements.

Matthew Wilcox (Oracle) (4):
  perf: Convert perf_mmap_(alloc,free)_page to folios
  mm: Add vmalloc_user_node()
  perf: Use vmalloc_to_folio()
  perf: Use folios for the aux ringbuffer & pagefault path

 include/linux/mm.h          |  5 +++
 include/linux/vmalloc.h     | 17 +++++++-
 kernel/events/core.c        | 13 ++++--
 kernel/events/ring_buffer.c | 83 +++++++++++++++++--------------------
 mm/vmalloc.c                |  9 ++--
 5 files changed, 72 insertions(+), 55 deletions(-)

-- 
2.40.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page to folios
  2023-08-21 20:20 [RFC PATCH 0/4] Convert perf ringbuffer to folios Matthew Wilcox (Oracle)
@ 2023-08-21 20:20 ` Matthew Wilcox (Oracle)
  2023-08-23  7:28   ` Yin Fengwei
  2023-08-27  7:33   ` Lorenzo Stoakes
  2023-08-21 20:20 ` [RFC PATCH 2/4] mm: Add vmalloc_user_node() Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle), linux-perf-users, peterz, mingo, acme,
	urezki, hch, lstoakes

Use the folio API to alloc & free memory.  Also pass in the node ID
instead of the CPU ID since we already did that calculation in the
caller.  And call numa_mem_id() instead of leaving the node ID as
-1 and having the MM call numa_mem_id() each time.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 kernel/events/ring_buffer.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index fb1e180b5f0a..cc90d5299005 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -770,7 +770,7 @@ void rb_free_aux(struct perf_buffer *rb)
 #ifndef CONFIG_PERF_USE_VMALLOC
 
 /*
- * Back perf_mmap() with regular GFP_KERNEL-0 pages.
+ * Back perf_mmap() with regular GFP_KERNEL pages.
  */
 
 static struct page *
@@ -785,25 +785,23 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
 	return virt_to_page(rb->data_pages[pgoff - 1]);
 }
 
-static void *perf_mmap_alloc_page(int cpu)
+static void *perf_mmap_alloc_page(int node)
 {
-	struct page *page;
-	int node;
+	struct folio *folio;
 
-	node = (cpu == -1) ? cpu : cpu_to_node(cpu);
-	page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
-	if (!page)
+	folio = __folio_alloc_node(GFP_KERNEL | __GFP_ZERO, 0, node);
+	if (!folio)
 		return NULL;
 
-	return page_address(page);
+	return folio_address(folio);
 }
 
 static void perf_mmap_free_page(void *addr)
 {
-	struct page *page = virt_to_page(addr);
+	struct folio *folio = virt_to_folio(addr);
 
-	page->mapping = NULL;
-	__free_page(page);
+	folio->mapping = NULL;
+	folio_put(folio);
 }
 
 struct perf_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
@@ -818,17 +816,17 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
 	if (order_base_2(size) > PAGE_SHIFT+MAX_ORDER)
 		goto fail;
 
-	node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+	node = (cpu == -1) ? numa_mem_id() : cpu_to_node(cpu);
 	rb = kzalloc_node(size, GFP_KERNEL, node);
 	if (!rb)
 		goto fail;
 
-	rb->user_page = perf_mmap_alloc_page(cpu);
+	rb->user_page = perf_mmap_alloc_page(node);
 	if (!rb->user_page)
 		goto fail_user_page;
 
 	for (i = 0; i < nr_pages; i++) {
-		rb->data_pages[i] = perf_mmap_alloc_page(cpu);
+		rb->data_pages[i] = perf_mmap_alloc_page(node);
 		if (!rb->data_pages[i])
 			goto fail_data_pages;
 	}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page to folios
  2023-08-21 20:20 ` [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page " Matthew Wilcox (Oracle)
@ 2023-08-23  7:28   ` Yin Fengwei
  2023-08-27  7:33   ` Lorenzo Stoakes
  1 sibling, 0 replies; 19+ messages in thread
From: Yin Fengwei @ 2023-08-23  7:28 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), linux-mm
  Cc: linux-perf-users, peterz, mingo, acme, urezki, hch, lstoakes



On 8/22/23 04:20, Matthew Wilcox (Oracle) wrote:
> Use the folio API to alloc & free memory.  Also pass in the node ID
> instead of the CPU ID since we already did that calculation in the
> caller.  And call numa_mem_id() instead of leaving the node ID as
> -1 and having the MM call numa_mem_id() each time.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>

Regards
Yin, Fengwei

> ---
>  kernel/events/ring_buffer.c | 26 ++++++++++++--------------
>  1 file changed, 12 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index fb1e180b5f0a..cc90d5299005 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -770,7 +770,7 @@ void rb_free_aux(struct perf_buffer *rb)
>  #ifndef CONFIG_PERF_USE_VMALLOC
>  
>  /*
> - * Back perf_mmap() with regular GFP_KERNEL-0 pages.
> + * Back perf_mmap() with regular GFP_KERNEL pages.
>   */
>  
>  static struct page *
> @@ -785,25 +785,23 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
>  	return virt_to_page(rb->data_pages[pgoff - 1]);
>  }
>  
> -static void *perf_mmap_alloc_page(int cpu)
> +static void *perf_mmap_alloc_page(int node)
>  {
> -	struct page *page;
> -	int node;
> +	struct folio *folio;
>  
> -	node = (cpu == -1) ? cpu : cpu_to_node(cpu);
> -	page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
> -	if (!page)
> +	folio = __folio_alloc_node(GFP_KERNEL | __GFP_ZERO, 0, node);
> +	if (!folio)
>  		return NULL;
>  
> -	return page_address(page);
> +	return folio_address(folio);
>  }
>  
>  static void perf_mmap_free_page(void *addr)
>  {
> -	struct page *page = virt_to_page(addr);
> +	struct folio *folio = virt_to_folio(addr);
>  
> -	page->mapping = NULL;
> -	__free_page(page);
> +	folio->mapping = NULL;
> +	folio_put(folio);
>  }
>  
>  struct perf_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
> @@ -818,17 +816,17 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
>  	if (order_base_2(size) > PAGE_SHIFT+MAX_ORDER)
>  		goto fail;
>  
> -	node = (cpu == -1) ? cpu : cpu_to_node(cpu);
> +	node = (cpu == -1) ? numa_mem_id() : cpu_to_node(cpu);
>  	rb = kzalloc_node(size, GFP_KERNEL, node);
>  	if (!rb)
>  		goto fail;
>  
> -	rb->user_page = perf_mmap_alloc_page(cpu);
> +	rb->user_page = perf_mmap_alloc_page(node);
>  	if (!rb->user_page)
>  		goto fail_user_page;
>  
>  	for (i = 0; i < nr_pages; i++) {
> -		rb->data_pages[i] = perf_mmap_alloc_page(cpu);
> +		rb->data_pages[i] = perf_mmap_alloc_page(node);
>  		if (!rb->data_pages[i])
>  			goto fail_data_pages;
>  	}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page to folios
  2023-08-21 20:20 ` [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page " Matthew Wilcox (Oracle)
  2023-08-23  7:28   ` Yin Fengwei
@ 2023-08-27  7:33   ` Lorenzo Stoakes
  2023-09-13 13:31     ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Lorenzo Stoakes @ 2023-08-27  7:33 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, linux-perf-users, peterz, mingo, acme, urezki, hch

On Mon, Aug 21, 2023 at 09:20:13PM +0100, Matthew Wilcox (Oracle) wrote:
> Use the folio API to alloc & free memory.  Also pass in the node ID
> instead of the CPU ID since we already did that calculation in the
> caller.  And call numa_mem_id() instead of leaving the node ID as
> -1 and having the MM call numa_mem_id() each time.

It's a bit of a nit or perhaps an aside, but with this change if you passed
-1 to the newly invoked __folio_alloc_node() you will also trigger a
VM_BUG_ON() :)

>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  kernel/events/ring_buffer.c | 26 ++++++++++++--------------
>  1 file changed, 12 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index fb1e180b5f0a..cc90d5299005 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -770,7 +770,7 @@ void rb_free_aux(struct perf_buffer *rb)
>  #ifndef CONFIG_PERF_USE_VMALLOC
>
>  /*
> - * Back perf_mmap() with regular GFP_KERNEL-0 pages.
> + * Back perf_mmap() with regular GFP_KERNEL pages.
>   */
>
>  static struct page *
> @@ -785,25 +785,23 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
>  	return virt_to_page(rb->data_pages[pgoff - 1]);
>  }
>
> -static void *perf_mmap_alloc_page(int cpu)
> +static void *perf_mmap_alloc_page(int node)

Nitty point but since we're dealing with folios here maybe rename to
perf_mmap_alloc_folio()?

>  {
> -	struct page *page;
> -	int node;
> +	struct folio *folio;
>
> -	node = (cpu == -1) ? cpu : cpu_to_node(cpu);
> -	page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
> -	if (!page)
> +	folio = __folio_alloc_node(GFP_KERNEL | __GFP_ZERO, 0, node);
> +	if (!folio)
>  		return NULL;
>
> -	return page_address(page);
> +	return folio_address(folio);
>  }
>
>  static void perf_mmap_free_page(void *addr)

Same comment re: rename -> perf_mmap_free_folio()

>  {
> -	struct page *page = virt_to_page(addr);
> +	struct folio *folio = virt_to_folio(addr);
>
> -	page->mapping = NULL;
> -	__free_page(page);
> +	folio->mapping = NULL;
> +	folio_put(folio);
>  }
>
>  struct perf_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
> @@ -818,17 +816,17 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
>  	if (order_base_2(size) > PAGE_SHIFT+MAX_ORDER)
>  		goto fail;
>
> -	node = (cpu == -1) ? cpu : cpu_to_node(cpu);
> +	node = (cpu == -1) ? numa_mem_id() : cpu_to_node(cpu);

This is a good change in general, it's pretty yucky that this code just assumes
-1 == NUMA_NO_NODE and that all invoked functions would handle this correctly.

>  	rb = kzalloc_node(size, GFP_KERNEL, node);
>  	if (!rb)
>  		goto fail;
>
> -	rb->user_page = perf_mmap_alloc_page(cpu);
> +	rb->user_page = perf_mmap_alloc_page(node);
>  	if (!rb->user_page)
>  		goto fail_user_page;
>
>  	for (i = 0; i < nr_pages; i++) {
> -		rb->data_pages[i] = perf_mmap_alloc_page(cpu);
> +		rb->data_pages[i] = perf_mmap_alloc_page(node);
>  		if (!rb->data_pages[i])
>  			goto fail_data_pages;
>  	}
> --
> 2.40.1
>

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page to folios
  2023-08-27  7:33   ` Lorenzo Stoakes
@ 2023-09-13 13:31     ` Peter Zijlstra
  2023-09-22 20:19       ` Lorenzo Stoakes
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2023-09-13 13:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-perf-users, mingo, acme,
	urezki, hch

On Sun, Aug 27, 2023 at 08:33:39AM +0100, Lorenzo Stoakes wrote:

> > @@ -785,25 +785,23 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
> >  	return virt_to_page(rb->data_pages[pgoff - 1]);
> >  }
> >
> > -static void *perf_mmap_alloc_page(int cpu)
> > +static void *perf_mmap_alloc_page(int node)
> 
> Nitty point but since we're dealing with folios here maybe rename to
> perf_mmap_alloc_folio()?

Since it's an explicit order-0 allocation, does that really make sense?

> >  {
> > -	struct page *page;
> > -	int node;
> > +	struct folio *folio;
> >
> > -	node = (cpu == -1) ? cpu : cpu_to_node(cpu);
> > -	page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
> > -	if (!page)
> > +	folio = __folio_alloc_node(GFP_KERNEL | __GFP_ZERO, 0, node);
> > +	if (!folio)
> >  		return NULL;
> >
> > -	return page_address(page);
> > +	return folio_address(folio);
> >  }


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page to folios
  2023-09-13 13:31     ` Peter Zijlstra
@ 2023-09-22 20:19       ` Lorenzo Stoakes
  0 siblings, 0 replies; 19+ messages in thread
From: Lorenzo Stoakes @ 2023-09-22 20:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-perf-users, mingo, acme,
	urezki, hch

On Wed, Sep 13, 2023 at 03:31:05PM +0200, Peter Zijlstra wrote:
> On Sun, Aug 27, 2023 at 08:33:39AM +0100, Lorenzo Stoakes wrote:
>
> > > @@ -785,25 +785,23 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
> > >  	return virt_to_page(rb->data_pages[pgoff - 1]);
> > >  }
> > >
> > > -static void *perf_mmap_alloc_page(int cpu)
> > > +static void *perf_mmap_alloc_page(int node)
> >
> > Nitty point but since we're dealing with folios here maybe rename to
> > perf_mmap_alloc_folio()?
>
> Since it's an explicit order-0 allocation, does that really make sense?
>

True, it does ultimately yield a single page and doesn't reference its
metadata so it's not the end of the world to keep it as-is (it was very
nitty after all!)

> > >  {
> > > -	struct page *page;
> > > -	int node;
> > > +	struct folio *folio;
> > >
> > > -	node = (cpu == -1) ? cpu : cpu_to_node(cpu);
> > > -	page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
> > > -	if (!page)
> > > +	folio = __folio_alloc_node(GFP_KERNEL | __GFP_ZERO, 0, node);
> > > +	if (!folio)
> > >  		return NULL;
> > >
> > > -	return page_address(page);
> > > +	return folio_address(folio);
> > >  }
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/4] mm: Add vmalloc_user_node()
  2023-08-21 20:20 [RFC PATCH 0/4] Convert perf ringbuffer to folios Matthew Wilcox (Oracle)
  2023-08-21 20:20 ` [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page " Matthew Wilcox (Oracle)
@ 2023-08-21 20:20 ` Matthew Wilcox (Oracle)
  2023-09-13 13:32   ` Peter Zijlstra
  2023-08-21 20:20 ` [RFC PATCH 3/4] perf: Use vmalloc_to_folio() Matthew Wilcox (Oracle)
  2023-08-21 20:20 ` [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path Matthew Wilcox (Oracle)
  3 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle), linux-perf-users, peterz, mingo, acme,
	urezki, hch, lstoakes

Allow memory to be allocated on a specified node.  Use it in the
perf ring-buffer code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/vmalloc.h     | 17 ++++++++++++++++-
 kernel/events/ring_buffer.c |  2 +-
 mm/vmalloc.c                |  9 +++++----
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..030bfe1a60ab 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -6,6 +6,7 @@
 #include <linux/init.h>
 #include <linux/list.h>
 #include <linux/llist.h>
+#include <linux/numa.h>
 #include <asm/page.h>		/* pgprot_t */
 #include <linux/rbtree.h>
 #include <linux/overflow.h>
@@ -139,7 +140,7 @@ static inline unsigned long vmalloc_nr_pages(void) { return 0; }
 
 extern void *vmalloc(unsigned long size) __alloc_size(1);
 extern void *vzalloc(unsigned long size) __alloc_size(1);
-extern void *vmalloc_user(unsigned long size) __alloc_size(1);
+extern void *vmalloc_user_node(unsigned long size, int node) __alloc_size(1);
 extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
 extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
 extern void *vmalloc_32(unsigned long size) __alloc_size(1);
@@ -158,6 +159,20 @@ extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
 extern void *__vcalloc(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
 extern void *vcalloc(size_t n, size_t size) __alloc_size(1, 2);
 
+/**
+ * vmalloc_user - allocate zeroed virtually contiguous memory for userspace
+ * @size: allocation size
+ *
+ * The resulting memory area is zeroed so it can be mapped to userspace
+ * without leaking data.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+static inline void *vmalloc_user(size_t size)
+{
+	return vmalloc_user_node(size, NUMA_NO_NODE);
+}
+
 extern void vfree(const void *addr);
 extern void vfree_atomic(const void *addr);
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cc90d5299005..c73add132618 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -918,7 +918,7 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
 
 	INIT_WORK(&rb->work, rb_free_work);
 
-	all_buf = vmalloc_user((nr_pages + 1) * PAGE_SIZE);
+	all_buf = vmalloc_user_node((nr_pages + 1) * PAGE_SIZE, node);
 	if (!all_buf)
 		goto fail_all_buf;
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 228a4a5312f2..3616bfe4348f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3461,22 +3461,23 @@ void *vzalloc(unsigned long size)
 EXPORT_SYMBOL(vzalloc);
 
 /**
- * vmalloc_user - allocate zeroed virtually contiguous memory for userspace
+ * vmalloc_user_node - allocate zeroed virtually contiguous memory for userspace
  * @size: allocation size
+ * @node: NUMA node
  *
  * The resulting memory area is zeroed so it can be mapped to userspace
  * without leaking data.
  *
  * Return: pointer to the allocated memory or %NULL on error
  */
-void *vmalloc_user(unsigned long size)
+void *vmalloc_user_node(unsigned long size, int node)
 {
 	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
 				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
-				    VM_USERMAP, NUMA_NO_NODE,
+				    VM_USERMAP, node,
 				    __builtin_return_address(0));
 }
-EXPORT_SYMBOL(vmalloc_user);
+EXPORT_SYMBOL(vmalloc_user_node);
 
 /**
  * vmalloc_node - allocate memory on a specific node
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/4] mm: Add vmalloc_user_node()
  2023-08-21 20:20 ` [RFC PATCH 2/4] mm: Add vmalloc_user_node() Matthew Wilcox (Oracle)
@ 2023-09-13 13:32   ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-09-13 13:32 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, linux-perf-users, mingo, acme, urezki, hch, lstoakes

On Mon, Aug 21, 2023 at 09:20:14PM +0100, Matthew Wilcox (Oracle) wrote:
> Allow memory to be allocated on a specified node.  Use it in the
> perf ring-buffer code.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/4] perf: Use vmalloc_to_folio()
  2023-08-21 20:20 [RFC PATCH 0/4] Convert perf ringbuffer to folios Matthew Wilcox (Oracle)
  2023-08-21 20:20 ` [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page " Matthew Wilcox (Oracle)
  2023-08-21 20:20 ` [RFC PATCH 2/4] mm: Add vmalloc_user_node() Matthew Wilcox (Oracle)
@ 2023-08-21 20:20 ` Matthew Wilcox (Oracle)
  2023-08-22  7:54   ` Zhu Yanjun
                     ` (2 more replies)
  2023-08-21 20:20 ` [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path Matthew Wilcox (Oracle)
  3 siblings, 3 replies; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle), linux-perf-users, peterz, mingo, acme,
	urezki, hch, lstoakes

Eliminate a use of page->mapping by using vmalloc_to_folio() instead.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h          | 5 +++++
 kernel/events/ring_buffer.c | 4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 840bae5f23b6..7d84a2843193 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1134,6 +1134,11 @@ int region_intersects(resource_size_t offset, size_t size, unsigned long flags,
 struct page *vmalloc_to_page(const void *addr);
 unsigned long vmalloc_to_pfn(const void *addr);
 
+static inline struct folio *vmalloc_to_folio(const void *addr)
+{
+	return page_folio(vmalloc_to_page(addr));
+}
+
 /*
  * Determine if an address is within the vmalloc range
  *
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index c73add132618..56939dc3bf33 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -873,9 +873,9 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
 
 static void perf_mmap_unmark_page(void *addr)
 {
-	struct page *page = vmalloc_to_page(addr);
+	struct folio *folio = vmalloc_to_folio(addr);
 
-	page->mapping = NULL;
+	folio->mapping = NULL;
 }
 
 static void rb_free_work(struct work_struct *work)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/4] perf: Use vmalloc_to_folio()
  2023-08-21 20:20 ` [RFC PATCH 3/4] perf: Use vmalloc_to_folio() Matthew Wilcox (Oracle)
@ 2023-08-22  7:54   ` Zhu Yanjun
  2023-08-23  7:30   ` Yin Fengwei
  2023-08-27  7:22   ` Lorenzo Stoakes
  2 siblings, 0 replies; 19+ messages in thread
From: Zhu Yanjun @ 2023-08-22  7:54 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), linux-mm
  Cc: linux-perf-users, peterz, mingo, acme, urezki, hch, lstoakes

在 2023/8/22 4:20, Matthew Wilcox (Oracle) 写道:
> Eliminate a use of page->mapping by using vmalloc_to_folio() instead.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   include/linux/mm.h          | 5 +++++
>   kernel/events/ring_buffer.c | 4 ++--
>   2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 840bae5f23b6..7d84a2843193 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1134,6 +1134,11 @@ int region_intersects(resource_size_t offset, size_t size, unsigned long flags,
>   struct page *vmalloc_to_page(const void *addr);
>   unsigned long vmalloc_to_pfn(const void *addr);
>   
> +static inline struct folio *vmalloc_to_folio(const void *addr)
> +{
> +	return page_folio(vmalloc_to_page(addr));
> +}

Thanks a lot. This function is very wonderful. It is also what I need.

Zhu Yanjun

> +
>   /*
>    * Determine if an address is within the vmalloc range
>    *
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index c73add132618..56939dc3bf33 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -873,9 +873,9 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
>   
>   static void perf_mmap_unmark_page(void *addr)
>   {
> -	struct page *page = vmalloc_to_page(addr);
> +	struct folio *folio = vmalloc_to_folio(addr);
>   
> -	page->mapping = NULL;
> +	folio->mapping = NULL;
>   }
>   
>   static void rb_free_work(struct work_struct *work)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/4] perf: Use vmalloc_to_folio()
  2023-08-21 20:20 ` [RFC PATCH 3/4] perf: Use vmalloc_to_folio() Matthew Wilcox (Oracle)
  2023-08-22  7:54   ` Zhu Yanjun
@ 2023-08-23  7:30   ` Yin Fengwei
  2023-08-23 12:16     ` Matthew Wilcox
  2023-08-27  7:22   ` Lorenzo Stoakes
  2 siblings, 1 reply; 19+ messages in thread
From: Yin Fengwei @ 2023-08-23  7:30 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), linux-mm
  Cc: linux-perf-users, peterz, mingo, acme, urezki, hch, lstoakes



On 8/22/23 04:20, Matthew Wilcox (Oracle) wrote:
> Eliminate a use of page->mapping by using vmalloc_to_folio() instead.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/mm.h          | 5 +++++
>  kernel/events/ring_buffer.c | 4 ++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 840bae5f23b6..7d84a2843193 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1134,6 +1134,11 @@ int region_intersects(resource_size_t offset, size_t size, unsigned long flags,
>  struct page *vmalloc_to_page(const void *addr);
>  unsigned long vmalloc_to_pfn(const void *addr);
>  
> +static inline struct folio *vmalloc_to_folio(const void *addr)
> +{
> +	return page_folio(vmalloc_to_page(addr));
I am wondering whether we should check the return value of vmalloc_to_page()?

> +}
> +
>  /*
>   * Determine if an address is within the vmalloc range
>   *
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index c73add132618..56939dc3bf33 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -873,9 +873,9 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
>  
>  static void perf_mmap_unmark_page(void *addr)
>  {
> -	struct page *page = vmalloc_to_page(addr);
> +	struct folio *folio = vmalloc_to_folio(addr);
>  
> -	page->mapping = NULL;
> +	folio->mapping = NULL;
>  }
>  
>  static void rb_free_work(struct work_struct *work)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/4] perf: Use vmalloc_to_folio()
  2023-08-23  7:30   ` Yin Fengwei
@ 2023-08-23 12:16     ` Matthew Wilcox
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2023-08-23 12:16 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: linux-mm, linux-perf-users, peterz, mingo, acme, urezki, hch,
	lstoakes

On Wed, Aug 23, 2023 at 03:30:22PM +0800, Yin Fengwei wrote:
> On 8/22/23 04:20, Matthew Wilcox (Oracle) wrote:
> > +static inline struct folio *vmalloc_to_folio(const void *addr)
> > +{
> > +	return page_folio(vmalloc_to_page(addr));
> I am wondering whether we should check the return value of vmalloc_to_page()?

That's a good question.  Almost every user of vmalloc_to_page() just
assumes it works.  But it'll cost very little to check for NULL, so
I'll put that into the next version.  Thanks!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/4] perf: Use vmalloc_to_folio()
  2023-08-21 20:20 ` [RFC PATCH 3/4] perf: Use vmalloc_to_folio() Matthew Wilcox (Oracle)
  2023-08-22  7:54   ` Zhu Yanjun
  2023-08-23  7:30   ` Yin Fengwei
@ 2023-08-27  7:22   ` Lorenzo Stoakes
  2 siblings, 0 replies; 19+ messages in thread
From: Lorenzo Stoakes @ 2023-08-27  7:22 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, linux-perf-users, peterz, mingo, acme, urezki, hch

On Mon, Aug 21, 2023 at 09:20:15PM +0100, Matthew Wilcox (Oracle) wrote:
> Eliminate a use of page->mapping by using vmalloc_to_folio() instead.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/mm.h          | 5 +++++
>  kernel/events/ring_buffer.c | 4 ++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 840bae5f23b6..7d84a2843193 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1134,6 +1134,11 @@ int region_intersects(resource_size_t offset, size_t size, unsigned long flags,
>  struct page *vmalloc_to_page(const void *addr);
>  unsigned long vmalloc_to_pfn(const void *addr);
>
> +static inline struct folio *vmalloc_to_folio(const void *addr)
> +{
> +	return page_folio(vmalloc_to_page(addr));
> +}
> +

Total nit, and quite possibly unnecessary, but it might be nice to have a
comment pointing out that vmalloc_to_page can and will return tail pages hence
the need for page_folio() and thus _compound_head().

This makes me wonder about a typedef or wrapper type which explicitly indicates
that a function is _intentionally_ returning something that might be a head or a
tail page. But perhaps once foliofication is complete, this is what struct page
or its memdesc replacement will become?

>  /*
>   * Determine if an address is within the vmalloc range
>   *
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index c73add132618..56939dc3bf33 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -873,9 +873,9 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
>
>  static void perf_mmap_unmark_page(void *addr)
>  {
> -	struct page *page = vmalloc_to_page(addr);
> +	struct folio *folio = vmalloc_to_folio(addr);
>
> -	page->mapping = NULL;
> +	folio->mapping = NULL;
>  }
>
>  static void rb_free_work(struct work_struct *work)
> --
> 2.40.1
>

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path
  2023-08-21 20:20 [RFC PATCH 0/4] Convert perf ringbuffer to folios Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2023-08-21 20:20 ` [RFC PATCH 3/4] perf: Use vmalloc_to_folio() Matthew Wilcox (Oracle)
@ 2023-08-21 20:20 ` Matthew Wilcox (Oracle)
  2023-08-23  7:38   ` Yin Fengwei
  2023-08-27  8:01   ` Lorenzo Stoakes
  3 siblings, 2 replies; 19+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox (Oracle), linux-perf-users, peterz, mingo, acme,
	urezki, hch, lstoakes

Instead of allocating a non-compound page and splitting it, allocate
a folio and make its refcount the count of the number of pages in it.
That way, when we free each page in the folio, we'll only actually free
it when the last page in the folio is freed.  Keeping the memory intact
is better for the MM system than allocating it and splitting it.

Now, instead of setting each page->mapping, we only set folio->mapping
which is better for our cacheline usage, as well as helping towards the
goal of eliminating page->mapping.  We remove the setting of page->index;
I do not believe this is needed.  And we return with the folio locked,
which the fault handler should have been doing all along.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 kernel/events/core.c        | 13 +++++++---
 kernel/events/ring_buffer.c | 51 ++++++++++++++++---------------------
 2 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4c72a41f11af..59d4f7c48c8c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -29,6 +29,7 @@
 #include <linux/vmalloc.h>
 #include <linux/hardirq.h>
 #include <linux/hugetlb.h>
+#include <linux/pagemap.h>
 #include <linux/rculist.h>
 #include <linux/uaccess.h>
 #include <linux/syscalls.h>
@@ -6083,6 +6084,7 @@ static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
 {
 	struct perf_event *event = vmf->vma->vm_file->private_data;
 	struct perf_buffer *rb;
+	struct folio *folio;
 	vm_fault_t ret = VM_FAULT_SIGBUS;
 
 	if (vmf->flags & FAULT_FLAG_MKWRITE) {
@@ -6102,12 +6104,15 @@ static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
 	vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
 	if (!vmf->page)
 		goto unlock;
+	folio = page_folio(vmf->page);
 
-	get_page(vmf->page);
-	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
-	vmf->page->index   = vmf->pgoff;
+	folio_get(folio);
+	rcu_read_unlock();
+	folio_lock(folio);
+	if (!folio->mapping)
+		folio->mapping = vmf->vma->vm_file->f_mapping;
 
-	ret = 0;
+	return VM_FAULT_LOCKED;
 unlock:
 	rcu_read_unlock();
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 56939dc3bf33..0a026e5ff4f5 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -606,39 +606,28 @@ long perf_output_copy_aux(struct perf_output_handle *aux_handle,
 
 #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
 
-static struct page *rb_alloc_aux_page(int node, int order)
+static struct folio *rb_alloc_aux_folio(int node, int order)
 {
-	struct page *page;
+	struct folio *folio;
 
 	if (order > MAX_ORDER)
 		order = MAX_ORDER;
 
 	do {
-		page = alloc_pages_node(node, PERF_AUX_GFP, order);
-	} while (!page && order--);
-
-	if (page && order) {
-		/*
-		 * Communicate the allocation size to the driver:
-		 * if we managed to secure a high-order allocation,
-		 * set its first page's private to this order;
-		 * !PagePrivate(page) means it's just a normal page.
-		 */
-		split_page(page, order);
-		SetPagePrivate(page);
-		set_page_private(page, order);
-	}
+		folio = __folio_alloc_node(PERF_AUX_GFP, order, node);
+	} while (!folio && order--);
 
-	return page;
+	if (order)
+		folio_ref_add(folio, (1 << order) - 1);
+	return folio;
 }
 
 static void rb_free_aux_page(struct perf_buffer *rb, int idx)
 {
-	struct page *page = virt_to_page(rb->aux_pages[idx]);
+	struct folio *folio = virt_to_folio(rb->aux_pages[idx]);
 
-	ClearPagePrivate(page);
-	page->mapping = NULL;
-	__free_page(page);
+	folio->mapping = NULL;
+	folio_put(folio);
 }
 
 static void __rb_free_aux(struct perf_buffer *rb)
@@ -672,7 +661,7 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
 		 pgoff_t pgoff, int nr_pages, long watermark, int flags)
 {
 	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
-	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
+	int node = (event->cpu == -1) ? numa_mem_id() : cpu_to_node(event->cpu);
 	int ret = -ENOMEM, max_order;
 
 	if (!has_aux(event))
@@ -707,17 +696,21 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
 
 	rb->free_aux = event->pmu->free_aux;
 	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
-		struct page *page;
-		int last, order;
+		struct folio *folio;
+		unsigned int i, nr, order;
+		void *addr;
 
 		order = min(max_order, ilog2(nr_pages - rb->aux_nr_pages));
-		page = rb_alloc_aux_page(node, order);
-		if (!page)
+		folio = rb_alloc_aux_folio(node, order);
+		if (!folio)
 			goto out;
+		addr = folio_address(folio);
+		nr = folio_nr_pages(folio);
 
-		for (last = rb->aux_nr_pages + (1 << page_private(page));
-		     last > rb->aux_nr_pages; rb->aux_nr_pages++)
-			rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
+		for (i = 0; i < nr; i++) {
+			rb->aux_pages[rb->aux_nr_pages++] = addr;
+			addr += PAGE_SIZE;
+		}
 	}
 
 	/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path
  2023-08-21 20:20 ` [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path Matthew Wilcox (Oracle)
@ 2023-08-23  7:38   ` Yin Fengwei
  2023-08-23 12:23     ` Matthew Wilcox
  2023-08-27  8:01   ` Lorenzo Stoakes
  1 sibling, 1 reply; 19+ messages in thread
From: Yin Fengwei @ 2023-08-23  7:38 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), linux-mm
  Cc: linux-perf-users, peterz, mingo, acme, urezki, hch, lstoakes



On 8/22/23 04:20, Matthew Wilcox (Oracle) wrote:
> Instead of allocating a non-compound page and splitting it, allocate
> a folio and make its refcount the count of the number of pages in it.
> That way, when we free each page in the folio, we'll only actually free
> it when the last page in the folio is freed.  Keeping the memory intact
> is better for the MM system than allocating it and splitting it.
> 
> Now, instead of setting each page->mapping, we only set folio->mapping
> which is better for our cacheline usage, as well as helping towards the
> goal of eliminating page->mapping.  We remove the setting of page->index;
> I do not believe this is needed.  And we return with the folio locked,
> which the fault handler should have been doing all along.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  kernel/events/core.c        | 13 +++++++---
>  kernel/events/ring_buffer.c | 51 ++++++++++++++++---------------------
>  2 files changed, 31 insertions(+), 33 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 4c72a41f11af..59d4f7c48c8c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -29,6 +29,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/hardirq.h>
>  #include <linux/hugetlb.h>
> +#include <linux/pagemap.h>
>  #include <linux/rculist.h>
>  #include <linux/uaccess.h>
>  #include <linux/syscalls.h>
> @@ -6083,6 +6084,7 @@ static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
>  {
>  	struct perf_event *event = vmf->vma->vm_file->private_data;
>  	struct perf_buffer *rb;
> +	struct folio *folio;
>  	vm_fault_t ret = VM_FAULT_SIGBUS;
>  
>  	if (vmf->flags & FAULT_FLAG_MKWRITE) {
> @@ -6102,12 +6104,15 @@ static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
>  	vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
>  	if (!vmf->page)
>  		goto unlock;
> +	folio = page_folio(vmf->page);
>  
> -	get_page(vmf->page);
> -	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
> -	vmf->page->index   = vmf->pgoff;
> +	folio_get(folio);
> +	rcu_read_unlock();
> +	folio_lock(folio);
> +	if (!folio->mapping)
> +		folio->mapping = vmf->vma->vm_file->f_mapping;
>  
> -	ret = 0;
> +	return VM_FAULT_LOCKED;
In __do_fault():

        if (unlikely(!(ret & VM_FAULT_LOCKED)))                                 
                lock_page(vmf->page);                                           
        else                                                                    
                VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); 

As we lock folio, not sure whether !PageLocked(vmf->page) can be true
here. My understanding is yes if vmf->pgoff belongs to tail pages. Did
I can miss something here?


Regards
Yin, Fengwei

>  unlock:
>  	rcu_read_unlock();
>  
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 56939dc3bf33..0a026e5ff4f5 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -606,39 +606,28 @@ long perf_output_copy_aux(struct perf_output_handle *aux_handle,
>  
>  #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
>  
> -static struct page *rb_alloc_aux_page(int node, int order)
> +static struct folio *rb_alloc_aux_folio(int node, int order)
>  {
> -	struct page *page;
> +	struct folio *folio;
>  
>  	if (order > MAX_ORDER)
>  		order = MAX_ORDER;
>  
>  	do {
> -		page = alloc_pages_node(node, PERF_AUX_GFP, order);
> -	} while (!page && order--);
> -
> -	if (page && order) {
> -		/*
> -		 * Communicate the allocation size to the driver:
> -		 * if we managed to secure a high-order allocation,
> -		 * set its first page's private to this order;
> -		 * !PagePrivate(page) means it's just a normal page.
> -		 */
> -		split_page(page, order);
> -		SetPagePrivate(page);
> -		set_page_private(page, order);
> -	}
> +		folio = __folio_alloc_node(PERF_AUX_GFP, order, node);
> +	} while (!folio && order--);
>  
> -	return page;
> +	if (order)
> +		folio_ref_add(folio, (1 << order) - 1);
> +	return folio;
>  }
>  
>  static void rb_free_aux_page(struct perf_buffer *rb, int idx)
>  {
> -	struct page *page = virt_to_page(rb->aux_pages[idx]);
> +	struct folio *folio = virt_to_folio(rb->aux_pages[idx]);
>  
> -	ClearPagePrivate(page);
> -	page->mapping = NULL;
> -	__free_page(page);
> +	folio->mapping = NULL;
> +	folio_put(folio);
>  }
>  
>  static void __rb_free_aux(struct perf_buffer *rb)
> @@ -672,7 +661,7 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>  		 pgoff_t pgoff, int nr_pages, long watermark, int flags)
>  {
>  	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
> -	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
> +	int node = (event->cpu == -1) ? numa_mem_id() : cpu_to_node(event->cpu);
>  	int ret = -ENOMEM, max_order;
>  
>  	if (!has_aux(event))
> @@ -707,17 +696,21 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>  
>  	rb->free_aux = event->pmu->free_aux;
>  	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
> -		struct page *page;
> -		int last, order;
> +		struct folio *folio;
> +		unsigned int i, nr, order;
> +		void *addr;
>  
>  		order = min(max_order, ilog2(nr_pages - rb->aux_nr_pages));
> -		page = rb_alloc_aux_page(node, order);
> -		if (!page)
> +		folio = rb_alloc_aux_folio(node, order);
> +		if (!folio)
>  			goto out;
> +		addr = folio_address(folio);
> +		nr = folio_nr_pages(folio);
>  
> -		for (last = rb->aux_nr_pages + (1 << page_private(page));
> -		     last > rb->aux_nr_pages; rb->aux_nr_pages++)
> -			rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
> +		for (i = 0; i < nr; i++) {
> +			rb->aux_pages[rb->aux_nr_pages++] = addr;
> +			addr += PAGE_SIZE;
> +		}
>  	}
>  
>  	/*

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path
  2023-08-23  7:38   ` Yin Fengwei
@ 2023-08-23 12:23     ` Matthew Wilcox
  2023-08-23 12:45       ` Yin, Fengwei
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2023-08-23 12:23 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: linux-mm, linux-perf-users, peterz, mingo, acme, urezki, hch,
	lstoakes

On Wed, Aug 23, 2023 at 03:38:13PM +0800, Yin Fengwei wrote:
> On 8/22/23 04:20, Matthew Wilcox (Oracle) wrote:
> > -	get_page(vmf->page);
> > -	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
> > -	vmf->page->index   = vmf->pgoff;
> > +	folio_get(folio);
> > +	rcu_read_unlock();
> > +	folio_lock(folio);
> > +	if (!folio->mapping)
> > +		folio->mapping = vmf->vma->vm_file->f_mapping;
> >  
> > -	ret = 0;
> > +	return VM_FAULT_LOCKED;
> In __do_fault():
> 
>         if (unlikely(!(ret & VM_FAULT_LOCKED)))                                 
>                 lock_page(vmf->page);                                           
>         else                                                                    
>                 VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); 
> 
> As we lock folio, not sure whether !PageLocked(vmf->page) can be true
> here. My understanding is yes if vmf->pgoff belongs to tail pages. Did
> I can miss something here?

There's only one lock bit per folio; there's no lock bit for individual
pages.  When we check PageLocked() on a tail page, it redirects to the
head page.

__PAGEFLAG(Locked, locked, PF_NO_TAIL)

#define PF_NO_TAIL(page, enforce) ({                                    \
                VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page);     \
                PF_POISONED_CHECK(compound_head(page)); })

#define TESTPAGEFLAG(uname, lname, policy)                              \
static __always_inline int Page##uname(struct page *page)               \
{ return test_bit(PG_##lname, &policy(page, 0)->flags); }

and that expands out to:

static __always_inline int PageLocked(struct page *page)
{ return test_bit(PG_##locked, &compound_head(page)->flags); }


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path
  2023-08-23 12:23     ` Matthew Wilcox
@ 2023-08-23 12:45       ` Yin, Fengwei
  0 siblings, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-08-23 12:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-perf-users, peterz, mingo, acme, urezki, hch,
	lstoakes



On 8/23/2023 8:23 PM, Matthew Wilcox wrote:
> On Wed, Aug 23, 2023 at 03:38:13PM +0800, Yin Fengwei wrote:
>> On 8/22/23 04:20, Matthew Wilcox (Oracle) wrote:
>>> -	get_page(vmf->page);
>>> -	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
>>> -	vmf->page->index   = vmf->pgoff;
>>> +	folio_get(folio);
>>> +	rcu_read_unlock();
>>> +	folio_lock(folio);
>>> +	if (!folio->mapping)
>>> +		folio->mapping = vmf->vma->vm_file->f_mapping;
>>>  
>>> -	ret = 0;
>>> +	return VM_FAULT_LOCKED;
>> In __do_fault():
>>
>>         if (unlikely(!(ret & VM_FAULT_LOCKED)))                                 
>>                 lock_page(vmf->page);                                           
>>         else                                                                    
>>                 VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); 
>>
>> As we lock folio, not sure whether !PageLocked(vmf->page) can be true
>> here. My understanding is yes if vmf->pgoff belongs to tail pages. Did
>> I can miss something here?
> 
> There's only one lock bit per folio; there's no lock bit for individual
> pages.  When we check PageLocked() on a tail page, it redirects to the
> head page.
> 
> __PAGEFLAG(Locked, locked, PF_NO_TAIL)
> 
> #define PF_NO_TAIL(page, enforce) ({                                    \
>                 VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page);     \
>                 PF_POISONED_CHECK(compound_head(page)); })
> 
> #define TESTPAGEFLAG(uname, lname, policy)                              \
> static __always_inline int Page##uname(struct page *page)               \
> { return test_bit(PG_##lname, &policy(page, 0)->flags); }
> 
> and that expands out to:
> 
> static __always_inline int PageLocked(struct page *page)
> { return test_bit(PG_##locked, &compound_head(page)->flags); }
> 
Ah. Here is the trick. Thanks a lot for detail explanation.


Regards
Yin, Fengwei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path
  2023-08-21 20:20 ` [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path Matthew Wilcox (Oracle)
  2023-08-23  7:38   ` Yin Fengwei
@ 2023-08-27  8:01   ` Lorenzo Stoakes
  2023-08-27 11:50     ` Matthew Wilcox
  1 sibling, 1 reply; 19+ messages in thread
From: Lorenzo Stoakes @ 2023-08-27  8:01 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-mm, linux-perf-users, peterz, mingo, acme, urezki, hch

On Mon, Aug 21, 2023 at 09:20:16PM +0100, Matthew Wilcox (Oracle) wrote:
> Instead of allocating a non-compound page and splitting it, allocate
> a folio and make its refcount the count of the number of pages in it.
> That way, when we free each page in the folio, we'll only actually free
> it when the last page in the folio is freed.  Keeping the memory intact
> is better for the MM system than allocating it and splitting it.
>
> Now, instead of setting each page->mapping, we only set folio->mapping
> which is better for our cacheline usage, as well as helping towards the
> goal of eliminating page->mapping.  We remove the setting of page->index;
> I do not believe this is needed.  And we return with the folio locked,
> which the fault handler should have been doing all along.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  kernel/events/core.c        | 13 +++++++---
>  kernel/events/ring_buffer.c | 51 ++++++++++++++++---------------------
>  2 files changed, 31 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 4c72a41f11af..59d4f7c48c8c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -29,6 +29,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/hardirq.h>
>  #include <linux/hugetlb.h>
> +#include <linux/pagemap.h>
>  #include <linux/rculist.h>
>  #include <linux/uaccess.h>
>  #include <linux/syscalls.h>
> @@ -6083,6 +6084,7 @@ static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
>  {
>  	struct perf_event *event = vmf->vma->vm_file->private_data;
>  	struct perf_buffer *rb;
> +	struct folio *folio;
>  	vm_fault_t ret = VM_FAULT_SIGBUS;

Since we're explicitly returning VM_FAULT_LOCKED on success, perhaps worth
simply renaming the unlock label to error and returning VM_FAULT_SIGBUS there?

The FAULT_FLAG_MKWRITE branch can simply return vmf->pgoff == 0 ? 0 :
VM_FAULT_SIGBUS;

>
>  	if (vmf->flags & FAULT_FLAG_MKWRITE) {
> @@ -6102,12 +6104,15 @@ static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
>  	vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
>  	if (!vmf->page)
>  		goto unlock;
> +	folio = page_folio(vmf->page);
>
> -	get_page(vmf->page);
> -	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
> -	vmf->page->index   = vmf->pgoff;
> +	folio_get(folio);
> +	rcu_read_unlock();
> +	folio_lock(folio);
> +	if (!folio->mapping)
> +		folio->mapping = vmf->vma->vm_file->f_mapping;
>
> -	ret = 0;
> +	return VM_FAULT_LOCKED;
>  unlock:
>  	rcu_read_unlock();
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 56939dc3bf33..0a026e5ff4f5 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -606,39 +606,28 @@ long perf_output_copy_aux(struct perf_output_handle *aux_handle,
>
>  #define PERF_AUX_GFP	(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
>
> -static struct page *rb_alloc_aux_page(int node, int order)
> +static struct folio *rb_alloc_aux_folio(int node, int order)
>  {
> -	struct page *page;
> +	struct folio *folio;
>
>  	if (order > MAX_ORDER)
>  		order = MAX_ORDER;
>
>  	do {
> -		page = alloc_pages_node(node, PERF_AUX_GFP, order);
> -	} while (!page && order--);
> -
> -	if (page && order) {
> -		/*
> -		 * Communicate the allocation size to the driver:
> -		 * if we managed to secure a high-order allocation,
> -		 * set its first page's private to this order;
> -		 * !PagePrivate(page) means it's just a normal page.
> -		 */
> -		split_page(page, order);
> -		SetPagePrivate(page);
> -		set_page_private(page, order);

I'm guessing this was used in conjunction with the page_private() logic
that existed below and can simply be discarded now?

> -	}
> +		folio = __folio_alloc_node(PERF_AUX_GFP, order, node);
> +	} while (!folio && order--);
>
> -	return page;
> +	if (order)
> +		folio_ref_add(folio, (1 << order) - 1);

Can't order go to -1 if we continue to fail to allocate a folio?

> +	return folio;
>  }
>
>  static void rb_free_aux_page(struct perf_buffer *rb, int idx)
>  {
> -	struct page *page = virt_to_page(rb->aux_pages[idx]);
> +	struct folio *folio = virt_to_folio(rb->aux_pages[idx]);
>
> -	ClearPagePrivate(page);
> -	page->mapping = NULL;
> -	__free_page(page);
> +	folio->mapping = NULL;
> +	folio_put(folio);
>  }
>
>  static void __rb_free_aux(struct perf_buffer *rb)
> @@ -672,7 +661,7 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>  		 pgoff_t pgoff, int nr_pages, long watermark, int flags)
>  {
>  	bool overwrite = !(flags & RING_BUFFER_WRITABLE);
> -	int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
> +	int node = (event->cpu == -1) ? numa_mem_id() : cpu_to_node(event->cpu);
>  	int ret = -ENOMEM, max_order;
>
>  	if (!has_aux(event))
> @@ -707,17 +696,21 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>
>  	rb->free_aux = event->pmu->free_aux;
>  	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
> -		struct page *page;
> -		int last, order;
> +		struct folio *folio;
> +		unsigned int i, nr, order;
> +		void *addr;
>
>  		order = min(max_order, ilog2(nr_pages - rb->aux_nr_pages));
> -		page = rb_alloc_aux_page(node, order);
> -		if (!page)
> +		folio = rb_alloc_aux_folio(node, order);
> +		if (!folio)
>  			goto out;
> +		addr = folio_address(folio);
> +		nr = folio_nr_pages(folio);

I was going to raise the unspeakably annoying nit about this function returning
a long, but then that made me wonder why, given folio->_folio_nr_pages is an
unsigned int folio_nr_pages() returns a long in the first instance?

>
> -		for (last = rb->aux_nr_pages + (1 << page_private(page));
> -		     last > rb->aux_nr_pages; rb->aux_nr_pages++)
> -			rb->aux_pages[rb->aux_nr_pages] = page_address(page++);
> +		for (i = 0; i < nr; i++) {
> +			rb->aux_pages[rb->aux_nr_pages++] = addr;
> +			addr += PAGE_SIZE;
> +		}
>  	}
>
>  	/*
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path
  2023-08-27  8:01   ` Lorenzo Stoakes
@ 2023-08-27 11:50     ` Matthew Wilcox
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2023-08-27 11:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-perf-users, peterz, mingo, acme, urezki, hch

On Sun, Aug 27, 2023 at 09:01:04AM +0100, Lorenzo Stoakes wrote:
> On Mon, Aug 21, 2023 at 09:20:16PM +0100, Matthew Wilcox (Oracle) wrote:
> > -	if (page && order) {
> > -		/*
> > -		 * Communicate the allocation size to the driver:
> > -		 * if we managed to secure a high-order allocation,
> > -		 * set its first page's private to this order;
> > -		 * !PagePrivate(page) means it's just a normal page.
> > -		 */
> > -		split_page(page, order);
> > -		SetPagePrivate(page);
> > -		set_page_private(page, order);
> 
> I'm guessing this was used in conjunction with the page_private() logic
> that existed below and can simply be discarded now?

As far as I can tell, yes.

> > -	}
> > +		folio = __folio_alloc_node(PERF_AUX_GFP, order, node);
> > +	} while (!folio && order--);
> >
> > -	return page;
> > +	if (order)
> > +		folio_ref_add(folio, (1 << order) - 1);
> 
> Can't order go to -1 if we continue to fail to allocate a folio?

Hm, yes.

> > @@ -707,17 +696,21 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> >
> >  	rb->free_aux = event->pmu->free_aux;
> >  	for (rb->aux_nr_pages = 0; rb->aux_nr_pages < nr_pages;) {
> > -		struct page *page;
> > -		int last, order;
> > +		struct folio *folio;
> > +		unsigned int i, nr, order;
> > +		void *addr;
> >
> >  		order = min(max_order, ilog2(nr_pages - rb->aux_nr_pages));
> > -		page = rb_alloc_aux_page(node, order);
> > -		if (!page)
> > +		folio = rb_alloc_aux_folio(node, order);
> > +		if (!folio)
> >  			goto out;
> > +		addr = folio_address(folio);
> > +		nr = folio_nr_pages(folio);
> 
> I was going to raise the unspeakably annoying nit about this function returning
> a long, but then that made me wonder why, given folio->_folio_nr_pages is an
> unsigned int folio_nr_pages() returns a long in the first instance?

Futureproofing the API.  While I don't expect us to be allocating
order-32 folios any time soon, and x86 has excluded p4d-sizes from the
architecture, I don't want to have to go through and audit everywhere
when it turns out we do want to support such a thing on some architecture.

(on x86 PMD - order 9, PUD - order 18, P4D - order 27, so we'd need a
hypothetical P5D level, or a machine with, say 16k pages, which would
go PMD-11, PUD-22, P4D-33, and be managing a folio of size 2^57 bytes)

But supporting nr_pages stored directly in the otherwise wasted space
in the folio seemed like a cheap win, and there was no reason to take
up the extra 32 bits.

Anyway, here, we know we're not getting back an arbitrary folio that
somebody else allocated, we're allocating for ourselves, and we know we're
allocating something rather smaller than that, so it's fine to calculate
in terms of unsigned int.  If I were writing in rust, I'd use usize,
but since nobody's done the work to make rust types available in C yet,
I haven't done that either.

We actually use usize as a variable name in a couple of hundred places,
so turning it into a generally available type is harder than one might
like.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-09-22 20:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-21 20:20 [RFC PATCH 0/4] Convert perf ringbuffer to folios Matthew Wilcox (Oracle)
2023-08-21 20:20 ` [RFC PATCH 1/4] perf: Convert perf_mmap_(alloc,free)_page " Matthew Wilcox (Oracle)
2023-08-23  7:28   ` Yin Fengwei
2023-08-27  7:33   ` Lorenzo Stoakes
2023-09-13 13:31     ` Peter Zijlstra
2023-09-22 20:19       ` Lorenzo Stoakes
2023-08-21 20:20 ` [RFC PATCH 2/4] mm: Add vmalloc_user_node() Matthew Wilcox (Oracle)
2023-09-13 13:32   ` Peter Zijlstra
2023-08-21 20:20 ` [RFC PATCH 3/4] perf: Use vmalloc_to_folio() Matthew Wilcox (Oracle)
2023-08-22  7:54   ` Zhu Yanjun
2023-08-23  7:30   ` Yin Fengwei
2023-08-23 12:16     ` Matthew Wilcox
2023-08-27  7:22   ` Lorenzo Stoakes
2023-08-21 20:20 ` [RFC PATCH 4/4] perf: Use folios for the aux ringbuffer & pagefault path Matthew Wilcox (Oracle)
2023-08-23  7:38   ` Yin Fengwei
2023-08-23 12:23     ` Matthew Wilcox
2023-08-23 12:45       ` Yin, Fengwei
2023-08-27  8:01   ` Lorenzo Stoakes
2023-08-27 11:50     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).