* [PATCH v2] perf: Allocate non-contiguous AUX pages by default
@ 2025-04-29 21:31 Yabin Cui
2025-05-01 10:47 ` James Clark
0 siblings, 1 reply; 3+ messages in thread
From: Yabin Cui @ 2025-04-29 21:31 UTC (permalink / raw)
To: Suzuki K Poulose, Mike Leach, James Clark, Alexander Shishkin,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Jiri Olsa, Ian Rogers, Adrian Hunter,
Liang Kan, Thomas Gleixner, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin
Cc: coresight, linux-arm-kernel, linux-kernel, linux-perf-users,
Yabin Cui
perf always allocates contiguous AUX pages based on aux_watermark.
However, this contiguous allocation doesn't benefit all PMUs. For
instance, ARM SPE and TRBE operate with virtual pages, and Coresight
ETR allocates a separate buffer. For these PMUs, allocating contiguous
AUX pages unnecessarily exacerbates memory fragmentation. This
fragmentation can prevent their use on long-running devices.
This patch modifies the perf driver to allocate non-contiguous AUX
pages by default. For PMUs that can benefit from contiguous pages (
Intel PT and BTS), a new PMU capability, PERF_PMU_CAP_AUX_PREFER_LARGE
is introduced to maintain their existing behavior.
Signed-off-by: Yabin Cui <yabinc@google.com>
---
Changes since v1:
In v1, default is preferring contiguous pages, and add a flag to
allocate non-contiguous pages. In v2, default is allocating
non-contiguous pages, and add a flag to prefer contiguous pages.
v1 patchset:
perf,coresight: Reduce fragmentation with non-contiguous AUX pages for
cs_etm
arch/x86/events/intel/bts.c | 3 ++-
arch/x86/events/intel/pt.c | 3 ++-
include/linux/perf_event.h | 1 +
kernel/events/ring_buffer.c | 18 +++++++++++-------
4 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index a95e6c91c4d7..9129f00e4b9f 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -625,7 +625,8 @@ static __init int bts_init(void)
return -ENOMEM;
bts_pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE |
- PERF_PMU_CAP_EXCLUSIVE;
+ PERF_PMU_CAP_EXCLUSIVE |
+ PERF_PMU_CAP_AUX_PREFER_LARGE;
bts_pmu.task_ctx_nr = perf_sw_context;
bts_pmu.event_init = bts_event_init;
bts_pmu.add = bts_event_add;
diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index fa37565f6418..37179e813b8c 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1866,7 +1866,8 @@ static __init int pt_init(void)
pt_pmu.pmu.capabilities |= PERF_PMU_CAP_EXCLUSIVE |
PERF_PMU_CAP_ITRACE |
- PERF_PMU_CAP_AUX_PAUSE;
+ PERF_PMU_CAP_AUX_PAUSE |
+ PERF_PMU_CAP_AUX_PREFER_LARGE;
pt_pmu.pmu.attr_groups = pt_attr_groups;
pt_pmu.pmu.task_ctx_nr = perf_sw_context;
pt_pmu.pmu.event_init = pt_event_init;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0069ba6866a4..56d77348c511 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -301,6 +301,7 @@ struct perf_event_pmu_context;
#define PERF_PMU_CAP_AUX_OUTPUT 0x0080
#define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100
#define PERF_PMU_CAP_AUX_PAUSE 0x0200
+#define PERF_PMU_CAP_AUX_PREFER_LARGE 0x0400
/**
* pmu::scope
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5130b119d0ae..d76249ce4f17 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -679,7 +679,7 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
{
bool overwrite = !(flags & RING_BUFFER_WRITABLE);
int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
- int ret = -ENOMEM, max_order;
+ int ret = -ENOMEM, max_order = 0;
if (!has_aux(event))
return -EOPNOTSUPP;
@@ -689,8 +689,8 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
if (!overwrite) {
/*
- * Watermark defaults to half the buffer, and so does the
- * max_order, to aid PMU drivers in double buffering.
+ * Watermark defaults to half the buffer, to aid PMU drivers
+ * in double buffering.
*/
if (!watermark)
watermark = min_t(unsigned long,
@@ -698,16 +698,20 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
(unsigned long)nr_pages << (PAGE_SHIFT - 1));
/*
- * Use aux_watermark as the basis for chunking to
+ * For PMUs that prefer large contiguous buffers,
+ * use aux_watermark as the basis for chunking to
* help PMU drivers honor the watermark.
*/
- max_order = get_order(watermark);
+ if (event->pmu->capabilities & PERF_PMU_CAP_AUX_PREFER_LARGE)
+ max_order = get_order(watermark);
} else {
/*
- * We need to start with the max_order that fits in nr_pages,
+ * For PMUs that prefer large contiguous buffers,
+ * we need to start with the max_order that fits in nr_pages,
* not the other way around, hence ilog2() and not get_order.
*/
- max_order = ilog2(nr_pages);
+ if (event->pmu->capabilities & PERF_PMU_CAP_AUX_PREFER_LARGE)
+ max_order = ilog2(nr_pages);
watermark = 0;
}
--
2.49.0.967.g6a0df3ecc3-goog
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v2] perf: Allocate non-contiguous AUX pages by default
2025-04-29 21:31 [PATCH v2] perf: Allocate non-contiguous AUX pages by default Yabin Cui
@ 2025-05-01 10:47 ` James Clark
2025-05-01 19:40 ` Yabin Cui
0 siblings, 1 reply; 3+ messages in thread
From: James Clark @ 2025-05-01 10:47 UTC (permalink / raw)
To: Yabin Cui
Cc: coresight, linux-arm-kernel, linux-kernel, linux-perf-users,
Suzuki K Poulose, Mike Leach, Alexander Shishkin, Peter Zijlstra,
Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Jiri Olsa, Ian Rogers, Adrian Hunter, Liang Kan, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
On 29/04/2025 10:31 pm, Yabin Cui wrote:
> perf always allocates contiguous AUX pages based on aux_watermark.
> However, this contiguous allocation doesn't benefit all PMUs. For
> instance, ARM SPE and TRBE operate with virtual pages, and Coresight
> ETR allocates a separate buffer. For these PMUs, allocating contiguous
> AUX pages unnecessarily exacerbates memory fragmentation. This
> fragmentation can prevent their use on long-running devices.
>
> This patch modifies the perf driver to allocate non-contiguous AUX
> pages by default. For PMUs that can benefit from contiguous pages (
> Intel PT and BTS), a new PMU capability, PERF_PMU_CAP_AUX_PREFER_LARGE
> is introduced to maintain their existing behavior.
>
> Signed-off-by: Yabin Cui <yabinc@google.com>
> ---
> Changes since v1:
> In v1, default is preferring contiguous pages, and add a flag to
> allocate non-contiguous pages. In v2, default is allocating
> non-contiguous pages, and add a flag to prefer contiguous pages.
>
> v1 patchset:
> perf,coresight: Reduce fragmentation with non-contiguous AUX pages for
> cs_etm
>
> arch/x86/events/intel/bts.c | 3 ++-
> arch/x86/events/intel/pt.c | 3 ++-
> include/linux/perf_event.h | 1 +
> kernel/events/ring_buffer.c | 18 +++++++++++-------
> 4 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
> index a95e6c91c4d7..9129f00e4b9f 100644
> --- a/arch/x86/events/intel/bts.c
> +++ b/arch/x86/events/intel/bts.c
> @@ -625,7 +625,8 @@ static __init int bts_init(void)
> return -ENOMEM;
>
> bts_pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE |
> - PERF_PMU_CAP_EXCLUSIVE;
> + PERF_PMU_CAP_EXCLUSIVE |
> + PERF_PMU_CAP_AUX_PREFER_LARGE;
> bts_pmu.task_ctx_nr = perf_sw_context;
> bts_pmu.event_init = bts_event_init;
> bts_pmu.add = bts_event_add;
> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> index fa37565f6418..37179e813b8c 100644
> --- a/arch/x86/events/intel/pt.c
> +++ b/arch/x86/events/intel/pt.c
> @@ -1866,7 +1866,8 @@ static __init int pt_init(void)
>
> pt_pmu.pmu.capabilities |= PERF_PMU_CAP_EXCLUSIVE |
> PERF_PMU_CAP_ITRACE |
> - PERF_PMU_CAP_AUX_PAUSE;
> + PERF_PMU_CAP_AUX_PAUSE |
> + PERF_PMU_CAP_AUX_PREFER_LARGE;
> pt_pmu.pmu.attr_groups = pt_attr_groups;
> pt_pmu.pmu.task_ctx_nr = perf_sw_context;
> pt_pmu.pmu.event_init = pt_event_init;
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0069ba6866a4..56d77348c511 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -301,6 +301,7 @@ struct perf_event_pmu_context;
> #define PERF_PMU_CAP_AUX_OUTPUT 0x0080
> #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100
> #define PERF_PMU_CAP_AUX_PAUSE 0x0200
> +#define PERF_PMU_CAP_AUX_PREFER_LARGE 0x0400
>
> /**
> * pmu::scope
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 5130b119d0ae..d76249ce4f17 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -679,7 +679,7 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> {
> bool overwrite = !(flags & RING_BUFFER_WRITABLE);
> int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
> - int ret = -ENOMEM, max_order;
> + int ret = -ENOMEM, max_order = 0;
>
> if (!has_aux(event))
> return -EOPNOTSUPP;
> @@ -689,8 +689,8 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>
> if (!overwrite) {
> /*
> - * Watermark defaults to half the buffer, and so does the
> - * max_order, to aid PMU drivers in double buffering.
> + * Watermark defaults to half the buffer, to aid PMU drivers
> + * in double buffering.
> */
> if (!watermark)
> watermark = min_t(unsigned long,
> @@ -698,16 +698,20 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> (unsigned long)nr_pages << (PAGE_SHIFT - 1));
>
> /*
> - * Use aux_watermark as the basis for chunking to
> + * For PMUs that prefer large contiguous buffers,
> + * use aux_watermark as the basis for chunking to
> * help PMU drivers honor the watermark.
> */
> - max_order = get_order(watermark);
> + if (event->pmu->capabilities & PERF_PMU_CAP_AUX_PREFER_LARGE)
> + max_order = get_order(watermark);
> } else {
> /*
> - * We need to start with the max_order that fits in nr_pages,
> + * For PMUs that prefer large contiguous buffers,
> + * we need to start with the max_order that fits in nr_pages,
> * not the other way around, hence ilog2() and not get_order.
> */
> - max_order = ilog2(nr_pages);
> + if (event->pmu->capabilities & PERF_PMU_CAP_AUX_PREFER_LARGE)
> + max_order = ilog2(nr_pages);
Doesn't this one need to be 'PERF_PMU_CAP_AUX_PREFER_LARGE |
PERF_PMU_CAP_AUX_NO_SG', otherwise the NO_SG test further down doesn't
work for devices that only have NO_SG and not PREFER_LARGE.
NO_SG implies PREFER_LARGE behavior, except that NO_SG additionally hard
fails if it can't do it in one alloc. But I think you shouldn't have to
set them both to get the correct behavior.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v2] perf: Allocate non-contiguous AUX pages by default
2025-05-01 10:47 ` James Clark
@ 2025-05-01 19:40 ` Yabin Cui
0 siblings, 0 replies; 3+ messages in thread
From: Yabin Cui @ 2025-05-01 19:40 UTC (permalink / raw)
To: James Clark
Cc: coresight, linux-arm-kernel, linux-kernel, linux-perf-users,
Suzuki K Poulose, Mike Leach, Alexander Shishkin, Peter Zijlstra,
Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Jiri Olsa, Ian Rogers, Adrian Hunter, Liang Kan, Thomas Gleixner,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
On Thu, May 1, 2025 at 3:47 AM James Clark <james.clark@linaro.org> wrote:
>
>
>
> On 29/04/2025 10:31 pm, Yabin Cui wrote:
> > perf always allocates contiguous AUX pages based on aux_watermark.
> > However, this contiguous allocation doesn't benefit all PMUs. For
> > instance, ARM SPE and TRBE operate with virtual pages, and Coresight
> > ETR allocates a separate buffer. For these PMUs, allocating contiguous
> > AUX pages unnecessarily exacerbates memory fragmentation. This
> > fragmentation can prevent their use on long-running devices.
> >
> > This patch modifies the perf driver to allocate non-contiguous AUX
> > pages by default. For PMUs that can benefit from contiguous pages (
> > Intel PT and BTS), a new PMU capability, PERF_PMU_CAP_AUX_PREFER_LARGE
> > is introduced to maintain their existing behavior.
> >
> > Signed-off-by: Yabin Cui <yabinc@google.com>
> > ---
> > Changes since v1:
> > In v1, default is preferring contiguous pages, and add a flag to
> > allocate non-contiguous pages. In v2, default is allocating
> > non-contiguous pages, and add a flag to prefer contiguous pages.
> >
> > v1 patchset:
> > perf,coresight: Reduce fragmentation with non-contiguous AUX pages for
> > cs_etm
> >
> > arch/x86/events/intel/bts.c | 3 ++-
> > arch/x86/events/intel/pt.c | 3 ++-
> > include/linux/perf_event.h | 1 +
> > kernel/events/ring_buffer.c | 18 +++++++++++-------
> > 4 files changed, 16 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
> > index a95e6c91c4d7..9129f00e4b9f 100644
> > --- a/arch/x86/events/intel/bts.c
> > +++ b/arch/x86/events/intel/bts.c
> > @@ -625,7 +625,8 @@ static __init int bts_init(void)
> > return -ENOMEM;
> >
> > bts_pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE |
> > - PERF_PMU_CAP_EXCLUSIVE;
> > + PERF_PMU_CAP_EXCLUSIVE |
> > + PERF_PMU_CAP_AUX_PREFER_LARGE;
> > bts_pmu.task_ctx_nr = perf_sw_context;
> > bts_pmu.event_init = bts_event_init;
> > bts_pmu.add = bts_event_add;
> > diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> > index fa37565f6418..37179e813b8c 100644
> > --- a/arch/x86/events/intel/pt.c
> > +++ b/arch/x86/events/intel/pt.c
> > @@ -1866,7 +1866,8 @@ static __init int pt_init(void)
> >
> > pt_pmu.pmu.capabilities |= PERF_PMU_CAP_EXCLUSIVE |
> > PERF_PMU_CAP_ITRACE |
> > - PERF_PMU_CAP_AUX_PAUSE;
> > + PERF_PMU_CAP_AUX_PAUSE |
> > + PERF_PMU_CAP_AUX_PREFER_LARGE;
> > pt_pmu.pmu.attr_groups = pt_attr_groups;
> > pt_pmu.pmu.task_ctx_nr = perf_sw_context;
> > pt_pmu.pmu.event_init = pt_event_init;
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 0069ba6866a4..56d77348c511 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -301,6 +301,7 @@ struct perf_event_pmu_context;
> > #define PERF_PMU_CAP_AUX_OUTPUT 0x0080
> > #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100
> > #define PERF_PMU_CAP_AUX_PAUSE 0x0200
> > +#define PERF_PMU_CAP_AUX_PREFER_LARGE 0x0400
> >
> > /**
> > * pmu::scope
> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> > index 5130b119d0ae..d76249ce4f17 100644
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -679,7 +679,7 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> > {
> > bool overwrite = !(flags & RING_BUFFER_WRITABLE);
> > int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
> > - int ret = -ENOMEM, max_order;
> > + int ret = -ENOMEM, max_order = 0;
> >
> > if (!has_aux(event))
> > return -EOPNOTSUPP;
> > @@ -689,8 +689,8 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> >
> > if (!overwrite) {
> > /*
> > - * Watermark defaults to half the buffer, and so does the
> > - * max_order, to aid PMU drivers in double buffering.
> > + * Watermark defaults to half the buffer, to aid PMU drivers
> > + * in double buffering.
> > */
> > if (!watermark)
> > watermark = min_t(unsigned long,
> > @@ -698,16 +698,20 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> > (unsigned long)nr_pages << (PAGE_SHIFT - 1));
> >
> > /*
> > - * Use aux_watermark as the basis for chunking to
> > + * For PMUs that prefer large contiguous buffers,
> > + * use aux_watermark as the basis for chunking to
> > * help PMU drivers honor the watermark.
> > */
> > - max_order = get_order(watermark);
> > + if (event->pmu->capabilities & PERF_PMU_CAP_AUX_PREFER_LARGE)
> > + max_order = get_order(watermark);
> > } else {
> > /*
> > - * We need to start with the max_order that fits in nr_pages,
> > + * For PMUs that prefer large contiguous buffers,
> > + * we need to start with the max_order that fits in nr_pages,
> > * not the other way around, hence ilog2() and not get_order.
> > */
> > - max_order = ilog2(nr_pages);
> > + if (event->pmu->capabilities & PERF_PMU_CAP_AUX_PREFER_LARGE)
> > + max_order = ilog2(nr_pages);
>
> Doesn't this one need to be 'PERF_PMU_CAP_AUX_PREFER_LARGE |
> PERF_PMU_CAP_AUX_NO_SG', otherwise the NO_SG test further down doesn't
> work for devices that only have NO_SG and not PREFER_LARGE.
>
> NO_SG implies PREFER_LARGE behavior, except that NO_SG additionally hard
> fails if it can't do it in one alloc. But I think you shouldn't have to
> set them both to get the correct behavior.
>
Thanks for pointing it out! I made the change in v3 patch.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-05-01 19:41 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 21:31 [PATCH v2] perf: Allocate non-contiguous AUX pages by default Yabin Cui
2025-05-01 10:47 ` James Clark
2025-05-01 19:40 ` Yabin Cui
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).