linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
@ 2023-09-16  3:56 Ian Rogers
  0 siblings, 0 replies; 23+ messages in thread
From: Ian Rogers @ 2023-09-16  3:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, Yang Jihong, Stephane Eranian

Dummy events are created with an attribute where the period and freq
are zero. evsel__config will then see the uninitialized values and
initialize them in evsel__default_freq_period. As fequency mode is
used by default the dummy event would be set to use frequency
mode. However, this has no effect on the dummy event but does cause
unnecessary timers/interrupts. Avoid this overhead by setting the
period to 1 for dummy events.

evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
period=1. This isn't necessary after this change and so the setting is
removed.

Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/evlist.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 25c3ebe2c2f5..e36da58522ef 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
 		.type	= PERF_TYPE_SOFTWARE,
 		.config = PERF_COUNT_SW_DUMMY,
 		.size	= sizeof(attr), /* to capture ABI version */
+		/* Avoid frequency mode for dummy events to avoid associated timers. */
+		.freq = 0,
+		.sample_period = 1,
 	};
 
 	return evsel__new_idx(&attr, evlist->core.nr_entries);
@@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
 	evsel->core.attr.exclude_kernel = 1;
 	evsel->core.attr.exclude_guest = 1;
 	evsel->core.attr.exclude_hv = 1;
-	evsel->core.attr.freq = 0;
-	evsel->core.attr.sample_period = 1;
 	evsel->core.system_wide = system_wide;
 	evsel->no_aux_samples = true;
 	evsel->name = strdup("dummy:u");
-- 
2.42.0.459.ge4e396fd5e-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
@ 2023-09-16  4:09 Ian Rogers
  2023-09-17  0:45 ` Mingwei Zhang
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Ian Rogers @ 2023-09-16  4:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, Yang Jihong, Stephane Eranian

Dummy events are created with an attribute where the period and freq
are zero. evsel__config will then see the uninitialized values and
initialize them in evsel__default_freq_period. As fequency mode is
used by default the dummy event would be set to use frequency
mode. However, this has no effect on the dummy event but does cause
unnecessary timers/interrupts. Avoid this overhead by setting the
period to 1 for dummy events.

evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
period=1. This isn't necessary after this change and so the setting is
removed.

From Stephane:

The dummy event is not counting anything. It is used to collect mmap
records and avoid a race condition during the synthesize mmap phase of
perf record. As such, it should not cause any overhead during active
profiling. Yet, it did. Because of a bug the dummy event was
programmed as a sampling event in frequency mode. Events in that mode
incur more kernel overheads because on timer tick, the kernel has to
look at the number of samples for each event and potentially adjust
the sampling period to achieve the desired frequency. The dummy event
was therefore adding a frequency event to task and ctx contexts we may
otherwise not have any, e.g., perf record -a -e
cpu/event=0x3c,period=10000000/. On each timer tick the
perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
non-zero, then the kernel will loop over ALL the events of the context
looking for frequency mode ones. In doing, so it locks the context,
and enable/disable the PMU of each hw event. If all the events of the
context are in period mode, the kernel will have to traverse the list for
nothing incurring overhead. The overhead is multiplied by a very large
factor when this happens in a guest kernel. There is no need for the
dummy event to be in frequency mode, it does not count anything and
therefore should not cause extra overhead for no reason.

Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/evlist.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 25c3ebe2c2f5..e36da58522ef 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
 		.type	= PERF_TYPE_SOFTWARE,
 		.config = PERF_COUNT_SW_DUMMY,
 		.size	= sizeof(attr), /* to capture ABI version */
+		/* Avoid frequency mode for dummy events to avoid associated timers. */
+		.freq = 0,
+		.sample_period = 1,
 	};
 
 	return evsel__new_idx(&attr, evlist->core.nr_entries);
@@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
 	evsel->core.attr.exclude_kernel = 1;
 	evsel->core.attr.exclude_guest = 1;
 	evsel->core.attr.exclude_hv = 1;
-	evsel->core.attr.freq = 0;
-	evsel->core.attr.sample_period = 1;
 	evsel->core.system_wide = system_wide;
 	evsel->no_aux_samples = true;
 	evsel->name = strdup("dummy:u");
-- 
2.42.0.459.ge4e396fd5e-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-16  4:09 Ian Rogers
@ 2023-09-17  0:45 ` Mingwei Zhang
  2023-09-18 22:42   ` Ian Rogers
  2023-09-18  8:14 ` Adrian Hunter
  2023-10-30 19:04 ` Mingwei Zhang
  2 siblings, 1 reply; 23+ messages in thread
From: Mingwei Zhang @ 2023-09-17  0:45 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
	Yang Jihong, Stephane Eranian

On Fri, Sep 15, 2023 at 9:10 PM Ian Rogers <irogers@google.com> wrote:
>
> Dummy events are created with an attribute where the period and freq
> are zero. evsel__config will then see the uninitialized values and
> initialize them in evsel__default_freq_period. As fequency mode is
> used by default the dummy event would be set to use frequency
> mode. However, this has no effect on the dummy event but does cause
> unnecessary timers/interrupts. Avoid this overhead by setting the
> period to 1 for dummy events.
>
> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> period=1. This isn't necessary after this change and so the setting is
> removed.
>
> From Stephane:
>
> The dummy event is not counting anything. It is used to collect mmap
> records and avoid a race condition during the synthesize mmap phase of
> perf record. As such, it should not cause any overhead during active
> profiling. Yet, it did. Because of a bug the dummy event was
> programmed as a sampling event in frequency mode. Events in that mode
> incur more kernel overheads because on timer tick, the kernel has to
> look at the number of samples for each event and potentially adjust
> the sampling period to achieve the desired frequency. The dummy event
> was therefore adding a frequency event to task and ctx contexts we may
> otherwise not have any, e.g., perf record -a -e
> cpu/event=0x3c,period=10000000/. On each timer tick the
> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> non-zero, then the kernel will loop over ALL the events of the context
> looking for frequency mode ones. In doing, so it locks the context,
> and enable/disable the PMU of each hw event. If all the events of the
> context are in period mode, the kernel will have to traverse the list for
> nothing incurring overhead. The overhead is multiplied by a very large
> factor when this happens in a guest kernel. There is no need for the
> dummy event to be in frequency mode, it does not count anything and
> therefore should not cause extra overhead for no reason.
>
> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> Reported-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/util/evlist.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 25c3ebe2c2f5..e36da58522ef 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
>                 .type   = PERF_TYPE_SOFTWARE,
>                 .config = PERF_COUNT_SW_DUMMY,
>                 .size   = sizeof(attr), /* to capture ABI version */
> +               /* Avoid frequency mode for dummy events to avoid associated timers. */
> +               .freq = 0,
> +               .sample_period = 1,
>         };
>
>         return evsel__new_idx(&attr, evlist->core.nr_entries);
> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
>         evsel->core.attr.exclude_kernel = 1;
>         evsel->core.attr.exclude_guest = 1;
>         evsel->core.attr.exclude_hv = 1;
> -       evsel->core.attr.freq = 0;
> -       evsel->core.attr.sample_period = 1;
>         evsel->core.system_wide = system_wide;
>         evsel->no_aux_samples = true;
>         evsel->name = strdup("dummy:u");
> --
> 2.42.0.459.ge4e396fd5e-goog
>

Thank you very much for the change. I have one quick question about
the PMU unthrottling logic. When I am looking into the function
perf_adjust_freq_unthr_context(), I see the loop with PMU stop and
start in each iteration. Is there a good way to avoid this PMU reset
operation while quickly figuring out the event in frequency mode?

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-16  4:09 Ian Rogers
  2023-09-17  0:45 ` Mingwei Zhang
@ 2023-09-18  8:14 ` Adrian Hunter
  2023-09-18 21:48   ` Ian Rogers
  2023-10-30 19:04 ` Mingwei Zhang
  2 siblings, 1 reply; 23+ messages in thread
From: Adrian Hunter @ 2023-09-18  8:14 UTC (permalink / raw)
  To: Ian Rogers, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Kan Liang, linux-perf-users, linux-kernel, Yang Jihong,
	Stephane Eranian

On 16/09/23 07:09, Ian Rogers wrote:
> Dummy events are created with an attribute where the period and freq
> are zero. evsel__config will then see the uninitialized values and
> initialize them in evsel__default_freq_period. As fequency mode is
> used by default the dummy event would be set to use frequency
> mode. However, this has no effect on the dummy event but does cause
> unnecessary timers/interrupts. Avoid this overhead by setting the
> period to 1 for dummy events.
> 
> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> period=1. This isn't necessary after this change and so the setting is
> removed.
> 
> From Stephane:
> 
> The dummy event is not counting anything. It is used to collect mmap
> records and avoid a race condition during the synthesize mmap phase of
> perf record. As such, it should not cause any overhead during active
> profiling. Yet, it did. Because of a bug the dummy event was
> programmed as a sampling event in frequency mode. Events in that mode
> incur more kernel overheads because on timer tick, the kernel has to
> look at the number of samples for each event and potentially adjust
> the sampling period to achieve the desired frequency. The dummy event
> was therefore adding a frequency event to task and ctx contexts we may
> otherwise not have any, e.g., perf record -a -e
> cpu/event=0x3c,period=10000000/. On each timer tick the
> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> non-zero, then the kernel will loop over ALL the events of the context
> looking for frequency mode ones. In doing, so it locks the context,
> and enable/disable the PMU of each hw event. If all the events of the
> context are in period mode, the kernel will have to traverse the list for
> nothing incurring overhead. The overhead is multiplied by a very large
> factor when this happens in a guest kernel. There is no need for the
> dummy event to be in frequency mode, it does not count anything and
> therefore should not cause extra overhead for no reason.
> 
> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> Reported-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/util/evlist.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 25c3ebe2c2f5..e36da58522ef 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
>  		.type	= PERF_TYPE_SOFTWARE,
>  		.config = PERF_COUNT_SW_DUMMY,
>  		.size	= sizeof(attr), /* to capture ABI version */
> +		/* Avoid frequency mode for dummy events to avoid associated timers. */
> +		.freq = 0,
> +		.sample_period = 1,
>  	};
>  
>  	return evsel__new_idx(&attr, evlist->core.nr_entries);
> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
>  	evsel->core.attr.exclude_kernel = 1;
>  	evsel->core.attr.exclude_guest = 1;
>  	evsel->core.attr.exclude_hv = 1;
> -	evsel->core.attr.freq = 0;
> -	evsel->core.attr.sample_period = 1;
>  	evsel->core.system_wide = system_wide;
>  	evsel->no_aux_samples = true;
>  	evsel->name = strdup("dummy:u");

Note that evsel__config() will put it back to freq if -F is used.

Nevertheless:

Acked-by: Adrian Hunter <adrian.hunter@intel.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-18  8:14 ` Adrian Hunter
@ 2023-09-18 21:48   ` Ian Rogers
  2023-09-19  5:59     ` Adrian Hunter
  0 siblings, 1 reply; 23+ messages in thread
From: Ian Rogers @ 2023-09-18 21:48 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Kan Liang, linux-perf-users, linux-kernel, Yang Jihong,
	Stephane Eranian

On Mon, Sep 18, 2023 at 1:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 16/09/23 07:09, Ian Rogers wrote:
> > Dummy events are created with an attribute where the period and freq
> > are zero. evsel__config will then see the uninitialized values and
> > initialize them in evsel__default_freq_period. As fequency mode is
> > used by default the dummy event would be set to use frequency
> > mode. However, this has no effect on the dummy event but does cause
> > unnecessary timers/interrupts. Avoid this overhead by setting the
> > period to 1 for dummy events.
> >
> > evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> > period=1. This isn't necessary after this change and so the setting is
> > removed.
> >
> > From Stephane:
> >
> > The dummy event is not counting anything. It is used to collect mmap
> > records and avoid a race condition during the synthesize mmap phase of
> > perf record. As such, it should not cause any overhead during active
> > profiling. Yet, it did. Because of a bug the dummy event was
> > programmed as a sampling event in frequency mode. Events in that mode
> > incur more kernel overheads because on timer tick, the kernel has to
> > look at the number of samples for each event and potentially adjust
> > the sampling period to achieve the desired frequency. The dummy event
> > was therefore adding a frequency event to task and ctx contexts we may
> > otherwise not have any, e.g., perf record -a -e
> > cpu/event=0x3c,period=10000000/. On each timer tick the
> > perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> > non-zero, then the kernel will loop over ALL the events of the context
> > looking for frequency mode ones. In doing, so it locks the context,
> > and enable/disable the PMU of each hw event. If all the events of the
> > context are in period mode, the kernel will have to traverse the list for
> > nothing incurring overhead. The overhead is multiplied by a very large
> > factor when this happens in a guest kernel. There is no need for the
> > dummy event to be in frequency mode, it does not count anything and
> > therefore should not cause extra overhead for no reason.
> >
> > Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> > Reported-by: Stephane Eranian <eranian@google.com>
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/util/evlist.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index 25c3ebe2c2f5..e36da58522ef 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
> >               .type   = PERF_TYPE_SOFTWARE,
> >               .config = PERF_COUNT_SW_DUMMY,
> >               .size   = sizeof(attr), /* to capture ABI version */
> > +             /* Avoid frequency mode for dummy events to avoid associated timers. */
> > +             .freq = 0,
> > +             .sample_period = 1,
> >       };
> >
> >       return evsel__new_idx(&attr, evlist->core.nr_entries);
> > @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
> >       evsel->core.attr.exclude_kernel = 1;
> >       evsel->core.attr.exclude_guest = 1;
> >       evsel->core.attr.exclude_hv = 1;
> > -     evsel->core.attr.freq = 0;
> > -     evsel->core.attr.sample_period = 1;
> >       evsel->core.system_wide = system_wide;
> >       evsel->no_aux_samples = true;
> >       evsel->name = strdup("dummy:u");
>
> Note that evsel__config() will put it back to freq if -F is used.

Right, I was looking for a minimal fix in part for the sake of back
porting. For the -F we could do:

```
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d5363d23f5d3..806185a39e17 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1083,11 +1083,15 @@ void __weak arch__post_evsel_config(struct
evsel *evsel __maybe_unused,
static void evsel__set_default_freq_period(struct record_opts *opts,
                                          struct perf_event_attr *attr)
{
-       if (opts->freq) {
+       bool is_dummy = attr->type == PERF_TYPE_SOFTWARE &&
+               attr->config == PERF_COUNT_SW_DUMMY;
+
+       if (opts->freq && !is_dummy) {
               attr->freq = 1;
               attr->sample_freq = opts->freq;
       } else {
-               attr->sample_period = opts->default_interval;
+               attr->freq = 0;
+               attr->sample_period = is_dummy ? 1 : opts->default_interval;
       }
}
```

But this felt like it could potentially have other side-effects.
Software events like page faults have a period, but the
dummy/side-band events appear never to check the period - but I could
be wrong. We can add the patch above but perhaps without a fixes tag.

Thanks,
Ian

> Nevertheless:
>
> Acked-by: Adrian Hunter <adrian.hunter@intel.com>
>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-17  0:45 ` Mingwei Zhang
@ 2023-09-18 22:42   ` Ian Rogers
  2023-09-21  5:04     ` Mingwei Zhang
  0 siblings, 1 reply; 23+ messages in thread
From: Ian Rogers @ 2023-09-18 22:42 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
	Yang Jihong, Stephane Eranian

On Sat, Sep 16, 2023 at 5:46 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Fri, Sep 15, 2023 at 9:10 PM Ian Rogers <irogers@google.com> wrote:
> >
> > Dummy events are created with an attribute where the period and freq
> > are zero. evsel__config will then see the uninitialized values and
> > initialize them in evsel__default_freq_period. As fequency mode is
> > used by default the dummy event would be set to use frequency
> > mode. However, this has no effect on the dummy event but does cause
> > unnecessary timers/interrupts. Avoid this overhead by setting the
> > period to 1 for dummy events.
> >
> > evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> > period=1. This isn't necessary after this change and so the setting is
> > removed.
> >
> > From Stephane:
> >
> > The dummy event is not counting anything. It is used to collect mmap
> > records and avoid a race condition during the synthesize mmap phase of
> > perf record. As such, it should not cause any overhead during active
> > profiling. Yet, it did. Because of a bug the dummy event was
> > programmed as a sampling event in frequency mode. Events in that mode
> > incur more kernel overheads because on timer tick, the kernel has to
> > look at the number of samples for each event and potentially adjust
> > the sampling period to achieve the desired frequency. The dummy event
> > was therefore adding a frequency event to task and ctx contexts we may
> > otherwise not have any, e.g., perf record -a -e
> > cpu/event=0x3c,period=10000000/. On each timer tick the
> > perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> > non-zero, then the kernel will loop over ALL the events of the context
> > looking for frequency mode ones. In doing, so it locks the context,
> > and enable/disable the PMU of each hw event. If all the events of the
> > context are in period mode, the kernel will have to traverse the list for
> > nothing incurring overhead. The overhead is multiplied by a very large
> > factor when this happens in a guest kernel. There is no need for the
> > dummy event to be in frequency mode, it does not count anything and
> > therefore should not cause extra overhead for no reason.
> >
> > Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> > Reported-by: Stephane Eranian <eranian@google.com>
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/util/evlist.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index 25c3ebe2c2f5..e36da58522ef 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
> >                 .type   = PERF_TYPE_SOFTWARE,
> >                 .config = PERF_COUNT_SW_DUMMY,
> >                 .size   = sizeof(attr), /* to capture ABI version */
> > +               /* Avoid frequency mode for dummy events to avoid associated timers. */
> > +               .freq = 0,
> > +               .sample_period = 1,
> >         };
> >
> >         return evsel__new_idx(&attr, evlist->core.nr_entries);
> > @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
> >         evsel->core.attr.exclude_kernel = 1;
> >         evsel->core.attr.exclude_guest = 1;
> >         evsel->core.attr.exclude_hv = 1;
> > -       evsel->core.attr.freq = 0;
> > -       evsel->core.attr.sample_period = 1;
> >         evsel->core.system_wide = system_wide;
> >         evsel->no_aux_samples = true;
> >         evsel->name = strdup("dummy:u");
> > --
> > 2.42.0.459.ge4e396fd5e-goog
> >
>
> Thank you very much for the change. I have one quick question about
> the PMU unthrottling logic. When I am looking into the function
> perf_adjust_freq_unthr_context(), I see the loop with PMU stop and
> start in each iteration. Is there a good way to avoid this PMU reset
> operation while quickly figuring out the event in frequency mode?

Agreed. I think before the pmu_disable could be avoided for this condition:
```
if (event->hw.interrupts != MAX_INTERRUPTS &&
    (!event->attr.freq || !event->attr.sample_freq))
        continue;
```
Fixing up the event stop/start looks harder.

Thanks,
Ian

> Thanks.
> -Mingwei

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-18 21:48   ` Ian Rogers
@ 2023-09-19  5:59     ` Adrian Hunter
  2023-09-21 19:26       ` Namhyung Kim
  0 siblings, 1 reply; 23+ messages in thread
From: Adrian Hunter @ 2023-09-19  5:59 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Kan Liang, linux-perf-users, linux-kernel, Yang Jihong,
	Stephane Eranian

On 19/09/23 00:48, Ian Rogers wrote:
> On Mon, Sep 18, 2023 at 1:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 16/09/23 07:09, Ian Rogers wrote:
>>> Dummy events are created with an attribute where the period and freq
>>> are zero. evsel__config will then see the uninitialized values and
>>> initialize them in evsel__default_freq_period. As fequency mode is
>>> used by default the dummy event would be set to use frequency
>>> mode. However, this has no effect on the dummy event but does cause
>>> unnecessary timers/interrupts. Avoid this overhead by setting the
>>> period to 1 for dummy events.
>>>
>>> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
>>> period=1. This isn't necessary after this change and so the setting is
>>> removed.
>>>
>>> From Stephane:
>>>
>>> The dummy event is not counting anything. It is used to collect mmap
>>> records and avoid a race condition during the synthesize mmap phase of
>>> perf record. As such, it should not cause any overhead during active
>>> profiling. Yet, it did. Because of a bug the dummy event was
>>> programmed as a sampling event in frequency mode. Events in that mode
>>> incur more kernel overheads because on timer tick, the kernel has to
>>> look at the number of samples for each event and potentially adjust
>>> the sampling period to achieve the desired frequency. The dummy event
>>> was therefore adding a frequency event to task and ctx contexts we may
>>> otherwise not have any, e.g., perf record -a -e
>>> cpu/event=0x3c,period=10000000/. On each timer tick the
>>> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
>>> non-zero, then the kernel will loop over ALL the events of the context
>>> looking for frequency mode ones. In doing, so it locks the context,
>>> and enable/disable the PMU of each hw event. If all the events of the
>>> context are in period mode, the kernel will have to traverse the list for
>>> nothing incurring overhead. The overhead is multiplied by a very large
>>> factor when this happens in a guest kernel. There is no need for the
>>> dummy event to be in frequency mode, it does not count anything and
>>> therefore should not cause extra overhead for no reason.
>>>
>>> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
>>> Reported-by: Stephane Eranian <eranian@google.com>
>>> Signed-off-by: Ian Rogers <irogers@google.com>
>>> ---
>>>  tools/perf/util/evlist.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
>>> index 25c3ebe2c2f5..e36da58522ef 100644
>>> --- a/tools/perf/util/evlist.c
>>> +++ b/tools/perf/util/evlist.c
>>> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
>>>               .type   = PERF_TYPE_SOFTWARE,
>>>               .config = PERF_COUNT_SW_DUMMY,
>>>               .size   = sizeof(attr), /* to capture ABI version */
>>> +             /* Avoid frequency mode for dummy events to avoid associated timers. */
>>> +             .freq = 0,
>>> +             .sample_period = 1,
>>>       };
>>>
>>>       return evsel__new_idx(&attr, evlist->core.nr_entries);
>>> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
>>>       evsel->core.attr.exclude_kernel = 1;
>>>       evsel->core.attr.exclude_guest = 1;
>>>       evsel->core.attr.exclude_hv = 1;
>>> -     evsel->core.attr.freq = 0;
>>> -     evsel->core.attr.sample_period = 1;
>>>       evsel->core.system_wide = system_wide;
>>>       evsel->no_aux_samples = true;
>>>       evsel->name = strdup("dummy:u");
>>
>> Note that evsel__config() will put it back to freq if -F is used.
> 
> Right, I was looking for a minimal fix in part for the sake of back
> porting. For the -F we could do:
> 
> ```
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index d5363d23f5d3..806185a39e17 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1083,11 +1083,15 @@ void __weak arch__post_evsel_config(struct
> evsel *evsel __maybe_unused,
> static void evsel__set_default_freq_period(struct record_opts *opts,
>                                           struct perf_event_attr *attr)
> {
> -       if (opts->freq) {
> +       bool is_dummy = attr->type == PERF_TYPE_SOFTWARE &&
> +               attr->config == PERF_COUNT_SW_DUMMY;
> +
> +       if (opts->freq && !is_dummy) {
>                attr->freq = 1;
>                attr->sample_freq = opts->freq;
>        } else {
> -               attr->sample_period = opts->default_interval;
> +               attr->freq = 0;
> +               attr->sample_period = is_dummy ? 1 : opts->default_interval;
>        }
> }
> ```
> 
> But this felt like it could potentially have other side-effects.

Perhaps leave it alone, if the period has already been defined:

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d5363d23f5d3..ad3e12f5ec88 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1166,7 +1166,8 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 	if ((evsel->is_libpfm_event && !attr->sample_period) ||
 	    (!evsel->is_libpfm_event && (!attr->sample_period ||
 					 opts->user_freq != UINT_MAX ||
-					 opts->user_interval != ULLONG_MAX)))
+					 opts->user_interval != ULLONG_MAX) &&
+	     !(is_dummy && attr->sample_period)))
 		evsel__set_default_freq_period(opts, attr);
 
 	/*



> Software events like page faults have a period, but the
> dummy/side-band events appear never to check the period - but I could
> be wrong. We can add the patch above but perhaps without a fixes tag.
> 
> Thanks,
> Ian
> 
>> Nevertheless:
>>
>> Acked-by: Adrian Hunter <adrian.hunter@intel.com>
>>


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-18 22:42   ` Ian Rogers
@ 2023-09-21  5:04     ` Mingwei Zhang
  2023-10-03 20:07       ` Namhyung Kim
  0 siblings, 1 reply; 23+ messages in thread
From: Mingwei Zhang @ 2023-09-21  5:04 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
	Yang Jihong, Stephane Eranian

On Mon, Sep 18, 2023 at 3:43 PM Ian Rogers <irogers@google.com> wrote:
>
> On Sat, Sep 16, 2023 at 5:46 PM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > On Fri, Sep 15, 2023 at 9:10 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > Dummy events are created with an attribute where the period and freq
> > > are zero. evsel__config will then see the uninitialized values and
> > > initialize them in evsel__default_freq_period. As fequency mode is
> > > used by default the dummy event would be set to use frequency
> > > mode. However, this has no effect on the dummy event but does cause
> > > unnecessary timers/interrupts. Avoid this overhead by setting the
> > > period to 1 for dummy events.
> > >
> > > evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> > > period=1. This isn't necessary after this change and so the setting is
> > > removed.
> > >
> > > From Stephane:
> > >
> > > The dummy event is not counting anything. It is used to collect mmap
> > > records and avoid a race condition during the synthesize mmap phase of
> > > perf record. As such, it should not cause any overhead during active
> > > profiling. Yet, it did. Because of a bug the dummy event was
> > > programmed as a sampling event in frequency mode. Events in that mode
> > > incur more kernel overheads because on timer tick, the kernel has to
> > > look at the number of samples for each event and potentially adjust
> > > the sampling period to achieve the desired frequency. The dummy event
> > > was therefore adding a frequency event to task and ctx contexts we may
> > > otherwise not have any, e.g., perf record -a -e
> > > cpu/event=0x3c,period=10000000/. On each timer tick the
> > > perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> > > non-zero, then the kernel will loop over ALL the events of the context
> > > looking for frequency mode ones. In doing, so it locks the context,
> > > and enable/disable the PMU of each hw event. If all the events of the
> > > context are in period mode, the kernel will have to traverse the list for
> > > nothing incurring overhead. The overhead is multiplied by a very large
> > > factor when this happens in a guest kernel. There is no need for the
> > > dummy event to be in frequency mode, it does not count anything and
> > > therefore should not cause extra overhead for no reason.
> > >
> > > Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> > > Reported-by: Stephane Eranian <eranian@google.com>
> > > Signed-off-by: Ian Rogers <irogers@google.com>
> > > ---
> > >  tools/perf/util/evlist.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > > index 25c3ebe2c2f5..e36da58522ef 100644
> > > --- a/tools/perf/util/evlist.c
> > > +++ b/tools/perf/util/evlist.c
> > > @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
> > >                 .type   = PERF_TYPE_SOFTWARE,
> > >                 .config = PERF_COUNT_SW_DUMMY,
> > >                 .size   = sizeof(attr), /* to capture ABI version */
> > > +               /* Avoid frequency mode for dummy events to avoid associated timers. */
> > > +               .freq = 0,
> > > +               .sample_period = 1,
> > >         };
> > >
> > >         return evsel__new_idx(&attr, evlist->core.nr_entries);
> > > @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
> > >         evsel->core.attr.exclude_kernel = 1;
> > >         evsel->core.attr.exclude_guest = 1;
> > >         evsel->core.attr.exclude_hv = 1;
> > > -       evsel->core.attr.freq = 0;
> > > -       evsel->core.attr.sample_period = 1;
> > >         evsel->core.system_wide = system_wide;
> > >         evsel->no_aux_samples = true;
> > >         evsel->name = strdup("dummy:u");
> > > --
> > > 2.42.0.459.ge4e396fd5e-goog
> > >
> >
> > Thank you very much for the change. I have one quick question about
> > the PMU unthrottling logic. When I am looking into the function
> > perf_adjust_freq_unthr_context(), I see the loop with PMU stop and
> > start in each iteration. Is there a good way to avoid this PMU reset
> > operation while quickly figuring out the event in frequency mode?
>
> Agreed. I think before the pmu_disable could be avoided for this condition:
> ```
> if (event->hw.interrupts != MAX_INTERRUPTS &&
>     (!event->attr.freq || !event->attr.sample_freq))
>         continue;
> ```
> Fixing up the event stop/start looks harder.
>

Right, I think putting the check early before pmu_disable() is already
a great optimization. The only concern I initially had was whether
event->hw.interrupts can be accessed before we disable the pmu. But
after checking this field in other locations, I don't see any problem
at all.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-19  5:59     ` Adrian Hunter
@ 2023-09-21 19:26       ` Namhyung Kim
  2023-09-22  5:36         ` Adrian Hunter
  0 siblings, 1 reply; 23+ messages in thread
From: Namhyung Kim @ 2023-09-21 19:26 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Ian Rogers, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Kan Liang,
	linux-perf-users, linux-kernel, Yang Jihong, Stephane Eranian

Hi,

On Mon, Sep 18, 2023 at 11:00 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 19/09/23 00:48, Ian Rogers wrote:
> > On Mon, Sep 18, 2023 at 1:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> On 16/09/23 07:09, Ian Rogers wrote:
> >>> Dummy events are created with an attribute where the period and freq
> >>> are zero. evsel__config will then see the uninitialized values and
> >>> initialize them in evsel__default_freq_period. As fequency mode is
> >>> used by default the dummy event would be set to use frequency
> >>> mode. However, this has no effect on the dummy event but does cause
> >>> unnecessary timers/interrupts. Avoid this overhead by setting the
> >>> period to 1 for dummy events.
> >>>
> >>> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> >>> period=1. This isn't necessary after this change and so the setting is
> >>> removed.
> >>>
> >>> From Stephane:
> >>>
> >>> The dummy event is not counting anything. It is used to collect mmap
> >>> records and avoid a race condition during the synthesize mmap phase of
> >>> perf record. As such, it should not cause any overhead during active
> >>> profiling. Yet, it did. Because of a bug the dummy event was
> >>> programmed as a sampling event in frequency mode. Events in that mode
> >>> incur more kernel overheads because on timer tick, the kernel has to
> >>> look at the number of samples for each event and potentially adjust
> >>> the sampling period to achieve the desired frequency. The dummy event
> >>> was therefore adding a frequency event to task and ctx contexts we may
> >>> otherwise not have any, e.g., perf record -a -e
> >>> cpu/event=0x3c,period=10000000/. On each timer tick the
> >>> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> >>> non-zero, then the kernel will loop over ALL the events of the context
> >>> looking for frequency mode ones. In doing, so it locks the context,
> >>> and enable/disable the PMU of each hw event. If all the events of the
> >>> context are in period mode, the kernel will have to traverse the list for
> >>> nothing incurring overhead. The overhead is multiplied by a very large
> >>> factor when this happens in a guest kernel. There is no need for the
> >>> dummy event to be in frequency mode, it does not count anything and
> >>> therefore should not cause extra overhead for no reason.
> >>>
> >>> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> >>> Reported-by: Stephane Eranian <eranian@google.com>
> >>> Signed-off-by: Ian Rogers <irogers@google.com>

I'll take the original patch first.


> >>> ---
> >>>  tools/perf/util/evlist.c | 5 +++--
> >>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> >>> index 25c3ebe2c2f5..e36da58522ef 100644
> >>> --- a/tools/perf/util/evlist.c
> >>> +++ b/tools/perf/util/evlist.c
> >>> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
> >>>               .type   = PERF_TYPE_SOFTWARE,
> >>>               .config = PERF_COUNT_SW_DUMMY,
> >>>               .size   = sizeof(attr), /* to capture ABI version */
> >>> +             /* Avoid frequency mode for dummy events to avoid associated timers. */
> >>> +             .freq = 0,
> >>> +             .sample_period = 1,
> >>>       };
> >>>
> >>>       return evsel__new_idx(&attr, evlist->core.nr_entries);
> >>> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
> >>>       evsel->core.attr.exclude_kernel = 1;
> >>>       evsel->core.attr.exclude_guest = 1;
> >>>       evsel->core.attr.exclude_hv = 1;
> >>> -     evsel->core.attr.freq = 0;
> >>> -     evsel->core.attr.sample_period = 1;
> >>>       evsel->core.system_wide = system_wide;
> >>>       evsel->no_aux_samples = true;
> >>>       evsel->name = strdup("dummy:u");
> >>
> >> Note that evsel__config() will put it back to freq if -F is used.
> >
> > Right, I was looking for a minimal fix in part for the sake of back
> > porting. For the -F we could do:
> >
> > ```
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index d5363d23f5d3..806185a39e17 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -1083,11 +1083,15 @@ void __weak arch__post_evsel_config(struct
> > evsel *evsel __maybe_unused,
> > static void evsel__set_default_freq_period(struct record_opts *opts,
> >                                           struct perf_event_attr *attr)
> > {
> > -       if (opts->freq) {
> > +       bool is_dummy = attr->type == PERF_TYPE_SOFTWARE &&
> > +               attr->config == PERF_COUNT_SW_DUMMY;
> > +
> > +       if (opts->freq && !is_dummy) {
> >                attr->freq = 1;
> >                attr->sample_freq = opts->freq;
> >        } else {
> > -               attr->sample_period = opts->default_interval;
> > +               attr->freq = 0;
> > +               attr->sample_period = is_dummy ? 1 : opts->default_interval;
> >        }
> > }
> > ```
> >
> > But this felt like it could potentially have other side-effects.
>
> Perhaps leave it alone, if the period has already been defined:
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index d5363d23f5d3..ad3e12f5ec88 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1166,7 +1166,8 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>         if ((evsel->is_libpfm_event && !attr->sample_period) ||
>             (!evsel->is_libpfm_event && (!attr->sample_period ||
>                                          opts->user_freq != UINT_MAX ||
> -                                        opts->user_interval != ULLONG_MAX)))
> +                                        opts->user_interval != ULLONG_MAX) &&
> +            !(is_dummy && attr->sample_period)))
>                 evsel__set_default_freq_period(opts, attr);
>
>         /*

Or simply like this?


diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d5363d23f5d3..6ce832ce62f1 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1169,6 +1169,9 @@ void evsel__config(struct evsel *evsel, struct
record_opts *opts,
                                         opts->user_interval != ULLONG_MAX)))
                evsel__set_default_freq_period(opts, attr);

+       if (evsel__is_dummy_event(evsel))
+               attr->freq = 0;
+
        /*
         * If attr->freq was set (here or earlier), ask for period
         * to be sampled.

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-21 19:26       ` Namhyung Kim
@ 2023-09-22  5:36         ` Adrian Hunter
  2023-09-22 15:05           ` Ian Rogers
  0 siblings, 1 reply; 23+ messages in thread
From: Adrian Hunter @ 2023-09-22  5:36 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ian Rogers, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Kan Liang,
	linux-perf-users, linux-kernel, Yang Jihong, Stephane Eranian

On 21/09/23 22:26, Namhyung Kim wrote:
> Hi,
> 
> On Mon, Sep 18, 2023 at 11:00 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 19/09/23 00:48, Ian Rogers wrote:
>>> On Mon, Sep 18, 2023 at 1:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>
>>>> On 16/09/23 07:09, Ian Rogers wrote:
>>>>> Dummy events are created with an attribute where the period and freq
>>>>> are zero. evsel__config will then see the uninitialized values and
>>>>> initialize them in evsel__default_freq_period. As fequency mode is
>>>>> used by default the dummy event would be set to use frequency
>>>>> mode. However, this has no effect on the dummy event but does cause
>>>>> unnecessary timers/interrupts. Avoid this overhead by setting the
>>>>> period to 1 for dummy events.
>>>>>
>>>>> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
>>>>> period=1. This isn't necessary after this change and so the setting is
>>>>> removed.
>>>>>
>>>>> From Stephane:
>>>>>
>>>>> The dummy event is not counting anything. It is used to collect mmap
>>>>> records and avoid a race condition during the synthesize mmap phase of
>>>>> perf record. As such, it should not cause any overhead during active
>>>>> profiling. Yet, it did. Because of a bug the dummy event was
>>>>> programmed as a sampling event in frequency mode. Events in that mode
>>>>> incur more kernel overheads because on timer tick, the kernel has to
>>>>> look at the number of samples for each event and potentially adjust
>>>>> the sampling period to achieve the desired frequency. The dummy event
>>>>> was therefore adding a frequency event to task and ctx contexts we may
>>>>> otherwise not have any, e.g., perf record -a -e
>>>>> cpu/event=0x3c,period=10000000/. On each timer tick the
>>>>> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
>>>>> non-zero, then the kernel will loop over ALL the events of the context
>>>>> looking for frequency mode ones. In doing, so it locks the context,
>>>>> and enable/disable the PMU of each hw event. If all the events of the
>>>>> context are in period mode, the kernel will have to traverse the list for
>>>>> nothing incurring overhead. The overhead is multiplied by a very large
>>>>> factor when this happens in a guest kernel. There is no need for the
>>>>> dummy event to be in frequency mode, it does not count anything and
>>>>> therefore should not cause extra overhead for no reason.
>>>>>
>>>>> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
>>>>> Reported-by: Stephane Eranian <eranian@google.com>
>>>>> Signed-off-by: Ian Rogers <irogers@google.com>
> 
> I'll take the original patch first.
> 
> 
>>>>> ---
>>>>>  tools/perf/util/evlist.c | 5 +++--
>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
>>>>> index 25c3ebe2c2f5..e36da58522ef 100644
>>>>> --- a/tools/perf/util/evlist.c
>>>>> +++ b/tools/perf/util/evlist.c
>>>>> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
>>>>>               .type   = PERF_TYPE_SOFTWARE,
>>>>>               .config = PERF_COUNT_SW_DUMMY,
>>>>>               .size   = sizeof(attr), /* to capture ABI version */
>>>>> +             /* Avoid frequency mode for dummy events to avoid associated timers. */
>>>>> +             .freq = 0,
>>>>> +             .sample_period = 1,
>>>>>       };
>>>>>
>>>>>       return evsel__new_idx(&attr, evlist->core.nr_entries);
>>>>> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
>>>>>       evsel->core.attr.exclude_kernel = 1;
>>>>>       evsel->core.attr.exclude_guest = 1;
>>>>>       evsel->core.attr.exclude_hv = 1;
>>>>> -     evsel->core.attr.freq = 0;
>>>>> -     evsel->core.attr.sample_period = 1;
>>>>>       evsel->core.system_wide = system_wide;
>>>>>       evsel->no_aux_samples = true;
>>>>>       evsel->name = strdup("dummy:u");
>>>>
>>>> Note that evsel__config() will put it back to freq if -F is used.
>>>
>>> Right, I was looking for a minimal fix in part for the sake of back
>>> porting. For the -F we could do:
>>>
>>> ```
>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>> index d5363d23f5d3..806185a39e17 100644
>>> --- a/tools/perf/util/evsel.c
>>> +++ b/tools/perf/util/evsel.c
>>> @@ -1083,11 +1083,15 @@ void __weak arch__post_evsel_config(struct
>>> evsel *evsel __maybe_unused,
>>> static void evsel__set_default_freq_period(struct record_opts *opts,
>>>                                           struct perf_event_attr *attr)
>>> {
>>> -       if (opts->freq) {
>>> +       bool is_dummy = attr->type == PERF_TYPE_SOFTWARE &&
>>> +               attr->config == PERF_COUNT_SW_DUMMY;
>>> +
>>> +       if (opts->freq && !is_dummy) {
>>>                attr->freq = 1;
>>>                attr->sample_freq = opts->freq;
>>>        } else {
>>> -               attr->sample_period = opts->default_interval;
>>> +               attr->freq = 0;
>>> +               attr->sample_period = is_dummy ? 1 : opts->default_interval;
>>>        }
>>> }
>>> ```
>>>
>>> But this felt like it could potentially have other side-effects.
>>
>> Perhaps leave it alone, if the period has already been defined:
>>
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index d5363d23f5d3..ad3e12f5ec88 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -1166,7 +1166,8 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>         if ((evsel->is_libpfm_event && !attr->sample_period) ||
>>             (!evsel->is_libpfm_event && (!attr->sample_period ||
>>                                          opts->user_freq != UINT_MAX ||
>> -                                        opts->user_interval != ULLONG_MAX)))
>> +                                        opts->user_interval != ULLONG_MAX) &&
>> +            !(is_dummy && attr->sample_period)))
>>                 evsel__set_default_freq_period(opts, attr);
>>
>>         /*
> 
> Or simply like this?
> 
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index d5363d23f5d3..6ce832ce62f1 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1169,6 +1169,9 @@ void evsel__config(struct evsel *evsel, struct
> record_opts *opts,
>                                          opts->user_interval != ULLONG_MAX)))
>                 evsel__set_default_freq_period(opts, attr);
> 
> +       if (evsel__is_dummy_event(evsel))
> +               attr->freq = 0;
> +
>         /*
>          * If attr->freq was set (here or earlier), ask for period
>          * to be sampled.

I thought there might be corner cases where it made a difference,
but I can't find any, so that should do.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-22  5:36         ` Adrian Hunter
@ 2023-09-22 15:05           ` Ian Rogers
  2023-09-25  3:35             ` Yang Jihong
  0 siblings, 1 reply; 23+ messages in thread
From: Ian Rogers @ 2023-09-22 15:05 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Namhyung Kim, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Kan Liang, linux-perf-users, linux-kernel, Yang Jihong,
	Stephane Eranian

On Thu, Sep 21, 2023 at 10:36 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 21/09/23 22:26, Namhyung Kim wrote:
> > Hi,
> >
> > On Mon, Sep 18, 2023 at 11:00 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> On 19/09/23 00:48, Ian Rogers wrote:
> >>> On Mon, Sep 18, 2023 at 1:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>
> >>>> On 16/09/23 07:09, Ian Rogers wrote:
> >>>>> Dummy events are created with an attribute where the period and freq
> >>>>> are zero. evsel__config will then see the uninitialized values and
> >>>>> initialize them in evsel__default_freq_period. As fequency mode is
> >>>>> used by default the dummy event would be set to use frequency
> >>>>> mode. However, this has no effect on the dummy event but does cause
> >>>>> unnecessary timers/interrupts. Avoid this overhead by setting the
> >>>>> period to 1 for dummy events.
> >>>>>
> >>>>> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> >>>>> period=1. This isn't necessary after this change and so the setting is
> >>>>> removed.
> >>>>>
> >>>>> From Stephane:
> >>>>>
> >>>>> The dummy event is not counting anything. It is used to collect mmap
> >>>>> records and avoid a race condition during the synthesize mmap phase of
> >>>>> perf record. As such, it should not cause any overhead during active
> >>>>> profiling. Yet, it did. Because of a bug the dummy event was
> >>>>> programmed as a sampling event in frequency mode. Events in that mode
> >>>>> incur more kernel overheads because on timer tick, the kernel has to
> >>>>> look at the number of samples for each event and potentially adjust
> >>>>> the sampling period to achieve the desired frequency. The dummy event
> >>>>> was therefore adding a frequency event to task and ctx contexts we may
> >>>>> otherwise not have any, e.g., perf record -a -e
> >>>>> cpu/event=0x3c,period=10000000/. On each timer tick the
> >>>>> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> >>>>> non-zero, then the kernel will loop over ALL the events of the context
> >>>>> looking for frequency mode ones. In doing, so it locks the context,
> >>>>> and enable/disable the PMU of each hw event. If all the events of the
> >>>>> context are in period mode, the kernel will have to traverse the list for
> >>>>> nothing incurring overhead. The overhead is multiplied by a very large
> >>>>> factor when this happens in a guest kernel. There is no need for the
> >>>>> dummy event to be in frequency mode, it does not count anything and
> >>>>> therefore should not cause extra overhead for no reason.
> >>>>>
> >>>>> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> >>>>> Reported-by: Stephane Eranian <eranian@google.com>
> >>>>> Signed-off-by: Ian Rogers <irogers@google.com>
> >
> > I'll take the original patch first.
> >
> >
> >>>>> ---
> >>>>>  tools/perf/util/evlist.c | 5 +++--
> >>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> >>>>> index 25c3ebe2c2f5..e36da58522ef 100644
> >>>>> --- a/tools/perf/util/evlist.c
> >>>>> +++ b/tools/perf/util/evlist.c
> >>>>> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
> >>>>>               .type   = PERF_TYPE_SOFTWARE,
> >>>>>               .config = PERF_COUNT_SW_DUMMY,
> >>>>>               .size   = sizeof(attr), /* to capture ABI version */
> >>>>> +             /* Avoid frequency mode for dummy events to avoid associated timers. */
> >>>>> +             .freq = 0,
> >>>>> +             .sample_period = 1,
> >>>>>       };
> >>>>>
> >>>>>       return evsel__new_idx(&attr, evlist->core.nr_entries);
> >>>>> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
> >>>>>       evsel->core.attr.exclude_kernel = 1;
> >>>>>       evsel->core.attr.exclude_guest = 1;
> >>>>>       evsel->core.attr.exclude_hv = 1;
> >>>>> -     evsel->core.attr.freq = 0;
> >>>>> -     evsel->core.attr.sample_period = 1;
> >>>>>       evsel->core.system_wide = system_wide;
> >>>>>       evsel->no_aux_samples = true;
> >>>>>       evsel->name = strdup("dummy:u");
> >>>>
> >>>> Note that evsel__config() will put it back to freq if -F is used.
> >>>
> >>> Right, I was looking for a minimal fix in part for the sake of back
> >>> porting. For the -F we could do:
> >>>
> >>> ```
> >>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>> index d5363d23f5d3..806185a39e17 100644
> >>> --- a/tools/perf/util/evsel.c
> >>> +++ b/tools/perf/util/evsel.c
> >>> @@ -1083,11 +1083,15 @@ void __weak arch__post_evsel_config(struct
> >>> evsel *evsel __maybe_unused,
> >>> static void evsel__set_default_freq_period(struct record_opts *opts,
> >>>                                           struct perf_event_attr *attr)
> >>> {
> >>> -       if (opts->freq) {
> >>> +       bool is_dummy = attr->type == PERF_TYPE_SOFTWARE &&
> >>> +               attr->config == PERF_COUNT_SW_DUMMY;
> >>> +
> >>> +       if (opts->freq && !is_dummy) {
> >>>                attr->freq = 1;
> >>>                attr->sample_freq = opts->freq;
> >>>        } else {
> >>> -               attr->sample_period = opts->default_interval;
> >>> +               attr->freq = 0;
> >>> +               attr->sample_period = is_dummy ? 1 : opts->default_interval;
> >>>        }
> >>> }
> >>> ```
> >>>
> >>> But this felt like it could potentially have other side-effects.
> >>
> >> Perhaps leave it alone, if the period has already been defined:
> >>
> >> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >> index d5363d23f5d3..ad3e12f5ec88 100644
> >> --- a/tools/perf/util/evsel.c
> >> +++ b/tools/perf/util/evsel.c
> >> @@ -1166,7 +1166,8 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>         if ((evsel->is_libpfm_event && !attr->sample_period) ||
> >>             (!evsel->is_libpfm_event && (!attr->sample_period ||
> >>                                          opts->user_freq != UINT_MAX ||
> >> -                                        opts->user_interval != ULLONG_MAX)))
> >> +                                        opts->user_interval != ULLONG_MAX) &&
> >> +            !(is_dummy && attr->sample_period)))
> >>                 evsel__set_default_freq_period(opts, attr);
> >>
> >>         /*
> >
> > Or simply like this?
> >
> >
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index d5363d23f5d3..6ce832ce62f1 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -1169,6 +1169,9 @@ void evsel__config(struct evsel *evsel, struct
> > record_opts *opts,
> >                                          opts->user_interval != ULLONG_MAX)))
> >                 evsel__set_default_freq_period(opts, attr);
> >
> > +       if (evsel__is_dummy_event(evsel))
> > +               attr->freq = 0;
> > +
> >         /*
> >          * If attr->freq was set (here or earlier), ask for period
> >          * to be sampled.
>
> I thought there might be corner cases where it made a difference,
> but I can't find any, so that should do.
>

It seemed more intention revealing to do it at creation/initialization
than on a later not obviously executed code path - I'm thinking of
future me trying to understand the code. My priority is the clearing
of the flag, so I'm easy.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-22 15:05           ` Ian Rogers
@ 2023-09-25  3:35             ` Yang Jihong
  2023-09-25 17:37               ` Stephane Eranian
  0 siblings, 1 reply; 23+ messages in thread
From: Yang Jihong @ 2023-09-25  3:35 UTC (permalink / raw)
  To: Ian Rogers, Adrian Hunter
  Cc: Namhyung Kim, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Kan Liang, linux-perf-users, linux-kernel,
	Stephane Eranian

Hello,

On 2023/9/22 23:05, Ian Rogers wrote:
> On Thu, Sep 21, 2023 at 10:36 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 21/09/23 22:26, Namhyung Kim wrote:
>>> Hi,
>>>
>>> On Mon, Sep 18, 2023 at 11:00 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>
>>>> On 19/09/23 00:48, Ian Rogers wrote:
>>>>> On Mon, Sep 18, 2023 at 1:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>>>
>>>>>> On 16/09/23 07:09, Ian Rogers wrote:
>>>>>>> Dummy events are created with an attribute where the period and freq
>>>>>>> are zero. evsel__config will then see the uninitialized values and
>>>>>>> initialize them in evsel__default_freq_period. As fequency mode is
>>>>>>> used by default the dummy event would be set to use frequency
>>>>>>> mode. However, this has no effect on the dummy event but does cause
>>>>>>> unnecessary timers/interrupts. Avoid this overhead by setting the
>>>>>>> period to 1 for dummy events.
>>>>>>>
>>>>>>> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
>>>>>>> period=1. This isn't necessary after this change and so the setting is
>>>>>>> removed.
>>>>>>>
>>>>>>>  From Stephane:
>>>>>>>
>>>>>>> The dummy event is not counting anything. It is used to collect mmap
>>>>>>> records and avoid a race condition during the synthesize mmap phase of
>>>>>>> perf record. As such, it should not cause any overhead during active
>>>>>>> profiling. Yet, it did. Because of a bug the dummy event was
>>>>>>> programmed as a sampling event in frequency mode. Events in that mode
>>>>>>> incur more kernel overheads because on timer tick, the kernel has to
>>>>>>> look at the number of samples for each event and potentially adjust
>>>>>>> the sampling period to achieve the desired frequency. The dummy event
>>>>>>> was therefore adding a frequency event to task and ctx contexts we may
>>>>>>> otherwise not have any, e.g., perf record -a -e
>>>>>>> cpu/event=0x3c,period=10000000/. On each timer tick the
>>>>>>> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
>>>>>>> non-zero, then the kernel will loop over ALL the events of the context
>>>>>>> looking for frequency mode ones. In doing, so it locks the context,
>>>>>>> and enable/disable the PMU of each hw event. If all the events of the
>>>>>>> context are in period mode, the kernel will have to traverse the list for
>>>>>>> nothing incurring overhead. The overhead is multiplied by a very large
>>>>>>> factor when this happens in a guest kernel. There is no need for the
>>>>>>> dummy event to be in frequency mode, it does not count anything and
>>>>>>> therefore should not cause extra overhead for no reason.
>>>>>>>
>>>>>>> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
>>>>>>> Reported-by: Stephane Eranian <eranian@google.com>
>>>>>>> Signed-off-by: Ian Rogers <irogers@google.com>
>>>
>>> I'll take the original patch first.
>>>
>>>
>>>>>>> ---
>>>>>>>   tools/perf/util/evlist.c | 5 +++--
>>>>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
>>>>>>> index 25c3ebe2c2f5..e36da58522ef 100644
>>>>>>> --- a/tools/perf/util/evlist.c
>>>>>>> +++ b/tools/perf/util/evlist.c
>>>>>>> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
>>>>>>>                .type   = PERF_TYPE_SOFTWARE,
>>>>>>>                .config = PERF_COUNT_SW_DUMMY,
>>>>>>>                .size   = sizeof(attr), /* to capture ABI version */
>>>>>>> +             /* Avoid frequency mode for dummy events to avoid associated timers. */
>>>>>>> +             .freq = 0,
>>>>>>> +             .sample_period = 1,
>>>>>>>        };
>>>>>>>
>>>>>>>        return evsel__new_idx(&attr, evlist->core.nr_entries);
>>>>>>> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
>>>>>>>        evsel->core.attr.exclude_kernel = 1;
>>>>>>>        evsel->core.attr.exclude_guest = 1;
>>>>>>>        evsel->core.attr.exclude_hv = 1;
>>>>>>> -     evsel->core.attr.freq = 0;
>>>>>>> -     evsel->core.attr.sample_period = 1;
>>>>>>>        evsel->core.system_wide = system_wide;
>>>>>>>        evsel->no_aux_samples = true;
>>>>>>>        evsel->name = strdup("dummy:u");
>>>>>>
>>>>>> Note that evsel__config() will put it back to freq if -F is used.
>>>>>
>>>>> Right, I was looking for a minimal fix in part for the sake of back
>>>>> porting. For the -F we could do:
>>>>>
>>>>> ```
>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>> index d5363d23f5d3..806185a39e17 100644
>>>>> --- a/tools/perf/util/evsel.c
>>>>> +++ b/tools/perf/util/evsel.c
>>>>> @@ -1083,11 +1083,15 @@ void __weak arch__post_evsel_config(struct
>>>>> evsel *evsel __maybe_unused,
>>>>> static void evsel__set_default_freq_period(struct record_opts *opts,
>>>>>                                            struct perf_event_attr *attr)
>>>>> {
>>>>> -       if (opts->freq) {
>>>>> +       bool is_dummy = attr->type == PERF_TYPE_SOFTWARE &&
>>>>> +               attr->config == PERF_COUNT_SW_DUMMY;
>>>>> +
>>>>> +       if (opts->freq && !is_dummy) {
>>>>>                 attr->freq = 1;
>>>>>                 attr->sample_freq = opts->freq;
>>>>>         } else {
>>>>> -               attr->sample_period = opts->default_interval;
>>>>> +               attr->freq = 0;
>>>>> +               attr->sample_period = is_dummy ? 1 : opts->default_interval;
>>>>>         }
>>>>> }
>>>>> ```
>>>>>
>>>>> But this felt like it could potentially have other side-effects.
>>>>
>>>> Perhaps leave it alone, if the period has already been defined:
>>>>
>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>> index d5363d23f5d3..ad3e12f5ec88 100644
>>>> --- a/tools/perf/util/evsel.c
>>>> +++ b/tools/perf/util/evsel.c
>>>> @@ -1166,7 +1166,8 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>>>>          if ((evsel->is_libpfm_event && !attr->sample_period) ||
>>>>              (!evsel->is_libpfm_event && (!attr->sample_period ||
>>>>                                           opts->user_freq != UINT_MAX ||
>>>> -                                        opts->user_interval != ULLONG_MAX)))
>>>> +                                        opts->user_interval != ULLONG_MAX) &&
>>>> +            !(is_dummy && attr->sample_period)))
>>>>                  evsel__set_default_freq_period(opts, attr);
>>>>
>>>>          /*
>>>
>>> Or simply like this?
>>>
>>>
>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>> index d5363d23f5d3..6ce832ce62f1 100644
>>> --- a/tools/perf/util/evsel.c
>>> +++ b/tools/perf/util/evsel.c
>>> @@ -1169,6 +1169,9 @@ void evsel__config(struct evsel *evsel, struct
>>> record_opts *opts,
>>>                                           opts->user_interval != ULLONG_MAX)))
>>>                  evsel__set_default_freq_period(opts, attr);
>>>
>>> +       if (evsel__is_dummy_event(evsel))
>>> +               attr->freq = 0;
>>> +
>>>          /*
>>>           * If attr->freq was set (here or earlier), ask for period
>>>           * to be sampled.
>>
>> I thought there might be corner cases where it made a difference,
>> but I can't find any, so that should do.
>>
> 
> It seemed more intention revealing to do it at creation/initialization
> than on a later not obviously executed code path - I'm thinking of
> future me trying to understand the code. My priority is the clearing
> of the flag, so I'm easy.
> 
evsel__apply_config_terms() also sets freq. For example:

# perf record -vv -e dummy/freq=100/ true
<SNIP>
------------------------------------------------------------
perf_event_attr:
   type                             1 (PERF_TYPE_SOFTWARE)
   size                             136
   config                           0x9 (PERF_COUNT_SW_DUMMY)
   { sample_period, sample_freq }   100
   sample_type                      IP|TID|TIME|PERIOD
   read_format                      ID|LOST
   disabled                         1
   inherit                          1
   mmap                             1
   comm                             1
   freq                             1
   enable_on_exec                   1
   task                             1
   sample_id_all                    1
   exclude_guest                    1
   mmap2                            1
   comm_exec                        1
   ksymbol                          1
   bpf_event                        1
------------------------------------------------------------
<SNIP>

Therefore, do we need to perform special processing on dummy events in 
evsel__apply_config_terms?

Thanks,
Yang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-25  3:35             ` Yang Jihong
@ 2023-09-25 17:37               ` Stephane Eranian
  2023-09-30  6:06                 ` Namhyung Kim
  0 siblings, 1 reply; 23+ messages in thread
From: Stephane Eranian @ 2023-09-25 17:37 UTC (permalink / raw)
  To: Yang Jihong
  Cc: Ian Rogers, Adrian Hunter, Namhyung Kim, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Kan Liang, linux-perf-users,
	linux-kernel

On Sun, Sep 24, 2023 at 8:35 PM Yang Jihong <yangjihong1@huawei.com> wrote:
>
> Hello,
>
> On 2023/9/22 23:05, Ian Rogers wrote:
> > On Thu, Sep 21, 2023 at 10:36 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> On 21/09/23 22:26, Namhyung Kim wrote:
> >>> Hi,
> >>>
> >>> On Mon, Sep 18, 2023 at 11:00 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>
> >>>> On 19/09/23 00:48, Ian Rogers wrote:
> >>>>> On Mon, Sep 18, 2023 at 1:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>>>>>
> >>>>>> On 16/09/23 07:09, Ian Rogers wrote:
> >>>>>>> Dummy events are created with an attribute where the period and freq
> >>>>>>> are zero. evsel__config will then see the uninitialized values and
> >>>>>>> initialize them in evsel__default_freq_period. As fequency mode is
> >>>>>>> used by default the dummy event would be set to use frequency
> >>>>>>> mode. However, this has no effect on the dummy event but does cause
> >>>>>>> unnecessary timers/interrupts. Avoid this overhead by setting the
> >>>>>>> period to 1 for dummy events.
> >>>>>>>
> >>>>>>> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> >>>>>>> period=1. This isn't necessary after this change and so the setting is
> >>>>>>> removed.
> >>>>>>>
> >>>>>>>  From Stephane:
> >>>>>>>
> >>>>>>> The dummy event is not counting anything. It is used to collect mmap
> >>>>>>> records and avoid a race condition during the synthesize mmap phase of
> >>>>>>> perf record. As such, it should not cause any overhead during active
> >>>>>>> profiling. Yet, it did. Because of a bug the dummy event was
> >>>>>>> programmed as a sampling event in frequency mode. Events in that mode
> >>>>>>> incur more kernel overheads because on timer tick, the kernel has to
> >>>>>>> look at the number of samples for each event and potentially adjust
> >>>>>>> the sampling period to achieve the desired frequency. The dummy event
> >>>>>>> was therefore adding a frequency event to task and ctx contexts we may
> >>>>>>> otherwise not have any, e.g., perf record -a -e
> >>>>>>> cpu/event=0x3c,period=10000000/. On each timer tick the
> >>>>>>> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> >>>>>>> non-zero, then the kernel will loop over ALL the events of the context
> >>>>>>> looking for frequency mode ones. In doing, so it locks the context,
> >>>>>>> and enable/disable the PMU of each hw event. If all the events of the
> >>>>>>> context are in period mode, the kernel will have to traverse the list for
> >>>>>>> nothing incurring overhead. The overhead is multiplied by a very large
> >>>>>>> factor when this happens in a guest kernel. There is no need for the
> >>>>>>> dummy event to be in frequency mode, it does not count anything and
> >>>>>>> therefore should not cause extra overhead for no reason.
> >>>>>>>
> >>>>>>> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> >>>>>>> Reported-by: Stephane Eranian <eranian@google.com>
> >>>>>>> Signed-off-by: Ian Rogers <irogers@google.com>
> >>>
> >>> I'll take the original patch first.
> >>>
> >>>
> >>>>>>> ---
> >>>>>>>   tools/perf/util/evlist.c | 5 +++--
> >>>>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> >>>>>>> index 25c3ebe2c2f5..e36da58522ef 100644
> >>>>>>> --- a/tools/perf/util/evlist.c
> >>>>>>> +++ b/tools/perf/util/evlist.c
> >>>>>>> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
> >>>>>>>                .type   = PERF_TYPE_SOFTWARE,
> >>>>>>>                .config = PERF_COUNT_SW_DUMMY,
> >>>>>>>                .size   = sizeof(attr), /* to capture ABI version */
> >>>>>>> +             /* Avoid frequency mode for dummy events to avoid associated timers. */
> >>>>>>> +             .freq = 0,
> >>>>>>> +             .sample_period = 1,
> >>>>>>>        };
> >>>>>>>
> >>>>>>>        return evsel__new_idx(&attr, evlist->core.nr_entries);
> >>>>>>> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
> >>>>>>>        evsel->core.attr.exclude_kernel = 1;
> >>>>>>>        evsel->core.attr.exclude_guest = 1;
> >>>>>>>        evsel->core.attr.exclude_hv = 1;
> >>>>>>> -     evsel->core.attr.freq = 0;
> >>>>>>> -     evsel->core.attr.sample_period = 1;
> >>>>>>>        evsel->core.system_wide = system_wide;
> >>>>>>>        evsel->no_aux_samples = true;
> >>>>>>>        evsel->name = strdup("dummy:u");
> >>>>>>
> >>>>>> Note that evsel__config() will put it back to freq if -F is used.
> >>>>>
> >>>>> Right, I was looking for a minimal fix in part for the sake of back
> >>>>> porting. For the -F we could do:
> >>>>>
> >>>>> ```
> >>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>>> index d5363d23f5d3..806185a39e17 100644
> >>>>> --- a/tools/perf/util/evsel.c
> >>>>> +++ b/tools/perf/util/evsel.c
> >>>>> @@ -1083,11 +1083,15 @@ void __weak arch__post_evsel_config(struct
> >>>>> evsel *evsel __maybe_unused,
> >>>>> static void evsel__set_default_freq_period(struct record_opts *opts,
> >>>>>                                            struct perf_event_attr *attr)
> >>>>> {
> >>>>> -       if (opts->freq) {
> >>>>> +       bool is_dummy = attr->type == PERF_TYPE_SOFTWARE &&
> >>>>> +               attr->config == PERF_COUNT_SW_DUMMY;
> >>>>> +
> >>>>> +       if (opts->freq && !is_dummy) {
> >>>>>                 attr->freq = 1;
> >>>>>                 attr->sample_freq = opts->freq;
> >>>>>         } else {
> >>>>> -               attr->sample_period = opts->default_interval;
> >>>>> +               attr->freq = 0;
> >>>>> +               attr->sample_period = is_dummy ? 1 : opts->default_interval;
> >>>>>         }
> >>>>> }
> >>>>> ```
> >>>>>
> >>>>> But this felt like it could potentially have other side-effects.
> >>>>
> >>>> Perhaps leave it alone, if the period has already been defined:
> >>>>
> >>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>> index d5363d23f5d3..ad3e12f5ec88 100644
> >>>> --- a/tools/perf/util/evsel.c
> >>>> +++ b/tools/perf/util/evsel.c
> >>>> @@ -1166,7 +1166,8 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> >>>>          if ((evsel->is_libpfm_event && !attr->sample_period) ||
> >>>>              (!evsel->is_libpfm_event && (!attr->sample_period ||
> >>>>                                           opts->user_freq != UINT_MAX ||
> >>>> -                                        opts->user_interval != ULLONG_MAX)))
> >>>> +                                        opts->user_interval != ULLONG_MAX) &&
> >>>> +            !(is_dummy && attr->sample_period)))
> >>>>                  evsel__set_default_freq_period(opts, attr);
> >>>>
> >>>>          /*
> >>>
> >>> Or simply like this?
> >>>
> >>>
> >>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>> index d5363d23f5d3..6ce832ce62f1 100644
> >>> --- a/tools/perf/util/evsel.c
> >>> +++ b/tools/perf/util/evsel.c
> >>> @@ -1169,6 +1169,9 @@ void evsel__config(struct evsel *evsel, struct
> >>> record_opts *opts,
> >>>                                           opts->user_interval != ULLONG_MAX)))
> >>>                  evsel__set_default_freq_period(opts, attr);
> >>>
> >>> +       if (evsel__is_dummy_event(evsel))
> >>> +               attr->freq = 0;
> >>> +
> >>>          /*
> >>>           * If attr->freq was set (here or earlier), ask for period
> >>>           * to be sampled.
> >>
> >> I thought there might be corner cases where it made a difference,
> >> but I can't find any, so that should do.
> >>
> >
> > It seemed more intention revealing to do it at creation/initialization
> > than on a later not obviously executed code path - I'm thinking of
> > future me trying to understand the code. My priority is the clearing
> > of the flag, so I'm easy.
> >
> evsel__apply_config_terms() also sets freq. For example:
>
> # perf record -vv -e dummy/freq=100/ true

This example is NOT relevant to the issue.
Of course, if you want to explicitly use the dummy event with
frequency mode, you should be allowed to do so.
The problem we are solving here is different. We are preventing perf
record internal use of the dummy event
from using frequency mode for no good reason as any frequency mode
event adds additional overhead.
The perf record internal dummy event is used for one goal: to capture
MMAP records to avoid a race condition
between synthesize phase and processes being created and not captured
by synthesize. In that mode, it acts
as an aggregator of all MMAP records during the entire run of the
tool. This does not require any frequency mode.


> <SNIP>
> ------------------------------------------------------------
> perf_event_attr:
>    type                             1 (PERF_TYPE_SOFTWARE)
>    size                             136
>    config                           0x9 (PERF_COUNT_SW_DUMMY)
>    { sample_period, sample_freq }   100
>    sample_type                      IP|TID|TIME|PERIOD
>    read_format                      ID|LOST
>    disabled                         1
>    inherit                          1
>    mmap                             1
>    comm                             1
>    freq                             1
>    enable_on_exec                   1
>    task                             1
>    sample_id_all                    1
>    exclude_guest                    1
>    mmap2                            1
>    comm_exec                        1
>    ksymbol                          1
>    bpf_event                        1
> ------------------------------------------------------------
> <SNIP>
>
> Therefore, do we need to perform special processing on dummy events in
> evsel__apply_config_terms?
>
> Thanks,
> Yang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-25 17:37               ` Stephane Eranian
@ 2023-09-30  6:06                 ` Namhyung Kim
  0 siblings, 0 replies; 23+ messages in thread
From: Namhyung Kim @ 2023-09-30  6:06 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Yang Jihong, Ian Rogers, Adrian Hunter, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Kan Liang, linux-perf-users,
	linux-kernel

Hello,

On Mon, Sep 25, 2023 at 10:37 AM Stephane Eranian <eranian@google.com> wrote:
>
> On Sun, Sep 24, 2023 at 8:35 PM Yang Jihong <yangjihong1@huawei.com> wrote:
> >
> > Hello,
> >
> > On 2023/9/22 23:05, Ian Rogers wrote:
> > > On Thu, Sep 21, 2023 at 10:36 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> > >>
> > >> On 21/09/23 22:26, Namhyung Kim wrote:
> > >>> Hi,
> > >>>
> > >>> On Mon, Sep 18, 2023 at 11:00 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> > >>>>
> > >>>> On 19/09/23 00:48, Ian Rogers wrote:
> > >>>>> On Mon, Sep 18, 2023 at 1:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
> > >>>>>>
> > >>>>>> On 16/09/23 07:09, Ian Rogers wrote:
> > >>>>>>> Dummy events are created with an attribute where the period and freq
> > >>>>>>> are zero. evsel__config will then see the uninitialized values and
> > >>>>>>> initialize them in evsel__default_freq_period. As fequency mode is
> > >>>>>>> used by default the dummy event would be set to use frequency
> > >>>>>>> mode. However, this has no effect on the dummy event but does cause
> > >>>>>>> unnecessary timers/interrupts. Avoid this overhead by setting the
> > >>>>>>> period to 1 for dummy events.
> > >>>>>>>
> > >>>>>>> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> > >>>>>>> period=1. This isn't necessary after this change and so the setting is
> > >>>>>>> removed.
> > >>>>>>>
> > >>>>>>>  From Stephane:
> > >>>>>>>
> > >>>>>>> The dummy event is not counting anything. It is used to collect mmap
> > >>>>>>> records and avoid a race condition during the synthesize mmap phase of
> > >>>>>>> perf record. As such, it should not cause any overhead during active
> > >>>>>>> profiling. Yet, it did. Because of a bug the dummy event was
> > >>>>>>> programmed as a sampling event in frequency mode. Events in that mode
> > >>>>>>> incur more kernel overheads because on timer tick, the kernel has to
> > >>>>>>> look at the number of samples for each event and potentially adjust
> > >>>>>>> the sampling period to achieve the desired frequency. The dummy event
> > >>>>>>> was therefore adding a frequency event to task and ctx contexts we may
> > >>>>>>> otherwise not have any, e.g., perf record -a -e
> > >>>>>>> cpu/event=0x3c,period=10000000/. On each timer tick the
> > >>>>>>> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> > >>>>>>> non-zero, then the kernel will loop over ALL the events of the context
> > >>>>>>> looking for frequency mode ones. In doing, so it locks the context,
> > >>>>>>> and enable/disable the PMU of each hw event. If all the events of the
> > >>>>>>> context are in period mode, the kernel will have to traverse the list for
> > >>>>>>> nothing incurring overhead. The overhead is multiplied by a very large
> > >>>>>>> factor when this happens in a guest kernel. There is no need for the
> > >>>>>>> dummy event to be in frequency mode, it does not count anything and
> > >>>>>>> therefore should not cause extra overhead for no reason.
> > >>>>>>>
> > >>>>>>> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> > >>>>>>> Reported-by: Stephane Eranian <eranian@google.com>
> > >>>>>>> Signed-off-by: Ian Rogers <irogers@google.com>
> > >>>
> > >>> I'll take the original patch first.

It's in the perf-tools-next.

> > >>>
> > >>>
> > >>>>>>> ---
> > >>>>>>>   tools/perf/util/evlist.c | 5 +++--
> > >>>>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
> > >>>>>>>
> > >>>>>>> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > >>>>>>> index 25c3ebe2c2f5..e36da58522ef 100644
> > >>>>>>> --- a/tools/perf/util/evlist.c
> > >>>>>>> +++ b/tools/perf/util/evlist.c
> > >>>>>>> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
> > >>>>>>>                .type   = PERF_TYPE_SOFTWARE,
> > >>>>>>>                .config = PERF_COUNT_SW_DUMMY,
> > >>>>>>>                .size   = sizeof(attr), /* to capture ABI version */
> > >>>>>>> +             /* Avoid frequency mode for dummy events to avoid associated timers. */
> > >>>>>>> +             .freq = 0,
> > >>>>>>> +             .sample_period = 1,
> > >>>>>>>        };
> > >>>>>>>
> > >>>>>>>        return evsel__new_idx(&attr, evlist->core.nr_entries);
> > >>>>>>> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
> > >>>>>>>        evsel->core.attr.exclude_kernel = 1;
> > >>>>>>>        evsel->core.attr.exclude_guest = 1;
> > >>>>>>>        evsel->core.attr.exclude_hv = 1;
> > >>>>>>> -     evsel->core.attr.freq = 0;
> > >>>>>>> -     evsel->core.attr.sample_period = 1;
> > >>>>>>>        evsel->core.system_wide = system_wide;
> > >>>>>>>        evsel->no_aux_samples = true;
> > >>>>>>>        evsel->name = strdup("dummy:u");
> > >>>>>>
> > >>>>>> Note that evsel__config() will put it back to freq if -F is used.
> > >>>>>
> > >>>>> Right, I was looking for a minimal fix in part for the sake of back
> > >>>>> porting. For the -F we could do:
> > >>>>>
> > >>>>> ```
> > >>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > >>>>> index d5363d23f5d3..806185a39e17 100644
> > >>>>> --- a/tools/perf/util/evsel.c
> > >>>>> +++ b/tools/perf/util/evsel.c
> > >>>>> @@ -1083,11 +1083,15 @@ void __weak arch__post_evsel_config(struct
> > >>>>> evsel *evsel __maybe_unused,
> > >>>>> static void evsel__set_default_freq_period(struct record_opts *opts,
> > >>>>>                                            struct perf_event_attr *attr)
> > >>>>> {
> > >>>>> -       if (opts->freq) {
> > >>>>> +       bool is_dummy = attr->type == PERF_TYPE_SOFTWARE &&
> > >>>>> +               attr->config == PERF_COUNT_SW_DUMMY;
> > >>>>> +
> > >>>>> +       if (opts->freq && !is_dummy) {
> > >>>>>                 attr->freq = 1;
> > >>>>>                 attr->sample_freq = opts->freq;
> > >>>>>         } else {
> > >>>>> -               attr->sample_period = opts->default_interval;
> > >>>>> +               attr->freq = 0;
> > >>>>> +               attr->sample_period = is_dummy ? 1 : opts->default_interval;
> > >>>>>         }
> > >>>>> }
> > >>>>> ```
> > >>>>>
> > >>>>> But this felt like it could potentially have other side-effects.
> > >>>>
> > >>>> Perhaps leave it alone, if the period has already been defined:
> > >>>>
> > >>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > >>>> index d5363d23f5d3..ad3e12f5ec88 100644
> > >>>> --- a/tools/perf/util/evsel.c
> > >>>> +++ b/tools/perf/util/evsel.c
> > >>>> @@ -1166,7 +1166,8 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
> > >>>>          if ((evsel->is_libpfm_event && !attr->sample_period) ||
> > >>>>              (!evsel->is_libpfm_event && (!attr->sample_period ||
> > >>>>                                           opts->user_freq != UINT_MAX ||
> > >>>> -                                        opts->user_interval != ULLONG_MAX)))
> > >>>> +                                        opts->user_interval != ULLONG_MAX) &&
> > >>>> +            !(is_dummy && attr->sample_period)))
> > >>>>                  evsel__set_default_freq_period(opts, attr);
> > >>>>
> > >>>>          /*
> > >>>
> > >>> Or simply like this?
> > >>>
> > >>>
> > >>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > >>> index d5363d23f5d3..6ce832ce62f1 100644
> > >>> --- a/tools/perf/util/evsel.c
> > >>> +++ b/tools/perf/util/evsel.c
> > >>> @@ -1169,6 +1169,9 @@ void evsel__config(struct evsel *evsel, struct
> > >>> record_opts *opts,
> > >>>                                           opts->user_interval != ULLONG_MAX)))
> > >>>                  evsel__set_default_freq_period(opts, attr);
> > >>>
> > >>> +       if (evsel__is_dummy_event(evsel))
> > >>> +               attr->freq = 0;
> > >>> +
> > >>>          /*
> > >>>           * If attr->freq was set (here or earlier), ask for period
> > >>>           * to be sampled.
> > >>
> > >> I thought there might be corner cases where it made a difference,
> > >> but I can't find any, so that should do.
> > >>
> > >
> > > It seemed more intention revealing to do it at creation/initialization
> > > than on a later not obviously executed code path - I'm thinking of
> > > future me trying to understand the code. My priority is the clearing
> > > of the flag, so I'm easy.

Yeah I meant to keep the initialization and to prevent updating
it for -F later.  I can add a comment in this code like

/*
 * The kernel counts events with freq=1 separately to update
 * the sample frequency in the timer tick handler.  Do not
 * confuse the kernel since the dummy event never fires and
 * no need to update the frequency.
 */
if (evsel__is_dummy_event(evsel))
    attr->freq = 0;

ok?

> > >
> > evsel__apply_config_terms() also sets freq. For example:
> >
> > # perf record -vv -e dummy/freq=100/ true
>
> This example is NOT relevant to the issue.
> Of course, if you want to explicitly use the dummy event with
> frequency mode, you should be allowed to do so.

Agreed.

> The problem we are solving here is different. We are preventing perf
> record internal use of the dummy event
> from using frequency mode for no good reason as any frequency mode
> event adds additional overhead.
> The perf record internal dummy event is used for one goal: to capture
> MMAP records to avoid a race condition
> between synthesize phase and processes being created and not captured
> by synthesize. In that mode, it acts
> as an aggregator of all MMAP records during the entire run of the
> tool. This does not require any frequency mode.
>
>
> > <SNIP>
> > ------------------------------------------------------------
> > perf_event_attr:
> >    type                             1 (PERF_TYPE_SOFTWARE)
> >    size                             136
> >    config                           0x9 (PERF_COUNT_SW_DUMMY)
> >    { sample_period, sample_freq }   100
> >    sample_type                      IP|TID|TIME|PERIOD
> >    read_format                      ID|LOST
> >    disabled                         1
> >    inherit                          1
> >    mmap                             1
> >    comm                             1
> >    freq                             1
> >    enable_on_exec                   1
> >    task                             1
> >    sample_id_all                    1
> >    exclude_guest                    1
> >    mmap2                            1
> >    comm_exec                        1
> >    ksymbol                          1
> >    bpf_event                        1
> > ------------------------------------------------------------
> > <SNIP>
> >
> > Therefore, do we need to perform special processing on dummy events in
> > evsel__apply_config_terms?

As Stephane said, I don't worry about the case as it's explicitly
requested by the user.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-21  5:04     ` Mingwei Zhang
@ 2023-10-03 20:07       ` Namhyung Kim
  2023-10-03 22:36         ` Ian Rogers
  0 siblings, 1 reply; 23+ messages in thread
From: Namhyung Kim @ 2023-10-03 20:07 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Ian Rogers, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Adrian Hunter,
	Kan Liang, linux-perf-users, linux-kernel, Yang Jihong,
	Stephane Eranian

Hello,

On Wed, Sep 20, 2023 at 10:05 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Mon, Sep 18, 2023 at 3:43 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Sat, Sep 16, 2023 at 5:46 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > Thank you very much for the change. I have one quick question about
> > > the PMU unthrottling logic. When I am looking into the function
> > > perf_adjust_freq_unthr_context(), I see the loop with PMU stop and
> > > start in each iteration. Is there a good way to avoid this PMU reset
> > > operation while quickly figuring out the event in frequency mode?
> >
> > Agreed. I think before the pmu_disable could be avoided for this condition:
> > ```
> > if (event->hw.interrupts != MAX_INTERRUPTS &&
> >     (!event->attr.freq || !event->attr.sample_freq))
> >         continue;
> > ```
> > Fixing up the event stop/start looks harder.
> >
>
> Right, I think putting the check early before pmu_disable() is already
> a great optimization. The only concern I initially had was whether
> event->hw.interrupts can be accessed before we disable the pmu. But
> after checking this field in other locations, I don't see any problem
> at all.

The event->hw.interrupts would be increased in the NMI handler
so there is a race between the check and the NMI.  That's why
I think it checks that after disabling the PMU.

But I think we can skip non-sampling events for sure.  Then it
would be better to set attr.sample_period = 0 rather than attr.freq.

    if (!is_sampling_event(event))
        continue;

    perf_pmu_disable(event->pmu);
    ...

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-10-03 20:07       ` Namhyung Kim
@ 2023-10-03 22:36         ` Ian Rogers
  2023-10-03 23:02           ` Namhyung Kim
  2023-10-03 23:19           ` Mingwei Zhang
  0 siblings, 2 replies; 23+ messages in thread
From: Ian Rogers @ 2023-10-03 22:36 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, Yang Jihong, Stephane Eranian

On Tue, Oct 3, 2023 at 1:08 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> On Wed, Sep 20, 2023 at 10:05 PM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > On Mon, Sep 18, 2023 at 3:43 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Sat, Sep 16, 2023 at 5:46 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > > Thank you very much for the change. I have one quick question about
> > > > the PMU unthrottling logic. When I am looking into the function
> > > > perf_adjust_freq_unthr_context(), I see the loop with PMU stop and
> > > > start in each iteration. Is there a good way to avoid this PMU reset
> > > > operation while quickly figuring out the event in frequency mode?
> > >
> > > Agreed. I think before the pmu_disable could be avoided for this condition:
> > > ```
> > > if (event->hw.interrupts != MAX_INTERRUPTS &&
> > >     (!event->attr.freq || !event->attr.sample_freq))
> > >         continue;
> > > ```
> > > Fixing up the event stop/start looks harder.
> > >
> >
> > Right, I think putting the check early before pmu_disable() is already
> > a great optimization. The only concern I initially had was whether
> > event->hw.interrupts can be accessed before we disable the pmu. But
> > after checking this field in other locations, I don't see any problem
> > at all.
>
> The event->hw.interrupts would be increased in the NMI handler
> so there is a race between the check and the NMI.  That's why
> I think it checks that after disabling the PMU.
>
> But I think we can skip non-sampling events for sure.  Then it
> would be better to set attr.sample_period = 0 rather than attr.freq.
>
>     if (!is_sampling_event(event))
>         continue;
>
>     perf_pmu_disable(event->pmu);
>     ...
>
> Thanks,
> Namhyung

With the PMU disabled, isn't there still a risk of an interrupt still
being in flight? In other words the disable doesn't prevent a race and
we'll catch this on the next timer call to
perf_adjust_freq_unthr_context. I think we can also improve the code
by just disabling a PMU once, we can take advantage of the
perf_event_pmu_context and disable that PMU, iterate its events and
then re-enable the PMU - i.e. no need for an enable and disable per
event. I'll put a patch together.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-10-03 22:36         ` Ian Rogers
@ 2023-10-03 23:02           ` Namhyung Kim
  2023-10-11 16:14             ` Namhyung Kim
  2023-10-03 23:19           ` Mingwei Zhang
  1 sibling, 1 reply; 23+ messages in thread
From: Namhyung Kim @ 2023-10-03 23:02 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, Yang Jihong, Stephane Eranian

On Tue, Oct 3, 2023 at 3:36 PM Ian Rogers <irogers@google.com> wrote:
>
> On Tue, Oct 3, 2023 at 1:08 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Sep 20, 2023 at 10:05 PM Mingwei Zhang <mizhang@google.com> wrote:
> > >
> > > On Mon, Sep 18, 2023 at 3:43 PM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > On Sat, Sep 16, 2023 at 5:46 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > > > Thank you very much for the change. I have one quick question about
> > > > > the PMU unthrottling logic. When I am looking into the function
> > > > > perf_adjust_freq_unthr_context(), I see the loop with PMU stop and
> > > > > start in each iteration. Is there a good way to avoid this PMU reset
> > > > > operation while quickly figuring out the event in frequency mode?
> > > >
> > > > Agreed. I think before the pmu_disable could be avoided for this condition:
> > > > ```
> > > > if (event->hw.interrupts != MAX_INTERRUPTS &&
> > > >     (!event->attr.freq || !event->attr.sample_freq))
> > > >         continue;
> > > > ```
> > > > Fixing up the event stop/start looks harder.
> > > >
> > >
> > > Right, I think putting the check early before pmu_disable() is already
> > > a great optimization. The only concern I initially had was whether
> > > event->hw.interrupts can be accessed before we disable the pmu. But
> > > after checking this field in other locations, I don't see any problem
> > > at all.
> >
> > The event->hw.interrupts would be increased in the NMI handler
> > so there is a race between the check and the NMI.  That's why
> > I think it checks that after disabling the PMU.
> >
> > But I think we can skip non-sampling events for sure.  Then it
> > would be better to set attr.sample_period = 0 rather than attr.freq.
> >
> >     if (!is_sampling_event(event))
> >         continue;
> >
> >     perf_pmu_disable(event->pmu);
> >     ...
> >
> > Thanks,
> > Namhyung
>
> With the PMU disabled, isn't there still a risk of an interrupt still
> being in flight? In other words the disable doesn't prevent a race and
> we'll catch this on the next timer call to
> perf_adjust_freq_unthr_context. I think we can also improve the code
> by just disabling a PMU once, we can take advantage of the
> perf_event_pmu_context and disable that PMU, iterate its events and
> then re-enable the PMU - i.e. no need for an enable and disable per
> event. I'll put a patch together.

Thanks, I was thinking about that too.  It's also a side effect of
the context rewrite.  Maybe we could iterate pmu_ctx's active lists
and skip pmus with PERF_PMU_CAP_NO_INTERRUPT and
individual non-sampling events.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-10-03 22:36         ` Ian Rogers
  2023-10-03 23:02           ` Namhyung Kim
@ 2023-10-03 23:19           ` Mingwei Zhang
  2023-10-11 16:09             ` Namhyung Kim
  1 sibling, 1 reply; 23+ messages in thread
From: Mingwei Zhang @ 2023-10-03 23:19 UTC (permalink / raw)
  To: Ian Rogers, Jim Mattson
  Cc: Namhyung Kim, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, Yang Jihong, Stephane Eranian

On Tue, Oct 3, 2023 at 3:36 PM Ian Rogers <irogers@google.com> wrote:
>
> On Tue, Oct 3, 2023 at 1:08 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Sep 20, 2023 at 10:05 PM Mingwei Zhang <mizhang@google.com> wrote:
> > >
> > > On Mon, Sep 18, 2023 at 3:43 PM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > On Sat, Sep 16, 2023 at 5:46 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > > > Thank you very much for the change. I have one quick question about
> > > > > the PMU unthrottling logic. When I am looking into the function
> > > > > perf_adjust_freq_unthr_context(), I see the loop with PMU stop and
> > > > > start in each iteration. Is there a good way to avoid this PMU reset
> > > > > operation while quickly figuring out the event in frequency mode?
> > > >
> > > > Agreed. I think before the pmu_disable could be avoided for this condition:
> > > > ```
> > > > if (event->hw.interrupts != MAX_INTERRUPTS &&
> > > >     (!event->attr.freq || !event->attr.sample_freq))
> > > >         continue;
> > > > ```
> > > > Fixing up the event stop/start looks harder.
> > > >
> > >
> > > Right, I think putting the check early before pmu_disable() is already
> > > a great optimization. The only concern I initially had was whether
> > > event->hw.interrupts can be accessed before we disable the pmu. But
> > > after checking this field in other locations, I don't see any problem
> > > at all.
> >
> > The event->hw.interrupts would be increased in the NMI handler
> > so there is a race between the check and the NMI.  That's why
> > I think it checks that after disabling the PMU.
> >
> > But I think we can skip non-sampling events for sure.  Then it
> > would be better to set attr.sample_period = 0 rather than attr.freq.
> >
> >     if (!is_sampling_event(event))
> >         continue;
> >
> >     perf_pmu_disable(event->pmu);
> >     ...
> >
> > Thanks,
> > Namhyung
>
> With the PMU disabled, isn't there still a risk of an interrupt still
> being in flight? In other words the disable doesn't prevent a race and
> we'll catch this on the next timer call to
> perf_adjust_freq_unthr_context. I think we can also improve the code
> by just disabling a PMU once, we can take advantage of the
> perf_event_pmu_context and disable that PMU, iterate its events and
> then re-enable the PMU - i.e. no need for an enable and disable per
> event. I'll put a patch together.
>
> Thanks,
> Ian

+Jim Mattson

I initially thought this idea was just an alternative, or a more
professional fix in the perf subsystem. I was wrong...

This would be way better than just skipping frequency events in the
loop. Since if we just skip by event, we may still suffer from huge
overhead if the event list contains many sampling events in frequency
mode. Unfortunately, that is the general case when we do perf record
-e 'eventlist' (IIUC all events in eventlist are in frequency mode if
we don't specify period=). So the problem actually remains whenever we
do perf sampling unless we use something like Intel vtune.

On the other hand, since all of the events are presumably CPU core
events, with the fix we pay only once for the PMU reset per hrtimer
regardless of how many events are in frequency mode.

Looking forward to the patch! Please keep us posted if possible.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-10-03 23:19           ` Mingwei Zhang
@ 2023-10-11 16:09             ` Namhyung Kim
  0 siblings, 0 replies; 23+ messages in thread
From: Namhyung Kim @ 2023-10-11 16:09 UTC (permalink / raw)
  To: Ian Rogers, Mingwei Zhang
  Cc: Jim Mattson, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, Yang Jihong, Stephane Eranian

Hi Ian,

On Tue, Oct 3, 2023 at 4:20 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Tue, Oct 3, 2023 at 3:36 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Tue, Oct 3, 2023 at 1:08 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Wed, Sep 20, 2023 at 10:05 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > >
> > > > On Mon, Sep 18, 2023 at 3:43 PM Ian Rogers <irogers@google.com> wrote:
> > > > >
> > > > > On Sat, Sep 16, 2023 at 5:46 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > > > > Thank you very much for the change. I have one quick question about
> > > > > > the PMU unthrottling logic. When I am looking into the function
> > > > > > perf_adjust_freq_unthr_context(), I see the loop with PMU stop and
> > > > > > start in each iteration. Is there a good way to avoid this PMU reset
> > > > > > operation while quickly figuring out the event in frequency mode?
> > > > >
> > > > > Agreed. I think before the pmu_disable could be avoided for this condition:
> > > > > ```
> > > > > if (event->hw.interrupts != MAX_INTERRUPTS &&
> > > > >     (!event->attr.freq || !event->attr.sample_freq))
> > > > >         continue;
> > > > > ```
> > > > > Fixing up the event stop/start looks harder.
> > > > >
> > > >
> > > > Right, I think putting the check early before pmu_disable() is already
> > > > a great optimization. The only concern I initially had was whether
> > > > event->hw.interrupts can be accessed before we disable the pmu. But
> > > > after checking this field in other locations, I don't see any problem
> > > > at all.
> > >
> > > The event->hw.interrupts would be increased in the NMI handler
> > > so there is a race between the check and the NMI.  That's why
> > > I think it checks that after disabling the PMU.
> > >
> > > But I think we can skip non-sampling events for sure.  Then it
> > > would be better to set attr.sample_period = 0 rather than attr.freq.
> > >
> > >     if (!is_sampling_event(event))
> > >         continue;
> > >
> > >     perf_pmu_disable(event->pmu);
> > >     ...
> > >
> > > Thanks,
> > > Namhyung
> >
> > With the PMU disabled, isn't there still a risk of an interrupt still
> > being in flight? In other words the disable doesn't prevent a race and
> > we'll catch this on the next timer call to
> > perf_adjust_freq_unthr_context. I think we can also improve the code
> > by just disabling a PMU once, we can take advantage of the
> > perf_event_pmu_context and disable that PMU, iterate its events and
> > then re-enable the PMU - i.e. no need for an enable and disable per
> > event. I'll put a patch together.
> >
> > Thanks,
> > Ian
>
> +Jim Mattson
>
> I initially thought this idea was just an alternative, or a more
> professional fix in the perf subsystem. I was wrong...
>
> This would be way better than just skipping frequency events in the
> loop. Since if we just skip by event, we may still suffer from huge
> overhead if the event list contains many sampling events in frequency
> mode. Unfortunately, that is the general case when we do perf record
> -e 'eventlist' (IIUC all events in eventlist are in frequency mode if
> we don't specify period=). So the problem actually remains whenever we
> do perf sampling unless we use something like Intel vtune.
>
> On the other hand, since all of the events are presumably CPU core
> events, with the fix we pay only once for the PMU reset per hrtimer
> regardless of how many events are in frequency mode.
>
> Looking forward to the patch! Please keep us posted if possible.

Any updates?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-10-03 23:02           ` Namhyung Kim
@ 2023-10-11 16:14             ` Namhyung Kim
  0 siblings, 0 replies; 23+ messages in thread
From: Namhyung Kim @ 2023-10-11 16:14 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Adrian Hunter, Kan Liang, linux-perf-users,
	linux-kernel, Yang Jihong, Stephane Eranian

On Tue, Oct 3, 2023 at 4:02 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Tue, Oct 3, 2023 at 3:36 PM Ian Rogers <irogers@google.com> wrote:
> > With the PMU disabled, isn't there still a risk of an interrupt still
> > being in flight? In other words the disable doesn't prevent a race and
> > we'll catch this on the next timer call to
> > perf_adjust_freq_unthr_context. I think we can also improve the code
> > by just disabling a PMU once, we can take advantage of the
> > perf_event_pmu_context and disable that PMU, iterate its events and
> > then re-enable the PMU - i.e. no need for an enable and disable per
> > event. I'll put a patch together.
>
> Thanks, I was thinking about that too.  It's also a side effect of
> the context rewrite.  Maybe we could iterate pmu_ctx's active lists
> and skip pmus with PERF_PMU_CAP_NO_INTERRUPT and
> individual non-sampling events.

Or we can add pmu_ctx->nr_freq and check it before accessing
pmu MSRs.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-09-16  4:09 Ian Rogers
  2023-09-17  0:45 ` Mingwei Zhang
  2023-09-18  8:14 ` Adrian Hunter
@ 2023-10-30 19:04 ` Mingwei Zhang
  2023-10-30 20:01   ` Mingwei Zhang
  2 siblings, 1 reply; 23+ messages in thread
From: Mingwei Zhang @ 2023-10-30 19:04 UTC (permalink / raw)
  To: gregkh
  Cc: Ian Rogers, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
	Yang Jihong, Stephane Eranian, stable

On Fri, Sep 15, 2023 at 9:10 PM Ian Rogers <irogers@google.com> wrote:
>
> Dummy events are created with an attribute where the period and freq
> are zero. evsel__config will then see the uninitialized values and
> initialize them in evsel__default_freq_period. As fequency mode is
> used by default the dummy event would be set to use frequency
> mode. However, this has no effect on the dummy event but does cause
> unnecessary timers/interrupts. Avoid this overhead by setting the
> period to 1 for dummy events.
>
> evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> period=1. This isn't necessary after this change and so the setting is
> removed.
>
> From Stephane:
>
> The dummy event is not counting anything. It is used to collect mmap
> records and avoid a race condition during the synthesize mmap phase of
> perf record. As such, it should not cause any overhead during active
> profiling. Yet, it did. Because of a bug the dummy event was
> programmed as a sampling event in frequency mode. Events in that mode
> incur more kernel overheads because on timer tick, the kernel has to
> look at the number of samples for each event and potentially adjust
> the sampling period to achieve the desired frequency. The dummy event
> was therefore adding a frequency event to task and ctx contexts we may
> otherwise not have any, e.g., perf record -a -e
> cpu/event=0x3c,period=10000000/. On each timer tick the
> perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> non-zero, then the kernel will loop over ALL the events of the context
> looking for frequency mode ones. In doing, so it locks the context,
> and enable/disable the PMU of each hw event. If all the events of the
> context are in period mode, the kernel will have to traverse the list for
> nothing incurring overhead. The overhead is multiplied by a very large
> factor when this happens in a guest kernel. There is no need for the
> dummy event to be in frequency mode, it does not count anything and
> therefore should not cause extra overhead for no reason.
>
> Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> Reported-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/util/evlist.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 25c3ebe2c2f5..e36da58522ef 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
>                 .type   = PERF_TYPE_SOFTWARE,
>                 .config = PERF_COUNT_SW_DUMMY,
>                 .size   = sizeof(attr), /* to capture ABI version */
> +               /* Avoid frequency mode for dummy events to avoid associated timers. */
> +               .freq = 0,
> +               .sample_period = 1,
>         };
>
>         return evsel__new_idx(&attr, evlist->core.nr_entries);
> @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
>         evsel->core.attr.exclude_kernel = 1;
>         evsel->core.attr.exclude_guest = 1;
>         evsel->core.attr.exclude_hv = 1;
> -       evsel->core.attr.freq = 0;
> -       evsel->core.attr.sample_period = 1;
>         evsel->core.system_wide = system_wide;
>         evsel->no_aux_samples = true;
>         evsel->name = strdup("dummy:u");
> --
> 2.42.0.459.ge4e396fd5e-goog
>

Hi Greg,

This patch is a critical performance fix for perf and vPMU. Can you
help us dispatch the commit to all stable kernel versions?

Appreciate your help. Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-10-30 19:04 ` Mingwei Zhang
@ 2023-10-30 20:01   ` Mingwei Zhang
  2023-10-31  5:47     ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Mingwei Zhang @ 2023-10-30 20:01 UTC (permalink / raw)
  To: gregkh, stable
  Cc: Ian Rogers, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Kan Liang, linux-perf-users, linux-kernel,
	Yang Jihong, Stephane Eranian

On Mon, Oct 30, 2023 at 12:04 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Fri, Sep 15, 2023 at 9:10 PM Ian Rogers <irogers@google.com> wrote:
> >
> > Dummy events are created with an attribute where the period and freq
> > are zero. evsel__config will then see the uninitialized values and
> > initialize them in evsel__default_freq_period. As fequency mode is
> > used by default the dummy event would be set to use frequency
> > mode. However, this has no effect on the dummy event but does cause
> > unnecessary timers/interrupts. Avoid this overhead by setting the
> > period to 1 for dummy events.
> >
> > evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> > period=1. This isn't necessary after this change and so the setting is
> > removed.
> >
> > From Stephane:
> >
> > The dummy event is not counting anything. It is used to collect mmap
> > records and avoid a race condition during the synthesize mmap phase of
> > perf record. As such, it should not cause any overhead during active
> > profiling. Yet, it did. Because of a bug the dummy event was
> > programmed as a sampling event in frequency mode. Events in that mode
> > incur more kernel overheads because on timer tick, the kernel has to
> > look at the number of samples for each event and potentially adjust
> > the sampling period to achieve the desired frequency. The dummy event
> > was therefore adding a frequency event to task and ctx contexts we may
> > otherwise not have any, e.g., perf record -a -e
> > cpu/event=0x3c,period=10000000/. On each timer tick the
> > perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> > non-zero, then the kernel will loop over ALL the events of the context
> > looking for frequency mode ones. In doing, so it locks the context,
> > and enable/disable the PMU of each hw event. If all the events of the
> > context are in period mode, the kernel will have to traverse the list for
> > nothing incurring overhead. The overhead is multiplied by a very large
> > factor when this happens in a guest kernel. There is no need for the
> > dummy event to be in frequency mode, it does not count anything and
> > therefore should not cause extra overhead for no reason.
> >
> > Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> > Reported-by: Stephane Eranian <eranian@google.com>
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/util/evlist.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index 25c3ebe2c2f5..e36da58522ef 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
> >                 .type   = PERF_TYPE_SOFTWARE,
> >                 .config = PERF_COUNT_SW_DUMMY,
> >                 .size   = sizeof(attr), /* to capture ABI version */
> > +               /* Avoid frequency mode for dummy events to avoid associated timers. */
> > +               .freq = 0,
> > +               .sample_period = 1,
> >         };
> >
> >         return evsel__new_idx(&attr, evlist->core.nr_entries);
> > @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
> >         evsel->core.attr.exclude_kernel = 1;
> >         evsel->core.attr.exclude_guest = 1;
> >         evsel->core.attr.exclude_hv = 1;
> > -       evsel->core.attr.freq = 0;
> > -       evsel->core.attr.sample_period = 1;
> >         evsel->core.system_wide = system_wide;
> >         evsel->no_aux_samples = true;
> >         evsel->name = strdup("dummy:u");
> > --
> > 2.42.0.459.ge4e396fd5e-goog
> >
>
> Hi Greg,
>
> This patch is a critical performance fix for perf and vPMU. Can you
> help us dispatch the commit to all stable kernel versions?
>
> Appreciate your help. Thanks.
> -Mingwei

Oops... Update target email to: stable@vger.kernel.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] perf evlist: Avoid frequency mode for the dummy event
  2023-10-30 20:01   ` Mingwei Zhang
@ 2023-10-31  5:47     ` Greg KH
  0 siblings, 0 replies; 23+ messages in thread
From: Greg KH @ 2023-10-31  5:47 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: stable, Ian Rogers, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Adrian Hunter, Kan Liang,
	linux-perf-users, linux-kernel, Yang Jihong, Stephane Eranian

On Mon, Oct 30, 2023 at 01:01:45PM -0700, Mingwei Zhang wrote:
> On Mon, Oct 30, 2023 at 12:04 PM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > On Fri, Sep 15, 2023 at 9:10 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > Dummy events are created with an attribute where the period and freq
> > > are zero. evsel__config will then see the uninitialized values and
> > > initialize them in evsel__default_freq_period. As fequency mode is
> > > used by default the dummy event would be set to use frequency
> > > mode. However, this has no effect on the dummy event but does cause
> > > unnecessary timers/interrupts. Avoid this overhead by setting the
> > > period to 1 for dummy events.
> > >
> > > evlist__add_aux_dummy calls evlist__add_dummy then sets freq=0 and
> > > period=1. This isn't necessary after this change and so the setting is
> > > removed.
> > >
> > > From Stephane:
> > >
> > > The dummy event is not counting anything. It is used to collect mmap
> > > records and avoid a race condition during the synthesize mmap phase of
> > > perf record. As such, it should not cause any overhead during active
> > > profiling. Yet, it did. Because of a bug the dummy event was
> > > programmed as a sampling event in frequency mode. Events in that mode
> > > incur more kernel overheads because on timer tick, the kernel has to
> > > look at the number of samples for each event and potentially adjust
> > > the sampling period to achieve the desired frequency. The dummy event
> > > was therefore adding a frequency event to task and ctx contexts we may
> > > otherwise not have any, e.g., perf record -a -e
> > > cpu/event=0x3c,period=10000000/. On each timer tick the
> > > perf_adjust_freq_unthr_context() is invoked and if ctx->nr_freq is
> > > non-zero, then the kernel will loop over ALL the events of the context
> > > looking for frequency mode ones. In doing, so it locks the context,
> > > and enable/disable the PMU of each hw event. If all the events of the
> > > context are in period mode, the kernel will have to traverse the list for
> > > nothing incurring overhead. The overhead is multiplied by a very large
> > > factor when this happens in a guest kernel. There is no need for the
> > > dummy event to be in frequency mode, it does not count anything and
> > > therefore should not cause extra overhead for no reason.
> > >
> > > Fixes: 5bae0250237f ("perf evlist: Introduce perf_evlist__new_dummy constructor")
> > > Reported-by: Stephane Eranian <eranian@google.com>
> > > Signed-off-by: Ian Rogers <irogers@google.com>
> > > ---
> > >  tools/perf/util/evlist.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > > index 25c3ebe2c2f5..e36da58522ef 100644
> > > --- a/tools/perf/util/evlist.c
> > > +++ b/tools/perf/util/evlist.c
> > > @@ -251,6 +251,9 @@ static struct evsel *evlist__dummy_event(struct evlist *evlist)
> > >                 .type   = PERF_TYPE_SOFTWARE,
> > >                 .config = PERF_COUNT_SW_DUMMY,
> > >                 .size   = sizeof(attr), /* to capture ABI version */
> > > +               /* Avoid frequency mode for dummy events to avoid associated timers. */
> > > +               .freq = 0,
> > > +               .sample_period = 1,
> > >         };
> > >
> > >         return evsel__new_idx(&attr, evlist->core.nr_entries);
> > > @@ -277,8 +280,6 @@ struct evsel *evlist__add_aux_dummy(struct evlist *evlist, bool system_wide)
> > >         evsel->core.attr.exclude_kernel = 1;
> > >         evsel->core.attr.exclude_guest = 1;
> > >         evsel->core.attr.exclude_hv = 1;
> > > -       evsel->core.attr.freq = 0;
> > > -       evsel->core.attr.sample_period = 1;
> > >         evsel->core.system_wide = system_wide;
> > >         evsel->no_aux_samples = true;
> > >         evsel->name = strdup("dummy:u");
> > > --
> > > 2.42.0.459.ge4e396fd5e-goog
> > >
> >
> > Hi Greg,
> >
> > This patch is a critical performance fix for perf and vPMU. Can you
> > help us dispatch the commit to all stable kernel versions?
> >
> > Appreciate your help. Thanks.
> > -Mingwei
> 
> Oops... Update target email to: stable@vger.kernel.org

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-10-31  5:48 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-16  3:56 [PATCH v1] perf evlist: Avoid frequency mode for the dummy event Ian Rogers
  -- strict thread matches above, loose matches on Subject: below --
2023-09-16  4:09 Ian Rogers
2023-09-17  0:45 ` Mingwei Zhang
2023-09-18 22:42   ` Ian Rogers
2023-09-21  5:04     ` Mingwei Zhang
2023-10-03 20:07       ` Namhyung Kim
2023-10-03 22:36         ` Ian Rogers
2023-10-03 23:02           ` Namhyung Kim
2023-10-11 16:14             ` Namhyung Kim
2023-10-03 23:19           ` Mingwei Zhang
2023-10-11 16:09             ` Namhyung Kim
2023-09-18  8:14 ` Adrian Hunter
2023-09-18 21:48   ` Ian Rogers
2023-09-19  5:59     ` Adrian Hunter
2023-09-21 19:26       ` Namhyung Kim
2023-09-22  5:36         ` Adrian Hunter
2023-09-22 15:05           ` Ian Rogers
2023-09-25  3:35             ` Yang Jihong
2023-09-25 17:37               ` Stephane Eranian
2023-09-30  6:06                 ` Namhyung Kim
2023-10-30 19:04 ` Mingwei Zhang
2023-10-30 20:01   ` Mingwei Zhang
2023-10-31  5:47     ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).