linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] Fix perf adjust period
@ 2024-08-21 13:42 Luo Gengkun
  2024-08-21 13:42 ` [PATCH v4 1/2] perf/core: Fix small negative period being ignored Luo Gengkun
  2024-08-21 13:42 ` [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period Luo Gengkun
  0 siblings, 2 replies; 12+ messages in thread
From: Luo Gengkun @ 2024-08-21 13:42 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	luogengkun

v3->v4:
1. Rebase the patch
2. Tidy up the commit message
3. Modify the code style

Luo Gengkun (2):
  perf/core: Fix small negative period being ignored
  perf/core: Fix incorrect time diff in tick adjust period

 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 17 +++++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4 1/2] perf/core: Fix small negative period being ignored
  2024-08-21 13:42 [PATCH v4 0/2] Fix perf adjust period Luo Gengkun
@ 2024-08-21 13:42 ` Luo Gengkun
  2024-08-27 16:32   ` Liang, Kan
  2024-08-21 13:42 ` [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period Luo Gengkun
  1 sibling, 1 reply; 12+ messages in thread
From: Luo Gengkun @ 2024-08-21 13:42 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	luogengkun

In perf_adjust_period, we will first calculate period, and then use
this period to calculate delta. However, when delta is less than 0,
there will be a deviation compared to when delta is greater than or
equal to 0. For example, when delta is in the range of [-14,-1], the
range of delta = delta + 7 is between [-7,6], so the final value of
delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored.
This is unacceptable when the target period is very short, because
we will lose a lot of samples.

Here are some tests and analyzes:
before:
  # perf record -e cs -F 1000  ./a.out
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ]

  # perf script
  ...
  a.out     396   257.956048:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.957891:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.959730:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.961545:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.963355:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.965163:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.966973:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.968785:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.970593:         23 cs:  ffffffff81f4eeec schedul>
  ...

after:
  # perf record -e cs -F 1000  ./a.out
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ]

  # perf script
  ...
  a.out     395    59.338813:         11 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.339707:         12 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.340682:         13 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.341751:         13 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.342799:         12 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.343765:         11 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.344651:         11 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.345539:         12 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.346502:         13 cs:  ffffffff81f4eeec schedul>
  ...

test.c

int main() {
        for (int i = 0; i < 20000; i++)
                usleep(10);

        return 0;
}

  # time ./a.out
  real    0m1.583s
  user    0m0.040s
  sys     0m0.298s

The above results were tested on x86-64 qemu with KVM enabled using
test.c as test program. Ideally, we should have around 1500 samples,
but the previous algorithm had only about 500, whereas the modified
algorithm now has about 1400. Further more, the new version shows 1
sample per 0.001s, while the previous one is 1 sample per 0.002s.This
indicates that the new algorithm is more sensitive to small negative
values compared to old algorithm.

Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
---
 kernel/events/core.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c973e3c11e03..a9395bbfd4aa 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4092,7 +4092,11 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
 	period = perf_calculate_period(event, nsec, count);
 
 	delta = (s64)(period - hwc->sample_period);
-	delta = (delta + 7) / 8; /* low pass filter */
+	if (delta >= 0)
+		delta += 7;
+	else
+		delta -= 7;
+	delta /= 8; /* low pass filter */
 
 	sample_period = hwc->sample_period + delta;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period
  2024-08-21 13:42 [PATCH v4 0/2] Fix perf adjust period Luo Gengkun
  2024-08-21 13:42 ` [PATCH v4 1/2] perf/core: Fix small negative period being ignored Luo Gengkun
@ 2024-08-21 13:42 ` Luo Gengkun
  2024-08-22 18:23   ` Adrian Hunter
  2024-08-27 16:42   ` Liang, Kan
  1 sibling, 2 replies; 12+ messages in thread
From: Luo Gengkun @ 2024-08-21 13:42 UTC (permalink / raw)
  To: peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	luogengkun

Perf events has the notion of sampling frequency which is implemented in
software by dynamically adjusting the counter period so that samples occur
at approximately the target frequency.  Period adjustment is done in 2
places:
 - when the counter overflows (and a sample is recorded)
 - each timer tick, when the event is active
The later case is slightly flawed because it assumes that the time since
the last timer-tick period adjustment is 1 tick, whereas the event may not
have been active (e.g. for a task that is sleeping).

Fix by using jiffies to determine the elapsed time in that case.

Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 11 ++++++++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1a8942277dda..d29b7cf971a1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -265,6 +265,7 @@ struct hw_perf_event {
 	 * State for freq target events, see __perf_event_overflow() and
 	 * perf_adjust_freq_unthr_context().
 	 */
+	u64				freq_tick_stamp;
 	u64				freq_time_stamp;
 	u64				freq_count_stamp;
 #endif
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a9395bbfd4aa..86e80e3ef6ac 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -55,6 +55,7 @@
 #include <linux/pgtable.h>
 #include <linux/buildid.h>
 #include <linux/task_work.h>
+#include <linux/jiffies.h>
 
 #include "internal.h"
 
@@ -4120,7 +4121,7 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
 {
 	struct perf_event *event;
 	struct hw_perf_event *hwc;
-	u64 now, period = TICK_NSEC;
+	u64 now, period, tick_stamp;
 	s64 delta;
 
 	list_for_each_entry(event, event_list, active_list) {
@@ -4148,6 +4149,10 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
 		 */
 		event->pmu->stop(event, PERF_EF_UPDATE);
 
+		tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
+		period = tick_stamp - hwc->freq_tick_stamp;
+		hwc->freq_tick_stamp = tick_stamp;
+
 		now = local64_read(&event->count);
 		delta = now - hwc->freq_count_stamp;
 		hwc->freq_count_stamp = now;
@@ -4157,9 +4162,9 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
 		 * reload only if value has changed
 		 * we have stopped the event so tell that
 		 * to perf_adjust_period() to avoid stopping it
-		 * twice.
+		 * twice. And skip if it is the first tick adjust period.
 		 */
-		if (delta > 0)
+		if (delta > 0 && likely(period != tick_stamp))
 			perf_adjust_period(event, period, delta, false);
 
 		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period
  2024-08-21 13:42 ` [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period Luo Gengkun
@ 2024-08-22 18:23   ` Adrian Hunter
  2024-08-27 16:42   ` Liang, Kan
  1 sibling, 0 replies; 12+ messages in thread
From: Adrian Hunter @ 2024-08-22 18:23 UTC (permalink / raw)
  To: Luo Gengkun, peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, kan.liang, linux-perf-users, linux-kernel

On 21/08/24 16:42, Luo Gengkun wrote:
> Perf events has the notion of sampling frequency which is implemented in
> software by dynamically adjusting the counter period so that samples occur
> at approximately the target frequency.  Period adjustment is done in 2
> places:
>  - when the counter overflows (and a sample is recorded)
>  - each timer tick, when the event is active
> The later case is slightly flawed because it assumes that the time since
> the last timer-tick period adjustment is 1 tick, whereas the event may not
> have been active (e.g. for a task that is sleeping).
> 
> Fix by using jiffies to determine the elapsed time in that case.
> 
> Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>

Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>

> ---
>  include/linux/perf_event.h |  1 +
>  kernel/events/core.c       | 11 ++++++++---
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 1a8942277dda..d29b7cf971a1 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -265,6 +265,7 @@ struct hw_perf_event {
>  	 * State for freq target events, see __perf_event_overflow() and
>  	 * perf_adjust_freq_unthr_context().
>  	 */
> +	u64				freq_tick_stamp;
>  	u64				freq_time_stamp;
>  	u64				freq_count_stamp;
>  #endif
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index a9395bbfd4aa..86e80e3ef6ac 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -55,6 +55,7 @@
>  #include <linux/pgtable.h>
>  #include <linux/buildid.h>
>  #include <linux/task_work.h>
> +#include <linux/jiffies.h>
>  
>  #include "internal.h"
>  
> @@ -4120,7 +4121,7 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>  {
>  	struct perf_event *event;
>  	struct hw_perf_event *hwc;
> -	u64 now, period = TICK_NSEC;
> +	u64 now, period, tick_stamp;
>  	s64 delta;
>  
>  	list_for_each_entry(event, event_list, active_list) {
> @@ -4148,6 +4149,10 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>  		 */
>  		event->pmu->stop(event, PERF_EF_UPDATE);
>  
> +		tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
> +		period = tick_stamp - hwc->freq_tick_stamp;
> +		hwc->freq_tick_stamp = tick_stamp;
> +
>  		now = local64_read(&event->count);
>  		delta = now - hwc->freq_count_stamp;
>  		hwc->freq_count_stamp = now;
> @@ -4157,9 +4162,9 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>  		 * reload only if value has changed
>  		 * we have stopped the event so tell that
>  		 * to perf_adjust_period() to avoid stopping it
> -		 * twice.
> +		 * twice. And skip if it is the first tick adjust period.
>  		 */
> -		if (delta > 0)
> +		if (delta > 0 && likely(period != tick_stamp))
>  			perf_adjust_period(event, period, delta, false);
>  
>  		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 1/2] perf/core: Fix small negative period being ignored
  2024-08-21 13:42 ` [PATCH v4 1/2] perf/core: Fix small negative period being ignored Luo Gengkun
@ 2024-08-27 16:32   ` Liang, Kan
  0 siblings, 0 replies; 12+ messages in thread
From: Liang, Kan @ 2024-08-27 16:32 UTC (permalink / raw)
  To: Luo Gengkun, peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, linux-perf-users, linux-kernel



On 2024-08-21 9:42 a.m., Luo Gengkun wrote:
> In perf_adjust_period, we will first calculate period, and then use
> this period to calculate delta. However, when delta is less than 0,
> there will be a deviation compared to when delta is greater than or
> equal to 0. For example, when delta is in the range of [-14,-1], the
> range of delta = delta + 7 is between [-7,6], so the final value of
> delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored.
> This is unacceptable when the target period is very short, because
> we will lose a lot of samples.
> 
> Here are some tests and analyzes:
> before:
>   # perf record -e cs -F 1000  ./a.out
>   [ perf record: Woken up 1 times to write data ]
>   [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ]
> 
>   # perf script
>   ...
>   a.out     396   257.956048:         23 cs:  ffffffff81f4eeec schedul>
>   a.out     396   257.957891:         23 cs:  ffffffff81f4eeec schedul>
>   a.out     396   257.959730:         23 cs:  ffffffff81f4eeec schedul>
>   a.out     396   257.961545:         23 cs:  ffffffff81f4eeec schedul>
>   a.out     396   257.963355:         23 cs:  ffffffff81f4eeec schedul>
>   a.out     396   257.965163:         23 cs:  ffffffff81f4eeec schedul>
>   a.out     396   257.966973:         23 cs:  ffffffff81f4eeec schedul>
>   a.out     396   257.968785:         23 cs:  ffffffff81f4eeec schedul>
>   a.out     396   257.970593:         23 cs:  ffffffff81f4eeec schedul>
>   ...
> 
> after:
>   # perf record -e cs -F 1000  ./a.out
>   [ perf record: Woken up 1 times to write data ]
>   [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ]
> 
>   # perf script
>   ...
>   a.out     395    59.338813:         11 cs:  ffffffff81f4eeec schedul>
>   a.out     395    59.339707:         12 cs:  ffffffff81f4eeec schedul>
>   a.out     395    59.340682:         13 cs:  ffffffff81f4eeec schedul>
>   a.out     395    59.341751:         13 cs:  ffffffff81f4eeec schedul>
>   a.out     395    59.342799:         12 cs:  ffffffff81f4eeec schedul>
>   a.out     395    59.343765:         11 cs:  ffffffff81f4eeec schedul>
>   a.out     395    59.344651:         11 cs:  ffffffff81f4eeec schedul>
>   a.out     395    59.345539:         12 cs:  ffffffff81f4eeec schedul>
>   a.out     395    59.346502:         13 cs:  ffffffff81f4eeec schedul>
>   ...
> 
> test.c
> 
> int main() {
>         for (int i = 0; i < 20000; i++)
>                 usleep(10);
> 
>         return 0;
> }
> 
>   # time ./a.out
>   real    0m1.583s
>   user    0m0.040s
>   sys     0m0.298s
> 
> The above results were tested on x86-64 qemu with KVM enabled using
> test.c as test program. Ideally, we should have around 1500 samples,
> but the previous algorithm had only about 500, whereas the modified
> algorithm now has about 1400. Further more, the new version shows 1
> sample per 0.001s, while the previous one is 1 sample per 0.002s.This
> indicates that the new algorithm is more sensitive to small negative
> values compared to old algorithm.
> 
> Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment")
> Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>

You may want to Cc stable to back port it to the LTS kernel.

Reviewed-by: Kan Liang <kan.liang@linux.intel.com>

Thanks,
Kan
> ---
>  kernel/events/core.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c973e3c11e03..a9395bbfd4aa 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4092,7 +4092,11 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
>  	period = perf_calculate_period(event, nsec, count);
>  
>  	delta = (s64)(period - hwc->sample_period);
> -	delta = (delta + 7) / 8; /* low pass filter */
> +	if (delta >= 0)
> +		delta += 7;
> +	else
> +		delta -= 7;
> +	delta /= 8; /* low pass filter */
>  
>  	sample_period = hwc->sample_period + delta;
>  

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period
  2024-08-21 13:42 ` [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period Luo Gengkun
  2024-08-22 18:23   ` Adrian Hunter
@ 2024-08-27 16:42   ` Liang, Kan
  2024-08-27 17:16     ` Adrian Hunter
  1 sibling, 1 reply; 12+ messages in thread
From: Liang, Kan @ 2024-08-27 16:42 UTC (permalink / raw)
  To: Luo Gengkun, peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, linux-perf-users, linux-kernel



On 2024-08-21 9:42 a.m., Luo Gengkun wrote:
> Perf events has the notion of sampling frequency which is implemented in
> software by dynamically adjusting the counter period so that samples occur
> at approximately the target frequency.  Period adjustment is done in 2
> places:
>  - when the counter overflows (and a sample is recorded)
>  - each timer tick, when the event is active
> The later case is slightly flawed because it assumes that the time since
> the last timer-tick period adjustment is 1 tick, whereas the event may not
> have been active (e.g. for a task that is sleeping).
>

Do you have a real-world example to demonstrate how bad it is if the
algorithm doesn't take sleep into account?

I'm not sure if introducing such complexity in the critical path is
worth it.

> Fix by using jiffies to determine the elapsed time in that case.
> 
> Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
> ---
>  include/linux/perf_event.h |  1 +
>  kernel/events/core.c       | 11 ++++++++---
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 1a8942277dda..d29b7cf971a1 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -265,6 +265,7 @@ struct hw_perf_event {
>  	 * State for freq target events, see __perf_event_overflow() and
>  	 * perf_adjust_freq_unthr_context().
>  	 */
> +	u64				freq_tick_stamp;
>  	u64				freq_time_stamp;
>  	u64				freq_count_stamp;
>  #endif
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index a9395bbfd4aa..86e80e3ef6ac 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -55,6 +55,7 @@
>  #include <linux/pgtable.h>
>  #include <linux/buildid.h>
>  #include <linux/task_work.h>
> +#include <linux/jiffies.h>
>  
>  #include "internal.h"
>  
> @@ -4120,7 +4121,7 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>  {
>  	struct perf_event *event;
>  	struct hw_perf_event *hwc;
> -	u64 now, period = TICK_NSEC;
> +	u64 now, period, tick_stamp;
>  	s64 delta;
>  
>  	list_for_each_entry(event, event_list, active_list) {
> @@ -4148,6 +4149,10 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>  		 */
>  		event->pmu->stop(event, PERF_EF_UPDATE);
>  
> +		tick_stamp = jiffies64_to_nsecs(get_jiffies_64());

Seems it only needs to retrieve the time once at the beginning, not for
each event.

There is a perf_clock(). It's better to use it for the consistency.

Thanks,
Kan
> +		period = tick_stamp - hwc->freq_tick_stamp;
> +		hwc->freq_tick_stamp = tick_stamp;
> +
>  		now = local64_read(&event->count);
>  		delta = now - hwc->freq_count_stamp;
>  		hwc->freq_count_stamp = now;
> @@ -4157,9 +4162,9 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>  		 * reload only if value has changed
>  		 * we have stopped the event so tell that
>  		 * to perf_adjust_period() to avoid stopping it
> -		 * twice.
> +		 * twice. And skip if it is the first tick adjust period.
>  		 */
> -		if (delta > 0)
> +		if (delta > 0 && likely(period != tick_stamp))
>  			perf_adjust_period(event, period, delta, false);>
>  		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period
  2024-08-27 16:42   ` Liang, Kan
@ 2024-08-27 17:16     ` Adrian Hunter
  2024-08-27 20:06       ` Liang, Kan
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Hunter @ 2024-08-27 17:16 UTC (permalink / raw)
  To: Liang, Kan, Luo Gengkun, peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, linux-perf-users, linux-kernel

On 27/08/24 19:42, Liang, Kan wrote:
> 
> 
> On 2024-08-21 9:42 a.m., Luo Gengkun wrote:
>> Perf events has the notion of sampling frequency which is implemented in
>> software by dynamically adjusting the counter period so that samples occur
>> at approximately the target frequency.  Period adjustment is done in 2
>> places:
>>  - when the counter overflows (and a sample is recorded)
>>  - each timer tick, when the event is active
>> The later case is slightly flawed because it assumes that the time since
>> the last timer-tick period adjustment is 1 tick, whereas the event may not
>> have been active (e.g. for a task that is sleeping).
>>
> 
> Do you have a real-world example to demonstrate how bad it is if the
> algorithm doesn't take sleep into account?
> 
> I'm not sure if introducing such complexity in the critical path is
> worth it.
> 
>> Fix by using jiffies to determine the elapsed time in that case.
>>
>> Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
>> ---
>>  include/linux/perf_event.h |  1 +
>>  kernel/events/core.c       | 11 ++++++++---
>>  2 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 1a8942277dda..d29b7cf971a1 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -265,6 +265,7 @@ struct hw_perf_event {
>>  	 * State for freq target events, see __perf_event_overflow() and
>>  	 * perf_adjust_freq_unthr_context().
>>  	 */
>> +	u64				freq_tick_stamp;
>>  	u64				freq_time_stamp;
>>  	u64				freq_count_stamp;
>>  #endif
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index a9395bbfd4aa..86e80e3ef6ac 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -55,6 +55,7 @@
>>  #include <linux/pgtable.h>
>>  #include <linux/buildid.h>
>>  #include <linux/task_work.h>
>> +#include <linux/jiffies.h>
>>  
>>  #include "internal.h"
>>  
>> @@ -4120,7 +4121,7 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>  {
>>  	struct perf_event *event;
>>  	struct hw_perf_event *hwc;
>> -	u64 now, period = TICK_NSEC;
>> +	u64 now, period, tick_stamp;
>>  	s64 delta;
>>  
>>  	list_for_each_entry(event, event_list, active_list) {
>> @@ -4148,6 +4149,10 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>  		 */
>>  		event->pmu->stop(event, PERF_EF_UPDATE);
>>  
>> +		tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
> 
> Seems it only needs to retrieve the time once at the beginning, not for
> each event.
> 
> There is a perf_clock(). It's better to use it for the consistency.

perf_clock() is much slower, and for statistical sampling it doesn't
have to be perfect.

> 
> Thanks,
> Kan
>> +		period = tick_stamp - hwc->freq_tick_stamp;
>> +		hwc->freq_tick_stamp = tick_stamp;
>> +
>>  		now = local64_read(&event->count);
>>  		delta = now - hwc->freq_count_stamp;
>>  		hwc->freq_count_stamp = now;
>> @@ -4157,9 +4162,9 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>  		 * reload only if value has changed
>>  		 * we have stopped the event so tell that
>>  		 * to perf_adjust_period() to avoid stopping it
>> -		 * twice.
>> +		 * twice. And skip if it is the first tick adjust period.
>>  		 */
>> -		if (delta > 0)
>> +		if (delta > 0 && likely(period != tick_stamp))
>>  			perf_adjust_period(event, period, delta, false);>
>>  		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period
  2024-08-27 17:16     ` Adrian Hunter
@ 2024-08-27 20:06       ` Liang, Kan
  2024-08-28  1:10         ` Adrian Hunter
  0 siblings, 1 reply; 12+ messages in thread
From: Liang, Kan @ 2024-08-27 20:06 UTC (permalink / raw)
  To: Adrian Hunter, Luo Gengkun, peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, linux-perf-users, linux-kernel



On 2024-08-27 1:16 p.m., Adrian Hunter wrote:
> On 27/08/24 19:42, Liang, Kan wrote:
>>
>>
>> On 2024-08-21 9:42 a.m., Luo Gengkun wrote:
>>> Perf events has the notion of sampling frequency which is implemented in
>>> software by dynamically adjusting the counter period so that samples occur
>>> at approximately the target frequency.  Period adjustment is done in 2
>>> places:
>>>  - when the counter overflows (and a sample is recorded)
>>>  - each timer tick, when the event is active
>>> The later case is slightly flawed because it assumes that the time since
>>> the last timer-tick period adjustment is 1 tick, whereas the event may not
>>> have been active (e.g. for a task that is sleeping).
>>>
>>
>> Do you have a real-world example to demonstrate how bad it is if the
>> algorithm doesn't take sleep into account?
>>
>> I'm not sure if introducing such complexity in the critical path is
>> worth it.
>>
>>> Fix by using jiffies to determine the elapsed time in that case.
>>>
>>> Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
>>> ---
>>>  include/linux/perf_event.h |  1 +
>>>  kernel/events/core.c       | 11 ++++++++---
>>>  2 files changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>> index 1a8942277dda..d29b7cf971a1 100644
>>> --- a/include/linux/perf_event.h
>>> +++ b/include/linux/perf_event.h
>>> @@ -265,6 +265,7 @@ struct hw_perf_event {
>>>  	 * State for freq target events, see __perf_event_overflow() and
>>>  	 * perf_adjust_freq_unthr_context().
>>>  	 */
>>> +	u64				freq_tick_stamp;
>>>  	u64				freq_time_stamp;
>>>  	u64				freq_count_stamp;
>>>  #endif
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index a9395bbfd4aa..86e80e3ef6ac 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -55,6 +55,7 @@
>>>  #include <linux/pgtable.h>
>>>  #include <linux/buildid.h>
>>>  #include <linux/task_work.h>
>>> +#include <linux/jiffies.h>
>>>  
>>>  #include "internal.h"
>>>  
>>> @@ -4120,7 +4121,7 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>  {
>>>  	struct perf_event *event;
>>>  	struct hw_perf_event *hwc;
>>> -	u64 now, period = TICK_NSEC;
>>> +	u64 now, period, tick_stamp;
>>>  	s64 delta;
>>>  
>>>  	list_for_each_entry(event, event_list, active_list) {
>>> @@ -4148,6 +4149,10 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>  		 */
>>>  		event->pmu->stop(event, PERF_EF_UPDATE);
>>>  
>>> +		tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
>>
>> Seems it only needs to retrieve the time once at the beginning, not for
>> each event.
>>
>> There is a perf_clock(). It's better to use it for the consistency.
> 
> perf_clock() is much slower, and for statistical sampling it doesn't
> have to be perfect.

Because of rdtsc?

If it is only used here, it should be fine. What I'm worried about is
that someone may use it with other timestamp in perf later. Anyway, it's
not a big deal.

The main concern I have is that do we really need the patch?
It seems can only bring us a better guess of the period for the sleep
test. Then we have to do all the calculate for each tick.

Thanks,
Kan
> 
>>
>> Thanks,
>> Kan
>>> +		period = tick_stamp - hwc->freq_tick_stamp;
>>> +		hwc->freq_tick_stamp = tick_stamp;
>>> +
>>>  		now = local64_read(&event->count);
>>>  		delta = now - hwc->freq_count_stamp;
>>>  		hwc->freq_count_stamp = now;
>>> @@ -4157,9 +4162,9 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>  		 * reload only if value has changed
>>>  		 * we have stopped the event so tell that
>>>  		 * to perf_adjust_period() to avoid stopping it
>>> -		 * twice.
>>> +		 * twice. And skip if it is the first tick adjust period.
>>>  		 */
>>> -		if (delta > 0)
>>> +		if (delta > 0 && likely(period != tick_stamp))
>>>  			perf_adjust_period(event, period, delta, false);>
>>>  		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period
  2024-08-27 20:06       ` Liang, Kan
@ 2024-08-28  1:10         ` Adrian Hunter
  2024-08-29 13:46           ` Liang, Kan
  0 siblings, 1 reply; 12+ messages in thread
From: Adrian Hunter @ 2024-08-28  1:10 UTC (permalink / raw)
  To: Liang, Kan, Luo Gengkun, peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, linux-perf-users, linux-kernel

On 27/08/24 23:06, Liang, Kan wrote:
> 
> 
> On 2024-08-27 1:16 p.m., Adrian Hunter wrote:
>> On 27/08/24 19:42, Liang, Kan wrote:
>>>
>>>
>>> On 2024-08-21 9:42 a.m., Luo Gengkun wrote:
>>>> Perf events has the notion of sampling frequency which is implemented in
>>>> software by dynamically adjusting the counter period so that samples occur
>>>> at approximately the target frequency.  Period adjustment is done in 2
>>>> places:
>>>>  - when the counter overflows (and a sample is recorded)
>>>>  - each timer tick, when the event is active
>>>> The later case is slightly flawed because it assumes that the time since
>>>> the last timer-tick period adjustment is 1 tick, whereas the event may not
>>>> have been active (e.g. for a task that is sleeping).
>>>>
>>>
>>> Do you have a real-world example to demonstrate how bad it is if the
>>> algorithm doesn't take sleep into account?
>>>
>>> I'm not sure if introducing such complexity in the critical path is
>>> worth it.
>>>
>>>> Fix by using jiffies to determine the elapsed time in that case.
>>>>
>>>> Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
>>>> ---
>>>>  include/linux/perf_event.h |  1 +
>>>>  kernel/events/core.c       | 11 ++++++++---
>>>>  2 files changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>> index 1a8942277dda..d29b7cf971a1 100644
>>>> --- a/include/linux/perf_event.h
>>>> +++ b/include/linux/perf_event.h
>>>> @@ -265,6 +265,7 @@ struct hw_perf_event {
>>>>  	 * State for freq target events, see __perf_event_overflow() and
>>>>  	 * perf_adjust_freq_unthr_context().
>>>>  	 */
>>>> +	u64				freq_tick_stamp;
>>>>  	u64				freq_time_stamp;
>>>>  	u64				freq_count_stamp;
>>>>  #endif
>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>> index a9395bbfd4aa..86e80e3ef6ac 100644
>>>> --- a/kernel/events/core.c
>>>> +++ b/kernel/events/core.c
>>>> @@ -55,6 +55,7 @@
>>>>  #include <linux/pgtable.h>
>>>>  #include <linux/buildid.h>
>>>>  #include <linux/task_work.h>
>>>> +#include <linux/jiffies.h>
>>>>  
>>>>  #include "internal.h"
>>>>  
>>>> @@ -4120,7 +4121,7 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>  {
>>>>  	struct perf_event *event;
>>>>  	struct hw_perf_event *hwc;
>>>> -	u64 now, period = TICK_NSEC;
>>>> +	u64 now, period, tick_stamp;
>>>>  	s64 delta;
>>>>  
>>>>  	list_for_each_entry(event, event_list, active_list) {
>>>> @@ -4148,6 +4149,10 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>  		 */
>>>>  		event->pmu->stop(event, PERF_EF_UPDATE);
>>>>  
>>>> +		tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
>>>
>>> Seems it only needs to retrieve the time once at the beginning, not for
>>> each event.
>>>
>>> There is a perf_clock(). It's better to use it for the consistency.
>>
>> perf_clock() is much slower, and for statistical sampling it doesn't
>> have to be perfect.
> 
> Because of rdtsc?

Yes

> 
> If it is only used here, it should be fine. What I'm worried about is
> that someone may use it with other timestamp in perf later. Anyway, it's
> not a big deal.
> 
> The main concern I have is that do we really need the patch?

The current code is wrong.

> It seems can only bring us a better guess of the period for the sleep
> test. Then we have to do all the calculate for each tick.

Or any workload that sleeps periodically.

Another option is to remove the period adjust on tick entirely.
Although arguably the calculation at a tick is better because
it probably covers a longer period.

> 
> Thanks,
> Kan
>>
>>>
>>> Thanks,
>>> Kan
>>>> +		period = tick_stamp - hwc->freq_tick_stamp;
>>>> +		hwc->freq_tick_stamp = tick_stamp;
>>>> +
>>>>  		now = local64_read(&event->count);
>>>>  		delta = now - hwc->freq_count_stamp;
>>>>  		hwc->freq_count_stamp = now;
>>>> @@ -4157,9 +4162,9 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>  		 * reload only if value has changed
>>>>  		 * we have stopped the event so tell that
>>>>  		 * to perf_adjust_period() to avoid stopping it
>>>> -		 * twice.
>>>> +		 * twice. And skip if it is the first tick adjust period.
>>>>  		 */
>>>> -		if (delta > 0)
>>>> +		if (delta > 0 && likely(period != tick_stamp))
>>>>  			perf_adjust_period(event, period, delta, false);>
>>>>  		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
>>
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period
  2024-08-28  1:10         ` Adrian Hunter
@ 2024-08-29 13:46           ` Liang, Kan
  2024-08-29 14:19             ` Luo Gengkun
  0 siblings, 1 reply; 12+ messages in thread
From: Liang, Kan @ 2024-08-29 13:46 UTC (permalink / raw)
  To: Adrian Hunter, Luo Gengkun, peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, linux-perf-users, linux-kernel



On 2024-08-27 9:10 p.m., Adrian Hunter wrote:
> On 27/08/24 23:06, Liang, Kan wrote:
>>
>>
>> On 2024-08-27 1:16 p.m., Adrian Hunter wrote:
>>> On 27/08/24 19:42, Liang, Kan wrote:
>>>>
>>>>
>>>> On 2024-08-21 9:42 a.m., Luo Gengkun wrote:
>>>>> Perf events has the notion of sampling frequency which is implemented in
>>>>> software by dynamically adjusting the counter period so that samples occur
>>>>> at approximately the target frequency.  Period adjustment is done in 2
>>>>> places:
>>>>>  - when the counter overflows (and a sample is recorded)
>>>>>  - each timer tick, when the event is active
>>>>> The later case is slightly flawed because it assumes that the time since
>>>>> the last timer-tick period adjustment is 1 tick, whereas the event may not
>>>>> have been active (e.g. for a task that is sleeping).
>>>>>
>>>>
>>>> Do you have a real-world example to demonstrate how bad it is if the
>>>> algorithm doesn't take sleep into account?
>>>>
>>>> I'm not sure if introducing such complexity in the critical path is
>>>> worth it.
>>>>
>>>>> Fix by using jiffies to determine the elapsed time in that case.
>>>>>
>>>>> Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
>>>>> ---
>>>>>  include/linux/perf_event.h |  1 +
>>>>>  kernel/events/core.c       | 11 ++++++++---
>>>>>  2 files changed, 9 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>>> index 1a8942277dda..d29b7cf971a1 100644
>>>>> --- a/include/linux/perf_event.h
>>>>> +++ b/include/linux/perf_event.h
>>>>> @@ -265,6 +265,7 @@ struct hw_perf_event {
>>>>>  	 * State for freq target events, see __perf_event_overflow() and
>>>>>  	 * perf_adjust_freq_unthr_context().
>>>>>  	 */
>>>>> +	u64				freq_tick_stamp;
>>>>>  	u64				freq_time_stamp;
>>>>>  	u64				freq_count_stamp;
>>>>>  #endif
>>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>>> index a9395bbfd4aa..86e80e3ef6ac 100644
>>>>> --- a/kernel/events/core.c
>>>>> +++ b/kernel/events/core.c
>>>>> @@ -55,6 +55,7 @@
>>>>>  #include <linux/pgtable.h>
>>>>>  #include <linux/buildid.h>
>>>>>  #include <linux/task_work.h>
>>>>> +#include <linux/jiffies.h>
>>>>>  
>>>>>  #include "internal.h"
>>>>>  
>>>>> @@ -4120,7 +4121,7 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>>  {
>>>>>  	struct perf_event *event;
>>>>>  	struct hw_perf_event *hwc;
>>>>> -	u64 now, period = TICK_NSEC;
>>>>> +	u64 now, period, tick_stamp;
>>>>>  	s64 delta;
>>>>>  
>>>>>  	list_for_each_entry(event, event_list, active_list) {
>>>>> @@ -4148,6 +4149,10 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>>  		 */
>>>>>  		event->pmu->stop(event, PERF_EF_UPDATE);
>>>>>  
>>>>> +		tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
>>>>
>>>> Seems it only needs to retrieve the time once at the beginning, not for
>>>> each event.
>>>>
>>>> There is a perf_clock(). It's better to use it for the consistency.
>>>
>>> perf_clock() is much slower, and for statistical sampling it doesn't
>>> have to be perfect.
>>
>> Because of rdtsc?
> 
> Yes

OK. I'm not worry about it too much as long as it's only invoked once in
each tick.

> 
>>
>> If it is only used here, it should be fine. What I'm worried about is
>> that someone may use it with other timestamp in perf later. Anyway, it's
>> not a big deal.
>>
>> The main concern I have is that do we really need the patch?
> 
> The current code is wrong.
> 
>> It seems can only bring us a better guess of the period for the sleep
>> test. Then we have to do all the calculate for each tick.
> 
> Or any workload that sleeps periodically.
> 
> Another option is to remove the period adjust on tick entirely.
> Although arguably the calculation at a tick is better because
> it probably covers a longer period.

Or we may remove the period adjust on overflow.

As my understanding, the period adjust on overflow is to handle the case
while the overflow happens very frequently (< 2 ticks). It is mainly
caused by the very low start period (1).
I'm working on a patch to set a larger start period, which should
minimize the usage of the period adjust on overflow.

Anyway, based on the current code, I agree that adding a new
freq_tick_stamp should be required. But it doesn't need to read the time
for each event. I think reading the time once at the beginning should be
good enough for the period adjust/estimate algorithm.

Thanks,
Kan

> 
>>
>> Thanks,
>> Kan
>>>
>>>>
>>>> Thanks,
>>>> Kan
>>>>> +		period = tick_stamp - hwc->freq_tick_stamp;
>>>>> +		hwc->freq_tick_stamp = tick_stamp;
>>>>> +
>>>>>  		now = local64_read(&event->count);
>>>>>  		delta = now - hwc->freq_count_stamp;
>>>>>  		hwc->freq_count_stamp = now;
>>>>> @@ -4157,9 +4162,9 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>>  		 * reload only if value has changed
>>>>>  		 * we have stopped the event so tell that
>>>>>  		 * to perf_adjust_period() to avoid stopping it
>>>>> -		 * twice.
>>>>> +		 * twice. And skip if it is the first tick adjust period.
>>>>>  		 */
>>>>> -		if (delta > 0)
>>>>> +		if (delta > 0 && likely(period != tick_stamp))
>>>>>  			perf_adjust_period(event, period, delta, false);>
>>>>>  		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
>>>
>>>
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period
  2024-08-29 13:46           ` Liang, Kan
@ 2024-08-29 14:19             ` Luo Gengkun
  2024-08-29 14:30               ` Liang, Kan
  0 siblings, 1 reply; 12+ messages in thread
From: Luo Gengkun @ 2024-08-29 14:19 UTC (permalink / raw)
  To: Liang, Kan, Adrian Hunter, peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, linux-perf-users, linux-kernel


On 2024/8/29 21:46, Liang, Kan wrote:
>
> On 2024-08-27 9:10 p.m., Adrian Hunter wrote:
>> On 27/08/24 23:06, Liang, Kan wrote:
>>>
>>> On 2024-08-27 1:16 p.m., Adrian Hunter wrote:
>>>> On 27/08/24 19:42, Liang, Kan wrote:
>>>>>
>>>>> On 2024-08-21 9:42 a.m., Luo Gengkun wrote:
>>>>>> Perf events has the notion of sampling frequency which is implemented in
>>>>>> software by dynamically adjusting the counter period so that samples occur
>>>>>> at approximately the target frequency.  Period adjustment is done in 2
>>>>>> places:
>>>>>>   - when the counter overflows (and a sample is recorded)
>>>>>>   - each timer tick, when the event is active
>>>>>> The later case is slightly flawed because it assumes that the time since
>>>>>> the last timer-tick period adjustment is 1 tick, whereas the event may not
>>>>>> have been active (e.g. for a task that is sleeping).
>>>>>>
>>>>> Do you have a real-world example to demonstrate how bad it is if the
>>>>> algorithm doesn't take sleep into account?
>>>>>
>>>>> I'm not sure if introducing such complexity in the critical path is
>>>>> worth it.
>>>>>
>>>>>> Fix by using jiffies to determine the elapsed time in that case.
>>>>>>
>>>>>> Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
>>>>>> ---
>>>>>>   include/linux/perf_event.h |  1 +
>>>>>>   kernel/events/core.c       | 11 ++++++++---
>>>>>>   2 files changed, 9 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>>>> index 1a8942277dda..d29b7cf971a1 100644
>>>>>> --- a/include/linux/perf_event.h
>>>>>> +++ b/include/linux/perf_event.h
>>>>>> @@ -265,6 +265,7 @@ struct hw_perf_event {
>>>>>>   	 * State for freq target events, see __perf_event_overflow() and
>>>>>>   	 * perf_adjust_freq_unthr_context().
>>>>>>   	 */
>>>>>> +	u64				freq_tick_stamp;
>>>>>>   	u64				freq_time_stamp;
>>>>>>   	u64				freq_count_stamp;
>>>>>>   #endif
>>>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>>>> index a9395bbfd4aa..86e80e3ef6ac 100644
>>>>>> --- a/kernel/events/core.c
>>>>>> +++ b/kernel/events/core.c
>>>>>> @@ -55,6 +55,7 @@
>>>>>>   #include <linux/pgtable.h>
>>>>>>   #include <linux/buildid.h>
>>>>>>   #include <linux/task_work.h>
>>>>>> +#include <linux/jiffies.h>
>>>>>>   
>>>>>>   #include "internal.h"
>>>>>>   
>>>>>> @@ -4120,7 +4121,7 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>>>   {
>>>>>>   	struct perf_event *event;
>>>>>>   	struct hw_perf_event *hwc;
>>>>>> -	u64 now, period = TICK_NSEC;
>>>>>> +	u64 now, period, tick_stamp;
>>>>>>   	s64 delta;
>>>>>>   
>>>>>>   	list_for_each_entry(event, event_list, active_list) {
>>>>>> @@ -4148,6 +4149,10 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>>>   		 */
>>>>>>   		event->pmu->stop(event, PERF_EF_UPDATE);
>>>>>>   
>>>>>> +		tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
>>>>> Seems it only needs to retrieve the time once at the beginning, not for
>>>>> each event.
>>>>>
>>>>> There is a perf_clock(). It's better to use it for the consistency.
>>>> perf_clock() is much slower, and for statistical sampling it doesn't
>>>> have to be perfect.
>>> Because of rdtsc?
>> Yes
> OK. I'm not worry about it too much as long as it's only invoked once in
> each tick.
>
>>> If it is only used here, it should be fine. What I'm worried about is
>>> that someone may use it with other timestamp in perf later. Anyway, it's
>>> not a big deal.
>>>
>>> The main concern I have is that do we really need the patch?
>> The current code is wrong.
>>
>>> It seems can only bring us a better guess of the period for the sleep
>>> test. Then we have to do all the calculate for each tick.
>> Or any workload that sleeps periodically.
>>
>> Another option is to remove the period adjust on tick entirely.
>> Although arguably the calculation at a tick is better because
>> it probably covers a longer period.
> Or we may remove the period adjust on overflow.
>
> As my understanding, the period adjust on overflow is to handle the case
> while the overflow happens very frequently (< 2 ticks). It is mainly
> caused by the very low start period (1).
> I'm working on a patch to set a larger start period, which should
> minimize the usage of the period adjust on overflow.
I think it's hard to choose a nice initial period, it may require a lot 
of testing, good luck.
>
> Anyway, based on the current code, I agree that adding a new
> freq_tick_stamp should be required. But it doesn't need to read the time
> for each event. I think reading the time once at the beginning should be
> good enough for the period adjust/estimate algorithm.

That's a good idea, do you think it's appropriate to move this line here?


Thanks,

Gengkun

@@ -4126,6 +4126,8 @@ perf_adjust_freq_unthr_context(struct 
perf_event_context *ctx, bool unthrottle)

         raw_spin_lock(&ctx->lock);

+       tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
+
         list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
                 if (event->state != PERF_EVENT_STATE_ACTIVE)
                         continue;
@@ -4152,7 +4154,6 @@ perf_adjust_freq_unthr_context(struct 
perf_event_context *ctx, bool unthrottle)
                  */
                 event->pmu->stop(event, PERF_EF_UPDATE);

-               tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
                 period = tick_stamp - hwc->freq_tick_stamp;
                 hwc->freq_tick_stamp = tick_stamp;

>
> Thanks,
> Kan
>
>>> Thanks,
>>> Kan
>>>>> Thanks,
>>>>> Kan
>>>>>> +		period = tick_stamp - hwc->freq_tick_stamp;
>>>>>> +		hwc->freq_tick_stamp = tick_stamp;
>>>>>> +
>>>>>>   		now = local64_read(&event->count);
>>>>>>   		delta = now - hwc->freq_count_stamp;
>>>>>>   		hwc->freq_count_stamp = now;
>>>>>> @@ -4157,9 +4162,9 @@ static void perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>>>   		 * reload only if value has changed
>>>>>>   		 * we have stopped the event so tell that
>>>>>>   		 * to perf_adjust_period() to avoid stopping it
>>>>>> -		 * twice.
>>>>>> +		 * twice. And skip if it is the first tick adjust period.
>>>>>>   		 */
>>>>>> -		if (delta > 0)
>>>>>> +		if (delta > 0 && likely(period != tick_stamp))
>>>>>>   			perf_adjust_period(event, period, delta, false);>
>>>>>>   		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
>>>>
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period
  2024-08-29 14:19             ` Luo Gengkun
@ 2024-08-29 14:30               ` Liang, Kan
  0 siblings, 0 replies; 12+ messages in thread
From: Liang, Kan @ 2024-08-29 14:30 UTC (permalink / raw)
  To: Luo Gengkun, Adrian Hunter, peterz
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, linux-perf-users, linux-kernel



On 2024-08-29 10:19 a.m., Luo Gengkun wrote:
> 
> On 2024/8/29 21:46, Liang, Kan wrote:
>>
>> On 2024-08-27 9:10 p.m., Adrian Hunter wrote:
>>> On 27/08/24 23:06, Liang, Kan wrote:
>>>>
>>>> On 2024-08-27 1:16 p.m., Adrian Hunter wrote:
>>>>> On 27/08/24 19:42, Liang, Kan wrote:
>>>>>>
>>>>>> On 2024-08-21 9:42 a.m., Luo Gengkun wrote:
>>>>>>> Perf events has the notion of sampling frequency which is
>>>>>>> implemented in
>>>>>>> software by dynamically adjusting the counter period so that
>>>>>>> samples occur
>>>>>>> at approximately the target frequency.  Period adjustment is done
>>>>>>> in 2
>>>>>>> places:
>>>>>>>   - when the counter overflows (and a sample is recorded)
>>>>>>>   - each timer tick, when the event is active
>>>>>>> The later case is slightly flawed because it assumes that the
>>>>>>> time since
>>>>>>> the last timer-tick period adjustment is 1 tick, whereas the
>>>>>>> event may not
>>>>>>> have been active (e.g. for a task that is sleeping).
>>>>>>>
>>>>>> Do you have a real-world example to demonstrate how bad it is if the
>>>>>> algorithm doesn't take sleep into account?
>>>>>>
>>>>>> I'm not sure if introducing such complexity in the critical path is
>>>>>> worth it.
>>>>>>
>>>>>>> Fix by using jiffies to determine the elapsed time in that case.
>>>>>>>
>>>>>>> Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
>>>>>>> ---
>>>>>>>   include/linux/perf_event.h |  1 +
>>>>>>>   kernel/events/core.c       | 11 ++++++++---
>>>>>>>   2 files changed, 9 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>>>>> index 1a8942277dda..d29b7cf971a1 100644
>>>>>>> --- a/include/linux/perf_event.h
>>>>>>> +++ b/include/linux/perf_event.h
>>>>>>> @@ -265,6 +265,7 @@ struct hw_perf_event {
>>>>>>>        * State for freq target events, see
>>>>>>> __perf_event_overflow() and
>>>>>>>        * perf_adjust_freq_unthr_context().
>>>>>>>        */
>>>>>>> +    u64                freq_tick_stamp;
>>>>>>>       u64                freq_time_stamp;
>>>>>>>       u64                freq_count_stamp;
>>>>>>>   #endif
>>>>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>>>>> index a9395bbfd4aa..86e80e3ef6ac 100644
>>>>>>> --- a/kernel/events/core.c
>>>>>>> +++ b/kernel/events/core.c
>>>>>>> @@ -55,6 +55,7 @@
>>>>>>>   #include <linux/pgtable.h>
>>>>>>>   #include <linux/buildid.h>
>>>>>>>   #include <linux/task_work.h>
>>>>>>> +#include <linux/jiffies.h>
>>>>>>>     #include "internal.h"
>>>>>>>   @@ -4120,7 +4121,7 @@ static void
>>>>>>> perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>>>>   {
>>>>>>>       struct perf_event *event;
>>>>>>>       struct hw_perf_event *hwc;
>>>>>>> -    u64 now, period = TICK_NSEC;
>>>>>>> +    u64 now, period, tick_stamp;
>>>>>>>       s64 delta;
>>>>>>>         list_for_each_entry(event, event_list, active_list) {
>>>>>>> @@ -4148,6 +4149,10 @@ static void
>>>>>>> perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>>>>            */
>>>>>>>           event->pmu->stop(event, PERF_EF_UPDATE);
>>>>>>>   +        tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
>>>>>> Seems it only needs to retrieve the time once at the beginning,
>>>>>> not for
>>>>>> each event.
>>>>>>
>>>>>> There is a perf_clock(). It's better to use it for the consistency.
>>>>> perf_clock() is much slower, and for statistical sampling it doesn't
>>>>> have to be perfect.
>>>> Because of rdtsc?
>>> Yes
>> OK. I'm not worry about it too much as long as it's only invoked once in
>> each tick.
>>
>>>> If it is only used here, it should be fine. What I'm worried about is
>>>> that someone may use it with other timestamp in perf later. Anyway,
>>>> it's
>>>> not a big deal.
>>>>
>>>> The main concern I have is that do we really need the patch?
>>> The current code is wrong.
>>>
>>>> It seems can only bring us a better guess of the period for the sleep
>>>> test. Then we have to do all the calculate for each tick.
>>> Or any workload that sleeps periodically.
>>>
>>> Another option is to remove the period adjust on tick entirely.
>>> Although arguably the calculation at a tick is better because
>>> it probably covers a longer period.
>> Or we may remove the period adjust on overflow.
>>
>> As my understanding, the period adjust on overflow is to handle the case
>> while the overflow happens very frequently (< 2 ticks). It is mainly
>> caused by the very low start period (1).
>> I'm working on a patch to set a larger start period, which should
>> minimize the usage of the period adjust on overflow.
> I think it's hard to choose a nice initial period, it may require a lot
> of testing, good luck.
>>
>> Anyway, based on the current code, I agree that adding a new
>> freq_tick_stamp should be required. But it doesn't need to read the time
>> for each event. I think reading the time once at the beginning should be
>> good enough for the period adjust/estimate algorithm.
> 
> That's a good idea, do you think it's appropriate to move this line here?
> 
> 
> Thanks,
> 
> Gengkun
> 
> @@ -4126,6 +4126,8 @@ perf_adjust_freq_unthr_context(struct
> perf_event_context *ctx, bool unthrottle)
> 
>         raw_spin_lock(&ctx->lock);
> 
> +       tick_stamp = jiffies64_to_nsecs(get_jiffies_64());

Yes, the place looks good.

I'm still not a big fan of jiffies. Anyway, I guess we can leave it to
Peter to decide.

Thanks,
Kan
> +
>         list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
>                 if (event->state != PERF_EVENT_STATE_ACTIVE)
>                         continue;
> @@ -4152,7 +4154,6 @@ perf_adjust_freq_unthr_context(struct
> perf_event_context *ctx, bool unthrottle)
>                  */
>                 event->pmu->stop(event, PERF_EF_UPDATE);
> 
> -               tick_stamp = jiffies64_to_nsecs(get_jiffies_64());
>                 period = tick_stamp - hwc->freq_tick_stamp;
>                 hwc->freq_tick_stamp = tick_stamp;
> 
>>
>> Thanks,
>> Kan
>>
>>>> Thanks,
>>>> Kan
>>>>>> Thanks,
>>>>>> Kan
>>>>>>> +        period = tick_stamp - hwc->freq_tick_stamp;
>>>>>>> +        hwc->freq_tick_stamp = tick_stamp;
>>>>>>> +
>>>>>>>           now = local64_read(&event->count);
>>>>>>>           delta = now - hwc->freq_count_stamp;
>>>>>>>           hwc->freq_count_stamp = now;
>>>>>>> @@ -4157,9 +4162,9 @@ static void
>>>>>>> perf_adjust_freq_unthr_events(struct list_head *event_list)
>>>>>>>            * reload only if value has changed
>>>>>>>            * we have stopped the event so tell that
>>>>>>>            * to perf_adjust_period() to avoid stopping it
>>>>>>> -         * twice.
>>>>>>> +         * twice. And skip if it is the first tick adjust period.
>>>>>>>            */
>>>>>>> -        if (delta > 0)
>>>>>>> +        if (delta > 0 && likely(period != tick_stamp))
>>>>>>>               perf_adjust_period(event, period, delta, false);>
>>>>>>>           event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-08-29 14:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-21 13:42 [PATCH v4 0/2] Fix perf adjust period Luo Gengkun
2024-08-21 13:42 ` [PATCH v4 1/2] perf/core: Fix small negative period being ignored Luo Gengkun
2024-08-27 16:32   ` Liang, Kan
2024-08-21 13:42 ` [PATCH v4 2/2] perf/core: Fix incorrect time diff in tick adjust period Luo Gengkun
2024-08-22 18:23   ` Adrian Hunter
2024-08-27 16:42   ` Liang, Kan
2024-08-27 17:16     ` Adrian Hunter
2024-08-27 20:06       ` Liang, Kan
2024-08-28  1:10         ` Adrian Hunter
2024-08-29 13:46           ` Liang, Kan
2024-08-29 14:19             ` Luo Gengkun
2024-08-29 14:30               ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).