* [PATCH] Add cap_user_time aarch64
@ 2018-07-13 19:24 Michael O'Farrell
2018-07-24 15:14 ` Will Deacon
0 siblings, 1 reply; 4+ messages in thread
From: Michael O'Farrell @ 2018-07-13 19:24 UTC (permalink / raw)
To: linux-arm-kernel
It is useful to get the running time of a thread. Doing so in an
efficient manner can be important for performance of user applications.
Avoiding system calls in `clock_gettime` when handling
CLOCK_THREAD_CPUTIME_ID is important. Other clocks are handled in the
VDSO, but CLOCK_THREAD_CPUTIME_ID falls back on the system call.
CLOCK_THREAD_CPUTIME_ID is not handled in the VDSO since it would have
costs associated with maintaining updated user space accessible time
offsets. These offsets have to be updated everytime the a thread is
scheduled/descheduled. However, for programs regularly checking the
running time of a thread, this is a performance improvement.
This patch takes a middle ground, and adds support for cap_user_time an
optional feature of the perf_event API. This way costs are only
incurred when the perf_event api is enabled. This is done the same way
as it is in x86. This patch also enables the cap_user_time facility of
perf_event since it comes free with the stable clock.
Ultimately this allows calculating the thread running time in userspace
on aarch64 as follows (adapted from perf_event_open manpage):
u32 seq, time_mult, time_shift;
u64 running, count, time_offset, quot, rem, delta;
struct perf_event_mmap_page *pc;
pc = buf; // buf is the perf event mmaped page as documented in the API.
if (pc->cap_usr_time) {
do {
seq = pc->lock;
barrier();
running = pc->time_running;
count = readCNTVCT_EL0(); // Read ARM hardware clock.
time_offset = pc->time_offset;
time_mult = pc->time_mult;
time_shift = pc->time_shift;
barrier();
} while (pc->lock != seq);
quot = (count >> time_shift);
rem = count & (((u64)1 << time_shift) - 1);
delta = time_offset + quot * time_mult +
((rem * time_mult) >> time_shift);
running += delta;
// running now has the current nanosecond level thread time.
}
Summary of changes in the patch:
For aarch64 systems with a stable scheduler clock, make
arch_perf_update_userpage update the timing information stored in the
perf_event page. Requiring the following calculations:
- Calculate the appropriate time_mult, and time_shift factors to convert
ticks to nano seconds for the current clock frequency.
- Adjust the mult and shift factors to avoid shift factors of 32 bits.
(possibly unnecessary)
- The time_offset userspace should apply when doing calculations:
negative the current sched time (now), because time_running and
time_enabled fields of the perf_event page have just been updated.
- The time_zero field is zero since we have are using the stable
sched_clock, and it started at zero.
Toggle bits to appropriate values:
- Enable cap_user_time
- Enable cap_user_time_zero. Since the sched clock is stable, and
it started at zero this is able to be true
Signed-off-by: Michael O'Farrell <micpof@gmail.com>
---
arch/arm64/kernel/perf_event.c | 39 ++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 33147aacdafd..c63f31caf7ac 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -25,6 +25,7 @@
#include <asm/virt.h>
#include <linux/acpi.h>
+#include <linux/clocksource.h>
#include <linux/of.h>
#include <linux/perf/arm_pmu.h>
#include <linux/platform_device.h>
@@ -1127,3 +1128,41 @@ static int __init armv8_pmu_driver_init(void)
return arm_pmu_acpi_probe(armv8_pmuv3_init);
}
device_initcall(armv8_pmu_driver_init)
+
+#ifndef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
+void arch_perf_update_userpage(struct perf_event *event,
+ struct perf_event_mmap_page *userpg, u64 now)
+{
+ u32 freq;
+ u32 shift;
+
+ /*
+ * Internal timekeeping for enabled/running/stopped times
+ * is always computed with the sched_clock.
+ */
+ freq = arch_timer_get_rate();
+ userpg->cap_user_time = 1;
+
+ clocks_calc_mult_shift(&userpg->time_mult, &shift, freq,
+ NSEC_PER_SEC, 0);
+ /*
+ * time_shift is not expected to be greater than 31 due to
+ * the original published conversion algorithm shifting a
+ * 32-bit value (now specifies a 64-bit value) - refer
+ * perf_event_mmap_page documentation in perf_event.h.
+ */
+ if (shift == 32) {
+ shift = 31;
+ userpg->time_mult >>= 1;
+ }
+ userpg->time_shift = (u16)shift;
+ userpg->time_offset = -now;
+
+ /*
+ * We are always using the sched_clock as the base, so
+ * cap_user_time_zero makes sense.
+ */
+ userpg->cap_user_time_zero = 1;
+ userpg->time_zero = 0;
+}
+#endif /* !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
--
2.17.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH] Add cap_user_time aarch64
2018-07-13 19:24 [PATCH] Add cap_user_time aarch64 Michael O'Farrell
@ 2018-07-24 15:14 ` Will Deacon
2018-07-24 15:45 ` Peter Zijlstra
0 siblings, 1 reply; 4+ messages in thread
From: Will Deacon @ 2018-07-24 15:14 UTC (permalink / raw)
To: linux-arm-kernel
Hi Michael,
[adding Peter]
Thanks for the patch. I think the general idea is sound, but I have a few
questions about the implementation.
On Fri, Jul 13, 2018 at 12:24:51PM -0700, Michael O'Farrell wrote:
> It is useful to get the running time of a thread. Doing so in an
> efficient manner can be important for performance of user applications.
> Avoiding system calls in `clock_gettime` when handling
> CLOCK_THREAD_CPUTIME_ID is important. Other clocks are handled in the
> VDSO, but CLOCK_THREAD_CPUTIME_ID falls back on the system call.
>
> CLOCK_THREAD_CPUTIME_ID is not handled in the VDSO since it would have
> costs associated with maintaining updated user space accessible time
> offsets. These offsets have to be updated everytime the a thread is
> scheduled/descheduled. However, for programs regularly checking the
> running time of a thread, this is a performance improvement.
>
> This patch takes a middle ground, and adds support for cap_user_time an
> optional feature of the perf_event API. This way costs are only
> incurred when the perf_event api is enabled. This is done the same way
> as it is in x86. This patch also enables the cap_user_time facility of
> perf_event since it comes free with the stable clock.
>
> Ultimately this allows calculating the thread running time in userspace
> on aarch64 as follows (adapted from perf_event_open manpage):
>
> u32 seq, time_mult, time_shift;
> u64 running, count, time_offset, quot, rem, delta;
> struct perf_event_mmap_page *pc;
> pc = buf; // buf is the perf event mmaped page as documented in the API.
>
> if (pc->cap_usr_time) {
> do {
> seq = pc->lock;
> barrier();
> running = pc->time_running;
>
> count = readCNTVCT_EL0(); // Read ARM hardware clock.
> time_offset = pc->time_offset;
> time_mult = pc->time_mult;
> time_shift = pc->time_shift;
>
> barrier();
> } while (pc->lock != seq);
Using only compiler barriers here raises my eyebrows a little, but as long
as this is all in the context of the reader thread, then it should be ok.
I think it means that the comment in events/core.c is bogus:
/*
* Disable preemption so as to not let the corresponding user-space
* spin too long if we get preempted.
*/
preempt_disable();
++userpg->lock;
barrier();
because user-space won't get stuck spinning unless its running concurrently,
at which point it's hosed anyway. Hmmm.
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index 33147aacdafd..c63f31caf7ac 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -25,6 +25,7 @@
> #include <asm/virt.h>
>
> #include <linux/acpi.h>
> +#include <linux/clocksource.h>
> #include <linux/of.h>
> #include <linux/perf/arm_pmu.h>
> #include <linux/platform_device.h>
> @@ -1127,3 +1128,41 @@ static int __init armv8_pmu_driver_init(void)
> return arm_pmu_acpi_probe(armv8_pmuv3_init);
> }
> device_initcall(armv8_pmu_driver_init)
> +
> +#ifndef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
We never select CONFIG_HAVE_UNSTABLE_SCHED_CLOCK, so you can remove the
guards.
> +void arch_perf_update_userpage(struct perf_event *event,
> + struct perf_event_mmap_page *userpg, u64 now)
> +{
> + u32 freq;
> + u32 shift;
> +
> + /*
> + * Internal timekeeping for enabled/running/stopped times
> + * is always computed with the sched_clock.
> + */
> + freq = arch_timer_get_rate();
> + userpg->cap_user_time = 1;
> +
> + clocks_calc_mult_shift(&userpg->time_mult, &shift, freq,
> + NSEC_PER_SEC, 0);
> + /*
> + * time_shift is not expected to be greater than 31 due to
> + * the original published conversion algorithm shifting a
> + * 32-bit value (now specifies a 64-bit value) - refer
> + * perf_event_mmap_page documentation in perf_event.h.
> + */
> + if (shift == 32) {
> + shift = 31;
> + userpg->time_mult >>= 1;
> + }
Can you explain this a bit more, please? Given that we've never enabled this
on arm64, why do we have a problem with a legacy algorithm?
> + userpg->time_shift = (u16)shift;
> + userpg->time_offset = -now;
> +
> + /*
> + * We are always using the sched_clock as the base, so
> + * cap_user_time_zero makes sense.
> + */
How are you enforcing that we're always using sched_clock?
> + userpg->cap_user_time_zero = 1;
> + userpg->time_zero = 0;
I'm not sure we can rely on the architected timer being zero initialised --
our booting documentation just require it to be consistent between CPUs.
Will
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH] Add cap_user_time aarch64
2018-07-24 15:14 ` Will Deacon
@ 2018-07-24 15:45 ` Peter Zijlstra
2018-07-24 22:41 ` Michael O'Farrell
0 siblings, 1 reply; 4+ messages in thread
From: Peter Zijlstra @ 2018-07-24 15:45 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Jul 24, 2018 at 04:14:54PM +0100, Will Deacon wrote:
> Using only compiler barriers here raises my eyebrows a little, but as long
> as this is all in the context of the reader thread, then it should be ok.
> I think it means that the comment in events/core.c is bogus:
>
> /*
> * Disable preemption so as to not let the corresponding user-space
> * spin too long if we get preempted.
> */
> preempt_disable();
> ++userpg->lock;
> barrier();
>
> because user-space won't get stuck spinning unless its running concurrently,
> at which point it's hosed anyway. Hmmm.
Right, we should fix that. The only reason to keep that
preempt_disable() there is so that the various timestamps retain
temporal locality.
> > +void arch_perf_update_userpage(struct perf_event *event,
> > + struct perf_event_mmap_page *userpg, u64 now)
> > +{
> > + u32 freq;
> > + u32 shift;
> > +
> > + /*
> > + * Internal timekeeping for enabled/running/stopped times
> > + * is always computed with the sched_clock.
> > + */
> > + freq = arch_timer_get_rate();
> > + userpg->cap_user_time = 1;
> > +
> > + clocks_calc_mult_shift(&userpg->time_mult, &shift, freq,
> > + NSEC_PER_SEC, 0);
> > + /*
> > + * time_shift is not expected to be greater than 31 due to
> > + * the original published conversion algorithm shifting a
> > + * 32-bit value (now specifies a 64-bit value) - refer
> > + * perf_event_mmap_page documentation in perf_event.h.
> > + */
> > + if (shift == 32) {
> > + shift = 31;
> > + userpg->time_mult >>= 1;
> > + }
>
> Can you explain this a bit more, please? Given that we've never enabled this
> on arm64, why do we have a problem with a legacy algorithm?
The worry is, I suppose, that someone still uses the old algorithm
and then ports it to arm64 without checking.
> > + userpg->time_shift = (u16)shift;
> > + userpg->time_offset = -now;
> > +
> > + /*
> > + * We are always using the sched_clock as the base, so
> > + * cap_user_time_zero makes sense.
> > + */
>
> How are you enforcing that we're always using sched_clock?
>
> > + userpg->cap_user_time_zero = 1;
> > + userpg->time_zero = 0;
>
> I'm not sure we can rely on the architected timer being zero initialised --
> our booting documentation just require it to be consistent between CPUs.
On x86 the above is:
/*
* cap_user_time_zero doesn't make sense when we're using a different
* time base for the records.
*/
if (!event->attr.use_clockid) {
userpg->cap_user_time_zero = 1;
userpg->time_zero = offset;
}
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH] Add cap_user_time aarch64
2018-07-24 15:45 ` Peter Zijlstra
@ 2018-07-24 22:41 ` Michael O'Farrell
0 siblings, 0 replies; 4+ messages in thread
From: Michael O'Farrell @ 2018-07-24 22:41 UTC (permalink / raw)
To: linux-arm-kernel
Thanks Will and Peter.
So:
- Remove the ifndef for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
- Update the comment in events/core.c to read "Disable preemption to
guarantee consistent time stamps are stored to the user page."
On Tue, Jul 24, 2018 at 8:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 24, 2018 at 04:14:54PM +0100, Will Deacon wrote:
> > Using only compiler barriers here raises my eyebrows a little, but as long
> > as this is all in the context of the reader thread, then it should be ok.
> > I think it means that the comment in events/core.c is bogus:
> >
> > /*
> > * Disable preemption so as to not let the corresponding user-space
> > * spin too long if we get preempted.
> > */
> > preempt_disable();
> > ++userpg->lock;
> > barrier();
> >
> > because user-space won't get stuck spinning unless its running concurrently,
> > at which point it's hosed anyway. Hmmm.
>
> Right, we should fix that. The only reason to keep that
> preempt_disable() there is so that the various timestamps retain
> temporal locality.
>
> > > +void arch_perf_update_userpage(struct perf_event *event,
> > > + struct perf_event_mmap_page *userpg, u64 now)
> > > +{
> > > + u32 freq;
> > > + u32 shift;
> > > +
> > > + /*
> > > + * Internal timekeeping for enabled/running/stopped times
> > > + * is always computed with the sched_clock.
> > > + */
> > > + freq = arch_timer_get_rate();
> > > + userpg->cap_user_time = 1;
> > > +
> > > + clocks_calc_mult_shift(&userpg->time_mult, &shift, freq,
> > > + NSEC_PER_SEC, 0);
> > > + /*
> > > + * time_shift is not expected to be greater than 31 due to
> > > + * the original published conversion algorithm shifting a
> > > + * 32-bit value (now specifies a 64-bit value) - refer
> > > + * perf_event_mmap_page documentation in perf_event.h.
> > > + */
> > > + if (shift == 32) {
> > > + shift = 31;
> > > + userpg->time_mult >>= 1;
> > > + }
> >
> > Can you explain this a bit more, please? Given that we've never enabled this
> > on arm64, why do we have a problem with a legacy algorithm?
>
> The worry is, I suppose, that someone still uses the old algorithm
> and then ports it to arm64 without checking.
Thats exactly what I was thinking. If we flip this bit, existing user
code that is
cross platform may change behavior and break if it relies on the old API.
>
>
> > > + userpg->time_shift = (u16)shift;
> > > + userpg->time_offset = -now;
> > > +
> > > + /*
> > > + * We are always using the sched_clock as the base, so
> > > + * cap_user_time_zero makes sense.
> > > + */
> >
> > How are you enforcing that we're always using sched_clock?
> >
> > > + userpg->cap_user_time_zero = 1;
> > > + userpg->time_zero = 0;
> >
> > I'm not sure we can rely on the architected timer being zero initialised --
> > our booting documentation just require it to be consistent between CPUs.
>
> On x86 the above is:
>
> /*
> * cap_user_time_zero doesn't make sense when we're using a different
> * time base for the records.
> */
> if (!event->attr.use_clockid) {
> userpg->cap_user_time_zero = 1;
> userpg->time_zero = offset;
> }
I was under the impression that the updating of the user page was always done
using perf_clock() which uses the local_clock which is the sched_clock.
I saw that the use_clockid allows setting the clock to use in other parts of the
perf event system, but I didn't think it affected the user page update. This
comment seems to support my claim:
/*
* Internal timekeeping for enabled/running/stopped times
* is always in the local_clock domain.
*/
Why does having a different time base for the records make cap_user_time_zero
not make sense?
I could rework the time_zero offset to be computed using ktime_get_boot_ns, or
remove cap_user_time_zero from this patch since it is a separate concept from
cap_user_time.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-07-24 22:41 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-13 19:24 [PATCH] Add cap_user_time aarch64 Michael O'Farrell
2018-07-24 15:14 ` Will Deacon
2018-07-24 15:45 ` Peter Zijlstra
2018-07-24 22:41 ` Michael O'Farrell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).