* [patch V2 03/11] x86/vdso: Enforce 64bit clocksource
From: Thomas Gleixner @ 2018-09-17 12:45 UTC (permalink / raw)
To: LKML
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180917124533.329334911@linutronix.de>
[-- Attachment #1: x86-vdso--Enforce-64bit-clocksource.patch --]
[-- Type: text/plain, Size: 1074 bytes --]
All VDSO clock sources are TSC based and use CLOCKSOURCE_MASK(64). There is
no point in masking with all FF. Get rid of it and enforce the mask in the
sanity checker.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/entry/vdso/vclock_gettime.c | 2 +-
arch/x86/kernel/time.c | 6 ++++++
2 files changed, 7 insertions(+), 1 deletion(-)
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -199,7 +199,7 @@ notrace static inline u64 vgetsns(int *m
#endif
else
return 0;
- v = (cycles - gtod->cycle_last) & gtod->mask;
+ v = cycles - gtod->cycle_last;
return v * gtod->mult;
}
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -120,4 +120,10 @@ void clocksource_arch_init(struct clocks
cs->name, cs->archdata.vclock_mode);
cs->archdata.vclock_mode = VCLOCK_NONE;
}
+
+ if (cs->mask != CLOCKSOURCE_MASK(64)) {
+ pr_warn("clocksource %s registered with invalid mask %016llx. Disabling vclock.\n",
+ cs->name, cs->mask);
+ cs->archdata.vclock_mode = VCLOCK_NONE;
+ }
}
^ permalink raw reply
* [patch V2 04/11] x86/vdso: Use unsigned int consistently for vsyscall_gtod_data::seq
From: Thomas Gleixner @ 2018-09-17 12:45 UTC (permalink / raw)
To: LKML
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180917124533.329334911@linutronix.de>
[-- Attachment #1: x86-vdso--Mintor-cleanups.patch --]
[-- Type: text/plain, Size: 2383 bytes --]
The sequence count in vgtod_data is unsigned int, but the call sites use
unsigned long, which is a pointless exercise. Fix the call sites and
replace 'unsigned' with unsinged 'int' while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/entry/vdso/vclock_gettime.c | 8 ++++----
arch/x86/include/asm/vgtod.h | 10 +++++-----
2 files changed, 9 insertions(+), 9 deletions(-)
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,7 +206,7 @@ notrace static inline u64 vgetsns(int *m
/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
notrace static int __always_inline do_realtime(struct timespec *ts)
{
- unsigned long seq;
+ unsigned int seq;
u64 ns;
int mode;
@@ -227,7 +227,7 @@ notrace static int __always_inline do_re
notrace static int __always_inline do_monotonic(struct timespec *ts)
{
- unsigned long seq;
+ unsigned int seq;
u64 ns;
int mode;
@@ -248,7 +248,7 @@ notrace static int __always_inline do_mo
notrace static void do_realtime_coarse(struct timespec *ts)
{
- unsigned long seq;
+ unsigned int seq;
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = gtod->wall_time_coarse_sec;
@@ -258,7 +258,7 @@ notrace static void do_realtime_coarse(s
notrace static void do_monotonic_coarse(struct timespec *ts)
{
- unsigned long seq;
+ unsigned int seq;
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = gtod->monotonic_time_coarse_sec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -15,9 +15,9 @@ typedef unsigned long gtod_long_t;
* so be carefull by modifying this structure.
*/
struct vsyscall_gtod_data {
- unsigned seq;
+ unsigned int seq;
- int vclock_mode;
+ int vclock_mode;
u64 cycle_last;
u64 mask;
u32 mult;
@@ -44,9 +44,9 @@ static inline bool vclock_was_used(int v
return READ_ONCE(vclocks_used) & (1 << vclock);
}
-static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s)
+static inline unsigned int gtod_read_begin(const struct vsyscall_gtod_data *s)
{
- unsigned ret;
+ unsigned int ret;
repeat:
ret = READ_ONCE(s->seq);
@@ -59,7 +59,7 @@ static inline unsigned gtod_read_begin(c
}
static inline int gtod_read_retry(const struct vsyscall_gtod_data *s,
- unsigned start)
+ unsigned int start)
{
smp_rmb();
return unlikely(s->seq != start);
^ permalink raw reply
* [patch V2 05/11] x86/vdso: Introduce and use vgtod_ts
From: Thomas Gleixner @ 2018-09-17 12:45 UTC (permalink / raw)
To: LKML
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180917124533.329334911@linutronix.de>
[-- Attachment #1: x86-vdso--Introduce-and-use-vgtod_ts.patch --]
[-- Type: text/plain, Size: 7483 bytes --]
It's desired to support more clocks in the VDSO, e.g. CLOCK_TAI. This
results either in indirect calls due to the larger switch case, which then
requires retpolines or when the compiler is forced to avoid jump tables it
results in even more conditionals.
To avoid both variants which are bad for performance the high resolution
functions and the coarse grained functions will be collapsed into one for
each. That requires to store the clock specific base time in an array.
Introcude struct vgtod_ts for storage and convert the data store, the
update function and the individual clock functions over to use it.
The new storage does not longer use gtod_long_t for seconds depending on 32
or 64 bit compile because this needs to be the full 64bit value even for
32bit when a Y2038 function is added. No point in keeping the distinction
alive in the internal representation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/entry/vdso/vclock_gettime.c | 24 +++++++++------
arch/x86/entry/vsyscall/vsyscall_gtod.c | 51 ++++++++++++++++----------------
arch/x86/include/asm/vgtod.h | 36 ++++++++++++----------
3 files changed, 61 insertions(+), 50 deletions(-)
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,6 +206,7 @@ notrace static inline u64 vgetsns(int *m
/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
notrace static int __always_inline do_realtime(struct timespec *ts)
{
+ struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME];
unsigned int seq;
u64 ns;
int mode;
@@ -213,8 +214,8 @@ notrace static int __always_inline do_re
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
- ts->tv_sec = gtod->wall_time_sec;
- ns = gtod->wall_time_snsec;
+ ts->tv_sec = base->sec;
+ ns = base->nsec;
ns += vgetsns(&mode);
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -227,6 +228,7 @@ notrace static int __always_inline do_re
notrace static int __always_inline do_monotonic(struct timespec *ts)
{
+ struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC];
unsigned int seq;
u64 ns;
int mode;
@@ -234,8 +236,8 @@ notrace static int __always_inline do_mo
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
- ts->tv_sec = gtod->monotonic_time_sec;
- ns = gtod->monotonic_time_snsec;
+ ts->tv_sec = base->sec;
+ ns = base->nsec;
ns += vgetsns(&mode);
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -248,21 +250,25 @@ notrace static int __always_inline do_mo
notrace static void do_realtime_coarse(struct timespec *ts)
{
+ struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME_COARSE];
unsigned int seq;
+
do {
seq = gtod_read_begin(gtod);
- ts->tv_sec = gtod->wall_time_coarse_sec;
- ts->tv_nsec = gtod->wall_time_coarse_nsec;
+ ts->tv_sec = base->sec;
+ ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
}
notrace static void do_monotonic_coarse(struct timespec *ts)
{
+ struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC_COARSE];
unsigned int seq;
+
do {
seq = gtod_read_begin(gtod);
- ts->tv_sec = gtod->monotonic_time_coarse_sec;
- ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
+ ts->tv_sec = base->sec;
+ ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
}
@@ -318,7 +324,7 @@ int gettimeofday(struct timeval *, struc
notrace time_t __vdso_time(time_t *t)
{
/* This is atomic on x86 so we don't need any locks. */
- time_t result = READ_ONCE(gtod->wall_time_sec);
+ time_t result = READ_ONCE(gtod->basetime[CLOCK_REALTIME].sec);
if (t)
*t = result;
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -31,6 +31,8 @@ void update_vsyscall(struct timekeeper *
{
int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
+ struct vgtod_ts *base;
+ u64 nsec;
/* Mark the new vclock used. */
BUILD_BUG_ON(VCLOCK_MAX >= 32);
@@ -45,34 +47,33 @@ void update_vsyscall(struct timekeeper *
vdata->mult = tk->tkr_mono.mult;
vdata->shift = tk->tkr_mono.shift;
- vdata->wall_time_sec = tk->xtime_sec;
- vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec;
-
- vdata->monotonic_time_sec = tk->xtime_sec
- + tk->wall_to_monotonic.tv_sec;
- vdata->monotonic_time_snsec = tk->tkr_mono.xtime_nsec
- + ((u64)tk->wall_to_monotonic.tv_nsec
- << tk->tkr_mono.shift);
- while (vdata->monotonic_time_snsec >=
- (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
- vdata->monotonic_time_snsec -=
- ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift;
- vdata->monotonic_time_sec++;
+ base = &vdata->basetime[CLOCK_REALTIME];
+ base->sec = tk->xtime_sec;
+ base->nsec = tk->tkr_mono.xtime_nsec;
+
+ base = &vdata->basetime[CLOCK_MONOTONIC];
+ base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
+ nsec = tk->tkr_mono.xtime_nsec;
+ nsec += ((u64)tk->wall_to_monotonic.tv_nsec << tk->tkr_mono.shift);
+ while (nsec >= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
+ nsec -= ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift;
+ base->sec++;
}
+ base->nsec = nsec;
- vdata->wall_time_coarse_sec = tk->xtime_sec;
- vdata->wall_time_coarse_nsec = (long)(tk->tkr_mono.xtime_nsec >>
- tk->tkr_mono.shift);
-
- vdata->monotonic_time_coarse_sec =
- vdata->wall_time_coarse_sec + tk->wall_to_monotonic.tv_sec;
- vdata->monotonic_time_coarse_nsec =
- vdata->wall_time_coarse_nsec + tk->wall_to_monotonic.tv_nsec;
-
- while (vdata->monotonic_time_coarse_nsec >= NSEC_PER_SEC) {
- vdata->monotonic_time_coarse_nsec -= NSEC_PER_SEC;
- vdata->monotonic_time_coarse_sec++;
+ base = &vdata->basetime[CLOCK_REALTIME_COARSE];
+ base->sec = tk->xtime_sec;
+ base->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+
+ base = &vdata->basetime[CLOCK_MONOTONIC_COARSE];
+ base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
+ nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+ nsec += tk->wall_to_monotonic.tv_nsec;
+ while (nsec >= NSEC_PER_SEC) {
+ nsec -= NSEC_PER_SEC;
+ base->sec++;
}
+ base->nsec = nsec;
gtod_write_end(vdata);
}
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -5,33 +5,37 @@
#include <linux/compiler.h>
#include <linux/clocksource.h>
+#include <uapi/linux/time.h>
+
#ifdef BUILD_VDSO32_64
typedef u64 gtod_long_t;
#else
typedef unsigned long gtod_long_t;
#endif
+
+struct vgtod_ts {
+ u64 sec;
+ u64 nsec;
+};
+
+#define VGTOD_BASES (CLOCK_MONOTONIC_COARSE + 1)
+#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
+#define VGTOD_COARSE (BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE))
+
/*
* vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
* so be carefull by modifying this structure.
*/
struct vsyscall_gtod_data {
- unsigned int seq;
+ unsigned int seq;
+
+ int vclock_mode;
+ u64 cycle_last;
+ u64 mask;
+ u32 mult;
+ u32 shift;
- int vclock_mode;
- u64 cycle_last;
- u64 mask;
- u32 mult;
- u32 shift;
-
- /* open coded 'struct timespec' */
- u64 wall_time_snsec;
- gtod_long_t wall_time_sec;
- gtod_long_t monotonic_time_sec;
- u64 monotonic_time_snsec;
- gtod_long_t wall_time_coarse_sec;
- gtod_long_t wall_time_coarse_nsec;
- gtod_long_t monotonic_time_coarse_sec;
- gtod_long_t monotonic_time_coarse_nsec;
+ struct vgtod_ts basetime[VGTOD_BASES];
int tz_minuteswest;
int tz_dsttime;
^ permalink raw reply
* [patch V2 06/11] x86/vdso: Collapse high resolution functions
From: Thomas Gleixner @ 2018-09-17 12:45 UTC (permalink / raw)
To: LKML
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180917124533.329334911@linutronix.de>
[-- Attachment #1: x86-vdso--Collapse-high-res-functions.patch --]
[-- Type: text/plain, Size: 2221 bytes --]
do_realtime() and do_monotonic() are now the same except for the storage
array index. Hand the index in as an argument and collapse the functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/entry/vdso/vclock_gettime.c | 35 +++++++----------------------------
1 file changed, 7 insertions(+), 28 deletions(-)
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -203,35 +203,12 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
}
-/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static int do_hres(clockid_t clk, struct timespec *ts)
{
- struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME];
+ struct vgtod_ts *base = >od->basetime[clk];
unsigned int seq;
- u64 ns;
int mode;
-
- do {
- seq = gtod_read_begin(gtod);
- mode = gtod->vclock_mode;
- ts->tv_sec = base->sec;
- ns = base->nsec;
- ns += vgetsns(&mode);
- ns >>= gtod->shift;
- } while (unlikely(gtod_read_retry(gtod, seq)));
-
- ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
- ts->tv_nsec = ns;
-
- return mode;
-}
-
-notrace static int __always_inline do_monotonic(struct timespec *ts)
-{
- struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC];
- unsigned int seq;
u64 ns;
- int mode;
do {
seq = gtod_read_begin(gtod);
@@ -276,11 +253,11 @@ notrace int __vdso_clock_gettime(clockid
{
switch (clock) {
case CLOCK_REALTIME:
- if (do_realtime(ts) == VCLOCK_NONE)
+ if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_MONOTONIC:
- if (do_monotonic(ts) == VCLOCK_NONE)
+ if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_REALTIME_COARSE:
@@ -303,7 +280,9 @@ int clock_gettime(clockid_t, struct time
notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
{
if (likely(tv != NULL)) {
- if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
+ struct timespec *ts = (struct timespec *) tv;
+
+ if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE))
return vdso_fallback_gtod(tv, tz);
tv->tv_usec /= 1000;
}
^ permalink raw reply
* [patch V2 07/11] x86/vdso: Collapse coarse functions
From: Thomas Gleixner @ 2018-09-17 12:45 UTC (permalink / raw)
To: LKML
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180917124533.329334911@linutronix.de>
[-- Attachment #1: x86-vdso--Collapse-coarse-functions.patch --]
[-- Type: text/plain, Size: 1421 bytes --]
do_realtime_coarse() and do_monotonic_coarse() are now the same except for
the storage array index. Hand the index in as an argument and collapse the
functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/entry/vdso/vclock_gettime.c | 20 ++++----------------
1 file changed, 4 insertions(+), 16 deletions(-)
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -225,21 +225,9 @@ notrace static int do_hres(clockid_t clk
return mode;
}
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static void do_coarse(clockid_t clk, struct timespec *ts)
{
- struct vgtod_ts *base = >od->basetime[CLOCK_REALTIME_COARSE];
- unsigned int seq;
-
- do {
- seq = gtod_read_begin(gtod);
- ts->tv_sec = base->sec;
- ts->tv_nsec = base->nsec;
- } while (unlikely(gtod_read_retry(gtod, seq)));
-}
-
-notrace static void do_monotonic_coarse(struct timespec *ts)
-{
- struct vgtod_ts *base = >od->basetime[CLOCK_MONOTONIC_COARSE];
+ struct vgtod_ts *base = >od->basetime[clk];
unsigned int seq;
do {
@@ -261,10 +249,10 @@ notrace int __vdso_clock_gettime(clockid
goto fallback;
break;
case CLOCK_REALTIME_COARSE:
- do_realtime_coarse(ts);
+ do_coarse(CLOCK_REALTIME_COARSE, ts);
break;
case CLOCK_MONOTONIC_COARSE:
- do_monotonic_coarse(ts);
+ do_coarse(CLOCK_MONOTONIC_COARSE, ts);
break;
default:
goto fallback;
^ permalink raw reply
* [patch V2 08/11] x86/vdso: Replace the clockid switch case
From: Thomas Gleixner @ 2018-09-17 12:45 UTC (permalink / raw)
To: LKML
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180917124533.329334911@linutronix.de>
[-- Attachment #1: x86-vdso--Replace-the-clockid-switch-case.patch --]
[-- Type: text/plain, Size: 2110 bytes --]
Now that the time getter functions use the clockid as index into the
storage array for the base time access, the switch case can be replaced.
- Check for clockid >= MAX_CLOCKS and for negative clockid (CPU/FD) first
and call the fallback function right away.
- After establishing that clockid is < MAX_CLOCKS, convert the clockid to a
bitmask
- Check for the supported high resolution and coarse functions by anding
the bitmask of supported clocks and check whether a bit is set.
This completely avoids jump tables, reduces the number of conditionals and
makes the VDSO extensible for other clock ids.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/entry/vdso/vclock_gettime.c | 38 ++++++++++++++++-------------------
1 file changed, 18 insertions(+), 20 deletions(-)
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -239,29 +239,27 @@ notrace static void do_coarse(clockid_t
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
- switch (clock) {
- case CLOCK_REALTIME:
- if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
- goto fallback;
- break;
- case CLOCK_MONOTONIC:
- if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
- goto fallback;
- break;
- case CLOCK_REALTIME_COARSE:
- do_coarse(CLOCK_REALTIME_COARSE, ts);
- break;
- case CLOCK_MONOTONIC_COARSE:
- do_coarse(CLOCK_MONOTONIC_COARSE, ts);
- break;
- default:
- goto fallback;
- }
+ unsigned int msk;
+
+ /* Sort out negative (CPU/FD) and invalid clocks */
+ if (unlikely((unsigned int) clock >= MAX_CLOCKS))
+ return vdso_fallback_gettime(clock, ts);
- return 0;
-fallback:
+ /*
+ * Convert the clockid to a bitmask and use it to check which
+ * clocks are handled in the VDSO directly.
+ */
+ msk = 1U << clock;
+ if (likely(msk & VGTOD_HRES)) {
+ if (do_hres(clock, ts) != VCLOCK_NONE)
+ return 0;
+ } else if (msk & VGTOD_COARSE) {
+ do_coarse(clock, ts);
+ return 0;
+ }
return vdso_fallback_gettime(clock, ts);
}
+
int clock_gettime(clockid_t, struct timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
^ permalink raw reply
* [patch V2 09/11] x86/vdso: Simplify the invalid vclock case
From: Thomas Gleixner @ 2018-09-17 12:45 UTC (permalink / raw)
To: LKML
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180917124533.329334911@linutronix.de>
[-- Attachment #1: x86-vdso--Simplify-the-invalid-vclock-case.patch --]
[-- Type: text/plain, Size: 5307 bytes --]
The code flow for the vclocks is convoluted as it requires the vclocks
which can be invalidated separately from the vsyscall_gtod_data sequence to
store the fact in a separate variable. That's inefficient.
Restructure the code so the vclock readout returns cycles and the
conversion to nanoseconds is handled at the call site.
If the clock gets invalidated or vclock is already VCLOCK_NONE, return
U64_MAX as the cycle value, which is invalid for all clocks and leave the
sequence loop immediately in that case by calling the fallback function
directly.
This allows to remove the gettimeofday fallback as it now uses the
clock_gettime() fallback and does the nanoseconds to microseconds
conversion in the same way as it does when the vclock is functional. It
does not make a difference whether the division by 1000 happens in the
kernel fallback or in userspace.
Generates way better code and gains a few cycles back.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/entry/vdso/vclock_gettime.c | 81 +++++++++--------------------------
1 file changed, 21 insertions(+), 60 deletions(-)
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -48,16 +48,6 @@ notrace static long vdso_fallback_gettim
return ret;
}
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
- long ret;
-
- asm("syscall" : "=a" (ret) :
- "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
- return ret;
-}
-
-
#else
notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -75,21 +65,6 @@ notrace static long vdso_fallback_gettim
return ret;
}
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
- long ret;
-
- asm(
- "mov %%ebx, %%edx \n"
- "mov %2, %%ebx \n"
- "call __kernel_vsyscall \n"
- "mov %%edx, %%ebx \n"
- : "=a" (ret)
- : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
- : "memory", "edx");
- return ret;
-}
-
#endif
#ifdef CONFIG_PARAVIRT_CLOCK
@@ -98,7 +73,7 @@ static notrace const struct pvclock_vsys
return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
}
-static notrace u64 vread_pvclock(int *mode)
+static notrace u64 vread_pvclock(void)
{
const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
u64 ret;
@@ -130,10 +105,8 @@ static notrace u64 vread_pvclock(int *mo
do {
version = pvclock_read_begin(pvti);
- if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
- *mode = VCLOCK_NONE;
- return 0;
- }
+ if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)))
+ return U64_MAX;
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
@@ -148,17 +121,12 @@ static notrace u64 vread_pvclock(int *mo
}
#endif
#ifdef CONFIG_HYPERV_TSCPAGE
-static notrace u64 vread_hvclock(int *mode)
+static notrace u64 vread_hvclock(void)
{
const struct ms_hyperv_tsc_page *tsc_pg =
(const struct ms_hyperv_tsc_page *)&hvclock_page;
- u64 current_tick = hv_read_tsc_page(tsc_pg);
-
- if (current_tick != U64_MAX)
- return current_tick;
- *mode = VCLOCK_NONE;
- return 0;
+ return hv_read_tsc_page(tsc_pg);
}
#endif
@@ -182,47 +150,42 @@ notrace static u64 vread_tsc(void)
return last;
}
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcyc(int mode)
{
- u64 v;
- cycles_t cycles;
-
- if (gtod->vclock_mode == VCLOCK_TSC)
- cycles = vread_tsc();
+ if (mode == VCLOCK_TSC)
+ return vread_tsc();
#ifdef CONFIG_PARAVIRT_CLOCK
- else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
- cycles = vread_pvclock(mode);
+ else if (mode == VCLOCK_PVCLOCK)
+ return vread_pvclock();
#endif
#ifdef CONFIG_HYPERV_TSCPAGE
- else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
- cycles = vread_hvclock(mode);
+ else if (mode == VCLOCK_HVCLOCK)
+ return vread_hvclock();
#endif
- else
- return 0;
- v = cycles - gtod->cycle_last;
- return v * gtod->mult;
+ return U64_MAX;
}
notrace static int do_hres(clockid_t clk, struct timespec *ts)
{
struct vgtod_ts *base = >od->basetime[clk];
unsigned int seq;
- int mode;
- u64 ns;
+ u64 cycles, ns;
do {
seq = gtod_read_begin(gtod);
- mode = gtod->vclock_mode;
ts->tv_sec = base->sec;
ns = base->nsec;
- ns += vgetsns(&mode);
+ cycles = vgetcyc(gtod->vclock_mode);
+ if (unlikely((s64)cycles < 0))
+ return vdso_fallback_gettime(clk, ts);
+ ns += (cycles - gtod->cycle_last) * gtod->mult;
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;
- return mode;
+ return 0;
}
notrace static void do_coarse(clockid_t clk, struct timespec *ts)
@@ -251,8 +214,7 @@ notrace int __vdso_clock_gettime(clockid
*/
msk = 1U << clock;
if (likely(msk & VGTOD_HRES)) {
- if (do_hres(clock, ts) != VCLOCK_NONE)
- return 0;
+ return do_hres(clock, ts);
} else if (msk & VGTOD_COARSE) {
do_coarse(clock, ts);
return 0;
@@ -268,8 +230,7 @@ notrace int __vdso_gettimeofday(struct t
if (likely(tv != NULL)) {
struct timespec *ts = (struct timespec *) tv;
- if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE))
- return vdso_fallback_gtod(tv, tz);
+ do_hres(CLOCK_REALTIME, ts);
tv->tv_usec /= 1000;
}
if (unlikely(tz != NULL)) {
^ permalink raw reply
* [patch V2 10/11] x86/vdso: Move cycle_last handling into the caller
From: Thomas Gleixner @ 2018-09-17 12:45 UTC (permalink / raw)
To: LKML
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180917124533.329334911@linutronix.de>
[-- Attachment #1: x86-vdso--Move-cycle_last-handling-into-the-caller.patch --]
[-- Type: text/plain, Size: 2774 bytes --]
Dereferencing gtod->cycle_last all over the place and foing the cycles <
last comparison in the vclock read functions generates horrible code. Doing
it at the call site is much better and gains a few cycles both for TSC and
pvclock.
Caveat: This adds the comparison to the hyperv vclock as well, but I have
no way to test that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/entry/vdso/vclock_gettime.c | 39 ++++++-----------------------------
1 file changed, 7 insertions(+), 32 deletions(-)
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -76,9 +76,8 @@ static notrace const struct pvclock_vsys
static notrace u64 vread_pvclock(void)
{
const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
- u64 ret;
- u64 last;
u32 version;
+ u64 ret;
/*
* Note: The kernel and hypervisor must guarantee that cpu ID
@@ -111,13 +110,7 @@ static notrace u64 vread_pvclock(void)
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
- /* refer to vread_tsc() comment for rationale */
- last = gtod->cycle_last;
-
- if (likely(ret >= last))
- return ret;
-
- return last;
+ return ret;
}
#endif
#ifdef CONFIG_HYPERV_TSCPAGE
@@ -130,30 +123,10 @@ static notrace u64 vread_hvclock(void)
}
#endif
-notrace static u64 vread_tsc(void)
-{
- u64 ret = (u64)rdtsc_ordered();
- u64 last = gtod->cycle_last;
-
- if (likely(ret >= last))
- return ret;
-
- /*
- * GCC likes to generate cmov here, but this branch is extremely
- * predictable (it's just a function of time and the likely is
- * very likely) and there's a data dependence, so force GCC
- * to generate a branch instead. I don't barrier() because
- * we don't actually need a barrier, and if this function
- * ever gets inlined it will generate worse code.
- */
- asm volatile ("");
- return last;
-}
-
notrace static inline u64 vgetcyc(int mode)
{
if (mode == VCLOCK_TSC)
- return vread_tsc();
+ return (u64)rdtsc_ordered();
#ifdef CONFIG_PARAVIRT_CLOCK
else if (mode == VCLOCK_PVCLOCK)
return vread_pvclock();
@@ -168,17 +141,19 @@ notrace static inline u64 vgetcyc(int mo
notrace static int do_hres(clockid_t clk, struct timespec *ts)
{
struct vgtod_ts *base = >od->basetime[clk];
+ u64 cycles, last, ns;
unsigned int seq;
- u64 cycles, ns;
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = base->sec;
ns = base->nsec;
+ last = gtod->cycle_last;
cycles = vgetcyc(gtod->vclock_mode);
if (unlikely((s64)cycles < 0))
return vdso_fallback_gettime(clk, ts);
- ns += (cycles - gtod->cycle_last) * gtod->mult;
+ if (cycles > last)
+ ns += (cycles - last) * gtod->mult;
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
^ permalink raw reply
* [patch V2 11/11] x66/vdso: Add CLOCK_TAI support
From: Thomas Gleixner @ 2018-09-17 12:45 UTC (permalink / raw)
To: LKML
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra, x86,
virtualization, Stephen Boyd, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180917124533.329334911@linutronix.de>
[-- Attachment #1: x66-vdso--Add-CLOCK_TAI-support.patch --]
[-- Type: text/plain, Size: 1326 bytes --]
With the storage array in place it's now trivial to support CLOCK_TAI in
the vdso. Extend the base time storage array and add the update code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Remove the masking trick
arch/x86/entry/vsyscall/vsyscall_gtod.c | 4 ++++
arch/x86/include/asm/vgtod.h | 4 ++--
2 files changed, 6 insertions(+), 2 deletions(-)
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper *
base->sec = tk->xtime_sec;
base->nsec = tk->tkr_mono.xtime_nsec;
+ base = &vdata->basetime[CLOCK_TAI];
+ base->sec = tk->xtime_sec + (s64)tk->tai_offset;
+ base->nsec = tk->tkr_mono.xtime_nsec;
+
base = &vdata->basetime[CLOCK_MONOTONIC];
base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
nsec = tk->tkr_mono.xtime_nsec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -18,8 +18,8 @@ struct vgtod_ts {
u64 nsec;
};
-#define VGTOD_BASES (CLOCK_MONOTONIC_COARSE + 1)
-#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
+#define VGTOD_BASES (CLOCK_TAI + 1)
+#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | BIT(CLOCK_TAI))
#define VGTOD_COARSE (BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE))
/*
^ permalink raw reply
* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
From: Thomas Gleixner @ 2018-09-17 13:00 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Florian Weimer, Juergen Gross, Peter Zijlstra,
the arch/x86 maintainers, Linux Kernel Mailing List,
virtualization, Stephen Boyd, John Stultz, Deepa Dinamani,
Andy Lutomirski, Paolo Bonzini, devel, matt
In-Reply-To: <CAK8P3a1QfynDxkKBLXVa5Qj_z+7ct3QB5sDYuY5QV3ht-a0cTg@mail.gmail.com>
On Fri, 14 Sep 2018, Arnd Bergmann wrote:
> On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> A couple of architectures (s390, ia64, riscv, powerpc, arm64)
> implement the vdso as assembler code at the moment, so they
> won't be as easy to consolidate (other than outright replacing all
> the code).
>
> The other five:
> arch/x86/entry/vdso/vclock_gettime.c
> arch/sparc/vdso/vclock_gettime.c
> arch/nds32/kernel/vdso/gettimeofday.c
> arch/mips/vdso/gettimeofday.c
> arch/arm/vdso/vgettimeofday.c
>
> are basically all minor variations of the same code base and could be
> consolidated to some degree.
> Any suggestions here? Should we plan to do that consolitdation based on
> your new version, or just add clock_gettime64 in arm32 and x86-32, and then
> be done with it? The other ones will obviously still be fast for 32-bit time_t
> and will have a working non-vdso sys_clock_getttime64().
In principle consolidating all those implementations should be possible to
some extent and probably worthwhile. What's arch specific are the actual
accessors to the hardware clocks.
> I also wonder about clock_getres(): half the architectures seem to implement
> it in vdso, but notably arm32 and x86 don't, and I had not expected it to be
> performance critical given that the result is easily cached in user space.
getres() is not really performance critical, but adding it does not create
a huge problem either.
Thanks,
tglx
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Andy Lutomirski @ 2018-09-17 19:25 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra,
X86 ML, LKML, Linux Virtualization, Stephen Boyd, John Stultz,
Andy Lutomirski, Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180914125118.909646643@linutronix.de>
On Fri, Sep 14, 2018 at 5:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> The code flow for the vclocks is convoluted as it requires the vclocks
> which can be invalidated separately from the vsyscall_gtod_data sequence to
> store the fact in a separate variable. That's inefficient.
>
> notrace static int do_hres(clockid_t clk, struct timespec *ts)
> {
> struct vgtod_ts *base = >od->basetime[clk];
> unsigned int seq;
> - int mode;
> - u64 ns;
> + u64 cycles, ns;
>
> do {
> seq = gtod_read_begin(gtod);
> - mode = gtod->vclock_mode;
> ts->tv_sec = base->sec;
> ns = base->nsec;
> - ns += vgetsns(&mode);
> + cycles = vgetcyc(gtod->vclock_mode);
> + if (unlikely((s64)cycles < 0))
> + return vdso_fallback_gettime(clk, ts);
i was contemplating this, and I would suggest one of two optimizations:
1. have all the helpers return a struct containing a u64 and a bool,
and use that bool. The compiler should do okay.
2. Be sneaky. Later in the series, you do:
if (unlikely((s64)cycles < 0))
return vdso_fallback_gettime(clk, ts);
- ns += (cycles - gtod->cycle_last) * gtod->mult;
+ if (cycles > last)
+ ns += (cycles - last) * gtod->mult;
How about:
if (unlikely((s64)cycles <= last)) {
if (cycles < 0) [or cycles == -1 or whatever]
return vdso_fallback_gettime;
} else {
ns += (cycles - last) * gtod->mult;
}
which should actually make this whole mess be essentially free.
Also, I'm not entirely convinced that this "last" thing is needed at
all. John, what's the scenario under which we need it?
--Andy
--Andy
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Thomas Gleixner @ 2018-09-18 7:52 UTC (permalink / raw)
To: John Stultz
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra,
X86 ML, LKML, Linux Virtualization, Stephen Boyd, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <CALAqxLUv9Rd_eA7u9gZNFvTNrT1Z67RpHLozdeNC0W2yZm=Heg@mail.gmail.com>
On Mon, 17 Sep 2018, John Stultz wrote:
> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > Also, I'm not entirely convinced that this "last" thing is needed at
> > all. John, what's the scenario under which we need it?
>
> So my memory is probably a bit foggy, but I recall that as we
> accelerated gettimeofday, we found that even on systems that claimed
> to have synced TSCs, they were actually just slightly out of sync.
> Enough that right after cycles_last had been updated, a read on
> another cpu could come in just behind cycles_last, resulting in a
> negative interval causing lots of havoc.
>
> So the sanity check is needed to avoid that case.
Your memory serves you right. That's indeed observable on CPUs which
lack TSC_ADJUST.
@Andy: Welcome to the wonderful world of TSC.
Thanks,
tglx
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Peter Zijlstra @ 2018-09-18 8:30 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Stephen Boyd,
X86 ML, LKML, Linux Virtualization, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <alpine.DEB.2.21.1809180948570.3558@nanos.tec.linutronix.de>
On Tue, Sep 18, 2018 at 09:52:26AM +0200, Thomas Gleixner wrote:
> On Mon, 17 Sep 2018, John Stultz wrote:
> > On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > > Also, I'm not entirely convinced that this "last" thing is needed at
> > > all. John, what's the scenario under which we need it?
> >
> > So my memory is probably a bit foggy, but I recall that as we
> > accelerated gettimeofday, we found that even on systems that claimed
> > to have synced TSCs, they were actually just slightly out of sync.
> > Enough that right after cycles_last had been updated, a read on
> > another cpu could come in just behind cycles_last, resulting in a
> > negative interval causing lots of havoc.
> >
> > So the sanity check is needed to avoid that case.
>
> Your memory serves you right. That's indeed observable on CPUs which
> lack TSC_ADJUST.
But, if the gtod code can observe this, then why doesn't the code that
checks the sync?
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Thomas Gleixner @ 2018-09-18 8:52 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Stephen Boyd,
X86 ML, LKML, Linux Virtualization, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180918083055.GJ24106@hirez.programming.kicks-ass.net>
On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> On Tue, Sep 18, 2018 at 09:52:26AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Sep 2018, John Stultz wrote:
> > > On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > > > Also, I'm not entirely convinced that this "last" thing is needed at
> > > > all. John, what's the scenario under which we need it?
> > >
> > > So my memory is probably a bit foggy, but I recall that as we
> > > accelerated gettimeofday, we found that even on systems that claimed
> > > to have synced TSCs, they were actually just slightly out of sync.
> > > Enough that right after cycles_last had been updated, a read on
> > > another cpu could come in just behind cycles_last, resulting in a
> > > negative interval causing lots of havoc.
> > >
> > > So the sanity check is needed to avoid that case.
> >
> > Your memory serves you right. That's indeed observable on CPUs which
> > lack TSC_ADJUST.
>
> But, if the gtod code can observe this, then why doesn't the code that
> checks the sync?
Because it depends where the involved CPUs are in the topology. The sync
code might just run on the same package an simply not see it. Yes, w/o
TSC_ADJUST the TSC sync code can just fail to see the havoc.
Thanks,
tglx
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Thomas Gleixner @ 2018-09-18 10:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Stephen Boyd,
X86 ML, LKML, Linux Virtualization, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <alpine.DEB.2.21.1809181050020.4167@nanos.tec.linutronix.de>
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > Your memory serves you right. That's indeed observable on CPUs which
> > > lack TSC_ADJUST.
> >
> > But, if the gtod code can observe this, then why doesn't the code that
> > checks the sync?
>
> Because it depends where the involved CPUs are in the topology. The sync
> code might just run on the same package an simply not see it. Yes, w/o
> TSC_ADJUST the TSC sync code can just fail to see the havoc.
Even with TSC adjust the TSC can be slightly off by design on multi-socket
systems.
Thanks,
tglx
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Thomas Gleixner @ 2018-09-18 10:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Stephen Boyd,
X86 ML, LKML, Linux Virtualization, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <alpine.DEB.2.21.1809181204460.4167@nanos.tec.linutronix.de>
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > > Your memory serves you right. That's indeed observable on CPUs which
> > > > lack TSC_ADJUST.
> > >
> > > But, if the gtod code can observe this, then why doesn't the code that
> > > checks the sync?
> >
> > Because it depends where the involved CPUs are in the topology. The sync
> > code might just run on the same package an simply not see it. Yes, w/o
> > TSC_ADJUST the TSC sync code can just fail to see the havoc.
>
> Even with TSC adjust the TSC can be slightly off by design on multi-socket
> systems.
Here are the gory details:
https://lore.kernel.org/lkml/3c1737210708230408i7a8049a9m5db49e6c4d89ab62@mail.gmail.com/
The changelog has an explanation as well.
d8bb6f4c1670 ("x86: tsc prevent time going backwards")
I still have one of the machines which is affected by this.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH 00/19] vmw_balloon: compaction, shrinker, 64-bit, etc.
From: Greg Kroah-Hartman @ 2018-09-18 12:27 UTC (permalink / raw)
To: Nadav Amit
Cc: Arnd Bergmann, Michael S. Tsirkin, Xavier Deguillard,
linux-kernel, virtualization, linux-mm
In-Reply-To: <20180918063853.198332-1-namit@vmware.com>
On Mon, Sep 17, 2018 at 11:38:34PM -0700, Nadav Amit wrote:
> This patch-set adds the following enhancements to the VMware balloon
> driver:
>
> 1. Balloon compaction support.
> 2. Report the number of inflated/deflated ballooned pages through vmstat.
> 3. Memory shrinker to avoid balloon over-inflation (and OOM).
> 4. Support VMs with memory limit that is greater than 16TB.
> 5. Faster and more aggressive inflation.
>
> To support compaction we wish to use the existing infrastructure.
> However, we need to make slight adaptions for it. We add a new list
> interface to balloon-compaction, which is more generic and efficient,
> since it does not require as many IRQ save/restore operations. We leave
> the old interface that is used by the virtio balloon.
>
> Big parts of this patch-set are cleanup and documentation. Patches 1-13
> simplify the balloon code, document its behavior and allow the balloon
> code to run concurrently. The support for concurrency is required for
> compaction and the shrinker interface.
>
> For documentation we use the kernel-doc format. We are aware that the
> balloon interface is not public, but following the kernel-doc format may
> be useful one day.
kbuild seems to not like this patch series, so I'm going to drop it from
my queue and wait for a v2 respin before looking at it.
thanks,
greg k-h
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Peter Zijlstra @ 2018-09-18 12:48 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Stephen Boyd,
X86 ML, LKML, Linux Virtualization, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <alpine.DEB.2.21.1809181237140.4167@nanos.tec.linutronix.de>
On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > > On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > > > Your memory serves you right. That's indeed observable on CPUs which
> > > > > lack TSC_ADJUST.
> > > >
> > > > But, if the gtod code can observe this, then why doesn't the code that
> > > > checks the sync?
> > >
> > > Because it depends where the involved CPUs are in the topology. The sync
> > > code might just run on the same package an simply not see it. Yes, w/o
> > > TSC_ADJUST the TSC sync code can just fail to see the havoc.
> >
> > Even with TSC adjust the TSC can be slightly off by design on multi-socket
> > systems.
>
> Here are the gory details:
>
> https://lore.kernel.org/lkml/3c1737210708230408i7a8049a9m5db49e6c4d89ab62@mail.gmail.com/
>
> The changelog has an explanation as well.
>
> d8bb6f4c1670 ("x86: tsc prevent time going backwards")
>
> I still have one of the machines which is affected by this.
Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
current code:
notrace static u64 vread_tsc(void)
{
u64 ret = (u64)rdtsc_ordered();
u64 last = gtod->cycle_last;
if (likely(ret >= last))
return ret;
/*
* GCC likes to generate cmov here, but this branch is extremely
* predictable (it's just a function of time and the likely is
* very likely) and there's a data dependence, so force GCC
* to generate a branch instead. I don't barrier() because
* we don't actually need a barrier, and if this function
* ever gets inlined it will generate worse code.
*/
asm volatile ("");
return last;
}
That does:
lfence
rdtsc
load gtod->cycle_last
Which obviously allows us to observe a cycles_last that is later than
the rdtsc itself, and thus time can trivially go backwards.
The new code:
last = gtod->cycle_last;
cycles = vgetcyc(gtod->vclock_mode);
if (unlikely((s64)cycles < 0))
return vdso_fallback_gettime(clk, ts);
if (cycles > last)
ns += (cycles - last) * gtod->mult;
looks like:
load gtod->cycle_last
lfence
rdtsc
which avoids that possibility, the cycle_last load must have completed
before the rdtsc.
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Thomas Gleixner @ 2018-09-18 13:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Stephen Boyd,
X86 ML, LKML, Linux Virtualization, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <20180918124800.GL24106@hirez.programming.kicks-ass.net>
On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> > I still have one of the machines which is affected by this.
>
> Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
> current code:
The load order of last vs. rdtsc does not matter at all.
CPU0 CPU1
....
now0 = rdtsc_ordered();
...
tk->cycle_last = now0;
gtod->seq++;
gtod->cycle_last = tk->cycle_last;
...
gtod->seq++;
seq_begin(gtod->seq);
now1 = rdtsc_ordered();
So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
smaller than cycle_last. The TSC sync stuff does not catch the small delta
for unknown raisins. I'll go and find that machine and test that again.
Thanks,
tglx
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Peter Zijlstra @ 2018-09-18 13:38 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Stephen Boyd,
X86 ML, LKML, Linux Virtualization, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <alpine.DEB.2.21.1809181515170.6950@nanos.tec.linutronix.de>
On Tue, Sep 18, 2018 at 03:23:08PM +0200, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> > > I still have one of the machines which is affected by this.
> >
> > Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
> > current code:
>
> The load order of last vs. rdtsc does not matter at all.
>
> CPU0 CPU1
>
> ....
> now0 = rdtsc_ordered();
> ...
> tk->cycle_last = now0;
>
> gtod->seq++;
> gtod->cycle_last = tk->cycle_last;
> ...
> gtod->seq++;
> seq_begin(gtod->seq);
> now1 = rdtsc_ordered();
>
> So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> smaller than cycle_last. The TSC sync stuff does not catch the small delta
> for unknown raisins. I'll go and find that machine and test that again.
Yeah, somehow I forgot about rseq.. maybe I should go sleep or
something.
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Andy Lutomirski @ 2018-09-18 14:01 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra,
X86 ML, LKML, Linux Virtualization, Stephen Boyd, John Stultz,
Andy Lutomirski, Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <alpine.DEB.2.21.1809180948570.3558@nanos.tec.linutronix.de>
> On Sep 18, 2018, at 12:52 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> On Mon, 17 Sep 2018, John Stultz wrote:
>>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> Also, I'm not entirely convinced that this "last" thing is needed at
>>> all. John, what's the scenario under which we need it?
>>
>> So my memory is probably a bit foggy, but I recall that as we
>> accelerated gettimeofday, we found that even on systems that claimed
>> to have synced TSCs, they were actually just slightly out of sync.
>> Enough that right after cycles_last had been updated, a read on
>> another cpu could come in just behind cycles_last, resulting in a
>> negative interval causing lots of havoc.
>>
>> So the sanity check is needed to avoid that case.
>
> Your memory serves you right. That's indeed observable on CPUs which
> lack TSC_ADJUST.
>
> @Andy: Welcome to the wonderful world of TSC.
>
Do we do better if we use signed arithmetic for the whole calculation? Then a small backwards movement would result in a small backwards result. Or we could offset everything so that we’d have to go back several hundred ms before we cross zero.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Thomas Gleixner @ 2018-09-18 15:52 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Stephen Boyd,
X86 ML, LKML, Linux Virtualization, John Stultz, Andy Lutomirski,
Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <alpine.DEB.2.21.1809181515170.6950@nanos.tec.linutronix.de>
On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> smaller than cycle_last. The TSC sync stuff does not catch the small delta
> for unknown raisins. I'll go and find that machine and test that again.
Of course it does not trigger anymore. We accumulated code between the
point in timekeeping_advance() where the TSC is read and the update of the
VDSO data.
I'll might have to get an 2.6ish kernel booted on that machine and try with
that again. /me shudders
Thanks,
tglx
^ permalink raw reply
* Re: [RFC PATCH 1/2] virtio/s390: avoid race on vcdev->config
From: Cornelia Huck @ 2018-09-18 18:29 UTC (permalink / raw)
To: Halil Pasic; +Cc: linux-s390, Colin Ian King, kvm, virtualization
In-Reply-To: <20180912140202.12292-2-pasic@linux.ibm.com>
On Wed, 12 Sep 2018 16:02:01 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:
> Currently we have a race on vcdev->config in virtio_ccw_get_config() and
> in virtio_ccw_set_config().
>
> This normally does not cause problems, as these are usually infrequent
> operations. But occasionally sysfs attributes are directly dependent on
> pieces of virio config and trigger a get on each read. This gives us at
> least one trigger.
So, the problem is that you might get unexpected/inconsistent config
information?
>
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
> drivers/s390/virtio/virtio_ccw.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
> index 8f5c1d7f751a..a5e8530a3391 100644
> --- a/drivers/s390/virtio/virtio_ccw.c
> +++ b/drivers/s390/virtio/virtio_ccw.c
> @@ -828,6 +828,7 @@ static void virtio_ccw_get_config(struct virtio_device *vdev,
> int ret;
> struct ccw1 *ccw;
> void *config_area;
> + unsigned long flags;
>
> ccw = kzalloc(sizeof(*ccw), GFP_DMA | GFP_KERNEL);
> if (!ccw)
> @@ -846,11 +847,13 @@ static void virtio_ccw_get_config(struct virtio_device *vdev,
> if (ret)
> goto out_free;
>
> + spin_lock_irqsave(&vcdev->lock, flags);
I'm not sure that vcdev->lock is the right lock to use for this, but
I'm a bit unsure about what you're guarding against here.
> memcpy(vcdev->config, config_area, offset + len);
> - if (buf)
> - memcpy(buf, &vcdev->config[offset], len);
> if (vcdev->config_ready < offset + len)
> vcdev->config_ready = offset + len;
> + spin_unlock_irqrestore(&vcdev->lock, flags);
> + if (buf)
> + memcpy(buf, config_area + offset, len);
>
> out_free:
> kfree(config_area);
> @@ -864,6 +867,7 @@ static void virtio_ccw_set_config(struct virtio_device *vdev,
> struct virtio_ccw_device *vcdev = to_vc_device(vdev);
> struct ccw1 *ccw;
> void *config_area;
> + unsigned long flags;
>
> ccw = kzalloc(sizeof(*ccw), GFP_DMA | GFP_KERNEL);
> if (!ccw)
> @@ -876,9 +880,11 @@ static void virtio_ccw_set_config(struct virtio_device *vdev,
> /* Make sure we don't overwrite fields. */
> if (vcdev->config_ready < offset)
> virtio_ccw_get_config(vdev, 0, NULL, offset);
> + spin_lock_irqsave(&vcdev->lock, flags);
> memcpy(&vcdev->config[offset], buf, len);
> /* Write the config area to the host. */
> memcpy(config_area, vcdev->config, sizeof(vcdev->config));
> + spin_unlock_irqrestore(&vcdev->lock, flags);
> ccw->cmd_code = CCW_CMD_WRITE_CONF;
> ccw->flags = 0;
> ccw->count = offset + len;
One thing that might be a problem here is how reading/writing the
config stuff works for virtio-ccw: As channel devices don't have a
config space like e.g. PCI devices, I had to come up with a way to
implement something like that via dedicated channel commands. I did not
want to go via a payload that would provide offset/length, but went
with reading/writing everything before the value to be read/written as
well. That means we need to call read before write to make sure we
don't overwrite things (as the comment states), and how this is done
might be problematic.
I'm thinking what we may need is a way to make "read and then write" a
single operation and make sure that it does not run concurrently with
the simple read operation. Factor out the proper invocation of the read
command and the status update, have get_config call this with a reader
lock and have set_config call the read-then-write combo with a write
lock, maybe?
I might not understand the problem correctly, though (or I might have
spent too much time reading email today already :)
^ permalink raw reply
* Re: [RFC PATCH 2/2] virtio/s390: fix race in ccw_io_helper()
From: Cornelia Huck @ 2018-09-18 18:45 UTC (permalink / raw)
To: Halil Pasic; +Cc: linux-s390, Colin Ian King, kvm, virtualization
In-Reply-To: <20180912140202.12292-3-pasic@linux.ibm.com>
On Wed, 12 Sep 2018 16:02:02 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:
> While ccw_io_helper() seems like intended to be exclusive in a sense that
> it is supposed to facilitate I/O for at most one thread at any given
> time, there is actually nothing ensuring that threads won't pile up at
> vcdev->wait_q. If they all threads get woken up and see the status that
> belongs to some other request as their own. This can lead to bugs. For an
> example see :
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788432
>
> This normally does not cause problems, as these are usually infrequent
> operations that happen in a well defined sequence and normally do not
> fail. But occasionally sysfs attributes are directly dependent
> on pieces of virio config and trigger a get on each read. This gives us
> at least one method to trigger races.
Yes, the idea behind ccw_io_helper() was to provide a simple way to use
the inherently asynchronous channel I/O operations in a synchronous
way, as that's what the virtio callbacks expect. I did not consider
multiple callbacks for a device running at the same time; but if the
interface allows that, we obviously need to be able to handle it.
Has this only been observed for the config get/set commands? (The
read-before-write thing?)
>
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> Reported-by: Colin Ian King <colin.king@canonical.com>
> ---
>
> This is a big hammer -- mutex on virtio_ccw device level would more than
> suffice. But I don't think it hurts, and maybe there is a better way e.g.
> one using some common ccw/cio mechanisms to address this. That's why this
> is an RFC.
I'm for using more delicate tools, if possible :)
We basically have two options:
- Have a way to queue I/O operations and then handle them in sequence.
Creates complexity, and is likely overkill. (We already have a kind
of serialization because we re-submit the channel program until the
hypervisor accepts it; the problem comes from the wait queue usage.)
- Add serialization around the submit/wait procedure (as you did), but
with a per-device mutex. That looks like the easiest solution.
> ---
> drivers/s390/virtio/virtio_ccw.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
> index a5e8530a3391..36252f344660 100644
> --- a/drivers/s390/virtio/virtio_ccw.c
> +++ b/drivers/s390/virtio/virtio_ccw.c
> @@ -289,6 +289,8 @@ static int doing_io(struct virtio_ccw_device *vcdev, __u32 flag)
> return ret;
> }
>
> +DEFINE_MUTEX(vcio_mtx);
> +
> static int ccw_io_helper(struct virtio_ccw_device *vcdev,
> struct ccw1 *ccw, __u32 intparm)
> {
> @@ -296,6 +298,7 @@ static int ccw_io_helper(struct virtio_ccw_device *vcdev,
> unsigned long flags;
> int flag = intparm & VIRTIO_CCW_INTPARM_MASK;
>
> + mutex_lock(&vcio_mtx);
> do {
> spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), flags);
> ret = ccw_device_start(vcdev->cdev, ccw, intparm, 0, 0);
> @@ -308,7 +311,9 @@ static int ccw_io_helper(struct virtio_ccw_device *vcdev,
> cpu_relax();
> } while (ret == -EBUSY);
We probably still want to keep this while loop to be on the safe side
(unsolicited status from the hypervisor, for example.)
> wait_event(vcdev->wait_q, doing_io(vcdev, flag) == 0);
> - return ret ? ret : vcdev->err;
> + ret = ret ? ret : vcdev->err;
> + mutex_unlock(&vcio_mtx);
> + return ret;
> }
>
> static void virtio_ccw_drop_indicator(struct virtio_ccw_device *vcdev,
^ permalink raw reply
* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
From: Thomas Gleixner @ 2018-09-18 22:46 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Florian Weimer, Juergen Gross, Arnd Bergmann, Peter Zijlstra,
X86 ML, LKML, Linux Virtualization, Stephen Boyd, John Stultz,
Andy Lutomirski, Paolo Bonzini, devel, Matt Rickard
In-Reply-To: <863331ED-B04A-4B94-91A2-D34002C9CCDC@amacapital.net>
[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]
On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > On Sep 18, 2018, at 12:52 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> >> On Mon, 17 Sep 2018, John Stultz wrote:
> >>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >>> Also, I'm not entirely convinced that this "last" thing is needed at
> >>> all. John, what's the scenario under which we need it?
> >>
> >> So my memory is probably a bit foggy, but I recall that as we
> >> accelerated gettimeofday, we found that even on systems that claimed
> >> to have synced TSCs, they were actually just slightly out of sync.
> >> Enough that right after cycles_last had been updated, a read on
> >> another cpu could come in just behind cycles_last, resulting in a
> >> negative interval causing lots of havoc.
> >>
> >> So the sanity check is needed to avoid that case.
> >
> > Your memory serves you right. That's indeed observable on CPUs which
> > lack TSC_ADJUST.
> >
> > @Andy: Welcome to the wonderful world of TSC.
> >
>
> Do we do better if we use signed arithmetic for the whole calculation?
> Then a small backwards movement would result in a small backwards result.
> Or we could offset everything so that we’d have to go back several
> hundred ms before we cross zero.
That would be probably the better solution as signed math would be
problematic when the resulting ns value becomes negative. As the delta is
really small, otherwise the TSC sync check would have caught it, the caller
should never be able to observe time going backwards.
I'll have a look into that. It needs some thought vs. the fractional part
of the base time, but it should be not rocket science to get that
correct. Famous last words...
Thanks,
tglx
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox