[PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-02 22:09 [PATCH 0/4] Convert timekeeping core to use printk_deferred (v2) John Stultz
@ 2014-05-02 22:09 ` John Stultz
  0 siblings, 0 replies; 16+ messages in thread
From: John Stultz @ 2014-05-02 22:09 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Jan Kara, Peter Zijlstra, Jiri Bohac,
	Thomas Gleixner, Ingo Molnar, Andrew Morton, Steven Rostedt

Jiri Bohac pointed out that there are rare but potential deadlock
possibilities when calling printk while holding the timekeeping
seqlock.

This is due to printk() triggering console sem wakeup, which can
cause scheduling code to trigger hrtimers which may try to read
the time.

Specifically, as Jiri pointed out, that path is:
  printk
    vprintk_emit
      console_unlock
        up(&console_sem)
          __up
	    wake_up_process
	      try_to_wake_up
	        ttwu_do_activate
		  ttwu_activate
		    activate_task
		      enqueue_task
		        enqueue_task_fair
			  hrtick_update
			    hrtick_start_fair
			      hrtick_start_fair
			        get_time
				  ktime_get
				    --> endless loop on
				    read_seqcount_retry(&timekeeper_seq, ...)

This patch tries to avoid this issue by using printk_deferred (previously
named printk_sched) which should defer printing via a irq_work_queue.

Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/ntp.c         | 15 +++++++++------
 kernel/time/timekeeping.c |  7 ++++---
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 419a52c..5b0ac4d 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -786,8 +786,9 @@ static long hardpps_update_freq(struct pps_normtime freq_norm)
 		time_status |= STA_PPSERROR;
 		pps_errcnt++;
 		pps_dec_freq_interval();
-		pr_err("hardpps: PPSERROR: interval too long - %ld s\n",
-				freq_norm.sec);
+		printk_deferred(KERN_ERR
+			"hardpps: PPSERROR: interval too long - %ld s\n",
+			freq_norm.sec);
 		return 0;
 	}
 
@@ -800,7 +801,8 @@ static long hardpps_update_freq(struct pps_normtime freq_norm)
 	delta = shift_right(ftemp - pps_freq, NTP_SCALE_SHIFT);
 	pps_freq = ftemp;
 	if (delta > PPS_MAXWANDER || delta < -PPS_MAXWANDER) {
-		pr_warning("hardpps: PPSWANDER: change=%ld\n", delta);
+		printk_deferred(KERN_WARNING
+				"hardpps: PPSWANDER: change=%ld\n", delta);
 		time_status |= STA_PPSWANDER;
 		pps_stbcnt++;
 		pps_dec_freq_interval();
@@ -844,8 +846,9 @@ static void hardpps_update_phase(long error)
 	 * the time offset is updated.
 	 */
 	if (jitter > (pps_jitter << PPS_POPCORN)) {
-		pr_warning("hardpps: PPSJITTER: jitter=%ld, limit=%ld\n",
-		       jitter, (pps_jitter << PPS_POPCORN));
+		printk_deferred(KERN_WARNING
+				"hardpps: PPSJITTER: jitter=%ld, limit=%ld\n",
+				jitter, (pps_jitter << PPS_POPCORN));
 		time_status |= STA_PPSJITTER;
 		pps_jitcnt++;
 	} else if (time_status & STA_PPSTIME) {
@@ -902,7 +905,7 @@ void __hardpps(const struct timespec *phase_ts, const struct timespec *raw_ts)
 		time_status |= STA_PPSJITTER;
 		/* restart the frequency calibration interval */
 		pps_fbase = *raw_ts;
-		pr_err("hardpps: PPSJITTER: bad pulse\n");
+		printk_deferred(KERN_ERR "hardpps: PPSJITTER: bad pulse\n");
 		return;
 	}
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index f7df8ea..ffd3113 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -852,8 +852,9 @@ static void __timekeeping_inject_sleeptime(struct timekeeper *tk,
 							struct timespec *delta)
 {
 	if (!timespec_valid_strict(delta)) {
-		printk(KERN_WARNING "__timekeeping_inject_sleeptime: Invalid "
-					"sleep delta value!\n");
+		printk_deferred(KERN_WARNING
+				"__timekeeping_inject_sleeptime: Invalid "
+				"sleep delta value!\n");
 		return;
 	}
 	tk_xtime_add(tk, delta);
@@ -1157,7 +1158,7 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 
 	if (unlikely(tk->clock->maxadj &&
 		(tk->mult + adj > tk->clock->mult + tk->clock->maxadj))) {
-		printk_once(KERN_WARNING
+		printk_once_deferred(KERN_WARNING
 			"Adjusting %s more than 11%% (%ld vs %ld)\n",
 			tk->clock->name, (long)tk->mult + adj,
 			(long)tk->clock->mult + tk->clock->maxadj);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-05 20:47 [PATCH 0/4] Convert timekeeping core to use printk_deferred (v3) John Stultz
@ 2014-05-05 20:47 ` John Stultz
  0 siblings, 0 replies; 16+ messages in thread
From: John Stultz @ 2014-05-05 20:47 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Jan Kara, Peter Zijlstra, Jiri Bohac,
	Thomas Gleixner, Ingo Molnar, Andrew Morton, Steven Rostedt

Jiri Bohac pointed out that there are rare but potential deadlock
possibilities when calling printk while holding the timekeeping
seqlock.

This is due to printk() triggering console sem wakeup, which can
cause scheduling code to trigger hrtimers which may try to read
the time.

Specifically, as Jiri pointed out, that path is:
  printk
    vprintk_emit
      console_unlock
        up(&console_sem)
          __up
	    wake_up_process
	      try_to_wake_up
	        ttwu_do_activate
		  ttwu_activate
		    activate_task
		      enqueue_task
		        enqueue_task_fair
			  hrtick_update
			    hrtick_start_fair
			      hrtick_start_fair
			        get_time
				  ktime_get
				    --> endless loop on
				    read_seqcount_retry(&timekeeper_seq, ...)

This patch tries to avoid this issue by using printk_deferred (previously
named printk_sched) which should defer printing via a irq_work_queue.

Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Jiri Bohac <jbohac@suse.cz>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/ntp.c         | 15 +++++++++------
 kernel/time/timekeeping.c |  7 ++++---
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 419a52c..5b0ac4d 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -786,8 +786,9 @@ static long hardpps_update_freq(struct pps_normtime freq_norm)
 		time_status |= STA_PPSERROR;
 		pps_errcnt++;
 		pps_dec_freq_interval();
-		pr_err("hardpps: PPSERROR: interval too long - %ld s\n",
-				freq_norm.sec);
+		printk_deferred(KERN_ERR
+			"hardpps: PPSERROR: interval too long - %ld s\n",
+			freq_norm.sec);
 		return 0;
 	}
 
@@ -800,7 +801,8 @@ static long hardpps_update_freq(struct pps_normtime freq_norm)
 	delta = shift_right(ftemp - pps_freq, NTP_SCALE_SHIFT);
 	pps_freq = ftemp;
 	if (delta > PPS_MAXWANDER || delta < -PPS_MAXWANDER) {
-		pr_warning("hardpps: PPSWANDER: change=%ld\n", delta);
+		printk_deferred(KERN_WARNING
+				"hardpps: PPSWANDER: change=%ld\n", delta);
 		time_status |= STA_PPSWANDER;
 		pps_stbcnt++;
 		pps_dec_freq_interval();
@@ -844,8 +846,9 @@ static void hardpps_update_phase(long error)
 	 * the time offset is updated.
 	 */
 	if (jitter > (pps_jitter << PPS_POPCORN)) {
-		pr_warning("hardpps: PPSJITTER: jitter=%ld, limit=%ld\n",
-		       jitter, (pps_jitter << PPS_POPCORN));
+		printk_deferred(KERN_WARNING
+				"hardpps: PPSJITTER: jitter=%ld, limit=%ld\n",
+				jitter, (pps_jitter << PPS_POPCORN));
 		time_status |= STA_PPSJITTER;
 		pps_jitcnt++;
 	} else if (time_status & STA_PPSTIME) {
@@ -902,7 +905,7 @@ void __hardpps(const struct timespec *phase_ts, const struct timespec *raw_ts)
 		time_status |= STA_PPSJITTER;
 		/* restart the frequency calibration interval */
 		pps_fbase = *raw_ts;
-		pr_err("hardpps: PPSJITTER: bad pulse\n");
+		printk_deferred(KERN_ERR "hardpps: PPSJITTER: bad pulse\n");
 		return;
 	}
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index f7df8ea..32d8d6a 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -852,8 +852,9 @@ static void __timekeeping_inject_sleeptime(struct timekeeper *tk,
 							struct timespec *delta)
 {
 	if (!timespec_valid_strict(delta)) {
-		printk(KERN_WARNING "__timekeeping_inject_sleeptime: Invalid "
-					"sleep delta value!\n");
+		printk_deferred(KERN_WARNING
+				"__timekeeping_inject_sleeptime: Invalid "
+				"sleep delta value!\n");
 		return;
 	}
 	tk_xtime_add(tk, delta);
@@ -1157,7 +1158,7 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 
 	if (unlikely(tk->clock->maxadj &&
 		(tk->mult + adj > tk->clock->mult + tk->clock->maxadj))) {
-		printk_once(KERN_WARNING
+		printk_deferred_once(KERN_WARNING
 			"Adjusting %s more than 11%% (%ld vs %ld)\n",
 			tk->clock->name, (long)tk->mult + adj,
 			(long)tk->clock->mult + tk->clock->maxadj);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
@ 2014-05-06 21:33 George Spelvin
  2014-05-12 16:21 ` [rough draft PATCH] avoid stalls on the " George Spelvin
  2014-05-12 17:49 ` [PATCH 4/4] timekeeping: Use printk_deferred when holding " John Stultz
  0 siblings, 2 replies; 16+ messages in thread
From: George Spelvin @ 2014-05-06 21:33 UTC (permalink / raw)
  To: john.stultz; +Cc: linux, linux-kernel

One misfeature in the timekeeping seqlock code I noticed is that
read_seqcount_begin returns "unsigned int", not "unsigned long".

Casting to a larger type is harmless, but inefficient.

> This is due to printk() triggering console sem wakeup, which can
> cause scheduling code to trigger hrtimers which may try to read
> the time.

An alternative solution, which avoids the need for this entire patch
series, is to make ktime_get completely nonblocking.

To do that, use a seqlock variant wherein you mintain an array of "struct
timekeeper" structures so that reading is always non-blocking if there
is no writer progress.  (I.e. livelock is remotely possible, but deadlock
is not.)

In the basic version, there are two.  (You can add more to further
reduce the chance of livelock.)  The currently stable one is indicated
by (timekeeper_seq->sequence & 2).  Writers update the appropriate
one ping-pong.  The low two bits indicate four phases:

0: Both timekeepers stable, [0] preferred for reading
1: Timekeeper[1] is being written; read timekeeper[0] only
2: Both timekeepers stable, [1] preferred for reading
3: Timekeeper[0] is being written; read timekeeper[1] only

The actual writer locking code is exactly the current write_seqcount_begin
and write_seqcount_end code.

A reader needs to retry only if end_seq > (start_seq & ~1u) + 2.
Accounting for wraparound, the read sequence is:

unsigned raw_read_FOO_begin(seqcount_t const *s)
{
	unsigned ret = ACCESS_ONCE(s->sequence);
	smp_rmb();
	return ret;
}

unsigned raw_read_FOO_retry(seqcount_t const *s, unsigned start)
{
	smp_rmb();
	start &= ~1u;
	return unlikely(s->seqence - start > 2);
}

A reader does:

        unsigned seq;

        do {
		struct timekeeper const *tk;
                seq = read_FOO_begin(&timekeeper_seq);
 		tk = timekeeper + (seq >> 1 & 1);
		frobnicate(tk);
        } while (read_FOO_retry(&timekeeper_seq, seq));

I haven't yet thought of a good name (to replace FOO) for this.
"seqnonblock"?

If you have more frequent updates, there's another variant that does
away with lsbit of the sequence number, so the writer only increments it
once, after update.  This reduces coherency traffic.  It does, however,
cause more readers to retry unless you go to an array of 4 structures.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [rough draft PATCH] avoid stalls on the timekeeping seqlock
  2014-05-06 21:33 [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock George Spelvin
@ 2014-05-12 16:21 ` George Spelvin
  2014-05-12 18:23   ` John Stultz
  2014-05-12 17:49 ` [PATCH 4/4] timekeeping: Use printk_deferred when holding " John Stultz
  1 sibling, 1 reply; 16+ messages in thread
From: George Spelvin @ 2014-05-12 16:21 UTC (permalink / raw)
  To: john.stultz, linux; +Cc: linux-kernel

Here's a non-working rough draft of that idea I suggested to make
reading the time non-blocking, even if an update is in progress.

Basically, it uses the idea proposed in a comment in update_wall_time,
switching pointers so there's always one valid structure.

This is non-working because last year the NTP variables lost their
own locking and inherited the timekeeping locks I am redesigning.
I haven't updated NTP yet.

One interesting possibility is that the write side of the locking
is identical to a standard seqlock.  It would be possible to
divide the timekeeping variables into non-blocking variables which
are mirrored, and ones that require stalling during write
seqlock updates.

But that's somewhat deeper magic than I've attempted so far.
This is a demonstration of the idea.

Does it seem worth pursuing?

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index f7df8ea217..0dfa4aa6fb 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -29,15 +29,15 @@
 #include "timekeeping_internal.h"
 
 #define TK_CLEAR_NTP		(1 << 0)
-#define TK_MIRROR		(1 << 1)
 #define TK_CLOCK_WAS_SET	(1 << 2)
 
-static struct timekeeper timekeeper;
+static struct timekeeper timekeeper[2];
 static DEFINE_RAW_SPINLOCK(timekeeper_lock);
+/* The following is NOT used as a standard seqlock */
 static seqcount_t timekeeper_seq;
-static struct timekeeper shadow_timekeeper;
 
 /* flag for if timekeeping is suspended */
+/* Q: What are the locking rules for this variable? */
 int __read_mostly timekeeping_suspended;
 
 /* Flag for if there is a persistent clock on this platform */
@@ -165,7 +165,7 @@ u32 get_arch_timeoffset(void)
 static inline u32 get_arch_timeoffset(void) { return 0; }
 #endif
 
-static inline s64 timekeeping_get_ns(struct timekeeper *tk)
+static inline s64 timekeeping_get_ns(struct timekeeper const *tk)
 {
 	cycle_t cycle_now, cycle_delta;
 	struct clocksource *clock;
@@ -185,7 +185,7 @@ static inline s64 timekeeping_get_ns(struct timekeeper *tk)
 	return nsec + get_arch_timeoffset();
 }
 
-static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk)
+static inline s64 timekeeping_get_ns_raw(struct timekeeper const *tk)
 {
 	cycle_t cycle_now, cycle_delta;
 	struct clocksource *clock;
@@ -217,12 +217,13 @@ static void update_pvclock_gtod(struct timekeeper *tk, bool was_set)
  */
 int pvclock_gtod_register_notifier(struct notifier_block *nb)
 {
-	struct timekeeper *tk = &timekeeper;
+	struct timekeeper *tk;
 	unsigned long flags;
 	int ret;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
 	ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
+	tk = timekeeper_current();
 	update_pvclock_gtod(tk, true);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
@@ -256,9 +257,6 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
 	}
 	update_vsyscall(tk);
 	update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);
-
-	if (action & TK_MIRROR)
-		memcpy(&shadow_timekeeper, &timekeeper, sizeof(timekeeper));
 }
 
 /**
@@ -291,6 +289,89 @@ static void timekeeping_forward_now(struct timekeeper *tk)
 }
 
 /**
+ * timekeeper_write_begin: Return a timekeeper that can be updated.
+ *
+ * Must be called with the timekeeper_lock held.
+ */
+static inline struct timekeeper *timekeeper_write_begin(void)
+{
+	bool b;
+
+	write_seqcount_begin(&timekeeper_seq);
+	b = (timekeeper_seq.sequence >> 1) & 1;
+	timekeeper[!b] = timekeeper[b];
+	return timekeeper + !b;
+}
+
+/**
+ * timekeeper_write_end: Finish write, mark the modified timekeeper as current.
+ *
+ * Must be called with the timekeeper_lock held.
+ */
+static inline void timekeeper_write_end(void)
+{
+	write_seqcount_end(&timekeeper_seq);
+}
+
+/**
+ * __timekeeper_current: Return the current (for reading) timekeeper
+ * @seq: The current sequence number
+ *
+ * Return the timekeeper corresponding to the given sequence number.
+ */
+static inline struct timekeeper const *__timekeeper_current(unsigned seq)
+{
+	return timekeeper + ((seq >> 1) & 1);
+}
+
+/**
+ * timekeeper_current: Return the current (for reading) timekeeper
+ *
+ * On rare occasions, we want the current timekeeper without obtaining
+ * the seqlock.  For example, if we hold the timekeeper_loc but don't
+ * intend to write it.
+ */
+static inline struct timekeeper const *timekeeper_current(void)
+{
+	return __timekeeper_current(timekeeper_seq.sequence);
+}
+
+/**
+ * timekeeper_read_begin: Begin reading a timekeeper.
+ * @seqp: Pointer to variable to receive sequence number.
+ *	(Because this is inline, the compiler can optimize out
+ *	the memory access.)
+ *
+ * Returns a pointer to a readable timekeeper structure.
+ *
+ * Because we have two timekeeper structures that we ping-pong
+ * between, this never blocks.  Only if there are two calls
+ * to timekeeper_write_begin between read_begin and read_retry
+ * will a retry be forced.
+ */
+static inline struct timekeeper const *timekeeper_read_begin(unsigned *seqp)
+{
+	unsigned seq = ACCESS_ONCE(timekeeper_seq.sequence);
+	smp_rmb();
+	*seqp = seq &= ~1u;
+	return __timekeeper_current(seq);
+}
+
+/**
+ * timekeeper_read_retry: Return true if read was inconsistent, must retry
+ * @seq: The return value from timekeeper_read_begin
+ *
+ * Because we ping-pong between two timekeeper structures, the window
+ * of validity is wider than a normal seqlock, and a retry is very
+ * unlikely.
+ */
+static inline bool timekeeper_read_retry(unsigned seq)
+{
+	unsigned delta = timekeeper_seq.sequence - seq;
+	return unlikely(delta > 2);
+}
+
+/**
  * __getnstimeofday - Returns the time of day in a timespec.
  * @ts:		pointer to the timespec to be set
  *
@@ -299,17 +380,16 @@ static void timekeeping_forward_now(struct timekeeper *tk)
  */
 int __getnstimeofday(struct timespec *ts)
 {
-	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned seq;
 	s64 nsecs = 0;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
 
 		ts->tv_sec = tk->xtime_sec;
 		nsecs = timekeeping_get_ns(tk);
 
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	ts->tv_nsec = 0;
 	timespec_add_ns(ts, nsecs);
@@ -338,18 +418,18 @@ EXPORT_SYMBOL(getnstimeofday);
 
 ktime_t ktime_get(void)
 {
-	struct timekeeper *tk = &timekeeper;
 	unsigned int seq;
 	s64 secs, nsecs;
 
 	WARN_ON(timekeeping_suspended);
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
+
 		secs = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
 		nsecs = timekeeping_get_ns(tk) + tk->wall_to_monotonic.tv_nsec;
 
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 	/*
 	 * Use ktime_set/ktime_add_ns to create a proper ktime on
 	 * 32-bit architectures without CONFIG_KTIME_SCALAR.
@@ -368,7 +448,6 @@ EXPORT_SYMBOL_GPL(ktime_get);
  */
 void ktime_get_ts(struct timespec *ts)
 {
-	struct timekeeper *tk = &timekeeper;
 	struct timespec tomono;
 	s64 nsec;
 	unsigned int seq;
@@ -376,12 +455,13 @@ void ktime_get_ts(struct timespec *ts)
 	WARN_ON(timekeeping_suspended);
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
+
 		ts->tv_sec = tk->xtime_sec;
 		nsec = timekeeping_get_ns(tk);
 		tomono = tk->wall_to_monotonic;
 
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	ts->tv_sec += tomono.tv_sec;
 	ts->tv_nsec = 0;
@@ -398,19 +478,18 @@ EXPORT_SYMBOL_GPL(ktime_get_ts);
  */
 void timekeeping_clocktai(struct timespec *ts)
 {
-	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned seq;
 	u64 nsecs;
 
 	WARN_ON(timekeeping_suspended);
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
 
 		ts->tv_sec = tk->xtime_sec + tk->tai_offset;
 		nsecs = timekeeping_get_ns(tk);
 
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	ts->tv_nsec = 0;
 	timespec_add_ns(ts, nsecs);
@@ -446,14 +525,13 @@ EXPORT_SYMBOL(ktime_get_clocktai);
  */
 void getnstime_raw_and_real(struct timespec *ts_raw, struct timespec *ts_real)
 {
-	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned seq;
 	s64 nsecs_raw, nsecs_real;
 
 	WARN_ON_ONCE(timekeeping_suspended);
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
 
 		*ts_raw = tk->raw_time;
 		ts_real->tv_sec = tk->xtime_sec;
@@ -462,7 +540,7 @@ void getnstime_raw_and_real(struct timespec *ts_raw, struct timespec *ts_real)
 		nsecs_raw = timekeeping_get_ns_raw(tk);
 		nsecs_real = timekeeping_get_ns(tk);
 
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	timespec_add_ns(ts_raw, nsecs_raw);
 	timespec_add_ns(ts_real, nsecs_real);
@@ -495,7 +573,7 @@ EXPORT_SYMBOL(do_gettimeofday);
  */
 int do_settimeofday(const struct timespec *tv)
 {
-	struct timekeeper *tk = &timekeeper;
+	struct timekeeper *tk;
 	struct timespec ts_delta, xt;
 	unsigned long flags;
 
@@ -503,7 +581,7 @@ int do_settimeofday(const struct timespec *tv)
 		return -EINVAL;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
-	write_seqcount_begin(&timekeeper_seq);
+	tk = timekeeper_write_begin();
 
 	timekeeping_forward_now(tk);
 
@@ -515,9 +593,9 @@ int do_settimeofday(const struct timespec *tv)
 
 	tk_set_xtime(tk, tv);
 
-	timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);
+	timekeeping_update(tk, TK_CLEAR_NTP | TK_CLOCK_WAS_SET);
 
-	write_seqcount_end(&timekeeper_seq);
+	timekeeper_write_end();
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
 	/* signal hrtimers about time change */
@@ -535,7 +613,7 @@ EXPORT_SYMBOL(do_settimeofday);
  */
 int timekeeping_inject_offset(struct timespec *ts)
 {
-	struct timekeeper *tk = &timekeeper;
+	struct timekeeper *tk;
 	unsigned long flags;
 	struct timespec tmp;
 	int ret = 0;
@@ -544,7 +622,7 @@ int timekeeping_inject_offset(struct timespec *ts)
 		return -EINVAL;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
-	write_seqcount_begin(&timekeeper_seq);
+	tk = timekeeper_write_begin();
 
 	timekeeping_forward_now(tk);
 
@@ -559,9 +637,9 @@ int timekeeping_inject_offset(struct timespec *ts)
 	tk_set_wall_to_mono(tk, timespec_sub(tk->wall_to_monotonic, *ts));
 
 error: /* even if we error out, we forwarded the time, so call update */
-	timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);
+	timekeeping_update(tk, TK_CLEAR_NTP | TK_CLOCK_WAS_SET);
 
-	write_seqcount_end(&timekeeper_seq);
+	timekeeper_write_end();
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
 	/* signal hrtimers about time change */
@@ -578,14 +656,14 @@ EXPORT_SYMBOL(timekeeping_inject_offset);
  */
 s32 timekeeping_get_tai_offset(void)
 {
-	struct timekeeper *tk = &timekeeper;
 	unsigned int seq;
 	s32 ret;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
+
 		ret = tk->tai_offset;
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	return ret;
 }
@@ -606,14 +684,14 @@ static void __timekeeping_set_tai_offset(struct timekeeper *tk, s32 tai_offset)
  */
 void timekeeping_set_tai_offset(s32 tai_offset)
 {
-	struct timekeeper *tk = &timekeeper;
+	struct timekeeper *tk;
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
-	write_seqcount_begin(&timekeeper_seq);
+	tk = timekeeper_write_begin();
 	__timekeeping_set_tai_offset(tk, tai_offset);
-	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
-	write_seqcount_end(&timekeeper_seq);
+	timekeeping_update(tk, TK_CLOCK_WAS_SET);
+	timekeeper_write_end();
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 	clock_was_set();
 }
@@ -625,14 +703,14 @@ void timekeeping_set_tai_offset(s32 tai_offset)
  */
 static int change_clocksource(void *data)
 {
-	struct timekeeper *tk = &timekeeper;
+	struct timekeeper *tk;
 	struct clocksource *new, *old;
 	unsigned long flags;
 
 	new = (struct clocksource *) data;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
-	write_seqcount_begin(&timekeeper_seq);
+	tk = timekeeper_write_begin();
 
 	timekeeping_forward_now(tk);
 	/*
@@ -650,9 +728,9 @@ static int change_clocksource(void *data)
 			module_put(new->owner);
 		}
 	}
-	timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);
+	timekeeping_update(tk, TK_CLEAR_NTP | TK_CLOCK_WAS_SET);
 
-	write_seqcount_end(&timekeeper_seq);
+	timekeeper_write_end();
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
 	return 0;
@@ -667,12 +745,20 @@ static int change_clocksource(void *data)
  */
 int timekeeping_notify(struct clocksource *clock)
 {
-	struct timekeeper *tk = &timekeeper;
+	struct timekeeper const *tk = timekeeper_current();
 
+	/*
+	 * Since the clock source can't change outside the clocksource_mutex,
+	 * and a write update just copies the same current value over top
+	 * of itself, even if the write is non-atomic a read should still
+	 * return the correct value for tk->clock without locking.
+	 */
 	if (tk->clock == clock)
 		return 0;
 	stop_machine(change_clocksource, clock, NULL);
 	tick_clock_notify();
+
+	tk = timekeeper_current();
 	return tk->clock == clock ? 0 : -1;
 }
 
@@ -699,16 +785,16 @@ EXPORT_SYMBOL_GPL(ktime_get_real);
  */
 void getrawmonotonic(struct timespec *ts)
 {
-	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned seq;
 	s64 nsecs;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
+
 		nsecs = timekeeping_get_ns_raw(tk);
 		*ts = tk->raw_time;
 
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	timespec_add_ns(ts, nsecs);
 }
@@ -719,16 +805,15 @@ EXPORT_SYMBOL(getrawmonotonic);
  */
 int timekeeping_valid_for_hres(void)
 {
-	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned seq;
 	int ret;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
 
 		ret = tk->clock->flags & CLOCK_SOURCE_VALID_FOR_HRES;
 
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	return ret;
 }
@@ -738,16 +823,15 @@ int timekeeping_valid_for_hres(void)
  */
 u64 timekeeping_max_deferment(void)
 {
-	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned seq;
 	u64 ret;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
 
 		ret = tk->clock->max_idle_ns;
 
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	return ret;
 }
@@ -787,7 +871,7 @@ void __weak read_boot_clock(struct timespec *ts)
  */
 void __init timekeeping_init(void)
 {
-	struct timekeeper *tk = &timekeeper;
+	struct timekeeper *tk;
 	struct clocksource *clock;
 	unsigned long flags;
 	struct timespec now, boot, tmp;
@@ -811,7 +895,7 @@ void __init timekeeping_init(void)
 	}
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
-	write_seqcount_begin(&timekeeper_seq);
+	tk = timekeeper_write_begin();
 	ntp_init();
 
 	clock = clocksource_default_clock();
@@ -832,9 +916,11 @@ void __init timekeeping_init(void)
 	tmp.tv_nsec = 0;
 	tk_set_sleep_time(tk, tmp);
 
-	memcpy(&shadow_timekeeper, &timekeeper, sizeof(timekeeper));
+	timekeeper_write_end();
 
-	write_seqcount_end(&timekeeper_seq);
+	/* Set up the second copy, too */
+	(void)timekeeper_write_begin();
+	timekeeper_write_end();
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 }
 
@@ -874,7 +960,7 @@ static void __timekeeping_inject_sleeptime(struct timekeeper *tk,
  */
 void timekeeping_inject_sleeptime(struct timespec *delta)
 {
-	struct timekeeper *tk = &timekeeper;
+	struct timekeeper *tk;
 	unsigned long flags;
 
 	/*
@@ -885,15 +971,15 @@ void timekeeping_inject_sleeptime(struct timespec *delta)
 		return;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
-	write_seqcount_begin(&timekeeper_seq);
+	tk = timekeeper_write_begin();
 
 	timekeeping_forward_now(tk);
 
 	__timekeeping_inject_sleeptime(tk, delta);
 
-	timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);
+	timekeeping_update(tk, TK_CLEAR_NTP | TK_CLOCK_WAS_SET);
 
-	write_seqcount_end(&timekeeper_seq);
+	timekeeper_write_end();
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
 	/* signal hrtimers about time change */
@@ -909,8 +995,8 @@ void timekeeping_inject_sleeptime(struct timespec *delta)
  */
 static void timekeeping_resume(void)
 {
-	struct timekeeper *tk = &timekeeper;
-	struct clocksource *clock = tk->clock;
+	struct timekeeper *tk;
+	struct clocksource *clock;
 	unsigned long flags;
 	struct timespec ts_new, ts_delta;
 	cycle_t cycle_now, cycle_delta;
@@ -922,7 +1008,6 @@ static void timekeeping_resume(void)
 	clocksource_resume();
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
-	write_seqcount_begin(&timekeeper_seq);
 
 	/*
 	 * After system resumes, we need to calculate the suspended time and
@@ -936,6 +1021,7 @@ static void timekeeping_resume(void)
 	 * The less preferred source will only be tried if there is no better
 	 * usable source. The rtc part is handled separately in rtc core code.
 	 */
+	clock = timekeeper_current()->clock;
 	cycle_now = clock->read(clock);
 	if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
 		cycle_now > clock->cycle_last) {
@@ -947,14 +1033,14 @@ static void timekeeping_resume(void)
 		cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
 
 		/*
-		 * "cycle_delta * mutl" may cause 64 bits overflow, if the
+		 * "cycle_delta * mult" may cause 64 bits overflow, if the
 		 * suspended time is too long. In that case we need do the
 		 * 64 bits math carefully
 		 */
 		do_div(max, mult);
 		if (cycle_delta > max) {
 			num = div64_u64(cycle_delta, max);
-			nsec = (((u64) max * mult) >> shift) * num;
+			nsec = (max * mult >> shift) * num;
 			cycle_delta -= num * max;
 		}
 		nsec += ((u64) cycle_delta * mult) >> shift;
@@ -966,6 +1052,8 @@ static void timekeeping_resume(void)
 		suspendtime_found = true;
 	}
 
+	tk = timekeeper_write_begin();	/* Now we start making changes */
+
 	if (suspendtime_found)
 		__timekeeping_inject_sleeptime(tk, &ts_delta);
 
@@ -973,8 +1061,9 @@ static void timekeeping_resume(void)
 	tk->cycle_last = clock->cycle_last = cycle_now;
 	tk->ntp_error = 0;
 	timekeeping_suspended = 0;
-	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
-	write_seqcount_end(&timekeeper_seq);
+
+	timekeeper_write_end();
+	timekeeping_update(tk, TK_CLOCK_WAS_SET);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
 	touch_softlockup_watchdog();
@@ -987,7 +1076,7 @@ static void timekeeping_resume(void)
 
 static int timekeeping_suspend(void)
 {
-	struct timekeeper *tk = &timekeeper;
+	struct timekeeper *tk;
 	unsigned long flags;
 	struct timespec		delta, delta_delta;
 	static struct timespec	old_delta;
@@ -1003,7 +1092,7 @@ static int timekeeping_suspend(void)
 		persistent_clock_exist = true;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
-	write_seqcount_begin(&timekeeper_seq);
+	tk = timekeeper_write_begin();
 	timekeeping_forward_now(tk);
 	timekeeping_suspended = 1;
 
@@ -1027,8 +1116,8 @@ static int timekeeping_suspend(void)
 			timespec_add(timekeeping_suspend_time, delta_delta);
 	}
 
-	timekeeping_update(tk, TK_MIRROR);
-	write_seqcount_end(&timekeeper_seq);
+	timekeeper_write_end();
+	timekeeping_update(tk, 0);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
 	clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL);
@@ -1056,7 +1145,7 @@ device_initcall(timekeeping_init_ops);
  * If the error is already larger, we look ahead even further
  * to compensate for late or lost adjustments.
  */
-static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
+static __always_inline int timekeeping_bigadjust(struct timekeeper const *tk,
 						 s64 error, s64 *interval,
 						 s64 *offset)
 {
@@ -1129,7 +1218,7 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 	error = tk->ntp_error >> (tk->ntp_error_shift - 1);
 	if (error > interval) {
 		/*
-		 * We now divide error by 4(via shift), which checks if
+		 * We now divide error by 4 (via shift), which checks if
 		 * the error is greater than twice the interval.
 		 * If it is greater, we need a bigadjust, if its smaller,
 		 * we can adjust by 1.
@@ -1139,8 +1228,7 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 			adj = 1;
 		else
 			adj = timekeeping_bigadjust(tk, error, &interval, &offset);
-	} else {
-		if (error < -interval) {
+	} else if (error < -interval) {
 			/* See comment above, this is just switched for the negative */
 			error >>= 2;
 			if (likely(error >= -interval)) {
@@ -1236,7 +1324,6 @@ out_adjust:
 		tk->xtime_nsec = 0;
 		tk->ntp_error += neg << tk->ntp_error_shift;
 	}
-
 }
 
 /**
@@ -1245,7 +1332,6 @@ out_adjust:
  * Helper function that accumulates a the nsecs greater then a second
  * from the xtime_nsec field to the xtime_secs field.
  * It also calls into the NTP code to handle leapsecond processing.
- *
  */
 static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
 {
@@ -1357,8 +1443,8 @@ static inline void old_vsyscall_fixup(struct timekeeper *tk)
 void update_wall_time(void)
 {
 	struct clocksource *clock;
-	struct timekeeper *real_tk = &timekeeper;
-	struct timekeeper *tk = &shadow_timekeeper;
+	struct timekeeper *tk;
+	struct timekeeper const *tk_old;
 	cycle_t offset;
 	int shift = 0, maxshift;
 	unsigned int clock_set = 0;
@@ -1370,17 +1456,18 @@ void update_wall_time(void)
 	if (unlikely(timekeeping_suspended))
 		goto out;
 
-	clock = real_tk->clock;
+	tk_old = timekeeper_current();
+	clock = tk_old->clock;
 
 #ifdef CONFIG_ARCH_USES_GETTIMEOFFSET
-	offset = real_tk->cycle_interval;
+	offset = tk_old->cycle_interval;
 #else
 	offset = (clock->read(clock) - clock->cycle_last) & clock->mask;
-#endif
 
 	/* Check if there's really nothing to do */
-	if (offset < real_tk->cycle_interval)
+	if (offset < tk_old->cycle_interval)
 		goto out;
+#endif
 
 	/*
 	 * With NO_HZ we may have to accumulate many cycle_intervals
@@ -1402,6 +1489,9 @@ void update_wall_time(void)
 			shift--;
 	}
 
+	/* Now begin the updates */
+	tk = timekeeper_write_begin();
+
 	/* correct the clock when NTP error is too big */
 	timekeeping_adjust(tk, offset);
 
@@ -1417,22 +1507,17 @@ void update_wall_time(void)
 	 */
 	clock_set |= accumulate_nsecs_to_secs(tk);
 
-	write_seqcount_begin(&timekeeper_seq);
+	/* We are done updating tk; from here on it's read only */
+	timekeeper_write_end();
+
 	/* Update clock->cycle_last with the new value */
 	clock->cycle_last = tk->cycle_last;
 	/*
-	 * Update the real timekeeper.
-	 *
-	 * We could avoid this memcpy by switching pointers, but that
-	 * requires changes to all other timekeeper usage sites as
-	 * well, i.e. move the timekeeper pointer getter into the
-	 * spinlocked/seqcount protected sections. And we trade this
-	 * memcpy under the timekeeper_seq against one before we start
-	 * updating.
+	 * Notify users of updates.
+	 * (timekeeping_update writes *tk if clock_set & TK_CLEAR_NTP,
+	 * but that's never the case here.)
 	 */
-	memcpy(real_tk, tk, sizeof(*tk));
-	timekeeping_update(real_tk, clock_set);
-	write_seqcount_end(&timekeeper_seq);
+	timekeeping_update(tk, clock_set);
 out:
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 	if (clock_set)
@@ -1453,13 +1538,16 @@ out:
  */
 void getboottime(struct timespec *ts)
 {
-	struct timekeeper *tk = &timekeeper;
-	struct timespec boottime = {
-		.tv_sec = tk->wall_to_monotonic.tv_sec +
-				tk->total_sleep_time.tv_sec,
-		.tv_nsec = tk->wall_to_monotonic.tv_nsec +
-				tk->total_sleep_time.tv_nsec
-	};
+	unsigned seq;
+	struct timespec boottime;
+
+	do {
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
+		boottime.tv_sec = tk->wall_to_monotonic.tv_sec +
+				tk->total_sleep_time.tv_sec;
+		boottime.tv_nsec = tk->wall_to_monotonic.tv_nsec +
+				tk->total_sleep_time.tv_nsec;
+	} while (timekeeper_read_retry(seq));
 
 	set_normalized_timespec(ts, -boottime.tv_sec, -boottime.tv_nsec);
 }
@@ -1476,7 +1564,6 @@ EXPORT_SYMBOL_GPL(getboottime);
  */
 void get_monotonic_boottime(struct timespec *ts)
 {
-	struct timekeeper *tk = &timekeeper;
 	struct timespec tomono, sleep;
 	s64 nsec;
 	unsigned int seq;
@@ -1484,13 +1571,14 @@ void get_monotonic_boottime(struct timespec *ts)
 	WARN_ON(timekeeping_suspended);
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
+
 		ts->tv_sec = tk->xtime_sec;
 		nsec = timekeeping_get_ns(tk);
 		tomono = tk->wall_to_monotonic;
 		sleep = tk->total_sleep_time;
 
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	ts->tv_sec += tomono.tv_sec + sleep.tv_sec;
 	ts->tv_nsec = 0;
@@ -1521,38 +1609,31 @@ EXPORT_SYMBOL_GPL(ktime_get_boottime);
  */
 void monotonic_to_bootbased(struct timespec *ts)
 {
-	struct timekeeper *tk = &timekeeper;
-
-	*ts = timespec_add(*ts, tk->total_sleep_time);
+	*ts = timespec_add(*ts, timekeeper_current()->total_sleep_time);
 }
 EXPORT_SYMBOL_GPL(monotonic_to_bootbased);
 
 unsigned long get_seconds(void)
 {
-	struct timekeeper *tk = &timekeeper;
-
-	return tk->xtime_sec;
+	return timekeeper_current()->xtime_sec;
 }
 EXPORT_SYMBOL(get_seconds);
 
 struct timespec __current_kernel_time(void)
 {
-	struct timekeeper *tk = &timekeeper;
-
-	return tk_xtime(tk);
+	return tk_xtime(timekeeper_current());
 }
 
 struct timespec current_kernel_time(void)
 {
-	struct timekeeper *tk = &timekeeper;
 	struct timespec now;
-	unsigned long seq;
+	unsigned seq;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
 
 		now = tk_xtime(tk);
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	return now;
 }
@@ -1560,20 +1641,17 @@ EXPORT_SYMBOL(current_kernel_time);
 
 struct timespec get_monotonic_coarse(void)
 {
-	struct timekeeper *tk = &timekeeper;
 	struct timespec now, mono;
-	unsigned long seq;
+	unsigned seq;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
 
 		now = tk_xtime(tk);
 		mono = tk->wall_to_monotonic;
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
-	set_normalized_timespec(&now, now.tv_sec + mono.tv_sec,
-				now.tv_nsec + mono.tv_nsec);
-	return now;
+	return timespec_add(now, mono);
 }
 
 /*
@@ -1595,15 +1673,15 @@ void do_timer(unsigned long ticks)
 void get_xtime_and_monotonic_and_sleep_offset(struct timespec *xtim,
 				struct timespec *wtom, struct timespec *sleep)
 {
-	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned seq;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
+
 		*xtim = tk_xtime(tk);
 		*wtom = tk->wall_to_monotonic;
 		*sleep = tk->total_sleep_time;
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 }
 
 #ifdef CONFIG_HIGH_RES_TIMERS
@@ -1619,13 +1697,12 @@ void get_xtime_and_monotonic_and_sleep_offset(struct timespec *xtim,
 ktime_t ktime_get_update_offsets(ktime_t *offs_real, ktime_t *offs_boot,
 							ktime_t *offs_tai)
 {
-	struct timekeeper *tk = &timekeeper;
 	ktime_t now;
 	unsigned int seq;
 	u64 secs, nsecs;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
 
 		secs = tk->xtime_sec;
 		nsecs = timekeeping_get_ns(tk);
@@ -1633,7 +1710,7 @@ ktime_t ktime_get_update_offsets(ktime_t *offs_real, ktime_t *offs_boot,
 		*offs_real = tk->offs_real;
 		*offs_boot = tk->offs_boot;
 		*offs_tai = tk->offs_tai;
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	now = ktime_add_ns(ktime_set(secs, 0), nsecs);
 	now = ktime_sub(now, *offs_real);
@@ -1646,14 +1723,13 @@ ktime_t ktime_get_update_offsets(ktime_t *offs_real, ktime_t *offs_boot,
  */
 ktime_t ktime_get_monotonic_offset(void)
 {
-	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned seq;
 	struct timespec wtom;
 
 	do {
-		seq = read_seqcount_begin(&timekeeper_seq);
+		struct timekeeper const *tk = timekeeper_read_begin(&seq);
 		wtom = tk->wall_to_monotonic;
-	} while (read_seqcount_retry(&timekeeper_seq, seq));
+	} while (timekeeper_read_retry(seq));
 
 	return timespec_to_ktime(wtom);
 }
@@ -1664,7 +1740,6 @@ EXPORT_SYMBOL_GPL(ktime_get_monotonic_offset);
  */
 int do_adjtimex(struct timex *txc)
 {
-	struct timekeeper *tk = &timekeeper;
 	unsigned long flags;
 	struct timespec ts;
 	s32 orig_tai, tai;
@@ -1691,14 +1766,15 @@ int do_adjtimex(struct timex *txc)
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
 	write_seqcount_begin(&timekeeper_seq);
 
-	orig_tai = tai = tk->tai_offset;
+	orig_tai = tai = timekeeper_current()->tai_offset;
 	ret = __do_adjtimex(txc, &ts, &tai);
 
 	if (tai != orig_tai) {
+		struct timekeeper *tk = timekeeper_write_begin();
 		__timekeeping_set_tai_offset(tk, tai);
-		timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
+		timekeeper_write_end();
+		timekeeping_update(tk, TK_CLOCK_WAS_SET);
 	}
-	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
 	if (tai != orig_tai)
@@ -1712,17 +1788,23 @@ int do_adjtimex(struct timex *txc)
 #ifdef CONFIG_NTP_PPS
 /**
  * hardpps() - Accessor function to NTP __hardpps function
+ * FIXME: The NTP variables need to be duplicated in the same
+ * manner as struct timekeeper; the "locking" here doesn't actually
+ * do anything.  Unless... the write-locking part of
+ * timekeeper_write_begin is identical to regular seqlock.
+ * I could have the non-ntp timing use the duplicated info, but
+ * reads of the ntp variables use standard seqlocks.  Needs thought...
  */
 void hardpps(const struct timespec *phase_ts, const struct timespec *raw_ts)
 {
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
-	write_seqcount_begin(&timekeeper_seq);
+	(void)timekeeper_write_begin();
 
 	__hardpps(phase_ts, raw_ts);
 
-	write_seqcount_end(&timekeeper_seq);
+	timekeeper_write_end();
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 }
 EXPORT_SYMBOL(hardpps);
!e

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-06 21:33 [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock George Spelvin
  2014-05-12 16:21 ` [rough draft PATCH] avoid stalls on the " George Spelvin
@ 2014-05-12 17:49 ` John Stultz
  2014-05-13  2:44   ` George Spelvin
                     ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: John Stultz @ 2014-05-12 17:49 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-kernel, Mathieu Desnoyers

First: My apologies, for some reason your mail got tagged as spam, so I
only saw it just now looking for another missing email.

On 05/06/2014 02:33 PM, George Spelvin wrote:
> One misfeature in the timekeeping seqlock code I noticed is that
> read_seqcount_begin returns "unsigned int", not "unsigned long".
>
> Casting to a larger type is harmless, but inefficient.
Thanks for pointing this out. If you want to spin up a patch for this,
I'd be fine applying it. Otherwise I'll try to clean this up sometime soon.

>> This is due to printk() triggering console sem wakeup, which can
>> cause scheduling code to trigger hrtimers which may try to read
>> the time.
> An alternative solution, which avoids the need for this entire patch
> series, is to make ktime_get completely nonblocking.
>
> To do that, use a seqlock variant wherein you mintain an array of "struct
> timekeeper" structures so that reading is always non-blocking if there
> is no writer progress.  (I.e. livelock is remotely possible, but deadlock
> is not.)
>
> In the basic version, there are two.  (You can add more to further
> reduce the chance of livelock.)  The currently stable one is indicated
> by (timekeeper_seq->sequence & 2).  Writers update the appropriate
> one ping-pong.  The low two bits indicate four phases:
>
> 0: Both timekeepers stable, [0] preferred for reading
> 1: Timekeeper[1] is being written; read timekeeper[0] only
> 2: Both timekeepers stable, [1] preferred for reading
> 3: Timekeeper[0] is being written; read timekeeper[1] only

So this has been suggested before (and I've spent some time awhile back
trying to implement it).  The problem is that should the update be
significantly delayed (by for instance, the host preempting a VM's cpu
for some extended period), and a frequency adjustment slowing the clock
was being applied to the new timekeeper, the continued use of the old
timekeeper could cause the current time to over-shoot the new
timekeeper, causing time to go backwards when we finally made the switch. :(

But I believe the sched_clock code does something like this on x86,
since it doesn't have to handle frequency changes.

I would very much like to have a non-blocking implementation! We've also
looked at variants where we keep a valid-interval range in the structure
so readers can be sure the timekeeper they're using is still active
(sort of an expiration date), so we would only block if an update was
delayed. However this is problematic, since there are a number of
timekeeping calls in the hrtimer logic required to trigger the
timekeeping update. So if the update interrupt was delayed, those reads
would block, deadlocking the box.

So maybe you can come up with a tweak to make it possible, but when
discussing this at length w/ Matheiu Desnoyers, it seemed like the main
problem was there was no way to atomically validate that we hadn't been
delayed and update the pointer to the new structure without stopping all
the cpus with some sort of serializing lock.

thanks
-john

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [rough draft PATCH] avoid stalls on the timekeeping seqlock
  2014-05-12 16:21 ` [rough draft PATCH] avoid stalls on the " George Spelvin
@ 2014-05-12 18:23   ` John Stultz
  0 siblings, 0 replies; 16+ messages in thread
From: John Stultz @ 2014-05-12 18:23 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-kernel, Mathieu Desnoyers

On 05/12/2014 09:21 AM, George Spelvin wrote:
> Here's a non-working rough draft of that idea I suggested to make
> reading the time non-blocking, even if an update is in progress.
>
> Basically, it uses the idea proposed in a comment in update_wall_time,
> switching pointers so there's always one valid structure.
>
> This is non-working because last year the NTP variables lost their
> own locking and inherited the timekeeping locks I am redesigning.
> I haven't updated NTP yet.

An important part here is that the NTP state is really tied to the
current timekeeping structure. When everything was updated in lockstep,
we split the locks to simplify some of the locking rules. But when we
added the mirrored update in 3.10 (which is a lighter version of what
you're proposing), we had to go back to using the same locking for
everything.

Matheiu took a similar swing last year, and in doing so moved most of
the ntp state into the timekeeper. This seemed like a nice cleanup, but
since his appraoch ran into trouble, and stalled out, so we didn't get
the cleaup patches merged either.

You can check his series out here:
    https://lkml.org/lkml/2013/9/14/136


> One interesting possibility is that the write side of the locking
> is identical to a standard seqlock.  It would be possible to
> divide the timekeeping variables into non-blocking variables which
> are mirrored, and ones that require stalling during write
> seqlock updates.
>
> But that's somewhat deeper magic than I've attempted so far.
> This is a demonstration of the idea.
>
> Does it seem worth pursuing?


So again, I'd love to find a way to make it work, but I worry that the
freq changes make the non-blocking route not very feasible (though my
concerns may be overwrought - so feel free to push back here).

There's also the extra complexity of the vdso updates, which basically
are a similar update to a separate arch specific subsystem, which is
currently done under the lock. So that would have to get unified in this
non-blocking update as well.


A few small notes below:

> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index f7df8ea217..0dfa4aa6fb 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -29,15 +29,15 @@
>  #include "timekeeping_internal.h"
>  
>  #define TK_CLEAR_NTP		(1 << 0)
> -#define TK_MIRROR		(1 << 1)
>  #define TK_CLOCK_WAS_SET	(1 << 2)
>  
> -static struct timekeeper timekeeper;
> +static struct timekeeper timekeeper[2];
>  static DEFINE_RAW_SPINLOCK(timekeeper_lock);
> +/* The following is NOT used as a standard seqlock */
>  static seqcount_t timekeeper_seq;
> -static struct timekeeper shadow_timekeeper;
>  
>  /* flag for if timekeeping is suspended */
> +/* Q: What are the locking rules for this variable? */
>  int __read_mostly timekeeping_suspended;

Really this is a left over bit that needs to be cleaned up and moved to
the timekeeping structure. Its just a global flag that we use to make
sure nothing calls into timekeeping logic while we're suspended, and was
added to sort out a few suspend/resume issues that cropped up when the
original generic timkeeping core was added.


> @@ -291,6 +289,89 @@ static void timekeeping_forward_now(struct timekeeper *tk)
>  }
>  
>  /**
> + * timekeeper_write_begin: Return a timekeeper that can be updated.
> + *
> + * Must be called with the timekeeper_lock held.
> + */
> +static inline struct timekeeper *timekeeper_write_begin(void)
> +{
> +	bool b;
> +
> +	write_seqcount_begin(&timekeeper_seq);
> +	b = (timekeeper_seq.sequence >> 1) & 1;
> +	timekeeper[!b] = timekeeper[b];
> +	return timekeeper + !b;
> +}
> +
> +/**
> + * timekeeper_write_end: Finish write, mark the modified timekeeper as current.
> + *
> + * Must be called with the timekeeper_lock held.
> + */
> +static inline void timekeeper_write_end(void)
> +{
> +	write_seqcount_end(&timekeeper_seq);
> +}
> +
> +/**
> + * __timekeeper_current: Return the current (for reading) timekeeper
> + * @seq: The current sequence number
> + *
> + * Return the timekeeper corresponding to the given sequence number.
> + */
> +static inline struct timekeeper const *__timekeeper_current(unsigned seq)
> +{
> +	return timekeeper + ((seq >> 1) & 1);
> +}
> +
> +/**
> + * timekeeper_current: Return the current (for reading) timekeeper
> + *
> + * On rare occasions, we want the current timekeeper without obtaining
> + * the seqlock.  For example, if we hold the timekeeper_loc but don't
> + * intend to write it.
> + */
> +static inline struct timekeeper const *timekeeper_current(void)
> +{
> +	return __timekeeper_current(timekeeper_seq.sequence);
> +}
> +
> +/**
> + * timekeeper_read_begin: Begin reading a timekeeper.
> + * @seqp: Pointer to variable to receive sequence number.
> + *	(Because this is inline, the compiler can optimize out
> + *	the memory access.)
> + *
> + * Returns a pointer to a readable timekeeper structure.
> + *
> + * Because we have two timekeeper structures that we ping-pong
> + * between, this never blocks.  Only if there are two calls
> + * to timekeeper_write_begin between read_begin and read_retry
> + * will a retry be forced.
> + */
> +static inline struct timekeeper const *timekeeper_read_begin(unsigned *seqp)
> +{
> +	unsigned seq = ACCESS_ONCE(timekeeper_seq.sequence);
> +	smp_rmb();
> +	*seqp = seq &= ~1u;
> +	return __timekeeper_current(seq);
> +}
> +
> +/**
> + * timekeeper_read_retry: Return true if read was inconsistent, must retry
> + * @seq: The return value from timekeeper_read_begin
> + *
> + * Because we ping-pong between two timekeeper structures, the window
> + * of validity is wider than a normal seqlock, and a retry is very
> + * unlikely.
> + */
> +static inline bool timekeeper_read_retry(unsigned seq)
> +{
> +	unsigned delta = timekeeper_seq.sequence - seq;
> +	return unlikely(delta > 2);
> +}

This all looks very clean and nice! I suspect even if the non-blocking
logic doesn't work out, there's probably some similar style cleanups
that could be done to make the current mirroring code nicer to read.

thanks
-john



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-12 17:49 ` [PATCH 4/4] timekeeping: Use printk_deferred when holding " John Stultz
@ 2014-05-13  2:44   ` George Spelvin
  2014-05-13  3:39     ` John Stultz
  2014-05-13  2:45   ` [PATCH 1/2] timekeeping: Use unsigned int for seqlock sequence George Spelvin
  2014-05-13  2:48   ` [PATCH 2/2] timekeeping: Mark struct timekeeper * passed to notifiers as const George Spelvin
  2 siblings, 1 reply; 16+ messages in thread
From: George Spelvin @ 2014-05-13  2:44 UTC (permalink / raw)
  To: john.stultz, linux; +Cc: linux-kernel, mathieu.desnoyers

On Mon, 12 May 2014, John Stultz wrote:
> First: My apologies, for some reason your mail got tagged as spam, so I
> only saw it just now looking for another missing email.

No problem; it happens to everyone.  Curse you, Canter & Siegel!

>> One misfeature in the timekeeping seqlock code I noticed is that
>>  read_seqcount_begin returns "unsigned int", not "unsigned long".
>>
> Casting to a larger type is harmless, but inefficient.

> Thanks for pointing this out. If you want to spin up a patch for this,
> I'd be fine applying it. Otherwise I'll try to clean this up sometime soon.

Appended.  And a second one that you may or may not like.

> So this has been suggested before (and I've spent some time awhile back
> trying to implement it).  The problem is that should the update be
> significantly delayed (by for instance, the host preempting a VM's cpu
> for some extended period), and a frequency adjustment slowing the clock
> was being applied to the new timekeeper, the continued use of the old
> timekeeper could cause the current time to over-shoot the new
> timekeeper, causing time to go backwards when we finally made the switch. :(

Thank you very much!  "I'd like to, but it's very very tricky" is encouraging,
if daunting.

> I would very much like to have a non-blocking implementation! We've also
> looked at variants where we keep a valid-interval range in the structure
> so readers can be sure the timekeeper they're using is still active
> (sort of an expiration date), so we would only block if an update was
> delayed. However this is problematic, since there are a number of
> timekeeping calls in the hrtimer logic required to trigger the
> timekeeping update. So if the update interrupt was delayed, those reads
> would block, deadlocking the box.

> So maybe you can come up with a tweak to make it possible, but when
> discussing this at length w/ Matheiu Desnoyers, it seemed like the main
> problem was there was no way to atomically validate that we hadn't been
> delayed and update the pointer to the new structure without stopping all
> the cpus with some sort of serializing lock.

Okay, two approaches come to mind:

1) Dividing the "get the time" calls into "strictly monotonic, but
   blocking", and "only approximately non-monotonic".  Debugging/printk
   would use the latter.  The downside is API complexity pushed onto the
   users.
2) Using wait-free coding techniques where readers help the writer if
   they notice a stall.  This is much hairier internal code, but makes
   life easier on the callers.  The basic procedure is:

   - A conventionally locked writer decides that the frequency is to be
     adjusted, effective at the current time (or perhaps slightly in the
     future).
   - It publishes its idea of the effective time, and that an update is
     in progress.
   - A reader comes in, reads the hardware clock, and checks:
     - Is a rate update in progress, and is the effective time *before*
       the time I just read?
     - If so, I'm going to force a postponement to the time I just read,
       using compare-and-swap trickery.
     - Then it proceeds to use the old timekeeper rate information.
   - When the writer finishes figuring out the new timekeeper state,
     as part of the atomic installation of a new state, it checks to
     see if a reader has forced a postponement.  If so, it goes back and
     recomputes the state with the later effective time and tries again.

   Going down a level to how to implement this, one way (I may be biased
   because my mind is already on this track) is to extend the technique
   of using the seqlock counter to control ping-pong between 2 states:

   - The writer publishes the effective time in a global variable (and
     does an smp_wmb) before setting the lsbit of the sequence.
   - Readers, if they see the lsbit set, check the effective time,
     and if they see a conflict, atomically clear the lsbit, forcing
     the final write-unlock to abort.
   - If a reader loses the race to clear the lsbit, the reader recomputes
     using the new state.
   - When the writer tries to complete the update, it does an atomic update
     of the sequence counter and notices that someone has messed with it.
     Then it retries, recomputing the rate update with a later effective
     time.
     (Rather than a compare-and-swap, it could simply do an atomic
     increment which releases the lock if there was no race, and starts
     a new write if there was one).

   This approach has some horrible complexity inside the (already quite
   convoluted) timekeeper code, but allows all the read operations to
   be wait-free.

   The question is whether this can be implemented in a way that's
   comprehensible to anyone but Paul McKenney.

Does that make sense, or am I hallucinating?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] timekeeping: Use unsigned int for seqlock sequence
  2014-05-12 17:49 ` [PATCH 4/4] timekeeping: Use printk_deferred when holding " John Stultz
  2014-05-13  2:44   ` George Spelvin
@ 2014-05-13  2:45   ` George Spelvin
  2014-05-13 11:49     ` Mathieu Desnoyers
  2014-05-13  2:48   ` [PATCH 2/2] timekeeping: Mark struct timekeeper * passed to notifiers as const George Spelvin
  2 siblings, 1 reply; 16+ messages in thread
From: George Spelvin @ 2014-05-13  2:45 UTC (permalink / raw)
  To: john.stultz, linux; +Cc: linux-kernel, mathieu.desnoyers

read_seqlock_begin returns an int, so using "unsigned long" to
hold the return value is harmless, but a waste.

Signed-off-by: George Spelvin <linux@horizon.com>
---
 kernel/time/timekeeping.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index f7df8ea217..14e703e5bd 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -300,7 +300,7 @@ static void timekeeping_forward_now(struct timekeeper *tk)
 int __getnstimeofday(struct timespec *ts)
 {
 	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned int seq;
 	s64 nsecs = 0;
 
 	do {
@@ -399,7 +399,7 @@ EXPORT_SYMBOL_GPL(ktime_get_ts);
 void timekeeping_clocktai(struct timespec *ts)
 {
 	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned int seq;
 	u64 nsecs;
 
 	WARN_ON(timekeeping_suspended);
@@ -447,7 +447,7 @@ EXPORT_SYMBOL(ktime_get_clocktai);
 void getnstime_raw_and_real(struct timespec *ts_raw, struct timespec *ts_real)
 {
 	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned int seq;
 	s64 nsecs_raw, nsecs_real;
 
 	WARN_ON_ONCE(timekeeping_suspended);
@@ -700,7 +700,7 @@ EXPORT_SYMBOL_GPL(ktime_get_real);
 void getrawmonotonic(struct timespec *ts)
 {
 	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned int seq;
 	s64 nsecs;
 
 	do {
@@ -720,7 +720,7 @@ EXPORT_SYMBOL(getrawmonotonic);
 int timekeeping_valid_for_hres(void)
 {
 	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned int seq;
 	int ret;
 
 	do {
@@ -739,7 +739,7 @@ int timekeeping_valid_for_hres(void)
 u64 timekeeping_max_deferment(void)
 {
 	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned int seq;
 	u64 ret;
 
 	do {
@@ -1546,7 +1546,7 @@ struct timespec current_kernel_time(void)
 {
 	struct timekeeper *tk = &timekeeper;
 	struct timespec now;
-	unsigned long seq;
+	unsigned int seq;
 
 	do {
 		seq = read_seqcount_begin(&timekeeper_seq);
@@ -1562,7 +1562,7 @@ struct timespec get_monotonic_coarse(void)
 {
 	struct timekeeper *tk = &timekeeper;
 	struct timespec now, mono;
-	unsigned long seq;
+	unsigned int seq;
 
 	do {
 		seq = read_seqcount_begin(&timekeeper_seq);
@@ -1596,7 +1596,7 @@ void get_xtime_and_monotonic_and_sleep_offset(struct timespec *xtim,
 				struct timespec *wtom, struct timespec *sleep)
 {
 	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned int seq;
 
 	do {
 		seq = read_seqcount_begin(&timekeeper_seq);
@@ -1647,7 +1647,7 @@ ktime_t ktime_get_update_offsets(ktime_t *offs_real, ktime_t *offs_boot,
 ktime_t ktime_get_monotonic_offset(void)
 {
 	struct timekeeper *tk = &timekeeper;
-	unsigned long seq;
+	unsigned int seq;
 	struct timespec wtom;
 
 	do {
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] timekeeping: Mark struct timekeeper * passed to notifiers as const
  2014-05-12 17:49 ` [PATCH 4/4] timekeeping: Use printk_deferred when holding " John Stultz
  2014-05-13  2:44   ` George Spelvin
  2014-05-13  2:45   ` [PATCH 1/2] timekeeping: Use unsigned int for seqlock sequence George Spelvin
@ 2014-05-13  2:48   ` George Spelvin
  2 siblings, 0 replies; 16+ messages in thread
From: George Spelvin @ 2014-05-13  2:48 UTC (permalink / raw)
  To: john.stultz, linux; +Cc: linux-kernel, mathieu.desnoyers

Add const delcarations where possible to make clear that the call chains
are not permitted to write to the structure.

It is not expected to improve generated code; the purpose is to document
the rules for who's allowed to modify the timekeeper state (and must
hold the associated locks) more clearly.

There are two code paths affected:
- update_vsyscall() (and update_vsyscall_old(), if used)
- update_pvclock_gtod() and the functions notified

This touches arch code that I have not tested, but only in a very safe
way: to add const declarations whose only effect is to add compile-time
complaints.

Signed-off-by: George Spelvin <linux@horizon.com>
---
One style issue: Some people prefer "const struct timekeeper *", when
I'm in the habit of writing "struct timekeeper const *".  This is due
to an example I had pointed out to me:
typedef char *charp;
const char *a;
char const *b;
char * const c;
const charp d;
charp const e;

Which of these variables are the same type?  The answer is that a and b
are writable pointers to const chars, while c through e are const pointers
to writeable chars.  But if you're in the habit of writing const first,
a and d are easy to confuse.  If you get in the habit of writing const
last, e looks like c, which is correct.  Due to the typedef, there's
no way write a declaration with the const "between the char and the *"
that looks like b.

I found this persuasive, and have adopted the style.  If it causes
violent reactions, I can change.

 arch/arm64/kernel/vdso.c            |  2 +-
 arch/ia64/kernel/time.c             |  2 +-
 arch/powerpc/kernel/time.c          |  2 +-
 arch/s390/kernel/time.c             |  2 +-
 arch/tile/kernel/time.c             |  2 +-
 arch/x86/kernel/vsyscall_gtod.c     |  2 +-
 arch/x86/kvm/x86.c                  |  4 ++--
 include/linux/timekeeper_internal.h | 10 +++++-----
 kernel/time/timekeeping.c           |  8 ++++----
 9 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 50384fec56..dafdc36845 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -208,7 +208,7 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
 /*
  * Update the vDSO data page to keep in sync with kernel timekeeping.
  */
-void update_vsyscall(struct timekeeper *tk)
+void update_vsyscall(struct timekeeper const *tk)
 {
 	struct timespec xtime_coarse;
 	u32 use_syscall = strcmp(tk->clock->name, "arch_sys_counter");
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 71c52bc7c2..353d5847f1 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -440,7 +440,7 @@ void update_vsyscall_tz(void)
 {
 }
 
-void update_vsyscall_old(struct timespec *wall, struct timespec *wtm,
+void update_vsyscall_old(struct timespec *wall, struct timespec const *wtm,
 			struct clocksource *c, u32 mult)
 {
 	write_seqcount_begin(&fsyscall_gtod_data.seq);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 122a580f73..42ffd2309d 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -740,7 +740,7 @@ static cycle_t timebase_read(struct clocksource *cs)
 	return (cycle_t)get_tb();
 }
 
-void update_vsyscall_old(struct timespec *wall_time, struct timespec *wtm,
+void update_vsyscall_old(struct timespec *wall_time, struct timespec const *wtm,
 			struct clocksource *clock, u32 mult)
 {
 	u64 new_tb_to_xs, new_stamp_xsec;
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index 386d37a228..ea1c8d3137 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -210,7 +210,7 @@ struct clocksource * __init clocksource_default_clock(void)
 	return &clocksource_tod;
 }
 
-void update_vsyscall(struct timekeeper *tk)
+void update_vsyscall(struct timekeeper const *tk)
 {
 	u64 nsecps;
 
diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 462dcd0c17..7202570ad8 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -258,7 +258,7 @@ void update_vsyscall_tz(void)
 	++vdso_data->tz_update_count;
 }
 
-void update_vsyscall(struct timekeeper *tk)
+void update_vsyscall(struct timekeeper const *tk)
 {
 	struct timespec wall_time = tk_xtime(tk);
 	struct timespec *wtm = &tk->wall_to_monotonic;
diff --git a/arch/x86/kernel/vsyscall_gtod.c b/arch/x86/kernel/vsyscall_gtod.c
index f9c6e56e14..fa12766b38 100644
--- a/arch/x86/kernel/vsyscall_gtod.c
+++ b/arch/x86/kernel/vsyscall_gtod.c
@@ -24,7 +24,7 @@ void update_vsyscall_tz(void)
 	vsyscall_gtod_data.tz_dsttime = sys_tz.tz_dsttime;
 }
 
-void update_vsyscall(struct timekeeper *tk)
+void update_vsyscall(struct timekeeper const *tk)
 {
 	struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8b8fc0b792..0c42561f52 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1003,7 +1003,7 @@ struct pvclock_gtod_data {
 
 static struct pvclock_gtod_data pvclock_gtod_data;
 
-static void update_pvclock_gtod(struct timekeeper *tk)
+static void update_pvclock_gtod(struct timekeeper const *tk)
 {
 	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
 
@@ -5552,7 +5552,7 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 			       void *priv)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
-	struct timekeeper *tk = priv;
+	struct timekeeper const *tk = priv;
 
 	update_pvclock_gtod(tk);
 
diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index c1825eb436..96e39501d8 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -71,7 +71,7 @@ struct timekeeper {
 
 };
 
-static inline struct timespec tk_xtime(struct timekeeper *tk)
+static inline struct timespec tk_xtime(struct timekeeper const *tk)
 {
 	struct timespec ts;
 
@@ -83,16 +83,16 @@ static inline struct timespec tk_xtime(struct timekeeper *tk)
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
 
-extern void update_vsyscall(struct timekeeper *tk);
+extern void update_vsyscall(struct timekeeper const *tk);
 extern void update_vsyscall_tz(void);
 
 #elif defined(CONFIG_GENERIC_TIME_VSYSCALL_OLD)
 
-extern void update_vsyscall_old(struct timespec *ts, struct timespec *wtm,
+extern void update_vsyscall_old(struct timespec *ts, struct timespec const *wtm,
 				struct clocksource *c, u32 mult);
 extern void update_vsyscall_tz(void);
 
-static inline void update_vsyscall(struct timekeeper *tk)
+static inline void update_vsyscall(struct timekeeper const *tk)
 {
 	struct timespec xt;
 
@@ -102,7 +102,7 @@ static inline void update_vsyscall(struct timekeeper *tk)
 
 #else
 
-static inline void update_vsyscall(struct timekeeper *tk)
+static inline void update_vsyscall(struct timekeeper const *tk)
 {
 }
 static inline void update_vsyscall_tz(void)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 14e703e5bd..2e0fb0f617 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -165,7 +165,7 @@ u32 get_arch_timeoffset(void)
 static inline u32 get_arch_timeoffset(void) { return 0; }
 #endif
 
-static inline s64 timekeeping_get_ns(struct timekeeper *tk)
+static inline s64 timekeeping_get_ns(struct timekeeper const *tk)
 {
 	cycle_t cycle_now, cycle_delta;
 	struct clocksource *clock;
@@ -185,7 +185,7 @@ static inline s64 timekeeping_get_ns(struct timekeeper *tk)
 	return nsec + get_arch_timeoffset();
 }
 
-static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk)
+static inline s64 timekeeping_get_ns_raw(struct timekeeper const *tk)
 {
 	cycle_t cycle_now, cycle_delta;
 	struct clocksource *clock;
@@ -207,9 +207,9 @@ static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk)
 
 static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
 
-static void update_pvclock_gtod(struct timekeeper *tk, bool was_set)
+static void update_pvclock_gtod(struct timekeeper const *tk, bool was_set)
 {
-	raw_notifier_call_chain(&pvclock_gtod_chain, was_set, tk);
+	raw_notifier_call_chain(&pvclock_gtod_chain, was_set, (void *)tk);
 }
 
 /**
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-13  2:44   ` George Spelvin
@ 2014-05-13  3:39     ` John Stultz
  2014-05-13  5:13       ` George Spelvin
  0 siblings, 1 reply; 16+ messages in thread
From: John Stultz @ 2014-05-13  3:39 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-kernel, mathieu.desnoyers

On 05/12/2014 07:44 PM, George Spelvin wrote:
> On Mon, 12 May 2014, John Stultz wrote:
>> First: My apologies, for some reason your mail got tagged as spam, so I
>> only saw it just now looking for another missing email.
> No problem; it happens to everyone.  Curse you, Canter & Siegel!
>
>>> One misfeature in the timekeeping seqlock code I noticed is that
>>>  read_seqcount_begin returns "unsigned int", not "unsigned long".
>>>
>> Casting to a larger type is harmless, but inefficient.
>> Thanks for pointing this out. If you want to spin up a patch for this,
>> I'd be fine applying it. Otherwise I'll try to clean this up sometime soon.
> Appended.  And a second one that you may or may not like.
>
>> So this has been suggested before (and I've spent some time awhile back
>> trying to implement it).  The problem is that should the update be
>> significantly delayed (by for instance, the host preempting a VM's cpu
>> for some extended period), and a frequency adjustment slowing the clock
>> was being applied to the new timekeeper, the continued use of the old
>> timekeeper could cause the current time to over-shoot the new
>> timekeeper, causing time to go backwards when we finally made the switch. :(
> Thank you very much!  "I'd like to, but it's very very tricky" is encouraging,
> if daunting.
>
>> I would very much like to have a non-blocking implementation! We've also
>> looked at variants where we keep a valid-interval range in the structure
>> so readers can be sure the timekeeper they're using is still active
>> (sort of an expiration date), so we would only block if an update was
>> delayed. However this is problematic, since there are a number of
>> timekeeping calls in the hrtimer logic required to trigger the
>> timekeeping update. So if the update interrupt was delayed, those reads
>> would block, deadlocking the box.
>> So maybe you can come up with a tweak to make it possible, but when
>> discussing this at length w/ Matheiu Desnoyers, it seemed like the main
>> problem was there was no way to atomically validate that we hadn't been
>> delayed and update the pointer to the new structure without stopping all
>> the cpus with some sort of serializing lock.
> Okay, two approaches come to mind:
>
> 1) Dividing the "get the time" calls into "strictly monotonic, but
>    blocking", and "only approximately non-monotonic".  Debugging/printk
>    would use the latter.  The downside is API complexity pushed onto the
>    users.

So the printk/debugging is covered by sched_clock which uses a
completely separate code path. So if I'm following along I think this
divide is already in place.


> 2) Using wait-free coding techniques where readers help the writer if
>    they notice a stall.  This is much hairier internal code, but makes
>    life easier on the callers.  The basic procedure is:
>
>    - A conventionally locked writer decides that the frequency is to be
>      adjusted, effective at the current time (or perhaps slightly in the
>      future).
>    - It publishes its idea of the effective time, and that an update is
>      in progress.
>    - A reader comes in, reads the hardware clock, and checks:
>      - Is a rate update in progress, and is the effective time *before*
>        the time I just read?
>      - If so, I'm going to force a postponement to the time I just read,
>        using compare-and-swap trickery.
>      - Then it proceeds to use the old timekeeper rate information.
>    - When the writer finishes figuring out the new timekeeper state,
>      as part of the atomic installation of a new state, it checks to
>      see if a reader has forced a postponement.  If so, it goes back and
>      recomputes the state with the later effective time and tries again.

Hrm.. So basically during the update you lock readers to the counter
value read at the beginning of the update. So readers don't block, but
time doesn't progress while the update is being made. Sounds promising!
Though I worry it rhymes a bit with an idea Matheiu and I were throwing
around during plumbers last year, but didn't work out. I think the
concern with this sort of approach is: how do you prevent a race between
a reader and the update thread such that a reader doesn't sneak in and
read a counter value that is after the value the updater read, but
before the value is stored locking the structure?

You can imagine the updater reads the counter value, then is immediately
stalled for some very long time by a hyper-visor. Various readers on
other cpus read time that continues to increase. Then when the updater
finally returns, he locks the time to that value in the past. Following
readers then see time that appears to go backwards.

I have to admit its getting late and I'm not sure I'm completely
following the bit about the update restarting depending on a forced
postponement.

>    Going down a level to how to implement this, one way (I may be biased
>    because my mind is already on this track) is to extend the technique
>    of using the seqlock counter to control ping-pong between 2 states:
>
>    - The writer publishes the effective time in a global variable (and
>      does an smp_wmb) before setting the lsbit of the sequence.
>    - Readers, if they see the lsbit set, check the effective time,
>      and if they see a conflict, atomically clear the lsbit, forcing
>      the final write-unlock to abort.
>    - If a reader loses the race to clear the lsbit, the reader recomputes
>      using the new state.
>    - When the writer tries to complete the update, it does an atomic update
>      of the sequence counter and notices that someone has messed with it.
>      Then it retries, recomputing the rate update with a later effective
>      time.
>      (Rather than a compare-and-swap, it could simply do an atomic
>      increment which releases the lock if there was no race, and starts
>      a new write if there was one).

Oh.. so its actually more like the update is canceled if a reader can
complete a read and set a flag while the update is in flight? Hrm.. I
worry with the number and frequency of time reads on many systems, you'd
end up with update starvation (something which use to be a problem back
when timekeeping was managed with just spinlocks, and SMP wasn't that
common - so I suspect it can only be worse now).

Further, the write in the read path would cause problems for VDSO time,
where everything is computed on read-only data in userspace.



>    This approach has some horrible complexity inside the (already quite
>    convoluted) timekeeper code, but allows all the read operations to
>    be wait-free.
>
>    The question is whether this can be implemented in a way that's
>    comprehensible to anyone but Paul McKenney.
>
> Does that make sense, or am I hallucinating?

Yea. It definitely gets complex. But timekeeping is always a very
hotpath, so complexity for performance is a normal trade. So there may
be a way to make it work. But finding a way to make it easier to
comprehend (even just having a clear metaphor for whats being done)
would be excellent.

thanks
-john



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-13  3:39     ` John Stultz
@ 2014-05-13  5:13       ` George Spelvin
  2014-05-13 12:07         ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: George Spelvin @ 2014-05-13  5:13 UTC (permalink / raw)
  To: john.stultz, linux; +Cc: linux-kernel, mathieu.desnoyers

>> 2) Using wait-free coding techniques where readers help the writer if
>>    they notice a stall.  This is much hairier internal code, but makes
>>    life easier on the callers.  The basic procedure is:
>>
>>    - A conventionally locked writer decides that the frequency is to be
>>      adjusted, effective at the current time (or perhaps slightly in the
>>      future).
>>    - It publishes its idea of the effective time, and that an update is
>>      in progress.
>>    - A reader comes in, reads the hardware clock, and checks:
>>      - Is a rate update in progress, and is the effective time *before*
>>        the time I just read?
>>      - If so, I'm going to force a postponement to the time I just read,
>>        using compare-and-swap trickery.
>>      - Then it proceeds to use the old timekeeper rate information.
>>    - When the writer finishes figuring out the new timekeeper state,
>>      as part of the atomic installation of a new state, it checks to
>>      see if a reader has forced a postponement.  If so, it goes back and
>>      recomputes the state with the later effective time and tries again.
> 
> Hrm.. So basically during the update you lock readers to the counter
> value read at the beginning of the update. So readers don't block, but
> time doesn't progress while the update is being made. Sounds promising!

Er... no.  I guess my explanation didn't work.  I thought of that (it's
quite simple, after all), but I was worried that the stuttering would
mess up fine accounting for e.g. interrupt handlers.

Also, it's not implementable.  The writer must pick a rate-change moment,
then announce it.  If it stalls between those two operations (and there's
no way to make them atomic), then it might announce a moment already in
the past, causing a later reader to return the stalled time, while an
earlier reader that didn't see the announcement has already returned
a later time.

Inseate, the writer announces a *proposed* rate-change time, but a number
of conditions can cause that time to be postponed.  (Basically, the writer
loops back to the beginning and applies the rate change later.)

One condition is enforced by the writer itself: it reads the time again
after making the announcement, and if the announced time has already
passed, loops.

But this is also checked by all readers.  If an update is in progress,
with a proposed time earlier than the reader is in the process of reading,
the writer is told to try again later.

One trick that can minimize the number of retries is to add a very small
time delta (equal to the typical write update cycle) to the writer's
effective time.  If the writer completes before that time happens (the
common case), readers happily use the old time values during the update.
Only if the writer is delayed more than expected will a reader notice
a problem.  "Hey, I read the hardware at tick x, but the writer is trying
to update tick-to-seconds translations for times later than tick y < x.
Hey, writer, recompute with a higher y!"

There is also the issue of a reader coming along after the change-over
and wanting to translate a time x < y.  There are several ways to
handle this.  Either the old parameters have to be kept around until
time y has passed, or the readers must wait until time y.

(They may not return early reporting time y or there may be race conditions.)

> Oh.. so its actually more like the update is canceled if a reader can
> complete a read and set a flag while the update is in flight? Hrm.. I
> worry with the number and frequency of time reads on many systems, you'd
> end up with update starvation (something which use to be a problem back
> when timekeeping was managed with just spinlocks, and SMP wasn't that
> common - so I suspect it can only be worse now).

The solution, mentioned above, is to have the effective time of the
update set to the predicted write completion time.  Then the readers will
never see an effective time less than their time and need to postpone
the writer.

Only if the writer is delayed unexpectedly does that happen.

> Further, the write in the read path would cause problems for VDSO time,
> where everything is computed on read-only data in userspace.

Ah!  Now *that* poses a potential problem.  I didn't think of that.
Arrgh, that's going to take some work.

Two possibilities:
1) The VDSO code could just spin.  It's not holding any kernel locks,
   so there's not much problem.
2) Possibly after some optimistic spinning, the VDSO code could
   trap to the kernel as a fallback.

> Yea. It definitely gets complex. But timekeeping is always a very
> hotpath, so complexity for performance is a normal trade. So there may
> be a way to make it work. But finding a way to make it easier to
> comprehend (even just having a clear metaphor for what's being done)
> would be excellent.

I'll try.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] timekeeping: Use unsigned int for seqlock sequence
  2014-05-13  2:45   ` [PATCH 1/2] timekeeping: Use unsigned int for seqlock sequence George Spelvin
@ 2014-05-13 11:49     ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-05-13 11:49 UTC (permalink / raw)
  To: George Spelvin; +Cc: john stultz, linux-kernel

----- Original Message -----
> From: "George Spelvin" <linux@horizon.com>
> To: "john stultz" <john.stultz@linaro.org>, linux@horizon.com
> Cc: linux-kernel@vger.kernel.org, "mathieu desnoyers" <mathieu.desnoyers@efficios.com>
> Sent: Monday, May 12, 2014 10:45:19 PM
> Subject: [PATCH 1/2] timekeeping: Use unsigned int for seqlock sequence
> 
> read_seqlock_begin returns an int, so using "unsigned long" to
> hold the return value is harmless, but a waste.

Did you try comparing this to changing read_seqlock to return
an "unsigned long" instead ? Does it change the executable
size at all ?

Seq locks correctness depends on ensuring the sequence
counter does not overflow. Strictly speaking, a 64-bit
value provides a stronger no-overflow guarantee than
a 32-bit value, even though it is already unlikely that
the 32-bit value will ever overflow in a sane system.
Therefore, I'd be tempted to use "unsigned long" if
possible if it's not significantly more expensive.

Thoughts ?

Thanks,

Mathieu

> 
> Signed-off-by: George Spelvin <linux@horizon.com>
> ---
>  kernel/time/timekeeping.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index f7df8ea217..14e703e5bd 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -300,7 +300,7 @@ static void timekeeping_forward_now(struct timekeeper
> *tk)
>  int __getnstimeofday(struct timespec *ts)
>  {
>  	struct timekeeper *tk = &timekeeper;
> -	unsigned long seq;
> +	unsigned int seq;
>  	s64 nsecs = 0;
>  
>  	do {
> @@ -399,7 +399,7 @@ EXPORT_SYMBOL_GPL(ktime_get_ts);
>  void timekeeping_clocktai(struct timespec *ts)
>  {
>  	struct timekeeper *tk = &timekeeper;
> -	unsigned long seq;
> +	unsigned int seq;
>  	u64 nsecs;
>  
>  	WARN_ON(timekeeping_suspended);
> @@ -447,7 +447,7 @@ EXPORT_SYMBOL(ktime_get_clocktai);
>  void getnstime_raw_and_real(struct timespec *ts_raw, struct timespec
>  *ts_real)
>  {
>  	struct timekeeper *tk = &timekeeper;
> -	unsigned long seq;
> +	unsigned int seq;
>  	s64 nsecs_raw, nsecs_real;
>  
>  	WARN_ON_ONCE(timekeeping_suspended);
> @@ -700,7 +700,7 @@ EXPORT_SYMBOL_GPL(ktime_get_real);
>  void getrawmonotonic(struct timespec *ts)
>  {
>  	struct timekeeper *tk = &timekeeper;
> -	unsigned long seq;
> +	unsigned int seq;
>  	s64 nsecs;
>  
>  	do {
> @@ -720,7 +720,7 @@ EXPORT_SYMBOL(getrawmonotonic);
>  int timekeeping_valid_for_hres(void)
>  {
>  	struct timekeeper *tk = &timekeeper;
> -	unsigned long seq;
> +	unsigned int seq;
>  	int ret;
>  
>  	do {
> @@ -739,7 +739,7 @@ int timekeeping_valid_for_hres(void)
>  u64 timekeeping_max_deferment(void)
>  {
>  	struct timekeeper *tk = &timekeeper;
> -	unsigned long seq;
> +	unsigned int seq;
>  	u64 ret;
>  
>  	do {
> @@ -1546,7 +1546,7 @@ struct timespec current_kernel_time(void)
>  {
>  	struct timekeeper *tk = &timekeeper;
>  	struct timespec now;
> -	unsigned long seq;
> +	unsigned int seq;
>  
>  	do {
>  		seq = read_seqcount_begin(&timekeeper_seq);
> @@ -1562,7 +1562,7 @@ struct timespec get_monotonic_coarse(void)
>  {
>  	struct timekeeper *tk = &timekeeper;
>  	struct timespec now, mono;
> -	unsigned long seq;
> +	unsigned int seq;
>  
>  	do {
>  		seq = read_seqcount_begin(&timekeeper_seq);
> @@ -1596,7 +1596,7 @@ void get_xtime_and_monotonic_and_sleep_offset(struct
> timespec *xtim,
>  				struct timespec *wtom, struct timespec *sleep)
>  {
>  	struct timekeeper *tk = &timekeeper;
> -	unsigned long seq;
> +	unsigned int seq;
>  
>  	do {
>  		seq = read_seqcount_begin(&timekeeper_seq);
> @@ -1647,7 +1647,7 @@ ktime_t ktime_get_update_offsets(ktime_t *offs_real,
> ktime_t *offs_boot,
>  ktime_t ktime_get_monotonic_offset(void)
>  {
>  	struct timekeeper *tk = &timekeeper;
> -	unsigned long seq;
> +	unsigned int seq;
>  	struct timespec wtom;
>  
>  	do {
> --
> 1.9.2
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-13  5:13       ` George Spelvin
@ 2014-05-13 12:07         ` Mathieu Desnoyers
  2014-05-13 13:29           ` George Spelvin
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-05-13 12:07 UTC (permalink / raw)
  To: George Spelvin; +Cc: john stultz, linux-kernel, Thomas Gleixner, Peter Zijlstra

----- Original Message -----
> From: "George Spelvin" <linux@horizon.com>
> To: "john stultz" <john.stultz@linaro.org>, linux@horizon.com
> Cc: linux-kernel@vger.kernel.org, "mathieu desnoyers" <mathieu.desnoyers@efficios.com>
> Sent: Tuesday, May 13, 2014 1:13:24 AM
> Subject: Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
> 
> >> 2) Using wait-free coding techniques where readers help the writer if
> >>    they notice a stall.  This is much hairier internal code, but makes
> >>    life easier on the callers.  The basic procedure is:
> >>
> >>    - A conventionally locked writer decides that the frequency is to be
> >>      adjusted, effective at the current time (or perhaps slightly in the
> >>      future).
> >>    - It publishes its idea of the effective time, and that an update is
> >>      in progress.
> >>    - A reader comes in, reads the hardware clock, and checks:
> >>      - Is a rate update in progress, and is the effective time *before*
> >>        the time I just read?
> >>      - If so, I'm going to force a postponement to the time I just read,
> >>        using compare-and-swap trickery.
> >>      - Then it proceeds to use the old timekeeper rate information.
> >>    - When the writer finishes figuring out the new timekeeper state,
> >>      as part of the atomic installation of a new state, it checks to
> >>      see if a reader has forced a postponement.  If so, it goes back and
> >>      recomputes the state with the later effective time and tries again.
> > 
> > Hrm.. So basically during the update you lock readers to the counter
> > value read at the beginning of the update. So readers don't block, but
> > time doesn't progress while the update is being made. Sounds promising!
> 
> Er... no.  I guess my explanation didn't work.  I thought of that (it's
> quite simple, after all), but I was worried that the stuttering would
> mess up fine accounting for e.g. interrupt handlers.
> 
> Also, it's not implementable.  The writer must pick a rate-change moment,
> then announce it.  If it stalls between those two operations (and there's
> no way to make them atomic), then it might announce a moment already in
> the past, causing a later reader to return the stalled time, while an
> earlier reader that didn't see the announcement has already returned
> a later time.

Your description above summarizes well the conclusions I reached when doing
my nonblocking-read clock source implementation prototype a while back. If
we can add a new clock semantic, there is a solution I proposed last year
that might just achieve what you are looking for:

We could expose a new clock type (besides monotonic and realtime) that is
documented as non-strictly monotonic. It may return a time very slightly in
the past if readers race with clock source frequency change. The caller could
handle this situation (e.g. in user-space) by keeping its own per-cpu or
per-thread "last clock value" data structure (something we cannot do in a
vDSO) if it really cares about per-cpu/thread clock monotonicity.

This could be implemented with the scheme I proposed as a prototype here:

   https://lkml.org/lkml/2013/9/14/136

The idea here would be to keep both a read seqlock (for realtime and
monotonic), as well as an API/ABI that allows reading this "latched
clock" value (documented as non-strictly-monotonic).

Thoughts ?

Thanks,

Mathieu


> 
> Inseate, the writer announces a *proposed* rate-change time, but a number
> of conditions can cause that time to be postponed.  (Basically, the writer
> loops back to the beginning and applies the rate change later.)
> 
> One condition is enforced by the writer itself: it reads the time again
> after making the announcement, and if the announced time has already
> passed, loops.
> 
> But this is also checked by all readers.  If an update is in progress,
> with a proposed time earlier than the reader is in the process of reading,
> the writer is told to try again later.
> 
> 
> One trick that can minimize the number of retries is to add a very small
> time delta (equal to the typical write update cycle) to the writer's
> effective time.  If the writer completes before that time happens (the
> common case), readers happily use the old time values during the update.
> Only if the writer is delayed more than expected will a reader notice
> a problem.  "Hey, I read the hardware at tick x, but the writer is trying
> to update tick-to-seconds translations for times later than tick y < x.
> Hey, writer, recompute with a higher y!"
> 
> 
> There is also the issue of a reader coming along after the change-over
> and wanting to translate a time x < y.  There are several ways to
> handle this.  Either the old parameters have to be kept around until
> time y has passed, or the readers must wait until time y.
> 
> (They may not return early reporting time y or there may be race conditions.)
> 
> > Oh.. so its actually more like the update is canceled if a reader can
> > complete a read and set a flag while the update is in flight? Hrm.. I
> > worry with the number and frequency of time reads on many systems, you'd
> > end up with update starvation (something which use to be a problem back
> > when timekeeping was managed with just spinlocks, and SMP wasn't that
> > common - so I suspect it can only be worse now).
> 
> The solution, mentioned above, is to have the effective time of the
> update set to the predicted write completion time.  Then the readers will
> never see an effective time less than their time and need to postpone
> the writer.
> 
> Only if the writer is delayed unexpectedly does that happen.
> 
> > Further, the write in the read path would cause problems for VDSO time,
> > where everything is computed on read-only data in userspace.
> 
> Ah!  Now *that* poses a potential problem.  I didn't think of that.
> Arrgh, that's going to take some work.
> 
> Two possibilities:
> 1) The VDSO code could just spin.  It's not holding any kernel locks,
>    so there's not much problem.
> 2) Possibly after some optimistic spinning, the VDSO code could
>    trap to the kernel as a fallback.
> 
> > Yea. It definitely gets complex. But timekeeping is always a very
> > hotpath, so complexity for performance is a normal trade. So there may
> > be a way to make it work. But finding a way to make it easier to
> > comprehend (even just having a clear metaphor for what's being done)
> > would be excellent.
> 
> I'll try.
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-13 12:07         ` Mathieu Desnoyers
@ 2014-05-13 13:29           ` George Spelvin
  2014-05-13 13:39             ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: George Spelvin @ 2014-05-13 13:29 UTC (permalink / raw)
  To: linux, mathieu.desnoyers; +Cc: john.stultz, linux-kernel, peterz, tglx

> We could expose a new clock type (besides monotonic and realtime) that is
> documented as non-strictly monotonic. It may return a time very slightly in
> the past if readers race with clock source frequency change. The caller could
> handle this situation (e.g. in user-space) by keeping its own per-cpu or
> per-thread "last clock value" data structure (something we cannot do in a
> vDSO) if it really cares about per-cpu/thread clock monotonicity.

That the first of two options I proposed.  The problem, with respect to
the immediate problem of debugging during a write deadlocking, is
that it makes a more complex API which callers must understand the
subtleties of.

Perhaps necessary, but definitely a minus.

> This could be implemented with the scheme I proposed as a prototype here:
> 
>    https://lkml.org/lkml/2013/9/14/136

I'm working my way though it.  I definitely like the first patch!

> Thoughts ?

I was trying to tackle the "hard problem" of making *all* time reads
non-blocking, with monotonicity guarantees.  There has to be *some* bound
on blocking times (in particular, time between reading hardware tiemrs
and translating them to real time), but they can be reasonably long.

I think I have an idea that could work, but given the hairiness of
the timeeeping code, implementing it would be a major project.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-13 13:29           ` George Spelvin
@ 2014-05-13 13:39             ` Mathieu Desnoyers
  2014-05-13 16:18               ` George Spelvin
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2014-05-13 13:39 UTC (permalink / raw)
  To: George Spelvin; +Cc: john stultz, linux-kernel, peterz, tglx

----- Original Message -----
> From: "George Spelvin" <linux@horizon.com>
> To: linux@horizon.com, "mathieu desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "john stultz" <john.stultz@linaro.org>, linux-kernel@vger.kernel.org, peterz@infradead.org, tglx@linutronix.de
> Sent: Tuesday, May 13, 2014 9:29:18 AM
> Subject: Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
> 
> > We could expose a new clock type (besides monotonic and realtime) that is
> > documented as non-strictly monotonic. It may return a time very slightly in
> > the past if readers race with clock source frequency change. The caller
> > could
> > handle this situation (e.g. in user-space) by keeping its own per-cpu or
> > per-thread "last clock value" data structure (something we cannot do in a
> > vDSO) if it really cares about per-cpu/thread clock monotonicity.
> 
> That the first of two options I proposed.  The problem, with respect to
> the immediate problem of debugging during a write deadlocking, is
> that it makes a more complex API which callers must understand the
> subtleties of.
> 
> Perhaps necessary, but definitely a minus.
> 
> > This could be implemented with the scheme I proposed as a prototype here:
> > 
> >    https://lkml.org/lkml/2013/9/14/136
> 
> I'm working my way though it.  I definitely like the first patch!

Thanks! :)

> 
> > Thoughts ?
> 
> I was trying to tackle the "hard problem" of making *all* time reads
> non-blocking, with monotonicity guarantees.  There has to be *some* bound
> on blocking times (in particular, time between reading hardware tiemrs
> and translating them to real time), but they can be reasonably long.

What I gathered from my past discussion with John on this topic is that
virtualization blows away pretty much any assumption we can make on
"update should be reasonably short". A virtualized CPU can be preempted
for a rather long time (AFAIU not possible to bound).

> I think I have an idea that could work, but given the hairiness of
> the timeeeping code, implementing it would be a major project.

Indeed, timekeeping is not for the faint of heart. ;-)

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock
  2014-05-13 13:39             ` Mathieu Desnoyers
@ 2014-05-13 16:18               ` George Spelvin
  0 siblings, 0 replies; 16+ messages in thread
From: George Spelvin @ 2014-05-13 16:18 UTC (permalink / raw)
  To: linux, mathieu.desnoyers; +Cc: john.stultz, linux-kernel, peterz, tglx

>> I was trying to tackle the "hard problem" of making *all* time reads
>> non-blocking, with monotonicity guarantees.  There has to be *some* bound
>> on blocking times (in particular, time between reading hardware tiemrs
>> and translating them to real time), but they can be reasonably long.

> What I gathered from my past discussion with John on this topic is that
> virtualization blows away pretty much any assumption we can make on
> "update should be reasonably short". A virtualized CPU can be preempted
> for a rather long time (AFAIU not possible to bound).

Well, there are two kinds of time reads:

1) "What is the current time now?"  This is easy, because if we stall
   it's okay to return any time in the appropriate window.
2) "Some time in the past I read the hardware clock with value c.
   What is the time in seconds corresponding to that?"

I'll call number 2 "time conversion".  This can be handled as long as the
time between reading the clock and doing the conversion is reasonable.
Otherwise, it'll fail, with "I don't remember that part of the piecewise
clock/time conversion function."

Now, obviously number 1 can be implemented by reading the hardware clock
and converting it.  As you point out, virtualization makes it impossible
to guarantee any reasonable bound on the time between the two steps.
But in case 1, we can easily handle failure by retrying the read.  (Or,
equivalently, returning the earliest time which we know how to convert.)

All we can guarantee in case 1 is that we return the timestamp of some
instant between call and return.  If that time is made very long by
the hypervisor descheduling us, it reduces the required precision.

So it's possible to allow arbirearily long delays in time writes, in
current-time reads, but conversion has a risk of failure.

I don't think it's possible to do better, so that's kind of the limit.

>> I think I have an idea that could work, but given the hairiness of
>> the timeeeping code, implementing it would be a major project.
>
> Indeed, timekeeping is not for the faint of heart. ;-)

But at least it's interesting, with problems other than undocumented bugs
in hardware.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-05-13 16:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-06 21:33 [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock George Spelvin
2014-05-12 16:21 ` [rough draft PATCH] avoid stalls on the " George Spelvin
2014-05-12 18:23   ` John Stultz
2014-05-12 17:49 ` [PATCH 4/4] timekeeping: Use printk_deferred when holding " John Stultz
2014-05-13  2:44   ` George Spelvin
2014-05-13  3:39     ` John Stultz
2014-05-13  5:13       ` George Spelvin
2014-05-13 12:07         ` Mathieu Desnoyers
2014-05-13 13:29           ` George Spelvin
2014-05-13 13:39             ` Mathieu Desnoyers
2014-05-13 16:18               ` George Spelvin
2014-05-13  2:45   ` [PATCH 1/2] timekeeping: Use unsigned int for seqlock sequence George Spelvin
2014-05-13 11:49     ` Mathieu Desnoyers
2014-05-13  2:48   ` [PATCH 2/2] timekeeping: Mark struct timekeeper * passed to notifiers as const George Spelvin
  -- strict thread matches above, loose matches on Subject: below --
2014-05-05 20:47 [PATCH 0/4] Convert timekeeping core to use printk_deferred (v3) John Stultz
2014-05-05 20:47 ` [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock John Stultz
2014-05-02 22:09 [PATCH 0/4] Convert timekeeping core to use printk_deferred (v2) John Stultz
2014-05-02 22:09 ` [PATCH 4/4] timekeeping: Use printk_deferred when holding timekeeping seqlock John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).