public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <john.stultz@linaro.org>, Ingo Molnar <mingo@elte.hu>,
	Peter Zijlstra <peterz@infradead.org>,
	Eric Dumazet <dada1@cosmosbay.com>,
	Frederic Weisbecker <fweisbec@gmail.com>
Subject: [RFC patch 0/8] timekeeping: Implement shadow timekeeper to shorten in kernel reader side blocking
Date: Thu, 21 Feb 2013 22:51:35 -0000	[thread overview]
Message-ID: <20130221220147.719832397@linutronix.de> (raw)

The vsyscall based timekeeping interfaces for userspace provide the
shortest possible reader side blocking (update of the vsyscall gtod
data structure), but the kernel side interfaces to timekeeping are
blocked over the full code sequence of calculating update_wall_time()
magic which can be rather "long" due to ntp, corner cases, etc...

Eric did some work a few years ago to distangle the seqcount write
hold from the spinlock which is serializing the potential updaters of
the kernel internal timekeeper data. I couldn't be bothered to reread
the old mail thread and figure out why this got turned down, but I
remember that there were objections due to the potential inconsistency
between calculation, update and observation.

In hindsight that's nonsense, because even back at that time we did
the vsyscall update at the very least moment and unsychronized to the
in kernel data update.

While we never got any complaints about that there is a real issue
versus virtualization:

  VCPU0                                         VCPU1

  update_wall_time()
    write_seqlock_irqsave(&tk->lock, flags);
    ....

Host schedules out VCPU0

Arbitrary delay

Host schedules in VCPU0
                                                __vdso_clock_gettime()#1
    update_vsyscall();
                                                __vdso_clock_gettime()#2

Depending on the length of the delay which kept VCPU0 away from
executing and depending on the direction of the ntp update of the
timekeeping variables __vdso_clock_gettime()#2 can observe time going
backwards.

You can reproduce that by pinning VCPU0 to physical core 0 and VCPU1
to physical core 1. Now remove all load from physical core 1 except
VCPU1 and put massive load on physical core 0 and make sure that the
NTP adjustment lowers the mult factor. It's extremly hard to
reproduce, but it's possible.

So this patch series is going to expose the same issue to the kernel
side timekeeping. I'm not too worried about that, because 

 - it's extremly hard to trigger
 
 - we are aware of the issue vs. vsyscalls already

 - making the kernel behave the same way as vsyscall does not make
   things worse

 - John Stultz has already an idea how to fix it.
   See  https://lkml.org/lkml/2013/2/19/569

Though that's not the scope of this patch series, but I want to make
sure that it's documented.

Now the obvious question whether this is worth the trouble can be
answered easily. Preempt-RT users and HPC folks have complained about
the long write hold time of the timekeeping seqcount since years and a
quick test on a preempt-RT enabled kernel shows, that this series
lowers the maximum latency on the non-timekeeping cores from 8 to 4
microseconds. That's a whopping factor of 2. Defintely worth the
trouble!

Thanks,

	tglx
---
 include/linux/jiffies.h             |    1 
 include/linux/timekeeper_internal.h |    4 
 kernel/time/tick-internal.h         |    2 
 kernel/time/timekeeping.c           |  176 +++++++++++++++++++++---------------
 4 files changed, 107 insertions(+), 76 deletions(-)




             reply	other threads:[~2013-02-21 22:51 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-21 22:51 Thomas Gleixner [this message]
2013-02-21 22:51 ` [RFC patch 2/8] timekeeping: Make jiffies_lock internal Thomas Gleixner
2013-02-21 22:51 ` [RFC patch 1/8] timekeeping: Calc stuff once Thomas Gleixner
2013-02-21 22:51 ` [RFC patch 3/8] timekeeping: Move lock out of timekeeper struct Thomas Gleixner
2013-02-21 22:51 ` [RFC patch 4/8] timekeeping: Split timekeeper_lock into lock and seqcount Thomas Gleixner
2013-02-21 22:51 ` [RFC patch 5/8] timekeeping: Store cycle_last value in timekeeper struct as well Thomas Gleixner
2013-02-21 22:51 ` [RFC patch 6/8] timekeeping: Delay update of clock->cycle_last Thomas Gleixner
2013-02-21 22:51 ` [RFC patch 7/8] timekeeping: Implement a shadow timekeeper Thomas Gleixner
2013-02-22 23:53   ` John Stultz
2013-02-26 12:17     ` Thomas Gleixner
2013-02-21 22:51 ` [RFC patch 8/8] timekeeping: Shorten seq_count region Thomas Gleixner
2013-02-21 23:06 ` [RFC patch 0/8] timekeeping: Implement shadow timekeeper to shorten in kernel reader side blocking Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130221220147.719832397@linutronix.de \
    --to=tglx@linutronix.de \
    --cc=dada1@cosmosbay.com \
    --cc=fweisbec@gmail.com \
    --cc=john.stultz@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox