From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>,
linux-kernel@vger.kernel.org, mingo@elte.hu,
laijs@cn.fujitsu.com, dipankar@in.ibm.com,
akpm@linux-foundation.org, mathieu.desnoyers@efficios.com,
josh@joshtriplett.org, niv@us.ibm.com, peterz@infradead.org,
rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com,
darren@dvhart.com, fweisbec@gmail.com, sbw@mit.edu
Subject: Re: [PATCH RFC nohz_full 0/8] Provide infrastructure for full-system idle
Date: Wed, 26 Jun 2013 07:31:04 -0700 [thread overview]
Message-ID: <20130626143104.GL3828@linux.vnet.ibm.com> (raw)
In-Reply-To: <51CA3FBF.6080201@amacapital.net>
On Tue, Jun 25, 2013 at 06:11:27PM -0700, Andy Lutomirski wrote:
> On 06/25/2013 02:49 PM, Thomas Gleixner wrote:
> > On Tue, 25 Jun 2013, Paul E. McKenney wrote:
> >> Note that this version pays attention to CPUs that have taken an NMI
> >> from idle. It is not clear to me that NMI handlers can safely access
> >> the time on a system that is long-term idle. Unless someone tells me
> >> that it is somehow safe to access time from an NMI from idle, I will
> >> remove NMI support in the next version.
> >
> > NMI cannot access any time related functions independent of NOHZ, long
> > term idle or whatever you come up with:
> >
> > write_seqcount_begin(&timekeeper_seq);
> >
> > ---> NMI
> > ...
> > do {
> > seq = read_seqcount_begin(&timekeeper_seq);
> > } while (read_seqcount_retry(&timekeeper_seq, seq));
> >
> > Guess how well that works ....
> >
> > Thanks,
> >
> > tglx
> >
>
> Is this something worth fixing? One of the things on my infinitely long
> todo list is to replace that seqcount with a wait-free data structure,
> in which case this would be okay. I don't care about NMIs, but this
> would mean that clock_gettime would never stall just because the
> timekeeping code was running somewhere -- at worse you'd get a couple
> extra cache misses.
>
> The data structure is described here:
>
> http://link.springer.com/chapter/10.1007%2F978-3-540-92221-6_40
>
> (Sorry, this was my first paper and is therefore not so well written.
> Also, it costs $30, although I think I'm allowed to email copies out and
> probably even host them on a website somewhere.)
>
> The main downside would be a possible loss of monotonicity, like this:
>
> Thread a: read the timekeeping data
> Thread b: update the timekeeping data
> Thread c: start and finish reading the time (using new data)
> Thread a: read new raw clock value but compute using old timekeeping data
>
> This would be fixable.
>
> The data structure is essentially an array of copies of the protected
> data, which can be called bin 0, 1, 2, ..., N. The data is versioned,
> just like with seqcount. Bin i contains the most recent copy of the
> data that had a version number that's a multiple of 2^i, but any bin can
> also be marked as invalid if it's being written.
If I remember correctly, something like this was proposed some years
back, but was rejected because it was not possible to set a bound on
how long a given thread would be using a given array element, which
could result in all elements being both in use and out of date.
It is possible to avoid this, but the only way I can see to do so
re-introduces retries.
> To write: update all bins that need updating (that is 1 + num trailing
> zeros in the new version number), starting at the highest number.
>
> To read starting at bin i: try to read bin i (just like with a
> seqcount). If that fails, then recursively read starting at bin i+1.
> As a double-check, re-try bin i. If the retry fails but the recursive
> read succeeded, return the value from the recursive read.
>
> The only way this can fail is if you race with ~2^N writes. You can try
> to read in a loop to avoid this problem.
>
> Unlike a seqcount, you need to race with more than 1 write, which
> eliminates this deadlock -- writers have to make continuous progress for
> readers to get stuck. But it's extremely unlikely that a reader ever
> has to loop.
Unless I am missing something, you still have to have readers loop, but
reduce the probability of them having to do so. Assuming that you were
willing to reuse old entries, you could avoid the deadlock that Thomas
pointed out, but at the cost of reusing an arbitrarily old entry in the
case where the NMI happened at the end of a long full-system idle period.
Which might not be such a good thing!
Therefore, I still intend to remove the NMI detection.
Thanx, Paul
next prev parent reply other threads:[~2013-06-26 15:02 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-25 21:37 [PATCH RFC nohz_full 0/8] Provide infrastructure for full-system idle Paul E. McKenney
2013-06-25 21:37 ` [PATCH RFC nohz_full 1/8] nohz_full: Add Kconfig parameter for scalable detection of all-idle state Paul E. McKenney
2013-06-25 21:37 ` [PATCH RFC nohz_full 2/8] nohz_full: Add rcu_dyntick data " Paul E. McKenney
2013-06-25 21:37 ` [PATCH RFC nohz_full 3/8] nohz_full: Add per-CPU idle-state tracking Paul E. McKenney
2013-06-25 21:37 ` [PATCH RFC nohz_full 4/8] nohz_full: Add per-CPU idle-state tracking for NMIs Paul E. McKenney
2013-06-25 21:37 ` [PATCH RFC nohz_full 5/8] nohz_full: Add full-system idle states and variables Paul E. McKenney
2013-06-25 21:37 ` [PATCH RFC nohz_full 6/8] nohz_full: Add full-system-idle arguments to API Paul E. McKenney
2013-06-25 21:37 ` [PATCH RFC nohz_full 7/8] nohz_full: Add full-system-idle state machine Paul E. McKenney
2013-06-25 21:37 ` [PATCH RFC nohz_full 8/8] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU Paul E. McKenney
2013-06-25 21:49 ` [PATCH RFC nohz_full 0/8] Provide infrastructure for full-system idle Thomas Gleixner
2013-06-25 22:01 ` Paul E. McKenney
2013-06-26 1:11 ` Andy Lutomirski
2013-06-26 14:31 ` Paul E. McKenney [this message]
2013-06-26 12:20 ` Peter Zijlstra
2013-06-26 22:24 ` Paul E. McKenney
2013-06-27 9:42 ` Peter Zijlstra
2013-06-27 12:44 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130626143104.GL3828@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=darren@dvhart.com \
--cc=dhowells@redhat.com \
--cc=dipankar@in.ibm.com \
--cc=edumazet@google.com \
--cc=fweisbec@gmail.com \
--cc=josh@joshtriplett.org \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@elte.hu \
--cc=niv@us.ibm.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sbw@mit.edu \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox