public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: John Stultz <john.stultz@linaro.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Richard Cochran <richardcochran@gmail.com>,
	Prarit Bhargava <prarit@redhat.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org, lttng-dev@lists.lttng.org
Subject: Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
Date: Wed, 11 Sep 2013 13:36:18 -0700	[thread overview]
Message-ID: <20130911203618.GH3966@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130911185441.GC23532@Krystal>

On Wed, Sep 11, 2013 at 02:54:41PM -0400, Mathieu Desnoyers wrote:
> * John Stultz (john.stultz@linaro.org) wrote:
> > On 09/11/2013 08:08 AM, Mathieu Desnoyers wrote:
> [...]
> 
> Now focusing on features (the fix discussion is in a separate
> sub-thread):
> 
> > 
> > > LTTng uses ktime to have the same time-base across kernel and
> > > user-space, so traces gathered from LTTng-modules and LTTng-UST can be
> > > correlated. We plan on using ktime until a fast, scalable, and
> > > fine-grained time-source for tracing that can be used across kernel and
> > > user-space, and which does not rely on read seqlock for kernel-level
> > > synchronization, makes its way into the kernel.
> > 
> > So my fix for the issue aside, I could see cases where using timekeeping
> > for tracing could run into similar issues, so something like your
> > timekeeping_is_busy() check sounds reasonable.
> 
> Yep, it would certainly make those use of ktime_get() more robust
> against internal changes.
> 
> > I might suggest we wrap
> > the timekeeper locking in a helper function so we don't have the
> > spinlock(); set_owner(); write_seqcount(); pattern all over the place
> > (and it avoids me forgetting to set the owner in some future change,
> > further mucking things up :).
> 
> Good idea.
> 
> > As for your waiting for "fast, scalable, and fine-grained time-source
> > for tracing that can be used across kernel and user-space, and which
> > does not rely on read seqlock for kernel-level synchronization" wish,
> > I'd be interested in hearing ideas if anyone has them.
> 
> So far, the best I could come up with is this: using a RCU (or RCU-like)
> scheme to protect in-kernel timestamp reads (possibly with RCU sched,
> which is based on preemption disabling), and use a sequence lock to
> protect reads from user-space.
> 
> Time updates within the kernel would have to deal with both RCU pointer
> update and track quiescent state, and would need to hold a write seqlock
> to synchronize against concurrent user-space reads.
> 
> > After getting the recent lock-hold reduction work merged in 3.10, I had
> > some thoughts that maybe we could do some sort of rcu style timekeeper
> > switch. The down side is that there really is a time bound in which the
> > timekeeper state is valid for, so there would have to be some sort of
> > seqcount style "retry if we didn't finish the calculation within the
> > valid bound" (which can run into similar deadlock problems if the
> > updater is delayed by a reader spinning waiting for an update).
> 
> What could make a reader fail to finish the calculation within the valid
> time bound ? Besides preemption ? If it's caused by a too long
> interrupt, this will have an effect on the entire timekeeping, because
> the timer interrupt will likely be delayed, and therefore the periodical
> update changing the write seqlock value will be delayed too. So for the
> interrupt case, it looks like a "too long" interrupt (or interrupt
> disable section) will already disrupt timekeeping with the current
> design.
> 
> > 
> > Also there is the memory issue of having N timekeeper structures hanging
> > around, since there could be many readers delayed mid-calculation, but
> > that could probably be bound by falling back to a seqcount (and again,
> > that opens up deadlock possibilities). Anyway, it all gets pretty
> > complicated pretty quickly, which makes ensuring correctness even harder. :(
> > 
> > But yea, I'd be interested in other ideas and approaches.
> 
> If we can afford a synchronize_rcu_sched() wherever the write seqlock is
> needed, we could go with the following. Please note that I use
> synchronize_rcu_sched() rather than call_rcu_sched() here because I try
> to avoid having too many timekeeper structures hanging around, and I
> think it can be generally a good think to ensure timekeeping core does
> not depend on the memory allocator (but I could very well be wrong).

The issue called out with this the last time I remember it being put
forward was that grace periods can be delayed for longer than is an
acceptable gap between timekeeping updates.  But maybe something has
changed since then -- that was a few years ago.

							Thanx, Paul

> In kernel/time/timekeeper.c:
> 
> static DEFINE_MUTEX(timekeeper_mutex);
> static seqcount_t timekeeper_user_seq;
> 
> struct timekeeper_rcu {
>         struct a[2];
>         struct timekeeper *p;   /* current */
> };
> 
> /* Timekeeper structure for kernel readers */
> static struct timekeeper_rcu timekeeper_rcu;
> 
> /* Timekeeper structure for userspace readers */
> static struct timekeeper timekeeper_user;
> 
> /* for updates */
> update_time()
> {
>         struct timekeeper *next_p;
> 
>         mutex_lock(&timekeeper_mutex);
> 
>         /* RCU update, for kernel readers */
>         if (timekeeper_rcu.p == &timekeeper_rcu.a[0])
>                 next_p = &timekeeper_rcu.a[1];
>         else
>                 next_p = &timekeeper_rcu.a[0];
> 
>         timekeeper_copy(next_p, timekeeper_rcu.p);
>         timekeeper_do_update(next_p, ...);
> 
>         rcu_assign_pointer(timekeeper_rcu.p, next_p);
> 
>         /*
>          * Ensure there are no more readers in flight accessing the old
>          * element.
>          */
>         synchronize_rcu_sched();
> 
>         /* seqlock update, for user-space readers */
>         write_seqcount_begin(&timekeeper_user_seq);
>         timekeeper_do_update_user(&timekeeper_user);
>         write_seqcount_end(&timekeeper_user_seq);
> 
>         mutex_unlock(&timekeeper_mutex);
> }
> 
> 
> /*
>  * Accessing time from kernel. Should be called with preemption
>  * disabled.
>  */
> u64 __read_time_kernel()
> {
>         struct timekeeper *p;
> 
>         p = rcu_dereference(timekeeper_rcu.p);
>         return get_value_from(p);
> }
> 
> /* Accessing time from user-space (vDSO) */
> 
> u64 read_time_user()
> {
>         /*
>          * Read timekeeper_user within a read_seqcount loop.
>          */
> }
> 
> 
> As far as I see, the write seqcount is currently taken by:
> 
> - do_settimeofday()
> - timekeeping_inject_offset()
> - timekeeping_set_tai_offset()
> - change_clocksource()
> - timekeeping_init()
> - timekeeping_inject_sleeptime()
> - timekeeping_resume()
> - timekeeping_suspend()
> - update_wall_time()
> - do_adjtimex()
> - hardpps()
> 
> It might be good to breakdown which of these functions could afford a
> synchronize_rcu_sched(), and which can't. Maybe part of the solution
> could be to use different data structures, and separate synchronization,
> for the ajdtime part (executed in thread context) vs for
> update_wall_time, AFAIK executed in interrupt context. We might also
> want to consider that the adjtime part needs to be global, vs
> update_wall_time which could possibly be done on per-cpu data
> structures, which could simplify synchronization.
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 


  reply	other threads:[~2013-09-11 20:36 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-11 15:08 [RFC PATCH] timekeeping: introduce timekeeping_is_busy() Mathieu Desnoyers
2013-09-11 16:40 ` John Stultz
2013-09-11 17:07   ` Steven Rostedt
2013-09-11 17:49   ` Mathieu Desnoyers
2013-09-11 17:53     ` John Stultz
2013-09-11 18:54   ` Mathieu Desnoyers
2013-09-11 20:36     ` Paul E. McKenney [this message]
2013-09-12  0:48       ` Mathieu Desnoyers
2013-09-12  1:25         ` Peter Zijlstra
2013-09-12  3:22           ` Mathieu Desnoyers
2013-09-12 12:09             ` Peter Zijlstra
2013-09-12 13:48               ` Mathieu Desnoyers
2013-09-12 14:46                 ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130911203618.GH3966@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=john.stultz@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lttng-dev@lists.lttng.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=prarit@redhat.com \
    --cc=richardcochran@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox