[RFC PATCH] timekeeping: introduce timekeeping_is

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
@ 2013-09-11 15:08 Mathieu Desnoyers
  2013-09-11 16:40 ` John Stultz
  0 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2013-09-11 15:08 UTC (permalink / raw)
  To: Thomas Gleixner, Richard Cochran, Prarit Bhargava, John Stultz
  Cc: Greg Kroah-Hartman, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

Starting from commit 06c017fdd4dc48451a29ac37fc1db4a3f86b7f40
"timekeeping: Hold timekeepering locks in do_adjtimex and hardpps"
(3.10 kernels), the xtime write seqlock is held across calls to
__do_adjtimex(), which includes a call to notify_cmos_timer(), and hence
schedule_delayed_work().

This introduces a side-effect for a set of tracepoints, including mainly
the workqueue tracepoints: a tracer hooking on those tracepoints and
reading current time with ktime_get() will cause hard system LOCKUP such
as:

WARNING: CPU: 6 PID: 2258 at kernel/watchdog.c:245 watchdog_overflow_callback+0x93/0x9e()
Watchdog detected hard LOCKUP on cpu 6
Modules linked in: lttng_probe_workqueue(O) lttng_probe_vmscan(O) lttng_probe_udp(O) lttng_probe_timer(O) lttng_probe_s]
CPU: 6 PID: 2258 Comm: ntpd Tainted: G           O 3.11.0 #158
Hardware name: Supermicro X7DAL/X7DAL, BIOS 6.00 12/03/2007
0000000000000000 ffffffff814f83eb ffffffff813b206a ffff88042fd87c78
ffffffff8106a07c 0000000000000000 ffffffff810c94c2 0000000000000000
ffff88041f31bc00 0000000000000000 ffff88042fd87d68 ffff88042fd87ef8
Call Trace:
<NMI>  [<ffffffff813b206a>] ? dump_stack+0x41/0x51
[<ffffffff8106a07c>] ? warn_slowpath_common+0x79/0x92
[<ffffffff810c94c2>] ? watchdog_overflow_callback+0x93/0x9e
[<ffffffff8106a12d>] ? warn_slowpath_fmt+0x45/0x4a
[<ffffffff810c94c2>] ? watchdog_overflow_callback+0x93/0x9e
[<ffffffff810c942f>] ? watchdog_enable_all_cpus.part.2+0x31/0x31
[<ffffffff810ecc66>] ? __perf_event_overflow+0x12c/0x1ae
[<ffffffff810eab60>] ? perf_event_update_userpage+0x13/0xc2
[<ffffffff81016820>] ? intel_pmu_handle_irq+0x26a/0x2fd
[<ffffffff813b7a0b>] ? perf_event_nmi_handler+0x24/0x3d
[<ffffffff813b728f>] ? nmi_handle.isra.3+0x58/0x12f
[<ffffffff813b7a59>] ? perf_ibs_nmi_handler+0x35/0x35
[<ffffffff813b7404>] ? do_nmi+0x9e/0x2bc
[<ffffffff813b6af7>] ? end_repeat_nmi+0x1e/0x2e
[<ffffffff810a2a33>] ? read_seqcount_begin.constprop.4+0x8/0xf
[<ffffffff810a2a33>] ? read_seqcount_begin.constprop.4+0x8/0xf
[<ffffffff810a2a33>] ? read_seqcount_begin.constprop.4+0x8/0xf
<<EOE>>  [<ffffffff810a2d6c>] ? ktime_get+0x23/0x5e
[<ffffffffa0314670>] ? lib_ring_buffer_clock_read.isra.28+0x1f/0x21 [lttng_ring_buffer_client_discard]
[<ffffffffa0314786>] ? lttng_event_reserve+0x112/0x3f3 [lttng_ring_buffer_client_discard]
[<ffffffffa045b1c5>] ? __event_probe__workqueue_queue_work+0x72/0xe0 [lttng_probe_workqueue]
[<ffffffff812ef7e9>] ? sock_aio_read.part.10+0x110/0x124
[<ffffffff81133a36>] ? do_sync_readv_writev+0x50/0x76
[<ffffffff8107d514>] ? __queue_work+0x1ab/0x265
[<ffffffff8107da7e>] ? queue_delayed_work_on+0x3f/0x4e
[<ffffffff810a473d>] ? __do_adjtimex+0x408/0x413
[<ffffffff810a3e9a>] ? do_adjtimex+0x98/0xee
[<ffffffff8106cec6>] ? SYSC_adjtimex+0x32/0x5d
[<ffffffff813bb74b>] ? tracesys+0xdd/0xe2

LTTng uses ktime to have the same time-base across kernel and
user-space, so traces gathered from LTTng-modules and LTTng-UST can be
correlated. We plan on using ktime until a fast, scalable, and
fine-grained time-source for tracing that can be used across kernel and
user-space, and which does not rely on read seqlock for kernel-level
synchronization, makes its way into the kernel.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
diff --git a/include/linux/time.h b/include/linux/time.h
index d5d229b..f2cec6b 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -267,4 +267,11 @@ static __always_inline void timespec_add_ns(struct timespec *a, u64 ns)
 	a->tv_nsec = ns;
 }
 
+/**
+ * timekeeping_is_busy - is timekeeping busy ?
+ *
+ * Returns 0 if the current execution context can safely read time.
+ */
+extern int timekeeping_is_busy(void);
+
 #endif
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 48b9fff..8ec2b5f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -34,6 +34,7 @@
 static struct timekeeper timekeeper;
 static DEFINE_RAW_SPINLOCK(timekeeper_lock);
 static seqcount_t timekeeper_seq;
+static int timekeeper_seq_owner = -1;
 static struct timekeeper shadow_timekeeper;
 
 /* flag for if timekeeping is suspended */
@@ -1690,6 +1691,9 @@ int do_adjtimex(struct timex *txc)
 	getnstimeofday(&ts);
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
+	timekeeper_seq_owner = smp_processor_id();
+	/* store to timekeeper_seq_owner before seqcount */
+	barrier();
 	write_seqcount_begin(&timekeeper_seq);
 
 	orig_tai = tai = tk->tai_offset;
@@ -1701,6 +1705,9 @@ int do_adjtimex(struct timex *txc)
 		clock_was_set_delayed();
 	}
 	write_seqcount_end(&timekeeper_seq);
+	/* store to seqcount before timekeeper_seq_owner */
+	barrier();
+	timekeeper_seq_owner = -1;
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
 	return ret;
@@ -1715,11 +1722,17 @@ void hardpps(const struct timespec *phase_ts, const struct timespec *raw_ts)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
+	timekeeper_seq_owner = smp_processor_id();
+	/* store to timekeeper_seq_owner before seqcount */
+	barrier();
 	write_seqcount_begin(&timekeeper_seq);
 
 	__hardpps(phase_ts, raw_ts);
 
 	write_seqcount_end(&timekeeper_seq);
+	/* store to seqcount before timekeeper_seq_owner */
+	barrier();
+	timekeeper_seq_owner = -1;
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 }
 EXPORT_SYMBOL(hardpps);
@@ -1737,3 +1750,11 @@ void xtime_update(unsigned long ticks)
 	do_timer(ticks);
 	write_sequnlock(&jiffies_lock);
 }
+
+int timekeeping_is_busy(void)
+{
+	if (timekeeper_seq_owner == raw_smp_processor_id())
+		return 1;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(timekeeping_is_busy);

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-11 15:08 [RFC PATCH] timekeeping: introduce timekeeping_is_busy() Mathieu Desnoyers
@ 2013-09-11 16:40 ` John Stultz
  2013-09-11 17:07   ` Steven Rostedt
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: John Stultz @ 2013-09-11 16:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Richard Cochran, Prarit Bhargava,
	Greg Kroah-Hartman, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

On 09/11/2013 08:08 AM, Mathieu Desnoyers wrote:
> Starting from commit 06c017fdd4dc48451a29ac37fc1db4a3f86b7f40
> "timekeeping: Hold timekeepering locks in do_adjtimex and hardpps"
> (3.10 kernels), the xtime write seqlock is held across calls to
> __do_adjtimex(), which includes a call to notify_cmos_timer(), and hence
> schedule_delayed_work().
>
> This introduces a side-effect for a set of tracepoints, including mainly
> the workqueue tracepoints: a tracer hooking on those tracepoints and
> reading current time with ktime_get() will cause hard system LOCKUP such
> as:
Oh bummer. I had just worked this issue out the other day:
    https://lkml.org/lkml/2013/9/9/476

Apparently it was a schroedinbug of sorts. My apologies for time you
spent chasing this down.

My plan is to pull the notify_cmos_timer call to outside of the
timekeeper locking (see the patch at the very end of the mail in the
above link), as well as try to add lockdep support to seqcount/seqlocks
so we can catch these sorts of issues more easily.

> LTTng uses ktime to have the same time-base across kernel and
> user-space, so traces gathered from LTTng-modules and LTTng-UST can be
> correlated. We plan on using ktime until a fast, scalable, and
> fine-grained time-source for tracing that can be used across kernel and
> user-space, and which does not rely on read seqlock for kernel-level
> synchronization, makes its way into the kernel.

So my fix for the issue aside, I could see cases where using timekeeping
for tracing could run into similar issues, so something like your
timekeeping_is_busy() check sounds reasonable. I might suggest we wrap
the timekeeper locking in a helper function so we don't have the
spinlock(); set_owner(); write_seqcount(); pattern all over the place
(and it avoids me forgetting to set the owner in some future change,
further mucking things up :).

As for your waiting for "fast, scalable, and fine-grained time-source
for tracing that can be used across kernel and user-space, and which
does not rely on read seqlock for kernel-level synchronization" wish,
I'd be interested in hearing ideas if anyone has them.

After getting the recent lock-hold reduction work merged in 3.10, I had
some thoughts that maybe we could do some sort of rcu style timekeeper
switch. The down side is that there really is a time bound in which the
timekeeper state is valid for, so there would have to be some sort of
seqcount style "retry if we didn't finish the calculation within the
valid bound" (which can run into similar deadlock problems if the
updater is delayed by a reader spinning waiting for an update).

Also there is the memory issue of having N timekeeper structures hanging
around, since there could be many readers delayed mid-calculation, but
that could probably be bound by falling back to a seqcount (and again,
that opens up deadlock possibilities). Anyway, it all gets pretty
complicated pretty quickly, which makes ensuring correctness even harder. :(

But yea, I'd be interested in other ideas and approaches.

thanks
-john

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-11 16:40 ` John Stultz
@ 2013-09-11 17:07   ` Steven Rostedt
  2013-09-11 17:49   ` Mathieu Desnoyers
  2013-09-11 18:54   ` Mathieu Desnoyers
  2 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2013-09-11 17:07 UTC (permalink / raw)
  To: John Stultz
  Cc: Mathieu Desnoyers, Thomas Gleixner, Richard Cochran,
	Prarit Bhargava, Greg Kroah-Hartman, Peter Zijlstra, Ingo Molnar,
	linux-kernel, lttng-dev

On Wed, 11 Sep 2013 09:40:51 -0700
John Stultz <john.stultz@linaro.org> wrote:

> (and it avoids me forgetting to set the owner in some future change,
> further mucking things up :).

Wow, I had to read that twice to see what you really wrote. I was
thinking "damn, John is getting vulgar!" ;-)

-- Steve 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-11 16:40 ` John Stultz
  2013-09-11 17:07   ` Steven Rostedt
@ 2013-09-11 17:49   ` Mathieu Desnoyers
  2013-09-11 17:53     ` John Stultz
  2013-09-11 18:54   ` Mathieu Desnoyers
  2 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2013-09-11 17:49 UTC (permalink / raw)
  To: John Stultz
  Cc: Thomas Gleixner, Richard Cochran, Prarit Bhargava,
	Greg Kroah-Hartman, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

Hi John,

* John Stultz (john.stultz@linaro.org) wrote:
> On 09/11/2013 08:08 AM, Mathieu Desnoyers wrote:
> > Starting from commit 06c017fdd4dc48451a29ac37fc1db4a3f86b7f40
> > "timekeeping: Hold timekeepering locks in do_adjtimex and hardpps"
> > (3.10 kernels), the xtime write seqlock is held across calls to
> > __do_adjtimex(), which includes a call to notify_cmos_timer(), and hence
> > schedule_delayed_work().
> >
> > This introduces a side-effect for a set of tracepoints, including mainly
> > the workqueue tracepoints: a tracer hooking on those tracepoints and
> > reading current time with ktime_get() will cause hard system LOCKUP such
> > as:
> Oh bummer. I had just worked this issue out the other day:
>     https://lkml.org/lkml/2013/9/9/476
>
> Apparently it was a schroedinbug of sorts. My apologies for time you
> spent chasing this down.

No worries. As soon as I've been able to reproduce it on my test box
(with serial port), the NMI watchdog had a pretty reasonable explanation
for the issue.

> My plan is to pull the notify_cmos_timer call to outside of the
> timekeeper locking (see the patch at the very end of the mail in the
> above link), as well as try to add lockdep support to seqcount/seqlocks
> so we can catch these sorts of issues more easily.

I just tried your patch, and it indeed seems to fix the lockup I've been
experiencing with lttng-modules. Do you plan pushing this fix into
master, and submitting it for inclusion into stable 3.10 and stable 3.11 ?
I'm planning on dealing with with this issue with a blacklist of
kernel versions that will prevent people from building lttng-modules
against broken kernels as soon as the patch makes it into those trees.

I will reply to the rest of your email separately, so this thread can
focus on getting the fix upstream quickly.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-11 17:49   ` Mathieu Desnoyers
@ 2013-09-11 17:53     ` John Stultz
  0 siblings, 0 replies; 13+ messages in thread
From: John Stultz @ 2013-09-11 17:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Richard Cochran, Prarit Bhargava,
	Greg Kroah-Hartman, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

On 09/11/2013 10:49 AM, Mathieu Desnoyers wrote:
> Hi John,
>
> * John Stultz (john.stultz@linaro.org) wrote:
>> On 09/11/2013 08:08 AM, Mathieu Desnoyers wrote:
>>> Starting from commit 06c017fdd4dc48451a29ac37fc1db4a3f86b7f40
>>> "timekeeping: Hold timekeepering locks in do_adjtimex and hardpps"
>>> (3.10 kernels), the xtime write seqlock is held across calls to
>>> __do_adjtimex(), which includes a call to notify_cmos_timer(), and hence
>>> schedule_delayed_work().
>>>
>>> This introduces a side-effect for a set of tracepoints, including mainly
>>> the workqueue tracepoints: a tracer hooking on those tracepoints and
>>> reading current time with ktime_get() will cause hard system LOCKUP such
>>> as:
>> Oh bummer. I had just worked this issue out the other day:
>>     https://lkml.org/lkml/2013/9/9/476
>>
>> Apparently it was a schroedinbug of sorts. My apologies for time you
>> spent chasing this down.
> No worries. As soon as I've been able to reproduce it on my test box
> (with serial port), the NMI watchdog had a pretty reasonable explanation
> for the issue.
>
>> My plan is to pull the notify_cmos_timer call to outside of the
>> timekeeper locking (see the patch at the very end of the mail in the
>> above link), as well as try to add lockdep support to seqcount/seqlocks
>> so we can catch these sorts of issues more easily.
> I just tried your patch, and it indeed seems to fix the lockup I've been
> experiencing with lttng-modules. Do you plan pushing this fix into
> master, and submitting it for inclusion into stable 3.10 and stable 3.11 ?
Yea. I was waiting on feedback from the reporter that the fix resolves
the issue but if it fixes it for you I'll go ahead and send it out today.

thanks
-john




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-11 16:40 ` John Stultz
  2013-09-11 17:07   ` Steven Rostedt
  2013-09-11 17:49   ` Mathieu Desnoyers
@ 2013-09-11 18:54   ` Mathieu Desnoyers
  2013-09-11 20:36     ` Paul E. McKenney
  2 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2013-09-11 18:54 UTC (permalink / raw)
  To: John Stultz
  Cc: Thomas Gleixner, Richard Cochran, Prarit Bhargava,
	Greg Kroah-Hartman, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev, Paul E. McKenney

* John Stultz (john.stultz@linaro.org) wrote:
> On 09/11/2013 08:08 AM, Mathieu Desnoyers wrote:
[...]

Now focusing on features (the fix discussion is in a separate
sub-thread):

> 
> > LTTng uses ktime to have the same time-base across kernel and
> > user-space, so traces gathered from LTTng-modules and LTTng-UST can be
> > correlated. We plan on using ktime until a fast, scalable, and
> > fine-grained time-source for tracing that can be used across kernel and
> > user-space, and which does not rely on read seqlock for kernel-level
> > synchronization, makes its way into the kernel.
> 
> So my fix for the issue aside, I could see cases where using timekeeping
> for tracing could run into similar issues, so something like your
> timekeeping_is_busy() check sounds reasonable.

Yep, it would certainly make those use of ktime_get() more robust
against internal changes.

> I might suggest we wrap
> the timekeeper locking in a helper function so we don't have the
> spinlock(); set_owner(); write_seqcount(); pattern all over the place
> (and it avoids me forgetting to set the owner in some future change,
> further mucking things up :).

Good idea.

> As for your waiting for "fast, scalable, and fine-grained time-source
> for tracing that can be used across kernel and user-space, and which
> does not rely on read seqlock for kernel-level synchronization" wish,
> I'd be interested in hearing ideas if anyone has them.

So far, the best I could come up with is this: using a RCU (or RCU-like)
scheme to protect in-kernel timestamp reads (possibly with RCU sched,
which is based on preemption disabling), and use a sequence lock to
protect reads from user-space.

Time updates within the kernel would have to deal with both RCU pointer
update and track quiescent state, and would need to hold a write seqlock
to synchronize against concurrent user-space reads.

> After getting the recent lock-hold reduction work merged in 3.10, I had
> some thoughts that maybe we could do some sort of rcu style timekeeper
> switch. The down side is that there really is a time bound in which the
> timekeeper state is valid for, so there would have to be some sort of
> seqcount style "retry if we didn't finish the calculation within the
> valid bound" (which can run into similar deadlock problems if the
> updater is delayed by a reader spinning waiting for an update).

What could make a reader fail to finish the calculation within the valid
time bound ? Besides preemption ? If it's caused by a too long
interrupt, this will have an effect on the entire timekeeping, because
the timer interrupt will likely be delayed, and therefore the periodical
update changing the write seqlock value will be delayed too. So for the
interrupt case, it looks like a "too long" interrupt (or interrupt
disable section) will already disrupt timekeeping with the current
design.

> 
> Also there is the memory issue of having N timekeeper structures hanging
> around, since there could be many readers delayed mid-calculation, but
> that could probably be bound by falling back to a seqcount (and again,
> that opens up deadlock possibilities). Anyway, it all gets pretty
> complicated pretty quickly, which makes ensuring correctness even harder. :(
> 
> But yea, I'd be interested in other ideas and approaches.

If we can afford a synchronize_rcu_sched() wherever the write seqlock is
needed, we could go with the following. Please note that I use
synchronize_rcu_sched() rather than call_rcu_sched() here because I try
to avoid having too many timekeeper structures hanging around, and I
think it can be generally a good think to ensure timekeeping core does
not depend on the memory allocator (but I could very well be wrong).

In kernel/time/timekeeper.c:

static DEFINE_MUTEX(timekeeper_mutex);
static seqcount_t timekeeper_user_seq;

struct timekeeper_rcu {
        struct a[2];
        struct timekeeper *p;   /* current */
};

/* Timekeeper structure for kernel readers */
static struct timekeeper_rcu timekeeper_rcu;

/* Timekeeper structure for userspace readers */
static struct timekeeper timekeeper_user;

/* for updates */
update_time()
{
        struct timekeeper *next_p;

        mutex_lock(&timekeeper_mutex);

        /* RCU update, for kernel readers */
        if (timekeeper_rcu.p == &timekeeper_rcu.a[0])
                next_p = &timekeeper_rcu.a[1];
        else
                next_p = &timekeeper_rcu.a[0];

        timekeeper_copy(next_p, timekeeper_rcu.p);
        timekeeper_do_update(next_p, ...);

        rcu_assign_pointer(timekeeper_rcu.p, next_p);

        /*
         * Ensure there are no more readers in flight accessing the old
         * element.
         */
        synchronize_rcu_sched();

        /* seqlock update, for user-space readers */
        write_seqcount_begin(&timekeeper_user_seq);
        timekeeper_do_update_user(&timekeeper_user);
        write_seqcount_end(&timekeeper_user_seq);

        mutex_unlock(&timekeeper_mutex);
}

/*
 * Accessing time from kernel. Should be called with preemption
 * disabled.
 */
u64 __read_time_kernel()
{
        struct timekeeper *p;

        p = rcu_dereference(timekeeper_rcu.p);
        return get_value_from(p);
}

/* Accessing time from user-space (vDSO) */

u64 read_time_user()
{
        /*
         * Read timekeeper_user within a read_seqcount loop.
         */
}

As far as I see, the write seqcount is currently taken by:

- do_settimeofday()
- timekeeping_inject_offset()
- timekeeping_set_tai_offset()
- change_clocksource()
- timekeeping_init()
- timekeeping_inject_sleeptime()
- timekeeping_resume()
- timekeeping_suspend()
- update_wall_time()
- do_adjtimex()
- hardpps()

It might be good to breakdown which of these functions could afford a
synchronize_rcu_sched(), and which can't. Maybe part of the solution
could be to use different data structures, and separate synchronization,
for the ajdtime part (executed in thread context) vs for
update_wall_time, AFAIK executed in interrupt context. We might also
want to consider that the adjtime part needs to be global, vs
update_wall_time which could possibly be done on per-cpu data
structures, which could simplify synchronization.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-11 18:54   ` Mathieu Desnoyers
@ 2013-09-11 20:36     ` Paul E. McKenney
  2013-09-12  0:48       ` Mathieu Desnoyers
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2013-09-11 20:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: John Stultz, Thomas Gleixner, Richard Cochran, Prarit Bhargava,
	Greg Kroah-Hartman, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

On Wed, Sep 11, 2013 at 02:54:41PM -0400, Mathieu Desnoyers wrote:
> * John Stultz (john.stultz@linaro.org) wrote:
> > On 09/11/2013 08:08 AM, Mathieu Desnoyers wrote:
> [...]
> 
> Now focusing on features (the fix discussion is in a separate
> sub-thread):
> 
> > 
> > > LTTng uses ktime to have the same time-base across kernel and
> > > user-space, so traces gathered from LTTng-modules and LTTng-UST can be
> > > correlated. We plan on using ktime until a fast, scalable, and
> > > fine-grained time-source for tracing that can be used across kernel and
> > > user-space, and which does not rely on read seqlock for kernel-level
> > > synchronization, makes its way into the kernel.
> > 
> > So my fix for the issue aside, I could see cases where using timekeeping
> > for tracing could run into similar issues, so something like your
> > timekeeping_is_busy() check sounds reasonable.
> 
> Yep, it would certainly make those use of ktime_get() more robust
> against internal changes.
> 
> > I might suggest we wrap
> > the timekeeper locking in a helper function so we don't have the
> > spinlock(); set_owner(); write_seqcount(); pattern all over the place
> > (and it avoids me forgetting to set the owner in some future change,
> > further mucking things up :).
> 
> Good idea.
> 
> > As for your waiting for "fast, scalable, and fine-grained time-source
> > for tracing that can be used across kernel and user-space, and which
> > does not rely on read seqlock for kernel-level synchronization" wish,
> > I'd be interested in hearing ideas if anyone has them.
> 
> So far, the best I could come up with is this: using a RCU (or RCU-like)
> scheme to protect in-kernel timestamp reads (possibly with RCU sched,
> which is based on preemption disabling), and use a sequence lock to
> protect reads from user-space.
> 
> Time updates within the kernel would have to deal with both RCU pointer
> update and track quiescent state, and would need to hold a write seqlock
> to synchronize against concurrent user-space reads.
> 
> > After getting the recent lock-hold reduction work merged in 3.10, I had
> > some thoughts that maybe we could do some sort of rcu style timekeeper
> > switch. The down side is that there really is a time bound in which the
> > timekeeper state is valid for, so there would have to be some sort of
> > seqcount style "retry if we didn't finish the calculation within the
> > valid bound" (which can run into similar deadlock problems if the
> > updater is delayed by a reader spinning waiting for an update).
> 
> What could make a reader fail to finish the calculation within the valid
> time bound ? Besides preemption ? If it's caused by a too long
> interrupt, this will have an effect on the entire timekeeping, because
> the timer interrupt will likely be delayed, and therefore the periodical
> update changing the write seqlock value will be delayed too. So for the
> interrupt case, it looks like a "too long" interrupt (or interrupt
> disable section) will already disrupt timekeeping with the current
> design.
> 
> > 
> > Also there is the memory issue of having N timekeeper structures hanging
> > around, since there could be many readers delayed mid-calculation, but
> > that could probably be bound by falling back to a seqcount (and again,
> > that opens up deadlock possibilities). Anyway, it all gets pretty
> > complicated pretty quickly, which makes ensuring correctness even harder. :(
> > 
> > But yea, I'd be interested in other ideas and approaches.
> 
> If we can afford a synchronize_rcu_sched() wherever the write seqlock is
> needed, we could go with the following. Please note that I use
> synchronize_rcu_sched() rather than call_rcu_sched() here because I try
> to avoid having too many timekeeper structures hanging around, and I
> think it can be generally a good think to ensure timekeeping core does
> not depend on the memory allocator (but I could very well be wrong).

The issue called out with this the last time I remember it being put
forward was that grace periods can be delayed for longer than is an
acceptable gap between timekeeping updates.  But maybe something has
changed since then -- that was a few years ago.

							Thanx, Paul

> In kernel/time/timekeeper.c:
> 
> static DEFINE_MUTEX(timekeeper_mutex);
> static seqcount_t timekeeper_user_seq;
> 
> struct timekeeper_rcu {
>         struct a[2];
>         struct timekeeper *p;   /* current */
> };
> 
> /* Timekeeper structure for kernel readers */
> static struct timekeeper_rcu timekeeper_rcu;
> 
> /* Timekeeper structure for userspace readers */
> static struct timekeeper timekeeper_user;
> 
> /* for updates */
> update_time()
> {
>         struct timekeeper *next_p;
> 
>         mutex_lock(&timekeeper_mutex);
> 
>         /* RCU update, for kernel readers */
>         if (timekeeper_rcu.p == &timekeeper_rcu.a[0])
>                 next_p = &timekeeper_rcu.a[1];
>         else
>                 next_p = &timekeeper_rcu.a[0];
> 
>         timekeeper_copy(next_p, timekeeper_rcu.p);
>         timekeeper_do_update(next_p, ...);
> 
>         rcu_assign_pointer(timekeeper_rcu.p, next_p);
> 
>         /*
>          * Ensure there are no more readers in flight accessing the old
>          * element.
>          */
>         synchronize_rcu_sched();
> 
>         /* seqlock update, for user-space readers */
>         write_seqcount_begin(&timekeeper_user_seq);
>         timekeeper_do_update_user(&timekeeper_user);
>         write_seqcount_end(&timekeeper_user_seq);
> 
>         mutex_unlock(&timekeeper_mutex);
> }
> 
> 
> /*
>  * Accessing time from kernel. Should be called with preemption
>  * disabled.
>  */
> u64 __read_time_kernel()
> {
>         struct timekeeper *p;
> 
>         p = rcu_dereference(timekeeper_rcu.p);
>         return get_value_from(p);
> }
> 
> /* Accessing time from user-space (vDSO) */
> 
> u64 read_time_user()
> {
>         /*
>          * Read timekeeper_user within a read_seqcount loop.
>          */
> }
> 
> 
> As far as I see, the write seqcount is currently taken by:
> 
> - do_settimeofday()
> - timekeeping_inject_offset()
> - timekeeping_set_tai_offset()
> - change_clocksource()
> - timekeeping_init()
> - timekeeping_inject_sleeptime()
> - timekeeping_resume()
> - timekeeping_suspend()
> - update_wall_time()
> - do_adjtimex()
> - hardpps()
> 
> It might be good to breakdown which of these functions could afford a
> synchronize_rcu_sched(), and which can't. Maybe part of the solution
> could be to use different data structures, and separate synchronization,
> for the ajdtime part (executed in thread context) vs for
> update_wall_time, AFAIK executed in interrupt context. We might also
> want to consider that the adjtime part needs to be global, vs
> update_wall_time which could possibly be done on per-cpu data
> structures, which could simplify synchronization.
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-11 20:36     ` Paul E. McKenney
@ 2013-09-12  0:48       ` Mathieu Desnoyers
  2013-09-12  1:25         ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2013-09-12  0:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: John Stultz, Thomas Gleixner, Richard Cochran, Prarit Bhargava,
	Greg Kroah-Hartman, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Sep 11, 2013 at 02:54:41PM -0400, Mathieu Desnoyers wrote:
[...]
> > If we can afford a synchronize_rcu_sched() wherever the write seqlock is
> > needed, we could go with the following. Please note that I use
> > synchronize_rcu_sched() rather than call_rcu_sched() here because I try
> > to avoid having too many timekeeper structures hanging around, and I
> > think it can be generally a good think to ensure timekeeping core does
> > not depend on the memory allocator (but I could very well be wrong).
> 
> The issue called out with this the last time I remember it being put
> forward was that grace periods can be delayed for longer than is an
> acceptable gap between timekeeping updates.  But maybe something has
> changed since then -- that was a few years ago.

Assuming we have lockdep support for seqlock, and we ensure there is no
deadlock between timekeeper seqlock and other locks. Also, given our
intent is more or less to allow execution contexts nested over the write
seqlock to read time (e.g. nmi handlers), I think we could do the
following:

updates:
- spinlock irq save
- take a shadow copy of the old timekeeper structure.
- store the CPU number that owns the lock into a variable.
- barrier()     /* store cpu nr before seqlock from nested NMI POV */
- write seqlock begin
- perform timekeeper structure update
- write seqlock end
- barrier()     /* store seqlock before cpu nr from nested NMI POV */
- store -1 into lock owner variable
- spin unlock irqrestore

kernel read-side:
- try read seqlock loop for fast path
- if seqlock read fails, check if we are the CPU owning the lock
  (variable stored by the update). If it is so, fall-back to the shadow
  copy, else, retry seqlock.

This approach would _only_ work if there is no deadlock involving a lock
nested within the write seqlock. Hence the importance of adding lockdep
support if we choose this locking design.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-12  0:48       ` Mathieu Desnoyers
@ 2013-09-12  1:25         ` Peter Zijlstra
  2013-09-12  3:22           ` Mathieu Desnoyers
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2013-09-12  1:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, John Stultz, Thomas Gleixner, Richard Cochran,
	Prarit Bhargava, Greg Kroah-Hartman, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

On Wed, Sep 11, 2013 at 08:48:11PM -0400, Mathieu Desnoyers wrote:
> Thoughts ?


struct foo {
	...
};

spinlock_t foo_lock;
unsigned int foo_head = 0;
unsigned int foo_tail = 0;
struct foo foo_array[2];

void foo_assign(struct foo f)
{
	spin_lock(&foo_lock);
	foo_head++;
	smp_wmb();
	foo_array[foo_head & 1] = f;
	smp_wmb();
	foo_tail++;
	spin_unlock(&foo_lock);
}

struct foo foo_get(void)
{
	unsigned int tail, head;
	struct foo ret;

again:
	tail = ACCESS_ONCE(foo_tail);
	smp_rmb();
	ret = foo_array[tail & 1];
	smp_rmb();
	head = ACCESS_ONCE(foo_head);
	if (head - tail >= 2)
		goto again;

	return ret;
}

Should work and get you the most recent 'complete' foo even when
foo_get() is called nested inside foo_assign().




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-12  1:25         ` Peter Zijlstra
@ 2013-09-12  3:22           ` Mathieu Desnoyers
  2013-09-12 12:09             ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2013-09-12  3:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, John Stultz, Thomas Gleixner, Richard Cochran,
	Prarit Bhargava, Greg Kroah-Hartman, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, Sep 11, 2013 at 08:48:11PM -0400, Mathieu Desnoyers wrote:
> > Thoughts ?
> 
> 
> struct foo {
> 	...
> };
> 
> spinlock_t foo_lock;
> unsigned int foo_head = 0;
> unsigned int foo_tail = 0;
> struct foo foo_array[2];
> 
> void foo_assign(struct foo f)
> {
> 	spin_lock(&foo_lock);
> 	foo_head++;
> 	smp_wmb();
> 	foo_array[foo_head & 1] = f;
> 	smp_wmb();
> 	foo_tail++;
> 	spin_unlock(&foo_lock);
> }
> 
> struct foo foo_get(void)
> {
> 	unsigned int tail, head;
> 	struct foo ret;
> 
> again:
> 	tail = ACCESS_ONCE(foo_tail);
> 	smp_rmb();
> 	ret = foo_array[tail & 1];
> 	smp_rmb();
> 	head = ACCESS_ONCE(foo_head);
> 	if (head - tail >= 2)
> 		goto again;
> 
> 	return ret;
> }
> 
> Should work and get you the most recent 'complete' foo even when
> foo_get() is called nested inside foo_assign().

Cool!

Your design looks good to me. It reminds me of a latch. My only fear is
that struct timekeeper is probably too large to be copied every time on
the read path. Here is a slightly reworked version that would allow
in-place read of "foo" without copy.

struct foo {
	...
};

struct latchfoo {
	unsigned int head, tail;
	spinlock_t write_lock;
	struct foo data[2];
};

static
void foo_update(struct latchfoo *lf, void cb(struct foo *foo), void *ctx)
{
	spin_lock(&lf->write_lock);
	lf->head++;
	smp_wmb();
	lf->data[lf->head & 1] = lf->data[lf->tail & 1];
	cb(&lf->data[lf->head & 1], ctx);
	smp_wmb();
	lf->tail++;
	spin_unlock(&lock->write_lock);
}

static
unsigned int foo_read_begin(struct latchfoo *lf)
{
	unsigned int ret;

	ret = ACCESS_ONCE(lf->tail);
	smp_rmb();
	return ret;
}

static
struct foo *foo_read_get(struct latchfoo *lf, unsigned int tail)
{
	return &lf->data[tail & 1];
}

static
int foo_read_retry(struct latchfoo *lf, unsigned int tail)
{
	smp_rmb();
	return (ACCESS_ONCE(lf->head) - tail >= 2);
}

Comments are welcome,

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-12  3:22           ` Mathieu Desnoyers
@ 2013-09-12 12:09             ` Peter Zijlstra
  2013-09-12 13:48               ` Mathieu Desnoyers
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2013-09-12 12:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, John Stultz, Thomas Gleixner, Richard Cochran,
	Prarit Bhargava, Greg Kroah-Hartman, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

On Wed, Sep 11, 2013 at 11:22:52PM -0400, Mathieu Desnoyers wrote:
> Cool!
> 
> Your design looks good to me. It reminds me of a latch. My only fear is
> that struct timekeeper is probably too large to be copied every time on
> the read path. Here is a slightly reworked version that would allow
> in-place read of "foo" without copy.
> 
> struct foo {
> 	...
> };
> 
> struct latchfoo {
> 	unsigned int head, tail;
> 	spinlock_t write_lock;
> 	struct foo data[2];
> };
> 
> static
> void foo_update(struct latchfoo *lf, void cb(struct foo *foo), void *ctx)
> {
> 	spin_lock(&lf->write_lock);
> 	lf->head++;
> 	smp_wmb();
> 	lf->data[lf->head & 1] = lf->data[lf->tail & 1];
> 	cb(&lf->data[lf->head & 1], ctx);

You do that initial copy such that the cb gets the previous state to
work from and doesn't have to do a fetch/complete rewrite?

The alternative is to give the cb function both pointers, old and new
and have it do its thing.

Yet another option is to split the update side into helper functions
just like you did below for the read side.

> 	smp_wmb();
> 	lf->tail++;
> 	spin_unlock(&lock->write_lock);
> }
> 
> static
> unsigned int foo_read_begin(struct latchfoo *lf)
> {
> 	unsigned int ret;
> 
> 	ret = ACCESS_ONCE(lf->tail);
> 	smp_rmb();
> 	return ret;
> }
> 
> static
> struct foo *foo_read_get(struct latchfoo *lf, unsigned int tail)
> {
> 	return &lf->data[tail & 1];
> }
> 
> static
> int foo_read_retry(struct latchfoo *lf, unsigned int tail)
> {
> 	smp_rmb();
> 	return (ACCESS_ONCE(lf->head) - tail >= 2);
> }
> 
> Comments are welcome,

Yeah this would work. The foo_read_begin() and foo_read_get() split is a
bit awkward but C doesn't really encourage us to do any better.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-12 12:09             ` Peter Zijlstra
@ 2013-09-12 13:48               ` Mathieu Desnoyers
  2013-09-12 14:46                 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Mathieu Desnoyers @ 2013-09-12 13:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, John Stultz, Thomas Gleixner, Richard Cochran,
	Prarit Bhargava, Greg Kroah-Hartman, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, Sep 11, 2013 at 11:22:52PM -0400, Mathieu Desnoyers wrote:
> > Cool!
> > 
> > Your design looks good to me. It reminds me of a latch. My only fear is
> > that struct timekeeper is probably too large to be copied every time on
> > the read path. Here is a slightly reworked version that would allow
> > in-place read of "foo" without copy.
> > 
> > struct foo {
> > 	...
> > };
> > 
> > struct latchfoo {
> > 	unsigned int head, tail;
> > 	spinlock_t write_lock;
> > 	struct foo data[2];
> > };
> > 
> > static
> > void foo_update(struct latchfoo *lf, void cb(struct foo *foo), void *ctx)
> > {
> > 	spin_lock(&lf->write_lock);
> > 	lf->head++;
> > 	smp_wmb();
> > 	lf->data[lf->head & 1] = lf->data[lf->tail & 1];
> > 	cb(&lf->data[lf->head & 1], ctx);
> 
> You do that initial copy such that the cb gets the previous state to
> work from and doesn't have to do a fetch/complete rewrite?

Yep, my original intent was to simplify life for callers.

> 
> The alternative is to give the cb function both pointers, old and new
> and have it do its thing.

Good point. The caller don't necessarily need to copy the old entry into
the new one: it may very well want to overwrite all the fields.

> 
> Yet another option is to split the update side into helper functions
> just like you did below for the read side.

OK. Updated code below.

> 
> > 	smp_wmb();
> > 	lf->tail++;
> > 	spin_unlock(&lock->write_lock);
> > }
> > 
> > static
> > unsigned int foo_read_begin(struct latchfoo *lf)
> > {
> > 	unsigned int ret;
> > 
> > 	ret = ACCESS_ONCE(lf->tail);
> > 	smp_rmb();
> > 	return ret;
> > }
> > 
> > static
> > struct foo *foo_read_get(struct latchfoo *lf, unsigned int tail)
> > {
> > 	return &lf->data[tail & 1];
> > }
> > 
> > static
> > int foo_read_retry(struct latchfoo *lf, unsigned int tail)
> > {
> > 	smp_rmb();
> > 	return (ACCESS_ONCE(lf->head) - tail >= 2);
> > }
> > 
> > Comments are welcome,
> 
> Yeah this would work. The foo_read_begin() and foo_read_get() split is a
> bit awkward but C doesn't really encourage us to do any better.

We might be able to do better:


struct foo {
	...
};

spinlock_t foo_lock;

struct latchfoo {
	unsigned int head, tail;
	struct foo data[2];
};

/**
 * foo_write_begin - begin foo update.
 *
 " @lf: struct latchfoo to update.
 * @prev: pointer to previous element (output parameter).
 * @next: pointer to next element (output parameter).
 *
 * The area pointed to by "next" should be considered uninitialized.
 * The caller needs to have exclusive update access to struct latchfoo.
 */
static
void foo_write_begin(struct latchfoo *lf, const struct foo **prev,
		struct foo **next)
{
	lf->head++;
	smp_wmb();
	*prev = &lf->data[lf->tail & 1];
	*next = &lf->data[lf->head & 1];
}

/**
 * foo_write_end - end foo update.
 *
 " @lf: struct latchfoo.
 *
 * The caller needs to have exclusive update access to struct latchfoo.
 */
static void
void foo_write_end(struct latchfoo *lf)
{
	smp_wmb();
	lf->tail++;
}

/**
 * foo_read_begin - begin foo read.
 *
 " @lf: struct latchfoo to read.
 * @tail: pointer to unsigned int containing tail position (output).
 */
static
struct foo *foo_read_begin(struct latchfoo *lf, unsigned int *tail)
{
	unsigned int ret;

	ret = ACCESS_ONCE(lf->tail);
	smp_rmb();
	*tail = ret;
	return &lf->data[ret & 1];
}

/**
 * foo_read_retry - end foo read, trigger retry if needed.
 *
 " @lf: struct latchfoo read.
 * @tail: tail position returned as output by foo_read_begin().
 *
 * If foo_read_retry() returns nonzero, the content of the read should
 * be considered invalid, and the read should be performed again to
 * reattempt reading coherent data, starting with foo_read_begin().
 */
static
int foo_read_retry(struct latchfoo *lf, unsigned int tail)
{
	smp_rmb();
	return (ACCESS_ONCE(lf->head) - tail >= 2);
}


Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] timekeeping: introduce timekeeping_is_busy()
  2013-09-12 13:48               ` Mathieu Desnoyers
@ 2013-09-12 14:46                 ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-09-12 14:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, John Stultz, Thomas Gleixner, Richard Cochran,
	Prarit Bhargava, Greg Kroah-Hartman, Steven Rostedt, Ingo Molnar,
	linux-kernel, lttng-dev

On Thu, Sep 12, 2013 at 09:48:16AM -0400, Mathieu Desnoyers wrote:

> struct foo {
> 	...
> };
> 
> spinlock_t foo_lock;
> 
> struct latchfoo {
> 	unsigned int head, tail;
> 	struct foo data[2];
> };
> 
> /**
>  * foo_write_begin - begin foo update.
>  *
>  " @lf: struct latchfoo to update.
>  * @prev: pointer to previous element (output parameter).
>  * @next: pointer to next element (output parameter).
>  *
>  * The area pointed to by "next" should be considered uninitialized.
>  * The caller needs to have exclusive update access to struct latchfoo.
>  */
> static
> void foo_write_begin(struct latchfoo *lf, const struct foo **prev,
> 		struct foo **next)
> {
> 	lf->head++;
> 	smp_wmb();
> 	*prev = &lf->data[lf->tail & 1];
> 	*next = &lf->data[lf->head & 1];
> }
> 
> /**
>  * foo_write_end - end foo update.
>  *
>  " @lf: struct latchfoo.
>  *
>  * The caller needs to have exclusive update access to struct latchfoo.
>  */
> static void
> void foo_write_end(struct latchfoo *lf)
> {
> 	smp_wmb();
> 	lf->tail++;
> }
> 
> /**
>  * foo_read_begin - begin foo read.
>  *
>  " @lf: struct latchfoo to read.
>  * @tail: pointer to unsigned int containing tail position (output).
>  */
> static
> struct foo *foo_read_begin(struct latchfoo *lf, unsigned int *tail)
> {
> 	unsigned int ret;
> 
> 	ret = ACCESS_ONCE(lf->tail);
> 	smp_rmb();
> 	*tail = ret;
> 	return &lf->data[ret & 1];
> }
> 
> /**
>  * foo_read_retry - end foo read, trigger retry if needed.
>  *
>  " @lf: struct latchfoo read.
>  * @tail: tail position returned as output by foo_read_begin().
>  *
>  * If foo_read_retry() returns nonzero, the content of the read should
>  * be considered invalid, and the read should be performed again to
>  * reattempt reading coherent data, starting with foo_read_begin().
>  */
> static
> int foo_read_retry(struct latchfoo *lf, unsigned int tail)
> {
> 	smp_rmb();
> 	return (ACCESS_ONCE(lf->head) - tail >= 2);
> }

Yep, that's good. I suppose if there's multiple use sites we can jump
through another few hoops to get rid of the specific struct foo
assumptions by storing sizeof() whatever we do use and playing pointer
math games.

But for now with the time stuff as only user this looks ok.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-09-12 14:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-11 15:08 [RFC PATCH] timekeeping: introduce timekeeping_is_busy() Mathieu Desnoyers
2013-09-11 16:40 ` John Stultz
2013-09-11 17:07   ` Steven Rostedt
2013-09-11 17:49   ` Mathieu Desnoyers
2013-09-11 17:53     ` John Stultz
2013-09-11 18:54   ` Mathieu Desnoyers
2013-09-11 20:36     ` Paul E. McKenney
2013-09-12  0:48       ` Mathieu Desnoyers
2013-09-12  1:25         ` Peter Zijlstra
2013-09-12  3:22           ` Mathieu Desnoyers
2013-09-12 12:09             ` Peter Zijlstra
2013-09-12 13:48               ` Mathieu Desnoyers
2013-09-12 14:46                 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox