Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
To: Jan Kiszka <jan.kiszka@siemens.com>,
	Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com>
Cc: "xenomai@xenomai.org" <xenomai@xenomai.org>
Subject: Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
Date: Thu, 18 Sep 2014 18:28:05 +0200	[thread overview]
Message-ID: <541B0815.5010906@xenomai.org> (raw)
In-Reply-To: <541B04E6.5010808@siemens.com>

On 09/18/2014 06:14 PM, Jan Kiszka wrote:
> On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
>> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>>
>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>>
>>>>>>>>>>>> With locks:
>>>>>>>>>>>>
>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>>
>>>>>>>>>>>> Without locks:
>>>>>>>>>>>>
>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>>
>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>>
>>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>>> patch?
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>>
>>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>>
>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>>> on release.
>>>>>>>>
>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>>> there are two optimization possibilities:
>>>>>>>>
>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>>
>>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>>
>>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>>
>>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>>
>>>>>>> [quick answer]
>>>>>>>
>>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>>
>>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>>
>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>>> other cores either see the lock release and all prior changes committed
>>>>> or the lock taken (and the prior changes do not matter then). They will
>>>>> never see later changes committed before the lock being visible as free.
>>>>
>>>> I agree. But this is true on all architectures, not just on strictly
>>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>>> explained.
>>>>
>>>>> That's architecturally guaranteed, and that's why you have no memory
>>>>> barriers in x86 spinlock release operations.
>>>>
>>>> I disagree, as explained in the paragraph just below the one you quote,
>>>> I believe this is an optimization, which is almost valid on any
>>>> architecture. Almost valid, because if the cpu which has done the unlock
>>>> does another lock without any time for a barrier in between to
>>>> synchronize cpus, we have a problem, because the other cpus will never
>>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>>> the cpu which spins, and you have to add a barrier after that, if you
>>>> want the barrier before the read on the cpu which will acquire the lock
>>>> to see that the spinlock is contended. So I do not see how this requires
>>>> less barriers.
>>>
>>> Ticket locks prevent unfair starvation without the closing barrier as
>>> they grant the next ticket to the next waiter, not the current holder.
>>> See the Linux implementation.
>>
>> Whether to put the closing barrier after the last store is orthogonal,
>> to whether implementing ticket locks or not. This is all a question of
>> tradeoffs.
>>
>> Without the barrier after the last store, you increase the spinning time
>> due to time taken for the store to be visible on other cpus, but you
>> optimize the overhead of unlocking.
>>
>> With ticket spinlocks you avoid the starvation situation, at the expense
>> of increasing the overhread of spinlock operations.
>>
>> I do not know which is worse. I suspect all this does not make much of a
>> difference, and what dominates is the duration of spinlock sections anyway.
> 
> I think the way classic Linux spinlock did this on x86 provide the answer.

The situation is completely different: linux spinlocks are well split,
xenomai basically has one only spinlock, so chances are that it will be
more contended, so the heavy unlock path (the one which implements the
ticket stuff) will be triggered more often. Also, xenomai spinlock (we
can loose the s anyway) being more contended, the "pending store
barrier" optimization has in fact chances of being detrimental. And
finally, due to the way spinlocks are split, Linux has scalability
issues that Xenomai can not even begin to imagine tackling.

Anyway, the discussion is kind of moot, because as I said, we are not
going to change the spinlock implementation in 2.6. What we are
discussing here is whether to put the barrier after the atomic_set, or
whether to put that barrier where it is really needed: in the snapshot
code, and what to do for forge. I also agree that the barrier before the
atomic_set in xnlock_put is not needed on x86 and proposed an
architecture macro to replace it with a compiler barrier in that case.

I also proposed to replace the atomic_set with a cmpxchg, cmpxchg has
two barriers on ARM, but I guess on x86 it is only one barrier, this
would solve the architecture dependency nicely.

-- 
                                                                Gilles.

next prev parent reply	other threads:[~2014-09-18 16:28 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-22 16:02 [Xenomai] Reading /proc/xenomai/stat causes high latencies Jeroen Van den Keybus
2014-04-23  9:14 ` Jeroen Van den Keybus
2014-04-23 13:45   ` Jeroen Van den Keybus
2014-04-23 14:07     ` Gilles Chanteperdrix
2014-04-23 20:54       ` Jeroen Van den Keybus
2014-04-23 20:56         ` Gilles Chanteperdrix
2014-04-23 21:39           ` Jeroen Van den Keybus
2014-04-23 22:25             ` Gilles Chanteperdrix
2014-04-24  8:57               ` Jeroen Van den Keybus
2014-04-24 14:46                 ` Jeroen Van den Keybus
2014-04-25  8:15                   ` Jeroen Van den Keybus
2014-04-25 10:44                     ` Jeroen Van den Keybus
2014-09-09 21:03                       ` Gilles Chanteperdrix
2014-09-10 13:50                         ` Jeroen Van den Keybus
2014-09-10 19:47                           ` Gilles Chanteperdrix
2014-09-11  5:11                         ` Jan Kiszka
2014-09-11  5:19                           ` Jan Kiszka
2014-09-18 11:46                             ` Gilles Chanteperdrix
2014-09-18 11:59                               ` Jan Kiszka
2014-09-18 12:11                                 ` Gilles Chanteperdrix
2014-09-18 12:17                                 ` Gilles Chanteperdrix
2014-09-18 12:20                                   ` Jan Kiszka
2014-09-18 13:05                                     ` Gilles Chanteperdrix
2014-09-18 13:26                                       ` Jan Kiszka
2014-09-18 13:44                                         ` Gilles Chanteperdrix
2014-09-18 16:14                                           ` Jan Kiszka
2014-09-18 16:28                                             ` Gilles Chanteperdrix [this message]
2014-09-18 18:39                                               ` Gilles Chanteperdrix
2014-09-18 19:23                                                 ` Jan Kiszka
2014-09-18 19:31                                                   ` Gilles Chanteperdrix
2014-09-18 19:09                                               ` Jan Kiszka
2014-09-18 19:32                                                 ` Gilles Chanteperdrix
2014-09-18 19:56                                                   ` Jan Kiszka
2014-09-18 20:13                                                     ` Gilles Chanteperdrix
2014-09-18 20:21                                 ` Gilles Chanteperdrix
2014-09-19  2:06                                   ` Gilles Chanteperdrix
2014-09-19  5:41                                     ` Jan Kiszka
2014-09-19  7:04                                       ` Philippe Gerum
2014-09-19 10:51                                     ` Gilles Chanteperdrix
2014-09-16 11:09                           ` Gilles Chanteperdrix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=541B0815.5010906@xenomai.org \
    --to=gilles.chanteperdrix@xenomai.org \
    --cc=jan.kiszka@siemens.com \
    --cc=jeroen.vandenkeybus@gmail.com \
    --cc=xenomai@xenomai.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.