From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <541B3CEE.9050903@xenomai.org> Date: Thu, 18 Sep 2014 22:13:34 +0200 From: Gilles Chanteperdrix MIME-Version: 1.0 References: <535828F6.6050308@xenomai.org> <53583DF7.3080700@xenomai.org> <540F6B15.2070201@xenomai.org> <54112EFA.4080901@web.de> <541130D0.50409@web.de> <541AC62F.2050003@xenomai.org> <541AC933.9090600@siemens.com> <541ACD66.50902@xenomai.org> <541ACE05.1050305@siemens.com> <541AD8A2.30500@xenomai.org> <541ADD8A.7020804@siemens.com> <541AE1CD.7090008@xenomai.org> <541B04E6.5010808@siemens.com> <541B0815.5010906@xenomai.org> <541B2E02.7030009@siemens.com> <541B336A.5040604@xenomai.org> <541B38F1.5030503@siemens.com> In-Reply-To: <541B38F1.5030503@siemens.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jan Kiszka , Jeroen Van den Keybus Cc: "xenomai@xenomai.org" On 09/18/2014 09:56 PM, Jan Kiszka wrote: > On 2014-09-18 21:32, Gilles Chanteperdrix wrote: >> On 09/18/2014 09:09 PM, Jan Kiszka wrote: >>> On 2014-09-18 18:28, Gilles Chanteperdrix wrote: >>>> On 09/18/2014 06:14 PM, Jan Kiszka wrote: >>>>> On 2014-09-18 15:44, Gilles Chanteperdrix wrote: >>>>>> On 09/18/2014 03:26 PM, Jan Kiszka wrote: >>>>>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote: >>>>>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote: >>>>>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote: >>>>>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote: >>>>>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote: >>>>>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote: >>>>>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote: >>>>>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote: >>>>>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote: >>>>>>>>>>>>>>>> For testing, I've removed the locks from the vfile system. >>>>>>>>>>>>>>>> Then the high latencies reliably disappear. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the >>>>>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently, >>>>>>>>>>>>>>>> I use a program that simply opens and reads the stat file >>>>>>>>>>>>>>>> 1,000 times. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> With locks: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> RTT| 00:00:01 (periodic user-mode task, 100 us period, >>>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat >>>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD| -2.575| >>>>>>>>>>>>>>>> -2.309| 9.286| 0| 0| -2.575| 9.286 >>>>>>>>>>>>>>>> RTD| -2.364| -2.276| 1.600| 0| 0| >>>>>>>>>>>>>>>> -2.575| 9.286 RTD| -2.482| -2.274| 2.165| >>>>>>>>>>>>>>>> 0| 0| -2.575| 9.286 RTD| -2.368| 135.261| >>>>>>>>>>>>>>>> 1478.154| 13008| 0| -2.575| 1478.154 RTD| >>>>>>>>>>>>>>>> -2.368| -2.272| 2.602| 13008| 0| -2.575| >>>>>>>>>>>>>>>> 1478.154 RTD| -2.499| -2.272| 6.933| 13008| >>>>>>>>>>>>>>>> 0| -2.575| 1478.154 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Without locks: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> RTT| 00:00:01 (periodic user-mode task, 100 us period, >>>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat >>>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD| -2.503| >>>>>>>>>>>>>>>> -2.270| 3.310| 0| 0| -2.503| 3.310 >>>>>>>>>>>>>>>> RTD| -2.418| -2.284| -1.646| 0| 0| >>>>>>>>>>>>>>>> -2.503| 3.310 RTD| -2.496| -2.275| 4.630| >>>>>>>>>>>>>>>> 0| 0| -2.503| 4.630 RTD| -2.374| -2.285| >>>>>>>>>>>>>>>> -1.458| 0| 0| -2.503| 4.630 RTD| >>>>>>>>>>>>>>>> -2.452| -2.273| 3.559| 0| 0| -2.503| >>>>>>>>>>>>>>>> 4.630 RTD| -2.370| -2.285| -1.518| 0| >>>>>>>>>>>>>>>> 0| -2.503| 4.630 RTD| -2.458| -2.274| >>>>>>>>>>>>>>>> 4.203| 0| 0| -2.503| 4.630 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the >>>>>>>>>>>>>>>> locks are malfunctioning, I'm clueless. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Answering with a "little" delay, could you try the following >>>>>>>>>>>>>>> patch? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h >>>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644 >>>>>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++ >>>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void >>>>>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS) >>>>>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */ >>>>>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); + xnarch_memory_barrier(); } >>>>>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); } >>>>>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git >>>>>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h >>>>>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 --- >>>>>>>>>>>>>>> a/include/asm-generic/system.h +++ >>>>>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static >>>>>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock) >>>>>>>>>>>>>>> xnarch_memory_barrier(); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> atomic_set(&lock->owner, ~0); + + xnarch_memory_barrier(); >>>>>>>>>>>>>> >>>>>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first >>>>>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock >>>>>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and >>>>>>>>>>>>>> other strictly ordered archs), those require no memory barriers >>>>>>>>>>>>>> on release. >>>>>>>>>>>> >>>>>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs >>>>>>>>>>>>> already today, independent of the spinlock granting algorithm. So >>>>>>>>>>>>> there are two optimization possibilities: >>>>>>>>>>>> >>>>>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core >>>>>>>>>>>> >>>>>>>>>>>> Ok, no answer, so I will try to be more clear. >>>>>>>>>>>> >>>>>>>>>>>> I do not pretend to understand how memory barriers work at a low >>>>>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level" >>>>>>>>>>>> view, is that memory barriers on SMP systems act as synchronization >>>>>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the >>>>>>>>>>>> state of the other CPUs at the time of their last barrier. This means >>>>>>>>>>>> that for a CPU to see a store that occured on another CPU, there must >>>>>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a >>>>>>>>>>>> barrier after that before the read on the other cpu. This view of >>>>>>>>>>>> things seems to be corroborated by the fact that the patch works, and >>>>>>>>>>>> by the following sentence in Documentation/memory-barriers.txt: >>>>>>>>>>>> >>>>>>>>>>>> (*) There is no guarantee that a CPU will see the correct order of >>>>>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a >>>>>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory >>>>>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing"). >>>>>>>>>>> >>>>>>>>>>> [quick answer] >>>>>>>>>>> >>>>>>>>>>> ...or the architecture refrains from reordering write requests, like x86 >>>>>>>>>>> does. What may happen, though, is that the compiler reorders the writes. >>>>>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those >>>>>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more. >>>>>>>>>> >>>>>>>>>> quick answer: I do not believe an SMP architecture can enforce stores >>>>>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the >>>>>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right. >>>>>>>>> >>>>>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the >>>>>>>>> other cores either see the lock release and all prior changes committed >>>>>>>>> or the lock taken (and the prior changes do not matter then). They will >>>>>>>>> never see later changes committed before the lock being visible as free. >>>>>>>> >>>>>>>> I agree. But this is true on all architectures, not just on strictly >>>>>>>> ordered ones, this is just due to how barriers work on SMP systems, as I >>>>>>>> explained. >>>>>>>> >>>>>>>>> That's architecturally guaranteed, and that's why you have no memory >>>>>>>>> barriers in x86 spinlock release operations. >>>>>>>> >>>>>>>> I disagree, as explained in the paragraph just below the one you quote, >>>>>>>> I believe this is an optimization, which is almost valid on any >>>>>>>> architecture. Almost valid, because if the cpu which has done the unlock >>>>>>>> does another lock without any time for a barrier in between to >>>>>>>> synchronize cpus, we have a problem, because the other cpus will never >>>>>>>> see the spinlock as free. With ticket spinlocks, you just add a store on >>>>>>>> the cpu which spins, and you have to add a barrier after that, if you >>>>>>>> want the barrier before the read on the cpu which will acquire the lock >>>>>>>> to see that the spinlock is contended. So I do not see how this requires >>>>>>>> less barriers. >>>>>>> >>>>>>> Ticket locks prevent unfair starvation without the closing barrier as >>>>>>> they grant the next ticket to the next waiter, not the current holder. >>>>>>> See the Linux implementation. >>>>>> >>>>>> Whether to put the closing barrier after the last store is orthogonal, >>>>>> to whether implementing ticket locks or not. This is all a question of >>>>>> tradeoffs. >>>>>> >>>>>> Without the barrier after the last store, you increase the spinning time >>>>>> due to time taken for the store to be visible on other cpus, but you >>>>>> optimize the overhead of unlocking. >>>>>> >>>>>> With ticket spinlocks you avoid the starvation situation, at the expense >>>>>> of increasing the overhread of spinlock operations. >>>>>> >>>>>> I do not know which is worse. I suspect all this does not make much of a >>>>>> difference, and what dominates is the duration of spinlock sections anyway. >>>>> >>>>> I think the way classic Linux spinlock did this on x86 provide the answer. >>>> >>>> The situation is completely different: linux spinlocks are well split, >>>> xenomai basically has one only spinlock, so chances are that it will be >>>> more contended, so the heavy unlock path (the one which implements the >>>> ticket stuff) will be triggered more often. Also, xenomai spinlock (we >>>> can loose the s anyway) being more contended, the "pending store >>>> barrier" optimization has in fact chances of being detrimental. And >>>> finally, due to the way spinlocks are split, Linux has scalability >>>> issues that Xenomai can not even begin to imagine tackling. >>>> >>>> Anyway, the discussion is kind of moot, because as I said, we are not >>>> going to change the spinlock implementation in 2.6. What we are >>>> discussing here is whether to put the barrier after the atomic_set, or >>>> whether to put that barrier where it is really needed: in the snapshot >>>> code, and what to do for forge. I also agree that the barrier before the >>>> atomic_set in xnlock_put is not needed on x86 and proposed an >>>> architecture macro to replace it with a compiler barrier in that case. >>> >>> Yes, seems reasonable. >>> >>>> >>>> I also proposed to replace the atomic_set with a cmpxchg, cmpxchg has >>>> two barriers on ARM, but I guess on x86 it is only one barrier, this >>>> would solve the architecture dependency nicely. >>> >>> That saves an abstraction but I have no clue if "mfence" is equally >>> expensive as "lock cmpxchg". If it is, that's fine, but I suspect it's >>> not (due to the cacheline "lock"). >> >> Unless I misunderstand something in Linux code, it also uses the "lock" >> prefix for unlocking ticket spinlocks. Either with a lock; add or with a >> lock; cmpxchg. > > I'm looking at arch/x86/asm/spinlock.h, and there is only a non-atomic > __add. Same in the disassembly. Indeed, the lock prefix is only used for some erratum on x86_32. -- Gilles.