From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <541BC21F.20604@siemens.com> Date: Fri, 19 Sep 2014 07:41:51 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <5357C92F.2060206@xenomai.org> <535828F6.6050308@xenomai.org> <53583DF7.3080700@xenomai.org> <540F6B15.2070201@xenomai.org> <54112EFA.4080901@web.de> <541130D0.50409@web.de> <541AC62F.2050003@xenomai.org> <541AC933.9090600@siemens.com> <541B3ED6.8090606@xenomai.org> <541B8F91.9050603@xenomai.org> In-Reply-To: <541B8F91.9050603@xenomai.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Gilles Chanteperdrix , Jeroen Van den Keybus Cc: "xenomai@xenomai.org" On 2014-09-19 04:06, Gilles Chanteperdrix wrote: > On 09/18/2014 10:21 PM, Gilles Chanteperdrix wrote: >> On 09/18/2014 01:59 PM, Jan Kiszka wrote: >>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote: >>>> (*) There is no guarantee that a CPU will see the correct order of >>>> effects from a second CPU's accesses, even _if_ the second CPU uses a >>>> memory barrier, unless the first CPU _also_ uses a matching memory >>>> barrier (see the subsection on "SMP Barrier Pairing"). >>> >>> [quick answer] >>> >>> ...or the architecture refrains from reordering write requests, like x86 >>> does. What may happen, though, is that the compiler reorders the writes. >>> Therefore you need at least a (must cheaper) compiler barrier on those >>> archs. See also linux/Documentation/memory-barriers.txt on this and more. >> >> The passage you quote is quoted from memory-barriers.txt, and I find it >> makes it pretty clear that the two barriers are needed for cache >> synchronization in the general case. Now, I read more in >> memory-barriers, and I do not find easily details about what the fact >> that x86 is "strictly ordered" means, and how it relaxes the constraints >> on what rules. Maybe you would care to give us the exact passage where >> this is mentioned? Also, I would welcome any detail about how SMP cache >> synchronization actually works on x86. > > Ok, I have read a few things, it would seem recent x86 architectures > (nehalem, sandy bridge and probably haswell) use the MESIF cache > coherence protocol, with a twist for haswell since it introduced > transactional memory. A cache coherence protocol ensures in theory > transparently the same view of cache on all cpus. MESIF itself is > derived from the MESI cache coherence protocol, which is said (by > wikipedia article) to have some performance issues which are generally > compensated by adding a store buffer, which in turn requires memory > barriers for a store on one cpu to be visible in the cache (and so on > other cpus). I did not find any indication that memory barriers are > still needed for this case (which is exactly the case we are interested > in) with MESIF, but no indication that they are not needed either. > > Then, I had a look at the ticket spinlocks implementations. The > operations they do are roughly the same as the xnlock implementation, > except that they are optimized for each architecture, and so remove the > useless barriers. The ARM implementation has the barrier after unlock, > and use in addition the special "sev" instruction, allowing the spinning > cpu to wait for this signal with the "wfe" (wait for event) instruction, > and to not burn cpu power when spinning. In fact it does not spin. > > Of course, the problem is that they are not recursive, so implementing > recursive tickets spinlocks without adding overhead seems tricky. Just > to test if ticket spinlocks solve the issue which started this thread, I > made the following implementation: > > typedef struct { > unsigned owner; > arch_spinlock_t alock; > } xnlock_t; > > static inline int __xnlock_get(xnlock_t *lock /*, */ > XNLOCK_DBG_CONTEXT_ARGS) > { > unsigned long long start; > int cpu = xnarch_current_cpu(); > > if (lock->owner == cpu) > return 1; > > xnlock_dbg_prepare_acquire(&start); > > arch_spin_lock(&lock->alock); > lock->owner = cpu; > > xnlock_dbg_acquired(lock, cpu, &start /*, */ XNLOCK_DBG_PASS_CONTEXT); > > return 0; > } > > static inline void xnlock_put(xnlock_t *lock) > { > if (xnlock_dbg_release(lock)) > return; > > lock->owner = ~0U; > arch_spin_unlock(&lock->alock); > } > > And the good news is yes, this avoids the issue with /proc/xenomai/stat. > The bad news is that it does not answer the question about visibility on > one cpu of stores on another cpu without barrier. Because the ticket > spinlocks work either way on x86: the atomic add at the beginning of > arch_spin_lock ensures both the visibility of the fact that there is a > waiter to the cpu attempting to relock, and of the fact that the spin > lock has been unlocked to the waiting cpu. So, in the particular case of > the concurrent cat /proc/xenomai/stat, the "two barriers needed for > visibility" rule is respected. > > I have also measured latencies with a cat /proc/xenomai/stat loop > running, with and without a memory barrier after arch_spin_unlock, and > could not find any difference, minimum, average and maximum latency > after a few minutes of runtime are the same, or at least inferior to 100ns. > > I am also wondering if this xnlock implementation could be used on > forge. It has the advantage of benefiting from architecture > optimization, without the need for maintaining architecture dependent code. > Indeed, that would be very elegant! Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux