linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: jamie@shareable.org (Jamie Lokier)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH 1/1] [RFC] arm: add half-word __xchg
Date: Fri, 19 Mar 2010 03:36:01 +0000	[thread overview]
Message-ID: <20100319033601.GF14108@shareable.org> (raw)
In-Reply-To: <20100319021222.GA17388@Krystal>

Mathieu Desnoyers wrote:
> * Jamie Lokier (jamie at shareable.org) wrote:
> > Mathieu Desnoyers wrote:
> > > But.. thinking about it. It's bad to have a 2-byte xchg primitive that
> > > only works on UP and breaks the build on SMP. We should instead
> > > implement a workaround based on __cmpxchg4 to perform the 2-byte xchg().
> > 
> > This exposes why there should be __cmpxchg_bool() versions, which do
> > not loop, preferably in the generic kernel API, because the workaround
> > using __cmpxchg4 has to add yet another pointless loop nesting to all
> > cmpxchg users.
> 
> The workaround I propose is to use __cmpxchg4 in a loop for 2-byte xchg()
> fallback on arm; it is not related to cmpxchg() in any way. I don't
> think one single use-case justifies the creation of a __cmpxchg_bool()
> across all architectures.
> 
> Also, I've never seen where having a cmpxchg primitive returning a
> boolean is more efficient than returning the value read, ever. If we
> have to loop, then we can re-use the value that has been read by
> cmpxchg, while the version returning a boolean would need a read to
> re-read the next value.

Your workaround is an example (but we are both right; the best I can
come up with is described at the end).  This is a version of __xchg2
as you've described (please ignore any irrelevant mistakes :-):

static inline u16 __xchg2(u16 *ptr, u16 new)
{
	int shift = (~(unsigned)ptr & 2) * 8; /* Assumes big-endian. */
	u32 mask = ~(0x0000ffff << shift), add = new << shift;
	u32 *ptrbig = (u32 *)((char *)ptr - ((unsigned)ptr & 2));

	u32 oldbig = *ptrbig;
	while (1) {
		u32 tmp = __cmpxchg4(ptrbig, oldbig, (oldbig & mask) | add);
		if (tmp == oldbig)
			break;
		oldbig = tmp;
	}

	return (u16)(oldbig >> shift);
}

This is __cmpxchg4() from the __cmpxchg() macro in <asm/system.h>,
types changed for sanity:

static inline u32 __cmpxchg4(u32 *ptr, u32 old, u32 new)
{
	u32 oldval, res;
	do {
		asm volatile("@ __cmpxchg4\n"
		"       ldrex   %1, [%2]\n"
		"       mov     %0, #0\n"
		"       teq     %1, %3\n"
		"       strexeq %0, %4, [%2]\n"
			: "=&r" (res), "=&r" (oldval)
			: "r" (ptr), "Ir" (old), "r" (new)
			: "memory", "cc");
	} while (res);
	return oldval;
}

Now let's imagine a __cmpxchg4_bool, which returns non-zero if the new
value was stored, and zero if the new value was not stored:

static inline int __cmpxchg4_bool(u32 *ptr, u32 old, u32 new)
{
	u32 tmp, failed;
	asm volatile("@ __cmpxchg4_bool\n"
	"       ldrex   %1, [%2]\n"
	"       mov     %0, #1\n"
	"       teq     %1, %3\n"
	"       strexeq %0, %4, [%2]\n"
		: "=&r" (res), "=&r" (tmp)
		: "r" (ptr), "Ir" (old), "r" (new)
		: "memory", "cc");
	return !failed;
}

And the alternative loop part of __xchg2 (setup and return aren't
interseting):

	do {
		oldbig = *ptrbig;
	} while (!__cmpxchg4_bool(ptrbig, oldbig, (oldbig & mask) | add);

	
I'm sure you can already see where this is going.

Let's compile __xchg2 using __cmpxchg4 by hand.  We're not interested
in the setup, just the loop, so let's imagine the code starting at
"oldbig = *ptrbig".  The names are all symbolic register names:

	ldr	oldbig, [ptrbig]

	@ start of loop in __xchg2
	and	new, oldbig, mask
	or	new, new, add
1:
	@ start of __cmpxchg4
2:
	ldrex	tmp, [ptrbig]
	mov	res, #0
	teq	tmp, oldbig
	strexeq	res, newbig, [ptrbig]

	@ end of asm, but do..while(res) still needed
	teq	res, #0
	bne	2b
	@ end of __cmpxchg4

	@ remainder of loop in __xchg2
	teq	tmp, oldbig
	mov	oldbig, tmp
	bne	1b

The result is (oldbig >> shift).

Now for comparison, the equivalent part of __xchg2 using __cmpxchg4_bool:

	@ start of loop in __xchg2
1:
	ldr	oldbig, [ptrbig]
	and	new, oldbig, mask
	or	new, new, add

	@ start of __cmpxchg4_bool
	ldrex	tmp, [ptrbig]
	mov	failed, #1
	teq	tmp, oldbig
	strexeq	failed, newbig, [ptrbig]
	@ end of __cmpxchg4_bool (!failed is optimised by caller's !)

	@ remainder of loop in __xchg2
	teq	failed, #0
	bne	1b

As you can see, the bool version has one less loop nesting, and is 9
instructions instead of 11 for the non-bool version because of that.

Although the bool version is shorter, and the double-loop is
pointless, 

You are right that the ldrex result can be recycled to avoid a load
(and is probably faster), although the double-loop is pointless.

Meaning the "ideal" API for the bool-flavoured call would returns
_both_ the old value and whether the new value was stored.  A tidy way
to do that would be to take &old instead, and modify the pointed-to value.

Combining both observations:

static inline int __cmpxchg4_bool2(u32 *ptr, u32 *old, u32 new)
{
	u32 failed;
	asm volatile("@ __cmpxchg4_bool2\n"
	"       ldrex   %1, [%2]\n"
	"       mov     %0, #1\n"
	"       teq     %1, %3\n"
	"       strexeq %0, %4, [%2]\n"
		: "=&r" (res), "=&r" (*old)
		: "r" (ptr), "Ir" (*old), "r" (new)
		: "memory", "cc");
	return !failed;
}

The new loop part of __xchg2:

	u32 oldbig = *ptrbig;
	while (!__cmpxchg4_bool2(ptrbig, &oldbig, (oldbig & mask) | add)) {}

The result (assuming GCC does it's job optimising *&oldbig to oldbig):
	
	ldr	oldbig, [ptrbig]

	@ start of loop in __xchg2
1:
	and	new, oldbig, mask
	or	new, new, add

	@ start of __cmpxchg4_bool2
	ldrex	tmp, [ptrbig]
	mov	failed, #1
	teq	tmp, oldbig
	strexeq	failed, newbig, [ptrbig]
	mov	oldbig, tmp
	@ end of __cmpxchg4_bool2 (!failed is optimised by caller's !)

	@ remainder of loop in __xchg2
	teq	failed, #0
	bne	1b

That eliminates the load and the double-loop.

Quite a lot of __cmpxchg calls in the kernel are like this, with an
unnecessary extra loop, but by no means all are.  It's not just
__xchg2, but obviously all the callers would have to change to take
advantage of a __cmpxchg_bool2-like function.

The difference becomes more evident when writing __cmpxchg2 emulation
using __cmpxchg4.  Then you get an unnecessary *triple* loop at many
call sites: From __cmpxchg4 itself, __cmpxchg2 wrapped around it, and
the caller is also looping.  Only the caller's loop is necessary.

Obviously we can only go so far.  The shortest version of __xchg2 puts
the and+or between ldrex and strex, but that really isn't something
that can be exported safely in an arch-generic way:

	@ start of loop in __xchg2_smallest
1:
	ldrex	oldbig, [ptrbig]
	and	new, oldbig, mask
	or	new, new, add
	strex	failed, newbig, [ptrbig]
	teq	failed, #0
	bne	1b

The difference between 6 instructions and the original's 11, and the
fact it is simple and clear, leads me to suggest using last one as the
ARM version of __xchg2 when {ld,st}rexh are not available.  It's only
downside is it can't go in asm-generic.  That is, unless we had an
__atomic_and_xor() arch function, for asm-generic to use :-)

-- Jamie

  reply	other threads:[~2010-03-19  3:36 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-10 16:22 __xchg for sizes other than 32bit Imre Deak
2010-03-10 17:35 ` Russell King - ARM Linux
2010-03-10 20:02   ` [PATCH] ARM support single byte cmpxchg and cmpxchg_local on ARMv6 Mathieu Desnoyers
2010-03-10 20:30     ` Russell King - ARM Linux
2010-03-10 21:15       ` Mathieu Desnoyers
2010-03-10 23:16   ` __xchg for sizes other than 32bit Jamie Lokier
2010-03-18  9:29   ` [PATCH 1/1] [RFC] arm: add half-word __xchg Alexander Shishkin
2010-03-18 12:32     ` Mathieu Desnoyers
2010-03-18 12:37       ` Alexander Shishkin
2010-03-18 13:33       ` Alexander Shishkin
2010-03-18 13:50         ` Mathieu Desnoyers
2010-03-18 16:33           ` Imre Deak
2010-03-18 17:21             ` Mathieu Desnoyers
2010-03-18 19:00               ` Imre Deak
2010-03-18 19:30                 ` Mathieu Desnoyers
2010-03-19  1:49           ` Jamie Lokier
2010-03-19  2:12             ` Mathieu Desnoyers
2010-03-19  3:36               ` Jamie Lokier [this message]
2010-03-25 15:52           ` [PATCH 1/1] [RFCv2] " Alexander Shishkin
2010-03-25 16:42           ` Alexander Shishkin
2010-03-27 22:52             ` Russell King - ARM Linux
2010-03-28  0:14               ` Jamie Lokier
2010-03-28  0:18                 ` Russell King - ARM Linux
2010-03-28  1:00                 ` Mathieu Desnoyers
2010-03-28 14:39                   ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100319033601.GF14108@shareable.org \
    --to=jamie@shareable.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).