[PATCH 1/1] [RFC] arm: add half-word __xchg

All of lore.kernel.org
 help / color / mirror / Atom feed

From: jamie@shareable.org (Jamie Lokier)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH 1/1] [RFC] arm: add half-word __xchg
Date: Fri, 19 Mar 2010 03:36:01 +0000	[thread overview]
Message-ID: <20100319033601.GF14108@shareable.org> (raw)
In-Reply-To: <20100319021222.GA17388@Krystal>

Mathieu Desnoyers wrote:
> * Jamie Lokier (jamie at shareable.org) wrote:
> > Mathieu Desnoyers wrote:
> > > But.. thinking about it. It's bad to have a 2-byte xchg primitive that
> > > only works on UP and breaks the build on SMP. We should instead
> > > implement a workaround based on __cmpxchg4 to perform the 2-byte xchg().
> > 
> > This exposes why there should be __cmpxchg_bool() versions, which do
> > not loop, preferably in the generic kernel API, because the workaround
> > using __cmpxchg4 has to add yet another pointless loop nesting to all
> > cmpxchg users.
> 
> The workaround I propose is to use __cmpxchg4 in a loop for 2-byte xchg()
> fallback on arm; it is not related to cmpxchg() in any way. I don't
> think one single use-case justifies the creation of a __cmpxchg_bool()
> across all architectures.
> 
> Also, I've never seen where having a cmpxchg primitive returning a
> boolean is more efficient than returning the value read, ever. If we
> have to loop, then we can re-use the value that has been read by
> cmpxchg, while the version returning a boolean would need a read to
> re-read the next value.

Your workaround is an example (but we are both right; the best I can
come up with is described at the end).  This is a version of __xchg2
as you've described (please ignore any irrelevant mistakes :-):

static inline u16 __xchg2(u16 *ptr, u16 new)
{
	int shift = (~(unsigned)ptr & 2) * 8; /* Assumes big-endian. */
	u32 mask = ~(0x0000ffff << shift), add = new << shift;
	u32 *ptrbig = (u32 *)((char *)ptr - ((unsigned)ptr & 2));

	u32 oldbig = *ptrbig;
	while (1) {
		u32 tmp = __cmpxchg4(ptrbig, oldbig, (oldbig & mask) | add);
		if (tmp == oldbig)
			break;
		oldbig = tmp;
	}

	return (u16)(oldbig >> shift);
}

This is __cmpxchg4() from the __cmpxchg() macro in <asm/system.h>,
types changed for sanity:

static inline u32 __cmpxchg4(u32 *ptr, u32 old, u32 new)
{
	u32 oldval, res;
	do {
		asm volatile("@ __cmpxchg4\n"
		"       ldrex   %1, [%2]\n"
		"       mov     %0, #0\n"
		"       teq     %1, %3\n"
		"       strexeq %0, %4, [%2]\n"
			: "=&r" (res), "=&r" (oldval)
			: "r" (ptr), "Ir" (old), "r" (new)
			: "memory", "cc");
	} while (res);
	return oldval;
}

Now let's imagine a __cmpxchg4_bool, which returns non-zero if the new
value was stored, and zero if the new value was not stored:

static inline int __cmpxchg4_bool(u32 *ptr, u32 old, u32 new)
{
	u32 tmp, failed;
	asm volatile("@ __cmpxchg4_bool\n"
	"       ldrex   %1, [%2]\n"
	"       mov     %0, #1\n"
	"       teq     %1, %3\n"
	"       strexeq %0, %4, [%2]\n"
		: "=&r" (res), "=&r" (tmp)
		: "r" (ptr), "Ir" (old), "r" (new)
		: "memory", "cc");
	return !failed;
}

And the alternative loop part of __xchg2 (setup and return aren't
interseting):

	do {
		oldbig = *ptrbig;
	} while (!__cmpxchg4_bool(ptrbig, oldbig, (oldbig & mask) | add);

I'm sure you can already see where this is going.

Let's compile __xchg2 using __cmpxchg4 by hand.  We're not interested
in the setup, just the loop, so let's imagine the code starting at
"oldbig = *ptrbig".  The names are all symbolic register names:

	ldr	oldbig, [ptrbig]

	@ start of loop in __xchg2
	and	new, oldbig, mask
	or	new, new, add
1:
	@ start of __cmpxchg4
2:
	ldrex	tmp, [ptrbig]
	mov	res, #0
	teq	tmp, oldbig
	strexeq	res, newbig, [ptrbig]

	@ end of asm, but do..while(res) still needed
	teq	res, #0
	bne	2b
	@ end of __cmpxchg4

	@ remainder of loop in __xchg2
	teq	tmp, oldbig
	mov	oldbig, tmp
	bne	1b

The result is (oldbig >> shift).

Now for comparison, the equivalent part of __xchg2 using __cmpxchg4_bool:

	@ start of loop in __xchg2
1:
	ldr	oldbig, [ptrbig]
	and	new, oldbig, mask
	or	new, new, add

	@ start of __cmpxchg4_bool
	ldrex	tmp, [ptrbig]
	mov	failed, #1
	teq	tmp, oldbig
	strexeq	failed, newbig, [ptrbig]
	@ end of __cmpxchg4_bool (!failed is optimised by caller's !)

	@ remainder of loop in __xchg2
	teq	failed, #0
	bne	1b

As you can see, the bool version has one less loop nesting, and is 9
instructions instead of 11 for the non-bool version because of that.

Although the bool version is shorter, and the double-loop is
pointless, 

You are right that the ldrex result can be recycled to avoid a load
(and is probably faster), although the double-loop is pointless.

Meaning the "ideal" API for the bool-flavoured call would returns
_both_ the old value and whether the new value was stored.  A tidy way
to do that would be to take &old instead, and modify the pointed-to value.

Combining both observations:

static inline int __cmpxchg4_bool2(u32 *ptr, u32 *old, u32 new)
{
	u32 failed;
	asm volatile("@ __cmpxchg4_bool2\n"
	"       ldrex   %1, [%2]\n"
	"       mov     %0, #1\n"
	"       teq     %1, %3\n"
	"       strexeq %0, %4, [%2]\n"
		: "=&r" (res), "=&r" (*old)
		: "r" (ptr), "Ir" (*old), "r" (new)
		: "memory", "cc");
	return !failed;
}

The new loop part of __xchg2:

	u32 oldbig = *ptrbig;
	while (!__cmpxchg4_bool2(ptrbig, &oldbig, (oldbig & mask) | add)) {}

The result (assuming GCC does it's job optimising *&oldbig to oldbig):

	ldr	oldbig, [ptrbig]

	@ start of loop in __xchg2
1:
	and	new, oldbig, mask
	or	new, new, add

	@ start of __cmpxchg4_bool2
	ldrex	tmp, [ptrbig]
	mov	failed, #1
	teq	tmp, oldbig
	strexeq	failed, newbig, [ptrbig]
	mov	oldbig, tmp
	@ end of __cmpxchg4_bool2 (!failed is optimised by caller's !)

	@ remainder of loop in __xchg2
	teq	failed, #0
	bne	1b

That eliminates the load and the double-loop.

Quite a lot of __cmpxchg calls in the kernel are like this, with an
unnecessary extra loop, but by no means all are.  It's not just
__xchg2, but obviously all the callers would have to change to take
advantage of a __cmpxchg_bool2-like function.

The difference becomes more evident when writing __cmpxchg2 emulation
using __cmpxchg4.  Then you get an unnecessary *triple* loop at many
call sites: From __cmpxchg4 itself, __cmpxchg2 wrapped around it, and
the caller is also looping.  Only the caller's loop is necessary.

Obviously we can only go so far.  The shortest version of __xchg2 puts
the and+or between ldrex and strex, but that really isn't something
that can be exported safely in an arch-generic way:

	@ start of loop in __xchg2_smallest
1:
	ldrex	oldbig, [ptrbig]
	and	new, oldbig, mask
	or	new, new, add
	strex	failed, newbig, [ptrbig]
	teq	failed, #0
	bne	1b

The difference between 6 instructions and the original's 11, and the
fact it is simple and clear, leads me to suggest using last one as the
ARM version of __xchg2 when {ld,st}rexh are not available.  It's only
downside is it can't go in asm-generic.  That is, unless we had an
__atomic_and_xor() arch function, for asm-generic to use :-)

-- Jamie

next prev parent reply	other threads:[~2010-03-19  3:36 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-10 16:22 __xchg for sizes other than 32bit Imre Deak
2010-03-10 17:35 ` Russell King - ARM Linux
2010-03-10 20:02   ` [PATCH] ARM support single byte cmpxchg and cmpxchg_local on ARMv6 Mathieu Desnoyers
2010-03-10 20:30     ` Russell King - ARM Linux
2010-03-10 21:15       ` Mathieu Desnoyers
2010-03-10 23:16   ` __xchg for sizes other than 32bit Jamie Lokier
2010-03-18  9:29   ` [PATCH 1/1] [RFC] arm: add half-word __xchg Alexander Shishkin
2010-03-18 12:32     ` Mathieu Desnoyers
2010-03-18 12:37       ` Alexander Shishkin
2010-03-18 13:33       ` Alexander Shishkin
2010-03-18 13:50         ` Mathieu Desnoyers
2010-03-18 16:33           ` Imre Deak
2010-03-18 17:21             ` Mathieu Desnoyers
2010-03-18 19:00               ` Imre Deak
2010-03-18 19:30                 ` Mathieu Desnoyers
2010-03-19  1:49           ` Jamie Lokier
2010-03-19  2:12             ` Mathieu Desnoyers
2010-03-19  3:36               ` Jamie Lokier [this message]
2010-03-25 15:52           ` [PATCH 1/1] [RFCv2] " Alexander Shishkin
2010-03-25 16:42           ` Alexander Shishkin
2010-03-27 22:52             ` Russell King - ARM Linux
2010-03-28  0:14               ` Jamie Lokier
2010-03-28  0:18                 ` Russell King - ARM Linux
2010-03-28  1:00                 ` Mathieu Desnoyers
2010-03-28 14:39                   ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100319033601.GF14108@shareable.org \
    --to=jamie@shareable.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.