From: Bruce Richardson <bruce.richardson@intel.com>
To: "Morten Brørup" <mb@smartsharesystems.com>
Cc: <dev@dpdk.org>
Subject: Re: [PATCH 1/2] net: ethernet address comparison optimizations
Date: Fri, 30 Jan 2026 14:32:54 +0000 [thread overview]
Message-ID: <aXzBFlU6eW5SFTwq@bricha3-mobl1.ger.corp.intel.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35F656D7@smartserver.smartshare.dk>
On Fri, Jan 30, 2026 at 03:25:34PM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Friday, 30 January 2026 15.03
> >
> > On Fri, Jan 30, 2026 at 02:54:52PM +0100, Morten Brørup wrote:
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Friday, 30 January 2026 12.27
> > > >
> > > > On Fri, Jan 30, 2026 at 12:16:43PM +0100, Morten Brørup wrote:
> > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > > Sent: Friday, 30 January 2026 11.53
> > > > > >
> > > > > > On Fri, Jan 30, 2026 at 10:46:16AM +0000, Morten Brørup wrote:
> > > > > > > For CPU architectures without strict alignment requirements,
> > > > > > operations on
> > > > > > > 6-byte Ethernet addresses using three 2-byte operations were
> > > > replaced
> > > > > > by a
> > > > > > > 4-byte and a 2-byte operation, i.e. two operations instead of
> > > > three.
> > > > > > >
> > > > > > > Comparison functions are pure, so added __rte_pure.
> > > > > > >
> > > > > > > Removed superfluous parentheses. (No functional change.)
> > > > > > >
> > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > ---
> > > > > > > lib/net/rte_ether.h | 19 ++++++++++++++++++-
> > > > > > > 1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/lib/net/rte_ether.h b/lib/net/rte_ether.h
> > > > > > > index c9a0b536c3..5552d3c1f6 100644
> > > > > > > --- a/lib/net/rte_ether.h
> > > > > > > +++ b/lib/net/rte_ether.h
> > > > > > > @@ -99,13 +99,19 @@ static_assert(alignof(struct
> > rte_ether_addr)
> > > > ==
> > > > > > 2,
> > > > > > > * True (1) if the given two ethernet address are the
> > same;
> > > > > > > * False (0) otherwise.
> > > > > > > */
> > > > > > > +__rte_pure
> > > > > > > static inline int rte_is_same_ether_addr(const struct
> > > > rte_ether_addr
> > > > > > *ea1,
> > > > > > > const struct rte_ether_addr *ea2)
> > > > > > > {
> > > > > > > +#if !defined(RTE_ARCH_STRICT_ALIGN)
> > > > > > > + return ((((const unaligned_uint32_t *)ea1)[0] ^ ((const
> > > > > > unaligned_uint32_t *)ea2)[0]) |
> > > > > > > + (((const uint16_t *)ea1)[2] ^ ((const uint16_t
> > > > > > *)ea2)[2])) == 0;
> > > > > > > +#else
> > > > > > > const uint16_t *w1 = (const uint16_t *)ea1;
> > > > > > > const uint16_t *w2 = (const uint16_t *)ea2;
> > > > > > >
> > > > > > > return ((w1[0] ^ w2[0]) | (w1[1] ^ w2[1]) | (w1[2] ^
> > > > w2[2])) ==
> > > > > > 0;
> > > > > > > +#endif
> > > > > > > }
> > > > > >
> > > > > > Is this actually faster?
> > > > >
> > > > > It's a simple micro-optimization, so I haven't benchmarked it.
> > > > > On x86, the compiled function is simplified and reduced in size
> > from
> > > > 34 to 24 bytes:
> > > > >
> > > > > 00000000004ed650 <review_rte_is_same_ether_addr>:
> > > > > 4ed650: 0f b7 07 movzwl (%rdi),%eax
> > > > > 4ed653: 0f b7 57 02 movzwl 0x2(%rdi),%edx
> > > > > 4ed657: 66 33 06 xor (%rsi),%ax
> > > > > 4ed65a: 66 33 56 02 xor 0x2(%rsi),%dx
> > > > > 4ed65e: 09 d0 or %edx,%eax
> > > > > 4ed660: 0f b7 57 04 movzwl 0x4(%rdi),%edx
> > > > > 4ed664: 66 33 56 04 xor 0x4(%rsi),%dx
> > > > > 4ed668: 66 09 d0 or %dx,%ax
> > > > > 4ed66b: 0f 94 c0 sete %al
> > > > > 4ed66e: 0f b6 c0 movzbl %al,%eax
> > > > > 4ed671: c3 ret
> > > > > 4ed672: 66 66 2e 0f 1f 84 00 data16 cs nopw
> > 0x0(%rax,%rax,1)
> > > > > 4ed679: 00 00 00 00
> > > > > 4ed67d: 0f 1f 00 nopl (%rax)
> > > > >
> > > > > 00000000004ed680 <rte_is_same_ether_addr_improved>:
> > > > > 4ed680: 0f b7 47 04 movzwl 0x4(%rdi),%eax
> > > > > 4ed684: 66 33 46 04 xor 0x4(%rsi),%ax
> > > > > 4ed688: 8b 17 mov (%rdi),%edx
> > > > > 4ed68a: 33 16 xor (%rsi),%edx
> > > > > 4ed68c: 0f b7 c0 movzwl %ax,%eax
> > > > > 4ed68f: 09 c2 or %eax,%edx
> > > > > 4ed691: 0f 94 c0 sete %al
> > > > > 4ed694: 0f b6 c0 movzbl %al,%eax
> > > > > 4ed697: c3 ret
> > > > > 4ed698: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
> > > > > 4ed69f: 00
> > > > >
> > > > > For reference, memcpy() of 6 bytes (compile time constant) also
> > > > compiles to a 4-byte and a 2-byte operation, not three 2-byte
> > > > operations.
> > > > >
> > > > What about memcmp? Does it compile similarly?
> > >
> > > memcmp(a,b,6) on Clang compiles into something very similar.
> > > memcmp(a,b,6) on GCC compiles into something with a branch after the
> > first 4-byte comparison, with the assumption (regarding static branch
> > prediction) that they are likely to differ.
> > > I guess GCC's counterproductive behavior was the reason for
> > originally implementing a manual comparison, instead of simply using
> > memcmp().
> > >
> > > BTW, GCC is clever enough to compile 8-byte and 16-byte comparisons
> > into code without branches.
> > > I guess that's why rte_ipv6_addr_eq() is implemented using memcpy()
> > [1].
> > >
> > > [1]:
> > https://elixir.bootlin.com/dpdk/v25.11/source/lib/net/rte_ip6.h#L68
> > >
> > > > Before we start adding ifdefs
> > > > like this to the code, I'd like to see some measured performance
> > > > benefits
> > > > from it. While the code may be 10 bytes shorter, does that actually
> > > > translate into a measurable difference in some app?
> > >
> > > Excellent question!
> > > Some quick rudimentary testing shows that it seems to be ~4 cycles
> > slower than what it's replacing.
> > > Reality beats expectations.
> > >
> > > I'll drop this patch.
> > >
> > If you have the test-case already prepared, can you also check what
> > memcmp() performs like? Replacing the whole function by memcmp and
> > punting
> > the optimization to the compiler would be a nice, though small, code
> > improvement.
>
> Good you asked!
>
> While setting up the test for memcmp(), I noticed that I had been testing my improved function without "inline".
> With inline (like the original), it's ~1 cycle faster than the original.
> I have restored the patch status to "New".
>
> The memcmp() test (not forgetting "inline") performs very close to the original.
>
If memcmp performs like the original, I'd be tempted to forgo the 1cycle
benefit just to have the shortest simplest code.
next prev parent reply other threads:[~2026-01-30 14:33 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-30 10:46 [PATCH 1/2] net: ethernet address comparison optimizations Morten Brørup
2026-01-30 10:46 ` [PATCH 2/2] [RFC] net: introduce fast ethernet address comparison function Morten Brørup
2026-01-30 14:03 ` Morten Brørup
2026-01-30 10:52 ` [PATCH 1/2] net: ethernet address comparison optimizations Bruce Richardson
2026-01-30 11:16 ` Morten Brørup
2026-01-30 11:26 ` Bruce Richardson
2026-01-30 13:54 ` Morten Brørup
2026-01-30 14:02 ` Bruce Richardson
2026-01-30 14:25 ` Morten Brørup
2026-01-30 14:32 ` Bruce Richardson [this message]
2026-01-30 14:59 ` Morten Brørup
2026-01-30 16:20 ` Stephen Hemminger
2026-01-30 16:24 ` Bruce Richardson
2026-01-30 16:31 ` Konstantin Ananyev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aXzBFlU6eW5SFTwq@bricha3-mobl1.ger.corp.intel.com \
--to=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=mb@smartsharesystems.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox