Re: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Yury Norov <yury.norov@gmail.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Lucas Stach <l.stach@pengutronix.de>,
	Russell King <linux+etnaviv@armlinux.org.uk>,
	Christian Gmeiner <christian.gmeiner@gmail.com>,
	David Airlie <airlied@linux.ie>, Daniel Vetter <daniel@ffwll.ch>,
	Jean Delvare <jdelvare@suse.com>,
	Guenter Roeck <linux@roeck-us.net>,
	Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	David Woodhouse <dwmw@amazon.co.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Wei Yang <richard.weiyang@linux.alibaba.com>,
	Geert Uytterhoeven <geert+renesas@glider.be>,
	Alexey Klimov <aklimov@redhat.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	etnaviv@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-hwmon@vger.kernel.org
Subject: Re: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()
Date: Sun, 27 Jun 2021 09:47:25 -0700	[thread overview]
Message-ID: <YNirnaYw1GSxg1jK@yury-ThinkPad> (raw)
In-Reply-To: <YM4pJpNphEwvUF2F@yury-ThinkPad>

On Sat, Jun 19, 2021 at 10:28:07AM -0700, Yury Norov wrote:
> On Sat, Jun 19, 2021 at 05:24:15PM +0100, Marc Zyngier wrote:
> > On Fri, 18 Jun 2021 20:57:34 +0100,
> > Yury Norov <yury.norov@gmail.com> wrote:
> > > 
> > > The macros iterate thru all set/clear bits in a bitmap. They search a
> > > first bit using find_first_bit(), and the rest bits using find_next_bit().
> > > 
> > > Since find_next_bit() is called shortly after find_first_bit(), we can
> > > save few lines of I-cache by not using find_first_bit().
> > 
> > Really?
> > 
> > > 
> > > Signed-off-by: Yury Norov <yury.norov@gmail.com>
> > > ---
> > >  include/linux/find.h | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/find.h b/include/linux/find.h
> > > index 4500e8ab93e2..ae9ed52b52b8 100644
> > > --- a/include/linux/find.h
> > > +++ b/include/linux/find.h
> > > @@ -280,7 +280,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
> > >  #endif
> > >  
> > >  #define for_each_set_bit(bit, addr, size) \
> > > -	for ((bit) = find_first_bit((addr), (size));		\
> > > +	for ((bit) = find_next_bit((addr), (size), 0);		\
> > 
> > On which architecture do you observe a gain? Only 32bit ARM and m68k
> > implement their own version of find_first_bit(), and everyone else
> > uses the canonical implementation:
> 
> And those who enable GENERIC_FIND_FIRST_BIT - x86, arm64, arc, mips
> and s390.
> 
> > #ifndef find_first_bit
> > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> > #endif
> > 
> > These architectures explicitly have different implementations for
> > find_first_bit() and find_next_bit() because they can do better
> > (whether that is true or not is another debate). I don't think you
> > should remove this optimisation until it has been measured on these
> > two architectures.
> 
> This patch is based on a series that enables separate implementation
> of find_first_bit() for all architectures; according to my tests,
> find_first* is ~ twice faster than find_next* on arm64 and x86.
> 
> https://lore.kernel.org/lkml/20210612123639.329047-1-yury.norov@gmail.com/T/#t
> 
> After applying the series, I noticed that my small kernel module that
> calls for_each_set_bit() is now using find_first_bit() to just find
> one bit, and find_next_bit() for all others. I think it's better to
> always use find_next_bit() in this case to minimize the chance of
> cache miss. But if it's not that obvious, I'll try to write some test.

This test measures the difference between for_each_set_bit() and
for_each_set_bit_from().

diff --git a/lib/find_bit_benchmark.c b/lib/find_bit_benchmark.c
index 5637c5711db9..1f37e99090b0 100644
--- a/lib/find_bit_benchmark.c
+++ b/lib/find_bit_benchmark.c
@@ -111,6 +111,59 @@ static int __init test_find_next_and_bit(const void *bitmap,
 	return 0;
 }
 
+#ifdef CONFIG_X86_64
+#define flush_cache_all() wbinvd()
+#endif
+
+static int __init test_for_each_set_bit(int flags)
+{
+#ifdef flush_cache_all
+	DECLARE_BITMAP(bm, BITS_PER_LONG * 2);
+	unsigned long i, cnt = 0;
+	ktime_t time;
+
+	bm[0] = 1; bm[1] = 0;
+
+	time = ktime_get();
+	while (cnt < 1000) {
+		if (flags)
+			flush_cache_all();
+		for_each_set_bit(i, bm, BITS_PER_LONG * 2)
+			cnt++;
+	}
+
+	time = ktime_get() - time;
+
+	pr_err("for_each_set_bit:   %18llu ns, %6ld iterations\n",  time, cnt);
+#endif
+	return 0;
+}
+
+static int __init test_for_each_set_bit_from(int flags)
+{
+#ifdef flush_cache_all
+	DECLARE_BITMAP(bm, BITS_PER_LONG * 2);
+	unsigned long i, cnt = 0;
+	ktime_t time;
+
+	bm[0] = 1; bm[1] = 0;
+
+	time = ktime_get();
+	while (cnt < 1000) {
+		if (flags)
+			flush_cache_all();
+		i = 0;
+		for_each_set_bit_from(i, bm, BITS_PER_LONG * 2)
+			cnt++;
+	}
+
+	time = ktime_get() - time;
+
+	pr_err("for_each_set_bit_from:%16llu ns, %6ld iterations\n", time, cnt);
+#endif
+	return 0;
+}
+
 static int __init find_bit_test(void)
 {
 	unsigned long nbits = BITMAP_LEN / SPARSE;
@@ -147,6 +200,16 @@ static int __init find_bit_test(void)
 	test_find_first_bit(bitmap, BITMAP_LEN);
 	test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
 
+	pr_err("\nStart testing for_each_bit()\n");
+
+	test_for_each_set_bit(0);
+	test_for_each_set_bit_from(0);
+
+	pr_err("\nStart testing for_each_bit() with cash flushing\n");
+
+	test_for_each_set_bit(1);
+	test_for_each_set_bit_from(1);
+
 	/*
 	 * Everything is OK. Return error just to let user run benchmark
 	 * again without annoying rmmod.

Here on each iteration: 
 - for_each_set_bit() calls find_first_bit() once, and find_next_bit() once.
 - for_each_set_bit_from() calls  find_next_bit() twice.

On my AMD Ryzen 7 4700U, the result is like this:

Start testing for_each_bit()
for_each_set_bit:                15296 ns,   1000 iterations
for_each_set_bit_from:           15225 ns,   1000 iterations

Start testing for_each_bit() with cash flushing
for_each_set_bit:               547626 ns,   1000 iterations
for_each_set_bit_from:          497899 ns,   1000 iterations

for_each_set_bit_from() is ~10% faster than for_each_set_bit() in
case of cold caches, and no significant difference was observed if
flush_cache_all() is not called.

So, it looks reasonable to switch for_each_set_bit() to use
find_next_bit() only.

Thanks,
Yury

next prev parent reply	other threads:[~2021-06-27 16:47 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-18 19:57 [PATCH 0/3] for_each_*_bit: move to find.h and reconsider Yury Norov
2021-06-18 19:57 ` [PATCH 1/3] include/linux: move for_each_bit() macros from bitops.h to find.h Yury Norov
2021-06-19 10:50   ` Andy Shevchenko
2021-06-18 19:57 ` [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit() Yury Norov
2021-06-19 10:50   ` Andy Shevchenko
2021-06-19 16:24   ` Marc Zyngier
2021-06-19 17:28     ` Yury Norov
2021-06-27 16:47       ` Yury Norov [this message]
2021-06-18 19:57 ` [PATCH 3/3] Replace for_each_*_bit_from() with for_each_*_bit() where appropriate Yury Norov
2021-06-19 10:49   ` Andy Shevchenko
2021-06-19 10:55     ` Andy Shevchenko
2021-06-21 20:17   ` Guenter Roeck
2021-06-21 21:34     ` Yury Norov
2021-07-28 14:57 ` [PATCH 0/3] for_each_*_bit: move to find.h and reconsider Yury Norov

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:5637c5711db dfblob:1f37e99090b )
 OR (
bs:"Re: [PATCH 2/3] find: micro-optimize for_each_{set,clear}_bit()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YNirnaYw1GSxg1jK@yury-ThinkPad \
    --to=yury.norov@gmail.com \
    --cc=airlied@linux.ie \
    --cc=aklimov@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=christian.gmeiner@gmail.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=dwmw@amazon.co.uk \
    --cc=etnaviv@lists.freedesktop.org \
    --cc=geert+renesas@glider.be \
    --cc=hpa@zytor.com \
    --cc=jdelvare@suse.com \
    --cc=l.stach@pengutronix.de \
    --cc=linux+etnaviv@armlinux.org.uk \
    --cc=linux-hwmon@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=linux@roeck-us.net \
    --cc=maz@kernel.org \
    --cc=mingo@redhat.com \
    --cc=richard.weiyang@linux.alibaba.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox