netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/35] bitops: add atomic find_bit() operations
@ 2023-12-03 19:23 Yury Norov
  2023-12-03 19:23 ` [PATCH v2 01/35] lib/find: add atomic find_bit() primitives Yury Norov
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Yury Norov @ 2023-12-03 19:23 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, H. Peter Anvin,
	James E.J. Bottomley, K. Y. Srinivasan, Md. Haris Iqbal,
	Akinobu Mita, Andrew Morton, Bjorn Andersson, Borislav Petkov,
	Chaitanya Kulkarni, Christian Brauner, Damien Le Moal,
	Dave Hansen, David Disseldorp, Edward Cree, Eric Dumazet,
	Fenghua Yu, Geert Uytterhoeven, Greg Kroah-Hartman,
	Gregory Greenman, Hans Verkuil, Hans de Goede, Hugh Dickins,
	Ingo Molnar, Jakub Kicinski, Jaroslav Kysela, Jason Gunthorpe,
	Jens Axboe, Jiri Pirko, Jiri Slaby, Kalle Valo, Karsten Graul,
	Karsten Keil, Kees Cook, Leon Romanovsky, Mark Rutland,
	Martin Habets, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Simek, Nicholas Piggin, Oliver Neukum, Paolo Abeni,
	Paolo Bonzini, Peter Zijlstra, Ping-Ke Shih, Rich Felker,
	Rob Herring, Robin Murphy, Sean Christopherson, Shuai Xue,
	Stanislaw Gruszka, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Valentin Schneider, Vitaly Kuznetsov,
	Wenjia Zhang, Will Deacon, Yoshinori Sato,
	GR-QLogic-Storage-Upstream, alsa-devel, ath10k, dmaengine, iommu,
	kvm, linux-arm-kernel, linux-arm-msm, linux-block,
	linux-bluetooth, linux-hyperv, linux-m68k, linux-media,
	linux-mips, linux-net-drivers, linux-pci, linux-rdma, linux-s390,
	linux-scsi, linux-serial, linux-sh, linux-sound, linux-usb,
	linux-wireless, linuxppc-dev, mpi3mr-linuxdrv.pdl, netdev,
	sparclinux, x86
  Cc: Yury Norov, Jan Kara, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov,
	Bart Van Assche, Sergey Shtylyov

Add helpers around test_and_{set,clear}_bit() that allow to search for
clear or set bits and flip them atomically.

The target patterns may look like this:

	for (idx = 0; idx < nbits; idx++)
		if (test_and_clear_bit(idx, bitmap))
			do_something(idx);

Or like this:

	do {
		bit = find_first_bit(bitmap, nbits);
		if (bit >= nbits)
			return nbits;
	} while (!test_and_clear_bit(bit, bitmap));
	return bit;

In both cases, the opencoded loop may be converted to a single function
or iterator call. Correspondingly:

	for_each_test_and_clear_bit(idx, bitmap, nbits)
		do_something(idx);

Or:
	return find_and_clear_bit(bitmap, nbits);

Obviously, the less routine code people have to write themself, the
less probability to make a mistake.

Those are not only handy helpers but also resolve a non-trivial
issue of using non-atomic find_bit() together with atomic
test_and_{set,clear)_bit().

The trick is that find_bit() implies that the bitmap is a regular
non-volatile piece of memory, and compiler is allowed to use such
optimization techniques like re-fetching memory instead of caching it.

For example, find_first_bit() is implemented like this:

      for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {
              val = addr[idx];
              if (val) {
                      sz = min(idx * BITS_PER_LONG + __ffs(val), sz);
                      break;
              }
      }

On register-memory architectures, like x86, compiler may decide to
access memory twice - first time to compare against 0, and second time
to fetch its value to pass it to __ffs().

When running find_first_bit() on volatile memory, the memory may get
changed in-between, and for instance, it may lead to passing 0 to
__ffs(), which is undefined. This is a potentially dangerous call.

find_and_clear_bit() as a wrapper around test_and_clear_bit()
naturally treats underlying bitmap as a volatile memory and prevents
compiler from such optimizations.

Now that KCSAN is catching exactly this type of situations and warns on
undercover memory modifications. We can use it to reveal improper usage
of find_bit(), and convert it to atomic find_and_*_bit() as appropriate.

The 1st patch of the series adds the following atomic primitives:

	find_and_set_bit(addr, nbits);
	find_and_set_next_bit(addr, nbits, start);
	...

Here find_and_{set,clear} part refers to the corresponding
test_and_{set,clear}_bit function. Suffixes like _wrap or _lock
derive their semantics from corresponding find() or test() functions.

For brevity, the naming omits the fact that we search for zero bit in
find_and_set, and correspondingly search for set bit in find_and_clear
functions.

The patch also adds iterators with atomic semantics, like
for_each_test_and_set_bit(). Here, the naming rule is to simply prefix
corresponding atomic operation with 'for_each'.

This series is a result of discussion [1]. All find_bit() functions imply
exclusive access to the bitmaps. However, KCSAN reports quite a number
of warnings related to find_bit() API. Some of them are not pointing
to real bugs because in many situations people intentionally allow
concurrent bitmap operations.

If so, find_bit() can be annotated such that KCSAN will ignore it:

        bit = data_race(find_first_bit(bitmap, nbits));

This series addresses the other important case where people really need
atomic find ops. As the following patches show, the resulting code
looks safer and more verbose comparing to opencoded loops followed by
atomic bit flips.

In [1] Mirsad reported 2% slowdown in a single-thread search test when
switching find_bit() function to treat bitmaps as volatile arrays. On
the other hand, kernel robot in the same thread reported +3.7% to the
performance of will-it-scale.per_thread_ops test.

Assuming that our compilers are sane and generate better code against
properly annotated data, the above discrepancy doesn't look weird. When
running on non-volatile bitmaps, plain find_bit() outperforms atomic
find_and_bit(), and vice-versa.

So, all users of find_bit() API, where heavy concurrency is expected,
are encouraged to switch to atomic find_and_bit() as appropriate.

The 1st patch of this series adds atomic find_and_bit() API, 2nd adds
a basic test for new API, and all the following patches spread it over
the kernel.

They can be applied separately from each other on per-subsystems basis,
or I can pull them in bitmap tree, as appropriate.

[1] https://lore.kernel.org/lkml/634f5fdf-e236-42cf-be8d-48a581c21660@alu.unizg.hr/T/#m3e7341eb3571753f3acf8fe166f3fb5b2c12e615
---
v1: https://lore.kernel.org/netdev/20231118155105.25678-29-yury.norov@gmail.com/T/
v2:
 - Add a basic test for the new API # Bart Van Assche;
 - Add collected reviewers' tags. Thank you guys!
 - Fix typos where found/pointed by;
 - Drop erroneous patch #v1-31 ("drivers/perf: optimize m1_pmu_get_event_idx()...") @ Marc Zyngier;
 - Drop unneeded patch #v1-12 ("wifi: intel: use atomic find_bit() API...") @ Johannes Berg;
 - Patch #v1-15: split SCSI changes per subsystems @ Bart Van Assche;
 - Patch  #5: keep changes inside __mm_cid_try_get() @ Mathieu Desnoyers;
 - Patch  #8: use find_and_set_next_bit() @ Will Deacon;
 - Patch #13: keep test against stimer->config.enable @ Vitaly Kuznetsov;
 - Patch #15: use find_and_set_next_bit @ Bart Van Assche;
 - Patch #31: edit commit message @ Tony Lu, Alexandra Winter;
 - Patch #35: edit tag @ John Paul Adrian Glaubitz;

Yury Norov (35):
  lib/find: add atomic find_bit() primitives
  lib/find: add test for atomic find_bit() ops
  lib/sbitmap; make __sbitmap_get_word() using find_and_set_bit()
  watch_queue: use atomic find_bit() in post_one_notification()
  sched: add cpumask_find_and_set() and use it in __mm_cid_get()
  mips: sgi-ip30: rework heart_alloc_int()
  sparc: fix opencoded find_and_set_bit() in alloc_msi()
  perf/arm: optimize opencoded atomic find_bit() API
  drivers/perf: optimize ali_drw_get_counter_idx() by using find_bit()
  dmaengine: idxd: optimize perfmon_assign_event()
  ath10k: optimize ath10k_snoc_napi_poll() by using find_bit()
  wifi: rtw88: optimize rtw_pci_tx_kick_off() by using find_bit()
  KVM: x86: hyper-v: optimize and cleanup kvm_hv_process_stimers()
  PCI: hv: switch hv_get_dom_num() to use atomic find_bit()
  scsi: core: use atomic find_bit() API where appropriate
  scsi: mpi3mr: switch to using atomic find_and_set_bit()
  scsi: qedi: rework qedi_get_task_idx()
  powerpc: use atomic find_bit() API where appropriate
  iommu: use atomic find_bit() API where appropriate
  media: radio-shark: use atomic find_bit() API where appropriate
  sfc: switch to using atomic find_bit() API where appropriate
  tty: nozomi: optimize interrupt_handler()
  usb: cdc-acm: optimize acm_softint()
  block: null_blk: fix opencoded find_and_set_bit() in get_tag()
  RDMA/rtrs: fix opencoded find_and_set_bit_lock() in
    __rtrs_get_permit()
  mISDN: optimize get_free_devid()
  media: em28xx: cx231xx: fix opencoded find_and_set_bit()
  ethernet: rocker: optimize ofdpa_port_internal_vlan_id_get()
  serial: sc12is7xx: optimize sc16is7xx_alloc_line()
  bluetooth: optimize cmtp_alloc_block_id()
  net: smc:  use find_and_set_bit() in smc_wr_tx_get_free_slot_index()
  ALSA: use atomic find_bit() functions where applicable
  m68k: rework get_mmu_context()
  microblaze: rework get_mmu_context()
  sh: mach-x3proto: rework ilsel_enable()

 arch/m68k/include/asm/mmu_context.h          |  11 +-
 arch/microblaze/include/asm/mmu_context_mm.h |  11 +-
 arch/mips/sgi-ip30/ip30-irq.c                |  12 +-
 arch/powerpc/mm/book3s32/mmu_context.c       |  10 +-
 arch/powerpc/platforms/pasemi/dma_lib.c      |  45 +--
 arch/powerpc/platforms/powernv/pci-sriov.c   |  12 +-
 arch/sh/boards/mach-x3proto/ilsel.c          |   4 +-
 arch/sparc/kernel/pci_msi.c                  |   9 +-
 arch/x86/kvm/hyperv.c                        |  39 ++-
 drivers/block/null_blk/main.c                |  41 +--
 drivers/dma/idxd/perfmon.c                   |   8 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c       |  15 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.h        |  10 +-
 drivers/iommu/msm_iommu.c                    |  18 +-
 drivers/isdn/mISDN/core.c                    |   9 +-
 drivers/media/radio/radio-shark.c            |   5 +-
 drivers/media/radio/radio-shark2.c           |   5 +-
 drivers/media/usb/cx231xx/cx231xx-cards.c    |  16 +-
 drivers/media/usb/em28xx/em28xx-cards.c      |  37 +--
 drivers/net/ethernet/rocker/rocker_ofdpa.c   |  11 +-
 drivers/net/ethernet/sfc/rx_common.c         |   4 +-
 drivers/net/ethernet/sfc/siena/rx_common.c   |   4 +-
 drivers/net/ethernet/sfc/siena/siena_sriov.c |  14 +-
 drivers/net/wireless/ath/ath10k/snoc.c       |   9 +-
 drivers/net/wireless/realtek/rtw88/pci.c     |   5 +-
 drivers/net/wireless/realtek/rtw89/pci.c     |   5 +-
 drivers/pci/controller/pci-hyperv.c          |   7 +-
 drivers/perf/alibaba_uncore_drw_pmu.c        |  10 +-
 drivers/perf/arm-cci.c                       |  24 +-
 drivers/perf/arm-ccn.c                       |  10 +-
 drivers/perf/arm_dmc620_pmu.c                |   9 +-
 drivers/perf/arm_pmuv3.c                     |   8 +-
 drivers/scsi/mpi3mr/mpi3mr_os.c              |  21 +-
 drivers/scsi/qedi/qedi_main.c                |   9 +-
 drivers/scsi/scsi_lib.c                      |   7 +-
 drivers/tty/nozomi.c                         |   5 +-
 drivers/tty/serial/sc16is7xx.c               |   8 +-
 drivers/usb/class/cdc-acm.c                  |   5 +-
 include/linux/cpumask.h                      |  12 +
 include/linux/find.h                         | 293 +++++++++++++++++++
 kernel/sched/sched.h                         |  14 +-
 kernel/watch_queue.c                         |   6 +-
 lib/find_bit.c                               |  85 ++++++
 lib/sbitmap.c                                |  46 +--
 lib/test_bitmap.c                            |  61 ++++
 net/bluetooth/cmtp/core.c                    |  10 +-
 net/smc/smc_wr.c                             |  10 +-
 sound/pci/hda/hda_codec.c                    |   7 +-
 sound/usb/caiaq/audio.c                      |  13 +-
 49 files changed, 629 insertions(+), 420 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 01/35] lib/find: add atomic find_bit() primitives
  2023-12-03 19:23 [PATCH v2 00/35] bitops: add atomic find_bit() operations Yury Norov
@ 2023-12-03 19:23 ` Yury Norov
  2023-12-03 19:32 ` [PATCH v2 02/35] lib/find: add test for atomic find_bit() ops Yury Norov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Yury Norov @ 2023-12-03 19:23 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, H. Peter Anvin,
	James E.J. Bottomley, K. Y. Srinivasan, Md. Haris Iqbal,
	Akinobu Mita, Andrew Morton, Bjorn Andersson, Borislav Petkov,
	Chaitanya Kulkarni, Christian Brauner, Damien Le Moal,
	Dave Hansen, David Disseldorp, Edward Cree, Eric Dumazet,
	Fenghua Yu, Geert Uytterhoeven, Greg Kroah-Hartman,
	Gregory Greenman, Hans Verkuil, Hans de Goede, Hugh Dickins,
	Ingo Molnar, Jakub Kicinski, Jaroslav Kysela, Jason Gunthorpe,
	Jens Axboe, Jiri Pirko, Jiri Slaby, Kalle Valo, Karsten Graul,
	Karsten Keil, Kees Cook, Leon Romanovsky, Mark Rutland,
	Martin Habets, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Simek, Nicholas Piggin, Oliver Neukum, Paolo Abeni,
	Paolo Bonzini, Peter Zijlstra, Ping-Ke Shih, Rich Felker,
	Rob Herring, Robin Murphy, Sean Christopherson, Shuai Xue,
	Stanislaw Gruszka, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Valentin Schneider, Vitaly Kuznetsov,
	Wenjia Zhang, Will Deacon, Yoshinori Sato,
	GR-QLogic-Storage-Upstream, alsa-devel, ath10k, dmaengine, iommu,
	kvm, linux-arm-kernel, linux-arm-msm, linux-block,
	linux-bluetooth, linux-hyperv, linux-m68k, linux-media,
	linux-mips, linux-net-drivers, linux-pci, linux-rdma, linux-s390,
	linux-scsi, linux-serial, linux-sh, linux-sound, linux-usb,
	linux-wireless, linuxppc-dev, mpi3mr-linuxdrv.pdl, netdev,
	sparclinux, x86
  Cc: Yury Norov, Jan Kara, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov,
	Bart Van Assche, Sergey Shtylyov

Add helpers around test_and_{set,clear}_bit() that allow to search for
clear or set bits and flip them atomically.

The target patterns may look like this:

	for (idx = 0; idx < nbits; idx++)
		if (test_and_clear_bit(idx, bitmap))
			do_something(idx);

Or like this:

	do {
		bit = find_first_bit(bitmap, nbits);
		if (bit >= nbits)
			return nbits;
	} while (!test_and_clear_bit(bit, bitmap));
	return bit;

In both cases, the opencoded loop may be converted to a single function
or iterator call. Correspondingly:

	for_each_test_and_clear_bit(idx, bitmap, nbits)
		do_something(idx);

Or:
	return find_and_clear_bit(bitmap, nbits);

Obviously, the less routine code people have to write themself, the
less probability to make a mistake.

Those are not only handy helpers but also resolve a non-trivial
issue of using non-atomic find_bit() together with atomic
test_and_{set,clear)_bit().

The trick is that find_bit() implies that the bitmap is a regular
non-volatile piece of memory, and compiler is allowed to use such
optimization techniques like re-fetching memory instead of caching it.

For example, find_first_bit() is implemented like this:

      for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {
              val = addr[idx];
              if (val) {
                      sz = min(idx * BITS_PER_LONG + __ffs(val), sz);
                      break;
              }
      }

On register-memory architectures, like x86, compiler may decide to
access memory twice - first time to compare against 0, and second time
to fetch its value to pass it to __ffs().

When running find_first_bit() on volatile memory, the memory may get
changed in-between, and for instance, it may lead to passing 0 to
__ffs(), which is undefined. This is a potentially dangerous call.

find_and_clear_bit() as a wrapper around test_and_clear_bit()
naturally treats underlying bitmap as a volatile memory and prevents
compiler from such optimizations.

Now that KCSAN is catching exactly this type of situations and warns on
undercover memory modifications. We can use it to reveal improper usage
of find_bit(), and convert it to atomic find_and_*_bit() as appropriate.

The 1st patch of the series adds the following atomic primitives:

	find_and_set_bit(addr, nbits);
	find_and_set_next_bit(addr, nbits, start);
	...

Here find_and_{set,clear} part refers to the corresponding
test_and_{set,clear}_bit function. Suffixes like _wrap or _lock
derive their semantics from corresponding find() or test() functions.

For brevity, the naming omits the fact that we search for zero bit in
find_and_set, and correspondingly search for set bit in find_and_clear
functions.

The patch also adds iterators with atomic semantics, like
for_each_test_and_set_bit(). Here, the naming rule is to simply prefix
corresponding atomic operation with 'for_each'.

All users of find_bit() API, where heavy concurrency is expected,
are encouraged to switch to atomic find_and_bit() as appropriate.

CC: Bart Van Assche <bvanassche@acm.org>
CC: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/find.h | 293 +++++++++++++++++++++++++++++++++++++++++++
 lib/find_bit.c       |  85 +++++++++++++
 2 files changed, 378 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index 5e4f39ef2e72..79b0e2589725 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -32,6 +32,16 @@ extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long size);
 
+unsigned long _find_and_set_bit(volatile unsigned long *addr, unsigned long nbits);
+unsigned long _find_and_set_next_bit(volatile unsigned long *addr, unsigned long nbits,
+				unsigned long start);
+unsigned long _find_and_set_bit_lock(volatile unsigned long *addr, unsigned long nbits);
+unsigned long _find_and_set_next_bit_lock(volatile unsigned long *addr, unsigned long nbits,
+					  unsigned long start);
+unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long nbits);
+unsigned long _find_and_clear_next_bit(volatile unsigned long *addr, unsigned long nbits,
+				unsigned long start);
+
 #ifdef __BIG_ENDIAN
 unsigned long _find_first_zero_bit_le(const unsigned long *addr, unsigned long size);
 unsigned long _find_next_zero_bit_le(const  unsigned long *addr, unsigned
@@ -460,6 +470,267 @@ unsigned long __for_each_wrap(const unsigned long *bitmap, unsigned long size,
 	return bit < start ? bit : size;
 }
 
+/**
+ * find_and_set_bit - Find a zero bit and set it atomically
+ * @addr: The address to base the search on
+ * @nbits: The bitmap size in bits
+ *
+ * This function is designed to operate in concurrent access environment.
+ *
+ * Because of concurrency and volatile nature of underlying bitmap, it's not
+ * guaranteed that the found bit is the 1st bit in the bitmap. It's also not
+ * guaranteed that if @nbits is returned, the bitmap is empty.
+ *
+ * The function does guarantee that if returned value is in range [0 .. @nbits),
+ * the acquired bit belongs to the caller exclusively.
+ *
+ * Returns: found and set bit, or @nbits if no bits found
+ */
+static inline
+unsigned long find_and_set_bit(volatile unsigned long *addr, unsigned long nbits)
+{
+	if (small_const_nbits(nbits)) {
+		unsigned long val, ret;
+
+		do {
+			val = *addr | ~GENMASK(nbits - 1, 0);
+			if (val == ~0UL)
+				return nbits;
+			ret = ffz(val);
+		} while (test_and_set_bit(ret, addr));
+
+		return ret;
+	}
+
+	return _find_and_set_bit(addr, nbits);
+}
+
+
+/**
+ * find_and_set_next_bit - Find a zero bit and set it, starting from @offset
+ * @addr: The address to base the search on
+ * @nbits: The bitmap nbits in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * This function is designed to operate in concurrent access environment.
+ *
+ * Because of concurrency and volatile nature of underlying bitmap, it's not
+ * guaranteed that the found bit is the 1st bit in the bitmap, starting from @offset.
+ * It's also not guaranteed that if @nbits is returned, the bitmap is empty.
+ *
+ * The function does guarantee that if returned value is in range [@offset .. @nbits),
+ * the acquired bit belongs to the caller exclusively.
+ *
+ * Returns: found and set bit, or @nbits if no bits found
+ */
+static inline
+unsigned long find_and_set_next_bit(volatile unsigned long *addr,
+				    unsigned long nbits, unsigned long offset)
+{
+	if (small_const_nbits(nbits)) {
+		unsigned long val, ret;
+
+		do {
+			val = *addr | ~GENMASK(nbits - 1, offset);
+			if (val == ~0UL)
+				return nbits;
+			ret = ffz(val);
+		} while (test_and_set_bit(ret, addr));
+
+		return ret;
+	}
+
+	return _find_and_set_next_bit(addr, nbits, offset);
+}
+
+/**
+ * find_and_set_bit_wrap - find and set bit starting at @offset, wrapping around zero
+ * @addr: The first address to base the search on
+ * @nbits: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * Returns: the bit number for the next clear bit, or first clear bit up to @offset,
+ * while atomically setting it. If no bits are found, returns @nbits.
+ */
+static inline
+unsigned long find_and_set_bit_wrap(volatile unsigned long *addr,
+					unsigned long nbits, unsigned long offset)
+{
+	unsigned long bit = find_and_set_next_bit(addr, nbits, offset);
+
+	if (bit < nbits || offset == 0)
+		return bit;
+
+	bit = find_and_set_bit(addr, offset);
+	return bit < offset ? bit : nbits;
+}
+
+/**
+ * find_and_set_bit_lock - find a zero bit, then set it atomically with lock
+ * @addr: The address to base the search on
+ * @nbits: The bitmap nbits in bits
+ *
+ * This function is designed to operate in concurrent access environment.
+ *
+ * Because of concurrency and volatile nature of underlying bitmap, it's not
+ * guaranteed that the found bit is the 1st bit in the bitmap. It's also not
+ * guaranteed that if @nbits is returned, the bitmap is empty.
+ *
+ * The function does guarantee that if returned value is in range [0 .. @nbits),
+ * the acquired bit belongs to the caller exclusively.
+ *
+ * Returns: found and set bit, or @nbits if no bits found
+ */
+static inline
+unsigned long find_and_set_bit_lock(volatile unsigned long *addr, unsigned long nbits)
+{
+	if (small_const_nbits(nbits)) {
+		unsigned long val, ret;
+
+		do {
+			val = *addr | ~GENMASK(nbits - 1, 0);
+			if (val == ~0UL)
+				return nbits;
+			ret = ffz(val);
+		} while (test_and_set_bit_lock(ret, addr));
+
+		return ret;
+	}
+
+	return _find_and_set_bit_lock(addr, nbits);
+}
+
+/**
+ * find_and_set_next_bit_lock - find a zero bit and set it atomically with lock
+ * @addr: The address to base the search on
+ * @nbits: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * This function is designed to operate in concurrent access environment.
+ *
+ * Because of concurrency and volatile nature of underlying bitmap, it's not
+ * guaranteed that the found bit is the 1st bit in the range. It's also not
+ * guaranteed that if @nbits is returned, the bitmap is empty.
+ *
+ * The function does guarantee that if returned value is in range [@offset .. @nbits),
+ * the acquired bit belongs to the caller exclusively.
+ *
+ * Returns: found and set bit, or @nbits if no bits found
+ */
+static inline
+unsigned long find_and_set_next_bit_lock(volatile unsigned long *addr,
+					 unsigned long nbits, unsigned long offset)
+{
+	if (small_const_nbits(nbits)) {
+		unsigned long val, ret;
+
+		do {
+			val = *addr | ~GENMASK(nbits - 1, offset);
+			if (val == ~0UL)
+				return nbits;
+			ret = ffz(val);
+		} while (test_and_set_bit_lock(ret, addr));
+
+		return ret;
+	}
+
+	return _find_and_set_next_bit_lock(addr, nbits, offset);
+}
+
+/**
+ * find_and_set_bit_wrap_lock - find zero bit starting at @ofset and set it
+ *				with lock, and wrap around zero if nothing found
+ * @addr: The first address to base the search on
+ * @nbits: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * Returns: the bit number for the next set bit, or first set bit up to @offset
+ * If no bits are set, returns @nbits.
+ */
+static inline
+unsigned long find_and_set_bit_wrap_lock(volatile unsigned long *addr,
+					unsigned long nbits, unsigned long offset)
+{
+	unsigned long bit = find_and_set_next_bit_lock(addr, nbits, offset);
+
+	if (bit < nbits || offset == 0)
+		return bit;
+
+	bit = find_and_set_bit_lock(addr, offset);
+	return bit < offset ? bit : nbits;
+}
+
+/**
+ * find_and_clear_bit - Find a set bit and clear it atomically
+ * @addr: The address to base the search on
+ * @nbits: The bitmap nbits in bits
+ *
+ * This function is designed to operate in concurrent access environment.
+ *
+ * Because of concurrency and volatile nature of underlying bitmap, it's not
+ * guaranteed that the found bit is the 1st bit in the bitmap. It's also not
+ * guaranteed that if @nbits is returned, the bitmap is empty.
+ *
+ * The function does guarantee that if returned value is in range [0 .. @nbits),
+ * the acquired bit belongs to the caller exclusively.
+ *
+ * Returns: found and cleared bit, or @nbits if no bits found
+ */
+static inline unsigned long find_and_clear_bit(volatile unsigned long *addr, unsigned long nbits)
+{
+	if (small_const_nbits(nbits)) {
+		unsigned long val, ret;
+
+		do {
+			val = *addr & GENMASK(nbits - 1, 0);
+			if (val == 0)
+				return nbits;
+			ret = __ffs(val);
+		} while (!test_and_clear_bit(ret, addr));
+
+		return ret;
+	}
+
+	return _find_and_clear_bit(addr, nbits);
+}
+
+/**
+ * find_and_clear_next_bit - Find a set bit next after @offset, and clear it atomically
+ * @addr: The address to base the search on
+ * @nbits: The bitmap nbits in bits
+ * @offset: bit offset at which to start searching
+ *
+ * This function is designed to operate in concurrent access environment.
+ *
+ * Because of concurrency and volatile nature of underlying bitmap, it's not
+ * guaranteed that the found bit is the 1st bit in the range It's also not
+ * guaranteed that if @nbits is returned, there's no set bits after @offset.
+ *
+ * The function does guarantee that if returned value is in range [@offset .. @nbits),
+ * the acquired bit belongs to the caller exclusively.
+ *
+ * Returns: found and cleared bit, or @nbits if no bits found
+ */
+static inline
+unsigned long find_and_clear_next_bit(volatile unsigned long *addr,
+					unsigned long nbits, unsigned long offset)
+{
+	if (small_const_nbits(nbits)) {
+		unsigned long val, ret;
+
+		do {
+			val = *addr & GENMASK(nbits - 1, offset);
+			if (val == 0)
+				return nbits;
+			ret = __ffs(val);
+		} while (!test_and_clear_bit(ret, addr));
+
+		return ret;
+	}
+
+	return _find_and_clear_next_bit(addr, nbits, offset);
+}
+
 /**
  * find_next_clump8 - find next 8-bit clump with set bits in a memory region
  * @clump: location to store copy of found clump
@@ -577,6 +848,28 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 #define for_each_set_bit_from(bit, addr, size) \
 	for (; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)
 
+/* same as for_each_set_bit() but atomically clears each found bit */
+#define for_each_test_and_clear_bit(bit, addr, size) \
+	for ((bit) = 0; \
+	     (bit) = find_and_clear_next_bit((addr), (size), (bit)), (bit) < (size); \
+	     (bit)++)
+
+/* same as for_each_set_bit_from() but atomically clears each found bit */
+#define for_each_test_and_clear_bit_from(bit, addr, size) \
+	for (; (bit) = find_and_clear_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)
+
+/* same as for_each_clear_bit() but atomically sets each found bit */
+#define for_each_test_and_set_bit(bit, addr, size) \
+	for ((bit) = 0; \
+	     (bit) = find_and_set_next_bit((addr), (size), (bit)), (bit) < (size); \
+	     (bit)++)
+
+/* same as for_each_clear_bit_from() but atomically clears each found bit */
+#define for_each_test_and_set_bit_from(bit, addr, size) \
+	for (; \
+	     (bit) = find_and_set_next_bit((addr), (size), (bit)), (bit) < (size); \
+	     (bit)++)
+
 #define for_each_clear_bit(bit, addr, size) \
 	for ((bit) = 0;									\
 	     (bit) = find_next_zero_bit((addr), (size), (bit)), (bit) < (size);		\
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 32f99e9a670e..c9b6b9f96610 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -116,6 +116,91 @@ unsigned long _find_first_and_bit(const unsigned long *addr1,
 EXPORT_SYMBOL(_find_first_and_bit);
 #endif
 
+unsigned long _find_and_set_bit(volatile unsigned long *addr, unsigned long nbits)
+{
+	unsigned long bit;
+
+	do {
+		bit = FIND_FIRST_BIT(~addr[idx], /* nop */, nbits);
+		if (bit >= nbits)
+			return nbits;
+	} while (test_and_set_bit(bit, addr));
+
+	return bit;
+}
+EXPORT_SYMBOL(_find_and_set_bit);
+
+unsigned long _find_and_set_next_bit(volatile unsigned long *addr,
+				     unsigned long nbits, unsigned long start)
+{
+	unsigned long bit;
+
+	do {
+		bit = FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
+		if (bit >= nbits)
+			return nbits;
+	} while (test_and_set_bit(bit, addr));
+
+	return bit;
+}
+EXPORT_SYMBOL(_find_and_set_next_bit);
+
+unsigned long _find_and_set_bit_lock(volatile unsigned long *addr, unsigned long nbits)
+{
+	unsigned long bit;
+
+	do {
+		bit = FIND_FIRST_BIT(~addr[idx], /* nop */, nbits);
+		if (bit >= nbits)
+			return nbits;
+	} while (test_and_set_bit_lock(bit, addr));
+
+	return bit;
+}
+EXPORT_SYMBOL(_find_and_set_bit_lock);
+
+unsigned long _find_and_set_next_bit_lock(volatile unsigned long *addr,
+					  unsigned long nbits, unsigned long start)
+{
+	unsigned long bit;
+
+	do {
+		bit = FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
+		if (bit >= nbits)
+			return nbits;
+	} while (test_and_set_bit_lock(bit, addr));
+
+	return bit;
+}
+EXPORT_SYMBOL(_find_and_set_next_bit_lock);
+
+unsigned long _find_and_clear_bit(volatile unsigned long *addr, unsigned long nbits)
+{
+	unsigned long bit;
+
+	do {
+		bit = FIND_FIRST_BIT(addr[idx], /* nop */, nbits);
+		if (bit >= nbits)
+			return nbits;
+	} while (!test_and_clear_bit(bit, addr));
+
+	return bit;
+}
+EXPORT_SYMBOL(_find_and_clear_bit);
+
+unsigned long _find_and_clear_next_bit(volatile unsigned long *addr,
+					unsigned long nbits, unsigned long start)
+{
+	do {
+		start =  FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
+		if (start >= nbits)
+			return nbits;
+	} while (!test_and_clear_bit(start, addr));
+
+	return start;
+}
+EXPORT_SYMBOL(_find_and_clear_next_bit);
+
 #ifndef find_first_zero_bit
 /*
  * Find the first cleared bit in a memory region.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 02/35] lib/find: add test for atomic find_bit() ops
  2023-12-03 19:23 [PATCH v2 00/35] bitops: add atomic find_bit() operations Yury Norov
  2023-12-03 19:23 ` [PATCH v2 01/35] lib/find: add atomic find_bit() primitives Yury Norov
@ 2023-12-03 19:32 ` Yury Norov
  2023-12-03 19:32   ` [PATCH v2 21/35] sfc: switch to using atomic find_bit() API where appropriate Yury Norov
                     ` (4 more replies)
  2023-12-04 13:07 ` [PATCH v2 00/35] bitops: add atomic find_bit() operations Andy Shevchenko
  2023-12-04 18:51 ` Jan Kara
  3 siblings, 5 replies; 14+ messages in thread
From: Yury Norov @ 2023-12-03 19:32 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, H. Peter Anvin,
	James E.J. Bottomley, K. Y. Srinivasan, Md. Haris Iqbal,
	Akinobu Mita, Andrew Morton, Bjorn Andersson, Borislav Petkov,
	Chaitanya Kulkarni, Christian Brauner, Damien Le Moal,
	Dave Hansen, David Disseldorp, Edward Cree, Eric Dumazet,
	Fenghua Yu, Geert Uytterhoeven, Greg Kroah-Hartman,
	Gregory Greenman, Hans Verkuil, Hans de Goede, Hugh Dickins,
	Ingo Molnar, Jakub Kicinski, Jaroslav Kysela, Jason Gunthorpe,
	Jens Axboe, Jiri Pirko, Jiri Slaby, Kalle Valo, Karsten Graul,
	Karsten Keil, Kees Cook, Leon Romanovsky, Mark Rutland,
	Martin Habets, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Simek, Nicholas Piggin, Oliver Neukum, Paolo Abeni,
	Paolo Bonzini, Peter Zijlstra, Ping-Ke Shih, Rich Felker,
	Rob Herring, Robin Murphy, Sean Christopherson, Shuai Xue,
	Stanislaw Gruszka, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Valentin Schneider, Vitaly Kuznetsov,
	Wenjia Zhang, Will Deacon, Yoshinori Sato,
	GR-QLogic-Storage-Upstream, alsa-devel, ath10k, dmaengine, iommu,
	kvm, linux-arm-kernel, linux-arm-msm, linux-block,
	linux-bluetooth, linux-hyperv, linux-m68k, linux-media,
	linux-mips, linux-net-drivers, linux-pci, linux-rdma, linux-s390,
	linux-scsi, linux-serial, linux-sh, linux-sound, linux-usb,
	linux-wireless, linuxppc-dev, mpi3mr-linuxdrv.pdl, netdev,
	sparclinux, x86
  Cc: Yury Norov, Jan Kara, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov,
	Bart Van Assche, Sergey Shtylyov

Add basic functionality test for new API.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 lib/test_bitmap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 65f22c2578b0..277e1ca9fd28 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -221,6 +221,65 @@ static void __init test_zero_clear(void)
 	expect_eq_pbl("", bmap, 1024);
 }
 
+static void __init test_find_and_bit(void)
+{
+	unsigned long w, w_part, bit, cnt = 0;
+	DECLARE_BITMAP(bmap, EXP1_IN_BITS);
+
+	/*
+	 * Test find_and_clear{_next}_bit() and corresponding
+	 * iterators
+	 */
+	bitmap_copy(bmap, exp1, EXP1_IN_BITS);
+	w = bitmap_weight(bmap, EXP1_IN_BITS);
+
+	for_each_test_and_clear_bit(bit, bmap, EXP1_IN_BITS)
+		cnt++;
+
+	expect_eq_uint(w, cnt);
+	expect_eq_uint(0, bitmap_weight(bmap, EXP1_IN_BITS));
+
+	bitmap_copy(bmap, exp1, EXP1_IN_BITS);
+	w = bitmap_weight(bmap, EXP1_IN_BITS);
+	w_part = bitmap_weight(bmap, EXP1_IN_BITS / 3);
+
+	cnt = 0;
+	bit = EXP1_IN_BITS / 3;
+	for_each_test_and_clear_bit_from(bit, bmap, EXP1_IN_BITS)
+		cnt++;
+
+	expect_eq_uint(bitmap_weight(bmap, EXP1_IN_BITS), bitmap_weight(bmap, EXP1_IN_BITS / 3));
+	expect_eq_uint(w_part, bitmap_weight(bmap, EXP1_IN_BITS));
+	expect_eq_uint(w - w_part, cnt);
+
+	/*
+	 * Test find_and_set{_next}_bit() and corresponding
+	 * iterators
+	 */
+	bitmap_copy(bmap, exp1, EXP1_IN_BITS);
+	w = bitmap_weight(bmap, EXP1_IN_BITS);
+	cnt = 0;
+
+	for_each_test_and_set_bit(bit, bmap, EXP1_IN_BITS)
+		cnt++;
+
+	expect_eq_uint(EXP1_IN_BITS - w, cnt);
+	expect_eq_uint(EXP1_IN_BITS, bitmap_weight(bmap, EXP1_IN_BITS));
+
+	bitmap_copy(bmap, exp1, EXP1_IN_BITS);
+	w = bitmap_weight(bmap, EXP1_IN_BITS);
+	w_part = bitmap_weight(bmap, EXP1_IN_BITS / 3);
+	cnt = 0;
+
+	bit = EXP1_IN_BITS / 3;
+	for_each_test_and_set_bit_from(bit, bmap, EXP1_IN_BITS)
+		cnt++;
+
+	expect_eq_uint(EXP1_IN_BITS - bitmap_weight(bmap, EXP1_IN_BITS),
+			EXP1_IN_BITS / 3 - bitmap_weight(bmap, EXP1_IN_BITS / 3));
+	expect_eq_uint(EXP1_IN_BITS * 2 / 3 - (w - w_part), cnt);
+}
+
 static void __init test_find_nth_bit(void)
 {
 	unsigned long b, bit, cnt = 0;
@@ -1273,6 +1332,8 @@ static void __init selftest(void)
 	test_for_each_clear_bitrange_from();
 	test_for_each_set_clump8();
 	test_for_each_set_bit_wrap();
+
+	test_find_and_bit();
 }
 
 KSTM_MODULE_LOADERS(test_bitmap);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 21/35] sfc: switch to using atomic find_bit() API where appropriate
  2023-12-03 19:32 ` [PATCH v2 02/35] lib/find: add test for atomic find_bit() ops Yury Norov
@ 2023-12-03 19:32   ` Yury Norov
  2023-12-03 19:32   ` [PATCH v2 26/35] mISDN: optimize get_free_devid() Yury Norov
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Yury Norov @ 2023-12-03 19:32 UTC (permalink / raw)
  To: linux-kernel, Edward Cree, Martin Habets, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, netdev,
	linux-net-drivers
  Cc: Jan Kara, Mirsad Todorovac, Matthew Wilcox, Rasmus Villemoes,
	Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov, Bart Van Assche,
	Sergey Shtylyov

SFC code traverses rps_slot_map and rxq_retry_mask bit by bit. We can do
it better by using dedicated atomic find_bit() functions, because they
skip already clear bits.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
---
 drivers/net/ethernet/sfc/rx_common.c         |  4 +---
 drivers/net/ethernet/sfc/siena/rx_common.c   |  4 +---
 drivers/net/ethernet/sfc/siena/siena_sriov.c | 14 ++++++--------
 3 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/sfc/rx_common.c b/drivers/net/ethernet/sfc/rx_common.c
index d2f35ee15eff..0112968b3fe7 100644
--- a/drivers/net/ethernet/sfc/rx_common.c
+++ b/drivers/net/ethernet/sfc/rx_common.c
@@ -950,9 +950,7 @@ int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 	int rc;
 
 	/* find a free slot */
-	for (slot_idx = 0; slot_idx < EFX_RPS_MAX_IN_FLIGHT; slot_idx++)
-		if (!test_and_set_bit(slot_idx, &efx->rps_slot_map))
-			break;
+	slot_idx = find_and_set_bit(&efx->rps_slot_map, EFX_RPS_MAX_IN_FLIGHT);
 	if (slot_idx >= EFX_RPS_MAX_IN_FLIGHT)
 		return -EBUSY;
 
diff --git a/drivers/net/ethernet/sfc/siena/rx_common.c b/drivers/net/ethernet/sfc/siena/rx_common.c
index 4579f43484c3..160b16aa7486 100644
--- a/drivers/net/ethernet/sfc/siena/rx_common.c
+++ b/drivers/net/ethernet/sfc/siena/rx_common.c
@@ -958,9 +958,7 @@ int efx_siena_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
 	int rc;
 
 	/* find a free slot */
-	for (slot_idx = 0; slot_idx < EFX_RPS_MAX_IN_FLIGHT; slot_idx++)
-		if (!test_and_set_bit(slot_idx, &efx->rps_slot_map))
-			break;
+	slot_idx = find_and_set_bit(&efx->rps_slot_map, EFX_RPS_MAX_IN_FLIGHT);
 	if (slot_idx >= EFX_RPS_MAX_IN_FLIGHT)
 		return -EBUSY;
 
diff --git a/drivers/net/ethernet/sfc/siena/siena_sriov.c b/drivers/net/ethernet/sfc/siena/siena_sriov.c
index 8353c15dc233..554b799288b8 100644
--- a/drivers/net/ethernet/sfc/siena/siena_sriov.c
+++ b/drivers/net/ethernet/sfc/siena/siena_sriov.c
@@ -722,14 +722,12 @@ static int efx_vfdi_fini_all_queues(struct siena_vf *vf)
 					     efx_vfdi_flush_wake(vf),
 					     timeout);
 		rxqs_count = 0;
-		for (index = 0; index < count; ++index) {
-			if (test_and_clear_bit(index, vf->rxq_retry_mask)) {
-				atomic_dec(&vf->rxq_retry_count);
-				MCDI_SET_ARRAY_DWORD(
-					inbuf, FLUSH_RX_QUEUES_IN_QID_OFST,
-					rxqs_count, vf_offset + index);
-				rxqs_count++;
-			}
+		for_each_test_and_clear_bit(index, vf->rxq_retry_mask, count) {
+			atomic_dec(&vf->rxq_retry_count);
+			MCDI_SET_ARRAY_DWORD(
+				inbuf, FLUSH_RX_QUEUES_IN_QID_OFST,
+				rxqs_count, vf_offset + index);
+			rxqs_count++;
 		}
 	}
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 26/35] mISDN: optimize get_free_devid()
  2023-12-03 19:32 ` [PATCH v2 02/35] lib/find: add test for atomic find_bit() ops Yury Norov
  2023-12-03 19:32   ` [PATCH v2 21/35] sfc: switch to using atomic find_bit() API where appropriate Yury Norov
@ 2023-12-03 19:32   ` Yury Norov
  2023-12-03 19:33   ` [PATCH v2 28/35] ethernet: rocker: optimize ofdpa_port_internal_vlan_id_get() Yury Norov
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Yury Norov @ 2023-12-03 19:32 UTC (permalink / raw)
  To: linux-kernel, Karsten Keil, netdev
  Cc: Yury Norov, Jan Kara, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov,
	Bart Van Assche, Sergey Shtylyov

get_free_devid() traverses each bit in device_ids in an open-coded loop.
We can do it faster by using dedicated find_and_set_bit().

It makes the whole function a nice one-liner, and because MAX_DEVICE_ID
is a small constant-time value (63), on 64-bit platforms find_and_set_bit()
call will be optimized to:

	ffs();
	test_and_set_bit().

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 drivers/isdn/mISDN/core.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/isdn/mISDN/core.c b/drivers/isdn/mISDN/core.c
index ab8513a7acd5..3f97db006cf3 100644
--- a/drivers/isdn/mISDN/core.c
+++ b/drivers/isdn/mISDN/core.c
@@ -197,14 +197,9 @@ get_mdevice_count(void)
 static int
 get_free_devid(void)
 {
-	u_int	i;
+	u_int i = find_and_set_bit((u_long *)&device_ids, MAX_DEVICE_ID + 1);
 
-	for (i = 0; i <= MAX_DEVICE_ID; i++)
-		if (!test_and_set_bit(i, (u_long *)&device_ids))
-			break;
-	if (i > MAX_DEVICE_ID)
-		return -EBUSY;
-	return i;
+	return i <= MAX_DEVICE_ID ? i : -EBUSY;
 }
 
 int
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 28/35] ethernet: rocker: optimize ofdpa_port_internal_vlan_id_get()
  2023-12-03 19:32 ` [PATCH v2 02/35] lib/find: add test for atomic find_bit() ops Yury Norov
  2023-12-03 19:32   ` [PATCH v2 21/35] sfc: switch to using atomic find_bit() API where appropriate Yury Norov
  2023-12-03 19:32   ` [PATCH v2 26/35] mISDN: optimize get_free_devid() Yury Norov
@ 2023-12-03 19:33   ` Yury Norov
  2023-12-03 19:33   ` [PATCH v2 30/35] bluetooth: optimize cmtp_alloc_block_id() Yury Norov
  2023-12-03 19:33   ` [PATCH v2 31/35] net: smc: use find_and_set_bit() in smc_wr_tx_get_free_slot_index() Yury Norov
  4 siblings, 0 replies; 14+ messages in thread
From: Yury Norov @ 2023-12-03 19:33 UTC (permalink / raw)
  To: linux-kernel, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev
  Cc: Yury Norov, Jan Kara, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov,
	Bart Van Assche, Sergey Shtylyov

Optimize ofdpa_port_internal_vlan_id_get() by using find_and_set_bit(),
instead of polling every bit from bitmap in a for-loop.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 drivers/net/ethernet/rocker/rocker_ofdpa.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 826990459fa4..449be8af7ffc 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -2249,14 +2249,11 @@ static __be16 ofdpa_port_internal_vlan_id_get(struct ofdpa_port *ofdpa_port,
 	found = entry;
 	hash_add(ofdpa->internal_vlan_tbl, &found->entry, found->ifindex);
 
-	for (i = 0; i < OFDPA_N_INTERNAL_VLANS; i++) {
-		if (test_and_set_bit(i, ofdpa->internal_vlan_bitmap))
-			continue;
+	i = find_and_set_bit(ofdpa->internal_vlan_bitmap, OFDPA_N_INTERNAL_VLANS);
+	if (i < OFDPA_N_INTERNAL_VLANS)
 		found->vlan_id = htons(OFDPA_INTERNAL_VLAN_ID_BASE + i);
-		goto found;
-	}
-
-	netdev_err(ofdpa_port->dev, "Out of internal VLAN IDs\n");
+	else
+		netdev_err(ofdpa_port->dev, "Out of internal VLAN IDs\n");
 
 found:
 	found->ref_count++;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 30/35] bluetooth: optimize cmtp_alloc_block_id()
  2023-12-03 19:32 ` [PATCH v2 02/35] lib/find: add test for atomic find_bit() ops Yury Norov
                     ` (2 preceding siblings ...)
  2023-12-03 19:33   ` [PATCH v2 28/35] ethernet: rocker: optimize ofdpa_port_internal_vlan_id_get() Yury Norov
@ 2023-12-03 19:33   ` Yury Norov
  2023-12-03 19:33   ` [PATCH v2 31/35] net: smc: use find_and_set_bit() in smc_wr_tx_get_free_slot_index() Yury Norov
  4 siblings, 0 replies; 14+ messages in thread
From: Yury Norov @ 2023-12-03 19:33 UTC (permalink / raw)
  To: linux-kernel, Karsten Keil, Marcel Holtmann, Johan Hedberg,
	Luiz Augusto von Dentz, Yury Norov, netdev, linux-bluetooth
  Cc: Jan Kara, Mirsad Todorovac, Matthew Wilcox, Rasmus Villemoes,
	Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov, Bart Van Assche,
	Sergey Shtylyov

Instead of polling every bit in blockids, switch it to using a
dedicated find_and_set_bit(), and make the function a simple one-liner.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 net/bluetooth/cmtp/core.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c
index 90d130588a3e..b1330acbbff3 100644
--- a/net/bluetooth/cmtp/core.c
+++ b/net/bluetooth/cmtp/core.c
@@ -88,15 +88,9 @@ static void __cmtp_copy_session(struct cmtp_session *session, struct cmtp_connin
 
 static inline int cmtp_alloc_block_id(struct cmtp_session *session)
 {
-	int i, id = -1;
+	int id = find_and_set_bit(&session->blockids, 16);
 
-	for (i = 0; i < 16; i++)
-		if (!test_and_set_bit(i, &session->blockids)) {
-			id = i;
-			break;
-		}
-
-	return id;
+	return id < 16 ? id : -1;
 }
 
 static inline void cmtp_free_block_id(struct cmtp_session *session, int id)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 31/35] net: smc: use find_and_set_bit() in smc_wr_tx_get_free_slot_index()
  2023-12-03 19:32 ` [PATCH v2 02/35] lib/find: add test for atomic find_bit() ops Yury Norov
                     ` (3 preceding siblings ...)
  2023-12-03 19:33   ` [PATCH v2 30/35] bluetooth: optimize cmtp_alloc_block_id() Yury Norov
@ 2023-12-03 19:33   ` Yury Norov
  2023-12-04  9:40     ` Alexandra Winter
  4 siblings, 1 reply; 14+ messages in thread
From: Yury Norov @ 2023-12-03 19:33 UTC (permalink / raw)
  To: linux-kernel, Karsten Graul, Wenjia Zhang, Jan Karcher, D. Wythe,
	Tony Lu, Wen Gu, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-s390, netdev
  Cc: Yury Norov, Jan Kara, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov,
	Bart Van Assche, Sergey Shtylyov, Alexandra Winter

The function opencodes find_and_set_bit() with a for_each() loop. Use
it, and make the whole function a simple almost one-liner.

While here, drop explicit initialization of *idx, because it's already
initialized by the caller in case of ENOLINK, or set properly with
->wr_tx_mask, if nothing is found, in case of EBUSY.

CC: Tony Lu <tonylu@linux.alibaba.com>
CC: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 net/smc/smc_wr.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 0021065a600a..b6f0cfc52788 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -170,15 +170,11 @@ void smc_wr_tx_cq_handler(struct ib_cq *ib_cq, void *cq_context)
 
 static inline int smc_wr_tx_get_free_slot_index(struct smc_link *link, u32 *idx)
 {
-	*idx = link->wr_tx_cnt;
 	if (!smc_link_sendable(link))
 		return -ENOLINK;
-	for_each_clear_bit(*idx, link->wr_tx_mask, link->wr_tx_cnt) {
-		if (!test_and_set_bit(*idx, link->wr_tx_mask))
-			return 0;
-	}
-	*idx = link->wr_tx_cnt;
-	return -EBUSY;
+
+	*idx = find_and_set_bit(link->wr_tx_mask, link->wr_tx_cnt);
+	return *idx < link->wr_tx_cnt ? 0 : -EBUSY;
 }
 
 /**
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 31/35] net: smc: use find_and_set_bit() in smc_wr_tx_get_free_slot_index()
  2023-12-03 19:33   ` [PATCH v2 31/35] net: smc: use find_and_set_bit() in smc_wr_tx_get_free_slot_index() Yury Norov
@ 2023-12-04  9:40     ` Alexandra Winter
  2023-12-11 22:34       ` Yury Norov
  0 siblings, 1 reply; 14+ messages in thread
From: Alexandra Winter @ 2023-12-04  9:40 UTC (permalink / raw)
  To: Yury Norov, linux-kernel, Karsten Graul, Wenjia Zhang,
	Jan Karcher, D. Wythe, Tony Lu, Wen Gu, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-s390, netdev
  Cc: Jan Kara, Mirsad Todorovac, Matthew Wilcox, Rasmus Villemoes,
	Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov, Bart Van Assche,
	Sergey Shtylyov



On 03.12.23 20:33, Yury Norov wrote:
> The function opencodes find_and_set_bit() with a for_each() loop. Use
> it, and make the whole function a simple almost one-liner.
> 
> While here, drop explicit initialization of *idx, because it's already
> initialized by the caller in case of ENOLINK, or set properly with
> ->wr_tx_mask, if nothing is found, in case of EBUSY.
> 
> CC: Tony Lu <tonylu@linux.alibaba.com>
> CC: Alexandra Winter <wintera@linux.ibm.com>
> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> ---

Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>


Thanks a lot for the great helper function!
I guess the top-level maintainers will figure out, how this series best finds its way upstream.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/35] bitops: add atomic find_bit() operations
  2023-12-03 19:23 [PATCH v2 00/35] bitops: add atomic find_bit() operations Yury Norov
  2023-12-03 19:23 ` [PATCH v2 01/35] lib/find: add atomic find_bit() primitives Yury Norov
  2023-12-03 19:32 ` [PATCH v2 02/35] lib/find: add test for atomic find_bit() ops Yury Norov
@ 2023-12-04 13:07 ` Andy Shevchenko
  2023-12-04 18:51 ` Jan Kara
  3 siblings, 0 replies; 14+ messages in thread
From: Andy Shevchenko @ 2023-12-04 13:07 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, David S. Miller, H. Peter Anvin,
	James E.J. Bottomley, K. Y. Srinivasan, Md. Haris Iqbal,
	Akinobu Mita, Andrew Morton, Bjorn Andersson, Borislav Petkov,
	Chaitanya Kulkarni, Christian Brauner, Damien Le Moal,
	Dave Hansen, David Disseldorp, Edward Cree, Eric Dumazet,
	Fenghua Yu, Geert Uytterhoeven, Greg Kroah-Hartman,
	Gregory Greenman, Hans Verkuil, Hans de Goede, Hugh Dickins,
	Ingo Molnar, Jakub Kicinski, Jaroslav Kysela, Jason Gunthorpe,
	Jens Axboe, Jiri Pirko, Jiri Slaby, Kalle Valo, Karsten Graul,
	Karsten Keil, Kees Cook, Leon Romanovsky, Mark Rutland,
	Martin Habets, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Simek, Nicholas Piggin, Oliver Neukum, Paolo Abeni,
	Paolo Bonzini, Peter Zijlstra, Ping-Ke Shih, Rich Felker,
	Rob Herring, Robin Murphy, Sean Christopherson, Shuai Xue,
	Stanislaw Gruszka, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Valentin Schneider, Vitaly Kuznetsov,
	Wenjia Zhang, Will Deacon, Yoshinori Sato,
	GR-QLogic-Storage-Upstream, alsa-devel, ath10k, dmaengine, iommu,
	kvm, linux-arm-kernel, linux-arm-msm, linux-block,
	linux-bluetooth, linux-hyperv, linux-m68k, linux-media,
	linux-mips, linux-net-drivers, linux-pci, linux-rdma, linux-s390,
	linux-scsi, linux-serial, linux-sh, linux-sound, linux-usb,
	linux-wireless, linuxppc-dev, mpi3mr-linuxdrv.pdl, netdev,
	sparclinux, x86, Jan Kara, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Maxim Kuvyrkov, Alexey Klimov, Bart Van Assche,
	Sergey Shtylyov

On Sun, Dec 03, 2023 at 11:23:47AM -0800, Yury Norov wrote:
> Add helpers around test_and_{set,clear}_bit() that allow to search for
> clear or set bits and flip them atomically.
> 
> The target patterns may look like this:
> 
> 	for (idx = 0; idx < nbits; idx++)
> 		if (test_and_clear_bit(idx, bitmap))
> 			do_something(idx);
> 
> Or like this:
> 
> 	do {
> 		bit = find_first_bit(bitmap, nbits);
> 		if (bit >= nbits)
> 			return nbits;
> 	} while (!test_and_clear_bit(bit, bitmap));
> 	return bit;
> 
> In both cases, the opencoded loop may be converted to a single function
> or iterator call. Correspondingly:
> 
> 	for_each_test_and_clear_bit(idx, bitmap, nbits)
> 		do_something(idx);
> 
> Or:
> 	return find_and_clear_bit(bitmap, nbits);
> 
> Obviously, the less routine code people have to write themself, the
> less probability to make a mistake.
> 
> Those are not only handy helpers but also resolve a non-trivial
> issue of using non-atomic find_bit() together with atomic
> test_and_{set,clear)_bit().
> 
> The trick is that find_bit() implies that the bitmap is a regular
> non-volatile piece of memory, and compiler is allowed to use such
> optimization techniques like re-fetching memory instead of caching it.
> 
> For example, find_first_bit() is implemented like this:
> 
>       for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {
>               val = addr[idx];
>               if (val) {
>                       sz = min(idx * BITS_PER_LONG + __ffs(val), sz);
>                       break;
>               }
>       }
> 
> On register-memory architectures, like x86, compiler may decide to
> access memory twice - first time to compare against 0, and second time
> to fetch its value to pass it to __ffs().
> 
> When running find_first_bit() on volatile memory, the memory may get
> changed in-between, and for instance, it may lead to passing 0 to
> __ffs(), which is undefined. This is a potentially dangerous call.
> 
> find_and_clear_bit() as a wrapper around test_and_clear_bit()
> naturally treats underlying bitmap as a volatile memory and prevents
> compiler from such optimizations.
> 
> Now that KCSAN is catching exactly this type of situations and warns on
> undercover memory modifications. We can use it to reveal improper usage
> of find_bit(), and convert it to atomic find_and_*_bit() as appropriate.
> 
> The 1st patch of the series adds the following atomic primitives:
> 
> 	find_and_set_bit(addr, nbits);
> 	find_and_set_next_bit(addr, nbits, start);
> 	...
> 
> Here find_and_{set,clear} part refers to the corresponding
> test_and_{set,clear}_bit function. Suffixes like _wrap or _lock
> derive their semantics from corresponding find() or test() functions.
> 
> For brevity, the naming omits the fact that we search for zero bit in
> find_and_set, and correspondingly search for set bit in find_and_clear
> functions.
> 
> The patch also adds iterators with atomic semantics, like
> for_each_test_and_set_bit(). Here, the naming rule is to simply prefix
> corresponding atomic operation with 'for_each'.
> 
> This series is a result of discussion [1]. All find_bit() functions imply
> exclusive access to the bitmaps. However, KCSAN reports quite a number
> of warnings related to find_bit() API. Some of them are not pointing
> to real bugs because in many situations people intentionally allow
> concurrent bitmap operations.
> 
> If so, find_bit() can be annotated such that KCSAN will ignore it:
> 
>         bit = data_race(find_first_bit(bitmap, nbits));
> 
> This series addresses the other important case where people really need
> atomic find ops. As the following patches show, the resulting code
> looks safer and more verbose comparing to opencoded loops followed by
> atomic bit flips.
> 
> In [1] Mirsad reported 2% slowdown in a single-thread search test when
> switching find_bit() function to treat bitmaps as volatile arrays. On
> the other hand, kernel robot in the same thread reported +3.7% to the
> performance of will-it-scale.per_thread_ops test.
> 
> Assuming that our compilers are sane and generate better code against
> properly annotated data, the above discrepancy doesn't look weird. When
> running on non-volatile bitmaps, plain find_bit() outperforms atomic
> find_and_bit(), and vice-versa.

...

In some cases the better improvements can be achieved by switching
the (very) old code to utilise IDA framework.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/35] bitops: add atomic find_bit() operations
  2023-12-03 19:23 [PATCH v2 00/35] bitops: add atomic find_bit() operations Yury Norov
                   ` (2 preceding siblings ...)
  2023-12-04 13:07 ` [PATCH v2 00/35] bitops: add atomic find_bit() operations Andy Shevchenko
@ 2023-12-04 18:51 ` Jan Kara
  2023-12-06  5:22   ` Yury Norov
  3 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2023-12-04 18:51 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, David S. Miller, H. Peter Anvin,
	James E.J. Bottomley, K. Y. Srinivasan, Md. Haris Iqbal,
	Akinobu Mita, Andrew Morton, Bjorn Andersson, Borislav Petkov,
	Chaitanya Kulkarni, Christian Brauner, Damien Le Moal,
	Dave Hansen, David Disseldorp, Edward Cree, Eric Dumazet,
	Fenghua Yu, Geert Uytterhoeven, Greg Kroah-Hartman,
	Gregory Greenman, Hans Verkuil, Hans de Goede, Hugh Dickins,
	Ingo Molnar, Jakub Kicinski, Jaroslav Kysela, Jason Gunthorpe,
	Jens Axboe, Jiri Pirko, Jiri Slaby, Kalle Valo, Karsten Graul,
	Karsten Keil, Kees Cook, Leon Romanovsky, Mark Rutland,
	Martin Habets, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Simek, Nicholas Piggin, Oliver Neukum, Paolo Abeni,
	Paolo Bonzini, Peter Zijlstra, Ping-Ke Shih, Rich Felker,
	Rob Herring, Robin Murphy, Sean Christopherson, Shuai Xue,
	Stanislaw Gruszka, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Valentin Schneider, Vitaly Kuznetsov,
	Wenjia Zhang, Will Deacon, Yoshinori Sato,
	GR-QLogic-Storage-Upstream, alsa-devel, ath10k, dmaengine, iommu,
	kvm, linux-arm-kernel, linux-arm-msm, linux-block,
	linux-bluetooth, linux-hyperv, linux-m68k, linux-media,
	linux-mips, linux-net-drivers, linux-pci, linux-rdma, linux-s390,
	linux-scsi, linux-serial, linux-sh, linux-sound, linux-usb,
	linux-wireless, linuxppc-dev, mpi3mr-linuxdrv.pdl, netdev,
	sparclinux, x86, Jan Kara, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov,
	Bart Van Assche, Sergey Shtylyov

Hello Yury!

On Sun 03-12-23 11:23:47, Yury Norov wrote:
> Add helpers around test_and_{set,clear}_bit() that allow to search for
> clear or set bits and flip them atomically.
> 
> The target patterns may look like this:
> 
> 	for (idx = 0; idx < nbits; idx++)
> 		if (test_and_clear_bit(idx, bitmap))
> 			do_something(idx);
> 
> Or like this:
> 
> 	do {
> 		bit = find_first_bit(bitmap, nbits);
> 		if (bit >= nbits)
> 			return nbits;
> 	} while (!test_and_clear_bit(bit, bitmap));
> 	return bit;
> 
> In both cases, the opencoded loop may be converted to a single function
> or iterator call. Correspondingly:
> 
> 	for_each_test_and_clear_bit(idx, bitmap, nbits)
> 		do_something(idx);
> 
> Or:
> 	return find_and_clear_bit(bitmap, nbits);

These are fine cleanups but they actually don't address the case that has
triggered all these changes - namely the xarray use of find_next_bit() in
xas_find_chunk().

...
> This series is a result of discussion [1]. All find_bit() functions imply
> exclusive access to the bitmaps. However, KCSAN reports quite a number
> of warnings related to find_bit() API. Some of them are not pointing
> to real bugs because in many situations people intentionally allow
> concurrent bitmap operations.
> 
> If so, find_bit() can be annotated such that KCSAN will ignore it:
> 
>         bit = data_race(find_first_bit(bitmap, nbits));

No, this is not a correct thing to do. If concurrent bitmap changes can
happen, find_first_bit() as it is currently implemented isn't ever a safe
choice because it can call __ffs(0) which is dangerous as you properly note
above. I proposed adding READ_ONCE() into find_first_bit() / find_next_bit()
implementation to fix this issue but you disliked that. So other option we
have is adding find_first_bit() and find_next_bit() variants that take
volatile 'addr' and we have to use these in code like xas_find_chunk()
which cannot be converted to your new helpers.

> This series addresses the other important case where people really need
> atomic find ops. As the following patches show, the resulting code
> looks safer and more verbose comparing to opencoded loops followed by
> atomic bit flips.
> 
> In [1] Mirsad reported 2% slowdown in a single-thread search test when
> switching find_bit() function to treat bitmaps as volatile arrays. On
> the other hand, kernel robot in the same thread reported +3.7% to the
> performance of will-it-scale.per_thread_ops test.

It was actually me who reported the regression here [2] but whatever :)

[2] https://lore.kernel.org/all/20231011150252.32737-1-jack@suse.cz

> Assuming that our compilers are sane and generate better code against
> properly annotated data, the above discrepancy doesn't look weird. When
> running on non-volatile bitmaps, plain find_bit() outperforms atomic
> find_and_bit(), and vice-versa.
> 
> So, all users of find_bit() API, where heavy concurrency is expected,
> are encouraged to switch to atomic find_and_bit() as appropriate.

Well, all users where any concurrency can happen should switch. Otherwise
they are prone to the (admittedly mostly theoretical) data race issue.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/35] bitops: add atomic find_bit() operations
  2023-12-04 18:51 ` Jan Kara
@ 2023-12-06  5:22   ` Yury Norov
  2023-12-07  9:10     ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Yury Norov @ 2023-12-06  5:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, David S. Miller, H. Peter Anvin,
	James E.J. Bottomley, K. Y. Srinivasan, Md. Haris Iqbal,
	Akinobu Mita, Andrew Morton, Bjorn Andersson, Borislav Petkov,
	Chaitanya Kulkarni, Christian Brauner, Damien Le Moal,
	Dave Hansen, David Disseldorp, Edward Cree, Eric Dumazet,
	Fenghua Yu, Geert Uytterhoeven, Greg Kroah-Hartman,
	Gregory Greenman, Hans Verkuil, Hans de Goede, Hugh Dickins,
	Ingo Molnar, Jakub Kicinski, Jaroslav Kysela, Jason Gunthorpe,
	Jens Axboe, Jiri Pirko, Jiri Slaby, Kalle Valo, Karsten Graul,
	Karsten Keil, Kees Cook, Leon Romanovsky, Mark Rutland,
	Martin Habets, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Simek, Nicholas Piggin, Oliver Neukum, Paolo Abeni,
	Paolo Bonzini, Peter Zijlstra, Ping-Ke Shih, Rich Felker,
	Rob Herring, Robin Murphy, Sean Christopherson, Shuai Xue,
	Stanislaw Gruszka, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Valentin Schneider, Vitaly Kuznetsov,
	Wenjia Zhang, Will Deacon, Yoshinori Sato,
	GR-QLogic-Storage-Upstream, alsa-devel, ath10k, dmaengine, iommu,
	kvm, linux-arm-kernel, linux-arm-msm, linux-block,
	linux-bluetooth, linux-hyperv, linux-m68k, linux-media,
	linux-mips, linux-net-drivers, linux-pci, linux-rdma, linux-s390,
	linux-scsi, linux-serial, linux-sh, linux-sound, linux-usb,
	linux-wireless, linuxppc-dev, mpi3mr-linuxdrv.pdl, netdev,
	sparclinux, x86, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov,
	Bart Van Assche, Sergey Shtylyov

On Mon, Dec 04, 2023 at 07:51:01PM +0100, Jan Kara wrote:
> Hello Yury!
> 
> On Sun 03-12-23 11:23:47, Yury Norov wrote:
> > Add helpers around test_and_{set,clear}_bit() that allow to search for
> > clear or set bits and flip them atomically.
> > 
> > The target patterns may look like this:
> > 
> > 	for (idx = 0; idx < nbits; idx++)
> > 		if (test_and_clear_bit(idx, bitmap))
> > 			do_something(idx);
> > 
> > Or like this:
> > 
> > 	do {
> > 		bit = find_first_bit(bitmap, nbits);
> > 		if (bit >= nbits)
> > 			return nbits;
> > 	} while (!test_and_clear_bit(bit, bitmap));
> > 	return bit;
> > 
> > In both cases, the opencoded loop may be converted to a single function
> > or iterator call. Correspondingly:
> > 
> > 	for_each_test_and_clear_bit(idx, bitmap, nbits)
> > 		do_something(idx);
> > 
> > Or:
> > 	return find_and_clear_bit(bitmap, nbits);
> 
> These are fine cleanups but they actually don't address the case that has
> triggered all these changes - namely the xarray use of find_next_bit() in
> xas_find_chunk().
> 
> ...
> > This series is a result of discussion [1]. All find_bit() functions imply
> > exclusive access to the bitmaps. However, KCSAN reports quite a number
> > of warnings related to find_bit() API. Some of them are not pointing
> > to real bugs because in many situations people intentionally allow
> > concurrent bitmap operations.
> > 
> > If so, find_bit() can be annotated such that KCSAN will ignore it:
> > 
> >         bit = data_race(find_first_bit(bitmap, nbits));
> 
> No, this is not a correct thing to do. If concurrent bitmap changes can
> happen, find_first_bit() as it is currently implemented isn't ever a safe
> choice because it can call __ffs(0) which is dangerous as you properly note
> above. I proposed adding READ_ONCE() into find_first_bit() / find_next_bit()
> implementation to fix this issue but you disliked that. So other option we
> have is adding find_first_bit() and find_next_bit() variants that take
> volatile 'addr' and we have to use these in code like xas_find_chunk()
> which cannot be converted to your new helpers.

Here is some examples when concurrent operations with plain find_bit()
are acceptable:

 - two threads running find_*_bit(): safe wrt ffs(0) and returns correct
   value, because underlying bitmap is unchanged;
 - find_next_bit() in parallel with set or clear_bit(), when modifying
   a bit prior to the start bit to search: safe and correct;
 - find_first_bit() in parallel with set_bit(): safe, but may return wrong
   bit number;
 - find_first_zero_bit() in parallel with clear_bit(): same as above.

In last 2 cases find_bit() may not return a correct bit number, but
it may be OK if caller requires any (not exactly first) set or clear
bit, correspondingly.

In such cases, KCSAN may be safely silenced.
 
> > This series addresses the other important case where people really need
> > atomic find ops. As the following patches show, the resulting code
> > looks safer and more verbose comparing to opencoded loops followed by
> > atomic bit flips.
> > 
> > In [1] Mirsad reported 2% slowdown in a single-thread search test when
> > switching find_bit() function to treat bitmaps as volatile arrays. On
> > the other hand, kernel robot in the same thread reported +3.7% to the
> > performance of will-it-scale.per_thread_ops test.
> 
> It was actually me who reported the regression here [2] but whatever :)
> 
> [2] https://lore.kernel.org/all/20231011150252.32737-1-jack@suse.cz

My apologize.

> > Assuming that our compilers are sane and generate better code against
> > properly annotated data, the above discrepancy doesn't look weird. When
> > running on non-volatile bitmaps, plain find_bit() outperforms atomic
> > find_and_bit(), and vice-versa.
> > 
> > So, all users of find_bit() API, where heavy concurrency is expected,
> > are encouraged to switch to atomic find_and_bit() as appropriate.
> 
> Well, all users where any concurrency can happen should switch. Otherwise
> they are prone to the (admittedly mostly theoretical) data race issue.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 00/35] bitops: add atomic find_bit() operations
  2023-12-06  5:22   ` Yury Norov
@ 2023-12-07  9:10     ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2023-12-07  9:10 UTC (permalink / raw)
  To: Yury Norov
  Cc: Jan Kara, linux-kernel, David S. Miller, H. Peter Anvin,
	James E.J. Bottomley, K. Y. Srinivasan, Md. Haris Iqbal,
	Akinobu Mita, Andrew Morton, Bjorn Andersson, Borislav Petkov,
	Chaitanya Kulkarni, Christian Brauner, Damien Le Moal,
	Dave Hansen, David Disseldorp, Edward Cree, Eric Dumazet,
	Fenghua Yu, Geert Uytterhoeven, Greg Kroah-Hartman,
	Gregory Greenman, Hans Verkuil, Hans de Goede, Hugh Dickins,
	Ingo Molnar, Jakub Kicinski, Jaroslav Kysela, Jason Gunthorpe,
	Jens Axboe, Jiri Pirko, Jiri Slaby, Kalle Valo, Karsten Graul,
	Karsten Keil, Kees Cook, Leon Romanovsky, Mark Rutland,
	Martin Habets, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Simek, Nicholas Piggin, Oliver Neukum, Paolo Abeni,
	Paolo Bonzini, Peter Zijlstra, Ping-Ke Shih, Rich Felker,
	Rob Herring, Robin Murphy, Sean Christopherson, Shuai Xue,
	Stanislaw Gruszka, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Valentin Schneider, Vitaly Kuznetsov,
	Wenjia Zhang, Will Deacon, Yoshinori Sato,
	GR-QLogic-Storage-Upstream, alsa-devel, ath10k, dmaengine, iommu,
	kvm, linux-arm-kernel, linux-arm-msm, linux-block,
	linux-bluetooth, linux-hyperv, linux-m68k, linux-media,
	linux-mips, linux-net-drivers, linux-pci, linux-rdma, linux-s390,
	linux-scsi, linux-serial, linux-sh, linux-sound, linux-usb,
	linux-wireless, linuxppc-dev, mpi3mr-linuxdrv.pdl, netdev,
	sparclinux, x86, Mirsad Todorovac, Matthew Wilcox,
	Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov, Alexey Klimov,
	Bart Van Assche, Sergey Shtylyov

On Tue 05-12-23 21:22:59, Yury Norov wrote:
> On Mon, Dec 04, 2023 at 07:51:01PM +0100, Jan Kara wrote:
> > > This series is a result of discussion [1]. All find_bit() functions imply
> > > exclusive access to the bitmaps. However, KCSAN reports quite a number
> > > of warnings related to find_bit() API. Some of them are not pointing
> > > to real bugs because in many situations people intentionally allow
> > > concurrent bitmap operations.
> > > 
> > > If so, find_bit() can be annotated such that KCSAN will ignore it:
> > > 
> > >         bit = data_race(find_first_bit(bitmap, nbits));
> > 
> > No, this is not a correct thing to do. If concurrent bitmap changes can
> > happen, find_first_bit() as it is currently implemented isn't ever a safe
> > choice because it can call __ffs(0) which is dangerous as you properly note
> > above. I proposed adding READ_ONCE() into find_first_bit() / find_next_bit()
> > implementation to fix this issue but you disliked that. So other option we
> > have is adding find_first_bit() and find_next_bit() variants that take
> > volatile 'addr' and we have to use these in code like xas_find_chunk()
> > which cannot be converted to your new helpers.
> 
> Here is some examples when concurrent operations with plain find_bit()
> are acceptable:
> 
>  - two threads running find_*_bit(): safe wrt ffs(0) and returns correct
>    value, because underlying bitmap is unchanged;
>  - find_next_bit() in parallel with set or clear_bit(), when modifying
>    a bit prior to the start bit to search: safe and correct;
>  - find_first_bit() in parallel with set_bit(): safe, but may return wrong
>    bit number;
>  - find_first_zero_bit() in parallel with clear_bit(): same as above.
> 
> In last 2 cases find_bit() may not return a correct bit number, but
> it may be OK if caller requires any (not exactly first) set or clear
> bit, correspondingly.
> 
> In such cases, KCSAN may be safely silenced.

True - but these are special cases. In particular the case in xas_find_chunk()
is not any of these special cases. It is using find_next_bit() which is can
be racing with clear_bit(). So what are your plans for such usecase?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 31/35] net: smc: use find_and_set_bit() in smc_wr_tx_get_free_slot_index()
  2023-12-04  9:40     ` Alexandra Winter
@ 2023-12-11 22:34       ` Yury Norov
  0 siblings, 0 replies; 14+ messages in thread
From: Yury Norov @ 2023-12-11 22:34 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: linux-kernel, Karsten Graul, Wenjia Zhang, Jan Karcher, D. Wythe,
	Tony Lu, Wen Gu, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-s390, netdev, Jan Kara, Mirsad Todorovac,
	Matthew Wilcox, Rasmus Villemoes, Andy Shevchenko, Maxim Kuvyrkov,
	Alexey Klimov, Bart Van Assche, Sergey Shtylyov

On Mon, Dec 04, 2023 at 10:40:20AM +0100, Alexandra Winter wrote:
> 
> 
> On 03.12.23 20:33, Yury Norov wrote:
> > The function opencodes find_and_set_bit() with a for_each() loop. Use
> > it, and make the whole function a simple almost one-liner.
> > 
> > While here, drop explicit initialization of *idx, because it's already
> > initialized by the caller in case of ENOLINK, or set properly with
> > ->wr_tx_mask, if nothing is found, in case of EBUSY.
> > 
> > CC: Tony Lu <tonylu@linux.alibaba.com>
> > CC: Alexandra Winter <wintera@linux.ibm.com>
> > Signed-off-by: Yury Norov <yury.norov@gmail.com>
> > ---
> 
> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
> 
> 
> Thanks a lot for the great helper function!
> I guess the top-level maintainers will figure out, how this series best finds its way upstream.

Thanks, Alexandra. :)

People in this thread say just pick their subsystem patch together
with #1. So, I'm going to send v3 with some minor tweaks, and if
everything is OK, will pull all this in my bitmap-for-next branch.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-12-11 22:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-03 19:23 [PATCH v2 00/35] bitops: add atomic find_bit() operations Yury Norov
2023-12-03 19:23 ` [PATCH v2 01/35] lib/find: add atomic find_bit() primitives Yury Norov
2023-12-03 19:32 ` [PATCH v2 02/35] lib/find: add test for atomic find_bit() ops Yury Norov
2023-12-03 19:32   ` [PATCH v2 21/35] sfc: switch to using atomic find_bit() API where appropriate Yury Norov
2023-12-03 19:32   ` [PATCH v2 26/35] mISDN: optimize get_free_devid() Yury Norov
2023-12-03 19:33   ` [PATCH v2 28/35] ethernet: rocker: optimize ofdpa_port_internal_vlan_id_get() Yury Norov
2023-12-03 19:33   ` [PATCH v2 30/35] bluetooth: optimize cmtp_alloc_block_id() Yury Norov
2023-12-03 19:33   ` [PATCH v2 31/35] net: smc: use find_and_set_bit() in smc_wr_tx_get_free_slot_index() Yury Norov
2023-12-04  9:40     ` Alexandra Winter
2023-12-11 22:34       ` Yury Norov
2023-12-04 13:07 ` [PATCH v2 00/35] bitops: add atomic find_bit() operations Andy Shevchenko
2023-12-04 18:51 ` Jan Kara
2023-12-06  5:22   ` Yury Norov
2023-12-07  9:10     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).