Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] tracing: Add CONFIG_TRACE_PRINTK_DEBUGGING to clean up kernel.h
From: Steven Rostedt @ 2026-06-21  9:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx
In-Reply-To: <20260621093811.168514984@kernel.org>

On Sun, 21 Jun 2026 05:34:32 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> Instead of having trace_printk.h included in kernel.h, create a config
> TRACE_PRINTK_DEBUGGING that when set will update the CFLAGS in the
> Makefile to allow developers to add trace_printk() without the need to add
> the include for it. Having it included in the Makefile keeps it from being
> in the dependency chain and it will not waste extra CPU cycles for those
> building the kernel without using trace_printk.

Bah, I only tested with the config option enabled, and missed some
dependencies with it disabled.

For instance, rcu.h also uses ftrace_dump() so that too needs to go
into kernel.h. I also need to add a few more includes to trace_printk.h.

OK, I need to run this through all my tests to find where else I missed
adding the includes. But the idea should hopefully satisfy everyone.

-- Steve

^ permalink raw reply

* [PATCH 2/2] tracing: Add CONFIG_TRACE_PRINTK_DEBUGGING to clean up kernel.h
From: Steven Rostedt @ 2026-06-21  9:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx
In-Reply-To: <20260621093430.264983361@kernel.org>

From: Steven Rostedt <rostedt@goodmis.org>

Instead of having trace_printk.h included in kernel.h, create a config
TRACE_PRINTK_DEBUGGING that when set will update the CFLAGS in the
Makefile to allow developers to add trace_printk() without the need to add
the include for it. Having it included in the Makefile keeps it from being
in the dependency chain and it will not waste extra CPU cycles for those
building the kernel without using trace_printk.

Link: https://lore.kernel.org/all/CAHk-=wikCBeVFjVXiY4o-oepdbjAoir5+TcAgtL12c4u1TpZLQ@mail.gmail.com/

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 .../debugging/driver_development_debugging_guide.rst   |  2 +-
 Makefile                                               |  5 +++++
 arch/powerpc/kvm/book3s_xics.c                         |  1 +
 drivers/gpu/drm/i915/gt/intel_gtt.h                    |  1 +
 drivers/gpu/drm/i915/i915_gem.h                        |  1 +
 drivers/hwtracing/stm/dummy_stm.c                      |  4 ++++
 drivers/infiniband/hw/hfi1/trace_dbg.h                 |  1 +
 drivers/usb/early/xhci-dbc.c                           |  1 +
 fs/ext4/inline.c                                       |  1 +
 include/linux/kernel.h                                 |  1 -
 include/linux/sunrpc/debug.h                           |  1 +
 include/linux/trace_printk.h                           |  5 +++--
 kernel/trace/Kconfig                                   | 10 ++++++++++
 kernel/trace/ring_buffer_benchmark.c                   |  1 +
 kernel/trace/trace.h                                   |  1 +
 samples/fprobe/fprobe_example.c                        |  1 +
 samples/ftrace/ftrace-direct-modify.c                  |  1 +
 samples/ftrace/ftrace-direct-multi-modify.c            |  1 +
 samples/ftrace/ftrace-direct-multi.c                   |  2 +-
 samples/ftrace/ftrace-direct-too.c                     |  2 +-
 samples/ftrace/ftrace-direct.c                         |  2 +-
 21 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/Documentation/process/debugging/driver_development_debugging_guide.rst b/Documentation/process/debugging/driver_development_debugging_guide.rst
index aca08f457793..3c87aa03622f 100644
--- a/Documentation/process/debugging/driver_development_debugging_guide.rst
+++ b/Documentation/process/debugging/driver_development_debugging_guide.rst
@@ -52,7 +52,7 @@ For the full documentation see :doc:`/core-api/printk-basics`
 Trace_printk
 ~~~~~~~~~~~~
 
-Prerequisite: ``CONFIG_DYNAMIC_FTRACE`` & ``#include <linux/ftrace.h>``
+Prerequisite: ``CONFIG_TRACE_PRINTK_DEBUGGING``
 
 It is a tiny bit less comfortable to use than printk(), because you will have
 to read the messages from the trace file (See: :ref:`read_ftrace_log`
diff --git a/Makefile b/Makefile
index d1c595db55c9..2f5923d5393b 100644
--- a/Makefile
+++ b/Makefile
@@ -840,6 +840,11 @@ ifdef CONFIG_FUNCTION_TRACER
   CC_FLAGS_FTRACE := -pg
 endif
 
+ifdef CONFIG_TRACE_PRINTK_DEBUGGING
+  # Allow trace_printk() to be used anywhere without including the header.
+  LINUXINCLUDE += -include $(srctree)/include/linux/trace_printk.h
+endif
+
 ifdef CONFIG_TRACEPOINTS
 # To check for unused tracepoints (tracepoints that are defined but never
 # called), run with:
diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index 74a44fa702b0..ef5eb596a56e 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -26,6 +26,7 @@
 #if 1
 #define XICS_DBG(fmt...) do { } while (0)
 #else
+#include <linux/trace_printk.h>
 #define XICS_DBG(fmt...) trace_printk(fmt)
 #endif
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index b54ee4f25af1..f6f223090760 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -35,6 +35,7 @@
 #define I915_GFP_ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
 
 #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GTT)
+#include <linux/trace_printk.h>
 #define GTT_TRACE(...) trace_printk(__VA_ARGS__)
 #else
 #define GTT_TRACE(...)
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 20b3cb29cfff..5cab1836dc1d 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -116,6 +116,7 @@ int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file);
 #endif
 
 #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GEM)
+#include <linux/trace_printk.h>
 #define GEM_TRACE(...) trace_printk(__VA_ARGS__)
 #define GEM_TRACE_ERR(...) do {						\
 	pr_err(__VA_ARGS__);						\
diff --git a/drivers/hwtracing/stm/dummy_stm.c b/drivers/hwtracing/stm/dummy_stm.c
index 38528ffdc0b3..784f9af7ccba 100644
--- a/drivers/hwtracing/stm/dummy_stm.c
+++ b/drivers/hwtracing/stm/dummy_stm.c
@@ -14,6 +14,10 @@
 #include <linux/stm.h>
 #include <uapi/linux/stm.h>
 
+#ifdef DEBUG
+#include <linux/trace_printk.h>
+#endif
+
 static ssize_t notrace
 dummy_stm_packet(struct stm_data *stm_data, unsigned int master,
 		 unsigned int channel, unsigned int packet, unsigned int flags,
diff --git a/drivers/infiniband/hw/hfi1/trace_dbg.h b/drivers/infiniband/hw/hfi1/trace_dbg.h
index 58304b91380f..30df5e246586 100644
--- a/drivers/infiniband/hw/hfi1/trace_dbg.h
+++ b/drivers/infiniband/hw/hfi1/trace_dbg.h
@@ -103,6 +103,7 @@ __hfi1_trace_def(IOCTL);
  */
 
 #ifdef HFI1_EARLY_DBG
+#include <linux/trace_printk.h>
 #define hfi1_dbg_early(fmt, ...) \
 	trace_printk(fmt, ##__VA_ARGS__)
 #else
diff --git a/drivers/usb/early/xhci-dbc.c b/drivers/usb/early/xhci-dbc.c
index 41118bba9197..955c73bd601f 100644
--- a/drivers/usb/early/xhci-dbc.c
+++ b/drivers/usb/early/xhci-dbc.c
@@ -30,6 +30,7 @@ static struct xdbc_state xdbc;
 static bool early_console_keep;
 
 #ifdef XDBC_TRACE
+#include <linux/trace_printk.h>
 #define	xdbc_trace	trace_printk
 #else
 static inline void xdbc_trace(const char *fmt, ...) { }
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..0eff4a0c6a6c 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -934,6 +934,7 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
 }
 
 #ifdef INLINE_DIR_DEBUG
+#include <linux/trace_printk.h>
 void ext4_show_inline_dir(struct inode *dir, struct buffer_head *bh,
 			  void *inline_start, int inline_size)
 {
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index c3c68128827c..538655385089 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -31,7 +31,6 @@
 #include <linux/build_bug.h>
 #include <linux/sprintf.h>
 #include <linux/static_call_types.h>
-#include <linux/trace_printk.h>
 #include <linux/util_macros.h>
 #include <linux/wordpart.h>
 
diff --git a/include/linux/sunrpc/debug.h b/include/linux/sunrpc/debug.h
index ab61bed2f7af..7524f5d82fba 100644
--- a/include/linux/sunrpc/debug.h
+++ b/include/linux/sunrpc/debug.h
@@ -29,6 +29,7 @@ extern unsigned int		nlm_debug;
 # define ifdebug(fac)		if (unlikely(rpc_debug & RPCDBG_##fac))
 
 # if IS_ENABLED(CONFIG_SUNRPC_DEBUG_TRACE)
+#  include <linux/trace_printk.h>
 #  define __sunrpc_printk(fmt, ...)	trace_printk(fmt, ##__VA_ARGS__)
 # else
 #  define __sunrpc_printk(fmt, ...)	printk(KERN_DEFAULT fmt, ##__VA_ARGS__)
diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index 879fed0805fd..66edec6d5dbf 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -1,11 +1,12 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_TRACE_PRINTK_H
 #define _LINUX_TRACE_PRINTK_H
+#if !defined(__ASSEMBLY__) && !defined(__GENKSYMS__) && !defined(BUILD_VDSO)
 
-#include <linux/compiler_attributes.h>
 #include <linux/instruction_pointer.h>
 #include <linux/stddef.h>
 #include <linux/stringify.h>
+#include <linux/stdarg.h>
 
 /*
  * General tracing related utility functions - trace_printk(),
@@ -181,5 +182,5 @@ ftrace_vprintk(const char *fmt, va_list ap)
 }
 static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
 #endif /* CONFIG_TRACING */
-
+#endif /* !defined(__ASSEMBLY__) && !defined(__GENKSYMS__) */
 #endif
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 084f34dc6c9f..ffbd1b0ce66e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -210,6 +210,16 @@ menuconfig FTRACE
 
 if FTRACE
 
+config TRACE_PRINTK_DEBUGGING
+	bool "Debug with trace_printk()"
+	help
+	  If you need to debug with trace_printk(), instead of adding
+	  include <linux/trace_printk.h> to every file you add a trace_printk
+	  to, select this option and it will add trace_printk.h to all code
+	  to allow tracing with trace_printk() with.
+
+	  If in doubt, select N
+
 config TRACEFS_AUTOMOUNT_DEPRECATED
 	bool "Automount tracefs on debugfs [DEPRECATED]"
 	depends on TRACING
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 593e3b59e42e..2bb25caebb75 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2009 Steven Rostedt <srostedt@redhat.com>
  */
 #include <linux/ring_buffer.h>
+#include <linux/trace_printk.h>
 #include <linux/completion.h>
 #include <linux/kthread.h>
 #include <uapi/linux/sched/types.h>
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..580a3deab1e9 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -13,6 +13,7 @@
 #include <linux/ftrace.h>
 #include <linux/trace.h>
 #include <linux/hw_breakpoint.h>
+#include <linux/trace_printk.h>
 #include <linux/trace_seq.h>
 #include <linux/trace_events.h>
 #include <linux/compiler.h>
diff --git a/samples/fprobe/fprobe_example.c b/samples/fprobe/fprobe_example.c
index bfe98ce826f3..de81b9b4ca7d 100644
--- a/samples/fprobe/fprobe_example.c
+++ b/samples/fprobe/fprobe_example.c
@@ -12,6 +12,7 @@
 
 #define pr_fmt(fmt) "%s: " fmt, __func__
 
+#include <linux/trace_printk.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/fprobe.h>
diff --git a/samples/ftrace/ftrace-direct-modify.c b/samples/ftrace/ftrace-direct-modify.c
index 1ba1927b548e..30d0f8e644c8 100644
--- a/samples/ftrace/ftrace-direct-modify.c
+++ b/samples/ftrace/ftrace-direct-modify.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/trace_printk.h>
 #include <linux/module.h>
 #include <linux/kthread.h>
 #include <linux/ftrace.h>
diff --git a/samples/ftrace/ftrace-direct-multi-modify.c b/samples/ftrace/ftrace-direct-multi-modify.c
index 7a7822dfeb50..f64b929e19ec 100644
--- a/samples/ftrace/ftrace-direct-multi-modify.c
+++ b/samples/ftrace/ftrace-direct-multi-modify.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/trace_printk.h>
 #include <linux/module.h>
 #include <linux/kthread.h>
 #include <linux/ftrace.h>
diff --git a/samples/ftrace/ftrace-direct-multi.c b/samples/ftrace/ftrace-direct-multi.c
index 3fe6ddaf0b69..d32644a49554 100644
--- a/samples/ftrace/ftrace-direct-multi.c
+++ b/samples/ftrace/ftrace-direct-multi.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/trace_printk.h>
 #include <linux/module.h>
-
 #include <linux/mm.h> /* for handle_mm_fault() */
 #include <linux/ftrace.h>
 #include <linux/sched/stat.h>
diff --git a/samples/ftrace/ftrace-direct-too.c b/samples/ftrace/ftrace-direct-too.c
index bf2411aa6fd7..266fcb233301 100644
--- a/samples/ftrace/ftrace-direct-too.c
+++ b/samples/ftrace/ftrace-direct-too.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/trace_printk.h>
 #include <linux/module.h>
-
 #include <linux/mm.h> /* for handle_mm_fault() */
 #include <linux/ftrace.h>
 #if !defined(CONFIG_ARM64) && !defined(CONFIG_PPC32)
diff --git a/samples/ftrace/ftrace-direct.c b/samples/ftrace/ftrace-direct.c
index 5368c8c39cbb..85e0dff9b691 100644
--- a/samples/ftrace/ftrace-direct.c
+++ b/samples/ftrace/ftrace-direct.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/trace_printk.h>
 #include <linux/module.h>
-
 #include <linux/sched.h> /* for wake_up_process() */
 #include <linux/ftrace.h>
 #if !defined(CONFIG_ARM64) && !defined(CONFIG_PPC32)
-- 
2.53.0



^ permalink raw reply related

* [PATCH 1/2] tracing: Move non-trace_printk prototypes back to kernel.h
From: Steven Rostedt @ 2026-06-21  9:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx
In-Reply-To: <20260621093430.264983361@kernel.org>

From: Steven Rostedt <rostedt@goodmis.org>

In order to remove the include to trace_printk.h from kernel.h the tracing
control prototypes need to be moved back into kernel.h. That's because
they are used in other common header files like rcu.h. There's no point in
removing trace_printk.h from kernel.h if it just gets added back to other
common headers.

Prototypes are very cheap for the compiler and should not be an issue.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/kernel.h       | 18 ++++++++++++++++++
 include/linux/trace_printk.h | 17 -----------------
 2 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index e5570a16cbb1..c3c68128827c 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -194,4 +194,22 @@ extern enum system_states system_state;
 # define REBUILD_DUE_TO_DYNAMIC_FTRACE
 #endif
 
+#ifdef CONFIG_TRACING
+void tracing_on(void);
+void tracing_off(void);
+int tracing_is_on(void);
+void tracing_snapshot(void);
+void tracing_snapshot_alloc(void);
+void tracing_start(void);
+void tracing_stop(void);
+#else
+static inline void tracing_start(void) { }
+static inline void tracing_stop(void) { }
+static inline void tracing_on(void) { }
+static inline void tracing_off(void) { }
+static inline int tracing_is_on(void) { return 0; }
+static inline void tracing_snapshot(void) { }
+static inline void tracing_snapshot_alloc(void) { }
+#endif
+
 #endif
diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index 3d54f440dccf..879fed0805fd 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -35,15 +35,6 @@ enum ftrace_dump_mode {
 };
 
 #ifdef CONFIG_TRACING
-void tracing_on(void);
-void tracing_off(void);
-int tracing_is_on(void);
-void tracing_snapshot(void);
-void tracing_snapshot_alloc(void);
-
-extern void tracing_start(void);
-extern void tracing_stop(void);
-
 static inline __printf(1, 2)
 void ____trace_printk_check_format(const char *fmt, ...)
 {
@@ -176,16 +167,8 @@ __ftrace_vprintk(unsigned long ip, const char *fmt, va_list ap);
 
 extern void ftrace_dump(enum ftrace_dump_mode oops_dump_mode);
 #else
-static inline void tracing_start(void) { }
-static inline void tracing_stop(void) { }
 static inline void trace_dump_stack(int skip) { }
 
-static inline void tracing_on(void) { }
-static inline void tracing_off(void) { }
-static inline int tracing_is_on(void) { return 0; }
-static inline void tracing_snapshot(void) { }
-static inline void tracing_snapshot_alloc(void) { }
-
 static inline __printf(1, 2)
 int trace_printk(const char *fmt, ...)
 {
-- 
2.53.0



^ permalink raw reply related

* [PATCH 0/2] tracing: Move trace_printk.h out of kernel.h
From: Steven Rostedt @ 2026-06-21  9:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx

There's been complaints about trace_printk() being defined in kernel.h as it
can increase the compilation time. As it is only used by some developers for
debugging purposes, it should not be in kernel.h causing lots of wasted CPU
cycles for those that do not ever care about it.

Instead, add a CONFIG_TRACE_PRINTK_DEBUGGING option that developers that do
use it can set and not have to always remember to add #include <linux/trace_printk.h>
to the files they add trace_printk() while debugging. It also means that
those that do not have that config set will not have to worry about wasted
CPU cycles as it is only include in the CFLAGS when the option is set, and
its completely ignored otherwise.

Steven Rostedt (2):
      tracing: Move non-trace_printk prototypes back to kernel.h
      tracing: Add CONFIG_TRACE_PRINTK_DEBUGGING to clean up kernel.h

----
 .../driver_development_debugging_guide.rst         |  2 +-
 Makefile                                           |  5 +++++
 arch/powerpc/kvm/book3s_xics.c                     |  1 +
 drivers/gpu/drm/i915/gt/intel_gtt.h                |  1 +
 drivers/gpu/drm/i915/i915_gem.h                    |  1 +
 drivers/hwtracing/stm/dummy_stm.c                  |  4 ++++
 drivers/infiniband/hw/hfi1/trace_dbg.h             |  1 +
 drivers/usb/early/xhci-dbc.c                       |  1 +
 fs/ext4/inline.c                                   |  1 +
 include/linux/kernel.h                             | 19 ++++++++++++++++++-
 include/linux/sunrpc/debug.h                       |  1 +
 include/linux/trace_printk.h                       | 22 +++-------------------
 kernel/trace/Kconfig                               | 10 ++++++++++
 kernel/trace/ring_buffer_benchmark.c               |  1 +
 kernel/trace/trace.h                               |  1 +
 samples/fprobe/fprobe_example.c                    |  1 +
 samples/ftrace/ftrace-direct-modify.c              |  1 +
 samples/ftrace/ftrace-direct-multi-modify.c        |  1 +
 samples/ftrace/ftrace-direct-multi.c               |  2 +-
 samples/ftrace/ftrace-direct-too.c                 |  2 +-
 samples/ftrace/ftrace-direct.c                     |  2 +-
 21 files changed, 56 insertions(+), 24 deletions(-)

^ permalink raw reply

* Re: [PATCH net v3] net/mlx5e: macsec: fix use-after-free of metadata_dst on RX SC delete
From: Simon Horman @ 2026-06-20 16:30 UTC (permalink / raw)
  To: Doruk Tan Ozturk
  Cc: saeedm, leon, tariqt, mbloch, sd, andrew+netdev, davem, edumazet,
	kuba, pabeni, borisp, raeds, ehakim, netdev, linux-rdma,
	linux-kernel, stable
In-Reply-To: <20260618145545.53035-1-doruk@0sec.ai>

On Thu, Jun 18, 2026 at 04:55:45PM +0200, Doruk Tan Ozturk wrote:
> When an offloaded MACsec RX SC is deleted, macsec_del_rxsc_ctx() released
> the per-SC metadata_dst with metadata_dst_free(), which calls kfree()
> unconditionally and ignores the dst reference count. The RX datapath in
> mlx5e_macsec_offload_handle_rx_skb() looks up the SC under rcu_read_lock()
> via xa_load() and, while still holding only the RCU read lock, takes a
> reference with dst_hold() and attaches the dst to the skb with
> skb_dst_set().
> 
> A reader that has already obtained the rx_sc pointer can therefore race
> with the delete path:
> 
>   CPU0 (del_rxsc)			CPU1 (rx datapath)
>   --------------			------------------
> 					rcu_read_lock();
> 					rx_sc = xa_load(...)->rx_sc;
>   xa_erase(...);
>   metadata_dst_free(rx_sc->md_dst);	/* kfree(), ignores refcount */
> 					dst_hold(&rx_sc->md_dst->dst); /* UAF */
> 					skb_dst_set(skb, &rx_sc->md_dst->dst);
> 
> metadata_dst_free() frees the object even though the datapath still holds
> (or is about to take) a reference, so the subsequent dst_hold() /
> skb_dst_set() and the later skb free operate on freed memory.
> 
> Fix the owner side by dropping the reference with dst_release() instead of
> freeing unconditionally. dst_release() only schedules the RCU-deferred
> dst_destroy() once the reference count reaches zero, so a concurrent reader
> that still holds a reference keeps the object alive.
> 
> Dropping the owner reference is not sufficient on its own: once the owner
> reference is the last one, dst_release() drops the count to zero and the
> destroy is merely RCU-deferred. A racing reader that runs plain dst_hold()
> on that already-dead dst gets rcuref_get() == false but dst_hold() only
> WARNs and attaches the dying dst to the skb anyway; the later skb free then
> calls dst_release() on an object whose destroy is already scheduled, again
> a use-after-free.
> 
> Convert the RX datapath to dst_hold_safe(), which returns false (without
> warning) when the dst is already dead, and only attach it to the skb when a
> reference was successfully taken. When the SC is being deleted the in-flight
> packet simply proceeds without the offload metadata_dst: skb_metadata_dst()
> returns NULL, the MACsec core sees !is_macsec_md_dst and skips this secy
> (rx_uses_md_dst path), which is the correct behaviour for a packet whose SC
> is going away.
> 
> While reworking the datapath lookup, also guard the two NULL dereferences
> on the same path that an automated review (forwarded by Simon Horman)
> flagged: xa_load() can return NULL when the fs_id has just been erased, and
> mlx5e_macsec_add_rxsc() publishes sc_xarray_element via xa_alloc() before
> rx_sc->md_dst is allocated, so a packet carrying a freshly recycled fs_id
> can observe a non-NULL rx_sc whose md_dst is still NULL. Check both before
> dereferencing.
> 
> Note: macsec_del_rxsc_ctx() also kfree()s rx_sc->sc_xarray_element without
> an RCU grace period while the same datapath reads it under rcu_read_lock();
> that is a separate pre-existing issue and is left to a follow-up patch.
> 
> Fixes: b7c9400cbc48 ("net/mlx5e: Implement MACsec Rx data path using MACsec skb_metadata_dst")
> Cc: stable@vger.kernel.org
> Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
> ---
> v3:
>  - Also guard the RX-datapath NULL dereferences flagged by the automated
>    review: NULL-check the xa_load() result and rx_sc->md_dst before use.

The review of this patch on sashiko.dev flags that this change doesn't
appear to be complete:

  "This is a pre-existing issue, but since xa_alloc() in mlx5e_macsec_add_rxsc()
   publishes sc_xarray_element before rx_sc->md_dst is allocated and initialized,
   is it safe to use a plain read here?
   drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c:mlx5e_macsec_add_rxsc() {
      ...
      err = xa_alloc(&macsec->sc_xarray, &sc_xarray_element->fs_id, sc_xarray_element, ...);
      ...
      rx_sc->md_dst = metadata_dst_alloc(0, METADATA_MACSEC, GFP_KERNEL);
      ...
   }
   Because there are no memory barriers around the assignment and initialization
   of md_dst, could a concurrent datapath reader observe a non-NULL md_dst
   pointer but read uninitialized memory from it in dst_hold_safe()?"

...

^ permalink raw reply

* [PATCH] RDMA/siw: publish QP after initialization
From: Ruoyu Wang @ 2026-06-20 15:53 UTC (permalink / raw)
  To: Bernard Metzler, Jason Gunthorpe, Leon Romanovsky
  Cc: linux-rdma, linux-kernel, Ruoyu Wang

siw_create_qp() allocates a QP number before the queues, CQ pointers,
state, completion, and device list entry are ready. A QPN lookup can
therefore reach a QP that is still being constructed if the object is
published at allocation time.

Reserve the QPN with an empty XArray entry first. Publish the QP object
only after the kernel-visible QP state is initialized and just before
siw_create_qp() returns it to the caller.

Fixes: f29dd55b0236 ("rdma/siw: queue pair methods")
Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
---
 drivers/infiniband/sw/siw/siw.h       |  1 +
 drivers/infiniband/sw/siw/siw_qp.c    | 26 ++++++++++++++++++--------
 drivers/infiniband/sw/siw/siw_verbs.c | 12 +++++++++++-
 3 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw.h b/drivers/infiniband/sw/siw/siw.h
index f5fd71717b80..ade7c96135c2 100644
--- a/drivers/infiniband/sw/siw/siw.h
+++ b/drivers/infiniband/sw/siw/siw.h
@@ -511,6 +511,7 @@ void siw_send_terminate(struct siw_qp *qp);
 void siw_qp_get_ref(struct ib_qp *qp);
 void siw_qp_put_ref(struct ib_qp *qp);
 int siw_qp_add(struct siw_device *sdev, struct siw_qp *qp);
+int siw_qp_publish(struct siw_qp *qp);
 void siw_free_qp(struct kref *ref);
 
 void siw_init_terminate(struct siw_qp *qp, enum term_elayer layer,
diff --git a/drivers/infiniband/sw/siw/siw_qp.c b/drivers/infiniband/sw/siw/siw_qp.c
index bb780e3904a2..1a9135d9a2a7 100644
--- a/drivers/infiniband/sw/siw/siw_qp.c
+++ b/drivers/infiniband/sw/siw/siw_qp.c
@@ -1281,15 +1281,25 @@ void siw_rq_flush(struct siw_qp *qp)
 
 int siw_qp_add(struct siw_device *sdev, struct siw_qp *qp)
 {
-	int rv = xa_alloc(&sdev->qp_xa, &qp->base_qp.qp_num, qp, xa_limit_32b,
-			  GFP_KERNEL);
+	qp->sdev = sdev;
 
-	if (!rv) {
-		kref_init(&qp->ref);
-		qp->sdev = sdev;
-		siw_dbg_qp(qp, "new QP\n");
-	}
-	return rv;
+	return xa_alloc(&sdev->qp_xa, &qp->base_qp.qp_num, NULL,
+			xa_limit_32b, GFP_KERNEL);
+}
+
+int siw_qp_publish(struct siw_qp *qp)
+{
+	void *old;
+
+	kref_init(&qp->ref);
+
+	old = xa_store(&qp->sdev->qp_xa, qp_id(qp), qp, GFP_KERNEL);
+	if (xa_is_err(old))
+		return xa_err(old);
+
+	siw_dbg_qp(qp, "new QP\n");
+
+	return 0;
 }
 
 void siw_free_qp(struct kref *ref)
diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
index 1e1d262a4ae2..71bc0cc59e3d 100644
--- a/drivers/infiniband/sw/siw/siw_verbs.c
+++ b/drivers/infiniband/sw/siw/siw_verbs.c
@@ -482,14 +482,24 @@ int siw_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
 		goto err_out_xa;
 	}
 	INIT_LIST_HEAD(&qp->devq);
+	init_completion(&qp->qp_free);
+
 	spin_lock_irqsave(&sdev->lock, flags);
 	list_add_tail(&qp->devq, &sdev->qp_list);
 	spin_unlock_irqrestore(&sdev->lock, flags);
 
-	init_completion(&qp->qp_free);
+	rv = siw_qp_publish(qp);
+	if (rv)
+		goto err_out_list;
 
 	return 0;
 
+err_out_list:
+	spin_lock_irqsave(&sdev->lock, flags);
+	list_del(&qp->devq);
+	spin_unlock_irqrestore(&sdev->lock, flags);
+
+	siw_put_tx_cpu(qp->tx_cpu);
 err_out_xa:
 	xa_erase(&sdev->qp_xa, qp_id(qp));
 	if (uctx) {

^ permalink raw reply related

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Bryam Vargas @ 2026-06-19 22:17 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <ajS4BgnyzRsa7HVm@linux.alibaba.com>

On Fri, 19 Jun 2026 11:31:18 +0800, Dust Li wrote:
> I think we can decide after we see the real issue.

Here it is, as a truth table over the real smc_curs_diff. cons is fixed (the app
isn't reading), bytes_to_rcv is the running sum of per-CDC smc_curs_diff(prod_old,
prod_new), len = 65504:

  scenario                     b2r       count>=len  diff>len  occ>len  OOB no-clamp  OOB clamp
  honest steady / full / wrap  <= len    no          no        no       no            no
  attack single big diff       131007    no          yes       yes      yes           no
  attack count=len-1 wrapflip  327519    no          yes       yes      yes           no
  attack wrap++ count=0        327520    no          no        no       yes           no

Every attack row has count < len, so an input count check accepts it. The last
row is the one that matters: a peer that just increments prod.wrap with count=0
adds len to bytes_to_rcv every CDC, unbounded, and no cursor-level check sees it.
The per-CDC diff is exactly len, and smc_curs_diff(cons, prod) stays at len
because it can't see the wrap accumulation. The only thing that bounds it is
clamping bytes_to_rcv at the consumer. So #2 isn't subsumed by validating cursors
at the input -- the cursor view can't see the accumulator.

> should we also abort the connection like what we did in patch #1 ?

Yes for net-next. Two caveats: First, the detection
has to be on bytes_to_rcv itself, not on a cursor recompute -- the wrap++ row
walks past every cursor check, so an occupancy gate at the input wouldn't catch
it. Second, the abort supplements the clamp, it doesn't replace it: the clamp is
synchronous, the abort via queue_work isn't. The producer add runs in the tasklet
under bh_lock_sock, the consumer sub runs in smc_recvmsg under lock_sock which
drops the spinlock, so they race; between queue_work and abort_work running
smc_conn_kill, smc_recvmsg can read the inflated bytes_to_rcv and copy past the
RMB. The clamp at the consumer is what closes that window.

So v4: -stable keeps the consumer-side clamp on #2, and the same shape on #3 for
sndbuf_space and peer_rmbe_space -- no control-flow change. net-next keeps the
clamp and, when bytes_to_rcv goes over len (which an honest peer never does),
queues the abort the way patch #1 does. Patch #1 keeps its count-based abort for
the urgent index.

Bryam

The table above is this program (gcc -O2 -Wall -Wextra -fwrapv; self-checks, exit 0):

  #include <stdio.h>
  #include <stdint.h>
  typedef uint16_t u16; typedef uint32_t u32;
  union hc { struct { u16 reserved; u16 wrap; u32 count; }; };

  /* verbatim net/smc/smc_cdc.h:149-158 */
  static int smc_curs_diff(unsigned int size, const union hc *old, const union hc *new)
  {
          if (old->wrap != new->wrap) {
                  int v = (int)((size - old->count) + new->count);
                  return v > 0 ? v : 0;
          }
          { int v = (int)(new->count - old->count); return v > 0 ? v : 0; }
  }

  #define LEN 65504
  struct cur { u16 w; u32 c; };

  /* prod[]/cons[]: cursor positions after each CDC. honest=app drains so
   * occupancy stays <= len; attack=cons stuck. */
  static int run(const char *name, int honest, int n,
                 const struct cur *prod, const struct cur *cons)
  {
          union hc po = {0}, co = {0};
          long b2r = 0; int i, cnt_rej = 0, raw_rej = 0, occ_rej = 0, fail = 0;
          for (i = 0; i < n; i++) {
                  union hc p = { .wrap = prod[i].w, .count = prod[i].c };
                  union hc c = { .wrap = cons[i].w, .count = cons[i].c };
                  int dp = smc_curs_diff(LEN, &po, &p);
                  if (prod[i].c >= (u32)LEN) cnt_rej = 1;
                  if (dp > LEN) raw_rej = 1;
                  if (smc_curs_diff(LEN, &c, &p) > LEN) occ_rej = 1;
                  b2r += dp; b2r -= smc_curs_diff(LEN, &co, &c);
                  po = p; co = c;
          }
          int oob_noclamp = b2r > LEN;
          int oob_clamp   = (b2r > LEN ? LEN : b2r) > LEN;   /* always 0 */
          printf("  %-30s b2r=%-8ld cnt_rej=%d raw_rej=%d occ_rej=%d oob_noclamp=%d oob_clamp=%d\n",
                 name, b2r, cnt_rej, raw_rej, occ_rej, oob_noclamp, oob_clamp);
          if (honest) fail = (cnt_rej || raw_rej || occ_rej || oob_noclamp);
          else        fail = (oob_clamp || !oob_noclamp);
          return fail;
  }

  int main(void)
  {
          struct cur ps[][5] = {
                  {{0,5000}}, {{1,0}}, {{0,30000},{0,60000},{1,10000}},
                  {{1,LEN-1}},
                  {{1,LEN-1},{0,LEN-1},{1,LEN-1},{0,LEN-1}},
                  {{1,0},{2,0},{3,0},{4,0},{5,0}},
          };
          struct cur cs[][5] = {
                  {{0,4000}}, {{0,0}}, {{0,0},{0,30000},{0,50000}},
                  {{0,0}},
                  {{0,0},{0,0},{0,0},{0,0}},
                  {{0,0},{0,0},{0,0},{0,0},{0,0}},
          };
          const char *nm[] = { "honest: steady", "honest: full ring",
                  "honest: wrapping", "attack: single big diff",
                  "attack: count=len-1 wrapflip", "attack: wrap++ count=0" };
          int hon[] = { 1,1,1,0,0,0 };
          int nc[]  = { 1,1,3,1,4,5 };
          int i, fails = 0;
          for (i = 0; i < 6; i++)
                  fails += run(nm[i], hon[i], nc[i], ps[i], cs[i]);
          printf("RESULT: %s\n", fails ? "FAIL" : "PASS");
          return fails ? 1 : 0;
  }

(In-kernel KASAN confirming the over-read at count=65503 is available on request;
a small out-of-tree module driving the same smc_curs_diff over a real
rmb_desc->len allocation -- bytes_to_rcv 131007 -> 327519, slab-out-of-bounds in
the recv copy, clean with the clamp.)


^ permalink raw reply

* [PATCH rdma-next v8] RDMA: Change capability fields in ib_device_attr from int to u32
From: Erni Sri Satya Vennela @ 2026-06-19 20:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, mkalderon, zyjzyj2000, sagi,
	mgurtovoy, haris.iqbal, jinpu.wang, bvanassche, kbusch,
	Jens Axboe, Christoph Hellwig, kch, smfrench, linkinjeon, metze,
	tom, trondmy, anna, chuck.lever, jlayton, neil, okorniev, Dai.Ngo,
	achender, davem, edumazet, kuba, pabeni, horms, kees, markzhang,
	andriy.shevchenko, ebadger, linux-rdma, linux-kernel,
	target-devel, linux-nvme, linux-cifs, samba-technical, linux-nfs,
	netdev, rds-devel
  Cc: Erni Sri Satya Vennela, Jason Gunthorpe

The capability counter fields in struct ib_device_attr are declared
as signed int, but these values are inherently non-negative. Drivers
maintain their cached caps as u32 and assign them directly into these
int fields; if a cap exceeds INT_MAX the implicit narrowing yields a
negative value visible to the IB core.

Change the signed int capability fields to u32 to match the
underlying nature of the data. Also update consumers across the IB
core, ULPs, NVMe-oF target, RDS, and NFS/RDMA so the new u32 values
are not forced back through signed int or u8 via min()/min_t() or
narrowing local variables.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Acked-by: Stefan Metzmacher <metze@samba.org> # smbdirect
---
Changes in v8:
* Convert the remaining non-negative counter fields max_ee_rd_atom,
  max_ee_init_rd_atom, max_ee, max_rdd, max_raw_ipv6_qp and max_srq_wr
  to u32; keep max_srq as int (its consumer compares it against
  ib_device.num_comp_vectors, still int).
* Drop all remaining min_t() where plain min() now works.
* Make the srq_size module parameters unsigned int so the srq_size min()
  stays a plain min().
* Replace the ternary-inside-min() with the simpler "if (x) x--;".
* Reorder the send_queue_depth min() to min(value, CONST) to match the
  sibling site.
* Restore reverse xmas-tree declaration order.
* Collapse the min()/min3() assignments that now fit onto a single line
  within 100 columns.
* Print the now-u32 fields with %u instead of %d.
Changes in v7:
* Drop min_t() in all sites where a plain min() (or min3()) works
  cleanly
* Guard nvme/host/rdma.c num_inline_segments computation against a
  device reporting max_send_sge == 0, so the u32 subtract
  cannot wrap to UINT_MAX.
* Use %u when printing the newly-u32 capability fields
  in diagnostic messages.
Changes in v6:
* Fix subject prefix: net-next -> rdma-next.
Changes in v5:
* Add U8_MAX clamps in iser_verbs, nvme/host, nvme/target, isert,
* rds/ib_cm, smbdirect/connect and smbdirect/accept where u32 capability
  fields were directly narrowed into u8 rdma_conn_param fields without
  clamping.
* Guard the inline_sge_count calculation in nvmet_rdma_find_get_device()
  to prevent u32 underflow when both max_sge_rd and max_recv_sge are zero.
* Expand type migration to 9 additional fields (max_mw, max_raw_ethy_qp,
  max_mcast_grp, max_mcast_qp_attach, max_total_mcast_qp_attach, max_ah,
  max_srq, max_srq_wr, max_srq_sge)
* Fix min_t(int,...) in svc_rdma_transport; min_t(u32,...) in ipoib,
  srpt, nvme/target, rds/ib, rtrs-clt, rtrs-srv, xprtrdma/verbsdd.
* Fix frwr_ops.c u32 underflow guard (reorder check before subtraction)
* Change sc_max_send_sges to unsigned int, inline_sge_count to u32
* Fix %d -> %u in rxe_qp, rxe_srq, ipoib_cm, ib_isert, svc_rdma_transport
* Update commit message.
Changes in v4:
* Drop clamping the values in mana_ib_query_device, instead update
  the props values from int to u32.
Changes in v3:
* Drop clamping from mana_ib_gd_query_adapter_caps(). The internal u32
  caps cache does not need to be clamped.
* Move all clamping exclusively to mana_ib_query_device(), which is the
  only place the cached u32 values are narrowed into the signed int
  fields of struct ib_device_attr.
* Reframe commit message: this is a u32-to-int type boundary fix, not a
  CVM/untrusted-hardware hardening patch.
Changes in v2:
* Update patch title.
---
 drivers/infiniband/core/cq.c               |  3 +-
 drivers/infiniband/hw/qedr/verbs.c         |  2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c         | 22 +++++-----
 drivers/infiniband/sw/rxe/rxe_srq.c        | 16 +++----
 drivers/infiniband/ulp/ipoib/ipoib_cm.c    | 10 ++---
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  3 +-
 drivers/infiniband/ulp/iser/iser_verbs.c   |  5 +--
 drivers/infiniband/ulp/isert/ib_isert.c    |  7 ++-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c     | 11 ++---
 drivers/infiniband/ulp/rtrs/rtrs-srv.c     | 11 ++---
 drivers/infiniband/ulp/srp/ib_srp.c        |  2 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c      | 21 +++++----
 drivers/nvme/host/rdma.c                   |  8 ++--
 drivers/nvme/target/rdma.c                 | 13 +++---
 fs/smb/smbdirect/accept.c                  |  5 ++-
 fs/smb/smbdirect/connect.c                 |  5 ++-
 fs/smb/smbdirect/connection.c              |  8 ++--
 include/linux/sunrpc/svc_rdma.h            |  4 +-
 include/rdma/ib_verbs.h                    | 50 +++++++++++-----------
 net/rds/ib.c                               | 10 ++---
 net/rds/ib_cm.c                            | 10 ++---
 net/sunrpc/xprtrdma/frwr_ops.c             |  7 +--
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |  5 +--
 net/sunrpc/xprtrdma/verbs.c                |  2 +-
 24 files changed, 117 insertions(+), 123 deletions(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 3d7b6cddd131..ee98188e57fb 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -393,8 +393,7 @@ static int ib_alloc_cqs(struct ib_device *dev, unsigned int nr_cqes,
 	 * a reasonable batch size so that we can share CQs between
 	 * multiple users instead of allocating a larger number of CQs.
 	 */
-	nr_cqes = min_t(unsigned int, dev->attrs.max_cqe,
-			max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
+	nr_cqes = min(dev->attrs.max_cqe, max(nr_cqes, IB_MAX_SHARED_CQ_SZ));
 	nr_cqs = min_t(unsigned int, dev->num_comp_vectors, num_online_cpus());
 	for (i = 0; i < nr_cqs; i++) {
 		cq = ib_alloc_cq(dev, NULL, nr_cqes, i, poll_ctx);
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 679aa6f3a63b..a85ad0171134 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -151,7 +151,7 @@ int qedr_query_device(struct ib_device *ibdev,
 	attr->max_qp_init_rd_atom =
 	    1 << (fls(qattr->max_qp_req_rd_atomic_resc) - 1);
 	attr->max_qp_rd_atom =
-	    min(1 << (fls(qattr->max_qp_resp_rd_atomic_resc) - 1),
+	    min(1U << (fls(qattr->max_qp_resp_rd_atomic_resc) - 1),
 		attr->max_qp_init_rd_atom);
 
 	attr->max_srq = qattr->max_srq;
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index f3dff1aea96a..7a0529a17992 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -67,27 +67,27 @@ static int rxe_qp_chk_cap(struct rxe_dev *rxe, struct ib_qp_cap *cap,
 			  int has_srq)
 {
 	if (cap->max_send_wr > rxe->attr.max_qp_wr) {
-		rxe_dbg_dev(rxe, "invalid send wr = %u > %d\n",
-			 cap->max_send_wr, rxe->attr.max_qp_wr);
+		rxe_dbg_dev(rxe, "invalid send wr = %u > %u\n",
+			    cap->max_send_wr, rxe->attr.max_qp_wr);
 		goto err1;
 	}
 
 	if (cap->max_send_sge > rxe->attr.max_send_sge) {
-		rxe_dbg_dev(rxe, "invalid send sge = %u > %d\n",
-			 cap->max_send_sge, rxe->attr.max_send_sge);
+		rxe_dbg_dev(rxe, "invalid send sge = %u > %u\n",
+			    cap->max_send_sge, rxe->attr.max_send_sge);
 		goto err1;
 	}
 
 	if (!has_srq) {
 		if (cap->max_recv_wr > rxe->attr.max_qp_wr) {
-			rxe_dbg_dev(rxe, "invalid recv wr = %u > %d\n",
-				 cap->max_recv_wr, rxe->attr.max_qp_wr);
+			rxe_dbg_dev(rxe, "invalid recv wr = %u > %u\n",
+				    cap->max_recv_wr, rxe->attr.max_qp_wr);
 			goto err1;
 		}
 
 		if (cap->max_recv_sge > rxe->attr.max_recv_sge) {
-			rxe_dbg_dev(rxe, "invalid recv sge = %u > %d\n",
-				 cap->max_recv_sge, rxe->attr.max_recv_sge);
+			rxe_dbg_dev(rxe, "invalid recv sge = %u > %u\n",
+				    cap->max_recv_sge, rxe->attr.max_recv_sge);
 			goto err1;
 		}
 	}
@@ -537,9 +537,9 @@ int rxe_qp_chk_attr(struct rxe_dev *rxe, struct rxe_qp *qp,
 
 	if (mask & IB_QP_MAX_QP_RD_ATOMIC) {
 		if (attr->max_rd_atomic > rxe->attr.max_qp_rd_atom) {
-			rxe_dbg_qp(qp, "invalid max_rd_atomic %d > %d\n",
-				 attr->max_rd_atomic,
-				 rxe->attr.max_qp_rd_atom);
+			rxe_dbg_qp(qp, "invalid max_rd_atomic %u > %u\n",
+				   attr->max_rd_atomic,
+				   rxe->attr.max_qp_rd_atom);
 			goto err1;
 		}
 	}
diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
index c9a7cd38953d..74904a6fdf2b 100644
--- a/drivers/infiniband/sw/rxe/rxe_srq.c
+++ b/drivers/infiniband/sw/rxe/rxe_srq.c
@@ -13,8 +13,8 @@ int rxe_srq_chk_init(struct rxe_dev *rxe, struct ib_srq_init_attr *init)
 	struct ib_srq_attr *attr = &init->attr;
 
 	if (attr->max_wr > rxe->attr.max_srq_wr) {
-		rxe_dbg_dev(rxe, "max_wr(%d) > max_srq_wr(%d)\n",
-			attr->max_wr, rxe->attr.max_srq_wr);
+		rxe_dbg_dev(rxe, "max_wr(%u) > max_srq_wr(%u)\n",
+			    attr->max_wr, rxe->attr.max_srq_wr);
 		goto err1;
 	}
 
@@ -27,8 +27,8 @@ int rxe_srq_chk_init(struct rxe_dev *rxe, struct ib_srq_init_attr *init)
 		attr->max_wr = RXE_MIN_SRQ_WR;
 
 	if (attr->max_sge > rxe->attr.max_srq_sge) {
-		rxe_dbg_dev(rxe, "max_sge(%d) > max_srq_sge(%d)\n",
-			attr->max_sge, rxe->attr.max_srq_sge);
+		rxe_dbg_dev(rxe, "max_sge(%u) > max_srq_sge(%u)\n",
+			    attr->max_sge, rxe->attr.max_srq_sge);
 		goto err1;
 	}
 
@@ -107,8 +107,8 @@ int rxe_srq_chk_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
 
 	if (mask & IB_SRQ_MAX_WR) {
 		if (attr->max_wr > rxe->attr.max_srq_wr) {
-			rxe_dbg_srq(srq, "max_wr(%d) > max_srq_wr(%d)\n",
-				attr->max_wr, rxe->attr.max_srq_wr);
+			rxe_dbg_srq(srq, "max_wr(%u) > max_srq_wr(%u)\n",
+				    attr->max_wr, rxe->attr.max_srq_wr);
 			goto err1;
 		}
 
@@ -129,8 +129,8 @@ int rxe_srq_chk_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
 
 	if (mask & IB_SRQ_LIMIT) {
 		if (attr->srq_limit > rxe->attr.max_srq_wr) {
-			rxe_dbg_srq(srq, "srq_limit(%d) > max_srq_wr(%d)\n",
-				attr->srq_limit, rxe->attr.max_srq_wr);
+			rxe_dbg_srq(srq, "srq_limit(%u) > max_srq_wr(%u)\n",
+				    attr->srq_limit, rxe->attr.max_srq_wr);
 			goto err1;
 		}
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 57fec88a1629..ed0592898384 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -1071,8 +1071,7 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 	struct ib_qp *tx_qp;
 
 	if (dev->features & NETIF_F_SG)
-		attr.cap.max_send_sge = min_t(u32, priv->ca->attrs.max_send_sge,
-					      MAX_SKB_FRAGS + 1);
+		attr.cap.max_send_sge = min(priv->ca->attrs.max_send_sge, MAX_SKB_FRAGS + 1);
 
 	tx_qp = ib_create_qp(priv->pd, &attr);
 	tx->max_send_sge = attr.cap.max_send_sge;
@@ -1582,7 +1581,8 @@ static void ipoib_cm_create_srq(struct net_device *dev, int max_sge)
 int ipoib_cm_dev_init(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = ipoib_priv(dev);
-	int max_srq_sge, i;
+	u32 max_srq_sge;
+	int i;
 	u8 addr;
 
 	INIT_LIST_HEAD(&priv->cm.passive_ids);
@@ -1600,9 +1600,9 @@ int ipoib_cm_dev_init(struct net_device *dev)
 
 	skb_queue_head_init(&priv->cm.skb_queue);
 
-	ipoib_dbg(priv, "max_srq_sge=%d\n", priv->ca->attrs.max_srq_sge);
+	ipoib_dbg(priv, "max_srq_sge=%u\n", priv->ca->attrs.max_srq_sge);
 
-	max_srq_sge = min_t(int, IPOIB_CM_RX_SG, priv->ca->attrs.max_srq_sge);
+	max_srq_sge = min(priv->ca->attrs.max_srq_sge, IPOIB_CM_RX_SG);
 	ipoib_cm_create_srq(dev, max_srq_sge);
 	if (ipoib_cm_has_srq(dev)) {
 		priv->cm.max_cm_mtu = max_srq_sge * PAGE_SIZE - 0x10;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 3ed1ea566690..2490696a1aab 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -147,8 +147,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 		.cap = {
 			.max_send_wr  = ipoib_sendq_size,
 			.max_recv_wr  = ipoib_recvq_size,
-			.max_send_sge = min_t(u32, priv->ca->attrs.max_send_sge,
-					      MAX_SKB_FRAGS + 1),
+			.max_send_sge = min(priv->ca->attrs.max_send_sge, MAX_SKB_FRAGS + 1),
 			.max_recv_sge = IPOIB_UD_RX_SG
 		},
 		.sq_sig_type = IB_SIGNAL_ALL_WR,
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index f03b3bb3c0c4..55fe68e5b837 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -244,8 +244,7 @@ static int iser_create_ib_conn_res(struct ib_conn *ib_conn)
 		max_send_wr = ISER_QP_SIG_MAX_REQ_DTOS + 1;
 	else
 		max_send_wr = ISER_QP_MAX_REQ_DTOS + 1;
-	max_send_wr = min_t(unsigned int, max_send_wr,
-			    (unsigned int)ib_dev->attrs.max_qp_wr);
+	max_send_wr = min(max_send_wr, ib_dev->attrs.max_qp_wr);
 
 	cq_size = max_send_wr + ISER_QP_MAX_RECV_DTOS;
 	ib_conn->cq = ib_cq_pool_get(ib_dev, cq_size, -1, IB_POLL_SOFTIRQ);
@@ -589,7 +588,7 @@ static void iser_route_handler(struct rdma_cm_id *cma_id)
 		goto failure;
 
 	memset(&conn_param, 0, sizeof conn_param);
-	conn_param.responder_resources = ib_dev->attrs.max_qp_rd_atom;
+	conn_param.responder_resources = min(ib_dev->attrs.max_qp_rd_atom, U8_MAX);
 	conn_param.initiator_depth = 1;
 	conn_param.retry_count = 7;
 	conn_param.rnr_retry_count = 6;
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index 1015a51f750a..4691845bf815 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -214,9 +214,9 @@ isert_create_device_ib_res(struct isert_device *device)
 	struct ib_device *ib_dev = device->ib_device;
 	int ret;
 
-	isert_dbg("devattr->max_send_sge: %d devattr->max_recv_sge %d\n",
+	isert_dbg("devattr->max_send_sge: %u devattr->max_recv_sge %u\n",
 		  ib_dev->attrs.max_send_sge, ib_dev->attrs.max_recv_sge);
-	isert_dbg("devattr->max_sge_rd: %d\n", ib_dev->attrs.max_sge_rd);
+	isert_dbg("devattr->max_sge_rd: %u\n", ib_dev->attrs.max_sge_rd);
 
 	device->pd = ib_alloc_pd(ib_dev, 0);
 	if (IS_ERR(device->pd)) {
@@ -381,8 +381,7 @@ isert_set_nego_params(struct isert_conn *isert_conn,
 	struct ib_device_attr *attr = &isert_conn->device->ib_device->attrs;
 
 	/* Set max inflight RDMA READ requests */
-	isert_conn->initiator_depth = min_t(u8, param->initiator_depth,
-				attr->max_qp_init_rd_atom);
+	isert_conn->initiator_depth = min(param->initiator_depth, attr->max_qp_init_rd_atom);
 	isert_dbg("Using initiator_depth: %u\n", isert_conn->initiator_depth);
 
 	if (param->private_data) {
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
index e351552733df..80b08697f96b 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@@ -1681,8 +1681,7 @@ static int create_con_cq_qp(struct rtrs_clt_con *con)
 		 * + 2 for drain and heartbeat
 		 * in case qp gets into error state.
 		 */
-		max_send_wr =
-			min_t(int, wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
+		max_send_wr = min(wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
 		max_recv_wr = max_send_wr;
 	} else {
 		/*
@@ -1698,11 +1697,9 @@ static int create_con_cq_qp(struct rtrs_clt_con *con)
 		wr_limit = clt_path->s.dev->ib_dev->attrs.max_qp_wr;
 		/* Shared between connections */
 		clt_path->s.dev_ref++;
-		max_send_wr = min_t(int, wr_limit,
-			      /* QD * (REQ + RSP + FR REGS or INVS) + drain */
-			      clt_path->queue_depth * 4 + 1);
-		max_recv_wr = min_t(int, wr_limit,
-			      clt_path->queue_depth * 3 + 1);
+		/* QD * (REQ + RSP + FR REGS or INVS) + drain */
+		max_send_wr = min(wr_limit, clt_path->queue_depth * 4 + 1);
+		max_recv_wr = min(wr_limit, clt_path->queue_depth * 3 + 1);
 		max_send_sge = 2;
 	}
 	atomic_set(&con->c.sq_wr_avail, max_send_wr);
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv.c b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
index 6482ad859bd1..f5a6890235bc 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-srv.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
@@ -1731,21 +1731,16 @@ static int create_con(struct rtrs_srv_path *srv_path,
 		 * All receive and all send (each requiring invalidate)
 		 * + 2 for drain and heartbeat
 		 */
-		max_send_wr = min_t(int, wr_limit,
-				    SERVICE_CON_QUEUE_DEPTH * 2 + 2);
+		max_send_wr = min(wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
 		max_recv_wr = max_send_wr;
 		s->signal_interval = min_not_zero(srv->queue_depth,
 						  (size_t)SERVICE_CON_QUEUE_DEPTH);
 	} else {
 		/* when always_invlaidate enalbed, we need linv+rinv+mr+imm */
 		if (always_invalidate)
-			max_send_wr =
-				min_t(int, wr_limit,
-				      srv->queue_depth * (1 + 4) + 1);
+			max_send_wr = min(wr_limit, srv->queue_depth * (1 + 4) + 1);
 		else
-			max_send_wr =
-				min_t(int, wr_limit,
-				      srv->queue_depth * (1 + 2) + 1);
+			max_send_wr = min(wr_limit, srv->queue_depth * (1 + 2) + 1);
 
 		max_recv_wr = srv->queue_depth + 1;
 	}
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index acbd787de265..0caebbc2810f 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -557,7 +557,7 @@ static int srp_create_ch_ib(struct srp_rdma_ch *ch)
 	init_attr->cap.max_send_wr     = m * target->queue_size;
 	init_attr->cap.max_recv_wr     = target->queue_size + 1;
 	init_attr->cap.max_recv_sge    = 1;
-	init_attr->cap.max_send_sge    = min(SRP_MAX_SGE, attr->max_send_sge);
+	init_attr->cap.max_send_sge    = min(attr->max_send_sge, SRP_MAX_SGE);
 	init_attr->sq_sig_type         = IB_SIGNAL_REQ_WR;
 	init_attr->qp_type             = IB_QPT_RC;
 	init_attr->send_cq             = send_cq;
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 9aec5d80117f..a4e4feba4a02 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -77,8 +77,8 @@ module_param(srp_max_req_size, int, 0444);
 MODULE_PARM_DESC(srp_max_req_size,
 		 "Maximum size of SRP request messages in bytes.");
 
-static int srpt_srq_size = DEFAULT_SRPT_SRQ_SIZE;
-module_param(srpt_srq_size, int, 0444);
+static unsigned int srpt_srq_size = DEFAULT_SRPT_SRQ_SIZE;
+module_param(srpt_srq_size, uint, 0444);
 MODULE_PARM_DESC(srpt_srq_size,
 		 "Shared receive queue (SRQ) size.");
 
@@ -405,8 +405,7 @@ static void srpt_get_ioc(struct srpt_port *sport, u32 slot,
 	if (sdev->use_srq)
 		send_queue_depth = sdev->srq_size;
 	else
-		send_queue_depth = min(MAX_SRPT_RQ_SIZE,
-				       sdev->device->attrs.max_qp_wr);
+		send_queue_depth = min(sdev->device->attrs.max_qp_wr, MAX_SRPT_RQ_SIZE);
 
 	memset(iocp, 0, sizeof(*iocp));
 	strcpy(iocp->id_string, SRPT_ID_STRING);
@@ -1850,7 +1849,7 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 	struct srpt_port *sport = ch->sport;
 	struct srpt_device *sdev = sport->sdev;
 	const struct ib_device_attr *attrs = &sdev->device->attrs;
-	int sq_size = sport->port_attrib.srp_sq_size;
+	u32 sq_size = sport->port_attrib.srp_sq_size;
 	int i, ret;
 
 	WARN_ON(ch->rq_size < 1);
@@ -1911,13 +1910,13 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 		bool retry = sq_size > MIN_SRPT_SQ_SIZE;
 
 		if (retry) {
-			pr_debug("failed to create queue pair with sq_size = %d (%d) - retrying\n",
+			pr_debug("failed to create queue pair with sq_size = %u (%d) - retrying\n",
 				 sq_size, ret);
 			ib_cq_pool_put(ch->cq, ch->cq_size);
 			sq_size = max(sq_size / 2, MIN_SRPT_SQ_SIZE);
 			goto retry;
 		} else {
-			pr_err("failed to create queue pair with sq_size = %d (%d)\n",
+			pr_err("failed to create queue pair with sq_size = %u (%d)\n",
 			       sq_size, ret);
 			goto err_destroy_cq;
 		}
@@ -1925,7 +1924,7 @@ static int srpt_create_ch_ib(struct srpt_rdma_ch *ch)
 
 	atomic_set(&ch->sq_wr_avail, qp_init->cap.max_send_wr);
 
-	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %d ch= %p\n",
+	pr_debug("%s: max_cqe= %d max_sge= %d sq_size = %u ch= %p\n",
 		 __func__, ch->cq->cqe, qp_init->cap.max_send_sge,
 		 qp_init->cap.max_send_wr, ch);
 
@@ -2298,7 +2297,7 @@ static int srpt_cm_req_recv(struct srpt_device *const sdev,
 	 * depth to avoid that the initiator driver has to report QUEUE_FULL
 	 * to the SCSI mid-layer.
 	 */
-	ch->rq_size = min(MAX_SRPT_RQ_SIZE, sdev->device->attrs.max_qp_wr);
+	ch->rq_size = min(sdev->device->attrs.max_qp_wr, MAX_SRPT_RQ_SIZE);
 	spin_lock_init(&ch->spinlock);
 	ch->state = CH_CONNECTING;
 	INIT_LIST_HEAD(&ch->cmd_wait_list);
@@ -3136,7 +3135,7 @@ static int srpt_alloc_srq(struct srpt_device *sdev)
 		return PTR_ERR(srq);
 	}
 
-	pr_debug("create SRQ #wr= %d max_allow=%d dev= %s\n", sdev->srq_size,
+	pr_debug("create SRQ #wr= %d max_allow=%u dev= %s\n", sdev->srq_size,
 		 sdev->device->attrs.max_srq_wr, dev_name(&device->dev));
 
 	sdev->req_buf_cache = srpt_cache_get(srp_max_req_size);
@@ -3951,7 +3950,7 @@ static int __init srpt_init_module(void)
 
 	if (srpt_srq_size < MIN_SRPT_SRQ_SIZE
 	    || srpt_srq_size > MAX_SRPT_SRQ_SIZE) {
-		pr_err("invalid value %d for kernel module parameter srpt_srq_size -- must be in the range [%d..%d].\n",
+		pr_err("invalid value %u for kernel module parameter srpt_srq_size -- must be in the range [%d..%d].\n",
 		       srpt_srq_size, MIN_SRPT_SRQ_SIZE, MAX_SRPT_SRQ_SIZE);
 		goto out;
 	}
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 6909e3542794..56cd228af1d5 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -394,8 +394,10 @@ nvme_rdma_find_get_device(struct rdma_cm_id *cm_id)
 		goto out_free_pd;
 	}
 
-	ndev->num_inline_segments = min(NVME_RDMA_MAX_INLINE_SEGMENTS,
-					ndev->dev->attrs.max_send_sge - 1);
+	ndev->num_inline_segments = ndev->dev->attrs.max_send_sge;
+	if (ndev->num_inline_segments)
+		ndev->num_inline_segments--;
+	ndev->num_inline_segments = min(ndev->num_inline_segments, NVME_RDMA_MAX_INLINE_SEGMENTS);
 	list_add(&ndev->entry, &device_list);
 out_unlock:
 	mutex_unlock(&device_list_mutex);
@@ -1847,7 +1849,7 @@ static int nvme_rdma_route_resolved(struct nvme_rdma_queue *queue)
 	param.qp_num = queue->qp->qp_num;
 	param.flow_control = 1;
 
-	param.responder_resources = queue->device->dev->attrs.max_qp_rd_atom;
+	param.responder_resources = min(queue->device->dev->attrs.max_qp_rd_atom, U8_MAX);
 	/* maximum retry count */
 	param.retry_count = 7;
 	param.rnr_retry_count = 7;
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index ac26f4f774c4..1c332d66222a 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -152,7 +152,7 @@ static const struct kernel_param_ops srq_size_ops = {
 	.get = param_get_int,
 };
 
-static int nvmet_rdma_srq_size = 1024;
+static unsigned int nvmet_rdma_srq_size = 1024;
 module_param_cb(srq_size, &srq_size_ops, &nvmet_rdma_srq_size, 0644);
 MODULE_PARM_DESC(srq_size, "set Shared Receive Queue (SRQ) size, should >= 256 (default: 1024)");
 
@@ -1197,7 +1197,7 @@ nvmet_rdma_find_get_device(struct rdma_cm_id *cm_id)
 	struct nvmet_port *nport = port->nport;
 	struct nvmet_rdma_device *ndev;
 	int inline_page_count;
-	int inline_sge_count;
+	u32 inline_sge_count;
 	int ret;
 
 	mutex_lock(&device_list_mutex);
@@ -1213,7 +1213,9 @@ nvmet_rdma_find_get_device(struct rdma_cm_id *cm_id)
 
 	inline_page_count = num_pages(nport->inline_data_size);
 	inline_sge_count = max(cm_id->device->attrs.max_sge_rd,
-				cm_id->device->attrs.max_recv_sge) - 1;
+				cm_id->device->attrs.max_recv_sge);
+	if (inline_sge_count)
+		inline_sge_count--;
 	if (inline_page_count > inline_sge_count) {
 		pr_warn("inline_data_size %d cannot be supported by device %s. Reducing to %lu.\n",
 			nport->inline_data_size, cm_id->device->name,
@@ -1553,8 +1555,9 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 
 	param.rnr_retry_count = 7;
 	param.flow_control = 1;
-	param.initiator_depth = min_t(u8, p->initiator_depth,
-		queue->dev->device->attrs.max_qp_init_rd_atom);
+	param.initiator_depth = min3(p->initiator_depth,
+				     queue->dev->device->attrs.max_qp_init_rd_atom,
+				     U8_MAX);
 	param.private_data = &priv;
 	param.private_data_len = sizeof(priv);
 	priv.recfmt = cpu_to_le16(NVME_RDMA_CM_FMT_1_0);
diff --git a/fs/smb/smbdirect/accept.c b/fs/smb/smbdirect/accept.c
index 529740005838..44b681a20725 100644
--- a/fs/smb/smbdirect/accept.c
+++ b/fs/smb/smbdirect/accept.c
@@ -32,8 +32,9 @@ int smbdirect_accept_connect_request(struct smbdirect_socket *sc,
 	/*
 	 * First set what the we as server are able to support
 	 */
-	sp->initiator_depth = min_t(u8, sp->initiator_depth,
-				    sc->ib.dev->attrs.max_qp_rd_atom);
+	sp->initiator_depth = min3(sp->initiator_depth,
+				   sc->ib.dev->attrs.max_qp_rd_atom,
+				   U8_MAX);
 
 	peer_initiator_depth = param->initiator_depth;
 	peer_responder_resources = param->responder_resources;
diff --git a/fs/smb/smbdirect/connect.c b/fs/smb/smbdirect/connect.c
index cd726b399afe..34a3e72c38fb 100644
--- a/fs/smb/smbdirect/connect.c
+++ b/fs/smb/smbdirect/connect.c
@@ -182,8 +182,9 @@ static int smbdirect_connect_rdma_connect(struct smbdirect_socket *sc)
 	if (sc->ib.dev->attrs.kernel_cap_flags & IBK_SG_GAPS_REG)
 		sc->mr_io.type = IB_MR_TYPE_SG_GAPS;
 
-	sp->responder_resources = min_t(u8, sp->responder_resources,
-					sc->ib.dev->attrs.max_qp_rd_atom);
+	sp->responder_resources = min3(sp->responder_resources,
+				       sc->ib.dev->attrs.max_qp_rd_atom,
+				       U8_MAX);
 	smbdirect_log_rdma_mr(sc, SMBDIRECT_LOG_INFO,
 		"responder_resources=%d\n",
 		sp->responder_resources);
diff --git a/fs/smb/smbdirect/connection.c b/fs/smb/smbdirect/connection.c
index 8adf58097534..690acb84e1b5 100644
--- a/fs/smb/smbdirect/connection.c
+++ b/fs/smb/smbdirect/connection.c
@@ -287,7 +287,7 @@ int smbdirect_connection_create_qp(struct smbdirect_socket *sc)
 	    qp_cap.max_send_wr > sc->ib.dev->attrs.max_qp_wr) {
 		pr_err("Possible CQE overrun: max_send_wr %d\n",
 		       qp_cap.max_send_wr);
-		pr_err("device %.*s reporting max_cqe %d max_qp_wr %d\n",
+		pr_err("device %.*s reporting max_cqe %u max_qp_wr %u\n",
 		       IB_DEVICE_NAME_MAX,
 		       sc->ib.dev->name,
 		       sc->ib.dev->attrs.max_cqe,
@@ -302,7 +302,7 @@ int smbdirect_connection_create_qp(struct smbdirect_socket *sc)
 	     max_send_wr >= sc->ib.dev->attrs.max_qp_wr)) {
 		pr_err("Possible CQE overrun: rdma_send_wr %d + max_send_wr %d = %d\n",
 		       rdma_send_wr, qp_cap.max_send_wr, max_send_wr);
-		pr_err("device %.*s reporting max_cqe %d max_qp_wr %d\n",
+		pr_err("device %.*s reporting max_cqe %u max_qp_wr %u\n",
 		       IB_DEVICE_NAME_MAX,
 		       sc->ib.dev->name,
 		       sc->ib.dev->attrs.max_cqe,
@@ -316,7 +316,7 @@ int smbdirect_connection_create_qp(struct smbdirect_socket *sc)
 	    qp_cap.max_recv_wr > sc->ib.dev->attrs.max_qp_wr) {
 		pr_err("Possible CQE overrun: max_recv_wr %d\n",
 		       qp_cap.max_recv_wr);
-		pr_err("device %.*s reporting max_cqe %d max_qp_wr %d\n",
+		pr_err("device %.*s reporting max_cqe %u max_qp_wr %u\n",
 		       IB_DEVICE_NAME_MAX,
 		       sc->ib.dev->name,
 		       sc->ib.dev->attrs.max_cqe,
@@ -328,7 +328,7 @@ int smbdirect_connection_create_qp(struct smbdirect_socket *sc)
 
 	if (qp_cap.max_send_sge > sc->ib.dev->attrs.max_send_sge ||
 	    qp_cap.max_recv_sge > sc->ib.dev->attrs.max_recv_sge) {
-		pr_err("device %.*s max_send_sge/max_recv_sge = %d/%d too small\n",
+		pr_err("device %.*s max_send_sge/max_recv_sge = %u/%u too small\n",
 		       IB_DEVICE_NAME_MAX,
 		       sc->ib.dev->name,
 		       sc->ib.dev->attrs.max_send_sge,
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index df6e08aaad57..217f000be5d6 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -78,8 +78,8 @@ struct svcxprt_rdma {
 	struct rdma_cm_id    *sc_cm_id;		/* RDMA connection id */
 	struct list_head     sc_accept_q;	/* Conn. waiting accept */
 	struct rpcrdma_notification sc_rn;	/* removal notification */
-	int		     sc_ord;		/* RDMA read limit */
-	int                  sc_max_send_sges;
+	u32		     sc_ord;		/* RDMA read limit */
+	unsigned int         sc_max_send_sges;
 	bool		     sc_snd_w_inv;	/* OK to use Send With Invalidate */
 
 	atomic_t             sc_sq_avail;	/* SQEs ready to be consumed */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9dd76f489a0b..b8b221b5f564 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -406,36 +406,36 @@ struct ib_device_attr {
 	u32			vendor_id;
 	u32			vendor_part_id;
 	u32			hw_ver;
-	int			max_qp;
-	int			max_qp_wr;
+	u32			max_qp;
+	u32			max_qp_wr;
 	u64			device_cap_flags;
 	u64			kernel_cap_flags;
-	int			max_send_sge;
-	int			max_recv_sge;
-	int			max_sge_rd;
-	int			max_cq;
-	int			max_cqe;
-	int			max_mr;
-	int			max_pd;
-	int			max_qp_rd_atom;
-	int			max_ee_rd_atom;
-	int			max_res_rd_atom;
-	int			max_qp_init_rd_atom;
-	int			max_ee_init_rd_atom;
+	u32			max_send_sge;
+	u32			max_recv_sge;
+	u32			max_sge_rd;
+	u32			max_cq;
+	u32			max_cqe;
+	u32			max_mr;
+	u32			max_pd;
+	u32			max_qp_rd_atom;
+	u32			max_ee_rd_atom;
+	u32			max_res_rd_atom;
+	u32			max_qp_init_rd_atom;
+	u32			max_ee_init_rd_atom;
 	enum ib_atomic_cap	atomic_cap;
 	enum ib_atomic_cap	masked_atomic_cap;
-	int			max_ee;
-	int			max_rdd;
-	int			max_mw;
-	int			max_raw_ipv6_qp;
-	int			max_raw_ethy_qp;
-	int			max_mcast_grp;
-	int			max_mcast_qp_attach;
-	int			max_total_mcast_qp_attach;
-	int			max_ah;
+	u32			max_ee;
+	u32			max_rdd;
+	u32			max_mw;
+	u32			max_raw_ipv6_qp;
+	u32			max_raw_ethy_qp;
+	u32			max_mcast_grp;
+	u32			max_mcast_qp_attach;
+	u32			max_total_mcast_qp_attach;
+	u32			max_ah;
 	int			max_srq;
-	int			max_srq_wr;
-	int			max_srq_sge;
+	u32			max_srq_wr;
+	u32			max_srq_sge;
 	unsigned int		max_fast_reg_page_list_len;
 	unsigned int		max_pi_fast_reg_page_list_len;
 	u16			max_pkeys;
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 39f87272e071..c62684d4259c 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -162,12 +162,12 @@ static int rds_ib_add_one(struct ib_device *device)
 		   IB_ODP_SUPPORT_READ);
 
 	rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
-		min_t(unsigned int, (device->attrs.max_mr / 2),
-		      rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
+		min(device->attrs.max_mr / 2,
+		    rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
 	rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
-		min_t(unsigned int, ((device->attrs.max_mr / 2) * RDS_MR_8K_SCALE),
-		      rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
+		min((device->attrs.max_mr / 2) * RDS_MR_8K_SCALE,
+		    rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
 	rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
 	rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -204,7 +204,7 @@ static int rds_ib_add_one(struct ib_device *device)
 		goto put_dev;
 	}
 
-	rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
+	rdsdebug("RDS/IB: max_mr = %u, max_wrs = %d, max_sge = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 		 device->attrs.max_mr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
 		 rds_ibdev->max_1m_mrs, rds_ibdev->max_8k_mrs);
 
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 5667f0173b47..17e587c30076 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -173,11 +173,11 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
 
 	memset(conn_param, 0, sizeof(struct rdma_conn_param));
 
-	conn_param->responder_resources =
-		min_t(u32, rds_ibdev->max_responder_resources, max_responder_resources);
-	conn_param->initiator_depth =
-		min_t(u32, rds_ibdev->max_initiator_depth, max_initiator_depth);
-	conn_param->retry_count = min_t(unsigned int, rds_ib_retry_count, 7);
+	conn_param->responder_resources = min3(rds_ibdev->max_responder_resources,
+					       max_responder_resources, U8_MAX);
+	conn_param->initiator_depth = min3(rds_ibdev->max_initiator_depth,
+					   max_initiator_depth, U8_MAX);
+	conn_param->retry_count = min(rds_ib_retry_count, 7U);
 	conn_param->rnr_retry_count = 7;
 
 	if (dp) {
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 7f79a0a2601e..b2e437afe09d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -172,8 +172,9 @@ int frwr_mr_init(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr *mr)
 int frwr_query_device(struct rpcrdma_ep *ep, const struct ib_device *device)
 {
 	const struct ib_device_attr *attrs = &device->attrs;
-	int max_qp_wr, depth, delta;
 	unsigned int max_sge;
+	u32 max_qp_wr;
+	int depth, delta;
 
 	if (!(attrs->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) ||
 	    attrs->max_fast_reg_page_list_len == 0) {
@@ -229,10 +230,10 @@ int frwr_query_device(struct rpcrdma_ep *ep, const struct ib_device *device)
 	}
 
 	max_qp_wr = attrs->max_qp_wr;
+	if (max_qp_wr < RPCRDMA_BACKWARD_WRS + 1 + RPCRDMA_MIN_SLOT_TABLE)
+		return -ENOMEM;
 	max_qp_wr -= RPCRDMA_BACKWARD_WRS;
 	max_qp_wr -= 1;
-	if (max_qp_wr < RPCRDMA_MIN_SLOT_TABLE)
-		return -ENOMEM;
 	if (ep->re_max_requests > max_qp_wr)
 		ep->re_max_requests = max_qp_wr;
 	ep->re_attr.cap.max_send_wr = ep->re_max_requests * depth;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index f18bc60d9f4f..c768cda2e544 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -544,8 +544,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 	set_bit(RDMAXPRT_CONN_PENDING, &newxprt->sc_flags);
 	memset(&conn_param, 0, sizeof conn_param);
 	conn_param.responder_resources = 0;
-	conn_param.initiator_depth = min_t(int, newxprt->sc_ord,
-					   dev->attrs.max_qp_init_rd_atom);
+	conn_param.initiator_depth = min(newxprt->sc_ord, dev->attrs.max_qp_init_rd_atom);
 	if (!conn_param.initiator_depth) {
 		ret = -EINVAL;
 		trace_svcrdma_initdepth_err(newxprt, ret);
@@ -570,7 +569,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		dprintk("    local address   : %pIS:%u\n", sap, rpc_get_port(sap));
 		sap = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
 		dprintk("    remote address  : %pIS:%u\n", sap, rpc_get_port(sap));
-		dprintk("    max_sge         : %d\n", newxprt->sc_max_send_sges);
+		dprintk("    max_sge         : %u\n", newxprt->sc_max_send_sges);
 		dprintk("    sq_depth        : %d\n", newxprt->sc_sq_depth);
 		dprintk("    rdma_rw_ctxs    : %d\n", ctxts);
 		dprintk("    max_requests    : %d\n", newxprt->sc_max_requests);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index aecf9c0a153f..8ed9da6d2d2f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -453,7 +453,7 @@ static int rpcrdma_ep_create(struct rpcrdma_xprt *r_xprt)
 	/* Client offers RDMA Read but does not initiate */
 	ep->re_remote_cma.initiator_depth = 0;
 	ep->re_remote_cma.responder_resources =
-		min_t(int, U8_MAX, device->attrs.max_qp_rd_atom);
+		min(device->attrs.max_qp_rd_atom, U8_MAX);
 
 	/* Limit transport retries so client can detect server
 	 * GID changes quickly. RPC layer handles re-establishing
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH rdma-next v3] RDMA/mana_ib: Clamp adapter capabilities at the ib_device_attr boundary
From: Erni Sri Satya Vennela @ 2026-06-19 19:41 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: longli, kotaranov, Jason Gunthorpe, linux-rdma, linux-hyperv,
	linux-kernel
In-Reply-To: <20260611111745.GM327369@unreal>

On Thu, Jun 11, 2026 at 02:17:45PM +0300, Leon Romanovsky wrote:
> On Mon, May 25, 2026 at 12:01:01PM -0700, Erni Sri Satya Vennela wrote:
> > mana_ib stores its adapter capabilities internally as u32 in
> > struct mana_ib_adapter_caps. The IB core, however, exposes the
> > corresponding device attributes through struct ib_device_attr, where
> > fields such as max_qp, max_qp_wr, max_send_sge, max_recv_sge,
> > max_sge_rd, max_cq, max_cqe, max_mr, max_pd, max_qp_rd_atom,
> > max_res_rd_atom and max_qp_init_rd_atom are signed int.
> > 
> > mana_ib_query_device() is the only place that copies the cached u32
> > caps into these int fields. If a cap exceeds INT_MAX, the implicit
> > u32-to-int narrowing yields a negative value. Clamp each cap to
> > INT_MAX at this boundary so the values handed to the IB core are always
> > non-negative.
> > 
> > While here, fix a related overflow in the computation of
> > max_res_rd_atom. It is derived as max_qp_rd_atom * max_qp, both of
> > which are int after the assignment above; the multiplication can
> > overflow an int even with the new clamps in place. Widen to s64
> > before multiplying and clamp the result to INT_MAX.
> > 
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > ---
> > Changes in v3:
> > * Drop clamping from mana_ib_gd_query_adapter_caps(). The internal u32
> >   caps cache does not need to be clamped.
> > * Move all clamping exclusively to mana_ib_query_device(), which is the
> >   only place the cached u32 values are narrowed into the signed int
> >   fields of struct ib_device_attr.
> > * Reframe commit message: this is a u32-to-int type boundary fix, not a
> >   CVM/untrusted-hardware hardening patch.
> 
> You should align all types to u32 and avoid hiding the issue behind  
> min_t().
> 
> Thanks
Yes Leon, I'm currently at v7 version of this patch.
I'm planning to completely avoid using min_t in the next version.

- Vennela

^ permalink raw reply

* Re: [PATCH rdma-next v7] RDMA: Change capability fields in ib_device_attr from int to u32
From: Erni Sri Satya Vennela @ 2026-06-19 19:32 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: mkalderon, Jason Gunthorpe, Leon Romanovsky, zyjzyj2000, sagi,
	mgurtovoy, haris.iqbal, jinpu.wang, bvanassche, kbusch,
	Jens Axboe, Christoph Hellwig, kch, smfrench, linkinjeon, metze,
	tom, trondmy, anna, chuck.lever, jlayton, neil, okorniev, Dai.Ngo,
	achender, davem, edumazet, kuba, pabeni, horms, kees, ebadger,
	linux-rdma, linux-kernel, target-devel, linux-nvme, linux-cifs,
	samba-technical, linux-nfs, netdev, rds-devel, Jason Gunthorpe
In-Reply-To: <aigwONAwxQx6rLef@ashevche-desk.local>

Hi Andy,

Sorry for delayed response.

> >  	attr->max_qp_init_rd_atom =
> >  	    1 << (fls(qattr->max_qp_req_rd_atomic_resc) - 1);
> 
> FWIW, this one and below looks like reinvention of rounddown_pow_of_two().

Acked.
> 
> >  	attr->max_qp_rd_atom =
> > -	    min(1 << (fls(qattr->max_qp_resp_rd_atomic_resc) - 1),
> > +	    min(1U << (fls(qattr->max_qp_resp_rd_atomic_resc) - 1),
> >  		attr->max_qp_init_rd_atom);
> 
> ...
> 
> >  int ipoib_cm_dev_init(struct net_device *dev)
> >  {
> >  	struct ipoib_dev_priv *priv = ipoib_priv(dev);
> > -	int max_srq_sge, i;
> > +	int i;
> > +	u32 max_srq_sge;
> >  	u8 addr;
> 
> It seems the order is reversed xmas tree, why not preserving it?
> 
Right. I'll fix it in the next version.
> ...
> 
> > --- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
> > +++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
> 
> >  		max_send_wr =
> > -			min_t(int, wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
> > +			min(wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
> 
> Now perfectly a single line
> 
> 		max_send_wr = min(wr_limit, SERVICE_CON_QUEUE_DEPTH * 2 + 2);
> 
> >  		max_recv_wr = max_send_wr;
> 
> ...
> 
> > -		max_send_wr = min_t(int, wr_limit,
> > -			      /* QD * (REQ + RSP + FR REGS or INVS) + drain */
> > -			      clt_path->queue_depth * 4 + 1);
> > -		max_recv_wr = min_t(int, wr_limit,
> > -			      clt_path->queue_depth * 3 + 1);
> > +		max_send_wr = min_t(u32, wr_limit,
> > +				    /* QD * (REQ + RSP + FR REGS or INVS) + drain */
> > +				    clt_path->queue_depth * 4 + 1);
> > +		max_recv_wr = min_t(u32, wr_limit,
> > +				    clt_path->queue_depth * 3 + 1);
> 
> Can we rather update the type of one of them and use min() instead?
> 
I'll remove all the min_t usages in the next version.
> ...
> 
> > --- a/drivers/infiniband/ulp/rtrs/rtrs-srv.c
> > +++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
> 
> Ditto.
> 
> ...
> 
> > -static int srpt_srq_size = DEFAULT_SRPT_SRQ_SIZE;
> > -module_param(srpt_srq_size, int, 0444);
> > +static unsigned int srpt_srq_size = DEFAULT_SRPT_SRQ_SIZE;
> > +module_param(srpt_srq_size, uint, 0444);
> 
> Theoretically this might break ABI (if somebody uses negative values for
> anything. I don't think it's the case, but just be informed.
> 
Okay. Thankyou for the information. 

> >  MODULE_PARM_DESC(srpt_srq_size,
> >  		 "Shared receive queue (SRQ) size.");
> 
> ...
> 
> > --- a/drivers/nvme/target/rdma.c
> > +++ b/drivers/nvme/target/rdma.c
> 
> > -	ndev->srq_size = min(ndev->device->attrs.max_srq_wr,
> > -			     nvmet_rdma_srq_size);
> > -	ndev->srq_count = min(ndev->device->num_comp_vectors,
> > -			      ndev->device->attrs.max_srq);
> > +	ndev->srq_size = min_t(u32, ndev->device->attrs.max_srq_wr,
> > +			       nvmet_rdma_srq_size);
> > +	ndev->srq_count = min_t(u32, ndev->device->num_comp_vectors,
> > +				ndev->device->attrs.max_srq);
> 
> Same question, can we change type type of variables instead?
>
Yes. I'll be doing it in the next version.
 
> >  	mutex_lock(&device_list_mutex);
> 
> ...
> 
> >  	inline_page_count = num_pages(nport->inline_data_size);
> >  	inline_sge_count = max(cm_id->device->attrs.max_sge_rd,
> > -				cm_id->device->attrs.max_recv_sge) - 1;
> > +				cm_id->device->attrs.max_recv_sge);
> > +	inline_sge_count = inline_sge_count ? inline_sge_count - 1 : 0;
> 
> Simple conditional might be better
> 
> 	if (inline_sge_count)
> 		inline_sge_count--;
> 	OR
> 		inline_sge_count -= 1;
Okay. I'll update all such instances.

> 
> ...
> 
> > +++ b/include/rdma/ib_verbs.h
> 
> > -	int			max_qp;
> > -	int			max_qp_wr;
> > +	u32			max_qp;
> > +	u32			max_qp_wr;
> 
> Nice, but please check that none of these (and beyond) were not used in signed
> multiplication or (which is more disasterous) division. Otherwise it might be
> subtle issues that will be hard to debug.
Yes I have checked that for all the variables I updated.

> 
> ...
> 
> >  	conn_param->responder_resources =
> > -		min_t(u32, rds_ibdev->max_responder_resources, max_responder_resources);
> > +		min3(rds_ibdev->max_responder_resources,
> > +		     max_responder_resources, U8_MAX);
> >  	conn_param->initiator_depth =
> > -		min_t(u32, rds_ibdev->max_initiator_depth, max_initiator_depth);
> > +		min3(rds_ibdev->max_initiator_depth,
> > +		     max_initiator_depth, U8_MAX);
> 
> I believe we can go a few characters over and leave them to be single lines.
> 
Okay.

> >  	conn_param->retry_count = min_t(unsigned int, rds_ib_retry_count, 7);
> 
> What about this one?
Sorry. I missed this one, I'll update it.

> 
> >  	conn_param->rnr_retry_count = 7;
> 
> ...
> 
> >  int frwr_query_device(struct rpcrdma_ep *ep, const struct ib_device *device)
> >  {
> >  	const struct ib_device_attr *attrs = &device->attrs;
> > -	int max_qp_wr, depth, delta;
> > +	u32 max_qp_wr;
> > +	int depth, delta;
> >  	unsigned int max_sge;
> 
> Reversed xmas tree order.
Okay

Thankyou for all your suggestions.
The next version will be incorporated with all these changes.

- Vennela
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 

^ permalink raw reply

* RE: [EXTERNAL] [PATCH v2 2/2] RDMA/mana_ib: initialize err for empty send WR lists
From: Long Li @ 2026-06-19 17:09 UTC (permalink / raw)
  To: Ruoyu Wang, Jason Gunthorpe, Leon Romanovsky
  Cc: Cheng Xu, Kai Shen, Konstantin Taranov,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260618041752.481193-2-ruoyuw560@gmail.com>

> mana_ib_post_send() returns err after walking the send work request list.
> If the caller passes an empty list, the loop is skipped and err is not assigned.
> 
> Initialize err to 0 so an empty send work request list returns success instead of
> stack data.
> 
> Fixes: c8017f5b4856 ("RDMA/mana_ib: UD/GSI work requests")
> Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>

Reviewed-by: Long Li <longli@microsoft.com>


> ---
> v2:
> - Split the erdma and mana_ib changes into separate patches.
> - Add a driver-specific Fixes tag.
> 
>  drivers/infiniband/hw/mana/wr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/hw/mana/wr.c b/drivers/infiniband/hw/mana/wr.c
> index 1813567d3b16c..36a1d506f08f6 100644
> --- a/drivers/infiniband/hw/mana/wr.c
> +++ b/drivers/infiniband/hw/mana/wr.c
> @@ -144,7 +144,7 @@ static int mana_ib_post_send_ud(struct mana_ib_qp
> *qp, const struct ib_ud_wr *wr  int mana_ib_post_send(struct ib_qp *ibqp,
> const struct ib_send_wr *wr,
>                       const struct ib_send_wr **bad_wr)  {
> -       int err;
> +       int err = 0;
>         struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
> 
>         for (; wr; wr = wr->next) {
> --
> 2.51.0

^ permalink raw reply

* [PATCH net v2] net/smc: fix out-of-bounds read when sk_user_data holds a sk_psock
From: Sechang Lim @ 2026-06-19 15:03 UTC (permalink / raw)
  To: D . Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Ursula Braun,
	Karsten Graul, Guvenc Gulce, linux-rdma, linux-s390, netdev,
	linux-kernel, bpf

SMC stores its smc_sock in the clcsock's sk_user_data tagged
SK_USER_DATA_NOCOPY and reads it back with smc_clcsock_user_data(), which
only strips that flag. sockmap stores a sk_psock in the same field tagged
SK_USER_DATA_NOCOPY | SK_USER_DATA_PSOCK. Nothing keeps both off one
socket, and SMC then casts the sk_psock to an smc_sock.

A passive-open child hits this. It inherits the listener's
smc_clcsock_data_ready(), but sk_clone_lock() clears its NOCOPY
sk_user_data, and a BPF sock_ops program then adds the child to a sockmap,
installing a sk_psock in that field. The inherited callback reads it as an
smc_sock and dereferences a clcsk_* pointer past the end of the sk_psock:

  BUG: KASAN: slab-out-of-bounds in smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
  Read of size 8 at addr ffff8880013b8674 by task syz.6.12484/67930
   <IRQ>
   smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
   tcp_urg+0x24d/0x360 net/ipv4/tcp_input.c:6264
   tcp_rcv_state_process+0x280d/0x4940 net/ipv4/tcp_input.c:7336
   tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
   tcp_v4_rcv+0x1eaa/0x2a00 net/ipv4/tcp_ipv4.c:2186
   [...]
   </IRQ>

  Allocated by task 67930:
   sk_psock_init+0x142/0x740 net/core/skmsg.c:766
   sock_hash_update_common+0xd3/0x990 net/core/sock_map.c:1010
   bpf_sock_hash_update+0x114/0x170 net/core/sock_map.c:1229
   __cgroup_bpf_run_filter_sock_ops+0x74/0xa0 kernel/bpf/cgroup.c:1727
   tcp_init_transfer+0x1085/0x1100 net/ipv4/tcp_input.c:6693
   [...]

sk_psock() already guards the other side, returning NULL unless
SK_USER_DATA_PSOCK is set. Make smc_clcsock_user_data() and its RCU
variant return the smc_sock only when sk_user_data carries SMC's tag
alone. A sk_psock then reads back as NULL, which the data_ready and
fallback callbacks already handle.

Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 net/smc/smc.h | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc.h b/net/smc/smc.h
index 52145df83f6e..88dfb459b7cc 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -342,13 +342,25 @@ static inline void smc_init_saved_callbacks(struct smc_sock *smc)
 
 static inline struct smc_sock *smc_clcsock_user_data(const struct sock *clcsk)
 {
-	return (struct smc_sock *)
-	       ((uintptr_t)clcsk->sk_user_data & ~SK_USER_DATA_NOCOPY);
+	uintptr_t data = (uintptr_t)clcsk->sk_user_data;
+
+	/*
+	 * Return the smc_sock only if the slot carries SMC's tag alone.
+	 * sockmap stores a sk_psock here tagged SK_USER_DATA_PSOCK; it is
+	 * not an smc_sock and must not be dereferenced as one.
+	 */
+	if ((data & ~SK_USER_DATA_PTRMASK) != SK_USER_DATA_NOCOPY)
+		return NULL;
+	return (struct smc_sock *)(data & SK_USER_DATA_PTRMASK);
 }
 
 static inline struct smc_sock *smc_clcsock_user_data_rcu(const struct sock *clcsk)
 {
-	return (struct smc_sock *)rcu_dereference_sk_user_data(clcsk);
+	uintptr_t data = (uintptr_t)rcu_dereference(__sk_user_data(clcsk));
+
+	if ((data & ~SK_USER_DATA_PTRMASK) != SK_USER_DATA_NOCOPY)
+		return NULL;
+	return (struct smc_sock *)(data & SK_USER_DATA_PTRMASK);
 }
 
 /* save target_cb in saved_cb, and replace target_cb with new_cb */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net] net/smc: fix out-of-bounds read in smc_clcsock_data_ready()
From: Sechang Lim @ 2026-06-19 14:59 UTC (permalink / raw)
  To: D. Wythe
  Cc: Dust Li, Sidraya Jayagond, Wenjia Zhang, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David S . Miller, Mahanta Jambigi,
	Tony Lu, Wen Gu, Simon Horman, Ursula Braun, Karsten Graul,
	Guvenc Gulce, netdev, linux-rdma, linux-s390, bpf, linux-kernel
In-Reply-To: <20260616071639.GA104390@j66a10360.sqa.eu95>

On Tue, Jun 16, 2026 at 03:16:39PM +0800, D. Wythe wrote:
>On Sun, Jun 14, 2026 at 12:09:30PM +0000, Sechang Lim wrote:
>> smc_clcsock_data_ready() is installed on the listen socket and reads its
>> sk_user_data as an smc_sock. A passive-open child inherits this callback,
>> but sk_clone_lock() clears the child's sk_user_data because it is tagged
>> SK_USER_DATA_NOCOPY. smc_tcp_syn_recv_sock() restores the child's af_ops,
>> but the inherited sk_data_ready() is left in place until accept.
>>
>> In that window the child is established. A cgroup sock_ops program can run
>> bpf_sock_hash_update() on it from tcp_init_transfer(); sk_psock_init()
>> stores a sk_psock in the NULL sk_user_data. The inherited callback then
>> reads sk_user_data via smc_clcsock_user_data(), which masks only
>> SK_USER_DATA_NOCOPY, mistakes the sk_psock for an smc_sock, and reads a
>> callback pointer past the end of the sk_psock:
>>
>>   BUG: KASAN: slab-out-of-bounds in smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>>   Read of size 8 at addr ffff8880013b8674 by task syz.6.12484/67930
>>    <IRQ>
>>    smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>>    tcp_urg+0x24d/0x360 net/ipv4/tcp_input.c:6264
>>    tcp_rcv_state_process+0x280d/0x4940 net/ipv4/tcp_input.c:7336
>>    tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
>>    tcp_v4_rcv+0x1eaa/0x2a00 net/ipv4/tcp_ipv4.c:2186
>>    ip_protocol_deliver_rcu+0x226/0x420 net/ipv4/ip_input.c:207
>>    ip_local_deliver_finish+0x35a/0x5f0 net/ipv4/ip_input.c:241
>>    __netif_receive_skb_one_core+0x1e5/0x210 net/core/dev.c:6216
>>    process_backlog+0x631/0x1470 net/core/dev.c:6682
>>    __napi_poll+0xb3/0x320 net/core/dev.c:7749
>>    net_rx_action+0x4fa/0xcb0 net/core/dev.c:7969
>>    handle_softirqs+0x236/0x800 kernel/softirq.c:622
>>    </IRQ>
>>
>>   Allocated by task 67930:
>>    sk_psock_init+0x142/0x740 net/core/skmsg.c:766
>>    sock_map_link+0x646/0xdf0 net/core/sock_map.c:279
>>    sock_hash_update_common+0xd3/0x990 net/core/sock_map.c:1010
>>    bpf_sock_hash_update+0x114/0x170 net/core/sock_map.c:1229
>>    __cgroup_bpf_run_filter_sock_ops+0x74/0xa0 kernel/bpf/cgroup.c:1727
>>    tcp_init_transfer+0x1085/0x1100 net/ipv4/tcp_input.c:6693
>>    tcp_rcv_state_process+0x241e/0x4940 net/ipv4/tcp_input.c:7231
>>    tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
>>
>> Restore the inherited sk_data_ready() in smc_tcp_syn_recv_sock(), where the
>> child's sk_user_data is already cleared, rather than only at accept.
>>
>> Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
>> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
>> ---
>>  net/smc/af_smc.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
>> index b5db69073e20..152971e8ad17 100644
>> --- a/net/smc/af_smc.c
>> +++ b/net/smc/af_smc.c
>> @@ -156,6 +156,12 @@ static struct sock *smc_tcp_syn_recv_sock(const struct sock *sk,
>>  	if (child) {
>>  		rcu_assign_sk_user_data(child, NULL);
>>
>> +		/*
>> +		 * the child inherited the listen-specific sk_data_ready();
>> +		 * restore it here, as sk_user_data may be reused before accept
>> +		 */
>> +		child->sk_data_ready = smc->clcsk_data_ready;
>
>One concern:
>
>smc_clcsock_user_data_rcu() together with refcount_inc_not_zero() only
>pins the smc_sock; it does not guarantee anything about the lifetime or
>consistency of smc->clcsk_data_ready. In the listen-close path,
>smc_clcsock_restore_cb() clears that field under sk_callback_lock,
>while smc_tcp_syn_recv_sock() reads it without any lock. These are
>independent protection domains. If close wins the race,
>child->sk_data_ready can end up NULL and the next data arrival will
>crash.
>

will drop the syn_recv restore in v2. Thanks for your review.

>Also, I don't object to this fix, but I'd rather see the underlying cause
>addressed directly. The real issue seems to be the conflict between
>SMC's sk_user_data and sk_psock. Maybe there is a cleaner solution, e.g.
>always setting user_data.
>

Agreed. 

Thanks, will send v2.

Best,
Sechang

^ permalink raw reply

* Re: [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Simon Horman @ 2026-06-19  9:05 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: kys, haiyangz, wei.liu, longli, andrew+netdev, davem, edumazet,
	kuba, pabeni, kotaranov, ernis, dipayanroy, kees, jacob.e.keller,
	ssengar, linux-hyperv, netdev, linux-kernel, linux-rdma, stable
In-Reply-To: <20260618035029.249361-1-decui@microsoft.com>

On Wed, Jun 17, 2026 at 08:50:29PM -0700, Dexuan Cui wrote:
> MANA allocates RX buffers from page pool fragments when frag_count is
> greater than 1. In that case the buffers remain DMA mapped by page pool
> and the RX completion path does not call dma_unmap_single(). As a result,
> the implicit sync-for-CPU normally performed by dma_unmap_single() is
> missing before the packet data is passed to the networking stack.
> 
> This breaks RX on configurations which require explicit DMA syncing, for
> example when booted with swiotlb=force.
> 
> Fix this by recording the page pool page and DMA sync offset when the RX
> buffer is allocated, and syncing the received packet range for CPU access
> before handing the RX buffer to the stack.
> 
> Also validate the packet length reported in the RX CQE before using it as
> a DMA sync length or passing it to skb processing. The CQE is supplied
> by the device and should not be blindly trusted by Confidential VMs.

I think this last part warrants being split out into a separate patch.

> 
> Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>

...

^ permalink raw reply

* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Dust Li @ 2026-06-19  6:35 UTC (permalink / raw)
  To: Runyu Xiao, D. Wythe, Sidraya Jayagond, Wenjia Zhang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
	linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu, stable
In-Reply-To: <20260617152855.1039151-1-runyu.xiao@seu.edu.cn>

On 2026-06-17 23:28:55, Runyu Xiao wrote:
>smc_listen() installs smc_clcsock_data_ready() as the underlying TCP
>listen socket's sk_data_ready callback.  smc_clcsock_data_ready() then
>immediately takes sk_callback_lock before looking up the SMC listener and
>queuing smc_tcp_listen_work().
>
>That is unsafe once the TCP listen socket is leaving TCP_LISTEN.  The TCP
>close/flush path can run the installed sk_data_ready callback with
>sk_callback_lock already held, so entering smc_clcsock_data_ready() again
>tries to take the same rwlock recursively in the same thread.  The nvmet
>TCP listener had to make the same state check before taking
>sk_callback_lock for this reason.
>
>This issue was found by our static analysis tool and then manually
>reviewed against the current tree.
>
>The grounded PoC kept the SMC listen callback installation path:
>
>  smc_listen()
>  smc_clcsock_replace_cb()
>  sk_data_ready = smc_clcsock_data_ready()
>
>It then modeled the close/flush carrier that invokes the installed
>sk_data_ready callback while sk_callback_lock is already held.  Lockdep
>reported the same-thread recursive acquisition:
>
>  WARNING: possible recursive locking detected
>  smc_clcsock_data_ready+0xa/0x4d [vuln_msv]
>  smc_close_flush_work+0x1f/0x30 [vuln_msv]
>  *** DEADLOCK ***
>
>Return before taking sk_callback_lock when the underlying TCP socket is no
>longer in TCP_LISTEN.  In that state there is no listen accept work to
>queue for SMC, and avoiding the callback lock mirrors the fix used by the
>TCP nvmet listener.

Hi Runyu,

I noticed the lockdep splat comes from your own kernel module
([vuln_msv]) that models the condition, rather than from a real
TCP code path.

Could you point me to the specific mainline TCP code path that calls
sk_data_ready() while holding sk_callback_lock? If such a path
exists, I'm happy to take this patch. But if this is based solely on
static analysis without a confirmed real call chain, I'd prefer to
focus our review bandwidth on issues that have demonstrated impact.

Thanks,
Dust


^ permalink raw reply

* [PATCH net v2] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Runyu Xiao @ 2026-06-19  5:48 UTC (permalink / raw)
  To: D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
	linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu,
	runyu.xiao, stable
In-Reply-To: <20260617152855.1039151-1-runyu.xiao@seu.edu.cn>

smc_listen() installs smc_clcsock_data_ready() as the underlying TCP
listen socket's sk_data_ready callback.  The callback takes
sk_callback_lock before looking up the SMC listener and queuing
smc_tcp_listen_work().

This can recurse when the underlying TCP listen socket is being closed.
The close/flush path may invoke the installed sk_data_ready callback with
sk_callback_lock already held, so smc_clcsock_data_ready() tries to take
the same rwlock again in the same thread.

This issue was found by our static analysis tool and then manually
reviewed against the current tree.  The reproducer keeps the SMC listen
callback installation path:

  smc_listen()
  smc_clcsock_replace_cb()
  sk_data_ready = smc_clcsock_data_ready()

It then models the close/flush carrier that invokes the installed
sk_data_ready callback while sk_callback_lock is already held.  Lockdep
reports the same-thread recursive acquisition:

  WARNING: possible recursive locking detected
  kworker/u4:3/39 is trying to acquire lock:
    (sk_callback_lock) at smc_clcsock_data_ready+0xa/0x4d

  but task is already holding lock:
    (sk_callback_lock) at smc_close_flush_work+0xc/0x30

  Possible unsafe locking scenario:

        CPU0
        ----
        lock(sk_callback_lock);
        lock(sk_callback_lock);

  *** DEADLOCK ***

  Workqueue: smc_close_wq smc_close_flush_work

  Call Trace:
    dump_stack_lvl
    __lock_acquire
    lock_acquire
    _raw_read_lock_bh
    smc_clcsock_data_ready
    smc_close_flush_work
    process_one_work
    worker_thread
    kthread
    ret_from_fork

The same pattern was fixed for nvmet TCP by checking TCP_LISTEN before
taking sk_callback_lock:

  commit 2fa8961d3a6a ("nvmet-tcp: fixup hang in
  nvmet_tcp_listen_data_ready()")

Do the same for SMC.  smc_clcsock_data_ready() is installed by
smc_listen() on the underlying TCP listen socket and only queues
smc_tcp_listen_work() for the SMC listen/accept path.  Once that socket is
no longer in TCP_LISTEN, there is no listen accept work to queue from this
callback, and avoiding sk_callback_lock also avoids the recursive locking
path.

Fixes: 0558226cebee ("net/smc: Fix slab-out-of-bounds issue in fallback")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
v2:
- Include the fuller Lockdep stack from the grounded reproducer.
- Add the related nvmet TCP fix reference.
- Explain why the TCP_LISTEN check is valid for the SMC listen callback.

 net/smc/af_smc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 6421c2e1c84d..1af4e3c333ff 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2631,6 +2631,9 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock)
 {
 	struct smc_sock *lsmc;
 
+	if (READ_ONCE(listen_clcsock->sk_state) != TCP_LISTEN)
+		return;
+
 	read_lock_bh(&listen_clcsock->sk_callback_lock);
 	lsmc = smc_clcsock_user_data(listen_clcsock);
 	if (!lsmc)
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Mahanta Jambigi @ 2026-06-19  5:36 UTC (permalink / raw)
  To: Runyu Xiao, D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Tony Lu, Wen Gu, Simon Horman, Karsten Graul, linux-rdma,
	linux-s390, netdev, linux-kernel, jianhao.xu
In-Reply-To: <20260618141629.2904071-1-runyu.xiao@seu.edu.cn>



On 18/06/26 7:46 pm, Runyu Xiao wrote:
> Hi,
> 
> Thanks for taking a look.
> 
> The exact Lockdep stack I have is from the grounded reproducer, not from
> a production SMC setup.  The reproducer keeps the same callback shape:
> the close/flush side holds sk_callback_lock and invokes the installed
> sk_data_ready callback, which re-enters smc_clcsock_data_ready() and tries
> to take sk_callback_lock again.
> 
> The relevant Lockdep report is:
> 
>   WARNING: possible recursive locking detected
>   kworker/u4:3/39 is trying to acquire lock:
>     (sk_callback_lock) at smc_clcsock_data_ready+0xa/0x4d
> 
>   but task is already holding lock:
>     (sk_callback_lock) at smc_close_flush_work+0xc/0x30
> 
>   Possible unsafe locking scenario:
> 
>         CPU0
>         ----
>         lock(sk_callback_lock);
>         lock(sk_callback_lock);
> 
>   *** DEADLOCK ***
> 
>   Workqueue: smc_close_wq smc_close_flush_work
> 
>   Call Trace:
>     dump_stack_lvl
>     __lock_acquire
>     lock_acquire
>     _raw_read_lock_bh
>     smc_clcsock_data_ready+0xa/0x4d
>     smc_close_flush_work+0x1f/0x30
>     process_one_work
>     worker_thread
>     kthread
>     ret_from_fork

Thank you for addressing the feedback. My suggestion would be to reply
to the original email thread where the review comments were given, so
that the maintainers can follow the conversation.

https://www.kernel.org/doc/html/latest/process/submitting-patches.html#respond-to-review-comments

Please include above call stack in your next version.

> 
> The nvmet change I referred to is:
> 
>   2fa8961d3a6a ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")

Please include this info in your next version.

> 
> The stable/backport patch I originally used as the reference is:
> 
>   1c90f930e7b4 ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")
> 
> Its commit message says that when the socket is closed while in
> TCP_LISTEN, the flush callback can call nvmet_tcp_listen_data_ready()
> with sk_callback_lock already held, so nvmet moved the TCP_LISTEN check
> before taking sk_callback_lock.
> 
> For the TCP_LISTEN check: my reasoning was that smc_clcsock_data_ready()
> is installed by smc_listen() on the underlying TCP listen socket and only
> queues smc_tcp_listen_work() for the SMC listen/accept path.  Once that
> underlying socket is no longer in TCP_LISTEN, there should be no SMC
> listen accept work to queue from this callback.  TCP_SYN_RECV and
> TCP_ESTABLISHED are not listen-socket states for this callback path, so I
> did not intend the callback to queue listen work for those states.

I understand. Please include this info in your next version.

> 
> That said, if SMC expects smc_clcsock_data_ready() to handle a non-LISTEN
> state during fallback or another transition, then the proposed check is
> too strict and I should rework the fix.
> 
> Thanks,
> Runyu


^ permalink raw reply

* [PATCH v2] RDMA/irdma: Suppress PF reset on HMC error
From: Seyeong Kim @ 2026-06-19  5:00 UTC (permalink / raw)
  To: linux-rdma
  Cc: Krzysztof Czurylo, Tatyana Nikolova, Jason Gunthorpe,
	Leon Romanovsky, Seyeong Kim

The irdma driver currently issues an unconditional PF reset whenever the
HMC Error interrupt (PFINT_OICR bit 26) fires:

	if (event->reg & IRDMAPFINT_OICR_HMC_ERR_M) {
		ibdev_err(&iwdev->ibdev, "HMC Error\n");
		iwdev->rf->reset = true;
	}

request_reset() issues an IIDC_PFR to ice. In practice a single HMC_ERR
can trigger cascading PF resets, IOMMU faults during teardown, and
teardown of every RDMA connection on the device.

i40e handles the identically-named interrupt by reading
PFHMC_ERRORINFO and PFHMC_ERRORDATA and logging them without touching
device state; see commit 9c010ee0ea5f ("i40e: Suppress HMC error to
Interrupt message level") which removed the reset as "not necessary".
This patch mirrors that handling on irdma.

With this change, repeated HMC_ERR no longer produces a reset storm and
RDMA traffic on the device continues uninterrupted.

Signed-off-by: Seyeong Kim <seyeong.kim@canonical.com>
---
v2:
 - Drop RFC; retested on mainline v7.1-rc4 (NVM 4.51). The patch
   applies unchanged and behaves the same. No functional change.

v1: https://lore.kernel.org/linux-rdma/20260416071541.3899471-1-seyeong.kim@canonical.com/

Notes for reviewers
-------------------

Some details are inferred rather than verified against the E810
datasheet; Intel confirmation would settle them:

1. Register offsets. PFHMC_ERRORINFO (0x00520400) and PFHMC_ERRORDATA
   (0x00520500) are inferred from the same 0x00520000 PFHMC bank as
   PFHMC_PDINV (0x00520300); they have not been verified against the
   datasheet. On v7.1-rc4 both reads returned 0x00000000, which is
   consistent but not a positive confirmation of the offsets.

2. HMC error semantics. The assumption that every E810 HMC_ERR is
   safe to continue past is not datasheet-confirmed. A conditional
   reset branch analogous to the existing PE_CRITERR /
   IRDMA_Q1_RESOURCE_ERR whitelist can be added on top if needed.

3. Test methodology. The interrupt was forced via a /dev/mem bit
   write to PFINT_OICR, which exercises the handler path but does
   not reproduce firmware-triggered HMC errors directly.

Testing details (mainline v7.1-rc4)
-----------------------------------

Tested on:
  Kernel   : 7.1.0-rc4 (mainline)
  Adapter  : Intel E810-XXV for SFP [8086:159b rev02], 2-port
  NVM      : 4.51, fw.mgmt 7.5.4, DDP 1.3.43.0
  Repro    : writel(BIT(26), BAR0 + 0x0016CA00) via /dev/mem
  Workload : ib_write_bw -R -F -q 4 -D 90, 4 QPs, loopback on the
             injected PF; HMC_ERR forced three times during the run

Before the patch, a forced HMC_ERR caused a full PF reset. With an
RDMA workload running, the reset tore down the irdma aux device with a
uverbs file still open and hit a WARNING at uverbs_destroy_ufile_hw
(rdma_core.c:957, via ice_prepare_for_reset -> ice_unplug_aux_dev);
the ib_write_bw run aborted. After the patch, each forced HMC_ERR only
logged a single "HMC Error: errinfo=0x00000000 errdata=0x00000000"
line - no reset, no WARNING - and the run completed (8622 MiB/s over
4 QPs).

The customer report this addresses was on NVM 3.10, where a single
HMC_ERR produced cascading PF resets and DMAR faults rather than the
single reset seen on NVM 4.51. The unconditional reset is the common
cause in both cases.

 drivers/infiniband/hw/irdma/i40iw_hw.c  | 4 +++-
 drivers/infiniband/hw/irdma/icrdma_hw.c | 2 ++
 drivers/infiniband/hw/irdma/icrdma_hw.h | 2 ++
 drivers/infiniband/hw/irdma/icrdma_if.c | 8 ++++++--
 drivers/infiniband/hw/irdma/irdma.h     | 2 ++
 5 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/irdma/i40iw_hw.c b/drivers/infiniband/hw/irdma/i40iw_hw.c
index 60c1f2b1811d..8301938b4543 100644
--- a/drivers/infiniband/hw/irdma/i40iw_hw.c
+++ b/drivers/infiniband/hw/irdma/i40iw_hw.c
@@ -29,7 +29,9 @@ static u32 i40iw_regs[IRDMA_MAX_REGS] = {
 	I40E_PFHMC_PDINV,
 	I40E_GLHMC_VFPDINV(0),
 	I40E_GLPE_CRITERR,
-	0xffffffff      /* PFINT_RATEN not used in FPK */
+	0xffffffff,     /* PFINT_RATEN not used in FPK */
+	0xffffffff,     /* PFHMC_ERRORINFO not used in FPK */
+	0xffffffff      /* PFHMC_ERRORDATA not used in FPK */
 };
 
 static u32 i40iw_stat_offsets[] = {
diff --git a/drivers/infiniband/hw/irdma/icrdma_hw.c b/drivers/infiniband/hw/irdma/icrdma_hw.c
index 32f26284a788..b1f1b5485762 100644
--- a/drivers/infiniband/hw/irdma/icrdma_hw.c
+++ b/drivers/infiniband/hw/irdma/icrdma_hw.c
@@ -29,6 +29,8 @@ static u32 icrdma_regs[IRDMA_MAX_REGS] = {
 	GLHMC_VFPDINV(0),
 	GLPE_CRITERR,
 	GLINT_RATE(0),
+	PFHMC_ERRORINFO,
+	PFHMC_ERRORDATA,
 };
 
 static u64 icrdma_masks[IRDMA_MAX_MASKS] = {
diff --git a/drivers/infiniband/hw/irdma/icrdma_hw.h b/drivers/infiniband/hw/irdma/icrdma_hw.h
index d97944ab45da..0acdeda1236d 100644
--- a/drivers/infiniband/hw/irdma/icrdma_hw.h
+++ b/drivers/infiniband/hw/irdma/icrdma_hw.h
@@ -40,6 +40,8 @@
 #define GLHMC_VFPDINV(_i)	(0x00528300 + ((_i) * 4)) /* _i=0...31 */
 #define GLPE_CRITERR		0x00534000
 #define GLINT_RATE(_INT)	(0x0015A000 + ((_INT) * 4)) /* _i=0...2047 */ /* Reset Source: CORER */
+#define PFHMC_ERRORINFO		0x00520400
+#define PFHMC_ERRORDATA		0x00520500
 
 #define ICRDMA_DB_ADDR_OFFSET		(8 * 1024 * 1024 - 64 * 1024)
 
diff --git a/drivers/infiniband/hw/irdma/icrdma_if.c b/drivers/infiniband/hw/irdma/icrdma_if.c
index 2172a2092e3f..4b451d8482a4 100644
--- a/drivers/infiniband/hw/irdma/icrdma_if.c
+++ b/drivers/infiniband/hw/irdma/icrdma_if.c
@@ -91,8 +91,12 @@ static void icrdma_iidc_event_handler(struct iidc_rdma_core_dev_info *cdev_info,
 			}
 		}
 		if (event->reg & IRDMAPFINT_OICR_HMC_ERR_M) {
-			ibdev_err(&iwdev->ibdev, "HMC Error\n");
-			iwdev->rf->reset = true;
+			u32 hmc_errinfo = readl(iwdev->rf->sc_dev.hw_regs[IRDMA_PFHMC_ERRORINFO]);
+			u32 hmc_errdata = readl(iwdev->rf->sc_dev.hw_regs[IRDMA_PFHMC_ERRORDATA]);
+
+			/* Log diagnostics; do not reset here. */
+			ibdev_warn(&iwdev->ibdev, "HMC Error: errinfo=0x%08x errdata=0x%08x\n",
+				   hmc_errinfo, hmc_errdata);
 		}
 		if (event->reg & IRDMAPFINT_OICR_PE_PUSH_M) {
 			ibdev_err(&iwdev->ibdev, "PE Push Error\n");
diff --git a/drivers/infiniband/hw/irdma/irdma.h b/drivers/infiniband/hw/irdma/irdma.h
index ff938a01d70c..e8cda27d7854 100644
--- a/drivers/infiniband/hw/irdma/irdma.h
+++ b/drivers/infiniband/hw/irdma/irdma.h
@@ -66,6 +66,8 @@ enum irdma_registers {
 	IRDMA_GLHMC_VFPDINV,
 	IRDMA_GLPE_CRITERR,
 	IRDMA_GLINT_RATE,
+	IRDMA_PFHMC_ERRORINFO,
+	IRDMA_PFHMC_ERRORDATA,
 	IRDMA_MAX_REGS, /* Must be last entry */
 };
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Dust Li @ 2026-06-19  3:31 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <20260618221106.236699-1-hexlabsecurity@proton.me>

On 2026-06-18 22:11:12, Bryam Vargas wrote:
>On Fri, 19 Jun 2026 00:03:17 +0800, Dust Li wrote:
>> Once we validate the CDC message at the input boundary (as in the
>> previous patch), bytes_to_rcv can never exceed rmb_desc->len, so
>> this check becomes unreachable. So I don't think this patch is needed.
>
>This one I'd actually like to keep, and let me walk through why -- I don't think the
>boundary check closes it.
>
>bytes_to_rcv isn't set to a cursor count, it's a running accumulator:
>smc_cdc_msg_recv_action does atomic_add(diff_prod, &bytes_to_rcv), where
>diff_prod = smc_curs_diff(rmb_desc->len, old, new). So bounding each cursor's count at
>the boundary doesn't bound the sum of the deltas.
>
>The differing-wrap branch of smc_curs_diff returns (len - old.count) + new.count,
>which is up to 2*len-1 even when both cursors pass count <= len. With len=16, a prod
>going (0,0) -> (1,15) gives diff=31, so bytes_to_rcv is already 31 > len after one
>message; alternating wrap 0<->1 at count=15 keeps adding ~len and eventually wraps the
>atomic_t negative. I have an A/B for this -- happy to send it along.

Glad to see you A/B test, I think we can decide after we see the real
issue.

>
>So to make this truly unreachable from the boundary check, we'd need to bound
>prod - cons <= len there, not just the absolute count. The consumer-side clamp is two
>lines and race-free against the tasklet, so my preference would be to keep it as a
>backstop -- but if you'd rather fold it into a stronger boundary check instead, I'm
>open to that.

Another thing I'd worry about is if this really happens, should we also
abort the connection like what we did in patch #1 ?

Best regards,
Dust


^ permalink raw reply

* Re: [PATCH v3 1/3] net/smc: bound the wire-controlled producer cursor to the RMB
From: Dust Li @ 2026-06-19  3:26 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <20260618221057.236673-1-hexlabsecurity@proton.me>

On 2026-06-18 22:11:05, Bryam Vargas wrote:
>On Thu, 18 Jun 2026 22:29:20 +0800, Dust Li wrote:
>> once we detect that the peer is misbehaving, I think the right action is
>> to abort the connection and record the event, rather than silently clamp.
>[...]
>>         u32 prod_count = ntohs(cdc->prod.count);
>> ...
>>             cdc->prod.wrap > 1 || cdc->cons.wrap > 1) {
>
>Thanks for taking a look, Dust. I'm on board with the direction for net-next --
>aborting and recording a bad CDC is cleaner than clamping something we already know
>we can't trust, and as you say, the clamp just papers over the peer bug. So: minimal
>clamp stays for -stable, and net-next gets the wire-boundary check + abort (through
>abort_work, with an smc_stats counter and a ratelimited warn).

That's greate. Then I think we can move on in this direction.

>
>A few things I ran into on the check itself, though:
>
>- count is __be32, so it wants ntohl() rather than ntohs() -- ntohs() ends up reading
>  the wrong half.

Right

>
>- I'd drop the wrap > 1 tests. wrap is a free-running counter (smc_curs_add does
>  wrap++), so a connection that legitimately wraps its RMB ends up with wrap > 1; and
>  since it's a __be16 read raw, on little-endian wrap==1 already reads as 0x0100 and
>  we'd abort on the very first wrap. I don't think there's a sane upper bound to put
>  on wrap.

Agree

>
>- the check is typed for SMC-R, but the SMC-D path hands a host-order smcd_cdc_msg to
>  smc_cdc_msg_recv() cast as smc_cdc_msg (smc_cdc.c:456), so ntohl/ntohs would
>  double-swap it there. The simplest thing I found is one check on the host cursor
>  right after smc_cdc_msg_to_host(), before the diff/atomic_add block -- that covers
>  SMC-R and SMC-D in one place.

Agree

>
>Minor: >= len rather than > len (count is an offset in [0,len)), and peer_rmbe_size
>is signed so worth guarding. The cons vs peer_rmbe_size bound looks right to me.

No problem

Best regards,
Dust


^ permalink raw reply

* Re: [PATCH net-next v2] net: rds: check cmsg_len before reading rds_rdma_args in size pass
From: patchwork-bot+netdevbpf @ 2026-06-19  1:40 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: achender, davem, kuba, pabeni, edumazet, horms, netdev,
	linux-rdma, rds-devel, linux-kernel
In-Reply-To: <20260617023146.2780077-1-michael.bommarito@gmail.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 16 Jun 2026 22:31:46 -0400 you wrote:
> rds_rm_size() handles RDS_CMSG_RDMA_ARGS after only CMSG_OK() and then
> calls rds_rdma_extra_size(), which reads args->local_vec_addr and
> args->nr_local without first checking that cmsg_len covers struct
> rds_rdma_args. The other two RDS_CMSG_RDMA_ARGS consumers already guard
> this: rds_rdma_bytes() in rds_sendmsg() and rds_cmsg_rdma_args() in
> rds_cmsg_send() both reject cmsg_len < CMSG_LEN(sizeof(struct
> rds_rdma_args)). Add the same check to rds_rm_size() so all three RDMA
> args passes are consistent.
> 
> [...]

Here is the summary with links:
  - [net-next,v2] net: rds: check cmsg_len before reading rds_rdma_args in size pass
    https://git.kernel.org/netdev/net/c/e5c00023270e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net V2 0/3] net/mlx5e: Fix crashes in dynamic per-channel stats and HV VHCA agent
From: Jakub Kicinski @ 2026-06-19  1:14 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, netdev, Paolo Abeni,
	Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed
In-Reply-To: <20260617140127.573117-1-tariqt@nvidia.com>

On Wed, 17 Jun 2026 17:01:24 +0300 Tariq Toukan wrote:
> Since per-channel stats were converted to be allocated and published
> lazily at first channel open in commit fa691d0c9c08 ("net/mlx5e:
> Allocate per-channel stats dynamically at first usage"),
> priv->channel_stats[] and priv->stats_nch are filled in
> incrementally during interface bring-up. This opened a window in
> which the various stats readers - most of them reachable from
> userspace via netlink/netdev stats queries - can race with
> mlx5e_open_channel() on another CPU and observe partially
> initialized state. The HV VHCA stats agent, which is created
> before the channels are opened, hits related problems of its own.
> 
> This series by Feng fixes the resulting crashes.

No longer(?) applies:

Applying: net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
Applying: net/mlx5e: Fix HV VHCA stats agent registration race
Applying: net/mlx5e: Fix publication race for priv->channel_stats[]
error: patch failed: drivers/net/ethernet/mellanox/mlx5/core/en_main.c:5533
error: drivers/net/ethernet/mellanox/mlx5/core/en_main.c: patch does not apply
Patch failed at 0003 net/mlx5e: Fix publication race for priv->channel_stats[]
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH v3 3/3] net/smc: bound the send length to the send buffer in smc_tx_sendmsg()
From: Bryam Vargas @ 2026-06-18 22:11 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <ajQX7_9xFI9GSaq5@linux.alibaba.com>

On Fri, 19 Jun 2026 00:08:15 +0800, Dust Li wrote:
> I think this is the same as patch #2.

Same story as 2/3, just on the SMC-D send side: sndbuf_space accumulates
diff_tx = smc_curs_diff(sndbuf_desc->len, tx_curs_fin, cons) from the peer's consumer
cursor, so a cons alternating wrap 0<->1 walks it past sndbuf_desc->len (and negative
over time), and smc_tx_sendmsg's wrap-around write then runs off the end of the
buffer. The boundary count check doesn't bound diff_tx here either, so I'd keep the
same two-line bound. The same A/B covers it.

Bryam


^ permalink raw reply

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Bryam Vargas @ 2026-06-18 22:11 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <ajQWxQZXzM2J8kaZ@linux.alibaba.com>

On Fri, 19 Jun 2026 00:03:17 +0800, Dust Li wrote:
> Once we validate the CDC message at the input boundary (as in the
> previous patch), bytes_to_rcv can never exceed rmb_desc->len, so
> this check becomes unreachable. So I don't think this patch is needed.

This one I'd actually like to keep, and let me walk through why -- I don't think the
boundary check closes it.

bytes_to_rcv isn't set to a cursor count, it's a running accumulator:
smc_cdc_msg_recv_action does atomic_add(diff_prod, &bytes_to_rcv), where
diff_prod = smc_curs_diff(rmb_desc->len, old, new). So bounding each cursor's count at
the boundary doesn't bound the sum of the deltas.

The differing-wrap branch of smc_curs_diff returns (len - old.count) + new.count,
which is up to 2*len-1 even when both cursors pass count <= len. With len=16, a prod
going (0,0) -> (1,15) gives diff=31, so bytes_to_rcv is already 31 > len after one
message; alternating wrap 0<->1 at count=15 keeps adding ~len and eventually wraps the
atomic_t negative. I have an A/B for this -- happy to send it along.

So to make this truly unreachable from the boundary check, we'd need to bound
prod - cons <= len there, not just the absolute count. The consumer-side clamp is two
lines and race-free against the tasklet, so my preference would be to keep it as a
backstop -- but if you'd rather fold it into a stronger boundary check instead, I'm
open to that.

Bryam


^ permalink raw reply

* Re: [PATCH v3 1/3] net/smc: bound the wire-controlled producer cursor to the RMB
From: Bryam Vargas @ 2026-06-18 22:11 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <ajQAwBMzCJfO9SM1@linux.alibaba.com>

On Thu, 18 Jun 2026 22:29:20 +0800, Dust Li wrote:
> once we detect that the peer is misbehaving, I think the right action is
> to abort the connection and record the event, rather than silently clamp.
[...]
>         u32 prod_count = ntohs(cdc->prod.count);
> ...
>             cdc->prod.wrap > 1 || cdc->cons.wrap > 1) {

Thanks for taking a look, Dust. I'm on board with the direction for net-next --
aborting and recording a bad CDC is cleaner than clamping something we already know
we can't trust, and as you say, the clamp just papers over the peer bug. So: minimal
clamp stays for -stable, and net-next gets the wire-boundary check + abort (through
abort_work, with an smc_stats counter and a ratelimited warn).

A few things I ran into on the check itself, though:

- count is __be32, so it wants ntohl() rather than ntohs() -- ntohs() ends up reading
  the wrong half.

- I'd drop the wrap > 1 tests. wrap is a free-running counter (smc_curs_add does
  wrap++), so a connection that legitimately wraps its RMB ends up with wrap > 1; and
  since it's a __be16 read raw, on little-endian wrap==1 already reads as 0x0100 and
  we'd abort on the very first wrap. I don't think there's a sane upper bound to put
  on wrap.

- the check is typed for SMC-R, but the SMC-D path hands a host-order smcd_cdc_msg to
  smc_cdc_msg_recv() cast as smc_cdc_msg (smc_cdc.c:456), so ntohl/ntohs would
  double-swap it there. The simplest thing I found is one check on the host cursor
  right after smc_cdc_msg_to_host(), before the diff/atomic_add block -- that covers
  SMC-R and SMC-D in one place.

Minor: >= len rather than > len (count is an offset in [0,len)), and peer_rmbe_size
is signed so worth guarding. The cons vs peer_rmbe_size bound looks right to me.

Happy to spin it whichever way you prefer.

Bryam


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox