Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: [PATCH v4 0/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Jani Nikula @ 2026-06-25 11:05 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx
In-Reply-To: <20260625104007.041432666@kernel.org>

On Thu, 25 Jun 2026, Steven Rostedt <rostedt@kernel.org> wrote:
> Remove trace_printk.h by creating a trace_controls.h for those places that
> need access to tracing prototypes like tracing_off() and for the places that
> need trace_printk() directly, to have it included directly.
>
> Changse since v3: https://lore.kernel.org/all/20260624081806.120105649@kernel.org/
>
> - Always include trace_controls.h in rcu.h (kernel test robot)
>
>   There are other configs that may include tracing_off() in rcu.h besides
>   the one that had the include of trace_controls.h. Just always include
>   it in that header to be safe.
>
> Steven Rostedt (2):
>       tracing: Move non-trace_printk prototypes into trace_controls.h
>       tracing: Remove trace_printk.h from kernel.h
>
> ----
>  arch/powerpc/kvm/book3s_xics.c         |  1 +
>  arch/powerpc/xmon/xmon.c               |  1 +
>  arch/s390/kernel/ipl.c                 |  1 +
>  arch/s390/kernel/machine_kexec.c       |  1 +
>  drivers/gpu/drm/i915/gt/intel_gtt.h    |  1 +
>  drivers/gpu/drm/i915/i915_gem.h        |  2 ++

For the i915 parts,

Acked-by: Jani Nikula <jani.nikula@intel.com>

for merging via whichever tree.

>  drivers/hwtracing/stm/dummy_stm.c      |  1 +
>  drivers/infiniband/hw/hfi1/trace_dbg.h |  1 +
>  drivers/tty/sysrq.c                    |  1 +
>  drivers/usb/early/xhci-dbc.c           |  1 +
>  fs/ext4/inline.c                       |  1 +
>  include/linux/ftrace.h                 |  2 ++
>  include/linux/kernel.h                 |  1 -
>  include/linux/sunrpc/debug.h           |  1 +
>  include/linux/trace_controls.h         | 54 ++++++++++++++++++++++++++++++++
>  include/linux/trace_printk.h           | 56 ++--------------------------------
>  kernel/debug/debug_core.c              |  1 +
>  kernel/panic.c                         |  1 +
>  kernel/rcu/rcu.h                       |  1 +
>  kernel/rcu/rcutorture.c                |  1 +
>  kernel/trace/ring_buffer_benchmark.c   |  1 +
>  kernel/trace/trace.h                   |  1 +
>  kernel/trace/trace_benchmark.c         |  1 +
>  lib/sys_info.c                         |  1 +
>  samples/fprobe/fprobe_example.c        |  1 +
>  samples/ftrace/ftrace-direct-too.c     |  1 -
>  samples/trace_printk/trace-printk.c    |  1 +
>  27 files changed, 82 insertions(+), 55 deletions(-)
>  create mode 100644 include/linux/trace_controls.h

-- 
Jani Nikula, Intel

^ permalink raw reply

* [PATCH v4 1/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Steven Rostedt @ 2026-06-25 10:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx
In-Reply-To: <20260625104007.041432666@kernel.org>

From: Steven Rostedt <rostedt@goodmis.org>

Remove the prototypes of the code that is not associated with
trace_printk() from trace_printk.h.

These control functions as well as ftrace_dump() and trace_dump_stack()
are used in cases where things go wrong.  The main use case is to do a
trace_dump_stack(); tracing_off(); ftrace_dump(); in a place that detected
that something went wrong, whereas, trace_printk() is added to normal code
during debugging and removed before committing upstream. The dump code is
fine to keep in production.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Changes since v3: https://patch.msgid.link/20260624081948.147764194@kernel.org

- Move include out of #if statement in rcu.h
  kernel test robot found other configs that could require the
  control functions in rcu.h. Just always include it in that file.

 arch/powerpc/xmon/xmon.c         |  1 +
 arch/s390/kernel/ipl.c           |  1 +
 arch/s390/kernel/machine_kexec.c |  1 +
 drivers/gpu/drm/i915/i915_gem.h  |  1 +
 drivers/tty/sysrq.c              |  1 +
 include/linux/trace_controls.h   | 54 ++++++++++++++++++++++++++++++++
 include/linux/trace_printk.h     | 51 ------------------------------
 kernel/debug/debug_core.c        |  1 +
 kernel/panic.c                   |  1 +
 kernel/rcu/rcu.h                 |  1 +
 kernel/rcu/rcutorture.c          |  1 +
 kernel/trace/trace.h             |  1 +
 kernel/trace/trace_benchmark.c   |  1 +
 lib/sys_info.c                   |  1 +
 14 files changed, 66 insertions(+), 51 deletions(-)
 create mode 100644 include/linux/trace_controls.h

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index cb3a3244ae6f..2135f319e0dd 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -27,6 +27,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/debugfs.h>
+#include <linux/trace_controls.h>
 
 #include <asm/ptrace.h>
 #include <asm/smp.h>
diff --git a/arch/s390/kernel/ipl.c b/arch/s390/kernel/ipl.c
index 3c346b02ceb9..baac66cc4de4 100644
--- a/arch/s390/kernel/ipl.c
+++ b/arch/s390/kernel/ipl.c
@@ -22,6 +22,7 @@
 #include <linux/debug_locks.h>
 #include <linux/vmalloc.h>
 #include <linux/secure_boot.h>
+#include <linux/trace_controls.h>
 #include <asm/asm-extable.h>
 #include <asm/machine.h>
 #include <asm/diag.h>
diff --git a/arch/s390/kernel/machine_kexec.c b/arch/s390/kernel/machine_kexec.c
index baeb3dcfc1c8..33f9a89eb3ad 100644
--- a/arch/s390/kernel/machine_kexec.c
+++ b/arch/s390/kernel/machine_kexec.c
@@ -12,6 +12,7 @@
 #include <linux/delay.h>
 #include <linux/reboot.h>
 #include <linux/ftrace.h>
+#include <linux/trace_controls.h>
 #include <linux/debug_locks.h>
 #include <linux/cpufeature.h>
 #include <asm/guarded_storage.h>
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 20b3cb29cfff..1da8fb61c09e 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -116,6 +116,7 @@ int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file);
 #endif
 
 #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GEM)
+#include <linux/trace_controls.h>
 #define GEM_TRACE(...) trace_printk(__VA_ARGS__)
 #define GEM_TRACE_ERR(...) do {						\
 	pr_err(__VA_ARGS__);						\
diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index c2e4b31b699a..d3f72dc430b8 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -324,6 +324,7 @@ static const struct sysrq_key_op sysrq_showstate_blocked_op = {
 };
 
 #ifdef CONFIG_TRACING
+#include <linux/trace_controls.h>
 #include <linux/ftrace.h>
 
 static void sysrq_ftrace_dump(u8 key)
diff --git a/include/linux/trace_controls.h b/include/linux/trace_controls.h
new file mode 100644
index 000000000000..995b97e963b4
--- /dev/null
+++ b/include/linux/trace_controls.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TRACE_CONTROLS_H
+#define _LINUX_TRACE_CONTROLS_H
+
+
+/*
+ * General tracing related utility functions - trace_printk(),
+ * tracing_on/tracing_off and tracing_start()/tracing_stop
+ *
+ * Use tracing_on/tracing_off when you want to quickly turn on or off
+ * tracing. It simply enables or disables the recording of the trace events.
+ * This also corresponds to the user space /sys/kernel/tracing/tracing_on
+ * file, which gives a means for the kernel and userspace to interact.
+ * Place a tracing_off() in the kernel where you want tracing to end.
+ * From user space, examine the trace, and then echo 1 > tracing_on
+ * to continue tracing.
+ *
+ * tracing_stop/tracing_start has slightly more overhead. It is used
+ * by things like suspend to ram where disabling the recording of the
+ * trace is not enough, but tracing must actually stop because things
+ * like calling smp_processor_id() may crash the system.
+ *
+ * Most likely, you want to use tracing_on/tracing_off.
+ */
+enum ftrace_dump_mode {
+	DUMP_NONE,
+	DUMP_ALL,
+	DUMP_ORIG,
+	DUMP_PARAM,
+};
+
+#ifdef CONFIG_TRACING
+void tracing_on(void);
+void tracing_off(void);
+int tracing_is_on(void);
+void tracing_snapshot(void);
+void tracing_snapshot_alloc(void);
+void tracing_start(void);
+void tracing_stop(void);
+void trace_dump_stack(int skip);
+void ftrace_dump(enum ftrace_dump_mode oops_dump_mode);
+#else
+static inline void tracing_start(void) { }
+static inline void tracing_stop(void) { }
+static inline void tracing_on(void) { }
+static inline void tracing_off(void) { }
+static inline int tracing_is_on(void) { return 0; }
+static inline void tracing_snapshot(void) { }
+static inline void tracing_snapshot_alloc(void) { }
+static inline void trace_dump_stack(int skip) { }
+static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
+#endif
+
+#endif /* _LINUX_TRACE_CONTROLS_H */
diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index 3d54f440dccf..a488ea9e9f85 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -7,43 +7,7 @@
 #include <linux/stddef.h>
 #include <linux/stringify.h>
 
-/*
- * General tracing related utility functions - trace_printk(),
- * tracing_on/tracing_off and tracing_start()/tracing_stop
- *
- * Use tracing_on/tracing_off when you want to quickly turn on or off
- * tracing. It simply enables or disables the recording of the trace events.
- * This also corresponds to the user space /sys/kernel/tracing/tracing_on
- * file, which gives a means for the kernel and userspace to interact.
- * Place a tracing_off() in the kernel where you want tracing to end.
- * From user space, examine the trace, and then echo 1 > tracing_on
- * to continue tracing.
- *
- * tracing_stop/tracing_start has slightly more overhead. It is used
- * by things like suspend to ram where disabling the recording of the
- * trace is not enough, but tracing must actually stop because things
- * like calling smp_processor_id() may crash the system.
- *
- * Most likely, you want to use tracing_on/tracing_off.
- */
-
-enum ftrace_dump_mode {
-	DUMP_NONE,
-	DUMP_ALL,
-	DUMP_ORIG,
-	DUMP_PARAM,
-};
-
 #ifdef CONFIG_TRACING
-void tracing_on(void);
-void tracing_off(void);
-int tracing_is_on(void);
-void tracing_snapshot(void);
-void tracing_snapshot_alloc(void);
-
-extern void tracing_start(void);
-extern void tracing_stop(void);
-
 static inline __printf(1, 2)
 void ____trace_printk_check_format(const char *fmt, ...)
 {
@@ -149,8 +113,6 @@ int __trace_printk(unsigned long ip, const char *fmt, ...);
 extern int __trace_bputs(unsigned long ip, const char *str);
 extern int __trace_puts(unsigned long ip, const char *str);
 
-extern void trace_dump_stack(int skip);
-
 /*
  * The double __builtin_constant_p is because gcc will give us an error
  * if we try to allocate the static variable to fmt if it is not a
@@ -173,19 +135,7 @@ __ftrace_vbprintk(unsigned long ip, const char *fmt, va_list ap);
 
 extern __printf(2, 0) int
 __ftrace_vprintk(unsigned long ip, const char *fmt, va_list ap);
-
-extern void ftrace_dump(enum ftrace_dump_mode oops_dump_mode);
 #else
-static inline void tracing_start(void) { }
-static inline void tracing_stop(void) { }
-static inline void trace_dump_stack(int skip) { }
-
-static inline void tracing_on(void) { }
-static inline void tracing_off(void) { }
-static inline int tracing_is_on(void) { return 0; }
-static inline void tracing_snapshot(void) { }
-static inline void tracing_snapshot_alloc(void) { }
-
 static inline __printf(1, 2)
 int trace_printk(const char *fmt, ...)
 {
@@ -196,7 +146,6 @@ ftrace_vprintk(const char *fmt, va_list ap)
 {
 	return 0;
 }
-static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
 #endif /* CONFIG_TRACING */
 
 #endif
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index b276504c1c6b..f9c83a470c98 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -27,6 +27,7 @@
 
 #define pr_fmt(fmt) "KGDB: " fmt
 
+#include <linux/trace_controls.h>
 #include <linux/pid_namespace.h>
 #include <linux/clocksource.h>
 #include <linux/serial_core.h>
diff --git a/kernel/panic.c b/kernel/panic.c
index 213725b612aa..1415e910371d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -9,6 +9,7 @@
  * This function is used through-out the kernel (including mm and fs)
  * to indicate a major problem.
  */
+#include <linux/trace_controls.h>
 #include <linux/debug_locks.h>
 #include <linux/sched/debug.h>
 #include <linux/interrupt.h>
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index fa6d30ce73d1..735a80df0b30 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -12,6 +12,7 @@
 
 #include <linux/slab.h>
 #include <trace/events/rcu.h>
+#include <linux/trace_controls.h>
 
 /*
  * Grace-period counter management.
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 882a158ada7b..76bf0184b267 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -39,6 +39,7 @@
 #include <linux/srcu.h>
 #include <linux/slab.h>
 #include <linux/trace_clock.h>
+#include <linux/trace_controls.h>
 #include <asm/byteorder.h>
 #include <linux/torture.h>
 #include <linux/vmalloc.h>
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..2537c33ddd49 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -22,6 +22,7 @@
 #include <linux/ctype.h>
 #include <linux/once_lite.h>
 #include <linux/ftrace_regs.h>
+#include <linux/trace_controls.h>
 #include <linux/llist.h>
 
 #include "pid_list.h"
diff --git a/kernel/trace/trace_benchmark.c b/kernel/trace/trace_benchmark.c
index e19c32f2a938..69cc39008c36 100644
--- a/kernel/trace/trace_benchmark.c
+++ b/kernel/trace/trace_benchmark.c
@@ -3,6 +3,7 @@
 #include <linux/module.h>
 #include <linux/kthread.h>
 #include <linux/trace_clock.h>
+#include <linux/trace_controls.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace_benchmark.h"
diff --git a/lib/sys_info.c b/lib/sys_info.c
index f32a06ec9ed4..e3c9ca05601b 100644
--- a/lib/sys_info.c
+++ b/lib/sys_info.c
@@ -8,6 +8,7 @@
 #include <linux/ftrace.h>
 #include <linux/nmi.h>
 #include <linux/sched/debug.h>
+#include <linux/trace_controls.h>
 #include <linux/string.h>
 #include <linux/sysctl.h>
 
-- 
2.53.0



^ permalink raw reply related

* [PATCH v4 2/2] tracing: Remove trace_printk.h from kernel.h
From: Steven Rostedt @ 2026-06-25 10:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx
In-Reply-To: <20260625104007.041432666@kernel.org>

From: Steven Rostedt <rostedt@goodmis.org>

There have been complaints about trace_printk.h causing more build time
for being in kernel.h if it changes. There is also an effort to clean up
kernel.h to have it not include unneeded header files. Move trace_printk.h
out of kernel.h and place it in the headers and C files that use it.

Link: https://lore.kernel.org/all/CAHk-=wikCBeVFjVXiY4o-oepdbjAoir5+TcAgtL12c4u1TpZLQ@mail.gmail.com/

Suggested-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/powerpc/kvm/book3s_xics.c         | 1 +
 drivers/gpu/drm/i915/gt/intel_gtt.h    | 1 +
 drivers/gpu/drm/i915/i915_gem.h        | 1 +
 drivers/hwtracing/stm/dummy_stm.c      | 1 +
 drivers/infiniband/hw/hfi1/trace_dbg.h | 1 +
 drivers/usb/early/xhci-dbc.c           | 1 +
 fs/ext4/inline.c                       | 1 +
 include/linux/ftrace.h                 | 2 ++
 include/linux/kernel.h                 | 1 -
 include/linux/sunrpc/debug.h           | 1 +
 include/linux/trace_printk.h           | 5 +++--
 kernel/trace/ring_buffer_benchmark.c   | 1 +
 samples/fprobe/fprobe_example.c        | 1 +
 samples/ftrace/ftrace-direct-too.c     | 1 -
 samples/trace_printk/trace-printk.c    | 1 +
 15 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index 74a44fa702b0..ef5eb596a56e 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -26,6 +26,7 @@
 #if 1
 #define XICS_DBG(fmt...) do { } while (0)
 #else
+#include <linux/trace_printk.h>
 #define XICS_DBG(fmt...) trace_printk(fmt)
 #endif
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index b54ee4f25af1..f6f223090760 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -35,6 +35,7 @@
 #define I915_GFP_ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
 
 #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GTT)
+#include <linux/trace_printk.h>
 #define GTT_TRACE(...) trace_printk(__VA_ARGS__)
 #else
 #define GTT_TRACE(...)
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 1da8fb61c09e..f490052e8964 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -117,6 +117,7 @@ int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file);
 
 #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GEM)
 #include <linux/trace_controls.h>
+#include <linux/trace_printk.h>
 #define GEM_TRACE(...) trace_printk(__VA_ARGS__)
 #define GEM_TRACE_ERR(...) do {						\
 	pr_err(__VA_ARGS__);						\
diff --git a/drivers/hwtracing/stm/dummy_stm.c b/drivers/hwtracing/stm/dummy_stm.c
index 38528ffdc0b3..7c5e48ebfb9f 100644
--- a/drivers/hwtracing/stm/dummy_stm.c
+++ b/drivers/hwtracing/stm/dummy_stm.c
@@ -8,6 +8,7 @@
  */
 
 #undef DEBUG
+#include <linux/trace_printk.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/slab.h>
diff --git a/drivers/infiniband/hw/hfi1/trace_dbg.h b/drivers/infiniband/hw/hfi1/trace_dbg.h
index 58304b91380f..30df5e246586 100644
--- a/drivers/infiniband/hw/hfi1/trace_dbg.h
+++ b/drivers/infiniband/hw/hfi1/trace_dbg.h
@@ -103,6 +103,7 @@ __hfi1_trace_def(IOCTL);
  */
 
 #ifdef HFI1_EARLY_DBG
+#include <linux/trace_printk.h>
 #define hfi1_dbg_early(fmt, ...) \
 	trace_printk(fmt, ##__VA_ARGS__)
 #else
diff --git a/drivers/usb/early/xhci-dbc.c b/drivers/usb/early/xhci-dbc.c
index 41118bba9197..955c73bd601f 100644
--- a/drivers/usb/early/xhci-dbc.c
+++ b/drivers/usb/early/xhci-dbc.c
@@ -30,6 +30,7 @@ static struct xdbc_state xdbc;
 static bool early_console_keep;
 
 #ifdef XDBC_TRACE
+#include <linux/trace_printk.h>
 #define	xdbc_trace	trace_printk
 #else
 static inline void xdbc_trace(const char *fmt, ...) { }
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..0eff4a0c6a6c 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -934,6 +934,7 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
 }
 
 #ifdef INLINE_DIR_DEBUG
+#include <linux/trace_printk.h>
 void ext4_show_inline_dir(struct inode *dir, struct buffer_head *bh,
 			  void *inline_start, int inline_size)
 {
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 02bc5027523a..b5336a81e619 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -8,6 +8,8 @@
 #define _LINUX_FTRACE_H
 
 #include <linux/trace_recursion.h>
+#include <linux/trace_controls.h>
+#include <linux/trace_printk.h>
 #include <linux/trace_clock.h>
 #include <linux/jump_label.h>
 #include <linux/kallsyms.h>
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index e5570a16cbb1..e87a40fbd152 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -31,7 +31,6 @@
 #include <linux/build_bug.h>
 #include <linux/sprintf.h>
 #include <linux/static_call_types.h>
-#include <linux/trace_printk.h>
 #include <linux/util_macros.h>
 #include <linux/wordpart.h>
 
diff --git a/include/linux/sunrpc/debug.h b/include/linux/sunrpc/debug.h
index ab61bed2f7af..7524f5d82fba 100644
--- a/include/linux/sunrpc/debug.h
+++ b/include/linux/sunrpc/debug.h
@@ -29,6 +29,7 @@ extern unsigned int		nlm_debug;
 # define ifdebug(fac)		if (unlikely(rpc_debug & RPCDBG_##fac))
 
 # if IS_ENABLED(CONFIG_SUNRPC_DEBUG_TRACE)
+#  include <linux/trace_printk.h>
 #  define __sunrpc_printk(fmt, ...)	trace_printk(fmt, ##__VA_ARGS__)
 # else
 #  define __sunrpc_printk(fmt, ...)	printk(KERN_DEFAULT fmt, ##__VA_ARGS__)
diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index a488ea9e9f85..74ce4f8995c4 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -1,11 +1,12 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_TRACE_PRINTK_H
 #define _LINUX_TRACE_PRINTK_H
+#if !defined(__ASSEMBLY__) && !defined(__GENKSYMS__) && !defined(BUILD_VDSO)
 
-#include <linux/compiler_attributes.h>
 #include <linux/instruction_pointer.h>
 #include <linux/stddef.h>
 #include <linux/stringify.h>
+#include <linux/stdarg.h>
 
 #ifdef CONFIG_TRACING
 static inline __printf(1, 2)
@@ -147,5 +148,5 @@ ftrace_vprintk(const char *fmt, va_list ap)
 	return 0;
 }
 #endif /* CONFIG_TRACING */
-
+#endif /* !defined(__ASSEMBLY__) && !defined(__GENKSYMS__) && !defined(BUILD_VDSO) */
 #endif
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 593e3b59e42e..2bb25caebb75 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2009 Steven Rostedt <srostedt@redhat.com>
  */
 #include <linux/ring_buffer.h>
+#include <linux/trace_printk.h>
 #include <linux/completion.h>
 #include <linux/kthread.h>
 #include <uapi/linux/sched/types.h>
diff --git a/samples/fprobe/fprobe_example.c b/samples/fprobe/fprobe_example.c
index bfe98ce826f3..de81b9b4ca7d 100644
--- a/samples/fprobe/fprobe_example.c
+++ b/samples/fprobe/fprobe_example.c
@@ -12,6 +12,7 @@
 
 #define pr_fmt(fmt) "%s: " fmt, __func__
 
+#include <linux/trace_printk.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/fprobe.h>
diff --git a/samples/ftrace/ftrace-direct-too.c b/samples/ftrace/ftrace-direct-too.c
index bf2411aa6fd7..159190f4103f 100644
--- a/samples/ftrace/ftrace-direct-too.c
+++ b/samples/ftrace/ftrace-direct-too.c
@@ -1,6 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include <linux/module.h>
-
 #include <linux/mm.h> /* for handle_mm_fault() */
 #include <linux/ftrace.h>
 #if !defined(CONFIG_ARM64) && !defined(CONFIG_PPC32)
diff --git a/samples/trace_printk/trace-printk.c b/samples/trace_printk/trace-printk.c
index cfc159580263..ff37aeb8523e 100644
--- a/samples/trace_printk/trace-printk.c
+++ b/samples/trace_printk/trace-printk.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/trace_printk.h>
 #include <linux/module.h>
 #include <linux/kthread.h>
 #include <linux/irq_work.h>
-- 
2.53.0



^ permalink raw reply related

* [PATCH v4 0/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Steven Rostedt @ 2026-06-25 10:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx

Remove trace_printk.h by creating a trace_controls.h for those places that
need access to tracing prototypes like tracing_off() and for the places that
need trace_printk() directly, to have it included directly.

Changse since v3: https://lore.kernel.org/all/20260624081806.120105649@kernel.org/

- Always include trace_controls.h in rcu.h (kernel test robot)

  There are other configs that may include tracing_off() in rcu.h besides
  the one that had the include of trace_controls.h. Just always include
  it in that header to be safe.

Steven Rostedt (2):
      tracing: Move non-trace_printk prototypes into trace_controls.h
      tracing: Remove trace_printk.h from kernel.h

----
 arch/powerpc/kvm/book3s_xics.c         |  1 +
 arch/powerpc/xmon/xmon.c               |  1 +
 arch/s390/kernel/ipl.c                 |  1 +
 arch/s390/kernel/machine_kexec.c       |  1 +
 drivers/gpu/drm/i915/gt/intel_gtt.h    |  1 +
 drivers/gpu/drm/i915/i915_gem.h        |  2 ++
 drivers/hwtracing/stm/dummy_stm.c      |  1 +
 drivers/infiniband/hw/hfi1/trace_dbg.h |  1 +
 drivers/tty/sysrq.c                    |  1 +
 drivers/usb/early/xhci-dbc.c           |  1 +
 fs/ext4/inline.c                       |  1 +
 include/linux/ftrace.h                 |  2 ++
 include/linux/kernel.h                 |  1 -
 include/linux/sunrpc/debug.h           |  1 +
 include/linux/trace_controls.h         | 54 ++++++++++++++++++++++++++++++++
 include/linux/trace_printk.h           | 56 ++--------------------------------
 kernel/debug/debug_core.c              |  1 +
 kernel/panic.c                         |  1 +
 kernel/rcu/rcu.h                       |  1 +
 kernel/rcu/rcutorture.c                |  1 +
 kernel/trace/ring_buffer_benchmark.c   |  1 +
 kernel/trace/trace.h                   |  1 +
 kernel/trace/trace_benchmark.c         |  1 +
 lib/sys_info.c                         |  1 +
 samples/fprobe/fprobe_example.c        |  1 +
 samples/ftrace/ftrace-direct-too.c     |  1 -
 samples/trace_printk/trace-printk.c    |  1 +
 27 files changed, 82 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/trace_controls.h

^ permalink raw reply

* Re: [PATCH rdma-next 1/1] RDMA/mana_ib: Adopt robust udata
From: Konstantin Taranov @ 2026-06-25  8:55 UTC (permalink / raw)
  To: Jacob Moroni, Konstantin Taranov
  Cc: shirazsaleem@microsoft.com, Long Li, jgg@ziepe.ca,
	leon@kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAHYDg1RQ8vEMrKPoS3qHgtf5S+T1Wzrm=YuwdfzFEX3g22Ruhg@mail.gmail.com>

> > +struct mana_ib_uctx_req {
> > +       __aligned_u64 client_caps1;
> > +       __aligned_u64 client_caps2;
> > +       __aligned_u64 client_caps3;
> > +       __aligned_u64 client_caps4;
> > +       __aligned_u64 comp_mask;
> > +};
> > +
> 
> I am curious about the addition of these unused "client_caps" fields.
> 
> I guess the idea is to be able to reject older providers that lack support for
> some mandatory feature in the future - like if a new HW variant breaks the
> descriptor ABI or something and therefore requires a provider update?

Not really. The capability bits will be used for simpler integration of new features,
and not to reject older providers. To reject an older provider, I could use the abi version.

> 
> My main question is: how come they need to be added now as opposed to
> extending the structure later?
> 

these client caps allow mana_ib to integrate easier client wide optimizations or feature enablement.
Once we have some important code in the rdma-core that the kernel should know about, we can declare it
as a capability of a client (meaning a client can do additionally X). Important aspect that the capability is
an addition, and it is not a new behavior (meaning a client is backwards compatible). So it is handy for bug
fixes and new features, rather than overwhelming other udata requests via chained changes.

For example, the upcoming patch after this one is that WQEs in the rdma-core will be of fixed size.
The HW supports all sizes, but the knowledge that all WQEs are the same will allow the HW to apply optimizations.
So, this fixed size can be defined as a capability. Then in the kernel code for QP creation, we can add HW flag from the client
capability (so a simple line in the kernel. 1) if the cap present then add a certain flag to HWC).
Otherwise, the change would be more complex: new response for alloc_ucontext, and then new request for various qp create.

A similar idea was in bnxt_re (see BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT and uctx->cmask),
but I think there was misunderstanding as the cap field was named as comp_mask and now bnxt_re is locked to 2 capability bits.
As it is the first rdma-core ioctl, there is no way to know which comp_mask is allowed. With the wave of robust udata,
providers will be locked to one udata request format for alloc_ucontext() without a chance of extending.
That is why, I try to introduce the idea now.

All in all, I believe it would be beneficial for kernel to get some initial feedback from rdma-core,
and I think it was an initial goal of having udata request in alloc_ucontext().

- Konstantin

> I'm not proposing any changes, just trying to understand the intent.
> 
> Thanks,
> Jake

^ permalink raw reply

* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Sidraya Jayagond @ 2026-06-25  8:32 UTC (permalink / raw)
  To: Runyu Xiao, D. Wythe, Dust Li, Wenjia Zhang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
	linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu, stable
In-Reply-To: <20260617152855.1039151-1-runyu.xiao@seu.edu.cn>



On 17/06/26 8:58 pm, Runyu Xiao wrote:
> smc_listen() installs smc_clcsock_data_ready() as the underlying TCP
> listen socket's sk_data_ready callback.  smc_clcsock_data_ready() then
> immediately takes sk_callback_lock before looking up the SMC listener and
> queuing smc_tcp_listen_work().
> 
> That is unsafe once the TCP listen socket is leaving TCP_LISTEN.  The TCP
> close/flush path can run the installed sk_data_ready callback with
> sk_callback_lock already held, so entering smc_clcsock_data_ready() again
> tries to take the same rwlock recursively in the same thread.  The nvmet
> TCP listener had to make the same state check before taking
> sk_callback_lock for this reason.
> 
> This issue was found by our static analysis tool and then manually
> reviewed against the current tree.
> 
> The grounded PoC kept the SMC listen callback installation path:
> 
>   smc_listen()
>   smc_clcsock_replace_cb()
>   sk_data_ready = smc_clcsock_data_ready()
> 
> It then modeled the close/flush carrier that invokes the installed
> sk_data_ready callback while sk_callback_lock is already held.  Lockdep
> reported the same-thread recursive acquisition:
> 
>   WARNING: possible recursive locking detected
>   smc_clcsock_data_ready+0xa/0x4d [vuln_msv]
>   smc_close_flush_work+0x1f/0x30 [vuln_msv]
>   *** DEADLOCK ***
> 
> Return before taking sk_callback_lock when the underlying TCP socket is no
> longer in TCP_LISTEN.  In that state there is no listen accept work to
> queue for SMC, and avoiding the callback lock mirrors the fix used by the
> TCP nvmet listener.
> 
> Fixes: 0558226cebee ("net/smc: Fix slab-out-of-bounds issue in fallback")
> Cc: stable@vger.kernel.org
> Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
> ---
>  net/smc/af_smc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 6421c2e1c84d..1af4e3c333ff 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -2631,6 +2631,9 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock)
>  {
>  	struct smc_sock *lsmc;
>  
> +	if (READ_ONCE(listen_clcsock->sk_state) != TCP_LISTEN)
> +		return;
> +

In smc_close_active(), the TCP socket remains in TCP_LISTEN state while
holding write_lock_bh(&smc->clcsock->sk->sk_callback_lock);. The patch's
state check would pass during this window, not preventing the recursive
lock scenario.
It's unclear whether it fully prevents the recursive locking scenario
described in the commit message for the specific code path in
smc_close_active().
Could you come up with exact deadlock scenario and how the patch
addresses it?

>  	read_lock_bh(&listen_clcsock->sk_callback_lock);
>  	lsmc = smc_clcsock_user_data(listen_clcsock);
>  	if (!lsmc)


^ permalink raw reply

* Re: [PATCH v2] RDMA/core: Fix memory leak in __ib_create_cq() on invalid cqe
From: Kalesh Anakkur Purayil @ 2026-06-25  7:32 UTC (permalink / raw)
  To: Chenguang Zhao
  Cc: jgg, leon, edwards, mbloch, michaelgur, msanalla, ohartoov, jiri,
	linux-rdma
In-Reply-To: <20260625020148.224537-1-zhaochenguang@kylinos.cn>

[-- Attachment #1: Type: text/plain, Size: 426 bytes --]

On Thu, Jun 25, 2026 at 7:33 AM Chenguang Zhao <zhaochenguang@kylinos.cn> wrote:
>
> Move the zero CQE validation before rdma_zalloc_drv_obj() to avoid
> leaking the CQ object when returning -EINVAL.
>
> Fixes: a2917582887a ("RDMA/core: Reject zero CQE count")
> Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
LGTM,
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

-- 
Regards,
Kalesh AP

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5509 bytes --]

^ permalink raw reply

* [PATCH] RDMA/rxe: Check PDs for memory window binds
From: Zhiwei Zhang @ 2026-06-25  6:16 UTC (permalink / raw)
  To: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky
  Cc: linux-rdma, linux-kernel, Zhiwei Zhang

The IBTA Software Transport Verbs specification requires the QP,
Memory Window and Memory Region for a Bind Memory Window operation
to belong to the same HCA and protection domain.

rxe only checked the QP and MW protection domain for type 2 MWs.
Move the QP/MW PD check to the common bind path and also reject
binding an MW to an MR from a different PD.

Invalid bind requests continue to fail with IB_WC_MW_BIND_ERR.

Signed-off-by: Zhiwei Zhang <202275009@qq.com>
---
 drivers/infiniband/sw/rxe/rxe_mw.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index 379e65bfcd49..aa9371e4ccd5 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -72,13 +72,6 @@ static int rxe_check_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 			return -EINVAL;
 		}
 
-		/* C10-72 */
-		if (unlikely(qp->pd != to_rpd(mw->ibmw.pd))) {
-			rxe_dbg_mw(mw,
-				"attempt to bind type 2 MW with qp with different PD\n");
-			return -EINVAL;
-		}
-
 		/* o10-37.2.40 */
 		if (unlikely(!mr || wqe->wr.wr.mw.length == 0)) {
 			rxe_dbg_mw(mw,
@@ -87,10 +80,21 @@ static int rxe_check_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 		}
 	}
 
-	/* remaining checks only apply to a nonzero MR */
+	/* C10-72 */
+	if (unlikely(qp->pd != rxe_mw_pd(mw))) {
+		rxe_dbg_mw(mw, "attempt to bind MW with qp with different PD\n");
+		return -EINVAL;
+	}
+
 	if (!mr)
 		return 0;
 
+	/* remaining checks only apply to a nonzero MR */
+	if (unlikely(qp->pd != mr_pd(mr))) {
+		rxe_dbg_mw(mw, "attempt to bind MW to MR with different PD\n");
+		return -EINVAL;
+	}
+
 	if (unlikely(mr->access & IB_ZERO_BASED)) {
 		rxe_dbg_mw(mw, "attempt to bind MW to zero based MR\n");
 		return -EINVAL;
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH] [net] eth: mlx5: fix macsec dependency
From: patchwork-bot+netdevbpf @ 2026-06-25  2:30 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: saeedm, leon, tariqt, mbloch, andrew+netdev, davem, edumazet,
	kuba, pabeni, sd, arnd, daniel.zahka, rrameshbabu, raeds, netdev,
	linux-rdma, linux-kernel
In-Reply-To: <20260622124229.2444502-1-arnd@kernel.org>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 22 Jun 2026 14:41:07 +0200 you wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> Configurations with mlx5 built-in but macsec=m fail to link:
> 
> x86_64-linux-ld: drivers/infiniband/hw/mlx5/macsec.o: in function `mlx5r_add_gid_macsec_operations':
> macsec.c:(.text+0x77d): undefined reference to `macsec_netdev_is_offloaded'
> x86_64-linux-ld: drivers/infiniband/hw/mlx5/macsec.o: in function `mlx5r_del_gid_macsec_operations':
> macsec.c:(.text+0xe81): undefined reference to `macsec_netdev_is_offloaded'
> 
> [...]

Here is the summary with links:
  - [net] eth: mlx5: fix macsec dependency
    https://git.kernel.org/netdev/net/c/87ab8276ed24

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net 2/3] net/mlx5e: Validate bandwidth for non-ETS traffic classes
From: Jakub Kicinski @ 2026-06-25  2:10 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, netdev, Paolo Abeni,
	Alexei Lazar, Carolina Jubran, Leon Romanovsky, linux-kernel,
	linux-rdma, Mark Bloch, Saeed Mahameed, Gal Pressman
In-Reply-To: <20260622112925.624795-3-tariqt@nvidia.com>

On Mon, 22 Jun 2026 14:29:24 +0300 Tariq Toukan wrote:
> From: Alexei Lazar <alazar@nvidia.com>
> 
> The IEEE 802.1Qaz standard defines that bandwidth allocation percentages
> only apply to ETS traffic classes.
> 
> Reject ETS configurations that specify non-zero bandwidth for traffic
> classes.
> 
> Fixes: 08fb1dacdd76 ("net/mlx5e: Support DCBNL IEEE ETS")
> Signed-off-by: Alexei Lazar <alazar@nvidia.com>
> Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
> index 762f0a46c120..e4161603cdc0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c
> @@ -324,6 +324,17 @@ static int mlx5e_dbcnl_validate_ets(struct net_device *netdev,
>  		}
>  	}
>  
> +	/* Validate Non ETS BW */
> +	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
> +		if (ets->tc_tsa[i] != IEEE_8021QAZ_TSA_ETS &&
> +		    ets->tc_tx_bw[i]) {
> +			netdev_err(netdev,
> +				   "Failed to validate ETS: tc=%d BW is not 0 for non-ETS TC (tsa=%u, bw=%u)\n",
> +				   i, ets->tc_tsa[i], ets->tc_tx_bw[i]);
> +			return -EINVAL;
> +		}
> +	}

Can we pull this check out into the shared dcbnl handling?
There seems to be zero mlx5 specific logic in this patch,
and the motivation.

>  	/* Validate Bandwidth Sum */
>  	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
>  		if (ets->tc_tsa[i] == IEEE_8021QAZ_TSA_ETS) {


^ permalink raw reply

* [PATCH v2] RDMA/core: Fix memory leak in __ib_create_cq() on invalid cqe
From: Chenguang Zhao @ 2026-06-25  2:01 UTC (permalink / raw)
  To: jgg, leon
  Cc: edwards, mbloch, michaelgur, msanalla, ohartoov, jiri,
	kalesh-anakkur.purayil, linux-rdma, zhaochenguang
In-Reply-To: <20260624025949.306783-1-zhaochenguang@kylinos.cn>

Move the zero CQE validation before rdma_zalloc_drv_obj() to avoid
leaking the CQ object when returning -EINVAL.

Fixes: a2917582887a ("RDMA/core: Reject zero CQE count")
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
---
v2:
 - move validation before rdma_zalloc_drv_obj() as suggested by Kalesh.
---
 drivers/infiniband/core/verbs.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 3b613b57e269..86811d31092c 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2196,13 +2196,13 @@ struct ib_cq *__ib_create_cq(struct ib_device *device,
 	struct ib_cq *cq;
 	int ret;
 
+	if (WARN_ON_ONCE(!cq_attr->cqe))
+		return ERR_PTR(-EINVAL);
+
 	cq = rdma_zalloc_drv_obj(device, ib_cq);
 	if (!cq)
 		return ERR_PTR(-ENOMEM);
 
-	if (WARN_ON_ONCE(!cq_attr->cqe))
-		return ERR_PTR(-EINVAL);
-
 	cq->device = device;
 	cq->comp_handler = comp_handler;
 	cq->event_handler = event_handler;
-- 
2.25.1


^ permalink raw reply related

* [PATCH] RDMA/bng_re: return a timeout when firmware responses stall
From: Pengpeng Hou @ 2026-06-25  0:36 UTC (permalink / raw)
  To: Siva Reddy Kallam
  Cc: pengpeng, Jason Gunthorpe, Leon Romanovsky, linux-rdma,
	linux-kernel

__wait_for_resp() documents that it returns a non-zero error when a
firmware command does not complete, and bng_re_rcfw_send_message() already
marks the firmware as stalled when the helper returns -ENODEV.

However, the helper ignores wait_event_timeout() expiry.  If the response
slot remains in use after the timeout and after the polled CREQ service
attempt, the loop starts another full timeout period and can repeat
forever.

Return -ENODEV after a timed out wait that still has no response.  The
existing caller then marks FIRMWARE_STALL_DETECTED and returns
-ETIMEDOUT to the command issuer.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 drivers/infiniband/hw/bng_re/bng_fw.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/bng_re/bng_fw.c b/drivers/infiniband/hw/bng_re/bng_fw.c
index 50156c300..ab6a2d2e9 100644
--- a/drivers/infiniband/hw/bng_re/bng_fw.c
+++ b/drivers/infiniband/hw/bng_re/bng_fw.c
@@ -401,14 +401,15 @@ static int __wait_for_resp(struct bng_re_rcfw *rcfw, u16 cookie)
 {
 	struct bng_re_cmdq_ctx *cmdq;
 	struct bng_re_crsqe *crsqe;
+	unsigned long time_left;
 
 	cmdq = &rcfw->cmdq;
 	crsqe = &rcfw->crsqe_tbl[cookie];
 
 	do {
-		wait_event_timeout(cmdq->waitq,
-				   !crsqe->is_in_used,
-				   secs_to_jiffies(rcfw->max_timeout));
+		time_left = wait_event_timeout(cmdq->waitq,
+					       !crsqe->is_in_used,
+					       secs_to_jiffies(rcfw->max_timeout));
 
 		if (!crsqe->is_in_used)
 			return 0;
@@ -417,6 +418,9 @@ static int __wait_for_resp(struct bng_re_rcfw *rcfw, u16 cookie)
 
 		if (!crsqe->is_in_used)
 			return 0;
+
+		if (!time_left)
+			return -ENODEV;
 	} while (true);
 };
 
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-06-24 22:50 UTC (permalink / raw)
  To: Simon Horman
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Long Li,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Konstantin Taranov,
	ernis@linux.microsoft.com, dipayanroy@linux.microsoft.com,
	kees@kernel.org, jacob.e.keller@intel.com,
	ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260619090514.GT827683@horms.kernel.org>

> From: Simon Horman <horms@kernel.org>
> Sent: Friday, June 19, 2026 2:05 AM
> > ...
> > Also validate the packet length reported in the RX CQE before using it as
> > a DMA sync length or passing it to skb processing. The CQE is supplied
> > by the device and should not be blindly trusted by Confidential VMs.
> 
> I think this last part warrants being split out into a separate patch.

Sorry for the late reply. I split v1 into 2 patches of v2, which I just posted:
https://lwn.net/ml/linux-kernel/20260624222605.1794719-1-decui@microsoft.com/
 
Thanks,
Dexuan

^ permalink raw reply

* [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable
In-Reply-To: <20260624222605.1794719-1-decui@microsoft.com>

MANA allocates RX buffers from page pool fragments when frag_count is
greater than 1. In that case the buffers remain DMA mapped by page pool
and the RX completion path does not call dma_unmap_single(). As a result,
the implicit sync-for-CPU normally performed by dma_unmap_single() is
missing before the packet data is passed to the networking stack.

This breaks RX on configurations which require explicit DMA syncing, for
example when booted with swiotlb=force.

Fix this by recording the page pool page and DMA sync offset when the RX
buffer is allocated, and syncing the received packet range for CPU access
before handing the RX buffer to the stack.

Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes since v1:
    v1 is split into two patches in the v2.
    Add Haiyang's Reviewed-by.

 drivers/net/ethernet/microsoft/mana/mana_en.c | 39 +++++++++++++++----
 include/net/mana/mana.h                       |  8 ++++
 2 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..1875bffd82b7 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2044,12 +2044,16 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
 }
 
 static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
-			     dma_addr_t *da, bool *from_pool)
+			     dma_addr_t *da, bool *from_pool,
+			     struct page **pp_page, u32 *dma_sync_offset)
 {
 	struct page *page;
 	u32 offset;
 	void *va;
+
 	*from_pool = false;
+	*pp_page = NULL;
+	*dma_sync_offset = 0;
 
 	/* Don't use fragments for jumbo frames or XDP where it's 1 fragment
 	 * per page.
@@ -2087,31 +2091,47 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 	va  = page_to_virt(page) + offset;
 	*da = page_pool_get_dma_addr(page) + offset + rxq->headroom;
 	*from_pool = true;
+	*pp_page = page;
+	*dma_sync_offset = offset + rxq->headroom;
 
 	return va;
 }
 
 /* Allocate frag for rx buffer, and save the old buf */
 static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
-			       struct mana_recv_buf_oob *rxoob, void **old_buf,
-			       bool *old_fp)
+			       struct mana_recv_buf_oob *rxoob, u32 pktlen,
+			       void **old_buf, bool *old_fp)
 {
+	struct page *pp_page;
+	u32 dma_sync_offset;
 	bool from_pool;
 	dma_addr_t da;
 	void *va;
 
-	va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+	va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+			     &dma_sync_offset);
 	if (!va)
 		return;
-	if (!rxoob->from_pool || rxq->frag_count == 1)
+	if (!rxoob->from_pool || rxq->frag_count == 1) {
 		dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
 				 DMA_FROM_DEVICE);
+	} else {
+		/* The page pool maps the whole page and only syncs for device
+		 * automatically (PP_FLAG_DMA_SYNC_DEV). Sync the received bytes
+		 * for the CPU before they are read: this is required if DMA
+		 * is incoherent or bounce buffers are used.
+		 */
+		page_pool_dma_sync_for_cpu(rxq->page_pool, rxoob->pp_page,
+					   rxoob->dma_sync_offset, pktlen);
+	}
 	*old_buf = rxoob->buf_va;
 	*old_fp = rxoob->from_pool;
 
 	rxoob->buf_va = va;
 	rxoob->sgl[0].address = da;
 	rxoob->from_pool = from_pool;
+	rxoob->pp_page = pp_page;
+	rxoob->dma_sync_offset = dma_sync_offset;
 }
 
 static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -2170,7 +2190,7 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		rxbuf_oob = &rxq->rx_oobs[curr];
 		WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-		mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+		mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
 
 		/* Unsuccessful refill will have old_buf == NULL.
 		 * In this case, mana_rx_skb() will drop the packet.
@@ -2566,6 +2586,8 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 			    struct mana_rxq *rxq, struct device *dev)
 {
 	struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+	struct page *pp_page = NULL;
+	u32 dma_sync_offset = 0;
 	bool from_pool = false;
 	dma_addr_t da;
 	void *va;
@@ -2573,13 +2595,16 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 	if (mpc->rxbufs_pre)
 		va = mana_get_rxbuf_pre(rxq, &da);
 	else
-		va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+		va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+				     &dma_sync_offset);
 
 	if (!va)
 		return -ENOMEM;
 
 	rx_oob->buf_va = va;
 	rx_oob->from_pool = from_pool;
+	rx_oob->pp_page = pp_page;
+	rx_oob->dma_sync_offset = dma_sync_offset;
 
 	rx_oob->sgl[0].address = da;
 	rx_oob->sgl[0].size = rxq->datasize;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..4111b93169d2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -305,6 +305,14 @@ struct mana_recv_buf_oob {
 
 	void *buf_va;
 	bool from_pool; /* allocated from a page pool */
+	/* head page of the page_pool fragment; valid only when
+	 * from_pool && frag_count > 1.
+	 */
+	struct page *pp_page;
+	/* Fragment offset plus rxq->headroom, passed to
+	 * page_pool_dma_sync_for_cpu().
+	 */
+	u32 dma_sync_offset;
 
 	/* SGL of the buffer going to be sent as part of the work request. */
 	u32 num_sge;
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 2/2] net: mana: Validate the packet length reported by the NIC
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma
  Cc: stable
In-Reply-To: <20260624222605.1794719-1-decui@microsoft.com>

Validate the packet length reported in the RX CQE before using it as a DMA
sync length or passing it to skb processing. The CQE is supplied by the
NIC device and should not be blindly trusted.

Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes since v1:
    v1 is split into two patches in the v2.
    Add Haiyang's Reviewed-by.

 drivers/net/ethernet/microsoft/mana/mana_en.c | 24 +++++++++++++++----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1875bffd82b7..0b44c51ae6ec 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2190,12 +2190,26 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 		rxbuf_oob = &rxq->rx_oobs[curr];
 		WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-		mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
+		if (unlikely(pktlen > rxq->datasize)) {
+			/* Increase it even if mana_rx_skb() isn't called. */
+			rxq->rx_cq.work_done++;
 
-		/* Unsuccessful refill will have old_buf == NULL.
-		 * In this case, mana_rx_skb() will drop the packet.
-		 */
-		mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+			++ndev->stats.rx_dropped;
+			netdev_warn_once(ndev,
+				"Dropped oversized RX packet: len=%u, datasize=%u\n",
+				pktlen, rxq->datasize);
+
+			/* Reuse the RX buffer since rxbuf_oob is unchanged. */
+		} else {
+
+			mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen,
+					   &old_buf, &old_fp);
+
+			/* Unsuccessful refill will have old_buf == NULL.
+			 * In this case, mana_rx_skb() will drop the packet.
+			 */
+			mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+		}
 
 		mana_move_wq_tail(rxq->gdma_rq,
 				  rxbuf_oob->wqe_inf.wqe_size_in_bu);
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2 0/2] Fix MANA RX with bounce buffering
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
	linux-rdma

With swiotlb=force, the MANA NIC fails to work properly due to commit
730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead
of full pages to improve memory efficiency.")

Dipayaan tried to fix this by avoiding page pool frags when bounce
buffering is in use [1][2]. However, that is not a clean solution: no
other NIC drivers need to explicitly check whether bounce buffering is
in use. It is also not good for throughput, since
dma_map_single()/dma_unmap_single() are then called for each incoming
packet.

In fact, page pool frags can still be used with the standard MTU of
1500: all we need is to add page_pool_dma_sync_for_cpu() before the CPU
reads the incoming packet, so I implemented that in v1 [3].

As Simon suggested [4], this version splits v1 into two patches:
Patch 1 adds page_pool_dma_sync_for_cpu().
Patch 2 validates the packet length reported by the NIC.

There is no functional difference between v1 and v2, so I am keeping
Haiyang's Reviewed-by tag in v2.

Please review. Thanks!

Note that, with jumbo MTU and XDP, page pool frags are not used, and
dma_map_single()/dma_unmap_single() are still called for each incoming
packet, causing poor throughput with swiotlb=force; see
mana_get_rxbuf_cfg() and mana_refill_rx_oob() -> mana_get_rxfrag().
The jumbo MTU/XDP issue will be addressed later since that needs more
consideration if we want to use page pool with PP_FLAG_DMA_MAP there:
e.g., for XDP, the received packet can be transmitted in place, i.e. the
same RX buffer can be used as a TX buffer:
mana_rx_skb() -> mana_xdp_tx() -> mana_start_xmit() -> mana_map_skb().

In mana_create_page_pool(), we may have to set pprm.dma_dir to
DMA_BIDIRECTIONAL if XDP is in use:
pprm.dma_dir = mana_xdp_get(mpc) ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;

In the case of XDP, the next issue is that mana_rx_skb() -> ... ->
mana_map_skb() appears to call dma_map_single() on an RX buffer allocated
from a page pool created with PP_FLAG_DMA_MAP, which seems incorrect.
Any thoughts?

[1] https://lore.kernel.org/all/ae91hyrLf4n23XE6@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/#r
[2] https://lore.kernel.org/all/ae9pxvJfkAZYfKMf@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
[3] https://lore.kernel.org/all/20260618035029.249361-1-decui@microsoft.com/
[4] https://lore.kernel.org/all/20260619090514.GT827683@horms.kernel.org/

Dexuan Cui (2):
  net: mana: Sync page pool RX frags for CPU
  net: mana: Validate the packet length reported by the NIC

 drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
 include/net/mana/mana.h                       |  8 +++
 2 files changed, 58 insertions(+), 11 deletions(-)

-- 
2.34.1

^ permalink raw reply

* [PATCH net] net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link()
From: Jakub Kicinski @ 2026-06-24 19:04 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
	Breno Leitao, joshwash, hramamurthy, anthony.l.nguyen,
	przemyslaw.kitszel, saeedm, tariqt, mbloch, leon, alexanderduyck,
	kernel-team, kys, haiyangz, wei.liu, decui, longli, jordanrhee,
	jacob.e.keller, nktgrg, debarghyak, mohsin.bashr, ernis, sdf, gal,
	linux-rdma, linux-hyperv

Breno reports following splats on mlx5:

  RTNL: assertion failed at net/core/dev.c (2241)
  WARNING: net/core/dev.c:2241 at netif_state_change+0xed/0x130, CPU#5: ethtool/1335
  RIP: 0010:netif_state_change+0xf9/0x130
  Call Trace:
    <TASK>
     __linkwatch_sync_dev+0xea/0x120
     ethtool_op_get_link+0xe/0x20
     __ethtool_get_link+0x26/0x40
     linkstate_prepare_data+0x51/0x200
     ethnl_default_doit+0x213/0x470
     genl_family_rcv_msg_doit+0xdd/0x110

Looks like I missed ethtool_op_get_link() trying to sync linkwatch,
which needs rtnl_lock. Not all drivers do this - bnxt doesn't,
it just returns the link state, so add an opt-in bit.

Reported-by: Breno Leitao <leitao@debian.org>
Fixes: 45079e00133e ("net: ethtool: optionally skip rtnl_lock on Netlink path for GET ops")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: joshwash@google.com
CC: hramamurthy@google.com
CC: anthony.l.nguyen@intel.com
CC: przemyslaw.kitszel@intel.com
CC: saeedm@nvidia.com
CC: tariqt@nvidia.com
CC: mbloch@nvidia.com
CC: leon@kernel.org
CC: alexanderduyck@fb.com
CC: kernel-team@meta.com
CC: kys@microsoft.com
CC: haiyangz@microsoft.com
CC: wei.liu@kernel.org
CC: decui@microsoft.com
CC: longli@microsoft.com
CC: jordanrhee@google.com
CC: jacob.e.keller@intel.com
CC: nktgrg@google.com
CC: debarghyak@google.com
CC: leitao@debian.org
CC: mohsin.bashr@gmail.com
CC: ernis@linux.microsoft.com
CC: sdf@fomichev.me
CC: gal@nvidia.com
CC: linux-rdma@vger.kernel.org
CC: linux-hyperv@vger.kernel.org
---
 include/linux/ethtool.h                                 | 2 ++
 net/ethtool/common.h                                    | 4 ++++
 drivers/net/ethernet/google/gve/gve_ethtool.c           | 3 ++-
 drivers/net/ethernet/intel/iavf/iavf_ethtool.c          | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c    | 3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c        | 3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c | 4 +++-
 drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c         | 3 ++-
 drivers/net/ethernet/microsoft/mana/mana_ethtool.c      | 3 ++-
 9 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 1b834e2a522e..5d491a98265e 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -942,6 +942,7 @@ struct kernel_ethtool_ts_info {
 #define ETHTOOL_OP_NEEDS_RTNL_GPAUSEPARAM	BIT(5)
 #define ETHTOOL_OP_NEEDS_RTNL_SPAUSEPARAM	BIT(6)
 #define ETHTOOL_OP_NEEDS_RTNL_RSS		BIT(7)
+#define ETHTOOL_OP_NEEDS_RTNL_GLINK		BIT(8)
 
 /**
  * struct ethtool_ops - optional netdev operations
@@ -978,6 +979,7 @@ struct kernel_ethtool_ts_info {
  *	 - phylink helpers (note that phydev is currently unsupported!)
  *	 - netdev_update_features()
  *	 - netif_set_real_num_tx_queues()
+ *	 - ethtool_op_get_link() (syncs link watch under rtnl_lock)
  *
  * @get_drvinfo: Report driver/device information. Modern drivers no
  *	longer have to implement this callback. Most fields are
diff --git a/net/ethtool/common.h b/net/ethtool/common.h
index 2b3847f00801..4e5356e26f40 100644
--- a/net/ethtool/common.h
+++ b/net/ethtool/common.h
@@ -113,6 +113,8 @@ ethtool_nl_msg_needs_rtnl(const struct net_device *dev, u8 cmd)
 		return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_SPAUSEPARAM;
 	case ETHTOOL_MSG_RSS_SET:
 		return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_RSS;
+	case ETHTOOL_MSG_LINKSTATE_GET:
+		return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_GLINK;
 	case ETHTOOL_MSG_TSCONFIG_GET:
 	case ETHTOOL_MSG_TSCONFIG_SET:
 		/* tsconfig calls ndos (ndo_hwtstamp_set/get), not ethtool ops.
@@ -159,6 +161,8 @@ ethtool_ioctl_needs_rtnl(const struct net_device *dev, u32 ethcmd)
 	case ETHTOOL_SRXFH:
 	case ETHTOOL_SRXFHINDIR:
 		return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_RSS;
+	case ETHTOOL_GLINK:
+		return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_GLINK;
 	}
 	return false;
 }
diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c
index 7cc22916852f..8199738ba979 100644
--- a/drivers/net/ethernet/google/gve/gve_ethtool.c
+++ b/drivers/net/ethernet/google/gve/gve_ethtool.c
@@ -984,7 +984,8 @@ const struct ethtool_ops gve_ethtool_ops = {
 	.supported_ring_params = ETHTOOL_RING_USE_TCP_DATA_SPLIT |
 				 ETHTOOL_RING_USE_RX_BUF_LEN,
 	.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
-			 ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+			 ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+			 ETHTOOL_OP_NEEDS_RTNL_GLINK,
 	.get_drvinfo = gve_get_drvinfo,
 	.get_strings = gve_get_strings,
 	.get_sset_count = gve_get_sset_count,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
index a615d599b88e..e7cf12eaa268 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
@@ -1855,6 +1855,7 @@ static const struct ethtool_ops iavf_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_USECS |
 				     ETHTOOL_COALESCE_USE_ADAPTIVE,
 	.supported_input_xfrm	= RXH_XFRM_SYM_XOR,
+	.op_needs_rtnl		= ETHTOOL_OP_NEEDS_RTNL_GLINK,
 	.get_drvinfo		= iavf_get_drvinfo,
 	.get_link		= ethtool_op_get_link,
 	.get_ringparam		= iavf_get_ringparam,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 2f5b626ba33f..112926d07634 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -2721,7 +2721,8 @@ const struct ethtool_ops mlx5e_ethtool_ops = {
 	.rxfh_max_num_contexts	= MLX5E_MAX_NUM_RSS,
 	.op_needs_rtnl		= ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
 				  ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
-				  ETHTOOL_OP_NEEDS_RTNL_SPFLAGS,
+				  ETHTOOL_OP_NEEDS_RTNL_SPFLAGS |
+				  ETHTOOL_OP_NEEDS_RTNL_GLINK,
 	.supported_coalesce_params = ETHTOOL_COALESCE_USECS |
 				     ETHTOOL_COALESCE_MAX_FRAMES |
 				     ETHTOOL_COALESCE_USE_ADAPTIVE |
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 1a8a19f980d3..c8b76d301c92 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -419,7 +419,8 @@ static const struct ethtool_ops mlx5e_rep_ethtool_ops = {
 				     ETHTOOL_COALESCE_MAX_FRAMES |
 				     ETHTOOL_COALESCE_USE_ADAPTIVE,
 	.op_needs_rtnl	   = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
-			     ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+			     ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+			     ETHTOOL_OP_NEEDS_RTNL_GLINK,
 	.get_drvinfo	   = mlx5e_rep_get_drvinfo,
 	.get_link	   = ethtool_op_get_link,
 	.get_strings       = mlx5e_rep_get_strings,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c
index 9b3b32408c64..01ddc3def9ac 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c
@@ -286,7 +286,8 @@ const struct ethtool_ops mlx5i_ethtool_ops = {
 				     ETHTOOL_COALESCE_MAX_FRAMES |
 				     ETHTOOL_COALESCE_USE_ADAPTIVE,
 	.op_needs_rtnl	    = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
-			      ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+			      ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+			      ETHTOOL_OP_NEEDS_RTNL_GLINK,
 	.get_drvinfo        = mlx5i_get_drvinfo,
 	.get_strings        = mlx5i_get_strings,
 	.get_sset_count     = mlx5i_get_sset_count,
@@ -309,6 +310,7 @@ const struct ethtool_ops mlx5i_ethtool_ops = {
 };
 
 const struct ethtool_ops mlx5i_pkey_ethtool_ops = {
+	.op_needs_rtnl	    = ETHTOOL_OP_NEEDS_RTNL_GLINK,
 	.get_drvinfo        = mlx5i_get_drvinfo,
 	.get_link           = ethtool_op_get_link,
 	.get_ts_info        = mlx5i_get_ts_info,
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c b/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c
index cb34fc166ef9..0e47088ec44b 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c
@@ -2024,7 +2024,8 @@ static const struct ethtool_ops fbnic_ethtool_ops = {
 					  ETHTOOL_OP_NEEDS_RTNL_GPAUSEPARAM |
 					  ETHTOOL_OP_NEEDS_RTNL_SPAUSEPARAM |
 					  ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
-					  ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+					  ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+					  ETHTOOL_OP_NEEDS_RTNL_GLINK,
 	.get_drvinfo			= fbnic_get_drvinfo,
 	.get_regs_len			= fbnic_get_regs_len,
 	.get_regs			= fbnic_get_regs,
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 94e658d07a27..881df597d7f9 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -597,7 +597,8 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 const struct ethtool_ops mana_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.op_needs_rtnl		= ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
-				  ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+				  ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+				  ETHTOOL_OP_NEEDS_RTNL_GLINK,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
 	.get_sset_count		= mana_get_sset_count,
 	.get_strings		= mana_get_strings,
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH rdma v1 1/2] RDMA/zrdma: Add basic framework for ZTE Dinghai Ethernet Protocol Driver for RDMA
From: Julian Braha @ 2026-06-24 17:41 UTC (permalink / raw)
  To: zhang.yanze, jgg, leon
  Cc: linux-kernel, linux-rdma, wei.quan, han.junyang, ran.ming,
	han.chengfei
In-Reply-To: <20260624164852120pLCX6txujHU8n4GMakGbe@zte.com.cn>

Hi Yanze,

On 6/24/26 09:48, zhang.yanze@zte.com.cn wrote:
> +++ b/drivers/infiniband/hw/zrdma/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config INFINIBAND_ZRDMA
> +   tristate "ZTE Ethernet Protocol Driver for RDMA"
> +   depends on INFINIBAND
> +   help
> +       Say Y or M here to enable support for the ZTE DingHai (ZXDH) Ethernet
> +       Protocol Driver for RDMA. This driver provides RDMA over Converged
> +       Ethernet (RoCE) functionality for ZTE DingHai network adapters.
> +       If you choose to build this driver as a module, it will be built as
> +       a module named zrdma.

You've got a duplicate dependency on INFINIBAND for your
INFINIBAND_ZRDMA config option. There's already an
'if INFINIBAND .. endif' that wraps the kconfig file import in
drivers/infiniband/Kconfig

- Julian Braha

^ permalink raw reply

* [PATCH for-next v2 2/2] RDMA/bnxt_re: Add uverbs object handle path for CQ/SRQ toggle page
From: Selvin Xavier @ 2026-06-24 22:39 UTC (permalink / raw)
  To: leon, jgg
  Cc: linux-rdma, andrew.gospodarek, kalesh-anakkur.purayil,
	sriharsha.basavapatna, Selvin Xavier, Jason Gunthorpe
In-Reply-To: <20260624223927.521882-1-selvin.xavier@broadcom.com>

The current GET_TOGGLE_MEM ioctl requires the caller to supply
a type enum and a raw hardware queue ID (RES_ID). The kernel
looks up the CQ or SRQ by that ID without verifying that the
caller owns the resource.

Add a new, preferred code path that accepts standard uverbs
object handles (BNXT_RE_TOGGLE_MEM_CQ_HANDLE /
BNXT_RE_TOGGLE_MEM_SRQ_HANDLE) instead.

Only newer rdma-core versions support this path. Capability is
negotiated during context creation using the req mask
(BNXT_RE_COMP_MASK_REQ_UCNTX_TOGGLE_MEM_UOBJ_SUPPORT) and resp
mask (BNXT_RE_UCNTX_CMASK_TOGGLE_MEM_UOBJ_SUPPORT).
The existing TYPE + RES_ID path is retained for backward
compatibility with older rdma-core.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
 drivers/infiniband/hw/bnxt_re/ib_verbs.c |  7 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.h |  1 +
 drivers/infiniband/hw/bnxt_re/uapi.c     | 99 +++++++++++++++---------
 include/uapi/rdma/bnxt_re-abi.h          |  4 +
 4 files changed, 74 insertions(+), 37 deletions(-)

diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index d1eebd7b56f4..423c8f3184bb 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -4846,7 +4846,8 @@ int bnxt_re_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *udata)
 		rc = ib_copy_validate_udata_in_cm(
 			udata, ureq, comp_mask,
 			BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT |
-				BNXT_RE_COMP_MASK_REQ_UCNTX_VAR_WQE_SUPPORT);
+				BNXT_RE_COMP_MASK_REQ_UCNTX_VAR_WQE_SUPPORT |
+				BNXT_RE_COMP_MASK_REQ_UCNTX_TOGGLE_MEM_UOBJ_SUPPORT);
 		if (rc)
 			goto cfail;
 		if (ureq.comp_mask & BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT) {
@@ -4859,6 +4860,10 @@ int bnxt_re_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *udata)
 			if (resp.mode == BNXT_QPLIB_WQE_MODE_VARIABLE)
 				uctx->cmask |= BNXT_RE_UCNTX_CAP_VAR_WQE_ENABLED;
 		}
+		if (ureq.comp_mask & BNXT_RE_COMP_MASK_REQ_UCNTX_TOGGLE_MEM_UOBJ_SUPPORT) {
+			uctx->cmask |= BNXT_RE_UCNTX_CAP_TOGGLE_MEM_UOBJ;
+			resp.comp_mask |= BNXT_RE_UCNTX_CMASK_TOGGLE_MEM_UOBJ_SUPPORT;
+		}
 	}
 
 	xa_init(&uctx->cq_xa);
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
index 76f407cd3435..85e594a25448 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
@@ -193,6 +193,7 @@ static inline u16 bnxt_re_get_rwqe_size(int nsge)
 enum {
 	BNXT_RE_UCNTX_CAP_POW2_DISABLED = 0x1ULL,
 	BNXT_RE_UCNTX_CAP_VAR_WQE_ENABLED = 0x2ULL,
+	BNXT_RE_UCNTX_CAP_TOGGLE_MEM_UOBJ = 0x4ULL,
 };
 
 static inline u32 bnxt_re_init_depth(u32 ent, u32 max,
diff --git a/drivers/infiniband/hw/bnxt_re/uapi.c b/drivers/infiniband/hw/bnxt_re/uapi.c
index 7e2acd0933f7..45dcaa49d6a8 100644
--- a/drivers/infiniband/hw/bnxt_re/uapi.c
+++ b/drivers/infiniband/hw/bnxt_re/uapi.c
@@ -216,57 +216,76 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_GET_TOGGLE_MEM)(struct uverbs_attr_bund
 {
 	struct ib_uobject *uobj = uverbs_attr_get_uobject(attrs, BNXT_RE_TOGGLE_MEM_HANDLE);
 	enum bnxt_re_mmap_flag mmap_flag = BNXT_RE_MMAP_TOGGLE_PAGE;
-	enum bnxt_re_get_toggle_mem_type res_type;
 	struct bnxt_re_user_mmap_entry *entry;
 	struct bnxt_re_ucontext *uctx;
 	struct ib_ucontext *ib_uctx;
+	struct ib_uobject *res_uobj;
 	u32 length = PAGE_SIZE;
 	u64 mem_offset;
 	u32 offset = 0;
 	u64 addr = 0;
-	u32 res_id;
 	int err;
 
 	ib_uctx = ib_uverbs_get_ucontext(attrs);
 	if (IS_ERR(ib_uctx))
 		return PTR_ERR(ib_uctx);
 
-	err = uverbs_get_const(&res_type, attrs, BNXT_RE_TOGGLE_MEM_TYPE);
-	if (err)
-		return err;
-
 	uctx = container_of(ib_uctx, struct bnxt_re_ucontext, ib_uctx);
-	err = uverbs_copy_from(&res_id, attrs, BNXT_RE_TOGGLE_MEM_RES_ID);
-	if (err)
-		return err;
-
-	switch (res_type) {
-	case BNXT_RE_CQ_TOGGLE_MEM:
-		struct bnxt_re_cq *cq;
+	res_uobj = uverbs_attr_get_uobject(attrs, BNXT_RE_TOGGLE_MEM_CQ_HANDLE);
+	if (!IS_ERR(res_uobj)) {
+		struct bnxt_re_cq *cq =
+			container_of((struct ib_cq *)res_uobj->object,
+				     struct bnxt_re_cq, ib_cq);
+
+		addr = (u64)cq->uctx_cq_page;
+	} else {
+		res_uobj = uverbs_attr_get_uobject(attrs, BNXT_RE_TOGGLE_MEM_SRQ_HANDLE);
+		if (!IS_ERR(res_uobj)) {
+			struct bnxt_re_srq *srq =
+				container_of((struct ib_srq *)res_uobj->object,
+					     struct bnxt_re_srq, ib_srq);
 
-		xa_lock(&uctx->cq_xa);
-		cq = xa_load(&uctx->cq_xa, res_id);
-		if (cq)
-			addr = (u64)cq->uctx_cq_page;
-		xa_unlock(&uctx->cq_xa);
-		if (!addr)
-			return -EINVAL;
-		break;
-	case BNXT_RE_SRQ_TOGGLE_MEM:
-		struct bnxt_re_srq *srq;
-
-		xa_lock(&uctx->srq_xa);
-		srq = xa_load(&uctx->srq_xa, res_id);
-		if (srq)
 			addr = (u64)srq->uctx_srq_page;
-		xa_unlock(&uctx->srq_xa);
-		if (!addr)
-			return -EINVAL;
-		break;
-	default:
-		return -EOPNOTSUPP;
+		} else {
+			/*
+			 * Legacy path: old libbnxt_re sends TYPE + RES_ID.
+			 * Look up the CQ or SRQ in the per-context XArray
+			 */
+			enum bnxt_re_get_toggle_mem_type res_type;
+			u32 res_id;
+
+			err = uverbs_get_const(&res_type, attrs,
+					       BNXT_RE_TOGGLE_MEM_TYPE);
+			if (err)
+				return err;
+			err = uverbs_copy_from(&res_id, attrs,
+					       BNXT_RE_TOGGLE_MEM_RES_ID);
+			if (err)
+				return err;
+
+			if (res_type == BNXT_RE_CQ_TOGGLE_MEM) {
+				struct bnxt_re_cq *cq;
+
+				xa_lock(&uctx->cq_xa);
+				cq = xa_load(&uctx->cq_xa, res_id);
+				if (cq)
+					addr = (u64)cq->uctx_cq_page;
+				xa_unlock(&uctx->cq_xa);
+			} else if (res_type == BNXT_RE_SRQ_TOGGLE_MEM) {
+				struct bnxt_re_srq *srq;
+
+				xa_lock(&uctx->srq_xa);
+				srq = xa_load(&uctx->srq_xa, res_id);
+				if (srq)
+					addr = (u64)srq->uctx_srq_page;
+				xa_unlock(&uctx->srq_xa);
+			}
+		}
 	}
 
+	if (!addr)
+		return -EOPNOTSUPP;
+
 	entry = bnxt_re_mmap_entry_insert(uctx, addr, mmap_flag, &mem_offset);
 	if (!entry)
 		return -ENOMEM;
@@ -308,10 +327,10 @@ DECLARE_UVERBS_NAMED_METHOD(BNXT_RE_METHOD_GET_TOGGLE_MEM,
 					    UA_MANDATORY),
 			    UVERBS_ATTR_CONST_IN(BNXT_RE_TOGGLE_MEM_TYPE,
 						 enum bnxt_re_get_toggle_mem_type,
-						 UA_MANDATORY),
+						 UA_OPTIONAL),
 			    UVERBS_ATTR_PTR_IN(BNXT_RE_TOGGLE_MEM_RES_ID,
 					       UVERBS_ATTR_TYPE(u32),
-					       UA_MANDATORY),
+					       UA_OPTIONAL),
 			    UVERBS_ATTR_PTR_OUT(BNXT_RE_TOGGLE_MEM_MMAP_PAGE,
 						UVERBS_ATTR_TYPE(u64),
 						UA_MANDATORY),
@@ -320,7 +339,15 @@ DECLARE_UVERBS_NAMED_METHOD(BNXT_RE_METHOD_GET_TOGGLE_MEM,
 						UA_MANDATORY),
 			    UVERBS_ATTR_PTR_OUT(BNXT_RE_TOGGLE_MEM_MMAP_LENGTH,
 						UVERBS_ATTR_TYPE(u32),
-						UA_MANDATORY));
+						UA_MANDATORY),
+			    UVERBS_ATTR_IDR(BNXT_RE_TOGGLE_MEM_CQ_HANDLE,
+					    UVERBS_OBJECT_CQ,
+					    UVERBS_ACCESS_READ,
+					    UA_OPTIONAL),
+			    UVERBS_ATTR_IDR(BNXT_RE_TOGGLE_MEM_SRQ_HANDLE,
+					    UVERBS_OBJECT_SRQ,
+					    UVERBS_ACCESS_READ,
+					    UA_OPTIONAL));
 
 DECLARE_UVERBS_NAMED_METHOD_DESTROY(BNXT_RE_METHOD_RELEASE_TOGGLE_MEM,
 				    UVERBS_ATTR_IDR(BNXT_RE_RELEASE_TOGGLE_MEM_HANDLE,
diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h
index a4599d7b736a..a6cfd68ed8f5 100644
--- a/include/uapi/rdma/bnxt_re-abi.h
+++ b/include/uapi/rdma/bnxt_re-abi.h
@@ -57,6 +57,7 @@ enum {
 	BNXT_RE_UCNTX_CMASK_POW2_DISABLED = 0x10ULL,
 	BNXT_RE_UCNTX_CMASK_MSN_TABLE_ENABLED = 0x40,
 	BNXT_RE_UCNTX_CMASK_QP_RATE_LIMIT_ENABLED = 0x80ULL,
+	BNXT_RE_UCNTX_CMASK_TOGGLE_MEM_UOBJ_SUPPORT = 0x400000ULL,
 };
 
 enum bnxt_re_wqe_mode {
@@ -68,6 +69,7 @@ enum bnxt_re_wqe_mode {
 enum {
 	BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT = 0x01,
 	BNXT_RE_COMP_MASK_REQ_UCNTX_VAR_WQE_SUPPORT = 0x02,
+	BNXT_RE_COMP_MASK_REQ_UCNTX_TOGGLE_MEM_UOBJ_SUPPORT = 0x20,
 };
 
 struct bnxt_re_uctx_req {
@@ -218,6 +220,8 @@ enum bnxt_re_var_toggle_mem_attrs {
 	BNXT_RE_TOGGLE_MEM_MMAP_PAGE,
 	BNXT_RE_TOGGLE_MEM_MMAP_OFFSET,
 	BNXT_RE_TOGGLE_MEM_MMAP_LENGTH,
+	BNXT_RE_TOGGLE_MEM_CQ_HANDLE,
+	BNXT_RE_TOGGLE_MEM_SRQ_HANDLE,
 };
 
 enum bnxt_re_toggle_mem_attrs {
-- 
2.39.3


^ permalink raw reply related

* [PATCH for-next v2 1/2] RDMA/bnxt_re: Replace per-device hash tables with per-context XArray
From: Selvin Xavier @ 2026-06-24 22:39 UTC (permalink / raw)
  To: leon, jgg
  Cc: linux-rdma, andrew.gospodarek, kalesh-anakkur.purayil,
	sriharsha.basavapatna, Selvin Xavier
In-Reply-To: <20260624223927.521882-1-selvin.xavier@broadcom.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=a, Size: 12613 bytes --]

The CQ and SRQ hash tables (cq_hash, srq_hash) on struct bnxt_re_dev
were used exclusively to look up a toggle-page pointer from a
user-space-supplied hardware queue ID in the GET_TOGGLE_MEM
ioctl handler. This approach has couple of problems. First,
because the tables are per-device, any user can look up another
user's CQ or SRQ by guessing the hardware queue ID. Second,
concurrent add and remove operations on the hash table are not
protected by any lock, leaving a race window.

The correct fix is to retrieve the CQ and SRQ objects via the uverbs
object handle, which gives built-in ownership verification and reference
pinning for the duration of the ioctl. That is added in the next patch of
this series.

To maintain backward compatibility with older rdma-core versions that
do not send a uverbs object handle, the driver must continue to support
the existing TYPE + RES_ID lookup path. This patch replaces the per-device
hash tables with per-ucontext XArrays (cq_xa and srq_xa on struct
bnxt_re_ucontext), which narrows the lookup scope to the calling context,
eliminating the cross-user visibility. Also adds Xarray locking mechanism
for synchronization.

The GET_TOGGLE_MEM ioctl handler is updated to call xa_load()
in place of the now-removed bnxt_re_search_for_cq()/
bnxt_re_search_for_srq() helpers. No ABI changes are required.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
 drivers/infiniband/hw/bnxt_re/bnxt_re.h  |  6 --
 drivers/infiniband/hw/bnxt_re/ib_verbs.c | 84 ++++++++++++++++++------
 drivers/infiniband/hw/bnxt_re/ib_verbs.h |  4 +-
 drivers/infiniband/hw/bnxt_re/main.c     |  5 --
 drivers/infiniband/hw/bnxt_re/uapi.c     | 55 ++++------------
 5 files changed, 80 insertions(+), 74 deletions(-)

diff --git a/drivers/infiniband/hw/bnxt_re/bnxt_re.h b/drivers/infiniband/hw/bnxt_re/bnxt_re.h
index 3a7ce4729fcf..a43e678151d3 100644
--- a/drivers/infiniband/hw/bnxt_re/bnxt_re.h
+++ b/drivers/infiniband/hw/bnxt_re/bnxt_re.h
@@ -41,7 +41,6 @@
 #define __BNXT_RE_H__
 #include <rdma/uverbs_ioctl.h>
 #include "hw_counters.h"
-#include <linux/hashtable.h>
 #define ROCE_DRV_MODULE_NAME		"bnxt_re"
 
 #define BNXT_RE_DESC	"Broadcom NetXtreme-C/E RoCE Driver"
@@ -158,9 +157,6 @@ struct bnxt_re_nq_record {
 	struct mutex		load_lock;
 };
 
-#define MAX_CQ_HASH_BITS		(16)
-#define MAX_SRQ_HASH_BITS		(16)
-
 static inline bool bnxt_re_chip_gen_p7(u16 chip_num)
 {
 	return (chip_num == CHIP_NUM_58818 ||
@@ -215,8 +211,6 @@ struct bnxt_re_dev {
 	struct bnxt_re_pacing pacing;
 	struct work_struct dbq_fifo_check_work;
 	struct delayed_work dbq_pacing_work;
-	DECLARE_HASHTABLE(cq_hash, MAX_CQ_HASH_BITS);
-	DECLARE_HASHTABLE(srq_hash, MAX_SRQ_HASH_BITS);
 	struct dentry			*dbg_root;
 	struct dentry			*qp_debugfs;
 	unsigned long			event_bitmap;
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 565762529007..d1eebd7b56f4 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -38,6 +38,7 @@
 
 #include <linux/interrupt.h>
 #include <linux/types.h>
+#include <linux/xarray.h>
 #include <linux/pci.h>
 #include <linux/netdevice.h>
 #include <linux/if_ether.h>
@@ -51,8 +52,6 @@
 #include <rdma/ib_cache.h>
 #include <rdma/ib_pma.h>
 #include <rdma/uverbs_ioctl.h>
-#include <linux/hashtable.h>
-
 #include "roce_hsi.h"
 #include "qplib_res.h"
 #include "qplib_sp.h"
@@ -2152,11 +2151,19 @@ int bnxt_re_destroy_srq(struct ib_srq *ib_srq, struct ib_udata *udata)
 	if (ret)
 		return ret;
 
-	if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT)
-		hash_del(&srq->hash_entry);
 	bnxt_qplib_destroy_srq(&rdev->qplib_res, qplib_srq);
-	if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT)
-		free_page((unsigned long)srq->uctx_srq_page);
+	if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT) {
+		struct bnxt_re_ucontext *uctx =
+			rdma_udata_to_drv_context(udata, struct bnxt_re_ucontext, ib_uctx);
+
+		if (uctx) {
+			/* similar to cq, use __xa_erase() with the lock already held */
+			xa_lock(&uctx->srq_xa);
+			__xa_erase(&uctx->srq_xa, srq->qplib_srq.id);
+			xa_unlock(&uctx->srq_xa);
+			free_page((unsigned long)srq->uctx_srq_page);
+		}
+	}
 	ib_umem_release(srq->umem);
 	atomic_dec(&rdev->stats.res.srq_count);
 	return ib_respond_empty_udata(udata);
@@ -2263,20 +2270,21 @@ int bnxt_re_create_srq(struct ib_srq *ib_srq,
 
 		resp.srqid = srq->qplib_srq.id;
 		if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT) {
-			hash_add(rdev->srq_hash, &srq->hash_entry, srq->qplib_srq.id);
 			srq->uctx_srq_page = (void *)get_zeroed_page(GFP_KERNEL);
 			if (!srq->uctx_srq_page) {
 				rc = -ENOMEM;
-				goto fail;
+				goto fail_destroy_srq;
+			}
+			if (xa_is_err(xa_store(&uctx->srq_xa, srq->qplib_srq.id,
+					       srq, GFP_KERNEL))) {
+				rc = -ENOMEM;
+				goto fail_free_toggle;
 			}
 			resp.comp_mask |= BNXT_RE_SRQ_TOGGLE_PAGE_SUPPORT;
 		}
 		rc = ib_respond_udata(udata, resp);
-		if (rc) {
-			bnxt_qplib_destroy_srq(&rdev->qplib_res,
-					       &srq->qplib_srq);
-			goto fail;
-		}
+		if (rc)
+			goto fail_respond;
 	}
 	active_srqs = atomic_inc_return(&rdev->stats.res.srq_count);
 	if (active_srqs > rdev->stats.res.srq_watermark)
@@ -2285,6 +2293,16 @@ int bnxt_re_create_srq(struct ib_srq *ib_srq,
 
 	return 0;
 
+fail_respond:
+	if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT) {
+		xa_lock(&uctx->srq_xa);
+		__xa_erase(&uctx->srq_xa, srq->qplib_srq.id);
+		xa_unlock(&uctx->srq_xa);
+	}
+fail_free_toggle:
+	free_page((unsigned long)srq->uctx_srq_page);
+fail_destroy_srq:
+	bnxt_qplib_destroy_srq(&rdev->qplib_res, &srq->qplib_srq);
 fail:
 	ib_umem_release(srq->umem);
 exit:
@@ -3475,11 +3493,24 @@ int bnxt_re_destroy_cq(struct ib_cq *ib_cq, struct ib_udata *udata)
 	if (ret)
 		return ret;
 
-	if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT)
-		hash_del(&cq->hash_entry);
 	bnxt_qplib_destroy_cq(&rdev->qplib_res, &cq->qplib_cq);
-	if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT)
-		free_page((unsigned long)cq->uctx_cq_page);
+	if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT) {
+		struct bnxt_re_ucontext *uctx =
+			rdma_udata_to_drv_context(udata, struct bnxt_re_ucontext, ib_uctx);
+
+		if (uctx) {
+		/*
+		 * Hold xa_lock across the erase so that GET_TOGGLE_MEM's
+		 * xa_lock + xa_load + dereference region is atomic with respect
+		 * to removal. xa_erase() would re-acquire the same lock and
+		 * deadlock; use __xa_erase() with the lock already held.
+		 */
+			xa_lock(&uctx->cq_xa);
+			__xa_erase(&uctx->cq_xa, cq->qplib_cq.id);
+			xa_unlock(&uctx->cq_xa);
+			free_page((unsigned long)cq->uctx_cq_page);
+		}
+	}
 
 	bnxt_re_put_nq(rdev, nq);
 
@@ -3554,14 +3585,15 @@ int bnxt_re_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *att
 	spin_lock_init(&cq->cq_lock);
 
 	if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT) {
-		hash_add(rdev->cq_hash, &cq->hash_entry, cq->qplib_cq.id);
-		/* Allocate a page */
 		cq->uctx_cq_page = (void *)get_zeroed_page(GFP_KERNEL);
 		if (!cq->uctx_cq_page) {
 			rc = -ENOMEM;
 			goto destroy_cq;
 		}
-
+		if (xa_is_err(xa_store(&uctx->cq_xa, cq->qplib_cq.id, cq, GFP_KERNEL))) {
+			rc = -ENOMEM;
+			goto free_toggle_page;
+		}
 		resp.comp_mask |= BNXT_RE_CQ_TOGGLE_PAGE_SUPPORT;
 	}
 	resp.cqid = cq->qplib_cq.id;
@@ -3574,6 +3606,12 @@ int bnxt_re_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *att
 	return 0;
 
 free_mem:
+	if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT) {
+		xa_lock(&uctx->cq_xa);
+		__xa_erase(&uctx->cq_xa, cq->qplib_cq.id);
+		xa_unlock(&uctx->cq_xa);
+	}
+free_toggle_page:
 	free_page((unsigned long)cq->uctx_cq_page);
 destroy_cq:
 	bnxt_qplib_destroy_cq(&rdev->qplib_res, &cq->qplib_cq);
@@ -4823,6 +4861,9 @@ int bnxt_re_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *udata)
 		}
 	}
 
+	xa_init(&uctx->cq_xa);
+	xa_init(&uctx->srq_xa);
+
 	rc = ib_respond_udata(udata, resp);
 	if (rc)
 		goto cfail;
@@ -4848,6 +4889,9 @@ void bnxt_re_dealloc_ucontext(struct ib_ucontext *ib_uctx)
 	if (uctx->shpg)
 		free_page((unsigned long)uctx->shpg);
 
+	xa_destroy(&uctx->cq_xa);
+	xa_destroy(&uctx->srq_xa);
+
 	if (uctx->dpi.dbr) {
 		/* Free DPI only if this is the first PD allocated by the
 		 * application and mark the context dpi as NULL
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
index 22bf81668cfb..76f407cd3435 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
@@ -78,7 +78,6 @@ struct bnxt_re_srq {
 	struct ib_umem		*umem;
 	spinlock_t		lock;		/* protect srq */
 	void			*uctx_srq_page;
-	struct hlist_node       hash_entry;
 };
 
 struct bnxt_re_qp {
@@ -113,7 +112,6 @@ struct bnxt_re_cq {
 	struct ib_umem		*resize_umem;
 	int			resize_cqe;
 	void			*uctx_cq_page;
-	struct hlist_node	hash_entry;
 };
 
 struct bnxt_re_mr {
@@ -147,6 +145,8 @@ struct bnxt_re_ucontext {
 	void			*shpg;
 	spinlock_t		sh_lock;	/* protect shpg */
 	struct rdma_user_mmap_entry *shpage_mmap;
+	struct xarray		cq_xa;		/* cqid → bnxt_re_cq, for toggle page lookup */
+	struct xarray		srq_xa;		/* srqid → bnxt_re_srq, for toggle page lookup */
 	u64 cmask;
 };
 
diff --git a/drivers/infiniband/hw/bnxt_re/main.c b/drivers/infiniband/hw/bnxt_re/main.c
index d25fdc458120..637f023b18ac 100644
--- a/drivers/infiniband/hw/bnxt_re/main.c
+++ b/drivers/infiniband/hw/bnxt_re/main.c
@@ -54,7 +54,6 @@
 #include <rdma/ib_user_verbs.h>
 #include <rdma/ib_umem.h>
 #include <rdma/ib_addr.h>
-#include <linux/hashtable.h>
 #include <linux/bnxt/ulp.h>
 
 #include "roce_hsi.h"
@@ -2337,10 +2336,6 @@ static int bnxt_re_dev_init(struct bnxt_re_dev *rdev, u8 op_type)
 		if (!(rdev->qplib_res.en_dev->flags & BNXT_EN_FLAG_ROCE_VF_RES_MGMT))
 			bnxt_re_vf_res_config(rdev);
 	}
-	hash_init(rdev->cq_hash);
-	if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT)
-		hash_init(rdev->srq_hash);
-
 	bnxt_re_debugfs_add_pdev(rdev);
 
 	bnxt_re_init_dcb_wq(rdev);
diff --git a/drivers/infiniband/hw/bnxt_re/uapi.c b/drivers/infiniband/hw/bnxt_re/uapi.c
index 263238a6e4cd..7e2acd0933f7 100644
--- a/drivers/infiniband/hw/bnxt_re/uapi.c
+++ b/drivers/infiniband/hw/bnxt_re/uapi.c
@@ -22,32 +22,6 @@
 #include "bnxt_re.h"
 #include "ib_verbs.h"
 
-static struct bnxt_re_cq *bnxt_re_search_for_cq(struct bnxt_re_dev *rdev, u32 cq_id)
-{
-	struct bnxt_re_cq *cq = NULL, *tmp_cq;
-
-	hash_for_each_possible(rdev->cq_hash, tmp_cq, hash_entry, cq_id) {
-		if (tmp_cq->qplib_cq.id == cq_id) {
-			cq = tmp_cq;
-			break;
-		}
-	}
-	return cq;
-}
-
-static struct bnxt_re_srq *bnxt_re_search_for_srq(struct bnxt_re_dev *rdev, u32 srq_id)
-{
-	struct bnxt_re_srq *srq = NULL, *tmp_srq;
-
-	hash_for_each_possible(rdev->srq_hash, tmp_srq, hash_entry, srq_id) {
-		if (tmp_srq->qplib_srq.id == srq_id) {
-			srq = tmp_srq;
-			break;
-		}
-	}
-	return srq;
-}
-
 static int UVERBS_HANDLER(BNXT_RE_METHOD_NOTIFY_DRV)(struct uverbs_attr_bundle *attrs)
 {
 	struct bnxt_re_ucontext *uctx;
@@ -246,10 +220,7 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_GET_TOGGLE_MEM)(struct uverbs_attr_bund
 	struct bnxt_re_user_mmap_entry *entry;
 	struct bnxt_re_ucontext *uctx;
 	struct ib_ucontext *ib_uctx;
-	struct bnxt_re_dev *rdev;
-	struct bnxt_re_srq *srq;
 	u32 length = PAGE_SIZE;
-	struct bnxt_re_cq *cq;
 	u64 mem_offset;
 	u32 offset = 0;
 	u64 addr = 0;
@@ -265,31 +236,33 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_GET_TOGGLE_MEM)(struct uverbs_attr_bund
 		return err;
 
 	uctx = container_of(ib_uctx, struct bnxt_re_ucontext, ib_uctx);
-	rdev = uctx->rdev;
 	err = uverbs_copy_from(&res_id, attrs, BNXT_RE_TOGGLE_MEM_RES_ID);
 	if (err)
 		return err;
 
 	switch (res_type) {
 	case BNXT_RE_CQ_TOGGLE_MEM:
-		cq = bnxt_re_search_for_cq(rdev, res_id);
-		if (!cq)
-			return -EINVAL;
+		struct bnxt_re_cq *cq;
 
-		addr = (u64)cq->uctx_cq_page;
+		xa_lock(&uctx->cq_xa);
+		cq = xa_load(&uctx->cq_xa, res_id);
+		if (cq)
+			addr = (u64)cq->uctx_cq_page;
+		xa_unlock(&uctx->cq_xa);
 		if (!addr)
-			return -EOPNOTSUPP;
+			return -EINVAL;
 		break;
 	case BNXT_RE_SRQ_TOGGLE_MEM:
-		srq = bnxt_re_search_for_srq(rdev, res_id);
-		if (!srq)
-			return -EINVAL;
+		struct bnxt_re_srq *srq;
 
-		addr = (u64)srq->uctx_srq_page;
+		xa_lock(&uctx->srq_xa);
+		srq = xa_load(&uctx->srq_xa, res_id);
+		if (srq)
+			addr = (u64)srq->uctx_srq_page;
+		xa_unlock(&uctx->srq_xa);
 		if (!addr)
-			return -EOPNOTSUPP;
+			return -EINVAL;
 		break;
-
 	default:
 		return -EOPNOTSUPP;
 	}
-- 
2.39.3


^ permalink raw reply related

* [PATCH for-next v2 0/2] RDMA/bnxt_re: Update the toggle page handling of CQ and SRQ
From: Selvin Xavier @ 2026-06-24 22:39 UTC (permalink / raw)
  To: leon, jgg
  Cc: linux-rdma, andrew.gospodarek, kalesh-anakkur.purayil,
	sriharsha.basavapatna, Selvin Xavier

Based on the suggestion from Jason (
https://patchwork.kernel.org/project/linux-rdma/patch/20260615224751.232802-5-selvin.xavier@broadcom.com/)
, adding the uverb object to retrieve the CQ an SRQ structures while getting the
toggle mem. To work with older rdma-core, retain the existing code with
modification.

The rdma-core pull request is here: https://github.com/linux-rdma/rdma-core/pull/1761

Please review and apply the series.

Thanks,
Selvin Xavier

v1->v2 :
    - Fix the error cleanup for SRQ and CQ create paths
    - Fix a synchronization issue for the legacy path which can cause a
      UAF

Selvin Xavier (2):
  RDMA/bnxt_re: Replace per-device hash tables with per-context XArray
  RDMA/bnxt_re: Add uverbs object handle path for CQ/SRQ toggle page

 drivers/infiniband/hw/bnxt_re/bnxt_re.h  |   6 --
 drivers/infiniband/hw/bnxt_re/ib_verbs.c |  91 +++++++++++++----
 drivers/infiniband/hw/bnxt_re/ib_verbs.h |   5 +-
 drivers/infiniband/hw/bnxt_re/main.c     |   5 -
 drivers/infiniband/hw/bnxt_re/uapi.c     | 124 +++++++++++------------
 include/uapi/rdma/bnxt_re-abi.h          |   4 +
 6 files changed, 139 insertions(+), 96 deletions(-)

-- 
2.39.3


^ permalink raw reply

* [PATCH] RDMA/irdma: Prevent overflows in memory contiguity checks
From: Aleksandrova Alyona @ 2026-06-24 14:48 UTC (permalink / raw)
  To: Krzysztof Czurylo, Tatyana Nikolova
  Cc: Jason Gunthorpe, Leon Romanovsky, Mustafa Ismail, Shiraz Saleem,
	linux-rdma, linux-kernel, lvc-project

irdma_check_mem_contiguous() and irdma_check_mr_contiguous() verify that
PBL entries describe physically contiguous memory ranges.

Both functions calculate byte offsets using 32-bit operands. For example,
with 4 KiB pages, pg_size * pg_idx overflows 32-bit arithmetic when
pg_idx reaches 1048576. In the level-2 check, PBLE_PER_PAGE is 512, so
i * pg_size * PBLE_PER_PAGE overflows when i reaches 2048.

These values are reachable in the driver. For MRs, palloc->total_cnt
comes from iwmr->page_cnt, which is calculated by
ib_umem_num_dma_blocks(). The MR size is limited by IRDMA_MAX_MR_SIZE,
so a 4 GiB MR with 4 KiB pages can reach page_cnt of 1048576. PBLE
resources do not exclude this value either: for gen3, the limit is based
on avail_sds * MAX_PBLE_PER_SD, and MAX_PBLE_PER_SD is 0x40000, so 4 SDs
are enough for 1048576 PBLEs.

Cast one operand to u64 before the multiplications so that the offset
calculations are performed in 64-bit arithmetic.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Aleksandrova Alyona <aga@itb.spb.ru>
---
 drivers/infiniband/hw/irdma/verbs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index 17086048d2d7..ab55f674cb63 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -2819,7 +2819,7 @@ static bool irdma_check_mem_contiguous(u64 *arr, u32 npages, u32 pg_size)
 	u32 pg_idx;

 	for (pg_idx = 0; pg_idx < npages; pg_idx++) {
-		if ((*arr + (pg_size * pg_idx)) != arr[pg_idx])
+		if ((*arr + ((u64)pg_size * pg_idx)) != arr[pg_idx])
 			return false;
 	}

@@ -2852,7 +2852,7 @@ static bool irdma_check_mr_contiguous(struct irdma_pble_alloc *palloc,

 	for (i = 0; i < lvl2->leaf_cnt; i++, leaf++) {
 		arr = leaf->addr;
-		if ((*start_addr + (i * pg_size * PBLE_PER_PAGE)) != *arr)
+		if ((*start_addr + ((u64)i * pg_size * PBLE_PER_PAGE)) != *arr)
 			return false;
 		ret = irdma_check_mem_contiguous(arr, leaf->cnt, pg_size);
 		if (!ret)
-- 
2.26.2

^ permalink raw reply related

* Re: [PATCH] RDMA/siw: publish QP after initialization
From: Bernard Metzler @ 2026-06-24 14:16 UTC (permalink / raw)
  To: Ruoyu Wang, Jason Gunthorpe, Leon Romanovsky; +Cc: linux-rdma, linux-kernel
In-Reply-To: <20260620155306.78919-1-ruoyuw560@gmail.com>

On 20.06.2026 17:53, Ruoyu Wang wrote:
> siw_create_qp() allocates a QP number before the queues, CQ pointers,
> state, completion, and device list entry are ready. A QPN lookup can
> therefore reach a QP that is still being constructed if the object is
> published at allocation time.
> 
> Reserve the QPN with an empty XArray entry first. Publish the QP object
> only after the kernel-visible QP state is initialized and just before
> siw_create_qp() returns it to the caller.
> 
> Fixes: f29dd55b0236 ("rdma/siw: queue pair methods")
> Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
> ---
>   drivers/infiniband/sw/siw/siw.h       |  1 +
>   drivers/infiniband/sw/siw/siw_qp.c    | 26 ++++++++++++++++++--------
>   drivers/infiniband/sw/siw/siw_verbs.c | 12 +++++++++++-
>   3 files changed, 30 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/siw/siw.h b/drivers/infiniband/sw/siw/siw.h
> index f5fd71717b80..ade7c96135c2 100644
> --- a/drivers/infiniband/sw/siw/siw.h
> +++ b/drivers/infiniband/sw/siw/siw.h
> @@ -511,6 +511,7 @@ void siw_send_terminate(struct siw_qp *qp);
>   void siw_qp_get_ref(struct ib_qp *qp);
>   void siw_qp_put_ref(struct ib_qp *qp);
>   int siw_qp_add(struct siw_device *sdev, struct siw_qp *qp);
> +int siw_qp_publish(struct siw_qp *qp);
>   void siw_free_qp(struct kref *ref);
>   
>   void siw_init_terminate(struct siw_qp *qp, enum term_elayer layer,
> diff --git a/drivers/infiniband/sw/siw/siw_qp.c b/drivers/infiniband/sw/siw/siw_qp.c
> index bb780e3904a2..1a9135d9a2a7 100644
> --- a/drivers/infiniband/sw/siw/siw_qp.c
> +++ b/drivers/infiniband/sw/siw/siw_qp.c
> @@ -1281,15 +1281,25 @@ void siw_rq_flush(struct siw_qp *qp)
>   
>   int siw_qp_add(struct siw_device *sdev, struct siw_qp *qp)
>   {
> -	int rv = xa_alloc(&sdev->qp_xa, &qp->base_qp.qp_num, qp, xa_limit_32b,
> -			  GFP_KERNEL);
> +	qp->sdev = sdev;
>   
> -	if (!rv) {
> -		kref_init(&qp->ref);
> -		qp->sdev = sdev;
> -		siw_dbg_qp(qp, "new QP\n");
> -	}
> -	return rv;
> +	return xa_alloc(&sdev->qp_xa, &qp->base_qp.qp_num, NULL,
> +			xa_limit_32b, GFP_KERNEL);
> +}
> +
> +int siw_qp_publish(struct siw_qp *qp)
> +{
> +	void *old;
> +
> +	kref_init(&qp->ref);
> +
> +	old = xa_store(&qp->sdev->qp_xa, qp_id(qp), qp, GFP_KERNEL);
> +	if (xa_is_err(old))
> +		return xa_err(old);
> +
> +	siw_dbg_qp(qp, "new QP\n");
> +
> +	return 0;
>   }
>   
>   void siw_free_qp(struct kref *ref)
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
> index 1e1d262a4ae2..71bc0cc59e3d 100644
> --- a/drivers/infiniband/sw/siw/siw_verbs.c
> +++ b/drivers/infiniband/sw/siw/siw_verbs.c
> @@ -482,14 +482,24 @@ int siw_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
>   		goto err_out_xa;
>   	}
>   	INIT_LIST_HEAD(&qp->devq);
> +	init_completion(&qp->qp_free);
> +
>   	spin_lock_irqsave(&sdev->lock, flags);
>   	list_add_tail(&qp->devq, &sdev->qp_list);
>   	spin_unlock_irqrestore(&sdev->lock, flags);
>   
> -	init_completion(&qp->qp_free);
> +	rv = siw_qp_publish(qp);

To avoid this transient visibility of a not-yet-initialized
QP - can't we just move siw_qp_add() to the end of the
siw_create_qp() function?


> +	if (rv)
> +		goto err_out_list;
>   
>   	return 0;
>   
> +err_out_list:
> +	spin_lock_irqsave(&sdev->lock, flags);
> +	list_del(&qp->devq);
> +	spin_unlock_irqrestore(&sdev->lock, flags);
> +
> +	siw_put_tx_cpu(qp->tx_cpu);
>   err_out_xa:
>   	xa_erase(&sdev->qp_xa, qp_id(qp));
>   	if (uctx) {


^ permalink raw reply

* [recipe build #4056915] of ~linux-rdma rdma-core-daily in xenial: Dependency wait
From: noreply @ 2026-06-24 13:03 UTC (permalink / raw)
  To: Linux RDMA

 * State: Dependency wait
 * Recipe: linux-rdma/rdma-core-daily
 * Archive: ~linux-rdma/ubuntu/rdma-core-daily
 * Distroseries: xenial
 * Duration: 3 minutes
 * Build Log: https://launchpad.net/~linux-rdma/+archive/ubuntu/rdma-core-daily/+recipebuild/4056915/+files/buildlog.txt.gz
 * Upload Log: 
 * Builder: https://launchpad.net/builders/lcy02-amd64-063

-- 
https://launchpad.net/~linux-rdma/+archive/ubuntu/rdma-core-daily/+recipebuild/4056915
Your team Linux RDMA is the requester of the build.


^ permalink raw reply

* Re: [PATCH net v2] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Runyu Xiao @ 2026-06-24 10:37 UTC (permalink / raw)
  To: XIAO WU
  Cc: D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
	linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu
In-Reply-To: <tencent_BD4B709F8D16281265EDBC0DC9EFC8758808@qq.com>

Hi Xiao,

&gt; the error path in smc_listen() does not restore icsk_af_ops when
&gt; kernel_listen() fails

Thanks, this looks like a real error-path bug. I will prepare it as a
separate fix for smc_listen() rather than folding it into this
sk_callback_lock patch.

Runyu


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox