Linux CIFS filesystem development
 help / color / mirror / Atom feed
* Problem with smbdirect rw credits and initiator_depth
@ 2025-12-03 18:18 Stefan Metzmacher
  2025-12-04  0:07 ` Namjae Jeon
  2025-12-04  9:57 ` Stefan Metzmacher
  0 siblings, 2 replies; 24+ messages in thread
From: Stefan Metzmacher @ 2025-12-03 18:18 UTC (permalink / raw)
  To: Namjae Jeon, Tom Talpey
  Cc: linux-cifs@vger.kernel.org, linux-rdma@vger.kernel.org

Hi Namjae,

I found the problem why the 6.17.9 code of transport_rdma.c deadlocks
with a Windows client, when using irdma in roce mode, while the 6.18
code works fine.

irdma/roce in 6.17.9 code => deadlock in wait_for_rw_credits()
[   T8653] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16
[   T8653] ksmbd: smb_direct: max_rw_credits:9
[   T7013] ------------[ cut here ]------------
[   T7013] needed:31 > max:9
[   T7013] WARNING: CPU: 1 PID: 7013 at transport_rdma.c:975 wait_for_credits+0x3b8/0x430 [ksmbd]

When the client starts to send an array with larger number of smb2_buffer_desc_v1
elements in a single SMB2 write request (most likely 31 in the above example)
wait_for_rw_credits() will simply deadlock, as there are only 9 credits possible
and 31 are requested.

In the 6.18 code we have commit 0bd73ae09ba1b73137d0830b21820d24700e09b1
smb: server: allocate enough space for RW WRs and ib_drain_qp()

It makes sure we allocate qp_attr.cap.max_rdma_ctxs and qp_attr.cap.max_send_wr
correct. qp_attr.cap.max_rdma_ctxs was filled by sc->rw_io.credits.max before,
so I changed sc->rw_io.credits.max, but that might need to be split from
each other.

But after that change we no longer deadlock when the client starts sending
larger SMB2 writes, with a larger number of smb2_buffer_desc_v1 elements
it no longer deadlocks, 159 more than enough.

irdma/roce:
[   T6505] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16
...
[   T6505] ksmbd: smb_direct: sc->rw_io.credits.num_pages=13 sc->rw_io.credits.max:159

My current theory about the Mellanox problem is, that the number of pending
RDMA Read operations should be limited by the negotiated initiator_depth, which is at max
SMB_DIRECT_CM_INITIATOR_DEPTH (8). And we're overflowing the hardware limits by
posting too much RDMA Read sqes.

The change in 0bd73ae09ba1b73137d0830b21820d24700e09b1 didn't change the
resulting values of sc->rw_io.credits.max for iwarp devices, it only adjusted
the number for qp_attr.cap.max_send_wr.

So for iwarp we deadlock in both versions of transport_rdma.c, when
the client starts to send an array of 17 of smb2_buffer_desc_v1 elements
(I was able to see that using siw on the server, so that tcpdump was
able to capture it, see:
https://www.samba.org/~metze/caps/smb2/rdma/linux-6.18-regression/2025-12-03/rdma1-siw-r6.18-ace-fixed-hang-01-stream13.pcap.gz
With roce it's directly using 17:
https://www.samba.org/~metze/caps/smb2/rdma/linux-6.18-regression/2025-12-03/rdma1-rxe-r6.18-race-fixed-rw-credits-reverted-hang-01.pcap.gz

The first few SMB2 writes use 2 smb2_buffer_desc_v1 elements and at the end
the Windows client switches to 17 smb2_buffer_desc_v1 elements.

irdma/iwarp:
[Wed Dec  3 13:45:22 2025] [   T7621] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:127
..
[Wed Dec  3 13:45:22 2025] [   T7621] ksmbd: smb_direct: sc->rw_io.credits.num_pages=256 sc->rw_io.credits.max:9
...
[Wed Dec  3 13:45:22 2025] [   T8638] ------------[ cut here ]------------
[Wed Dec  3 13:45:22 2025] [   T8638] needed:17 > max:9


siw/iwarp:
[Wed Dec  3 13:49:30 2025] [   T7621] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16
...
[Wed Dec  3 13:49:30 2025] [   T7621] ksmbd: smb_direct: sc->rw_io.credits.num_pages=256 sc->rw_io.credits.max:9
...
[Wed Dec  3 13:49:30 2025] [   T9353] ------------[ cut here ]------------
[Wed Dec  3 13:49:30 2025] [   T9353] needed:17 > max:9

I've prepared 3 branches for testing:

for-6.18/ksmbd-smbdirect-regression-v1
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v1

This has some pr_notice() messages and a WARN_ONCE() when the wait_for_rw_credits() happens.

for-6.18/ksmbd-smbdirect-regression-v2
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v2

This is based on for-6.18/ksmbd-smbdirect-regression-v1 but reverts
commit 0bd73ae09ba1b73137d0830b21820d24700e09b1, this might fix your setup.

for-6.18/ksmbd-smbdirect-regression-v3
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v3

This reverts everything to the state of v6.17.9 +
This has some pr_notice() messages and a WARN_ONCE() when the wait_for_rw_credits() happens.

Can you please test them with the priority of testing
for-6.18/ksmbd-smbdirect-regression-v2 first and the others if you have
more time.

I typically use this running on a 6.18 kernel:
modprobe ksmbd
ksmbd.control -s
rmmod ksmbd
cd fs/smb/server
make -j$(getconf _NPROCESSORS_ONLN) -C /lib/modules/$(uname -r)/build M=$(pwd) KBUILD_MODPOST_WARN=1 modules
insmod ksmbd.ko
ksmbd.mountd

The in one window:
bpftrace -e 'kprobe:smb_direct_rdma_xmit { printf("%s: %s pid=%d %s\n", strftime("%F %H:%M:%S", nsecs(sw_tai)), comm, pid, func); }'
And in another window:
dmesg -T -w


I assume the solution is to change smb_direct_rdma_xmit, so that
it doesn't try to get credits for all RDMA read/write requests at once.
Instead after collecting all ib_send_wr structures from all rdma_rw_ctx_wrs()
we chunk the list to stay in the negotiated initiator depth,
before passing to ib_post_send().

At least we need to limit this for RDMA read requests, for RDMA write requests
we may not need to chunk and post them all together, but still chunking might
be good in order to avoid blocking concurrent RDMA sends.

Tom is this assumption correct?

Thanks!
metze


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-01-19 19:17 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-03 18:18 Problem with smbdirect rw credits and initiator_depth Stefan Metzmacher
2025-12-04  0:07 ` Namjae Jeon
2025-12-04  9:39   ` Stefan Metzmacher
2025-12-05  2:33     ` Namjae Jeon
2025-12-05 12:21       ` Namjae Jeon
2025-12-08 16:13         ` Stefan Metzmacher
2025-12-10 16:42         ` Stefan Metzmacher
2025-12-11 19:38           ` Stefan Metzmacher
2025-12-12  9:58             ` Stefan Metzmacher
2025-12-12 15:35               ` Stefan Metzmacher
2025-12-13  2:14                 ` Namjae Jeon
2025-12-14 22:56                   ` Stefan Metzmacher
2025-12-15 20:17                     ` Stefan Metzmacher
2025-12-16 23:59                       ` Namjae Jeon
2026-01-14 18:13                       ` Stefan Metzmacher
2026-01-15  2:01                         ` Namjae Jeon
2026-01-15  9:50                           ` Stefan Metzmacher
2026-01-16 23:08                             ` Stefan Metzmacher
2026-01-17 13:15                               ` Stefan Metzmacher
2026-01-18  8:03                                 ` Namjae Jeon
2026-01-19 17:28                                   ` Stefan Metzmacher
2026-01-19 19:17                                     ` Stefan Metzmacher
2025-12-08 16:02       ` Stefan Metzmacher
2025-12-04  9:57 ` Stefan Metzmacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox