From: Stefan Metzmacher <metze@samba.org>
To: Namjae Jeon <linkinjeon@kernel.org>
Cc: Tom Talpey <tom@talpey.com>,
"linux-cifs@vger.kernel.org" <linux-cifs@vger.kernel.org>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: Problem with smbdirect rw credits and initiator_depth
Date: Thu, 4 Dec 2025 10:39:58 +0100 [thread overview]
Message-ID: <f59e0dc7-e91c-4a13-8d49-fe183c10b6f4@samba.org> (raw)
In-Reply-To: <CAKYAXd9p=7BzmSSKi5n41OKkkw4qrr4cWpWet7rUfC+VT-6h1g@mail.gmail.com>
Hi Namjae,
> Okay, It seems like the issue has been improved in your v3 branch. If
> you send the official patches, I will test it more.
It's good to have verified that for-6.18/ksmbd-smbdirect-regression-v3
on a 6.18 kernel behaves the same as with 6.17.9, as transport_rdma.c
is the same, but it doesn't really allow forward process on
the Mellanox problem.
Can you at least post the dmesg output generated by this:
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=7e724ebc58e986f4e101a55f4ab5e96912239918
Assuming that this wasn't triggered:
if (WARN_ONCE(needed > max_possible, "needed:%u > max:%u\n", needed, max_possible))
Did you run the bpftrace command? Did it print a lot of
'smb_direct_rdma_xmit' message over the whole time of the file copy?
Did you actually copied a file to or from the server?
Have you actually tested for-6.18/ksmbd-smbdirect-regression-v2,
as requested? As I was in hope that it would work in the
same way as for-6.18/ksmbd-smbdirect-regression-v3,
but with only a single patch reverted.
I'll continue to fix the general problem that this works
for non Mellanox setups, as it seems it never worked at all :-(
Where you testing with RoCEv2 or Infiniband?
I think moving forward for Mellanox setups requires these steps:
- Test v1 vs. v2 and see that smb_direct_rdma_xmit is actually
called at all. And see the dmesg output.
- Testing with Mellanox RoCEv2 on the client and rxe on
the server, so that we can create a network capture with tcpdump.
Thanks!
metze
> Thanks.
>
> On Thu, Dec 4, 2025 at 3:18 AM Stefan Metzmacher <metze@samba.org> wrote:
>>
>> Hi Namjae,
>>
>> I found the problem why the 6.17.9 code of transport_rdma.c deadlocks
>> with a Windows client, when using irdma in roce mode, while the 6.18
>> code works fine.
>>
>> irdma/roce in 6.17.9 code => deadlock in wait_for_rw_credits()
>> [ T8653] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16
>> [ T8653] ksmbd: smb_direct: max_rw_credits:9
>> [ T7013] ------------[ cut here ]------------
>> [ T7013] needed:31 > max:9
>> [ T7013] WARNING: CPU: 1 PID: 7013 at transport_rdma.c:975 wait_for_credits+0x3b8/0x430 [ksmbd]
>>
>> When the client starts to send an array with larger number of smb2_buffer_desc_v1
>> elements in a single SMB2 write request (most likely 31 in the above example)
>> wait_for_rw_credits() will simply deadlock, as there are only 9 credits possible
>> and 31 are requested.
>>
>> In the 6.18 code we have commit 0bd73ae09ba1b73137d0830b21820d24700e09b1
>> smb: server: allocate enough space for RW WRs and ib_drain_qp()
>>
>> It makes sure we allocate qp_attr.cap.max_rdma_ctxs and qp_attr.cap.max_send_wr
>> correct. qp_attr.cap.max_rdma_ctxs was filled by sc->rw_io.credits.max before,
>> so I changed sc->rw_io.credits.max, but that might need to be split from
>> each other.
>>
>> But after that change we no longer deadlock when the client starts sending
>> larger SMB2 writes, with a larger number of smb2_buffer_desc_v1 elements
>> it no longer deadlocks, 159 more than enough.
>>
>> irdma/roce:
>> [ T6505] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16
>> ...
>> [ T6505] ksmbd: smb_direct: sc->rw_io.credits.num_pages=13 sc->rw_io.credits.max:159
>>
>> My current theory about the Mellanox problem is, that the number of pending
>> RDMA Read operations should be limited by the negotiated initiator_depth, which is at max
>> SMB_DIRECT_CM_INITIATOR_DEPTH (8). And we're overflowing the hardware limits by
>> posting too much RDMA Read sqes.
>>
>> The change in 0bd73ae09ba1b73137d0830b21820d24700e09b1 didn't change the
>> resulting values of sc->rw_io.credits.max for iwarp devices, it only adjusted
>> the number for qp_attr.cap.max_send_wr.
>>
>> So for iwarp we deadlock in both versions of transport_rdma.c, when
>> the client starts to send an array of 17 of smb2_buffer_desc_v1 elements
>> (I was able to see that using siw on the server, so that tcpdump was
>> able to capture it, see:
>> https://www.samba.org/~metze/caps/smb2/rdma/linux-6.18-regression/2025-12-03/rdma1-siw-r6.18-ace-fixed-hang-01-stream13.pcap.gz
>> With roce it's directly using 17:
>> https://www.samba.org/~metze/caps/smb2/rdma/linux-6.18-regression/2025-12-03/rdma1-rxe-r6.18-race-fixed-rw-credits-reverted-hang-01.pcap.gz
>>
>> The first few SMB2 writes use 2 smb2_buffer_desc_v1 elements and at the end
>> the Windows client switches to 17 smb2_buffer_desc_v1 elements.
>>
>> irdma/iwarp:
>> [Wed Dec 3 13:45:22 2025] [ T7621] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:127
>> ..
>> [Wed Dec 3 13:45:22 2025] [ T7621] ksmbd: smb_direct: sc->rw_io.credits.num_pages=256 sc->rw_io.credits.max:9
>> ...
>> [Wed Dec 3 13:45:22 2025] [ T8638] ------------[ cut here ]------------
>> [Wed Dec 3 13:45:22 2025] [ T8638] needed:17 > max:9
>>
>>
>> siw/iwarp:
>> [Wed Dec 3 13:49:30 2025] [ T7621] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16
>> ...
>> [Wed Dec 3 13:49:30 2025] [ T7621] ksmbd: smb_direct: sc->rw_io.credits.num_pages=256 sc->rw_io.credits.max:9
>> ...
>> [Wed Dec 3 13:49:30 2025] [ T9353] ------------[ cut here ]------------
>> [Wed Dec 3 13:49:30 2025] [ T9353] needed:17 > max:9
>>
>> I've prepared 3 branches for testing:
>>
>> for-6.18/ksmbd-smbdirect-regression-v1
>> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v1
>>
>> This has some pr_notice() messages and a WARN_ONCE() when the wait_for_rw_credits() happens.
>>
>> for-6.18/ksmbd-smbdirect-regression-v2
>> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v2
>>
>> This is based on for-6.18/ksmbd-smbdirect-regression-v1 but reverts
>> commit 0bd73ae09ba1b73137d0830b21820d24700e09b1, this might fix your setup.
>>
>> for-6.18/ksmbd-smbdirect-regression-v3
>> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v3
>>
>> This reverts everything to the state of v6.17.9 +
>> This has some pr_notice() messages and a WARN_ONCE() when the wait_for_rw_credits() happens.
>>
>> Can you please test them with the priority of testing
>> for-6.18/ksmbd-smbdirect-regression-v2 first and the others if you have
>> more time.
>>
>> I typically use this running on a 6.18 kernel:
>> modprobe ksmbd
>> ksmbd.control -s
>> rmmod ksmbd
>> cd fs/smb/server
>> make -j$(getconf _NPROCESSORS_ONLN) -C /lib/modules/$(uname -r)/build M=$(pwd) KBUILD_MODPOST_WARN=1 modules
>> insmod ksmbd.ko
>> ksmbd.mountd
>>
>> The in one window:
>> bpftrace -e 'kprobe:smb_direct_rdma_xmit { printf("%s: %s pid=%d %s\n", strftime("%F %H:%M:%S", nsecs(sw_tai)), comm, pid, func); }'
>> And in another window:
>> dmesg -T -w
>>
>>
>> I assume the solution is to change smb_direct_rdma_xmit, so that
>> it doesn't try to get credits for all RDMA read/write requests at once.
>> Instead after collecting all ib_send_wr structures from all rdma_rw_ctx_wrs()
>> we chunk the list to stay in the negotiated initiator depth,
>> before passing to ib_post_send().
>>
>> At least we need to limit this for RDMA read requests, for RDMA write requests
>> we may not need to chunk and post them all together, but still chunking might
>> be good in order to avoid blocking concurrent RDMA sends.
>>
>> Tom is this assumption correct?
>>
>> Thanks!
>> metze
>>
next prev parent reply other threads:[~2025-12-04 9:40 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-03 18:18 Problem with smbdirect rw credits and initiator_depth Stefan Metzmacher
2025-12-04 0:07 ` Namjae Jeon
2025-12-04 9:39 ` Stefan Metzmacher [this message]
2025-12-05 2:33 ` Namjae Jeon
2025-12-05 12:21 ` Namjae Jeon
2025-12-08 16:13 ` Stefan Metzmacher
2025-12-10 16:42 ` Stefan Metzmacher
2025-12-11 19:38 ` Stefan Metzmacher
2025-12-12 9:58 ` Stefan Metzmacher
2025-12-12 15:35 ` Stefan Metzmacher
2025-12-13 2:14 ` Namjae Jeon
2025-12-14 22:56 ` Stefan Metzmacher
2025-12-15 20:17 ` Stefan Metzmacher
2025-12-16 23:59 ` Namjae Jeon
2026-01-14 18:13 ` Stefan Metzmacher
2026-01-15 2:01 ` Namjae Jeon
2026-01-15 9:50 ` Stefan Metzmacher
2026-01-16 23:08 ` Stefan Metzmacher
2026-01-17 13:15 ` Stefan Metzmacher
2026-01-18 8:03 ` Namjae Jeon
2026-01-19 17:28 ` Stefan Metzmacher
2026-01-19 19:17 ` Stefan Metzmacher
2025-12-08 16:02 ` Stefan Metzmacher
2025-12-04 9:57 ` Stefan Metzmacher
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f59e0dc7-e91c-4a13-8d49-fe183c10b6f4@samba.org \
--to=metze@samba.org \
--cc=linkinjeon@kernel.org \
--cc=linux-cifs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox