Linux CIFS filesystem development
 help / color / mirror / Atom feed
From: Stefan Metzmacher <metze@samba.org>
To: Namjae Jeon <linkinjeon@kernel.org>
Cc: Tom Talpey <tom@talpey.com>,
	"linux-cifs@vger.kernel.org" <linux-cifs@vger.kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: Problem with smbdirect rw credits and initiator_depth
Date: Thu, 4 Dec 2025 10:39:58 +0100	[thread overview]
Message-ID: <f59e0dc7-e91c-4a13-8d49-fe183c10b6f4@samba.org> (raw)
In-Reply-To: <CAKYAXd9p=7BzmSSKi5n41OKkkw4qrr4cWpWet7rUfC+VT-6h1g@mail.gmail.com>

Hi Namjae,

> Okay, It seems like the issue has been improved in your v3 branch. If
> you send the official patches, I will test it more.

It's good to have verified that for-6.18/ksmbd-smbdirect-regression-v3
on a 6.18 kernel behaves the same as with 6.17.9, as transport_rdma.c
is the same, but it doesn't really allow forward process on
the Mellanox problem.

Can you at least post the dmesg output generated by this:
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=7e724ebc58e986f4e101a55f4ab5e96912239918
Assuming that this wasn't triggered:
if (WARN_ONCE(needed > max_possible, "needed:%u > max:%u\n", needed, max_possible))

Did you run the bpftrace command? Did it print a lot of
'smb_direct_rdma_xmit' message over the whole time of the file copy?

Did you actually copied a file to or from the server?

Have you actually tested for-6.18/ksmbd-smbdirect-regression-v2,
as requested? As I was in hope that it would work in the
same way as for-6.18/ksmbd-smbdirect-regression-v3,
but with only a single patch reverted.

I'll continue to fix the general problem that this works
for non Mellanox setups, as it seems it never worked at all :-(

Where you testing with RoCEv2 or Infiniband?

I think moving forward for Mellanox setups requires these steps:
- Test v1 vs. v2 and see that smb_direct_rdma_xmit is actually
   called at all. And see the dmesg output.
- Testing with Mellanox RoCEv2 on the client and rxe on
   the server, so that we can create a network capture with tcpdump.

Thanks!
metze

> Thanks.
> 
> On Thu, Dec 4, 2025 at 3:18 AM Stefan Metzmacher <metze@samba.org> wrote:
>>
>> Hi Namjae,
>>
>> I found the problem why the 6.17.9 code of transport_rdma.c deadlocks
>> with a Windows client, when using irdma in roce mode, while the 6.18
>> code works fine.
>>
>> irdma/roce in 6.17.9 code => deadlock in wait_for_rw_credits()
>> [   T8653] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16
>> [   T8653] ksmbd: smb_direct: max_rw_credits:9
>> [   T7013] ------------[ cut here ]------------
>> [   T7013] needed:31 > max:9
>> [   T7013] WARNING: CPU: 1 PID: 7013 at transport_rdma.c:975 wait_for_credits+0x3b8/0x430 [ksmbd]
>>
>> When the client starts to send an array with larger number of smb2_buffer_desc_v1
>> elements in a single SMB2 write request (most likely 31 in the above example)
>> wait_for_rw_credits() will simply deadlock, as there are only 9 credits possible
>> and 31 are requested.
>>
>> In the 6.18 code we have commit 0bd73ae09ba1b73137d0830b21820d24700e09b1
>> smb: server: allocate enough space for RW WRs and ib_drain_qp()
>>
>> It makes sure we allocate qp_attr.cap.max_rdma_ctxs and qp_attr.cap.max_send_wr
>> correct. qp_attr.cap.max_rdma_ctxs was filled by sc->rw_io.credits.max before,
>> so I changed sc->rw_io.credits.max, but that might need to be split from
>> each other.
>>
>> But after that change we no longer deadlock when the client starts sending
>> larger SMB2 writes, with a larger number of smb2_buffer_desc_v1 elements
>> it no longer deadlocks, 159 more than enough.
>>
>> irdma/roce:
>> [   T6505] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16
>> ...
>> [   T6505] ksmbd: smb_direct: sc->rw_io.credits.num_pages=13 sc->rw_io.credits.max:159
>>
>> My current theory about the Mellanox problem is, that the number of pending
>> RDMA Read operations should be limited by the negotiated initiator_depth, which is at max
>> SMB_DIRECT_CM_INITIATOR_DEPTH (8). And we're overflowing the hardware limits by
>> posting too much RDMA Read sqes.
>>
>> The change in 0bd73ae09ba1b73137d0830b21820d24700e09b1 didn't change the
>> resulting values of sc->rw_io.credits.max for iwarp devices, it only adjusted
>> the number for qp_attr.cap.max_send_wr.
>>
>> So for iwarp we deadlock in both versions of transport_rdma.c, when
>> the client starts to send an array of 17 of smb2_buffer_desc_v1 elements
>> (I was able to see that using siw on the server, so that tcpdump was
>> able to capture it, see:
>> https://www.samba.org/~metze/caps/smb2/rdma/linux-6.18-regression/2025-12-03/rdma1-siw-r6.18-ace-fixed-hang-01-stream13.pcap.gz
>> With roce it's directly using 17:
>> https://www.samba.org/~metze/caps/smb2/rdma/linux-6.18-regression/2025-12-03/rdma1-rxe-r6.18-race-fixed-rw-credits-reverted-hang-01.pcap.gz
>>
>> The first few SMB2 writes use 2 smb2_buffer_desc_v1 elements and at the end
>> the Windows client switches to 17 smb2_buffer_desc_v1 elements.
>>
>> irdma/iwarp:
>> [Wed Dec  3 13:45:22 2025] [   T7621] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:127
>> ..
>> [Wed Dec  3 13:45:22 2025] [   T7621] ksmbd: smb_direct: sc->rw_io.credits.num_pages=256 sc->rw_io.credits.max:9
>> ...
>> [Wed Dec  3 13:45:22 2025] [   T8638] ------------[ cut here ]------------
>> [Wed Dec  3 13:45:22 2025] [   T8638] needed:17 > max:9
>>
>>
>> siw/iwarp:
>> [Wed Dec  3 13:49:30 2025] [   T7621] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16
>> ...
>> [Wed Dec  3 13:49:30 2025] [   T7621] ksmbd: smb_direct: sc->rw_io.credits.num_pages=256 sc->rw_io.credits.max:9
>> ...
>> [Wed Dec  3 13:49:30 2025] [   T9353] ------------[ cut here ]------------
>> [Wed Dec  3 13:49:30 2025] [   T9353] needed:17 > max:9
>>
>> I've prepared 3 branches for testing:
>>
>> for-6.18/ksmbd-smbdirect-regression-v1
>> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v1
>>
>> This has some pr_notice() messages and a WARN_ONCE() when the wait_for_rw_credits() happens.
>>
>> for-6.18/ksmbd-smbdirect-regression-v2
>> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v2
>>
>> This is based on for-6.18/ksmbd-smbdirect-regression-v1 but reverts
>> commit 0bd73ae09ba1b73137d0830b21820d24700e09b1, this might fix your setup.
>>
>> for-6.18/ksmbd-smbdirect-regression-v3
>> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v3
>>
>> This reverts everything to the state of v6.17.9 +
>> This has some pr_notice() messages and a WARN_ONCE() when the wait_for_rw_credits() happens.
>>
>> Can you please test them with the priority of testing
>> for-6.18/ksmbd-smbdirect-regression-v2 first and the others if you have
>> more time.
>>
>> I typically use this running on a 6.18 kernel:
>> modprobe ksmbd
>> ksmbd.control -s
>> rmmod ksmbd
>> cd fs/smb/server
>> make -j$(getconf _NPROCESSORS_ONLN) -C /lib/modules/$(uname -r)/build M=$(pwd) KBUILD_MODPOST_WARN=1 modules
>> insmod ksmbd.ko
>> ksmbd.mountd
>>
>> The in one window:
>> bpftrace -e 'kprobe:smb_direct_rdma_xmit { printf("%s: %s pid=%d %s\n", strftime("%F %H:%M:%S", nsecs(sw_tai)), comm, pid, func); }'
>> And in another window:
>> dmesg -T -w
>>
>>
>> I assume the solution is to change smb_direct_rdma_xmit, so that
>> it doesn't try to get credits for all RDMA read/write requests at once.
>> Instead after collecting all ib_send_wr structures from all rdma_rw_ctx_wrs()
>> we chunk the list to stay in the negotiated initiator depth,
>> before passing to ib_post_send().
>>
>> At least we need to limit this for RDMA read requests, for RDMA write requests
>> we may not need to chunk and post them all together, but still chunking might
>> be good in order to avoid blocking concurrent RDMA sends.
>>
>> Tom is this assumption correct?
>>
>> Thanks!
>> metze
>>


  reply	other threads:[~2025-12-04  9:40 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-03 18:18 Problem with smbdirect rw credits and initiator_depth Stefan Metzmacher
2025-12-04  0:07 ` Namjae Jeon
2025-12-04  9:39   ` Stefan Metzmacher [this message]
2025-12-05  2:33     ` Namjae Jeon
2025-12-05 12:21       ` Namjae Jeon
2025-12-08 16:13         ` Stefan Metzmacher
2025-12-10 16:42         ` Stefan Metzmacher
2025-12-11 19:38           ` Stefan Metzmacher
2025-12-12  9:58             ` Stefan Metzmacher
2025-12-12 15:35               ` Stefan Metzmacher
2025-12-13  2:14                 ` Namjae Jeon
2025-12-14 22:56                   ` Stefan Metzmacher
2025-12-15 20:17                     ` Stefan Metzmacher
2025-12-16 23:59                       ` Namjae Jeon
2026-01-14 18:13                       ` Stefan Metzmacher
2026-01-15  2:01                         ` Namjae Jeon
2026-01-15  9:50                           ` Stefan Metzmacher
2026-01-16 23:08                             ` Stefan Metzmacher
2026-01-17 13:15                               ` Stefan Metzmacher
2026-01-18  8:03                                 ` Namjae Jeon
2026-01-19 17:28                                   ` Stefan Metzmacher
2026-01-19 19:17                                     ` Stefan Metzmacher
2025-12-08 16:02       ` Stefan Metzmacher
2025-12-04  9:57 ` Stefan Metzmacher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f59e0dc7-e91c-4a13-8d49-fe183c10b6f4@samba.org \
    --to=metze@samba.org \
    --cc=linkinjeon@kernel.org \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox