From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from hr2.samba.org (hr2.samba.org [144.76.82.148]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 049E717BB21; Thu, 4 Dec 2025 09:40:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=144.76.82.148 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764841210; cv=none; b=MTS/k7RRXibC05s0QjZfk+z6wYXFo9W6VRqs5BWTbVqCnFUM7C8dh8ndiCtZaj46/IoLgZvpMNNFvO8YXeLYsC2z1CXQzpQ2/jASEFme3XT6TvfFTGJd8rJLDPRdEXrlTSLQLPkIFicoVQplQrpcHHhQ+9CSRcuD+MF0JIdCKns= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764841210; c=relaxed/simple; bh=OLyPLGK4XqWzWdIzHIc099613a0cPBpmR0hPsCe+lb4=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=UZwz1vj+ej+gHa9Cv6CoD2/mQDcjfm+yhTXUlupICXOUtOM3GLFiJi34er9osHdWm+FRdZ3hvvHHf4Y43zTecEfZ19j+bomu2PktwdPvKP8XkhHfvmheaUHKXGpEp+78beHVwSx/wOaUTj/iOVGS5vuqflyAlW/gJcmSmWnpqnw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=samba.org; spf=pass smtp.mailfrom=samba.org; dkim=pass (3072-bit key) header.d=samba.org header.i=@samba.org header.b=0F2rpp6o; arc=none smtp.client-ip=144.76.82.148 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=samba.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=samba.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (3072-bit key) header.d=samba.org header.i=@samba.org header.b="0F2rpp6o" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=samba.org; s=42; h=From:Cc:To:Date:Message-ID; bh=yrxomE7BvqqCHAaEbMCWJBGtDrGLLmvS6+kblNxY0Fs=; b=0F2rpp6ows7iW15sEBK3hx+R+5 kQIkguxbXCUNI68lfSKbm2KoClBq86jeefbNpTVucN+p7k4mLyh6Ia5Gd3F1kfzbxzrhHozNddivj t54F/qlcoaOFjIQooY9GZ8mz4MhFI7wjO1Woj6jHI1OKkKmUEM4lNWiJaOZYPSTHi1+y84L51vzBJ V20/6Enxe36gzP2CLUll0emakVerXNKmfZ/MmnnSOynsXQzahn7E5kvJtKjC6o+wwafwObl3LXgKu R7Hz2/kxs17RFbBfZzmZoHUtZ/Zgat+sRpqH5vBYBNgXNMpDa1ZGl9l+lbzboysxxlXtC72/qEUPC hdbreWfEI4hs4hG5AoxWgUcv411ujhGy48Hg9Z4AHDDE+18WVSJnKPSzqgAqpgHn45wTNThSFVqdC CvsQabKqDU7xEK2vdtDahnhXKQE+Bg1PZOjCDnuuX13FoBbHj+dQY7evYPJ5v4sD4+KJNAaZ139Wa MHwtmbRRAyotpHGzs5R0FQUV; Received: from [127.0.0.2] (localhost [127.0.0.1]) by hr2.samba.org with esmtpsa (TLS1.3:ECDHE_SECP256R1__ECDSA_SECP256R1_SHA256__CHACHA20_POLY1305:256) (Exim) id 1vR5od-00GwU4-1f; Thu, 04 Dec 2025 09:39:59 +0000 Message-ID: Date: Thu, 4 Dec 2025 10:39:58 +0100 Precedence: bulk X-Mailing-List: linux-cifs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Problem with smbdirect rw credits and initiator_depth To: Namjae Jeon Cc: Tom Talpey , "linux-cifs@vger.kernel.org" , "linux-rdma@vger.kernel.org" References: <35eec2e6-bf37-43d6-a2d8-7a939a68021b@samba.org> Content-Language: en-US From: Stefan Metzmacher In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hi Namjae, > Okay, It seems like the issue has been improved in your v3 branch. If > you send the official patches, I will test it more. It's good to have verified that for-6.18/ksmbd-smbdirect-regression-v3 on a 6.18 kernel behaves the same as with 6.17.9, as transport_rdma.c is the same, but it doesn't really allow forward process on the Mellanox problem. Can you at least post the dmesg output generated by this: https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=7e724ebc58e986f4e101a55f4ab5e96912239918 Assuming that this wasn't triggered: if (WARN_ONCE(needed > max_possible, "needed:%u > max:%u\n", needed, max_possible)) Did you run the bpftrace command? Did it print a lot of 'smb_direct_rdma_xmit' message over the whole time of the file copy? Did you actually copied a file to or from the server? Have you actually tested for-6.18/ksmbd-smbdirect-regression-v2, as requested? As I was in hope that it would work in the same way as for-6.18/ksmbd-smbdirect-regression-v3, but with only a single patch reverted. I'll continue to fix the general problem that this works for non Mellanox setups, as it seems it never worked at all :-( Where you testing with RoCEv2 or Infiniband? I think moving forward for Mellanox setups requires these steps: - Test v1 vs. v2 and see that smb_direct_rdma_xmit is actually called at all. And see the dmesg output. - Testing with Mellanox RoCEv2 on the client and rxe on the server, so that we can create a network capture with tcpdump. Thanks! metze > Thanks. > > On Thu, Dec 4, 2025 at 3:18 AM Stefan Metzmacher wrote: >> >> Hi Namjae, >> >> I found the problem why the 6.17.9 code of transport_rdma.c deadlocks >> with a Windows client, when using irdma in roce mode, while the 6.18 >> code works fine. >> >> irdma/roce in 6.17.9 code => deadlock in wait_for_rw_credits() >> [ T8653] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16 >> [ T8653] ksmbd: smb_direct: max_rw_credits:9 >> [ T7013] ------------[ cut here ]------------ >> [ T7013] needed:31 > max:9 >> [ T7013] WARNING: CPU: 1 PID: 7013 at transport_rdma.c:975 wait_for_credits+0x3b8/0x430 [ksmbd] >> >> When the client starts to send an array with larger number of smb2_buffer_desc_v1 >> elements in a single SMB2 write request (most likely 31 in the above example) >> wait_for_rw_credits() will simply deadlock, as there are only 9 credits possible >> and 31 are requested. >> >> In the 6.18 code we have commit 0bd73ae09ba1b73137d0830b21820d24700e09b1 >> smb: server: allocate enough space for RW WRs and ib_drain_qp() >> >> It makes sure we allocate qp_attr.cap.max_rdma_ctxs and qp_attr.cap.max_send_wr >> correct. qp_attr.cap.max_rdma_ctxs was filled by sc->rw_io.credits.max before, >> so I changed sc->rw_io.credits.max, but that might need to be split from >> each other. >> >> But after that change we no longer deadlock when the client starts sending >> larger SMB2 writes, with a larger number of smb2_buffer_desc_v1 elements >> it no longer deadlocks, 159 more than enough. >> >> irdma/roce: >> [ T6505] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16 >> ... >> [ T6505] ksmbd: smb_direct: sc->rw_io.credits.num_pages=13 sc->rw_io.credits.max:159 >> >> My current theory about the Mellanox problem is, that the number of pending >> RDMA Read operations should be limited by the negotiated initiator_depth, which is at max >> SMB_DIRECT_CM_INITIATOR_DEPTH (8). And we're overflowing the hardware limits by >> posting too much RDMA Read sqes. >> >> The change in 0bd73ae09ba1b73137d0830b21820d24700e09b1 didn't change the >> resulting values of sc->rw_io.credits.max for iwarp devices, it only adjusted >> the number for qp_attr.cap.max_send_wr. >> >> So for iwarp we deadlock in both versions of transport_rdma.c, when >> the client starts to send an array of 17 of smb2_buffer_desc_v1 elements >> (I was able to see that using siw on the server, so that tcpdump was >> able to capture it, see: >> https://www.samba.org/~metze/caps/smb2/rdma/linux-6.18-regression/2025-12-03/rdma1-siw-r6.18-ace-fixed-hang-01-stream13.pcap.gz >> With roce it's directly using 17: >> https://www.samba.org/~metze/caps/smb2/rdma/linux-6.18-regression/2025-12-03/rdma1-rxe-r6.18-race-fixed-rw-credits-reverted-hang-01.pcap.gz >> >> The first few SMB2 writes use 2 smb2_buffer_desc_v1 elements and at the end >> the Windows client switches to 17 smb2_buffer_desc_v1 elements. >> >> irdma/iwarp: >> [Wed Dec 3 13:45:22 2025] [ T7621] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:127 >> .. >> [Wed Dec 3 13:45:22 2025] [ T7621] ksmbd: smb_direct: sc->rw_io.credits.num_pages=256 sc->rw_io.credits.max:9 >> ... >> [Wed Dec 3 13:45:22 2025] [ T8638] ------------[ cut here ]------------ >> [Wed Dec 3 13:45:22 2025] [ T8638] needed:17 > max:9 >> >> >> siw/iwarp: >> [Wed Dec 3 13:49:30 2025] [ T7621] ksmbd: smb_direct: initiator_depth:8 peer_initiator_depth:16 >> ... >> [Wed Dec 3 13:49:30 2025] [ T7621] ksmbd: smb_direct: sc->rw_io.credits.num_pages=256 sc->rw_io.credits.max:9 >> ... >> [Wed Dec 3 13:49:30 2025] [ T9353] ------------[ cut here ]------------ >> [Wed Dec 3 13:49:30 2025] [ T9353] needed:17 > max:9 >> >> I've prepared 3 branches for testing: >> >> for-6.18/ksmbd-smbdirect-regression-v1 >> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v1 >> >> This has some pr_notice() messages and a WARN_ONCE() when the wait_for_rw_credits() happens. >> >> for-6.18/ksmbd-smbdirect-regression-v2 >> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v2 >> >> This is based on for-6.18/ksmbd-smbdirect-regression-v1 but reverts >> commit 0bd73ae09ba1b73137d0830b21820d24700e09b1, this might fix your setup. >> >> for-6.18/ksmbd-smbdirect-regression-v3 >> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/for-6.18/ksmbd-smbdirect-regression-v3 >> >> This reverts everything to the state of v6.17.9 + >> This has some pr_notice() messages and a WARN_ONCE() when the wait_for_rw_credits() happens. >> >> Can you please test them with the priority of testing >> for-6.18/ksmbd-smbdirect-regression-v2 first and the others if you have >> more time. >> >> I typically use this running on a 6.18 kernel: >> modprobe ksmbd >> ksmbd.control -s >> rmmod ksmbd >> cd fs/smb/server >> make -j$(getconf _NPROCESSORS_ONLN) -C /lib/modules/$(uname -r)/build M=$(pwd) KBUILD_MODPOST_WARN=1 modules >> insmod ksmbd.ko >> ksmbd.mountd >> >> The in one window: >> bpftrace -e 'kprobe:smb_direct_rdma_xmit { printf("%s: %s pid=%d %s\n", strftime("%F %H:%M:%S", nsecs(sw_tai)), comm, pid, func); }' >> And in another window: >> dmesg -T -w >> >> >> I assume the solution is to change smb_direct_rdma_xmit, so that >> it doesn't try to get credits for all RDMA read/write requests at once. >> Instead after collecting all ib_send_wr structures from all rdma_rw_ctx_wrs() >> we chunk the list to stay in the negotiated initiator depth, >> before passing to ib_post_send(). >> >> At least we need to limit this for RDMA read requests, for RDMA write requests >> we may not need to chunk and post them all together, but still chunking might >> be good in order to avoid blocking concurrent RDMA sends. >> >> Tom is this assumption correct? >> >> Thanks! >> metze >>