From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C4821C41513 for ; Wed, 18 Oct 2023 08:17:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230047AbjJRIRB (ORCPT ); Wed, 18 Oct 2023 04:17:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230013AbjJRIRA (ORCPT ); Wed, 18 Oct 2023 04:17:00 -0400 Received: from out-199.mta0.migadu.com (out-199.mta0.migadu.com [IPv6:2001:41d0:1004:224b::c7]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CEFF3B0 for ; Wed, 18 Oct 2023 01:16:55 -0700 (PDT) Message-ID: <2a5e1fb6-6c73-4d25-b29a-4ccdbf2c5678@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1697617014; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3bQEjPXiHHLTebOYkrYW4EJTly3aDO83LJe8sSOMjPc=; b=X1VMcJ7fputuUF9CLL7/Yf+ZsZbS/weac66CDri+8uH7kkSx5JY8Dw+/7NEdLjPDZ3Z8ZK WQkA9pW6u3XXAxzANN6786COQzq2vGqZZFc3U9v/epGhLGCeIOEefPgDVasMlotUigjC1x NOSwotrQS3KIzhZQYkEnlQY6Skm5Lto= Date: Wed, 18 Oct 2023 16:16:45 +0800 MIME-Version: 1.0 Subject: Re: [bug report] blktests srp/002 hang To: Bob Pearson , "Daisuke Matsuda (Fujitsu)" , 'Bart Van Assche' , 'Rain River' Cc: Jason Gunthorpe , "leon@kernel.org" , Shinichiro Kawasaki , RDMA mailing list , "linux-scsi@vger.kernel.org" References: <6fc3b524-af7d-43ce-aa05-5c44ec850b9b@acm.org> <02d7cbf2-b17b-488a-b6e9-ebb728b51c94@acm.org> <8aff9124-85c0-8e3b-dc35-1017b1540037@gmail.com> <3c84da83-cdbb-3326-b3f0-b2dee5f014e0@linux.dev> <4e7aac82-f006-aaa7-6769-d1c9691a0cec@gmail.com> <29c5de53-cc61-4efc-8e8d-690e27756a16@acm.org> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Zhu Yanjun In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org 在 2023/10/18 1:09, Bob Pearson 写道: > On 9/25/23 20:17, Daisuke Matsuda (Fujitsu) wrote: >> On Tue, Sep 26, 2023 12:01 AM Bart Van Assche: >>> On 9/24/23 21:47, Daisuke Matsuda (Fujitsu) wrote: >>>> As Bob wrote above, nobody has found any logical failure in rxe >>>> driver. >>> That's wrong. In case you would not yet have noticed my latest email in >>> this thread, please take a look at >>> https://lore.kernel.org/linux-rdma/e8b76fae-780a-470e-8ec4-c6b650793d10@leemhuis.info/T/#m0fd8ea8a4cbc27b37 >>> b042ae4f8e9b024f1871a73. >>> I think the report in that email is a 100% proof that there is a >>> use-after-free issue in the rdma_rxe driver. Use-after-free issues have >>> security implications and also can cause data corruption. I propose to >>> revert the commit that introduced the rdma_rxe use-after-free unless >>> someone comes up with a fix for the rdma_rxe driver. >>> >>> Bart. >> Thank you for the clarification. I see your intention. >> I hope the hang issue will be resolved by addressing this. >> >> Thanks, >> Daisuke >> > I have made some progress in understanding the cause of the srp/002 etc. hang. > > The two attached files are traces of activity for two qp's qp#151 and qp#167. In my runs of srp/002 > All the qp's pass before 167 and all fail after 167 which is the first to fail. > > It turns out that all the passing qp's call srp_post_send() some number of times and also call > srp_send_done() the same number of times. Starting at qp#167 the last call to srp_send_done() does > not take place leaving the srp driver waiting for the final completion and causing the hang I believe. Thanks, Bob I will delve into your findings and the source code to find the root cause. BTW, what linux distribution are you using to find this? Ubuntu, Fedora or Debian? From the above, sometings this problem is difficult to reproduce on Ubuntu. But it can be reproduced in Ubuntu and Debian. So can you let me know what linux distribution you are using? Thanks Zhu Yanjun > > There are four cq's involved in each pair of qp's in the srp test. Two in ib_srp and two in ib_srpt > for the two qp's. Three of them execute completion processing in a soft irq context so the code in > core/cq.c gathers the completions and calls back to the srp drivers. The send side cq in srp uses > cq_direct which requires srp to call ib_process_direct() in order to collect the completions. This > happens in __srp_get_tx_iu() which is called in several places in the srp driver. But only as a side effect > since the purpose of this routine is to get an iu to start a new command. > > In the attached files for qp#151 the final call to srp_post_send is followed by the rxe requester and > completer work queues processing the send packet and the ack before a final call to __srp_get_rx_iu() > which gathers the final send side completion and success. > > For qp#167 the call to srp_post_send() is followed by the rxe driver processing the send operation and > generating a work completion which is posted to the send cq but there is never a following call to > __srp_get_rx_iu() so the cqe is not received by srp and failure. > > I don't yet understand the logic of the srp driver to fix this but the problem is not in the rxe driver > as far as I can tell. > > Bob