From: Zhu Yanjun <yanjun.zhu@linux.dev>
To: "Daisuke Matsuda (Fujitsu)" <matsuda-daisuke@fujitsu.com>,
'Rain River' <rain.1986.08.12@gmail.com>,
Bob Pearson <rpearsonhpe@gmail.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
"leon@kernel.org" <leon@kernel.org>,
Bart Van Assche <bvanassche@acm.org>,
Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
RDMA mailing list <linux-rdma@vger.kernel.org>,
"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: [bug report] blktests srp/002 hang
Date: Tue, 26 Sep 2023 14:09:28 +0800 [thread overview]
Message-ID: <02d61fa2-9222-a071-8442-ef43a3aa74a2@linux.dev> (raw)
In-Reply-To: <OS7PR01MB11804B7BFCD8A3DF78E51DD5CE5C3A@OS7PR01MB11804.jpnprd01.prod.outlook.com>
在 2023/9/26 9:09, Daisuke Matsuda (Fujitsu) 写道:
> On Mon, Sep 25, 2023 11:31 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>> 在 2023/9/25 12:47, Daisuke Matsuda (Fujitsu) 写道:
>>> On Sun, Sep 24, 2023 10:18 AM Rain River wrote:
>>>> On Sat, Sep 23, 2023 at 2:14 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>>>> On 9/21/23 10:10, Zhu Yanjun wrote:
>>>>>> 在 2023/9/21 22:39, Bob Pearson 写道:
>>>>>>> On 9/21/23 09:23, Rain River wrote:
>>>>>>>> On Thu, Sep 21, 2023 at 2:53 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>>>>>>>> On 9/20/23 12:22, Bart Van Assche wrote:
>>>>>>>>>> On 9/20/23 10:18, Bob Pearson wrote:
>>>>>>>>>>> But I have also seen the same behavior in the siw driver which is
>>>>>>>>>>> completely independent.
>>>>>>>>>> Hmm ... I haven't seen any hangs yet with the siw driver.
>>>>>>>>> I was on Ubuntu 6-9 months ago. Currently I don't see hangs on either.
>>>>>>>>>>> As mentioned above at the moment Ubuntu is failing rarely. But it used to fail reliably (srp/002 about 75%
>> of
>>>> the time and srp/011 about 99% of the time.) There haven't been any changes to rxe to explain this.
>>>>>>>>>> I think that Zhu mentioned commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue
>>>>>>>>>> support for rxe tasks")?
>>>>>>>>> That change happened well before the failures went away. I was seeing failures at the same rate with tasklets
>>>>>>>>> and wqs. But after updating Ubuntu and the kernel at some point they all went away.
>>>>>>>> I made tests on the latest Ubuntu with the latest kernel without the
>>>>>>>> commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks").
>>>>>>>> The latest kernel is v6.6-rc2, the commit 9b4b7c1f9f54 ("RDMA/rxe: Add
>>>>>>>> workqueue support for rxe tasks") is reverted.
>>>>>>>> I made blktest tests for about 30 times, this problem does not occur.
>>>>>>>>
>>>>>>>> So I confirm that without this commit, this hang problem does not
>>>>>>>> occur on Ubuntu without the commit 9b4b7c1f9f54 ("RDMA/rxe: Add
>>>>>>>> workqueue support for rxe tasks").
>>>>>>>>
>>>>>>>> Nanthan
>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Bart.
>>>>>>> This commit is very important for several reasons. It is needed for the ODP implementation
>>>>>>> that is in the works from Daisuke Matsuda and also for QP scaling of performance. The work
>>>>>>> queue implementation scales well with increasing qp number while the tasklet implementation
>>>>>>> does not. This is critical for the drivers use in large scale storage applications. So, if
>>>>>>> there is a bug in the work queue implementation it needs to be fixed not reverted.
>>>>>>>
>>>>>>> I am still hoping that someone will diagnose what is causing the ULPs to hang in terms of
>>>>>>> something missing causing it to wait.
>>>>>> Hi, Bob
>>>>>>
>>>>>>
>>>>>> You submitted this commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks").
>>>>>>
>>>>>> You should be very familiar with this commit.
>>>>>>
>>>>>> And this commit causes regression.
>>>>>>
>>>>>> So you should delved into the source code to find the root cause, then fix it.
>>>>> Zhu,
>>>>>
>>>>> I have spent tons of time over the months trying to figure out what is happening with blktests.
>>>>> As I have mentioned several times I have seen the same exact failure in siw in the past although
>>>>> currently that doesn't seem to happen so I had been suspecting that the problem may be in the ULP.
>>>>> The challenge is that the blktests represents a huge stack of software much of which I am not
>>>>> familiar with. The bug is a hang in layers above the rxe driver and so far no one has been able to
>>>>> say with any specificity the rxe driver failed to do something needed to make progress or violated
>>>>> expected behavior. Without any clue as to where to look it has been hard to make progress.
>>>> Bob
>>>>
>>>> Work queue will sleep. If work queue sleep for long time, the packets
>>>> will not be sent to ULP. This is why this hang occurs.
>>> In general work queue can sleep, but the workload running in rxe driver
>>> should not sleep because it was originally running on tasklet and converted
>>> to use work queue. A task can sometime take longer because of IRQs, but
>>> the same thing can also happen with tasklet. If there is a difference between
>>> the two, I think it would be the overhead of scheduring the work queue.
>>>
>>>> Difficult to handle this sleep in work queue. It had better revert
>>>> this commit in RXE.
>>> I am objected to reverting the commit at this stage. As Bob wrote above,
>>> nobody has found any logical failure in rxe driver. It is quite possible
>>> that the patch is just revealing a latent bug in the higher layers.
>>
>> To now, on Debian and Fedora, all the tests with work queue will hang.
>> And after reverting this commit,
>>
>> no hang will occur.
>>
>> Before new test results, it is a reasonable suspect that this commit
>> will result in the hang.
>
> If the hang *always* occurs, then I agree your opinion is correct,
About hang tests, please read through the whole discussion. Several
engineers made tests on Debian, Fedora and Ubuntu to confirm these test
results.
Zhu Yanjun
> but this one happens occasionally. It is also natural to think that
> the commit makes it easier to meet the condition of an existing bug.
>
>>
>>>
>>>> Because work queue sleeps, ULP can not wait for long time for the
>>>> packets. If packets can not reach ULPs for long time, many problems
>>>> will occur to ULPs.
>>> I wonder where in the rxe driver does it sleep. BTW, most packets are
>>> processed in NET_RX_IRQ context, and work queue is scheduled only
>>
>> Do you mean NET_RX_SOFTIRQ?
>
> Yes. I am sorry for confusing you.
>
> Thanks,
> Daisuke
>
>>
>> Zhu Yanjun
>>
>>> when there is already a running context. If your speculation is to the point,
>>> the hang will occur more frequently if we change it to use work queue exclusively.
>>> My ODP patches include a change to do this.
>>> Cf.
>> https://lore.kernel.org/lkml/7699a90bc4af10c33c0a46ef6330ed4bb7e7ace6.1694153251.git.matsuda-daisuke@fujitsu.c
>> om/
>>>
>>> Thanks,
>>> Daisuke
>>>
>>>>> My main motivation is making Lustre run on rxe and it does and it's fast enough to meet our needs.
>>>>> Lustre is similar to srp as a ULP and in all of our testing we have never seen a similar hang. Other
>>>>> hangs to be sure but not this one. I believe that this bug will never get resolved until someone with
>>>>> a good understanding of the ulp drivers makes an effort to find out where and why the hang is occurring.
>>>>> From there it should be straight forward to fix the problem. I am continuing to investigate and am learning
>>>>> the device-manager/multipath/srp/scsi stack but I have a long ways to go.
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>>>
>>>>>> Jason && Leon, please comment on this.
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Zhu Yanjun
>>>>>>
>>>>>>> Bob
next prev parent reply other threads:[~2023-09-26 6:09 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-21 6:46 [bug report] blktests srp/002 hang Shinichiro Kawasaki
2023-08-22 1:46 ` Bob Pearson
2023-08-22 10:18 ` Shinichiro Kawasaki
2023-08-22 15:20 ` Bart Van Assche
2023-08-23 16:19 ` Bob Pearson
2023-08-23 19:46 ` Bart Van Assche
2023-08-24 16:24 ` Bob Pearson
2023-08-24 8:55 ` Bernard Metzler
2023-08-24 15:35 ` Bernard Metzler
2023-08-24 16:05 ` Bart Van Assche
2023-08-24 16:27 ` Bob Pearson
2023-08-25 1:11 ` Shinichiro Kawasaki
2023-08-25 1:36 ` Bob Pearson
2023-08-25 10:16 ` Shinichiro Kawasaki
2023-08-25 13:49 ` Bart Van Assche
2023-08-25 13:52 ` Bart Van Assche
2023-09-13 17:36 ` Bob Pearson
2023-09-13 23:38 ` Zhu Yanjun
2023-09-16 5:59 ` Zhu Yanjun
2023-09-19 4:14 ` Shinichiro Kawasaki
2023-09-19 8:07 ` Zhu Yanjun
2023-09-19 16:30 ` Pearson, Robert B
2023-09-19 18:11 ` Bob Pearson
2023-09-20 4:22 ` Zhu Yanjun
2023-09-20 16:24 ` Bob Pearson
2023-09-20 16:36 ` Bart Van Assche
2023-09-20 17:18 ` Bob Pearson
2023-09-20 17:22 ` Bart Van Assche
2023-09-20 17:29 ` Bob Pearson
2023-09-21 5:46 ` Zhu Yanjun
2023-09-21 10:06 ` Zhu Yanjun
2023-09-21 14:23 ` Rain River
2023-09-21 14:39 ` Bob Pearson
2023-09-21 15:08 ` Zhu Yanjun
2023-09-21 15:10 ` Zhu Yanjun
2023-09-22 18:14 ` Bob Pearson
2023-09-22 22:06 ` Bart Van Assche
2023-09-24 1:17 ` Rain River
2023-09-25 4:47 ` Daisuke Matsuda (Fujitsu)
2023-09-25 14:31 ` Zhu Yanjun
2023-09-26 1:09 ` Daisuke Matsuda (Fujitsu)
2023-09-26 6:09 ` Zhu Yanjun [this message]
2023-09-25 15:00 ` Bart Van Assche
2023-09-25 15:25 ` Bob Pearson
2023-09-25 15:52 ` Jason Gunthorpe
2023-09-25 15:54 ` Bob Pearson
2023-09-25 19:57 ` Bob Pearson
2023-09-25 20:33 ` Bart Van Assche
2023-09-25 20:40 ` Bob Pearson
2023-09-26 15:36 ` Rain River
2023-09-26 1:17 ` Daisuke Matsuda (Fujitsu)
2023-10-17 17:09 ` Bob Pearson
2023-10-17 17:13 ` Bart Van Assche
2023-10-17 17:15 ` Bob Pearson
2023-10-17 17:19 ` Bob Pearson
2023-10-17 17:34 ` Bart Van Assche
2023-10-17 17:58 ` Jason Gunthorpe
2023-10-17 18:44 ` Bob Pearson
2023-10-17 18:51 ` Jason Gunthorpe
2023-10-17 19:55 ` Bob Pearson
2023-10-17 20:06 ` Bart Van Assche
2023-10-17 20:13 ` Bob Pearson
2023-10-17 21:14 ` Bob Pearson
2023-10-17 21:18 ` Bart Van Assche
2023-10-17 21:23 ` Bob Pearson
2023-10-17 21:30 ` Bart Van Assche
2023-10-17 21:39 ` Bob Pearson
2023-10-17 22:42 ` Bart Van Assche
2023-10-18 18:29 ` Bob Pearson
2023-10-18 19:17 ` Jason Gunthorpe
2023-10-18 19:48 ` Bart Van Assche
2023-10-18 20:03 ` Bob Pearson
2023-10-18 20:04 ` Bob Pearson
2023-10-18 20:14 ` Bob Pearson
2023-10-18 20:29 ` Bob Pearson
2023-10-18 20:49 ` Bart Van Assche
2023-10-18 21:17 ` Pearson, Robert B
2023-10-18 21:27 ` Bart Van Assche
2023-10-18 21:52 ` Bob Pearson
2023-10-19 19:17 ` Bart Van Assche
2023-10-20 17:12 ` Bob Pearson
2023-10-20 17:41 ` Bart Van Assche
2023-10-18 19:38 ` Bart Van Assche
2023-10-17 19:18 ` Bart Van Assche
2023-10-18 8:16 ` Zhu Yanjun
2023-09-22 11:06 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-10-13 12:51 ` Linux regression tracking #update (Thorsten Leemhuis)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=02d61fa2-9222-a071-8442-ef43a3aa74a2@linux.dev \
--to=yanjun.zhu@linux.dev \
--cc=bvanassche@acm.org \
--cc=jgg@ziepe.ca \
--cc=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=matsuda-daisuke@fujitsu.com \
--cc=rain.1986.08.12@gmail.com \
--cc=rpearsonhpe@gmail.com \
--cc=shinichiro.kawasaki@wdc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox