public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: Zhu Yanjun <yanjun.zhu@linux.dev>
To: "Daisuke Matsuda (Fujitsu)" <matsuda-daisuke@fujitsu.com>,
	'Rain River' <rain.1986.08.12@gmail.com>,
	Bob Pearson <rpearsonhpe@gmail.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
	"leon@kernel.org" <leon@kernel.org>,
	Bart Van Assche <bvanassche@acm.org>,
	Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
	RDMA mailing list <linux-rdma@vger.kernel.org>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: [bug report] blktests srp/002 hang
Date: Tue, 26 Sep 2023 14:09:28 +0800	[thread overview]
Message-ID: <02d61fa2-9222-a071-8442-ef43a3aa74a2@linux.dev> (raw)
In-Reply-To: <OS7PR01MB11804B7BFCD8A3DF78E51DD5CE5C3A@OS7PR01MB11804.jpnprd01.prod.outlook.com>

在 2023/9/26 9:09, Daisuke Matsuda (Fujitsu) 写道:
> On Mon, Sep 25, 2023 11:31 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>> 在 2023/9/25 12:47, Daisuke Matsuda (Fujitsu) 写道:
>>> On Sun, Sep 24, 2023 10:18 AM Rain River wrote:
>>>> On Sat, Sep 23, 2023 at 2:14 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>>>> On 9/21/23 10:10, Zhu Yanjun wrote:
>>>>>> 在 2023/9/21 22:39, Bob Pearson 写道:
>>>>>>> On 9/21/23 09:23, Rain River wrote:
>>>>>>>> On Thu, Sep 21, 2023 at 2:53 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>>>>>>>> On 9/20/23 12:22, Bart Van Assche wrote:
>>>>>>>>>> On 9/20/23 10:18, Bob Pearson wrote:
>>>>>>>>>>> But I have also seen the same behavior in the siw driver which is
>>>>>>>>>>> completely independent.
>>>>>>>>>> Hmm ... I haven't seen any hangs yet with the siw driver.
>>>>>>>>> I was on Ubuntu 6-9 months ago. Currently I don't see hangs on either.
>>>>>>>>>>> As mentioned above at the moment Ubuntu is failing rarely. But it used to fail reliably (srp/002 about 75%
>> of
>>>> the time and srp/011 about 99% of the time.) There haven't been any changes to rxe to explain this.
>>>>>>>>>> I think that Zhu mentioned commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue
>>>>>>>>>> support for rxe tasks")?
>>>>>>>>> That change happened well before the failures went away. I was seeing failures at the same rate with tasklets
>>>>>>>>> and wqs. But after updating Ubuntu and the kernel at some point they all went away.
>>>>>>>> I made tests on the latest Ubuntu with the latest kernel without the
>>>>>>>> commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks").
>>>>>>>> The latest kernel is v6.6-rc2, the commit 9b4b7c1f9f54 ("RDMA/rxe: Add
>>>>>>>> workqueue support for rxe tasks") is reverted.
>>>>>>>> I made blktest tests for about 30 times, this problem does not occur.
>>>>>>>>
>>>>>>>> So I confirm that without this commit, this hang problem does not
>>>>>>>> occur on Ubuntu without the commit 9b4b7c1f9f54 ("RDMA/rxe: Add
>>>>>>>> workqueue support for rxe tasks").
>>>>>>>>
>>>>>>>> Nanthan
>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Bart.
>>>>>>> This commit is very important for several reasons. It is needed for the ODP implementation
>>>>>>> that is in the works from Daisuke Matsuda and also for QP scaling of performance. The work
>>>>>>> queue implementation scales well with increasing qp number while the tasklet implementation
>>>>>>> does not. This is critical for the drivers use in large scale storage applications. So, if
>>>>>>> there is a bug in the work queue implementation it needs to be fixed not reverted.
>>>>>>>
>>>>>>> I am still hoping that someone will diagnose what is causing the ULPs to hang in terms of
>>>>>>> something missing causing it to wait.
>>>>>> Hi, Bob
>>>>>>
>>>>>>
>>>>>> You submitted this commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks").
>>>>>>
>>>>>> You should be very familiar with this commit.
>>>>>>
>>>>>> And this commit causes regression.
>>>>>>
>>>>>> So you should delved into the source code to find the root cause, then fix it.
>>>>> Zhu,
>>>>>
>>>>> I have spent tons of time over the months trying to figure out what is happening with blktests.
>>>>> As I have mentioned several times I have seen the same exact failure in siw in the past although
>>>>> currently that doesn't seem to happen so I had been suspecting that the problem may be in the ULP.
>>>>> The challenge is that the blktests represents a huge stack of software much of which I am not
>>>>> familiar with. The bug is a hang in layers above the rxe driver and so far no one has been able to
>>>>> say with any specificity the rxe driver failed to do something needed to make progress or violated
>>>>> expected behavior. Without any clue as to where to look it has been hard to make progress.
>>>> Bob
>>>>
>>>> Work queue will sleep. If work queue sleep for long time, the packets
>>>> will not be sent to ULP. This is why this hang occurs.
>>> In general work queue can sleep, but the workload running in rxe driver
>>> should not sleep because it was originally running on tasklet and converted
>>> to use work queue. A task can sometime take longer because of IRQs, but
>>> the same thing can also happen with tasklet. If there is a difference between
>>> the two, I think it would be the overhead of scheduring the work queue.
>>>
>>>> Difficult to handle this sleep in work queue. It had better revert
>>>> this commit in RXE.
>>> I am objected to reverting the commit at this stage. As Bob wrote above,
>>> nobody has found any logical failure in rxe driver. It is quite possible
>>> that the patch is just revealing a latent bug in the higher layers.
>>
>> To now, on Debian and Fedora, all the tests with work queue will hang.
>> And after reverting this commit,
>>
>> no hang will occur.
>>
>> Before new test results, it is a reasonable suspect that this commit
>> will result in the hang.
> 
> If the hang *always* occurs, then I agree your opinion is correct,

About hang tests, please read through the whole discussion. Several 
engineers made tests on Debian, Fedora and Ubuntu to confirm these test 
results.

Zhu Yanjun

> but this one happens occasionally. It is also natural to think that
> the commit makes it easier to meet the condition of an existing bug.
> 
>>
>>>
>>>> Because work queue sleeps,  ULP can not wait for long time for the
>>>> packets. If packets can not reach ULPs for long time, many problems
>>>> will occur to ULPs.
>>> I wonder where in the rxe driver does it sleep. BTW, most packets are
>>> processed in NET_RX_IRQ context, and work queue is scheduled only
>>
>> Do you mean NET_RX_SOFTIRQ?
> 
> Yes. I am sorry for confusing you.
> 
> Thanks,
> Daisuke
> 
>>
>> Zhu Yanjun
>>
>>> when there is already a running context. If your speculation is to the point,
>>> the hang will occur more frequently if we change it to use work queue exclusively.
>>> My ODP patches include a change to do this.
>>> Cf.
>> https://lore.kernel.org/lkml/7699a90bc4af10c33c0a46ef6330ed4bb7e7ace6.1694153251.git.matsuda-daisuke@fujitsu.c
>> om/
>>>
>>> Thanks,
>>> Daisuke
>>>
>>>>> My main motivation is making Lustre run on rxe and it does and it's fast enough to meet our needs.
>>>>> Lustre is similar to srp as a ULP and in all of our testing we have never seen a similar hang. Other
>>>>> hangs to be sure but not this one. I believe that this bug will never get resolved until someone with
>>>>> a good understanding of the ulp drivers makes an effort to find out where and why the hang is occurring.
>>>>>   From there it should be straight forward to fix the problem. I am continuing to investigate and am learning
>>>>> the device-manager/multipath/srp/scsi stack but I have a long ways to go.
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>>>
>>>>>> Jason && Leon, please comment on this.
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Zhu Yanjun
>>>>>>
>>>>>>> Bob


  reply	other threads:[~2023-09-26  6:09 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-21  6:46 [bug report] blktests srp/002 hang Shinichiro Kawasaki
2023-08-22  1:46 ` Bob Pearson
2023-08-22 10:18   ` Shinichiro Kawasaki
2023-08-22 15:20     ` Bart Van Assche
2023-08-23 16:19       ` Bob Pearson
2023-08-23 19:46         ` Bart Van Assche
2023-08-24 16:24           ` Bob Pearson
2023-08-24  8:55         ` Bernard Metzler
2023-08-24 15:35         ` Bernard Metzler
2023-08-24 16:05           ` Bart Van Assche
2023-08-24 16:27             ` Bob Pearson
2023-08-25  1:11       ` Shinichiro Kawasaki
2023-08-25  1:36         ` Bob Pearson
2023-08-25 10:16           ` Shinichiro Kawasaki
2023-08-25 13:49           ` Bart Van Assche
2023-08-25 13:52         ` Bart Van Assche
2023-09-13 17:36           ` Bob Pearson
2023-09-13 23:38             ` Zhu Yanjun
2023-09-16  5:59               ` Zhu Yanjun
2023-09-19  4:14                 ` Shinichiro Kawasaki
2023-09-19  8:07                   ` Zhu Yanjun
2023-09-19 16:30                     ` Pearson, Robert B
2023-09-19 18:11                     ` Bob Pearson
2023-09-20  4:22                       ` Zhu Yanjun
2023-09-20 16:24                         ` Bob Pearson
2023-09-20 16:36                           ` Bart Van Assche
2023-09-20 17:18                             ` Bob Pearson
2023-09-20 17:22                               ` Bart Van Assche
2023-09-20 17:29                                 ` Bob Pearson
2023-09-21  5:46                                   ` Zhu Yanjun
2023-09-21 10:06                                   ` Zhu Yanjun
2023-09-21 14:23                                   ` Rain River
2023-09-21 14:39                                     ` Bob Pearson
2023-09-21 15:08                                       ` Zhu Yanjun
2023-09-21 15:10                                       ` Zhu Yanjun
2023-09-22 18:14                                         ` Bob Pearson
2023-09-22 22:06                                           ` Bart Van Assche
2023-09-24  1:17                                           ` Rain River
2023-09-25  4:47                                             ` Daisuke Matsuda (Fujitsu)
2023-09-25 14:31                                               ` Zhu Yanjun
2023-09-26  1:09                                                 ` Daisuke Matsuda (Fujitsu)
2023-09-26  6:09                                                   ` Zhu Yanjun [this message]
2023-09-25 15:00                                               ` Bart Van Assche
2023-09-25 15:25                                                 ` Bob Pearson
2023-09-25 15:52                                                 ` Jason Gunthorpe
2023-09-25 15:54                                                   ` Bob Pearson
2023-09-25 19:57                                                 ` Bob Pearson
2023-09-25 20:33                                                   ` Bart Van Assche
2023-09-25 20:40                                                     ` Bob Pearson
2023-09-26 15:36                                                   ` Rain River
2023-09-26  1:17                                                 ` Daisuke Matsuda (Fujitsu)
2023-10-17 17:09                                                   ` Bob Pearson
2023-10-17 17:13                                                     ` Bart Van Assche
2023-10-17 17:15                                                       ` Bob Pearson
2023-10-17 17:19                                                       ` Bob Pearson
2023-10-17 17:34                                                         ` Bart Van Assche
2023-10-17 17:58                                                     ` Jason Gunthorpe
2023-10-17 18:44                                                       ` Bob Pearson
2023-10-17 18:51                                                         ` Jason Gunthorpe
2023-10-17 19:55                                                           ` Bob Pearson
2023-10-17 20:06                                                             ` Bart Van Assche
2023-10-17 20:13                                                               ` Bob Pearson
2023-10-17 21:14                                                               ` Bob Pearson
2023-10-17 21:18                                                                 ` Bart Van Assche
2023-10-17 21:23                                                                   ` Bob Pearson
2023-10-17 21:30                                                                     ` Bart Van Assche
2023-10-17 21:39                                                                       ` Bob Pearson
2023-10-17 22:42                                                                         ` Bart Van Assche
2023-10-18 18:29                                                                           ` Bob Pearson
2023-10-18 19:17                                                                             ` Jason Gunthorpe
2023-10-18 19:48                                                                               ` Bart Van Assche
2023-10-18 20:03                                                                                 ` Bob Pearson
2023-10-18 20:04                                                                                 ` Bob Pearson
2023-10-18 20:14                                                                                 ` Bob Pearson
2023-10-18 20:29                                                                                 ` Bob Pearson
2023-10-18 20:49                                                                                   ` Bart Van Assche
2023-10-18 21:17                                                                                     ` Pearson, Robert B
2023-10-18 21:27                                                                                       ` Bart Van Assche
2023-10-18 21:52                                                                                         ` Bob Pearson
2023-10-19 19:17                                                                                           ` Bart Van Assche
2023-10-20 17:12                                                                                             ` Bob Pearson
2023-10-20 17:41                                                                                               ` Bart Van Assche
2023-10-18 19:38                                                                             ` Bart Van Assche
2023-10-17 19:18                                                       ` Bart Van Assche
2023-10-18  8:16                                                     ` Zhu Yanjun
2023-09-22 11:06 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-10-13 12:51   ` Linux regression tracking #update (Thorsten Leemhuis)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=02d61fa2-9222-a071-8442-ef43a3aa74a2@linux.dev \
    --to=yanjun.zhu@linux.dev \
    --cc=bvanassche@acm.org \
    --cc=jgg@ziepe.ca \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=matsuda-daisuke@fujitsu.com \
    --cc=rain.1986.08.12@gmail.com \
    --cc=rpearsonhpe@gmail.com \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox