All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bob Pearson <rpearsonhpe@gmail.com>
To: Yanjun Zhu <yanjun.zhu@linux.dev>,
	Bart Van Assche <bvanassche@acm.org>,
	Zhu Yanjun <zyjzyj2000@gmail.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	Bernard Metzler <bmt@zurich.ibm.com>,
	Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: Apparent regression in blktests since 5.18-rc1+
Date: Sat, 7 May 2022 08:40:09 -0500	[thread overview]
Message-ID: <8a05c359-8e2d-b88d-8741-2743be2eb779@gmail.com> (raw)
In-Reply-To: <4b0153c7-a8e9-98de-26ae-d421434a116d@linux.dev>

On 5/6/22 19:29, Yanjun Zhu wrote:
> 在 2022/5/7 8:10, Bart Van Assche 写道:
>> On 5/6/22 11:11, Bob Pearson wrote:
>>> Before the most recent kernel update I had blktests running OK on rdma_rxe. Since we went on to 5.18.0-rc1+
>>> I have been experiencing hangs. All of this is with the 'revert scsi-debug' patch which addressed the
>>> 3 min timeout related to modprobe -r scsi-debug.
>>>
>>> You suggested checking with siw and I finally got around to this and the behavior is exactly the same.
>>>
>>> Specifically here is a run and dmesgs from that run:
>>>
>>> root@u-22:/home/bob/src/blktests# use_siw=1 ./check srp
>>>
>>> srp/001 (Create and remove LUNs)                             [passed]
>>>
>>>      runtime  3.388s  ...  3.501s
>>>
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq))
>>>
>>>      runtime  54.689s  ...
>>>    <HANGS HERE>
>>>
>>> I had to reboot to recover.
>>>
>>> The dmesg output is attached in a long file called out.
>>> The output looks normal until line 1875 where it hangs at an "Already connected ..." message.
>>> This is the same as the other hangs I have been seeing.
>>> This is followed by a splat warning that a cpu has hung for 120 seconds.
>>>
>>> Since this is behaving the same for rxe and siw I am going to stop chasing this bug since
>>> it is most likely outside of the the rxe driver.
>>
>> Hi Bob,
>>
>> What I see on my test setup is that the SRP tests from the blktests suite pass with
>> the SoftiWARP driver (kernel v5.18-rc5 / commit 4b97bac0756a):
>>
>> # (cd blktests && use_siw=1 ./check -q srp)
>> srp/001 (Create and remove LUNs)                             [passed]
>>      runtime  5.781s  ...  5.464s
>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]time  40.772s  ...
>>      runtime  40.772s  ...  42.039s
>> srp/003 (File I/O on top of multipath concurrently with logout and login (sq)) [not run]
>>      legacy device mapper support is missing
>> srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-srp/004 (File I/O on top of multipath concurrently with logout and login (sq-on-mq)) [not run]
>>      legacy device mapper support is missing
>> srp/005 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]untime  17.870s  ...
>>      runtime  17.870s  ...  17.016s
>> srp/006 (Direct I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]untime  16.369s  ...
>>      runtime  16.369s  ...  17.315s
>> srp/007 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=4M) [passed] runtime  16.729s  ...
>>      runtime  16.729s  ...  17.409s
>> srp/008 (Direct I/O with large transfer sizes, cmd_sg_entries=1 and bs=8M) [passed] runtime  16.823s  ...
>>      runtime  16.823s  ...  16.453s
>> srp/009 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=4M) [passed]time  17.304s  ...
>>      runtime  17.304s  ...  17.838s
>> srp/010 (Buffered I/O with large transfer sizes, cmd_sg_entries=255 and bs=8M) [passed]time  17.191s  ...
>>      runtime  17.191s  ...  17.117s
>> srp/011 (Block I/O on top of multipath concurrently with logout and login) [passed] runtime  40.835s  ...
>>      runtime  40.835s  ...  38.728s
>> srp/012 (dm-mpath on top of multiple I/O schedulers)         [passed]
>>      runtime  23.703s  ...  24.763s
>> srp/013 (Direct I/O using a discontiguous buffer)            [passed]
>>      runtime  11.279s  ...  9.265s
>> srp/014 (Run sg_reset while I/O is ongoing)                  [passed]
>>      runtime  39.110s  ...  37.929s
>> srp/015 (File I/O on top of multipath concurrently with logout and login (mq) ussrp/015
>>      (File I/O on top of multipath concurrently with logout and login (mq) using the SoftiWARP (siw) driver) [passed]
>>      runtime  40.027s  ...  40.220s
>>
>> If I try to run the SRP test 002 with the soft-RoCE driver, the following appears:
>>
>> [  749.901966] ================================
>> [  749.903638] WARNING: inconsistent lock state
>> [  749.905376] 5.18.0-rc5-dbg+ #1 Not tainted
>> [  749.907039] --------------------------------
>> [  749.908699] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
>> [  749.910646] ksoftirqd/5/40 [HC0[0]:SC1[1]:HE0:SE0] takes:
>> [  749.912499] ffff88818244d350 (&xa->xa_lock#14){+.?.}-{2:2}, at: rxe_pool_get_index+0x73/0x170 [rdma_rxe]
>> [  749.914691] {SOFTIRQ-ON-W} state was registered at:
>> [  749.916648]   __lock_acquire+0x45b/0xce0
>> [  749.918599]   lock_acquire+0x18a/0x450
>> [  749.920480]   _raw_spin_lock+0x34/0x50
>> [  749.922580]   __rxe_add_to_pool+0xcc/0x140 [rdma_rxe]
>> [  749.924583]   rxe_alloc_pd+0x2d/0x40 [rdma_rxe]
>> [  749.926394]   __ib_alloc_pd+0xa3/0x270 [ib_core]
>> [  749.928579]   ib_mad_port_open+0x44a/0x790 [ib_core]
>> [  749.930640]   ib_mad_init_device+0x8e/0x110 [ib_core]
>> [  749.932495]   add_client_context+0x26a/0x330 [ib_core]
>> [  749.934302]   enable_device_and_get+0x169/0x2b0 [ib_core]
>> [  749.936217]   ib_register_device+0x26f/0x330 [ib_core]
>> [  749.938020]   rxe_register_device+0x1b4/0x1d0 [rdma_rxe]
>> [  749.939794]   rxe_add+0x8c/0xc0 [rdma_rxe]
>> [  749.941552]   rxe_net_add+0x5b/0x90 [rdma_rxe]
>> [  749.943356]   rxe_newlink+0x71/0x80 [rdma_rxe]
>> [  749.945182]   nldev_newlink+0x21e/0x370 [ib_core]
>> [  749.946917]   rdma_nl_rcv_msg+0x200/0x410 [ib_core]
>> [  749.948657]   rdma_nl_rcv+0x140/0x220 [ib_core]
>> [  749.950373]   netlink_unicast+0x307/0x460
>> [  749.952063]   netlink_sendmsg+0x422/0x750
>> [  749.953672]   __sys_sendto+0x1c2/0x250
>> [  749.955281]   __x64_sys_sendto+0x7f/0x90
>> [  749.956849]   do_syscall_64+0x35/0x80
>> [  749.958353]   entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [  749.959942] irq event stamp: 1411849
>> [  749.961517] hardirqs last  enabled at (1411848): [<ffffffff810cdb28>] __local_bh_enable_ip+0x88/0xf0
>> [  749.963338] hardirqs last disabled at (1411849): [<ffffffff81ebf24d>] _raw_spin_lock_irqsave+0x5d/0x60
>> [  749.965214] softirqs last  enabled at (1411838): [<ffffffff82200467>] __do_softirq+0x467/0x6e1
>> [  749.967027] softirqs last disabled at (1411843): [<ffffffff810cd947>] run_ksoftirqd+0x37/0x60
> To this, Please use this patch series news://nntp.lore.kernel.org:119/20220422194416.983549-1-yanjun.zhu@linux.dev
> 
> Zhu Yanjun
>>
>> I think the above is strong evidence that there is something wrong with the
>> soft-RoCE driver.
>>
>> Thanks,
>>
>> Bart.
> 

I was showing siw results not rxe results. When I have run srp on rxe I use a patch similar to the
one Zhu suggested to fix the lockdep warnings.

Bob

  parent reply	other threads:[~2022-05-07 13:40 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-06 18:11 Apparent regression in blktests since 5.18-rc1+ Bob Pearson
2022-05-07  0:10 ` Bart Van Assche
2022-05-07  0:29   ` Yanjun Zhu
2022-05-07  1:29     ` Jason Gunthorpe
2022-05-07  1:55       ` Yanjun Zhu
2022-05-07 13:43         ` Bob Pearson
2022-05-08  4:13           ` Bart Van Assche
2022-05-10 15:24             ` Pearson, Robert B
2022-05-12 21:57             ` Bob Pearson
2022-05-12 22:25               ` Bart Van Assche
2022-05-13  0:41                 ` Bob Pearson
2022-05-13  3:40                   ` Bart Van Assche
2022-05-17 15:21                     ` Bob Pearson
2022-05-17 20:44                       ` Bart Van Assche
2022-05-17 20:54                         ` Bob Pearson
2022-05-17 20:59                         ` Bob Pearson
2022-05-08  8:43         ` Yanjun Zhu
2022-05-09  8:01       ` Zhu Yanjun
2022-05-09 11:52         ` Jason Gunthorpe
2022-05-09 12:31           ` Yanjun Zhu
2022-05-09 12:33             ` Jason Gunthorpe
2022-05-09 12:42               ` Yanjun Zhu
2022-05-07 13:40     ` Bob Pearson [this message]
2022-05-09  6:56 ` Thorsten Leemhuis
2022-05-10  3:53   ` Bart Van Assche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8a05c359-8e2d-b88d-8741-2743be2eb779@gmail.com \
    --to=rpearsonhpe@gmail.com \
    --cc=bmt@zurich.ibm.com \
    --cc=bvanassche@acm.org \
    --cc=jgg@nvidia.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=yanjun.zhu@linux.dev \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.