* blktests failures with v6.4
@ 2023-07-07 7:27 Shinichiro Kawasaki
2023-07-09 14:32 ` Sagi Grimberg
0 siblings, 1 reply; 6+ messages in thread
From: Shinichiro Kawasaki @ 2023-07-07 7:27 UTC (permalink / raw)
To: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-scsi@vger.kernel.org
Hi all,
I ran the latest blktests (git hash: 30bdcb72ad99) with v6.4 kernel and observed
test case failures listed below. I call for support to fix them. The list of
failures are same as the report for v6.3 kernel [1].
[1] https://lore.kernel.org/linux-block/rsmmxrchy6voi5qhl4irss5sprna3f5owkqtvybxglcv2pnylm@xmrnpfu3tfpe/
List of failures
================
#1: block/011
#2: block/024
#3: nvme/003 (fabrics transport)
#4: nvme/030 or nvme/031 (rdma transport with siw)
#5: nvme/* (fc transport)
Failure description
===================
#1: block/011
This test case shows two failure symptoms.
Symptom A:
The test case fails with NVME devices due to lockdep WARNING "possible
circular locking dependency detected". Reported in Sep/2022 [2] and
solution was discussed. Waiting a fix.
[2] https://lore.kernel.org/linux-block/20220930001943.zdbvolc3gkekfmcv@shindev/
Symptom B:
The test case occasionally fail with fio assert messages. Fio bug is
suspected. I'm preparing a fix patch [3].
block/011 => nvme0n1 (disable PCI device while doing I/O) [failed]
runtime 32.250s ... 1684.324s
--- tests/block/011.out 2023-04-06 10:11:07.920670529 +0900
+++ /home/shin/Blktests/blktests/results/nvme0n1/block/011.out.bad 2023-07-07 14:34:26.123295632 +0900
@@ -1,2 +1,11 @@
Running block/011
+fio: ioengines.c:335: td_io_queue: Assertion `(io_u->flags & IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:335: td_io_queue: Assertion `(io_u->flags & IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:335: td_io_queue: Assertion `(io_u->flags & IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:335: td_io_queue: Assertion `(io_u->flags & IO_U_F_FLIGHT) == 0' failed.
+fio: pid=1492, got signal=6
+fio: pid=1493, got signal=6
...
(Run 'diff -u tests/block/011.out /home/shin/Blktests/blktests/results/nvme0n1/block/011.out.bad' to see the entire diff)
[3] https://github.com/kawasaki/fio/commit/f4b68b90c6780a73d4eb017449400a7891443e3b
#2: block/024
Fails on slow machines. Reported in Dec/2022. Test case side issue is
suspected. Still needs further investigation.
block/024 (do I/O faster than a jiffy and check iostats times) [failed]
runtime ... 4.347s
--- tests/block/024.out 2022-12-06 20:51:41.525066605 +0900
+++ /home/shin/kts/kernel-test-suite/sets/blktests/log/runlog/nodev/block/024.out.bad 2022-12-07 12:51:03.610924521 +0900
@@ -6,5 +6,5 @@
read 1 s
write 1 s
read 2 s
-write 3 s
+write 4 s
Test complete
#3: nvme/003 (fabrics transport)
When nvme test group is run with trtype=rdma or tcp, the test case fails
due to lockdep WARNING "possible circular locking dependency detected".
Reported in May/2023. Bart suggested a fix for trytpe=rdma [4] but it
needs more discussion.
[4] https://lore.kernel.org/linux-nvme/20230511150321.103172-1-bvanassche@acm.org/
#4: nvme/030 or nvme/031 (rdma transport with siw)
When nvme test group is run with trtype=rdma and use_siw=1 configurations,
nvme/030 or nvme/031 fail occasionally due to "BUG: KASAN: slab-use-after-
free in __mutex_lock". Reported to linux-rdma in May/2023. A fix was
suggested but it did not fix the root cause in rdma iwarp cm [5]. Waiting
for a good fix.
[5] https://lore.kernel.org/linux-rdma/20230612054237.1855292-1-shinichiro.kawasaki@wdc.com/
#5: nvme/* (fc transport)
With trtype=fc configuration, test run on nvme test group hangs. Daniel is
driving fix work [6].
[6] https://lore.kernel.org/linux-nvme/20230620133711.22840-1-dwagner@suse.de/
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: blktests failures with v6.4 2023-07-07 7:27 blktests failures with v6.4 Shinichiro Kawasaki @ 2023-07-09 14:32 ` Sagi Grimberg 2023-07-13 1:22 ` Shinichiro Kawasaki 0 siblings, 1 reply; 6+ messages in thread From: Sagi Grimberg @ 2023-07-09 14:32 UTC (permalink / raw) To: Shinichiro Kawasaki, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org > #3: nvme/003 (fabrics transport) > > When nvme test group is run with trtype=rdma or tcp, the test case fails > due to lockdep WARNING "possible circular locking dependency detected". > Reported in May/2023. Bart suggested a fix for trytpe=rdma [4] but it > needs more discussion. > > [4] https://lore.kernel.org/linux-nvme/20230511150321.103172-1-bvanassche@acm.org/ This patch is unfortunately incorrect and buggy. This will likely make the issue go away, but adds another old issue where a client can DDOS a target by bombarding it with connect/disconnect. When releases are async and we don't have any back-pressure, it is likely to happen. -- diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index 4597bca43a6d..8b4f4aa48206 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -1582,11 +1582,6 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id, goto put_device; } - if (queue->host_qid == 0) { - /* Let inflight controller teardown complete */ - flush_workqueue(nvmet_wq); - } - ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn); if (ret) { /* diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 868aa4de2e4c..c8cfa19e11c7 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -1844,11 +1844,6 @@ static u16 nvmet_tcp_install_queue(struct nvmet_sq *sq) struct nvmet_tcp_queue *queue = container_of(sq, struct nvmet_tcp_queue, nvme_sq); - if (sq->qid == 0) { - /* Let inflight controller teardown complete */ - flush_workqueue(nvmet_wq); - } - queue->nr_cmds = sq->size * 2; if (nvmet_tcp_alloc_cmds(queue)) return NVME_SC_INTERNAL; -- ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: blktests failures with v6.4 2023-07-09 14:32 ` Sagi Grimberg @ 2023-07-13 1:22 ` Shinichiro Kawasaki 2023-07-13 7:48 ` Sagi Grimberg 0 siblings, 1 reply; 6+ messages in thread From: Shinichiro Kawasaki @ 2023-07-13 1:22 UTC (permalink / raw) To: Sagi Grimberg Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org On Jul 09, 2023 / 17:32, Sagi Grimberg wrote: > > > #3: nvme/003 (fabrics transport) > > > > When nvme test group is run with trtype=rdma or tcp, the test case fails > > due to lockdep WARNING "possible circular locking dependency detected". > > Reported in May/2023. Bart suggested a fix for trytpe=rdma [4] but it > > needs more discussion. > > > > [4] https://lore.kernel.org/linux-nvme/20230511150321.103172-1-bvanassche@acm.org/ > > This patch is unfortunately incorrect and buggy. > > This will likely make the issue go away, but adds another > old issue where a client can DDOS a target by bombarding it > with connect/disconnect. When releases are async and we don't > have any back-pressure, it is likely to happen. > -- > diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c > index 4597bca43a6d..8b4f4aa48206 100644 > --- a/drivers/nvme/target/rdma.c > +++ b/drivers/nvme/target/rdma.c > @@ -1582,11 +1582,6 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id > *cm_id, > goto put_device; > } > > - if (queue->host_qid == 0) { > - /* Let inflight controller teardown complete */ > - flush_workqueue(nvmet_wq); > - } > - > ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn); > if (ret) { > /* > diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c > index 868aa4de2e4c..c8cfa19e11c7 100644 > --- a/drivers/nvme/target/tcp.c > +++ b/drivers/nvme/target/tcp.c > @@ -1844,11 +1844,6 @@ static u16 nvmet_tcp_install_queue(struct nvmet_sq > *sq) > struct nvmet_tcp_queue *queue = > container_of(sq, struct nvmet_tcp_queue, nvme_sq); > > - if (sq->qid == 0) { > - /* Let inflight controller teardown complete */ > - flush_workqueue(nvmet_wq); > - } > - > queue->nr_cmds = sq->size * 2; > if (nvmet_tcp_alloc_cmds(queue)) > return NVME_SC_INTERNAL; > -- Thanks Sagi, I tried the patch above and confirmed the lockdep WARN disappears for both rdma and tcp. It indicates that the flush_workqueue(nvmet_wq) introduced the circular lock dependency. I also found the two commits below record why the flush_workqueue(nvmet_wq) was introduced. 777dc82395de ("nvmet-rdma: occasionally flush ongoing controller teardown") 8832cf922151 ("nvmet: use a private workqueue instead of the system workqueue") The left question is how to avoid both the connect/disconnect bombarding DDOS and the circular lock possibility related to the nvmet_wq completion. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: blktests failures with v6.4 2023-07-13 1:22 ` Shinichiro Kawasaki @ 2023-07-13 7:48 ` Sagi Grimberg 2023-07-13 8:41 ` Hannes Reinecke 0 siblings, 1 reply; 6+ messages in thread From: Sagi Grimberg @ 2023-07-13 7:48 UTC (permalink / raw) To: Shinichiro Kawasaki Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org >>> #3: nvme/003 (fabrics transport) >>> >>> When nvme test group is run with trtype=rdma or tcp, the test case fails >>> due to lockdep WARNING "possible circular locking dependency detected". >>> Reported in May/2023. Bart suggested a fix for trytpe=rdma [4] but it >>> needs more discussion. >>> >>> [4] https://lore.kernel.org/linux-nvme/20230511150321.103172-1-bvanassche@acm.org/ >> >> This patch is unfortunately incorrect and buggy. >> >> This will likely make the issue go away, but adds another >> old issue where a client can DDOS a target by bombarding it >> with connect/disconnect. When releases are async and we don't >> have any back-pressure, it is likely to happen. >> -- >> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c >> index 4597bca43a6d..8b4f4aa48206 100644 >> --- a/drivers/nvme/target/rdma.c >> +++ b/drivers/nvme/target/rdma.c >> @@ -1582,11 +1582,6 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id >> *cm_id, >> goto put_device; >> } >> >> - if (queue->host_qid == 0) { >> - /* Let inflight controller teardown complete */ >> - flush_workqueue(nvmet_wq); >> - } >> - >> ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn); >> if (ret) { >> /* >> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c >> index 868aa4de2e4c..c8cfa19e11c7 100644 >> --- a/drivers/nvme/target/tcp.c >> +++ b/drivers/nvme/target/tcp.c >> @@ -1844,11 +1844,6 @@ static u16 nvmet_tcp_install_queue(struct nvmet_sq >> *sq) >> struct nvmet_tcp_queue *queue = >> container_of(sq, struct nvmet_tcp_queue, nvme_sq); >> >> - if (sq->qid == 0) { >> - /* Let inflight controller teardown complete */ >> - flush_workqueue(nvmet_wq); >> - } >> - >> queue->nr_cmds = sq->size * 2; >> if (nvmet_tcp_alloc_cmds(queue)) >> return NVME_SC_INTERNAL; >> -- > > Thanks Sagi, I tried the patch above and confirmed the lockdep WARN disappears > for both rdma and tcp. It indicates that the flush_workqueue(nvmet_wq) > introduced the circular lock dependency. Thanks for confirming. This was expected. > I also found the two commits below > record why the flush_workqueue(nvmet_wq) was introduced. > > 777dc82395de ("nvmet-rdma: occasionally flush ongoing controller teardown") > 8832cf922151 ("nvmet: use a private workqueue instead of the system workqueue") The second patch is unrelated, before we used a global workqueue and fundamentally had the same issue. > The left question is how to avoid both the connect/disconnect bombarding DDOS > and the circular lock possibility related to the nvmet_wq completion. I don't see any way to synchronize connects with releases without moving connect sequences to a dedicated thread. Which in my mind is undesirable. The only solution I can think of is to fail a host connect expecting the host to reconnect and throttle this way, but that would lead to spurious connect failures (at least from the host PoV). Maybe we can add a NOT_READY connect error code in nvme for that... ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: blktests failures with v6.4 2023-07-13 7:48 ` Sagi Grimberg @ 2023-07-13 8:41 ` Hannes Reinecke 2023-07-13 10:16 ` Sagi Grimberg 0 siblings, 1 reply; 6+ messages in thread From: Hannes Reinecke @ 2023-07-13 8:41 UTC (permalink / raw) To: Sagi Grimberg, Shinichiro Kawasaki Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org On 7/13/23 09:48, Sagi Grimberg wrote: > >>>> #3: nvme/003 (fabrics transport) >>>> >>>> When nvme test group is run with trtype=rdma or tcp, the test >>>> case fails >>>> due to lockdep WARNING "possible circular locking dependency >>>> detected". >>>> Reported in May/2023. Bart suggested a fix for trytpe=rdma [4] >>>> but it >>>> needs more discussion. >>>> >>>> [4] >>>> https://lore.kernel.org/linux-nvme/20230511150321.103172-1-bvanassche@acm.org/ >>> >>> This patch is unfortunately incorrect and buggy. >>> >>> This will likely make the issue go away, but adds another >>> old issue where a client can DDOS a target by bombarding it >>> with connect/disconnect. When releases are async and we don't >>> have any back-pressure, it is likely to happen. >>> -- >>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c >>> index 4597bca43a6d..8b4f4aa48206 100644 >>> --- a/drivers/nvme/target/rdma.c >>> +++ b/drivers/nvme/target/rdma.c >>> @@ -1582,11 +1582,6 @@ static int nvmet_rdma_queue_connect(struct >>> rdma_cm_id >>> *cm_id, >>> goto put_device; >>> } >>> >>> - if (queue->host_qid == 0) { >>> - /* Let inflight controller teardown complete */ >>> - flush_workqueue(nvmet_wq); >>> - } >>> - >>> ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn); >>> if (ret) { >>> /* >>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c >>> index 868aa4de2e4c..c8cfa19e11c7 100644 >>> --- a/drivers/nvme/target/tcp.c >>> +++ b/drivers/nvme/target/tcp.c >>> @@ -1844,11 +1844,6 @@ static u16 nvmet_tcp_install_queue(struct >>> nvmet_sq >>> *sq) >>> struct nvmet_tcp_queue *queue = >>> container_of(sq, struct nvmet_tcp_queue, nvme_sq); >>> >>> - if (sq->qid == 0) { >>> - /* Let inflight controller teardown complete */ >>> - flush_workqueue(nvmet_wq); >>> - } >>> - >>> queue->nr_cmds = sq->size * 2; >>> if (nvmet_tcp_alloc_cmds(queue)) >>> return NVME_SC_INTERNAL; >>> -- >> >> Thanks Sagi, I tried the patch above and confirmed the lockdep WARN >> disappears >> for both rdma and tcp. It indicates that the flush_workqueue(nvmet_wq) >> introduced the circular lock dependency. > > Thanks for confirming. This was expected. > >> I also found the two commits below >> record why the flush_workqueue(nvmet_wq) was introduced. >> >> 777dc82395de ("nvmet-rdma: occasionally flush ongoing controller >> teardown") >> 8832cf922151 ("nvmet: use a private workqueue instead of the system >> workqueue") > > The second patch is unrelated, before we used a global workqueue and > fundamentally had the same issue. > >> The left question is how to avoid both the connect/disconnect >> bombarding DDOS >> and the circular lock possibility related to the nvmet_wq completion. > > I don't see any way to synchronize connects with releases without moving > connect sequences to a dedicated thread. Which in my mind is undesirable. > > The only solution I can think of is to fail a host connect expecting the > host to reconnect and throttle this way, but that would lead to spurious > connect failures (at least from the host PoV). > > Maybe we can add a NOT_READY connect error code in nvme for that... > You know, I have been seeing the very same lockdep warning during TLS testing; wasn't sure, though, if it's a generic issue or something I've introduced with the TLS logic. I guess we can solve this by adding another NVMET_TCP_Q_INIT state before NVMET_TCP_Q_CONNECTING; then we can flip the state from INIT to CONNECTING in nvmet_tcp_alloc_queue(): diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index ed98df72c76b..e6f699a44128 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -121,6 +121,7 @@ struct nvmet_tcp_cmd { }; enum nvmet_tcp_queue_state { + NVMET_TCP_Q_INIT, NVMET_TCP_Q_CONNECTING, NVMET_TCP_Q_LIVE, NVMET_TCP_Q_DISCONNECTING, @@ -1274,7 +1275,8 @@ static int nvmet_tcp_try_recv(struct nvmet_tcp_queue *queue, static void nvmet_tcp_schedule_release_queue(struct nvmet_tcp_queue *queue) { spin_lock(&queue->state_lock); - if (queue->state != NVMET_TCP_Q_DISCONNECTING) { + if (queue->state != NVMET_TCP_Q_INIT && + queue->state != NVMET_TCP_Q_DISCONNECTING) { queue->state = NVMET_TCP_Q_DISCONNECTING; queue_work(nvmet_wq, &queue->release_work); } @@ -1625,7 +1627,7 @@ static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port, queue->port = port; queue->nr_cmds = 0; spin_lock_init(&queue->state_lock); - queue->state = NVMET_TCP_Q_CONNECTING; + queue->state = NVMET_TCP_Q_INIT; INIT_LIST_HEAD(&queue->free_list); init_llist_head(&queue->resp_list); INIT_LIST_HEAD(&queue->resp_send_list); @@ -1832,10 +1834,12 @@ static u16 nvmet_tcp_install_queue(struct nvmet_sq *sq) struct nvmet_tcp_queue *queue = container_of(sq, struct nvmet_tcp_queue, nvme_sq); - if (sq->qid == 0) { - /* Let inflight controller teardown complete */ - flush_workqueue(nvmet_wq); + spin_lock(&queue->state_lock); + if (queue->state != NVMET_TCP_Q_INIT) { + spin_unlock(&queue->state_lock); + return NVME_SC_NOT_READY; } + spin_unlock(&queue->state_lock); queue->nr_cmds = sq->size * 2; if (nvmet_tcp_alloc_cmds(queue)) With that we'll return 'not ready' whenever we hit this condition, but that should be fine as we would've crashed anyway with the old code. Hmm? Cheers, Hannes ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: blktests failures with v6.4 2023-07-13 8:41 ` Hannes Reinecke @ 2023-07-13 10:16 ` Sagi Grimberg 0 siblings, 0 replies; 6+ messages in thread From: Sagi Grimberg @ 2023-07-13 10:16 UTC (permalink / raw) To: Hannes Reinecke, Shinichiro Kawasaki Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org On 7/13/23 11:41, Hannes Reinecke wrote: > On 7/13/23 09:48, Sagi Grimberg wrote: >> >>>>> #3: nvme/003 (fabrics transport) >>>>> >>>>> When nvme test group is run with trtype=rdma or tcp, the test >>>>> case fails >>>>> due to lockdep WARNING "possible circular locking dependency >>>>> detected". >>>>> Reported in May/2023. Bart suggested a fix for trytpe=rdma [4] >>>>> but it >>>>> needs more discussion. >>>>> >>>>> [4] >>>>> https://lore.kernel.org/linux-nvme/20230511150321.103172-1-bvanassche@acm.org/ >>>> >>>> This patch is unfortunately incorrect and buggy. >>>> >>>> This will likely make the issue go away, but adds another >>>> old issue where a client can DDOS a target by bombarding it >>>> with connect/disconnect. When releases are async and we don't >>>> have any back-pressure, it is likely to happen. >>>> -- >>>> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c >>>> index 4597bca43a6d..8b4f4aa48206 100644 >>>> --- a/drivers/nvme/target/rdma.c >>>> +++ b/drivers/nvme/target/rdma.c >>>> @@ -1582,11 +1582,6 @@ static int nvmet_rdma_queue_connect(struct >>>> rdma_cm_id >>>> *cm_id, >>>> goto put_device; >>>> } >>>> >>>> - if (queue->host_qid == 0) { >>>> - /* Let inflight controller teardown complete */ >>>> - flush_workqueue(nvmet_wq); >>>> - } >>>> - >>>> ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn); >>>> if (ret) { >>>> /* >>>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c >>>> index 868aa4de2e4c..c8cfa19e11c7 100644 >>>> --- a/drivers/nvme/target/tcp.c >>>> +++ b/drivers/nvme/target/tcp.c >>>> @@ -1844,11 +1844,6 @@ static u16 nvmet_tcp_install_queue(struct >>>> nvmet_sq >>>> *sq) >>>> struct nvmet_tcp_queue *queue = >>>> container_of(sq, struct nvmet_tcp_queue, nvme_sq); >>>> >>>> - if (sq->qid == 0) { >>>> - /* Let inflight controller teardown complete */ >>>> - flush_workqueue(nvmet_wq); >>>> - } >>>> - >>>> queue->nr_cmds = sq->size * 2; >>>> if (nvmet_tcp_alloc_cmds(queue)) >>>> return NVME_SC_INTERNAL; >>>> -- >>> >>> Thanks Sagi, I tried the patch above and confirmed the lockdep WARN >>> disappears >>> for both rdma and tcp. It indicates that the flush_workqueue(nvmet_wq) >>> introduced the circular lock dependency. >> >> Thanks for confirming. This was expected. >> >>> I also found the two commits below >>> record why the flush_workqueue(nvmet_wq) was introduced. >>> >>> 777dc82395de ("nvmet-rdma: occasionally flush ongoing controller >>> teardown") >>> 8832cf922151 ("nvmet: use a private workqueue instead of the system >>> workqueue") >> >> The second patch is unrelated, before we used a global workqueue and >> fundamentally had the same issue. >> >>> The left question is how to avoid both the connect/disconnect >>> bombarding DDOS >>> and the circular lock possibility related to the nvmet_wq completion. >> >> I don't see any way to synchronize connects with releases without >> moving connect sequences to a dedicated thread. Which in my mind is >> undesirable. >> >> The only solution I can think of is to fail a host connect expecting the >> host to reconnect and throttle this way, but that would lead to spurious >> connect failures (at least from the host PoV). >> >> Maybe we can add a NOT_READY connect error code in nvme for that... >> > You know, I have been seeing the very same lockdep warning during TLS > testing; wasn't sure, though, if it's a generic issue or something I've > introduced with the TLS logic. > > I guess we can solve this by adding another NVMET_TCP_Q_INIT state > before NVMET_TCP_Q_CONNECTING; then we can flip the state from INIT to > CONNECTING in nvmet_tcp_alloc_queue(): > > diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c > index ed98df72c76b..e6f699a44128 100644 > --- a/drivers/nvme/target/tcp.c > +++ b/drivers/nvme/target/tcp.c > @@ -121,6 +121,7 @@ struct nvmet_tcp_cmd { > }; > > enum nvmet_tcp_queue_state { > + NVMET_TCP_Q_INIT, > NVMET_TCP_Q_CONNECTING, > NVMET_TCP_Q_LIVE, > NVMET_TCP_Q_DISCONNECTING, > @@ -1274,7 +1275,8 @@ static int nvmet_tcp_try_recv(struct > nvmet_tcp_queue *queue, > static void nvmet_tcp_schedule_release_queue(struct nvmet_tcp_queue > *queue) > { > spin_lock(&queue->state_lock); > - if (queue->state != NVMET_TCP_Q_DISCONNECTING) { > + if (queue->state != NVMET_TCP_Q_INIT && > + queue->state != NVMET_TCP_Q_DISCONNECTING) { > queue->state = NVMET_TCP_Q_DISCONNECTING; > queue_work(nvmet_wq, &queue->release_work); > } > @@ -1625,7 +1627,7 @@ static int nvmet_tcp_alloc_queue(struct > nvmet_tcp_port *port, > queue->port = port; > queue->nr_cmds = 0; > spin_lock_init(&queue->state_lock); > - queue->state = NVMET_TCP_Q_CONNECTING; > + queue->state = NVMET_TCP_Q_INIT; > INIT_LIST_HEAD(&queue->free_list); > init_llist_head(&queue->resp_list); > INIT_LIST_HEAD(&queue->resp_send_list); > @@ -1832,10 +1834,12 @@ static u16 nvmet_tcp_install_queue(struct > nvmet_sq *sq) > struct nvmet_tcp_queue *queue = > container_of(sq, struct nvmet_tcp_queue, nvme_sq); > > - if (sq->qid == 0) { > - /* Let inflight controller teardown complete */ > - flush_workqueue(nvmet_wq); > + spin_lock(&queue->state_lock); > + if (queue->state != NVMET_TCP_Q_INIT) { > + spin_unlock(&queue->state_lock); > + return NVME_SC_NOT_READY; > } > + spin_unlock(&queue->state_lock); > > queue->nr_cmds = sq->size * 2; > if (nvmet_tcp_alloc_cmds(queue)) > > With that we'll return 'not ready' whenever we hit this condition, but > that should be fine as we would've crashed anyway with the old code. > > Hmm? I don't understand what this patch is doing... Nor how it solves anything. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-07-13 10:17 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-07-07 7:27 blktests failures with v6.4 Shinichiro Kawasaki 2023-07-09 14:32 ` Sagi Grimberg 2023-07-13 1:22 ` Shinichiro Kawasaki 2023-07-13 7:48 ` Sagi Grimberg 2023-07-13 8:41 ` Hannes Reinecke 2023-07-13 10:16 ` Sagi Grimberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox