linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel NBD client waits on wrong cookie, aborts connection
@ 2024-10-15 10:21 Kevin Wolf
  2024-10-15 12:01 ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2024-10-15 10:21 UTC (permalink / raw)
  To: josef; +Cc: axboe, linux-block, nbd, eblake

Hi all,

the other day I was running some benchmarks to compare different QEMU
block exports, and one of the scenarios I was interested in was
exporting NBD from qemu-storage-daemon over a unix socket and attaching
it as a block device using the kernel NBD client. I would then run a VM
on top of it and fio inside of it.

Unfortunately, I couldn't get any numbers because the connection always
aborted with messages like "Double reply on req ..." or "Unexpected
reply ..." in the host kernel log.

Yesterday I found some time to have a closer look why this is happening,
and I think I have a rough understanding of what's going on now. Look at
these trace events:

        qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
[...]
        qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
[...]
   kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a

This is the same request, but the handle has changed between
nbd_header_sent and nbd_payload_sent! I think this means that we hit one
of the cases where the request is requeued, and then the next time it
is executed with a different blk-mq tag, which is something the nbd
driver doesn't seem to expect.

Of course, since the cookie is transmitted in the header, the server
replies with the original handle that contains the tag from the first
call, while the kernel is only waiting for a handle with the new tag and
is confused by the server response.

I'm not sure yet which of the following options should be considered the
real problem here, so I'm only describing the situation without trying
to provide a patch:

1. Is it that blk-mq should always re-run the request with the same tag?
   I don't expect so, though in practice I was surprised to see that it
   happens quite often after nbd requeues a request that it actually
   does end up with the same cookie again.

2. Is it that nbd should use cookies that don't depend on the tag?
   Maybe, but then we lose an easy way to identify the request from the
   server response.

3. Is it that it should never requeue requests after already starting to
   send data for them? This sounds most likely to me, but also like the
   biggest change to make in nbd.

4. Or something else entirely?

I tested this with the 6.10.12 kernel from Fedora 40, but a quick git
diff on nbd.c doesn't suggest that anything related has changed since
then. This is how I reproduced it for debugging (without a VM):

$ qemu-storage-daemon --blockdev null-co,size=$((16*(1024**3))),node-name=data --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock --export nbd,id=exp0,node-name=data,writable=on
# nbd-client -unix -N data /tmp/nbd.sock /dev/nbd0
# qemu-img bench -f host_device -w -s 4k -c 1000000 -t none -i io_uring /dev/nbd0

I couldn't trigger the problem with TCP or without the io_uring backend
(i.e. using Linux AIO or the thread pool) for 'qemu-img bench'.

Kevin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-15 10:21 Kernel NBD client waits on wrong cookie, aborts connection Kevin Wolf
@ 2024-10-15 12:01 ` Ming Lei
  2024-10-15 12:11   ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2024-10-15 12:01 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: josef, axboe, linux-block, nbd, eblake

On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
>
> Hi all,
>
> the other day I was running some benchmarks to compare different QEMU
> block exports, and one of the scenarios I was interested in was
> exporting NBD from qemu-storage-daemon over a unix socket and attaching
> it as a block device using the kernel NBD client. I would then run a VM
> on top of it and fio inside of it.
>
> Unfortunately, I couldn't get any numbers because the connection always
> aborted with messages like "Double reply on req ..." or "Unexpected
> reply ..." in the host kernel log.
>
> Yesterday I found some time to have a closer look why this is happening,
> and I think I have a rough understanding of what's going on now. Look at
> these trace events:
>
>         qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
> [...]
>         qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
> [...]
>    kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
>
> This is the same request, but the handle has changed between
> nbd_header_sent and nbd_payload_sent! I think this means that we hit one
> of the cases where the request is requeued, and then the next time it
> is executed with a different blk-mq tag, which is something the nbd
> driver doesn't seem to expect.
>
> Of course, since the cookie is transmitted in the header, the server
> replies with the original handle that contains the tag from the first
> call, while the kernel is only waiting for a handle with the new tag and
> is confused by the server response.
>
> I'm not sure yet which of the following options should be considered the
> real problem here, so I'm only describing the situation without trying
> to provide a patch:
>
> 1. Is it that blk-mq should always re-run the request with the same tag?
>    I don't expect so, though in practice I was surprised to see that it
>    happens quite often after nbd requeues a request that it actually
>    does end up with the same cookie again.

No.

request->tag will change, but we may take ->internal_tag(sched) or
->tag(none), which won't change.

I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
is sent with a different tag.

I will try to cook one patch soon.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-15 12:01 ` Ming Lei
@ 2024-10-15 12:11   ` Ming Lei
  2024-10-15 12:15     ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2024-10-15 12:11 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: josef, axboe, linux-block, nbd, eblake, ming.lei

On Tue, Oct 15, 2024 at 08:01:43PM +0800, Ming Lei wrote:
> On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
> >
> > Hi all,
> >
> > the other day I was running some benchmarks to compare different QEMU
> > block exports, and one of the scenarios I was interested in was
> > exporting NBD from qemu-storage-daemon over a unix socket and attaching
> > it as a block device using the kernel NBD client. I would then run a VM
> > on top of it and fio inside of it.
> >
> > Unfortunately, I couldn't get any numbers because the connection always
> > aborted with messages like "Double reply on req ..." or "Unexpected
> > reply ..." in the host kernel log.
> >
> > Yesterday I found some time to have a closer look why this is happening,
> > and I think I have a rough understanding of what's going on now. Look at
> > these trace events:
> >
> >         qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
> > [...]
> >         qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
> > [...]
> >    kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
> >
> > This is the same request, but the handle has changed between
> > nbd_header_sent and nbd_payload_sent! I think this means that we hit one
> > of the cases where the request is requeued, and then the next time it
> > is executed with a different blk-mq tag, which is something the nbd
> > driver doesn't seem to expect.
> >
> > Of course, since the cookie is transmitted in the header, the server
> > replies with the original handle that contains the tag from the first
> > call, while the kernel is only waiting for a handle with the new tag and
> > is confused by the server response.
> >
> > I'm not sure yet which of the following options should be considered the
> > real problem here, so I'm only describing the situation without trying
> > to provide a patch:
> >
> > 1. Is it that blk-mq should always re-run the request with the same tag?
> >    I don't expect so, though in practice I was surprised to see that it
> >    happens quite often after nbd requeues a request that it actually
> >    does end up with the same cookie again.
> 
> No.
> 
> request->tag will change, but we may take ->internal_tag(sched) or
> ->tag(none), which won't change.
> 
> I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
> is sent with a different tag.
> 
> I will try to cook one patch soon.

Please try the following patch:


diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2cafcf11ee8b..e3eb31c3ee75 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -682,3 +682,16 @@ u32 blk_mq_unique_tag(struct request *rq)
 		(rq->tag & BLK_MQ_UNIQUE_TAG_MASK);
 }
 EXPORT_SYMBOL(blk_mq_unique_tag);
+
+/*
+ * Same with blk_mq_unique_tag, but one persistent tag is included in
+ * the request lifetime.
+ */
+u32 blk_mq_unique_static_tag(struct request *rq)
+{
+	u32 tag = rq->q->elevator ? rq->internal_tag : rq->tag;
+
+	return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) |
+		(tag & BLK_MQ_UNIQUE_TAG_MASK);
+}
+EXPORT_SYMBOL(blk_mq_unique_static_tag);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index b852050d8a96..cc522a2cb9fb 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -201,7 +201,7 @@ static void nbd_requeue_cmd(struct nbd_cmd *cmd)
 static u64 nbd_cmd_handle(struct nbd_cmd *cmd)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
-	u32 tag = blk_mq_unique_tag(req);
+	u32 tag = blk_mq_unique_static_tag(req);
 	u64 cookie = cmd->cmd_cookie;
 
 	return (cookie << NBD_COOKIE_BITS) | tag;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 4fecf46ef681..d6266759d62d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -793,6 +793,7 @@ enum {
 };
 
 u32 blk_mq_unique_tag(struct request *rq);
+u32 blk_mq_unique_static_tag(struct request *rq);
 
 static inline u16 blk_mq_unique_tag_to_hwq(u32 unique_tag)
 {

-- 
Ming


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-15 12:11   ` Ming Lei
@ 2024-10-15 12:15     ` Ming Lei
  2024-10-15 12:59       ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2024-10-15 12:15 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: josef, axboe, linux-block, nbd, eblake

On Tue, Oct 15, 2024 at 8:11 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Tue, Oct 15, 2024 at 08:01:43PM +0800, Ming Lei wrote:
> > On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > >
> > > Hi all,
> > >
> > > the other day I was running some benchmarks to compare different QEMU
> > > block exports, and one of the scenarios I was interested in was
> > > exporting NBD from qemu-storage-daemon over a unix socket and attaching
> > > it as a block device using the kernel NBD client. I would then run a VM
> > > on top of it and fio inside of it.
> > >
> > > Unfortunately, I couldn't get any numbers because the connection always
> > > aborted with messages like "Double reply on req ..." or "Unexpected
> > > reply ..." in the host kernel log.
> > >
> > > Yesterday I found some time to have a closer look why this is happening,
> > > and I think I have a rough understanding of what's going on now. Look at
> > > these trace events:
> > >
> > >         qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
> > > [...]
> > >         qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
> > > [...]
> > >    kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
> > >
> > > This is the same request, but the handle has changed between
> > > nbd_header_sent and nbd_payload_sent! I think this means that we hit one
> > > of the cases where the request is requeued, and then the next time it
> > > is executed with a different blk-mq tag, which is something the nbd
> > > driver doesn't seem to expect.
> > >
> > > Of course, since the cookie is transmitted in the header, the server
> > > replies with the original handle that contains the tag from the first
> > > call, while the kernel is only waiting for a handle with the new tag and
> > > is confused by the server response.
> > >
> > > I'm not sure yet which of the following options should be considered the
> > > real problem here, so I'm only describing the situation without trying
> > > to provide a patch:
> > >
> > > 1. Is it that blk-mq should always re-run the request with the same tag?
> > >    I don't expect so, though in practice I was surprised to see that it
> > >    happens quite often after nbd requeues a request that it actually
> > >    does end up with the same cookie again.
> >
> > No.
> >
> > request->tag will change, but we may take ->internal_tag(sched) or
> > ->tag(none), which won't change.
> >
> > I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
> > is sent with a different tag.
> >
> > I will try to cook one patch soon.
>
> Please try the following patch:

Oops, please ignore the patch, it can't work since
nbd_handle_reply() doesn't know static tag.

Thanks,


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-15 12:15     ` Ming Lei
@ 2024-10-15 12:59       ` Ming Lei
  2024-10-15 16:06         ` Kevin Wolf
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2024-10-15 12:59 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: josef, axboe, linux-block, nbd, eblake

On Tue, Oct 15, 2024 at 08:15:17PM +0800, Ming Lei wrote:
> On Tue, Oct 15, 2024 at 8:11 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Tue, Oct 15, 2024 at 08:01:43PM +0800, Ming Lei wrote:
> > > On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > the other day I was running some benchmarks to compare different QEMU
> > > > block exports, and one of the scenarios I was interested in was
> > > > exporting NBD from qemu-storage-daemon over a unix socket and attaching
> > > > it as a block device using the kernel NBD client. I would then run a VM
> > > > on top of it and fio inside of it.
> > > >
> > > > Unfortunately, I couldn't get any numbers because the connection always
> > > > aborted with messages like "Double reply on req ..." or "Unexpected
> > > > reply ..." in the host kernel log.
> > > >
> > > > Yesterday I found some time to have a closer look why this is happening,
> > > > and I think I have a rough understanding of what's going on now. Look at
> > > > these trace events:
> > > >
> > > >         qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
> > > > [...]
> > > >         qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
> > > > [...]
> > > >    kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
> > > >
> > > > This is the same request, but the handle has changed between
> > > > nbd_header_sent and nbd_payload_sent! I think this means that we hit one
> > > > of the cases where the request is requeued, and then the next time it
> > > > is executed with a different blk-mq tag, which is something the nbd
> > > > driver doesn't seem to expect.
> > > >
> > > > Of course, since the cookie is transmitted in the header, the server
> > > > replies with the original handle that contains the tag from the first
> > > > call, while the kernel is only waiting for a handle with the new tag and
> > > > is confused by the server response.
> > > >
> > > > I'm not sure yet which of the following options should be considered the
> > > > real problem here, so I'm only describing the situation without trying
> > > > to provide a patch:
> > > >
> > > > 1. Is it that blk-mq should always re-run the request with the same tag?
> > > >    I don't expect so, though in practice I was surprised to see that it
> > > >    happens quite often after nbd requeues a request that it actually
> > > >    does end up with the same cookie again.
> > >
> > > No.
> > >
> > > request->tag will change, but we may take ->internal_tag(sched) or
> > > ->tag(none), which won't change.
> > >
> > > I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
> > > is sent with a different tag.
> > >
> > > I will try to cook one patch soon.
> >
> > Please try the following patch:
> 
> Oops, please ignore the patch, it can't work since
> nbd_handle_reply() doesn't know static tag.

Please try the v2:


diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2cafcf11ee8b..8bad4030b2e4 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -682,3 +682,31 @@ u32 blk_mq_unique_tag(struct request *rq)
 		(rq->tag & BLK_MQ_UNIQUE_TAG_MASK);
 }
 EXPORT_SYMBOL(blk_mq_unique_tag);
+
+/*
+ * Same with blk_mq_unique_tag, but one persistent tag is included in
+ * the request lifetime.
+ */
+u32 blk_mq_unique_static_tag(struct request *rq)
+{
+	u32 tag = rq->q->elevator ? rq->internal_tag : rq->tag;
+
+	return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) |
+		(tag & BLK_MQ_UNIQUE_TAG_MASK);
+}
+EXPORT_SYMBOL(blk_mq_unique_static_tag);
+
+struct request *blk_mq_static_tag_to_req(struct request_queue *q, u32 uniq_tag)
+{
+	unsigned long hwq = blk_mq_unique_tag_to_hwq(uniq_tag);
+	u32 tag = blk_mq_unique_tag_to_tag(uniq_tag);
+	const struct blk_mq_hw_ctx *hctx= xa_load(&q->hctx_table, hwq);
+
+	if (!hctx)
+		return NULL;
+
+	if (q->elevator)
+		return hctx->sched_tags->static_rqs[tag];
+	return hctx->tags->static_rqs[tag];
+}
+EXPORT_SYMBOL(blk_mq_static_tag_to_req);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index b852050d8a96..5be324233c9f 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -201,7 +201,7 @@ static void nbd_requeue_cmd(struct nbd_cmd *cmd)
 static u64 nbd_cmd_handle(struct nbd_cmd *cmd)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
-	u32 tag = blk_mq_unique_tag(req);
+	u32 tag = blk_mq_unique_static_tag(req);
 	u64 cookie = cmd->cmd_cookie;
 
 	return (cookie << NBD_COOKIE_BITS) | tag;
@@ -818,10 +818,7 @@ static struct nbd_cmd *nbd_handle_reply(struct nbd_device *nbd, int index,
 
 	handle = be64_to_cpu(reply->cookie);
 	tag = nbd_handle_to_tag(handle);
-	hwq = blk_mq_unique_tag_to_hwq(tag);
-	if (hwq < nbd->tag_set.nr_hw_queues)
-		req = blk_mq_tag_to_rq(nbd->tag_set.tags[hwq],
-				       blk_mq_unique_tag_to_tag(tag));
+	req = blk_mq_static_tag_to_req(nbd->disk->queue, tag);
 	if (!req || !blk_mq_request_started(req)) {
 		dev_err(disk_to_dev(nbd->disk), "Unexpected reply (%d) %p\n",
 			tag, req);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 4fecf46ef681..9c4ef3f16a77 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -793,6 +793,8 @@ enum {
 };
 
 u32 blk_mq_unique_tag(struct request *rq);
+u32 blk_mq_unique_static_tag(struct request *rq);
+struct request *blk_mq_static_tag_to_req(struct request_queue *q, u32 tag);
 
 static inline u16 blk_mq_unique_tag_to_hwq(u32 unique_tag)
 {

-- 
Ming


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-15 12:59       ` Ming Lei
@ 2024-10-15 16:06         ` Kevin Wolf
  2024-10-16  2:20           ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2024-10-15 16:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: josef, axboe, linux-block, nbd, eblake

Am 15.10.2024 um 14:59 hat Ming Lei geschrieben:
> On Tue, Oct 15, 2024 at 08:15:17PM +0800, Ming Lei wrote:
> > On Tue, Oct 15, 2024 at 8:11 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Tue, Oct 15, 2024 at 08:01:43PM +0800, Ming Lei wrote:
> > > > On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > the other day I was running some benchmarks to compare different QEMU
> > > > > block exports, and one of the scenarios I was interested in was
> > > > > exporting NBD from qemu-storage-daemon over a unix socket and attaching
> > > > > it as a block device using the kernel NBD client. I would then run a VM
> > > > > on top of it and fio inside of it.
> > > > >
> > > > > Unfortunately, I couldn't get any numbers because the connection always
> > > > > aborted with messages like "Double reply on req ..." or "Unexpected
> > > > > reply ..." in the host kernel log.
> > > > >
> > > > > Yesterday I found some time to have a closer look why this is happening,
> > > > > and I think I have a rough understanding of what's going on now. Look at
> > > > > these trace events:
> > > > >
> > > > >         qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
> > > > > [...]
> > > > >         qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
> > > > > [...]
> > > > >    kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
> > > > >
> > > > > This is the same request, but the handle has changed between
> > > > > nbd_header_sent and nbd_payload_sent! I think this means that we hit one
> > > > > of the cases where the request is requeued, and then the next time it
> > > > > is executed with a different blk-mq tag, which is something the nbd
> > > > > driver doesn't seem to expect.
> > > > >
> > > > > Of course, since the cookie is transmitted in the header, the server
> > > > > replies with the original handle that contains the tag from the first
> > > > > call, while the kernel is only waiting for a handle with the new tag and
> > > > > is confused by the server response.
> > > > >
> > > > > I'm not sure yet which of the following options should be considered the
> > > > > real problem here, so I'm only describing the situation without trying
> > > > > to provide a patch:
> > > > >
> > > > > 1. Is it that blk-mq should always re-run the request with the same tag?
> > > > >    I don't expect so, though in practice I was surprised to see that it
> > > > >    happens quite often after nbd requeues a request that it actually
> > > > >    does end up with the same cookie again.
> > > >
> > > > No.
> > > >
> > > > request->tag will change, but we may take ->internal_tag(sched) or
> > > > ->tag(none), which won't change.
> > > >
> > > > I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
> > > > is sent with a different tag.
> > > >
> > > > I will try to cook one patch soon.
> > >
> > > Please try the following patch:
> > 
> > Oops, please ignore the patch, it can't work since
> > nbd_handle_reply() doesn't know static tag.
> 
> Please try the v2:

It doesn't fully work, though it replaced the bug with a different one.
Now I get "Unexpected request" for the final flush request.

Anyway, before talking about specific patches, would this even be the
right solution or would it only paper over a bigger issue?

Is getting a different tag the only thing that can go wrong if you
handle a request partially and then requeue it?

Kevin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-15 16:06         ` Kevin Wolf
@ 2024-10-16  2:20           ` Ming Lei
  2024-10-16 17:00             ` Kevin Wolf
                               ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Ming Lei @ 2024-10-16  2:20 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: josef, axboe, linux-block, nbd, eblake

[-- Attachment #1: Type: text/plain, Size: 4608 bytes --]

On Tue, Oct 15, 2024 at 06:06:01PM +0200, Kevin Wolf wrote:
> Am 15.10.2024 um 14:59 hat Ming Lei geschrieben:
> > On Tue, Oct 15, 2024 at 08:15:17PM +0800, Ming Lei wrote:
> > > On Tue, Oct 15, 2024 at 8:11 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Tue, Oct 15, 2024 at 08:01:43PM +0800, Ming Lei wrote:
> > > > > On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > the other day I was running some benchmarks to compare different QEMU
> > > > > > block exports, and one of the scenarios I was interested in was
> > > > > > exporting NBD from qemu-storage-daemon over a unix socket and attaching
> > > > > > it as a block device using the kernel NBD client. I would then run a VM
> > > > > > on top of it and fio inside of it.
> > > > > >
> > > > > > Unfortunately, I couldn't get any numbers because the connection always
> > > > > > aborted with messages like "Double reply on req ..." or "Unexpected
> > > > > > reply ..." in the host kernel log.
> > > > > >
> > > > > > Yesterday I found some time to have a closer look why this is happening,
> > > > > > and I think I have a rough understanding of what's going on now. Look at
> > > > > > these trace events:
> > > > > >
> > > > > >         qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
> > > > > > [...]
> > > > > >         qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
> > > > > > [...]
> > > > > >    kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
> > > > > >
> > > > > > This is the same request, but the handle has changed between
> > > > > > nbd_header_sent and nbd_payload_sent! I think this means that we hit one
> > > > > > of the cases where the request is requeued, and then the next time it
> > > > > > is executed with a different blk-mq tag, which is something the nbd
> > > > > > driver doesn't seem to expect.
> > > > > >
> > > > > > Of course, since the cookie is transmitted in the header, the server
> > > > > > replies with the original handle that contains the tag from the first
> > > > > > call, while the kernel is only waiting for a handle with the new tag and
> > > > > > is confused by the server response.
> > > > > >
> > > > > > I'm not sure yet which of the following options should be considered the
> > > > > > real problem here, so I'm only describing the situation without trying
> > > > > > to provide a patch:
> > > > > >
> > > > > > 1. Is it that blk-mq should always re-run the request with the same tag?
> > > > > >    I don't expect so, though in practice I was surprised to see that it
> > > > > >    happens quite often after nbd requeues a request that it actually
> > > > > >    does end up with the same cookie again.
> > > > >
> > > > > No.
> > > > >
> > > > > request->tag will change, but we may take ->internal_tag(sched) or
> > > > > ->tag(none), which won't change.
> > > > >
> > > > > I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
> > > > > is sent with a different tag.
> > > > >
> > > > > I will try to cook one patch soon.
> > > >
> > > > Please try the following patch:
> > > 
> > > Oops, please ignore the patch, it can't work since
> > > nbd_handle_reply() doesn't know static tag.
> > 
> > Please try the v2:
> 
> It doesn't fully work, though it replaced the bug with a different one.
> Now I get "Unexpected request" for the final flush request.

That just shows the approach is working.

Flush request doesn't have static tag, that is why it is failed.
It shouldn't be hard to cover it, please try the attached & revised
patch.

Another solution is to add per-nbd-device map for retrieving nbd_cmd
by the stored `index` in cookie, and the cost is one such array for
each device.

> 
> Anyway, before talking about specific patches, would this even be the
> right solution or would it only paper over a bigger issue?
> 
> Is getting a different tag the only thing that can go wrong if you
> handle a request partially and then requeue it?

Strictly speaking it is BLK_STS_RESOURCE.

Not like userspace implementation, kernel nbd call one sock_sendmsg()
for sending either request header, or each single data bvec, so partial xmit
can't be avoided. This kind of handling is fine, given TCP is just
byte stream, nothing difference is observed from nbd server side if
data is correct.


Thanks,
Ming

[-- Attachment #2: static-tag.patch --]
[-- Type: text/plain, Size: 3010 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2cafcf11ee8b..3cc14fc76546 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -682,3 +682,51 @@ u32 blk_mq_unique_tag(struct request *rq)
 		(rq->tag & BLK_MQ_UNIQUE_TAG_MASK);
 }
 EXPORT_SYMBOL(blk_mq_unique_tag);
+
+/* Same with blk_mq_unique_tag, but one persistent tag is included */
+u32 blk_mq_unique_static_tag(struct request *rq)
+{
+	bool use_sched = rq->q->elevator;
+	u32 tag;
+
+	if (rq == rq->mq_hctx->fq->flush_rq) {
+		if (use_sched)
+			tag = rq->mq_hctx->sched_tags->nr_tags;
+		else
+			tag = rq->mq_hctx->tags->nr_tags;
+	} else {
+		tag = use_sched ? rq->internal_tag : rq->tag;
+	}
+
+	return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) |
+		(tag & BLK_MQ_UNIQUE_TAG_MASK);
+}
+EXPORT_SYMBOL(blk_mq_unique_static_tag);
+
+static struct request *
+__blk_mq_static_tag_to_rq(const struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_tags *tags, u32 tag)
+{
+	if (tag < tags->nr_tags)
+		return tags->static_rqs[tag];
+	else if (tag == tags->nr_tags)
+		return hctx->fq->flush_rq;
+	else
+		return NULL;
+}
+
+struct request *blk_mq_static_tag_to_req(struct request_queue *q, u32 uniq_tag)
+{
+	unsigned long hwq = blk_mq_unique_tag_to_hwq(uniq_tag);
+	u32 tag = blk_mq_unique_tag_to_tag(uniq_tag);
+	const struct blk_mq_hw_ctx *hctx= xa_load(&q->hctx_table, hwq);
+
+	if (!hctx)
+		return NULL;
+
+	if (q->elevator)
+		return __blk_mq_static_tag_to_rq(hctx, hctx->sched_tags, tag);
+
+	return __blk_mq_static_tag_to_rq(hctx, hctx->tags, tag);
+}
+EXPORT_SYMBOL(blk_mq_static_tag_to_req);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index b852050d8a96..5be324233c9f 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -201,7 +201,7 @@ static void nbd_requeue_cmd(struct nbd_cmd *cmd)
 static u64 nbd_cmd_handle(struct nbd_cmd *cmd)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
-	u32 tag = blk_mq_unique_tag(req);
+	u32 tag = blk_mq_unique_static_tag(req);
 	u64 cookie = cmd->cmd_cookie;
 
 	return (cookie << NBD_COOKIE_BITS) | tag;
@@ -818,10 +818,7 @@ static struct nbd_cmd *nbd_handle_reply(struct nbd_device *nbd, int index,
 
 	handle = be64_to_cpu(reply->cookie);
 	tag = nbd_handle_to_tag(handle);
-	hwq = blk_mq_unique_tag_to_hwq(tag);
-	if (hwq < nbd->tag_set.nr_hw_queues)
-		req = blk_mq_tag_to_rq(nbd->tag_set.tags[hwq],
-				       blk_mq_unique_tag_to_tag(tag));
+	req = blk_mq_static_tag_to_req(nbd->disk->queue, tag);
 	if (!req || !blk_mq_request_started(req)) {
 		dev_err(disk_to_dev(nbd->disk), "Unexpected reply (%d) %p\n",
 			tag, req);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 4fecf46ef681..9c4ef3f16a77 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -793,6 +793,8 @@ enum {
 };
 
 u32 blk_mq_unique_tag(struct request *rq);
+u32 blk_mq_unique_static_tag(struct request *rq);
+struct request *blk_mq_static_tag_to_req(struct request_queue *q, u32 tag);
 
 static inline u16 blk_mq_unique_tag_to_hwq(u32 unique_tag)
 {

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-16  2:20           ` Ming Lei
@ 2024-10-16 17:00             ` Kevin Wolf
  2024-10-17  1:16               ` Ming Lei
  2024-10-18 14:40             ` Leon Schuermann
  2024-10-21 17:54             ` UNSUBSCRIBE Simon Fernandez
  2 siblings, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2024-10-16 17:00 UTC (permalink / raw)
  To: Ming Lei; +Cc: josef, axboe, linux-block, nbd, eblake, leon

Am 16.10.2024 um 04:20 hat Ming Lei geschrieben:
> On Tue, Oct 15, 2024 at 06:06:01PM +0200, Kevin Wolf wrote:
> > Am 15.10.2024 um 14:59 hat Ming Lei geschrieben:
> > > On Tue, Oct 15, 2024 at 08:15:17PM +0800, Ming Lei wrote:
> > > > On Tue, Oct 15, 2024 at 8:11 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > On Tue, Oct 15, 2024 at 08:01:43PM +0800, Ming Lei wrote:
> > > > > > On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > the other day I was running some benchmarks to compare different QEMU
> > > > > > > block exports, and one of the scenarios I was interested in was
> > > > > > > exporting NBD from qemu-storage-daemon over a unix socket and attaching
> > > > > > > it as a block device using the kernel NBD client. I would then run a VM
> > > > > > > on top of it and fio inside of it.
> > > > > > >
> > > > > > > Unfortunately, I couldn't get any numbers because the connection always
> > > > > > > aborted with messages like "Double reply on req ..." or "Unexpected
> > > > > > > reply ..." in the host kernel log.
> > > > > > >
> > > > > > > Yesterday I found some time to have a closer look why this is happening,
> > > > > > > and I think I have a rough understanding of what's going on now. Look at
> > > > > > > these trace events:
> > > > > > >
> > > > > > >         qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
> > > > > > > [...]
> > > > > > >         qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
> > > > > > > [...]
> > > > > > >    kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
> > > > > > >
> > > > > > > This is the same request, but the handle has changed between
> > > > > > > nbd_header_sent and nbd_payload_sent! I think this means that we hit one
> > > > > > > of the cases where the request is requeued, and then the next time it
> > > > > > > is executed with a different blk-mq tag, which is something the nbd
> > > > > > > driver doesn't seem to expect.
> > > > > > >
> > > > > > > Of course, since the cookie is transmitted in the header, the server
> > > > > > > replies with the original handle that contains the tag from the first
> > > > > > > call, while the kernel is only waiting for a handle with the new tag and
> > > > > > > is confused by the server response.
> > > > > > >
> > > > > > > I'm not sure yet which of the following options should be considered the
> > > > > > > real problem here, so I'm only describing the situation without trying
> > > > > > > to provide a patch:
> > > > > > >
> > > > > > > 1. Is it that blk-mq should always re-run the request with the same tag?
> > > > > > >    I don't expect so, though in practice I was surprised to see that it
> > > > > > >    happens quite often after nbd requeues a request that it actually
> > > > > > >    does end up with the same cookie again.
> > > > > >
> > > > > > No.
> > > > > >
> > > > > > request->tag will change, but we may take ->internal_tag(sched) or
> > > > > > ->tag(none), which won't change.
> > > > > >
> > > > > > I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
> > > > > > is sent with a different tag.
> > > > > >
> > > > > > I will try to cook one patch soon.
> > > > >
> > > > > Please try the following patch:
> > > > 
> > > > Oops, please ignore the patch, it can't work since
> > > > nbd_handle_reply() doesn't know static tag.
> > > 
> > > Please try the v2:
> > 
> > It doesn't fully work, though it replaced the bug with a different one.
> > Now I get "Unexpected request" for the final flush request.
> 
> That just shows the approach is working.
> 
> Flush request doesn't have static tag, that is why it is failed.
> It shouldn't be hard to cover it, please try the attached & revised
> patch.

Any other request types that are unusual, or is flush the only one?

> Another solution is to add per-nbd-device map for retrieving nbd_cmd
> by the stored `index` in cookie, and the cost is one such array for
> each device.

Yes, just creating the cookie another way and having an explicit mapping
back is the obvious naive solution (my option 2). It would be nice to
avoid this.

> > 
> > Anyway, before talking about specific patches, would this even be the
> > right solution or would it only paper over a bigger issue?
> > 
> > Is getting a different tag the only thing that can go wrong if you
> > handle a request partially and then requeue it?
> 
> Strictly speaking it is BLK_STS_RESOURCE.
> 
> Not like userspace implementation, kernel nbd call one sock_sendmsg()
> for sending either request header, or each single data bvec, so
> partial xmit can't be avoided. This kind of handling is fine, given
> TCP is just byte stream, nothing difference is observed from nbd
> server side if data is correct.

I wasn't questioning the partial submission, but only if it's a good
idea to return the request to the queue in this case, or if the nbd
driver should use another mechanism to keep working on the request
without returning it. But if this is accepted and a common pattern in
other drivers, too (is it?), I don't have a problem with it.

> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 2cafcf11ee8b..3cc14fc76546 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -682,3 +682,51 @@ u32 blk_mq_unique_tag(struct request *rq)
>  		(rq->tag & BLK_MQ_UNIQUE_TAG_MASK);
>  }
>  EXPORT_SYMBOL(blk_mq_unique_tag);
> +
> +/* Same with blk_mq_unique_tag, but one persistent tag is included */
> +u32 blk_mq_unique_static_tag(struct request *rq)
> +{
> +	bool use_sched = rq->q->elevator;
> +	u32 tag;
> +
> +	if (rq == rq->mq_hctx->fq->flush_rq) {
> +		if (use_sched)
> +			tag = rq->mq_hctx->sched_tags->nr_tags;
> +		else
> +			tag = rq->mq_hctx->tags->nr_tags;
> +	} else {
> +		tag = use_sched ? rq->internal_tag : rq->tag;
> +	}
> +
> +	return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) |
> +		(tag & BLK_MQ_UNIQUE_TAG_MASK);
> +}
> +EXPORT_SYMBOL(blk_mq_unique_static_tag);
> +
> +static struct request *
> +__blk_mq_static_tag_to_rq(const struct blk_mq_hw_ctx *hctx,
> +		const struct blk_mq_tags *tags, u32 tag)
> +{
> +	if (tag < tags->nr_tags)
> +		return tags->static_rqs[tag];
> +	else if (tag == tags->nr_tags)
> +		return hctx->fq->flush_rq;
> +	else
> +		return NULL;
> +}

There is probably little reason to have this as a separate function. It
will be more readable if you inline it and make tags just a local
variable in blk_mq_static_tag_to_req().

> +struct request *blk_mq_static_tag_to_req(struct request_queue *q, u32 uniq_tag)
> +{
> +	unsigned long hwq = blk_mq_unique_tag_to_hwq(uniq_tag);
> +	u32 tag = blk_mq_unique_tag_to_tag(uniq_tag);
> +	const struct blk_mq_hw_ctx *hctx= xa_load(&q->hctx_table, hwq);
> +
> +	if (!hctx)
> +		return NULL;
> +
> +	if (q->elevator)
> +		return __blk_mq_static_tag_to_rq(hctx, hctx->sched_tags, tag);
> +
> +	return __blk_mq_static_tag_to_rq(hctx, hctx->tags, tag);
> +}
> +EXPORT_SYMBOL(blk_mq_static_tag_to_req);
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index b852050d8a96..5be324233c9f 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -201,7 +201,7 @@ static void nbd_requeue_cmd(struct nbd_cmd *cmd)
>  static u64 nbd_cmd_handle(struct nbd_cmd *cmd)
>  {
>  	struct request *req = blk_mq_rq_from_pdu(cmd);
> -	u32 tag = blk_mq_unique_tag(req);
> +	u32 tag = blk_mq_unique_static_tag(req);
>  	u64 cookie = cmd->cmd_cookie;
>  
>  	return (cookie << NBD_COOKIE_BITS) | tag;
> @@ -818,10 +818,7 @@ static struct nbd_cmd *nbd_handle_reply(struct nbd_device *nbd, int index,
>  
>  	handle = be64_to_cpu(reply->cookie);
>  	tag = nbd_handle_to_tag(handle);
> -	hwq = blk_mq_unique_tag_to_hwq(tag);

hwq is now unused and can be removed.

> -	if (hwq < nbd->tag_set.nr_hw_queues)
> -		req = blk_mq_tag_to_rq(nbd->tag_set.tags[hwq],
> -				       blk_mq_unique_tag_to_tag(tag));
> +	req = blk_mq_static_tag_to_req(nbd->disk->queue, tag);
>  	if (!req || !blk_mq_request_started(req)) {
>  		dev_err(disk_to_dev(nbd->disk), "Unexpected reply (%d) %p\n",
>  			tag, req);

This version of the patch survives the reproducer I used for debugging
this. I'll try to give it some more testing later.

Kevin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-16 17:00             ` Kevin Wolf
@ 2024-10-17  1:16               ` Ming Lei
  2024-10-17  2:52                 ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2024-10-17  1:16 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: josef, axboe, linux-block, nbd, eblake, leon

On Wed, Oct 16, 2024 at 07:00:15PM +0200, Kevin Wolf wrote:
> Am 16.10.2024 um 04:20 hat Ming Lei geschrieben:
> > On Tue, Oct 15, 2024 at 06:06:01PM +0200, Kevin Wolf wrote:
> > > Am 15.10.2024 um 14:59 hat Ming Lei geschrieben:
> > > > On Tue, Oct 15, 2024 at 08:15:17PM +0800, Ming Lei wrote:
> > > > > On Tue, Oct 15, 2024 at 8:11 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 15, 2024 at 08:01:43PM +0800, Ming Lei wrote:
> > > > > > > On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > the other day I was running some benchmarks to compare different QEMU
> > > > > > > > block exports, and one of the scenarios I was interested in was
> > > > > > > > exporting NBD from qemu-storage-daemon over a unix socket and attaching
> > > > > > > > it as a block device using the kernel NBD client. I would then run a VM
> > > > > > > > on top of it and fio inside of it.
> > > > > > > >
> > > > > > > > Unfortunately, I couldn't get any numbers because the connection always
> > > > > > > > aborted with messages like "Double reply on req ..." or "Unexpected
> > > > > > > > reply ..." in the host kernel log.
> > > > > > > >
> > > > > > > > Yesterday I found some time to have a closer look why this is happening,
> > > > > > > > and I think I have a rough understanding of what's going on now. Look at
> > > > > > > > these trace events:
> > > > > > > >
> > > > > > > >         qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
> > > > > > > > [...]
> > > > > > > >         qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
> > > > > > > > [...]
> > > > > > > >    kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
> > > > > > > >
> > > > > > > > This is the same request, but the handle has changed between
> > > > > > > > nbd_header_sent and nbd_payload_sent! I think this means that we hit one
> > > > > > > > of the cases where the request is requeued, and then the next time it
> > > > > > > > is executed with a different blk-mq tag, which is something the nbd
> > > > > > > > driver doesn't seem to expect.
> > > > > > > >
> > > > > > > > Of course, since the cookie is transmitted in the header, the server
> > > > > > > > replies with the original handle that contains the tag from the first
> > > > > > > > call, while the kernel is only waiting for a handle with the new tag and
> > > > > > > > is confused by the server response.
> > > > > > > >
> > > > > > > > I'm not sure yet which of the following options should be considered the
> > > > > > > > real problem here, so I'm only describing the situation without trying
> > > > > > > > to provide a patch:
> > > > > > > >
> > > > > > > > 1. Is it that blk-mq should always re-run the request with the same tag?
> > > > > > > >    I don't expect so, though in practice I was surprised to see that it
> > > > > > > >    happens quite often after nbd requeues a request that it actually
> > > > > > > >    does end up with the same cookie again.
> > > > > > >
> > > > > > > No.
> > > > > > >
> > > > > > > request->tag will change, but we may take ->internal_tag(sched) or
> > > > > > > ->tag(none), which won't change.
> > > > > > >
> > > > > > > I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
> > > > > > > is sent with a different tag.
> > > > > > >
> > > > > > > I will try to cook one patch soon.
> > > > > >
> > > > > > Please try the following patch:
> > > > > 
> > > > > Oops, please ignore the patch, it can't work since
> > > > > nbd_handle_reply() doesn't know static tag.
> > > > 
> > > > Please try the v2:
> > > 
> > > It doesn't fully work, though it replaced the bug with a different one.
> > > Now I get "Unexpected request" for the final flush request.
> > 
> > That just shows the approach is working.
> > 
> > Flush request doesn't have static tag, that is why it is failed.
> > It shouldn't be hard to cover it, please try the attached & revised
> > patch.
> 
> Any other request types that are unusual, or is flush the only one?

Flush is the only one.

> 
> > Another solution is to add per-nbd-device map for retrieving nbd_cmd
> > by the stored `index` in cookie, and the cost is one such array for
> > each device.
> 
> Yes, just creating the cookie another way and having an explicit mapping
> back is the obvious naive solution (my option 2). It would be nice to
> avoid this.
> 
> > > 
> > > Anyway, before talking about specific patches, would this even be the
> > > right solution or would it only paper over a bigger issue?
> > > 
> > > Is getting a different tag the only thing that can go wrong if you
> > > handle a request partially and then requeue it?
> > 
> > Strictly speaking it is BLK_STS_RESOURCE.
> > 
> > Not like userspace implementation, kernel nbd call one sock_sendmsg()
> > for sending either request header, or each single data bvec, so
> > partial xmit can't be avoided. This kind of handling is fine, given
> > TCP is just byte stream, nothing difference is observed from nbd
> > server side if data is correct.
> 
> I wasn't questioning the partial submission, but only if it's a good
> idea to return the request to the queue in this case, or if the nbd
> driver should use another mechanism to keep working on the request
> without returning it. But if this is accepted and a common pattern in
> other drivers, too (is it?), I don't have a problem with it.

It is one common pattern to retrieve request with tag in many storage
drivers(scsi, nvme, ...), also it should be the only way.

But userspace implementation needn't it, with async/.await the io
request or whatever can be defined as one local variable.

> 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 2cafcf11ee8b..3cc14fc76546 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -682,3 +682,51 @@ u32 blk_mq_unique_tag(struct request *rq)
> >  		(rq->tag & BLK_MQ_UNIQUE_TAG_MASK);
> >  }
> >  EXPORT_SYMBOL(blk_mq_unique_tag);
> > +
> > +/* Same with blk_mq_unique_tag, but one persistent tag is included */
> > +u32 blk_mq_unique_static_tag(struct request *rq)
> > +{
> > +	bool use_sched = rq->q->elevator;
> > +	u32 tag;
> > +
> > +	if (rq == rq->mq_hctx->fq->flush_rq) {
> > +		if (use_sched)
> > +			tag = rq->mq_hctx->sched_tags->nr_tags;
> > +		else
> > +			tag = rq->mq_hctx->tags->nr_tags;
> > +	} else {
> > +		tag = use_sched ? rq->internal_tag : rq->tag;
> > +	}
> > +
> > +	return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) |
> > +		(tag & BLK_MQ_UNIQUE_TAG_MASK);
> > +}
> > +EXPORT_SYMBOL(blk_mq_unique_static_tag);
> > +
> > +static struct request *
> > +__blk_mq_static_tag_to_rq(const struct blk_mq_hw_ctx *hctx,
> > +		const struct blk_mq_tags *tags, u32 tag)
> > +{
> > +	if (tag < tags->nr_tags)
> > +		return tags->static_rqs[tag];
> > +	else if (tag == tags->nr_tags)
> > +		return hctx->fq->flush_rq;
> > +	else
> > +		return NULL;
> > +}
> 
> There is probably little reason to have this as a separate function. It
> will be more readable if you inline it and make tags just a local
> variable in blk_mq_static_tag_to_req().

Looks doable.

> 
> > +struct request *blk_mq_static_tag_to_req(struct request_queue *q, u32 uniq_tag)
> > +{
> > +	unsigned long hwq = blk_mq_unique_tag_to_hwq(uniq_tag);
> > +	u32 tag = blk_mq_unique_tag_to_tag(uniq_tag);
> > +	const struct blk_mq_hw_ctx *hctx= xa_load(&q->hctx_table, hwq);
> > +
> > +	if (!hctx)
> > +		return NULL;
> > +
> > +	if (q->elevator)
> > +		return __blk_mq_static_tag_to_rq(hctx, hctx->sched_tags, tag);
> > +
> > +	return __blk_mq_static_tag_to_rq(hctx, hctx->tags, tag);
> > +}
> > +EXPORT_SYMBOL(blk_mq_static_tag_to_req);
> > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> > index b852050d8a96..5be324233c9f 100644
> > --- a/drivers/block/nbd.c
> > +++ b/drivers/block/nbd.c
> > @@ -201,7 +201,7 @@ static void nbd_requeue_cmd(struct nbd_cmd *cmd)
> >  static u64 nbd_cmd_handle(struct nbd_cmd *cmd)
> >  {
> >  	struct request *req = blk_mq_rq_from_pdu(cmd);
> > -	u32 tag = blk_mq_unique_tag(req);
> > +	u32 tag = blk_mq_unique_static_tag(req);
> >  	u64 cookie = cmd->cmd_cookie;
> >  
> >  	return (cookie << NBD_COOKIE_BITS) | tag;
> > @@ -818,10 +818,7 @@ static struct nbd_cmd *nbd_handle_reply(struct nbd_device *nbd, int index,
> >  
> >  	handle = be64_to_cpu(reply->cookie);
> >  	tag = nbd_handle_to_tag(handle);
> > -	hwq = blk_mq_unique_tag_to_hwq(tag);
> 
> hwq is now unused and can be removed.

OK, will kill it.

> 
> > -	if (hwq < nbd->tag_set.nr_hw_queues)
> > -		req = blk_mq_tag_to_rq(nbd->tag_set.tags[hwq],
> > -				       blk_mq_unique_tag_to_tag(tag));
> > +	req = blk_mq_static_tag_to_req(nbd->disk->queue, tag);
> >  	if (!req || !blk_mq_request_started(req)) {
> >  		dev_err(disk_to_dev(nbd->disk), "Unexpected reply (%d) %p\n",
> >  			tag, req);
> 
> This version of the patch survives the reproducer I used for debugging
> this. I'll try to give it some more testing later.

Ok, thanks for the test.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-17  1:16               ` Ming Lei
@ 2024-10-17  2:52                 ` Ming Lei
  0 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2024-10-17  2:52 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: josef, axboe, linux-block, nbd, eblake, leon

On Thu, Oct 17, 2024 at 09:16:07AM +0800, Ming Lei wrote:
> On Wed, Oct 16, 2024 at 07:00:15PM +0200, Kevin Wolf wrote:
> > Am 16.10.2024 um 04:20 hat Ming Lei geschrieben:
> > > On Tue, Oct 15, 2024 at 06:06:01PM +0200, Kevin Wolf wrote:
> > > > Am 15.10.2024 um 14:59 hat Ming Lei geschrieben:
> > > > > On Tue, Oct 15, 2024 at 08:15:17PM +0800, Ming Lei wrote:
> > > > > > On Tue, Oct 15, 2024 at 8:11 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Oct 15, 2024 at 08:01:43PM +0800, Ming Lei wrote:
> > > > > > > > On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > the other day I was running some benchmarks to compare different QEMU
> > > > > > > > > block exports, and one of the scenarios I was interested in was
> > > > > > > > > exporting NBD from qemu-storage-daemon over a unix socket and attaching
> > > > > > > > > it as a block device using the kernel NBD client. I would then run a VM
> > > > > > > > > on top of it and fio inside of it.
> > > > > > > > >
> > > > > > > > > Unfortunately, I couldn't get any numbers because the connection always
> > > > > > > > > aborted with messages like "Double reply on req ..." or "Unexpected
> > > > > > > > > reply ..." in the host kernel log.
> > > > > > > > >
> > > > > > > > > Yesterday I found some time to have a closer look why this is happening,
> > > > > > > > > and I think I have a rough understanding of what's going on now. Look at
> > > > > > > > > these trace events:
> > > > > > > > >
> > > > > > > > >         qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
> > > > > > > > > [...]
> > > > > > > > >         qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
> > > > > > > > > [...]
> > > > > > > > >    kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
> > > > > > > > >
> > > > > > > > > This is the same request, but the handle has changed between
> > > > > > > > > nbd_header_sent and nbd_payload_sent! I think this means that we hit one
> > > > > > > > > of the cases where the request is requeued, and then the next time it
> > > > > > > > > is executed with a different blk-mq tag, which is something the nbd
> > > > > > > > > driver doesn't seem to expect.
> > > > > > > > >
> > > > > > > > > Of course, since the cookie is transmitted in the header, the server
> > > > > > > > > replies with the original handle that contains the tag from the first
> > > > > > > > > call, while the kernel is only waiting for a handle with the new tag and
> > > > > > > > > is confused by the server response.
> > > > > > > > >
> > > > > > > > > I'm not sure yet which of the following options should be considered the
> > > > > > > > > real problem here, so I'm only describing the situation without trying
> > > > > > > > > to provide a patch:
> > > > > > > > >
> > > > > > > > > 1. Is it that blk-mq should always re-run the request with the same tag?
> > > > > > > > >    I don't expect so, though in practice I was surprised to see that it
> > > > > > > > >    happens quite often after nbd requeues a request that it actually
> > > > > > > > >    does end up with the same cookie again.
> > > > > > > >
> > > > > > > > No.
> > > > > > > >
> > > > > > > > request->tag will change, but we may take ->internal_tag(sched) or
> > > > > > > > ->tag(none), which won't change.
> > > > > > > >
> > > > > > > > I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
> > > > > > > > is sent with a different tag.
> > > > > > > >
> > > > > > > > I will try to cook one patch soon.
> > > > > > >
> > > > > > > Please try the following patch:
> > > > > > 
> > > > > > Oops, please ignore the patch, it can't work since
> > > > > > nbd_handle_reply() doesn't know static tag.
> > > > > 
> > > > > Please try the v2:
> > > > 
> > > > It doesn't fully work, though it replaced the bug with a different one.
> > > > Now I get "Unexpected request" for the final flush request.
> > > 
> > > That just shows the approach is working.
> > > 
> > > Flush request doesn't have static tag, that is why it is failed.
> > > It shouldn't be hard to cover it, please try the attached & revised
> > > patch.
> > 
> > Any other request types that are unusual, or is flush the only one?
> 
> Flush is the only one.
> 
> > 
> > > Another solution is to add per-nbd-device map for retrieving nbd_cmd
> > > by the stored `index` in cookie, and the cost is one such array for
> > > each device.
> > 
> > Yes, just creating the cookie another way and having an explicit mapping
> > back is the obvious naive solution (my option 2). It would be nice to
> > avoid this.
> > 
> > > > 
> > > > Anyway, before talking about specific patches, would this even be the
> > > > right solution or would it only paper over a bigger issue?
> > > > 
> > > > Is getting a different tag the only thing that can go wrong if you
> > > > handle a request partially and then requeue it?
> > > 
> > > Strictly speaking it is BLK_STS_RESOURCE.
> > > 
> > > Not like userspace implementation, kernel nbd call one sock_sendmsg()
> > > for sending either request header, or each single data bvec, so
> > > partial xmit can't be avoided. This kind of handling is fine, given
> > > TCP is just byte stream, nothing difference is observed from nbd
> > > server side if data is correct.
> > 
> > I wasn't questioning the partial submission, but only if it's a good
> > idea to return the request to the queue in this case, or if the nbd
> > driver should use another mechanism to keep working on the request
> > without returning it. But if this is accepted and a common pattern in
> > other drivers, too (is it?), I don't have a problem with it.
> 
> It is one common pattern to retrieve request with tag in many storage
> drivers(scsi, nvme, ...), also it should be the only way.
> 
> But userspace implementation needn't it, with async/.await the io
> request or whatever can be defined as one local variable.

Thinking of further, we shouldn't bother blk-mq to add static tag
related APIs, which not only involves implementation detail, but also
not necessarily.

nbd shouldn't return BLK_STS_RESOURCE to ask blk-mq to retry in case of
partial send, and driver has to do it by itself.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel NBD client waits on wrong cookie, aborts connection
  2024-10-16  2:20           ` Ming Lei
  2024-10-16 17:00             ` Kevin Wolf
@ 2024-10-18 14:40             ` Leon Schuermann
  2024-10-21 17:54             ` UNSUBSCRIBE Simon Fernandez
  2 siblings, 0 replies; 12+ messages in thread
From: Leon Schuermann @ 2024-10-18 14:40 UTC (permalink / raw)
  To: Ming Lei; +Cc: Kevin Wolf, josef, axboe, linux-block, nbd, eblake, Vincent Chen

Thanks for CCing me on this thread. Indeed this seems to be the exact
same issue that I (and Vincent Chen) reported in the other thread [1].

Ming Lei <ming.lei@redhat.com> writes:
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 2cafcf11ee8b..3cc14fc76546 100644

I can confirm that this patch fixes these issues.

Prior to this patch, I managed to reproduce both the "Unexpected reply"
and "Double reply" bugs reliably via a TCP/IP connection between two
AMD64 VMs on different hosts, two aarch64 VMs, and an aarch64 Raspberry
Pi 5 SBC connecting to an NBD server on another host.

With this patch, at least in the AMD64 VM setup, it survived an
hour-long stress test using the `stress-ng` command from [1]. Also,
sending a SIGINT to a process actively using the disk often manifested
this bug. I could not reproduce this faulty behavior with this patch.

I'm still working on backporting this to and compiling the v6.6
Raspberry Pi vendor kernel and will report my observations there. I'll
also try Ming's v2 from [2].

Thanks!

-Leon

[1]: https://lore.kernel.org/linux-block/87jzea4w2n.fsf@silicon.host.schuermann.io/
[2]: https://lore.kernel.org/linux-block/20241018140831.3064135-1-ming.lei@redhat.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* UNSUBSCRIBE
  2024-10-16  2:20           ` Ming Lei
  2024-10-16 17:00             ` Kevin Wolf
  2024-10-18 14:40             ` Leon Schuermann
@ 2024-10-21 17:54             ` Simon Fernandez
  2 siblings, 0 replies; 12+ messages in thread
From: Simon Fernandez @ 2024-10-21 17:54 UTC (permalink / raw)
  To: Ming Lei; +Cc: Kevin Wolf, Josef Bacik, axboe, linux-block, nbd, eblake



> On 16 Oct 2024, at 03:20, Ming Lei <ming.lei@redhat.com> wrote:
> 
> On Tue, Oct 15, 2024 at 06:06:01PM +0200, Kevin Wolf wrote:
>> Am 15.10.2024 um 14:59 hat Ming Lei geschrieben:
>>> On Tue, Oct 15, 2024 at 08:15:17PM +0800, Ming Lei wrote:
>>>> On Tue, Oct 15, 2024 at 8:11 PM Ming Lei <ming.lei@redhat.com> wrote:
>>>>> 
>>>>> On Tue, Oct 15, 2024 at 08:01:43PM +0800, Ming Lei wrote:
>>>>>> On Tue, Oct 15, 2024 at 6:22 PM Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> the other day I was running some benchmarks to compare different QEMU
>>>>>>> block exports, and one of the scenarios I was interested in was
>>>>>>> exporting NBD from qemu-storage-daemon over a unix socket and attaching
>>>>>>> it as a block device using the kernel NBD client. I would then run a VM
>>>>>>> on top of it and fio inside of it.
>>>>>>> 
>>>>>>> Unfortunately, I couldn't get any numbers because the connection always
>>>>>>> aborted with messages like "Double reply on req ..." or "Unexpected
>>>>>>> reply ..." in the host kernel log.
>>>>>>> 
>>>>>>> Yesterday I found some time to have a closer look why this is happening,
>>>>>>> and I think I have a rough understanding of what's going on now. Look at
>>>>>>> these trace events:
>>>>>>> 
>>>>>>>        qemu-img-51025   [005] ..... 19503.285423: nbd_header_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005a
>>>>>>> [...]
>>>>>>>        qemu-img-51025   [008] ..... 19503.285500: nbd_payload_sent: nbd transport event: request 000000002df03708, handle 0x0000150c0000005d
>>>>>>> [...]
>>>>>>>   kworker/u49:1-47350   [004] ..... 19503.285514: nbd_header_received: nbd transport event: request 00000000b79e7443, handle 0x0000150c0000005a
>>>>>>> 
>>>>>>> This is the same request, but the handle has changed between
>>>>>>> nbd_header_sent and nbd_payload_sent! I think this means that we hit one
>>>>>>> of the cases where the request is requeued, and then the next time it
>>>>>>> is executed with a different blk-mq tag, which is something the nbd
>>>>>>> driver doesn't seem to expect.
>>>>>>> 
>>>>>>> Of course, since the cookie is transmitted in the header, the server
>>>>>>> replies with the original handle that contains the tag from the first
>>>>>>> call, while the kernel is only waiting for a handle with the new tag and
>>>>>>> is confused by the server response.
>>>>>>> 
>>>>>>> I'm not sure yet which of the following options should be considered the
>>>>>>> real problem here, so I'm only describing the situation without trying
>>>>>>> to provide a patch:
>>>>>>> 
>>>>>>> 1. Is it that blk-mq should always re-run the request with the same tag?
>>>>>>>   I don't expect so, though in practice I was surprised to see that it
>>>>>>>   happens quite often after nbd requeues a request that it actually
>>>>>>>   does end up with the same cookie again.
>>>>>> 
>>>>>> No.
>>>>>> 
>>>>>> request->tag will change, but we may take ->internal_tag(sched) or
>>>>>> ->tag(none), which won't change.
>>>>>> 
>>>>>> I guess was_interrupted() in nbd_send_cmd() is triggered, then the payload
>>>>>> is sent with a different tag.
>>>>>> 
>>>>>> I will try to cook one patch soon.
>>>>> 
>>>>> Please try the following patch:
>>>> 
>>>> Oops, please ignore the patch, it can't work since
>>>> nbd_handle_reply() doesn't know static tag.
>>> 
>>> Please try the v2:
>> 
>> It doesn't fully work, though it replaced the bug with a different one.
>> Now I get "Unexpected request" for the final flush request.
> 
> That just shows the approach is working.
> 
> Flush request doesn't have static tag, that is why it is failed.
> It shouldn't be hard to cover it, please try the attached & revised
> patch.
> 
> Another solution is to add per-nbd-device map for retrieving nbd_cmd
> by the stored `index` in cookie, and the cost is one such array for
> each device.
> 
>> 
>> Anyway, before talking about specific patches, would this even be the
>> right solution or would it only paper over a bigger issue?
>> 
>> Is getting a different tag the only thing that can go wrong if you
>> handle a request partially and then requeue it?
> 
> Strictly speaking it is BLK_STS_RESOURCE.
> 
> Not like userspace implementation, kernel nbd call one sock_sendmsg()
> for sending either request header, or each single data bvec, so partial xmit
> can't be avoided. This kind of handling is fine, given TCP is just
> byte stream, nothing difference is observed from nbd server side if
> data is correct.
> 
> 
> Thanks,
> Ming
> <static-tag.patch>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-10-21 17:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-15 10:21 Kernel NBD client waits on wrong cookie, aborts connection Kevin Wolf
2024-10-15 12:01 ` Ming Lei
2024-10-15 12:11   ` Ming Lei
2024-10-15 12:15     ` Ming Lei
2024-10-15 12:59       ` Ming Lei
2024-10-15 16:06         ` Kevin Wolf
2024-10-16  2:20           ` Ming Lei
2024-10-16 17:00             ` Kevin Wolf
2024-10-17  1:16               ` Ming Lei
2024-10-17  2:52                 ` Ming Lei
2024-10-18 14:40             ` Leon Schuermann
2024-10-21 17:54             ` UNSUBSCRIBE Simon Fernandez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).