* [PATCH 1/3] block/nbd: only detach existing iochannel from aio_context
2021-01-28 20:14 [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating Roman Kagan
@ 2021-01-28 20:14 ` Roman Kagan
2021-01-29 5:37 ` Vladimir Sementsov-Ogievskiy
2021-01-28 20:14 ` [PATCH 2/3] block/nbd: only enter connection coroutine if it's present Roman Kagan
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Roman Kagan @ 2021-01-28 20:14 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Max Reitz
When the reconnect in NBD client is in progress, the iochannel used for
NBD connection doesn't exist. Therefore an attempt to detach it from
the aio_context of the parent BlockDriverState results in a NULL pointer
dereference.
The problem is triggerable, in particular, when an outgoing migration is
about to finish, and stopping the dataplane tries to move the
BlockDriverState from the iothread aio_context to the main loop. If the
NBD connection is lost before this point, and the NBD client has entered
the reconnect procedure, QEMU crashes:
at /build/qemu-gYtjVn/qemu-5.0.1/io/channel.c:452
at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6151
new_context=new_context@entry=0x562a260c9580,
ignore=ignore@entry=0x7feeadc9b780)
at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6230
(bs=bs@entry=0x562a268d6a00, ctx=0x562a260c9580,
ignore_child=<optimized out>, errp=<optimized out>)
at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6332
new_context=0x562a260c9580,
update_root_node=update_root_node@entry=true, errp=errp@entry=0x0)
at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:1989
new_context=<optimized out>, errp=errp@entry=0x0)
at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:2010
out>)
at /build/qemu-gYtjVn/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio-bus.c:245
running=0, state=<optimized out>)
at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio.c:3220
state=state@entry=RUN_STATE_FINISH_MIGRATE)
at /build/qemu-gYtjVn/qemu-5.0.1/softmmu/vl.c:1275
send_stop=<optimized out>)
at /build/qemu-gYtjVn/qemu-5.0.1/cpus.c:1032
at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:2914
at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3275
at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3439
at /build/qemu-gYtjVn/qemu-5.0.1/util/qemu-thread-posix.c:519
at pthread_create.c:333
oldval=0x562a2452b138, oldlenp=0x0, newval=0x562a2452c5e0
<__func__.28102>, newlen=0)
at ../sysdeps/unix/sysv/linux/sysctl.c:30
Fix it by checking that the iochannel is non-null before trying to
detach it from the aio_context. If it is null, no detaching is needed,
and it will get reattached in the proper aio_context once the connection
is reestablished.
Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
---
block/nbd.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/block/nbd.c b/block/nbd.c
index 42e10c7c93..bcd6641e90 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -235,7 +235,14 @@ static void nbd_client_detach_aio_context(BlockDriverState *bs)
/* Timer is deleted in nbd_client_co_drain_begin() */
assert(!s->reconnect_delay_timer);
- qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
+ /*
+ * If reconnect is in progress we may have no ->ioc. It will be
+ * re-instantiated in the proper aio context once the connection is
+ * reestablished.
+ */
+ if (s->ioc) {
+ qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
+ }
}
static void nbd_client_attach_aio_context_bh(void *opaque)
--
2.29.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] block/nbd: only detach existing iochannel from aio_context
2021-01-28 20:14 ` [PATCH 1/3] block/nbd: only detach existing iochannel from aio_context Roman Kagan
@ 2021-01-29 5:37 ` Vladimir Sementsov-Ogievskiy
2021-01-29 7:03 ` Roman Kagan
0 siblings, 1 reply; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 5:37 UTC (permalink / raw)
To: Roman Kagan, qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz
28.01.2021 23:14, Roman Kagan wrote:
> When the reconnect in NBD client is in progress, the iochannel used for
> NBD connection doesn't exist. Therefore an attempt to detach it from
> the aio_context of the parent BlockDriverState results in a NULL pointer
> dereference.
>
> The problem is triggerable, in particular, when an outgoing migration is
> about to finish, and stopping the dataplane tries to move the
> BlockDriverState from the iothread aio_context to the main loop. If the
> NBD connection is lost before this point, and the NBD client has entered
> the reconnect procedure, QEMU crashes:
>
> at /build/qemu-gYtjVn/qemu-5.0.1/io/channel.c:452
> at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6151
> new_context=new_context@entry=0x562a260c9580,
> ignore=ignore@entry=0x7feeadc9b780)
> at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6230
> (bs=bs@entry=0x562a268d6a00, ctx=0x562a260c9580,
> ignore_child=<optimized out>, errp=<optimized out>)
> at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6332
> new_context=0x562a260c9580,
> update_root_node=update_root_node@entry=true, errp=errp@entry=0x0)
> at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:1989
> new_context=<optimized out>, errp=errp@entry=0x0)
> at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:2010
> out>)
> at /build/qemu-gYtjVn/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
> at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio-bus.c:245
> running=0, state=<optimized out>)
> at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio.c:3220
> state=state@entry=RUN_STATE_FINISH_MIGRATE)
> at /build/qemu-gYtjVn/qemu-5.0.1/softmmu/vl.c:1275
> send_stop=<optimized out>)
> at /build/qemu-gYtjVn/qemu-5.0.1/cpus.c:1032
> at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:2914
> at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3275
> at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3439
> at /build/qemu-gYtjVn/qemu-5.0.1/util/qemu-thread-posix.c:519
> at pthread_create.c:333
> oldval=0x562a2452b138, oldlenp=0x0, newval=0x562a2452c5e0
> <__func__.28102>, newlen=0)
> at ../sysdeps/unix/sysv/linux/sysctl.c:30
function names are somehow lost :(
>
> Fix it by checking that the iochannel is non-null before trying to
> detach it from the aio_context. If it is null, no detaching is needed,
> and it will get reattached in the proper aio_context once the connection
> is reestablished.
>
> Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
Thanks!
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> block/nbd.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/block/nbd.c b/block/nbd.c
> index 42e10c7c93..bcd6641e90 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -235,7 +235,14 @@ static void nbd_client_detach_aio_context(BlockDriverState *bs)
>
> /* Timer is deleted in nbd_client_co_drain_begin() */
> assert(!s->reconnect_delay_timer);
> - qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
> + /*
> + * If reconnect is in progress we may have no ->ioc. It will be
> + * re-instantiated in the proper aio context once the connection is
> + * reestablished.
> + */
> + if (s->ioc) {
> + qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
> + }
> }
>
> static void nbd_client_attach_aio_context_bh(void *opaque)
>
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] block/nbd: only detach existing iochannel from aio_context
2021-01-29 5:37 ` Vladimir Sementsov-Ogievskiy
@ 2021-01-29 7:03 ` Roman Kagan
0 siblings, 0 replies; 11+ messages in thread
From: Roman Kagan @ 2021-01-29 7:03 UTC (permalink / raw)
To: Vladimir Sementsov-Ogievskiy
Cc: Kevin Wolf, qemu-devel, qemu-block, Max Reitz
On Fri, Jan 29, 2021 at 08:37:13AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 28.01.2021 23:14, Roman Kagan wrote:
> > When the reconnect in NBD client is in progress, the iochannel used for
> > NBD connection doesn't exist. Therefore an attempt to detach it from
> > the aio_context of the parent BlockDriverState results in a NULL pointer
> > dereference.
> >
> > The problem is triggerable, in particular, when an outgoing migration is
> > about to finish, and stopping the dataplane tries to move the
> > BlockDriverState from the iothread aio_context to the main loop. If the
> > NBD connection is lost before this point, and the NBD client has entered
> > the reconnect procedure, QEMU crashes:
> >
> > at /build/qemu-gYtjVn/qemu-5.0.1/io/channel.c:452
> > at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6151
> > new_context=new_context@entry=0x562a260c9580,
> > ignore=ignore@entry=0x7feeadc9b780)
> > at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6230
> > (bs=bs@entry=0x562a268d6a00, ctx=0x562a260c9580,
> > ignore_child=<optimized out>, errp=<optimized out>)
> > at /build/qemu-gYtjVn/qemu-5.0.1/block.c:6332
> > new_context=0x562a260c9580,
> > update_root_node=update_root_node@entry=true, errp=errp@entry=0x0)
> > at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:1989
> > new_context=<optimized out>, errp=errp@entry=0x0)
> > at /build/qemu-gYtjVn/qemu-5.0.1/block/block-backend.c:2010
> > out>)
> > at /build/qemu-gYtjVn/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
> > at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio-bus.c:245
> > running=0, state=<optimized out>)
> > at /build/qemu-gYtjVn/qemu-5.0.1/hw/virtio/virtio.c:3220
> > state=state@entry=RUN_STATE_FINISH_MIGRATE)
> > at /build/qemu-gYtjVn/qemu-5.0.1/softmmu/vl.c:1275
> > send_stop=<optimized out>)
> > at /build/qemu-gYtjVn/qemu-5.0.1/cpus.c:1032
> > at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:2914
> > at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3275
> > at /build/qemu-gYtjVn/qemu-5.0.1/migration/migration.c:3439
> > at /build/qemu-gYtjVn/qemu-5.0.1/util/qemu-thread-posix.c:519
> > at pthread_create.c:333
> > oldval=0x562a2452b138, oldlenp=0x0, newval=0x562a2452c5e0
> > <__func__.28102>, newlen=0)
> > at ../sysdeps/unix/sysv/linux/sysctl.c:30
>
> function names are somehow lost :(
Oops. Backtraces in gdb have frame numbers prefixed with a hash which
got interpreted as comments by git commit; I should have offset the
backtrace when pasting.
Will respin with fixed log messages, thanks.
Roman.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 2/3] block/nbd: only enter connection coroutine if it's present
2021-01-28 20:14 [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating Roman Kagan
2021-01-28 20:14 ` [PATCH 1/3] block/nbd: only detach existing iochannel from aio_context Roman Kagan
@ 2021-01-28 20:14 ` Roman Kagan
2021-01-29 5:40 ` Vladimir Sementsov-Ogievskiy
2021-01-28 20:14 ` [PATCH 3/3] nbd: make nbd_read* return -EIO on error Roman Kagan
2021-01-29 5:51 ` [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating Vladimir Sementsov-Ogievskiy
3 siblings, 1 reply; 11+ messages in thread
From: Roman Kagan @ 2021-01-28 20:14 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Max Reitz
When an NBD block driver state is moved from one aio_context to another
(e.g. when doing a drain in a migration thread),
nbd_client_attach_aio_context_bh is executed that enters the connection
coroutine.
However, the assumption that ->connection_co is always present here
appears incorrect: the connection may have encountered an error other
than -EIO in the underlying transport, and thus may have decided to quit
rather than keep trying to reconnect, and therefore it may have
terminated the connection coroutine. As a result an attempt to reassign
the client in this state (NBD_CLIENT_QUIT) to a different aio_context
leads to a null pointer dereference:
at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-coroutine.c:109
opaque=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block/nbd.c:164
at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:55
at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:136
at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:164
blocking=blocking@entry=true)
at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-posix.c:650
cb=<optimized out>, opaque=<optimized out>)
at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:71
bs=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6172
new_context=new_context@entry=0x5618056c7580,
ignore=ignore@entry=0x7f60e1e63780)
at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6237
bs=bs@entry=0x561805ed4c00, ctx=0x5618056c7580,
ignore_child=<optimized out>, errp=<optimized out>)
at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6332
new_context=0x5618056c7580, update_root_node=update_root_node@entry=true,
errp=errp@entry=0x0)
at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:1989
new_context=<optimized out>, errp=errp@entry=0x0)
at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:2010
at /build/qemu-6MF7tq/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio-bus.c:245
running=0, state=<optimized out>)
at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio.c:3220
state=state@entry=RUN_STATE_FINISH_MIGRATE)
at /build/qemu-6MF7tq/qemu-5.0.1/softmmu/vl.c:1275
send_stop=<optimized out>) at /build/qemu-6MF7tq/qemu-5.0.1/cpus.c:1032
at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:2914
at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3275
at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3439
at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-thread-posix.c:519
from /lib/x86_64-linux-gnu/libpthread.so.0
Fix it by checking that the connection coroutine is non-null before
trying to enter it. If it is null, no entering is needed, as the
connection is probably going down anyway.
Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
---
block/nbd.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/block/nbd.c b/block/nbd.c
index bcd6641e90..b3cbbeb4b0 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -250,13 +250,15 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
BlockDriverState *bs = opaque;
BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
- /*
- * The node is still drained, so we know the coroutine has yielded in
- * nbd_read_eof(), the only place where bs->in_flight can reach 0, or it is
- * entered for the first time. Both places are safe for entering the
- * coroutine.
- */
- qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
+ if (s->connection_co) {
+ /*
+ * The node is still drained, so we know the coroutine has yielded in
+ * nbd_read_eof(), the only place where bs->in_flight can reach 0, or
+ * it is entered for the first time. Both places are safe for entering
+ * the coroutine.
+ */
+ qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
+ }
bdrv_dec_in_flight(bs);
}
--
2.29.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] block/nbd: only enter connection coroutine if it's present
2021-01-28 20:14 ` [PATCH 2/3] block/nbd: only enter connection coroutine if it's present Roman Kagan
@ 2021-01-29 5:40 ` Vladimir Sementsov-Ogievskiy
0 siblings, 0 replies; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 5:40 UTC (permalink / raw)
To: Roman Kagan, qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz
28.01.2021 23:14, Roman Kagan wrote:
> When an NBD block driver state is moved from one aio_context to another
> (e.g. when doing a drain in a migration thread),
> nbd_client_attach_aio_context_bh is executed that enters the connection
> coroutine.
>
> However, the assumption that ->connection_co is always present here
> appears incorrect: the connection may have encountered an error other
> than -EIO in the underlying transport, and thus may have decided to quit
> rather than keep trying to reconnect, and therefore it may have
> terminated the connection coroutine. As a result an attempt to reassign
> the client in this state (NBD_CLIENT_QUIT) to a different aio_context
> leads to a null pointer dereference:
>
> at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-coroutine.c:109
> opaque=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block/nbd.c:164
> at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:55
> at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:136
> at /build/qemu-6MF7tq/qemu-5.0.1/util/async.c:164
> blocking=blocking@entry=true)
> at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-posix.c:650
> cb=<optimized out>, opaque=<optimized out>)
> at /build/qemu-6MF7tq/qemu-5.0.1/util/aio-wait.c:71
> bs=0x561805ed4c00) at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6172
> new_context=new_context@entry=0x5618056c7580,
> ignore=ignore@entry=0x7f60e1e63780)
> at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6237
> bs=bs@entry=0x561805ed4c00, ctx=0x5618056c7580,
> ignore_child=<optimized out>, errp=<optimized out>)
> at /build/qemu-6MF7tq/qemu-5.0.1/block.c:6332
> new_context=0x5618056c7580, update_root_node=update_root_node@entry=true,
> errp=errp@entry=0x0)
> at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:1989
> new_context=<optimized out>, errp=errp@entry=0x0)
> at /build/qemu-6MF7tq/qemu-5.0.1/block/block-backend.c:2010
> at /build/qemu-6MF7tq/qemu-5.0.1/hw/block/dataplane/virtio-blk.c:292
> at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio-bus.c:245
> running=0, state=<optimized out>)
> at /build/qemu-6MF7tq/qemu-5.0.1/hw/virtio/virtio.c:3220
> state=state@entry=RUN_STATE_FINISH_MIGRATE)
> at /build/qemu-6MF7tq/qemu-5.0.1/softmmu/vl.c:1275
> send_stop=<optimized out>) at /build/qemu-6MF7tq/qemu-5.0.1/cpus.c:1032
> at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:2914
> at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3275
> at /build/qemu-6MF7tq/qemu-5.0.1/migration/migration.c:3439
> at /build/qemu-6MF7tq/qemu-5.0.1/util/qemu-thread-posix.c:519
> from /lib/x86_64-linux-gnu/libpthread.so.0
>
> Fix it by checking that the connection coroutine is non-null before
> trying to enter it. If it is null, no entering is needed, as the
> connection is probably going down anyway.
>
> Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> block/nbd.c | 16 +++++++++-------
> 1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/block/nbd.c b/block/nbd.c
> index bcd6641e90..b3cbbeb4b0 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -250,13 +250,15 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
> BlockDriverState *bs = opaque;
> BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>
> - /*
> - * The node is still drained, so we know the coroutine has yielded in
> - * nbd_read_eof(), the only place where bs->in_flight can reach 0, or it is
> - * entered for the first time. Both places are safe for entering the
> - * coroutine.
> - */
> - qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
> + if (s->connection_co) {
> + /*
> + * The node is still drained, so we know the coroutine has yielded in
> + * nbd_read_eof(), the only place where bs->in_flight can reach 0, or
> + * it is entered for the first time. Both places are safe for entering
> + * the coroutine.
> + */
> + qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
> + }
> bdrv_dec_in_flight(bs);
> }
>
>
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 3/3] nbd: make nbd_read* return -EIO on error
2021-01-28 20:14 [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating Roman Kagan
2021-01-28 20:14 ` [PATCH 1/3] block/nbd: only detach existing iochannel from aio_context Roman Kagan
2021-01-28 20:14 ` [PATCH 2/3] block/nbd: only enter connection coroutine if it's present Roman Kagan
@ 2021-01-28 20:14 ` Roman Kagan
2021-01-29 5:48 ` Vladimir Sementsov-Ogievskiy
2021-01-29 5:51 ` [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating Vladimir Sementsov-Ogievskiy
3 siblings, 1 reply; 11+ messages in thread
From: Roman Kagan @ 2021-01-28 20:14 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, Max Reitz
NBD reconnect logic considers the error code from the functions that
read NBD messages to tell if reconnect should be attempted or not: it is
attempted on -EIO, otherwise the client transitions to NBD_CLIENT_QUIT
state (see nbd_channel_error). This error code is propagated from the
primitives like nbd_read.
The problem, however, is that nbd_read itself turns every error into -1
rather than -EIO. As a result, if the NBD server happens to die while
sending the message, the client in QEMU receives less data than it
expects, considers it as a fatal error, and wouldn't attempt
reestablishing the connection.
Fix it by turning every negative return from qio_channel_read_all into
-EIO returned from nbd_read. Apparently that was the original behavior,
but got broken later. Also adjust nbd_readXX to follow.
Fixes: e6798f06a6 ("nbd: generalize usage of nbd_read")
Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
---
include/block/nbd.h | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/include/block/nbd.h b/include/block/nbd.h
index 4a52a43ef5..5f34d23bb0 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -364,7 +364,7 @@ static inline int nbd_read(QIOChannel *ioc, void *buffer, size_t size,
if (desc) {
error_prepend(errp, "Failed to read %s: ", desc);
}
- return -1;
+ return ret;
}
return 0;
@@ -375,8 +375,9 @@ static inline int nbd_read##bits(QIOChannel *ioc, \
uint##bits##_t *val, \
const char *desc, Error **errp) \
{ \
- if (nbd_read(ioc, val, sizeof(*val), desc, errp) < 0) { \
- return -1; \
+ int ret = nbd_read(ioc, val, sizeof(*val), desc, errp); \
+ if (ret < 0) { \
+ return ret; \
} \
*val = be##bits##_to_cpu(*val); \
return 0; \
--
2.29.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] nbd: make nbd_read* return -EIO on error
2021-01-28 20:14 ` [PATCH 3/3] nbd: make nbd_read* return -EIO on error Roman Kagan
@ 2021-01-29 5:48 ` Vladimir Sementsov-Ogievskiy
0 siblings, 0 replies; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 5:48 UTC (permalink / raw)
To: Roman Kagan, qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz
28.01.2021 23:14, Roman Kagan wrote:
> NBD reconnect logic considers the error code from the functions that
> read NBD messages to tell if reconnect should be attempted or not: it is
> attempted on -EIO, otherwise the client transitions to NBD_CLIENT_QUIT
> state (see nbd_channel_error). This error code is propagated from the
> primitives like nbd_read.
>
> The problem, however, is that nbd_read itself turns every error into -1
> rather than -EIO. As a result, if the NBD server happens to die while
> sending the message, the client in QEMU receives less data than it
> expects, considers it as a fatal error, and wouldn't attempt
> reestablishing the connection.
>
> Fix it by turning every negative return from qio_channel_read_all into
> -EIO returned from nbd_read. Apparently that was the original behavior,
> but got broken later. Also adjust nbd_readXX to follow.
>
> Fixes: e6798f06a6 ("nbd: generalize usage of nbd_read")
> Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Really looks like a bug in e6798f06a6: it changes error code from -EIO to -1 without any reasoning.
> ---
> include/block/nbd.h | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 4a52a43ef5..5f34d23bb0 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -364,7 +364,7 @@ static inline int nbd_read(QIOChannel *ioc, void *buffer, size_t size,
> if (desc) {
> error_prepend(errp, "Failed to read %s: ", desc);
> }
> - return -1;
> + return ret;
> }
>
> return 0;
> @@ -375,8 +375,9 @@ static inline int nbd_read##bits(QIOChannel *ioc, \
> uint##bits##_t *val, \
> const char *desc, Error **errp) \
> { \
> - if (nbd_read(ioc, val, sizeof(*val), desc, errp) < 0) { \
> - return -1; \
> + int ret = nbd_read(ioc, val, sizeof(*val), desc, errp); \
> + if (ret < 0) { \
> + return ret; \
> } \
> *val = be##bits##_to_cpu(*val); \
> return 0; \
>
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating
2021-01-28 20:14 [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating Roman Kagan
` (2 preceding siblings ...)
2021-01-28 20:14 ` [PATCH 3/3] nbd: make nbd_read* return -EIO on error Roman Kagan
@ 2021-01-29 5:51 ` Vladimir Sementsov-Ogievskiy
2021-01-29 7:35 ` Roman Kagan
3 siblings, 1 reply; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 5:51 UTC (permalink / raw)
To: Roman Kagan, qemu-devel; +Cc: Kevin Wolf, qemu-block, Max Reitz
28.01.2021 23:14, Roman Kagan wrote:
> During the final phase of migration the NBD reconnection logic may
> encounter situations it doesn't expect during regular operation.
>
> This series addresses some of them that make qemu crash. They are
> reproducible when a vm with a secondary drive attached via nbd with
> non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
> in the guest on that drive and is migrated (e.g. to a file), while the
> nbd server is SIGKILL-ed and restarted every second.
>
> See the individual patches for specific crash conditions and more
> detailed explanations.
>
> Roman Kagan (3):
> block/nbd: only detach existing iochannel from aio_context
> block/nbd: only enter connection coroutine if it's present
> nbd: make nbd_read* return -EIO on error
>
> include/block/nbd.h | 7 ++++---
> block/nbd.c | 25 +++++++++++++++++--------
> 2 files changed, 21 insertions(+), 11 deletions(-)
>
Thanks a lot for fixing!
Do you have some reproducer scripts? Could you post them or may be add an iotest?
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating
2021-01-29 5:51 ` [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating Vladimir Sementsov-Ogievskiy
@ 2021-01-29 7:35 ` Roman Kagan
2021-01-29 8:19 ` Vladimir Sementsov-Ogievskiy
0 siblings, 1 reply; 11+ messages in thread
From: Roman Kagan @ 2021-01-29 7:35 UTC (permalink / raw)
To: Vladimir Sementsov-Ogievskiy
Cc: Kevin Wolf, qemu-devel, qemu-block, Max Reitz
On Fri, Jan 29, 2021 at 08:51:39AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 28.01.2021 23:14, Roman Kagan wrote:
> > During the final phase of migration the NBD reconnection logic may
> > encounter situations it doesn't expect during regular operation.
> >
> > This series addresses some of them that make qemu crash. They are
> > reproducible when a vm with a secondary drive attached via nbd with
> > non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
> > in the guest on that drive and is migrated (e.g. to a file), while the
> > nbd server is SIGKILL-ed and restarted every second.
> >
> > See the individual patches for specific crash conditions and more
> > detailed explanations.
> >
> > Roman Kagan (3):
> > block/nbd: only detach existing iochannel from aio_context
> > block/nbd: only enter connection coroutine if it's present
> > nbd: make nbd_read* return -EIO on error
> >
> > include/block/nbd.h | 7 ++++---
> > block/nbd.c | 25 +++++++++++++++++--------
> > 2 files changed, 21 insertions(+), 11 deletions(-)
> >
>
> Thanks a lot for fixing!
>
> Do you have some reproducer scripts? Could you post them or may be add
> an iotest?
I don't have it scripted, just ad hoc command lines. I'll look into
making up a test. Can you perhaps suggest what existing test to base
on?
Thanks,
Roman.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating
2021-01-29 7:35 ` Roman Kagan
@ 2021-01-29 8:19 ` Vladimir Sementsov-Ogievskiy
0 siblings, 0 replies; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-01-29 8:19 UTC (permalink / raw)
To: Roman Kagan, qemu-devel, Eric Blake, Max Reitz, qemu-block,
Kevin Wolf
29.01.2021 10:35, Roman Kagan wrote:
> On Fri, Jan 29, 2021 at 08:51:39AM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> 28.01.2021 23:14, Roman Kagan wrote:
>>> During the final phase of migration the NBD reconnection logic may
>>> encounter situations it doesn't expect during regular operation.
>>>
>>> This series addresses some of them that make qemu crash. They are
>>> reproducible when a vm with a secondary drive attached via nbd with
>>> non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
>>> in the guest on that drive and is migrated (e.g. to a file), while the
>>> nbd server is SIGKILL-ed and restarted every second.
>>>
>>> See the individual patches for specific crash conditions and more
>>> detailed explanations.
>>>
>>> Roman Kagan (3):
>>> block/nbd: only detach existing iochannel from aio_context
>>> block/nbd: only enter connection coroutine if it's present
>>> nbd: make nbd_read* return -EIO on error
>>>
>>> include/block/nbd.h | 7 ++++---
>>> block/nbd.c | 25 +++++++++++++++++--------
>>> 2 files changed, 21 insertions(+), 11 deletions(-)
>>>
>>
>> Thanks a lot for fixing!
>>
>> Do you have some reproducer scripts? Could you post them or may be add
>> an iotest?
>
> I don't have it scripted, just ad hoc command lines. I'll look into
> making up a test. Can you perhaps suggest what existing test to base
> on?
>
For now reconnect feature is covered only by two tests tests/qemu-iotests/264 and tests/qemu-iotests/277.
Also note, that since "f203080bbd iotests: rewrite check into python" you should add new iotests with human-readable file names into tests/qemu-iotests/tests subdirectory. Also you don't need to update tests/qemu-iotests/group file (it's absent now), test groups are defined in tests themselves in a comment, like "# group: rw quick".
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 11+ messages in thread