* [PATCH 0/2] dm mpath: Interface for explicit probing of active paths @ 2025-04-29 16:50 Kevin Wolf 2025-04-29 16:50 ` [PATCH 1/2] dm: Allow .prepare_ioctl to handle ioctls directly Kevin Wolf ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Kevin Wolf @ 2025-04-29 16:50 UTC (permalink / raw) To: dm-devel; +Cc: kwolf, hreitz, mpatocka, snitzer, bmarzins, linux-kernel Multipath cannot directly provide failover for ioctls in the kernel because it doesn't know what each ioctl means and which result could indicate a path error. Userspace generally knows what the ioctl it issued means and if it might be a path error, but neither does it know which path the ioctl took nor does it necessarily have the privileges to fail a path using the control device. This series adds an interface that userspace can use to probe paths and fail the bad ones after seeing a potential path error in an ioctl result. Once the bad paths are eliminated, the ioctl can be retried. While the fundamental problem is relatively broad and can affect any sort of ioctl, the immediate motivation for this is the use of SG_IO in QEMU for SCSI passthrough. A preliminary QEMU side patch that makes use of the new interface to support multipath failover with SCSI passthrough can be found at: https://repo.or.cz/qemu/kevin.git/commitdiff/78a474da3b39469b11fbb1d4e0ddf4797b637d35 Kevin Wolf (2): dm: Allow .prepare_ioctl to handle ioctls directly dm mpath: Interface for explicit probing of active paths include/linux/device-mapper.h | 9 +++- include/uapi/linux/dm-ioctl.h | 9 +++- drivers/md/dm-dust.c | 4 +- drivers/md/dm-ebs-target.c | 3 +- drivers/md/dm-flakey.c | 4 +- drivers/md/dm-ioctl.c | 1 + drivers/md/dm-linear.c | 4 +- drivers/md/dm-log-writes.c | 4 +- drivers/md/dm-mpath.c | 95 ++++++++++++++++++++++++++++++++++- drivers/md/dm-switch.c | 4 +- drivers/md/dm-verity-target.c | 4 +- drivers/md/dm-zoned-target.c | 3 +- drivers/md/dm.c | 17 ++++--- 13 files changed, 142 insertions(+), 19 deletions(-) -- 2.49.0 ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH 1/2] dm: Allow .prepare_ioctl to handle ioctls directly 2025-04-29 16:50 [PATCH 0/2] dm mpath: Interface for explicit probing of active paths Kevin Wolf @ 2025-04-29 16:50 ` Kevin Wolf 2025-04-29 23:22 ` Benjamin Marzinski 2025-05-08 13:50 ` Martin Wilck 2025-04-29 16:50 ` [PATCH 2/2] dm mpath: Interface for explicit probing of active paths Kevin Wolf 2025-05-08 13:51 ` [PATCH 0/2] " Martin Wilck 2 siblings, 2 replies; 52+ messages in thread From: Kevin Wolf @ 2025-04-29 16:50 UTC (permalink / raw) To: dm-devel; +Cc: kwolf, hreitz, mpatocka, snitzer, bmarzins, linux-kernel This adds a 'bool *forward' parameter to .prepare_ioctl, which allows device mapper targets to accept ioctls to themselves instead of the underlying device. If the target already fully handled the ioctl, it sets *forward to false and device mapper won't forward it to the underlying device any more. In order for targets to actually know what the ioctl is about and how to handle it, pass also cmd and arg. As long as targets restrict themselves to interpreting ioctls of type DM_IOCTL, this is a backwards compatible change because previously, any such ioctl would have been passed down through all device mapper layers until it reached a device that can't understand the ioctl and would return an error. Signed-off-by: Kevin Wolf <kwolf@redhat.com> --- include/linux/device-mapper.h | 9 ++++++++- drivers/md/dm-dust.c | 4 +++- drivers/md/dm-ebs-target.c | 3 ++- drivers/md/dm-flakey.c | 4 +++- drivers/md/dm-linear.c | 4 +++- drivers/md/dm-log-writes.c | 4 +++- drivers/md/dm-mpath.c | 4 +++- drivers/md/dm-switch.c | 4 +++- drivers/md/dm-verity-target.c | 4 +++- drivers/md/dm-zoned-target.c | 3 ++- drivers/md/dm.c | 17 +++++++++++------ 11 files changed, 44 insertions(+), 16 deletions(-) diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index bcc6d7b69470..cb95951547ab 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -93,7 +93,14 @@ typedef void (*dm_status_fn) (struct dm_target *ti, status_type_t status_type, typedef int (*dm_message_fn) (struct dm_target *ti, unsigned int argc, char **argv, char *result, unsigned int maxlen); -typedef int (*dm_prepare_ioctl_fn) (struct dm_target *ti, struct block_device **bdev); +/* + * Called with *forward == true. If it remains true, the ioctl should be + * forwarded to bdev. If it is reset to false, the target already fully handled + * the ioctl and the return value is the return value for the whole ioctl. + */ +typedef int (*dm_prepare_ioctl_fn) (struct dm_target *ti, struct block_device **bdev, + unsigned int cmd, unsigned long arg, + bool *forward); #ifdef CONFIG_BLK_DEV_ZONED typedef int (*dm_report_zones_fn) (struct dm_target *ti, diff --git a/drivers/md/dm-dust.c b/drivers/md/dm-dust.c index 1a33820c9f46..e75310232bbf 100644 --- a/drivers/md/dm-dust.c +++ b/drivers/md/dm-dust.c @@ -534,7 +534,9 @@ static void dust_status(struct dm_target *ti, status_type_t type, } } -static int dust_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) +static int dust_prepare_ioctl(struct dm_target *ti, struct block_device **bdev, + unsigned int cmd, unsigned long arg, + bool *forward) { struct dust_device *dd = ti->private; struct dm_dev *dev = dd->dev; diff --git a/drivers/md/dm-ebs-target.c b/drivers/md/dm-ebs-target.c index b19b0142a690..6abb31ca9662 100644 --- a/drivers/md/dm-ebs-target.c +++ b/drivers/md/dm-ebs-target.c @@ -415,7 +415,8 @@ static void ebs_status(struct dm_target *ti, status_type_t type, } } -static int ebs_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) +static int ebs_prepare_ioctl(struct dm_target *ti, struct block_device **bdev, + unsigned int cmd, unsigned long arg, bool *forward) { struct ebs_c *ec = ti->private; struct dm_dev *dev = ec->dev; diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c index b690905ab89f..0fceb08f4622 100644 --- a/drivers/md/dm-flakey.c +++ b/drivers/md/dm-flakey.c @@ -638,7 +638,9 @@ static void flakey_status(struct dm_target *ti, status_type_t type, } } -static int flakey_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) +static int flakey_prepare_ioctl(struct dm_target *ti, struct block_device **bdev, + unsigned int cmd, unsigned long arg, + bool *forward) { struct flakey_c *fc = ti->private; diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 66318aba4bdb..15538ec58f8e 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -119,7 +119,9 @@ static void linear_status(struct dm_target *ti, status_type_t type, } } -static int linear_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) +static int linear_prepare_ioctl(struct dm_target *ti, struct block_device **bdev, + unsigned int cmd, unsigned long arg, + bool *forward) { struct linear_c *lc = ti->private; struct dm_dev *dev = lc->dev; diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c index 8d7df8303d0a..d484e8e1d48a 100644 --- a/drivers/md/dm-log-writes.c +++ b/drivers/md/dm-log-writes.c @@ -818,7 +818,9 @@ static void log_writes_status(struct dm_target *ti, status_type_t type, } static int log_writes_prepare_ioctl(struct dm_target *ti, - struct block_device **bdev) + struct block_device **bdev, + unsigned int cmd, unsigned long arg, + bool *forward) { struct log_writes_c *lc = ti->private; struct dm_dev *dev = lc->dev; diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 6c98f4ae5ea9..909ed6890ba5 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -2022,7 +2022,9 @@ static int multipath_message(struct dm_target *ti, unsigned int argc, char **arg } static int multipath_prepare_ioctl(struct dm_target *ti, - struct block_device **bdev) + struct block_device **bdev, + unsigned int cmd, unsigned long arg, + bool *forward) { struct multipath *m = ti->private; struct pgpath *pgpath; diff --git a/drivers/md/dm-switch.c b/drivers/md/dm-switch.c index dfd9fb52a6f3..bb1a70b5a215 100644 --- a/drivers/md/dm-switch.c +++ b/drivers/md/dm-switch.c @@ -517,7 +517,9 @@ static void switch_status(struct dm_target *ti, status_type_t type, * * Passthrough all ioctls to the path for sector 0 */ -static int switch_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) +static int switch_prepare_ioctl(struct dm_target *ti, struct block_device **bdev, + unsigned int cmd, unsigned long arg, + bool *forward) { struct switch_ctx *sctx = ti->private; unsigned int path_nr; diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c index 4de2c226ac9d..34a9f9fbd0d1 100644 --- a/drivers/md/dm-verity-target.c +++ b/drivers/md/dm-verity-target.c @@ -994,7 +994,9 @@ static void verity_status(struct dm_target *ti, status_type_t type, } } -static int verity_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) +static int verity_prepare_ioctl(struct dm_target *ti, struct block_device **bdev, + unsigned int cmd, unsigned long arg, + bool *forward) { struct dm_verity *v = ti->private; diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c index 6141fc25d842..5da3db06da10 100644 --- a/drivers/md/dm-zoned-target.c +++ b/drivers/md/dm-zoned-target.c @@ -1015,7 +1015,8 @@ static void dmz_io_hints(struct dm_target *ti, struct queue_limits *limits) /* * Pass on ioctl to the backend device. */ -static int dmz_prepare_ioctl(struct dm_target *ti, struct block_device **bdev) +static int dmz_prepare_ioctl(struct dm_target *ti, struct block_device **bdev, + unsigned int cmd, unsigned long arg, bool *forward) { struct dmz_target *dmz = ti->private; struct dmz_dev *dev = &dmz->dev[0]; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index ccccc098b30e..1726f0f828cc 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -411,7 +411,8 @@ static int dm_blk_getgeo(struct block_device *bdev, struct hd_geometry *geo) } static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx, - struct block_device **bdev) + struct block_device **bdev, unsigned int cmd, + unsigned long arg, bool *forward) { struct dm_target *ti; struct dm_table *map; @@ -434,8 +435,8 @@ static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx, if (dm_suspended_md(md)) return -EAGAIN; - r = ti->type->prepare_ioctl(ti, bdev); - if (r == -ENOTCONN && !fatal_signal_pending(current)) { + r = ti->type->prepare_ioctl(ti, bdev, cmd, arg, forward); + if (r == -ENOTCONN && *forward && !fatal_signal_pending(current)) { dm_put_live_table(md, *srcu_idx); fsleep(10000); goto retry; @@ -454,9 +455,10 @@ static int dm_blk_ioctl(struct block_device *bdev, blk_mode_t mode, { struct mapped_device *md = bdev->bd_disk->private_data; int r, srcu_idx; + bool forward = true; - r = dm_prepare_ioctl(md, &srcu_idx, &bdev); - if (r < 0) + r = dm_prepare_ioctl(md, &srcu_idx, &bdev, cmd, arg, &forward); + if (!forward || r < 0) goto out; if (r > 0) { @@ -3630,10 +3632,13 @@ static int dm_pr_clear(struct block_device *bdev, u64 key) struct mapped_device *md = bdev->bd_disk->private_data; const struct pr_ops *ops; int r, srcu_idx; + bool forward = true; - r = dm_prepare_ioctl(md, &srcu_idx, &bdev); + /* Not a real ioctl, but targets must not interpret non-DM ioctls */ + r = dm_prepare_ioctl(md, &srcu_idx, &bdev, 0, 0, &forward); if (r < 0) goto out; + WARN_ON_ONCE(!forward); ops = bdev->bd_disk->fops->pr_ops; if (ops && ops->pr_clear) -- 2.49.0 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH 1/2] dm: Allow .prepare_ioctl to handle ioctls directly 2025-04-29 16:50 ` [PATCH 1/2] dm: Allow .prepare_ioctl to handle ioctls directly Kevin Wolf @ 2025-04-29 23:22 ` Benjamin Marzinski 2025-05-08 13:50 ` Martin Wilck 1 sibling, 0 replies; 52+ messages in thread From: Benjamin Marzinski @ 2025-04-29 23:22 UTC (permalink / raw) To: Kevin Wolf; +Cc: dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Tue, Apr 29, 2025 at 06:50:17PM +0200, Kevin Wolf wrote: > This adds a 'bool *forward' parameter to .prepare_ioctl, which allows > device mapper targets to accept ioctls to themselves instead of the > underlying device. If the target already fully handled the ioctl, it > sets *forward to false and device mapper won't forward it to the > underlying device any more. > > In order for targets to actually know what the ioctl is about and how to > handle it, pass also cmd and arg. > > As long as targets restrict themselves to interpreting ioctls of type > DM_IOCTL, this is a backwards compatible change because previously, any > such ioctl would have been passed down through all device mapper layers > until it reached a device that can't understand the ioctl and would > return an error. > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 1/2] dm: Allow .prepare_ioctl to handle ioctls directly 2025-04-29 16:50 ` [PATCH 1/2] dm: Allow .prepare_ioctl to handle ioctls directly Kevin Wolf 2025-04-29 23:22 ` Benjamin Marzinski @ 2025-05-08 13:50 ` Martin Wilck 1 sibling, 0 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-08 13:50 UTC (permalink / raw) To: Kevin Wolf, dm-devel Cc: hreitz, mpatocka, snitzer, bmarzins, linux-kernel, Hannes Reinecke On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > This adds a 'bool *forward' parameter to .prepare_ioctl, which allows > device mapper targets to accept ioctls to themselves instead of the > underlying device. If the target already fully handled the ioctl, it > sets *forward to false and device mapper won't forward it to the > underlying device any more. > > In order for targets to actually know what the ioctl is about and how > to > handle it, pass also cmd and arg. > > As long as targets restrict themselves to interpreting ioctls of type > DM_IOCTL, this is a backwards compatible change because previously, > any > such ioctl would have been passed down through all device mapper > layers > until it reached a device that can't understand the ioctl and would > return an error. > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > --- > include/linux/device-mapper.h | 9 ++++++++- > drivers/md/dm-dust.c | 4 +++- > drivers/md/dm-ebs-target.c | 3 ++- > drivers/md/dm-flakey.c | 4 +++- > drivers/md/dm-linear.c | 4 +++- > drivers/md/dm-log-writes.c | 4 +++- > drivers/md/dm-mpath.c | 4 +++- > drivers/md/dm-switch.c | 4 +++- > drivers/md/dm-verity-target.c | 4 +++- > drivers/md/dm-zoned-target.c | 3 ++- > drivers/md/dm.c | 17 +++++++++++------ > 11 files changed, 44 insertions(+), 16 deletions(-) > > diff --git a/include/linux/device-mapper.h b/include/linux/device- > mapper.h > index bcc6d7b69470..cb95951547ab 100644 > --- a/include/linux/device-mapper.h > +++ b/include/linux/device-mapper.h > @@ -93,7 +93,14 @@ typedef void (*dm_status_fn) (struct dm_target > *ti, status_type_t status_type, > typedef int (*dm_message_fn) (struct dm_target *ti, unsigned int > argc, char **argv, > char *result, unsigned int maxlen); > > -typedef int (*dm_prepare_ioctl_fn) (struct dm_target *ti, struct > block_device **bdev); > +/* > + * Called with *forward == true. If it remains true, the ioctl > should be > + * forwarded to bdev. If it is reset to false, the target already > fully handled > + * the ioctl and the return value is the return value for the whole > ioctl. > + */ > +typedef int (*dm_prepare_ioctl_fn) (struct dm_target *ti, struct > block_device **bdev, > + unsigned int cmd, unsigned long arg, > + bool *forward); > > #ifdef CONFIG_BLK_DEV_ZONED > typedef int (*dm_report_zones_fn) (struct dm_target *ti, > diff --git a/drivers/md/dm-dust.c b/drivers/md/dm-dust.c > index 1a33820c9f46..e75310232bbf 100644 > --- a/drivers/md/dm-dust.c > +++ b/drivers/md/dm-dust.c > @@ -534,7 +534,9 @@ static void dust_status(struct dm_target *ti, > status_type_t type, > } > } > > -static int dust_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev) > +static int dust_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev, > + unsigned int cmd, unsigned long arg, > + bool *forward) > { > struct dust_device *dd = ti->private; > struct dm_dev *dev = dd->dev; > diff --git a/drivers/md/dm-ebs-target.c b/drivers/md/dm-ebs-target.c > index b19b0142a690..6abb31ca9662 100644 > --- a/drivers/md/dm-ebs-target.c > +++ b/drivers/md/dm-ebs-target.c > @@ -415,7 +415,8 @@ static void ebs_status(struct dm_target *ti, > status_type_t type, > } > } > > -static int ebs_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev) > +static int ebs_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev, > + unsigned int cmd, unsigned long arg, bool *forward) > { > struct ebs_c *ec = ti->private; > struct dm_dev *dev = ec->dev; > diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c > index b690905ab89f..0fceb08f4622 100644 > --- a/drivers/md/dm-flakey.c > +++ b/drivers/md/dm-flakey.c > @@ -638,7 +638,9 @@ static void flakey_status(struct dm_target *ti, > status_type_t type, > } > } > > -static int flakey_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev) > +static int flakey_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev, > + unsigned int cmd, unsigned long arg, > + bool *forward) > { > struct flakey_c *fc = ti->private; > > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c > index 66318aba4bdb..15538ec58f8e 100644 > --- a/drivers/md/dm-linear.c > +++ b/drivers/md/dm-linear.c > @@ -119,7 +119,9 @@ static void linear_status(struct dm_target *ti, > status_type_t type, > } > } > > -static int linear_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev) > +static int linear_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev, > + unsigned int cmd, unsigned long arg, > + bool *forward) > { > struct linear_c *lc = ti->private; > struct dm_dev *dev = lc->dev; > diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c > index 8d7df8303d0a..d484e8e1d48a 100644 > --- a/drivers/md/dm-log-writes.c > +++ b/drivers/md/dm-log-writes.c > @@ -818,7 +818,9 @@ static void log_writes_status(struct dm_target > *ti, status_type_t type, > } > > static int log_writes_prepare_ioctl(struct dm_target *ti, > - struct block_device **bdev) > + struct block_device **bdev, > + unsigned int cmd, unsigned long arg, > + bool *forward) > { > struct log_writes_c *lc = ti->private; > struct dm_dev *dev = lc->dev; > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 6c98f4ae5ea9..909ed6890ba5 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -2022,7 +2022,9 @@ static int multipath_message(struct dm_target > *ti, unsigned int argc, char **arg > } > > static int multipath_prepare_ioctl(struct dm_target *ti, > - struct block_device **bdev) > + struct block_device **bdev, > + unsigned int cmd, unsigned long arg, > + bool *forward) > { > struct multipath *m = ti->private; > struct pgpath *pgpath; > diff --git a/drivers/md/dm-switch.c b/drivers/md/dm-switch.c > index dfd9fb52a6f3..bb1a70b5a215 100644 > --- a/drivers/md/dm-switch.c > +++ b/drivers/md/dm-switch.c > @@ -517,7 +517,9 @@ static void switch_status(struct dm_target *ti, > status_type_t type, > * > * Passthrough all ioctls to the path for sector 0 > */ > -static int switch_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev) > +static int switch_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev, > + unsigned int cmd, unsigned long arg, > + bool *forward) > { > struct switch_ctx *sctx = ti->private; > unsigned int path_nr; > diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity- > target.c > index 4de2c226ac9d..34a9f9fbd0d1 100644 > --- a/drivers/md/dm-verity-target.c > +++ b/drivers/md/dm-verity-target.c > @@ -994,7 +994,9 @@ static void verity_status(struct dm_target *ti, > status_type_t type, > } > } > > -static int verity_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev) > +static int verity_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev, > + unsigned int cmd, unsigned long arg, > + bool *forward) > { > struct dm_verity *v = ti->private; > > diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned- > target.c > index 6141fc25d842..5da3db06da10 100644 > --- a/drivers/md/dm-zoned-target.c > +++ b/drivers/md/dm-zoned-target.c > @@ -1015,7 +1015,8 @@ static void dmz_io_hints(struct dm_target *ti, > struct queue_limits *limits) > /* > * Pass on ioctl to the backend device. > */ > -static int dmz_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev) > +static int dmz_prepare_ioctl(struct dm_target *ti, struct > block_device **bdev, > + unsigned int cmd, unsigned long arg, bool *forward) > { > struct dmz_target *dmz = ti->private; > struct dmz_dev *dev = &dmz->dev[0]; > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index ccccc098b30e..1726f0f828cc 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -411,7 +411,8 @@ static int dm_blk_getgeo(struct block_device > *bdev, struct hd_geometry *geo) > } > > static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx, > - struct block_device **bdev) > + struct block_device **bdev, unsigned int cmd, > + unsigned long arg, bool *forward) > { > struct dm_target *ti; > struct dm_table *map; > @@ -434,8 +435,8 @@ static int dm_prepare_ioctl(struct mapped_device > *md, int *srcu_idx, > if (dm_suspended_md(md)) > return -EAGAIN; > > - r = ti->type->prepare_ioctl(ti, bdev); > - if (r == -ENOTCONN && !fatal_signal_pending(current)) { > + r = ti->type->prepare_ioctl(ti, bdev, cmd, arg, forward); > + if (r == -ENOTCONN && *forward && !fatal_signal_pending(current)) { > dm_put_live_table(md, *srcu_idx); > fsleep(10000); > goto retry; > @@ -454,9 +455,10 @@ static int dm_blk_ioctl(struct block_device > *bdev, blk_mode_t mode, > { > struct mapped_device *md = bdev->bd_disk->private_data; > int r, srcu_idx; > + bool forward = true; > > - r = dm_prepare_ioctl(md, &srcu_idx, &bdev); > - if (r < 0) > + r = dm_prepare_ioctl(md, &srcu_idx, &bdev, cmd, arg, &forward); > + if (!forward || r < 0) > goto out; > > if (r > 0) { > @@ -3630,10 +3632,13 @@ static int dm_pr_clear(struct block_device > *bdev, u64 key) > struct mapped_device *md = bdev->bd_disk->private_data; > const struct pr_ops *ops; > int r, srcu_idx; > + bool forward = true; > > - r = dm_prepare_ioctl(md, &srcu_idx, &bdev); > + /* Not a real ioctl, but targets must not interpret non-DM ioctls > */ > + r = dm_prepare_ioctl(md, &srcu_idx, &bdev, 0, 0, &forward); > if (r < 0) > goto out; > + WARN_ON_ONCE(!forward); > > ops = bdev->bd_disk->fops->pr_ops; > if (ops && ops->pr_clear) ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH 2/2] dm mpath: Interface for explicit probing of active paths 2025-04-29 16:50 [PATCH 0/2] dm mpath: Interface for explicit probing of active paths Kevin Wolf 2025-04-29 16:50 ` [PATCH 1/2] dm: Allow .prepare_ioctl to handle ioctls directly Kevin Wolf @ 2025-04-29 16:50 ` Kevin Wolf 2025-04-29 23:22 ` Benjamin Marzinski 2025-05-08 13:51 ` [PATCH 0/2] " Martin Wilck 2 siblings, 1 reply; 52+ messages in thread From: Kevin Wolf @ 2025-04-29 16:50 UTC (permalink / raw) To: dm-devel; +Cc: kwolf, hreitz, mpatocka, snitzer, bmarzins, linux-kernel Multipath cannot directly provide failover for ioctls in the kernel because it doesn't know what each ioctl means and which result could indicate a path error. Userspace generally knows what the ioctl it issued means and if it might be a path error, but neither does it know which path the ioctl took nor does it necessarily have the privileges to fail a path using the control device. In order to allow userspace to address this situation, implement a DM_MPATH_PROBE_PATHS ioctl that prompts the dm-mpath driver to probe all active paths in the current path group to see whether they still work, and fail them if not. If this returns success, userspace can retry the ioctl and expect that the previously hit bad path is now failed (or working again). The immediate motivation for this is the use of SG_IO in QEMU for SCSI passthrough. Following a failed SG_IO ioctl, QEMU will trigger probing to ensure that all active paths are actually alive, so that retrying SG_IO at least has a lower chance of failing due to a path error. However, the problem is broader than just SG_IO (it affects any ioctl), and if applications need failover support for other ioctls, the same probing can be used. This is not implemented on the DM control device, but on the DM mpath block devices, to allow all users who have access to such a block device to make use of this interface, specifically to implement failover for ioctls. For the same reason, it is also unprivileged. Its implementation is effectively just a bunch of reads, which could already be issued by userspace, just without any guarantee that all the rights paths are selected. The probing implemented here is done fully synchronously path by path; probing all paths concurrently is left as an improvement for the future. Co-developed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> --- include/uapi/linux/dm-ioctl.h | 9 +++- drivers/md/dm-ioctl.c | 1 + drivers/md/dm-mpath.c | 91 ++++++++++++++++++++++++++++++++++- 3 files changed, 98 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h index b08c7378164d..3225e025e30e 100644 --- a/include/uapi/linux/dm-ioctl.h +++ b/include/uapi/linux/dm-ioctl.h @@ -258,10 +258,12 @@ enum { DM_DEV_SET_GEOMETRY_CMD, DM_DEV_ARM_POLL_CMD, DM_GET_TARGET_VERSION_CMD, + DM_MPATH_PROBE_PATHS_CMD, }; #define DM_IOCTL 0xfd +/* Control device ioctls */ #define DM_VERSION _IOWR(DM_IOCTL, DM_VERSION_CMD, struct dm_ioctl) #define DM_REMOVE_ALL _IOWR(DM_IOCTL, DM_REMOVE_ALL_CMD, struct dm_ioctl) #define DM_LIST_DEVICES _IOWR(DM_IOCTL, DM_LIST_DEVICES_CMD, struct dm_ioctl) @@ -285,10 +287,13 @@ enum { #define DM_TARGET_MSG _IOWR(DM_IOCTL, DM_TARGET_MSG_CMD, struct dm_ioctl) #define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl) +/* Block device ioctls */ +#define DM_MPATH_PROBE_PATHS _IO(DM_IOCTL, DM_MPATH_PROBE_PATHS_CMD) + #define DM_VERSION_MAJOR 4 -#define DM_VERSION_MINOR 49 +#define DM_VERSION_MINOR 50 #define DM_VERSION_PATCHLEVEL 0 -#define DM_VERSION_EXTRA "-ioctl (2025-01-17)" +#define DM_VERSION_EXTRA "-ioctl (2025-04-28)" /* Status bits */ #define DM_READONLY_FLAG (1 << 0) /* In/Out */ diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c index d42eac944eb5..4165fef4c170 100644 --- a/drivers/md/dm-ioctl.c +++ b/drivers/md/dm-ioctl.c @@ -1885,6 +1885,7 @@ static ioctl_fn lookup_ioctl(unsigned int cmd, int *ioctl_flags) {DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry}, {DM_DEV_ARM_POLL_CMD, IOCTL_FLAGS_NO_PARAMS, dev_arm_poll}, {DM_GET_TARGET_VERSION_CMD, 0, get_target_version}, + {DM_MPATH_PROBE_PATHS_CMD, 0, NULL}, /* block device ioctl */ }; if (unlikely(cmd >= ARRAY_SIZE(_ioctls))) diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 909ed6890ba5..af17a35c6457 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -2021,6 +2021,85 @@ static int multipath_message(struct dm_target *ti, unsigned int argc, char **arg return r; } +/* + * Perform a minimal read from the given path to find out whether the + * path still works. If a path error occurs, fail it. + */ +static int probe_path(struct pgpath *pgpath) +{ + struct block_device *bdev = pgpath->path.dev->bdev; + unsigned int read_size = bdev_logical_block_size(bdev); + struct page *page; + struct bio *bio; + blk_status_t status; + int r = 0; + + if (WARN_ON_ONCE(read_size > PAGE_SIZE)) + return -EINVAL; + + page = alloc_page(GFP_KERNEL); + if (!page) + return -ENOMEM; + + /* Perform a minimal read: Sector 0, length read_size */ + bio = bio_alloc(bdev, 1, REQ_OP_READ, GFP_KERNEL); + if (!bio) { + r = -ENOMEM; + goto out; + } + + bio->bi_iter.bi_sector = 0; + __bio_add_page(bio, page, read_size, 0); + submit_bio_wait(bio); + status = bio->bi_status; + bio_put(bio); + + if (status && blk_path_error(status)) + fail_path(pgpath); + +out: + __free_page(page); + return r; +} + +/* + * Probe all active paths in current_pg to find out whether they still work. + * Fail all paths that do not work. + * + * Return -ENOTCONN if no valid path is left (even outside of current_pg). We + * cannot probe paths in other pgs without switching current_pg, so if valid + * paths are only in different pgs, they may or may not work. Userspace can + * submit a request and we'll switch. If the request fails, it may need to + * probe again. + */ +static int probe_active_paths(struct multipath *m) +{ + struct pgpath *pgpath; + struct priority_group *pg; + int r = 0; + + mutex_lock(&m->work_mutex); + + pg = READ_ONCE(m->current_pg); + if (pg) { + list_for_each_entry(pgpath, &pg->pgpaths, list) { + if (!pgpath->is_active) + continue; + + r = probe_path(pgpath); + if (r < 0) + goto out; + } + } + + if (!atomic_read(&m->nr_valid_paths)) + r = -ENOTCONN; + +out: + mutex_unlock(&m->work_mutex); + return r; +} + static int multipath_prepare_ioctl(struct dm_target *ti, struct block_device **bdev, unsigned int cmd, unsigned long arg, @@ -2031,6 +2110,16 @@ static int multipath_prepare_ioctl(struct dm_target *ti, unsigned long flags; int r; + if (_IOC_TYPE(cmd) == DM_IOCTL) { + *forward = false; + switch (cmd) { + case DM_MPATH_PROBE_PATHS: + return probe_active_paths(m); + default: + return -ENOTTY; + } + } + pgpath = READ_ONCE(m->current_pgpath); if (!pgpath || !mpath_double_check_test_bit(MPATHF_QUEUE_IO, m)) pgpath = choose_pgpath(m, 0); @@ -2182,7 +2271,7 @@ static int multipath_busy(struct dm_target *ti) */ static struct target_type multipath_target = { .name = "multipath", - .version = {1, 14, 0}, + .version = {1, 15, 0}, .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY, .module = THIS_MODULE, -- 2.49.0 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH 2/2] dm mpath: Interface for explicit probing of active paths 2025-04-29 16:50 ` [PATCH 2/2] dm mpath: Interface for explicit probing of active paths Kevin Wolf @ 2025-04-29 23:22 ` Benjamin Marzinski 0 siblings, 0 replies; 52+ messages in thread From: Benjamin Marzinski @ 2025-04-29 23:22 UTC (permalink / raw) To: Kevin Wolf; +Cc: dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Tue, Apr 29, 2025 at 06:50:18PM +0200, Kevin Wolf wrote: > Multipath cannot directly provide failover for ioctls in the kernel > because it doesn't know what each ioctl means and which result could > indicate a path error. Userspace generally knows what the ioctl it > issued means and if it might be a path error, but neither does it know > which path the ioctl took nor does it necessarily have the privileges to > fail a path using the control device. > > In order to allow userspace to address this situation, implement a > DM_MPATH_PROBE_PATHS ioctl that prompts the dm-mpath driver to probe all > active paths in the current path group to see whether they still work, > and fail them if not. If this returns success, userspace can retry the > ioctl and expect that the previously hit bad path is now failed (or > working again). > > The immediate motivation for this is the use of SG_IO in QEMU for SCSI > passthrough. Following a failed SG_IO ioctl, QEMU will trigger probing > to ensure that all active paths are actually alive, so that retrying > SG_IO at least has a lower chance of failing due to a path error. > However, the problem is broader than just SG_IO (it affects any ioctl), > and if applications need failover support for other ioctls, the same > probing can be used. > > This is not implemented on the DM control device, but on the DM mpath > block devices, to allow all users who have access to such a block device > to make use of this interface, specifically to implement failover for > ioctls. For the same reason, it is also unprivileged. Its implementation > is effectively just a bunch of reads, which could already be issued by > userspace, just without any guarantee that all the rights paths are > selected. > > The probing implemented here is done fully synchronously path by path; > probing all paths concurrently is left as an improvement for the future. > > Co-developed-by: Hanna Czenczek <hreitz@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-04-29 16:50 [PATCH 0/2] dm mpath: Interface for explicit probing of active paths Kevin Wolf 2025-04-29 16:50 ` [PATCH 1/2] dm: Allow .prepare_ioctl to handle ioctls directly Kevin Wolf 2025-04-29 16:50 ` [PATCH 2/2] dm mpath: Interface for explicit probing of active paths Kevin Wolf @ 2025-05-08 13:51 ` Martin Wilck 2025-05-12 13:46 ` Mikulas Patocka 2025-05-12 15:18 ` Kevin Wolf 2 siblings, 2 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-08 13:51 UTC (permalink / raw) To: Kevin Wolf, dm-devel; +Cc: hreitz, mpatocka, snitzer, bmarzins, linux-kernel Hello Kevin, [I'm sorry for the previous email. It seems that I clicked "send" in the wrong window]. On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > Multipath cannot directly provide failover for ioctls in the kernel > because it doesn't know what each ioctl means and which result could > indicate a path error. Userspace generally knows what the ioctl it > issued means and if it might be a path error, but neither does it > know > which path the ioctl took nor does it necessarily have the privileges > to > fail a path using the control device. Thanks for this effort. I have some remarks about your approach. The most important one is that the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. It sends IO to possibly broken paths and waits for it to complete. In the common error case of a device not responding, this might cause the code to hang for a long time in the kernel ioctl code path, in uninterruptible sleep (note that these probing commands will be queued after other possibly hanging IO). In the past we have put a lot of effort into other code paths in multipath-tools and elsewhere to avoid this kind of hang to the extent possible. It seems to me that your set re-introduces this risk. Apart from that, minor observations are that in patch 2/2 you don't check whether the map is in queueing state, and I don't quite understand why you only check paths in the map's active path group without attempting a PG failover in the case where all paths in the current PG fail. I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is necessary at all. Rather than triggering explicit path probing, you could have dm-mpath "handle" IO errors by failing the active path for "path errors". That would be similar to my patch set from 2021 [1], but it would avoid the extra IO and thus the additional risk of hanging in the kernel. Contrary to my set, you wouldn't attempt retries in the kernel, but leave this part to qemu instead, like in the current set. That would avoid Christoph's main criticism that "Failing over SG_IO does not make sense" [2]. Doing the failover in qemu has the general disadvantage that qemu has no notion about the number of available and/or healthy paths; it can thus only guess the reasonable number of retries for any given I/O. But that's unavoidable, given that the idea to do kernel-level failover on SG_IO is rejected. Please let me know your thoughts. Best Regards Martin [1] https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/ [2] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-08 13:51 ` [PATCH 0/2] " Martin Wilck @ 2025-05-12 13:46 ` Mikulas Patocka 2025-05-13 7:06 ` Martin Wilck 2025-05-12 15:18 ` Kevin Wolf 1 sibling, 1 reply; 52+ messages in thread From: Mikulas Patocka @ 2025-05-12 13:46 UTC (permalink / raw) To: Martin Wilck Cc: Kevin Wolf, dm-devel, hreitz, snitzer, bmarzins, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3510 bytes --] Hi On Thu, 8 May 2025, Martin Wilck wrote: > Hello Kevin, > > [I'm sorry for the previous email. It seems that I clicked "send" in > the wrong window]. > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > Multipath cannot directly provide failover for ioctls in the kernel > > because it doesn't know what each ioctl means and which result could > > indicate a path error. Userspace generally knows what the ioctl it > > issued means and if it might be a path error, but neither does it > > know > > which path the ioctl took nor does it necessarily have the privileges > > to > > fail a path using the control device. > > Thanks for this effort. > > I have some remarks about your approach. The most important one is that > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. > It sends IO to possibly broken paths and waits for it to complete. In > the common error case of a device not responding, this might cause the > code to hang for a long time in the kernel ioctl code path, in > uninterruptible sleep (note that these probing commands will be queued Normal reads and writes would also hang in an uninterruptible sleep if a path stops responding. > after other possibly hanging IO). In the past we have put a lot of > effort into other code paths in multipath-tools and elsewhere to avoid > this kind of hang to the extent possible. It seems to me that your set > re-introduces this risk. > > Apart from that, minor observations are that in patch 2/2 you don't > check whether the map is in queueing state, and I don't quite > understand why you only check paths in the map's active path group > without attempting a PG failover in the case where all paths in the > current PG fail. > > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is necessary > at all. Rather than triggering explicit path probing, you could have > dm-mpath "handle" IO errors by failing the active path for "path > errors". That would be similar to my patch set from 2021 [1], but it > would avoid the extra IO and thus the additional risk of hanging in the > kernel. What exactly do you mean? You say "you could have dm-mpath 'handle' IO errors by failing the active path for "path errors". Christoph doesn't want dm-mpath can't look at the error code - so dm-mpath doesn't know if path error occured or not. qemu could look at the error code, but qemu doesn't know which path did the SG_IO go through - so it can't instruct qemu to fail that path. One of the possibility that we discussed was to add a path-id to SG_IO, so that dm-mpath would mark which path did the SG_IO go through and qemu could instruct dm-mpath to fail that path. But we rejected it as being more complicated than the current approach. > Contrary to my set, you wouldn't attempt retries in the kernel, but > leave this part to qemu instead, like in the current set. That would > avoid Christoph's main criticism that "Failing over SG_IO does not make > sense" [2]. > > Doing the failover in qemu has the general disadvantage that qemu has > no notion about the number of available and/or healthy paths; it can > thus only guess the reasonable number of retries for any given I/O. But > that's unavoidable, given that the idea to do kernel-level failover on > SG_IO is rejected. > > Please let me know your thoughts. > > Best Regards > Martin > > [1] https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/ > [2] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/ Mikulas ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-12 13:46 ` Mikulas Patocka @ 2025-05-13 7:06 ` Martin Wilck 0 siblings, 0 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-13 7:06 UTC (permalink / raw) To: Mikulas Patocka, Christoph Hellwig Cc: Kevin Wolf, dm-devel, hreitz, snitzer, bmarzins, linux-kernel Hello Mikulas, On Mon, 2025-05-12 at 15:46 +0200, Mikulas Patocka wrote: > Hi > > > On Thu, 8 May 2025, Martin Wilck wrote: > > > Hello Kevin, > > > > [I'm sorry for the previous email. It seems that I clicked "send" > > in > > the wrong window]. > > > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > > Multipath cannot directly provide failover for ioctls in the > > > kernel > > > because it doesn't know what each ioctl means and which result > > > could > > > indicate a path error. Userspace generally knows what the ioctl > > > it > > > issued means and if it might be a path error, but neither does it > > > know > > > which path the ioctl took nor does it necessarily have the > > > privileges > > > to > > > fail a path using the control device. > > > > Thanks for this effort. > > > > I have some remarks about your approach. The most important one is > > that > > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous > > command. > > It sends IO to possibly broken paths and waits for it to complete. > > In > > the common error case of a device not responding, this might cause > > the > > code to hang for a long time in the kernel ioctl code path, in > > uninterruptible sleep (note that these probing commands will be > > queued > > Normal reads and writes would also hang in an uninterruptible sleep > if a > path stops responding. Right. That's why multipathd uses TEST UNIT READY if supported by the storage, and either aio or separate I/O threads to avoid the main thread being blocked, and generally goes to great lengths to make sure that misbehaving storage hardware can cause no freeze. The way I read Kevin's code, you'd queue up additional IO in the same context that had submitted the original failing IO. I realize that qemu uses asynchronous IO, but AFAIK the details depend on the qemu options for the individual VM, and it isn't clear to me whether or not DM_MPATH_PROBE_PATHS_CMD can bring the entire VM to halt. > > [...] > > > > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is > > necessary > > at all. Rather than triggering explicit path probing, you could > > have > > dm-mpath "handle" IO errors by failing the active path for "path > > errors". That would be similar to my patch set from 2021 [1], but > > it > > would avoid the extra IO and thus the additional risk of hanging in > > the > > kernel. > > What exactly do you mean? You say "you could have dm-mpath 'handle' > IO > errors by failing the active path for "path errors". What I meant was that dm-mpath could call fail_path() in case of an error, using a similar mechanism to the block IO code path. > Christoph doesn't want dm-mpath can't look at the error code - so dm- > mpath > doesn't know if path error occured or not. My impression from the previous discussion was that Christoph mainly objected to SG_IO requests being retried in the kernel [1], not so much to the inspection of the error code by device mapper. I may have misunderstood this of course. Adding Christoph into the loop so that he can clarify what he meant. > qemu could look at the error > code, but qemu doesn't know which path did the SG_IO go through - so > it > can't instruct qemu to fail that path. > > One of the possibility that we discussed was to add a path-id to > SG_IO, so > that dm-mpath would mark which path did the SG_IO go through and qemu > could instruct dm-mpath to fail that path. But we rejected it as > being > more complicated than the current approach. If we discuss extending SG_IO itself, it might be easier to have it store the blkstatus_t somewhere in the sg_io_hdr. More about that idea in my reply to Kevin. Regards, Martin [1] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/ > > Contrary to my set, you wouldn't attempt retries in the kernel, but > > leave this part to qemu instead, like in the current set. That > > would > > avoid Christoph's main criticism that "Failing over SG_IO does not > > make > > sense" [2]. > > > > Doing the failover in qemu has the general disadvantage that qemu > > has > > no notion about the number of available and/or healthy paths; it > > can > > thus only guess the reasonable number of retries for any given I/O. > > But > > that's unavoidable, given that the idea to do kernel-level failover > > on > > SG_IO is rejected. > > > > Please let me know your thoughts. > > > > Best Regards > > Martin > > > > [1] > > https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/ > > [2] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/ > > Mikulas ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-08 13:51 ` [PATCH 0/2] " Martin Wilck 2025-05-12 13:46 ` Mikulas Patocka @ 2025-05-12 15:18 ` Kevin Wolf 2025-05-13 5:55 ` Christoph Hellwig ` (2 more replies) 1 sibling, 3 replies; 52+ messages in thread From: Kevin Wolf @ 2025-05-12 15:18 UTC (permalink / raw) To: Martin Wilck; +Cc: dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben: > Hello Kevin, > > [I'm sorry for the previous email. It seems that I clicked "send" in > the wrong window]. > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > Multipath cannot directly provide failover for ioctls in the kernel > > because it doesn't know what each ioctl means and which result could > > indicate a path error. Userspace generally knows what the ioctl it > > issued means and if it might be a path error, but neither does it > > know > > which path the ioctl took nor does it necessarily have the privileges > > to > > fail a path using the control device. > > Thanks for this effort. > > I have some remarks about your approach. The most important one is that > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. > It sends IO to possibly broken paths and waits for it to complete. In > the common error case of a device not responding, this might cause the > code to hang for a long time in the kernel ioctl code path, in > uninterruptible sleep (note that these probing commands will be queued > after other possibly hanging IO). In the past we have put a lot of > effort into other code paths in multipath-tools and elsewhere to avoid > this kind of hang to the extent possible. It seems to me that your set > re-introduces this risk. That's a fair point. I don't expect this code to be fully final, another thing that isn't really optimal (as mentioned in the comment message) is that we're not probing paths in parallel, potentially adding up timeouts from multiple paths. I don't think this is a problem of the interface, though, but we can improve the implementation keeping the same interface. Maybe I'm missing something, but I think the reason why it has to be uninterruptible is just that submit_bio_wait() has the completion on the stack? So I suppose we could just kmalloc() some state, submit all the bios, let the completion callback free it, and then we can just stop waiting early (i.e. wait_for_completion_interruptible/killable) and let the requests complete on their own in the background. Would this address your concerns or is there another part to it? > Apart from that, minor observations are that in patch 2/2 you don't > check whether the map is in queueing state, and I don't quite > understand why you only check paths in the map's active path group > without attempting a PG failover in the case where all paths in the > current PG fail. Ben already fixed up the missing check. Not attempting PG failover was also his suggestion, on the basis that it would be additional work for no real benefit when you can only submit requests for the current PG anyway. If userspace retries the SG_IO, it will already pick a different PG, so having the kernel do the same doesn't really improve anything. > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is necessary > at all. Rather than triggering explicit path probing, you could have > dm-mpath "handle" IO errors by failing the active path for "path > errors". That would be similar to my patch set from 2021 [1], but it > would avoid the extra IO and thus the additional risk of hanging in the > kernel. > > Contrary to my set, you wouldn't attempt retries in the kernel, but > leave this part to qemu instead, like in the current set. That would > avoid Christoph's main criticism that "Failing over SG_IO does not make > sense" [2]. Maybe I misunderstood, but my understanding from the feedback you got back then was that no SCSI-specific code was wanted in device mapper. This is why we kept interpreting the status codes in userspace. While discussing the approaches with Mikuláš and Ben, one option that was briefly discussed was a pair of ioctls: One that wraps ioctls and returns which path the ioctl took, and another one to fail that path if inspection of the result in userspace comes to the conclusion that it's a path error. I didn't think this could be implemented without also allowing an unprivileged process to DoS the device by just failing all paths even if they are still good, so we didn't continue there. Anyway, if it actually were acceptable for the kernel to check SG_IO results for path errors and fail the path while still returning the result unchanged, then that would work for us for the specific case, of course. But I don't expect this really addresses Christoph's concerns. If you think it does, is there another reason why you didn't try this before? (And it wouldn't be generic, so would we potentially have to repeat the exercise for other ioctls in the future?) > Doing the failover in qemu has the general disadvantage that qemu has > no notion about the number of available and/or healthy paths; it can > thus only guess the reasonable number of retries for any given I/O. But > that's unavoidable, given that the idea to do kernel-level failover on > SG_IO is rejected. Yes, it's a bit unfortunate, but we have to work with what we have. QEMU doesn't even necessarily know that it's dealing with a multipath device, so it just has to blindly try the ioctl and see if it works. Kevin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-12 15:18 ` Kevin Wolf @ 2025-05-13 5:55 ` Christoph Hellwig 2025-05-13 6:09 ` Hannes Reinecke 2025-05-13 9:29 ` Kevin Wolf 2025-05-13 6:30 ` Hannes Reinecke 2025-05-13 8:00 ` Martin Wilck 2 siblings, 2 replies; 52+ messages in thread From: Christoph Hellwig @ 2025-05-13 5:55 UTC (permalink / raw) To: Kevin Wolf Cc: Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > doesn't even necessarily know that it's dealing with a multipath device, > so it just has to blindly try the ioctl and see if it works. Why is qemu even using SG_IO to start with? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 5:55 ` Christoph Hellwig @ 2025-05-13 6:09 ` Hannes Reinecke 2025-05-13 6:14 ` Christoph Hellwig 2025-05-13 9:29 ` Kevin Wolf 1 sibling, 1 reply; 52+ messages in thread From: Hannes Reinecke @ 2025-05-13 6:09 UTC (permalink / raw) To: Christoph Hellwig, Kevin Wolf Cc: Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On 5/13/25 07:55, Christoph Hellwig wrote: > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: >> Yes, it's a bit unfortunate, but we have to work with what we have. QEMU >> doesn't even necessarily know that it's dealing with a multipath device, >> so it just has to blindly try the ioctl and see if it works. > > Why is qemu even using SG_IO to start with? > To be able to forward SCSI commands. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 6:09 ` Hannes Reinecke @ 2025-05-13 6:14 ` Christoph Hellwig 2025-05-13 6:32 ` Hannes Reinecke 0 siblings, 1 reply; 52+ messages in thread From: Christoph Hellwig @ 2025-05-13 6:14 UTC (permalink / raw) To: Hannes Reinecke Cc: Christoph Hellwig, Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On Tue, May 13, 2025 at 08:09:45AM +0200, Hannes Reinecke wrote: > On 5/13/25 07:55, Christoph Hellwig wrote: > > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > > > doesn't even necessarily know that it's dealing with a multipath device, > > > so it just has to blindly try the ioctl and see if it works. > > > > Why is qemu even using SG_IO to start with? > > > To be able to forward SCSI commands. Why? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 6:14 ` Christoph Hellwig @ 2025-05-13 6:32 ` Hannes Reinecke 2025-05-13 6:49 ` Christoph Hellwig 0 siblings, 1 reply; 52+ messages in thread From: Hannes Reinecke @ 2025-05-13 6:32 UTC (permalink / raw) To: Christoph Hellwig Cc: Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On 5/13/25 08:14, Christoph Hellwig wrote: > On Tue, May 13, 2025 at 08:09:45AM +0200, Hannes Reinecke wrote: >> On 5/13/25 07:55, Christoph Hellwig wrote: >>> On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: >>>> Yes, it's a bit unfortunate, but we have to work with what we have. QEMU >>>> doesn't even necessarily know that it's dealing with a multipath device, >>>> so it just has to blindly try the ioctl and see if it works. >>> >>> Why is qemu even using SG_IO to start with? >>> >> To be able to forward SCSI commands. > > Why? > Reservations and stuff. There are customer who use GPFS ... But yes, I know. We are working on it; it should be possible to hook in the generic block reservation code here. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 6:32 ` Hannes Reinecke @ 2025-05-13 6:49 ` Christoph Hellwig 2025-05-13 8:17 ` Martin Wilck 2025-05-13 16:29 ` Benjamin Marzinski 0 siblings, 2 replies; 52+ messages in thread From: Christoph Hellwig @ 2025-05-13 6:49 UTC (permalink / raw) To: Hannes Reinecke Cc: Christoph Hellwig, Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > Reservations and stuff. They should use the kernel persistent reservation API. > There are customer who use GPFS ... Supporting illegal binary only modules that is already enough of a reason to NAK this. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 6:49 ` Christoph Hellwig @ 2025-05-13 8:17 ` Martin Wilck 2025-05-14 4:53 ` Christoph Hellwig 2025-05-13 16:29 ` Benjamin Marzinski 1 sibling, 1 reply; 52+ messages in thread From: Martin Wilck @ 2025-05-13 8:17 UTC (permalink / raw) To: Christoph Hellwig, Hannes Reinecke Cc: Kevin Wolf, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On Mon, 2025-05-12 at 23:49 -0700, Christoph Hellwig wrote: > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > > Reservations and stuff. > > They should use the kernel persistent reservation API. > > > There are customer who use GPFS ... > > Supporting illegal binary only modules that is already enough of a > reason to NAK this. GPFS is not the only use case. The reason why we have "scsi-block" in qemu is simply SCSI emulation. If we emulate a SCSI device to a VM, the device should support commands like persistent reservations properly. "scsi-block" supports this, while still offering decent IO performance to VMs. In the multipath case, the idea is to be able to hide from the VM the fact that the device in question is actually a multipath device, and still allow this sort of SCSI commands to be passed through to the storage. And no, passing the SCSI devices to the VM and doing multipath in the the guest doesn't work. The transport layer isn't properly emulated (bluntly speaking, we have no FC emulation). Regards Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 8:17 ` Martin Wilck @ 2025-05-14 4:53 ` Christoph Hellwig 2025-05-15 11:14 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Christoph Hellwig @ 2025-05-14 4:53 UTC (permalink / raw) To: Martin Wilck Cc: Christoph Hellwig, Hannes Reinecke, Kevin Wolf, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On Tue, May 13, 2025 at 10:17:58AM +0200, Martin Wilck wrote: > If we emulate a SCSI device to a VM, the device should support commands > like persistent reservations properly. Then point it to an actual SCSI device, and not a multipath-device. Trying to split responsibility for handling the initiator side work is not in any way support in the SCSI spec, and by trying it anyway you are just creating tons of of problems. > And no, passing the SCSI devices to the VM and doing multipath in the > the guest doesn't work. The transport layer isn't properly emulated > (bluntly speaking, we have no FC emulation). Then fix that. Because without it you will be in never ending pain due to impedance mismatch. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-14 4:53 ` Christoph Hellwig @ 2025-05-15 11:14 ` Paolo Bonzini 0 siblings, 0 replies; 52+ messages in thread From: Paolo Bonzini @ 2025-05-15 11:14 UTC (permalink / raw) To: Christoph Hellwig, Martin Wilck Cc: Hannes Reinecke, Kevin Wolf, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On 5/14/25 06:53, Christoph Hellwig wrote: >> And no, passing the SCSI devices to the VM and doing multipath in the >> the guest doesn't work. The transport layer isn't properly emulated >> (bluntly speaking, we have no FC emulation). > > Then fix that. Because without it you will be in never ending pain > due to impedance mismatch. FC HBAs don't even expose enough of the protocol to actually do it sensibly, so that can't really be done. Hannes tried. Paolo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 6:49 ` Christoph Hellwig 2025-05-13 8:17 ` Martin Wilck @ 2025-05-13 16:29 ` Benjamin Marzinski 2025-05-14 4:56 ` Christoph Hellwig ` (2 more replies) 1 sibling, 3 replies; 52+ messages in thread From: Benjamin Marzinski @ 2025-05-13 16:29 UTC (permalink / raw) To: Christoph Hellwig Cc: Hannes Reinecke, Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote: > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > > Reservations and stuff. > > They should use the kernel persistent reservation API. Currently QEMU isn't sending Persistent Reservation requests to multipath devices at all. It's sending those directly to the underlying scsi devices. The issue here is with all the other SCSI commands that users want to send to their SCSI passthrough device that is actually a multipath device on top of a number of SCSI paths. They expect to get back the actual sense and status information, so QEMU needs to send them via SG_IOs. Without reading that sense and status information in kernel, the multipath target can't know if it needs to fail a path and retry the ioctl down a different path. QEMU can read this information, but it doesn't know what path the multipath device send the ioctl down. This patch just gives users a way to check the paths in the active pathgroup (which all should be able to handle IO) and fail those that can't. While QEMU is the driver of this, it's completely general functionality. -Ben > > > There are customer who use GPFS ... > > Supporting illegal binary only modules that is already enough of a > reason to NAK this. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 16:29 ` Benjamin Marzinski @ 2025-05-14 4:56 ` Christoph Hellwig 2025-05-14 6:39 ` Hannes Reinecke 2025-05-16 5:52 ` Christoph Hellwig 2 siblings, 0 replies; 52+ messages in thread From: Christoph Hellwig @ 2025-05-14 4:56 UTC (permalink / raw) To: Benjamin Marzinski Cc: Christoph Hellwig, Hannes Reinecke, Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Tue, May 13, 2025 at 12:29:51PM -0400, Benjamin Marzinski wrote: > Without reading that sense and status information in kernel, the > multipath target can't know if it needs to fail a path and retry the > ioctl down a different path. QEMU can read this information, but it > doesn't know what path the multipath device send the ioctl down. This > patch just gives users a way to check the paths in the active pathgroup > (which all should be able to handle IO) and fail those that can't. > While QEMU is the driver of this, it's completely general functionality. As just replied to Martin the problem is that this setup fundamentally can't work. Either you pass a SCSI devices through, which should (mostly, there are a few warts) work. Or you want host side multipathing, in which case you must pass through the block device abstraction and not a SCSI one, or at least do a full emulation of the SCSI interfaces instead of pretending to pass it through. The root of all evil here is that dm-multipath tries to pass through SG_IO, which is dangerous for all but the most trivial commands. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 16:29 ` Benjamin Marzinski 2025-05-14 4:56 ` Christoph Hellwig @ 2025-05-14 6:39 ` Hannes Reinecke 2025-05-14 16:01 ` Benjamin Marzinski 2025-05-16 5:52 ` Christoph Hellwig 2 siblings, 1 reply; 52+ messages in thread From: Hannes Reinecke @ 2025-05-14 6:39 UTC (permalink / raw) To: Benjamin Marzinski, Christoph Hellwig Cc: Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On 5/13/25 18:29, Benjamin Marzinski wrote: > On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote: >> On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: >>> Reservations and stuff. >> >> They should use the kernel persistent reservation API. > > Currently QEMU isn't sending Persistent Reservation requests to > multipath devices at all. It's sending those directly to the underlying > scsi devices. The issue here is with all the other SCSI commands that > users want to send to their SCSI passthrough device that is actually a > multipath device on top of a number of SCSI paths. They expect to > get back the actual sense and status information, so QEMU needs to > send them via SG_IOs. > > Without reading that sense and status information in kernel, the > multipath target can't know if it needs to fail a path and retry the > ioctl down a different path. QEMU can read this information, but it > doesn't know what path the multipath device send the ioctl down. This > patch just gives users a way to check the paths in the active pathgroup > (which all should be able to handle IO) and fail those that can't. > While QEMU is the driver of this, it's completely general functionality. > Ouch. Different reservations on different paths? Really? How do you manage to keep the reservations up-to-date? My recommendation is to use the _same_ reservation on all paths, precisely to avoid having to re-do reservations on path failure. And for that scenario the kernel persistent reservation API would work fine. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-14 6:39 ` Hannes Reinecke @ 2025-05-14 16:01 ` Benjamin Marzinski 0 siblings, 0 replies; 52+ messages in thread From: Benjamin Marzinski @ 2025-05-14 16:01 UTC (permalink / raw) To: Hannes Reinecke Cc: Christoph Hellwig, Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Wed, May 14, 2025 at 08:39:25AM +0200, Hannes Reinecke wrote: > On 5/13/25 18:29, Benjamin Marzinski wrote: > > On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote: > > > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > > > > Reservations and stuff. > > > > > > They should use the kernel persistent reservation API. > > > > Currently QEMU isn't sending Persistent Reservation requests to > > multipath devices at all. It's sending those directly to the underlying > > scsi devices. The issue here is with all the other SCSI commands that > > users want to send to their SCSI passthrough device that is actually a > > multipath device on top of a number of SCSI paths. They expect to > > get back the actual sense and status information, so QEMU needs to > > send them via SG_IOs. > > > > Without reading that sense and status information in kernel, the > > multipath target can't know if it needs to fail a path and retry the > > ioctl down a different path. QEMU can read this information, but it > > doesn't know what path the multipath device send the ioctl down. This > > patch just gives users a way to check the paths in the active pathgroup > > (which all should be able to handle IO) and fail those that can't. > > While QEMU is the driver of this, it's completely general functionality. > > > Ouch. > Different reservations on different paths? Really? > How do you manage to keep the reservations up-to-date? multipath's libmpathpersist library. You contributed some code to it IIRC. It predates the kernel interface, and even if we switched to using the kernel interface to send persistent reservation commands to the paths instead of doing that in userspace (the dm_pr_ops functions would needs some reworking to handle multipath, where the rules are different than for other dm-devices. For instance, you only want to reserve on one path, and it's not necessarily a failure if you can't register a key down all paths) we would still probably want to use the library, since it also coordinates with multipathd so that it can register keys on paths that are later added to the multipath device or were down when the registration initially happened. > My recommendation is to use the _same_ reservation on all paths, > precisely to avoid having to re-do reservations on path failure. > And for that scenario the kernel persistent reservation API > would work fine. Libmpathpersist does register the same key for all paths of a multipath device. Doing anything else would be crazy. Sorry if my last email was misleading. But the kernel API cannot help in the case where a path is later added to the multipath device, or was down when the initial registration happened. But again, all this is orthogonal to the patches in question, which have nothing to do with Persistent Reservations. -Ben > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg > HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 16:29 ` Benjamin Marzinski 2025-05-14 4:56 ` Christoph Hellwig 2025-05-14 6:39 ` Hannes Reinecke @ 2025-05-16 5:52 ` Christoph Hellwig 2 siblings, 0 replies; 52+ messages in thread From: Christoph Hellwig @ 2025-05-16 5:52 UTC (permalink / raw) To: Benjamin Marzinski Cc: Christoph Hellwig, Hannes Reinecke, Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Tue, May 13, 2025 at 12:29:51PM -0400, Benjamin Marzinski wrote: > On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote: > > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > > > Reservations and stuff. > > > > They should use the kernel persistent reservation API. > > Currently QEMU isn't sending Persistent Reservation requests to > multipath devices at all. It's sending those directly to the underlying > scsi devices. The issue here is with all the other SCSI commands that > users want to send to their SCSI passthrough device that is actually a > multipath device on top of a number of SCSI paths. They expect to > get back the actual sense and status information, so QEMU needs to > send them via SG_IOs. And that's the problem. There is plenty of path (I_T_L) nexus specific information in SCSI, and just trying to glob it back together above means you'll get path specific information in something that doesn't claim to be multi-pathing and will get random and changing results depending on the underlying path selection. > Without reading that sense and status information in kernel, the > multipath target can't know if it needs to fail a path and retry the > ioctl down a different path. The blk_status_t has the information to fail over. But the whole idea of implementing SG_IO in dm-mpath or any other stacking layer is just really, really dangerous. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 5:55 ` Christoph Hellwig 2025-05-13 6:09 ` Hannes Reinecke @ 2025-05-13 9:29 ` Kevin Wolf 2025-05-13 15:43 ` Paolo Bonzini 2025-05-14 4:57 ` Christoph Hellwig 1 sibling, 2 replies; 52+ messages in thread From: Kevin Wolf @ 2025-05-13 9:29 UTC (permalink / raw) To: Christoph Hellwig Cc: Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel, pbonzini Am 13.05.2025 um 07:55 hat Christoph Hellwig geschrieben: > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > > doesn't even necessarily know that it's dealing with a multipath device, > > so it just has to blindly try the ioctl and see if it works. > > Why is qemu even using SG_IO to start with? How else would you do SCSI passthrough? Ok, from your replies to Hannes I understand an implicit message, you wouldn't. But I don't think that's really an answer, at least not for all users. Yes, I'll give you that in the long term, we might be able to move some passthrough users away from it by using more specific interfaces like for persistent reservations (though see below). But for example, it's also used for vendor-specific commands and I don't see how Linux could ever provide more specific interfaces than SG_IO for those. But it's even about more than just accessing commands that aren't otherwise accessible. Mapping a SCSI response from the device to a single errno and back into SCSI status for the guest isn't lossless. QEMU's scsi-block device actually started off using normal I/O for reads and writes and using SG_IO only for things that aren't covered by normal I/O. But then those had to be changed to SG_IO, too, so that the guest would actually see the full SCSI status. Things the commit message mentions are unit attention codes and handling RESERVATION CONFLICT correctly (which made me unsure above if the more specific interface for reservations could actually be used to fully get rid of SG_IO). For more context, I'm adding Paolo who actually made that change back then. He may remember the concrete bug(s) this fixed. So if you want the guest device to behave mostly the same as your host device, I don't really see any way around SCSI passthrough and therefore SG_IO. Kevin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 9:29 ` Kevin Wolf @ 2025-05-13 15:43 ` Paolo Bonzini 2025-05-14 4:57 ` Christoph Hellwig 1 sibling, 0 replies; 52+ messages in thread From: Paolo Bonzini @ 2025-05-13 15:43 UTC (permalink / raw) To: Kevin Wolf Cc: Christoph Hellwig, Martin Wilck, dm-devel, Hanna Czenczek, Mikulas Patocka, snitzer, Ben Marzinski, Kernel Mailing List, Linux Il mar 13 mag 2025, 11:29 Kevin Wolf <kwolf@redhat.com> ha scritto: > QEMU's scsi-block device actually started off using normal I/O for reads > and writes and using SG_IO only for things that aren't covered by normal > I/O. But then those had to be changed to SG_IO, too, so that the guest > would actually see the full SCSI status. Things the commit message > mentions are unit attention codes and handling RESERVATION CONFLICT > correctly (which made me unsure above if the more specific interface for > reservations could actually be used to fully get rid of SG_IO). For more > context, I'm adding Paolo who actually made that change back then. He > may remember the concrete bug(s) this fixed. The original reason to avoid SG_IO for reads and writes was purely performance (using the host scheduler), but it turned out to be a bad idea. RESERVATION CONFLICT indeed was one of the reasons why I moved QEMU away from SG_IO. It has since been fixed, because these days Linux uses EBADE for it and likewise there are errno values for some other statuses or sense codes, but it helps more in general to have the precise SCSI status and sense data. Having the real status and sense for example lets QEMU decide which errors to pass to the guest and which should be handled in the host (for example by pausing the VM). Some HBAs also have equivalents of Linux's host_status, and passing that down also makes for more accurate pass through. Also, specifically for reservations, I also didn't like the idea that a guest command could be split in multiple host commands and the reservation conflict would appear in the middle due to a concurrent PR OUT command. To be honest I don't know how physical disks and controllers handle that, but I didn't want to make it worse. Thanks, Paolo > So if you want the guest device to behave mostly the same as your host > device, I don't really see any way around SCSI passthrough and therefore > SG_IO. > > Kevin > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 9:29 ` Kevin Wolf 2025-05-13 15:43 ` Paolo Bonzini @ 2025-05-14 4:57 ` Christoph Hellwig 2025-05-14 16:23 ` Benjamin Marzinski 1 sibling, 1 reply; 52+ messages in thread From: Christoph Hellwig @ 2025-05-14 4:57 UTC (permalink / raw) To: Kevin Wolf Cc: Christoph Hellwig, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel, pbonzini On Tue, May 13, 2025 at 11:29:09AM +0200, Kevin Wolf wrote: > Am 13.05.2025 um 07:55 hat Christoph Hellwig geschrieben: > > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > > > doesn't even necessarily know that it's dealing with a multipath device, > > > so it just has to blindly try the ioctl and see if it works. > > > > Why is qemu even using SG_IO to start with? > > How else would you do SCSI passthrough? > > Ok, from your replies to Hannes I understand an implicit message, you > wouldn't. But I don't think that's really an answer, at least not for > all users. SG_IO is fine and the only way for SCSI passthrough. But doing SCSI passthrough through md-multipath just doesn't work. SCSI isn't built for layering, and ALUA and it's vendor-specific variants and alternatives certainly isn't. If you try that you're playing with fire and is not chance of ever moving properly. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-14 4:57 ` Christoph Hellwig @ 2025-05-14 16:23 ` Benjamin Marzinski 2025-05-14 17:37 ` Martin Wilck 0 siblings, 1 reply; 52+ messages in thread From: Benjamin Marzinski @ 2025-05-14 16:23 UTC (permalink / raw) To: Christoph Hellwig Cc: Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel, pbonzini On Tue, May 13, 2025 at 09:57:51PM -0700, Christoph Hellwig wrote: > On Tue, May 13, 2025 at 11:29:09AM +0200, Kevin Wolf wrote: > > Am 13.05.2025 um 07:55 hat Christoph Hellwig geschrieben: > > > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > > > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > > > > doesn't even necessarily know that it's dealing with a multipath device, > > > > so it just has to blindly try the ioctl and see if it works. > > > > > > Why is qemu even using SG_IO to start with? > > > > How else would you do SCSI passthrough? > > > > Ok, from your replies to Hannes I understand an implicit message, you > > wouldn't. But I don't think that's really an answer, at least not for > > all users. > > SG_IO is fine and the only way for SCSI passthrough. But doing > SCSI passthrough through md-multipath just doesn't work. SCSI isn't > built for layering, and ALUA and it's vendor-specific variants and > alternatives certainly isn't. If you try that you're playing with > fire and is not chance of ever moving properly. Could you be a bit more specific. All multipath is doing here is forwarding the ioctls to an underlying scsi device, and passing back up the result. Admittedly, it doesn't always make sense to pass the ioctl on from the multipath device to just one scsi device. Persistent Reservations are perfect example of this, and that's why QEMU doesn't use DMs ioctl passthrough code to handle them. Also, when you have ALUA setups, not all the scsi devices are equal. But multipath isn't naievely assuming that they are. It's only passing ioctls to the highest priority activated paths, just like it does for IO, and multipath is in charge of handling explicit alua devices. This hasn't proved to be problematic in practice. The reality of the situation is that customers have been using this for a while, and the only issue that they run into is that multipath can't tell when a SG_IO has failed due to a retryable error. Currently, they're left with waiting for multipathd's preemptive path checking to fail the path so they can retry down a new one. The purpose of this patchset and Martin's previous one is to handle this problem. If there are unavoidable critical problems that you see with this setup, it would be really helpful to know what they are. -Ben ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-14 16:23 ` Benjamin Marzinski @ 2025-05-14 17:37 ` Martin Wilck 2025-05-15 2:53 ` Paolo Bonzini 0 siblings, 1 reply; 52+ messages in thread From: Martin Wilck @ 2025-05-14 17:37 UTC (permalink / raw) To: Benjamin Marzinski, Christoph Hellwig Cc: Kevin Wolf, dm-devel, hreitz, mpatocka, snitzer, linux-kernel, pbonzini, Hannes Reinecke Hello Ben, hello Christoph, On Wed, 2025-05-14 at 12:23 -0400, Benjamin Marzinski wrote: > On Tue, May 13, 2025 at 09:57:51PM -0700, Christoph Hellwig wrote: > > > > SG_IO is fine and the only way for SCSI passthrough. But doing > > SCSI passthrough through md-multipath just doesn't work. SCSI > > isn't > > built for layering, and ALUA and it's vendor-specific variants and > > alternatives certainly isn't. If you try that you're playing with > > fire and is not chance of ever moving properly. > > Could you be a bit more specific. All multipath is doing here is > forwarding the ioctls to an underlying scsi device, and passing back > up > the result. Admittedly, it doesn't always make sense to pass the > ioctl > on from the multipath device to just one scsi device. Persistent > Reservations are perfect example of this, and that's why QEMU doesn't > use DMs ioctl passthrough code to handle them. I'd go one step further. Christoph is right to say that what we're currently doing in qemu – passing through every command except the PRIN/PROUT to a multipath device – is a dangerous thing to do. Passthrough from a dm-multipath device to a SCSI device makes sense only for a small subset of the SCSI command set. Basically just for the regular IO commands like the various READ and WRITE variants and the occasional UNMAP. However, in practice these commands account for 99.y% percent of the actual commands sent to devices. The fact that customers have been running these setups in large deployments over many years suggests that, if other commands ever get passed through to member devices, it has rarely had fatal consequences. Nobody would seriously consider sending ALUA commands to the multipath devices. TUR and REQUEST SENSE are other examples for commands that can't be reasonably passed through to random member devices of a multipath map. There are certainly many more examples. I guess it would make sense to review the command set and add some filtering in the qemu passthrough code. AFAIK the only commands that we really need to pass through (except the standard ones) are the reservation commands, which get special handling by qemu anyway. @Ben, @Kevin, are you aware of anything else? So: admittedly we're using a framework for passing through any command, where we actually need to pass through only a tiny subset of commands. Thinking about it this way, it really doesn't look like the perfect tool for the job, and we may want to look into a different approach for the future. > Also, when you have ALUA > setups, not all the scsi devices are equal. But multipath isn't > naievely > assuming that they are. It's only passing ioctls to the highest > priority > activated paths, just like it does for IO, and multipath is in charge > of > handling explicit alua devices. This hasn't proved to be problematic > in > practice. > > The reality of the situation is that customers have been using this > for > a while, and the only issue that they run into is that multipath > can't > tell when a SG_IO has failed due to a retryable error. Currently, > they're left with waiting for multipathd's preemptive path checking > to > fail the path so they can retry down a new one. The purpose of this > patchset and Martin's previous one is to handle this problem. If > there > are unavoidable critical problems that you see with this setup, it > would > be really helpful to know what they are. I'd also be interested in understanding this better. As noted above, I'm aware that passing through everything is dangerous and wrong in principle. But in practice, we haven't observed anything serious except (as Ben already said) the failure to do path failover in the SG_IO code path, which both this patch set and my set from the past are intended to fix. While I am open for looking for better alternatives, I still hope that we can find an agreement for a short/mid-term solution that would allow us to serve our customers who currently use SCSI passthrough setups. That would not just benefit us (the enterprise distros), because it would also help us fund upstream contributions. Regards Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-14 17:37 ` Martin Wilck @ 2025-05-15 2:53 ` Paolo Bonzini 2025-05-15 10:34 ` Martin Wilck 0 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2025-05-15 2:53 UTC (permalink / raw) To: Martin Wilck Cc: Benjamin Marzinski, Christoph Hellwig, Kevin Wolf, dm-devel, Hanna Czenczek, Mikulas Patocka, snitzer, Kernel Mailing List, Linux, Hannes Reinecke Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto: > > I'd go one step further. Christoph is right to say that what we're > currently doing in qemu – passing through every command except the > PRIN/PROUT to a multipath device – is a dangerous thing to do. > > Passthrough from a dm-multipath device to a SCSI device makes sense > only for a small subset of the SCSI command set. Basically just for the > regular IO commands like the various READ and WRITE variants and the > occasional UNMAP. The fact that customers > have been running these setups in large deployments over many years > suggests that, if other commands ever get passed through to member > devices, it has rarely had fatal consequences. > > Nobody would seriously consider sending ALUA commands to the multipath > devices. TUR and REQUEST SENSE are other examples for commands that > can't be reasonably passed through to random member devices of a > multipath map. Yes, as usual things are a bit more complicated. First, a handful of commands are special (REQUEST SENSE would be for HBAs that don't use auto sense, but that is fortunately not something you encounter). Second, there's already a filter in the kernel in drivers/scsi/scsi_ioctl.c for commands that are allowed without CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it. Any command that the kernel doesn't filter would be rejected, or handled specially in the case of PR commands (PR commands use a separate privileged helper to send them down to the device; the helper also knows about multipath and uses the userspace libmpathpersist if it receives a dm-mpath file descriptor via SCM_RIGHTS). > AFAIK the only commands that we really need to pass through (except the standard ones) are the reservation commands, which get special handling > by qemu anyway. @Ben, @Kevin, are you aware of anything else? .Of the ones that aren't simple I/O, mode parameters and TUR are the important cases. A TUR failure would be handled by the ioctl that Kevin proposed here by forcing a path switch. Mode parameters might not be shared(*) and would need to be sent down all the paths in that case; that can be fixed in userspace if necessary. > I'd also be interested in understanding this better. As noted above, > I'm aware that passing through everything is dangerous and wrong in > principle. But in practice, we haven't observed anything serious except > (as Ben already said) the failure to do path failover in the SG_IO code > path, which both this patch set and my set from the past are intended > to fix. Yes, the kernel filter is a PITA in the normal single path case but here it helps not doing something overly wrong. Paolo (*) in theory there's a Mode Page Policy VPD page to tell the initiator whether they are. I'm not sure if anyone supports it in the real world... ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 2:53 ` Paolo Bonzini @ 2025-05-15 10:34 ` Martin Wilck 2025-05-15 10:51 ` Paolo Bonzini 2025-05-15 14:29 ` Benjamin Marzinski 0 siblings, 2 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-15 10:34 UTC (permalink / raw) To: Paolo Bonzini Cc: Benjamin Marzinski, Christoph Hellwig, Kevin Wolf, dm-devel, Hanna Czenczek, Mikulas Patocka, snitzer, Kernel Mailing List, Linux, Hannes Reinecke On Thu, 2025-05-15 at 04:53 +0200, Paolo Bonzini wrote: > Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto: > > > > I'd go one step further. Christoph is right to say that what we're > > currently doing in qemu – passing through every command except the > > PRIN/PROUT to a multipath device – is a dangerous thing to do. > > > > Passthrough from a dm-multipath device to a SCSI device makes sense > > only for a small subset of the SCSI command set. Basically just for > > the > > regular IO commands like the various READ and WRITE variants and > > the > > occasional UNMAP. The fact that customers > > have been running these setups in large deployments over many years > > suggests that, if other commands ever get passed through to member > > devices, it has rarely had fatal consequences. > > > > Nobody would seriously consider sending ALUA commands to the > > multipath > > devices. TUR and REQUEST SENSE are other examples for commands that > > can't be reasonably passed through to random member devices of a > > multipath map. > > Yes, as usual things are a bit more complicated. First, a handful of > commands are special (REQUEST SENSE would be for HBAs that don't use > auto sense, but that is fortunately not something you encounter). > Second, there's already a filter in the kernel in > drivers/scsi/scsi_ioctl.c for commands that are allowed without > CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are > mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it. Thanks for mentioning this. However, I suppose that depends on the permissions with which the qemu process is started, no? Wouldn't qemu need CAP_SYS_RAWIO for PCI passthrough as well? I admit that I'm confused by the many indirections in qemu's scsi-block code flow. AFAICS qemu forwards everything except PRIN/PROUT to the kernel block device in "scsi-block" mode. Correct me if I'm wrong. Anwyway, let's not forget that we're talking about the kernel here. While qemu is the main target application for this feature is created, any application can use it, and it may or may not run with CAP_SYS_RAWIO. > Any command that the kernel doesn't filter would be rejected, or > handled specially in the case of PR commands (PR commands use a > separate privileged helper to send them down to the device; the > helper > also knows about multipath and uses the userspace libmpathpersist if > it receives a dm-mpath file descriptor via SCM_RIGHTS). > > > AFAIK the only commands that we really need to pass through (except > > the standard ones) are the reservation commands, which get special > > handling > > by qemu anyway. @Ben, @Kevin, are you aware of anything else? > > .Of the ones that aren't simple I/O, mode parameters and TUR are the > important cases. A TUR failure would be handled by the ioctl that > Kevin proposed here by forcing a path switch. Mode parameters might > not be shared(*) and would need to be sent down all the paths in that > case; that can be fixed in userspace if necessary. Passing TUR from a multipath device to a random member doesn't make much sense to me. qemu would need to implement some logic to determine whether the map has any usable paths. > > I'd also be interested in understanding this better. As noted > > above, > > I'm aware that passing through everything is dangerous and wrong in > > principle. But in practice, we haven't observed anything serious > > except > > (as Ben already said) the failure to do path failover in the SG_IO > > code > > path, which both this patch set and my set from the past are > > intended > > to fix. > > Yes, the kernel filter is a PITA in the normal single path case but > here it helps not doing something overly wrong. This seems coincidental to me. Filtering by permissions and filtering for commands that make sense on multipath devices are orthogonal problems. Regards, Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 10:34 ` Martin Wilck @ 2025-05-15 10:51 ` Paolo Bonzini 2025-05-15 14:50 ` Martin Wilck 2025-05-15 14:29 ` Benjamin Marzinski 1 sibling, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2025-05-15 10:51 UTC (permalink / raw) To: Martin Wilck Cc: Benjamin Marzinski, Christoph Hellwig, Kevin Wolf, dm-devel, Hanna Czenczek, Mikulas Patocka, snitzer, Kernel Mailing List, Linux, Hannes Reinecke On Thu, May 15, 2025 at 12:34 PM Martin Wilck <mwilck@suse.com> wrote: > On Thu, 2025-05-15 at 04:53 +0200, Paolo Bonzini wrote: > > Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto: > > Yes, as usual things are a bit more complicated. First, a handful of > > commands are special (REQUEST SENSE would be for HBAs that don't use > > auto sense, but that is fortunately not something you encounter). > > Second, there's already a filter in the kernel in > > drivers/scsi/scsi_ioctl.c for commands that are allowed without > > CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are > > mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it. > > Thanks for mentioning this. However, I suppose that depends on the > permissions with which the qemu process is started, no? Wouldn't > qemu need CAP_SYS_RAWIO for PCI passthrough as well? Generally you want to assume that the VM is hostile and run QEMU with as few privileges as possible (not just capabilities, but also in separate namespaces and with restrictions from device cgroups, SELinux, etc.). PCI passthrough is not an issue, it only needs access to the VFIO inodes and you can do it by setting the appropriate file permissions without extra capabilities. The actual privileged part is binding the device to VFIO, which is done outside QEMU anyway. > I admit that I'm confused by the many indirections in qemu's scsi-block > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the > kernel block device in "scsi-block" mode. Correct me if I'm wrong. Yes, that's correct. The code for PRIN/PROUT calls out to a separate privileged process (in scsi/qemu-pr-helper.c if you're curious) which is aware of multipath and can be extended if needed. > Anwyway, let's not forget that we're talking about the kernel here. > While qemu is the main target application for this feature is created, > any application can use it, and it may or may not run with > CAP_SYS_RAWIO. Yes, but once you have CAP_SYS_RAWIO all bets for sanity are off... it even lets you do SG_IO on partitions. > > .Of the ones that aren't simple I/O, mode parameters and TUR are the > > important cases. A TUR failure would be handled by the ioctl that > > Kevin proposed here by forcing a path switch. Mode parameters might > > not be shared(*) and would need to be sent down all the paths in that > > case; that can be fixed in userspace if necessary. > > Passing TUR from a multipath device to a random member doesn't make > much sense to me. qemu would need to implement some logic to determine > whether the map has any usable paths. As long as one path replies to a TUR and the host is able to (eventually, somehow) steer I/O transparently to that path, that should be good enough. If the one path that the kernel tries is down, QEMU can probe which paths are up and retry. That seems consistent with what you want from TUR but maybe I'm missing something. > > Yes, the kernel filter is a PITA in the normal single path case but > > here it helps not doing something overly wrong. > > This seems coincidental to me. Filtering by permissions and filtering > for commands that make sense on multipath devices are orthogonal > problems. I agree (it helps, it doesn't solve the problem), however there is a large overlap between the two. Paolo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 10:51 ` Paolo Bonzini @ 2025-05-15 14:50 ` Martin Wilck 0 siblings, 0 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-15 14:50 UTC (permalink / raw) To: Paolo Bonzini Cc: Benjamin Marzinski, Christoph Hellwig, Kevin Wolf, dm-devel, Hanna Czenczek, Mikulas Patocka, snitzer, Kernel Mailing List, Linux, Hannes Reinecke On Thu, 2025-05-15 at 12:51 +0200, Paolo Bonzini wrote: > On Thu, May 15, 2025 at 12:34 PM Martin Wilck <mwilck@suse.com> > > > > Thanks for mentioning this. However, I suppose that depends on the > > permissions with which the qemu process is started, no? Wouldn't > > qemu need CAP_SYS_RAWIO for PCI passthrough as well? > > Generally you want to assume that the VM is hostile and run QEMU with > as few privileges as possible (not just capabilities, but also in > separate namespaces and with restrictions from device cgroups, > SELinux, etc.). PCI passthrough is not an issue, it only needs access > to the VFIO inodes and you can do it by setting the appropriate file > permissions without extra capabilities. The actual privileged part is > binding the device to VFIO, which is done outside QEMU anyway. Thanks for the clarification. > > I admit that I'm confused by the many indirections in qemu's scsi- > > block > > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the > > kernel block device in "scsi-block" mode. Correct me if I'm wrong. > > Yes, that's correct. The code for PRIN/PROUT calls out to a separate > privileged process (in scsi/qemu-pr-helper.c if you're curious) which > is aware of multipath and can be extended if needed. Sure, I was aware of the helper. I just wasn't 100% clear about how it gets called. Found the code in the meantime [1]. [1] https://github.com/qemu/qemu/blob/864813878951b44e964eb4c012d832fd21f8cc0c/block/file-posix.c#L4286 > > > .Of the ones that aren't simple I/O, mode parameters and TUR are > > > the > > > important cases. A TUR failure would be handled by the ioctl that > > > Kevin proposed here by forcing a path switch. Mode parameters > > > might > > > not be shared(*) and would need to be sent down all the paths in > > > that > > > case; that can be fixed in userspace if necessary. > > > > Passing TUR from a multipath device to a random member doesn't make > > much sense to me. qemu would need to implement some logic to > > determine > > whether the map has any usable paths. > > As long as one path replies to a TUR and the host is able to > (eventually, somehow) steer I/O transparently to that path, that > should be good enough. If the one path that the kernel tries is down, > QEMU can probe which paths are up and retry. That seems consistent > with what you want from TUR but maybe I'm missing something. It's ok-ish, in particular in combination with Kevin't patch. But using an equivalent of "multipath -C" would be closer to the real thing for TUR. Regards Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 10:34 ` Martin Wilck 2025-05-15 10:51 ` Paolo Bonzini @ 2025-05-15 14:29 ` Benjamin Marzinski 2025-05-15 15:00 ` Martin Wilck 1 sibling, 1 reply; 52+ messages in thread From: Benjamin Marzinski @ 2025-05-15 14:29 UTC (permalink / raw) To: Martin Wilck Cc: Paolo Bonzini, Christoph Hellwig, Kevin Wolf, dm-devel, Hanna Czenczek, Mikulas Patocka, snitzer, Kernel Mailing List, Linux, Hannes Reinecke On Thu, May 15, 2025 at 12:34:13PM +0200, Martin Wilck wrote: > On Thu, 2025-05-15 at 04:53 +0200, Paolo Bonzini wrote: > > Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto: > > > > > > I'd go one step further. Christoph is right to say that what we're > > > currently doing in qemu – passing through every command except the > > > PRIN/PROUT to a multipath device – is a dangerous thing to do. > > > > > > Passthrough from a dm-multipath device to a SCSI device makes sense > > > only for a small subset of the SCSI command set. Basically just for > > > the > > > regular IO commands like the various READ and WRITE variants and > > > the > > > occasional UNMAP. The fact that customers > > > have been running these setups in large deployments over many years > > > suggests that, if other commands ever get passed through to member > > > devices, it has rarely had fatal consequences. > > > > > > Nobody would seriously consider sending ALUA commands to the > > > multipath > > > devices. TUR and REQUEST SENSE are other examples for commands that > > > can't be reasonably passed through to random member devices of a > > > multipath map. > > > > Yes, as usual things are a bit more complicated. First, a handful of > > commands are special (REQUEST SENSE would be for HBAs that don't use > > auto sense, but that is fortunately not something you encounter). > > Second, there's already a filter in the kernel in > > drivers/scsi/scsi_ioctl.c for commands that are allowed without > > CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are > > mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it. > > Thanks for mentioning this. > > However, I suppose that depends on the permissions with which the qemu > process is started, no? Wouldn't qemu need CAP_SYS_RAWIO for PCI > passthrough as well? > > I admit that I'm confused by the many indirections in qemu's scsi-block > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the > kernel block device in "scsi-block" mode. Correct me if I'm wrong. > > Anwyway, let's not forget that we're talking about the kernel here. > While qemu is the main target application for this feature is created, > any application can use it, and it may or may not run with > CAP_SYS_RAWIO. Maybe I'm missing your issue, but QEMU is just using DM's existing ioctl forwarding code, which was already designed for general use, and I don't see any capability issues with this patchset itself. If the caller has the capability to issue this ioctl, they already have the capability to do reads to the device directly, which will cause any unusable paths to be failed, exactly like the ioctl does. The only difference it that the user has no way to know how many reads they would need to issue to the multipath device directly to force it to try all the paths. This gives any application the ability to verify the paths just like doing enough IO to the multipath device would, but it guarantees that it will use the minimum IO. -Ben > > Any command that the kernel doesn't filter would be rejected, or > > handled specially in the case of PR commands (PR commands use a > > separate privileged helper to send them down to the device; the > > helper > > also knows about multipath and uses the userspace libmpathpersist if > > it receives a dm-mpath file descriptor via SCM_RIGHTS). > > > > > AFAIK the only commands that we really need to pass through (except > > > the standard ones) are the reservation commands, which get special > > > handling > > > by qemu anyway. @Ben, @Kevin, are you aware of anything else? > > > > .Of the ones that aren't simple I/O, mode parameters and TUR are the > > important cases. A TUR failure would be handled by the ioctl that > > Kevin proposed here by forcing a path switch. Mode parameters might > > not be shared(*) and would need to be sent down all the paths in that > > case; that can be fixed in userspace if necessary. > > Passing TUR from a multipath device to a random member doesn't make > much sense to me. qemu would need to implement some logic to determine > whether the map has any usable paths. > > > > I'd also be interested in understanding this better. As noted > > > above, > > > I'm aware that passing through everything is dangerous and wrong in > > > principle. But in practice, we haven't observed anything serious > > > except > > > (as Ben already said) the failure to do path failover in the SG_IO > > > code > > > path, which both this patch set and my set from the past are > > > intended > > > to fix. > > > > Yes, the kernel filter is a PITA in the normal single path case but > > here it helps not doing something overly wrong. > > This seems coincidental to me. Filtering by permissions and filtering > for commands that make sense on multipath devices are orthogonal > problems. > > Regards, > Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 14:29 ` Benjamin Marzinski @ 2025-05-15 15:00 ` Martin Wilck 2025-05-16 5:57 ` Christoph Hellwig 0 siblings, 1 reply; 52+ messages in thread From: Martin Wilck @ 2025-05-15 15:00 UTC (permalink / raw) To: Benjamin Marzinski Cc: Paolo Bonzini, Christoph Hellwig, Kevin Wolf, dm-devel, Hanna Czenczek, Mikulas Patocka, snitzer, Kernel Mailing List, Linux, Hannes Reinecke On Thu, 2025-05-15 at 10:29 -0400, Benjamin Marzinski wrote: > On Thu, May 15, 2025 at 12:34:13PM +0200, Martin Wilck wrote: > > > > However, I suppose that depends on the permissions with which the > > qemu > > process is started, no? Wouldn't qemu need CAP_SYS_RAWIO for PCI > > passthrough as well? > > > > I admit that I'm confused by the many indirections in qemu's scsi- > > block > > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the > > kernel block device in "scsi-block" mode. Correct me if I'm wrong. > > > > Anwyway, let's not forget that we're talking about the kernel here. > > While qemu is the main target application for this feature is > > created, > > any application can use it, and it may or may not run with > > CAP_SYS_RAWIO. > > Maybe I'm missing your issue, but QEMU is just using DM's existing > ioctl > forwarding code, which was already designed for general use, and I > don't > see any capability issues with this patchset itself. I didn't mean it this way. I was rather musing about the question whether doing SG_IO on multipath devices by forwarding them to the current path makes sense. Sorry for the confusing post. Regards Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 15:00 ` Martin Wilck @ 2025-05-16 5:57 ` Christoph Hellwig 0 siblings, 0 replies; 52+ messages in thread From: Christoph Hellwig @ 2025-05-16 5:57 UTC (permalink / raw) To: Martin Wilck Cc: Benjamin Marzinski, Paolo Bonzini, Christoph Hellwig, Kevin Wolf, dm-devel, Hanna Czenczek, Mikulas Patocka, snitzer, Kernel Mailing List, Linux, Hannes Reinecke On Thu, May 15, 2025 at 05:00:41PM +0200, Martin Wilck wrote: > I didn't mean it this way. I was rather musing about the question > whether doing SG_IO on multipath devices by forwarding them to the > current path makes sense. It doesn't, and that's the core of the problem. Someone back in the day though just forwarding SG_IO would solve whatever thing they had to do back then and which didn't have a proper kernel abstraction, and it keeps adding problems. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-12 15:18 ` Kevin Wolf 2025-05-13 5:55 ` Christoph Hellwig @ 2025-05-13 6:30 ` Hannes Reinecke 2025-05-13 18:09 ` Benjamin Marzinski 2025-05-13 8:00 ` Martin Wilck 2 siblings, 1 reply; 52+ messages in thread From: Hannes Reinecke @ 2025-05-13 6:30 UTC (permalink / raw) To: Kevin Wolf, Martin Wilck Cc: dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On 5/12/25 17:18, Kevin Wolf wrote: > Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben: >> Hello Kevin, >> >> [I'm sorry for the previous email. It seems that I clicked "send" in >> the wrong window]. >> >> On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: >>> Multipath cannot directly provide failover for ioctls in the kernel >>> because it doesn't know what each ioctl means and which result could >>> indicate a path error. Userspace generally knows what the ioctl it >>> issued means and if it might be a path error, but neither does it >>> know >>> which path the ioctl took nor does it necessarily have the privileges >>> to >>> fail a path using the control device. >> >> Thanks for this effort. >> >> I have some remarks about your approach. The most important one is that >> the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. >> It sends IO to possibly broken paths and waits for it to complete. In >> the common error case of a device not responding, this might cause the >> code to hang for a long time in the kernel ioctl code path, in >> uninterruptible sleep (note that these probing commands will be queued >> after other possibly hanging IO). In the past we have put a lot of >> effort into other code paths in multipath-tools and elsewhere to avoid >> this kind of hang to the extent possible. It seems to me that your set >> re-introduces this risk. > > That's a fair point. I don't expect this code to be fully final, another > thing that isn't really optimal (as mentioned in the comment message) is > that we're not probing paths in parallel, potentially adding up timeouts > from multiple paths. > > I don't think this is a problem of the interface, though, but we can > improve the implementation keeping the same interface. > > Maybe I'm missing something, but I think the reason why it has to be > uninterruptible is just that submit_bio_wait() has the completion on the > stack? So I suppose we could just kmalloc() some state, submit all the > bios, let the completion callback free it, and then we can just stop > waiting early (i.e. wait_for_completion_interruptible/killable) and let > the requests complete on their own in the background. > > Would this address your concerns or is there another part to it? > Not really. There are two issues which ultimately made us move away from read I/O as path checkers: - Spurious timeouts due to blocked or congested submission - Error handling delays and stalls When using normal read/write functions for path checking you are subjected to normal I/O processing rules, ie I/O is inserted at the end of the submission queue. If the system is busy the path checker will time out prematurely without ever having reached the target. One could avoid that by extending the timeout, but that would make the path checker unreliable. But the real issue is error handling. Path checker are precisely there to check if the path is healthy, and as such _will_ be subjected to error handling (or might even trigger error handling itself). The thing about error handling is that you can only return affected commands once error handling is complete, as only then you can be sure that no DMA is pending on the buffers and one can free/re-use the associated pages. On SCSI error handling can be an _extremely_ lengthy affair (we had reports for over one hour), and the last thing you want is to delay your path checker for that time. (And besides, I didn't even mention the third problem with I/O-based path checkers, namely that the check entirely the wrong thing; we are not interested whether we can do I/O, we are interested in whether we can send commands. In the light of eg ALUA these are two very different things.) >> Apart from that, minor observations are that in patch 2/2 you don't >> check whether the map is in queueing state, and I don't quite >> understand why you only check paths in the map's active path group >> without attempting a PG failover in the case where all paths in the >> current PG fail. > > Ben already fixed up the missing check. > > Not attempting PG failover was also his suggestion, on the basis that it > would be additional work for no real benefit when you can only submit > requests for the current PG anyway. If userspace retries the SG_IO, it > will already pick a different PG, so having the kernel do the same > doesn't really improve anything. > Precisely. I think the best way forward here is to have a 'post_ioctl' handler (much like we have a pre_ioctl handler nowadays). This would check the return code from the ioctl, and invalidate the current path if there was an I/O error. That way the user can resubmit the ioctl, until all paths are exhausted. For that to work we need to agree on two error codes, one for 'path failed, retry' and one for 'path failed, no retry possible'. Resetting path status will be delegated to multipathd, but then that's its main task anyway. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 6:30 ` Hannes Reinecke @ 2025-05-13 18:09 ` Benjamin Marzinski 0 siblings, 0 replies; 52+ messages in thread From: Benjamin Marzinski @ 2025-05-13 18:09 UTC (permalink / raw) To: Hannes Reinecke Cc: Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Tue, May 13, 2025 at 08:30:55AM +0200, Hannes Reinecke wrote: > On 5/12/25 17:18, Kevin Wolf wrote: > > Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben: > > > Hello Kevin, > > > > > > [I'm sorry for the previous email. It seems that I clicked "send" in > > > the wrong window]. > > > > > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > > > Multipath cannot directly provide failover for ioctls in the kernel > > > > because it doesn't know what each ioctl means and which result could > > > > indicate a path error. Userspace generally knows what the ioctl it > > > > issued means and if it might be a path error, but neither does it > > > > know > > > > which path the ioctl took nor does it necessarily have the privileges > > > > to > > > > fail a path using the control device. > > > > > > Thanks for this effort. > > > > > > I have some remarks about your approach. The most important one is that > > > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. > > > It sends IO to possibly broken paths and waits for it to complete. In > > > the common error case of a device not responding, this might cause the > > > code to hang for a long time in the kernel ioctl code path, in > > > uninterruptible sleep (note that these probing commands will be queued > > > after other possibly hanging IO). In the past we have put a lot of > > > effort into other code paths in multipath-tools and elsewhere to avoid > > > this kind of hang to the extent possible. It seems to me that your set > > > re-introduces this risk. > > > > That's a fair point. I don't expect this code to be fully final, another > > thing that isn't really optimal (as mentioned in the comment message) is > > that we're not probing paths in parallel, potentially adding up timeouts > > from multiple paths. > > > > I don't think this is a problem of the interface, though, but we can > > improve the implementation keeping the same interface. > > > > Maybe I'm missing something, but I think the reason why it has to be > > uninterruptible is just that submit_bio_wait() has the completion on the > > stack? So I suppose we could just kmalloc() some state, submit all the > > bios, let the completion callback free it, and then we can just stop > > waiting early (i.e. wait_for_completion_interruptible/killable) and let > > the requests complete on their own in the background. > > > > Would this address your concerns or is there another part to it? > > > Not really. There are two issues which ultimately made us move away > from read I/O as path checkers: > > - Spurious timeouts due to blocked or congested submission > - Error handling delays and stalls > > When using normal read/write functions for path checking you are > subjected to normal I/O processing rules, ie I/O is inserted > at the end of the submission queue. If the system is busy the > path checker will time out prematurely without ever having reached > the target. One could avoid that by extending the timeout, but that > would make the path checker unreliable. I get your complaint in general, but this interface is just giving users the ability to probe the active pathgroup by sending normal IOs to its paths. Presumably, they won't use it if they are already sending normal IOs to the multipath device, since this is just duplicating the effect of those IOs. > But the real issue is error handling. Path checker are precisely there > to check if the path is healthy, and as such _will_ be subjected > to error handling (or might even trigger error handling itself). > The thing about error handling is that you can only return affected > commands once error handling is complete, as only then you can be > sure that no DMA is pending on the buffers and one can free/re-use > the associated pages. > On SCSI error handling can be an _extremely_ lengthy affair > (we had reports for over one hour), and the last thing you want is > to delay your path checker for that time. I can send a patch that will keep probe_active_paths() from holding the work_mutex. This means that probe_active_paths() won't delay multipath_message(), so the path checkers will not be effected by this. > (And besides, I didn't even mention the third problem with I/O-based > path checkers, namely that the check entirely the wrong thing; we > are not interested whether we can do I/O, we are interested in whether > we can send commands. In the light of eg ALUA these are two very > different things.) I actually disagree with this. There are issues with IO based checkers, but in general, what we care about is paths being able to do IO, not sending commands. All the improvements on the directio checker were made because we still need it. For instance, some SCSI devices will claim they are usable when you run a SCSI Test Unit Ready command, but they aren't able to actually handle IO. These paths will have their state ping-pong back and forth as multipathd restores them and the kernel uses them and immeditately fails them again. The solution is often to switch multipathd to the directio checker, since it verifies that the path can actually handle IO. It's problematic when dealing with paths that need to be initialized before accepting IO, but that's why this patch only checks the paths in the active pathgroup. > > > Apart from that, minor observations are that in patch 2/2 you don't > > > check whether the map is in queueing state, and I don't quite > > > understand why you only check paths in the map's active path group > > > without attempting a PG failover in the case where all paths in the > > > current PG fail. > > > > Ben already fixed up the missing check. > > > > Not attempting PG failover was also his suggestion, on the basis that it > > would be additional work for no real benefit when you can only submit > > requests for the current PG anyway. If userspace retries the SG_IO, it > > will already pick a different PG, so having the kernel do the same > > doesn't really improve anything. > > > Precisely. > > I think the best way forward here is to have a 'post_ioctl' handler > (much like we have a pre_ioctl handler nowadays). > This would check the return code from the ioctl, and invalidate the > current path if there was an I/O error. > That way the user can resubmit the ioctl, until all paths are exhausted. > For that to work we need to agree on two error codes, one for > 'path failed, retry' and one for 'path failed, no retry possible'. > Resetting path status will be delegated to multipathd, but then that's > its main task anyway. Isn't this basically the idea that Martin had, here: https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/ If this was acceptable, I would be all for it. But we often can't even tell if an SG_IO has failed without decoding the scsi status/sense data, and that got shot down when Martin posted patches to do it. Thus, we have a general solution that has nothing to do with SG_IO ioctls and will work for any type of path device, any time a user wants to verify the state of the active paths. -Ben > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg > HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-12 15:18 ` Kevin Wolf 2025-05-13 5:55 ` Christoph Hellwig 2025-05-13 6:30 ` Hannes Reinecke @ 2025-05-13 8:00 ` Martin Wilck 2025-05-13 10:06 ` Martin Wilck 2025-05-14 21:21 ` Martin Wilck 2 siblings, 2 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-13 8:00 UTC (permalink / raw) To: Kevin Wolf, Christoph Hellwig Cc: dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel Hi Kevin, On Mon, 2025-05-12 at 17:18 +0200, Kevin Wolf wrote: > Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben: > > Hello Kevin, > > > > [I'm sorry for the previous email. It seems that I clicked "send" > > in > > the wrong window]. > > > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > > Multipath cannot directly provide failover for ioctls in the > > > kernel > > > because it doesn't know what each ioctl means and which result > > > could > > > indicate a path error. Userspace generally knows what the ioctl > > > it > > > issued means and if it might be a path error, but neither does it > > > know > > > which path the ioctl took nor does it necessarily have the > > > privileges > > > to > > > fail a path using the control device. > > > > Thanks for this effort. > > > > I have some remarks about your approach. The most important one is > > that > > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous > > command. > > It sends IO to possibly broken paths and waits for it to complete. > > In > > the common error case of a device not responding, this might cause > > the > > code to hang for a long time in the kernel ioctl code path, in > > uninterruptible sleep (note that these probing commands will be > > queued > > after other possibly hanging IO). In the past we have put a lot of > > effort into other code paths in multipath-tools and elsewhere to > > avoid > > this kind of hang to the extent possible. It seems to me that your > > set > > re-introduces this risk. > > That's a fair point. I don't expect this code to be fully final, > another > thing that isn't really optimal (as mentioned in the comment message) > is > that we're not probing paths in parallel, potentially adding up > timeouts > from multiple paths. > > I don't think this is a problem of the interface, though, but we can > improve the implementation keeping the same interface. > Maybe I'm missing something, but I think the reason why it has to be > uninterruptible is just that submit_bio_wait() has the completion on > the > stack? So I suppose we could just kmalloc() some state, submit all > the > bios, let the completion callback free it, and then we can just stop > waiting early (i.e. wait_for_completion_interruptible/killable) and > let > the requests complete on their own in the background. > > Would this address your concerns or is there another part to it? It'd be an improvement. Not sure if it'd solve the problem entirely. > > Apart from that, minor observations are that in patch 2/2 you don't > > check whether the map is in queueing state, and I don't quite > > understand why you only check paths in the map's active path group > > without attempting a PG failover in the case where all paths in the > > current PG fail. > > Ben already fixed up the missing check. I missed that. > Not attempting PG failover was also his suggestion, on the basis that > it > would be additional work for no real benefit when you can only submit > requests for the current PG anyway. If userspace retries the SG_IO, > it > will already pick a different PG, so having the kernel do the same > doesn't really improve anything. Ok. But continuing along this line of reasoning, I'd conclude that it would be even more useful to fail the path in device-mapper directly after a failed IO request (given that we find an agreeable way to do this, see below). Qemu could then retry, depending on the error code. Paths would be failed one after the other, and eventually a PG failover would occur. > > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is > > necessary > > at all. Rather than triggering explicit path probing, you could > > have > > dm-mpath "handle" IO errors by failing the active path for "path > > errors". That would be similar to my patch set from 2021 [1], but > > it > > would avoid the extra IO and thus the additional risk of hanging in > > the > > kernel. > > > > Contrary to my set, you wouldn't attempt retries in the kernel, but > > leave this part to qemu instead, like in the current set. That > > would > > avoid Christoph's main criticism that "Failing over SG_IO does not > > make > > sense" [2]. > > Maybe I misunderstood, but my understanding from the feedback you got > back then was that no SCSI-specific code was wanted in device mapper. > This is why we kept interpreting the status codes in userspace. As I wrote in my reply to Mikulas already, my understanding was that Christoph's main objection was against retrying SG_IO ioctls in the kernel on different paths, not against inspecting the error code. > While discussing the approaches with Mikuláš and Ben, one option that > was briefly discussed was a pair of ioctls: One that wraps ioctls and > returns which path the ioctl took, and another one to fail that path > if > inspection of the result in userspace comes to the conclusion that > it's > a path error. I didn't think this could be implemented without also > allowing an unprivileged process to DoS the device by just failing > all > paths even if they are still good, so we didn't continue there. This doesn't sound too promising IMO. If we want to find a solution that's focused on user space, I'd suggest to use multipathd rather than additional ioctls. After all, all necessary information is available in multipathd. Already today, qemu can use the multipath socket to query the current state of paths in a map. I can imagine a new multipathd CLI command that would instruct multipathd to check all paths of a given map immediately (@Ben: set pp->tick to 1 for all paths of the map). This command could be sent to multipathd in case of a suspected path failure. This would have similar semantics as the DM_MPATH_PROBE_PATHS_CMD ioctl, with some advantages, as it would take the multipath configuration of the specific storage array into account. It'd also avoid blocking qemu. multipathd would allow this command for non-root connections, but to avoid DoS, accept it only once every N seconds. > Anyway, if it actually were acceptable for the kernel to check SG_IO > results for path errors and fail the path while still returning the > result unchanged, then that would work for us for the specific case, > of > course. But I don't expect this really addresses Christoph's > concerns. For regular block IO, the "path error" logic is in blk_path_error() [1] and scsi_result_to_blk_status() [2]. blk_path_error() can be called from dm; the problem in the SG_IO code path is that we don't have a blk_status_t to examine. Therefore, in my old code, I replicated the logic of scsi_result_to_blk_status() in the dm layer. It works, but it's admittedly not the cleanest possible approach. OTOH, SG_IO on dm devices isn't the cleanest thing to do in general ;-) > If you think it does, is there another reason why you didn't try this > before? It didn't occur to me back then that we could fail paths without retrying in the kernel. Perhaps we could have the sg driver pass the blk_status_t (which is available on the sg level) to device mapper somehow in the sg_io_hdr structure? That way we could entirely avoid the layering violation between SCSI and dm. Not sure if that would be acceptible to Christoph, as blk_status_t is supposed to be exclusive to the kernel. Can we find a way to make sure it's passed to DM, but not to user space? Alternatively (if maintainers strictly object the error code inspection by dm), I wonder whether we could just _always_ fail the current path in case of errors in the dm-mpath SG_IO code path, without inspecting the error code. Rationale: a) most potential error conditions are treated as "path error" in block IO code path, b) qemu can still inspect the error code and avoid retrying for errors that aren't path errors, and c) if multipathd is running and the path has been failed mistakenly, it will reinstate that path after just a few seconds. In the worst case, an unintentional failover would occur. That isn't as bad as it used to be, as active-passive configurations with explicit TPGS are less common then in the past. > (And it wouldn't be generic, so would we potentially have to repeat > the > exercise for other ioctls in the future?) I don't think so. Do you have anything specific in mind? > > Doing the failover in qemu has the general disadvantage that qemu > > has > > no notion about the number of available and/or healthy paths; it > > can > > thus only guess the reasonable number of retries for any given I/O. > > But > > that's unavoidable, given that the idea to do kernel-level failover > > on > > SG_IO is rejected. > > Yes, it's a bit unfortunate, but we have to work with what we have. > QEMU > doesn't even necessarily know that it's dealing with a multipath > device, > so it just has to blindly try the ioctl and see if it works. See the paragraph about multipathd above. Regards Martin [1] https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/blk_types.h#L185 [2] https://elixir.bootlin.com/linux/v6.14.6/source/drivers/scsi/scsi_lib.c#L685 PS: Yet another option that Christoph has suggested in the past would be to move away from qemu's "scsi-block", and use the generic block reservation mechanism to allow passthrough of reservation commands from VMs to the host. Nobody seems to have explored this option seriously so far. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 8:00 ` Martin Wilck @ 2025-05-13 10:06 ` Martin Wilck 2025-05-14 21:21 ` Martin Wilck 1 sibling, 0 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-13 10:06 UTC (permalink / raw) To: Kevin Wolf, Christoph Hellwig Cc: dm-devel, hreitz, mpatocka, snitzer, bmarzins, linux-kernel On Tue, 2025-05-13 at 10:00 +0200, Martin Wilck wrote: > On Mon, 2025-05-12 at 17:18 +0200, Kevin Wolf wrote: > > > > > Ben already fixed up the missing check. > > I missed that. Saw it now. Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-13 8:00 ` Martin Wilck 2025-05-13 10:06 ` Martin Wilck @ 2025-05-14 21:21 ` Martin Wilck 2025-05-15 10:11 ` Kevin Wolf 1 sibling, 1 reply; 52+ messages in thread From: Martin Wilck @ 2025-05-14 21:21 UTC (permalink / raw) To: Kevin Wolf, Christoph Hellwig, Benjamin Marzinski Cc: dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Tue, 2025-05-13 at 10:00 +0200, Martin Wilck wrote: > > > If you think it does, is there another reason why you didn't try > > this > > before? > > It didn't occur to me back then that we could fail paths without > retrying in the kernel. > > Perhaps we could have the sg driver pass the blk_status_t (which is > available on the sg level) to device mapper somehow in the sg_io_hdr > structure? That way we could entirely avoid the layering violation > between SCSI and dm. Not sure if that would be acceptible to > Christoph, > as blk_status_t is supposed to be exclusive to the kernel. Can we > find > a way to make sure it's passed to DM, but not to user space? I have to correct myself. I was confused by my old patches which contain special casing for SG_IO. The current upstream code does of course not support special-casing SG_IO in any way. device-mapper neither looks at the ioctl `cmd` value nor at any arguments, and has only the Unix error code to examine when the ioctl returns. The device mapper layer has access to *less* information than the user space process that issued the ioctl. Adding hooks to the sg driver wouldn't buy us anything in this situation. If we can't change this, we can't fail paths in the SG_IO error code path, end of story. With Kevin's patch 1/2 applied, it would in principle be feasible to special-case SG_IO, handle it in the dm-multipath, retrieve the blk_status_t somehow, and possibly initiate path failover. This way we'd at least keep the generic dm layer clean of SCSI specific code. But still, the end result would look very similar attempt from 2021 and would therefore lead us nowhere, probably. I'm still not too fond of DM_MPATH_PROBE_PATHS_CMD, but I can't offer a better solution at this time. If the side issues are fixed, it will be an improvement over the current upstream, situation where we can do no path failover at all. In the long term, we should evaluate alternatives. If my conjecture in my previous post is correct we need only PRIN/PROUT commands, there might be a better solution than scsi-block for our customers. Using regular block IO should actually also improved performance. Regards Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-14 21:21 ` Martin Wilck @ 2025-05-15 10:11 ` Kevin Wolf 2025-05-15 11:09 ` Paolo Bonzini ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Kevin Wolf @ 2025-05-15 10:11 UTC (permalink / raw) To: Martin Wilck Cc: Christoph Hellwig, Benjamin Marzinski, dm-devel, hreitz, mpatocka, snitzer, linux-kernel Am 14.05.2025 um 23:21 hat Martin Wilck geschrieben: > On Tue, 2025-05-13 at 10:00 +0200, Martin Wilck wrote: > > > If you think it does, is there another reason why you didn't try > > > this > > > before? > > > > It didn't occur to me back then that we could fail paths without > > retrying in the kernel. > > > > Perhaps we could have the sg driver pass the blk_status_t (which is > > available on the sg level) to device mapper somehow in the sg_io_hdr > > structure? That way we could entirely avoid the layering violation > > between SCSI and dm. Not sure if that would be acceptible to > > Christoph, > > as blk_status_t is supposed to be exclusive to the kernel. Can we > > find > > a way to make sure it's passed to DM, but not to user space? > > I have to correct myself. I was confused by my old patches which > contain special casing for SG_IO. The current upstream code does of > course not support special-casing SG_IO in any way. device-mapper > neither looks at the ioctl `cmd` value nor at any arguments, and has > only the Unix error code to examine when the ioctl returns. The device > mapper layer has access to *less* information than the user space > process that issued the ioctl. Adding hooks to the sg driver wouldn't > buy us anything in this situation. > > If we can't change this, we can't fail paths in the SG_IO error code > path, end of story. Yes, as long as we can't look at the sg_io_hdr, there is no way to figure out if we got a path error. > With Kevin's patch 1/2 applied, it would in principle be feasible to > special-case SG_IO, handle it in the dm-multipath, retrieve the > blk_status_t somehow, and possibly initiate path failover. This way > we'd at least keep the generic dm layer clean of SCSI specific code. > But still, the end result would look very similar attempt from 2021 and > would therefore lead us nowhere, probably. Right, that was my impression, too. The interfaces could be made look a bit different, and we could return -EAGAIN to userspace instead of retrying immediately (not that it makes sense to me, but if that were really the issue, fine with me), but the core logic with copying the sg_io_hdr, calling sg_io() directly and then inspecting the status and possibly failing paths would have to be pretty much the same as you had. > I'm still not too fond of DM_MPATH_PROBE_PATHS_CMD, but I can't offer a > better solution at this time. If the side issues are fixed, it will be > an improvement over the current upstream, situation where we can do no > path failover at all. Yes, I agree we should focus on improving what we have, rather than trying to find another radically different approach that none of us have thought of before. > In the long term, we should evaluate alternatives. If my conjecture in > my previous post is correct we need only PRIN/PROUT commands, there > might be a better solution than scsi-block for our customers. Using > regular block IO should actually also improved performance. If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are actually the one thing that we don't need. libmpathpersist sends the commands to the individual path devices, so dm-mpath will never see those. It's mostly about getting the full results on the SCSI level for normal I/O commands. There has actually been a patch series on qemu-devel last year (that I haven't found the time to review properly yet) that would add explicit persistent reservation operations to QEMU's block layer that could then be used with the emulated scsi-hd device. On the backend, it only implemented it for iscsi, but I suppose we could implement it for file-posix, too (using the same libmpathpersist code as for passthrough). If that works, maybe at least some users can move away from SCSI passthrough. The thing that we need to make sure, though, is that the emulated status we can expose to the guest is actually good enough. That Paolo said that the problem with reservation conflicts was mostly because -EBADE wasn't a thing yet gives me some hope that at least this wouldn't be a problem any more today. We would still lose other parts of the SCSI status, so I'm still a bit cautious here with making a prediction for how many users could eventually (I expect years) use the emulated device instead and how many would keep using passthrough even in the long term. Kevin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 10:11 ` Kevin Wolf @ 2025-05-15 11:09 ` Paolo Bonzini 2025-05-15 15:18 ` Martin Wilck 2025-05-15 15:05 ` Martin Wilck 2025-05-16 6:00 ` Christoph Hellwig 2 siblings, 1 reply; 52+ messages in thread From: Paolo Bonzini @ 2025-05-15 11:09 UTC (permalink / raw) To: Kevin Wolf, Martin Wilck Cc: Christoph Hellwig, Benjamin Marzinski, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On 5/15/25 12:11, Kevin Wolf wrote: > The thing that we need to make sure, though, is that the emulated status > we can expose to the guest is actually good enough. That Paolo said that > the problem with reservation conflicts was mostly because -EBADE wasn't > a thing yet gives me some hope that at least this wouldn't be a problem > any more today. Hmm I'm not sure about that anymore... I was confused because some of the refactoring of block errors was done in 2017, which is around the same time I was working on persistent reservations in QEMU. However, EBADE handling dates back to 2011 (commit 63583cca745f, "[SCSI] Add detailed SCSI I/O errors", 2011-02-12) and yet the Windows tests for PR were failing before QEMU switched to SG_IO for reads and writes. I guess I have to try reverting that and retest, though. Paolo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 11:09 ` Paolo Bonzini @ 2025-05-15 15:18 ` Martin Wilck 0 siblings, 0 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-15 15:18 UTC (permalink / raw) To: Paolo Bonzini, Kevin Wolf Cc: Christoph Hellwig, Benjamin Marzinski, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Thu, 2025-05-15 at 13:09 +0200, Paolo Bonzini wrote: > However, EBADE handling dates back to 2011 (commit 63583cca745f, > "[SCSI] > Add detailed SCSI I/O errors", 2011-02-12) and yet the Windows tests > for > PR were failing before QEMU switched to SG_IO for reads and writes. > I > guess I have to try reverting that and retest, though. Thanks! This makes me realize that we could summarize the goal for future efforts (independent of the current patch set) roughly like this: "Emulate a SCSI disk on top of a (host) multipath device in a way that 1) failover works properly (like it would work for regular IO from the host itself), 2) Windows tests for PR (plus test case X, Y, ...) can be run successfully". Does this make sense? It implies that PR commands don't just need to be forwarded appropriately, we also need to pass meaningful error codes back to the guest. Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 10:11 ` Kevin Wolf 2025-05-15 11:09 ` Paolo Bonzini @ 2025-05-15 15:05 ` Martin Wilck 2025-05-16 6:00 ` Christoph Hellwig 2 siblings, 0 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-15 15:05 UTC (permalink / raw) To: Kevin Wolf Cc: Christoph Hellwig, Benjamin Marzinski, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Thu, 2025-05-15 at 12:11 +0200, Kevin Wolf wrote: > Am 14.05.2025 um 23:21 hat Martin Wilck geschrieben: > > > In the long term, we should evaluate alternatives. If my conjecture > > in > > my previous post is correct we need only PRIN/PROUT commands, there > > might be a better solution than scsi-block for our customers. Using > > regular block IO should actually also improved performance. > > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands > are > actually the one thing that we don't need. libmpathpersist sends the > commands to the individual path devices, so dm-mpath will never see > those. It's mostly about getting the full results on the SCSI level > for > normal I/O commands. I know. I meant "need" from the PoV of the guest, in the sense "which commands need to be passed from the guessed to the device (in some reasonable way) except the normal READ and WRITE commands?". Regards Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-15 10:11 ` Kevin Wolf 2025-05-15 11:09 ` Paolo Bonzini 2025-05-15 15:05 ` Martin Wilck @ 2025-05-16 6:00 ` Christoph Hellwig 2025-05-16 16:06 ` Benjamin Marzinski ` (2 more replies) 2 siblings, 3 replies; 52+ messages in thread From: Christoph Hellwig @ 2025-05-16 6:00 UTC (permalink / raw) To: Kevin Wolf Cc: Martin Wilck, Christoph Hellwig, Benjamin Marzinski, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote: > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are > actually the one thing that we don't need. libmpathpersist sends the > commands to the individual path devices, so dm-mpath will never see > those. It's mostly about getting the full results on the SCSI level for > normal I/O commands. > > There has actually been a patch series on qemu-devel last year (that I > haven't found the time to review properly yet) that would add explicit > persistent reservation operations to QEMU's block layer that could then > be used with the emulated scsi-hd device. On the backend, it only > implemented it for iscsi, but I suppose we could implement it for > file-posix, too (using the same libmpathpersist code as for > passthrough). If that works, maybe at least some users can move away > from SCSI passthrough. Please call into the kernel PR code instead of hacking up more of this, which will just run into problems again. > The thing that we need to make sure, though, is that the emulated status > we can expose to the guest is actually good enough. That Paolo said that > the problem with reservation conflicts was mostly because -EBADE wasn't > a thing yet gives me some hope that at least this wouldn't be a problem > any more today. And if you need more information we can add that to the kernel PR API. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-16 6:00 ` Christoph Hellwig @ 2025-05-16 16:06 ` Benjamin Marzinski 2025-05-19 5:32 ` Christoph Hellwig 2025-05-19 10:06 ` Kevin Wolf 2025-05-19 17:33 ` Martin Wilck 2 siblings, 1 reply; 52+ messages in thread From: Benjamin Marzinski @ 2025-05-16 16:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Thu, May 15, 2025 at 11:00:17PM -0700, Christoph Hellwig wrote: > On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote: > > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are > > actually the one thing that we don't need. libmpathpersist sends the > > commands to the individual path devices, so dm-mpath will never see > > those. It's mostly about getting the full results on the SCSI level for > > normal I/O commands. > > > > There has actually been a patch series on qemu-devel last year (that I > > haven't found the time to review properly yet) that would add explicit > > persistent reservation operations to QEMU's block layer that could then > > be used with the emulated scsi-hd device. On the backend, it only > > implemented it for iscsi, but I suppose we could implement it for > > file-posix, too (using the same libmpathpersist code as for > > passthrough). If that works, maybe at least some users can move away > > from SCSI passthrough. > > Please call into the kernel PR code instead of hacking up more of > this, which will just run into problems again. > > > The thing that we need to make sure, though, is that the emulated status > > we can expose to the guest is actually good enough. That Paolo said that > > the problem with reservation conflicts was mostly because -EBADE wasn't > > a thing yet gives me some hope that at least this wouldn't be a problem > > any more today. > > And if you need more information we can add that to the kernel PR API. I've run into SCSI arrays that always act like they were called with the ALL_TG_PT flag set (or perhaps they were just configured that way, I never could get a straight answer about that). libmpathpersist accepts this flag, and only sends one registration per initiator & target device when it's set, instead of one per initator & target port. dm-multipath currently doesn't have the knowledge necessary to figure out which devices it needs to send reservations to, even if the kernel PR API supported this. But I supposed it could just send the registration normally down one path and if that works, it could just ignore the existing key when it sends the registration down all the other paths. -Ben ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-16 16:06 ` Benjamin Marzinski @ 2025-05-19 5:32 ` Christoph Hellwig 2025-05-19 18:24 ` Benjamin Marzinski 2025-05-28 20:44 ` Martin Wilck 0 siblings, 2 replies; 52+ messages in thread From: Christoph Hellwig @ 2025-05-19 5:32 UTC (permalink / raw) To: Benjamin Marzinski Cc: Christoph Hellwig, Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Fri, May 16, 2025 at 12:06:21PM -0400, Benjamin Marzinski wrote: > I've run into SCSI arrays that always act like they were called with the > ALL_TG_PT flag set (or perhaps they were just configured that way, I > never could get a straight answer about that). Hmm, that's pretty strange. > libmpathpersist accepts > this flag, and only sends one registration per initiator & target device > when it's set, instead of one per initator & target port. With "this flag" you mean ALL_TG_PT? > dm-multipath > currently doesn't have the knowledge necessary to figure out which > devices it needs to send reservations to, even if the kernel PR API > supported this. > > But I supposed it could just send the registration normally down one > path and if that works, it could just ignore the existing key when it > sends the registration down all the other paths. We could add support for a similar flag to the kernel PR API. The problem is to figure out how to discover support for it. What does the library do for that currently? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-19 5:32 ` Christoph Hellwig @ 2025-05-19 18:24 ` Benjamin Marzinski 2025-05-28 20:44 ` Martin Wilck 1 sibling, 0 replies; 52+ messages in thread From: Benjamin Marzinski @ 2025-05-19 18:24 UTC (permalink / raw) To: Christoph Hellwig Cc: Kevin Wolf, Martin Wilck, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Sun, May 18, 2025 at 10:32:17PM -0700, Christoph Hellwig wrote: > On Fri, May 16, 2025 at 12:06:21PM -0400, Benjamin Marzinski wrote: > > I've run into SCSI arrays that always act like they were called with the > > ALL_TG_PT flag set (or perhaps they were just configured that way, I > > never could get a straight answer about that). > > Hmm, that's pretty strange. > > > libmpathpersist accepts > > this flag, and only sends one registration per initiator & target device > > when it's set, instead of one per initator & target port. > > With "this flag" you mean ALL_TG_PT? Yes. > > > dm-multipath > > currently doesn't have the knowledge necessary to figure out which > > devices it needs to send reservations to, even if the kernel PR API > > supported this. > > > > But I supposed it could just send the registration normally down one > > path and if that works, it could just ignore the existing key when it > > sends the registration down all the other paths. > > We could add support for a similar flag to the kernel PR API. The > problem is to figure out how to discover support for it. What does > the library do for that currently? > libmpathpersist knows which scsi path devices map to the same Initiator, but different target ports. When ALL_TG_PT is set, it only sends one registration command to each group, instead of sending one to each path device (like when ALL_TG_TP isn't set). If it sent a registration to every SCSI device on these arrays that act like ALL_TG_PT is always set, it would get reservation conflicts when sending the command to devices using ports after the first, since they would already have a registered key. That's why I mentioned the posibility of using Register and Ignore, which doesn't care if there's a key already registered. I don't think it would take much work for the Kernel API to support ALL_TG_PT, if the SCSI implementation just passed the flag down to the device. The problem really only exists for things like dm-multipath, that are composed of a number of different devices that all point to the same underlying Target LUN. To handle this without relying on something like Register and Ignore, It would need to be able to group all the devices that come from the same initiator but different target ports together, and only send a registration to one device per group. This takes more SCSI specific knowledge than it currently has, or really wants to have. -Ben ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-19 5:32 ` Christoph Hellwig 2025-05-19 18:24 ` Benjamin Marzinski @ 2025-05-28 20:44 ` Martin Wilck 1 sibling, 0 replies; 52+ messages in thread From: Martin Wilck @ 2025-05-28 20:44 UTC (permalink / raw) To: Christoph Hellwig, Benjamin Marzinski Cc: Kevin Wolf, dm-devel, hreitz, mpatocka, snitzer, linux-kernel, Hannes Reinecke On Sun, 2025-05-18 at 22:32 -0700, Christoph Hellwig wrote: > On Fri, May 16, 2025 at 12:06:21PM -0400, Benjamin Marzinski wrote: > > I've run into SCSI arrays that always act like they were called > > with the > > ALL_TG_PT flag set (or perhaps they were just configured that way, > > I > > never could get a straight answer about that). > > Hmm, that's pretty strange. > > > libmpathpersist accepts > > this flag, and only sends one registration per initiator & target > > device > > when it's set, instead of one per initator & target port. > > With "this flag" you mean ALL_TG_PT? multipath-tools allows setting the flag "all_tg_pt" in multipath.conf for certain storage devices or multipath maps. If this flag is set, mpathpersist will always act as if the --param-alltgpt flag was given. This means that the REGISTER commands are sent once per initiator port (SCSI host) only, instead of once per I_T nexus. By default, the all_tg_pt flag is not set for any device, not even for EMC VNX, where Ben observed the described behavior. It seems that devices that behave like this are a rare exception. > > dm-multipath > > currently doesn't have the knowledge necessary to figure out which > > devices it needs to send reservations to, even if the kernel PR API > > supported this. > > > > But I supposed it could just send the registration normally down > > one > > path and if that works, it could just ignore the existing key when > > it > > sends the registration down all the other paths. > > We could add support for a similar flag to the kernel PR API. The > problem is to figure out how to discover support for it. What does > the library do for that currently? dm-multipath has no knowledge about target or initiator ports, and I fail to see how we could provide this knowledge to it without adding yet another layering violation. Just sending the command once will not be sufficient if there are multiple local ports. Regards Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-16 6:00 ` Christoph Hellwig 2025-05-16 16:06 ` Benjamin Marzinski @ 2025-05-19 10:06 ` Kevin Wolf 2025-05-19 17:33 ` Martin Wilck 2 siblings, 0 replies; 52+ messages in thread From: Kevin Wolf @ 2025-05-19 10:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Martin Wilck, Benjamin Marzinski, dm-devel, hreitz, mpatocka, snitzer, linux-kernel Am 16.05.2025 um 08:00 hat Christoph Hellwig geschrieben: > On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote: > > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are > > actually the one thing that we don't need. libmpathpersist sends the > > commands to the individual path devices, so dm-mpath will never see > > those. It's mostly about getting the full results on the SCSI level for > > normal I/O commands. > > > > There has actually been a patch series on qemu-devel last year (that I > > haven't found the time to review properly yet) that would add explicit > > persistent reservation operations to QEMU's block layer that could then > > be used with the emulated scsi-hd device. On the backend, it only > > implemented it for iscsi, but I suppose we could implement it for > > file-posix, too (using the same libmpathpersist code as for > > passthrough). If that works, maybe at least some users can move away > > from SCSI passthrough. > > Please call into the kernel PR code instead of hacking up more of > this, which will just run into problems again. I agree that using kernel code is preferable to doing things behind the kernel's back. However, libmpathpersist is the official interface for doing these things with multipath devices, so I think the necessary work to make this happen should primarily be done in the library (and possibly the kernel if the existing interfaces aren't good enough). QEMU could directly call the kernel when qemu-pr-helper isn't in use. I don't know enough about how libmpathpersist works internally to tell if running it this way would be a good idea for multipath devices. Can multipathd still do its job with reservations being made behind its back? (It would probably be good to allow this eventually, but is it the case today?) Kevin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-16 6:00 ` Christoph Hellwig 2025-05-16 16:06 ` Benjamin Marzinski 2025-05-19 10:06 ` Kevin Wolf @ 2025-05-19 17:33 ` Martin Wilck 2025-05-20 13:46 ` Christoph Hellwig 2 siblings, 1 reply; 52+ messages in thread From: Martin Wilck @ 2025-05-19 17:33 UTC (permalink / raw) To: Christoph Hellwig, Kevin Wolf Cc: Benjamin Marzinski, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Thu, 2025-05-15 at 23:00 -0700, Christoph Hellwig wrote: > On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote: > > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands > > are > > actually the one thing that we don't need. libmpathpersist sends > > the > > commands to the individual path devices, so dm-mpath will never see > > those. It's mostly about getting the full results on the SCSI level > > for > > normal I/O commands. > > > > There has actually been a patch series on qemu-devel last year > > (that I > > haven't found the time to review properly yet) that would add > > explicit > > persistent reservation operations to QEMU's block layer that could > > then > > be used with the emulated scsi-hd device. On the backend, it only > > implemented it for iscsi, but I suppose we could implement it for > > file-posix, too (using the same libmpathpersist code as for > > passthrough). If that works, maybe at least some users can move > > away > > from SCSI passthrough. > x > Please call into the kernel PR code instead of hacking up more of > this, which will just run into problems again. I still don't get what this buys us. The guest (which might be Windows or whatever else) sends SCSI reservation commands. qemu will have to intercept these anyway, unless the backend device is a plain SCSI device (in which case transformation into generic PR command would be a strange thing to do). If the backend device is multipath on SCSI, qemu-pr-helper would take the most appropriate action and return the most appropriate result code. The dm-multipath layer can't do it as well, as it doesn't have the full information about the target that's available in user space (see Ben's note about ALL_TG_PT for example). I don't see any benefit from using a generic reservation on dm-mpath instead of using qemu-pr- helper for this important use case. I also don't see why this way of handling SCSI PR commands would be dangerous. You are of course right to say that passthrough of other SCSI commands (except regular IO and PR) is highly dangerous, but in the concept that Kevin described that wouldn't happen. Transforming the SCSI reservation commands into generic reservation commands makes sense for _other_ types of backend devices. NVMe comes to mind, but (for real-world applications) not much else. (But does it make sense to present NVMe devices to guests as SCSI devices?). Regards Martin ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths 2025-05-19 17:33 ` Martin Wilck @ 2025-05-20 13:46 ` Christoph Hellwig 0 siblings, 0 replies; 52+ messages in thread From: Christoph Hellwig @ 2025-05-20 13:46 UTC (permalink / raw) To: Martin Wilck Cc: Christoph Hellwig, Kevin Wolf, Benjamin Marzinski, dm-devel, hreitz, mpatocka, snitzer, linux-kernel On Mon, May 19, 2025 at 07:33:50PM +0200, Martin Wilck wrote: > I still don't get what this buys us. You get one layer to deal with instead of polking into a leak abstraction. Qemu sees a block devices, checks if it supports persistent reservations and everything is taken care underneath instead of having to try to understand what is below and working around it. ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2025-05-28 20:44 UTC | newest] Thread overview: 52+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-29 16:50 [PATCH 0/2] dm mpath: Interface for explicit probing of active paths Kevin Wolf 2025-04-29 16:50 ` [PATCH 1/2] dm: Allow .prepare_ioctl to handle ioctls directly Kevin Wolf 2025-04-29 23:22 ` Benjamin Marzinski 2025-05-08 13:50 ` Martin Wilck 2025-04-29 16:50 ` [PATCH 2/2] dm mpath: Interface for explicit probing of active paths Kevin Wolf 2025-04-29 23:22 ` Benjamin Marzinski 2025-05-08 13:51 ` [PATCH 0/2] " Martin Wilck 2025-05-12 13:46 ` Mikulas Patocka 2025-05-13 7:06 ` Martin Wilck 2025-05-12 15:18 ` Kevin Wolf 2025-05-13 5:55 ` Christoph Hellwig 2025-05-13 6:09 ` Hannes Reinecke 2025-05-13 6:14 ` Christoph Hellwig 2025-05-13 6:32 ` Hannes Reinecke 2025-05-13 6:49 ` Christoph Hellwig 2025-05-13 8:17 ` Martin Wilck 2025-05-14 4:53 ` Christoph Hellwig 2025-05-15 11:14 ` Paolo Bonzini 2025-05-13 16:29 ` Benjamin Marzinski 2025-05-14 4:56 ` Christoph Hellwig 2025-05-14 6:39 ` Hannes Reinecke 2025-05-14 16:01 ` Benjamin Marzinski 2025-05-16 5:52 ` Christoph Hellwig 2025-05-13 9:29 ` Kevin Wolf 2025-05-13 15:43 ` Paolo Bonzini 2025-05-14 4:57 ` Christoph Hellwig 2025-05-14 16:23 ` Benjamin Marzinski 2025-05-14 17:37 ` Martin Wilck 2025-05-15 2:53 ` Paolo Bonzini 2025-05-15 10:34 ` Martin Wilck 2025-05-15 10:51 ` Paolo Bonzini 2025-05-15 14:50 ` Martin Wilck 2025-05-15 14:29 ` Benjamin Marzinski 2025-05-15 15:00 ` Martin Wilck 2025-05-16 5:57 ` Christoph Hellwig 2025-05-13 6:30 ` Hannes Reinecke 2025-05-13 18:09 ` Benjamin Marzinski 2025-05-13 8:00 ` Martin Wilck 2025-05-13 10:06 ` Martin Wilck 2025-05-14 21:21 ` Martin Wilck 2025-05-15 10:11 ` Kevin Wolf 2025-05-15 11:09 ` Paolo Bonzini 2025-05-15 15:18 ` Martin Wilck 2025-05-15 15:05 ` Martin Wilck 2025-05-16 6:00 ` Christoph Hellwig 2025-05-16 16:06 ` Benjamin Marzinski 2025-05-19 5:32 ` Christoph Hellwig 2025-05-19 18:24 ` Benjamin Marzinski 2025-05-28 20:44 ` Martin Wilck 2025-05-19 10:06 ` Kevin Wolf 2025-05-19 17:33 ` Martin Wilck 2025-05-20 13:46 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).