* [PATCH ndctl 0/2] daxctl: Fix error handling and propagation in daxctl/device.c
@ 2024-04-12 21:05 Vishal Verma
2024-04-12 21:05 ` [PATCH ndctl 1/2] daxctl/device.c: Handle special case of destroying daxX.0 Vishal Verma
2024-04-12 21:05 ` [PATCH ndctl 2/2] daxctl/device.c: Fix error propagation in do_xaction_device() Vishal Verma
0 siblings, 2 replies; 8+ messages in thread
From: Vishal Verma @ 2024-04-12 21:05 UTC (permalink / raw)
To: nvdimm, linux-cxl
Cc: Dan Williams, Alison Schofield, Dave Jiang, Ira Weiny,
Vishal Verma
An intermittently failing daxctl-create.sh test revealed two things:
1/ The kernel handles the 0th DAX device under a region specially, and
refuses to delete it, returning an EBUSY. The daxctl-destroy-device
failed to account for this, even if other aspects of the destroy
operation succeeded (i.e. setting the size to zero). Patch 1 fixes this
by expecting the EBUSY on the 0th device, and not failing for it.
2/ When looping over multiple DAX devices, do_xaction_device() just
returned the status from the action on the last device. Since this order
can be effectively random, so would be the status returned. Patch 2
makes this behavior more consistent by saving any non-zero status from
the device iterations, and returning that instead of the last action's
status.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
Vishal Verma (2):
daxctl/device.c: Handle special case of destroying daxX.0
daxctl/device.c: Fix error propagation in do_xaction_device()
daxctl/device.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
---
base-commit: e0d0680bd3e554bd5f211e989480c5a13a023b2d
change-id: 20240412-vv-daxctl-fixes-bd7992ea229d
Best regards,
--
Vishal Verma <vishal.l.verma@intel.com>
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH ndctl 1/2] daxctl/device.c: Handle special case of destroying daxX.0 2024-04-12 21:05 [PATCH ndctl 0/2] daxctl: Fix error handling and propagation in daxctl/device.c Vishal Verma @ 2024-04-12 21:05 ` Vishal Verma 2024-04-12 21:09 ` Dan Williams 2024-04-15 16:17 ` Dave Jiang 2024-04-12 21:05 ` [PATCH ndctl 2/2] daxctl/device.c: Fix error propagation in do_xaction_device() Vishal Verma 1 sibling, 2 replies; 8+ messages in thread From: Vishal Verma @ 2024-04-12 21:05 UTC (permalink / raw) To: nvdimm, linux-cxl Cc: Dan Williams, Alison Schofield, Dave Jiang, Ira Weiny, Vishal Verma The kernel has special handling for destroying the 0th dax device under any given DAX region (daxX.0). It ensures the size is set to 0, but doesn't actually remove the device, instead it returns an EBUSY, indicating that this device cannot be removed. Add an expectation in daxctl's dev_destroy() helper to handle this case instead of returning the error - as far as the user is concerned, the size has been set to zero, and the destroy operation has been completed, even if the kernel indicated an EBUSY. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Reported-by: Ira Weiny <ira.weiny@intel.com> Reported-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> --- daxctl/device.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/daxctl/device.c b/daxctl/device.c index 83913430..83c61389 100644 --- a/daxctl/device.c +++ b/daxctl/device.c @@ -675,6 +675,13 @@ static int dev_destroy(struct daxctl_dev *dev) return rc; rc = daxctl_region_destroy_dev(daxctl_dev_get_region(dev), dev); + /* + * The kernel treats daxX.0 specially. It can't be deleted to ensure + * there is always a /sys/bus/dax/ present. If this happens, an + * EBUSY is returned. Expect it and don't treat it as an error. + */ + if (daxctl_dev_get_id(dev) == 0 && rc == -EBUSY) + return 0; if (rc < 0) return rc; -- 2.44.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH ndctl 1/2] daxctl/device.c: Handle special case of destroying daxX.0 2024-04-12 21:05 ` [PATCH ndctl 1/2] daxctl/device.c: Handle special case of destroying daxX.0 Vishal Verma @ 2024-04-12 21:09 ` Dan Williams 2024-04-15 16:17 ` Dave Jiang 1 sibling, 0 replies; 8+ messages in thread From: Dan Williams @ 2024-04-12 21:09 UTC (permalink / raw) To: Vishal Verma, nvdimm, linux-cxl Cc: Dan Williams, Alison Schofield, Dave Jiang, Ira Weiny, Vishal Verma Vishal Verma wrote: > The kernel has special handling for destroying the 0th dax device under > any given DAX region (daxX.0). It ensures the size is set to 0, but > doesn't actually remove the device, instead it returns an EBUSY, > indicating that this device cannot be removed. > > Add an expectation in daxctl's dev_destroy() helper to handle this case > instead of returning the error - as far as the user is concerned, the > size has been set to zero, and the destroy operation has been completed, > even if the kernel indicated an EBUSY. > > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Alison Schofield <alison.schofield@intel.com> > Reported-by: Ira Weiny <ira.weiny@intel.com> > Reported-by: Dave Jiang <dave.jiang@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Looks good, Reviewed-by: Dan Williams <dan.j.williams@intel.com> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH ndctl 1/2] daxctl/device.c: Handle special case of destroying daxX.0 2024-04-12 21:05 ` [PATCH ndctl 1/2] daxctl/device.c: Handle special case of destroying daxX.0 Vishal Verma 2024-04-12 21:09 ` Dan Williams @ 2024-04-15 16:17 ` Dave Jiang 1 sibling, 0 replies; 8+ messages in thread From: Dave Jiang @ 2024-04-15 16:17 UTC (permalink / raw) To: Vishal Verma, nvdimm, linux-cxl; +Cc: Dan Williams, Alison Schofield, Ira Weiny On 4/12/24 2:05 PM, Vishal Verma wrote: > The kernel has special handling for destroying the 0th dax device under > any given DAX region (daxX.0). It ensures the size is set to 0, but > doesn't actually remove the device, instead it returns an EBUSY, > indicating that this device cannot be removed. > > Add an expectation in daxctl's dev_destroy() helper to handle this case > instead of returning the error - as far as the user is concerned, the > size has been set to zero, and the destroy operation has been completed, > even if the kernel indicated an EBUSY. > > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Alison Schofield <alison.schofield@intel.com> > Reported-by: Ira Weiny <ira.weiny@intel.com> > Reported-by: Dave Jiang <dave.jiang@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> > --- > daxctl/device.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/daxctl/device.c b/daxctl/device.c > index 83913430..83c61389 100644 > --- a/daxctl/device.c > +++ b/daxctl/device.c > @@ -675,6 +675,13 @@ static int dev_destroy(struct daxctl_dev *dev) > return rc; > > rc = daxctl_region_destroy_dev(daxctl_dev_get_region(dev), dev); > + /* > + * The kernel treats daxX.0 specially. It can't be deleted to ensure > + * there is always a /sys/bus/dax/ present. If this happens, an > + * EBUSY is returned. Expect it and don't treat it as an error. > + */ > + if (daxctl_dev_get_id(dev) == 0 && rc == -EBUSY) > + return 0; > if (rc < 0) > return rc; > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH ndctl 2/2] daxctl/device.c: Fix error propagation in do_xaction_device() 2024-04-12 21:05 [PATCH ndctl 0/2] daxctl: Fix error handling and propagation in daxctl/device.c Vishal Verma 2024-04-12 21:05 ` [PATCH ndctl 1/2] daxctl/device.c: Handle special case of destroying daxX.0 Vishal Verma @ 2024-04-12 21:05 ` Vishal Verma 2024-04-12 21:30 ` Dan Williams 2024-04-15 16:19 ` Dave Jiang 1 sibling, 2 replies; 8+ messages in thread From: Vishal Verma @ 2024-04-12 21:05 UTC (permalink / raw) To: nvdimm, linux-cxl Cc: Dan Williams, Alison Schofield, Dave Jiang, Ira Weiny, Vishal Verma The loop through the provided list of devices in do_xaction_device() returns the status based on whatever the last device did. Since the order of processing devices, especially in cases like the 'all' keyword, can be effectively random, this can lead to the same command, and same effects, exiting with a different error code based on device ordering. This was noticed with flakiness in the daxctl-create.sh unit test. Its 'destroy-device all' command would either pass or fail based on the order it tried to destroy devices in. (Recall that until now, destroying a daxX.0 device would result in a failure). Make this slightly more consistent by saving a failed status in do_xaction_device if any iteration of the loop produces a failure. Return this saved status instead of returning the status of the last device processed. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> --- daxctl/device.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/daxctl/device.c b/daxctl/device.c index 83c61389..14d62148 100644 --- a/daxctl/device.c +++ b/daxctl/device.c @@ -1012,7 +1012,7 @@ static int do_xaction_device(const char *device, enum device_action action, struct json_object *jdevs = NULL; struct daxctl_region *region; struct daxctl_dev *dev; - int rc = -ENXIO; + int rc = -ENXIO, saved_rc = 0; *processed = 0; @@ -1059,6 +1059,8 @@ static int do_xaction_device(const char *device, enum device_action action, rc = -EINVAL; break; } + if (rc) + saved_rc = rc; } } @@ -1070,7 +1072,7 @@ static int do_xaction_device(const char *device, enum device_action action, if (jdevs) util_display_json_array(stdout, jdevs, flags); - return rc; + return saved_rc; } int cmd_create_device(int argc, const char **argv, struct daxctl_ctx *ctx) -- 2.44.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH ndctl 2/2] daxctl/device.c: Fix error propagation in do_xaction_device() 2024-04-12 21:05 ` [PATCH ndctl 2/2] daxctl/device.c: Fix error propagation in do_xaction_device() Vishal Verma @ 2024-04-12 21:30 ` Dan Williams 2024-04-12 21:42 ` Verma, Vishal L 2024-04-15 16:19 ` Dave Jiang 1 sibling, 1 reply; 8+ messages in thread From: Dan Williams @ 2024-04-12 21:30 UTC (permalink / raw) To: Vishal Verma, nvdimm, linux-cxl Cc: Dan Williams, Alison Schofield, Dave Jiang, Ira Weiny, Vishal Verma Vishal Verma wrote: > The loop through the provided list of devices in do_xaction_device() > returns the status based on whatever the last device did. Since the > order of processing devices, especially in cases like the 'all' keyword, > can be effectively random, this can lead to the same command, and same > effects, exiting with a different error code based on device ordering. > > This was noticed with flakiness in the daxctl-create.sh unit test. Its > 'destroy-device all' command would either pass or fail based on the > order it tried to destroy devices in. (Recall that until now, destroying > a daxX.0 device would result in a failure). > > Make this slightly more consistent by saving a failed status in > do_xaction_device if any iteration of the loop produces a failure. > Return this saved status instead of returning the status of the last > device processed. I think "this is the way", at least it follows what cxl/memdev.c is doing. However we have ended up with an error scheme per tool when it comes to reporting errors for multi-device operations. cxl/memdev.c: report the first error daxctl/device.c: now fixed to report the first error ndctl/namespace.c: reports last result (same daxctl/device.c issue), unless in the greedy-create case cxl/region.c: reports the last error even if that is not the last result, immune to the above bug, but why different? The struggle here is that all of these tools continue on error, so it has always been the case that the only way to get a reliable error code vs action carried out is to not use the "all" or "multi-device" ways to specify the devices to operate upon. I don't have a good answer besides, be careful when using "all". It might make sense to bring ndctl/namespace.c in line to guarantee "unless 100% of the attempts are successful the command reports failure". However, it might be too late to make that change there if it breaks people's scripts. ndctl/namespace.c does not suffer from needing to know that namspaceX.0 can not be deleted since the deletion there is exclusively done by setting namespace size to zero. I think this daxctl change has a low risk of breaking folks because the primary failure case is fixed to swallow the error. Reviewed-by: Dan Williams <dan.j.williams@intel.com> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH ndctl 2/2] daxctl/device.c: Fix error propagation in do_xaction_device() 2024-04-12 21:30 ` Dan Williams @ 2024-04-12 21:42 ` Verma, Vishal L 0 siblings, 0 replies; 8+ messages in thread From: Verma, Vishal L @ 2024-04-12 21:42 UTC (permalink / raw) To: Williams, Dan J, linux-cxl@vger.kernel.org, nvdimm@lists.linux.dev Cc: Schofield, Alison, Jiang, Dave, Weiny, Ira On Fri, 2024-04-12 at 14:30 -0700, Dan Williams wrote: > Vishal Verma wrote: > > The loop through the provided list of devices in do_xaction_device() > > returns the status based on whatever the last device did. Since the > > order of processing devices, especially in cases like the 'all' keyword, > > can be effectively random, this can lead to the same command, and same > > effects, exiting with a different error code based on device ordering. > > > > This was noticed with flakiness in the daxctl-create.sh unit test. Its > > 'destroy-device all' command would either pass or fail based on the > > order it tried to destroy devices in. (Recall that until now, destroying > > a daxX.0 device would result in a failure). > > > > Make this slightly more consistent by saving a failed status in > > do_xaction_device if any iteration of the loop produces a failure. > > Return this saved status instead of returning the status of the last > > device processed. > > I think "this is the way", at least it follows what cxl/memdev.c is > doing. However we have ended up with an error scheme per tool when it > comes to reporting errors for multi-device operations. > > cxl/memdev.c: report the first error > > daxctl/device.c: now fixed to report the first error With this patch it's actually reporting the last error, not first. > > ndctl/namespace.c: reports last result (same daxctl/device.c issue), unless in the greedy-create case > > cxl/region.c: reports the last error even if that is not the last > result, immune to the above bug, but why different? I guess I always preferred last error instead of first, but happy to change these to last to match cxl/memdev.c. I can see how first error is probably slightly better (we'd normally want to know about the first thing that fails). > > The struggle here is that all of these tools continue on error, so it > has always been the case that the only way to get a reliable error code > vs action carried out is to not use the "all" or "multi-device" ways to > specify the devices to operate upon. > > I don't have a good answer besides, be careful when using "all". > > It might make sense to bring ndctl/namespace.c in line to guarantee > "unless 100% of the attempts are successful the command reports > failure". However, it might be too late to make that change there if it > breaks people's scripts. ndctl/namespace.c does not suffer from needing > to know that namspaceX.0 can not be deleted since the deletion there is > exclusively done by setting namespace size to zero. Agreed, I'm onboard with holding off changes to ndctl/namespace.c unless there's a reported problem, or we hit a test suite failure or something down the line. > > I think this daxctl change has a low risk of breaking folks because the > primary failure case is fixed to swallow the error. > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> Thanks Dan! ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH ndctl 2/2] daxctl/device.c: Fix error propagation in do_xaction_device() 2024-04-12 21:05 ` [PATCH ndctl 2/2] daxctl/device.c: Fix error propagation in do_xaction_device() Vishal Verma 2024-04-12 21:30 ` Dan Williams @ 2024-04-15 16:19 ` Dave Jiang 1 sibling, 0 replies; 8+ messages in thread From: Dave Jiang @ 2024-04-15 16:19 UTC (permalink / raw) To: Vishal Verma, nvdimm, linux-cxl; +Cc: Dan Williams, Alison Schofield, Ira Weiny On 4/12/24 2:05 PM, Vishal Verma wrote: > The loop through the provided list of devices in do_xaction_device() > returns the status based on whatever the last device did. Since the > order of processing devices, especially in cases like the 'all' keyword, > can be effectively random, this can lead to the same command, and same > effects, exiting with a different error code based on device ordering. > > This was noticed with flakiness in the daxctl-create.sh unit test. Its > 'destroy-device all' command would either pass or fail based on the > order it tried to destroy devices in. (Recall that until now, destroying > a daxX.0 device would result in a failure). > > Make this slightly more consistent by saving a failed status in > do_xaction_device if any iteration of the loop produces a failure. > Return this saved status instead of returning the status of the last > device processed. > > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Alison Schofield <alison.schofield@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> > --- > daxctl/device.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/daxctl/device.c b/daxctl/device.c > index 83c61389..14d62148 100644 > --- a/daxctl/device.c > +++ b/daxctl/device.c > @@ -1012,7 +1012,7 @@ static int do_xaction_device(const char *device, enum device_action action, > struct json_object *jdevs = NULL; > struct daxctl_region *region; > struct daxctl_dev *dev; > - int rc = -ENXIO; > + int rc = -ENXIO, saved_rc = 0; > > *processed = 0; > > @@ -1059,6 +1059,8 @@ static int do_xaction_device(const char *device, enum device_action action, > rc = -EINVAL; > break; > } > + if (rc) > + saved_rc = rc; > } > } > > @@ -1070,7 +1072,7 @@ static int do_xaction_device(const char *device, enum device_action action, > if (jdevs) > util_display_json_array(stdout, jdevs, flags); > > - return rc; > + return saved_rc; > } > > int cmd_create_device(int argc, const char **argv, struct daxctl_ctx *ctx) > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-04-15 16:19 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-12 21:05 [PATCH ndctl 0/2] daxctl: Fix error handling and propagation in daxctl/device.c Vishal Verma 2024-04-12 21:05 ` [PATCH ndctl 1/2] daxctl/device.c: Handle special case of destroying daxX.0 Vishal Verma 2024-04-12 21:09 ` Dan Williams 2024-04-15 16:17 ` Dave Jiang 2024-04-12 21:05 ` [PATCH ndctl 2/2] daxctl/device.c: Fix error propagation in do_xaction_device() Vishal Verma 2024-04-12 21:30 ` Dan Williams 2024-04-12 21:42 ` Verma, Vishal L 2024-04-15 16:19 ` Dave Jiang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox