* [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone
@ 2024-01-26 15:11 Rodrigo Vivi
2024-01-26 15:11 ` [PATCH 2/2] devcoredump: Remove the mutex serialization Rodrigo Vivi
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Rodrigo Vivi @ 2024-01-26 15:11 UTC (permalink / raw)
To: linux-kernel
Cc: Rodrigo Vivi, Jose Souza, Maarten Lankhorst, Johannes Berg,
Greg Kroah-Hartman, Rafael J . Wysocki
Make dev_coredumpm a real device managed helper, that not only
frees the device after a scheduled delay (DEVCD_TIMEOUT), but
also when the failing/crashed device is gone.
The module remove for the drivers using devcoredump are currently
broken if attempted between the crash and the DEVCD_TIMEOUT, since
the symbolic sysfs link won't be deleted.
On top of that, for PCI devices, the unbind of the device will
call the pci .remove void function, that cannot fail. At that
time, our device is pretty much gone, but the read and free
functions are alive trough the devcoredump device and they
can get some NULL dereferences or use after free.
So, if the failing-device is gone let's also request for the
devcoredump-device removal using the same mod_delayed_work
as when writing anything through data. The flush cannot be
used since it is synchronous and the devcd would be surely
gone right before the mutex_unlock on the next line.
Cc: Jose Souza <jose.souza@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
---
drivers/base/devcoredump.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 7e2d1f0d903a..678ecc2fa242 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -304,6 +304,19 @@ static ssize_t devcd_read_from_sgtable(char *buffer, loff_t offset,
offset);
}
+static void devcd_remove(void *data)
+{
+ struct devcd_entry *devcd = data;
+
+ mutex_lock(&devcd->mutex);
+ if (!devcd->delete_work) {
+ devcd->delete_work = true;
+ /* XXX: Cannot flush otherwise the mutex below will hit a UAF */
+ mod_delayed_work(system_wq, &devcd->del_wk, 0);
+ }
+ mutex_unlock(&devcd->mutex);
+}
+
/**
* dev_coredumpm - create device coredump with read/free methods
* @dev: the struct device for the crashed device
@@ -381,6 +394,8 @@ void dev_coredumpm(struct device *dev, struct module *owner,
kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD);
INIT_DELAYED_WORK(&devcd->del_wk, devcd_del);
schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT);
+ if (devm_add_action(dev, devcd_remove, devcd))
+ dev_warn(dev, "devcoredump managed auto-removal registration failed\n");
mutex_unlock(&devcd->mutex);
return;
put_device:
--
2.43.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 2/2] devcoredump: Remove the mutex serialization 2024-01-26 15:11 [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone Rodrigo Vivi @ 2024-01-26 15:11 ` Rodrigo Vivi 2024-01-29 15:50 ` Souza, Jose 2024-01-30 12:02 ` Mukesh Ojha 2024-01-29 15:50 ` [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone Souza, Jose 2024-01-29 17:48 ` Johannes Berg 2 siblings, 2 replies; 15+ messages in thread From: Rodrigo Vivi @ 2024-01-26 15:11 UTC (permalink / raw) To: linux-kernel Cc: Rodrigo Vivi, Mukesh Ojha, Johannes Berg, Greg Kroah-Hartman, Rafael J . Wysocki, Jose Souza The commit 01daccf74832 ("devcoredump : Serialize devcd_del work") introduced the mutex to protect the case where mod_delayed_work could be called before the delayed work even existed. Instead, we can simply initialize the delayed work before the device is added, so the race condition doesn't exist at first place. The mutex_unlock is very problematic here. Although mod_delayed_work is async, we have no warranty that the work is not finished before the mutex_unlock(devcd->mutex), and if that happen 'devcd' is used after freed. Cc: Mukesh Ojha <quic_mojha@quicinc.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Jose Souza <jose.souza@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> --- drivers/base/devcoredump.c | 97 +++----------------------------------- 1 file changed, 6 insertions(+), 91 deletions(-) diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c index 678ecc2fa242..0e26b1273920 100644 --- a/drivers/base/devcoredump.c +++ b/drivers/base/devcoredump.c @@ -25,47 +25,6 @@ struct devcd_entry { struct device devcd_dev; void *data; size_t datalen; - /* - * Here, mutex is required to serialize the calls to del_wk work between - * user/kernel space which happens when devcd is added with device_add() - * and that sends uevent to user space. User space reads the uevents, - * and calls to devcd_data_write() which try to modify the work which is - * not even initialized/queued from devcoredump. - * - * - * - * cpu0(X) cpu1(Y) - * - * dev_coredump() uevent sent to user space - * device_add() ======================> user space process Y reads the - * uevents writes to devcd fd - * which results into writes to - * - * devcd_data_write() - * mod_delayed_work() - * try_to_grab_pending() - * del_timer() - * debug_assert_init() - * INIT_DELAYED_WORK() - * schedule_delayed_work() - * - * - * Also, mutex alone would not be enough to avoid scheduling of - * del_wk work after it get flush from a call to devcd_free() - * mentioned as below. - * - * disabled_store() - * devcd_free() - * mutex_lock() devcd_data_write() - * flush_delayed_work() - * mutex_unlock() - * mutex_lock() - * mod_delayed_work() - * mutex_unlock() - * So, delete_work flag is required. - */ - struct mutex mutex; - bool delete_work; struct module *owner; ssize_t (*read)(char *buffer, loff_t offset, size_t count, void *data, size_t datalen); @@ -125,13 +84,8 @@ static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj, struct device *dev = kobj_to_dev(kobj); struct devcd_entry *devcd = dev_to_devcd(dev); - mutex_lock(&devcd->mutex); - if (!devcd->delete_work) { - devcd->delete_work = true; - mod_delayed_work(system_wq, &devcd->del_wk, 0); - } - mutex_unlock(&devcd->mutex); - + /* This file needs to be closed before devcd can be deleted */ + mod_delayed_work(system_wq, &devcd->del_wk, 0); return count; } @@ -158,12 +112,7 @@ static int devcd_free(struct device *dev, void *data) { struct devcd_entry *devcd = dev_to_devcd(dev); - mutex_lock(&devcd->mutex); - if (!devcd->delete_work) - devcd->delete_work = true; - flush_delayed_work(&devcd->del_wk); - mutex_unlock(&devcd->mutex); return 0; } @@ -173,30 +122,6 @@ static ssize_t disabled_show(const struct class *class, const struct class_attri return sysfs_emit(buf, "%d\n", devcd_disabled); } -/* - * - * disabled_store() worker() - * class_for_each_device(&devcd_class, - * NULL, NULL, devcd_free) - * ... - * ... - * while ((dev = class_dev_iter_next(&iter)) - * devcd_del() - * device_del() - * put_device() <- last reference - * error = fn(dev, data) devcd_dev_release() - * devcd_free(dev, data) kfree(devcd) - * mutex_lock(&devcd->mutex); - * - * - * In the above diagram, It looks like disabled_store() would be racing with parallely - * running devcd_del() and result in memory abort while acquiring devcd->mutex which - * is called after kfree of devcd memory after dropping its last reference with - * put_device(). However, this will not happens as fn(dev, data) runs - * with its own reference to device via klist_node so it is not its last reference. - * so, above situation would not occur. - */ - static ssize_t disabled_store(const struct class *class, const struct class_attribute *attr, const char *buf, size_t count) { @@ -308,13 +233,7 @@ static void devcd_remove(void *data) { struct devcd_entry *devcd = data; - mutex_lock(&devcd->mutex); - if (!devcd->delete_work) { - devcd->delete_work = true; - /* XXX: Cannot flush otherwise the mutex below will hit a UAF */ - mod_delayed_work(system_wq, &devcd->del_wk, 0); - } - mutex_unlock(&devcd->mutex); + flush_delayed_work(&devcd->del_wk); } /** @@ -365,16 +284,15 @@ void dev_coredumpm(struct device *dev, struct module *owner, devcd->read = read; devcd->free = free; devcd->failing_dev = get_device(dev); - devcd->delete_work = false; - mutex_init(&devcd->mutex); device_initialize(&devcd->devcd_dev); dev_set_name(&devcd->devcd_dev, "devcd%d", atomic_inc_return(&devcd_count)); devcd->devcd_dev.class = &devcd_class; - mutex_lock(&devcd->mutex); + INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); + schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); dev_set_uevent_suppress(&devcd->devcd_dev, true); if (device_add(&devcd->devcd_dev)) goto put_device; @@ -392,15 +310,12 @@ void dev_coredumpm(struct device *dev, struct module *owner, dev_set_uevent_suppress(&devcd->devcd_dev, false); kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD); - INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); - schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); if (devm_add_action(dev, devcd_remove, devcd)) dev_warn(dev, "devcoredump managed auto-removal registration failed\n"); - mutex_unlock(&devcd->mutex); return; put_device: + cancel_delayed_work(&devcd->del_wk); put_device(&devcd->devcd_dev); - mutex_unlock(&devcd->mutex); put_module: module_put(owner); free: -- 2.43.0 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] devcoredump: Remove the mutex serialization 2024-01-26 15:11 ` [PATCH 2/2] devcoredump: Remove the mutex serialization Rodrigo Vivi @ 2024-01-29 15:50 ` Souza, Jose 2024-01-30 12:02 ` Mukesh Ojha 1 sibling, 0 replies; 15+ messages in thread From: Souza, Jose @ 2024-01-29 15:50 UTC (permalink / raw) To: Vivi, Rodrigo, linux-kernel@vger.kernel.org Cc: quic_mojha@quicinc.com, gregkh@linuxfoundation.org, johannes@sipsolutions.net, rafael@kernel.org On Fri, 2024-01-26 at 10:11 -0500, Rodrigo Vivi wrote: > The commit 01daccf74832 ("devcoredump : Serialize devcd_del work") > introduced the mutex to protect the case where mod_delayed_work > could be called before the delayed work even existed. > > Instead, we can simply initialize the delayed work before the device > is added, so the race condition doesn't exist at first place. > > The mutex_unlock is very problematic here. Although mod_delayed_work > is async, we have no warranty that the work is not finished before > the mutex_unlock(devcd->mutex), and if that happen 'devcd' is used > after freed. > Reviewed-by: José Roberto de Souza <jose.souza@intel.com> > Cc: Mukesh Ojha <quic_mojha@quicinc.com> > Cc: Johannes Berg <johannes@sipsolutions.net> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Rafael J. Wysocki <rafael@kernel.org> > Cc: Jose Souza <jose.souza@intel.com> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> > --- > drivers/base/devcoredump.c | 97 +++----------------------------------- > 1 file changed, 6 insertions(+), 91 deletions(-) > > diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c > index 678ecc2fa242..0e26b1273920 100644 > --- a/drivers/base/devcoredump.c > +++ b/drivers/base/devcoredump.c > @@ -25,47 +25,6 @@ struct devcd_entry { > struct device devcd_dev; > void *data; > size_t datalen; > - /* > - * Here, mutex is required to serialize the calls to del_wk work between > - * user/kernel space which happens when devcd is added with device_add() > - * and that sends uevent to user space. User space reads the uevents, > - * and calls to devcd_data_write() which try to modify the work which is > - * not even initialized/queued from devcoredump. > - * > - * > - * > - * cpu0(X) cpu1(Y) > - * > - * dev_coredump() uevent sent to user space > - * device_add() ======================> user space process Y reads the > - * uevents writes to devcd fd > - * which results into writes to > - * > - * devcd_data_write() > - * mod_delayed_work() > - * try_to_grab_pending() > - * del_timer() > - * debug_assert_init() > - * INIT_DELAYED_WORK() > - * schedule_delayed_work() > - * > - * > - * Also, mutex alone would not be enough to avoid scheduling of > - * del_wk work after it get flush from a call to devcd_free() > - * mentioned as below. > - * > - * disabled_store() > - * devcd_free() > - * mutex_lock() devcd_data_write() > - * flush_delayed_work() > - * mutex_unlock() > - * mutex_lock() > - * mod_delayed_work() > - * mutex_unlock() > - * So, delete_work flag is required. > - */ > - struct mutex mutex; > - bool delete_work; > struct module *owner; > ssize_t (*read)(char *buffer, loff_t offset, size_t count, > void *data, size_t datalen); > @@ -125,13 +84,8 @@ static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj, > struct device *dev = kobj_to_dev(kobj); > struct devcd_entry *devcd = dev_to_devcd(dev); > > - mutex_lock(&devcd->mutex); > - if (!devcd->delete_work) { > - devcd->delete_work = true; > - mod_delayed_work(system_wq, &devcd->del_wk, 0); > - } > - mutex_unlock(&devcd->mutex); > - > + /* This file needs to be closed before devcd can be deleted */ > + mod_delayed_work(system_wq, &devcd->del_wk, 0); > return count; > } > > @@ -158,12 +112,7 @@ static int devcd_free(struct device *dev, void *data) > { > struct devcd_entry *devcd = dev_to_devcd(dev); > > - mutex_lock(&devcd->mutex); > - if (!devcd->delete_work) > - devcd->delete_work = true; > - > flush_delayed_work(&devcd->del_wk); > - mutex_unlock(&devcd->mutex); > return 0; > } > > @@ -173,30 +122,6 @@ static ssize_t disabled_show(const struct class *class, const struct class_attri > return sysfs_emit(buf, "%d\n", devcd_disabled); > } > > -/* > - * > - * disabled_store() worker() > - * class_for_each_device(&devcd_class, > - * NULL, NULL, devcd_free) > - * ... > - * ... > - * while ((dev = class_dev_iter_next(&iter)) > - * devcd_del() > - * device_del() > - * put_device() <- last reference > - * error = fn(dev, data) devcd_dev_release() > - * devcd_free(dev, data) kfree(devcd) > - * mutex_lock(&devcd->mutex); > - * > - * > - * In the above diagram, It looks like disabled_store() would be racing with parallely > - * running devcd_del() and result in memory abort while acquiring devcd->mutex which > - * is called after kfree of devcd memory after dropping its last reference with > - * put_device(). However, this will not happens as fn(dev, data) runs > - * with its own reference to device via klist_node so it is not its last reference. > - * so, above situation would not occur. > - */ > - > static ssize_t disabled_store(const struct class *class, const struct class_attribute *attr, > const char *buf, size_t count) > { > @@ -308,13 +233,7 @@ static void devcd_remove(void *data) > { > struct devcd_entry *devcd = data; > > - mutex_lock(&devcd->mutex); > - if (!devcd->delete_work) { > - devcd->delete_work = true; > - /* XXX: Cannot flush otherwise the mutex below will hit a UAF */ > - mod_delayed_work(system_wq, &devcd->del_wk, 0); > - } > - mutex_unlock(&devcd->mutex); > + flush_delayed_work(&devcd->del_wk); > } > > /** > @@ -365,16 +284,15 @@ void dev_coredumpm(struct device *dev, struct module *owner, > devcd->read = read; > devcd->free = free; > devcd->failing_dev = get_device(dev); > - devcd->delete_work = false; > > - mutex_init(&devcd->mutex); > device_initialize(&devcd->devcd_dev); > > dev_set_name(&devcd->devcd_dev, "devcd%d", > atomic_inc_return(&devcd_count)); > devcd->devcd_dev.class = &devcd_class; > > - mutex_lock(&devcd->mutex); > + INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); > + schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); > dev_set_uevent_suppress(&devcd->devcd_dev, true); > if (device_add(&devcd->devcd_dev)) > goto put_device; > @@ -392,15 +310,12 @@ void dev_coredumpm(struct device *dev, struct module *owner, > > dev_set_uevent_suppress(&devcd->devcd_dev, false); > kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD); > - INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); > - schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); > if (devm_add_action(dev, devcd_remove, devcd)) > dev_warn(dev, "devcoredump managed auto-removal registration failed\n"); > - mutex_unlock(&devcd->mutex); > return; > put_device: > + cancel_delayed_work(&devcd->del_wk); > put_device(&devcd->devcd_dev); > - mutex_unlock(&devcd->mutex); > put_module: > module_put(owner); > free: ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] devcoredump: Remove the mutex serialization 2024-01-26 15:11 ` [PATCH 2/2] devcoredump: Remove the mutex serialization Rodrigo Vivi 2024-01-29 15:50 ` Souza, Jose @ 2024-01-30 12:02 ` Mukesh Ojha 2024-01-30 15:34 ` Rodrigo Vivi 1 sibling, 1 reply; 15+ messages in thread From: Mukesh Ojha @ 2024-01-30 12:02 UTC (permalink / raw) To: Rodrigo Vivi, linux-kernel Cc: Johannes Berg, Greg Kroah-Hartman, Rafael J . Wysocki, Jose Souza On 1/26/2024 8:41 PM, Rodrigo Vivi wrote: > The commit 01daccf74832 ("devcoredump : Serialize devcd_del work") > introduced the mutex to protect the case where mod_delayed_work > could be called before the delayed work even existed. > > Instead, we can simply initialize the delayed work before the device > is added, so the race condition doesn't exist at first place. > > The mutex_unlock is very problematic here. Although mod_delayed_work > is async, we have no warranty that the work is not finished before > the mutex_unlock(devcd->mutex), and if that happen 'devcd' is used > after freed. I agree, Mutex is bad and last time there was only a situation of UAF from disable_store() and that can not occur as it keeps its ref, so we went ahead with the change., > > Cc: Mukesh Ojha <quic_mojha@quicinc.com> > Cc: Johannes Berg <johannes@sipsolutions.net> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Rafael J. Wysocki <rafael@kernel.org> > Cc: Jose Souza <jose.souza@intel.com> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> > --- > drivers/base/devcoredump.c | 97 +++----------------------------------- > 1 file changed, 6 insertions(+), 91 deletions(-) > > diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c > index 678ecc2fa242..0e26b1273920 100644 > --- a/drivers/base/devcoredump.c > +++ b/drivers/base/devcoredump.c > @@ -25,47 +25,6 @@ struct devcd_entry { > struct device devcd_dev; > void *data; > size_t datalen; > - /* > - * Here, mutex is required to serialize the calls to del_wk work between > - * user/kernel space which happens when devcd is added with device_add() > - * and that sends uevent to user space. User space reads the uevents, > - * and calls to devcd_data_write() which try to modify the work which is > - * not even initialized/queued from devcoredump. > - * > - * > - * > - * cpu0(X) cpu1(Y) > - * > - * dev_coredump() uevent sent to user space > - * device_add() ======================> user space process Y reads the > - * uevents writes to devcd fd > - * which results into writes to > - * > - * devcd_data_write() > - * mod_delayed_work() > - * try_to_grab_pending() > - * del_timer() > - * debug_assert_init() > - * INIT_DELAYED_WORK() > - * schedule_delayed_work() > - * > - * > - * Also, mutex alone would not be enough to avoid scheduling of > - * del_wk work after it get flush from a call to devcd_free() > - * mentioned as below. > - * > - * disabled_store() > - * devcd_free() > - * mutex_lock() devcd_data_write() > - * flush_delayed_work() > - * mutex_unlock() > - * mutex_lock() > - * mod_delayed_work() > - * mutex_unlock() > - * So, delete_work flag is required. > - */ > - struct mutex mutex; > - bool delete_work; > struct module *owner; > ssize_t (*read)(char *buffer, loff_t offset, size_t count, > void *data, size_t datalen); > @@ -125,13 +84,8 @@ static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj, > struct device *dev = kobj_to_dev(kobj); > struct devcd_entry *devcd = dev_to_devcd(dev); > > - mutex_lock(&devcd->mutex); > - if (!devcd->delete_work) { > - devcd->delete_work = true; > - mod_delayed_work(system_wq, &devcd->del_wk, 0); > - } > - mutex_unlock(&devcd->mutex); > - > + /* This file needs to be closed before devcd can be deleted */ > + mod_delayed_work(system_wq, &devcd->del_wk, 0); > return count; > } > > @@ -158,12 +112,7 @@ static int devcd_free(struct device *dev, void *data) > { > struct devcd_entry *devcd = dev_to_devcd(dev); > > - mutex_lock(&devcd->mutex); > - if (!devcd->delete_work) > - devcd->delete_work = true; > - > flush_delayed_work(&devcd->del_wk); > - mutex_unlock(&devcd->mutex); > return 0; > } > > @@ -173,30 +122,6 @@ static ssize_t disabled_show(const struct class *class, const struct class_attri > return sysfs_emit(buf, "%d\n", devcd_disabled); > } > > -/* > - * > - * disabled_store() worker() > - * class_for_each_device(&devcd_class, > - * NULL, NULL, devcd_free) > - * ... > - * ... > - * while ((dev = class_dev_iter_next(&iter)) > - * devcd_del() > - * device_del() > - * put_device() <- last reference > - * error = fn(dev, data) devcd_dev_release() > - * devcd_free(dev, data) kfree(devcd) > - * mutex_lock(&devcd->mutex); > - * > - * > - * In the above diagram, It looks like disabled_store() would be racing with parallely > - * running devcd_del() and result in memory abort while acquiring devcd->mutex which > - * is called after kfree of devcd memory after dropping its last reference with > - * put_device(). However, this will not happens as fn(dev, data) runs > - * with its own reference to device via klist_node so it is not its last reference. > - * so, above situation would not occur. > - */ > - > static ssize_t disabled_store(const struct class *class, const struct class_attribute *attr, > const char *buf, size_t count) > { > @@ -308,13 +233,7 @@ static void devcd_remove(void *data) > { > struct devcd_entry *devcd = data; > > - mutex_lock(&devcd->mutex); > - if (!devcd->delete_work) { > - devcd->delete_work = true; > - /* XXX: Cannot flush otherwise the mutex below will hit a UAF */ > - mod_delayed_work(system_wq, &devcd->del_wk, 0); > - } > - mutex_unlock(&devcd->mutex); > + flush_delayed_work(&devcd->del_wk); > } > > /** > @@ -365,16 +284,15 @@ void dev_coredumpm(struct device *dev, struct module *owner, > devcd->read = read; > devcd->free = free; > devcd->failing_dev = get_device(dev); > - devcd->delete_work = false; > > - mutex_init(&devcd->mutex); > device_initialize(&devcd->devcd_dev); > > dev_set_name(&devcd->devcd_dev, "devcd%d", > atomic_inc_return(&devcd_count)); > devcd->devcd_dev.class = &devcd_class; > > - mutex_lock(&devcd->mutex); > + INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); > + schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); Last time, we discussed [1] here, involves a assumption about timeout can not happen before device_add() succeeds. It is rare but it is there. https://lore.kernel.org/all/87ilr15ekx.ffs@tglx/ -Mukesh > dev_set_uevent_suppress(&devcd->devcd_dev, true); > if (device_add(&devcd->devcd_dev)) > goto put_device; > @@ -392,15 +310,12 @@ void dev_coredumpm(struct device *dev, struct module *owner, > > dev_set_uevent_suppress(&devcd->devcd_dev, false); > kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD); > - INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); > - schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); > if (devm_add_action(dev, devcd_remove, devcd)) > dev_warn(dev, "devcoredump managed auto-removal registration failed\n"); > - mutex_unlock(&devcd->mutex); > return; > put_device: > + cancel_delayed_work(&devcd->del_wk); > put_device(&devcd->devcd_dev); > - mutex_unlock(&devcd->mutex); > put_module: > module_put(owner); > free: ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] devcoredump: Remove the mutex serialization 2024-01-30 12:02 ` Mukesh Ojha @ 2024-01-30 15:34 ` Rodrigo Vivi 2024-01-31 16:15 ` Mukesh Ojha 0 siblings, 1 reply; 15+ messages in thread From: Rodrigo Vivi @ 2024-01-30 15:34 UTC (permalink / raw) To: Mukesh Ojha Cc: linux-kernel, Johannes Berg, Greg Kroah-Hartman, Rafael J . Wysocki, Jose Souza On Tue, Jan 30, 2024 at 05:32:24PM +0530, Mukesh Ojha wrote: > > > On 1/26/2024 8:41 PM, Rodrigo Vivi wrote: > > The commit 01daccf74832 ("devcoredump : Serialize devcd_del work") > > introduced the mutex to protect the case where mod_delayed_work > > could be called before the delayed work even existed. > > > > Instead, we can simply initialize the delayed work before the device > > is added, so the race condition doesn't exist at first place. > > > > The mutex_unlock is very problematic here. Although mod_delayed_work > > is async, we have no warranty that the work is not finished before > > the mutex_unlock(devcd->mutex), and if that happen 'devcd' is used > > after freed. > > I agree, Mutex is bad and last time there was only a situation of UAF from > disable_store() and that can not occur as it keeps its ref, so > we went ahead with the change., my concern was with: flush_delayed_work(&devcd->del_wk); mutex_unlock(&devcd->mutex); at devcd_free(). flush_work always wait for the work to finish it's execution, which will delete the device. The with the release, the devcd should be gone soon and this is at risk of the UAF, no? maybe I'm missing something. > > > > > Cc: Mukesh Ojha <quic_mojha@quicinc.com> > > Cc: Johannes Berg <johannes@sipsolutions.net> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Cc: Rafael J. Wysocki <rafael@kernel.org> > > Cc: Jose Souza <jose.souza@intel.com> > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> > > --- > > drivers/base/devcoredump.c | 97 +++----------------------------------- > > 1 file changed, 6 insertions(+), 91 deletions(-) > > > > diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c > > index 678ecc2fa242..0e26b1273920 100644 > > --- a/drivers/base/devcoredump.c > > +++ b/drivers/base/devcoredump.c > > @@ -25,47 +25,6 @@ struct devcd_entry { > > struct device devcd_dev; > > void *data; > > size_t datalen; > > - /* > > - * Here, mutex is required to serialize the calls to del_wk work between > > - * user/kernel space which happens when devcd is added with device_add() > > - * and that sends uevent to user space. User space reads the uevents, > > - * and calls to devcd_data_write() which try to modify the work which is > > - * not even initialized/queued from devcoredump. > > - * > > - * > > - * > > - * cpu0(X) cpu1(Y) > > - * > > - * dev_coredump() uevent sent to user space > > - * device_add() ======================> user space process Y reads the > > - * uevents writes to devcd fd > > - * which results into writes to > > - * > > - * devcd_data_write() > > - * mod_delayed_work() > > - * try_to_grab_pending() > > - * del_timer() > > - * debug_assert_init() > > - * INIT_DELAYED_WORK() > > - * schedule_delayed_work() > > - * > > - * > > - * Also, mutex alone would not be enough to avoid scheduling of > > - * del_wk work after it get flush from a call to devcd_free() > > - * mentioned as below. > > - * > > - * disabled_store() > > - * devcd_free() > > - * mutex_lock() devcd_data_write() > > - * flush_delayed_work() > > - * mutex_unlock() > > - * mutex_lock() > > - * mod_delayed_work() > > - * mutex_unlock() > > - * So, delete_work flag is required. > > - */ > > - struct mutex mutex; > > - bool delete_work; > > struct module *owner; > > ssize_t (*read)(char *buffer, loff_t offset, size_t count, > > void *data, size_t datalen); > > @@ -125,13 +84,8 @@ static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj, > > struct device *dev = kobj_to_dev(kobj); > > struct devcd_entry *devcd = dev_to_devcd(dev); > > - mutex_lock(&devcd->mutex); > > - if (!devcd->delete_work) { > > - devcd->delete_work = true; > > - mod_delayed_work(system_wq, &devcd->del_wk, 0); > > - } > > - mutex_unlock(&devcd->mutex); > > - > > + /* This file needs to be closed before devcd can be deleted */ > > + mod_delayed_work(system_wq, &devcd->del_wk, 0); > > return count; > > } > > @@ -158,12 +112,7 @@ static int devcd_free(struct device *dev, void *data) > > { > > struct devcd_entry *devcd = dev_to_devcd(dev); > > - mutex_lock(&devcd->mutex); > > - if (!devcd->delete_work) > > - devcd->delete_work = true; > > - > > flush_delayed_work(&devcd->del_wk); > > - mutex_unlock(&devcd->mutex); > > return 0; > > } > > @@ -173,30 +122,6 @@ static ssize_t disabled_show(const struct class *class, const struct class_attri > > return sysfs_emit(buf, "%d\n", devcd_disabled); > > } > > -/* > > - * > > - * disabled_store() worker() > > - * class_for_each_device(&devcd_class, > > - * NULL, NULL, devcd_free) > > - * ... > > - * ... > > - * while ((dev = class_dev_iter_next(&iter)) > > - * devcd_del() > > - * device_del() > > - * put_device() <- last reference > > - * error = fn(dev, data) devcd_dev_release() > > - * devcd_free(dev, data) kfree(devcd) > > - * mutex_lock(&devcd->mutex); > > - * > > - * > > - * In the above diagram, It looks like disabled_store() would be racing with parallely > > - * running devcd_del() and result in memory abort while acquiring devcd->mutex which > > - * is called after kfree of devcd memory after dropping its last reference with > > - * put_device(). However, this will not happens as fn(dev, data) runs > > - * with its own reference to device via klist_node so it is not its last reference. > > - * so, above situation would not occur. > > - */ > > - > > static ssize_t disabled_store(const struct class *class, const struct class_attribute *attr, > > const char *buf, size_t count) > > { > > @@ -308,13 +233,7 @@ static void devcd_remove(void *data) > > { > > struct devcd_entry *devcd = data; > > - mutex_lock(&devcd->mutex); > > - if (!devcd->delete_work) { > > - devcd->delete_work = true; > > - /* XXX: Cannot flush otherwise the mutex below will hit a UAF */ > > - mod_delayed_work(system_wq, &devcd->del_wk, 0); > > - } > > - mutex_unlock(&devcd->mutex); > > + flush_delayed_work(&devcd->del_wk); > > } > > /** > > @@ -365,16 +284,15 @@ void dev_coredumpm(struct device *dev, struct module *owner, > > devcd->read = read; > > devcd->free = free; > > devcd->failing_dev = get_device(dev); > > - devcd->delete_work = false; > > - mutex_init(&devcd->mutex); > > device_initialize(&devcd->devcd_dev); > > dev_set_name(&devcd->devcd_dev, "devcd%d", > > atomic_inc_return(&devcd_count)); > > devcd->devcd_dev.class = &devcd_class; > > - mutex_lock(&devcd->mutex); > > + INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); > > + schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); > > Last time, we discussed [1] here, involves a assumption > about timeout can not happen before device_add() succeeds. > It is rare but it is there. > > https://lore.kernel.org/all/87ilr15ekx.ffs@tglx/ hmm... I couldn't imagine a case where a device_add could take longer then 5 minutes, at least not without other bigger problems... I'm wondering that multiple subsequent calls of dev_coredumpm() would fail to find the failing_device with the class_find_device and all, but maybe I'm overthinking here or missing something else. > > -Mukesh > > > dev_set_uevent_suppress(&devcd->devcd_dev, true); > > if (device_add(&devcd->devcd_dev)) > > goto put_device; > > @@ -392,15 +310,12 @@ void dev_coredumpm(struct device *dev, struct module *owner, > > dev_set_uevent_suppress(&devcd->devcd_dev, false); > > kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD); > > - INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); > > - schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); > > if (devm_add_action(dev, devcd_remove, devcd)) > > dev_warn(dev, "devcoredump managed auto-removal registration failed\n"); > > - mutex_unlock(&devcd->mutex); > > return; > > put_device: > > + cancel_delayed_work(&devcd->del_wk); > > put_device(&devcd->devcd_dev); > > - mutex_unlock(&devcd->mutex); > > put_module: > > module_put(owner); > > free: ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] devcoredump: Remove the mutex serialization 2024-01-30 15:34 ` Rodrigo Vivi @ 2024-01-31 16:15 ` Mukesh Ojha 0 siblings, 0 replies; 15+ messages in thread From: Mukesh Ojha @ 2024-01-31 16:15 UTC (permalink / raw) To: Rodrigo Vivi Cc: linux-kernel, Johannes Berg, Greg Kroah-Hartman, Rafael J . Wysocki, Jose Souza On 1/30/2024 9:04 PM, Rodrigo Vivi wrote: > On Tue, Jan 30, 2024 at 05:32:24PM +0530, Mukesh Ojha wrote: >> >> >> On 1/26/2024 8:41 PM, Rodrigo Vivi wrote: >>> The commit 01daccf74832 ("devcoredump : Serialize devcd_del work") >>> introduced the mutex to protect the case where mod_delayed_work >>> could be called before the delayed work even existed. >>> >>> Instead, we can simply initialize the delayed work before the device >>> is added, so the race condition doesn't exist at first place. >>> >>> The mutex_unlock is very problematic here. Although mod_delayed_work >>> is async, we have no warranty that the work is not finished before >>> the mutex_unlock(devcd->mutex), and if that happen 'devcd' is used >>> after freed. >> >> I agree, Mutex is bad and last time there was only a situation of UAF from >> disable_store() and that can not occur as it keeps its ref, so >> we went ahead with the change., > > my concern was with: > > flush_delayed_work(&devcd->del_wk); > mutex_unlock(&devcd->mutex); > > at devcd_free(). > > flush_work always wait for the work to finish it's execution, > which will delete the device. > The with the release, the devcd should be gone soon and > this is at risk of the UAF, no? > > maybe I'm missing something. Before your patch, the only place where flush_delayed_work() used was devcd_free() and that is getting called from disabled_store() is taking its own reference and due to which above UAF issue would not happen from above path. > > >> >>> >>> Cc: Mukesh Ojha <quic_mojha@quicinc.com> >>> Cc: Johannes Berg <johannes@sipsolutions.net> >>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> >>> Cc: Rafael J. Wysocki <rafael@kernel.org> >>> Cc: Jose Souza <jose.souza@intel.com> >>> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> >>> --- >>> drivers/base/devcoredump.c | 97 +++----------------------------------- >>> 1 file changed, 6 insertions(+), 91 deletions(-) >>> >>> diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c >>> index 678ecc2fa242..0e26b1273920 100644 >>> --- a/drivers/base/devcoredump.c >>> +++ b/drivers/base/devcoredump.c >>> @@ -25,47 +25,6 @@ struct devcd_entry { >>> struct device devcd_dev; >>> void *data; >>> size_t datalen; >>> - /* >>> - * Here, mutex is required to serialize the calls to del_wk work between >>> - * user/kernel space which happens when devcd is added with device_add() >>> - * and that sends uevent to user space. User space reads the uevents, >>> - * and calls to devcd_data_write() which try to modify the work which is >>> - * not even initialized/queued from devcoredump. >>> - * >>> - * >>> - * >>> - * cpu0(X) cpu1(Y) >>> - * >>> - * dev_coredump() uevent sent to user space >>> - * device_add() ======================> user space process Y reads the >>> - * uevents writes to devcd fd >>> - * which results into writes to >>> - * >>> - * devcd_data_write() >>> - * mod_delayed_work() >>> - * try_to_grab_pending() >>> - * del_timer() >>> - * debug_assert_init() >>> - * INIT_DELAYED_WORK() >>> - * schedule_delayed_work() >>> - * >>> - * >>> - * Also, mutex alone would not be enough to avoid scheduling of >>> - * del_wk work after it get flush from a call to devcd_free() >>> - * mentioned as below. >>> - * >>> - * disabled_store() >>> - * devcd_free() >>> - * mutex_lock() devcd_data_write() >>> - * flush_delayed_work() >>> - * mutex_unlock() >>> - * mutex_lock() >>> - * mod_delayed_work() >>> - * mutex_unlock() >>> - * So, delete_work flag is required. >>> - */ >>> - struct mutex mutex; >>> - bool delete_work; >>> struct module *owner; >>> ssize_t (*read)(char *buffer, loff_t offset, size_t count, >>> void *data, size_t datalen); >>> @@ -125,13 +84,8 @@ static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj, >>> struct device *dev = kobj_to_dev(kobj); >>> struct devcd_entry *devcd = dev_to_devcd(dev); >>> - mutex_lock(&devcd->mutex); >>> - if (!devcd->delete_work) { >>> - devcd->delete_work = true; >>> - mod_delayed_work(system_wq, &devcd->del_wk, 0); >>> - } >>> - mutex_unlock(&devcd->mutex); >>> - >>> + /* This file needs to be closed before devcd can be deleted */ >>> + mod_delayed_work(system_wq, &devcd->del_wk, 0); >>> return count; >>> } >>> @@ -158,12 +112,7 @@ static int devcd_free(struct device *dev, void *data) >>> { >>> struct devcd_entry *devcd = dev_to_devcd(dev); >>> - mutex_lock(&devcd->mutex); >>> - if (!devcd->delete_work) >>> - devcd->delete_work = true; >>> - >>> flush_delayed_work(&devcd->del_wk); >>> - mutex_unlock(&devcd->mutex); >>> return 0; >>> } >>> @@ -173,30 +122,6 @@ static ssize_t disabled_show(const struct class *class, const struct class_attri >>> return sysfs_emit(buf, "%d\n", devcd_disabled); >>> } >>> -/* >>> - * >>> - * disabled_store() worker() >>> - * class_for_each_device(&devcd_class, >>> - * NULL, NULL, devcd_free) >>> - * ... >>> - * ... >>> - * while ((dev = class_dev_iter_next(&iter)) >>> - * devcd_del() >>> - * device_del() >>> - * put_device() <- last reference >>> - * error = fn(dev, data) devcd_dev_release() >>> - * devcd_free(dev, data) kfree(devcd) >>> - * mutex_lock(&devcd->mutex); >>> - * >>> - * >>> - * In the above diagram, It looks like disabled_store() would be racing with parallely >>> - * running devcd_del() and result in memory abort while acquiring devcd->mutex which >>> - * is called after kfree of devcd memory after dropping its last reference with >>> - * put_device(). However, this will not happens as fn(dev, data) runs >>> - * with its own reference to device via klist_node so it is not its last reference. >>> - * so, above situation would not occur. >>> - */ >>> - >>> static ssize_t disabled_store(const struct class *class, const struct class_attribute *attr, >>> const char *buf, size_t count) >>> { >>> @@ -308,13 +233,7 @@ static void devcd_remove(void *data) >>> { >>> struct devcd_entry *devcd = data; >>> - mutex_lock(&devcd->mutex); >>> - if (!devcd->delete_work) { >>> - devcd->delete_work = true; >>> - /* XXX: Cannot flush otherwise the mutex below will hit a UAF */ >>> - mod_delayed_work(system_wq, &devcd->del_wk, 0); >>> - } >>> - mutex_unlock(&devcd->mutex); >>> + flush_delayed_work(&devcd->del_wk); >>> } >>> /** >>> @@ -365,16 +284,15 @@ void dev_coredumpm(struct device *dev, struct module *owner, >>> devcd->read = read; >>> devcd->free = free; >>> devcd->failing_dev = get_device(dev); >>> - devcd->delete_work = false; >>> - mutex_init(&devcd->mutex); >>> device_initialize(&devcd->devcd_dev); >>> dev_set_name(&devcd->devcd_dev, "devcd%d", >>> atomic_inc_return(&devcd_count)); >>> devcd->devcd_dev.class = &devcd_class; >>> - mutex_lock(&devcd->mutex); >>> + INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); >>> + schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); >> >> Last time, we discussed [1] here, involves a assumption >> about timeout can not happen before device_add() succeeds. >> It is rare but it is there. >> >> https://lore.kernel.org/all/87ilr15ekx.ffs@tglx/ > > hmm... I couldn't imagine a case where a device_add could > take longer then 5 minutes, at least not without other bigger > problems... > > I'm wondering that multiple subsequent calls of dev_coredumpm() > would fail to find the failing_device with the class_find_device > and all, but maybe I'm overthinking here or missing something else. There are two issue here which is described here, 1. /* * * * cpu0(X) cpu1(Y) * * dev_coredump() uevent sent to user space * device_add() ======================> user space process Y reads the * uevents writes to devcd fd * which results into writes to * * devcd_data_write() * mod_delayed_work() * try_to_grab_pending() * del_timer() * * INIT_DELAYED_WORK() * schedule_delayed_work() * 2. * * * disabled_store() * devcd_free() * flush_delayed_work() * devcd_data_write() * mod_delayed_work() * * But i think, we can further optimize the existing change to only protect delete_work flag, diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c index 7e2d1f0d903a..af2448da00f4 100644 --- a/drivers/base/devcoredump.c +++ b/drivers/base/devcoredump.c @@ -126,12 +126,14 @@ static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj, struct devcd_entry *devcd = dev_to_devcd(dev); mutex_lock(&devcd->mutex); - if (!devcd->delete_work) { - devcd->delete_work = true; - mod_delayed_work(system_wq, &devcd->del_wk, 0); + if (devcd->delete_work) { + mutex_unlock(&devcd->mutex); + goto out; } + devcd->delete_work = true; mutex_unlock(&devcd->mutex); - + mod_delayed_work(system_wq, &devcd->del_wk, 0); +out: return count; } @@ -161,9 +163,9 @@ static int devcd_free(struct device *dev, void *data) mutex_lock(&devcd->mutex); if (!devcd->delete_work) devcd->delete_work = true; - - flush_delayed_work(&devcd->del_wk); mutex_unlock(&devcd->mutex); + flush_delayed_work(&devcd->del_wk); + return 0; } @@ -361,7 +363,6 @@ void dev_coredumpm(struct device *dev, struct module *owner, atomic_inc_return(&devcd_count)); devcd->devcd_dev.class = &devcd_class; - mutex_lock(&devcd->mutex); dev_set_uevent_suppress(&devcd->devcd_dev, true); if (device_add(&devcd->devcd_dev)) goto put_device; @@ -377,15 +378,13 @@ void dev_coredumpm(struct device *dev, struct module *owner, "devcoredump")) dev_warn(dev, "devcoredump create_link failed\n"); - dev_set_uevent_suppress(&devcd->devcd_dev, false); - kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD); INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); - mutex_unlock(&devcd->mutex); + dev_set_uevent_suppress(&devcd->devcd_dev, false); + kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD); return; put_device: put_device(&devcd->devcd_dev); - mutex_unlock(&devcd->mutex); put_module: module_put(owner); free: -Mukesh > >> >> -Mukesh >> >>> dev_set_uevent_suppress(&devcd->devcd_dev, true); >>> if (device_add(&devcd->devcd_dev)) >>> goto put_device; >>> @@ -392,15 +310,12 @@ void dev_coredumpm(struct device *dev, struct module *owner, >>> dev_set_uevent_suppress(&devcd->devcd_dev, false); >>> kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD); >>> - INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); >>> - schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); >>> if (devm_add_action(dev, devcd_remove, devcd)) >>> dev_warn(dev, "devcoredump managed auto-removal registration failed\n"); >>> - mutex_unlock(&devcd->mutex); >>> return; >>> put_device: >>> + cancel_delayed_work(&devcd->del_wk); >>> put_device(&devcd->devcd_dev); >>> - mutex_unlock(&devcd->mutex); >>> put_module: >>> module_put(owner); >>> free: ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone 2024-01-26 15:11 [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone Rodrigo Vivi 2024-01-26 15:11 ` [PATCH 2/2] devcoredump: Remove the mutex serialization Rodrigo Vivi @ 2024-01-29 15:50 ` Souza, Jose 2024-01-29 17:48 ` Johannes Berg 2 siblings, 0 replies; 15+ messages in thread From: Souza, Jose @ 2024-01-29 15:50 UTC (permalink / raw) To: Vivi, Rodrigo, linux-kernel@vger.kernel.org Cc: maarten.lankhorst@linux.intel.com, johannes@sipsolutions.net, rafael@kernel.org, gregkh@linuxfoundation.org On Fri, 2024-01-26 at 10:11 -0500, Rodrigo Vivi wrote: > Make dev_coredumpm a real device managed helper, that not only > frees the device after a scheduled delay (DEVCD_TIMEOUT), but > also when the failing/crashed device is gone. > > The module remove for the drivers using devcoredump are currently > broken if attempted between the crash and the DEVCD_TIMEOUT, since > the symbolic sysfs link won't be deleted. > > On top of that, for PCI devices, the unbind of the device will > call the pci .remove void function, that cannot fail. At that > time, our device is pretty much gone, but the read and free > functions are alive trough the devcoredump device and they > can get some NULL dereferences or use after free. > > So, if the failing-device is gone let's also request for the > devcoredump-device removal using the same mod_delayed_work > as when writing anything through data. The flush cannot be > used since it is synchronous and the devcd would be surely > gone right before the mutex_unlock on the next line. > > > Reviewed-by: José Roberto de Souza <jose.souza@intel.com> > Cc: Jose Souza <jose.souza@intel.com> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> > Cc: Johannes Berg <johannes@sipsolutions.net> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Rafael J. Wysocki <rafael@kernel.org> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> > --- > drivers/base/devcoredump.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c > index 7e2d1f0d903a..678ecc2fa242 100644 > --- a/drivers/base/devcoredump.c > +++ b/drivers/base/devcoredump.c > @@ -304,6 +304,19 @@ static ssize_t devcd_read_from_sgtable(char *buffer, loff_t offset, > offset); > } > > +static void devcd_remove(void *data) > +{ > + struct devcd_entry *devcd = data; > + > + mutex_lock(&devcd->mutex); > + if (!devcd->delete_work) { > + devcd->delete_work = true; > + /* XXX: Cannot flush otherwise the mutex below will hit a UAF */ > + mod_delayed_work(system_wq, &devcd->del_wk, 0); > + } > + mutex_unlock(&devcd->mutex); > +} > + > /** > * dev_coredumpm - create device coredump with read/free methods > * @dev: the struct device for the crashed device > @@ -381,6 +394,8 @@ void dev_coredumpm(struct device *dev, struct module *owner, > kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD); > INIT_DELAYED_WORK(&devcd->del_wk, devcd_del); > schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT); > + if (devm_add_action(dev, devcd_remove, devcd)) > + dev_warn(dev, "devcoredump managed auto-removal registration failed\n"); > mutex_unlock(&devcd->mutex); > return; > put_device: ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone 2024-01-26 15:11 [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone Rodrigo Vivi 2024-01-26 15:11 ` [PATCH 2/2] devcoredump: Remove the mutex serialization Rodrigo Vivi 2024-01-29 15:50 ` [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone Souza, Jose @ 2024-01-29 17:48 ` Johannes Berg 2024-01-29 21:29 ` Rodrigo Vivi 2024-01-31 17:22 ` Mukesh Ojha 2 siblings, 2 replies; 15+ messages in thread From: Johannes Berg @ 2024-01-29 17:48 UTC (permalink / raw) To: Rodrigo Vivi, linux-kernel Cc: Jose Souza, Maarten Lankhorst, Greg Kroah-Hartman, Rafael J . Wysocki On Fri, 2024-01-26 at 10:11 -0500, Rodrigo Vivi wrote: > Make dev_coredumpm a real device managed helper, that not only > frees the device after a scheduled delay (DEVCD_TIMEOUT), but > also when the failing/crashed device is gone. > > The module remove for the drivers using devcoredump are currently > broken if attempted between the crash and the DEVCD_TIMEOUT, since > the symbolic sysfs link won't be deleted. Hmm, is it a problem to remove a whole dev when it still has some link here? Maybe we could just make the link be managed/auto-removed? Probably regardless of that you should change the comment in devcd_dev_release() since it's no longer a concern? > On top of that, for PCI devices, the unbind of the device will > call the pci .remove void function, that cannot fail. At that > time, our device is pretty much gone, but the read and free > functions are alive trough the devcoredump device and they ^ through, I guess > can get some NULL dereferences or use after free. Not sure I understand this part, how's this related to PCI's .remove? > So, if the failing-device is gone let's also request for the > devcoredump-device removal using the same mod_delayed_work > as when writing anything through data. The flush cannot be > used since it is synchronous and the devcd would be surely > gone right before the mutex_unlock on the next line. Can we just decouple it instead and remove the symlink? Which is kind of what the comment in devcd_dev_release() says but at the time I wasn't aware of all the devm mechanics etc. I'm thinking this might be annoying in certain recovery cases, e.g. iwlwifi uses this but may sometimes unbind/rebind itself to recover from certain errors, and that'd make the FW dumps disappear. johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone 2024-01-29 17:48 ` Johannes Berg @ 2024-01-29 21:29 ` Rodrigo Vivi 2024-01-29 21:51 ` Johannes Berg 2024-01-31 17:22 ` Mukesh Ojha 1 sibling, 1 reply; 15+ messages in thread From: Rodrigo Vivi @ 2024-01-29 21:29 UTC (permalink / raw) To: Johannes Berg Cc: linux-kernel, Jose Souza, Maarten Lankhorst, Greg Kroah-Hartman, Rafael J . Wysocki On Mon, Jan 29, 2024 at 06:48:12PM +0100, Johannes Berg wrote: > On Fri, 2024-01-26 at 10:11 -0500, Rodrigo Vivi wrote: > > Make dev_coredumpm a real device managed helper, that not only > > frees the device after a scheduled delay (DEVCD_TIMEOUT), but > > also when the failing/crashed device is gone. > > > > The module remove for the drivers using devcoredump are currently > > broken if attempted between the crash and the DEVCD_TIMEOUT, since > > the symbolic sysfs link won't be deleted. > > Hmm, is it a problem to remove a whole dev when it still has some link > here? Well, the big problem is that with link there, the base sysfs dir is not deleted/removed. So, the next reload fails to recreate beucase file exists. > Maybe we could just make the link be managed/auto-removed? this would help this angle indeed. > > Probably regardless of that you should change the comment in > devcd_dev_release() since it's no longer a concern? indeed, I don't believe that that is a concern because this is exactly the only place deleting the link and it can't race with itself. > > > On top of that, for PCI devices, the unbind of the device will > > call the pci .remove void function, that cannot fail. At that > > time, our device is pretty much gone, but the read and free > > functions are alive trough the devcoredump device and they > ^ through, I guess > > > can get some NULL dereferences or use after free. > > Not sure I understand this part, how's this related to PCI's .remove? Well, this is my secondary concern that the idea of the link_auto_removal doesn't cover. If the failing_device is gone, the 'data cookie' it used to register with dev_coredumpm(... void *data,...), is also likely gone on a clean removal. And to be honest, we shouldn't even count that the registered *read() function pointer is valid anymore. I'm sorry for not being clear on this point. The other one was the immediate one blocking our CI so I ended up writing up the commit message with that in mind and without thinking about alternatives like only removing the link. > > > So, if the failing-device is gone let's also request for the > > devcoredump-device removal using the same mod_delayed_work > > as when writing anything through data. The flush cannot be > > used since it is synchronous and the devcd would be surely > > gone right before the mutex_unlock on the next line. > > Can we just decouple it instead and remove the symlink? Which is kind of > what the comment in devcd_dev_release() says but at the time I wasn't > aware of all the devm mechanics etc. Well, we could indeed. And that would unblock our CI, but I'm afraid it wouldn't protect the final user from bad memory access on a direct $ cat /sys/class/devcoredump/devcd<n>/data Shouldn't we consider this critical itself to justify this entirely removal? > > I'm thinking this might be annoying in certain recovery cases, e.g. > iwlwifi uses this but may sometimes unbind/rebind itself to recover from > certain errors, and that'd make the FW dumps disappear. I see... but it looks like dev_coredumpsg have a different handle of the data cookie and read functions with the read direct from the sgtable and might not face this bad memory access, since it allocates the sg_dump_data which is only deleted/freed by the devcoredump removal... But I'm concerned with the direct usage of the drivers using dev_coredumpm() directly. Should I then move dev_coredumpm to _dev_coredumpm() and then create a new dev_coredumpm that calls for _dev_coredump and this devm_add_action(dev, devcd_remove, devcd) ? And also an improved commit message to show the bad memory access issue? Thank you so much for the feedback, Rodrigo. > > johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone 2024-01-29 21:29 ` Rodrigo Vivi @ 2024-01-29 21:51 ` Johannes Berg 2024-01-30 15:16 ` Rodrigo Vivi 0 siblings, 1 reply; 15+ messages in thread From: Johannes Berg @ 2024-01-29 21:51 UTC (permalink / raw) To: Rodrigo Vivi Cc: linux-kernel, Jose Souza, Maarten Lankhorst, Greg Kroah-Hartman, Rafael J . Wysocki On Mon, 2024-01-29 at 16:29 -0500, Rodrigo Vivi wrote: > > > > > On top of that, for PCI devices, the unbind of the device will > > > call the pci .remove void function, that cannot fail. At that > > > time, our device is pretty much gone, but the read and free > > > functions are alive trough the devcoredump device and they > > ^ through, I guess > > > > > can get some NULL dereferences or use after free. > > > > Not sure I understand this part, how's this related to PCI's .remove? > > Well, this is my secondary concern that the idea of the link_auto_removal > doesn't cover. > > If the failing_device is gone, the 'data cookie' it used to register with > dev_coredumpm(... void *data,...), is also likely gone on a clean removal. That's on the user. You'll always be able to shoot yourself in the foot. > And to be honest, we shouldn't even count that the registered *read() > function pointer is valid anymore. That's not true: the module cannot be removed, there's a reference to it if you're using dev_coredumpm() correctly (which is to say: pass THIS_MODULE to the struct module *owner argument). > Well, we could indeed. And that would unblock our CI, but I'm afraid > it wouldn't protect the final user from bad memory access on a direct > $ cat /sys/class/devcoredump/devcd<n>/data > > Shouldn't we consider this critical itself to justify this entirely > removal? No? IMHO that's totally on the user. If you absolutely cannot make a standalone dump 'data' pointer (why not?! you can always stick the actual data into a vmalloc chunk and use dev_coredumpv()?) then maybe we can offer ways of removing it when you need to? But I'd rather not, it feels weird to have a need for it. johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone 2024-01-29 21:51 ` Johannes Berg @ 2024-01-30 15:16 ` Rodrigo Vivi 2024-01-30 15:19 ` Johannes Berg 0 siblings, 1 reply; 15+ messages in thread From: Rodrigo Vivi @ 2024-01-30 15:16 UTC (permalink / raw) To: Johannes Berg Cc: linux-kernel, Jose Souza, Maarten Lankhorst, Greg Kroah-Hartman, Rafael J . Wysocki > > If the failing_device is gone, the 'data cookie' it used to register with > > dev_coredumpm(... void *data,...), is also likely gone on a clean removal. > > That's on the user. You'll always be able to shoot yourself in the foot. > > > And to be honest, we shouldn't even count that the registered *read() > > function pointer is valid anymore. > > That's not true: the module cannot be removed, there's a reference to it > if you're using dev_coredumpm() correctly (which is to say: pass > THIS_MODULE to the struct module *owner argument). > > > Well, we could indeed. And that would unblock our CI, but I'm afraid > > it wouldn't protect the final user from bad memory access on a direct > > $ cat /sys/class/devcoredump/devcd<n>/data > > > > Shouldn't we consider this critical itself to justify this entirely > > removal? > > No? IMHO that's totally on the user. If you absolutely cannot make a > standalone dump 'data' pointer (why not?! you can always stick the > actual data into a vmalloc chunk and use dev_coredumpv()?) hmm... fair enough. We would be okay here indeed since devcoredump always free the data. > then maybe we > can offer ways of removing it when you need to? well, to be honest my first local version was like that, offering a dev_coredump_removal() that driver could request the removal before going away. > But I'd rather not, it > feels weird to have a need for it. We could change or CI and instruct our devs to always write something to 'data' to ensure that devcoredump is deleted before we can reload our module. Maybe that's the right approach indeed, although I would really prefer to have a direct way. > > johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone 2024-01-30 15:16 ` Rodrigo Vivi @ 2024-01-30 15:19 ` Johannes Berg 2024-01-30 15:49 ` Rodrigo Vivi 0 siblings, 1 reply; 15+ messages in thread From: Johannes Berg @ 2024-01-30 15:19 UTC (permalink / raw) To: Rodrigo Vivi Cc: linux-kernel, Jose Souza, Maarten Lankhorst, Greg Kroah-Hartman, Rafael J . Wysocki On Tue, 2024-01-30 at 10:16 -0500, Rodrigo Vivi wrote: > > > > But I'd rather not, it > > feels weird to have a need for it. > > We could change or CI and instruct our devs to always write > something to 'data' to ensure that devcoredump is deleted > before we can reload our module. Maybe that's the right > approach indeed, although I would really prefer to have > a direct way. That's not really what I meant :-) I think we can agree that it's wrong for the kernel to be _able_ to run into some kind of use-after-free if userspace isn't doing the right thing here! What I meant though is: it's weird for 'data' to actually depend on the struct device being still around, no? Whatever you want 'data' to be, couldn't you arrange it so that it's valid as long as the module isn't removed, so that the 'data' pointer literally encapsulates the needed data, doesn't depend on anything else, and the method you pass is more like a 'format' method. johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone 2024-01-30 15:19 ` Johannes Berg @ 2024-01-30 15:49 ` Rodrigo Vivi 2024-01-30 15:51 ` Johannes Berg 0 siblings, 1 reply; 15+ messages in thread From: Rodrigo Vivi @ 2024-01-30 15:49 UTC (permalink / raw) To: Johannes Berg Cc: linux-kernel, Jose Souza, Maarten Lankhorst, Greg Kroah-Hartman, Rafael J . Wysocki On Tue, Jan 30, 2024 at 04:19:18PM +0100, Johannes Berg wrote: > On Tue, 2024-01-30 at 10:16 -0500, Rodrigo Vivi wrote: > > > > > > But I'd rather not, it > > > feels weird to have a need for it. > > > > We could change or CI and instruct our devs to always write > > something to 'data' to ensure that devcoredump is deleted > > before we can reload our module. Maybe that's the right > > approach indeed, although I would really prefer to have > > a direct way. > > That's not really what I meant :-) I think we can agree that it's wrong > for the kernel to be _able_ to run into some kind of use-after-free if > userspace isn't doing the right thing here! > > What I meant though is: it's weird for 'data' to actually depend on the > struct device being still around, no? Whatever you want 'data' to be, > couldn't you arrange it so that it's valid as long as the module isn't > removed, so that the 'data' pointer literally encapsulates the needed > data, doesn't depend on anything else, and the method you pass is more > like a 'format' method. I'm sorry for not being clear here. I totally agree with you. I will make changes to our driver to make the 'data' a standalone memory that devcoredump will free. this ensures no uaf and no null deref. data could be read even after unbinding the driver. What I meant to userspace 'writing to 'data'' was to ensure that on our CI we run something like if /sys/.../device/devcd<n> exists, then echo 1 > /sys/.../device/devcd<n>/data before attempting the rmmod <driver> our rmmod cannot get stuck or our CI is blocked, but then ensuring the devcd is gone with module_put happening is the only current way of not blocking the rmmod. > > johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone 2024-01-30 15:49 ` Rodrigo Vivi @ 2024-01-30 15:51 ` Johannes Berg 0 siblings, 0 replies; 15+ messages in thread From: Johannes Berg @ 2024-01-30 15:51 UTC (permalink / raw) To: Rodrigo Vivi Cc: linux-kernel, Jose Souza, Maarten Lankhorst, Greg Kroah-Hartman, Rafael J . Wysocki On Tue, 2024-01-30 at 10:49 -0500, Rodrigo Vivi wrote: > > I will make changes to our driver to make the 'data' a standalone memory > that devcoredump will free. this ensures no uaf and no null deref. > data could be read even after unbinding the driver. > > What I meant to userspace 'writing to 'data'' was to ensure that > on our CI we run something like > > if /sys/.../device/devcd<n> exists, then > echo 1 > /sys/.../device/devcd<n>/data > before attempting the rmmod <driver> > > our rmmod cannot get stuck or our CI is blocked, but then ensuring > the devcd is gone with module_put happening is the only current way > of not blocking the rmmod. > Ah, you were just concerned about the module removal, sure, that makes sense. Though depending on how you make that data pointer: if you just use *sg() or *v() then you don't have this problem in the first place. OTOH, it's probably good to have a udev rule to automatically capture the data in CI anyway (and fail the test if it happened?) johannes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone 2024-01-29 17:48 ` Johannes Berg 2024-01-29 21:29 ` Rodrigo Vivi @ 2024-01-31 17:22 ` Mukesh Ojha 1 sibling, 0 replies; 15+ messages in thread From: Mukesh Ojha @ 2024-01-31 17:22 UTC (permalink / raw) To: Johannes Berg, Rodrigo Vivi, linux-kernel Cc: Jose Souza, Maarten Lankhorst, Greg Kroah-Hartman, Rafael J . Wysocki On 1/29/2024 11:18 PM, Johannes Berg wrote: > On Fri, 2024-01-26 at 10:11 -0500, Rodrigo Vivi wrote: >> Make dev_coredumpm a real device managed helper, that not only >> frees the device after a scheduled delay (DEVCD_TIMEOUT), but >> also when the failing/crashed device is gone. >> >> The module remove for the drivers using devcoredump are currently >> broken if attempted between the crash and the DEVCD_TIMEOUT, since >> the symbolic sysfs link won't be deleted. > > Hmm, is it a problem to remove a whole dev when it still has some link > here? Maybe we could just make the link be managed/auto-removed? > > Probably regardless of that you should change the comment in > devcd_dev_release() since it's no longer a concern? > >> On top of that, for PCI devices, the unbind of the device will >> call the pci .remove void function, that cannot fail. At that >> time, our device is pretty much gone, but the read and free >> functions are alive trough the devcoredump device and they > ^ through, I guess > >> can get some NULL dereferences or use after free. > > Not sure I understand this part, how's this related to PCI's .remove? > >> So, if the failing-device is gone let's also request for the >> devcoredump-device removal using the same mod_delayed_work >> as when writing anything through data. The flush cannot be >> used since it is synchronous and the devcd would be surely >> gone right before the mutex_unlock on the next line. > > Can we just decouple it instead and remove the symlink? Which is kind of > what the comment in devcd_dev_release() says but at the time I wasn't > aware of all the devm mechanics etc. Are we going to do this ? -Mukesh > > I'm thinking this might be annoying in certain recovery cases, e.g. > iwlwifi uses this but may sometimes unbind/rebind itself to recover from > certain errors, and that'd make the FW dumps disappear. > > johannes > ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-01-31 17:23 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-01-26 15:11 [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone Rodrigo Vivi 2024-01-26 15:11 ` [PATCH 2/2] devcoredump: Remove the mutex serialization Rodrigo Vivi 2024-01-29 15:50 ` Souza, Jose 2024-01-30 12:02 ` Mukesh Ojha 2024-01-30 15:34 ` Rodrigo Vivi 2024-01-31 16:15 ` Mukesh Ojha 2024-01-29 15:50 ` [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone Souza, Jose 2024-01-29 17:48 ` Johannes Berg 2024-01-29 21:29 ` Rodrigo Vivi 2024-01-29 21:51 ` Johannes Berg 2024-01-30 15:16 ` Rodrigo Vivi 2024-01-30 15:19 ` Johannes Berg 2024-01-30 15:49 ` Rodrigo Vivi 2024-01-30 15:51 ` Johannes Berg 2024-01-31 17:22 ` Mukesh Ojha
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox