Linux RAID subsystem development
 help / color / mirror / Atom feed
* [RFC PATCH 1/1] md: delete gendisk in ioctl path
@ 2025-04-22  3:24 Xiao Ni
  2025-04-22  5:51 ` Christoph Hellwig
  2025-04-22  6:21 ` Yu Kuai
  0 siblings, 2 replies; 8+ messages in thread
From: Xiao Ni @ 2025-04-22  3:24 UTC (permalink / raw)
  To: linux-raid; +Cc: song, yukuai1, hch, ming.lei, ncroxon

commit 777d0961ff95b ("fs: move the bdex_statx call to vfs_getattr_nosec")
introduces a regression problem, like this:
[ 1479.474440] INFO: task kdevtmpfs:506 blocked for more than 491 seconds.
[ 1479.478569]  __wait_for_common+0x99/0x1c0
[ 1479.478823]  ? __pfx_schedule_timeout+0x10/0x10
[ 1479.479466]  __flush_workqueue+0x138/0x400
[ 1479.479684]  md_alloc+0x21/0x370 [md_mod]
[ 1479.479926]  md_probe+0x24/0x50 [md_mod]
[ 1479.480150]  blk_probe_dev+0x62/0x90
[ 1479.480368]  blk_request_module+0xf/0x70
[ 1479.480604]  blkdev_get_no_open+0x5e/0xa0
[ 1479.480809]  bdev_statx+0x1f/0xf0
[ 1479.481364]  vfs_getattr_nosec+0x10a/0x120
[ 1479.481602]  handle_remove+0x68/0x290
[ 1479.481812]  ? __update_idle_core+0x23/0xb0
[ 1479.482023]  devtmpfs_work_loop+0x10d/0x2a0
[ 1479.482231]  ? __pfx_devtmpfsd+0x10/0x10
[ 1479.482464]  devtmpfsd+0x2f/0x30

[ 1479.485397] INFO: task kworker/33:1:532 blocked for more than 491 seconds.
[ 1479.535380]  __wait_for_common+0x99/0x1c0
[ 1479.566876]  ? __pfx_schedule_timeout+0x10/0x10
[ 1479.591156]  devtmpfs_submit_req+0x6e/0x90
[ 1479.591389]  devtmpfs_delete_node+0x60/0x90
[ 1479.591631]  ? process_sysctl_arg+0x270/0x2f0
[ 1479.592256]  device_del+0x315/0x3d0
[ 1479.592839]  ? mutex_lock+0xe/0x30
[ 1479.593455]  del_gendisk+0x216/0x320
[ 1479.593672]  md_kobj_release+0x34/0x40 [md_mod]
[ 1479.594317]  kobject_cleanup+0x3a/0x130
[ 1479.594562]  process_one_work+0x19d/0x3d0

Now del_gendisk and put_disk are called asynchronously in workqueue work.
del_gendisk deletes device node by devtmpfs. devtmpfs tries to open this
array again and it flush the workqueue at the bigging of open process. So
a deadlock happens.

The asynchronous way also has a problem that the device node can still
exist after mdadm --stop command returns in a short window. So udev rule
can open this device node and create the struct mddev in kernel again.

So put del_gendisk in ioctl path and still leave put_disk in
md_kobj_release to avoid uaf.

Fixes: 777d0961ff95 ("fs: move the bdex_statx call to vfs_getattr_nosec")
Signed-off-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9daa78c5fe33..c3f793296ccc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5746,7 +5746,6 @@ static void md_kobj_release(struct kobject *ko)
 	if (mddev->sysfs_level)
 		sysfs_put(mddev->sysfs_level);
 
-	del_gendisk(mddev->gendisk);
 	put_disk(mddev->gendisk);
 }
 
@@ -6597,6 +6596,8 @@ static int do_md_stop(struct mddev *mddev, int mode)
 		md_clean(mddev);
 		if (mddev->hold_active == UNTIL_STOP)
 			mddev->hold_active = 0;
+
+		del_gendisk(mddev->gendisk);
 	}
 	md_new_event();
 	sysfs_notify_dirent_safe(mddev->sysfs_state);
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] md: delete gendisk in ioctl path
  2025-04-22  3:24 [RFC PATCH 1/1] md: delete gendisk in ioctl path Xiao Ni
@ 2025-04-22  5:51 ` Christoph Hellwig
  2025-04-22  8:31   ` Xiao Ni
  2025-04-22  9:24   ` Xiao Ni
  2025-04-22  6:21 ` Yu Kuai
  1 sibling, 2 replies; 8+ messages in thread
From: Christoph Hellwig @ 2025-04-22  5:51 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid, song, yukuai1, hch, ming.lei, ncroxon

On Tue, Apr 22, 2025 at 11:24:03AM +0800, Xiao Ni wrote:
> Now del_gendisk and put_disk are called asynchronously in workqueue work.
> del_gendisk deletes device node by devtmpfs. devtmpfs tries to open this
> array again and it flush the workqueue at the bigging of open process. So
> a deadlock happens.
> 
> The asynchronous way also has a problem that the device node can still
> exist after mdadm --stop command returns in a short window. So udev rule
> can open this device node and create the struct mddev in kernel again.
> 
> So put del_gendisk in ioctl path and still leave put_disk in
> md_kobj_release to avoid uaf.

The md lifetime rules are complicated enough as-is.  So while I won't
object to this change per-see I'd rather have it reviewed by the md
maintainers independently.

In the meantime this should ensure devtmpfs doesn't call into
blkdev_get_no_open and thus put_disk:

diff --git a/block/bdev.c b/block/bdev.c
index 6a34179192c9..97d4c0ab1670 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1274,18 +1274,23 @@ void sync_bdevs(bool wait)
  */
 void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask)
 {
-	struct inode *backing_inode;
 	struct block_device *bdev;
 
-	backing_inode = d_backing_inode(path->dentry);
-
 	/*
-	 * Note that backing_inode is the inode of a block device node file,
-	 * not the block device's internal inode.  Therefore it is *not* valid
-	 * to use I_BDEV() here; the block device has to be looked up by i_rdev
-	 * instead.
+	 * Note that d_backing_inode() returnsthe inode of a block device node
+	 * file, not the block device's internal inode.
+	 *
+	 * Therefore it is *not* valid to use I_BDEV() here; the block device
+	 * has to be looked up by i_rdev instead.
+	 *
+	 * Only do this lookup if actually needed to avoid the performance
+	 * overhead of the lookup, and to avoid injecting bdev lifetime issues
+	 * into devtmpfs.
 	 */
-	bdev = blkdev_get_no_open(backing_inode->i_rdev);
+	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
+		return;
+
+	bdev = blkdev_get_no_open(d_backing_inode(path->dentry)->i_rdev);
 	if (!bdev)
 		return;
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] md: delete gendisk in ioctl path
  2025-04-22  3:24 [RFC PATCH 1/1] md: delete gendisk in ioctl path Xiao Ni
  2025-04-22  5:51 ` Christoph Hellwig
@ 2025-04-22  6:21 ` Yu Kuai
  2025-04-22  7:31   ` Xiao Ni
  1 sibling, 1 reply; 8+ messages in thread
From: Yu Kuai @ 2025-04-22  6:21 UTC (permalink / raw)
  To: Xiao Ni, linux-raid; +Cc: song, yukuai1, hch, ming.lei, ncroxon, yukuai (C)

Hi,

在 2025/04/22 11:24, Xiao Ni 写道:
> commit 777d0961ff95b ("fs: move the bdex_statx call to vfs_getattr_nosec")
> introduces a regression problem, like this:
> [ 1479.474440] INFO: task kdevtmpfs:506 blocked for more than 491 seconds.
> [ 1479.478569]  __wait_for_common+0x99/0x1c0
> [ 1479.478823]  ? __pfx_schedule_timeout+0x10/0x10
> [ 1479.479466]  __flush_workqueue+0x138/0x400
> [ 1479.479684]  md_alloc+0x21/0x370 [md_mod]
> [ 1479.479926]  md_probe+0x24/0x50 [md_mod]
> [ 1479.480150]  blk_probe_dev+0x62/0x90
> [ 1479.480368]  blk_request_module+0xf/0x70
> [ 1479.480604]  blkdev_get_no_open+0x5e/0xa0
> [ 1479.480809]  bdev_statx+0x1f/0xf0
> [ 1479.481364]  vfs_getattr_nosec+0x10a/0x120
> [ 1479.481602]  handle_remove+0x68/0x290
> [ 1479.481812]  ? __update_idle_core+0x23/0xb0
> [ 1479.482023]  devtmpfs_work_loop+0x10d/0x2a0
> [ 1479.482231]  ? __pfx_devtmpfsd+0x10/0x10
> [ 1479.482464]  devtmpfsd+0x2f/0x30
> 
> [ 1479.485397] INFO: task kworker/33:1:532 blocked for more than 491 seconds.
> [ 1479.535380]  __wait_for_common+0x99/0x1c0
> [ 1479.566876]  ? __pfx_schedule_timeout+0x10/0x10
> [ 1479.591156]  devtmpfs_submit_req+0x6e/0x90
> [ 1479.591389]  devtmpfs_delete_node+0x60/0x90
> [ 1479.591631]  ? process_sysctl_arg+0x270/0x2f0
> [ 1479.592256]  device_del+0x315/0x3d0
> [ 1479.592839]  ? mutex_lock+0xe/0x30
> [ 1479.593455]  del_gendisk+0x216/0x320
> [ 1479.593672]  md_kobj_release+0x34/0x40 [md_mod]
> [ 1479.594317]  kobject_cleanup+0x3a/0x130
> [ 1479.594562]  process_one_work+0x19d/0x3d0
> 
> Now del_gendisk and put_disk are called asynchronously in workqueue work.
> del_gendisk deletes device node by devtmpfs. devtmpfs tries to open this
> array again and it flush the workqueue at the bigging of open process. So
> a deadlock happens.
> 
> The asynchronous way also has a problem that the device node can still
> exist after mdadm --stop command returns in a short window. So udev rule
> can open this device node and create the struct mddev in kernel again.
> 
> So put del_gendisk in ioctl path and still leave put_disk in
> md_kobj_release to avoid uaf.
> 
> Fixes: 777d0961ff95 ("fs: move the bdex_statx call to vfs_getattr_nosec")
> Signed-off-by: Xiao Ni <xni@redhat.com>
> ---
>   drivers/md/md.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 9daa78c5fe33..c3f793296ccc 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5746,7 +5746,6 @@ static void md_kobj_release(struct kobject *ko)
>   	if (mddev->sysfs_level)
>   		sysfs_put(mddev->sysfs_level);
>   
> -	del_gendisk(mddev->gendisk);
>   	put_disk(mddev->gendisk);
>   }
>   
> @@ -6597,6 +6596,8 @@ static int do_md_stop(struct mddev *mddev, int mode)
>   		md_clean(mddev);
>   		if (mddev->hold_active == UNTIL_STOP)
>   			mddev->hold_active = 0;
> +
> +		del_gendisk(mddev->gendisk);

I'm still trying to understand the problem here, however, this change
will break sysfs api md/array_state, do_md_stop will be called as well,
del_gendisk in this context will deadlock, because it will wait for all
sysfs writers, include itself, to be done.

Thanks,
Kuai

>   	}
>   	md_new_event();
>   	sysfs_notify_dirent_safe(mddev->sysfs_state);
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] md: delete gendisk in ioctl path
  2025-04-22  6:21 ` Yu Kuai
@ 2025-04-22  7:31   ` Xiao Ni
  0 siblings, 0 replies; 8+ messages in thread
From: Xiao Ni @ 2025-04-22  7:31 UTC (permalink / raw)
  To: Yu Kuai; +Cc: linux-raid, song, hch, ming.lei, ncroxon, yukuai (C)

On Tue, Apr 22, 2025 at 2:21 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/04/22 11:24, Xiao Ni 写道:
> > commit 777d0961ff95b ("fs: move the bdex_statx call to vfs_getattr_nosec")
> > introduces a regression problem, like this:
> > [ 1479.474440] INFO: task kdevtmpfs:506 blocked for more than 491 seconds.
> > [ 1479.478569]  __wait_for_common+0x99/0x1c0
> > [ 1479.478823]  ? __pfx_schedule_timeout+0x10/0x10
> > [ 1479.479466]  __flush_workqueue+0x138/0x400
> > [ 1479.479684]  md_alloc+0x21/0x370 [md_mod]
> > [ 1479.479926]  md_probe+0x24/0x50 [md_mod]
> > [ 1479.480150]  blk_probe_dev+0x62/0x90
> > [ 1479.480368]  blk_request_module+0xf/0x70
> > [ 1479.480604]  blkdev_get_no_open+0x5e/0xa0
> > [ 1479.480809]  bdev_statx+0x1f/0xf0
> > [ 1479.481364]  vfs_getattr_nosec+0x10a/0x120
> > [ 1479.481602]  handle_remove+0x68/0x290
> > [ 1479.481812]  ? __update_idle_core+0x23/0xb0
> > [ 1479.482023]  devtmpfs_work_loop+0x10d/0x2a0
> > [ 1479.482231]  ? __pfx_devtmpfsd+0x10/0x10
> > [ 1479.482464]  devtmpfsd+0x2f/0x30
> >
> > [ 1479.485397] INFO: task kworker/33:1:532 blocked for more than 491 seconds.
> > [ 1479.535380]  __wait_for_common+0x99/0x1c0
> > [ 1479.566876]  ? __pfx_schedule_timeout+0x10/0x10
> > [ 1479.591156]  devtmpfs_submit_req+0x6e/0x90
> > [ 1479.591389]  devtmpfs_delete_node+0x60/0x90
> > [ 1479.591631]  ? process_sysctl_arg+0x270/0x2f0
> > [ 1479.592256]  device_del+0x315/0x3d0
> > [ 1479.592839]  ? mutex_lock+0xe/0x30
> > [ 1479.593455]  del_gendisk+0x216/0x320
> > [ 1479.593672]  md_kobj_release+0x34/0x40 [md_mod]
> > [ 1479.594317]  kobject_cleanup+0x3a/0x130
> > [ 1479.594562]  process_one_work+0x19d/0x3d0
> >
> > Now del_gendisk and put_disk are called asynchronously in workqueue work.
> > del_gendisk deletes device node by devtmpfs. devtmpfs tries to open this
> > array again and it flush the workqueue at the bigging of open process. So
> > a deadlock happens.
> >
> > The asynchronous way also has a problem that the device node can still
> > exist after mdadm --stop command returns in a short window. So udev rule
> > can open this device node and create the struct mddev in kernel again.
> >
> > So put del_gendisk in ioctl path and still leave put_disk in
> > md_kobj_release to avoid uaf.
> >
> > Fixes: 777d0961ff95 ("fs: move the bdex_statx call to vfs_getattr_nosec")
> > Signed-off-by: Xiao Ni <xni@redhat.com>
> > ---
> >   drivers/md/md.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 9daa78c5fe33..c3f793296ccc 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -5746,7 +5746,6 @@ static void md_kobj_release(struct kobject *ko)
> >       if (mddev->sysfs_level)
> >               sysfs_put(mddev->sysfs_level);
> >
> > -     del_gendisk(mddev->gendisk);
> >       put_disk(mddev->gendisk);
> >   }
> >
> > @@ -6597,6 +6596,8 @@ static int do_md_stop(struct mddev *mddev, int mode)
> >               md_clean(mddev);
> >               if (mddev->hold_active == UNTIL_STOP)
> >                       mddev->hold_active = 0;
> > +
> > +             del_gendisk(mddev->gendisk);
>
> I'm still trying to understand the problem here, however, this change
> will break sysfs api md/array_state, do_md_stop will be called as well,
> del_gendisk in this context will deadlock, because it will wait for all
> sysfs writers, include itself, to be done.

Thanks for pointing out this.

Is it ok to remove the support of md_probe now? Now all the things
that are difficult to handle is that array can be created again when
opening the block device node.

78b6350dcaadb03b4a2970b16387227ba6744876 ("md: support disabling of
create-on-open semantics.") tries to disable create-on-open. And the
probe method will be removed too
(451f0b6f4c44d7b649ae609157b114b71f6d7875 block: default
BLOCK_LEGACY_AUTOLOAD to y). To make things easy, we can remove the
support of md_probe. md_probe is used for system without udev
environment. So it needs to create a device node by mknod and open the
device node to create the array.  I guess no system doesn't have udev
environment, right?

Regards
Xiao
>
> Thanks,
> Kuai
>
> >       }
> >       md_new_event();
> >       sysfs_notify_dirent_safe(mddev->sysfs_state);
> >
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] md: delete gendisk in ioctl path
  2025-04-22  5:51 ` Christoph Hellwig
@ 2025-04-22  8:31   ` Xiao Ni
  2025-04-22  8:40     ` Xiao Ni
  2025-04-22 13:48     ` Christoph Hellwig
  2025-04-22  9:24   ` Xiao Ni
  1 sibling, 2 replies; 8+ messages in thread
From: Xiao Ni @ 2025-04-22  8:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-raid, song, yukuai1, ming.lei, ncroxon

On Tue, Apr 22, 2025 at 1:51 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Apr 22, 2025 at 11:24:03AM +0800, Xiao Ni wrote:
> > Now del_gendisk and put_disk are called asynchronously in workqueue work.
> > del_gendisk deletes device node by devtmpfs. devtmpfs tries to open this
> > array again and it flush the workqueue at the bigging of open process. So
> > a deadlock happens.
> >
> > The asynchronous way also has a problem that the device node can still
> > exist after mdadm --stop command returns in a short window. So udev rule
> > can open this device node and create the struct mddev in kernel again.
> >
> > So put del_gendisk in ioctl path and still leave put_disk in
> > md_kobj_release to avoid uaf.
>
> The md lifetime rules are complicated enough as-is.  So while I won't
> object to this change per-see I'd rather have it reviewed by the md
> maintainers independently.
>
> In the meantime this should ensure devtmpfs doesn't call into
> blkdev_get_no_open and thus put_disk:
>
> diff --git a/block/bdev.c b/block/bdev.c
> index 6a34179192c9..97d4c0ab1670 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1274,18 +1274,23 @@ void sync_bdevs(bool wait)
>   */
>  void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask)
>  {
> -       struct inode *backing_inode;
>         struct block_device *bdev;
>
> -       backing_inode = d_backing_inode(path->dentry);
> -
>         /*
> -        * Note that backing_inode is the inode of a block device node file,
> -        * not the block device's internal inode.  Therefore it is *not* valid
> -        * to use I_BDEV() here; the block device has to be looked up by i_rdev
> -        * instead.
> +        * Note that d_backing_inode() returnsthe inode of a block device node
> +        * file, not the block device's internal inode.
> +        *
> +        * Therefore it is *not* valid to use I_BDEV() here; the block device
> +        * has to be looked up by i_rdev instead.
> +        *
> +        * Only do this lookup if actually needed to avoid the performance
> +        * overhead of the lookup, and to avoid injecting bdev lifetime issues
> +        * into devtmpfs.
>          */
> -       bdev = blkdev_get_no_open(backing_inode->i_rdev);
> +       if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
> +               return;
> +
> +       bdev = blkdev_get_no_open(d_backing_inode(path->dentry)->i_rdev);
>         if (!bdev)
>                 return;
>
>

Hi Christoph

Thanks for the point. I tried this patch and it's stuck on the device
mapper device during shutdown.

[  848.520483] systemd-shutdow[-- MARK -- Tue Apr 22 08:20:00 2025]
[  989.175210] INFO: task systemd-shutdow:1 blocked for more than 122 seconds.
[  989.175667]       Not tainted 6.15.0-rc3+ #3
[  989.176327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  989.176728] task:systemd-shutdow state:D stack:0     pid:1
tgid:1     ppid:0      task_flags:0x480100 flags:0x00004002
[  989.177287] Call Trace:
[  989.177429]  <TASK>
[  989.177910]  __schedule+0x2c7/0x690
[  989.178501]  schedule+0x23/0x80
[  989.179265]  schedule_timeout+0xe8/0x100
[  989.179488]  ? sched_clock_cpu+0xb/0x190
[  989.179703]  ? __smp_call_single_queue+0xac/0x120
[  989.180436]  ? ttwu_queue_wakelist+0xd0/0xf0
[  989.181065]  __wait_for_common+0x99/0x1c0
[  989.181287]  ? __pfx_schedule_timeout+0x10/0x10
[  989.181935]  devtmpfs_submit_req+0x6e/0x90
[  989.182154]  devtmpfs_delete_node+0x60/0x90
[  989.182374]  ? kernfs_remove+0x5e/0x70
[  989.182579]  ? sysfs_remove_group+0x46/0x90
[  989.182787]  device_del+0x315/0x3d0
[  989.183389]  ? mutex_lock+0xe/0x30
[  989.183963]  del_gendisk+0x216/0x320
[  989.184180]  cleanup_mapped_device+0x13f/0x150 [dm_mod]
[  989.184502]  __dm_destroy+0x12e/0x1f0 [dm_mod]
[  989.185162]  dev_remove+0x119/0x1a0 [dm_mod]
[  989.185830]  ctl_ioctl+0x21c/0x370 [dm_mod]
[  989.186090]  dm_ctl_ioctl+0xa/0x20 [dm_mod]
[  989.186325]  __x64_sys_ioctl+0x92/0xc0
[  989.186533]  do_syscall_64+0x79/0x160
[  989.186738]  ? blkg_alloc+0x10f/0x1c0
[  989.186976]  ? __memcg_slab_free_hook+0xf2/0x140
[  989.187616]  ? __x64_sys_close+0x39/0x80
[  989.187836]  ? kmem_cache_free+0x33c/0x450
[  989.188057]  ? syscall_exit_to_user_mode+0xc/0x1d0
[  989.188696]  ? do_syscall_64+0x85/0x160
[  989.188904]  ? __pfx_submit_bio_wait_endio+0x10/0x10
[  989.189222]  ? blkdev_fsync+0x41/0x60
[  989.189468]  ? syscall_exit_to_user_mode+0xc/0x1d0
[  989.190096]  ? do_syscall_64+0x85/0x160
[  989.190316]  ? do_sys_openat2+0x8c/0xd0
[  989.190524]  ? syscall_exit_to_user_mode+0xc/0x1d0
[  989.191160]  ? do_syscall_64+0x85/0x160
[  989.191374]  ? syscall_exit_to_user_mode+0xc/0x1d0
[  989.192001]  ? do_syscall_64+0x85/0x160
[  989.192226]  ? do_syscall_64+0x85/0x160
[  989.192434]  ? do_syscall_64+0x85/0x160
[  989.192641]  ? syscall_exit_to_user_mode+0xc/0x1d0
[  989.193287]  ? do_syscall_64+0x85/0x160
[  989.193509]  ? do_syscall_64+0x85/0x160
[  989.193719]  ? exc_page_fault+0x64/0x140
[  989.2: 00000246 ORIG_RAX: 0000000000000010
[  989.294592] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fef13903b0b
[  989.295388] RDX: 00007ffef60fbe50 RSI: 00000000c138fd04 RDI: 0000000000000005
[  989.296153] RBP: 00007ffef60fbfc0 R08: 0000000000000000 R09: 00007ffef60fb190
[  989.296936] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
[  989.297708] R13: 0000000000000000 R14: 00007ffef60fbe10 R15: 0000000000000000
[  989.298512]  </TASK>
[  989.298700] INFO: task kdevtmpfs:506 blocked for more than 123 seconds.
[  989.299139]       Not tainted 6.15.0-rc3+ #3
[  989.299795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  989.300279] task:kdevtmpfs       state:D stack:0     pid:506
tgid:506   ppid:2      task_flags:0x208140 flags:0x00004000
[  989.300837] Call Trace:
[  989.300994]  <TASK>
[  68/0x290
[  989.701395]  ? __update_idle_core+0x23/0xb0
[  989.701611]  devtmpfs_work_loop+0x10d/0x2a0
[  989.701979]  ? __pfx_devtmpfsd+0x10/0x10
[  989.700x10/0x10
[  989.802290]  ret_from_fork+0x30/0x50
[  989.802498]  ? __pfx_kthread+0x10/0x10
[  989.802694]  ret_from_fork_asm+0x1a/0x30
[  989.802926]  </TASK>
[  989.803103] INFO: task kworker/38:1:586 blocked for more than 123 seconds.
[  989.803453]       Not tainted 6.15.0-rc3+ #3


There is another reason that I want to put del_gendisk in the ioctl
path. Because sometimes device node can still exist after command
mdadm --stop. del_gendisk removes inode first and then removes device
node (e.g. /dev/md0). So there is a small window that device node can
be open again. Then some strange things happen. Sometimes the array is
created but device node can't be created (I guess it's removed by
devtempfs?). Sometimes the kernel message prints "block device
autoloading is deprecated and will be removed".

If your patch can work well, md array can call
flush_workqueue(md_misc_wq) in do_md_stop to avoid the deadlock that
del_gendisk needs to wait all sysfs calls to return.


Regards
Xiao

Xiao


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] md: delete gendisk in ioctl path
  2025-04-22  8:31   ` Xiao Ni
@ 2025-04-22  8:40     ` Xiao Ni
  2025-04-22 13:48     ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Xiao Ni @ 2025-04-22  8:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-raid, song, yukuai1, ming.lei, ncroxon

On Tue, Apr 22, 2025 at 4:31 PM Xiao Ni <xni@redhat.com> wrote:
>
> On Tue, Apr 22, 2025 at 1:51 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Tue, Apr 22, 2025 at 11:24:03AM +0800, Xiao Ni wrote:
> > > Now del_gendisk and put_disk are called asynchronously in workqueue work.
> > > del_gendisk deletes device node by devtmpfs. devtmpfs tries to open this
> > > array again and it flush the workqueue at the bigging of open process. So
> > > a deadlock happens.
> > >
> > > The asynchronous way also has a problem that the device node can still
> > > exist after mdadm --stop command returns in a short window. So udev rule
> > > can open this device node and create the struct mddev in kernel again.
> > >
> > > So put del_gendisk in ioctl path and still leave put_disk in
> > > md_kobj_release to avoid uaf.
> >
> > The md lifetime rules are complicated enough as-is.  So while I won't
> > object to this change per-see I'd rather have it reviewed by the md
> > maintainers independently.
> >
> > In the meantime this should ensure devtmpfs doesn't call into
> > blkdev_get_no_open and thus put_disk:
> >
> > diff --git a/block/bdev.c b/block/bdev.c
> > index 6a34179192c9..97d4c0ab1670 100644
> > --- a/block/bdev.c
> > +++ b/block/bdev.c
> > @@ -1274,18 +1274,23 @@ void sync_bdevs(bool wait)
> >   */
> >  void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask)
> >  {
> > -       struct inode *backing_inode;
> >         struct block_device *bdev;
> >
> > -       backing_inode = d_backing_inode(path->dentry);
> > -
> >         /*
> > -        * Note that backing_inode is the inode of a block device node file,
> > -        * not the block device's internal inode.  Therefore it is *not* valid
> > -        * to use I_BDEV() here; the block device has to be looked up by i_rdev
> > -        * instead.
> > +        * Note that d_backing_inode() returnsthe inode of a block device node
> > +        * file, not the block device's internal inode.
> > +        *
> > +        * Therefore it is *not* valid to use I_BDEV() here; the block device
> > +        * has to be looked up by i_rdev instead.
> > +        *
> > +        * Only do this lookup if actually needed to avoid the performance
> > +        * overhead of the lookup, and to avoid injecting bdev lifetime issues
> > +        * into devtmpfs.
> >          */
> > -       bdev = blkdev_get_no_open(backing_inode->i_rdev);
> > +       if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
> > +               return;
> > +
> > +       bdev = blkdev_get_no_open(d_backing_inode(path->dentry)->i_rdev);
> >         if (!bdev)
> >                 return;
> >
> >
>
> Hi Christoph
>
> Thanks for the point. I tried this patch and it's stuck on the device
> mapper device during shutdown.

Sorry, please ignore the following error. I made a mistake when
building the kernel.

Regards
Xiao
>
> [  848.520483] systemd-shutdow[-- MARK -- Tue Apr 22 08:20:00 2025]
> [  989.175210] INFO: task systemd-shutdow:1 blocked for more than 122 seconds.
> [  989.175667]       Not tainted 6.15.0-rc3+ #3
> [  989.176327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  989.176728] task:systemd-shutdow state:D stack:0     pid:1
> tgid:1     ppid:0      task_flags:0x480100 flags:0x00004002
> [  989.177287] Call Trace:
> [  989.177429]  <TASK>
> [  989.177910]  __schedule+0x2c7/0x690
> [  989.178501]  schedule+0x23/0x80
> [  989.179265]  schedule_timeout+0xe8/0x100
> [  989.179488]  ? sched_clock_cpu+0xb/0x190
> [  989.179703]  ? __smp_call_single_queue+0xac/0x120
> [  989.180436]  ? ttwu_queue_wakelist+0xd0/0xf0
> [  989.181065]  __wait_for_common+0x99/0x1c0
> [  989.181287]  ? __pfx_schedule_timeout+0x10/0x10
> [  989.181935]  devtmpfs_submit_req+0x6e/0x90
> [  989.182154]  devtmpfs_delete_node+0x60/0x90
> [  989.182374]  ? kernfs_remove+0x5e/0x70
> [  989.182579]  ? sysfs_remove_group+0x46/0x90
> [  989.182787]  device_del+0x315/0x3d0
> [  989.183389]  ? mutex_lock+0xe/0x30
> [  989.183963]  del_gendisk+0x216/0x320
> [  989.184180]  cleanup_mapped_device+0x13f/0x150 [dm_mod]
> [  989.184502]  __dm_destroy+0x12e/0x1f0 [dm_mod]
> [  989.185162]  dev_remove+0x119/0x1a0 [dm_mod]
> [  989.185830]  ctl_ioctl+0x21c/0x370 [dm_mod]
> [  989.186090]  dm_ctl_ioctl+0xa/0x20 [dm_mod]
> [  989.186325]  __x64_sys_ioctl+0x92/0xc0
> [  989.186533]  do_syscall_64+0x79/0x160
> [  989.186738]  ? blkg_alloc+0x10f/0x1c0
> [  989.186976]  ? __memcg_slab_free_hook+0xf2/0x140
> [  989.187616]  ? __x64_sys_close+0x39/0x80
> [  989.187836]  ? kmem_cache_free+0x33c/0x450
> [  989.188057]  ? syscall_exit_to_user_mode+0xc/0x1d0
> [  989.188696]  ? do_syscall_64+0x85/0x160
> [  989.188904]  ? __pfx_submit_bio_wait_endio+0x10/0x10
> [  989.189222]  ? blkdev_fsync+0x41/0x60
> [  989.189468]  ? syscall_exit_to_user_mode+0xc/0x1d0
> [  989.190096]  ? do_syscall_64+0x85/0x160
> [  989.190316]  ? do_sys_openat2+0x8c/0xd0
> [  989.190524]  ? syscall_exit_to_user_mode+0xc/0x1d0
> [  989.191160]  ? do_syscall_64+0x85/0x160
> [  989.191374]  ? syscall_exit_to_user_mode+0xc/0x1d0
> [  989.192001]  ? do_syscall_64+0x85/0x160
> [  989.192226]  ? do_syscall_64+0x85/0x160
> [  989.192434]  ? do_syscall_64+0x85/0x160
> [  989.192641]  ? syscall_exit_to_user_mode+0xc/0x1d0
> [  989.193287]  ? do_syscall_64+0x85/0x160
> [  989.193509]  ? do_syscall_64+0x85/0x160
> [  989.193719]  ? exc_page_fault+0x64/0x140
> [  989.2: 00000246 ORIG_RAX: 0000000000000010
> [  989.294592] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fef13903b0b
> [  989.295388] RDX: 00007ffef60fbe50 RSI: 00000000c138fd04 RDI: 0000000000000005
> [  989.296153] RBP: 00007ffef60fbfc0 R08: 0000000000000000 R09: 00007ffef60fb190
> [  989.296936] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
> [  989.297708] R13: 0000000000000000 R14: 00007ffef60fbe10 R15: 0000000000000000
> [  989.298512]  </TASK>
> [  989.298700] INFO: task kdevtmpfs:506 blocked for more than 123 seconds.
> [  989.299139]       Not tainted 6.15.0-rc3+ #3
> [  989.299795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  989.300279] task:kdevtmpfs       state:D stack:0     pid:506
> tgid:506   ppid:2      task_flags:0x208140 flags:0x00004000
> [  989.300837] Call Trace:
> [  989.300994]  <TASK>
> [  68/0x290
> [  989.701395]  ? __update_idle_core+0x23/0xb0
> [  989.701611]  devtmpfs_work_loop+0x10d/0x2a0
> [  989.701979]  ? __pfx_devtmpfsd+0x10/0x10
> [  989.700x10/0x10
> [  989.802290]  ret_from_fork+0x30/0x50
> [  989.802498]  ? __pfx_kthread+0x10/0x10
> [  989.802694]  ret_from_fork_asm+0x1a/0x30
> [  989.802926]  </TASK>
> [  989.803103] INFO: task kworker/38:1:586 blocked for more than 123 seconds.
> [  989.803453]       Not tainted 6.15.0-rc3+ #3
>
>
> There is another reason that I want to put del_gendisk in the ioctl
> path. Because sometimes device node can still exist after command
> mdadm --stop. del_gendisk removes inode first and then removes device
> node (e.g. /dev/md0). So there is a small window that device node can
> be open again. Then some strange things happen. Sometimes the array is
> created but device node can't be created (I guess it's removed by
> devtempfs?). Sometimes the kernel message prints "block device
> autoloading is deprecated and will be removed".
>
> If your patch can work well, md array can call
> flush_workqueue(md_misc_wq) in do_md_stop to avoid the deadlock that
> del_gendisk needs to wait all sysfs calls to return.
>
>
> Regards
> Xiao
>
> Xiao


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] md: delete gendisk in ioctl path
  2025-04-22  5:51 ` Christoph Hellwig
  2025-04-22  8:31   ` Xiao Ni
@ 2025-04-22  9:24   ` Xiao Ni
  1 sibling, 0 replies; 8+ messages in thread
From: Xiao Ni @ 2025-04-22  9:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-raid, song, yukuai1, ming.lei, ncroxon

On Tue, Apr 22, 2025 at 1:51 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Apr 22, 2025 at 11:24:03AM +0800, Xiao Ni wrote:
> > Now del_gendisk and put_disk are called asynchronously in workqueue work.
> > del_gendisk deletes device node by devtmpfs. devtmpfs tries to open this
> > array again and it flush the workqueue at the bigging of open process. So
> > a deadlock happens.
> >
> > The asynchronous way also has a problem that the device node can still
> > exist after mdadm --stop command returns in a short window. So udev rule
> > can open this device node and create the struct mddev in kernel again.
> >
> > So put del_gendisk in ioctl path and still leave put_disk in
> > md_kobj_release to avoid uaf.
>
> The md lifetime rules are complicated enough as-is.  So while I won't
> object to this change per-see I'd rather have it reviewed by the md
> maintainers independently.
>
> In the meantime this should ensure devtmpfs doesn't call into
> blkdev_get_no_open and thus put_disk:
>
> diff --git a/block/bdev.c b/block/bdev.c
> index 6a34179192c9..97d4c0ab1670 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1274,18 +1274,23 @@ void sync_bdevs(bool wait)
>   */
>  void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask)
>  {
> -       struct inode *backing_inode;
>         struct block_device *bdev;
>
> -       backing_inode = d_backing_inode(path->dentry);
> -
>         /*
> -        * Note that backing_inode is the inode of a block device node file,
> -        * not the block device's internal inode.  Therefore it is *not* valid
> -        * to use I_BDEV() here; the block device has to be looked up by i_rdev
> -        * instead.
> +        * Note that d_backing_inode() returnsthe inode of a block device node
> +        * file, not the block device's internal inode.
> +        *
> +        * Therefore it is *not* valid to use I_BDEV() here; the block device
> +        * has to be looked up by i_rdev instead.
> +        *
> +        * Only do this lookup if actually needed to avoid the performance
> +        * overhead of the lookup, and to avoid injecting bdev lifetime issues
> +        * into devtmpfs.
>          */
> -       bdev = blkdev_get_no_open(backing_inode->i_rdev);
> +       if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
> +               return;
> +
> +       bdev = blkdev_get_no_open(d_backing_inode(path->dentry)->i_rdev);
>         if (!bdev)
>                 return;
>
>

This patch resolves this deadlock problem.
Tested-by: Xiao Ni <xni@redhat.com>

Regards
Xiao


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] md: delete gendisk in ioctl path
  2025-04-22  8:31   ` Xiao Ni
  2025-04-22  8:40     ` Xiao Ni
@ 2025-04-22 13:48     ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2025-04-22 13:48 UTC (permalink / raw)
  To: Xiao Ni; +Cc: linux-raid, song, yukuai1, ming.lei, ncroxon

On Tue, Apr 22, 2025 at 04:31:57PM +0800, Xiao Ni wrote:
> There is another reason that I want to put del_gendisk in the ioctl
> path. Because sometimes device node can still exist after command
> mdadm --stop. del_gendisk removes inode first and then removes device
> node (e.g. /dev/md0). So there is a small window that device node can
> be open again. Then some strange things happen. Sometimes the array is
> created but device node can't be created (I guess it's removed by
> devtempfs?). Sometimes the kernel message prints "block device
> autoloading is deprecated and will be removed".

FYI, I'm all for sorting out the md gendisk lifetime and probing
mess.  But let's do that separately from a regression fix as changes
in the area are intricate and might cause further regressions.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-22 13:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22  3:24 [RFC PATCH 1/1] md: delete gendisk in ioctl path Xiao Ni
2025-04-22  5:51 ` Christoph Hellwig
2025-04-22  8:31   ` Xiao Ni
2025-04-22  8:40     ` Xiao Ni
2025-04-22 13:48     ` Christoph Hellwig
2025-04-22  9:24   ` Xiao Ni
2025-04-22  6:21 ` Yu Kuai
2025-04-22  7:31   ` Xiao Ni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox