add error handling to add_disk / device_add

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* add error handling to add_disk / device_add_disk
@ 2021-08-18 14:45 Christoph Hellwig
  2021-08-18 14:45 ` [PATCH 01/11] block: add a sanity check for a live disk in del_gendisk Christoph Hellwig
                   ` (11 more replies)
  0 siblings, 12 replies; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

Hi Jens,

this series does some refactoring and then adds support to return errors
from add_disk (rebasing a patch from Luis).  I think that alone is a huge
improvement as it leaves a disk for which add_disk failed in a defined
status, but the real improvement will be actually handling the errors in
the drivers.  This series contains two trivial conversions.  Luis has
a tree with conversions for all drivers in the tree, which will be fed
incrementally once this goes in.  Hopefully we can convert all the
commonly used drivers in this merge window.

This series sits on top of:

   "ensure each gendisk always has a request_queue reference v2"

A git tree is available here:

    git://git.infradead.org/users/hch/block.git

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/add-disk-error-handling

Diffstat:
 block/blk-integrity.c         |   12 +-
 block/blk-sysfs.c             |    9 --
 block/blk.h                   |    7 -
 block/disk-events.c           |    7 -
 block/genhd.c                 |  186 ++++++++++++++++++++++--------------------
 drivers/block/null_blk/main.c |    3 
 drivers/block/virtio_blk.c    |    7 +
 include/linux/genhd.h         |    8 -
 8 files changed, 125 insertions(+), 114 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/11] block: add a sanity check for a live disk in del_gendisk
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-19 10:41   ` Hannes Reinecke
  2021-08-18 14:45 ` [PATCH 02/11] block: fold register_disk into device_add_disk Christoph Hellwig
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

Add a sanity check to del_gendisk to do nothing when the disk wasn't
successfully added.  This papers over the complete lack of add_disk
error handling, which is about to get fixed gradually.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 02cd9ec93e52..935f74c652f1 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -579,7 +579,7 @@ void del_gendisk(struct gendisk *disk)
 {
 	might_sleep();
 
-	if (WARN_ON_ONCE(!disk->queue))
+	if (WARN_ON_ONCE(!disk_live(disk)))
 		return;
 
 	blk_integrity_del(disk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/11] block: fold register_disk into device_add_disk
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
  2021-08-18 14:45 ` [PATCH 01/11] block: add a sanity check for a live disk in del_gendisk Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-19 10:48   ` Hannes Reinecke
  2021-08-18 14:45 ` [PATCH 03/11] block: call bdev_add later in device_add_disk Christoph Hellwig
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

There is no real reason these should be separate.  Also simplify the
groups assignment a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c | 131 +++++++++++++++++++++++---------------------------
 1 file changed, 60 insertions(+), 71 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 935f74c652f1..ec4be5889fbf 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -409,71 +409,6 @@ static void disk_scan_partitions(struct gendisk *disk)
 		blkdev_put(bdev, FMODE_READ);
 }
 
-static void register_disk(struct device *parent, struct gendisk *disk,
-			  const struct attribute_group **groups)
-{
-	struct device *ddev = disk_to_dev(disk);
-	int err;
-
-	ddev->parent = parent;
-
-	dev_set_name(ddev, "%s", disk->disk_name);
-
-	/* delay uevents, until we scanned partition table */
-	dev_set_uevent_suppress(ddev, 1);
-
-	if (groups) {
-		WARN_ON(ddev->groups);
-		ddev->groups = groups;
-	}
-	if (device_add(ddev))
-		return;
-	if (!sysfs_deprecated) {
-		err = sysfs_create_link(block_depr, &ddev->kobj,
-					kobject_name(&ddev->kobj));
-		if (err) {
-			device_del(ddev);
-			return;
-		}
-	}
-
-	/*
-	 * avoid probable deadlock caused by allocating memory with
-	 * GFP_KERNEL in runtime_resume callback of its all ancestor
-	 * devices
-	 */
-	pm_runtime_set_memalloc_noio(ddev, true);
-
-	disk->part0->bd_holder_dir =
-		kobject_create_and_add("holders", &ddev->kobj);
-	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
-
-	/*
-	 * XXX: this is a mess, can't wait for real error handling in add_disk.
-	 * Make sure ->slave_dir is NULL if we failed some of the registration
-	 * so that the cleanup in bd_unlink_disk_holder works properly.
-	 */
-	if (bd_register_pending_holders(disk) < 0) {
-		kobject_put(disk->slave_dir);
-		disk->slave_dir = NULL;
-	}
-
-	if (disk->flags & GENHD_FL_HIDDEN)
-		return;
-
-	disk_scan_partitions(disk);
-
-	/* announce the disk and partitions after all partitions are created */
-	dev_set_uevent_suppress(ddev, 0);
-	disk_uevent(disk, KOBJ_ADD);
-
-	if (disk->bdi->dev) {
-		err = sysfs_create_link(&ddev->kobj, &disk->bdi->dev->kobj,
-					"bdi");
-		WARN_ON(err);
-	}
-}
-
 /**
  * device_add_disk - add disk information to kernel list
  * @parent: parent device for the disk
@@ -490,6 +425,7 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 		     const struct attribute_group **groups)
 
 {
+	struct device *ddev = disk_to_dev(disk);
 	int ret;
 
 	/*
@@ -538,17 +474,70 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 		disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
 		disk->flags |= GENHD_FL_NO_PART_SCAN;
 	} else {
-		struct device *dev = disk_to_dev(disk);
-
 		/* Register BDI before referencing it from bdev */
-		dev->devt = MKDEV(disk->major, disk->first_minor);
+		ddev->devt = MKDEV(disk->major, disk->first_minor);
 		ret = bdi_register(disk->bdi, "%u:%u",
 				   disk->major, disk->first_minor);
 		WARN_ON(ret);
-		bdi_set_owner(disk->bdi, dev);
-		bdev_add(disk->part0, dev->devt);
+		bdi_set_owner(disk->bdi, ddev);
+		bdev_add(disk->part0, ddev->devt);
+	}
+
+	/* delay uevents, until we scanned partition table */
+	dev_set_uevent_suppress(ddev, 1);
+
+	ddev->parent = parent;
+	ddev->groups = groups;
+	dev_set_name(ddev, "%s", disk->disk_name);
+	if (device_add(ddev))
+		return;
+	if (!sysfs_deprecated) {
+		ret = sysfs_create_link(block_depr, &ddev->kobj,
+					kobject_name(&ddev->kobj));
+		if (ret) {
+			device_del(ddev);
+			return;
+		}
 	}
-	register_disk(parent, disk, groups);
+
+	/*
+	 * avoid probable deadlock caused by allocating memory with
+	 * GFP_KERNEL in runtime_resume callback of its all ancestor
+	 * devices
+	 */
+	pm_runtime_set_memalloc_noio(ddev, true);
+
+	disk->part0->bd_holder_dir =
+		kobject_create_and_add("holders", &ddev->kobj);
+	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
+
+	/*
+	 * XXX: this is a mess, can't wait for real error handling in add_disk.
+	 * Make sure ->slave_dir is NULL if we failed some of the registration
+	 * so that the cleanup in bd_unlink_disk_holder works properly.
+	 */
+	if (bd_register_pending_holders(disk) < 0) {
+		kobject_put(disk->slave_dir);
+		disk->slave_dir = NULL;
+	}
+
+	if (!(disk->flags & GENHD_FL_HIDDEN)) {
+		disk_scan_partitions(disk);
+
+		/*
+		 * Announce the disk and partitions after all partitions are
+		 * created.
+		 */
+		dev_set_uevent_suppress(ddev, 0);
+		disk_uevent(disk, KOBJ_ADD);
+
+		if (disk->bdi->dev) {
+			ret = sysfs_create_link(&ddev->kobj,
+						&disk->bdi->dev->kobj, "bdi");
+			WARN_ON(ret);
+		}
+	}
+
 	blk_register_queue(disk);
 
 	disk_add_events(disk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/11] block: call bdev_add later in device_add_disk
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
  2021-08-18 14:45 ` [PATCH 01/11] block: add a sanity check for a live disk in del_gendisk Christoph Hellwig
  2021-08-18 14:45 ` [PATCH 02/11] block: fold register_disk into device_add_disk Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-20  6:06   ` Hannes Reinecke
  2025-10-31  7:46   ` question about bd_inode hashing against device_add() // " Gao Xiang
  2021-08-18 14:45 ` [PATCH 04/11] block: create the bdi link earlier " Christoph Hellwig
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

Once bdev_add is called userspace can open the block device.  Ensure
that the struct device, which is used for refcounting of the disk
besides various other things, is fully setup at that point.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c | 36 +++++++++++++++++-------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index ec4be5889fbf..ab455f110be2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -466,29 +466,14 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 
 	disk_alloc_events(disk);
 
-	if (disk->flags & GENHD_FL_HIDDEN) {
-		/*
-		 * Don't let hidden disks show up in /proc/partitions,
-		 * and don't bother scanning for partitions either.
-		 */
-		disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
-		disk->flags |= GENHD_FL_NO_PART_SCAN;
-	} else {
-		/* Register BDI before referencing it from bdev */
-		ddev->devt = MKDEV(disk->major, disk->first_minor);
-		ret = bdi_register(disk->bdi, "%u:%u",
-				   disk->major, disk->first_minor);
-		WARN_ON(ret);
-		bdi_set_owner(disk->bdi, ddev);
-		bdev_add(disk->part0, ddev->devt);
-	}
-
 	/* delay uevents, until we scanned partition table */
 	dev_set_uevent_suppress(ddev, 1);
 
 	ddev->parent = parent;
 	ddev->groups = groups;
 	dev_set_name(ddev, "%s", disk->disk_name);
+	if (!(disk->flags & GENHD_FL_HIDDEN))
+		ddev->devt = MKDEV(disk->major, disk->first_minor);
 	if (device_add(ddev))
 		return;
 	if (!sysfs_deprecated) {
@@ -521,12 +506,25 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 		disk->slave_dir = NULL;
 	}
 
-	if (!(disk->flags & GENHD_FL_HIDDEN)) {
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		/*
+		 * Don't let hidden disks show up in /proc/partitions,
+		 * and don't bother scanning for partitions either.
+		 */
+		disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+		disk->flags |= GENHD_FL_NO_PART_SCAN;
+	} else {
+		ret = bdi_register(disk->bdi, "%u:%u",
+				   disk->major, disk->first_minor);
+		WARN_ON(ret);
+		bdi_set_owner(disk->bdi, ddev);
+		bdev_add(disk->part0, ddev->devt);
+
 		disk_scan_partitions(disk);
 
 		/*
 		 * Announce the disk and partitions after all partitions are
-		 * created.
+		 * created. (for hidden disks uevents remain suppressed forever)
 		 */
 		dev_set_uevent_suppress(ddev, 0);
 		disk_uevent(disk, KOBJ_ADD);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/11] block: create the bdi link earlier in device_add_disk
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
                   ` (2 preceding siblings ...)
  2021-08-18 14:45 ` [PATCH 03/11] block: call bdev_add later in device_add_disk Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-20  6:08   ` Hannes Reinecke
  2021-08-18 14:45 ` [PATCH 05/11] block: call blk_integrity_add " Christoph Hellwig
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

This will simplify error handling going forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index ab455f110be2..f05e58f214d2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -518,8 +518,13 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 				   disk->major, disk->first_minor);
 		WARN_ON(ret);
 		bdi_set_owner(disk->bdi, ddev);
-		bdev_add(disk->part0, ddev->devt);
+		if (disk->bdi->dev) {
+			ret = sysfs_create_link(&ddev->kobj,
+						&disk->bdi->dev->kobj, "bdi");
+			WARN_ON(ret);
+		}
 
+		bdev_add(disk->part0, ddev->devt);
 		disk_scan_partitions(disk);
 
 		/*
@@ -528,12 +533,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 		 */
 		dev_set_uevent_suppress(ddev, 0);
 		disk_uevent(disk, KOBJ_ADD);
-
-		if (disk->bdi->dev) {
-			ret = sysfs_create_link(&ddev->kobj,
-						&disk->bdi->dev->kobj, "bdi");
-			WARN_ON(ret);
-		}
 	}
 
 	blk_register_queue(disk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/11] block: call blk_integrity_add earlier in device_add_disk
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
                   ` (3 preceding siblings ...)
  2021-08-18 14:45 ` [PATCH 04/11] block: create the bdi link earlier " Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-20  6:08   ` Hannes Reinecke
  2021-08-18 14:45 ` [PATCH 06/11] block: call blk_register_queue " Christoph Hellwig
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

Doing all the sysfs file creation before adding the bdev and thus
allowing it to be opened will simplify the about to be added error
handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index f05e58f214d2..75d900e4c82f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -492,6 +492,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 	 */
 	pm_runtime_set_memalloc_noio(ddev, true);
 
+	blk_integrity_add(disk);
+
 	disk->part0->bd_holder_dir =
 		kobject_create_and_add("holders", &ddev->kobj);
 	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
@@ -538,7 +540,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 	blk_register_queue(disk);
 
 	disk_add_events(disk);
-	blk_integrity_add(disk);
 }
 EXPORT_SYMBOL(device_add_disk);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/11] block: call blk_register_queue earlier in device_add_disk
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
                   ` (4 preceding siblings ...)
  2021-08-18 14:45 ` [PATCH 05/11] block: call blk_integrity_add " Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-20  6:09   ` Hannes Reinecke
  2021-08-18 14:45 ` [PATCH 07/11] block: return errors from blk_integrity_add Christoph Hellwig
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

Ensure that all the sysfs bits are set up before bdev_add is called,
as that will make the upcomding error handling much easier.  However
this means the call to disk_update_readahead has to be split as that
requires a bdi.  Also remove various sanity checks that don't make
sense now that blk_register_queue only has a single caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-sysfs.c | 9 ---------
 block/genhd.c     | 5 +++--
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7fd99487300c..614d9d47de36 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -856,15 +856,6 @@ int blk_register_queue(struct gendisk *disk)
 	struct device *dev = disk_to_dev(disk);
 	struct request_queue *q = disk->queue;
 
-	if (WARN_ON(!q))
-		return -ENXIO;
-
-	WARN_ONCE(blk_queue_registered(q),
-		  "%s is registering an already registered queue\n",
-		  kobject_name(&dev->kobj));
-
-	disk_update_readahead(disk);
-
 	ret = blk_trace_init_sysfs(dev);
 	if (ret)
 		return ret;
diff --git a/block/genhd.c b/block/genhd.c
index 75d900e4c82f..a54b4849242c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -508,6 +508,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 		disk->slave_dir = NULL;
 	}
 
+	blk_register_queue(disk);
+
 	if (disk->flags & GENHD_FL_HIDDEN) {
 		/*
 		 * Don't let hidden disks show up in /proc/partitions,
@@ -537,8 +539,7 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 		disk_uevent(disk, KOBJ_ADD);
 	}
 
-	blk_register_queue(disk);
-
+	disk_update_readahead(disk);
 	disk_add_events(disk);
 }
 EXPORT_SYMBOL(device_add_disk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/11] block: return errors from blk_integrity_add
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
                   ` (5 preceding siblings ...)
  2021-08-18 14:45 ` [PATCH 06/11] block: call blk_register_queue " Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-20  6:10   ` Hannes Reinecke
  2021-08-18 14:45 ` [PATCH 08/11] block: return errors from disk_alloc_events Christoph Hellwig
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

From: Luis Chamberlain <mcgrof@kernel.org>

Prepare for proper error handling in add_disk.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-integrity.c | 12 +++++++-----
 block/blk.h           |  5 +++--
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 410da060d1f5..69a12177dfb6 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -431,13 +431,15 @@ void blk_integrity_unregister(struct gendisk *disk)
 }
 EXPORT_SYMBOL(blk_integrity_unregister);
 
-void blk_integrity_add(struct gendisk *disk)
+int blk_integrity_add(struct gendisk *disk)
 {
-	if (kobject_init_and_add(&disk->integrity_kobj, &integrity_ktype,
-				 &disk_to_dev(disk)->kobj, "%s", "integrity"))
-		return;
+	int ret;
 
-	kobject_uevent(&disk->integrity_kobj, KOBJ_ADD);
+	ret = kobject_init_and_add(&disk->integrity_kobj, &integrity_ktype,
+				   &disk_to_dev(disk)->kobj, "%s", "integrity");
+	if (!ret)
+		kobject_uevent(&disk->integrity_kobj, KOBJ_ADD);
+	return ret;
 }
 
 void blk_integrity_del(struct gendisk *disk)
diff --git a/block/blk.h b/block/blk.h
index 148bdcd3aa08..c9727dd56da7 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -132,7 +132,7 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
 				bip_next->bip_vec[0].bv_offset);
 }
 
-void blk_integrity_add(struct gendisk *);
+int blk_integrity_add(struct gendisk *disk);
 void blk_integrity_del(struct gendisk *);
 #else /* CONFIG_BLK_DEV_INTEGRITY */
 static inline bool blk_integrity_merge_rq(struct request_queue *rq,
@@ -166,8 +166,9 @@ static inline bool bio_integrity_endio(struct bio *bio)
 static inline void bio_integrity_free(struct bio *bio)
 {
 }
-static inline void blk_integrity_add(struct gendisk *disk)
+static inline int blk_integrity_add(struct gendisk *disk)
 {
+	return 0;
 }
 static inline void blk_integrity_del(struct gendisk *disk)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/11] block: return errors from disk_alloc_events
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
                   ` (6 preceding siblings ...)
  2021-08-18 14:45 ` [PATCH 07/11] block: return errors from blk_integrity_add Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-20  6:10   ` Hannes Reinecke
  2021-08-18 14:45 ` [PATCH 09/11] block: add error handling for device_add_disk / add_disk Christoph Hellwig
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

From: Luis Chamberlain <mcgrof@kernel.org>

Prepare for proper error handling in add_disk.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk.h         | 2 +-
 block/disk-events.c | 7 ++++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index c9727dd56da7..bbbcc1a64a2d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -362,7 +362,7 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 
 struct request_queue *blk_alloc_queue(int node_id);
 
-void disk_alloc_events(struct gendisk *disk);
+int disk_alloc_events(struct gendisk *disk);
 void disk_add_events(struct gendisk *disk);
 void disk_del_events(struct gendisk *disk);
 void disk_release_events(struct gendisk *disk);
diff --git a/block/disk-events.c b/block/disk-events.c
index 7445b8ff2775..8d5496e7592a 100644
--- a/block/disk-events.c
+++ b/block/disk-events.c
@@ -444,17 +444,17 @@ module_param_cb(events_dfl_poll_msecs, &disk_events_dfl_poll_msecs_param_ops,
 /*
  * disk_{alloc|add|del|release}_events - initialize and destroy disk_events.
  */
-void disk_alloc_events(struct gendisk *disk)
+int disk_alloc_events(struct gendisk *disk)
 {
 	struct disk_events *ev;
 
 	if (!disk->fops->check_events || !disk->events)
-		return;
+		return 0;
 
 	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
 	if (!ev) {
 		pr_warn("%s: failed to initialize events\n", disk->disk_name);
-		return;
+		return -ENOMEM;
 	}
 
 	INIT_LIST_HEAD(&ev->node);
@@ -466,6 +466,7 @@ void disk_alloc_events(struct gendisk *disk)
 	INIT_DELAYED_WORK(&ev->dwork, disk_events_workfn);
 
 	disk->ev = ev;
+	return 0;
 }
 
 void disk_add_events(struct gendisk *disk)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/11] block: add error handling for device_add_disk / add_disk
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
                   ` (7 preceding siblings ...)
  2021-08-18 14:45 ` [PATCH 08/11] block: return errors from disk_alloc_events Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-20  6:12   ` Hannes Reinecke
  2021-08-18 14:45 ` [PATCH 10/11] virtio_blk: add error handling support for add_disk() Christoph Hellwig
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

From: Luis Chamberlain <mcgrof@kernel.org>

Properly unwind on errors in device_add_disk.  This is the initial work
as drivers are not converted yet, which will follow in separate patches.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
[hch: major rebase.  All bugs are probably mine]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c         | 92 +++++++++++++++++++++++++++----------------
 include/linux/genhd.h |  8 ++--
 2 files changed, 62 insertions(+), 38 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index a54b4849242c..a925f773145f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -417,11 +417,8 @@ static void disk_scan_partitions(struct gendisk *disk)
  *
  * This function registers the partitioning information in @disk
  * with the kernel.
- *
- * FIXME: error handling
  */
-
-void device_add_disk(struct device *parent, struct gendisk *disk,
+int device_add_disk(struct device *parent, struct gendisk *disk,
 		     const struct attribute_group **groups)
 
 {
@@ -444,7 +441,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 	 * and all partitions from the extended dev_t space.
 	 */
 	if (disk->major) {
-		WARN_ON(!disk->minors);
+		if (WARN_ON(!disk->minors))
+			return -EINVAL;
 
 		if (disk->minors > DISK_MAX_PARTS) {
 			pr_err("block: can't allocate more than %d partitions\n",
@@ -452,19 +450,20 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 			disk->minors = DISK_MAX_PARTS;
 		}
 	} else {
-		WARN_ON(disk->minors);
+		if (WARN_ON(disk->minors))
+			return -EINVAL;
 
 		ret = blk_alloc_ext_minor();
-		if (ret < 0) {
-			WARN_ON(1);
-			return;
-		}
+		if (ret < 0)
+			return ret;
 		disk->major = BLOCK_EXT_MAJOR;
 		disk->first_minor = MINOR(ret);
 		disk->flags |= GENHD_FL_EXT_DEVT;
 	}
 
-	disk_alloc_events(disk);
+	ret = disk_alloc_events(disk);
+	if (ret)
+		goto out_free_ext_minor;
 
 	/* delay uevents, until we scanned partition table */
 	dev_set_uevent_suppress(ddev, 1);
@@ -474,15 +473,14 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 	dev_set_name(ddev, "%s", disk->disk_name);
 	if (!(disk->flags & GENHD_FL_HIDDEN))
 		ddev->devt = MKDEV(disk->major, disk->first_minor);
-	if (device_add(ddev))
-		return;
+	ret = device_add(ddev);
+	if (ret)
+		goto out_disk_release_events;
 	if (!sysfs_deprecated) {
 		ret = sysfs_create_link(block_depr, &ddev->kobj,
 					kobject_name(&ddev->kobj));
-		if (ret) {
-			device_del(ddev);
-			return;
-		}
+		if (ret)
+			goto out_device_del;
 	}
 
 	/*
@@ -492,23 +490,25 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 	 */
 	pm_runtime_set_memalloc_noio(ddev, true);
 
-	blk_integrity_add(disk);
+	ret = blk_integrity_add(disk);
+	if (ret)
+		goto out_del_block_link;
 
 	disk->part0->bd_holder_dir =
 		kobject_create_and_add("holders", &ddev->kobj);
+	if (!disk->part0->bd_holder_dir)
+		goto out_del_integrity;
 	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
+	if (!disk->slave_dir)
+		goto out_put_holder_dir;
 
-	/*
-	 * XXX: this is a mess, can't wait for real error handling in add_disk.
-	 * Make sure ->slave_dir is NULL if we failed some of the registration
-	 * so that the cleanup in bd_unlink_disk_holder works properly.
-	 */
-	if (bd_register_pending_holders(disk) < 0) {
-		kobject_put(disk->slave_dir);
-		disk->slave_dir = NULL;
-	}
+	ret = bd_register_pending_holders(disk);
+	if (ret < 0)
+		goto out_put_slave_dir;
 
-	blk_register_queue(disk);
+	ret = blk_register_queue(disk);
+	if (ret)
+		goto out_put_slave_dir;
 
 	if (disk->flags & GENHD_FL_HIDDEN) {
 		/*
@@ -520,13 +520,13 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 	} else {
 		ret = bdi_register(disk->bdi, "%u:%u",
 				   disk->major, disk->first_minor);
-		WARN_ON(ret);
+		if (ret)
+			goto out_unregister_queue;
 		bdi_set_owner(disk->bdi, ddev);
-		if (disk->bdi->dev) {
-			ret = sysfs_create_link(&ddev->kobj,
-						&disk->bdi->dev->kobj, "bdi");
-			WARN_ON(ret);
-		}
+		ret = sysfs_create_link(&ddev->kobj,
+					&disk->bdi->dev->kobj, "bdi");
+		if (ret)
+			goto out_unregister_bdi;
 
 		bdev_add(disk->part0, ddev->devt);
 		disk_scan_partitions(disk);
@@ -541,6 +541,30 @@ void device_add_disk(struct device *parent, struct gendisk *disk,
 
 	disk_update_readahead(disk);
 	disk_add_events(disk);
+	return 0;
+
+out_unregister_bdi:
+	if (!(disk->flags & GENHD_FL_HIDDEN))
+		bdi_unregister(disk->bdi);
+out_unregister_queue:
+	blk_unregister_queue(disk);
+out_put_slave_dir:
+	kobject_put(disk->slave_dir);
+out_put_holder_dir:
+	kobject_put(disk->part0->bd_holder_dir);
+out_del_integrity:
+	blk_integrity_del(disk);
+out_del_block_link:
+	if (!sysfs_deprecated)
+		sysfs_remove_link(block_depr, dev_name(ddev));
+out_device_del:
+	device_del(ddev);
+out_disk_release_events:
+	disk_release_events(disk);
+out_free_ext_minor:
+	if (disk->major == BLOCK_EXT_MAJOR)
+		blk_free_ext_minor(disk->first_minor);
+	return WARN_ON_ONCE(ret); /* keep until all callers handle errors */
 }
 EXPORT_SYMBOL(device_add_disk);
 
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 55acefdd8a20..c68d83c87f83 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -214,11 +214,11 @@ static inline dev_t disk_devt(struct gendisk *disk)
 void disk_uevent(struct gendisk *disk, enum kobject_action action);
 
 /* block/genhd.c */
-extern void device_add_disk(struct device *parent, struct gendisk *disk,
-			    const struct attribute_group **groups);
-static inline void add_disk(struct gendisk *disk)
+int device_add_disk(struct device *parent, struct gendisk *disk,
+		const struct attribute_group **groups);
+static inline int add_disk(struct gendisk *disk)
 {
-	device_add_disk(NULL, disk, NULL);
+	return device_add_disk(NULL, disk, NULL);
 }
 extern void del_gendisk(struct gendisk *gp);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/11] virtio_blk: add error handling support for add_disk()
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
                   ` (8 preceding siblings ...)
  2021-08-18 14:45 ` [PATCH 09/11] block: add error handling for device_add_disk / add_disk Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-20  6:12   ` Hannes Reinecke
  2021-08-18 14:45 ` [PATCH 11/11] null_blk: " Christoph Hellwig
  2021-08-23 18:57 ` add error handling to add_disk / device_add_disk Jens Axboe
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

From: Luis Chamberlain <mcgrof@kernel.org>

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/virtio_blk.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 767b4f72a54d..63dc121a4c43 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -875,9 +875,14 @@ static int virtblk_probe(struct virtio_device *vdev)
 	virtblk_update_capacity(vblk, false);
 	virtio_device_ready(vdev);
 
-	device_add_disk(&vdev->dev, vblk->disk, virtblk_attr_groups);
+	err = device_add_disk(&vdev->dev, vblk->disk, virtblk_attr_groups);
+	if (err)
+		goto out_cleanup_disk;
+
 	return 0;
 
+out_cleanup_disk:
+	blk_cleanup_disk(vblk->disk);
 out_free_tags:
 	blk_mq_free_tag_set(&vblk->tag_set);
 out_free_vq:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/11] null_blk: add error handling support for add_disk()
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
                   ` (9 preceding siblings ...)
  2021-08-18 14:45 ` [PATCH 10/11] virtio_blk: add error handling support for add_disk() Christoph Hellwig
@ 2021-08-18 14:45 ` Christoph Hellwig
  2021-08-20  6:13   ` Hannes Reinecke
  2021-08-23 18:57 ` add error handling to add_disk / device_add_disk Jens Axboe
  11 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2021-08-18 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

From: Luis Chamberlain <mcgrof@kernel.org>

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling. The actual cleanup in case of error is
already handled by the caller of null_gendisk_register().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/block/null_blk/main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index f128242d1170..187d779c8ca0 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1717,8 +1717,7 @@ static int null_gendisk_register(struct nullb *nullb)
 			return ret;
 	}
 
-	add_disk(disk);
-	return 0;
+	return add_disk(disk);
 }
 
 static int null_init_tag_set(struct nullb *nullb, struct blk_mq_tag_set *set)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/11] block: add a sanity check for a live disk in del_gendisk
  2021-08-18 14:45 ` [PATCH 01/11] block: add a sanity check for a live disk in del_gendisk Christoph Hellwig
@ 2021-08-19 10:41   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-19 10:41 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> Add a sanity check to del_gendisk to do nothing when the disk wasn't
> successfully added.  This papers over the complete lack of add_disk
> error handling, which is about to get fixed gradually.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/genhd.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 02cd9ec93e52..935f74c652f1 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -579,7 +579,7 @@ void del_gendisk(struct gendisk *disk)
>   {
>   	might_sleep();
>   
> -	if (WARN_ON_ONCE(!disk->queue))
> +	if (WARN_ON_ONCE(!disk_live(disk)))
>   		return;
>   
>   	blk_integrity_del(disk);
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/11] block: fold register_disk into device_add_disk
  2021-08-18 14:45 ` [PATCH 02/11] block: fold register_disk into device_add_disk Christoph Hellwig
@ 2021-08-19 10:48   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-19 10:48 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> There is no real reason these should be separate.  Also simplify the
> groups assignment a bit.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/genhd.c | 131 +++++++++++++++++++++++---------------------------
>   1 file changed, 60 insertions(+), 71 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2021-08-18 14:45 ` [PATCH 03/11] block: call bdev_add later in device_add_disk Christoph Hellwig
@ 2021-08-20  6:06   ` Hannes Reinecke
  2025-10-31  7:46   ` question about bd_inode hashing against device_add() // " Gao Xiang
  1 sibling, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-20  6:06 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> Once bdev_add is called userspace can open the block device.  Ensure
> that the struct device, which is used for refcounting of the disk
> besides various other things, is fully setup at that point.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/genhd.c | 36 +++++++++++++++++-------------------
>   1 file changed, 17 insertions(+), 19 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/11] block: create the bdi link earlier in device_add_disk
  2021-08-18 14:45 ` [PATCH 04/11] block: create the bdi link earlier " Christoph Hellwig
@ 2021-08-20  6:08   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-20  6:08 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> This will simplify error handling going forward.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/genhd.c | 13 ++++++-------
>   1 file changed, 6 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/11] block: call blk_integrity_add earlier in device_add_disk
  2021-08-18 14:45 ` [PATCH 05/11] block: call blk_integrity_add " Christoph Hellwig
@ 2021-08-20  6:08   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-20  6:08 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> Doing all the sysfs file creation before adding the bdev and thus
> allowing it to be opened will simplify the about to be added error
> handling.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/genhd.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/11] block: call blk_register_queue earlier in device_add_disk
  2021-08-18 14:45 ` [PATCH 06/11] block: call blk_register_queue " Christoph Hellwig
@ 2021-08-20  6:09   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-20  6:09 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> Ensure that all the sysfs bits are set up before bdev_add is called,
> as that will make the upcomding error handling much easier.  However
> this means the call to disk_update_readahead has to be split as that
> requires a bdi.  Also remove various sanity checks that don't make
> sense now that blk_register_queue only has a single caller.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-sysfs.c | 9 ---------
>   block/genhd.c     | 5 +++--
>   2 files changed, 3 insertions(+), 11 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/11] block: return errors from blk_integrity_add
  2021-08-18 14:45 ` [PATCH 07/11] block: return errors from blk_integrity_add Christoph Hellwig
@ 2021-08-20  6:10   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-20  6:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Prepare for proper error handling in add_disk.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> [hch: split from a larger patch]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-integrity.c | 12 +++++++-----
>   block/blk.h           |  5 +++--
>   2 files changed, 10 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/11] block: return errors from disk_alloc_events
  2021-08-18 14:45 ` [PATCH 08/11] block: return errors from disk_alloc_events Christoph Hellwig
@ 2021-08-20  6:10   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-20  6:10 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Prepare for proper error handling in add_disk.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> [hch: split from a larger patch]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk.h         | 2 +-
>   block/disk-events.c | 7 ++++---
>   2 files changed, 5 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/11] block: add error handling for device_add_disk / add_disk
  2021-08-18 14:45 ` [PATCH 09/11] block: add error handling for device_add_disk / add_disk Christoph Hellwig
@ 2021-08-20  6:12   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-20  6:12 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Properly unwind on errors in device_add_disk.  This is the initial work
> as drivers are not converted yet, which will follow in separate patches.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> [hch: major rebase.  All bugs are probably mine]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/genhd.c         | 92 +++++++++++++++++++++++++++----------------
>   include/linux/genhd.h |  8 ++--
>   2 files changed, 62 insertions(+), 38 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/11] virtio_blk: add error handling support for add_disk()
  2021-08-18 14:45 ` [PATCH 10/11] virtio_blk: add error handling support for add_disk() Christoph Hellwig
@ 2021-08-20  6:12   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-20  6:12 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> We never checked for errors on add_disk() as this function
> returned void. Now that this is fixed, use the shiny new
> error handling.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   drivers/block/virtio_blk.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/11] null_blk: add error handling support for add_disk()
  2021-08-18 14:45 ` [PATCH 11/11] null_blk: " Christoph Hellwig
@ 2021-08-20  6:13   ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2021-08-20  6:13 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 4:45 PM, Christoph Hellwig wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> We never checked for errors on add_disk() as this function
> returned void. Now that this is fixed, use the shiny new
> error handling. The actual cleanup in case of error is
> already handled by the caller of null_gendisk_register().
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/block/null_blk/main.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: add error handling to add_disk / device_add_disk
  2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
                   ` (10 preceding siblings ...)
  2021-08-18 14:45 ` [PATCH 11/11] null_blk: " Christoph Hellwig
@ 2021-08-23 18:57 ` Jens Axboe
  2021-08-23 19:44   ` Luis Chamberlain
  11 siblings, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2021-08-23 18:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block

On 8/18/21 8:45 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> this series does some refactoring and then adds support to return errors
> from add_disk (rebasing a patch from Luis).  I think that alone is a huge
> improvement as it leaves a disk for which add_disk failed in a defined
> status, but the real improvement will be actually handling the errors in
> the drivers.  This series contains two trivial conversions.  Luis has
> a tree with conversions for all drivers in the tree, which will be fed
> incrementally once this goes in.  Hopefully we can convert all the
> commonly used drivers in this merge window.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: add error handling to add_disk / device_add_disk
  2021-08-23 18:57 ` add error handling to add_disk / device_add_disk Jens Axboe
@ 2021-08-23 19:44   ` Luis Chamberlain
  2021-08-23 19:50     ` Jens Axboe
  0 siblings, 1 reply; 42+ messages in thread
From: Luis Chamberlain @ 2021-08-23 19:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Michael S. Tsirkin, Jason Wang,
	Martin K. Petersen, linux-block

On Mon, Aug 23, 2021 at 12:57:41PM -0600, Jens Axboe wrote:
> On 8/18/21 8:45 AM, Christoph Hellwig wrote:
> > Hi Jens,
> > 
> > this series does some refactoring and then adds support to return errors
> > from add_disk (rebasing a patch from Luis).  I think that alone is a huge
> > improvement as it leaves a disk for which add_disk failed in a defined
> > status, but the real improvement will be actually handling the errors in
> > the drivers.  This series contains two trivial conversions.  Luis has
> > a tree with conversions for all drivers in the tree, which will be fed
> > incrementally once this goes in.  Hopefully we can convert all the
> > commonly used drivers in this merge window.
> 
> Applied, thanks.

Do you have a branch published which has this by any chance? I checked
but can't see anything obvious.

  Luis

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: add error handling to add_disk / device_add_disk
  2021-08-23 19:44   ` Luis Chamberlain
@ 2021-08-23 19:50     ` Jens Axboe
  0 siblings, 0 replies; 42+ messages in thread
From: Jens Axboe @ 2021-08-23 19:50 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Michael S. Tsirkin, Jason Wang,
	Martin K. Petersen, linux-block

On 8/23/21 1:44 PM, Luis Chamberlain wrote:
> On Mon, Aug 23, 2021 at 12:57:41PM -0600, Jens Axboe wrote:
>> On 8/18/21 8:45 AM, Christoph Hellwig wrote:
>>> Hi Jens,
>>>
>>> this series does some refactoring and then adds support to return errors
>>> from add_disk (rebasing a patch from Luis).  I think that alone is a huge
>>> improvement as it leaves a disk for which add_disk failed in a defined
>>> status, but the real improvement will be actually handling the errors in
>>> the drivers.  This series contains two trivial conversions.  Luis has
>>> a tree with conversions for all drivers in the tree, which will be fed
>>> incrementally once this goes in.  Hopefully we can convert all the
>>> commonly used drivers in this merge window.
>>
>> Applied, thanks.
> 
> Do you have a branch published which has this by any chance? I checked
> but can't see anything obvious.

It's in the core branch, just hadn't been pushed out yet. Now it is,
find it in for-5.15/block

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2021-08-18 14:45 ` [PATCH 03/11] block: call bdev_add later in device_add_disk Christoph Hellwig
  2021-08-20  6:06   ` Hannes Reinecke
@ 2025-10-31  7:46   ` Gao Xiang
  2025-10-31  9:09     ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Gao Xiang @ 2025-10-31  7:46 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff

Hi Christoph,

On 2021/8/18 22:45, Christoph Hellwig wrote:
> Once bdev_add is called userspace can open the block device.  Ensure
> that the struct device, which is used for refcounting of the disk
> besides various other things, is fully setup at that point.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Sorry for bring up an old commit.

 From my understanding, before this commit, bdev_add() will
be called prior to device_add() so insert_inode_hash()
in bdev_add() will be called before users can see blkdev
in the devtmpfs.

But after this change, blkdev can be seen before
insert_inode_hash(), which opens a race (although I'm not
sure if it's an expected behavior) blkdev_get_no_open() will
find nothing (e.g. mounting) even blkdev in devtmpfs is
there.

Also before commit 22ae8ce8b892 ("block: simplify bdev/disk
lookup in blkdev_get"), the corresponding bd_inode was created
by bdget() so at least it doesn't have such race window too.

One use case is that some userspace applications expect that
once a blkdev is visible in devtmpfs, it can be mounted
immediately. I'm not sure if it's a correct expectation
(or if there is some better way to know when bd_inode is
ready instead of just retring mount operations) or it's
just needed to be fixed.

Such race can be often observed if a virtio-blk device is
hotpluged into a VM and mount immediately at least on
Linux 6.6.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31  7:46   ` question about bd_inode hashing against device_add() // " Gao Xiang
@ 2025-10-31  9:09     ` Christoph Hellwig
  2025-10-31  9:36       ` Gao Xiang
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2025-10-31  9:09 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff

On Fri, Oct 31, 2025 at 03:46:02PM +0800, Gao Xiang wrote:
> From my understanding, before this commit, bdev_add() will
> be called prior to device_add() so insert_inode_hash()
> in bdev_add() will be called before users can see blkdev
> in the devtmpfs.
>
> But after this change, blkdev can be seen before
> insert_inode_hash(), which opens a race (although I'm not
> sure if it's an expected behavior) blkdev_get_no_open() will
> find nothing (e.g. mounting) even blkdev in devtmpfs is
> there.

We're not supposed to see the uevent notification before the
block device is ready.  Do you see that earlier, or do you have
code busy polling for a node?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31  9:09     ` Christoph Hellwig
@ 2025-10-31  9:36       ` Gao Xiang
  2025-10-31  9:45         ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Gao Xiang @ 2025-10-31  9:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jan Kara, Christian Brauner, Michael S. Tsirkin,
	Jason Wang, Martin K. Petersen, Luis Chamberlain, linux-block,
	Joseph Qi, guanghuifeng, zongyong.wzy, zyfjeff

Hi Christoph,

On 2025/10/31 17:09, Christoph Hellwig wrote:
> On Fri, Oct 31, 2025 at 03:46:02PM +0800, Gao Xiang wrote:
>>  From my understanding, before this commit, bdev_add() will
>> be called prior to device_add() so insert_inode_hash()
>> in bdev_add() will be called before users can see blkdev
>> in the devtmpfs.
>>
>> But after this change, blkdev can be seen before
>> insert_inode_hash(), which opens a race (although I'm not
>> sure if it's an expected behavior) blkdev_get_no_open() will
>> find nothing (e.g. mounting) even blkdev in devtmpfs is
>> there.
> 
> We're not supposed to see the uevent notification before the
> block device is ready.

Thanks for the quick response.

Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.

>  Do you see that earlier, or do you have
> code busy polling for a node?

Personally I think it will break many userspace programs
(although I also don't think it's a correct expectation.)

After recheck internally, the userspace program logic is:
   - stat /dev/vdX;
   - if exists, mount directly;
   - if non-exists, listen uevent disk_add instead.

Previously, for devtmpfs blkdev files, such stat/mount
assumption is always valid.

After this change, it's only valid for udev generated blkdev
files (because they're generated by uevent).  again I'm not
saying it's a good assumption, but I guess many developpers
who don't know this will write such user code, including our
internal users.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31  9:36       ` Gao Xiang
@ 2025-10-31  9:45         ` Christoph Hellwig
  2025-10-31  9:54           ` Gao Xiang
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2025-10-31  9:45 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich, linux-kernel

On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
> Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
>
>>  Do you see that earlier, or do you have
>> code busy polling for a node?
>
> Personally I think it will break many userspace programs
> (although I also don't think it's a correct expectation.)

We've had this behavior for a few years, and this is the first report
I've seen.

> After recheck internally, the userspace program logic is:
>   - stat /dev/vdX;
>   - if exists, mount directly;
>   - if non-exists, listen uevent disk_add instead.
>
> Previously, for devtmpfs blkdev files, such stat/mount
> assumption is always valid.

That assumption doesn't seem wrong.  But why does the device node
get created earlier?  My assumption was that it would only be
created by the KOBJ_ADD uevent.  Adding the device model maintainers
as my little dig through the core drivers/base/ code doesn't find
anything to the contrary, but maybe I don't fully understand it.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31  9:45         ` Christoph Hellwig
@ 2025-10-31  9:54           ` Gao Xiang
  2025-10-31  9:58             ` Greg Kroah-Hartman
  0 siblings, 1 reply; 42+ messages in thread
From: Gao Xiang @ 2025-10-31  9:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jan Kara, Christian Brauner, Michael S. Tsirkin,
	Jason Wang, Martin K. Petersen, Luis Chamberlain, linux-block,
	Joseph Qi, guanghuifeng, zongyong.wzy, zyfjeff,
	Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 17:45, Christoph Hellwig wrote:
> On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
>> Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
>>
>>>   Do you see that earlier, or do you have
>>> code busy polling for a node?
>>
>> Personally I think it will break many userspace programs
>> (although I also don't think it's a correct expectation.)
> 
> We've had this behavior for a few years, and this is the first report
> I've seen.
> 
>> After recheck internally, the userspace program logic is:
>>    - stat /dev/vdX;
>>    - if exists, mount directly;
>>    - if non-exists, listen uevent disk_add instead.
>>
>> Previously, for devtmpfs blkdev files, such stat/mount
>> assumption is always valid.
> 
> That assumption doesn't seem wrong.

;-) I was thought UNIX mknod doesn't imply the device is
ready or valid in any case (but dev files in devtmpfs
might be an exception but I didn't find some formal words)...
so uevent is clearly a right way, but..

> But why does the device node
> get created earlier?  My assumption was that it would only be
> created by the KOBJ_ADD uevent.  Adding the device model maintainers
> as my little dig through the core drivers/base/ code doesn't find
> anything to the contrary, but maybe I don't fully understand it.

AFAIK, device_add() is used to trigger devtmpfs file
creation, and it can be observed if frequently
hotpluging device in the VM and mount.  Currently
I don't have time slot to build an easy reproducer,
but I think it's a real issue anyway.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31  9:54           ` Gao Xiang
@ 2025-10-31  9:58             ` Greg Kroah-Hartman
  2025-10-31 10:12               ` Gao Xiang
  0 siblings, 1 reply; 42+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-31  9:58 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
> 
> 
> On 2025/10/31 17:45, Christoph Hellwig wrote:
> > On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
> > > Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
> > > 
> > > >   Do you see that earlier, or do you have
> > > > code busy polling for a node?
> > > 
> > > Personally I think it will break many userspace programs
> > > (although I also don't think it's a correct expectation.)
> > 
> > We've had this behavior for a few years, and this is the first report
> > I've seen.
> > 
> > > After recheck internally, the userspace program logic is:
> > >    - stat /dev/vdX;
> > >    - if exists, mount directly;
> > >    - if non-exists, listen uevent disk_add instead.
> > > 
> > > Previously, for devtmpfs blkdev files, such stat/mount
> > > assumption is always valid.
> > 
> > That assumption doesn't seem wrong.
> 
> ;-) I was thought UNIX mknod doesn't imply the device is
> ready or valid in any case (but dev files in devtmpfs
> might be an exception but I didn't find some formal words)...
> so uevent is clearly a right way, but..

Yes, anyone can do a mknod and attempt to open a device that isn't
present.

when devtmpfs creates the device node, it should be there.  Unless it
gets removed, and then added back, so you could race with userspace, but
that's not normal.

> > But why does the device node
> > get created earlier?  My assumption was that it would only be
> > created by the KOBJ_ADD uevent.  Adding the device model maintainers
> > as my little dig through the core drivers/base/ code doesn't find
> > anything to the contrary, but maybe I don't fully understand it.
> 
> AFAIK, device_add() is used to trigger devtmpfs file
> creation, and it can be observed if frequently
> hotpluging device in the VM and mount.  Currently
> I don't have time slot to build an easy reproducer,
> but I think it's a real issue anyway.

As I say above, that's not normal, and you have to be root to do this,
so I don't understand what you are trying to prevent happening?  What is
the bug and why is it just showing up now (i.e. what changed to cause
it?)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31  9:58             ` Greg Kroah-Hartman
@ 2025-10-31 10:12               ` Gao Xiang
  2025-10-31 12:23                 ` Gao Xiang
  2025-10-31 14:31                 ` Greg Kroah-Hartman
  0 siblings, 2 replies; 42+ messages in thread
From: Gao Xiang @ 2025-10-31 10:12 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

Hi Greg,

On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>
>>
>> On 2025/10/31 17:45, Christoph Hellwig wrote:
>>> On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
>>>> Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
>>>>
>>>>>    Do you see that earlier, or do you have
>>>>> code busy polling for a node?
>>>>
>>>> Personally I think it will break many userspace programs
>>>> (although I also don't think it's a correct expectation.)
>>>
>>> We've had this behavior for a few years, and this is the first report
>>> I've seen.
>>>
>>>> After recheck internally, the userspace program logic is:
>>>>     - stat /dev/vdX;
>>>>     - if exists, mount directly;
>>>>     - if non-exists, listen uevent disk_add instead.
>>>>
>>>> Previously, for devtmpfs blkdev files, such stat/mount
>>>> assumption is always valid.
>>>
>>> That assumption doesn't seem wrong.
>>
>> ;-) I was thought UNIX mknod doesn't imply the device is
>> ready or valid in any case (but dev files in devtmpfs
>> might be an exception but I didn't find some formal words)...
>> so uevent is clearly a right way, but..
> 
> Yes, anyone can do a mknod and attempt to open a device that isn't
> present.
> 
> when devtmpfs creates the device node, it should be there.  Unless it
> gets removed, and then added back, so you could race with userspace, but
> that's not normal.
> 
>>> But why does the device node
>>> get created earlier?  My assumption was that it would only be
>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>> as my little dig through the core drivers/base/ code doesn't find
>>> anything to the contrary, but maybe I don't fully understand it.
>>
>> AFAIK, device_add() is used to trigger devtmpfs file
>> creation, and it can be observed if frequently
>> hotpluging device in the VM and mount.  Currently
>> I don't have time slot to build an easy reproducer,
>> but I think it's a real issue anyway.
> 
> As I say above, that's not normal, and you have to be root to do this,

Just thinking out if I am a random reporter, I could
report the original symptom now because we face it,
but everyone has his own internal business or even
with limited kernel ability for example, in any
case, there is no such expectation to rush someone
into build a clean reproducer.

Nevertheless, I will take time on the reproducer, and
I think it could just add some artificial delay just
after device_add(). I could try anyway, but no rush.

> so I don't understand what you are trying to prevent happening?  What is

The original report was
https://lore.kernel.org/r/43375218-2a80-4a7a-b8bb-465f6419b595@linux.alibaba.com/

> the bug and why is it just showing up now (i.e. what changed to cause
> it?)

I don't know, I think just because 6.6 is a relatively
newer kernel, and most userspace logic has retry logic
to cover this up.


Thanks,
Gao Xiang


> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 10:12               ` Gao Xiang
@ 2025-10-31 12:23                 ` Gao Xiang
  2025-10-31 12:25                   ` Gao Xiang
  2025-10-31 14:34                   ` Greg Kroah-Hartman
  2025-10-31 14:31                 ` Greg Kroah-Hartman
  1 sibling, 2 replies; 42+ messages in thread
From: Gao Xiang @ 2025-10-31 12:23 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Christoph Hellwig, Jens Axboe, Jan Kara,
	Christian Brauner
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 18:12, Gao Xiang wrote:
> Hi Greg,
> 
> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>
>>>
>>> On 2025/10/31 17:45, Christoph Hellwig wrote:

...

>>>> But why does the device node
>>>> get created earlier?  My assumption was that it would only be
>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>> as my little dig through the core drivers/base/ code doesn't find
>>>> anything to the contrary, but maybe I don't fully understand it.
>>>
>>> AFAIK, device_add() is used to trigger devtmpfs file
>>> creation, and it can be observed if frequently
>>> hotpluging device in the VM and mount.  Currently
>>> I don't have time slot to build an easy reproducer,
>>> but I think it's a real issue anyway.
>>
>> As I say above, that's not normal, and you have to be root to do this,
I just spent time to reproduce with dynamic loop devices and
actually it's easy if msleep() is located artificiallly,
the diff as below:

diff --git a/block/bdev.c b/block/bdev.c
index 810707cca970..a4273b5ad456 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
  	struct inode *inode;

  	inode = ilookup(blockdev_superblock, dev);
-	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
+	if (0) {
  		blk_request_module(dev);
  		inode = ilookup(blockdev_superblock, dev);
  		if (inode)
diff --git a/block/genhd.c b/block/genhd.c
index 9bbc38d12792..3c9116fdc1ce 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
  	set_bit(GD_ADDED, &disk->state);
  }

+#include <linux/delay.h>
+
  static int __add_disk(struct device *parent, struct gendisk *disk,
  		      const struct attribute_group **groups,
  		      struct fwnode_handle *fwnode)
@@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
  	if (ret)
  		goto out_free_ext_minor;

+	if (disk->major == LOOP_MAJOR)
+		msleep(2500);           // delay 2.5s for all loops
+
  	ret = disk_alloc_events(disk);
  	if (ret)
  		goto out_device_del;


(Note that I masked off CONFIG_BLOCK_LEGACY_AUTOLOAD
  for cleaner ftrace below.)

and then

# uname -a  (patched 6.18-rc1 kernel)

```
Linux 7e5b4b5f5181 6.18.0-rc1-dirty #25 SMP PREEMPT_DYNAMIC Fri Oct 31 19:52:10 CST 2025 x86_64 GNU/Linux
```

# truncate -s 1g test.img; mkfs.ext4 -F test.img;
# losetup /dev/loop999 test.img & sleep 1; ls -l /dev/loop999; strace mount -t ext4 /dev/loop999 mnt 2>&1 | grep fsconfig

It shows

```
brw------- 1 root root 7, 999 Oct 31 20:06 /dev/loop999
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/loop999", 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 ENXIO (No such device or address)  // unexpected
```

then

# losetup /dev/loop996 test.img & sleep 1; stat /dev/loop996; trace-cmd record -p function_graph mount -t ext4 /dev/loop996 mnt &> /dev/null

It shows
```
   File: /dev/loop996
   Size: 0               Blocks: 0          IO Block: 4096   block special file
Device: 0,6     Inode: 429         Links: 1     Device type: 7,996
Access: (0600/brw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-10-31 20:07:54.938474868 +0800
Modify: 2025-10-31 20:07:54.938474868 +0800
Change: 2025-10-31 20:07:54.938474868 +0800
  Birth: 2025-10-31 20:07:54.938474868 +0800
```

but

# trace-cmd report | grep mount | less
            mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                bdev_file_open_by_dev() {
            mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                  bdev_permission() {
            mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                    devcgroup_check_permission() {
            mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                      __rcu_read_lock() {
            mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.193 us   |                      } (ret=0x1)
            mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      match_exception_partial() {
            mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.199 us   |                      } (ret=0x0)
            mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      __rcu_read_unlock() {
            mount-561   [007] ...1.   240.180515: funcgraph_exit:         0.202 us   |                      } (ret=0x0)
            mount-561   [007] ...1.   240.180515: funcgraph_exit:         1.602 us   |                    } (ret=0x0)
            mount-561   [007] ...1.   240.180515: funcgraph_exit:         2.100 us   |                  } (ret=0x0)
            mount-561   [007] ...1.   240.180515: funcgraph_entry:                   |                  ilookup() {
            mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    __cond_resched() {
            mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.194 us   |                    } (ret=0x0)
            mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    find_inode_fast() {
            mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                      __rcu_read_lock() {
            mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.195 us   |                      } (ret=0x1)
            mount-561   [007] ...1.   240.180517: funcgraph_entry:                   |                      __rcu_read_unlock() {
            mount-561   [007] ...1.   240.180517: funcgraph_exit:         0.193 us   |                      } (ret=0x0)
            mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.060 us   |                    } (ret=0x0)
            mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.970 us   |                  } (ret=0x0)
            mount-561   [007] ...1.   240.180518: funcgraph_exit:         4.818 us   |                } (ret=-6)

here -6 (-ENXIO) is unexpected.

Actually the problematic code path I've said is device_add():

upstream code:

loop_control_ioctl
  loop_add
    add_disk_fwnode
      __add_disk
        devtmpfs_create_node   // here create devtmpfs blkdev file, but racy
      add_disk_final
        bdev_add
          insert_inode_hash    // just seen by bdev_file_open_by_dev()
        disk_uevent(disk, KOBJ_ADD)

I actually think it's enough to explain the root.

Thanks,
Gao Xiang

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 12:23                 ` Gao Xiang
@ 2025-10-31 12:25                   ` Gao Xiang
  2025-10-31 14:34                   ` Greg Kroah-Hartman
  1 sibling, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2025-10-31 12:25 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Christoph Hellwig, Jens Axboe, Jan Kara,
	Christian Brauner
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 20:23, Gao Xiang wrote:
> 
> 
> On 2025/10/31 18:12, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2025/10/31 17:45, Christoph Hellwig wrote:
> 
> ...
> 
>>>>> But why does the device node
>>>>> get created earlier?  My assumption was that it would only be
>>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>>> as my little dig through the core drivers/base/ code doesn't find
>>>>> anything to the contrary, but maybe I don't fully understand it.
>>>>
>>>> AFAIK, device_add() is used to trigger devtmpfs file
>>>> creation, and it can be observed if frequently
>>>> hotpluging device in the VM and mount.  Currently
>>>> I don't have time slot to build an easy reproducer,
>>>> but I think it's a real issue anyway.
>>>
>>> As I say above, that's not normal, and you have to be root to do this,
> I just spent time to reproduce with dynamic loop devices and
> actually it's easy if msleep() is located artificiallly,
> the diff as below:
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 810707cca970..a4273b5ad456 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>       struct inode *inode;
> 
>       inode = ilookup(blockdev_superblock, dev);
> -    if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
> +    if (0) {
>           blk_request_module(dev);
>           inode = ilookup(blockdev_superblock, dev);
>           if (inode)
> diff --git a/block/genhd.c b/block/genhd.c
> index 9bbc38d12792..3c9116fdc1ce 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>       set_bit(GD_ADDED, &disk->state);
>   }
> 
> +#include <linux/delay.h>
> +
>   static int __add_disk(struct device *parent, struct gendisk *disk,
>                 const struct attribute_group **groups,
>                 struct fwnode_handle *fwnode)
> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>       if (ret)
>           goto out_free_ext_minor;
> 
> +    if (disk->major == LOOP_MAJOR)
> +        msleep(2500);           // delay 2.5s for all loops
> +
>       ret = disk_alloc_events(disk);
>       if (ret)
>           goto out_device_del;
> 
> 
> (Note that I masked off CONFIG_BLOCK_LEGACY_AUTOLOAD
>   for cleaner ftrace below.)
> 
> and then
> 
> # uname -a  (patched 6.18-rc1 kernel)
> 
> ```
> Linux 7e5b4b5f5181 6.18.0-rc1-dirty #25 SMP PREEMPT_DYNAMIC Fri Oct 31 19:52:10 CST 2025 x86_64 GNU/Linux
> ```
> 
> # truncate -s 1g test.img; mkfs.ext4 -F test.img;
> # losetup /dev/loop999 test.img & sleep 1; ls -l /dev/loop999; strace mount -t ext4 /dev/loop999 mnt 2>&1 | grep fsconfig
> 
> It shows
> 
> ```
> brw------- 1 root root 7, 999 Oct 31 20:06 /dev/loop999
> fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/loop999", 0) = 0
> fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 ENXIO (No such device or address)  // unexpected
> ```
> 
> then
> 
> # losetup /dev/loop996 test.img & sleep 1; stat /dev/loop996; trace-cmd record -p function_graph mount -t ext4 /dev/loop996 mnt &> /dev/null
> 
> It shows
> ```
>    File: /dev/loop996
>    Size: 0               Blocks: 0          IO Block: 4096   block special file
> Device: 0,6     Inode: 429         Links: 1     Device type: 7,996
> Access: (0600/brw-------)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2025-10-31 20:07:54.938474868 +0800
> Modify: 2025-10-31 20:07:54.938474868 +0800
> Change: 2025-10-31 20:07:54.938474868 +0800
>   Birth: 2025-10-31 20:07:54.938474868 +0800
> ```
> 
> but
> 
> # trace-cmd report | grep mount | less
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                bdev_file_open_by_dev() {
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                  bdev_permission() {
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                    devcgroup_check_permission() {
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                      __rcu_read_lock() {
>             mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.193 us   |                      } (ret=0x1)
>             mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      match_exception_partial() {
>             mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.199 us   |                      } (ret=0x0)
>             mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      __rcu_read_unlock() {
>             mount-561   [007] ...1.   240.180515: funcgraph_exit:         0.202 us   |                      } (ret=0x0)
>             mount-561   [007] ...1.   240.180515: funcgraph_exit:         1.602 us   |                    } (ret=0x0)
>             mount-561   [007] ...1.   240.180515: funcgraph_exit:         2.100 us   |                  } (ret=0x0)
>             mount-561   [007] ...1.   240.180515: funcgraph_entry:                   |                  ilookup() {
>             mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    __cond_resched() {
>             mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.194 us   |                    } (ret=0x0)
>             mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    find_inode_fast() {
>             mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                      __rcu_read_lock() {
>             mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.195 us   |                      } (ret=0x1)
>             mount-561   [007] ...1.   240.180517: funcgraph_entry:                   |                      __rcu_read_unlock() {
>             mount-561   [007] ...1.   240.180517: funcgraph_exit:         0.193 us   |                      } (ret=0x0)
>             mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.060 us   |                    } (ret=0x0)
>             mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.970 us   |                  } (ret=0x0)
>             mount-561   [007] ...1.   240.180518: funcgraph_exit:         4.818 us   |                } (ret=-6)
> 
> here -6 (-ENXIO) is unexpected.
> 
> Actually the problematic code path I've said is device_add():
> 
> upstream code:
> 
> loop_control_ioctl
>   loop_add
>     add_disk_fwnode
>       __add_disk
>         devtmpfs_create_node   // here create devtmpfs blkdev file, but racy
>       add_disk_final
>         bdev_add
>           insert_inode_hash    // just seen by bdev_file_open_by_dev()
>         disk_uevent(disk, KOBJ_ADD)

minor revision:

  loop_control_ioctl
    loop_add
      add_disk_fwnode
        __add_disk
          device_add
            devtmpfs_create_node   // here create devtmpfs blkdev file, but racy
        add_disk_final
          bdev_add
            insert_inode_hash    // just seen by bdev_file_open_by_dev()
          disk_uevent(disk, KOBJ_ADD)

> 
> I actually think it's enough to explain the root.
> 
> Thanks,
> Gao Xiang


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 10:12               ` Gao Xiang
  2025-10-31 12:23                 ` Gao Xiang
@ 2025-10-31 14:31                 ` Greg Kroah-Hartman
  2025-10-31 14:40                   ` Gao Xiang
  1 sibling, 1 reply; 42+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-31 14:31 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

On Fri, Oct 31, 2025 at 06:12:05PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
> > On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
> > > 
> > > 
> > > On 2025/10/31 17:45, Christoph Hellwig wrote:
> > > > On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
> > > > > Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
> > > > > 
> > > > > >    Do you see that earlier, or do you have
> > > > > > code busy polling for a node?
> > > > > 
> > > > > Personally I think it will break many userspace programs
> > > > > (although I also don't think it's a correct expectation.)
> > > > 
> > > > We've had this behavior for a few years, and this is the first report
> > > > I've seen.
> > > > 
> > > > > After recheck internally, the userspace program logic is:
> > > > >     - stat /dev/vdX;
> > > > >     - if exists, mount directly;
> > > > >     - if non-exists, listen uevent disk_add instead.
> > > > > 
> > > > > Previously, for devtmpfs blkdev files, such stat/mount
> > > > > assumption is always valid.
> > > > 
> > > > That assumption doesn't seem wrong.
> > > 
> > > ;-) I was thought UNIX mknod doesn't imply the device is
> > > ready or valid in any case (but dev files in devtmpfs
> > > might be an exception but I didn't find some formal words)...
> > > so uevent is clearly a right way, but..
> > 
> > Yes, anyone can do a mknod and attempt to open a device that isn't
> > present.
> > 
> > when devtmpfs creates the device node, it should be there.  Unless it
> > gets removed, and then added back, so you could race with userspace, but
> > that's not normal.
> > 
> > > > But why does the device node
> > > > get created earlier?  My assumption was that it would only be
> > > > created by the KOBJ_ADD uevent.  Adding the device model maintainers
> > > > as my little dig through the core drivers/base/ code doesn't find
> > > > anything to the contrary, but maybe I don't fully understand it.
> > > 
> > > AFAIK, device_add() is used to trigger devtmpfs file
> > > creation, and it can be observed if frequently
> > > hotpluging device in the VM and mount.  Currently
> > > I don't have time slot to build an easy reproducer,
> > > but I think it's a real issue anyway.
> > 
> > As I say above, that's not normal, and you have to be root to do this,
> 
> Just thinking out if I am a random reporter, I could
> report the original symptom now because we face it,
> but everyone has his own internal business or even
> with limited kernel ability for example, in any
> case, there is no such expectation to rush someone
> into build a clean reproducer.
> 
> Nevertheless, I will take time on the reproducer, and
> I think it could just add some artificial delay just
> after device_add(). I could try anyway, but no rush.
> 
> > so I don't understand what you are trying to prevent happening?  What is
> 
> The original report was
> https://lore.kernel.org/r/43375218-2a80-4a7a-b8bb-465f6419b595@linux.alibaba.com/

So you see cases where the device node is present, you try to open it,
but yet there is no real block device behind it at all?

> > the bug and why is it just showing up now (i.e. what changed to cause
> > it?)
> 
> I don't know, I think just because 6.6 is a relatively
> newer kernel, and most userspace logic has retry logic
> to cover this up.

6.6 has been out for 2 years now, this is a long time in kernel
development cycles for things to just start showing up now.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 12:23                 ` Gao Xiang
  2025-10-31 12:25                   ` Gao Xiang
@ 2025-10-31 14:34                   ` Greg Kroah-Hartman
  2025-10-31 14:44                     ` Gao Xiang
  1 sibling, 1 reply; 42+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-31 14:34 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

On Fri, Oct 31, 2025 at 08:23:32PM +0800, Gao Xiang wrote:
> 
> 
> On 2025/10/31 18:12, Gao Xiang wrote:
> > Hi Greg,
> > 
> > On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
> > > On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
> > > > 
> > > > 
> > > > On 2025/10/31 17:45, Christoph Hellwig wrote:
> 
> ...
> 
> > > > > But why does the device node
> > > > > get created earlier?  My assumption was that it would only be
> > > > > created by the KOBJ_ADD uevent.  Adding the device model maintainers
> > > > > as my little dig through the core drivers/base/ code doesn't find
> > > > > anything to the contrary, but maybe I don't fully understand it.
> > > > 
> > > > AFAIK, device_add() is used to trigger devtmpfs file
> > > > creation, and it can be observed if frequently
> > > > hotpluging device in the VM and mount.  Currently
> > > > I don't have time slot to build an easy reproducer,
> > > > but I think it's a real issue anyway.
> > > 
> > > As I say above, that's not normal, and you have to be root to do this,
> I just spent time to reproduce with dynamic loop devices and
> actually it's easy if msleep() is located artificiallly,
> the diff as below:
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 810707cca970..a4273b5ad456 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>  	struct inode *inode;
> 
>  	inode = ilookup(blockdev_superblock, dev);
> -	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
> +	if (0) {
>  		blk_request_module(dev);
>  		inode = ilookup(blockdev_superblock, dev);
>  		if (inode)
> diff --git a/block/genhd.c b/block/genhd.c
> index 9bbc38d12792..3c9116fdc1ce 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>  	set_bit(GD_ADDED, &disk->state);
>  }
> 
> +#include <linux/delay.h>
> +
>  static int __add_disk(struct device *parent, struct gendisk *disk,
>  		      const struct attribute_group **groups,
>  		      struct fwnode_handle *fwnode)
> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>  	if (ret)
>  		goto out_free_ext_minor;
> 
> +	if (disk->major == LOOP_MAJOR)
> +		msleep(2500);           // delay 2.5s for all loops
> +

Yes, so you need to watch for the uevent to happen, THEN it is safe to
access the block device.  Doing it before then isn't a good idea :)

But, if you think this is an issue, do you have a patch that passes your
testing to fix it?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 14:31                 ` Greg Kroah-Hartman
@ 2025-10-31 14:40                   ` Gao Xiang
  0 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2025-10-31 14:40 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 22:31, Greg Kroah-Hartman wrote:
> On Fri, Oct 31, 2025 at 06:12:05PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2025/10/31 17:45, Christoph Hellwig wrote:
>>>>> On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
>>>>>> Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
>>>>>>
>>>>>>>     Do you see that earlier, or do you have
>>>>>>> code busy polling for a node?
>>>>>>
>>>>>> Personally I think it will break many userspace programs
>>>>>> (although I also don't think it's a correct expectation.)
>>>>>
>>>>> We've had this behavior for a few years, and this is the first report
>>>>> I've seen.
>>>>>
>>>>>> After recheck internally, the userspace program logic is:
>>>>>>      - stat /dev/vdX;
>>>>>>      - if exists, mount directly;
>>>>>>      - if non-exists, listen uevent disk_add instead.
>>>>>>
>>>>>> Previously, for devtmpfs blkdev files, such stat/mount
>>>>>> assumption is always valid.
>>>>>
>>>>> That assumption doesn't seem wrong.
>>>>
>>>> ;-) I was thought UNIX mknod doesn't imply the device is
>>>> ready or valid in any case (but dev files in devtmpfs
>>>> might be an exception but I didn't find some formal words)...
>>>> so uevent is clearly a right way, but..
>>>
>>> Yes, anyone can do a mknod and attempt to open a device that isn't
>>> present.
>>>
>>> when devtmpfs creates the device node, it should be there.  Unless it
>>> gets removed, and then added back, so you could race with userspace, but
>>> that's not normal.
>>>
>>>>> But why does the device node
>>>>> get created earlier?  My assumption was that it would only be
>>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>>> as my little dig through the core drivers/base/ code doesn't find
>>>>> anything to the contrary, but maybe I don't fully understand it.
>>>>
>>>> AFAIK, device_add() is used to trigger devtmpfs file
>>>> creation, and it can be observed if frequently
>>>> hotpluging device in the VM and mount.  Currently
>>>> I don't have time slot to build an easy reproducer,
>>>> but I think it's a real issue anyway.
>>>
>>> As I say above, that's not normal, and you have to be root to do this,
>>
>> Just thinking out if I am a random reporter, I could
>> report the original symptom now because we face it,
>> but everyone has his own internal business or even
>> with limited kernel ability for example, in any
>> case, there is no such expectation to rush someone
>> into build a clean reproducer.
>>
>> Nevertheless, I will take time on the reproducer, and
>> I think it could just add some artificial delay just
>> after device_add(). I could try anyway, but no rush.
>>
>>> so I don't understand what you are trying to prevent happening?  What is
>>
>> The original report was
>> https://lore.kernel.org/r/43375218-2a80-4a7a-b8bb-465f6419b595@linux.alibaba.com/
> 
> So you see cases where the device node is present, you try to open it,
> but yet there is no real block device behind it at all?

Roughly yes, block devices have a pseudo filesystem, briefly
it registered the block device with device_add() so the
devtmpfs file is visible then but bdev_add() is not called yet
so for example, mounting like bdev_file_open_by_dev() cannot
find this and return ENXIO.

> 
>>> the bug and why is it just showing up now (i.e. what changed to cause
>>> it?)
>>
>> I don't know, I think just because 6.6 is a relatively
>> newer kernel, and most userspace logic has retry logic
>> to cover this up.
> 
> 6.6 has been out for 2 years now, this is a long time in kernel
> development cycles for things to just start showing up now.

I think for most cases devices are added during boot so
it's hard to find, but in the stress hotplug cases, it
can be observed easily honestly.

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 14:34                   ` Greg Kroah-Hartman
@ 2025-10-31 14:44                     ` Gao Xiang
  2025-11-05  3:04                       ` Gao Xiang
  2025-11-05 12:30                       ` Christian Brauner
  0 siblings, 2 replies; 42+ messages in thread
From: Gao Xiang @ 2025-10-31 14:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 22:34, Greg Kroah-Hartman wrote:
> On Fri, Oct 31, 2025 at 08:23:32PM +0800, Gao Xiang wrote:
>>
>>
>> On 2025/10/31 18:12, Gao Xiang wrote:
>>> Hi Greg,
>>>
>>> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>>>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>>>
>>>>>
>>>>> On 2025/10/31 17:45, Christoph Hellwig wrote:
>>
>> ...
>>
>>>>>> But why does the device node
>>>>>> get created earlier?  My assumption was that it would only be
>>>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>>>> as my little dig through the core drivers/base/ code doesn't find
>>>>>> anything to the contrary, but maybe I don't fully understand it.
>>>>>
>>>>> AFAIK, device_add() is used to trigger devtmpfs file
>>>>> creation, and it can be observed if frequently
>>>>> hotpluging device in the VM and mount.  Currently
>>>>> I don't have time slot to build an easy reproducer,
>>>>> but I think it's a real issue anyway.
>>>>
>>>> As I say above, that's not normal, and you have to be root to do this,
>> I just spent time to reproduce with dynamic loop devices and
>> actually it's easy if msleep() is located artificiallly,
>> the diff as below:
>>
>> diff --git a/block/bdev.c b/block/bdev.c
>> index 810707cca970..a4273b5ad456 100644
>> --- a/block/bdev.c
>> +++ b/block/bdev.c
>> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>>   	struct inode *inode;
>>
>>   	inode = ilookup(blockdev_superblock, dev);
>> -	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
>> +	if (0) {
>>   		blk_request_module(dev);
>>   		inode = ilookup(blockdev_superblock, dev);
>>   		if (inode)
>> diff --git a/block/genhd.c b/block/genhd.c
>> index 9bbc38d12792..3c9116fdc1ce 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>>   	set_bit(GD_ADDED, &disk->state);
>>   }
>>
>> +#include <linux/delay.h>
>> +
>>   static int __add_disk(struct device *parent, struct gendisk *disk,
>>   		      const struct attribute_group **groups,
>>   		      struct fwnode_handle *fwnode)
>> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>>   	if (ret)
>>   		goto out_free_ext_minor;
>>
>> +	if (disk->major == LOOP_MAJOR)
>> +		msleep(2500);           // delay 2.5s for all loops
>> +
> 
> Yes, so you need to watch for the uevent to happen, THEN it is safe to
> access the block device.  Doing it before then isn't a good idea :)
> 
> But, if you think this is an issue, do you have a patch that passes your
> testing to fix it?

I just raise it up for some ideas, and this change is
buried into the code refactor and honestly I need to
look into the codebase and related patchsets first.

Currently I have dozens of other development stuffs
on hand, if it's really a regression, I do hope
Christoph or other folks who are familiar with the code
could try to address this.

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 14:44                     ` Gao Xiang
@ 2025-11-05  3:04                       ` Gao Xiang
  2025-11-05 12:30                       ` Christian Brauner
  1 sibling, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2025-11-05  3:04 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Greg Kroah-Hartman, Jan Kara,
	Christian Brauner
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

Hi Christiph,

On 2025/10/31 22:44, Gao Xiang wrote:
> 

..

>>> I just spent time to reproduce with dynamic loop devices and
>>> actually it's easy if msleep() is located artificiallly,
>>> the diff as below:
>>>
>>> diff --git a/block/bdev.c b/block/bdev.c
>>> index 810707cca970..a4273b5ad456 100644
>>> --- a/block/bdev.c
>>> +++ b/block/bdev.c
>>> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>>>       struct inode *inode;
>>>
>>>       inode = ilookup(blockdev_superblock, dev);
>>> -    if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
>>> +    if (0) {
>>>           blk_request_module(dev);
>>>           inode = ilookup(blockdev_superblock, dev);
>>>           if (inode)
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index 9bbc38d12792..3c9116fdc1ce 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>>>       set_bit(GD_ADDED, &disk->state);
>>>   }
>>>
>>> +#include <linux/delay.h>
>>> +
>>>   static int __add_disk(struct device *parent, struct gendisk *disk,
>>>                 const struct attribute_group **groups,
>>>                 struct fwnode_handle *fwnode)
>>> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>>>       if (ret)
>>>           goto out_free_ext_minor;
>>>
>>> +    if (disk->major == LOOP_MAJOR)
>>> +        msleep(2500);           // delay 2.5s for all loops
>>> +
>>
>> Yes, so you need to watch for the uevent to happen, THEN it is safe to
>> access the block device.  Doing it before then isn't a good idea :)
>>
>> But, if you think this is an issue, do you have a patch that passes your
>> testing to fix it?
> 
> I just raise it up for some ideas, and this change is
> buried into the code refactor and honestly I need to
> look into the codebase and related patchsets first.
> 
> Currently I have dozens of other development stuffs
> on hand, if it's really a regression, I do hope
> Christoph or other folks who are familiar with the code
> could try to address this.

I've provided a reproducible way:
https://lore.kernel.org/linux-block/ec8b1c76-c211-49a5-a056-6a147faddd3b@linux.alibaba.com

As the author of these gendisk/bdev enhancement commits, what's
your opinion on this?

In other words, do you think it's a regression, or just a behavior
change but not a regression? Also, a minor confirmation:
if it is a regression on your side, would you like to address it?

Due to further code changes, I proposed a temporary workaround
for our 6.6 kernels as below (I don't think it's clean but we will
do more tests), but due to limited time, currently I don't have
time to come up with a cleaner solution and track this until the
upstream fix lands.

Thanks,
Gao Xiang

  block/blk.h             |  1 +
  block/genhd.c           | 18 ++++++++++++++++--
  block/partitions/core.c |  6 +++++-
  3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 475bbb40bb83..4410ae9da378 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -419,6 +419,7 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
  #endif /* CONFIG_BLK_DEV_ZONED */
  
  struct block_device *bdev_alloc(struct gendisk *disk, u8 partno);
+void bdev_inode_failed(struct block_device *bdev);
  void bdev_add(struct block_device *bdev, dev_t dev);
  
  int blk_alloc_ext_minor(void);
diff --git a/block/genhd.c b/block/genhd.c
index 039e7c17523b..cb4313a7c618 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -383,6 +383,14 @@ int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode)
  	return ret;
  }
  
+void bdev_inode_failed(struct block_device *bdev)
+{
+	struct inode *inode = bdev->bd_inode;
+
+	make_bad_inode(inode);
+	unlock_new_inode(inode);
+}
+
  /**
   * device_add_disk - add disk information to kernel list
   * @parent: parent device for the disk
@@ -452,8 +460,12 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
  	ddev->parent = parent;
  	ddev->groups = groups;
  	dev_set_name(ddev, "%s", disk->disk_name);
-	if (!(disk->flags & GENHD_FL_HIDDEN))
+	if (!(disk->flags & GENHD_FL_HIDDEN)) {
  		ddev->devt = MKDEV(disk->major, disk->first_minor);
+		disk->part0->bd_inode->i_state |= I_NEW;
+		bdev_add(disk->part0, ddev->devt);
+	}
+
  	ret = device_add(ddev);
  	if (ret)
  		goto out_free_ext_minor;
@@ -505,7 +517,7 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
  		if (get_capacity(disk) && disk_has_partscan(disk))
  			set_bit(GD_NEED_PART_SCAN, &disk->state);
  
-		bdev_add(disk->part0, ddev->devt);
+		unlock_new_inode(disk->part0->bd_inode);
  		if (get_capacity(disk))
  			disk_scan_partitions(disk, BLK_OPEN_READ);
  
@@ -546,6 +558,8 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
  out_device_del:
  	device_del(ddev);
  out_free_ext_minor:
+	if (!(disk->flags & GENHD_FL_HIDDEN))
+		bdev_inode_failed(disk->part0);
  	if (disk->major == BLOCK_EXT_MAJOR)
  		blk_free_ext_minor(disk->first_minor);
  out_exit_elevator:
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 549ce89a657b..c69e369955b9 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -376,6 +376,9 @@ static struct block_device *add_partition(struct gendisk *disk, int partno,
  			goto out_put;
  	}
  
+	bdev->bd_inode->i_state |= I_NEW;
+	bdev_add(bdev, devt);
+
  	/* delay uevent until 'holders' subdir is created */
  	dev_set_uevent_suppress(pdev, 1);
  	err = device_add(pdev);
@@ -398,7 +401,7 @@ static struct block_device *add_partition(struct gendisk *disk, int partno,
  	err = xa_insert(&disk->part_tbl, partno, bdev, GFP_KERNEL);
  	if (err)
  		goto out_del;
-	bdev_add(bdev, devt);
+	unlock_new_inode(bdev->bd_inode);
  
  	/* suppress uevent if the disk suppresses it */
  	if (!dev_get_uevent_suppress(ddev))
@@ -409,6 +412,7 @@ static struct block_device *add_partition(struct gendisk *disk, int partno,
  	kobject_put(bdev->bd_holder_dir);
  	device_del(pdev);
  out_put:
+	bdev_inode_failed(bdev);
  	put_device(pdev);
  	return ERR_PTR(err);
  out_put_disk:
-- 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 14:44                     ` Gao Xiang
  2025-11-05  3:04                       ` Gao Xiang
@ 2025-11-05 12:30                       ` Christian Brauner
  2025-11-05 14:13                         ` Gao Xiang
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Brauner @ 2025-11-05 12:30 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Greg Kroah-Hartman, Christoph Hellwig, Jens Axboe, Jan Kara,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

On Fri, Oct 31, 2025 at 10:44:53PM +0800, Gao Xiang wrote:
> 
> 
> On 2025/10/31 22:34, Greg Kroah-Hartman wrote:
> > On Fri, Oct 31, 2025 at 08:23:32PM +0800, Gao Xiang wrote:
> > > 
> > > 
> > > On 2025/10/31 18:12, Gao Xiang wrote:
> > > > Hi Greg,
> > > > 
> > > > On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
> > > > > On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2025/10/31 17:45, Christoph Hellwig wrote:
> > > 
> > > ...
> > > 
> > > > > > > But why does the device node
> > > > > > > get created earlier?  My assumption was that it would only be
> > > > > > > created by the KOBJ_ADD uevent.  Adding the device model maintainers
> > > > > > > as my little dig through the core drivers/base/ code doesn't find
> > > > > > > anything to the contrary, but maybe I don't fully understand it.
> > > > > > 
> > > > > > AFAIK, device_add() is used to trigger devtmpfs file
> > > > > > creation, and it can be observed if frequently
> > > > > > hotpluging device in the VM and mount.  Currently
> > > > > > I don't have time slot to build an easy reproducer,
> > > > > > but I think it's a real issue anyway.
> > > > > 
> > > > > As I say above, that's not normal, and you have to be root to do this,
> > > I just spent time to reproduce with dynamic loop devices and
> > > actually it's easy if msleep() is located artificiallly,
> > > the diff as below:
> > > 
> > > diff --git a/block/bdev.c b/block/bdev.c
> > > index 810707cca970..a4273b5ad456 100644
> > > --- a/block/bdev.c
> > > +++ b/block/bdev.c
> > > @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
> > >   	struct inode *inode;
> > > 
> > >   	inode = ilookup(blockdev_superblock, dev);
> > > -	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
> > > +	if (0) {
> > >   		blk_request_module(dev);
> > >   		inode = ilookup(blockdev_superblock, dev);
> > >   		if (inode)
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index 9bbc38d12792..3c9116fdc1ce 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
> > >   	set_bit(GD_ADDED, &disk->state);
> > >   }
> > > 
> > > +#include <linux/delay.h>
> > > +
> > >   static int __add_disk(struct device *parent, struct gendisk *disk,
> > >   		      const struct attribute_group **groups,
> > >   		      struct fwnode_handle *fwnode)
> > > @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
> > >   	if (ret)
> > >   		goto out_free_ext_minor;
> > > 
> > > +	if (disk->major == LOOP_MAJOR)
> > > +		msleep(2500);           // delay 2.5s for all loops
> > > +
> > 
> > Yes, so you need to watch for the uevent to happen, THEN it is safe to
> > access the block device.  Doing it before then isn't a good idea :)
> > 
> > But, if you think this is an issue, do you have a patch that passes your
> > testing to fix it?
> 
> I just raise it up for some ideas, and this change is
> buried into the code refactor and honestly I need to
> look into the codebase and related patchsets first.
> 
> Currently I have dozens of other development stuffs
> on hand, if it's really a regression, I do hope
> Christoph or other folks who are familiar with the code
> could try to address this.

If it's easy to do without much of a regression or performance risk then
the device node should only show up once the device is actually ready.
It's certainly best-practive to wait for the uevent though.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-11-05 12:30                       ` Christian Brauner
@ 2025-11-05 14:13                         ` Gao Xiang
  0 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2025-11-05 14:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Greg Kroah-Hartman, Christoph Hellwig, Jens Axboe, Jan Kara,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

Hi Christian,

On 2025/11/5 20:30, Christian Brauner wrote:
> On Fri, Oct 31, 2025 at 10:44:53PM +0800, Gao Xiang wrote:
>>
>>
>> On 2025/10/31 22:34, Greg Kroah-Hartman wrote:
>>> On Fri, Oct 31, 2025 at 08:23:32PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2025/10/31 18:12, Gao Xiang wrote:
>>>>> Hi Greg,
>>>>>
>>>>> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>>>>>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2025/10/31 17:45, Christoph Hellwig wrote:
>>>>
>>>> ...
>>>>
>>>>>>>> But why does the device node
>>>>>>>> get created earlier?  My assumption was that it would only be
>>>>>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>>>>>> as my little dig through the core drivers/base/ code doesn't find
>>>>>>>> anything to the contrary, but maybe I don't fully understand it.
>>>>>>>
>>>>>>> AFAIK, device_add() is used to trigger devtmpfs file
>>>>>>> creation, and it can be observed if frequently
>>>>>>> hotpluging device in the VM and mount.  Currently
>>>>>>> I don't have time slot to build an easy reproducer,
>>>>>>> but I think it's a real issue anyway.
>>>>>>
>>>>>> As I say above, that's not normal, and you have to be root to do this,
>>>> I just spent time to reproduce with dynamic loop devices and
>>>> actually it's easy if msleep() is located artificiallly,
>>>> the diff as below:
>>>>
>>>> diff --git a/block/bdev.c b/block/bdev.c
>>>> index 810707cca970..a4273b5ad456 100644
>>>> --- a/block/bdev.c
>>>> +++ b/block/bdev.c
>>>> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>>>>    	struct inode *inode;
>>>>
>>>>    	inode = ilookup(blockdev_superblock, dev);
>>>> -	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
>>>> +	if (0) {
>>>>    		blk_request_module(dev);
>>>>    		inode = ilookup(blockdev_superblock, dev);
>>>>    		if (inode)
>>>> diff --git a/block/genhd.c b/block/genhd.c
>>>> index 9bbc38d12792..3c9116fdc1ce 100644
>>>> --- a/block/genhd.c
>>>> +++ b/block/genhd.c
>>>> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>>>>    	set_bit(GD_ADDED, &disk->state);
>>>>    }
>>>>
>>>> +#include <linux/delay.h>
>>>> +
>>>>    static int __add_disk(struct device *parent, struct gendisk *disk,
>>>>    		      const struct attribute_group **groups,
>>>>    		      struct fwnode_handle *fwnode)
>>>> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>>>>    	if (ret)
>>>>    		goto out_free_ext_minor;
>>>>
>>>> +	if (disk->major == LOOP_MAJOR)
>>>> +		msleep(2500);           // delay 2.5s for all loops
>>>> +
>>>
>>> Yes, so you need to watch for the uevent to happen, THEN it is safe to
>>> access the block device.  Doing it before then isn't a good idea :)
>>>
>>> But, if you think this is an issue, do you have a patch that passes your
>>> testing to fix it?
>>
>> I just raise it up for some ideas, and this change is
>> buried into the code refactor and honestly I need to
>> look into the codebase and related patchsets first.
>>
>> Currently I have dozens of other development stuffs
>> on hand, if it's really a regression, I do hope
>> Christoph or other folks who are familiar with the code
>> could try to address this.
> 
> If it's easy to do without much of a regression or performance risk then
> the device node should only show up once the device is actually ready.

Yeah, agreed.

> It's certainly best-practive to wait for the uevent though.

Currently our internal applications will try to adapt uevent
detection too, but as a public cloud provider, we may still
need a fallback to avoid users' potential blame on this,
anyway.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-11-05 14:13 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-08-18 14:45 add error handling to add_disk / device_add_disk Christoph Hellwig
2021-08-18 14:45 ` [PATCH 01/11] block: add a sanity check for a live disk in del_gendisk Christoph Hellwig
2021-08-19 10:41   ` Hannes Reinecke
2021-08-18 14:45 ` [PATCH 02/11] block: fold register_disk into device_add_disk Christoph Hellwig
2021-08-19 10:48   ` Hannes Reinecke
2021-08-18 14:45 ` [PATCH 03/11] block: call bdev_add later in device_add_disk Christoph Hellwig
2021-08-20  6:06   ` Hannes Reinecke
2025-10-31  7:46   ` question about bd_inode hashing against device_add() // " Gao Xiang
2025-10-31  9:09     ` Christoph Hellwig
2025-10-31  9:36       ` Gao Xiang
2025-10-31  9:45         ` Christoph Hellwig
2025-10-31  9:54           ` Gao Xiang
2025-10-31  9:58             ` Greg Kroah-Hartman
2025-10-31 10:12               ` Gao Xiang
2025-10-31 12:23                 ` Gao Xiang
2025-10-31 12:25                   ` Gao Xiang
2025-10-31 14:34                   ` Greg Kroah-Hartman
2025-10-31 14:44                     ` Gao Xiang
2025-11-05  3:04                       ` Gao Xiang
2025-11-05 12:30                       ` Christian Brauner
2025-11-05 14:13                         ` Gao Xiang
2025-10-31 14:31                 ` Greg Kroah-Hartman
2025-10-31 14:40                   ` Gao Xiang
2021-08-18 14:45 ` [PATCH 04/11] block: create the bdi link earlier " Christoph Hellwig
2021-08-20  6:08   ` Hannes Reinecke
2021-08-18 14:45 ` [PATCH 05/11] block: call blk_integrity_add " Christoph Hellwig
2021-08-20  6:08   ` Hannes Reinecke
2021-08-18 14:45 ` [PATCH 06/11] block: call blk_register_queue " Christoph Hellwig
2021-08-20  6:09   ` Hannes Reinecke
2021-08-18 14:45 ` [PATCH 07/11] block: return errors from blk_integrity_add Christoph Hellwig
2021-08-20  6:10   ` Hannes Reinecke
2021-08-18 14:45 ` [PATCH 08/11] block: return errors from disk_alloc_events Christoph Hellwig
2021-08-20  6:10   ` Hannes Reinecke
2021-08-18 14:45 ` [PATCH 09/11] block: add error handling for device_add_disk / add_disk Christoph Hellwig
2021-08-20  6:12   ` Hannes Reinecke
2021-08-18 14:45 ` [PATCH 10/11] virtio_blk: add error handling support for add_disk() Christoph Hellwig
2021-08-20  6:12   ` Hannes Reinecke
2021-08-18 14:45 ` [PATCH 11/11] null_blk: " Christoph Hellwig
2021-08-20  6:13   ` Hannes Reinecke
2021-08-23 18:57 ` add error handling to add_disk / device_add_disk Jens Axboe
2021-08-23 19:44   ` Luis Chamberlain
2021-08-23 19:50     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox