public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
       [not found]       ` <ae38c5dc-da90-4fb3-bb72-61b66ab5a0d2@linux.alibaba.com>
@ 2025-10-31  9:45         ` Christoph Hellwig
  2025-10-31  9:54           ` Gao Xiang
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-10-31  9:45 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Greg Kroah-Hartman, Rafael J. Wysocki,
	Danilo Krummrich, linux-kernel

On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
> Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
>
>>  Do you see that earlier, or do you have
>> code busy polling for a node?
>
> Personally I think it will break many userspace programs
> (although I also don't think it's a correct expectation.)

We've had this behavior for a few years, and this is the first report
I've seen.

> After recheck internally, the userspace program logic is:
>   - stat /dev/vdX;
>   - if exists, mount directly;
>   - if non-exists, listen uevent disk_add instead.
>
> Previously, for devtmpfs blkdev files, such stat/mount
> assumption is always valid.

That assumption doesn't seem wrong.  But why does the device node
get created earlier?  My assumption was that it would only be
created by the KOBJ_ADD uevent.  Adding the device model maintainers
as my little dig through the core drivers/base/ code doesn't find
anything to the contrary, but maybe I don't fully understand it.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31  9:45         ` question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk Christoph Hellwig
@ 2025-10-31  9:54           ` Gao Xiang
  2025-10-31  9:58             ` Greg Kroah-Hartman
  0 siblings, 1 reply; 13+ messages in thread
From: Gao Xiang @ 2025-10-31  9:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jan Kara, Christian Brauner, Michael S. Tsirkin,
	Jason Wang, Martin K. Petersen, Luis Chamberlain, linux-block,
	Joseph Qi, guanghuifeng, zongyong.wzy, zyfjeff,
	Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 17:45, Christoph Hellwig wrote:
> On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
>> Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
>>
>>>   Do you see that earlier, or do you have
>>> code busy polling for a node?
>>
>> Personally I think it will break many userspace programs
>> (although I also don't think it's a correct expectation.)
> 
> We've had this behavior for a few years, and this is the first report
> I've seen.
> 
>> After recheck internally, the userspace program logic is:
>>    - stat /dev/vdX;
>>    - if exists, mount directly;
>>    - if non-exists, listen uevent disk_add instead.
>>
>> Previously, for devtmpfs blkdev files, such stat/mount
>> assumption is always valid.
> 
> That assumption doesn't seem wrong.

;-) I was thought UNIX mknod doesn't imply the device is
ready or valid in any case (but dev files in devtmpfs
might be an exception but I didn't find some formal words)...
so uevent is clearly a right way, but..

> But why does the device node
> get created earlier?  My assumption was that it would only be
> created by the KOBJ_ADD uevent.  Adding the device model maintainers
> as my little dig through the core drivers/base/ code doesn't find
> anything to the contrary, but maybe I don't fully understand it.

AFAIK, device_add() is used to trigger devtmpfs file
creation, and it can be observed if frequently
hotpluging device in the VM and mount.  Currently
I don't have time slot to build an easy reproducer,
but I think it's a real issue anyway.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31  9:54           ` Gao Xiang
@ 2025-10-31  9:58             ` Greg Kroah-Hartman
  2025-10-31 10:12               ` Gao Xiang
  0 siblings, 1 reply; 13+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-31  9:58 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
> 
> 
> On 2025/10/31 17:45, Christoph Hellwig wrote:
> > On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
> > > Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
> > > 
> > > >   Do you see that earlier, or do you have
> > > > code busy polling for a node?
> > > 
> > > Personally I think it will break many userspace programs
> > > (although I also don't think it's a correct expectation.)
> > 
> > We've had this behavior for a few years, and this is the first report
> > I've seen.
> > 
> > > After recheck internally, the userspace program logic is:
> > >    - stat /dev/vdX;
> > >    - if exists, mount directly;
> > >    - if non-exists, listen uevent disk_add instead.
> > > 
> > > Previously, for devtmpfs blkdev files, such stat/mount
> > > assumption is always valid.
> > 
> > That assumption doesn't seem wrong.
> 
> ;-) I was thought UNIX mknod doesn't imply the device is
> ready or valid in any case (but dev files in devtmpfs
> might be an exception but I didn't find some formal words)...
> so uevent is clearly a right way, but..

Yes, anyone can do a mknod and attempt to open a device that isn't
present.

when devtmpfs creates the device node, it should be there.  Unless it
gets removed, and then added back, so you could race with userspace, but
that's not normal.

> > But why does the device node
> > get created earlier?  My assumption was that it would only be
> > created by the KOBJ_ADD uevent.  Adding the device model maintainers
> > as my little dig through the core drivers/base/ code doesn't find
> > anything to the contrary, but maybe I don't fully understand it.
> 
> AFAIK, device_add() is used to trigger devtmpfs file
> creation, and it can be observed if frequently
> hotpluging device in the VM and mount.  Currently
> I don't have time slot to build an easy reproducer,
> but I think it's a real issue anyway.

As I say above, that's not normal, and you have to be root to do this,
so I don't understand what you are trying to prevent happening?  What is
the bug and why is it just showing up now (i.e. what changed to cause
it?)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31  9:58             ` Greg Kroah-Hartman
@ 2025-10-31 10:12               ` Gao Xiang
  2025-10-31 12:23                 ` Gao Xiang
  2025-10-31 14:31                 ` Greg Kroah-Hartman
  0 siblings, 2 replies; 13+ messages in thread
From: Gao Xiang @ 2025-10-31 10:12 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

Hi Greg,

On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>
>>
>> On 2025/10/31 17:45, Christoph Hellwig wrote:
>>> On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
>>>> Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
>>>>
>>>>>    Do you see that earlier, or do you have
>>>>> code busy polling for a node?
>>>>
>>>> Personally I think it will break many userspace programs
>>>> (although I also don't think it's a correct expectation.)
>>>
>>> We've had this behavior for a few years, and this is the first report
>>> I've seen.
>>>
>>>> After recheck internally, the userspace program logic is:
>>>>     - stat /dev/vdX;
>>>>     - if exists, mount directly;
>>>>     - if non-exists, listen uevent disk_add instead.
>>>>
>>>> Previously, for devtmpfs blkdev files, such stat/mount
>>>> assumption is always valid.
>>>
>>> That assumption doesn't seem wrong.
>>
>> ;-) I was thought UNIX mknod doesn't imply the device is
>> ready or valid in any case (but dev files in devtmpfs
>> might be an exception but I didn't find some formal words)...
>> so uevent is clearly a right way, but..
> 
> Yes, anyone can do a mknod and attempt to open a device that isn't
> present.
> 
> when devtmpfs creates the device node, it should be there.  Unless it
> gets removed, and then added back, so you could race with userspace, but
> that's not normal.
> 
>>> But why does the device node
>>> get created earlier?  My assumption was that it would only be
>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>> as my little dig through the core drivers/base/ code doesn't find
>>> anything to the contrary, but maybe I don't fully understand it.
>>
>> AFAIK, device_add() is used to trigger devtmpfs file
>> creation, and it can be observed if frequently
>> hotpluging device in the VM and mount.  Currently
>> I don't have time slot to build an easy reproducer,
>> but I think it's a real issue anyway.
> 
> As I say above, that's not normal, and you have to be root to do this,

Just thinking out if I am a random reporter, I could
report the original symptom now because we face it,
but everyone has his own internal business or even
with limited kernel ability for example, in any
case, there is no such expectation to rush someone
into build a clean reproducer.

Nevertheless, I will take time on the reproducer, and
I think it could just add some artificial delay just
after device_add(). I could try anyway, but no rush.

> so I don't understand what you are trying to prevent happening?  What is

The original report was
https://lore.kernel.org/r/43375218-2a80-4a7a-b8bb-465f6419b595@linux.alibaba.com/

> the bug and why is it just showing up now (i.e. what changed to cause
> it?)

I don't know, I think just because 6.6 is a relatively
newer kernel, and most userspace logic has retry logic
to cover this up.


Thanks,
Gao Xiang


> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 10:12               ` Gao Xiang
@ 2025-10-31 12:23                 ` Gao Xiang
  2025-10-31 12:25                   ` Gao Xiang
  2025-10-31 14:34                   ` Greg Kroah-Hartman
  2025-10-31 14:31                 ` Greg Kroah-Hartman
  1 sibling, 2 replies; 13+ messages in thread
From: Gao Xiang @ 2025-10-31 12:23 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Christoph Hellwig, Jens Axboe, Jan Kara,
	Christian Brauner
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 18:12, Gao Xiang wrote:
> Hi Greg,
> 
> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>
>>>
>>> On 2025/10/31 17:45, Christoph Hellwig wrote:

...

>>>> But why does the device node
>>>> get created earlier?  My assumption was that it would only be
>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>> as my little dig through the core drivers/base/ code doesn't find
>>>> anything to the contrary, but maybe I don't fully understand it.
>>>
>>> AFAIK, device_add() is used to trigger devtmpfs file
>>> creation, and it can be observed if frequently
>>> hotpluging device in the VM and mount.  Currently
>>> I don't have time slot to build an easy reproducer,
>>> but I think it's a real issue anyway.
>>
>> As I say above, that's not normal, and you have to be root to do this,
I just spent time to reproduce with dynamic loop devices and
actually it's easy if msleep() is located artificiallly,
the diff as below:

diff --git a/block/bdev.c b/block/bdev.c
index 810707cca970..a4273b5ad456 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
  	struct inode *inode;

  	inode = ilookup(blockdev_superblock, dev);
-	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
+	if (0) {
  		blk_request_module(dev);
  		inode = ilookup(blockdev_superblock, dev);
  		if (inode)
diff --git a/block/genhd.c b/block/genhd.c
index 9bbc38d12792..3c9116fdc1ce 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
  	set_bit(GD_ADDED, &disk->state);
  }

+#include <linux/delay.h>
+
  static int __add_disk(struct device *parent, struct gendisk *disk,
  		      const struct attribute_group **groups,
  		      struct fwnode_handle *fwnode)
@@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
  	if (ret)
  		goto out_free_ext_minor;

+	if (disk->major == LOOP_MAJOR)
+		msleep(2500);           // delay 2.5s for all loops
+
  	ret = disk_alloc_events(disk);
  	if (ret)
  		goto out_device_del;


(Note that I masked off CONFIG_BLOCK_LEGACY_AUTOLOAD
  for cleaner ftrace below.)

and then

# uname -a  (patched 6.18-rc1 kernel)

```
Linux 7e5b4b5f5181 6.18.0-rc1-dirty #25 SMP PREEMPT_DYNAMIC Fri Oct 31 19:52:10 CST 2025 x86_64 GNU/Linux
```

# truncate -s 1g test.img; mkfs.ext4 -F test.img;
# losetup /dev/loop999 test.img & sleep 1; ls -l /dev/loop999; strace mount -t ext4 /dev/loop999 mnt 2>&1 | grep fsconfig

It shows

```
brw------- 1 root root 7, 999 Oct 31 20:06 /dev/loop999
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/loop999", 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 ENXIO (No such device or address)  // unexpected
```

then

# losetup /dev/loop996 test.img & sleep 1; stat /dev/loop996; trace-cmd record -p function_graph mount -t ext4 /dev/loop996 mnt &> /dev/null

It shows
```
   File: /dev/loop996
   Size: 0               Blocks: 0          IO Block: 4096   block special file
Device: 0,6     Inode: 429         Links: 1     Device type: 7,996
Access: (0600/brw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-10-31 20:07:54.938474868 +0800
Modify: 2025-10-31 20:07:54.938474868 +0800
Change: 2025-10-31 20:07:54.938474868 +0800
  Birth: 2025-10-31 20:07:54.938474868 +0800
```

but

# trace-cmd report | grep mount | less
            mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                bdev_file_open_by_dev() {
            mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                  bdev_permission() {
            mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                    devcgroup_check_permission() {
            mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                      __rcu_read_lock() {
            mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.193 us   |                      } (ret=0x1)
            mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      match_exception_partial() {
            mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.199 us   |                      } (ret=0x0)
            mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      __rcu_read_unlock() {
            mount-561   [007] ...1.   240.180515: funcgraph_exit:         0.202 us   |                      } (ret=0x0)
            mount-561   [007] ...1.   240.180515: funcgraph_exit:         1.602 us   |                    } (ret=0x0)
            mount-561   [007] ...1.   240.180515: funcgraph_exit:         2.100 us   |                  } (ret=0x0)
            mount-561   [007] ...1.   240.180515: funcgraph_entry:                   |                  ilookup() {
            mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    __cond_resched() {
            mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.194 us   |                    } (ret=0x0)
            mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    find_inode_fast() {
            mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                      __rcu_read_lock() {
            mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.195 us   |                      } (ret=0x1)
            mount-561   [007] ...1.   240.180517: funcgraph_entry:                   |                      __rcu_read_unlock() {
            mount-561   [007] ...1.   240.180517: funcgraph_exit:         0.193 us   |                      } (ret=0x0)
            mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.060 us   |                    } (ret=0x0)
            mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.970 us   |                  } (ret=0x0)
            mount-561   [007] ...1.   240.180518: funcgraph_exit:         4.818 us   |                } (ret=-6)

here -6 (-ENXIO) is unexpected.

Actually the problematic code path I've said is device_add():

upstream code:

loop_control_ioctl
  loop_add
    add_disk_fwnode
      __add_disk
        devtmpfs_create_node   // here create devtmpfs blkdev file, but racy
      add_disk_final
        bdev_add
          insert_inode_hash    // just seen by bdev_file_open_by_dev()
        disk_uevent(disk, KOBJ_ADD)

I actually think it's enough to explain the root.

Thanks,
Gao Xiang

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 12:23                 ` Gao Xiang
@ 2025-10-31 12:25                   ` Gao Xiang
  2025-10-31 14:34                   ` Greg Kroah-Hartman
  1 sibling, 0 replies; 13+ messages in thread
From: Gao Xiang @ 2025-10-31 12:25 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Christoph Hellwig, Jens Axboe, Jan Kara,
	Christian Brauner
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 20:23, Gao Xiang wrote:
> 
> 
> On 2025/10/31 18:12, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2025/10/31 17:45, Christoph Hellwig wrote:
> 
> ...
> 
>>>>> But why does the device node
>>>>> get created earlier?  My assumption was that it would only be
>>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>>> as my little dig through the core drivers/base/ code doesn't find
>>>>> anything to the contrary, but maybe I don't fully understand it.
>>>>
>>>> AFAIK, device_add() is used to trigger devtmpfs file
>>>> creation, and it can be observed if frequently
>>>> hotpluging device in the VM and mount.  Currently
>>>> I don't have time slot to build an easy reproducer,
>>>> but I think it's a real issue anyway.
>>>
>>> As I say above, that's not normal, and you have to be root to do this,
> I just spent time to reproduce with dynamic loop devices and
> actually it's easy if msleep() is located artificiallly,
> the diff as below:
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 810707cca970..a4273b5ad456 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>       struct inode *inode;
> 
>       inode = ilookup(blockdev_superblock, dev);
> -    if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
> +    if (0) {
>           blk_request_module(dev);
>           inode = ilookup(blockdev_superblock, dev);
>           if (inode)
> diff --git a/block/genhd.c b/block/genhd.c
> index 9bbc38d12792..3c9116fdc1ce 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>       set_bit(GD_ADDED, &disk->state);
>   }
> 
> +#include <linux/delay.h>
> +
>   static int __add_disk(struct device *parent, struct gendisk *disk,
>                 const struct attribute_group **groups,
>                 struct fwnode_handle *fwnode)
> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>       if (ret)
>           goto out_free_ext_minor;
> 
> +    if (disk->major == LOOP_MAJOR)
> +        msleep(2500);           // delay 2.5s for all loops
> +
>       ret = disk_alloc_events(disk);
>       if (ret)
>           goto out_device_del;
> 
> 
> (Note that I masked off CONFIG_BLOCK_LEGACY_AUTOLOAD
>   for cleaner ftrace below.)
> 
> and then
> 
> # uname -a  (patched 6.18-rc1 kernel)
> 
> ```
> Linux 7e5b4b5f5181 6.18.0-rc1-dirty #25 SMP PREEMPT_DYNAMIC Fri Oct 31 19:52:10 CST 2025 x86_64 GNU/Linux
> ```
> 
> # truncate -s 1g test.img; mkfs.ext4 -F test.img;
> # losetup /dev/loop999 test.img & sleep 1; ls -l /dev/loop999; strace mount -t ext4 /dev/loop999 mnt 2>&1 | grep fsconfig
> 
> It shows
> 
> ```
> brw------- 1 root root 7, 999 Oct 31 20:06 /dev/loop999
> fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/loop999", 0) = 0
> fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 ENXIO (No such device or address)  // unexpected
> ```
> 
> then
> 
> # losetup /dev/loop996 test.img & sleep 1; stat /dev/loop996; trace-cmd record -p function_graph mount -t ext4 /dev/loop996 mnt &> /dev/null
> 
> It shows
> ```
>    File: /dev/loop996
>    Size: 0               Blocks: 0          IO Block: 4096   block special file
> Device: 0,6     Inode: 429         Links: 1     Device type: 7,996
> Access: (0600/brw-------)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2025-10-31 20:07:54.938474868 +0800
> Modify: 2025-10-31 20:07:54.938474868 +0800
> Change: 2025-10-31 20:07:54.938474868 +0800
>   Birth: 2025-10-31 20:07:54.938474868 +0800
> ```
> 
> but
> 
> # trace-cmd report | grep mount | less
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                bdev_file_open_by_dev() {
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                  bdev_permission() {
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                    devcgroup_check_permission() {
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                      __rcu_read_lock() {
>             mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.193 us   |                      } (ret=0x1)
>             mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      match_exception_partial() {
>             mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.199 us   |                      } (ret=0x0)
>             mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      __rcu_read_unlock() {
>             mount-561   [007] ...1.   240.180515: funcgraph_exit:         0.202 us   |                      } (ret=0x0)
>             mount-561   [007] ...1.   240.180515: funcgraph_exit:         1.602 us   |                    } (ret=0x0)
>             mount-561   [007] ...1.   240.180515: funcgraph_exit:         2.100 us   |                  } (ret=0x0)
>             mount-561   [007] ...1.   240.180515: funcgraph_entry:                   |                  ilookup() {
>             mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    __cond_resched() {
>             mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.194 us   |                    } (ret=0x0)
>             mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    find_inode_fast() {
>             mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                      __rcu_read_lock() {
>             mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.195 us   |                      } (ret=0x1)
>             mount-561   [007] ...1.   240.180517: funcgraph_entry:                   |                      __rcu_read_unlock() {
>             mount-561   [007] ...1.   240.180517: funcgraph_exit:         0.193 us   |                      } (ret=0x0)
>             mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.060 us   |                    } (ret=0x0)
>             mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.970 us   |                  } (ret=0x0)
>             mount-561   [007] ...1.   240.180518: funcgraph_exit:         4.818 us   |                } (ret=-6)
> 
> here -6 (-ENXIO) is unexpected.
> 
> Actually the problematic code path I've said is device_add():
> 
> upstream code:
> 
> loop_control_ioctl
>   loop_add
>     add_disk_fwnode
>       __add_disk
>         devtmpfs_create_node   // here create devtmpfs blkdev file, but racy
>       add_disk_final
>         bdev_add
>           insert_inode_hash    // just seen by bdev_file_open_by_dev()
>         disk_uevent(disk, KOBJ_ADD)

minor revision:

  loop_control_ioctl
    loop_add
      add_disk_fwnode
        __add_disk
          device_add
            devtmpfs_create_node   // here create devtmpfs blkdev file, but racy
        add_disk_final
          bdev_add
            insert_inode_hash    // just seen by bdev_file_open_by_dev()
          disk_uevent(disk, KOBJ_ADD)

> 
> I actually think it's enough to explain the root.
> 
> Thanks,
> Gao Xiang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 10:12               ` Gao Xiang
  2025-10-31 12:23                 ` Gao Xiang
@ 2025-10-31 14:31                 ` Greg Kroah-Hartman
  2025-10-31 14:40                   ` Gao Xiang
  1 sibling, 1 reply; 13+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-31 14:31 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

On Fri, Oct 31, 2025 at 06:12:05PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
> > On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
> > > 
> > > 
> > > On 2025/10/31 17:45, Christoph Hellwig wrote:
> > > > On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
> > > > > Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
> > > > > 
> > > > > >    Do you see that earlier, or do you have
> > > > > > code busy polling for a node?
> > > > > 
> > > > > Personally I think it will break many userspace programs
> > > > > (although I also don't think it's a correct expectation.)
> > > > 
> > > > We've had this behavior for a few years, and this is the first report
> > > > I've seen.
> > > > 
> > > > > After recheck internally, the userspace program logic is:
> > > > >     - stat /dev/vdX;
> > > > >     - if exists, mount directly;
> > > > >     - if non-exists, listen uevent disk_add instead.
> > > > > 
> > > > > Previously, for devtmpfs blkdev files, such stat/mount
> > > > > assumption is always valid.
> > > > 
> > > > That assumption doesn't seem wrong.
> > > 
> > > ;-) I was thought UNIX mknod doesn't imply the device is
> > > ready or valid in any case (but dev files in devtmpfs
> > > might be an exception but I didn't find some formal words)...
> > > so uevent is clearly a right way, but..
> > 
> > Yes, anyone can do a mknod and attempt to open a device that isn't
> > present.
> > 
> > when devtmpfs creates the device node, it should be there.  Unless it
> > gets removed, and then added back, so you could race with userspace, but
> > that's not normal.
> > 
> > > > But why does the device node
> > > > get created earlier?  My assumption was that it would only be
> > > > created by the KOBJ_ADD uevent.  Adding the device model maintainers
> > > > as my little dig through the core drivers/base/ code doesn't find
> > > > anything to the contrary, but maybe I don't fully understand it.
> > > 
> > > AFAIK, device_add() is used to trigger devtmpfs file
> > > creation, and it can be observed if frequently
> > > hotpluging device in the VM and mount.  Currently
> > > I don't have time slot to build an easy reproducer,
> > > but I think it's a real issue anyway.
> > 
> > As I say above, that's not normal, and you have to be root to do this,
> 
> Just thinking out if I am a random reporter, I could
> report the original symptom now because we face it,
> but everyone has his own internal business or even
> with limited kernel ability for example, in any
> case, there is no such expectation to rush someone
> into build a clean reproducer.
> 
> Nevertheless, I will take time on the reproducer, and
> I think it could just add some artificial delay just
> after device_add(). I could try anyway, but no rush.
> 
> > so I don't understand what you are trying to prevent happening?  What is
> 
> The original report was
> https://lore.kernel.org/r/43375218-2a80-4a7a-b8bb-465f6419b595@linux.alibaba.com/

So you see cases where the device node is present, you try to open it,
but yet there is no real block device behind it at all?

> > the bug and why is it just showing up now (i.e. what changed to cause
> > it?)
> 
> I don't know, I think just because 6.6 is a relatively
> newer kernel, and most userspace logic has retry logic
> to cover this up.

6.6 has been out for 2 years now, this is a long time in kernel
development cycles for things to just start showing up now.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 12:23                 ` Gao Xiang
  2025-10-31 12:25                   ` Gao Xiang
@ 2025-10-31 14:34                   ` Greg Kroah-Hartman
  2025-10-31 14:44                     ` Gao Xiang
  1 sibling, 1 reply; 13+ messages in thread
From: Greg Kroah-Hartman @ 2025-10-31 14:34 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

On Fri, Oct 31, 2025 at 08:23:32PM +0800, Gao Xiang wrote:
> 
> 
> On 2025/10/31 18:12, Gao Xiang wrote:
> > Hi Greg,
> > 
> > On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
> > > On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
> > > > 
> > > > 
> > > > On 2025/10/31 17:45, Christoph Hellwig wrote:
> 
> ...
> 
> > > > > But why does the device node
> > > > > get created earlier?  My assumption was that it would only be
> > > > > created by the KOBJ_ADD uevent.  Adding the device model maintainers
> > > > > as my little dig through the core drivers/base/ code doesn't find
> > > > > anything to the contrary, but maybe I don't fully understand it.
> > > > 
> > > > AFAIK, device_add() is used to trigger devtmpfs file
> > > > creation, and it can be observed if frequently
> > > > hotpluging device in the VM and mount.  Currently
> > > > I don't have time slot to build an easy reproducer,
> > > > but I think it's a real issue anyway.
> > > 
> > > As I say above, that's not normal, and you have to be root to do this,
> I just spent time to reproduce with dynamic loop devices and
> actually it's easy if msleep() is located artificiallly,
> the diff as below:
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 810707cca970..a4273b5ad456 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>  	struct inode *inode;
> 
>  	inode = ilookup(blockdev_superblock, dev);
> -	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
> +	if (0) {
>  		blk_request_module(dev);
>  		inode = ilookup(blockdev_superblock, dev);
>  		if (inode)
> diff --git a/block/genhd.c b/block/genhd.c
> index 9bbc38d12792..3c9116fdc1ce 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>  	set_bit(GD_ADDED, &disk->state);
>  }
> 
> +#include <linux/delay.h>
> +
>  static int __add_disk(struct device *parent, struct gendisk *disk,
>  		      const struct attribute_group **groups,
>  		      struct fwnode_handle *fwnode)
> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>  	if (ret)
>  		goto out_free_ext_minor;
> 
> +	if (disk->major == LOOP_MAJOR)
> +		msleep(2500);           // delay 2.5s for all loops
> +

Yes, so you need to watch for the uevent to happen, THEN it is safe to
access the block device.  Doing it before then isn't a good idea :)

But, if you think this is an issue, do you have a patch that passes your
testing to fix it?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 14:31                 ` Greg Kroah-Hartman
@ 2025-10-31 14:40                   ` Gao Xiang
  0 siblings, 0 replies; 13+ messages in thread
From: Gao Xiang @ 2025-10-31 14:40 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 22:31, Greg Kroah-Hartman wrote:
> On Fri, Oct 31, 2025 at 06:12:05PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2025/10/31 17:45, Christoph Hellwig wrote:
>>>>> On Fri, Oct 31, 2025 at 05:36:45PM +0800, Gao Xiang wrote:
>>>>>> Right, sorry yes, disk_uevent(KOBJ_ADD) is in the end.
>>>>>>
>>>>>>>     Do you see that earlier, or do you have
>>>>>>> code busy polling for a node?
>>>>>>
>>>>>> Personally I think it will break many userspace programs
>>>>>> (although I also don't think it's a correct expectation.)
>>>>>
>>>>> We've had this behavior for a few years, and this is the first report
>>>>> I've seen.
>>>>>
>>>>>> After recheck internally, the userspace program logic is:
>>>>>>      - stat /dev/vdX;
>>>>>>      - if exists, mount directly;
>>>>>>      - if non-exists, listen uevent disk_add instead.
>>>>>>
>>>>>> Previously, for devtmpfs blkdev files, such stat/mount
>>>>>> assumption is always valid.
>>>>>
>>>>> That assumption doesn't seem wrong.
>>>>
>>>> ;-) I was thought UNIX mknod doesn't imply the device is
>>>> ready or valid in any case (but dev files in devtmpfs
>>>> might be an exception but I didn't find some formal words)...
>>>> so uevent is clearly a right way, but..
>>>
>>> Yes, anyone can do a mknod and attempt to open a device that isn't
>>> present.
>>>
>>> when devtmpfs creates the device node, it should be there.  Unless it
>>> gets removed, and then added back, so you could race with userspace, but
>>> that's not normal.
>>>
>>>>> But why does the device node
>>>>> get created earlier?  My assumption was that it would only be
>>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>>> as my little dig through the core drivers/base/ code doesn't find
>>>>> anything to the contrary, but maybe I don't fully understand it.
>>>>
>>>> AFAIK, device_add() is used to trigger devtmpfs file
>>>> creation, and it can be observed if frequently
>>>> hotpluging device in the VM and mount.  Currently
>>>> I don't have time slot to build an easy reproducer,
>>>> but I think it's a real issue anyway.
>>>
>>> As I say above, that's not normal, and you have to be root to do this,
>>
>> Just thinking out if I am a random reporter, I could
>> report the original symptom now because we face it,
>> but everyone has his own internal business or even
>> with limited kernel ability for example, in any
>> case, there is no such expectation to rush someone
>> into build a clean reproducer.
>>
>> Nevertheless, I will take time on the reproducer, and
>> I think it could just add some artificial delay just
>> after device_add(). I could try anyway, but no rush.
>>
>>> so I don't understand what you are trying to prevent happening?  What is
>>
>> The original report was
>> https://lore.kernel.org/r/43375218-2a80-4a7a-b8bb-465f6419b595@linux.alibaba.com/
> 
> So you see cases where the device node is present, you try to open it,
> but yet there is no real block device behind it at all?

Roughly yes, block devices have a pseudo filesystem, briefly
it registered the block device with device_add() so the
devtmpfs file is visible then but bdev_add() is not called yet
so for example, mounting like bdev_file_open_by_dev() cannot
find this and return ENXIO.

> 
>>> the bug and why is it just showing up now (i.e. what changed to cause
>>> it?)
>>
>> I don't know, I think just because 6.6 is a relatively
>> newer kernel, and most userspace logic has retry logic
>> to cover this up.
> 
> 6.6 has been out for 2 years now, this is a long time in kernel
> development cycles for things to just start showing up now.

I think for most cases devices are added during boot so
it's hard to find, but in the stress hotplug cases, it
can be observed easily honestly.

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 14:34                   ` Greg Kroah-Hartman
@ 2025-10-31 14:44                     ` Gao Xiang
  2025-11-05  3:04                       ` Gao Xiang
  2025-11-05 12:30                       ` Christian Brauner
  0 siblings, 2 replies; 13+ messages in thread
From: Gao Xiang @ 2025-10-31 14:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Christoph Hellwig, Jens Axboe, Jan Kara, Christian Brauner,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel



On 2025/10/31 22:34, Greg Kroah-Hartman wrote:
> On Fri, Oct 31, 2025 at 08:23:32PM +0800, Gao Xiang wrote:
>>
>>
>> On 2025/10/31 18:12, Gao Xiang wrote:
>>> Hi Greg,
>>>
>>> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>>>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>>>
>>>>>
>>>>> On 2025/10/31 17:45, Christoph Hellwig wrote:
>>
>> ...
>>
>>>>>> But why does the device node
>>>>>> get created earlier?  My assumption was that it would only be
>>>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>>>> as my little dig through the core drivers/base/ code doesn't find
>>>>>> anything to the contrary, but maybe I don't fully understand it.
>>>>>
>>>>> AFAIK, device_add() is used to trigger devtmpfs file
>>>>> creation, and it can be observed if frequently
>>>>> hotpluging device in the VM and mount.  Currently
>>>>> I don't have time slot to build an easy reproducer,
>>>>> but I think it's a real issue anyway.
>>>>
>>>> As I say above, that's not normal, and you have to be root to do this,
>> I just spent time to reproduce with dynamic loop devices and
>> actually it's easy if msleep() is located artificiallly,
>> the diff as below:
>>
>> diff --git a/block/bdev.c b/block/bdev.c
>> index 810707cca970..a4273b5ad456 100644
>> --- a/block/bdev.c
>> +++ b/block/bdev.c
>> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>>   	struct inode *inode;
>>
>>   	inode = ilookup(blockdev_superblock, dev);
>> -	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
>> +	if (0) {
>>   		blk_request_module(dev);
>>   		inode = ilookup(blockdev_superblock, dev);
>>   		if (inode)
>> diff --git a/block/genhd.c b/block/genhd.c
>> index 9bbc38d12792..3c9116fdc1ce 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>>   	set_bit(GD_ADDED, &disk->state);
>>   }
>>
>> +#include <linux/delay.h>
>> +
>>   static int __add_disk(struct device *parent, struct gendisk *disk,
>>   		      const struct attribute_group **groups,
>>   		      struct fwnode_handle *fwnode)
>> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>>   	if (ret)
>>   		goto out_free_ext_minor;
>>
>> +	if (disk->major == LOOP_MAJOR)
>> +		msleep(2500);           // delay 2.5s for all loops
>> +
> 
> Yes, so you need to watch for the uevent to happen, THEN it is safe to
> access the block device.  Doing it before then isn't a good idea :)
> 
> But, if you think this is an issue, do you have a patch that passes your
> testing to fix it?

I just raise it up for some ideas, and this change is
buried into the code refactor and honestly I need to
look into the codebase and related patchsets first.

Currently I have dozens of other development stuffs
on hand, if it's really a regression, I do hope
Christoph or other folks who are familiar with the code
could try to address this.

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 14:44                     ` Gao Xiang
@ 2025-11-05  3:04                       ` Gao Xiang
  2025-11-05 12:30                       ` Christian Brauner
  1 sibling, 0 replies; 13+ messages in thread
From: Gao Xiang @ 2025-11-05  3:04 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Greg Kroah-Hartman, Jan Kara,
	Christian Brauner
  Cc: Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

Hi Christiph,

On 2025/10/31 22:44, Gao Xiang wrote:
> 

..

>>> I just spent time to reproduce with dynamic loop devices and
>>> actually it's easy if msleep() is located artificiallly,
>>> the diff as below:
>>>
>>> diff --git a/block/bdev.c b/block/bdev.c
>>> index 810707cca970..a4273b5ad456 100644
>>> --- a/block/bdev.c
>>> +++ b/block/bdev.c
>>> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>>>       struct inode *inode;
>>>
>>>       inode = ilookup(blockdev_superblock, dev);
>>> -    if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
>>> +    if (0) {
>>>           blk_request_module(dev);
>>>           inode = ilookup(blockdev_superblock, dev);
>>>           if (inode)
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index 9bbc38d12792..3c9116fdc1ce 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>>>       set_bit(GD_ADDED, &disk->state);
>>>   }
>>>
>>> +#include <linux/delay.h>
>>> +
>>>   static int __add_disk(struct device *parent, struct gendisk *disk,
>>>                 const struct attribute_group **groups,
>>>                 struct fwnode_handle *fwnode)
>>> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>>>       if (ret)
>>>           goto out_free_ext_minor;
>>>
>>> +    if (disk->major == LOOP_MAJOR)
>>> +        msleep(2500);           // delay 2.5s for all loops
>>> +
>>
>> Yes, so you need to watch for the uevent to happen, THEN it is safe to
>> access the block device.  Doing it before then isn't a good idea :)
>>
>> But, if you think this is an issue, do you have a patch that passes your
>> testing to fix it?
> 
> I just raise it up for some ideas, and this change is
> buried into the code refactor and honestly I need to
> look into the codebase and related patchsets first.
> 
> Currently I have dozens of other development stuffs
> on hand, if it's really a regression, I do hope
> Christoph or other folks who are familiar with the code
> could try to address this.

I've provided a reproducible way:
https://lore.kernel.org/linux-block/ec8b1c76-c211-49a5-a056-6a147faddd3b@linux.alibaba.com

As the author of these gendisk/bdev enhancement commits, what's
your opinion on this?

In other words, do you think it's a regression, or just a behavior
change but not a regression? Also, a minor confirmation:
if it is a regression on your side, would you like to address it?

Due to further code changes, I proposed a temporary workaround
for our 6.6 kernels as below (I don't think it's clean but we will
do more tests), but due to limited time, currently I don't have
time to come up with a cleaner solution and track this until the
upstream fix lands.

Thanks,
Gao Xiang

  block/blk.h             |  1 +
  block/genhd.c           | 18 ++++++++++++++++--
  block/partitions/core.c |  6 +++++-
  3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 475bbb40bb83..4410ae9da378 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -419,6 +419,7 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
  #endif /* CONFIG_BLK_DEV_ZONED */
  
  struct block_device *bdev_alloc(struct gendisk *disk, u8 partno);
+void bdev_inode_failed(struct block_device *bdev);
  void bdev_add(struct block_device *bdev, dev_t dev);
  
  int blk_alloc_ext_minor(void);
diff --git a/block/genhd.c b/block/genhd.c
index 039e7c17523b..cb4313a7c618 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -383,6 +383,14 @@ int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode)
  	return ret;
  }
  
+void bdev_inode_failed(struct block_device *bdev)
+{
+	struct inode *inode = bdev->bd_inode;
+
+	make_bad_inode(inode);
+	unlock_new_inode(inode);
+}
+
  /**
   * device_add_disk - add disk information to kernel list
   * @parent: parent device for the disk
@@ -452,8 +460,12 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
  	ddev->parent = parent;
  	ddev->groups = groups;
  	dev_set_name(ddev, "%s", disk->disk_name);
-	if (!(disk->flags & GENHD_FL_HIDDEN))
+	if (!(disk->flags & GENHD_FL_HIDDEN)) {
  		ddev->devt = MKDEV(disk->major, disk->first_minor);
+		disk->part0->bd_inode->i_state |= I_NEW;
+		bdev_add(disk->part0, ddev->devt);
+	}
+
  	ret = device_add(ddev);
  	if (ret)
  		goto out_free_ext_minor;
@@ -505,7 +517,7 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
  		if (get_capacity(disk) && disk_has_partscan(disk))
  			set_bit(GD_NEED_PART_SCAN, &disk->state);
  
-		bdev_add(disk->part0, ddev->devt);
+		unlock_new_inode(disk->part0->bd_inode);
  		if (get_capacity(disk))
  			disk_scan_partitions(disk, BLK_OPEN_READ);
  
@@ -546,6 +558,8 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
  out_device_del:
  	device_del(ddev);
  out_free_ext_minor:
+	if (!(disk->flags & GENHD_FL_HIDDEN))
+		bdev_inode_failed(disk->part0);
  	if (disk->major == BLOCK_EXT_MAJOR)
  		blk_free_ext_minor(disk->first_minor);
  out_exit_elevator:
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 549ce89a657b..c69e369955b9 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -376,6 +376,9 @@ static struct block_device *add_partition(struct gendisk *disk, int partno,
  			goto out_put;
  	}
  
+	bdev->bd_inode->i_state |= I_NEW;
+	bdev_add(bdev, devt);
+
  	/* delay uevent until 'holders' subdir is created */
  	dev_set_uevent_suppress(pdev, 1);
  	err = device_add(pdev);
@@ -398,7 +401,7 @@ static struct block_device *add_partition(struct gendisk *disk, int partno,
  	err = xa_insert(&disk->part_tbl, partno, bdev, GFP_KERNEL);
  	if (err)
  		goto out_del;
-	bdev_add(bdev, devt);
+	unlock_new_inode(bdev->bd_inode);
  
  	/* suppress uevent if the disk suppresses it */
  	if (!dev_get_uevent_suppress(ddev))
@@ -409,6 +412,7 @@ static struct block_device *add_partition(struct gendisk *disk, int partno,
  	kobject_put(bdev->bd_holder_dir);
  	device_del(pdev);
  out_put:
+	bdev_inode_failed(bdev);
  	put_device(pdev);
  	return ERR_PTR(err);
  out_put_disk:
-- 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-10-31 14:44                     ` Gao Xiang
  2025-11-05  3:04                       ` Gao Xiang
@ 2025-11-05 12:30                       ` Christian Brauner
  2025-11-05 14:13                         ` Gao Xiang
  1 sibling, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2025-11-05 12:30 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Greg Kroah-Hartman, Christoph Hellwig, Jens Axboe, Jan Kara,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

On Fri, Oct 31, 2025 at 10:44:53PM +0800, Gao Xiang wrote:
> 
> 
> On 2025/10/31 22:34, Greg Kroah-Hartman wrote:
> > On Fri, Oct 31, 2025 at 08:23:32PM +0800, Gao Xiang wrote:
> > > 
> > > 
> > > On 2025/10/31 18:12, Gao Xiang wrote:
> > > > Hi Greg,
> > > > 
> > > > On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
> > > > > On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2025/10/31 17:45, Christoph Hellwig wrote:
> > > 
> > > ...
> > > 
> > > > > > > But why does the device node
> > > > > > > get created earlier?  My assumption was that it would only be
> > > > > > > created by the KOBJ_ADD uevent.  Adding the device model maintainers
> > > > > > > as my little dig through the core drivers/base/ code doesn't find
> > > > > > > anything to the contrary, but maybe I don't fully understand it.
> > > > > > 
> > > > > > AFAIK, device_add() is used to trigger devtmpfs file
> > > > > > creation, and it can be observed if frequently
> > > > > > hotpluging device in the VM and mount.  Currently
> > > > > > I don't have time slot to build an easy reproducer,
> > > > > > but I think it's a real issue anyway.
> > > > > 
> > > > > As I say above, that's not normal, and you have to be root to do this,
> > > I just spent time to reproduce with dynamic loop devices and
> > > actually it's easy if msleep() is located artificiallly,
> > > the diff as below:
> > > 
> > > diff --git a/block/bdev.c b/block/bdev.c
> > > index 810707cca970..a4273b5ad456 100644
> > > --- a/block/bdev.c
> > > +++ b/block/bdev.c
> > > @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
> > >   	struct inode *inode;
> > > 
> > >   	inode = ilookup(blockdev_superblock, dev);
> > > -	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
> > > +	if (0) {
> > >   		blk_request_module(dev);
> > >   		inode = ilookup(blockdev_superblock, dev);
> > >   		if (inode)
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index 9bbc38d12792..3c9116fdc1ce 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
> > >   	set_bit(GD_ADDED, &disk->state);
> > >   }
> > > 
> > > +#include <linux/delay.h>
> > > +
> > >   static int __add_disk(struct device *parent, struct gendisk *disk,
> > >   		      const struct attribute_group **groups,
> > >   		      struct fwnode_handle *fwnode)
> > > @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
> > >   	if (ret)
> > >   		goto out_free_ext_minor;
> > > 
> > > +	if (disk->major == LOOP_MAJOR)
> > > +		msleep(2500);           // delay 2.5s for all loops
> > > +
> > 
> > Yes, so you need to watch for the uevent to happen, THEN it is safe to
> > access the block device.  Doing it before then isn't a good idea :)
> > 
> > But, if you think this is an issue, do you have a patch that passes your
> > testing to fix it?
> 
> I just raise it up for some ideas, and this change is
> buried into the code refactor and honestly I need to
> look into the codebase and related patchsets first.
> 
> Currently I have dozens of other development stuffs
> on hand, if it's really a regression, I do hope
> Christoph or other folks who are familiar with the code
> could try to address this.

If it's easy to do without much of a regression or performance risk then
the device node should only show up once the device is actually ready.
It's certainly best-practive to wait for the uevent though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
  2025-11-05 12:30                       ` Christian Brauner
@ 2025-11-05 14:13                         ` Gao Xiang
  0 siblings, 0 replies; 13+ messages in thread
From: Gao Xiang @ 2025-11-05 14:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Greg Kroah-Hartman, Christoph Hellwig, Jens Axboe, Jan Kara,
	Michael S. Tsirkin, Jason Wang, Martin K. Petersen,
	Luis Chamberlain, linux-block, Joseph Qi, guanghuifeng,
	zongyong.wzy, zyfjeff, Rafael J. Wysocki, Danilo Krummrich,
	linux-kernel

Hi Christian,

On 2025/11/5 20:30, Christian Brauner wrote:
> On Fri, Oct 31, 2025 at 10:44:53PM +0800, Gao Xiang wrote:
>>
>>
>> On 2025/10/31 22:34, Greg Kroah-Hartman wrote:
>>> On Fri, Oct 31, 2025 at 08:23:32PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2025/10/31 18:12, Gao Xiang wrote:
>>>>> Hi Greg,
>>>>>
>>>>> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>>>>>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2025/10/31 17:45, Christoph Hellwig wrote:
>>>>
>>>> ...
>>>>
>>>>>>>> But why does the device node
>>>>>>>> get created earlier?  My assumption was that it would only be
>>>>>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>>>>>> as my little dig through the core drivers/base/ code doesn't find
>>>>>>>> anything to the contrary, but maybe I don't fully understand it.
>>>>>>>
>>>>>>> AFAIK, device_add() is used to trigger devtmpfs file
>>>>>>> creation, and it can be observed if frequently
>>>>>>> hotpluging device in the VM and mount.  Currently
>>>>>>> I don't have time slot to build an easy reproducer,
>>>>>>> but I think it's a real issue anyway.
>>>>>>
>>>>>> As I say above, that's not normal, and you have to be root to do this,
>>>> I just spent time to reproduce with dynamic loop devices and
>>>> actually it's easy if msleep() is located artificiallly,
>>>> the diff as below:
>>>>
>>>> diff --git a/block/bdev.c b/block/bdev.c
>>>> index 810707cca970..a4273b5ad456 100644
>>>> --- a/block/bdev.c
>>>> +++ b/block/bdev.c
>>>> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>>>>    	struct inode *inode;
>>>>
>>>>    	inode = ilookup(blockdev_superblock, dev);
>>>> -	if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
>>>> +	if (0) {
>>>>    		blk_request_module(dev);
>>>>    		inode = ilookup(blockdev_superblock, dev);
>>>>    		if (inode)
>>>> diff --git a/block/genhd.c b/block/genhd.c
>>>> index 9bbc38d12792..3c9116fdc1ce 100644
>>>> --- a/block/genhd.c
>>>> +++ b/block/genhd.c
>>>> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>>>>    	set_bit(GD_ADDED, &disk->state);
>>>>    }
>>>>
>>>> +#include <linux/delay.h>
>>>> +
>>>>    static int __add_disk(struct device *parent, struct gendisk *disk,
>>>>    		      const struct attribute_group **groups,
>>>>    		      struct fwnode_handle *fwnode)
>>>> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>>>>    	if (ret)
>>>>    		goto out_free_ext_minor;
>>>>
>>>> +	if (disk->major == LOOP_MAJOR)
>>>> +		msleep(2500);           // delay 2.5s for all loops
>>>> +
>>>
>>> Yes, so you need to watch for the uevent to happen, THEN it is safe to
>>> access the block device.  Doing it before then isn't a good idea :)
>>>
>>> But, if you think this is an issue, do you have a patch that passes your
>>> testing to fix it?
>>
>> I just raise it up for some ideas, and this change is
>> buried into the code refactor and honestly I need to
>> look into the codebase and related patchsets first.
>>
>> Currently I have dozens of other development stuffs
>> on hand, if it's really a regression, I do hope
>> Christoph or other folks who are familiar with the code
>> could try to address this.
> 
> If it's easy to do without much of a regression or performance risk then
> the device node should only show up once the device is actually ready.

Yeah, agreed.

> It's certainly best-practive to wait for the uevent though.

Currently our internal applications will try to adapt uevent
detection too, but as a public cloud provider, we may still
need a fallback to avoid users' potential blame on this,
anyway.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-11-05 14:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20210818144542.19305-1-hch@lst.de>
     [not found] ` <20210818144542.19305-4-hch@lst.de>
     [not found]   ` <43375218-2a80-4a7a-b8bb-465f6419b595@linux.alibaba.com>
     [not found]     ` <20251031090925.GA9379@lst.de>
     [not found]       ` <ae38c5dc-da90-4fb3-bb72-61b66ab5a0d2@linux.alibaba.com>
2025-10-31  9:45         ` question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk Christoph Hellwig
2025-10-31  9:54           ` Gao Xiang
2025-10-31  9:58             ` Greg Kroah-Hartman
2025-10-31 10:12               ` Gao Xiang
2025-10-31 12:23                 ` Gao Xiang
2025-10-31 12:25                   ` Gao Xiang
2025-10-31 14:34                   ` Greg Kroah-Hartman
2025-10-31 14:44                     ` Gao Xiang
2025-11-05  3:04                       ` Gao Xiang
2025-11-05 12:30                       ` Christian Brauner
2025-11-05 14:13                         ` Gao Xiang
2025-10-31 14:31                 ` Greg Kroah-Hartman
2025-10-31 14:40                   ` Gao Xiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox