public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Gao Xiang <hsiangkao@linux.alibaba.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Jan Kara <jack@suse.cz>, Christian Brauner <brauner@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	linux-block@vger.kernel.org,
	Joseph Qi <joseph.qi@linux.alibaba.com>,
	guanghuifeng@linux.alibaba.com, zongyong.wzy@alibaba-inc.com,
	zyfjeff@linux.alibaba.com,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Danilo Krummrich <dakr@kernel.org>,
	linux-kernel@vger.kernel.org
Subject: Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk
Date: Fri, 31 Oct 2025 20:25:46 +0800	[thread overview]
Message-ID: <bc738580-4e1f-411f-af7b-f76a4ce7b7ea@linux.alibaba.com> (raw)
In-Reply-To: <ec8b1c76-c211-49a5-a056-6a147faddd3b@linux.alibaba.com>



On 2025/10/31 20:23, Gao Xiang wrote:
> 
> 
> On 2025/10/31 18:12, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
>>> On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2025/10/31 17:45, Christoph Hellwig wrote:
> 
> ...
> 
>>>>> But why does the device node
>>>>> get created earlier?  My assumption was that it would only be
>>>>> created by the KOBJ_ADD uevent.  Adding the device model maintainers
>>>>> as my little dig through the core drivers/base/ code doesn't find
>>>>> anything to the contrary, but maybe I don't fully understand it.
>>>>
>>>> AFAIK, device_add() is used to trigger devtmpfs file
>>>> creation, and it can be observed if frequently
>>>> hotpluging device in the VM and mount.  Currently
>>>> I don't have time slot to build an easy reproducer,
>>>> but I think it's a real issue anyway.
>>>
>>> As I say above, that's not normal, and you have to be root to do this,
> I just spent time to reproduce with dynamic loop devices and
> actually it's easy if msleep() is located artificiallly,
> the diff as below:
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 810707cca970..a4273b5ad456 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
>       struct inode *inode;
> 
>       inode = ilookup(blockdev_superblock, dev);
> -    if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
> +    if (0) {
>           blk_request_module(dev);
>           inode = ilookup(blockdev_superblock, dev);
>           if (inode)
> diff --git a/block/genhd.c b/block/genhd.c
> index 9bbc38d12792..3c9116fdc1ce 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
>       set_bit(GD_ADDED, &disk->state);
>   }
> 
> +#include <linux/delay.h>
> +
>   static int __add_disk(struct device *parent, struct gendisk *disk,
>                 const struct attribute_group **groups,
>                 struct fwnode_handle *fwnode)
> @@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
>       if (ret)
>           goto out_free_ext_minor;
> 
> +    if (disk->major == LOOP_MAJOR)
> +        msleep(2500);           // delay 2.5s for all loops
> +
>       ret = disk_alloc_events(disk);
>       if (ret)
>           goto out_device_del;
> 
> 
> (Note that I masked off CONFIG_BLOCK_LEGACY_AUTOLOAD
>   for cleaner ftrace below.)
> 
> and then
> 
> # uname -a  (patched 6.18-rc1 kernel)
> 
> ```
> Linux 7e5b4b5f5181 6.18.0-rc1-dirty #25 SMP PREEMPT_DYNAMIC Fri Oct 31 19:52:10 CST 2025 x86_64 GNU/Linux
> ```
> 
> # truncate -s 1g test.img; mkfs.ext4 -F test.img;
> # losetup /dev/loop999 test.img & sleep 1; ls -l /dev/loop999; strace mount -t ext4 /dev/loop999 mnt 2>&1 | grep fsconfig
> 
> It shows
> 
> ```
> brw------- 1 root root 7, 999 Oct 31 20:06 /dev/loop999
> fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/loop999", 0) = 0
> fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 ENXIO (No such device or address)  // unexpected
> ```
> 
> then
> 
> # losetup /dev/loop996 test.img & sleep 1; stat /dev/loop996; trace-cmd record -p function_graph mount -t ext4 /dev/loop996 mnt &> /dev/null
> 
> It shows
> ```
>    File: /dev/loop996
>    Size: 0               Blocks: 0          IO Block: 4096   block special file
> Device: 0,6     Inode: 429         Links: 1     Device type: 7,996
> Access: (0600/brw-------)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2025-10-31 20:07:54.938474868 +0800
> Modify: 2025-10-31 20:07:54.938474868 +0800
> Change: 2025-10-31 20:07:54.938474868 +0800
>   Birth: 2025-10-31 20:07:54.938474868 +0800
> ```
> 
> but
> 
> # trace-cmd report | grep mount | less
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                bdev_file_open_by_dev() {
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                  bdev_permission() {
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                    devcgroup_check_permission() {
>             mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                      __rcu_read_lock() {
>             mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.193 us   |                      } (ret=0x1)
>             mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      match_exception_partial() {
>             mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.199 us   |                      } (ret=0x0)
>             mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      __rcu_read_unlock() {
>             mount-561   [007] ...1.   240.180515: funcgraph_exit:         0.202 us   |                      } (ret=0x0)
>             mount-561   [007] ...1.   240.180515: funcgraph_exit:         1.602 us   |                    } (ret=0x0)
>             mount-561   [007] ...1.   240.180515: funcgraph_exit:         2.100 us   |                  } (ret=0x0)
>             mount-561   [007] ...1.   240.180515: funcgraph_entry:                   |                  ilookup() {
>             mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    __cond_resched() {
>             mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.194 us   |                    } (ret=0x0)
>             mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    find_inode_fast() {
>             mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                      __rcu_read_lock() {
>             mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.195 us   |                      } (ret=0x1)
>             mount-561   [007] ...1.   240.180517: funcgraph_entry:                   |                      __rcu_read_unlock() {
>             mount-561   [007] ...1.   240.180517: funcgraph_exit:         0.193 us   |                      } (ret=0x0)
>             mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.060 us   |                    } (ret=0x0)
>             mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.970 us   |                  } (ret=0x0)
>             mount-561   [007] ...1.   240.180518: funcgraph_exit:         4.818 us   |                } (ret=-6)
> 
> here -6 (-ENXIO) is unexpected.
> 
> Actually the problematic code path I've said is device_add():
> 
> upstream code:
> 
> loop_control_ioctl
>   loop_add
>     add_disk_fwnode
>       __add_disk
>         devtmpfs_create_node   // here create devtmpfs blkdev file, but racy
>       add_disk_final
>         bdev_add
>           insert_inode_hash    // just seen by bdev_file_open_by_dev()
>         disk_uevent(disk, KOBJ_ADD)

minor revision:

  loop_control_ioctl
    loop_add
      add_disk_fwnode
        __add_disk
          device_add
            devtmpfs_create_node   // here create devtmpfs blkdev file, but racy
        add_disk_final
          bdev_add
            insert_inode_hash    // just seen by bdev_file_open_by_dev()
          disk_uevent(disk, KOBJ_ADD)

> 
> I actually think it's enough to explain the root.
> 
> Thanks,
> Gao Xiang


  reply	other threads:[~2025-10-31 12:25 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20210818144542.19305-1-hch@lst.de>
     [not found] ` <20210818144542.19305-4-hch@lst.de>
     [not found]   ` <43375218-2a80-4a7a-b8bb-465f6419b595@linux.alibaba.com>
     [not found]     ` <20251031090925.GA9379@lst.de>
     [not found]       ` <ae38c5dc-da90-4fb3-bb72-61b66ab5a0d2@linux.alibaba.com>
2025-10-31  9:45         ` question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk Christoph Hellwig
2025-10-31  9:54           ` Gao Xiang
2025-10-31  9:58             ` Greg Kroah-Hartman
2025-10-31 10:12               ` Gao Xiang
2025-10-31 12:23                 ` Gao Xiang
2025-10-31 12:25                   ` Gao Xiang [this message]
2025-10-31 14:34                   ` Greg Kroah-Hartman
2025-10-31 14:44                     ` Gao Xiang
2025-11-05  3:04                       ` Gao Xiang
2025-11-05 12:30                       ` Christian Brauner
2025-11-05 14:13                         ` Gao Xiang
2025-10-31 14:31                 ` Greg Kroah-Hartman
2025-10-31 14:40                   ` Gao Xiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bc738580-4e1f-411f-af7b-f76a4ce7b7ea@linux.alibaba.com \
    --to=hsiangkao@linux.alibaba.com \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=dakr@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=guanghuifeng@linux.alibaba.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jasowang@redhat.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=mcgrof@kernel.org \
    --cc=mst@redhat.com \
    --cc=rafael@kernel.org \
    --cc=zongyong.wzy@alibaba-inc.com \
    --cc=zyfjeff@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox