REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
       [not found] ` <20220614074827.458955-5-hch@lst.de>
@ 2022-07-08  5:41   ` Logan Gunthorpe
  2022-07-08  6:01     ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Logan Gunthorpe @ 2022-07-08  5:41 UTC (permalink / raw)
  To: Christoph Hellwig, axboe
  Cc: shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei,
	linux-block, linux-raid, Song Liu

Hi,

On 2022-06-14 01:48, Christoph Hellwig wrote:
> Freeze the queue earlier in del_gendisk so that the state does not
> change while we remove debugfs and sysfs files.
> 
> Ming mentioned that being able to observer request in debugfs might
> be useful while the queue is being frozen in del_gendisk, which is
> made possible by this change.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

I'm not really sure why this is yet, but this patch in rc4 causes some
random failures with mdadm tests.

It seems the 11spare-migration tests starts failing roughly every other
run because the block device is not quite cleaned up after mdadm --stop
by the time the next mdadm --create commands starts, or rather there
appears to be a race now between the newly created device and the one
being cleaned up. This results in an infrequent sysfs panic with a
duplicate filename error (see the end of this email).

I managed to bisect this and found a09b314005f3a09 to be the problematic
commit.

Reverting seems to fix it.

Thanks,

Logan

> ---
>  block/genhd.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index e0675772178b0..278227ba1d531 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -623,6 +623,7 @@ void del_gendisk(struct gendisk *disk)
>  	 * Prevent new I/O from crossing bio_queue_enter().
>  	 */
>  	blk_queue_start_drain(q);
> +	blk_mq_freeze_queue_wait(q);
>  
>  	if (!(disk->flags & GENHD_FL_HIDDEN)) {
>  		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
> @@ -646,8 +647,6 @@ void del_gendisk(struct gendisk *disk)
>  	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
>  	device_del(disk_to_dev(disk));
>  
> -	blk_mq_freeze_queue_wait(q);
> -
>  	blk_throtl_cancel_bios(disk->queue);
>  
>  	blk_sync_queue(q);


[ 1026.373014] sysfs: cannot create duplicate filename
'/devices/virtual/block/md124'
[ 1026.374616] CPU: 1 PID: 11046 Comm: mdadm Not tainted
5.19.0-rc4-eid-vmlocalyes-dbg-00065-gff4ec5f79108 #2430
[ 1026.376546] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.14.0-2 04/01/2014
[ 1026.378000] Call Trace:
[ 1026.378461]  <TASK>
[ 1026.378867]  dump_stack_lvl+0x5a/0x74
[ 1026.379558]  dump_stack+0x10/0x12
[ 1026.380178]  sysfs_warn_dup.cold+0x17/0x27
[ 1026.380944]  sysfs_create_dir_ns+0x17d/0x190
[ 1026.381765]  ? sysfs_create_mount_point+0x80/0x80
[ 1026.382638]  ? __kasan_check_read+0x11/0x20
[ 1026.383401]  ? class_dir_child_ns_type+0x23/0x30
[ 1026.384267]  kobject_add_internal+0x145/0x460
[ 1026.385084]  kobject_add+0xf3/0x150
[ 1026.385733]  ? kset_create_and_add+0xe0/0xe0
[ 1026.386529]  ? __kasan_check_read+0x11/0x20
[ 1026.387306]  ? mutex_unlock+0x12/0x20
[ 1026.387986]  ? device_add+0x1da/0xf20
[ 1026.388676]  device_add+0x224/0xf20
[ 1026.389316]  ? kobject_set_name_vargs+0x95/0xb0
[ 1026.390163]  ? __fw_devlink_link_to_suppliers+0x180/0x180
[ 1026.391157]  ? sprintf+0xae/0xe0
[ 1026.391784]  device_add_disk+0x1b8/0x5f0
[ 1026.392520]  md_alloc+0x4c9/0x800
[ 1026.393131]  ? __kasan_check_read+0x11/0x20
[ 1026.393921]  md_probe+0x24/0x30
[ 1026.394506]  blk_request_module+0x9a/0x100
[ 1026.395268]  blkdev_get_no_open+0x66/0xa0
[ 1026.395993]  blkdev_get_by_dev.part.0+0x24/0x570
[ 1026.396854]  ? devcgroup_check_permission+0xed/0x240
[ 1026.397770]  blkdev_get_by_dev+0x51/0x60
[ 1026.398497]  blkdev_open+0xa4/0x140
[ 1026.399146]  do_dentry_open+0x2a7/0x6e0
[ 1026.399854]  ? blkdev_close+0x50/0x50
[ 1026.400546]  vfs_open+0x58/0x60
[ 1026.401125]  path_openat+0x77e/0x13f0
[ 1026.401830]  ? lookup_open.isra.0+0xaf0/0xaf0
[ 1026.402615]  ? kvm_sched_clock_read+0x18/0x40
[ 1026.403441]  ? sched_autogroup_detach+0x20/0x20
[ 1026.404267]  ? __this_cpu_preempt_check+0x13/0x20
[ 1026.405141]  do_filp_open+0x154/0x280
[ 1026.405833]  ? may_open_dev+0x60/0x60
[ 1026.406558]  ? __kasan_check_read+0x11/0x20
[ 1026.407302]  ? do_raw_spin_unlock+0x98/0x100
[ 1026.408067]  ? alloc_fd+0x183/0x340
[ 1026.408718]  do_sys_openat2+0x119/0x2c0
[ 1026.409437]  ? kmem_cache_free+0x156/0x690
[ 1026.410167]  ? dput+0x29/0x750
[ 1026.410730]  ? build_open_flags+0x280/0x280
[ 1026.411607]  ? putname+0x7c/0x90
[ 1026.412164]  __x64_sys_openat+0xe7/0x160
[ 1026.412919]  ? __ia32_compat_sys_open+0x130/0x130
[ 1026.413630]  ? syscall_enter_from_user_mode+0x21/0x60
[ 1026.414271]  ? lockdep_hardirqs_on+0x82/0x110
[ 1026.414828]  ? trace_hardirqs_on+0x3d/0x100
[ 1026.415376]  do_syscall_64+0x3b/0x90
[ 1026.415837]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 1026.416473] RIP: 0033:0x7fab048f3be7
[ 1026.416938] Code: 25 00 00 41 00 3d 00 00 41 00 74 47 64 8b 04 25 18
00 00 00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f
05 <48> 3d 00 f0 ff ff 0f 87 95 00 00 00 48 8b 4c 24 28 64 48 2b 0c 25
[ 1026.419219] RSP: 002b:00007ffc3c3d7ae0 EFLAGS: 00000246 ORIG_RAX:
0000000000000101
[ 1026.420162] RAX: ffffffffffffffda RBX: 00000000000003e8 RCX:
00007fab048f3be7
[ 1026.421045] RDX: 0000000000004082 RSI: 00007ffc3c3d7b70 RDI:
00000000ffffff9c
[ 1026.421928] RBP: 00007ffc3c3d7b70 R08: 0000000000000000 R09:
00007ffc3c3d79f0
[ 1026.422809] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000004082
[ 1026.423696] R13: 0000000000000009 R14: 00007ffc3c3d7b68 R15:
0000556054ddd970
[ 1026.424608]  </TASK>
[ 1026.424982] kobject_add_internal failed for md124 with -EEXIST, don't
try to register things with the same name in the same directory.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-08  5:41   ` REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk Logan Gunthorpe
@ 2022-07-08  6:01     ` Christoph Hellwig
  2022-07-08 15:55       ` Logan Gunthorpe
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2022-07-08  6:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, axboe, shinichiro.kawasaki, dan.j.williams,
	yukuai3, ming.lei, linux-block, linux-raid, Song Liu

On Thu, Jul 07, 2022 at 11:41:40PM -0600, Logan Gunthorpe wrote:
> I'm not really sure why this is yet, but this patch in rc4 causes some
> random failures with mdadm tests.
> 
> It seems the 11spare-migration tests starts failing roughly every other
> run because the block device is not quite cleaned up after mdadm --stop
> by the time the next mdadm --create commands starts, or rather there
> appears to be a race now between the newly created device and the one
> being cleaned up. This results in an infrequent sysfs panic with a
> duplicate filename error (see the end of this email).
> 
> I managed to bisect this and found a09b314005f3a09 to be the problematic
> commit.

Taking a look at the mddev code this commit just seems to increase the
race window of hitting horrible life time problems in md, but I'll also
try to reproduce and verify it myself.

Take a look at how md searches for a duplicate name in md_alloc,
mddev_alloc_unit and mddev_find_locked based on the all_mddevs list,
and how the mddev gets dropped from all_mddevs very early and long
before the gendisk is gone in mddev_put.  I think what needs to be
done is to implement a free_disk method and drop the mddev (and free it)
from that.  But given how much intricate mess is based on all_mddevs
we'll have to be very careful about that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-08  6:01     ` Christoph Hellwig
@ 2022-07-08 15:55       ` Logan Gunthorpe
  2022-07-09  8:17         ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Logan Gunthorpe @ 2022-07-08 15:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei,
	linux-block, linux-raid, Song Liu



On 2022-07-08 00:01, Christoph Hellwig wrote:
> On Thu, Jul 07, 2022 at 11:41:40PM -0600, Logan Gunthorpe wrote:
>> I'm not really sure why this is yet, but this patch in rc4 causes some
>> random failures with mdadm tests.
>>
>> It seems the 11spare-migration tests starts failing roughly every other
>> run because the block device is not quite cleaned up after mdadm --stop
>> by the time the next mdadm --create commands starts, or rather there
>> appears to be a race now between the newly created device and the one
>> being cleaned up. This results in an infrequent sysfs panic with a
>> duplicate filename error (see the end of this email).
>>
>> I managed to bisect this and found a09b314005f3a09 to be the problematic
>> commit.
> 
> Taking a look at the mddev code this commit just seems to increase the
> race window of hitting horrible life time problems in md, but I'll also
> try to reproduce and verify it myself.
> 
> Take a look at how md searches for a duplicate name in md_alloc,
> mddev_alloc_unit and mddev_find_locked based on the all_mddevs list,
> and how the mddev gets dropped from all_mddevs very early and long
> before the gendisk is gone in mddev_put.  I think what needs to be
> done is to implement a free_disk method and drop the mddev (and free it)
> from that.  But given how much intricate mess is based on all_mddevs
> we'll have to be very careful about that.

I agree it's a mess, probably buggy and could use a cleanup with a
free_disk method. But I'm not sure the all_mdevs lifetime issues are the
problem here. If the entry in all_mdevs outlasts the disk, then
md_alloc() will just fail earlier. Many test scripts rely on the fact
that you can stop an mddev and recreate it immediately after. We need
some way of ensuring any deleted disks are fully deleted before trying
to make a new mddev, in case the new one has the same name as one being
deleted.

The md code deletes the disk in md_delayed_delete(), a work queue item
on md_misc_wq. That queue is flushed first in md_misc_wq, but somehow,
some of the disk is still not fully deleted by the time
flush_workqueue() returns. I'm not sure why that would be.

Logan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-08 15:55       ` Logan Gunthorpe
@ 2022-07-09  8:17         ` Christoph Hellwig
  2022-07-11  3:33           ` Logan Gunthorpe
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2022-07-09  8:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, axboe, shinichiro.kawasaki, dan.j.williams,
	yukuai3, ming.lei, linux-block, linux-raid, Song Liu

On Fri, Jul 08, 2022 at 09:55:55AM -0600, Logan Gunthorpe wrote:
> I agree it's a mess, probably buggy and could use a cleanup with a
> free_disk method. But I'm not sure the all_mdevs lifetime issues are the
> problem here. If the entry in all_mdevs outlasts the disk, then
> md_alloc() will just fail earlier. Many test scripts rely on the fact
> that you can stop an mddev and recreate it immediately after. We need
> some way of ensuring any deleted disks are fully deleted before trying
> to make a new mddev, in case the new one has the same name as one being
> deleted.

I think those tests are broken.  But fortunately that is just an
assumption in the tests, while device name reuse is a real problem.

I could not reproduce your problem, but on the for-5.20/block
tree I see a hang in 10ddf-geometry when running the tests.

The branch here:

    git://git.infradead.org/users/hch/block.git md-lifetime-fixes
    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/md-lifetime-fixes

fixes that for me and does not introduce new regressions.  Can you
check if that helps your problem?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-09  8:17         ` Christoph Hellwig
@ 2022-07-11  3:33           ` Logan Gunthorpe
  2022-07-11  4:33             ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Logan Gunthorpe @ 2022-07-11  3:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, shinichiro.kawasaki, dan.j.williams, yukuai3, ming.lei,
	linux-block, linux-raid, Song Liu




On 2022-07-09 02:17, Christoph Hellwig wrote:
> On Fri, Jul 08, 2022 at 09:55:55AM -0600, Logan Gunthorpe wrote:
>> I agree it's a mess, probably buggy and could use a cleanup with a
>> free_disk method. But I'm not sure the all_mdevs lifetime issues are the
>> problem here. If the entry in all_mdevs outlasts the disk, then
>> md_alloc() will just fail earlier. Many test scripts rely on the fact
>> that you can stop an mddev and recreate it immediately after. We need
>> some way of ensuring any deleted disks are fully deleted before trying
>> to make a new mddev, in case the new one has the same name as one being
>> deleted.
> 
> I think those tests are broken.  But fortunately that is just an
> assumption in the tests, while device name reuse is a real problem.
> 
> I could not reproduce your problem, but on the for-5.20/block
> tree I see a hang in 10ddf-geometry when running the tests.

Ah, strange. I never saw an issue with that test, though I didn't run
that one repeatedly with the latest branch. So maybe it was an
intermittent like I saw with 11spare-migration and I just missed it.

> The branch here:
> 
>     git://git.infradead.org/users/hch/block.git md-lifetime-fixes
>     http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/md-lifetime-fixes
> 
> fixes that for me and does not introduce new regressions.  Can you
> check if that helps your problem?

Yes, I can confirm those patches fix the bug I was seeing too.


I did a fairly quick review of the patches:

 - In the patch that introduces md_free_disk() it looks like md_free()
can still be called from the error path of md_alloc() before
mddev->gendisk is set... which seems to make things rather complicated
seeing we then can't use free_disk to finish the cleanup if the disk
hasn't been created yet. I probably need to take closer look at this
but, it might make more sense for the cleanup to remain in md_free() but
call kobject_put() in md_free_disk() and del_gendisk() in
mdev_delayed_delete(). Then md_alloc() can still use kobject_put() in
the error path and it makes a little more sense seeing we'd still be
freeing the kobject stuff in it's own release method instead of the
disks free method.

 - In the patch with md_rdevs_overlap, it looks like we remove the
rcu_read_lock(), which definitely seems out of place and probably isn't
correct. But the comment that was recreated still references the rcu so
probably should be changed.

 - The last patch has a typo in the title (disk is *a* freed).

Thanks!

Logan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk
  2022-07-11  3:33           ` Logan Gunthorpe
@ 2022-07-11  4:33             ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2022-07-11  4:33 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, axboe, shinichiro.kawasaki, dan.j.williams,
	yukuai3, ming.lei, linux-block, linux-raid, Song Liu

On Sun, Jul 10, 2022 at 09:33:51PM -0600, Logan Gunthorpe wrote:
> I did a fairly quick review of the patches:
> 
>  - In the patch that introduces md_free_disk() it looks like md_free()
> can still be called from the error path of md_alloc() before
> mddev->gendisk is set... which seems to make things rather complicated
> seeing we then can't use free_disk to finish the cleanup if the disk
> hasn't been created yet. I probably need to take closer look at this
> but, it might make more sense for the cleanup to remain in md_free() but
> call kobject_put() in md_free_disk() and del_gendisk() in
> mdev_delayed_delete(). Then md_alloc() can still use kobject_put() in
> the error path and it makes a little more sense seeing we'd still be
> freeing the kobject stuff in it's own release method instead of the
> disks free method.

Uww, yes.  I suspect the best fix is to actually stop the kobject
from taking part in the md life time directly.  Because the kobject
contributes a reference to the disk until it is deleted, we might
as well stop messing with the refcounts entirely and just call
kobject_del on it just before del_gendisk.  Let me see what I can do
there.

> 
>  - In the patch with md_rdevs_overlap, it looks like we remove the
> rcu_read_lock(), which definitely seems out of place and probably isn't
> correct. But the comment that was recreated still references the rcu so
> probably should be changed.

Fixed.

>  - The last patch has a typo in the title (disk is *a* freed).

Fixed.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-07-11  4:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20220614074827.458955-1-hch@lst.de>
     [not found] ` <20220614074827.458955-5-hch@lst.de>
2022-07-08  5:41   ` REGRESSION: [PATCH 4/4] block: freeze the queue earlier in del_gendisk Logan Gunthorpe
2022-07-08  6:01     ` Christoph Hellwig
2022-07-08 15:55       ` Logan Gunthorpe
2022-07-09  8:17         ` Christoph Hellwig
2022-07-11  3:33           ` Logan Gunthorpe
2022-07-11  4:33             ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).