* Memory leak when raid10 takeover raid0 @ 2022-10-08 9:06 Xiao Ni 2022-10-10 0:40 ` Guoqing Jiang 0 siblings, 1 reply; 8+ messages in thread From: Xiao Ni @ 2022-10-08 9:06 UTC (permalink / raw) To: linux-raid; +Cc: Song Liu, Guoqing Jiang Hi all Sometimes it will have a memory leak problem when raid10 takeover raid0. And then panic. [ 6973.704097] BUG bio-176 (Not tainted): Objects remaining in bio-176 on __kmem_cache_shutdown() [ 6973.705627] ----------------------------------------------------------------------------- [ 6973.705627] [ 6973.707331] Slab 0x0000000067cd8c6c objects=21 used=1 fp=0x000000005bf568c8 flags=0xfffffc0000100(slab|node=0|zone=1|lastcpupid=0x1fffff) [ 6973.709498] CPU: 0 PID: 735535 Comm: mdadm Kdump: loaded Not tainted 4.18.0-423.el8.x86_64 #1 [ 6973.711027] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 6973.712068] Call Trace: [ 6973.712550] dump_stack+0x41/0x60 [ 6973.713185] slab_err.cold.117+0x53/0x67 [ 6973.713878] ? __unfreeze_partials+0x1a0/0x1a0 [ 6973.714659] ? cpumask_next+0x17/0x20 [ 6973.715309] __kmem_cache_shutdown+0x16e/0x300 [ 6973.716097] kmem_cache_destroy+0x4d/0x120 [ 6973.716843] bioset_exit+0xb0/0x100 [ 6973.717486] level_store+0x280/0x650 [ 6973.718147] ? security_capable+0x38/0x60 [ 6973.718856] md_attr_store+0x7c/0xc0 [ 6973.719489] kernfs_fop_write+0x11e/0x1a0 [ 6973.720200] vfs_write+0xa5/0x1b0 [ 6973.720785] ksys_write+0x4f/0xb0 [ 6973.721374] do_syscall_64+0x5b/0x1b0 [ 6973.722028] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 6973.723005] RIP: 0033:0x7fb15d2b5bc8 [ 6973.723660] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 4b 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 [ 6973.726926] RSP: 002b:00007ffed8ebc8a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 6973.728278] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb15d2b5bc8 [ 6973.729515] RDX: 0000000000000006 RSI: 0000560330046dd7 RDI: 0000000000000004 [ 6973.730748] RBP: 0000560330046dd7 R08: 000056033003b927 R09: 00007fb15d315d40 [ 6973.731986] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 6973.733240] R13: 00007ffed8ebcc80 R14: 00007ffed8ebe8ee R15: 0000000000000000 [ 6973.734520] Disabling lock debugging due to kernel taint [ 6973.735452] Object 0x000000000923869f @offset=3264 [ 6973.736317] kmem_cache_destroy bio-176: Slab cache still has objects [ 6973.737479] CPU: 0 PID: 735535 Comm: mdadm Kdump: loaded Tainted: G B --------- - - 4.18.0-423.el8.x86_64 #1 [ 6973.739438] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 6973.740437] Call Trace: [ 6973.740877] dump_stack+0x41/0x60 [ 6973.741473] kmem_cache_destroy+0x116/0x120 [ 6973.742209] bioset_exit+0xb0/0x100 [ 6973.742824] level_store+0x280/0x650 [ 6973.743463] ? security_capable+0x38/0x60 [ 6973.744177] md_attr_store+0x7c/0xc0 [ 6973.744807] kernfs_fop_write+0x11e/0x1a0 [ 6973.745516] vfs_write+0xa5/0x1b0 [ 6973.746112] ksys_write+0x4f/0xb0 [ 6973.746695] do_syscall_64+0x5b/0x1b0 [ 6973.747373] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 6973.748293] RIP: 0033:0x7fb15d2b5bc8 [ 6973.748929] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 4b 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 [ 6973.752255] RSP: 002b:00007ffed8ebc8a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 6973.753628] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb15d2b5bc8 [ 6973.754900] RDX: 0000000000000006 RSI: 0000560330046dd7 RDI: 0000000000000004 [ 6973.756142] RBP: 0000560330046dd7 R08: 000056033003b927 R09: 00007fb15d315d40 [ 6973.757420] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 6973.758702] R13: 00007ffed8ebcc80 R14: 00007ffed8ebe8ee R15: 0000000000000000 [ 6973.762429] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 6973.763818] PGD 0 P4D 0 [ 6973.764277] Oops: 0002 [#1] SMP PTI [ 6973.764896] CPU: 0 PID: 12 Comm: ksoftirqd/0 Kdump: loaded Tainted: G B --------- - - 4.18.0-423.el8.x86_64 #1 [ 6973.766944] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 6973.767999] RIP: 0010:mempool_free+0x52/0x80 [ 6973.768779] Code: e4 6f 96 00 e9 0f 72 96 00 48 89 f7 e8 a7 5e 75 00 48 63 53 08 3b 53 04 7d 30 48 8b 4b 10 8d 72 01 48 89 df 89 73 08 48 89 c6 <48> 89 2c d1 e8 f5 5e 75 00 48 8d 7b 30 31 c9 5b ba 01 00 00 00 be [ 6973.772088] RSP: 0018:ffffbc6c4068fdf8 EFLAGS: 00010093 [ 6973.773026] RAX: 0000000000000293 RBX: ffff9b5844fed7d8 RCX: 0000000000000000 [ 6973.774318] RDX: 0000000000000000 RSI: 0000000000000293 RDI: ffff9b5844fed7d8 [ 6973.775560] RBP: ffff9b5759046cc0 R08: 000000000000003d R09: ffff9b57420e62a0 [ 6973.776794] R10: 0000000000000008 R11: ffff9b574a602a00 R12: ffff9b5845004ed0 [ 6973.778096] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000 [ 6973.779357] FS: 0000000000000000(0000) GS:ffff9b5870000000(0000) knlGS:0000000000000000 [ 6973.780757] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6973.781806] CR2: 0000000000000000 CR3: 0000000043410002 CR4: 00000000007706f0 [ 6973.783076] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6973.784313] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6973.785593] PKRU: 55555554 [ 6973.786098] Call Trace: [ 6973.786549] md_end_io_acct+0x31/0x40 [ 6973.787227] blk_update_request+0x224/0x380 [ 6973.787994] blk_mq_end_request+0x1a/0x130 [ 6973.788739] blk_complete_reqs+0x35/0x50 [ 6973.789456] __do_softirq+0xd7/0x2c8 [ 6973.790114] ? sort_range+0x20/0x20 [ 6973.790763] run_ksoftirqd+0x2a/0x40 [ 6973.791400] smpboot_thread_fn+0xb5/0x150 [ 6973.792114] kthread+0x10b/0x130 [ 6973.792724] ? set_kthread_struct+0x50/0x50 [ 6973.793491] ret_from_fork+0x1f/0x40 The reason is that raid0 doesn't do anything in raid0_quiesce. During the level change in level_store, it needs to make sure all inflight bios to finish. We can add a count to do this in raid0. Is it a good way? There is a count mddev->active_io in mddev. But it's decreased in md_handle_request rather than in the callback endio function. Is it right to decrease active_io in callbcak endio function? Best Regards Xiao ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Memory leak when raid10 takeover raid0 2022-10-08 9:06 Memory leak when raid10 takeover raid0 Xiao Ni @ 2022-10-10 0:40 ` Guoqing Jiang 2022-10-10 0:55 ` Guoqing Jiang 2022-10-10 5:48 ` Xiao Ni 0 siblings, 2 replies; 8+ messages in thread From: Guoqing Jiang @ 2022-10-10 0:40 UTC (permalink / raw) To: Xiao Ni, linux-raid; +Cc: Song Liu Hi Xiao, On 10/8/22 5:06 PM, Xiao Ni wrote: > Hi all > > Sometimes it will have a memory leak problem when raid10 takeover raid0. > And then panic. > > [ 6973.704097] BUG bio-176 (Not tainted): Objects remaining in bio-176 > on __kmem_cache_shutdown() > [ 6973.705627] ----------------------------------------------------------------------------- > [ 6973.705627] > [ 6973.707331] Slab 0x0000000067cd8c6c objects=21 used=1 > fp=0x000000005bf568c8 > flags=0xfffffc0000100(slab|node=0|zone=1|lastcpupid=0x1fffff) > [ 6973.709498] CPU: 0 PID: 735535 Comm: mdadm Kdump: loaded Not > tainted 4.18.0-423.el8.x86_64 #1 It is an old kernel, does it also happen to latest mainline version? Not sure mdadm test suite covers this case. > [ 6973.711027] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > [ 6973.712068] Call Trace: > [ 6973.712550] dump_stack+0x41/0x60 > [ 6973.713185] slab_err.cold.117+0x53/0x67 > [ 6973.713878] ? __unfreeze_partials+0x1a0/0x1a0 > [ 6973.714659] ? cpumask_next+0x17/0x20 > [ 6973.715309] __kmem_cache_shutdown+0x16e/0x300 > [ 6973.716097] kmem_cache_destroy+0x4d/0x120 > [ 6973.716843] bioset_exit+0xb0/0x100 > [ 6973.717486] level_store+0x280/0x650 > [ 6973.718147] ? security_capable+0x38/0x60 > [ 6973.718856] md_attr_store+0x7c/0xc0 > [ 6973.719489] kernfs_fop_write+0x11e/0x1a0 > [ 6973.720200] vfs_write+0xa5/0x1b0 > [ 6973.720785] ksys_write+0x4f/0xb0 > [ 6973.721374] do_syscall_64+0x5b/0x1b0 > [ 6973.722028] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > [ 6973.723005] RIP: 0033:0x7fb15d2b5bc8 > [ 6973.723660] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 > 00 00 f3 0f 1e fa 48 8d 05 55 4b 2a 00 8b 00 85 c0 75 17 b8 01 00 00 > 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 > d4 55 > [ 6973.726926] RSP: 002b:00007ffed8ebc8a8 EFLAGS: 00000246 ORIG_RAX: > 0000000000000001 > [ 6973.728278] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb15d2b5bc8 > [ 6973.729515] RDX: 0000000000000006 RSI: 0000560330046dd7 RDI: 0000000000000004 > [ 6973.730748] RBP: 0000560330046dd7 R08: 000056033003b927 R09: 00007fb15d315d40 > [ 6973.731986] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > [ 6973.733240] R13: 00007ffed8ebcc80 R14: 00007ffed8ebe8ee R15: 0000000000000000 > [ 6973.734520] Disabling lock debugging due to kernel taint > [ 6973.735452] Object 0x000000000923869f @offset=3264 > [ 6973.736317] kmem_cache_destroy bio-176: Slab cache still has objects > [ 6973.737479] CPU: 0 PID: 735535 Comm: mdadm Kdump: loaded Tainted: G > B --------- - - 4.18.0-423.el8.x86_64 #1 > [ 6973.739438] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > [ 6973.740437] Call Trace: > [ 6973.740877] dump_stack+0x41/0x60 > [ 6973.741473] kmem_cache_destroy+0x116/0x120 > [ 6973.742209] bioset_exit+0xb0/0x100 > [ 6973.742824] level_store+0x280/0x650 > [ 6973.743463] ? security_capable+0x38/0x60 > [ 6973.744177] md_attr_store+0x7c/0xc0 > [ 6973.744807] kernfs_fop_write+0x11e/0x1a0 > [ 6973.745516] vfs_write+0xa5/0x1b0 > [ 6973.746112] ksys_write+0x4f/0xb0 > [ 6973.746695] do_syscall_64+0x5b/0x1b0 > [ 6973.747373] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > [ 6973.748293] RIP: 0033:0x7fb15d2b5bc8 > [ 6973.748929] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 > 00 00 f3 0f 1e fa 48 8d 05 55 4b 2a 00 8b 00 85 c0 75 17 b8 01 00 00 > 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 > d4 55 > [ 6973.752255] RSP: 002b:00007ffed8ebc8a8 EFLAGS: 00000246 ORIG_RAX: > 0000000000000001 > [ 6973.753628] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb15d2b5bc8 > [ 6973.754900] RDX: 0000000000000006 RSI: 0000560330046dd7 RDI: 0000000000000004 > [ 6973.756142] RBP: 0000560330046dd7 R08: 000056033003b927 R09: 00007fb15d315d40 > [ 6973.757420] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > [ 6973.758702] R13: 00007ffed8ebcc80 R14: 00007ffed8ebe8ee R15: 0000000000000000 > [ 6973.762429] BUG: unable to handle kernel NULL pointer dereference > at 0000000000000000 > [ 6973.763818] PGD 0 P4D 0 > [ 6973.764277] Oops: 0002 [#1] SMP PTI > [ 6973.764896] CPU: 0 PID: 12 Comm: ksoftirqd/0 Kdump: loaded Tainted: > G B --------- - - 4.18.0-423.el8.x86_64 #1 > [ 69 it73.766944] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > [ 6973.767999] RIP: 0010:mempool_free+0x52/0x80 > [ 6973.768779] Code: e4 6f 96 00 e9 0f 72 96 00 48 89 f7 e8 a7 5e 75 > 00 48 63 53 08 3b 53 04 7d 30 48 8b 4b 10 8d 72 01 48 89 df 89 73 08 > 48 89 c6 <48> 89 2c d1 e8 f5 5e 75 00 48 8d 7b 30 31 c9 5b ba 01 00 00 > 00 be > [ 6973.772088] RSP: 0018:ffffbc6c4068fdf8 EFLAGS: 00010093 > [ 6973.773026] RAX: 0000000000000293 RBX: ffff9b5844fed7d8 RCX: 0000000000000000 > [ 6973.774318] RDX: 0000000000000000 RSI: 0000000000000293 RDI: ffff9b5844fed7d8 > [ 6973.775560] RBP: ffff9b5759046cc0 R08: 000000000000003d R09: ffff9b57420e62a0 > [ 6973.776794] R10: 0000000000000008 R11: ffff9b574a602a00 R12: ffff9b5845004ed0 > [ 6973.778096] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000 > [ 6973.779357] FS: 0000000000000000(0000) GS:ffff9b5870000000(0000) > knlGS:0000000000000000 > [ 6973.780757] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 6973.781806] CR2: 0000000000000000 CR3: 0000000043410002 CR4: 00000000007706f0 > [ 6973.783076] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 6973.784313] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 6973.785593] PKRU: 55555554 > [ 6973.786098] Call Trace: > [ 6973.786549] md_end_io_acct+0x31/0x40 > [ 6973.787227] blk_update_request+0x224/0x380 > [ 6973.787994] blk_mq_end_request+0x1a/0x130 > [ 6973.788739] blk_complete_reqs+0x35/0x50 > [ 6973.789456] __do_softirq+0xd7/0x2c8 > [ 6973.790114] ? sort_range+0x20/0x20 > [ 6973.790763] run_ksoftirqd+0x2a/0x40 > [ 6973.791400] smpboot_thread_fn+0xb5/0x150 > [ 6973.792114] kthread+0x10b/0x130 > [ 6973.792724] ? set_kthread_struct+0x50/0x50 > [ 6973.793491] ret_from_fork+0x1f/0x40 > > The reason is that raid0 doesn't do anything in raid0_quiesce. During > the level change in level_store, it needs > to make sure all inflight bios to finish. We can add a count to do > this in raid0. Is it a good way? > > There is a count mddev->active_io in mddev. But it's decreased in > md_handle_request rather than in the callback > endio function. Is it right to decrease active_io in callbcak endio function? > I think it is a race between ios and takeover action, since mddev_suspend called by level_store should ensure no io is submitted to array at that time by below. wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0) However it can't guarantee new io comes after mddev_suspend because the empty raid0 quiesce. Maybe we can do something like this. diff --git a/drivers/md/md.c b/drivers/md/md.c index afaf36b2f6ab..0fdc07008bcf 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3891,6 +3891,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) if (rv) return rv; + set_bit(MD_TAKENOVER, &mddev->flags); if (mddev->pers == NULL) { strncpy(mddev->clevel, buf, slen); if (mddev->clevel[slen-1] == '\n') @@ -4074,6 +4075,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) rv = len; out_unlock: mddev_unlock(mddev); + clear_bit(MD_TAKENOVER, &mddev->flags); return rv; } diff --git a/drivers/md/md.h b/drivers/md/md.h index b4e2d8b87b61..e77da3bc6a72 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -255,6 +255,7 @@ struct md_cluster_info; * array is ready yet. * @MD_BROKEN: This is used to stop writes and mark array as failed. * @MD_DELETED: This device is being deleted + * @MD_TAKENOVER: This device is being taken over to another personality. * * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added */ @@ -272,6 +273,7 @@ enum mddev_flags { MD_NOT_READY, MD_BROKEN, MD_DELETED, + MD_TAKENOVER, }; enum mddev_sb_flags { diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 78addfe4a0c9..b00ae87c42d3 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -523,6 +523,9 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) unsigned chunk_sects; unsigned sectors; + if (test_bit(MD_TAKENOVER, &mddev->flags)) + return true; + if (unlikely(bio->bi_opf & REQ_PREFLUSH) && md_flush_request(mddev, bio)) return true; Another way is set mddev->pers to NULL before mddev_suspend in level_store, then move "mddev->pers = pers" after mddev_resume, this way makes md_submit_bio returns earlier. I guess this way is more simple and straightforward. diff --git a/drivers/md/md.c b/drivers/md/md.c index afaf36b2f6ab..5a7134e3c182 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3979,6 +3979,9 @@ level_store(struct mddev *mddev, const char *buf, size_t len) goto out_unlock; } + /* Only accept IO after takeover is done */ + mddev->pers = NULL; + /* Looks like we have a winner */ mddev_suspend(mddev); mddev_detach(mddev); @@ -3986,7 +3989,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len) spin_lock(&mddev->lock); oldpers = mddev->pers; oldpriv = mddev->private; - mddev->pers = pers; mddev->private = priv; strscpy(mddev->clevel, pers->name, sizeof(mddev->clevel)); mddev->level = mddev->new_level; @@ -4067,6 +4069,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) pers->run(mddev); set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); mddev_resume(mddev); + mddev->pers = pers; if (!mddev->thread) md_update_sb(mddev, 1); sysfs_notify_dirent_safe(mddev->sysfs_level); Thanks, Guoqing ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Memory leak when raid10 takeover raid0 2022-10-10 0:40 ` Guoqing Jiang @ 2022-10-10 0:55 ` Guoqing Jiang 2022-10-10 5:48 ` Xiao Ni 1 sibling, 0 replies; 8+ messages in thread From: Guoqing Jiang @ 2022-10-10 0:55 UTC (permalink / raw) To: Xiao Ni, linux-raid; +Cc: Song Liu On 10/10/22 8:40 AM, Guoqing Jiang wrote: > diff --git a/drivers/md/md.c b/drivers/md/md.c > index afaf36b2f6ab..5a7134e3c182 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -3979,6 +3979,9 @@ level_store(struct mddev *mddev, const char > *buf, size_t len) > goto out_unlock; > } > > + /* Only accept IO after takeover is done */ > + mddev->pers = NULL; > + > /* Looks like we have a winner */ > mddev_suspend(mddev); mddev->pers->quiesce is called in suspend/resume, so we need relevant checking in them accordingly, maybe other places as well. Thanks, Guoqing ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Memory leak when raid10 takeover raid0 2022-10-10 0:40 ` Guoqing Jiang 2022-10-10 0:55 ` Guoqing Jiang @ 2022-10-10 5:48 ` Xiao Ni 2022-10-10 8:13 ` Xiao Ni 2022-10-11 13:38 ` Guoqing Jiang 1 sibling, 2 replies; 8+ messages in thread From: Xiao Ni @ 2022-10-10 5:48 UTC (permalink / raw) To: Guoqing Jiang; +Cc: linux-raid, Song Liu On Mon, Oct 10, 2022 at 8:40 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > > Hi Xiao, > > On 10/8/22 5:06 PM, Xiao Ni wrote: > > Hi all > > > > Sometimes it will have a memory leak problem when raid10 takeover raid0. > > And then panic. > > > > [ 6973.704097] BUG bio-176 (Not tainted): Objects remaining in bio-176 > > on __kmem_cache_shutdown() > > [ 6973.705627] ----------------------------------------------------------------------------- > > [ 6973.705627] > > [ 6973.707331] Slab 0x0000000067cd8c6c objects=21 used=1 > > fp=0x000000005bf568c8 > > flags=0xfffffc0000100(slab|node=0|zone=1|lastcpupid=0x1fffff) > > [ 6973.709498] CPU: 0 PID: 735535 Comm: mdadm Kdump: loaded Not > > tainted 4.18.0-423.el8.x86_64 #1 > > It is an old kernel, does it also happen to latest mainline version? Not > sure mdadm test suite > covers this case. Hi Guoqing I haven't tried with the latest upstream kernel. It's hard to reproduce this problem. Now it can only be triggered with our qe automatic tests. I tried to extract the test commands and put them into a script. It can't be reproduced. But from the code analysis, raid0 does have this problem. > > > [ 6973.711027] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > > [ 6973.712068] Call Trace: > > [ 6973.712550] dump_stack+0x41/0x60 > > [ 6973.713185] slab_err.cold.117+0x53/0x67 > > [ 6973.713878] ? __unfreeze_partials+0x1a0/0x1a0 > > [ 6973.714659] ? cpumask_next+0x17/0x20 > > [ 6973.715309] __kmem_cache_shutdown+0x16e/0x300 > > [ 6973.716097] kmem_cache_destroy+0x4d/0x120 > > [ 6973.716843] bioset_exit+0xb0/0x100 > > [ 6973.717486] level_store+0x280/0x650 > > [ 6973.718147] ? security_capable+0x38/0x60 > > [ 6973.718856] md_attr_store+0x7c/0xc0 > > [ 6973.719489] kernfs_fop_write+0x11e/0x1a0 > > [ 6973.720200] vfs_write+0xa5/0x1b0 > > [ 6973.720785] ksys_write+0x4f/0xb0 > > [ 6973.721374] do_syscall_64+0x5b/0x1b0 > > [ 6973.722028] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > [ 6973.723005] RIP: 0033:0x7fb15d2b5bc8 > > [ 6973.723660] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 > > 00 00 f3 0f 1e fa 48 8d 05 55 4b 2a 00 8b 00 85 c0 75 17 b8 01 00 00 > > 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 > > d4 55 > > [ 6973.726926] RSP: 002b:00007ffed8ebc8a8 EFLAGS: 00000246 ORIG_RAX: > > 0000000000000001 > > [ 6973.728278] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb15d2b5bc8 > > [ 6973.729515] RDX: 0000000000000006 RSI: 0000560330046dd7 RDI: 0000000000000004 > > [ 6973.730748] RBP: 0000560330046dd7 R08: 000056033003b927 R09: 00007fb15d315d40 > > [ 6973.731986] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > > [ 6973.733240] R13: 00007ffed8ebcc80 R14: 00007ffed8ebe8ee R15: 0000000000000000 > > [ 6973.734520] Disabling lock debugging due to kernel taint > > [ 6973.735452] Object 0x000000000923869f @offset=3264 > > [ 6973.736317] kmem_cache_destroy bio-176: Slab cache still has objects > > [ 6973.737479] CPU: 0 PID: 735535 Comm: mdadm Kdump: loaded Tainted: G > > B --------- - - 4.18.0-423.el8.x86_64 #1 > > [ 6973.739438] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > > [ 6973.740437] Call Trace: > > [ 6973.740877] dump_stack+0x41/0x60 > > [ 6973.741473] kmem_cache_destroy+0x116/0x120 > > [ 6973.742209] bioset_exit+0xb0/0x100 > > [ 6973.742824] level_store+0x280/0x650 > > [ 6973.743463] ? security_capable+0x38/0x60 > > [ 6973.744177] md_attr_store+0x7c/0xc0 > > [ 6973.744807] kernfs_fop_write+0x11e/0x1a0 > > [ 6973.745516] vfs_write+0xa5/0x1b0 > > [ 6973.746112] ksys_write+0x4f/0xb0 > > [ 6973.746695] do_syscall_64+0x5b/0x1b0 > > [ 6973.747373] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > [ 6973.748293] RIP: 0033:0x7fb15d2b5bc8 > > [ 6973.748929] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 > > 00 00 f3 0f 1e fa 48 8d 05 55 4b 2a 00 8b 00 85 c0 75 17 b8 01 00 00 > > 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 > > d4 55 > > [ 6973.752255] RSP: 002b:00007ffed8ebc8a8 EFLAGS: 00000246 ORIG_RAX: > > 0000000000000001 > > [ 6973.753628] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb15d2b5bc8 > > [ 6973.754900] RDX: 0000000000000006 RSI: 0000560330046dd7 RDI: 0000000000000004 > > [ 6973.756142] RBP: 0000560330046dd7 R08: 000056033003b927 R09: 00007fb15d315d40 > > [ 6973.757420] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > > [ 6973.758702] R13: 00007ffed8ebcc80 R14: 00007ffed8ebe8ee R15: 0000000000000000 > > [ 6973.762429] BUG: unable to handle kernel NULL pointer dereference > > at 0000000000000000 > > [ 6973.763818] PGD 0 P4D 0 > > [ 6973.764277] Oops: 0002 [#1] SMP PTI > > [ 6973.764896] CPU: 0 PID: 12 Comm: ksoftirqd/0 Kdump: loaded Tainted: > > G B --------- - - 4.18.0-423.el8.x86_64 #1 > > [ 69 it73.766944] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > > [ 6973.767999] RIP: 0010:mempool_free+0x52/0x80 > > [ 6973.768779] Code: e4 6f 96 00 e9 0f 72 96 00 48 89 f7 e8 a7 5e 75 > > 00 48 63 53 08 3b 53 04 7d 30 48 8b 4b 10 8d 72 01 48 89 df 89 73 08 > > 48 89 c6 <48> 89 2c d1 e8 f5 5e 75 00 48 8d 7b 30 31 c9 5b ba 01 00 00 > > 00 be > > [ 6973.772088] RSP: 0018:ffffbc6c4068fdf8 EFLAGS: 00010093 > > [ 6973.773026] RAX: 0000000000000293 RBX: ffff9b5844fed7d8 RCX: 0000000000000000 > > [ 6973.774318] RDX: 0000000000000000 RSI: 0000000000000293 RDI: ffff9b5844fed7d8 > > [ 6973.775560] RBP: ffff9b5759046cc0 R08: 000000000000003d R09: ffff9b57420e62a0 > > [ 6973.776794] R10: 0000000000000008 R11: ffff9b574a602a00 R12: ffff9b5845004ed0 > > [ 6973.778096] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000 > > [ 6973.779357] FS: 0000000000000000(0000) GS:ffff9b5870000000(0000) > > knlGS:0000000000000000 > > [ 6973.780757] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 6973.781806] CR2: 0000000000000000 CR3: 0000000043410002 CR4: 00000000007706f0 > > [ 6973.783076] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 6973.784313] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [ 6973.785593] PKRU: 55555554 > > [ 6973.786098] Call Trace: > > [ 6973.786549] md_end_io_acct+0x31/0x40 > > [ 6973.787227] blk_update_request+0x224/0x380 > > [ 6973.787994] blk_mq_end_request+0x1a/0x130 > > [ 6973.788739] blk_complete_reqs+0x35/0x50 > > [ 6973.789456] __do_softirq+0xd7/0x2c8 > > [ 6973.790114] ? sort_range+0x20/0x20 > > [ 6973.790763] run_ksoftirqd+0x2a/0x40 > > [ 6973.791400] smpboot_thread_fn+0xb5/0x150 > > [ 6973.792114] kthread+0x10b/0x130 > > [ 6973.792724] ? set_kthread_struct+0x50/0x50 > > [ 6973.793491] ret_from_fork+0x1f/0x40 > > > > The reason is that raid0 doesn't do anything in raid0_quiesce. During > > the level change in level_store, it needs > > to make sure all inflight bios to finish. We can add a count to do > > this in raid0. Is it a good way? > > > > There is a count mddev->active_io in mddev. But it's decreased in > > md_handle_request rather than in the callback > > endio function. Is it right to decrease active_io in callbcak endio function? > > > > I think it is a race between ios and takeover action, since > mddev_suspend called by > level_store should ensure no io is submitted to array at that time by below. > > wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0) > > However it can't guarantee new io comes after mddev_suspend because the > empty > raid0 quiesce. Maybe we can do something like this. Agree with this. But for the method, I think it can't resolve the problem. > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index afaf36b2f6ab..0fdc07008bcf 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -3891,6 +3891,7 @@ level_store(struct mddev *mddev, const char *buf, > size_t len) > if (rv) > return rv; > > + set_bit(MD_TAKENOVER, &mddev->flags); > if (mddev->pers == NULL) { > strncpy(mddev->clevel, buf, slen); > if (mddev->clevel[slen-1] == '\n') > @@ -4074,6 +4075,7 @@ level_store(struct mddev *mddev, const char *buf, > size_t len) > rv = len; > out_unlock: > mddev_unlock(mddev); > + clear_bit(MD_TAKENOVER, &mddev->flags); > return rv; > } > > diff --git a/drivers/md/md.h b/drivers/md/md.h > index b4e2d8b87b61..e77da3bc6a72 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -255,6 +255,7 @@ struct md_cluster_info; > * array is ready yet. > * @MD_BROKEN: This is used to stop writes and mark array as failed. > * @MD_DELETED: This device is being deleted > + * @MD_TAKENOVER: This device is being taken over to another personality. > * > * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added > */ > @@ -272,6 +273,7 @@ enum mddev_flags { > MD_NOT_READY, > MD_BROKEN, > MD_DELETED, > + MD_TAKENOVER, > }; > > enum mddev_sb_flags { > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c > index 78addfe4a0c9..b00ae87c42d3 100644 > --- a/drivers/md/raid0.c > +++ b/drivers/md/raid0.c > @@ -523,6 +523,9 @@ static bool raid0_make_request(struct mddev *mddev, > struct bio *bio) > unsigned chunk_sects; > unsigned sectors; > > + if (test_bit(MD_TAKENOVER, &mddev->flags)) > + return true; > + > if (unlikely(bio->bi_opf & REQ_PREFLUSH) > && md_flush_request(mddev, bio)) > return true; The problem here is we need to wait until all bios come back from member disks. As you pointed above, mddev_suspend can make sure no new bios can be handled (can't pass to raid0_make_request) > > Another way is set mddev->pers to NULL before mddev_suspend in > level_store, then move > "mddev->pers = pers" after mddev_resume, this way makes md_submit_bio > returns earlier. > I guess this way is more simple and straightforward. It looks dangerous. If the callback function introduces mddev->pers when bios come back from member disks, it can panic. And we don't want to return io error to upper layer during the takeover. It only need to wait until md can handle it. Best Regards Xiao > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index afaf36b2f6ab..5a7134e3c182 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -3979,6 +3979,9 @@ level_store(struct mddev *mddev, const char *buf, > size_t len) > goto out_unlock; > } > > + /* Only accept IO after takeover is done */ > + mddev->pers = NULL; > + > /* Looks like we have a winner */ > mddev_suspend(mddev); > mddev_detach(mddev); > @@ -3986,7 +3989,6 @@ level_store(struct mddev *mddev, const char *buf, > size_t len) > spin_lock(&mddev->lock); > oldpers = mddev->pers; > oldpriv = mddev->private; > - mddev->pers = pers; > mddev->private = priv; > strscpy(mddev->clevel, pers->name, sizeof(mddev->clevel)); > mddev->level = mddev->new_level; > @@ -4067,6 +4069,7 @@ level_store(struct mddev *mddev, const char *buf, > size_t len) > pers->run(mddev); > set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); > mddev_resume(mddev); > + mddev->pers = pers; > if (!mddev->thread) > md_update_sb(mddev, 1); > sysfs_notify_dirent_safe(mddev->sysfs_level); > > Thanks, > Guoqing > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Memory leak when raid10 takeover raid0 2022-10-10 5:48 ` Xiao Ni @ 2022-10-10 8:13 ` Xiao Ni 2022-10-11 13:38 ` Guoqing Jiang 1 sibling, 0 replies; 8+ messages in thread From: Xiao Ni @ 2022-10-10 8:13 UTC (permalink / raw) To: Guoqing Jiang; +Cc: linux-raid, Song Liu How about something like this? diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 857c49399c28..c263173a142f 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -393,6 +393,9 @@ static int raid0_run(struct mddev *mddev) mddev->private = conf; } conf = mddev->private; + + init_waitqueue_head(&conf->wait_for_quiescent); + if (mddev->queue) { struct md_rdev *rdev; @@ -512,6 +515,18 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) bio_endio(bio); } +static void raid0_end_request(struct bio *bio) +{ + struct bio *orig_bio = bio->bi_private; + struct gendisk *disk = orig_bio->bi_bdev->bd_disk; + struct mddev *mddev = disk->private; + struct r0conf *conf = mddev->private; + + atomic_dec(&conf->bios); + bio_put(bio); + bio_endio(orig_bio); +} + static bool raid0_make_request(struct mddev *mddev, struct bio *bio) { struct r0conf *conf = mddev->private; @@ -522,6 +537,7 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) sector_t orig_sector; unsigned chunk_sects; unsigned sectors; + struct bio *r0_bio; if (unlikely(bio->bi_opf & REQ_PREFLUSH) && md_flush_request(mddev, bio)) @@ -575,15 +591,20 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) return true; } - bio_set_dev(bio, tmp_dev->bdev); - bio->bi_iter.bi_sector = sector + zone->dev_start + + r0_bio = bio_alloc_clone(tmp_dev->bdev, bio, GFP_NOIO, + &mddev->bio_set); + + r0_bio->bi_private = bio; + r0_bio->bi_end_io = raid0_end_request; + r0_bio->bi_iter.bi_sector = sector + zone->dev_start + tmp_dev->data_offset; if (mddev->gendisk) - trace_block_bio_remap(bio, disk_devt(mddev->gendisk), + trace_block_bio_remap(r0_bio, disk_devt(mddev->gendisk), bio_sector); - mddev_check_write_zeroes(mddev, bio); - submit_bio_noacct(bio); + mddev_check_write_zeroes(mddev, r0_bio); + atomic_inc(&conf->bios); + submit_bio_noacct(r0_bio); return true; } @@ -754,6 +775,10 @@ static void *raid0_takeover(struct mddev *mddev) static void raid0_quiesce(struct mddev *mddev, int quiesce) { + struct r0conf *conf = mddev->private; + + if (quiesce) + wait_event(&conf->wait_for_quiescent, atomic_read(&conf->bios)==0); } static struct md_personality raid0_personality= diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h index 3816e5477db1..a0ec5c235962 100644 --- a/drivers/md/raid0.h +++ b/drivers/md/raid0.h @@ -27,6 +27,9 @@ struct r0conf { * by strip_zone->dev */ int nr_strip_zones; enum r0layout layout; + atomic_t bios; /* The bios that submit to + * member disks */ + struct wait_queue_head_t wait_for_quiescent; }; Regards Xiao ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Memory leak when raid10 takeover raid0 2022-10-10 5:48 ` Xiao Ni 2022-10-10 8:13 ` Xiao Ni @ 2022-10-11 13:38 ` Guoqing Jiang 2022-10-11 15:49 ` Xiao Ni 1 sibling, 1 reply; 8+ messages in thread From: Guoqing Jiang @ 2022-10-11 13:38 UTC (permalink / raw) To: Xiao Ni; +Cc: linux-raid, Song Liu On 10/10/22 1:48 PM, Xiao Ni wrote: > On Mon, Oct 10, 2022 at 8:40 AM Guoqing Jiang<guoqing.jiang@linux.dev> wrote: >> Hi Xiao, >> >> On 10/8/22 5:06 PM, Xiao Ni wrote: >>> Hi all >>> >>> Sometimes it will have a memory leak problem when raid10 takeover raid0. >>> And then panic. >>> >>> [ 6973.704097] BUG bio-176 (Not tainted): Objects remaining in bio-176 >>> on __kmem_cache_shutdown() >>> [ 6973.705627] ----------------------------------------------------------------------------- >>> [ 6973.705627] >>> [ 6973.707331] Slab 0x0000000067cd8c6c objects=21 used=1 >>> fp=0x000000005bf568c8 >>> flags=0xfffffc0000100(slab|node=0|zone=1|lastcpupid=0x1fffff) >>> [ 6973.709498] CPU: 0 PID: 735535 Comm: mdadm Kdump: loaded Not >>> tainted 4.18.0-423.el8.x86_64 #1 >> It is an old kernel, does it also happen to latest mainline version? Not >> sure mdadm test suite >> covers this case. > Hi Guoqing > > I haven't tried with the latest upstream kernel. It's hard to > reproduce this problem. > Now it can only be triggered with our qe automatic tests. I tried to > extract the test > commands and put them into a script. It can't be reproduced. > > But from the code analysis, raid0 does have this problem. I finally reproduced it with 6.0 kernel, and It is a NULL dereference. *[ 238.455386] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 238.470775] #PF: supervisor write access in kernel mode [ 238.470787] #PF: error_code(0x0002) - not-present page [ 238.470795] PGD 0 P4D 0 [ 238.470808] Oops: 0002 [#1] PREEMPT SMP KASAN NOPTI [ 238.470821] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G B OE 6.0.0-57-default #36 [ 238.470834] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 [ 238.470843] RIP: 0010:mempool_free+0xe7/0x170 [ 238.470859] Code: 72 48 8d 7b 48 e8 49 4a 10 00 4c 89 e7 4c 8b 6b 48 e8 9d 49 10 00 41 8d 46 01 89 43 44 4f 8d 64 f5 00 4c 89 e7 e8 c9 4a 10 00 <49> 89 2c 24 4c 89 fe 48 89 df e8 ba 24 da 00 48 8d 7b 68 31 c9 5b [ 238.470871] RSP: 0018:ffff88805b80fe40 EFLAGS: 00010082 [ 238.470883] RAX: 0000000000000001 RBX: ffff888009c54da0 RCX: ffffffff902fd306 [ 238.470892] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffff93461a00 [ 238.470900] RBP: ffff88801fa26800 R08: 0000000000000001 R09: ffffffff93461a07 [ 238.470908] R10: fffffbfff268c340 R11: 3e4b5341542f3c20 R12: 0000000000000000 [ 238.470916] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000202 [ 238.470924] FS: 0000000000000000(0000) GS:ffff88805b800000(0000) knlGS:0000000000000000 [ 238.470934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 238.470942] CR2: 0000000000000000 CR3: 0000000002478001 CR4: 0000000000370ee0 [ 238.470955] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 238.470962] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 238.470969] Call Trace: [ 238.470976] <IRQ> [ 238.470993] md_end_io_acct+0x71/0xa0 [md_mod] [ 238.471089] blk_update_request+0x205/0x820 [ 238.471110] ? ktime_get+0x21/0xc0 [ 238.471132] blk_mq_end_request+0x2e/0x220 [ 238.471152] blk_complete_reqs+0x78/0x90 [ 238.471171] __do_softirq+0x121/0x66d [ 238.471188] ? lock_is_held_type+0xe3/0x140 [ 238.471209] __irq_exit_rcu+0x12c/0x1a0 [ 238.471224] irq_exit_rcu+0xa/0x30 [ 238.471237] sysvec_call_function_single+0x8f/0xb0 [ 238.471252] </IRQ> [ 238.471256] <TASK> [ 238.471263] asm_sysvec_call_function_single+0x16/0x20 [ 238.471275] RIP: 0010:native_safe_halt+0xb/0x10* >>> [ 6973.711027] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >>> [ 6973.712068] Call Trace: >>> [ 6973.712550] dump_stack+0x41/0x60 >>> [ 6973.713185] slab_err.cold.117+0x53/0x67 >>> [ 6973.713878] ? __unfreeze_partials+0x1a0/0x1a0 >>> [ 6973.714659] ? cpumask_next+0x17/0x20 >>> [ 6973.715309] __kmem_cache_shutdown+0x16e/0x300 >>> [ 6973.716097] kmem_cache_destroy+0x4d/0x120 >>> [ 6973.716843] bioset_exit+0xb0/0x100 >>> [ 6973.717486] level_store+0x280/0x650 >>> [ 6973.718147] ? security_capable+0x38/0x60 >>> [ 6973.718856] md_attr_store+0x7c/0xc0 >>> [ 6973.719489] kernfs_fop_write+0x11e/0x1a0 >>> [ 6973.720200] vfs_write+0xa5/0x1b0 >>> [ 6973.720785] ksys_write+0x4f/0xb0 >>> [ 6973.721374] do_syscall_64+0x5b/0x1b0 >>> [ 6973.722028] entry_SYSCALL_64_after_hwframe+0x61/0xc6 >>> [ 6973.723005] RIP: 0033:0x7fb15d2b5bc8 >>> [ 6973.723660] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 >>> 00 00 f3 0f 1e fa 48 8d 05 55 4b 2a 00 8b 00 85 c0 75 17 b8 01 00 00 >>> 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 >>> d4 55 >>> [ 6973.726926] RSP: 002b:00007ffed8ebc8a8 EFLAGS: 00000246 ORIG_RAX: >>> 0000000000000001 >>> [ 6973.728278] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb15d2b5bc8 >>> [ 6973.729515] RDX: 0000000000000006 RSI: 0000560330046dd7 RDI: 0000000000000004 >>> [ 6973.730748] RBP: 0000560330046dd7 R08: 000056033003b927 R09: 00007fb15d315d40 >>> [ 6973.731986] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 >>> [ 6973.733240] R13: 00007ffed8ebcc80 R14: 00007ffed8ebe8ee R15: 0000000000000000 >>> [ 6973.734520] Disabling lock debugging due to kernel taint >>> [ 6973.735452] Object 0x000000000923869f @offset=3264 >>> [ 6973.736317] kmem_cache_destroy bio-176: Slab cache still has objects >>> [ 6973.737479] CPU: 0 PID: 735535 Comm: mdadm Kdump: loaded Tainted: G >>> B --------- - - 4.18.0-423.el8.x86_64 #1 >>> [ 6973.739438] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >>> [ 6973.740437] Call Trace: >>> [ 6973.740877] dump_stack+0x41/0x60 >>> [ 6973.741473] kmem_cache_destroy+0x116/0x120 >>> [ 6973.742209] bioset_exit+0xb0/0x100 >>> [ 6973.742824] level_store+0x280/0x650 >>> [ 6973.743463] ? security_capable+0x38/0x60 >>> [ 6973.744177] md_attr_store+0x7c/0xc0 >>> [ 6973.744807] kernfs_fop_write+0x11e/0x1a0 >>> [ 6973.745516] vfs_write+0xa5/0x1b0 >>> [ 6973.746112] ksys_write+0x4f/0xb0 >>> [ 6973.746695] do_syscall_64+0x5b/0x1b0 >>> [ 6973.747373] entry_SYSCALL_64_after_hwframe+0x61/0xc6 >>> [ 6973.748293] RIP: 0033:0x7fb15d2b5bc8 >>> [ 6973.748929] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 >>> 00 00 f3 0f 1e fa 48 8d 05 55 4b 2a 00 8b 00 85 c0 75 17 b8 01 00 00 >>> 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 >>> d4 55 >>> [ 6973.752255] RSP: 002b:00007ffed8ebc8a8 EFLAGS: 00000246 ORIG_RAX: >>> 0000000000000001 >>> [ 6973.753628] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb15d2b5bc8 >>> [ 6973.754900] RDX: 0000000000000006 RSI: 0000560330046dd7 RDI: 0000000000000004 >>> [ 6973.756142] RBP: 0000560330046dd7 R08: 000056033003b927 R09: 00007fb15d315d40 >>> [ 6973.757420] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 >>> [ 6973.758702] R13: 00007ffed8ebcc80 R14: 00007ffed8ebe8ee R15: 0000000000000000 >>> [ 6973.762429] BUG: unable to handle kernel NULL pointer dereference >>> at 0000000000000000 >>> [ 6973.763818] PGD 0 P4D 0 >>> [ 6973.764277] Oops: 0002 [#1] SMP PTI >>> [ 6973.764896] CPU: 0 PID: 12 Comm: ksoftirqd/0 Kdump: loaded Tainted: >>> G B --------- - - 4.18.0-423.el8.x86_64 #1 >>> [ 69 it73.766944] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >>> [ 6973.767999] RIP: 0010:mempool_free+0x52/0x80 >>> [ 6973.768779] Code: e4 6f 96 00 e9 0f 72 96 00 48 89 f7 e8 a7 5e 75 >>> 00 48 63 53 08 3b 53 04 7d 30 48 8b 4b 10 8d 72 01 48 89 df 89 73 08 >>> 48 89 c6 <48> 89 2c d1 e8 f5 5e 75 00 48 8d 7b 30 31 c9 5b ba 01 00 00 >>> 00 be >>> [ 6973.772088] RSP: 0018:ffffbc6c4068fdf8 EFLAGS: 00010093 >>> [ 6973.773026] RAX: 0000000000000293 RBX: ffff9b5844fed7d8 RCX: 0000000000000000 >>> [ 6973.774318] RDX: 0000000000000000 RSI: 0000000000000293 RDI: ffff9b5844fed7d8 >>> [ 6973.775560] RBP: ffff9b5759046cc0 R08: 000000000000003d R09: ffff9b57420e62a0 >>> [ 6973.776794] R10: 0000000000000008 R11: ffff9b574a602a00 R12: ffff9b5845004ed0 >>> [ 6973.778096] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000 >>> [ 6973.779357] FS: 0000000000000000(0000) GS:ffff9b5870000000(0000) >>> knlGS:0000000000000000 >>> [ 6973.780757] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 6973.781806] CR2: 0000000000000000 CR3: 0000000043410002 CR4: 00000000007706f0 >>> [ 6973.783076] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [ 6973.784313] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> [ 6973.785593] PKRU: 55555554 >>> [ 6973.786098] Call Trace: >>> [ 6973.786549] md_end_io_acct+0x31/0x40 >>> [ 6973.787227] blk_update_request+0x224/0x380 >>> [ 6973.787994] blk_mq_end_request+0x1a/0x130 >>> [ 6973.788739] blk_complete_reqs+0x35/0x50 >>> [ 6973.789456] __do_softirq+0xd7/0x2c8 >>> [ 6973.790114] ? sort_range+0x20/0x20 >>> [ 6973.790763] run_ksoftirqd+0x2a/0x40 >>> [ 6973.791400] smpboot_thread_fn+0xb5/0x150 >>> [ 6973.792114] kthread+0x10b/0x130 >>> [ 6973.792724] ? set_kthread_struct+0x50/0x50 >>> [ 6973.793491] ret_from_fork+0x1f/0x40 >>> >>> The reason is that raid0 doesn't do anything in raid0_quiesce. During >>> the level change in level_store, it needs >>> to make sure all inflight bios to finish. We can add a count to do >>> this in raid0. Is it a good way? >>> >>> There is a count mddev->active_io in mddev. But it's decreased in >>> md_handle_request rather than in the callback >>> endio function. Is it right to decrease active_io in callbcak endio function? >>> >> I think it is a race between ios and takeover action, since >> mddev_suspend called by >> level_store should ensure no io is submitted to array at that time by below. >> >> wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0) >> >> However it can't guarantee new io comes after mddev_suspend because the >> empty raid0 quiesce. Maybe we can do something like this. I was wrong given md_end_io_acct could happen after raid0_make_request, so it could be the previous io not finished before mddev_suspend, which means we need to find a way to drain member disk somehow but I doubt it is reasonable. Also change personality from raid0 seems doesn't obey the second rule. /* request to change the personality. Need to ensure: * - array is not engaged in resync/recovery/reshape * - *old personality can be suspended* * - new personality will access other array. */ And I don't like the idea to add another clone during io path for the special case which hurts performance. I would just warn user don't change personality from raid0 unless QUEUE_FLAG_IO_STAT flag is cleared for a while. Or do something like. diff --git a/drivers/md/md.c b/drivers/md/md.c index 729be2c5296c..55e975233f66 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3979,6 +3979,12 @@ level_store(struct mddev *mddev, const char *buf, size_t len) goto out_unlock; } + if (blk_queue_io_stat(mddev->queue)) { + blk_queue_flag_clear(QUEUE_FLAG_IO_STAT, mddev->queue); + /* We want the previous bio is finished */ + msleep(1000); + } + /* Looks like we have a winner */ mddev_suspend(mddev); mddev_detach(mddev); @@ -4067,6 +4073,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) pers->run(mddev); set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); mddev_resume(mddev); + blk_queue_flag_set(QUEUE_FLAG_IO_STAT, mddev->queue); If you really want to fix it, maybe we have to ensure all inflight IOs is finished by increase inflight num in md_account_bio, then decrease it in md_end_io_acct after add a new inflight_num in mddev and also add "struct mddev *mddev" in md_io_acct, just FYI. Thanks, Guoqing ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Memory leak when raid10 takeover raid0 2022-10-11 13:38 ` Guoqing Jiang @ 2022-10-11 15:49 ` Xiao Ni 2022-10-12 10:58 ` Guoqing Jiang 0 siblings, 1 reply; 8+ messages in thread From: Xiao Ni @ 2022-10-11 15:49 UTC (permalink / raw) To: Guoqing Jiang; +Cc: linux-raid, Song Liu On Tue, Oct 11, 2022 at 9:39 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > > I finally reproduced it with 6.0 kernel, and It is a NULL dereference. That's cool. Could you share the steps? > >>> > >>> The reason is that raid0 doesn't do anything in raid0_quiesce. During > >>> the level change in level_store, it needs > >>> to make sure all inflight bios to finish. We can add a count to do > >>> this in raid0. Is it a good way? > >>> > >>> There is a count mddev->active_io in mddev. But it's decreased in > >>> md_handle_request rather than in the callback > >>> endio function. Is it right to decrease active_io in callbcak endio function? > >>> > >> I think it is a race between ios and takeover action, since > >> mddev_suspend called by > >> level_store should ensure no io is submitted to array at that time by below. > >> > >> wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0) > >> > >> However it can't guarantee new io comes after mddev_suspend because the > >> empty raid0 quiesce. Maybe we can do something like this. > > I was wrong given md_end_io_acct could happen after raid0_make_request, > so it could > be the previous io not finished before mddev_suspend, which means we > need to find a > way to drain member disk somehow but I doubt it is reasonable. It's what I wanted to mention in the original email. Sorry for not describing clearly. The inflight ios which mentioned in the first email are the bios that submit to the member disks. Because we introduced md_end_io_acct now, we need to ensure all the inflight ios come back and then free the memory. I put a patch in last email that tried to resolve it. > > Also change personality from raid0 seems doesn't obey the second rule. > > /* request to change the personality. Need to ensure: > * - array is not engaged in resync/recovery/reshape > * - *old personality can be suspended* > * - new personality will access other array. > */ > > And I don't like the idea to add another clone during io path for the > special case which That is what my last patch does. I didn't figure out a better way. > hurts performance. I would just warn user don't change personality from > raid0 unless > QUEUE_FLAG_IO_STAT flag is cleared for a while. Or do something like. > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 729be2c5296c..55e975233f66 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -3979,6 +3979,12 @@ level_store(struct mddev *mddev, const char *buf, > size_t len) > goto out_unlock; > } > > + if (blk_queue_io_stat(mddev->queue)) { > + blk_queue_flag_clear(QUEUE_FLAG_IO_STAT, mddev->queue); > + /* We want the previous bio is finished */ > + msleep(1000); > + } > + > /* Looks like we have a winner */ > mddev_suspend(mddev); > mddev_detach(mddev); > @@ -4067,6 +4073,7 @@ level_store(struct mddev *mddev, const char *buf, > size_t len) > pers->run(mddev); > set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); > mddev_resume(mddev); > + blk_queue_flag_set(QUEUE_FLAG_IO_STAT, mddev->queue); > > If you really want to fix it, maybe we have to ensure all inflight IOs > is finished by increase > inflight num in md_account_bio, then decrease it in md_end_io_acct after > add a new > inflight_num in mddev and also add "struct mddev *mddev" in md_io_acct, > just FYI. I thought about this method too. But then I thought it's only raid0 problem. So I didn't choose this method. Thanks for all the suggestions. Best Regards Xiao ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Memory leak when raid10 takeover raid0 2022-10-11 15:49 ` Xiao Ni @ 2022-10-12 10:58 ` Guoqing Jiang 0 siblings, 0 replies; 8+ messages in thread From: Guoqing Jiang @ 2022-10-12 10:58 UTC (permalink / raw) To: Xiao Ni; +Cc: linux-raid, Song Liu On 10/11/22 11:49 PM, Xiao Ni wrote: >> I finally reproduced it with 6.0 kernel, and It is a NULL dereference. > That's cool. Could you share the steps? Do the two things simultaneously. 1. write to raid0 2. echo raid10 > /sys/block/md/mdx/level Thanks, Guoqing ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-10-12 10:59 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-10-08 9:06 Memory leak when raid10 takeover raid0 Xiao Ni 2022-10-10 0:40 ` Guoqing Jiang 2022-10-10 0:55 ` Guoqing Jiang 2022-10-10 5:48 ` Xiao Ni 2022-10-10 8:13 ` Xiao Ni 2022-10-11 13:38 ` Guoqing Jiang 2022-10-11 15:49 ` Xiao Ni 2022-10-12 10:58 ` Guoqing Jiang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).