* [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() @ 2017-10-17 16:36 Coly Li 2017-10-17 20:03 ` NeilBrown 0 siblings, 1 reply; 8+ messages in thread From: Coly Li @ 2017-10-17 16:36 UTC (permalink / raw) To: shli, neilb, linux-raid; +Cc: Coly Li We have a bug report about NULL dereference in md raid10 code. The operations are, - Create a raid10 device - add a spare disk - disconnect one of the online disks from the raid10 device - wait to the recovery happens on the spare disk - remove the spare disk which is recovering And sometimes a kernel oops of NULL dereference in md raid10 module can be observed, and crash tool reports the following information: > (gdb) list *(raid10d+0xad6) > 0x5de6 is in raid10d (../drivers/md/raid10.c:2300). > 2295 * the latter is free to free wbio2. > 2296 */ > 2297 if (wbio2 && !wbio2->bi_end_io) > 2298 wbio2 = NULL; > 2299 if (wbio->bi_end_io) { > 2300 atomic_inc(&conf->mirrors[d].rdev->nr_pending); > 2301 md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio)); > 2302 generic_make_request(wbio); > 2303 } > 2304 if (wbio2) { At line 2300, conf->mirrors[d].rdev is NULL, this causes a NULL pointer dereference to conf->mirrors[d].rdev->nr_pending. After reading previous raid10 patches, I find conf->mirrors[d].rdev can be NULL at any point unless, - a counted reference is held - ->reconfig_mutex is held, or - rcu_read_lock() is held In context of kernel thread raid10d, non of the above conditions happens, therefore using rcu to protect the access to conf->mirrors[].rdev and conf->mirrors[].replacement pointers is necessary. This patch is an effort to add rcu protection in raid10d() code path, which including the following sub-routines, - handle_write_completed() - reshape_request_write() - sync_request_write() - recovery_request_write() Because the reported issue can not be always reproduced, I am still working on verification. This RFC patch is sent out for early feedback in case there is something I missed in the fix. Thanks in advance for the review and comments. Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> --- drivers/md/raid10.c | 99 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 80 insertions(+), 19 deletions(-) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 374df5796649..fe9ce25ffc08 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2041,13 +2041,25 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) tpages = get_resync_pages(tbio)->pages; d = r10_bio->devs[i].devnum; - rdev = conf->mirrors[d].rdev; + rcu_read_lock(); + rdev = rcu_dereference(conf->mirrors[d].rdev); + if (rdev == NULL) { + rcu_read_unlock(); + continue; + } + if (!r10_bio->devs[i].bio->bi_status) { /* We know that the bi_io_vec layout is the same for * both 'first' and 'i', so we just compare them. * All vec entries are PAGE_SIZE; */ int sectors = r10_bio->sectors; + + /* + * rdev is not referenced here, don't need rcu lock + * any more. + */ + rcu_read_unlock(); for (j = 0; j < vcnt; j++) { int len = PAGE_SIZE; if (sectors < (len / 512)) @@ -2067,8 +2079,10 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) } else if (test_bit(FailFast, &rdev->flags)) { /* Just give up on this device */ md_error(rdev->mddev, rdev); + rcu_read_unlock(); continue; } + rcu_read_unlock(); /* Ok, we need to write this bio, either to correct an * inconsistency or to correct an unreadable block. * First we need to fixup bv_offset, bv_len and @@ -2087,14 +2101,21 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) bio_copy_data(tbio, fbio); - atomic_inc(&conf->mirrors[d].rdev->nr_pending); + rcu_read_lock(); + rdev = rcu_dereference(conf->mirrors[d].rdev); + if (rdev == NULL) { + rcu_read_unlock(); + continue; + } + atomic_inc(&rdev->nr_pending); atomic_inc(&r10_bio->remaining); - md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio)); + md_sync_acct(rdev->bdev, bio_sectors(tbio)); - if (test_bit(FailFast, &conf->mirrors[d].rdev->flags)) + if (test_bit(FailFast, &rdev->flags)) tbio->bi_opf |= MD_FAILFAST; - tbio->bi_iter.bi_sector += conf->mirrors[d].rdev->data_offset; - bio_set_dev(tbio, conf->mirrors[d].rdev->bdev); + tbio->bi_iter.bi_sector += rdev->data_offset; + bio_set_dev(tbio, rdev->bdev); + rcu_read_unlock(); generic_make_request(tbio); } @@ -2222,6 +2243,7 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio) struct r10conf *conf = mddev->private; int d; struct bio *wbio, *wbio2; + struct md_rdev *rdev, *replacement; if (!test_bit(R10BIO_Uptodate, &r10_bio->state)) { fix_recovery_read_error(r10_bio); @@ -2236,6 +2258,8 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio) d = r10_bio->devs[1].devnum; wbio = r10_bio->devs[1].bio; wbio2 = r10_bio->devs[1].repl_bio; + + /* Need to test wbio2->bi_end_io before we call * generic_make_request as if the former is NULL, * the latter is free to free wbio2. @@ -2243,15 +2267,25 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio) if (wbio2 && !wbio2->bi_end_io) wbio2 = NULL; if (wbio->bi_end_io) { - atomic_inc(&conf->mirrors[d].rdev->nr_pending); - md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio)); - generic_make_request(wbio); + rcu_read_lock(); + rdev = rcu_dereference(conf->mirrors[d].rdev); + if (rdev) { + atomic_inc(&rdev->nr_pending); + md_sync_acct(rdev->bdev, bio_sectors(wbio)); + generic_make_request(wbio); + } + rcu_read_unlock(); } if (wbio2) { - atomic_inc(&conf->mirrors[d].replacement->nr_pending); - md_sync_acct(conf->mirrors[d].replacement->bdev, + rcu_read_lock(); + replacement = rcu_dereference(conf->mirrors[d].replacement); + if (replacement) { + atomic_inc(&replacement->nr_pending); + md_sync_acct(replacement->bdev, bio_sectors(wbio2)); - generic_make_request(wbio2); + generic_make_request(wbio2); + } + rcu_read_unlock(); } } @@ -2620,9 +2654,17 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) test_bit(R10BIO_IsRecover, &r10_bio->state)) { for (m = 0; m < conf->copies; m++) { int dev = r10_bio->devs[m].devnum; - rdev = conf->mirrors[dev].rdev; - if (r10_bio->devs[m].bio == NULL) + + rcu_read_lock(); + rdev = rcu_dereference(conf->mirrors[dev].rdev); + if (rdev == NULL) { + rcu_read_unlock(); + continue; + } + if (r10_bio->devs[m].bio == NULL) { + rcu_read_unlock(); continue; + } if (!r10_bio->devs[m].bio->bi_status) { rdev_clear_badblocks( rdev, @@ -2635,9 +2677,16 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) r10_bio->sectors, 0)) md_error(conf->mddev, rdev); } - rdev = conf->mirrors[dev].replacement; - if (r10_bio->devs[m].repl_bio == NULL) + + rdev = rcu_dereference(conf->mirrors[dev].replacement); + if (rdev == NULL) { + rcu_read_unlock(); continue; + } + if (r10_bio->devs[m].repl_bio == NULL) { + rcu_read_unlock(); + continue; + } if (!r10_bio->devs[m].repl_bio->bi_status) { rdev_clear_badblocks( @@ -2651,6 +2700,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) r10_bio->sectors, 0)) md_error(conf->mddev, rdev); } + rcu_read_unlock(); } put_buf(r10_bio); } else { @@ -2658,7 +2708,13 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) for (m = 0; m < conf->copies; m++) { int dev = r10_bio->devs[m].devnum; struct bio *bio = r10_bio->devs[m].bio; - rdev = conf->mirrors[dev].rdev; + + rcu_read_lock(); + rdev = rcu_dereference(conf->mirrors[dev].rdev); + if (rdev == NULL) { + rcu_read_unlock(); + continue; + } if (bio == IO_MADE_GOOD) { rdev_clear_badblocks( rdev, @@ -2675,14 +2731,19 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) rdev_dec_pending(rdev, conf->mddev); } bio = r10_bio->devs[m].repl_bio; - rdev = conf->mirrors[dev].replacement; - if (rdev && bio == IO_MADE_GOOD) { + rdev = rcu_dereference(conf->mirrors[dev].replacement); + if (rdev == NULL) { + rcu_read_unlock(); + continue; + } + if (bio == IO_MADE_GOOD) { rdev_clear_badblocks( rdev, r10_bio->devs[m].addr, r10_bio->sectors, 0); rdev_dec_pending(rdev, conf->mddev); } + rcu_read_unlock(); } if (fail) { spin_lock_irq(&conf->device_lock); -- 2.13.6 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() 2017-10-17 16:36 [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() Coly Li @ 2017-10-17 20:03 ` NeilBrown 2017-10-18 23:06 ` NeilBrown 0 siblings, 1 reply; 8+ messages in thread From: NeilBrown @ 2017-10-17 20:03 UTC (permalink / raw) To: shli, linux-raid; +Cc: Coly Li [-- Attachment #1: Type: text/plain, Size: 10247 bytes --] On Wed, Oct 18 2017, Coly Li wrote: > We have a bug report about NULL dereference in md raid10 code. The > operations are, > - Create a raid10 device > - add a spare disk > - disconnect one of the online disks from the raid10 device > - wait to the recovery happens on the spare disk > - remove the spare disk which is recovering > And sometimes a kernel oops of NULL dereference in md raid10 module can > be observed, and crash tool reports the following information: >> (gdb) list *(raid10d+0xad6) >> 0x5de6 is in raid10d (../drivers/md/raid10.c:2300). >> 2295 * the latter is free to free wbio2. >> 2296 */ >> 2297 if (wbio2 && !wbio2->bi_end_io) >> 2298 wbio2 = NULL; >> 2299 if (wbio->bi_end_io) { >> 2300 atomic_inc(&conf->mirrors[d].rdev->nr_pending); >> 2301 md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio)); >> 2302 generic_make_request(wbio); >> 2303 } >> 2304 if (wbio2) { > At line 2300, conf->mirrors[d].rdev is NULL, this causes a NULL pointer > dereference to conf->mirrors[d].rdev->nr_pending. After reading previous > raid10 patches, I find conf->mirrors[d].rdev can be NULL at any point > unless, > - a counted reference is held > - ->reconfig_mutex is held, or > - rcu_read_lock() is held There is one other condition: MD_RECOVERY_RUNNING is set (without MD_RECOVERY_DONE). mirrors[d].rdev is only set to NULL by ->hot_remove_disk() which is only called from remove_and_add_spares(), and that is never called while resync/recovery/reshape is happening. So there is no need for RCU protection here. Only ... remove_and_add_spares() *can* sometimes be called during resync - Commit: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") added a called to remove_and_add_spares() when "remove" is written to the device/state attribute. That was wrong. This: } else if (cmd_match(buf, "remove")) { - if (rdev->mddev->pers) { + if (rdev->mddev->pers && + !test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) { clear_bit(Blocked, &rdev->flags); remove_and_add_spares(rdev->mddev, rdev); should fix it. Thanks, NeilBrown > In context of kernel thread raid10d, non of the above conditions happens, > therefore using rcu to protect the access to conf->mirrors[].rdev and > conf->mirrors[].replacement pointers is necessary. > > This patch is an effort to add rcu protection in raid10d() code path, > which including the following sub-routines, > - handle_write_completed() > - reshape_request_write() > - sync_request_write() > - recovery_request_write() > > Because the reported issue can not be always reproduced, I am still > working on verification. This RFC patch is sent out for early feedback > in case there is something I missed in the fix. > > Thanks in advance for the review and comments. > > Signed-off-by: Coly Li <colyli@suse.de> > Reported-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> > --- > drivers/md/raid10.c | 99 +++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 80 insertions(+), 19 deletions(-) > > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index 374df5796649..fe9ce25ffc08 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -2041,13 +2041,25 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) > > tpages = get_resync_pages(tbio)->pages; > d = r10_bio->devs[i].devnum; > - rdev = conf->mirrors[d].rdev; > + rcu_read_lock(); > + rdev = rcu_dereference(conf->mirrors[d].rdev); > + if (rdev == NULL) { > + rcu_read_unlock(); > + continue; > + } > + > if (!r10_bio->devs[i].bio->bi_status) { > /* We know that the bi_io_vec layout is the same for > * both 'first' and 'i', so we just compare them. > * All vec entries are PAGE_SIZE; > */ > int sectors = r10_bio->sectors; > + > + /* > + * rdev is not referenced here, don't need rcu lock > + * any more. > + */ > + rcu_read_unlock(); > for (j = 0; j < vcnt; j++) { > int len = PAGE_SIZE; > if (sectors < (len / 512)) > @@ -2067,8 +2079,10 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) > } else if (test_bit(FailFast, &rdev->flags)) { > /* Just give up on this device */ > md_error(rdev->mddev, rdev); > + rcu_read_unlock(); > continue; > } > + rcu_read_unlock(); > /* Ok, we need to write this bio, either to correct an > * inconsistency or to correct an unreadable block. > * First we need to fixup bv_offset, bv_len and > @@ -2087,14 +2101,21 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) > > bio_copy_data(tbio, fbio); > > - atomic_inc(&conf->mirrors[d].rdev->nr_pending); > + rcu_read_lock(); > + rdev = rcu_dereference(conf->mirrors[d].rdev); > + if (rdev == NULL) { > + rcu_read_unlock(); > + continue; > + } > + atomic_inc(&rdev->nr_pending); > atomic_inc(&r10_bio->remaining); > - md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio)); > + md_sync_acct(rdev->bdev, bio_sectors(tbio)); > > - if (test_bit(FailFast, &conf->mirrors[d].rdev->flags)) > + if (test_bit(FailFast, &rdev->flags)) > tbio->bi_opf |= MD_FAILFAST; > - tbio->bi_iter.bi_sector += conf->mirrors[d].rdev->data_offset; > - bio_set_dev(tbio, conf->mirrors[d].rdev->bdev); > + tbio->bi_iter.bi_sector += rdev->data_offset; > + bio_set_dev(tbio, rdev->bdev); > + rcu_read_unlock(); > generic_make_request(tbio); > } > > @@ -2222,6 +2243,7 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio) > struct r10conf *conf = mddev->private; > int d; > struct bio *wbio, *wbio2; > + struct md_rdev *rdev, *replacement; > > if (!test_bit(R10BIO_Uptodate, &r10_bio->state)) { > fix_recovery_read_error(r10_bio); > @@ -2236,6 +2258,8 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio) > d = r10_bio->devs[1].devnum; > wbio = r10_bio->devs[1].bio; > wbio2 = r10_bio->devs[1].repl_bio; > + > + > /* Need to test wbio2->bi_end_io before we call > * generic_make_request as if the former is NULL, > * the latter is free to free wbio2. > @@ -2243,15 +2267,25 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio) > if (wbio2 && !wbio2->bi_end_io) > wbio2 = NULL; > if (wbio->bi_end_io) { > - atomic_inc(&conf->mirrors[d].rdev->nr_pending); > - md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio)); > - generic_make_request(wbio); > + rcu_read_lock(); > + rdev = rcu_dereference(conf->mirrors[d].rdev); > + if (rdev) { > + atomic_inc(&rdev->nr_pending); > + md_sync_acct(rdev->bdev, bio_sectors(wbio)); > + generic_make_request(wbio); > + } > + rcu_read_unlock(); > } > if (wbio2) { > - atomic_inc(&conf->mirrors[d].replacement->nr_pending); > - md_sync_acct(conf->mirrors[d].replacement->bdev, > + rcu_read_lock(); > + replacement = rcu_dereference(conf->mirrors[d].replacement); > + if (replacement) { > + atomic_inc(&replacement->nr_pending); > + md_sync_acct(replacement->bdev, > bio_sectors(wbio2)); > - generic_make_request(wbio2); > + generic_make_request(wbio2); > + } > + rcu_read_unlock(); > } > } > > @@ -2620,9 +2654,17 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) > test_bit(R10BIO_IsRecover, &r10_bio->state)) { > for (m = 0; m < conf->copies; m++) { > int dev = r10_bio->devs[m].devnum; > - rdev = conf->mirrors[dev].rdev; > - if (r10_bio->devs[m].bio == NULL) > + > + rcu_read_lock(); > + rdev = rcu_dereference(conf->mirrors[dev].rdev); > + if (rdev == NULL) { > + rcu_read_unlock(); > + continue; > + } > + if (r10_bio->devs[m].bio == NULL) { > + rcu_read_unlock(); > continue; > + } > if (!r10_bio->devs[m].bio->bi_status) { > rdev_clear_badblocks( > rdev, > @@ -2635,9 +2677,16 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) > r10_bio->sectors, 0)) > md_error(conf->mddev, rdev); > } > - rdev = conf->mirrors[dev].replacement; > - if (r10_bio->devs[m].repl_bio == NULL) > + > + rdev = rcu_dereference(conf->mirrors[dev].replacement); > + if (rdev == NULL) { > + rcu_read_unlock(); > continue; > + } > + if (r10_bio->devs[m].repl_bio == NULL) { > + rcu_read_unlock(); > + continue; > + } > > if (!r10_bio->devs[m].repl_bio->bi_status) { > rdev_clear_badblocks( > @@ -2651,6 +2700,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) > r10_bio->sectors, 0)) > md_error(conf->mddev, rdev); > } > + rcu_read_unlock(); > } > put_buf(r10_bio); > } else { > @@ -2658,7 +2708,13 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) > for (m = 0; m < conf->copies; m++) { > int dev = r10_bio->devs[m].devnum; > struct bio *bio = r10_bio->devs[m].bio; > - rdev = conf->mirrors[dev].rdev; > + > + rcu_read_lock(); > + rdev = rcu_dereference(conf->mirrors[dev].rdev); > + if (rdev == NULL) { > + rcu_read_unlock(); > + continue; > + } > if (bio == IO_MADE_GOOD) { > rdev_clear_badblocks( > rdev, > @@ -2675,14 +2731,19 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) > rdev_dec_pending(rdev, conf->mddev); > } > bio = r10_bio->devs[m].repl_bio; > - rdev = conf->mirrors[dev].replacement; > - if (rdev && bio == IO_MADE_GOOD) { > + rdev = rcu_dereference(conf->mirrors[dev].replacement); > + if (rdev == NULL) { > + rcu_read_unlock(); > + continue; > + } > + if (bio == IO_MADE_GOOD) { > rdev_clear_badblocks( > rdev, > r10_bio->devs[m].addr, > r10_bio->sectors, 0); > rdev_dec_pending(rdev, conf->mddev); > } > + rcu_read_unlock(); > } > if (fail) { > spin_lock_irq(&conf->device_lock); > -- > 2.13.6 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() 2017-10-17 20:03 ` NeilBrown @ 2017-10-18 23:06 ` NeilBrown 2017-11-17 20:03 ` Shaohua Li 0 siblings, 1 reply; 8+ messages in thread From: NeilBrown @ 2017-10-18 23:06 UTC (permalink / raw) To: shli, linux-raid; +Cc: Coly Li [-- Attachment #1: Type: text/plain, Size: 4064 bytes --] On Wed, Oct 18 2017, NeilBrown wrote: > On Wed, Oct 18 2017, Coly Li wrote: > >> We have a bug report about NULL dereference in md raid10 code. The >> operations are, >> - Create a raid10 device >> - add a spare disk >> - disconnect one of the online disks from the raid10 device >> - wait to the recovery happens on the spare disk >> - remove the spare disk which is recovering >> And sometimes a kernel oops of NULL dereference in md raid10 module can >> be observed, and crash tool reports the following information: >>> (gdb) list *(raid10d+0xad6) >>> 0x5de6 is in raid10d (../drivers/md/raid10.c:2300). >>> 2295 * the latter is free to free wbio2. >>> 2296 */ >>> 2297 if (wbio2 && !wbio2->bi_end_io) >>> 2298 wbio2 = NULL; >>> 2299 if (wbio->bi_end_io) { >>> 2300 atomic_inc(&conf->mirrors[d].rdev->nr_pending); >>> 2301 md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio)); >>> 2302 generic_make_request(wbio); >>> 2303 } >>> 2304 if (wbio2) { >> At line 2300, conf->mirrors[d].rdev is NULL, this causes a NULL pointer >> dereference to conf->mirrors[d].rdev->nr_pending. After reading previous >> raid10 patches, I find conf->mirrors[d].rdev can be NULL at any point >> unless, >> - a counted reference is held >> - ->reconfig_mutex is held, or >> - rcu_read_lock() is held > > There is one other condition: MD_RECOVERY_RUNNING is set (without > MD_RECOVERY_DONE). > mirrors[d].rdev is only set to NULL by ->hot_remove_disk() which is only > called from remove_and_add_spares(), and that is never called while > resync/recovery/reshape is happening. > So there is no need for RCU protection here. > > Only ... remove_and_add_spares() *can* sometimes be called during > resync - > Commit: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") > added a called to remove_and_add_spares() when "remove" is written to > the device/state attribute. That was wrong. > This: > > } else if (cmd_match(buf, "remove")) { > - if (rdev->mddev->pers) { > + if (rdev->mddev->pers && > + !test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) { > clear_bit(Blocked, &rdev->flags); > remove_and_add_spares(rdev->mddev, rdev); > > should fix it. Actually, that doesn't even compile :-( I think this is a better fix. Thanks, NeilBrown From: NeilBrown <neilb@suse.com> Date: Thu, 19 Oct 2017 09:58:19 +1100 Subject: [PATCH] md: only allow remove_and_add_spares() when no sync_thread running. The locking protocols in md assume that a device will never be removed from an array during resync/recovery/reshape. When that isn't happening, rcu or reconfig_mutex is needed to protect an rdev pointer while taking a refcount. When it is happening, that protection isn't needed. Unfortuantely there is a case were remove_and_add_spares() is called when recovery might be happening: when "remove" is written to the device/state sysfs attribute. This call is an optimization and not essential so it doesn't matter if it fails. So change remove_and_add_spares() to abort early if resync/recovery/reshape is happening. As this can result in a NULL dereference, the fix is suitable for -stable. Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Fixes: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") Cc: stable@ver.kernel.org (v4.8+) Signed-off-by: NeilBrown <neilb@suse.com> --- drivers/md/md.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/md/md.c b/drivers/md/md.c index b3192943de7d..01eac0dbca98 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8563,6 +8563,10 @@ static int remove_and_add_spares(struct mddev *mddev, int removed = 0; bool remove_some = false; + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) + /* Mustn't remove devices when resync thread is running */ + return 0; + rdev_for_each(rdev, mddev) { if ((this == NULL || rdev == this) && rdev->raid_disk >= 0 && -- 2.14.0.rc0.dirty [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() 2017-10-18 23:06 ` NeilBrown @ 2017-11-17 20:03 ` Shaohua Li [not found] ` <45cfa4a5-fe04-8b02-531d-c642ad6e7b4c@huawei.com> 0 siblings, 1 reply; 8+ messages in thread From: Shaohua Li @ 2017-11-17 20:03 UTC (permalink / raw) To: NeilBrown; +Cc: Coly Li, linux-raid On Thu, Oct 19, 2017 at 10:06:56AM +1100, Neil Brown wrote: > On Wed, Oct 18 2017, NeilBrown wrote: > > > On Wed, Oct 18 2017, Coly Li wrote: > > > >> We have a bug report about NULL dereference in md raid10 code. The > >> operations are, > >> - Create a raid10 device > >> - add a spare disk > >> - disconnect one of the online disks from the raid10 device > >> - wait to the recovery happens on the spare disk > >> - remove the spare disk which is recovering > >> And sometimes a kernel oops of NULL dereference in md raid10 module can > >> be observed, and crash tool reports the following information: > >>> (gdb) list *(raid10d+0xad6) > >>> 0x5de6 is in raid10d (../drivers/md/raid10.c:2300). > >>> 2295 * the latter is free to free wbio2. > >>> 2296 */ > >>> 2297 if (wbio2 && !wbio2->bi_end_io) > >>> 2298 wbio2 = NULL; > >>> 2299 if (wbio->bi_end_io) { > >>> 2300 atomic_inc(&conf->mirrors[d].rdev->nr_pending); > >>> 2301 md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio)); > >>> 2302 generic_make_request(wbio); > >>> 2303 } > >>> 2304 if (wbio2) { > >> At line 2300, conf->mirrors[d].rdev is NULL, this causes a NULL pointer > >> dereference to conf->mirrors[d].rdev->nr_pending. After reading previous > >> raid10 patches, I find conf->mirrors[d].rdev can be NULL at any point > >> unless, > >> - a counted reference is held > >> - ->reconfig_mutex is held, or > >> - rcu_read_lock() is held > > > > There is one other condition: MD_RECOVERY_RUNNING is set (without > > MD_RECOVERY_DONE). > > mirrors[d].rdev is only set to NULL by ->hot_remove_disk() which is only > > called from remove_and_add_spares(), and that is never called while > > resync/recovery/reshape is happening. > > So there is no need for RCU protection here. > > > > Only ... remove_and_add_spares() *can* sometimes be called during > > resync - > > Commit: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") > > added a called to remove_and_add_spares() when "remove" is written to > > the device/state attribute. That was wrong. > > This: > > > > } else if (cmd_match(buf, "remove")) { > > - if (rdev->mddev->pers) { > > + if (rdev->mddev->pers && > > + !test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) { > > clear_bit(Blocked, &rdev->flags); > > remove_and_add_spares(rdev->mddev, rdev); > > > > should fix it. > > Actually, that doesn't even compile :-( > > I think this is a better fix. > Thanks, > NeilBrown > > From: NeilBrown <neilb@suse.com> > Date: Thu, 19 Oct 2017 09:58:19 +1100 > Subject: [PATCH] md: only allow remove_and_add_spares() when no sync_thread running. > > The locking protocols in md assume that a device will > never be removed from an array during resync/recovery/reshape. > When that isn't happening, rcu or reconfig_mutex is needed > to protect an rdev pointer while taking a refcount. When > it is happening, that protection isn't needed. > > Unfortuantely there is a case were remove_and_add_spares() is > called when recovery might be happening: when "remove" is > written to the device/state sysfs attribute. > > This call is an optimization and not essential so it doesn't > matter if it fails. > So change remove_and_add_spares() to abort early if > resync/recovery/reshape is happening. > > As this can result in a NULL dereference, the fix is suitable > for -stable. > > Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com> > Fixes: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") > Cc: stable@ver.kernel.org (v4.8+) > Signed-off-by: NeilBrown <neilb@suse.com> > --- > drivers/md/md.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index b3192943de7d..01eac0dbca98 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8563,6 +8563,10 @@ static int remove_and_add_spares(struct mddev *mddev, > int removed = 0; > bool remove_some = false; Sorry, I missed this patch. I'd really appreciate if you can add the locking protocol into comments here, digging changelog is much painful. > + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) > + /* Mustn't remove devices when resync thread is running */ > + return 0; > + This will bypass hotadd too, is it what we want? Thanks, Shaohua > rdev_for_each(rdev, mddev) { > if ((this == NULL || rdev == this) && > rdev->raid_disk >= 0 && > -- > 2.14.0.rc0.dirty > > ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <45cfa4a5-fe04-8b02-531d-c642ad6e7b4c@huawei.com>]
* Re: [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() [not found] ` <45cfa4a5-fe04-8b02-531d-c642ad6e7b4c@huawei.com> @ 2018-02-01 12:26 ` yuyufen 2018-02-01 22:22 ` NeilBrown 0 siblings, 1 reply; 8+ messages in thread From: yuyufen @ 2018-02-01 12:26 UTC (permalink / raw) To: neilb, colyli, shli; +Cc: linux-raid hi. > On Thu, Oct 19, 2017 at 10:06:56AM +1100, Neil Brown wrote: >> On Wed, Oct 18 2017, NeilBrown wrote: >> >>> On Wed, Oct 18 2017, Coly Li wrote: >>> >>>> We have a bug report about NULL dereference in md raid10 code. The >>>> operations are, >>>> - Create a raid10 device >>>> - add a spare disk >>>> - disconnect one of the online disks from the raid10 device >>>> - wait to the recovery happens on the spare disk >>>> - remove the spare disk which is recovering >>>> And sometimes a kernel oops of NULL dereference in md raid10 module can >>>> be observed, and crash tool reports the following information: >>>>> (gdb) list *(raid10d+0xad6) >>>>> 0x5de6 is in raid10d (../drivers/md/raid10.c:2300). >>>>> 2295 * the latter is free to free wbio2. >>>>> 2296 */ >>>>> 2297 if (wbio2 && !wbio2->bi_end_io) >>>>> 2298 wbio2 = NULL; >>>>> 2299 if (wbio->bi_end_io) { >>>>> 2300 atomic_inc(&conf->mirrors[d].rdev->nr_pending); >>>>> 2301 md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio)); >>>>> 2302 generic_make_request(wbio); >>>>> 2303 } >>>>> 2304 if (wbio2) { >>>> At line 2300, conf->mirrors[d].rdev is NULL, this causes a NULL pointer >>>> dereference to conf->mirrors[d].rdev->nr_pending. After reading previous >>>> raid10 patches, I find conf->mirrors[d].rdev can be NULL at any point >>>> unless, >>>> - a counted reference is held >>>> - ->reconfig_mutex is held, or >>>> - rcu_read_lock() is held We have the same bug report about NULL dereference in md raid10 code. operations are: - create raid10, including 4 disk and 2 spare disk mdadm -C /dev/md1 -l 10 -n 4 /dev/sda /dev/sdb /dev/sdc /dev/sdd -x 2 /dev/sde /dev/sdf -R - block and offline a disk in raid10 - hot remove a disk by mdadm --manage /dev/md1 -r sdX - add disk mdadm --manage /dev/md1 -r sdX (gdb) l *(0xffffffff815daafd) 0xffffffff815daafd is in rdev_clear_badblocks (drivers/md/md.c:8688). 8683 int is_new) 8684 { 8685 if (is_new) 8686 s += rdev->new_data_offset; 8687 else 8688 s += rdev->data_offset; <==================== 8689 return md_clear_badblocks(&rdev->badblocks, 8690 s, sectors); 8691 } 8692 EXPORT_SYMBOL_GPL(rdev_clear_badblocks); Call Trace and crash tools show it is triggered as following: raid10d() => handle_write_completed() => rdev_clear_badblocks() >>> There is one other condition: MD_RECOVERY_RUNNING is set (without >>> MD_RECOVERY_DONE). >>> mirrors[d].rdev is only set to NULL by ->hot_remove_disk() which is only >>> called from remove_and_add_spares(), and that is never called while >>> resync/recovery/reshape is happening. >>> So there is no need for RCU protection here. >>> >>> Only ... remove_and_add_spares() *can* sometimes be called during >>> resync - >>> Commit: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") >>> added a called to remove_and_add_spares() when "remove" is written to >>> the device/state attribute. That was wrong. >>> This: >>> >>> } else if (cmd_match(buf, "remove")) { >>> - if (rdev->mddev->pers) { >>> + if (rdev->mddev->pers && >>> + !test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) { >>> clear_bit(Blocked, &rdev->flags); >>> remove_and_add_spares(rdev->mddev, rdev); >>> >>> should fix it. >> Actually, that doesn't even compile :-( >> >> I think this is a better fix. >> Thanks, >> NeilBrown >> >> From: NeilBrown <neilb@suse.com> >> Date: Thu, 19 Oct 2017 09:58:19 +1100 >> Subject: [PATCH] md: only allow remove_and_add_spares() when no sync_thread running. >> >> The locking protocols in md assume that a device will >> never be removed from an array during resync/recovery/reshape. >> When that isn't happening, rcu or reconfig_mutex is needed >> to protect an rdev pointer while taking a refcount. When >> it is happening, that protection isn't needed. >> >> Unfortuantely there is a case were remove_and_add_spares() is >> called when recovery might be happening: when "remove" is >> written to the device/state sysfs attribute. >> >> This call is an optimization and not essential so it doesn't >> matter if it fails. >> So change remove_and_add_spares() to abort early if >> resync/recovery/reshape is happening. >> >> As this can result in a NULL dereference, the fix is suitable >> for -stable. >> >> Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com> >> Fixes: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") >> Cc: stable@ver.kernel.org (v4.8+) >> Signed-off-by: NeilBrown <neilb@suse.com> >> --- >> drivers/md/md.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index b3192943de7d..01eac0dbca98 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -8563,6 +8563,10 @@ static int remove_and_add_spares(struct mddev *mddev, >> int removed = 0; >> bool remove_some = false; > Sorry, I missed this patch. > > I'd really appreciate if you can add the locking protocol into comments here, > digging changelog is much painful. > > + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) >> + /* Mustn't remove devices when resync thread is running */ >> + return 0; >> + > This will bypass hotadd too, is it what we want? After applying this patch, there is no more Oops, but the number of disks will become less and less, which is not expected. I think it's caused by bypassing of hotadd. Can we just add spare disk without removing disk? Any suggestions for this fix or any progress on this patch? Thanks a lot, Yufen Yu > > Thanks, > Shaohua >> rdev_for_each(rdev, mddev) { >> if ((this == NULL || rdev == this) && >> rdev->raid_disk >= 0 && >> -- >> 2.14.0.rc0.dirty >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > . > > > > . > . ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() 2018-02-01 12:26 ` yuyufen @ 2018-02-01 22:22 ` NeilBrown 2018-02-02 13:49 ` yuyufen 0 siblings, 1 reply; 8+ messages in thread From: NeilBrown @ 2018-02-01 22:22 UTC (permalink / raw) To: yuyufen, colyli, shli; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 6810 bytes --] On Thu, Feb 01 2018, yuyufen wrote: > hi. > >> On Thu, Oct 19, 2017 at 10:06:56AM +1100, Neil Brown wrote: >>> On Wed, Oct 18 2017, NeilBrown wrote: >>> >>>> On Wed, Oct 18 2017, Coly Li wrote: >>>> >>>>> We have a bug report about NULL dereference in md raid10 code. The >>>>> operations are, >>>>> - Create a raid10 device >>>>> - add a spare disk >>>>> - disconnect one of the online disks from the raid10 device >>>>> - wait to the recovery happens on the spare disk >>>>> - remove the spare disk which is recovering >>>>> And sometimes a kernel oops of NULL dereference in md raid10 module can >>>>> be observed, and crash tool reports the following information: >>>>>> (gdb) list *(raid10d+0xad6) >>>>>> 0x5de6 is in raid10d (../drivers/md/raid10.c:2300). >>>>>> 2295 * the latter is free to free wbio2. >>>>>> 2296 */ >>>>>> 2297 if (wbio2 && !wbio2->bi_end_io) >>>>>> 2298 wbio2 = NULL; >>>>>> 2299 if (wbio->bi_end_io) { >>>>>> 2300 atomic_inc(&conf->mirrors[d].rdev->nr_pending); >>>>>> 2301 md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio)); >>>>>> 2302 generic_make_request(wbio); >>>>>> 2303 } >>>>>> 2304 if (wbio2) { >>>>> At line 2300, conf->mirrors[d].rdev is NULL, this causes a NULL pointer >>>>> dereference to conf->mirrors[d].rdev->nr_pending. After reading previous >>>>> raid10 patches, I find conf->mirrors[d].rdev can be NULL at any point >>>>> unless, >>>>> - a counted reference is held >>>>> - ->reconfig_mutex is held, or >>>>> - rcu_read_lock() is held > > We have the same bug report about NULL dereference in md raid10 code. > operations are: > > - create raid10, including 4 disk and 2 spare disk > mdadm -C /dev/md1 -l 10 -n 4 /dev/sda /dev/sdb /dev/sdc > /dev/sdd -x 2 /dev/sde /dev/sdf -R > - block and offline a disk in raid10 > - hot remove a disk by > mdadm --manage /dev/md1 -r sdX > - add disk > mdadm --manage /dev/md1 -r sdX > > (gdb) l *(0xffffffff815daafd) > 0xffffffff815daafd is in rdev_clear_badblocks (drivers/md/md.c:8688). > 8683 int is_new) > 8684 { > 8685 if (is_new) > 8686 s += rdev->new_data_offset; > 8687 else > 8688 s += rdev->data_offset; <==================== > 8689 return md_clear_badblocks(&rdev->badblocks, > 8690 s, sectors); > 8691 } > 8692 EXPORT_SYMBOL_GPL(rdev_clear_badblocks); > > Call Trace and crash tools show it is triggered as following: > raid10d() => handle_write_completed() => rdev_clear_badblocks() > >>>> There is one other condition: MD_RECOVERY_RUNNING is set (without >>>> MD_RECOVERY_DONE). >>>> mirrors[d].rdev is only set to NULL by ->hot_remove_disk() which is only >>>> called from remove_and_add_spares(), and that is never called while >>>> resync/recovery/reshape is happening. >>>> So there is no need for RCU protection here. >>>> >>>> Only ... remove_and_add_spares() *can* sometimes be called during >>>> resync - >>>> Commit: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") >>>> added a called to remove_and_add_spares() when "remove" is written to >>>> the device/state attribute. That was wrong. >>>> This: >>>> >>>> } else if (cmd_match(buf, "remove")) { >>>> - if (rdev->mddev->pers) { >>>> + if (rdev->mddev->pers && >>>> + !test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) { >>>> clear_bit(Blocked, &rdev->flags); >>>> remove_and_add_spares(rdev->mddev, rdev); >>>> >>>> should fix it. >>> Actually, that doesn't even compile :-( >>> >>> I think this is a better fix. >>> Thanks, >>> NeilBrown >>> >>> From: NeilBrown <neilb@suse.com> >>> Date: Thu, 19 Oct 2017 09:58:19 +1100 >>> Subject: [PATCH] md: only allow remove_and_add_spares() when no sync_thread running. >>> >>> The locking protocols in md assume that a device will >>> never be removed from an array during resync/recovery/reshape. >>> When that isn't happening, rcu or reconfig_mutex is needed >>> to protect an rdev pointer while taking a refcount. When >>> it is happening, that protection isn't needed. >>> >>> Unfortuantely there is a case were remove_and_add_spares() is >>> called when recovery might be happening: when "remove" is >>> written to the device/state sysfs attribute. >>> >>> This call is an optimization and not essential so it doesn't >>> matter if it fails. >>> So change remove_and_add_spares() to abort early if >>> resync/recovery/reshape is happening. >>> >>> As this can result in a NULL dereference, the fix is suitable >>> for -stable. >>> >>> Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com> >>> Fixes: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") >>> Cc: stable@ver.kernel.org (v4.8+) >>> Signed-off-by: NeilBrown <neilb@suse.com> >>> --- >>> drivers/md/md.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>> index b3192943de7d..01eac0dbca98 100644 >>> --- a/drivers/md/md.c >>> +++ b/drivers/md/md.c >>> @@ -8563,6 +8563,10 @@ static int remove_and_add_spares(struct mddev *mddev, >>> int removed = 0; >>> bool remove_some = false; >> Sorry, I missed this patch. >> >> I'd really appreciate if you can add the locking protocol into comments here, >> digging changelog is much painful. >> > + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) >>> + /* Mustn't remove devices when resync thread is running */ >>> + return 0; >>> + >> This will bypass hotadd too, is it what we want? > > After applying this patch, there is no more Oops, but the number of > disks will become less and less, which is not expected. Please explain what you mean by "the number of disks will become less and less"?? What is the sequence of events? > I think it's caused by bypassing of hotadd. Can we just add spare disk > without removing disk? > > Any suggestions for this fix or any progress on this patch? I had forgotten about this patch and will resend. However I think the only change needs is adding some comments. I still think the code is correct. Thanks, NeilBrown > > Thanks a lot, > Yufen Yu > >> >> Thanks, >> Shaohua >>> rdev_for_each(rdev, mddev) { >>> if ((this == NULL || rdev == this) && >>> rdev->raid_disk >= 0 && >>> -- >>> 2.14.0.rc0.dirty >>> >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> . >> >> >> >> . >> > . [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() 2018-02-01 22:22 ` NeilBrown @ 2018-02-02 13:49 ` yuyufen 2018-02-02 22:14 ` NeilBrown 0 siblings, 1 reply; 8+ messages in thread From: yuyufen @ 2018-02-02 13:49 UTC (permalink / raw) To: NeilBrown, colyli, shli; +Cc: linux-raid On 2018/2/2 6:22, NeilBrown wrote: > On Thu, Feb 01 2018, yuyufen wrote: > >> hi. >> >>> On Thu, Oct 19, 2017 at 10:06:56AM +1100, Neil Brown wrote: >>>> On Wed, Oct 18 2017, NeilBrown wrote: >>>> >>>>> On Wed, Oct 18 2017, Coly Li wrote: >>>>> >>>>>> We have a bug report about NULL dereference in md raid10 code. The >>>>>> operations are, >>>>>> - Create a raid10 device >>>>>> - add a spare disk >>>>>> - disconnect one of the online disks from the raid10 device >>>>>> - wait to the recovery happens on the spare disk >>>>>> - remove the spare disk which is recovering >>>>>> And sometimes a kernel oops of NULL dereference in md raid10 module can >>>>>> be observed, and crash tool reports the following information: >>>>>>> (gdb) list *(raid10d+0xad6) >>>>>>> 0x5de6 is in raid10d (../drivers/md/raid10.c:2300). >>>>>>> 2295 * the latter is free to free wbio2. >>>>>>> 2296 */ >>>>>>> 2297 if (wbio2 && !wbio2->bi_end_io) >>>>>>> 2298 wbio2 = NULL; >>>>>>> 2299 if (wbio->bi_end_io) { >>>>>>> 2300 atomic_inc(&conf->mirrors[d].rdev->nr_pending); >>>>>>> 2301 md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio)); >>>>>>> 2302 generic_make_request(wbio); >>>>>>> 2303 } >>>>>>> 2304 if (wbio2) { >>>>>> At line 2300, conf->mirrors[d].rdev is NULL, this causes a NULL pointer >>>>>> dereference to conf->mirrors[d].rdev->nr_pending. After reading previous >>>>>> raid10 patches, I find conf->mirrors[d].rdev can be NULL at any point >>>>>> unless, >>>>>> - a counted reference is held >>>>>> - ->reconfig_mutex is held, or >>>>>> - rcu_read_lock() is held >> We have the same bug report about NULL dereference in md raid10 code. >> operations are: >> >> - create raid10, including 4 disk and 2 spare disk >> mdadm -C /dev/md1 -l 10 -n 4 /dev/sda /dev/sdb /dev/sdc >> /dev/sdd -x 2 /dev/sde /dev/sdf -R >> - block and offline a disk in raid10 >> - hot remove a disk by >> mdadm --manage /dev/md1 -r sdX >> - add disk >> mdadm --manage /dev/md1 -r sdX >> >> (gdb) l *(0xffffffff815daafd) >> 0xffffffff815daafd is in rdev_clear_badblocks (drivers/md/md.c:8688). >> 8683 int is_new) >> 8684 { >> 8685 if (is_new) >> 8686 s += rdev->new_data_offset; >> 8687 else >> 8688 s += rdev->data_offset; <==================== >> 8689 return md_clear_badblocks(&rdev->badblocks, >> 8690 s, sectors); >> 8691 } >> 8692 EXPORT_SYMBOL_GPL(rdev_clear_badblocks); >> >> Call Trace and crash tools show it is triggered as following: >> raid10d() => handle_write_completed() => rdev_clear_badblocks() >> >>>>> There is one other condition: MD_RECOVERY_RUNNING is set (without >>>>> MD_RECOVERY_DONE). >>>>> mirrors[d].rdev is only set to NULL by ->hot_remove_disk() which is only >>>>> called from remove_and_add_spares(), and that is never called while >>>>> resync/recovery/reshape is happening. >>>>> So there is no need for RCU protection here. >>>>> >>>>> Only ... remove_and_add_spares() *can* sometimes be called during >>>>> resync - >>>>> Commit: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") >>>>> added a called to remove_and_add_spares() when "remove" is written to >>>>> the device/state attribute. That was wrong. >>>>> This: >>>>> >>>>> } else if (cmd_match(buf, "remove")) { >>>>> - if (rdev->mddev->pers) { >>>>> + if (rdev->mddev->pers && >>>>> + !test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) { >>>>> clear_bit(Blocked, &rdev->flags); >>>>> remove_and_add_spares(rdev->mddev, rdev); >>>>> >>>>> should fix it. >>>> Actually, that doesn't even compile :-( >>>> >>>> I think this is a better fix. >>>> Thanks, >>>> NeilBrown >>>> >>>> From: NeilBrown <neilb@suse.com> >>>> Date: Thu, 19 Oct 2017 09:58:19 +1100 >>>> Subject: [PATCH] md: only allow remove_and_add_spares() when no sync_thread running. >>>> >>>> The locking protocols in md assume that a device will >>>> never be removed from an array during resync/recovery/reshape. >>>> When that isn't happening, rcu or reconfig_mutex is needed >>>> to protect an rdev pointer while taking a refcount. When >>>> it is happening, that protection isn't needed. >>>> >>>> Unfortuantely there is a case were remove_and_add_spares() is >>>> called when recovery might be happening: when "remove" is >>>> written to the device/state sysfs attribute. >>>> >>>> This call is an optimization and not essential so it doesn't >>>> matter if it fails. >>>> So change remove_and_add_spares() to abort early if >>>> resync/recovery/reshape is happening. >>>> >>>> As this can result in a NULL dereference, the fix is suitable >>>> for -stable. >>>> >>>> Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com> >>>> Fixes: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") >>>> Cc: stable@ver.kernel.org (v4.8+) >>>> Signed-off-by: NeilBrown <neilb@suse.com> >>>> --- >>>> drivers/md/md.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>>> index b3192943de7d..01eac0dbca98 100644 >>>> --- a/drivers/md/md.c >>>> +++ b/drivers/md/md.c >>>> @@ -8563,6 +8563,10 @@ static int remove_and_add_spares(struct mddev *mddev, >>>> int removed = 0; >>>> bool remove_some = false; >>> Sorry, I missed this patch. >>> >>> I'd really appreciate if you can add the locking protocol into comments here, >>> digging changelog is much painful. >>> > + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) >>>> + /* Mustn't remove devices when resync thread is running */ >>>> + return 0; >>>> + >>> This will bypass hotadd too, is it what we want? >> After applying this patch, there is no more Oops, but the number of >> disks will become less and less, which is not expected. > Please explain what you mean by "the number of disks will become less > and less"?? What is the sequence of events? > > >> I think it's caused by bypassing of hotadd. Can we just add spare disk >> without removing disk? >> >> Any suggestions for this fix or any progress on this patch? > I had forgotten about this patch and will resend. However I think the > only change needs is adding some comments. I still think the code is > correct. (1) When raid10 create, there are 4 disks in raid and 2 spare disks (sda5, sdd5) [ 100.473118] RAID10 conf printout: [ 100.473121] --- wd:3 rd:4 [ 100.473123] disk 0, wo:0, o:1, dev:sdb5 [ 100.473124] disk 1, wo:1, o:0, dev:sdc5 [ 100.473126] disk 2, wo:0, o:1, dev:sde5 [ 100.473127] disk 3, wo:0, o:1, dev:sdf5 (2) mdadm: hot removed /dev/sdc5 from /dev/md5 [ 100.473118] RAID10 conf printout: [ 110.390027] --- wd:3 rd:4 [ 110.390028] disk 0, wo:0, o:1, dev:sdb5 [ 110.390029] disk 2, wo:0, o:1, dev:sde5 [ 110.390030] disk 3, wo:0, o:1, dev:sdf5 However, as there are 2 spare disks, we expect raid10 can hot add a disk. (3) mdadm: added /dev/sdc5 mdadm: hot removed /dev/sdc5 from /dev/md5 [ 148.680032] RAID10 conf printout: [ 148.680035] --- wd:2 rd:4 [ 148.680037] disk 0, wo:0, o:1, dev:sdb5 [ 148.680038] disk 2, wo:0, o:1, dev:sde5 Now, there are 3 spare disks (sda5, sdc5, sdd5) Also, we expect hot add 2 disk, not just 2 disk in raid10. I try to explain, but not sure: The only place of the runtime to invoke mddev->pers->hot_add_disk(mddev, rdev) is md_check_recovery(). But, it will set MD_RECOVERY_RUNNING before invoking remove_and_add_spares(): 8702 void md_check_recovery(struct mddev *mddev) 8703 { …… 8800 mddev->curr_resync_completed = 0; 8801 spin_lock(&mddev->lock); 8802 set_bit(MD_RECOVERY_RUNNING, &mddev->recovery); <======================= 8803 spin_unlock(&mddev->lock); 8804 /* Clear some bits that don't mean anything, but 8805 * might be left set 8806 */ 8807 clear_bit(MD_RECOVERY_INTR, &mddev->recovery); 8808 clear_bit(MD_RECOVERY_DONE, &mddev->recovery); 8809 8810 if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || 8811 test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) 8812 goto not_running; 8813 /* no recovery is running. 8814 * remove any failed drives, then 8815 * add spares if possible. 8816 * Spares are also removed and re-added, to allow 8817 * the personality to fail the re-add. 8818 */ 8819 8820 if (mddev->reshape_position != MaxSector) { 8821 if (mddev->pers->check_reshape == NULL || 8822 mddev->pers->check_reshape(mddev) != 0) 8823 /* Cannot proceed */ 8824 goto not_running; 8825 set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); 8826 clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 8827 } else if ((spares = remove_and_add_spares(mddev, NULL))) { <======================= 8828 clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); …… 8874 } 8875 EXPORT_SYMBOL(md_check_recovery); That means, test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) always be true. So it will bypass hot add. + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) + /* Mustn't remove devices when resync thread is running */ + return 0; What's more, I found there some D status fio processes (I use fio to test). They seems to waiting for IO complete. But I can't explain reason. [ 463.085649] sysrq: SysRq : Show Blocked State [ 463.087764] task PC stack pid father [ 463.090151] fio D ffff88012e45ba68 0 1420 1398 0x00000000 [ 463.095138] ffff88012e45b9e8 ffff8800b5893ba0 ffff8800a55bf200 0000000000015c40 [ 463.098142] ffffffff810d8698 ffff8800b58940f0 225383e5f11e4e2e ffff88012e45b9d8 [ 463.102101] ffff88012e45c000 ffff88013fcd5c40 7fffffffffffffff ffff88012e45b9f8 [ 463.105651] Call Trace: [ 463.107430] [<ffffffff810d8698>] ? __wake_up+0x48/0x60 [ 463.110126] [<ffffffff81745567>] schedule+0x37/0x90 [ 463.112817] [<ffffffff81747dd9>] schedule_timeout+0x1a9/0x200 [ 463.116131] [<ffffffff81745567>] ? schedule+0x37/0x90 [ 463.119738] [<ffffffff81747dd9>] ? schedule_timeout+0x1a9/0x200 [ 463.123879] [<ffffffff81108470>] ? ktime_get+0x40/0xb0 [ 463.126737] [<ffffffff81248896>] ? __blockdev_direct_IO+0xf46/0x1720 [ 463.130833] [<ffffffff81353b11>] ? bio_put+0x51/0x90 [ 463.134180] [<ffffffff817449a4>] ? io_schedule_timeout+0xa4/0x110 [ 463.138207] [<ffffffff81248896>] ? __blockdev_direct_IO+0xf46/0x1720 [ 463.142740] [<ffffffff81243340>] ? I_BDEV+0x10/0x10 [ 463.146066] [<ffffffff81243c7c>] ? blkdev_direct_IO+0x4c/0x50 [ 463.149205] [<ffffffff811942f0>] ? generic_file_read_iter+0x540/0x6f0 [ 463.154051] [<ffffffff8135dd48>] ? blk_finish_plug+0x18/0x50 [ 463.156457] [<ffffffff81244bc5>] ? blkdev_read_iter+0x35/0x50 [ 463.159293] [<ffffffff81207e0d>] ? __vfs_read+0xcd/0x110 [ 463.162703] [<ffffffff812086bc>] ? vfs_write+0x9c/0x1b0 [ 463.166147] [<ffffffff8120855a>] ? vfs_read+0x7a/0x140 [ 463.168477] [<ffffffff8120924a>] ? SyS_pread64+0x9a/0xc0 [ 463.171970] [<ffffffff811058d1>] ? SyS_clock_gettime+0x81/0xd0 [ 463.175796] [<ffffffff81748c57>] ? system_call_fastpath+0x12/0x6a Thanks, Yufen Yu > Thanks, > NeilBrown > >> Thanks a lot, >> Yufen Yu >> >>> Thanks, >>> Shaohua >>>> rdev_for_each(rdev, mddev) { >>>> if ((this == NULL || rdev == this) && >>>> rdev->raid_disk >= 0 && >>>> -- >>>> 2.14.0.rc0.dirty >>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> . >>> >>> >>> >>> . >>> >> . ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() 2018-02-02 13:49 ` yuyufen @ 2018-02-02 22:14 ` NeilBrown 0 siblings, 0 replies; 8+ messages in thread From: NeilBrown @ 2018-02-02 22:14 UTC (permalink / raw) To: yuyufen, colyli, shli; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1701 bytes --] On Fri, Feb 02 2018, yuyufen@huawei.com wrote: >> I had forgotten about this patch and will resend. However I think the >> only change needs is adding some comments. I still think the code is >> correct. > (1) When raid10 create, there are 4 disks in raid and 2 spare disks > (sda5, sdd5) > > [ 100.473118] RAID10 conf printout: > [ 100.473121] --- wd:3 rd:4 > [ 100.473123] disk 0, wo:0, o:1, dev:sdb5 > [ 100.473124] disk 1, wo:1, o:0, dev:sdc5 > [ 100.473126] disk 2, wo:0, o:1, dev:sde5 > [ 100.473127] disk 3, wo:0, o:1, dev:sdf5 > > (2) mdadm: hot removed /dev/sdc5 from /dev/md5 > [ 100.473118] RAID10 conf printout: > [ 110.390027] --- wd:3 rd:4 > [ 110.390028] disk 0, wo:0, o:1, dev:sdb5 > [ 110.390029] disk 2, wo:0, o:1, dev:sde5 > [ 110.390030] disk 3, wo:0, o:1, dev:sdf5 > > However, as there are 2 spare disks, we expect raid10 can hot add a disk. > > (3) > mdadm: added /dev/sdc5 > mdadm: hot removed /dev/sdc5 from /dev/md5 > [ 148.680032] RAID10 conf printout: > [ 148.680035] --- wd:2 rd:4 > [ 148.680037] disk 0, wo:0, o:1, dev:sdb5 > [ 148.680038] disk 2, wo:0, o:1, dev:sde5 > > Now, there are 3 spare disks (sda5, sdc5, sdd5) > Also, we expect hot add 2 disk, not just 2 disk in raid10. > > I try to explain, but not sure: > The only place of the runtime to invoke mddev->pers->hot_add_disk(mddev, > rdev) is md_check_recovery(). > But, it will set MD_RECOVERY_RUNNING before invoking > remove_and_add_spares(): This is the piece I was missing. Thanks for pointing that out, definitely a bug in my patch. I'll send a new patch which tests 'this' as well as the flag. Thanks a lot, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-02-02 22:14 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-10-17 16:36 [PATCH] [RFC] md raid10: use rcu to avoid NULL dereference in raid10d() Coly Li 2017-10-17 20:03 ` NeilBrown 2017-10-18 23:06 ` NeilBrown 2017-11-17 20:03 ` Shaohua Li [not found] ` <45cfa4a5-fe04-8b02-531d-c642ad6e7b4c@huawei.com> 2018-02-01 12:26 ` yuyufen 2018-02-01 22:22 ` NeilBrown 2018-02-02 13:49 ` yuyufen 2018-02-02 22:14 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).