* [PATCH] md linear: fix a race between linear_add() and linear_congested() @ 2017-01-25 11:15 colyli 2017-01-25 18:02 ` Shaohua Li 0 siblings, 1 reply; 5+ messages in thread From: colyli @ 2017-01-25 11:15 UTC (permalink / raw) To: linux-raid; +Cc: Coly Li, Shaohua Li, Neil Brown, stable Recently I receie a report that on Linux v3.0 based kerenl, hot add disk to a md linear device causes kernel crash at linear_congested(). From the crash image analysis, I find in linear_congested(), mddev->raid_disks contains value N, but conf->disks[] only has N-1 pointers available. Then a pointer deference to a NULL pointer crashes the kernel. There is a race between linear_add() and linear_congested(), RCU stuffs used in these two functions cannot avoid the race. Since Linuv v4.0 RCU code is replaced by introducing mddev_suspend(). After checking the upstream code, it seems linear_congested() is not called in generic_make_request() code patch, so mddev_suspend() cannot provent it from being called. The possible race still exists. Here I explain how the race still exists in current code. For a machine has many CPUs, on one CPU, linear_add() is called to add a hard disk to a md linear device; at the same time on other CPU, linear_congested() is called to detect whether this md linear device is congested before issuing an I/O request onto it. Now I use a possible code execution time sequence to demo how the possible race happens, seq linear_add() linear_congested() 0 conf=mddev->private 1 oldconf=mddev->private 2 mddev->raid_disks++ 3 for (i=0; i<mddev->raid_disks;i++) 4 bdev_get_queue(conf->disks[i].rdev->bdev) 5 mddev->private=newconf In linear_add() mddev->raid_disks is increased in time seq 2, and on another CPU in linear_congested() the for-loop iterates conf->disks[i] by the increased mddev->raid_disks in time seq 3,4. But conf with one more element (which is a pointer to struct dev_info type) to conf->disks[] is not updated yet, accessing its structure member in time seq 4 will cause a NULL pointer deference fault. The fix is to update mddev->private with new value before increasing mddev->raid_disks, and to make sure on other CPUs their are seen to be updated in same order as linear_add() does (otherwise the race may still happen), a smp_mb() is necessary. A question is, by this fix, if mddev->private is update to new value in linear_add(), but in linear_congested() the for-loop still tests old value of mddev->raid_disks, then the iteration will miss the last element of conf->disks[]. My answer is don't worry it, it's OK. the reasons are, - When updating mddev->private, the md linear device is suspend, no I/O may happen, it is safe to missing congestion status of the last new-added hard disk. - In the worst case linear_congested() returns 0 and I/O sent to this md linear device, but the new added hard disk is congested, then the I/O request will be blocked for a while if it just happenly hits the new added hard disk. linear_congested() is in code path of wb_congested(), which is quite hot in write back code path. Comparing to add locking code in linear_congested(), the cost of the worst case is acceptable. The bug is reported on Linux v3.0 based kernel, it can and should be applied to all kernels since Linux v3.0. I see linear_add() is merged into mainline since Linux v2.6.18, maybe stable kernel maintainers after this version may consider to pick this fix as well. Signed-off-by: Coly Li <colyli@suse.de> Cc: Shaohua Li <shli@fb.com> Cc: Neil Brown <neilb@suse.com> Cc: stable@vger.kernel.org --- drivers/md/linear.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/md/linear.c b/drivers/md/linear.c index 5975c99..48ccfad 100644 --- a/drivers/md/linear.c +++ b/drivers/md/linear.c @@ -196,10 +196,22 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev) if (!newconf) return -ENOMEM; + /* In linear_congested(), mddev->raid_disks and mddev->private + * are accessed without protection by mddev_suspend(). If on + * another CPU, in linear_congested() mddev->private is still seen + * to contains old value but mddev->raid_disks is seen to have the + * increased value, the last iteration to conf->disks[i].rdev will + * trigger a NULL pointer deference. To avoid this race, here + * mddev->private must be updated before increasing + * mddev->raid_disks, and a smp_mb() is required between them. Then + * in linear_congested(), we are sure the updated mddev->private is + * seen when iterating conf->disks[i]. + */ mddev_suspend(mddev); oldconf = mddev->private; - mddev->raid_disks++; mddev->private = newconf; + smp_mb(); + mddev->raid_disks++; md_set_array_sectors(mddev, linear_size(mddev, 0, 0)); set_capacity(mddev->gendisk, mddev->array_sectors); mddev_resume(mddev); -- 2.6.6 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] md linear: fix a race between linear_add() and linear_congested() 2017-01-25 11:15 [PATCH] md linear: fix a race between linear_add() and linear_congested() colyli @ 2017-01-25 18:02 ` Shaohua Li 2017-01-26 0:04 ` NeilBrown 2017-01-27 17:32 ` Coly Li 0 siblings, 2 replies; 5+ messages in thread From: Shaohua Li @ 2017-01-25 18:02 UTC (permalink / raw) To: colyli; +Cc: linux-raid, Shaohua Li, Neil Brown, stable On Wed, Jan 25, 2017 at 07:15:43PM +0800, colyli@suse.de wrote: > Recently I receie a report that on Linux v3.0 based kerenl, hot add disk > to a md linear device causes kernel crash at linear_congested(). From the > crash image analysis, I find in linear_congested(), mddev->raid_disks > contains value N, but conf->disks[] only has N-1 pointers available. Then > a pointer deference to a NULL pointer crashes the kernel. > > There is a race between linear_add() and linear_congested(), RCU stuffs > used in these two functions cannot avoid the race. Since Linuv v4.0 > RCU code is replaced by introducing mddev_suspend(). After checking the > upstream code, it seems linear_congested() is not called in > generic_make_request() code patch, so mddev_suspend() cannot provent it > from being called. The possible race still exists. > > Here I explain how the race still exists in current code. For a machine > has many CPUs, on one CPU, linear_add() is called to add a hard disk to a > md linear device; at the same time on other CPU, linear_congested() is > called to detect whether this md linear device is congested before issuing > an I/O request onto it. > > Now I use a possible code execution time sequence to demo how the possible > race happens, > > seq linear_add() linear_congested() > 0 conf=mddev->private > 1 oldconf=mddev->private > 2 mddev->raid_disks++ > 3 for (i=0; i<mddev->raid_disks;i++) > 4 bdev_get_queue(conf->disks[i].rdev->bdev) > 5 mddev->private=newconf Good catch, this makes a lot of sense. However, this looks like an incomplete fix. step 0 will get the old conf, after step 5, linear_add will free the old conf. So it's possible linear_congested() will use the freed old conf. I think this is more likely to happen. The easist fix maybe put rcu_lock in linear_congested and free the old conf in a rcu callback. Thanks, Shaohua > In linear_add() mddev->raid_disks is increased in time seq 2, and on > another CPU in linear_congested() the for-loop iterates conf->disks[i] by > the increased mddev->raid_disks in time seq 3,4. But conf with one more > element (which is a pointer to struct dev_info type) to conf->disks[] is > not updated yet, accessing its structure member in time seq 4 will cause a > NULL pointer deference fault. > > The fix is to update mddev->private with new value before increasing > mddev->raid_disks, and to make sure on other CPUs their are seen to be > updated in same order as linear_add() does (otherwise the race may still > happen), a smp_mb() is necessary. > > A question is, by this fix, if mddev->private is update to new value in > linear_add(), but in linear_congested() the for-loop still tests old value > of mddev->raid_disks, then the iteration will miss the last element of > conf->disks[]. My answer is don't worry it, it's OK. the reasons are, > - When updating mddev->private, the md linear device is suspend, no I/O > may happen, it is safe to missing congestion status of the last > new-added hard disk. > - In the worst case linear_congested() returns 0 and I/O sent to this md > linear device, but the new added hard disk is congested, then the I/O > request will be blocked for a while if it just happenly hits the new > added hard disk. linear_congested() is in code path of wb_congested(), > which is quite hot in write back code path. Comparing to add locking > code in linear_congested(), the cost of the worst case is acceptable. > > The bug is reported on Linux v3.0 based kernel, it can and should be > applied to all kernels since Linux v3.0. I see linear_add() is merged into > mainline since Linux v2.6.18, maybe stable kernel maintainers after this > version may consider to pick this fix as well. > > Signed-off-by: Coly Li <colyli@suse.de> > Cc: Shaohua Li <shli@fb.com> > Cc: Neil Brown <neilb@suse.com> > Cc: stable@vger.kernel.org > --- > drivers/md/linear.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/linear.c b/drivers/md/linear.c > index 5975c99..48ccfad 100644 > --- a/drivers/md/linear.c > +++ b/drivers/md/linear.c > @@ -196,10 +196,22 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev) > if (!newconf) > return -ENOMEM; > > + /* In linear_congested(), mddev->raid_disks and mddev->private > + * are accessed without protection by mddev_suspend(). If on > + * another CPU, in linear_congested() mddev->private is still seen > + * to contains old value but mddev->raid_disks is seen to have the > + * increased value, the last iteration to conf->disks[i].rdev will > + * trigger a NULL pointer deference. To avoid this race, here > + * mddev->private must be updated before increasing > + * mddev->raid_disks, and a smp_mb() is required between them. Then > + * in linear_congested(), we are sure the updated mddev->private is > + * seen when iterating conf->disks[i]. > + */ > mddev_suspend(mddev); > oldconf = mddev->private; > - mddev->raid_disks++; > mddev->private = newconf; > + smp_mb(); > + mddev->raid_disks++; > md_set_array_sectors(mddev, linear_size(mddev, 0, 0)); > set_capacity(mddev->gendisk, mddev->array_sectors); > mddev_resume(mddev); > -- > 2.6.6 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] md linear: fix a race between linear_add() and linear_congested() 2017-01-25 18:02 ` Shaohua Li @ 2017-01-26 0:04 ` NeilBrown 2017-01-27 17:45 ` Coly Li 2017-01-27 17:32 ` Coly Li 1 sibling, 1 reply; 5+ messages in thread From: NeilBrown @ 2017-01-26 0:04 UTC (permalink / raw) To: Shaohua Li, colyli; +Cc: linux-raid, Shaohua Li, stable [-- Attachment #1: Type: text/plain, Size: 6318 bytes --] On Wed, Jan 25 2017, Shaohua Li wrote: > On Wed, Jan 25, 2017 at 07:15:43PM +0800, colyli@suse.de wrote: >> Recently I receie a report that on Linux v3.0 based kerenl, hot add disk >> to a md linear device causes kernel crash at linear_congested(). From the >> crash image analysis, I find in linear_congested(), mddev->raid_disks >> contains value N, but conf->disks[] only has N-1 pointers available. Then >> a pointer deference to a NULL pointer crashes the kernel. >> >> There is a race between linear_add() and linear_congested(), RCU stuffs >> used in these two functions cannot avoid the race. Since Linuv v4.0 >> RCU code is replaced by introducing mddev_suspend(). After checking the >> upstream code, it seems linear_congested() is not called in >> generic_make_request() code patch, so mddev_suspend() cannot provent it >> from being called. The possible race still exists. >> >> Here I explain how the race still exists in current code. For a machine >> has many CPUs, on one CPU, linear_add() is called to add a hard disk to a >> md linear device; at the same time on other CPU, linear_congested() is >> called to detect whether this md linear device is congested before issuing >> an I/O request onto it. >> >> Now I use a possible code execution time sequence to demo how the possible >> race happens, >> >> seq linear_add() linear_congested() >> 0 conf=mddev->private >> 1 oldconf=mddev->private >> 2 mddev->raid_disks++ >> 3 for (i=0; i<mddev->raid_disks;i++) >> 4 bdev_get_queue(conf->disks[i].rdev->bdev) >> 5 mddev->private=newconf > > Good catch, this makes a lot of sense. However, this looks like an incomplete > fix. step 0 will get the old conf, after step 5, linear_add will free the old > conf. So it's possible linear_congested() will use the freed old conf. I think > this is more likely to happen. The easist fix maybe put rcu_lock in > linear_congested and free the old conf in a rcu callback. We used to use kfree_rcu() but removed it in Commit: 3be260cc18f8 ("md/linear: remove rcu protections in favour of suspend/resume") when we changed to suspend/resume the device. That stops all IO, but doesn't stop the ->congested call. So we probably should re-introduce kfree_rcu() to free oldconf. It might also be good to store a copy of raid_disks in linear_conf, like we do with r5conf, the ensure we never us inconsistent ->raid_disks and ->disks[] Thanks, NeilBrown > > Thanks, > Shaohua > >> In linear_add() mddev->raid_disks is increased in time seq 2, and on >> another CPU in linear_congested() the for-loop iterates conf->disks[i] by >> the increased mddev->raid_disks in time seq 3,4. But conf with one more >> element (which is a pointer to struct dev_info type) to conf->disks[] is >> not updated yet, accessing its structure member in time seq 4 will cause a >> NULL pointer deference fault. >> >> The fix is to update mddev->private with new value before increasing >> mddev->raid_disks, and to make sure on other CPUs their are seen to be >> updated in same order as linear_add() does (otherwise the race may still >> happen), a smp_mb() is necessary. >> >> A question is, by this fix, if mddev->private is update to new value in >> linear_add(), but in linear_congested() the for-loop still tests old value >> of mddev->raid_disks, then the iteration will miss the last element of >> conf->disks[]. My answer is don't worry it, it's OK. the reasons are, >> - When updating mddev->private, the md linear device is suspend, no I/O >> may happen, it is safe to missing congestion status of the last >> new-added hard disk. >> - In the worst case linear_congested() returns 0 and I/O sent to this md >> linear device, but the new added hard disk is congested, then the I/O >> request will be blocked for a while if it just happenly hits the new >> added hard disk. linear_congested() is in code path of wb_congested(), >> which is quite hot in write back code path. Comparing to add locking >> code in linear_congested(), the cost of the worst case is acceptable. >> >> The bug is reported on Linux v3.0 based kernel, it can and should be >> applied to all kernels since Linux v3.0. I see linear_add() is merged into >> mainline since Linux v2.6.18, maybe stable kernel maintainers after this >> version may consider to pick this fix as well. >> >> Signed-off-by: Coly Li <colyli@suse.de> >> Cc: Shaohua Li <shli@fb.com> >> Cc: Neil Brown <neilb@suse.com> >> Cc: stable@vger.kernel.org >> --- >> drivers/md/linear.c | 14 +++++++++++++- >> 1 file changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/md/linear.c b/drivers/md/linear.c >> index 5975c99..48ccfad 100644 >> --- a/drivers/md/linear.c >> +++ b/drivers/md/linear.c >> @@ -196,10 +196,22 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev) >> if (!newconf) >> return -ENOMEM; >> >> + /* In linear_congested(), mddev->raid_disks and mddev->private >> + * are accessed without protection by mddev_suspend(). If on >> + * another CPU, in linear_congested() mddev->private is still seen >> + * to contains old value but mddev->raid_disks is seen to have the >> + * increased value, the last iteration to conf->disks[i].rdev will >> + * trigger a NULL pointer deference. To avoid this race, here >> + * mddev->private must be updated before increasing >> + * mddev->raid_disks, and a smp_mb() is required between them. Then >> + * in linear_congested(), we are sure the updated mddev->private is >> + * seen when iterating conf->disks[i]. >> + */ >> mddev_suspend(mddev); >> oldconf = mddev->private; >> - mddev->raid_disks++; >> mddev->private = newconf; >> + smp_mb(); >> + mddev->raid_disks++; >> md_set_array_sectors(mddev, linear_size(mddev, 0, 0)); >> set_capacity(mddev->gendisk, mddev->array_sectors); >> mddev_resume(mddev); >> -- >> 2.6.6 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] md linear: fix a race between linear_add() and linear_congested() 2017-01-26 0:04 ` NeilBrown @ 2017-01-27 17:45 ` Coly Li 0 siblings, 0 replies; 5+ messages in thread From: Coly Li @ 2017-01-27 17:45 UTC (permalink / raw) To: NeilBrown; +Cc: Shaohua Li, linux-raid, Shaohua Li, stable On 2017/1/26 上午8:04, NeilBrown wrote: > On Wed, Jan 25 2017, Shaohua Li wrote: > >> On Wed, Jan 25, 2017 at 07:15:43PM +0800, colyli@suse.de wrote: >>> Recently I receie a report that on Linux v3.0 based kerenl, hot >>> add disk to a md linear device causes kernel crash at >>> linear_congested(). From the crash image analysis, I find in >>> linear_congested(), mddev->raid_disks contains value N, but >>> conf->disks[] only has N-1 pointers available. Then a pointer >>> deference to a NULL pointer crashes the kernel. >>> >>> There is a race between linear_add() and linear_congested(), >>> RCU stuffs used in these two functions cannot avoid the race. >>> Since Linuv v4.0 RCU code is replaced by introducing >>> mddev_suspend(). After checking the upstream code, it seems >>> linear_congested() is not called in generic_make_request() code >>> patch, so mddev_suspend() cannot provent it from being called. >>> The possible race still exists. >>> >>> Here I explain how the race still exists in current code. For >>> a machine has many CPUs, on one CPU, linear_add() is called to >>> add a hard disk to a md linear device; at the same time on >>> other CPU, linear_congested() is called to detect whether this >>> md linear device is congested before issuing an I/O request >>> onto it. >>> >>> Now I use a possible code execution time sequence to demo how >>> the possible race happens, >>> >>> seq linear_add() linear_congested() 0 >>> conf=mddev->private 1 oldconf=mddev->private 2 >>> mddev->raid_disks++ 3 for (i=0; >>> i<mddev->raid_disks;i++) 4 >>> bdev_get_queue(conf->disks[i].rdev->bdev) 5 >>> mddev->private=newconf >> >> Good catch, this makes a lot of sense. However, this looks like >> an incomplete fix. step 0 will get the old conf, after step 5, >> linear_add will free the old conf. So it's possible >> linear_congested() will use the freed old conf. I think this is >> more likely to happen. The easist fix maybe put rcu_lock in >> linear_congested and free the old conf in a rcu callback. > > We used to use kfree_rcu() but removed it in > > Commit: 3be260cc18f8 ("md/linear: remove rcu protections in favour > of suspend/resume") > > when we changed to suspend/resume the device. That stops all IO, > but doesn't stop the ->congested call. > > So we probably should re-introduce kfree_rcu() to free oldconf. It > might also be good to store a copy of raid_disks in linear_conf, > like we do with r5conf, the ensure we never us inconsistent > ->raid_disks and ->disks[] Hi Neil, I just send out v2 patch which adds RCU stuffs back. I test it on my small server, it survives. Once thing I want to confirm here is the memory barrier in linear_add(). 219 mddev_suspend(mddev); 220 oldconf = rcu_dereference(mddev->private); 221 rcu_assign_pointer(mddev->private, newconf); 222 smp_mb(); 223 mddev->raid_disks++; 224 md_set_array_sectors(mddev, linear_size(mddev, 0, 0)); 225 set_capacity(mddev->gendisk, mddev->array_sectors); 226 mddev_resume(mddev); 227 revalidate_disk(mddev->gendisk); 228 call_rcu(&oldconf->rcu, free_conf); At LINE 222, I add a smp_mb(), from Documentations/memory-barrier.txt, my understand is here I need a smp_wmb() or smp_mb(). I see other places all use smp_mb() so I choose the stronger one -- smp_mb(). But from Documentation/whatisRCU.txt, it says about rcu_assign_pointer(): "This function returns he new value, and also executes any memory-barrier instructions required for a given CPU architecture." So it seems smp_mb() at LINE 222 is unnecessary. In v2 patch, I keep smp_mb() although I think it is unnecessary. I will remove it if you or Shaohua may confirm it is unncessary as I think. Another question is, I try to look at the code about r5conf, but I still have no idea how to store a copy of r5conf. Could you please to give me more hint ? Thanks. Coly ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] md linear: fix a race between linear_add() and linear_congested() 2017-01-25 18:02 ` Shaohua Li 2017-01-26 0:04 ` NeilBrown @ 2017-01-27 17:32 ` Coly Li 1 sibling, 0 replies; 5+ messages in thread From: Coly Li @ 2017-01-27 17:32 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-raid, Neil Brown, stable On 2017/1/26 上午2:02, Shaohua Li wrote: > On Wed, Jan 25, 2017 at 07:15:43PM +0800, colyli@suse.de wrote: >> Recently I receie a report that on Linux v3.0 based kerenl, hot add disk >> to a md linear device causes kernel crash at linear_congested(). From the >> crash image analysis, I find in linear_congested(), mddev->raid_disks >> contains value N, but conf->disks[] only has N-1 pointers available. Then >> a pointer deference to a NULL pointer crashes the kernel. >> >> There is a race between linear_add() and linear_congested(), RCU stuffs >> used in these two functions cannot avoid the race. Since Linuv v4.0 >> RCU code is replaced by introducing mddev_suspend(). After checking the >> upstream code, it seems linear_congested() is not called in >> generic_make_request() code patch, so mddev_suspend() cannot provent it >> from being called. The possible race still exists. >> >> Here I explain how the race still exists in current code. For a machine >> has many CPUs, on one CPU, linear_add() is called to add a hard disk to a >> md linear device; at the same time on other CPU, linear_congested() is >> called to detect whether this md linear device is congested before issuing >> an I/O request onto it. >> >> Now I use a possible code execution time sequence to demo how the possible >> race happens, >> >> seq linear_add() linear_congested() >> 0 conf=mddev->private >> 1 oldconf=mddev->private >> 2 mddev->raid_disks++ >> 3 for (i=0; i<mddev->raid_disks;i++) >> 4 bdev_get_queue(conf->disks[i].rdev->bdev) >> 5 mddev->private=newconf > > Good catch, this makes a lot of sense. However, this looks like an incomplete > fix. step 0 will get the old conf, after step 5, linear_add will free the old > conf. So it's possible linear_congested() will use the freed old conf. I think > this is more likely to happen. The easist fix maybe put rcu_lock in > linear_congested and free the old conf in a rcu callback. Yes, RCU is still necessary here, I just compose and send out the second version. Thanks for pointing out this :-) Coly ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-01-27 17:45 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-01-25 11:15 [PATCH] md linear: fix a race between linear_add() and linear_congested() colyli 2017-01-25 18:02 ` Shaohua Li 2017-01-26 0:04 ` NeilBrown 2017-01-27 17:45 ` Coly Li 2017-01-27 17:32 ` Coly Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).