* [PATCH] md: raid10: wake up frozen array @ 2008-07-25 19:03 Arthur Jones 2008-08-01 3:03 ` Neil Brown 2008-08-30 21:30 ` Clive Messer 0 siblings, 2 replies; 6+ messages in thread From: Arthur Jones @ 2008-07-25 19:03 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid When rescheduling a bio in raid10, we wake up the md thread, but if the array is frozen, this will have no effect. This causes the array to remain frozen for eternity. We add a wake_up to allow the array to de-freeze. This code is nearly identical to the raid1 code, which has this fix already. Signed-off-by: Arthur Jones <ajones@riverbed.com> --- drivers/md/raid10.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 159535d..d41bebb 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -215,6 +215,9 @@ static void reschedule_retry(r10bio_t *r10_bio) conf->nr_queued ++; spin_unlock_irqrestore(&conf->device_lock, flags); + /* wake up frozen array... */ + wake_up(&conf->wait_barrier); + md_wakeup_thread(mddev->thread); } -- 1.5.4.3 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] md: raid10: wake up frozen array 2008-07-25 19:03 [PATCH] md: raid10: wake up frozen array Arthur Jones @ 2008-08-01 3:03 ` Neil Brown 2008-08-30 21:30 ` Clive Messer 1 sibling, 0 replies; 6+ messages in thread From: Neil Brown @ 2008-08-01 3:03 UTC (permalink / raw) To: Arthur Jones; +Cc: linux-raid On Friday July 25, ajones@riverbed.com wrote: > When rescheduling a bio in raid10, we wake up > the md thread, but if the array is frozen, this > will have no effect. This causes the array to > remain frozen for eternity. We add a wake_up > to allow the array to de-freeze. This code is > nearly identical to the raid1 code, which has > this fix already. Thanks for this! It is "obviously correct" based on the similarity with raid1. It is on its way to Linus. NeilBrown > > Signed-off-by: Arthur Jones <ajones@riverbed.com> > --- > drivers/md/raid10.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index 159535d..d41bebb 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -215,6 +215,9 @@ static void reschedule_retry(r10bio_t *r10_bio) > conf->nr_queued ++; > spin_unlock_irqrestore(&conf->device_lock, flags); > > + /* wake up frozen array... */ > + wake_up(&conf->wait_barrier); > + > md_wakeup_thread(mddev->thread); > } > > -- > 1.5.4.3 > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] md: raid10: wake up frozen array 2008-07-25 19:03 [PATCH] md: raid10: wake up frozen array Arthur Jones 2008-08-01 3:03 ` Neil Brown @ 2008-08-30 21:30 ` Clive Messer 2008-09-02 15:07 ` Arthur Jones 1 sibling, 1 reply; 6+ messages in thread From: Clive Messer @ 2008-08-30 21:30 UTC (permalink / raw) To: linux-raid On Fri, 2008-07-25 at 12:03 -0700, Arthur Jones wrote: > When rescheduling a bio in raid10, we wake up > the md thread, but if the array is frozen, this > will have no effect. This causes the array to > remain frozen for eternity. We add a wake_up > to allow the array to de-freeze. This code is > nearly identical to the raid1 code, which has > this fix already. Can someone explain this to me in simple terms? What will cause a rescheduling of bio? Frozen for eternity - what will be the effect assuming my root file system is on raid10? I have a Fedora Core 9 box using a 4 disk f2 raid10 array. This is the main partition and root file system. Every couple of days the machine would hard lock. Sometimes I could ssh in. Most of the time not. I never managed to catch anything to the logs with SysRq. With the benefit of hindsight - if the kernel was 'jammed' writing to logfiles on a frozen raid10 array that could explain it. I assumed faulty hardware. I have actually replaced one at a time, (and at considerable expense), the power supply, motherboard, processor, all 4 disks in the array. Still the machine would lock-up. What is interesting is that I have managed 5 days uptime since I added this one line patch to 2.6.25.14-108.fc9.x86_64. Could someone confirm for me that it is more than likely that the hard locks I experienced on this machine could be resolved by this one line patch? Has this patch now made it into an official kernel release? > Signed-off-by: Arthur Jones <ajones@riverbed.com> > --- > drivers/md/raid10.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index 159535d..d41bebb 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -215,6 +215,9 @@ static void reschedule_retry(r10bio_t *r10_bio) > conf->nr_queued ++; > spin_unlock_irqrestore(&conf->device_lock, flags); > > + /* wake up frozen array... */ > + wake_up(&conf->wait_barrier); > + > md_wakeup_thread(mddev->thread); > } > Regards Clive - Clive Messer <clive@vacuumtube.org.uk> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] md: raid10: wake up frozen array 2008-08-30 21:30 ` Clive Messer @ 2008-09-02 15:07 ` Arthur Jones 2008-09-05 16:58 ` Bill Davidsen 0 siblings, 1 reply; 6+ messages in thread From: Arthur Jones @ 2008-09-02 15:07 UTC (permalink / raw) To: Clive Messer; +Cc: linux-raid@vger.kernel.org Hi Clive, ... On Sat, Aug 30, 2008 at 02:30:52PM -0700, Clive Messer wrote: > > On Fri, 2008-07-25 at 12:03 -0700, Arthur Jones wrote: > > When rescheduling a bio in raid10, we wake up > > the md thread, but if the array is frozen, this > > will have no effect. This causes the array to > > remain frozen for eternity. We add a wake_up > > to allow the array to de-freeze. This code is > > nearly identical to the raid1 code, which has > > this fix already. > > Can someone explain this to me in simple terms? The RAID sub-system needs to be able to synchronize certain operations, to do this, it "freezes" the array, i.e. no I/O will complete until it is un-frozen. This bug hit when we failed an I/O while the array was frozen. In this case, we would never tell the frozen array that it was time wake up and get back to work and the retry would not make progress. > What will cause a rescheduling of bio? If the first bio read attempt failed (e.g. broken disk -- or, in my case, using fault injection), then raid10 will retry the block I/O. > Frozen for eternity - what will be the effect assuming my root file > system is on raid10? The failed I/O will not complete, the process which started the I/O will be stuck in an unkillable state forever. Future I/O to the device would be put on hold (I guess, I never looked at this directly). > I have a Fedora Core 9 box using a 4 disk f2 raid10 array. This is the > main partition and root file system. Every couple of days the machine > would hard lock. Sometimes I could ssh in. Most of the time not. I never > managed to catch anything to the logs with SysRq. With the benefit of > hindsight - if the kernel was 'jammed' writing to logfiles on a frozen > raid10 array that could explain it. I assumed faulty hardware. I have > actually replaced one at a time, (and at considerable expense), the > power supply, motherboard, processor, all 4 disks in the array. Still > the machine would lock-up. What is interesting is that I have managed 5 > days uptime since I added this one line patch to > 2.6.25.14-108.fc9.x86_64. Could someone confirm for me that it is more > than likely that the hard locks I experienced on this machine could be > resolved by this one line patch? Has this patch now made it into an > official kernel release? It could be, but since you changed the drives and controller, it doesn't seem too likely. You need some sort of failure to trigger this bug. Also, Sys-rq still worked fine for me when I triggered this bug... This patch is now in linus' git tree, but it looks like it missed 2.6.26, so it won't be in an "official" release until 2.6.27... Arthur ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] md: raid10: wake up frozen array 2008-09-02 15:07 ` Arthur Jones @ 2008-09-05 16:58 ` Bill Davidsen 2008-09-05 17:04 ` Arthur Jones 0 siblings, 1 reply; 6+ messages in thread From: Bill Davidsen @ 2008-09-05 16:58 UTC (permalink / raw) To: Arthur Jones; +Cc: Clive Messer, linux-raid@vger.kernel.org Arthur Jones wrote: > Hi Clive, ... > > On Sat, Aug 30, 2008 at 02:30:52PM -0700, Clive Messer wrote: > >> On Fri, 2008-07-25 at 12:03 -0700, Arthur Jones wrote: >> >>> When rescheduling a bio in raid10, we wake up >>> the md thread, but if the array is frozen, this >>> will have no effect. This causes the array to >>> remain frozen for eternity. We add a wake_up >>> to allow the array to de-freeze. This code is >>> nearly identical to the raid1 code, which has >>> this fix already. >>> >> Can someone explain this to me in simple terms? >> > > The RAID sub-system needs to be able to synchronize > certain operations, to do this, it "freezes" the > array, i.e. no I/O will complete until it is un-frozen. > This bug hit when we failed an I/O while the array > was frozen. In this case, we would never tell the > frozen array that it was time wake up and get back > to work and the retry would not make progress. > > >> What will cause a rescheduling of bio? >> > > If the first bio read attempt failed (e.g. broken > disk -- or, in my case, using fault injection), > then raid10 will retry the block I/O. > > >> Frozen for eternity - what will be the effect assuming my root file >> system is on raid10? >> > > The failed I/O will not complete, the process which > started the I/O will be stuck in an unkillable state > forever. Future I/O to the device would be put on > hold (I guess, I never looked at this directly). > > >> I have a Fedora Core 9 box using a 4 disk f2 raid10 array. This is the >> main partition and root file system. Every couple of days the machine >> would hard lock. Sometimes I could ssh in. Most of the time not. I never >> managed to catch anything to the logs with SysRq. With the benefit of >> hindsight - if the kernel was 'jammed' writing to logfiles on a frozen >> raid10 array that could explain it. I assumed faulty hardware. I have >> actually replaced one at a time, (and at considerable expense), the >> power supply, motherboard, processor, all 4 disks in the array. Still >> the machine would lock-up. What is interesting is that I have managed 5 >> days uptime since I added this one line patch to >> 2.6.25.14-108.fc9.x86_64. Could someone confirm for me that it is more >> than likely that the hard locks I experienced on this machine could be >> resolved by this one line patch? Has this patch now made it into an >> official kernel release? >> > > It could be, but since you changed the drives > and controller, it doesn't seem too likely. You > need some sort of failure to trigger this bug. > Also, Sys-rq still worked fine for me when I > triggered this bug... > > This patch is now in linus' git tree, but it > looks like it missed 2.6.26, so it won't be in > an "official" release until 2.6.27... > I would hope that you or Neil would get it into the -stable series ASAP. While rare, this bug is a killer when it strikes. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] md: raid10: wake up frozen array 2008-09-05 16:58 ` Bill Davidsen @ 2008-09-05 17:04 ` Arthur Jones 0 siblings, 0 replies; 6+ messages in thread From: Arthur Jones @ 2008-09-05 17:04 UTC (permalink / raw) To: Bill Davidsen; +Cc: Clive Messer, linux-raid@vger.kernel.org, Neil Brown Hi Bill, ... On Fri, Sep 05, 2008 at 09:58:20AM -0700, Bill Davidsen wrote: > [...] > > This patch is now in linus' git tree, but it > > looks like it missed 2.6.26, so it won't be in > > an "official" release until 2.6.27... > > > > I would hope that you or Neil would get it into the -stable series ASAP. > While rare, this bug is a killer when it strikes. That sounds like a good idea to me... I think Neil should send it off if he agrees. Neil, what do you think? Arthur ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-09-05 17:04 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-25 19:03 [PATCH] md: raid10: wake up frozen array Arthur Jones 2008-08-01 3:03 ` Neil Brown 2008-08-30 21:30 ` Clive Messer 2008-09-02 15:07 ` Arthur Jones 2008-09-05 16:58 ` Bill Davidsen 2008-09-05 17:04 ` Arthur Jones
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.