* [Lustre-devel] Disk rebuild
@ 2009-02-03 14:09 Eric Barton
[not found] ` <6BAB26C7-8230-4D93-B41D-37B8AE31D7FF@sun.com>
0 siblings, 1 reply; 8+ messages in thread
From: Eric Barton @ 2009-02-03 14:09 UTC (permalink / raw)
To: lustre-devel
Andreas,
When we have some estimate of the overall HPCS filesystem size and shape,
can we do some calculations to show how frequently we expect drives to
fail and get our heads round the rebuild performance / 2nd failure
vulnerability tradeoff. This obviously begs the question whether RAID 6
changes this tradeoff significantly by allowing rebuild to be so slow
performance isn't impacted, and if so, whether it's viable with a DMU
backend.
Cheers,
Eric
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] Fwd: Disk rebuild
[not found] ` <6BAB26C7-8230-4D93-B41D-37B8AE31D7FF@sun.com>
@ 2009-02-03 16:41 ` Jody McIntyre
0 siblings, 0 replies; 8+ messages in thread
From: Jody McIntyre @ 2009-02-03 16:41 UTC (permalink / raw)
To: lustre-devel
Hi Eric,
>> When we have some estimate of the overall HPCS filesystem size and
>> shape, can we do some calculations to show how frequently we expect
>> drives to fail and get our heads round the rebuild performance / 2nd
>> failure vulnerability tradeoff. This obviously begs the question
>> whether RAID 6 changes this tradeoff significantly by allowing
>> rebuild to be so slow performance isn't impacted, and if so, whether
>> it's viable with a DMU backend.
Bryon asked me to clarify the RAID 6 vulnerability situation in resync
vs. recovery. First some definitions, since I don't know how widely
accepted these terms are outside the Linux software RAID community:
recovery: This occurs when a disk fails and is replaced. The entire
array must be read so that the new disk can be reconstructed from the
data and parity blocks on the existing disks. Recovery is also done on
new arrays, because it's faster than resync.
resync: When a system crashes during a write, resync must be done to
repair the parity blocks. All data blocks and parity blocks must be
read, and if the parity blocks are incorrect they must be rewritten.
With RAID 6, we are not vulnerable to a disk failure during recovery.
If a second disk fails while the first disk is being recovered, we can
replace it as well - recovery can reconstruct the data and parity blocks
on both new disks.
Unfortunately, we are vulnerable to _even one_ disk failing during
resync. When a machine crashes during a write the parity could be
completely wrong and unsuitable for recovery.
It is possible to significantly reduce resync (but not recovery) times
using bitmaps, but these have been shown to hurt performance
significantly. Another approach, journal-guided resynchronization, was
studied in a 2005 paper but has never been merged into the kernel. The
paper shows improvements in resync times from 254 seconds to 0.21
seconds (for a 1 GB test array) with under 5% performance impact. This
is an option if we're willing to develop and maintain the patches to do
it.
Cheers,
Jody
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] Fwd: Disk rebuild
@ 2009-12-01 15:00 Nikita Danilov
2009-12-01 23:57 ` Jody McIntyre
0 siblings, 1 reply; 8+ messages in thread
From: Nikita Danilov @ 2009-12-01 15:00 UTC (permalink / raw)
To: lustre-devel
Hello,
On 03 Feb 2009 Jody McIntyre wrote:
> It is possible to significantly reduce resync (but not recovery) times
> using bitmaps, but these have been shown to hurt performance
> significantly. ?Another approach, journal-guided?resynchronization, was
> studied in a 2005 paper but has never been merged into the kernel. ?The
> paper shows improvements in resync times from 254 seconds to 0.21
> seconds (for a 1 GB test array) with under 5% performance impact. ?This
> is an option if we're willing to develop and maintain the patches to do
> it.
what is the status of this? Is ext3 guided resync code (RHEL 5 version
was posted on lkml in October) used by Lustre?
>
> Cheers,
> Jody
Thank you,
Nikita.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] Fwd: Disk rebuild
2009-12-01 15:00 Nikita Danilov
@ 2009-12-01 23:57 ` Jody McIntyre
2009-12-02 14:13 ` Nikita Danilov
0 siblings, 1 reply; 8+ messages in thread
From: Jody McIntyre @ 2009-12-01 23:57 UTC (permalink / raw)
To: lustre-devel
Hi Nikita,
On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote:
> what is the status of this? Is ext3 guided resync code (RHEL 5 version
> was posted on lkml in October) used by Lustre?
This is covered by bug 19932.
Cheers,
Jody
>
> >
> > Cheers,
> > Jody
>
> Thank you,
> Nikita.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] Fwd: Disk rebuild
2009-12-01 23:57 ` Jody McIntyre
@ 2009-12-02 14:13 ` Nikita Danilov
2009-12-02 19:43 ` Andreas Dilger
0 siblings, 1 reply; 8+ messages in thread
From: Nikita Danilov @ 2009-12-02 14:13 UTC (permalink / raw)
To: lustre-devel
2009/12/2 Jody McIntyre <scjody@sun.com>:
> Hi Nikita,
Hello Jody,
>
> On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote:
>
>> what is the status of this? Is ext3 guided resync code (RHEL 5 version
>> was posted on lkml in October) used by Lustre?
>
> This is covered by bug 19932.
The last (43rd) comment there is rather intriguing. Can you elaborate
on why guided resync cannot work with the Lustre IO stack?
>
> Cheers,
> Jody
Thank you,
Nikita.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] Fwd: Disk rebuild
2009-12-02 14:13 ` Nikita Danilov
@ 2009-12-02 19:43 ` Andreas Dilger
2009-12-02 20:58 ` Nikita Danilov
0 siblings, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2009-12-02 19:43 UTC (permalink / raw)
To: lustre-devel
Hi Nikita!
On 2009-12-02, at 07:13, Nikita Danilov wrote:
>> On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote:
>>> what is the status of this? Is ext3 guided resync code (RHEL 5
>>> version was posted on lkml in October) used by Lustre?
>>
>> This is covered by bug 19932.
>
> The last (43rd) comment there is rather intriguing. Can you elaborate
> on why guided resync cannot work with the Lustre IO stack?
The problem lies in the way that obdfilter submits IO. Since it is
not using the normal buffer cache to track "data=ordered" (or in the
case of this patch "data=declared") mode the bio_submit() will likely
start modifying the MD device before the corresponding declare blocks
are committed to the journal.
This breaks the whole validity of declared mode in case of a crash,
since we can no longer be certain that the declare blocks contain all
of the locations in the MD RAID that may need to have parity rebuilt.
It would be possible to fix this by having the OST use the normal VFS
methods to order the IO to disk, but I'm sure you're well aware of the
performance impact of this. It wouldn't be so bad with older versions
of Lustre, where we had to wait for the journal commit before
returning to the client anyway, but in 1.8.2 there is a (disabled by
default) async journal commit option that allows the client to get RPC
replies before the bulk IO is committed.
In order to accommodate declared mode it mean that we need to
implement full write-cached IO on the OST, which wouldn't be
impossible given that 1.8 already uses the page cache for reading, but
given the amount of change and risk this would introduce it wasn't
thought worthwhile to implement for the short lifespan it would have.
It wouldn't be practical to introduce such a major change any sooner
than the DMU OSD in the 2.1 release, at which point it is largely
obsolete.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] Fwd: Disk rebuild
2009-12-02 19:43 ` Andreas Dilger
@ 2009-12-02 20:58 ` Nikita Danilov
2009-12-02 22:48 ` Andreas Dilger
0 siblings, 1 reply; 8+ messages in thread
From: Nikita Danilov @ 2009-12-02 20:58 UTC (permalink / raw)
To: lustre-devel
2009/12/2 Andreas Dilger <adilger@sun.com>:
> Hi Nikita!
Hello Andreas!
>
> On 2009-12-02, at 07:13, Nikita Danilov wrote:
>>>
>>> On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote:
>>>>
>>>> what is the status of this? Is ext3 guided resync code (RHEL 5 version
>>>> was posted on lkml in October) used by Lustre?
>>>
>>> This is covered by bug 19932.
>>
>> The last (43rd) comment there is rather intriguing. Can you elaborate
>> on why guided resync cannot work with the Lustre IO stack?
>
>
> The problem lies in the way that obdfilter submits IO. ?Since it is not
> using the normal buffer cache to track "data=ordered" (or in the case of
> this patch "data=declared") mode the bio_submit() will likely start
> modifying the MD device before the corresponding declare blocks are
> committed to the journal.
Thank you for the detailed explanation, data-path completely escaped
my mind. Still, on the mdt side, osd goes through the normal VFS paths
and data=declared should work, right?
[...]
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
Thank you,
Nikita.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Lustre-devel] Fwd: Disk rebuild
2009-12-02 20:58 ` Nikita Danilov
@ 2009-12-02 22:48 ` Andreas Dilger
0 siblings, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2009-12-02 22:48 UTC (permalink / raw)
To: lustre-devel
On 2009-12-02, at 13:58, Nikita Danilov wrote:
>> The problem lies in the way that obdfilter submits IO. Since it is
>> not
>> using the normal buffer cache to track "data=ordered" (or in the
>> case of
>> this patch "data=declared") mode the bio_submit() will likely start
>> modifying the MD device before the corresponding declare blocks are
>> committed to the journal.
>
> Thank you for the detailed explanation, data-path completely escaped
> my mind. Still, on the mdt side, osd goes through the normal VFS paths
> and data=declared should work, right?
Yes, though in general the MDT is a lot smaller than the OSTs, fails
less often, has RAID-1 instead of RAID-6 so the rebuild goes
considerably faster, has metadata journaling for everything (so
doesn't get inconsistent in the first place).
There would likely be some improvement, but we haven't benchmarked it
- the main concern was for the OSTs.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-12-02 22:48 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-03 14:09 [Lustre-devel] Disk rebuild Eric Barton
[not found] ` <6BAB26C7-8230-4D93-B41D-37B8AE31D7FF@sun.com>
2009-02-03 16:41 ` [Lustre-devel] Fwd: " Jody McIntyre
-- strict thread matches above, loose matches on Subject: below --
2009-12-01 15:00 Nikita Danilov
2009-12-01 23:57 ` Jody McIntyre
2009-12-02 14:13 ` Nikita Danilov
2009-12-02 19:43 ` Andreas Dilger
2009-12-02 20:58 ` Nikita Danilov
2009-12-02 22:48 ` Andreas Dilger
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.