Re: limits on raid - David Greaves

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Greaves <david@dgreaves.com>
To: Bill Davidsen <davidsen@tmr.com>
Cc: david@lang.hm, Neil Brown <neilb@suse.de>,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org
Subject: Re: limits on raid
Date: Fri, 22 Jun 2007 17:55:46 +0100	[thread overview]
Message-ID: <467BFF12.60200@dgreaves.com> (raw)
In-Reply-To: <467BF233.7020700@tmr.com>

Bill Davidsen wrote:
> David Greaves wrote:
>> david@lang.hm wrote:
>>> On Fri, 22 Jun 2007, David Greaves wrote:
>> If you end up 'fiddling' in md because someone specified 
>> --assume-clean on a raid5 [in this case just to save a few minutes 
>> *testing time* on system with a heavily choked bus!] then that adds 
>> *even more* complexity and exception cases into all the stuff you 
>> described.
> 
> A "few minutes?" Are you reading the times people are seeing with 
> multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. 
Yes. But we are talking initial creation here.

> And as soon as you believe that the array is actually "usable" you cut 
> that rebuild rate, perhaps in half, and get dog-slow performance from 
> the array. It's usable in the sense that reads and writes work, but for 
> useful work it's pretty painful. You either fail to understand the 
> magnitude of the problem or wish to trivialize it for some reason.
I do understand the problem and I'm not trying to trivialise it :)

I _suggested_ that it's worth thinking about things rather than jumping in to 
say "oh, we can code up a clever algorithm that keeps track of what stripes have 
valid parity and which don't and we can optimise the read/copy/write for valid 
stripes and use the raid6 type read-all/write-all for invalid stripes and then 
we can write a bit extra on the check code to set the bitmaps......"

Phew - and that lets us run the array at semi-degraded performance (raid6-like) 
for 3 days rather than either waiting before we put it into production or 
running it very slowly.
Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT?

What happens in those 3 years when we have a disk fail? The solution doesn't 
apply then - it's 3 days to rebuild - like it or not.

> By delaying parity computation until the first write to a stripe only 
> the growth of a filesystem is slowed, and all data are protected without 
> waiting for the lengthly check. The rebuild speed can be set very low, 
> because on-demand rebuild will do most of the work.
I am not saying you are wrong.
I ask merely if the balance of benefit outweighs the balance of complexity.

If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs 
- very useful indeed.

>> I'm very much for the fs layer reading the lower block structure so I 
>> don't have to fiddle with arcane tuning parameters - yes, *please* 
>> help make xfs self-tuning!
>>
>> Keeping life as straightforward as possible low down makes the upwards 
>> interface more manageable and that goal more realistic... 
> 
> Those two paragraphs are mutually exclusive. The fs can be simple 
> because it rests on a simple device, even if the "simple device" is 
> provided by LVM or md. And LVM and md can stay simple because they rest 
> on simple devices, even if they are provided by PATA, SATA, nbd, etc. 
> Independent layers make each layer more robust. If you want to 
> compromise the layer separation, some approach like ZFS with full 
> integration would seem to be promising. Note that layers allow 
> specialized features at each point, trading integration for flexibility.

That's a simplistic summary.
You *can* loosely couple the layers. But you can enrich the interface and 
tightly couple them too - XFS is capable (I guess) of understanding md more 
fully than say ext2.
XFS would still work on a less 'talkative' block device where performance wasn't 
as important (USB flash maybe, dunno).


> My feeling is that full integration and independent layers each have 
> benefits, as you connect the layers to expose operational details you 
> need to handle changes in those details, which would seem to make layers 
> more complex.
Agreed.

> What I'm looking for here is better performance in one 
> particular layer, the md RAID5 layer. I like to avoid unnecessary 
> complexity, but I feel that the current performance suggests room for 
> improvement.

I agree there is room for improvement.
I suggest that it may be more fruitful to write a tool called "raid5prepare"
that writes zeroes/ones as appropriate to all component devices and then you can 
use --assume-clean without concern. That could look to see if the devices are 
scsi or whatever and take advantage of the hyperfast block writes that can be done.

David

next prev parent reply	other threads:[~2007-06-22 16:55 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-15  2:58 limits on raid david
2007-06-15  3:05 ` Neil Brown
2007-06-15  3:43   ` david
2007-06-15  3:58     ` Neil Brown
2007-06-15  9:13       ` David Chinner
2007-06-15 22:21         ` Neil Brown
2007-06-15 11:10       ` Avi Kivity
2007-06-15 16:23         ` Jan Engelhardt
2007-06-15 17:20           ` Avi Kivity
2007-06-15 21:59         ` Neil Brown
2007-06-16 17:23           ` Avi Kivity
2007-06-17 13:00           ` Andi Kleen
2007-06-18  4:57           ` David Chinner
2007-06-21  2:56             ` Neil Brown
2007-06-21  6:39               ` David Chinner
2007-06-21  6:45                 ` david
2007-06-21  8:59                   ` David Greaves
2007-06-21 17:00                   ` Mark Lord
2007-06-21 11:00                 ` David Chinner
2007-06-21 12:40               ` Mattias Wadenstein
2007-06-21 14:40                 ` Justin Piszcz
2007-06-21 16:48                 ` david
2007-06-21 18:30                 ` Martin K. Petersen
2007-06-21 20:08               ` Nix
2007-06-16  2:03       ` Wakko Warner
2007-06-16  3:47         ` Neil Brown
2007-06-16  4:40           ` Dan Merillat
2007-06-16  7:48           ` david
2007-06-16 13:38             ` David Greaves
2007-06-16 17:16               ` david
2007-06-17 17:16             ` Bill Davidsen
2007-06-18 17:20             ` Brendan Conoboy
2007-06-18 17:28               ` david
2007-06-18 18:03                 ` Lennart Sorensen
2007-06-18 18:12                   ` david
2007-06-18 18:33                     ` Lennart Sorensen
2007-06-18 18:40                       ` david
2007-06-18 19:11                         ` Brendan Conoboy
2007-06-18 20:52                           ` david
2007-06-18 21:46                             ` Wakko Warner
2007-06-18 21:56                               ` david
2007-06-18 22:00                                 ` Brendan Conoboy
2007-06-19 20:11                                 ` Lennart Sorensen
2007-06-19 20:51                                   ` david
2007-06-19 15:07                             ` Phillip Susi
2007-06-19 19:28                               ` david
2007-06-18 18:07                 ` Brendan Conoboy
2007-06-18 18:16                   ` david
2007-06-16 13:33           ` David Greaves
2007-06-17  1:44             ` dean gaudet
2007-06-21  3:01             ` Neil Brown
2007-06-21  8:49               ` David Greaves
2007-06-16 14:08           ` Wakko Warner
2007-06-17  1:47             ` dean gaudet
2007-06-17 13:28               ` Wakko Warner
2007-06-17 17:28                 ` dean gaudet
2007-06-17 19:30                   ` Wakko Warner
2007-06-17 19:54                     ` dean gaudet
2007-06-17 20:46                       ` david
2007-06-17 20:44                     ` david
2007-06-17 17:14       ` Bill Davidsen
2007-06-21 23:03         ` Bill Davidsen
2007-06-22  2:24           ` Neil Brown
2007-06-22  8:10             ` David Greaves
2007-06-22  9:51               ` david
2007-06-22 12:39                 ` David Greaves
2007-06-22 16:00                   ` Bill Davidsen
2007-06-22 16:55                     ` David Greaves [this message]
2007-06-22 18:41                     ` david

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=467BFF12.60200@dgreaves.com \
    --to=david@dgreaves.com \
    --cc=david@lang.hm \
    --cc=davidsen@tmr.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).