From: Neil Brown <neilb@suse.de>
To: Avi Kivity <avi@argo.co.il>
Cc: david@lang.hm, linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org
Subject: Re: limits on raid
Date: Sat, 16 Jun 2007 07:59:29 +1000 [thread overview]
Message-ID: <18035.3009.568832.785308@notabene.brown> (raw)
In-Reply-To: message from Avi Kivity on Friday June 15
On Friday June 15, avi@argo.co.il wrote:
> Neil Brown wrote:
> >
> >> while I consider zfs to be ~80% hype, one advantage it could have (but I
> >> don't know if it has) is that since the filesystem an raid are integrated
> >> into one layer they can optimize the case where files are being written
> >> onto unallocated space and instead of reading blocks from disk to
> >> calculate the parity they could just put zeros in the unallocated space,
> >> potentially speeding up the system by reducing the amount of disk I/O.
> >>
> >
> > Certainly. But the raid doesn't need to be tightly integrated
> > into the filesystem to achieve this. The filesystem need only know
> > the geometry of the RAID and when it comes to write, it tries to write
> > full stripes at a time. If that means writing some extra blocks full
> > of zeros, it can try to do that. This would require a little bit
> > better communication between filesystem and raid, but not much. If
> > anyone has a filesystem that they want to be able to talk to raid
> > better, they need only ask...
> >
>
> Some things are not achievable with block-level raid. For example, with
> redundancy integrated into the filesystem, you can have three copies for
> metadata, two copies for small files, and parity blocks for large files,
> effectively using different raid levels for different types of data on
> the same filesystem.
Absolutely. And doing that is a very good idea quite independent of
underlying RAID. Even ext2 stores multiple copies of the superblock.
Having the filesystem duplicate data, store checksums, and be able to
find a different copy if the first one it chose was bad is very
sensible and cannot be done by just putting the filesystem on RAID.
Having the filesystem keep multiple copies of each data block so that
when one drive dies, another block is used does not excite me quite so
much. If you are going to do that, then you want to be able to
reconstruct the data that should be on a failed drive onto a new
drive.
For a RAID system, that reconstruction can go at the full speed of the
drive subsystem - but needs to copy every block, whether used or not.
For in-filesystem duplication, it is easy to imagine that being quite
slow and complex. It would depend a lot on how you arrange data,
and maybe there is some clever approach to data layout that I haven't
thought of. But I think that sort of thing is much easier to do in a
RAID layer below the filesystem.
Combining these thoughts, it would make a lot of sense for the
filesystem to be able to say to the block device "That blocks looks
wrong - can you find me another copy to try?". That is an example of
the sort of closer integration between filesystem and RAID that would
make sense.
NeilBrown
next prev parent reply other threads:[~2007-06-15 21:59 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-06-15 2:58 limits on raid david
2007-06-15 3:05 ` Neil Brown
2007-06-15 3:43 ` david
2007-06-15 3:58 ` Neil Brown
2007-06-15 9:13 ` David Chinner
2007-06-15 22:21 ` Neil Brown
2007-06-15 11:10 ` Avi Kivity
2007-06-15 16:23 ` Jan Engelhardt
2007-06-15 17:20 ` Avi Kivity
2007-06-15 21:59 ` Neil Brown [this message]
2007-06-16 17:23 ` Avi Kivity
2007-06-17 13:00 ` Andi Kleen
2007-06-18 4:57 ` David Chinner
2007-06-21 2:56 ` Neil Brown
2007-06-21 6:39 ` David Chinner
2007-06-21 6:45 ` david
2007-06-21 8:59 ` David Greaves
2007-06-21 17:00 ` Mark Lord
2007-06-21 11:00 ` David Chinner
2007-06-21 12:40 ` Mattias Wadenstein
2007-06-21 14:40 ` Justin Piszcz
2007-06-21 16:48 ` david
2007-06-21 18:30 ` Martin K. Petersen
2007-06-21 20:08 ` Nix
2007-06-16 2:03 ` Wakko Warner
2007-06-16 3:47 ` Neil Brown
2007-06-16 4:40 ` Dan Merillat
2007-06-16 7:48 ` david
2007-06-16 13:38 ` David Greaves
2007-06-16 17:16 ` david
2007-06-17 17:16 ` Bill Davidsen
2007-06-18 17:20 ` Brendan Conoboy
2007-06-18 17:28 ` david
2007-06-18 18:03 ` Lennart Sorensen
2007-06-18 18:12 ` david
2007-06-18 18:33 ` Lennart Sorensen
2007-06-18 18:40 ` david
2007-06-18 19:11 ` Brendan Conoboy
2007-06-18 20:52 ` david
2007-06-18 21:46 ` Wakko Warner
2007-06-18 21:56 ` david
2007-06-18 22:00 ` Brendan Conoboy
2007-06-19 20:11 ` Lennart Sorensen
2007-06-19 20:51 ` david
2007-06-19 15:07 ` Phillip Susi
2007-06-19 19:28 ` david
2007-06-18 18:07 ` Brendan Conoboy
2007-06-18 18:16 ` david
2007-06-16 13:33 ` David Greaves
2007-06-17 1:44 ` dean gaudet
2007-06-21 3:01 ` Neil Brown
2007-06-21 8:49 ` David Greaves
2007-06-16 14:08 ` Wakko Warner
2007-06-17 1:47 ` dean gaudet
2007-06-17 13:28 ` Wakko Warner
2007-06-17 17:28 ` dean gaudet
2007-06-17 19:30 ` Wakko Warner
2007-06-17 19:54 ` dean gaudet
2007-06-17 20:46 ` david
2007-06-17 20:44 ` david
2007-06-17 17:14 ` Bill Davidsen
2007-06-21 23:03 ` Bill Davidsen
2007-06-22 2:24 ` Neil Brown
2007-06-22 8:10 ` David Greaves
2007-06-22 9:51 ` david
2007-06-22 12:39 ` David Greaves
2007-06-22 16:00 ` Bill Davidsen
2007-06-22 16:55 ` David Greaves
2007-06-22 18:41 ` david
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=18035.3009.568832.785308@notabene.brown \
--to=neilb@suse.de \
--cc=avi@argo.co.il \
--cc=david@lang.hm \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).