public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: ZFS with Linux: An Open Plea
@ 2007-04-15  8:54 David R. Litwin
  2007-04-16  0:50 ` Rik van Riel
  0 siblings, 1 reply; 5+ messages in thread
From: David R. Litwin @ 2007-04-15  8:54 UTC (permalink / raw)
  To: linux-kernel

On 14/04/07, Neil Brown <neilb@suse.de> wrote:

It is generally expected that email conversations started on-list will
remain on-list, unless there is a special reason to take it off
list... though maybe it was an accident on your part.

It very much was. I'm not used to not being subscribed to a mailing list.

Example of odd commands?
    mkfs -j /dev/whatever
usually does me.  Admittedly it might be nice to avoid the -j, but
that doesn't seem like a bit issue.

Fair enough.

> 2: ZFS provides near-platter speeds. A hard-drive should not be
> hampered performance-wise by it's file system.

That is claimed of XFS too.

Really? I must have missed that one.... Any way, I use XFS so this news  
makes me like it even more.

Immediate backups to tape?  seems unlikely.
Or are you talking about online snapshots.  I believe LVM supports
those.  Maybe the commands there are odd...

O fine, be that way with your commands. :-) As I said, though, I'm not an  
expert. Merely a Linux-user. You know far more about this sort of thing  
than I ever shall.

> 4: ZFS has a HUGE capacity. I don't have 30 exobytes, but I might some
> day....

ext4 will probably cope with that.  XFS definitely has very high
limits though I admit I don't know what they are.

XFS is also a few exobytes.

> 5: ZFS has little over-head. I don't want my file system to take up
> space; that's for the data to do.

I doubt space-overhead is a substantial differentiator between
filesystems.  All filesystems need to use space for metadata.  They
might report it in different ways.

Again, I'm simply reporting what I've heard. Well, read.

>
>   It is possible that that functionality can be
> > incorporated into Linux without trying to clone or copy ZFS.
>
>
> I don't deny this in the least. But, there's good code sitting,waiting  
> to be used. Why bother starting from scratch or trying to
> re-do what is already done?

Imagine someone wanting some cheap furniture and going to a 'garage
sale' at a light-house.  All this nice second-hand furniture, but you
can tell it won't fit in your house as it all has rounded edges...

It is a bit like that with software.  It might have great features and
functionality, but that doesn't mean it will fit.

XFS is a prime example.  It was ported to Linux by creating a fairly
substantial comparability interface so that code written for IRIX
would work in Linux.  That layer makes it a lot harder for other
people to maintain the software (I know, I've tried to understand it
and never got very far).

I've heard of the horrors of XFS's code. But, is there really that much  
work to be done to port ZFS to Linux? This is one area for which I have no  
information, as no one has tried (save the FUSEy folk) due to Lisences. To  
inform me!
- Hide quoted text -

-- 
—A watched bread-crumb never boils.
—My hover-craft is full of eels.
—[...]and that's the he and the she of it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ZFS with Linux: An Open Plea
  2007-04-15  8:54 ZFS with Linux: An Open Plea David R. Litwin
@ 2007-04-16  0:50 ` Rik van Riel
  2007-04-16  3:07   ` David Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Rik van Riel @ 2007-04-16  0:50 UTC (permalink / raw)
  To: David R. Litwin; +Cc: linux-kernel

David R. Litwin wrote:

>> 4: ZFS has a HUGE capacity. I don't have 30 exobytes, but I might some
>> day....
> 
> ext4 will probably cope with that.  XFS definitely has very high
> limits though I admit I don't know what they are.
> 
> XFS is also a few exobytes.

The fsck for none of these filesystems will be able to deal with
a filesystem that big.  Unless, of course, you have a few weeks
to wait for fsck to complete.

Backup and restore are similar problems.  When part of the filesystem
is lost, you don't want to have to wait for a full restore.

Sounds simple?  Well, the hard part is figuring out exactly which
part of the filesystem you need to restore...

I don't see ZFS, ext4 or XFS addressing these issues.

IMHO chunkfs could provide a much more promising approach.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ZFS with Linux: An Open Plea
  2007-04-16  0:50 ` Rik van Riel
@ 2007-04-16  3:07   ` David Chinner
  2007-04-16 22:34     ` Repair-driven file system design (was Re: ZFS with Linux: An Open Plea) Valerie Henson
  0 siblings, 1 reply; 5+ messages in thread
From: David Chinner @ 2007-04-16  3:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: David R. Litwin, linux-kernel

On Sun, Apr 15, 2007 at 08:50:25PM -0400, Rik van Riel wrote:
> David R. Litwin wrote:
> 
> >>4: ZFS has a HUGE capacity. I don't have 30 exobytes, but I might some
> >>day....
> >
> >ext4 will probably cope with that.  XFS definitely has very high
> >limits though I admit I don't know what they are.
> >
> >XFS is also a few exobytes.
> 
> The fsck for none of these filesystems will be able to deal with
> a filesystem that big.  Unless, of course, you have a few weeks
> to wait for fsck to complete.

Which is why I want to be able to partially offline a chunk of
a filesystem and repair it while the rest is still online.....

> Backup and restore are similar problems.  When part of the filesystem
> is lost, you don't want to have to wait for a full restore.
> 
> Sounds simple?  Well, the hard part is figuring out exactly which
> part of the filesystem you need to restore...
> 
> I don't see ZFS, ext4 or XFS addressing these issues.

XFS has these sorts of issues directly in our cross-hairs.

The major scaling problem XFS has right now is to do with
how long repair/backup/restore take when you have hundreds
of terabytes of storage.

> IMHO chunkfs could provide a much more promising approach.

Agreed, that's one method of compartmentalising the problem.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Repair-driven file system design (was Re: ZFS with Linux: An Open Plea)
  2007-04-16  3:07   ` David Chinner
@ 2007-04-16 22:34     ` Valerie Henson
  2007-04-17  1:09       ` David Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Valerie Henson @ 2007-04-16 22:34 UTC (permalink / raw)
  To: David Chinner; +Cc: Rik van Riel, David R. Litwin, linux-kernel

On Mon, Apr 16, 2007 at 01:07:05PM +1000, David Chinner wrote:
> On Sun, Apr 15, 2007 at 08:50:25PM -0400, Rik van Riel wrote:
>
> > IMHO chunkfs could provide a much more promising approach.
> 
> Agreed, that's one method of compartmentalising the problem.....

Agreed, the chunkfs design is only one way to implement repair-driven
file system design - designing your file system to make file system
check and repair fast and easy.  I've written a paper on this idea,
which includes some interesting projections estimating that fsck will
take 10 times as long on the 2013 equivalent of a 2006 file system,
due entirely to changes in disk hardware.  So if your server currently
takes 2 hours to fsck, an equivalent server in 2013 will take about 20
hours.  Eek!  Paper here:

http://infohost.nmt.edu/~val/review/repair.pdf

While I'm working on chunkfs, I also think that all file systems
should strive for repair-driven design.  XFS has already made big
strides in this area (multi-threading fsck for multi-disk file
systems, for example) and I'm excited to see what comes next.

-VAL

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Repair-driven file system design (was Re: ZFS with Linux: An Open Plea)
  2007-04-16 22:34     ` Repair-driven file system design (was Re: ZFS with Linux: An Open Plea) Valerie Henson
@ 2007-04-17  1:09       ` David Chinner
  0 siblings, 0 replies; 5+ messages in thread
From: David Chinner @ 2007-04-17  1:09 UTC (permalink / raw)
  To: Valerie Henson
  Cc: David Chinner, Rik van Riel, David R. Litwin, linux-kernel,
	bnaujok

On Mon, Apr 16, 2007 at 03:34:42PM -0700, Valerie Henson wrote:
> On Mon, Apr 16, 2007 at 01:07:05PM +1000, David Chinner wrote:
> > On Sun, Apr 15, 2007 at 08:50:25PM -0400, Rik van Riel wrote:
> >
> > > IMHO chunkfs could provide a much more promising approach.
> > 
> > Agreed, that's one method of compartmentalising the problem.....
> 
> Agreed, the chunkfs design is only one way to implement repair-driven
> file system design - designing your file system to make file system
> check and repair fast and easy.  I've written a paper on this idea,
> which includes some interesting projections estimating that fsck will
> take 10 times as long on the 2013 equivalent of a 2006 file system,
> due entirely to changes in disk hardware.

That's assuming that repair doesn't get any more efficient. ;)

> So if your server currently
> takes 2 hours to fsck, an equivalent server in 2013 will take about 20
> hours.  Eek!  Paper here:
> 
> http://infohost.nmt.edu/~val/review/repair.pdf
> 
> While I'm working on chunkfs, I also think that all file systems
> should strive for repair-driven design.  XFS has already made big
> strides in this area (multi-threading fsck for multi-disk file
> systems, for example) and I'm excited to see what comes next.

Two steps forward, one step back.

We found that our original approach to multithreading doesn't always
work, and doesn't work at all for single disks. Under some test cases,
it goes *much* slower due to increased seeking of the disks.

This patch from the folk at Agami:

http://oss.sgi.com/archives/xfs/2007-01/msg00135.html

used a different threading approach to speeding up the repair
process - it basically did object path walking in separate threads
to prime the block device page cache so that when the real
repair thread needed the block it came from the blockdev cache
rather than from disk.

This sped up several phases of the repair process because of
re-reads needed in the different phases. What we found interesting
about this approach is that it showed that prefetching gave as good
or better results than simple parallelisation with a rudimentary
caching system. In most cases it was superior (lower runtime) to
the existing multithreaded xfs_repair.

However, the Agami object based prefetch does not speed up phase 3
on a single disk - like strided AG parallelism it increases disk
seeks and, as we discovered, causes lots of little backwards seeks
to occur. It also performs very poorly when there is not enough
memory to cache sufficient objects in the block dev cache (whose
size cannot be controlled). It sped things up by using prefetch to
speed up (repeated) I/O, not by using intelligent caching.....

However, this patch has been very instructive on how we could
further improve the threading of xfs_repair - intelligent prefetch
is better than simple parallelism (from the Agami patch), caching is
far better than rereading (from the SGI repair level caching) and
that prefetching complements simple parallelism on volumes that can
take advantage of it.

We've ended up combining a threaded, two phase object walking
prefetch with spatial analysis of the inode and object layouts
and integration into a smarter internal cache. This cache is now
similar to the xfs_buf cache in the kernel and uses direct I/O
so if you have enough memory you only need to read objects from
disk once.

Spatial analysis of the metadata is used to determine the relative
density of the metadata in an area of disk before we read it. Using
a density function, we determine if we want to do lots of small I/Os
or one large I/O to read the entire region in one go and then split
it up in memory. Hence as metadata density increases, the number of
I/Os decrease and we pull enough data in to (hopefully) keep the
CPUs busy.

We still walk objects, but any blocks behind where we are currently
reading go into a secondary I/O queue to be issued later. Hence we
keep moving in one direction across the disk. Once the first pass is
complete, we then do the same analysis on the secondary list and run
that I/O all in a single pass across the disk.

This is effectively a result of observing that repair is typically seek
bound and only using 2-3MB/s of the bandwidth a disk has to offer.
Where metadata density is high, we are now seeing luns max out on
bandwidth rather than being seek bound. Effectively we are hiding
latency by using more bandwidth and that is a good tradeoff to
make for a seek bound app....

The result of this is that even on single disks the reading of all
the metadata goes faster with this multithreaded prefetch model.  A
full 250GB SATA disk with a clean filesystem containing ~1.6 million
inodes is now taking less than 5 minutes to repair. A 5.5TB RAID5
volume with 30 million inodes is now taking about 4.5 minutes to
repair instead of 20 minutes. We're currently creating a
multi-hundred million inode filesystem to determine scalability to
the current bleeding edge.

One thing this makes me consider is changing the way inodes and
metadata get laid out in XFS - clumping metadata together will lead
to better scan times for repair because of the density increase.
Dualfs has already proven that this can be good for performance when
done correctly; I think it also has merit for improving repair times
substantially as well.

FWIW, I've already told Barry he's going to have to write a
white paper about all this once he's finished.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-04-17  1:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-15  8:54 ZFS with Linux: An Open Plea David R. Litwin
2007-04-16  0:50 ` Rik van Riel
2007-04-16  3:07   ` David Chinner
2007-04-16 22:34     ` Repair-driven file system design (was Re: ZFS with Linux: An Open Plea) Valerie Henson
2007-04-17  1:09       ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox