stable xfs

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* stable xfs
@ 2006-07-17 15:30 Ming Zhang
  2006-07-17 16:20 ` Peter Grandi
  2006-07-18 23:54 ` Nathan Scott
  0 siblings, 2 replies; 33+ messages in thread
From: Ming Zhang @ 2006-07-17 15:30 UTC (permalink / raw)
  To: linux-xfs

Hi All

We want to use XFS in all of our production servers but feel a little
scary about the corruption problems seen in this list. I wonder which
2.6.16+ kernel we can use in order to get a stable XFS? Thanks!

ps, one friend mentioned that XFS has some issue with LVM+MD under it.
Is this true?

Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-17 15:30 stable xfs Ming Zhang
@ 2006-07-17 16:20 ` Peter Grandi
  2006-07-18 22:36   ` Ming Zhang
  2006-07-18 23:54 ` Nathan Scott
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Grandi @ 2006-07-17 16:20 UTC (permalink / raw)
  To: Linux XFS

>>> On Mon, 17 Jul 2006 11:30:23 -0400, Ming Zhang
>>> <mingz@ele.uri.edu> said:

mingz> Hi All We want to use XFS in all of our production
mingz> servers but feel a little scary about the corruption
mingz> problems seen in this list. [ ... ]

XFS is complex but quite stable code. Most of the reports about
''corruption'' are consequences of not being aware of what it
was designed for, how it works and how it should be used...

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-17 16:20 ` Peter Grandi
@ 2006-07-18 22:36   ` Ming Zhang
  2006-07-18 23:14     ` Peter Grandi
  0 siblings, 1 reply; 33+ messages in thread
From: Ming Zhang @ 2006-07-18 22:36 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux XFS

Thanks for your response.

But could you give me an example on what is an improper use?

Ming

On Mon, 2006-07-17 at 17:20 +0100, Peter Grandi wrote:
> >>> On Mon, 17 Jul 2006 11:30:23 -0400, Ming Zhang
> >>> <mingz@ele.uri.edu> said:
> 
> mingz> Hi All We want to use XFS in all of our production
> mingz> servers but feel a little scary about the corruption
> mingz> problems seen in this list. [ ... ]
> 
> XFS is complex but quite stable code. Most of the reports about
> ''corruption'' are consequences of not being aware of what it
> was designed for, how it works and how it should be used...
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-18 22:36   ` Ming Zhang
@ 2006-07-18 23:14     ` Peter Grandi
  2006-07-19  1:20       ` Ming Zhang
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Grandi @ 2006-07-18 23:14 UTC (permalink / raw)
  To: Linux XFS

>>> On Tue, 18 Jul 2006 18:36:06 -0400, Ming Zhang
>>> <mingz@ele.uri.edu> said:

mingz> [ .. ] example on what is an improper use?

Well, this mailing list is full of them :-). However it is
easier to say what is an optimal use:

  * A 64 bit system.
  * With a large, parallel storage system.
  * The block IO system handles all storage errors.
  * With backups of the contents of the storage system.

In other words, an Altix in an enterprise computing room... :-)

Something like 64 bit systems running a UNIX-like OS, one system
production and one for backup, each with some TiB of RAID10
storage, both with UPSes giving a significant amount of uptime,
and extensive hot swapping abilities. If you got that, XFS can
give really good performance quite safely.

My impression is that the design of XFS was based on a focus on
performance, at the file system level, via on-disk layout,
massive ''transactions'', and parallel IO requests, assuming
that the block IO subsystem handles every storage error issue
both transparently and gracefully.

It is _possible_, and may even be appropriate after carefully
thinking it through, to use XFS in a 32 bit system without UPS,
and with no storage system redundancy, and with device errors
not handled by the block IO system, and with little parallelism
in the storage subsystem; e.g. a SOHO desktop or server.

But then I have seen people building RAIDs stuffing in a couple
dozen drives from the same shipping box, so improper use of XFS
is definitely a second order issue at that kind of level :-).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-17 15:30 stable xfs Ming Zhang
  2006-07-17 16:20 ` Peter Grandi
@ 2006-07-18 23:54 ` Nathan Scott
  2006-07-19  1:15   ` Ming Zhang
  2006-07-19  7:40   ` Martin Steigerwald
  1 sibling, 2 replies; 33+ messages in thread
From: Nathan Scott @ 2006-07-18 23:54 UTC (permalink / raw)
  To: Ming Zhang; +Cc: xfs

On Mon, Jul 17, 2006 at 11:30:23AM -0400, Ming Zhang wrote:
> Hi All
> 
> We want to use XFS in all of our production servers but feel a little
> scary about the corruption problems seen in this list. I wonder which
> 2.6.16+ kernel we can use in order to get a stable XFS? Thanks!

Use the latest 2.6.17 -stable release, or a vendor kernel (SLES is
particularly good with XFS, as SGI works closely with SUSE).

The current batch of corruption reports is due to one unfortunate
bug that has slipped through our QA testing net, which happily is
a fairly rare occurence (it was a very subtle bug).

XFS also tends to get a bad rap (IMO) from the way it reports on-disk
corruption and I/O errors in critical data structures, which is quite
different to many other filesystems - it dumps a stack trace into the
system log (alot of people mistake that for a panic) and "shuts down"
the filesystem, with subsequent accesses returning errors until the
problem is resolved.

> ps, one friend mentioned that XFS has some issue with LVM+MD under it.
> Is this true?

No.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-18 23:54 ` Nathan Scott
@ 2006-07-19  1:15   ` Ming Zhang
  2006-07-19  7:40   ` Martin Steigerwald
  1 sibling, 0 replies; 33+ messages in thread
From: Ming Zhang @ 2006-07-19  1:15 UTC (permalink / raw)
  To: Nathan Scott; +Cc: xfs

thanks a lot for this detail explanation!

i will check both 2.6.17 -stable release and sles kernel. unfortunately,
i only play with RHEL so far.

Ming

On Wed, 2006-07-19 at 09:54 +1000, Nathan Scott wrote:
> On Mon, Jul 17, 2006 at 11:30:23AM -0400, Ming Zhang wrote:
> > Hi All
> > 
> > We want to use XFS in all of our production servers but feel a little
> > scary about the corruption problems seen in this list. I wonder which
> > 2.6.16+ kernel we can use in order to get a stable XFS? Thanks!
> 
> Use the latest 2.6.17 -stable release, or a vendor kernel (SLES is
> particularly good with XFS, as SGI works closely with SUSE).
> 
> The current batch of corruption reports is due to one unfortunate
> bug that has slipped through our QA testing net, which happily is
> a fairly rare occurence (it was a very subtle bug).
> 
> XFS also tends to get a bad rap (IMO) from the way it reports on-disk
> corruption and I/O errors in critical data structures, which is quite
> different to many other filesystems - it dumps a stack trace into the
> system log (alot of people mistake that for a panic) and "shuts down"
> the filesystem, with subsequent accesses returning errors until the
> problem is resolved.
> 
> > ps, one friend mentioned that XFS has some issue with LVM+MD under it.
> > Is this true?
> 
> No.
> 
> cheers.
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-18 23:14     ` Peter Grandi
@ 2006-07-19  1:20       ` Ming Zhang
  2006-07-19  5:56         ` Chris Wedgwood
  2006-07-19 10:24         ` Peter Grandi
  0 siblings, 2 replies; 33+ messages in thread
From: Ming Zhang @ 2006-07-19  1:20 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux XFS

On Wed, 2006-07-19 at 00:14 +0100, Peter Grandi wrote:
> >>> On Tue, 18 Jul 2006 18:36:06 -0400, Ming Zhang
> >>> <mingz@ele.uri.edu> said:
> 
> mingz> [ .. ] example on what is an improper use?
> 
> Well, this mailing list is full of them :-). However it is
> easier to say what is an optimal use:
> 
>   * A 64 bit system.
>   * With a large, parallel storage system.

when u say large parallel storage system, you mean independent spindles
right? but most people will have all disks configured in one RAID5/6 and
thus it is not parallel any more.


>   * The block IO system handles all storage errors.

so current MD/LVM/SATA/SCSI layers are not good enough?

>   * With backups of the contents of the storage system.
> 
> In other words, an Altix in an enterprise computing room... :-)

just kidding, are you a SGI sales? ;)

> 
> Something like 64 bit systems running a UNIX-like OS, one system
> production and one for backup, each with some TiB of RAID10
> storage, both with UPSes giving a significant amount of uptime,
> and extensive hot swapping abilities. If you got that, XFS can
> give really good performance quite safely.
> 
> My impression is that the design of XFS was based on a focus on
> performance, at the file system level, via on-disk layout,
> massive ''transactions'', and parallel IO requests, assuming
> that the block IO subsystem handles every storage error issue
> both transparently and gracefully.
> 
> It is _possible_, and may even be appropriate after carefully
> thinking it through, to use XFS in a 32 bit system without UPS,
> and with no storage system redundancy, and with device errors
> not handled by the block IO system, and with little parallelism
> in the storage subsystem; e.g. a SOHO desktop or server.

i think with write barrier support, system without UPS should be ok.
considering even u have UPS, kernel oops in other parts still can take
the FS down.

 
> 
> But then I have seen people building RAIDs stuffing in a couple
> dozen drives from the same shipping box, so improper use of XFS
> is definitely a second order issue at that kind of level :-).
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19  1:20       ` Ming Zhang
@ 2006-07-19  5:56         ` Chris Wedgwood
  2006-07-19 10:53           ` Peter Grandi
  2006-07-19 14:10           ` Ming Zhang
  2006-07-19 10:24         ` Peter Grandi
  1 sibling, 2 replies; 33+ messages in thread
From: Chris Wedgwood @ 2006-07-19  5:56 UTC (permalink / raw)
  To: Ming Zhang; +Cc: Peter Grandi, Linux XFS

On Tue, Jul 18, 2006 at 09:20:44PM -0400, Ming Zhang wrote:

> when u say large parallel storage system, you mean independent
> spindles right? but most people will have all disks configured in
> one RAID5/6 and thus it is not parallel any more.

it depends, you might have 100s of spindles in groups, you don't make
a giant raid5/6 array with that many disks, you make a number of
smaller arrays

> i think with write barrier support, system without UPS should be ok.

with barrier support a UPS shouldn't be necessary

> considering even u have UPS, kernel oops in other parts still can
> take the FS down.

but a crash won't cause writes to be 'reordered'


reordering is bad because the fs pushes writes down in a manner that
means when it comes back it will be able to make it self consistent,
so if you have a number of writes pending and some of them are lost,
and those that are lost are not the most recent writes because of
reordering, you can end up with a corrupt fs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-18 23:54 ` Nathan Scott
  2006-07-19  1:15   ` Ming Zhang
@ 2006-07-19  7:40   ` Martin Steigerwald
  2006-07-19 14:11     ` Ming Zhang
  1 sibling, 1 reply; 33+ messages in thread
From: Martin Steigerwald @ 2006-07-19  7:40 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Ming Zhang, xfs

Am Mittwoch, 19. Juli 2006 01:54 schrieb Nathan Scott:
> On Mon, Jul 17, 2006 at 11:30:23AM -0400, Ming Zhang wrote:
> > Hi All
> >
> > We want to use XFS in all of our production servers but feel a little
> > scary about the corruption problems seen in this list. I wonder which
> > 2.6.16+ kernel we can use in order to get a stable XFS? Thanks!
>
> Use the latest 2.6.17 -stable release, or a vendor kernel (SLES is
> particularly good with XFS, as SGI works closely with SUSE).

Hello Nathan,

as far as I can see the fix for kernel bug #6757 has not yet made it in a 
stable kernel release upto 2.6.17.6 and thus should manually be applied:

http://bugzilla.kernel.org/show_bug.cgi?id=6757

It probably doesn't happen for lots of people but I would still apply that 
patch unless it is finally put into a stable point release.

Regards,
-- 
Martin Steigerwald - team(ix) GmbH - http://www.teamix.de
gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19  1:20       ` Ming Zhang
  2006-07-19  5:56         ` Chris Wedgwood
@ 2006-07-19 10:24         ` Peter Grandi
  2006-07-19 13:11           ` Ming Zhang
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Grandi @ 2006-07-19 10:24 UTC (permalink / raw)
  To: Linux XFS

>>> On Tue, 18 Jul 2006 21:20:44 -0400, Ming Zhang <mingz@ele.uri.edu> said:

[ ... ]

mingz> when u say large parallel storage system, you mean
mingz> independent spindles right? but most people will have all
mingz> disks configured in one RAID5/6 and thus it is not
mingz> parallel any more.

As I was saying...

  pg> Most of the reports about ''corruption'' are consequences
  pg> of not being aware of what it was designed for, how it
  pg> works and how it should be used...

  mingz> [ .. ] example on what is an improper use?
  pg> Well, this mailing list is full of them :-).

  pg> But then I have seen people building RAIDs stuffing in a
  pg> couple dozen drives from the same shipping box, [ ... ]

:-)

BTW as to these:

  * A 64 bit system.
  * With a large, parallel storage system.
  * The block IO system handles all storage errors.
  * With backups of the contents of the storage system.

I forgot a very essential one:

  * With lots of RAM, size proportional to that of the largest filesystem.

[ ... ]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19  5:56         ` Chris Wedgwood
@ 2006-07-19 10:53           ` Peter Grandi
  2006-07-19 14:45             ` Ming Zhang
  2006-07-20  6:12             ` Chris Wedgwood
  2006-07-19 14:10           ` Ming Zhang
  1 sibling, 2 replies; 33+ messages in thread
From: Peter Grandi @ 2006-07-19 10:53 UTC (permalink / raw)
  To: Linux XFS

[ ... ]

mingz> when u say large parallel storage system, you mean
mingz> independent spindles right? but most people will have all
mingz> disks configured in one RAID5/6 and thus it is not parallel
mingz> any more.

cw> it depends, you might have 100s of spindles in groups, you
cw> don't make a giant raid5/6 array with that many disks, you
cw> make a number of smaller arrays

Perhaps you are undestimating the ''if it can be done''
mindset...

Also, if one does a number of smaller RAID5s, is each one a
separate filesystem or they get aggregated, for example with
LVM with ''concat''? Either way, how likely is is that the
consequences have been thought through?

I would personally hesitate to recommend either, especially a
two-level arrangement where the base level is a RAID5.

[I am making an effort in this discussion to use euphemisms]

mingz> i think with write barrier support, system without UPS
mingz> should be ok.

cw> with barrier support a UPS shouldn't be necessary

Sure, «should» and «shouldn't» are nice hopeful concepts.

But write barriers are difficult to achieve, and when achieved
they are often unreliable, except on enterprise level hardware,
because many disks/host adapters/...  simply lie as to whether
they have actually started writing (never mind finished writing,
or written correctly) stuff.

To get reliable write barrier often one has to source special
cards or disks with custom firmware; or leave system integration
to the big expensive guys and buy an Altix or equivalent system
from Sun or IBM.

Besides I have seen many reports of ''corruption'' that cannot
be fixed by write barriers: many have the expectation that
*data* should not be lost, even if no 'fsync' is done, *as if*
'mount -o sync' or 'mount -o data=ordered'.

Of course that is a bit of an inflated expectation, but all that
the vast majority of sysadms care about is whether it ''just
works'', without ''wasting time'' figuring things out.

mingz> considering even u have UPS, kernel oops in other parts
mingz> still can take the FS down.

cw> but a crash won't cause writes to be 'reordered' [ ... ]

The metadata will be consistent, but metadata and data may well
will be lost. So the filesystem is still ''corrupted'', at least
from the point of view of a sysadm who just wants the filesystem
to be effortlessly foolproof. Anyhow, if a crash happens all
bets are off, because who knows *what* gets written.

Look at it from the point of view of a ''practitioner'' sysadm:

  ''who cares if the metadata is consistent, if my 3TiB
  application database is unusable (and I don't do backups
  because after all it is a concat of RAID5s, backups are not
  necessary) as there is a huge gap in some data file, and my
  users are yelling at me, and it is not my fault''

The tradeoff in XFS is that if you know exactly what you are
doing you get extra performance...

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19 10:24         ` Peter Grandi
@ 2006-07-19 13:11           ` Ming Zhang
  2006-07-20  6:15             ` Chris Wedgwood
  2006-07-22 15:37             ` Peter Grandi
  0 siblings, 2 replies; 33+ messages in thread
From: Ming Zhang @ 2006-07-19 13:11 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux XFS

On Wed, 2006-07-19 at 11:24 +0100, Peter Grandi wrote:
> >>> On Tue, 18 Jul 2006 21:20:44 -0400, Ming Zhang <mingz@ele.uri.edu> said:
> 
> [ ... ]
> 
> mingz> when u say large parallel storage system, you mean
> mingz> independent spindles right? but most people will have all
> mingz> disks configured in one RAID5/6 and thus it is not
> mingz> parallel any more.
> 
> As I was saying...
> 
>   pg> Most of the reports about ''corruption'' are consequences
>   pg> of not being aware of what it was designed for, how it
>   pg> works and how it should be used...
> 
>   mingz> [ .. ] example on what is an improper use?
>   pg> Well, this mailing list is full of them :-).
> 
>   pg> But then I have seen people building RAIDs stuffing in a
>   pg> couple dozen drives from the same shipping box, [ ... ]
> 
> :-)
> 
> BTW as to these:
> 
>   * A 64 bit system.
>   * With a large, parallel storage system.
>   * The block IO system handles all storage errors.
>   * With backups of the contents of the storage system.
> 
> I forgot a very essential one:
> 
>   * With lots of RAM, size proportional to that of the largest filesystem.
> 
> [ ... ]
> 

what kind of "ram vs fs" size ratio here will be a safe/good/proper one?
any rule of thumb? thanks!

hope not 1:1. :)

Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19  5:56         ` Chris Wedgwood
  2006-07-19 10:53           ` Peter Grandi
@ 2006-07-19 14:10           ` Ming Zhang
  1 sibling, 0 replies; 33+ messages in thread
From: Ming Zhang @ 2006-07-19 14:10 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS

On Tue, 2006-07-18 at 22:56 -0700, Chris Wedgwood wrote:
> On Tue, Jul 18, 2006 at 09:20:44PM -0400, Ming Zhang wrote:
> 
> > when u say large parallel storage system, you mean independent
> > spindles right? but most people will have all disks configured in
> > one RAID5/6 and thus it is not parallel any more.
> 
> it depends, you might have 100s of spindles in groups, you don't make
> a giant raid5/6 array with that many disks, you make a number of
> smaller arrays

right

> 
> > i think with write barrier support, system without UPS should be ok.
> 
> with barrier support a UPS shouldn't be necessary
> 
> > considering even u have UPS, kernel oops in other parts still can
> > take the FS down.
> 

i mean with UPS and huge write cache, but no write barrier.

> but a crash won't cause writes to be 'reordered'
> 
> 
> reordering is bad because the fs pushes writes down in a manner that
> means when it comes back it will be able to make it self consistent,
> so if you have a number of writes pending and some of them are lost,
> and those that are lost are not the most recent writes because of
> reordering, you can end up with a corrupt fs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19  7:40   ` Martin Steigerwald
@ 2006-07-19 14:11     ` Ming Zhang
  0 siblings, 0 replies; 33+ messages in thread
From: Ming Zhang @ 2006-07-19 14:11 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Nathan Scott, xfs

yes. thx for reminding.

Ming

On Wed, 2006-07-19 at 09:40 +0200, Martin Steigerwald wrote:
> Am Mittwoch, 19. Juli 2006 01:54 schrieb Nathan Scott:
> > On Mon, Jul 17, 2006 at 11:30:23AM -0400, Ming Zhang wrote:
> > > Hi All
> > >
> > > We want to use XFS in all of our production servers but feel a little
> > > scary about the corruption problems seen in this list. I wonder which
> > > 2.6.16+ kernel we can use in order to get a stable XFS? Thanks!
> >
> > Use the latest 2.6.17 -stable release, or a vendor kernel (SLES is
> > particularly good with XFS, as SGI works closely with SUSE).
> 
> Hello Nathan,
> 
> as far as I can see the fix for kernel bug #6757 has not yet made it in a 
> stable kernel release upto 2.6.17.6 and thus should manually be applied:
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=6757
> 
> It probably doesn't happen for lots of people but I would still apply that 
> patch unless it is finally put into a stable point release.
> 
> Regards,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19 10:53           ` Peter Grandi
@ 2006-07-19 14:45             ` Ming Zhang
  2006-07-22 17:13               ` Peter Grandi
  2006-07-20  6:12             ` Chris Wedgwood
  1 sibling, 1 reply; 33+ messages in thread
From: Ming Zhang @ 2006-07-19 14:45 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux XFS

On Wed, 2006-07-19 at 11:53 +0100, Peter Grandi wrote:
> [ ... ]
> 
> mingz> when u say large parallel storage system, you mean
> mingz> independent spindles right? but most people will have all
> mingz> disks configured in one RAID5/6 and thus it is not parallel
> mingz> any more.
> 
> cw> it depends, you might have 100s of spindles in groups, you
> cw> don't make a giant raid5/6 array with that many disks, you
> cw> make a number of smaller arrays
> 
> Perhaps you are undestimating the ''if it can be done''
> mindset...
> 
> Also, if one does a number of smaller RAID5s, is each one a
> separate filesystem or they get aggregated, for example with
> LVM with ''concat''? Either way, how likely is is that the
> consequences have been thought through?
> 
> I would personally hesitate to recommend either, especially a
> two-level arrangement where the base level is a RAID5.

could u give us some hints on this? since it is really popular to have a
FS/LV/MD structure and I believe LVM is designed for this purpose.


> 
> [I am making an effort in this discussion to use euphemisms]
> 
> mingz> i think with write barrier support, system without UPS
> mingz> should be ok.
> 
> cw> with barrier support a UPS shouldn't be necessary
> 
> Sure, «should» and «shouldn't» are nice hopeful concepts.
> 
> But write barriers are difficult to achieve, and when achieved
> they are often unreliable, except on enterprise level hardware,
> because many disks/host adapters/...  simply lie as to whether
> they have actually started writing (never mind finished writing,
> or written correctly) stuff.
> 
> To get reliable write barrier often one has to source special
> cards or disks with custom firmware; or leave system integration
> to the big expensive guys and buy an Altix or equivalent system
> from Sun or IBM.
> 
> Besides I have seen many reports of ''corruption'' that cannot
> be fixed by write barriers: many have the expectation that
> *data* should not be lost, even if no 'fsync' is done, *as if*
> 'mount -o sync' or 'mount -o data=ordered'.
> 
> Of course that is a bit of an inflated expectation, but all that
> the vast majority of sysadms care about is whether it ''just
> works'', without ''wasting time'' figuring things out.
> 
> mingz> considering even u have UPS, kernel oops in other parts
> mingz> still can take the FS down.
> 
> cw> but a crash won't cause writes to be 'reordered' [ ... ]
> 
> The metadata will be consistent, but metadata and data may well
> will be lost. So the filesystem is still ''corrupted'', at least
> from the point of view of a sysadm who just wants the filesystem
> to be effortlessly foolproof. Anyhow, if a crash happens all
> bets are off, because who knows *what* gets written.
> 
> Look at it from the point of view of a ''practitioner'' sysadm:
> 
>   ''who cares if the metadata is consistent, if my 3TiB
>   application database is unusable (and I don't do backups
>   because after all it is a concat of RAID5s, backups are not
>   necessary) as there is a huge gap in some data file, and my
>   users are yelling at me, and it is not my fault''
> 
> The tradeoff in XFS is that if you know exactly what you are
> doing you get extra performance...

then i think unless you disable all write cache, none of the file system
can achieve this goal. or maybe ext3 with both data and metadata into
log might do this?

Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19 10:53           ` Peter Grandi
  2006-07-19 14:45             ` Ming Zhang
@ 2006-07-20  6:12             ` Chris Wedgwood
  2006-07-22 17:31               ` Peter Grandi
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Wedgwood @ 2006-07-20  6:12 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux XFS

On Wed, Jul 19, 2006 at 11:53:24AM +0100, Peter Grandi wrote:

> But write barriers are difficult to achieve, and when achieved they
> are often unreliable, except on enterprise level hardware, because
> many disks/host adapters/...  simply lie as to whether they have
> actually started writing (never mind finished writing, or written
> correctly) stuff.

IDE/SATA doesn't have barrier to lie about (the kernel has to flush
and wait in those cases).

> The metadata will be consistent, but metadata and data may well will
> be lost. So the filesystem is still ''corrupted'', at least from the
> point of view of a sysadm who just wants the filesystem to be
> effortlessly foolproof.

Sanely written applications shouldn't lose data.

> Look at it from the point of view of a ''practitioner'' sysadm:
>
>   ''who cares if the metadata is consistent, if my 3TiB
>   application database is unusable (and I don't do backups

any sane database should be safe, it will fsync or similar as needed

this is also true for sane MTAs


i've actually tested sitations where transactions were in flight and
i've dropped power on a rack of disks and verified that when it came
up all transactions that we claimed to have completed really did

i've also done lesser things will SATA disks and email and it usually
turns out to also be reliable for the most part

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19 13:11           ` Ming Zhang
@ 2006-07-20  6:15             ` Chris Wedgwood
  2006-07-20 14:08               ` Ming Zhang
  2006-07-22 15:37             ` Peter Grandi
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Wedgwood @ 2006-07-20  6:15 UTC (permalink / raw)
  To: Ming Zhang; +Cc: Peter Grandi, Linux XFS

On Wed, Jul 19, 2006 at 09:11:10AM -0400, Ming Zhang wrote:

> what kind of "ram vs fs" size ratio here will be a safe/good/proper
> one?

it depends very much on what you are doing

> any rule of thumb? thanks!
>
> hope not 1:1. :)

i recent dealt with a corrupted filesystem that xfs_repair needed over
1GB to deal with --- the kicker is the filesystem was only 20GB, so
that's 20:1 for xfs_repair

i suspect that was anomalous though and that some bug or quirk of
their fs cause xfs_repair to behave badly (that said, i'd had to have
to repair an 8TB fs fill of maildir email boxes, which i know some
people have)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-20  6:15             ` Chris Wedgwood
@ 2006-07-20 14:08               ` Ming Zhang
  2006-07-20 16:17                 ` Chris Wedgwood
  2006-07-22 17:47                 ` Peter Grandi
  0 siblings, 2 replies; 33+ messages in thread
From: Ming Zhang @ 2006-07-20 14:08 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS

On Wed, 2006-07-19 at 23:15 -0700, Chris Wedgwood wrote:
> On Wed, Jul 19, 2006 at 09:11:10AM -0400, Ming Zhang wrote:
> 
> > what kind of "ram vs fs" size ratio here will be a safe/good/proper
> > one?
> 
> it depends very much on what you are doing

we mainly handle large media files like 20-50GB. so file number is not
too much. but file size is large.

hope i never need to run repair, but i do need to defrag from time to
time.

> 
> > any rule of thumb? thanks!
> >
> > hope not 1:1. :)
> 
> i recent dealt with a corrupted filesystem that xfs_repair needed over
> 1GB to deal with --- the kicker is the filesystem was only 20GB, so
> that's 20:1 for xfs_repair

hope this does not hold true for a 15x750GB SATA raid5. ;)

> 
> i suspect that was anomalous though and that some bug or quirk of
> their fs cause xfs_repair to behave badly (that said, i'd had to have
> to repair an 8TB fs fill of maildir email boxes, which i know some
> people have)

ps, also another question brought up while reading this thread.

say XFS can make use of parallel storage by using multiple allocation
groups. but XFS need to be built over one block device. so if i have 4
smaller raid, i have to use LVM to glue them before i create XFS over it
right? but then u said XFS over LVM or N MD is not good?

Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-20 14:08               ` Ming Zhang
@ 2006-07-20 16:17                 ` Chris Wedgwood
  2006-07-20 16:38                   ` Ming Zhang
  2006-07-22 17:47                 ` Peter Grandi
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Wedgwood @ 2006-07-20 16:17 UTC (permalink / raw)
  To: Ming Zhang; +Cc: Peter Grandi, Linux XFS

On Thu, Jul 20, 2006 at 10:08:22AM -0400, Ming Zhang wrote:

> we mainly handle large media files like 20-50GB. so file number is
> not too much. but file size is large.

xfs_repair usually deals with that fairly well in reality (much better
than lots of small files anyhow)

> hope i never need to run repair, but i do need to defrag from time
> to time.

if you preallocate you can avoid that (this is what i do, i
preallocate in the replication daemon)

> hope this does not hold true for a 15x750GB SATA raid5. ;)

that's ~10TB or so, my guess is that a repair there would take some
GBs of ram

it would be interesting to test it if you had the time

there is a 'formular' for working out how much ram is needed roughly
(steve lord posted it a long time ago, hopefully someone can find that
and repost is)

> say XFS can make use of parallel storage by using multiple
> allocation groups. but XFS need to be built over one block
> device. so if i have 4 smaller raid, i have to use LVM to glue them
> before i create XFS over it right? but then u said XFS over LVM or N
> MD is not good?

with recent kernels it shouldn't be a problem, the recursive nature of
the block layer changed so you no longer blow up as badly as people
did in the past (also, XFS tends to use less stack these days)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-20 16:17                 ` Chris Wedgwood
@ 2006-07-20 16:38                   ` Ming Zhang
  2006-07-20 19:04                     ` Chris Wedgwood
  2006-07-22 18:09                     ` Peter Grandi
  0 siblings, 2 replies; 33+ messages in thread
From: Ming Zhang @ 2006-07-20 16:38 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS

On Thu, 2006-07-20 at 09:17 -0700, Chris Wedgwood wrote:
> On Thu, Jul 20, 2006 at 10:08:22AM -0400, Ming Zhang wrote:
> 
> > we mainly handle large media files like 20-50GB. so file number is
> > not too much. but file size is large.
> 
> xfs_repair usually deals with that fairly well in reality (much better
> than lots of small files anyhow)

sounds cool. yes, large # of small files are always painful.


> 
> > hope i never need to run repair, but i do need to defrag from time
> > to time.
> 
> if you preallocate you can avoid that (this is what i do, i
> preallocate in the replication daemon)

i could not control my application. so i still need to do defrag some
time.

> 
> > hope this does not hold true for a 15x750GB SATA raid5. ;)
> 
> that's ~10TB or so, my guess is that a repair there would take some
> GBs of ram
> 
> it would be interesting to test it if you had the time

yes. i should find out. hope to force a repair? unplug my power cord? ;)


> 
> there is a 'formular' for working out how much ram is needed roughly
> (steve lord posted it a long time ago, hopefully someone can find that
> and repost is)
> 
> > say XFS can make use of parallel storage by using multiple
> > allocation groups. but XFS need to be built over one block
> > device. so if i have 4 smaller raid, i have to use LVM to glue them
> > before i create XFS over it right? but then u said XFS over LVM or N
> > MD is not good?
> 
> with recent kernels it shouldn't be a problem, the recursive nature of
> the block layer changed so you no longer blow up as badly as people
> did in the past (also, XFS tends to use less stack these days)

sounds cool.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-20 16:38                   ` Ming Zhang
@ 2006-07-20 19:04                     ` Chris Wedgwood
  2006-07-21  0:19                       ` Ming Zhang
  2006-07-22 18:09                     ` Peter Grandi
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Wedgwood @ 2006-07-20 19:04 UTC (permalink / raw)
  To: Ming Zhang; +Cc: Peter Grandi, Linux XFS

On Thu, Jul 20, 2006 at 12:38:01PM -0400, Ming Zhang wrote:

> i could not control my application. so i still need to do defrag
> some time.

one thing that irks me about fsr is that unless it's given path
elements it that the files created to replace the fragmented file are
usually not allocated close the original file (they are openned by
handle after a bulkstat pass) so you tend to scatter your files about
if you're not careful

also, fsr implies doing a lot more work on the whole, writing, reading
and rewriting the files in most cases and because it uses dio it will
invalidate the page-cache of any files that might be being read-from
when it's running

> yes. i should find out. hope to force a repair?

umount cleanly and run xfs_repair, check to see how much memory it
uses with ps/top/whatever as it's running

> unplug my power cord? ;)

raid protects against failed disks, it usually doesn't protect well
against corruption from lost/bad writes as a result of dropping power
so well, if you have backups, sure, go for it

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-20 19:04                     ` Chris Wedgwood
@ 2006-07-21  0:19                       ` Ming Zhang
  2006-07-21  3:26                         ` Chris Wedgwood
  0 siblings, 1 reply; 33+ messages in thread
From: Ming Zhang @ 2006-07-21  0:19 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS

On Thu, 2006-07-20 at 12:04 -0700, Chris Wedgwood wrote:
> On Thu, Jul 20, 2006 at 12:38:01PM -0400, Ming Zhang wrote:
> 
> > i could not control my application. so i still need to do defrag
> > some time.
> 
> one thing that irks me about fsr is that unless it's given path
> elements it that the files created to replace the fragmented file are
> usually not allocated close the original file (they are openned by
> handle after a bulkstat pass) so you tend to scatter your files about
> if you're not careful

what will be the side effect about this scattering? you want particular
file in particular place?


> 
> also, fsr implies doing a lot more work on the whole, writing, reading
> and rewriting the files in most cases and because it uses dio it will
> invalidate the page-cache of any files that might be being read-from
> when it's running

one thing i worry about fsr is when do fsr and some power loss events
happen, can xfs handle this well?

i will backup before trying these. need some time. ;)


> 
> > yes. i should find out. hope to force a repair?
> 
> umount cleanly and run xfs_repair, check to see how much memory it
> uses with ps/top/whatever as it's running
> 
> > unplug my power cord? ;)
> 
> raid protects against failed disks, it usually doesn't protect well
> against corruption from lost/bad writes as a result of dropping power
> so well, if you have backups, sure, go for it

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-21  0:19                       ` Ming Zhang
@ 2006-07-21  3:26                         ` Chris Wedgwood
  2006-07-21 13:10                           ` Ming Zhang
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Wedgwood @ 2006-07-21  3:26 UTC (permalink / raw)
  To: Ming Zhang; +Cc: Peter Grandi, Linux XFS

On Thu, Jul 20, 2006 at 08:19:38PM -0400, Ming Zhang wrote:

> what will be the side effect about this scattering?

there is a desire in some cases to have files in the same directory
close to each other on disk

> one thing i worry about fsr is when do fsr and some power loss
> events happen, can xfs handle this well?

yes, fsr create a temporary file, unlinks it, copies the extents over,
and does an atomic swap-extents-if-nothing-changed operation

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-21  3:26                         ` Chris Wedgwood
@ 2006-07-21 13:10                           ` Ming Zhang
  2006-07-21 16:07                             ` Chris Wedgwood
  0 siblings, 1 reply; 33+ messages in thread
From: Ming Zhang @ 2006-07-21 13:10 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS

On Thu, 2006-07-20 at 20:26 -0700, Chris Wedgwood wrote:
> On Thu, Jul 20, 2006 at 08:19:38PM -0400, Ming Zhang wrote:
> 
> > what will be the side effect about this scattering?
> 
> there is a desire in some cases to have files in the same directory
> close to each other on disk

then what is the benefit? because files under same dir can be accessed
with locality so put close will reduce disk head seek? other than this,
what else benefit?


> 
> > one thing i worry about fsr is when do fsr and some power loss
> > events happen, can xfs handle this well?
> 
> yes, fsr create a temporary file, unlinks it, copies the extents over,
> and does an atomic swap-extents-if-nothing-changed operation

so if i have 500GB file, will it be copied to another 500GB temp file?
sounds scary for me.

Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-21 13:10                           ` Ming Zhang
@ 2006-07-21 16:07                             ` Chris Wedgwood
  2006-07-21 17:00                               ` Ming Zhang
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Wedgwood @ 2006-07-21 16:07 UTC (permalink / raw)
  To: Ming Zhang; +Cc: Peter Grandi, Linux XFS

On Fri, Jul 21, 2006 at 09:10:31AM -0400, Ming Zhang wrote:

> then what is the benefit? because files under same dir can be accessed
> with locality so put close will reduce disk head seek?

yes

> other than this, what else benefit?

that alone has a measurable benefit to me (i have an overlay
filesystem over many smaller 400 to 500GB filesystems so i don't get
the benefit of many spindles to reduce average seek times)

> so if i have 500GB file, will it be copied to another 500GB temp
> file?

yes, which in many cases isn't always derisable because:

  * if the file had a small number of extents in the first place,
    reducing them slightly more isn't much of a gain (ie. going from
    say 11 to 10 is argubly pointless) (i have a patch to specifiy
    the miniumum gains before doing the copy somewhere)

  * if the file changes during the copy, then it will be skipped until
    next time, for larger files this is problematic,  you could
    argue attemtping to fsr a file that is less than <n> seconds old
    is pointless as it has a high chance of being active (i have a
    patch for that too))

  * fsr has no global overview of what it's doing, so it never does
    things like 'move this file out of the way to make room for this
    one' (it can't do this w/o assistance right now), and of course it
    can't move inodes w/o changing them so there are limits to what
    can be done anyhow

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-21 16:07                             ` Chris Wedgwood
@ 2006-07-21 17:00                               ` Ming Zhang
  2006-07-21 18:07                                 ` Chris Wedgwood
  0 siblings, 1 reply; 33+ messages in thread
From: Ming Zhang @ 2006-07-21 17:00 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS

On Fri, 2006-07-21 at 09:07 -0700, Chris Wedgwood wrote:
> On Fri, Jul 21, 2006 at 09:10:31AM -0400, Ming Zhang wrote:
> 
> > then what is the benefit? because files under same dir can be accessed
> > with locality so put close will reduce disk head seek?
> 
> yes
> 
> > other than this, what else benefit?
> 
> that alone has a measurable benefit to me (i have an overlay
> filesystem over many smaller 400 to 500GB filesystems so i don't get
> the benefit of many spindles to reduce average seek times)

what u mean overlay fs over small fs? like a unionfs?

> 
> > so if i have 500GB file, will it be copied to another 500GB temp
> > file?
> 

but other than fsr. there is no better way for this right?

of course, preallocate is always good. but i do not have control over
applications.


> yes, which in many cases isn't always derisable because:
> 
>   * if the file had a small number of extents in the first place,
>     reducing them slightly more isn't much of a gain (ie. going from
>     say 11 to 10 is argubly pointless) (i have a patch to specifiy
>     the miniumum gains before doing the copy somewhere)
> 
>   * if the file changes during the copy, then it will be skipped until
>     next time, for larger files this is problematic,  you could
>     argue attemtping to fsr a file that is less than <n> seconds old
>     is pointless as it has a high chance of being active (i have a
>     patch for that too))

sounds like a useful patch. :P will it be merged into fsr code?

> 
>   * fsr has no global overview of what it's doing, so it never does
>     things like 'move this file out of the way to make room for this
>     one' (it can't do this w/o assistance right now), and of course it
>     can't move inodes w/o changing them so there are limits to what
>     can be done anyhow

what kind of assistance you mean?

> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-21 17:00                               ` Ming Zhang
@ 2006-07-21 18:07                                 ` Chris Wedgwood
  2006-07-24  1:14                                   ` Ming Zhang
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Wedgwood @ 2006-07-21 18:07 UTC (permalink / raw)
  To: Ming Zhang; +Cc: Peter Grandi, Linux XFS

On Fri, Jul 21, 2006 at 01:00:44PM -0400, Ming Zhang wrote:

> what u mean overlay fs over small fs? like a unionfs?

sorta not really, it's userspace libraries which create a virtual
filesystem over real filesystems with some database (bezerkely db).
it sorta evolved from an attempt to unify several filesystems spread
over cheap PCs into something that pretended to be one larger fs

> but other than fsr. there is no better way for this right?

not publicly, you could patch fsr or nag me for my patches if that
helps

> of course, preallocate is always good. but i do not have control
> over applications.

well, in some cases you could use LD_PRELOAD and influence things,  it
depends on the application and what you need from it

fwiw, most modern p2p applicaitons have terribly access patterns which
cause cause horrible fragmentation (on all fs's, not just XFS)

> sounds like a useful patch. :P will it be merged into fsr code?

no, because it's ugly and i don't think i ever decoupled it from other
changes and posted it

> what kind of assistance you mean?

[WARNING: lots of hand waving ahead, plenty of minor, but important,
details ignored]

if you wanted much smarter defragmentation semantics, it would
probably make sense to

  * bulkstat the entire volume, this will give you the inode cluster
    locations and enough information to start building a tree of where
    all the files are (XFS_IOC_FSGEOMETRY details obviously)

  * opendir/read to build a full directory tree

  * use XFS_IOC_GETBMAP & XFS_IOC_GETBMAPA to figure out which blocks
    are occupied by which files

you would now have a pretty good idea of what is using what parts of
the disk, except of course it could be constantly changing underneath
you to make things harder

also, doing this using the existing interfaces is (when i tried it)
really really painfully slow if you have a large filesystem with a lot
of small files (even when you try to optimized you accesses for
minimize seeking by sorting by inode number and submitting several
requests in parallel to try and help the elevator merge accesses)


one you have some overall picture of the disk, you can decide what you
want to move to achieve your goal, typically this would be to reduce
the fragmentation of the largest files, and this would be be
relocating some of all of those blocks to another place

if you want to allocate space in a given AG, you open/creat a
temporary file in a directory in that AG (create multiple dirs as
needed to ensure you have one or more of these), and preallocate the
space --- there you can copy the file over

we could also add ioctls to further bias XFSs allocation strategies,
like telling it to never allocate in some AGs (needed for an online
shrink if someone wanted to make such a thing) or simply bias strongly
away from some places, then add other ioctls to allow you to
specifically allocate space in those AGs so you can bias what is
allocated where

another useful ioctl would be a variation of XFS_IOC_SWAPEXT which
would swap only some extents.  there is no internal support for this
now except we do have code for XFS_IOC_UNRESVSP64 and XFS_IOC_RESVSP64
so perhaps the idea would be to swap some (but not all) blocks of a
file by creating a function that do the equivalent of 'punch a hole'
where we want to replace the blocks, and then 'allocate new blocks
given some i already have elsewhere' (however, making that all work as
one transaction might be very very difficult)

it's a lot of effort for what for many people wouldn't only have
marginal gains

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19 13:11           ` Ming Zhang
  2006-07-20  6:15             ` Chris Wedgwood
@ 2006-07-22 15:37             ` Peter Grandi
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Grandi @ 2006-07-22 15:37 UTC (permalink / raw)
  To: Linux XFS

>>> On Wed, 19 Jul 2006 09:11:10 -0400, Ming Zhang
>>> <mingz@ele.uri.edu> said:

[ ... ]

mingz> what kind of "ram vs fs" size ratio here will be a
mingz> safe/good/proper one?  any rule of thumb? thanks! hope
mingz> not 1:1. :)

This is driven mostly by the space required by check/repair
(which can well be above 4GiB, so 64 bit systems are often
required):

  http://OSS.SGI.com/archives/linux-xfs/2005-08/msg00045.html

   «e.g. it took 1.5GiB RAM for 32bit xfs_check and 2.7GiB RAM
    for a 64bit xfs_check on a 1.1TiB filesystem with 3million
    inodes in it.»

It suggests that a 10TB filesystem might need from about 15
gigabytes of RAM (or swap, with corresponding slowdown), after
all only less than 0.2% of its size.

Anyhow, a system with lots of RAM to speedly check/repair an
XFS filesystem also benefits from the same RAM for caching and
delayed writing, so it is all for good (as long as one has a
perfectly reliable block IO subsystem).

Note that the 15 gigabytes in the example above are well above
what a 32 bit process can address, thus for multi-terabyte
filesystem one should really have a 64 bit system (from the same
article mentioned above):

  http://OSS.SGI.com/archives/linux-xfs/2005-08/msg00045.html

   «> > Your filesystem (8TiB) may simply bee too large for your
    > > system to be able to repair. Try mounting it on a 64bit
    > > system with more RAM in it and repairing it from there.
    > 
    > Sorry, but is this a joke?

    A joke? Absolutely not.

    Acheivable XFS filesystem sizes outgrew the capability of 32
    bit Irix systems to repair them several years ago. Now that
    linux supports larger than 2TiB filesystems on 32 bit
    systems, this is true for Linux as well.»

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-19 14:45             ` Ming Zhang
@ 2006-07-22 17:13               ` Peter Grandi
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Grandi @ 2006-07-22 17:13 UTC (permalink / raw)
  To: Linux XFS

>>> On Wed, 19 Jul 2006 10:45:04 -0400, Ming Zhang
>>> <mingz@ele.uri.edu> said:

[ ... ]

>> Also, if one does a number of smaller RAID5s, is each one a
>> separate filesystem or they get aggregated, for example with
>> LVM with ''concat''? Either way, how likely is is that the
>> consequences have been thought through?
>> 
>> I would personally hesitate to recommend either, especially a
>> two-level arrangement where the base level is a RAID5.

mingz> could u give us some hints on this?

Well, RAID5 itself is in general a very bad idea, as well argued
here: <URL:http://WWW.BAARF.com/> and a LVM based concat (which is
the slow version of RAID0) of RAID5 volumes has quite terrible
performance and redundancy aspects that nicely match those of
RAID5.

Imagine a 4TB volume build as a concat/span of 4 RAID5 volumes,
each done as a 1TB RAID5 of 4+1 250GB disks. Under which
conditions do you lose the whole lot?

Compare the same with a RAID0 of RAID1 pairs...

mingz> since it is really popular to have a FS/LV/MD structure

Sure, and it is also really popular to do 5+1 or 11+1 RAID5s and
to stuff them all with disks of the same model, and even from the
same shipping carton...

mingz> and I believe LVM is designed for this purpose.

Yes and no. LVM's main purpose, if any, is to outgrow the
limitation on the number of partitions in most, and PC-based
in particular, partitioning schemes. This means that LVM is
of benefit only in very few cases, those where one needs a lot
of partitions (as such, not as a cheap quota scheme).

[ ... ]

>> ''who cares if the metadata is consistent, if my 3TiB
>> application database is unusable (and I don't do backups
>> because after all it is a concat of RAID5s, backups are not
>> necessary) as there is a huge gap in some data file, and my
>> users are yelling at me, and it is not my fault''

>> The tradeoff in XFS is that if you know exactly what you are
>> doing you get extra performance...

mingz> then i think unless you disable all write cache,

Not even then, because storage subsystems often do lie about
that. Only very clever system integrators and usually only those
with a big wallet can manage to build storage subsystems with
reliable caching semantics (including write barriers).

mingz> none of the file system can achieve this goal.

Well, some people might want to argue that a filesystem *should
not* be designed to achieve that goal, because it is a goal that
does not make sense in an ideal world in which people know exactly
what they are doing.

mingz> or maybe ext3 with both data and metadata into log might
mingz> do this?

Well, 'data=ordered' and especially 'data=journal' (and the low
default value of 'commit=5') most often give at a moderate cost
the illusion that the file system and storage system ''just
work'', when they don't. This creates issues when discussing the
relative merits of 'ext3' vs. other filesystems which are less
forgiving.

Eventually the XFS and 'ext3' designers seem to have chosen very
different assumptions about their user base:

* the XFS designers probably assumed that their user based would
  be big iron people with a high degree of understanding of
  storage systems and optimal hardware conditions, and interested
  in maximally scalable performance (e.g. Altix customers in HPC);

* the 'ext3' guys seem to have assumed their user base would be
  general users slamming together stuff on the cheap without much
  awareness or thought as to storage system engineering, and
  interested in ''just works, most of the time''.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-20  6:12             ` Chris Wedgwood
@ 2006-07-22 17:31               ` Peter Grandi
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Grandi @ 2006-07-22 17:31 UTC (permalink / raw)
  To: Linux XFS

>>> On Wed, 19 Jul 2006 23:12:09 -0700, Chris Wedgwood
>>> <cw@f00f.org> said:

[ ... ]

pg> But write barriers are difficult to achieve, and when
pg> achieved they are often unreliable, except on enterprise
pg> level hardware, because many disks/host adapters/...  simply
pg> lie as to whether they have actually started writing (never
pg> mind finished writing, or written correctly) stuff.

cw> IDE/SATA doesn't have barrier to lie about

Actually a very few ATA/SATA do have write barriers, but that is
a just a nitpick, because it is hard to get to them, and anyhow
Linux does not take advantage much :-).

cw> (the kernel has to flush and wait in those cases).

But ATA/SATA flush and wait have the same problems as write
barriers, except worse: disks and ATA/SATA cards do lie too as
to cache flushing. Just getting an ATA/SATA driver or card
manufacturer to tell whether completion of cache flush is
reported when the command is received, or when writing has
started, or when writing has ended, is pretty difficult.

cw> [ ... ] Sanely written applications shouldn't lose data. [
cw> ... ] any sane database should be safe, it will fsync or
cw> similar as needed this is also true for sane MTAs

Sure, in optimal conditions where people running the system and
writing applications know exactly what they are doing and the
storage subsystem has the right semantics, then things are
good. Problem is, ''sanity'' is not entirely common in IT, as
the archives of this mailing list show abundantly.

cw> i've actually tested sitations where transactions were in
cw> flight and i've dropped power on a rack of disks and
cw> verified that when it came up all transactions that we
cw> claimed to have completed really did

I hope that this was with an Altix or equivalently robustly and
advisedly engineered system and storage subsystem... (and I
don't get any commission from SGI :->).

cw> i've also done lesser things will SATA disks and email and
cw> it usually turns out to also be reliable for the most part

Ehehehe here :-). I like the «usually» and «most part». But my
argument is that I guess that is what the 'ext3' designers, but
not the XFS ones, have targeted.

The difference here between XFS and 'ext3' is that with 'ext3'
(and similar) even a not very aware sysadm running on a not very
well chosen system can get ''just works''. Just the 'commit=5'
default of 'ext3' makes *a very large* difference.

My overall message is that using XFS on a system that «usually»
and for the «most part» ''just works'' is not very appropriate...

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-20 14:08               ` Ming Zhang
  2006-07-20 16:17                 ` Chris Wedgwood
@ 2006-07-22 17:47                 ` Peter Grandi
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Grandi @ 2006-07-22 17:47 UTC (permalink / raw)
  To: Linux XFS

>>> On Thu, 20 Jul 2006 10:08:22 -0400, Ming Zhang
>>> <mingz@ele.uri.edu> said:

[ ... ]

mingz> hope i never need to run repair,

A ''strategic'' attitude :-).

mingz> but i do need to defrag from time to time.

As to defrag, I reckon that defrag-in-place is a very bad idea,
but I have to admit that contrary evidence exists, and I was
rather surprised to read this:

  http://OSS.SGI.com/archives/xfs/2006-03/msg00110.html

   «> How many people defrag their filesystems using xfs_fsr
    > /dev/PARTITION if their fragmentation is > 50% etc?  Does
    > anyone regularly defrag their production filesystems or
    > just defrag their filesystems on a regular basis?

    We have several hundred production filesystems defragmented
    every night.»

Even so I think that defragment-by-copy is a much better option.

mingz> [ ... ] we mainly handle large media files like 20-50GB.
mingz> [ ....] hope this does not hold true for a 15x750GB SATA
mingz> raid5. ;)
mingz> [ ... ] say XFS can make use of parallel storage by using
mingz> multiple allocation groups. but XFS need to be built over
mingz> one block device. so if i have 4 smaller raid, i have to
mingz> use LVM to glue them before i create XFS over it right?

Well, I'll just hint that I cannot find euphemisms suitable for
expressing what I think of this setup :-).

mingz> but then u said XFS over LVM or N MD is not good?

It was me saying that [euphemism alert!] I would not recommend a
setup like that without understanding very well the consequences.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-20 16:38                   ` Ming Zhang
  2006-07-20 19:04                     ` Chris Wedgwood
@ 2006-07-22 18:09                     ` Peter Grandi
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Grandi @ 2006-07-22 18:09 UTC (permalink / raw)
  To: Linux XFS

>>> On Thu, 20 Jul 2006 12:38:01 -0400, Ming Zhang
>>> <mingz@ele.uri.edu> said:

[ ... ]

>>> we mainly handle large media files like 20-50GB. so file
>>> number is not too much. but file size is large.

>> xfs_repair usually deals with that fairly well in reality
>> (much better than lots of small files anyhow)

> sounds cool. yes, large # of small files are always painful.

It is not just number of inodes, it is also number of
extents. That is total number of metadata items.

[ ... ]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: stable xfs
  2006-07-21 18:07                                 ` Chris Wedgwood
@ 2006-07-24  1:14                                   ` Ming Zhang
  0 siblings, 0 replies; 33+ messages in thread
From: Ming Zhang @ 2006-07-24  1:14 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS

On Fri, 2006-07-21 at 11:07 -0700, Chris Wedgwood wrote:
> On Fri, Jul 21, 2006 at 01:00:44PM -0400, Ming Zhang wrote:
> 
> > what u mean overlay fs over small fs? like a unionfs?
> 
> sorta not really, it's userspace libraries which create a virtual
> filesystem over real filesystems with some database (bezerkely db).
> it sorta evolved from an attempt to unify several filesystems spread
> over cheap PCs into something that pretended to be one larger fs

fancy word for this is NAS virtualization i guess.


> 
> > but other than fsr. there is no better way for this right?
> 
> not publicly, you could patch fsr or nag me for my patches if that
> helps

i will run some tests about fsr and see if i need to bug you about
patches.


> 
> > of course, preallocate is always good. but i do not have control
> > over applications.
> 
> well, in some cases you could use LD_PRELOAD and influence things,  it
> depends on the application and what you need from it
> 
> fwiw, most modern p2p applicaitons have terribly access patterns which
> cause cause horrible fragmentation (on all fs's, not just XFS)
> 
> > sounds like a useful patch. :P will it be merged into fsr code?
> 
> no, because it's ugly and i don't think i ever decoupled it from other
> changes and posted it
> 
> > what kind of assistance you mean?
> 
> [WARNING: lots of hand waving ahead, plenty of minor, but important,
> details ignored]
> 

read about this and feel this will be VERY hard to be built, especially
considering the transaction issue. 

can this be easier?

* analyze the fs to find out which file(s) to be defrag;
* create a temp file and begin to copy, preserve the space so it is
continuous;
* after first round of copy, for changed blocks have a trace table and a
second round on changed blocks.
* lock and switch the old file with new file.


> if you wanted much smarter defragmentation semantics, it would
> probably make sense to
> 
>   * bulkstat the entire volume, this will give you the inode cluster
>     locations and enough information to start building a tree of where
>     all the files are (XFS_IOC_FSGEOMETRY details obviously)
> 
>   * opendir/read to build a full directory tree
> 
>   * use XFS_IOC_GETBMAP & XFS_IOC_GETBMAPA to figure out which blocks
>     are occupied by which files
> 
> you would now have a pretty good idea of what is using what parts of
> the disk, except of course it could be constantly changing underneath
> you to make things harder
> 
> also, doing this using the existing interfaces is (when i tried it)
> really really painfully slow if you have a large filesystem with a lot
> of small files (even when you try to optimized you accesses for
> minimize seeking by sorting by inode number and submitting several
> requests in parallel to try and help the elevator merge accesses)
> 
> 
> one you have some overall picture of the disk, you can decide what you
> want to move to achieve your goal, typically this would be to reduce
> the fragmentation of the largest files, and this would be be
> relocating some of all of those blocks to another place
> 
> if you want to allocate space in a given AG, you open/creat a
> temporary file in a directory in that AG (create multiple dirs as
> needed to ensure you have one or more of these), and preallocate the
> space --- there you can copy the file over
> 
> we could also add ioctls to further bias XFSs allocation strategies,
> like telling it to never allocate in some AGs (needed for an online
> shrink if someone wanted to make such a thing) or simply bias strongly
> away from some places, then add other ioctls to allow you to
> specifically allocate space in those AGs so you can bias what is
> allocated where
> 
> another useful ioctl would be a variation of XFS_IOC_SWAPEXT which
> would swap only some extents.  there is no internal support for this
> now except we do have code for XFS_IOC_UNRESVSP64 and XFS_IOC_RESVSP64
> so perhaps the idea would be to swap some (but not all) blocks of a
> file by creating a function that do the equivalent of 'punch a hole'
> where we want to replace the blocks, and then 'allocate new blocks
> given some i already have elsewhere' (however, making that all work as
> one transaction might be very very difficult)
> 
> it's a lot of effort for what for many people wouldn't only have
> marginal gains

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2006-07-24  1:24 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-17 15:30 stable xfs Ming Zhang
2006-07-17 16:20 ` Peter Grandi
2006-07-18 22:36   ` Ming Zhang
2006-07-18 23:14     ` Peter Grandi
2006-07-19  1:20       ` Ming Zhang
2006-07-19  5:56         ` Chris Wedgwood
2006-07-19 10:53           ` Peter Grandi
2006-07-19 14:45             ` Ming Zhang
2006-07-22 17:13               ` Peter Grandi
2006-07-20  6:12             ` Chris Wedgwood
2006-07-22 17:31               ` Peter Grandi
2006-07-19 14:10           ` Ming Zhang
2006-07-19 10:24         ` Peter Grandi
2006-07-19 13:11           ` Ming Zhang
2006-07-20  6:15             ` Chris Wedgwood
2006-07-20 14:08               ` Ming Zhang
2006-07-20 16:17                 ` Chris Wedgwood
2006-07-20 16:38                   ` Ming Zhang
2006-07-20 19:04                     ` Chris Wedgwood
2006-07-21  0:19                       ` Ming Zhang
2006-07-21  3:26                         ` Chris Wedgwood
2006-07-21 13:10                           ` Ming Zhang
2006-07-21 16:07                             ` Chris Wedgwood
2006-07-21 17:00                               ` Ming Zhang
2006-07-21 18:07                                 ` Chris Wedgwood
2006-07-24  1:14                                   ` Ming Zhang
2006-07-22 18:09                     ` Peter Grandi
2006-07-22 17:47                 ` Peter Grandi
2006-07-22 15:37             ` Peter Grandi
2006-07-18 23:54 ` Nathan Scott
2006-07-19  1:15   ` Ming Zhang
2006-07-19  7:40   ` Martin Steigerwald
2006-07-19 14:11     ` Ming Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox