* stable xfs @ 2006-07-17 15:30 Ming Zhang 2006-07-17 16:20 ` Peter Grandi 2006-07-18 23:54 ` Nathan Scott 0 siblings, 2 replies; 33+ messages in thread From: Ming Zhang @ 2006-07-17 15:30 UTC (permalink / raw) To: linux-xfs Hi All We want to use XFS in all of our production servers but feel a little scary about the corruption problems seen in this list. I wonder which 2.6.16+ kernel we can use in order to get a stable XFS? Thanks! ps, one friend mentioned that XFS has some issue with LVM+MD under it. Is this true? Ming ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-17 15:30 stable xfs Ming Zhang @ 2006-07-17 16:20 ` Peter Grandi 2006-07-18 22:36 ` Ming Zhang 2006-07-18 23:54 ` Nathan Scott 1 sibling, 1 reply; 33+ messages in thread From: Peter Grandi @ 2006-07-17 16:20 UTC (permalink / raw) To: Linux XFS >>> On Mon, 17 Jul 2006 11:30:23 -0400, Ming Zhang >>> <mingz@ele.uri.edu> said: mingz> Hi All We want to use XFS in all of our production mingz> servers but feel a little scary about the corruption mingz> problems seen in this list. [ ... ] XFS is complex but quite stable code. Most of the reports about ''corruption'' are consequences of not being aware of what it was designed for, how it works and how it should be used... ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-17 16:20 ` Peter Grandi @ 2006-07-18 22:36 ` Ming Zhang 2006-07-18 23:14 ` Peter Grandi 0 siblings, 1 reply; 33+ messages in thread From: Ming Zhang @ 2006-07-18 22:36 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux XFS Thanks for your response. But could you give me an example on what is an improper use? Ming On Mon, 2006-07-17 at 17:20 +0100, Peter Grandi wrote: > >>> On Mon, 17 Jul 2006 11:30:23 -0400, Ming Zhang > >>> <mingz@ele.uri.edu> said: > > mingz> Hi All We want to use XFS in all of our production > mingz> servers but feel a little scary about the corruption > mingz> problems seen in this list. [ ... ] > > XFS is complex but quite stable code. Most of the reports about > ''corruption'' are consequences of not being aware of what it > was designed for, how it works and how it should be used... > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-18 22:36 ` Ming Zhang @ 2006-07-18 23:14 ` Peter Grandi 2006-07-19 1:20 ` Ming Zhang 0 siblings, 1 reply; 33+ messages in thread From: Peter Grandi @ 2006-07-18 23:14 UTC (permalink / raw) To: Linux XFS >>> On Tue, 18 Jul 2006 18:36:06 -0400, Ming Zhang >>> <mingz@ele.uri.edu> said: mingz> [ .. ] example on what is an improper use? Well, this mailing list is full of them :-). However it is easier to say what is an optimal use: * A 64 bit system. * With a large, parallel storage system. * The block IO system handles all storage errors. * With backups of the contents of the storage system. In other words, an Altix in an enterprise computing room... :-) Something like 64 bit systems running a UNIX-like OS, one system production and one for backup, each with some TiB of RAID10 storage, both with UPSes giving a significant amount of uptime, and extensive hot swapping abilities. If you got that, XFS can give really good performance quite safely. My impression is that the design of XFS was based on a focus on performance, at the file system level, via on-disk layout, massive ''transactions'', and parallel IO requests, assuming that the block IO subsystem handles every storage error issue both transparently and gracefully. It is _possible_, and may even be appropriate after carefully thinking it through, to use XFS in a 32 bit system without UPS, and with no storage system redundancy, and with device errors not handled by the block IO system, and with little parallelism in the storage subsystem; e.g. a SOHO desktop or server. But then I have seen people building RAIDs stuffing in a couple dozen drives from the same shipping box, so improper use of XFS is definitely a second order issue at that kind of level :-). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-18 23:14 ` Peter Grandi @ 2006-07-19 1:20 ` Ming Zhang 2006-07-19 5:56 ` Chris Wedgwood 2006-07-19 10:24 ` Peter Grandi 0 siblings, 2 replies; 33+ messages in thread From: Ming Zhang @ 2006-07-19 1:20 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux XFS On Wed, 2006-07-19 at 00:14 +0100, Peter Grandi wrote: > >>> On Tue, 18 Jul 2006 18:36:06 -0400, Ming Zhang > >>> <mingz@ele.uri.edu> said: > > mingz> [ .. ] example on what is an improper use? > > Well, this mailing list is full of them :-). However it is > easier to say what is an optimal use: > > * A 64 bit system. > * With a large, parallel storage system. when u say large parallel storage system, you mean independent spindles right? but most people will have all disks configured in one RAID5/6 and thus it is not parallel any more. > * The block IO system handles all storage errors. so current MD/LVM/SATA/SCSI layers are not good enough? > * With backups of the contents of the storage system. > > In other words, an Altix in an enterprise computing room... :-) just kidding, are you a SGI sales? ;) > > Something like 64 bit systems running a UNIX-like OS, one system > production and one for backup, each with some TiB of RAID10 > storage, both with UPSes giving a significant amount of uptime, > and extensive hot swapping abilities. If you got that, XFS can > give really good performance quite safely. > > My impression is that the design of XFS was based on a focus on > performance, at the file system level, via on-disk layout, > massive ''transactions'', and parallel IO requests, assuming > that the block IO subsystem handles every storage error issue > both transparently and gracefully. > > It is _possible_, and may even be appropriate after carefully > thinking it through, to use XFS in a 32 bit system without UPS, > and with no storage system redundancy, and with device errors > not handled by the block IO system, and with little parallelism > in the storage subsystem; e.g. a SOHO desktop or server. i think with write barrier support, system without UPS should be ok. considering even u have UPS, kernel oops in other parts still can take the FS down. > > But then I have seen people building RAIDs stuffing in a couple > dozen drives from the same shipping box, so improper use of XFS > is definitely a second order issue at that kind of level :-). > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 1:20 ` Ming Zhang @ 2006-07-19 5:56 ` Chris Wedgwood 2006-07-19 10:53 ` Peter Grandi 2006-07-19 14:10 ` Ming Zhang 2006-07-19 10:24 ` Peter Grandi 1 sibling, 2 replies; 33+ messages in thread From: Chris Wedgwood @ 2006-07-19 5:56 UTC (permalink / raw) To: Ming Zhang; +Cc: Peter Grandi, Linux XFS On Tue, Jul 18, 2006 at 09:20:44PM -0400, Ming Zhang wrote: > when u say large parallel storage system, you mean independent > spindles right? but most people will have all disks configured in > one RAID5/6 and thus it is not parallel any more. it depends, you might have 100s of spindles in groups, you don't make a giant raid5/6 array with that many disks, you make a number of smaller arrays > i think with write barrier support, system without UPS should be ok. with barrier support a UPS shouldn't be necessary > considering even u have UPS, kernel oops in other parts still can > take the FS down. but a crash won't cause writes to be 'reordered' reordering is bad because the fs pushes writes down in a manner that means when it comes back it will be able to make it self consistent, so if you have a number of writes pending and some of them are lost, and those that are lost are not the most recent writes because of reordering, you can end up with a corrupt fs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 5:56 ` Chris Wedgwood @ 2006-07-19 10:53 ` Peter Grandi 2006-07-19 14:45 ` Ming Zhang 2006-07-20 6:12 ` Chris Wedgwood 2006-07-19 14:10 ` Ming Zhang 1 sibling, 2 replies; 33+ messages in thread From: Peter Grandi @ 2006-07-19 10:53 UTC (permalink / raw) To: Linux XFS [ ... ] mingz> when u say large parallel storage system, you mean mingz> independent spindles right? but most people will have all mingz> disks configured in one RAID5/6 and thus it is not parallel mingz> any more. cw> it depends, you might have 100s of spindles in groups, you cw> don't make a giant raid5/6 array with that many disks, you cw> make a number of smaller arrays Perhaps you are undestimating the ''if it can be done'' mindset... Also, if one does a number of smaller RAID5s, is each one a separate filesystem or they get aggregated, for example with LVM with ''concat''? Either way, how likely is is that the consequences have been thought through? I would personally hesitate to recommend either, especially a two-level arrangement where the base level is a RAID5. [I am making an effort in this discussion to use euphemisms] mingz> i think with write barrier support, system without UPS mingz> should be ok. cw> with barrier support a UPS shouldn't be necessary Sure, «should» and «shouldn't» are nice hopeful concepts. But write barriers are difficult to achieve, and when achieved they are often unreliable, except on enterprise level hardware, because many disks/host adapters/... simply lie as to whether they have actually started writing (never mind finished writing, or written correctly) stuff. To get reliable write barrier often one has to source special cards or disks with custom firmware; or leave system integration to the big expensive guys and buy an Altix or equivalent system from Sun or IBM. Besides I have seen many reports of ''corruption'' that cannot be fixed by write barriers: many have the expectation that *data* should not be lost, even if no 'fsync' is done, *as if* 'mount -o sync' or 'mount -o data=ordered'. Of course that is a bit of an inflated expectation, but all that the vast majority of sysadms care about is whether it ''just works'', without ''wasting time'' figuring things out. mingz> considering even u have UPS, kernel oops in other parts mingz> still can take the FS down. cw> but a crash won't cause writes to be 'reordered' [ ... ] The metadata will be consistent, but metadata and data may well will be lost. So the filesystem is still ''corrupted'', at least from the point of view of a sysadm who just wants the filesystem to be effortlessly foolproof. Anyhow, if a crash happens all bets are off, because who knows *what* gets written. Look at it from the point of view of a ''practitioner'' sysadm: ''who cares if the metadata is consistent, if my 3TiB application database is unusable (and I don't do backups because after all it is a concat of RAID5s, backups are not necessary) as there is a huge gap in some data file, and my users are yelling at me, and it is not my fault'' The tradeoff in XFS is that if you know exactly what you are doing you get extra performance... ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 10:53 ` Peter Grandi @ 2006-07-19 14:45 ` Ming Zhang 2006-07-22 17:13 ` Peter Grandi 2006-07-20 6:12 ` Chris Wedgwood 1 sibling, 1 reply; 33+ messages in thread From: Ming Zhang @ 2006-07-19 14:45 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux XFS On Wed, 2006-07-19 at 11:53 +0100, Peter Grandi wrote: > [ ... ] > > mingz> when u say large parallel storage system, you mean > mingz> independent spindles right? but most people will have all > mingz> disks configured in one RAID5/6 and thus it is not parallel > mingz> any more. > > cw> it depends, you might have 100s of spindles in groups, you > cw> don't make a giant raid5/6 array with that many disks, you > cw> make a number of smaller arrays > > Perhaps you are undestimating the ''if it can be done'' > mindset... > > Also, if one does a number of smaller RAID5s, is each one a > separate filesystem or they get aggregated, for example with > LVM with ''concat''? Either way, how likely is is that the > consequences have been thought through? > > I would personally hesitate to recommend either, especially a > two-level arrangement where the base level is a RAID5. could u give us some hints on this? since it is really popular to have a FS/LV/MD structure and I believe LVM is designed for this purpose. > > [I am making an effort in this discussion to use euphemisms] > > mingz> i think with write barrier support, system without UPS > mingz> should be ok. > > cw> with barrier support a UPS shouldn't be necessary > > Sure, «should» and «shouldn't» are nice hopeful concepts. > > But write barriers are difficult to achieve, and when achieved > they are often unreliable, except on enterprise level hardware, > because many disks/host adapters/... simply lie as to whether > they have actually started writing (never mind finished writing, > or written correctly) stuff. > > To get reliable write barrier often one has to source special > cards or disks with custom firmware; or leave system integration > to the big expensive guys and buy an Altix or equivalent system > from Sun or IBM. > > Besides I have seen many reports of ''corruption'' that cannot > be fixed by write barriers: many have the expectation that > *data* should not be lost, even if no 'fsync' is done, *as if* > 'mount -o sync' or 'mount -o data=ordered'. > > Of course that is a bit of an inflated expectation, but all that > the vast majority of sysadms care about is whether it ''just > works'', without ''wasting time'' figuring things out. > > mingz> considering even u have UPS, kernel oops in other parts > mingz> still can take the FS down. > > cw> but a crash won't cause writes to be 'reordered' [ ... ] > > The metadata will be consistent, but metadata and data may well > will be lost. So the filesystem is still ''corrupted'', at least > from the point of view of a sysadm who just wants the filesystem > to be effortlessly foolproof. Anyhow, if a crash happens all > bets are off, because who knows *what* gets written. > > Look at it from the point of view of a ''practitioner'' sysadm: > > ''who cares if the metadata is consistent, if my 3TiB > application database is unusable (and I don't do backups > because after all it is a concat of RAID5s, backups are not > necessary) as there is a huge gap in some data file, and my > users are yelling at me, and it is not my fault'' > > The tradeoff in XFS is that if you know exactly what you are > doing you get extra performance... then i think unless you disable all write cache, none of the file system can achieve this goal. or maybe ext3 with both data and metadata into log might do this? Ming ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 14:45 ` Ming Zhang @ 2006-07-22 17:13 ` Peter Grandi 0 siblings, 0 replies; 33+ messages in thread From: Peter Grandi @ 2006-07-22 17:13 UTC (permalink / raw) To: Linux XFS >>> On Wed, 19 Jul 2006 10:45:04 -0400, Ming Zhang >>> <mingz@ele.uri.edu> said: [ ... ] >> Also, if one does a number of smaller RAID5s, is each one a >> separate filesystem or they get aggregated, for example with >> LVM with ''concat''? Either way, how likely is is that the >> consequences have been thought through? >> >> I would personally hesitate to recommend either, especially a >> two-level arrangement where the base level is a RAID5. mingz> could u give us some hints on this? Well, RAID5 itself is in general a very bad idea, as well argued here: <URL:http://WWW.BAARF.com/> and a LVM based concat (which is the slow version of RAID0) of RAID5 volumes has quite terrible performance and redundancy aspects that nicely match those of RAID5. Imagine a 4TB volume build as a concat/span of 4 RAID5 volumes, each done as a 1TB RAID5 of 4+1 250GB disks. Under which conditions do you lose the whole lot? Compare the same with a RAID0 of RAID1 pairs... mingz> since it is really popular to have a FS/LV/MD structure Sure, and it is also really popular to do 5+1 or 11+1 RAID5s and to stuff them all with disks of the same model, and even from the same shipping carton... mingz> and I believe LVM is designed for this purpose. Yes and no. LVM's main purpose, if any, is to outgrow the limitation on the number of partitions in most, and PC-based in particular, partitioning schemes. This means that LVM is of benefit only in very few cases, those where one needs a lot of partitions (as such, not as a cheap quota scheme). [ ... ] >> ''who cares if the metadata is consistent, if my 3TiB >> application database is unusable (and I don't do backups >> because after all it is a concat of RAID5s, backups are not >> necessary) as there is a huge gap in some data file, and my >> users are yelling at me, and it is not my fault'' >> The tradeoff in XFS is that if you know exactly what you are >> doing you get extra performance... mingz> then i think unless you disable all write cache, Not even then, because storage subsystems often do lie about that. Only very clever system integrators and usually only those with a big wallet can manage to build storage subsystems with reliable caching semantics (including write barriers). mingz> none of the file system can achieve this goal. Well, some people might want to argue that a filesystem *should not* be designed to achieve that goal, because it is a goal that does not make sense in an ideal world in which people know exactly what they are doing. mingz> or maybe ext3 with both data and metadata into log might mingz> do this? Well, 'data=ordered' and especially 'data=journal' (and the low default value of 'commit=5') most often give at a moderate cost the illusion that the file system and storage system ''just work'', when they don't. This creates issues when discussing the relative merits of 'ext3' vs. other filesystems which are less forgiving. Eventually the XFS and 'ext3' designers seem to have chosen very different assumptions about their user base: * the XFS designers probably assumed that their user based would be big iron people with a high degree of understanding of storage systems and optimal hardware conditions, and interested in maximally scalable performance (e.g. Altix customers in HPC); * the 'ext3' guys seem to have assumed their user base would be general users slamming together stuff on the cheap without much awareness or thought as to storage system engineering, and interested in ''just works, most of the time''. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 10:53 ` Peter Grandi 2006-07-19 14:45 ` Ming Zhang @ 2006-07-20 6:12 ` Chris Wedgwood 2006-07-22 17:31 ` Peter Grandi 1 sibling, 1 reply; 33+ messages in thread From: Chris Wedgwood @ 2006-07-20 6:12 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux XFS On Wed, Jul 19, 2006 at 11:53:24AM +0100, Peter Grandi wrote: > But write barriers are difficult to achieve, and when achieved they > are often unreliable, except on enterprise level hardware, because > many disks/host adapters/... simply lie as to whether they have > actually started writing (never mind finished writing, or written > correctly) stuff. IDE/SATA doesn't have barrier to lie about (the kernel has to flush and wait in those cases). > The metadata will be consistent, but metadata and data may well will > be lost. So the filesystem is still ''corrupted'', at least from the > point of view of a sysadm who just wants the filesystem to be > effortlessly foolproof. Sanely written applications shouldn't lose data. > Look at it from the point of view of a ''practitioner'' sysadm: > > ''who cares if the metadata is consistent, if my 3TiB > application database is unusable (and I don't do backups any sane database should be safe, it will fsync or similar as needed this is also true for sane MTAs i've actually tested sitations where transactions were in flight and i've dropped power on a rack of disks and verified that when it came up all transactions that we claimed to have completed really did i've also done lesser things will SATA disks and email and it usually turns out to also be reliable for the most part ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-20 6:12 ` Chris Wedgwood @ 2006-07-22 17:31 ` Peter Grandi 0 siblings, 0 replies; 33+ messages in thread From: Peter Grandi @ 2006-07-22 17:31 UTC (permalink / raw) To: Linux XFS >>> On Wed, 19 Jul 2006 23:12:09 -0700, Chris Wedgwood >>> <cw@f00f.org> said: [ ... ] pg> But write barriers are difficult to achieve, and when pg> achieved they are often unreliable, except on enterprise pg> level hardware, because many disks/host adapters/... simply pg> lie as to whether they have actually started writing (never pg> mind finished writing, or written correctly) stuff. cw> IDE/SATA doesn't have barrier to lie about Actually a very few ATA/SATA do have write barriers, but that is a just a nitpick, because it is hard to get to them, and anyhow Linux does not take advantage much :-). cw> (the kernel has to flush and wait in those cases). But ATA/SATA flush and wait have the same problems as write barriers, except worse: disks and ATA/SATA cards do lie too as to cache flushing. Just getting an ATA/SATA driver or card manufacturer to tell whether completion of cache flush is reported when the command is received, or when writing has started, or when writing has ended, is pretty difficult. cw> [ ... ] Sanely written applications shouldn't lose data. [ cw> ... ] any sane database should be safe, it will fsync or cw> similar as needed this is also true for sane MTAs Sure, in optimal conditions where people running the system and writing applications know exactly what they are doing and the storage subsystem has the right semantics, then things are good. Problem is, ''sanity'' is not entirely common in IT, as the archives of this mailing list show abundantly. cw> i've actually tested sitations where transactions were in cw> flight and i've dropped power on a rack of disks and cw> verified that when it came up all transactions that we cw> claimed to have completed really did I hope that this was with an Altix or equivalently robustly and advisedly engineered system and storage subsystem... (and I don't get any commission from SGI :->). cw> i've also done lesser things will SATA disks and email and cw> it usually turns out to also be reliable for the most part Ehehehe here :-). I like the «usually» and «most part». But my argument is that I guess that is what the 'ext3' designers, but not the XFS ones, have targeted. The difference here between XFS and 'ext3' is that with 'ext3' (and similar) even a not very aware sysadm running on a not very well chosen system can get ''just works''. Just the 'commit=5' default of 'ext3' makes *a very large* difference. My overall message is that using XFS on a system that «usually» and for the «most part» ''just works'' is not very appropriate... ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 5:56 ` Chris Wedgwood 2006-07-19 10:53 ` Peter Grandi @ 2006-07-19 14:10 ` Ming Zhang 1 sibling, 0 replies; 33+ messages in thread From: Ming Zhang @ 2006-07-19 14:10 UTC (permalink / raw) To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS On Tue, 2006-07-18 at 22:56 -0700, Chris Wedgwood wrote: > On Tue, Jul 18, 2006 at 09:20:44PM -0400, Ming Zhang wrote: > > > when u say large parallel storage system, you mean independent > > spindles right? but most people will have all disks configured in > > one RAID5/6 and thus it is not parallel any more. > > it depends, you might have 100s of spindles in groups, you don't make > a giant raid5/6 array with that many disks, you make a number of > smaller arrays right > > > i think with write barrier support, system without UPS should be ok. > > with barrier support a UPS shouldn't be necessary > > > considering even u have UPS, kernel oops in other parts still can > > take the FS down. > i mean with UPS and huge write cache, but no write barrier. > but a crash won't cause writes to be 'reordered' > > > reordering is bad because the fs pushes writes down in a manner that > means when it comes back it will be able to make it self consistent, > so if you have a number of writes pending and some of them are lost, > and those that are lost are not the most recent writes because of > reordering, you can end up with a corrupt fs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 1:20 ` Ming Zhang 2006-07-19 5:56 ` Chris Wedgwood @ 2006-07-19 10:24 ` Peter Grandi 2006-07-19 13:11 ` Ming Zhang 1 sibling, 1 reply; 33+ messages in thread From: Peter Grandi @ 2006-07-19 10:24 UTC (permalink / raw) To: Linux XFS >>> On Tue, 18 Jul 2006 21:20:44 -0400, Ming Zhang <mingz@ele.uri.edu> said: [ ... ] mingz> when u say large parallel storage system, you mean mingz> independent spindles right? but most people will have all mingz> disks configured in one RAID5/6 and thus it is not mingz> parallel any more. As I was saying... pg> Most of the reports about ''corruption'' are consequences pg> of not being aware of what it was designed for, how it pg> works and how it should be used... mingz> [ .. ] example on what is an improper use? pg> Well, this mailing list is full of them :-). pg> But then I have seen people building RAIDs stuffing in a pg> couple dozen drives from the same shipping box, [ ... ] :-) BTW as to these: * A 64 bit system. * With a large, parallel storage system. * The block IO system handles all storage errors. * With backups of the contents of the storage system. I forgot a very essential one: * With lots of RAM, size proportional to that of the largest filesystem. [ ... ] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 10:24 ` Peter Grandi @ 2006-07-19 13:11 ` Ming Zhang 2006-07-20 6:15 ` Chris Wedgwood 2006-07-22 15:37 ` Peter Grandi 0 siblings, 2 replies; 33+ messages in thread From: Ming Zhang @ 2006-07-19 13:11 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux XFS On Wed, 2006-07-19 at 11:24 +0100, Peter Grandi wrote: > >>> On Tue, 18 Jul 2006 21:20:44 -0400, Ming Zhang <mingz@ele.uri.edu> said: > > [ ... ] > > mingz> when u say large parallel storage system, you mean > mingz> independent spindles right? but most people will have all > mingz> disks configured in one RAID5/6 and thus it is not > mingz> parallel any more. > > As I was saying... > > pg> Most of the reports about ''corruption'' are consequences > pg> of not being aware of what it was designed for, how it > pg> works and how it should be used... > > mingz> [ .. ] example on what is an improper use? > pg> Well, this mailing list is full of them :-). > > pg> But then I have seen people building RAIDs stuffing in a > pg> couple dozen drives from the same shipping box, [ ... ] > > :-) > > BTW as to these: > > * A 64 bit system. > * With a large, parallel storage system. > * The block IO system handles all storage errors. > * With backups of the contents of the storage system. > > I forgot a very essential one: > > * With lots of RAM, size proportional to that of the largest filesystem. > > [ ... ] > what kind of "ram vs fs" size ratio here will be a safe/good/proper one? any rule of thumb? thanks! hope not 1:1. :) Ming ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 13:11 ` Ming Zhang @ 2006-07-20 6:15 ` Chris Wedgwood 2006-07-20 14:08 ` Ming Zhang 2006-07-22 15:37 ` Peter Grandi 1 sibling, 1 reply; 33+ messages in thread From: Chris Wedgwood @ 2006-07-20 6:15 UTC (permalink / raw) To: Ming Zhang; +Cc: Peter Grandi, Linux XFS On Wed, Jul 19, 2006 at 09:11:10AM -0400, Ming Zhang wrote: > what kind of "ram vs fs" size ratio here will be a safe/good/proper > one? it depends very much on what you are doing > any rule of thumb? thanks! > > hope not 1:1. :) i recent dealt with a corrupted filesystem that xfs_repair needed over 1GB to deal with --- the kicker is the filesystem was only 20GB, so that's 20:1 for xfs_repair i suspect that was anomalous though and that some bug or quirk of their fs cause xfs_repair to behave badly (that said, i'd had to have to repair an 8TB fs fill of maildir email boxes, which i know some people have) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-20 6:15 ` Chris Wedgwood @ 2006-07-20 14:08 ` Ming Zhang 2006-07-20 16:17 ` Chris Wedgwood 2006-07-22 17:47 ` Peter Grandi 0 siblings, 2 replies; 33+ messages in thread From: Ming Zhang @ 2006-07-20 14:08 UTC (permalink / raw) To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS On Wed, 2006-07-19 at 23:15 -0700, Chris Wedgwood wrote: > On Wed, Jul 19, 2006 at 09:11:10AM -0400, Ming Zhang wrote: > > > what kind of "ram vs fs" size ratio here will be a safe/good/proper > > one? > > it depends very much on what you are doing we mainly handle large media files like 20-50GB. so file number is not too much. but file size is large. hope i never need to run repair, but i do need to defrag from time to time. > > > any rule of thumb? thanks! > > > > hope not 1:1. :) > > i recent dealt with a corrupted filesystem that xfs_repair needed over > 1GB to deal with --- the kicker is the filesystem was only 20GB, so > that's 20:1 for xfs_repair hope this does not hold true for a 15x750GB SATA raid5. ;) > > i suspect that was anomalous though and that some bug or quirk of > their fs cause xfs_repair to behave badly (that said, i'd had to have > to repair an 8TB fs fill of maildir email boxes, which i know some > people have) ps, also another question brought up while reading this thread. say XFS can make use of parallel storage by using multiple allocation groups. but XFS need to be built over one block device. so if i have 4 smaller raid, i have to use LVM to glue them before i create XFS over it right? but then u said XFS over LVM or N MD is not good? Ming ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-20 14:08 ` Ming Zhang @ 2006-07-20 16:17 ` Chris Wedgwood 2006-07-20 16:38 ` Ming Zhang 2006-07-22 17:47 ` Peter Grandi 1 sibling, 1 reply; 33+ messages in thread From: Chris Wedgwood @ 2006-07-20 16:17 UTC (permalink / raw) To: Ming Zhang; +Cc: Peter Grandi, Linux XFS On Thu, Jul 20, 2006 at 10:08:22AM -0400, Ming Zhang wrote: > we mainly handle large media files like 20-50GB. so file number is > not too much. but file size is large. xfs_repair usually deals with that fairly well in reality (much better than lots of small files anyhow) > hope i never need to run repair, but i do need to defrag from time > to time. if you preallocate you can avoid that (this is what i do, i preallocate in the replication daemon) > hope this does not hold true for a 15x750GB SATA raid5. ;) that's ~10TB or so, my guess is that a repair there would take some GBs of ram it would be interesting to test it if you had the time there is a 'formular' for working out how much ram is needed roughly (steve lord posted it a long time ago, hopefully someone can find that and repost is) > say XFS can make use of parallel storage by using multiple > allocation groups. but XFS need to be built over one block > device. so if i have 4 smaller raid, i have to use LVM to glue them > before i create XFS over it right? but then u said XFS over LVM or N > MD is not good? with recent kernels it shouldn't be a problem, the recursive nature of the block layer changed so you no longer blow up as badly as people did in the past (also, XFS tends to use less stack these days) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-20 16:17 ` Chris Wedgwood @ 2006-07-20 16:38 ` Ming Zhang 2006-07-20 19:04 ` Chris Wedgwood 2006-07-22 18:09 ` Peter Grandi 0 siblings, 2 replies; 33+ messages in thread From: Ming Zhang @ 2006-07-20 16:38 UTC (permalink / raw) To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS On Thu, 2006-07-20 at 09:17 -0700, Chris Wedgwood wrote: > On Thu, Jul 20, 2006 at 10:08:22AM -0400, Ming Zhang wrote: > > > we mainly handle large media files like 20-50GB. so file number is > > not too much. but file size is large. > > xfs_repair usually deals with that fairly well in reality (much better > than lots of small files anyhow) sounds cool. yes, large # of small files are always painful. > > > hope i never need to run repair, but i do need to defrag from time > > to time. > > if you preallocate you can avoid that (this is what i do, i > preallocate in the replication daemon) i could not control my application. so i still need to do defrag some time. > > > hope this does not hold true for a 15x750GB SATA raid5. ;) > > that's ~10TB or so, my guess is that a repair there would take some > GBs of ram > > it would be interesting to test it if you had the time yes. i should find out. hope to force a repair? unplug my power cord? ;) > > there is a 'formular' for working out how much ram is needed roughly > (steve lord posted it a long time ago, hopefully someone can find that > and repost is) > > > say XFS can make use of parallel storage by using multiple > > allocation groups. but XFS need to be built over one block > > device. so if i have 4 smaller raid, i have to use LVM to glue them > > before i create XFS over it right? but then u said XFS over LVM or N > > MD is not good? > > with recent kernels it shouldn't be a problem, the recursive nature of > the block layer changed so you no longer blow up as badly as people > did in the past (also, XFS tends to use less stack these days) sounds cool. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-20 16:38 ` Ming Zhang @ 2006-07-20 19:04 ` Chris Wedgwood 2006-07-21 0:19 ` Ming Zhang 2006-07-22 18:09 ` Peter Grandi 1 sibling, 1 reply; 33+ messages in thread From: Chris Wedgwood @ 2006-07-20 19:04 UTC (permalink / raw) To: Ming Zhang; +Cc: Peter Grandi, Linux XFS On Thu, Jul 20, 2006 at 12:38:01PM -0400, Ming Zhang wrote: > i could not control my application. so i still need to do defrag > some time. one thing that irks me about fsr is that unless it's given path elements it that the files created to replace the fragmented file are usually not allocated close the original file (they are openned by handle after a bulkstat pass) so you tend to scatter your files about if you're not careful also, fsr implies doing a lot more work on the whole, writing, reading and rewriting the files in most cases and because it uses dio it will invalidate the page-cache of any files that might be being read-from when it's running > yes. i should find out. hope to force a repair? umount cleanly and run xfs_repair, check to see how much memory it uses with ps/top/whatever as it's running > unplug my power cord? ;) raid protects against failed disks, it usually doesn't protect well against corruption from lost/bad writes as a result of dropping power so well, if you have backups, sure, go for it ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-20 19:04 ` Chris Wedgwood @ 2006-07-21 0:19 ` Ming Zhang 2006-07-21 3:26 ` Chris Wedgwood 0 siblings, 1 reply; 33+ messages in thread From: Ming Zhang @ 2006-07-21 0:19 UTC (permalink / raw) To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS On Thu, 2006-07-20 at 12:04 -0700, Chris Wedgwood wrote: > On Thu, Jul 20, 2006 at 12:38:01PM -0400, Ming Zhang wrote: > > > i could not control my application. so i still need to do defrag > > some time. > > one thing that irks me about fsr is that unless it's given path > elements it that the files created to replace the fragmented file are > usually not allocated close the original file (they are openned by > handle after a bulkstat pass) so you tend to scatter your files about > if you're not careful what will be the side effect about this scattering? you want particular file in particular place? > > also, fsr implies doing a lot more work on the whole, writing, reading > and rewriting the files in most cases and because it uses dio it will > invalidate the page-cache of any files that might be being read-from > when it's running one thing i worry about fsr is when do fsr and some power loss events happen, can xfs handle this well? i will backup before trying these. need some time. ;) > > > yes. i should find out. hope to force a repair? > > umount cleanly and run xfs_repair, check to see how much memory it > uses with ps/top/whatever as it's running > > > unplug my power cord? ;) > > raid protects against failed disks, it usually doesn't protect well > against corruption from lost/bad writes as a result of dropping power > so well, if you have backups, sure, go for it ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-21 0:19 ` Ming Zhang @ 2006-07-21 3:26 ` Chris Wedgwood 2006-07-21 13:10 ` Ming Zhang 0 siblings, 1 reply; 33+ messages in thread From: Chris Wedgwood @ 2006-07-21 3:26 UTC (permalink / raw) To: Ming Zhang; +Cc: Peter Grandi, Linux XFS On Thu, Jul 20, 2006 at 08:19:38PM -0400, Ming Zhang wrote: > what will be the side effect about this scattering? there is a desire in some cases to have files in the same directory close to each other on disk > one thing i worry about fsr is when do fsr and some power loss > events happen, can xfs handle this well? yes, fsr create a temporary file, unlinks it, copies the extents over, and does an atomic swap-extents-if-nothing-changed operation ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-21 3:26 ` Chris Wedgwood @ 2006-07-21 13:10 ` Ming Zhang 2006-07-21 16:07 ` Chris Wedgwood 0 siblings, 1 reply; 33+ messages in thread From: Ming Zhang @ 2006-07-21 13:10 UTC (permalink / raw) To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS On Thu, 2006-07-20 at 20:26 -0700, Chris Wedgwood wrote: > On Thu, Jul 20, 2006 at 08:19:38PM -0400, Ming Zhang wrote: > > > what will be the side effect about this scattering? > > there is a desire in some cases to have files in the same directory > close to each other on disk then what is the benefit? because files under same dir can be accessed with locality so put close will reduce disk head seek? other than this, what else benefit? > > > one thing i worry about fsr is when do fsr and some power loss > > events happen, can xfs handle this well? > > yes, fsr create a temporary file, unlinks it, copies the extents over, > and does an atomic swap-extents-if-nothing-changed operation so if i have 500GB file, will it be copied to another 500GB temp file? sounds scary for me. Ming ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-21 13:10 ` Ming Zhang @ 2006-07-21 16:07 ` Chris Wedgwood 2006-07-21 17:00 ` Ming Zhang 0 siblings, 1 reply; 33+ messages in thread From: Chris Wedgwood @ 2006-07-21 16:07 UTC (permalink / raw) To: Ming Zhang; +Cc: Peter Grandi, Linux XFS On Fri, Jul 21, 2006 at 09:10:31AM -0400, Ming Zhang wrote: > then what is the benefit? because files under same dir can be accessed > with locality so put close will reduce disk head seek? yes > other than this, what else benefit? that alone has a measurable benefit to me (i have an overlay filesystem over many smaller 400 to 500GB filesystems so i don't get the benefit of many spindles to reduce average seek times) > so if i have 500GB file, will it be copied to another 500GB temp > file? yes, which in many cases isn't always derisable because: * if the file had a small number of extents in the first place, reducing them slightly more isn't much of a gain (ie. going from say 11 to 10 is argubly pointless) (i have a patch to specifiy the miniumum gains before doing the copy somewhere) * if the file changes during the copy, then it will be skipped until next time, for larger files this is problematic, you could argue attemtping to fsr a file that is less than <n> seconds old is pointless as it has a high chance of being active (i have a patch for that too)) * fsr has no global overview of what it's doing, so it never does things like 'move this file out of the way to make room for this one' (it can't do this w/o assistance right now), and of course it can't move inodes w/o changing them so there are limits to what can be done anyhow ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-21 16:07 ` Chris Wedgwood @ 2006-07-21 17:00 ` Ming Zhang 2006-07-21 18:07 ` Chris Wedgwood 0 siblings, 1 reply; 33+ messages in thread From: Ming Zhang @ 2006-07-21 17:00 UTC (permalink / raw) To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS On Fri, 2006-07-21 at 09:07 -0700, Chris Wedgwood wrote: > On Fri, Jul 21, 2006 at 09:10:31AM -0400, Ming Zhang wrote: > > > then what is the benefit? because files under same dir can be accessed > > with locality so put close will reduce disk head seek? > > yes > > > other than this, what else benefit? > > that alone has a measurable benefit to me (i have an overlay > filesystem over many smaller 400 to 500GB filesystems so i don't get > the benefit of many spindles to reduce average seek times) what u mean overlay fs over small fs? like a unionfs? > > > so if i have 500GB file, will it be copied to another 500GB temp > > file? > but other than fsr. there is no better way for this right? of course, preallocate is always good. but i do not have control over applications. > yes, which in many cases isn't always derisable because: > > * if the file had a small number of extents in the first place, > reducing them slightly more isn't much of a gain (ie. going from > say 11 to 10 is argubly pointless) (i have a patch to specifiy > the miniumum gains before doing the copy somewhere) > > * if the file changes during the copy, then it will be skipped until > next time, for larger files this is problematic, you could > argue attemtping to fsr a file that is less than <n> seconds old > is pointless as it has a high chance of being active (i have a > patch for that too)) sounds like a useful patch. :P will it be merged into fsr code? > > * fsr has no global overview of what it's doing, so it never does > things like 'move this file out of the way to make room for this > one' (it can't do this w/o assistance right now), and of course it > can't move inodes w/o changing them so there are limits to what > can be done anyhow what kind of assistance you mean? > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-21 17:00 ` Ming Zhang @ 2006-07-21 18:07 ` Chris Wedgwood 2006-07-24 1:14 ` Ming Zhang 0 siblings, 1 reply; 33+ messages in thread From: Chris Wedgwood @ 2006-07-21 18:07 UTC (permalink / raw) To: Ming Zhang; +Cc: Peter Grandi, Linux XFS On Fri, Jul 21, 2006 at 01:00:44PM -0400, Ming Zhang wrote: > what u mean overlay fs over small fs? like a unionfs? sorta not really, it's userspace libraries which create a virtual filesystem over real filesystems with some database (bezerkely db). it sorta evolved from an attempt to unify several filesystems spread over cheap PCs into something that pretended to be one larger fs > but other than fsr. there is no better way for this right? not publicly, you could patch fsr or nag me for my patches if that helps > of course, preallocate is always good. but i do not have control > over applications. well, in some cases you could use LD_PRELOAD and influence things, it depends on the application and what you need from it fwiw, most modern p2p applicaitons have terribly access patterns which cause cause horrible fragmentation (on all fs's, not just XFS) > sounds like a useful patch. :P will it be merged into fsr code? no, because it's ugly and i don't think i ever decoupled it from other changes and posted it > what kind of assistance you mean? [WARNING: lots of hand waving ahead, plenty of minor, but important, details ignored] if you wanted much smarter defragmentation semantics, it would probably make sense to * bulkstat the entire volume, this will give you the inode cluster locations and enough information to start building a tree of where all the files are (XFS_IOC_FSGEOMETRY details obviously) * opendir/read to build a full directory tree * use XFS_IOC_GETBMAP & XFS_IOC_GETBMAPA to figure out which blocks are occupied by which files you would now have a pretty good idea of what is using what parts of the disk, except of course it could be constantly changing underneath you to make things harder also, doing this using the existing interfaces is (when i tried it) really really painfully slow if you have a large filesystem with a lot of small files (even when you try to optimized you accesses for minimize seeking by sorting by inode number and submitting several requests in parallel to try and help the elevator merge accesses) one you have some overall picture of the disk, you can decide what you want to move to achieve your goal, typically this would be to reduce the fragmentation of the largest files, and this would be be relocating some of all of those blocks to another place if you want to allocate space in a given AG, you open/creat a temporary file in a directory in that AG (create multiple dirs as needed to ensure you have one or more of these), and preallocate the space --- there you can copy the file over we could also add ioctls to further bias XFSs allocation strategies, like telling it to never allocate in some AGs (needed for an online shrink if someone wanted to make such a thing) or simply bias strongly away from some places, then add other ioctls to allow you to specifically allocate space in those AGs so you can bias what is allocated where another useful ioctl would be a variation of XFS_IOC_SWAPEXT which would swap only some extents. there is no internal support for this now except we do have code for XFS_IOC_UNRESVSP64 and XFS_IOC_RESVSP64 so perhaps the idea would be to swap some (but not all) blocks of a file by creating a function that do the equivalent of 'punch a hole' where we want to replace the blocks, and then 'allocate new blocks given some i already have elsewhere' (however, making that all work as one transaction might be very very difficult) it's a lot of effort for what for many people wouldn't only have marginal gains ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-21 18:07 ` Chris Wedgwood @ 2006-07-24 1:14 ` Ming Zhang 0 siblings, 0 replies; 33+ messages in thread From: Ming Zhang @ 2006-07-24 1:14 UTC (permalink / raw) To: Chris Wedgwood; +Cc: Peter Grandi, Linux XFS On Fri, 2006-07-21 at 11:07 -0700, Chris Wedgwood wrote: > On Fri, Jul 21, 2006 at 01:00:44PM -0400, Ming Zhang wrote: > > > what u mean overlay fs over small fs? like a unionfs? > > sorta not really, it's userspace libraries which create a virtual > filesystem over real filesystems with some database (bezerkely db). > it sorta evolved from an attempt to unify several filesystems spread > over cheap PCs into something that pretended to be one larger fs fancy word for this is NAS virtualization i guess. > > > but other than fsr. there is no better way for this right? > > not publicly, you could patch fsr or nag me for my patches if that > helps i will run some tests about fsr and see if i need to bug you about patches. > > > of course, preallocate is always good. but i do not have control > > over applications. > > well, in some cases you could use LD_PRELOAD and influence things, it > depends on the application and what you need from it > > fwiw, most modern p2p applicaitons have terribly access patterns which > cause cause horrible fragmentation (on all fs's, not just XFS) > > > sounds like a useful patch. :P will it be merged into fsr code? > > no, because it's ugly and i don't think i ever decoupled it from other > changes and posted it > > > what kind of assistance you mean? > > [WARNING: lots of hand waving ahead, plenty of minor, but important, > details ignored] > read about this and feel this will be VERY hard to be built, especially considering the transaction issue. can this be easier? * analyze the fs to find out which file(s) to be defrag; * create a temp file and begin to copy, preserve the space so it is continuous; * after first round of copy, for changed blocks have a trace table and a second round on changed blocks. * lock and switch the old file with new file. > if you wanted much smarter defragmentation semantics, it would > probably make sense to > > * bulkstat the entire volume, this will give you the inode cluster > locations and enough information to start building a tree of where > all the files are (XFS_IOC_FSGEOMETRY details obviously) > > * opendir/read to build a full directory tree > > * use XFS_IOC_GETBMAP & XFS_IOC_GETBMAPA to figure out which blocks > are occupied by which files > > you would now have a pretty good idea of what is using what parts of > the disk, except of course it could be constantly changing underneath > you to make things harder > > also, doing this using the existing interfaces is (when i tried it) > really really painfully slow if you have a large filesystem with a lot > of small files (even when you try to optimized you accesses for > minimize seeking by sorting by inode number and submitting several > requests in parallel to try and help the elevator merge accesses) > > > one you have some overall picture of the disk, you can decide what you > want to move to achieve your goal, typically this would be to reduce > the fragmentation of the largest files, and this would be be > relocating some of all of those blocks to another place > > if you want to allocate space in a given AG, you open/creat a > temporary file in a directory in that AG (create multiple dirs as > needed to ensure you have one or more of these), and preallocate the > space --- there you can copy the file over > > we could also add ioctls to further bias XFSs allocation strategies, > like telling it to never allocate in some AGs (needed for an online > shrink if someone wanted to make such a thing) or simply bias strongly > away from some places, then add other ioctls to allow you to > specifically allocate space in those AGs so you can bias what is > allocated where > > another useful ioctl would be a variation of XFS_IOC_SWAPEXT which > would swap only some extents. there is no internal support for this > now except we do have code for XFS_IOC_UNRESVSP64 and XFS_IOC_RESVSP64 > so perhaps the idea would be to swap some (but not all) blocks of a > file by creating a function that do the equivalent of 'punch a hole' > where we want to replace the blocks, and then 'allocate new blocks > given some i already have elsewhere' (however, making that all work as > one transaction might be very very difficult) > > it's a lot of effort for what for many people wouldn't only have > marginal gains ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-20 16:38 ` Ming Zhang 2006-07-20 19:04 ` Chris Wedgwood @ 2006-07-22 18:09 ` Peter Grandi 1 sibling, 0 replies; 33+ messages in thread From: Peter Grandi @ 2006-07-22 18:09 UTC (permalink / raw) To: Linux XFS >>> On Thu, 20 Jul 2006 12:38:01 -0400, Ming Zhang >>> <mingz@ele.uri.edu> said: [ ... ] >>> we mainly handle large media files like 20-50GB. so file >>> number is not too much. but file size is large. >> xfs_repair usually deals with that fairly well in reality >> (much better than lots of small files anyhow) > sounds cool. yes, large # of small files are always painful. It is not just number of inodes, it is also number of extents. That is total number of metadata items. [ ... ] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-20 14:08 ` Ming Zhang 2006-07-20 16:17 ` Chris Wedgwood @ 2006-07-22 17:47 ` Peter Grandi 1 sibling, 0 replies; 33+ messages in thread From: Peter Grandi @ 2006-07-22 17:47 UTC (permalink / raw) To: Linux XFS >>> On Thu, 20 Jul 2006 10:08:22 -0400, Ming Zhang >>> <mingz@ele.uri.edu> said: [ ... ] mingz> hope i never need to run repair, A ''strategic'' attitude :-). mingz> but i do need to defrag from time to time. As to defrag, I reckon that defrag-in-place is a very bad idea, but I have to admit that contrary evidence exists, and I was rather surprised to read this: http://OSS.SGI.com/archives/xfs/2006-03/msg00110.html «> How many people defrag their filesystems using xfs_fsr > /dev/PARTITION if their fragmentation is > 50% etc? Does > anyone regularly defrag their production filesystems or > just defrag their filesystems on a regular basis? We have several hundred production filesystems defragmented every night.» Even so I think that defragment-by-copy is a much better option. mingz> [ ... ] we mainly handle large media files like 20-50GB. mingz> [ ....] hope this does not hold true for a 15x750GB SATA mingz> raid5. ;) mingz> [ ... ] say XFS can make use of parallel storage by using mingz> multiple allocation groups. but XFS need to be built over mingz> one block device. so if i have 4 smaller raid, i have to mingz> use LVM to glue them before i create XFS over it right? Well, I'll just hint that I cannot find euphemisms suitable for expressing what I think of this setup :-). mingz> but then u said XFS over LVM or N MD is not good? It was me saying that [euphemism alert!] I would not recommend a setup like that without understanding very well the consequences. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 13:11 ` Ming Zhang 2006-07-20 6:15 ` Chris Wedgwood @ 2006-07-22 15:37 ` Peter Grandi 1 sibling, 0 replies; 33+ messages in thread From: Peter Grandi @ 2006-07-22 15:37 UTC (permalink / raw) To: Linux XFS >>> On Wed, 19 Jul 2006 09:11:10 -0400, Ming Zhang >>> <mingz@ele.uri.edu> said: [ ... ] mingz> what kind of "ram vs fs" size ratio here will be a mingz> safe/good/proper one? any rule of thumb? thanks! hope mingz> not 1:1. :) This is driven mostly by the space required by check/repair (which can well be above 4GiB, so 64 bit systems are often required): http://OSS.SGI.com/archives/linux-xfs/2005-08/msg00045.html «e.g. it took 1.5GiB RAM for 32bit xfs_check and 2.7GiB RAM for a 64bit xfs_check on a 1.1TiB filesystem with 3million inodes in it.» It suggests that a 10TB filesystem might need from about 15 gigabytes of RAM (or swap, with corresponding slowdown), after all only less than 0.2% of its size. Anyhow, a system with lots of RAM to speedly check/repair an XFS filesystem also benefits from the same RAM for caching and delayed writing, so it is all for good (as long as one has a perfectly reliable block IO subsystem). Note that the 15 gigabytes in the example above are well above what a 32 bit process can address, thus for multi-terabyte filesystem one should really have a 64 bit system (from the same article mentioned above): http://OSS.SGI.com/archives/linux-xfs/2005-08/msg00045.html «> > Your filesystem (8TiB) may simply bee too large for your > > system to be able to repair. Try mounting it on a 64bit > > system with more RAM in it and repairing it from there. > > Sorry, but is this a joke? A joke? Absolutely not. Acheivable XFS filesystem sizes outgrew the capability of 32 bit Irix systems to repair them several years ago. Now that linux supports larger than 2TiB filesystems on 32 bit systems, this is true for Linux as well.» ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-17 15:30 stable xfs Ming Zhang 2006-07-17 16:20 ` Peter Grandi @ 2006-07-18 23:54 ` Nathan Scott 2006-07-19 1:15 ` Ming Zhang 2006-07-19 7:40 ` Martin Steigerwald 1 sibling, 2 replies; 33+ messages in thread From: Nathan Scott @ 2006-07-18 23:54 UTC (permalink / raw) To: Ming Zhang; +Cc: xfs On Mon, Jul 17, 2006 at 11:30:23AM -0400, Ming Zhang wrote: > Hi All > > We want to use XFS in all of our production servers but feel a little > scary about the corruption problems seen in this list. I wonder which > 2.6.16+ kernel we can use in order to get a stable XFS? Thanks! Use the latest 2.6.17 -stable release, or a vendor kernel (SLES is particularly good with XFS, as SGI works closely with SUSE). The current batch of corruption reports is due to one unfortunate bug that has slipped through our QA testing net, which happily is a fairly rare occurence (it was a very subtle bug). XFS also tends to get a bad rap (IMO) from the way it reports on-disk corruption and I/O errors in critical data structures, which is quite different to many other filesystems - it dumps a stack trace into the system log (alot of people mistake that for a panic) and "shuts down" the filesystem, with subsequent accesses returning errors until the problem is resolved. > ps, one friend mentioned that XFS has some issue with LVM+MD under it. > Is this true? No. cheers. -- Nathan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-18 23:54 ` Nathan Scott @ 2006-07-19 1:15 ` Ming Zhang 2006-07-19 7:40 ` Martin Steigerwald 1 sibling, 0 replies; 33+ messages in thread From: Ming Zhang @ 2006-07-19 1:15 UTC (permalink / raw) To: Nathan Scott; +Cc: xfs thanks a lot for this detail explanation! i will check both 2.6.17 -stable release and sles kernel. unfortunately, i only play with RHEL so far. Ming On Wed, 2006-07-19 at 09:54 +1000, Nathan Scott wrote: > On Mon, Jul 17, 2006 at 11:30:23AM -0400, Ming Zhang wrote: > > Hi All > > > > We want to use XFS in all of our production servers but feel a little > > scary about the corruption problems seen in this list. I wonder which > > 2.6.16+ kernel we can use in order to get a stable XFS? Thanks! > > Use the latest 2.6.17 -stable release, or a vendor kernel (SLES is > particularly good with XFS, as SGI works closely with SUSE). > > The current batch of corruption reports is due to one unfortunate > bug that has slipped through our QA testing net, which happily is > a fairly rare occurence (it was a very subtle bug). > > XFS also tends to get a bad rap (IMO) from the way it reports on-disk > corruption and I/O errors in critical data structures, which is quite > different to many other filesystems - it dumps a stack trace into the > system log (alot of people mistake that for a panic) and "shuts down" > the filesystem, with subsequent accesses returning errors until the > problem is resolved. > > > ps, one friend mentioned that XFS has some issue with LVM+MD under it. > > Is this true? > > No. > > cheers. > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-18 23:54 ` Nathan Scott 2006-07-19 1:15 ` Ming Zhang @ 2006-07-19 7:40 ` Martin Steigerwald 2006-07-19 14:11 ` Ming Zhang 1 sibling, 1 reply; 33+ messages in thread From: Martin Steigerwald @ 2006-07-19 7:40 UTC (permalink / raw) To: Nathan Scott; +Cc: Ming Zhang, xfs Am Mittwoch, 19. Juli 2006 01:54 schrieb Nathan Scott: > On Mon, Jul 17, 2006 at 11:30:23AM -0400, Ming Zhang wrote: > > Hi All > > > > We want to use XFS in all of our production servers but feel a little > > scary about the corruption problems seen in this list. I wonder which > > 2.6.16+ kernel we can use in order to get a stable XFS? Thanks! > > Use the latest 2.6.17 -stable release, or a vendor kernel (SLES is > particularly good with XFS, as SGI works closely with SUSE). Hello Nathan, as far as I can see the fix for kernel bug #6757 has not yet made it in a stable kernel release upto 2.6.17.6 and thus should manually be applied: http://bugzilla.kernel.org/show_bug.cgi?id=6757 It probably doesn't happen for lots of people but I would still apply that patch unless it is finally put into a stable point release. Regards, -- Martin Steigerwald - team(ix) GmbH - http://www.teamix.de gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: stable xfs 2006-07-19 7:40 ` Martin Steigerwald @ 2006-07-19 14:11 ` Ming Zhang 0 siblings, 0 replies; 33+ messages in thread From: Ming Zhang @ 2006-07-19 14:11 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Nathan Scott, xfs yes. thx for reminding. Ming On Wed, 2006-07-19 at 09:40 +0200, Martin Steigerwald wrote: > Am Mittwoch, 19. Juli 2006 01:54 schrieb Nathan Scott: > > On Mon, Jul 17, 2006 at 11:30:23AM -0400, Ming Zhang wrote: > > > Hi All > > > > > > We want to use XFS in all of our production servers but feel a little > > > scary about the corruption problems seen in this list. I wonder which > > > 2.6.16+ kernel we can use in order to get a stable XFS? Thanks! > > > > Use the latest 2.6.17 -stable release, or a vendor kernel (SLES is > > particularly good with XFS, as SGI works closely with SUSE). > > Hello Nathan, > > as far as I can see the fix for kernel bug #6757 has not yet made it in a > stable kernel release upto 2.6.17.6 and thus should manually be applied: > > http://bugzilla.kernel.org/show_bug.cgi?id=6757 > > It probably doesn't happen for lots of people but I would still apply that > patch unless it is finally put into a stable point release. > > Regards, ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2006-07-24 1:24 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-07-17 15:30 stable xfs Ming Zhang 2006-07-17 16:20 ` Peter Grandi 2006-07-18 22:36 ` Ming Zhang 2006-07-18 23:14 ` Peter Grandi 2006-07-19 1:20 ` Ming Zhang 2006-07-19 5:56 ` Chris Wedgwood 2006-07-19 10:53 ` Peter Grandi 2006-07-19 14:45 ` Ming Zhang 2006-07-22 17:13 ` Peter Grandi 2006-07-20 6:12 ` Chris Wedgwood 2006-07-22 17:31 ` Peter Grandi 2006-07-19 14:10 ` Ming Zhang 2006-07-19 10:24 ` Peter Grandi 2006-07-19 13:11 ` Ming Zhang 2006-07-20 6:15 ` Chris Wedgwood 2006-07-20 14:08 ` Ming Zhang 2006-07-20 16:17 ` Chris Wedgwood 2006-07-20 16:38 ` Ming Zhang 2006-07-20 19:04 ` Chris Wedgwood 2006-07-21 0:19 ` Ming Zhang 2006-07-21 3:26 ` Chris Wedgwood 2006-07-21 13:10 ` Ming Zhang 2006-07-21 16:07 ` Chris Wedgwood 2006-07-21 17:00 ` Ming Zhang 2006-07-21 18:07 ` Chris Wedgwood 2006-07-24 1:14 ` Ming Zhang 2006-07-22 18:09 ` Peter Grandi 2006-07-22 17:47 ` Peter Grandi 2006-07-22 15:37 ` Peter Grandi 2006-07-18 23:54 ` Nathan Scott 2006-07-19 1:15 ` Ming Zhang 2006-07-19 7:40 ` Martin Steigerwald 2006-07-19 14:11 ` Ming Zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox