From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Wed, 19 Jul 2006 07:45:43 -0700 (PDT) Received: from orca.ele.uri.edu (orca.ele.uri.edu [131.128.51.63]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id k6JEjWDW013834 for ; Wed, 19 Jul 2006 07:45:33 -0700 Subject: Re: stable xfs From: Ming Zhang Reply-To: mingz@ele.uri.edu In-Reply-To: <17598.3876.565887.172598@base.ty.sabi.co.UK> References: <1153150223.4532.24.camel@localhost.localdomain> <17595.47312.720883.451573@base.ty.sabi.co.UK> <1153262166.2669.267.camel@localhost.localdomain> <17597.27469.834961.186850@base.ty.sabi.co.UK> <1153272044.2669.282.camel@localhost.localdomain> <20060719055621.GA1491@tuatara.stupidest.org> <17598.3876.565887.172598@base.ty.sabi.co.UK> Content-Type: text/plain; charset=utf-8 Date: Wed, 19 Jul 2006 10:45:04 -0400 Message-Id: <1153320304.2691.56.camel@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: xfs-bounce@oss.sgi.com Errors-To: xfs-bounce@oss.sgi.com List-Id: xfs To: Peter Grandi Cc: Linux XFS On Wed, 2006-07-19 at 11:53 +0100, Peter Grandi wrote: > [ ... ] > > mingz> when u say large parallel storage system, you mean > mingz> independent spindles right? but most people will have all > mingz> disks configured in one RAID5/6 and thus it is not parallel > mingz> any more. > > cw> it depends, you might have 100s of spindles in groups, you > cw> don't make a giant raid5/6 array with that many disks, you > cw> make a number of smaller arrays > > Perhaps you are undestimating the ''if it can be done'' > mindset... > > Also, if one does a number of smaller RAID5s, is each one a > separate filesystem or they get aggregated, for example with > LVM with ''concat''? Either way, how likely is is that the > consequences have been thought through? > > I would personally hesitate to recommend either, especially a > two-level arrangement where the base level is a RAID5. could u give us some hints on this? since it is really popular to have a FS/LV/MD structure and I believe LVM is designed for this purpose. > > [I am making an effort in this discussion to use euphemisms] > > mingz> i think with write barrier support, system without UPS > mingz> should be ok. > > cw> with barrier support a UPS shouldn't be necessary > > Sure, «should» and «shouldn't» are nice hopeful concepts. > > But write barriers are difficult to achieve, and when achieved > they are often unreliable, except on enterprise level hardware, > because many disks/host adapters/... simply lie as to whether > they have actually started writing (never mind finished writing, > or written correctly) stuff. > > To get reliable write barrier often one has to source special > cards or disks with custom firmware; or leave system integration > to the big expensive guys and buy an Altix or equivalent system > from Sun or IBM. > > Besides I have seen many reports of ''corruption'' that cannot > be fixed by write barriers: many have the expectation that > *data* should not be lost, even if no 'fsync' is done, *as if* > 'mount -o sync' or 'mount -o data=ordered'. > > Of course that is a bit of an inflated expectation, but all that > the vast majority of sysadms care about is whether it ''just > works'', without ''wasting time'' figuring things out. > > mingz> considering even u have UPS, kernel oops in other parts > mingz> still can take the FS down. > > cw> but a crash won't cause writes to be 'reordered' [ ... ] > > The metadata will be consistent, but metadata and data may well > will be lost. So the filesystem is still ''corrupted'', at least > from the point of view of a sysadm who just wants the filesystem > to be effortlessly foolproof. Anyhow, if a crash happens all > bets are off, because who knows *what* gets written. > > Look at it from the point of view of a ''practitioner'' sysadm: > > ''who cares if the metadata is consistent, if my 3TiB > application database is unusable (and I don't do backups > because after all it is a concat of RAID5s, backups are not > necessary) as there is a huge gap in some data file, and my > users are yelling at me, and it is not my fault'' > > The tradeoff in XFS is that if you know exactly what you are > doing you get extra performance... then i think unless you disable all write cache, none of the file system can achieve this goal. or maybe ext3 with both data and metadata into log might do this? Ming