RE: XFS corruption during power-blackout

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: XFS corruption during power-blackout
       [not found] <20050629001847.GB850@frodo>
@ 2005-06-29  4:53 ` Al Boldi
  2005-06-29 16:38   ` Christian Rice
  2005-06-29 17:02   ` Chris Wedgwood
  0 siblings, 2 replies; 36+ messages in thread
From: Al Boldi @ 2005-06-29  4:53 UTC (permalink / raw)
  To: 'Nathan Scott'
  Cc: linux-xfs, linux-kernel, linux-fsdevel, reiserfs-list

Hi Nathan,
You wrote: {
On Tue, Jun 28, 2005 at 12:08:05PM +0300, Al Boldi wrote:
> True now, not so around 2.4.20 when XFS was rock-solid. I think they 
> tried to improve on performance and broke something. I wish they would 
> fix that because it forced me back to ext3, as in consistency over 
> performance any time.

Can you provide any details...
}

Specifically, in 2.4.20 I did an acid test:
Spawn 10 cp -a on some big dir like /usr.
Let it run for a few seconds, then pull the plug.
Don't reset-button, reset is different then pulling the plug.
Don't poweroff-button, poweroff is different then pulling the plug.
On reboot diff the dirs spawned.

What I found were 4 things in the dest dir:
1. Missing Dirs,Files. That's OK.
2. Files of size 0. That's acceptable.
3. Corrupted Files. That's unacceptable.
4. Corrupted Files with original fingerprint. That's ABSOLUTELY
unacceptable.

Ext3 performed best with minimal files of size 0.
XFS was second  with more files of size 0.
Reiser,JFS was worst with corruptions.

When XFS was added into the vanilla-Kernel it caused corruptions like Reiser
and JFS, which forced me back to Ext3.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-29  4:53 ` XFS corruption during power-blackout Al Boldi
@ 2005-06-29 16:38   ` Christian Rice
  2005-06-29 17:02   ` Chris Wedgwood
  1 sibling, 0 replies; 36+ messages in thread
From: Christian Rice @ 2005-06-29 16:38 UTC (permalink / raw)
  To: Al Boldi
  Cc: 'Nathan Scott', linux-xfs, linux-kernel, linux-fsdevel,
	reiserfs-list

Al Boldi wrote:

>Hi Nathan,
>You wrote: {
>On Tue, Jun 28, 2005 at 12:08:05PM +0300, Al Boldi wrote:
>  
>
>>True now, not so around 2.4.20 when XFS was rock-solid. I think they 
>>tried to improve on performance and broke something. I wish they would 
>>fix that because it forced me back to ext3, as in consistency over 
>>performance any time.
>>    
>>
>
>Can you provide any details...
>}
>
>Specifically, in 2.4.20 I did an acid test:
>Spawn 10 cp -a on some big dir like /usr.
>Let it run for a few seconds, then pull the plug.
>Don't reset-button, reset is different then pulling the plug.
>Don't poweroff-button, poweroff is different then pulling the plug.
>On reboot diff the dirs spawned.
>
>What I found were 4 things in the dest dir:
>1. Missing Dirs,Files. That's OK.
>2. Files of size 0. That's acceptable.
>3. Corrupted Files. That's unacceptable.
>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
>unacceptable.
>
>Ext3 performed best with minimal files of size 0.
>XFS was second  with more files of size 0.
>Reiser,JFS was worst with corruptions.
>
>When XFS was added into the vanilla-Kernel it caused corruptions like Reiser
>and JFS, which forced me back to Ext3.
>
>
>
>  
>
Pardon me if I haven't seen the whole thread.

Do you have hard drive write cache turned off or, if it's a raid card, a 
battery backup on the write cache?  That makes a big difference when 
operators begin doing things like pulling plugs and hitting reset.

Again, no offense, just one of those "have you taken it out of the box, 
plugged it in and turned it on" kind of questions.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-29  4:53 ` XFS corruption during power-blackout Al Boldi
  2005-06-29 16:38   ` Christian Rice
@ 2005-06-29 17:02   ` Chris Wedgwood
  2005-06-29 17:56     ` Steve Lord
  2005-07-01  8:17     ` David Masover
  1 sibling, 2 replies; 36+ messages in thread
From: Chris Wedgwood @ 2005-06-29 17:02 UTC (permalink / raw)
  To: Al Boldi
  Cc: 'Nathan Scott', linux-xfs, linux-kernel, linux-fsdevel,
	reiserfs-list

On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:

> What I found were 4 things in the dest dir:
> 1. Missing Dirs,Files. That's OK.
> 2. Files of size 0. That's acceptable.
> 3. Corrupted Files. That's unacceptable.
> 4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> unacceptable.

disk usually default to caching these days and can lose data as a
result, disable that

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-29 17:02   ` Chris Wedgwood
@ 2005-06-29 17:56     ` Steve Lord
  2005-06-29 20:56       ` Chris Wedgwood
  2005-06-29 21:10       ` Nathan Scott
  2005-07-01  8:17     ` David Masover
  1 sibling, 2 replies; 36+ messages in thread
From: Steve Lord @ 2005-06-29 17:56 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Al Boldi, 'Nathan Scott', linux-xfs, linux-kernel,
	linux-fsdevel, reiserfs-list

Chris Wedgwood wrote:
> On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> 
> 
>>What I found were 4 things in the dest dir:
>>1. Missing Dirs,Files. That's OK.
>>2. Files of size 0. That's acceptable.
>>3. Corrupted Files. That's unacceptable.
>>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
>>unacceptable.
> 
> 
> disk usually default to caching these days and can lose data as a
> result, disable that
> 

There are IDE drives where the vendor will tell you that you will
drasticly shorten the life of a drive if you turn off caching.
There are also cool bits of technology which use the rotational
energy of the spinning down drive to dump the cache out to a
special track (or this may be an urban legend, not sure). Problem
is, no one but the vendors really knows what any particular
disk is going to do when you pull the plug.

I did spend a bunch of time once ensuring that when you typed
sync on xfs you could pull the power right after that and
everything from before the sync survived. There have been a
lot of changes both in xfs and the surrounding kernel since
then. I do not know if anyone has attempted this effort
again recently.

If you care sufficiently about your data to want to do power fail
testing then, even assuming the filesystem works perfectly:

a) have a working, tested, regular backup policy
b) keep the backups in a different building
c) buy a UPS.

Steve

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-29 17:56     ` Steve Lord
@ 2005-06-29 20:56       ` Chris Wedgwood
  2005-06-30 16:30         ` Bryan Henderson
  2005-06-29 21:10       ` Nathan Scott
  1 sibling, 1 reply; 36+ messages in thread
From: Chris Wedgwood @ 2005-06-29 20:56 UTC (permalink / raw)
  To: Steve Lord
  Cc: Al Boldi, 'Nathan Scott', linux-xfs, linux-kernel,
	linux-fsdevel, reiserfs-list

On Wed, Jun 29, 2005 at 12:56:12PM -0500, Steve Lord wrote:

> There are also cool bits of technology which use the rotational
> energy of the spinning down drive to dump the cache out to a special
> track (or this may be an urban legend, not sure).

This seems only to be true for very small writes.  I suspect on power
loss a drive and finish writing the current sector.

Anyhow, I've tested power loss on drives with caching enabled and they
definatley do lose data.  Sometimes a couple of MBs worth.

I don't know if this is true for all drives but NONE of the ones I had
access to when testing did anything like save the cache --- pretty
much all data that was inflight got lost.

> I did spend a bunch of time once ensuring that when you typed sync
> on xfs you could pull the power right after that and everything from
> before the sync survived.

I think this is probably still true.  If I sync then drop power I
don't seem to have any problems provided caching is off.

If caching is enabled I still lose data.  Linux does have a concept of
write barriers but these are presently not implemented for XFS right
now.  Once they are I assume sunc + poweroff will be reliable with
caching enabled.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-29 20:56       ` Chris Wedgwood
@ 2005-06-30 16:30         ` Bryan Henderson
  2005-06-30 18:46           ` Chris Wedgwood
  0 siblings, 1 reply; 36+ messages in thread
From: Bryan Henderson @ 2005-06-30 16:30 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Al Boldi, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

>I don't know if this is true for all drives but NONE of the ones I had
>access to when testing did anything like save the cache --- pretty
>much all data that was inflight got lost.

For another point of reference - were these ATA (personal class) or SCSI 
(commercial class) drives or both?

Is write caching the default on typical SCSI devices?

>Linux does have a concept of
>write barriers but these are presently not implemented for XFS right
>now.  Once they are I assume sync + poweroff will be reliable with
>caching enabled.

But be careful with the 'sync' program/system call.  As defined by POSIX, 
it is not a synchronizing operation.  It's supposed to cause buffered 
writes to get hardened some time soon, not right now.  So in theory, you 
can't pull the plug after typing "sync."  In Linux, the implementation has 
changed a few times in this respect.  In some versions, it at least 
_tries_ to implement "everything that was buffered when sync() started is 
hardened before sync() returns."  In others, it implements "everything 
that was buffered when sync() started is hardened before the next sync() 
returns," and some 'sync' programs do multiple sync()s.  And it's also 
filesystem-type-dependent.  I don't know exactly what the present state 
is.

fsync(), on the other hand, is a true synchronizing operation.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-30 16:30         ` Bryan Henderson
@ 2005-06-30 18:46           ` Chris Wedgwood
  2005-06-30 19:44             ` Jörn Engel
                               ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Chris Wedgwood @ 2005-06-30 18:46 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Al Boldi, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

On Thu, Jun 30, 2005 at 12:30:20PM -0400, Bryan Henderson wrote:

> For another point of reference - were these ATA (personal class) or
> SCSI (commercial class) drives or both?

IDE were Maxtor some old Maxtor 60GB disks and some not-so-old 200GB
WD drives.  Maxtor has 2MB cache.  WD 8MB.

The SCSI disks where 10K RPM SCA somethings.  I think they were Segate
(they've since been taken or else I would check).  I have no idea what
the cache is on those.

> Is write caching the default on typical SCSI devices?

I'm not sure.  It seemed to be off by default for the SCSI disks and
on by default for IDE when I checked.  I can't rule out the
bios/controller doing something there though.

> But be careful with the 'sync' program/system call.  As defined by
> POSIX, it is not a synchronizing operation.

Yes, but POSIX is broken in places.  The linux implmentation (now and
for sometime but not always) won't return until all dirty data is
flushed.

POSIX is a bit more sane about fsync():

      The fsync() function can be used by an application to indicate
      that all data for the open file description named by fildes is
      to be transferred to the storage device associated with the file
      described by fildes in an implementation-dependent manner. The
      fsync() function does not return until the system has completed
      that action or until an error is detected.

> It's supposed to cause buffered writes to get hardened some time
> soon, not right now.  So in theory, you can't pull the plug after
> typing "sync."

Data lss internal to the disks aside you can uner Linux.  I do it all
the time.  Various other people do and this is something some people
do test.

> In others, it implements "everything that was buffered when sync()
> started is hardened before the next sync() returns,"

That is what happens now.  I'm not sure any other behavior makes sense
does it?

If it happens differently I would call that a bug.

> and some 'sync' programs do multiple sync()s.

Such programs are arguably broken (grub maybe?).  If one doesn't work,
then why should doing it <n>-times?

> And it's also filesystem-type-dependent.

If a filesystem doesn't flush reliably with sync, I would call that a
bug.

> fsync(), on the other hand, is a true synchronizing operation.

Again that requires the fs to behave correctly so if it fails it
should be reported as a bug.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-30 18:46           ` Chris Wedgwood
@ 2005-06-30 19:44             ` Jörn Engel
  2005-06-30 20:32               ` Chris Wedgwood
  2005-06-30 20:49             ` Bryan Henderson
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Jörn Engel @ 2005-06-30 19:44 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Bryan Henderson, Al Boldi, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

On Thu, 30 June 2005 11:46:27 -0700, Chris Wedgwood wrote:
> On Thu, Jun 30, 2005 at 12:30:20PM -0400, Bryan Henderson wrote:
> 
> > In others, it implements "everything that was buffered when sync()
> > started is hardened before the next sync() returns,"
> 
> That is what happens now.  I'm not sure any other behavior makes sense
> does it?
> 
> If it happens differently I would call that a bug.

While I agree with all the rest, this part confuses me.  Do you mean
that sync() should altually return immediatly, but the second sync()
block until all data present at the time of the previous sync() is
hardened?

Or do you rather mean that a single sync() should block until all data
currently present is hardened?

Jörn

-- 
It is better to die of hunger having lived without grief and fear,
than to live with a troubled spirit amid abundance.
-- Epictetus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-30 19:44             ` Jörn Engel
@ 2005-06-30 20:32               ` Chris Wedgwood
  2005-06-30 21:07                 ` Jörn Engel
  2005-07-01 12:36                 ` Ric Wheeler
  0 siblings, 2 replies; 36+ messages in thread
From: Chris Wedgwood @ 2005-06-30 20:32 UTC (permalink / raw)
  To: J?rn Engel
  Cc: Bryan Henderson, Al Boldi, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

On Thu, Jun 30, 2005 at 09:44:37PM +0200, J?rn Engel wrote:

> Or do you rather mean that a single sync() should block until all data
> currently present is hardened?

Logically sync() should return only after all dirty buffers that
existed before sync() was called are flushed.

Anything more than this (i.e. waiting on newly (since sync was called
but before it returns) dirtied buffers) could live-lock (actually,
this used to happen sometimes, I don't know if that's the case).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-30 20:32               ` Chris Wedgwood
@ 2005-06-30 21:07                 ` Jörn Engel
  2005-07-01 12:36                 ` Ric Wheeler
  1 sibling, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2005-06-30 21:07 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Bryan Henderson, Al Boldi, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

On Thu, 30 June 2005 13:32:23 -0700, Chris Wedgwood wrote:
> On Thu, Jun 30, 2005 at 09:44:37PM +0200, J?rn Engel wrote:
> 
> > Or do you rather mean that a single sync() should block until all data
> > currently present is hardened?
> 
> Logically sync() should return only after all dirty buffers that
> existed before sync() was called are flushed.

That's what I thought.  Thanks for the confirmation.

> Anything more than this (i.e. waiting on newly (since sync was called
> but before it returns) dirtied buffers) could live-lock (actually,
> this used to happen sometimes, I don't know if that's the case).

... and would be totally useless anyway, yep.

Jörn

-- 
The strong give up and move away, while the weak give up and stay.
-- unknown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-30 20:32               ` Chris Wedgwood
  2005-06-30 21:07                 ` Jörn Engel
@ 2005-07-01 12:36                 ` Ric Wheeler
  2005-07-01 12:56                   ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: Ric Wheeler @ 2005-07-01 12:36 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: J?rn Engel, Bryan Henderson, Al Boldi, linux-fsdevel, linux-xfs,
	Steve Lord, 'Nathan Scott', reiserfs-list

Chris Wedgwood wrote:

>On Thu, Jun 30, 2005 at 09:44:37PM +0200, J?rn Engel wrote:
>
>  
>
>>Or do you rather mean that a single sync() should block until all data
>>currently present is hardened?
>>    
>>
>
>Logically sync() should return only after all dirty buffers that
>existed before sync() was called are flushed.
>
>Anything more than this (i.e. waiting on newly (since sync was called
>but before it returns) dirtied buffers) could live-lock (actually,
>this used to happen sometimes, I don't know if that's the case).
>  
>
I think that we need one more stage in sync() behavior to make sure that 
the data is safely on the platter.  For file systems with supported 
write barriers, the last IO should be wrapped with a barrier to flush 
the disk cache.

It doesn't seem that sync() does that in today's code.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 12:36                 ` Ric Wheeler
@ 2005-07-01 12:56                   ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2005-07-01 12:56 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Chris Wedgwood, J?rn Engel, Bryan Henderson, Al Boldi,
	linux-fsdevel, linux-xfs, Steve Lord, 'Nathan Scott',
	reiserfs-list

On Fri, Jul 01 2005, Ric Wheeler wrote:
> Chris Wedgwood wrote:
> 
> >On Thu, Jun 30, 2005 at 09:44:37PM +0200, J?rn Engel wrote:
> >
> > 
> >
> >>Or do you rather mean that a single sync() should block until all data
> >>currently present is hardened?
> >>   
> >>
> >
> >Logically sync() should return only after all dirty buffers that
> >existed before sync() was called are flushed.
> >
> >Anything more than this (i.e. waiting on newly (since sync was called
> >but before it returns) dirtied buffers) could live-lock (actually,
> >this used to happen sometimes, I don't know if that's the case).
> > 
> >
> I think that we need one more stage in sync() behavior to make sure that 
> the data is safely on the platter.  For file systems with supported 
> write barriers, the last IO should be wrapped with a barrier to flush 
> the disk cache.
> 
> It doesn't seem that sync() does that in today's code.

That is true, sync() really only guarantees that the io has been issued
and the drive signalled completion, with write back caching on it might
not be on platter yet.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-30 18:46           ` Chris Wedgwood
  2005-06-30 19:44             ` Jörn Engel
@ 2005-06-30 20:49             ` Bryan Henderson
  2005-07-01 12:53               ` Ric Wheeler
  2005-07-01  1:09             ` Stewart Smith
  2005-07-05 15:53             ` Sonny Rao
  3 siblings, 1 reply; 36+ messages in thread
From: Bryan Henderson @ 2005-06-30 20:49 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Al Boldi, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

>POSIX is broken in places

...

>If it happens differently I would call that a bug.

I think you're confusing goodness with correctness.  POSIX is a 
definition; it can't be broken.  A bug is where don't meet your own 
specification.  So if the spec doesn't say you have to be synchronous, 
it's not a bug not to be synchronous.  Call it a design flaw if you want.

>> In others, it implements "everything that was buffered when sync()
>> started is hardened before the next sync() returns,"
>
>That is what happens now.  I'm not sure any other behavior makes sense
>does it?

I think you quoted the wrong part.  From context, I think you meant 
"everything that was buffered when sync() started is hardened before 
sync() returns."  And it's also my understanding that current Linux does 
that.

Another Linux sync() behavior is that it keeps syncing super blocks until 
every super block is clean at the same moment.  That has given me fits.  I 
don't know what the goal of that is -- it came in around 2.4.10.

>POSIX is a bit more sane about fsync():
>
>      The fsync() function can be used by an application to indicate
>      that all data for the open file description named by fildes is
>      to be transferred to the storage device associated with the file
>      described by fildes in an implementation-dependent manner. The
>      fsync() function does not return until the system has completed
>      that action or until an error is detected.

Strange; that's not the way I remember it.  I remember it being much more 
vague;  in particular, I remember a specification that did not assume that 
a file is associated with a particular device and referred instead to 
"stable storage," the definition of which was entirely up to the 
implementation.  In other words, the definition I've been working from was 
more grown-up.  I wonder what the difference is.

>> and some 'sync' programs do multiple sync()s.
>
>Such programs are arguably broken (grub maybe?).  If one doesn't work,
>then why should doing it <n>-times?

It's because of the words before that:  "everything that was buffered when 
sync()
started is hardened before the next sync() returns."  The point is that 
the second sync() is the one that waits (it actually waits for the 
previous one to finish before it starts).  By the way, I'm not talking 
about Linux at this point.  I'm talking about so-called POSIX systems in 
general.

But it does sound like Linux has a pretty firm philosophy of synchronous 
sync (I see it documented in an old man page), so I guess it's OK to rely 
on it.

There are scenarios where you'd rather not have a process tied up while 
syncing takes place.  Stepping back, I would guess the primary original 
purpose of sync() was to allow you to make a sync daemon.  Early Unix 
systems did not have in-kernel safety clean timers.  A user space process 
did that.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-30 20:49             ` Bryan Henderson
@ 2005-07-01 12:53               ` Ric Wheeler
  2005-07-01 18:24                 ` Bryan Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Ric Wheeler @ 2005-07-01 12:53 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Chris Wedgwood, Al Boldi, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

Bryan Henderson wrote:

>
>It's because of the words before that:  "everything that was buffered when 
>sync()
>started is hardened before the next sync() returns."  The point is that 
>the second sync() is the one that waits (it actually waits for the 
>previous one to finish before it starts).  By the way, I'm not talking 
>about Linux at this point.  I'm talking about so-called POSIX systems in 
>general.
>
>But it does sound like Linux has a pretty firm philosophy of synchronous 
>sync (I see it documented in an old man page), so I guess it's OK to rely 
>on it.
>
>There are scenarios where you'd rather not have a process tied up while 
>syncing takes place.  Stepping back, I would guess the primary original 
>purpose of sync() was to allow you to make a sync daemon.  Early Unix 
>systems did not have in-kernel safety clean timers.  A user space process 
>did that.
>
>--
>Bryan Henderson                     IBM Almaden Research Center
>San Jose CA                         Filesystems
>  
>
We have been playing around with various sync techniques that allow you 
to get good data safety for a large batch of files (think of a restore 
of a file system or a migration of lots of files from one server to 
another).  You can always restart a restore if the box goes down in the 
middle, but once you are done, you want a hard promise that all files 
are safely on the disk platter.

Using system level sync() has all  of the disadvantages that you mention 
along with the lack of a per-file system barrier flush.

You can try to hack in a flush by issuing an fsync() call on one file 
per file system after the sync() completes, but whether or not the file 
system issues a barrier operation is file system dependent.

Doing an fsync() per file is slow but safe. Writing the files without 
syncing and then reopening and fsync()'ing each one in  reasonable batch 
size is much faster, but still kludgey.

An attractive, but as far as I can see missing feature, would be the 
ability to do a file system specific sync() command.  Another option 
would be a batched AIO like fsync() with a bit vector of descriptors to 
sync.  Not surprising, but the best performance is reached when you let 
the writing phase working asynchronously and let the underlying file 
system do its thing and wrap it up with a group cache to disk sync and a 
single disk write cache invalidate (barrier) at the end.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 12:53               ` Ric Wheeler
@ 2005-07-01 18:24                 ` Bryan Henderson
  2005-07-01 19:58                   ` David Masover
  0 siblings, 1 reply; 36+ messages in thread
From: Bryan Henderson @ 2005-07-01 18:24 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Al Boldi, Chris Wedgwood, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

>We have been playing around with various sync techniques that allow you 
>to get good data safety for a large batch of files (think of a restore 
>of a file system or a migration of lots of files from one server to 
>another).  You can always restart a restore if the box goes down in the 
>middle, but once you are done, you want a hard promise that all files 
>are safely on the disk platter.
>
>Using system level sync() has all  of the disadvantages that you mention 
>along with the lack of a per-file system barrier flush.
>
>You can try to hack in a flush by issuing an fsync() call on one file 
>per file system after the sync() completes, but whether or not the file 
>system issues a barrier operation is file system dependent.
>
>Doing an fsync() per file is slow but safe. Writing the files without 
>syncing and then reopening and fsync()'ing each one in  reasonable batch 
>size is much faster, but still kludgey.
>
>An attractive, but as far as I can see missing feature, would be the 
>ability to do a file system specific sync() command.  Another option
>would be a batched AIO like fsync() with a bit vector of descriptors to 
>sync.  Not surprising, but the best performance is reached when you let 
>the writing phase working asynchronously and let the underlying file 
>system do its thing and wrap it up with a group cache to disk sync and a 
>single disk write cache invalidate (barrier) at the end.

Hear, hear to all of that.  sync() has gotten to be really old-fashioned.

You can sync an invidual filesystem image if the filesystem is on a block 
device or a suitable simulation of one, by opening a block device special 
file for the device and doing fsync().

What you'd really like is to fsync a multi-file unit of work (transaction) 
-- and not just among open files.  You'd like to open, write, and close 
1000 files in a single transaction and then commit that transaction, with 
no syncing due to timers in the meantime.  If you're really greedy, you'd 
also ask for complete rollback if the system fails before the commit.

I've always found it awkward that any user can do a sync(), when it's a 
system-wide control operation.

In the Storage Tank Linux filesystem driver I designed, you could turn off 
safety cleaning with a mount option (and could mount the filesystem 
multiple times in order to work with multiple options).  You could also 
turn it off for a particular file with a "temporary file" attribute, and a 
file which was not linked to a directory was also understood to be 
temporary.  Safety cleaning is what sync() and the internal timers do.

Safety cleaning doesn't make much sense unless it goes down inside the 
storage device as well.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 18:24                 ` Bryan Henderson
@ 2005-07-01 19:58                   ` David Masover
  2005-07-01 21:10                     ` Jörn Engel
  0 siblings, 1 reply; 36+ messages in thread
From: David Masover @ 2005-07-01 19:58 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Ric Wheeler, Al Boldi, Chris Wedgwood, linux-fsdevel, linux-xfs,
	Steve Lord, 'Nathan Scott', reiserfs-list

Bryan Henderson wrote:
[...]
> What you'd really like is to fsync a multi-file unit of work (transaction) 
> -- and not just among open files.  You'd like to open, write, and close 
> 1000 files in a single transaction and then commit that transaction, with 
> no syncing due to timers in the meantime.  If you're really greedy, you'd 
> also ask for complete rollback if the system fails before the commit.

Both of these are planned for Reiser4.  Or is it 4.1?

I would like said interface to be able to not necessarily flush to disk 
right away, though.  It should certainly be an option (I'm sure MySQL 
would use that option), but sometimes you want the performance, 
especially if there are dozens of these transactions firing all at once 
-- better to let RAM fill up and then flush them all.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 19:58                   ` David Masover
@ 2005-07-01 21:10                     ` Jörn Engel
  2005-07-01 21:39                       ` David Masover
  0 siblings, 1 reply; 36+ messages in thread
From: Jörn Engel @ 2005-07-01 21:10 UTC (permalink / raw)
  To: David Masover
  Cc: Bryan Henderson, Ric Wheeler, Al Boldi, Chris Wedgwood,
	linux-fsdevel, linux-xfs, Steve Lord, 'Nathan Scott',
	reiserfs-list

On Fri, 1 July 2005 14:58:39 -0500, David Masover wrote:
> Bryan Henderson wrote:
> [...]
> >What you'd really like is to fsync a multi-file unit of work (transaction) 
> >-- and not just among open files.  You'd like to open, write, and close 
> >1000 files in a single transaction and then commit that transaction, with 
> >no syncing due to timers in the meantime.  If you're really greedy, you'd 
> >also ask for complete rollback if the system fails before the commit.
> 
> Both of these are planned for Reiser4.  Or is it 4.1?

Both are pretty trivial to implement for a tree-based fs like
reiserfs.  Non-trivial is the user interface.  Not sure if sys_reiser
is the answer to that.

Jörn

-- 
When people work hard for you for a pat on the back, you've got
to give them that pat.
-- Robert Heinlein

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 21:10                     ` Jörn Engel
@ 2005-07-01 21:39                       ` David Masover
  0 siblings, 0 replies; 36+ messages in thread
From: David Masover @ 2005-07-01 21:39 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Bryan Henderson, Ric Wheeler, Al Boldi, Chris Wedgwood,
	linux-fsdevel, linux-xfs, Steve Lord, 'Nathan Scott',
	reiserfs-list

Jörn Engel wrote:
> On Fri, 1 July 2005 14:58:39 -0500, David Masover wrote:
> 
>>Bryan Henderson wrote:
>>[...]
>>
>>>What you'd really like is to fsync a multi-file unit of work (transaction) 
>>>-- and not just among open files.  You'd like to open, write, and close 
>>>1000 files in a single transaction and then commit that transaction, with 
>>>no syncing due to timers in the meantime.  If you're really greedy, you'd 
>>>also ask for complete rollback if the system fails before the commit.
>>
>>Both of these are planned for Reiser4.  Or is it 4.1?
> 
> 
> Both are pretty trivial to implement for a tree-based fs like
> reiserfs.  Non-trivial is the user interface.  Not sure if sys_reiser
> is the answer to that.

It is intended to be, I think.  But sys_reiser has been pushed off to 
4.1, last I checked.

 From the general attitude here, I'm guessing that it should *not* be 
called sys_reiser.  We're already doing the meta-files interface for 
doing anything we want to do with reiser, which means sys_reiser 
currently only does two things:  allows simultaneous access to lots of 
small files efficiently (versus open()-ing each of them), and 
transactions.  While the two may or may not belong in the same system 
call, I don't believe they should be Reiser-specific.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-30 18:46           ` Chris Wedgwood
  2005-06-30 19:44             ` Jörn Engel
  2005-06-30 20:49             ` Bryan Henderson
@ 2005-07-01  1:09             ` Stewart Smith
  2005-07-05 15:53             ` Sonny Rao
  3 siblings, 0 replies; 36+ messages in thread
From: Stewart Smith @ 2005-07-01  1:09 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Bryan Henderson, Al Boldi, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 2238 bytes --]

On Thu, 2005-06-30 at 11:46 -0700, Chris Wedgwood wrote:
> Yes, but POSIX is broken in places.  The linux implmentation (now and
> for sometime but not always) won't return until all dirty data is
> flushed.

POSIX, in regard to fsync() provides "flexibility for the
implementation" - maybe your environment is special and you don't buffer
anything, so fsync() is null. Or perhaps you cannot control some of the
disk caches, so fsync() is null.

In newer systems, you can check for the flag POSIX_SYNCHRONIZED_IO (or
similar) that, if set, gaurentees that fsync() is synchronously flushing
buffers to disk. However, this only came into the spec in 99 or 2000 i
think, so there are still a lot of systems in which you have to know the
behaviour.

> > and some 'sync' programs do multiple sync()s.
> 
> Such programs are arguably broken (grub maybe?).  If one doesn't work,
> then why should doing it <n>-times?

It's a legacy from the days when it was an async operation. The idea
went: that the time it took to type sync and press enter three times
(note, no using up-arrow, enter - typing) would be long enough for the
buffers that started to get flushed on the first sync to have hit disk.

> > And it's also filesystem-type-dependent.
> 
> If a filesystem doesn't flush reliably with sync, I would call that a
> bug.
> 
> > fsync(), on the other hand, is a true synchronizing operation.
> 
> Again that requires the fs to behave correctly so if it fails it
> should be reported as a bug.

It's all fun and games - reliably getting data to disk is not fun. If
Linux can reliably follow the idea that fsync() is synchronous and
really does flush everything to disk, then it will be a lot better off
then a lot of other platforms.

Also, it'd be useful to have a list of where bugs affecting this have
been found and in what kernels - It is not out of the question
explicitly coding in exceptions (read: big warnings to users) for these
systems.

I guess a list of known-bad drives and controllers could be useful too.
Doubly useful if the kernel could report this, but a userspace list
would also be good. 

-- 
Stewart Smith (stewart@flamingspork.com)
http://www.flamingspork.com/

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 307 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-30 18:46           ` Chris Wedgwood
                               ` (2 preceding siblings ...)
  2005-07-01  1:09             ` Stewart Smith
@ 2005-07-05 15:53             ` Sonny Rao
  3 siblings, 0 replies; 36+ messages in thread
From: Sonny Rao @ 2005-07-05 15:53 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Bryan Henderson, Al Boldi, linux-fsdevel, linux-xfs, Steve Lord,
	'Nathan Scott', reiserfs-list

On Thu, Jun 30, 2005 at 11:46:27AM -0700, Chris Wedgwood wrote:
> On Thu, Jun 30, 2005 at 12:30:20PM -0400, Bryan Henderson wrote:
> 
> > For another point of reference - were these ATA (personal class) or
> > SCSI (commercial class) drives or both?
> 
> IDE were Maxtor some old Maxtor 60GB disks and some not-so-old 200GB
> WD drives.  Maxtor has 2MB cache.  WD 8MB.
> 
> The SCSI disks where 10K RPM SCA somethings.  I think they were Segate
> (they've since been taken or else I would check).  I have no idea what
> the cache is on those.
> 
> > Is write caching the default on typical SCSI devices?
> 
> I'm not sure.  It seemed to be off by default for the SCSI disks and
> on by default for IDE when I checked.  I can't rule out the
> bios/controller doing something there though.

On all the SCSI drives shipped w/ servers write-caching is turned off
for this very reason.  This is true of all the IBM equipment I've
seen, not sure about the smaller mom & pop outfits or drives sold
through retail channels though.

Sonny

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-29 17:56     ` Steve Lord
  2005-06-29 20:56       ` Chris Wedgwood
@ 2005-06-29 21:10       ` Nathan Scott
  1 sibling, 0 replies; 36+ messages in thread
From: Nathan Scott @ 2005-06-29 21:10 UTC (permalink / raw)
  To: Steve Lord
  Cc: Chris Wedgwood, Al Boldi, linux-xfs, linux-kernel, linux-fsdevel,
	reiserfs-list

On Wed, Jun 29, 2005 at 12:56:12PM -0500, Steve Lord wrote:
> I did spend a bunch of time once ensuring that when you typed
> sync on xfs you could pull the power right after that and
> everything from before the sync survived. There have been a
> lot of changes both in xfs and the surrounding kernel since
> then. I do not know if anyone has attempted this effort
> again recently.

Yep, someone has, a number of times.  And as Homer would say
"its still good!".

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-06-29 17:02   ` Chris Wedgwood
  2005-06-29 17:56     ` Steve Lord
@ 2005-07-01  8:17     ` David Masover
  2005-07-01  9:24       ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: David Masover @ 2005-07-01  8:17 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Al Boldi, 'Nathan Scott', linux-xfs, linux-kernel,
	linux-fsdevel, reiserfs-list

Chris Wedgwood wrote:
> On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> 
> 
>>What I found were 4 things in the dest dir:
>>1. Missing Dirs,Files. That's OK.
>>2. Files of size 0. That's acceptable.
>>3. Corrupted Files. That's unacceptable.
>>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
>>unacceptable.
> 
> 
> disk usually default to caching these days and can lose data as a
> result, disable that

Not always possible.  Some disks lie and leave caching on anyway.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01  8:17     ` David Masover
@ 2005-07-01  9:24       ` Jens Axboe
       [not found]         ` <20050701131950.GA15180@ime.usp.br>
  2005-07-01 14:05         ` Al Boldi
  0 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2005-07-01  9:24 UTC (permalink / raw)
  To: David Masover
  Cc: Chris Wedgwood, Al Boldi, 'Nathan Scott', linux-xfs,
	linux-kernel, linux-fsdevel, reiserfs-list

On Fri, Jul 01 2005, David Masover wrote:
> Chris Wedgwood wrote:
> >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> >
> >
> >>What I found were 4 things in the dest dir:
> >>1. Missing Dirs,Files. That's OK.
> >>2. Files of size 0. That's acceptable.
> >>3. Corrupted Files. That's unacceptable.
> >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> >>unacceptable.
> >
> >
> >disk usually default to caching these days and can lose data as a
> >result, disable that
> 
> Not always possible.  Some disks lie and leave caching on anyway.

And the same (and others) disks will not honor a flush anyways. Moral of
that story - avoid bad hardware.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 36+ messages in thread

[parent not found: <20050701131950.GA15180@ime.usp.br>]

* Re: XFS corruption during power-blackout
       [not found]         ` <20050701131950.GA15180@ime.usp.br>
@ 2005-07-01 13:57           ` Ric Wheeler
  2005-07-01 18:37             ` Bryan Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Ric Wheeler @ 2005-07-01 13:57 UTC (permalink / raw)
  To: Rogério Brito; +Cc: linux-kernel, Brett Russ, linux-fsdevel

Rogério Brito wrote:

>On Jul 01 2005, Jens Axboe wrote:
>  
>
>>On Fri, Jul 01 2005, David Masover wrote:
>>    
>>
>>>Not always possible.  Some disks lie and leave caching on anyway.
>>>      
>>>
>>And the same (and others) disks will not honor a flush anyways.
>>Moral of that story - avoid bad hardware.
>>    
>>
>
>But how does the end-user know what hardware is "good hardware"? Which
>vendors don't lie (or, at least, lie less than others) regarding HDs?
>
>
>Thanks, Rogério Brito.
>
>  
>
The only real way is to test the drive (and retest when you get a new 
versions of firmware) and the whole fsync -> write barrier code path.

We use a bus analyzer to make sure that when you fsync() a file, you 
will see a cache flush command coming across the bus. Of course, that is 
the easy step ;-)

The second step is to test your system across power failures.  We have a 
"wbtest" code that we have used to catch bugs. The basic idea is to 
write a file to a disk with the cache turned off, write the same file to 
the disk with the write barrier (and working cache flush command) and 
then randomly drop power to the box.  It is important to really drop 
power to the whole box since a "reset button" push often does not drop 
power to the drives and will give you false passes.

Our wbtest used to be good at finding holes in the write barrier code 
using 2.4 kernels and PATA drives, but we have had no luck yet in 
catching known bugs with this test on 2.6 with S-ATA drives.

Ideas on how to get a more effective test are welcome - it is a very 
small window that you need to hit to catch a misbehaving drive (i.e., 
your write cache flush command has returned, you want to drop power and 
on reboot, validate that the platter contains that last IO correctly).  
If you had enough NVRAM in a test system, you might be able to 
substitute a NVRAM backed file system for the write-cache disabled drive 
and get closer to catching the window.

The alternative is to either run with the write cache disabled (again, 
you will need to validate that the drive really disabled the cache) or 
to buy a mid-range or better storage array that provides a non-volatile 
(battery backed) write cache.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 13:57           ` Ric Wheeler
@ 2005-07-01 18:37             ` Bryan Henderson
  2005-07-01 18:41               ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Bryan Henderson @ 2005-07-01 18:37 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-fsdevel, Rogério Brito, Brett Russ

>>But how does the end-user know what hardware is "good hardware"? Which
>>vendors don't lie (or, at least, lie less than others) regarding HDs?
>>
>
>The only real way is to test the drive (and retest when you get a new 
>versions of firmware) and the whole fsync -> write barrier code path.

Wouldn't a commercial class drive that ignores explicit flushes be 
infamous?  I'm ready to accept that there are SCSI drives that cache 
writes in volatile storage by default (but frankly, I'm still skeptical), 
but I'm not ready to accept that there are drives out there secretly 
ignoring explicit commands to harden data, thus jeopardizing millions of 
dollars' worth of data.  I'd need more evidence.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 18:37             ` Bryan Henderson
@ 2005-07-01 18:41               ` Jens Axboe
  2005-07-11 12:53                 ` Ric Wheeler
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2005-07-01 18:41 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Ric Wheeler, linux-fsdevel, Rogério Brito, Brett Russ

On Fri, Jul 01 2005, Bryan Henderson wrote:
> >>But how does the end-user know what hardware is "good hardware"? Which
> >>vendors don't lie (or, at least, lie less than others) regarding HDs?
> >>
> >
> >The only real way is to test the drive (and retest when you get a new 
> >versions of firmware) and the whole fsync -> write barrier code path.
> 
> Wouldn't a commercial class drive that ignores explicit flushes be 
> infamous?  I'm ready to accept that there are SCSI drives that cache 
> writes in volatile storage by default (but frankly, I'm still skeptical), 
> but I'm not ready to accept that there are drives out there secretly 
> ignoring explicit commands to harden data, thus jeopardizing millions of 
> dollars' worth of data.  I'd need more evidence.

I'm pretty sure I have an IBM drive that does so (its flush cache
command is _really_ fast), as a matter of fact :-) I need to locate it
and put it in a test box to re-ensure this.

I'm not sure such drives would necessarily be infamous, hardly anyone
would notice anything wrong in a desktop type machine. Which is what
these drives were made for.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 18:41               ` Jens Axboe
@ 2005-07-11 12:53                 ` Ric Wheeler
  0 siblings, 0 replies; 36+ messages in thread
From: Ric Wheeler @ 2005-07-11 12:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Bryan Henderson, linux-fsdevel, Rogério Brito, Brett Russ



Jens Axboe wrote:

>On Fri, Jul 01 2005, Bryan Henderson wrote:
>  
>
>>Wouldn't a commercial class drive that ignores explicit flushes be 
>>infamous?  I'm ready to accept that there are SCSI drives that cache 
>>writes in volatile storage by default (but frankly, I'm still skeptical), 
>>but I'm not ready to accept that there are drives out there secretly 
>>ignoring explicit commands to harden data, thus jeopardizing millions of 
>>dollars' worth of data.  I'd need more evidence.
>>    
>>
>
>I'm pretty sure I have an IBM drive that does so (its flush cache
>command is _really_ fast), as a matter of fact :-) I need to locate it
>and put it in a test box to re-ensure this.
>
>I'm not sure such drives would necessarily be infamous, hardly anyone
>would notice anything wrong in a desktop type machine. Which is what
>these drives were made for.
>  
>
One other thing to keep in mind is that drive firmware can have bugs 
just like any other bit of code, so a drive may have a bug in one 
firmware revision that gets fixed in a following one. 

I am not sure how much that other operating system uses flush cache 
commands, but until the write barrier patch,  it has been a relatively 
rarely issued command for Linux and breakage would not be noticed.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: XFS corruption during power-blackout
  2005-07-01  9:24       ` Jens Axboe
       [not found]         ` <20050701131950.GA15180@ime.usp.br>
@ 2005-07-01 14:05         ` Al Boldi
  2005-07-01 16:35           ` Alistair John Strachan
  2005-07-05 15:49           ` Sonny Rao
  1 sibling, 2 replies; 36+ messages in thread
From: Al Boldi @ 2005-07-01 14:05 UTC (permalink / raw)
  To: 'Jens Axboe', 'David Masover'
  Cc: 'Chris Wedgwood', 'Nathan Scott', linux-xfs,
	linux-kernel, linux-fsdevel, reiserfs-list

Jens Axboe wrote: {
On Fri, Jul 01 2005, David Masover wrote:
> Chris Wedgwood wrote:
> >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> >
> >
> >>What I found were 4 things in the dest dir:
> >>1. Missing Dirs,Files. That's OK.
> >>2. Files of size 0. That's acceptable.
> >>3. Corrupted Files. That's unacceptable.
> >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY 
> >>unacceptable.
> >
> >
> >disk usually default to caching these days and can lose data as a 
> >result, disable that
> 
> Not always possible.  Some disks lie and leave caching on anyway.

And the same (and others) disks will not honor a flush anyways. 
Moral of that story - avoid bad hardware.
}

1. Sync is not the issue. The issue is whether a journaled FS can detect
corrupted files and flag them after a power-blackout!
2. Moral of the story is: What's ext3 doing the others aren't?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 14:05         ` Al Boldi
@ 2005-07-01 16:35           ` Alistair John Strachan
  2005-07-05 15:49           ` Sonny Rao
  1 sibling, 0 replies; 36+ messages in thread
From: Alistair John Strachan @ 2005-07-01 16:35 UTC (permalink / raw)
  To: Al Boldi
  Cc: 'Jens Axboe', 'David Masover',
	'Chris Wedgwood', 'Nathan Scott', linux-xfs,
	linux-kernel, linux-fsdevel, reiserfs-list

On Friday 01 Jul 2005 15:05, Al Boldi wrote:
> Jens Axboe wrote: {
>
> On Fri, Jul 01 2005, David Masover wrote:
> > Chris Wedgwood wrote:
> > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > >>What I found were 4 things in the dest dir:
> > >>1. Missing Dirs,Files. That's OK.
> > >>2. Files of size 0. That's acceptable.
> > >>3. Corrupted Files. That's unacceptable.
> > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> > >>unacceptable.
> > >
> > >disk usually default to caching these days and can lose data as a
> > >result, disable that
> >
> > Not always possible.  Some disks lie and leave caching on anyway.
>
> And the same (and others) disks will not honor a flush anyways.
> Moral of that story - avoid bad hardware.
> }
>
> 1. Sync is not the issue. The issue is whether a journaled FS can detect
> corrupted files and flag them after a power-blackout!
> 2. Moral of the story is: What's ext3 doing the others aren't?
>

I agree, I've used XFS for about three years on Linux now, and whilst I love 
the performance and self-repair attributes of the filesystem, I do think it 
leaves a lot to be desired when it comes to file corruption.

In my experience, using a standard XFS log/volume setup on the same physical, 
cheap IDE HD, any files open at the time as a power down or hardware lockup 
end up being filled either with zeros, or garbage.

However, I'd far rather lose a few files once in a blue moon than have to sit 
through 10 minute fsck's every time the kernel crashes or I kick out the 
plugs.

-- 
Cheers,
Alistair.

personal:   alistair()devzero!co!uk
university: s0348365()sms!ed!ac!uk
student:    CS/CSim Undergraduate
contact:    1F2 55 South Clerk Street,
            Edinburgh. EH8 9PP.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-01 14:05         ` Al Boldi
  2005-07-01 16:35           ` Alistair John Strachan
@ 2005-07-05 15:49           ` Sonny Rao
  2005-07-05 17:25             ` Al Boldi
  1 sibling, 1 reply; 36+ messages in thread
From: Sonny Rao @ 2005-07-05 15:49 UTC (permalink / raw)
  To: Al Boldi
  Cc: 'Jens Axboe', 'David Masover',
	'Chris Wedgwood', 'Nathan Scott', linux-xfs,
	linux-kernel, linux-fsdevel, reiserfs-list

On Fri, Jul 01, 2005 at 05:05:11PM +0300, Al Boldi wrote:
> Jens Axboe wrote: {
> On Fri, Jul 01 2005, David Masover wrote:
> > Chris Wedgwood wrote:
> > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > >
> > >
> > >>What I found were 4 things in the dest dir:
> > >>1. Missing Dirs,Files. That's OK.
> > >>2. Files of size 0. That's acceptable.
> > >>3. Corrupted Files. That's unacceptable.
> > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY 
> > >>unacceptable.
> > >
> > >
> > >disk usually default to caching these days and can lose data as a 
> > >result, disable that
> > 
> > Not always possible.  Some disks lie and leave caching on anyway.
> 
> And the same (and others) disks will not honor a flush anyways. 
> Moral of that story - avoid bad hardware.
> }
> 
> 1. Sync is not the issue. The issue is whether a journaled FS can detect
> corrupted files and flag them after a power-blackout!

Journaling implies filesystem consistency, not data integrity, AFAIK.

> 2. Moral of the story is: What's ext3 doing the others aren't?

Ext3 has stronger guaranties than basic filesystem consistency.

I.e. in ordered mode, file data is always written before metadata, so
the worst that could happen is a growing file's new data is written
but the metadata isn't updated before a power failure... so the new
writes wouldn't be seen afterwards.  You should try the same test w/
ext3 in "writeback" mode and see if it fares better or worse in terms
of file corruption.

Sonny

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: XFS corruption during power-blackout
  2005-07-05 15:49           ` Sonny Rao
@ 2005-07-05 17:25             ` Al Boldi
  2005-07-05 18:10               ` Sonny Rao
  0 siblings, 1 reply; 36+ messages in thread
From: Al Boldi @ 2005-07-05 17:25 UTC (permalink / raw)
  To: 'Sonny Rao'
  Cc: 'Jens Axboe', 'David Masover',
	'Chris Wedgwood', 'Nathan Scott', linux-xfs,
	linux-kernel, linux-fsdevel, reiserfs-list

Sonny Rao wrote: {
> > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > >>What I found were 4 things in the dest dir:
> > >>1. Missing Dirs,Files. That's OK.
> > >>2. Files of size 0. That's acceptable.
> > >>3. Corrupted Files. That's unacceptable.
> > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY 
> > >>unacceptable.
> > >
> 2. Moral of the story is: What's ext3 doing the others aren't?

Ext3 has stronger guaranties than basic filesystem consistency.
I.e. in ordered mode, file data is always written before metadata, so the
worst that could happen is a growing file's new data is written but the
metadata isn't updated before a power failure... so the new writes wouldn't
be seen afterwards.

}

Sonny,
Thanks for you input!
Is there an option in XFS,ReiserFS,JFS to enable ordered mode?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-05 17:25             ` Al Boldi
@ 2005-07-05 18:10               ` Sonny Rao
  2005-07-05 19:24                 ` Dieter Nützel
  2005-07-06  4:24                 ` Al Boldi
  0 siblings, 2 replies; 36+ messages in thread
From: Sonny Rao @ 2005-07-05 18:10 UTC (permalink / raw)
  To: Al Boldi
  Cc: 'Jens Axboe', 'David Masover',
	'Chris Wedgwood', 'Nathan Scott', linux-xfs,
	linux-kernel, linux-fsdevel, reiserfs-list

On Tue, Jul 05, 2005 at 08:25:11PM +0300, Al Boldi wrote:
> Sonny Rao wrote: {
> > > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > > >>What I found were 4 things in the dest dir:
> > > >>1. Missing Dirs,Files. That's OK.
> > > >>2. Files of size 0. That's acceptable.
> > > >>3. Corrupted Files. That's unacceptable.
> > > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY 
> > > >>unacceptable.
> > > >
> > 2. Moral of the story is: What's ext3 doing the others aren't?
> 
> Ext3 has stronger guaranties than basic filesystem consistency.
> I.e. in ordered mode, file data is always written before metadata, so the
> worst that could happen is a growing file's new data is written but the
> metadata isn't updated before a power failure... so the new writes wouldn't
> be seen afterwards.
> 
> }
> 
> Sonny,
> Thanks for you input!
> Is there an option in XFS,ReiserFS,JFS to enable ordered mode?

I beleive in newer 2.6 kernels that Reiser has ordered mode (IIRC, courtesy
of Chris Mason), but XFS and JFS do not support it.  I seem to remember
Shaggy (JFS maintainer) saying in older 2.4 kernels he tried to write
file data before metadata but had to change that behavior in 2.6, not
really sure why or anything beyond that.
 
Sonny

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-05 18:10               ` Sonny Rao
@ 2005-07-05 19:24                 ` Dieter Nützel
  2005-07-06  4:24                 ` Al Boldi
  1 sibling, 0 replies; 36+ messages in thread
From: Dieter Nützel @ 2005-07-05 19:24 UTC (permalink / raw)
  To: reiserfs-list
  Cc: Sonny Rao, Al Boldi, 'Jens Axboe',
	'David Masover', 'Chris Wedgwood',
	'Nathan Scott', linux-xfs, linux-kernel, linux-fsdevel

Am Dienstag, 5. Juli 2005 20:10 schrieb Sonny Rao:
> On Tue, Jul 05, 2005 at 08:25:11PM +0300, Al Boldi wrote:
> > Sonny Rao wrote: {
> >
> > > > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > > > >>What I found were 4 things in the dest dir:
> > > > >>1. Missing Dirs,Files. That's OK.
> > > > >>2. Files of size 0. That's acceptable.
> > > > >>3. Corrupted Files. That's unacceptable.
> > > > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> > > > >>unacceptable.
> > >
> > > 2. Moral of the story is: What's ext3 doing the others aren't?
> >
> > Ext3 has stronger guaranties than basic filesystem consistency.
> > I.e. in ordered mode, file data is always written before metadata, so the
> > worst that could happen is a growing file's new data is written but the
> > metadata isn't updated before a power failure... so the new writes
> > wouldn't be seen afterwards.
> >
> > }
> >
> > Sonny,
> > Thanks for you input!
> > Is there an option in XFS,ReiserFS,JFS to enable ordered mode?
>
> I beleive in newer 2.6 kernels that Reiser has ordered mode (IIRC, courtesy
> of Chris Mason),

And SuSE, ack.

ftp://ftp.suse.com/pub/people/mason/patches/data-logging

They are around some time ;-)

> but XFS and JFS do not support it.  I seem to remember 
> Shaggy (JFS maintainer) saying in older 2.4 kernels he tried to write
> file data before metadata but had to change that behavior in 2.6, not
> really sure why or anything beyond that.

Greetings,
	Dieter

-- 
Dieter Nützel
@home: <Dieter () nuetzel-hh ! de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: XFS corruption during power-blackout
  2005-07-05 18:10               ` Sonny Rao
  2005-07-05 19:24                 ` Dieter Nützel
@ 2005-07-06  4:24                 ` Al Boldi
  2005-07-06  4:46                   ` Nathan Scott
  1 sibling, 1 reply; 36+ messages in thread
From: Al Boldi @ 2005-07-06  4:24 UTC (permalink / raw)
  To: 'Sonny Rao'
  Cc: 'Jens Axboe', 'David Masover',
	'Chris Wedgwood', 'Nathan Scott', linux-xfs,
	linux-kernel, linux-fsdevel, reiserfs-list

Sonny Rao wrote: {
> > > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > > >>What I found were 4 things in the dest dir:
> > > >>1. Missing Dirs,Files. That's OK.
> > > >>2. Files of size 0. That's acceptable.
> > > >>3. Corrupted Files. That's unacceptable.
> > > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY 
> > > >>unacceptable.
> > > >
> > 2. Moral of the story is: What's ext3 doing the others aren't?
> 
> Ext3 has stronger guaranties than basic filesystem consistency.
> I.e. in ordered mode, file data is always written before metadata, so 
> the worst that could happen is a growing file's new data is written 
> but the metadata isn't updated before a power failure... so the new 
> writes wouldn't be seen afterwards.
> 
I believe in newer 2.6 kernels that Reiser has ordered mode (IIRC, courtesy
of Chris Mason), but XFS and JFS do not support it.
}

Was ordered mode disabled/removed when XFS was add to the vanilla-kernel?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
  2005-07-06  4:24                 ` Al Boldi
@ 2005-07-06  4:46                   ` Nathan Scott
  0 siblings, 0 replies; 36+ messages in thread
From: Nathan Scott @ 2005-07-06  4:46 UTC (permalink / raw)
  To: Al Boldi
  Cc: 'Sonny Rao', 'Jens Axboe',
	'David Masover', 'Chris Wedgwood', linux-xfs,
	linux-kernel, linux-fsdevel, reiserfs-list

On Wed, Jul 06, 2005 at 07:24:03AM +0300, Al Boldi wrote:
> Was ordered mode disabled/removed when XFS was add to the vanilla-kernel?

No, XFS has never supported such a mode.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: XFS corruption during power-blackout
@ 2005-07-16  7:02 Al Boldi
  0 siblings, 0 replies; 36+ messages in thread
From: Al Boldi @ 2005-07-16  7:02 UTC (permalink / raw)
  To: rhowe; +Cc: linux-kernel, linux-fsdevel, linux-xfs, 'Nathan Scott'

Russell Howe wrote: {

XFS only journals metadata, not data.

So, you are supposed to get a consistent filesystem structure, but your
data consistency isn't guaranteed.
}

What did XFS do to detect filedata-corruption before it was added to the
vanilla-kernel?

Maybe it did not update the metadata before the fs was sync'd?

Really, it should wait for fs sync and then update metadata!

This would imply 2 syncs in succession to ensure updated filedata/metadata
consistency, which is OK.

Is it possible to instruct XFS to delay metadata update until after a
filedata sync?

Thanks!

		Al

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2005-07-16  7:04 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20050629001847.GB850@frodo>
2005-06-29  4:53 ` XFS corruption during power-blackout Al Boldi
2005-06-29 16:38   ` Christian Rice
2005-06-29 17:02   ` Chris Wedgwood
2005-06-29 17:56     ` Steve Lord
2005-06-29 20:56       ` Chris Wedgwood
2005-06-30 16:30         ` Bryan Henderson
2005-06-30 18:46           ` Chris Wedgwood
2005-06-30 19:44             ` Jörn Engel
2005-06-30 20:32               ` Chris Wedgwood
2005-06-30 21:07                 ` Jörn Engel
2005-07-01 12:36                 ` Ric Wheeler
2005-07-01 12:56                   ` Jens Axboe
2005-06-30 20:49             ` Bryan Henderson
2005-07-01 12:53               ` Ric Wheeler
2005-07-01 18:24                 ` Bryan Henderson
2005-07-01 19:58                   ` David Masover
2005-07-01 21:10                     ` Jörn Engel
2005-07-01 21:39                       ` David Masover
2005-07-01  1:09             ` Stewart Smith
2005-07-05 15:53             ` Sonny Rao
2005-06-29 21:10       ` Nathan Scott
2005-07-01  8:17     ` David Masover
2005-07-01  9:24       ` Jens Axboe
     [not found]         ` <20050701131950.GA15180@ime.usp.br>
2005-07-01 13:57           ` Ric Wheeler
2005-07-01 18:37             ` Bryan Henderson
2005-07-01 18:41               ` Jens Axboe
2005-07-11 12:53                 ` Ric Wheeler
2005-07-01 14:05         ` Al Boldi
2005-07-01 16:35           ` Alistair John Strachan
2005-07-05 15:49           ` Sonny Rao
2005-07-05 17:25             ` Al Boldi
2005-07-05 18:10               ` Sonny Rao
2005-07-05 19:24                 ` Dieter Nützel
2005-07-06  4:24                 ` Al Boldi
2005-07-06  4:46                   ` Nathan Scott
2005-07-16  7:02 Al Boldi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).