Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
       [not found]         ` <47A7188A.4070005@msgid.tls.msk.ru>
@ 2008-02-04 14:09           ` Justin Piszcz
  2008-02-04 14:25             ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Justin Piszcz @ 2008-02-04 14:09 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Moshe Yudkowsky, linux-raid, xfs, sandeen



On Mon, 4 Feb 2008, Michael Tokarev wrote:

> Moshe Yudkowsky wrote:
> []
>> If I'm reading the man pages, Wikis, READMEs and mailing lists correctly
>> --  not necessarily the case -- the ext3 file system uses the equivalent
>> of data=journal as a default.
>
> ext3 defaults to data=ordered, not data=journal.  ext2 doesn't have
> journal at all.
>
>> The question then becomes what data scheme to use with reiserfs on the
>
> I'd say don't use reiserfs in the first place ;)
>
>> Another way to phrase this: unless you're running data-center grade
>> hardware and have absolute confidence in your UPS, you should use
>> data=journal for reiserfs and perhaps avoid XFS entirely.
>
> By the way, even if you do have a good UPS, there should be some
> control program for it, to properly shut down your system when
> UPS loses the AC power.  So far, I've seen no such programs...
>
> /mjt

Why avoid XFS entirely?

esandeen, any comments here?

Justin.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-04 14:09           ` RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash) Justin Piszcz
@ 2008-02-04 14:25             ` Eric Sandeen
  2008-02-04 14:42               ` Eric Sandeen
                                 ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Eric Sandeen @ 2008-02-04 14:25 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Michael Tokarev, Moshe Yudkowsky, linux-raid, xfs

Justin Piszcz wrote:

> Why avoid XFS entirely?
> 
> esandeen, any comments here?

Heh; well, it's the meme.

see:

http://oss.sgi.com/projects/xfs/faq.html#nulls

and note that recent fixes have been made in this area (also noted in
the faq)

Also - the above all assumes that when a drive says it's written/flushed
data, that it truly has.  Modern write-caching drives can wreak havoc
with any journaling filesystem, so that's one good reason for a UPS.  If
the drive claims to have metadata safe on disk but actually does not,
and you lose power, the data claimed safe will evaporate, there's not
much the fs can do.  IO write barriers address this by forcing the drive
to flush order-critical data before continuing; xfs has them on by
default, although they are tested at mount time and if you have
something in between xfs and the disks which does not support barriers
(i.e. lvm...) then they are disabled again, with a notice in the logs.

Note also that ext3 has the barrier option as well, but it is not
enabled by default due to performance concerns.  Barriers also affect
xfs performance, but enabling them in the non-battery-backed-write-cache
scenario is the right thing to do for filesystem integrity.

-Eric

> Justin.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-04 14:25             ` Eric Sandeen
@ 2008-02-04 14:42               ` Eric Sandeen
  2008-02-04 15:31               ` Moshe Yudkowsky
  2008-02-04 16:38               ` Michael Tokarev
  2 siblings, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2008-02-04 14:42 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Michael Tokarev, Moshe Yudkowsky, linux-raid, xfs

Eric Sandeen wrote:
> Justin Piszcz wrote:
> 
>> Why avoid XFS entirely?
>>
>> esandeen, any comments here?
> 
> Heh; well, it's the meme.
> 
> see:
> 
> http://oss.sgi.com/projects/xfs/faq.html#nulls
> 
> and note that recent fixes have been made in this area (also noted in
> the faq)

Actually, continue reading past that specific entry to the next several,
 it covers all this quite well.

-Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-04 14:25             ` Eric Sandeen
  2008-02-04 14:42               ` Eric Sandeen
@ 2008-02-04 15:31               ` Moshe Yudkowsky
  2008-02-04 16:45                 ` Eric Sandeen
  2008-02-04 16:38               ` Michael Tokarev
  2 siblings, 1 reply; 11+ messages in thread
From: Moshe Yudkowsky @ 2008-02-04 15:31 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Justin Piszcz, Michael Tokarev, linux-raid, xfs

Eric,

Thanks very much for your note. I'm becoming very leery of resiserfs at 
the moment... I'm about to run another series of crash tests.

Eric Sandeen wrote:
> Justin Piszcz wrote:
> 
>> Why avoid XFS entirely?
>>
>> esandeen, any comments here?
> 
> Heh; well, it's the meme.

Well, yeah...

> Note also that ext3 has the barrier option as well, but it is not
> enabled by default due to performance concerns.  Barriers also affect
> xfs performance, but enabling them in the non-battery-backed-write-cache
> scenario is the right thing to do for filesystem integrity.

So if I understand you correctly, you're stating that current the most 
reliable fs in its default configuration, in terms of protection against 
power-loss scenarios, is XFS?


-- 
Moshe Yudkowsky * moshe@pobox.com * www.pobox.com/~moshe
  "There is something fundamentally wrong with a country [USSR] where
   the citizens want to buy your underwear."  -- Paul Thereaux

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-04 15:31               ` Moshe Yudkowsky
@ 2008-02-04 16:45                 ` Eric Sandeen
  2008-02-04 17:22                   ` Michael Tokarev
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2008-02-04 16:45 UTC (permalink / raw)
  To: Moshe Yudkowsky; +Cc: Justin Piszcz, Michael Tokarev, linux-raid, xfs

Moshe Yudkowsky wrote:
> So if I understand you correctly, you're stating that current the most 
> reliable fs in its default configuration, in terms of protection against 
> power-loss scenarios, is XFS?

I wouldn't go that far without some real-world poweroff testing, because
various fs's are probably more or less tolerant of a write-cache
evaporation.  I suppose it'd depend on the size of the write cache as well.

-Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-04 16:45                 ` Eric Sandeen
@ 2008-02-04 17:22                   ` Michael Tokarev
  2008-02-05 12:31                     ` Linda Walsh
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Tokarev @ 2008-02-04 17:22 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Moshe Yudkowsky, Justin Piszcz, linux-raid, xfs

Eric Sandeen wrote:
> Moshe Yudkowsky wrote:
>> So if I understand you correctly, you're stating that current the most 
>> reliable fs in its default configuration, in terms of protection against 
>> power-loss scenarios, is XFS?
> 
> I wouldn't go that far without some real-world poweroff testing, because
> various fs's are probably more or less tolerant of a write-cache
> evaporation.  I suppose it'd depend on the size of the write cache as well.

I know no filesystem which is, as you say, tolerant to a write-cache
evaporation.  If a drive says the data is written but in fact it's
not, it's a Bad Drive (tm) and it should be thrown away immediately.
Fortunately, almost all modern disk drives don't lie this way.  The
only thing needed for the filesystem is to tell the drive to flush
it's cache at the appropriate time, and actually wait for the flush
to complete.  Barriers (mentioned in this thread) is just another
way to do so, in a somewhat more efficient way, but normal cache
flush will do as well.  IFF the write caching is enabled in the
first place - note that with some workloads, write caching in
the drive actually makes write speed worse, not better - namely,
in case of massive writes.

Speaking of XFS (and with ext3fs with write barriers enabled) -
I'm confused here as well, and answers to my questions didn't
help either.  As far as I understand, XFS only use barriers,
not regular cache flushes, hence without write barrier support
(which is not here for linux software raid, which is explained
elsewhere) it's unsafe, -- probably the same applies to ext3
with barrier support enabled.  But I'm not sure I got it all
correctly.

/mjt

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-04 17:22                   ` Michael Tokarev
@ 2008-02-05 12:31                     ` Linda Walsh
  0 siblings, 0 replies; 11+ messages in thread
From: Linda Walsh @ 2008-02-05 12:31 UTC (permalink / raw)
  To: xfs; +Cc: linux-raid



Michael Tokarev wrote:
> note that with some workloads, write caching in
> the drive actually makes write speed worse, not better - namely,
> in case of massive writes.
----
	With write barriers enabled, I did a quick test of
a large copy from one backup filesystem to another.
	I'm not what you refer to when you say large, but
this disk has 387G used with 975 files, averaging about 406MB/file.

I was copying from /hde (ATA100-750G) to
/sdb (SATA-300-750G) (both, basically underlying model)

Of course your 'mileage may vary', and these were averages over
12 runs each (w/ + w/out wcaching);

(write cache on)         write    read
dev ave           TPS     MB/s    MB/s
hde ave          64.67   30.94     0.0
sdb ave         249.51    0.24    30.93

(write cache off)        write    read
dev ave          TPS      MB/s    MB/s
hde ave          45.63   21.81     0.0
xx: ave         177.76   0.24     21.96

write w/cache =         (30.94-21.86)/21.86     => 45% faster
w/o write cache =       100-(100*21.81/30.94)   => 30% slower

These disks have barrier support, so I'd guess the differences would
have been greater if you didn't worry about losing w-cache contents.

If  barrier support doesn't work and one has to disable write-caching,
that is a noticeable performance penalty.

All writes with noatime, nodiratime, logbufs=8.


FWIW...slightly OT, the rates under Win for their write-through (FAT32-perf)
vs. write-back caching (NTFS-perf) were FAT about 60% faster over NTFS or
NTFS ~ 40% slower than FAT32 (with ops for no-last-access and no 3.1
filename creation)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-04 14:25             ` Eric Sandeen
  2008-02-04 14:42               ` Eric Sandeen
  2008-02-04 15:31               ` Moshe Yudkowsky
@ 2008-02-04 16:38               ` Michael Tokarev
  2008-02-04 22:27                 ` Justin Piszcz
  2008-02-06  1:12                 ` Linda Walsh
  2 siblings, 2 replies; 11+ messages in thread
From: Michael Tokarev @ 2008-02-04 16:38 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Justin Piszcz, Moshe Yudkowsky, linux-raid, xfs

Eric Sandeen wrote:
[]
> http://oss.sgi.com/projects/xfs/faq.html#nulls
> 
> and note that recent fixes have been made in this area (also noted in
> the faq)
> 
> Also - the above all assumes that when a drive says it's written/flushed
> data, that it truly has.  Modern write-caching drives can wreak havoc
> with any journaling filesystem, so that's one good reason for a UPS.  If

Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this?  With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...

> the drive claims to have metadata safe on disk but actually does not,
> and you lose power, the data claimed safe will evaporate, there's not
> much the fs can do.  IO write barriers address this by forcing the drive
> to flush order-critical data before continuing; xfs has them on by
> default, although they are tested at mount time and if you have
> something in between xfs and the disks which does not support barriers
> (i.e. lvm...) then they are disabled again, with a notice in the logs.

Note also that with linux software raid barriers are NOT supported.

/mjt

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-04 16:38               ` Michael Tokarev
@ 2008-02-04 22:27                 ` Justin Piszcz
  2008-02-06  1:12                 ` Linda Walsh
  1 sibling, 0 replies; 11+ messages in thread
From: Justin Piszcz @ 2008-02-04 22:27 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Eric Sandeen, Moshe Yudkowsky, linux-raid, xfs



On Mon, 4 Feb 2008, Michael Tokarev wrote:

> Eric Sandeen wrote:
> []
>> http://oss.sgi.com/projects/xfs/faq.html#nulls
>>
>> and note that recent fixes have been made in this area (also noted in
>> the faq)
>>
>> Also - the above all assumes that when a drive says it's written/flushed
>> data, that it truly has.  Modern write-caching drives can wreak havoc
>> with any journaling filesystem, so that's one good reason for a UPS.  If
>
> Unfortunately an UPS does not *really* help here.  Because unless
> it has control program which properly shuts system down on the loss
> of input power, and the battery really has the capacity to power the
> system while it's shutting down (anyone tested this?  With new UPS?
> and after an year of use, when the battery is not new?), -- unless
> the UPS actually has the capacity to shutdown system, it will cut
> the power at an unexpected time, while the disk(s) still has dirty
> caches...
You use nut and a large enough UPS to handle the load of the system, it 
shuts the machine down just fine.

>
>> the drive claims to have metadata safe on disk but actually does not,
>> and you lose power, the data claimed safe will evaporate, there's not
>> much the fs can do.  IO write barriers address this by forcing the drive
>> to flush order-critical data before continuing; xfs has them on by
>> default, although they are tested at mount time and if you have
>> something in between xfs and the disks which does not support barriers
>> (i.e. lvm...) then they are disabled again, with a notice in the logs.
>
> Note also that with linux software raid barriers are NOT supported.
>
> /mjt
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-04 16:38               ` Michael Tokarev
  2008-02-04 22:27                 ` Justin Piszcz
@ 2008-02-06  1:12                 ` Linda Walsh
  2008-02-06  2:12                   ` Michael Tokarev
  1 sibling, 1 reply; 11+ messages in thread
From: Linda Walsh @ 2008-02-06  1:12 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Eric Sandeen, Justin Piszcz, Moshe Yudkowsky, linux-raid, xfs

Michael Tokarev wrote:
> Unfortunately an UPS does not *really* help here.  Because unless
> it has control program which properly shuts system down on the loss
> of input power, and the battery really has the capacity to power the
> system while it's shutting down (anyone tested this? 
----
	Yes.  I must say, I am not connected or paid by APC.

> With new UPS?
> and after an year of use, when the battery is not new?), -- unless
> the UPS actually has the capacity to shutdown system, it will cut
> the power at an unexpected time, while the disk(s) still has dirty
> caches...
--------
If you have a "SmartUPS" by "APC", their is a freeware demon that monitors
it's status.  The UPS has USB and serial connections.
  It's included in some distributions (SuSE).  The config
file is pretty straight forward.

I recommend the "1000XL" (1000 peak Volt-Amp load -- usually at startup;
note, this is not the same as watts as some of us were taught in basic
electronics class since the unit isn't a simple resistor (like a light
bulb). over the 1500XL because with the 1000XL, you can buy several
"add-on batteries" that plug into the back.

One minor (but not fatal) design flaw: the add-on batteries give no indication
that they are "live" (I knocked a cord on one, and only got 7 minutes
of uptime before things shut-down instead of my expected 20.
I have 3-cells total (controller & 1 extra pack).  So why is my run time
so short?  I am being lazy in buying more extension packs.
The UPS is running 3 computers, the house-phone, (answering and wireless
handsets).  a digital clock, 1 LCD (usually off),  The real killer is a
new workstation with 2x2-Core-II chips and other comparable equipment.

The "1500XL" doesn't allow for adding more power packs.
The "2200XL" does allow extra packs but comes in a rack-mount format.

It's not just a battery backup -- it conditions the power -- to filter out
spikes and emit a pure sine wave.  It will kick in during over or under
voltage conditions (you can set the sensitivity).  Adjustable alarm
when on battery, setting of output volts (115, 230, 120, 240).  It
selftests at least every 2 weeks or shorter (to your fancy).

It also has a network feature (that I haven't gotten to work yet -- they just
changed the format), that allows other computers on the same net to also be
notified and take action.

You specify what scripts to run at what times (power off, power on, getting
critically low, etc).

Hasn't failed me 'yet' -- cept when a charger died and was replaced free of
cost (within warantee).  I have a separate setup another room for another
computer.

The upspowerd runs on linux or windows (under cygwin, I think).

You can specify when to shut down -- like "5 minutes of battery life left.

The controller unit has 1 battery.  But the add-ons have 2 batteries
each, so the first add-on adds 3x to the run-time.  When my system
did shut down "prematurely", it went through the full "halt" sequence,
which I'd presume flushes disk caches.

> 
>> the drive claims to have metadata safe on disk but actually does not,
>> and you lose power, the data claimed safe will evaporate, there's not
>> much the fs can do.  IO write barriers address this by forcing the drive
>> to flush order-critical data before continuing; xfs has them on by
>> default, although they are tested at mount time and if you have
>> something in between xfs and the disks which does not support barriers
>> (i.e. lvm...) then they are disabled again, with a notice in the logs.

> Note also that with linux software raid barriers are NOT supported.
------
	Are you sure about this?  When my system boots, I used to have
3 new IDE's, and one older one.  XFS checked each drive for barriers
and turned off barriers for a disk that didn't support it.  ... or
are you referring specifically to linux-raid setups?

	Would it be possible on boot to have xfs probe the Raid array,
physically, to see if barriers are really supported (or not), and disable
them if they are not (and optionally disabling write caching, but that's
a major performance hit in my experience.

Linda

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
  2008-02-06  1:12                 ` Linda Walsh
@ 2008-02-06  2:12                   ` Michael Tokarev
  0 siblings, 0 replies; 11+ messages in thread
From: Michael Tokarev @ 2008-02-06  2:12 UTC (permalink / raw)
  To: Linda Walsh; +Cc: Eric Sandeen, Justin Piszcz, Moshe Yudkowsky, linux-raid, xfs

Linda Walsh wrote:
> 
> Michael Tokarev wrote:
>> Unfortunately an UPS does not *really* help here.  Because unless
>> it has control program which properly shuts system down on the loss
>> of input power, and the battery really has the capacity to power the
>> system while it's shutting down (anyone tested this? 
> ----
>     Yes.  I must say, I am not connected or paid by APC.
> 
>> With new UPS?
>> and after an year of use, when the battery is not new?), -- unless
>> the UPS actually has the capacity to shutdown system, it will cut
>> the power at an unexpected time, while the disk(s) still has dirty
>> caches...
> --------
> If you have a "SmartUPS" by "APC", their is a freeware demon that monitors
[...]

Good stuff.  I knew at least SOME UPSes are good... ;)
Too bad I rarely see such stuff in use by regular
home users...
[]
>> Note also that with linux software raid barriers are NOT supported.
> ------
>     Are you sure about this?  When my system boots, I used to have
> 3 new IDE's, and one older one.  XFS checked each drive for barriers
> and turned off barriers for a disk that didn't support it.  ... or
> are you referring specifically to linux-raid setups?

I'm referring especially to linux-raid setups (software raid).
md devices don't support barriers, because of a very simple
reasons: once more than one disk drive is involved, md layer
can't guarantee ordering ACROSS drives too.  The problem is
that in case of power loss during writes, when an array needs
recovery/resync (at least the parts which were being written,
if bitmaps are in use), md layer will choose arbitrary drive
as a "master" and will copy data to another drive (speaking
of simplest case of 2-drive raid1 array).  But the thing
is that one drive may have two last barriers written (I mean
the data that was "assotiated" with the barriers), and
another neither of the two - in two different places.  And
hence we may see quite.. some inconsistency here.

This is regardless of whether underlying component devices
supports barriers or not.

>     Would it be possible on boot to have xfs probe the Raid array,
> physically, to see if barriers are really supported (or not), and disable
> them if they are not (and optionally disabling write caching, but that's
> a major performance hit in my experience.

Xfs already probes the devices as you describe, exactly the
same way as you've seen with your ide disks, and disables
barriers.

The question and confusing was about what happens when the
barriers are disabled (provided, again, that we don't rely
on UPS and other external things).  As far as I understand,
when barriers are working properly, xfs should be safe wrt
power losses (still a bit unsure about this).  Now, when
barriers are turned off (for whatever reason), is it still
as safe?  I don't know.  Does it use regular cache flushes
in place of barriers in that case (which ARE supported by
md layer)?

Generally, it has been said numerous times that XFS is not
"powercut-friendly", and it has to be used when everything
is stable, including power.  Hence I'm afraid to deploy it
where I know the power is not stable (we've about 70 such
places here, with servers in each, where they don't always
replace UPS batteries in time - ext3fs never crashed so
far, while ext2 did).

Thanks.

/mjt

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-02-06  2:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <47A612BE.5050707@pobox.com>
     [not found] ` <47A623EE.4050305@msgid.tls.msk.ru>
     [not found]   ` <47A62A17.70101@pobox.com>
     [not found]     ` <47A6DA81.3030008@msgid.tls.msk.ru>
     [not found]       ` <47A6EFCF.9080906@pobox.com>
     [not found]         ` <47A7188A.4070005@msgid.tls.msk.ru>
2008-02-04 14:09           ` RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash) Justin Piszcz
2008-02-04 14:25             ` Eric Sandeen
2008-02-04 14:42               ` Eric Sandeen
2008-02-04 15:31               ` Moshe Yudkowsky
2008-02-04 16:45                 ` Eric Sandeen
2008-02-04 17:22                   ` Michael Tokarev
2008-02-05 12:31                     ` Linda Walsh
2008-02-04 16:38               ` Michael Tokarev
2008-02-04 22:27                 ` Justin Piszcz
2008-02-06  1:12                 ` Linda Walsh
2008-02-06  2:12                   ` Michael Tokarev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox