From: Ric Wheeler <rwheeler@redhat.com>
To: Theodore Tso <tytso@mit.edu>, Ric Wheeler <rwheeler@redhat.com>,
Pavel Machek <pavel@ucw.cz>, Florian Weimer <fweimer@bfk.de>,
Goswin von Brederlow <goswin-v-b@web.de>,
Rob Landley <rob@landley.net>,
kernel list <linux-kernel@vger.kernel.org>,
Andrew Morton <akpm@osdl.org>,
mtk.manpages@gmail.com, rdunlap@xenotime.net,
linux-doc@vger.kernel.org, linux-ext4@vger.kernel.org,
corbet@lwn.net
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible
Date: Tue, 25 Aug 2009 21:15:00 -0400 [thread overview]
Message-ID: <4A948C94.7040103@redhat.com> (raw)
In-Reply-To: <20090826010018.GA17684@mit.edu>
On 08/25/2009 09:00 PM, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote:
>
>>>> You are simply incorrect, Ted did not say that ext3 does not work
>>>> with MD raid5.
>>>>
>>> http://lkml.org/lkml/2009/8/25/312
>>> Pavel
>>>
>> I will let Ted clarify his text on his own, but the quoted text says "...
>> have potential...".
>>
>> Why not ask Neil if he designed MD to not work properly with ext3?
>>
> So let me clarify by saying the following things.
>
> 1) Filesystems are designed to expect that storage devices have
> certain properties. These include returning the same data that you
> wrote, and that an error when writing a sector, or a power failure
> when writing sector, should not be amplified to cause collateral
> damage with previously succfessfully written sectors.
>
> 2) Degraded RAID 5/6 filesystems do not meet these properties.
> Neither to cheap flash drives. This increases the chances you can
> lose, bigtime.
>
>
I agree with the whole write up outside of the above - degraded RAID
does meet this requirement unless you have a second (or third, counting
the split write) failure during the rebuild.
Note that the window of exposure during a RAID rebuild is linear with
the size of your disk and how much you detune the rebuild...
ric
> 3) Does that mean that you shouldn't use ext3 on RAID drives? Of
> course not! First of all, Ext3 still saves you against kernel panics
> and hangs caused by device driver bugs or other kernel hangs. You
> will lose less data, and avoid needing to run a long and painful fsck
> after a forced reboot, compared to if you used ext2. You are making
> an assumption that the only time running the journal takes place is
> after a power failure. But if the system hangs, and you need to hit
> the Big Red Switch, or if you using the system in a Linux High
> Availability setup and the ethernet card fails, so the STONITH ("shoot
> the other node in the head") system forces a hard reset of the system,
> or you get a kernel panic which forces a reboot, in all of these cases
> ext3 will save you from a long fsck, and it will do so safely.
>
> Secondly, what's the probability of a failure causes the RAID array to
> become degraded, followed by a power failure, versus a power failure
> while the RAID array is not running in degraded mode? Hopefully you
> are running with the RAID array in full, proper running order a much
> larger percentage of the time than running with the RAID array in
> degraded mode. If not, the bug is with the system administrator!
>
> If you are someone who tends to run for long periods of time in
> degraded mode --- then better get a UPS. And certainly if you want to
> avoid the chances of failure, periodically scrubbing the disks so you
> detect hard drive failures early, instead of waiting until a disk
> fails before letting the rebuild find the dreaded "second failure"
> which causes data loss, is a d*mned good idea.
>
> Maybe a random OS engineer doesn't know these things --- but trust me
> when I say a competent system administrator had better be familiar
> with these concepts. And someone who wants their data to be reliably
> stored needs to do some basic storage engineering if they want to have
> long-term data reliability. (That, or maybe they should outsource
> their long-term reliable storage some service such as Amazon S3 ---
> see Jeremy Zawodny's analysis about how it can be cheaper, here:
> http://jeremy.zawodny.com/blog/archives/007624.html)
>
> But we *do* need to be careful that we don't write documentation which
> is ends up giving users the wrong impression. The bottom line is that
> you're better off using ext3 over ext2, even on a RAID array, for the
> reasons listed above.
>
> Are you better off using ext3 over ext2 on a crappy flash drive?
> Maybe --- if you are also using crappy proprietary video drivers, such
> as Ubuntu ships, where every single time you exit a 3d game the system
> crashes (and Ubuntu users accept this as normal?!?), then ext3 might
> be a better choice since you'll reduce the chance of data loss when
> the system locks up or crashes thanks to the aforemention crappy
> proprietary video drivers from Nvidia. On the other hand, crappy
> flash drives *do* have really bad write amplification effects, where a
> 4K write can cause 128k or more worth of flash to be rewritten, such
> that using ext3 could seriously degrade the lifetime of said crappy
> flash drive; furthermore, the crappy flash drives have such terribly
> write performance that using ext3 can be a performance nightmare.
> This of course, doesn't apply to well-implemented SSD's, such as the
> Intel's X25-M and X18-M. So here your mileage may vary. Still, if
> you are using crappy proprietary drivers which cause system hangs and
> crashes at a far greater rate than power fail-induced unclean
> shutdowns, ext3 *still* might be the better choice, even with crappy
> flash drives.
>
> The best thing to do, of course, is to improve your storage stack; use
> competently implemented SSD's instead of crap flash cards. If your
> hardware RAID card supports a battery option, *get* the battery. Add
> a UPS to your system. Provision your RAID array with hot spares, and
> regularly scrub (read-test) your array so that failed drives can be
> detected early. Make sure you configure your MD setup so that you get
> e-mail when a hard drive fails and the array starts running in
> degraded mode, so you can replace the failed drive ASAP.
>
> At the end of the day, filesystems are not magic. They can't
> compensate for crap hardware, or incompetently administered machines.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
next prev parent reply other threads:[~2009-08-26 1:16 UTC|newest]
Thread overview: 272+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-03-12 9:21 ext2/3: document conditions when reliable operation is possible Pavel Machek
2009-03-12 11:40 ` Jochen Voß
2009-03-21 11:24 ` Pavel Machek
2009-03-12 19:13 ` Rob Landley
2009-03-16 12:28 ` Pavel Machek
2009-03-16 19:26 ` Rob Landley
2009-03-23 10:45 ` Pavel Machek
2009-03-30 15:06 ` Goswin von Brederlow
2009-08-24 9:26 ` Pavel Machek
2009-08-24 9:31 ` [patch] " Pavel Machek
2009-08-24 11:19 ` Florian Weimer
2009-08-24 13:01 ` Theodore Tso
2009-08-24 14:55 ` Artem Bityutskiy
2009-08-24 22:30 ` Rob Landley
2009-08-24 19:52 ` Pavel Machek
2009-08-24 20:24 ` Ric Wheeler
2009-08-24 20:52 ` Pavel Machek
2009-08-24 21:08 ` Ric Wheeler
2009-08-24 21:25 ` Pavel Machek
2009-08-24 22:05 ` Ric Wheeler
2009-08-24 22:22 ` Zan Lynx
2009-08-24 22:44 ` Pavel Machek
2009-08-25 0:34 ` Ric Wheeler
2009-08-24 23:42 ` david
2009-08-24 22:41 ` Pavel Machek
2009-08-24 22:39 ` Theodore Tso
2009-08-24 23:00 ` Pavel Machek
2009-08-25 0:02 ` david
2009-08-25 9:32 ` Pavel Machek
2009-08-25 0:06 ` Ric Wheeler
2009-08-25 9:34 ` Pavel Machek
2009-08-25 15:34 ` david
2009-08-26 3:32 ` Rik van Riel
2009-08-26 11:17 ` Pavel Machek
2009-08-26 11:29 ` david
2009-08-26 13:10 ` Pavel Machek
2009-08-26 13:43 ` david
2009-08-26 18:02 ` Theodore Tso
2009-08-27 6:28 ` Eric Sandeen
2009-11-09 8:53 ` periodic fsck was " Pavel Machek
2009-11-09 14:05 ` Theodore Tso
2009-11-09 15:58 ` Andreas Dilger
2009-08-30 7:03 ` Pavel Machek
2009-08-26 12:28 ` Theodore Tso
2009-08-27 6:06 ` Rob Landley
2009-08-27 6:54 ` david
2009-08-27 7:34 ` Rob Landley
2009-08-28 14:37 ` david
2009-08-30 7:19 ` Pavel Machek
2009-08-30 12:48 ` david
2009-08-27 5:27 ` Rob Landley
2009-08-25 0:08 ` Theodore Tso
2009-08-25 9:42 ` Pavel Machek
2009-08-25 13:37 ` Ric Wheeler
2009-08-25 13:42 ` Alan Cox
2009-08-27 3:16 ` Rob Landley
2009-08-25 21:15 ` Pavel Machek
2009-08-25 22:42 ` Ric Wheeler
2009-08-25 22:51 ` Pavel Machek
2009-08-25 23:03 ` david
2009-08-25 23:29 ` Pavel Machek
2009-08-25 23:03 ` Ric Wheeler
2009-08-25 23:26 ` Pavel Machek
2009-08-25 23:40 ` Ric Wheeler
2009-08-25 23:48 ` david
2009-08-25 23:53 ` Pavel Machek
2009-08-26 0:11 ` Ric Wheeler
2009-08-26 0:16 ` Pavel Machek
2009-08-26 0:31 ` Ric Wheeler
2009-08-26 1:00 ` Theodore Tso
2009-08-26 1:15 ` Ric Wheeler [this message]
2009-08-26 2:58 ` Theodore Tso
2009-08-26 10:39 ` Ric Wheeler
2009-08-26 11:12 ` Pavel Machek
2009-08-26 11:28 ` david
2009-08-29 9:49 ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
2009-08-29 11:28 ` Ric Wheeler
2009-09-02 20:12 ` Pavel Machek
2009-09-02 20:42 ` Ric Wheeler
2009-09-02 23:00 ` Rob Landley
2009-09-02 23:09 ` david
2009-09-03 8:55 ` Pavel Machek
2009-09-03 0:36 ` jim owens
2009-09-03 2:41 ` Rob Landley
2009-09-03 14:14 ` jim owens
2009-09-04 7:44 ` Rob Landley
2009-09-04 11:49 ` Ric Wheeler
2009-09-05 10:28 ` Pavel Machek
2009-09-05 12:20 ` Ric Wheeler
2009-09-05 13:54 ` Jonathan Corbet
2009-09-05 21:27 ` Pavel Machek
2009-09-05 21:56 ` Theodore Tso
2009-09-02 22:45 ` Rob Landley
2009-09-02 22:49 ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley
2009-09-03 9:08 ` Pavel Machek
2009-09-03 12:05 ` Ric Wheeler
2009-09-03 12:31 ` Pavel Machek
2009-08-29 16:35 ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david
2009-08-30 7:07 ` Pavel Machek
2009-08-26 12:01 ` [patch] ext2/3: document conditions when reliable operation is possible Ric Wheeler
2009-08-26 12:23 ` Theodore Tso
2009-08-30 7:01 ` Pavel Machek
2009-08-27 5:19 ` Rob Landley
2009-08-27 12:24 ` Theodore Tso
2009-08-27 13:10 ` Ric Wheeler
2009-08-27 16:54 ` MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Jeff Garzik
2009-08-27 18:09 ` Alasdair G Kergon
2009-09-01 14:01 ` Pavel Machek
2009-09-02 16:17 ` Michael Tokarev
2009-08-29 10:02 ` [patch] ext2/3: document conditions when reliable operation is possible Pavel Machek
2009-08-26 1:16 ` Pavel Machek
2009-08-26 2:55 ` Theodore Tso
2009-08-26 13:37 ` Ric Wheeler
2009-08-26 2:53 ` Henrique de Moraes Holschuh
2009-09-03 9:47 ` Pavel Machek
2009-08-26 3:50 ` Rik van Riel
2009-08-27 3:53 ` Rob Landley
2009-08-27 11:43 ` Ric Wheeler
2009-08-27 20:51 ` Rob Landley
2009-08-27 22:00 ` Ric Wheeler
2009-08-28 14:49 ` david
2009-08-29 10:05 ` Pavel Machek
2009-08-29 20:22 ` Rob Landley
2009-08-29 21:34 ` Pavel Machek
2009-09-03 16:56 ` what fsck can (and can't) do was " david
2009-09-03 19:27 ` Theodore Tso
2009-08-27 22:13 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
2009-08-28 1:32 ` Ric Wheeler
2009-08-28 6:44 ` Pavel Machek
2009-08-28 7:31 ` NeilBrown
2009-11-09 10:50 ` Pavel Machek
2009-08-28 11:16 ` Ric Wheeler
2009-09-01 13:58 ` Pavel Machek
2009-08-28 12:08 ` Theodore Tso
2009-08-30 7:51 ` Pavel Machek
2009-08-30 9:01 ` Christian Kujau
2009-09-02 20:55 ` Pavel Machek
2009-08-30 12:55 ` david
2009-08-30 14:12 ` Ric Wheeler
2009-08-30 14:44 ` Michael Tokarev
2009-08-30 16:10 ` Ric Wheeler
2009-08-30 16:35 ` Christoph Hellwig
2009-08-31 13:15 ` Ric Wheeler
2009-08-31 13:16 ` Christoph Hellwig
2009-08-31 13:19 ` Mark Lord
2009-08-31 13:21 ` Christoph Hellwig
2009-08-31 15:14 ` jim owens
2009-09-03 1:59 ` Ric Wheeler
2009-09-03 11:12 ` Krzysztof Halasa
2009-09-03 11:18 ` Ric Wheeler
2009-09-03 13:34 ` Krzysztof Halasa
2009-09-03 13:50 ` Ric Wheeler
2009-09-03 13:59 ` Krzysztof Halasa
2009-09-03 14:15 ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler
2009-09-03 14:26 ` Florian Weimer
2009-09-03 15:09 ` Ric Wheeler
2009-09-03 23:50 ` Krzysztof Halasa
2009-09-04 0:39 ` Ric Wheeler
2009-09-04 21:21 ` Mark Lord
2009-09-04 21:29 ` Ric Wheeler
2009-09-05 12:57 ` Mark Lord
2009-09-05 13:40 ` Ric Wheeler
2009-09-05 21:43 ` NeilBrown
2009-09-07 11:45 ` Pavel Machek
2009-09-07 13:10 ` Theodore Tso
2010-04-04 13:47 ` fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) Pavel Machek
2010-04-04 17:39 ` tytso
2010-04-04 17:59 ` Rob Landley
2010-04-04 18:45 ` Pavel Machek
2010-04-04 19:35 ` tytso
2010-04-04 19:29 ` tytso
2010-04-04 23:58 ` Rob Landley
2009-09-03 14:35 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david
2009-08-31 13:22 ` Ric Wheeler
2009-08-31 15:50 ` david
2009-08-31 16:21 ` Ric Wheeler
2009-08-31 18:31 ` Christoph Hellwig
2009-08-31 19:11 ` david
2009-08-30 15:05 ` Pavel Machek
2009-08-30 15:20 ` Theodore Tso
2009-08-31 17:49 ` Jesse Brandeburg
2009-08-31 18:01 ` Ric Wheeler
2009-08-31 18:07 ` martin f krafft
2009-08-31 22:26 ` Jesse Brandeburg
2009-09-01 5:45 ` martin f krafft
2009-09-05 10:34 ` Pavel Machek
2009-08-28 7:11 ` raid is dangerous but that's secret Florian Weimer
2009-08-28 7:23 ` NeilBrown
2009-08-25 23:46 ` [patch] ext2/3: document conditions when reliable operation is possible david
2009-08-25 23:08 ` Neil Brown
2009-08-25 23:44 ` Pavel Machek
2009-08-26 4:08 ` Rik van Riel
2009-08-26 11:15 ` Pavel Machek
2009-08-27 3:29 ` Rik van Riel
2009-08-25 16:11 ` Theodore Tso
2009-08-25 22:21 ` [patch] document flash/RAID dangers Pavel Machek
2009-08-25 22:33 ` david
2009-08-25 22:40 ` Pavel Machek
2009-08-25 22:59 ` david
2009-08-25 23:37 ` Pavel Machek
2009-08-25 23:48 ` Ric Wheeler
2009-08-26 0:06 ` Pavel Machek
2009-08-26 0:12 ` Ric Wheeler
2009-08-26 0:20 ` Pavel Machek
2009-08-26 0:26 ` david
2009-08-26 0:28 ` Ric Wheeler
2009-08-26 0:38 ` Pavel Machek
2009-08-26 0:45 ` Ric Wheeler
2009-08-26 11:21 ` Pavel Machek
2009-08-26 11:58 ` Ric Wheeler
2009-08-26 12:40 ` Theodore Tso
2009-08-26 13:11 ` Ric Wheeler
2009-08-26 13:44 ` david
2009-08-26 13:40 ` Chris Adams
2009-08-26 13:47 ` Alan Cox
2009-08-26 14:11 ` Chris Adams
2009-08-27 21:50 ` Pavel Machek
2009-08-29 9:38 ` Pavel Machek
2009-08-26 4:24 ` Rik van Riel
2009-08-26 11:22 ` Pavel Machek
2009-08-26 14:45 ` Rik van Riel
2009-08-29 9:39 ` Pavel Machek
2009-08-25 23:56 ` david
2009-08-26 0:12 ` Pavel Machek
2009-08-26 0:20 ` david
2009-08-26 0:39 ` Pavel Machek
2009-08-26 1:17 ` david
2009-08-26 0:26 ` Ric Wheeler
2009-08-26 0:44 ` Pavel Machek
2009-08-26 0:50 ` Ric Wheeler
2009-08-26 1:19 ` david
2009-08-26 11:25 ` Pavel Machek
2009-08-26 12:37 ` Theodore Tso
2009-08-30 6:49 ` Pavel Machek
2009-08-26 4:20 ` Rik van Riel
2009-08-25 22:27 ` [patch] document that ext2 can't handle barriers Pavel Machek
2009-08-27 3:34 ` [patch] ext2/3: document conditions when reliable operation is possible Rob Landley
2009-08-27 8:46 ` David Woodhouse
2009-08-28 14:46 ` david
2009-08-29 10:09 ` Pavel Machek
2009-08-29 16:27 ` david
2009-08-29 21:33 ` Pavel Machek
2009-08-25 13:57 ` Chris Adams
2009-08-25 22:58 ` Neil Brown
2009-08-25 23:10 ` Ric Wheeler
2009-08-25 23:32 ` NeilBrown
2009-08-24 21:11 ` Greg Freemyer
2009-08-25 20:56 ` Rob Landley
2009-08-25 21:08 ` david
2009-08-25 18:52 ` Rob Landley
2009-08-25 14:43 ` Florian Weimer
2009-08-24 13:50 ` Theodore Tso
2009-08-24 18:48 ` Pavel Machek
2009-08-24 18:39 ` Pavel Machek
2009-08-24 13:21 ` Greg Freemyer
2009-08-24 18:44 ` Pavel Machek
2009-08-25 23:28 ` Neil Brown
2009-08-26 1:34 ` david
2009-08-24 21:11 ` Rob Landley
2009-08-24 21:33 ` Pavel Machek
2009-08-25 18:45 ` Jan Kara
2009-03-16 12:30 ` Pavel Machek
2009-03-16 19:03 ` Theodore Tso
2009-03-23 18:23 ` Pavel Machek
2009-03-16 19:40 ` Sitsofe Wheeler
2009-03-16 21:43 ` Rob Landley
2009-03-17 4:55 ` Kyle Moffett
2009-03-23 11:00 ` Pavel Machek
2009-08-29 1:33 ` Robert Hancock
2009-08-29 13:04 ` Alan Cox
2009-03-16 19:45 ` Greg Freemyer
2009-03-16 21:48 ` Pavel Machek
[not found] <dddMw-7Xt-57@gated-at.bofh.it>
[not found] ` <dddWc-83W-51@gated-at.bofh.it>
[not found] ` <ddefx-5R-37@gated-at.bofh.it>
[not found] ` <ddep7-bI-7@gated-at.bofh.it>
[not found] ` <ddeyR-iy-11@gated-at.bofh.it>
[not found] ` <ddeyR-iy-9@gated-at.bofh.it>
[not found] ` <ddeIu-n9-9@gated-at.bofh.it>
[not found] ` <ddeSb-th-29@gated-at.bofh.it>
[not found] ` <ddoRI-11a-39@gated-at.bofh.it>
[not found] <ciPTu-7f2-47@gated-at.bofh.it>
[not found] ` <clrhX-3R1-13@gated-at.bofh.it>
[not found] ` <dcEc1-33Z-17@gated-at.bofh.it>
[not found] ` <dcFUz-4yN-9@gated-at.bofh.it>
[not found] ` <dcHtf-5XC-17@gated-at.bofh.it>
[not found] ` <dcNSc-2DU-11@gated-at.bofh.it>
[not found] ` <dcOl9-3aC-19@gated-at.bofh.it>
[not found] ` <dcOOd-3qM-33@gated-at.bofh.it>
[not found] ` <dcOXN-3M5-15@gated-at.bofh.it>
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A948C94.7040103@redhat.com \
--to=rwheeler@redhat.com \
--cc=akpm@osdl.org \
--cc=corbet@lwn.net \
--cc=fweimer@bfk.de \
--cc=goswin-v-b@web.de \
--cc=linux-doc@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mtk.manpages@gmail.com \
--cc=pavel@ucw.cz \
--cc=rdunlap@xenotime.net \
--cc=rob@landley.net \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).