From: Ric Wheeler <ric@emc.com>
To: Tejun Heo <htejun@gmail.com>
Cc: Mark Hahn <hahn@physics.mcmaster.ca>,
David.Ronis@McGill.CA, linux-ide@vger.kernel.org, neilb@suse.de
Subject: Re: Problem with disk
Date: Mon, 08 May 2006 10:33:22 -0400 [thread overview]
Message-ID: <445F56B2.9070300@emc.com> (raw)
In-Reply-To: <445DF911.1020408@gmail.com>
Tejun Heo wrote:
> Ric Wheeler wrote:
>>
>>
>> Tejun Heo wrote:
>>>
>>>
>>> Unfortunately, this can result in *massive* destruction of the
>>> filesystem. I lost my RAID-1 array earlier this year this way. The
>>> FS code systematically destroyed metadata of the filesystem and, on
>>> the following reboot, fsck did the final blow, I think. I ended up
>>> with 100+Gbytes of unorganized data and I had to recover data by
>>> grep + bvi.
>>
>> Were you running with Neil's fixes that make MD devices properly
>> handle write barrier requests? Until fairly recently (not sure when
>> this was fixed), MD devices more or less dropped the barrier requests.
>>
>> With properly working barriers, any journal file system should get
>> you back to a consistent state after a power drop (although there are
>> many less common ways that drives can potentially drop data).
>
> I'm not sure whether the barrier was working or not. Ummm.. Are you
> saying that MD is capable of recovering from data drop *during*
> operation? ie. the system didn't go out, just the harddrives. Data
> is lost no matter what MD does and MD and the filesystem don't have
> any way to tell which bits made it to the media and which are lost
> whether barriers are working or not.
I think that MD will do the right thing if the IO terminates with an
error condition. If the error is silent (and that can happen during a
write), then it clearly cannot recover.
>
> To handle such conditions, device driver should tell upper layer that
> PHY status has changed (or something weird happened which could lead
> to data loss) and the fs, in return, perform journal replay while
> still online. I'm pretty sure that isn't implemented in the current
> kernel.
>
>>>
>>> This is an extreme case but it shows turning off writeback has its
>>> advantages. After the initial stress & panic attack subsided, I
>>> tried to think about how to prevent such catastrophes, but there
>>> doesn't seem to be a good way. There's no way to tell 1. if the
>>> harddrive actually lost the writeback cache content 2. if so, how
>>> much it has lost. So, unless the OS halts the system everytime
>>> something seems weird with the disk, turning off writeback cache
>>> seems to be the only solution.
>>>
>>
>> Turning off the writeback cache is definitely the safe and
>> conservative way to go for mission critical data unless you can be
>> very certain that your barriers are properly working on the drive &
>> IO stack. We validate the cache flush commands with a s-ata analyzer
>> (making sure that we see them on sync/transaction commits) and that
>> they take a reasonable amount of time at the drive...
>>
>
> One thing I'm curious about is how much performance benefit can be
> obtained from write-back caching. With NCQ/TCQ, latency is much less
> of an issue and I don't think scheduling and/or buffering inside the
> drive would result in significant performance increase when so much is
> done by the vm and block layer (aside from scheduling of currently
> queued commands).
>
> Some linux elevators try pretty hard to not mix read and write
> requests as they mess up statistics (write back cache absorbs write
> requests very fast then affect following read requests). So, they
> basically try to eliminate the effect of write-back caching.
>
> Well, benchmark time, it seems. :)
My own benchmarks showed a clear win for a write intensive work load
with the write cache + barriers enabled using reiserfs. I think that the
NCQ/TCQ wins mostly in a read case.
ric
next prev parent reply other threads:[~2006-05-08 8:33 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-05-03 20:01 Problem with disk David Ronis
2006-05-03 20:08 ` Ric Wheeler
2006-05-05 23:49 ` Mark Hahn
2006-05-06 0:51 ` Ric Wheeler
2006-05-06 17:11 ` Mark Hahn
2006-05-06 18:17 ` Ric Wheeler
2006-05-06 18:34 ` Mark Hahn
2006-05-06 22:56 ` Tejun Heo
2006-05-07 13:21 ` Ric Wheeler
2006-05-07 13:41 ` Tejun Heo
2006-05-08 14:33 ` Ric Wheeler [this message]
2006-05-10 22:21 ` Tejun Heo
2006-05-13 19:31 ` Ric Wheeler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=445F56B2.9070300@emc.com \
--to=ric@emc.com \
--cc=David.Ronis@McGill.CA \
--cc=hahn@physics.mcmaster.ca \
--cc=htejun@gmail.com \
--cc=linux-ide@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.