From: Bill Davidsen <davidsen@tmr.com>
To: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
device-mapper development <dm-devel@redhat.com>,
mingo@redhat.com, agk@redhat.com
Subject: Re: Data corruption on software RAID
Date: Thu, 10 Apr 2008 10:21:01 -0400 [thread overview]
Message-ID: <47FE224D.2020309@tmr.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0804100503160.26713@artax.karlin.mff.cuni.cz>
[-- Attachment #1.1: Type: text/plain, Size: 4006 bytes --]
Mikulas Patocka wrote:
>>> Possibilities how to fix it:
>>>
>>> 1. lock the buffers and pages while they are being written --- this would
>>> cause performance degradation (the most severe degradation would be in case
>>> when one process does repeatedly sync() and other unrelated process
>>> repeatedly writes to some file).
>>>
>>> Lock the buffers and pages only for RAID --- would create many special cases
>>> and possible bugs.
>>>
>>> 2. never turn the region dirty bit off until the filesystem is unmounted.
>>> --- this is the simplest fix. If the computer crashes after a long time, it
>>> resynchronizes the whole device. But there won't cause application-visible
>>> or filesystem-visible data corruption.
>>>
>>> 3. turn off the region bit if the region wasn't written in one pdflush
>>> period --- requires an interaction with pdflush, rather complex. The problem
>>> here is that pdflush makes its best effort to write data in
>>> dirty_writeback_centisecs interval, but it is not guaranteed to do it.
>>>
>>> 4. make more region states: Region has in-memory states CLEAN, DIRTY,
>>> MAYBE_DIRTY, CLEAN_CANDIDATE.
>>>
>>> When you start writing to the region, it is always moved to DIRTY state (and
>>> on-disk bit is turned on).
>>>
>>> When you finish all writes to the region, move it to MAYBE_DIRTY state, but
>>> leave bit on disk on. We now don't know if the region is dirty or no.
>>>
>>> Run a helper thread that does periodically:
>>> Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
>>> Issue sync()
>>> Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.
>>>
>>> The rationale is that if the above write-while-modify scenario happens, the
>>> page is always dirty. Thus, sync() will write the page, kick the region back
>>> from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark the region as
>>> clean on disk.
>>>
>>>
>>> I'd like to know you ideas on this, before we start coding a solution.
>>>
>>>
>> I looked at just this problem a while ago, and came to the conclusion that
>> what was needed was a COW bit, to show that there was i/o in flight, and that
>> before modification it needed to be copied. Since you don't want to let that
>> recurse, you don't start writing the copy until the original is written and
>> freed. Ideally you wouldn't bother to finish writing the original, but that
>> doesn't seem possible. That allows at most two copies of a chunk to take up
>> memory space at once, although it's still ugly and can be a bottleneck.
>>
>
> Copying the data would be performance overkill. You can really write
> different data to different disks, you just must not forget to resync them
> after a crash. The filesystem/application will recover with either old or
> new data --- it just won't recover when it's reading old and new data from
> the same location.
>
>
Currently you can go for hours without ever reaching a clean state on
active files. By not deliberately allowing the buffer to change during a
write the chances for getting consistent data on the disk should be
significantly improved.
> >From my point of view that trick with thread doing sync() and turning off
> region bits looks best. I'd like to know if that solution doesn't have any
> other flaw.
>
>
>> For reliable operation I would want all copies (and/or CRCs) to be written on
>> an fsync, by the time I bother to fsync I really, really, want the data on the
>> disk.
>>
>
> fsync already works this way.
>
The point I was making is that after you change the code I would still
want that to happen. And your comment above seems to indicate a goal of
getting consistent data after a crash, with less concern that it be the
most recent data written. Sorry in advance if that's a misreading of
"you just must not forget to resync them after a crash."
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
[-- Attachment #1.2: Type: text/html, Size: 4622 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
next prev parent reply other threads:[~2008-04-10 14:21 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
2008-04-08 10:22 ` Helge Hafting
2008-04-08 11:14 ` Mikulas Patocka
2008-04-09 18:33 ` Bill Davidsen
2008-04-10 3:07 ` Mikulas Patocka
2008-04-10 14:21 ` Bill Davidsen [this message]
2008-04-11 2:55 ` Mikulas Patocka
2008-04-10 6:14 ` Mario 'BitKoenig' Holbe
-- strict thread matches above, loose matches on Subject: below --
2007-03-18 13:16 Data corruption on software raid Sander Smeenk
2007-03-18 14:02 ` Justin Piszcz
2007-03-18 16:50 ` Bill Davidsen
2007-03-18 17:38 ` Sander Smeenk
[not found] ` <45FD870C.3020403@tmr.com>
2007-03-18 22:00 ` Sander Smeenk
2007-03-18 15:17 ` Wolfgang Denk
2007-03-18 17:09 ` Bill Davidsen
2007-03-18 22:16 ` Neil Brown
2007-03-18 22:19 ` Neil Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47FE224D.2020309@tmr.com \
--to=davidsen@tmr.com \
--cc=agk@redhat.com \
--cc=dm-devel@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=mikulas@artax.karlin.mff.cuni.cz \
--cc=mingo@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).