All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bill Davidsen <davidsen@tmr.com>
To: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	device-mapper development <dm-devel@redhat.com>,
	mingo@redhat.com, agk@redhat.com
Subject: Re: Data corruption on software RAID
Date: Thu, 10 Apr 2008 10:21:01 -0400	[thread overview]
Message-ID: <47FE224D.2020309@tmr.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0804100503160.26713@artax.karlin.mff.cuni.cz>


[-- Attachment #1.1: Type: text/plain, Size: 4006 bytes --]

Mikulas Patocka wrote:
>>> Possibilities how to fix it:
>>>
>>> 1. lock the buffers and pages while they are being written --- this would
>>> cause performance degradation (the most severe degradation would be in case
>>> when one process does repeatedly sync() and other unrelated process
>>> repeatedly writes to some file).
>>>
>>> Lock the buffers and pages only for RAID --- would create many special cases
>>> and possible bugs.
>>>
>>> 2. never turn the region dirty bit off until the filesystem is unmounted.
>>> --- this is the simplest fix. If the computer crashes after a long time, it
>>> resynchronizes the whole device. But there won't cause application-visible
>>> or filesystem-visible data corruption.
>>>
>>> 3. turn off the region bit if the region wasn't written in one pdflush
>>> period --- requires an interaction with pdflush, rather complex. The problem
>>> here is that pdflush makes its best effort to write data in
>>> dirty_writeback_centisecs interval, but it is not guaranteed to do it.
>>>
>>> 4. make more region states: Region has in-memory states CLEAN, DIRTY,
>>> MAYBE_DIRTY, CLEAN_CANDIDATE.
>>>
>>> When you start writing to the region, it is always moved to DIRTY state (and
>>> on-disk bit is turned on).
>>>
>>> When you finish all writes to the region, move it to MAYBE_DIRTY state, but
>>> leave bit on disk on. We now don't know if the region is dirty or no.
>>>
>>> Run a helper thread that does periodically:
>>> Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
>>> Issue sync()
>>> Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.
>>>
>>> The rationale is that if the above write-while-modify scenario happens, the
>>> page is always dirty. Thus, sync() will write the page, kick the region back
>>> from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark the region as
>>> clean on disk.
>>>
>>>
>>> I'd like to know you ideas on this, before we start coding a solution.
>>>   
>>>       
>> I looked at just this problem a while ago, and came to the conclusion that
>> what was needed was a COW bit, to show that there was i/o in flight, and that
>> before modification it needed to be copied. Since you don't want to let that
>> recurse, you don't start writing the copy until the original is written and
>> freed. Ideally you wouldn't bother to finish writing the original, but that
>> doesn't seem possible. That allows at most two copies of a chunk to take up
>> memory space at once, although it's still ugly and can be a bottleneck.
>>     
>
> Copying the data would be performance overkill. You can really write 
> different data to different disks, you just must not forget to resync them 
> after a crash. The filesystem/application will recover with either old or 
> new data --- it just won't recover when it's reading old and new data from 
> the same location.
>
>   
Currently you can go for hours without ever reaching a clean state on 
active files. By not deliberately allowing the buffer to change during a 
write the chances for getting consistent data on the disk should be 
significantly improved.
> >From my point of view that trick with thread doing sync() and turning off 
> region bits looks best. I'd like to know if that solution doesn't have any 
> other flaw.
>
>   
>> For reliable operation I would want all copies (and/or CRCs) to be written on
>> an fsync, by the time I bother to fsync I really, really, want the data on the
>> disk.
>>     
>
> fsync already works this way.
>   

The point I was making is that after you change the code I would still 
want that to happen. And your comment above seems to indicate a goal of 
getting consistent data after a crash, with less concern that it be the 
most recent data written. Sorry in advance if that's a misreading of 
"you just must not forget to resync them after a crash."

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



[-- Attachment #1.2: Type: text/html, Size: 4622 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



  reply	other threads:[~2008-04-10 14:21 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
2008-04-08 10:22 ` Helge Hafting
2008-04-08 11:14   ` Mikulas Patocka
2008-04-09 18:33 ` Bill Davidsen
2008-04-09 18:33   ` Bill Davidsen
2008-04-10  3:07   ` Mikulas Patocka
2008-04-10  3:07     ` Mikulas Patocka
2008-04-10 14:21     ` Bill Davidsen [this message]
2008-04-11  2:55       ` Mikulas Patocka
2008-04-11  2:55         ` Mikulas Patocka
2008-04-10  6:14 ` Mario 'BitKoenig' Holbe
  -- strict thread matches above, loose matches on Subject: below --
2007-03-18 13:16 Data corruption on software raid Sander Smeenk
2007-03-18 14:02 ` Justin Piszcz
2007-03-18 16:50   ` Bill Davidsen
2007-03-18 17:38     ` Sander Smeenk
     [not found]       ` <45FD870C.3020403@tmr.com>
2007-03-18 22:00         ` Sander Smeenk
2007-03-18 15:17 ` Wolfgang Denk
2007-03-18 17:09 ` Bill Davidsen
2007-03-18 22:16   ` Neil Brown
2007-03-18 22:19 ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47FE224D.2020309@tmr.com \
    --to=davidsen@tmr.com \
    --cc=agk@redhat.com \
    --cc=dm-devel@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=mikulas@artax.karlin.mff.cuni.cz \
    --cc=mingo@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.