Data corruption on software RAID

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Data corruption on software RAID
@ 2008-04-07 23:43 Mikulas Patocka
  2008-04-08 10:22 ` Helge Hafting
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Mikulas Patocka @ 2008-04-07 23:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-raid, device-mapper development, agk, mingo, neilb

Hi

During source code review, I found an unprobable but possible data 
corruption on RAID-1 and on DM-RAID-1. (I'm not sure about RAID-4,5,6).

The RAID code was enhanced with bitmaps in 2.6.13.

The bitmap tracks regions on the device that may be possibly out-of-sync. 
The purpose of the bitmap is to avoid resynchronizing the whole array in 
the case of crash. DM-raid uses similar bitmap too.

The write sequnce is usually:
1. turn on bit in the bitmap (if it hasn't been on before).
2. update the data.
3. when writes to all devices finish, turn the bit may be turned off.

The developers assume that when all writes to the region finish, the 
region is in-sync.

This assumption is wrong.

Kernel writes data while they may be modified in many places. For example, 
the pdflush daemon writes periodically pages and buffers without locking 
them. Similarly, pages may be written while they are mapped for write to 
the processes.

Normally, there is no problem with modify-while-write. The write sequence 
is something like:
* turn off Dirty bit
* write the buffer or page
--- and if the buffer or page is modified while it's being written, the 
Dirty bit is turned on again and the correct data are written later.

But with RAID (since 2.6.13), it can produce corruption because when the 
buffer is modified while being written, different versions of data can be 
written to devices in the RAID array. For example:

1. pdflush turns off a dirty bit on Ext2 bitmap buffer and starts writing 
the buffer to RAID-1
2. the kernel allocates some blocks in that Ext2 bitmap. One of RAID-1 
devices writes new data, the other one gets old data.
3. The kernel turns on the buffer dirty bit, so this buffer is scheduled 
for next write.
4. RAID-1 subsystem sees that both writes finished, it thinks that this 
region is in-sync, turns off its dirty bit in its region bitmap and writes 
the bitmap to disk.
5. before pdflush writes the Ext2 bitmap buffer again, the system CRASHES

6. after new boot, RAID-1 sees the bit for this region off, so it doesn't 
resynchronize it.
7. during fsck, RAID-1 reads the Ext2 bitmap from the device where the bit 
is on. fsck sees that the bitmap is correct and doesn't touch it.
8. some times later kernel reads the Ext2 bitmap from the other device. It 
sees the bit off, allocates some data there and creates cross-linked 
files.

The same corruption may happen with some jorunaled filesystems (probably 
not Ext3) or applications that do their own crash recovery (databases, 
etc.). The key point is that an application expects that after a crash it 
reads old data or new data, but it doesn't expect that subsequent reads to 
the same place may alternatively return old or new data --- which may 
happen on RAID-1.

Possibilities how to fix it:

1. lock the buffers and pages while they are being written --- this would 
cause performance degradation (the most severe degradation would be in 
case when one process does repeatedly sync() and other unrelated 
process repeatedly writes to some file).

Lock the buffers and pages only for RAID --- would create many special 
cases and possible bugs.

2. never turn the region dirty bit off until the filesystem is unmounted. 
--- this is the simplest fix. If the computer crashes after a long 
time, it resynchronizes the whole device. But there won't cause 
application-visible or filesystem-visible data corruption.

3. turn off the region bit if the region wasn't written in one pdflush 
period --- requires an interaction with pdflush, rather complex. The 
problem here is that pdflush makes its best effort to write data in 
dirty_writeback_centisecs interval, but it is not guaranteed to do it.

4. make more region states: Region has in-memory states CLEAN, DIRTY, 
MAYBE_DIRTY, CLEAN_CANDIDATE.

When you start writing to the region, it is always moved to DIRTY state 
(and on-disk bit is turned on).

When you finish all writes to the region, move it to MAYBE_DIRTY state, 
but leave bit on disk on. We now don't know if the region is dirty or no.

Run a helper thread that does periodically:
Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
Issue sync()
Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.

The rationale is that if the above write-while-modify scenario happens, 
the page is always dirty. Thus, sync() will write the page, kick the 
region back from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark 
the region as clean on disk.

I'd like to know you ideas on this, before we start coding a solution.

Mikulas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Data corruption on software RAID
  2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
@ 2008-04-08 10:22 ` Helge Hafting
  2008-04-08 11:14   ` Mikulas Patocka
  2008-04-09 18:33 ` Bill Davidsen
  2008-04-10  6:14 ` Mario 'BitKoenig' Holbe
  2 siblings, 1 reply; 7+ messages in thread
From: Helge Hafting @ 2008-04-08 10:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: linux-kernel, linux-raid, device-mapper development, agk, mingo,
	neilb

Mikulas Patocka wrote:
> Hi
>
> During source code review, I found an unprobable but possible data 
> corruption on RAID-1 and on DM-RAID-1. (I'm not sure about RAID-4,5,6).
>
> The RAID code was enhanced with bitmaps in 2.6.13.
>
> The bitmap tracks regions on the device that may be possibly out-of-sync. 
> The purpose of the bitmap is to avoid resynchronizing the whole array in 
> the case of crash. DM-raid uses similar bitmap too.
>
> The write sequnce is usually:
> 1. turn on bit in the bitmap (if it hasn't been on before).
> 2. update the data.
> 3. when writes to all devices finish, turn the bit may be turned off.
>
> The developers assume that when all writes to the region finish, the 
> region is in-sync.
>
> This assumption is wrong.
>
> Kernel writes data while they may be modified in many places. For example, 
> the pdflush daemon writes periodically pages and buffers without locking 
> them. Similarly, pages may be written while they are mapped for write to 
> the processes.
>
> Normally, there is no problem with modify-while-write. The write sequence 
> is something like:
> * turn off Dirty bit
> * write the buffer or page
> --- and if the buffer or page is modified while it's being written, the 
> Dirty bit is turned on again and the correct data are written later.
>
> But with RAID (since 2.6.13), it can produce corruption because when the 
> buffer is modified while being written, different versions of data can be 
> written to devices in the RAID array. For example:
>
> 1. pdflush turns off a dirty bit on Ext2 bitmap buffer and starts writing 
> the buffer to RAID-1
> 2. the kernel allocates some blocks in that Ext2 bitmap. One of RAID-1 
> devices writes new data, the other one gets old data.
> 3. The kernel turns on the buffer dirty bit, so this buffer is scheduled 
> for next write.
> 4. RAID-1 subsystem sees that both writes finished, it thinks that this 
> region is in-sync, turns off its dirty bit in its region bitmap and writes 
> the bitmap to disk.
>   
Would this help:
RAID-1 sees that both writes finished. It checks the dirty bits on all
relevant buffers/pages. If none got re-dirtied, then it is ok to
turn off the dirty bit in the region bitmap and write that. Otherwise, 
it is not!

Or is such a check too time-consuming?

Helge Hafting

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Data corruption on software RAID
  2008-04-08 10:22 ` Helge Hafting
@ 2008-04-08 11:14   ` Mikulas Patocka
  0 siblings, 0 replies; 7+ messages in thread
From: Mikulas Patocka @ 2008-04-08 11:14 UTC (permalink / raw)
  To: Helge Hafting
  Cc: linux-kernel, linux-raid, device-mapper development, agk, mingo,
	neilb

> > But with RAID (since 2.6.13), it can produce corruption because when the
> > buffer is modified while being written, different versions of data can be
> > written to devices in the RAID array. For example:
> >
> > 1. pdflush turns off a dirty bit on Ext2 bitmap buffer and starts writing
> > the buffer to RAID-1
> > 2. the kernel allocates some blocks in that Ext2 bitmap. One of RAID-1
> > devices writes new data, the other one gets old data.
> > 3. The kernel turns on the buffer dirty bit, so this buffer is scheduled for
> > next write.
> > 4. RAID-1 subsystem sees that both writes finished, it thinks that this
> > region is in-sync, turns off its dirty bit in its region bitmap and writes
> > the bitmap to disk.
> >   
> Would this help:
> RAID-1 sees that both writes finished. It checks the dirty bits on all
> relevant buffers/pages. If none got re-dirtied, then it is ok to
> turn off the dirty bit in the region bitmap and write that. Otherwise, it is
> not!
> 
> Or is such a check too time-consuming?

That is impossible. The page cache can answer questions like "where is 
page 0x1234 from inode 0x5678 located on disk?" But it can't answer the 
reverse question: "which inode and which page is using disk block 
0x12345678?"

Furthermore, with device mapper you can stack several mapping tables each 
on other --- and again --- device mapper can't solve the reverse problem 
it can't tell you which filesystem is using block X.

Mikulas

> Helge Hafting

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Data corruption on software RAID
  2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
  2008-04-08 10:22 ` Helge Hafting
@ 2008-04-09 18:33 ` Bill Davidsen
  2008-04-10  3:07   ` Mikulas Patocka
  2008-04-10  6:14 ` Mario 'BitKoenig' Holbe
  2 siblings, 1 reply; 7+ messages in thread
From: Bill Davidsen @ 2008-04-09 18:33 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: linux-kernel, linux-raid, device-mapper development, agk, mingo,
	neilb

Mikulas Patocka wrote:
> Hi
>
> During source code review, I found an unprobable but possible data 
> corruption on RAID-1 and on DM-RAID-1. (I'm not sure about RAID-4,5,6).
>
> The RAID code was enhanced with bitmaps in 2.6.13.
>
> The bitmap tracks regions on the device that may be possibly out-of-sync. 
> The purpose of the bitmap is to avoid resynchronizing the whole array in 
> the case of crash. DM-raid uses similar bitmap too.
>
> The write sequnce is usually:
> 1. turn on bit in the bitmap (if it hasn't been on before).
> 2. update the data.
> 3. when writes to all devices finish, turn the bit may be turned off.
>
> The developers assume that when all writes to the region finish, the 
> region is in-sync.
>
> This assumption is wrong.
>
> Kernel writes data while they may be modified in many places. For example, 
> the pdflush daemon writes periodically pages and buffers without locking 
> them. Similarly, pages may be written while they are mapped for write to 
> the processes.
>
> Normally, there is no problem with modify-while-write. The write sequence 
> is something like:
> * turn off Dirty bit
> * write the buffer or page
> --- and if the buffer or page is modified while it's being written, the 
> Dirty bit is turned on again and the correct data are written later.
>
> But with RAID (since 2.6.13), it can produce corruption because when the 
> buffer is modified while being written, different versions of data can be 
> written to devices in the RAID array. For example:
>
> 1. pdflush turns off a dirty bit on Ext2 bitmap buffer and starts writing 
> the buffer to RAID-1
> 2. the kernel allocates some blocks in that Ext2 bitmap. One of RAID-1 
> devices writes new data, the other one gets old data.
> 3. The kernel turns on the buffer dirty bit, so this buffer is scheduled 
> for next write.
> 4. RAID-1 subsystem sees that both writes finished, it thinks that this 
> region is in-sync, turns off its dirty bit in its region bitmap and writes 
> the bitmap to disk.
> 5. before pdflush writes the Ext2 bitmap buffer again, the system CRASHES
>
> 6. after new boot, RAID-1 sees the bit for this region off, so it doesn't 
> resynchronize it.
> 7. during fsck, RAID-1 reads the Ext2 bitmap from the device where the bit 
> is on. fsck sees that the bitmap is correct and doesn't touch it.
> 8. some times later kernel reads the Ext2 bitmap from the other device. It 
> sees the bit off, allocates some data there and creates cross-linked 
> files.
>
> The same corruption may happen with some jorunaled filesystems (probably 
> not Ext3) or applications that do their own crash recovery (databases, 
> etc.). The key point is that an application expects that after a crash it 
> reads old data or new data, but it doesn't expect that subsequent reads to 
> the same place may alternatively return old or new data --- which may 
> happen on RAID-1.
>
>
> Possibilities how to fix it:
>
> 1. lock the buffers and pages while they are being written --- this would 
> cause performance degradation (the most severe degradation would be in 
> case when one process does repeatedly sync() and other unrelated 
> process repeatedly writes to some file).
>
> Lock the buffers and pages only for RAID --- would create many special 
> cases and possible bugs.
>
> 2. never turn the region dirty bit off until the filesystem is unmounted. 
> --- this is the simplest fix. If the computer crashes after a long 
> time, it resynchronizes the whole device. But there won't cause 
> application-visible or filesystem-visible data corruption.
>
> 3. turn off the region bit if the region wasn't written in one pdflush 
> period --- requires an interaction with pdflush, rather complex. The 
> problem here is that pdflush makes its best effort to write data in 
> dirty_writeback_centisecs interval, but it is not guaranteed to do it.
>
> 4. make more region states: Region has in-memory states CLEAN, DIRTY, 
> MAYBE_DIRTY, CLEAN_CANDIDATE.
>
> When you start writing to the region, it is always moved to DIRTY state 
> (and on-disk bit is turned on).
>
> When you finish all writes to the region, move it to MAYBE_DIRTY state, 
> but leave bit on disk on. We now don't know if the region is dirty or no.
>
> Run a helper thread that does periodically:
> Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
> Issue sync()
> Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.
>
> The rationale is that if the above write-while-modify scenario happens, 
> the page is always dirty. Thus, sync() will write the page, kick the 
> region back from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark 
> the region as clean on disk.
>
>
> I'd like to know you ideas on this, before we start coding a solution.
>   

I looked at just this problem a while ago, and came to the conclusion 
that what was needed was a COW bit, to show that there was i/o in 
flight, and that before modification it needed to be copied. Since you 
don't want to let that recurse, you don't start writing the copy until 
the original is written and freed. Ideally you wouldn't bother to finish 
writing the original, but that doesn't seem possible. That allows at 
most two copies of a chunk to take up memory space at once, although 
it's still ugly and can be a bottleneck.

For reliable operation I would want all copies (and/or CRCs) to be 
written on an fsync, by the time I bother to fsync I really, really, 
want the data on the disk.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Data corruption on software RAID
  2008-04-09 18:33 ` Bill Davidsen
@ 2008-04-10  3:07   ` Mikulas Patocka
       [not found]     ` <47FE224D.2020309@tmr.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Mikulas Patocka @ 2008-04-10  3:07 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: linux-kernel, linux-raid, device-mapper development, agk, mingo,
	neilb

> > Possibilities how to fix it:
> >
> > 1. lock the buffers and pages while they are being written --- this would
> > cause performance degradation (the most severe degradation would be in case
> > when one process does repeatedly sync() and other unrelated process
> > repeatedly writes to some file).
> >
> > Lock the buffers and pages only for RAID --- would create many special cases
> > and possible bugs.
> >
> > 2. never turn the region dirty bit off until the filesystem is unmounted.
> > --- this is the simplest fix. If the computer crashes after a long time, it
> > resynchronizes the whole device. But there won't cause application-visible
> > or filesystem-visible data corruption.
> >
> > 3. turn off the region bit if the region wasn't written in one pdflush
> > period --- requires an interaction with pdflush, rather complex. The problem
> > here is that pdflush makes its best effort to write data in
> > dirty_writeback_centisecs interval, but it is not guaranteed to do it.
> >
> > 4. make more region states: Region has in-memory states CLEAN, DIRTY,
> > MAYBE_DIRTY, CLEAN_CANDIDATE.
> >
> > When you start writing to the region, it is always moved to DIRTY state (and
> > on-disk bit is turned on).
> >
> > When you finish all writes to the region, move it to MAYBE_DIRTY state, but
> > leave bit on disk on. We now don't know if the region is dirty or no.
> >
> > Run a helper thread that does periodically:
> > Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
> > Issue sync()
> > Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.
> >
> > The rationale is that if the above write-while-modify scenario happens, the
> > page is always dirty. Thus, sync() will write the page, kick the region back
> > from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark the region as
> > clean on disk.
> >
> >
> > I'd like to know you ideas on this, before we start coding a solution.
> >   
> 
> I looked at just this problem a while ago, and came to the conclusion that
> what was needed was a COW bit, to show that there was i/o in flight, and that
> before modification it needed to be copied. Since you don't want to let that
> recurse, you don't start writing the copy until the original is written and
> freed. Ideally you wouldn't bother to finish writing the original, but that
> doesn't seem possible. That allows at most two copies of a chunk to take up
> memory space at once, although it's still ugly and can be a bottleneck.

Copying the data would be performance overkill. You can really write 
different data to different disks, you just must not forget to resync them 
after a crash. The filesystem/application will recover with either old or 
new data --- it just won't recover when it's reading old and new data from 
the same location.

>From my point of view that trick with thread doing sync() and turning off 
region bits looks best. I'd like to know if that solution doesn't have any 
other flaw.

> For reliable operation I would want all copies (and/or CRCs) to be written on
> an fsync, by the time I bother to fsync I really, really, want the data on the
> disk.

fsync already works this way.

Mikulas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Data corruption on software RAID
  2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
  2008-04-08 10:22 ` Helge Hafting
  2008-04-09 18:33 ` Bill Davidsen
@ 2008-04-10  6:14 ` Mario 'BitKoenig' Holbe
  2 siblings, 0 replies; 7+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2008-04-10  6:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-raid, dm-devel

Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
> During source code review, I found an unprobable but possible data 
> corruption on RAID-1 and on DM-RAID-1. (I'm not sure about RAID-4,5,6).
>
> The RAID code was enhanced with bitmaps in 2.6.13.
...
> The developers assume that when all writes to the region finish, the 
> region is in-sync.

Just for the records: You don't need bitmaps for that, this happens on
plain non-bitmapped devices as well.
I had an interesting discussion about this with Heinz Mauelshagen on the
linux-raid list back in early 2006 starting with
	Message-ID: <du6t39$be5$1@sea.gmane.org>

And it's not that unlikely at all. I experience such inconsistencies
regularly on ext2/3 filesystems with heavy inode fluctuations (for
example via cp -al; rsync, like rsnapshot does). I periodically sync
these inconsistencies manual. However, it always seems to appear with
inode removal only, which is rather harmless.


regards
   Mario
-- 
I've never been certain whether the moral of the Icarus story should
only be, as is generally accepted, "Don't try to fly too high," or
whether it might also be thought of as, "Forget the wax and feathers
and do a better job on the wings."            -- Stanley Kubrick


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Data corruption on software RAID
       [not found]     ` <47FE224D.2020309@tmr.com>
@ 2008-04-11  2:55       ` Mikulas Patocka
  0 siblings, 0 replies; 7+ messages in thread
From: Mikulas Patocka @ 2008-04-11  2:55 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: linux-kernel, linux-raid, device-mapper development, agk, mingo,
	neilb

> Currently you can go for hours without ever reaching a clean state on active
> files. By not deliberately allowing the buffer to change during a write the
> chances for getting consistent data on the disk should be significantly
> improved.

It can already happen that one device writes the sector and other not if 
the power is interrupted. And all RAID implementations already deal with 
it by resynchronizing the modified areas in case of crash. So they could 
resynchronize modify-while-write cases as well, with the same code.

... or I don't know if MM maintainers want to add locking to the pages 
that are under a write. Personally, I wouldn't do it.

> > From my point of view that trick with thread doing sync() and turning off
> > region bits looks best. I'd like to know if that solution doesn't have any
> > other flaw.
> >
> >   
> > > For reliable operation I would want all copies (and/or CRCs) to be 
> > > written on an fsync, by the time I bother to fsync I really, really, 
> > > want the data on the disk.
> > >     
> >
> > fsync already works this way.
> >   
> 
> The point I was making is that after you change the code I would still want
> that to happen. And your comment above seems to indicate a goal of getting
> consistent data after a crash, with less concern that it be the most recent
> data written. Sorry in advance if that's a misreading of "you just must not
> forget to resync them after a crash."

There would be no problem with fsync. Fsync writes the synced data to both 
devices. So after a crash you can select any of the devices as a resync 
master copy, and you get the data that you wrote before sync() or fsync().

Mikulas

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-04-11  2:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
2008-04-08 10:22 ` Helge Hafting
2008-04-08 11:14   ` Mikulas Patocka
2008-04-09 18:33 ` Bill Davidsen
2008-04-10  3:07   ` Mikulas Patocka
     [not found]     ` <47FE224D.2020309@tmr.com>
2008-04-11  2:55       ` Mikulas Patocka
2008-04-10  6:14 ` Mario 'BitKoenig' Holbe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox