linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Data corruption on software raid.
@ 2007-03-18 13:16 Sander Smeenk
  2007-03-18 14:02 ` Justin Piszcz
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Sander Smeenk @ 2007-03-18 13:16 UTC (permalink / raw)
  To: linux-raid

Hello!  Long story. Get some coke.

I'm having an odd problem with using software raid on two Western
Digital disks type WD2500JD-00F (250gb) connected to a Silicon Image
Sil3112 PCI SATA conroller running with Linux 2.6.20, mdadm 2.5.6

When these disks are in a raid1 set, downloading data to the raid1 set
using scp or ftp causes some blocks of the data to corrupt on disk. Only
the data downloaded gets corrupted, not the data that already was on the
set. But when the data is first downloaded to another disk and locally
moved to the raid1 set, the data stays just fine.

This alone is weird enough.

But i decided to dig deeper and switched off the raid1 set, mounted both
disks directly. Writing data to the disks directly works perfectly fine.
No corruption anymore. The data written to the disks before using raid
is still corrupted, so the corruption is really on disk.

Then i decided to 'mke2fs -c -c' (read/write badblock check) both disks
which returned null errors on the disks themselves. I stored ~240gb data
on disk1 and verify-copied it to disk2. The contents stay the same.

I also tried simultaneously writing data to disk1 and disk2 to 'emulate'
raid1 disk activity, but no corruption occurred. I even moved the SATA
PCI controller to a different slot to isolate IRQ problems. This made
no change to the whole situation.

So for all i know, the disks are fine, the controller is fine, it must
be something in the software raid code, right?

Wrong. My system is also running a raid1 set on IDE disks. This set is
working just perfectly normal. No corruption when downloading data, no
corruption when moving data about, no problems at all...

My /proc/mdstat is one pool op happiness. It now reads:

| Personalities : [raid1] 
| md0 : active raid1 hda2[0] hdb1[1]
|       120060736 blocks [2/2] [UU]
|       
| unused devices: <none>

With the SATA set active it also has:

| md1 : active raid1 sdb1[0] sda1[1]
|       244198584 blocks [2/2] [UU]
(NOTE: sdb1 is first, sda1 is second, this should not cause problems,
i've had this in other setups before?)

No problems are reported while rebuilding the md1 SATA set, although i
think the disk-to-disk speed is rather slow with ~17MiB/sec measured by
/proc/mdstat's output while rebuilding.

| md: data-check of RAID array md1
| md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
| md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) 
| md: using 128k window, over a total of 244195904 blocks.
| md: md1: data-check done.
| RAID1 conf printout:
|  --- wd:2 rd:2
|  disk 0, wo:0, o:1, dev:sdb1
|  disk 1, wo:0, o:1, dev:sda1

When /using/ the disks in raid1 set, my dmesg did show signs of badness:

| ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
| ata2.00: (BMDMA2 stat 0xc0009)
| ata2.00: cmd c8/00:00:3f:43:47/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
|          res 51/40:00:86:43:47/00:00:00:00:00/e2 Emask 0x9 (media error)
| ata2.00: configured for UDMA/100
| ata2: EH complete
| ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
| ata2.00: (BMDMA2 stat 0xc0009)
| ata2.00: cmd c8/00:00:3f:43:47/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
|          res 51/40:00:86:43:47/00:00:00:00:00/e2 Emask 0x9 (media error)
| ata2.00: configured for UDMA/100
| ata2: EH complete
| ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
| ata2.00: (BMDMA2 stat 0xc0009)
| ata2.00: cmd c8/00:00:3f:43:47/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
|          res 51/40:00:86:43:47/00:00:00:00:00/e2 Emask 0x9 (media error)
| ata2.00: configured for UDMA/100
| ata2: EH complete

But what amazes me is that no media errors can be detected by doing a
write/read check on every sector of the disk with mke2fs, and no data
corruption occurs when moving data to the set locally!

Can anyone shed some light on what i can try next to isolate what is
causing all this?  It's not the software raid code, the IDE set is
working fine.  It's not the SATA controller, the disks are okay when
used separately.  It's not the disks themselves, they show no errors
with extensive testing.

Weird 'eh?  Any comments appreciated!

Kind regards,
Sander.
-- 
| Just remember -- if the world didn't suck, we would all fall off.
| 1024D/08CEC94D - 34B3 3314 B146 E13C 70C8  9BDB D463 7E41 08CE C94D

^ permalink raw reply	[flat|nested] 17+ messages in thread
* Data corruption on software RAID
@ 2008-04-07 23:43 Mikulas Patocka
  2008-04-08 10:22 ` Helge Hafting
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Mikulas Patocka @ 2008-04-07 23:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-raid, device-mapper development, agk, mingo, neilb

Hi

During source code review, I found an unprobable but possible data 
corruption on RAID-1 and on DM-RAID-1. (I'm not sure about RAID-4,5,6).

The RAID code was enhanced with bitmaps in 2.6.13.

The bitmap tracks regions on the device that may be possibly out-of-sync. 
The purpose of the bitmap is to avoid resynchronizing the whole array in 
the case of crash. DM-raid uses similar bitmap too.

The write sequnce is usually:
1. turn on bit in the bitmap (if it hasn't been on before).
2. update the data.
3. when writes to all devices finish, turn the bit may be turned off.

The developers assume that when all writes to the region finish, the 
region is in-sync.

This assumption is wrong.

Kernel writes data while they may be modified in many places. For example, 
the pdflush daemon writes periodically pages and buffers without locking 
them. Similarly, pages may be written while they are mapped for write to 
the processes.

Normally, there is no problem with modify-while-write. The write sequence 
is something like:
* turn off Dirty bit
* write the buffer or page
--- and if the buffer or page is modified while it's being written, the 
Dirty bit is turned on again and the correct data are written later.

But with RAID (since 2.6.13), it can produce corruption because when the 
buffer is modified while being written, different versions of data can be 
written to devices in the RAID array. For example:

1. pdflush turns off a dirty bit on Ext2 bitmap buffer and starts writing 
the buffer to RAID-1
2. the kernel allocates some blocks in that Ext2 bitmap. One of RAID-1 
devices writes new data, the other one gets old data.
3. The kernel turns on the buffer dirty bit, so this buffer is scheduled 
for next write.
4. RAID-1 subsystem sees that both writes finished, it thinks that this 
region is in-sync, turns off its dirty bit in its region bitmap and writes 
the bitmap to disk.
5. before pdflush writes the Ext2 bitmap buffer again, the system CRASHES

6. after new boot, RAID-1 sees the bit for this region off, so it doesn't 
resynchronize it.
7. during fsck, RAID-1 reads the Ext2 bitmap from the device where the bit 
is on. fsck sees that the bitmap is correct and doesn't touch it.
8. some times later kernel reads the Ext2 bitmap from the other device. It 
sees the bit off, allocates some data there and creates cross-linked 
files.

The same corruption may happen with some jorunaled filesystems (probably 
not Ext3) or applications that do their own crash recovery (databases, 
etc.). The key point is that an application expects that after a crash it 
reads old data or new data, but it doesn't expect that subsequent reads to 
the same place may alternatively return old or new data --- which may 
happen on RAID-1.


Possibilities how to fix it:

1. lock the buffers and pages while they are being written --- this would 
cause performance degradation (the most severe degradation would be in 
case when one process does repeatedly sync() and other unrelated 
process repeatedly writes to some file).

Lock the buffers and pages only for RAID --- would create many special 
cases and possible bugs.

2. never turn the region dirty bit off until the filesystem is unmounted. 
--- this is the simplest fix. If the computer crashes after a long 
time, it resynchronizes the whole device. But there won't cause 
application-visible or filesystem-visible data corruption.

3. turn off the region bit if the region wasn't written in one pdflush 
period --- requires an interaction with pdflush, rather complex. The 
problem here is that pdflush makes its best effort to write data in 
dirty_writeback_centisecs interval, but it is not guaranteed to do it.

4. make more region states: Region has in-memory states CLEAN, DIRTY, 
MAYBE_DIRTY, CLEAN_CANDIDATE.

When you start writing to the region, it is always moved to DIRTY state 
(and on-disk bit is turned on).

When you finish all writes to the region, move it to MAYBE_DIRTY state, 
but leave bit on disk on. We now don't know if the region is dirty or no.

Run a helper thread that does periodically:
Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
Issue sync()
Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.

The rationale is that if the above write-while-modify scenario happens, 
the page is always dirty. Thus, sync() will write the page, kick the 
region back from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark 
the region as clean on disk.


I'd like to know you ideas on this, before we start coding a solution.

Mikulas

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-04-11  2:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-18 13:16 Data corruption on software raid Sander Smeenk
2007-03-18 14:02 ` Justin Piszcz
2007-03-18 16:50   ` Bill Davidsen
2007-03-18 17:38     ` Sander Smeenk
     [not found]       ` <45FD870C.3020403@tmr.com>
2007-03-18 22:00         ` Sander Smeenk
2007-03-18 15:17 ` Wolfgang Denk
2007-03-18 17:09 ` Bill Davidsen
2007-03-18 22:16   ` Neil Brown
2007-03-18 22:19 ` Neil Brown
  -- strict thread matches above, loose matches on Subject: below --
2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
2008-04-08 10:22 ` Helge Hafting
2008-04-08 11:14   ` Mikulas Patocka
2008-04-09 18:33 ` Bill Davidsen
2008-04-10  3:07   ` Mikulas Patocka
2008-04-10 14:21     ` Bill Davidsen
2008-04-11  2:55       ` Mikulas Patocka
2008-04-10  6:14 ` Mario 'BitKoenig' Holbe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).