* Data corruption on software raid.
@ 2007-03-18 13:16 Sander Smeenk
2007-03-18 14:02 ` Justin Piszcz
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Sander Smeenk @ 2007-03-18 13:16 UTC (permalink / raw)
To: linux-raid
Hello! Long story. Get some coke.
I'm having an odd problem with using software raid on two Western
Digital disks type WD2500JD-00F (250gb) connected to a Silicon Image
Sil3112 PCI SATA conroller running with Linux 2.6.20, mdadm 2.5.6
When these disks are in a raid1 set, downloading data to the raid1 set
using scp or ftp causes some blocks of the data to corrupt on disk. Only
the data downloaded gets corrupted, not the data that already was on the
set. But when the data is first downloaded to another disk and locally
moved to the raid1 set, the data stays just fine.
This alone is weird enough.
But i decided to dig deeper and switched off the raid1 set, mounted both
disks directly. Writing data to the disks directly works perfectly fine.
No corruption anymore. The data written to the disks before using raid
is still corrupted, so the corruption is really on disk.
Then i decided to 'mke2fs -c -c' (read/write badblock check) both disks
which returned null errors on the disks themselves. I stored ~240gb data
on disk1 and verify-copied it to disk2. The contents stay the same.
I also tried simultaneously writing data to disk1 and disk2 to 'emulate'
raid1 disk activity, but no corruption occurred. I even moved the SATA
PCI controller to a different slot to isolate IRQ problems. This made
no change to the whole situation.
So for all i know, the disks are fine, the controller is fine, it must
be something in the software raid code, right?
Wrong. My system is also running a raid1 set on IDE disks. This set is
working just perfectly normal. No corruption when downloading data, no
corruption when moving data about, no problems at all...
My /proc/mdstat is one pool op happiness. It now reads:
| Personalities : [raid1]
| md0 : active raid1 hda2[0] hdb1[1]
| 120060736 blocks [2/2] [UU]
|
| unused devices: <none>
With the SATA set active it also has:
| md1 : active raid1 sdb1[0] sda1[1]
| 244198584 blocks [2/2] [UU]
(NOTE: sdb1 is first, sda1 is second, this should not cause problems,
i've had this in other setups before?)
No problems are reported while rebuilding the md1 SATA set, although i
think the disk-to-disk speed is rather slow with ~17MiB/sec measured by
/proc/mdstat's output while rebuilding.
| md: data-check of RAID array md1
| md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
| md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec)
| md: using 128k window, over a total of 244195904 blocks.
| md: md1: data-check done.
| RAID1 conf printout:
| --- wd:2 rd:2
| disk 0, wo:0, o:1, dev:sdb1
| disk 1, wo:0, o:1, dev:sda1
When /using/ the disks in raid1 set, my dmesg did show signs of badness:
| ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
| ata2.00: (BMDMA2 stat 0xc0009)
| ata2.00: cmd c8/00:00:3f:43:47/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
| res 51/40:00:86:43:47/00:00:00:00:00/e2 Emask 0x9 (media error)
| ata2.00: configured for UDMA/100
| ata2: EH complete
| ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
| ata2.00: (BMDMA2 stat 0xc0009)
| ata2.00: cmd c8/00:00:3f:43:47/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
| res 51/40:00:86:43:47/00:00:00:00:00/e2 Emask 0x9 (media error)
| ata2.00: configured for UDMA/100
| ata2: EH complete
| ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
| ata2.00: (BMDMA2 stat 0xc0009)
| ata2.00: cmd c8/00:00:3f:43:47/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
| res 51/40:00:86:43:47/00:00:00:00:00/e2 Emask 0x9 (media error)
| ata2.00: configured for UDMA/100
| ata2: EH complete
But what amazes me is that no media errors can be detected by doing a
write/read check on every sector of the disk with mke2fs, and no data
corruption occurs when moving data to the set locally!
Can anyone shed some light on what i can try next to isolate what is
causing all this? It's not the software raid code, the IDE set is
working fine. It's not the SATA controller, the disks are okay when
used separately. It's not the disks themselves, they show no errors
with extensive testing.
Weird 'eh? Any comments appreciated!
Kind regards,
Sander.
--
| Just remember -- if the world didn't suck, we would all fall off.
| 1024D/08CEC94D - 34B3 3314 B146 E13C 70C8 9BDB D463 7E41 08CE C94D
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: Data corruption on software raid.
2007-03-18 13:16 Data corruption on software raid Sander Smeenk
@ 2007-03-18 14:02 ` Justin Piszcz
2007-03-18 16:50 ` Bill Davidsen
2007-03-18 15:17 ` Wolfgang Denk
` (2 subsequent siblings)
3 siblings, 1 reply; 17+ messages in thread
From: Justin Piszcz @ 2007-03-18 14:02 UTC (permalink / raw)
To: Sander Smeenk; +Cc: linux-raid
On Sun, 18 Mar 2007, Sander Smeenk wrote:
> Hello! Long story. Get some coke.
>
> I'm having an odd problem with using software raid on two Western
> Digital disks type WD2500JD-00F (250gb) connected to a Silicon Image
> Sil3112 PCI SATA conroller running with Linux 2.6.20, mdadm 2.5.6
[[ .. snip .. ]]
See comments below.
| Personalities : [raid1]
| md0 : active raid1 hda2[0] hdb1[1]
| 120060736 blocks [2/2] [UU]
My main question:
Why the 'heck' are you running a RAID1 with a master/slave combination?
That is probably the -worst- way to run it. When using any form of RAID,
make sure you do not share an IDE channel for any one raid device.
Advice? Hook up each drive as a master, then your rebuild speed should go
to 30-60MB/s.
Traditional Troubleshooting:
What does fdisk -l /dev/hda
fdisk -l /dev/hdb
Report?
Also,
What does:
smartctl -a /dev/hda
smartctl -a /dev/hdb
show?
Then,
smartctl -t short /dev/hda
smartctl -t short /dev/hdb
Wait 5-10 minutes, re-run the commands (-a) above.
Then,
smartctl -t long /dev/hda
smartctl -t long /dev/hdb
Then re-run the (-a) smartctl above.
--------
Those errors look really weird, I would separate the two disks, each on
their own IDE channel and see if your problem goes away.
Justin.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: Data corruption on software raid.
2007-03-18 14:02 ` Justin Piszcz
@ 2007-03-18 16:50 ` Bill Davidsen
2007-03-18 17:38 ` Sander Smeenk
0 siblings, 1 reply; 17+ messages in thread
From: Bill Davidsen @ 2007-03-18 16:50 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Sander Smeenk, linux-raid
Justin Piszcz wrote:
>
>
> On Sun, 18 Mar 2007, Sander Smeenk wrote:
>
>> Hello! Long story. Get some coke.
>>
>> I'm having an odd problem with using software raid on two Western
>> Digital disks type WD2500JD-00F (250gb) connected to a Silicon Image
>> Sil3112 PCI SATA conroller running with Linux 2.6.20, mdadm 2.5.6
>
> [[ .. snip .. ]]
>
> See comments below.
>
> | Personalities : [raid1]
> | md0 : active raid1 hda2[0] hdb1[1]
> | 120060736 blocks [2/2] [UU]
Your comments below are thoughts on the PATA RAID1, not the one which is
giving trouble, the SATA. Unless you think that making the working array
faster, perhaps he's looking for ideas on fixing the array which isn't
working as expected.
>
> My main question:
>
> Why the 'heck' are you running a RAID1 with a master/slave
> combination? That is probably the -worst- way to run it. When using
> any form of RAID, make sure you do not share an IDE channel for any
> one raid device.
>
> Advice? Hook up each drive as a master, then your rebuild speed
> should go to 30-60MB/s.
>
> Traditional Troubleshooting:
>
> What does fdisk -l /dev/hda
> fdisk -l /dev/hdb
>
> Report?
>
> Also,
>
> What does:
>
> smartctl -a /dev/hda
> smartctl -a /dev/hdb
>
> show?
>
> Then,
>
> smartctl -t short /dev/hda
> smartctl -t short /dev/hdb
>
> Wait 5-10 minutes, re-run the commands (-a) above.
>
> Then,
>
> smartctl -t long /dev/hda
> smartctl -t long /dev/hdb
>
> Then re-run the (-a) smartctl above.
>
> --------
>
> Those errors look really weird, I would separate the two disks, each
> on their own IDE channel and see if your problem goes away.
>
> Justin.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software raid.
2007-03-18 16:50 ` Bill Davidsen
@ 2007-03-18 17:38 ` Sander Smeenk
[not found] ` <45FD870C.3020403@tmr.com>
0 siblings, 1 reply; 17+ messages in thread
From: Sander Smeenk @ 2007-03-18 17:38 UTC (permalink / raw)
To: linux-raid
Quoting Bill Davidsen (davidsen@tmr.com):
> >>I'm having an odd problem with using software raid on two Western
> >>Digital disks type WD2500JD-00F (250gb) connected to a Silicon Image
> >>Sil3112 PCI SATA conroller running with Linux 2.6.20, mdadm 2.5.6
> Your comments below are thoughts on the PATA RAID1, not the one which is
> giving trouble, the SATA.
Although your comment here is right, i was indeed questioning about the
SATA disks, Justin did make me realise i forgot to check SMART status
with smartctl.
It turns out sdb is a bit dodgy. It aparently had a read error earlier,
and during the 'long' SMART test just now, it also failed.
But this still amazes me, as i ran 'mke2fs -c -c' on the disk AND copied
about 95% of the disks capacity to it with no corruption or errors at all.
I don't know if i completely understood your comments about user buffers.
I can hardly believe the buffer to change it's contents before actually
being written to disk? Or did i misunderstand?
Regards,
Sander.
--
| When she saw her first strands of gray hair, she thought she'd dye.
| 1024D/08CEC94D - 34B3 3314 B146 E13C 70C8 9BDB D463 7E41 08CE C94D
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software raid.
2007-03-18 13:16 Data corruption on software raid Sander Smeenk
2007-03-18 14:02 ` Justin Piszcz
@ 2007-03-18 15:17 ` Wolfgang Denk
2007-03-18 17:09 ` Bill Davidsen
2007-03-18 22:19 ` Neil Brown
3 siblings, 0 replies; 17+ messages in thread
From: Wolfgang Denk @ 2007-03-18 15:17 UTC (permalink / raw)
To: Sander Smeenk; +Cc: linux-raid
In message <20070318131606.GJ6063@freshdot.net> you wrote:
>
> But what amazes me is that no media errors can be detected by doing a
> write/read check on every sector of the disk with mke2fs, and no data
> corruption occurs when moving data to the set locally!
>
> Can anyone shed some light on what i can try next to isolate what is
> causing all this? It's not the software raid code, the IDE set is
If it happens only with downloaded data, you may see data corruption
in the NIC hardware and/or driver. Try using another network card
(other vendor, other type).
Another possible culprit is memory - you may see memory errors under
certain usage patterns. Make sur to run a memory test, and/or try
changing RAM.
Best regards,
Wolfgang Denk
--
DENX Software Engineering GmbH, HRB 165235 Munich, CEO: Wolfgang Denk
Office: Kirchenstr. 5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Time is an illusion perpetrated by the manufacturers of space.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software raid.
2007-03-18 13:16 Data corruption on software raid Sander Smeenk
2007-03-18 14:02 ` Justin Piszcz
2007-03-18 15:17 ` Wolfgang Denk
@ 2007-03-18 17:09 ` Bill Davidsen
2007-03-18 22:16 ` Neil Brown
2007-03-18 22:19 ` Neil Brown
3 siblings, 1 reply; 17+ messages in thread
From: Bill Davidsen @ 2007-03-18 17:09 UTC (permalink / raw)
To: linux-raid; +Cc: ssmeenk, Neil Brown
(relativelySander Smeenk wrote:
> Hello! Long story. Get some coke.
>
> I'm having an odd problem with using software raid on two Western
> Digital disks type WD2500JD-00F (250gb) connected to a Silicon Image
> Sil3112 PCI SATA conroller running with Linux 2.6.20, mdadm 2.5.6
>
> When these disks are in a raid1 set, downloading data to the raid1 set
> using scp or ftp causes some blocks of the data to corrupt on disk. Only
> the data downloaded gets corrupted, not the data that already was on the
> set. But when the data is first downloaded to another disk and locally
> moved to the raid1 set, the data stays just fine.
>
This may be due to a characteristic of RAID1, which I believe Neil
described when discussing "check" failures in using RAID1 for swap. In
some cases, the data is being written from a user buffer, which is
changing. and the RAID software does two write, one to each device,
resulting in the data in the buffer changing as the write occurs. More
on this at the end.
So when you copy from a file already on disk, the data is NOT changing,
and no problem occurs. I assume that you have tried doing slow downloads
to the md0 PATA device, and that this problem doesn't occur there. I
have ideas why that would be, but I don't want to speculate.
Do you have some non-RAID partitions on one of those drives, such that
the seek time might be markedly different on one or the other due to
activity in that partition? That would increase the possible time
between writes and therefore the possibility of differences in what's
written.
> This alone is weird enough.
>
> But i decided to dig deeper and switched off the raid1 set, mounted both
> disks directly. Writing data to the disks directly works perfectly fine.
> No corruption anymore. The data written to the disks before using raid
> is still corrupted, so the corruption is really on disk.
>
> Then i decided to 'mke2fs -c -c' (read/write badblock check) both disks
> which returned null errors on the disks themselves. I stored ~240gb data
> on disk1 and verify-copied it to disk2. The contents stay the same.
>
> I also tried simultaneously writing data to disk1 and disk2 to 'emulate'
> raid1 disk activity, but no corruption occurred. I even moved the SATA
> PCI controller to a different slot to isolate IRQ problems. This made
> no change to the whole situation.
>
> So for all i know, the disks are fine, the controller is fine, it must
> be something in the software raid code, right?
>
> Wrong. My system is also running a raid1 set on IDE disks. This set is
> working just perfectly normal. No corruption when downloading data, no
> corruption when moving data about, no problems at all...
>
> My /proc/mdstat is one pool op happiness. It now reads:
>
> | Personalities : [raid1]
> | md0 : active raid1 hda2[0] hdb1[1]
> | 120060736 blocks [2/2] [UU]
> |
> | unused devices: <none>
>
> With the SATA set active it also has:
>
> | md1 : active raid1 sdb1[0] sda1[1]
> | 244198584 blocks [2/2] [UU]
> (NOTE: sdb1 is first, sda1 is second, this should not cause problems,
> i've had this in other setups before?)
>
> No problems are reported while rebuilding the md1 SATA set, although i
> think the disk-to-disk speed is rather slow with ~17MiB/sec measured by
> /proc/mdstat's output while rebuilding.
>
> | md: data-check of RAID array md1
> | md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> | md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec)
> | md: using 128k window, over a total of 244195904 blocks.
> | md: md1: data-check done.
> | RAID1 conf printout:
> | --- wd:2 rd:2
> | disk 0, wo:0, o:1, dev:sdb1
> | disk 1, wo:0, o:1, dev:sda1
>
> When /using/ the disks in raid1 set, my dmesg did show signs of badness:
>
> | ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> | ata2.00: (BMDMA2 stat 0xc0009)
> | ata2.00: cmd c8/00:00:3f:43:47/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
> | res 51/40:00:86:43:47/00:00:00:00:00/e2 Emask 0x9 (media error)
> | ata2.00: configured for UDMA/100
> | ata2: EH complete
> | ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> | ata2.00: (BMDMA2 stat 0xc0009)
> | ata2.00: cmd c8/00:00:3f:43:47/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
> | res 51/40:00:86:43:47/00:00:00:00:00/e2 Emask 0x9 (media error)
> | ata2.00: configured for UDMA/100
> | ata2: EH complete
> | ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> | ata2.00: (BMDMA2 stat 0xc0009)
> | ata2.00: cmd c8/00:00:3f:43:47/00:00:00:00:00/e2 tag 0 cdb 0x0 data 131072 in
> | res 51/40:00:86:43:47/00:00:00:00:00/e2 Emask 0x9 (media error)
> | ata2.00: configured for UDMA/100
> | ata2: EH complete
>
> But what amazes me is that no media errors can be detected by doing a
> write/read check on every sector of the disk with mke2fs, and no data
> corruption occurs when moving data to the set locally!
>
> Can anyone shed some light on what i can try next to isolate what is
> causing all this? It's not the software raid code, the IDE set is
> working fine. It's not the SATA controller, the disks are okay when
> used separately. It's not the disks themselves, they show no errors
> with extensive testing.
>
> Weird 'eh? Any comments appreciated!
I do have a thought which MIGHT address this issue in a general way,
perhaps Neil will share he opinion. When writing to any array with
multiple copies which are written from user buffers, perhaps the code
could set the page(s) as copy on write. Then if the program tried to
modify the data it could be done safely. When the write to all drives
was complete, the COW could be cleared, and if the page had not been
modified very little overhead would be generated. If the page had been
modified, then the original would no longer be mapped to a process and
could be released.
Neil, what think you? This would be e general solution to the mismatched
multiple copies issue, assuming that it could be done at all.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software raid.
2007-03-18 17:09 ` Bill Davidsen
@ 2007-03-18 22:16 ` Neil Brown
0 siblings, 0 replies; 17+ messages in thread
From: Neil Brown @ 2007-03-18 22:16 UTC (permalink / raw)
To: Bill Davidsen; +Cc: linux-raid, ssmeenk
On Sunday March 18, davidsen@tmr.com wrote:
>
> This may be due to a characteristic of RAID1, which I believe Neil
> described when discussing "check" failures in using RAID1 for swap. In
> some cases, the data is being written from a user buffer, which is
> changing. and the RAID software does two write, one to each device,
> resulting in the data in the buffer changing as the write occurs. More
> on this at the end.
While you can get different data written to different devices in a
RAID1, this difference is NEVER visible above the filesystem.
It is mostly likely to happen with swap and, if it does, the swap
system will never try to read that data back.
It can conceivably happen with regular filesystems, but only if a file
is being changed immediately before being truncated. Some of the blocks
that were in the file might end up different on each device. But
the data in those blocks will never be read (because the file has been
truncated).
So this could not explain the current problem.
>
> I do have a thought which MIGHT address this issue in a general way,
> perhaps Neil will share he opinion. When writing to any array with
> multiple copies which are written from user buffers, perhaps the code
> could set the page(s) as copy on write. Then if the program tried to
> modify the data it could be done safely. When the write to all drives
> was complete, the COW could be cleared, and if the page had not been
> modified very little overhead would be generated. If the page had been
> modified, then the original would no longer be mapped to a process and
> could be released.
>
> Neil, what think you? This would be e general solution to the mismatched
> multiple copies issue, assuming that it could be done at all.
Copy-on-write is not as easy as it sounds. Trying to trigger COW from
the md driver would be incredibly messy and wouldn't solve any serious
problem.
NeilBrown
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software raid.
2007-03-18 13:16 Data corruption on software raid Sander Smeenk
` (2 preceding siblings ...)
2007-03-18 17:09 ` Bill Davidsen
@ 2007-03-18 22:19 ` Neil Brown
3 siblings, 0 replies; 17+ messages in thread
From: Neil Brown @ 2007-03-18 22:19 UTC (permalink / raw)
To: Sander Smeenk; +Cc: linux-raid
On Sunday March 18, ssmeenk@freshdot.net wrote:
> Hello! Long story. Get some coke.
And a painful story!.
See also http://bugzilla.kernel.org/show_bug.cgi?id=8180
It also involves a Silicon Image PCI/SATA controller, though a
different model.
But then I have an SI PCI SATA controller that has never missed a beat
(except when on the of the SATA cables wasn't quite plugged in
properly....)
I'd say it is definitely a hardware problem, though identifying which
piece of hardware can be tricky.
NeilBrown
^ permalink raw reply [flat|nested] 17+ messages in thread
* Data corruption on software RAID
@ 2008-04-07 23:43 Mikulas Patocka
2008-04-08 10:22 ` Helge Hafting
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Mikulas Patocka @ 2008-04-07 23:43 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-raid, device-mapper development, agk, mingo, neilb
Hi
During source code review, I found an unprobable but possible data
corruption on RAID-1 and on DM-RAID-1. (I'm not sure about RAID-4,5,6).
The RAID code was enhanced with bitmaps in 2.6.13.
The bitmap tracks regions on the device that may be possibly out-of-sync.
The purpose of the bitmap is to avoid resynchronizing the whole array in
the case of crash. DM-raid uses similar bitmap too.
The write sequnce is usually:
1. turn on bit in the bitmap (if it hasn't been on before).
2. update the data.
3. when writes to all devices finish, turn the bit may be turned off.
The developers assume that when all writes to the region finish, the
region is in-sync.
This assumption is wrong.
Kernel writes data while they may be modified in many places. For example,
the pdflush daemon writes periodically pages and buffers without locking
them. Similarly, pages may be written while they are mapped for write to
the processes.
Normally, there is no problem with modify-while-write. The write sequence
is something like:
* turn off Dirty bit
* write the buffer or page
--- and if the buffer or page is modified while it's being written, the
Dirty bit is turned on again and the correct data are written later.
But with RAID (since 2.6.13), it can produce corruption because when the
buffer is modified while being written, different versions of data can be
written to devices in the RAID array. For example:
1. pdflush turns off a dirty bit on Ext2 bitmap buffer and starts writing
the buffer to RAID-1
2. the kernel allocates some blocks in that Ext2 bitmap. One of RAID-1
devices writes new data, the other one gets old data.
3. The kernel turns on the buffer dirty bit, so this buffer is scheduled
for next write.
4. RAID-1 subsystem sees that both writes finished, it thinks that this
region is in-sync, turns off its dirty bit in its region bitmap and writes
the bitmap to disk.
5. before pdflush writes the Ext2 bitmap buffer again, the system CRASHES
6. after new boot, RAID-1 sees the bit for this region off, so it doesn't
resynchronize it.
7. during fsck, RAID-1 reads the Ext2 bitmap from the device where the bit
is on. fsck sees that the bitmap is correct and doesn't touch it.
8. some times later kernel reads the Ext2 bitmap from the other device. It
sees the bit off, allocates some data there and creates cross-linked
files.
The same corruption may happen with some jorunaled filesystems (probably
not Ext3) or applications that do their own crash recovery (databases,
etc.). The key point is that an application expects that after a crash it
reads old data or new data, but it doesn't expect that subsequent reads to
the same place may alternatively return old or new data --- which may
happen on RAID-1.
Possibilities how to fix it:
1. lock the buffers and pages while they are being written --- this would
cause performance degradation (the most severe degradation would be in
case when one process does repeatedly sync() and other unrelated
process repeatedly writes to some file).
Lock the buffers and pages only for RAID --- would create many special
cases and possible bugs.
2. never turn the region dirty bit off until the filesystem is unmounted.
--- this is the simplest fix. If the computer crashes after a long
time, it resynchronizes the whole device. But there won't cause
application-visible or filesystem-visible data corruption.
3. turn off the region bit if the region wasn't written in one pdflush
period --- requires an interaction with pdflush, rather complex. The
problem here is that pdflush makes its best effort to write data in
dirty_writeback_centisecs interval, but it is not guaranteed to do it.
4. make more region states: Region has in-memory states CLEAN, DIRTY,
MAYBE_DIRTY, CLEAN_CANDIDATE.
When you start writing to the region, it is always moved to DIRTY state
(and on-disk bit is turned on).
When you finish all writes to the region, move it to MAYBE_DIRTY state,
but leave bit on disk on. We now don't know if the region is dirty or no.
Run a helper thread that does periodically:
Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
Issue sync()
Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.
The rationale is that if the above write-while-modify scenario happens,
the page is always dirty. Thus, sync() will write the page, kick the
region back from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark
the region as clean on disk.
I'd like to know you ideas on this, before we start coding a solution.
Mikulas
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: Data corruption on software RAID
2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
@ 2008-04-08 10:22 ` Helge Hafting
2008-04-08 11:14 ` Mikulas Patocka
2008-04-09 18:33 ` Bill Davidsen
2008-04-10 6:14 ` Mario 'BitKoenig' Holbe
2 siblings, 1 reply; 17+ messages in thread
From: Helge Hafting @ 2008-04-08 10:22 UTC (permalink / raw)
To: Mikulas Patocka
Cc: linux-kernel, linux-raid, device-mapper development, agk, mingo,
neilb
Mikulas Patocka wrote:
> Hi
>
> During source code review, I found an unprobable but possible data
> corruption on RAID-1 and on DM-RAID-1. (I'm not sure about RAID-4,5,6).
>
> The RAID code was enhanced with bitmaps in 2.6.13.
>
> The bitmap tracks regions on the device that may be possibly out-of-sync.
> The purpose of the bitmap is to avoid resynchronizing the whole array in
> the case of crash. DM-raid uses similar bitmap too.
>
> The write sequnce is usually:
> 1. turn on bit in the bitmap (if it hasn't been on before).
> 2. update the data.
> 3. when writes to all devices finish, turn the bit may be turned off.
>
> The developers assume that when all writes to the region finish, the
> region is in-sync.
>
> This assumption is wrong.
>
> Kernel writes data while they may be modified in many places. For example,
> the pdflush daemon writes periodically pages and buffers without locking
> them. Similarly, pages may be written while they are mapped for write to
> the processes.
>
> Normally, there is no problem with modify-while-write. The write sequence
> is something like:
> * turn off Dirty bit
> * write the buffer or page
> --- and if the buffer or page is modified while it's being written, the
> Dirty bit is turned on again and the correct data are written later.
>
> But with RAID (since 2.6.13), it can produce corruption because when the
> buffer is modified while being written, different versions of data can be
> written to devices in the RAID array. For example:
>
> 1. pdflush turns off a dirty bit on Ext2 bitmap buffer and starts writing
> the buffer to RAID-1
> 2. the kernel allocates some blocks in that Ext2 bitmap. One of RAID-1
> devices writes new data, the other one gets old data.
> 3. The kernel turns on the buffer dirty bit, so this buffer is scheduled
> for next write.
> 4. RAID-1 subsystem sees that both writes finished, it thinks that this
> region is in-sync, turns off its dirty bit in its region bitmap and writes
> the bitmap to disk.
>
Would this help:
RAID-1 sees that both writes finished. It checks the dirty bits on all
relevant buffers/pages. If none got re-dirtied, then it is ok to
turn off the dirty bit in the region bitmap and write that. Otherwise,
it is not!
Or is such a check too time-consuming?
Helge Hafting
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software RAID
2008-04-08 10:22 ` Helge Hafting
@ 2008-04-08 11:14 ` Mikulas Patocka
0 siblings, 0 replies; 17+ messages in thread
From: Mikulas Patocka @ 2008-04-08 11:14 UTC (permalink / raw)
To: Helge Hafting
Cc: linux-kernel, linux-raid, device-mapper development, agk, mingo,
neilb
> > But with RAID (since 2.6.13), it can produce corruption because when the
> > buffer is modified while being written, different versions of data can be
> > written to devices in the RAID array. For example:
> >
> > 1. pdflush turns off a dirty bit on Ext2 bitmap buffer and starts writing
> > the buffer to RAID-1
> > 2. the kernel allocates some blocks in that Ext2 bitmap. One of RAID-1
> > devices writes new data, the other one gets old data.
> > 3. The kernel turns on the buffer dirty bit, so this buffer is scheduled for
> > next write.
> > 4. RAID-1 subsystem sees that both writes finished, it thinks that this
> > region is in-sync, turns off its dirty bit in its region bitmap and writes
> > the bitmap to disk.
> >
> Would this help:
> RAID-1 sees that both writes finished. It checks the dirty bits on all
> relevant buffers/pages. If none got re-dirtied, then it is ok to
> turn off the dirty bit in the region bitmap and write that. Otherwise, it is
> not!
>
> Or is such a check too time-consuming?
That is impossible. The page cache can answer questions like "where is
page 0x1234 from inode 0x5678 located on disk?" But it can't answer the
reverse question: "which inode and which page is using disk block
0x12345678?"
Furthermore, with device mapper you can stack several mapping tables each
on other --- and again --- device mapper can't solve the reverse problem
it can't tell you which filesystem is using block X.
Mikulas
> Helge Hafting
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software RAID
2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
2008-04-08 10:22 ` Helge Hafting
@ 2008-04-09 18:33 ` Bill Davidsen
2008-04-10 3:07 ` Mikulas Patocka
2008-04-10 6:14 ` Mario 'BitKoenig' Holbe
2 siblings, 1 reply; 17+ messages in thread
From: Bill Davidsen @ 2008-04-09 18:33 UTC (permalink / raw)
To: Mikulas Patocka
Cc: linux-kernel, linux-raid, device-mapper development, mingo, agk
Mikulas Patocka wrote:
> Hi
>
> During source code review, I found an unprobable but possible data
> corruption on RAID-1 and on DM-RAID-1. (I'm not sure about RAID-4,5,6).
>
> The RAID code was enhanced with bitmaps in 2.6.13.
>
> The bitmap tracks regions on the device that may be possibly out-of-sync.
> The purpose of the bitmap is to avoid resynchronizing the whole array in
> the case of crash. DM-raid uses similar bitmap too.
>
> The write sequnce is usually:
> 1. turn on bit in the bitmap (if it hasn't been on before).
> 2. update the data.
> 3. when writes to all devices finish, turn the bit may be turned off.
>
> The developers assume that when all writes to the region finish, the
> region is in-sync.
>
> This assumption is wrong.
>
> Kernel writes data while they may be modified in many places. For example,
> the pdflush daemon writes periodically pages and buffers without locking
> them. Similarly, pages may be written while they are mapped for write to
> the processes.
>
> Normally, there is no problem with modify-while-write. The write sequence
> is something like:
> * turn off Dirty bit
> * write the buffer or page
> --- and if the buffer or page is modified while it's being written, the
> Dirty bit is turned on again and the correct data are written later.
>
> But with RAID (since 2.6.13), it can produce corruption because when the
> buffer is modified while being written, different versions of data can be
> written to devices in the RAID array. For example:
>
> 1. pdflush turns off a dirty bit on Ext2 bitmap buffer and starts writing
> the buffer to RAID-1
> 2. the kernel allocates some blocks in that Ext2 bitmap. One of RAID-1
> devices writes new data, the other one gets old data.
> 3. The kernel turns on the buffer dirty bit, so this buffer is scheduled
> for next write.
> 4. RAID-1 subsystem sees that both writes finished, it thinks that this
> region is in-sync, turns off its dirty bit in its region bitmap and writes
> the bitmap to disk.
> 5. before pdflush writes the Ext2 bitmap buffer again, the system CRASHES
>
> 6. after new boot, RAID-1 sees the bit for this region off, so it doesn't
> resynchronize it.
> 7. during fsck, RAID-1 reads the Ext2 bitmap from the device where the bit
> is on. fsck sees that the bitmap is correct and doesn't touch it.
> 8. some times later kernel reads the Ext2 bitmap from the other device. It
> sees the bit off, allocates some data there and creates cross-linked
> files.
>
> The same corruption may happen with some jorunaled filesystems (probably
> not Ext3) or applications that do their own crash recovery (databases,
> etc.). The key point is that an application expects that after a crash it
> reads old data or new data, but it doesn't expect that subsequent reads to
> the same place may alternatively return old or new data --- which may
> happen on RAID-1.
>
>
> Possibilities how to fix it:
>
> 1. lock the buffers and pages while they are being written --- this would
> cause performance degradation (the most severe degradation would be in
> case when one process does repeatedly sync() and other unrelated
> process repeatedly writes to some file).
>
> Lock the buffers and pages only for RAID --- would create many special
> cases and possible bugs.
>
> 2. never turn the region dirty bit off until the filesystem is unmounted.
> --- this is the simplest fix. If the computer crashes after a long
> time, it resynchronizes the whole device. But there won't cause
> application-visible or filesystem-visible data corruption.
>
> 3. turn off the region bit if the region wasn't written in one pdflush
> period --- requires an interaction with pdflush, rather complex. The
> problem here is that pdflush makes its best effort to write data in
> dirty_writeback_centisecs interval, but it is not guaranteed to do it.
>
> 4. make more region states: Region has in-memory states CLEAN, DIRTY,
> MAYBE_DIRTY, CLEAN_CANDIDATE.
>
> When you start writing to the region, it is always moved to DIRTY state
> (and on-disk bit is turned on).
>
> When you finish all writes to the region, move it to MAYBE_DIRTY state,
> but leave bit on disk on. We now don't know if the region is dirty or no.
>
> Run a helper thread that does periodically:
> Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
> Issue sync()
> Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.
>
> The rationale is that if the above write-while-modify scenario happens,
> the page is always dirty. Thus, sync() will write the page, kick the
> region back from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark
> the region as clean on disk.
>
>
> I'd like to know you ideas on this, before we start coding a solution.
>
I looked at just this problem a while ago, and came to the conclusion
that what was needed was a COW bit, to show that there was i/o in
flight, and that before modification it needed to be copied. Since you
don't want to let that recurse, you don't start writing the copy until
the original is written and freed. Ideally you wouldn't bother to finish
writing the original, but that doesn't seem possible. That allows at
most two copies of a chunk to take up memory space at once, although
it's still ugly and can be a bottleneck.
For reliable operation I would want all copies (and/or CRCs) to be
written on an fsync, by the time I bother to fsync I really, really,
want the data on the disk.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software RAID
2008-04-09 18:33 ` Bill Davidsen
@ 2008-04-10 3:07 ` Mikulas Patocka
2008-04-10 14:21 ` Bill Davidsen
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2008-04-10 3:07 UTC (permalink / raw)
To: Bill Davidsen
Cc: linux-kernel, linux-raid, device-mapper development, agk, mingo,
neilb
> > Possibilities how to fix it:
> >
> > 1. lock the buffers and pages while they are being written --- this would
> > cause performance degradation (the most severe degradation would be in case
> > when one process does repeatedly sync() and other unrelated process
> > repeatedly writes to some file).
> >
> > Lock the buffers and pages only for RAID --- would create many special cases
> > and possible bugs.
> >
> > 2. never turn the region dirty bit off until the filesystem is unmounted.
> > --- this is the simplest fix. If the computer crashes after a long time, it
> > resynchronizes the whole device. But there won't cause application-visible
> > or filesystem-visible data corruption.
> >
> > 3. turn off the region bit if the region wasn't written in one pdflush
> > period --- requires an interaction with pdflush, rather complex. The problem
> > here is that pdflush makes its best effort to write data in
> > dirty_writeback_centisecs interval, but it is not guaranteed to do it.
> >
> > 4. make more region states: Region has in-memory states CLEAN, DIRTY,
> > MAYBE_DIRTY, CLEAN_CANDIDATE.
> >
> > When you start writing to the region, it is always moved to DIRTY state (and
> > on-disk bit is turned on).
> >
> > When you finish all writes to the region, move it to MAYBE_DIRTY state, but
> > leave bit on disk on. We now don't know if the region is dirty or no.
> >
> > Run a helper thread that does periodically:
> > Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
> > Issue sync()
> > Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.
> >
> > The rationale is that if the above write-while-modify scenario happens, the
> > page is always dirty. Thus, sync() will write the page, kick the region back
> > from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark the region as
> > clean on disk.
> >
> >
> > I'd like to know you ideas on this, before we start coding a solution.
> >
>
> I looked at just this problem a while ago, and came to the conclusion that
> what was needed was a COW bit, to show that there was i/o in flight, and that
> before modification it needed to be copied. Since you don't want to let that
> recurse, you don't start writing the copy until the original is written and
> freed. Ideally you wouldn't bother to finish writing the original, but that
> doesn't seem possible. That allows at most two copies of a chunk to take up
> memory space at once, although it's still ugly and can be a bottleneck.
Copying the data would be performance overkill. You can really write
different data to different disks, you just must not forget to resync them
after a crash. The filesystem/application will recover with either old or
new data --- it just won't recover when it's reading old and new data from
the same location.
From my point of view that trick with thread doing sync() and turning off
region bits looks best. I'd like to know if that solution doesn't have any
other flaw.
> For reliable operation I would want all copies (and/or CRCs) to be written on
> an fsync, by the time I bother to fsync I really, really, want the data on the
> disk.
fsync already works this way.
Mikulas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software RAID
2008-04-10 3:07 ` Mikulas Patocka
@ 2008-04-10 14:21 ` Bill Davidsen
2008-04-11 2:55 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: Bill Davidsen @ 2008-04-10 14:21 UTC (permalink / raw)
To: Mikulas Patocka
Cc: linux-kernel, linux-raid, device-mapper development, mingo, agk
[-- Attachment #1.1: Type: text/plain, Size: 4006 bytes --]
Mikulas Patocka wrote:
>>> Possibilities how to fix it:
>>>
>>> 1. lock the buffers and pages while they are being written --- this would
>>> cause performance degradation (the most severe degradation would be in case
>>> when one process does repeatedly sync() and other unrelated process
>>> repeatedly writes to some file).
>>>
>>> Lock the buffers and pages only for RAID --- would create many special cases
>>> and possible bugs.
>>>
>>> 2. never turn the region dirty bit off until the filesystem is unmounted.
>>> --- this is the simplest fix. If the computer crashes after a long time, it
>>> resynchronizes the whole device. But there won't cause application-visible
>>> or filesystem-visible data corruption.
>>>
>>> 3. turn off the region bit if the region wasn't written in one pdflush
>>> period --- requires an interaction with pdflush, rather complex. The problem
>>> here is that pdflush makes its best effort to write data in
>>> dirty_writeback_centisecs interval, but it is not guaranteed to do it.
>>>
>>> 4. make more region states: Region has in-memory states CLEAN, DIRTY,
>>> MAYBE_DIRTY, CLEAN_CANDIDATE.
>>>
>>> When you start writing to the region, it is always moved to DIRTY state (and
>>> on-disk bit is turned on).
>>>
>>> When you finish all writes to the region, move it to MAYBE_DIRTY state, but
>>> leave bit on disk on. We now don't know if the region is dirty or no.
>>>
>>> Run a helper thread that does periodically:
>>> Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
>>> Issue sync()
>>> Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.
>>>
>>> The rationale is that if the above write-while-modify scenario happens, the
>>> page is always dirty. Thus, sync() will write the page, kick the region back
>>> from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark the region as
>>> clean on disk.
>>>
>>>
>>> I'd like to know you ideas on this, before we start coding a solution.
>>>
>>>
>> I looked at just this problem a while ago, and came to the conclusion that
>> what was needed was a COW bit, to show that there was i/o in flight, and that
>> before modification it needed to be copied. Since you don't want to let that
>> recurse, you don't start writing the copy until the original is written and
>> freed. Ideally you wouldn't bother to finish writing the original, but that
>> doesn't seem possible. That allows at most two copies of a chunk to take up
>> memory space at once, although it's still ugly and can be a bottleneck.
>>
>
> Copying the data would be performance overkill. You can really write
> different data to different disks, you just must not forget to resync them
> after a crash. The filesystem/application will recover with either old or
> new data --- it just won't recover when it's reading old and new data from
> the same location.
>
>
Currently you can go for hours without ever reaching a clean state on
active files. By not deliberately allowing the buffer to change during a
write the chances for getting consistent data on the disk should be
significantly improved.
> >From my point of view that trick with thread doing sync() and turning off
> region bits looks best. I'd like to know if that solution doesn't have any
> other flaw.
>
>
>> For reliable operation I would want all copies (and/or CRCs) to be written on
>> an fsync, by the time I bother to fsync I really, really, want the data on the
>> disk.
>>
>
> fsync already works this way.
>
The point I was making is that after you change the code I would still
want that to happen. And your comment above seems to indicate a goal of
getting consistent data after a crash, with less concern that it be the
most recent data written. Sorry in advance if that's a misreading of
"you just must not forget to resync them after a crash."
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
[-- Attachment #1.2: Type: text/html, Size: 4622 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software RAID
2008-04-10 14:21 ` Bill Davidsen
@ 2008-04-11 2:55 ` Mikulas Patocka
0 siblings, 0 replies; 17+ messages in thread
From: Mikulas Patocka @ 2008-04-11 2:55 UTC (permalink / raw)
To: Bill Davidsen
Cc: linux-kernel, linux-raid, device-mapper development, mingo, agk
> Currently you can go for hours without ever reaching a clean state on active
> files. By not deliberately allowing the buffer to change during a write the
> chances for getting consistent data on the disk should be significantly
> improved.
It can already happen that one device writes the sector and other not if
the power is interrupted. And all RAID implementations already deal with
it by resynchronizing the modified areas in case of crash. So they could
resynchronize modify-while-write cases as well, with the same code.
... or I don't know if MM maintainers want to add locking to the pages
that are under a write. Personally, I wouldn't do it.
> > From my point of view that trick with thread doing sync() and turning off
> > region bits looks best. I'd like to know if that solution doesn't have any
> > other flaw.
> >
> >
> > > For reliable operation I would want all copies (and/or CRCs) to be
> > > written on an fsync, by the time I bother to fsync I really, really,
> > > want the data on the disk.
> > >
> >
> > fsync already works this way.
> >
>
> The point I was making is that after you change the code I would still want
> that to happen. And your comment above seems to indicate a goal of getting
> consistent data after a crash, with less concern that it be the most recent
> data written. Sorry in advance if that's a misreading of "you just must not
> forget to resync them after a crash."
There would be no problem with fsync. Fsync writes the synced data to both
devices. So after a crash you can select any of the devices as a resync
master copy, and you get the data that you wrote before sync() or fsync().
Mikulas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Data corruption on software RAID
2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
2008-04-08 10:22 ` Helge Hafting
2008-04-09 18:33 ` Bill Davidsen
@ 2008-04-10 6:14 ` Mario 'BitKoenig' Holbe
2 siblings, 0 replies; 17+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2008-04-10 6:14 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-raid, dm-devel
Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
> During source code review, I found an unprobable but possible data
> corruption on RAID-1 and on DM-RAID-1. (I'm not sure about RAID-4,5,6).
>
> The RAID code was enhanced with bitmaps in 2.6.13.
...
> The developers assume that when all writes to the region finish, the
> region is in-sync.
Just for the records: You don't need bitmaps for that, this happens on
plain non-bitmapped devices as well.
I had an interesting discussion about this with Heinz Mauelshagen on the
linux-raid list back in early 2006 starting with
Message-ID: <du6t39$be5$1@sea.gmane.org>
And it's not that unlikely at all. I experience such inconsistencies
regularly on ext2/3 filesystems with heavy inode fluctuations (for
example via cp -al; rsync, like rsnapshot does). I periodically sync
these inconsistencies manual. However, it always seems to appear with
inode removal only, which is rather harmless.
regards
Mario
--
I've never been certain whether the moral of the Icarus story should
only be, as is generally accepted, "Don't try to fly too high," or
whether it might also be thought of as, "Forget the wax and feathers
and do a better job on the wings." -- Stanley Kubrick
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2008-04-11 2:55 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-18 13:16 Data corruption on software raid Sander Smeenk
2007-03-18 14:02 ` Justin Piszcz
2007-03-18 16:50 ` Bill Davidsen
2007-03-18 17:38 ` Sander Smeenk
[not found] ` <45FD870C.3020403@tmr.com>
2007-03-18 22:00 ` Sander Smeenk
2007-03-18 15:17 ` Wolfgang Denk
2007-03-18 17:09 ` Bill Davidsen
2007-03-18 22:16 ` Neil Brown
2007-03-18 22:19 ` Neil Brown
-- strict thread matches above, loose matches on Subject: below --
2008-04-07 23:43 Data corruption on software RAID Mikulas Patocka
2008-04-08 10:22 ` Helge Hafting
2008-04-08 11:14 ` Mikulas Patocka
2008-04-09 18:33 ` Bill Davidsen
2008-04-10 3:07 ` Mikulas Patocka
2008-04-10 14:21 ` Bill Davidsen
2008-04-11 2:55 ` Mikulas Patocka
2008-04-10 6:14 ` Mario 'BitKoenig' Holbe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).