* filesystem corruption @ 2011-01-03 1:58 Patrick H. 2011-01-03 3:16 ` Neil Brown 0 siblings, 1 reply; 12+ messages in thread From: Patrick H. @ 2011-01-03 1:58 UTC (permalink / raw) To: linux-raid I've been trying to track down an issue for a while now and from digging around it appears (though not certain) the issue lies with the md raid device. Whats happening is that after improperly shutting down a raid-5 array, upon reassembly, a few files on the filesystem will be corrupt. I dont think this is normal filesystem corruption from files being modified during the shut down because some of the files that end up corrupted are several hours old. The exact details of what I'm doing: I have a 3-node test cluster I'm doing integrity testing on. Each node in the cluster is exporting a couple of disks via ATAoE. I have the first disk of all 3 nodes in a raid-1 that is holding the journal data for the ext3 filesystem. The array is running with an internal bitmap as well. The second disk of all 3 nodes is a raid-5 array holding the ext3 filesystem itself. This is also running with an internal bitmap. The ext3 filesystem is mounted with 'data=journal,barrier=1,sync'. When I power down the node which is actively running both md raid devices, another node in the cluster takes over and starts both arrays up (in degraded mode of course). Once the original node comes back up, the new master re-adds its disks back into the raid arrays and re-syncs them. During all this, the filesystem is exported through nfs (nfs also has sync turned on) and a client is randomly creating, removing, and verifying checksums on the files in the filesystem (nfs is hard mounted so operations always retry). The client script averages about 30 creations/s, 30 deletes/s, and 30 checksums/s. So, as stated above, every now and then (1 in 50 chance or so), when the master is hard-rebooted, the client will detect a few files with invalid md5 checksums. These files could be hours old so they were not being actively modified. Another key point that leads me to believe its a md raid issue is that before I had the ext3 journal running internally on the raid-5 array (part of the filesystem itself). When I did this, there would occasionally be massive corruption. As in file modification times in the future, lots of corrupt files, thousands of files put in the 'lost+found' dir upon fsck, etc. After I put it on a separate raid-1, there are no more invalid modification times, there hasnt been a single file added to 'lost+found', and the number of corrupt files dropped significantly. This would seem to indicate that the journal was getting corrupted, and when it was played back, it went horribly wrong. So it would seem there's something wrong with the raid-5 array, but I dont know what it could be. Any ideas or input would be much appreciated. I can modify the clustering scripts to obtain whatever information is needed when they start the arrays. -Patrick ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: filesystem corruption 2011-01-03 1:58 filesystem corruption Patrick H. @ 2011-01-03 3:16 ` Neil Brown [not found] ` <4D214B5C.3010103@feystorm.net> ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Neil Brown @ 2011-01-03 3:16 UTC (permalink / raw) To: Patrick H.; +Cc: linux-raid On Sun, 02 Jan 2011 18:58:34 -0700 "Patrick H." <linux-raid@feystorm.net> wrote: > I've been trying to track down an issue for a while now and from digging > around it appears (though not certain) the issue lies with the md raid > device. > Whats happening is that after improperly shutting down a raid-5 array, > upon reassembly, a few files on the filesystem will be corrupt. I dont > think this is normal filesystem corruption from files being modified > during the shut down because some of the files that end up corrupted are > several hours old. > > The exact details of what I'm doing: > I have a 3-node test cluster I'm doing integrity testing on. Each node > in the cluster is exporting a couple of disks via ATAoE. > I have the first disk of all 3 nodes in a raid-1 that is holding the > journal data for the ext3 filesystem. The array is running with an > internal bitmap as well. > The second disk of all 3 nodes is a raid-5 array holding the ext3 > filesystem itself. This is also running with an internal bitmap. > The ext3 filesystem is mounted with 'data=journal,barrier=1,sync'. > When I power down the node which is actively running both md raid > devices, another node in the cluster takes over and starts both arrays > up (in degraded mode of course). > Once the original node comes back up, the new master re-adds its disks > back into the raid arrays and re-syncs them. > During all this, the filesystem is exported through nfs (nfs also has > sync turned on) and a client is randomly creating, removing, and > verifying checksums on the files in the filesystem (nfs is hard mounted > so operations always retry). The client script averages about 30 > creations/s, 30 deletes/s, and 30 checksums/s. > > So, as stated above, every now and then (1 in 50 chance or so), when the > master is hard-rebooted, the client will detect a few files with invalid > md5 checksums. These files could be hours old so they were not being > actively modified. > Another key point that leads me to believe its a md raid issue is that > before I had the ext3 journal running internally on the raid-5 array > (part of the filesystem itself). When I did this, there would > occasionally be massive corruption. As in file modification times in the > future, lots of corrupt files, thousands of files put in the > 'lost+found' dir upon fsck, etc. After I put it on a separate raid-1, > there are no more invalid modification times, there hasnt been a single > file added to 'lost+found', and the number of corrupt files dropped > significantly. This would seem to indicate that the journal was getting > corrupted, and when it was played back, it went horribly wrong. > > So it would seem there's something wrong with the raid-5 array, but I > dont know what it could be. Any ideas or input would be much > appreciated. I can modify the clustering scripts to obtain whatever > information is needed when they start the arrays. What you are doing cannot work reliably. If a RAID5 suffers an unclean shutdown and is restarted without a full complement of devices, then it can corrupt data that has not been changed recently, just as you are seeing. This is why mdadm will not assemble that array unless you provide the --force flag which essentially says "I know what I am doing and accept the risk". When md needs to update a block in your 3-drive RAID5, it will read the other block in the same stripe (if that isn't in the cache or being written at the same time) and then write out the data block (or blocks) and the newly computed parity block. If you crash after one of those writes has completed, but before all of the writes have completed, then the parity block will not match the data blocks on disk. When you re-assemble the array with one device missing, md will compute the data that was on the device using the other data block and the parity block. As the parity and data blocks could be inconsistent, the result could easily be wrong. With RAID1 there is no similar problem. When you read after a crash you will always get "correct" data. It maybe from before the last write that was attempted, or after, but if the data was not written recently you will read exactly the right data. This is why the situation improved substantially when you moved the journal to RAID1. The get full improvement, you need to move the data to RAID1 (or RAID10) as well. NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <4D214B5C.3010103@feystorm.net>]
* Re: filesystem corruption [not found] ` <4D214B5C.3010103@feystorm.net> @ 2011-01-03 4:56 ` Neil Brown 2011-01-03 5:05 ` Patrick H. 0 siblings, 1 reply; 12+ messages in thread From: Neil Brown @ 2011-01-03 4:56 UTC (permalink / raw) To: Patrick H.; +Cc: linux-raid On Sun, 02 Jan 2011 21:06:52 -0700 "Patrick H." <linux-raid@feystorm.net> wrote: > That makes sense assuming that MD acknowleges the write once the data is > written to the data disks but not necessarily the parity disk, which is > what I gather you were saying is what happens. Is there any option that > can change the behavior so that md wont ack the write until its been > committed to all disks (I'm guessing no since you didnt mention it)? > Also does raid6 suffer this problem? Is it smart enough to use both > parity disks when calculating replacement, or will it just use one? > md/raid5 doesn't acknowledge the write until both the data and the parity have been written. But that doesn't make any difference. If you schedule a number of interdependent writes (data and parity) and then allow some to complete but not all, then you have inconsistency. Recovery from losing a single device requires consistency of parity and data. RAID6 suffers equally from this problem. Even if it used both parity disks to recover (which it doesn't) how would that help? It would then have two possible value for the data and no way to know which was correct, and every possibility that both are incorrect. This would happen if a single data block was successfully written, but neither parity blocks were. The only way you can avoid this 'write hole' is by journalling in multiples of whole stripes. No current filesystems that I know of can do this as they journal in blocks, and the maximum block size is less than the minimum stripe size. So you would need journalling integrated with md/raid, or you would need a filesystem which was designed to understand this problem and write whole stripes at a time, always to an area of the device which did not contain live data. NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: filesystem corruption 2011-01-03 4:56 ` Neil Brown @ 2011-01-03 5:05 ` Patrick H. 2011-01-04 5:33 ` NeilBrown 0 siblings, 1 reply; 12+ messages in thread From: Patrick H. @ 2011-01-03 5:05 UTC (permalink / raw) To: linux-raid Sent: Sun Jan 02 2011 21:56:30 GMT-0700 (Mountain Standard Time) From: Neil Brown <neilb@suse.de> To: Patrick H. <linux-raid@feystorm.net> linux-raid@vger.kernel.org Subject: Re: filesystem corruption > On Sun, 02 Jan 2011 21:06:52 -0700 "Patrick H." <linux-raid@feystorm.net> > wrote: > > > >> That makes sense assuming that MD acknowleges the write once the data is >> written to the data disks but not necessarily the parity disk, which is >> what I gather you were saying is what happens. Is there any option that >> can change the behavior so that md wont ack the write until its been >> committed to all disks (I'm guessing no since you didnt mention it)? >> Also does raid6 suffer this problem? Is it smart enough to use both >> parity disks when calculating replacement, or will it just use one? >> >> > > md/raid5 doesn't acknowledge the write until both the data and the parity > have been written. But that doesn't make any difference. > If you schedule a number of interdependent writes (data and parity) and then > allow some to complete but not all, then you have inconsistency. > Recovery from losing a single device requires consistency of parity and data. > > RAID6 suffers equally from this problem. Even if it used both parity disks > to recover (which it doesn't) how would that help? It would then have two > possible value for the data and no way to know which was correct, and every > possibility that both are incorrect. This would happen if a single data > block was successfully written, but neither parity blocks were. > > The only way you can avoid this 'write hole' is by journalling in multiples > of whole stripes. No current filesystems that I know of can do this as they > journal in blocks, and the maximum block size is less than the minimum stripe > size. So you would need journalling integrated with md/raid, or you would > need a filesystem which was designed to understand this problem and write > whole stripes at a time, always to an area of the device which did not > contain live data. > > NeilBrown > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Ok, thanks for the info. I think I'll solve it by creating 2 dedicated hosts for running the array, but not actually export any disks themselves. This way if a master dies, all the raid disks are still there and can be picked up by the other master. -Patrick ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: filesystem corruption 2011-01-03 5:05 ` Patrick H. @ 2011-01-04 5:33 ` NeilBrown 2011-01-04 7:50 ` Patrick H. 0 siblings, 1 reply; 12+ messages in thread From: NeilBrown @ 2011-01-04 5:33 UTC (permalink / raw) To: Patrick H.; +Cc: linux-raid On Sun, 02 Jan 2011 22:05:06 -0700 "Patrick H." <linux-raid@feystorm.net> wrote: > Ok, thanks for the info. > I think I'll solve it by creating 2 dedicated hosts for running the > array, but not actually export any disks themselves. This way if a > master dies, all the raid disks are still there and can be picked up by > the other master. > That sounds like it should work OK. NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: filesystem corruption 2011-01-04 5:33 ` NeilBrown @ 2011-01-04 7:50 ` Patrick H. 2011-01-04 17:31 ` Patrick H. 0 siblings, 1 reply; 12+ messages in thread From: Patrick H. @ 2011-01-04 7:50 UTC (permalink / raw) To: linux-raid Sent: Mon Jan 03 2011 22:33:24 GMT-0700 (Mountain Standard Time) From: NeilBrown <neilb@suse.de> To: Patrick H. <linux-raid@feystorm.net> linux-raid@vger.kernel.org Subject: Re: filesystem corruption > On Sun, 02 Jan 2011 22:05:06 -0700 "Patrick H." <linux-raid@feystorm.net> > wrote: > > >> Ok, thanks for the info. >> I think I'll solve it by creating 2 dedicated hosts for running the >> array, but not actually export any disks themselves. This way if a >> master dies, all the raid disks are still there and can be picked up by >> the other master. >> >> > > That sounds like it should work OK. > > NeilBrown > Well, it didnt solve it. if I power the entire cluster down and start it back up, I get corruption, on old files that werent being modified still. If I power off just a single node, it seems to handle it fine, just not the whole cluster. It also seems to happen fairly frequently now. In the previous setup it was probably 1 in 50 failures that there was corruption. Now its pretty much a guarantee there will be corruption if I kill it. On the last failure I did, when it came back up, it re-assembled the entire raid-5 array with all disks active and none of them needing any sort of re-sync. The disk controller is battery backed, so even if it was re-ordering the writes, the battery should ensure that it all gets committed. Any other ideas? -Patrick ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: filesystem corruption 2011-01-04 7:50 ` Patrick H. @ 2011-01-04 17:31 ` Patrick H. 2011-01-05 1:22 ` Patrick H. 0 siblings, 1 reply; 12+ messages in thread From: Patrick H. @ 2011-01-04 17:31 UTC (permalink / raw) To: linux-raid Sent: Tue Jan 04 2011 00:50:39 GMT-0700 (Mountain Standard Time) From: Patrick H. <linux-raid@feystorm.net> To: linux-raid@vger.kernel.org Subject: Re: filesystem corruption > Sent: Mon Jan 03 2011 22:33:24 GMT-0700 (Mountain Standard Time) > From: NeilBrown <neilb@suse.de> > To: Patrick H. <linux-raid@feystorm.net> linux-raid@vger.kernel.org > Subject: Re: filesystem corruption >> On Sun, 02 Jan 2011 22:05:06 -0700 "Patrick H." >> <linux-raid@feystorm.net> >> wrote: >> >> >>> Ok, thanks for the info. >>> I think I'll solve it by creating 2 dedicated hosts for running the >>> array, but not actually export any disks themselves. This way if a >>> master dies, all the raid disks are still there and can be picked up >>> by the other master. >>> >>> >> >> That sounds like it should work OK. >> >> NeilBrown >> > Well, it didnt solve it. if I power the entire cluster down and start > it back up, I get corruption, on old files that werent being modified > still. If I power off just a single node, it seems to handle it fine, > just not the whole cluster. > > It also seems to happen fairly frequently now. In the previous setup > it was probably 1 in 50 failures that there was corruption. Now its > pretty much a guarantee there will be corruption if I kill it. > On the last failure I did, when it came back up, it re-assembled the > entire raid-5 array with all disks active and none of them needing any > sort of re-sync. The disk controller is battery backed, so even if it > was re-ordering the writes, the battery should ensure that it all gets > committed. > > Any other ideas? > > -Patrick Here is some info from my most recent failure simulation. This one resulted in about 50 corrupt files, another 40 or so that cant even be opened, and one stale nfs file handle. I had the cluster script dump out a bunch of info before and after assembling the array. = = = = = = = = = = # mdadm -E /dev/etherd/e1.1p1 /dev/etherd/e1.1p1: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Name : dm01:126 (local to host dm01) Creation Time : Tue Jan 4 04:45:50 2011 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB) Array Size : 4238848 (2.02 GiB 2.17 GB) Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : a20adb76:af00f276:5be79a36:b4ff3a8b Internal Bitmap : 2 sectors from superblock Update Time : Tue Jan 4 16:45:56 2011 Checksum : 361041f6 - correct Events : 486 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 0 Array State : AAA ('A' == active, '.' == missing) # mdadm -X /dev/etherd/e1.1p1 Filename : /dev/etherd/e1.1p1 Magic : 6d746962 Version : 4 UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Events : 486 Events Cleared : 486 State : OK Chunksize : 64 KB Daemon : 5s flush period Write Mode : Normal Sync Size : 1059712 (1035.05 MiB 1085.15 MB) Bitmap : 16558 bits (chunks), 189 dirty (1.1%) = = = = = = = = = = = = = = = = = = = = # mdadm -E /dev/etherd/e2.1p1 /dev/etherd/e2.1p1: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Name : dm01:126 (local to host dm01) Creation Time : Tue Jan 4 04:45:50 2011 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB) Array Size : 4238848 (2.02 GiB 2.17 GB) Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : f9205ace:0796ecf5:2cca363c:c2873816 Internal Bitmap : 2 sectors from superblock Update Time : Tue Jan 4 16:45:56 2011 Checksum : 9d235885 - correct Events : 486 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 1 Array State : AAA ('A' == active, '.' == missing) # mdadm -X /dev/etherd/e2.1p1 Filename : /dev/etherd/e2.1p1 Magic : 6d746962 Version : 4 UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Events : 486 Events Cleared : 486 State : OK Chunksize : 64 KB Daemon : 5s flush period Write Mode : Normal Sync Size : 1059712 (1035.05 MiB 1085.15 MB) Bitmap : 16558 bits (chunks), 189 dirty (1.1%) = = = = = = = = = = = = = = = = = = = = # mdadm -E /dev/etherd/e3.1p1 /dev/etherd/e3.1p1: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Name : dm01:126 (local to host dm01) Creation Time : Tue Jan 4 04:45:50 2011 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB) Array Size : 4238848 (2.02 GiB 2.17 GB) Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 7f90958d:22de5c08:88750ecb:5f376058 Internal Bitmap : 2 sectors from superblock Update Time : Tue Jan 4 16:46:13 2011 Checksum : 3fce6b33 - correct Events : 487 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 2 Array State : AAA ('A' == active, '.' == missing) # mdadm -X /dev/etherd/e3.1p1 Filename : /dev/etherd/e3.1p1 Magic : 6d746962 Version : 4 UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Events : 487 Events Cleared : 486 State : OK Chunksize : 64 KB Daemon : 5s flush period Write Mode : Normal Sync Size : 1059712 (1035.05 MiB 1085.15 MB) Bitmap : 16558 bits (chunks), 249 dirty (1.5%) = = = = = = = = = = - - - - - - - - - - - # mdadm -D /dev/md/fs01 /dev/md/fs01: Version : 1.2 Creation Time : Tue Jan 4 04:45:50 2011 Raid Level : raid5 Array Size : 2119424 (2.02 GiB 2.17 GB) Used Dev Size : 1059712 (1035.05 MiB 1085.15 MB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue Jan 4 16:46:13 2011 State : active, resyncing Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 1% complete Name : dm01:126 (local to host dm01) UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Events : 486 Number Major Minor RaidDevice State 0 152 273 0 active sync /dev/block/152:273 1 152 529 1 active sync /dev/block/152:529 3 152 785 2 active sync /dev/block/152:785 - - - - - - - - - - - The old method *never* resulted in this much corruption, and never generated stale nfs file handles. Why is this so much worse now when it was supposed to be better? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: filesystem corruption 2011-01-04 17:31 ` Patrick H. @ 2011-01-05 1:22 ` Patrick H. 0 siblings, 0 replies; 12+ messages in thread From: Patrick H. @ 2011-01-05 1:22 UTC (permalink / raw) To: linux-raid I think I may have found something on this. I was messing around with it more (switched to iSCSI instead of ATAoE), and managed to create a situation where 2 of the 3 raid-5 disks had failed, yet the MD device was still active, and it was letting me use it. This is bad. mdadm -D /dev/md/fs01 /dev/md/fs01: Version : 1.2 Creation Time : Tue Jan 4 04:45:50 2011 Raid Level : raid5 Array Size : 2119424 (2.02 GiB 2.17 GB) Used Dev Size : 1059712 (1035.05 MiB 1085.15 MB) Raid Devices : 3 Total Devices : 1 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue Jan 4 22:58:44 2011 State : active, FAILED Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Name : dm01:125 (local to host dm01) UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Events : 2980 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 80 1 active sync /dev/sdf 2 0 0 2 removed Notice, there's only one disk in the array, the other 2 failed and were removed. Yet state is still saying active. The filesystem is still up and running, and I can even read and write to it, though it spits out tons of IO errors. I then stopped the array and tried to reassemble it, and now it wont reassemble. # mdadm -A /dev/md/fs01 --uuid 9cd9ae9b:39454845:62f2b08d:a4a1ac6c -vv mdadm: looking for devices for /dev/md/fs01 mdadm: no recogniseable superblock on /dev/md/fs01_journal mdadm: /dev/md/fs01_journal has wrong uuid. mdadm: cannot open device /dev/sdg: Device or resource busy mdadm: /dev/sdg has wrong uuid. mdadm: cannot open device /dev/sdd: Device or resource busy mdadm: /dev/sdd has wrong uuid. mdadm: cannot open device /dev/sdb: Device or resource busy mdadm: /dev/sdb has wrong uuid. mdadm: cannot open device /dev/sda2: Device or resource busy mdadm: /dev/sda2 has wrong uuid. mdadm: cannot open device /dev/sda1: Device or resource busy mdadm: /dev/sda1 has wrong uuid. mdadm: cannot open device /dev/sda: Device or resource busy mdadm: /dev/sda has wrong uuid. mdadm: /dev/sde is identified as a member of /dev/md/fs01, slot 2. mdadm: /dev/sdc is identified as a member of /dev/md/fs01, slot 0. mdadm: /dev/sdf is identified as a member of /dev/md/fs01, slot 1. mdadm: added /dev/sdc to /dev/md/fs01 as 0 mdadm: added /dev/sde to /dev/md/fs01 as 2 mdadm: added /dev/sdf to /dev/md/fs01 as 1 mdadm: /dev/md/fs01 assembled from 1 drive - not enough to start the array. # cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md125 : inactive sdf[1](S) sde[3](S) sdc[0](S) 3179280 blocks super 1.2 md126 : active raid1 sdg[0] sdb[2] sdd[1] 265172 blocks super 1.2 [3/3] [UUU] bitmap: 0/3 pages [0KB], 64KB chunk unused devices: <none> md126 is the ext3 journal for the filesystem Below is mdadm info on all the devices in the array # mdadm -E /dev/sdc /dev/sdc: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Name : dm01:125 (local to host dm01) Creation Time : Tue Jan 4 04:45:50 2011 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB) Array Size : 4238848 (2.02 GiB 2.17 GB) Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : a20adb76:af00f276:5be79a36:b4ff3a8b Internal Bitmap : 2 sectors from superblock Update Time : Tue Jan 4 22:44:20 2011 Checksum : 350c988f - correct Events : 1150 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 0 Array State : AA. ('A' == active, '.' == missing) # mdadm -X /dev/sdc Filename : /dev/sdc Magic : 6d746962 Version : 4 UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Events : 1150 Events Cleared : 1144 State : OK Chunksize : 64 KB Daemon : 5s flush period Write Mode : Normal Sync Size : 1059712 (1035.05 MiB 1085.15 MB) Bitmap : 16558 bits (chunks), 93 dirty (0.6%) # mdadm -E /dev/sdf /dev/sdf: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Name : dm01:125 (local to host dm01) Creation Time : Tue Jan 4 04:45:50 2011 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB) Array Size : 4238848 (2.02 GiB 2.17 GB) Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : f9205ace:0796ecf5:2cca363c:c2873816 Internal Bitmap : 2 sectors from superblock Update Time : Tue Jan 4 23:00:49 2011 Checksum : 9c20ba71 - correct Events : 3062 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 1 Array State : .A. ('A' == active, '.' == missing) # mdadm -X /dev/sdf Filename : /dev/sdf Magic : 6d746962 Version : 4 UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Events : 3062 Events Cleared : 1144 State : OK Chunksize : 64 KB Daemon : 5s flush period Write Mode : Normal Sync Size : 1059712 (1035.05 MiB 1085.15 MB) Bitmap : 16558 bits (chunks), 150 dirty (0.9%) # mdadm -E /dev/sde /dev/sde: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Name : dm01:125 (local to host dm01) Creation Time : Tue Jan 4 04:45:50 2011 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB) Array Size : 4238848 (2.02 GiB 2.17 GB) Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 7f90958d:22de5c08:88750ecb:5f376058 Internal Bitmap : 2 sectors from superblock Update Time : Tue Jan 4 22:43:53 2011 Checksum : 3ecec198 - correct Events : 1144 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 2 Array State : AAA ('A' == active, '.' == missing) # mdadm -X /dev/sde Filename : /dev/sde Magic : 6d746962 Version : 4 UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c Events : 1144 Events Cleared : 1143 State : OK Chunksize : 64 KB Daemon : 5s flush period Write Mode : Normal Sync Size : 1059712 (1035.05 MiB 1085.15 MB) Bitmap : 16558 bits (chunks), 38 dirty (0.2%) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: filesystem corruption 2011-01-03 3:16 ` Neil Brown [not found] ` <4D214B5C.3010103@feystorm.net> @ 2011-01-05 7:02 ` CoolCold [not found] ` <AANLkTinL_nz58f8rSPuhYvVwGY5jdu1XVkNLC1ky5A65@mail.gmail.com> 2 siblings, 0 replies; 12+ messages in thread From: CoolCold @ 2011-01-05 7:02 UTC (permalink / raw) To: Neil Brown; +Cc: Patrick H., linux-raid On Mon, Jan 3, 2011 at 6:16 AM, Neil Brown <neilb@suse.de> wrote: > On Sun, 02 Jan 2011 18:58:34 -0700 "Patrick H." <linux-raid@feystorm.net> > wrote: > >> I've been trying to track down an issue for a while now and from digging >> around it appears (though not certain) the issue lies with the md raid >> device. >> Whats happening is that after improperly shutting down a raid-5 array, >> upon reassembly, a few files on the filesystem will be corrupt. I dont >> think this is normal filesystem corruption from files being modified >> during the shut down because some of the files that end up corrupted are >> several hours old. >> >> The exact details of what I'm doing: >> I have a 3-node test cluster I'm doing integrity testing on. Each node >> in the cluster is exporting a couple of disks via ATAoE. >> I have the first disk of all 3 nodes in a raid-1 that is holding the >> journal data for the ext3 filesystem. The array is running with an >> internal bitmap as well. >> The second disk of all 3 nodes is a raid-5 array holding the ext3 >> filesystem itself. This is also running with an internal bitmap. >> The ext3 filesystem is mounted with 'data=journal,barrier=1,sync'. >> When I power down the node which is actively running both md raid >> devices, another node in the cluster takes over and starts both arrays >> up (in degraded mode of course). >> Once the original node comes back up, the new master re-adds its disks >> back into the raid arrays and re-syncs them. >> During all this, the filesystem is exported through nfs (nfs also has >> sync turned on) and a client is randomly creating, removing, and >> verifying checksums on the files in the filesystem (nfs is hard mounted >> so operations always retry). The client script averages about 30 >> creations/s, 30 deletes/s, and 30 checksums/s. >> >> So, as stated above, every now and then (1 in 50 chance or so), when the >> master is hard-rebooted, the client will detect a few files with invalid >> md5 checksums. These files could be hours old so they were not being >> actively modified. >> Another key point that leads me to believe its a md raid issue is that >> before I had the ext3 journal running internally on the raid-5 array >> (part of the filesystem itself). When I did this, there would >> occasionally be massive corruption. As in file modification times in the >> future, lots of corrupt files, thousands of files put in the >> 'lost+found' dir upon fsck, etc. After I put it on a separate raid-1, >> there are no more invalid modification times, there hasnt been a single >> file added to 'lost+found', and the number of corrupt files dropped >> significantly. This would seem to indicate that the journal was getting >> corrupted, and when it was played back, it went horribly wrong. >> >> So it would seem there's something wrong with the raid-5 array, but I >> dont know what it could be. Any ideas or input would be much >> appreciated. I can modify the clustering scripts to obtain whatever >> information is needed when they start the arrays. > > What you are doing cannot work reliably. > > If a RAID5 suffers an unclean shutdown and is restarted without a full > complement of devices, then it can corrupt data that has not been changed > recently, just as you are seeing. > This is why mdadm will not assemble that array unless you provide the --force > flag which essentially says "I know what I am doing and accept the risk". > > When md needs to update a block in your 3-drive RAID5, it will read the other > block in the same stripe (if that isn't in the cache or being written at the > same time) and then write out the data block (or blocks) and the newly > computed parity block. > > If you crash after one of those writes has completed, but before all of the > writes have completed, then the parity block will not match the data blocks > on disk. Am I understanding right, that in case of hardware controller with bbu, data and parity gonna be written properly ( for locally connected drives of course ) even in case of powerloss and this is the only feature which hardware raid controllers can do and softraid can't ? (well, except some nice features like maxiq - cache on ssd for adaptec controllers and overall write performance expansion because of ram/bbu) > > When you re-assemble the array with one device missing, md will compute the > data that was on the device using the other data block and the parity block. > As the parity and data blocks could be inconsistent, the result could easily > be wrong. > > With RAID1 there is no similar problem. When you read after a crash you will > always get "correct" data. It maybe from before the last write that was > attempted, or after, but if the data was not written recently you will read > exactly the right data. > > This is why the situation improved substantially when you moved the journal > to RAID1. > > The get full improvement, you need to move the data to RAID1 (or RAID10) as > well. > > NeilBrown > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Best regards, [COOLCOLD-RIPN] -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <AANLkTinL_nz58f8rSPuhYvVwGY5jdu1XVkNLC1ky5A65@mail.gmail.com>]
* Re: filesystem corruption [not found] ` <AANLkTinL_nz58f8rSPuhYvVwGY5jdu1XVkNLC1ky5A65@mail.gmail.com> @ 2011-01-05 14:28 ` Patrick H. 2011-01-05 15:52 ` Spelic 0 siblings, 1 reply; 12+ messages in thread From: Patrick H. @ 2011-01-05 14:28 UTC (permalink / raw) To: linux-raid Sent: Wed Jan 05 2011 00:00:48 GMT-0700 (Mountain Standard Time) From: CoolCold <coolthecold@gmail.com> To: Neil Brown <neilb@suse.de> "Patrick H." <linux-raid@feystorm.net>, linux-raid@vger.kernel.org Subject: Re: filesystem corruption > > Am I understanding right, that in case of hardware controller with > bbu, data and parity gonna be written properly ( for locally > connected drives of course ) even in case of powerloss and this is > the only feature which hardware raid controllers can do and softraid > can't ? (well, except some nice features like maxiq - cache on ssd for > adaptec controllers and overall write performance expansion because of > ram/bbu) > > No, my drives are battery backed as well. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: filesystem corruption 2011-01-05 14:28 ` Patrick H. @ 2011-01-05 15:52 ` Spelic 2011-01-05 15:55 ` Patrick H. 0 siblings, 1 reply; 12+ messages in thread From: Spelic @ 2011-01-05 15:52 UTC (permalink / raw) To: Patrick H.; +Cc: linux-raid On 01/05/2011 03:28 PM, Patrick H. wrote: > No, my drives are battery backed as well. what drives are they, if I can ask? OCZ SSDs with supercapacitor maybe? Do you know if they will really flush the whole write cache on sudden power off? I read smoky sentences about this for the OCZ drives. In certain points it seemed like the supercapacitor was only able to provide the same guarantees of a HDD, that is, no further data loss due to erase-then-rewrite-32K and flash wear levelling stuff, but was not able to flush the write cache. Did you try with e.g. a stream of simple databases transactions then disconnecting the cable suddenly like this test http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ ? Thank you ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: filesystem corruption 2011-01-05 15:52 ` Spelic @ 2011-01-05 15:55 ` Patrick H. 0 siblings, 0 replies; 12+ messages in thread From: Patrick H. @ 2011-01-05 15:55 UTC (permalink / raw) To: linux-raid HP DL360-G6. SAS controller with battery backed write accelerator. I havent been focusing on the reliability of the drives as this is proof of concept testing. If we decide to use it, the drives will be replaced with 2TB SSD PCIe cards. -Patrick Sent: Wed Jan 05 2011 08:52:04 GMT-0700 (Mountain Standard Time) From: Spelic <spelic@shiftmail.org> To: Patrick H. <linux-raid@feystorm.net> linux-raid <linux-raid@vger.kernel.org> Subject: Re: filesystem corruption > On 01/05/2011 03:28 PM, Patrick H. wrote: >> No, my drives are battery backed as well. > > what drives are they, if I can ask? OCZ SSDs with supercapacitor maybe? > > Do you know if they will really flush the whole write cache on sudden > power off? I read smoky sentences about this for the OCZ drives. In > certain points it seemed like the supercapacitor was only able to > provide the same guarantees of a HDD, that is, no further data loss > due to erase-then-rewrite-32K and flash wear levelling stuff, but was > not able to flush the write cache. > Did you try with e.g. a stream of simple databases transactions then > disconnecting the cable suddenly like this test > http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ > > ? > > Thank you ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-01-05 15:55 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-03 1:58 filesystem corruption Patrick H.
2011-01-03 3:16 ` Neil Brown
[not found] ` <4D214B5C.3010103@feystorm.net>
2011-01-03 4:56 ` Neil Brown
2011-01-03 5:05 ` Patrick H.
2011-01-04 5:33 ` NeilBrown
2011-01-04 7:50 ` Patrick H.
2011-01-04 17:31 ` Patrick H.
2011-01-05 1:22 ` Patrick H.
2011-01-05 7:02 ` CoolCold
[not found] ` <AANLkTinL_nz58f8rSPuhYvVwGY5jdu1XVkNLC1ky5A65@mail.gmail.com>
2011-01-05 14:28 ` Patrick H.
2011-01-05 15:52 ` Spelic
2011-01-05 15:55 ` Patrick H.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).