linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* filesystem corruption
@ 2011-01-03  1:58 Patrick H.
  2011-01-03  3:16 ` Neil Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Patrick H. @ 2011-01-03  1:58 UTC (permalink / raw)
  To: linux-raid

I've been trying to track down an issue for a while now and from digging 
around it appears (though not certain) the issue lies with the md raid 
device.
Whats happening is that after improperly shutting down a raid-5 array, 
upon reassembly, a few files on the filesystem will be corrupt. I dont 
think this is normal filesystem corruption from files being modified 
during the shut down because some of the files that end up corrupted are 
several hours old.

The exact details of what I'm doing:
I have a 3-node test cluster I'm doing integrity testing on. Each node 
in the cluster is exporting a couple of disks via ATAoE.
I have the first disk of all 3 nodes in a raid-1 that is holding the 
journal data for the ext3 filesystem. The array is running with an 
internal bitmap as well.
The second disk of all 3 nodes is a raid-5 array holding the ext3 
filesystem itself. This is also running with an internal bitmap.
The ext3 filesystem is mounted with 'data=journal,barrier=1,sync'.
When I power down the node which is actively running both md raid 
devices, another node in the cluster takes over and starts both arrays 
up (in degraded mode of course).
Once the original node comes back up, the new master re-adds its disks 
back into the raid arrays and re-syncs them.
During all this, the filesystem is exported through nfs (nfs also has 
sync turned on) and a client is randomly creating, removing, and 
verifying checksums on the files in the filesystem (nfs is hard mounted 
so operations always retry). The client script averages about 30 
creations/s, 30 deletes/s, and 30 checksums/s.

So, as stated above, every now and then (1 in 50 chance or so), when the 
master is hard-rebooted, the client will detect a few files with invalid 
md5 checksums. These files could be hours old so they were not being 
actively modified.
Another key point that leads me to believe its a md raid issue is that 
before I had the ext3 journal running internally on the raid-5 array 
(part of the filesystem itself). When I did this, there would 
occasionally be massive corruption. As in file modification times in the 
future, lots of corrupt files, thousands of files put in the 
'lost+found' dir upon fsck, etc. After I put it on a separate raid-1, 
there are no more invalid modification times, there hasnt been a single 
file added to 'lost+found', and the number of corrupt files dropped 
significantly. This would seem to indicate that the journal was getting 
corrupted, and when it was played back, it went horribly wrong.

So it would seem there's something wrong with the raid-5 array, but I 
dont know what it could be. Any ideas or input would be much 
appreciated. I can modify the clustering scripts to obtain whatever 
information is needed when they start the arrays.

-Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
  2011-01-03  1:58 filesystem corruption Patrick H.
@ 2011-01-03  3:16 ` Neil Brown
       [not found]   ` <4D214B5C.3010103@feystorm.net>
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Neil Brown @ 2011-01-03  3:16 UTC (permalink / raw)
  To: Patrick H.; +Cc: linux-raid

On Sun, 02 Jan 2011 18:58:34 -0700 "Patrick H." <linux-raid@feystorm.net>
wrote:

> I've been trying to track down an issue for a while now and from digging 
> around it appears (though not certain) the issue lies with the md raid 
> device.
> Whats happening is that after improperly shutting down a raid-5 array, 
> upon reassembly, a few files on the filesystem will be corrupt. I dont 
> think this is normal filesystem corruption from files being modified 
> during the shut down because some of the files that end up corrupted are 
> several hours old.
> 
> The exact details of what I'm doing:
> I have a 3-node test cluster I'm doing integrity testing on. Each node 
> in the cluster is exporting a couple of disks via ATAoE.
> I have the first disk of all 3 nodes in a raid-1 that is holding the 
> journal data for the ext3 filesystem. The array is running with an 
> internal bitmap as well.
> The second disk of all 3 nodes is a raid-5 array holding the ext3 
> filesystem itself. This is also running with an internal bitmap.
> The ext3 filesystem is mounted with 'data=journal,barrier=1,sync'.
> When I power down the node which is actively running both md raid 
> devices, another node in the cluster takes over and starts both arrays 
> up (in degraded mode of course).
> Once the original node comes back up, the new master re-adds its disks 
> back into the raid arrays and re-syncs them.
> During all this, the filesystem is exported through nfs (nfs also has 
> sync turned on) and a client is randomly creating, removing, and 
> verifying checksums on the files in the filesystem (nfs is hard mounted 
> so operations always retry). The client script averages about 30 
> creations/s, 30 deletes/s, and 30 checksums/s.
> 
> So, as stated above, every now and then (1 in 50 chance or so), when the 
> master is hard-rebooted, the client will detect a few files with invalid 
> md5 checksums. These files could be hours old so they were not being 
> actively modified.
> Another key point that leads me to believe its a md raid issue is that 
> before I had the ext3 journal running internally on the raid-5 array 
> (part of the filesystem itself). When I did this, there would 
> occasionally be massive corruption. As in file modification times in the 
> future, lots of corrupt files, thousands of files put in the 
> 'lost+found' dir upon fsck, etc. After I put it on a separate raid-1, 
> there are no more invalid modification times, there hasnt been a single 
> file added to 'lost+found', and the number of corrupt files dropped 
> significantly. This would seem to indicate that the journal was getting 
> corrupted, and when it was played back, it went horribly wrong.
> 
> So it would seem there's something wrong with the raid-5 array, but I 
> dont know what it could be. Any ideas or input would be much 
> appreciated. I can modify the clustering scripts to obtain whatever 
> information is needed when they start the arrays.

What you are doing cannot work reliably.

If a RAID5 suffers an unclean shutdown and is restarted without a full
complement of devices, then it can corrupt data that has not been changed
recently, just as you are seeing.
This is why mdadm will not assemble that array unless you provide the --force
flag which essentially says "I know what I am doing and accept the risk".

When md needs to update a block in your 3-drive RAID5, it will read the other
block in the same stripe (if that isn't in the cache or being written at the
same time) and then write out the data block (or blocks) and the newly
computed parity block.

If you crash after one of those writes has completed, but before all of the
writes have completed, then the parity block will not match the data blocks
on disk.

When you re-assemble the array with one device missing, md will compute the
data that was on the device using the other data block and the parity block.
As the parity and data blocks could be inconsistent, the result could easily
be wrong.

With RAID1 there is no similar problem.  When you read after a crash you will
always get "correct" data.  It maybe from before the last write that was
attempted, or after, but if the data was not written recently you will read
exactly the right data.

This is why the situation improved substantially when you moved the journal
to RAID1.

The get full improvement, you need to move the data to RAID1 (or RAID10) as
well.

NeilBrown


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
       [not found]   ` <4D214B5C.3010103@feystorm.net>
@ 2011-01-03  4:56     ` Neil Brown
  2011-01-03  5:05       ` Patrick H.
  0 siblings, 1 reply; 12+ messages in thread
From: Neil Brown @ 2011-01-03  4:56 UTC (permalink / raw)
  To: Patrick H.; +Cc: linux-raid

On Sun, 02 Jan 2011 21:06:52 -0700 "Patrick H." <linux-raid@feystorm.net>
wrote:


> That makes sense assuming that MD acknowleges the write once the data is 
> written to the data disks but not necessarily the parity disk, which is 
> what I gather you were saying is what happens. Is there any option that 
> can change the behavior so that md wont ack the write until its been 
> committed to all disks (I'm guessing no since you didnt mention it)?
> Also does raid6 suffer this problem? Is it smart enough to use both 
> parity disks when calculating replacement, or will it just use one?
> 

md/raid5 doesn't acknowledge the write until both the data and the parity
have been written.  But that doesn't make any difference.
If you schedule a number of interdependent writes (data and parity) and then
allow some to complete but not all, then you have inconsistency.
Recovery from losing a single device requires consistency of parity and data.

RAID6 suffers equally from this problem.  Even if it used both parity disks
to recover (which it doesn't) how would that help?  It would then have two
possible value for the data and no way to know which was correct, and every
possibility that both are incorrect.  This would happen if a single data
block was successfully written, but neither parity blocks were.

The only way you can avoid this 'write hole' is by journalling in multiples
of whole stripes.  No current filesystems that I know of can do this as they
journal in blocks, and the maximum block size is less than the minimum stripe
size.  So you would need journalling integrated with md/raid, or you would
need a filesystem which was designed to understand this problem and write
whole stripes at a time, always to an area of the device which did not
contain live data.

NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
  2011-01-03  4:56     ` Neil Brown
@ 2011-01-03  5:05       ` Patrick H.
  2011-01-04  5:33         ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Patrick H. @ 2011-01-03  5:05 UTC (permalink / raw)
  To: linux-raid

Sent: Sun Jan 02 2011 21:56:30 GMT-0700 (Mountain Standard Time)
From: Neil Brown <neilb@suse.de>
To: Patrick H. <linux-raid@feystorm.net> linux-raid@vger.kernel.org
Subject: Re: filesystem corruption
> On Sun, 02 Jan 2011 21:06:52 -0700 "Patrick H." <linux-raid@feystorm.net>
> wrote:
>
>
>   
>> That makes sense assuming that MD acknowleges the write once the data is 
>> written to the data disks but not necessarily the parity disk, which is 
>> what I gather you were saying is what happens. Is there any option that 
>> can change the behavior so that md wont ack the write until its been 
>> committed to all disks (I'm guessing no since you didnt mention it)?
>> Also does raid6 suffer this problem? Is it smart enough to use both 
>> parity disks when calculating replacement, or will it just use one?
>>
>>     
>
> md/raid5 doesn't acknowledge the write until both the data and the parity
> have been written.  But that doesn't make any difference.
> If you schedule a number of interdependent writes (data and parity) and then
> allow some to complete but not all, then you have inconsistency.
> Recovery from losing a single device requires consistency of parity and data.
>
> RAID6 suffers equally from this problem.  Even if it used both parity disks
> to recover (which it doesn't) how would that help?  It would then have two
> possible value for the data and no way to know which was correct, and every
> possibility that both are incorrect.  This would happen if a single data
> block was successfully written, but neither parity blocks were.
>
> The only way you can avoid this 'write hole' is by journalling in multiples
> of whole stripes.  No current filesystems that I know of can do this as they
> journal in blocks, and the maximum block size is less than the minimum stripe
> size.  So you would need journalling integrated with md/raid, or you would
> need a filesystem which was designed to understand this problem and write
> whole stripes at a time, always to an area of the device which did not
> contain live data.
>
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

Ok, thanks for the info.
I think I'll solve it by creating 2 dedicated hosts for running the 
array, but not actually export any disks themselves. This way if a 
master dies, all the raid disks are still there and can be picked up by 
the other master.

-Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
  2011-01-03  5:05       ` Patrick H.
@ 2011-01-04  5:33         ` NeilBrown
  2011-01-04  7:50           ` Patrick H.
  0 siblings, 1 reply; 12+ messages in thread
From: NeilBrown @ 2011-01-04  5:33 UTC (permalink / raw)
  To: Patrick H.; +Cc: linux-raid

On Sun, 02 Jan 2011 22:05:06 -0700 "Patrick H." <linux-raid@feystorm.net>
wrote:

> Ok, thanks for the info.
> I think I'll solve it by creating 2 dedicated hosts for running the 
> array, but not actually export any disks themselves. This way if a 
> master dies, all the raid disks are still there and can be picked up by 
> the other master.
> 

That sounds like it should work OK.

NeilBrown


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
  2011-01-04  5:33         ` NeilBrown
@ 2011-01-04  7:50           ` Patrick H.
  2011-01-04 17:31             ` Patrick H.
  0 siblings, 1 reply; 12+ messages in thread
From: Patrick H. @ 2011-01-04  7:50 UTC (permalink / raw)
  To: linux-raid

Sent: Mon Jan 03 2011 22:33:24 GMT-0700 (Mountain Standard Time)
From: NeilBrown <neilb@suse.de>
To: Patrick H. <linux-raid@feystorm.net> linux-raid@vger.kernel.org
Subject: Re: filesystem corruption
> On Sun, 02 Jan 2011 22:05:06 -0700 "Patrick H." <linux-raid@feystorm.net>
> wrote:
>
>   
>> Ok, thanks for the info.
>> I think I'll solve it by creating 2 dedicated hosts for running the 
>> array, but not actually export any disks themselves. This way if a 
>> master dies, all the raid disks are still there and can be picked up by 
>> the other master.
>>
>>     
>
> That sounds like it should work OK.
>
> NeilBrown
>   
Well, it didnt solve it. if I power the entire cluster down and start it 
back up, I get corruption, on old files that werent being modified 
still. If I power off just a single node, it seems to handle it fine, 
just not the whole cluster.

It also seems to happen fairly frequently now. In the previous setup it 
was probably 1 in 50 failures that there was corruption. Now its pretty 
much a guarantee there will be corruption if I kill it.
On the last failure I did, when it came back up, it re-assembled the 
entire raid-5 array with all disks active and none of them needing any 
sort of re-sync. The disk controller is battery backed, so even if it 
was re-ordering the writes, the battery should ensure that it all gets 
committed.

Any other ideas?

-Patrick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
  2011-01-04  7:50           ` Patrick H.
@ 2011-01-04 17:31             ` Patrick H.
  2011-01-05  1:22               ` Patrick H.
  0 siblings, 1 reply; 12+ messages in thread
From: Patrick H. @ 2011-01-04 17:31 UTC (permalink / raw)
  To: linux-raid

Sent: Tue Jan 04 2011 00:50:39 GMT-0700 (Mountain Standard Time)
From: Patrick H. <linux-raid@feystorm.net>
To: linux-raid@vger.kernel.org
Subject: Re: filesystem corruption
> Sent: Mon Jan 03 2011 22:33:24 GMT-0700 (Mountain Standard Time)
> From: NeilBrown <neilb@suse.de>
> To: Patrick H. <linux-raid@feystorm.net> linux-raid@vger.kernel.org
> Subject: Re: filesystem corruption
>> On Sun, 02 Jan 2011 22:05:06 -0700 "Patrick H." 
>> <linux-raid@feystorm.net>
>> wrote:
>>
>>  
>>> Ok, thanks for the info.
>>> I think I'll solve it by creating 2 dedicated hosts for running the 
>>> array, but not actually export any disks themselves. This way if a 
>>> master dies, all the raid disks are still there and can be picked up 
>>> by the other master.
>>>
>>>     
>>
>> That sounds like it should work OK.
>>
>> NeilBrown
>>   
> Well, it didnt solve it. if I power the entire cluster down and start 
> it back up, I get corruption, on old files that werent being modified 
> still. If I power off just a single node, it seems to handle it fine, 
> just not the whole cluster.
>
> It also seems to happen fairly frequently now. In the previous setup 
> it was probably 1 in 50 failures that there was corruption. Now its 
> pretty much a guarantee there will be corruption if I kill it.
> On the last failure I did, when it came back up, it re-assembled the 
> entire raid-5 array with all disks active and none of them needing any 
> sort of re-sync. The disk controller is battery backed, so even if it 
> was re-ordering the writes, the battery should ensure that it all gets 
> committed.
>
> Any other ideas?
>
> -Patrick
Here is some info from my most recent failure simulation. This one 
resulted in about 50 corrupt files, another 40 or so that cant even be 
opened, and one stale nfs file handle.
I had the cluster script dump out a bunch of info before and after 
assembling the array.

= = = = = = = = = =
# mdadm -E /dev/etherd/e1.1p1
/dev/etherd/e1.1p1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Name : dm01:126  (local to host dm01)
Creation Time : Tue Jan  4 04:45:50 2011
Raid Level : raid5
Raid Devices : 3

Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
Array Size : 4238848 (2.02 GiB 2.17 GB)
Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : a20adb76:af00f276:5be79a36:b4ff3a8b

Internal Bitmap : 2 sectors from superblock
Update Time : Tue Jan  4 16:45:56 2011
Checksum : 361041f6 - correct
Events : 486

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 0
Array State : AAA ('A' == active, '.' == missing)



# mdadm -X /dev/etherd/e1.1p1
Filename : /dev/etherd/e1.1p1
Magic : 6d746962
Version : 4
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 486
Events Cleared : 486
State : OK
Chunksize : 64 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
Bitmap : 16558 bits (chunks), 189 dirty (1.1%)
= = = = = = = = = =


= = = = = = = = = =
# mdadm -E /dev/etherd/e2.1p1
/dev/etherd/e2.1p1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Name : dm01:126  (local to host dm01)
Creation Time : Tue Jan  4 04:45:50 2011
Raid Level : raid5
Raid Devices : 3

Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
Array Size : 4238848 (2.02 GiB 2.17 GB)
Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : f9205ace:0796ecf5:2cca363c:c2873816

Internal Bitmap : 2 sectors from superblock
Update Time : Tue Jan  4 16:45:56 2011
Checksum : 9d235885 - correct
Events : 486

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 1
Array State : AAA ('A' == active, '.' == missing)



# mdadm -X /dev/etherd/e2.1p1
Filename : /dev/etherd/e2.1p1
Magic : 6d746962
Version : 4
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 486
Events Cleared : 486
State : OK
Chunksize : 64 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
Bitmap : 16558 bits (chunks), 189 dirty (1.1%)
= = = = = = = = = =


= = = = = = = = = =
# mdadm -E /dev/etherd/e3.1p1
/dev/etherd/e3.1p1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Name : dm01:126  (local to host dm01)
Creation Time : Tue Jan  4 04:45:50 2011
Raid Level : raid5
Raid Devices : 3

Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
Array Size : 4238848 (2.02 GiB 2.17 GB)
Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : active
Device UUID : 7f90958d:22de5c08:88750ecb:5f376058

Internal Bitmap : 2 sectors from superblock
Update Time : Tue Jan  4 16:46:13 2011
Checksum : 3fce6b33 - correct
Events : 487

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 2
Array State : AAA ('A' == active, '.' == missing)



# mdadm -X /dev/etherd/e3.1p1
Filename : /dev/etherd/e3.1p1
Magic : 6d746962
Version : 4
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 487
Events Cleared : 486
State : OK
Chunksize : 64 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
Bitmap : 16558 bits (chunks), 249 dirty (1.5%)
= = = = = = = = = =



- - - - - - - - - - -
# mdadm -D /dev/md/fs01
/dev/md/fs01:
Version : 1.2
Creation Time : Tue Jan  4 04:45:50 2011
Raid Level : raid5
Array Size : 2119424 (2.02 GiB 2.17 GB)
Used Dev Size : 1059712 (1035.05 MiB 1085.15 MB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Tue Jan  4 16:46:13 2011
State : active, resyncing
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Rebuild Status : 1% complete

Name : dm01:126  (local to host dm01)
UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
Events : 486

Number   Major   Minor   RaidDevice State
0     152      273        0      active sync   /dev/block/152:273
1     152      529        1      active sync   /dev/block/152:529
3     152      785        2      active sync   /dev/block/152:785
- - - - - - - - - - -



The old method *never* resulted in this much corruption, and never 
generated stale nfs file handles. Why is this so much worse now when it 
was supposed to be better?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
  2011-01-04 17:31             ` Patrick H.
@ 2011-01-05  1:22               ` Patrick H.
  0 siblings, 0 replies; 12+ messages in thread
From: Patrick H. @ 2011-01-05  1:22 UTC (permalink / raw)
  To: linux-raid

I think I may have found something on this. I was messing around with it 
more (switched to iSCSI instead of ATAoE), and managed to create a 
situation where 2 of the 3 raid-5 disks had failed, yet the MD device 
was still active, and it was letting me use it. This is bad.

mdadm -D /dev/md/fs01
/dev/md/fs01:
        Version : 1.2
  Creation Time : Tue Jan  4 04:45:50 2011
     Raid Level : raid5
     Array Size : 2119424 (2.02 GiB 2.17 GB)
  Used Dev Size : 1059712 (1035.05 MiB 1085.15 MB)
   Raid Devices : 3
  Total Devices : 1
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Jan  4 22:58:44 2011
          State : active, FAILED
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : dm01:125  (local to host dm01)
           UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
         Events : 2980

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       80        1      active sync   /dev/sdf
       2       0        0        2      removed


Notice, there's only one disk in the array, the other 2 failed and were 
removed. Yet state is still saying active. The filesystem is still up 
and running, and I can even read and write to it, though it spits out 
tons of IO errors.
I then stopped the array and tried to reassemble it, and now it wont 
reassemble.


# mdadm -A /dev/md/fs01 --uuid 9cd9ae9b:39454845:62f2b08d:a4a1ac6c -vv
mdadm: looking for devices for /dev/md/fs01
mdadm: no recogniseable superblock on /dev/md/fs01_journal
mdadm: /dev/md/fs01_journal has wrong uuid.
mdadm: cannot open device /dev/sdg: Device or resource busy
mdadm: /dev/sdg has wrong uuid.
mdadm: cannot open device /dev/sdd: Device or resource busy
mdadm: /dev/sdd has wrong uuid.
mdadm: cannot open device /dev/sdb: Device or resource busy
mdadm: /dev/sdb has wrong uuid.
mdadm: cannot open device /dev/sda2: Device or resource busy
mdadm: /dev/sda2 has wrong uuid.
mdadm: cannot open device /dev/sda1: Device or resource busy
mdadm: /dev/sda1 has wrong uuid.
mdadm: cannot open device /dev/sda: Device or resource busy
mdadm: /dev/sda has wrong uuid.
mdadm: /dev/sde is identified as a member of /dev/md/fs01, slot 2.
mdadm: /dev/sdc is identified as a member of /dev/md/fs01, slot 0.
mdadm: /dev/sdf is identified as a member of /dev/md/fs01, slot 1.
mdadm: added /dev/sdc to /dev/md/fs01 as 0
mdadm: added /dev/sde to /dev/md/fs01 as 2
mdadm: added /dev/sdf to /dev/md/fs01 as 1
mdadm: /dev/md/fs01 assembled from 1 drive - not enough to start the array.


# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : inactive sdf[1](S) sde[3](S) sdc[0](S)
      3179280 blocks super 1.2

md126 : active raid1 sdg[0] sdb[2] sdd[1]
      265172 blocks super 1.2 [3/3] [UUU]
      bitmap: 0/3 pages [0KB], 64KB chunk

unused devices: <none>


md126 is the ext3 journal for the filesystem
Below is mdadm info on all the devices in the array

# mdadm -E /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
           Name : dm01:125  (local to host dm01)
  Creation Time : Tue Jan  4 04:45:50 2011
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
     Array Size : 4238848 (2.02 GiB 2.17 GB)
  Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : a20adb76:af00f276:5be79a36:b4ff3a8b

Internal Bitmap : 2 sectors from superblock
    Update Time : Tue Jan  4 22:44:20 2011
       Checksum : 350c988f - correct
         Events : 1150

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : AA. ('A' == active, '.' == missing)

# mdadm -X /dev/sdc
        Filename : /dev/sdc
           Magic : 6d746962
         Version : 4
            UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
          Events : 1150
  Events Cleared : 1144
           State : OK
       Chunksize : 64 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
          Bitmap : 16558 bits (chunks), 93 dirty (0.6%)

# mdadm -E /dev/sdf
/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
           Name : dm01:125  (local to host dm01)
  Creation Time : Tue Jan  4 04:45:50 2011
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
     Array Size : 4238848 (2.02 GiB 2.17 GB)
  Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : f9205ace:0796ecf5:2cca363c:c2873816

Internal Bitmap : 2 sectors from superblock
    Update Time : Tue Jan  4 23:00:49 2011
       Checksum : 9c20ba71 - correct
         Events : 3062

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 1
   Array State : .A. ('A' == active, '.' == missing)

# mdadm -X /dev/sdf
        Filename : /dev/sdf
           Magic : 6d746962
         Version : 4
            UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
          Events : 3062
  Events Cleared : 1144
           State : OK
       Chunksize : 64 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
          Bitmap : 16558 bits (chunks), 150 dirty (0.9%)

# mdadm -E /dev/sde
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
           Name : dm01:125  (local to host dm01)
  Creation Time : Tue Jan  4 04:45:50 2011
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 2119520 (1035.10 MiB 1085.19 MB)
     Array Size : 4238848 (2.02 GiB 2.17 GB)
  Used Dev Size : 2119424 (1035.05 MiB 1085.15 MB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 7f90958d:22de5c08:88750ecb:5f376058

Internal Bitmap : 2 sectors from superblock
    Update Time : Tue Jan  4 22:43:53 2011
       Checksum : 3ecec198 - correct
         Events : 1144

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : AAA ('A' == active, '.' == missing)

# mdadm -X /dev/sde
        Filename : /dev/sde
           Magic : 6d746962
         Version : 4
            UUID : 9cd9ae9b:39454845:62f2b08d:a4a1ac6c
          Events : 1144
  Events Cleared : 1143
           State : OK
       Chunksize : 64 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 1059712 (1035.05 MiB 1085.15 MB)
          Bitmap : 16558 bits (chunks), 38 dirty (0.2%)








^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
  2011-01-03  3:16 ` Neil Brown
       [not found]   ` <4D214B5C.3010103@feystorm.net>
@ 2011-01-05  7:02   ` CoolCold
       [not found]   ` <AANLkTinL_nz58f8rSPuhYvVwGY5jdu1XVkNLC1ky5A65@mail.gmail.com>
  2 siblings, 0 replies; 12+ messages in thread
From: CoolCold @ 2011-01-05  7:02 UTC (permalink / raw)
  To: Neil Brown; +Cc: Patrick H., linux-raid

On Mon, Jan 3, 2011 at 6:16 AM, Neil Brown <neilb@suse.de> wrote:
> On Sun, 02 Jan 2011 18:58:34 -0700 "Patrick H." <linux-raid@feystorm.net>
> wrote:
>
>> I've been trying to track down an issue for a while now and from digging
>> around it appears (though not certain) the issue lies with the md raid
>> device.
>> Whats happening is that after improperly shutting down a raid-5 array,
>> upon reassembly, a few files on the filesystem will be corrupt. I dont
>> think this is normal filesystem corruption from files being modified
>> during the shut down because some of the files that end up corrupted are
>> several hours old.
>>
>> The exact details of what I'm doing:
>> I have a 3-node test cluster I'm doing integrity testing on. Each node
>> in the cluster is exporting a couple of disks via ATAoE.
>> I have the first disk of all 3 nodes in a raid-1 that is holding the
>> journal data for the ext3 filesystem. The array is running with an
>> internal bitmap as well.
>> The second disk of all 3 nodes is a raid-5 array holding the ext3
>> filesystem itself. This is also running with an internal bitmap.
>> The ext3 filesystem is mounted with 'data=journal,barrier=1,sync'.
>> When I power down the node which is actively running both md raid
>> devices, another node in the cluster takes over and starts both arrays
>> up (in degraded mode of course).
>> Once the original node comes back up, the new master re-adds its disks
>> back into the raid arrays and re-syncs them.
>> During all this, the filesystem is exported through nfs (nfs also has
>> sync turned on) and a client is randomly creating, removing, and
>> verifying checksums on the files in the filesystem (nfs is hard mounted
>> so operations always retry). The client script averages about 30
>> creations/s, 30 deletes/s, and 30 checksums/s.
>>
>> So, as stated above, every now and then (1 in 50 chance or so), when the
>> master is hard-rebooted, the client will detect a few files with invalid
>> md5 checksums. These files could be hours old so they were not being
>> actively modified.
>> Another key point that leads me to believe its a md raid issue is that
>> before I had the ext3 journal running internally on the raid-5 array
>> (part of the filesystem itself). When I did this, there would
>> occasionally be massive corruption. As in file modification times in the
>> future, lots of corrupt files, thousands of files put in the
>> 'lost+found' dir upon fsck, etc. After I put it on a separate raid-1,
>> there are no more invalid modification times, there hasnt been a single
>> file added to 'lost+found', and the number of corrupt files dropped
>> significantly. This would seem to indicate that the journal was getting
>> corrupted, and when it was played back, it went horribly wrong.
>>
>> So it would seem there's something wrong with the raid-5 array, but I
>> dont know what it could be. Any ideas or input would be much
>> appreciated. I can modify the clustering scripts to obtain whatever
>> information is needed when they start the arrays.
>
> What you are doing cannot work reliably.
>
> If a RAID5 suffers an unclean shutdown and is restarted without a full
> complement of devices, then it can corrupt data that has not been changed
> recently, just as you are seeing.
> This is why mdadm will not assemble that array unless you provide the --force
> flag which essentially says "I know what I am doing and accept the risk".
>
> When md needs to update a block in your 3-drive RAID5, it will read the other
> block in the same stripe (if that isn't in the cache or being written at the
> same time) and then write out the data block (or blocks) and the newly
> computed parity block.
>
> If you crash after one of those writes has completed, but before all of the
> writes have completed, then the parity block will not match the data blocks
> on disk.
Am I understanding right, that in case of hardware controller with
bbu, data and parity gonna be written properly ( for locally connected
 drives of course ) even in case of powerloss and this is the only
feature which hardware raid controllers can do and softraid can't ?
(well, except some nice features like maxiq - cache on ssd for adaptec
controllers and overall write performance expansion because of
ram/bbu)

>
> When you re-assemble the array with one device missing, md will compute the
> data that was on the device using the other data block and the parity block.
> As the parity and data blocks could be inconsistent, the result could easily
> be wrong.
>
> With RAID1 there is no similar problem.  When you read after a crash you will
> always get "correct" data.  It maybe from before the last write that was
> attempted, or after, but if the data was not written recently you will read
> exactly the right data.
>
> This is why the situation improved substantially when you moved the journal
> to RAID1.
>
> The get full improvement, you need to move the data to RAID1 (or RAID10) as
> well.
>
> NeilBrown
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
       [not found]   ` <AANLkTinL_nz58f8rSPuhYvVwGY5jdu1XVkNLC1ky5A65@mail.gmail.com>
@ 2011-01-05 14:28     ` Patrick H.
  2011-01-05 15:52       ` Spelic
  0 siblings, 1 reply; 12+ messages in thread
From: Patrick H. @ 2011-01-05 14:28 UTC (permalink / raw)
  To: linux-raid

Sent: Wed Jan 05 2011 00:00:48 GMT-0700 (Mountain Standard Time)
From: CoolCold <coolthecold@gmail.com>
To: Neil Brown <neilb@suse.de> "Patrick H." <linux-raid@feystorm.net>, 
linux-raid@vger.kernel.org
Subject: Re: filesystem corruption
>
> Am I understanding right, that in case of hardware controller with 
> bbu, data and parity gonna be written properly ( for locally 
> connected  drives of course ) even in case of powerloss and this is 
> the only feature which hardware raid controllers can do and softraid 
> can't ? (well, except some nice features like maxiq - cache on ssd for 
> adaptec controllers and overall write performance expansion because of 
> ram/bbu)
>
>
No, my drives are battery backed as well.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
  2011-01-05 14:28     ` Patrick H.
@ 2011-01-05 15:52       ` Spelic
  2011-01-05 15:55         ` Patrick H.
  0 siblings, 1 reply; 12+ messages in thread
From: Spelic @ 2011-01-05 15:52 UTC (permalink / raw)
  To: Patrick H.; +Cc: linux-raid

On 01/05/2011 03:28 PM, Patrick H. wrote:
> No, my drives are battery backed as well.

what drives are they, if I can ask? OCZ SSDs with supercapacitor maybe?

Do you know if they will really flush the whole write cache on sudden 
power off? I read smoky sentences about this for the OCZ drives. In 
certain points it seemed like the supercapacitor was only able to 
provide the same guarantees of a HDD, that is, no further data loss due 
to erase-then-rewrite-32K and flash wear levelling stuff, but was not 
able to flush the write cache.
Did you try with e.g. a stream of simple databases transactions then 
disconnecting the cable suddenly like this test
http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
?

Thank you

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: filesystem corruption
  2011-01-05 15:52       ` Spelic
@ 2011-01-05 15:55         ` Patrick H.
  0 siblings, 0 replies; 12+ messages in thread
From: Patrick H. @ 2011-01-05 15:55 UTC (permalink / raw)
  To: linux-raid

HP DL360-G6. SAS controller with battery backed write accelerator.
I havent been focusing on the reliability of the drives as this is proof 
of concept testing. If we decide to use it, the drives will be replaced 
with 2TB SSD PCIe cards.

-Patrick

Sent: Wed Jan 05 2011 08:52:04 GMT-0700 (Mountain Standard Time)
From: Spelic <spelic@shiftmail.org>
To: Patrick H. <linux-raid@feystorm.net> linux-raid 
<linux-raid@vger.kernel.org>
Subject: Re: filesystem corruption
> On 01/05/2011 03:28 PM, Patrick H. wrote:
>> No, my drives are battery backed as well.
>
> what drives are they, if I can ask? OCZ SSDs with supercapacitor maybe?
>
> Do you know if they will really flush the whole write cache on sudden 
> power off? I read smoky sentences about this for the OCZ drives. In 
> certain points it seemed like the supercapacitor was only able to 
> provide the same guarantees of a HDD, that is, no further data loss 
> due to erase-then-rewrite-32K and flash wear levelling stuff, but was 
> not able to flush the write cache.
> Did you try with e.g. a stream of simple databases transactions then 
> disconnecting the cable suddenly like this test
> http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ 
>
> ?
>
> Thank you

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-01-05 15:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-03  1:58 filesystem corruption Patrick H.
2011-01-03  3:16 ` Neil Brown
     [not found]   ` <4D214B5C.3010103@feystorm.net>
2011-01-03  4:56     ` Neil Brown
2011-01-03  5:05       ` Patrick H.
2011-01-04  5:33         ` NeilBrown
2011-01-04  7:50           ` Patrick H.
2011-01-04 17:31             ` Patrick H.
2011-01-05  1:22               ` Patrick H.
2011-01-05  7:02   ` CoolCold
     [not found]   ` <AANLkTinL_nz58f8rSPuhYvVwGY5jdu1XVkNLC1ky5A65@mail.gmail.com>
2011-01-05 14:28     ` Patrick H.
2011-01-05 15:52       ` Spelic
2011-01-05 15:55         ` Patrick H.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).