* Massive RAID-1 desync [not found] <1919189912.18202330.1429908372364.JavaMail.zimbra@laposte.net> @ 2015-04-24 20:47 ` cau2jeaf1honoq 2015-04-25 4:02 ` Mikael Abrahamsson 2015-04-25 7:25 ` NeilBrown 0 siblings, 2 replies; 6+ messages in thread From: cau2jeaf1honoq @ 2015-04-24 20:47 UTC (permalink / raw) To: linux-raid Something is happening here. I don't know what, but I'm having fun trying to guess. The root file system (ext3) is on a 4 x 30 GB RAID-1 array. A couple hours after boot, the kernel detected something wrong in the file system and decided to remount it read-only. Comparing the component partitions finds many differences with a very uneven distribution : - sda1 and sdb1 are identical except for 4 bytes in the last 70 kB, - sdd1 is identical to sda1 and sdb1 except for about 67,000 differences in the last 70 kB. - sdc1 is grossly out of sync with about 300 million differences with the others, all of them in the first 450 MB or so. I'm not sure what to make of this. The knee-jerk thought would be "/dev/sdc1 is the odd man out so sdc must be faulty". But that disk participates in other arrays without problems, I don't see anything obviously bad in its SMART data and the kernel messages just before the remount were actually about sda. To be honest, I don't have a clear idea of how things got where they are. Since writing to a RAID-1 array writes the same data to all devices, how can you have so many differences ? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Massive RAID-1 desync 2015-04-24 20:47 ` Massive RAID-1 desync cau2jeaf1honoq @ 2015-04-25 4:02 ` Mikael Abrahamsson 2015-04-25 6:18 ` Jean-Baptiste Thomas 2015-04-25 7:25 ` NeilBrown 1 sibling, 1 reply; 6+ messages in thread From: Mikael Abrahamsson @ 2015-04-25 4:02 UTC (permalink / raw) To: cau2jeaf1honoq; +Cc: linux-raid On Fri, 24 Apr 2015, cau2jeaf1honoq@laposte.net wrote: > Something is happening here. I don't know what, but I'm having > fun trying to guess. What kernel version are you running? Some other information would be interesting as well, such as what /proc/mdstat is saying, and anything from dmesg or similar logs leading up to this... -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Massive RAID-1 desync 2015-04-25 4:02 ` Mikael Abrahamsson @ 2015-04-25 6:18 ` Jean-Baptiste Thomas 0 siblings, 0 replies; 6+ messages in thread From: Jean-Baptiste Thomas @ 2015-04-25 6:18 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: linux-raid On 2015-04-25 06:02 +0200, Mikael Abrahamsson wrote: > What kernel version are you running? Linux 3.0.14. mdadm 3.1.4 (2010-08-31) > Some other information would be > interesting as well, such as what /proc/mdstat is saying, Good point. Curiously, nothing : md1 : active raid1 sdd1[3] sdb1[0] sdc1[2] sda1[1] 31463232 blocks [4/4] [UUUU] > and anything from dmesg or similar logs leading up to this... Too late, /var/log was on the root file system and lockd has since helpfully flooded the kernel ring buffer. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Massive RAID-1 desync 2015-04-24 20:47 ` Massive RAID-1 desync cau2jeaf1honoq 2015-04-25 4:02 ` Mikael Abrahamsson @ 2015-04-25 7:25 ` NeilBrown 2015-04-26 8:48 ` Jean-Baptiste Thomas 1 sibling, 1 reply; 6+ messages in thread From: NeilBrown @ 2015-04-25 7:25 UTC (permalink / raw) To: cau2jeaf1honoq; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2216 bytes --] On Fri, 24 Apr 2015 22:47:49 +0200 (CEST) cau2jeaf1honoq@laposte.net wrote: > Something is happening here. I don't know what, but I'm having > fun trying to guess. > > The root file system (ext3) is on a 4 x 30 GB RAID-1 array. A > couple hours after boot, the kernel detected something wrong in > the file system and decided to remount it read-only. > > Comparing the component partitions finds many differences with a > very uneven distribution : > > - sda1 and sdb1 are identical except for 4 bytes in the last > 70 kB, Perfectly normal. Metadata is at the end, at least 64K from the end and 64K aligned. > > - sdd1 is identical to sda1 and sdb1 except for about 67,000 > differences in the last 70 kB. Following the metadata is between 60K and 124K of nothing. It could easily be completely different on different devices. > > - sdc1 is grossly out of sync with about 300 million differences > with the others, all of them in the first 450 MB or so. sdc1 is sick. Maybe it has hardware problems. Maybe some hacker broke into your machine and wrote garbage to it. Or maybe you triggered a bug that no one else has ever come across (unlikely, but possible). > > I'm not sure what to make of this. The knee-jerk thought would > be "/dev/sdc1 is the odd man out so sdc must be faulty". But > that disk participates in other arrays without problems, I don't > see anything obviously bad in its SMART data and the kernel > messages just before the remount were actually about sda. And what were those messages about sda? > > To be honest, I don't have a clear idea of how things got where > they are. Since writing to a RAID-1 array writes the same data > to all devices, how can you have so many differences ? Cosmic rays? EMP? I actually think that the most likely explanation is that someone was careless and wrote something to sdc that they didn't mean to. But I'm probably wrong. I like guessing too. NeilBrown > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Massive RAID-1 desync 2015-04-25 7:25 ` NeilBrown @ 2015-04-26 8:48 ` Jean-Baptiste Thomas 2015-04-28 21:39 ` NeilBrown 0 siblings, 1 reply; 6+ messages in thread From: Jean-Baptiste Thomas @ 2015-04-26 8:48 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On 2015-04-25 17:25 +1000, NeilBrown wrote: > Perfectly normal. Metadata is at the end, at least 64K from the end > and 64K aligned. Yes. Format 0.90. > And what were those messages about sda? The actual messages have been displaced by lockd's rambling but as I remember, it was this sort of thing : ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata1.00: BMDMA stat 0x4 ata1.00: failed command: READ DMA EXT ata1.00: cmd 25/00:80:a9:54:70/00:00:74:00:00/e0 tag 0 dma 65536 in res 51/40:00:25:55:70/40:00:74:00:00/e0 Emask 0x9 (media error) ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } ata1.00: configured for UDMA/133 ata1: EH complete I ran e2fsck on copies of sda1 and sdc1. They are both heavily damaged, not just sdc1. Looks like I'm going to have to replace a disk and see. I'd like to avoid replacing two, though. Or going through more crashes. Does MD have a paranoid mode in which reading a sector from a RAID-1 device would not return successfully until it got matching data from at least two components ? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Massive RAID-1 desync 2015-04-26 8:48 ` Jean-Baptiste Thomas @ 2015-04-28 21:39 ` NeilBrown 0 siblings, 0 replies; 6+ messages in thread From: NeilBrown @ 2015-04-28 21:39 UTC (permalink / raw) To: Jean-Baptiste Thomas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1708 bytes --] On Sun, 26 Apr 2015 10:48:37 +0200 (CEST) Jean-Baptiste Thomas <cau2jeaf1honoq@laposte.net> wrote: > On 2015-04-25 17:25 +1000, NeilBrown wrote: > > > Perfectly normal. Metadata is at the end, at least 64K from the end > > and 64K aligned. > > Yes. Format 0.90. > > > And what were those messages about sda? > > The actual messages have been displaced by lockd's rambling but as I > remember, it was this sort of thing : > > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 > ata1.00: BMDMA stat 0x4 > ata1.00: failed command: READ DMA EXT > ata1.00: cmd 25/00:80:a9:54:70/00:00:74:00:00/e0 tag 0 dma 65536 in > res 51/40:00:25:55:70/40:00:74:00:00/e0 Emask 0x9 (media error) > ata1.00: status: { DRDY ERR } > ata1.00: error: { UNC } > ata1.00: configured for UDMA/133 > ata1: EH complete A clean "media error" on READ should involve the block being written and if that fails, the drive ejected. I wonder if the controller got confused. > > I ran e2fsck on copies of sda1 and sdc1. They are both heavily damaged, > not just sdc1. That is rather sad. I'm having trouble imagining any scenario that would result in the symptoms you are seeing. Very odd. > > Looks like I'm going to have to replace a disk and see. I'd like to > avoid replacing two, though. Or going through more crashes. Does > MD have a paranoid mode in which reading a sector from a RAID-1 > device would not return successfully until it got matching data > from at least two components ? As mentioned separately: no. If it were me, I'd probably be feeling suspicious of the controller at this point. If it is a cheap one, maybe replace it. NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-04-28 21:39 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1919189912.18202330.1429908372364.JavaMail.zimbra@laposte.net>
2015-04-24 20:47 ` Massive RAID-1 desync cau2jeaf1honoq
2015-04-25 4:02 ` Mikael Abrahamsson
2015-04-25 6:18 ` Jean-Baptiste Thomas
2015-04-25 7:25 ` NeilBrown
2015-04-26 8:48 ` Jean-Baptiste Thomas
2015-04-28 21:39 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).