mdadm: failed devices become spares!

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* mdadm: failed devices become spares!
@ 2010-05-16 15:40 Pierre Vignéras
  2010-05-16 19:56 ` Leslie Rhorer
  0 siblings, 1 reply; 12+ messages in thread
From: Pierre Vignéras @ 2010-05-16 15:40 UTC (permalink / raw)
  To: linux-raid

Hi, 

I encountered a critical problem with mdadm that I submitted to the Debian 
mailing list (it's a debian lenny/stable). They asked me to submit this to 
you. So that's what I do.

To prevent duplication of description/information, I give you the URL of that 
bug description:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=578352

If you prefer the full stuff to be copy/pasted to that mailing list, just ask  
for it.

Note: that bug happened again today, on another RAID array. So the good news 
is that it is somewhat reproducible! The bad news, is that unless you have a 
magic solution, all my data are just lost (half of it was in the backup 
pipe!)...

Thanks for any help, and regards.
-- 
Pierre Vignéras
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: mdadm: failed devices become spares!
  2010-05-16 15:40 mdadm: failed devices become spares! Pierre Vignéras
@ 2010-05-16 19:56 ` Leslie Rhorer
  2010-05-17 18:10   ` Pierre Vignéras
  0 siblings, 1 reply; 12+ messages in thread
From: Leslie Rhorer @ 2010-05-16 19:56 UTC (permalink / raw)
  To: 'Pierre Vignéras', linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Pierre Vignéras
> Sent: Sunday, May 16, 2010 10:41 AM
> To: linux-raid@vger.kernel.org
> Subject: mdadm: failed devices become spares!
> 
> Hi,
> 
> I encountered a critical problem with mdadm that I submitted to the Debian
> mailing list (it's a debian lenny/stable). They asked me to submit this to
> you. So that's what I do.
> 
> To prevent duplication of description/information, I give you the URL of
> that
> bug description:
> 
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=578352
> 
> If you prefer the full stuff to be copy/pasted to that mailing list, just
> ask
> for it.
> 
> Note: that bug happened again today, on another RAID array. So the good
> news
> is that it is somewhat reproducible! The bad news, is that unless you have
> a
> magic solution, all my data are just lost (half of it was in the backup
> pipe!)...
> 
> Thanks for any help, and regards.
> --
> Pierre Vignéras

	It's not quite clear to me from the link whether your drives are
truly toast, or not.  If they are, then you are hosed.  Assuming not, then
you need to use 

`mdadm --examine /dev/sdxx` and `mdadm -Dt /dev/mdyy`

	to determine precisely all the parameters and the order of the block
devices in the array.  You need the chunk size, the superblock type, which
slot was occupied by each device in the array (this may not be the same as
when the array was created), the size of the array (if it did not fill the
entire partition in every case), the RAID level, etc.  Once you are certain
you have all the information to enable you to re-create the array, if need
be, the try to re-assemble the array with

`mdadm --assemble --force /dev/mdyy`

	If it works, then fsck the file system.  (I think I noticed you are
using XFS.  If so, do not use XFS_Check.  Instead, use XFS_Repair with the
-n option.)  After you have a clean file system, issue the command

`echo repair > /sys/block/mdyy/md/sync_action`

	to re-sync the array.  If the array does not assemble, then you will
need to stop it and re-create it using the options you obtained from your
research above and adding the --assume-clean switch to prevent a resync if
something is wrong.  If the fsck won't work after re-creating the array,
then you probably got one or more of the parameters incorrect.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares!
  2010-05-16 19:56 ` Leslie Rhorer
@ 2010-05-17 18:10   ` Pierre Vignéras
  2010-05-17 21:09     ` Tim Small
  2010-05-18  1:30     ` Neil Brown
  0 siblings, 2 replies; 12+ messages in thread
From: Pierre Vignéras @ 2010-05-17 18:10 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

On dimanche 16 mai 2010, Leslie Rhorer wrote:
> > -----Original Message-----
> > From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> > owner@vger.kernel.org] On Behalf Of Pierre Vignéras
> > Sent: Sunday, May 16, 2010 10:41 AM
> > To: linux-raid@vger.kernel.org
> > Subject: mdadm: failed devices become spares!
> >
> > Hi,
> >
> > I encountered a critical problem with mdadm that I submitted to the
> > Debian mailing list (it's a debian lenny/stable). They asked me to submit
> > this to you. So that's what I do.
> >
> > To prevent duplication of description/information, I give you the URL of
> > that
> > bug description:
> >
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=578352
> >
> > If you prefer the full stuff to be copy/pasted to that mailing list, just
> > ask
> > for it.
> >
> > Note: that bug happened again today, on another RAID array. So the good
> > news
> > is that it is somewhat reproducible! The bad news, is that unless you
> > have a
> > magic solution, all my data are just lost (half of it was in the backup
> > pipe!)...
> >
> > Thanks for any help, and regards.
> > --
> > Pierre Vignéras
> 
> 	It's not quite clear to me from the link whether your drives are
> truly toast, or not.  If they are, then you are hosed.  Assuming not, then
> you need to use
> 
> `mdadm --examine /dev/sdxx` and `mdadm -Dt /dev/mdyy`
> 
> 	to determine precisely all the parameters and the order of the block
> devices in the array.  You need the chunk size, the superblock type, which
> slot was occupied by each device in the array (this may not be the same as
> when the array was created), the size of the array (if it did not fill the
> entire partition in every case), the RAID level, etc.  Once you are certain
> you have all the information to enable you to re-create the array, if need
> be, the try to re-assemble the array with
> 
> `mdadm --assemble --force /dev/mdyy`
> 
> 	If it works, then fsck the file system.  (I think I noticed you are
> using XFS.  If so, do not use XFS_Check.  Instead, use XFS_Repair with the
> -n option.)  After you have a clean file system, issue the command
> 
> `echo repair > /sys/block/mdyy/md/sync_action`
> 
> 	to re-sync the array.  If the array does not assemble, then you will
> need to stop it and re-create it using the options you obtained from your
> research above and adding the --assume-clean switch to prevent a resync if
> something is wrong.  If the fsck won't work after re-creating the array,
> then you probably got one or more of the parameters incorrect.

Thanks for your help. Here is what I did:

 
# cat /proc/mdstat          
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] 
[...]
md2 : inactive sdc1[2](S) sdd1[5](S) sdf1[4](S) sde1[3](S)
      1250274304 blocks                                   
[...]                                                          
                              
# mdadm --examine /dev/sdc1 /dev/sdd1 /dev/sdf1 /dev/sde1
/dev/sdc1:                                             
          Magic : a92b4efc                             
        Version : 00.90.00                             
           UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
  Creation Time : Thu Aug  6 01:59:44 2009                                  
     Raid Level : raid10                                                    
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)                          
     Array Size : 625137152 (596.18 GiB 640.14 GB)                          
   Raid Devices : 4                                                         
  Total Devices : 4                                                         
Preferred Minor : 2                                                         

    Update Time : Tue Apr 13 19:22:21 2010
          State : clean                   
Internal Bitmap : present                 
 Active Devices : 2                       
Working Devices : 4                       
 Failed Devices : 0                       
  Spare Devices : 2                       
       Checksum : 5baf7939 - correct      
         Events : 90612                   

         Layout : near=2, far=1
     Chunk Size : 64K          

      Number   Major   Minor   RaidDevice State
this     2       8       33        2      active sync   /dev/sdc1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       33        2      active sync   /dev/sdc1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      spare   /dev/sdf1      
   5     5       8       49        5      spare   /dev/sdd1      
/dev/sdd1:                                                       
          Magic : a92b4efc                                       
        Version : 00.90.00                                       
           UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
  Creation Time : Thu Aug  6 01:59:44 2009                                  
     Raid Level : raid10                                                    
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)                          
     Array Size : 625137152 (596.18 GiB 640.14 GB)                          
   Raid Devices : 4                                                         
  Total Devices : 4                                                         
Preferred Minor : 2                                                         

    Update Time : Tue Apr 13 19:22:21 2010
          State : clean                   
Internal Bitmap : present                 
 Active Devices : 2                       
Working Devices : 4                       
 Failed Devices : 0                       
  Spare Devices : 2                       
       Checksum : 5baf7949 - correct      
         Events : 90612                   

         Layout : near=2, far=1
     Chunk Size : 64K          

      Number   Major   Minor   RaidDevice State
this     5       8       49        5      spare   /dev/sdd1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       33        2      active sync   /dev/sdc1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      spare   /dev/sdf1      
   5     5       8       49        5      spare   /dev/sdd1      
/dev/sdf1:                                                       
          Magic : a92b4efc                                       
        Version : 00.90.00                                       
           UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
  Creation Time : Thu Aug  6 01:59:44 2009                                  
     Raid Level : raid10                                                    
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)                          
     Array Size : 625137152 (596.18 GiB 640.14 GB)                          
   Raid Devices : 4                                                         
  Total Devices : 4                                                         
Preferred Minor : 2                                                         

    Update Time : Tue Apr 13 19:22:21 2010
          State : clean                   
Internal Bitmap : present                 
 Active Devices : 2                       
Working Devices : 4                       
 Failed Devices : 0                       
  Spare Devices : 2                       
       Checksum : 5baf7967 - correct      
         Events : 90612                   

         Layout : near=2, far=1
     Chunk Size : 64K          

      Number   Major   Minor   RaidDevice State
this     4       8       81        4      spare   /dev/sdf1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       33        2      active sync   /dev/sdc1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      spare   /dev/sdf1
   5     5       8       49        5      spare   /dev/sdd1
/dev/sde1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
  Creation Time : Thu Aug  6 01:59:44 2009
     Raid Level : raid10
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)
     Array Size : 625137152 (596.18 GiB 640.14 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 2

    Update Time : Tue Apr 13 19:22:21 2010
          State : clean
Internal Bitmap : present
 Active Devices : 2
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 2
       Checksum : 5baf795b - correct
         Events : 90612

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8       65        3      active sync   /dev/sde1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       33        2      active sync   /dev/sdc1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      spare   /dev/sdf1
   5     5       8       49        5      spare   /dev/sdd1

# mdadm -Dt /dev/md2
mdadm: md device /dev/md2 does not appear to be active.
phobos:~#

# mdadm --assemble --force /dev/md2
mdadm: /dev/md2 assembled from 2 drives and 2 spares - not enough to start the 
array.

#

What I don't get, is how those devices /dev/sdf1 and /dev/sdd1 have been 
marked as spares after being marked as faulty! I never asked for it. As shown 
at the previous Debian Bug link (repeated here for your convenience):

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=578352

<bug description extract>
...
Apr 12 20:10:02 phobos mdadm[3157]: Fail event detected on md device /dev/md2, 
component device /dev/sdf1
Apr 12 20:11:02 phobos mdadm[3157]: SpareActive event detected on md device 
/dev/md2, component device /dev/sdf1 

Is that last line normal? It seems to me that this failed drive has
been made a spare!  (I really hope that I misunderstood something). Is
it possible that the USB system (with its "plug'n play" sort-of
feature) had made the behaviour of mdadm so strange?

</bug>

And the next question is: how to activate those 2 spare drives? I was 
expecting mdadm to use them automagically.

Did I miss something, or is there something really strange happening there?

Thanks again.
-- 
Pierre Vignéras
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares!
  2010-05-17 18:10   ` Pierre Vignéras
@ 2010-05-17 21:09     ` Tim Small
  2010-05-18  1:30     ` Neil Brown
  1 sibling, 0 replies; 12+ messages in thread
From: Tim Small @ 2010-05-17 21:09 UTC (permalink / raw)
  To: Pierre Vignéras; +Cc: Leslie Rhorer, linux-raid, 578352

Pierre Vignéras wrote:
> And the next question is: how to activate those 2 spare drives? I was 
> expecting mdadm to use them automagically.
>   

If you want to experiment with different ways of getting the data back,
but without risking writing anything to the drives, you could do this:

1. Use dmsetup to create copy-on-write "virtual drives" which
"see-through" to the content of your real drives, but don't risk writing
anything at all to them.

2. Use mdadm --create --assume-clean ...blahblah...
/dev/mapper/cow_drive_1  .....

to force mdadm to put the array back together the way you think it was
(the output of examine will be useful here).  You'll need to specify (at
least - from memory):

. stripe size
. metadata version (this affects metadata location on the drives)
. correct device order (with or without a single failed drive)

... after that you can run a read-only (or read-write) check on the COW
md partition to verify that you've got your data back, then mount it
read-only etc.  Once you're happy that your commands are going to get
things running again, you can run them "for real" on the non-COW devices.

See the recent list archives for my post on using a similar set of
commands for HW RAID data forensics, along with references....

HTH,

Tim.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares!
  2010-05-17 18:10   ` Pierre Vignéras
  2010-05-17 21:09     ` Tim Small
@ 2010-05-18  1:30     ` Neil Brown
  2010-05-18  2:06       ` Neil Brown
  2010-05-18 23:07       ` mdadm: failed devices become spares! Pierre Vignéras
  1 sibling, 2 replies; 12+ messages in thread
From: Neil Brown @ 2010-05-18  1:30 UTC (permalink / raw)
  To: Pierre Vignéras; +Cc: Leslie Rhorer, linux-raid

On Mon, 17 May 2010 20:10:36 +0200
Pierre Vignéras <pierre@vigneras.name> wrote:

> Did I miss something, or is there something really strange happening there?

Something strange...
I cannot explain the 'SpareActive' messages.
Most of the rest makes sense.

You had a RAID10 - 4 drives in near=2 mode.  So the first two disks contain
identical data, and the second two are also identical and contain the rest.
The second device failed due to a write error.
Why it seemed to become a spare I'm not sure.  I'm not all sure it did
become a spare immediately- your logs aren't conclusive on that point.
It did eventually become a spare, but that could be because you "removed and
added the devices" which would have changed them from 'fail' to 'spares'.

Then the first device in the array reported an error and so was failed.
After this you would not be able to read or write to the even chunks of the
array, xfs noticed and complained.

By this time sdf1 seemed to be a spare so it gave recovery a try.  The
recovery process discovered there was nowhere to read good data from and
immediately gave up.

However if the devices really are OK, then sdf1 and sdc1 should contain
identical data (except the superblock would be slightly different.
You could check this with "cmp -l", though that might not be very efficient.
Also sdd1 and sde1 should be identical.

I suggest that you try:

 mdadm -S /dev/md2
 mdadm -C /dev/md2 -l 10 -n 4 -c 64 -e 0.90 /dev/sdc1 missing /dev/sdd1 missing  --assume-clean

and then see what the data on md2 looks like.
You could equally try sdf1 in place of sdc1, or sde1 in place of sdd1
(make sure you double check the device names, don't assume I got then right).

Once you have a combination that look good, you can add the other two devices
an they will recover and you should have your data back.

BUT be warned.  Something cause some errors to be reported.  Unless you find
out what that was and fix it, errors will occur again.  I have no idea what
might have caused those errors.  Bad media? bad controller ? bad usb
controller? bad luck?

I wouldn't write new data, or even perform a recovery until you are quite
confident of the devices.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares!
  2010-05-18  1:30     ` Neil Brown
@ 2010-05-18  2:06       ` Neil Brown
  2010-05-18 22:25         ` MRK
  2010-05-21 21:27         ` mdadm: failed devices become spares! -> Solved ! Pierre Vignéras
  2010-05-18 23:07       ` mdadm: failed devices become spares! Pierre Vignéras
  1 sibling, 2 replies; 12+ messages in thread
From: Neil Brown @ 2010-05-18  2:06 UTC (permalink / raw)
  To: Neil Brown; +Cc: Pierre Vignéras, Leslie Rhorer, linux-raid

On Tue, 18 May 2010 11:30:16 +1000
Neil Brown <neilb@suse.de> wrote:

> On Mon, 17 May 2010 20:10:36 +0200
> Pierre Vignéras <pierre@vigneras.name> wrote:
> 
> > Did I miss something, or is there something really strange happening there?
> 
> Something strange...
> I cannot explain the 'SpareActive' messages.

Actually I can explain that I think.

When a device fails it gets marked as faulty, then as soon as there is no
more pending IO it gets moved out of the array.  "mdadm -D" will show it with
a larger 'Number' and a 'RaidDevice' of '-'.
Normally these happen almost as a single operation, though a lot of pending
IO can slow it down.

"mdadm --monitor" identified devices based on 'Number', so it would normally
see a working device disappear - which is reported a a failure, and a
'faulty/spare' device appear, which it ignores.

However if --monitor gets to check the array between the above to events, it
will first see that the working drive is now faulty, so it reports a failure,
and then see that the faulty device isn't faulty any more and in fact isn't
even there.  The "isn't event there" bit doesn't register and it treats it as
'SpareActive'.

I should fix that.

So I'm quite sure now that your devices didn't really become spares until you
removed and added them, which is exactly they way to turn failed devices
into spares.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares!
  2010-05-18  2:06       ` Neil Brown
@ 2010-05-18 22:25         ` MRK
  2010-05-19 19:56           ` Simon Matthews
  2010-05-21 21:00           ` Pierre Vignéras
  2010-05-21 21:27         ` mdadm: failed devices become spares! -> Solved ! Pierre Vignéras
  1 sibling, 2 replies; 12+ messages in thread
From: MRK @ 2010-05-18 22:25 UTC (permalink / raw)
  Cc: linux-raid

On 05/18/2010 04:06 AM, Neil Brown wrote:
> However if --monitor gets to check the array between the above to events, it
> will first see that the working drive is now faulty, so it reports a failure,
> and then see that the faulty device isn't faulty any more and in fact isn't
> even there.  The "isn't event there" bit doesn't register and it treats it as
> 'SpareActive'.
>
> I should fix that.
>    

However in one case the two events are not detected in the same round:

Apr 12 20:10:02 phobos mdadm[3157]: Fail event detected on md device /dev/md2,
component device /dev/sdf1
Apr 12 20:11:02 phobos mdadm[3157]: SpareActive event detected on md device
/dev/md2, component device /dev/sdf1


1 minute passes between the two entries. I suppose that's the mdadm 
daemon polling time.

In the other case all the entries are at the same time

Apr 13 08:00:02 phobos mdadm[3157]: Fail event detected on md device /dev/md2,
component device /dev/sdd1
Apr 13 08:00:02 phobos mdadm[3157]: SpareActive event detected on md device
/dev/md2, component device /dev/sdd1
Apr 13 08:00:02 phobos last message repeated 7 times
[...many times that messages..]


...plus, in this second case the SpareActive triggers a lot of times 
within that same second (Pierre you cut it short, but are all the "many 
times that messages" all at the exact same time or they span a few seconds?)

It looks to me like some kind of usb failure where the USB connection or 
USB bridge momentarily fails then immediately gets re-detected and 
re-added to the system. But since there are no usb entries in dmesg, 
that would also be an issue of the usb driver. Could the problem also be 
a mixture with some unwise udev triggers of Debian, maybe somehow 
causing the auto-re-add of the drive to the RAID?

Pierre:
- can you post your mdadm.conf?
- USB is not good for RAID imho. Many times in my life I saw problems 
with USB/SATA bridges where the drive would get disconnected on high I/O 
activity and then reconnected after a few seconds. Anyway, readding it 
to the RAID shouldn't have happened. Also in my case there were "usb" 
entries in dmesg.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares!
  2010-05-18 22:25         ` MRK
@ 2010-05-19 19:56           ` Simon Matthews
  2010-05-21 21:00           ` Pierre Vignéras
  1 sibling, 0 replies; 12+ messages in thread
From: Simon Matthews @ 2010-05-19 19:56 UTC (permalink / raw)
  To: MRK; +Cc: linux-raid

On Tue, May 18, 2010 at 3:25 PM, MRK <mrk@shiftmail.org> wrote:

> - USB is not good for RAID imho.

I  can second that. At one time I had a USB backup drive that was
configured as half a RAID 1 set. This was so that the drive could
immediately be used in the event of a massive failure of the file
server.

Pulling this USB drive before stopping the RAID device caused the
machine to become unresponsive. I think it was trying to do some kind
of I/O, all I know was that a hard boot was the only way I could get
the machine out of that condition.

Simon

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares!
  2010-05-18 22:25         ` MRK
  2010-05-19 19:56           ` Simon Matthews
@ 2010-05-21 21:00           ` Pierre Vignéras
  1 sibling, 0 replies; 12+ messages in thread
From: Pierre Vignéras @ 2010-05-21 21:00 UTC (permalink / raw)
  To: MRK, linux-raid

On mercredi 19 mai 2010, MRK wrote:
> On 05/18/2010 04:06 AM, Neil Brown wrote:
> > However if --monitor gets to check the array between the above to events,
> > it will first see that the working drive is now faulty, so it reports a
> > failure, and then see that the faulty device isn't faulty any more and in
> > fact isn't even there.  The "isn't event there" bit doesn't register and
> > it treats it as 'SpareActive'.
> >
> > I should fix that.
> 
> However in one case the two events are not detected in the same round:
> 
> Apr 12 20:10:02 phobos mdadm[3157]: Fail event detected on md device
>  /dev/md2, component device /dev/sdf1
> Apr 12 20:11:02 phobos mdadm[3157]: SpareActive event detected on md device
> /dev/md2, component device /dev/sdf1
> 
> 
> 1 minute passes between the two entries. I suppose that's the mdadm
> daemon polling time.
> 
> In the other case all the entries are at the same time
> 
> Apr 13 08:00:02 phobos mdadm[3157]: Fail event detected on md device
>  /dev/md2, component device /dev/sdd1
> Apr 13 08:00:02 phobos mdadm[3157]: SpareActive event detected on md device
> /dev/md2, component device /dev/sdd1
> Apr 13 08:00:02 phobos last message repeated 7 times
> [...many times that messages..]
> 
> 
> ...plus, in this second case the SpareActive triggers a lot of times
> within that same second (Pierre you cut it short, but are all the "many
> times that messages" all at the exact same time or they span a few
>  seconds?)

Well I was probably  tired when I tried to filter the log for the bug report. 
It seems that this 'last message repeated 7 times' is for the:

Apr 13 08:00:02 phobos kernel: [5814019.208017] nfsd: non-standard errno: 5

not for the:

Apr 13 08:00:02 phobos mdadm[3157]: SpareActive event detected on md device 
/dev/md2, component device /dev/sdd1

I looked into my log and can't find something else. Sorry, sorry, sorry if 
this led you to false conclusions.

> It looks to me like some kind of usb failure where the USB connection or
> USB bridge momentarily fails then immediately gets re-detected and
> re-added to the system. But since there are no usb entries in dmesg,
> that would also be an issue of the usb driver. Could the problem also be
> a mixture with some unwise udev triggers of Debian, maybe somehow
> causing the auto-re-add of the drive to the RAID?
> 
> Pierre:
> - can you post your mdadm.conf?

Sure, but I am not sure it will be useful:

$ cat /etc/mdadm/mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md0 level=raid1 num-devices=2 
UUID=13f4fdef:db0bd815:77e02d4f:1bda00b4
ARRAY /dev/md1 level=raid1 num-devices=2 
UUID=4a120782:2ed3053c:e99784b3:b8e5f7bf
ARRAY /dev/md4 level=raid1 num-devices=2 
UUID=b3c7212a:e95c5081:24bf28c1:396de87f
ARRAY /dev/md2 level=raid10 num-devices=4 
UUID=b34f4192:f823df58:24bf28c1:396de87f
ARRAY /dev/md3 level=raid5 num-devices=3 
UUID=e1f30f82:0999431b:24bf28c1:396de87f



> - USB is not good for RAID imho. Many times in my life I saw problems
> with USB/SATA bridges where the drive would get disconnected on high I/O
> activity and then reconnected after a few seconds. Anyway, readding it
> to the RAID shouldn't have happened. Also in my case there were "usb"
> entries in dmesg.

Well, that is what I discover: USB and RAID is not currently fine (hum, on 
Debian stable, not sure, we can say 'currently', kernel is:

$ uname -a
Linux phobos 2.6.26-2-686 #1 SMP Tue Mar 9 17:35:51 UTC 2010 i686 GNU/Linux
$

).

Anyway, it would be a great feature if USB can be used for a RAID setup, at 
least for end users (actually, I am using in my setup, a "special" layout for 
the using of RAID on several heterogeneous drives that I described here:

http://www.linuxconfig.org/prouhd-raid-for-the-end-user

)

Thanks for your help and regards.
-- 
Pierre Vignéras
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares! -> Solved !
  2010-05-18  2:06       ` Neil Brown
  2010-05-18 22:25         ` MRK
@ 2010-05-21 21:27         ` Pierre Vignéras
  1 sibling, 0 replies; 12+ messages in thread
From: Pierre Vignéras @ 2010-05-21 21:27 UTC (permalink / raw)
  To: linux-raid

Sorry for the delay of my reply...

This small mail to let you know that my RAID array is currently recovering 
thanks to the valuable inputs of this mailing list users. You are great!

For the curious, what I did is the following:

# ##### Do not forget the '--assume-clean' as I almost did! ;-(
# mdadm -C /dev/md2 -l 10 -n 4 -c 64 -e 0.90 --assume-clean /dev/sdd1 missing 
/dev/sdc1 missing               
# vgchange -a y                                                                                               
# xfs_repair -n -t 1 -v /dev/my-vg/my-lv
# mount -o ro /dev/my-vg/my-lv /mnt/tmp
# find /mnt/tmp
# du -ks /mnt/tmp/
# umount /mnt/tmp
# #### Required: XFS asked the log to get replayed
# mount /dev/my-vg/my-lv /mnt/tmp/
# umount /mnt/tmp
# xfs_repair  -t 1 -v /dev/my-vg/my-lv
# mdadm --manage /dev/md2 --add /dev/sde1
# mdadm --manage /dev/md2 --add /dev/sdf1

The array is currently at 25 % of the recovery process. A bit too soon to say 
that everything is fine... By the way, I am quite sure now that my USB 
controllers (or the use driver or whatever in the chain except all disks) are 
buggy: all the other RAIDs of my setup are gone!

I will try to recover them using the same kind of process, to backup all data. 
Do you think that using BBR (since each time, the burden started due to a 
sector (write?) error), the problem will be "solved" (or at least postponed 
until BBR itself does not have enough free sectors)?

Anyway, again, thanks a lot to all of you. 
Open Source rocks! ;-)
-- 
Pierre Vignéras
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares!
  2010-05-18  1:30     ` Neil Brown
  2010-05-18  2:06       ` Neil Brown
@ 2010-05-18 23:07       ` Pierre Vignéras
  2010-05-19  1:45         ` Neil Brown
  1 sibling, 1 reply; 12+ messages in thread
From: Pierre Vignéras @ 2010-05-18 23:07 UTC (permalink / raw)
  To: Neil Brown; +Cc: Leslie Rhorer, linux-raid

On mardi 18 mai 2010, Neil Brown wrote:
> On Mon, 17 May 2010 20:10:36 +0200
> 
> Pierre Vignéras <pierre@vigneras.name> wrote:
> > Did I miss something, or is there something really strange happening
> > there?
> 
> Something strange...
> I cannot explain the 'SpareActive' messages.
> Most of the rest makes sense.
> 
> You had a RAID10 - 4 drives in near=2 mode.  So the first two disks contain
> identical data, and the second two are also identical and contain the rest.
> The second device failed due to a write error.
> Why it seemed to become a spare I'm not sure.  I'm not all sure it did
> become a spare immediately- your logs aren't conclusive on that point.
> It did eventually become a spare, but that could be because you "removed
>  and added the devices" which would have changed them from 'fail' to
>  'spares'.
> 
> Then the first device in the array reported an error and so was failed.
> After this you would not be able to read or write to the even chunks of the
> array, xfs noticed and complained.
> 
> By this time sdf1 seemed to be a spare so it gave recovery a try.  The
> recovery process discovered there was nowhere to read good data from and
> immediately gave up.
> 
> However if the devices really are OK, then sdf1 and sdc1 should contain
> identical data (except the superblock would be slightly different.
> You could check this with "cmp -l", though that might not be very
>  efficient. Also sdd1 and sde1 should be identical.

Well, actually, here is what I have:

phobos:~# mdadm --examine /dev/sd[c-f]1
/dev/sdc1:                             
          Magic : a92b4efc             
        Version : 00.90.00             
           UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
  Creation Time : Thu Aug  6 01:59:44 2009                                  
     Raid Level : raid10                                                    
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)                          
     Array Size : 625137152 (596.18 GiB 640.14 GB)                          
   Raid Devices : 4                                                         
  Total Devices : 4                                                         
Preferred Minor : 2                                                         

    Update Time : Tue Apr 13 19:22:21 2010
          State : clean                   
Internal Bitmap : present                 
 Active Devices : 2                       
Working Devices : 4                       
 Failed Devices : 0                       
  Spare Devices : 2                       
       Checksum : 5baf7939 - correct      
         Events : 90612                   

         Layout : near=2, far=1
     Chunk Size : 64K          

      Number   Major   Minor   RaidDevice State
this     2       8       33        2      active sync   /dev/sdc1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       33        2      active sync   /dev/sdc1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      spare   /dev/sdf1      
   5     5       8       49        5      spare   /dev/sdd1      
/dev/sdd1:                                                       
          Magic : a92b4efc                                       
        Version : 00.90.00                                       
           UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
  Creation Time : Thu Aug  6 01:59:44 2009                                  
     Raid Level : raid10                                                    
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)                          
     Array Size : 625137152 (596.18 GiB 640.14 GB)                          
   Raid Devices : 4                                                         
  Total Devices : 4                                                         
Preferred Minor : 2                                                         

    Update Time : Tue Apr 13 19:22:21 2010
          State : clean                   
Internal Bitmap : present                 
 Active Devices : 2                       
Working Devices : 4                       
 Failed Devices : 0                       
  Spare Devices : 2                       
       Checksum : 5baf7949 - correct      
         Events : 90612                   

         Layout : near=2, far=1
     Chunk Size : 64K          

      Number   Major   Minor   RaidDevice State
this     5       8       49        5      spare   /dev/sdd1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       33        2      active sync   /dev/sdc1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      spare   /dev/sdf1      
   5     5       8       49        5      spare   /dev/sdd1      
/dev/sde1:                                                       
          Magic : a92b4efc                                       
        Version : 00.90.00                                       
           UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
  Creation Time : Thu Aug  6 01:59:44 2009                                  
     Raid Level : raid10                                                    
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)                          
     Array Size : 625137152 (596.18 GiB 640.14 GB)                          
   Raid Devices : 4                                                         
  Total Devices : 4                                                         
Preferred Minor : 2                                                         

    Update Time : Tue Apr 13 19:22:21 2010
          State : clean                   
Internal Bitmap : present                 
 Active Devices : 2                       
Working Devices : 4                       
 Failed Devices : 0                       
  Spare Devices : 2                       
       Checksum : 5baf795b - correct      
         Events : 90612                   

         Layout : near=2, far=1
     Chunk Size : 64K          

      Number   Major   Minor   RaidDevice State
this     3       8       65        3      active sync   /dev/sde1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       33        2      active sync   /dev/sdc1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      spare   /dev/sdf1
   5     5       8       49        5      spare   /dev/sdd1
/dev/sdf1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
  Creation Time : Thu Aug  6 01:59:44 2009
     Raid Level : raid10
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)
     Array Size : 625137152 (596.18 GiB 640.14 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 2

    Update Time : Tue Apr 13 19:22:21 2010
          State : clean
Internal Bitmap : present
 Active Devices : 2
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 2
       Checksum : 5baf7967 - correct
         Events : 90612

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     4       8       81        4      spare   /dev/sdf1

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8       33        2      active sync   /dev/sdc1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      spare   /dev/sdf1
   5     5       8       49        5      spare   /dev/sdd1
phobos:~#

> I suggest that you try:
> 
>  mdadm -S /dev/md2
>  mdadm -C /dev/md2 -l 10 -n 4 -c 64 -e 0.90 /dev/sdc1 missing /dev/sdd1
>  missing  --assume-clean
> 
> and then see what the data on md2 looks like.
> You could equally try sdf1 in place of sdc1, or sde1 in place of sdd1
> (make sure you double check the device names, don't assume I got then
>  right).

So, I double checked the names. ;-)

I first tried to get which devices  where mirrors using cmp -l (thanks for 
that command I didn't know), and here is the (strange) result:

phobos:~# time cmp -l /dev/sdc1 /dev/sdd1 > /tmp/cmp-sdc1-sdd1
^C                                                            

real    0m56.337s
user    0m52.539s
sys     0m3.016s 
phobos:~# time cmp -l /dev/sdc1 /dev/sde1 > /tmp/cmp-sdc1-sde1
^C                                                            

real    0m54.733s
user    0m0.380s 
sys     0m7.688s 
phobos:~# time cmp -l /dev/sdc1 /dev/sdf1 > /tmp/cmp-sdc1-sdf1
^C

real    0m58.236s
user    0m54.099s
sys     0m3.216s
phobos:~# time cmp -l /dev/sdd1 /dev/sde1 > /tmp/cmp-sdd1-sde1
^C

real    0m57.932s
user    0m53.063s
sys     0m3.284s
phobos:~# time cmp -l /dev/sdd1 /dev/sdf1 > /tmp/cmp-sdd1-sdf1
^C

real    0m58.882s
user    0m26.486s
sys     0m6.152s
phobos:~# time cmp -l /dev/sde1 /dev/sdf1 > /tmp/cmp-sde1-sdf1
^C

real    0m57.996s
user    0m49.639s
sys     0m3.100s
phobos:~# ls -lh /tmp/cmp-sd*
-rw-r--r-- 1 root root 954M 2010-05-19 00:23 /tmp/cmp-sdc1-sdd1
-rw-r--r-- 1 root root    0 2010-05-19 00:25 /tmp/cmp-sdc1-sde1
-rw-r--r-- 1 root root 982M 2010-05-19 00:27 /tmp/cmp-sdc1-sdf1
-rw-r--r-- 1 root root 964M 2010-05-19 00:28 /tmp/cmp-sdd1-sde1
-rw-r--r-- 1 root root 466M 2010-05-19 00:30 /tmp/cmp-sdd1-sdf1
-rw-r--r-- 1 root root 872M 2010-05-19 00:31 /tmp/cmp-sde1-sdf1
phobos:~#

Therefore, as far as I understand, /dev/sdc1 does not hold the same data as 
/dev/sdd1 nor /dev/sdf1. Even if this short ~ 1 minute test does not prove 
anything, there is quite a good probability that /dev/sdc1 and /dev/sde1 was 
mirrors at some time.

What should be considered strange? That sdc1 contains exactly the same content 
than sde1 on that 1 minute scan or that sdd1 and sdf1 are so  different (~ 500 
MB/1min) ?

Therefore, I am not sure that the command you suggested is the good one:

mdadm -C /dev/md2 -l 10 -n 4 -c 64 -e 0.90 /dev/sdc1 missing /dev/sdd1 missing  
--assume-clean

It seems that I only have half the data for sure (sdc1 and sde1), but I don't 
know what is the other good part (sdd1 or sdf1)... Is there any way to know?

According to this information, can you confirm that the above command is the 
one I should execute? 
 
> BUT be warned.  Something cause some errors to be reported.  Unless you
>  find out what that was and fix it, errors will occur again.  I have no
>  idea what might have caused those errors.  Bad media? bad controller ? bad
>  usb controller? bad luck?

Well, all of those maybe! Anyway, I will consider using BBR. I have the 
feeling that on such mass market USB drives of 1TB, even the internal 
"hardware" BBR is not sufficient. There are too much errors (at least that is 
what my log suggests me)... It's a shame that BBR is not well documented and 
not as easy to set up using mdadm than using EVMS.  

> I wouldn't write new data, or even perform a recovery until you are quite
> confident of the devices.

Sure.
 
> NeilBrown

Again, thanks a lot!

-- 
Pierre Vignéras
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mdadm: failed devices become spares!
  2010-05-18 23:07       ` mdadm: failed devices become spares! Pierre Vignéras
@ 2010-05-19  1:45         ` Neil Brown
  0 siblings, 0 replies; 12+ messages in thread
From: Neil Brown @ 2010-05-19  1:45 UTC (permalink / raw)
  To: Pierre Vignéras; +Cc: Leslie Rhorer, linux-raid

On Wed, 19 May 2010 01:07:40 +0200
Pierre Vignéras <pierre@vigneras.name> wrote:

> On mardi 18 mai 2010, Neil Brown wrote:
> > On Mon, 17 May 2010 20:10:36 +0200
> > 
> > Pierre Vignéras <pierre@vigneras.name> wrote:
> > > Did I miss something, or is there something really strange happening
> > > there?
> > 
> > Something strange...
> > I cannot explain the 'SpareActive' messages.
> > Most of the rest makes sense.
> > 
> > You had a RAID10 - 4 drives in near=2 mode.  So the first two disks contain
> > identical data, and the second two are also identical and contain the rest.
> > The second device failed due to a write error.
> > Why it seemed to become a spare I'm not sure.  I'm not all sure it did
> > become a spare immediately- your logs aren't conclusive on that point.
> > It did eventually become a spare, but that could be because you "removed
> >  and added the devices" which would have changed them from 'fail' to
> >  'spares'.
> > 
> > Then the first device in the array reported an error and so was failed.
> > After this you would not be able to read or write to the even chunks of the
> > array, xfs noticed and complained.
> > 
> > By this time sdf1 seemed to be a spare so it gave recovery a try.  The
> > recovery process discovered there was nowhere to read good data from and
> > immediately gave up.
> > 
> > However if the devices really are OK, then sdf1 and sdc1 should contain
> > identical data (except the superblock would be slightly different.
> > You could check this with "cmp -l", though that might not be very
> >  efficient. Also sdd1 and sde1 should be identical.
> 
> Well, actually, here is what I have:
> 
> phobos:~# mdadm --examine /dev/sd[c-f]1
> /dev/sdc1:                             
>           Magic : a92b4efc             
>         Version : 00.90.00             
>            UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
>   Creation Time : Thu Aug  6 01:59:44 2009                                  
>      Raid Level : raid10                                                    
>   Used Dev Size : 312568576 (298.09 GiB 320.07 GB)                          
>      Array Size : 625137152 (596.18 GiB 640.14 GB)                          
>    Raid Devices : 4                                                         
>   Total Devices : 4                                                         
> Preferred Minor : 2                                                         
> 
>     Update Time : Tue Apr 13 19:22:21 2010
>           State : clean                   
> Internal Bitmap : present                 
>  Active Devices : 2                       
> Working Devices : 4                       
>  Failed Devices : 0                       
>   Spare Devices : 2                       
>        Checksum : 5baf7939 - correct      
>          Events : 90612                   
> 
>          Layout : near=2, far=1
>      Chunk Size : 64K          
> 
>       Number   Major   Minor   RaidDevice State
> this     2       8       33        2      active sync   /dev/sdc1
> 
>    0     0       0        0        0      removed
>    1     1       0        0        1      faulty removed
>    2     2       8       33        2      active sync   /dev/sdc1
>    3     3       8       65        3      active sync   /dev/sde1
>    4     4       8       81        4      spare   /dev/sdf1      
>    5     5       8       49        5      spare   /dev/sdd1      
> /dev/sdd1:                                                       
>           Magic : a92b4efc                                       
>         Version : 00.90.00                                       
>            UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
>   Creation Time : Thu Aug  6 01:59:44 2009                                  
>      Raid Level : raid10                                                    
>   Used Dev Size : 312568576 (298.09 GiB 320.07 GB)                          
>      Array Size : 625137152 (596.18 GiB 640.14 GB)                          
>    Raid Devices : 4                                                         
>   Total Devices : 4                                                         
> Preferred Minor : 2                                                         
> 
>     Update Time : Tue Apr 13 19:22:21 2010
>           State : clean                   
> Internal Bitmap : present                 
>  Active Devices : 2                       
> Working Devices : 4                       
>  Failed Devices : 0                       
>   Spare Devices : 2                       
>        Checksum : 5baf7949 - correct      
>          Events : 90612                   
> 
>          Layout : near=2, far=1
>      Chunk Size : 64K          
> 
>       Number   Major   Minor   RaidDevice State
> this     5       8       49        5      spare   /dev/sdd1
> 
>    0     0       0        0        0      removed
>    1     1       0        0        1      faulty removed
>    2     2       8       33        2      active sync   /dev/sdc1
>    3     3       8       65        3      active sync   /dev/sde1
>    4     4       8       81        4      spare   /dev/sdf1      
>    5     5       8       49        5      spare   /dev/sdd1      
> /dev/sde1:                                                       
>           Magic : a92b4efc                                       
>         Version : 00.90.00                                       
>            UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
>   Creation Time : Thu Aug  6 01:59:44 2009                                  
>      Raid Level : raid10                                                    
>   Used Dev Size : 312568576 (298.09 GiB 320.07 GB)                          
>      Array Size : 625137152 (596.18 GiB 640.14 GB)                          
>    Raid Devices : 4                                                         
>   Total Devices : 4                                                         
> Preferred Minor : 2                                                         
> 
>     Update Time : Tue Apr 13 19:22:21 2010
>           State : clean                   
> Internal Bitmap : present                 
>  Active Devices : 2                       
> Working Devices : 4                       
>  Failed Devices : 0                       
>   Spare Devices : 2                       
>        Checksum : 5baf795b - correct      
>          Events : 90612                   
> 
>          Layout : near=2, far=1
>      Chunk Size : 64K          
> 
>       Number   Major   Minor   RaidDevice State
> this     3       8       65        3      active sync   /dev/sde1
> 
>    0     0       0        0        0      removed
>    1     1       0        0        1      faulty removed
>    2     2       8       33        2      active sync   /dev/sdc1
>    3     3       8       65        3      active sync   /dev/sde1
>    4     4       8       81        4      spare   /dev/sdf1
>    5     5       8       49        5      spare   /dev/sdd1
> /dev/sdf1:
>           Magic : a92b4efc
>         Version : 00.90.00
>            UUID : b34f4192:f823df58:24bf28c1:396de87f (local to host phobos)
>   Creation Time : Thu Aug  6 01:59:44 2009
>      Raid Level : raid10
>   Used Dev Size : 312568576 (298.09 GiB 320.07 GB)
>      Array Size : 625137152 (596.18 GiB 640.14 GB)
>    Raid Devices : 4
>   Total Devices : 4
> Preferred Minor : 2
> 
>     Update Time : Tue Apr 13 19:22:21 2010
>           State : clean
> Internal Bitmap : present
>  Active Devices : 2
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 2
>        Checksum : 5baf7967 - correct
>          Events : 90612
> 
>          Layout : near=2, far=1
>      Chunk Size : 64K
> 
>       Number   Major   Minor   RaidDevice State
> this     4       8       81        4      spare   /dev/sdf1
> 
>    0     0       0        0        0      removed
>    1     1       0        0        1      faulty removed
>    2     2       8       33        2      active sync   /dev/sdc1
>    3     3       8       65        3      active sync   /dev/sde1
>    4     4       8       81        4      spare   /dev/sdf1
>    5     5       8       49        5      spare   /dev/sdd1
> phobos:~#
> 
> > I suggest that you try:
> > 
> >  mdadm -S /dev/md2
> >  mdadm -C /dev/md2 -l 10 -n 4 -c 64 -e 0.90 /dev/sdc1 missing /dev/sdd1
> >  missing  --assume-clean
> > 
> > and then see what the data on md2 looks like.
> > You could equally try sdf1 in place of sdc1, or sde1 in place of sdd1
> > (make sure you double check the device names, don't assume I got then
> >  right).
> 
> So, I double checked the names. ;-)
> 
> I first tried to get which devices  where mirrors using cmp -l (thanks for 
> that command I didn't know), and here is the (strange) result:
> 
> phobos:~# time cmp -l /dev/sdc1 /dev/sdd1 > /tmp/cmp-sdc1-sdd1
> ^C                                                            
> 
> real    0m56.337s
> user    0m52.539s
> sys     0m3.016s 
> phobos:~# time cmp -l /dev/sdc1 /dev/sde1 > /tmp/cmp-sdc1-sde1
> ^C                                                            
> 
> real    0m54.733s
> user    0m0.380s 
> sys     0m7.688s 
> phobos:~# time cmp -l /dev/sdc1 /dev/sdf1 > /tmp/cmp-sdc1-sdf1
> ^C
> 
> real    0m58.236s
> user    0m54.099s
> sys     0m3.216s
> phobos:~# time cmp -l /dev/sdd1 /dev/sde1 > /tmp/cmp-sdd1-sde1
> ^C
> 
> real    0m57.932s
> user    0m53.063s
> sys     0m3.284s
> phobos:~# time cmp -l /dev/sdd1 /dev/sdf1 > /tmp/cmp-sdd1-sdf1
> ^C
> 
> real    0m58.882s
> user    0m26.486s
> sys     0m6.152s
> phobos:~# time cmp -l /dev/sde1 /dev/sdf1 > /tmp/cmp-sde1-sdf1
> ^C
> 
> real    0m57.996s
> user    0m49.639s
> sys     0m3.100s
> phobos:~# ls -lh /tmp/cmp-sd*
> -rw-r--r-- 1 root root 954M 2010-05-19 00:23 /tmp/cmp-sdc1-sdd1
> -rw-r--r-- 1 root root    0 2010-05-19 00:25 /tmp/cmp-sdc1-sde1
> -rw-r--r-- 1 root root 982M 2010-05-19 00:27 /tmp/cmp-sdc1-sdf1
> -rw-r--r-- 1 root root 964M 2010-05-19 00:28 /tmp/cmp-sdd1-sde1
> -rw-r--r-- 1 root root 466M 2010-05-19 00:30 /tmp/cmp-sdd1-sdf1
> -rw-r--r-- 1 root root 872M 2010-05-19 00:31 /tmp/cmp-sde1-sdf1
> phobos:~#

The fact that sdc1 appear to have the same content as sde1 perfectly matches
the fact that these two devices think the are devices "2" and "3" in the
array, so they still contain half of your data.  This is good.

The fact that sdf1 appears to match sdd1 partly but not completely suggests
that they were devices "0" and "1", but that one of them has had other stuff
written to it.

It is hard to know based on available information which is the case.

> 
> Therefore, as far as I understand, /dev/sdc1 does not hold the same data as 
> /dev/sdd1 nor /dev/sdf1. Even if this short ~ 1 minute test does not prove 
> anything, there is quite a good probability that /dev/sdc1 and /dev/sde1 was 
> mirrors at some time.
> 
> What should be considered strange? That sdc1 contains exactly the same content 
> than sde1 on that 1 minute scan or that sdd1 and sdf1 are so  different (~ 500 
> MB/1min) ?
> 
> Therefore, I am not sure that the command you suggested is the good one:
> 
> mdadm -C /dev/md2 -l 10 -n 4 -c 64 -e 0.90 /dev/sdc1 missing /dev/sdd1 missing  
> --assume-clean
> 
> It seems that I only have half the data for sure (sdc1 and sde1), but I don't 
> know what is the other good part (sdd1 or sdf1)... Is there any way to know?

The way to find out is to try and see.
If you create an array following the above pattern it will not change any
data on the devices, just the superblock, which you have a record of in this
email now.
So you should try creating an array, run "fsck -n" and see if the filesystem
looks OK.  If it does, mount ( -o ro ) and see what it looks like.

Then try the other possibility and see how that compares.
Given the current names of devices, the list given to the mdadm command
should be:

   /dev/sdd1 missing /dev/sdc1 missing
or
   /dev/sdf1 missing /dev/sdc1 missing

Hopefully one of those will mount and fsck successfully.

NeilBrown



> 
> According to this information, can you confirm that the above command is the 
> one I should execute? 
>  
> > BUT be warned.  Something cause some errors to be reported.  Unless you
> >  find out what that was and fix it, errors will occur again.  I have no
> >  idea what might have caused those errors.  Bad media? bad controller ? bad
> >  usb controller? bad luck?
> 
> Well, all of those maybe! Anyway, I will consider using BBR. I have the 
> feeling that on such mass market USB drives of 1TB, even the internal 
> "hardware" BBR is not sufficient. There are too much errors (at least that is 
> what my log suggests me)... It's a shame that BBR is not well documented and 
> not as easy to set up using mdadm than using EVMS.  
> 
> > I wouldn't write new data, or even perform a recovery until you are quite
> > confident of the devices.
> 
> Sure.
>  
> > NeilBrown
> 
> Again, thanks a lot!
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-05-21 21:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-16 15:40 mdadm: failed devices become spares! Pierre Vignéras
2010-05-16 19:56 ` Leslie Rhorer
2010-05-17 18:10   ` Pierre Vignéras
2010-05-17 21:09     ` Tim Small
2010-05-18  1:30     ` Neil Brown
2010-05-18  2:06       ` Neil Brown
2010-05-18 22:25         ` MRK
2010-05-19 19:56           ` Simon Matthews
2010-05-21 21:00           ` Pierre Vignéras
2010-05-21 21:27         ` mdadm: failed devices become spares! -> Solved ! Pierre Vignéras
2010-05-18 23:07       ` mdadm: failed devices become spares! Pierre Vignéras
2010-05-19  1:45         ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).