No response?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* No response?
@ 2005-01-20 17:55 David Dougall
  2005-01-20 18:12 ` Peter T. Breuer
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: David Dougall @ 2005-01-20 17:55 UTC (permalink / raw)
  To: linux-raid

Perhaps I was asking a stupid question or an obvious one, but I have
received not response.
Maybe if I simplify the question...

If I am running software raid1 and a disk device starts throwing I/O
errors, Is the filesystem supposed to see any indication of this?  I
thought software raid would mask all of this and just fail the drive.

I have servers with xfs as the filesystem and xfs will start to throw I/O
errors when a disk starts acting up even with software raid in between.
Please advise on how I can confirm my setup or if this is possibly a bug
how to diagnose further.
If it makes a difference, I am running linux-2.4.26
Thanks
--David Dougall

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 17:55 No response? David Dougall
@ 2005-01-20 18:12 ` Peter T. Breuer
  2005-01-20 18:14 ` Gordon Henderson
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Peter T. Breuer @ 2005-01-20 18:12 UTC (permalink / raw)
  To: linux-raid

David Dougall <davidd@et.byu.edu> wrote:
> If I am running software raid1 and a disk device starts throwing I/O
> errors, Is the filesystem supposed to see any indication of this?  I

No - not if the error is on only one disk. The first error will fault
the disk from the array and the driver will retry the read, and must
retry from another disk (the first is no longer there).

The actual i/o to the disks does not form part of the i/o to the raid
array itself, so there is little chance of contamination between the
two. The raid i/o is only acked back to the user when one of the disk 
i/o's (on read) has succeeded.

> thought software raid would mask all of this and just fail the drive.
> 
> I have servers with xfs as the filesystem and xfs will start to throw I/O
> errors when a disk starts acting up even with software raid in between.

That's strange. But not impossible - coding for error situations is
always difficult, and more difficult to test.

> If it makes a difference, I am running linux-2.4.26

Peter

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 17:55 No response? David Dougall
  2005-01-20 18:12 ` Peter T. Breuer
@ 2005-01-20 18:14 ` Gordon Henderson
  2005-01-20 18:37   ` Mark Bellon
  2005-01-20 18:21 ` Mike Hardy
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Gordon Henderson @ 2005-01-20 18:14 UTC (permalink / raw)
  To: David Dougall; +Cc: linux-raid

On Thu, 20 Jan 2005, David Dougall wrote:

> Perhaps I was asking a stupid question or an obvious one, but I have
> received not response.
> Maybe if I simplify the question...
>
> If I am running software raid1 and a disk device starts throwing I/O
> errors, Is the filesystem supposed to see any indication of this?

No..

>  I
> thought software raid would mask all of this and just fail the drive.

It should.

> I have servers with xfs as the filesystem and xfs will start to throw I/O
> errors when a disk starts acting up even with software raid in between.
> Please advise on how I can confirm my setup or if this is possibly a bug
> how to diagnose further.

I've experienced long delays (30 seconds? It seemed longer) in a system
when a disk fails for a genuine reason - (I've deliberately run badblocks
on an md device when I knew one of the underlying devices had genuine bad
blocks) maybe the md code really tries hard to read the block, maybe the
underlying device driver tries really hard), but in these cases, I've seen
the system more or less freeze (all processes accessing that device
anyway) until the raid code decided to kick the device out of the array.

Maybe XFS has a timer and doesn't like devices to "go away" for a long
period of time?

> If it makes a difference, I am running linux-2.4.26

I've used 2.4.x for a long time - I did try xfs about a year ago, but
wasn't happy with it all (for various reasons).

Gordon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 17:55 No response? David Dougall
  2005-01-20 18:12 ` Peter T. Breuer
  2005-01-20 18:14 ` Gordon Henderson
@ 2005-01-20 18:21 ` Mike Hardy
  2005-01-20 18:30 ` Mario Holbe
  2005-01-20 18:49 ` Kanoa Withington
  4 siblings, 0 replies; 22+ messages in thread
From: Mike Hardy @ 2005-01-20 18:21 UTC (permalink / raw)
  To: linux-raid

This hasn't been my experience, and I just had a drive journey across 
the river styx a few days back. It was in a RAID1 mirror, and while I 
got some log messages about it, and smartd and mdadm both sent me email, 
the software raid device and the machine in general both kept ticking along.

That was 2.6.10 and ext3, but I've also had this experience in the 2.4 
series, with ext3.

Either its an xfs interaction, there's something else going on.

There was mention recently of RAID1 corruption, but I believe that was 
in reference to 2.6.10, and there have only been two reports, where I'd 
expect a storm of reports if it was ocurring to more people.

Regardless, perhaps its possible to test it yourself by assembling a 
couple of disk files into loopback bindings with LVM's faulty block 
support on them, and finally a raid mirror on top. Sounds a bit like a 
house of cards, but you'd be able to simulate a failing drive that way 
with xfs on it, and see how the mirror and filesystem react

-Mike

David Dougall wrote:
> Perhaps I was asking a stupid question or an obvious one, but I have
> received not response.
> Maybe if I simplify the question...
> 
> If I am running software raid1 and a disk device starts throwing I/O
> errors, Is the filesystem supposed to see any indication of this?  I
> thought software raid would mask all of this and just fail the drive.
> 
> I have servers with xfs as the filesystem and xfs will start to throw I/O
> errors when a disk starts acting up even with software raid in between.
> Please advise on how I can confirm my setup or if this is possibly a bug
> how to diagnose further.
> If it makes a difference, I am running linux-2.4.26
> Thanks
> --David Dougall
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 17:55 No response? David Dougall
                   ` (2 preceding siblings ...)
  2005-01-20 18:21 ` Mike Hardy
@ 2005-01-20 18:30 ` Mario Holbe
  2005-01-20 18:57   ` David Dougall
  2005-01-20 18:49 ` Kanoa Withington
  4 siblings, 1 reply; 22+ messages in thread
From: Mario Holbe @ 2005-01-20 18:30 UTC (permalink / raw)
  To: linux-raid

David Dougall <davidd@et.byu.edu> wrote:
> If I am running software raid1 and a disk device starts throwing I/O
> errors, Is the filesystem supposed to see any indication of this?  I

Usually this should not happen. Presumed a) this device is not the
only active device in this RAID1 and b) this device is the only
failing one.

> I have servers with xfs as the filesystem and xfs will start to throw I/O
> errors when a disk starts acting up even with software raid in between.

It could be helpful to show the messages appearing (dmesg), the
RAID setup (cat /proc/mdstat) and the mount (cat /etc/fstab /etc/mtab
or /proc/mounts).


regards,
   Mario
-- 
<jv> Oh well, config
<jv> one actually wonders what force in the universe is holding it
<jv> and makes it working
<Beeth> chances and accidents :)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 18:14 ` Gordon Henderson
@ 2005-01-20 18:37   ` Mark Bellon
  2005-01-20 19:15     ` David Dougall
  2005-01-20 19:37     ` Gordon Henderson
  0 siblings, 2 replies; 22+ messages in thread
From: Mark Bellon @ 2005-01-20 18:37 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: David Dougall, linux-raid

Gordon Henderson wrote:

>On Thu, 20 Jan 2005, David Dougall wrote:
>
>  
>
>>Perhaps I was asking a stupid question or an obvious one, but I have
>>received not response.
>>Maybe if I simplify the question...
>>
>>If I am running software raid1 and a disk device starts throwing I/O
>>errors, Is the filesystem supposed to see any indication of this?
>>    
>>
>
>No..
>
>  
>
>> I
>>thought software raid would mask all of this and just fail the drive.
>>    
>>
>
>It should.
>
>  
>
>>I have servers with xfs as the filesystem and xfs will start to throw I/O
>>errors when a disk starts acting up even with software raid in between.
>>Please advise on how I can confirm my setup or if this is possibly a bug
>>how to diagnose further.
>>    
>>
>
>I've experienced long delays (30 seconds? It seemed longer) in a system
>when a disk fails for a genuine reason - (I've deliberately run badblocks
>on an md device when I knew one of the underlying devices had genuine bad
>blocks) maybe the md code really tries hard to read the block, maybe the
>underlying device driver tries really hard), but in these cases, I've seen
>the system more or less freeze (all processes accessing that device
>anyway) until the raid code decided to kick the device out of the array.
>  
>
I've seen this too. The worst case can actually last for over 2 minutes.

We've been running with a patch to the RAID 1 driver that handles this 
so critical applications do not hang for too long. Basically it uses 
timers in the RAID 1 driver to force the disk to be treated as actually 
having failed if it doesn't respond within a reasonable time (tunable 
but usually ~3 seconds). It then handles the I/O requests coming back 
async. and does the clean up.

>Maybe XFS has a timer and doesn't like devices to "go away" for a long period of time?
>  
>
Not that I know of but I would need to look. Any XFS wizard's comments?

mark

>  
>
>>If it makes a difference, I am running linux-2.4.26
>>    
>>
>
>I've used 2.4.x for a long time - I did try xfs about a year ago, but
>wasn't happy with it all (for various reasons).
>
>Gordon
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>  
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 17:55 No response? David Dougall
                   ` (3 preceding siblings ...)
  2005-01-20 18:30 ` Mario Holbe
@ 2005-01-20 18:49 ` Kanoa Withington
  4 siblings, 0 replies; 22+ messages in thread
From: Kanoa Withington @ 2005-01-20 18:49 UTC (permalink / raw)
  To: David Dougall; +Cc: linux-raid

Hi David,

I have several systems with similar kernels, raid1 mirrors and XFS
filesystems and I don't have the problem you are talking about so
there is no inherent incompatibility.

XFS will, however, shut down a filesystem if an I/O timeout is
reached. I have seen this on a system with SCSI drives where the host
controller keeps resetting itself to try to get a failing disk
back. If both disks are on the same SCSI channel, the card resets can
sometimes be long enough to trip the XFS timeout. The solution is to
make sure your mirror elements are on different host controllers, no
matter what type of interface (SCSI/IDE/ETC) they are using.

After an XFS timeout you can unmount and remount the filesystem,
unless of course it is the root filesystem.

-Kanoa

On Thu, 20 Jan 2005, David Dougall wrote:

> Perhaps I was asking a stupid question or an obvious one, but I have
> received not response.
> Maybe if I simplify the question...
>
> If I am running software raid1 and a disk device starts throwing I/O
> errors, Is the filesystem supposed to see any indication of this?  I
> thought software raid would mask all of this and just fail the drive.
>
> I have servers with xfs as the filesystem and xfs will start to throw I/O
> errors when a disk starts acting up even with software raid in between.
> Please advise on how I can confirm my setup or if this is possibly a bug
> how to diagnose further.
> If it makes a difference, I am running linux-2.4.26
> Thanks
> --David Dougall
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 18:30 ` Mario Holbe
@ 2005-01-20 18:57   ` David Dougall
  2005-01-20 19:12     ` Kanoa Withington
                       ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: David Dougall @ 2005-01-20 18:57 UTC (permalink / raw)
  To: Mario Holbe; +Cc: linux-raid

The following appears to be relavent information from the syslog file:

Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:06 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:06 linux-sg2 kernel:  I/O error: dev 08:10, sector 314179976
Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:06 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:06 linux-sg2 kernel:  I/O error: dev 08:10, sector 314179969
Jan 10 11:56:06 linux-sg2 kernel: XFS: device device-mapper(254,1)- XFS
write er
ror in file system meta-data block 0x2bb20008 in device-mapper(254,1)
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 144067
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 62129592
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 144131
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 104726920
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 385
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 157090184
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453448
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 343219280
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 392
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 104726913
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 144143
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 157090177
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453441
Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
("device-mapper(254,1)
") meta-data dev device-mapper(254,1) block 0x18fa318f
("xlog_iodone") err
or 5 buf count 2048
Jan 10 11:56:08 linux-sg2 kernel:
xfs_force_shutdown(device-mapper(254,1),0x2) c
alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
I/O Err
or Detected.  Shutting down filesystem: device-mapper(254,1)
Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
rectify the
problem(s)


I don't see any error messages from md in any of these logs.
--David Dougall




On Thu, 20 Jan 2005, Mario Holbe wrote:

> David Dougall <davidd@et.byu.edu> wrote:
> > If I am running software raid1 and a disk device starts throwing I/O
> > errors, Is the filesystem supposed to see any indication of this?  I
>
> Usually this should not happen. Presumed a) this device is not the
> only active device in this RAID1 and b) this device is the only
> failing one.
>
> > I have servers with xfs as the filesystem and xfs will start to throw I/O
> > errors when a disk starts acting up even with software raid in between.
>
> It could be helpful to show the messages appearing (dmesg), the
> RAID setup (cat /proc/mdstat) and the mount (cat /etc/fstab /etc/mtab
> or /proc/mounts).
>
>
> regards,
>    Mario
> --
> <jv> Oh well, config
> <jv> one actually wonders what force in the universe is holding it
> <jv> and makes it working
> <Beeth> chances and accidents :)
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 18:57   ` David Dougall
@ 2005-01-20 19:12     ` Kanoa Withington
  2005-01-20 19:17       ` David Dougall
  2005-01-20 19:18     ` Guy
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Kanoa Withington @ 2005-01-20 19:12 UTC (permalink / raw)
  To: David Dougall; +Cc: Mario Holbe, linux-raid


Yes, that's a standard XFS timeout and shutdown. If your second disk
is on the sme SCSI channel try moving it to a different one,
preferably a different controller alotgether.

Your disk 08:10 does have real problems, but they are separate from
the XFS shutdown which should be prevented by the MD layer.

-Kanoa

On Thu, 20 Jan 2005, David Dougall wrote:


>  return code = 8000002
> Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
> sense key
>  Hardware Error
> Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453441
> Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
> ("device-mapper(254,1)
> ") meta-data dev device-mapper(254,1) block 0x18fa318f
> ("xlog_iodone") err
> or 5 buf count 2048
> Jan 10 11:56:08 linux-sg2 kernel:
> xfs_force_shutdown(device-mapper(254,1),0x2) c
> alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
> Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
> I/O Err
> or Detected.  Shutting down filesystem: device-mapper(254,1)
> Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
> rectify the
> problem(s)
>
>
> I don't see any error messages from md in any of these logs.
> --David Dougall
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 18:37   ` Mark Bellon
@ 2005-01-20 19:15     ` David Dougall
  2005-01-20 19:35       ` Mark Bellon
  2005-01-20 19:37     ` Gordon Henderson
  1 sibling, 1 reply; 22+ messages in thread
From: David Dougall @ 2005-01-20 19:15 UTC (permalink / raw)
  To: Mark Bellon; +Cc: Gordon Henderson, linux-raid

Oooh, that ~3 second patch sounds very interesting.  I actually think that
the theory about timeouts causing the problem is correct.  I didn't
realize that applications/fs calls could stall for that long.  My NFS
servers have a timeout themselves of about 10 seconds before they start to
try to shut things down.
--David Dougall


On Thu, 20 Jan 2005, Mark Bellon wrote:

> Gordon Henderson wrote:
>
> >On Thu, 20 Jan 2005, David Dougall wrote:
> >
> >
> >
> >>Perhaps I was asking a stupid question or an obvious one, but I have
> >>received not response.
> >>Maybe if I simplify the question...
> >>
> >>If I am running software raid1 and a disk device starts throwing I/O
> >>errors, Is the filesystem supposed to see any indication of this?
> >>
> >>
> >
> >No..
> >
> >
> >
> >> I
> >>thought software raid would mask all of this and just fail the drive.
> >>
> >>
> >
> >It should.
> >
> >
> >
> >>I have servers with xfs as the filesystem and xfs will start to throw I/O
> >>errors when a disk starts acting up even with software raid in between.
> >>Please advise on how I can confirm my setup or if this is possibly a bug
> >>how to diagnose further.
> >>
> >>
> >
> >I've experienced long delays (30 seconds? It seemed longer) in a system
> >when a disk fails for a genuine reason - (I've deliberately run badblocks
> >on an md device when I knew one of the underlying devices had genuine bad
> >blocks) maybe the md code really tries hard to read the block, maybe the
> >underlying device driver tries really hard), but in these cases, I've seen
> >the system more or less freeze (all processes accessing that device
> >anyway) until the raid code decided to kick the device out of the array.
> >
> >
> I've seen this too. The worst case can actually last for over 2 minutes.
>
> We've been running with a patch to the RAID 1 driver that handles this
> so critical applications do not hang for too long. Basically it uses
> timers in the RAID 1 driver to force the disk to be treated as actually
> having failed if it doesn't respond within a reasonable time (tunable
> but usually ~3 seconds). It then handles the I/O requests coming back
> async. and does the clean up.
>
> >Maybe XFS has a timer and doesn't like devices to "go away" for a long period of time?
> >
> >
> Not that I know of but I would need to look. Any XFS wizard's comments?
>
> mark
>
> >
> >
> >>If it makes a difference, I am running linux-2.4.26
> >>
> >>
> >
> >I've used 2.4.x for a long time - I did try xfs about a year ago, but
> >wasn't happy with it all (for various reasons).
> >
> >Gordon
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 19:12     ` Kanoa Withington
@ 2005-01-20 19:17       ` David Dougall
  2005-01-20 19:23         ` Guy
  2005-01-20 19:34         ` Kanoa Withington
  0 siblings, 2 replies; 22+ messages in thread
From: David Dougall @ 2005-01-20 19:17 UTC (permalink / raw)
  To: Kanoa Withington; +Cc: Mario Holbe, linux-raid

By "different controller" do you mean HBA controller or disk controller?
The disk devices are on completely different jbods.  They are both through
the same HBA(the server only has 1 PCI slot)
--David Dougall


On Thu, 20 Jan 2005, Kanoa Withington wrote:

>
> Yes, that's a standard XFS timeout and shutdown. If your second disk
> is on the sme SCSI channel try moving it to a different one,
> preferably a different controller alotgether.
>
> Your disk 08:10 does have real problems, but they are separate from
> the XFS shutdown which should be prevented by the MD layer.
>
> -Kanoa
>
> On Thu, 20 Jan 2005, David Dougall wrote:
>
>
> >  return code = 8000002
> > Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
> > sense key
> >  Hardware Error
> > Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453441
> > Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
> > ("device-mapper(254,1)
> > ") meta-data dev device-mapper(254,1) block 0x18fa318f
> > ("xlog_iodone") err
> > or 5 buf count 2048
> > Jan 10 11:56:08 linux-sg2 kernel:
> > xfs_force_shutdown(device-mapper(254,1),0x2) c
> > alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
> > Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
> > I/O Err
> > or Detected.  Shutting down filesystem: device-mapper(254,1)
> > Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
> > rectify the
> > problem(s)
> >
> >
> > I don't see any error messages from md in any of these logs.
> > --David Dougall
> >
> >
>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: No response?
  2005-01-20 18:57   ` David Dougall
  2005-01-20 19:12     ` Kanoa Withington
@ 2005-01-20 19:18     ` Guy
  2005-01-20 19:24     ` Peter T. Breuer
  2005-01-20 19:28     ` Mark Bellon
  3 siblings, 0 replies; 22+ messages in thread
From: Guy @ 2005-01-20 19:18 UTC (permalink / raw)
  To: 'David Dougall', 'Mario Holbe'; +Cc: linux-raid

Are you sure it is RAID?  Maybe hardware RAID?

Send the output of these commands:
cat /proc/mdstat
df
mdadm -D /dev/md?

If using LVM:
vgdisplay -v

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of David Dougall
Sent: Thursday, January 20, 2005 1:57 PM
To: Mario Holbe
Cc: linux-raid@vger.kernel.org
Subject: Re: No response?

The following appears to be relavent information from the syslog file:

Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:06 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:06 linux-sg2 kernel:  I/O error: dev 08:10, sector 314179976
Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:06 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:06 linux-sg2 kernel:  I/O error: dev 08:10, sector 314179969
Jan 10 11:56:06 linux-sg2 kernel: XFS: device device-mapper(254,1)- XFS
write er
ror in file system meta-data block 0x2bb20008 in device-mapper(254,1)
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 144067
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 62129592
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 144131
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 104726920
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 385
Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 157090184
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453448
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 343219280
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 392
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 104726913
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 144143
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 157090177
Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
lun 47
 return code = 8000002
Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
sense key
 Hardware Error
Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453441
Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
("device-mapper(254,1)
") meta-data dev device-mapper(254,1) block 0x18fa318f
("xlog_iodone") err
or 5 buf count 2048
Jan 10 11:56:08 linux-sg2 kernel:
xfs_force_shutdown(device-mapper(254,1),0x2) c
alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
I/O Err
or Detected.  Shutting down filesystem: device-mapper(254,1)
Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
rectify the
problem(s)


I don't see any error messages from md in any of these logs.
--David Dougall




On Thu, 20 Jan 2005, Mario Holbe wrote:

> David Dougall <davidd@et.byu.edu> wrote:
> > If I am running software raid1 and a disk device starts throwing I/O
> > errors, Is the filesystem supposed to see any indication of this?  I
>
> Usually this should not happen. Presumed a) this device is not the
> only active device in this RAID1 and b) this device is the only
> failing one.
>
> > I have servers with xfs as the filesystem and xfs will start to throw
I/O
> > errors when a disk starts acting up even with software raid in between.
>
> It could be helpful to show the messages appearing (dmesg), the
> RAID setup (cat /proc/mdstat) and the mount (cat /etc/fstab /etc/mtab
> or /proc/mounts).
>
>
> regards,
>    Mario
> --
> <jv> Oh well, config
> <jv> one actually wonders what force in the universe is holding it
> <jv> and makes it working
> <Beeth> chances and accidents :)
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: No response?
  2005-01-20 19:17       ` David Dougall
@ 2005-01-20 19:23         ` Guy
  2005-01-20 19:34         ` Kanoa Withington
  1 sibling, 0 replies; 22+ messages in thread
From: Guy @ 2005-01-20 19:23 UTC (permalink / raw)
  To: 'David Dougall', 'Kanoa Withington'
  Cc: 'Mario Holbe', linux-raid

At least:
Different SCSI or IDE bus.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of David Dougall
Sent: Thursday, January 20, 2005 2:18 PM
To: Kanoa Withington
Cc: Mario Holbe; linux-raid@vger.kernel.org
Subject: Re: No response?

By "different controller" do you mean HBA controller or disk controller?
The disk devices are on completely different jbods.  They are both through
the same HBA(the server only has 1 PCI slot)
--David Dougall


On Thu, 20 Jan 2005, Kanoa Withington wrote:

>
> Yes, that's a standard XFS timeout and shutdown. If your second disk
> is on the sme SCSI channel try moving it to a different one,
> preferably a different controller alotgether.
>
> Your disk 08:10 does have real problems, but they are separate from
> the XFS shutdown which should be prevented by the MD layer.
>
> -Kanoa
>
> On Thu, 20 Jan 2005, David Dougall wrote:
>
>
> >  return code = 8000002
> > Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
> > sense key
> >  Hardware Error
> > Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector
209453441
> > Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
> > ("device-mapper(254,1)
> > ") meta-data dev device-mapper(254,1) block 0x18fa318f
> > ("xlog_iodone") err
> > or 5 buf count 2048
> > Jan 10 11:56:08 linux-sg2 kernel:
> > xfs_force_shutdown(device-mapper(254,1),0x2) c
> > alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
> > Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
> > I/O Err
> > or Detected.  Shutting down filesystem: device-mapper(254,1)
> > Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
> > rectify the
> > problem(s)
> >
> >
> > I don't see any error messages from md in any of these logs.
> > --David Dougall
> >
> >
>
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 18:57   ` David Dougall
  2005-01-20 19:12     ` Kanoa Withington
  2005-01-20 19:18     ` Guy
@ 2005-01-20 19:24     ` Peter T. Breuer
  2005-01-20 19:51       ` David Dougall
  2005-01-20 19:28     ` Mark Bellon
  3 siblings, 1 reply; 22+ messages in thread
From: Peter T. Breuer @ 2005-01-20 19:24 UTC (permalink / raw)
  To: linux-raid

David Dougall <davidd@et.byu.edu> wrote:
> Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
> lun 47
>  return code = 8000002

That is sda.

> Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 343219280

Well, I don't really understand - that is sdb, no? No? (or sda10 if the
numbers are in decimal instead of hex).

I don't know why sda and sdb seem to be confused! The "lun 47" also has
me wondering!

> Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
> sense key
>  Hardware Error
> Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 392

sdb, very insistently, at a completely different sector. Looks belly
up.

> Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
> lun 47
>  return code = 8000002

But 0:0:0 should be sda!

> Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453441
> Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
> ("device-mapper(254,1)
> ") meta-data dev device-mapper(254,1) block 0x18fa318f
> ("xlog_iodone") err
> or 5 buf count 2048
> Jan 10 11:56:08 linux-sg2 kernel:
> xfs_force_shutdown(device-mapper(254,1),0x2) c
> alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
> Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
> I/O Err
> or Detected.  Shutting down filesystem: device-mapper(254,1)
> Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
> rectify the
> problem(s)
> 
> 
> I don't see any error messages from md in any of these logs.

Why is the device mapper implicated? Doesn't that mean that you are
using LVM and not raid? 2.4.26, you said? Hmm.

Peter


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 18:57   ` David Dougall
                       ` (2 preceding siblings ...)
  2005-01-20 19:24     ` Peter T. Breuer
@ 2005-01-20 19:28     ` Mark Bellon
  3 siblings, 0 replies; 22+ messages in thread
From: Mark Bellon @ 2005-01-20 19:28 UTC (permalink / raw)
  To: David Dougall; +Cc: Mario Holbe, linux-raid

This looks like MD did it's thing properly and there is a timeout within 
XFS.

mark

>The following appears to be relavent information from the syslog file:
>
>Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:06 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:06 linux-sg2 kernel:  I/O error: dev 08:10, sector 314179976
>Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:06 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:06 linux-sg2 kernel:  I/O error: dev 08:10, sector 314179969
>Jan 10 11:56:06 linux-sg2 kernel: XFS: device device-mapper(254,1)- XFS
>write er
>ror in file system meta-data block 0x2bb20008 in device-mapper(254,1)
>Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 144067
>Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 62129592
>Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 144131
>Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 104726920
>Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:07 linux-sg2 kernel:  I/O error: dev 08:10, sector 385
>Jan 10 11:56:07 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:07 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 157090184
>Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453448
>Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 343219280
>Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 392
>Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 104726913
>Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 144143
>Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 157090177
>Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
>lun 47
> return code = 8000002
>Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>sense key
> Hardware Error
>Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453441
>Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
>("device-mapper(254,1)
>") meta-data dev device-mapper(254,1) block 0x18fa318f
>("xlog_iodone") err
>or 5 buf count 2048
>Jan 10 11:56:08 linux-sg2 kernel:
>xfs_force_shutdown(device-mapper(254,1),0x2) c
>alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
>Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
>I/O Err
>or Detected.  Shutting down filesystem: device-mapper(254,1)
>Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
>rectify the
>problem(s)
>
>
>I don't see any error messages from md in any of these logs.
>--David Dougall
>
>
>
>
>On Thu, 20 Jan 2005, Mario Holbe wrote:
>
>  
>
>>David Dougall <davidd@et.byu.edu> wrote:
>>    
>>
>>>If I am running software raid1 and a disk device starts throwing I/O
>>>errors, Is the filesystem supposed to see any indication of this?  I
>>>      
>>>
>>Usually this should not happen. Presumed a) this device is not the
>>only active device in this RAID1 and b) this device is the only
>>failing one.
>>
>>    
>>
>>>I have servers with xfs as the filesystem and xfs will start to throw I/O
>>>errors when a disk starts acting up even with software raid in between.
>>>      
>>>
>>It could be helpful to show the messages appearing (dmesg), the
>>RAID setup (cat /proc/mdstat) and the mount (cat /etc/fstab /etc/mtab
>>or /proc/mounts).
>>
>>
>>regards,
>>   Mario
>>--
>><jv> Oh well, config
>><jv> one actually wonders what force in the universe is holding it
>><jv> and makes it working
>><Beeth> chances and accidents :)
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>    
>>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>  
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 19:17       ` David Dougall
  2005-01-20 19:23         ` Guy
@ 2005-01-20 19:34         ` Kanoa Withington
  2005-01-20 19:44           ` Mark Bellon
  1 sibling, 1 reply; 22+ messages in thread
From: Kanoa Withington @ 2005-01-20 19:34 UTC (permalink / raw)
  To: David Dougall; +Cc: Mario Holbe, linux-raid



Ideally a different HBA altogether, but a different channel on a
multichannel HBA at a minimum. If your SCSI card is not a multichannel
card, think about getting one or think about a completely different
arrangement.

It may be possible to tune the HBA reset behavior or the XFS timeout
threshold but as a matter of principle when constructing disk mirrors
you should try to keep the disks as separate as possible. You should
only need to tune, tweak or patch if you are trying to do something
unusual - which you are not.

In the short term, unplug the failing disk:

Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0 lun 47

You are better off without it if your system is crashing.

-Kanoa



On Thu, 20 Jan 2005, David Dougall wrote:

> By "different controller" do you mean HBA controller or disk controller?
> The disk devices are on completely different jbods.  They are both through
> the same HBA(the server only has 1 PCI slot)
> --David Dougall
>
>
> On Thu, 20 Jan 2005, Kanoa Withington wrote:
>
> >
> > Yes, that's a standard XFS timeout and shutdown. If your second disk
> > is on the sme SCSI channel try moving it to a different one,
> > preferably a different controller alotgether.
> >
> > Your disk 08:10 does have real problems, but they are separate from
> > the XFS shutdown which should be prevented by the MD layer.
> >
> > -Kanoa
> >
> > On Thu, 20 Jan 2005, David Dougall wrote:
> >
> >
> > >  return code = 8000002
> > > Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
> > > sense key
> > >  Hardware Error
> > > Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453441
> > > Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
> > > ("device-mapper(254,1)
> > > ") meta-data dev device-mapper(254,1) block 0x18fa318f
> > > ("xlog_iodone") err
> > > or 5 buf count 2048
> > > Jan 10 11:56:08 linux-sg2 kernel:
> > > xfs_force_shutdown(device-mapper(254,1),0x2) c
> > > alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
> > > Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
> > > I/O Err
> > > or Detected.  Shutting down filesystem: device-mapper(254,1)
> > > Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
> > > rectify the
> > > problem(s)
> > >
> > >
> > > I don't see any error messages from md in any of these logs.
> > > --David Dougall
> > >
> > >
> >
> >
> >
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 19:15     ` David Dougall
@ 2005-01-20 19:35       ` Mark Bellon
  0 siblings, 0 replies; 22+ messages in thread
From: Mark Bellon @ 2005-01-20 19:35 UTC (permalink / raw)
  To: David Dougall; +Cc: Gordon Henderson, linux-raid

David Dougall wrote:

>Oooh, that ~3 second patch sounds very interesting.  I actually think that
>the theory about timeouts causing the problem is correct.  I didn't
>realize that applications/fs calls could stall for that long.  My NFS
>servers have a timeout themselves of about 10 seconds before they start to
>try to shut things down.
>  
>
I could generate one for 2.4.26 for you but I need a bit of time - I'm 
running a 2.4.20 with a great many enhancements and there are a few 
differences. If there is interested I can post it to linux-raid too.

mark

>--David Dougall
>
>
>On Thu, 20 Jan 2005, Mark Bellon wrote:
>
>  
>
>>Gordon Henderson wrote:
>>
>>    
>>
>>>On Thu, 20 Jan 2005, David Dougall wrote:
>>>
>>>
>>>
>>>      
>>>
>>>>Perhaps I was asking a stupid question or an obvious one, but I have
>>>>received not response.
>>>>Maybe if I simplify the question...
>>>>
>>>>If I am running software raid1 and a disk device starts throwing I/O
>>>>errors, Is the filesystem supposed to see any indication of this?
>>>>
>>>>
>>>>        
>>>>
>>>No..
>>>
>>>
>>>
>>>      
>>>
>>>>I
>>>>thought software raid would mask all of this and just fail the drive.
>>>>
>>>>
>>>>        
>>>>
>>>It should.
>>>
>>>
>>>
>>>      
>>>
>>>>I have servers with xfs as the filesystem and xfs will start to throw I/O
>>>>errors when a disk starts acting up even with software raid in between.
>>>>Please advise on how I can confirm my setup or if this is possibly a bug
>>>>how to diagnose further.
>>>>
>>>>
>>>>        
>>>>
>>>I've experienced long delays (30 seconds? It seemed longer) in a system
>>>when a disk fails for a genuine reason - (I've deliberately run badblocks
>>>on an md device when I knew one of the underlying devices had genuine bad
>>>blocks) maybe the md code really tries hard to read the block, maybe the
>>>underlying device driver tries really hard), but in these cases, I've seen
>>>the system more or less freeze (all processes accessing that device
>>>anyway) until the raid code decided to kick the device out of the array.
>>>
>>>
>>>      
>>>
>>I've seen this too. The worst case can actually last for over 2 minutes.
>>
>>We've been running with a patch to the RAID 1 driver that handles this
>>so critical applications do not hang for too long. Basically it uses
>>timers in the RAID 1 driver to force the disk to be treated as actually
>>having failed if it doesn't respond within a reasonable time (tunable
>>but usually ~3 seconds). It then handles the I/O requests coming back
>>async. and does the clean up.
>>
>>    
>>
>>>Maybe XFS has a timer and doesn't like devices to "go away" for a long period of time?
>>>
>>>
>>>      
>>>
>>Not that I know of but I would need to look. Any XFS wizard's comments?
>>
>>mark
>>
>>    
>>
>>>      
>>>
>>>>If it makes a difference, I am running linux-2.4.26
>>>>
>>>>
>>>>        
>>>>
>>>I've used 2.4.x for a long time - I did try xfs about a year ago, but
>>>wasn't happy with it all (for various reasons).
>>>
>>>Gordon
>>>-
>>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>the body of a message to majordomo@vger.kernel.org
>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>      
>>>
>>
>>    
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 18:37   ` Mark Bellon
  2005-01-20 19:15     ` David Dougall
@ 2005-01-20 19:37     ` Gordon Henderson
  2005-01-20 19:41       ` Mark Bellon
  1 sibling, 1 reply; 22+ messages in thread
From: Gordon Henderson @ 2005-01-20 19:37 UTC (permalink / raw)
  To: Mark Bellon; +Cc: linux-raid

On Thu, 20 Jan 2005, Mark Bellon wrote:

> I've seen this too. The worst case can actually last for over 2 minutes.
>
> We've been running with a patch to the RAID 1 driver that handles this
> so critical applications do not hang for too long. Basically it uses
> timers in the RAID 1 driver to force the disk to be treated as actually
> having failed if it doesn't respond within a reasonable time (tunable
> but usually ~3 seconds). It then handles the I/O requests coming back
> async. and does the clean up.

This is intersting, but make it an option (kernel compile, sysctl,
etc.)... I have a small home server/firewall that I run with the disks
spun down (noflushd) and spinning up a disk sometimes takes 8 seconds -
it's a RAID-1 set and seems to cope OK with the disks spinning down & up
again as required...

Gordon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 19:37     ` Gordon Henderson
@ 2005-01-20 19:41       ` Mark Bellon
  2005-01-20 19:49         ` David Dougall
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Bellon @ 2005-01-20 19:41 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid

Gordon Henderson wrote:

>On Thu, 20 Jan 2005, Mark Bellon wrote:
>
>  
>
>>I've seen this too. The worst case can actually last for over 2 minutes.
>>
>>We've been running with a patch to the RAID 1 driver that handles this
>>so critical applications do not hang for too long. Basically it uses
>>timers in the RAID 1 driver to force the disk to be treated as actually
>>having failed if it doesn't respond within a reasonable time (tunable
>>but usually ~3 seconds). It then handles the I/O requests coming back
>>async. and does the clean up.
>>    
>>
>
>This is intersting, but make it an option (kernel compile, sysctl,
>etc.)... I have a small home server/firewall that I run with the disks
>spun down (noflushd) and spinning up a disk sometimes takes 8 seconds -
>it's a RAID-1 set and seems to cope OK with the disks spinning down & up
>again as required...
>  
>
The current patch has config options to adjust the 
Non-Responsive-Disk-Timer. A zero specified no timeout and a non-zero 
value is the timeout in seconds.

Let me pull a 2.4.26 kernel source and see how fast I can work up a 
patch. Or would it better to generate it against 2.4.29?

mark

>Gordon
>  
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 19:34         ` Kanoa Withington
@ 2005-01-20 19:44           ` Mark Bellon
  0 siblings, 0 replies; 22+ messages in thread
From: Mark Bellon @ 2005-01-20 19:44 UTC (permalink / raw)
  To: Kanoa Withington; +Cc: David Dougall, Mario Holbe, linux-raid

Kanoa Withington wrote:

>Ideally a different HBA altogether, but a different channel on a
>multichannel HBA at a minimum. If your SCSI card is not a multichannel
>card, think about getting one or think about a completely different
>arrangement.
>
>It may be possible to tune the HBA reset behavior or the XFS timeout
>threshold but as a matter of principle when constructing disk mirrors
>you should try to keep the disks as separate as possible. You should
>only need to tune, tweak or patch if you are trying to do something
>unusual - which you are not.
>  
>
Very true.

The default parameters for SCSI (5 retries as I recall) can take a very 
long time when a SCSI bus reset is called for (settle times and such) - 
I've seen 2+ minutes. Even with totally redundent controllers a logical 
I/O (to the RAID) could be held up waiting for a physical I/O by this 
long. The XFS parameter would need to be raised above the threadhold.

mark

>In the short term, unplug the failing disk:
>
>Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0 lun 47
>
>You are better off without it if your system is crashing.
>
>-Kanoa
>
>
>
>On Thu, 20 Jan 2005, David Dougall wrote:
>
>  
>
>>By "different controller" do you mean HBA controller or disk controller?
>>The disk devices are on completely different jbods.  They are both through
>>the same HBA(the server only has 1 PCI slot)
>>--David Dougall
>>
>>
>>On Thu, 20 Jan 2005, Kanoa Withington wrote:
>>
>>    
>>
>>>Yes, that's a standard XFS timeout and shutdown. If your second disk
>>>is on the sme SCSI channel try moving it to a different one,
>>>preferably a different controller alotgether.
>>>
>>>Your disk 08:10 does have real problems, but they are separate from
>>>the XFS shutdown which should be prevented by the MD layer.
>>>
>>>-Kanoa
>>>
>>>On Thu, 20 Jan 2005, David Dougall wrote:
>>>
>>>
>>>      
>>>
>>>> return code = 8000002
>>>>Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
>>>>sense key
>>>> Hardware Error
>>>>Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453441
>>>>Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
>>>>("device-mapper(254,1)
>>>>") meta-data dev device-mapper(254,1) block 0x18fa318f
>>>>("xlog_iodone") err
>>>>or 5 buf count 2048
>>>>Jan 10 11:56:08 linux-sg2 kernel:
>>>>xfs_force_shutdown(device-mapper(254,1),0x2) c
>>>>alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
>>>>Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
>>>>I/O Err
>>>>or Detected.  Shutting down filesystem: device-mapper(254,1)
>>>>Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
>>>>rectify the
>>>>problem(s)
>>>>
>>>>
>>>>I don't see any error messages from md in any of these logs.
>>>>--David Dougall
>>>>
>>>>
>>>>        
>>>>
>>>
>>>      
>>>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>  
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 19:41       ` Mark Bellon
@ 2005-01-20 19:49         ` David Dougall
  0 siblings, 0 replies; 22+ messages in thread
From: David Dougall @ 2005-01-20 19:49 UTC (permalink / raw)
  To: Mark Bellon; +Cc: Gordon Henderson, linux-raid

It would certainly be easier for me if it were 2.4.26.  The other 2
patches I have to apply have only been released for 2.4.26 and 2.4.25-pre9
or something like that.  If it is better for everyone else to do 2.4.29, I
can just backport it to my kernel.
Thanks
--David Dougall


On Thu, 20 Jan 2005, Mark Bellon wrote:

> Gordon Henderson wrote:
>
> >On Thu, 20 Jan 2005, Mark Bellon wrote:
> >
> >
> >
> >>I've seen this too. The worst case can actually last for over 2 minutes.
> >>
> >>We've been running with a patch to the RAID 1 driver that handles this
> >>so critical applications do not hang for too long. Basically it uses
> >>timers in the RAID 1 driver to force the disk to be treated as actually
> >>having failed if it doesn't respond within a reasonable time (tunable
> >>but usually ~3 seconds). It then handles the I/O requests coming back
> >>async. and does the clean up.
> >>
> >>
> >
> >This is intersting, but make it an option (kernel compile, sysctl,
> >etc.)... I have a small home server/firewall that I run with the disks
> >spun down (noflushd) and spinning up a disk sometimes takes 8 seconds -
> >it's a RAID-1 set and seems to cope OK with the disks spinning down & up
> >again as required...
> >
> >
> The current patch has config options to adjust the
> Non-Responsive-Disk-Timer. A zero specified no timeout and a non-zero
> value is the timeout in seconds.
>
> Let me pull a 2.4.26 kernel source and see how fast I can work up a
> patch. Or would it better to generate it against 2.4.29?
>
> mark
>
> >Gordon
> >
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: No response?
  2005-01-20 19:24     ` Peter T. Breuer
@ 2005-01-20 19:51       ` David Dougall
  0 siblings, 0 replies; 22+ messages in thread
From: David Dougall @ 2005-01-20 19:51 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Unfortunately my email client wrapped the line.  It is not sda.  It is
actually sdb.
--David Dougall


On Thu, 20 Jan 2005, Peter T. Breuer wrote:

> David Dougall <davidd@et.byu.edu> wrote:
> > Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
> > lun 47
> >  return code = 8000002
>
> That is sda.
>
> > Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 343219280
>
> Well, I don't really understand - that is sdb, no? No? (or sda10 if the
> numbers are in decimal instead of hex).
>
> I don't know why sda and sdb seem to be confused! The "lun 47" also has
> me wondering!
>
> > Jan 10 11:56:08 linux-sg2 kernel: Info fld=0xc7c0181, Current sd08:10:
> > sense key
> >  Hardware Error
> > Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 392
>
> sdb, very insistently, at a completely different sector. Looks belly
> up.
>
> > Jan 10 11:56:08 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
> > lun 47
> >  return code = 8000002
>
> But 0:0:0 should be sda!
>
> > Jan 10 11:56:08 linux-sg2 kernel:  I/O error: dev 08:10, sector 209453441
> > Jan 10 11:56:08 linux-sg2 kernel: I/O error in filesystem
> > ("device-mapper(254,1)
> > ") meta-data dev device-mapper(254,1) block 0x18fa318f
> > ("xlog_iodone") err
> > or 5 buf count 2048
> > Jan 10 11:56:08 linux-sg2 kernel:
> > xfs_force_shutdown(device-mapper(254,1),0x2) c
> > alled from line 966 of file xfs_log.c.  Return address = 0xc0246d9b
> > Jan 10 11:56:08 linux-sg2 kernel: Filesystem "device-mapper(254,1)": Log
> > I/O Err
> > or Detected.  Shutting down filesystem: device-mapper(254,1)
> > Jan 10 11:56:08 linux-sg2 kernel: Please umount the filesystem, and
> > rectify the
> > problem(s)
> >
> >
> > I don't see any error messages from md in any of these logs.
>
> Why is the device mapper implicated? Doesn't that mean that you are
> using LVM and not raid? 2.4.26, you said? Hmm.
>
> Peter
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2005-01-20 19:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-20 17:55 No response? David Dougall
2005-01-20 18:12 ` Peter T. Breuer
2005-01-20 18:14 ` Gordon Henderson
2005-01-20 18:37   ` Mark Bellon
2005-01-20 19:15     ` David Dougall
2005-01-20 19:35       ` Mark Bellon
2005-01-20 19:37     ` Gordon Henderson
2005-01-20 19:41       ` Mark Bellon
2005-01-20 19:49         ` David Dougall
2005-01-20 18:21 ` Mike Hardy
2005-01-20 18:30 ` Mario Holbe
2005-01-20 18:57   ` David Dougall
2005-01-20 19:12     ` Kanoa Withington
2005-01-20 19:17       ` David Dougall
2005-01-20 19:23         ` Guy
2005-01-20 19:34         ` Kanoa Withington
2005-01-20 19:44           ` Mark Bellon
2005-01-20 19:18     ` Guy
2005-01-20 19:24     ` Peter T. Breuer
2005-01-20 19:51       ` David Dougall
2005-01-20 19:28     ` Mark Bellon
2005-01-20 18:49 ` Kanoa Withington

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).