raid5: I lost a XFS file system due to a minor IDE cable problem

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid5: I lost a XFS file system due to a minor IDE cable problem
@ 2007-05-24 11:18 Pallai Roland
  2007-05-24 11:20 ` Justin Piszcz
  0 siblings, 1 reply; 20+ messages in thread
From: Pallai Roland @ 2007-05-24 11:18 UTC (permalink / raw)
  To: Linux-Raid

Hi,

 I wondering why the md raid5 does accept writes after 2 disks failed. I've an 
array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed 
(my friend kicked it off from the box on the floor:) and 2 disks have been 
kicked but my download (yafc) not stopped, it tried and could write the file 
system for whole night!
 Now I changed the cable, tried to reassembly the array (mdadm -f --run), 
event counter increased from 4908158 up to 4929612 on the failed disks, but I 
cannot mount the file system and the 'xfs_repair -n' shows lot of errors 
there. This is expainable by the partially successed writes. Ext3 and JFS 
has "error=" mount option to switch filesystem read-only on any error, but 
XFS hasn't: why? It's a good question too, but I think the md layer could 
save dumb filesystems like XFS if denies writes after 2 disks are failed, and 
I cannot see a good reason why it's not behave this way.

 Do you have better idea how can I avoid such filesystem corruptions in the 
future? No, I don't want to use ext3 on this box. :)

my mount error:
XFS: Log inconsistent (didn't find previous header)
XFS: failed to find log head
XFS: log mount/recovery failed: error 5
XFS: log mount failed

--
 d

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-24 11:18 raid5: I lost a XFS file system due to a minor IDE cable problem Pallai Roland
@ 2007-05-24 11:20 ` Justin Piszcz
  2007-05-25  0:05   ` David Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Justin Piszcz @ 2007-05-24 11:20 UTC (permalink / raw)
  To: Pallai Roland; +Cc: Linux-Raid, xfs

Including XFS mailing list on this one.

On Thu, 24 May 2007, Pallai Roland wrote:

>
> Hi,
>
> I wondering why the md raid5 does accept writes after 2 disks failed. I've an
> array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed
> (my friend kicked it off from the box on the floor:) and 2 disks have been
> kicked but my download (yafc) not stopped, it tried and could write the file
> system for whole night!
> Now I changed the cable, tried to reassembly the array (mdadm -f --run),
> event counter increased from 4908158 up to 4929612 on the failed disks, but I
> cannot mount the file system and the 'xfs_repair -n' shows lot of errors
> there. This is expainable by the partially successed writes. Ext3 and JFS
> has "error=" mount option to switch filesystem read-only on any error, but
> XFS hasn't: why? It's a good question too, but I think the md layer could
> save dumb filesystems like XFS if denies writes after 2 disks are failed, and
> I cannot see a good reason why it's not behave this way.
>
> Do you have better idea how can I avoid such filesystem corruptions in the
> future? No, I don't want to use ext3 on this box. :)
>
>
> my mount error:
> XFS: Log inconsistent (didn't find previous header)
> XFS: failed to find log head
> XFS: log mount/recovery failed: error 5
> XFS: log mount failed
>
>
> --
> d
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-24 11:20 ` Justin Piszcz
@ 2007-05-25  0:05   ` David Chinner
  2007-05-25  1:35     ` Pallai Roland
  2007-05-28 12:53     ` Pallai Roland
  0 siblings, 2 replies; 20+ messages in thread
From: David Chinner @ 2007-05-25  0:05 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Pallai Roland, Linux-Raid, xfs

On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote:
> Including XFS mailing list on this one.

Thanks Justin.

> On Thu, 24 May 2007, Pallai Roland wrote:
> 
> >
> >Hi,
> >
> >I wondering why the md raid5 does accept writes after 2 disks failed. I've 
> >an
> >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable 
> >failed
> >(my friend kicked it off from the box on the floor:) and 2 disks have been
> >kicked but my download (yafc) not stopped, it tried and could write the 
> >file
> >system for whole night!
> >Now I changed the cable, tried to reassembly the array (mdadm -f --run),
> >event counter increased from 4908158 up to 4929612 on the failed disks, 
> >but I
> >cannot mount the file system and the 'xfs_repair -n' shows lot of errors
> >there. This is expainable by the partially successed writes. Ext3 and JFS
> >has "error=" mount option to switch filesystem read-only on any error, but
> >XFS hasn't: why?

"-o ro,norecovery" will allow you to mount the filesystem and get any
uncorrupted data off it.

You still may get shutdowns if you trip across corrupted metadata in
the filesystem, though.

> >It's a good question too, but I think the md layer could
> >save dumb filesystems like XFS if denies writes after 2 disks are failed, 
> >and
> >I cannot see a good reason why it's not behave this way.

How is *any* filesystem supposed to know that the underlying block
device has gone bad if it is not returning errors?

I did mention this exact scenario in the filesystems workshop back
in february - we'd *really* like to know if a RAID block device has gone
into degraded mode (i.e. lost a disk) so we can throttle new writes
until the rebuil dhas been completed. Stopping writes completely on a
fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6)
would also be possible if only we could get the information out
of the block layer.

> >Do you have better idea how can I avoid such filesystem corruptions in the
> >future? No, I don't want to use ext3 on this box. :)

Well, the problem is a bug in MD - it should have detected
drives going away and stopped access to the device until it was
repaired. You would have had the same problem with ext3, or JFS,
or reiser or any other filesystem, too.

> >my mount error:
> >XFS: Log inconsistent (didn't find previous header)
> >XFS: failed to find log head
> >XFS: log mount/recovery failed: error 5
> >XFS: log mount failed

You MD device is still hosed - error 5 = EIO; the md device is
reporting errors back the filesystem now. You need to fix that
before trying to recover any data...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-25  0:05   ` David Chinner
@ 2007-05-25  1:35     ` Pallai Roland
  2007-05-25  4:55       ` David Chinner
  2007-05-25 14:01       ` Pallai Roland
  2007-05-28 12:53     ` Pallai Roland
  1 sibling, 2 replies; 20+ messages in thread
From: Pallai Roland @ 2007-05-25  1:35 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux-Raid, xfs


On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote:
> On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote:
> > On Thu, 24 May 2007, Pallai Roland wrote:
> > >I wondering why the md raid5 does accept writes after 2 disks failed. I've 
> > >an
> > >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable 
> > >failed
> > >(my friend kicked it off from the box on the floor:) and 2 disks have been
> > >kicked but my download (yafc) not stopped, it tried and could write the 
> > >file
> > >system for whole night!
> > >Now I changed the cable, tried to reassembly the array (mdadm -f --run),
> > >event counter increased from 4908158 up to 4929612 on the failed disks, 
> > >but I
> > >cannot mount the file system and the 'xfs_repair -n' shows lot of errors
> > >there. This is expainable by the partially successed writes. Ext3 and JFS
> > >has "error=" mount option to switch filesystem read-only on any error, but
> > >XFS hasn't: why?
> 
> "-o ro,norecovery" will allow you to mount the filesystem and get any
> uncorrupted data off it.
> 
> You still may get shutdowns if you trip across corrupted metadata in
> the filesystem, though.
 Thanks, I'll try it

> > >It's a good question too, but I think the md layer could
> > >save dumb filesystems like XFS if denies writes after 2 disks are failed, 
> > >and
> > >I cannot see a good reason why it's not behave this way.
> 
> How is *any* filesystem supposed to know that the underlying block
> device has gone bad if it is not returning errors?
 It is returning errors, I think so. If I try to write raid5 with 2
failed disks with dd, I've got errors on the missing chunks.
 The difference between ext3 and XFS is that ext3 will remount to
read-only on the first write error but the XFS won't, XFS only fails
only the current operation, IMHO. The method of ext3 isn't perfect, but
in practice, it's working well.

> I did mention this exact scenario in the filesystems workshop back
> in february - we'd *really* like to know if a RAID block device has gone
> into degraded mode (i.e. lost a disk) so we can throttle new writes
> until the rebuil dhas been completed. Stopping writes completely on a
> fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6)
> would also be possible if only we could get the information out
> of the block layer.
 It would be nice, but as I mentioned above, ext3 do it well in practice
now.

> > >Do you have better idea how can I avoid such filesystem corruptions in the
> > >future? No, I don't want to use ext3 on this box. :)
> 
> Well, the problem is a bug in MD - it should have detected
> drives going away and stopped access to the device until it was
> repaired. You would have had the same problem with ext3, or JFS,
> or reiser or any other filesystem, too.
> 
> > >my mount error:
> > >XFS: Log inconsistent (didn't find previous header)
> > >XFS: failed to find log head
> > >XFS: log mount/recovery failed: error 5
> > >XFS: log mount failed
> 
> You MD device is still hosed - error 5 = EIO; the md device is
> reporting errors back the filesystem now. You need to fix that
> before trying to recover any data...
 I play with it tomorrow, thanks for your help


--
 d



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-25  1:35     ` Pallai Roland
@ 2007-05-25  4:55       ` David Chinner
  2007-05-25  5:43         ` Alberto Alonso
  2007-05-25 14:35         ` Pallai Roland
  2007-05-25 14:01       ` Pallai Roland
  1 sibling, 2 replies; 20+ messages in thread
From: David Chinner @ 2007-05-25  4:55 UTC (permalink / raw)
  To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs

On Fri, May 25, 2007 at 03:35:48AM +0200, Pallai Roland wrote:
> On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote:
> > > >It's a good question too, but I think the md layer could
> > > >save dumb filesystems like XFS if denies writes after 2 disks are failed, 
> > > >and
> > > >I cannot see a good reason why it's not behave this way.
> > 
> > How is *any* filesystem supposed to know that the underlying block
> > device has gone bad if it is not returning errors?
>  It is returning errors, I think so. If I try to write raid5 with 2
> failed disks with dd, I've got errors on the missing chunks.

Oh, did you look at your logs and find that XFS had spammed them
about writes that were failing?

>  The difference between ext3 and XFS is that ext3 will remount to
> read-only on the first write error but the XFS won't, XFS only fails
> only the current operation, IMHO. The method of ext3 isn't perfect, but
> in practice, it's working well.

XFS will shutdown the filesystem if metadata corruption will occur
due to a failed write. We don't immediately fail the filesystem on
data write errors because on large systems you can get *transient*
I/O errors (e.g. FC path failover) and so retrying failed data
writes is useful for preventing unnecessary shutdowns of the
filesystem.

Different design criteria, different solutions...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-25  4:55       ` David Chinner
@ 2007-05-25  5:43         ` Alberto Alonso
  2007-05-25  8:36           ` David Chinner
  2007-05-25 14:35         ` Pallai Roland
  1 sibling, 1 reply; 20+ messages in thread
From: Alberto Alonso @ 2007-05-25  5:43 UTC (permalink / raw)
  To: David Chinner; +Cc: Pallai Roland, Linux-Raid, xfs

> >  The difference between ext3 and XFS is that ext3 will remount to
> > read-only on the first write error but the XFS won't, XFS only fails
> > only the current operation, IMHO. The method of ext3 isn't perfect, but
> > in practice, it's working well.
> 
> XFS will shutdown the filesystem if metadata corruption will occur
> due to a failed write. We don't immediately fail the filesystem on
> data write errors because on large systems you can get *transient*
> I/O errors (e.g. FC path failover) and so retrying failed data
> writes is useful for preventing unnecessary shutdowns of the
> filesystem.
> 
> Different design criteria, different solutions...

I think his point was that going into a read only mode causes a
less catastrophic situation (ie. a web server can still serve
pages). I think that is a valid point, rather than shutting down
the file system completely, an automatic switch to where the least
disruption of service can occur is always desired.

Maybe the automatic failure mode could be something that is 
configurable via the mount options.

I personally have found the XFS file system to be great for
my needs (except issues with NFS interaction, where the bug report
never got answered), but that doesn't mean it can not be improved.

Just my 2 cents,

Alberto

> Cheers,
> 
> Dave.
-- 
Alberto Alonso                        Global Gate Systems LLC.
(512) 351-7233                        http://www.ggsys.net
Hardware, consulting, sysadmin, monitoring and remote backups

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-25  5:43         ` Alberto Alonso
@ 2007-05-25  8:36           ` David Chinner
  2007-05-28 22:45             ` Alberto Alonso
  0 siblings, 1 reply; 20+ messages in thread
From: David Chinner @ 2007-05-25  8:36 UTC (permalink / raw)
  To: Alberto Alonso; +Cc: David Chinner, Pallai Roland, Linux-Raid, xfs

On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote:
> > >  The difference between ext3 and XFS is that ext3 will remount to
> > > read-only on the first write error but the XFS won't, XFS only fails
> > > only the current operation, IMHO. The method of ext3 isn't perfect, but
> > > in practice, it's working well.
> > 
> > XFS will shutdown the filesystem if metadata corruption will occur
> > due to a failed write. We don't immediately fail the filesystem on
> > data write errors because on large systems you can get *transient*
> > I/O errors (e.g. FC path failover) and so retrying failed data
> > writes is useful for preventing unnecessary shutdowns of the
> > filesystem.
> > 
> > Different design criteria, different solutions...
> 
> I think his point was that going into a read only mode causes a
> less catastrophic situation (ie. a web server can still serve
> pages).

Sure - but once you've detected one corruption or had metadata
I/O errors, can you trust the rest of the filesystem?

> I think that is a valid point, rather than shutting down
> the file system completely, an automatic switch to where the least
> disruption of service can occur is always desired.

I consider the possibility of serving out bad data (i.e after
a remount to readonly) to be the worst possible disruption of
service that can happen ;)

> Maybe the automatic failure mode could be something that is 
> configurable via the mount options.

If only it were that simple. Have you looked to see how many
hooks there are in XFS to shutdown without causing further
damage?

% grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l
116

Changing the way we handle shutdowns would take a lot of time,
effort and testing. When can I expect a patch? ;)

> I personally have found the XFS file system to be great for
> my needs (except issues with NFS interaction, where the bug report
> never got answered), but that doesn't mean it can not be improved.

Got a pointer?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-25  8:36           ` David Chinner
@ 2007-05-28 22:45             ` Alberto Alonso
  2007-05-29  3:28               ` David Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Alberto Alonso @ 2007-05-28 22:45 UTC (permalink / raw)
  To: David Chinner; +Cc: Pallai Roland, Linux-Raid, xfs

On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote:
> On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote:
> > I think his point was that going into a read only mode causes a
> > less catastrophic situation (ie. a web server can still serve
> > pages).
> 
> Sure - but once you've detected one corruption or had metadata
> I/O errors, can you trust the rest of the filesystem?
> 
> > I think that is a valid point, rather than shutting down
> > the file system completely, an automatic switch to where the least
> > disruption of service can occur is always desired.
> 
> I consider the possibility of serving out bad data (i.e after
> a remount to readonly) to be the worst possible disruption of
> service that can happen ;)

I guess it does depend on the nature of the failure. A write failure
on block 2000 does not imply corruption of the other 2TB of data.

I wish I knew more on the internals of file systems, unfortunately since
I don't, I was just commenting on feature that would be nice, but maybe
there is no way to implement them. I figured that a dynamic table
with bad blocks could be kept, if an attempt to access those blocks is
generated (read or write) an I/O error is returned, if the block is
not on the list, the access is processed. This would help a server
with large file systems continue operations for most users.

> > I personally have found the XFS file system to be great for
> > my needs (except issues with NFS interaction, where the bug report
> > never got answered), but that doesn't mean it can not be improved.
> 
> Got a pointer?

I can't seem to find it. I'm pretty sure I used bugzilla to report
it. I did find the kernel dump file though, so here it is:

Oct  3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns:
vp/0xd1e69c80, invp/0xc989e380
Oct  3 15:34:07 localhost kernel: ------------[ cut here ]------------
Oct  3 15:34:07 localhost kernel: kernel BUG at
fs/xfs/support/debug.c:106!
Oct  3 15:34:07 localhost kernel: invalid operand: 0000 [#1]
Oct  3 15:34:07 localhost kernel: PREEMPT SMP
Oct  3 15:34:07 localhost kernel: Modules linked in: af_packet
iptable_filter ip_tables nfsd exportfs lockd sunrpc ipv6xfs capability
commoncap ext3 jbd mbc
ache aic7xxx i2c_dev tsdev floppy mousedev parport_pc parport psmouse
evdev pcspkrhw_random shpchp pciehp pci_hotplug intel_agp intel_mch_agp
agpgart uhci_h
cd usbcore piix ide_core e1000 cfi_cmdset_0001 cfi_util mtdpart mtdcore
jedec_probe gen_probe chipreg dm_mod w83781d i2c_sensor i2c_i801
i2c_core raid5 xor
genrtc sd_mod aic79xx scsi_mod raid1 md unix font vesafb cfbcopyarea
cfbimgblt cfbfillrect
Oct  3 15:34:07 localhost kernel: CPU:    0
Oct  3 15:34:07 localhost kernel: EIP:    0060:[__crc_pm_idle
+3334982/5290900]    Not tainted
Oct  3 15:34:07 localhost kernel: EFLAGS: 00010246   (2.6.8-2-686-smp)
Oct  3 15:34:07 localhost kernel: EIP is at cmn_err+0xc5/0xe0 [xfs]
Oct  3 15:34:07 localhost kernel: eax: 00000000   ebx: f602c000   ecx:
c02dcfbc   edx: c02dcfbc
Oct  3 15:34:07 localhost kernel: esi: f8c40e28   edi: f8c56a3e   ebp:
00000293   esp: f602da08
Oct  3 15:34:07 localhost kernel: ds: 007b   es: 007b   ss: 0068
Oct  3 15:34:07 localhost kernel: Process nfsd (pid: 2740,
threadinfo=f602c000 task=f71a7210)
Oct  3 15:34:07 localhost kernel: Stack: f8c40e28 f8c40def f8c56a00
00000000 f602c000 074aa1aa f8c41700 ea2f0a40
Oct  3 15:34:07 localhost kernel:        f8c0a745 00000000 f8c41700
d1e69c80 c989e380 f7d4cc00 c2934754 074aa1aa
Oct  3 15:34:07 localhost kernel:        00000000 f6555624 074aa1aa
f7d4cc00 c017d6bd f6555620 00000000 00000000
Oct  3 15:34:07 localhost kernel: Call Trace:
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3123398/5290900]
xfs_iget_core+0x565/0x6b0 [xfs]
Oct  3 15:34:07 localhost kernel:  [iget_locked+189/256] iget_locked
+0xbd/0x100
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3124083/5290900]
xfs_iget+0x162/0x1a0 [xfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3252484/5290900]
xfs_vget+0x63/0x100 [xfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3331204/5290900]
vfs_vget+0x43/0x50 [xfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3329570/5290900]
linvfs_get_dentry+0x51/0x90 [xfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+1536451/5290900]
find_exported_dentry+0x42/0x830 [exportfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3234969/5290900]
xfs_trans_tail_ail+0x38/0x80 [xfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3174595/5290900]
xlog_write+0x102/0x580 [xfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3234969/5290900]
xfs_trans_tail_ail+0x38/0x80 [xfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3170617/5290900]
xlog_assign_tail_lsn+0x18/0x90 [xfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3234969/5290900]
xfs_trans_tail_ail+0x38/0x80 [xfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+3174595/5290900]
xlog_write+0x102/0x580 [xfs]
Oct  3 15:34:07 localhost kernel:  [alloc_skb+71/240] alloc_skb
+0x47/0xf0
Oct  3 15:34:07 localhost kernel:  [sock_alloc_send_pskb+197/464]
sock_alloc_send_pskb+0xc5/0x1d0
Oct  3 15:34:07 localhost kernel:  [sock_alloc_send_skb+45/64]
sock_alloc_send_skb+0x2d/0x40
Oct  3 15:34:07 localhost kernel:  [ip_append_data+1810/2016]
ip_append_data+0x712/0x7e0
Oct  3 15:34:07 localhost kernel:  [recalc_task_prio+168/416]
recalc_task_prio+0xa8/0x1a0
Oct  3 15:34:07 localhost kernel:  [__ip_route_output_key+47/288]
__ip_route_output_key+0x2f/0x120
Oct  3 15:34:07 localhost kernel:  [udp_sendmsg+831/1888] udp_sendmsg
+0x33f/0x760
Oct  3 15:34:07 localhost kernel:  [ip_generic_getfrag+0/192]
ip_generic_getfrag+0x0/0xc0
Oct  3 15:34:07 localhost kernel:  [qdisc_restart+23/560] qdisc_restart
+0x17/0x230
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+1539451/5290900]
export_decode_fh+0x5a/0x7a [exportfs]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4695505/5290900]
nfsd_acceptable+0x0/0x140 [nfsd]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4696349/5290900]
fh_verify+0x20c/0x5a0 [nfsd]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4695505/5290900]
nfsd_acceptable+0x0/0x140 [nfsd]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4702954/5290900]
nfsd_open+0x39/0x1a0 [nfsd]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4704974/5290900]
nfsd_write+0x5d/0x360 [nfsd]
Oct  3 15:34:07 localhost kernel:  [skb_copy_and_csum_bits+102/784]
skb_copy_and_csum_bits+0x66/0x310
Oct  3 15:34:07 localhost kernel:  [resched_task+83/144] resched_task
+0x53/0x90
Oct  3 15:34:07 localhost kernel:  [skb_copy_and_csum_bits+556/784]
skb_copy_and_csum_bits+0x22c/0x310
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+2136279/5290900]
skb_read_and_csum_bits+0x46/0x90 [sunrpc]
Oct  3 15:34:07 localhost kernel:  [kfree_skbmem+36/48] kfree_skbmem
+0x24/0x30
Oct  3 15:34:07 localhost kernel:  [__kfree_skb+173/336] __kfree_skb
+0xad/0x150
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+2184090/5290900]
xdr_partial_copy_from_skb+0x169/0x180 [sunrpc]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+2180355/5290900]
svcauth_unix_accept+0x272/0x2c0 [sunrpc]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4735417/5290900]
nfsd3_proc_write+0xb8/0x120 [nfsd]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4688328/5290900]
nfsd_dispatch+0xd7/0x1e0 [nfsd]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4688113/5290900]
nfsd_dispatch+0x0/0x1e0 [nfsd]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+2162754/5290900]
svc_process+0x4b1/0x619 [sunrpc]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4687545/5290900] nfsd
+0x248/0x480 [nfsd]
Oct  3 15:34:07 localhost kernel:  [__crc_pm_idle+4686961/5290900] nfsd
+0x0/0x480 [nfsd]
Oct  3 15:34:07 localhost kernel:  [kernel_thread_helper+5/16]
kernel_thread_helper+0x5/0x10
Oct  3 15:34:07 localhost kernel: Code: 0f 0b 6a 00 0f 0e c4 f8 83 c4 10
5b 5e 5f 5d c3 e8 c6 03 66
Oct  3 15:34:07 localhost kernel:  <6>note: nfsd[2740] exited with
preempt_count 1
Oct  3 15:51:23 localhost kernel: klogd 1.4.1#17, log source
= /proc/kmsg started.
Oct  3 15:51:23 localhost kernel:
Inspecting /boot/System.map-2.6.8-2-686-smp
Oct  3 15:51:24 localhost kernel: Loaded 27755 symbols
from /boot/System.map-2.6.8-2-686-smp.
Oct  3 15:51:24 localhost kernel: Symbols match kernel version 2.6.8.
Oct  3 15:51:24 localhost kernel: No module symbols loaded - kernel
modules not enabled.
Oct  3 15:51:24 localhost kernel: fef0000 (usable)
Oct  3 15:51:24 localhost kernel:  BIOS-e820: 00000000bfef0000 -
00000000bfefc000 (ACPI data)
Oct  3 15:51:24 localhost kernel:  BIOS-e820: 00000000bfefc000 -
00000000bff00000 (ACPI NVS)
Oct  3 15:51:24 localhost kernel:  BIOS-e820: 00000000bff00000 -
00000000bff80000 (usable)
Oct  3 15:51:24 localhost kernel:  BIOS-e820: 00000000bff80000 -
00000000c0000000 (reserved)
Oct  3 15:51:24 localhost kernel:  BIOS-e820: 00000000fec00000 -
00000000fec10000 (reserved)
Oct  3 15:51:24 localhost kernel:  BIOS-e820: 00000000fee00000 -
00000000fee01000 (reserved)
Oct  3 15:51:24 localhost kernel:  BIOS-e820: 00000000ff800000 -
00000000ffc00000 (reserved)
Oct  3 15:51:24 localhost kernel:  BIOS-e820: 00000000fff00000 -
0000000100000000 (reserved)
Oct  3 15:51:24 localhost kernel: 2175MB HIGHMEM available.
Oct  3 15:51:24 localhost kernel: 896MB LOWMEM available.
Oct  3 15:51:24 localhost kernel: found SMP MP-table at 000f6810
Oct  3 15:51:24 localhost kernel: On node 0 totalpages: 786304
Oct  3 15:51:24 localhost kernel:   DMA zone: 4096 pages, LIFO batch:1
Oct  3 15:51:24 localhost kernel:   Normal zone: 225280 pages, LIFO
batch:16
Oct  3 15:51:24 localhost kernel:   HighMem zone: 556928 pages, LIFO
batch:16
Oct  3 15:51:24 localhost kernel: DMI present.


Thanks,

Alberto




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-28 22:45             ` Alberto Alonso
@ 2007-05-29  3:28               ` David Chinner
  2007-05-29  3:37                 ` Alberto Alonso
  0 siblings, 1 reply; 20+ messages in thread
From: David Chinner @ 2007-05-29  3:28 UTC (permalink / raw)
  To: Alberto Alonso; +Cc: David Chinner, Pallai Roland, Linux-Raid, xfs

On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote:
> On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote:
> > On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote:
> > > I think his point was that going into a read only mode causes a
> > > less catastrophic situation (ie. a web server can still serve
> > > pages).
> > 
> > Sure - but once you've detected one corruption or had metadata
> > I/O errors, can you trust the rest of the filesystem?
> > 
> > > I think that is a valid point, rather than shutting down
> > > the file system completely, an automatic switch to where the least
> > > disruption of service can occur is always desired.
> > 
> > I consider the possibility of serving out bad data (i.e after
> > a remount to readonly) to be the worst possible disruption of
> > service that can happen ;)
> 
> I guess it does depend on the nature of the failure. A write failure
> on block 2000 does not imply corruption of the other 2TB of data.

The rest might not be corrupted, but if block 2000 is a index of
some sort (i.e. metadata), you could reference any of that 2TB
incorrectly and get the wrong data, write to the wrong spot on disk,
etc.

> > > I personally have found the XFS file system to be great for
> > > my needs (except issues with NFS interaction, where the bug report
> > > never got answered), but that doesn't mean it can not be improved.
> > 
> > Got a pointer?
> 
> I can't seem to find it. I'm pretty sure I used bugzilla to report
> it. I did find the kernel dump file though, so here it is:
> 
> Oct  3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns:
> vp/0xd1e69c80, invp/0xc989e380

Oh, I haven't seen any of those problems for quite some time.

> = /proc/kmsg started.
> Oct  3 15:51:23 localhost kernel:
> Inspecting /boot/System.map-2.6.8-2-686-smp

Oh, well, yes, kernels that old did have that problem. It got fixed
some time around 2.6.12 or 2.6.13 IIRC....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-29  3:28               ` David Chinner
@ 2007-05-29  3:37                 ` Alberto Alonso
  0 siblings, 0 replies; 20+ messages in thread
From: Alberto Alonso @ 2007-05-29  3:37 UTC (permalink / raw)
  To: David Chinner; +Cc: Pallai Roland, Linux-Raid, xfs

On Tue, 2007-05-29 at 13:28 +1000, David Chinner wrote:
> On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote:
> > On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote:
> > > I consider the possibility of serving out bad data (i.e after
> > > a remount to readonly) to be the worst possible disruption of
> > > service that can happen ;)
> > 
> > I guess it does depend on the nature of the failure. A write failure
> > on block 2000 does not imply corruption of the other 2TB of data.
> 
> The rest might not be corrupted, but if block 2000 is a index of
> some sort (i.e. metadata), you could reference any of that 2TB
> incorrectly and get the wrong data, write to the wrong spot on disk,
> etc.

Forgive my ignorance, but if block 2000 is an index, to access the
data that it references you would go through block 2000, which would
return an error without continuing to access any data pointed to by it.
Isn't that how things work?

> 
> > > > I personally have found the XFS file system to be great for
> > > > my needs (except issues with NFS interaction, where the bug report
> > > > never got answered), but that doesn't mean it can not be improved.
> > > 
> > > Got a pointer?
> > 
> > I can't seem to find it. I'm pretty sure I used bugzilla to report
> > it. I did find the kernel dump file though, so here it is:
> > 
> > Oct  3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns:
> > vp/0xd1e69c80, invp/0xc989e380
> 
> Oh, I haven't seen any of those problems for quite some time.
> 
> > = /proc/kmsg started.
> > Oct  3 15:51:23 localhost kernel:
> > Inspecting /boot/System.map-2.6.8-2-686-smp
> 
> Oh, well, yes, kernels that old did have that problem. It got fixed
> some time around 2.6.12 or 2.6.13 IIRC....

Time for a kernel upgrade then :-)

Thanks for all your enlightenment, I think I am learning quite a
few things.

Alberto


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-25  4:55       ` David Chinner
  2007-05-25  5:43         ` Alberto Alonso
@ 2007-05-25 14:35         ` Pallai Roland
  2007-05-28  0:30           ` David Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Pallai Roland @ 2007-05-25 14:35 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux-Raid, xfs


On Friday 25 May 2007 06:55:00 David Chinner wrote:
> Oh, did you look at your logs and find that XFS had spammed them
> about writes that were failing?

The first message after the incident:

May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c.  Caller 0xf8ac14f8
May 24 01:53:50 hq kernel: <f8adae69> xfs_btree_check_sblock+0x4f/0xc2 [xfs]  <f8ac14f8> xfs_alloc_lookup+0x34e/0x47b [xfs]
May 24 01:53:50 HF kernel: <f8ac14f8> xfs_alloc_lookup+0x34e/0x47b [xfs]  <f8b1a9c7> kmem_zone_zalloc+0x1b/0x43 [xfs]
May 24 01:53:50 hq kernel: <f8abe645> xfs_alloc_ag_vextent+0x24d/0x1110 [xfs]  <f8ac0647> xfs_alloc_vextent+0x3bd/0x53b [xfs]
May 24 01:53:50 hq kernel: <f8ad2f7e> xfs_bmapi+0x1ac4/0x23cd [xfs]  <f8acab97> xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs]
May 24 01:53:50 hq kernel: <f8b00001> xlog_dealloc_log+0x49/0xea [xfs]  <f8afdaee> xfs_iomap_write_allocate+0x2d9/0x58b [xfs]
May 24 01:53:50 hq kernel: <f8afc3ae> xfs_iomap+0x60e/0x82d [xfs]  <c0113bc8> __wake_up_common+0x39/0x59
May 24 01:53:50 hq kernel: <f8b1ae11> xfs_map_blocks+0x39/0x6c [xfs]  <f8b1bd7b> xfs_page_state_convert+0x644/0xf9c [xfs]
May 24 01:53:50 hq kernel: <c036f384> schedule+0x5d1/0xf4d  <f8b1c780> xfs_vm_writepage+0x0/0xe0 [xfs]
May 24 01:53:50 hq kernel: <f8b1c7d7> xfs_vm_writepage+0x57/0xe0 [xfs]  <c01830e8> mpage_writepages+0x1fb/0x3bb
May 24 01:53:50 hq kernel: <c0183020> mpage_writepages+0x133/0x3bb  <f8b1c780> xfs_vm_writepage+0x0/0xe0 [xfs]
May 24 01:53:50 hq kernel: <c0147bb3> do_writepages+0x35/0x3b  <c018135c> __writeback_single_inode+0x88/0x387
May 24 01:53:50 hq kernel: <c01819b7> sync_sb_inodes+0x1b4/0x2a8  <c0181c63> writeback_inodes+0x63/0xdc
May 24 01:53:50 hq kernel: <c0147943> background_writeout+0x66/0x9f  <c01482b3> pdflush+0x0/0x1ad
May 24 01:53:50 hq kernel: <c01483a2> pdflush+0xef/0x1ad  <c01478dd> background_writeout+0x0/0x9f
May 24 01:53:50 hq kernel: <c012d10b> kthread+0xc2/0xc6  <c012d049> kthread+0x0/0xc6
May 24 01:53:50 hq kernel: <c0100dd5> kernel_thread_helper+0x5/0xb

..and I've spammed such messages. This "internal error" isn't a good reason to shut down
the file system? I think if there's a sign of corrupted file system, the first thing we should do
is to stop writes (or the entire FS) and let the admin to examine the situation.
 I'm not talking about my case where the md raid5 was a braindead, I'm talking about
general situations.


--
 d


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-25 14:35         ` Pallai Roland
@ 2007-05-28  0:30           ` David Chinner
  2007-05-28  1:50             ` Pallai Roland
  0 siblings, 1 reply; 20+ messages in thread
From: David Chinner @ 2007-05-28  0:30 UTC (permalink / raw)
  To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs

On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote:
> 
> On Friday 25 May 2007 06:55:00 David Chinner wrote:
> > Oh, did you look at your logs and find that XFS had spammed them
> > about writes that were failing?
> 
> The first message after the incident:
> 
> May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c.  Caller 0xf8ac14f8
> May 24 01:53:50 hq kernel: <f8adae69> xfs_btree_check_sblock+0x4f/0xc2 [xfs]  <f8ac14f8> xfs_alloc_lookup+0x34e/0x47b [xfs]
> May 24 01:53:50 HF kernel: <f8ac14f8> xfs_alloc_lookup+0x34e/0x47b [xfs]  <f8b1a9c7> kmem_zone_zalloc+0x1b/0x43 [xfs]
> May 24 01:53:50 hq kernel: <f8abe645> xfs_alloc_ag_vextent+0x24d/0x1110 [xfs]  <f8ac0647> xfs_alloc_vextent+0x3bd/0x53b [xfs]
> May 24 01:53:50 hq kernel: <f8ad2f7e> xfs_bmapi+0x1ac4/0x23cd [xfs]  <f8acab97> xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs]
> May 24 01:53:50 hq kernel: <f8b00001> xlog_dealloc_log+0x49/0xea [xfs]  <f8afdaee> xfs_iomap_write_allocate+0x2d9/0x58b [xfs]
> May 24 01:53:50 hq kernel: <f8afc3ae> xfs_iomap+0x60e/0x82d [xfs]  <c0113bc8> __wake_up_common+0x39/0x59
> May 24 01:53:50 hq kernel: <f8b1ae11> xfs_map_blocks+0x39/0x6c [xfs]  <f8b1bd7b> xfs_page_state_convert+0x644/0xf9c [xfs]
> May 24 01:53:50 hq kernel: <c036f384> schedule+0x5d1/0xf4d  <f8b1c780> xfs_vm_writepage+0x0/0xe0 [xfs]
> May 24 01:53:50 hq kernel: <f8b1c7d7> xfs_vm_writepage+0x57/0xe0 [xfs]  <c01830e8> mpage_writepages+0x1fb/0x3bb
> May 24 01:53:50 hq kernel: <c0183020> mpage_writepages+0x133/0x3bb  <f8b1c780> xfs_vm_writepage+0x0/0xe0 [xfs]
> May 24 01:53:50 hq kernel: <c0147bb3> do_writepages+0x35/0x3b  <c018135c> __writeback_single_inode+0x88/0x387
> May 24 01:53:50 hq kernel: <c01819b7> sync_sb_inodes+0x1b4/0x2a8  <c0181c63> writeback_inodes+0x63/0xdc
> May 24 01:53:50 hq kernel: <c0147943> background_writeout+0x66/0x9f  <c01482b3> pdflush+0x0/0x1ad
> May 24 01:53:50 hq kernel: <c01483a2> pdflush+0xef/0x1ad  <c01478dd> background_writeout+0x0/0x9f
> May 24 01:53:50 hq kernel: <c012d10b> kthread+0xc2/0xc6  <c012d049> kthread+0x0/0xc6
> May 24 01:53:50 hq kernel: <c0100dd5> kernel_thread_helper+0x5/0xb
> 
> .and I've spammed such messages. This "internal error" isn't a good reason to shut down
> the file system?

Actaully, that error does shut the filesystem down in most cases. When you
see that output, the function is returning -EFSCORRUPTED. You've got a corrupted
freespace btree.

The reason why you get spammed is that this is happening during background
writeback, and there is no one to return the -EFSCORRUPTED error to. The
background writeback path doesn't specifically detect shut down filesystems or
trigger shutdowns on errors because that happens in different layers so you
just end up with failed data writes. These errors will occur on the next
foreground data or metadata allocation and that will shut the filesystem down
at that point.

I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in
this case we should be shutting down the filesystem.  That would certainly cut
down on the spamming and would not appear to change anything other
behaviour....

> I think if there's a sign of corrupted file system, the first thing we should do
> is to stop writes (or the entire FS) and let the admin to examine the situation.

Yes, that's *exactly* what a shutdown does. In this case, your writes are
being stopped - hence the error messages - but the filesystem has not yet
been shutdown.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-28  0:30           ` David Chinner
@ 2007-05-28  1:50             ` Pallai Roland
  2007-05-28  2:17               ` David Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Pallai Roland @ 2007-05-28  1:50 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux-Raid, xfs


On Monday 28 May 2007 02:30:11 David Chinner wrote:
> On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote:
> > On Friday 25 May 2007 06:55:00 David Chinner wrote:
> > > Oh, did you look at your logs and find that XFS had spammed them
> > > about writes that were failing?
> >
> > The first message after the incident:
> >
> > May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error
> > xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c.  Caller
> > 0xf8ac14f8 May 24 01:53:50 hq kernel: <f8adae69>
> > xfs_btree_check_sblock+0x4f/0xc2 [xfs]  <f8ac14f8>
> > xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: <f8ac14f8>
> > xfs_alloc_lookup+0x34e/0x47b [xfs]  <f8b1a9c7> kmem_zone_zalloc+0x1b/0x43
> > [xfs] May 24 01:53:50 hq kernel: <f8abe645>
> > xfs_alloc_ag_vextent+0x24d/0x1110 [xfs]  <f8ac0647>
> > xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: <f8ad2f7e>
> > xfs_bmapi+0x1ac4/0x23cd [xfs]  <f8acab97>
> > xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel:
> > <f8b00001> xlog_dealloc_log+0x49/0xea [xfs]  <f8afdaee>
> > xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel:
> > <f8afc3ae> xfs_iomap+0x60e/0x82d [xfs]  <c0113bc8>
> > __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: <f8b1ae11>
> > xfs_map_blocks+0x39/0x6c [xfs]  <f8b1bd7b>
> > xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel:
> > <c036f384> schedule+0x5d1/0xf4d  <f8b1c780> xfs_vm_writepage+0x0/0xe0
> > [xfs] May 24 01:53:50 hq kernel: <f8b1c7d7> xfs_vm_writepage+0x57/0xe0
> > [xfs]  <c01830e8> mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel:
> > <c0183020> mpage_writepages+0x133/0x3bb  <f8b1c780>
> > xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: <c0147bb3>
> > do_writepages+0x35/0x3b  <c018135c> __writeback_single_inode+0x88/0x387
> > May 24 01:53:50 hq kernel: <c01819b7> sync_sb_inodes+0x1b4/0x2a8 
> > <c0181c63> writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel:
> > <c0147943> background_writeout+0x66/0x9f  <c01482b3> pdflush+0x0/0x1ad
> > May 24 01:53:50 hq kernel: <c01483a2> pdflush+0xef/0x1ad  <c01478dd>
> > background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: <c012d10b>
> > kthread+0xc2/0xc6  <c012d049> kthread+0x0/0xc6 May 24 01:53:50 hq kernel:
> > <c0100dd5> kernel_thread_helper+0x5/0xb
> >
> > .and I've spammed such messages. This "internal error" isn't a good
> > reason to shut down the file system?
>
> Actaully, that error does shut the filesystem down in most cases. When you
> see that output, the function is returning -EFSCORRUPTED. You've got a
> corrupted freespace btree.
>
> The reason why you get spammed is that this is happening during background
> writeback, and there is no one to return the -EFSCORRUPTED error to. The
> background writeback path doesn't specifically detect shut down filesystems
> or trigger shutdowns on errors because that happens in different layers so
> you just end up with failed data writes. These errors will occur on the
> next foreground data or metadata allocation and that will shut the
> filesystem down at that point.
>
> I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in
> this case we should be shutting down the filesystem.  That would certainly
> cut down on the spamming and would not appear to change anything other
> behaviour....
 If I remember correctly, my file system wasn't shutted down at all, it 
was "writeable" for whole night, the yafc slowly "written" files to it. Maybe 
all write operations had failed, but yafc doesn't warn.

 Spamming is just annoying when we need to find out what went wrong (My 
kernel.log is 300Mb), but for data security it's important to react to 
EFSCORRUPTED error in any case, I think so. Please consider this.

> > I think if there's a sign of corrupted file system, the first thing we
> > should do is to stop writes (or the entire FS) and let the admin to
> > examine the situation.
>
> Yes, that's *exactly* what a shutdown does. In this case, your writes are
> being stopped - hence the error messages - but the filesystem has not yet
> been shutdown.....
 All writes being stopped that were involved in the freespace btree, but a few 
operations were executed (on the corrupted FS), right? Ignoring of 
EFSCORRUPTED isn't a good idea in this case.


--
 d


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-28  1:50             ` Pallai Roland
@ 2007-05-28  2:17               ` David Chinner
  2007-05-28 11:17                 ` Pallai Roland
  0 siblings, 1 reply; 20+ messages in thread
From: David Chinner @ 2007-05-28  2:17 UTC (permalink / raw)
  To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs

On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote:
> On Monday 28 May 2007 02:30:11 David Chinner wrote:
> > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote:
> > > .and I've spammed such messages. This "internal error" isn't a good
> > > reason to shut down the file system?
> >
> > Actaully, that error does shut the filesystem down in most cases. When you
> > see that output, the function is returning -EFSCORRUPTED. You've got a
> > corrupted freespace btree.
> >
> > The reason why you get spammed is that this is happening during background
> > writeback, and there is no one to return the -EFSCORRUPTED error to. The
> > background writeback path doesn't specifically detect shut down filesystems
> > or trigger shutdowns on errors because that happens in different layers so
> > you just end up with failed data writes. These errors will occur on the
> > next foreground data or metadata allocation and that will shut the
> > filesystem down at that point.
> >
> > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in
> > this case we should be shutting down the filesystem.  That would certainly
> > cut down on the spamming and would not appear to change anything other
> > behaviour....
>  If I remember correctly, my file system wasn't shutted down at all, it 
> was "writeable" for whole night, the yafc slowly "written" files to it. Maybe 
> all write operations had failed, but yafc doesn't warn.

So you never created new files or directories, unlinked files or
directories, did synchronous writes, etc? Just had slowly growing files?

>  Spamming is just annoying when we need to find out what went wrong (My 
> kernel.log is 300Mb), but for data security it's important to react to 
> EFSCORRUPTED error in any case, I think so. Please consider this.

The filesystem has responded correctly to the corruption in terms of
data security (i.e. failed the data write and warned noisily about
it), but it probably hasn't done everything it should....

Hmmmm. A quick look at the linux code makes me thikn that background
writeback on linux has never been able to cause a shutdown in this
case. However, the same error on Irix will definitely cause a
shutdown, though....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-28  2:17               ` David Chinner
@ 2007-05-28 11:17                 ` Pallai Roland
  2007-05-28 23:06                   ` David Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Pallai Roland @ 2007-05-28 11:17 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux-Raid, xfs


On Monday 28 May 2007 04:17:18 David Chinner wrote:
> On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote:
> > On Monday 28 May 2007 02:30:11 David Chinner wrote:
> > > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote:
> > > > .and I've spammed such messages. This "internal error" isn't a good
> > > > reason to shut down the file system?
> > >
> > > Actaully, that error does shut the filesystem down in most cases. When
> > > you see that output, the function is returning -EFSCORRUPTED. You've
> > > got a corrupted freespace btree.
> > >
> > > The reason why you get spammed is that this is happening during
> > > background writeback, and there is no one to return the -EFSCORRUPTED
> > > error to. The background writeback path doesn't specifically detect
> > > shut down filesystems or trigger shutdowns on errors because that
> > > happens in different layers so you just end up with failed data writes.
> > > These errors will occur on the next foreground data or metadata
> > > allocation and that will shut the filesystem down at that point.
> > >
> > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe
> > > in this case we should be shutting down the filesystem.  That would
> > > certainly cut down on the spamming and would not appear to change
> > > anything other behaviour....
> >
> >  If I remember correctly, my file system wasn't shutted down at all, it
> > was "writeable" for whole night, the yafc slowly "written" files to it.
> > Maybe all write operations had failed, but yafc doesn't warn.
>
> So you never created new files or directories, unlinked files or
> directories, did synchronous writes, etc? Just had slowly growing files?
 I just overwritten badly downloaded files.

> >  Spamming is just annoying when we need to find out what went wrong (My
> > kernel.log is 300Mb), but for data security it's important to react to
> > EFSCORRUPTED error in any case, I think so. Please consider this.
>
> The filesystem has responded correctly to the corruption in terms of
> data security (i.e. failed the data write and warned noisily about
> it), but it probably hasn't done everything it should....
>
> Hmmmm. A quick look at the linux code makes me thikn that background
> writeback on linux has never been able to cause a shutdown in this
> case. However, the same error on Irix will definitely cause a
> shutdown, though....
 I hope Linux will follow Irix, that's a consistent standpoint.


 David, have you a plan to implement your "reporting raid5 block layer" idea? 
No one else has caring about this silent data loss on temporary (cable, 
power) failed raid5 arrays as I see, I really hope you do at least!


--
 d


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-28 11:17                 ` Pallai Roland
@ 2007-05-28 23:06                   ` David Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: David Chinner @ 2007-05-28 23:06 UTC (permalink / raw)
  To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs

On Mon, May 28, 2007 at 01:17:31PM +0200, Pallai Roland wrote:
> On Monday 28 May 2007 04:17:18 David Chinner wrote:
> > Hmmmm. A quick look at the linux code makes me thikn that background
> > writeback on linux has never been able to cause a shutdown in this case.
> > However, the same error on Irix will definitely cause a shutdown,
> > though....
>  I hope Linux will follow Irix, that's a consistent standpoint.

I raised a bug for this yesterday when writing that reply. It won't
get forgotten now....

>  David, have you a plan to implement your "reporting raid5 block layer"
>  idea?  No one else has caring about this silent data loss on temporary
>  (cable, power) failed raid5 arrays as I see, I really hope you do at least!

Yeah, I'd love to get something like this happening, but given it's about
half way down my list of "stuff to do when I have some spare time" I'd
say it will be about 2015 before I get to it.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-25  1:35     ` Pallai Roland
  2007-05-25  4:55       ` David Chinner
@ 2007-05-25 14:01       ` Pallai Roland
  1 sibling, 0 replies; 20+ messages in thread
From: Pallai Roland @ 2007-05-25 14:01 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux-Raid, xfs


On Friday 25 May 2007 03:35:48 Pallai Roland wrote:
> On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote:
> > On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote:
> > > On Thu, 24 May 2007, Pallai Roland wrote:
> > > >It's a good question too, but I think the md layer could
> > > >save dumb filesystems like XFS if denies writes after 2 disks are
> > > > failed, and
> > > >I cannot see a good reason why it's not behave this way.
> >
> > How is *any* filesystem supposed to know that the underlying block
> > device has gone bad if it is not returning errors?
>
>  It is returning errors, I think so. If I try to write raid5 with 2
> failed disks with dd, I've got errors on the missing chunks.
>  The difference between ext3 and XFS is that ext3 will remount to
> read-only on the first write error but the XFS won't, XFS only fails
> only the current operation, IMHO. The method of ext3 isn't perfect, but
> in practice, it's working well.
 Sorry, I was wrong: md really isn't returning error! It's madness, IMHO.

 The reason why ext3 safer on raid5 in practice is that ext3 remounts to 
read-only on read errors too and when a raid5 array got 2 failed drives and 
there's some read, the error= behavior of ext3 will be activated and stops 
further writes. You're right, it's not a good solution and there should be 
read operations to prevent data loss in this case on ext3 too. Raid5 *must 
deny all writes* when 2 disks failed: I still can't see a good reason why 
not, and the current method is braindead!

> > I did mention this exact scenario in the filesystems workshop back
> > in february - we'd *really* like to know if a RAID block device has gone
> > into degraded mode (i.e. lost a disk) so we can throttle new writes
> > until the rebuil dhas been completed. Stopping writes completely on a
> > fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6)
> > would also be possible if only we could get the information out
> > of the block layer.
 Yes, it's sounds good, but I think we need a quick fix now, it's a real 
problem and easily can lead to mass data loss.



--
 d


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-25  0:05   ` David Chinner
  2007-05-25  1:35     ` Pallai Roland
@ 2007-05-28 12:53     ` Pallai Roland
  2007-05-28 15:30       ` Pallai Roland
  1 sibling, 1 reply; 20+ messages in thread
From: Pallai Roland @ 2007-05-28 12:53 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux-Raid, xfs


On Friday 25 May 2007 02:05:47 David Chinner wrote:
> "-o ro,norecovery" will allow you to mount the filesystem and get any
> uncorrupted data off it.
>
> You still may get shutdowns if you trip across corrupted metadata in
> the filesystem, though.
This filesystem is completely dead.

hq:~# mount -o ro,norecovery /dev/loop1 /mnt/r5
May 28 13:41:50 hq kernel: Mounting filesystem "loop1" in no-recovery mode.  
Filesystem will be inconsistent.
May 28 13:41:50 hq kernel: XFS: failed to read root inode

hq:~# xfs_db /dev/loop1
xfs_db: cannot read root inode (22)
xfs_db: cannot read realtime bitmap inode (22)
Segmentation fault

hq:~# strace xfs_db /dev/loop1
_llseek(4, 0, [0], SEEK_SET)            = 0
read(4, "XFSB\0\0\20\0\0\0\0\0\6\374\253\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 
512
pread(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 
480141901312) = 512
pread(4, "\30G$L\203\33OE \256=\207@\340\264O\"\324\2074DY\323\6"..., 8192, 
131072) = 8192
write(2, "xfs_db: cannot read root inode ("..., 36xfs_db: cannot read root 
inode (22)
) = 36
pread(4, "\30G$L\203\33OE \256=\207@\340\264O\"\324\2074DY\323\6"..., 8192, 
131072) = 8192
write(2, "xfs_db: cannot read realtime bit"..., 47xfs_db: cannot read realtime 
bitmap inode (22)
) = 47
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++


Browsing with hexdump -C, seems like a part of a PDF file is at 128Kb, on the 
place of the root inode. :(


--
 d


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-28 12:53     ` Pallai Roland
@ 2007-05-28 15:30       ` Pallai Roland
  2007-05-28 23:36         ` David Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Pallai Roland @ 2007-05-28 15:30 UTC (permalink / raw)
  To: David Chinner; +Cc: Linux-Raid, xfs

On Monday 28 May 2007 14:53:55 Pallai Roland wrote:
> On Friday 25 May 2007 02:05:47 David Chinner wrote:
> > "-o ro,norecovery" will allow you to mount the filesystem and get any
> > uncorrupted data off it.
> >
> > You still may get shutdowns if you trip across corrupted metadata in
> > the filesystem, though.
>
> This filesystem is completely dead.
> [...]

 I tried to make a md patch to stop writes if a raid5 array got 2+ failed 
drives, but I found it's already done, oops. :) handle_stripe5() ignores 
writes in this case quietly, I tried and works.

 So how I lost my file system? My first guess about partially successed writes 
wasn't right: there wasn't real write to the disks after the second disk has 
been kicked, so the scenario is same to a simple power loss from this point 
of view. Am I thinking right?

 There's an another layer I used on this box between md and xfs: loop-aes. I 
used it since years and rock stable, but now it's my first suspect, cause I 
found a bug in it today:
 I assembled my array from n-1 disks, and I failed a second disk for a test 
and I found /dev/loop1 still provides *random* data where /dev/md1 serves 
nothing, it's definitely a loop-aes bug:

/dev/loop1: [0700]:180907 (/dev/md1) encryption=AES128 multi-key-v3
hq:~# dd if=/dev/md1 bs=1k count=128 skip=128 >/dev/null
dd: reading `/dev/md1': Input/output error
0+0 records in
0+0 records out
hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum
128+0 records in
128+0 records out
131072 bytes (131 kB) copied, 0.027775 seconds, 4.7 MB/s
e2548a924a0e835bb45fb50058acba98  - (!!!)
hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum
128+0 records in
128+0 records out
131072 bytes (131 kB) copied, 0.030311 seconds, 4.3 MB/s
c6a23412fb75eb5a7eb1d6a7813eb86b  - (!!!)

 It's not an explanation to my screwed up file system, but for me it's enough 
to drop loop-aes. Eh.

--
 d

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: raid5: I lost a XFS file system due to a minor IDE cable problem
  2007-05-28 15:30       ` Pallai Roland
@ 2007-05-28 23:36         ` David Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: David Chinner @ 2007-05-28 23:36 UTC (permalink / raw)
  To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs

On Mon, May 28, 2007 at 05:30:52PM +0200, Pallai Roland wrote:
> 
> On Monday 28 May 2007 14:53:55 Pallai Roland wrote:
> > On Friday 25 May 2007 02:05:47 David Chinner wrote:
> > > "-o ro,norecovery" will allow you to mount the filesystem and get any
> > > uncorrupted data off it.
> > >
> > > You still may get shutdowns if you trip across corrupted metadata in
> > > the filesystem, though.
> >
> > This filesystem is completely dead.
> > [...]
> 
>  I tried to make a md patch to stop writes if a raid5 array got 2+ failed 
> drives, but I found it's already done, oops. :) handle_stripe5() ignores 
> writes in this case quietly, I tried and works.

Hmmm - it clears the uptodate bit on the bio, which is supposed to
make the bio return EIO. That looks to be doing the right thing...

>  There's an another layer I used on this box between md and xfs: loop-aes. I 

Oh, that's a kind of important thing to forget to mention....

> used it since years and rock stable, but now it's my first suspect, cause I 
> found a bug in it today:
>  I assembled my array from n-1 disks, and I failed a second disk for a test 
> and I found /dev/loop1 still provides *random* data where /dev/md1 serves 
> nothing, it's definitely a loop-aes bug:

.....

>  It's not an explanation to my screwed up file system, but for me it's enough 
> to drop loop-aes. Eh.

If you can get random data back instead of an error from the block device,
then I'm not surprised your filesystem is toast. If it's one sector in a
larger block that is corrupted, then the only thing that will protect you from
this sort of corruption causing problems is metadata checksums (yet another
thin on my list of stuff to do).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-05-29  3:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-24 11:18 raid5: I lost a XFS file system due to a minor IDE cable problem Pallai Roland
2007-05-24 11:20 ` Justin Piszcz
2007-05-25  0:05   ` David Chinner
2007-05-25  1:35     ` Pallai Roland
2007-05-25  4:55       ` David Chinner
2007-05-25  5:43         ` Alberto Alonso
2007-05-25  8:36           ` David Chinner
2007-05-28 22:45             ` Alberto Alonso
2007-05-29  3:28               ` David Chinner
2007-05-29  3:37                 ` Alberto Alonso
2007-05-25 14:35         ` Pallai Roland
2007-05-28  0:30           ` David Chinner
2007-05-28  1:50             ` Pallai Roland
2007-05-28  2:17               ` David Chinner
2007-05-28 11:17                 ` Pallai Roland
2007-05-28 23:06                   ` David Chinner
2007-05-25 14:01       ` Pallai Roland
2007-05-28 12:53     ` Pallai Roland
2007-05-28 15:30       ` Pallai Roland
2007-05-28 23:36         ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).