* raid5: I lost a XFS file system due to a minor IDE cable problem @ 2007-05-24 11:18 Pallai Roland 2007-05-24 11:20 ` Justin Piszcz 0 siblings, 1 reply; 20+ messages in thread From: Pallai Roland @ 2007-05-24 11:18 UTC (permalink / raw) To: Linux-Raid Hi, I wondering why the md raid5 does accept writes after 2 disks failed. I've an array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed (my friend kicked it off from the box on the floor:) and 2 disks have been kicked but my download (yafc) not stopped, it tried and could write the file system for whole night! Now I changed the cable, tried to reassembly the array (mdadm -f --run), event counter increased from 4908158 up to 4929612 on the failed disks, but I cannot mount the file system and the 'xfs_repair -n' shows lot of errors there. This is expainable by the partially successed writes. Ext3 and JFS has "error=" mount option to switch filesystem read-only on any error, but XFS hasn't: why? It's a good question too, but I think the md layer could save dumb filesystems like XFS if denies writes after 2 disks are failed, and I cannot see a good reason why it's not behave this way. Do you have better idea how can I avoid such filesystem corruptions in the future? No, I don't want to use ext3 on this box. :) my mount error: XFS: Log inconsistent (didn't find previous header) XFS: failed to find log head XFS: log mount/recovery failed: error 5 XFS: log mount failed -- d ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-24 11:18 raid5: I lost a XFS file system due to a minor IDE cable problem Pallai Roland @ 2007-05-24 11:20 ` Justin Piszcz 2007-05-25 0:05 ` David Chinner 0 siblings, 1 reply; 20+ messages in thread From: Justin Piszcz @ 2007-05-24 11:20 UTC (permalink / raw) To: Pallai Roland; +Cc: Linux-Raid, xfs Including XFS mailing list on this one. On Thu, 24 May 2007, Pallai Roland wrote: > > Hi, > > I wondering why the md raid5 does accept writes after 2 disks failed. I've an > array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed > (my friend kicked it off from the box on the floor:) and 2 disks have been > kicked but my download (yafc) not stopped, it tried and could write the file > system for whole night! > Now I changed the cable, tried to reassembly the array (mdadm -f --run), > event counter increased from 4908158 up to 4929612 on the failed disks, but I > cannot mount the file system and the 'xfs_repair -n' shows lot of errors > there. This is expainable by the partially successed writes. Ext3 and JFS > has "error=" mount option to switch filesystem read-only on any error, but > XFS hasn't: why? It's a good question too, but I think the md layer could > save dumb filesystems like XFS if denies writes after 2 disks are failed, and > I cannot see a good reason why it's not behave this way. > > Do you have better idea how can I avoid such filesystem corruptions in the > future? No, I don't want to use ext3 on this box. :) > > > my mount error: > XFS: Log inconsistent (didn't find previous header) > XFS: failed to find log head > XFS: log mount/recovery failed: error 5 > XFS: log mount failed > > > -- > d > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-24 11:20 ` Justin Piszcz @ 2007-05-25 0:05 ` David Chinner 2007-05-25 1:35 ` Pallai Roland 2007-05-28 12:53 ` Pallai Roland 0 siblings, 2 replies; 20+ messages in thread From: David Chinner @ 2007-05-25 0:05 UTC (permalink / raw) To: Justin Piszcz; +Cc: Pallai Roland, Linux-Raid, xfs On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > Including XFS mailing list on this one. Thanks Justin. > On Thu, 24 May 2007, Pallai Roland wrote: > > > > >Hi, > > > >I wondering why the md raid5 does accept writes after 2 disks failed. I've > >an > >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable > >failed > >(my friend kicked it off from the box on the floor:) and 2 disks have been > >kicked but my download (yafc) not stopped, it tried and could write the > >file > >system for whole night! > >Now I changed the cable, tried to reassembly the array (mdadm -f --run), > >event counter increased from 4908158 up to 4929612 on the failed disks, > >but I > >cannot mount the file system and the 'xfs_repair -n' shows lot of errors > >there. This is expainable by the partially successed writes. Ext3 and JFS > >has "error=" mount option to switch filesystem read-only on any error, but > >XFS hasn't: why? "-o ro,norecovery" will allow you to mount the filesystem and get any uncorrupted data off it. You still may get shutdowns if you trip across corrupted metadata in the filesystem, though. > >It's a good question too, but I think the md layer could > >save dumb filesystems like XFS if denies writes after 2 disks are failed, > >and > >I cannot see a good reason why it's not behave this way. How is *any* filesystem supposed to know that the underlying block device has gone bad if it is not returning errors? I did mention this exact scenario in the filesystems workshop back in february - we'd *really* like to know if a RAID block device has gone into degraded mode (i.e. lost a disk) so we can throttle new writes until the rebuil dhas been completed. Stopping writes completely on a fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) would also be possible if only we could get the information out of the block layer. > >Do you have better idea how can I avoid such filesystem corruptions in the > >future? No, I don't want to use ext3 on this box. :) Well, the problem is a bug in MD - it should have detected drives going away and stopped access to the device until it was repaired. You would have had the same problem with ext3, or JFS, or reiser or any other filesystem, too. > >my mount error: > >XFS: Log inconsistent (didn't find previous header) > >XFS: failed to find log head > >XFS: log mount/recovery failed: error 5 > >XFS: log mount failed You MD device is still hosed - error 5 = EIO; the md device is reporting errors back the filesystem now. You need to fix that before trying to recover any data... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-25 0:05 ` David Chinner @ 2007-05-25 1:35 ` Pallai Roland 2007-05-25 4:55 ` David Chinner 2007-05-25 14:01 ` Pallai Roland 2007-05-28 12:53 ` Pallai Roland 1 sibling, 2 replies; 20+ messages in thread From: Pallai Roland @ 2007-05-25 1:35 UTC (permalink / raw) To: David Chinner; +Cc: Linux-Raid, xfs On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > > On Thu, 24 May 2007, Pallai Roland wrote: > > >I wondering why the md raid5 does accept writes after 2 disks failed. I've > > >an > > >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable > > >failed > > >(my friend kicked it off from the box on the floor:) and 2 disks have been > > >kicked but my download (yafc) not stopped, it tried and could write the > > >file > > >system for whole night! > > >Now I changed the cable, tried to reassembly the array (mdadm -f --run), > > >event counter increased from 4908158 up to 4929612 on the failed disks, > > >but I > > >cannot mount the file system and the 'xfs_repair -n' shows lot of errors > > >there. This is expainable by the partially successed writes. Ext3 and JFS > > >has "error=" mount option to switch filesystem read-only on any error, but > > >XFS hasn't: why? > > "-o ro,norecovery" will allow you to mount the filesystem and get any > uncorrupted data off it. > > You still may get shutdowns if you trip across corrupted metadata in > the filesystem, though. Thanks, I'll try it > > >It's a good question too, but I think the md layer could > > >save dumb filesystems like XFS if denies writes after 2 disks are failed, > > >and > > >I cannot see a good reason why it's not behave this way. > > How is *any* filesystem supposed to know that the underlying block > device has gone bad if it is not returning errors? It is returning errors, I think so. If I try to write raid5 with 2 failed disks with dd, I've got errors on the missing chunks. The difference between ext3 and XFS is that ext3 will remount to read-only on the first write error but the XFS won't, XFS only fails only the current operation, IMHO. The method of ext3 isn't perfect, but in practice, it's working well. > I did mention this exact scenario in the filesystems workshop back > in february - we'd *really* like to know if a RAID block device has gone > into degraded mode (i.e. lost a disk) so we can throttle new writes > until the rebuil dhas been completed. Stopping writes completely on a > fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) > would also be possible if only we could get the information out > of the block layer. It would be nice, but as I mentioned above, ext3 do it well in practice now. > > >Do you have better idea how can I avoid such filesystem corruptions in the > > >future? No, I don't want to use ext3 on this box. :) > > Well, the problem is a bug in MD - it should have detected > drives going away and stopped access to the device until it was > repaired. You would have had the same problem with ext3, or JFS, > or reiser or any other filesystem, too. > > > >my mount error: > > >XFS: Log inconsistent (didn't find previous header) > > >XFS: failed to find log head > > >XFS: log mount/recovery failed: error 5 > > >XFS: log mount failed > > You MD device is still hosed - error 5 = EIO; the md device is > reporting errors back the filesystem now. You need to fix that > before trying to recover any data... I play with it tomorrow, thanks for your help -- d ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-25 1:35 ` Pallai Roland @ 2007-05-25 4:55 ` David Chinner 2007-05-25 5:43 ` Alberto Alonso 2007-05-25 14:35 ` Pallai Roland 2007-05-25 14:01 ` Pallai Roland 1 sibling, 2 replies; 20+ messages in thread From: David Chinner @ 2007-05-25 4:55 UTC (permalink / raw) To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs On Fri, May 25, 2007 at 03:35:48AM +0200, Pallai Roland wrote: > On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > > > >It's a good question too, but I think the md layer could > > > >save dumb filesystems like XFS if denies writes after 2 disks are failed, > > > >and > > > >I cannot see a good reason why it's not behave this way. > > > > How is *any* filesystem supposed to know that the underlying block > > device has gone bad if it is not returning errors? > It is returning errors, I think so. If I try to write raid5 with 2 > failed disks with dd, I've got errors on the missing chunks. Oh, did you look at your logs and find that XFS had spammed them about writes that were failing? > The difference between ext3 and XFS is that ext3 will remount to > read-only on the first write error but the XFS won't, XFS only fails > only the current operation, IMHO. The method of ext3 isn't perfect, but > in practice, it's working well. XFS will shutdown the filesystem if metadata corruption will occur due to a failed write. We don't immediately fail the filesystem on data write errors because on large systems you can get *transient* I/O errors (e.g. FC path failover) and so retrying failed data writes is useful for preventing unnecessary shutdowns of the filesystem. Different design criteria, different solutions... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-25 4:55 ` David Chinner @ 2007-05-25 5:43 ` Alberto Alonso 2007-05-25 8:36 ` David Chinner 2007-05-25 14:35 ` Pallai Roland 1 sibling, 1 reply; 20+ messages in thread From: Alberto Alonso @ 2007-05-25 5:43 UTC (permalink / raw) To: David Chinner; +Cc: Pallai Roland, Linux-Raid, xfs > > The difference between ext3 and XFS is that ext3 will remount to > > read-only on the first write error but the XFS won't, XFS only fails > > only the current operation, IMHO. The method of ext3 isn't perfect, but > > in practice, it's working well. > > XFS will shutdown the filesystem if metadata corruption will occur > due to a failed write. We don't immediately fail the filesystem on > data write errors because on large systems you can get *transient* > I/O errors (e.g. FC path failover) and so retrying failed data > writes is useful for preventing unnecessary shutdowns of the > filesystem. > > Different design criteria, different solutions... I think his point was that going into a read only mode causes a less catastrophic situation (ie. a web server can still serve pages). I think that is a valid point, rather than shutting down the file system completely, an automatic switch to where the least disruption of service can occur is always desired. Maybe the automatic failure mode could be something that is configurable via the mount options. I personally have found the XFS file system to be great for my needs (except issues with NFS interaction, where the bug report never got answered), but that doesn't mean it can not be improved. Just my 2 cents, Alberto > Cheers, > > Dave. -- Alberto Alonso Global Gate Systems LLC. (512) 351-7233 http://www.ggsys.net Hardware, consulting, sysadmin, monitoring and remote backups ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-25 5:43 ` Alberto Alonso @ 2007-05-25 8:36 ` David Chinner 2007-05-28 22:45 ` Alberto Alonso 0 siblings, 1 reply; 20+ messages in thread From: David Chinner @ 2007-05-25 8:36 UTC (permalink / raw) To: Alberto Alonso; +Cc: David Chinner, Pallai Roland, Linux-Raid, xfs On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > > The difference between ext3 and XFS is that ext3 will remount to > > > read-only on the first write error but the XFS won't, XFS only fails > > > only the current operation, IMHO. The method of ext3 isn't perfect, but > > > in practice, it's working well. > > > > XFS will shutdown the filesystem if metadata corruption will occur > > due to a failed write. We don't immediately fail the filesystem on > > data write errors because on large systems you can get *transient* > > I/O errors (e.g. FC path failover) and so retrying failed data > > writes is useful for preventing unnecessary shutdowns of the > > filesystem. > > > > Different design criteria, different solutions... > > I think his point was that going into a read only mode causes a > less catastrophic situation (ie. a web server can still serve > pages). Sure - but once you've detected one corruption or had metadata I/O errors, can you trust the rest of the filesystem? > I think that is a valid point, rather than shutting down > the file system completely, an automatic switch to where the least > disruption of service can occur is always desired. I consider the possibility of serving out bad data (i.e after a remount to readonly) to be the worst possible disruption of service that can happen ;) > Maybe the automatic failure mode could be something that is > configurable via the mount options. If only it were that simple. Have you looked to see how many hooks there are in XFS to shutdown without causing further damage? % grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l 116 Changing the way we handle shutdowns would take a lot of time, effort and testing. When can I expect a patch? ;) > I personally have found the XFS file system to be great for > my needs (except issues with NFS interaction, where the bug report > never got answered), but that doesn't mean it can not be improved. Got a pointer? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-25 8:36 ` David Chinner @ 2007-05-28 22:45 ` Alberto Alonso 2007-05-29 3:28 ` David Chinner 0 siblings, 1 reply; 20+ messages in thread From: Alberto Alonso @ 2007-05-28 22:45 UTC (permalink / raw) To: David Chinner; +Cc: Pallai Roland, Linux-Raid, xfs On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: > On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > I think his point was that going into a read only mode causes a > > less catastrophic situation (ie. a web server can still serve > > pages). > > Sure - but once you've detected one corruption or had metadata > I/O errors, can you trust the rest of the filesystem? > > > I think that is a valid point, rather than shutting down > > the file system completely, an automatic switch to where the least > > disruption of service can occur is always desired. > > I consider the possibility of serving out bad data (i.e after > a remount to readonly) to be the worst possible disruption of > service that can happen ;) I guess it does depend on the nature of the failure. A write failure on block 2000 does not imply corruption of the other 2TB of data. I wish I knew more on the internals of file systems, unfortunately since I don't, I was just commenting on feature that would be nice, but maybe there is no way to implement them. I figured that a dynamic table with bad blocks could be kept, if an attempt to access those blocks is generated (read or write) an I/O error is returned, if the block is not on the list, the access is processed. This would help a server with large file systems continue operations for most users. > > I personally have found the XFS file system to be great for > > my needs (except issues with NFS interaction, where the bug report > > never got answered), but that doesn't mean it can not be improved. > > Got a pointer? I can't seem to find it. I'm pretty sure I used bugzilla to report it. I did find the kernel dump file though, so here it is: Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: vp/0xd1e69c80, invp/0xc989e380 Oct 3 15:34:07 localhost kernel: ------------[ cut here ]------------ Oct 3 15:34:07 localhost kernel: kernel BUG at fs/xfs/support/debug.c:106! Oct 3 15:34:07 localhost kernel: invalid operand: 0000 [#1] Oct 3 15:34:07 localhost kernel: PREEMPT SMP Oct 3 15:34:07 localhost kernel: Modules linked in: af_packet iptable_filter ip_tables nfsd exportfs lockd sunrpc ipv6xfs capability commoncap ext3 jbd mbc ache aic7xxx i2c_dev tsdev floppy mousedev parport_pc parport psmouse evdev pcspkrhw_random shpchp pciehp pci_hotplug intel_agp intel_mch_agp agpgart uhci_h cd usbcore piix ide_core e1000 cfi_cmdset_0001 cfi_util mtdpart mtdcore jedec_probe gen_probe chipreg dm_mod w83781d i2c_sensor i2c_i801 i2c_core raid5 xor genrtc sd_mod aic79xx scsi_mod raid1 md unix font vesafb cfbcopyarea cfbimgblt cfbfillrect Oct 3 15:34:07 localhost kernel: CPU: 0 Oct 3 15:34:07 localhost kernel: EIP: 0060:[__crc_pm_idle +3334982/5290900] Not tainted Oct 3 15:34:07 localhost kernel: EFLAGS: 00010246 (2.6.8-2-686-smp) Oct 3 15:34:07 localhost kernel: EIP is at cmn_err+0xc5/0xe0 [xfs] Oct 3 15:34:07 localhost kernel: eax: 00000000 ebx: f602c000 ecx: c02dcfbc edx: c02dcfbc Oct 3 15:34:07 localhost kernel: esi: f8c40e28 edi: f8c56a3e ebp: 00000293 esp: f602da08 Oct 3 15:34:07 localhost kernel: ds: 007b es: 007b ss: 0068 Oct 3 15:34:07 localhost kernel: Process nfsd (pid: 2740, threadinfo=f602c000 task=f71a7210) Oct 3 15:34:07 localhost kernel: Stack: f8c40e28 f8c40def f8c56a00 00000000 f602c000 074aa1aa f8c41700 ea2f0a40 Oct 3 15:34:07 localhost kernel: f8c0a745 00000000 f8c41700 d1e69c80 c989e380 f7d4cc00 c2934754 074aa1aa Oct 3 15:34:07 localhost kernel: 00000000 f6555624 074aa1aa f7d4cc00 c017d6bd f6555620 00000000 00000000 Oct 3 15:34:07 localhost kernel: Call Trace: Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3123398/5290900] xfs_iget_core+0x565/0x6b0 [xfs] Oct 3 15:34:07 localhost kernel: [iget_locked+189/256] iget_locked +0xbd/0x100 Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3124083/5290900] xfs_iget+0x162/0x1a0 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3252484/5290900] xfs_vget+0x63/0x100 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3331204/5290900] vfs_vget+0x43/0x50 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3329570/5290900] linvfs_get_dentry+0x51/0x90 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+1536451/5290900] find_exported_dentry+0x42/0x830 [exportfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3234969/5290900] xfs_trans_tail_ail+0x38/0x80 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3174595/5290900] xlog_write+0x102/0x580 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3234969/5290900] xfs_trans_tail_ail+0x38/0x80 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3170617/5290900] xlog_assign_tail_lsn+0x18/0x90 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3234969/5290900] xfs_trans_tail_ail+0x38/0x80 [xfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+3174595/5290900] xlog_write+0x102/0x580 [xfs] Oct 3 15:34:07 localhost kernel: [alloc_skb+71/240] alloc_skb +0x47/0xf0 Oct 3 15:34:07 localhost kernel: [sock_alloc_send_pskb+197/464] sock_alloc_send_pskb+0xc5/0x1d0 Oct 3 15:34:07 localhost kernel: [sock_alloc_send_skb+45/64] sock_alloc_send_skb+0x2d/0x40 Oct 3 15:34:07 localhost kernel: [ip_append_data+1810/2016] ip_append_data+0x712/0x7e0 Oct 3 15:34:07 localhost kernel: [recalc_task_prio+168/416] recalc_task_prio+0xa8/0x1a0 Oct 3 15:34:07 localhost kernel: [__ip_route_output_key+47/288] __ip_route_output_key+0x2f/0x120 Oct 3 15:34:07 localhost kernel: [udp_sendmsg+831/1888] udp_sendmsg +0x33f/0x760 Oct 3 15:34:07 localhost kernel: [ip_generic_getfrag+0/192] ip_generic_getfrag+0x0/0xc0 Oct 3 15:34:07 localhost kernel: [qdisc_restart+23/560] qdisc_restart +0x17/0x230 Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+1539451/5290900] export_decode_fh+0x5a/0x7a [exportfs] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4695505/5290900] nfsd_acceptable+0x0/0x140 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4696349/5290900] fh_verify+0x20c/0x5a0 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4695505/5290900] nfsd_acceptable+0x0/0x140 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4702954/5290900] nfsd_open+0x39/0x1a0 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4704974/5290900] nfsd_write+0x5d/0x360 [nfsd] Oct 3 15:34:07 localhost kernel: [skb_copy_and_csum_bits+102/784] skb_copy_and_csum_bits+0x66/0x310 Oct 3 15:34:07 localhost kernel: [resched_task+83/144] resched_task +0x53/0x90 Oct 3 15:34:07 localhost kernel: [skb_copy_and_csum_bits+556/784] skb_copy_and_csum_bits+0x22c/0x310 Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+2136279/5290900] skb_read_and_csum_bits+0x46/0x90 [sunrpc] Oct 3 15:34:07 localhost kernel: [kfree_skbmem+36/48] kfree_skbmem +0x24/0x30 Oct 3 15:34:07 localhost kernel: [__kfree_skb+173/336] __kfree_skb +0xad/0x150 Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+2184090/5290900] xdr_partial_copy_from_skb+0x169/0x180 [sunrpc] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+2180355/5290900] svcauth_unix_accept+0x272/0x2c0 [sunrpc] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4735417/5290900] nfsd3_proc_write+0xb8/0x120 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4688328/5290900] nfsd_dispatch+0xd7/0x1e0 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4688113/5290900] nfsd_dispatch+0x0/0x1e0 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+2162754/5290900] svc_process+0x4b1/0x619 [sunrpc] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4687545/5290900] nfsd +0x248/0x480 [nfsd] Oct 3 15:34:07 localhost kernel: [__crc_pm_idle+4686961/5290900] nfsd +0x0/0x480 [nfsd] Oct 3 15:34:07 localhost kernel: [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10 Oct 3 15:34:07 localhost kernel: Code: 0f 0b 6a 00 0f 0e c4 f8 83 c4 10 5b 5e 5f 5d c3 e8 c6 03 66 Oct 3 15:34:07 localhost kernel: <6>note: nfsd[2740] exited with preempt_count 1 Oct 3 15:51:23 localhost kernel: klogd 1.4.1#17, log source = /proc/kmsg started. Oct 3 15:51:23 localhost kernel: Inspecting /boot/System.map-2.6.8-2-686-smp Oct 3 15:51:24 localhost kernel: Loaded 27755 symbols from /boot/System.map-2.6.8-2-686-smp. Oct 3 15:51:24 localhost kernel: Symbols match kernel version 2.6.8. Oct 3 15:51:24 localhost kernel: No module symbols loaded - kernel modules not enabled. Oct 3 15:51:24 localhost kernel: fef0000 (usable) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000bfef0000 - 00000000bfefc000 (ACPI data) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000bfefc000 - 00000000bff00000 (ACPI NVS) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000bff00000 - 00000000bff80000 (usable) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000bff80000 - 00000000c0000000 (reserved) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved) Oct 3 15:51:24 localhost kernel: BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved) Oct 3 15:51:24 localhost kernel: 2175MB HIGHMEM available. Oct 3 15:51:24 localhost kernel: 896MB LOWMEM available. Oct 3 15:51:24 localhost kernel: found SMP MP-table at 000f6810 Oct 3 15:51:24 localhost kernel: On node 0 totalpages: 786304 Oct 3 15:51:24 localhost kernel: DMA zone: 4096 pages, LIFO batch:1 Oct 3 15:51:24 localhost kernel: Normal zone: 225280 pages, LIFO batch:16 Oct 3 15:51:24 localhost kernel: HighMem zone: 556928 pages, LIFO batch:16 Oct 3 15:51:24 localhost kernel: DMI present. Thanks, Alberto ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-28 22:45 ` Alberto Alonso @ 2007-05-29 3:28 ` David Chinner 2007-05-29 3:37 ` Alberto Alonso 0 siblings, 1 reply; 20+ messages in thread From: David Chinner @ 2007-05-29 3:28 UTC (permalink / raw) To: Alberto Alonso; +Cc: David Chinner, Pallai Roland, Linux-Raid, xfs On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote: > On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: > > On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: > > > I think his point was that going into a read only mode causes a > > > less catastrophic situation (ie. a web server can still serve > > > pages). > > > > Sure - but once you've detected one corruption or had metadata > > I/O errors, can you trust the rest of the filesystem? > > > > > I think that is a valid point, rather than shutting down > > > the file system completely, an automatic switch to where the least > > > disruption of service can occur is always desired. > > > > I consider the possibility of serving out bad data (i.e after > > a remount to readonly) to be the worst possible disruption of > > service that can happen ;) > > I guess it does depend on the nature of the failure. A write failure > on block 2000 does not imply corruption of the other 2TB of data. The rest might not be corrupted, but if block 2000 is a index of some sort (i.e. metadata), you could reference any of that 2TB incorrectly and get the wrong data, write to the wrong spot on disk, etc. > > > I personally have found the XFS file system to be great for > > > my needs (except issues with NFS interaction, where the bug report > > > never got answered), but that doesn't mean it can not be improved. > > > > Got a pointer? > > I can't seem to find it. I'm pretty sure I used bugzilla to report > it. I did find the kernel dump file though, so here it is: > > Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: > vp/0xd1e69c80, invp/0xc989e380 Oh, I haven't seen any of those problems for quite some time. > = /proc/kmsg started. > Oct 3 15:51:23 localhost kernel: > Inspecting /boot/System.map-2.6.8-2-686-smp Oh, well, yes, kernels that old did have that problem. It got fixed some time around 2.6.12 or 2.6.13 IIRC.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-29 3:28 ` David Chinner @ 2007-05-29 3:37 ` Alberto Alonso 0 siblings, 0 replies; 20+ messages in thread From: Alberto Alonso @ 2007-05-29 3:37 UTC (permalink / raw) To: David Chinner; +Cc: Pallai Roland, Linux-Raid, xfs On Tue, 2007-05-29 at 13:28 +1000, David Chinner wrote: > On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote: > > On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: > > > I consider the possibility of serving out bad data (i.e after > > > a remount to readonly) to be the worst possible disruption of > > > service that can happen ;) > > > > I guess it does depend on the nature of the failure. A write failure > > on block 2000 does not imply corruption of the other 2TB of data. > > The rest might not be corrupted, but if block 2000 is a index of > some sort (i.e. metadata), you could reference any of that 2TB > incorrectly and get the wrong data, write to the wrong spot on disk, > etc. Forgive my ignorance, but if block 2000 is an index, to access the data that it references you would go through block 2000, which would return an error without continuing to access any data pointed to by it. Isn't that how things work? > > > > > I personally have found the XFS file system to be great for > > > > my needs (except issues with NFS interaction, where the bug report > > > > never got answered), but that doesn't mean it can not be improved. > > > > > > Got a pointer? > > > > I can't seem to find it. I'm pretty sure I used bugzilla to report > > it. I did find the kernel dump file though, so here it is: > > > > Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: > > vp/0xd1e69c80, invp/0xc989e380 > > Oh, I haven't seen any of those problems for quite some time. > > > = /proc/kmsg started. > > Oct 3 15:51:23 localhost kernel: > > Inspecting /boot/System.map-2.6.8-2-686-smp > > Oh, well, yes, kernels that old did have that problem. It got fixed > some time around 2.6.12 or 2.6.13 IIRC.... Time for a kernel upgrade then :-) Thanks for all your enlightenment, I think I am learning quite a few things. Alberto ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-25 4:55 ` David Chinner 2007-05-25 5:43 ` Alberto Alonso @ 2007-05-25 14:35 ` Pallai Roland 2007-05-28 0:30 ` David Chinner 1 sibling, 1 reply; 20+ messages in thread From: Pallai Roland @ 2007-05-25 14:35 UTC (permalink / raw) To: David Chinner; +Cc: Linux-Raid, xfs On Friday 25 May 2007 06:55:00 David Chinner wrote: > Oh, did you look at your logs and find that XFS had spammed them > about writes that were failing? The first message after the incident: May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xf8ac14f8 May 24 01:53:50 hq kernel: <f8adae69> xfs_btree_check_sblock+0x4f/0xc2 [xfs] <f8ac14f8> xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: <f8ac14f8> xfs_alloc_lookup+0x34e/0x47b [xfs] <f8b1a9c7> kmem_zone_zalloc+0x1b/0x43 [xfs] May 24 01:53:50 hq kernel: <f8abe645> xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] <f8ac0647> xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: <f8ad2f7e> xfs_bmapi+0x1ac4/0x23cd [xfs] <f8acab97> xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel: <f8b00001> xlog_dealloc_log+0x49/0xea [xfs] <f8afdaee> xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel: <f8afc3ae> xfs_iomap+0x60e/0x82d [xfs] <c0113bc8> __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: <f8b1ae11> xfs_map_blocks+0x39/0x6c [xfs] <f8b1bd7b> xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel: <c036f384> schedule+0x5d1/0xf4d <f8b1c780> xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: <f8b1c7d7> xfs_vm_writepage+0x57/0xe0 [xfs] <c01830e8> mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel: <c0183020> mpage_writepages+0x133/0x3bb <f8b1c780> xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: <c0147bb3> do_writepages+0x35/0x3b <c018135c> __writeback_single_inode+0x88/0x387 May 24 01:53:50 hq kernel: <c01819b7> sync_sb_inodes+0x1b4/0x2a8 <c0181c63> writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel: <c0147943> background_writeout+0x66/0x9f <c01482b3> pdflush+0x0/0x1ad May 24 01:53:50 hq kernel: <c01483a2> pdflush+0xef/0x1ad <c01478dd> background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: <c012d10b> kthread+0xc2/0xc6 <c012d049> kthread+0x0/0xc6 May 24 01:53:50 hq kernel: <c0100dd5> kernel_thread_helper+0x5/0xb ..and I've spammed such messages. This "internal error" isn't a good reason to shut down the file system? I think if there's a sign of corrupted file system, the first thing we should do is to stop writes (or the entire FS) and let the admin to examine the situation. I'm not talking about my case where the md raid5 was a braindead, I'm talking about general situations. -- d ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-25 14:35 ` Pallai Roland @ 2007-05-28 0:30 ` David Chinner 2007-05-28 1:50 ` Pallai Roland 0 siblings, 1 reply; 20+ messages in thread From: David Chinner @ 2007-05-28 0:30 UTC (permalink / raw) To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > On Friday 25 May 2007 06:55:00 David Chinner wrote: > > Oh, did you look at your logs and find that XFS had spammed them > > about writes that were failing? > > The first message after the incident: > > May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xf8ac14f8 > May 24 01:53:50 hq kernel: <f8adae69> xfs_btree_check_sblock+0x4f/0xc2 [xfs] <f8ac14f8> xfs_alloc_lookup+0x34e/0x47b [xfs] > May 24 01:53:50 HF kernel: <f8ac14f8> xfs_alloc_lookup+0x34e/0x47b [xfs] <f8b1a9c7> kmem_zone_zalloc+0x1b/0x43 [xfs] > May 24 01:53:50 hq kernel: <f8abe645> xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] <f8ac0647> xfs_alloc_vextent+0x3bd/0x53b [xfs] > May 24 01:53:50 hq kernel: <f8ad2f7e> xfs_bmapi+0x1ac4/0x23cd [xfs] <f8acab97> xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] > May 24 01:53:50 hq kernel: <f8b00001> xlog_dealloc_log+0x49/0xea [xfs] <f8afdaee> xfs_iomap_write_allocate+0x2d9/0x58b [xfs] > May 24 01:53:50 hq kernel: <f8afc3ae> xfs_iomap+0x60e/0x82d [xfs] <c0113bc8> __wake_up_common+0x39/0x59 > May 24 01:53:50 hq kernel: <f8b1ae11> xfs_map_blocks+0x39/0x6c [xfs] <f8b1bd7b> xfs_page_state_convert+0x644/0xf9c [xfs] > May 24 01:53:50 hq kernel: <c036f384> schedule+0x5d1/0xf4d <f8b1c780> xfs_vm_writepage+0x0/0xe0 [xfs] > May 24 01:53:50 hq kernel: <f8b1c7d7> xfs_vm_writepage+0x57/0xe0 [xfs] <c01830e8> mpage_writepages+0x1fb/0x3bb > May 24 01:53:50 hq kernel: <c0183020> mpage_writepages+0x133/0x3bb <f8b1c780> xfs_vm_writepage+0x0/0xe0 [xfs] > May 24 01:53:50 hq kernel: <c0147bb3> do_writepages+0x35/0x3b <c018135c> __writeback_single_inode+0x88/0x387 > May 24 01:53:50 hq kernel: <c01819b7> sync_sb_inodes+0x1b4/0x2a8 <c0181c63> writeback_inodes+0x63/0xdc > May 24 01:53:50 hq kernel: <c0147943> background_writeout+0x66/0x9f <c01482b3> pdflush+0x0/0x1ad > May 24 01:53:50 hq kernel: <c01483a2> pdflush+0xef/0x1ad <c01478dd> background_writeout+0x0/0x9f > May 24 01:53:50 hq kernel: <c012d10b> kthread+0xc2/0xc6 <c012d049> kthread+0x0/0xc6 > May 24 01:53:50 hq kernel: <c0100dd5> kernel_thread_helper+0x5/0xb > > .and I've spammed such messages. This "internal error" isn't a good reason to shut down > the file system? Actaully, that error does shut the filesystem down in most cases. When you see that output, the function is returning -EFSCORRUPTED. You've got a corrupted freespace btree. The reason why you get spammed is that this is happening during background writeback, and there is no one to return the -EFSCORRUPTED error to. The background writeback path doesn't specifically detect shut down filesystems or trigger shutdowns on errors because that happens in different layers so you just end up with failed data writes. These errors will occur on the next foreground data or metadata allocation and that will shut the filesystem down at that point. I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in this case we should be shutting down the filesystem. That would certainly cut down on the spamming and would not appear to change anything other behaviour.... > I think if there's a sign of corrupted file system, the first thing we should do > is to stop writes (or the entire FS) and let the admin to examine the situation. Yes, that's *exactly* what a shutdown does. In this case, your writes are being stopped - hence the error messages - but the filesystem has not yet been shutdown..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-28 0:30 ` David Chinner @ 2007-05-28 1:50 ` Pallai Roland 2007-05-28 2:17 ` David Chinner 0 siblings, 1 reply; 20+ messages in thread From: Pallai Roland @ 2007-05-28 1:50 UTC (permalink / raw) To: David Chinner; +Cc: Linux-Raid, xfs On Monday 28 May 2007 02:30:11 David Chinner wrote: > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > On Friday 25 May 2007 06:55:00 David Chinner wrote: > > > Oh, did you look at your logs and find that XFS had spammed them > > > about writes that were failing? > > > > The first message after the incident: > > > > May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error > > xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller > > 0xf8ac14f8 May 24 01:53:50 hq kernel: <f8adae69> > > xfs_btree_check_sblock+0x4f/0xc2 [xfs] <f8ac14f8> > > xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: <f8ac14f8> > > xfs_alloc_lookup+0x34e/0x47b [xfs] <f8b1a9c7> kmem_zone_zalloc+0x1b/0x43 > > [xfs] May 24 01:53:50 hq kernel: <f8abe645> > > xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] <f8ac0647> > > xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: <f8ad2f7e> > > xfs_bmapi+0x1ac4/0x23cd [xfs] <f8acab97> > > xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel: > > <f8b00001> xlog_dealloc_log+0x49/0xea [xfs] <f8afdaee> > > xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel: > > <f8afc3ae> xfs_iomap+0x60e/0x82d [xfs] <c0113bc8> > > __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: <f8b1ae11> > > xfs_map_blocks+0x39/0x6c [xfs] <f8b1bd7b> > > xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel: > > <c036f384> schedule+0x5d1/0xf4d <f8b1c780> xfs_vm_writepage+0x0/0xe0 > > [xfs] May 24 01:53:50 hq kernel: <f8b1c7d7> xfs_vm_writepage+0x57/0xe0 > > [xfs] <c01830e8> mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel: > > <c0183020> mpage_writepages+0x133/0x3bb <f8b1c780> > > xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: <c0147bb3> > > do_writepages+0x35/0x3b <c018135c> __writeback_single_inode+0x88/0x387 > > May 24 01:53:50 hq kernel: <c01819b7> sync_sb_inodes+0x1b4/0x2a8 > > <c0181c63> writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel: > > <c0147943> background_writeout+0x66/0x9f <c01482b3> pdflush+0x0/0x1ad > > May 24 01:53:50 hq kernel: <c01483a2> pdflush+0xef/0x1ad <c01478dd> > > background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: <c012d10b> > > kthread+0xc2/0xc6 <c012d049> kthread+0x0/0xc6 May 24 01:53:50 hq kernel: > > <c0100dd5> kernel_thread_helper+0x5/0xb > > > > .and I've spammed such messages. This "internal error" isn't a good > > reason to shut down the file system? > > Actaully, that error does shut the filesystem down in most cases. When you > see that output, the function is returning -EFSCORRUPTED. You've got a > corrupted freespace btree. > > The reason why you get spammed is that this is happening during background > writeback, and there is no one to return the -EFSCORRUPTED error to. The > background writeback path doesn't specifically detect shut down filesystems > or trigger shutdowns on errors because that happens in different layers so > you just end up with failed data writes. These errors will occur on the > next foreground data or metadata allocation and that will shut the > filesystem down at that point. > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in > this case we should be shutting down the filesystem. That would certainly > cut down on the spamming and would not appear to change anything other > behaviour.... If I remember correctly, my file system wasn't shutted down at all, it was "writeable" for whole night, the yafc slowly "written" files to it. Maybe all write operations had failed, but yafc doesn't warn. Spamming is just annoying when we need to find out what went wrong (My kernel.log is 300Mb), but for data security it's important to react to EFSCORRUPTED error in any case, I think so. Please consider this. > > I think if there's a sign of corrupted file system, the first thing we > > should do is to stop writes (or the entire FS) and let the admin to > > examine the situation. > > Yes, that's *exactly* what a shutdown does. In this case, your writes are > being stopped - hence the error messages - but the filesystem has not yet > been shutdown..... All writes being stopped that were involved in the freespace btree, but a few operations were executed (on the corrupted FS), right? Ignoring of EFSCORRUPTED isn't a good idea in this case. -- d ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-28 1:50 ` Pallai Roland @ 2007-05-28 2:17 ` David Chinner 2007-05-28 11:17 ` Pallai Roland 0 siblings, 1 reply; 20+ messages in thread From: David Chinner @ 2007-05-28 2:17 UTC (permalink / raw) To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote: > On Monday 28 May 2007 02:30:11 David Chinner wrote: > > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > > .and I've spammed such messages. This "internal error" isn't a good > > > reason to shut down the file system? > > > > Actaully, that error does shut the filesystem down in most cases. When you > > see that output, the function is returning -EFSCORRUPTED. You've got a > > corrupted freespace btree. > > > > The reason why you get spammed is that this is happening during background > > writeback, and there is no one to return the -EFSCORRUPTED error to. The > > background writeback path doesn't specifically detect shut down filesystems > > or trigger shutdowns on errors because that happens in different layers so > > you just end up with failed data writes. These errors will occur on the > > next foreground data or metadata allocation and that will shut the > > filesystem down at that point. > > > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in > > this case we should be shutting down the filesystem. That would certainly > > cut down on the spamming and would not appear to change anything other > > behaviour.... > If I remember correctly, my file system wasn't shutted down at all, it > was "writeable" for whole night, the yafc slowly "written" files to it. Maybe > all write operations had failed, but yafc doesn't warn. So you never created new files or directories, unlinked files or directories, did synchronous writes, etc? Just had slowly growing files? > Spamming is just annoying when we need to find out what went wrong (My > kernel.log is 300Mb), but for data security it's important to react to > EFSCORRUPTED error in any case, I think so. Please consider this. The filesystem has responded correctly to the corruption in terms of data security (i.e. failed the data write and warned noisily about it), but it probably hasn't done everything it should.... Hmmmm. A quick look at the linux code makes me thikn that background writeback on linux has never been able to cause a shutdown in this case. However, the same error on Irix will definitely cause a shutdown, though.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-28 2:17 ` David Chinner @ 2007-05-28 11:17 ` Pallai Roland 2007-05-28 23:06 ` David Chinner 0 siblings, 1 reply; 20+ messages in thread From: Pallai Roland @ 2007-05-28 11:17 UTC (permalink / raw) To: David Chinner; +Cc: Linux-Raid, xfs On Monday 28 May 2007 04:17:18 David Chinner wrote: > On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote: > > On Monday 28 May 2007 02:30:11 David Chinner wrote: > > > On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > > > .and I've spammed such messages. This "internal error" isn't a good > > > > reason to shut down the file system? > > > > > > Actaully, that error does shut the filesystem down in most cases. When > > > you see that output, the function is returning -EFSCORRUPTED. You've > > > got a corrupted freespace btree. > > > > > > The reason why you get spammed is that this is happening during > > > background writeback, and there is no one to return the -EFSCORRUPTED > > > error to. The background writeback path doesn't specifically detect > > > shut down filesystems or trigger shutdowns on errors because that > > > happens in different layers so you just end up with failed data writes. > > > These errors will occur on the next foreground data or metadata > > > allocation and that will shut the filesystem down at that point. > > > > > > I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe > > > in this case we should be shutting down the filesystem. That would > > > certainly cut down on the spamming and would not appear to change > > > anything other behaviour.... > > > > If I remember correctly, my file system wasn't shutted down at all, it > > was "writeable" for whole night, the yafc slowly "written" files to it. > > Maybe all write operations had failed, but yafc doesn't warn. > > So you never created new files or directories, unlinked files or > directories, did synchronous writes, etc? Just had slowly growing files? I just overwritten badly downloaded files. > > Spamming is just annoying when we need to find out what went wrong (My > > kernel.log is 300Mb), but for data security it's important to react to > > EFSCORRUPTED error in any case, I think so. Please consider this. > > The filesystem has responded correctly to the corruption in terms of > data security (i.e. failed the data write and warned noisily about > it), but it probably hasn't done everything it should.... > > Hmmmm. A quick look at the linux code makes me thikn that background > writeback on linux has never been able to cause a shutdown in this > case. However, the same error on Irix will definitely cause a > shutdown, though.... I hope Linux will follow Irix, that's a consistent standpoint. David, have you a plan to implement your "reporting raid5 block layer" idea? No one else has caring about this silent data loss on temporary (cable, power) failed raid5 arrays as I see, I really hope you do at least! -- d ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-28 11:17 ` Pallai Roland @ 2007-05-28 23:06 ` David Chinner 0 siblings, 0 replies; 20+ messages in thread From: David Chinner @ 2007-05-28 23:06 UTC (permalink / raw) To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs On Mon, May 28, 2007 at 01:17:31PM +0200, Pallai Roland wrote: > On Monday 28 May 2007 04:17:18 David Chinner wrote: > > Hmmmm. A quick look at the linux code makes me thikn that background > > writeback on linux has never been able to cause a shutdown in this case. > > However, the same error on Irix will definitely cause a shutdown, > > though.... > I hope Linux will follow Irix, that's a consistent standpoint. I raised a bug for this yesterday when writing that reply. It won't get forgotten now.... > David, have you a plan to implement your "reporting raid5 block layer" > idea? No one else has caring about this silent data loss on temporary > (cable, power) failed raid5 arrays as I see, I really hope you do at least! Yeah, I'd love to get something like this happening, but given it's about half way down my list of "stuff to do when I have some spare time" I'd say it will be about 2015 before I get to it..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-25 1:35 ` Pallai Roland 2007-05-25 4:55 ` David Chinner @ 2007-05-25 14:01 ` Pallai Roland 1 sibling, 0 replies; 20+ messages in thread From: Pallai Roland @ 2007-05-25 14:01 UTC (permalink / raw) To: David Chinner; +Cc: Linux-Raid, xfs On Friday 25 May 2007 03:35:48 Pallai Roland wrote: > On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote: > > On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: > > > On Thu, 24 May 2007, Pallai Roland wrote: > > > >It's a good question too, but I think the md layer could > > > >save dumb filesystems like XFS if denies writes after 2 disks are > > > > failed, and > > > >I cannot see a good reason why it's not behave this way. > > > > How is *any* filesystem supposed to know that the underlying block > > device has gone bad if it is not returning errors? > > It is returning errors, I think so. If I try to write raid5 with 2 > failed disks with dd, I've got errors on the missing chunks. > The difference between ext3 and XFS is that ext3 will remount to > read-only on the first write error but the XFS won't, XFS only fails > only the current operation, IMHO. The method of ext3 isn't perfect, but > in practice, it's working well. Sorry, I was wrong: md really isn't returning error! It's madness, IMHO. The reason why ext3 safer on raid5 in practice is that ext3 remounts to read-only on read errors too and when a raid5 array got 2 failed drives and there's some read, the error= behavior of ext3 will be activated and stops further writes. You're right, it's not a good solution and there should be read operations to prevent data loss in this case on ext3 too. Raid5 *must deny all writes* when 2 disks failed: I still can't see a good reason why not, and the current method is braindead! > > I did mention this exact scenario in the filesystems workshop back > > in february - we'd *really* like to know if a RAID block device has gone > > into degraded mode (i.e. lost a disk) so we can throttle new writes > > until the rebuil dhas been completed. Stopping writes completely on a > > fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) > > would also be possible if only we could get the information out > > of the block layer. Yes, it's sounds good, but I think we need a quick fix now, it's a real problem and easily can lead to mass data loss. -- d ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-25 0:05 ` David Chinner 2007-05-25 1:35 ` Pallai Roland @ 2007-05-28 12:53 ` Pallai Roland 2007-05-28 15:30 ` Pallai Roland 1 sibling, 1 reply; 20+ messages in thread From: Pallai Roland @ 2007-05-28 12:53 UTC (permalink / raw) To: David Chinner; +Cc: Linux-Raid, xfs On Friday 25 May 2007 02:05:47 David Chinner wrote: > "-o ro,norecovery" will allow you to mount the filesystem and get any > uncorrupted data off it. > > You still may get shutdowns if you trip across corrupted metadata in > the filesystem, though. This filesystem is completely dead. hq:~# mount -o ro,norecovery /dev/loop1 /mnt/r5 May 28 13:41:50 hq kernel: Mounting filesystem "loop1" in no-recovery mode. Filesystem will be inconsistent. May 28 13:41:50 hq kernel: XFS: failed to read root inode hq:~# xfs_db /dev/loop1 xfs_db: cannot read root inode (22) xfs_db: cannot read realtime bitmap inode (22) Segmentation fault hq:~# strace xfs_db /dev/loop1 _llseek(4, 0, [0], SEEK_SET) = 0 read(4, "XFSB\0\0\20\0\0\0\0\0\6\374\253\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 pread(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 480141901312) = 512 pread(4, "\30G$L\203\33OE \256=\207@\340\264O\"\324\2074DY\323\6"..., 8192, 131072) = 8192 write(2, "xfs_db: cannot read root inode ("..., 36xfs_db: cannot read root inode (22) ) = 36 pread(4, "\30G$L\203\33OE \256=\207@\340\264O\"\324\2074DY\323\6"..., 8192, 131072) = 8192 write(2, "xfs_db: cannot read realtime bit"..., 47xfs_db: cannot read realtime bitmap inode (22) ) = 47 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV +++ Browsing with hexdump -C, seems like a part of a PDF file is at 128Kb, on the place of the root inode. :( -- d ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-28 12:53 ` Pallai Roland @ 2007-05-28 15:30 ` Pallai Roland 2007-05-28 23:36 ` David Chinner 0 siblings, 1 reply; 20+ messages in thread From: Pallai Roland @ 2007-05-28 15:30 UTC (permalink / raw) To: David Chinner; +Cc: Linux-Raid, xfs On Monday 28 May 2007 14:53:55 Pallai Roland wrote: > On Friday 25 May 2007 02:05:47 David Chinner wrote: > > "-o ro,norecovery" will allow you to mount the filesystem and get any > > uncorrupted data off it. > > > > You still may get shutdowns if you trip across corrupted metadata in > > the filesystem, though. > > This filesystem is completely dead. > [...] I tried to make a md patch to stop writes if a raid5 array got 2+ failed drives, but I found it's already done, oops. :) handle_stripe5() ignores writes in this case quietly, I tried and works. So how I lost my file system? My first guess about partially successed writes wasn't right: there wasn't real write to the disks after the second disk has been kicked, so the scenario is same to a simple power loss from this point of view. Am I thinking right? There's an another layer I used on this box between md and xfs: loop-aes. I used it since years and rock stable, but now it's my first suspect, cause I found a bug in it today: I assembled my array from n-1 disks, and I failed a second disk for a test and I found /dev/loop1 still provides *random* data where /dev/md1 serves nothing, it's definitely a loop-aes bug: /dev/loop1: [0700]:180907 (/dev/md1) encryption=AES128 multi-key-v3 hq:~# dd if=/dev/md1 bs=1k count=128 skip=128 >/dev/null dd: reading `/dev/md1': Input/output error 0+0 records in 0+0 records out hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum 128+0 records in 128+0 records out 131072 bytes (131 kB) copied, 0.027775 seconds, 4.7 MB/s e2548a924a0e835bb45fb50058acba98 - (!!!) hq:~# dd if=/dev/loop1 bs=1k count=128 skip=128 | md5sum 128+0 records in 128+0 records out 131072 bytes (131 kB) copied, 0.030311 seconds, 4.3 MB/s c6a23412fb75eb5a7eb1d6a7813eb86b - (!!!) It's not an explanation to my screwed up file system, but for me it's enough to drop loop-aes. Eh. -- d ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: raid5: I lost a XFS file system due to a minor IDE cable problem 2007-05-28 15:30 ` Pallai Roland @ 2007-05-28 23:36 ` David Chinner 0 siblings, 0 replies; 20+ messages in thread From: David Chinner @ 2007-05-28 23:36 UTC (permalink / raw) To: Pallai Roland; +Cc: David Chinner, Linux-Raid, xfs On Mon, May 28, 2007 at 05:30:52PM +0200, Pallai Roland wrote: > > On Monday 28 May 2007 14:53:55 Pallai Roland wrote: > > On Friday 25 May 2007 02:05:47 David Chinner wrote: > > > "-o ro,norecovery" will allow you to mount the filesystem and get any > > > uncorrupted data off it. > > > > > > You still may get shutdowns if you trip across corrupted metadata in > > > the filesystem, though. > > > > This filesystem is completely dead. > > [...] > > I tried to make a md patch to stop writes if a raid5 array got 2+ failed > drives, but I found it's already done, oops. :) handle_stripe5() ignores > writes in this case quietly, I tried and works. Hmmm - it clears the uptodate bit on the bio, which is supposed to make the bio return EIO. That looks to be doing the right thing... > There's an another layer I used on this box between md and xfs: loop-aes. I Oh, that's a kind of important thing to forget to mention.... > used it since years and rock stable, but now it's my first suspect, cause I > found a bug in it today: > I assembled my array from n-1 disks, and I failed a second disk for a test > and I found /dev/loop1 still provides *random* data where /dev/md1 serves > nothing, it's definitely a loop-aes bug: ..... > It's not an explanation to my screwed up file system, but for me it's enough > to drop loop-aes. Eh. If you can get random data back instead of an error from the block device, then I'm not surprised your filesystem is toast. If it's one sector in a larger block that is corrupted, then the only thing that will protect you from this sort of corruption causing problems is metadata checksums (yet another thin on my list of stuff to do). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2007-05-29 3:37 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-24 11:18 raid5: I lost a XFS file system due to a minor IDE cable problem Pallai Roland 2007-05-24 11:20 ` Justin Piszcz 2007-05-25 0:05 ` David Chinner 2007-05-25 1:35 ` Pallai Roland 2007-05-25 4:55 ` David Chinner 2007-05-25 5:43 ` Alberto Alonso 2007-05-25 8:36 ` David Chinner 2007-05-28 22:45 ` Alberto Alonso 2007-05-29 3:28 ` David Chinner 2007-05-29 3:37 ` Alberto Alonso 2007-05-25 14:35 ` Pallai Roland 2007-05-28 0:30 ` David Chinner 2007-05-28 1:50 ` Pallai Roland 2007-05-28 2:17 ` David Chinner 2007-05-28 11:17 ` Pallai Roland 2007-05-28 23:06 ` David Chinner 2007-05-25 14:01 ` Pallai Roland 2007-05-28 12:53 ` Pallai Roland 2007-05-28 15:30 ` Pallai Roland 2007-05-28 23:36 ` David Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).