From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Sun, 27 May 2007 17:30:27 -0700 (PDT) Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id l4S0ULWt021934 for ; Sun, 27 May 2007 17:30:22 -0700 Date: Mon, 28 May 2007 10:30:11 +1000 From: David Chinner Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem Message-ID: <20070528003010.GS85884050@sgi.com> References: <200705241318.30711.dap@mail.index.hu> <1180056948.6183.10.camel@daptopfc.localdomain> <20070525045500.GF86004887@sgi.com> <200705251635.36533.dap@mail.index.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200705251635.36533.dap@mail.index.hu> Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com List-Id: xfs To: Pallai Roland Cc: David Chinner , Linux-Raid , xfs@oss.sgi.com On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: > > On Friday 25 May 2007 06:55:00 David Chinner wrote: > > Oh, did you look at your logs and find that XFS had spammed them > > about writes that were failing? > > The first message after the incident: > > May 24 01:53:50 hq kernel: Filesystem "loop1": XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xf8ac14f8 > May 24 01:53:50 hq kernel: xfs_btree_check_sblock+0x4f/0xc2 [xfs] xfs_alloc_lookup+0x34e/0x47b [xfs] > May 24 01:53:50 HF kernel: xfs_alloc_lookup+0x34e/0x47b [xfs] kmem_zone_zalloc+0x1b/0x43 [xfs] > May 24 01:53:50 hq kernel: xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] xfs_alloc_vextent+0x3bd/0x53b [xfs] > May 24 01:53:50 hq kernel: xfs_bmapi+0x1ac4/0x23cd [xfs] xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] > May 24 01:53:50 hq kernel: xlog_dealloc_log+0x49/0xea [xfs] xfs_iomap_write_allocate+0x2d9/0x58b [xfs] > May 24 01:53:50 hq kernel: xfs_iomap+0x60e/0x82d [xfs] __wake_up_common+0x39/0x59 > May 24 01:53:50 hq kernel: xfs_map_blocks+0x39/0x6c [xfs] xfs_page_state_convert+0x644/0xf9c [xfs] > May 24 01:53:50 hq kernel: schedule+0x5d1/0xf4d xfs_vm_writepage+0x0/0xe0 [xfs] > May 24 01:53:50 hq kernel: xfs_vm_writepage+0x57/0xe0 [xfs] mpage_writepages+0x1fb/0x3bb > May 24 01:53:50 hq kernel: mpage_writepages+0x133/0x3bb xfs_vm_writepage+0x0/0xe0 [xfs] > May 24 01:53:50 hq kernel: do_writepages+0x35/0x3b __writeback_single_inode+0x88/0x387 > May 24 01:53:50 hq kernel: sync_sb_inodes+0x1b4/0x2a8 writeback_inodes+0x63/0xdc > May 24 01:53:50 hq kernel: background_writeout+0x66/0x9f pdflush+0x0/0x1ad > May 24 01:53:50 hq kernel: pdflush+0xef/0x1ad background_writeout+0x0/0x9f > May 24 01:53:50 hq kernel: kthread+0xc2/0xc6 kthread+0x0/0xc6 > May 24 01:53:50 hq kernel: kernel_thread_helper+0x5/0xb > > .and I've spammed such messages. This "internal error" isn't a good reason to shut down > the file system? Actaully, that error does shut the filesystem down in most cases. When you see that output, the function is returning -EFSCORRUPTED. You've got a corrupted freespace btree. The reason why you get spammed is that this is happening during background writeback, and there is no one to return the -EFSCORRUPTED error to. The background writeback path doesn't specifically detect shut down filesystems or trigger shutdowns on errors because that happens in different layers so you just end up with failed data writes. These errors will occur on the next foreground data or metadata allocation and that will shut the filesystem down at that point. I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in this case we should be shutting down the filesystem. That would certainly cut down on the spamming and would not appear to change anything other behaviour.... > I think if there's a sign of corrupted file system, the first thing we should do > is to stop writes (or the entire FS) and let the admin to examine the situation. Yes, that's *exactly* what a shutdown does. In this case, your writes are being stopped - hence the error messages - but the filesystem has not yet been shutdown..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group