From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 85E707F4E for ; Sat, 20 Jul 2013 03:45:01 -0500 (CDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 6EBAC304053 for ; Sat, 20 Jul 2013 01:44:58 -0700 (PDT) Received: from ipmail05.adl6.internode.on.net (ipmail05.adl6.internode.on.net [150.101.137.143]) by cuda.sgi.com with ESMTP id hQjem8DYoQO7lvL6 for ; Sat, 20 Jul 2013 01:44:56 -0700 (PDT) Date: Sat, 20 Jul 2013 11:48:36 +1000 From: Dave Chinner Subject: Re: [Bisected] Corruption of root fs during git bisect of drm system hang Message-ID: <20130720014836.GZ11674@dastard> References: <20130713090523.GA362@x4> <20130712070721.GA359@x4> <20130715022841.GH5228@dastard> <20130715064734.GA361@x4> <20130719122235.GA360@x4> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20130719122235.GA360@x4> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Markus Trippelsdorf Cc: Ben Myers , Mark Tinguely , Stan Hoeppner , xfs@oss.sgi.com On Fri, Jul 19, 2013 at 02:22:35PM +0200, Markus Trippelsdorf wrote: > On 2013.07.15 at 08:47 +0200, Markus Trippelsdorf wrote: > > On 2013.07.15 at 12:28 +1000, Dave Chinner wrote: > > > On Fri, Jul 12, 2013 at 09:07:21AM +0200, Markus Trippelsdorf wrote: > > > > On 2013.07.12 at 12:17 +1000, Dave Chinner wrote: > > > > > On Thu, Jul 11, 2013 at 11:07:55AM +0200, Markus Trippelsdorf wrote: > > > > > > On 2013.07.10 at 23:12 -0500, Stan Hoeppner wrote: > > > > > > > On 7/10/2013 10:58 PM, Dave Chinner wrote: > > > > > > > > On Thu, Jul 11, 2013 at 05:36:21AM +0200, Markus Trippelsdorf wrote: > > > > > > > > > > > > > > >> I was loosing my KDE settings bit by bit with every reboot during the > > > > > > > >> bisection. First my window-rules disappeared, then my desktop background > > > > > > > >> changed to default, then my taskbar moved from top to the bottom, etc. > > > > > > > >> In the end I had to restore all my .files from backup. > > > > > > > > > > > > > > > > That's not filesystem corruption. That sounds more like someone not > > > > > > > > using fsync in the apropriate place when overwriting a file.... > > > > > > > > > > > > > t@ubunt:~# xfs_repair /dev/sdb > > > > > > Phase 1 - find and verify superblock... > > > > > > Phase 2 - using internal log > > > > > > - zero log... > > > > > > - scan filesystem freespace and inode maps... > > > > > > agi unlinked bucket 0 is 683435008 in ag 2 (inode=4978402304) > > > > > > agi unlinked bucket 1 is 683435009 in ag 2 (inode=4978402305) > > > > > > - found root inode chunk > > > > > > > > > > Again, these are signs that log recovery has not completed > > > > > successfully or that for some reason it thought the log was clean. > > > > > Can you please post the dmesg output after the crash when you go > > > > > through the mount/unmount process before you run xfs_repair? > > > > > > > > Sure. > > > > First boot after crash: > > > > XFS (sdb2): Mounting Filesystem > > > > XFS (sdb2): Starting recovery (logdev: internal) > > > > XFS (sdb2): Ending recovery (logdev: internal) > > > > > > > > Second boot after crash: > > > > XFS (sdb2): Mounting Filesystem > > > > XFS (sdb2): Ending clean mount > > > > > > > > I then boot Ubuntu from another disc to run xfs_repair. > > > > > > That's what shoul dhave been in the initial description of your > > > problem. > > > > > > > And looking through my logs I see this WARNING: > > > > > > > > ------------[ cut here ]------------ > > > > WARNING: CPU: 0 PID: 439 at fs/inode.c:280 drop_nlink+0x33/0x40() > > > > CPU: 0 PID: 439 Comm: gconfd-2 Not tainted 3.10.0-08982-g6d128e1-dirty #42 > > > > Hardware name: System manufacturer System Product Name/M4A78T-E, BIOS 3503 04/13/2011 > > > > 0000000000000009 ffffffff8157d030 0000000000000000 ffffffff81060788 > > > > ffff8801f8608cc8 ffff880205998230 ffff8801f7bede58 0000000000000000 > > > > ffff8801f86083c0 ffffffff8110ce93 ffff8801f8608b40 ffffffff811b7104 > > > > Call Trace: > > > > [] ? dump_stack+0x41/0x51 > > > > [] ? warn_slowpath_common+0x68/0x80 > > > > [] ? drop_nlink+0x33/0x40 > > > > [] ? xfs_droplink+0x24/0x60 > > > > [] ? xfs_remove+0x24d/0x380 > > > > [] ? xfs_vn_unlink+0x37/0x80 > > > > [] ? vfs_unlink+0x6e/0xe0 > > > > [] ? do_unlinkat+0x16a/0x220 > > > > [] ? SyS_faccessat+0x149/0x200 > > > > [] ? system_call_fastpath+0x16/0x1b > > > > > > When did that occur? Before the crash, after the first/second mount? > > > after you ran repair? > > > > After the first mount. > > > > > > Some further observations: > > > > > > > > When I boot 3.2.0 after the crash log recovery works fine. > > > > > > > > When I boot 3.9.0 after the crash I get the following: > > > > > > > > [ 2.332989] XFS (sdc2): Mounting Filesystem > > > > [ 2.406206] XFS (sdc2): Starting recovery (logdev: internal) > > > > [ 2.418147] XFS (sdc2): log record CRC mismatch: found 0xdbcaef48, expected 0x69e7934e. > > > > > > Just informational - indicating that the log records don't have > > > valid CRCs in them because 3.2 didn't calculate them. If you are > > > getting them when after a crash on a 3.9+ kernel, then there's a > > > problem writing to the log.... > > > > The crash always occurred on the current Linus tree kernel... > > > > > > When I boot the current Linus tree after the crash log recovery fails silently. > > > > > > dmesg output, please. Indeed, what does "fails silently" mean? the > > > filesystem doesn't mount but no error is given? > > > > Again, there is no dmesg output. XFS tells me that it's "Ending recovery > > (logdev: internal)" without any errors, when indeed it didn't recover > > the log at all. It then mounts the filesystem normally (rw) in this > > unclean state. That's when the WARNING I postend above happend. > > I've bisected this issue to the following commit: > > commit cca9f93a52d2ead50b5da59ca83d5f469ee4be5f > Author: Dave Chinner > Date: Thu Jun 27 16:04:49 2013 +1000 > > xfs: don't do IO when creating an new inode > > Reverting this commit on top of the Linus tree "solves" all problems for > me. IOW I no longer loose my KDE and LibreOffice config files during a > crash. Log recovery now works fine and xfs_repair shows no issues. Thanks for bisecting this, Marcus. I'll admit, right now it doesn't make a lot of sense to me - I don't immediately see a connection between not reading an inode during the create phase and unlinked list and directory corruption after a crash. But now you've identified a change that might be the cause, I have an avenue of investigation I can follow. Indeed, in the time I've taken to write this mail I've thought of 2-3 possible causes that I need to investigate.... > So users of 3.11.0-rc1 beware. Only run this version if you have > up-to-date backups handy. Don't be so dramatic - very few people are doing what you are doing, so let's try to understand the root cause of problem before jumping to rash conclusions.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs