From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Chinner <david@fromorbit.com>
Subject: Re: [BUG?] sync writeback regression from c4a391b5 "writeback: do
 not sync data dirtied after sync start"?
Date: Wed, 19 Feb 2014 09:09:53 +1100
Message-ID: <20140218220953.GJ28666@dastard>
References: <20140217044047.GD13997@dastard>
 <20140217151642.GE3686@quack.suse.cz>
 <20140218002312.GC13647@dastard>
 <20140218093820.GA29660@quack.suse.cz>
 <20140218132924.GH28666@dastard>
 <20140218140252.GD29660@quack.suse.cz>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-fsdevel@vger.kernel.org
To: Jan Kara <jack@suse.cz>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:17943 "EHLO
	ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1751151AbaBRWJ6 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 18 Feb 2014 17:09:58 -0500
Content-Disposition: inline
In-Reply-To: <20140218140252.GD29660@quack.suse.cz>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, Feb 18, 2014 at 03:02:52PM +0100, Jan Kara wrote:
> On Wed 19-02-14 00:29:24, Dave Chinner wrote:
> > OK, I suspect that there are oter problem lurking here, too. I just
> > hit a problem on generic/068 on a ramdisk on XFS where a sync call
> > would never complete until the writer processes were killed. fstress
> > got stuck here:
> > 
> > [222229.551097] fsstress        D ffff88021bc13180  4040  5898   5896 0x00000000
> > [222229.551097]  ffff8801e5c2dd68 0000000000000086 ffff880219eb1850 0000000000013180
> > [222229.551097]  ffff8801e5c2dfd8 0000000000013180 ffff88011b2b0000 ffff880219eb1850
> > [222229.551097]  ffff8801e5c2dd48 ffff8801e5c2de68 ffff8801e5c2de70 7fffffffffffffff
> > [222229.551097] Call Trace:
> > [222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
> > [222229.551097]  [<ffffffff81ce35e9>] schedule+0x29/0x70
> > [222229.551097]  [<ffffffff81ce28c1>] schedule_timeout+0x171/0x1d0
> > [222229.551097]  [<ffffffff810b0eda>] ? __queue_delayed_work+0x9a/0x170
> > [222229.551097]  [<ffffffff810b0b41>] ? try_to_grab_pending+0xc1/0x180
> > [222229.551097]  [<ffffffff81ce434f>] wait_for_completion+0x9f/0x110
> > [222229.551097]  [<ffffffff810c7810>] ? try_to_wake_up+0x2c0/0x2c0
> > [222229.551097]  [<ffffffff811d3c4a>] sync_inodes_sb+0xca/0x1f0
> > [222229.551097]  [<ffffffff811db930>] ? fdatawrite_one_bdev+0x20/0x20
> > [222229.551097]  [<ffffffff811db94c>] sync_inodes_one_sb+0x1c/0x20
> > [222229.551097]  [<ffffffff811af219>] iterate_supers+0xe9/0xf0
> > [222229.551097]  [<ffffffff811dbb32>] sys_sync+0x42/0xa0
> > [222229.551097]  [<ffffffff81cf0d29>] system_call_fastpath+0x16/0x1b
> > 
> > This then held off the filesystem freeze due to holding s_umount,
> > and the two fstest processes just kept running dirtying the
> > filesystem. It wasn't until I kill the fstests processes by removing
> > the tmp file that the sync completed and the test made progress.
>   OK, so flusher thread (or actually the corresponding kworker) was
> continuously writing the newly dirtied data? So far I didn't reproduce this
> but I'll try...

No, the flusher thread was nowhere to be found.

> > It's reproducable, and I left it for a couple of hours to see if
> > would resolve itself. It didn't, so I had to kick it to break the
> > livelock.
>   I wonder whether it might be some incarnation of a bug fixed here:
> https://lkml.org/lkml/2014/2/14/733
> 
> The effects should be somewhat different but it's in that area. Can you try
> with that patch?

Seems to have fixed the problem. generic/068 has just passed 3 times
in a row, and it's never passed before on this ramdisk based test
rig. Thanks for the pointer, Jan!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com