All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Tinguely <tinguely@sgi.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>, xfs@oss.sgi.com
Subject: Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
Date: Wed, 18 Apr 2012 08:13:55 -0500	[thread overview]
Message-ID: <4F8EBE13.9040207@sgi.com> (raw)
In-Reply-To: <20120417082633.GH6734@dastard>

On 04/17/12 03:26, Dave Chinner wrote:
> On Tue, Apr 17, 2012 at 02:20:23PM +1000, Dave Chinner wrote:
>> On Tue, Apr 17, 2012 at 09:54:32AM +1000, Dave Chinner wrote:
>>> On Mon, Apr 16, 2012 at 08:47:00AM -0500, Mark Tinguely wrote:
>>>> On 03/27/12 11:44, Christoph Hellwig wrote:
>>>>> Now that we write back all metadata either synchronously or through the AIL
>>>>> we can simply implement metadata freezing in terms of emptying the AIL.
>>>>>
>>>>> The implementation for this is fairly simply and straight-forward:  A new
>>>>> routine is added that increments a counter that tells xfsaild to not stop
>>>>> until the AIL is empty and then waits on a wakeup from
>>>>> xfs_trans_ail_delete_bulk to signal that the AIL is empty.
>>>>>
>>>>> As usual the devil is in the details, in this case the filesystem shutdown
>>>>> code.  Currently we are a bit sloppy there and do not continue ail pushing
>>>>> in that case, and thus never reach the code in the log item implementations
>>>>> that can unwind in case of a shutdown filesystem.  Also the code to
>>>>> abort inode and dquot flushes was rather sloppy before and did not remove
>>>>> the log items from the AIL, which had to be fixed as well.
>>>>>
>>>>> Also treat unmount the same way as freeze now, except that we still keep a
>>>>> synchronous inode reclaim pass to make sure we reclaim all clean inodes, too.
>>>>>
>>>>> As an upside we can now remove the radix tree based inode writeback and
>>>>> xfs_unmountfs_writesb.
>>>>>
>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>
>>>> Sorry for the empty email.
>>>>
>>>> This series hangs my test boxes. This patch is the first indication
>>>> of the hang. Reboot, and remove patch 4 and the test are successful.
>>>>
>>>> The machine is still responsive. Only the SCRATCH filesystem from
>>>> the test suite is hung.
>>>>
>>>> Per Dave's observation, I added a couple inode reclaims to this
>>>> patch and the test gets further (hangs on run 9 of test 068 rather
>>>> than run 3).
>>>
>>> That implies that there are dirty inodes at the VFS level leaking
>>> through the freeze.
>>>
>>> .....
>> .....
>>> So, what are the flusher threads doing - where are they stuck?
>>
>> I have an answer of sorts:
>>
>> 90580.054767]   task                        PC stack   pid father
>> [90580.056035] flush-253:16    D 0000000000000001  4136 32084      2 0x00000000
>> [90580.056035]  ffff880004c558a0 0000000000000046 ffff880068b6cd48 ffff880004c55cb0
>> [90580.056035]  ffff88007b616280 ffff880004c55fd8 ffff880004c55fd8 ffff880004c55fd8
>> [90580.056035]  ffff88000681e340 ffff88007b616280 ffff880004c558b0 ffff88007981e000
>> [90580.056035] Call Trace:
>> [90580.056035]  [<ffffffff81afcd19>] schedule+0x29/0x70
>> [90580.056035]  [<ffffffff814801fd>] xfs_trans_alloc+0x5d/0xb0
>> [90580.056035]  [<ffffffff81099eb0>] ? add_wait_queue+0x60/0x60
>> [90580.056035]  [<ffffffff81416b14>] xfs_setfilesize_trans_alloc+0x34/0xb0
>> [90580.056035]  [<ffffffff814186f5>] xfs_vm_writepage+0x4a5/0x560
>> [90580.056035]  [<ffffffff81127507>] __writepage+0x17/0x40
>> [90580.056035]  [<ffffffff81127b3d>] write_cache_pages+0x20d/0x460
>> [90580.056035]  [<ffffffff811274f0>] ? set_page_dirty_lock+0x60/0x60
>> [90580.056035]  [<ffffffff81127dda>] generic_writepages+0x4a/0x70
>> [90580.056035]  [<ffffffff814167ec>] xfs_vm_writepages+0x4c/0x60
>> [90580.056035]  [<ffffffff81129711>] do_writepages+0x21/0x40
>> [90580.056035]  [<ffffffff8118ee42>] writeback_single_inode+0x112/0x380
>> [90580.056035]  [<ffffffff8118f25e>] writeback_sb_inodes+0x1ae/0x270
>> [90580.056035]  [<ffffffff8118f4c0>] wb_writeback+0xe0/0x320
>> [90580.056035]  [<ffffffff8108724a>] ? try_to_del_timer_sync+0x8a/0x110
>> [90580.056035]  [<ffffffff81190bc8>] wb_do_writeback+0xb8/0x1d0
>> [90580.056035]  [<ffffffff81085f40>] ? usleep_range+0x50/0x50
>> [90580.056035]  [<ffffffff81190d6b>] bdi_writeback_thread+0x8b/0x280
>> [90580.056035]  [<ffffffff81190ce0>] ? wb_do_writeback+0x1d0/0x1d0
>> [90580.056035]  [<ffffffff81099403>] kthread+0x93/0xa0
>> [90580.056035]  [<ffffffff81b06f64>] kernel_thread_helper+0x4/0x10
>> [90580.056035]  [<ffffffff81099370>] ? kthread_freezable_should_stop+0x70/0x70
>> [90580.056035]  [<ffffffff81b06f60>] ? gs_change+0x13/0x13
>>
>> A dirty inode has slipped through the freeze process, and the
>> flusher thread is stuck trying to allocate a transaction for setting
>> the file size. I can reproduce this fairly easily, so a a bit of
>> tracing should tell me exactly what is going wrong....
>
> Yeah, it's pretty clear what is happening here. We don't have
> freeze protection against EOF zeroing operations. At least
> xfs_setattr_size() and xfs_change_file_space() fail to check for
> freeze, and that is initially what I though was causing this problem.
>
> However, adding freeze checks into the relevant paths didn't make
> the hangs go away, so there's more to it than that. Basically, we've
> been getting races between checking for freeze, the dirtying of the
> pages and the flusher thread syncing out the dirty data. i.e.:
>
> Thread 1		Thread 2		freeze		flusher thread
> write inode A
> check for freeze
> 					grab s_umount
> 					SB_FREEZE_WRITE
> 					writeback_inodes_sb()
> 								iterate dirty inodes
> 								inode A not in flush
> 					sync_inodes_sb()
> 								iterate dirty inodes
> 								inode A not in flush
> dirty pages
> mark inode A dirty
> write inode A done.
> 					SB_FREEZE_TRANS
> 					drop s_umount
> 					freeze done
> 			sync
> 			grab s_umount
> 								iterate dirty inodes
> 								Flush dirty inode A
>
>
> Before we added the transactional inode size updates, this race
> simply went unnoticed because nothing caused the flusher thread to
> block. All the problems I see are due to overwrites of allocated
> space - if there was real allocation then the delalloc conversion
> would have always hung. Now we see that when we need to extend the
> file size when writing, we ahve to allocate a transaction and hence
> the flusher thread now hangs.
>
> While I can "fix" the xfs_setattr_size() and xfs_change_file_space()
> triggers, they don't close the above race condition, so this problem
> is essentially unfixable in XFS. The only reason we have not tripped
> over it before is that the flusher thread didn't hang waiting for a
> transaction reservation when the race was hit.
>
> So why didn't this happen before Christoph's patch set? That's
> something I can't explain. Oh, wait, yes I can - 068 hangs even
> without this patch of Christoph's. Actually, looking at my xfstests
> logs, I can trace the start of the failures back to mid march, and
> that coincided with an update to the xfstests installed on my test
> boxes. Which coincides with when my machines first saw this change:
>
> commit 281627df3eb55e1b729b9bb06fff5ff112929646
> Author: Christoph Hellwig<hch@infradead.org>
> Date:   Tue Mar 13 08:41:05 2012 +0000
>
>      xfs: log file size updates at I/O completion time
>
> That confirms my analysis above - the problem is being exposed by new
> code in the writeback path that does transaction allocation where it
> didn't used to.
>
> Clearly the problem is not really the new code in Christoph's
> patches - it's an existing freeze problem that has previously
> resulted in data writes occuring after a freeze has completed (of
> which we have had rare complaints about). That sounds pretty dire,
> except for one thing: Jan Kara's patch set that fixes all these
> freeze problems:
>
> https://lkml.org/lkml/2012/4/16/356
>
> And now that I've run some testing with Jan's patch series, along
> with Christoph's and mine (75-odd patches ;), a couple of my test
> VMs have been running test 068 in a tight loop for about half an
> hour without a hang, so I'd consider this problem fixed by Jan's
> freeze fixes given I could reliably hang it in 2-3 minutes before
> adding Jan's patch set to my stack.
>
> So the fix for this problem is getting Jan's patch set into the
> kernel at the same time we get the inode size logging changes into
> the kernel. What do people think about that for a plan?
>
> Cheers,
>
> Dave.

Good job.

Jan's freeze patch set is at v5 and seems to be settling down. What is 
the status of Jan's freeze code getting into the kernel?

--Mark Tinguely

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-04-18 13:13 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
2012-03-27 16:44 ` [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown Christoph Hellwig
2012-03-27 18:17   ` Mark Tinguely
2012-04-13  9:36   ` Dave Chinner
2012-03-27 16:44 ` [PATCH 02/10] xfs: remove log item from AIL in xfs_iflush " Christoph Hellwig
2012-04-13  9:37   ` Dave Chinner
2012-03-27 16:44 ` [PATCH 03/10] xfs: allow assigning the tail lsn with the AIL lock held Christoph Hellwig
2012-03-27 18:18   ` Mark Tinguely
2012-04-13  9:42   ` Dave Chinner
2012-03-27 16:44 ` [PATCH 04/10] xfs: implement freezing by emptying the AIL Christoph Hellwig
2012-04-13 10:04   ` Dave Chinner
2012-04-16 13:33   ` Mark Tinguely
2012-04-16 13:47   ` Mark Tinguely
2012-04-16 23:54     ` Dave Chinner
2012-04-17  4:20       ` Dave Chinner
2012-04-17  8:26         ` Dave Chinner
2012-04-18 13:13           ` Mark Tinguely [this message]
2012-04-18 18:14             ` Ben Myers
2012-04-18 17:53           ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 05/10] xfs: do flush inodes from background inode reclaim Christoph Hellwig
2012-04-13 10:14   ` Dave Chinner
2012-04-16 19:25   ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 06/10] xfs: do not write the buffer from xfs_iflush Christoph Hellwig
2012-04-13 10:31   ` Dave Chinner
2012-04-18 13:33   ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 07/10] xfs: do not write the buffer from xfs_qm_dqflush Christoph Hellwig
2012-04-13 10:33   ` Dave Chinner
2012-04-18 21:11   ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 08/10] xfs: do not add buffers to the delwri queue until pushed Christoph Hellwig
2012-04-13 10:35   ` Dave Chinner
2012-04-18 21:11   ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 09/10] xfs: on-stack delayed write buffer lists Christoph Hellwig
2012-04-13 11:37   ` Dave Chinner
2012-04-20 18:19   ` Mark Tinguely
2012-04-21  0:42     ` Dave Chinner
2012-04-23  1:57       ` Dave Chinner
2012-03-27 16:44 ` [PATCH 10/10] xfs: remove some obsolete comments in xfs_trans_ail.c Christoph Hellwig
2012-04-13 11:37   ` Dave Chinner
2012-03-28  0:53 ` [PATCH 00/10] remove xfsbufd Dave Chinner
2012-03-28 15:10   ` Christoph Hellwig
2012-03-29  0:52     ` Dave Chinner
2012-03-29 19:38       ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F8EBE13.9040207@sgi.com \
    --to=tinguely@sgi.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.