From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH v2] xfs: rewrite the fdatasync optimization
Date: Wed, 7 Mar 2018 11:45:54 +1100 [thread overview]
Message-ID: <20180307004554.GM18129@dastard> (raw)
In-Reply-To: <20180226145503.GB51394@bfoster.bfoster>
On Mon, Feb 26, 2018 at 09:55:04AM -0500, Brian Foster wrote:
> On Thu, Feb 22, 2018 at 07:04:21AM -0800, Christoph Hellwig wrote:
> > Currently we need to the ilock over the log force in xfs_fsync so that we
> > can protect ili_fsync_fields against incorrect manipulation.
> > Unfortunately that can cause huge latency spikes for workloads that mix
> > fsync with writes from other thread or through aio on the same file.
> >
> > Instead record the last dirty lsn that was not just a timestamp update in
> > the inode log item as long as the inode is pinned, and clear it when
> > unpinning the inode. This removes the need for ili_fsync_fields and thus
> > holding the ilock over the log force, and drastically reduces latency on
> > multithreaded workloads that mix writes with fsync calls on the same file.
> >
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >
> > Changes since V1:
> > - reomve the XFS_ILOG_VERSION optimization
> >
> > fs/xfs/xfs_file.c | 20 ++++++--------------
> > fs/xfs/xfs_inode.c | 2 --
> > fs/xfs/xfs_inode_item.c | 10 ++++++++--
> > fs/xfs/xfs_inode_item.h | 2 +-
> > fs/xfs/xfs_iomap.c | 3 +--
> > fs/xfs/xfs_trans_inode.c | 9 ---------
> > 6 files changed, 16 insertions(+), 30 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 9ea08326f876..ccb331f96a6d 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -165,27 +165,19 @@ xfs_file_fsync(
> > * All metadata updates are logged, which means that we just have to
> > * flush the log up to the latest LSN that touched the inode. If we have
> > * concurrent fsync/fdatasync() calls, we need them to all block on the
> > - * log force before we clear the ili_fsync_fields field. This ensures
> > - * that we don't get a racing sync operation that does not wait for the
> > - * metadata to hit the journal before returning. If we race with
> > - * clearing the ili_fsync_fields, then all that will happen is the log
> > - * force will do nothing as the lsn will already be on disk. We can't
> > - * race with setting ili_fsync_fields because that is done under
> > - * XFS_ILOCK_EXCL, and that can't happen because we hold the lock shared
> > - * until after the ili_fsync_fields is cleared.
> > + * log force before returning.
> > */
> > xfs_ilock(ip, XFS_ILOCK_SHARED);
> > if (xfs_ipincount(ip)) {
> > - if (!datasync ||
> > - (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > + if (datasync)
> > + lsn = ip->i_itemp->ili_datasync_lsn;
> > + else
> > lsn = ip->i_itemp->ili_last_lsn;
> > }
> > + xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >
> > - if (lsn) {
> > + if (lsn)
> > error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> > - ip->i_itemp->ili_fsync_fields = 0;
> > - }
> > - xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >
> > /*
> > * If we only have a single device, and the log force about was
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 604ee384a00a..a57772553a2a 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2385,7 +2385,6 @@ xfs_ifree_cluster(
> >
> > iip->ili_last_fields = iip->ili_fields;
> > iip->ili_fields = 0;
> > - iip->ili_fsync_fields = 0;
> > iip->ili_logged = 1;
> > xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
> > &iip->ili_item.li_lsn);
> > @@ -3649,7 +3648,6 @@ xfs_iflush_int(
> > */
> > iip->ili_last_fields = iip->ili_fields;
> > iip->ili_fields = 0;
> > - iip->ili_fsync_fields = 0;
> > iip->ili_logged = 1;
> >
> > xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index d5037f060d6f..b60592e82833 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -630,6 +630,9 @@ xfs_inode_item_committed(
> > struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> > struct xfs_inode *ip = iip->ili_inode;
> >
> > + iip->ili_last_lsn = 0;
> > + iip->ili_datasync_lsn = 0;
> > +
>
> Hmm, it's not clear to me why we're messing with ili_last_lsn here as
> well since I don't see any mention of it anywhere. That aside, I'm not
> quite sure this is safe even with regard to ->ili_datasync_lsn.
Actually, it's not safe with regard to iip->ili_last_lsn, either:
[ 6158.067488] run fstests generic/476 at 2018-03-05 16:26:36
[ 6159.036562] XFS (sdg): Mounting V5 Filesystem
[ 6159.074800] XFS (sdg): Ending clean mount
[ 6370.748497] kworker/dying (24679) used greatest stack depth: 11168 bytes left
[ 6527.664801] XFS: Assertion failed: lsn != 0, file: fs/xfs/xfs_log.c, line: 3458
[ 6527.667593] ------------[ cut here ]------------
[ 6527.669286] kernel BUG at fs/xfs/xfs_message.c:114!
[ 6527.670253] invalid opcode: 0000 [#1] PREEMPT SMP
[ 6527.673091] CPU: 2 PID: 697 Comm: kswapd0 Not tainted 4.16.0-rc2-dgc #383
[ 6527.675120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 6527.677410] RIP: 0010:assfail+0x28/0x30
[ 6527.678402] RSP: 0018:ffffc900016b7a88 EFLAGS: 00010202
[ 6527.679852] RAX: 00000000ffffffea RBX: 0000000000000000 RCX: 0000000000000000
[ 6527.681579] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff82271753
[ 6527.683235] RBP: ffff8801276e1400 R08: 0000000000000000 R09: 0000000000000000
[ 6527.684467] R10: ffff8800ba50f028 R11: f000000000000000 R12: ffff880121420000
[ 6527.685549] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000002079f51
[ 6527.686604] FS: 0000000000000000(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
[ 6527.687896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6527.688739] CR2: 00007f5ac5bbb000 CR3: 000000013a3e4000 CR4: 00000000000006e0
[ 6527.689759] Call Trace:
[ 6527.690141] _xfs_log_force_lsn+0x323/0x340
[ 6527.690758] ? xfs_reclaim_inode+0x71/0x400
[ 6527.691365] __xfs_iunpin_wait+0xa4/0x170
[ 6527.691981] ? woken_wake_function+0x10/0x10
[ 6527.692602] xfs_reclaim_inode+0x71/0x400
[ 6527.693271] xfs_reclaim_inodes_ag+0x1a5/0x2e0
[ 6527.693904] xfs_reclaim_inodes_nr+0x31/0x40
[ 6527.694515] super_cache_scan+0x14d/0x1a0
[ 6527.695086] shrink_slab.part.66+0x171/0x370
[ 6527.695739] shrink_node+0x7e/0x1a0
[ 6527.696238] kswapd+0x270/0x6b0
[ 6527.696723] ? node_reclaim+0x220/0x220
[ 6527.697271] kthread+0x111/0x130
[ 6527.697737] ? __kthread_create_worker+0x120/0x120
[ 6527.698421] ? call_usermodehelper_exec_async+0x11c/0x150
[ 6527.699196] ret_from_fork+0x24/0x30
[ 6527.699741] Code: 0b c3 90 0f 1f 44 00 00 48 89 f1 41 89 d0 48 c7 c6 d0 64 2d 82 48 89 fa 31 ff e8 84 fa ff ff 80 3d d1 e3 0d 01 00 75 03 0f 0b c3 <0f> 0b 66 0f 1f 44 00 00 0f 1f 44 00 00 4
[ 6527.702441] RIP: assfail+0x28/0x30 RSP: ffffc900016b7a88
[ 6527.704099] ---[ end trace bbf263c1abbd8799 ]---
On log checkpoint completion, there is a big window between
->iop_committed and ->iop_unpin calls where we are in the state
i_pincount > 0 and ili_last_lsn = zero.
If anything calls xfs_iunpin()/xfs_iunpin_wait() in this window,
then we run this:
xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0);
Which then fires the above assert because the force lsn is zero.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
prev parent reply other threads:[~2018-03-07 0:45 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-22 15:04 [PATCH v2] xfs: rewrite the fdatasync optimization Christoph Hellwig
2018-02-26 14:55 ` Brian Foster
2018-03-07 0:45 ` Dave Chinner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180307004554.GM18129@dastard \
--to=david@fromorbit.com \
--cc=bfoster@redhat.com \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).