From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
Alex Lyakas <alex@zadarastorage.com>,
linux-xfs@vger.kernel.org, libor.klepac@bcom.cz
Subject: Re: [PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of an attribute
Date: Fri, 11 Aug 2017 10:30:41 -0400 [thread overview]
Message-ID: <20170811143039.GB18609@bfoster.bfoster> (raw)
In-Reply-To: <20170811020956.GP21024@dastard>
On Fri, Aug 11, 2017 at 12:09:56PM +1000, Dave Chinner wrote:
> On Thu, Aug 10, 2017 at 10:55:48AM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 10, 2017 at 10:52:49AM -0400, Brian Foster wrote:
> > > On Thu, Aug 10, 2017 at 03:09:09PM +0300, Alex Lyakas wrote:
> > > > Hi Dave,
> > > >
> > > > Thanks for the explanation. So it seems we cannot move forward with this
> > > > fix.
> > > >
> > >
> > > I don't think this completely invalidates the fix.. Dave is pointing out
> > > a flaw that the deferred ops infrastructure doesn't properly handle the
> > > technique we want to use here. IOW, it means there's a dependency that
> > > needs to be implemented first.
> > >
> > > FWIW, I also think this means that your approach on the older kernel to
> > > join/hold the buffer to the finished transaction may be the right
> > > approach (depending on whether I follow the perm transaction code
> > > correctly or not, see below), but I think you'd need to relog the buffer
> > > as well.
>
> Yes, the problem exists in 3.18 via the roll in xfs_bmap_finish()
> so it would also need to be done there, too.
>
The argument to fix the deferred ops problem in the current code first
because its fairly straightforward makes sense to me.
That said, what I wrote below in 1. suggests that this is not a problem
in the v3.18 xfs_attr_set() code. The argument is basically that the old
xfs_bmap_finish() only committed the transaction once at most and so
this codepath never waits on log reservation after the initial
xfs_trans_reserve(). IOW, the problem in this case (in principle) is
that the xfs_defer_finish() can roll the transaction an arbitrary number
of times. Am I missing something?
Brian
> > >
> > > > Will somebody else in XFS community be working on fixing this issue? As you
> > > > pointed out, it exists for over two decades. Our production systems hit this
> > > > every couple of days, and shutting down the filesystem causes outage.
> > > >
> > >
> > > I'm guessing the defer infrastructure needs to handle relogging a buffer
> > > similar to how it currently handles joining/relogging inodes..?
>
> Yup, pretty much identical, and only a 10-20 lines of new code, I
> think.
>
> > > > The problem is that the locked buffer is not joined and logged in
> > > > the rolling transactions run in xfs_defer_ops. Hence it can pin the
> > > > tail of the AIL, and this can prevent the transaction roll from
> > > > regranting the log space necessary to continue rolling the
> > > > transaction for the required number of transactions to complete the
> > > > deferred ops. If this happens, we end up with a log space deadlock.
> > > >
> > > > Hence if we are holding an item that we logged in a transaction
> > > > locked and we roll the transaction, we have to join, hold and log it
> > > > in each subsequent transaction until we have finished with the item
> > > > and can unlock and release it.
> > > >
> > > > This is documented in xfs_trans_roll():
> > > >
> > > > /*
> > > > * Reserve space in the log for th next transaction.
> > > > * This also pushes items in the "AIL", the list of logged items,
> > > > * out to disk if they are taking up space at the tail of the log
> > > > * that we want to use. This requires that either nothing be locked
> > > > * across this call, or that anything that is locked be logged in
> > > > * the prior and the next transactions.
> > > > */
> > > >
> > >
> > > Good catch, though I'm wondering whether it's a real enough problem atm
> > > to block this fix. A few thoughts/questions:
> > >
> > > 1.) The transaction in this case has a t_log_count of 3, presumably to
> > > cover the commits by the historical bmap_finish, the trans roll and the
> > > final commit. If I'm following the permanent transaction code correctly,
> > > doesn't that mean that we have room for at least 2 rolls (and 3 commits)
> > > before this task would actually block on log reservation? AFAICT it
> > > looks like the commit would dec ticket->t_cnt and replenish the current
> > > log reservation. The subsequent xfs_trans_reserve() would just return if
> > > t_cnt > 0.
> > >
> > > This of course doesn't accommodate the fact the xfs_defer_finish() can
> > > now roll a transaction an indeterminate number of times, which probably
> > > needs to be handled in general, but...
> >
> > I'd been wondering if tr_logcount needed upward adjusting, but so far
> > haven't observed any problems.
>
> That won't avoid the general problem, though, just increase log
> reservation pressure from active transactions.
>
> > > 2.) It doesn't look like we actually defer any ops in this situation
> > > unless rmapbt is enabled, assuming that we limit holding the buffer to
> > > the empty leaf case, at least (see my comment on the previous version).
> > > I also don't see where a deferred rmapbt update would itself ever roll
> > > the transaction.
> >
> > rmapbt split causes the agfl to hit the low water mark and refresh,
> > requiring an allocation ... but I think that's all stuffed in the same
> > transaction. (So yeah I think I agree with you)
>
> I haven't looked that far, but I'd prefer we fix the problem now
> while we are looking at it because it doesn't seem that hard to
> fix...
>
> > > 3.) The buffer in this case is a new allocation, which I think means the
> > > risk of it pinning the tail and causing a log deadlock here means that
> > > on top of somehow depleting the initial permanent reservation, we'd have
> > > to exhaust the log in the time between the first commit and the last
> > > reservation.
> > >
> > > Given the above, it seems reasonably safe enough to me to merge this
> > > change as is and fix up the deferred ops stuff after the fact
> > > (considering we know we need to rework the xattr stuff as such anyways).
> > > This is still a landmine that should be fixed up, but I wouldn't be
> > > against an ASSERT() or something for the time being if we could somehow
> > > verify that the transaction ticket didn't require any extra reservation.
> > >
> > > OTOH, just adding deferred ops buffer relogging might not be too much
> > > trouble either. ;) Anyways, thoughts?
> >
> > I don't think it'd be difficult to add a _defer_bjoin operation that
> > maintains a list of buffers that we need to bhold across rolls.
>
> Just a small array like inodes currently use would be sufficient.
> We only need to hold one buffer right now....
>
> > I think xfs_buf->b_list is only used for delwri buffers, and a buffer
> > cannot be part of a transaction /and/ on a delwri list at the same time,
> > right? So it shouldn't be hard to whip something up and couple this
> > patch to that.
>
> Reading xfs_buf_item_push() answers that question:
>
> if (!xfs_buf_delwri_queue(bp, buffer_list))
> rval = XFS_ITEM_FLUSHING;
> xfs_buf_unlock(bp);
> return rval;
>
> So, yes, a buffer can be on the delwri queue and be part of a
> transaction at the same time because the buffers on the delwri queue
> get unlocked once they are queued. If a transaction locks and joins
> the buffer while it is on the delwri queue, the commit will pin the
> buffer in memory before unlocking it and
> xfs_buf_delwri_submit_nowait() will see it pinned and skip over it.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2017-08-11 14:30 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-09 11:06 [PATCH] xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of an attribute alex
2017-08-09 13:17 ` Brian Foster
2017-08-09 21:33 ` Dave Chinner
2017-08-10 8:02 ` Alex Lyakas
2017-08-10 11:33 ` Dave Chinner
2017-08-10 12:09 ` Alex Lyakas
2017-08-10 14:52 ` Brian Foster
2017-08-10 17:55 ` Darrick J. Wong
2017-08-10 18:32 ` Brian Foster
2017-08-11 2:22 ` Dave Chinner
2017-08-11 14:27 ` Brian Foster
2017-08-12 0:16 ` Dave Chinner
2017-08-12 14:04 ` Brian Foster
2017-08-14 0:28 ` Dave Chinner
2017-08-14 8:11 ` Alex Lyakas
2017-08-14 12:22 ` Brian Foster
2017-08-14 16:04 ` Alex Lyakas
2017-08-14 21:33 ` Darrick J. Wong
2019-03-22 9:12 ` Shyam Kaushik
2019-03-22 16:08 ` Darrick J. Wong
2019-03-25 13:49 ` Shyam Kaushik
2019-03-25 18:17 ` Darrick J. Wong
2019-03-27 16:03 ` Alex Lyakas
2019-03-27 20:46 ` Dave Chinner
2019-03-28 11:26 ` Alex Lyakas
2017-08-17 20:38 ` Brian Foster
2017-08-17 22:31 ` Darrick J. Wong
2017-08-18 11:39 ` Brian Foster
2017-08-18 15:37 ` Darrick J. Wong
2017-08-18 2:04 ` Dave Chinner
2017-08-18 11:42 ` Brian Foster
2017-08-11 2:09 ` Dave Chinner
2017-08-11 14:30 ` Brian Foster [this message]
2017-08-11 12:53 ` Christoph Hellwig
2017-08-11 16:52 ` Darrick J. Wong
2017-08-12 7:37 ` Christoph Hellwig
2017-11-21 15:31 ` Libor Klepáč
2017-11-21 16:24 ` Brian Foster
2017-11-21 18:50 ` Darrick J. Wong
2017-11-30 17:55 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170811143039.GB18609@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=alex@zadarastorage.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=libor.klepac@bcom.cz \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.