stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Martin Svec <martin.svec@zoner.cz>,
	Brian Foster <bfoster@redhat.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>
Subject: [PATCH 4.9 33/78] xfs: push buffer of flush locked dquot to avoid quotacheck deadlock
Date: Mon, 18 Sep 2017 11:11:42 +0200	[thread overview]
Message-ID: <20170918091131.242027785@linuxfoundation.org> (raw)
In-Reply-To: <20170918091126.077483037@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Brian Foster <bfoster@redhat.com>

commit 7912e7fef2aebe577f0b46d3cba261f2783c5695 upstream.

Reclaim during quotacheck can lead to deadlocks on the dquot flush
lock:

 - Quotacheck populates a local delwri queue with the physical dquot
   buffers.
 - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
   dirties all of the dquots.
 - Reclaim kicks in and attempts to flush a dquot whose buffer is
   already queud on the quotacheck queue. The flush succeeds but
   queueing to the reclaim delwri queue fails as the backing buffer is
   already queued. The flush unlock is now deferred to I/O completion
   of the buffer from the quotacheck queue.
 - The dqadjust bulkstat continues and dirties the recently flushed
   dquot once again.
 - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
   the flush lock to update the backing buffers with the in-core
   recalculated values. It deadlocks on the redirtied dquot as the
   flush lock was already acquired by reclaim, but the buffer resides
   on the local delwri queue which isn't submitted until the end of
   quotacheck.

This is reproduced by running quotacheck on a filesystem with a
couple million inodes in low memory (512MB-1GB) situations. This is
a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
buffer lists"), which removed a trylock and buffer I/O submission
from the quotacheck dquot flush sequence.

Quotacheck first resets and collects the physical dquot buffers in a
delwri queue. Then, it traverses the filesystem inodes via bulkstat,
updates the in-core dquots, flushes the corrected dquots to the
backing buffers and finally submits the delwri queue for I/O. Since
the backing buffers are queued across the entire quotacheck
operation, dquot reclaim cannot possibly complete a dquot flush
before quotacheck completes.

Therefore, quotacheck must submit the buffer for I/O in order to
cycle the flush lock and flush the dirty in-core dquot to the
buffer. Add a delwri queue buffer push mechanism to submit an
individual buffer for I/O without losing the delwri queue status and
use it from quotacheck to avoid the deadlock. This restores
quotacheck behavior to as before the regression was introduced.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/xfs/xfs_buf.c   |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_buf.h   |    1 
 fs/xfs/xfs_qm.c    |   28 +++++++++++++++++++++++-
 fs/xfs/xfs_trace.h |    1 
 4 files changed, 89 insertions(+), 1 deletion(-)

--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2022,6 +2022,66 @@ xfs_buf_delwri_submit(
 	return error;
 }
 
+/*
+ * Push a single buffer on a delwri queue.
+ *
+ * The purpose of this function is to submit a single buffer of a delwri queue
+ * and return with the buffer still on the original queue. The waiting delwri
+ * buffer submission infrastructure guarantees transfer of the delwri queue
+ * buffer reference to a temporary wait list. We reuse this infrastructure to
+ * transfer the buffer back to the original queue.
+ *
+ * Note the buffer transitions from the queued state, to the submitted and wait
+ * listed state and back to the queued state during this call. The buffer
+ * locking and queue management logic between _delwri_pushbuf() and
+ * _delwri_queue() guarantee that the buffer cannot be queued to another list
+ * before returning.
+ */
+int
+xfs_buf_delwri_pushbuf(
+	struct xfs_buf		*bp,
+	struct list_head	*buffer_list)
+{
+	LIST_HEAD		(submit_list);
+	int			error;
+
+	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
+
+	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
+
+	/*
+	 * Isolate the buffer to a new local list so we can submit it for I/O
+	 * independently from the rest of the original list.
+	 */
+	xfs_buf_lock(bp);
+	list_move(&bp->b_list, &submit_list);
+	xfs_buf_unlock(bp);
+
+	/*
+	 * Delwri submission clears the DELWRI_Q buffer flag and returns with
+	 * the buffer on the wait list with an associated reference. Rather than
+	 * bounce the buffer from a local wait list back to the original list
+	 * after I/O completion, reuse the original list as the wait list.
+	 */
+	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
+
+	/*
+	 * The buffer is now under I/O and wait listed as during typical delwri
+	 * submission. Lock the buffer to wait for I/O completion. Rather than
+	 * remove the buffer from the wait list and release the reference, we
+	 * want to return with the buffer queued to the original list. The
+	 * buffer already sits on the original list with a wait list reference,
+	 * however. If we let the queue inherit that wait list reference, all we
+	 * need to do is reset the DELWRI_Q flag.
+	 */
+	xfs_buf_lock(bp);
+	error = bp->b_error;
+	bp->b_flags |= _XBF_DELWRI_Q;
+	xfs_buf_unlock(bp);
+
+	return error;
+}
+
 int __init
 xfs_buf_init(void)
 {
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -333,6 +333,7 @@ extern void xfs_buf_delwri_cancel(struct
 extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
 extern int xfs_buf_delwri_submit(struct list_head *);
 extern int xfs_buf_delwri_submit_nowait(struct list_head *);
+extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
 
 /* Buffer Daemon Setup Routines */
 extern int xfs_buf_init(void);
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1247,6 +1247,7 @@ xfs_qm_flush_one(
 	struct xfs_dquot	*dqp,
 	void			*data)
 {
+	struct xfs_mount	*mp = dqp->q_mount;
 	struct list_head	*buffer_list = data;
 	struct xfs_buf		*bp = NULL;
 	int			error = 0;
@@ -1257,7 +1258,32 @@ xfs_qm_flush_one(
 	if (!XFS_DQ_IS_DIRTY(dqp))
 		goto out_unlock;
 
-	xfs_dqflock(dqp);
+	/*
+	 * The only way the dquot is already flush locked by the time quotacheck
+	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
+	 * it for the final time. Quotacheck collects all dquot bufs in the
+	 * local delwri queue before dquots are dirtied, so reclaim can't have
+	 * possibly queued it for I/O. The only way out is to push the buffer to
+	 * cycle the flush lock.
+	 */
+	if (!xfs_dqflock_nowait(dqp)) {
+		/* buf is pinned in-core by delwri list */
+		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
+				      mp->m_quotainfo->qi_dqchunklen);
+		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
+		if (!bp) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+		xfs_buf_unlock(bp);
+
+		xfs_buf_delwri_pushbuf(bp, buffer_list);
+		xfs_buf_rele(bp);
+
+		error = -EAGAIN;
+		goto out_unlock;
+	}
+
 	error = xfs_qm_dqflush(dqp, &bp);
 	if (error)
 		goto out_unlock;
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -366,6 +366,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done);
 DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
 DEFINE_BUF_EVENT(xfs_buf_delwri_queued);
 DEFINE_BUF_EVENT(xfs_buf_delwri_split);
+DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf);
 DEFINE_BUF_EVENT(xfs_buf_get_uncached);
 DEFINE_BUF_EVENT(xfs_bdstrat_shut);
 DEFINE_BUF_EVENT(xfs_buf_item_relse);

  parent reply	other threads:[~2017-09-18  9:15 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-18  9:11 [PATCH 4.9 00/78] 4.9.51-stable review Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 01/78] ipv6: accept 64k - 1 packet length in ip6_find_1stfragopt() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 02/78] ipv6: add rcu grace period before freeing fib6_node Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 03/78] ipv6: fix sparse warning on rt6i_node Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 04/78] macsec: add genl family module alias Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 05/78] udp: on peeking bad csum, drop packets even if not at head Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 06/78] fsl/man: Inherit parent device and of_node Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 07/78] sctp: Avoid out-of-bounds reads from address storage Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 08/78] qlge: avoid memcpy buffer overflow Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 09/78] netvsc: fix deadlock betwen link status and removal Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 10/78] cxgb4: Fix stack out-of-bounds read due to wrong size to t4_record_mbox() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 11/78] packet: Dont write vnet header beyond end of buffer Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 12/78] kcm: do not attach PF_KCM sockets to avoid deadlock Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 13/78] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()" Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 14/78] tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0 Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 15/78] mlxsw: spectrum: Forbid linking to devices that have uppers Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 16/78] bridge: switchdev: Clear forward mark when transmitting packet Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 17/78] Revert "net: use lib/percpu_counter API for fragmentation mem accounting" Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 18/78] Revert "net: fix percpu memory leaks" Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 19/78] gianfar: Fix Tx flow control deactivation Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 20/78] vhost_net: correctly check tx avail during rx busy polling Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 21/78] ip6_gre: update mtu properly in ip6gre_err Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 22/78] ipv6: fix memory leak with multiple tables during netns destruction Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 23/78] ipv6: fix typo in fib6_net_exit() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 24/78] sctp: fix missing wake ups in some situations Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 25/78] ip_tunnel: fix setting ttl and tos value in collect_md mode Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 26/78] f2fs: let fill_super handle roll-forward errors Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 27/78] f2fs: check hot_data for roll-forward recovery Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 28/78] x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 29/78] x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 30/78] x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 31/78] xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 32/78] xfs: fix spurious spin_is_locked() assert failures on non-smp kernels Greg Kroah-Hartman
2017-09-18  9:11 ` Greg Kroah-Hartman [this message]
2017-09-18  9:11 ` [PATCH 4.9 34/78] xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 35/78] xfs: release bli from transaction properly on fs shutdown Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 36/78] xfs: remove bli from AIL before release on transaction abort Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 37/78] xfs: dont allow bmap on rt files Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 38/78] xfs: free uncommitted transactions during log recovery Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 39/78] xfs: free cowblocks and retry on buffered write ENOSPC Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 40/78] xfs: dont crash on unexpected holes in dir/attr btrees Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 41/78] xfs: check _btree_check_block value Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 42/78] xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 43/78] xfs: check _alloc_read_agf buffer pointer before using Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 44/78] xfs: fix quotacheck dquot id overflow infinite loop Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 45/78] xfs: fix multi-AG deadlock in xfs_bunmapi Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 46/78] xfs: Fix per-inode DAX flag inheritance Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 47/78] xfs: fix inobt inode allocation search optimization Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 48/78] xfs: clear MS_ACTIVE after finishing log recovery Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 49/78] xfs: dont leak quotacheck dquots when cow recovery Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 50/78] iomap: fix integer truncation issues in the zeroing and dirtying helpers Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 51/78] xfs: write unmount record for ro mounts Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 52/78] xfs: toggle readonly state around xfs_log_mount_finish Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 53/78] xfs: remove xfs_trans_ail_delete_bulk Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 54/78] xfs: Add infrastructure needed for error propagation during buffer IO failure Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 55/78] xfs: Properly retry failed inode items in case of error during buffer writeback Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 56/78] xfs: fix recovery failure when log record header wraps log end Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 57/78] xfs: always verify the log tail during recovery Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 58/78] xfs: fix log recovery corruption error due to tail overwrite Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 59/78] xfs: handle -EFSCORRUPTED during head/tail verification Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 60/78] xfs: add log recovery tracepoint for head/tail Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 61/78] xfs: stop searching for free slots in an inode chunk when there are none Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 62/78] xfs: evict all inodes involved with log redo item Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 63/78] xfs: check for race with xfs_reclaim_inode() in xfs_ifree_cluster() Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 64/78] xfs: open-code xfs_buf_item_dirty() Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 65/78] xfs: remove unnecessary dirty bli format check for ordered bufs Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 66/78] xfs: ordered buffer log items are never formatted Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 67/78] xfs: refactor buffer logging into buffer dirtying helper Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 68/78] xfs: dont log dirty ranges for ordered buffers Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 69/78] xfs: skip bmbt block ino validation during owner change Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 70/78] xfs: move bmbt owner change to last step of extent swap Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 71/78] xfs: disallow marking previously dirty buffers as ordered Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 72/78] xfs: relog dirty buffers during swapext bmbt owner change Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 73/78] xfs: disable per-inode DAX flag Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 74/78] xfs: fix incorrect log_flushed on fsync Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 75/78] xfs: dont set v3 xflags for v2 inodes Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 76/78] xfs: open code end_buffer_async_write in xfs_finish_page_writeback Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 77/78] xfs: use kmem_free to free return value of kmem_zalloc Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 78/78] md/raid5: release/flush io in raid5_do_work() Greg Kroah-Hartman
2017-09-18 12:21 ` [PATCH 4.9 00/78] 4.9.51-stable review Tom Gall
2017-09-18 14:20   ` Greg Kroah-Hartman
2017-09-18 19:28 ` Guenter Roeck
2017-09-19  6:33   ` Greg Kroah-Hartman
2017-09-18 19:55 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170918091131.242027785@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=bfoster@redhat.com \
    --cc=darrick.wong@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.svec@zoner.cz \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).