stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Brian Foster <bfoster@redhat.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>
Subject: [PATCH 4.9 58/78] xfs: fix log recovery corruption error due to tail overwrite
Date: Mon, 18 Sep 2017 11:12:07 +0200	[thread overview]
Message-ID: <20170918091135.156142250@linuxfoundation.org> (raw)
In-Reply-To: <20170918091126.077483037@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Brian Foster <bfoster@redhat.com>

commit 4a4f66eac4681378996a1837ad1ffec3a2e2981f upstream.

If we consider the case where the tail (T) of the log is pinned long
enough for the head (H) to push and block behind the tail, we can
end up blocked in the following state without enough free space (f)
in the log to satisfy a transaction reservation:

	0	phys. log	N
	[-------HffT---H'--T'---]

The last good record in the log (before H) refers to T. The tail
eventually pushes forward (T') leaving more free space in the log
for writes to H. At this point, suppose space frees up in the log
for the maximum of 8 in-core log buffers to start flushing out to
the log. If this pushes the head from H to H', these next writes
overwrite the previous tail T. This is safe because the items logged
from T to T' have been written back and removed from the AIL.

If the next log writes (H -> H') happen to fail and result in
partial records in the log, the filesystem shuts down having
overwritten T with invalid data. Log recovery correctly locates H on
the subsequent mount, but H still refers to the now corrupted tail
T. This results in log corruption errors and recovery failure.

Since the tail overwrite results from otherwise correct runtime
behavior, it is up to log recovery to try and deal with this
situation. Update log recovery tail verification to run a CRC pass
from the first record past the tail to the head. This facilitates
error detection at T and moves the recovery tail to the first good
record past H' (similar to truncating the head on torn write
detection). If corruption is detected beyond the range possibly
affected by the max number of iclogs, the log is legitimately
corrupted and log recovery failure is expected.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/xfs/xfs_log_recover.c |  108 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 77 insertions(+), 31 deletions(-)

--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1029,61 +1029,106 @@ out_error:
 }
 
 /*
- * Check the log tail for torn writes. This is required when torn writes are
- * detected at the head and the head had to be walked back to a previous record.
- * The tail of the previous record must now be verified to ensure the torn
- * writes didn't corrupt the previous tail.
+ * Calculate distance from head to tail (i.e., unused space in the log).
+ */
+static inline int
+xlog_tail_distance(
+	struct xlog	*log,
+	xfs_daddr_t	head_blk,
+	xfs_daddr_t	tail_blk)
+{
+	if (head_blk < tail_blk)
+		return tail_blk - head_blk;
+
+	return tail_blk + (log->l_logBBsize - head_blk);
+}
+
+/*
+ * Verify the log tail. This is particularly important when torn or incomplete
+ * writes have been detected near the front of the log and the head has been
+ * walked back accordingly.
+ *
+ * We also have to handle the case where the tail was pinned and the head
+ * blocked behind the tail right before a crash. If the tail had been pushed
+ * immediately prior to the crash and the subsequent checkpoint was only
+ * partially written, it's possible it overwrote the last referenced tail in the
+ * log with garbage. This is not a coherency problem because the tail must have
+ * been pushed before it can be overwritten, but appears as log corruption to
+ * recovery because we have no way to know the tail was updated if the
+ * subsequent checkpoint didn't write successfully.
  *
- * Return an error if CRC verification fails as recovery cannot proceed.
+ * Therefore, CRC check the log from tail to head. If a failure occurs and the
+ * offending record is within max iclog bufs from the head, walk the tail
+ * forward and retry until a valid tail is found or corruption is detected out
+ * of the range of a possible overwrite.
  */
 STATIC int
 xlog_verify_tail(
 	struct xlog		*log,
 	xfs_daddr_t		head_blk,
-	xfs_daddr_t		tail_blk)
+	xfs_daddr_t		*tail_blk,
+	int			hsize)
 {
 	struct xlog_rec_header	*thead;
 	struct xfs_buf		*bp;
 	xfs_daddr_t		first_bad;
-	int			count;
 	int			error = 0;
 	bool			wrapped;
-	xfs_daddr_t		tmp_head;
+	xfs_daddr_t		tmp_tail;
+	xfs_daddr_t		orig_tail = *tail_blk;
 
 	bp = xlog_get_bp(log, 1);
 	if (!bp)
 		return -ENOMEM;
 
 	/*
-	 * Seek XLOG_MAX_ICLOGS + 1 records past the current tail record to get
-	 * a temporary head block that points after the last possible
-	 * concurrently written record of the tail.
+	 * Make sure the tail points to a record (returns positive count on
+	 * success).
 	 */
-	count = xlog_seek_logrec_hdr(log, head_blk, tail_blk,
-				     XLOG_MAX_ICLOGS + 1, bp, &tmp_head, &thead,
-				     &wrapped);
-	if (count < 0) {
-		error = count;
+	error = xlog_seek_logrec_hdr(log, head_blk, *tail_blk, 1, bp,
+			&tmp_tail, &thead, &wrapped);
+	if (error < 0)
 		goto out;
-	}
+	if (*tail_blk != tmp_tail)
+		*tail_blk = tmp_tail;
 
 	/*
-	 * If the call above didn't find XLOG_MAX_ICLOGS + 1 records, we ran
-	 * into the actual log head. tmp_head points to the start of the record
-	 * so update it to the actual head block.
+	 * Run a CRC check from the tail to the head. We can't just check
+	 * MAX_ICLOGS records past the tail because the tail may point to stale
+	 * blocks cleared during the search for the head/tail. These blocks are
+	 * overwritten with zero-length records and thus record count is not a
+	 * reliable indicator of the iclog state before a crash.
 	 */
-	if (count < XLOG_MAX_ICLOGS + 1)
-		tmp_head = head_blk;
-
-	/*
-	 * We now have a tail and temporary head block that covers at least
-	 * XLOG_MAX_ICLOGS records from the tail. We need to verify that these
-	 * records were completely written. Run a CRC verification pass from
-	 * tail to head and return the result.
-	 */
-	error = xlog_do_recovery_pass(log, tmp_head, tail_blk,
+	first_bad = 0;
+	error = xlog_do_recovery_pass(log, head_blk, *tail_blk,
 				      XLOG_RECOVER_CRCPASS, &first_bad);
+	while (error == -EFSBADCRC && first_bad) {
+		int	tail_distance;
+
+		/*
+		 * Is corruption within range of the head? If so, retry from
+		 * the next record. Otherwise return an error.
+		 */
+		tail_distance = xlog_tail_distance(log, head_blk, first_bad);
+		if (tail_distance > BTOBB(XLOG_MAX_ICLOGS * hsize))
+			break;
+
+		/* skip to the next record; returns positive count on success */
+		error = xlog_seek_logrec_hdr(log, head_blk, first_bad, 2, bp,
+				&tmp_tail, &thead, &wrapped);
+		if (error < 0)
+			goto out;
+
+		*tail_blk = tmp_tail;
+		first_bad = 0;
+		error = xlog_do_recovery_pass(log, head_blk, *tail_blk,
+					      XLOG_RECOVER_CRCPASS, &first_bad);
+	}
 
+	if (!error && *tail_blk != orig_tail)
+		xfs_warn(log->l_mp,
+		"Tail block (0x%llx) overwrite detected. Updated to 0x%llx",
+			 orig_tail, *tail_blk);
 out:
 	xlog_put_bp(bp);
 	return error;
@@ -1187,7 +1232,8 @@ xlog_verify_head(
 	if (error)
 		return error;
 
-	return xlog_verify_tail(log, *head_blk, *tail_blk);
+	return xlog_verify_tail(log, *head_blk, tail_blk,
+				be32_to_cpu((*rhead)->h_size));
 }
 
 /*

  parent reply	other threads:[~2017-09-18  9:17 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-18  9:11 [PATCH 4.9 00/78] 4.9.51-stable review Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 01/78] ipv6: accept 64k - 1 packet length in ip6_find_1stfragopt() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 02/78] ipv6: add rcu grace period before freeing fib6_node Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 03/78] ipv6: fix sparse warning on rt6i_node Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 04/78] macsec: add genl family module alias Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 05/78] udp: on peeking bad csum, drop packets even if not at head Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 06/78] fsl/man: Inherit parent device and of_node Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 07/78] sctp: Avoid out-of-bounds reads from address storage Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 08/78] qlge: avoid memcpy buffer overflow Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 09/78] netvsc: fix deadlock betwen link status and removal Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 10/78] cxgb4: Fix stack out-of-bounds read due to wrong size to t4_record_mbox() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 11/78] packet: Dont write vnet header beyond end of buffer Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 12/78] kcm: do not attach PF_KCM sockets to avoid deadlock Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 13/78] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()" Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 14/78] tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0 Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 15/78] mlxsw: spectrum: Forbid linking to devices that have uppers Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 16/78] bridge: switchdev: Clear forward mark when transmitting packet Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 17/78] Revert "net: use lib/percpu_counter API for fragmentation mem accounting" Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 18/78] Revert "net: fix percpu memory leaks" Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 19/78] gianfar: Fix Tx flow control deactivation Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 20/78] vhost_net: correctly check tx avail during rx busy polling Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 21/78] ip6_gre: update mtu properly in ip6gre_err Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 22/78] ipv6: fix memory leak with multiple tables during netns destruction Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 23/78] ipv6: fix typo in fib6_net_exit() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 24/78] sctp: fix missing wake ups in some situations Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 25/78] ip_tunnel: fix setting ttl and tos value in collect_md mode Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 26/78] f2fs: let fill_super handle roll-forward errors Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 27/78] f2fs: check hot_data for roll-forward recovery Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 28/78] x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 29/78] x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 30/78] x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 31/78] xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff() Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 32/78] xfs: fix spurious spin_is_locked() assert failures on non-smp kernels Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 33/78] xfs: push buffer of flush locked dquot to avoid quotacheck deadlock Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 34/78] xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 35/78] xfs: release bli from transaction properly on fs shutdown Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 36/78] xfs: remove bli from AIL before release on transaction abort Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 37/78] xfs: dont allow bmap on rt files Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 38/78] xfs: free uncommitted transactions during log recovery Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 39/78] xfs: free cowblocks and retry on buffered write ENOSPC Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 40/78] xfs: dont crash on unexpected holes in dir/attr btrees Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 41/78] xfs: check _btree_check_block value Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 42/78] xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 43/78] xfs: check _alloc_read_agf buffer pointer before using Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 44/78] xfs: fix quotacheck dquot id overflow infinite loop Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 45/78] xfs: fix multi-AG deadlock in xfs_bunmapi Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 46/78] xfs: Fix per-inode DAX flag inheritance Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 47/78] xfs: fix inobt inode allocation search optimization Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 48/78] xfs: clear MS_ACTIVE after finishing log recovery Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 49/78] xfs: dont leak quotacheck dquots when cow recovery Greg Kroah-Hartman
2017-09-18  9:11 ` [PATCH 4.9 50/78] iomap: fix integer truncation issues in the zeroing and dirtying helpers Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 51/78] xfs: write unmount record for ro mounts Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 52/78] xfs: toggle readonly state around xfs_log_mount_finish Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 53/78] xfs: remove xfs_trans_ail_delete_bulk Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 54/78] xfs: Add infrastructure needed for error propagation during buffer IO failure Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 55/78] xfs: Properly retry failed inode items in case of error during buffer writeback Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 56/78] xfs: fix recovery failure when log record header wraps log end Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 57/78] xfs: always verify the log tail during recovery Greg Kroah-Hartman
2017-09-18  9:12 ` Greg Kroah-Hartman [this message]
2017-09-18  9:12 ` [PATCH 4.9 59/78] xfs: handle -EFSCORRUPTED during head/tail verification Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 60/78] xfs: add log recovery tracepoint for head/tail Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 61/78] xfs: stop searching for free slots in an inode chunk when there are none Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 62/78] xfs: evict all inodes involved with log redo item Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 63/78] xfs: check for race with xfs_reclaim_inode() in xfs_ifree_cluster() Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 64/78] xfs: open-code xfs_buf_item_dirty() Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 65/78] xfs: remove unnecessary dirty bli format check for ordered bufs Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 66/78] xfs: ordered buffer log items are never formatted Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 67/78] xfs: refactor buffer logging into buffer dirtying helper Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 68/78] xfs: dont log dirty ranges for ordered buffers Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 69/78] xfs: skip bmbt block ino validation during owner change Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 70/78] xfs: move bmbt owner change to last step of extent swap Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 71/78] xfs: disallow marking previously dirty buffers as ordered Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 72/78] xfs: relog dirty buffers during swapext bmbt owner change Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 73/78] xfs: disable per-inode DAX flag Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 74/78] xfs: fix incorrect log_flushed on fsync Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 75/78] xfs: dont set v3 xflags for v2 inodes Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 76/78] xfs: open code end_buffer_async_write in xfs_finish_page_writeback Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 77/78] xfs: use kmem_free to free return value of kmem_zalloc Greg Kroah-Hartman
2017-09-18  9:12 ` [PATCH 4.9 78/78] md/raid5: release/flush io in raid5_do_work() Greg Kroah-Hartman
2017-09-18 12:21 ` [PATCH 4.9 00/78] 4.9.51-stable review Tom Gall
2017-09-18 14:20   ` Greg Kroah-Hartman
2017-09-18 19:28 ` Guenter Roeck
2017-09-19  6:33   ` Greg Kroah-Hartman
2017-09-18 19:55 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170918091135.156142250@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=bfoster@redhat.com \
    --cc=darrick.wong@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).