[PATCH 0/4] xfs: log recovery wrap and tail overwrite fixes

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] xfs: log recovery wrap and tail overwrite fixes
@ 2017-06-27 14:40 Brian Foster
  2017-06-27 14:40 ` [PATCH 1/4] xfs: fix recovery failure when log record header wraps log end Brian Foster
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Brian Foster @ 2017-06-27 14:40 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's a first real stab at a fix for the log tail overwrite issue. The
general approach is similar to torn write detection: move the tail
forward when corruption is detected within the range of a possible tail
overwrite.

Patch 1 fixes an independent and spurious log recovery failure when a
log record header wraps around the end of the physical log. Patch 2 is a
semi-preparatory patch that unconditionally invokes log tail
verification rather than only after torn write detection at the head.
Patch 3 introduces the core fix to move the tail forward in the event of
corruption. Patch 4 introduces an error injection tag to force log item
pinning and facilitates the test that reliably reproduces the tail
overwrite problem.

This survives the latest variant of the xfstests test that reproduces
the tail overwrite condition and otherwise hasn't shown any regressions
in my testing so far (still ongoing). This also allows the metadump
images provided by Sweet Tea[1] to mount (though those images do still
show filesystem corruption after mount, so I suspect something more is
going on there).

One other slight change worth noting in log recovery behavior is that
tail overwrite detection causes earlier reporting of legitimate log CRC
or corruption errors. Before this series, a log corruption that is not
resolved by torn write/tail overwrite detection results in log recovery
failure after a partial recovery up to the point at which the corruption
is encountered. After this series, it is very likely that the corruption
is identified during tail verification and an error returned to
userspace before real recovery begins.

An xfs_repair is necessary in either case, but I'm curious if there is a
preference towards the old or newly proposed behavior. An alternative
I've considered to preserve the old behavior, for example, is to use the
tail verification CRC pass for tail fixing only (and otherwise consider
errors at this point as nonfatal). This means that we would fix up the
tail if possible, otherwise leave errors to the real recovery sequence
such that a partial recovery can occur before the (imminent) failure.
Thoughts?

Brian

v1:
- Add patch to fix log recovery header wrapping problem.
- Replace transaction reservation rfc with log recovery based fix.
- Replace custom log pinning sysfs knob with error injection tag.
rfc: http://www.spinics.net/lists/linux-xfs/msg07623.html

[1] http://www.spinics.net/lists/linux-xfs/msg07667.html

Brian Foster (4):
  xfs: fix recovery failure when log record header wraps log end
  xfs: always verify the log tail during recovery
  xfs: fix log recovery corruption error due to tail overwrite
  xfs: add log item pinning error injection tag

 fs/xfs/xfs_error.c       |   3 +
 fs/xfs/xfs_error.h       |   4 +-
 fs/xfs/xfs_log_recover.c | 150 +++++++++++++++++++++++++++++------------------
 fs/xfs/xfs_trans_ail.c   |  17 +++++-
 4 files changed, 114 insertions(+), 60 deletions(-)

-- 
2.7.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] xfs: fix recovery failure when log record header wraps log end
  2017-06-27 14:40 [PATCH 0/4] xfs: log recovery wrap and tail overwrite fixes Brian Foster
@ 2017-06-27 14:40 ` Brian Foster
  2017-07-01  4:38   ` Darrick J. Wong
  2017-06-27 14:40 ` [PATCH 2/4] xfs: always verify the log tail during recovery Brian Foster
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2017-06-27 14:40 UTC (permalink / raw)
  To: linux-xfs

The high-level log recovery algorithm consists of two loops that
walk the physical log and process log records from the tail to the
head. The first loop handles the case where the tail is beyond the
head and processes records up to the end of the physical log. The
subsequent loop processes records from the beginning of the physical
log to the head.

Because log records can wrap around the end of the physical log, the
first loop mentioned above must handle this case appropriately.
Records are processed from in-core buffers, which means that this
algorithm must split the reads of such records into two partial
I/Os: 1.) from the beginning of the record to the end of the log and
2.) from the beginning of the log to the end of the record. This is
further complicated by the fact that the log record header and log
record data are read into independent buffers.

The current handling of each buffer correctly splits the reads when
either the header or data starts before the end of the log and wraps
around the end. The data read does not correctly handle the case
where the prior header read wrapped or ends on the physical log end
boundary. blk_no is incremented to or beyond the log end after the
header read to point to the record data, but the split data read
logic triggers, attempts to read from an invalid log block and
ultimately causes log recovery to fail. This can be reproduced
fairly reliably via xfstests tests generic/047 and generic/388 with
large iclog sizes (256k) and small (10M) logs.

If the record header read has pushed beyond the end of the physical
log, the subsequent data read is actually contiguous. Update the
data read logic to detect the case where blk_no has wrapped, mod it
against the log size to read from the correct address and issue one
contiguous read for the log data buffer. The log record is processed
as normal from the buffer(s), the loop exits after the current
iteration and the subsequent loop picks up with the first new record
after the start of the log.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index b6a40bd..9efcd12 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -5371,10 +5371,20 @@ xlog_do_recovery_pass(
 			bblks = (int)BTOBB(be32_to_cpu(rhead->h_len));
 			blk_no += hblks;

-			/* Read in data for log record */
-			if (blk_no + bblks <= log->l_logBBsize) {
-				error = xlog_bread(log, blk_no, bblks, dbp,
-						   &offset);
+			/*
+			 * Read the log record data in multiple reads if it
+			 * wraps around the end of the log. Note that if the
+			 * header already wrapped, blk_no could point past the
+			 * end of the log. The record data is contiguous in
+			 * that case.
+			 */
+			if (blk_no + bblks <= log->l_logBBsize ||
+			    blk_no >= log->l_logBBsize) {
+				/* mod blk_no in case the header wrapped and
+				 * pushed it beyond the end of the log */
+				error = xlog_bread(log,
+						   blk_no % log->l_logBBsize,
+						   bblks, dbp, &offset);
 				if (error)
 					goto bread_err2;
 			} else {
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] xfs: fix recovery failure when log record header wraps log end
  2017-06-27 14:40 ` [PATCH 1/4] xfs: fix recovery failure when log record header wraps log end Brian Foster
@ 2017-07-01  4:38   ` Darrick J. Wong
  2017-07-03 12:11     ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2017-07-01  4:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 27, 2017 at 10:40:33AM -0400, Brian Foster wrote:
> The high-level log recovery algorithm consists of two loops that
> walk the physical log and process log records from the tail to the
> head. The first loop handles the case where the tail is beyond the
> head and processes records up to the end of the physical log. The
> subsequent loop processes records from the beginning of the physical
> log to the head.
> 
> Because log records can wrap around the end of the physical log, the
> first loop mentioned above must handle this case appropriately.
> Records are processed from in-core buffers, which means that this
> algorithm must split the reads of such records into two partial
> I/Os: 1.) from the beginning of the record to the end of the log and
> 2.) from the beginning of the log to the end of the record. This is
> further complicated by the fact that the log record header and log
> record data are read into independent buffers.
> 
> The current handling of each buffer correctly splits the reads when
> either the header or data starts before the end of the log and wraps
> around the end. The data read does not correctly handle the case
> where the prior header read wrapped or ends on the physical log end
> boundary. blk_no is incremented to or beyond the log end after the
> header read to point to the record data, but the split data read
> logic triggers, attempts to read from an invalid log block and
> ultimately causes log recovery to fail. This can be reproduced
> fairly reliably via xfstests tests generic/047 and generic/388 with
> large iclog sizes (256k) and small (10M) logs.
> 
> If the record header read has pushed beyond the end of the physical
> log, the subsequent data read is actually contiguous. Update the
> data read logic to detect the case where blk_no has wrapped, mod it
> against the log size to read from the correct address and issue one
> contiguous read for the log data buffer. The log record is processed
> as normal from the buffer(s), the loop exits after the current
> iteration and the subsequent loop picks up with the first new record
> after the start of the log.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_log_recover.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index b6a40bd..9efcd12 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -5371,10 +5371,20 @@ xlog_do_recovery_pass(
>  			bblks = (int)BTOBB(be32_to_cpu(rhead->h_len));
>  			blk_no += hblks;
>  
> -			/* Read in data for log record */
> -			if (blk_no + bblks <= log->l_logBBsize) {
> -				error = xlog_bread(log, blk_no, bblks, dbp,
> -						   &offset);
> +			/*
> +			 * Read the log record data in multiple reads if it
> +			 * wraps around the end of the log. Note that if the
> +			 * header already wrapped, blk_no could point past the
> +			 * end of the log. The record data is contiguous in
> +			 * that case.
> +			 */
> +			if (blk_no + bblks <= log->l_logBBsize ||
> +			    blk_no >= log->l_logBBsize) {
> +				/* mod blk_no in case the header wrapped and
> +				 * pushed it beyond the end of the log */
> +				error = xlog_bread(log,
> +						   blk_no % log->l_logBBsize,

I /think/ this is ok, though isn't this 64-bit division?  ^

--D

> +						   bblks, dbp, &offset);
>  				if (error)
>  					goto bread_err2;
>  			} else {
> -- 
> 2.7.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] xfs: fix recovery failure when log record header wraps log end
  2017-07-01  4:38   ` Darrick J. Wong
@ 2017-07-03 12:11     ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2017-07-03 12:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Jun 30, 2017 at 09:38:03PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 27, 2017 at 10:40:33AM -0400, Brian Foster wrote:
> > The high-level log recovery algorithm consists of two loops that
> > walk the physical log and process log records from the tail to the
> > head. The first loop handles the case where the tail is beyond the
> > head and processes records up to the end of the physical log. The
> > subsequent loop processes records from the beginning of the physical
> > log to the head.
> > 
> > Because log records can wrap around the end of the physical log, the
> > first loop mentioned above must handle this case appropriately.
> > Records are processed from in-core buffers, which means that this
> > algorithm must split the reads of such records into two partial
> > I/Os: 1.) from the beginning of the record to the end of the log and
> > 2.) from the beginning of the log to the end of the record. This is
> > further complicated by the fact that the log record header and log
> > record data are read into independent buffers.
> > 
> > The current handling of each buffer correctly splits the reads when
> > either the header or data starts before the end of the log and wraps
> > around the end. The data read does not correctly handle the case
> > where the prior header read wrapped or ends on the physical log end
> > boundary. blk_no is incremented to or beyond the log end after the
> > header read to point to the record data, but the split data read
> > logic triggers, attempts to read from an invalid log block and
> > ultimately causes log recovery to fail. This can be reproduced
> > fairly reliably via xfstests tests generic/047 and generic/388 with
> > large iclog sizes (256k) and small (10M) logs.
> > 
> > If the record header read has pushed beyond the end of the physical
> > log, the subsequent data read is actually contiguous. Update the
> > data read logic to detect the case where blk_no has wrapped, mod it
> > against the log size to read from the correct address and issue one
> > contiguous read for the log data buffer. The log record is processed
> > as normal from the buffer(s), the loop exits after the current
> > iteration and the subsequent loop picks up with the first new record
> > after the start of the log.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_log_recover.c | 18 ++++++++++++++----
> >  1 file changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index b6a40bd..9efcd12 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -5371,10 +5371,20 @@ xlog_do_recovery_pass(
> >  			bblks = (int)BTOBB(be32_to_cpu(rhead->h_len));
> >  			blk_no += hblks;
> >  
> > -			/* Read in data for log record */
> > -			if (blk_no + bblks <= log->l_logBBsize) {
> > -				error = xlog_bread(log, blk_no, bblks, dbp,
> > -						   &offset);
> > +			/*
> > +			 * Read the log record data in multiple reads if it
> > +			 * wraps around the end of the log. Note that if the
> > +			 * header already wrapped, blk_no could point past the
> > +			 * end of the log. The record data is contiguous in
> > +			 * that case.
> > +			 */
> > +			if (blk_no + bblks <= log->l_logBBsize ||
> > +			    blk_no >= log->l_logBBsize) {
> > +				/* mod blk_no in case the header wrapped and
> > +				 * pushed it beyond the end of the log */
> > +				error = xlog_bread(log,
> > +						   blk_no % log->l_logBBsize,
> 
> I /think/ this is ok, though isn't this 64-bit division?  ^
> 

Yep, looks like it. Will fix.. thanks for catching that..

Brian

> --D
> 
> > +						   bblks, dbp, &offset);
> >  				if (error)
> >  					goto bread_err2;
> >  			} else {
> > -- 
> > 2.7.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/4] xfs: always verify the log tail during recovery
  2017-06-27 14:40 [PATCH 0/4] xfs: log recovery wrap and tail overwrite fixes Brian Foster
  2017-06-27 14:40 ` [PATCH 1/4] xfs: fix recovery failure when log record header wraps log end Brian Foster
@ 2017-06-27 14:40 ` Brian Foster
  2017-07-01  4:43   ` Darrick J. Wong
  2017-06-27 14:40 ` [PATCH 3/4] xfs: fix log recovery corruption error due to tail overwrite Brian Foster
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2017-06-27 14:40 UTC (permalink / raw)
  To: linux-xfs

Log tail verification currently only occurs when torn writes are
detected at the head of the log. This was introduced because a
change in the head block due to torn writes can lead to a change in
the tail block (each log record header references the current tail)
and the tail block should be verified before log recovery proceeds.

Tail corruption is possible outside of torn write scenarios,
however. For example, partial log writes can be detected and cleared
during the initial head/tail block discovery process. If the partial
write coincides with a tail overwrite, the log tail is corrupted and
recovery fails.

To facilitate correct handling of log tail overwites, update log
recovery to always perform tail verification. This is necessary to
detect potential tail overwrite conditions when torn writes may not
have occurred. This changes normal (i.e., no torn writes) recovery
behavior slightly to detect and return CRC related errors near the
tail before actual recovery starts.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 24 +-----------------------
 1 file changed, 1 insertion(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 9efcd12..269d5f9 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1183,31 +1183,9 @@ xlog_verify_head(
 			ASSERT(0);
 			return 0;
 		}
-
-		/*
-		 * Now verify the tail based on the updated head. This is
-		 * required because the torn writes trimmed from the head could
-		 * have been written over the tail of a previous record. Return
-		 * any errors since recovery cannot proceed if the tail is
-		 * corrupt.
-		 *
-		 * XXX: This leaves a gap in truly robust protection from torn
-		 * writes in the log. If the head is behind the tail, the tail
-		 * pushes forward to create some space and then a crash occurs
-		 * causing the writes into the previous record's tail region to
-		 * tear, log recovery isn't able to recover.
-		 *
-		 * How likely is this to occur? If possible, can we do something
-		 * more intelligent here? Is it safe to push the tail forward if
-		 * we can determine that the tail is within the range of the
-		 * torn write (e.g., the kernel can only overwrite the tail if
-		 * it has actually been pushed forward)? Alternatively, could we
-		 * somehow prevent this condition at runtime?
-		 */
-		error = xlog_verify_tail(log, *head_blk, *tail_blk);
 	}

-	return error;
+	return xlog_verify_tail(log, *head_blk, *tail_blk);
 }

 /*
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] xfs: always verify the log tail during recovery
  2017-06-27 14:40 ` [PATCH 2/4] xfs: always verify the log tail during recovery Brian Foster
@ 2017-07-01  4:43   ` Darrick J. Wong
  2017-07-03 12:11     ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2017-07-01  4:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 27, 2017 at 10:40:34AM -0400, Brian Foster wrote:
> Log tail verification currently only occurs when torn writes are
> detected at the head of the log. This was introduced because a
> change in the head block due to torn writes can lead to a change in
> the tail block (each log record header references the current tail)
> and the tail block should be verified before log recovery proceeds.
> 
> Tail corruption is possible outside of torn write scenarios,
> however. For example, partial log writes can be detected and cleared
> during the initial head/tail block discovery process. If the partial
> write coincides with a tail overwrite, the log tail is corrupted and
> recovery fails.
> 
> To facilitate correct handling of log tail overwites, update log
> recovery to always perform tail verification. This is necessary to
> detect potential tail overwrite conditions when torn writes may not
> have occurred. This changes normal (i.e., no torn writes) recovery
> behavior slightly to detect and return CRC related errors near the
> tail before actual recovery starts.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
>  fs/xfs/xfs_log_recover.c | 24 +-----------------------
>  1 file changed, 1 insertion(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 9efcd12..269d5f9 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -1183,31 +1183,9 @@ xlog_verify_head(
>  			ASSERT(0);
>  			return 0;
>  		}
> -
> -		/*
> -		 * Now verify the tail based on the updated head. This is
> -		 * required because the torn writes trimmed from the head could
> -		 * have been written over the tail of a previous record. Return
> -		 * any errors since recovery cannot proceed if the tail is
> -		 * corrupt.
> -		 *
> -		 * XXX: This leaves a gap in truly robust protection from torn
> -		 * writes in the log. If the head is behind the tail, the tail
> -		 * pushes forward to create some space and then a crash occurs
> -		 * causing the writes into the previous record's tail region to
> -		 * tear, log recovery isn't able to recover.
> -		 *
> -		 * How likely is this to occur? If possible, can we do something
> -		 * more intelligent here? Is it safe to push the tail forward if
> -		 * we can determine that the tail is within the range of the
> -		 * torn write (e.g., the kernel can only overwrite the tail if
> -		 * it has actually been pushed forward)? Alternatively, could we
> -		 * somehow prevent this condition at runtime?
> -		 */
> -		error = xlog_verify_tail(log, *head_blk, *tail_blk);
>  	}
>  
> -	return error;
> +	return xlog_verify_tail(log, *head_blk, *tail_blk);

What if (error != 0 && error != -EFSBADCRC) here?  If the CRC checking log
recovery pass failed due to some other reason (EIO, ENOMEM, etc.) then is
there really a point to verifying the tail, vs. bubbling the error up and
(I presume) failing the mount?

--D

>  }
>  
>  /*
> -- 
> 2.7.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] xfs: always verify the log tail during recovery
  2017-07-01  4:43   ` Darrick J. Wong
@ 2017-07-03 12:11     ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2017-07-03 12:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Jun 30, 2017 at 09:43:49PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 27, 2017 at 10:40:34AM -0400, Brian Foster wrote:
> > Log tail verification currently only occurs when torn writes are
> > detected at the head of the log. This was introduced because a
> > change in the head block due to torn writes can lead to a change in
> > the tail block (each log record header references the current tail)
> > and the tail block should be verified before log recovery proceeds.
> > 
> > Tail corruption is possible outside of torn write scenarios,
> > however. For example, partial log writes can be detected and cleared
> > during the initial head/tail block discovery process. If the partial
> > write coincides with a tail overwrite, the log tail is corrupted and
> > recovery fails.
> > 
> > To facilitate correct handling of log tail overwites, update log
> > recovery to always perform tail verification. This is necessary to
> > detect potential tail overwrite conditions when torn writes may not
> > have occurred. This changes normal (i.e., no torn writes) recovery
> > behavior slightly to detect and return CRC related errors near the
> > tail before actual recovery starts.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> > ---
> >  fs/xfs/xfs_log_recover.c | 24 +-----------------------
> >  1 file changed, 1 insertion(+), 23 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 9efcd12..269d5f9 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -1183,31 +1183,9 @@ xlog_verify_head(
> >  			ASSERT(0);
> >  			return 0;
> >  		}
> > -
> > -		/*
> > -		 * Now verify the tail based on the updated head. This is
> > -		 * required because the torn writes trimmed from the head could
> > -		 * have been written over the tail of a previous record. Return
> > -		 * any errors since recovery cannot proceed if the tail is
> > -		 * corrupt.
> > -		 *
> > -		 * XXX: This leaves a gap in truly robust protection from torn
> > -		 * writes in the log. If the head is behind the tail, the tail
> > -		 * pushes forward to create some space and then a crash occurs
> > -		 * causing the writes into the previous record's tail region to
> > -		 * tear, log recovery isn't able to recover.
> > -		 *
> > -		 * How likely is this to occur? If possible, can we do something
> > -		 * more intelligent here? Is it safe to push the tail forward if
> > -		 * we can determine that the tail is within the range of the
> > -		 * torn write (e.g., the kernel can only overwrite the tail if
> > -		 * it has actually been pushed forward)? Alternatively, could we
> > -		 * somehow prevent this condition at runtime?
> > -		 */
> > -		error = xlog_verify_tail(log, *head_blk, *tail_blk);
> >  	}
> >  
> > -	return error;
> > +	return xlog_verify_tail(log, *head_blk, *tail_blk);
> 
> What if (error != 0 && error != -EFSBADCRC) here?  If the CRC checking log
> recovery pass failed due to some other reason (EIO, ENOMEM, etc.) then is
> there really a point to verifying the tail, vs. bubbling the error up and
> (I presume) failing the mount?
> 

Probably not. Presumably, the full recovery attempt would hit the same
error. I suppose that's not necessarily guaranteed with transient
errors, though, so I think it's probably better and more deterministic
to stop here. I wouldn't want us to get into situations where a torn
write had occurred, but the torn write checking happened to ENOMEM while
the full recovery didn't and the end result appears as a log corruption
rather than something that should have been fixed up.

The -EFSBADCRC hunk above resets error = 0 when it attempts to fix up
the head. Therefore, we can probably add a 'if (error) return error;'
check before the tail verification to ensure any unrelated problems
at the head continue to fail the mount.

Brian

> --D
> 
> >  }
> >  
> >  /*
> > -- 
> > 2.7.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/4] xfs: fix log recovery corruption error due to tail overwrite
  2017-06-27 14:40 [PATCH 0/4] xfs: log recovery wrap and tail overwrite fixes Brian Foster
  2017-06-27 14:40 ` [PATCH 1/4] xfs: fix recovery failure when log record header wraps log end Brian Foster
  2017-06-27 14:40 ` [PATCH 2/4] xfs: always verify the log tail during recovery Brian Foster
@ 2017-06-27 14:40 ` Brian Foster
  2017-07-01  5:06   ` Darrick J. Wong
  2017-06-27 14:40 ` [PATCH 4/4] xfs: add log item pinning error injection tag Brian Foster
  2017-06-27 14:50 ` [PATCH] tests/xfs: test for log recovery failure after tail overwrite Brian Foster
  4 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2017-06-27 14:40 UTC (permalink / raw)
  To: linux-xfs

If we consider the case where the tail (T) of the log is pinned long
enough for the head (H) to push and block behind the tail, we can
end up blocked in the following state without enough free space (f)
in the log to satisfy a transaction reservation:

	0	phys. log	N
	[-------HffT---H'--T'---]

The last good record in the log (before H) refers to T. The tail
eventually pushes forward (T') leaving more free space in the log
for writes to H. At this point, suppose space frees up in the log
for the maximum of 8 in-core log buffers to start flushing out to
the log. If this pushes the head from H to H', these next writes
overwrite the previous tail T. This is safe because the items logged
from T to T' have been written back and removed from the AIL.

If the next log writes (H -> H') happen to fail and result in
partial records in the log, the filesystem shuts down having
overwritten T with invalid data. Log recovery correctly locates H on
the subsequent mount, but H still refers to the now corrupted tail
T. This results in log corruption errors and recovery failure.

Since the tail overwrite results from otherwise correct runtime
behavior, it is up to log recovery to try and deal with this
situation. Update log recovery tail verification to run a CRC pass
from the first record past the tail to the head. This facilitates
error detection at T and moves the recovery tail to the first good
record past H' (similar to truncating the head on torn write
detection). If corruption is detected beyond the range possibly
affected by the max number of iclogs, the log is legitimately
corrupted and log recovery failure is expected.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 110 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 78 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 269d5f9..4113252 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1029,61 +1029,106 @@ xlog_seek_logrec_hdr(
 }
 
 /*
- * Check the log tail for torn writes. This is required when torn writes are
- * detected at the head and the head had to be walked back to a previous record.
- * The tail of the previous record must now be verified to ensure the torn
- * writes didn't corrupt the previous tail.
+ * Calculate distance from head to tail (i.e., unused space in the log).
+ */
+static inline int
+xlog_tail_distance(
+	struct xlog	*log,
+	xfs_daddr_t	head_blk,
+	xfs_daddr_t	tail_blk)
+{
+	if (head_blk < tail_blk)
+		return tail_blk - head_blk;
+
+	return tail_blk + (log->l_logBBsize - head_blk);
+}
+
+/*
+ * Verify the log tail. This is particularly important when torn or incomplete
+ * writes have been detected near the front of the log and the head has been
+ * walked back accordingly.
  *
- * Return an error if CRC verification fails as recovery cannot proceed.
+ * We also have to handle the case where the tail was pinned and the head
+ * blocked behind the tail right before a crash. If the tail had been pushed
+ * immediately prior to the crash and the subsequent checkpoint was only
+ * partially written, it's possible it overwrote the last referenced tail in the
+ * log with garbage. This is not a coherency problem because the tail must have
+ * been pushed before it can be overwritten, but appears as log corruption to
+ * recovery because we have no way to know the tail was updated if the
+ * subsequent checkpoint didn't write successfully.
+ *
+ * Therefore, CRC check the log from tail to head. If a failure occurs and the
+ * offending record is within max iclog bufs from the head, walk the tail
+ * forward and retry until a valid tail is found or corruption is detected out
+ * of the range of a possible overwrite.
  */
 STATIC int
 xlog_verify_tail(
 	struct xlog		*log,
-	xfs_daddr_t		head_blk,
-	xfs_daddr_t		tail_blk)
+	xfs_daddr_t		*head_blk,
+	xfs_daddr_t		*tail_blk,
+	int			hsize)
 {
 	struct xlog_rec_header	*thead;
 	struct xfs_buf		*bp;
 	xfs_daddr_t		first_bad;
-	int			count;
 	int			error = 0;
 	bool			wrapped;
-	xfs_daddr_t		tmp_head;
+	xfs_daddr_t		tmp_tail;
+	xfs_daddr_t		orig_tail = *tail_blk;
 
 	bp = xlog_get_bp(log, 1);
 	if (!bp)
 		return -ENOMEM;
 
 	/*
-	 * Seek XLOG_MAX_ICLOGS + 1 records past the current tail record to get
-	 * a temporary head block that points after the last possible
-	 * concurrently written record of the tail.
+	 * Make sure the tail points to a record (returns positive count on
+	 * success).
 	 */
-	count = xlog_seek_logrec_hdr(log, head_blk, tail_blk,
-				     XLOG_MAX_ICLOGS + 1, bp, &tmp_head, &thead,
-				     &wrapped);
-	if (count < 0) {
-		error = count;
+	error = xlog_seek_logrec_hdr(log, *head_blk, *tail_blk, 1, bp,
+			&tmp_tail, &thead, &wrapped);
+	if (error < 0)
 		goto out;
-	}
-
-	/*
-	 * If the call above didn't find XLOG_MAX_ICLOGS + 1 records, we ran
-	 * into the actual log head. tmp_head points to the start of the record
-	 * so update it to the actual head block.
-	 */
-	if (count < XLOG_MAX_ICLOGS + 1)
-		tmp_head = head_blk;
+	if (*tail_blk != tmp_tail)
+		*tail_blk = tmp_tail;
 
 	/*
-	 * We now have a tail and temporary head block that covers at least
-	 * XLOG_MAX_ICLOGS records from the tail. We need to verify that these
-	 * records were completely written. Run a CRC verification pass from
-	 * tail to head and return the result.
+	 * Run a CRC check from the tail to the head. We can't just check
+	 * MAX_ICLOGS records past the tail because the tail may point to stale
+	 * blocks cleared during the search for the head/tail. These blocks are
+	 * overwritten with zero-length records and thus record count is not a
+	 * reliable indicator of the iclog state before a crash.
 	 */
-	error = xlog_do_recovery_pass(log, tmp_head, tail_blk,
+	first_bad = 0;
+	error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
 				      XLOG_RECOVER_CRCPASS, &first_bad);
+	while (error && first_bad) {
+		int	tail_distance;
+
+		/*
+		 * Is corruption within range of the head? If so, retry from
+		 * the next record. Otherwise return an error.
+		 */
+		tail_distance = xlog_tail_distance(log, *head_blk, first_bad);
+		if (tail_distance > BTOBB(XLOG_MAX_ICLOGS * hsize))
+			break;
+
+		/* skip to the next record; returns positive count on success */
+		error = xlog_seek_logrec_hdr(log, *head_blk, first_bad, 2, bp,
+				&tmp_tail, &thead, &wrapped);
+		if (error < 0)
+			goto out;
 
+		*tail_blk = tmp_tail;
+		first_bad = 0;
+		error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
+					      XLOG_RECOVER_CRCPASS, &first_bad);
+	}
+
+	if (!error && *tail_blk != orig_tail)
+		xfs_warn(log->l_mp,
+		"Tail block (0x%llx) overwrite detected. Updated to 0x%llx",
+			 orig_tail, *tail_blk);
 out:
 	xlog_put_bp(bp);
 	return error;
@@ -1185,7 +1230,8 @@ xlog_verify_head(
 		}
 	}
 
-	return xlog_verify_tail(log, *head_blk, *tail_blk);
+	return xlog_verify_tail(log, head_blk, tail_blk,
+				be32_to_cpu((*rhead)->h_size));
 }
 
 /*
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] xfs: fix log recovery corruption error due to tail overwrite
  2017-06-27 14:40 ` [PATCH 3/4] xfs: fix log recovery corruption error due to tail overwrite Brian Foster
@ 2017-07-01  5:06   ` Darrick J. Wong
  2017-07-03 12:13     ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2017-07-01  5:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 27, 2017 at 10:40:35AM -0400, Brian Foster wrote:
> If we consider the case where the tail (T) of the log is pinned long
> enough for the head (H) to push and block behind the tail, we can
> end up blocked in the following state without enough free space (f)
> in the log to satisfy a transaction reservation:
> 
> 	0	phys. log	N
> 	[-------HffT---H'--T'---]
> 
> The last good record in the log (before H) refers to T. The tail
> eventually pushes forward (T') leaving more free space in the log
> for writes to H. At this point, suppose space frees up in the log
> for the maximum of 8 in-core log buffers to start flushing out to
> the log. If this pushes the head from H to H', these next writes
> overwrite the previous tail T. This is safe because the items logged
> from T to T' have been written back and removed from the AIL.
> 
> If the next log writes (H -> H') happen to fail and result in
> partial records in the log, the filesystem shuts down having
> overwritten T with invalid data. Log recovery correctly locates H on
> the subsequent mount, but H still refers to the now corrupted tail
> T. This results in log corruption errors and recovery failure.
> 
> Since the tail overwrite results from otherwise correct runtime
> behavior, it is up to log recovery to try and deal with this
> situation. Update log recovery tail verification to run a CRC pass
> from the first record past the tail to the head. This facilitates
> error detection at T and moves the recovery tail to the first good
> record past H' (similar to truncating the head on torn write
> detection). If corruption is detected beyond the range possibly
> affected by the max number of iclogs, the log is legitimately
> corrupted and log recovery failure is expected.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_log_recover.c | 110 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 78 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 269d5f9..4113252 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -1029,61 +1029,106 @@ xlog_seek_logrec_hdr(
>  }
>  
>  /*
> - * Check the log tail for torn writes. This is required when torn writes are
> - * detected at the head and the head had to be walked back to a previous record.
> - * The tail of the previous record must now be verified to ensure the torn
> - * writes didn't corrupt the previous tail.
> + * Calculate distance from head to tail (i.e., unused space in the log).
> + */
> +static inline int
> +xlog_tail_distance(
> +	struct xlog	*log,
> +	xfs_daddr_t	head_blk,
> +	xfs_daddr_t	tail_blk)
> +{
> +	if (head_blk < tail_blk)
> +		return tail_blk - head_blk;
> +
> +	return tail_blk + (log->l_logBBsize - head_blk);
> +}
> +
> +/*
> + * Verify the log tail. This is particularly important when torn or incomplete
> + * writes have been detected near the front of the log and the head has been
> + * walked back accordingly.
>   *
> - * Return an error if CRC verification fails as recovery cannot proceed.
> + * We also have to handle the case where the tail was pinned and the head
> + * blocked behind the tail right before a crash. If the tail had been pushed
> + * immediately prior to the crash and the subsequent checkpoint was only
> + * partially written, it's possible it overwrote the last referenced tail in the
> + * log with garbage. This is not a coherency problem because the tail must have
> + * been pushed before it can be overwritten, but appears as log corruption to
> + * recovery because we have no way to know the tail was updated if the
> + * subsequent checkpoint didn't write successfully.
> + *
> + * Therefore, CRC check the log from tail to head. If a failure occurs and the
> + * offending record is within max iclog bufs from the head, walk the tail
> + * forward and retry until a valid tail is found or corruption is detected out
> + * of the range of a possible overwrite.
>   */
>  STATIC int
>  xlog_verify_tail(
>  	struct xlog		*log,
> -	xfs_daddr_t		head_blk,
> -	xfs_daddr_t		tail_blk)
> +	xfs_daddr_t		*head_blk,
> +	xfs_daddr_t		*tail_blk,
> +	int			hsize)
>  {
>  	struct xlog_rec_header	*thead;
>  	struct xfs_buf		*bp;
>  	xfs_daddr_t		first_bad;
> -	int			count;
>  	int			error = 0;
>  	bool			wrapped;
> -	xfs_daddr_t		tmp_head;
> +	xfs_daddr_t		tmp_tail;
> +	xfs_daddr_t		orig_tail = *tail_blk;
>  
>  	bp = xlog_get_bp(log, 1);
>  	if (!bp)
>  		return -ENOMEM;
>  
>  	/*
> -	 * Seek XLOG_MAX_ICLOGS + 1 records past the current tail record to get
> -	 * a temporary head block that points after the last possible
> -	 * concurrently written record of the tail.
> +	 * Make sure the tail points to a record (returns positive count on
> +	 * success).
>  	 */
> -	count = xlog_seek_logrec_hdr(log, head_blk, tail_blk,
> -				     XLOG_MAX_ICLOGS + 1, bp, &tmp_head, &thead,
> -				     &wrapped);
> -	if (count < 0) {
> -		error = count;
> +	error = xlog_seek_logrec_hdr(log, *head_blk, *tail_blk, 1, bp,
> +			&tmp_tail, &thead, &wrapped);
> +	if (error < 0)
>  		goto out;
> -	}
> -
> -	/*
> -	 * If the call above didn't find XLOG_MAX_ICLOGS + 1 records, we ran
> -	 * into the actual log head. tmp_head points to the start of the record
> -	 * so update it to the actual head block.
> -	 */
> -	if (count < XLOG_MAX_ICLOGS + 1)
> -		tmp_head = head_blk;
> +	if (*tail_blk != tmp_tail)
> +		*tail_blk = tmp_tail;
>  
>  	/*
> -	 * We now have a tail and temporary head block that covers at least
> -	 * XLOG_MAX_ICLOGS records from the tail. We need to verify that these
> -	 * records were completely written. Run a CRC verification pass from
> -	 * tail to head and return the result.
> +	 * Run a CRC check from the tail to the head. We can't just check
> +	 * MAX_ICLOGS records past the tail because the tail may point to stale
> +	 * blocks cleared during the search for the head/tail. These blocks are
> +	 * overwritten with zero-length records and thus record count is not a
> +	 * reliable indicator of the iclog state before a crash.
>  	 */
> -	error = xlog_do_recovery_pass(log, tmp_head, tail_blk,
> +	first_bad = 0;
> +	error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
>  				      XLOG_RECOVER_CRCPASS, &first_bad);
> +	while (error && first_bad) {
> +		int	tail_distance;
> +
> +		/*
> +		 * Is corruption within range of the head? If so, retry from
> +		 * the next record. Otherwise return an error.
> +		 */
> +		tail_distance = xlog_tail_distance(log, *head_blk, first_bad);
> +		if (tail_distance > BTOBB(XLOG_MAX_ICLOGS * hsize))
> +			break;

(Thinking aloud...)

So if I understand this correctly, we start by checking that there's a
log record immediately after where we think is the tail T.  If there's
no record then the log is junk and we just give up, but if we find a
record then we try to check CRCs from head H to tail T.

	0	phys. log	N
	[-------HffT---H'--T'---]

If the CRC pass fails (again I'm wondering if it's appropriate to keep
trying things even if error == EIO or ENOMEM or something) then we'll
try to bump the tail ahead towards T', so long as we don't stray farther
than the head + log buffer size.

Bumping the tail forward involves checking for valid records and redoing
the CRC pass with the new tail.  If we run out of records, we bail out;
if the new CRC pass succeeds, we declare victory.  If not, then we'll
possibly try another tail bump.

I think I understand how this is suppsed to work now.  I'm concerned
that we try bumping the tail ahead even if the CRC pass runs out of
memory or hits a read error, though.  I'm also wondering why we pass
in a pointer to *head_blk, though we don't seem to update it?

--D

> +
> +		/* skip to the next record; returns positive count on success */
> +		error = xlog_seek_logrec_hdr(log, *head_blk, first_bad, 2, bp,
> +				&tmp_tail, &thead, &wrapped);
> +		if (error < 0)
> +			goto out;
>  
> +		*tail_blk = tmp_tail;
> +		first_bad = 0;
> +		error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
> +					      XLOG_RECOVER_CRCPASS, &first_bad);
> +	}
> +
> +	if (!error && *tail_blk != orig_tail)
> +		xfs_warn(log->l_mp,
> +		"Tail block (0x%llx) overwrite detected. Updated to 0x%llx",
> +			 orig_tail, *tail_blk);
>  out:
>  	xlog_put_bp(bp);
>  	return error;
> @@ -1185,7 +1230,8 @@ xlog_verify_head(
>  		}
>  	}
>  
> -	return xlog_verify_tail(log, *head_blk, *tail_blk);
> +	return xlog_verify_tail(log, head_blk, tail_blk,
> +				be32_to_cpu((*rhead)->h_size));
>  }
>  
>  /*
> -- 
> 2.7.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] xfs: fix log recovery corruption error due to tail overwrite
  2017-07-01  5:06   ` Darrick J. Wong
@ 2017-07-03 12:13     ` Brian Foster
  2017-07-03 16:27       ` Brian Foster
  2017-07-03 16:39       ` Darrick J. Wong
  0 siblings, 2 replies; 18+ messages in thread
From: Brian Foster @ 2017-07-03 12:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Jun 30, 2017 at 10:06:05PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 27, 2017 at 10:40:35AM -0400, Brian Foster wrote:
> > If we consider the case where the tail (T) of the log is pinned long
> > enough for the head (H) to push and block behind the tail, we can
> > end up blocked in the following state without enough free space (f)
> > in the log to satisfy a transaction reservation:
> > 
> > 	0	phys. log	N
> > 	[-------HffT---H'--T'---]
> > 
> > The last good record in the log (before H) refers to T. The tail
> > eventually pushes forward (T') leaving more free space in the log
> > for writes to H. At this point, suppose space frees up in the log
> > for the maximum of 8 in-core log buffers to start flushing out to
> > the log. If this pushes the head from H to H', these next writes
> > overwrite the previous tail T. This is safe because the items logged
> > from T to T' have been written back and removed from the AIL.
> > 
> > If the next log writes (H -> H') happen to fail and result in
> > partial records in the log, the filesystem shuts down having
> > overwritten T with invalid data. Log recovery correctly locates H on
> > the subsequent mount, but H still refers to the now corrupted tail
> > T. This results in log corruption errors and recovery failure.
> > 
> > Since the tail overwrite results from otherwise correct runtime
> > behavior, it is up to log recovery to try and deal with this
> > situation. Update log recovery tail verification to run a CRC pass
> > from the first record past the tail to the head. This facilitates
> > error detection at T and moves the recovery tail to the first good
> > record past H' (similar to truncating the head on torn write
> > detection). If corruption is detected beyond the range possibly
> > affected by the max number of iclogs, the log is legitimately
> > corrupted and log recovery failure is expected.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_log_recover.c | 110 +++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 78 insertions(+), 32 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 269d5f9..4113252 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -1029,61 +1029,106 @@ xlog_seek_logrec_hdr(
> >  }
> >  
> >  /*
> > - * Check the log tail for torn writes. This is required when torn writes are
> > - * detected at the head and the head had to be walked back to a previous record.
> > - * The tail of the previous record must now be verified to ensure the torn
> > - * writes didn't corrupt the previous tail.
> > + * Calculate distance from head to tail (i.e., unused space in the log).
> > + */
> > +static inline int
> > +xlog_tail_distance(
> > +	struct xlog	*log,
> > +	xfs_daddr_t	head_blk,
> > +	xfs_daddr_t	tail_blk)
> > +{
> > +	if (head_blk < tail_blk)
> > +		return tail_blk - head_blk;
> > +
> > +	return tail_blk + (log->l_logBBsize - head_blk);
> > +}
> > +
> > +/*
> > + * Verify the log tail. This is particularly important when torn or incomplete
> > + * writes have been detected near the front of the log and the head has been
> > + * walked back accordingly.
> >   *
> > - * Return an error if CRC verification fails as recovery cannot proceed.
> > + * We also have to handle the case where the tail was pinned and the head
> > + * blocked behind the tail right before a crash. If the tail had been pushed
> > + * immediately prior to the crash and the subsequent checkpoint was only
> > + * partially written, it's possible it overwrote the last referenced tail in the
> > + * log with garbage. This is not a coherency problem because the tail must have
> > + * been pushed before it can be overwritten, but appears as log corruption to
> > + * recovery because we have no way to know the tail was updated if the
> > + * subsequent checkpoint didn't write successfully.
> > + *
> > + * Therefore, CRC check the log from tail to head. If a failure occurs and the
> > + * offending record is within max iclog bufs from the head, walk the tail
> > + * forward and retry until a valid tail is found or corruption is detected out
> > + * of the range of a possible overwrite.
> >   */
> >  STATIC int
> >  xlog_verify_tail(
> >  	struct xlog		*log,
> > -	xfs_daddr_t		head_blk,
> > -	xfs_daddr_t		tail_blk)
> > +	xfs_daddr_t		*head_blk,
> > +	xfs_daddr_t		*tail_blk,
> > +	int			hsize)
> >  {
> >  	struct xlog_rec_header	*thead;
> >  	struct xfs_buf		*bp;
> >  	xfs_daddr_t		first_bad;
> > -	int			count;
> >  	int			error = 0;
> >  	bool			wrapped;
> > -	xfs_daddr_t		tmp_head;
> > +	xfs_daddr_t		tmp_tail;
> > +	xfs_daddr_t		orig_tail = *tail_blk;
> >  
> >  	bp = xlog_get_bp(log, 1);
> >  	if (!bp)
> >  		return -ENOMEM;
> >  
> >  	/*
> > -	 * Seek XLOG_MAX_ICLOGS + 1 records past the current tail record to get
> > -	 * a temporary head block that points after the last possible
> > -	 * concurrently written record of the tail.
> > +	 * Make sure the tail points to a record (returns positive count on
> > +	 * success).
> >  	 */
> > -	count = xlog_seek_logrec_hdr(log, head_blk, tail_blk,
> > -				     XLOG_MAX_ICLOGS + 1, bp, &tmp_head, &thead,
> > -				     &wrapped);
> > -	if (count < 0) {
> > -		error = count;
> > +	error = xlog_seek_logrec_hdr(log, *head_blk, *tail_blk, 1, bp,
> > +			&tmp_tail, &thead, &wrapped);
> > +	if (error < 0)
> >  		goto out;
> > -	}
> > -
> > -	/*
> > -	 * If the call above didn't find XLOG_MAX_ICLOGS + 1 records, we ran
> > -	 * into the actual log head. tmp_head points to the start of the record
> > -	 * so update it to the actual head block.
> > -	 */
> > -	if (count < XLOG_MAX_ICLOGS + 1)
> > -		tmp_head = head_blk;
> > +	if (*tail_blk != tmp_tail)
> > +		*tail_blk = tmp_tail;
> >  
> >  	/*
> > -	 * We now have a tail and temporary head block that covers at least
> > -	 * XLOG_MAX_ICLOGS records from the tail. We need to verify that these
> > -	 * records were completely written. Run a CRC verification pass from
> > -	 * tail to head and return the result.
> > +	 * Run a CRC check from the tail to the head. We can't just check
> > +	 * MAX_ICLOGS records past the tail because the tail may point to stale
> > +	 * blocks cleared during the search for the head/tail. These blocks are
> > +	 * overwritten with zero-length records and thus record count is not a
> > +	 * reliable indicator of the iclog state before a crash.
> >  	 */
> > -	error = xlog_do_recovery_pass(log, tmp_head, tail_blk,
> > +	first_bad = 0;
> > +	error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
> >  				      XLOG_RECOVER_CRCPASS, &first_bad);
> > +	while (error && first_bad) {
> > +		int	tail_distance;
> > +
> > +		/*
> > +		 * Is corruption within range of the head? If so, retry from
> > +		 * the next record. Otherwise return an error.
> > +		 */
> > +		tail_distance = xlog_tail_distance(log, *head_blk, first_bad);
> > +		if (tail_distance > BTOBB(XLOG_MAX_ICLOGS * hsize))
> > +			break;
> 
> (Thinking aloud...)
> 
> So if I understand this correctly, we start by checking that there's a
> log record immediately after where we think is the tail T.  If there's
> no record then the log is junk and we just give up, but if we find a
> record then we try to check CRCs from head H to tail T.
> 

Right.. this part basically adds a full on tail->head CRC pass whereas
actual log recovery would check CRCs as records are processed (i.e., the
slight change in behavior referenced in the cover letter).

> 	0	phys. log	N
> 	[-------HffT---H'--T'---]
> 
> If the CRC pass fails (again I'm wondering if it's appropriate to keep
> trying things even if error == EIO or ENOMEM or something) then we'll
> try to bump the tail ahead towards T', so long as we don't stray farther
> than the head + log buffer size.
> 

The idea is that we can attribute corruption to a partial tail overwrite
so long as the failure occurs within the "danger zone" of the head, as
defined by the max iclog value and log buf size. If the tail happened to
be located farther ahead outside of that range, this is not a
possibility and the corruption is legitimate. The tradeoff is that this
is a policy and so the corruption at the tail could potentially be
"legitimate" regardless, but it's the same tradeoff we made previously
with torn write detection.

And yes, this should only trigger on error == -EFSBADCRC. Actually, I'm
starting to think that both here and the head verification should handle
-EFSCORRUPTED as well. I think the more recent report incident (that
motivated this patch) was actually producing logs that may have been
more likely to fail with the latter than the former (if the log record
header is not valid, for example). I will take another look at that.

> Bumping the tail forward involves checking for valid records and redoing
> the CRC pass with the new tail.  If we run out of records, we bail out;
> if the new CRC pass succeeds, we declare victory.  If not, then we'll
> possibly try another tail bump.
> 
> I think I understand how this is suppsed to work now.  I'm concerned
> that we try bumping the tail ahead even if the CRC pass runs out of
> memory or hits a read error, though.  I'm also wondering why we pass
> in a pointer to *head_blk, though we don't seem to update it?
> 

I think I just updated both parameters to pass a pointer because it
looked more consistent and/or otherwise hadn't worked out the tail
verification algorithm at that point. I can change it back to head_blk
if that is preferred.

Thanks for the review!

Brian

> --D
> 
> > +
> > +		/* skip to the next record; returns positive count on success */
> > +		error = xlog_seek_logrec_hdr(log, *head_blk, first_bad, 2, bp,
> > +				&tmp_tail, &thead, &wrapped);
> > +		if (error < 0)
> > +			goto out;
> >  
> > +		*tail_blk = tmp_tail;
> > +		first_bad = 0;
> > +		error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
> > +					      XLOG_RECOVER_CRCPASS, &first_bad);
> > +	}
> > +
> > +	if (!error && *tail_blk != orig_tail)
> > +		xfs_warn(log->l_mp,
> > +		"Tail block (0x%llx) overwrite detected. Updated to 0x%llx",
> > +			 orig_tail, *tail_blk);
> >  out:
> >  	xlog_put_bp(bp);
> >  	return error;
> > @@ -1185,7 +1230,8 @@ xlog_verify_head(
> >  		}
> >  	}
> >  
> > -	return xlog_verify_tail(log, *head_blk, *tail_blk);
> > +	return xlog_verify_tail(log, head_blk, tail_blk,
> > +				be32_to_cpu((*rhead)->h_size));
> >  }
> >  
> >  /*
> > -- 
> > 2.7.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] xfs: fix log recovery corruption error due to tail overwrite
  2017-07-03 12:13     ` Brian Foster
@ 2017-07-03 16:27       ` Brian Foster
  2017-07-03 16:39       ` Darrick J. Wong
  1 sibling, 0 replies; 18+ messages in thread
From: Brian Foster @ 2017-07-03 16:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Jul 03, 2017 at 08:13:33AM -0400, Brian Foster wrote:
> On Fri, Jun 30, 2017 at 10:06:05PM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 27, 2017 at 10:40:35AM -0400, Brian Foster wrote:
> > > If we consider the case where the tail (T) of the log is pinned long
> > > enough for the head (H) to push and block behind the tail, we can
> > > end up blocked in the following state without enough free space (f)
> > > in the log to satisfy a transaction reservation:
> > > 
> > > 	0	phys. log	N
> > > 	[-------HffT---H'--T'---]
> > > 
> > > The last good record in the log (before H) refers to T. The tail
> > > eventually pushes forward (T') leaving more free space in the log
> > > for writes to H. At this point, suppose space frees up in the log
> > > for the maximum of 8 in-core log buffers to start flushing out to
> > > the log. If this pushes the head from H to H', these next writes
> > > overwrite the previous tail T. This is safe because the items logged
> > > from T to T' have been written back and removed from the AIL.
> > > 
> > > If the next log writes (H -> H') happen to fail and result in
> > > partial records in the log, the filesystem shuts down having
> > > overwritten T with invalid data. Log recovery correctly locates H on
> > > the subsequent mount, but H still refers to the now corrupted tail
> > > T. This results in log corruption errors and recovery failure.
> > > 
> > > Since the tail overwrite results from otherwise correct runtime
> > > behavior, it is up to log recovery to try and deal with this
> > > situation. Update log recovery tail verification to run a CRC pass
> > > from the first record past the tail to the head. This facilitates
> > > error detection at T and moves the recovery tail to the first good
> > > record past H' (similar to truncating the head on torn write
> > > detection). If corruption is detected beyond the range possibly
> > > affected by the max number of iclogs, the log is legitimately
> > > corrupted and log recovery failure is expected.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_recover.c | 110 +++++++++++++++++++++++++++++++++--------------
> > >  1 file changed, 78 insertions(+), 32 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > index 269d5f9..4113252 100644
> > > --- a/fs/xfs/xfs_log_recover.c
> > > +++ b/fs/xfs/xfs_log_recover.c
> > > @@ -1029,61 +1029,106 @@ xlog_seek_logrec_hdr(
> > >  }
> > >  
> > >  /*
> > > - * Check the log tail for torn writes. This is required when torn writes are
> > > - * detected at the head and the head had to be walked back to a previous record.
> > > - * The tail of the previous record must now be verified to ensure the torn
> > > - * writes didn't corrupt the previous tail.
> > > + * Calculate distance from head to tail (i.e., unused space in the log).
> > > + */
> > > +static inline int
> > > +xlog_tail_distance(
> > > +	struct xlog	*log,
> > > +	xfs_daddr_t	head_blk,
> > > +	xfs_daddr_t	tail_blk)
> > > +{
> > > +	if (head_blk < tail_blk)
> > > +		return tail_blk - head_blk;
> > > +
> > > +	return tail_blk + (log->l_logBBsize - head_blk);
> > > +}
> > > +
> > > +/*
> > > + * Verify the log tail. This is particularly important when torn or incomplete
> > > + * writes have been detected near the front of the log and the head has been
> > > + * walked back accordingly.
> > >   *
> > > - * Return an error if CRC verification fails as recovery cannot proceed.
> > > + * We also have to handle the case where the tail was pinned and the head
> > > + * blocked behind the tail right before a crash. If the tail had been pushed
> > > + * immediately prior to the crash and the subsequent checkpoint was only
> > > + * partially written, it's possible it overwrote the last referenced tail in the
> > > + * log with garbage. This is not a coherency problem because the tail must have
> > > + * been pushed before it can be overwritten, but appears as log corruption to
> > > + * recovery because we have no way to know the tail was updated if the
> > > + * subsequent checkpoint didn't write successfully.
> > > + *
> > > + * Therefore, CRC check the log from tail to head. If a failure occurs and the
> > > + * offending record is within max iclog bufs from the head, walk the tail
> > > + * forward and retry until a valid tail is found or corruption is detected out
> > > + * of the range of a possible overwrite.
> > >   */
> > >  STATIC int
> > >  xlog_verify_tail(
> > >  	struct xlog		*log,
> > > -	xfs_daddr_t		head_blk,
> > > -	xfs_daddr_t		tail_blk)
> > > +	xfs_daddr_t		*head_blk,
> > > +	xfs_daddr_t		*tail_blk,
> > > +	int			hsize)
> > >  {
> > >  	struct xlog_rec_header	*thead;
> > >  	struct xfs_buf		*bp;
> > >  	xfs_daddr_t		first_bad;
> > > -	int			count;
> > >  	int			error = 0;
> > >  	bool			wrapped;
> > > -	xfs_daddr_t		tmp_head;
> > > +	xfs_daddr_t		tmp_tail;
> > > +	xfs_daddr_t		orig_tail = *tail_blk;
> > >  
> > >  	bp = xlog_get_bp(log, 1);
> > >  	if (!bp)
> > >  		return -ENOMEM;
> > >  
> > >  	/*
> > > -	 * Seek XLOG_MAX_ICLOGS + 1 records past the current tail record to get
> > > -	 * a temporary head block that points after the last possible
> > > -	 * concurrently written record of the tail.
> > > +	 * Make sure the tail points to a record (returns positive count on
> > > +	 * success).
> > >  	 */
> > > -	count = xlog_seek_logrec_hdr(log, head_blk, tail_blk,
> > > -				     XLOG_MAX_ICLOGS + 1, bp, &tmp_head, &thead,
> > > -				     &wrapped);
> > > -	if (count < 0) {
> > > -		error = count;
> > > +	error = xlog_seek_logrec_hdr(log, *head_blk, *tail_blk, 1, bp,
> > > +			&tmp_tail, &thead, &wrapped);
> > > +	if (error < 0)
> > >  		goto out;
> > > -	}
> > > -
> > > -	/*
> > > -	 * If the call above didn't find XLOG_MAX_ICLOGS + 1 records, we ran
> > > -	 * into the actual log head. tmp_head points to the start of the record
> > > -	 * so update it to the actual head block.
> > > -	 */
> > > -	if (count < XLOG_MAX_ICLOGS + 1)
> > > -		tmp_head = head_blk;
> > > +	if (*tail_blk != tmp_tail)
> > > +		*tail_blk = tmp_tail;
> > >  
> > >  	/*
> > > -	 * We now have a tail and temporary head block that covers at least
> > > -	 * XLOG_MAX_ICLOGS records from the tail. We need to verify that these
> > > -	 * records were completely written. Run a CRC verification pass from
> > > -	 * tail to head and return the result.
> > > +	 * Run a CRC check from the tail to the head. We can't just check
> > > +	 * MAX_ICLOGS records past the tail because the tail may point to stale
> > > +	 * blocks cleared during the search for the head/tail. These blocks are
> > > +	 * overwritten with zero-length records and thus record count is not a
> > > +	 * reliable indicator of the iclog state before a crash.
> > >  	 */
> > > -	error = xlog_do_recovery_pass(log, tmp_head, tail_blk,
> > > +	first_bad = 0;
> > > +	error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
> > >  				      XLOG_RECOVER_CRCPASS, &first_bad);
> > > +	while (error && first_bad) {
> > > +		int	tail_distance;
> > > +
> > > +		/*
> > > +		 * Is corruption within range of the head? If so, retry from
> > > +		 * the next record. Otherwise return an error.
> > > +		 */
> > > +		tail_distance = xlog_tail_distance(log, *head_blk, first_bad);
> > > +		if (tail_distance > BTOBB(XLOG_MAX_ICLOGS * hsize))
> > > +			break;
> > 
> > (Thinking aloud...)
> > 
> > So if I understand this correctly, we start by checking that there's a
> > log record immediately after where we think is the tail T.  If there's
> > no record then the log is junk and we just give up, but if we find a
> > record then we try to check CRCs from head H to tail T.
> > 
> 
> Right.. this part basically adds a full on tail->head CRC pass whereas
> actual log recovery would check CRCs as records are processed (i.e., the
> slight change in behavior referenced in the cover letter).
> 
> > 	0	phys. log	N
> > 	[-------HffT---H'--T'---]
> > 
> > If the CRC pass fails (again I'm wondering if it's appropriate to keep
> > trying things even if error == EIO or ENOMEM or something) then we'll
> > try to bump the tail ahead towards T', so long as we don't stray farther
> > than the head + log buffer size.
> > 
> 
> The idea is that we can attribute corruption to a partial tail overwrite
> so long as the failure occurs within the "danger zone" of the head, as
> defined by the max iclog value and log buf size. If the tail happened to
> be located farther ahead outside of that range, this is not a
> possibility and the corruption is legitimate. The tradeoff is that this
> is a policy and so the corruption at the tail could potentially be
> "legitimate" regardless, but it's the same tradeoff we made previously
> with torn write detection.
> 
> And yes, this should only trigger on error == -EFSBADCRC. Actually, I'm
> starting to think that both here and the head verification should handle
> -EFSCORRUPTED as well. I think the more recent report incident (that
> motivated this patch) was actually producing logs that may have been
> more likely to fail with the latter than the former (if the log record
> header is not valid, for example). I will take another look at that.
> 

A few notes on the above...

First, I have to correct the statement in the cover letter that this
allows the referenced metadump images to mount. The mount still fails
due to an inode cluster corruption of some sort encountered during
recovery. I could have sworn I got those images to mount before. I
suppose it's possible I either had some nasty hacks to force the issue
or otherwise just lost track after constructing the tail overwrite
reproducer script and misremembered. Either way, I see that they don't
mount even with some earlier variants of this code and fail the same way
if I use for-next and hardcode the head/tail past the corrupted ranges
(to check whether I perhaps broke something else).

With regard to -EFSCORRUPTED, it appears to be rather difficult to
construct a log that fails due to EFSCORRUPTED at this point of recovery
because the head/tail search depends on somewhat sane log record
headers. IOW, severe enough corruption (buffer mem. corruption, for
e.g.) as such could fail log recovery much earlier before we can
actually identify a head/tail combination to work with.

That said, forcing an invalid h_len to disk does appear to allow
head/tail discovery to succeed while teasing out EFSCORRUPTED errors
during head/tail verification (and at least one minor/dependent bug in
the first_bad assignment in xlog_do_recovery_pass()). FWIW, I'm not
totally convinced one way or another whether this is a valuable change.
I may include it as a separate patch for reference, at least.

Brian

> > Bumping the tail forward involves checking for valid records and redoing
> > the CRC pass with the new tail.  If we run out of records, we bail out;
> > if the new CRC pass succeeds, we declare victory.  If not, then we'll
> > possibly try another tail bump.
> > 
> > I think I understand how this is suppsed to work now.  I'm concerned
> > that we try bumping the tail ahead even if the CRC pass runs out of
> > memory or hits a read error, though.  I'm also wondering why we pass
> > in a pointer to *head_blk, though we don't seem to update it?
> > 
> 
> I think I just updated both parameters to pass a pointer because it
> looked more consistent and/or otherwise hadn't worked out the tail
> verification algorithm at that point. I can change it back to head_blk
> if that is preferred.
> 
> Thanks for the review!
> 
> Brian
> 
> > --D
> > 
> > > +
> > > +		/* skip to the next record; returns positive count on success */
> > > +		error = xlog_seek_logrec_hdr(log, *head_blk, first_bad, 2, bp,
> > > +				&tmp_tail, &thead, &wrapped);
> > > +		if (error < 0)
> > > +			goto out;
> > >  
> > > +		*tail_blk = tmp_tail;
> > > +		first_bad = 0;
> > > +		error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
> > > +					      XLOG_RECOVER_CRCPASS, &first_bad);
> > > +	}
> > > +
> > > +	if (!error && *tail_blk != orig_tail)
> > > +		xfs_warn(log->l_mp,
> > > +		"Tail block (0x%llx) overwrite detected. Updated to 0x%llx",
> > > +			 orig_tail, *tail_blk);
> > >  out:
> > >  	xlog_put_bp(bp);
> > >  	return error;
> > > @@ -1185,7 +1230,8 @@ xlog_verify_head(
> > >  		}
> > >  	}
> > >  
> > > -	return xlog_verify_tail(log, *head_blk, *tail_blk);
> > > +	return xlog_verify_tail(log, head_blk, tail_blk,
> > > +				be32_to_cpu((*rhead)->h_size));
> > >  }
> > >  
> > >  /*
> > > -- 
> > > 2.7.5
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] xfs: fix log recovery corruption error due to tail overwrite
  2017-07-03 12:13     ` Brian Foster
  2017-07-03 16:27       ` Brian Foster
@ 2017-07-03 16:39       ` Darrick J. Wong
  1 sibling, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2017-07-03 16:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Jul 03, 2017 at 08:13:33AM -0400, Brian Foster wrote:
> On Fri, Jun 30, 2017 at 10:06:05PM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 27, 2017 at 10:40:35AM -0400, Brian Foster wrote:
> > > If we consider the case where the tail (T) of the log is pinned long
> > > enough for the head (H) to push and block behind the tail, we can
> > > end up blocked in the following state without enough free space (f)
> > > in the log to satisfy a transaction reservation:
> > > 
> > > 	0	phys. log	N
> > > 	[-------HffT---H'--T'---]
> > > 
> > > The last good record in the log (before H) refers to T. The tail
> > > eventually pushes forward (T') leaving more free space in the log
> > > for writes to H. At this point, suppose space frees up in the log
> > > for the maximum of 8 in-core log buffers to start flushing out to
> > > the log. If this pushes the head from H to H', these next writes
> > > overwrite the previous tail T. This is safe because the items logged
> > > from T to T' have been written back and removed from the AIL.
> > > 
> > > If the next log writes (H -> H') happen to fail and result in
> > > partial records in the log, the filesystem shuts down having
> > > overwritten T with invalid data. Log recovery correctly locates H on
> > > the subsequent mount, but H still refers to the now corrupted tail
> > > T. This results in log corruption errors and recovery failure.
> > > 
> > > Since the tail overwrite results from otherwise correct runtime
> > > behavior, it is up to log recovery to try and deal with this
> > > situation. Update log recovery tail verification to run a CRC pass
> > > from the first record past the tail to the head. This facilitates
> > > error detection at T and moves the recovery tail to the first good
> > > record past H' (similar to truncating the head on torn write
> > > detection). If corruption is detected beyond the range possibly
> > > affected by the max number of iclogs, the log is legitimately
> > > corrupted and log recovery failure is expected.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_recover.c | 110 +++++++++++++++++++++++++++++++++--------------
> > >  1 file changed, 78 insertions(+), 32 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > index 269d5f9..4113252 100644
> > > --- a/fs/xfs/xfs_log_recover.c
> > > +++ b/fs/xfs/xfs_log_recover.c
> > > @@ -1029,61 +1029,106 @@ xlog_seek_logrec_hdr(
> > >  }
> > >  
> > >  /*
> > > - * Check the log tail for torn writes. This is required when torn writes are
> > > - * detected at the head and the head had to be walked back to a previous record.
> > > - * The tail of the previous record must now be verified to ensure the torn
> > > - * writes didn't corrupt the previous tail.
> > > + * Calculate distance from head to tail (i.e., unused space in the log).
> > > + */
> > > +static inline int
> > > +xlog_tail_distance(
> > > +	struct xlog	*log,
> > > +	xfs_daddr_t	head_blk,
> > > +	xfs_daddr_t	tail_blk)
> > > +{
> > > +	if (head_blk < tail_blk)
> > > +		return tail_blk - head_blk;
> > > +
> > > +	return tail_blk + (log->l_logBBsize - head_blk);
> > > +}
> > > +
> > > +/*
> > > + * Verify the log tail. This is particularly important when torn or incomplete
> > > + * writes have been detected near the front of the log and the head has been
> > > + * walked back accordingly.
> > >   *
> > > - * Return an error if CRC verification fails as recovery cannot proceed.
> > > + * We also have to handle the case where the tail was pinned and the head
> > > + * blocked behind the tail right before a crash. If the tail had been pushed
> > > + * immediately prior to the crash and the subsequent checkpoint was only
> > > + * partially written, it's possible it overwrote the last referenced tail in the
> > > + * log with garbage. This is not a coherency problem because the tail must have
> > > + * been pushed before it can be overwritten, but appears as log corruption to
> > > + * recovery because we have no way to know the tail was updated if the
> > > + * subsequent checkpoint didn't write successfully.
> > > + *
> > > + * Therefore, CRC check the log from tail to head. If a failure occurs and the
> > > + * offending record is within max iclog bufs from the head, walk the tail
> > > + * forward and retry until a valid tail is found or corruption is detected out
> > > + * of the range of a possible overwrite.
> > >   */
> > >  STATIC int
> > >  xlog_verify_tail(
> > >  	struct xlog		*log,
> > > -	xfs_daddr_t		head_blk,
> > > -	xfs_daddr_t		tail_blk)
> > > +	xfs_daddr_t		*head_blk,
> > > +	xfs_daddr_t		*tail_blk,
> > > +	int			hsize)
> > >  {
> > >  	struct xlog_rec_header	*thead;
> > >  	struct xfs_buf		*bp;
> > >  	xfs_daddr_t		first_bad;
> > > -	int			count;
> > >  	int			error = 0;
> > >  	bool			wrapped;
> > > -	xfs_daddr_t		tmp_head;
> > > +	xfs_daddr_t		tmp_tail;
> > > +	xfs_daddr_t		orig_tail = *tail_blk;
> > >  
> > >  	bp = xlog_get_bp(log, 1);
> > >  	if (!bp)
> > >  		return -ENOMEM;
> > >  
> > >  	/*
> > > -	 * Seek XLOG_MAX_ICLOGS + 1 records past the current tail record to get
> > > -	 * a temporary head block that points after the last possible
> > > -	 * concurrently written record of the tail.
> > > +	 * Make sure the tail points to a record (returns positive count on
> > > +	 * success).
> > >  	 */
> > > -	count = xlog_seek_logrec_hdr(log, head_blk, tail_blk,
> > > -				     XLOG_MAX_ICLOGS + 1, bp, &tmp_head, &thead,
> > > -				     &wrapped);
> > > -	if (count < 0) {
> > > -		error = count;
> > > +	error = xlog_seek_logrec_hdr(log, *head_blk, *tail_blk, 1, bp,
> > > +			&tmp_tail, &thead, &wrapped);
> > > +	if (error < 0)
> > >  		goto out;
> > > -	}
> > > -
> > > -	/*
> > > -	 * If the call above didn't find XLOG_MAX_ICLOGS + 1 records, we ran
> > > -	 * into the actual log head. tmp_head points to the start of the record
> > > -	 * so update it to the actual head block.
> > > -	 */
> > > -	if (count < XLOG_MAX_ICLOGS + 1)
> > > -		tmp_head = head_blk;
> > > +	if (*tail_blk != tmp_tail)
> > > +		*tail_blk = tmp_tail;
> > >  
> > >  	/*
> > > -	 * We now have a tail and temporary head block that covers at least
> > > -	 * XLOG_MAX_ICLOGS records from the tail. We need to verify that these
> > > -	 * records were completely written. Run a CRC verification pass from
> > > -	 * tail to head and return the result.
> > > +	 * Run a CRC check from the tail to the head. We can't just check
> > > +	 * MAX_ICLOGS records past the tail because the tail may point to stale
> > > +	 * blocks cleared during the search for the head/tail. These blocks are
> > > +	 * overwritten with zero-length records and thus record count is not a
> > > +	 * reliable indicator of the iclog state before a crash.
> > >  	 */
> > > -	error = xlog_do_recovery_pass(log, tmp_head, tail_blk,
> > > +	first_bad = 0;
> > > +	error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
> > >  				      XLOG_RECOVER_CRCPASS, &first_bad);
> > > +	while (error && first_bad) {
> > > +		int	tail_distance;
> > > +
> > > +		/*
> > > +		 * Is corruption within range of the head? If so, retry from
> > > +		 * the next record. Otherwise return an error.
> > > +		 */
> > > +		tail_distance = xlog_tail_distance(log, *head_blk, first_bad);
> > > +		if (tail_distance > BTOBB(XLOG_MAX_ICLOGS * hsize))
> > > +			break;
> > 
> > (Thinking aloud...)
> > 
> > So if I understand this correctly, we start by checking that there's a
> > log record immediately after where we think is the tail T.  If there's
> > no record then the log is junk and we just give up, but if we find a
> > record then we try to check CRCs from head H to tail T.
> > 
> 
> Right.. this part basically adds a full on tail->head CRC pass whereas
> actual log recovery would check CRCs as records are processed (i.e., the
> slight change in behavior referenced in the cover letter).
> 
> > 	0	phys. log	N
> > 	[-------HffT---H'--T'---]
> > 
> > If the CRC pass fails (again I'm wondering if it's appropriate to keep
> > trying things even if error == EIO or ENOMEM or something) then we'll
> > try to bump the tail ahead towards T', so long as we don't stray farther
> > than the head + log buffer size.
> > 
> 
> The idea is that we can attribute corruption to a partial tail overwrite
> so long as the failure occurs within the "danger zone" of the head, as
> defined by the max iclog value and log buf size. If the tail happened to
> be located farther ahead outside of that range, this is not a
> possibility and the corruption is legitimate. The tradeoff is that this
> is a policy and so the corruption at the tail could potentially be
> "legitimate" regardless, but it's the same tradeoff we made previously
> with torn write detection.
> 
> And yes, this should only trigger on error == -EFSBADCRC. Actually, I'm
> starting to think that both here and the head verification should handle
> -EFSCORRUPTED as well. I think the more recent report incident (that
> motivated this patch) was actually producing logs that may have been
> more likely to fail with the latter than the former (if the log record
> header is not valid, for example). I will take another look at that.

Generally I expect most "should we reject this as broken?" tests to
check for -EFSBADCRC and -EFSCORRUPTED unless noted otherwise.  There's
a slim but nonzero chance the crc will just happen to match an otherwise
garbage log record. :)

> > Bumping the tail forward involves checking for valid records and redoing
> > the CRC pass with the new tail.  If we run out of records, we bail out;
> > if the new CRC pass succeeds, we declare victory.  If not, then we'll
> > possibly try another tail bump.
> > 
> > I think I understand how this is suppsed to work now.  I'm concerned
> > that we try bumping the tail ahead even if the CRC pass runs out of
> > memory or hits a read error, though.  I'm also wondering why we pass
> > in a pointer to *head_blk, though we don't seem to update it?
> > 
> 
> I think I just updated both parameters to pass a pointer because it
> looked more consistent and/or otherwise hadn't worked out the tail
> verification algorithm at that point. I can change it back to head_blk
> if that is preferred.

I don't have a strong preference either way, but when I see pointers to
scalars being passed into a function I go looking for where the update
happens.

--D

> 
> Thanks for the review!
> 
> Brian
> 
> > --D
> > 
> > > +
> > > +		/* skip to the next record; returns positive count on success */
> > > +		error = xlog_seek_logrec_hdr(log, *head_blk, first_bad, 2, bp,
> > > +				&tmp_tail, &thead, &wrapped);
> > > +		if (error < 0)
> > > +			goto out;
> > >  
> > > +		*tail_blk = tmp_tail;
> > > +		first_bad = 0;
> > > +		error = xlog_do_recovery_pass(log, *head_blk, *tail_blk,
> > > +					      XLOG_RECOVER_CRCPASS, &first_bad);
> > > +	}
> > > +
> > > +	if (!error && *tail_blk != orig_tail)
> > > +		xfs_warn(log->l_mp,
> > > +		"Tail block (0x%llx) overwrite detected. Updated to 0x%llx",
> > > +			 orig_tail, *tail_blk);
> > >  out:
> > >  	xlog_put_bp(bp);
> > >  	return error;
> > > @@ -1185,7 +1230,8 @@ xlog_verify_head(
> > >  		}
> > >  	}
> > >  
> > > -	return xlog_verify_tail(log, *head_blk, *tail_blk);
> > > +	return xlog_verify_tail(log, head_blk, tail_blk,
> > > +				be32_to_cpu((*rhead)->h_size));
> > >  }
> > >  
> > >  /*
> > > -- 
> > > 2.7.5
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/4] xfs: add log item pinning error injection tag
  2017-06-27 14:40 [PATCH 0/4] xfs: log recovery wrap and tail overwrite fixes Brian Foster
                   ` (2 preceding siblings ...)
  2017-06-27 14:40 ` [PATCH 3/4] xfs: fix log recovery corruption error due to tail overwrite Brian Foster
@ 2017-06-27 14:40 ` Brian Foster
  2017-07-01  3:03   ` Darrick J. Wong
  2017-06-27 14:50 ` [PATCH] tests/xfs: test for log recovery failure after tail overwrite Brian Foster
  4 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2017-06-27 14:40 UTC (permalink / raw)
  To: linux-xfs

Add an error injection tag to force log items in the AIL to the
pinned state. This option can be used by test infrastructure to
induce head behind tail conditions. Specifically, this is intended
to be used by xfstests to reproduce log recovery problems after
failed/corrupted log writes overwrite the last good tail LSN in the
log.

When enabled, AIL push attempts see log items in the AIL in the
pinned state. This stalls metadata writeback and thus prevents the
current tail of the log from moving forward. When disabled,
subsequent AIL pushes observe the log items in their appropriate
state and filesystem operation continues as normal.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_error.c     |  3 +++
 fs/xfs/xfs_error.h     |  4 +++-
 fs/xfs/xfs_trans_ail.c | 17 ++++++++++++++++-
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 26c32bc..f2f22a8 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -57,6 +57,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_AG_RESV_CRITICAL,
 	XFS_RANDOM_DROP_WRITES,
 	XFS_RANDOM_LOG_BAD_CRC,
+	XFS_RANDOM_LOG_ITEM_PIN,
 };
 
 struct xfs_errortag_attr {
@@ -161,6 +162,7 @@ XFS_ERRORTAG_ATTR_RW(bmap_finish_one,	XFS_ERRTAG_BMAP_FINISH_ONE);
 XFS_ERRORTAG_ATTR_RW(ag_resv_critical,	XFS_ERRTAG_AG_RESV_CRITICAL);
 XFS_ERRORTAG_ATTR_RW(drop_writes,	XFS_ERRTAG_DROP_WRITES);
 XFS_ERRORTAG_ATTR_RW(log_bad_crc,	XFS_ERRTAG_LOG_BAD_CRC);
+XFS_ERRORTAG_ATTR_RW(log_item_pin,	XFS_ERRTAG_LOG_ITEM_PIN);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -193,6 +195,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(ag_resv_critical),
 	XFS_ERRORTAG_ATTR_LIST(drop_writes),
 	XFS_ERRORTAG_ATTR_LIST(log_bad_crc),
+	XFS_ERRORTAG_ATTR_LIST(log_item_pin),
 	NULL,
 };
 
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 0874d6b..43d349c 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -107,8 +107,9 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
  */
 #define XFS_ERRTAG_DROP_WRITES				28
 #define XFS_ERRTAG_LOG_BAD_CRC				29
+#define XFS_ERRTAG_LOG_ITEM_PIN				30
 
-#define XFS_ERRTAG_MAX					30
+#define XFS_ERRTAG_MAX					31
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -143,6 +144,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_RANDOM_AG_RESV_CRITICAL			4
 #define XFS_RANDOM_DROP_WRITES				1
 #define XFS_RANDOM_LOG_BAD_CRC				1
+#define XFS_RANDOM_LOG_ITEM_PIN				1
 
 #ifdef DEBUG
 extern int xfs_errortag_init(struct xfs_mount *mp);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 9056c0f..563b315 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -325,6 +325,21 @@ xfs_ail_delete(
 	xfs_trans_ail_cursor_clear(ailp, lip);
 }
 
+static inline uint
+xfsaild_push_item(
+	struct xfs_ail		*ailp,
+	struct xfs_log_item	*lip)
+{
+	/*
+	 * If log item pinning is enabled, skip the push and track the item as
+	 * pinned. This can help induce head-behind-tail conditions.
+	 */
+	if (XFS_TEST_ERROR(false, ailp->xa_mount, XFS_ERRTAG_LOG_ITEM_PIN))
+		return XFS_ITEM_PINNED;
+
+	return lip->li_ops->iop_push(lip, &ailp->xa_buf_list);
+}
+
 static long
 xfsaild_push(
 	struct xfs_ail		*ailp)
@@ -382,7 +397,7 @@ xfsaild_push(
 		 * rely on the AIL cursor implementation to be able to deal with
 		 * the dropped lock.
 		 */
-		lock_result = lip->li_ops->iop_push(lip, &ailp->xa_buf_list);
+		lock_result = xfsaild_push_item(ailp, lip);
 		switch (lock_result) {
 		case XFS_ITEM_SUCCESS:
 			XFS_STATS_INC(mp, xs_push_ail_success);
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] xfs: add log item pinning error injection tag
  2017-06-27 14:40 ` [PATCH 4/4] xfs: add log item pinning error injection tag Brian Foster
@ 2017-07-01  3:03   ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2017-07-01  3:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 27, 2017 at 10:40:36AM -0400, Brian Foster wrote:
> Add an error injection tag to force log items in the AIL to the
> pinned state. This option can be used by test infrastructure to
> induce head behind tail conditions. Specifically, this is intended
> to be used by xfstests to reproduce log recovery problems after
> failed/corrupted log writes overwrite the last good tail LSN in the
> log.
> 
> When enabled, AIL push attempts see log items in the AIL in the
> pinned state. This stalls metadata writeback and thus prevents the
> current tail of the log from moving forward. When disabled,
> subsequent AIL pushes observe the log items in their appropriate
> state and filesystem operation continues as normal.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
>  fs/xfs/xfs_error.c     |  3 +++
>  fs/xfs/xfs_error.h     |  4 +++-
>  fs/xfs/xfs_trans_ail.c | 17 ++++++++++++++++-
>  3 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
> index 26c32bc..f2f22a8 100644
> --- a/fs/xfs/xfs_error.c
> +++ b/fs/xfs/xfs_error.c
> @@ -57,6 +57,7 @@ static unsigned int xfs_errortag_random_default[] = {
>  	XFS_RANDOM_AG_RESV_CRITICAL,
>  	XFS_RANDOM_DROP_WRITES,
>  	XFS_RANDOM_LOG_BAD_CRC,
> +	XFS_RANDOM_LOG_ITEM_PIN,
>  };
>  
>  struct xfs_errortag_attr {
> @@ -161,6 +162,7 @@ XFS_ERRORTAG_ATTR_RW(bmap_finish_one,	XFS_ERRTAG_BMAP_FINISH_ONE);
>  XFS_ERRORTAG_ATTR_RW(ag_resv_critical,	XFS_ERRTAG_AG_RESV_CRITICAL);
>  XFS_ERRORTAG_ATTR_RW(drop_writes,	XFS_ERRTAG_DROP_WRITES);
>  XFS_ERRORTAG_ATTR_RW(log_bad_crc,	XFS_ERRTAG_LOG_BAD_CRC);
> +XFS_ERRORTAG_ATTR_RW(log_item_pin,	XFS_ERRTAG_LOG_ITEM_PIN);
>  
>  static struct attribute *xfs_errortag_attrs[] = {
>  	XFS_ERRORTAG_ATTR_LIST(noerror),
> @@ -193,6 +195,7 @@ static struct attribute *xfs_errortag_attrs[] = {
>  	XFS_ERRORTAG_ATTR_LIST(ag_resv_critical),
>  	XFS_ERRORTAG_ATTR_LIST(drop_writes),
>  	XFS_ERRORTAG_ATTR_LIST(log_bad_crc),
> +	XFS_ERRORTAG_ATTR_LIST(log_item_pin),
>  	NULL,
>  };
>  
> diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
> index 0874d6b..43d349c 100644
> --- a/fs/xfs/xfs_error.h
> +++ b/fs/xfs/xfs_error.h
> @@ -107,8 +107,9 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
>   */
>  #define XFS_ERRTAG_DROP_WRITES				28
>  #define XFS_ERRTAG_LOG_BAD_CRC				29
> +#define XFS_ERRTAG_LOG_ITEM_PIN				30
>  
> -#define XFS_ERRTAG_MAX					30
> +#define XFS_ERRTAG_MAX					31
>  
>  /*
>   * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
> @@ -143,6 +144,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
>  #define XFS_RANDOM_AG_RESV_CRITICAL			4
>  #define XFS_RANDOM_DROP_WRITES				1
>  #define XFS_RANDOM_LOG_BAD_CRC				1
> +#define XFS_RANDOM_LOG_ITEM_PIN				1
>  
>  #ifdef DEBUG
>  extern int xfs_errortag_init(struct xfs_mount *mp);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 9056c0f..563b315 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -325,6 +325,21 @@ xfs_ail_delete(
>  	xfs_trans_ail_cursor_clear(ailp, lip);
>  }
>  
> +static inline uint
> +xfsaild_push_item(
> +	struct xfs_ail		*ailp,
> +	struct xfs_log_item	*lip)
> +{
> +	/*
> +	 * If log item pinning is enabled, skip the push and track the item as
> +	 * pinned. This can help induce head-behind-tail conditions.
> +	 */
> +	if (XFS_TEST_ERROR(false, ailp->xa_mount, XFS_ERRTAG_LOG_ITEM_PIN))
> +		return XFS_ITEM_PINNED;
> +
> +	return lip->li_ops->iop_push(lip, &ailp->xa_buf_list);
> +}
> +
>  static long
>  xfsaild_push(
>  	struct xfs_ail		*ailp)
> @@ -382,7 +397,7 @@ xfsaild_push(
>  		 * rely on the AIL cursor implementation to be able to deal with
>  		 * the dropped lock.
>  		 */
> -		lock_result = lip->li_ops->iop_push(lip, &ailp->xa_buf_list);
> +		lock_result = xfsaild_push_item(ailp, lip);
>  		switch (lock_result) {
>  		case XFS_ITEM_SUCCESS:
>  			XFS_STATS_INC(mp, xs_push_ail_success);
> -- 
> 2.7.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] tests/xfs: test for log recovery failure after tail overwrite
  2017-06-27 14:40 [PATCH 0/4] xfs: log recovery wrap and tail overwrite fixes Brian Foster
                   ` (3 preceding siblings ...)
  2017-06-27 14:40 ` [PATCH 4/4] xfs: add log item pinning error injection tag Brian Foster
@ 2017-06-27 14:50 ` Brian Foster
  4 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2017-06-27 14:50 UTC (permalink / raw)
  To: linux-xfs

XFS is susceptible to log recovery problems if the fs crashes under
certain circumstances. If the tail has been pinned for long enough
to the log to fill and the next batch of log buffer submissions
happen to fail, the filesystem shutsdown having potentially
overwritten part of the range between the last good tail->head range
in the log. This causes log recovery to fail with crc mismatch or
invalid log record errors.

This problem is not yet fixed and thus known/expected to fail. At
this time, this test serves as a reminder that the problem exists
and a reproducer for future verification purposes. Note that this
problem is currently only reproducible with larger (non-default) log
buffer sizes (i.e., '-o logbsize=256k') or smaller block sizes (1k).

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Here's the latest variant of the test that can reproduce this problem.
This version depends on the recent error injection sysfs knobs and the
existence of the associated knobs for log item pinning and bad log
record crc injection.

I've left fstests off the cc list for now since much of that code is
still outstanding. This is currently posted for reference.

Brian

 tests/xfs/999     | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/999.out |   2 +
 tests/xfs/group   |   1 +
 3 files changed, 116 insertions(+)
 create mode 100755 tests/xfs/999
 create mode 100644 tests/xfs/999.out

diff --git a/tests/xfs/999 b/tests/xfs/999
new file mode 100755
index 0000000..e321d73
--- /dev/null
+++ b/tests/xfs/999
@@ -0,0 +1,113 @@
+#! /bin/bash
+# FS QA Test No. 999
+#
+# Attempt to reproduce log recovery failure by writing corrupt log records over
+# the last good tail in the log. The tail is force pinned while a workload runs
+# the head as close as possible behind the tail. Once the head is pinned,
+# corrupted log records are written to the log and the filesystem shuts down.
+#
+# While log recovery should handle the corrupted log records, it has historical
+# problems dealing with the situation where the corrupted log records may have
+# overwritten the tail of the previous good record in the log. If this occurs,
+# log recovery may fail.
+#
+# This can be reproduced more reliably under non-default conditions such as with
+# the smallest supported FSB sizes and/or largest supported log buffer sizes and
+# counts (logbufs and logbsize mount options).
+#
+# Note that this test requires a DEBUG mode kernel.
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2017 Red Hat, Inc. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+	$KILLALL_PROG -9 fsstress > /dev/null 2>&1
+	[ -e /sys/fs/xfs/$sdev/errortag/log_item_pin ] &&
+		echo 0 > /sys/fs/xfs/$sdev/errortag/log_item_pin
+	wait > /dev/null 2>&1
+}
+
+rm -f $seqres.full
+
+# get standard environment, filters and checks
+. ./common/rc
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs xfs
+_supported_os Linux
+_require_xfs_sysfs $(_short_dev $TEST_DEV)/errortag/log_item_pin
+_require_xfs_sysfs $(_short_dev $TEST_DEV)/errortag/log_bad_crc
+_require_scratch
+_require_command "$KILLALL_PROG" killall
+
+echo "Silence is golden."
+
+sdev=$(_short_dev $SCRATCH_DEV)
+
+_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
+_scratch_mount || _fail "mount failed"
+
+# populate the fs with some data and cycle the mount to reset the log head/tail
+$FSSTRESS_PROG -d $SCRATCH_MNT -z -fcreat=1 -p 4 -n 100000 > /dev/null 2>&1
+_scratch_cycle_mount || _fail "mount failed"
+
+# Pin the tail and start a file removal workload. File removal tends to
+# reproduce the corruption more reliably.
+echo 1 > /sys/fs/xfs/$sdev/errortag/log_item_pin
+
+rm -rf $SCRATCH_MNT/* > /dev/null 2>&1 &
+workpid=$!
+
+# wait for the head to stop pushing forward
+prevhead=-1
+head=`cat /sys/fs/xfs/$sdev/log/log_head_lsn`
+while [ "$head" != "$prevhead" ]; do
+	sleep 5
+	prevhead=$head
+	head=`cat /sys/fs/xfs/$sdev/log/log_head_lsn`
+done
+
+# Once the head is pinned behind the tail, enable log record corruption and
+# unpin the tail. All subsequent log buffer writes end up corrupted on-disk and
+# result in log I/O errors.
+echo 1 > /sys/fs/xfs/$sdev/errortag/log_bad_crc
+echo 0 > /sys/fs/xfs/$sdev/errortag/log_item_pin
+
+# wait for fs shutdown to kill the workload
+wait $workpid
+
+# cycle mount to test log recovery
+_scratch_cycle_mount
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/999.out b/tests/xfs/999.out
new file mode 100644
index 0000000..d254382
--- /dev/null
+++ b/tests/xfs/999.out
@@ -0,0 +1,2 @@
+QA output created by 999
+Silence is golden.
diff --git a/tests/xfs/group b/tests/xfs/group
index 2a3b950..a10fd6d 100644
--- a/tests/xfs/group
+++ b/tests/xfs/group
@@ -419,3 +419,4 @@
 419 auto quick
 420 auto quick clone dedupe
 421 auto quick clone dedupe
+999 auto log
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] xfs: debug mode sysfs flag to force [un]pin the log tail
@ 2017-06-16 16:44 Brian Foster
  2017-06-16 16:46 ` [PATCH] tests/xfs: test for log recovery failure after tail overwrite Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2017-06-16 16:44 UTC (permalink / raw)
  To: linux-xfs

Create a debug mode only sysfs option to force pin the tail of the
log. This option can be used by test infrastructure to induce head
behind tail conditions. Specifically, this is intended to be used by
xfstests to reproduce log recovery problems after failed/corrupted
log writes overwrite the last good tail LSN in the log.

When enabled, AIL push attempts see every log item on the AIL in the
pinned state. This stalls metadata writeback and thus prevents the
current tail of the log from moving forward. When disabled,
subsequent AIL pushes observe the log items in their appropriate
state and filesystem operation continues as normal.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

This patch is a supporting patch for an xfstests test I'm about to post
that pins the tail of the log in order to reproduce the log recovery
problem that appears to be the root of the problem in this[1] thread.
That is the primary motivation for the patch and so should probably be
reviewed with that context. IOW, if there's a better way to reproduce
the problem in the test without the need for kernel support, I'm happy
to drop this. Thoughts, reviews, flames appreciated.

Brian

[1] http://www.spinics.net/lists/linux-xfs/msg07499.html

 fs/xfs/xfs_log_priv.h  |  2 ++
 fs/xfs/xfs_sysfs.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans_ail.c | 20 +++++++++++++++++++-
 3 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index c2604a5..bfbfde12 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -413,6 +413,8 @@ struct xlog {
 	void			*l_iclog_bak[XLOG_MAX_ICLOGS];
 	/* log record crc error injection factor */
 	uint32_t		l_badcrc_factor;
+	/* force pin the log tail */
+	bool			l_pin_tail;
 #endif
 	/* log recovery lsn tracking (for buffer submission */
 	xfs_lsn_t		l_recovery_lsn;
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index ec6e0e2..b86148a 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -378,6 +378,51 @@ log_badcrc_factor_show(
 }
 
 XFS_SYSFS_ATTR_RW(log_badcrc_factor);
+
+/*
+ * DEBUG mode flag to force pin the tail of the log. Used from test
+ * infrastructure to manufacture head-behind-tail conditions. DO NOT USE
+ * DIRECTLY. This will lock up the fs!
+ *
+ * When this option is enabled, all log items present in the AIL are emulated as
+ * being in the pinned state until the option is disabled. Once disabled, log
+ * items return to their natural state and fs operation continues as normal.
+ */
+STATIC ssize_t
+log_pin_tail_store(
+	struct kobject		*kobject,
+	const char		*buf,
+	size_t			count)
+{
+	struct xlog		*log = to_xlog(kobject);
+	int			ret;
+	int			val;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (val == 1)
+		log->l_pin_tail = true;
+	else if (val == 0)
+		log->l_pin_tail = false;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+STATIC ssize_t
+log_pin_tail_show(
+	struct kobject		*kobject,
+	char			*buf)
+{
+	struct xlog		*log = to_xlog(kobject);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", log->l_pin_tail ? 1 : 0);
+}
+XFS_SYSFS_ATTR_RW(log_pin_tail);
+
 #endif	/* DEBUG */
 
 static struct attribute *xfs_log_attrs[] = {
@@ -387,6 +432,7 @@ static struct attribute *xfs_log_attrs[] = {
 	ATTR_LIST(write_grant_head),
 #ifdef DEBUG
 	ATTR_LIST(log_badcrc_factor),
+	ATTR_LIST(log_pin_tail),
 #endif
 	NULL,
 };
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 9056c0f..c901e61 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -27,6 +27,7 @@
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -325,6 +326,23 @@ xfs_ail_delete(
 	xfs_trans_ail_cursor_clear(ailp, lip);
 }
 
+static inline uint
+xfsaild_push_item(
+	struct xfs_ail		*ailp,
+	struct xfs_log_item	*lip)
+{
+#ifdef DEBUG
+	/*
+	 * If tail pinning is enabled, skip the push and track all items as
+	 * pinned to force pin the log tail. This helps induce head-behind-tail
+	 * conditions.
+	 */
+	if (ailp->xa_mount->m_log->l_pin_tail)
+		return XFS_ITEM_PINNED;
+#endif
+	return lip->li_ops->iop_push(lip, &ailp->xa_buf_list);
+}
+
 static long
 xfsaild_push(
 	struct xfs_ail		*ailp)
@@ -382,7 +400,7 @@ xfsaild_push(
 		 * rely on the AIL cursor implementation to be able to deal with
 		 * the dropped lock.
 		 */
-		lock_result = lip->li_ops->iop_push(lip, &ailp->xa_buf_list);
+		lock_result = xfsaild_push_item(ailp, lip);
 		switch (lock_result) {
 		case XFS_ITEM_SUCCESS:
 			XFS_STATS_INC(mp, xs_push_ail_success);
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] tests/xfs: test for log recovery failure after tail overwrite
  2017-06-16 16:44 [PATCH] xfs: debug mode sysfs flag to force [un]pin the log tail Brian Foster
@ 2017-06-16 16:46 ` Brian Foster
  2017-06-30  3:44   ` Eryu Guan
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2017-06-16 16:46 UTC (permalink / raw)
  To: linux-xfs; +Cc: fstests

XFS is susceptible to log recovery problems if the fs crashes under
certain circumstances. If the tail has been pinned for long enough
to the log to fill and the next batch of log buffer submissions
happen to fail, the filesystem shutsdown having potentially
overwritten part of the range between the last good tail->head range
in the log. This causes log recovery to fail with crc mismatch or
invalid log record errors.

This problem is not yet fixed and thus known/expected to fail. At
this time, this test serves as a reminder that the problem exists
and a reproducer for future verification purposes. Note that this
problem is currently only reproducible with larger (non-default) log
buffer sizes (i.e., '-o logbsize=256k') or smaller block sizes (1k).

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

This patch uses the XFS debug kernel mechanism recently posted for
review[1] to reproduce an XFS log recovery problem. Note that this test
depends on the aforementioned patch and thus should not be merged
until/unless the corresponding kernel patch is merged.

Brian

[1] "xfs: debug mode sysfs flag to force [un]pin the log tail"

 tests/xfs/999     | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/999.out |   2 +
 tests/xfs/group   |   1 +
 3 files changed, 116 insertions(+)
 create mode 100644 tests/xfs/999
 create mode 100644 tests/xfs/999.out

diff --git a/tests/xfs/999 b/tests/xfs/999
new file mode 100644
index 0000000..6913a43
--- /dev/null
+++ b/tests/xfs/999
@@ -0,0 +1,113 @@
+#! /bin/bash
+# FS QA Test No. 999
+#
+# Attempt to reproduce log recovery failure by writing corrupt log records over
+# the last good tail in the log. The tail is force pinned while a workload runs
+# the head as close as possible behind the tail. Once the head is pinned,
+# corrupted log records are written to the log and the filesystem shuts down.
+#
+# While log recovery should handle the corrupted log records, it has historical
+# problems dealing with the situation where the corrupted log records may have
+# overwritten the tail of the previous good record in the log. If this occurs,
+# log recovery may fail.
+#
+# This can be reproduced more reliably under non-default conditions such as with
+# the smallest supported FSB sizes and/or largest supported log buffer sizes and
+# counts (logbufs and logbsize mount options).
+#
+# Note that this test requires a DEBUG mode kernel.
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2017 Red Hat, Inc. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+	$KILLALL_PROG -9 fsstress > /dev/null 2>&1
+	[ -e /sys/fs/xfs/$sdev/log/log_pin_tail ] &&
+		echo 0 > /sys/fs/xfs/$sdev/log/log_pin_tail
+	wait > /dev/null 2>&1
+}
+
+rm -f $seqres.full
+
+# get standard environment, filters and checks
+. ./common/rc
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs xfs
+_supported_os Linux
+_require_xfs_sysfs $(_short_dev $TEST_DEV)/log/log_badcrc_factor
+_require_xfs_sysfs $(_short_dev $TEST_DEV)/log/log_pin_tail
+_require_scratch
+_require_command "$KILLALL_PROG" killall
+
+echo "Silence is golden."
+
+sdev=$(_short_dev $SCRATCH_DEV)
+
+_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
+_scratch_mount || _fail "mount failed"
+
+# populate the fs with some data and cycle the mount to reset the log head/tail
+$FSSTRESS_PROG -d $SCRATCH_MNT -z -fcreat=1 -p 4 -n 100000 > /dev/null 2>&1
+_scratch_cycle_mount || _fail "mount failed"
+
+# Pin the tail and start a file removal workload. File removal tends to
+# reproduce the corruption more reliably.
+echo 1 > /sys/fs/xfs/$sdev/log/log_pin_tail
+
+rm -rf $SCRATCH_MNT/* > /dev/null 2>&1 &
+workpid=$!
+
+# wait for the head to stop pushing forward
+prevhead=-1
+head=`cat /sys/fs/xfs/$sdev/log/log_head_lsn`
+while [ "$head" != "$prevhead" ]; do
+	sleep 5
+	prevhead=$head
+	head=`cat /sys/fs/xfs/$sdev/log/log_head_lsn`
+done
+
+# Once the head is pinned behind the tail, enable log record corruption and
+# unpin the tail. All subsequent log buffer writes end up corrupted on-disk and
+# result in log I/O errors.
+echo 1 > /sys/fs/xfs/$sdev/log/log_badcrc_factor
+echo 0 > /sys/fs/xfs/$sdev/log/log_pin_tail
+
+# wait for fs shutdown to kill the workload
+wait $workpid
+
+# cycle mount to test log recovery
+_scratch_cycle_mount
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/999.out b/tests/xfs/999.out
new file mode 100644
index 0000000..d254382
--- /dev/null
+++ b/tests/xfs/999.out
@@ -0,0 +1,2 @@
+QA output created by 999
+Silence is golden.
diff --git a/tests/xfs/group b/tests/xfs/group
index 792161a..d94f010 100644
--- a/tests/xfs/group
+++ b/tests/xfs/group
@@ -416,3 +416,4 @@
 416 dangerous_fuzzers dangerous_scrub dangerous_repair
 417 dangerous_fuzzers dangerous_scrub dangerous_online_repair
 418 dangerous_fuzzers dangerous_scrub dangerous_repair
+999 auto log
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] tests/xfs: test for log recovery failure after tail overwrite
  2017-06-16 16:46 ` [PATCH] tests/xfs: test for log recovery failure after tail overwrite Brian Foster
@ 2017-06-30  3:44   ` Eryu Guan
  2017-06-30  4:09     ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Eryu Guan @ 2017-06-30  3:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, fstests

Hi Brian,

On Fri, Jun 16, 2017 at 12:46:19PM -0400, Brian Foster wrote:
> XFS is susceptible to log recovery problems if the fs crashes under
> certain circumstances. If the tail has been pinned for long enough
> to the log to fill and the next batch of log buffer submissions
> happen to fail, the filesystem shutsdown having potentially
> overwritten part of the range between the last good tail->head range
> in the log. This causes log recovery to fail with crc mismatch or
> invalid log record errors.
> 
> This problem is not yet fixed and thus known/expected to fail. At
> this time, this test serves as a reminder that the problem exists
> and a reproducer for future verification purposes. Note that this
> problem is currently only reproducible with larger (non-default) log
> buffer sizes (i.e., '-o logbsize=256k') or smaller block sizes (1k).
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi all,
> 
> This patch uses the XFS debug kernel mechanism recently posted for
> review[1] to reproduce an XFS log recovery problem. Note that this test
> depends on the aforementioned patch and thus should not be merged
> until/unless the corresponding kernel patch is merged.

Any decision made on the kernel side patch? Sorry, I'm a bit lost on the
kernel side discussions.

Thanks,
Eryu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tests/xfs: test for log recovery failure after tail overwrite
  2017-06-30  3:44   ` Eryu Guan
@ 2017-06-30  4:09     ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2017-06-30  4:09 UTC (permalink / raw)
  To: Eryu Guan; +Cc: Brian Foster, linux-xfs, fstests

On Fri, Jun 30, 2017 at 11:44:05AM +0800, Eryu Guan wrote:
> Hi Brian,
> 
> On Fri, Jun 16, 2017 at 12:46:19PM -0400, Brian Foster wrote:
> > XFS is susceptible to log recovery problems if the fs crashes under
> > certain circumstances. If the tail has been pinned for long enough
> > to the log to fill and the next batch of log buffer submissions
> > happen to fail, the filesystem shutsdown having potentially
> > overwritten part of the range between the last good tail->head range
> > in the log. This causes log recovery to fail with crc mismatch or
> > invalid log record errors.
> > 
> > This problem is not yet fixed and thus known/expected to fail. At
> > this time, this test serves as a reminder that the problem exists
> > and a reproducer for future verification purposes. Note that this
> > problem is currently only reproducible with larger (non-default) log
> > buffer sizes (i.e., '-o logbsize=256k') or smaller block sizes (1k).
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > This patch uses the XFS debug kernel mechanism recently posted for
> > review[1] to reproduce an XFS log recovery problem. Note that this test
> > depends on the aforementioned patch and thus should not be merged
> > until/unless the corresponding kernel patch is merged.
> 
> Any decision made on the kernel side patch? Sorry, I'm a bit lost on the
> kernel side discussions.

Uhhhh, I think this test is being redone to work with the new xfs
errortag mechanism that'll be appearing soon.  I will send patches
to fix up xfstests momentarily.

--D

> 
> Thanks,
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-07-03 16:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-27 14:40 [PATCH 0/4] xfs: log recovery wrap and tail overwrite fixes Brian Foster
2017-06-27 14:40 ` [PATCH 1/4] xfs: fix recovery failure when log record header wraps log end Brian Foster
2017-07-01  4:38   ` Darrick J. Wong
2017-07-03 12:11     ` Brian Foster
2017-06-27 14:40 ` [PATCH 2/4] xfs: always verify the log tail during recovery Brian Foster
2017-07-01  4:43   ` Darrick J. Wong
2017-07-03 12:11     ` Brian Foster
2017-06-27 14:40 ` [PATCH 3/4] xfs: fix log recovery corruption error due to tail overwrite Brian Foster
2017-07-01  5:06   ` Darrick J. Wong
2017-07-03 12:13     ` Brian Foster
2017-07-03 16:27       ` Brian Foster
2017-07-03 16:39       ` Darrick J. Wong
2017-06-27 14:40 ` [PATCH 4/4] xfs: add log item pinning error injection tag Brian Foster
2017-07-01  3:03   ` Darrick J. Wong
2017-06-27 14:50 ` [PATCH] tests/xfs: test for log recovery failure after tail overwrite Brian Foster
  -- strict thread matches above, loose matches on Subject: below --
2017-06-16 16:44 [PATCH] xfs: debug mode sysfs flag to force [un]pin the log tail Brian Foster
2017-06-16 16:46 ` [PATCH] tests/xfs: test for log recovery failure after tail overwrite Brian Foster
2017-06-30  3:44   ` Eryu Guan
2017-06-30  4:09     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).