public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: prevent spurious "head behind tail" warnings
@ 2013-11-19 22:37 Dave Chinner
  2013-11-19 23:08 ` Mark Tinguely
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-11-19 22:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When xlog_space_left() cracks the grant head and the log tail, it
does so without locking to synchronise the sampling of the
variables. It samples the grant head first, so if there is a delay
before it smaples the log tail, there is a window where the log tail
could have moved onwards and be moved past the sampled value of the
grant head. This then leads to the "xlog_space_left: head behind
tail" warning message.

To avoid spurious output in this situation, swap the order in which
the variables are cracked. This means that the head may grant head
may move if there is a delay, but the log tail will be stable, hence
ensure the tail does not jump the head accidentally.

While this avoids the spurious head behind tail problem, it
introduces the opposite problem - the head can move more than a full
cycle past the tail. The code already handles this case by
indicating that the log is full (i.e. zero space available) but
that's still (generally) a spurious situation.

Hence, if we detect that the head is more than a cycle ahead of the
tail or the head is behind the tail, start the calculation again by
resampling the variables and trying again. If we get too many
resamples, then throw a warning and return a full or empty log
appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 102 ++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 78 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 8497a00..0cfcc20 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1113,37 +1113,91 @@ xlog_space_left(
 	struct xlog	*log,
 	atomic64_t	*head)
 {
-	int		free_bytes;
 	int		tail_bytes;
 	int		tail_cycle;
 	int		head_cycle;
 	int		head_bytes;
+	int		retries = 0;
+#define XLOG_SPACE_MAX_RETRIES	5
+
+	do {
+		/*
+		 * sample tail before head to avoid spurious "head behind tail"
+		 * warnings due to racing tail updates. We dump a memory barrier
+		 * here to make sure we pick up the latest values that have been
+		 * written by other threads each time through.
+		 */
+		smp_mb();
+		xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle,
+				      &tail_bytes);
+		tail_bytes = BBTOB(tail_bytes);
+		xlog_crack_grant_head(head, &head_cycle, &head_bytes);
+
+		/*
+		 * The tail cycle should always be the same or one less than the
+		 * head cycle. If not, restart again because we've likely had a
+		 * race with an update during sampling.
+		 */
+		if (tail_cycle + 1 < head_cycle || tail_cycle > head_cycle)
+			continue;
+
+		/*
+		 * If the tail is in previous cycle, space available is the
+		 * region between the tail and the head.
+		 *
+		 *	     H		 T
+		 *    +2222222111111111111111+
+		 *            +-- free --+
+		 */
+		if (tail_cycle < head_cycle) {
+			ASSERT(tail_cycle == (head_cycle - 1));
+			return tail_bytes - head_bytes;
+		}
+
+		/*
+		 * If the tail is in the same cycle, space available is the
+		 * regions outside the range between the tail and the head:
+		 *
+		 *	   T	       H
+		 *    +2222222222222222211111+
+		 *          +-- used --+
+		 *    +free+		+free+
+		 *
+		 * If the head and tail are the same, then the whole log is
+		 * free.
+		 */
+		if (tail_cycle == head_cycle && head_bytes >= tail_bytes)
+			return log->l_logsize - (head_bytes - tail_bytes);
 
-	xlog_crack_grant_head(head, &head_cycle, &head_bytes);
-	xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes);
-	tail_bytes = BBTOB(tail_bytes);
-	if (tail_cycle == head_cycle && head_bytes >= tail_bytes)
-		free_bytes = log->l_logsize - (head_bytes - tail_bytes);
-	else if (tail_cycle + 1 < head_cycle)
-		return 0;
-	else if (tail_cycle < head_cycle) {
-		ASSERT(tail_cycle == (head_cycle - 1));
-		free_bytes = tail_bytes - head_bytes;
-	} else {
 		/*
-		 * The reservation head is behind the tail.
-		 * In this case we just want to return the size of the
-		 * log as the amount of space left.
+		 * The head is behind the tail. Invalid, so let's go around
+		 * again and resample.
 		 */
-		xfs_alert(log->l_mp,
-			"xlog_space_left: head behind tail\n"
-			"  tail_cycle = %d, tail_bytes = %d\n"
-			"  GH   cycle = %d, GH   bytes = %d",
-			tail_cycle, tail_bytes, head_cycle, head_bytes);
-		ASSERT(0);
-		free_bytes = log->l_logsize;
-	}
-	return free_bytes;
+	} while (retries++ < XLOG_SPACE_MAX_RETRIES);
+
+	/*
+	 * OK, we're in trouble now - the head and tail are out of sync. Time to
+	 * issue a warning about it and return a default free space value
+	 * depending on whether the tail is ahead or behind the log.
+	 */
+	ASSERT(retries >= XLOG_SPACE_MAX_RETRIES);
+	xfs_alert(log->l_mp,
+		  "%s: The grant head is %s the log tail.\n"
+		  "  tail_cycle = %d, tail_bytes = %d\n"
+		  "  GH   cycle = %d, GH   bytes = %d",
+		  __func__,
+		  tail_cycle + 1 < head_cycle ? "in front of" : "behind",
+		  tail_cycle, tail_bytes, head_cycle, head_bytes);
+	ASSERT(0);
+
+	/*
+	 * If the head is in front, there is not space left to advance it so we
+	 * consider the log as being full. If the tail is behind, then we assume
+	 * that the whole log is available.
+	 */
+	if (tail_cycle + 1 < head_cycle)
+		return 0;
+	return log->l_logsize;
 }
 
 
-- 
1.8.4.rc3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings
  2013-11-19 22:37 [PATCH] xfs: prevent spurious "head behind tail" warnings Dave Chinner
@ 2013-11-19 23:08 ` Mark Tinguely
  2013-11-19 23:24   ` Eric Sandeen
  2013-11-21  1:42   ` Dave Chinner
  0 siblings, 2 replies; 9+ messages in thread
From: Mark Tinguely @ 2013-11-19 23:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 11/19/13 16:37, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> When xlog_space_left() cracks the grant head and the log tail, it
> does so without locking to synchronise the sampling of the
> variables. It samples the grant head first, so if there is a delay
> before it smaples the log tail, there is a window where the log tail
> could have moved onwards and be moved past the sampled value of the
> grant head. This then leads to the "xlog_space_left: head behind
> tail" warning message.
>
> To avoid spurious output in this situation, swap the order in which
> the variables are cracked. This means that the head may grant head
> may move if there is a delay, but the log tail will be stable, hence
> ensure the tail does not jump the head accidentally.
>
> While this avoids the spurious head behind tail problem, it
> introduces the opposite problem - the head can move more than a full
> cycle past the tail. The code already handles this case by
> indicating that the log is full (i.e. zero space available) but
> that's still (generally) a spurious situation.
>
> Hence, if we detect that the head is more than a cycle ahead of the
> tail or the head is behind the tail, start the calculation again by
> resampling the variables and trying again. If we get too many
> resamples, then throw a warning and return a full or empty log
> appropriately.
>
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> ---

I am still getting the debug message:

   xlog_verify_grant_tail: space > BBTOB(tail_blocks)

This is a real over grant. It has been a while since I did all the 
tests, but basically the only way to stop it is to have a lock between 
checking for xlog_space_left() and actually reserving the space.

I am not a fan of another band-aid on a problem that is caused because 
we are granting space without locks.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings
  2013-11-19 23:08 ` Mark Tinguely
@ 2013-11-19 23:24   ` Eric Sandeen
  2013-11-19 23:44     ` Mark Tinguely
  2013-11-21  1:42   ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Eric Sandeen @ 2013-11-19 23:24 UTC (permalink / raw)
  To: Mark Tinguely, Dave Chinner; +Cc: xfs

On 11/19/13, 5:08 PM, Mark Tinguely wrote:
> On 11/19/13 16:37, Dave Chinner wrote:
>> From: Dave Chinner<dchinner@redhat.com>
>>
>> When xlog_space_left() cracks the grant head and the log tail, it
>> does so without locking to synchronise the sampling of the
>> variables. It samples the grant head first, so if there is a delay
>> before it smaples the log tail, there is a window where the log tail
>> could have moved onwards and be moved past the sampled value of the
>> grant head. This then leads to the "xlog_space_left: head behind
>> tail" warning message.
>>
>> To avoid spurious output in this situation, swap the order in which
>> the variables are cracked. This means that the head may grant head
>> may move if there is a delay, but the log tail will be stable, hence
>> ensure the tail does not jump the head accidentally.
>>
>> While this avoids the spurious head behind tail problem, it
>> introduces the opposite problem - the head can move more than a full
>> cycle past the tail. The code already handles this case by
>> indicating that the log is full (i.e. zero space available) but
>> that's still (generally) a spurious situation.
>>
>> Hence, if we detect that the head is more than a cycle ahead of the
>> tail or the head is behind the tail, start the calculation again by
>> resampling the variables and trying again. If we get too many
>> resamples, then throw a warning and return a full or empty log
>> appropriately.
>>
>> Signed-off-by: Dave Chinner<dchinner@redhat.com>
>> ---
> 
> I am still getting the debug message:
> 
>   xlog_verify_grant_tail: space > BBTOB(tail_blocks)
> 
> This is a real over grant. It has been a while since I did all the tests, but basically the only way to stop it is to have a lock between checking for xlog_space_left() and actually reserving the space.
> 
> I am not a fan of another band-aid on a problem that is caused because we are granting space without locks.

Mark, can you remind us of your testcase that produces this?
(sorry, I guess I should search for that old thread...)

Thanks,
-Eric

> --Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings
  2013-11-19 23:24   ` Eric Sandeen
@ 2013-11-19 23:44     ` Mark Tinguely
  2013-11-21  2:19       ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Tinguely @ 2013-11-19 23:44 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On 11/19/13 17:24, Eric Sandeen wrote:
> On 11/19/13, 5:08 PM, Mark Tinguely wrote:
>> On 11/19/13 16:37, Dave Chinner wrote:
>>> From: Dave Chinner<dchinner@redhat.com>
>>>
>>> When xlog_space_left() cracks the grant head and the log tail, it
>>> does so without locking to synchronise the sampling of the
>>> variables. It samples the grant head first, so if there is a delay
>>> before it smaples the log tail, there is a window where the log tail
>>> could have moved onwards and be moved past the sampled value of the
>>> grant head. This then leads to the "xlog_space_left: head behind
>>> tail" warning message.
>>>
>>> To avoid spurious output in this situation, swap the order in which
>>> the variables are cracked. This means that the head may grant head
>>> may move if there is a delay, but the log tail will be stable, hence
>>> ensure the tail does not jump the head accidentally.
>>>
>>> While this avoids the spurious head behind tail problem, it
>>> introduces the opposite problem - the head can move more than a full
>>> cycle past the tail. The code already handles this case by
>>> indicating that the log is full (i.e. zero space available) but
>>> that's still (generally) a spurious situation.
>>>
>>> Hence, if we detect that the head is more than a cycle ahead of the
>>> tail or the head is behind the tail, start the calculation again by
>>> resampling the variables and trying again. If we get too many
>>> resamples, then throw a warning and return a full or empty log
>>> appropriately.
>>>
>>> Signed-off-by: Dave Chinner<dchinner@redhat.com>
>>> ---
>>
>> I am still getting the debug message:
>>
>>    xlog_verify_grant_tail: space>  BBTOB(tail_blocks)
>>
>> This is a real over grant. It has been a while since I did all the tests, but basically the only way to stop it is to have a lock between checking for xlog_space_left() and actually reserving the space.
>>
>> I am not a fan of another band-aid on a problem that is caused because we are granting space without locks.
>
> Mark, can you remind us of your testcase that produces this?
> (sorry, I guess I should search for that old thread...)
>
> Thanks,
> -Eric
>
>> --Mark.

xfstest 273 hits it 100% of the time for me, as does 32+ process 
fsstress, pretty much any high log usage test.

I know Brian hit this with xfstest 273 when he was testing for commit 
9a3a5dab.

Using xfstest 273, I was seeing ten of thousand of bytes of over commit. 
 From what I recall, I tried a separate lock for the write/reserve grant 
heads, put locks to make sure the verifier was not getting stale 
information, ordered the write/reserve ungrants relative to the grants, 
put in cache smp_mb() call. Some attempts were more successful than 
others, but the only way I could prevent the overgrant completely was to 
put back the global lock between the checking for space and the granting 
of space.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings
  2013-11-19 23:08 ` Mark Tinguely
  2013-11-19 23:24   ` Eric Sandeen
@ 2013-11-21  1:42   ` Dave Chinner
  2013-11-21 13:01     ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-11-21  1:42 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On Tue, Nov 19, 2013 at 05:08:33PM -0600, Mark Tinguely wrote:
> On 11/19/13 16:37, Dave Chinner wrote:
> >From: Dave Chinner<dchinner@redhat.com>
> >
> >When xlog_space_left() cracks the grant head and the log tail, it
> >does so without locking to synchronise the sampling of the
> >variables. It samples the grant head first, so if there is a delay
> >before it smaples the log tail, there is a window where the log tail
> >could have moved onwards and be moved past the sampled value of the
> >grant head. This then leads to the "xlog_space_left: head behind
> >tail" warning message.
> >
> >To avoid spurious output in this situation, swap the order in which
> >the variables are cracked. This means that the head may grant head
> >may move if there is a delay, but the log tail will be stable, hence
> >ensure the tail does not jump the head accidentally.
> >
> >While this avoids the spurious head behind tail problem, it
> >introduces the opposite problem - the head can move more than a full
> >cycle past the tail. The code already handles this case by
> >indicating that the log is full (i.e. zero space available) but
> >that's still (generally) a spurious situation.
> >
> >Hence, if we detect that the head is more than a cycle ahead of the
> >tail or the head is behind the tail, start the calculation again by
> >resampling the variables and trying again. If we get too many
> >resamples, then throw a warning and return a full or empty log
> >appropriately.
> >
> >Signed-off-by: Dave Chinner<dchinner@redhat.com>
> >---
> 
> I am still getting the debug message:
> 
>   xlog_verify_grant_tail: space > BBTOB(tail_blocks)

I'm not sure why that is relevant to this patch - it doesn't touch
xlog_verify_grant_tail(), nor does it change the functioning of the
code except to remove false positive warnings.

Further, the comment about xlog_verify_grant_tail() says:

  * This check is run unlocked, so can give false positives. 

That's exactly what this patch addresses for xlog_space_left() -
it prevents false positives when it is run unlocked. So we could do
exactly the same thing to xlog_verify_grant_tail() as well. Patch
attached below.

> This is a real over grant.

Evidence, please.

> It has been a while since I did all the
> tests, but basically the only way to stop it is to have a lock
> between checking for xlog_space_left() and actually reserving the
> space.

Yes, so you demonstrated, but that change also completely destroyed
scalability of the transaction subsystem in the process, hence is a
non-starter, especially just to silence a debug-only false positive.
IOWs, there was no evidence provided that the message was the result
of an actual leak (i.e. a problem in the algorithm) rather than it
being a false positive.

The first step in determining if we really have a problem here is to
cut the false positive rate, and that's exactly what I'm addressing
here. Hence when we see this warning being emitted we will have a
much higher confidence that it is a real problem and not a false
positive.

> I am not a fan of another band-aid on a problem that is caused
> because we are granting space without locks.

Lockless algorithms *by their very nature* involve verifying what
was grabbed without a lock is valid before using it. It's a
fundamental principle underlying lockless algorithms, whether they
be lockless due to use of cmpxchg (like the log grant code) or RCU
(like the xfs inode cache) or just plain old memory barriers to
enforce memory access ordering.

Just because the original code didn't do this doesn't mean doing it
now is a band-aid. I understand lockless algorithms a whole lot
better than I did 3 years ago when I wrote the original algorithm,
and as such can see things I didn't clearly understand back then.
Indeed, the comments indicate that I knew there were problems (i.e.
false positive detection issues), but didn't understand them well
enough to be able to mitigate them. I certainly understand them well
enough now, so I'd hardly call mitigating known issues like this a
band-aid.

Looking at the bigger picture, we're adding lockless algorithms
rapidly throughout the kernel.  Hence it's critical that kernel
developers understand how lockless algorithms work and realise that
locks are no longer the preferred solution to the scalability
problem. If a lockless algorithm is broken, then we need to
understand exactly how it is broken, determine whether it can be
fixed (e.g. is it just missing a critical memory barrier?), and as
a last resort, if it needs to be replaced then we need to design
a new algorithm that scales equivalently well.

We need to start by killing as many false positives the lockless
algorithm throws out and see what remains. We have no real evidence
that there are fundamental problems with the lockless algorithm, so
let's address the issues we know about first and go from there.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: prevent spurious "space > BBTOB(tail_blocks)" warnings

When xlog_verify_grant_tail() cracks the grant head and the log
tail, it does so without locking to synchronise the sampling of the
variables. It samples the grant head first, so if there is a delay
before it samples the log tail, there is a window where the log tail
could have moved onwards and be moved past the sampled value of the
grant head. This then leads to the "xlog_verify_grant_tail: space >
BBTOB(tail_blocks)" warning message.

To avoid spurious output in this situation, swap the order in which
the variables are cracked. This means that the head may grant head
may move if there is a delay, but the log tail will be stable, hence
ensure the tail does not jump the head accidentally.

While this avoids the spurious head behind tail problem, it
introduces the opposite problem - the head can move more than a full
cycle past the tail.

Hence, if we detect that the head is more than a cycle ahead of the
tail or the head is behind the tail, start the calculation again by
resampling the variables and trying again. If we get too many
resamples, then throw a warning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

---
 fs/xfs/xfs_log.c | 82 +++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 64 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 0cfcc20..6857e09 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3682,26 +3682,72 @@ STATIC void
 xlog_verify_grant_tail(
 	struct xlog	*log)
 {
-	int		tail_cycle, tail_blocks;
-	int		cycle, space;
-
-	xlog_crack_grant_head(&log->l_write_head.grant, &cycle, &space);
-	xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_blocks);
-	if (tail_cycle != cycle) {
-		if (cycle - 1 != tail_cycle &&
-		    !(log->l_flags & XLOG_TAIL_WARN)) {
-			xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
-				"%s: cycle - 1 != tail_cycle", __func__);
-			log->l_flags |= XLOG_TAIL_WARN;
-		}
+	int		tail_cycle;
+	int		tail_blocks;
+	int		head_cycle;
+	int		head_bytes;
+	int		retries = 0;
+#define XLOG_VERIFY_MAX_RETRIES	5
 
-		if (space > BBTOB(tail_blocks) &&
-		    !(log->l_flags & XLOG_TAIL_WARN)) {
-			xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
-				"%s: space > BBTOB(tail_blocks)", __func__);
-			log->l_flags |= XLOG_TAIL_WARN;
-		}
+
+	/*
+	 * If we've already detected an problem here, don't bother checking
+	 * again.
+	 */
+	if (log->l_flags & XLOG_TAIL_WARN)
+		return;
+	do {
+		/*
+		 * sample tail before head to avoid spurious warnings due to
+		 * racing tail updates. We dump a memory barrier here to make
+		 * sure we pick up the latest values that have been written by
+		 * other threads each time through.
+		 */
+		smp_mb();
+		xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle,
+				      &tail_blocks);
+		xlog_crack_grant_head(&log->l_write_head.grant, &head_cycle,
+				      &head_bytes);
+
+		/*
+		 * if the cycles are the same, the head and tail can't be
+		 * overlapping, so everything is ok and we are done.
+		 */
+		if (tail_cycle == head_cycle)
+			return;
+
+		/*
+		 * if the tail is on the previous cycle to the head and the head
+		 * is before the tail, then all is good.
+		 */
+		if (tail_cycle == head_cycle - 1 &&
+		    BBTOB(tail_blocks) >= head_bytes)
+			return;
+
+		/*
+		 * Cycles don't match or the head overlapped the tail.
+		 * Invalid, so let's go around again and resample to ensure this
+		 * isn't a false positive.
+		 */
+	} while (retries++ < XLOG_VERIFY_MAX_RETRIES);
+
+	/*
+	 * OK, we're in trouble now - the head and tail are out of sync. Time to
+	 * issue a warning about it
+	 */
+	if (head_cycle - 1 != tail_cycle) {
+		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
+			"%s: head cycle - 1 != tail_cycle (0x%x/0x%x)",
+			      __func__, head_cycle, tail_cycle);
+	}
+
+	if (head_bytes > BBTOB(tail_blocks)) {
+		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
+			"%s: head_bytes > BBTOB(tail_blocks) (0x%x/0x%x)",
+			      __func__, head_bytes, BBTOB(tail_blocks));
 	}
+
+	log->l_flags |= XLOG_TAIL_WARN;
 }
 
 /* check if it will fit */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings
  2013-11-19 23:44     ` Mark Tinguely
@ 2013-11-21  2:19       ` Dave Chinner
  2013-11-21  3:16         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-11-21  2:19 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Eric Sandeen, xfs

On Tue, Nov 19, 2013 at 05:44:50PM -0600, Mark Tinguely wrote:
> On 11/19/13 17:24, Eric Sandeen wrote:
> >On 11/19/13, 5:08 PM, Mark Tinguely wrote:
> >>On 11/19/13 16:37, Dave Chinner wrote:
> >>>From: Dave Chinner<dchinner@redhat.com>
> >>>
> >>>When xlog_space_left() cracks the grant head and the log tail, it
> >>>does so without locking to synchronise the sampling of the
> >>>variables. It samples the grant head first, so if there is a delay
> >>>before it smaples the log tail, there is a window where the log tail
> >>>could have moved onwards and be moved past the sampled value of the
> >>>grant head. This then leads to the "xlog_space_left: head behind
> >>>tail" warning message.
> >>>
> >>>To avoid spurious output in this situation, swap the order in which
> >>>the variables are cracked. This means that the head may grant head
> >>>may move if there is a delay, but the log tail will be stable, hence
> >>>ensure the tail does not jump the head accidentally.
> >>>
> >>>While this avoids the spurious head behind tail problem, it
> >>>introduces the opposite problem - the head can move more than a full
> >>>cycle past the tail. The code already handles this case by
> >>>indicating that the log is full (i.e. zero space available) but
> >>>that's still (generally) a spurious situation.
> >>>
> >>>Hence, if we detect that the head is more than a cycle ahead of the
> >>>tail or the head is behind the tail, start the calculation again by
> >>>resampling the variables and trying again. If we get too many
> >>>resamples, then throw a warning and return a full or empty log
> >>>appropriately.
> >>>
> >>>Signed-off-by: Dave Chinner<dchinner@redhat.com>
> >>>---
> >>
> >>I am still getting the debug message:
> >>
> >>   xlog_verify_grant_tail: space>  BBTOB(tail_blocks)
> >>
> >>This is a real over grant. It has been a while since I did all the tests, but basically the only way to stop it is to have a lock between checking for xlog_space_left() and actually reserving the space.
> >>
> >>I am not a fan of another band-aid on a problem that is caused because we are granting space without locks.
> >
> >Mark, can you remind us of your testcase that produces this?
> >(sorry, I guess I should search for that old thread...)
> >
> >Thanks,
> >-Eric
> >
> >>--Mark.
> 
> xfstest 273 hits it 100% of the time for me, as does 32+ process
> fsstress, pretty much any high log usage test.

Yet I do those sorts of tests all the time on lots of different
machines and I rarely see either of the warnings that we are talking
about here. I sent the xlog_space_left() patch because a user
reported their log being spammed, not because they had a problem
with a filesystem hanging on log space.

i.e. spam == false positive, log space hang == real problem.

> I know Brian hit this with xfstest 273 when he was testing for
> commit 9a3a5dab.
> 
> Using xfstest 273, I was seeing ten of thousand of bytes of over
> commit.

That doesn't mean it's real.

By definition, a false positive shows exactly the same numbers as a
real problem. The only way to determine it is a false positive is to
use some other method of verifying the numbers. That could be
checking the state of other related variables or simply resampling
the variables multiple times to reduce the probability that the
problem was caused by a temporary condition (i.e. a race).

In this case, resampling them from their source is used to reverify
them, and if we get continual failures being detected, then the
likelihood of them being a false positive is very low.

> From what I recall, I tried a separate lock for the
> write/reserve grant heads,

We already have a per-grant head lock, and most of the calls to
xlog_space_left() are called under it. Hence the problem I'm
addressing is not a result of the grant head changing - the problem
is the log tail is updated concurrently. No amount of locking being
added to the grant heads will fix that problem.

And regardless of the fact that xlog_verify_grant_tail() samples the
write grant head with locks (it's a atomic64_t to do this safely),
the problem is that log tail can then change before it is sampled.
No amount of locking the grant heads will fix that problem, either.

> put locks to make sure the verifier was
> not getting stale information, ordered the write/reserve ungrants
> relative to the grants, put in cache smp_mb() call. Some attempts
> were more successful than others, but the only way I could prevent
> the overgrant completely was to put back the global lock between the
> checking for space and the granting of space.

Which won't have solved the problem of concurrent log tail updates
occurring after the grant head has been sampled. All this change
does is modify the concurrency patterns and hence made the race
condition much harder to hit by disabling pre-emption across
functions.

IOWs, we can probably get exactly the same result from adding
"preempt_disable(); ....  preempt_enable();" around the variable
sampling so that we cannot be pre-empted at all between the two
samples being taken.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings
  2013-11-21  2:19       ` Dave Chinner
@ 2013-11-21  3:16         ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2013-11-21  3:16 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Eric Sandeen, xfs

On Thu, Nov 21, 2013 at 01:19:33PM +1100, Dave Chinner wrote:
> On Tue, Nov 19, 2013 at 05:44:50PM -0600, Mark Tinguely wrote:
> > On 11/19/13 17:24, Eric Sandeen wrote:
> > >On 11/19/13, 5:08 PM, Mark Tinguely wrote:
> > >>On 11/19/13 16:37, Dave Chinner wrote:
> > >>>From: Dave Chinner<dchinner@redhat.com>
> > >>>
> > >>>When xlog_space_left() cracks the grant head and the log tail, it
> > >>>does so without locking to synchronise the sampling of the
> > >>>variables. It samples the grant head first, so if there is a delay
> > >>>before it smaples the log tail, there is a window where the log tail
> > >>>could have moved onwards and be moved past the sampled value of the
> > >>>grant head. This then leads to the "xlog_space_left: head behind
> > >>>tail" warning message.
> > >>>
> > >>>To avoid spurious output in this situation, swap the order in which
> > >>>the variables are cracked. This means that the head may grant head
> > >>>may move if there is a delay, but the log tail will be stable, hence
> > >>>ensure the tail does not jump the head accidentally.
> > >>>
> > >>>While this avoids the spurious head behind tail problem, it
> > >>>introduces the opposite problem - the head can move more than a full
> > >>>cycle past the tail. The code already handles this case by
> > >>>indicating that the log is full (i.e. zero space available) but
> > >>>that's still (generally) a spurious situation.
> > >>>
> > >>>Hence, if we detect that the head is more than a cycle ahead of the
> > >>>tail or the head is behind the tail, start the calculation again by
> > >>>resampling the variables and trying again. If we get too many
> > >>>resamples, then throw a warning and return a full or empty log
> > >>>appropriately.
> > >>>
> > >>>Signed-off-by: Dave Chinner<dchinner@redhat.com>
> > >>>---
> > >>
> > >>I am still getting the debug message:
> > >>
> > >>   xlog_verify_grant_tail: space>  BBTOB(tail_blocks)
> > >>
> > >>This is a real over grant. It has been a while since I did all the tests, but basically the only way to stop it is to have a lock between checking for xlog_space_left() and actually reserving the space.
> > >>
> > >>I am not a fan of another band-aid on a problem that is caused because we are granting space without locks.
> > >
> > >Mark, can you remind us of your testcase that produces this?
> > >(sorry, I guess I should search for that old thread...)
> > >
> > >Thanks,
> > >-Eric
> > >
> > >>--Mark.
> > 
> > xfstest 273 hits it 100% of the time for me, as does 32+ process
> > fsstress, pretty much any high log usage test.
> 
> Yet I do those sorts of tests all the time on lots of different
> machines and I rarely see either of the warnings that we are talking
> about here. I sent the xlog_space_left() patch because a user
> reported their log being spammed, not because they had a problem
> with a filesystem hanging on log space.
> 
> i.e. spam == false positive, log space hang == real problem.
> 
> > I know Brian hit this with xfstest 273 when he was testing for
> > commit 9a3a5dab.
> > 
> > Using xfstest 273, I was seeing ten of thousand of bytes of over
> > commit.
> 
> That doesn't mean it's real.
> 
> By definition, a false positive shows exactly the same numbers as a
> real problem. The only way to determine it is a false positive is to
> use some other method of verifying the numbers. That could be
> checking the state of other related variables or simply resampling
> the variables multiple times to reduce the probability that the
> problem was caused by a temporary condition (i.e. a race).
> 
> In this case, resampling them from their source is used to reverify
> them, and if we get continual failures being detected, then the
> likelihood of them being a false positive is very low.
> 
> > From what I recall, I tried a separate lock for the
> > write/reserve grant heads,
> 
> We already have a per-grant head lock, and most of the calls to
> xlog_space_left() are called under it. Hence the problem I'm
> addressing is not a result of the grant head changing - the problem
> is the log tail is updated concurrently. No amount of locking being
> added to the grant heads will fix that problem.
> 
> And regardless of the fact that xlog_verify_grant_tail() samples the
> write grant head with locks (it's a atomic64_t to do this safely),

		   without locks

> the problem is that log tail can then change before it is sampled.
> No amount of locking the grant heads will fix that problem, either.
> 
> > put locks to make sure the verifier was
> > not getting stale information, ordered the write/reserve ungrants
> > relative to the grants, put in cache smp_mb() call. Some attempts
> > were more successful than others, but the only way I could prevent
> > the overgrant completely was to put back the global lock between the
> > checking for space and the granting of space.
> 
> Which won't have solved the problem of concurrent log tail updates
> occurring after the grant head has been sampled. All this change
> does is modify the concurrency patterns and hence made the race
> condition much harder to hit by disabling pre-emption across
> functions.
> 
> IOWs, we can probably get exactly the same result from adding
> "preempt_disable(); ....  preempt_enable();" around the variable
> sampling so that we cannot be pre-empted at all between the two
> samples being taken.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings
  2013-11-21  1:42   ` Dave Chinner
@ 2013-11-21 13:01     ` Dave Chinner
  2013-11-22  2:57       ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-11-21 13:01 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On Thu, Nov 21, 2013 at 12:42:55PM +1100, Dave Chinner wrote:
> On Tue, Nov 19, 2013 at 05:08:33PM -0600, Mark Tinguely wrote:
> > On 11/19/13 16:37, Dave Chinner wrote:
> > >From: Dave Chinner<dchinner@redhat.com>
> > >
> > >When xlog_space_left() cracks the grant head and the log tail, it
> > >does so without locking to synchronise the sampling of the
> > >variables. It samples the grant head first, so if there is a delay
> > >before it smaples the log tail, there is a window where the log tail
> > >could have moved onwards and be moved past the sampled value of the
> > >grant head. This then leads to the "xlog_space_left: head behind
> > >tail" warning message.
> > >
> > >To avoid spurious output in this situation, swap the order in which
> > >the variables are cracked. This means that the head may grant head
> > >may move if there is a delay, but the log tail will be stable, hence
> > >ensure the tail does not jump the head accidentally.
> > >
> > >While this avoids the spurious head behind tail problem, it
> > >introduces the opposite problem - the head can move more than a full
> > >cycle past the tail. The code already handles this case by
> > >indicating that the log is full (i.e. zero space available) but
> > >that's still (generally) a spurious situation.
> > >
> > >Hence, if we detect that the head is more than a cycle ahead of the
> > >tail or the head is behind the tail, start the calculation again by
> > >resampling the variables and trying again. If we get too many
> > >resamples, then throw a warning and return a full or empty log
> > >appropriately.
> > >
> > >Signed-off-by: Dave Chinner<dchinner@redhat.com>
> > >---
> > 
> > I am still getting the debug message:
> > 
> >   xlog_verify_grant_tail: space > BBTOB(tail_blocks)
> 
> I'm not sure why that is relevant to this patch - it doesn't touch
> xlog_verify_grant_tail(), nor does it change the functioning of the
> code except to remove false positive warnings.
> 
> Further, the comment about xlog_verify_grant_tail() says:
> 
>   * This check is run unlocked, so can give false positives. 
> 
> That's exactly what this patch addresses for xlog_space_left() -
> it prevents false positives when it is run unlocked. So we could do
> exactly the same thing to xlog_verify_grant_tail() as well. Patch
> attached below.
> 
> > This is a real over grant.
> 
> Evidence, please.

So, I've done some more testing with these patches, and found a
couple of interesting things.  I found that xfs/297 with a bunch of
memory, VM and lock debugging turned on has fairly reliably
reproduced the messages from both functions, albeit slowly (taking
10 minutes a cycle, 3-4 cycles to get a single xlog_space_left()
failure).

First thing I did was remove the "print once" flag from
xlog_verify_grant_tail(), and I generally see 2-3 outputs from it
during a run, usaully tens to hundreds of seconds apart. No hangs,
and the difference in byte counts generally is under 200k.

It then made it print each retry, knowing that it wouldn't swamp the
logs. What each event gave was this:

[  201.796132] XFS (vdb): xlog_verify_grant_tail: retries 0, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
[  201.798514] XFS (vdb): xlog_verify_grant_tail: retries 1, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
[  201.800890] XFS (vdb): xlog_verify_grant_tail: retries 2, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
[  201.802951] XFS (vdb): xlog_verify_grant_tail: retries 3, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
[  201.805238] XFS (vdb): xlog_verify_grant_tail: retries 4, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
[  201.807378] XFS (vdb): xlog_verify_grant_tail: retries 5, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
[  201.809619] XFS (vdb): xlog_verify_grant_tail: head_bytes > BBTOB(tail_blocks) (8365828/8126464)

Which, as you can see, resampling didn't help prevent the message
output at all. The region of the log that this warning comes up in
is random; there is no pattern to it the repeated occurrences. Hence
I will drop the repeated sampling part of the patches, but keep the
memory barrier and preemption disabling around the sampling...

Scratch that - I just had the repeated sampling avoid a warning -
the fifth sample saw the problem go away....

[  140.728172] XFS (vdb): xlog_verify_grant_tail: retries 0, head_bytes > BBTOB(tail_blocks) (4099150/3932160)
[  140.731761] XFS (vdb): xlog_verify_grant_tail: retries 1, head_bytes > BBTOB(tail_blocks) (4099150/3932160)
[  140.735495] XFS (vdb): xlog_verify_grant_tail: retries 2, head_bytes > BBTOB(tail_blocks) (4099150/3932160)
[  140.739059] XFS (vdb): xlog_verify_grant_tail: retries 3, head_bytes > BBTOB(tail_blocks) (4099150/3932160)
[  140.742561] XFS (vdb): xlog_verify_grant_tail: retries 4, head_bytes > BBTOB(tail_blocks) (4099150/3932160)

What this points out is that the while the problem state is
temporary, the cause of it is not racing updates during sampling. We
do, indeed, appear to have a temporary state where the grant head
has passed over the log tail. By the next check, the grant head no
longer covers the tail.

The next step is to dump the stack when-ever such a warning is
emitted to find out what if there is a specific operation is that
is triggering them. All are coming from this path:

.....
  xfs_trans_reserve
    xfs_log_reserve
      xlog_verify_grant_tail

And so are occurring after the grant head has been moved forward by
the log reservation. That's rather interesting.

xlog_verify_grant_tail() verifies the log tail against the write
grant head, not the reserve grant head. The reserve grant head is
what prevents log space overcommit, and that's the one that the
xfs_log_reserve sleeps on waiting for log space to become available.
Once space is available on the reserve grant head, it adds the space
to both the reserve grant and the write grant.

Hence we know that we haven't unintentionally overcommitted the
reserve head via transaction reservations, so why does the write
head go past the log tail, and how is it different to the reserve
grant head.

Well, the write grant and the reserve grant are mostly tracking the
same thing. The difference comes, however, in permanent
transactions. While the permanent transaction has reservations for
multiple commits available (which they start with) both the write
grant and the reserve grant are updated during transaction commit

xfs_trans_commit
  xfs_log_commit_cil
    xfs_log_done
      xlog_regrant_reserve_log_space
	ticket->t_unit--
	xlog_grant_sub_space(reserve head, unused resv);
	xlog_grant_sub_space(write head, unused resv);

At this point, if the permanent log transaction still has reserved
units available, it returns. We'll come back to this.

For a rolling transaction, just before transaction commit, the
transaction will have been duplicated via xfs_trans_dup(). This
takes a reference to the ticket, and attaches it to the new
transaction. The old transaction is then committed like above, and
we continue with the same ticket in the above but with one less
reservation unit.

We then call xfs_trans_reserve on the ticket again, and that enters
xfs_log_regrant rather than xfs_log_reserve like the original ticket
reservation would have. if the ticket still has reservation units
left on it, it does nothing that push the AIL and reset the
reservation for the new transaction.

Where it gets interesting is when the ticket unit reservation runs
out, and we have to get new reservations to continue a rolling
transaction. Very few permanent log transactions actually need this.
In fact, I think xfs_itruncate_extents() is the only code that makes
use of this facility, as all other transactions have bound limits in
the code. xfs_itruncate_extents() can remove an unbound number of
extents, and hence the rolling transaction is essentially unbound in
size. Remember this, too.

So, when we run out of reservation units on the ticket, we do this:

xfs_trans_dup
xfs_trans_commit
  xfs_log_commit_cil
    xfs_log_done
      xlog_regrant_reserve_log_space
	xlog_grant_sub_space(reserve head, unused resv);
	xlog_grant_sub_space(write head, unused resv);
	xlog_grant_add_space(reserve head, unit reservation);

i.e. is simply steals the unit reservation from the reserve head
*without caring about the amount of reservation space that is
available*. This can push the reserve grant head past the tail if
the reservation space is already full.

While this might seem bad and wrong, it actually isn't because the
ticket doesn't yet have *write grant space*. So, following the
commit, we do:

xfs_trans_reserve
  xfs_log_regrant
    xlog_grant_head_check(write head, space needed)
    xlog_grant_add_space(write head, space needed)

It's the xlog_grant_head_check() that blocks waiting for write grant
space to come available, and hence we dont over-commit the log space
in such permanent transactions. Yes, we overcommit the reservation
spae, but we never overcommit the write space and hence we don't
overcommit the log space in this situation.

But, what is does point out is that the reserve head can *validly*
overcommit log space and overlap the tail of the log. This doesn't
explain the write grant head getting ahead of the log tail, but
let's ignore that and look at ithe warnings I'm getting from
xlog_space_left() for a moment.

[ 1458.577318] XFS (vdb): xlog_space_left: The grant head is in front of the log tail.
[ 1458.577318]   tail_cycle = 17, tail_bytes = 10223616
[ 1458.577318]   GH   cycle = 19, GH   bytes = 74284
[ 1458.577320] XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 1194
[ 1458.580128] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 1458.580128] Modules linked in:
[ 1458.580128] CPU: 0 PID: 16722 Comm: fsstress Not tainted 3.12.0-dgc+ #60
[ 1458.580128] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 1458.580128] task: ffff88006d29e1c0 ti: ffff8800592e2000 task.ti: ffff8800592e2000
[ 1458.580128] RIP: 0010:[<ffffffff8149cf32>]  [<ffffffff8149cf32>] assfail+0x22/0x30
[ 1458.580128] RSP: 0018:ffff8800592e3b58  EFLAGS: 00010292
[ 1458.580128] RAX: 000000000000003c RBX: ffff88007764da68 RCX: 000000000000006c
[ 1458.580128] RDX: 00000000000000f2 RSI: 0000000000000046 RDI: 0000000000000246
[ 1458.580128] RBP: ffff8800592e3b58 R08: 0000000000000096 R09: 0000000000001383
[ 1458.580128] R10: 0000000000000000 R11: 0000000000001382 R12: 000000000001222c
[ 1458.580128] R13: ffff8800592e3c8c R14: 0000000000000000 R15: 0000000000000000
[ 1458.580128] FS:  00007f42f1fbe700(0000) GS:ffff88007f800000(0000) knlGS:0000000000000000
[ 1458.580128] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1458.580128] CR2: 00007f42f14c4c60 CR3: 000000005fd19000 CR4: 00000000000006f0
[ 1458.580128] Stack:
[ 1458.580128]  ffff8800592e3b98 ffffffff814f7a47 ffff880000000013 ffffffff0001222c
[ 1458.580128]  ffff8800517ec4f0 ffff8800517ec4f0 ffff88007764da68 00000000000005be
[ 1458.580128]  ffff8800592e3bb8 ffffffff814f8480 ffff88006312be00 ffff88007764da68
[ 1458.580128] Call Trace:
[ 1458.580128]  [<ffffffff814f7a47>] xlog_space_left+0xf7/0x160
[ 1458.580128]  [<ffffffff814f8480>] xlog_grant_push_ail+0x50/0xf0
[ 1458.580128]  [<ffffffff814fa19e>] xfs_log_regrant+0x5e/0x190
[ 1458.580128]  [<ffffffff814a426d>] xfs_trans_reserve+0xfd/0x310
[ 1458.580128]  [<ffffffff814a5596>] ? xfs_trans_commit+0xf6/0x2d0
[ 1458.580128]  [<ffffffff814861fa>] xfs_bmap_finish+0x11a/0x1e0
[ 1458.580128]  [<ffffffff814e5dfb>] xfs_itruncate_extents+0x24b/0x440
[ 1458.580128]  [<ffffffff814e62c8>] xfs_inactive_truncate+0xd8/0x140
[ 1458.580128]  [<ffffffff814e74f0>] xfs_inactive+0x1d0/0x1f0
[ 1458.580128]  [<ffffffff814a0656>] xfs_fs_evict_inode+0xa6/0xf0
[ 1458.580128]  [<ffffffff811bd2a8>] evict+0xb8/0x1b0
[ 1458.580128]  [<ffffffff811bda45>] iput+0x105/0x190
[ 1458.580128]  [<ffffffff811b01e2>] do_unlinkat+0x182/0x2a0
[ 1458.580128]  [<ffffffff811a44ee>] ? ____fput+0xe/0x10
[ 1458.580128]  [<ffffffff810af96c>] ? task_work_run+0xac/0xe0
[ 1458.580128]  [<ffffffff81ae3655>] ? do_device_not_available+0x35/0x60
[ 1458.580128]  [<ffffffff811b3446>] SyS_unlink+0x16/0x20
[ 1458.580128]  [<ffffffff81aead69>] system_call_fastpath+0x16/0x1b

First thing to note: this has come from xfs_itruncate_extents(),
pushing the AIL from the xfs_log_regrant(). There's a good chance we
have just overcommitted the reserve grant head here, because
xlog_grant_push ail() does:

	free_bytes = xlog_space_left(log, &log->l_reserve_head.grant);

Yup, in the situation where we have just validly overcommitted the
reserve grant space, xlog_space_left has detected that we've
overcommited the reserve grant space and assert failed.

The fact that overrun here is by a couple of hundred kilobytes is no
surprise - it's in the ballpark of a itruncate transaction unit
reservation (IIRC ~150-200k), and there can be multiple threads in
the same overcommit state at the same time so seeing 300-400k as the
overcommit is not unexpected.

Further, this overcommit situation is handled just fine by the
xlog_space_left() - it returns 0 bytes of log space available - and
so there are no problems with this situation at all.  In fact, now I
understand why the original code in xlog_space_left() handled this
situation silently without emmitting an error:

	else if (tail_cycle + 1 < head_cycle)
		return 0;

200 lines of explanation for 2 lines of code might be some kind of
record. :/

What that means, though, is that I still haven't been able to
reproduce the "head behind tail" conditions that people are
complaining about, and i don't yet have an answer to the
xlog_verify_grant_tail() issue.

I'll take on xlog_verify_grant_tail() where I left it in the
morning.  It's going on midnight and I've already drunk enough
expensive scotch for the evening plowing through this...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings
  2013-11-21 13:01     ` Dave Chinner
@ 2013-11-22  2:57       ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2013-11-22  2:57 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On Fri, Nov 22, 2013 at 12:01:56AM +1100, Dave Chinner wrote:
> On Thu, Nov 21, 2013 at 12:42:55PM +1100, Dave Chinner wrote:
> > On Tue, Nov 19, 2013 at 05:08:33PM -0600, Mark Tinguely wrote:
> > > On 11/19/13 16:37, Dave Chinner wrote:
> > > >From: Dave Chinner<dchinner@redhat.com>
> > > >
> > > >When xlog_space_left() cracks the grant head and the log tail, it
> > > >does so without locking to synchronise the sampling of the
> > > >variables. It samples the grant head first, so if there is a delay
> > > >before it smaples the log tail, there is a window where the log tail
> > > >could have moved onwards and be moved past the sampled value of the
> > > >grant head. This then leads to the "xlog_space_left: head behind
> > > >tail" warning message.
.....
> > > 
> > > I am still getting the debug message:
> > > 
> > >   xlog_verify_grant_tail: space > BBTOB(tail_blocks)
> > 
> > I'm not sure why that is relevant to this patch - it doesn't touch
> > xlog_verify_grant_tail(), nor does it change the functioning of the
> > code except to remove false positive warnings.
> > 
> > Further, the comment about xlog_verify_grant_tail() says:
> > 
> >   * This check is run unlocked, so can give false positives. 
> > 
> > That's exactly what this patch addresses for xlog_space_left() -
> > it prevents false positives when it is run unlocked. So we could do
> > exactly the same thing to xlog_verify_grant_tail() as well. Patch
> > attached below.
> > 
> > > This is a real over grant.
> > 
> > Evidence, please.
> 
> So, I've done some more testing with these patches, and found a
> couple of interesting things.  I found that xfs/297 with a bunch of
> memory, VM and lock debugging turned on has fairly reliably
> reproduced the messages from both functions, albeit slowly (taking
> 10 minutes a cycle, 3-4 cycles to get a single xlog_space_left()
> failure).
> 
> First thing I did was remove the "print once" flag from
> xlog_verify_grant_tail(), and I generally see 2-3 outputs from it
> during a run, usaully tens to hundreds of seconds apart. No hangs,
> and the difference in byte counts generally is under 200k.
> 
> It then made it print each retry, knowing that it wouldn't swamp the
> logs. What each event gave was this:
> 
> [  201.796132] XFS (vdb): xlog_verify_grant_tail: retries 0, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
> [  201.798514] XFS (vdb): xlog_verify_grant_tail: retries 1, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
> [  201.800890] XFS (vdb): xlog_verify_grant_tail: retries 2, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
> [  201.802951] XFS (vdb): xlog_verify_grant_tail: retries 3, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
> [  201.805238] XFS (vdb): xlog_verify_grant_tail: retries 4, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
> [  201.807378] XFS (vdb): xlog_verify_grant_tail: retries 5, head_bytes > BBTOB(tail_blocks) (8365828/8126464)
> [  201.809619] XFS (vdb): xlog_verify_grant_tail: head_bytes > BBTOB(tail_blocks) (8365828/8126464)
> 
> Which, as you can see, resampling didn't help prevent the message
> output at all. The region of the log that this warning comes up in
> is random; there is no pattern to it the repeated occurrences. Hence
> I will drop the repeated sampling part of the patches, but keep the
> memory barrier and preemption disabling around the sampling...
> 
> Scratch that - I just had the repeated sampling avoid a warning -
> the fifth sample saw the problem go away....
> 
> [  140.728172] XFS (vdb): xlog_verify_grant_tail: retries 0, head_bytes > BBTOB(tail_blocks) (4099150/3932160)
> [  140.731761] XFS (vdb): xlog_verify_grant_tail: retries 1, head_bytes > BBTOB(tail_blocks) (4099150/3932160)
> [  140.735495] XFS (vdb): xlog_verify_grant_tail: retries 2, head_bytes > BBTOB(tail_blocks) (4099150/3932160)
> [  140.739059] XFS (vdb): xlog_verify_grant_tail: retries 3, head_bytes > BBTOB(tail_blocks) (4099150/3932160)
> [  140.742561] XFS (vdb): xlog_verify_grant_tail: retries 4, head_bytes > BBTOB(tail_blocks) (4099150/3932160)
> 
> What this points out is that the while the problem state is
> temporary, the cause of it is not racing updates during sampling. We
> do, indeed, appear to have a temporary state where the grant head
> has passed over the log tail. By the next check, the grant head no
> longer covers the tail.
> 
> The next step is to dump the stack when-ever such a warning is
> emitted to find out what if there is a specific operation is that
> is triggering them. All are coming from this path:
> 
> .....
>   xfs_trans_reserve
>     xfs_log_reserve
>       xlog_verify_grant_tail
> 
> And so are occurring after the grant head has been moved forward by
> the log reservation. That's rather interesting.
> 
> xlog_verify_grant_tail() verifies the log tail against the write
> grant head, not the reserve grant head. The reserve grant head is
> what prevents log space overcommit, and that's the one that the
> xfs_log_reserve sleeps on waiting for log space to become available.
> Once space is available on the reserve grant head, it adds the space
> to both the reserve grant and the write grant.
> 
> Hence we know that we haven't unintentionally overcommitted the
> reserve head via transaction reservations, so why does the write
> head go past the log tail, and how is it different to the reserve
> grant head.
> 
> Well, the write grant and the reserve grant are mostly tracking the
> same thing. The difference comes, however, in permanent
> transactions. While the permanent transaction has reservations for
> multiple commits available (which they start with) both the write
> grant and the reserve grant are updated during transaction commit
> 
> xfs_trans_commit
>   xfs_log_commit_cil
>     xfs_log_done
>       xlog_regrant_reserve_log_space
> 	ticket->t_unit--
> 	xlog_grant_sub_space(reserve head, unused resv);
> 	xlog_grant_sub_space(write head, unused resv);
> 
> At this point, if the permanent log transaction still has reserved
> units available, it returns. We'll come back to this.
> 
> For a rolling transaction, just before transaction commit, the
> transaction will have been duplicated via xfs_trans_dup(). This
> takes a reference to the ticket, and attaches it to the new
> transaction. The old transaction is then committed like above, and
> we continue with the same ticket in the above but with one less
> reservation unit.
> 
> We then call xfs_trans_reserve on the ticket again, and that enters
> xfs_log_regrant rather than xfs_log_reserve like the original ticket
> reservation would have. if the ticket still has reservation units
> left on it, it does nothing that push the AIL and reset the
> reservation for the new transaction.
> 
> Where it gets interesting is when the ticket unit reservation runs
> out, and we have to get new reservations to continue a rolling
> transaction. Very few permanent log transactions actually need this.
> In fact, I think xfs_itruncate_extents() is the only code that makes
> use of this facility, as all other transactions have bound limits in
> the code. xfs_itruncate_extents() can remove an unbound number of
> extents, and hence the rolling transaction is essentially unbound in
> size. Remember this, too.
> 
> So, when we run out of reservation units on the ticket, we do this:
> 
> xfs_trans_dup
> xfs_trans_commit
>   xfs_log_commit_cil
>     xfs_log_done
>       xlog_regrant_reserve_log_space
> 	xlog_grant_sub_space(reserve head, unused resv);
> 	xlog_grant_sub_space(write head, unused resv);
> 	xlog_grant_add_space(reserve head, unit reservation);
> 
> i.e. is simply steals the unit reservation from the reserve head
> *without caring about the amount of reservation space that is
> available*. This can push the reserve grant head past the tail if
> the reservation space is already full.
> 
> While this might seem bad and wrong, it actually isn't because the
> ticket doesn't yet have *write grant space*. So, following the
> commit, we do:
> 
> xfs_trans_reserve
>   xfs_log_regrant
>     xlog_grant_head_check(write head, space needed)
>     xlog_grant_add_space(write head, space needed)
> 
> It's the xlog_grant_head_check() that blocks waiting for write grant
> space to come available, and hence we dont over-commit the log space
> in such permanent transactions. Yes, we overcommit the reservation
> spae, but we never overcommit the write space and hence we don't
> overcommit the log space in this situation.
> 
> But, what is does point out is that the reserve head can *validly*
> overcommit log space and overlap the tail of the log. This doesn't
> explain the write grant head getting ahead of the log tail, but
> let's ignore that and look at ithe warnings I'm getting from
> xlog_space_left() for a moment.

[snip xlog_space_left() "overcommit is OK" debug]
> 
> What that means, though, is that I still haven't been able to
> reproduce the "head behind tail" conditions that people are
> complaining about, and i don't yet have an answer to the
> xlog_verify_grant_tail() issue.

So, how does the write grant head overlap the log tail? My current
hypothesis is that we're seeing the same pattern of "add space
without first checking" that results in overcommit of the write
head, just like we are doing with the reserve head.

That is, the reserve grant is overcommitting because we check and
throttle in xfs_log_reserve(), but then do an unchecked addition on
permanent transaction commit. In the case of the write head, we do
checking and throttling on permanent transaction reserve via
xfs_log_regrant(), but we do unchecked addition in
xfs_log_reserve().

Put simply, if we throttle in xfs_log_regrant because we don't have
write space available, we can have another transaction (say with a
smaller unit reservation) come through xfs_log_reserve() and get
reserve head space, which it then unconditionally adds to the write
head.

				resv head		write head
				X			Y
permanent xact P1 completes	X - curr		Y - curr
				  + unit_P1
                                (overcommit)
xact T2 reserve			OC, blocks
xact P1 reserve on dup					Not enough space,
							blocks
....
log tail moves a small amount
	space availble enough to fulfil T2 reserve, P1 write grants

xact P1 wakes			X - curr		Y - curr
				  + unit_P1		  + unit_P1
xact T2 wakes			X - curr		Y - curr
				  + unit_P1		  + unit_P1
				  + unit_T2		  + unit_T2
							  (overcommit)

In this case, it appears to me that the overcommit of the write
space is valid because transactions that block on write space
overcommit also hold a reserve space grant.

We already know that if we overcommit the reserve head due to a
permanent transaction regrant, then all new transactions are
throttled until the reserve space overcommit situation goes away.
Permanent transactions can continue to regrant even in an overcommit
situation, but we prevent real overcommit by throttling the write
grants.

IOWs, if we are in a reserve head overcommit situation, we cannot
then overcommit the write head because all new write head regrants
are checked and prevented from overcommit.

If we are in a write space overcommit state, then write regrants are
blocked after consuming reserve space. This will cause reserve
space to run out and so we can't can't overcommit the log with new
transactions. Permanent transactions can the drive the reserve space
into overcommit, but then block on the write space grant head
that is already over committed. At this point, the reserve space
overcommit situation will not be cleared until the write space
overcommit is resolved.

That is, the only way to make progress at this point is by the log
tail moving forwards due to metadata IO completion. If it only moves
forward a small amount, then write space will be the first to allow
progress, and so the permanent transactions will make progress and
complete ahead of new transactions being started, releasing their
overcommitted reserve space grants. If the log tail jumps by a long
way, it will alleviate the overcommit on both grant heads and both
new and permanent transactions will immediately make progress.

In summary: overcommits on both the reserve grant space and the
write grant space can and do occur, and both are valid temporary
conditions that resolve themselves through mechanisms built into the
algorithms without resulting in a real log space overcommit. Hence
debug code that triggers on grant head overcommit situations will
always generate false positive warnings.

I will rework the two patches to take this into account...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-11-22  2:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-19 22:37 [PATCH] xfs: prevent spurious "head behind tail" warnings Dave Chinner
2013-11-19 23:08 ` Mark Tinguely
2013-11-19 23:24   ` Eric Sandeen
2013-11-19 23:44     ` Mark Tinguely
2013-11-21  2:19       ` Dave Chinner
2013-11-21  3:16         ` Dave Chinner
2013-11-21  1:42   ` Dave Chinner
2013-11-21 13:01     ` Dave Chinner
2013-11-22  2:57       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox