* [PATCH] xfs: prevent spurious "head behind tail" warnings @ 2013-11-19 22:37 Dave Chinner 2013-11-19 23:08 ` Mark Tinguely 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-11-19 22:37 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> When xlog_space_left() cracks the grant head and the log tail, it does so without locking to synchronise the sampling of the variables. It samples the grant head first, so if there is a delay before it smaples the log tail, there is a window where the log tail could have moved onwards and be moved past the sampled value of the grant head. This then leads to the "xlog_space_left: head behind tail" warning message. To avoid spurious output in this situation, swap the order in which the variables are cracked. This means that the head may grant head may move if there is a delay, but the log tail will be stable, hence ensure the tail does not jump the head accidentally. While this avoids the spurious head behind tail problem, it introduces the opposite problem - the head can move more than a full cycle past the tail. The code already handles this case by indicating that the log is full (i.e. zero space available) but that's still (generally) a spurious situation. Hence, if we detect that the head is more than a cycle ahead of the tail or the head is behind the tail, start the calculation again by resampling the variables and trying again. If we get too many resamples, then throw a warning and return a full or empty log appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log.c | 102 ++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 78 insertions(+), 24 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 8497a00..0cfcc20 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1113,37 +1113,91 @@ xlog_space_left( struct xlog *log, atomic64_t *head) { - int free_bytes; int tail_bytes; int tail_cycle; int head_cycle; int head_bytes; + int retries = 0; +#define XLOG_SPACE_MAX_RETRIES 5 + + do { + /* + * sample tail before head to avoid spurious "head behind tail" + * warnings due to racing tail updates. We dump a memory barrier + * here to make sure we pick up the latest values that have been + * written by other threads each time through. + */ + smp_mb(); + xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, + &tail_bytes); + tail_bytes = BBTOB(tail_bytes); + xlog_crack_grant_head(head, &head_cycle, &head_bytes); + + /* + * The tail cycle should always be the same or one less than the + * head cycle. If not, restart again because we've likely had a + * race with an update during sampling. + */ + if (tail_cycle + 1 < head_cycle || tail_cycle > head_cycle) + continue; + + /* + * If the tail is in previous cycle, space available is the + * region between the tail and the head. + * + * H T + * +2222222111111111111111+ + * +-- free --+ + */ + if (tail_cycle < head_cycle) { + ASSERT(tail_cycle == (head_cycle - 1)); + return tail_bytes - head_bytes; + } + + /* + * If the tail is in the same cycle, space available is the + * regions outside the range between the tail and the head: + * + * T H + * +2222222222222222211111+ + * +-- used --+ + * +free+ +free+ + * + * If the head and tail are the same, then the whole log is + * free. + */ + if (tail_cycle == head_cycle && head_bytes >= tail_bytes) + return log->l_logsize - (head_bytes - tail_bytes); - xlog_crack_grant_head(head, &head_cycle, &head_bytes); - xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes); - tail_bytes = BBTOB(tail_bytes); - if (tail_cycle == head_cycle && head_bytes >= tail_bytes) - free_bytes = log->l_logsize - (head_bytes - tail_bytes); - else if (tail_cycle + 1 < head_cycle) - return 0; - else if (tail_cycle < head_cycle) { - ASSERT(tail_cycle == (head_cycle - 1)); - free_bytes = tail_bytes - head_bytes; - } else { /* - * The reservation head is behind the tail. - * In this case we just want to return the size of the - * log as the amount of space left. + * The head is behind the tail. Invalid, so let's go around + * again and resample. */ - xfs_alert(log->l_mp, - "xlog_space_left: head behind tail\n" - " tail_cycle = %d, tail_bytes = %d\n" - " GH cycle = %d, GH bytes = %d", - tail_cycle, tail_bytes, head_cycle, head_bytes); - ASSERT(0); - free_bytes = log->l_logsize; - } - return free_bytes; + } while (retries++ < XLOG_SPACE_MAX_RETRIES); + + /* + * OK, we're in trouble now - the head and tail are out of sync. Time to + * issue a warning about it and return a default free space value + * depending on whether the tail is ahead or behind the log. + */ + ASSERT(retries >= XLOG_SPACE_MAX_RETRIES); + xfs_alert(log->l_mp, + "%s: The grant head is %s the log tail.\n" + " tail_cycle = %d, tail_bytes = %d\n" + " GH cycle = %d, GH bytes = %d", + __func__, + tail_cycle + 1 < head_cycle ? "in front of" : "behind", + tail_cycle, tail_bytes, head_cycle, head_bytes); + ASSERT(0); + + /* + * If the head is in front, there is not space left to advance it so we + * consider the log as being full. If the tail is behind, then we assume + * that the whole log is available. + */ + if (tail_cycle + 1 < head_cycle) + return 0; + return log->l_logsize; } -- 1.8.4.rc3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings 2013-11-19 22:37 [PATCH] xfs: prevent spurious "head behind tail" warnings Dave Chinner @ 2013-11-19 23:08 ` Mark Tinguely 2013-11-19 23:24 ` Eric Sandeen 2013-11-21 1:42 ` Dave Chinner 0 siblings, 2 replies; 9+ messages in thread From: Mark Tinguely @ 2013-11-19 23:08 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 11/19/13 16:37, Dave Chinner wrote: > From: Dave Chinner<dchinner@redhat.com> > > When xlog_space_left() cracks the grant head and the log tail, it > does so without locking to synchronise the sampling of the > variables. It samples the grant head first, so if there is a delay > before it smaples the log tail, there is a window where the log tail > could have moved onwards and be moved past the sampled value of the > grant head. This then leads to the "xlog_space_left: head behind > tail" warning message. > > To avoid spurious output in this situation, swap the order in which > the variables are cracked. This means that the head may grant head > may move if there is a delay, but the log tail will be stable, hence > ensure the tail does not jump the head accidentally. > > While this avoids the spurious head behind tail problem, it > introduces the opposite problem - the head can move more than a full > cycle past the tail. The code already handles this case by > indicating that the log is full (i.e. zero space available) but > that's still (generally) a spurious situation. > > Hence, if we detect that the head is more than a cycle ahead of the > tail or the head is behind the tail, start the calculation again by > resampling the variables and trying again. If we get too many > resamples, then throw a warning and return a full or empty log > appropriately. > > Signed-off-by: Dave Chinner<dchinner@redhat.com> > --- I am still getting the debug message: xlog_verify_grant_tail: space > BBTOB(tail_blocks) This is a real over grant. It has been a while since I did all the tests, but basically the only way to stop it is to have a lock between checking for xlog_space_left() and actually reserving the space. I am not a fan of another band-aid on a problem that is caused because we are granting space without locks. --Mark. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings 2013-11-19 23:08 ` Mark Tinguely @ 2013-11-19 23:24 ` Eric Sandeen 2013-11-19 23:44 ` Mark Tinguely 2013-11-21 1:42 ` Dave Chinner 1 sibling, 1 reply; 9+ messages in thread From: Eric Sandeen @ 2013-11-19 23:24 UTC (permalink / raw) To: Mark Tinguely, Dave Chinner; +Cc: xfs On 11/19/13, 5:08 PM, Mark Tinguely wrote: > On 11/19/13 16:37, Dave Chinner wrote: >> From: Dave Chinner<dchinner@redhat.com> >> >> When xlog_space_left() cracks the grant head and the log tail, it >> does so without locking to synchronise the sampling of the >> variables. It samples the grant head first, so if there is a delay >> before it smaples the log tail, there is a window where the log tail >> could have moved onwards and be moved past the sampled value of the >> grant head. This then leads to the "xlog_space_left: head behind >> tail" warning message. >> >> To avoid spurious output in this situation, swap the order in which >> the variables are cracked. This means that the head may grant head >> may move if there is a delay, but the log tail will be stable, hence >> ensure the tail does not jump the head accidentally. >> >> While this avoids the spurious head behind tail problem, it >> introduces the opposite problem - the head can move more than a full >> cycle past the tail. The code already handles this case by >> indicating that the log is full (i.e. zero space available) but >> that's still (generally) a spurious situation. >> >> Hence, if we detect that the head is more than a cycle ahead of the >> tail or the head is behind the tail, start the calculation again by >> resampling the variables and trying again. If we get too many >> resamples, then throw a warning and return a full or empty log >> appropriately. >> >> Signed-off-by: Dave Chinner<dchinner@redhat.com> >> --- > > I am still getting the debug message: > > xlog_verify_grant_tail: space > BBTOB(tail_blocks) > > This is a real over grant. It has been a while since I did all the tests, but basically the only way to stop it is to have a lock between checking for xlog_space_left() and actually reserving the space. > > I am not a fan of another band-aid on a problem that is caused because we are granting space without locks. Mark, can you remind us of your testcase that produces this? (sorry, I guess I should search for that old thread...) Thanks, -Eric > --Mark. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings 2013-11-19 23:24 ` Eric Sandeen @ 2013-11-19 23:44 ` Mark Tinguely 2013-11-21 2:19 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Mark Tinguely @ 2013-11-19 23:44 UTC (permalink / raw) To: Eric Sandeen; +Cc: xfs On 11/19/13 17:24, Eric Sandeen wrote: > On 11/19/13, 5:08 PM, Mark Tinguely wrote: >> On 11/19/13 16:37, Dave Chinner wrote: >>> From: Dave Chinner<dchinner@redhat.com> >>> >>> When xlog_space_left() cracks the grant head and the log tail, it >>> does so without locking to synchronise the sampling of the >>> variables. It samples the grant head first, so if there is a delay >>> before it smaples the log tail, there is a window where the log tail >>> could have moved onwards and be moved past the sampled value of the >>> grant head. This then leads to the "xlog_space_left: head behind >>> tail" warning message. >>> >>> To avoid spurious output in this situation, swap the order in which >>> the variables are cracked. This means that the head may grant head >>> may move if there is a delay, but the log tail will be stable, hence >>> ensure the tail does not jump the head accidentally. >>> >>> While this avoids the spurious head behind tail problem, it >>> introduces the opposite problem - the head can move more than a full >>> cycle past the tail. The code already handles this case by >>> indicating that the log is full (i.e. zero space available) but >>> that's still (generally) a spurious situation. >>> >>> Hence, if we detect that the head is more than a cycle ahead of the >>> tail or the head is behind the tail, start the calculation again by >>> resampling the variables and trying again. If we get too many >>> resamples, then throw a warning and return a full or empty log >>> appropriately. >>> >>> Signed-off-by: Dave Chinner<dchinner@redhat.com> >>> --- >> >> I am still getting the debug message: >> >> xlog_verify_grant_tail: space> BBTOB(tail_blocks) >> >> This is a real over grant. It has been a while since I did all the tests, but basically the only way to stop it is to have a lock between checking for xlog_space_left() and actually reserving the space. >> >> I am not a fan of another band-aid on a problem that is caused because we are granting space without locks. > > Mark, can you remind us of your testcase that produces this? > (sorry, I guess I should search for that old thread...) > > Thanks, > -Eric > >> --Mark. xfstest 273 hits it 100% of the time for me, as does 32+ process fsstress, pretty much any high log usage test. I know Brian hit this with xfstest 273 when he was testing for commit 9a3a5dab. Using xfstest 273, I was seeing ten of thousand of bytes of over commit. From what I recall, I tried a separate lock for the write/reserve grant heads, put locks to make sure the verifier was not getting stale information, ordered the write/reserve ungrants relative to the grants, put in cache smp_mb() call. Some attempts were more successful than others, but the only way I could prevent the overgrant completely was to put back the global lock between the checking for space and the granting of space. --Mark. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings 2013-11-19 23:44 ` Mark Tinguely @ 2013-11-21 2:19 ` Dave Chinner 2013-11-21 3:16 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-11-21 2:19 UTC (permalink / raw) To: Mark Tinguely; +Cc: Eric Sandeen, xfs On Tue, Nov 19, 2013 at 05:44:50PM -0600, Mark Tinguely wrote: > On 11/19/13 17:24, Eric Sandeen wrote: > >On 11/19/13, 5:08 PM, Mark Tinguely wrote: > >>On 11/19/13 16:37, Dave Chinner wrote: > >>>From: Dave Chinner<dchinner@redhat.com> > >>> > >>>When xlog_space_left() cracks the grant head and the log tail, it > >>>does so without locking to synchronise the sampling of the > >>>variables. It samples the grant head first, so if there is a delay > >>>before it smaples the log tail, there is a window where the log tail > >>>could have moved onwards and be moved past the sampled value of the > >>>grant head. This then leads to the "xlog_space_left: head behind > >>>tail" warning message. > >>> > >>>To avoid spurious output in this situation, swap the order in which > >>>the variables are cracked. This means that the head may grant head > >>>may move if there is a delay, but the log tail will be stable, hence > >>>ensure the tail does not jump the head accidentally. > >>> > >>>While this avoids the spurious head behind tail problem, it > >>>introduces the opposite problem - the head can move more than a full > >>>cycle past the tail. The code already handles this case by > >>>indicating that the log is full (i.e. zero space available) but > >>>that's still (generally) a spurious situation. > >>> > >>>Hence, if we detect that the head is more than a cycle ahead of the > >>>tail or the head is behind the tail, start the calculation again by > >>>resampling the variables and trying again. If we get too many > >>>resamples, then throw a warning and return a full or empty log > >>>appropriately. > >>> > >>>Signed-off-by: Dave Chinner<dchinner@redhat.com> > >>>--- > >> > >>I am still getting the debug message: > >> > >> xlog_verify_grant_tail: space> BBTOB(tail_blocks) > >> > >>This is a real over grant. It has been a while since I did all the tests, but basically the only way to stop it is to have a lock between checking for xlog_space_left() and actually reserving the space. > >> > >>I am not a fan of another band-aid on a problem that is caused because we are granting space without locks. > > > >Mark, can you remind us of your testcase that produces this? > >(sorry, I guess I should search for that old thread...) > > > >Thanks, > >-Eric > > > >>--Mark. > > xfstest 273 hits it 100% of the time for me, as does 32+ process > fsstress, pretty much any high log usage test. Yet I do those sorts of tests all the time on lots of different machines and I rarely see either of the warnings that we are talking about here. I sent the xlog_space_left() patch because a user reported their log being spammed, not because they had a problem with a filesystem hanging on log space. i.e. spam == false positive, log space hang == real problem. > I know Brian hit this with xfstest 273 when he was testing for > commit 9a3a5dab. > > Using xfstest 273, I was seeing ten of thousand of bytes of over > commit. That doesn't mean it's real. By definition, a false positive shows exactly the same numbers as a real problem. The only way to determine it is a false positive is to use some other method of verifying the numbers. That could be checking the state of other related variables or simply resampling the variables multiple times to reduce the probability that the problem was caused by a temporary condition (i.e. a race). In this case, resampling them from their source is used to reverify them, and if we get continual failures being detected, then the likelihood of them being a false positive is very low. > From what I recall, I tried a separate lock for the > write/reserve grant heads, We already have a per-grant head lock, and most of the calls to xlog_space_left() are called under it. Hence the problem I'm addressing is not a result of the grant head changing - the problem is the log tail is updated concurrently. No amount of locking being added to the grant heads will fix that problem. And regardless of the fact that xlog_verify_grant_tail() samples the write grant head with locks (it's a atomic64_t to do this safely), the problem is that log tail can then change before it is sampled. No amount of locking the grant heads will fix that problem, either. > put locks to make sure the verifier was > not getting stale information, ordered the write/reserve ungrants > relative to the grants, put in cache smp_mb() call. Some attempts > were more successful than others, but the only way I could prevent > the overgrant completely was to put back the global lock between the > checking for space and the granting of space. Which won't have solved the problem of concurrent log tail updates occurring after the grant head has been sampled. All this change does is modify the concurrency patterns and hence made the race condition much harder to hit by disabling pre-emption across functions. IOWs, we can probably get exactly the same result from adding "preempt_disable(); .... preempt_enable();" around the variable sampling so that we cannot be pre-empted at all between the two samples being taken. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings 2013-11-21 2:19 ` Dave Chinner @ 2013-11-21 3:16 ` Dave Chinner 0 siblings, 0 replies; 9+ messages in thread From: Dave Chinner @ 2013-11-21 3:16 UTC (permalink / raw) To: Mark Tinguely; +Cc: Eric Sandeen, xfs On Thu, Nov 21, 2013 at 01:19:33PM +1100, Dave Chinner wrote: > On Tue, Nov 19, 2013 at 05:44:50PM -0600, Mark Tinguely wrote: > > On 11/19/13 17:24, Eric Sandeen wrote: > > >On 11/19/13, 5:08 PM, Mark Tinguely wrote: > > >>On 11/19/13 16:37, Dave Chinner wrote: > > >>>From: Dave Chinner<dchinner@redhat.com> > > >>> > > >>>When xlog_space_left() cracks the grant head and the log tail, it > > >>>does so without locking to synchronise the sampling of the > > >>>variables. It samples the grant head first, so if there is a delay > > >>>before it smaples the log tail, there is a window where the log tail > > >>>could have moved onwards and be moved past the sampled value of the > > >>>grant head. This then leads to the "xlog_space_left: head behind > > >>>tail" warning message. > > >>> > > >>>To avoid spurious output in this situation, swap the order in which > > >>>the variables are cracked. This means that the head may grant head > > >>>may move if there is a delay, but the log tail will be stable, hence > > >>>ensure the tail does not jump the head accidentally. > > >>> > > >>>While this avoids the spurious head behind tail problem, it > > >>>introduces the opposite problem - the head can move more than a full > > >>>cycle past the tail. The code already handles this case by > > >>>indicating that the log is full (i.e. zero space available) but > > >>>that's still (generally) a spurious situation. > > >>> > > >>>Hence, if we detect that the head is more than a cycle ahead of the > > >>>tail or the head is behind the tail, start the calculation again by > > >>>resampling the variables and trying again. If we get too many > > >>>resamples, then throw a warning and return a full or empty log > > >>>appropriately. > > >>> > > >>>Signed-off-by: Dave Chinner<dchinner@redhat.com> > > >>>--- > > >> > > >>I am still getting the debug message: > > >> > > >> xlog_verify_grant_tail: space> BBTOB(tail_blocks) > > >> > > >>This is a real over grant. It has been a while since I did all the tests, but basically the only way to stop it is to have a lock between checking for xlog_space_left() and actually reserving the space. > > >> > > >>I am not a fan of another band-aid on a problem that is caused because we are granting space without locks. > > > > > >Mark, can you remind us of your testcase that produces this? > > >(sorry, I guess I should search for that old thread...) > > > > > >Thanks, > > >-Eric > > > > > >>--Mark. > > > > xfstest 273 hits it 100% of the time for me, as does 32+ process > > fsstress, pretty much any high log usage test. > > Yet I do those sorts of tests all the time on lots of different > machines and I rarely see either of the warnings that we are talking > about here. I sent the xlog_space_left() patch because a user > reported their log being spammed, not because they had a problem > with a filesystem hanging on log space. > > i.e. spam == false positive, log space hang == real problem. > > > I know Brian hit this with xfstest 273 when he was testing for > > commit 9a3a5dab. > > > > Using xfstest 273, I was seeing ten of thousand of bytes of over > > commit. > > That doesn't mean it's real. > > By definition, a false positive shows exactly the same numbers as a > real problem. The only way to determine it is a false positive is to > use some other method of verifying the numbers. That could be > checking the state of other related variables or simply resampling > the variables multiple times to reduce the probability that the > problem was caused by a temporary condition (i.e. a race). > > In this case, resampling them from their source is used to reverify > them, and if we get continual failures being detected, then the > likelihood of them being a false positive is very low. > > > From what I recall, I tried a separate lock for the > > write/reserve grant heads, > > We already have a per-grant head lock, and most of the calls to > xlog_space_left() are called under it. Hence the problem I'm > addressing is not a result of the grant head changing - the problem > is the log tail is updated concurrently. No amount of locking being > added to the grant heads will fix that problem. > > And regardless of the fact that xlog_verify_grant_tail() samples the > write grant head with locks (it's a atomic64_t to do this safely), without locks > the problem is that log tail can then change before it is sampled. > No amount of locking the grant heads will fix that problem, either. > > > put locks to make sure the verifier was > > not getting stale information, ordered the write/reserve ungrants > > relative to the grants, put in cache smp_mb() call. Some attempts > > were more successful than others, but the only way I could prevent > > the overgrant completely was to put back the global lock between the > > checking for space and the granting of space. > > Which won't have solved the problem of concurrent log tail updates > occurring after the grant head has been sampled. All this change > does is modify the concurrency patterns and hence made the race > condition much harder to hit by disabling pre-emption across > functions. > > IOWs, we can probably get exactly the same result from adding > "preempt_disable(); .... preempt_enable();" around the variable > sampling so that we cannot be pre-empted at all between the two > samples being taken. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings 2013-11-19 23:08 ` Mark Tinguely 2013-11-19 23:24 ` Eric Sandeen @ 2013-11-21 1:42 ` Dave Chinner 2013-11-21 13:01 ` Dave Chinner 1 sibling, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-11-21 1:42 UTC (permalink / raw) To: Mark Tinguely; +Cc: xfs On Tue, Nov 19, 2013 at 05:08:33PM -0600, Mark Tinguely wrote: > On 11/19/13 16:37, Dave Chinner wrote: > >From: Dave Chinner<dchinner@redhat.com> > > > >When xlog_space_left() cracks the grant head and the log tail, it > >does so without locking to synchronise the sampling of the > >variables. It samples the grant head first, so if there is a delay > >before it smaples the log tail, there is a window where the log tail > >could have moved onwards and be moved past the sampled value of the > >grant head. This then leads to the "xlog_space_left: head behind > >tail" warning message. > > > >To avoid spurious output in this situation, swap the order in which > >the variables are cracked. This means that the head may grant head > >may move if there is a delay, but the log tail will be stable, hence > >ensure the tail does not jump the head accidentally. > > > >While this avoids the spurious head behind tail problem, it > >introduces the opposite problem - the head can move more than a full > >cycle past the tail. The code already handles this case by > >indicating that the log is full (i.e. zero space available) but > >that's still (generally) a spurious situation. > > > >Hence, if we detect that the head is more than a cycle ahead of the > >tail or the head is behind the tail, start the calculation again by > >resampling the variables and trying again. If we get too many > >resamples, then throw a warning and return a full or empty log > >appropriately. > > > >Signed-off-by: Dave Chinner<dchinner@redhat.com> > >--- > > I am still getting the debug message: > > xlog_verify_grant_tail: space > BBTOB(tail_blocks) I'm not sure why that is relevant to this patch - it doesn't touch xlog_verify_grant_tail(), nor does it change the functioning of the code except to remove false positive warnings. Further, the comment about xlog_verify_grant_tail() says: * This check is run unlocked, so can give false positives. That's exactly what this patch addresses for xlog_space_left() - it prevents false positives when it is run unlocked. So we could do exactly the same thing to xlog_verify_grant_tail() as well. Patch attached below. > This is a real over grant. Evidence, please. > It has been a while since I did all the > tests, but basically the only way to stop it is to have a lock > between checking for xlog_space_left() and actually reserving the > space. Yes, so you demonstrated, but that change also completely destroyed scalability of the transaction subsystem in the process, hence is a non-starter, especially just to silence a debug-only false positive. IOWs, there was no evidence provided that the message was the result of an actual leak (i.e. a problem in the algorithm) rather than it being a false positive. The first step in determining if we really have a problem here is to cut the false positive rate, and that's exactly what I'm addressing here. Hence when we see this warning being emitted we will have a much higher confidence that it is a real problem and not a false positive. > I am not a fan of another band-aid on a problem that is caused > because we are granting space without locks. Lockless algorithms *by their very nature* involve verifying what was grabbed without a lock is valid before using it. It's a fundamental principle underlying lockless algorithms, whether they be lockless due to use of cmpxchg (like the log grant code) or RCU (like the xfs inode cache) or just plain old memory barriers to enforce memory access ordering. Just because the original code didn't do this doesn't mean doing it now is a band-aid. I understand lockless algorithms a whole lot better than I did 3 years ago when I wrote the original algorithm, and as such can see things I didn't clearly understand back then. Indeed, the comments indicate that I knew there were problems (i.e. false positive detection issues), but didn't understand them well enough to be able to mitigate them. I certainly understand them well enough now, so I'd hardly call mitigating known issues like this a band-aid. Looking at the bigger picture, we're adding lockless algorithms rapidly throughout the kernel. Hence it's critical that kernel developers understand how lockless algorithms work and realise that locks are no longer the preferred solution to the scalability problem. If a lockless algorithm is broken, then we need to understand exactly how it is broken, determine whether it can be fixed (e.g. is it just missing a critical memory barrier?), and as a last resort, if it needs to be replaced then we need to design a new algorithm that scales equivalently well. We need to start by killing as many false positives the lockless algorithm throws out and see what remains. We have no real evidence that there are fundamental problems with the lockless algorithm, so let's address the issues we know about first and go from there. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: prevent spurious "space > BBTOB(tail_blocks)" warnings When xlog_verify_grant_tail() cracks the grant head and the log tail, it does so without locking to synchronise the sampling of the variables. It samples the grant head first, so if there is a delay before it samples the log tail, there is a window where the log tail could have moved onwards and be moved past the sampled value of the grant head. This then leads to the "xlog_verify_grant_tail: space > BBTOB(tail_blocks)" warning message. To avoid spurious output in this situation, swap the order in which the variables are cracked. This means that the head may grant head may move if there is a delay, but the log tail will be stable, hence ensure the tail does not jump the head accidentally. While this avoids the spurious head behind tail problem, it introduces the opposite problem - the head can move more than a full cycle past the tail. Hence, if we detect that the head is more than a cycle ahead of the tail or the head is behind the tail, start the calculation again by resampling the variables and trying again. If we get too many resamples, then throw a warning. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log.c | 82 +++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 64 insertions(+), 18 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 0cfcc20..6857e09 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3682,26 +3682,72 @@ STATIC void xlog_verify_grant_tail( struct xlog *log) { - int tail_cycle, tail_blocks; - int cycle, space; - - xlog_crack_grant_head(&log->l_write_head.grant, &cycle, &space); - xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_blocks); - if (tail_cycle != cycle) { - if (cycle - 1 != tail_cycle && - !(log->l_flags & XLOG_TAIL_WARN)) { - xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, - "%s: cycle - 1 != tail_cycle", __func__); - log->l_flags |= XLOG_TAIL_WARN; - } + int tail_cycle; + int tail_blocks; + int head_cycle; + int head_bytes; + int retries = 0; +#define XLOG_VERIFY_MAX_RETRIES 5 - if (space > BBTOB(tail_blocks) && - !(log->l_flags & XLOG_TAIL_WARN)) { - xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, - "%s: space > BBTOB(tail_blocks)", __func__); - log->l_flags |= XLOG_TAIL_WARN; - } + + /* + * If we've already detected an problem here, don't bother checking + * again. + */ + if (log->l_flags & XLOG_TAIL_WARN) + return; + do { + /* + * sample tail before head to avoid spurious warnings due to + * racing tail updates. We dump a memory barrier here to make + * sure we pick up the latest values that have been written by + * other threads each time through. + */ + smp_mb(); + xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, + &tail_blocks); + xlog_crack_grant_head(&log->l_write_head.grant, &head_cycle, + &head_bytes); + + /* + * if the cycles are the same, the head and tail can't be + * overlapping, so everything is ok and we are done. + */ + if (tail_cycle == head_cycle) + return; + + /* + * if the tail is on the previous cycle to the head and the head + * is before the tail, then all is good. + */ + if (tail_cycle == head_cycle - 1 && + BBTOB(tail_blocks) >= head_bytes) + return; + + /* + * Cycles don't match or the head overlapped the tail. + * Invalid, so let's go around again and resample to ensure this + * isn't a false positive. + */ + } while (retries++ < XLOG_VERIFY_MAX_RETRIES); + + /* + * OK, we're in trouble now - the head and tail are out of sync. Time to + * issue a warning about it + */ + if (head_cycle - 1 != tail_cycle) { + xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, + "%s: head cycle - 1 != tail_cycle (0x%x/0x%x)", + __func__, head_cycle, tail_cycle); + } + + if (head_bytes > BBTOB(tail_blocks)) { + xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, + "%s: head_bytes > BBTOB(tail_blocks) (0x%x/0x%x)", + __func__, head_bytes, BBTOB(tail_blocks)); } + + log->l_flags |= XLOG_TAIL_WARN; } /* check if it will fit */ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings 2013-11-21 1:42 ` Dave Chinner @ 2013-11-21 13:01 ` Dave Chinner 2013-11-22 2:57 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-11-21 13:01 UTC (permalink / raw) To: Mark Tinguely; +Cc: xfs On Thu, Nov 21, 2013 at 12:42:55PM +1100, Dave Chinner wrote: > On Tue, Nov 19, 2013 at 05:08:33PM -0600, Mark Tinguely wrote: > > On 11/19/13 16:37, Dave Chinner wrote: > > >From: Dave Chinner<dchinner@redhat.com> > > > > > >When xlog_space_left() cracks the grant head and the log tail, it > > >does so without locking to synchronise the sampling of the > > >variables. It samples the grant head first, so if there is a delay > > >before it smaples the log tail, there is a window where the log tail > > >could have moved onwards and be moved past the sampled value of the > > >grant head. This then leads to the "xlog_space_left: head behind > > >tail" warning message. > > > > > >To avoid spurious output in this situation, swap the order in which > > >the variables are cracked. This means that the head may grant head > > >may move if there is a delay, but the log tail will be stable, hence > > >ensure the tail does not jump the head accidentally. > > > > > >While this avoids the spurious head behind tail problem, it > > >introduces the opposite problem - the head can move more than a full > > >cycle past the tail. The code already handles this case by > > >indicating that the log is full (i.e. zero space available) but > > >that's still (generally) a spurious situation. > > > > > >Hence, if we detect that the head is more than a cycle ahead of the > > >tail or the head is behind the tail, start the calculation again by > > >resampling the variables and trying again. If we get too many > > >resamples, then throw a warning and return a full or empty log > > >appropriately. > > > > > >Signed-off-by: Dave Chinner<dchinner@redhat.com> > > >--- > > > > I am still getting the debug message: > > > > xlog_verify_grant_tail: space > BBTOB(tail_blocks) > > I'm not sure why that is relevant to this patch - it doesn't touch > xlog_verify_grant_tail(), nor does it change the functioning of the > code except to remove false positive warnings. > > Further, the comment about xlog_verify_grant_tail() says: > > * This check is run unlocked, so can give false positives. > > That's exactly what this patch addresses for xlog_space_left() - > it prevents false positives when it is run unlocked. So we could do > exactly the same thing to xlog_verify_grant_tail() as well. Patch > attached below. > > > This is a real over grant. > > Evidence, please. So, I've done some more testing with these patches, and found a couple of interesting things. I found that xfs/297 with a bunch of memory, VM and lock debugging turned on has fairly reliably reproduced the messages from both functions, albeit slowly (taking 10 minutes a cycle, 3-4 cycles to get a single xlog_space_left() failure). First thing I did was remove the "print once" flag from xlog_verify_grant_tail(), and I generally see 2-3 outputs from it during a run, usaully tens to hundreds of seconds apart. No hangs, and the difference in byte counts generally is under 200k. It then made it print each retry, knowing that it wouldn't swamp the logs. What each event gave was this: [ 201.796132] XFS (vdb): xlog_verify_grant_tail: retries 0, head_bytes > BBTOB(tail_blocks) (8365828/8126464) [ 201.798514] XFS (vdb): xlog_verify_grant_tail: retries 1, head_bytes > BBTOB(tail_blocks) (8365828/8126464) [ 201.800890] XFS (vdb): xlog_verify_grant_tail: retries 2, head_bytes > BBTOB(tail_blocks) (8365828/8126464) [ 201.802951] XFS (vdb): xlog_verify_grant_tail: retries 3, head_bytes > BBTOB(tail_blocks) (8365828/8126464) [ 201.805238] XFS (vdb): xlog_verify_grant_tail: retries 4, head_bytes > BBTOB(tail_blocks) (8365828/8126464) [ 201.807378] XFS (vdb): xlog_verify_grant_tail: retries 5, head_bytes > BBTOB(tail_blocks) (8365828/8126464) [ 201.809619] XFS (vdb): xlog_verify_grant_tail: head_bytes > BBTOB(tail_blocks) (8365828/8126464) Which, as you can see, resampling didn't help prevent the message output at all. The region of the log that this warning comes up in is random; there is no pattern to it the repeated occurrences. Hence I will drop the repeated sampling part of the patches, but keep the memory barrier and preemption disabling around the sampling... Scratch that - I just had the repeated sampling avoid a warning - the fifth sample saw the problem go away.... [ 140.728172] XFS (vdb): xlog_verify_grant_tail: retries 0, head_bytes > BBTOB(tail_blocks) (4099150/3932160) [ 140.731761] XFS (vdb): xlog_verify_grant_tail: retries 1, head_bytes > BBTOB(tail_blocks) (4099150/3932160) [ 140.735495] XFS (vdb): xlog_verify_grant_tail: retries 2, head_bytes > BBTOB(tail_blocks) (4099150/3932160) [ 140.739059] XFS (vdb): xlog_verify_grant_tail: retries 3, head_bytes > BBTOB(tail_blocks) (4099150/3932160) [ 140.742561] XFS (vdb): xlog_verify_grant_tail: retries 4, head_bytes > BBTOB(tail_blocks) (4099150/3932160) What this points out is that the while the problem state is temporary, the cause of it is not racing updates during sampling. We do, indeed, appear to have a temporary state where the grant head has passed over the log tail. By the next check, the grant head no longer covers the tail. The next step is to dump the stack when-ever such a warning is emitted to find out what if there is a specific operation is that is triggering them. All are coming from this path: ..... xfs_trans_reserve xfs_log_reserve xlog_verify_grant_tail And so are occurring after the grant head has been moved forward by the log reservation. That's rather interesting. xlog_verify_grant_tail() verifies the log tail against the write grant head, not the reserve grant head. The reserve grant head is what prevents log space overcommit, and that's the one that the xfs_log_reserve sleeps on waiting for log space to become available. Once space is available on the reserve grant head, it adds the space to both the reserve grant and the write grant. Hence we know that we haven't unintentionally overcommitted the reserve head via transaction reservations, so why does the write head go past the log tail, and how is it different to the reserve grant head. Well, the write grant and the reserve grant are mostly tracking the same thing. The difference comes, however, in permanent transactions. While the permanent transaction has reservations for multiple commits available (which they start with) both the write grant and the reserve grant are updated during transaction commit xfs_trans_commit xfs_log_commit_cil xfs_log_done xlog_regrant_reserve_log_space ticket->t_unit-- xlog_grant_sub_space(reserve head, unused resv); xlog_grant_sub_space(write head, unused resv); At this point, if the permanent log transaction still has reserved units available, it returns. We'll come back to this. For a rolling transaction, just before transaction commit, the transaction will have been duplicated via xfs_trans_dup(). This takes a reference to the ticket, and attaches it to the new transaction. The old transaction is then committed like above, and we continue with the same ticket in the above but with one less reservation unit. We then call xfs_trans_reserve on the ticket again, and that enters xfs_log_regrant rather than xfs_log_reserve like the original ticket reservation would have. if the ticket still has reservation units left on it, it does nothing that push the AIL and reset the reservation for the new transaction. Where it gets interesting is when the ticket unit reservation runs out, and we have to get new reservations to continue a rolling transaction. Very few permanent log transactions actually need this. In fact, I think xfs_itruncate_extents() is the only code that makes use of this facility, as all other transactions have bound limits in the code. xfs_itruncate_extents() can remove an unbound number of extents, and hence the rolling transaction is essentially unbound in size. Remember this, too. So, when we run out of reservation units on the ticket, we do this: xfs_trans_dup xfs_trans_commit xfs_log_commit_cil xfs_log_done xlog_regrant_reserve_log_space xlog_grant_sub_space(reserve head, unused resv); xlog_grant_sub_space(write head, unused resv); xlog_grant_add_space(reserve head, unit reservation); i.e. is simply steals the unit reservation from the reserve head *without caring about the amount of reservation space that is available*. This can push the reserve grant head past the tail if the reservation space is already full. While this might seem bad and wrong, it actually isn't because the ticket doesn't yet have *write grant space*. So, following the commit, we do: xfs_trans_reserve xfs_log_regrant xlog_grant_head_check(write head, space needed) xlog_grant_add_space(write head, space needed) It's the xlog_grant_head_check() that blocks waiting for write grant space to come available, and hence we dont over-commit the log space in such permanent transactions. Yes, we overcommit the reservation spae, but we never overcommit the write space and hence we don't overcommit the log space in this situation. But, what is does point out is that the reserve head can *validly* overcommit log space and overlap the tail of the log. This doesn't explain the write grant head getting ahead of the log tail, but let's ignore that and look at ithe warnings I'm getting from xlog_space_left() for a moment. [ 1458.577318] XFS (vdb): xlog_space_left: The grant head is in front of the log tail. [ 1458.577318] tail_cycle = 17, tail_bytes = 10223616 [ 1458.577318] GH cycle = 19, GH bytes = 74284 [ 1458.577320] XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 1194 [ 1458.580128] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [ 1458.580128] Modules linked in: [ 1458.580128] CPU: 0 PID: 16722 Comm: fsstress Not tainted 3.12.0-dgc+ #60 [ 1458.580128] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1458.580128] task: ffff88006d29e1c0 ti: ffff8800592e2000 task.ti: ffff8800592e2000 [ 1458.580128] RIP: 0010:[<ffffffff8149cf32>] [<ffffffff8149cf32>] assfail+0x22/0x30 [ 1458.580128] RSP: 0018:ffff8800592e3b58 EFLAGS: 00010292 [ 1458.580128] RAX: 000000000000003c RBX: ffff88007764da68 RCX: 000000000000006c [ 1458.580128] RDX: 00000000000000f2 RSI: 0000000000000046 RDI: 0000000000000246 [ 1458.580128] RBP: ffff8800592e3b58 R08: 0000000000000096 R09: 0000000000001383 [ 1458.580128] R10: 0000000000000000 R11: 0000000000001382 R12: 000000000001222c [ 1458.580128] R13: ffff8800592e3c8c R14: 0000000000000000 R15: 0000000000000000 [ 1458.580128] FS: 00007f42f1fbe700(0000) GS:ffff88007f800000(0000) knlGS:0000000000000000 [ 1458.580128] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1458.580128] CR2: 00007f42f14c4c60 CR3: 000000005fd19000 CR4: 00000000000006f0 [ 1458.580128] Stack: [ 1458.580128] ffff8800592e3b98 ffffffff814f7a47 ffff880000000013 ffffffff0001222c [ 1458.580128] ffff8800517ec4f0 ffff8800517ec4f0 ffff88007764da68 00000000000005be [ 1458.580128] ffff8800592e3bb8 ffffffff814f8480 ffff88006312be00 ffff88007764da68 [ 1458.580128] Call Trace: [ 1458.580128] [<ffffffff814f7a47>] xlog_space_left+0xf7/0x160 [ 1458.580128] [<ffffffff814f8480>] xlog_grant_push_ail+0x50/0xf0 [ 1458.580128] [<ffffffff814fa19e>] xfs_log_regrant+0x5e/0x190 [ 1458.580128] [<ffffffff814a426d>] xfs_trans_reserve+0xfd/0x310 [ 1458.580128] [<ffffffff814a5596>] ? xfs_trans_commit+0xf6/0x2d0 [ 1458.580128] [<ffffffff814861fa>] xfs_bmap_finish+0x11a/0x1e0 [ 1458.580128] [<ffffffff814e5dfb>] xfs_itruncate_extents+0x24b/0x440 [ 1458.580128] [<ffffffff814e62c8>] xfs_inactive_truncate+0xd8/0x140 [ 1458.580128] [<ffffffff814e74f0>] xfs_inactive+0x1d0/0x1f0 [ 1458.580128] [<ffffffff814a0656>] xfs_fs_evict_inode+0xa6/0xf0 [ 1458.580128] [<ffffffff811bd2a8>] evict+0xb8/0x1b0 [ 1458.580128] [<ffffffff811bda45>] iput+0x105/0x190 [ 1458.580128] [<ffffffff811b01e2>] do_unlinkat+0x182/0x2a0 [ 1458.580128] [<ffffffff811a44ee>] ? ____fput+0xe/0x10 [ 1458.580128] [<ffffffff810af96c>] ? task_work_run+0xac/0xe0 [ 1458.580128] [<ffffffff81ae3655>] ? do_device_not_available+0x35/0x60 [ 1458.580128] [<ffffffff811b3446>] SyS_unlink+0x16/0x20 [ 1458.580128] [<ffffffff81aead69>] system_call_fastpath+0x16/0x1b First thing to note: this has come from xfs_itruncate_extents(), pushing the AIL from the xfs_log_regrant(). There's a good chance we have just overcommitted the reserve grant head here, because xlog_grant_push ail() does: free_bytes = xlog_space_left(log, &log->l_reserve_head.grant); Yup, in the situation where we have just validly overcommitted the reserve grant space, xlog_space_left has detected that we've overcommited the reserve grant space and assert failed. The fact that overrun here is by a couple of hundred kilobytes is no surprise - it's in the ballpark of a itruncate transaction unit reservation (IIRC ~150-200k), and there can be multiple threads in the same overcommit state at the same time so seeing 300-400k as the overcommit is not unexpected. Further, this overcommit situation is handled just fine by the xlog_space_left() - it returns 0 bytes of log space available - and so there are no problems with this situation at all. In fact, now I understand why the original code in xlog_space_left() handled this situation silently without emmitting an error: else if (tail_cycle + 1 < head_cycle) return 0; 200 lines of explanation for 2 lines of code might be some kind of record. :/ What that means, though, is that I still haven't been able to reproduce the "head behind tail" conditions that people are complaining about, and i don't yet have an answer to the xlog_verify_grant_tail() issue. I'll take on xlog_verify_grant_tail() where I left it in the morning. It's going on midnight and I've already drunk enough expensive scotch for the evening plowing through this... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] xfs: prevent spurious "head behind tail" warnings 2013-11-21 13:01 ` Dave Chinner @ 2013-11-22 2:57 ` Dave Chinner 0 siblings, 0 replies; 9+ messages in thread From: Dave Chinner @ 2013-11-22 2:57 UTC (permalink / raw) To: Mark Tinguely; +Cc: xfs On Fri, Nov 22, 2013 at 12:01:56AM +1100, Dave Chinner wrote: > On Thu, Nov 21, 2013 at 12:42:55PM +1100, Dave Chinner wrote: > > On Tue, Nov 19, 2013 at 05:08:33PM -0600, Mark Tinguely wrote: > > > On 11/19/13 16:37, Dave Chinner wrote: > > > >From: Dave Chinner<dchinner@redhat.com> > > > > > > > >When xlog_space_left() cracks the grant head and the log tail, it > > > >does so without locking to synchronise the sampling of the > > > >variables. It samples the grant head first, so if there is a delay > > > >before it smaples the log tail, there is a window where the log tail > > > >could have moved onwards and be moved past the sampled value of the > > > >grant head. This then leads to the "xlog_space_left: head behind > > > >tail" warning message. ..... > > > > > > I am still getting the debug message: > > > > > > xlog_verify_grant_tail: space > BBTOB(tail_blocks) > > > > I'm not sure why that is relevant to this patch - it doesn't touch > > xlog_verify_grant_tail(), nor does it change the functioning of the > > code except to remove false positive warnings. > > > > Further, the comment about xlog_verify_grant_tail() says: > > > > * This check is run unlocked, so can give false positives. > > > > That's exactly what this patch addresses for xlog_space_left() - > > it prevents false positives when it is run unlocked. So we could do > > exactly the same thing to xlog_verify_grant_tail() as well. Patch > > attached below. > > > > > This is a real over grant. > > > > Evidence, please. > > So, I've done some more testing with these patches, and found a > couple of interesting things. I found that xfs/297 with a bunch of > memory, VM and lock debugging turned on has fairly reliably > reproduced the messages from both functions, albeit slowly (taking > 10 minutes a cycle, 3-4 cycles to get a single xlog_space_left() > failure). > > First thing I did was remove the "print once" flag from > xlog_verify_grant_tail(), and I generally see 2-3 outputs from it > during a run, usaully tens to hundreds of seconds apart. No hangs, > and the difference in byte counts generally is under 200k. > > It then made it print each retry, knowing that it wouldn't swamp the > logs. What each event gave was this: > > [ 201.796132] XFS (vdb): xlog_verify_grant_tail: retries 0, head_bytes > BBTOB(tail_blocks) (8365828/8126464) > [ 201.798514] XFS (vdb): xlog_verify_grant_tail: retries 1, head_bytes > BBTOB(tail_blocks) (8365828/8126464) > [ 201.800890] XFS (vdb): xlog_verify_grant_tail: retries 2, head_bytes > BBTOB(tail_blocks) (8365828/8126464) > [ 201.802951] XFS (vdb): xlog_verify_grant_tail: retries 3, head_bytes > BBTOB(tail_blocks) (8365828/8126464) > [ 201.805238] XFS (vdb): xlog_verify_grant_tail: retries 4, head_bytes > BBTOB(tail_blocks) (8365828/8126464) > [ 201.807378] XFS (vdb): xlog_verify_grant_tail: retries 5, head_bytes > BBTOB(tail_blocks) (8365828/8126464) > [ 201.809619] XFS (vdb): xlog_verify_grant_tail: head_bytes > BBTOB(tail_blocks) (8365828/8126464) > > Which, as you can see, resampling didn't help prevent the message > output at all. The region of the log that this warning comes up in > is random; there is no pattern to it the repeated occurrences. Hence > I will drop the repeated sampling part of the patches, but keep the > memory barrier and preemption disabling around the sampling... > > Scratch that - I just had the repeated sampling avoid a warning - > the fifth sample saw the problem go away.... > > [ 140.728172] XFS (vdb): xlog_verify_grant_tail: retries 0, head_bytes > BBTOB(tail_blocks) (4099150/3932160) > [ 140.731761] XFS (vdb): xlog_verify_grant_tail: retries 1, head_bytes > BBTOB(tail_blocks) (4099150/3932160) > [ 140.735495] XFS (vdb): xlog_verify_grant_tail: retries 2, head_bytes > BBTOB(tail_blocks) (4099150/3932160) > [ 140.739059] XFS (vdb): xlog_verify_grant_tail: retries 3, head_bytes > BBTOB(tail_blocks) (4099150/3932160) > [ 140.742561] XFS (vdb): xlog_verify_grant_tail: retries 4, head_bytes > BBTOB(tail_blocks) (4099150/3932160) > > What this points out is that the while the problem state is > temporary, the cause of it is not racing updates during sampling. We > do, indeed, appear to have a temporary state where the grant head > has passed over the log tail. By the next check, the grant head no > longer covers the tail. > > The next step is to dump the stack when-ever such a warning is > emitted to find out what if there is a specific operation is that > is triggering them. All are coming from this path: > > ..... > xfs_trans_reserve > xfs_log_reserve > xlog_verify_grant_tail > > And so are occurring after the grant head has been moved forward by > the log reservation. That's rather interesting. > > xlog_verify_grant_tail() verifies the log tail against the write > grant head, not the reserve grant head. The reserve grant head is > what prevents log space overcommit, and that's the one that the > xfs_log_reserve sleeps on waiting for log space to become available. > Once space is available on the reserve grant head, it adds the space > to both the reserve grant and the write grant. > > Hence we know that we haven't unintentionally overcommitted the > reserve head via transaction reservations, so why does the write > head go past the log tail, and how is it different to the reserve > grant head. > > Well, the write grant and the reserve grant are mostly tracking the > same thing. The difference comes, however, in permanent > transactions. While the permanent transaction has reservations for > multiple commits available (which they start with) both the write > grant and the reserve grant are updated during transaction commit > > xfs_trans_commit > xfs_log_commit_cil > xfs_log_done > xlog_regrant_reserve_log_space > ticket->t_unit-- > xlog_grant_sub_space(reserve head, unused resv); > xlog_grant_sub_space(write head, unused resv); > > At this point, if the permanent log transaction still has reserved > units available, it returns. We'll come back to this. > > For a rolling transaction, just before transaction commit, the > transaction will have been duplicated via xfs_trans_dup(). This > takes a reference to the ticket, and attaches it to the new > transaction. The old transaction is then committed like above, and > we continue with the same ticket in the above but with one less > reservation unit. > > We then call xfs_trans_reserve on the ticket again, and that enters > xfs_log_regrant rather than xfs_log_reserve like the original ticket > reservation would have. if the ticket still has reservation units > left on it, it does nothing that push the AIL and reset the > reservation for the new transaction. > > Where it gets interesting is when the ticket unit reservation runs > out, and we have to get new reservations to continue a rolling > transaction. Very few permanent log transactions actually need this. > In fact, I think xfs_itruncate_extents() is the only code that makes > use of this facility, as all other transactions have bound limits in > the code. xfs_itruncate_extents() can remove an unbound number of > extents, and hence the rolling transaction is essentially unbound in > size. Remember this, too. > > So, when we run out of reservation units on the ticket, we do this: > > xfs_trans_dup > xfs_trans_commit > xfs_log_commit_cil > xfs_log_done > xlog_regrant_reserve_log_space > xlog_grant_sub_space(reserve head, unused resv); > xlog_grant_sub_space(write head, unused resv); > xlog_grant_add_space(reserve head, unit reservation); > > i.e. is simply steals the unit reservation from the reserve head > *without caring about the amount of reservation space that is > available*. This can push the reserve grant head past the tail if > the reservation space is already full. > > While this might seem bad and wrong, it actually isn't because the > ticket doesn't yet have *write grant space*. So, following the > commit, we do: > > xfs_trans_reserve > xfs_log_regrant > xlog_grant_head_check(write head, space needed) > xlog_grant_add_space(write head, space needed) > > It's the xlog_grant_head_check() that blocks waiting for write grant > space to come available, and hence we dont over-commit the log space > in such permanent transactions. Yes, we overcommit the reservation > spae, but we never overcommit the write space and hence we don't > overcommit the log space in this situation. > > But, what is does point out is that the reserve head can *validly* > overcommit log space and overlap the tail of the log. This doesn't > explain the write grant head getting ahead of the log tail, but > let's ignore that and look at ithe warnings I'm getting from > xlog_space_left() for a moment. [snip xlog_space_left() "overcommit is OK" debug] > > What that means, though, is that I still haven't been able to > reproduce the "head behind tail" conditions that people are > complaining about, and i don't yet have an answer to the > xlog_verify_grant_tail() issue. So, how does the write grant head overlap the log tail? My current hypothesis is that we're seeing the same pattern of "add space without first checking" that results in overcommit of the write head, just like we are doing with the reserve head. That is, the reserve grant is overcommitting because we check and throttle in xfs_log_reserve(), but then do an unchecked addition on permanent transaction commit. In the case of the write head, we do checking and throttling on permanent transaction reserve via xfs_log_regrant(), but we do unchecked addition in xfs_log_reserve(). Put simply, if we throttle in xfs_log_regrant because we don't have write space available, we can have another transaction (say with a smaller unit reservation) come through xfs_log_reserve() and get reserve head space, which it then unconditionally adds to the write head. resv head write head X Y permanent xact P1 completes X - curr Y - curr + unit_P1 (overcommit) xact T2 reserve OC, blocks xact P1 reserve on dup Not enough space, blocks .... log tail moves a small amount space availble enough to fulfil T2 reserve, P1 write grants xact P1 wakes X - curr Y - curr + unit_P1 + unit_P1 xact T2 wakes X - curr Y - curr + unit_P1 + unit_P1 + unit_T2 + unit_T2 (overcommit) In this case, it appears to me that the overcommit of the write space is valid because transactions that block on write space overcommit also hold a reserve space grant. We already know that if we overcommit the reserve head due to a permanent transaction regrant, then all new transactions are throttled until the reserve space overcommit situation goes away. Permanent transactions can continue to regrant even in an overcommit situation, but we prevent real overcommit by throttling the write grants. IOWs, if we are in a reserve head overcommit situation, we cannot then overcommit the write head because all new write head regrants are checked and prevented from overcommit. If we are in a write space overcommit state, then write regrants are blocked after consuming reserve space. This will cause reserve space to run out and so we can't can't overcommit the log with new transactions. Permanent transactions can the drive the reserve space into overcommit, but then block on the write space grant head that is already over committed. At this point, the reserve space overcommit situation will not be cleared until the write space overcommit is resolved. That is, the only way to make progress at this point is by the log tail moving forwards due to metadata IO completion. If it only moves forward a small amount, then write space will be the first to allow progress, and so the permanent transactions will make progress and complete ahead of new transactions being started, releasing their overcommitted reserve space grants. If the log tail jumps by a long way, it will alleviate the overcommit on both grant heads and both new and permanent transactions will immediately make progress. In summary: overcommits on both the reserve grant space and the write grant space can and do occur, and both are valid temporary conditions that resolve themselves through mechanisms built into the algorithms without resulting in a real log space overcommit. Hence debug code that triggers on grant head overcommit situations will always generate false positive warnings. I will rework the two patches to take this into account... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-11-22 2:58 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-11-19 22:37 [PATCH] xfs: prevent spurious "head behind tail" warnings Dave Chinner 2013-11-19 23:08 ` Mark Tinguely 2013-11-19 23:24 ` Eric Sandeen 2013-11-19 23:44 ` Mark Tinguely 2013-11-21 2:19 ` Dave Chinner 2013-11-21 3:16 ` Dave Chinner 2013-11-21 1:42 ` Dave Chinner 2013-11-21 13:01 ` Dave Chinner 2013-11-22 2:57 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox