public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] xfs: fix xfsaild races and re-enable idle mode
@ 2012-05-21 18:49 Brian Foster
  2012-05-21 18:49 ` [RFC PATCH v2 1/3] xfs: re-enable xfsaild " Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Brian Foster @ 2012-05-21 18:49 UTC (permalink / raw)
  To: xfs; +Cc: Brian Foster

Hi all,

We reproduced and debugged several hangs in a rhel6.3 kernel that
happened to still support xfsaild idle mode. Our short term fix was
to disable idle mode as in upstream, but I'd like to fire out a
couple potential fixes that allow us to re-enable idle mode, assuming
there aren't any other problems I'm not aware of. The details of the
bug are at:

https://bugzilla.redhat.com/show_bug.cgi?id=813137

... but I'll try to provide all relevant data in this post. The
reproducer is xfstests 273 running in a 100-iteration loop. I have
reproduced this hang on upstream kernels quite reliably with commit
670ce93f reverted. The performance enhancement in that commit makes
this much harder to reproduce.

With the proposed modifications, I've probably run 5+ 100-loop
iterations of test 273 without reproducing a hang. Previously, I
was able to reproduce the first hang with 100% reliability and the
second hang reproduced 10 minutes or so after starting a second
100-loop test (with the first fix applied).

I still have to run a full xfstests but the changes are small enough
that I wanted to send them out before I got too far. Thanks.

Changes since v1:
- Rebased against a pristine tree.

Brian Foster (3):
  xfs: re-enable xfsaild idle mode
  xfs: fix xfsaild hang due to premature idle
  xfs: fix xfsaild hang due to lost wake ups

 fs/xfs/xfs_trans_ail.c  |    8 ++++----
 fs/xfs/xfs_trans_priv.h |    1 +
 2 files changed, 5 insertions(+), 4 deletions(-)

-- 
1.7.7.6

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH v2 1/3] xfs: re-enable xfsaild idle mode
  2012-05-21 18:49 [RFC PATCH v2 0/3] xfs: fix xfsaild races and re-enable idle mode Brian Foster
@ 2012-05-21 18:49 ` Brian Foster
  2012-05-21 18:49 ` [RFC PATCH v2 2/3] xfs: fix xfsaild hang due to premature idle Brian Foster
  2012-05-21 18:49 ` [RFC PATCH v2 3/3] xfs: fix xfsaild hang due to lost wake ups Brian Foster
  2 siblings, 0 replies; 7+ messages in thread
From: Brian Foster @ 2012-05-21 18:49 UTC (permalink / raw)
  To: xfs; +Cc: Brian Foster

Allow the xfsaild kernel thread to sleep when there is no work
to do.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trans_ail.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 1dead07..ae620eb 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -508,7 +508,7 @@ out_done:
 		ailp->xa_last_pushed_lsn = 0;
 		ailp->xa_log_flush = 0;
 
-		tout = 50;
+		tout = 0;
 	} else if (XFS_LSN_CMP(lsn, target) >= 0) {
 		/*
 		 * We reached the target so wait a bit longer for I/O to
-- 
1.7.7.6

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v2 2/3] xfs: fix xfsaild hang due to premature idle
  2012-05-21 18:49 [RFC PATCH v2 0/3] xfs: fix xfsaild races and re-enable idle mode Brian Foster
  2012-05-21 18:49 ` [RFC PATCH v2 1/3] xfs: re-enable xfsaild " Brian Foster
@ 2012-05-21 18:49 ` Brian Foster
  2012-05-21 21:19   ` Mark Tinguely
  2012-05-21 18:49 ` [RFC PATCH v2 3/3] xfs: fix xfsaild hang due to lost wake ups Brian Foster
  2 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2012-05-21 18:49 UTC (permalink / raw)
  To: xfs; +Cc: Brian Foster

Running xfstests 273 in a loop reproduces an XFS lockup due to
xfsaild entering idle mode indefinitely. The following
high-level sequence of events lead to the hang:

- xfsaild is running, hits the stuck item threshold and reschedules,
  setting xa_last_pushed_lsn appropriately.
- xa_threshold is updated.
- xfsaild restarts from the previous xa_last_pushed_lsn, hits the
  new target and enters idle mode, even though the previously
  stuck items still populate the ail.

Modify the tout logic to only enter idle mode when the ail is empty.
IOW, if we hit the target but did not perform the current scan from
the start of the ail, reschedule at least one more time.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trans_ail.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index ae620eb..8bc8aa2 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -503,7 +503,7 @@ xfsaild_push(
 
 	/* assume we have more work to do in a short while */
 out_done:
-	if (!count) {
+	if (!count && !ailp->xa_last_pushed_lsn) {
 		/* We're past our target or empty, so idle */
 		ailp->xa_last_pushed_lsn = 0;
 		ailp->xa_log_flush = 0;
-- 
1.7.7.6

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v2 3/3] xfs: fix xfsaild hang due to lost wake ups
  2012-05-21 18:49 [RFC PATCH v2 0/3] xfs: fix xfsaild races and re-enable idle mode Brian Foster
  2012-05-21 18:49 ` [RFC PATCH v2 1/3] xfs: re-enable xfsaild " Brian Foster
  2012-05-21 18:49 ` [RFC PATCH v2 2/3] xfs: fix xfsaild hang due to premature idle Brian Foster
@ 2012-05-21 18:49 ` Brian Foster
  2 siblings, 0 replies; 7+ messages in thread
From: Brian Foster @ 2012-05-21 18:49 UTC (permalink / raw)
  To: xfs; +Cc: Brian Foster

Running xfstests 273 in a loop reproduces an XFS lockup due to
xfsaild entering idle mode indefinitely. The following
high-level sequence of events leads to the hang:

- xfsaild is running with a cached target lsn
- xfs_ail_push() is invoked, updates ailp->xa_target_lsn and
  invokes wake_up_process(). wake_up_process() returns 0
  because xfsaild is already running.
- xfsaild enters idle mode having met its current target.

Once in the described state, xfs_ail_push() is invoked many
more times with the already set threshold_lsn, but these calls
do not lead to wake_up_process() calls because no further
invocations result in moving the threshold_lsn forward. Add a
flag to xfs_ail to capture whether an issued wake actually
succeeds. If not, continue issuing wakes until we know one has
been successful for the current target.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trans_ail.c  |    4 ++--
 fs/xfs/xfs_trans_priv.h |    1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 8bc8aa2..e886785 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -583,7 +583,7 @@ xfs_ail_push(
 
 	lip = xfs_ail_min(ailp);
 	if (!lip || XFS_FORCED_SHUTDOWN(ailp->xa_mount) ||
-	    XFS_LSN_CMP(threshold_lsn, ailp->xa_target) <= 0)
+	    ((XFS_LSN_CMP(threshold_lsn, ailp->xa_target) <= 0) && !ailp->xa_pending_wake))
 		return;
 
 	/*
@@ -594,7 +594,7 @@ xfs_ail_push(
 	xfs_trans_ail_copy_lsn(ailp, &ailp->xa_target, &threshold_lsn);
 	smp_wmb();
 
-	wake_up_process(ailp->xa_task);
+	ailp->xa_pending_wake = !wake_up_process(ailp->xa_task);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 8ab2ced..62bb4a9 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -71,6 +71,7 @@ struct xfs_ail {
 	spinlock_t		xa_lock;
 	xfs_lsn_t		xa_last_pushed_lsn;
 	int			xa_log_flush;
+	int			xa_pending_wake;
 };
 
 /*
-- 
1.7.7.6

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 2/3] xfs: fix xfsaild hang due to premature idle
  2012-05-21 18:49 ` [RFC PATCH v2 2/3] xfs: fix xfsaild hang due to premature idle Brian Foster
@ 2012-05-21 21:19   ` Mark Tinguely
  2012-05-22  0:31     ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Tinguely @ 2012-05-21 21:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On 05/21/12 13:49, Brian Foster wrote:
> Running xfstests 273 in a loop reproduces an XFS lockup due to
> xfsaild entering idle mode indefinitely. The following
> high-level sequence of events lead to the hang:
>
> - xfsaild is running, hits the stuck item threshold and reschedules,
>    setting xa_last_pushed_lsn appropriately.
> - xa_threshold is updated.
> - xfsaild restarts from the previous xa_last_pushed_lsn, hits the
>    new target and enters idle mode, even though the previously
>    stuck items still populate the ail.
>
> Modify the tout logic to only enter idle mode when the ail is empty.
> IOW, if we hit the target but did not perform the current scan from
> the start of the ail, reschedule at least one more time.
>
> Signed-off-by: Brian Foster<bfoster@redhat.com>
> ---
>   fs/xfs/xfs_trans_ail.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index ae620eb..8bc8aa2 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -503,7 +503,7 @@ xfsaild_push(
>
>   	/* assume we have more work to do in a short while */
>   out_done:
> -	if (!count) {
> +	if (!count&&  !ailp->xa_last_pushed_lsn) {
>   		/* We're past our target or empty, so idle */
>   		ailp->xa_last_pushed_lsn = 0;
>   		ailp->xa_log_flush = 0;

There is another patch in the OSS XFS (43ff2122 in 
git://oss.sgi.com/xfs/xfs) that is not yet in Linus' tree that is in 
this area and that is why it is not applying cleanly.

So the xfs_log_force() will un-stick the stuck items from the previous 
pass which set the ailp->xa_last_pushed_lsn = 0; I am asking to be 
re-assured the count will be non-zero and you won't go idle with still 
stuck items.


The problem that we are chasing in the AIL seems different than lost 
wakeup (next patch), but it would be interesting to have the patch in 
the kernel for testing.

--Mark Tinguely

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 2/3] xfs: fix xfsaild hang due to premature idle
  2012-05-21 21:19   ` Mark Tinguely
@ 2012-05-22  0:31     ` Brian Foster
  2012-05-22 13:10       ` Mark Tinguely
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2012-05-22  0:31 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On 05/21/2012 05:19 PM, Mark Tinguely wrote:
> On 05/21/12 13:49, Brian Foster wrote:
>> Running xfstests 273 in a loop reproduces an XFS lockup due to
>> xfsaild entering idle mode indefinitely. The following
>> high-level sequence of events lead to the hang:
>>
>> - xfsaild is running, hits the stuck item threshold and reschedules,
>>    setting xa_last_pushed_lsn appropriately.
>> - xa_threshold is updated.
>> - xfsaild restarts from the previous xa_last_pushed_lsn, hits the
>>    new target and enters idle mode, even though the previously
>>    stuck items still populate the ail.
>>
>> Modify the tout logic to only enter idle mode when the ail is empty.
>> IOW, if we hit the target but did not perform the current scan from
>> the start of the ail, reschedule at least one more time.
>>
>> Signed-off-by: Brian Foster<bfoster@redhat.com>
>> ---
>>   fs/xfs/xfs_trans_ail.c |    2 +-
>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
>> index ae620eb..8bc8aa2 100644
>> --- a/fs/xfs/xfs_trans_ail.c
>> +++ b/fs/xfs/xfs_trans_ail.c
>> @@ -503,7 +503,7 @@ xfsaild_push(
>>
>>       /* assume we have more work to do in a short while */
>>   out_done:
>> -    if (!count) {
>> +    if (!count&&  !ailp->xa_last_pushed_lsn) {
>>           /* We're past our target or empty, so idle */
>>           ailp->xa_last_pushed_lsn = 0;
>>           ailp->xa_log_flush = 0;
> 

Hi Mark,

> There is another patch in the OSS XFS (43ff2122 in git://oss.sgi.com/xfs/xfs) that is not yet in Linus' tree that is in this area and that is why it is not applying cleanly.
> 

Ah, sorry about that. This is my first time posting patches for XFS so I'm relatively new to the process. :) Should I rebase against the oss.sgi.com tree? For future reference, are new patches expected to be based against that tree?

> So the xfs_log_force() will un-stick the stuck items from the previous pass which set the ailp->xa_last_pushed_lsn = 0; I am asking to be re-assured the count will be non-zero and you won't go idle with still stuck items.
> 

I'm not sure I parse this comment... but my interpretation of xfsaild_push() is that it's possible to "miss" a section of the ail (as reflected by count) when xa_last_pushed_lsn is non-zero. If xa_last_pushed_lsn is 0, how could count be zero unless the ail is empty?

Brian

> 
> The problem that we are chasing in the AIL seems different than lost wakeup (next patch), but it would be interesting to have the patch in the kernel for testing.
> 
> --Mark Tinguely

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 2/3] xfs: fix xfsaild hang due to premature idle
  2012-05-22  0:31     ` Brian Foster
@ 2012-05-22 13:10       ` Mark Tinguely
  0 siblings, 0 replies; 7+ messages in thread
From: Mark Tinguely @ 2012-05-22 13:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On 05/21/12 19:31, Brian Foster wrote:
> On 05/21/2012 05:19 PM, Mark Tinguely wrote:
>> On 05/21/12 13:49, Brian Foster wrote:
>>> Running xfstests 273 in a loop reproduces an XFS lockup due to
>>> xfsaild entering idle mode indefinitely. The following
>>> high-level sequence of events lead to the hang:
>>>
>>> - xfsaild is running, hits the stuck item threshold and reschedules,
>>>     setting xa_last_pushed_lsn appropriately.
>>> - xa_threshold is updated.
>>> - xfsaild restarts from the previous xa_last_pushed_lsn, hits the
>>>     new target and enters idle mode, even though the previously
>>>     stuck items still populate the ail.
>>>
>>> Modify the tout logic to only enter idle mode when the ail is empty.
>>> IOW, if we hit the target but did not perform the current scan from
>>> the start of the ail, reschedule at least one more time.
>>>
>>> Signed-off-by: Brian Foster<bfoster@redhat.com>
>>> ---
>>>    fs/xfs/xfs_trans_ail.c |    2 +-
>>>    1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
>>> index ae620eb..8bc8aa2 100644
>>> --- a/fs/xfs/xfs_trans_ail.c
>>> +++ b/fs/xfs/xfs_trans_ail.c
>>> @@ -503,7 +503,7 @@ xfsaild_push(
>>>
>>>        /* assume we have more work to do in a short while */
>>>    out_done:
>>> -    if (!count) {
>>> +    if (!count&&   !ailp->xa_last_pushed_lsn) {
>>>            /* We're past our target or empty, so idle */
>>>            ailp->xa_last_pushed_lsn = 0;
>>>            ailp->xa_log_flush = 0;
>>
>
> Hi Mark,
>
>> There is another patch in the OSS XFS (43ff2122 in git://oss.sgi.com/xfs/xfs) that is not yet in Linus' tree that is in this area and that is why it is not applying cleanly.
>>
>
> Ah, sorry about that. This is my first time posting patches for XFS so I'm relatively new to the process. :) Should I rebase against the oss.sgi.com tree? For future reference, are new patches expected to be based against that tree?

Please rebase to that tree.

>> So the xfs_log_force() will un-stick the stuck items from the previous pass which set the ailp->xa_last_pushed_lsn = 0; I am asking to be re-assured the count will be non-zero and you won't go idle with still stuck items.
>>
>
> I'm not sure I parse this comment... but my interpretation of xfsaild_push() is that it's possible to "miss" a section of the ail (as reflected by count) when xa_last_pushed_lsn is non-zero. If xa_last_pushed_lsn is 0, how could count be zero unless the ail is empty?

You are correct, the counts are incremented. I do not know why I was
thinking the break was for the while loop and not the switch statement.

> Brian
>
>>
>> The problem that we are chasing in the AIL seems different than lost wakeup (next patch), but it would be interesting to have the patch in the kernel for testing.
>>
>> --Mark Tinguely
>

Thank-you,

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-05-22 13:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-21 18:49 [RFC PATCH v2 0/3] xfs: fix xfsaild races and re-enable idle mode Brian Foster
2012-05-21 18:49 ` [RFC PATCH v2 1/3] xfs: re-enable xfsaild " Brian Foster
2012-05-21 18:49 ` [RFC PATCH v2 2/3] xfs: fix xfsaild hang due to premature idle Brian Foster
2012-05-21 21:19   ` Mark Tinguely
2012-05-22  0:31     ` Brian Foster
2012-05-22 13:10       ` Mark Tinguely
2012-05-21 18:49 ` [RFC PATCH v2 3/3] xfs: fix xfsaild hang due to lost wake ups Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox