linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* xfs: two deadlock problems occur when kswapd writebacks XFS pages.
@ 2014-06-17  8:50 Masayoshi Mizuma
  2014-06-17 13:26 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Masayoshi Mizuma @ 2014-06-17  8:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-mm

I found two deadlock problems occur when kswapd writebacks XFS pages.
I detected these problems on RHEL kernel actually, and I suppose these
also happen on upstream kernel (3.16-rc1).

1.

A process (processA) has acquired read semaphore "xfs_cil.xc_ctx_lock"
at xfs_log_commit_cil() and it is waiting for the kswapd. Then, a
kworker has issued xlog_cil_push_work() and it is waiting for acquiring
the write semaphore. kswapd is waiting for acquiring the read semaphore
at xfs_log_commit_cil() because the kworker has been waiting before for
acquiring the write semaphore at xlog_cil_push(). Therefore, a deadlock
happens.

The deadlock flow is as follows.

  processA              | kworker                  | kswapd              
  ----------------------+--------------------------+----------------------
| xfs_trans_commit      |                          |
| xfs_log_commit_cil    |                          |
| down_read(xc_ctx_lock)|                          |
| xlog_cil_insert_items |                          |
| xlog_cil_insert_format_items                     |
| kmem_alloc            |                          |
| :                     |                          |
| shrink_inactive_list  |                          |
| congestion_wait       |                          |
| # waiting for kswapd..|                          |
|                       | xlog_cil_push_work       |
|                       | xlog_cil_push            |
|                       | xfs_trans_commit         |
|                       | down_write(xc_ctx_lock)  |
|                       | # waiting for processA...|
|                       |                          | shrink_page_list
|                       |                          | xfs_vm_writepage
|                       |                          | xfs_map_blocks
|                       |                          | xfs_iomap_write_allocate
|                       |                          | xfs_trans_commit
|                       |                          | xfs_log_commit_cil
|                       |                          | down_read(xc_ctx_lock)
V(time)                 |                          | # waiting for kworker...
  ----------------------+--------------------------+-----------------------

To fix this, should we up the read semaphore before calling kmem_alloc()
at xlog_cil_insert_format_items() to avoid blocking the kworker? Or,
should we the second argument of kmem_alloc() from KM_SLEEP|KM_NOFS
to KM_NOSLEEP to avoid waiting for the kswapd. Or...

2. 

A kworker (kworkerA), whish is a writeback thread, is waiting for
the XFS allocation thread (kworkerB) while it writebacks XFS pages.
kworkerB has started the allocation and it is waiting for kswapd to
allocate free pages. kswapd has started writeback XFS pages and
it is waiting for more log space. The reason why exhaustion of the
log space is both the writeback thread and kswapd are stuck, so
some processes, who have allocated the log space and are requesting
free pages, are also stuck.

The deadlock flow is as follows.

  kworkerA              | kworkerB                 | kswapd            
  ----------------------+--------------------------+-----------------------
| wb_writeback          |                          |
| :                     |                          |
| xfs_vm_writepage      |                          |
| xfs_map_blocks        |                          |
| xfs_iomap_write_allocate                         |
| xfs_bmapi_write       |                          |
| xfs_bmapi_allocate    |                          |
| wait_for_completion   |                          |
| # waiting for kworkerB...                        |
|                       | xfs_bmapi_allocate_worker|
|                       | :                        |
|                       | xfs_buf_get_map          |
|                       | xfs_buf_allocate_memory  |
|                       | alloc_pages_current      |
|                       | :                        |
|                       | shrink_inactive_list     |
|                       | congestion_wait          |
|                       | # waiting for kswapd...  |
|                       |                          | shrink_page_list
|                       |                          | xfs_vm_writepage
|                       |                          | :
|                       |                          | xfs_log_reserve
|                       |                          | :
|                       |                          | xlog_grant_head_check
|                       |                          | xlog_grant_head_wait
|                       |                          | # waiting for more
|                       |                          | # space...
V(time)                 |                          |
  ----------------------+--------------------------+-----------------------

I don't have any ideas to fix this...

Thanks,
Masayoshi Mizuma

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.
  2014-06-17  8:50 xfs: two deadlock problems occur when kswapd writebacks XFS pages Masayoshi Mizuma
@ 2014-06-17 13:26 ` Dave Chinner
  2014-06-18  9:37   ` Masayoshi Mizuma
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2014-06-17 13:26 UTC (permalink / raw)
  To: Masayoshi Mizuma; +Cc: xfs, linux-mm

On Tue, Jun 17, 2014 at 05:50:02PM +0900, Masayoshi Mizuma wrote:
> I found two deadlock problems occur when kswapd writebacks XFS pages.
> I detected these problems on RHEL kernel actually, and I suppose these
> also happen on upstream kernel (3.16-rc1).
> 
> 1.
> 
> A process (processA) has acquired read semaphore "xfs_cil.xc_ctx_lock"
> at xfs_log_commit_cil() and it is waiting for the kswapd. Then, a
> kworker has issued xlog_cil_push_work() and it is waiting for acquiring
> the write semaphore. kswapd is waiting for acquiring the read semaphore
> at xfs_log_commit_cil() because the kworker has been waiting before for
> acquiring the write semaphore at xlog_cil_push(). Therefore, a deadlock
> happens.
> 
> The deadlock flow is as follows.
> 
>   processA              | kworker                  | kswapd              
>   ----------------------+--------------------------+----------------------
> | xfs_trans_commit      |                          |
> | xfs_log_commit_cil    |                          |
> | down_read(xc_ctx_lock)|                          |
> | xlog_cil_insert_items |                          |
> | xlog_cil_insert_format_items                     |
> | kmem_alloc            |                          |
> | :                     |                          |
> | shrink_inactive_list  |                          |
> | congestion_wait       |                          |
> | # waiting for kswapd..|                          |
> |                       | xlog_cil_push_work       |
> |                       | xlog_cil_push            |
> |                       | xfs_trans_commit         |
> |                       | down_write(xc_ctx_lock)  |
> |                       | # waiting for processA...|
> |                       |                          | shrink_page_list
> |                       |                          | xfs_vm_writepage
> |                       |                          | xfs_map_blocks
> |                       |                          | xfs_iomap_write_allocate
> |                       |                          | xfs_trans_commit
> |                       |                          | xfs_log_commit_cil
> |                       |                          | down_read(xc_ctx_lock)
> V(time)                 |                          | # waiting for kworker...
>   ----------------------+--------------------------+-----------------------

Where's the deadlock here? congestion_wait() simply times out and
processA continues onward doing memory reclaim. It should continue
making progress, albeit slowly, and if it isn't then the allocation
will fail. If the allocation repeatedly fails then you should be
seeing this in the logs:

XFS: possible memory allocation deadlock in <func> (mode:0x%x)

If you aren't seeing that in the logs a few times a second and never
stopping, then the system is still making progress and isn't
deadlocked.

> To fix this, should we up the read semaphore before calling kmem_alloc()
> at xlog_cil_insert_format_items() to avoid blocking the kworker? Or,
> should we the second argument of kmem_alloc() from KM_SLEEP|KM_NOFS
> to KM_NOSLEEP to avoid waiting for the kswapd. Or...

Can't do that - it's in transaction context and so reclaim can't
recurse into the fs. Even if you do remove the flag, kmem_alloc()
will re-add the GFP_NOFS silently because of the PF_FSTRANS flag on
the task, so it won't affect anything...

We might be able to do a down_write_trylock() in xlog_cil_push(),
but we can't delay the push for an arbitrary amount of time - the
write lock needs to be a barrier otherwise we'll get push
starvation and that will lead to checkpoint size overruns (i.e.
temporary journal corruption).

> 2. 
> 
> A kworker (kworkerA), whish is a writeback thread, is waiting for
> the XFS allocation thread (kworkerB) while it writebacks XFS pages.
> kworkerB has started the allocation and it is waiting for kswapd to
> allocate free pages. kswapd has started writeback XFS pages and
> it is waiting for more log space. The reason why exhaustion of the
> log space is both the writeback thread and kswapd are stuck, so
> some processes, who have allocated the log space and are requesting
> free pages, are also stuck.
> 
> The deadlock flow is as follows.
> 
>   kworkerA              | kworkerB                 | kswapd            
>   ----------------------+--------------------------+-----------------------
> | wb_writeback          |                          |
> | :                     |                          |
> | xfs_vm_writepage      |                          |
> | xfs_map_blocks        |                          |
> | xfs_iomap_write_allocate                         |
> | xfs_bmapi_write       |                          |
> | xfs_bmapi_allocate    |                          |
> | wait_for_completion   |                          |
> | # waiting for kworkerB...                        |
> |                       | xfs_bmapi_allocate_worker|
> |                       | :                        |
> |                       | xfs_buf_get_map          |
> |                       | xfs_buf_allocate_memory  |
> |                       | alloc_pages_current      |
> |                       | :                        |
> |                       | shrink_inactive_list     |
> |                       | congestion_wait          |
> |                       | # waiting for kswapd...  |
> |                       |                          | shrink_page_list
> |                       |                          | xfs_vm_writepage
> |                       |                          | :
> |                       |                          | xfs_log_reserve
> |                       |                          | :
> |                       |                          | xlog_grant_head_check
> |                       |                          | xlog_grant_head_wait
> |                       |                          | # waiting for more
> |                       |                          | # space...
> V(time)                 |                          |
>   ----------------------+--------------------------+-----------------------

Again, anything in congestion_wait() is not stuck and if the
allocations here are repeatedly failing and progress is not being
made, then there should be log messages from XFS indicating this.

I need more information about your test setup to understand what is
going on here. Can you provide:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

The output of sysrq-w would also be useful here, because the above
abridged stack traces do not tell me everything about the state of
the system I need to know.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.
  2014-06-17 13:26 ` Dave Chinner
@ 2014-06-18  9:37   ` Masayoshi Mizuma
  2014-06-18 11:48     ` Dave Chinner
       [not found]     ` <53A7D6CC.1040605@jp.fujitsu.com>
  0 siblings, 2 replies; 6+ messages in thread
From: Masayoshi Mizuma @ 2014-06-18  9:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-mm


On Tue, 17 Jun 2014 23:26:09 +1000 Dave Chinner wrote:
> On Tue, Jun 17, 2014 at 05:50:02PM +0900, Masayoshi Mizuma wrote:
>> I found two deadlock problems occur when kswapd writebacks XFS pages.
>> I detected these problems on RHEL kernel actually, and I suppose these
>> also happen on upstream kernel (3.16-rc1).
>>
>> 1.
>>
>> A process (processA) has acquired read semaphore "xfs_cil.xc_ctx_lock"
>> at xfs_log_commit_cil() and it is waiting for the kswapd. Then, a
>> kworker has issued xlog_cil_push_work() and it is waiting for acquiring
>> the write semaphore. kswapd is waiting for acquiring the read semaphore
>> at xfs_log_commit_cil() because the kworker has been waiting before for
>> acquiring the write semaphore at xlog_cil_push(). Therefore, a deadlock
>> happens.
>>
>> The deadlock flow is as follows.
>>
>>    processA              | kworker                  | kswapd
>>    ----------------------+--------------------------+----------------------
>> | xfs_trans_commit      |                          |
>> | xfs_log_commit_cil    |                          |
>> | down_read(xc_ctx_lock)|                          |
>> | xlog_cil_insert_items |                          |
>> | xlog_cil_insert_format_items                     |
>> | kmem_alloc            |                          |
>> | :                     |                          |
>> | shrink_inactive_list  |                          |
>> | congestion_wait       |                          |
>> | # waiting for kswapd..|                          |
>> |                       | xlog_cil_push_work       |
>> |                       | xlog_cil_push            |
>> |                       | xfs_trans_commit         |
>> |                       | down_write(xc_ctx_lock)  |
>> |                       | # waiting for processA...|
>> |                       |                          | shrink_page_list
>> |                       |                          | xfs_vm_writepage
>> |                       |                          | xfs_map_blocks
>> |                       |                          | xfs_iomap_write_allocate
>> |                       |                          | xfs_trans_commit
>> |                       |                          | xfs_log_commit_cil
>> |                       |                          | down_read(xc_ctx_lock)
>> V(time)                 |                          | # waiting for kworker...
>>    ----------------------+--------------------------+-----------------------
>
> Where's the deadlock here? congestion_wait() simply times out and
> processA continues onward doing memory reclaim. It should continue
> making progress, albeit slowly, and if it isn't then the allocation
> will fail. If the allocation repeatedly fails then you should be
> seeing this in the logs:
>
> XFS: possible memory allocation deadlock in <func> (mode:0x%x)
>
> If you aren't seeing that in the logs a few times a second and never
> stopping, then the system is still making progress and isn't
> deadlocked.

processA is stuck at following while loop. In this situation,
too_many_isolated() always returns true because kswapd is also stuck...

---
static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
                      struct scan_control *sc, enum lru_list lru)
{
...
         while (unlikely(too_many_isolated(zone, file, sc))) {
                 congestion_wait(BLK_RW_ASYNC, HZ/10);

                 /* We are about to die and free our memory. Return now. */
                 if (fatal_signal_pending(current))
                         return SWAP_CLUSTER_MAX;
         }
---

On that point, this problem is similar to the problem fixed by
the following commit.

1f6d64829d xfs: block allocation work needs to be kswapd aware

So, the same solution, for example we add PF_KSWAPD to current->flags
before calling kmem_alloc(), may fix this problem1...

>
>> To fix this, should we up the read semaphore before calling kmem_alloc()
>> at xlog_cil_insert_format_items() to avoid blocking the kworker? Or,
>> should we the second argument of kmem_alloc() from KM_SLEEP|KM_NOFS
>> to KM_NOSLEEP to avoid waiting for the kswapd. Or...
>
> Can't do that - it's in transaction context and so reclaim can't
> recurse into the fs. Even if you do remove the flag, kmem_alloc()
> will re-add the GFP_NOFS silently because of the PF_FSTRANS flag on
> the task, so it won't affect anything...

I think kmem_alloc() doesn't re-add the GFP_NOFS if the second argument
is set to KM_NOSLEEP. kmem_alloc() will re-add GFP_ATOMIC and __GFP_NOWARN.

---
static inline gfp_t
kmem_flags_convert(xfs_km_flags_t flags)
{
         gfp_t   lflags;

         BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));

         if (flags & KM_NOSLEEP) {
                 lflags = GFP_ATOMIC | __GFP_NOWARN;
         } else {
                 lflags = GFP_KERNEL | __GFP_NOWARN;
                 if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
                         lflags &= ~__GFP_FS;
         }

         if (flags & KM_ZERO)
                 lflags |= __GFP_ZERO;

         return lflags;
}
---

>
> We might be able to do a down_write_trylock() in xlog_cil_push(),
> but we can't delay the push for an arbitrary amount of time - the
> write lock needs to be a barrier otherwise we'll get push
> starvation and that will lead to checkpoint size overruns (i.e.
> temporary journal corruption).

I understand, thanks.

>
>> 2.
>>
>> A kworker (kworkerA), whish is a writeback thread, is waiting for
>> the XFS allocation thread (kworkerB) while it writebacks XFS pages.
>> kworkerB has started the allocation and it is waiting for kswapd to
>> allocate free pages. kswapd has started writeback XFS pages and
>> it is waiting for more log space. The reason why exhaustion of the
>> log space is both the writeback thread and kswapd are stuck, so
>> some processes, who have allocated the log space and are requesting
>> free pages, are also stuck.
>>
>> The deadlock flow is as follows.
>>
>>    kworkerA              | kworkerB                 | kswapd
>>    ----------------------+--------------------------+-----------------------
>> | wb_writeback          |                          |
>> | :                     |                          |
>> | xfs_vm_writepage      |                          |
>> | xfs_map_blocks        |                          |
>> | xfs_iomap_write_allocate                         |
>> | xfs_bmapi_write       |                          |
>> | xfs_bmapi_allocate    |                          |
>> | wait_for_completion   |                          |
>> | # waiting for kworkerB...                        |
>> |                       | xfs_bmapi_allocate_worker|
>> |                       | :                        |
>> |                       | xfs_buf_get_map          |
>> |                       | xfs_buf_allocate_memory  |
>> |                       | alloc_pages_current      |
>> |                       | :                        |
>> |                       | shrink_inactive_list     |
>> |                       | congestion_wait          |
>> |                       | # waiting for kswapd...  |
>> |                       |                          | shrink_page_list
>> |                       |                          | xfs_vm_writepage
>> |                       |                          | :
>> |                       |                          | xfs_log_reserve
>> |                       |                          | :
>> |                       |                          | xlog_grant_head_check
>> |                       |                          | xlog_grant_head_wait
>> |                       |                          | # waiting for more
>> |                       |                          | # space...
>> V(time)                 |                          |
>>    ----------------------+--------------------------+-----------------------
>
> Again, anything in congestion_wait() is not stuck and if the
> allocations here are repeatedly failing and progress is not being
> made, then there should be log messages from XFS indicating this.

kworkerB is stuck at the same reason as above processA.

>
> I need more information about your test setup to understand what is
> going on here. Can you provide:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> The output of sysrq-w would also be useful here, because the above
> abridged stack traces do not tell me everything about the state of
> the system I need to know.

OK, I will try to get the information when this problem2 is reproduced.

Thanks,
Masayoshi Mizuma

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.
  2014-06-18  9:37   ` Masayoshi Mizuma
@ 2014-06-18 11:48     ` Dave Chinner
       [not found]     ` <53A7D6CC.1040605@jp.fujitsu.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2014-06-18 11:48 UTC (permalink / raw)
  To: Masayoshi Mizuma; +Cc: xfs, linux-mm

On Wed, Jun 18, 2014 at 06:37:11PM +0900, Masayoshi Mizuma wrote:
> 
> On Tue, 17 Jun 2014 23:26:09 +1000 Dave Chinner wrote:
> >On Tue, Jun 17, 2014 at 05:50:02PM +0900, Masayoshi Mizuma wrote:
> >>I found two deadlock problems occur when kswapd writebacks XFS pages.
> >>I detected these problems on RHEL kernel actually, and I suppose these
> >>also happen on upstream kernel (3.16-rc1).
> >>
> >>1.
> >>
> >>A process (processA) has acquired read semaphore "xfs_cil.xc_ctx_lock"
> >>at xfs_log_commit_cil() and it is waiting for the kswapd. Then, a
> >>kworker has issued xlog_cil_push_work() and it is waiting for acquiring
> >>the write semaphore. kswapd is waiting for acquiring the read semaphore
> >>at xfs_log_commit_cil() because the kworker has been waiting before for
> >>acquiring the write semaphore at xlog_cil_push(). Therefore, a deadlock
> >>happens.
> >>
> >>The deadlock flow is as follows.
> >>
> >>   processA              | kworker                  | kswapd
> >>   ----------------------+--------------------------+----------------------
> >>| xfs_trans_commit      |                          |
> >>| xfs_log_commit_cil    |                          |
> >>| down_read(xc_ctx_lock)|                          |
> >>| xlog_cil_insert_items |                          |
> >>| xlog_cil_insert_format_items                     |
> >>| kmem_alloc            |                          |
> >>| :                     |                          |
> >>| shrink_inactive_list  |                          |
> >>| congestion_wait       |                          |
> >>| # waiting for kswapd..|                          |
> >>|                       | xlog_cil_push_work       |
> >>|                       | xlog_cil_push            |
> >>|                       | xfs_trans_commit         |
> >>|                       | down_write(xc_ctx_lock)  |
> >>|                       | # waiting for processA...|
> >>|                       |                          | shrink_page_list
> >>|                       |                          | xfs_vm_writepage
> >>|                       |                          | xfs_map_blocks
> >>|                       |                          | xfs_iomap_write_allocate
> >>|                       |                          | xfs_trans_commit
> >>|                       |                          | xfs_log_commit_cil
> >>|                       |                          | down_read(xc_ctx_lock)
> >>V(time)                 |                          | # waiting for kworker...
> >>   ----------------------+--------------------------+-----------------------
> >
> >Where's the deadlock here? congestion_wait() simply times out and
> >processA continues onward doing memory reclaim. It should continue
> >making progress, albeit slowly, and if it isn't then the allocation
> >will fail. If the allocation repeatedly fails then you should be
> >seeing this in the logs:
> >
> >XFS: possible memory allocation deadlock in <func> (mode:0x%x)
> >
> >If you aren't seeing that in the logs a few times a second and never
> >stopping, then the system is still making progress and isn't
> >deadlocked.
> 
> processA is stuck at following while loop. In this situation,
> too_many_isolated() always returns true because kswapd is also stuck...

How is this a filesystem problem, though? kswapd is not guaranteed
to make writeback progress It's *always* been able to stall waiting
on log space or transaction commit during writeback like this, and
filesystems are allowed to simply redirty pages to avoid deadlocks.

For those playing along at home, this is also the reason why
filesystems can't use mempools for writeback structures - they can't
guarantee forward progress in low memory situations and mempools
aren't a solution to memory allocation problems.

Here's a basic example for you:

Process A				kswapd

start transaction
allocate block
lock AGF 1
read btree block
allocate memory for btree buffer
<direct memory reclaim>
loop while (too many isolated)
    <blocks waiting on kswapd>

					shrink_page_list
					xfs_vm_writepage
					xfs_map_blocks
					xfs_iomap_write_allocate
					....
					start transaction
					<allocate block>
					lock AGF 1
					<blocks waiting on process A>

See how simple it is to prevent kswapd from making progress? I can
think of many, many other ways that XFS can prevent kswapd from
making progress and none of them are new....

> ---
> static noinline_for_stack unsigned long
> shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>                      struct scan_control *sc, enum lru_list lru)
> {
> ...
>         while (unlikely(too_many_isolated(zone, file, sc))) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
>                 /* We are about to die and free our memory. Return now. */
>                 if (fatal_signal_pending(current))
>                         return SWAP_CLUSTER_MAX;
>         }
> ---
> 
> On that point, this problem is similar to the problem fixed by
> the following commit.
> 
> 1f6d64829d xfs: block allocation work needs to be kswapd aware

Which has already proven to be the wrong thing to do. I'm ready to
revert that because of other performance and memory reclaim
regressions I've isolated to that patch. Indeed, it makes my test
VMs start to issue allocation deadlock warnings from XFS under
workloads that it's never had problems with before....

> So, the same solution, for example we add PF_KSWAPD to current->flags
> before calling kmem_alloc(), may fix this problem1...

That's just a nasty hack, not a solution.

What we need to know is exactly why we are getting stuck with too
many isolated pages, and why kswapd seems to be the only thing that
can "unisolate" them. Why isn't the bdi flusher thread making
progress cleaning pages?  Is it stuck in memory reclaim, too? Why do
we wait forever rather than failing, winding up the reclaim priority
and retrying?

I'm not going hack stuff into a filesystem when the problem really
looks like a direct reclaim throttling issue. We need to understand
exactly how reclaim is getting stuck here and then work out how
direct reclaim can avoid getting stuck. Especially in the context of
GFP_NOFS allocations...

> >>To fix this, should we up the read semaphore before calling kmem_alloc()
> >>at xlog_cil_insert_format_items() to avoid blocking the kworker? Or,
> >>should we the second argument of kmem_alloc() from KM_SLEEP|KM_NOFS
> >>to KM_NOSLEEP to avoid waiting for the kswapd. Or...
> >
> >Can't do that - it's in transaction context and so reclaim can't
> >recurse into the fs. Even if you do remove the flag, kmem_alloc()
> >will re-add the GFP_NOFS silently because of the PF_FSTRANS flag on
> >the task, so it won't affect anything...
> 
> I think kmem_alloc() doesn't re-add the GFP_NOFS if the second argument
> is set to KM_NOSLEEP. kmem_alloc() will re-add GFP_ATOMIC and __GFP_NOWARN.

The second argument is KM_SLEEP|KM_NOFS, so what it does when
KM_NOSLEEP is set is irrelevant to the discussion at hand.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.
       [not found]     ` <53A7D6CC.1040605@jp.fujitsu.com>
@ 2014-06-24 22:05       ` Dave Chinner
  2014-07-14 11:00         ` Masayoshi Mizuma
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2014-06-24 22:05 UTC (permalink / raw)
  To: Masayoshi Mizuma; +Cc: xfs, linux-mm

On Mon, Jun 23, 2014 at 04:27:08PM +0900, Masayoshi Mizuma wrote:
> Hi Dave,
> 
> (I removed CCing xfs and linux-mm. And I changed your email address
>  to @redhat.com because this email includes RHEL7 kernel stack traces.)

Please don't do that. There's nothing wrong with posting RHEL7 stack
traces to public lists (though I'd prefer you to reproduce this
problem on a 3.15 or 3.16-rc kernel), and breaking the thread of
discussion makes it impossible to involve the people necessary to
solve this problem.

I've re-added xfs and linux-mm to the cc list, and taken my redhat
address off it...

<snip the 3 process back traces>

[looks at sysrq-w output]

kswapd0 is blocked in shrink_inactive_list/congestion_wait().

kswapd1 is blocked waiting for log space from
shrink_inactive_list().

kthreadd is blocked in shrink_inactive_list/congestion_wait trying
to fork another process.

xfsaild is in uninterruptible sleep, indicating that there is still
metadata to be written to push the log tail to it's required target,
and it will retry again in less than 20ms.

xfslogd is not blocked, indicating the log has not deadlocked
due to lack of space.

there are lots of timestamp updates waiting for log space.

There is one kworker stuck in data IO completion on an inode lock.

There are several threads blocked on an AGF lock trying to free
extents.

The bdi writeback thread is blocked waiting for allocation.

A single xfs_alloc_wq kworker is blocked in
shrink_inactive_list/congestion_wait while trying to read in btree
blocks for transactional modification. Indicative of memory pressure
trashing the working set of cached metadata. waiting for memory
reclaim
	- holds agf lock, blocks unlinks

There are 113 (!) blocked sadc processes - why are there so many
stats gathering processes running? If you stop gathering stats, does
the problem go away?

There are 54 mktemp processes blocked - what is generating them?
what filesystem are they actually running on? i.e. which XFS
filesystem in the system is having log space shortages? And what is
the xfs_info output of that filesystem i.e. have you simply
oversubscribed a tiny log and so it crawls along at a very slow
pace?

All of the blocked processes are on CPUs 0-3 i.e. on node 0, which
is handled by kswapd0, which is not blocked waiting for log
space. Hmmm - what is the value of /proc/sys/vm/zone_reclaim_mode?
If it is not zero, does setting it to zero make the problem go away?

Interestingly enough, for a system under extreme memory pressure,
don't see any processes blocked waiting for swap space or swap IO.
Do you have any swap space configured on this machine?  If you
don't, does the problem go away when you add a swap device?

Overall, I can't see anything that indicates that the filesystem has
actually hung. I can see it having trouble allocating the memory it
needs to make forwards progress, but the system itself is not
deadlocked. Is there any IO being issued when the system is in this
state? If there is Io being issued, then progress is being made and
the system is merely slow because of the extreme memory pressure
generated by the stress test.

If there is not IO being issued, does the system start making
progress again if you kill one of the memory hogs? i.e. does the
equivalent of triggering an OOM-kill make the system responsive
again? If it does, then the filesystem is not hung and the problem
is that there isn't enough free memory to allow the filesystem to do
IO and hence allow memory reclaim to make progress. In which case,
does increasing /proc/sys/vm/min_free_kbytes make the problem go
away?

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.
  2014-06-24 22:05       ` Dave Chinner
@ 2014-07-14 11:00         ` Masayoshi Mizuma
  0 siblings, 0 replies; 6+ messages in thread
From: Masayoshi Mizuma @ 2014-07-14 11:00 UTC (permalink / raw)
  To: david; +Cc: linux-mm, xfs

Hi Dave,

Thank you for your comment! and I apologize for my delayed response.

As your comment, I have investigated again the RHEL7 crash dump
why the processes which doing direct memory reclaim are stuck
at shrink_inactive_list(). Then, I found the reason that the processes
and kswapd are trying to free page caches from a zone despite
the number of inactive file pages is very very small (40 pages).
kswapd moved inactive file pages to isolate file pages to free the
pages at shrink_inactive_list(). As the result, NR_INACTIVE_FILE
was 0 and NR_ISOLATED_FILE was 40.
Therefore, no one can increase NR_INACTIVE_FILE or decrease
NR_ISOLATED_FILE, so the system hangs up.
In such situation, we should not try to free inactive file
pages because kswapd and direct memory reclaimer can move inactive
file pages to isolate file pages up to 32 pages.

And, I found why the problems did not happen on the upstream kernel.
The problems did not happen because of the following commit.
---
commit 623762517e2370be3b3f95f4fe08d6c063a49b06
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Tue May 6 12:50:07 2014 -0700

     revert "mm: vmscan: do not swap anon pages just because free+file 
is low"
---

Thank you so much!

Masayoshi Mizuma
On Wed, 25 Jun 2014 08:05:30 +1000 Dave Chinner wrote:
> On Mon, Jun 23, 2014 at 04:27:08PM +0900, Masayoshi Mizuma wrote:
>> Hi Dave,
>>
>> (I removed CCing xfs and linux-mm. And I changed your email address
>>   to @redhat.com because this email includes RHEL7 kernel stack traces.)
>
> Please don't do that. There's nothing wrong with posting RHEL7 stack
> traces to public lists (though I'd prefer you to reproduce this
> problem on a 3.15 or 3.16-rc kernel), and breaking the thread of
> discussion makes it impossible to involve the people necessary to
> solve this problem.
>
> I've re-added xfs and linux-mm to the cc list, and taken my redhat
> address off it...
>
> <snip the 3 process back traces>
>
> [looks at sysrq-w output]
>
> kswapd0 is blocked in shrink_inactive_list/congestion_wait().
>
> kswapd1 is blocked waiting for log space from
> shrink_inactive_list().
>
> kthreadd is blocked in shrink_inactive_list/congestion_wait trying
> to fork another process.
>
> xfsaild is in uninterruptible sleep, indicating that there is still
> metadata to be written to push the log tail to it's required target,
> and it will retry again in less than 20ms.
>
> xfslogd is not blocked, indicating the log has not deadlocked
> due to lack of space.
>
> there are lots of timestamp updates waiting for log space.
>
> There is one kworker stuck in data IO completion on an inode lock.
>
> There are several threads blocked on an AGF lock trying to free
> extents.
>
> The bdi writeback thread is blocked waiting for allocation.
>
> A single xfs_alloc_wq kworker is blocked in
> shrink_inactive_list/congestion_wait while trying to read in btree
> blocks for transactional modification. Indicative of memory pressure
> trashing the working set of cached metadata. waiting for memory
> reclaim
> 	- holds agf lock, blocks unlinks
>
> There are 113 (!) blocked sadc processes - why are there so many
> stats gathering processes running? If you stop gathering stats, does
> the problem go away?
>
> There are 54 mktemp processes blocked - what is generating them?
> what filesystem are they actually running on? i.e. which XFS
> filesystem in the system is having log space shortages? And what is
> the xfs_info output of that filesystem i.e. have you simply
> oversubscribed a tiny log and so it crawls along at a very slow
> pace?
>
> All of the blocked processes are on CPUs 0-3 i.e. on node 0, which
> is handled by kswapd0, which is not blocked waiting for log
> space. Hmmm - what is the value of /proc/sys/vm/zone_reclaim_mode?
> If it is not zero, does setting it to zero make the problem go away?
>
> Interestingly enough, for a system under extreme memory pressure,
> don't see any processes blocked waiting for swap space or swap IO.
> Do you have any swap space configured on this machine?  If you
> don't, does the problem go away when you add a swap device?
>
> Overall, I can't see anything that indicates that the filesystem has
> actually hung. I can see it having trouble allocating the memory it
> needs to make forwards progress, but the system itself is not
> deadlocked. Is there any IO being issued when the system is in this
> state? If there is Io being issued, then progress is being made and
> the system is merely slow because of the extreme memory pressure
> generated by the stress test.
>
> If there is not IO being issued, does the system start making
> progress again if you kill one of the memory hogs? i.e. does the
> equivalent of triggering an OOM-kill make the system responsive
> again? If it does, then the filesystem is not hung and the problem
> is that there isn't enough free memory to allow the filesystem to do
> IO and hence allow memory reclaim to make progress. In which case,
> does increasing /proc/sys/vm/min_free_kbytes make the problem go
> away?
>
> Cheers,
>
> Dave.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-07-14 11:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-17  8:50 xfs: two deadlock problems occur when kswapd writebacks XFS pages Masayoshi Mizuma
2014-06-17 13:26 ` Dave Chinner
2014-06-18  9:37   ` Masayoshi Mizuma
2014-06-18 11:48     ` Dave Chinner
     [not found]     ` <53A7D6CC.1040605@jp.fujitsu.com>
2014-06-24 22:05       ` Dave Chinner
2014-07-14 11:00         ` Masayoshi Mizuma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).