From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-mm@kvack.org, xfs@oss.sgi.com
Subject: Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.
Date: Wed, 18 Jun 2014 18:37:11 +0900 [thread overview]
Message-ID: <53A15DC7.50001@jp.fujitsu.com> (raw)
In-Reply-To: <20140617132609.GI9508@dastard>
On Tue, 17 Jun 2014 23:26:09 +1000 Dave Chinner wrote:
> On Tue, Jun 17, 2014 at 05:50:02PM +0900, Masayoshi Mizuma wrote:
>> I found two deadlock problems occur when kswapd writebacks XFS pages.
>> I detected these problems on RHEL kernel actually, and I suppose these
>> also happen on upstream kernel (3.16-rc1).
>>
>> 1.
>>
>> A process (processA) has acquired read semaphore "xfs_cil.xc_ctx_lock"
>> at xfs_log_commit_cil() and it is waiting for the kswapd. Then, a
>> kworker has issued xlog_cil_push_work() and it is waiting for acquiring
>> the write semaphore. kswapd is waiting for acquiring the read semaphore
>> at xfs_log_commit_cil() because the kworker has been waiting before for
>> acquiring the write semaphore at xlog_cil_push(). Therefore, a deadlock
>> happens.
>>
>> The deadlock flow is as follows.
>>
>> processA | kworker | kswapd
>> ----------------------+--------------------------+----------------------
>> | xfs_trans_commit | |
>> | xfs_log_commit_cil | |
>> | down_read(xc_ctx_lock)| |
>> | xlog_cil_insert_items | |
>> | xlog_cil_insert_format_items |
>> | kmem_alloc | |
>> | : | |
>> | shrink_inactive_list | |
>> | congestion_wait | |
>> | # waiting for kswapd..| |
>> | | xlog_cil_push_work |
>> | | xlog_cil_push |
>> | | xfs_trans_commit |
>> | | down_write(xc_ctx_lock) |
>> | | # waiting for processA...|
>> | | | shrink_page_list
>> | | | xfs_vm_writepage
>> | | | xfs_map_blocks
>> | | | xfs_iomap_write_allocate
>> | | | xfs_trans_commit
>> | | | xfs_log_commit_cil
>> | | | down_read(xc_ctx_lock)
>> V(time) | | # waiting for kworker...
>> ----------------------+--------------------------+-----------------------
>
> Where's the deadlock here? congestion_wait() simply times out and
> processA continues onward doing memory reclaim. It should continue
> making progress, albeit slowly, and if it isn't then the allocation
> will fail. If the allocation repeatedly fails then you should be
> seeing this in the logs:
>
> XFS: possible memory allocation deadlock in <func> (mode:0x%x)
>
> If you aren't seeing that in the logs a few times a second and never
> stopping, then the system is still making progress and isn't
> deadlocked.
processA is stuck at following while loop. In this situation,
too_many_isolated() always returns true because kswapd is also stuck...
---
static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct scan_control *sc, enum lru_list lru)
{
...
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}
---
On that point, this problem is similar to the problem fixed by
the following commit.
1f6d64829d xfs: block allocation work needs to be kswapd aware
So, the same solution, for example we add PF_KSWAPD to current->flags
before calling kmem_alloc(), may fix this problem1...
>
>> To fix this, should we up the read semaphore before calling kmem_alloc()
>> at xlog_cil_insert_format_items() to avoid blocking the kworker? Or,
>> should we the second argument of kmem_alloc() from KM_SLEEP|KM_NOFS
>> to KM_NOSLEEP to avoid waiting for the kswapd. Or...
>
> Can't do that - it's in transaction context and so reclaim can't
> recurse into the fs. Even if you do remove the flag, kmem_alloc()
> will re-add the GFP_NOFS silently because of the PF_FSTRANS flag on
> the task, so it won't affect anything...
I think kmem_alloc() doesn't re-add the GFP_NOFS if the second argument
is set to KM_NOSLEEP. kmem_alloc() will re-add GFP_ATOMIC and __GFP_NOWARN.
---
static inline gfp_t
kmem_flags_convert(xfs_km_flags_t flags)
{
gfp_t lflags;
BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
if (flags & KM_NOSLEEP) {
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}
if (flags & KM_ZERO)
lflags |= __GFP_ZERO;
return lflags;
}
---
>
> We might be able to do a down_write_trylock() in xlog_cil_push(),
> but we can't delay the push for an arbitrary amount of time - the
> write lock needs to be a barrier otherwise we'll get push
> starvation and that will lead to checkpoint size overruns (i.e.
> temporary journal corruption).
I understand, thanks.
>
>> 2.
>>
>> A kworker (kworkerA), whish is a writeback thread, is waiting for
>> the XFS allocation thread (kworkerB) while it writebacks XFS pages.
>> kworkerB has started the allocation and it is waiting for kswapd to
>> allocate free pages. kswapd has started writeback XFS pages and
>> it is waiting for more log space. The reason why exhaustion of the
>> log space is both the writeback thread and kswapd are stuck, so
>> some processes, who have allocated the log space and are requesting
>> free pages, are also stuck.
>>
>> The deadlock flow is as follows.
>>
>> kworkerA | kworkerB | kswapd
>> ----------------------+--------------------------+-----------------------
>> | wb_writeback | |
>> | : | |
>> | xfs_vm_writepage | |
>> | xfs_map_blocks | |
>> | xfs_iomap_write_allocate |
>> | xfs_bmapi_write | |
>> | xfs_bmapi_allocate | |
>> | wait_for_completion | |
>> | # waiting for kworkerB... |
>> | | xfs_bmapi_allocate_worker|
>> | | : |
>> | | xfs_buf_get_map |
>> | | xfs_buf_allocate_memory |
>> | | alloc_pages_current |
>> | | : |
>> | | shrink_inactive_list |
>> | | congestion_wait |
>> | | # waiting for kswapd... |
>> | | | shrink_page_list
>> | | | xfs_vm_writepage
>> | | | :
>> | | | xfs_log_reserve
>> | | | :
>> | | | xlog_grant_head_check
>> | | | xlog_grant_head_wait
>> | | | # waiting for more
>> | | | # space...
>> V(time) | |
>> ----------------------+--------------------------+-----------------------
>
> Again, anything in congestion_wait() is not stuck and if the
> allocations here are repeatedly failing and progress is not being
> made, then there should be log messages from XFS indicating this.
kworkerB is stuck at the same reason as above processA.
>
> I need more information about your test setup to understand what is
> going on here. Can you provide:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> The output of sysrq-w would also be useful here, because the above
> abridged stack traces do not tell me everything about the state of
> the system I need to know.
OK, I will try to get the information when this problem2 is reproduced.
Thanks,
Masayoshi Mizuma
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com, linux-mm@kvack.org
Subject: Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.
Date: Wed, 18 Jun 2014 18:37:11 +0900 [thread overview]
Message-ID: <53A15DC7.50001@jp.fujitsu.com> (raw)
In-Reply-To: <20140617132609.GI9508@dastard>
On Tue, 17 Jun 2014 23:26:09 +1000 Dave Chinner wrote:
> On Tue, Jun 17, 2014 at 05:50:02PM +0900, Masayoshi Mizuma wrote:
>> I found two deadlock problems occur when kswapd writebacks XFS pages.
>> I detected these problems on RHEL kernel actually, and I suppose these
>> also happen on upstream kernel (3.16-rc1).
>>
>> 1.
>>
>> A process (processA) has acquired read semaphore "xfs_cil.xc_ctx_lock"
>> at xfs_log_commit_cil() and it is waiting for the kswapd. Then, a
>> kworker has issued xlog_cil_push_work() and it is waiting for acquiring
>> the write semaphore. kswapd is waiting for acquiring the read semaphore
>> at xfs_log_commit_cil() because the kworker has been waiting before for
>> acquiring the write semaphore at xlog_cil_push(). Therefore, a deadlock
>> happens.
>>
>> The deadlock flow is as follows.
>>
>> processA | kworker | kswapd
>> ----------------------+--------------------------+----------------------
>> | xfs_trans_commit | |
>> | xfs_log_commit_cil | |
>> | down_read(xc_ctx_lock)| |
>> | xlog_cil_insert_items | |
>> | xlog_cil_insert_format_items |
>> | kmem_alloc | |
>> | : | |
>> | shrink_inactive_list | |
>> | congestion_wait | |
>> | # waiting for kswapd..| |
>> | | xlog_cil_push_work |
>> | | xlog_cil_push |
>> | | xfs_trans_commit |
>> | | down_write(xc_ctx_lock) |
>> | | # waiting for processA...|
>> | | | shrink_page_list
>> | | | xfs_vm_writepage
>> | | | xfs_map_blocks
>> | | | xfs_iomap_write_allocate
>> | | | xfs_trans_commit
>> | | | xfs_log_commit_cil
>> | | | down_read(xc_ctx_lock)
>> V(time) | | # waiting for kworker...
>> ----------------------+--------------------------+-----------------------
>
> Where's the deadlock here? congestion_wait() simply times out and
> processA continues onward doing memory reclaim. It should continue
> making progress, albeit slowly, and if it isn't then the allocation
> will fail. If the allocation repeatedly fails then you should be
> seeing this in the logs:
>
> XFS: possible memory allocation deadlock in <func> (mode:0x%x)
>
> If you aren't seeing that in the logs a few times a second and never
> stopping, then the system is still making progress and isn't
> deadlocked.
processA is stuck at following while loop. In this situation,
too_many_isolated() always returns true because kswapd is also stuck...
---
static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct scan_control *sc, enum lru_list lru)
{
...
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}
---
On that point, this problem is similar to the problem fixed by
the following commit.
1f6d64829d xfs: block allocation work needs to be kswapd aware
So, the same solution, for example we add PF_KSWAPD to current->flags
before calling kmem_alloc(), may fix this problem1...
>
>> To fix this, should we up the read semaphore before calling kmem_alloc()
>> at xlog_cil_insert_format_items() to avoid blocking the kworker? Or,
>> should we the second argument of kmem_alloc() from KM_SLEEP|KM_NOFS
>> to KM_NOSLEEP to avoid waiting for the kswapd. Or...
>
> Can't do that - it's in transaction context and so reclaim can't
> recurse into the fs. Even if you do remove the flag, kmem_alloc()
> will re-add the GFP_NOFS silently because of the PF_FSTRANS flag on
> the task, so it won't affect anything...
I think kmem_alloc() doesn't re-add the GFP_NOFS if the second argument
is set to KM_NOSLEEP. kmem_alloc() will re-add GFP_ATOMIC and __GFP_NOWARN.
---
static inline gfp_t
kmem_flags_convert(xfs_km_flags_t flags)
{
gfp_t lflags;
BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
if (flags & KM_NOSLEEP) {
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}
if (flags & KM_ZERO)
lflags |= __GFP_ZERO;
return lflags;
}
---
>
> We might be able to do a down_write_trylock() in xlog_cil_push(),
> but we can't delay the push for an arbitrary amount of time - the
> write lock needs to be a barrier otherwise we'll get push
> starvation and that will lead to checkpoint size overruns (i.e.
> temporary journal corruption).
I understand, thanks.
>
>> 2.
>>
>> A kworker (kworkerA), whish is a writeback thread, is waiting for
>> the XFS allocation thread (kworkerB) while it writebacks XFS pages.
>> kworkerB has started the allocation and it is waiting for kswapd to
>> allocate free pages. kswapd has started writeback XFS pages and
>> it is waiting for more log space. The reason why exhaustion of the
>> log space is both the writeback thread and kswapd are stuck, so
>> some processes, who have allocated the log space and are requesting
>> free pages, are also stuck.
>>
>> The deadlock flow is as follows.
>>
>> kworkerA | kworkerB | kswapd
>> ----------------------+--------------------------+-----------------------
>> | wb_writeback | |
>> | : | |
>> | xfs_vm_writepage | |
>> | xfs_map_blocks | |
>> | xfs_iomap_write_allocate |
>> | xfs_bmapi_write | |
>> | xfs_bmapi_allocate | |
>> | wait_for_completion | |
>> | # waiting for kworkerB... |
>> | | xfs_bmapi_allocate_worker|
>> | | : |
>> | | xfs_buf_get_map |
>> | | xfs_buf_allocate_memory |
>> | | alloc_pages_current |
>> | | : |
>> | | shrink_inactive_list |
>> | | congestion_wait |
>> | | # waiting for kswapd... |
>> | | | shrink_page_list
>> | | | xfs_vm_writepage
>> | | | :
>> | | | xfs_log_reserve
>> | | | :
>> | | | xlog_grant_head_check
>> | | | xlog_grant_head_wait
>> | | | # waiting for more
>> | | | # space...
>> V(time) | |
>> ----------------------+--------------------------+-----------------------
>
> Again, anything in congestion_wait() is not stuck and if the
> allocations here are repeatedly failing and progress is not being
> made, then there should be log messages from XFS indicating this.
kworkerB is stuck at the same reason as above processA.
>
> I need more information about your test setup to understand what is
> going on here. Can you provide:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> The output of sysrq-w would also be useful here, because the above
> abridged stack traces do not tell me everything about the state of
> the system I need to know.
OK, I will try to get the information when this problem2 is reproduced.
Thanks,
Masayoshi Mizuma
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-06-18 9:37 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-17 8:50 xfs: two deadlock problems occur when kswapd writebacks XFS pages Masayoshi Mizuma
2014-06-17 8:50 ` Masayoshi Mizuma
2014-06-17 13:26 ` Dave Chinner
2014-06-17 13:26 ` Dave Chinner
2014-06-18 9:37 ` Masayoshi Mizuma [this message]
2014-06-18 9:37 ` Masayoshi Mizuma
2014-06-18 11:48 ` Dave Chinner
2014-06-18 11:48 ` Dave Chinner
[not found] ` <53A7D6CC.1040605@jp.fujitsu.com>
2014-06-24 22:05 ` Dave Chinner
2014-06-24 22:05 ` Dave Chinner
2014-07-14 11:00 ` Masayoshi Mizuma
2014-07-14 11:00 ` Masayoshi Mizuma
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53A15DC7.50001@jp.fujitsu.com \
--to=m.mizuma@jp.fujitsu.com \
--cc=david@fromorbit.com \
--cc=linux-mm@kvack.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.