Re: Lockup in wait_transaction_locked under memory pressure

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: Nikolay Borisov <kernel@kyup.com>, Theodore Ts'o <tytso@mit.edu>,
	linux-ext4@vger.kernel.org, Marian Marinov <mm@1h.com>
Subject: Re: Lockup in wait_transaction_locked under memory pressure
Date: Wed, 1 Jul 2015 08:58:51 +1000	[thread overview]
Message-ID: <20150630225851.GK7943@dastard> (raw)
In-Reply-To: <20150630143158.GD4578@dhcp22.suse.cz>

On Tue, Jun 30, 2015 at 04:31:58PM +0200, Michal Hocko wrote:
> On Tue 30-06-15 14:30:33, Michal Hocko wrote:
> > On Tue 30-06-15 11:52:06, Dave Chinner wrote:
> > > On Mon, Jun 29, 2015 at 11:36:40AM +0200, Michal Hocko wrote:
> > > > On Mon 29-06-15 12:01:49, Nikolay Borisov wrote:
> > > > > Today I observed the issue again, this time on a different server. What
> > > > > is particularly strange is the fact that the OOM wasn't triggered for
> > > > > the cgroup, whose tasks have entered D state. There were a couple of
> > > > > SSHD processes and an RSYNC performing some backup tasks. Here is what
> > > > > the stacktrace for the rsync looks like:
> > > > > 
> > > > > crash> set 18308
> > > > >     PID: 18308
> > > > > COMMAND: "rsync"
> > > > >    TASK: ffff883d7c9b0a30  [THREAD_INFO: ffff881773748000]
> > > > >     CPU: 1
> > > > >   STATE: TASK_UNINTERRUPTIBLE
> > > > > crash> bt
> > > > > PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
> > > > >  #0 [ffff88177374ac60] __schedule at ffffffff815ab152
> > > > >  #1 [ffff88177374acb0] schedule at ffffffff815ab76e
> > > > >  #2 [ffff88177374acd0] schedule_timeout at ffffffff815ae5e5
> > > > >  #3 [ffff88177374ad70] io_schedule_timeout at ffffffff815aad6a
> > > > >  #4 [ffff88177374ada0] bit_wait_io at ffffffff815abfc6
> > > > >  #5 [ffff88177374adb0] __wait_on_bit at ffffffff815abda5
> > > > >  #6 [ffff88177374ae00] wait_on_page_bit at ffffffff8111fd4f
> > > > >  #7 [ffff88177374ae50] shrink_page_list at ffffffff81135445
> > > > 
> > > > This is most probably wait_on_page_writeback because the reclaim has
> > > > encountered a dirty page which is under writeback currently.
> > > 
> > > Yes, and looks at the caller path....
> > > 
> > > > >  #8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
> > > > >  #9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
> > > > > #10 [ffff88177374b150] shrink_zone at ffffffff811360c3
> > > > > #11 [ffff88177374b220] shrink_zones at ffffffff81136eff
> > > > > #12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
> > > > > #13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
> > > > > #14 [ffff88177374b380] try_charge at ffffffff81189423
> > > > > #15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
> > > > > #16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
> > > > > #17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
> > > > > #18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
> > > > > #19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
> > > > > #20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
> > > > > #21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
> > > > > #22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
> > > > > #23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
> > > > > #24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
> > > > > #25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
> > > > > #26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
> > > > > #27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
> > > > > #28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
> > > > > #29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
> > > > > #30 [ffff88177374bb20] do_writepages at ffffffff8112c490
> > > > > #31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
> > > > > #32 [ffff88177374bb80] filemap_flush at ffffffff8112041c
> > > 
> > > That's a potential self deadlocking path, isn't it? i.e. the
> > > writeback path has been entered, may hold pages locked in the
> > > current bio being built (waiting for submission), then memory
> > > reclaim has been entered while trying to map more contiguous blocks
> > > to submit, and that waits on page IO to complete on a page in a bio
> > > that ext4 hasn't yet submitted?
> > 
> > I am not sure I understand. Pages are marked writeback in
> > ext4_bio_write_page after all of this has been done already and then
> > the IO is submitted and the reclaim shouldn't block it. Or am I missing
> > something?
> 
> Thanks to Jan Kara for the off list clarification. I misunderstood the
> code. You are right ext4 is really deadlock prone. The heuristic in the
> reclaim code assumes that waiting on page_writeback is guaranteed to
> make a progress (from memcg POV) and that is not true for ext4 as it

*blink*

/me re-reads again

That assumption is fundamentally broken. Filesystems use GFP_NOFS
because the filesystem holds resources that can prevent memory
reclaim making forwards progress if it re-enters the filesystem or
blocks on anything filesystem related. memcg does not change that,
and I'm kinda scared to learn that memcg plays fast and loose like
this.

For example: IO completion might require unwritten extent conversion
which executes filesystem transactions and GFP_NOFS allocations. The
writeback flag on the pages can not be cleared until unwritten
extent conversion completes. Hence memory reclaim cannot wait on
page writeback to complete in GFP_NOFS context because it is not
safe to do so, memcg reclaim or otherwise.

> really charge after set_page_writeback (called from ext4_bio_write_page)
> and before the page is really submitted (when the bio is full or
> explicitly via ext4_io_submit). I thought that io_submit_add_bh submits
> the page but it doesn't do that necessarily.

XFS does exactly the same thing - the underlying alogrithm ext4 uses
to build large bios efficiently was copied from XFS. And FWIW XFS has
been using this algorithm since 2.6.15....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2015-06-30 22:58 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-25 10:13 Lockup in wait_transaction_locked under memory pressure Nikolay Borisov
2015-06-25 10:16 ` Nikolay Borisov
2015-06-25 11:21   ` Michal Hocko
2015-06-25 11:43     ` Nikolay Borisov
2015-06-25 11:50       ` Michal Hocko
2015-06-25 12:05         ` Nikolay Borisov
2015-06-25 13:29         ` Nikolay Borisov
2015-06-25 13:45           ` Michal Hocko
2015-06-25 13:54             ` Nikolay Borisov
2015-06-25 13:58               ` Michal Hocko
2015-06-25 13:31         ` Theodore Ts'o
2015-06-25 13:49           ` Nikolay Borisov
2015-06-25 14:05             ` Michal Hocko
2015-06-25 14:34               ` Nikolay Borisov
2015-06-25 15:18                 ` Michal Hocko
2015-06-25 15:27                   ` Nikolay Borisov
2015-06-29  8:32                     ` Michal Hocko
2015-06-29  9:07                       ` Nikolay Borisov
2015-06-29  9:16                         ` Michal Hocko
2015-06-29  9:23                           ` Nikolay Borisov
2015-06-29  9:38                             ` Michal Hocko
2015-06-29 10:21                               ` Nikolay Borisov
2015-06-29 11:44                                 ` Michal Hocko
2015-06-25 14:45             ` Theodore Ts'o
2015-06-25 13:57           ` Michal Hocko
2015-06-29  9:01           ` Nikolay Borisov
2015-06-29  9:36             ` Michal Hocko
2015-06-30  1:52               ` Dave Chinner
2015-06-30  3:02                 ` Theodore Ts'o
2015-06-30  6:35                   ` Nikolay Borisov
2015-06-30 12:30                 ` Michal Hocko
2015-06-30 14:31                   ` Michal Hocko
2015-06-30 22:58                     ` Dave Chinner [this message]
2015-07-01  6:10                       ` Michal Hocko
2015-07-01 11:13                         ` Theodore Ts'o
2015-07-01 14:21                           ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150630225851.GK7943@dastard \
    --to=david@fromorbit.com \
    --cc=kernel@kyup.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=mhocko@suse.cz \
    --cc=mm@1h.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.