From: Michal Hocko <mhocko@kernel.org>
To: gregkh@linuxfoundation.org
Cc: akpm@linux-foundation.org, bo.liu@linux.alibaba.com,
david@fromorbit.com, hannes@cmpxchg.org, jack@suse.cz,
kirill.shutemov@linux.intel.com, shakeelb@google.com,
stable@vger.kernel.org, torvalds@linux-foundation.org,
tytso@mit.edu, vdavydov.dev@gmail.com
Subject: Re: FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree
Date: Tue, 15 Jan 2019 16:34:44 +0100 [thread overview]
Message-ID: <20190115153444.GD7283@dhcp22.suse.cz> (raw)
In-Reply-To: <154747783690179@kroah.com>
I do not see a straightforward backport of this patch without pulling
more changes in. Do we have anybody to actually hit the issue on those
older kernels? While the issue is possible in principle I do not
remember anybody complaining.
On Mon 14-01-19 15:57:16, Greg KH wrote:
> >From 63f3655f950186752236bb88a22f8252c11ce394 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 8 Jan 2019 15:23:07 -0800
> Subject: [PATCH] mm, memcg: fix reclaim deadlock with writeback
>
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
>
> task1:
> wait_on_page_bit+0x82/0xa0
> shrink_page_list+0x907/0x960
> shrink_inactive_list+0x2c7/0x680
> shrink_node_memcg+0x404/0x830
> shrink_node+0xd8/0x300
> do_try_to_free_pages+0x10d/0x330
> try_to_free_mem_cgroup_pages+0xd5/0x1b0
> try_charge+0x14d/0x720
> memcg_kmem_charge_memcg+0x3c/0xa0
> memcg_kmem_charge+0x7e/0xd0
> __alloc_pages_nodemask+0x178/0x260
> alloc_pages_current+0x95/0x140
> pte_alloc_one+0x17/0x40
> __pte_alloc+0x1e/0x110
> alloc_set_pte+0x5fe/0xc20
> do_fault+0x103/0x970
> handle_mm_fault+0x61e/0xd10
> __do_page_fault+0x252/0x4d0
> do_page_fault+0x30/0x80
> page_fault+0x28/0x30
>
> task2:
> __lock_page+0x86/0xa0
> mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> ext4_writepages+0x479/0xd60
> do_writepages+0x1e/0x30
> __writeback_single_inode+0x45/0x320
> writeback_sb_inodes+0x272/0x600
> __writeback_inodes_wb+0x92/0xc0
> wb_writeback+0x268/0x300
> wb_workfn+0xb4/0x390
> process_one_work+0x189/0x420
> worker_thread+0x4e/0x4b0
> kthread+0xe6/0x100
> ret_from_fork+0x41/0x50
>
> He adds
> "task1 is waiting for the PageWriteback bit of the page that task2 has
> collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
> LOCKED bit the page which tasks1 has locked"
>
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a
> memory limit reclaim and the memcg reclaim for legacy controller is
> waiting on the writeback but that is never going to finish because the
> writeback itself is waiting for the page locked in the #PF path. So
> this is essentially ABBA deadlock:
>
> lock_page(A)
> SetPageWriteback(A)
> unlock_page(A)
> lock_page(B)
> lock_page(B)
> pte_alloc_pne
> shrink_page_list
> wait_on_page_writeback(A)
> SetPageWriteback(B)
> unlock_page(B)
>
> # flush A, B to clear the writeback
>
> This accumulating of more pages to flush is used by several filesystems
> to generate a more optimal IO patterns.
>
> Waiting for the writeback in legacy memcg controller is a workaround for
> pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
>
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
>
> [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
> [akpm@linux-foundation.org: coding-style fixes]
> [mhocko@kernel.org: enhance comment, per Johannes]
> Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
> Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
> Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
> Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a52663c0612d..5e46836714dc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2994,6 +2994,28 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> struct vm_area_struct *vma = vmf->vma;
> vm_fault_t ret;
>
> + /*
> + * Preallocate pte before we take page_lock because this might lead to
> + * deadlocks for memcg reclaim which waits for pages under writeback:
> + * lock_page(A)
> + * SetPageWriteback(A)
> + * unlock_page(A)
> + * lock_page(B)
> + * lock_page(B)
> + * pte_alloc_pne
> + * shrink_page_list
> + * wait_on_page_writeback(A)
> + * SetPageWriteback(B)
> + * unlock_page(B)
> + * # flush A, B to clear the writeback
> + */
> + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
> + if (!vmf->prealloc_pte)
> + return VM_FAULT_OOM;
> + smp_wmb(); /* See comment in __pte_alloc() */
> + }
> +
> ret = vma->vm_ops->fault(vmf);
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> VM_FAULT_DONE_COW)))
>
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2019-01-15 15:34 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-14 14:57 FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree gregkh
2019-01-15 15:34 ` Michal Hocko [this message]
2019-01-15 15:51 ` Greg KH
2019-01-15 17:40 ` Michal Hocko
2019-01-15 18:09 ` Greg KH
2019-01-15 19:57 ` Michal Hocko
2019-01-16 10:48 ` [PATCH 4.9] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
2019-01-16 11:41 ` Kirill A. Shutemov
2019-01-21 12:21 ` Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190115153444.GD7283@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=bo.liu@linux.alibaba.com \
--cc=david@fromorbit.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=kirill.shutemov@linux.intel.com \
--cc=shakeelb@google.com \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=tytso@mit.edu \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).