From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Michal Hocko <mhocko@suse.com>,
Ondrej Kozina <okozina@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>, NeilBrown <neilb@suse.com>,
David Rientjes <rientjes@google.com>,
Mikulas Patocka <mpatocka@redhat.com>,
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.6 45/56] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
Date: Sun, 14 Aug 2016 22:37:49 +0200 [thread overview]
Message-ID: <20160814202506.781093795@linuxfoundation.org> (raw)
In-Reply-To: <20160814202504.908694181@linuxfoundation.org>
4.6-stable review patch. If anyone has any objections, please let me know.
------------------
From: Michal Hocko <mhocko@suse.com>
commit 4e390b2b2f34b8daaabf2df1df0cf8f798b87ddb upstream.
This reverts commit f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements").
There has been a report about OOM killer invoked when swapping out to a
dm-crypt device. The primary reason seems to be that the swapout out IO
managed to completely deplete memory reserves. Ondrej was able to
bisect and explained the issue by pointing to f9054c70d28b ("mm,
mempool: only set __GFP_NOMEMALLOC if there are free elements").
The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.
If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.
The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. David has mentioned the following backtrace:
schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
We do not know more about why the mempool is depleted without being
replenished in time, though. In any case the dm layer shouldn't depend
on any allocations outside of the dedicated pools so a forward progress
should be guaranteed. If this is not the case then the dm should be
fixed rather than papering over the problem and postponing it to later
by accessing more memory reserves.
mempools are a mechanism to maintain dedicated memory reserves to
guaratee forward progress. Allowing them an unbounded access to the
page allocator memory reserves is going against the whole purpose of
this mechanism.
Bisected by Ondrej Kozina.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160721145309.GR26379@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: NeilBrown <neilb@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/mempool.c | 18 +++---------------
1 file changed, 3 insertions(+), 15 deletions(-)
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -310,7 +310,7 @@ EXPORT_SYMBOL(mempool_resize);
* returns NULL. Note that due to preallocation, this function
* *never* fails when called from process contexts. (it might
* fail if called from an IRQ context.)
- * Note: neither __GFP_NOMEMALLOC nor __GFP_ZERO are supported.
+ * Note: using __GFP_ZERO is not supported.
*/
void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
{
@@ -319,27 +319,16 @@ void *mempool_alloc(mempool_t *pool, gfp
wait_queue_t wait;
gfp_t gfp_temp;
- /* If oom killed, memory reserves are essential to prevent livelock */
- VM_WARN_ON_ONCE(gfp_mask & __GFP_NOMEMALLOC);
- /* No element size to zero on allocation */
VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
-
might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
+ gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */
gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
gfp_mask |= __GFP_NOWARN; /* failures are OK */
gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
repeat_alloc:
- if (likely(pool->curr_nr)) {
- /*
- * Don't allocate from emergency reserves if there are
- * elements available. This check is racy, but it will
- * be rechecked each loop.
- */
- gfp_temp |= __GFP_NOMEMALLOC;
- }
element = pool->alloc(gfp_temp, pool->pool_data);
if (likely(element != NULL))
@@ -363,12 +352,11 @@ repeat_alloc:
* We use gfp mask w/o direct reclaim or IO for the first round. If
* alloc failed with that and @pool was empty, retry immediately.
*/
- if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
+ if (gfp_temp != gfp_mask) {
spin_unlock_irqrestore(&pool->lock, flags);
gfp_temp = gfp_mask;
goto repeat_alloc;
}
- gfp_temp = gfp_mask;
/* We must not sleep if !__GFP_DIRECT_RECLAIM */
if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
next prev parent reply other threads:[~2016-08-14 20:37 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20160814203815uscas1p2549802c8af27d2aa233de8bce43fe3ee@uscas1p2.samsung.com>
2016-08-14 20:37 ` [PATCH 4.6 00/56] 4.6.7-stable review Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 01/56] libnvdimm, dax: record the specified alignment of a dax-device instance Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 02/56] libnvdimm, pfn, dax: fix initialization vs autodetect for mode + alignment Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 03/56] ppp: defer netns reference release for ppp channel Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 04/56] tcp: make challenge acks less predictable Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 05/56] tcp: enable per-socket rate limiting of all challenge acks Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 06/56] bonding: set carrier off for devices created through netlink Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 07/56] net: bgmac: Fix infinite loop in bgmac_dma_tx_add() Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 08/56] vlan: use a valid default mtu value for vlan over macsec Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 09/56] bridge: Fix incorrect re-injection of LLDP packets Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 10/56] net: ipv6: Always leave anycast and multicast groups on link down Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 11/56] net/irda: fix NULL pointer dereference on memory allocation failure Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 12/56] qed: Fix setting/clearing bit in completion bitmap Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 13/56] macsec: ensure rx_sa is set when validation is disabled Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 14/56] tcp: consider recv buf for the initial window scale Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 16/56] arm: oabi compat: add missing access checks Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 17/56] KEYS: 64-bit MIPS needs to use compat_sys_keyctl for 32-bit userspace Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 18/56] IB/hfi1: Correct issues with sc5 computation Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 19/56] IB/hfi1: Fix deadlock with txreq allocation slow path Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 20/56] apparmor: fix ref count leak when profile sha1 hash is read Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 21/56] regulator: qcom_smd: Remove list_voltage callback for rpm_smps_ldo_ops_fixed Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 22/56] random: strengthen input validation for RNDADDTOENTCNT Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 23/56] x86/mm/pat: Add support of non-default PAT MSR setting Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 24/56] x86/mm/pat: Add pat_disable() interface Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 25/56] x86/mm/pat: Replace cpu_has_pat with boot_cpu_has() Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 26/56] x86/mtrr: Fix Xorg crashes in Qemu sessions Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 27/56] x86/mtrr: Fix PAT init handling when MTRR is disabled Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 28/56] x86/xen, pat: Remove PAT table init code from Xen Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 29/56] x86/pat: Document the PAT initialization sequence Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 30/56] x86/mm/pat: Fix BUG_ON() in mmap_mem() on QEMU/i386 Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 31/56] udf: Prevent stack overflow on corrupted filesystem mount Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 32/56] powerpc/eeh: Fix invalid cached PE primary bus Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 33/56] powerpc/bpf/jit: Disable classic BPF JIT on ppc64le Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 34/56] mm: memcontrol: fix swap counter leak on swapout from offline cgroup Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 35/56] mm: memcontrol: fix memcg id ref counter on swap charge move Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 36/56] x86/syscalls/64: Add compat_sys_keyctl for 32-bit userspace Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 37/56] block: fix use-after-free in seq file Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 38/56] sysv, ipc: fix security-layer leaking Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 39/56] radix-tree: account nodes to memcg only if explicitly requested Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 40/56] x86/microcode: Fix suspend to RAM with builtin microcode Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 41/56] x86/power/64: Fix hibernation return address corruption Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 42/56] fuse: fsync() did not return IO errors Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 43/56] fuse: fuse_flush must check mapping->flags for errors Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 44/56] fuse: fix wrong assignment of ->flags in fuse_send_init() Greg Kroah-Hartman
2016-08-14 20:37 ` Greg Kroah-Hartman [this message]
2016-08-14 20:37 ` [PATCH 4.6 46/56] fs/dcache.c: avoid soft-lockup in dput() Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 47/56] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency" Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 48/56] crypto: gcm - Filter out async ghash if necessary Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 49/56] crypto: scatterwalk - Fix test in scatterwalk_done Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 50/56] serial: mvebu-uart: free the IRQ in ->shutdown() Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 51/56] ext4: check for extents that wrap around Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 52/56] ext4: fix deadlock during page writeback Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 53/56] ext4: dont call ext4_should_journal_data() on the journal inode Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 54/56] ext4: validate s_reserved_gdt_blocks on mount Greg Kroah-Hartman
2016-08-14 20:37 ` [PATCH 4.6 55/56] ext4: short-cut orphan cleanup on error Greg Kroah-Hartman
2016-08-14 20:38 ` [PATCH 4.6 56/56] ext4: fix reference counting bug on block allocation error Greg Kroah-Hartman
2016-08-15 13:07 ` [PATCH 4.6 00/56] 4.6.7-stable review Guenter Roeck
2016-08-16 4:02 ` Shuah Khan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160814202506.781093795@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=mpatocka@redhat.com \
--cc=neilb@suse.com \
--cc=okozina@redhat.com \
--cc=penguin-kernel@i-love.sakura.ne.jp \
--cc=rientjes@google.com \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).