From: Vlastimil Babka <vbabka@suse.cz>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>,
Michal Hocko <mhocko@kernel.org>,
Hillf Danton <hillf.zj@alibaba-inc.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Vlastimil Babka <vbabka@suse.cz>,
stable@vger.kernel.org
Subject: [PATCH v2 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update
Date: Fri, 20 Jan 2017 11:38:43 +0100 [thread overview]
Message-ID: <20170120103843.24587-5-vbabka@suse.cz> (raw)
In-Reply-To: <20170120103843.24587-1-vbabka@suse.cz>
Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
OOM killer in few seconds, despite lots of free memory. The test attempts to
repeatedly fault in memory in one process in a cpuset, while changing allowed
nodes of the cpuset between 0 and 1 in another process.
The problem comes from insufficient protection against cpuset changes, which
can cause get_page_from_freelist() to consider all zones as non-eligible due to
nodemask and/or current->mems_allowed. This was masked in the past by
sufficient retries, but since commit 682a3385e773 ("mm, page_alloc: inline the
fast path of the zonelist iterator") we fix the preferred_zoneref once, and
don't iterate over the whole zonelist in further attempts, thus the only
eligible zones might be placed in the zonelist before our starting point and we
always miss them.
A previous patch fixed this problem for current->mems_allowed. However, cpuset
changes also update the task's mempolicy nodemask. The fix has two parts. We
have to repeat the preferred_zoneref search when we detect cpuset update by way
of seqcount, and we have to check the seqcount before considering OOM.
Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
---
mm/page_alloc.c | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd3b9839a355..1c331ff6fdc4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3555,6 +3555,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
no_progress_loops = 0;
compact_priority = DEF_COMPACT_PRIORITY;
cpuset_mems_cookie = read_mems_allowed_begin();
+ /*
+ * We need to recalculate the starting point for the zonelist iterator
+ * because we might have used different nodemask in the fast path, or
+ * there was a cpuset modification and we are retrying - otherwise we
+ * could end up iterating over non-eligible zones endlessly.
+ */
+ ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
+ ac->high_zoneidx, ac->nodemask);
+ if (!ac->preferred_zoneref->zone)
+ goto nopage;
+
/*
* The fast path uses conservative alloc_flags to succeed only until
@@ -3715,6 +3726,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
&compaction_retries))
goto retry;
+ /*
+ * It's possible we raced with cpuset update so the OOM would be
+ * premature (see below the nopage: label for full explanation).
+ */
+ if (read_mems_allowed_retry(cpuset_mems_cookie))
+ goto retry_cpuset;
+
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
@@ -3728,10 +3746,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
nopage:
/*
- * When updating a task's mems_allowed, it is possible to race with
- * parallel threads in such a way that an allocation can fail while
- * the mask is being updated. If a page allocation is about to fail,
- * check if the cpuset changed during allocation and if so, retry.
+ * When updating a task's mems_allowed or mempolicy nodeask, it is
+ * possible to race with parallel threads in such a way that our
+ * allocation can fail while the mask is being updated. If we are about
+ * to fail, check if the cpuset changed during allocation and if so,
+ * retry.
*/
if (read_mems_allowed_retry(cpuset_mems_cookie))
goto retry_cpuset;
@@ -3822,15 +3841,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
/*
* Restore the original nodemask if it was potentially replaced with
* &cpuset_current_mems_allowed to optimize the fast-path attempt.
- * Also recalculate the starting point for the zonelist iterator or
- * we could end up iterating over non-eligible zones endlessly.
*/
- if (unlikely(ac.nodemask != nodemask)) {
+ if (unlikely(ac.nodemask != nodemask))
ac.nodemask = nodemask;
- ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
- ac.high_zoneidx, ac.nodemask);
- /* If we have NULL preferred zone, slowpath wll handle that */
- }
page = __alloc_pages_slowpath(alloc_mask, order, &ac);
--
2.11.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-01-20 10:38 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-20 10:38 [PATCH v2 0/4] fix premature OOM regression in 4.7+ due to cpuset races Vlastimil Babka
2017-01-20 10:38 ` [PATCH v2 1/4] mm, page_alloc: fix check for NULL preferred_zone Vlastimil Babka
2017-01-20 10:38 ` [PATCH v2 2/4] mm, page_alloc: fix fast-path race with cpuset update or removal Vlastimil Babka
2017-01-20 10:38 ` [PATCH v2 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath Vlastimil Babka
2017-01-20 10:38 ` Vlastimil Babka [this message]
2017-01-21 12:22 ` [PATCH v2 0/4] fix premature OOM regression in 4.7+ due to cpuset races Hillf Danton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170120103843.24587-5-vbabka@suse.cz \
--to=vbabka@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=hillf.zj@alibaba-inc.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).