From: Cliff Wickman <cpw@sgi.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: cmetcalf@tilera.com, dave@linux.vnet.ibm.com, dhillf@gmail.com,
dwg@au1.ibm.com, kamezawa.hiroyuki@gmail.com,
khlebnikov@openvz.org, lee.schermerhorn@hp.com, mgorman@suse.de,
shhuiw@gmail.com, viro@zeniv.linux.org.uk, linux-mm@kvack.org
Subject: Re: [PATCH v2] list corruption by gather_surp
Date: Tue, 31 Jul 2012 18:13:06 -0500 [thread overview]
Message-ID: <20120731231306.GA25248@sgi.com> (raw)
In-Reply-To: <20120730122224.GA12680@tiehlicka.suse.cz>
On Mon, Jul 30, 2012 at 02:22:24PM +0200, Michal Hocko wrote:
> On Fri 27-07-12 17:32:15, Cliff Wickman wrote:
> > From: Cliff Wickman <cpw@sgi.com>
> >
> > v2: diff'd against linux-next
> >
> > I am seeing list corruption occurring from within gather_surplus_pages()
> > (mm/hugetlb.c). The problem occurs in a RHEL6 kernel under a heavy load,
> > and seems to be because this function drops the hugetlb_lock.
> > The list_add() in gather_surplus_pages() seems to need to be protected by
> > the lock.
> > (I don't have a similar test for a linux-next kernel)
>
> Because you cannot reproduce or you just didn't test it with linux-next?
>
> > I have CONFIG_DEBUG_LIST=y, and am running an MPI application with 64 threads
> > and a library that creates a large heap of hugetlbfs pages for it.
> >
> > The below patch fixes the problem.
> > The gist of this patch is that gather_surplus_pages() does not have to drop
>
> But you cannot hold spinlock while allocating memory because the
> allocation is not atomic and you could deadlock easily.
>
> > the lock if alloc_buddy_huge_page() is told whether the lock is already held.
>
> The changelog doesn't actually explain how does the list gets corrupted.
> alloc_buddy_huge_page doesn't provide the freshly allocated page to use
> so nobody could get and free it. enqueue_huge_page happens under hugetlb_lock.
> I am sorry but I do not see how we could race here.
I finally got my test running on a linux-next kernel and could not
reproduce the problem.
So I agree that no race seems possible now. Disregard this patch.
I'll offer the fix to the distro of the old kernel on which I saw the
problem.
>
>
> > Signed-off-by: Cliff Wickman <cpw@sgi.com>
> > ---
> > mm/hugetlb.c | 29 ++++++++++++++++-------------
> > 1 file changed, 16 insertions(+), 13 deletions(-)
> >
> > Index: linux/mm/hugetlb.c
> > ===================================================================
> > --- linux.orig/mm/hugetlb.c
> > +++ linux/mm/hugetlb.c
> > @@ -838,7 +838,9 @@ static int free_pool_huge_page(struct hs
> > return ret;
> > }
> >
> > -static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> > +/* already_locked means the caller has already locked hugetlb_lock */
> > +static struct page *alloc_buddy_huge_page(struct hstate *h, int nid,
> > + int already_locked)
> > {
> > struct page *page;
> > unsigned int r_nid;
> > @@ -869,15 +871,19 @@ static struct page *alloc_buddy_huge_pag
> > * the node values until we've gotten the hugepage and only the
> > * per-node value is checked there.
> > */
> > - spin_lock(&hugetlb_lock);
> > + if (!already_locked)
> > + spin_lock(&hugetlb_lock);
> > +
> > if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> > - spin_unlock(&hugetlb_lock);
> > + if (!already_locked)
> > + spin_unlock(&hugetlb_lock);
> > return NULL;
> > } else {
> > h->nr_huge_pages++;
> > h->surplus_huge_pages++;
> > }
> > spin_unlock(&hugetlb_lock);
> > + /* page allocation may sleep, so the lock must be unlocked */
> >
> > if (nid == NUMA_NO_NODE)
> > page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
> > @@ -910,7 +916,8 @@ static struct page *alloc_buddy_huge_pag
> > h->surplus_huge_pages--;
> > __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> > }
> > - spin_unlock(&hugetlb_lock);
> > + if (!already_locked)
> > + spin_unlock(&hugetlb_lock);
> >
> > return page;
> > }
> > @@ -929,7 +936,7 @@ struct page *alloc_huge_page_node(struct
> > spin_unlock(&hugetlb_lock);
> >
> > if (!page)
> > - page = alloc_buddy_huge_page(h, nid);
> > + page = alloc_buddy_huge_page(h, nid, 0);
> >
> > return page;
> > }
> > @@ -937,6 +944,7 @@ struct page *alloc_huge_page_node(struct
> > /*
> > * Increase the hugetlb pool such that it can accommodate a reservation
> > * of size 'delta'.
> > + * This is entered and exited with hugetlb_lock locked.
> > */
> > static int gather_surplus_pages(struct hstate *h, int delta)
> > {
> > @@ -957,9 +965,8 @@ static int gather_surplus_pages(struct h
> >
> > ret = -ENOMEM;
> > retry:
> > - spin_unlock(&hugetlb_lock);
> > for (i = 0; i < needed; i++) {
> > - page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > + page = alloc_buddy_huge_page(h, NUMA_NO_NODE, 1);
> > if (!page) {
> > alloc_ok = false;
> > break;
> > @@ -969,10 +976,9 @@ retry:
> > allocated += i;
> >
> > /*
> > - * After retaking hugetlb_lock, we need to recalculate 'needed'
> > + * With hugetlb_lock still locked, we need to recalculate 'needed'
> > * because either resv_huge_pages or free_huge_pages may have changed.
> > */
> > - spin_lock(&hugetlb_lock);
> > needed = (h->resv_huge_pages + delta) -
> > (h->free_huge_pages + allocated);
> > if (needed > 0) {
> > @@ -1010,15 +1016,12 @@ retry:
> > enqueue_huge_page(h, page);
> > }
> > free:
> > - spin_unlock(&hugetlb_lock);
> > -
> > /* Free unnecessary surplus pages to the buddy allocator */
> > if (!list_empty(&surplus_list)) {
> > list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> > put_page(page);
> > }
> > }
> > - spin_lock(&hugetlb_lock);
> >
> > return ret;
> > }
> > @@ -1151,7 +1154,7 @@ static struct page *alloc_huge_page(stru
> > spin_unlock(&hugetlb_lock);
> > } else {
> > spin_unlock(&hugetlb_lock);
> > - page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > + page = alloc_buddy_huge_page(h, NUMA_NO_NODE, 0);
> > if (!page) {
> > hugetlb_cgroup_uncharge_cgroup(idx,
> > pages_per_huge_page(h),
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Cliff Wickman
SGI
cpw@sgi.com
(651) 683-3824
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-07-31 23:11 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-27 22:32 [PATCH v2] list corruption by gather_surp Cliff Wickman
2012-07-28 13:14 ` Hillf Danton
2012-07-30 12:22 ` Michal Hocko
2012-07-31 23:13 ` Cliff Wickman [this message]
2012-08-01 6:51 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120731231306.GA25248@sgi.com \
--to=cpw@sgi.com \
--cc=cmetcalf@tilera.com \
--cc=dave@linux.vnet.ibm.com \
--cc=dhillf@gmail.com \
--cc=dwg@au1.ibm.com \
--cc=kamezawa.hiroyuki@gmail.com \
--cc=khlebnikov@openvz.org \
--cc=lee.schermerhorn@hp.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.cz \
--cc=shhuiw@gmail.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.