linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Feng Tang <feng.tang@intel.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: "Sang, Oliver" <oliver.sang@intel.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
	lkp <lkp@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Torvalds, Linus" <torvalds@linux-foundation.org>,
	Jann Horn <jannh@google.com>,
	"Song, Youquan" <youquan.song@intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>, Jan Kara <jack@suse.cz>,
	John Hubbard <jhubbard@nvidia.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Matthew Wilcox <willy@infradead.org>,
	Michal Hocko <mhocko@kernel.org>,
	Muchun Song <songmuchun@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>
Subject: Re: [linus:master] [hugetlb] 7118fc2906: kernel_BUG_at_lib/list_debug.c
Date: Tue, 17 Jan 2023 16:01:08 +0800	[thread overview]
Message-ID: <Y8ZVxJSZdtEk8Nco@feng-clx> (raw)
In-Reply-To: <2f483247-da76-9ec9-3e51-f690939f4585@suse.cz>

On Tue, Jan 17, 2023 at 03:39:15PM +0800, Vlastimil Babka wrote:
> On 1/17/23 08:10, kernel test robot wrote:
> > 
> > +Vlastimil Babka, Hyeonggon Yoo, Feng Tang and Fengwei Yin
> > 
> > Hi, Mike Kravetz,
> > 
> > we reported
> > "[linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h" [1]
> > 
> > Vlastimil, Hyeonggon, Feng and Fengwei gave us a lot of great guidances based on
> > it, and, perticularly, after enabling below config per Vlastimil's suggestion
> >   CONFIG_DEBUG_PAGEALLOC
> >   CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> >   CONFIG_SLUB_DEBUG
> >   CONFIG_SLUB_DEBUG_ON
> > by more tests, we realized the "0af8489b02" is not the real culprit.
> > 
> > the new bisection was triggered and finally it pointed to this "7118fc2906".
> > 
> > though reporting for different issues
> > ("kernel_BUG_at_include/linux/mm.h" for 0af8489b02 vs.
> > "kernel_BUG_at_lib/list_debug.c" for this commit),
> > Feng and Fengwei helped further to confirm they are similar.
> > They will supply more technical wise analysis later.
> > 
> > please be noted the issues are not always happening
> > (~10% on this commit or 0af8489b02)
> 
> Great find! Looking at the commit, I'd bet the only part relevant to our bug
> is the "by the way we remove setting refcount to zero on tail pages which
> should already be zero":
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index db00ee8d79d2..eeff64843718 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -754,7 +754,6 @@ void prep_compound_page(struct page *page, unsigned int order)
> >         __SetPageHead(page);
> >         for (i = 1; i < nr_pages; i++) {
> >                 struct page *p = page + i;
> > -               set_page_count(p, 0);
> >                 p->mapping = TAIL_MAPPING;
> >                 set_compound_head(p, page);
> >         }
> 
> So either the assumption of refcount being already 0 is wrong (shouldn't be,
> AFAIK?), or this atomic operation effectively prevents some very subtle race
> (although IIRC atomic_set() has no barrier semantics defined, it could still
> affect a specific CPU?
 
Yes, "set_page_count(p, 0);" seems to be what matters here. Restoring
it make the list corruption issue not reproducable for 300+ runs.

And back in debugging 0af8489b02, the thing was similar that if we
added some code inside prep_compound_page(), the issue also can't
be reproduced.

So this 7118fc2906 seems just 'expose' the problem on i386, and is
not the root cause.

I suspect it is related with i386 compiling, based on the debug and
memory dump. I'm doing some compiler option and adding memory
barrier in prep_compound_page(), and will update when the test run
is done.

Thanks,
Feng

> I guess we could
> - try to restore that set_page_count(p, 0); on current kernel to see if it
> kills the bug
> - instead of restoring it, add (only locally for purposes of the test) a
> BUG_ON() if refcount is not zero already, and find out why if it triggers
> (unfortunately might also appear to fix the bug even if it doesn't trigger).


  parent reply	other threads:[~2023-01-17  8:05 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-17  7:10 [linus:master] [hugetlb] 7118fc2906: kernel_BUG_at_lib/list_debug.c kernel test robot
2023-01-17  7:39 ` Vlastimil Babka
2023-01-17  7:47   ` Yin, Fengwei
2023-01-17  8:01   ` Feng Tang [this message]
2023-01-17 12:20     ` Feng Tang
2023-01-17 18:25       ` Linus Torvalds
2023-01-18  1:07         ` Feng Tang
2023-01-18 13:31         ` Feng Tang
2023-01-18 17:10           ` Linus Torvalds
2023-01-19  2:19             ` Feng Tang
2023-01-18 13:35         ` Vlastimil Babka
2023-01-18 15:07           ` Feng Tang
2023-01-17 18:50 ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y8ZVxJSZdtEk8Nco@feng-clx \
    --to=feng.tang@intel.com \
    --cc=42.hyeyoo@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=songmuchun@bytedance.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=youquan.song@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).