stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@linux.ibm.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: akpm@linux-foundation.org, bhe@redhat.com, cai@lca.pw,
	david@redhat.com, mgorman@suse.de, mhocko@kernel.org,
	mm-commits@vger.kernel.org, stable@vger.kernel.org,
	vbabka@suse.cz
Subject: Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
Date: Wed, 9 Dec 2020 23:42:31 +0200	[thread overview]
Message-ID: <20201209214231.GC8939@linux.ibm.com> (raw)
In-Reply-To: <X9AIq3bwYXtrpFvx@redhat.com>

On Tue, Dec 08, 2020 at 06:13:47PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 08, 2020 at 11:46:14PM +0200, Mike Rapoport wrote:
>
> > Sorry, I was not clear. The penalty here is not the traversal of
> > memblock.reserved array. The penalty is for redundant initialization of
> > struct page for ranges memblock.reserved describes.
> 
> But that's the fix... How can you avoid removing the PagePoison for
> all pfn_valid in memblock.reserved, when the crash of
> https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org is
> in __SetPageReserved of reserve_bootmem_region?

With your fix the pages in the memblock.reserved that intersects
with memblock.memory will be initialized twice: first time in
for_each_mem_pfn_range() and the second time in for_each_res_pfn_range()

> > For instance, the if user specified hugetlb_cma=nnG loop over
> > memblock.reserved will cause redundant initialization pass over the
> > pages in this nnG.
> 
> And if that didn't happen, then I'm left wondering if
> free_low_memory_core_early would crash again with specified
> hugetlb_cma=nnG, not just in pfn 0 anymore even with the simple fix
> applied.
> 
> Where does PagePoison get removed from the CMA range, or is somehow
> __SetPageReserved not called with the simple fix applied on the CMA
> reserved range?

CMA range would be in both memblock.memory and memblock.reserved. So the
pages for it will be initialized in for_each_mem_pfn_range().

> > Regardless of particular implementation, the generic code cannot detect
> > physical memory layout, this is up to architecture to detect what memory
> > banks are populated and provide this information to the generic code.
> > 
> > So we'll always have what you call "dependencies on the caller" here.
> > The good thing is that we can fix "the caller" when we find it's wrong.
> 
> The caller that massages the e820 bios table, is still
> software. Software can go from arch code to common code. There's no
> asm or arch dependency in there.
> 
> The caller in my view needs to know what e820 is and send down the raw
> info... it's a 1:1 API translation, the caller massaging of the data
> can all go in the callee and benefit all callers.

Well, the translation between e820 and memblock could been 1:1, but it
is not. I think e820 in more broadly x86 memory initialization could
benifit from more straightforward translation to memblock. For instance,
I don't see why the pfn 0 should be marked as reserved in both e820 and
memblock and why some of the e820 reserved areas never make it to
memblock.reserved.

> The benefit is once the callee does all validation, then all archs
> that do a mistake will be told and boot will not fail and it'll be
> more likely to boot stable even in presence of inconsistent hardware
> tables.

That's true. Having the generic code more robust will benifit everybody.

> > > I guess before worrying of pfn 1 not being added to memblock.reserved,
> > > I'd worry about pfn 0 itself not being added to memblock.reserved.
> > 
> > My point was that with a bit different e820 table we may again hit a
> > corner case that we didn't anticipate.
> > 
> > pfn 0 is in memblock.reserved by a coincidence and not by design and using
> > memblock.reserved to detect zone/node boundaries is not reliable.
> >
> > Moreover, if you look for usage of for_each_mem_pfn_range() in
> > page_alloc.c, it's looks very much that it is presumed to iterate over
> > *all* physical memory, including mysterious pfn 0.
> 
> All the complex patch does it that it guarantees all the reserved
> pages can get a "right" zoneid.
> 
> Then we certainly should consider rounding it down by the pageblock in
> case pfn 0 isn't reserved.
> 
> In fact if you add all memblock.reserved ranges to memblock.memory,
> you'll obtain exactly the same result in the zone_start_pfn as with
> the complex patch in the zone_start_pfn calculation and the exact same
> issue of PagePoison being left on pfn 0 will happen if pfn 0 isn't
> added to memblock.memory with { memblock_add; memblock_reserved }.
>  
> > > Still with the complex patch applied, if something goes wrong
> > > DEBUG_VM=y will make it reproducible, with the simple fix we return in
> > > non-reproducible land.
> > 
> > I still think that the complex patch is going in the wrong direction.
> > We cannot rely on memblock.reserved to calculate zones and nodes span.
> > 
> > This:
> > 
> > +       /*
> > +        * reserved regions must be included so that their page
> > +        * structure can be part of a zone and obtain a valid zoneid
> > +        * before __SetPageReserved().
> > +        */
> > +       return min(PHYS_PFN(memblock_start_of_DRAM()),
> > +                  PHYS_PFN(memblock.reserved.regions[0].base));
> > 
> > is definitely a hack that worked because "the caller" has this:
> > 
> > 	/*
> > 	 * Make sure page 0 is always reserved because on systems with
> > 	 * L1TF its contents can be leaked to user processes.
> > 	 */
> > 	memblock_reserve(0, PAGE_SIZE);
> > 
> > So, what would have happen if L1TF was not yet disclosed? ;-)
> 
> It's actually fine thing to delete the above memblock_reserve(0,
> PAGE_SIZE) for all testing so we can move into the remaining issues.

There are few more places in arch/x86/kernel/setup.c that
memblock_reserve() one or several pages from address 0 :-)

> The end result of the "hack" is precisely what you obtain if you add
> all memblock.reserved to memblock.memory, except it doesn't require to
> add memblock.reserved to memblock.memory.

I still do not agree that there can be minimum between
memblock_start_of_DRAM() and anything else. I think it's semantically
wrong.

And I really dislike addition of memblock.reserved traversals, mostly
because of the areas that actually overlap.

However, I do agree that adding non-overlaping part of memblock.reserved
to memblock.memory will make the generic code more robust. I just don't
want to do this implicitly by calling memblock_add() from
memblock_reserve(). I think the cleaner way is to join them just before
free_area_init() starts zone/node size detection.

> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.

  reply	other threads:[~2020-12-09 21:47 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-06  0:54 + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree akpm
2020-12-06  6:48 ` Mike Rapoport
2020-12-06 19:30   ` Andrea Arcangeli
2020-12-07 16:50     ` Mike Rapoport
2020-12-08  2:57       ` Andrea Arcangeli
2020-12-08 21:46         ` Mike Rapoport
2020-12-08 23:13           ` Andrea Arcangeli
2020-12-09 21:42             ` Mike Rapoport [this message]
2020-12-07  8:58 ` David Hildenbrand
2020-12-07  9:45   ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201209214231.GC8939@linux.ibm.com \
    --to=rppt@linux.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=cai@lca.pw \
    --cc=david@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).