From: Mike Rapoport <rppt@linux.ibm.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: akpm@linux-foundation.org, bhe@redhat.com, cai@lca.pw,
david@redhat.com, mgorman@suse.de, mhocko@kernel.org,
mm-commits@vger.kernel.org, stable@vger.kernel.org,
vbabka@suse.cz
Subject: Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
Date: Wed, 9 Dec 2020 23:42:31 +0200 [thread overview]
Message-ID: <20201209214231.GC8939@linux.ibm.com> (raw)
In-Reply-To: <X9AIq3bwYXtrpFvx@redhat.com>
On Tue, Dec 08, 2020 at 06:13:47PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 08, 2020 at 11:46:14PM +0200, Mike Rapoport wrote:
>
> > Sorry, I was not clear. The penalty here is not the traversal of
> > memblock.reserved array. The penalty is for redundant initialization of
> > struct page for ranges memblock.reserved describes.
>
> But that's the fix... How can you avoid removing the PagePoison for
> all pfn_valid in memblock.reserved, when the crash of
> https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org is
> in __SetPageReserved of reserve_bootmem_region?
With your fix the pages in the memblock.reserved that intersects
with memblock.memory will be initialized twice: first time in
for_each_mem_pfn_range() and the second time in for_each_res_pfn_range()
> > For instance, the if user specified hugetlb_cma=nnG loop over
> > memblock.reserved will cause redundant initialization pass over the
> > pages in this nnG.
>
> And if that didn't happen, then I'm left wondering if
> free_low_memory_core_early would crash again with specified
> hugetlb_cma=nnG, not just in pfn 0 anymore even with the simple fix
> applied.
>
> Where does PagePoison get removed from the CMA range, or is somehow
> __SetPageReserved not called with the simple fix applied on the CMA
> reserved range?
CMA range would be in both memblock.memory and memblock.reserved. So the
pages for it will be initialized in for_each_mem_pfn_range().
> > Regardless of particular implementation, the generic code cannot detect
> > physical memory layout, this is up to architecture to detect what memory
> > banks are populated and provide this information to the generic code.
> >
> > So we'll always have what you call "dependencies on the caller" here.
> > The good thing is that we can fix "the caller" when we find it's wrong.
>
> The caller that massages the e820 bios table, is still
> software. Software can go from arch code to common code. There's no
> asm or arch dependency in there.
>
> The caller in my view needs to know what e820 is and send down the raw
> info... it's a 1:1 API translation, the caller massaging of the data
> can all go in the callee and benefit all callers.
Well, the translation between e820 and memblock could been 1:1, but it
is not. I think e820 in more broadly x86 memory initialization could
benifit from more straightforward translation to memblock. For instance,
I don't see why the pfn 0 should be marked as reserved in both e820 and
memblock and why some of the e820 reserved areas never make it to
memblock.reserved.
> The benefit is once the callee does all validation, then all archs
> that do a mistake will be told and boot will not fail and it'll be
> more likely to boot stable even in presence of inconsistent hardware
> tables.
That's true. Having the generic code more robust will benifit everybody.
> > > I guess before worrying of pfn 1 not being added to memblock.reserved,
> > > I'd worry about pfn 0 itself not being added to memblock.reserved.
> >
> > My point was that with a bit different e820 table we may again hit a
> > corner case that we didn't anticipate.
> >
> > pfn 0 is in memblock.reserved by a coincidence and not by design and using
> > memblock.reserved to detect zone/node boundaries is not reliable.
> >
> > Moreover, if you look for usage of for_each_mem_pfn_range() in
> > page_alloc.c, it's looks very much that it is presumed to iterate over
> > *all* physical memory, including mysterious pfn 0.
>
> All the complex patch does it that it guarantees all the reserved
> pages can get a "right" zoneid.
>
> Then we certainly should consider rounding it down by the pageblock in
> case pfn 0 isn't reserved.
>
> In fact if you add all memblock.reserved ranges to memblock.memory,
> you'll obtain exactly the same result in the zone_start_pfn as with
> the complex patch in the zone_start_pfn calculation and the exact same
> issue of PagePoison being left on pfn 0 will happen if pfn 0 isn't
> added to memblock.memory with { memblock_add; memblock_reserved }.
>
> > > Still with the complex patch applied, if something goes wrong
> > > DEBUG_VM=y will make it reproducible, with the simple fix we return in
> > > non-reproducible land.
> >
> > I still think that the complex patch is going in the wrong direction.
> > We cannot rely on memblock.reserved to calculate zones and nodes span.
> >
> > This:
> >
> > + /*
> > + * reserved regions must be included so that their page
> > + * structure can be part of a zone and obtain a valid zoneid
> > + * before __SetPageReserved().
> > + */
> > + return min(PHYS_PFN(memblock_start_of_DRAM()),
> > + PHYS_PFN(memblock.reserved.regions[0].base));
> >
> > is definitely a hack that worked because "the caller" has this:
> >
> > /*
> > * Make sure page 0 is always reserved because on systems with
> > * L1TF its contents can be leaked to user processes.
> > */
> > memblock_reserve(0, PAGE_SIZE);
> >
> > So, what would have happen if L1TF was not yet disclosed? ;-)
>
> It's actually fine thing to delete the above memblock_reserve(0,
> PAGE_SIZE) for all testing so we can move into the remaining issues.
There are few more places in arch/x86/kernel/setup.c that
memblock_reserve() one or several pages from address 0 :-)
> The end result of the "hack" is precisely what you obtain if you add
> all memblock.reserved to memblock.memory, except it doesn't require to
> add memblock.reserved to memblock.memory.
I still do not agree that there can be minimum between
memblock_start_of_DRAM() and anything else. I think it's semantically
wrong.
And I really dislike addition of memblock.reserved traversals, mostly
because of the areas that actually overlap.
However, I do agree that adding non-overlaping part of memblock.reserved
to memblock.memory will make the generic code more robust. I just don't
want to do this implicitly by calling memblock_add() from
memblock_reserve(). I think the cleaner way is to join them just before
free_area_init() starts zone/node size detection.
> Thanks,
> Andrea
>
--
Sincerely yours,
Mike.
next prev parent reply other threads:[~2020-12-09 21:47 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-06 0:54 + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree akpm
2020-12-06 6:48 ` Mike Rapoport
2020-12-06 19:30 ` Andrea Arcangeli
2020-12-07 16:50 ` Mike Rapoport
2020-12-08 2:57 ` Andrea Arcangeli
2020-12-08 21:46 ` Mike Rapoport
2020-12-08 23:13 ` Andrea Arcangeli
2020-12-09 21:42 ` Mike Rapoport [this message]
2020-12-07 8:58 ` David Hildenbrand
2020-12-07 9:45 ` Mike Rapoport
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201209214231.GC8939@linux.ibm.com \
--to=rppt@linux.ibm.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=cai@lca.pw \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mhocko@kernel.org \
--cc=mm-commits@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.