From: Mike Rapoport <rppt@linux.ibm.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: akpm@linux-foundation.org, bhe@redhat.com, cai@lca.pw,
david@redhat.com, mgorman@suse.de, mhocko@kernel.org,
mm-commits@vger.kernel.org, stable@vger.kernel.org,
vbabka@suse.cz
Subject: Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
Date: Wed, 9 Dec 2020 23:42:31 +0200 [thread overview]
Message-ID: <20201209214231.GC8939@linux.ibm.com> (raw)
In-Reply-To: <X9AIq3bwYXtrpFvx@redhat.com>
On Tue, Dec 08, 2020 at 06:13:47PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 08, 2020 at 11:46:14PM +0200, Mike Rapoport wrote:
>
> > Sorry, I was not clear. The penalty here is not the traversal of
> > memblock.reserved array. The penalty is for redundant initialization of
> > struct page for ranges memblock.reserved describes.
>
> But that's the fix... How can you avoid removing the PagePoison for
> all pfn_valid in memblock.reserved, when the crash of
> https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org is
> in __SetPageReserved of reserve_bootmem_region?
With your fix the pages in the memblock.reserved that intersects
with memblock.memory will be initialized twice: first time in
for_each_mem_pfn_range() and the second time in for_each_res_pfn_range()
> > For instance, the if user specified hugetlb_cma=nnG loop over
> > memblock.reserved will cause redundant initialization pass over the
> > pages in this nnG.
>
> And if that didn't happen, then I'm left wondering if
> free_low_memory_core_early would crash again with specified
> hugetlb_cma=nnG, not just in pfn 0 anymore even with the simple fix
> applied.
>
> Where does PagePoison get removed from the CMA range, or is somehow
> __SetPageReserved not called with the simple fix applied on the CMA
> reserved range?
CMA range would be in both memblock.memory and memblock.reserved. So the
pages for it will be initialized in for_each_mem_pfn_range().
> > Regardless of particular implementation, the generic code cannot detect
> > physical memory layout, this is up to architecture to detect what memory
> > banks are populated and provide this information to the generic code.
> >
> > So we'll always have what you call "dependencies on the caller" here.
> > The good thing is that we can fix "the caller" when we find it's wrong.
>
> The caller that massages the e820 bios table, is still
> software. Software can go from arch code to common code. There's no
> asm or arch dependency in there.
>
> The caller in my view needs to know what e820 is and send down the raw
> info... it's a 1:1 API translation, the caller massaging of the data
> can all go in the callee and benefit all callers.
Well, the translation between e820 and memblock could been 1:1, but it
is not. I think e820 in more broadly x86 memory initialization could
benifit from more straightforward translation to memblock. For instance,
I don't see why the pfn 0 should be marked as reserved in both e820 and
memblock and why some of the e820 reserved areas never make it to
memblock.reserved.
> The benefit is once the callee does all validation, then all archs
> that do a mistake will be told and boot will not fail and it'll be
> more likely to boot stable even in presence of inconsistent hardware
> tables.
That's true. Having the generic code more robust will benifit everybody.
> > > I guess before worrying of pfn 1 not being added to memblock.reserved,
> > > I'd worry about pfn 0 itself not being added to memblock.reserved.
> >
> > My point was that with a bit different e820 table we may again hit a
> > corner case that we didn't anticipate.
> >
> > pfn 0 is in memblock.reserved by a coincidence and not by design and using
> > memblock.reserved to detect zone/node boundaries is not reliable.
> >
> > Moreover, if you look for usage of for_each_mem_pfn_range() in
> > page_alloc.c, it's looks very much that it is presumed to iterate over
> > *all* physical memory, including mysterious pfn 0.
>
> All the complex patch does it that it guarantees all the reserved
> pages can get a "right" zoneid.
>
> Then we certainly should consider rounding it down by the pageblock in
> case pfn 0 isn't reserved.
>
> In fact if you add all memblock.reserved ranges to memblock.memory,
> you'll obtain exactly the same result in the zone_start_pfn as with
> the complex patch in the zone_start_pfn calculation and the exact same
> issue of PagePoison being left on pfn 0 will happen if pfn 0 isn't
> added to memblock.memory with { memblock_add; memblock_reserved }.
>
> > > Still with the complex patch applied, if something goes wrong
> > > DEBUG_VM=y will make it reproducible, with the simple fix we return in
> > > non-reproducible land.
> >
> > I still think that the complex patch is going in the wrong direction.
> > We cannot rely on memblock.reserved to calculate zones and nodes span.
> >
> > This:
> >
> > + /*
> > + * reserved regions must be included so that their page
> > + * structure can be part of a zone and obtain a valid zoneid
> > + * before __SetPageReserved().
> > + */
> > + return min(PHYS_PFN(memblock_start_of_DRAM()),
> > + PHYS_PFN(memblock.reserved.regions[0].base));
> >
> > is definitely a hack that worked because "the caller" has this:
> >
> > /*
> > * Make sure page 0 is always reserved because on systems with
> > * L1TF its contents can be leaked to user processes.
> > */
> > memblock_reserve(0, PAGE_SIZE);
> >
> > So, what would have happen if L1TF was not yet disclosed? ;-)
>
> It's actually fine thing to delete the above memblock_reserve(0,
> PAGE_SIZE) for all testing so we can move into the remaining issues.
There are few more places in arch/x86/kernel/setup.c that
memblock_reserve() one or several pages from address 0 :-)
> The end result of the "hack" is precisely what you obtain if you add
> all memblock.reserved to memblock.memory, except it doesn't require to
> add memblock.reserved to memblock.memory.
I still do not agree that there can be minimum between
memblock_start_of_DRAM() and anything else. I think it's semantically
wrong.
And I really dislike addition of memblock.reserved traversals, mostly
because of the areas that actually overlap.
However, I do agree that adding non-overlaping part of memblock.reserved
to memblock.memory will make the generic code more robust. I just don't
want to do this implicitly by calling memblock_add() from
memblock_reserve(). I think the cleaner way is to join them just before
free_area_init() starts zone/node size detection.
> Thanks,
> Andrea
>
--
Sincerely yours,
Mike.
next prev parent reply other threads:[~2020-12-09 21:47 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-06 0:54 + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree akpm
2020-12-06 6:48 ` Mike Rapoport
2020-12-06 19:30 ` Andrea Arcangeli
2020-12-07 16:50 ` Mike Rapoport
2020-12-08 2:57 ` Andrea Arcangeli
2020-12-08 21:46 ` Mike Rapoport
2020-12-08 23:13 ` Andrea Arcangeli
2020-12-09 21:42 ` Mike Rapoport [this message]
2020-12-07 8:58 ` David Hildenbrand
2020-12-07 9:45 ` Mike Rapoport
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201209214231.GC8939@linux.ibm.com \
--to=rppt@linux.ibm.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=cai@lca.pw \
--cc=david@redhat.com \
--cc=mgorman@suse.de \
--cc=mhocko@kernel.org \
--cc=mm-commits@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).