All of lore.kernel.org
 help / color / mirror / Atom feed
From: Naoya Horiguchi <naoya.horiguchi@linux.dev>
To: Dan Williams <dan.j.williams@intel.com>
Cc: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>,
	"Wang Wensheng" <wangwensheng4@huawei.com>,
	"David Hildenbrand" <david@redhat.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"osalvador@suse.de" <osalvador@suse.de>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"rui.xiang@huawei.com" <rui.xiang@huawei.com>,
	"HAGIO KAZUHITO(萩尾 一仁)" <k-hagio-ab@nec.com>
Subject: [PATCH] mm/sparse: set SECTION_NID_SHIFT to 6
Date: Mon, 28 Jun 2021 16:06:01 +0900	[thread overview]
Message-ID: <20210628070601.GB418318@u2004> (raw)
In-Reply-To: <CAPcyv4h5a5AYscsyC40_5bc6j1kmjMFWJ_0MFAGEx1EPS9Tmrw@mail.gmail.com>

On Fri, Jun 25, 2021 at 02:23:05PM -0700, Dan Williams wrote:
> On Wed, Jun 23, 2021 at 4:10 PM HORIGUCHI NAOYA(堀口 直也)
> <naoya.horiguchi@nec.com> wrote:
> >
> > On Tue, Apr 27, 2021 at 11:05:17AM +0200, David Hildenbrand wrote:
> > > On 27.04.21 10:30, Wang Wensheng wrote:
> > > > The section_mem_map member of struct mem_section stores some flags and
> > > > the address of struct page array of the mem_section.
> > > >
> > > > Additionally the node id of the mem_section is stored during early boot,
> > > > where the struct page array has not been allocated. In other words, the
> > > > higher bits of section_mem_map are used for two purpose, and the node id
> > > > should be clear properly after the early boot.
> > > >
> > > > Currently the node id field is overlapped with the flag field and cannot
> > > > be clear properly. That overlapped bits would then be treated as
> > > > mem_section flags and may lead to unexpected side effects.
> > > >
> > > > Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
> > > > field always locates after flags field. That's why the overlap occurs -
> > > > forgetting to increase SECTION_NID_SHIFT when adding new mem_section
> > > > flag.
> > > >
> > > > Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> > > > Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
> > > > ---
> > > >   include/linux/mmzone.h | 2 +-
> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 47946ce..b01694d 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -1325,7 +1325,7 @@ extern size_t mem_section_usage_size(void);
> > > >   #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> > > >   #define SECTION_MAP_LAST_BIT              (1UL<<5)
> > > >   #define SECTION_MAP_MASK          (~(SECTION_MAP_LAST_BIT-1))
> > > > -#define SECTION_NID_SHIFT          3
> > > > +#define SECTION_NID_SHIFT          order_base_2(SECTION_MAP_LAST_BIT)
> > > >   static inline struct page *__section_mem_map_addr(struct mem_section *section)
> > > >   {
> > > >
> > >
> > > Well, all sections around during boot that have an early NID are early ...
> > > so it's not an issue with SECTION_IS_EARLY, no? I mean, it's ugly, but not
> > > broken.
> > >
> > > But it's an issue with SECTION_TAINT_ZONE_DEVICE, AFAIKT.
> > > sparse_init_one_section() would leave the bit set if the nid happens to have
> > > that bit set (e.g., node 2,3). It's semi-broken then, because we force all
> > > pfn_to_online_page() through the slow path.
> > >
> > >
> > > That whole section flag setting code is fragile.
> >
> > Hi Wensheng, David,
> >
> > This patch seems not accepted or updated yet, so what's going on?
> >
> > We recently saw the exact issue on testing crash utilities with latest
> > kernels on 4 NUMA node system.  SECTION_TAINT_ZONE_DEVICE seems to be
> > set wrongly, and makedumpfile could fail due to this. So we need a fix.
> >
> > I thought of another approach like below before finding this thread,
> > so if you are fine, I'd like to submit a patch with this. And if you
> > like going with order_base_2() approach, I'm glad to ack this patch.
> >
> >   --- a/include/linux/mmzone.h
> >   +++ b/include/linux/mmzone.h
> >   @@ -1358,14 +1358,15 @@ extern size_t mem_section_usage_size(void);
> >     *      which results in PFN_SECTION_SHIFT equal 6.
> >     * To sum it up, at least 6 bits are available.
> >     */
> >   +#define SECTION_MAP_LAST_SHIFT         5
> >    #define SECTION_MARKED_PRESENT         (1UL<<0)
> >    #define SECTION_HAS_MEM_MAP            (1UL<<1)
> >    #define SECTION_IS_ONLINE              (1UL<<2)
> >    #define SECTION_IS_EARLY               (1UL<<3)
> >    #define SECTION_TAINT_ZONE_DEVICE      (1UL<<4)
> >   -#define SECTION_MAP_LAST_BIT           (1UL<<5)
> >   +#define SECTION_MAP_LAST_BIT           (1UL<<SECTION_MAP_LAST_SHIFT)
> >    #define SECTION_MAP_MASK               (~(SECTION_MAP_LAST_BIT-1))
> >   -#define SECTION_NID_SHIFT              3
> >   +#define SECTION_NID_SHIFT              SECTION_MAP_LAST_SHIFT
> 
> Rather than make it dynamic, why not just make it 6 directly since
> that matches the comment about the maximum number of flags available.

Sounds nice to me, so here's a patch. Could you review this?

Thanks,
Naoya Horiguchi
---
From a146c9f12ae8985c8985a5861330f7528cd14fe8 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Mon, 28 Jun 2021 15:50:37 +0900
Subject: [PATCH] mm/sparse: set SECTION_NID_SHIFT to 6

Hagio-san reported that crash utility can see bit 4 in section_mem_map
(SECTION_TAINT_ZONE_DEVICE) to be set, even if we do not use any
ZONE_DEVICE ilke pmem or HMM.  This problem could break crash-related
toolsets and/or other memory analysis tools.

The root cause is that SECTION_NID_SHIFT is incorrectly set to 3,
while we use lower 5 bits for SECTION_* flags.  So bit 3 and 4 can be
overlapped by sub-field for early NID, and bit 4 is unexpectedly set
on (for example) NUMA node id is 2 or 3.

To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of
available bits of section flag field.

[1]: https://github.com/crash-utility/crash/commit/0b5435e10161345cf713ed447a155a611a1b408b

Fixes: 1f90a3477df3 ("mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions")
Cc: stable@vger.kernel.org # v5.12+
Reported-by: Kazuhito Hagio <k-hagio-ab@nec.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mmzone.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fcb535560028..d6aa2a196aeb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1357,6 +1357,7 @@ extern size_t mem_section_usage_size(void);
  *      worst combination is powerpc with 256k pages,
  *      which results in PFN_SECTION_SHIFT equal 6.
  * To sum it up, at least 6 bits are available.
+ * SECTION_NID_SHIFT is set to 6 based on this fact.
  */
 #define SECTION_MARKED_PRESENT		(1UL<<0)
 #define SECTION_HAS_MEM_MAP		(1UL<<1)
@@ -1365,7 +1366,7 @@ extern size_t mem_section_usage_size(void);
 #define SECTION_TAINT_ZONE_DEVICE	(1UL<<4)
 #define SECTION_MAP_LAST_BIT		(1UL<<5)
 #define SECTION_MAP_MASK		(~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT		3
+#define SECTION_NID_SHIFT		6
 
 static inline struct page *__section_mem_map_addr(struct mem_section *section)
 {
-- 
2.25.1



  reply	other threads:[~2021-06-28  7:06 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-27  8:30 [PATCH] mm/sparse: Fix flags overlap in section_mem_map Wang Wensheng
2021-04-27  9:05 ` David Hildenbrand
2021-06-23 23:09   ` HORIGUCHI NAOYA(堀口 直也)
2021-06-25 21:23     ` Dan Williams
2021-06-28  7:06       ` Naoya Horiguchi [this message]
2021-07-06  8:36         ` [PATCH] mm/sparse: set SECTION_NID_SHIFT to 6 David Hildenbrand
2021-07-07  4:54           ` Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210628070601.GB418318@u2004 \
    --to=naoya.horiguchi@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=k-hagio-ab@nec.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=osalvador@suse.de \
    --cc=rui.xiang@huawei.com \
    --cc=wangwensheng4@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.