All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: David Hildenbrand <david@redhat.com>, linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,
	Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Arun KS <arunks@codeaurora.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Borislav Petkov <bp@alien8.de>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Christophe Leroy <christophe.leroy@c-s.fr>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Fenghua Yu <fenghua.yu@intel.com>,
	Gerald Schaefer <gerald.schaefer@de.ibm.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Halil Pasic <pasic@linux.ibm.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>,
	Ira Weiny <ira.weiny@intel.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jun Yao <yaojun8558363@gmail.com>,
	Logan Gunthorpe <logang@deltatee.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Masahiro Yamada <yamada.masahiro@socionext.com>,
	"Matthew Wilcox \(Oracle\)" <willy@infradead.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Michal Hocko <mhocko@suse.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Oscar Salvador <osalvador@suse.de>,
	Paul Mackerras <paulus@samba.org>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Pavel Tatashin <pavel.tatashin@microsoft.com>,
	Peter Zijlstra <peterz@infradead.org>, Qian Cai <cai@lca.pw>,
	Rich Felker <dalias@libc.org>,
	Robin Murphy <robin.murphy@arm.com>,
	Steve Capper <steve.capper@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Tony Luck <tony.luck@intel.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Wei Yang <richard.weiyang@gmail.com>,
	Wei Yang <richardw.yang@linux.intel.com>,
	Will Deacon <will@kernel.org>,
	Yoshinori Sato <ysato@users.sourceforge.jp>,
	Yu Zhao <yuzhao@google.com>
Subject: Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
Date: Tue, 27 Aug 2019 11:16:14 +0530	[thread overview]
Message-ID: <87mufvma8p.fsf@linux.ibm.com> (raw)
In-Reply-To: <f1f5d981-b633-dc39-0b92-3619e4b8f0a5@redhat.com>

David Hildenbrand <david@redhat.com> writes:

> On 26.08.19 18:20, David Hildenbrand wrote:
>> On 26.08.19 18:01, Aneesh Kumar K.V wrote:
>>> On 8/26/19 9:13 PM, David Hildenbrand wrote:
>>>> On 26.08.19 16:53, Aneesh Kumar K.V wrote:
>>>>> David Hildenbrand <david@redhat.com> writes:
>>>>>
>>>>>>
>>>
>>> ....
>>>
>>>>>
>>>>> I did report a variant of the issue at
>>>>>
>>>>> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
>>>>>
>>>>> This patch series still doesn't handle the fact that struct page backing
>>>>> the start_pfn might not be initialized. ie, it results in crash like
>>>>> below
>>>>
>>>> Okay, that's a related but different issue I think.
>>>>
>>>> I can see that current shrink_zone_span() might read-access the
>>>> uninitialized struct page of a PFN if
>>>>
>>>> 1. The zone has holes and we check for "zone all holes". If we get
>>>> pfn_valid(pfn), we check if "page_zone(pfn_to_page(pfn)) != zone".
>>>>
>>>> 2. Via find_smallest_section_pfn() / find_biggest_section_pfn() find a
>>>> spanned pfn_valid(). We check
>>>> - pfn_to_nid(start_pfn) != nid
>>>> - zone != page_zone(pfn_to_page(start_pfn)
>>>>
>>>> So we don't actually use the zone/nid, only use it to sanity check. That
>>>> might result in false-positives (not that bad).
>>>>
>>>> It all boils down to shrink_zone_span() not working only on active
>>>> memory, for which the PFN is not only valid but also initialized
>>>> (something for which we need a new section flag I assume).
>>>>
>>>> Which access triggers the issue you describe? pfn_to_nid()?
>>>>
>>>>>
>>>>>      pc: c0000000004bc1ec: shrink_zone_span+0x1bc/0x290
>>>>>      lr: c0000000004bc1e8: shrink_zone_span+0x1b8/0x290
>>>>>      sp: c0000000dac7f910
>>>>>     msr: 800000000282b033
>>>>>    current = 0xc0000000da2fa000
>>>>>    paca    = 0xc00000000fffb300   irqmask: 0x03   irq_happened: 0x01
>>>>>      pid   = 1224, comm = ndctl
>>>>> kernel BUG at /home/kvaneesh/src/linux/include/linux/mm.h:1088!
>>>>> Linux version 5.3.0-rc6-17495-gc7727d815970-dirty (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #183 SMP Mon Aug 26 09:37:32 CDT 2019
>>>>> enter ? for help
>>>>
>>>> Which exact kernel BUG are you hitting here? (my tree doesn't seem t
>>>> have any BUG statement around  include/linux/mm.h:1088). 
>>>
>>>
>>>
>>> This is against upstream linus with your patches applied.
>> 
>> I'm
>> 
>>>
>>>
>>> static inline int page_to_nid(const struct page *page)
>>> {
>>> 	struct page *p = (struct page *)page;
>>>
>>> 	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>>> }
>>>
>>>
>>> #define PF_POISONED_CHECK(page) ({					\
>>> 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
>>> 		page; })
>>> #
>>>
>>>
>>> It is the node id access.
>> 
>> A right. A temporary hack would be to assume in these functions
>> (shrink_zone_span() and friends) that we might have invalid NIDs /
>> zonenumbers and simply skip these. After all we're only using them for
>> finding zone boundaries. Not what we ultimately want, but I think until
>> we have a proper SECTION_ACTIVE, it might take a while.
>> 
>
> I am talking about something as hacky as this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8d1c7313ab3f..57ed3dd76a4f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1099,6 +1099,7 @@ static inline int page_zone_id(struct page *page)
>
>  #ifdef NODE_NOT_IN_PAGE_FLAGS
>  extern int page_to_nid(const struct page *page);
> +#define __page_to_nid page_to_nid
>  #else
>  static inline int page_to_nid(const struct page *page)
>  {
> @@ -1106,6 +1107,10 @@ static inline int page_to_nid(const struct page
> *page)
>
>  	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>  }
> +static inline int __page_to_nid(const struct page *page)
> +{
> +	return ((page)->flags >> NODES_PGSHIFT) & NODES_MASK;
> +}
>  #endif
>
>  #ifdef CONFIG_NUMA_BALANCING
> @@ -1249,6 +1254,12 @@ static inline struct zone *page_zone(const struct
> page *page)
>  	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
>  }
>
> +static inline struct zone *__page_zone(const struct page *page)
> +{
> +	return &NODE_DATA(__page_to_nid(page))->node_zones[page_zonenum(page)];
> +}

We don't need that. We can always do an explicity __page_to_nid check
and break from the loop before doing a page_zone() ? Also if the struct
page is poisoned, we should not trust the page_zonenum()? 

> +
> +
>  static inline pg_data_t *page_pgdat(const struct page *page)
>  {
>  	return NODE_DATA(page_to_nid(page));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 49ca3364eb70..378b593d1fe1 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -334,10 +334,10 @@ static unsigned long find_smallest_section_pfn(int
> nid, struct zone *zone,
>  		if (unlikely(!pfn_valid(start_pfn)))
>  			continue;
>
> -		if (unlikely(pfn_to_nid(start_pfn) != nid))
> +		/* We might have uninitialized memmaps */
> +		if (unlikely(__page_to_nid(pfn_to_page(start_pfn)) != nid))
>  			continue;

if we are here we got non poisoned struct page. Hence we don't need the
below change?

> -
> -		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
> +		if (zone && zone != __page_zone(pfn_to_page(start_pfn)))
>  			continue;
>
>  		return start_pfn;
> @@ -359,10 +359,10 @@ static unsigned long find_biggest_section_pfn(int
> nid, struct zone *zone,
>  		if (unlikely(!pfn_valid(pfn)))
>  			continue;
>
> -		if (unlikely(pfn_to_nid(pfn) != nid))
> +		/* We might have uninitialized memmaps */
> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>  			continue;

same as above

> -
> -		if (zone && zone != page_zone(pfn_to_page(pfn)))
> +		if (zone && zone != __page_zone(pfn_to_page(pfn)))
>  			continue;
>
>  		return pfn;
> @@ -418,7 +418,10 @@ static void shrink_zone_span(struct zone *zone,
> unsigned long start_pfn,
>  		if (unlikely(!pfn_valid(pfn)))
>  			continue;
>
> -		if (page_zone(pfn_to_page(pfn)) != zone)
> +		/* We might have uninitialized memmaps */
> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
> +			continue;

same as above? 

> +		if (__page_zone(pfn_to_page(pfn)) != zone)
>  			continue;
>
>  		/* Skip range to be removed */
> @@ -483,7 +486,8 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
>  		if (unlikely(!pfn_valid(pfn)))
>  			continue;
>
> -		if (pfn_to_nid(pfn) != nid)
> +		/* We might have uninitialized memmaps */
> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>  			continue;
>
>  		/* Skip range to be removed */
>


But I am not sure whether this is the right approach.

-aneesh



  reply	other threads:[~2019-08-27  5:46 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 1/6] mm/memory_hotplug: Exit early in __remove_pages() on BUGs David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 2/6] mm: Exit early in set_zone_contiguous() if already contiguous David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory David Hildenbrand
2019-08-29 15:39   ` Michal Hocko
2019-08-29 15:54     ` David Hildenbrand
2019-08-29 16:27       ` Michal Hocko
2019-08-29 16:59         ` David Hildenbrand
2019-08-30  6:01           ` Michal Hocko
2019-08-30  6:20             ` David Hildenbrand
2019-08-30  6:47               ` Michal Hocko
2019-08-30  7:07                 ` David Hildenbrand
2019-08-30  8:31                   ` Michal Hocko
2019-08-26 10:10 ` [PATCH v2 4/6] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 5/6] mm: Introduce for_each_zone_nid() David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 6/6] mm/memory_hotplug: Pass nid instead of zone to __remove_pages() David Hildenbrand
2019-08-26 10:10   ` David Hildenbrand
2019-08-27 10:49   ` Robin Murphy
2019-08-27 10:49     ` Robin Murphy
2019-08-26 14:53 ` [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory Aneesh Kumar K.V
2019-08-26 15:43   ` David Hildenbrand
2019-08-26 16:01     ` Aneesh Kumar K.V
2019-08-26 16:20       ` David Hildenbrand
2019-08-26 16:44         ` David Hildenbrand
2019-08-27  5:46           ` Aneesh Kumar K.V [this message]
2019-08-27  7:06             ` David Hildenbrand
2019-08-28  9:33             ` David Hildenbrand
2019-08-29  8:38   ` Michal Hocko
2019-08-29 11:55     ` David Hildenbrand
2019-08-29 12:20       ` Michal Hocko
2019-08-29  8:36 ` Michal Hocko
2019-08-29 11:39   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mufvma8p.fsf@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=anshuman.khandual@arm.com \
    --cc=arunks@codeaurora.org \
    --cc=benh@kernel.crashing.org \
    --cc=borntraeger@de.ibm.com \
    --cc=bp@alien8.de \
    --cc=cai@lca.pw \
    --cc=catalin.marinas@arm.com \
    --cc=christophe.leroy@c-s.fr \
    --cc=dalias@libc.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=fenghua.yu@intel.com \
    --cc=gerald.schaefer@de.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=heiko.carstens@de.ibm.com \
    --cc=hpa@zytor.com \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=logang@deltatee.com \
    --cc=luto@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=pasic@linux.ibm.com \
    --cc=paulus@samba.org \
    --cc=pavel.tatashin@microsoft.com \
    --cc=peterz@infradead.org \
    --cc=richard.weiyang@gmail.com \
    --cc=richardw.yang@linux.intel.com \
    --cc=robin.murphy@arm.com \
    --cc=rppt@linux.ibm.com \
    --cc=steve.capper@arm.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=tony.luck@intel.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yamada.masahiro@socionext.com \
    --cc=yaojun8558363@gmail.com \
    --cc=ysato@users.sourceforge.jp \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.