LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 2/8] s390x/mm: Implement arch_remove_memory()
From: David Hildenbrand @ 2019-05-07 20:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Oscar Salvador, linux-s390, Michal Hocko, linux-ia64,
	Vasily Gorbik, Linux-sh, Heiko Carstens,
	Linux Kernel Mailing List, Linux MM, Mike Rapoport,
	Martin Schwidefsky, Andrew Morton, linuxppc-dev
In-Reply-To: <CAPcyv4gtAMn2mDz0s1GRTJ52MeTK3jJYLQne6MiEx_ipPFUsmA@mail.gmail.com>

On 07.05.19 22:46, Dan Williams wrote:
> On Tue, May 7, 2019 at 11:38 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> Will come in handy when wanting to handle errors after
>> arch_add_memory().
>>
>> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
>> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Vasily Gorbik <gor@linux.ibm.com>
>> Cc: Oscar Salvador <osalvador@suse.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  arch/s390/mm/init.c | 13 +++++++------
>>  1 file changed, 7 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
>> index 31b1071315d7..1e0cbae69f12 100644
>> --- a/arch/s390/mm/init.c
>> +++ b/arch/s390/mm/init.c
>> @@ -237,12 +237,13 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>  void arch_remove_memory(int nid, u64 start, u64 size,
>>                         struct vmem_altmap *altmap)
>>  {
>> -       /*
>> -        * There is no hardware or firmware interface which could trigger a
>> -        * hot memory remove on s390. So there is nothing that needs to be
>> -        * implemented.
>> -        */
>> -       BUG();
>> +       unsigned long start_pfn = start >> PAGE_SHIFT;
>> +       unsigned long nr_pages = size >> PAGE_SHIFT;
>> +       struct zone *zone;
>> +
>> +       zone = page_zone(pfn_to_page(start_pfn));
> 
> Does s390 actually support passing in an altmap? If 'yes', I think it
> also needs the vmem_altmap_offset() fixup like x86-64:
> 
>         /* With altmap the first mapped page is offset from @start */
>         if (altmap)
>                 page += vmem_altmap_offset(altmap);
> 
> ...but I suspect it does not support altmap since
> arch/s390/mm/vmem.c::vmemmap_populate() does not arrange for 'struct
> page' capacity to be allocated out of an altmap defined page pool.
> 
> I think it would be enough to disallow any arch_add_memory() on s390
> where @altmap is non-NULL. At least until s390 gains ZONE_DEVICE
> support and can enable the pmem use case.
> 

As far as I know, it doesn't yet, however I guess this could change once
virtio-pmem is supported?

Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply

* Re: [PATCH]powerpc/mobility: Serialize PRRN and LPM in device tree update
From: Juliet Kim @ 2019-05-07 20:47 UTC (permalink / raw)
  To: Nathan Lynch, Juliet Kim, linuxppc-dev; +Cc: mmc, mwb
In-Reply-To: <87sgtrqzcz.fsf@linux.ibm.com>

Hi Nathan,

On 5/6/19 12:14 PM, Nathan Lynch wrote:
> Hi Juliet,
>
> Juliet Kim<julietk@linux.vnet.ibm.com> writes:
>> Fix extending start/stop topology update scope during LPM
>> Commit 65b9fdadfc4d ("powerpc/pseries/mobility: Extend start/stop
>> topology update scope") made the change to the duration that
>> topology updates are suppressed during LPM to allow the complete
>> device tree update which leaves the property update notifier
>> unregistered until device tree update completes. This prevents
>> topology update during LPM.
>>
>> Instead, use mutex_lock, which serializes LPM and PRRN operation
>> in pseries_devicetree_update.
> I think this is conflating two issues:
>
> 1. Insufficient serialization/ordering of handling PRRNs and
>     LPM. E.g. we could migrate while processing a PRRN from the source
>     system and end up with incorrect contents in the device tree on the
>     destination if the LPM changes the same nodes. The OS is supposed to
>     drain any outstanding PRRNs before proceeding with migration, which
>     is a stronger requirement than simple serialization of device tree
>     updates. If we don't impose this ordering already we should fix that.

PRRN request can be received at any time including before/after LPM and
during LPM.  Currently, we do not have a protocol with hypervisor 
prohibiting
PRRN after LPM begins. This patch is to fix the regression(inconsistent 
state
of device tree and skipping CPU affinity update) injected by a patch
Commit 65b9fdadfc4d (Extending start/stop topology update scope during
LPM ).

This patch uses mutex_lock to update device tree allowing device tree to be
consistent state in both cases : LPM begins while PRRN event is running and
vice versa. If we migrate while PRRN is running at source, PRRN holding 
the lock
completes at target. Once PRRN release the lock, LPM take the lock and 
update
device tree. PRRN completes device tree update before LPM begins.
To avoid PRRN and LPM from running at the same time, it needs serialization
at the higher layer which requires design change and may be future work.

>
> 2. The NUMA topology update processing. Generally speaking,
>     start/stop_topology_update() enable/disable dt_update_callback(),
>     which we use to update CPU-node assignments. Since we now know that
>     doing that is Bad, it's sort of a happy accident that
>     migration_store() was changed to re-register the notifier after
>     updating the device tree, which is too late. So I don't think we
>     should try to "fix" this. Instead we should remove the broken code
>     (dt_update_callback -> dlpar_cpu_readdd and so on).
When the regression (CPU affinity update has been accidentally disabled 
at LPM)
and CPU readd causes some issues, I suggested that we revert the CPU 
readd patch
already upstream and leave the regression without fixing. But then, we 
decided to
disable CPU affinity update globally for LPM, PRRN, VPHN and fix the 
regression
once the disablement CPU affinity update patch is accepted upstream as the
regression needs to be corrected in case of enabling CPU affinity update 
and we
would learn up codes once the disablement is stabilized.
> Do you agree?
>
> Thanks,
> Nathan


^ permalink raw reply

* Re: [PATCH v2 2/8] s390x/mm: Implement arch_remove_memory()
From: Dan Williams @ 2019-05-07 20:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, linux-s390, Michal Hocko, linux-ia64,
	Vasily Gorbik, Linux-sh, Heiko Carstens,
	Linux Kernel Mailing List, Linux MM, Mike Rapoport,
	Martin Schwidefsky, Andrew Morton, linuxppc-dev
In-Reply-To: <20190507183804.5512-3-david@redhat.com>

On Tue, May 7, 2019 at 11:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> Will come in handy when wanting to handle errors after
> arch_add_memory().
>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Oscar Salvador <osalvador@suse.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/mm/init.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 31b1071315d7..1e0cbae69f12 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -237,12 +237,13 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  void arch_remove_memory(int nid, u64 start, u64 size,
>                         struct vmem_altmap *altmap)
>  {
> -       /*
> -        * There is no hardware or firmware interface which could trigger a
> -        * hot memory remove on s390. So there is nothing that needs to be
> -        * implemented.
> -        */
> -       BUG();
> +       unsigned long start_pfn = start >> PAGE_SHIFT;
> +       unsigned long nr_pages = size >> PAGE_SHIFT;
> +       struct zone *zone;
> +
> +       zone = page_zone(pfn_to_page(start_pfn));

Does s390 actually support passing in an altmap? If 'yes', I think it
also needs the vmem_altmap_offset() fixup like x86-64:

        /* With altmap the first mapped page is offset from @start */
        if (altmap)
                page += vmem_altmap_offset(altmap);

...but I suspect it does not support altmap since
arch/s390/mm/vmem.c::vmemmap_populate() does not arrange for 'struct
page' capacity to be allocated out of an altmap defined page pool.

I think it would be enough to disallow any arch_add_memory() on s390
where @altmap is non-NULL. At least until s390 gains ZONE_DEVICE
support and can enable the pmem use case.

^ permalink raw reply

* Re: [PATCH v2 1/8] mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
From: Dan Williams @ 2019-05-07 20:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-s390, Michal Hocko, linux-ia64, Pavel Tatashin, Linux-sh,
	Mathieu Malaterre, Linux Kernel Mailing List, Wei Yang, Linux MM,
	Qian Cai, Arun KS, Andrew Morton, linuxppc-dev, Oscar Salvador
In-Reply-To: <20190507183804.5512-2-david@redhat.com>

On Tue, May 7, 2019 at 11:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> By converting start and size to page granularity, we actually ignore
> unaligned parts within a page instead of properly bailing out with an
> error.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Wei Yang <richard.weiyang@gmail.com>
> Cc: Arun KS <arunks@codeaurora.org>
> Cc: Mathieu Malaterre <malat@debian.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply

* Re: [PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling
From: Dan Williams @ 2019-05-07 20:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, linux-ia64, Linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, Chris Wilson,
	Linux MM, Pavel Tatashin, Rich Felker, Arun KS, H. Peter Anvin,
	Ingo Molnar, Qian Cai, linux-s390, Baoquan He, Logan Gunthorpe,
	Rafael J. Wysocki, Mike Rapoport, mike.travis@hpe.com,
	Ingo Molnar, Fenghua Yu, Pavel Tatashin, Vasily Gorbik,
	Rob Herring, Masahiro Yamada, Nicholas Piggin, Martin Schwidefsky,
	Mark Brown, Borislav Petkov, Andy Lutomirski, Jonathan Cameron,
	Thomas Gleixner, Wei Yang, Joonsoo Kim, Oscar Salvador, Tony Luck,
	Yoshinori Sato, Andrew Banman, Mathieu Malaterre,
	Greg Kroah-Hartman, Linux Kernel Mailing List, Mike Rapoport,
	Wei Yang, Alex Deucher, Paul Mackerras, Andrew Morton,
	linuxppc-dev, David S. Miller, Kirill A. Shutemov
In-Reply-To: <ad971f57-5f09-c056-beef-6a7b63311106@redhat.com>

On Tue, May 7, 2019 at 12:38 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.05.19 21:21, David Hildenbrand wrote:
> > On 07.05.19 21:04, Dan Williams wrote:
> >> On Tue, May 7, 2019 at 11:38 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> We only want memory block devices for memory to be onlined/offlined
> >>> (add/remove from the buddy). This is required so user space can
> >>> online/offline memory and kdump gets notified about newly onlined memory.
> >>>
> >>> Only such memory has the requirement of having to span whole memory blocks.
> >>> Let's factor out creation/removal of memory block devices. This helps
> >>> to further cleanup arch_add_memory/arch_remove_memory() and to make
> >>> implementation of new features easier. E.g. supplying a driver for
> >>> memory block devices becomes way easier (so user space is able to
> >>> distinguish different types of added memory to properly online it).
> >>>
> >>> Patch 1 makes sure the memory block size granularity is always respected.
> >>> Patch 2 implements arch_remove_memory() on s390x. Patch 3 prepares
> >>> arch_remove_memory() to be also called without CONFIG_MEMORY_HOTREMOVE.
> >>> Patch 4,5 and 6 factor out creation/removal of memory block devices.
> >>> Patch 7 gets rid of some unlikely errors that could have happened, not
> >>> removing links between memory block devices and nodes, previously brought
> >>> up by Oscar.
> >>>
> >>> Did a quick sanity test with DIMM plug/unplug, making sure all devices
> >>> and sysfs links properly get added/removed. Compile tested on s390x and
> >>> x86-64.
> >>>
> >>> Based on git://git.cmpxchg.org/linux-mmots.git
> >>>
> >>> Next refactoring on my list will be making sure that remove_memory()
> >>> will never deal with zones / access "struct pages". Any kind of zone
> >>> handling will have to be done when offlining system memory / before
> >>> removing device memory. I am thinking about remove_pfn_range_from_zone()",
> >>> du undo everything "move_pfn_range_to_zone()" did.
> >>>
> >>> v1 -> v2:
> >>> - s390x/mm: Implement arch_remove_memory()
> >>> -- remove mapping after "__remove_pages"
> >>>
> >>>
> >>> David Hildenbrand (8):
> >>>   mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
> >>>   s390x/mm: Implement arch_remove_memory()
> >>>   mm/memory_hotplug: arch_remove_memory() and __remove_pages() with
> >>>     CONFIG_MEMORY_HOTPLUG
> >>>   mm/memory_hotplug: Create memory block devices after arch_add_memory()
> >>>   mm/memory_hotplug: Drop MHP_MEMBLOCK_API
> >>
> >> So at a minimum we need a bit of patch staging guidance because this
> >> obviously collides with the subsection bits that are built on top of
> >> the existence of MHP_MEMBLOCK_API. What trigger do you envision as a
> >> replacement that arch_add_memory() use to determine that subsection
> >> operations should be disallowed?
> >>
> >
> > Looks like we now have time to sort it out :)
> >
> >
> > Looking at your series
> >
> > [PATCH v8 08/12] mm/sparsemem: Prepare for sub-section ranges
> >
> > is the "single" effectively place using MHP_MEMBLOCK_API, namely
> > "subsection_check()". Used when adding/removing memory.
> >
> >
> > +static int subsection_check(unsigned long pfn, unsigned long nr_pages,
> > +             unsigned long flags, const char *reason)
> > +{
> > +     /*
> > +      * Only allow partial section hotplug for !memblock ranges,
> > +      * since register_new_memory() requires section alignment, and
> > +      * CONFIG_SPARSEMEM_VMEMMAP=n requires sections to be fully
> > +      * populated.
> > +      */
> > +     if ((!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)
> > +                             || (flags & MHP_MEMBLOCK_API))
> > +                     && ((pfn & ~PAGE_SECTION_MASK)
> > +                             || (nr_pages & ~PAGE_SECTION_MASK))) {
> > +             WARN(1, "Sub-section hot-%s incompatible with %s\n", reason,
> > +                             (flags & MHP_MEMBLOCK_API)
> > +                             ? "memblock api" : "!CONFIG_SPARSEMEM_VMEMMAP");
> > +             return -EINVAL;
> > +     }
> > +     return 0;
> >  }
> >
> >
> > (flags & MHP_MEMBLOCK_API)) && ((pfn & ~PAGE_SECTION_MASK) || (nr_pages
> > & ~PAGE_SECTION_MASK)))
> >
> > sounds like something the caller (add_memory()) always has to take care
> > of. No need to check. The one imposing this restriction is the only caller.
> >
> > In my opinion, that check/function can go completely.
> >
> > Am I missing something / missing another user?
> >
>
> In other word, this series moves the restriction out of
> arch_add_memory() and therefore you don't need subsection_check() at all
> anymore. At least if I am not missing something :)

Ah, ok. Only direct arch_add_memory() users need to be worried about
subsection hotplug and the add_memory_resource() + __remove_memory()
paths are already protected by check_hotplug_memory_range(). Ok, I can
get on board with the removal.

Let me go ahead and review this series so Andrew can get it pulled in
and I can rebase.

^ permalink raw reply

* Re: [PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling
From: David Hildenbrand @ 2019-05-07 19:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Oscar Salvador, Michal Hocko, linux-ia64, Linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, Chris Wilson,
	Linux MM, Pavel Tatashin, Rich Felker, Arun KS, H. Peter Anvin,
	Ingo Molnar, Qian Cai, linux-s390, Baoquan He, Logan Gunthorpe,
	Rafael J. Wysocki, Mike Rapoport, mike.travis@hpe.com,
	Ingo Molnar, Fenghua Yu, Pavel Tatashin, Vasily Gorbik,
	Rob Herring, Masahiro Yamada, Nicholas Piggin, Martin Schwidefsky,
	Mark Brown, Borislav Petkov, Andy Lutomirski, Jonathan Cameron,
	Thomas Gleixner, Wei Yang, Joonsoo Kim, Oscar Salvador, Tony Luck,
	Yoshinori Sato, Andrew Banman, Mathieu Malaterre,
	Greg Kroah-Hartman, Linux Kernel Mailing List, Mike Rapoport,
	Wei Yang, Alex Deucher, Paul Mackerras, Andrew Morton,
	linuxppc-dev, David S. Miller, Kirill A. Shutemov
In-Reply-To: <6f69e615-2b4a-ff31-5d2a-e1711c564f9b@redhat.com>

On 07.05.19 21:21, David Hildenbrand wrote:
> On 07.05.19 21:04, Dan Williams wrote:
>> On Tue, May 7, 2019 at 11:38 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> We only want memory block devices for memory to be onlined/offlined
>>> (add/remove from the buddy). This is required so user space can
>>> online/offline memory and kdump gets notified about newly onlined memory.
>>>
>>> Only such memory has the requirement of having to span whole memory blocks.
>>> Let's factor out creation/removal of memory block devices. This helps
>>> to further cleanup arch_add_memory/arch_remove_memory() and to make
>>> implementation of new features easier. E.g. supplying a driver for
>>> memory block devices becomes way easier (so user space is able to
>>> distinguish different types of added memory to properly online it).
>>>
>>> Patch 1 makes sure the memory block size granularity is always respected.
>>> Patch 2 implements arch_remove_memory() on s390x. Patch 3 prepares
>>> arch_remove_memory() to be also called without CONFIG_MEMORY_HOTREMOVE.
>>> Patch 4,5 and 6 factor out creation/removal of memory block devices.
>>> Patch 7 gets rid of some unlikely errors that could have happened, not
>>> removing links between memory block devices and nodes, previously brought
>>> up by Oscar.
>>>
>>> Did a quick sanity test with DIMM plug/unplug, making sure all devices
>>> and sysfs links properly get added/removed. Compile tested on s390x and
>>> x86-64.
>>>
>>> Based on git://git.cmpxchg.org/linux-mmots.git
>>>
>>> Next refactoring on my list will be making sure that remove_memory()
>>> will never deal with zones / access "struct pages". Any kind of zone
>>> handling will have to be done when offlining system memory / before
>>> removing device memory. I am thinking about remove_pfn_range_from_zone()",
>>> du undo everything "move_pfn_range_to_zone()" did.
>>>
>>> v1 -> v2:
>>> - s390x/mm: Implement arch_remove_memory()
>>> -- remove mapping after "__remove_pages"
>>>
>>>
>>> David Hildenbrand (8):
>>>   mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
>>>   s390x/mm: Implement arch_remove_memory()
>>>   mm/memory_hotplug: arch_remove_memory() and __remove_pages() with
>>>     CONFIG_MEMORY_HOTPLUG
>>>   mm/memory_hotplug: Create memory block devices after arch_add_memory()
>>>   mm/memory_hotplug: Drop MHP_MEMBLOCK_API
>>
>> So at a minimum we need a bit of patch staging guidance because this
>> obviously collides with the subsection bits that are built on top of
>> the existence of MHP_MEMBLOCK_API. What trigger do you envision as a
>> replacement that arch_add_memory() use to determine that subsection
>> operations should be disallowed?
>>
> 
> Looks like we now have time to sort it out :)
> 
> 
> Looking at your series
> 
> [PATCH v8 08/12] mm/sparsemem: Prepare for sub-section ranges
> 
> is the "single" effectively place using MHP_MEMBLOCK_API, namely
> "subsection_check()". Used when adding/removing memory.
> 
> 
> +static int subsection_check(unsigned long pfn, unsigned long nr_pages,
> +		unsigned long flags, const char *reason)
> +{
> +	/*
> +	 * Only allow partial section hotplug for !memblock ranges,
> +	 * since register_new_memory() requires section alignment, and
> +	 * CONFIG_SPARSEMEM_VMEMMAP=n requires sections to be fully
> +	 * populated.
> +	 */
> +	if ((!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)
> +				|| (flags & MHP_MEMBLOCK_API))
> +			&& ((pfn & ~PAGE_SECTION_MASK)
> +				|| (nr_pages & ~PAGE_SECTION_MASK))) {
> +		WARN(1, "Sub-section hot-%s incompatible with %s\n", reason,
> +				(flags & MHP_MEMBLOCK_API)
> +				? "memblock api" : "!CONFIG_SPARSEMEM_VMEMMAP");
> +		return -EINVAL;
> +	}
> +	return 0;
>  }
> 
> 
> (flags & MHP_MEMBLOCK_API)) && ((pfn & ~PAGE_SECTION_MASK) || (nr_pages
> & ~PAGE_SECTION_MASK)))
> 
> sounds like something the caller (add_memory()) always has to take care
> of. No need to check. The one imposing this restriction is the only caller.
> 
> In my opinion, that check/function can go completely.
> 
> Am I missing something / missing another user?
> 

In other word, this series moves the restriction out of
arch_add_memory() and therefore you don't need subsection_check() at all
anymore. At least if I am not missing something :)

-- 

Thanks,

David / dhildenb

^ permalink raw reply

* Re: [PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling
From: David Hildenbrand @ 2019-05-07 19:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Oscar Salvador, Michal Hocko, linux-ia64, Linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, Chris Wilson,
	Linux MM, Pavel Tatashin, Rich Felker, Arun KS, H. Peter Anvin,
	Ingo Molnar, Qian Cai, linux-s390, Baoquan He, Logan Gunthorpe,
	Rafael J. Wysocki, Mike Rapoport, mike.travis@hpe.com,
	Ingo Molnar, Fenghua Yu, Pavel Tatashin, Vasily Gorbik,
	Rob Herring, Masahiro Yamada, Nicholas Piggin, Martin Schwidefsky,
	Mark Brown, Borislav Petkov, Andy Lutomirski, Jonathan Cameron,
	Thomas Gleixner, Wei Yang, Joonsoo Kim, Oscar Salvador, Tony Luck,
	Yoshinori Sato, Andrew Banman, Mathieu Malaterre,
	Greg Kroah-Hartman, Linux Kernel Mailing List, Mike Rapoport,
	Wei Yang, Alex Deucher, Paul Mackerras, Andrew Morton,
	linuxppc-dev, David S. Miller, Kirill A. Shutemov
In-Reply-To: <CAPcyv4gxwhsiZ8Hjm4cNbjmLXV2m4s=t14ZoH0uf8AADP2nOtA@mail.gmail.com>

On 07.05.19 21:04, Dan Williams wrote:
> On Tue, May 7, 2019 at 11:38 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> We only want memory block devices for memory to be onlined/offlined
>> (add/remove from the buddy). This is required so user space can
>> online/offline memory and kdump gets notified about newly onlined memory.
>>
>> Only such memory has the requirement of having to span whole memory blocks.
>> Let's factor out creation/removal of memory block devices. This helps
>> to further cleanup arch_add_memory/arch_remove_memory() and to make
>> implementation of new features easier. E.g. supplying a driver for
>> memory block devices becomes way easier (so user space is able to
>> distinguish different types of added memory to properly online it).
>>
>> Patch 1 makes sure the memory block size granularity is always respected.
>> Patch 2 implements arch_remove_memory() on s390x. Patch 3 prepares
>> arch_remove_memory() to be also called without CONFIG_MEMORY_HOTREMOVE.
>> Patch 4,5 and 6 factor out creation/removal of memory block devices.
>> Patch 7 gets rid of some unlikely errors that could have happened, not
>> removing links between memory block devices and nodes, previously brought
>> up by Oscar.
>>
>> Did a quick sanity test with DIMM plug/unplug, making sure all devices
>> and sysfs links properly get added/removed. Compile tested on s390x and
>> x86-64.
>>
>> Based on git://git.cmpxchg.org/linux-mmots.git
>>
>> Next refactoring on my list will be making sure that remove_memory()
>> will never deal with zones / access "struct pages". Any kind of zone
>> handling will have to be done when offlining system memory / before
>> removing device memory. I am thinking about remove_pfn_range_from_zone()",
>> du undo everything "move_pfn_range_to_zone()" did.
>>
>> v1 -> v2:
>> - s390x/mm: Implement arch_remove_memory()
>> -- remove mapping after "__remove_pages"
>>
>>
>> David Hildenbrand (8):
>>   mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
>>   s390x/mm: Implement arch_remove_memory()
>>   mm/memory_hotplug: arch_remove_memory() and __remove_pages() with
>>     CONFIG_MEMORY_HOTPLUG
>>   mm/memory_hotplug: Create memory block devices after arch_add_memory()
>>   mm/memory_hotplug: Drop MHP_MEMBLOCK_API
> 
> So at a minimum we need a bit of patch staging guidance because this
> obviously collides with the subsection bits that are built on top of
> the existence of MHP_MEMBLOCK_API. What trigger do you envision as a
> replacement that arch_add_memory() use to determine that subsection
> operations should be disallowed?
> 

Looks like we now have time to sort it out :)


Looking at your series

[PATCH v8 08/12] mm/sparsemem: Prepare for sub-section ranges

is the "single" effectively place using MHP_MEMBLOCK_API, namely
"subsection_check()". Used when adding/removing memory.


+static int subsection_check(unsigned long pfn, unsigned long nr_pages,
+		unsigned long flags, const char *reason)
+{
+	/*
+	 * Only allow partial section hotplug for !memblock ranges,
+	 * since register_new_memory() requires section alignment, and
+	 * CONFIG_SPARSEMEM_VMEMMAP=n requires sections to be fully
+	 * populated.
+	 */
+	if ((!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)
+				|| (flags & MHP_MEMBLOCK_API))
+			&& ((pfn & ~PAGE_SECTION_MASK)
+				|| (nr_pages & ~PAGE_SECTION_MASK))) {
+		WARN(1, "Sub-section hot-%s incompatible with %s\n", reason,
+				(flags & MHP_MEMBLOCK_API)
+				? "memblock api" : "!CONFIG_SPARSEMEM_VMEMMAP");
+		return -EINVAL;
+	}
+	return 0;
 }


(flags & MHP_MEMBLOCK_API)) && ((pfn & ~PAGE_SECTION_MASK) || (nr_pages
& ~PAGE_SECTION_MASK)))

sounds like something the caller (add_memory()) always has to take care
of. No need to check. The one imposing this restriction is the only caller.

In my opinion, that check/function can go completely.

Am I missing something / missing another user?

-- 

Thanks,

David / dhildenb

^ permalink raw reply

* Re: [PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling
From: Dan Williams @ 2019-05-07 19:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Michal Hocko, linux-ia64, Linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, Chris Wilson,
	Linux MM, Pavel Tatashin, Rich Felker, Arun KS, H. Peter Anvin,
	Ingo Molnar, Qian Cai, linux-s390, Baoquan He, Logan Gunthorpe,
	Rafael J. Wysocki, Mike Rapoport, mike.travis@hpe.com,
	Ingo Molnar, Fenghua Yu, Pavel Tatashin, Vasily Gorbik,
	Rob Herring, Masahiro Yamada, Nicholas Piggin, Martin Schwidefsky,
	Mark Brown, Borislav Petkov, Andy Lutomirski, Jonathan Cameron,
	Thomas Gleixner, Wei Yang, Joonsoo Kim, Oscar Salvador, Tony Luck,
	Yoshinori Sato, Andrew Banman, Mathieu Malaterre,
	Greg Kroah-Hartman, Linux Kernel Mailing List, Mike Rapoport,
	Wei Yang, Alex Deucher, Paul Mackerras, Andrew Morton,
	linuxppc-dev, David S. Miller, Kirill A. Shutemov
In-Reply-To: <20190507183804.5512-1-david@redhat.com>

On Tue, May 7, 2019 at 11:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> We only want memory block devices for memory to be onlined/offlined
> (add/remove from the buddy). This is required so user space can
> online/offline memory and kdump gets notified about newly onlined memory.
>
> Only such memory has the requirement of having to span whole memory blocks.
> Let's factor out creation/removal of memory block devices. This helps
> to further cleanup arch_add_memory/arch_remove_memory() and to make
> implementation of new features easier. E.g. supplying a driver for
> memory block devices becomes way easier (so user space is able to
> distinguish different types of added memory to properly online it).
>
> Patch 1 makes sure the memory block size granularity is always respected.
> Patch 2 implements arch_remove_memory() on s390x. Patch 3 prepares
> arch_remove_memory() to be also called without CONFIG_MEMORY_HOTREMOVE.
> Patch 4,5 and 6 factor out creation/removal of memory block devices.
> Patch 7 gets rid of some unlikely errors that could have happened, not
> removing links between memory block devices and nodes, previously brought
> up by Oscar.
>
> Did a quick sanity test with DIMM plug/unplug, making sure all devices
> and sysfs links properly get added/removed. Compile tested on s390x and
> x86-64.
>
> Based on git://git.cmpxchg.org/linux-mmots.git
>
> Next refactoring on my list will be making sure that remove_memory()
> will never deal with zones / access "struct pages". Any kind of zone
> handling will have to be done when offlining system memory / before
> removing device memory. I am thinking about remove_pfn_range_from_zone()",
> du undo everything "move_pfn_range_to_zone()" did.
>
> v1 -> v2:
> - s390x/mm: Implement arch_remove_memory()
> -- remove mapping after "__remove_pages"
>
>
> David Hildenbrand (8):
>   mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
>   s390x/mm: Implement arch_remove_memory()
>   mm/memory_hotplug: arch_remove_memory() and __remove_pages() with
>     CONFIG_MEMORY_HOTPLUG
>   mm/memory_hotplug: Create memory block devices after arch_add_memory()
>   mm/memory_hotplug: Drop MHP_MEMBLOCK_API

So at a minimum we need a bit of patch staging guidance because this
obviously collides with the subsection bits that are built on top of
the existence of MHP_MEMBLOCK_API. What trigger do you envision as a
replacement that arch_add_memory() use to determine that subsection
operations should be disallowed?

^ permalink raw reply

* [PATCH v2 8/8] mm/memory_hotplug: Remove "zone" parameter from sparse_remove_one_section
From: David Hildenbrand @ 2019-05-07 18:38 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-s390, linux-ia64, linux-sh, David Hildenbrand, linux-kernel,
	akpm, linuxppc-dev, Dan Williams
In-Reply-To: <20190507183804.5512-1-david@redhat.com>

Unused, and memory unplug path should never care about zones. This is
the job of memory offlining. ZONE_DEVICE might require special care -
the caller of arch_remove_memory() should handle this.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memory_hotplug.h | 2 +-
 mm/memory_hotplug.c            | 2 +-
 mm/sparse.c                    | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 2f1f87e13baa..1a4257c5f74c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -346,7 +346,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_one_section(int nid, unsigned long start_pfn,
 				  struct vmem_altmap *altmap);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+extern void sparse_remove_one_section(struct mem_section *ms,
 		unsigned long map_offset, struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 527fe4f9c620..e0340c8f6df4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -523,7 +523,7 @@ static void __remove_section(struct zone *zone, struct mem_section *ms,
 	start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
 	__remove_zone(zone, start_pfn);
 
-	sparse_remove_one_section(zone, ms, map_offset, altmap);
+	sparse_remove_one_section(ms, map_offset, altmap);
 }
 
 /**
diff --git a/mm/sparse.c b/mm/sparse.c
index d1d5e05f5b8d..1552c855d62a 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -800,8 +800,8 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
-		unsigned long map_offset, struct vmem_altmap *altmap)
+void sparse_remove_one_section(struct mem_section *ms, unsigned long map_offset,
+			       struct vmem_altmap *altmap)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL;
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 7/8] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail
From: David Hildenbrand @ 2019-05-07 18:38 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-s390, linux-ia64, linux-sh, Greg Kroah-Hartman,
	David Hildenbrand, linux-kernel, David S. Miller, Alex Deucher,
	Mark Brown, Jonathan Cameron, Rafael J. Wysocki, akpm,
	Chris Wilson, linuxppc-dev, Dan Williams, Oscar Salvador
In-Reply-To: <20190507183804.5512-1-david@redhat.com>

We really don't want anything during memory hotunplug to fail.
We always pass a valid memory block device, that check can go. Avoid
allocating memory and eventually failing. As we are always called under
lock, we can use a static piece of memory. This avoids having to put
the structure onto the stack, having to guess about the stack size
of callers.

Patch inspired by a patch from Oscar Salvador.

In the future, there might be no need to iterate over nodes at all.
mem->nid should tell us exactly what to remove. Memory block devices
with mixed nodes (added during boot) should properly fenced off and never
removed.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/node.c  | 18 +++++-------------
 include/linux/node.h |  5 ++---
 2 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 04fdfa99b8bc..9be88fd05147 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -803,20 +803,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
 
 /*
  * Unregister memory block device under all nodes that it spans.
+ * Has to be called with mem_sysfs_mutex held (due to unlinked_nodes).
  */
-int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
+void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 {
-	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
+	static nodemask_t unlinked_nodes;
 
-	if (!mem_blk) {
-		NODEMASK_FREE(unlinked_nodes);
-		return -EFAULT;
-	}
-	if (!unlinked_nodes)
-		return -ENOMEM;
-	nodes_clear(*unlinked_nodes);
-
+	nodes_clear(unlinked_nodes);
 	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
 	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
@@ -827,15 +821,13 @@ int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 			continue;
 		if (!node_online(nid))
 			continue;
-		if (node_test_and_set(nid, *unlinked_nodes))
+		if (node_test_and_set(nid, unlinked_nodes))
 			continue;
 		sysfs_remove_link(&node_devices[nid]->dev.kobj,
 			 kobject_name(&mem_blk->dev.kobj));
 		sysfs_remove_link(&mem_blk->dev.kobj,
 			 kobject_name(&node_devices[nid]->dev.kobj));
 	}
-	NODEMASK_FREE(unlinked_nodes);
-	return 0;
 }
 
 int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn)
diff --git a/include/linux/node.h b/include/linux/node.h
index 02a29e71b175..548c226966a2 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -139,7 +139,7 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						void *arg);
-extern int unregister_memory_block_under_nodes(struct memory_block *mem_blk);
+extern void unregister_memory_block_under_nodes(struct memory_block *mem_blk);
 
 extern int register_memory_node_under_compute_node(unsigned int mem_nid,
 						   unsigned int cpu_nid,
@@ -175,9 +175,8 @@ static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
 {
 	return 0;
 }
-static inline int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
+static inline void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 {
-	return 0;
 }
 
 static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 6/8] mm/memory_hotplug: Remove memory block devices before arch_remove_memory()
From: David Hildenbrand @ 2019-05-07 18:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, linux-ia64, linux-sh, Chris Wilson, Arun KS,
	Ingo Molnar, Rafael J. Wysocki, linux-s390, David Hildenbrand,
	Pavel Tatashin, mike.travis@hpe.com, Mark Brown, Jonathan Cameron,
	Dan Williams, Oscar Salvador, Andrew Banman, Mathieu Malaterre,
	Greg Kroah-Hartman, linux-kernel, Alex Deucher, akpm,
	linuxppc-dev, David S. Miller
In-Reply-To: <20190507183804.5512-1-david@redhat.com>

Let's factor out removing of memory block devices, which is only
necessary for memory added via add_memory() and friends that created
memory block devices. Remove the devices before calling
arch_remove_memory().

This finishes factoring out memory block device handling from
arch_add_memory() and arch_remove_memory().

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/memory.c  | 39 +++++++++++++++++++--------------------
 drivers/base/node.c    | 11 ++++++-----
 include/linux/memory.h |  2 +-
 include/linux/node.h   |  6 ++----
 mm/memory_hotplug.c    |  5 +++--
 5 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 862c202a18ca..47ff49058d1f 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -756,32 +756,31 @@ int hotplug_memory_register(unsigned long start, unsigned long size)
 	return ret;
 }
 
-static int remove_memory_section(struct mem_section *section)
+/*
+ * Remove memory block devices for the given memory area. Start and size
+ * have to be aligned to memory block granularity. Memory block devices
+ * have to be offline.
+ */
+void hotplug_memory_unregister(unsigned long start, unsigned long size)
 {
+	unsigned long block_nr_pages = memory_block_size_bytes() >> PAGE_SHIFT;
+	unsigned long start_pfn = PFN_DOWN(start);
+	unsigned long end_pfn = start_pfn + (size >> PAGE_SHIFT);
 	struct memory_block *mem;
+	unsigned long pfn;
 
-	if (WARN_ON_ONCE(!present_section(section)))
-		return;
+	BUG_ON(!IS_ALIGNED(start, memory_block_size_bytes()));
+	BUG_ON(!IS_ALIGNED(size, memory_block_size_bytes()));
 
 	mutex_lock(&mem_sysfs_mutex);
-
-	/*
-	 * Some users of the memory hotplug do not want/need memblock to
-	 * track all sections. Skip over those.
-	 */
-	mem = find_memory_block(section);
-	if (!mem)
-		goto out_unlock;
-
-	unregister_mem_sect_under_nodes(mem, __section_nr(section));
-
-	mem->section_count--;
-	if (mem->section_count == 0)
+	for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
+		mem = find_memory_block(__pfn_to_section(pfn));
+		if (!mem)
+			continue;
+		mem->section_count = 0;
+		unregister_memory_block_under_nodes(mem);
 		unregister_memory(mem);
-	else
-		put_device(&mem->dev);
-
-out_unlock:
+	}
 	mutex_unlock(&mem_sysfs_mutex);
 }
 
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 8598fcbd2a17..04fdfa99b8bc 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -801,9 +801,10 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
 	return 0;
 }
 
-/* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
-				    unsigned long phys_index)
+/*
+ * Unregister memory block device under all nodes that it spans.
+ */
+int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 {
 	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -816,8 +817,8 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 		return -ENOMEM;
 	nodes_clear(*unlinked_nodes);
 
-	sect_start_pfn = section_nr_to_pfn(phys_index);
-	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
+	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
 		int nid;
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 95505fbb5f85..aa236c2a0466 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -112,7 +112,7 @@ extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 int hotplug_memory_register(unsigned long start, unsigned long size);
-extern void unregister_memory_section(struct mem_section *);
+void hotplug_memory_unregister(unsigned long start, unsigned long size);
 extern int memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
 extern int memory_isolate_notify(unsigned long val, void *v);
diff --git a/include/linux/node.h b/include/linux/node.h
index 1a557c589ecb..02a29e71b175 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -139,8 +139,7 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						void *arg);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
-					   unsigned long phys_index);
+extern int unregister_memory_block_under_nodes(struct memory_block *mem_blk);
 
 extern int register_memory_node_under_compute_node(unsigned int mem_nid,
 						   unsigned int cpu_nid,
@@ -176,8 +175,7 @@ static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
 {
 	return 0;
 }
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
-						  unsigned long phys_index)
+static inline int unregister_memory_block_under_nodes(struct memory_block *mem_blk)
 {
 	return 0;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 107f72952347..527fe4f9c620 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -519,8 +519,6 @@ static void __remove_section(struct zone *zone, struct mem_section *ms,
 	if (WARN_ON_ONCE(!valid_section(ms)))
 		return;
 
-	unregister_memory_section(ms);
-
 	scn_nr = __section_nr(ms);
 	start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
 	__remove_zone(zone, start_pfn);
@@ -1844,6 +1842,9 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
+	/* remove memory block devices before removing memory */
+	hotplug_memory_unregister(start, size);
+
 	arch_remove_memory(nid, start, size, NULL);
 	__release_memory_resource(start, size);
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 5/8] mm/memory_hotplug: Drop MHP_MEMBLOCK_API
From: David Hildenbrand @ 2019-05-07 18:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Oscar Salvador, linux-s390, Michal Hocko, linux-ia64,
	Pavel Tatashin, linux-sh, Mathieu Malaterre, David Hildenbrand,
	linux-kernel, Wei Yang, Qian Cai, Joonsoo Kim, akpm, linuxppc-dev,
	Dan Williams, Arun KS
In-Reply-To: <20190507183804.5512-1-david@redhat.com>

No longer needed, the callers of arch_add_memory() can handle this
manually.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memory_hotplug.h | 8 --------
 mm/memory_hotplug.c            | 9 +++------
 2 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 2d4de313926d..2f1f87e13baa 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -128,14 +128,6 @@ extern void arch_remove_memory(int nid, u64 start, u64 size,
 extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
 			   unsigned long nr_pages, struct vmem_altmap *altmap);
 
-/*
- * Do we want sysfs memblock files created. This will allow userspace to online
- * and offline memory explicitly. Lack of this bit means that the caller has to
- * call move_pfn_range_to_zone to finish the initialization.
- */
-
-#define MHP_MEMBLOCK_API               (1<<0)
-
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 		       struct mhp_restrictions *restrictions);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e1637c8a0723..107f72952347 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,7 +250,7 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
 
 static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
-		struct vmem_altmap *altmap, bool want_memblock)
+				   struct vmem_altmap *altmap)
 {
 	int ret;
 
@@ -293,8 +293,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 	}
 
 	for (i = start_sec; i <= end_sec; i++) {
-		err = __add_section(nid, section_nr_to_pfn(i), altmap,
-				restrictions->flags & MHP_MEMBLOCK_API);
+		err = __add_section(nid, section_nr_to_pfn(i), altmap);
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -1066,9 +1065,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
  */
 int __ref add_memory_resource(int nid, struct resource *res)
 {
-	struct mhp_restrictions restrictions = {
-		.flags = MHP_MEMBLOCK_API,
-	};
+	struct mhp_restrictions restrictions = {};
 	u64 start, size;
 	bool new_node = false;
 	int ret;
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 4/8] mm/memory_hotplug: Create memory block devices after arch_add_memory()
From: David Hildenbrand @ 2019-05-07 18:38 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-s390, Michal Hocko, linux-ia64, Pavel Tatashin, linux-sh,
	mike.travis@hpe.com, Greg Kroah-Hartman, David Hildenbrand,
	linux-kernel, Ingo Molnar, Mathieu Malaterre, Andrew Banman,
	Qian Cai, Rafael J. Wysocki, Arun KS, akpm, Wei Yang,
	linuxppc-dev, Dan Williams, Oscar Salvador
In-Reply-To: <20190507183804.5512-1-david@redhat.com>

Only memory to be added to the buddy and to be onlined/offlined by
user space using memory block devices needs (and should have!) memory
block devices.

Factor out creation of memory block devices Create all devices after
arch_add_memory() succeeded. We can later drop the want_memblock parameter,
because it is now effectively stale.

Only after memory block devices have been added, memory can be onlined
by user space. This implies, that memory is not visible to user space at
all before arch_add_memory() succeeded.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/memory.c  | 70 ++++++++++++++++++++++++++----------------
 include/linux/memory.h |  2 +-
 mm/memory_hotplug.c    | 15 ++++-----
 3 files changed, 53 insertions(+), 34 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 6e0cb4fda179..862c202a18ca 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -701,44 +701,62 @@ static int add_memory_block(int base_section_nr)
 	return 0;
 }
 
+static void unregister_memory(struct memory_block *memory)
+{
+	BUG_ON(memory->dev.bus != &memory_subsys);
+
+	/* drop the ref. we got via find_memory_block() */
+	put_device(&memory->dev);
+	device_unregister(&memory->dev);
+}
+
 /*
- * need an interface for the VM to add new memory regions,
- * but without onlining it.
+ * Create memory block devices for the given memory area. Start and size
+ * have to be aligned to memory block granularity. Memory block devices
+ * will be initialized as offline.
  */
-int hotplug_memory_register(int nid, struct mem_section *section)
+int hotplug_memory_register(unsigned long start, unsigned long size)
 {
-	int ret = 0;
+	unsigned long block_nr_pages = memory_block_size_bytes() >> PAGE_SHIFT;
+	unsigned long start_pfn = PFN_DOWN(start);
+	unsigned long end_pfn = start_pfn + (size >> PAGE_SHIFT);
+	unsigned long pfn;
 	struct memory_block *mem;
+	int ret = 0;
 
-	mutex_lock(&mem_sysfs_mutex);
+	BUG_ON(!IS_ALIGNED(start, memory_block_size_bytes()));
+	BUG_ON(!IS_ALIGNED(size, memory_block_size_bytes()));
 
-	mem = find_memory_block(section);
-	if (mem) {
-		mem->section_count++;
-		put_device(&mem->dev);
-	} else {
-		ret = init_memory_block(&mem, section, MEM_OFFLINE);
+	mutex_lock(&mem_sysfs_mutex);
+	for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
+		mem = find_memory_block(__pfn_to_section(pfn));
+		if (mem) {
+			WARN_ON_ONCE(false);
+			put_device(&mem->dev);
+			continue;
+		}
+		ret = init_memory_block(&mem, __pfn_to_section(pfn),
+					MEM_OFFLINE);
 		if (ret)
-			goto out;
-		mem->section_count++;
+			break;
+		mem->section_count = memory_block_size_bytes() /
+				     MIN_MEMORY_BLOCK_SIZE;
+	}
+	if (ret) {
+		end_pfn = pfn;
+		for (pfn = start_pfn; pfn != end_pfn; pfn += block_nr_pages) {
+			mem = find_memory_block(__pfn_to_section(pfn));
+			if (!mem)
+				continue;
+			mem->section_count = 0;
+			unregister_memory(mem);
+		}
 	}
-
-out:
 	mutex_unlock(&mem_sysfs_mutex);
 	return ret;
 }
 
-static void
-unregister_memory(struct memory_block *memory)
-{
-	BUG_ON(memory->dev.bus != &memory_subsys);
-
-	/* drop the ref. we got via find_memory_block() */
-	put_device(&memory->dev);
-	device_unregister(&memory->dev);
-}
-
-void unregister_memory_section(struct mem_section *section)
+static int remove_memory_section(struct mem_section *section)
 {
 	struct memory_block *mem;
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 474c7c60c8f2..95505fbb5f85 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -111,7 +111,7 @@ extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
-int hotplug_memory_register(int nid, struct mem_section *section);
+int hotplug_memory_register(unsigned long start, unsigned long size);
 extern void unregister_memory_section(struct mem_section *);
 extern int memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7b5439839d67..e1637c8a0723 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -258,13 +258,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 		return -EEXIST;
 
 	ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
-	if (ret < 0)
-		return ret;
-
-	if (!want_memblock)
-		return 0;
-
-	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
+	return ret < 0 ? ret : 0;
 }
 
 /*
@@ -1106,6 +1100,13 @@ int __ref add_memory_resource(int nid, struct resource *res)
 	if (ret < 0)
 		goto error;
 
+	/* create memory block devices after memory was added */
+	ret = hotplug_memory_register(start, size);
+	if (ret) {
+		arch_remove_memory(nid, start, size, NULL);
+		goto error;
+	}
+
 	if (new_node) {
 		/* If sysfs file of new node can't be created, cpu on the node
 		 * can't be hot-added. There is no rollback way now.
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 3/8] mm/memory_hotplug: arch_remove_memory() and __remove_pages() with CONFIG_MEMORY_HOTPLUG
From: David Hildenbrand @ 2019-05-07 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Oscar Salvador, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Chris Wilson, Masahiro Yamada,
	Michal Hocko, Paul Mackerras, H. Peter Anvin, Dan Williams,
	Rafael J. Wysocki, Qian Cai, linux-s390, Yoshinori Sato,
	David Hildenbrand, Mike Rapoport, Ingo Molnar, Fenghua Yu,
	Pavel Tatashin, Vasily Gorbik, Rob Herring, mike.travis@hpe.com,
	Nicholas Piggin, Alex Deucher, Mark Brown, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Tony Luck, Baoquan He,
	Andrew Banman, Mathieu Malaterre, Greg Kroah-Hartman,
	linux-kernel, Logan Gunthorpe, Wei Yang, Martin Schwidefsky,
	Arun KS, akpm, linuxppc-dev, David S. Miller, Kirill A. Shutemov
In-Reply-To: <20190507183804.5512-1-david@redhat.com>

Let's prepare for better error handling while adding memory by allowing
to use arch_remove_memory() and __remove_pages() even if
CONFIG_MEMORY_HOTREMOVE is not set. CONFIG_MEMORY_HOTREMOVE effectively
covers
- Offlining of system ram (memory block devices) - offline_pages()
- Unplug of system ram - remove_memory()
- Unplug/remap of device memory - devm_memremap()

This allows e.g. for handling like

arch_add_memory()
rc = do_something();
if (rc) {
	arch_remove_memory();
}

Whereby do_something() will for example be memory block device creation
after it has been factored out.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/ia64/mm/init.c            | 2 --
 arch/powerpc/mm/mem.c          | 2 --
 arch/s390/mm/init.c            | 2 --
 arch/sh/mm/init.c              | 2 --
 arch/x86/mm/init_32.c          | 2 --
 arch/x86/mm/init_64.c          | 2 --
 drivers/base/memory.c          | 2 --
 include/linux/memory.h         | 2 --
 include/linux/memory_hotplug.h | 2 --
 mm/memory_hotplug.c            | 2 --
 mm/sparse.c                    | 6 ------
 11 files changed, 26 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d28e29103bdb..aae75fd7b810 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -681,7 +681,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return ret;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -693,4 +692,3 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 }
 #endif
-#endif
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index a2b78e72452f..ddc69b59575c 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -131,7 +131,6 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
 	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void __ref arch_remove_memory(int nid, u64 start, u64 size,
 			     struct vmem_altmap *altmap)
 {
@@ -164,7 +163,6 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 	resize_hpt_for_hotplug(memblock_phys_mem_size());
 }
 #endif
-#endif /* CONFIG_MEMORY_HOTPLUG */
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 void __init mem_topology_setup(void)
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 1e0cbae69f12..eafa3c750efc 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -233,7 +233,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return rc;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -245,5 +244,4 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 	vmem_remove_mapping(start, size);
 }
-#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 5aeb4d7099a1..59c5fe511f25 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -428,7 +428,6 @@ int memory_add_physaddr_to_nid(u64 addr)
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -439,5 +438,4 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	zone = page_zone(pfn_to_page(start_pfn));
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 }
-#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 075e568098f2..8d4bf2d97d50 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -859,7 +859,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -871,7 +870,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 }
 #endif
-#endif
 
 int kernel_set_to_readonly __read_mostly;
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 20d14254b686..f1b55ddea23f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1131,7 +1131,6 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
 	remove_pagetable(start, end, false, altmap);
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 static void __meminit
 kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 {
@@ -1156,7 +1155,6 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
 	kernel_physical_mapping_remove(start, start + size);
 }
-#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static struct kcore_list kcore_vsyscall;
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f180427e48f4..6e0cb4fda179 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -728,7 +728,6 @@ int hotplug_memory_register(int nid, struct mem_section *section)
 	return ret;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 static void
 unregister_memory(struct memory_block *memory)
 {
@@ -767,7 +766,6 @@ void unregister_memory_section(struct mem_section *section)
 out_unlock:
 	mutex_unlock(&mem_sysfs_mutex);
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /* return true if the memory block is offlined, otherwise, return false */
 bool is_memblock_offlined(struct memory_block *mem)
diff --git a/include/linux/memory.h b/include/linux/memory.h
index e1dc1bb2b787..474c7c60c8f2 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -112,9 +112,7 @@ extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 int hotplug_memory_register(int nid, struct mem_section *section);
-#ifdef CONFIG_MEMORY_HOTREMOVE
 extern void unregister_memory_section(struct mem_section *);
-#endif
 extern int memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
 extern int memory_isolate_notify(unsigned long val, void *v);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ae892eef8b82..2d4de313926d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -123,12 +123,10 @@ static inline bool movable_node_is_enabled(void)
 	return movable_node_enabled;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 extern void arch_remove_memory(int nid, u64 start, u64 size,
 			       struct vmem_altmap *altmap);
 extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
 			   unsigned long nr_pages, struct vmem_altmap *altmap);
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /*
  * Do we want sysfs memblock files created. This will allow userspace to online
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 202febe88b58..7b5439839d67 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -317,7 +317,6 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 	return err;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 /* find the smallest valid pfn in the range [start_pfn, end_pfn) */
 static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 				     unsigned long start_pfn,
@@ -581,7 +580,6 @@ void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 
 	set_zone_contiguous(zone);
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 int set_online_page_callback(online_page_callback_t callback)
 {
diff --git a/mm/sparse.c b/mm/sparse.c
index fd13166949b5..d1d5e05f5b8d 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -604,7 +604,6 @@ static void __kfree_section_memmap(struct page *memmap,
 
 	vmemmap_free(start, end, altmap);
 }
-#ifdef CONFIG_MEMORY_HOTREMOVE
 static void free_map_bootmem(struct page *memmap)
 {
 	unsigned long start = (unsigned long)memmap;
@@ -612,7 +611,6 @@ static void free_map_bootmem(struct page *memmap)
 
 	vmemmap_free(start, end, NULL);
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 #else
 static struct page *__kmalloc_section_memmap(void)
 {
@@ -651,7 +649,6 @@ static void __kfree_section_memmap(struct page *memmap,
 			   get_order(sizeof(struct page) * PAGES_PER_SECTION));
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 static void free_map_bootmem(struct page *memmap)
 {
 	unsigned long maps_section_nr, removing_section_nr, i;
@@ -681,7 +678,6 @@ static void free_map_bootmem(struct page *memmap)
 			put_page_bootmem(page);
 	}
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
 /**
@@ -746,7 +742,6 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 	return ret;
 }
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
 #ifdef CONFIG_MEMORY_FAILURE
 static void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 {
@@ -823,5 +818,4 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
 			PAGES_PER_SECTION - map_offset);
 	free_section_usemap(memmap, usemap, altmap);
 }
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 2/8] s390x/mm: Implement arch_remove_memory()
From: David Hildenbrand @ 2019-05-07 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Oscar Salvador, linux-s390, Michal Hocko, linux-ia64,
	Vasily Gorbik, linux-sh, David Hildenbrand, Heiko Carstens,
	linux-kernel, Mike Rapoport, Martin Schwidefsky, akpm,
	linuxppc-dev, Dan Williams
In-Reply-To: <20190507183804.5512-1-david@redhat.com>

Will come in handy when wanting to handle errors after
arch_add_memory().

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/init.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 31b1071315d7..1e0cbae69f12 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -237,12 +237,13 @@ int arch_add_memory(int nid, u64 start, u64 size,
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
-	/*
-	 * There is no hardware or firmware interface which could trigger a
-	 * hot memory remove on s390. So there is nothing that needs to be
-	 * implemented.
-	 */
-	BUG();
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	vmem_remove_mapping(start, size);
 }
 #endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling
From: David Hildenbrand @ 2019-05-07 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Oscar Salvador, Michal Hocko, linux-ia64, linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, Chris Wilson,
	Masahiro Yamada, Pavel Tatashin, Rich Felker, Arun KS,
	H. Peter Anvin, Ingo Molnar, Rafael J. Wysocki, Qian Cai,
	linux-s390, Baoquan He, Logan Gunthorpe, David Hildenbrand,
	Mike Rapoport, Ingo Molnar, Fenghua Yu, Pavel Tatashin,
	Vasily Gorbik, Rob Herring, mike.travis@hpe.com, Nicholas Piggin,
	Martin Schwidefsky, Mark Brown, Borislav Petkov, Andy Lutomirski,
	Jonathan Cameron, Dan Williams, Wei Yang, Joonsoo Kim,
	Oscar Salvador, Tony Luck, Yoshinori Sato, Andrew Banman,
	Mathieu Malaterre, Greg Kroah-Hartman, linux-kernel,
	Mike Rapoport, Thomas Gleixner, Wei Yang, Alex Deucher,
	Paul Mackerras, akpm, linuxppc-dev, David S. Miller,
	Kirill A. Shutemov

We only want memory block devices for memory to be onlined/offlined
(add/remove from the buddy). This is required so user space can
online/offline memory and kdump gets notified about newly onlined memory.

Only such memory has the requirement of having to span whole memory blocks.
Let's factor out creation/removal of memory block devices. This helps
to further cleanup arch_add_memory/arch_remove_memory() and to make
implementation of new features easier. E.g. supplying a driver for
memory block devices becomes way easier (so user space is able to
distinguish different types of added memory to properly online it).

Patch 1 makes sure the memory block size granularity is always respected.
Patch 2 implements arch_remove_memory() on s390x. Patch 3 prepares
arch_remove_memory() to be also called without CONFIG_MEMORY_HOTREMOVE.
Patch 4,5 and 6 factor out creation/removal of memory block devices.
Patch 7 gets rid of some unlikely errors that could have happened, not
removing links between memory block devices and nodes, previously brought
up by Oscar.

Did a quick sanity test with DIMM plug/unplug, making sure all devices
and sysfs links properly get added/removed. Compile tested on s390x and
x86-64.

Based on git://git.cmpxchg.org/linux-mmots.git

Next refactoring on my list will be making sure that remove_memory()
will never deal with zones / access "struct pages". Any kind of zone
handling will have to be done when offlining system memory / before
removing device memory. I am thinking about remove_pfn_range_from_zone()",
du undo everything "move_pfn_range_to_zone()" did.

v1 -> v2:
- s390x/mm: Implement arch_remove_memory()
-- remove mapping after "__remove_pages"


David Hildenbrand (8):
  mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
  s390x/mm: Implement arch_remove_memory()
  mm/memory_hotplug: arch_remove_memory() and __remove_pages() with
    CONFIG_MEMORY_HOTPLUG
  mm/memory_hotplug: Create memory block devices after arch_add_memory()
  mm/memory_hotplug: Drop MHP_MEMBLOCK_API
  mm/memory_hotplug: Remove memory block devices before
    arch_remove_memory()
  mm/memory_hotplug: Make unregister_memory_block_under_nodes() never
    fail
  mm/memory_hotplug: Remove "zone" parameter from
    sparse_remove_one_section

 arch/ia64/mm/init.c            |   2 -
 arch/powerpc/mm/mem.c          |   2 -
 arch/s390/mm/init.c            |  15 +++--
 arch/sh/mm/init.c              |   2 -
 arch/x86/mm/init_32.c          |   2 -
 arch/x86/mm/init_64.c          |   2 -
 drivers/base/memory.c          | 109 +++++++++++++++++++--------------
 drivers/base/node.c            |  27 +++-----
 include/linux/memory.h         |   6 +-
 include/linux/memory_hotplug.h |  12 +---
 include/linux/node.h           |   7 +--
 mm/memory_hotplug.c            |  44 ++++++-------
 mm/sparse.c                    |  10 +--
 13 files changed, 104 insertions(+), 136 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH v2 1/8] mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
From: David Hildenbrand @ 2019-05-07 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-s390, Michal Hocko, linux-ia64, Pavel Tatashin, linux-sh,
	Mathieu Malaterre, David Hildenbrand, linux-kernel, Wei Yang,
	Qian Cai, Arun KS, akpm, linuxppc-dev, Dan Williams,
	Oscar Salvador
In-Reply-To: <20190507183804.5512-1-david@redhat.com>

By converting start and size to page granularity, we actually ignore
unaligned parts within a page instead of properly bailing out with an
error.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 328878b6799d..202febe88b58 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1050,16 +1050,11 @@ int try_online_node(int nid)
 
 static int check_hotplug_memory_range(u64 start, u64 size)
 {
-	unsigned long block_sz = memory_block_size_bytes();
-	u64 block_nr_pages = block_sz >> PAGE_SHIFT;
-	u64 nr_pages = size >> PAGE_SHIFT;
-	u64 start_pfn = PFN_DOWN(start);
-
 	/* memory range must be block size aligned */
-	if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
-	    !IS_ALIGNED(nr_pages, block_nr_pages)) {
+	if (!size || !IS_ALIGNED(start, memory_block_size_bytes()) ||
+	    !IS_ALIGNED(size, memory_block_size_bytes())) {
 		pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
-		       block_sz, start, size);
+		       memory_block_size_bytes(), start, size);
 		return -EINVAL;
 	}
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] vfio-pci/nvlink2: Fix potential VMA leak
From: Alex Williamson @ 2019-05-07 17:25 UTC (permalink / raw)
  To: Greg Kurz; +Cc: Sam Bobroff, Alexey Kardashevskiy, linuxppc-dev, linux-kernel
In-Reply-To: <20190507090145.4be12c82@bahia.lan>

On Tue, 7 May 2019 09:01:45 +0200
Greg Kurz <groug@kaod.org> wrote:

> On Tue, 7 May 2019 11:52:44 +1000
> Sam Bobroff <sbobroff@linux.ibm.com> wrote:
> 
> > On Mon, May 06, 2019 at 03:58:45PM -0600, Alex Williamson wrote:  
> > > On Fri, 19 Apr 2019 17:37:17 +0200
> > > Greg Kurz <groug@kaod.org> wrote:
> > >     
> > > > If vfio_pci_register_dev_region() fails then we should rollback
> > > > previous changes, ie. unmap the ATSD registers.
> > > > 
> > > > Signed-off-by: Greg Kurz <groug@kaod.org>
> > > > ---    
> > > 
> > > Applied to vfio next branch for v5.2 with Alexey's R-b.  Thanks!
> > > 
> > > Alex    
> > 
> > Should this have a fixes tag? e.g.:
> > Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver")
> >   
> 
> Oops... you're right.
> 
> Alex, can you add the above tag ?

Added.  Thanks,

Alex

^ permalink raw reply

* Re: [PATCH v2 15/16] powernv/fadump: consider f/w load area
From: Mahesh J Salgaonkar @ 2019-05-07 17:13 UTC (permalink / raw)
  To: Hari Bathini
  Cc: Ananth N Mavinakayanahalli, Mahesh J Salgaonkar, Vasant Hegde,
	linuxppc-dev, Nicholas Piggin, Stewart Smith, Daniel Axtens
In-Reply-To: <155541097094.812.18328895014763068053.stgit@hbathini.in.ibm.com>

On 2019-04-16 16:06:13 Tue, Hari Bathini wrote:
> OPAL loads kernel & initrd at 512MB offset (256MB size), also exported
> as ibm,opal/dump/fw-load-area. So, if boot memory size of FADump is
> less than 768MB, kernel memory to be exported as '/proc/vmcore' would
> be overwritten by f/w while loading kernel & initrd. To avoid such a
> scenario, enforce a minimum boot memory size of 768MB on OPAL platform.
> 
> Also, skip using FADump if a newer F/W version loads kernel & initrd
> above 768MB.
> 
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
> ---
>  arch/powerpc/kernel/fadump-common.h          |   15 +++++++++++++--
>  arch/powerpc/kernel/fadump.c                 |    8 ++++++++
>  arch/powerpc/platforms/powernv/opal-fadump.c |   23 +++++++++++++++++++++++
>  3 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump-common.h b/arch/powerpc/kernel/fadump-common.h
> index 1bd3aeb..f59fdc7 100644
> --- a/arch/powerpc/kernel/fadump-common.h
> +++ b/arch/powerpc/kernel/fadump-common.h
> @@ -24,14 +24,25 @@
>  #define RMA_END		(ppc64_rma_size)
>  
>  /*
> + * With kernel & initrd loaded at 512MB (with 256MB size), enforce a minimum
> + * boot memory size of 768MB to ensure f/w loading kernel and initrd doesn't
> + * mess with crash'ed kernel's memory during MPIPL.
> + */
> +#define OPAL_MIN_BOOT_MEM	(0x30000000UL)
> +
> +/*
>   * On some Power systems where RMO is 128MB, it still requires minimum of
>   * 256MB for kernel to boot successfully. When kdump infrastructure is
>   * configured to save vmcore over network, we run into OOM issue while
>   * loading modules related to network setup. Hence we need additional 64M
>   * of memory to avoid OOM issue.
>   */
> -#define MIN_BOOT_MEM	(((RMA_END < (0x1UL << 28)) ? (0x1UL << 28) : RMA_END) \
> -			+ (0x1UL << 26))
> +#define PSERIES_MIN_BOOT_MEM	(((RMA_END < (0x1UL << 28)) ? (0x1UL << 28) : \
> +				 RMA_END) + (0x1UL << 26))
> +
> +#define MIN_BOOT_MEM	((fw_dump.fadump_platform ==			\
> +			 FADUMP_PLATFORM_POWERNV) ? OPAL_MIN_BOOT_MEM :	\
> +			 PSERIES_MIN_BOOT_MEM)

Can we hide this behind fadump_ops.get_bootmem_min() instead of common code
doing platform check ?

>  
>  /* The upper limit percentage for user specified boot memory size (25%) */
>  #define MAX_BOOT_MEM_RATIO			4
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index ba26169..3c3adc2 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -582,6 +582,14 @@ int __init fadump_reserve_mem(void)
>  				ALIGN(fw_dump.boot_memory_size,
>  							FADUMP_CMA_ALIGNMENT);
>  #endif
> +
> +		if ((fw_dump.fadump_platform == FADUMP_PLATFORM_POWERNV) &&
> +		    (fw_dump.boot_memory_size < OPAL_MIN_BOOT_MEM)) {

and here too.. fadump_ops.validate_bootmem_size() ? push platform specific
stuff behind fadump_ops.

Rest looks good.

Thanks,
-Mahesh.


^ permalink raw reply

* Re: [PATCH] powerpc: slightly improve cache helpers
From: Christophe Leroy @ 2019-05-07 16:53 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: linuxppc-dev, Paul Mackerras, linux-kernel
In-Reply-To: <20190507151030.GF8599@gate.crashing.org>



Le 07/05/2019 à 17:10, Segher Boessenkool a écrit :
> Hi Christophe,
> 
> On Tue, May 07, 2019 at 01:31:39PM +0000, Christophe Leroy wrote:
>> Cache instructions (dcbz, dcbi, dcbf and dcbst) take two registers
>> that are summed to obtain the target address. Using '%y0' argument
>> gives GCC the opportunity to use both registers instead of only one
>> with the second being forced to 0.
> 
> That's not quite right.  Sorry if I didn't explain it properly.
> 
> "m" allows all memory.  But this instruction only allows reg,reg and
> 0,reg addressing.  For that you need to use constraint "Z".

But gcc help 
(https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints) 
says it is better to use 'm':

Z

     Memory operand that is an indexed or indirect from a register (it 
is usually better to use ‘m’ or ‘es’ in asm statements)

That's the reason why I used 'm', I thought it was equivalent.

Christophe

> 
> The output modifier "%y0" just makes [reg] (i.e. simple indirect addressing)
> print as "0,reg" instead of "0(reg)" as it would by default (for just "%0").
> 
> 
> Segher
> 

^ permalink raw reply

* Re: [PATCH] x86/mpx: fix recursive munmap() corruption
From: Laurent Dufour @ 2019-05-07 16:35 UTC (permalink / raw)
  To: Michael Ellerman, Dave Hansen, Thomas Gleixner, Dave Hansen
  Cc: mhocko, rguenther, x86, LKML, stable, luto, linux-mm,
	Andrew Morton, linuxppc-dev, vbabka
In-Reply-To: <87k1faa2i0.fsf@concordia.ellerman.id.au>

[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]

Le 01/05/2019 à 12:32, Michael Ellerman a écrit :
> Laurent Dufour <ldufour@linux.vnet.ibm.com> writes:
>> Le 23/04/2019 à 18:04, Dave Hansen a écrit :
>>> On 4/23/19 4:16 AM, Laurent Dufour wrote:
> ...
>>>> There are 2 assumptions here:
>>>>    1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap().
>>>>    2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc)
>>>
>>> Are you sure about #2?  The 'vdso64_pages' variable seems rather
>>> unnecessary if the VDSO is only 1 page. ;)
>>
>> Hum, not so sure now ;)
>> I got confused, only the header is one page.
>> The test is working as a best effort, and don't cover the case where
>> only few pages inside the VDSO are unmmapped (start >
>> mm->context.vdso_base). This is not what CRIU is doing and so this was
>> enough for CRIU support.
>>
>> Michael, do you think there is a need to manage all the possibility
>> here, since the only user is CRIU and unmapping the VDSO is not a so
>> good idea for other processes ?
> 
> Couldn't we implement the semantic that if any part of the VDSO is
> unmapped then vdso_base is set to zero? That should be fairly easy, eg:
> 
> 	if (start < vdso_end && end >= mm->context.vdso_base)
> 		mm->context.vdso_base = 0;
> 
> 
> We might need to add vdso_end to the mm->context, but that should be OK.
> 
> That seems like it would work for CRIU and make sense in general?

Sorry for the late answer, yes this would make more sense.

Here is a patch doing that.

Cheers,
Laurent



[-- Attachment #2: 0001-powerpc-vdso-handle-generic-unmap-of-the-VDSO.patch --]
[-- Type: text/plain, Size: 6973 bytes --]

From 5b64a86c2a8042c7785c3d3f5e58e954a2c8c843 Mon Sep 17 00:00:00 2001
From: Laurent Dufour <ldufour@linux.ibm.com>
Date: Tue, 7 May 2019 16:29:46 +0200
Subject: [PATCH] powerpc/vdso: handle generic unmap of the VDSO

Make the unmap of the VDSO more generic by checking for the start and end
of the VDSO.

This implies to add the vdso_end address in the mm_context_t structure.

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/book3s/32/mmu-hash.h |  3 ++-
 arch/powerpc/include/asm/book3s/64/mmu.h      |  2 +-
 arch/powerpc/include/asm/mm-arch-hooks.h      |  5 ++++-
 arch/powerpc/include/asm/mmu_context.h        | 21 +++++++++++++++++--
 arch/powerpc/include/asm/nohash/32/mmu-40x.h  |  2 +-
 arch/powerpc/include/asm/nohash/32/mmu-44x.h  |  2 +-
 arch/powerpc/include/asm/nohash/32/mmu-8xx.h  |  2 +-
 arch/powerpc/include/asm/nohash/mmu-book3e.h  |  2 +-
 arch/powerpc/kernel/vdso.c                    |  2 ++
 9 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/mmu-hash.h b/arch/powerpc/include/asm/book3s/32/mmu-hash.h
index 2e277ca0170f..452152b809fc 100644
--- a/arch/powerpc/include/asm/book3s/32/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/32/mmu-hash.h
@@ -29,6 +29,7 @@
 #define BPP_RX	0x01		/* Read only */
 #define BPP_RW	0x02		/* Read/write */
 
+
 #ifndef __ASSEMBLY__
 /* Contort a phys_addr_t into the right format/bits for a BAT */
 #ifdef CONFIG_PHYS_64BIT
@@ -90,7 +91,7 @@ struct hash_pte {
 
 typedef struct {
 	unsigned long id;
-	unsigned long vdso_base;
+	unsigned long vdso_base, vdso_end;
 } mm_context_t;
 
 void update_bats(void);
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 74d24201fc4f..7a5a91a0696f 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -120,7 +120,7 @@ typedef struct {
 	struct npu_context *npu_context;
 	struct hash_mm_context *hash_context;
 
-	unsigned long vdso_base;
+	unsigned long vdso_base, vdso_end;
 	/*
 	 * pagetable fragment support
 	 */
diff --git a/arch/powerpc/include/asm/mm-arch-hooks.h b/arch/powerpc/include/asm/mm-arch-hooks.h
index f2a2da895897..1e2d527d3d1f 100644
--- a/arch/powerpc/include/asm/mm-arch-hooks.h
+++ b/arch/powerpc/include/asm/mm-arch-hooks.h
@@ -16,12 +16,15 @@ static inline void arch_remap(struct mm_struct *mm,
 			      unsigned long old_start, unsigned long old_end,
 			      unsigned long new_start, unsigned long new_end)
 {
+	unsigned long length = mm->context.vdso_end - mm->context.vdso_base;
 	/*
 	 * mremap() doesn't allow moving multiple vmas so we can limit the
 	 * check to old_start == vdso_base.
 	 */
-	if (old_start == mm->context.vdso_base)
+	if (old_start == mm->context.vdso_base) {
+		mm->context.vdso_end = new_start + length;
 		mm->context.vdso_base = new_start;
+	}
 }
 #define arch_remap arch_remap
 
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 611204e588b9..c24f5ed0aeff 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -235,8 +235,25 @@ static inline void arch_unmap(struct mm_struct *mm,
 			      struct vm_area_struct *vma,
 			      unsigned long start, unsigned long end)
 {
-	if (start <= mm->context.vdso_base && mm->context.vdso_base < end)
-		mm->context.vdso_base = 0;
+	unsigned long vdso_base, vdso_end;
+
+	vdso_base = mm->context.vdso_base;
+	vdso_end = mm->context.vdso_end;
+
+	/*
+	 * Partial unmapping of pages inside the VDSO, is consider equivalent
+	 * to unmapping the VDSO.
+	 *
+	 * case 1   >  |     VDSO    |  <
+	 * case 2   >  |           < |
+	 * case 3      |  >        < |
+	 * case 4      |  >          |  <
+	 */
+
+	if ((start <= vdso_base && vdso_end <= end) ||  /* 1   */
+	    (vdso_base <= start && start < vdso_end) || /* 3,4 */
+	    (vdso_base < end && end <= vdso_end))       /* 2,3 */
+		mm->context.vdso_base = mm->context.vdso_end = 0;
 }
 
 static inline void arch_bprm_mm_init(struct mm_struct *mm,
diff --git a/arch/powerpc/include/asm/nohash/32/mmu-40x.h b/arch/powerpc/include/asm/nohash/32/mmu-40x.h
index 74f4edb5916e..98739ba9d36e 100644
--- a/arch/powerpc/include/asm/nohash/32/mmu-40x.h
+++ b/arch/powerpc/include/asm/nohash/32/mmu-40x.h
@@ -57,7 +57,7 @@
 typedef struct {
 	unsigned int	id;
 	unsigned int	active;
-	unsigned long	vdso_base;
+	unsigned long	vdso_base, vdso_end;
 } mm_context_t;
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/nohash/32/mmu-44x.h b/arch/powerpc/include/asm/nohash/32/mmu-44x.h
index 28aa3b339c5e..de1d5b1c8cec 100644
--- a/arch/powerpc/include/asm/nohash/32/mmu-44x.h
+++ b/arch/powerpc/include/asm/nohash/32/mmu-44x.h
@@ -108,7 +108,7 @@ extern unsigned int tlb_44x_index;
 typedef struct {
 	unsigned int	id;
 	unsigned int	active;
-	unsigned long	vdso_base;
+	unsigned long	vdso_base, vdso_end;
 } mm_context_t;
 
 /* patch sites */
diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
index 76af5b0cb16e..414ce6638b20 100644
--- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
@@ -209,7 +209,7 @@ struct slice_mask {
 typedef struct {
 	unsigned int id;
 	unsigned int active;
-	unsigned long vdso_base;
+	unsigned long vdso_base, vdso_end;
 #ifdef CONFIG_PPC_MM_SLICES
 	u16 user_psize;		/* page size index */
 	unsigned char low_slices_psize[SLICE_ARRAY_SIZE];
diff --git a/arch/powerpc/include/asm/nohash/mmu-book3e.h b/arch/powerpc/include/asm/nohash/mmu-book3e.h
index 4c9777d256fb..8f406ad9fe25 100644
--- a/arch/powerpc/include/asm/nohash/mmu-book3e.h
+++ b/arch/powerpc/include/asm/nohash/mmu-book3e.h
@@ -229,7 +229,7 @@ extern unsigned int tlbcam_index;
 typedef struct {
 	unsigned int	id;
 	unsigned int	active;
-	unsigned long	vdso_base;
+	unsigned long	vdso_base, vdso_end;
 } mm_context_t;
 
 /* Page size definitions, common between 32 and 64-bit
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index a31b6234fcd7..263f820cc666 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -182,6 +182,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 #endif
 
 	current->mm->context.vdso_base = 0;
+	current->mm->context.vdso_end = 0;
 
 	/* vDSO has a problem and was disabled, just don't "enable" it for the
 	 * process
@@ -217,6 +218,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	 * will fail to recognise it as a vDSO (since arch_vma_name fails).
 	 */
 	current->mm->context.vdso_base = vdso_base;
+	current->mm->context.vdso_end = vdso_base + (vdso_pages << PAGE_SHIFT);
 
 	/*
 	 * our vma flags don't have VM_WRITE so by default, the process isn't
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v2 09/16] powernv/fadump: process architected register state data provided by firmware
From: Segher Boessenkool @ 2019-05-07 16:00 UTC (permalink / raw)
  To: Mahesh J Salgaonkar
  Cc: Ananth N Mavinakayanahalli, Mahesh J Salgaonkar, Vasant Hegde,
	linuxppc-dev, Nicholas Piggin, Stewart Smith, Hari Bathini,
	Daniel Axtens
In-Reply-To: <20190507141356.saug5kjhntwozu76@in.ibm.com>

On Tue, May 07, 2019 at 07:43:56PM +0530, Mahesh J Salgaonkar wrote:
> Can we use SPRN_* #defines which are already present in asm/reg.h instead of
> hard coding numbers for switch cases ? You may want to add new #defines
> for NIP, MSR and CCR.

But none of those three are SPRs.  Please don't pollute that namespace.
Put such defines into some fadump header, instead?


Segher

^ permalink raw reply

* Re: [PATCH] powerpc: slightly improve cache helpers
From: Segher Boessenkool @ 2019-05-07 15:10 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linuxppc-dev, Paul Mackerras, linux-kernel
In-Reply-To: <0b460a85319fb89dab2c5d1200ac69a3e1b7c1ef.1557235807.git.christophe.leroy@c-s.fr>

Hi Christophe,

On Tue, May 07, 2019 at 01:31:39PM +0000, Christophe Leroy wrote:
> Cache instructions (dcbz, dcbi, dcbf and dcbst) take two registers
> that are summed to obtain the target address. Using '%y0' argument
> gives GCC the opportunity to use both registers instead of only one
> with the second being forced to 0.

That's not quite right.  Sorry if I didn't explain it properly.

"m" allows all memory.  But this instruction only allows reg,reg and
0,reg addressing.  For that you need to use constraint "Z".

The output modifier "%y0" just makes [reg] (i.e. simple indirect addressing)
print as "0,reg" instead of "0(reg)" as it would by default (for just "%0").


Segher

^ permalink raw reply

* Re: [PATCH v2] drivers/dax: Allow to include DEV_DAX_PMEM as builtin
From: Dan Williams @ 2019-05-07 15:09 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Linux MM, linuxppc-dev, linux-nvdimm
In-Reply-To: <87pnoumql8.fsf@linux.ibm.com>

On Tue, May 7, 2019 at 4:50 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
>
> Hi Dan,
>
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
> > This move the dependency to DEV_DAX_PMEM_COMPAT such that only
> > if DEV_DAX_PMEM is built as module we can allow the compat support.
> >
> > This allows to test the new code easily in a emulation setup where we
> > often build things without module support.
> >
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>
> Any update on this. Can we merge this?

Applied for the v5.2 pull request.

^ permalink raw reply

* Crashes in linux-next on powerpc with CONFIG_PPC_KUAP and CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG
From: Michael Ellerman @ 2019-05-07 14:54 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Stephen Rothwell, Petr Mladek

Hi folks,

Just an FYI in case anyone else is seeing crashes very early in boot in
linux-next with the above config options.

The problem is the combination of some new code called via printk(),
check_pointer() which calls probe_kernel_read(). That then calls 
allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
(before we've patched features). With the JUMP_LABEL debug enabled that
causes us to call printk() & dump_stack() and we end up recursing and
overflowing the stack.

Because it happens so early you don't get any output, just an apparently
dead system.

The stack trace (which you don't see) is something like:

  ...
  dump_stack+0xdc
  probe_kernel_read+0x1a4
  check_pointer+0x58
  string+0x3c
  vsnprintf+0x1bc
  vscnprintf+0x20
  printk_safe_log_store+0x7c
  printk+0x40
  dump_stack_print_info+0xbc
  dump_stack+0x8
  probe_kernel_read+0x1a4
  probe_kernel_read+0x19c
  check_pointer+0x58
  string+0x3c
  vsnprintf+0x1bc
  vscnprintf+0x20
  vprintk_store+0x6c
  vprintk_emit+0xec
  vprintk_func+0xd4
  printk+0x40
  cpufeatures_process_feature+0xc8
  scan_cpufeatures_subnodes+0x380
  of_scan_flat_dt_subnodes+0xb4
  dt_cpu_ftrs_scan_callback+0x158
  of_scan_flat_dt+0xf0
  dt_cpu_ftrs_scan+0x3c
  early_init_devtree+0x360
  early_setup+0x9c


The simple fix is to use early_mmu_has_feature() in allow_user_access(),
but we'd rather not do that because it penalises all
copy_to/from_users() for the life of the system with the cost of the
runtime check vs the jump label. The irony is probe_kernel_read()
shouldn't be allowing user access at all, because we're reading the
kernel not userspace.

For now if you're hitting it just turn off 
CONFIG_PPC_KUAP and/or CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG.

cheers

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox