LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 14/16] block: use memcpy_from_bvec in __blk_queue_bounce
From: Ira Weiny @ 2021-06-09  1:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Thomas Bogendoerfer, Mike Snitzer, Geoff Levand,
	linux-mips, Dongsheng Yang, linux-kernel, linux-block, dm-devel,
	Ilya Dryomov, linuxppc-dev, ceph-devel
In-Reply-To: <20210608160603.1535935-15-hch@lst.de>

On Tue, Jun 08, 2021 at 06:06:01PM +0200, Christoph Hellwig wrote:
> Rewrite the actual bounce buffering loop in __blk_queue_bounce to that
> the memcpy_to_bvec helper can be used to perform the data copies.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/bounce.c | 21 +++++++--------------
>  1 file changed, 7 insertions(+), 14 deletions(-)
> 
> diff --git a/block/bounce.c b/block/bounce.c
> index a2fc6326b6c9..b5ad09e07bcf 100644
> --- a/block/bounce.c
> +++ b/block/bounce.c
> @@ -243,24 +243,17 @@ void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig)
>  	 * because the 'bio' is single-page bvec.
>  	 */
>  	for (i = 0, to = bio->bi_io_vec; i < bio->bi_vcnt; to++, i++) {
> -		struct page *page = to->bv_page;
> +		struct page *bounce_page;
>  
> -		if (!PageHighMem(page))
> +		if (!PageHighMem(to->bv_page))
>  			continue;
>  
> -		to->bv_page = mempool_alloc(&page_pool, GFP_NOIO);
> -		inc_zone_page_state(to->bv_page, NR_BOUNCE);
> +		bounce_page = mempool_alloc(&page_pool, GFP_NOIO);
> +		inc_zone_page_state(bounce_page, NR_BOUNCE);
>  
> -		if (rw == WRITE) {
> -			char *vto, *vfrom;
> -
> -			flush_dcache_page(page);
> -
> -			vto = page_address(to->bv_page) + to->bv_offset;
> -			vfrom = kmap_atomic(page) + to->bv_offset;
> -			memcpy(vto, vfrom, to->bv_len);
> -			kunmap_atomic(vfrom);
> -		}
> +		if (rw == WRITE)
> +			memcpy_from_bvec(page_address(bounce_page), to);

NIT: the fact that the copy is from 'to' makes my head hurt...  But I don't
see a good way to change that without declaring unnecessary variables...  :-(

The logic seems right.

Ira

> +		to->bv_page = bounce_page;
>  	}
>  
>  	trace_block_bio_bounce(*bio_orig);
> -- 
> 2.30.2
> 

^ permalink raw reply

* Re: switch the block layer to use kmap_local_page
From: Ira Weiny @ 2021-06-09  1:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Thomas Bogendoerfer, Mike Snitzer, Geoff Levand,
	linux-mips, Dongsheng Yang, linux-kernel, linux-block, dm-devel,
	Ilya Dryomov, linuxppc-dev, ceph-devel
In-Reply-To: <20210608160603.1535935-1-hch@lst.de>

On Tue, Jun 08, 2021 at 06:05:47PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series switches the core block layer code and all users of the
> existing bvec kmap helpers to use kmap_local_page.  Drivers that
> currently use open coded kmap_atomic calls will converted in a follow
> on series.

Other than the missing flush_dcache's.

For the series.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> 
> Diffstat:
>  arch/mips/include/asm/mach-rc32434/rb.h |    2 -
>  block/bio-integrity.c                   |   14 ++++------
>  block/bio.c                             |   37 +++++++---------------------
>  block/blk-map.c                         |    2 -
>  block/bounce.c                          |   35 ++++++--------------------
>  block/t10-pi.c                          |   16 ++++--------
>  drivers/block/ps3disk.c                 |   19 ++------------
>  drivers/block/rbd.c                     |   15 +----------
>  drivers/md/dm-writecache.c              |    5 +--
>  include/linux/bio.h                     |   42 --------------------------------
>  include/linux/bvec.h                    |   27 ++++++++++++++++++--
>  include/linux/highmem.h                 |    4 +--
>  12 files changed, 64 insertions(+), 154 deletions(-)

^ permalink raw reply

* Re: [PATCH v2] lockdown,selinux: avoid bogus SELinux lockdown permission checks
From: Paul Moore @ 2021-06-09  2:40 UTC (permalink / raw)
  To: Ondrej Mosnacek
  Cc: SElinux list, network dev, Stephen Smalley, James Morris,
	Steven Rostedt, Linux kernel mailing list, Casey Schaufler,
	Linux Security Module list, Ingo Molnar, Linux FS Devel, bpf,
	linuxppc-dev
In-Reply-To: <CAFqZXNsVFv2yh5cXwWYXscYTAuoCKk4H-JbPAYzDbwKUzSDP3A@mail.gmail.com>

On Tue, Jun 8, 2021 at 7:02 AM Ondrej Mosnacek <omosnace@redhat.com> wrote:
> On Thu, Jun 3, 2021 at 7:46 PM Paul Moore <paul@paul-moore.com> wrote:

...

> > It sounds an awful lot like the lockdown hook is in the wrong spot.
> > It sounds like it would be a lot better to relocate the hook than
> > remove it.
>
> I don't see how you would solve this by moving the hook. Where do you
> want to relocate it?

Wherever it makes sense.  Based on your comments it really sounded
like the hook was in a bad spot and since your approach in a lot of
this had been to remove or disable hooks I wanted to make sure that
relocating the hook was something you had considered.  Thankfully it
sounds like you have considered moving the hook - that's good.

> The main obstacle is that the message containing
> the SA dump is sent to consumers via a simple netlink broadcast, which
> doesn't provide a facility to redact the SA secret on a per-consumer
> basis. I can't see any way to make the checks meaningful for SELinux
> without a major overhaul of the broadcast logic.

Fair enough.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo
From: Baoquan He @ 2021-06-09  2:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Kazuhito Hagio, Bhupesh Sharma, linux-arm-kernel,
	Will Deacon, x86, kexec, linuxppc-dev, Pingfan Liu, linux-kernel,
	Boris Petkov, Catalin Marinas, James Morse, Thomas Gleixner,
	Dave Young, Ingo Molnar, Paul Mackerras, Dave Anderson
In-Reply-To: <20210608141410.0026a925ba3a609b0dd4e560@linux-foundation.org>

On 06/08/21 at 02:14pm, Andrew Morton wrote:
> On Tue, 8 Jun 2021 22:24:32 +0800 Baoquan He <bhe@redhat.com> wrote:
> 
> > On 06/08/21 at 06:33am, Pingfan Liu wrote:
> > > As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
> > > Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
> > > formula:
> > >     #define SECTIONS_SHIFT    (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
> > > 
> > > Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
> > > PAGES_PER_SECTION in makedumpfile just like kernel.
> > > 
> > > Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
> > > recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
> > > SECTION_SIZE_BITS"). But user space wants a stable interface to get this
> > > info. Such info is impossible to be deduced from a crashdump vmcore.
> > > Hence append SECTION_SIZE_BITS to vmcoreinfo.
> > 
> > ...
> >
> > Add the discussion of the original thread in kexec ML for reference:
> > http://lists.infradead.org/pipermail/kexec/2021-June/022676.html
> 
> I added a Link: for this.

Thanks, Andrew.

>  
> > This looks good to me.
> > 
> > Acked-by: Baoquan He <bhe@redhat.com>
> 
> I'm thinking we should backport this at least to Fixes:f0b13ee23241. 
> But perhaps it's simpler to just backport it as far as possible, so I
> added a bare cc:stable with no Fixes:.  Thoughts?

Yeah, it should add cc:stable, thanks. Otherwise it will break
vmcore dumping on 5.12 stable kernel even though with the updated
makedumpfile utility. Fixes:f0b13ee23241 will help stable kernel
maintainer easier to identify which kernel this patch need be applied
on? If only having cc:stable with no Fixes is allowed, it's also OK. 


^ permalink raw reply

* Re: [PATCH v2 2/3] powerpc/mm/hash: Avoid multiple HPT resize-ups on memory hotplug
From: Leonardo Brás @ 2021-06-09  3:09 UTC (permalink / raw)
  To: David Gibson
  Cc: Nathan Lynch, Aneesh Kumar K.V, David Hildenbrand, Scott Cheloha,
	linux-kernel, Nicholas Piggin, Paul Mackerras, Sandipan Das,
	Andrew Morton, Laurent Dufour, linuxppc-dev, Mike Rapoport
In-Reply-To: <YL2qKPhC2TrsFn6e@yekko>

On Mon, 2021-06-07 at 15:10 +1000, David Gibson wrote:
> On Fri, Apr 30, 2021 at 11:36:08AM -0300, Leonardo Bras wrote:
> > Every time a memory hotplug happens, and the memory limit crosses a
> > 2^n
> > value, it may be necessary to perform HPT resizing-up, which can
> > take
> > some time (over 100ms in my tests).
> > 
> > It usually is not an issue, but it can take some time if a lot of
> > memory
> > is added to a guest with little starting memory:
> > Adding 256G to a 2GB guest, for example will require 8 HPT resizes.
> > 
> > Perform an HPT resize before memory hotplug, updating HPT to its
> > final size (considering a successful hotplug), taking the number of
> > HPT resizes to at most one per memory hotplug action.
> > 
> > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Thanks David!

> 
> > ---
> >  arch/powerpc/include/asm/book3s/64/hash.h     |  2 ++
> >  arch/powerpc/mm/book3s64/hash_utils.c         | 20
> > +++++++++++++++++++
> >  .../platforms/pseries/hotplug-memory.c        |  9 +++++++++
> >  3 files changed, 31 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h
> > b/arch/powerpc/include/asm/book3s/64/hash.h
> > index d959b0195ad9..fad4af8b8543 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> > @@ -255,6 +255,8 @@ int hash__create_section_mapping(unsigned long
> > start, unsigned long end,
> >                                  int nid, pgprot_t prot);
> >  int hash__remove_section_mapping(unsigned long start, unsigned
> > long end);
> >  
> > +void hash_batch_expand_prepare(unsigned long newsize);
> > +
> >  #endif /* !__ASSEMBLY__ */
> >  #endif /* __KERNEL__ */
> >  #endif /* _ASM_POWERPC_BOOK3S_64_HASH_H */
> > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c
> > b/arch/powerpc/mm/book3s64/hash_utils.c
> > index 608e4ed397a9..3fa395b3fe57 100644
> > --- a/arch/powerpc/mm/book3s64/hash_utils.c
> > +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> > @@ -859,6 +859,26 @@ int hash__remove_section_mapping(unsigned long
> > start, unsigned long end)
> >  
> >         return rc;
> >  }
> > +
> > +void hash_batch_expand_prepare(unsigned long newsize)
> > +{
> > +       const u64 starting_size = ppc64_pft_size;
> > +
> > +       /*
> > +        * Resizing-up HPT should never fail, but there are some
> > cases system starts with higher
> > +        * SHIFT than required, and we go through the funny case of
> > resizing HPT down while
> > +        * adding memory
> > +        */
> > +
> > +       while (resize_hpt_for_hotplug(newsize, false) == -ENOSPC) {
> > +               newsize *= 2;
> > +               pr_warn("Hash collision while resizing HPT\n");
> > +
> > +               /* Do not try to resize to the starting size, or
> > bigger value */
> > +               if (htab_shift_for_mem_size(newsize) >=
> > starting_size)
> > +                       break;
> > +       }
> > +}
> >  #endif /* CONFIG_MEMORY_HOTPLUG */
> >  
> >  static void __init hash_init_partition_table(phys_addr_t
> > hash_table,
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index 8377f1f7c78e..48b2cfe4ce69 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/memory.h>
> >  #include <linux/memory_hotplug.h>
> >  #include <linux/slab.h>
> > +#include <linux/pgtable.h>
> >  
> >  #include <asm/firmware.h>
> >  #include <asm/machdep.h>
> > @@ -671,6 +672,10 @@ static int dlpar_memory_add_by_count(u32
> > lmbs_to_add)
> >         if (lmbs_available < lmbs_to_add)
> >                 return -EINVAL;
> >  
> > +       if (!radix_enabled())
> > +               hash_batch_expand_prepare(memblock_phys_mem_size()
> > +
> > +                                                lmbs_to_add *
> > drmem_lmb_size());
> > +
> >         for_each_drmem_lmb(lmb) {
> >                 if (lmb->flags & DRCONF_MEM_ASSIGNED)
> >                         continue;
> > @@ -788,6 +793,10 @@ static int dlpar_memory_add_by_ic(u32
> > lmbs_to_add, u32 drc_index)
> >         if (lmbs_available < lmbs_to_add)
> >                 return -EINVAL;
> >  
> > +       if (!radix_enabled())
> > +               hash_batch_expand_prepare(memblock_phys_mem_size()
> > +
> > +                                         lmbs_to_add *
> > drmem_lmb_size());
> > +
> >         for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
> >                 if (lmb->flags & DRCONF_MEM_ASSIGNED)
> >                         continue;
> 
> -- 
> David Gibson                    | I'll have my music baroque, and my
> code
> david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_
> _other_
>                                 | _way_ _around_!
> http://www.ozlabs.org/~dgibson



^ permalink raw reply

* Re: [PATCH 03/11] Documentation: ocxl.rst: change FPGA indirect article to an
From: Andrew Donnellan @ 2021-06-09  3:10 UTC (permalink / raw)
  To: trix, mdf, robh+dt, hao.wu, corbet, fbarrat, bbrezillon, arno,
	schalla, herbert, davem, gregkh, Sven.Auhagen, grandmaster
  Cc: devicetree, linux-doc, linux-fpga, linux-staging, linux-kernel,
	linux-crypto, linuxppc-dev
In-Reply-To: <20210608212350.3029742-5-trix@redhat.com>

On 9/6/21 7:23 am, trix@redhat.com wrote:
> From: Tom Rix <trix@redhat.com>
> 
> Change use of 'a fpga' to 'an fpga'
> 
> Signed-off-by: Tom Rix <trix@redhat.com>

Acked-by: Andrew Donnellan <ajd@linux.ibm.com>


-- 
Andrew Donnellan              OzLabs, ADL Canberra
ajd@linux.ibm.com             IBM Australia Limited

^ permalink raw reply

* Re: [PATCH v2 1/3] powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug
From: David Gibson @ 2021-06-09  4:40 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Nathan Lynch, Aneesh Kumar K.V, David Hildenbrand, Scott Cheloha,
	linux-kernel, Nicholas Piggin, Paul Mackerras, Sandipan Das,
	Andrew Morton, Laurent Dufour, linuxppc-dev, Mike Rapoport
In-Reply-To: <648b382159009c5f4277d9b9c3f896142ea75d6c.camel@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3445 bytes --]

On Tue, Jun 08, 2021 at 09:52:10PM -0300, Leonardo Brás wrote:
> On Mon, 2021-06-07 at 15:02 +1000, David Gibson wrote:
> > On Fri, Apr 30, 2021 at 11:36:06AM -0300, Leonardo Bras wrote:
> > > Because hypervisors may need to create HPTs without knowing the
> > > guest
> > > page size, the smallest used page-size (4k) may be chosen,
> > > resulting in
> > > a HPT that is possibly bigger than needed.
> > > 
> > > On a guest with bigger page-sizes, the amount of entries for HTP
> > > may be
> > > too high, causing the guest to ask for a HPT resize-down on the
> > > first
> > > hotplug.
> > > 
> > > This becomes a problem when HPT resize-down fails, and causes the
> > > HPT resize to be performed on every LMB added, until HPT size is
> > > compatible to guest memory size, causing a major slowdown.
> > > 
> > > So, avoiding HPT resizing-down on hot-add significantly improves
> > > memory
> > > hotplug times.
> > > 
> > > As an example, hotplugging 256GB on a 129GB guest took 710s without
> > > this
> > > patch, and 21s after applied.
> > > 
> > > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > 
> > Sorry it's taken me so long to look at these
> > 
> > I don't love the extra statefulness that the 'shrinking' parameter
> > adds, but I can't see an elegant way to avoid it, so:
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> np, thanks for reviewing!

Actually... I take that back.  With the subsequent patches my
discomfort with the complexity of implementing the batching grew.

I think I can see a simpler way - although it wasn't as clear as I
thought it might be, without some deep history on this feature.

What's going on here is pretty hard to follow, because it starts in
arch-specific code (arch/powerpc/platforms/pseries/hotplug-memory.c)
where it processes the add/remove requests, then goes into generic
code __add_memory() which eventually emerges back in arch specific
code (hash__create_section_mapping()).

The HPT resizing calls are in the "inner" arch specific section,
whereas it's only the outer arch section that has the information to
batch properly.  The mutex and 'shrinking' parameter in Leonardo's
code are all about conveying information from the outer to inner
section.

Now, I think the reason I had the resize calls in the inner section
was to accomodate the notion that a) pHyp might support resizing in
future, and it could come in through a different path with its drmgr
thingy and/or b) bare metal hash architectures might want to implement
hash resizing, and this would make at least part of the path common.

Given the decreasing relevance of hash MMUs, I think we can now safely
say neither of these is ever going to happen.

Therefore, we can simplify things by moving the HPT resize calls into
the pseries LMB code, instead of create/remove_section_mapping.  Then
to do batching without extra complications we just need this logic for
all resizes (both add and remove):

	let new_hpt_order = expected HPT size for new mem size;

	if (new_hpt_order > current_hpt_order)
		resize to new_hpt_order

	add/remove memory

	if (new_hpt_order < current_hpt_order - 1)
		resize to new_hpt_order


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 3/3] powerpc/mm/hash: Avoid multiple HPT resize-downs on memory hotunplug
From: Leonardo Brás @ 2021-06-09  5:30 UTC (permalink / raw)
  To: David Gibson
  Cc: Nathan Lynch, Aneesh Kumar K.V, David Hildenbrand, Scott Cheloha,
	linux-kernel, Nicholas Piggin, Paul Mackerras, Sandipan Das,
	Andrew Morton, Laurent Dufour, linuxppc-dev, Mike Rapoport
In-Reply-To: <YL2sjKM7ByS0Xeko@yekko>

On Mon, 2021-06-07 at 15:20 +1000, David Gibson wrote:
> On Fri, Apr 30, 2021 at 11:36:10AM -0300, Leonardo Bras wrote:
> > During memory hotunplug, after each LMB is removed, the HPT may be
> > resized-down if it would map a max of 4 times the current amount of
> > memory.
> > (2 shifts, due to introduced histeresis)
> > 
> > It usually is not an issue, but it can take a lot of time if HPT
> > resizing-down fails. This happens  because resize-down failures
> > usually repeat at each LMB removal, until there are no more bolted
> > entries
> > conflict, which can take a while to happen.
> > 
> > This can be solved by doing a single HPT resize at the end of
> > memory
> > hotunplug, after all requested entries are removed.
> > 
> > To make this happen, it's necessary to temporarily disable all HPT
> > resize-downs before hotunplug, re-enable them after hotunplug ends,
> > and then resize-down HPT to the current memory size.
> > 
> > As an example, hotunplugging 256GB from a 385GB guest took 621s
> > without
> > this patch, and 100s after applied.
> > 
> > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> 
> Hrm.  This looks correct, but it seems overly complicated.
> 
> AFAICT, the resize calls that this adds should in practice be the
> *only* times we call resize, all the calls from the lower level code
> should be suppressed. 

That's correct.

>  In which case can't we just remove those calls
> entirely, and not deal with the clunky locking and exclusion here.
> That should also remove the need for the 'shrinking' parameter in
> 1/3.


If I get your suggestion correctly, you suggest something like:
1 - Never calling resize_hpt_for_hotplug() in
hash__remove_section_mapping(), thus not needing the srinking
parameter.
2 - Functions in hotplug-memory.c that call dlpar_remove_lmb() would in
fact call another function to do the batch resize_hpt_for_hotplug() for
them

If so, that assumes that no other function that currently calls
resize_hpt_for_hotplug() under another path, or if they do, it does not
need to actually resize the HPT.

Is the above correct?

There are some examples of functions that currently call
resize_hpt_for_hotplug() by another path:

add_memory_driver_managed
	virtio_mem_add_memory
	dev_dax_kmem_probe

reserve_additional_memory
	balloon_process
	add_ballooned_pages

__add_memory
	probe_store

__remove_memory
	pseries_remove_memblock

remove_memory
	dev_dax_kmem_remove
	virtio_mem_remove_memory

memunmap_pages
	pci_p2pdma_add_resource
	virtio_fs_setup_dax


Best regards,
Leonardo Bras


^ permalink raw reply

* Re: [PATCH v2 1/3] powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug
From: Leonardo Brás @ 2021-06-09  5:51 UTC (permalink / raw)
  To: David Gibson
  Cc: Nathan Lynch, Aneesh Kumar K.V, David Hildenbrand, Scott Cheloha,
	linux-kernel, Nicholas Piggin, Paul Mackerras, Sandipan Das,
	Andrew Morton, Laurent Dufour, linuxppc-dev, Mike Rapoport
In-Reply-To: <YMBGW3RQOzoQxBqy@yekko>

On Wed, 2021-06-09 at 14:40 +1000, David Gibson wrote:
> On Tue, Jun 08, 2021 at 09:52:10PM -0300, Leonardo Brás wrote:
> > On Mon, 2021-06-07 at 15:02 +1000, David Gibson wrote:
> > > On Fri, Apr 30, 2021 at 11:36:06AM -0300, Leonardo Bras wrote:
> > > > Because hypervisors may need to create HPTs without knowing the
> > > > guest
> > > > page size, the smallest used page-size (4k) may be chosen,
> > > > resulting in
> > > > a HPT that is possibly bigger than needed.
> > > > 
> > > > On a guest with bigger page-sizes, the amount of entries for
> > > > HTP
> > > > may be
> > > > too high, causing the guest to ask for a HPT resize-down on the
> > > > first
> > > > hotplug.
> > > > 
> > > > This becomes a problem when HPT resize-down fails, and causes
> > > > the
> > > > HPT resize to be performed on every LMB added, until HPT size
> > > > is
> > > > compatible to guest memory size, causing a major slowdown.
> > > > 
> > > > So, avoiding HPT resizing-down on hot-add significantly
> > > > improves
> > > > memory
> > > > hotplug times.
> > > > 
> > > > As an example, hotplugging 256GB on a 129GB guest took 710s
> > > > without
> > > > this
> > > > patch, and 21s after applied.
> > > > 
> > > > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > > 
> > > Sorry it's taken me so long to look at these
> > > 
> > > I don't love the extra statefulness that the 'shrinking'
> > > parameter
> > > adds, but I can't see an elegant way to avoid it, so:
> > > 
> > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > np, thanks for reviewing!
> 
> Actually... I take that back.  With the subsequent patches my
> discomfort with the complexity of implementing the batching grew.
> 
> I think I can see a simpler way - although it wasn't as clear as I
> thought it might be, without some deep history on this feature.
> 
> What's going on here is pretty hard to follow, because it starts in
> arch-specific code (arch/powerpc/platforms/pseries/hotplug-memory.c)
> where it processes the add/remove requests, then goes into generic
> code __add_memory() which eventually emerges back in arch specific
> code (hash__create_section_mapping()).
> 
> The HPT resizing calls are in the "inner" arch specific section,
> whereas it's only the outer arch section that has the information to
> batch properly.  The mutex and 'shrinking' parameter in Leonardo's
> code are all about conveying information from the outer to inner
> section.
> 
> Now, I think the reason I had the resize calls in the inner section
> was to accomodate the notion that a) pHyp might support resizing in
> future, and it could come in through a different path with its drmgr
> thingy and/or b) bare metal hash architectures might want to
> implement
> hash resizing, and this would make at least part of the path common.
> 
> Given the decreasing relevance of hash MMUs, I think we can now
> safely
> say neither of these is ever going to happen.
> 
> Therefore, we can simplify things by moving the HPT resize calls into
> the pseries LMB code, instead of create/remove_section_mapping.  Then
> to do batching without extra complications we just need this logic
> for
> all resizes (both add and remove):
> 
>         let new_hpt_order = expected HPT size for new mem size;
> 
>         if (new_hpt_order > current_hpt_order)
>                 resize to new_hpt_order
> 
>         add/remove memory
> 
>         if (new_hpt_order < current_hpt_order - 1)
>                 resize to new_hpt_order
> 
> 


Ok, that really does seem to simplify a lot the batching.

Question:
by LMB code, you mean dlpar_memory_{add,remove}_by_* ?
(dealing only with dlpar_{add,remove}_lmb() would not be enough to deal
with batching)

In my 3/3 repĺy I sent you some other examples of functions that
currently end up calling resize_hpt_for_hotplug() without comming from 
hotplug-memory.c. Is that ok that they do not call it anymore?


Best regards,
Leonardo Bras


^ permalink raw reply

* Re: [PATCH v1 04/12] mm/memory_hotplug: remove nid parameter from arch_remove_memory()
From: Heiko Carstens @ 2021-06-09  5:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michel Lespinasse, Kefeng Wang, Rich Felker, linux-ia64, Wei Yang,
	Michael S. Tsirkin, Peter Zijlstra, Catalin Marinas, Jason Wang,
	Dave Hansen, virtualization, linux-mm, Paul Mackerras,
	H. Peter Anvin, Will Deacon, Thomas Gleixner, linux-s390,
	Laurent Dufour, Jia He, Yoshinori Sato, Christian Borntraeger,
	linux-sh, x86, Ard Biesheuvel, linux-acpi, Ingo Molnar,
	linux-arm-kernel, Len Brown, Thiago Jung Bauermann,
	Pavel Tatashin, Vasily Gorbik, Anshuman Khandual, Nicholas Piggin,
	Borislav Petkov, Sergei Trofimovich, Andy Lutomirski,
	Dan Williams, Michal Hocko, Andrew Morton, Vlastimil Babka,
	Oscar Salvador, Christophe Leroy, Pankaj Gupta, Baoquan He,
	Pierre Morel, Rafael J. Wysocki, linux-kernel, Hui Zhu,
	Aneesh Kumar K.V, Joe Perches, Vitaly Kuznetsov, linuxppc-dev,
	Marek Kedzierski, Mike Rapoport
In-Reply-To: <20210607195430.48228-5-david@redhat.com>

On Mon, Jun 07, 2021 at 09:54:22PM +0200, David Hildenbrand wrote:
> The parameter is unused, let's remove it.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/arm64/mm/mmu.c            | 3 +--
>  arch/ia64/mm/init.c            | 3 +--
>  arch/powerpc/mm/mem.c          | 3 +--
>  arch/s390/mm/init.c            | 3 +--
>  arch/sh/mm/init.c              | 3 +--
>  arch/x86/mm/init_32.c          | 3 +--
>  arch/x86/mm/init_64.c          | 3 +--
>  include/linux/memory_hotplug.h | 3 +--
>  mm/memory_hotplug.c            | 4 ++--
>  mm/memremap.c                  | 5 +----
>  10 files changed, 11 insertions(+), 22 deletions(-)

For s390:
Acked-by: Heiko Carstens <hca@linux.ibm.com>

^ permalink raw reply

* [PATCH] powerpc: Remove proc_trap()
From: Christophe Leroy @ 2021-06-09  5:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

proc_trap() has never been used, remove it.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/reg.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7c81d3e563b2..3bb01a8779c9 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1435,8 +1435,6 @@ static inline void mtsr(u32 val, u32 idx)
 }
 #endif
 
-#define proc_trap()	asm volatile("trap")
-
 extern unsigned long current_stack_frame(void);
 
 register unsigned long current_stack_pointer asm("r1");
-- 
2.25.0


^ permalink raw reply related

* Re: [PATCH v2 3/3] powerpc/mm/hash: Avoid multiple HPT resize-downs on memory hotunplug
From: David Gibson @ 2021-06-09  6:08 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Nathan Lynch, Aneesh Kumar K.V, David Hildenbrand, Scott Cheloha,
	linux-kernel, Nicholas Piggin, Paul Mackerras, Sandipan Das,
	Andrew Morton, Laurent Dufour, linuxppc-dev, Mike Rapoport
In-Reply-To: <fa7a0e981a067445beb1ae01d53db932990717b7.camel@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4851 bytes --]

On Wed, Jun 09, 2021 at 02:30:36AM -0300, Leonardo Brás wrote:
> On Mon, 2021-06-07 at 15:20 +1000, David Gibson wrote:
> > On Fri, Apr 30, 2021 at 11:36:10AM -0300, Leonardo Bras wrote:
> > > During memory hotunplug, after each LMB is removed, the HPT may be
> > > resized-down if it would map a max of 4 times the current amount of
> > > memory.
> > > (2 shifts, due to introduced histeresis)
> > > 
> > > It usually is not an issue, but it can take a lot of time if HPT
> > > resizing-down fails. This happens  because resize-down failures
> > > usually repeat at each LMB removal, until there are no more bolted
> > > entries
> > > conflict, which can take a while to happen.
> > > 
> > > This can be solved by doing a single HPT resize at the end of
> > > memory
> > > hotunplug, after all requested entries are removed.
> > > 
> > > To make this happen, it's necessary to temporarily disable all HPT
> > > resize-downs before hotunplug, re-enable them after hotunplug ends,
> > > and then resize-down HPT to the current memory size.
> > > 
> > > As an example, hotunplugging 256GB from a 385GB guest took 621s
> > > without
> > > this patch, and 100s after applied.
> > > 
> > > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > 
> > Hrm.  This looks correct, but it seems overly complicated.
> > 
> > AFAICT, the resize calls that this adds should in practice be the
> > *only* times we call resize, all the calls from the lower level code
> > should be suppressed. 
> 
> That's correct.
> 
> >  In which case can't we just remove those calls
> > entirely, and not deal with the clunky locking and exclusion here.
> > That should also remove the need for the 'shrinking' parameter in
> > 1/3.
> 
> 
> If I get your suggestion correctly, you suggest something like:
> 1 - Never calling resize_hpt_for_hotplug() in
> hash__remove_section_mapping(), thus not needing the srinking
> parameter.
> 2 - Functions in hotplug-memory.c that call dlpar_remove_lmb() would in
> fact call another function to do the batch resize_hpt_for_hotplug() for
> them

Basically, yes.

> If so, that assumes that no other function that currently calls
> resize_hpt_for_hotplug() under another path, or if they do, it does not
> need to actually resize the HPT.
> 
> Is the above correct?
> 
> There are some examples of functions that currently call
> resize_hpt_for_hotplug() by another path:
> 
> add_memory_driver_managed
> 	virtio_mem_add_memory
> 	dev_dax_kmem_probe

Oh... virtio-mem.  I didn't think of that.


> reserve_additional_memory
> 	balloon_process
> 	add_ballooned_pages

AFAICT this comes from drivers/xen, and Xen has never been a thing on
POWER.

> __add_memory
> 	probe_store

So this is a sysfs triggered memory add.  If the user is doing this
manually, then I think it's reasonable for them to manually manage the
HPT size as well, which they can do through debugfs.  I think it might
also be used my drmgr under pHyp, but pHyp doesn't support HPT
resizing.

> __remove_memory
> 	pseries_remove_memblock

Huh, this one comes through OF_RECONFIG_DETACH_NODE.  I don't really
know when those happen, but I strongly suspect it's only under pHyp
again.

> remove_memory
> 	dev_dax_kmem_remove
> 	virtio_mem_remove_memory

virtio-mem again.

> memunmap_pages
> 	pci_p2pdma_add_resource
> 	virtio_fs_setup_dax

And virtio-fs in dax mode.  Didn't think of that either.


Ugh, yeah, I'm used to the world where the platform provides the only
way of hotplugging memory, but virtio-mem does indeed provide another
one, and we could indeed need to manage the HPT size based on that.
Drat, so moving all the HPT resizing handling up into
pseries/hotplug-memory.c won't work.

I still think we can simplify the communication between the stuff in
the pseries hotplug code and the actual hash resizing.  In your draft
there are kind of 3 ways the information is conveyed: the mutex
suppresses HPT shrinks, pre-growing past what we need prevents HPT
grows, and the 'shrinking' flag handles some edge cases.

I suggest instead a single flag that will suppress all the current
resizes.  Not sure it technically has to be an atomic mutex, but
that's probably the obvious safe choice.  Then have a "resize up to
target" and "resize down to target" that ignore that suppression and
are no-ops if the target is in the other direction.
Then you should be able to make the path for pseries hotplugs be:

	suppress other resizes

	resize up to target

	do the actual adds or removes

	resize down to target

	unsuppress other resizes


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH] powerpc: Move update_power8_hid0() into its only user
From: Christophe Leroy @ 2021-06-09  6:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

update_power8_hid0() is used only by powernv platform subcore.c

Move it there.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/reg.h           | 10 ----------
 arch/powerpc/platforms/powernv/subcore.c | 10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 3bb01a8779c9..c5a527489ba5 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1445,16 +1445,6 @@ extern void scom970_write(unsigned int address, unsigned long value);
 struct pt_regs;
 
 extern void ppc_save_regs(struct pt_regs *regs);
-
-static inline void update_power8_hid0(unsigned long hid0)
-{
-	/*
-	 *  The HID0 update on Power8 should at the very least be
-	 *  preceded by a SYNC instruction followed by an ISYNC
-	 *  instruction
-	 */
-	asm volatile("sync; mtspr %0,%1; isync":: "i"(SPRN_HID0), "r"(hid0));
-}
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_REG_H */
diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c
index 73207b53dc2b..4fe0594c3f4d 100644
--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -169,6 +169,16 @@ static void update_hid_in_slw(u64 hid0)
 	}
 }
 
+static void update_power8_hid0(unsigned long hid0)
+{
+	/*
+	 *  The HID0 update on Power8 should at the very least be
+	 *  preceded by a SYNC instruction followed by an ISYNC
+	 *  instruction
+	 */
+	asm volatile("sync; mtspr %0,%1; isync":: "i"(SPRN_HID0), "r"(hid0));
+}
+
 static void unsplit_core(void)
 {
 	u64 hid0, mask;
-- 
2.25.0


^ permalink raw reply related

* Re: [PATCH v3 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
From: Mike Rapoport @ 2021-06-09  6:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-m68k, linux-ia64, linux-sh, linux-mips, linux-mm,
	sparclinux, linux-riscv, linux-arch, linux-s390, Jonathan Corbet,
	linux-doc, Mike Rapoport, Geert Uytterhoeven, Matt Turner,
	linux-snps-arc, linux-xtensa, Arnd Bergmann, Ivan Kokshaysky,
	linux-arm-kernel, Richard Henderson, Vineet Gupta, kexec,
	linux-kernel, linux-alpha, linuxppc-dev
In-Reply-To: <20210608172544.d9bf17549565d866fbb18451@linux-foundation.org>

On Tue, Jun 08, 2021 at 05:25:44PM -0700, Andrew Morton wrote:
> On Tue,  8 Jun 2021 12:13:15 +0300 Mike Rapoport <rppt@kernel.org> wrote:
> 
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
> > configuration options are equivalent.
> > 
> > Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
> > 
> > Done with
> > 
> > 	$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
> > 		$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
> > 	$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
> > 		$(git grep -wl NEED_MULTIPLE_NODES)
> > 
> > with manual tweaks afterwards.
> > 
> > ...
> >
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -987,7 +987,7 @@ extern int movable_zone;
> >  #ifdef CONFIG_HIGHMEM
> >  static inline int zone_movable_is_highmem(void)
> >  {
> > -#ifdef CONFIG_NEED_MULTIPLE_NODES
> > +#ifdef CONFIG_NUMA
> >  	return movable_zone == ZONE_HIGHMEM;
> >  #else
> >  	return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM;
> 
> I dropped this hunk - your "mm/mmzone.h: simplify is_highmem_idx()"
> removed zone_movable_is_highmem().  

Ah, right.
Thanks!

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v1 1/1] powerpc/prom_init: Move custom isspace() to its own namespace
From: Michael Ellerman @ 2021-06-09  7:00 UTC (permalink / raw)
  To: Andy Shevchenko, linuxppc-dev, linux-kernel
  Cc: Paul Mackerras, kernel test robot
In-Reply-To: <YL5Tm7jZaaQ0POH5@smile.fi.intel.com>

Andy Shevchenko <andriy.shevchenko@linux.intel.com> writes:
> On Mon, May 10, 2021 at 05:49:25PM +0300, Andy Shevchenko wrote:
>> If by some reason any of the headers will include ctype.h
>> we will have a name collision. Avoid this by moving isspace()
>> to the dedicate namespace.
>> 
>> First appearance of the code is in the commit cf68787b68a2
>> ("powerpc/prom_init: Evaluate mem kernel parameter for early allocation").
>
> Any comments on this?

Looks fine. Thanks.

I just missed it because it came in a bit early, I tend not to pick
things up until rc2.

I tweaked the formatting of prom_isxdigit() slightly now that we allow
100 column lines.

Have put it in my next-test now.

cheers

>> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
>> index 41ed7e33d897..6845cbbc0cd4 100644
>> --- a/arch/powerpc/kernel/prom_init.c
>> +++ b/arch/powerpc/kernel/prom_init.c
>> @@ -701,13 +701,13 @@ static int __init prom_setprop(phandle node, const char *nodename,
>>  }
>>  
>>  /* We can't use the standard versions because of relocation headaches. */
>> -#define isxdigit(c)	(('0' <= (c) && (c) <= '9') \
>> -			 || ('a' <= (c) && (c) <= 'f') \
>> -			 || ('A' <= (c) && (c) <= 'F'))
>> +#define prom_isxdigit(c)	(('0' <= (c) && (c) <= '9') \
>> +				 || ('a' <= (c) && (c) <= 'f') \
>> +				 || ('A' <= (c) && (c) <= 'F'))
>>  
>> -#define isdigit(c)	('0' <= (c) && (c) <= '9')
>> -#define islower(c)	('a' <= (c) && (c) <= 'z')
>> -#define toupper(c)	(islower(c) ? ((c) - 'a' + 'A') : (c))
>> +#define prom_isdigit(c)		('0' <= (c) && (c) <= '9')
>> +#define prom_islower(c)		('a' <= (c) && (c) <= 'z')
>> +#define prom_toupper(c)		(prom_islower(c) ? ((c) - 'a' + 'A') : (c))
>>  
>>  static unsigned long prom_strtoul(const char *cp, const char **endp)
>>  {
>> @@ -716,14 +716,14 @@ static unsigned long prom_strtoul(const char *cp, const char **endp)
>>  	if (*cp == '0') {
>>  		base = 8;
>>  		cp++;
>> -		if (toupper(*cp) == 'X') {
>> +		if (prom_toupper(*cp) == 'X') {
>>  			cp++;
>>  			base = 16;
>>  		}
>>  	}
>>  
>> -	while (isxdigit(*cp) &&
>> -	       (value = isdigit(*cp) ? *cp - '0' : toupper(*cp) - 'A' + 10) < base) {
>> +	while (prom_isxdigit(*cp) &&
>> +	       (value = prom_isdigit(*cp) ? *cp - '0' : prom_toupper(*cp) - 'A' + 10) < base) {
>>  		result = result * base + value;
>>  		cp++;
>>  	}
>> -- 
>> 2.30.2
>> 
>
> -- 
> With Best Regards,
> Andy Shevchenko

^ permalink raw reply

* Re: [PATCH v2 1/3] powerpc/mm/hash: Avoid resizing-down HPT on first memory hotplug
From: David Gibson @ 2021-06-09  6:59 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Nathan Lynch, Aneesh Kumar K.V, David Hildenbrand, Scott Cheloha,
	linux-kernel, Nicholas Piggin, Paul Mackerras, Sandipan Das,
	Andrew Morton, Laurent Dufour, linuxppc-dev, Mike Rapoport
In-Reply-To: <a69f18159b90c5ede95e6d3769e224b883cc974f.camel@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4892 bytes --]

On Wed, Jun 09, 2021 at 02:51:49AM -0300, Leonardo Brás wrote:
> On Wed, 2021-06-09 at 14:40 +1000, David Gibson wrote:
> > On Tue, Jun 08, 2021 at 09:52:10PM -0300, Leonardo Brás wrote:
> > > On Mon, 2021-06-07 at 15:02 +1000, David Gibson wrote:
> > > > On Fri, Apr 30, 2021 at 11:36:06AM -0300, Leonardo Bras wrote:
> > > > > Because hypervisors may need to create HPTs without knowing the
> > > > > guest
> > > > > page size, the smallest used page-size (4k) may be chosen,
> > > > > resulting in
> > > > > a HPT that is possibly bigger than needed.
> > > > > 
> > > > > On a guest with bigger page-sizes, the amount of entries for
> > > > > HTP
> > > > > may be
> > > > > too high, causing the guest to ask for a HPT resize-down on the
> > > > > first
> > > > > hotplug.
> > > > > 
> > > > > This becomes a problem when HPT resize-down fails, and causes
> > > > > the
> > > > > HPT resize to be performed on every LMB added, until HPT size
> > > > > is
> > > > > compatible to guest memory size, causing a major slowdown.
> > > > > 
> > > > > So, avoiding HPT resizing-down on hot-add significantly
> > > > > improves
> > > > > memory
> > > > > hotplug times.
> > > > > 
> > > > > As an example, hotplugging 256GB on a 129GB guest took 710s
> > > > > without
> > > > > this
> > > > > patch, and 21s after applied.
> > > > > 
> > > > > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > > > 
> > > > Sorry it's taken me so long to look at these
> > > > 
> > > > I don't love the extra statefulness that the 'shrinking'
> > > > parameter
> > > > adds, but I can't see an elegant way to avoid it, so:
> > > > 
> > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > 
> > > np, thanks for reviewing!
> > 
> > Actually... I take that back.  With the subsequent patches my
> > discomfort with the complexity of implementing the batching grew.
> > 
> > I think I can see a simpler way - although it wasn't as clear as I
> > thought it might be, without some deep history on this feature.
> > 
> > What's going on here is pretty hard to follow, because it starts in
> > arch-specific code (arch/powerpc/platforms/pseries/hotplug-memory.c)
> > where it processes the add/remove requests, then goes into generic
> > code __add_memory() which eventually emerges back in arch specific
> > code (hash__create_section_mapping()).
> > 
> > The HPT resizing calls are in the "inner" arch specific section,
> > whereas it's only the outer arch section that has the information to
> > batch properly.  The mutex and 'shrinking' parameter in Leonardo's
> > code are all about conveying information from the outer to inner
> > section.
> > 
> > Now, I think the reason I had the resize calls in the inner section
> > was to accomodate the notion that a) pHyp might support resizing in
> > future, and it could come in through a different path with its drmgr
> > thingy and/or b) bare metal hash architectures might want to
> > implement
> > hash resizing, and this would make at least part of the path common.
> > 
> > Given the decreasing relevance of hash MMUs, I think we can now
> > safely
> > say neither of these is ever going to happen.
> > 
> > Therefore, we can simplify things by moving the HPT resize calls into
> > the pseries LMB code, instead of create/remove_section_mapping.  Then
> > to do batching without extra complications we just need this logic
> > for
> > all resizes (both add and remove):
> > 
> >         let new_hpt_order = expected HPT size for new mem size;
> > 
> >         if (new_hpt_order > current_hpt_order)
> >                 resize to new_hpt_order
> > 
> >         add/remove memory
> > 
> >         if (new_hpt_order < current_hpt_order - 1)
> >                 resize to new_hpt_order
> > 
> > 
> 
> 
> Ok, that really does seem to simplify a lot the batching.
> 
> Question:
> by LMB code, you mean dlpar_memory_{add,remove}_by_* ?
> (dealing only with dlpar_{add,remove}_lmb() would not be enough to deal
> with batching)

I was thinking of a two stage process.  First moving the resizes to
dlpar_{add,remote}_lmb() (not changing behaviour for the pseries dlpar
path), then implementing the batching by moving to the {add,remove}_by
functions.

But..

> In my 3/3 repĺy I sent you some other examples of functions that
> currently end up calling resize_hpt_for_hotplug() without comming from 
> hotplug-memory.c. Is that ok that they do not call it anymore?

..as I replied there, I was wrong about it being safe to move the
resizes all to the pseries dlpar code.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH] powerpc/bpf: Use bctrl for making function calls
From: Naveen N. Rao @ 2021-06-09  9:00 UTC (permalink / raw)
  To: linuxppc-dev, bpf

blrl corrupts the link stack. Instead use bctrl when making function
calls from BPF programs.

Reported-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/ppc-opcode.h |  1 +
 arch/powerpc/net/bpf_jit_comp32.c     |  4 ++--
 arch/powerpc/net/bpf_jit_comp64.c     | 12 ++++++------
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index ac41776661e963..1abacb8417d562 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -451,6 +451,7 @@
 #define PPC_RAW_MTLR(r)			(0x7c0803a6 | ___PPC_RT(r))
 #define PPC_RAW_MFLR(t)			(PPC_INST_MFLR | ___PPC_RT(t))
 #define PPC_RAW_BCTR()			(PPC_INST_BCTR)
+#define PPC_RAW_BCTRL()			(PPC_INST_BCTRL)
 #define PPC_RAW_MTCTR(r)		(PPC_INST_MTCTR | ___PPC_RT(r))
 #define PPC_RAW_ADDI(d, a, i)		(PPC_INST_ADDI | ___PPC_RT(d) | ___PPC_RA(a) | IMM_L(i))
 #define PPC_RAW_LI(r, i)		PPC_RAW_ADDI(r, 0, i)
diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
index bbb16099e8c7fa..40ab50bea61c02 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -195,8 +195,8 @@ void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 fun
 		/* Load function address into r0 */
 		EMIT(PPC_RAW_LIS(__REG_R0, IMM_H(func)));
 		EMIT(PPC_RAW_ORI(__REG_R0, __REG_R0, IMM_L(func)));
-		EMIT(PPC_RAW_MTLR(__REG_R0));
-		EMIT(PPC_RAW_BLRL());
+		EMIT(PPC_RAW_MTCTR(__REG_R0));
+		EMIT(PPC_RAW_BCTRL());
 	}
 }
 
diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
index 57a8c1153851a0..ae9a6532be6ad4 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -153,8 +153,8 @@ static void bpf_jit_emit_func_call_hlp(u32 *image, struct codegen_context *ctx,
 	PPC_LI64(b2p[TMP_REG_2], func);
 	/* Load actual entry point from function descriptor */
 	PPC_BPF_LL(b2p[TMP_REG_1], b2p[TMP_REG_2], 0);
-	/* ... and move it to LR */
-	EMIT(PPC_RAW_MTLR(b2p[TMP_REG_1]));
+	/* ... and move it to CTR */
+	EMIT(PPC_RAW_MTCTR(b2p[TMP_REG_1]));
 	/*
 	 * Load TOC from function descriptor at offset 8.
 	 * We can clobber r2 since we get called through a
@@ -165,9 +165,9 @@ static void bpf_jit_emit_func_call_hlp(u32 *image, struct codegen_context *ctx,
 #else
 	/* We can clobber r12 */
 	PPC_FUNC_ADDR(12, func);
-	EMIT(PPC_RAW_MTLR(12));
+	EMIT(PPC_RAW_MTCTR(12));
 #endif
-	EMIT(PPC_RAW_BLRL());
+	EMIT(PPC_RAW_BCTRL());
 }
 
 void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 func)
@@ -202,8 +202,8 @@ void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 fun
 	PPC_BPF_LL(12, 12, 0);
 #endif
 
-	EMIT(PPC_RAW_MTLR(12));
-	EMIT(PPC_RAW_BLRL());
+	EMIT(PPC_RAW_MTCTR(12));
+	EMIT(PPC_RAW_BCTRL());
 }
 
 static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32 out)

base-commit: 112f47a1484ddca610b70cbe4a99f0d0f1701daf
-- 
2.31.1


^ permalink raw reply related

* Re: [PATCH 04/16] bvec: add a bvec_kmap_local helper
From: Ilya Dryomov @ 2021-06-09  9:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Thomas Bogendoerfer, Mike Snitzer, Geoff Levand,
	linuxppc-dev, linux-mips, Dongsheng Yang, LKML, linux-block,
	dm-devel, Ceph Development, Ira Weiny
In-Reply-To: <20210608160603.1535935-5-hch@lst.de>

On Tue, Jun 8, 2021 at 6:06 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Add a helper to call kmap_local_page on a bvec.  There is no need for
> an unmap helper given that kunmap_local accept any address in the mapped
> page.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/bvec.h | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index 883faf5f1523..d64d6c0ceb77 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -7,6 +7,7 @@
>  #ifndef __LINUX_BVEC_H
>  #define __LINUX_BVEC_H
>
> +#include <linux/highmem.h>
>  #include <linux/bug.h>
>  #include <linux/errno.h>
>  #include <linux/limits.h>
> @@ -183,4 +184,9 @@ static inline void bvec_advance(const struct bio_vec *bvec,
>         }
>  }
>
> +static inline void *bvec_kmap_local(struct bio_vec *bvec)
> +{
> +       return kmap_local_page(bvec->bv_page) + bvec->bv_offset;
> +}
> +
>  #endif /* __LINUX_BVEC_H */

Might be useful to add the second sentence of the commit message as
a comment for bvec_kmap_local().  It could be expanded to mention the
single-page bvec caveat too.

Thanks,

                Ilya

^ permalink raw reply

* Re: [PATCH 07/16] rbd: use memzero_bvec
From: Ilya Dryomov @ 2021-06-09  9:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Thomas Bogendoerfer, Mike Snitzer, Geoff Levand,
	linuxppc-dev, linux-mips, Dongsheng Yang, LKML, linux-block,
	dm-devel, Ceph Development, Ira Weiny
In-Reply-To: <20210608160603.1535935-8-hch@lst.de>

On Tue, Jun 8, 2021 at 6:06 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Use memzero_bvec instead of reimplementing it.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/block/rbd.c | 15 ++-------------
>  1 file changed, 2 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index bbb88eb009e0..eb243fc4d108 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1219,24 +1219,13 @@ static void rbd_dev_mapping_clear(struct rbd_device *rbd_dev)
>         rbd_dev->mapping.size = 0;
>  }
>
> -static void zero_bvec(struct bio_vec *bv)
> -{
> -       void *buf;
> -       unsigned long flags;
> -
> -       buf = bvec_kmap_irq(bv, &flags);
> -       memset(buf, 0, bv->bv_len);
> -       flush_dcache_page(bv->bv_page);
> -       bvec_kunmap_irq(buf, &flags);
> -}
> -
>  static void zero_bios(struct ceph_bio_iter *bio_pos, u32 off, u32 bytes)
>  {
>         struct ceph_bio_iter it = *bio_pos;
>
>         ceph_bio_iter_advance(&it, off);
>         ceph_bio_iter_advance_step(&it, bytes, ({
> -               zero_bvec(&bv);
> +               memzero_bvec(&bv);
>         }));
>  }
>
> @@ -1246,7 +1235,7 @@ static void zero_bvecs(struct ceph_bvec_iter *bvec_pos, u32 off, u32 bytes)
>
>         ceph_bvec_iter_advance(&it, off);
>         ceph_bvec_iter_advance_step(&it, bytes, ({
> -               zero_bvec(&bv);
> +               memzero_bvec(&bv);
>         }));
>  }
>

Ira already brought up the fact that this conversion drops
flush_dcache_page() calls throughout.  Other than that:

Acked-by: Ilya Dryomov <idryomov@gmail.com>

Thanks,

                Ilya

^ permalink raw reply

* Re: [PATCH] powerpc/bpf: Use bctrl for making function calls
From: Christophe Leroy @ 2021-06-09  9:42 UTC (permalink / raw)
  To: Naveen N. Rao, linuxppc-dev, bpf
In-Reply-To: <20210609090024.1446800-1-naveen.n.rao@linux.vnet.ibm.com>



Le 09/06/2021 à 11:00, Naveen N. Rao a écrit :
> blrl corrupts the link stack. Instead use bctrl when making function
> calls from BPF programs.

What's the link stack ? Is it the PPC64 branch predictor stack ?

> 
> Reported-by: Anton Blanchard <anton@ozlabs.org>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/ppc-opcode.h |  1 +
>   arch/powerpc/net/bpf_jit_comp32.c     |  4 ++--
>   arch/powerpc/net/bpf_jit_comp64.c     | 12 ++++++------
>   3 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
> index ac41776661e963..1abacb8417d562 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -451,6 +451,7 @@
>   #define PPC_RAW_MTLR(r)			(0x7c0803a6 | ___PPC_RT(r))
>   #define PPC_RAW_MFLR(t)			(PPC_INST_MFLR | ___PPC_RT(t))
>   #define PPC_RAW_BCTR()			(PPC_INST_BCTR)
> +#define PPC_RAW_BCTRL()			(PPC_INST_BCTRL)

Can you use the numeric value instead of the PPC_INST_BCTRL, to avoid conflict with 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/4ca2bfdca2f47a293d05f61eb3c4e487ee170f1f.1621506159.git.christophe.leroy@csgroup.eu/

>   #define PPC_RAW_MTCTR(r)		(PPC_INST_MTCTR | ___PPC_RT(r))
>   #define PPC_RAW_ADDI(d, a, i)		(PPC_INST_ADDI | ___PPC_RT(d) | ___PPC_RA(a) | IMM_L(i))
>   #define PPC_RAW_LI(r, i)		PPC_RAW_ADDI(r, 0, i)
> diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
> index bbb16099e8c7fa..40ab50bea61c02 100644
> --- a/arch/powerpc/net/bpf_jit_comp32.c
> +++ b/arch/powerpc/net/bpf_jit_comp32.c
> @@ -195,8 +195,8 @@ void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 fun
>   		/* Load function address into r0 */
>   		EMIT(PPC_RAW_LIS(__REG_R0, IMM_H(func)));
>   		EMIT(PPC_RAW_ORI(__REG_R0, __REG_R0, IMM_L(func)));
> -		EMIT(PPC_RAW_MTLR(__REG_R0));
> -		EMIT(PPC_RAW_BLRL());
> +		EMIT(PPC_RAW_MTCTR(__REG_R0));
> +		EMIT(PPC_RAW_BCTRL());
>   	}
>   }
>   
> diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
> index 57a8c1153851a0..ae9a6532be6ad4 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -153,8 +153,8 @@ static void bpf_jit_emit_func_call_hlp(u32 *image, struct codegen_context *ctx,
>   	PPC_LI64(b2p[TMP_REG_2], func);
>   	/* Load actual entry point from function descriptor */
>   	PPC_BPF_LL(b2p[TMP_REG_1], b2p[TMP_REG_2], 0);
> -	/* ... and move it to LR */
> -	EMIT(PPC_RAW_MTLR(b2p[TMP_REG_1]));
> +	/* ... and move it to CTR */
> +	EMIT(PPC_RAW_MTCTR(b2p[TMP_REG_1]));
>   	/*
>   	 * Load TOC from function descriptor at offset 8.
>   	 * We can clobber r2 since we get called through a
> @@ -165,9 +165,9 @@ static void bpf_jit_emit_func_call_hlp(u32 *image, struct codegen_context *ctx,
>   #else
>   	/* We can clobber r12 */
>   	PPC_FUNC_ADDR(12, func);
> -	EMIT(PPC_RAW_MTLR(12));
> +	EMIT(PPC_RAW_MTCTR(12));
>   #endif
> -	EMIT(PPC_RAW_BLRL());
> +	EMIT(PPC_RAW_BCTRL());
>   }
>   
>   void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 func)
> @@ -202,8 +202,8 @@ void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 fun
>   	PPC_BPF_LL(12, 12, 0);
>   #endif
>   
> -	EMIT(PPC_RAW_MTLR(12));
> -	EMIT(PPC_RAW_BLRL());
> +	EMIT(PPC_RAW_MTCTR(12));
> +	EMIT(PPC_RAW_BCTRL());
>   }
>   
>   static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32 out)
> 
> base-commit: 112f47a1484ddca610b70cbe4a99f0d0f1701daf
> 

^ permalink raw reply

* Re: [PATCH v1 05/12] mm/memory_hotplug: remove nid parameter from remove_memory() and friends
From: David Hildenbrand @ 2021-06-09 10:05 UTC (permalink / raw)
  To: Michael Ellerman, linux-kernel
  Cc: nvdimm, Wei Yang, Michael S. Tsirkin, Jason Wang, Dave Hansen,
	virtualization, linux-mm, Paul Mackerras, Laurent Dufour,
	Dave Jiang, Vishal Verma, linux-acpi, Len Brown, Nathan Lynch,
	Pavel Tatashin, Anshuman Khandual, Dan Williams, Michal Hocko,
	Vitaly Kuznetsov, Vlastimil Babka, Oscar Salvador, Pankaj Gupta,
	Scott Cheloha, Rafael J. Wysocki, Hui Zhu, Aneesh Kumar K.V,
	Andrew Morton, linuxppc-dev, Marek Kedzierski, Mike Rapoport
In-Reply-To: <7463b3ed-07d3-7157-629d-a85a3ff558d6@redhat.com>

On 08.06.21 13:18, David Hildenbrand wrote:
> On 08.06.21 13:11, Michael Ellerman wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>> There is only a single user remaining. We can simply try to offline all
>>> online nodes - which is fast, because we usually span pages and can skip
>>> such nodes right away.
>>
>> That makes me slightly nervous, because our big powerpc boxes tend to
>> trip on these scaling issues before others.
>>
>> But the spanned pages check is just:
>>
>> void try_offline_node(int nid)
>> {
>> 	pg_data_t *pgdat = NODE_DATA(nid);
>>           ...
>> 	if (pgdat->node_spanned_pages)
>> 		return;
>>
>> So I guess that's pretty cheap, and it's only O(nodes), which should
>> never get that big.
> 
> Exactly. And if it does turn out to be a problem, we can walk all memory
> blocks before removing them, collecting the nid(s).
> 

I might just do the following on top:

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 61bff8f3bfb1..bbc26fdac364 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2176,7 +2176,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
  static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
  {
         int ret = !is_memblock_offlined(mem);
+       int *nid = arg;
  
+       *nid = mem->nid;
         if (unlikely(ret)) {
                 phys_addr_t beginpa, endpa;
  
@@ -2271,10 +2273,10 @@ EXPORT_SYMBOL(try_offline_node);
  
  static int __ref try_remove_memory(u64 start, u64 size)
  {
-       int rc = 0, nid;
         struct vmem_altmap mhp_altmap = {};
         struct vmem_altmap *altmap = NULL;
         unsigned long nr_vmemmap_pages;
+       int rc = 0, nid = NUMA_NO_NODE;
  
         BUG_ON(check_hotplug_memory_range(start, size));
  
@@ -2282,8 +2284,12 @@ static int __ref try_remove_memory(u64 start, u64 size)
          * All memory blocks must be offlined before removing memory.  Check
          * whether all memory blocks in question are offline and return error
          * if this is not the case.
+        *
+        * While at it, determine the nid. Note that if we'd have mixed nodes,
+        * we'd only try to offline the last determined one -- which is good
+        * enough for the cases we care about.
          */
-       rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb);
+       rc = walk_memory_blocks(start, size, &nid, check_memblock_offlined_cb);
         if (rc)
                 return rc;
  
@@ -2332,7 +2338,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
  
         release_mem_region_adjustable(start, size);
  
-       for_each_online_node(nid)
+       if (nid != NUMA_NO_NODE)
                 try_offline_node(nid);
  
         mem_hotplug_done();



-- 
Thanks,

David / dhildenb


^ permalink raw reply related

* Re: [RFC] powerpc/pseries: Interface to represent PAPR firmware attributes
From: Pratik Sampat @ 2021-06-09 10:08 UTC (permalink / raw)
  To: Fabiano Rosas, mpe, benh, paulus, linuxppc-dev, kvm-ppc,
	linux-kernel, pratik.r.sampat
In-Reply-To: <87wnr4uhs9.fsf@linux.ibm.com>

Hello,
Thank you for your comments on the design.

On 09/06/21 3:43 am, Fabiano Rosas wrote:
> "Pratik R. Sampat" <psampat@linux.ibm.com> writes:
>
> Hi, I have some general comments and questions, mostly trying to
> understand design of the hcall and use cases of the sysfs data:
>
>> Adds a generic interface to represent the energy and frequency related
>> PAPR attributes on the system using the new H_CALL
>> "H_GET_ENERGY_SCALE_INFO".
>>
>> H_GET_EM_PARMS H_CALL was previously responsible for exporting this
>> information in the lparcfg, however the H_GET_EM_PARMS H_CALL
>> will be deprecated P10 onwards.
>>
>> The H_GET_ENERGY_SCALE_INFO H_CALL is of the following call format:
>> hcall(
>>    uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
>>    uint64 flags,           // Per the flag request
>>    uint64 firstAttributeId,// The attribute id
>>    uint64 bufferAddress,   // The logical address of the output buffer
> Instead of logical address, guest address or guest physical address
> would be more precise.

Yes, the name guest physical address makes more sense for this attribute.
The term logical address had me confused too when I first read it in the ACR,
however that isn't the case.

I'll change it to guest physical address here. Thanks for pointing out.

>
>>    uint64 bufferSize       // The size in bytes of the output buffer
>> );
>>
>> This H_CALL can query either all the attributes at once with
>> firstAttributeId = 0, flags = 0 as well as query only one attribute
>> at a time with firstAttributeId = id
>>
>> The output buffer consists of the following
>> 1. number of attributes              - 8 bytes
>> 2. array offset to the data location - 8 bytes
> The offset is from the start of the buffer, isn't it? So not the array
> offset.

Yes,the offset carries information that is to the start of the data buffer.

>> 3. version info                      - 1 byte
>> 4. A data array of size num attributes, which contains the following:
>>    a. attribute ID              - 8 bytes
>>    b. attribute value in number - 8 bytes
>>    c. attribute name in string  - 64 bytes
>>    d. attribute value in string - 64 bytes
> Is this new hypercall already present in the spec? These seem a bit
> underspecified to me.

Yes, it is present in the spec. I probably summarized a little more than needed
here and I could expand upon below.

The input buffer recives the following data:

1. “flags”:
	a. Bit 0: singleAttribute
		If set to 1, only return the single attribute matching firstAttributeId.
	b. Bits 1-63: Reserved
2. “firstAttributeId”: The first attribute to retrieve
3. “bufferAddress”: The logical real address of the start of the output buffer
4. “bufferSize”: The size in bytes of the output buffer
	

 From the document, the format of the output buffer is as follows:

Table 1 --> output buffer
================================================================================
| Field Name           | Byte   | Length   |  Description
|                      | Offset | in Bytes |
================================================================================
| NumberOf             |        |          | Number of Attributes in Buffer
| AttributesInBuffer   | 0x000  | 0x08     |
--------------------------------------------------------------------------------
| AttributeArrayOffset | 0x008  | 0x08     | Byte offset to start of Array
|                      |        |          | of Attributes
|                      |        |          |
--------------------------------------------------------------------------------
| OutputBufferData     |        |          | Version of the Header.
| HeaderVersion        | 0x010  | 0x01     | The header will be always
| AttributesInBuffer   |        |          | backward compatible, and changes
|                      |        |          | will not impact the Array of
|                      |        |          | attributes.
|                      |        |          | Current version = 0x01
--------------------------------------------------------------------------------
| ArrayOfAttributes    |        |          | The array will contain
|                      |        |          | "NumberOfAttributesInBuffer"
|                      |        |          | array elements not to exceed
|                      |        |          | the size of the buffer.
|                      |        |          | Layout of the array is
|                      |        |          | detailed in Table 2.
--------------------------------------------------------------------------------


Table 2 --> Array of attributes
================================================================================
| Field Name           | Byte   | Length   |  Description
|                      | Offset | in Bytes |
================================================================================
| 1st AttributeId      | 0x000  | 0x08     | The ID of the Attribute
--------------------------------------------------------------------------------
| 1st AttributeValue   | 0x008  | 0x08     | The numerical value of
|                      |        |          | the attribute
--------------------------------------------------------------------------------
| 1st AttributeString  | 0x010  | 0x40     | The ASCII string
| Description          |        |          | description of the
|                      |        |          | attribute, up to 63
|                      |        |          | characters plus a NULL
|                      |        |          | terminator.
--------------------------------------------------------------------------------
| 1st AttributeValue   | 0x050  | 0x40     | The ASCII string
| StringDescription    |        |          | description of the
|                      |        |          | attribute value, up to 63
|                      |        |          | characters plus a NULL
|                      |        |          | terminator. If this
|                      |        |          | contains only a NULL
|                      |        |          | terminator, then there is
|                      |        |          | no ASCII string
|                      |        |          | associated with AttributeValue.
--------------------------------------------------------------------------------
| ....                 |        |          |


>
>> The new H_CALL exports information in direct string value format, hence
>> a new interface has been introduced in /sys/firmware/papr to export
> Hm.. Maybe this should be something less generic than "papr"?

The interface naming was inspired from /sys/firmware/opal's naming convention.
We believed the name PAPR could serve as more generic name to be used by both
Linux running on PHYP and linux on KVM.

If you have something more concrete in mind, please let me know. I'm open to
suggestions.

>
>> this information to userspace in an extensible pass-through format.
>> The H_CALL returns the name, numeric value and string value. As string
>> values are in human readable format, therefore if the string value
>> exists then that is given precedence over the numeric value.
> So the hypervisor could simply not send the string representation? How
> will the userspace tell the difference since they are reading everything
> from a file?
>
> Overall I'd say we should give the data in a more structured way and let
> the user-facing tool do the formatting and presentation.

That's a valid concern, the design for this was inspired from hwmon's interface
to housing the sensor information.

One alternative to add more structure to this format could be to introduce:
attr_X_name, attr_X_num_val, attr_X_str_val

However, in some cases like min/max frequency the string value is empty. In
that case the file attr_X_str_val will also be empty.
Is that an acceptable format of having empty files that in some cases will
never be populated?
We also went ahead to confirm with the SPEC team that if a string value exists
in their buffer, that must be given precedence.

Another alternative format could to keep attr_X_name, attr_X_val intact but
change what X means. Currently X is just an iteratively increasing number. But
X can also serve as an ID which we get from H_CALL output buffer.

In this case, we should also include some versioning so that the tool now also
has cognizance of contents of each file.

>> The format of exposing the sysfs information is as follows:
>> /sys/firmware/papr/
>>    |-- attr_0_name
>>    |-- attr_0_val
>>    |-- attr_1_name
>>    |-- attr_1_val
>> ...
> How do we keep a stable interface with userspace? Say the hypervisor
> decides to add or remove attributes, change their order, string
> representation, etc? It will inform us via the version field, but that
> is lost when we output this to sysfs.
>
> I get that if the userspace just iterate over the contents of the
> directory then nothing breaks, but there is not much else it could do it
> seems.

Fair point, having the version exposed to the sysfs does seem crucial.

Currently in ppc-utils we iterate over all the information, however as you
rightly pointed out there may be other tools needing just specific information.
The alternative I suggested a few sentences above to include ID based attribute
naming and versioning maybe a more elegant way of solving this problem.

What are your thoughts on a design like this?

>> The energy information that is exported is useful for userspace tools
>> such as powerpc-utils. Currently these tools infer the
>> "power_mode_data" value in the lparcfg, which in turn is obtained from
>> the to be deprecated H_GET_EM_PARMS H_CALL.
>> On future platforms, such userspace utilities will have to look at the
>> data returned from the new H_CALL being populated in this new sysfs
>> interface and report this information directly without the need of
>> interpretation.
>>
>> Signed-off-by: Pratik R. Sampat <psampat@linux.ibm.com>

Thanks
Pratik


^ permalink raw reply

* Re: [PATCH 1/9] alpha: remove DISCONTIGMEM and NUMA
From: David Hildenbrand @ 2021-06-09 10:50 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: linux-ia64, linux-sh, linux-mips, linux-mm, sparclinux,
	linux-riscv, linux-arch, linux-s390, Jonathan Corbet, linux-doc,
	Mike Rapoport, Geert Uytterhoeven, Matt Turner, linux-snps-arc,
	linux-xtensa, Arnd Bergmann, linux-m68k, Ivan Kokshaysky,
	linux-arm-kernel, Richard Henderson, Vineet Gupta, kexec,
	linux-kernel, linux-alpha, linuxppc-dev
In-Reply-To: <20210602105348.13387-2-rppt@kernel.org>

On 02.06.21 12:53, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> NUMA is marked broken on alpha for more than 15 years and DISCONTIGMEM was
> replaced with SPARSEMEM in v5.11.
> 
> Remove both NUMA and DISCONTIGMEM support from alpha.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>   arch/alpha/Kconfig                |  22 ---
>   arch/alpha/include/asm/machvec.h  |   6 -
>   arch/alpha/include/asm/mmzone.h   | 100 --------------
>   arch/alpha/include/asm/pgtable.h  |   4 -
>   arch/alpha/include/asm/topology.h |  39 ------
>   arch/alpha/kernel/core_marvel.c   |  53 +------
>   arch/alpha/kernel/core_wildfire.c |  29 +---
>   arch/alpha/kernel/pci_iommu.c     |  29 ----
>   arch/alpha/kernel/proto.h         |   8 --
>   arch/alpha/kernel/setup.c         |  16 ---
>   arch/alpha/kernel/sys_marvel.c    |   5 -
>   arch/alpha/kernel/sys_wildfire.c  |   5 -
>   arch/alpha/mm/Makefile            |   2 -
>   arch/alpha/mm/init.c              |   3 -
>   arch/alpha/mm/numa.c              | 223 ------------------------------
>   15 files changed, 4 insertions(+), 540 deletions(-)
>   delete mode 100644 arch/alpha/include/asm/mmzone.h
>   delete mode 100644 arch/alpha/mm/numa.c
> 
> diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
> index 5998106faa60..8954216b9956 100644
> --- a/arch/alpha/Kconfig
> +++ b/arch/alpha/Kconfig
> @@ -549,29 +549,12 @@ config NR_CPUS
>   	  MARVEL support can handle a maximum of 32 CPUs, all the others
>   	  with working support have a maximum of 4 CPUs.
>   
> -config ARCH_DISCONTIGMEM_ENABLE
> -	bool "Discontiguous Memory Support"
> -	depends on BROKEN
> -	help
> -	  Say Y to support efficient handling of discontiguous physical memory,
> -	  for architectures which are either NUMA (Non-Uniform Memory Access)
> -	  or have huge holes in the physical address space for other reasons.
> -	  See <file:Documentation/vm/numa.rst> for more.
> -
>   config ARCH_SPARSEMEM_ENABLE
>   	bool "Sparse Memory Support"
>   	help
>   	  Say Y to support efficient handling of discontiguous physical memory,
>   	  for systems that have huge holes in the physical address space.
>   
> -config NUMA
> -	bool "NUMA Support (EXPERIMENTAL)"
> -	depends on DISCONTIGMEM && BROKEN
> -	help
> -	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
> -	  Access).  This option is for configuring high-end multiprocessor
> -	  server machines.  If in doubt, say N.
> -
>   config ALPHA_WTINT
>   	bool "Use WTINT" if ALPHA_SRM || ALPHA_GENERIC
>   	default y if ALPHA_QEMU
> @@ -596,11 +579,6 @@ config ALPHA_WTINT
>   
>   	  If unsure, say N.
>   
> -config NODES_SHIFT
> -	int
> -	default "7"
> -	depends on NEED_MULTIPLE_NODES
> -
>   # LARGE_VMALLOC is racy, if you *really* need it then fix it first
>   config ALPHA_LARGE_VMALLOC
>   	bool
> diff --git a/arch/alpha/include/asm/machvec.h b/arch/alpha/include/asm/machvec.h
> index a4e96e2bec74..e49fabce7b33 100644
> --- a/arch/alpha/include/asm/machvec.h
> +++ b/arch/alpha/include/asm/machvec.h
> @@ -99,12 +99,6 @@ struct alpha_machine_vector
>   
>   	const char *vector_name;
>   
> -	/* NUMA information */
> -	int (*pa_to_nid)(unsigned long);
> -	int (*cpuid_to_nid)(int);
> -	unsigned long (*node_mem_start)(int);
> -	unsigned long (*node_mem_size)(int);
> -
>   	/* System specific parameters.  */
>   	union {
>   	    struct {
> diff --git a/arch/alpha/include/asm/mmzone.h b/arch/alpha/include/asm/mmzone.h
> deleted file mode 100644
> index 86644604d977..000000000000
> --- a/arch/alpha/include/asm/mmzone.h
> +++ /dev/null
> @@ -1,100 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -/*
> - * Written by Kanoj Sarcar (kanoj@sgi.com) Aug 99
> - * Adapted for the alpha wildfire architecture Jan 2001.
> - */
> -#ifndef _ASM_MMZONE_H_
> -#define _ASM_MMZONE_H_
> -
> -#ifdef CONFIG_DISCONTIGMEM
> -
> -#include <asm/smp.h>
> -
> -/*
> - * Following are macros that are specific to this numa platform.
> - */
> -
> -extern pg_data_t node_data[];
> -
> -#define alpha_pa_to_nid(pa)		\
> -        (alpha_mv.pa_to_nid 		\
> -	 ? alpha_mv.pa_to_nid(pa)	\
> -	 : (0))
> -#define node_mem_start(nid)		\
> -        (alpha_mv.node_mem_start 	\
> -	 ? alpha_mv.node_mem_start(nid) \
> -	 : (0UL))
> -#define node_mem_size(nid)		\
> -        (alpha_mv.node_mem_size 	\
> -	 ? alpha_mv.node_mem_size(nid) 	\
> -	 : ((nid) ? (0UL) : (~0UL)))
> -
> -#define pa_to_nid(pa)		alpha_pa_to_nid(pa)
> -#define NODE_DATA(nid)		(&node_data[(nid)])
> -
> -#define node_localnr(pfn, nid)	((pfn) - NODE_DATA(nid)->node_start_pfn)
> -
> -#if 1
> -#define PLAT_NODE_DATA_LOCALNR(p, n)	\
> -	(((p) >> PAGE_SHIFT) - PLAT_NODE_DATA(n)->gendata.node_start_pfn)
> -#else
> -static inline unsigned long
> -PLAT_NODE_DATA_LOCALNR(unsigned long p, int n)
> -{
> -	unsigned long temp;
> -	temp = p >> PAGE_SHIFT;
> -	return temp - PLAT_NODE_DATA(n)->gendata.node_start_pfn;
> -}
> -#endif
> -
> -/*
> - * Following are macros that each numa implementation must define.
> - */
> -
> -/*
> - * Given a kernel address, find the home node of the underlying memory.
> - */
> -#define kvaddr_to_nid(kaddr)	pa_to_nid(__pa(kaddr))
> -
> -/*
> - * Given a kaddr, LOCAL_BASE_ADDR finds the owning node of the memory
> - * and returns the kaddr corresponding to first physical page in the
> - * node's mem_map.
> - */
> -#define LOCAL_BASE_ADDR(kaddr)						  \
> -    ((unsigned long)__va(NODE_DATA(kvaddr_to_nid(kaddr))->node_start_pfn  \
> -			 << PAGE_SHIFT))
> -
> -/* XXX: FIXME -- nyc */
> -#define kern_addr_valid(kaddr)	(0)
> -
> -#define mk_pte(page, pgprot)						     \
> -({								 	     \
> -	pte_t pte;                                                           \
> -	unsigned long pfn;                                                   \
> -									     \
> -	pfn = page_to_pfn(page) << 32; \
> -	pte_val(pte) = pfn | pgprot_val(pgprot);			     \
> -									     \
> -	pte;								     \
> -})
> -
> -#define pte_page(x)							\
> -({									\
> -       	unsigned long kvirt;						\
> -	struct page * __xx;						\
> -									\
> -	kvirt = (unsigned long)__va(pte_val(x) >> (32-PAGE_SHIFT));	\
> -	__xx = virt_to_page(kvirt);					\
> -									\
> -	__xx;                                                           \
> -})
> -
> -#define pfn_to_nid(pfn)		pa_to_nid(((u64)(pfn) << PAGE_SHIFT))
> -#define pfn_valid(pfn)							\
> -	(((pfn) - node_start_pfn(pfn_to_nid(pfn))) <			\
> -	 node_spanned_pages(pfn_to_nid(pfn)))					\
> -
> -#endif /* CONFIG_DISCONTIGMEM */
> -
> -#endif /* _ASM_MMZONE_H_ */
> diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
> index 8d856c62e22a..e1757b7cfe3d 100644
> --- a/arch/alpha/include/asm/pgtable.h
> +++ b/arch/alpha/include/asm/pgtable.h
> @@ -206,7 +206,6 @@ extern unsigned long __zero_page(void);
>   #define page_to_pa(page)	(page_to_pfn(page) << PAGE_SHIFT)
>   #define pte_pfn(pte)	(pte_val(pte) >> 32)
>   
> -#ifndef CONFIG_DISCONTIGMEM
>   #define pte_page(pte)	pfn_to_page(pte_pfn(pte))
>   #define mk_pte(page, pgprot)						\
>   ({									\
> @@ -215,7 +214,6 @@ extern unsigned long __zero_page(void);
>   	pte_val(pte) = (page_to_pfn(page) << 32) | pgprot_val(pgprot);	\
>   	pte;								\
>   })
> -#endif
>   
>   extern inline pte_t pfn_pte(unsigned long physpfn, pgprot_t pgprot)
>   { pte_t pte; pte_val(pte) = (PHYS_TWIDDLE(physpfn) << 32) | pgprot_val(pgprot); return pte; }
> @@ -330,9 +328,7 @@ extern inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
>   #define __pte_to_swp_entry(pte)	((swp_entry_t) { pte_val(pte) })
>   #define __swp_entry_to_pte(x)	((pte_t) { (x).val })
>   
> -#ifndef CONFIG_DISCONTIGMEM
>   #define kern_addr_valid(addr)	(1)
> -#endif
>   
>   #define pte_ERROR(e) \
>   	printk("%s:%d: bad pte %016lx.\n", __FILE__, __LINE__, pte_val(e))
> diff --git a/arch/alpha/include/asm/topology.h b/arch/alpha/include/asm/topology.h
> index 5a77a40567fa..7d393036aa8f 100644
> --- a/arch/alpha/include/asm/topology.h
> +++ b/arch/alpha/include/asm/topology.h
> @@ -7,45 +7,6 @@
>   #include <linux/numa.h>
>   #include <asm/machvec.h>
>   
> -#ifdef CONFIG_NUMA
> -static inline int cpu_to_node(int cpu)
> -{
> -	int node;
> -	
> -	if (!alpha_mv.cpuid_to_nid)
> -		return 0;
> -
> -	node = alpha_mv.cpuid_to_nid(cpu);
> -
> -#ifdef DEBUG_NUMA
> -	BUG_ON(node < 0);
> -#endif
> -
> -	return node;
> -}
> -
> -extern struct cpumask node_to_cpumask_map[];
> -/* FIXME: This is dumb, recalculating every time.  But simple. */
> -static const struct cpumask *cpumask_of_node(int node)
> -{
> -	int cpu;
> -
> -	if (node == NUMA_NO_NODE)
> -		return cpu_all_mask;
> -
> -	cpumask_clear(&node_to_cpumask_map[node]);
> -
> -	for_each_online_cpu(cpu) {
> -		if (cpu_to_node(cpu) == node)
> -			cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
> -	}
> -
> -	return &node_to_cpumask_map[node];
> -}
> -
> -#define cpumask_of_pcibus(bus)	(cpu_online_mask)
> -
> -#endif /* !CONFIG_NUMA */
>   # include <asm-generic/topology.h>
>   
>   #endif /* _ASM_ALPHA_TOPOLOGY_H */
> diff --git a/arch/alpha/kernel/core_marvel.c b/arch/alpha/kernel/core_marvel.c
> index 4485b77f8658..1efca79ac83c 100644
> --- a/arch/alpha/kernel/core_marvel.c
> +++ b/arch/alpha/kernel/core_marvel.c
> @@ -287,8 +287,7 @@ io7_init_hose(struct io7 *io7, int port)
>   	/*
>   	 * Set up window 0 for scatter-gather 8MB at 8MB.
>   	 */
> -	hose->sg_isa = iommu_arena_new_node(marvel_cpuid_to_nid(io7->pe),
> -					    hose, 0x00800000, 0x00800000, 0);
> +	hose->sg_isa = iommu_arena_new_node(0, hose, 0x00800000, 0x00800000, 0);
>   	hose->sg_isa->align_entry = 8;	/* cache line boundary */
>   	csrs->POx_WBASE[0].csr =
>   		hose->sg_isa->dma_base | wbase_m_ena | wbase_m_sg;
> @@ -305,8 +304,7 @@ io7_init_hose(struct io7 *io7, int port)
>   	/*
>   	 * Set up window 2 for scatter-gather (up-to) 1GB at 3GB.
>   	 */
> -	hose->sg_pci = iommu_arena_new_node(marvel_cpuid_to_nid(io7->pe),
> -					    hose, 0xc0000000, 0x40000000, 0);
> +	hose->sg_pci = iommu_arena_new_node(0, hose, 0xc0000000, 0x40000000, 0);
>   	hose->sg_pci->align_entry = 8;	/* cache line boundary */
>   	csrs->POx_WBASE[2].csr =
>   		hose->sg_pci->dma_base | wbase_m_ena | wbase_m_sg;
> @@ -843,53 +841,8 @@ EXPORT_SYMBOL(marvel_ioportmap);
>   EXPORT_SYMBOL(marvel_ioread8);
>   EXPORT_SYMBOL(marvel_iowrite8);
>   #endif
> -\f
> -/*
> - * NUMA Support
> - */
> -/**********
> - * FIXME - for now each cpu is a node by itself
> - *              -- no real support for striped mode
> - **********
> - */
> -int
> -marvel_pa_to_nid(unsigned long pa)
> -{
> -	int cpuid;
>   
> -	if ((pa >> 43) & 1) 	/* I/O */
> -		cpuid = (~(pa >> 35) & 0xff);
> -	else			/* mem */
> -		cpuid = ((pa >> 34) & 0x3) | ((pa >> (37 - 2)) & (0x1f << 2));
> -
> -	return marvel_cpuid_to_nid(cpuid);
> -}
> -
> -int
> -marvel_cpuid_to_nid(int cpuid)
> -{
> -	return cpuid;
> -}
> -
> -unsigned long
> -marvel_node_mem_start(int nid)
> -{
> -	unsigned long pa;
> -
> -	pa = (nid & 0x3) | ((nid & (0x1f << 2)) << 1);
> -	pa <<= 34;
> -
> -	return pa;
> -}
> -
> -unsigned long
> -marvel_node_mem_size(int nid)
> -{
> -	return 16UL * 1024 * 1024 * 1024; /* 16GB */
> -}
> -
> -\f
> -/*
> +/*
>    * AGP GART Support.
>    */
>   #include <linux/agp_backend.h>
> diff --git a/arch/alpha/kernel/core_wildfire.c b/arch/alpha/kernel/core_wildfire.c
> index e8d3b033018d..3a804b67f9da 100644
> --- a/arch/alpha/kernel/core_wildfire.c
> +++ b/arch/alpha/kernel/core_wildfire.c
> @@ -434,39 +434,12 @@ wildfire_write_config(struct pci_bus *bus, unsigned int devfn, int where,
>   	return PCIBIOS_SUCCESSFUL;
>   }
>   
> -struct pci_ops wildfire_pci_ops =
> +struct pci_ops wildfire_pci_ops =
>   {
>   	.read =		wildfire_read_config,
>   	.write =	wildfire_write_config,
>   };
>   
> -\f
> -/*
> - * NUMA Support
> - */
> -int wildfire_pa_to_nid(unsigned long pa)
> -{
> -	return pa >> 36;
> -}
> -
> -int wildfire_cpuid_to_nid(int cpuid)
> -{
> -	/* assume 4 CPUs per node */
> -	return cpuid >> 2;
> -}
> -
> -unsigned long wildfire_node_mem_start(int nid)
> -{
> -	/* 64GB per node */
> -	return (unsigned long)nid * (64UL * 1024 * 1024 * 1024);
> -}
> -
> -unsigned long wildfire_node_mem_size(int nid)
> -{
> -	/* 64GB per node */
> -	return 64UL * 1024 * 1024 * 1024;
> -}
> -
>   #if DEBUG_DUMP_REGS
>   
>   static void __init
> diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c
> index d84b19aa8e9d..35d7b3096d6e 100644
> --- a/arch/alpha/kernel/pci_iommu.c
> +++ b/arch/alpha/kernel/pci_iommu.c
> @@ -71,33 +71,6 @@ iommu_arena_new_node(int nid, struct pci_controller *hose, dma_addr_t base,
>   	if (align < mem_size)
>   		align = mem_size;
>   
> -
> -#ifdef CONFIG_DISCONTIGMEM
> -
> -	arena = memblock_alloc_node(sizeof(*arena), align, nid);
> -	if (!NODE_DATA(nid) || !arena) {
> -		printk("%s: couldn't allocate arena from node %d\n"
> -		       "    falling back to system-wide allocation\n",
> -		       __func__, nid);
> -		arena = memblock_alloc(sizeof(*arena), SMP_CACHE_BYTES);
> -		if (!arena)
> -			panic("%s: Failed to allocate %zu bytes\n", __func__,
> -			      sizeof(*arena));
> -	}
> -
> -	arena->ptes = memblock_alloc_node(sizeof(*arena), align, nid);
> -	if (!NODE_DATA(nid) || !arena->ptes) {
> -		printk("%s: couldn't allocate arena ptes from node %d\n"
> -		       "    falling back to system-wide allocation\n",
> -		       __func__, nid);
> -		arena->ptes = memblock_alloc(mem_size, align);
> -		if (!arena->ptes)
> -			panic("%s: Failed to allocate %lu bytes align=0x%lx\n",
> -			      __func__, mem_size, align);
> -	}
> -
> -#else /* CONFIG_DISCONTIGMEM */
> -
>   	arena = memblock_alloc(sizeof(*arena), SMP_CACHE_BYTES);
>   	if (!arena)
>   		panic("%s: Failed to allocate %zu bytes\n", __func__,
> @@ -107,8 +80,6 @@ iommu_arena_new_node(int nid, struct pci_controller *hose, dma_addr_t base,
>   		panic("%s: Failed to allocate %lu bytes align=0x%lx\n",
>   		      __func__, mem_size, align);
>   
> -#endif /* CONFIG_DISCONTIGMEM */
> -
>   	spin_lock_init(&arena->lock);
>   	arena->hose = hose;
>   	arena->dma_base = base;
> diff --git a/arch/alpha/kernel/proto.h b/arch/alpha/kernel/proto.h
> index 701a05090141..5816a31c1b38 100644
> --- a/arch/alpha/kernel/proto.h
> +++ b/arch/alpha/kernel/proto.h
> @@ -49,10 +49,6 @@ extern void marvel_init_arch(void);
>   extern void marvel_kill_arch(int);
>   extern void marvel_machine_check(unsigned long, unsigned long);
>   extern void marvel_pci_tbi(struct pci_controller *, dma_addr_t, dma_addr_t);
> -extern int marvel_pa_to_nid(unsigned long);
> -extern int marvel_cpuid_to_nid(int);
> -extern unsigned long marvel_node_mem_start(int);
> -extern unsigned long marvel_node_mem_size(int);
>   extern struct _alpha_agp_info *marvel_agp_info(void);
>   struct io7 *marvel_find_io7(int pe);
>   struct io7 *marvel_next_io7(struct io7 *prev);
> @@ -101,10 +97,6 @@ extern void wildfire_init_arch(void);
>   extern void wildfire_kill_arch(int);
>   extern void wildfire_machine_check(unsigned long vector, unsigned long la_ptr);
>   extern void wildfire_pci_tbi(struct pci_controller *, dma_addr_t, dma_addr_t);
> -extern int wildfire_pa_to_nid(unsigned long);
> -extern int wildfire_cpuid_to_nid(int);
> -extern unsigned long wildfire_node_mem_start(int);
> -extern unsigned long wildfire_node_mem_size(int);
>   
>   /* console.c */
>   #ifdef CONFIG_VGA_HOSE
> diff --git a/arch/alpha/kernel/setup.c b/arch/alpha/kernel/setup.c
> index 03dda3beb3bd..5f6858e9dc28 100644
> --- a/arch/alpha/kernel/setup.c
> +++ b/arch/alpha/kernel/setup.c
> @@ -79,11 +79,6 @@ int alpha_l3_cacheshape;
>   unsigned long alpha_verbose_mcheck = CONFIG_VERBOSE_MCHECK_ON;
>   #endif
>   
> -#ifdef CONFIG_NUMA
> -struct cpumask node_to_cpumask_map[MAX_NUMNODES] __read_mostly;
> -EXPORT_SYMBOL(node_to_cpumask_map);
> -#endif
> -
>   /* Which processor we booted from.  */
>   int boot_cpuid;
>   
> @@ -305,7 +300,6 @@ move_initrd(unsigned long mem_limit)
>   }
>   #endif
>   
> -#ifndef CONFIG_DISCONTIGMEM
>   static void __init
>   setup_memory(void *kernel_end)
>   {
> @@ -389,9 +383,6 @@ setup_memory(void *kernel_end)
>   	}
>   #endif /* CONFIG_BLK_DEV_INITRD */
>   }
> -#else
> -extern void setup_memory(void *);
> -#endif /* !CONFIG_DISCONTIGMEM */
>   
>   int __init
>   page_is_ram(unsigned long pfn)
> @@ -618,13 +609,6 @@ setup_arch(char **cmdline_p)
>   	       "VERBOSE_MCHECK "
>   #endif
>   
> -#ifdef CONFIG_DISCONTIGMEM
> -	       "DISCONTIGMEM "
> -#ifdef CONFIG_NUMA
> -	       "NUMA "
> -#endif
> -#endif
> -
>   #ifdef CONFIG_DEBUG_SPINLOCK
>   	       "DEBUG_SPINLOCK "
>   #endif
> diff --git a/arch/alpha/kernel/sys_marvel.c b/arch/alpha/kernel/sys_marvel.c
> index 83d6c53d6d4d..1f99b03effc2 100644
> --- a/arch/alpha/kernel/sys_marvel.c
> +++ b/arch/alpha/kernel/sys_marvel.c
> @@ -461,10 +461,5 @@ struct alpha_machine_vector marvel_ev7_mv __initmv = {
>   	.kill_arch		= marvel_kill_arch,
>   	.pci_map_irq		= marvel_map_irq,
>   	.pci_swizzle		= common_swizzle,
> -
> -	.pa_to_nid		= marvel_pa_to_nid,
> -	.cpuid_to_nid		= marvel_cpuid_to_nid,
> -	.node_mem_start		= marvel_node_mem_start,
> -	.node_mem_size		= marvel_node_mem_size,
>   };
>   ALIAS_MV(marvel_ev7)
> diff --git a/arch/alpha/kernel/sys_wildfire.c b/arch/alpha/kernel/sys_wildfire.c
> index 2c54d707142a..3cee05443f07 100644
> --- a/arch/alpha/kernel/sys_wildfire.c
> +++ b/arch/alpha/kernel/sys_wildfire.c
> @@ -337,10 +337,5 @@ struct alpha_machine_vector wildfire_mv __initmv = {
>   	.kill_arch		= wildfire_kill_arch,
>   	.pci_map_irq		= wildfire_map_irq,
>   	.pci_swizzle		= common_swizzle,
> -
> -	.pa_to_nid		= wildfire_pa_to_nid,
> -	.cpuid_to_nid		= wildfire_cpuid_to_nid,
> -	.node_mem_start		= wildfire_node_mem_start,
> -	.node_mem_size		= wildfire_node_mem_size,
>   };
>   ALIAS_MV(wildfire)
> diff --git a/arch/alpha/mm/Makefile b/arch/alpha/mm/Makefile
> index 08ac6612edad..bd770302eb82 100644
> --- a/arch/alpha/mm/Makefile
> +++ b/arch/alpha/mm/Makefile
> @@ -6,5 +6,3 @@
>   ccflags-y := -Werror
>   
>   obj-y	:= init.o fault.o
> -
> -obj-$(CONFIG_DISCONTIGMEM) += numa.o
> diff --git a/arch/alpha/mm/init.c b/arch/alpha/mm/init.c
> index a97650a618f1..f6114d03357c 100644
> --- a/arch/alpha/mm/init.c
> +++ b/arch/alpha/mm/init.c
> @@ -235,8 +235,6 @@ callback_init(void * kernel_end)
>   	return kernel_end;
>   }
>   
> -
> -#ifndef CONFIG_DISCONTIGMEM
>   /*
>    * paging_init() sets up the memory map.
>    */
> @@ -257,7 +255,6 @@ void __init paging_init(void)
>   	/* Initialize the kernel's ZERO_PGE. */
>   	memset((void *)ZERO_PGE, 0, PAGE_SIZE);
>   }
> -#endif /* CONFIG_DISCONTIGMEM */
>   
>   #if defined(CONFIG_ALPHA_GENERIC) || defined(CONFIG_ALPHA_SRM)
>   void
> diff --git a/arch/alpha/mm/numa.c b/arch/alpha/mm/numa.c
> deleted file mode 100644
> index 0636e254a22f..000000000000
> --- a/arch/alpha/mm/numa.c
> +++ /dev/null
> @@ -1,223 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0
> -/*
> - *  linux/arch/alpha/mm/numa.c
> - *
> - *  DISCONTIGMEM NUMA alpha support.
> - *
> - *  Copyright (C) 2001 Andrea Arcangeli <andrea@suse.de> SuSE
> - */
> -
> -#include <linux/types.h>
> -#include <linux/kernel.h>
> -#include <linux/mm.h>
> -#include <linux/memblock.h>
> -#include <linux/swap.h>
> -#include <linux/initrd.h>
> -#include <linux/pfn.h>
> -#include <linux/module.h>
> -
> -#include <asm/hwrpb.h>
> -#include <asm/sections.h>
> -
> -pg_data_t node_data[MAX_NUMNODES];
> -EXPORT_SYMBOL(node_data);
> -
> -#undef DEBUG_DISCONTIG
> -#ifdef DEBUG_DISCONTIG
> -#define DBGDCONT(args...) printk(args)
> -#else
> -#define DBGDCONT(args...)
> -#endif
> -
> -#define for_each_mem_cluster(memdesc, _cluster, i)		\
> -	for ((_cluster) = (memdesc)->cluster, (i) = 0;		\
> -	     (i) < (memdesc)->numclusters; (i)++, (_cluster)++)
> -
> -static void __init show_mem_layout(void)
> -{
> -	struct memclust_struct * cluster;
> -	struct memdesc_struct * memdesc;
> -	int i;
> -
> -	/* Find free clusters, and init and free the bootmem accordingly.  */
> -	memdesc = (struct memdesc_struct *)
> -	  (hwrpb->mddt_offset + (unsigned long) hwrpb);
> -
> -	printk("Raw memory layout:\n");
> -	for_each_mem_cluster(memdesc, cluster, i) {
> -		printk(" memcluster %2d, usage %1lx, start %8lu, end %8lu\n",
> -		       i, cluster->usage, cluster->start_pfn,
> -		       cluster->start_pfn + cluster->numpages);
> -	}
> -}
> -
> -static void __init
> -setup_memory_node(int nid, void *kernel_end)
> -{
> -	extern unsigned long mem_size_limit;
> -	struct memclust_struct * cluster;
> -	struct memdesc_struct * memdesc;
> -	unsigned long start_kernel_pfn, end_kernel_pfn;
> -	unsigned long start, end;
> -	unsigned long node_pfn_start, node_pfn_end;
> -	unsigned long node_min_pfn, node_max_pfn;
> -	int i;
> -	int show_init = 0;
> -
> -	/* Find the bounds of current node */
> -	node_pfn_start = (node_mem_start(nid)) >> PAGE_SHIFT;
> -	node_pfn_end = node_pfn_start + (node_mem_size(nid) >> PAGE_SHIFT);
> -	
> -	/* Find free clusters, and init and free the bootmem accordingly.  */
> -	memdesc = (struct memdesc_struct *)
> -	  (hwrpb->mddt_offset + (unsigned long) hwrpb);
> -
> -	/* find the bounds of this node (node_min_pfn/node_max_pfn) */
> -	node_min_pfn = ~0UL;
> -	node_max_pfn = 0UL;
> -	for_each_mem_cluster(memdesc, cluster, i) {
> -		/* Bit 0 is console/PALcode reserved.  Bit 1 is
> -		   non-volatile memory -- we might want to mark
> -		   this for later.  */
> -		if (cluster->usage & 3)
> -			continue;
> -
> -		start = cluster->start_pfn;
> -		end = start + cluster->numpages;
> -
> -		if (start >= node_pfn_end || end <= node_pfn_start)
> -			continue;
> -
> -		if (!show_init) {
> -			show_init = 1;
> -			printk("Initializing bootmem allocator on Node ID %d\n", nid);
> -		}
> -		printk(" memcluster %2d, usage %1lx, start %8lu, end %8lu\n",
> -		       i, cluster->usage, cluster->start_pfn,
> -		       cluster->start_pfn + cluster->numpages);
> -
> -		if (start < node_pfn_start)
> -			start = node_pfn_start;
> -		if (end > node_pfn_end)
> -			end = node_pfn_end;
> -
> -		if (start < node_min_pfn)
> -			node_min_pfn = start;
> -		if (end > node_max_pfn)
> -			node_max_pfn = end;
> -	}
> -
> -	if (mem_size_limit && node_max_pfn > mem_size_limit) {
> -		static int msg_shown = 0;
> -		if (!msg_shown) {
> -			msg_shown = 1;
> -			printk("setup: forcing memory size to %ldK (from %ldK).\n",
> -			       mem_size_limit << (PAGE_SHIFT - 10),
> -			       node_max_pfn    << (PAGE_SHIFT - 10));
> -		}
> -		node_max_pfn = mem_size_limit;
> -	}
> -
> -	if (node_min_pfn >= node_max_pfn)
> -		return;
> -
> -	/* Update global {min,max}_low_pfn from node information. */
> -	if (node_min_pfn < min_low_pfn)
> -		min_low_pfn = node_min_pfn;
> -	if (node_max_pfn > max_low_pfn)
> -		max_pfn = max_low_pfn = node_max_pfn;
> -
> -#if 0 /* we'll try this one again in a little while */
> -	/* Cute trick to make sure our local node data is on local memory */
> -	node_data[nid] = (pg_data_t *)(__va(node_min_pfn << PAGE_SHIFT));
> -#endif
> -	printk(" Detected node memory:   start %8lu, end %8lu\n",
> -	       node_min_pfn, node_max_pfn);
> -
> -	DBGDCONT(" DISCONTIG: node_data[%d]   is at 0x%p\n", nid, NODE_DATA(nid));
> -
> -	/* Find the bounds of kernel memory.  */
> -	start_kernel_pfn = PFN_DOWN(KERNEL_START_PHYS);
> -	end_kernel_pfn = PFN_UP(virt_to_phys(kernel_end));
> -
> -	if (!nid && (node_max_pfn < end_kernel_pfn || node_min_pfn > start_kernel_pfn))
> -		panic("kernel loaded out of ram");
> -
> -	memblock_add_node(PFN_PHYS(node_min_pfn),
> -			  (node_max_pfn - node_min_pfn) << PAGE_SHIFT, nid);
> -
> -	/* Zone start phys-addr must be 2^(MAX_ORDER-1) aligned.
> -	   Note that we round this down, not up - node memory
> -	   has much larger alignment than 8Mb, so it's safe. */
> -	node_min_pfn &= ~((1UL << (MAX_ORDER-1))-1);
> -
> -	NODE_DATA(nid)->node_start_pfn = node_min_pfn;
> -	NODE_DATA(nid)->node_present_pages = node_max_pfn - node_min_pfn;
> -
> -	node_set_online(nid);
> -}
> -
> -void __init
> -setup_memory(void *kernel_end)
> -{
> -	unsigned long kernel_size;
> -	int nid;
> -
> -	show_mem_layout();
> -
> -	nodes_clear(node_online_map);
> -
> -	min_low_pfn = ~0UL;
> -	max_low_pfn = 0UL;
> -	for (nid = 0; nid < MAX_NUMNODES; nid++)
> -		setup_memory_node(nid, kernel_end);
> -
> -	kernel_size = virt_to_phys(kernel_end) - KERNEL_START_PHYS;
> -	memblock_reserve(KERNEL_START_PHYS, kernel_size);
> -
> -#ifdef CONFIG_BLK_DEV_INITRD
> -	initrd_start = INITRD_START;
> -	if (initrd_start) {
> -		extern void *move_initrd(unsigned long);
> -
> -		initrd_end = initrd_start+INITRD_SIZE;
> -		printk("Initial ramdisk at: 0x%p (%lu bytes)\n",
> -		       (void *) initrd_start, INITRD_SIZE);
> -
> -		if ((void *)initrd_end > phys_to_virt(PFN_PHYS(max_low_pfn))) {
> -			if (!move_initrd(PFN_PHYS(max_low_pfn)))
> -				printk("initrd extends beyond end of memory "
> -				       "(0x%08lx > 0x%p)\ndisabling initrd\n",
> -				       initrd_end,
> -				       phys_to_virt(PFN_PHYS(max_low_pfn)));
> -		} else {
> -			nid = kvaddr_to_nid(initrd_start);
> -			memblock_reserve(virt_to_phys((void *)initrd_start),
> -					 INITRD_SIZE);
> -		}
> -	}
> -#endif /* CONFIG_BLK_DEV_INITRD */
> -}
> -
> -void __init paging_init(void)
> -{
> -	unsigned long   max_zone_pfn[MAX_NR_ZONES] = {0, };
> -	unsigned long	dma_local_pfn;
> -
> -	/*
> -	 * The old global MAX_DMA_ADDRESS per-arch API doesn't fit
> -	 * in the NUMA model, for now we convert it to a pfn and
> -	 * we interpret this pfn as a local per-node information.
> -	 * This issue isn't very important since none of these machines
> -	 * have legacy ISA slots anyways.
> -	 */
> -	dma_local_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
> -
> -	max_zone_pfn[ZONE_DMA] = dma_local_pfn;
> -	max_zone_pfn[ZONE_NORMAL] = max_pfn;
> -
> -	free_area_init(max_zone_pfn);
> -
> -	/* Initialize the kernel's ZERO_PGE. */
> -	memset((void *)ZERO_PGE, 0, PAGE_SIZE);
> -}
> 

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply

* Re: [PATCH 2/9] arc: update comment about HIGHMEM implementation
From: David Hildenbrand @ 2021-06-09 10:52 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: linux-ia64, linux-sh, linux-mips, linux-mm, sparclinux,
	linux-riscv, linux-arch, linux-s390, Jonathan Corbet, linux-doc,
	Mike Rapoport, Geert Uytterhoeven, Matt Turner, linux-snps-arc,
	linux-xtensa, Arnd Bergmann, linux-m68k, Ivan Kokshaysky,
	linux-arm-kernel, Richard Henderson, Vineet Gupta, kexec,
	linux-kernel, linux-alpha, linuxppc-dev
In-Reply-To: <20210602105348.13387-3-rppt@kernel.org>

On 02.06.21 12:53, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Arc does not use DISCONTIGMEM to implement high memory, update the comment
> describing how high memory works to reflect this.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>   arch/arc/mm/init.c | 13 +++++--------
>   1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> index e2ed355438c9..397a201adfe3 100644
> --- a/arch/arc/mm/init.c
> +++ b/arch/arc/mm/init.c
> @@ -139,16 +139,13 @@ void __init setup_arch_memory(void)
>   
>   #ifdef CONFIG_HIGHMEM
>   	/*
> -	 * Populate a new node with highmem
> -	 *
>   	 * On ARC (w/o PAE) HIGHMEM addresses are actually smaller (0 based)
> -	 * than addresses in normal ala low memory (0x8000_0000 based).
> +	 * than addresses in normal aka low memory (0x8000_0000 based).
>   	 * Even with PAE, the huge peripheral space hole would waste a lot of
> -	 * mem with single mem_map[]. This warrants a mem_map per region design.
> -	 * Thus HIGHMEM on ARC is imlemented with DISCONTIGMEM.
> -	 *
> -	 * DISCONTIGMEM in turns requires multiple nodes. node 0 above is
> -	 * populated with normal memory zone while node 1 only has highmem
> +	 * mem with single contiguous mem_map[].
> +	 * Thus when HIGHMEM on ARC is enabled the memory map corresponding
> +	 * to the hole is freed and ARC specific version of pfn_valid()
> +	 * handles the hole in the memory map.
>   	 */
>   #ifdef CONFIG_DISCONTIGMEM
>   	node_set_online(1);
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply

* Re: [PATCH 3/9] arc: remove support for DISCONTIGMEM
From: David Hildenbrand @ 2021-06-09 10:53 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: linux-ia64, linux-sh, linux-mips, linux-mm, sparclinux,
	linux-riscv, linux-arch, linux-s390, Jonathan Corbet, linux-doc,
	Mike Rapoport, Geert Uytterhoeven, Matt Turner, linux-snps-arc,
	linux-xtensa, Arnd Bergmann, linux-m68k, Ivan Kokshaysky,
	linux-arm-kernel, Richard Henderson, Vineet Gupta, kexec,
	linux-kernel, linux-alpha, linuxppc-dev
In-Reply-To: <20210602105348.13387-4-rppt@kernel.org>

On 02.06.21 12:53, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
> in v5.11.
> 
> Remove the support for DISCONTIGMEM entirely.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox