LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH 4/8] powerpc: add CPU field to struct thread_info
From: Kees Cook @ 2021-09-29 23:10 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Paul Mackerras, linux-riscv,
	Will Deacon, Ard Biesheuvel, open list:S390, Vasily Gorbik,
	Russell King, Christian Borntraeger, Ingo Molnar, Albert Ou,
	Arnd Bergmann, Heiko Carstens, Keith Packard, Borislav Petkov,
	Andy Lutomirski, Paul Walmsley, Thomas Gleixner, Linux ARM,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Linux Kernel Mailing List, Palmer Dabbelt, Linus Torvalds
In-Reply-To: <878rzf0zmb.fsf@mpe.ellerman.id.au>

On Thu, Sep 30, 2021 at 08:46:04AM +1000, Michael Ellerman wrote:
> Ard Biesheuvel <ardb@kernel.org> writes:
> > On Tue, 28 Sept 2021 at 02:16, Michael Ellerman <mpe@ellerman.id.au> wrote:
> >>
> >> Michael Ellerman <mpe@ellerman.id.au> writes:
> >> > Ard Biesheuvel <ardb@kernel.org> writes:
> >> >> On Tue, 14 Sept 2021 at 14:11, Ard Biesheuvel <ardb@kernel.org> wrote:
> >> >>>
> >> >>> The CPU field will be moved back into thread_info even when
> >> >>> THREAD_INFO_IN_TASK is enabled, so add it back to powerpc's definition
> >> >>> of struct thread_info.
> >> >>>
> >> >>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >> >>
> >> >> Michael,
> >> >>
> >> >> Do you have any objections or issues with this patch or the subsequent
> >> >> ones cleaning up the task CPU kludge for ppc32? Christophe indicated
> >> >> that he was happy with it.
> >> >
> >> > No objections, it looks good to me, thanks for cleaning up that horror :)
> >> >
> >> > It didn't apply cleanly to master so I haven't tested it at all, if you can point me at a
> >> > git tree with the dependencies I'd be happy to run some tests over it.
> >>
> >> Actually I realised I can just drop the last patch.
> >>
> >> So that looks fine, passes my standard quick build & boot on qemu tests,
> >> and builds with/without stack protector enabled.
> >>
> >
> > Thanks.
> >
> > Do you have any opinion on how this series should be merged? Kees Cook
> > is willing to take them via his cross-arch tree, or you could carry
> > them if you prefer. Taking it via multiple trees at the same time is
> > going to be tricky, or take two cycles, with I'd prefer to avoid.
> 
> I don't really mind. If Kees is happy to take it then that's OK by me.
> 
> If Kees put the series in a topic branch based off rc2 then I could
> merge that, and avoid any conflicts.

If that helps, yeah, I can make a separate stable branch. Thanks!

-Kees

-- 
Kees Cook

^ permalink raw reply

* Re: [RFC PATCH 4/8] powerpc: add CPU field to struct thread_info
From: Michael Ellerman @ 2021-09-29 22:46 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Zijlstra, Catalin Marinas, Paul Mackerras, linux-riscv,
	Will Deacon, open list:S390, Arnd Bergmann, Russell King,
	Christian Borntraeger, Ingo Molnar, Albert Ou, Kees Cook,
	Vasily Gorbik, Heiko Carstens, Keith Packard, Borislav Petkov,
	Andy Lutomirski, Paul Walmsley, Thomas Gleixner, Linux ARM,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Linux Kernel Mailing List, Palmer Dabbelt, Linus Torvalds
In-Reply-To: <CAMj1kXFXtbD3=L+QvCnwbyFr-qbWivZ0wRGT0N4LNxANPD8x4g@mail.gmail.com>

Ard Biesheuvel <ardb@kernel.org> writes:
> On Tue, 28 Sept 2021 at 02:16, Michael Ellerman <mpe@ellerman.id.au> wrote:
>>
>> Michael Ellerman <mpe@ellerman.id.au> writes:
>> > Ard Biesheuvel <ardb@kernel.org> writes:
>> >> On Tue, 14 Sept 2021 at 14:11, Ard Biesheuvel <ardb@kernel.org> wrote:
>> >>>
>> >>> The CPU field will be moved back into thread_info even when
>> >>> THREAD_INFO_IN_TASK is enabled, so add it back to powerpc's definition
>> >>> of struct thread_info.
>> >>>
>> >>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>> >>
>> >> Michael,
>> >>
>> >> Do you have any objections or issues with this patch or the subsequent
>> >> ones cleaning up the task CPU kludge for ppc32? Christophe indicated
>> >> that he was happy with it.
>> >
>> > No objections, it looks good to me, thanks for cleaning up that horror :)
>> >
>> > It didn't apply cleanly to master so I haven't tested it at all, if you can point me at a
>> > git tree with the dependencies I'd be happy to run some tests over it.
>>
>> Actually I realised I can just drop the last patch.
>>
>> So that looks fine, passes my standard quick build & boot on qemu tests,
>> and builds with/without stack protector enabled.
>>
>
> Thanks.
>
> Do you have any opinion on how this series should be merged? Kees Cook
> is willing to take them via his cross-arch tree, or you could carry
> them if you prefer. Taking it via multiple trees at the same time is
> going to be tricky, or take two cycles, with I'd prefer to avoid.

I don't really mind. If Kees is happy to take it then that's OK by me.

If Kees put the series in a topic branch based off rc2 then I could
merge that, and avoid any conflicts.

cheers

^ permalink raw reply

* Re: [PATCH v5 07/11] PCI: Replace pci_dev::driver usage that gets the driver name
From: Ido Schimmel @ 2021-09-29 15:21 UTC (permalink / raw)
  To: Ido Schimmel, u.kleine-koenig
  Cc: linux-pci, Alexander Duyck, oss-drivers, Paul Mackerras,
	Christoph Hellwig, Herbert Xu, Rafał Miłecki,
	Jesse Brandeburg, Bjorn Helgaas, Jakub Kicinski, Yisen Zhuang,
	Vadym Kochan, Michael Buesch, Jiri Pirko, Salil Mehta, netdev,
	linux-wireless, linux-kernel, Taras Chornyi, Zhou Wang,
	linux-crypto, kernel, Simon Horman, Oliver O'Halloran,
	linuxppc-dev, David S. Miller
In-Reply-To: <YVR74+8Rw6XmTqDD@shredder>

On Wed, Sep 29, 2021 at 05:44:51PM +0300, Ido Schimmel wrote:
> On Wed, Sep 29, 2021 at 10:53:02AM +0200, Uwe Kleine-König wrote:
> > struct pci_dev::driver holds (apart from a constant offset) the same
> > data as struct pci_dev::dev->driver. With the goal to remove struct
> > pci_dev::driver to get rid of data duplication replace getting the
> > driver name by dev_driver_string() which implicitly makes use of struct
> > pci_dev::dev->driver.
> > 
> > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> 
> For mlxsw:
> 
> Reviewed-by: Ido Schimmel <idosch@nvidia.com>
> Tested-by: Ido Schimmel <idosch@nvidia.com>
> 
> Thanks

Actually, I found out that after loading and executing another kernel
(or the same one) via kexec I get this splat [1].

[1]
 BUG: unable to handle page fault for address: ffffffffffffffc8         
 #PF: supervisor read access in kernel mode                       
 #PF: error_code(0x0000) - not-present page
 PGD 6e40c067 P4D 6e40c067 PUD 6e40e067 PMD 0 
 Oops: 0000 [#1] SMP                                                    
 CPU: 0 PID: 786 Comm: kexec Not tainted 5.15.0-rc2-custom-45114-g6b0effa5a61f #112
 Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019                                                                              
 RIP: 0010:pci_device_shutdown+0x16/0x40
 Code: 01 00 31 d2 4c 89 e7 89 c6 e8 36 ce 01 00 41 89 c5 eb bb 90 55 48 8d af 40 ff ff ff 53 48 8b 47 68 48 89 fb 48 83 f8 78 74 0e <48> 8b 40 c8 48 85 c0 74 
05 48 89 ef ff d0 80 3d 35 81 b7 01 00 74                                             
 RSP: 0018:ffff95fec0d37db8 EFLAGS: 00010297                            
 RAX: 0000000000000000 RBX: ffff8d70c0f1f0c0 RCX: 0000000000000004
 RDX: ffff8d7115a03a00 RSI: 0000000000000206 RDI: ffff8d70c0f1f0c0      
 RBP: ffff8d70c0f1f000 R08: 0000000000000002 R09: 0000000000000502
 R10: 0000000000000000 R11: 0000000000000006 R12: ffff8d70c0f1f0c0                                                                                             
 R13: ffff8d70c0f1f140 R14: 00000000fee1dead R15: 0000000000000000                                                                                             
 FS:  00007fd3089e0b80(0000) GS:ffff8d7237c00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                     
 CR2: ffffffffffffffc8 CR3: 0000000155abb001 CR4: 00000000003706f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  device_shutdown+0x12e/0x180 
  kernel_kexec+0x52/0xb0
  __do_sys_reboot+0x1c0/0x210 
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7fd308afd557
 Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 
c3 48 8b 15 f1 a8 0c 00 f7 d8 64 89 02 b8
 RSP: 002b:00007fff7d01e0a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
 RAX: ffffffffffffffda RBX: 00005606db11d380 RCX: 00007fd308afd557
 RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
 RBP: 0000000000000000 R08: 0000000000000007 R09: 00007fd308bc8a60
 R10: 0000000000000021 R11: 0000000000000246 R12: 0000000000000003
 R13: 00007fff7d01e1f0 R14: 00005606db11d8c0 R15: 00000000ffffffff
 Modules linked in:
 CR2: ffffffffffffffc8
 ---[ end trace 0cb0bc633a6fde3e ]---

Where:

(gdb) l *(pci_device_shutdown+0x16)
0xffffffff8156abf6 is in pci_device_shutdown (drivers/pci/pci-driver.c:496).
491             struct pci_dev *pci_dev = to_pci_dev(dev);
492             struct pci_driver *drv = to_pci_driver(pci_dev->dev.driver);
493
494             pm_runtime_resume(dev);
495
496             if (drv && drv->shutdown)
497                     drv->shutdown(pci_dev);
498
499             /*
500              * If this is a kexec reboot, turn off Bus Master bit on the

^ permalink raw reply

* Re: [PATCH v5 07/11] PCI: Replace pci_dev::driver usage that gets the driver name
From: Ido Schimmel @ 2021-09-29 14:44 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: linux-pci, Alexander Duyck, oss-drivers, Paul Mackerras,
	Christoph Hellwig, Herbert Xu, Rafał Miłecki,
	Jesse Brandeburg, Bjorn Helgaas, Jakub Kicinski, Yisen Zhuang,
	Vadym Kochan, Michael Buesch, Jiri Pirko, Salil Mehta, netdev,
	linux-wireless, linux-kernel, Taras Chornyi, Zhou Wang,
	linux-crypto, kernel, Simon Horman, Oliver O'Halloran,
	linuxppc-dev, David S. Miller
In-Reply-To: <20210929085306.2203850-8-u.kleine-koenig@pengutronix.de>

On Wed, Sep 29, 2021 at 10:53:02AM +0200, Uwe Kleine-König wrote:
> struct pci_dev::driver holds (apart from a constant offset) the same
> data as struct pci_dev::dev->driver. With the goal to remove struct
> pci_dev::driver to get rid of data duplication replace getting the
> driver name by dev_driver_string() which implicitly makes use of struct
> pci_dev::dev->driver.
> 
> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>

For mlxsw:

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>

Thanks

^ permalink raw reply

* Re: [PATCH 00/10] Add Apple M1 support to PASemi i2c driver
From: Wolfram Sang @ 2021-09-29 20:35 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Sven Peter, Hector Martin, Linux Kernel Mailing List, Linux I2C,
	Paul Mackerras, Linux ARM, Olof Johansson, Mohamed Mediouni,
	Mark Kettenis, linuxppc-dev, Alyssa Rosenzweig, Stan Skowronek
In-Reply-To: <CAK8P3a3Lt2QXk+aWLtXUXjjNhKJwNns6d9r=Yh5_aWETuvZTpQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 245 bytes --]


> This looks all very good to me, I had one very minor comment.
> 
> Whole series
> 
> Reviewed-by: Arnd Bergmann <arnd@arndb.de>

Thanks for the series and the review!

Same here, looks good to me and I only had one minor comment.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 09/10] i2c: pasemi: Add Apple platform driver
From: Wolfram Sang @ 2021-09-29 20:33 UTC (permalink / raw)
  To: Sven Peter
  Cc: Arnd Bergmann, Hector Martin, linux-kernel, linux-i2c,
	Paul Mackerras, linux-arm-kernel, Olof Johansson,
	Mohamed Mediouni, Mark Kettenis, linuxppc-dev, Alyssa Rosenzweig,
	Stan Skowronek
In-Reply-To: <20210926095847.38261-10-sven@svenpeter.dev>

[-- Attachment #1: Type: text/plain, Size: 192 bytes --]


>  drivers/i2c/busses/i2c-pasemi-apple.c | 122 ++++++++++++++++++++++++++

Can't we name it 'i2c-pasemi-platform.c' instead? Makes more sense to me
because the other instance is named -pci.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v5 10/11] PCI: Replace pci_dev::driver usage by pci_dev::dev.driver
From: Andrew Donnellan @ 2021-09-29 15:44 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Mark Rutland, Peter Zijlstra, H. Peter Anvin,
	Oliver O'Halloran, Jiri Olsa, Christoph Hellwig,
	Stefano Stabellini, Arnd Bergmann, Boris Ostrovsky, x86,
	Alexander Shishkin, Ingo Molnar, Bjorn Helgaas, linux-pci,
	xen-devel, Mathias Nyman, Konrad Rzeszutek Wilk,
	Arnaldo Carvalho de Melo, Borislav Petkov, Bjorn Helgaas,
	Namhyung Kim, Thomas Gleixner, Juergen Gross, Greg Kroah-Hartman,
	linux-usb, linux-perf-users, kernel, Frederic Barrat,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <20210929134330.e5c57t7mtwu5iner@pengutronix.de>

On 29/9/21 11:43 pm, Uwe Kleine-König wrote:> I'm not a huge fan either, 
I used it to keep the control flow as is and
> without introducing several calls to to_pci_driver.
> 
> The whole code looks as follows:
> 
> 	list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
> 		struct pci_driver *afu_drv;
> 		if (afu_dev->dev.driver &&
> 		    (afu_drv = to_pci_driver(afu_dev->dev.driver))->err_handler &&
> 		    afu_drv->err_handler->resume)
> 			afu_drv->err_handler->resume(afu_dev);
> 	}
> 
> Without assignment in the if it could look as follows:
> 
> 	list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
> 		struct pci_driver *afu_drv;
> 
> 		if (!afu_dev->dev.driver)
> 			continue;
> 
> 		afu_drv = to_pci_driver(afu_dev->dev.driver));
> 
> 		if (afu_drv->err_handler && afu_drv->err_handler->resume)
> 			afu_drv->err_handler->resume(afu_dev);
> 	}
> 
> Fine for me.

This looks fine.

As an aside while writing my email I discovered the existence of 
container_of_safe(), a version of container_of() that handles the null 
and err ptr cases... if to_pci_driver() used that, the null check in the 
caller could be moved until after the to_pci_driver() call which would 
be neater.

But then, grep tells me that container_of_safe() is used precisely zero 
times in the entire tree. Interesting.

> (Sidenote: What happens if the device is unbound directly after the
> check for afu_dev->dev.driver? This is a problem the old code had, too
> (assuming it is a real problem, didn't check deeply).)

Looking at any of the cxl PCI error handling paths brings back 
nightmares from a few years ago... Fred: I wonder if we need to add a 
lock here?

-- 
Andrew Donnellan              OzLabs, ADL Canberra
ajd@linux.ibm.com             IBM Australia Limited

^ permalink raw reply

* [PATCH v1 6/6] x86: remove memory hotplug support on X86_32
From: David Hildenbrand @ 2021-09-29 14:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michal Hocko, Michael S. Tsirkin, Peter Zijlstra, Jason Wang,
	Dave Hansen, x86, virtualization, linux-mm, Paul Mackerras,
	linux-kselftest, H. Peter Anvin, Shuah Khan, Alex Shi,
	Jonathan Corbet, David Hildenbrand, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Oscar Salvador,
	Greg Kroah-Hartman, linux-doc, Rafael J. Wysocki, Andrew Morton,
	linuxppc-dev, Mike Rapoport
In-Reply-To: <20210929143600.49379-1-david@redhat.com>

CONFIG_MEMORY_HOTPLUG was marked BROKEN over one year and we just
restricted it to 64 bit. Let's remove the unused x86 32bit implementation
and simplify the Kconfig.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/Kconfig      |  6 +++---
 arch/x86/mm/init_32.c | 31 -------------------------------
 2 files changed, 3 insertions(+), 34 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ab83c22d274e..85f4762429f1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -62,7 +62,7 @@ config X86
 	select ARCH_32BIT_OFF_T			if X86_32
 	select ARCH_CLOCKSOURCE_INIT
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && MIGRATION
-	select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM)
+	select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE)
 	select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
@@ -1615,7 +1615,7 @@ config ARCH_SELECT_MEMORY_MODEL
 
 config ARCH_MEMORY_PROBE
 	bool "Enable sysfs memory/probe interface"
-	depends on X86_64 && MEMORY_HOTPLUG
+	depends on MEMORY_HOTPLUG
 	help
 	  This option enables a sysfs memory/probe interface for testing.
 	  See Documentation/admin-guide/mm/memory-hotplug.rst for more information.
@@ -2395,7 +2395,7 @@ endmenu
 
 config ARCH_HAS_ADD_PAGES
 	def_bool y
-	depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG
+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
 
 config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	def_bool y
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index bd90b8fe81e4..5cd7ea6d645c 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -779,37 +779,6 @@ void __init mem_init(void)
 	test_wp_bit();
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size,
-		    struct mhp_params *params)
-{
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
-	int ret;
-
-	/*
-	 * The page tables were already mapped at boot so if the caller
-	 * requests a different mapping type then we must change all the
-	 * pages with __set_memory_prot().
-	 */
-	if (params->pgprot.pgprot != PAGE_KERNEL.pgprot) {
-		ret = __set_memory_prot(start, nr_pages, params->pgprot);
-		if (ret)
-			return ret;
-	}
-
-	return __add_pages(nid, start_pfn, nr_pages, params);
-}
-
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
-{
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
-
-	__remove_pages(start_pfn, nr_pages, altmap);
-}
-#endif
-
 int kernel_set_to_readonly __read_mostly;
 
 static void mark_nxdata_nx(void)
-- 
2.31.1


^ permalink raw reply related

* [PATCH v1 4/6] mm/memory_hotplug: remove HIGHMEM leftovers
From: David Hildenbrand @ 2021-09-29 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michal Hocko, Michael S. Tsirkin, Peter Zijlstra, Jason Wang,
	Dave Hansen, x86, virtualization, linux-mm, Paul Mackerras,
	linux-kselftest, H. Peter Anvin, Shuah Khan, Alex Shi,
	Jonathan Corbet, David Hildenbrand, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Oscar Salvador,
	Greg Kroah-Hartman, linux-doc, Rafael J. Wysocki, Andrew Morton,
	linuxppc-dev, Mike Rapoport
In-Reply-To: <20210929143600.49379-1-david@redhat.com>

We don't support CONFIG_MEMORY_HOTPLUG on 32 bit and consequently not
HIGHMEM. Let's remove any leftover code -- including the unused
"status_change_nid_high" field part of the memory notifier.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/core-api/memory-hotplug.rst     |  3 --
 .../zh_CN/core-api/memory-hotplug.rst         |  4 ---
 include/linux/memory.h                        |  1 -
 mm/memory_hotplug.c                           | 36 ++-----------------
 4 files changed, 2 insertions(+), 42 deletions(-)

diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst
index de7467e48067..682259ee633a 100644
--- a/Documentation/core-api/memory-hotplug.rst
+++ b/Documentation/core-api/memory-hotplug.rst
@@ -57,7 +57,6 @@ The third argument (arg) passes a pointer of struct memory_notify::
 		unsigned long start_pfn;
 		unsigned long nr_pages;
 		int status_change_nid_normal;
-		int status_change_nid_high;
 		int status_change_nid;
 	}
 
@@ -65,8 +64,6 @@ The third argument (arg) passes a pointer of struct memory_notify::
 - nr_pages is # of pages of online/offline memory.
 - status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
   is (will be) set/clear, if this is -1, then nodemask status is not changed.
-- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
-  is (will be) set/clear, if this is -1, then nodemask status is not changed.
 - status_change_nid is set node id when N_MEMORY of nodemask is (will be)
   set/clear. It means a new(memoryless) node gets new memory by online and a
   node loses all memory. If this is -1, then nodemask status is not changed.
diff --git a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst b/Documentation/translations/zh_CN/core-api/memory-hotplug.rst
index 161f4d2c18cc..9a204eb196f2 100644
--- a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst
+++ b/Documentation/translations/zh_CN/core-api/memory-hotplug.rst
@@ -63,7 +63,6 @@ memory_notify结构体的指针::
 		unsigned long start_pfn;
 		unsigned long nr_pages;
 		int status_change_nid_normal;
-		int status_change_nid_high;
 		int status_change_nid;
 	}
 
@@ -74,9 +73,6 @@ memory_notify结构体的指针::
 - status_change_nid_normal是当nodemask的N_NORMAL_MEMORY被设置/清除时设置节
   点id，如果是-1，则nodemask状态不改变。
 
-- status_change_nid_high是当nodemask的N_HIGH_MEMORY被设置/清除时设置的节点
-  id，如果这个值为-1，那么nodemask状态不会改变。
-
 - status_change_nid是当nodemask的N_MEMORY被（将）设置/清除时设置的节点id。这
   意味着一个新的（没上线的）节点通过联机获得新的内存，而一个节点失去了所有的内
   存。如果这个值为-1，那么nodemask的状态就不会改变。
diff --git a/include/linux/memory.h b/include/linux/memory.h
index dd6e608c3e0b..c46ff374d48d 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -96,7 +96,6 @@ struct memory_notify {
 	unsigned long start_pfn;
 	unsigned long nr_pages;
 	int status_change_nid_normal;
-	int status_change_nid_high;
 	int status_change_nid;
 };
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8d7b2b593a26..95c927c8bfb8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -21,7 +21,6 @@
 #include <linux/memory.h>
 #include <linux/memremap.h>
 #include <linux/memory_hotplug.h>
-#include <linux/highmem.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
 #include <linux/delay.h>
@@ -585,10 +584,6 @@ void generic_online_page(struct page *page, unsigned int order)
 	debug_pagealloc_map_pages(page, 1 << order);
 	__free_pages_core(page, order);
 	totalram_pages_add(1UL << order);
-#ifdef CONFIG_HIGHMEM
-	if (PageHighMem(page))
-		totalhigh_pages_add(1UL << order);
-#endif
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
 
@@ -625,16 +620,11 @@ static void node_states_check_changes_online(unsigned long nr_pages,
 
 	arg->status_change_nid = NUMA_NO_NODE;
 	arg->status_change_nid_normal = NUMA_NO_NODE;
-	arg->status_change_nid_high = NUMA_NO_NODE;
 
 	if (!node_state(nid, N_MEMORY))
 		arg->status_change_nid = nid;
 	if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY))
 		arg->status_change_nid_normal = nid;
-#ifdef CONFIG_HIGHMEM
-	if (zone_idx(zone) <= ZONE_HIGHMEM && !node_state(nid, N_HIGH_MEMORY))
-		arg->status_change_nid_high = nid;
-#endif
 }
 
 static void node_states_set_node(int node, struct memory_notify *arg)
@@ -642,9 +632,6 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 	if (arg->status_change_nid_normal >= 0)
 		node_set_state(node, N_NORMAL_MEMORY);
 
-	if (arg->status_change_nid_high >= 0)
-		node_set_state(node, N_HIGH_MEMORY);
-
 	if (arg->status_change_nid >= 0)
 		node_set_state(node, N_MEMORY);
 }
@@ -1801,7 +1788,6 @@ static void node_states_check_changes_offline(unsigned long nr_pages,
 
 	arg->status_change_nid = NUMA_NO_NODE;
 	arg->status_change_nid_normal = NUMA_NO_NODE;
-	arg->status_change_nid_high = NUMA_NO_NODE;
 
 	/*
 	 * Check whether node_states[N_NORMAL_MEMORY] will be changed.
@@ -1816,24 +1802,9 @@ static void node_states_check_changes_offline(unsigned long nr_pages,
 	if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages)
 		arg->status_change_nid_normal = zone_to_nid(zone);
 
-#ifdef CONFIG_HIGHMEM
-	/*
-	 * node_states[N_HIGH_MEMORY] contains nodes which
-	 * have normal memory or high memory.
-	 * Here we add the present_pages belonging to ZONE_HIGHMEM.
-	 * If the zone is within the range of [0..ZONE_HIGHMEM), and
-	 * we determine that the zones in that range become empty,
-	 * we need to clear the node for N_HIGH_MEMORY.
-	 */
-	present_pages += pgdat->node_zones[ZONE_HIGHMEM].present_pages;
-	if (zone_idx(zone) <= ZONE_HIGHMEM && nr_pages >= present_pages)
-		arg->status_change_nid_high = zone_to_nid(zone);
-#endif
-
 	/*
-	 * We have accounted the pages from [0..ZONE_NORMAL), and
-	 * in case of CONFIG_HIGHMEM the pages from ZONE_HIGHMEM
-	 * as well.
+	 * We have accounted the pages from [0..ZONE_NORMAL); ZONE_HIGHMEM
+	 * does not apply as we don't support 32bit.
 	 * Here we count the possible pages from ZONE_MOVABLE.
 	 * If after having accounted all the pages, we see that the nr_pages
 	 * to be offlined is over or equal to the accounted pages,
@@ -1851,9 +1822,6 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 	if (arg->status_change_nid_normal >= 0)
 		node_clear_state(node, N_NORMAL_MEMORY);
 
-	if (arg->status_change_nid_high >= 0)
-		node_clear_state(node, N_HIGH_MEMORY);
-
 	if (arg->status_change_nid >= 0)
 		node_clear_state(node, N_MEMORY);
 }
-- 
2.31.1


^ permalink raw reply related

* [PATCH v1 5/6] mm/memory_hotplug: remove stale function declarations
From: David Hildenbrand @ 2021-09-29 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michal Hocko, Michael S. Tsirkin, Peter Zijlstra, Jason Wang,
	Dave Hansen, x86, virtualization, linux-mm, Paul Mackerras,
	linux-kselftest, H. Peter Anvin, Shuah Khan, Alex Shi,
	Jonathan Corbet, David Hildenbrand, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Oscar Salvador,
	Greg Kroah-Hartman, linux-doc, Rafael J. Wysocki, Andrew Morton,
	linuxppc-dev, Mike Rapoport
In-Reply-To: <20210929143600.49379-1-david@redhat.com>

These functions no longer exist.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memory_hotplug.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index e5a867c950b2..be48e003a518 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -98,9 +98,6 @@ static inline void zone_seqlock_init(struct zone *zone)
 {
 	seqlock_init(&zone->span_seqlock);
 }
-extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
-extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
-extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 extern void adjust_present_page_count(struct page *page,
 				      struct memory_group *group,
 				      long nr_pages);
-- 
2.31.1


^ permalink raw reply related

* [PATCH v1 3/6] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit
From: David Hildenbrand @ 2021-09-29 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michal Hocko, Michael S. Tsirkin, Peter Zijlstra, Jason Wang,
	Dave Hansen, x86, virtualization, linux-mm, Paul Mackerras,
	linux-kselftest, H. Peter Anvin, Shuah Khan, Alex Shi,
	Jonathan Corbet, David Hildenbrand, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Oscar Salvador,
	Greg Kroah-Hartman, linux-doc, Rafael J. Wysocki, Andrew Morton,
	linuxppc-dev, Mike Rapoport
In-Reply-To: <20210929143600.49379-1-david@redhat.com>

32 bit support is broken in various ways: for example, we can online
memory that should actually go to ZONE_HIGHMEM to ZONE_MOVABLE or in
some cases even to one of the other kernel zones.

We marked it BROKEN in commit b59d02ed0869 ("mm/memory_hotplug: disable the
functionality for 32b") almost one year ago. According to that commit
it might be broken at least since 2017. Further, there is hardly a sane use
case nowadays.

Let's just depend completely on 64bit, dropping the "BROKEN" dependency to
make clear that we are not going to support it again. Next, we'll remove
some HIGHMEM leftovers from memory hotplug code to clean up.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index ea8762cd8e1e..88273dd5c6d6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -125,7 +125,7 @@ config MEMORY_HOTPLUG
 	select MEMORY_ISOLATION
 	depends on SPARSEMEM
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
-	depends on 64BIT || BROKEN
+	depends on 64BIT
 	select NUMA_KEEP_MEMINFO if NUMA

 config MEMORY_HOTPLUG_DEFAULT_ONLINE
-- 
2.31.1

^ permalink raw reply related

* [PATCH v1 2/6] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE
From: David Hildenbrand @ 2021-09-29 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michal Hocko, Michael S. Tsirkin, Peter Zijlstra, Jason Wang,
	Dave Hansen, x86, virtualization, linux-mm, Paul Mackerras,
	linux-kselftest, H. Peter Anvin, Shuah Khan, Alex Shi,
	Jonathan Corbet, David Hildenbrand, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Oscar Salvador,
	Greg Kroah-Hartman, linux-doc, Rafael J. Wysocki, Andrew Morton,
	linuxppc-dev, Mike Rapoport
In-Reply-To: <20210929143600.49379-1-david@redhat.com>

CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for
CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use
CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/machdep.h            |  2 +-
 arch/powerpc/kernel/setup_64.c                |  2 +-
 arch/powerpc/platforms/powernv/setup.c        |  4 ++--
 arch/powerpc/platforms/pseries/setup.c        |  2 +-
 drivers/base/Makefile                         |  2 +-
 drivers/base/node.c                           |  9 ++++-----
 drivers/virtio/Kconfig                        |  2 +-
 include/linux/memory.h                        | 18 +++++++-----------
 include/linux/node.h                          |  4 ++--
 lib/Kconfig.debug                             |  2 +-
 mm/Kconfig                                    |  4 ----
 mm/memory_hotplug.c                           |  2 --
 tools/testing/selftests/memory-hotplug/config |  1 -
 13 files changed, 21 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 764f2732a821..d8a2ca007082 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -32,7 +32,7 @@ struct machdep_calls {
 	void		(*iommu_save)(void);
 	void		(*iommu_restore)(void);
 #endif
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 	unsigned long	(*memory_block_size)(void);
 #endif
 #endif /* CONFIG_PPC64 */
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index eaa79a0996d1..21f15d82f062 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -912,7 +912,7 @@ void __init setup_per_cpu_areas(void)
 }
 #endif
 
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 unsigned long memory_block_size_bytes(void)
 {
 	if (ppc_md.memory_block_size)
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index a8db3f153063..ad56a54ac9c5 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -440,7 +440,7 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int secondary)
 }
 #endif /* CONFIG_KEXEC_CORE */
 
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 static unsigned long pnv_memory_block_size(void)
 {
 	/*
@@ -553,7 +553,7 @@ define_machine(powernv) {
 #ifdef CONFIG_KEXEC_CORE
 	.kexec_cpu_down		= pnv_kexec_cpu_down,
 #endif
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 	.memory_block_size	= pnv_memory_block_size,
 #endif
 };
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index f79126f16258..d29f6f1f7f37 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -1089,7 +1089,7 @@ define_machine(pseries) {
 	.machine_kexec          = pSeries_machine_kexec,
 	.kexec_cpu_down         = pseries_kexec_cpu_down,
 #endif
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 	.memory_block_size	= pseries_memory_block_size,
 #endif
 };
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index ef8e44a7d288..02f7f1358e86 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -13,7 +13,7 @@ obj-y			+= power/
 obj-$(CONFIG_ISA_BUS_API)	+= isa.o
 obj-y				+= firmware_loader/
 obj-$(CONFIG_NUMA)	+= node.o
-obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
+obj-$(CONFIG_MEMORY_HOTPLUG) += memory.o
 ifeq ($(CONFIG_SYSFS),y)
 obj-$(CONFIG_MODULES)	+= module.o
 endif
diff --git a/drivers/base/node.c b/drivers/base/node.c
index c56d34f8158f..b5a4ba18f9f9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -629,7 +629,7 @@ static void node_device_release(struct device *dev)
 {
 	struct node *node = to_node(dev);
 
-#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
+#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_HUGETLBFS)
 	/*
 	 * We schedule the work only when a memory section is
 	 * onlined/offlined on this node. When we come here,
@@ -782,7 +782,7 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 	return 0;
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 static int __ref get_nid_for_pfn(unsigned long pfn)
 {
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -958,10 +958,9 @@ static int node_memory_callback(struct notifier_block *self,
 	return NOTIFY_OK;
 }
 #endif	/* CONFIG_HUGETLBFS */
-#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
+#endif /* CONFIG_MEMORY_HOTPLUG */
 
-#if !defined(CONFIG_MEMORY_HOTPLUG_SPARSE) || \
-    !defined(CONFIG_HUGETLBFS)
+#if !defined(CONFIG_MEMORY_HOTPLUG) || !defined(CONFIG_HUGETLBFS)
 static inline int node_memory_callback(struct notifier_block *self,
 				unsigned long action, void *arg)
 {
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index ce1b3f6ec325..3654def9915c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -98,7 +98,7 @@ config VIRTIO_MEM
 	default m
 	depends on X86_64
 	depends on VIRTIO
-	depends on MEMORY_HOTPLUG_SPARSE
+	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
 	depends on CONTIG_ALLOC
 	help
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 7efc0a7c14c9..dd6e608c3e0b 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -110,7 +110,7 @@ struct mem_section;
 #define SLAB_CALLBACK_PRI       1
 #define IPC_CALLBACK_PRI        10
 
-#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifndef CONFIG_MEMORY_HOTPLUG
 static inline void memory_dev_init(void)
 {
 	return;
@@ -126,7 +126,11 @@ static inline int memory_notify(unsigned long val, void *v)
 {
 	return 0;
 }
-#else
+#define hotplug_memory_notifier(fn, pri)	({ 0; })
+/* These aren't inline functions due to a GCC bug. */
+#define register_hotmemory_notifier(nb)    ({ (void)(nb); 0; })
+#define unregister_hotmemory_notifier(nb)  ({ (void)(nb); })
+#else /* CONFIG_MEMORY_HOTPLUG */
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 int create_memory_block_devices(unsigned long start, unsigned long size,
@@ -149,9 +153,6 @@ struct memory_group *memory_group_find_by_id(int mgid);
 typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
 int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
 			       struct memory_group *excluded, void *arg);
-#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
-
-#ifdef CONFIG_MEMORY_HOTPLUG
 #define hotplug_memory_notifier(fn, pri) ({		\
 	static __meminitdata struct notifier_block fn##_mem_nb =\
 		{ .notifier_call = fn, .priority = pri };\
@@ -159,12 +160,7 @@ int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
 })
 #define register_hotmemory_notifier(nb)		register_memory_notifier(nb)
 #define unregister_hotmemory_notifier(nb) 	unregister_memory_notifier(nb)
-#else
-#define hotplug_memory_notifier(fn, pri)	({ 0; })
-/* These aren't inline functions due to a GCC bug. */
-#define register_hotmemory_notifier(nb)    ({ (void)(nb); 0; })
-#define unregister_hotmemory_notifier(nb)  ({ (void)(nb); })
-#endif
+#endif /* CONFIG_MEMORY_HOTPLUG */
 
 /*
  * Kernel text modification mutex, used for code patching. Users of this lock
diff --git a/include/linux/node.h b/include/linux/node.h
index 8e5a29897936..bb21fd631b16 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -85,7 +85,7 @@ struct node {
 	struct device	dev;
 	struct list_head access_list;
 
-#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
+#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_HUGETLBFS)
 	struct work_struct	node_work;
 #endif
 #ifdef CONFIG_HMEM_REPORTING
@@ -98,7 +98,7 @@ struct memory_block;
 extern struct node *node_devices[];
 typedef  void (*node_registration_func_t)(struct node *);
 
-#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA)
+#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA)
 void link_mem_sections(int nid, unsigned long start_pfn,
 		       unsigned long end_pfn,
 		       enum meminit_context context);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2a9b6dcdac4f..669fee1d26b8 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -877,7 +877,7 @@ config DEBUG_MEMORY_INIT
 
 config MEMORY_NOTIFIER_ERROR_INJECT
 	tristate "Memory hotplug notifier error injection module"
-	depends on MEMORY_HOTPLUG_SPARSE && NOTIFIER_ERROR_INJECTION
+	depends on MEMORY_HOTPLUG && NOTIFIER_ERROR_INJECTION
 	help
 	  This option provides the ability to inject artificial errors to
 	  memory hotplug notifier chain callbacks.  It is controlled through
diff --git a/mm/Kconfig b/mm/Kconfig
index b7fb3f0b485e..ea8762cd8e1e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -128,10 +128,6 @@ config MEMORY_HOTPLUG
 	depends on 64BIT || BROKEN
 	select NUMA_KEEP_MEMINFO if NUMA
 
-config MEMORY_HOTPLUG_SPARSE
-	def_bool y
-	depends on SPARSEMEM && MEMORY_HOTPLUG
-
 config MEMORY_HOTPLUG_DEFAULT_ONLINE
 	bool "Online the newly added memory blocks by default"
 	depends on MEMORY_HOTPLUG
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9fd0be32a281..8d7b2b593a26 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -220,7 +220,6 @@ static void release_memory_resource(struct resource *res)
 	kfree(res);
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
 		const char *reason)
 {
@@ -1163,7 +1162,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	mem_hotplug_done();
 	return ret;
 }
-#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
 static void reset_node_present_pages(pg_data_t *pgdat)
 {
diff --git a/tools/testing/selftests/memory-hotplug/config b/tools/testing/selftests/memory-hotplug/config
index a7e8cd5bb265..1eef042a31e1 100644
--- a/tools/testing/selftests/memory-hotplug/config
+++ b/tools/testing/selftests/memory-hotplug/config
@@ -1,5 +1,4 @@
 CONFIG_MEMORY_HOTPLUG=y
-CONFIG_MEMORY_HOTPLUG_SPARSE=y
 CONFIG_NOTIFIER_ERROR_INJECTION=y
 CONFIG_MEMORY_NOTIFIER_ERROR_INJECT=m
 CONFIG_MEMORY_HOTREMOVE=y
-- 
2.31.1


^ permalink raw reply related

* [PATCH v1 1/6] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG
From: David Hildenbrand @ 2021-09-29 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michal Hocko, Michael S. Tsirkin, Peter Zijlstra, Jason Wang,
	Dave Hansen, x86, virtualization, linux-mm, Paul Mackerras,
	linux-kselftest, H. Peter Anvin, Shuah Khan, Alex Shi,
	Jonathan Corbet, David Hildenbrand, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Oscar Salvador,
	Greg Kroah-Hartman, linux-doc, Rafael J. Wysocki, Andrew Morton,
	linuxppc-dev, Mike Rapoport
In-Reply-To: <20210929143600.49379-1-david@redhat.com>

SPARSEMEM is the only possible memory model for x86-64, FLATMEM is not
possible:
	config ARCH_FLATMEM_ENABLE
		def_bool y
		depends on X86_32 && !NUMA

And X86_64_ACPI_NUMA (obviously) only supports x86-64:
	config X86_64_ACPI_NUMA
		def_bool y
		depends on X86_64 && NUMA && ACPI && PCI

Let's just remove the CONFIG_X86_64_ACPI_NUMA dependency, as it does no
longer make sense.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index d16ba9249bc5..b7fb3f0b485e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -123,7 +123,7 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
 	select MEMORY_ISOLATION
-	depends on SPARSEMEM || X86_64_ACPI_NUMA
+	depends on SPARSEMEM
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
 	depends on 64BIT || BROKEN
 	select NUMA_KEEP_MEMINFO if NUMA
-- 
2.31.1


^ permalink raw reply related

* [PATCH v1 0/6] mm/memory_hotplug: Kconfig and 32 bit cleanups
From: David Hildenbrand @ 2021-09-29 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michal Hocko, Michael S. Tsirkin, Peter Zijlstra, Jason Wang,
	Dave Hansen, x86, virtualization, linux-mm, Paul Mackerras,
	linux-kselftest, H. Peter Anvin, Shuah Khan, Alex Shi,
	Jonathan Corbet, David Hildenbrand, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Oscar Salvador,
	Greg Kroah-Hartman, linux-doc, Rafael J. Wysocki, Andrew Morton,
	linuxppc-dev, Mike Rapoport

Some cleanups around CONFIG_MEMORY_HOTPLUG, including removing 32 bit
leftovers of memory hotplug support.

Compile-tested on various architectures, quickly tested memory hotplug
on x86-64.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alex Shi <alexs@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: x86@kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kselftest@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: virtualization@lists.linux-foundation.org

David Hildenbrand (6):
  mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from
    CONFIG_MEMORY_HOTPLUG
  mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE
  mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit
  mm/memory_hotplug: remove HIGHMEM leftovers
  mm/memory_hotplug: remove stale function declarations
  x86: remove memory hotplug support on X86_32

 Documentation/core-api/memory-hotplug.rst     |  3 --
 .../zh_CN/core-api/memory-hotplug.rst         |  4 --
 arch/powerpc/include/asm/machdep.h            |  2 +-
 arch/powerpc/kernel/setup_64.c                |  2 +-
 arch/powerpc/platforms/powernv/setup.c        |  4 +-
 arch/powerpc/platforms/pseries/setup.c        |  2 +-
 arch/x86/Kconfig                              |  6 +--
 arch/x86/mm/init_32.c                         | 31 ---------------
 drivers/base/Makefile                         |  2 +-
 drivers/base/node.c                           |  9 ++---
 drivers/virtio/Kconfig                        |  2 +-
 include/linux/memory.h                        | 19 ++++------
 include/linux/memory_hotplug.h                |  3 --
 include/linux/node.h                          |  4 +-
 lib/Kconfig.debug                             |  2 +-
 mm/Kconfig                                    |  8 +---
 mm/memory_hotplug.c                           | 38 +------------------
 tools/testing/selftests/memory-hotplug/config |  1 -
 18 files changed, 28 insertions(+), 114 deletions(-)

base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29
-- 
2.31.1

^ permalink raw reply

* Re: [PATCH v8 1/2] powerpc/pseries: Interface to represent PAPR firmware attributes
From: Michael Ellerman @ 2021-09-29 13:49 UTC (permalink / raw)
  To: Pratik R. Sampat, benh, paulus, shuah, farosas, kjain,
	linuxppc-dev, kvm-ppc, linux-kselftest, linux-kernel, psampat,
	pratik.r.sampat
In-Reply-To: <20210928115102.57117-2-psampat@linux.ibm.com>

Hi Pratik,

Some comments inline below ...

"Pratik R. Sampat" <psampat@linux.ibm.com> writes:
> Adds a generic interface to represent the energy and frequency related
> PAPR attributes on the system using the new H_CALL
> "H_GET_ENERGY_SCALE_INFO".

In Linux "generic" would usually mean something that is not architecture
specific, but this is architecture specific.

It's also not generic across different types of attributes, it's only
for information from this particular hcall (right?).

I think you mean generic in contrast to lparcfg, which I guess makes
some sense, this is more generic than lparcfg :)

But I think it'd be better to just say "a sysfs interface".

> H_GET_EM_PARMS H_CALL was previously responsible for exporting this
> information in the lparcfg, however the H_GET_EM_PARMS H_CALL
> will be deprecated P10 onwards.
>
> The H_GET_ENERGY_SCALE_INFO H_CALL is of the following call format:
> hcall(
>   uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
>   uint64 flags,           // Per the flag request
>   uint64 firstAttributeId,// The attribute id
>   uint64 bufferAddress,   // Guest physical address of the output buffer
>   uint64 bufferSize       // The size in bytes of the output buffer
> );

As specified in PAPR+ v2.11, section 14.14.3.

> This H_CALL can query either all the attributes at once with
> firstAttributeId = 0, flags = 0 as well as query only one attribute
> at a time with firstAttributeId = id, flags = 1.
>
> The output buffer consists of the following
> 1. number of attributes              - 8 bytes
> 2. array offset to the data location - 8 bytes
> 3. version info                      - 1 byte
     (possible future expansion here)
> 4. A data array of size num attributes, which contains the following:
>   a. attribute ID              - 8 bytes
>   b. attribute value in number - 8 bytes
>   c. attribute name in string  - 64 bytes
>   d. attribute value in string - 64 bytes
>
> The new H_CALL exports information in direct string value format, hence
> a new interface has been introduced in
> /sys/firmware/papr/energy_scale_info to export this information to
> userspace in an extensible pass-through format.

I'm not really sure what "extensible pass-through format" means? Do you
mean the kernel doesn't interpret the values, so firmware can add new
values and they will just appear without any kernel changes required?

> The H_CALL returns the name, numeric value and string value (if exists)
>
> The format of exposing the sysfs information is as follows:
> /sys/firmware/papr/energy_scale_info/
>    |-- <id>/
>      |-- desc
>      |-- value
>      |-- value_desc (if exists)
>    |-- <id>/
>      |-- desc
>      |-- value
>      |-- value_desc (if exists)
> ...
>
> The energy information that is exported is useful for userspace tools
> such as powerpc-utils. Currently these tools infer the
> "power_mode_data" value in the lparcfg, which in turn is obtained from
> the to be deprecated H_GET_EM_PARMS H_CALL.
> On future platforms, such userspace utilities will have to look at the
> data returned from the new H_CALL being populated in this new sysfs
> interface and report this information directly without the need of
> interpretation.

I don't see anything here or in PAPR about how/if the values change over
time.

Are we expected to always read the values from the hypervisor, or are we
allowed to cache them for any period of time?

The reason I ask is currently we're doing 3 hcalls for each attribute,
once for desc, value and value_desc. Is there any issue with the
consistency of the values given we're doing 3 separate calls?

Or are the values static? Or static except for LPM? In which case we
could read them once at boot/LPM. Or maybe we not know if/how the values
will change, especially in future?

> diff --git a/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
> new file mode 100644
> index 000000000000..139a576c7c9d
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
> @@ -0,0 +1,26 @@
> +What:		/sys/firmware/papr/energy_scale_info
> +Date:		June 2021
> +Contact:	Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>
> +Description:	Directory hosting a set of platform attributes like
> +		energy/frequency on Linux running as a PAPR guest.
> +
> +		Each file in a directory contains a platform
> +		attribute hierarchy pertaining to performance/
> +		energy-savings mode and processor frequency.
> +
> +What:		/sys/firmware/papr/energy_scale_info/<id>
> +		/sys/firmware/papr/energy_scale_info/<id>/desc
> +		/sys/firmware/papr/energy_scale_info/<id>/value
> +		/sys/firmware/papr/energy_scale_info/<id>/value_desc
> +Date:		June 2021
> +Contact:	Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>
> +Description:	Energy, frequency attributes directory for POWERVM servers
> +
> +		This directory provides energy, frequency, folding information. It
> +		contains below sysfs attributes:
> +
> +		- desc: String description of the attribute <id>
> +
> +		- value: Numeric value of attribute <id>
> +
> +		- value_desc: String value of attribute <id>
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
> index 9bcf345cb208..38980fef7a3d 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -323,7 +323,8 @@
>  #define H_SCM_PERFORMANCE_STATS 0x418
>  #define H_RPT_INVALIDATE	0x448
>  #define H_SCM_FLUSH		0x44C
> -#define MAX_HCALL_OPCODE	H_SCM_FLUSH
> +#define H_GET_ENERGY_SCALE_INFO	0x450
> +#define MAX_HCALL_OPCODE	H_GET_ENERGY_SCALE_INFO
>  
>  /* Scope args for H_SCM_UNBIND_ALL */
>  #define H_UNBIND_SCOPE_ALL (0x1)
> @@ -641,6 +642,27 @@ struct hv_gpci_request_buffer {
>  	uint8_t bytes[HGPCI_MAX_DATA_BYTES];
>  } __packed;
>  

None of the following needs to be in this header AFAICS. It can all go
in the C file.

> +#define ESI_VERSION	0x1
> +#define MAX_ESI_ATTRS	10

Where does that max come from? I don't see it mentioned in PAPR.

PAPR implies any number of attributes up to 2^64 can be returned, and if
the buffer is too small H_PARTIAL is returned.

We really mustn't hard code a limit of 10 attributes. We need a loop that
reads N attributes, and if we receive H_PARTIAL, we repeat the hcall,
until we've received all the attributes.

To test that is working on an existing system, which presumably only has
a small number of attributes, you should shrink the buffer to the size
of 1 attribute, to make sure the loop logic works.

> +#define MAX_BUF_SZ	(sizeof(struct h_energy_scale_info_hdr) + \
> +			(sizeof(struct energy_scale_attribute) * MAX_ESI_ATTRS))
>
> +struct energy_scale_attribute {
> +	__be64 id;
> +	__be64 value;
> +	unsigned char desc[64];
> +	unsigned char value_desc[64];

I prefer u8.

> +} __packed;
> +
> +struct h_energy_scale_info_hdr {
> +	__be64 num_attrs;
> +	__be64 array_offset;
> +	__u8 data_header_version;

You can use u8 here, __u8 is only needed in uapi headers.

> +} __packed;

So because of array_offset, we don't actually know what the size of the
header plus a single attribute will be.

Because the header tells us the offset to the first attribute, when the
header expands in a future version of the hcall, sizeof(hdr) +
sizeof(energy_scale_attribute) will not necessarily be large enough to
fit a single attribute. I guess maybe that's why you're always
allocating space for 10 of them (MAX_BUF_SZ).

To be truly forward compatible we should do the hcall at startup with
the current known size of the structs, and if we receive H_P4 try again
with increasing sizes until it succeeds.

That's probably overly complex though, given we'd probably only expect
the header to grow by a few 10s of bytes in future.

Maybe we should just allocate 2KB for the buffer, which is still small
as far as allocations go, but gives us a huge amount of headroom for the
header to grow.


> +/* /sys/firmware/papr */
> +extern struct kobject *papr_kobj;
> +
>  #endif /* __ASSEMBLY__ */
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_HVCALL_H */
> diff --git a/arch/powerpc/kvm/trace_hv.h b/arch/powerpc/kvm/trace_hv.h
> index 830a126e095d..38cd0ed0a617 100644
> --- a/arch/powerpc/kvm/trace_hv.h
> +++ b/arch/powerpc/kvm/trace_hv.h
> @@ -115,6 +115,7 @@
>  	{H_VASI_STATE,			"H_VASI_STATE"}, \
>  	{H_ENABLE_CRQ,			"H_ENABLE_CRQ"}, \
>  	{H_GET_EM_PARMS,		"H_GET_EM_PARMS"}, \
> +	{H_GET_ENERGY_SCALE_INFO,	"H_GET_ENERGY_SCALE_INFO"}, \
>  	{H_SET_MPP,			"H_SET_MPP"}, \
>  	{H_GET_MPP,			"H_GET_MPP"}, \
>  	{H_HOME_NODE_ASSOCIATIVITY,	"H_HOME_NODE_ASSOCIATIVITY"}, \
> diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile
> index 4cda0ef87be0..c4c19f6a5975 100644
> --- a/arch/powerpc/platforms/pseries/Makefile
> +++ b/arch/powerpc/platforms/pseries/Makefile
> @@ -6,7 +6,8 @@ obj-y			:= lpar.o hvCall.o nvram.o reconfig.o \
>  			   of_helpers.o \
>  			   setup.o iommu.o event_sources.o ras.o \
>  			   firmware.o power.o dlpar.o mobility.o rng.o \
> -			   pci.o pci_dlpar.o eeh_pseries.o msi.o
> +			   pci.o pci_dlpar.o eeh_pseries.o msi.o \
> +			   papr_platform_attributes.o

I don't really mind, but is there a reason we're calling it
papr_platform_attributes.c and not energy_info.c or something?

Is the plan to support more attributes in future?

>  obj-$(CONFIG_SMP)	+= smp.o
>  obj-$(CONFIG_SCANLOG)	+= scanlog.o
>  obj-$(CONFIG_KEXEC_CORE)	+= kexec.o
> diff --git a/arch/powerpc/platforms/pseries/papr_platform_attributes.c b/arch/powerpc/platforms/pseries/papr_platform_attributes.c
> new file mode 100644
> index 000000000000..84ddce52e519
> --- /dev/null
> +++ b/arch/powerpc/platforms/pseries/papr_platform_attributes.c
> @@ -0,0 +1,312 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Platform energy and frequency attributes driver
> + *
> + * This driver creates a sys file at /sys/firmware/papr/ which encapsulates a
> + * directory structure containing files in keyword - value pairs that specify
> + * energy and frequency configuration of the system.
> + *
> + * The format of exposing the sysfs information is as follows:
> + * /sys/firmware/papr/energy_scale_info/
> + *  |-- <id>/
> + *    |-- desc
> + *    |-- value
> + *    |-- value_desc (if exists)
> + *  |-- <id>/
> + *    |-- desc
> + *    |-- value
> + *    |-- value_desc (if exists)
> + *
> + * Copyright 2021 IBM Corp.
> + */
> +
> +#include <asm/hvcall.h>
> +#include <asm/machdep.h>
> +
> +#include "pseries.h"
> +
> +/*
> + * Flag attributes to fetch either all or one attribute from the HCALL
> + * flag = BE(0) => fetch all attributes with firstAttributeId = 0
> + * flag = BE(1) => fetch a single attribute with firstAttributeId = id
> + */
> +#define ESI_FLAGS_ALL		0
> +#define ESI_FLAGS_SINGLE	PPC_BIT(0)

I dislike PPC_BIT(). It's obscure and not helpful for people reading the
code compared to a simple (1ull << 63).

> +
> +#define MAX_ATTRS		3
> +
> +struct papr_attr {
> +	u64 id;
> +	struct kobj_attribute kobj_attr;
> +};
> +struct papr_group {
> +	struct attribute_group pg;
> +	struct papr_attr pgattrs[MAX_ATTRS];
> +} *pgs;

pgs is not a great name for a global.

Can it be static?

Also would be more normal to define the structure separately from
declaring *pgs.

> +
> +/* /sys/firmware/papr */
> +struct kobject *papr_kobj;

static?

> +/* /sys/firmware/papr/energy_scale_info */
> +struct kobject *esi_kobj;

static?


The following three functions are almost exactly the same, except they
print a different field with a different format string. Seems like there
could be some shared code.

> +/*
> + * Extract and export the description of the energy scale attributes
> + */
> +static ssize_t papr_show_desc(struct kobject *kobj,
> +			       struct kobj_attribute *kobj_attr,
> +			       char *buf)
> +{
> +	struct papr_attr *pattr = container_of(kobj_attr, struct papr_attr,
> +					       kobj_attr);
> +	struct h_energy_scale_info_hdr *t_hdr;
> +	struct energy_scale_attribute *t_esi;

What's the "t_", temp? I don't really like it. hdr and esi would be fine IMHO.

> +	char *t_buf;
> +	int ret = 0;

That's initialisation is pointless, you override it below.

> +	t_buf = kmalloc(MAX_BUF_SZ, GFP_KERNEL);
> +	if (t_buf == NULL)
> +		return -ENOMEM;
> +
> +	ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_SINGLE,
> +				 pattr->id, virt_to_phys(t_buf),
> +				 MAX_BUF_SZ);
> +
> +	if (ret != H_SUCCESS) {
> +		pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
> +		goto out;

Here you're going to return ret back to userspace. But ret currently
holds a hcall return value, hcall return values are not necessarily
valid Linux error codes.

eg. H_PARTIAL is 5, H_BUSY is 1, etc. Those are potentially read as
success by your caller.

You must never return a hcall error back to Linux code that's not
expecting it.

Here it should probably just return -EIO I guess.

> +	}
> +
> +	t_hdr = (struct h_energy_scale_info_hdr *) t_buf;
> +	t_esi = (struct energy_scale_attribute *)
> +		(t_buf + be64_to_cpu(t_hdr->array_offset));

You should check that array_offset + sizeof(attribute) doesn't point
past the end of your buffer.

> +
> +	ret = snprintf(buf, sizeof(t_esi->desc), "%s\n", t_esi->desc);
> +	if (ret < 0)
> +		ret = -EIO;
> +out:
> +	kfree(t_buf);
> +
> +	return ret;
> +}
> +
> +/*
> + * Extract and export the numeric value of the energy scale attributes
> + */
> +static ssize_t papr_show_value(struct kobject *kobj,
> +				struct kobj_attribute *kobj_attr,
> +				char *buf)
> +{
> +	struct papr_attr *pattr = container_of(kobj_attr, struct papr_attr,
> +					       kobj_attr);
> +	struct h_energy_scale_info_hdr *t_hdr;
> +	struct energy_scale_attribute *t_esi;
> +	char *t_buf;
> +	int ret = 0;
> +
> +	t_buf = kmalloc(MAX_BUF_SZ, GFP_KERNEL);
> +	if (t_buf == NULL)
> +		return -ENOMEM;
> +
> +	ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_SINGLE,
> +				 pattr->id, virt_to_phys(t_buf),
> +				 MAX_BUF_SZ);
> +
> +	if (ret != H_SUCCESS) {
> +		pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
> +		goto out;
> +	}
> +
> +	t_hdr = (struct h_energy_scale_info_hdr *) t_buf;
> +	t_esi = (struct energy_scale_attribute *)
> +		(t_buf + be64_to_cpu(t_hdr->array_offset));
> +
> +	ret = snprintf(buf, sizeof(t_esi->value), "%llu\n",
> +		       be64_to_cpu(t_esi->value));
> +	if (ret < 0)
> +		ret = -EIO;
> +out:
> +	kfree(t_buf);
> +
> +	return ret;
> +}
> +
> +/*
> + * Extract and export the value description in string format of the energy
> + * scale attributes
> + */
> +static ssize_t papr_show_value_desc(struct kobject *kobj,
> +				     struct kobj_attribute *kobj_attr,
> +				     char *buf)
> +{
> +	struct papr_attr *pattr = container_of(kobj_attr, struct papr_attr,
> +					       kobj_attr);
> +	struct h_energy_scale_info_hdr *t_hdr;
> +	struct energy_scale_attribute *t_esi;
> +	char *t_buf;
> +	int ret = 0;
> +
> +	t_buf = kmalloc(MAX_BUF_SZ, GFP_KERNEL);
> +	if (t_buf == NULL)
> +		return -ENOMEM;
> +
> +	ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_SINGLE,
> +				 pattr->id, virt_to_phys(t_buf),
> +				 MAX_BUF_SZ);
> +
> +	if (ret != H_SUCCESS) {
> +		pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
> +		goto out;
> +	}
> +
> +	t_hdr = (struct h_energy_scale_info_hdr *) t_buf;
> +	t_esi = (struct energy_scale_attribute *)
> +		(t_buf + be64_to_cpu(t_hdr->array_offset));
> +
> +	ret = snprintf(buf, sizeof(t_esi->value_desc), "%s\n",
> +		       t_esi->value_desc);
> +	if (ret < 0)
> +		ret = -EIO;
> +out:
> +	kfree(t_buf);
> +
> +	return ret;
> +}
> +
> +static struct papr_ops_info {
> +	const char *attr_name;
> +	ssize_t (*show)(struct kobject *kobj, struct kobj_attribute *kobj_attr,
> +			char *buf);
> +} ops_info[MAX_ATTRS] = {
> +	{ "desc", papr_show_desc },
> +	{ "value", papr_show_value },
> +	{ "value_desc", papr_show_value_desc },
> +};
> +
> +static void add_attr(u64 id, int index, struct papr_attr *attr)
> +{
> +	attr->id = id;
> +	sysfs_attr_init(&attr->kobj_attr.attr);
> +	attr->kobj_attr.attr.name = ops_info[index].attr_name;
> +	attr->kobj_attr.attr.mode = 0444;
> +	attr->kobj_attr.show = ops_info[index].show;
> +}
> +
> +static int add_attr_group(u64 id, struct papr_group *pg, bool show_val_desc)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_ATTRS; i++) {
> +		if (!strcmp(ops_info[i].attr_name, "value_desc") &&
> +		    !show_val_desc) {
> +			continue;
> +		}
> +		add_attr(id, i, &pg->pgattrs[i]);
> +		pg->pg.attrs[i] = &pg->pgattrs[i].kobj_attr.attr;
> +	}
> +
> +	return sysfs_create_group(esi_kobj, &pg->pg);
> +}
> +
> +static int __init papr_init(void)
> +{
> +	struct h_energy_scale_info_hdr *esi_hdr;
> +	struct energy_scale_attribute *esi_attrs;
> +	uint64_t num_attrs;
> +	int ret, idx, i;
> +	char *esi_buf;
> +
> +	if (!firmware_has_feature(FW_FEATURE_LPAR))
> +		return -ENXIO;
> +
> +	esi_buf = kmalloc(MAX_BUF_SZ, GFP_KERNEL);
> +	if (esi_buf == NULL)
> +		return -ENOMEM;
> +	/*
> +	 * hcall(
> +	 * uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
> +	 * uint64 flags,            // Per the flag request
> +	 * uint64 firstAttributeId, // The attribute id
> +	 * uint64 bufferAddress,    // Guest physical address of the output buffer
> +	 * uint64 bufferSize);      // The size in bytes of the output buffer
> +	 */
> +	ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_ALL, 0,
> +				 virt_to_phys(esi_buf), MAX_BUF_SZ);
> +	if (ret != H_SUCCESS) {
> +		pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
> +		goto out;

This prints a warning on all existing systems (that don't support the
new hcall), which is not ideal :)

The proper way to handle discovery of the hcall is by looking at
ibm,hypertas-functions and setting a FW_FEATURE flag based on that.

See fw_hypertas_feature_init().

> +	}

Also the hcall can return H_PARTIAL if the buffer isn't big enough.

> +	esi_hdr = (struct h_energy_scale_info_hdr *) esi_buf;
> +	if (esi_hdr->data_header_version != ESI_VERSION) {
> +		pr_warn("H_GET_ENERGY_SCALE_INFO VER MISMATCH - EXP: 0x%x, REC: 0x%x",
> +			ESI_VERSION, esi_hdr->data_header_version);

That shouldn't be a warning. PAPR says any changes will be backward
compatible. If we need to check the version at all, it would be to make
sure the hypervisor version is >= the version we expect. But I'm not
sure it's necessary to check at all?

> +	}
> +
> +	num_attrs = be64_to_cpu(esi_hdr->num_attrs);
> +	esi_attrs = (struct energy_scale_attribute *)
> +		    (esi_buf + be64_to_cpu(esi_hdr->array_offset));
> +
> +	pgs = kcalloc(num_attrs, sizeof(*pgs), GFP_KERNEL);
> +	if (!pgs)
> +		goto out;
> +
> +	papr_kobj = kobject_create_and_add("papr", firmware_kobj);
> +	if (!papr_kobj) {
> +		pr_warn("kobject_create_and_add papr failed\n");
> +		goto out_pgs;
> +	}
> +
> +	esi_kobj = kobject_create_and_add("energy_scale_info", papr_kobj);
> +	if (!esi_kobj) {
> +		pr_warn("kobject_create_and_add energy_scale_info failed\n");
> +		goto out_kobj;
> +	}
> +
> +	for (idx = 0; idx < num_attrs; idx++) {
> +		bool show_val_desc = true;
> +
> +		pgs[idx].pg.attrs = kcalloc(MAX_ATTRS + 1,
> +					    sizeof(*pgs[idx].pg.attrs),
> +					    GFP_KERNEL);
> +		if (!pgs[idx].pg.attrs) {
> +			goto out_pgattrs;
> +		}
> +
> +		pgs[idx].pg.name = kasprintf(GFP_KERNEL, "%lld",
> +					     be64_to_cpu(esi_attrs[idx].id));
> +		if (pgs[idx].pg.name == NULL) {
> +			goto out_pgattrs;
> +		}
> +		/* Do not add the value description if it does not exist */
> +		if (strnlen(esi_attrs[idx].value_desc,
> +			    sizeof(esi_attrs[idx].value_desc)) == 0)
> +			show_val_desc = false;
> +
> +		if (add_attr_group(be64_to_cpu(esi_attrs[idx].id), &pgs[idx],
> +				   show_val_desc)) {
> +			pr_warn("Failed to create papr attribute group %s\n",
> +				pgs[idx].pg.name);
> +			goto out_pgattrs;
> +		}
> +	}
> +
> +	kfree(esi_buf);
> +	return 0;
> +
> +out_pgattrs:

Is this right? If you failed the allocation for i = 1, then the attrs
for i = 0 would already be registered with sysfs, and then you free them
while they're still in use?

To simplify it you can probably do all the allocations first before
registering anything.

> +	for (i = 0; i < idx ; i++) {
> +		kfree(pgs[i].pg.attrs);
> +		kfree(pgs[i].pg.name);
> +	}
> +	kobject_put(esi_kobj);
> +out_kobj:
> +	kobject_put(papr_kobj);
> +out_pgs:
> +	kfree(pgs);
> +out:

If you're freeing something the label should be "out_free_thing". So in
this case out_free_esi_buf.

> +	kfree(esi_buf);
> +
> +	return -ENOMEM;
> +}
> +
> +machine_device_initcall(pseries, papr_init);
> -- 
> 2.31.1


cheers

^ permalink raw reply

* Re: [PATCH v5 10/11] PCI: Replace pci_dev::driver usage by pci_dev::dev.driver
From: Uwe Kleine-König @ 2021-09-29 13:43 UTC (permalink / raw)
  To: Andrew Donnellan
  Cc: Mark Rutland, Peter Zijlstra, H. Peter Anvin,
	Oliver O'Halloran, Jiri Olsa, Christoph Hellwig,
	Stefano Stabellini, Arnd Bergmann, Boris Ostrovsky, x86,
	Alexander Shishkin, Ingo Molnar, Bjorn Helgaas, linux-pci,
	xen-devel, Mathias Nyman, Konrad Rzeszutek Wilk,
	Arnaldo Carvalho de Melo, Borislav Petkov, Bjorn Helgaas,
	Namhyung Kim, Thomas Gleixner, Juergen Gross, Greg Kroah-Hartman,
	linux-usb, linux-perf-users, kernel, Frederic Barrat,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <75dd6d60-08b9-fa68-352c-3a0c5a04c0ab@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 1935 bytes --]

On Wed, Sep 29, 2021 at 11:15:37PM +1000, Andrew Donnellan wrote:
> On 29/9/21 6:53 pm, Uwe Kleine-König wrote:>   	
> list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
> > -			if (afu_dev->driver && afu_dev->driver->err_handler &&
> > -			    afu_dev->driver->err_handler->resume)
> > -				afu_dev->driver->err_handler->resume(afu_dev);
> > +			struct pci_driver *afu_drv;
> > +			if (afu_dev->dev.driver &&
> > +			    (afu_drv = to_pci_driver(afu_dev->dev.driver))->err_handler &&
> 
> I'm not a huge fan of assignments in if statements and if you send a v6 I'd
> prefer you break it up.

I'm not a huge fan either, I used it to keep the control flow as is and
without introducing several calls to to_pci_driver.

The whole code looks as follows:

	list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
		struct pci_driver *afu_drv;
		if (afu_dev->dev.driver &&
		    (afu_drv = to_pci_driver(afu_dev->dev.driver))->err_handler &&
		    afu_drv->err_handler->resume)
			afu_drv->err_handler->resume(afu_dev);
	}

Without assignment in the if it could look as follows:

	list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
		struct pci_driver *afu_drv;

		if (!afu_dev->dev.driver)
			continue;

		afu_drv = to_pci_driver(afu_dev->dev.driver));

		if (afu_drv->err_handler && afu_drv->err_handler->resume)
			afu_drv->err_handler->resume(afu_dev);
	}

Fine for me.

(Sidenote: What happens if the device is unbound directly after the
check for afu_dev->dev.driver? This is a problem the old code had, too
(assuming it is a real problem, didn't check deeply).)

> Apart from that everything in the powerpc and cxl sections looks good to me.

Thanks for checking.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 484 bytes --]

^ permalink raw reply

* Re: [PATCH v5 10/11] PCI: Replace pci_dev::driver usage by pci_dev::dev.driver
From: Andrew Donnellan @ 2021-09-29 13:15 UTC (permalink / raw)
  To: Uwe Kleine-König, Bjorn Helgaas
  Cc: Mark Rutland, Peter Zijlstra, H. Peter Anvin,
	Oliver O'Halloran, Jiri Olsa, Christoph Hellwig,
	Stefano Stabellini, Mathias Nyman, x86, Alexander Shishkin,
	Ingo Molnar, linux-pci, xen-devel, Juergen Gross, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Arnaldo Carvalho de Melo, Borislav Petkov,
	Bjorn Helgaas, Namhyung Kim, Thomas Gleixner, Boris Ostrovsky,
	Greg Kroah-Hartman, linux-usb, linux-perf-users, kernel,
	Frederic Barrat, Paul Mackerras, linuxppc-dev
In-Reply-To: <20210929085306.2203850-11-u.kleine-koenig@pengutronix.de>

On 29/9/21 6:53 pm, Uwe Kleine-König wrote:>   	 
list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
> -			if (afu_dev->driver && afu_dev->driver->err_handler &&
> -			    afu_dev->driver->err_handler->resume)
> -				afu_dev->driver->err_handler->resume(afu_dev);
> +			struct pci_driver *afu_drv;
> +			if (afu_dev->dev.driver &&
> +			    (afu_drv = to_pci_driver(afu_dev->dev.driver))->err_handler &&

I'm not a huge fan of assignments in if statements and if you send a v6 
I'd prefer you break it up.

Apart from that everything in the powerpc and cxl sections looks good to me.

-- 
Andrew Donnellan              OzLabs, ADL Canberra
ajd@linux.ibm.com             IBM Australia Limited

^ permalink raw reply

* [PATCH v5 00/11] PCI: Drop duplicated tracking of a pci_dev's bound driver
From: Uwe Kleine-König @ 2021-09-29  8:52 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Giovanni Cabiddu, Mark Rutland, Sathya Prakash,
	Alexander Shishkin, linux-pci, Alexander Duyck, x86, qat-linux,
	oss-drivers, Oliver O'Halloran, H. Peter Anvin, Jiri Olsa,
	Christoph Hellwig, Marco Chiappero, Stefano Stabellini,
	Herbert Xu, linux-scsi, Rafał Miłecki, Jesse Brandeburg,
	Peter Zijlstra, Ingo Molnar, linux-wireless, Jakub Kicinski,
	Yisen Zhuang, Suganath Prabu Subramani, Fiona Trahe,
	Andrew Donnellan, Arnd Bergmann, Konrad Rzeszutek Wilk,
	Ido Schimmel, Simon Horman, linuxppc-dev,
	Arnaldo Carvalho de Melo, Jack Xu, Borislav Petkov,
	Michael Buesch, Jiri Pirko, Bjorn Helgaas, Namhyung Kim,
	Boris Ostrovsky, Andy Shevchenko, Juergen Gross, Salil Mehta,
	Sreekanth Reddy, xen-devel, Vadym Kochan, MPT-FusionLinux.pdl,
	Greg Kroah-Hartman, linux-usb, Wojciech Ziemba, linux-kernel,
	Mathias Nyman, Zhou Wang, Thomas Gleixner, linux-crypto, kernel,
	netdev, Frederic Barrat, Paul Mackerras, Tomaszx Kowalik,
	Taras Chornyi, David S. Miller, linux-perf-users

Hello,

this is v5 of the quest to drop the "driver" member from struct pci_dev
which tracks the same data (apart from a constant offset) as dev.driver.

Changes since v4:
 - split some changes out of "PCI: replace pci_dev::driver usage that
   gets the driver name" and reworked them a bit as suggested by Bjorn.
 - Add a line break in
   drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c to please
   checkpatch --strict and Simon Horman.
 - Fixed a build problem in "crypto: qat - simplify adf_enable_aer()",
   thanks to Giovanni Cabiddu for spotting and a suggested fix.

I didn't replace :: by . as suggested by Bjorn as I'm unsure if his
preference is stronger than mine.

This patch stack survived an allmodconfig build on arm64, m68k, powerpc,
riscv, s390, sparc64 and x86_64 on top of v5.15-rc3.

Best regards
Uwe

Uwe Kleine-König (11):
  PCI: Simplify pci_device_remove()
  PCI: Drop useless check from pci_device_probe()
  xen/pci: Drop some checks that are always true
  bcma: simplify reference to the driver's name
  powerpc/eeh: Don't use driver member of struct pci_dev and further
    cleanups
  ssb: Simplify determination of driver name
  PCI: Replace pci_dev::driver usage that gets the driver name
  scsi: message: fusion: Remove unused parameter of mpt_pci driver's
    probe()
  crypto: qat - simplify adf_enable_aer()
  PCI: Replace pci_dev::driver usage by pci_dev::dev.driver
  PCI: Drop duplicated tracking of a pci_dev's bound driver

 arch/powerpc/include/asm/ppc-pci.h            |  5 --
 arch/powerpc/kernel/eeh.c                     |  8 +++
 arch/powerpc/kernel/eeh_driver.c              | 10 +--
 arch/x86/events/intel/uncore.c                |  2 +-
 arch/x86/kernel/probe_roms.c                  |  2 +-
 drivers/bcma/host_pci.c                       |  6 +-
 drivers/crypto/hisilicon/qm.c                 |  2 +-
 drivers/crypto/qat/qat_4xxx/adf_drv.c         |  7 +--
 drivers/crypto/qat/qat_c3xxx/adf_drv.c        |  7 +--
 drivers/crypto/qat/qat_c62x/adf_drv.c         |  7 +--
 drivers/crypto/qat/qat_common/adf_aer.c       | 10 +--
 .../crypto/qat/qat_common/adf_common_drv.h    |  3 +-
 drivers/crypto/qat/qat_dh895xcc/adf_drv.c     |  7 +--
 drivers/message/fusion/mptbase.c              |  7 +--
 drivers/message/fusion/mptbase.h              |  2 +-
 drivers/message/fusion/mptctl.c               |  4 +-
 drivers/message/fusion/mptlan.c               |  2 +-
 drivers/misc/cxl/guest.c                      | 24 ++++---
 drivers/misc/cxl/pci.c                        | 30 +++++----
 .../ethernet/hisilicon/hns3/hns3_ethtool.c    |  2 +-
 .../ethernet/marvell/prestera/prestera_pci.c  |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |  2 +-
 .../ethernet/netronome/nfp/nfp_net_ethtool.c  |  3 +-
 drivers/pci/iov.c                             | 25 +++++---
 drivers/pci/pci-driver.c                      | 45 ++++++-------
 drivers/pci/pci.c                             |  4 +-
 drivers/pci/pcie/err.c                        | 36 ++++++-----
 drivers/pci/xen-pcifront.c                    | 63 +++++++++----------
 drivers/ssb/pcihost_wrapper.c                 |  6 +-
 drivers/usb/host/xhci-pci.c                   |  2 +-
 include/linux/pci.h                           |  1 -
 31 files changed, 161 insertions(+), 175 deletions(-)

Range-diff against v4:
 1:  8d064bbc74b0 =  1:  c2b53ab26a6b PCI: Simplify pci_device_remove()
 2:  966b49c308b4 =  2:  2c733e1d5186 PCI: Drop useless check from pci_device_probe()
 3:  ce710d6e8a1b =  3:  547ca5a7aa16 xen/pci: Drop some checks that are always true
 -:  ------------ >  4:  40eb07353844 bcma: simplify reference to the driver's name
 -:  ------------ >  5:  bab59c1dff6d powerpc/eeh: Don't use driver member of struct pci_dev and further cleanups
 -:  ------------ >  6:  abd70de9782d ssb: Simplify determination of driver name
 4:  3e4e6994a59d !  7:  735845bd26b9 PCI: replace pci_dev::driver usage that gets the driver name
    @@ Metadata
     Author: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
     
      ## Commit message ##
    -    PCI: replace pci_dev::driver usage that gets the driver name
    +    PCI: Replace pci_dev::driver usage that gets the driver name
     
         struct pci_dev::driver holds (apart from a constant offset) the same
         data as struct pci_dev::dev->driver. With the goal to remove struct
    @@ Commit message
     
         Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
     
    - ## arch/powerpc/include/asm/ppc-pci.h ##
    -@@ arch/powerpc/include/asm/ppc-pci.h: void eeh_sysfs_remove_device(struct pci_dev *pdev);
    - 
    - static inline const char *eeh_driver_name(struct pci_dev *pdev)
    - {
    --	return (pdev && pdev->driver) ? pdev->driver->name : "<null>";
    -+	if (pdev) {
    -+		const char *drvstr = dev_driver_string(&pdev->dev);
    -+
    -+		if (strcmp(drvstr, ""))
    -+			return drvstr;
    -+	}
    -+
    -+	return "<null>";
    - }
    - 
    - #endif /* CONFIG_EEH */
    -
    - ## drivers/bcma/host_pci.c ##
    -@@ drivers/bcma/host_pci.c: static int bcma_host_pci_probe(struct pci_dev *dev,
    - 	if (err)
    - 		goto err_kfree_bus;
    - 
    --	name = dev_name(&dev->dev);
    --	if (dev->driver && dev->driver->name)
    --		name = dev->driver->name;
    -+	name = dev_driver_string(&dev->dev);
    -+	if (!strcmp(name, ""))
    -+		name = dev_name(&dev->dev);
    -+
    - 	err = pci_request_regions(dev, name);
    - 	if (err)
    - 		goto err_pci_disable;
    -
      ## drivers/crypto/hisilicon/qm.c ##
     @@ drivers/crypto/hisilicon/qm.c: static int qm_alloc_uacce(struct hisi_qm *qm)
      	};
    @@ drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c: nfp_get_drvinfo(struct nfp
      	char nsp_version[ETHTOOL_FWVERS_LEN] = {};
      
     -	strlcpy(drvinfo->driver, pdev->driver->name, sizeof(drvinfo->driver));
    -+	strlcpy(drvinfo->driver, dev_driver_string(&pdev->dev), sizeof(drvinfo->driver));
    ++	strlcpy(drvinfo->driver, dev_driver_string(&pdev->dev),
    ++		sizeof(drvinfo->driver));
      	nfp_net_get_nspinfo(app, nsp_version);
      	snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version),
      		 "%s %s %s %s", vnic_version, nsp_version,
    -
    - ## drivers/ssb/pcihost_wrapper.c ##
    -@@ drivers/ssb/pcihost_wrapper.c: static int ssb_pcihost_probe(struct pci_dev *dev,
    - 	err = pci_enable_device(dev);
    - 	if (err)
    - 		goto err_kfree_ssb;
    --	name = dev_name(&dev->dev);
    --	if (dev->driver && dev->driver->name)
    --		name = dev->driver->name;
    -+
    -+	name = dev_driver_string(&dev->dev);
    -+	if (*name == '\0')
    -+		name = dev_name(&dev->dev);
    -+
    - 	err = pci_request_regions(dev, name);
    - 	if (err)
    - 		goto err_pci_disable;
 5:  574b743327b8 =  8:  1e58019165b9 scsi: message: fusion: Remove unused parameter of mpt_pci driver's probe()
 6:  674c6efd3dab !  9:  dea72a470141 crypto: qat - simplify adf_enable_aer()
    @@ drivers/crypto/qat/qat_4xxx/adf_drv.c: static struct pci_driver adf_driver = {
      	.probe = adf_probe,
      	.remove = adf_remove,
      	.sriov_configure = adf_sriov_configure,
    -+	.err_handler = adf_err_handler,
    ++	.err_handler = &adf_err_handler,
      };
      
      module_pci_driver(adf_driver);
    @@ drivers/crypto/qat/qat_c3xxx/adf_drv.c: static struct pci_driver adf_driver = {
      	.probe = adf_probe,
      	.remove = adf_remove,
      	.sriov_configure = adf_sriov_configure,
    -+	.err_handler = adf_err_handler,
    ++	.err_handler = &adf_err_handler,
      };
      
      static void adf_cleanup_pci_dev(struct adf_accel_dev *accel_dev)
    @@ drivers/crypto/qat/qat_c62x/adf_drv.c: static struct pci_driver adf_driver = {
      	.probe = adf_probe,
      	.remove = adf_remove,
      	.sriov_configure = adf_sriov_configure,
    -+	.err_handler = adf_err_handler,
    ++	.err_handler = &adf_err_handler,
      };
      
      static void adf_cleanup_pci_dev(struct adf_accel_dev *accel_dev)
    @@ drivers/crypto/qat/qat_common/adf_common_drv.h: void adf_ae_fw_release(struct ad
      int adf_ae_stop(struct adf_accel_dev *accel_dev);
      
     -int adf_enable_aer(struct adf_accel_dev *accel_dev);
    ++extern const struct pci_error_handlers adf_err_handler;
     +void adf_enable_aer(struct adf_accel_dev *accel_dev);
      void adf_disable_aer(struct adf_accel_dev *accel_dev);
      void adf_reset_sbr(struct adf_accel_dev *accel_dev);
    @@ drivers/crypto/qat/qat_dh895xcc/adf_drv.c: static struct pci_driver adf_driver =
      	.probe = adf_probe,
      	.remove = adf_remove,
      	.sriov_configure = adf_sriov_configure,
    -+	.err_handler = adf_err_handler,
    ++	.err_handler = &adf_err_handler,
      };
      
      static void adf_cleanup_pci_dev(struct adf_accel_dev *accel_dev)
 7:  0e022deb0f75 = 10:  b4165dda38ea PCI: Replace pci_dev::driver usage by pci_dev::dev.driver
 8:  edd9d24df02a = 11:  d93a138bd7ab PCI: Drop duplicated tracking of a pci_dev's bound driver

base-commit: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29
-- 
2.30.2


^ permalink raw reply

* Re: [PATCH v2 4/4] powerpc/pseries/cpuhp: remove obsolete comment from pseries_cpu_die
From: Nathan Lynch @ 2021-09-29 12:06 UTC (permalink / raw)
  To: Michael Ellerman, Daniel Henrique Barboza, linuxppc-dev
  Cc: tyreld, ldufour, aneesh.kumar
In-Reply-To: <87h7e40wac.fsf@mpe.ellerman.id.au>

Michael Ellerman <mpe@ellerman.id.au> writes:
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>> This is enough to say that we can't easily see the history behind this comment.
>> I also believe that we're better of without it since it doesn't make sense
>> with the current codebase.
>
> It was added by the original CPU hotplug commit for ppc64::
>
> https://github.com/mpe/linux-fullhistory/commit/0e9fd9441cd2113b67b14e739267c9e69761489b
>
>
> The code was fairly similar:
>
> void __cpu_die(unsigned int cpu)
> {
> 	int tries;
> 	int cpu_status;
> 	unsigned int pcpu = get_hard_smp_processor_id(cpu);
>
> 	for (tries = 0; tries < 5; tries++) {
> 		cpu_status = query_cpu_stopped(pcpu);
>
> 		if (cpu_status == 0)
> 			break;
> 		set_current_state(TASK_UNINTERRUPTIBLE);
> 		schedule_timeout(HZ);
> 	}
> 	if (cpu_status != 0) {
> 		printk("Querying DEAD? cpu %i (%i) shows %i\n",
> 		       cpu, pcpu, cpu_status);
> 	}
>
> 	/* Isolation and deallocation are definatly done by
> 	 * drslot_chrp_cpu.  If they were not they would be
> 	 * done here.  Change isolate state to Isolate and
> 	 * change allocation-state to Unusable.
> 	 */
> 	paca[cpu].xProcStart = 0;
>
> 	/* So we can recognize if it fails to come up next time. */
> 	cpu_callin_map[cpu] = 0;
> }
>
>
> drslot_chrp_cpu() still exists in drmgr:
>
>   https://github.com/ibm-power-utilities/powerpc-utils/blob/e798c4a09fbf0fa0f421e624cfa366a6c405c9fe/src/drmgr/drslot_chrp_cpu.c#L406
>
>
> I agree the comment is no longer meaningful and can be removed.

Thanks for providing this background.

> It might be good to then add a comment explaining why we need to set
> cpu_start = 0.

Sure, I can take that as a follow-up. Or perhaps it should be moved to
the online path.

> It's not immediately clear why we need to. When we bring a CPU back
> online in smp_pSeries_kick_cpu() we ask RTAS to start it and then
> immediately set cpu_start = 1, ie. there isn't a separate step that sets
> cpu_start = 1 for hotplugged CPUs.

Hmm I'm not following the distinction you seem to be drawing between
bringing a CPU back online and a hotplugged CPU. kick_cpu is used in all
cases AFAIK.


^ permalink raw reply

* Re: [PATCH v4 5/8] bpf ppc64: Add BPF_PROBE_MEM support for JIT
From: Christophe Leroy @ 2021-09-29 11:47 UTC (permalink / raw)
  To: Hari Bathini, naveen.n.rao, mpe, ast, daniel
  Cc: Ravi Bangoria, songliubraving, netdev, john.fastabend, andrii,
	kpsingh, paulus, yhs, bpf, linuxppc-dev, kafai
In-Reply-To: <20210929111855.50254-6-hbathini@linux.ibm.com>



Le 29/09/2021 à 13:18, Hari Bathini a écrit :
> From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> 
> BPF load instruction with BPF_PROBE_MEM mode can cause a fault
> inside kernel. Append exception table for such instructions
> within BPF program.
> 
> Unlike other archs which uses extable 'fixup' field to pass dest_reg
> and nip, BPF exception table on PowerPC follows the generic PowerPC


For my curiosity, can you explain why we don't want and can't do the 
same on powerpc as on other archs ?


> exception table design, where it populates both fixup and extable
> sections within BPF program. fixup section contains two instructions,
> first instruction clears dest_reg and 2nd jumps to next instruction
> in the BPF code. extable 'insn' field contains relative offset of
> the instruction and 'fixup' field contains relative offset of the
> fixup entry. Example layout of BPF program with extable present:
> 
>               +------------------+
>               |                  |
>               |                  |
>     0x4020 -->| ld   r27,4(r3)   |
>               |                  |
>               |                  |
>     0x40ac -->| lwz  r3,0(r4)    |
>               |                  |
>               |                  |
>               |------------------|
>     0x4280 -->| li  r27,0        |  \ fixup entry
>               | b   0x4024       |  /
>     0x4288 -->| li  r3,0         |
>               | b   0x40b0       |
>               |------------------|
>     0x4290 -->| insn=0xfffffd90  |  \ extable entry
>               | fixup=0xffffffec |  /
>     0x4298 -->| insn=0xfffffe14  |
>               | fixup=0xffffffec |
>               +------------------+
> 
>     (Addresses shown here are chosen random, not real)
> 

^ permalink raw reply

* Re: [PATCH v4 8/8] bpf ppc32: Access only if addr is kernel address
From: Christophe Leroy @ 2021-09-29 11:45 UTC (permalink / raw)
  To: Hari Bathini, naveen.n.rao, mpe, ast, daniel
  Cc: songliubraving, netdev, john.fastabend, andrii, kpsingh, paulus,
	yhs, bpf, linuxppc-dev, kafai
In-Reply-To: <20210929111855.50254-9-hbathini@linux.ibm.com>



Le 29/09/2021 à 13:18, Hari Bathini a écrit :
> With KUAP enabled, any kernel code which wants to access userspace
> needs to be surrounded by disable-enable KUAP. But that is not
> happening for BPF_PROBE_MEM load instruction. Though PPC32 does not
> support read protection, considering the fact that PTR_TO_BTF_ID
> (which uses BPF_PROBE_MEM mode) could either be a valid kernel pointer
> or NULL but should never be a pointer to userspace address, execute
> BPF_PROBE_MEM load only if addr is kernel address, otherwise set
> dst_reg=0 and move on.
> 
> This will catch NULL, valid or invalid userspace pointers. Only bad
> kernel pointer will be handled by BPF exception table.
> 
> [Alexei suggested for x86]
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>

Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>

> ---
> 
> Changes in v4:
> * Adjusted the emit code to avoid using temporary reg.
> 
> 
>   arch/powerpc/net/bpf_jit_comp32.c | 34 +++++++++++++++++++++++++++++++
>   1 file changed, 34 insertions(+)
> 
> diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
> index 6ee13a09c70d..2ac81563c78d 100644
> --- a/arch/powerpc/net/bpf_jit_comp32.c
> +++ b/arch/powerpc/net/bpf_jit_comp32.c
> @@ -818,6 +818,40 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   		case BPF_LDX | BPF_PROBE_MEM | BPF_W:
>   		case BPF_LDX | BPF_MEM | BPF_DW: /* dst = *(u64 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
> +			/*
> +			 * As PTR_TO_BTF_ID that uses BPF_PROBE_MEM mode could either be a valid
> +			 * kernel pointer or NULL but not a userspace address, execute BPF_PROBE_MEM
> +			 * load only if addr is kernel address (see is_kernel_addr()), otherwise
> +			 * set dst_reg=0 and move on.
> +			 */
> +			if (BPF_MODE(code) == BPF_PROBE_MEM) {
> +				PPC_LI32(_R0, TASK_SIZE - off);
> +				EMIT(PPC_RAW_CMPLW(src_reg, _R0));
> +				PPC_BCC(COND_GT, (ctx->idx + 5) * 4);
> +				EMIT(PPC_RAW_LI(dst_reg, 0));
> +				/*
> +				 * For BPF_DW case, "li reg_h,0" would be needed when
> +				 * !fp->aux->verifier_zext. Emit NOP otherwise.
> +				 *
> +				 * Note that "li reg_h,0" is emitted for BPF_B/H/W case,
> +				 * if necessary. So, jump there insted of emitting an
> +				 * additional "li reg_h,0" instruction.
> +				 */
> +				if (size == BPF_DW && !fp->aux->verifier_zext)
> +					EMIT(PPC_RAW_LI(dst_reg_h, 0));
> +				else
> +					EMIT(PPC_RAW_NOP());
> +				/*
> +				 * Need to jump two instructions instead of one for BPF_DW case
> +				 * as there are two load instructions for dst_reg_h & dst_reg
> +				 * respectively.
> +				 */
> +				if (size == BPF_DW)
> +					PPC_JMP((ctx->idx + 3) * 4);
> +				else
> +					PPC_JMP((ctx->idx + 2) * 4);
> +			}
> +
>   			switch (size) {
>   			case BPF_B:
>   				EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
> 

^ permalink raw reply

* Re: [PATCH v4 7/8] bpf ppc32: Add BPF_PROBE_MEM support for JIT
From: Christophe Leroy @ 2021-09-29 11:44 UTC (permalink / raw)
  To: Hari Bathini, naveen.n.rao, mpe, ast, daniel
  Cc: songliubraving, netdev, john.fastabend, andrii, kpsingh, paulus,
	yhs, bpf, linuxppc-dev, kafai
In-Reply-To: <20210929111855.50254-8-hbathini@linux.ibm.com>



Le 29/09/2021 à 13:18, Hari Bathini a écrit :
> BPF load instruction with BPF_PROBE_MEM mode can cause a fault
> inside kernel. Append exception table for such instructions
> within BPF program.
> 
> Unlike other archs which uses extable 'fixup' field to pass dest_reg
> and nip, BPF exception table on PowerPC follows the generic PowerPC
> exception table design, where it populates both fixup and extable
> sections within BPF program. fixup section contains 3 instructions,
> first 2 instructions clear dest_reg (lower & higher 32-bit registers)
> and last instruction jumps to next instruction in the BPF code.
> extable 'insn' field contains relative offset of the instruction and
> 'fixup' field contains relative offset of the fixup entry. Example
> layout of BPF program with extable present:
> 
>               +------------------+
>               |                  |
>               |                  |
>     0x4020 -->| lwz   r28,4(r4)  |
>               |                  |
>               |                  |
>     0x40ac -->| lwz  r3,0(r24)   |
>               | lwz  r4,4(r24)   |
>               |                  |
>               |                  |
>               |------------------|
>     0x4278 -->| li  r28,0        |  \
>               | li  r27,0        |  | fixup entry
>               | b   0x4024       |  /
>     0x4284 -->| li  r4,0         |
>               | li  r3,0         |
>               | b   0x40b4       |
>               |------------------|
>     0x4290 -->| insn=0xfffffd90  |  \ extable entry
>               | fixup=0xffffffe4 |  /
>     0x4298 -->| insn=0xfffffe14  |
>               | fixup=0xffffffe8 |
>               +------------------+
> 
>     (Addresses shown here are chosen random, not real)
> 
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>

Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>

> ---
> 
> Changes in v4:
> * Dropped explicit fallthrough statement for empty switch cases.
> 
> 
>   arch/powerpc/net/bpf_jit.h        |  4 ++++
>   arch/powerpc/net/bpf_jit_comp.c   |  2 ++
>   arch/powerpc/net/bpf_jit_comp32.c | 30 ++++++++++++++++++++++++++++++
>   3 files changed, 36 insertions(+)
> 
> diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
> index 561689a2abdf..800734056200 100644
> --- a/arch/powerpc/net/bpf_jit.h
> +++ b/arch/powerpc/net/bpf_jit.h
> @@ -144,7 +144,11 @@ struct codegen_context {
>   	unsigned int exentry_idx;
>   };
>   
> +#ifdef CONFIG_PPC32
> +#define BPF_FIXUP_LEN	3 /* Three instructions => 12 bytes */
> +#else
>   #define BPF_FIXUP_LEN	2 /* Two instructions => 8 bytes */
> +#endif
>   
>   static inline void bpf_flush_icache(void *start, void *end)
>   {
> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> index f02457c6b54f..1a0041997050 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c
> @@ -297,6 +297,8 @@ int bpf_add_extable_entry(struct bpf_prog *fp, u32 *image, int pass, struct code
>   		(ctx->exentry_idx * BPF_FIXUP_LEN * 4);
>   
>   	fixup[0] = PPC_RAW_LI(dst_reg, 0);
> +	if (IS_ENABLED(CONFIG_PPC32))
> +		fixup[1] = PPC_RAW_LI(dst_reg - 1, 0); /* clear higher 32-bit register too */
>   
>   	fixup[BPF_FIXUP_LEN - 1] =
>   		PPC_RAW_BRANCH((long)(pc + jmp_off) - (long)&fixup[BPF_FIXUP_LEN - 1]);
> diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
> index c04291517a7e..6ee13a09c70d 100644
> --- a/arch/powerpc/net/bpf_jit_comp32.c
> +++ b/arch/powerpc/net/bpf_jit_comp32.c
> @@ -811,9 +811,13 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   		 * BPF_LDX
>   		 */
>   		case BPF_LDX | BPF_MEM | BPF_B: /* dst = *(u8 *)(ul) (src + off) */
> +		case BPF_LDX | BPF_PROBE_MEM | BPF_B:
>   		case BPF_LDX | BPF_MEM | BPF_H: /* dst = *(u16 *)(ul) (src + off) */
> +		case BPF_LDX | BPF_PROBE_MEM | BPF_H:
>   		case BPF_LDX | BPF_MEM | BPF_W: /* dst = *(u32 *)(ul) (src + off) */
> +		case BPF_LDX | BPF_PROBE_MEM | BPF_W:
>   		case BPF_LDX | BPF_MEM | BPF_DW: /* dst = *(u64 *)(ul) (src + off) */
> +		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
>   			switch (size) {
>   			case BPF_B:
>   				EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
> @@ -832,6 +836,32 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   
>   			if (size != BPF_DW && !fp->aux->verifier_zext)
>   				EMIT(PPC_RAW_LI(dst_reg_h, 0));
> +
> +			if (BPF_MODE(code) == BPF_PROBE_MEM) {
> +				int insn_idx = ctx->idx - 1;
> +				int jmp_off = 4;
> +
> +				/*
> +				 * In case of BPF_DW, two lwz instructions are emitted, one
> +				 * for higher 32-bit and another for lower 32-bit. So, set
> +				 * ex->insn to the first of the two and jump over both
> +				 * instructions in fixup.
> +				 *
> +				 * Similarly, with !verifier_zext, two instructions are
> +				 * emitted for BPF_B/H/W case. So, set ex->insn to the
> +				 * instruction that could fault and skip over both
> +				 * instructions.
> +				 */
> +				if (size == BPF_DW || !fp->aux->verifier_zext) {
> +					insn_idx -= 1;
> +					jmp_off += 4;
> +				}
> +
> +				ret = bpf_add_extable_entry(fp, image, pass, ctx, insn_idx,
> +							    jmp_off, dst_reg);
> +				if (ret)
> +					return ret;
> +			}
>   			break;
>   
>   		/*
> 

^ permalink raw reply

* Re: [PATCH v4 6/8] bpf ppc64: Access only if addr is kernel address
From: LEROY Christophe @ 2021-09-29 11:44 UTC (permalink / raw)
  To: Hari Bathini, naveen.n.rao@linux.ibm.com, mpe@ellerman.id.au,
	ast@kernel.org, daniel@iogearbox.net
  Cc: Ravi Bangoria, songliubraving@fb.com, netdev@vger.kernel.org,
	john.fastabend@gmail.com, andrii@kernel.org, kpsingh@kernel.org,
	paulus@samba.org, yhs@fb.com, bpf@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, kafai@fb.com
In-Reply-To: <20210929111855.50254-7-hbathini@linux.ibm.com>



Le 29/09/2021 à 13:18, Hari Bathini a écrit :
> From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> 
> On PPC64 with KUAP enabled, any kernel code which wants to
> access userspace needs to be surrounded by disable-enable KUAP.
> But that is not happening for BPF_PROBE_MEM load instruction.
> So, when BPF program tries to access invalid userspace address,
> page-fault handler considers it as bad KUAP fault:
> 
>    Kernel attempted to read user page (d0000000) - exploit attempt? (uid: 0)
> 
> Considering the fact that PTR_TO_BTF_ID (which uses BPF_PROBE_MEM
> mode) could either be a valid kernel pointer or NULL but should
> never be a pointer to userspace address, execute BPF_PROBE_MEM load
> only if addr is kernel address, otherwise set dst_reg=0 and move on.
> 
> This will catch NULL, valid or invalid userspace pointers. Only bad
> kernel pointer will be handled by BPF exception table.
> 
> [Alexei suggested for x86]
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>

Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>

> ---
> 
> Changes in v4:
> * Used IS_ENABLED() instead of #ifdef.
> * Dropped the else case that is not applicable for PPC64.
> 
> 
>   arch/powerpc/net/bpf_jit_comp64.c | 26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
> 
> diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
> index 4170999371ee..e1ea64081ae1 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -727,6 +727,32 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   		/* dst = *(u64 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_MEM | BPF_DW:
>   		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
> +			/*
> +			 * As PTR_TO_BTF_ID that uses BPF_PROBE_MEM mode could either be a valid
> +			 * kernel pointer or NULL but not a userspace address, execute BPF_PROBE_MEM
> +			 * load only if addr is kernel address (see is_kernel_addr()), otherwise
> +			 * set dst_reg=0 and move on.
> +			 */
> +			if (BPF_MODE(code) == BPF_PROBE_MEM) {
> +				EMIT(PPC_RAW_ADDI(b2p[TMP_REG_1], src_reg, off));
> +				if (IS_ENABLED(CONFIG_PPC_BOOK3E_64))
> +					PPC_LI64(b2p[TMP_REG_2], 0x8000000000000000ul);
> +				else /* BOOK3S_64 */
> +					PPC_LI64(b2p[TMP_REG_2], PAGE_OFFSET);
> +				EMIT(PPC_RAW_CMPLD(b2p[TMP_REG_1], b2p[TMP_REG_2]));
> +				PPC_BCC(COND_GT, (ctx->idx + 4) * 4);
> +				EMIT(PPC_RAW_LI(dst_reg, 0));
> +				/*
> +				 * Check if 'off' is word aligned because PPC_BPF_LL()
> +				 * (BPF_DW case) generates two instructions if 'off' is not
> +				 * word-aligned and one instruction otherwise.
> +				 */
> +				if (BPF_SIZE(code) == BPF_DW && (off & 3))
> +					PPC_JMP((ctx->idx + 3) * 4);
> +				else
> +					PPC_JMP((ctx->idx + 2) * 4);
> +			}
> +
>   			switch (size) {
>   			case BPF_B:
>   				EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
> 

^ permalink raw reply

* Re: [PATCH v4 5/8] bpf ppc64: Add BPF_PROBE_MEM support for JIT
From: Christophe Leroy @ 2021-09-29 11:43 UTC (permalink / raw)
  To: Hari Bathini, naveen.n.rao, mpe, ast, daniel
  Cc: Ravi Bangoria, songliubraving, netdev, john.fastabend, andrii,
	kpsingh, paulus, yhs, bpf, linuxppc-dev, kafai
In-Reply-To: <20210929111855.50254-6-hbathini@linux.ibm.com>



Le 29/09/2021 à 13:18, Hari Bathini a écrit :
> From: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> 
> BPF load instruction with BPF_PROBE_MEM mode can cause a fault
> inside kernel. Append exception table for such instructions
> within BPF program.
> 
> Unlike other archs which uses extable 'fixup' field to pass dest_reg
> and nip, BPF exception table on PowerPC follows the generic PowerPC
> exception table design, where it populates both fixup and extable
> sections within BPF program. fixup section contains two instructions,
> first instruction clears dest_reg and 2nd jumps to next instruction
> in the BPF code. extable 'insn' field contains relative offset of
> the instruction and 'fixup' field contains relative offset of the
> fixup entry. Example layout of BPF program with extable present:
> 
>               +------------------+
>               |                  |
>               |                  |
>     0x4020 -->| ld   r27,4(r3)   |
>               |                  |
>               |                  |
>     0x40ac -->| lwz  r3,0(r4)    |
>               |                  |
>               |                  |
>               |------------------|
>     0x4280 -->| li  r27,0        |  \ fixup entry
>               | b   0x4024       |  /
>     0x4288 -->| li  r3,0         |
>               | b   0x40b0       |
>               |------------------|
>     0x4290 -->| insn=0xfffffd90  |  \ extable entry
>               | fixup=0xffffffec |  /
>     0x4298 -->| insn=0xfffffe14  |
>               | fixup=0xffffffec |
>               +------------------+
> 
>     (Addresses shown here are chosen random, not real)
> 
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>

Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>

> ---
> 
> Changes in v4:
> * Dropped explicit fallthrough statement for empty switch cases.
> 
> 
>   arch/powerpc/net/bpf_jit.h        |  8 +++-
>   arch/powerpc/net/bpf_jit_comp.c   | 70 ++++++++++++++++++++++++++++---
>   arch/powerpc/net/bpf_jit_comp32.c |  2 +-
>   arch/powerpc/net/bpf_jit_comp64.c | 13 +++++-
>   4 files changed, 84 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
> index 0c8f885b8f48..561689a2abdf 100644
> --- a/arch/powerpc/net/bpf_jit.h
> +++ b/arch/powerpc/net/bpf_jit.h
> @@ -141,8 +141,11 @@ struct codegen_context {
>   	unsigned int idx;
>   	unsigned int stack_size;
>   	int b2p[ARRAY_SIZE(b2p)];
> +	unsigned int exentry_idx;
>   };
>   
> +#define BPF_FIXUP_LEN	2 /* Two instructions => 8 bytes */
> +
>   static inline void bpf_flush_icache(void *start, void *end)
>   {
>   	smp_wmb();	/* smp write barrier */
> @@ -166,11 +169,14 @@ static inline void bpf_clear_seen_register(struct codegen_context *ctx, int i)
>   
>   void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 func);
>   int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *ctx,
> -		       u32 *addrs);
> +		       u32 *addrs, int pass);
>   void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx);
>   void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx);
>   void bpf_jit_realloc_regs(struct codegen_context *ctx);
>   
> +int bpf_add_extable_entry(struct bpf_prog *fp, u32 *image, int pass, struct codegen_context *ctx,
> +			  int insn_idx, int jmp_off, int dst_reg);
> +
>   #endif
>   
>   #endif
> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> index c5c9e8ad1de7..f02457c6b54f 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c
> @@ -101,6 +101,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   	struct bpf_prog *tmp_fp;
>   	bool bpf_blinded = false;
>   	bool extra_pass = false;
> +	u32 extable_len;
> +	u32 fixup_len;
>   
>   	if (!fp->jit_requested)
>   		return org_fp;
> @@ -131,7 +133,6 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   		image = jit_data->image;
>   		bpf_hdr = jit_data->header;
>   		proglen = jit_data->proglen;
> -		alloclen = proglen + FUNCTION_DESCR_SIZE;
>   		extra_pass = true;
>   		goto skip_init_ctx;
>   	}
> @@ -149,7 +150,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   	cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
>   
>   	/* Scouting faux-generate pass 0 */
> -	if (bpf_jit_build_body(fp, 0, &cgctx, addrs)) {
> +	if (bpf_jit_build_body(fp, 0, &cgctx, addrs, 0)) {
>   		/* We hit something illegal or unsupported. */
>   		fp = org_fp;
>   		goto out_addrs;
> @@ -162,7 +163,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   	 */
>   	if (cgctx.seen & SEEN_TAILCALL) {
>   		cgctx.idx = 0;
> -		if (bpf_jit_build_body(fp, 0, &cgctx, addrs)) {
> +		if (bpf_jit_build_body(fp, 0, &cgctx, addrs, 0)) {
>   			fp = org_fp;
>   			goto out_addrs;
>   		}
> @@ -177,8 +178,11 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   	bpf_jit_build_prologue(0, &cgctx);
>   	bpf_jit_build_epilogue(0, &cgctx);
>   
> +	fixup_len = fp->aux->num_exentries * BPF_FIXUP_LEN * 4;
> +	extable_len = fp->aux->num_exentries * sizeof(struct exception_table_entry);
> +
>   	proglen = cgctx.idx * 4;
> -	alloclen = proglen + FUNCTION_DESCR_SIZE;
> +	alloclen = proglen + FUNCTION_DESCR_SIZE + fixup_len + extable_len;
>   
>   	bpf_hdr = bpf_jit_binary_alloc(alloclen, &image, 4, bpf_jit_fill_ill_insns);
>   	if (!bpf_hdr) {
> @@ -186,6 +190,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   		goto out_addrs;
>   	}
>   
> +	if (extable_len)
> +		fp->aux->extable = (void *)image + FUNCTION_DESCR_SIZE + proglen + fixup_len;
> +
>   skip_init_ctx:
>   	code_base = (u32 *)(image + FUNCTION_DESCR_SIZE);
>   
> @@ -210,7 +217,11 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   		/* Now build the prologue, body code & epilogue for real. */
>   		cgctx.idx = 0;
>   		bpf_jit_build_prologue(code_base, &cgctx);
> -		bpf_jit_build_body(fp, code_base, &cgctx, addrs);
> +		if (bpf_jit_build_body(fp, code_base, &cgctx, addrs, pass)) {
> +			bpf_jit_binary_free(bpf_hdr);
> +			fp = org_fp;
> +			goto out_addrs;
> +		}
>   		bpf_jit_build_epilogue(code_base, &cgctx);
>   
>   		if (bpf_jit_enable > 1)
> @@ -234,7 +245,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   
>   	fp->bpf_func = (void *)image;
>   	fp->jited = 1;
> -	fp->jited_len = alloclen;
> +	fp->jited_len = proglen + FUNCTION_DESCR_SIZE;
>   
>   	bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE));
>   	bpf_jit_binary_lock_ro(bpf_hdr);
> @@ -258,3 +269,50 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   
>   	return fp;
>   }
> +
> +/*
> + * The caller should check for (BPF_MODE(code) == BPF_PROBE_MEM) before calling
> + * this function, as this only applies to BPF_PROBE_MEM, for now.
> + */
> +int bpf_add_extable_entry(struct bpf_prog *fp, u32 *image, int pass, struct codegen_context *ctx,
> +			  int insn_idx, int jmp_off, int dst_reg)
> +{
> +	off_t offset;
> +	unsigned long pc;
> +	struct exception_table_entry *ex;
> +	u32 *fixup;
> +
> +	/* Populate extable entries only in the last pass */
> +	if (pass != 2)
> +		return 0;
> +
> +	if (!fp->aux->extable ||
> +	    WARN_ON_ONCE(ctx->exentry_idx >= fp->aux->num_exentries))
> +		return -EINVAL;
> +
> +	pc = (unsigned long)&image[insn_idx];
> +
> +	fixup = (void *)fp->aux->extable -
> +		(fp->aux->num_exentries * BPF_FIXUP_LEN * 4) +
> +		(ctx->exentry_idx * BPF_FIXUP_LEN * 4);
> +
> +	fixup[0] = PPC_RAW_LI(dst_reg, 0);
> +
> +	fixup[BPF_FIXUP_LEN - 1] =
> +		PPC_RAW_BRANCH((long)(pc + jmp_off) - (long)&fixup[BPF_FIXUP_LEN - 1]);
> +
> +	ex = &fp->aux->extable[ctx->exentry_idx];
> +
> +	offset = pc - (long)&ex->insn;
> +	if (WARN_ON_ONCE(offset >= 0 || offset < INT_MIN))
> +		return -ERANGE;
> +	ex->insn = offset;
> +
> +	offset = (long)fixup - (long)&ex->fixup;
> +	if (WARN_ON_ONCE(offset >= 0 || offset < INT_MIN))
> +		return -ERANGE;
> +	ex->fixup = offset;
> +
> +	ctx->exentry_idx++;
> +	return 0;
> +}
> diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
> index 65a4d1ed97bf..c04291517a7e 100644
> --- a/arch/powerpc/net/bpf_jit_comp32.c
> +++ b/arch/powerpc/net/bpf_jit_comp32.c
> @@ -266,7 +266,7 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
>   
>   /* Assemble the body code between the prologue & epilogue */
>   int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *ctx,
> -		       u32 *addrs)
> +		       u32 *addrs, int pass)
>   {
>   	const struct bpf_insn *insn = fp->insnsi;
>   	int flen = fp->len;
> diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
> index 49e6e0b6e4d2..4170999371ee 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -272,7 +272,7 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
>   
>   /* Assemble the body code between the prologue & epilogue */
>   int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *ctx,
> -		       u32 *addrs)
> +		       u32 *addrs, int pass)
>   {
>   	const struct bpf_insn *insn = fp->insnsi;
>   	int flen = fp->len;
> @@ -717,12 +717,16 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   		 */
>   		/* dst = *(u8 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_MEM | BPF_B:
> +		case BPF_LDX | BPF_PROBE_MEM | BPF_B:
>   		/* dst = *(u16 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_MEM | BPF_H:
> +		case BPF_LDX | BPF_PROBE_MEM | BPF_H:
>   		/* dst = *(u32 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_MEM | BPF_W:
> +		case BPF_LDX | BPF_PROBE_MEM | BPF_W:
>   		/* dst = *(u64 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_MEM | BPF_DW:
> +		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
>   			switch (size) {
>   			case BPF_B:
>   				EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
> @@ -740,6 +744,13 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   
>   			if (size != BPF_DW && insn_is_zext(&insn[i + 1]))
>   				addrs[++i] = ctx->idx * 4;
> +
> +			if (BPF_MODE(code) == BPF_PROBE_MEM) {
> +				ret = bpf_add_extable_entry(fp, image, pass, ctx, ctx->idx - 1,
> +							    4, dst_reg);
> +				if (ret)
> +					return ret;
> +			}
>   			break;
>   
>   		/*
> 

^ permalink raw reply

* Re: [PATCH v4 3/8] bpf powerpc: refactor JIT compiler code
From: Christophe Leroy @ 2021-09-29 11:42 UTC (permalink / raw)
  To: Hari Bathini, naveen.n.rao, mpe, ast, daniel
  Cc: songliubraving, netdev, john.fastabend, andrii, kpsingh, paulus,
	yhs, bpf, linuxppc-dev, kafai
In-Reply-To: <20210929111855.50254-4-hbathini@linux.ibm.com>



Le 29/09/2021 à 13:18, Hari Bathini a écrit :
> Refactor powerpc LDX JITing code to simplify adding BPF_PROBE_MEM
> support.
> 
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>

Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>

> ---
> 
> Changes in v4:
> * Dropped the default case in the switch statement for bpf size.
> * Dropped explicit fallthrough statement for empty switch cases.
> 
> 
>   arch/powerpc/net/bpf_jit_comp32.c | 33 ++++++++++++++++++-------------
>   arch/powerpc/net/bpf_jit_comp64.c | 31 +++++++++++++++++------------
>   2 files changed, 37 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
> index b60b59426a24..65a4d1ed97bf 100644
> --- a/arch/powerpc/net/bpf_jit_comp32.c
> +++ b/arch/powerpc/net/bpf_jit_comp32.c
> @@ -282,6 +282,7 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   		u32 src_reg = bpf_to_ppc(ctx, insn[i].src_reg);
>   		u32 src_reg_h = src_reg - 1;
>   		u32 tmp_reg = bpf_to_ppc(ctx, TMP_REG);
> +		u32 size = BPF_SIZE(code);
>   		s16 off = insn[i].off;
>   		s32 imm = insn[i].imm;
>   		bool func_addr_fixed;
> @@ -810,23 +811,27 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   		 * BPF_LDX
>   		 */
>   		case BPF_LDX | BPF_MEM | BPF_B: /* dst = *(u8 *)(ul) (src + off) */
> -			EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
> -			if (!fp->aux->verifier_zext)
> -				EMIT(PPC_RAW_LI(dst_reg_h, 0));
> -			break;
>   		case BPF_LDX | BPF_MEM | BPF_H: /* dst = *(u16 *)(ul) (src + off) */
> -			EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
> -			if (!fp->aux->verifier_zext)
> -				EMIT(PPC_RAW_LI(dst_reg_h, 0));
> -			break;
>   		case BPF_LDX | BPF_MEM | BPF_W: /* dst = *(u32 *)(ul) (src + off) */
> -			EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
> -			if (!fp->aux->verifier_zext)
> -				EMIT(PPC_RAW_LI(dst_reg_h, 0));
> -			break;
>   		case BPF_LDX | BPF_MEM | BPF_DW: /* dst = *(u64 *)(ul) (src + off) */
> -			EMIT(PPC_RAW_LWZ(dst_reg_h, src_reg, off));
> -			EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off + 4));
> +			switch (size) {
> +			case BPF_B:
> +				EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
> +				break;
> +			case BPF_H:
> +				EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
> +				break;
> +			case BPF_W:
> +				EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
> +				break;
> +			case BPF_DW:
> +				EMIT(PPC_RAW_LWZ(dst_reg_h, src_reg, off));
> +				EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off + 4));
> +				break;
> +			}
> +
> +			if (size != BPF_DW && !fp->aux->verifier_zext)
> +				EMIT(PPC_RAW_LI(dst_reg_h, 0));
>   			break;
>   
>   		/*
> diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
> index 2a87da50d9a4..49e6e0b6e4d2 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -285,6 +285,7 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   		u32 code = insn[i].code;
>   		u32 dst_reg = b2p[insn[i].dst_reg];
>   		u32 src_reg = b2p[insn[i].src_reg];
> +		u32 size = BPF_SIZE(code);
>   		s16 off = insn[i].off;
>   		s32 imm = insn[i].imm;
>   		bool func_addr_fixed;
> @@ -716,25 +717,29 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *
>   		 */
>   		/* dst = *(u8 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_MEM | BPF_B:
> -			EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
> -			if (insn_is_zext(&insn[i + 1]))
> -				addrs[++i] = ctx->idx * 4;
> -			break;
>   		/* dst = *(u16 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_MEM | BPF_H:
> -			EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
> -			if (insn_is_zext(&insn[i + 1]))
> -				addrs[++i] = ctx->idx * 4;
> -			break;
>   		/* dst = *(u32 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_MEM | BPF_W:
> -			EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
> -			if (insn_is_zext(&insn[i + 1]))
> -				addrs[++i] = ctx->idx * 4;
> -			break;
>   		/* dst = *(u64 *)(ul) (src + off) */
>   		case BPF_LDX | BPF_MEM | BPF_DW:
> -			PPC_BPF_LL(dst_reg, src_reg, off);
> +			switch (size) {
> +			case BPF_B:
> +				EMIT(PPC_RAW_LBZ(dst_reg, src_reg, off));
> +				break;
> +			case BPF_H:
> +				EMIT(PPC_RAW_LHZ(dst_reg, src_reg, off));
> +				break;
> +			case BPF_W:
> +				EMIT(PPC_RAW_LWZ(dst_reg, src_reg, off));
> +				break;
> +			case BPF_DW:
> +				PPC_BPF_LL(dst_reg, src_reg, off);
> +				break;
> +			}
> +
> +			if (size != BPF_DW && insn_is_zext(&insn[i + 1]))
> +				addrs[++i] = ctx->idx * 4;
>   			break;
>   
>   		/*
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox