LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 00/15 v4] tick/sched: Refactor idle cputime accounting
From: Frederic Weisbecker @ 2026-05-26 10:42 UTC (permalink / raw)
  To: LKML, Peter Zijlstra, Thomas Gleixner
  Cc: Madhavan Srinivasan, Jan Kiszka, Dietmar Eggemann,
	Shrikanth Hegde, Nicholas Piggin, Alexander Gordeev, Ben Segall,
	Vasily Gorbik, Rafael J. Wysocki, linux-pm, Sashiko, Ingo Molnar,
	Michael Ellerman, Boqun Feng, Valentin Schneider, linuxppc-dev,
	Sven Schnelle, Ingo Molnar, Vincent Guittot,
	Christian Borntraeger, Mel Gorman, Steven Rostedt, Joel Fernandes,
	Paul E . McKenney, Neeraj Upadhyay, Anna-Maria Behnsen,
	Christophe Leroy (CS GROUP), Juri Lelli, Uladzislau Rezki,
	Viresh Kumar, Kieran Bingham, Xin Zhao, linux-s390,
	Heiko Carstens
In-Reply-To: <20260508131647.43868-1-frederic@kernel.org>

Hi,

I don't see any further concern. What should we do with this? It could
either go through the scheduler tree or the timer tree.

Thanks.


Le Fri, May 08, 2026 at 03:16:32PM +0200, Frederic Weisbecker a écrit :
> Hi,
> 
> After the issue reported here:
> 
>         https://lore.kernel.org/all/20251210083135.3993562-1-jackzxcui1989@163.com/
> 
> It occurs that the idle cputime accounting is a big mess that
> accumulates within two concurrent statistics, each having their own
> shortcomings:
> 
> * The accounting for online CPUs which is based on the delta between
>   tick_nohz_start_idle() and tick_nohz_stop_idle().
> 
>   Pros:
>        - Works when the tick is off
> 
>        - Has nsecs granularity
> 
>   Cons:
>        - Account idle steal time but doesn't substract it from idle
>          cputime.
> 
>        - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
>          the IRQ time is simply ignored when
>          CONFIG_IRQ_TIME_ACCOUNTING=n
> 
>        - The windows between 1) idle task scheduling and the first call
>          to tick_nohz_start_idle() and 2) idle task between the last
>          tick_nohz_stop_idle() and the rest of the idle time are
>          blindspots wrt. cputime accounting (though mostly insignificant
>          amount)
> 
>        - Relies on private fields outside of kernel stats, with specific
>          accessors.
> 
> * The accounting for offline CPUs which is based on ticks and the
>   jiffies delta during which the tick was stopped.
> 
>   Pros:
>        - Handles steal time correctly
> 
>        - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
>          CONFIG_IRQ_TIME_ACCOUNTING=n correctly.
> 
>        - Handles the whole idle task
> 
>        - Accounts directly to kernel stats, without midlayer accumulator.
> 
>    Cons:
>        - Doesn't elapse when the tick is off, which doesn't make it
>          suitable for online CPUs.
> 
>        - Has TICK_NSEC granularity (jiffies)
> 
>        - Needs to track the dyntick-idle ticks that were accounted and
>          substract them from the total jiffies time spent while the tick
>          was stopped. This is an ugly workaround.
> 
> Having two different accounting for a single context is not the only
> problem: since those accountings are of different natures, it is
> possible to observe the global idle time going backward after a CPU goes
> offline, as reported by Xin Zhao.
> 
> Clean up the situation with introducing a hybrid approach that stays
> coherent, fixes the backward jumps and works for both online and offline
> CPUs:
> 
> * Tick based or native vtime accounting operate before the tick is
>   stopped and resumes once the tick is restarted.
> 
> * When the idle loop starts, switch to dynticks-idle accounting as is
>   done currently, except that the statistics accumulate directly to the
>   relevant kernel stat fields.
> 
> * Private dyntick cputime accounting fields are removed.
> 
> * Works on both online and offline case.
> 
> * Move most of the relevant code to the common sched/cputime subsystem
> 
> * Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
>   dynticks-idle accounting still elapses while on IRQs.
> 
> * Correctly substract idle steal cputime from idle time
> 
> Changes since v3 (among which a lot of relevant reviews from Sashiko):
> 
> - Add new tags
> 
> - Rebase on latest -rc1
> 
> - Add "tick/sched: Fix TOCTOU in nohz idle time fetch" (Sashiko)
> 
> - Fix buggy state refetch in kcpustat_cpu_fetch_vtime() (Sashiko)
> 
> - Fix build issue on powerpc (Christophe Leroy)
> 
> - Fix s390 lost steal time occuring on idle IRQs (call vtime_flush() on
>   vtime_account_hardirq() and vtime_account_softirq()) (Sashiko)
> 
> - Fix build issue on s390
> 
> - Fix uninitialized idle_sleeptime_seq (Sashiko)
> 
> - Fix irqtime being disabled or enabled in the middle of an idle IRQ
>   (Sashiko)
>   
> - Fix tick restart and then restop in the same idle loop (Sashiko)
> 
> - Fix "sched/cputime: Handle idle irqtime gracefully" changelog (Sashiko)
> 
> - Fix idle steal time substracted from the wrong index between idle and
>   iowait kcpustat. (Sashiko)
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> 	timers/core-v4
> 
> HEAD: e64ba052ce04e363ff76d3cb8bedc5f812188acb
> Thanks,
> 	Frederic
> ---
> 
> Frederic Weisbecker (15):
>       tick/sched: Fix TOCTOU in nohz idle time fetch
>       sched/idle: Handle offlining first in idle loop
>       sched/cputime: Remove superfluous and error prone kcpustat_field() parameter
>       sched/cputime: Correctly support generic vtime idle time
>       powerpc/time: Prepare to stop elapsing in dynticks-idle
>       s390/time: Prepare to stop elapsing in dynticks-idle
>       tick/sched: Unify idle cputime accounting
>       tick/sched: Remove nohz disabled special case in cputime fetch
>       tick/sched: Move dyntick-idle cputime accounting to cputime code
>       tick/sched: Remove unused fields
>       tick/sched: Account tickless idle cputime only when tick is stopped
>       tick/sched: Consolidate idle time fetching APIs
>       sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case
>       sched/cputime: Handle idle irqtime gracefully
>       sched/cputime: Handle dyntick-idle steal time correctly
> 
>  arch/powerpc/kernel/time.c         |  41 +++++
>  arch/s390/include/asm/idle.h       |   2 +
>  arch/s390/kernel/idle.c            |   5 +-
>  arch/s390/kernel/vtime.c           |  75 ++++++++-
>  drivers/cpufreq/cpufreq.c          |  29 +---
>  drivers/cpufreq/cpufreq_governor.c |   6 +-
>  drivers/macintosh/rack-meter.c     |   2 +-
>  fs/proc/stat.c                     |  40 +----
>  fs/proc/uptime.c                   |   8 +-
>  include/linux/kernel_stat.h        |  76 +++++++--
>  include/linux/tick.h               |   4 -
>  include/linux/vtime.h              |  22 ++-
>  kernel/rcu/tree.c                  |   9 +-
>  kernel/rcu/tree_stall.h            |   7 +-
>  kernel/sched/core.c                |   6 +-
>  kernel/sched/cputime.c             | 308 +++++++++++++++++++++++++++++++------
>  kernel/sched/idle.c                |  13 +-
>  kernel/time/tick-sched.c           | 212 ++++++-------------------
>  kernel/time/tick-sched.h           |  12 --
>  kernel/time/timer_list.c           |   6 +-
>  scripts/gdb/linux/timerlist.py     |   4 -
>  21 files changed, 529 insertions(+), 358 deletions(-)

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply

* [PATCH] [net-next] net: dsa: netc: fix enetc dependencies
From: Arnd Bergmann @ 2026-05-26 10:26 UTC (permalink / raw)
  To: Wei Fang, Claudiu Manoil, Clark Wang, Christophe Leroy (CS GROUP)
  Cc: Arnd Bergmann, Andrew Lunn, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, imx, netdev,
	linux-kernel, linuxppc-dev, linux-arm-kernel

From: Arnd Bergmann <arnd@arndb.de>

The newly added netc dsa support has incorrect Kconfig dependencies,
leading to Kconfig and link failures:

WARNING: unmet direct dependencies detected for FSL_ENETC_MDIO
  Depends on [n]: NETDEVICES [=y] && ETHERNET [=n] && NET_VENDOR_FREESCALE [=n] && PCI [=y] && PHYLIB [=y]
  Selected by [y]:
  - NET_DSA_NETC_SWITCH [=y] && NETDEVICES [=y] && (ARM64 || COMPILE_TEST [=y]) && NET_DSA [=y] && PCI [=y]
WARNING: unmet direct dependencies detected for NXP_NTMP
  Depends on [n]: NETDEVICES [=y] && ETHERNET [=n] && NET_VENDOR_FREESCALE [=n]
  Selected by [m]:
  - NET_DSA_NETC_SWITCH [=m] && NETDEVICES [=y] && (ARM64 || COMPILE_TEST [=y]) && NET_DSA [=m] && PCI [=y]
ERROR: modpost: "enetc_mdio_read_c22" [drivers/net/dsa/netc/nxp-netc-switch.ko] undefined!
ERROR: modpost: "ntmp_fdbt_delete_entry" [drivers/net/dsa/netc/nxp-netc-switch.ko] undefined!
ERROR: modpost: "enetc_mdio_read_c45" [drivers/net/dsa/netc/nxp-netc-switch.ko] undefined!

Add the required 'NET_VENDOR_FREESCALE' dependency to make it possible
to select both the PHY and NTMP library modules. Originally this was
meant to be done through an 'IS_REACHABLE' check in the header file,
but that did not work because the drivers/net/ethernet/freescale
directory is not even entered in this case. IS_REACHABLE() is generally
counterproductive because even when it works as intended, it turns
a helpful link-time error into a silent runtime failure that is
harder to debug. In this case, it clearly did not even do anything.

Fixes: 187fbae024c8 ("net: dsa: netc: introduce NXP NETC switch driver for i.MX94")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/dsa/netc/Kconfig   |  1 +
 include/linux/fsl/enetc_mdio.h | 22 ----------------------
 2 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/drivers/net/dsa/netc/Kconfig b/drivers/net/dsa/netc/Kconfig
index d2f78d74ac23..eaad3cb5babe 100644
--- a/drivers/net/dsa/netc/Kconfig
+++ b/drivers/net/dsa/netc/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config NET_DSA_NETC_SWITCH
 	tristate "NXP NETC Ethernet switch support"
+	depends on NET_VENDOR_FREESCALE
 	depends on ARM64 || COMPILE_TEST
 	depends on NET_DSA && PCI
 	select NET_DSA_TAG_NETC
diff --git a/include/linux/fsl/enetc_mdio.h b/include/linux/fsl/enetc_mdio.h
index 623ccfcbf39c..7cd5be694cc4 100644
--- a/include/linux/fsl/enetc_mdio.h
+++ b/include/linux/fsl/enetc_mdio.h
@@ -35,8 +35,6 @@ struct enetc_mdio_priv {
 	int mdio_base;
 };
 
-#if IS_REACHABLE(CONFIG_FSL_ENETC_MDIO)
-
 int enetc_mdio_read_c22(struct mii_bus *bus, int phy_id, int regnum);
 int enetc_mdio_write_c22(struct mii_bus *bus, int phy_id, int regnum,
 			 u16 value);
@@ -45,24 +43,4 @@ int enetc_mdio_write_c45(struct mii_bus *bus, int phy_id, int devad, int regnum,
 			 u16 value);
 struct enetc_hw *enetc_hw_alloc(struct device *dev, void __iomem *port_regs);
 
-#else
-
-static inline int enetc_mdio_read_c22(struct mii_bus *bus, int phy_id,
-				      int regnum)
-{ return -EINVAL; }
-static inline int enetc_mdio_write_c22(struct mii_bus *bus, int phy_id,
-				       int regnum, u16 value)
-{ return -EINVAL; }
-static inline int enetc_mdio_read_c45(struct mii_bus *bus, int phy_id,
-				      int devad, int regnum)
-{ return -EINVAL; }
-static inline int enetc_mdio_write_c45(struct mii_bus *bus, int phy_id,
-				       int devad, int regnum, u16 value)
-{ return -EINVAL; }
-static inline struct enetc_hw *enetc_hw_alloc(struct device *dev,
-					      void __iomem *port_regs)
-{ return ERR_PTR(-EINVAL); }
-
-#endif
-
 #endif
-- 
2.39.5



^ permalink raw reply related

* [PATCH 2/2] powerpc: export memory encryption helper functions
From: Arnd Bergmann @ 2026-05-26 10:20 UTC (permalink / raw)
  To: Madhavan Srinivasan, Michael Ellerman, T.J. Mercier,
	Maxime Ripard, Sumit Semwal, Andrew Davis
  Cc: Arnd Bergmann, Nicholas Piggin, Christophe Leroy (CS GROUP),
	linuxppc-dev, linux-kernel
In-Reply-To: <20260526102113.2594501-1-arnd@kernel.org>

From: Arnd Bergmann <arnd@arndb.de>

The set_memory_encrypted/set_memory_decrypted functions are exported
on x86 and arm64 but not on powerpc, which leads to a new build failure
because they are now used in a loadable module:

ERROR: modpost: "set_memory_encrypted" [drivers/dma-buf/heaps/system_heap.ko] undefined!
ERROR: modpost: "set_memory_decrypted" [drivers/dma-buf/heaps/system_heap.ko] undefined!

Export these the same way we do on the other architectures.

Fixes: fd55edff8a0a ("dma-buf: heaps: system: Turn the heap into a module")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/powerpc/platforms/pseries/svm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c
index 384c9dc1899a..ab8f8c722741 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -6,6 +6,7 @@
  * Author: Anshuman Khandual <khandual@linux.vnet.ibm.com>
  */
 
+#include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/memblock.h>
 #include <linux/mem_encrypt.h>
@@ -50,6 +51,7 @@ int set_memory_encrypted(unsigned long addr, int numpages)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(set_memory_encrypted);
 
 int set_memory_decrypted(unsigned long addr, int numpages)
 {
@@ -63,6 +65,7 @@ int set_memory_decrypted(unsigned long addr, int numpages)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(set_memory_decrypted);
 
 /* There's one dispatch log per CPU. */
 #define NR_DTL_PAGE (DISPATCH_LOG_BYTES * CONFIG_NR_CPUS / PAGE_SIZE)
-- 
2.39.5



^ permalink raw reply related

* Re: [PATCH v5 06/14] module: Switch load_info::len to size_t
From: Petr Pavlu @ 2026-05-26  9:47 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Nathan Chancellor,
	Nicolas Schier, Arnd Bergmann, Luis Chamberlain, Sami Tolvanen,
	Daniel Gomez, Paul Moore, James Morris, Serge E. Hallyn,
	Jonathan Corbet, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Naveen N Rao, Mimi Zohar, Roberto Sassu,
	Dmitry Kasatkin, Eric Snowberg, Nicolas Schier, Daniel Gomez,
	Aaron Tomlin, Christophe Leroy (CS GROUP), Nicolas Bouchinet,
	Xiu Jianfeng, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jiri Olsa, bpf, Fabian Grünbichler, Arnout Engelen,
	Mattia Rizzolo, kpcyrd, Christian Heusel, Câju Mihai-Drosi,
	Eric Biggers, Sebastian Andrzej Siewior, linux-kbuild,
	linux-kernel, linux-arch, linux-modules, linux-security-module,
	linux-doc, linuxppc-dev, linux-integrity, debian-kernel
In-Reply-To: <20260505-module-hashes-v5-6-e174a5a49fce@weissschuh.net>

On 5/5/26 11:05 AM, Thomas Weißschuh wrote:
> Switching the types will make some later changes cleaner.

Since the updated version drops the patch "module: Deduplicate signature
extraction", I believe this change is no longer necessary.

> size_t is also the semantically correct type for this field.
> 
> As both 'size_t' and 'unsigned long' are always the same size, this
> should be risk-free.

The module 'len' would now start in init_module() as 'unsigned long',
then change in copy_module_from_user() to size_t, and then back to
'unsigned long' when calling copy_chunked_from_user(). The current code
is more consistent and mostly uses 'unsigned long', matching the syscall
interface.

-- 
Thanks,
Petr


^ permalink raw reply

* Re: [PATCH V16 4/7] rust/powerpc: Set min rustc version for powerpc
From: Mukesh Kumar Chaurasiya @ 2026-05-26  8:52 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: maddy, mpe, npiggin, chleroy, peterz, jpoimboe, jbaron, aliceryhl,
	rostedt, ardb, ojeda, boqun, gary, bjorn3_gh, lossin, a.hindborg,
	tmgross, dakr, nathan, nick.desaulniers+lkml, morbo, justinstitt,
	daniel.almeida, acourbot, fujita.tomonori, gregkh, prafulrai522,
	tamird, kees, lyude, airlied, linuxppc-dev, linux-kernel,
	rust-for-linux, llvm
In-Reply-To: <CANiq72nPOMkZXpMpwTLFg4xjFfmyf_O_x2FN_nrNBsxeNu178Q@mail.gmail.com>

On Mon, May 25, 2026 at 08:16:53PM +0200, Miguel Ojeda wrote:
> On Wed, May 20, 2026 at 8:48 AM Mukesh Kumar Chaurasiya (IBM)
> <mkchauras@gmail.com> wrote:
> >
> > Minimum `rustc` version required for powerpc is 1.95 as some critical
> > features required for compiling rust code for kernel are not there.
> 
> Which critical features?
Hey Miguel,

Right now i can only think of inline asm. I can rerun the whole thing
with 1.85 and figure out the issues with 1.85. I'll get back on this.

>
> > For example Stable inline asm support which got merged in 1.95.
> 
> It is not needed that the support is stable, but rather that
> everything you may need works.
> 
I wanted inline asm be stable, I was skeptical about inline asm to be
unstable and potentially messing up the whole system. That's the reason
I waited for the stable support to get merged before sending out this
patch series.

> From a quick test (with a dummy example that may not be
> representative), ppc64 inline assembly seems to work for a long time,
> way before Rust 1.95.
> 
> So, which is the actual version that it is needed? i.e. 1.95.0 doesn't
> seem to be required at least due to that.
> 
Yeah it may be true. I'll test out the 1.85 rustc and come back with the
results.

> That is why I am asking about the critical features above, because it
> may be that this works since earlier versions.
> 
> (I wonder if this patch and s390's similar one influenced each other?)
>
I am not aware of s390x way of approaching support for rust so can't say
anything about that.

> Thanks!
> 
> Cheers,
> Miguel

Regards,
Mukesh



^ permalink raw reply

* Re: [PATCH v2 3/4] fbdev: Wrap fbcon updates from vga-switcheroo in helper
From: Thomas Zimmermann @ 2026-05-26  7:25 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: deller, simona, airlied, lukas, maddy, mpe, npiggin, chleroy,
	dri-devel, linux-fbdev, linuxppc-dev
In-Reply-To: <CAMuHMdXKF=fSZLqQiOuxDvygBDVSZKD+CQ3Rj+R4E_rYrz-WtA@mail.gmail.com>

Hi Geert

Am 25.05.26 um 11:31 schrieb Geert Uytterhoeven:
> Hi Thomas,
>
> On Fri, 22 May 2026 at 15:11, Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> Handle console remapping in fbcon in fb_switch_output(). Vga-switcheroo
>> invokes this functionality before switching physical outputs to a new
>> graphics device. Open-coding fbcon state in vga-switcheroo exposed fbdev
>> implementation details.
>>
>> Vga-switcheroo is used for switching physical outputs among graphics
>> hardware. This functionality is only supported by DRM drivers. A later
>> update will further move fb_switch_output() into DRM's fbdev emulation;
>> thus fully decoupling vga-switcheroo from fbdev.
>>
>> v2:
>> - use '#if defined' (Helge)
>>
>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Thanks for your patch, which is now commit 91458b3f2a84bc7b ("fbdev:
> Wrap fbcon updates from vga-switcheroo in helper") in fbdev/for-next.
>
>> --- a/drivers/gpu/vga/vga_switcheroo.c
>> +++ b/drivers/gpu/vga/vga_switcheroo.c
>> @@ -31,11 +31,9 @@
>>   #define pr_fmt(fmt) "vga_switcheroo: " fmt
>>
>>   #include <linux/apple-gmux.h>
>> -#include <linux/console.h>
>>   #include <linux/debugfs.h>
>>   #include <linux/fb.h>
>>   #include <linux/fs.h>
>> -#include <linux/fbcon.h>
>>   #include <linux/module.h>
>>   #include <linux/pci.h>
>>   #include <linux/pm_domain.h>
>> @@ -735,8 +733,10 @@ static int vga_switchto_stage2(struct vga_switcheroo_client *new_client)
>>          if (!active->driver_power_control)
>>                  set_audio_state(active->id, VGA_SWITCHEROO_OFF);
>>
>> +#if defined(CONFIG_FB)
>>          if (new_client->fb_info)
>> -               fbcon_remap_all(new_client->fb_info);
>> +               fb_switch_outputs(new_client->fb_info);
>> +#endif
> What if CONFIG_FB is modular?
> CONFIG_VGA_SWITCHEROO is bool.

Good point. Fbcon is currently linked into fb.ko, which is build with 
CONFIG_FB.  Kconfig covers this dependency at [1]. For now, I think we 
could make it 'depends on FB=y'.  As I mentioned elsewhere, this 
fb-related logic is only relevant for DRM and supposed to be moved there 
soon.  That will also resolve any such config issues.

Let me prepare an update for this.

Best regards
Thomas

[1] 
https://elixir.bootlin.com/linux/v7.0.9/source/drivers/gpu/vga/Kconfig#L7

>
>>          mutex_lock(&vgasr_priv.mux_hw_lock);
>>          ret = vgasr_priv.handler->switchto(new_client->id);
> Gr{oetje,eeting}s,
>
>                          Geert
>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)




^ permalink raw reply

* Re: [PATCH] KVM: Validate irqchip index for LoongArch and PowerPC
From: Bibo Mao @ 2026-05-26  7:04 UTC (permalink / raw)
  To: Yanfei Xu, zhaotianrui, chenhuacai, maddy, npiggin,
	sashiko-reviews, seanjc, pbonzini
  Cc: kvm, loongarch, linuxppc-dev, caixiangfeng, fangying.tommy,
	isyanfei.xu, Sashiko
In-Reply-To: <20260525070154.495455-1-yanfei.xu@bytedance.com>



On 2026/5/25 下午3:01, Yanfei Xu wrote:
> Sashiko reported that irqchip index is not validated for LoongArch and
> PowerPC. Add validation and reject out-of-range irqchip indexes to avoid
> indexing past the routing table's chip array.
> 
> Fixes: de9ba2f36368 ("KVM: PPC: Support irq routing and irqfd for in-kernel MPIC")
> Fixes: 1928254c5ccb ("LoongArch: KVM: Add irqfd support")
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://lore.kernel.org/kvm/20260525051714.485D51F000E9@smtp.kernel.org/
> Signed-off-by: Yanfei Xu <yanfei.xu@bytedance.com>
> ---
>   arch/loongarch/kvm/irqfd.c | 3 ++-
>   arch/powerpc/kvm/mpic.c    | 3 ++-
>   2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/loongarch/kvm/irqfd.c b/arch/loongarch/kvm/irqfd.c
> index f4f953b22419..40ed1081c4b6 100644
> --- a/arch/loongarch/kvm/irqfd.c
> +++ b/arch/loongarch/kvm/irqfd.c
> @@ -51,7 +51,8 @@ int kvm_set_routing_entry(struct kvm *kvm,
>   		e->irqchip.irqchip = ue->u.irqchip.irqchip;
>   		e->irqchip.pin = ue->u.irqchip.pin;
>   
> -		if (e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS)
> +		if (e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS ||
> +		    e->irqchip.irqchip >= KVM_NR_IRQCHIPS)
>   			return -EINVAL;
>   
>   		return 0;
> diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
> index 3070f36d9fb8..fb5f9e65e02e 100644
> --- a/arch/powerpc/kvm/mpic.c
> +++ b/arch/powerpc/kvm/mpic.c
> @@ -1833,7 +1833,8 @@ int kvm_set_routing_entry(struct kvm *kvm,
>   		e->set = mpic_set_irq;
>   		e->irqchip.irqchip = ue->u.irqchip.irqchip;
>   		e->irqchip.pin = ue->u.irqchip.pin;
> -		if (e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS)
> +		if (e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS ||
> +		    e->irqchip.irqchip >= KVM_NR_IRQCHIPS)
>   			goto out;
>   		break;
>   	case KVM_IRQ_ROUTING_MSI:
> 
Hi Yanfei,

That is important fixes, thanking for your efforts.

Reviewed-by: Bibo Mao <maobibo@loongson.cn>



^ permalink raw reply

* Re: [PATCH v2] KVM: PPC: Kconfig: Enable CONFIG_VPA_PMU with KVM
From: Harsh Prateek Bora @ 2026-05-26  7:00 UTC (permalink / raw)
  To: Gautam Menghani, maddy, npiggin, mpe, chleroy, atrajeev
  Cc: linuxppc-dev, kvm, linux-kernel, stable
In-Reply-To: <20260518044150.34632-1-gautam@linux.ibm.com>



On 18/05/26 10:11 am, Gautam Menghani wrote:
> Enable CONFIG_VPA_PMU with KVM to enable its usage. Currently, the
> vpa-pmu driver cannot be used since it is not enabled in distro configs.
> 

I think the commit log needs a rephrase as irrespective of current state 
of distro configs, it makes sense to enable CONFIG_VPA_PMU for KVM 
guests on Power by default since this is the only use-case for VPA 
counters (i.e. in a KVM guest).

> On fedora kernel 6.13.7, the config option is disabled:
> $ cat /boot/config-6.19.12-200.fc43.ppc64le  | grep VPA_PMU
>   # CONFIG_VPA_PMU is not set
> 
> Fixes: 176cda0619b6c ("powerpc/perf: Add perf interface to expose vpa counters")
> Cc: stable@vger.kernel.org # v6.13+
> Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
> ---
> v1 -> v2:
> 1. Rebased on latest master
> 
>   arch/powerpc/kvm/Kconfig | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 9a0d1c1aca6c..56e86b46ff13 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -82,6 +82,7 @@ config KVM_BOOK3S_64_HV
>   	select KVM_BOOK3S_HV_POSSIBLE
>   	select KVM_BOOK3S_HV_PMU
>   	select CMA
> +	select VPA_PMU if HV_PERF_CTRS


Also, since we already select KVM_BOOK3S_HV_PMU, VPA_PMU is a natural 
extension, provided we enable only if the required dependecy is enabled.

With an update to the changelog with suggested rationale:

Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>

>   	help
>   	  Support running unmodified book3s_64 guest kernels in
>   	  virtual machines on POWER7 and newer processors that have



^ permalink raw reply

* Re: [PATCH 00/11] Convert moduleparams to seq_buf
From: Petr Pavlu @ 2026-05-26  6:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: Luis Chamberlain, Pengpeng Hou, Richard Weinberger, Anton Ivanov,
	Johannes Berg, Rafael J. Wysocki, Len Brown, Corey Minyard,
	Gabriel Somlo, Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, David Airlie, Simona Vetter,
	Bart Van Assche, Jason Gunthorpe, Leon Romanovsky,
	Laurent Pinchart, Hans de Goede, Mauro Carvalho Chehab,
	Bjorn Helgaas, Hannes Reinecke, James E.J. Bottomley,
	Martin K. Petersen, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Greg Kroah-Hartman, Jiri Slaby, Alan Stern, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Baron, Jim Cromie, Tiwei Bie,
	Benjamin Berg, Ilpo Järvinen, David E. Box,
	Maciej W. Rozycki, Srinivas Pandruvada, Peter Zijlstra,
	Heiko Carstens, Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andy Shevchenko, Georgia Garcia, kvm, dmaengine,
	linux-modules, kasan-dev, linux-mm, apparmor,
	linux-security-module, linux-um, linux-acpi, openipmi-developer,
	qemu-devel, intel-gfx, dri-devel, linux-rdma, linux-media,
	linux-pci, linux-scsi, linux-pm, linuxppc-dev, linux-serial,
	linux-usb, usb-storage, virtualization, linux-kernel, linux-arch,
	netdev, linux-fsdevel, linux-hardening
In-Reply-To: <20260521133315.work.845-kees@kernel.org>

On 5/21/26 3:33 PM, Kees Cook wrote:
> Hi,
> 
> I tried to trim the CC list here, but it's still pretty huge...
> 
> We've had a long-standing issue with "write to a string pointer" callbacks
> that don't bounds check the destination (and for which the bounds is
> also not part of the callback prototype, even if it is "known" to be
> PAGE_SIZE, which sysfs_emit() depends on). Both moduleparams and sysfs
> use this pattern. As a first step, and to test the migration method,
> migrate moduleparams first.
> 
> There are 2 "mechanical" treewide patches that are handled by Coccinelle:
> - treewide: Convert struct kernel_param_ops initializers to DEFINE_KERNEL_PARAM_OPS
> - treewide: Convert custom kernel_param_ops .get callbacks to seq_buf via cocci
> 
> The last treewide patch is manual, and may need to be broken up into
> per-subsystem patches, though I'd prefer to avoid this, as it would
> extend the migration from 1 relase to at least 2 releases. (1 to
> release the migration infrastructure, then 1 release to collect all the
> subsystem changes, and possibly 1 more release to remove the migration
> infrastructure.)
> 
> Thoughts, questions?

This looks reasonable to me. I added a few minor comments on the patches
but they already look solid.

-- 
Thanks,
Petr


^ permalink raw reply

* Re: [PATCH] powerpc/boot: Allow text relocations for pseries wrapper with binutils 2.46+
From: Anushree Mathur @ 2026-05-26  6:44 UTC (permalink / raw)
  To: Amit Machhiwal, Madhavan Srinivasan, linuxppc-dev
  Cc: Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Antonio Alvarez Feijoo, Vaibhav Jain, Harsh Prateek Bora,
	linux-kernel
In-Reply-To: <20260525161601.32097-1-amachhiw@linux.ibm.com>



On 25/05/26 9:46 PM, Amit Machhiwal wrote:
> Binutils 2.46 changed the default linker behavior from '-z notext' to
> '-z text', which treats dynamic relocations in read-only segments as
> errors rather than warnings. This causes the pseries boot wrapper build
> to fail with:
>
>    /usr/bin/ld.bfd: arch/powerpc/boot/wrapper.a(crt0.o): warning:
>      relocation against `_platform_stack_top' in read-only section `.text'
>    /usr/bin/ld.bfd: error: read-only segment has dynamic relocations
>
> The pseries wrapper uses '-pie' to create position-independent code.
> However, crt0.S contains a pointer to '_platform_stack_top' in the .text
> section, which requires a dynamic relocation at runtime. This creates
> DT_TEXTREL (text relocations), which were allowed by default in binutils
> 2.45 and earlier (via implicit '-z notext') but are now rejected by
> binutils 2.46+.
>
> Add '-z notext' linker flag to explicitly allow text relocations for
> the pseries platform, similar to what is already done for the epapr
> platform. This restores the previous behavior and allows the boot
> wrapper to build successfully with binutils 2.46+.
>
> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> ---
>   arch/powerpc/boot/wrapper | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/boot/wrapper b/arch/powerpc/boot/wrapper
> index 1efd1206fcab..25321ce262e8 100755
> --- a/arch/powerpc/boot/wrapper
> +++ b/arch/powerpc/boot/wrapper
> @@ -262,6 +262,7 @@ pseries)
>       if [ "$format" != "elf32ppc" ]; then
>   	link_address=
>   	pie=-pie
> +	notext='-z notext'
>       fi
>       make_space=n
>       ;;
>
> base-commit: e7ae89a0c97ce2b68b0983cd01eda67cf373517d
Hi Amit,
I have tested this patch and it has fixed the kernel build issue. Here 
is my analysis:

i) Without applying the patch:

Kernel build failed with below error -

   CC [M]  drivers/net/usb/r8152.mod.o
   CC [M]  drivers/net/usb/hso.mod.o
   CC [M]  drivers/net/usb/lan78xx.mod.o
   CC [M]  drivers/net/usb/asix.mod.o
   CC [M]  drivers/net/usb/ax88179_178a.mod.o
/usr/bin/ld.bfd: arch/powerpc/boot/wrapper.a(crt0.o): warning: 
relocation against `_platform_stack_top' in read-only section `.text'
/usr/bin/ld.bfd: error: read-only segment has dynamic relocations
make[2]: *** [arch/powerpc/boot/Makefile:386: 
arch/powerpc/boot/zImage.pseries] Error 1
   CC [M]  drivers/net/usb/cdc_ether.mod.o
make[1]: *** [arch/powerpc/Makefile:236: zImage] Error 2
make[1]: *** Waiting for unfinished jobs....

ii) After applying the patch :

Kernel build got successful.

Please feel free to add:

Tested-by: Anushree Mathur <anushree.mathur@linux.ibm.com>

Thank you!
Anushree Mathur


^ permalink raw reply

* Re: [PATCH] KVM: Validate irqchip index for LoongArch and PowerPC
From: Harsh Prateek Bora @ 2026-05-26  6:32 UTC (permalink / raw)
  To: Yanfei Xu, zhaotianrui, maobibo, chenhuacai, maddy, npiggin,
	sashiko-reviews, seanjc, pbonzini
  Cc: kvm, loongarch, linuxppc-dev, caixiangfeng, fangying.tommy,
	isyanfei.xu, Sashiko, stable
In-Reply-To: <20260525070154.495455-1-yanfei.xu@bytedance.com>

+ cc: stable@vger.kernel.org

On 25/05/26 12:31 pm, Yanfei Xu wrote:
> Sashiko reported that irqchip index is not validated for LoongArch and
> PowerPC. Add validation and reject out-of-range irqchip indexes to avoid
> indexing past the routing table's chip array.
> 
> Fixes: de9ba2f36368 ("KVM: PPC: Support irq routing and irqfd for in-kernel MPIC")
> Fixes: 1928254c5ccb ("LoongArch: KVM: Add irqfd support")
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://lore.kernel.org/kvm/20260525051714.485D51F000E9@smtp.kernel.org/
> Signed-off-by: Yanfei Xu <yanfei.xu@bytedance.com>
> ---
>   arch/loongarch/kvm/irqfd.c | 3 ++-
>   arch/powerpc/kvm/mpic.c    | 3 ++-
>   2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/loongarch/kvm/irqfd.c b/arch/loongarch/kvm/irqfd.c
> index f4f953b22419..40ed1081c4b6 100644
> --- a/arch/loongarch/kvm/irqfd.c
> +++ b/arch/loongarch/kvm/irqfd.c
> @@ -51,7 +51,8 @@ int kvm_set_routing_entry(struct kvm *kvm,
>   		e->irqchip.irqchip = ue->u.irqchip.irqchip;
>   		e->irqchip.pin = ue->u.irqchip.pin;
>   
> -		if (e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS)
> +		if (e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS ||
> +		    e->irqchip.irqchip >= KVM_NR_IRQCHIPS)
>   			return -EINVAL;
>   
>   		return 0;
> diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
> index 3070f36d9fb8..fb5f9e65e02e 100644
> --- a/arch/powerpc/kvm/mpic.c
> +++ b/arch/powerpc/kvm/mpic.c
> @@ -1833,7 +1833,8 @@ int kvm_set_routing_entry(struct kvm *kvm,
>   		e->set = mpic_set_irq;
>   		e->irqchip.irqchip = ue->u.irqchip.irqchip;
>   		e->irqchip.pin = ue->u.irqchip.pin;
> -		if (e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS)
> +		if (e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS ||
> +		    e->irqchip.irqchip >= KVM_NR_IRQCHIPS)

Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com> # PPC KVM
>   			goto out;
>   		break;
>   	case KVM_IRQ_ROUTING_MSI:



^ permalink raw reply

* Re: [BUG] sched/cache: "Make LLC id continuous" causes NULL cpumask dereference in build_sched_domains on POWER9
From: K Prateek Nayak @ 2026-05-26  5:53 UTC (permalink / raw)
  To: Srikar Dronamraju, Chen, Yu C
  Cc: Venkat Rao Bagalkote, Madhavan Srinivasan, Shrikanth Hegde,
	Ritesh Harjani, Christophe Leroy (CS GROUP), LKML, linuxppc-dev,
	linux-sched, tim.c.chen, Peter Zijlstra
In-Reply-To: <ahUogeqIq5km7JSr@linux.ibm.com>

Hello Srikar,

On 5/26/2026 10:28 AM, Srikar Dronamraju wrote:
> L2 Cache reported here is for SMT8 Core aka CACHE domain.

Apart for the scheduler, nothing in tree currently cares about
cpu_coregroup_mask() except for drivers/base/arch_topology.c but
Power doesn't select GENERIC_ARCH_TOPOLOGY.

Why can't Power have an internal mask for MC domain (tl_mc_mask) and
the scheduler can use cpu_coregroup_mask() for the actual LLc? (The L2
mask in this case.)

Power anyways adds its own topology via set_sched_topology() so the
default_topology from kernel/sched/topology.c remains unused.

...

> Shouldnt cache-aware scheduling be worried about cpuset partitions too.
> If a cpuset has subset of LLC cores, then we should Scheduler assume it can
> control complete LLC?

Well, the scheduling takes care of partitions and the cache aware
scheduling bits take care of looking at the full system perspective
for stats aggregation and pointing to a particular LLc.

We don't compare llc_id across cpusets so we keeping one unique llc_id
per H/W LLC instance is feasible and it enables us to keep llc_id space
limited for optimizing cache-aware scheduling.

Now if we have threads of same process across partitions, we'll
still aggregate the utilization numbers across the full LLC but
the load balancers at individual partitions will make a call on
the aggregation.

-- 
Thanks and Regards,
Prateek



^ permalink raw reply

* Re: [BUG] sched/cache: "Make LLC id continuous" causes NULL cpumask dereference in build_sched_domains on POWER9
From: Venkat Rao Bagalkote @ 2026-05-26  5:24 UTC (permalink / raw)
  To: Chen, Yu C, Srikar Dronamraju
  Cc: Madhavan Srinivasan, Shrikanth Hegde, Ritesh Harjani,
	Christophe Leroy (CS GROUP), LKML, linuxppc-dev, linux-sched,
	tim.c.chen, K Prateek Nayak, Peter Zijlstra
In-Reply-To: <a0e3f4f4-03b6-4bb1-881f-06bf5abca1a3@intel.com>


On 26/05/26 9:38 am, Chen, Yu C wrote:
> Hi Venkat,
>
> On 5/26/2026 11:14 AM, Srikar Dronamraju wrote:
>> * Chen, Yu C <yu.c.chen@intel.com> [2026-05-25 23:35:45]:
>>
>>> Hi Venkat,
>>>
>>> On 5/25/2026 10:07 PM, Venkat Rao Bagalkote wrote:
>>>> Greetings!!!
>>>>
>>>> I am seeing an early boot kernel panic due to NULL pointer dereference
>>>> on a POWER9 (pSeries) system when testing linux-next (next-20260522).


This issue is seen on P11 as well.

[    0.006697] smp: Brought up 1 node, 16 CPUs
[    0.006702] Big cores detected but using small core scheduling
[    0.006752] BUG: Kernel NULL pointer dereference on read at 0x00000000
[    0.006755] Faulting instruction address: 0xc000000020adbb6c
[    0.006759] Oops: Kernel access of bad area, sig: 7 [#1]
[    0.006762] LE PAGE_SIZE=4K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
[    0.006767] Modules linked in:
[    0.006772] CPU: 4 UID: 0 PID: 1 Comm: swapper/4 Not tainted 
7.1.0-rc5-next-20260525 #1 PREEMPT(lazy)
[    0.006777] Hardware name: IBM,9080-HEX Power11 (architected) 
0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries
[    0.006781] NIP:  c000000020adbb6c LR: c0000000202e5a58 CTR: 
0000000000000000
[    0.006784] REGS: c0000000283d7890 TRAP: 0300   Not tainted 
(7.1.0-rc5-next-20260525)
[    0.006788] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 
44002242  XER: 20040003
[    0.006796] CFAR: c0000000202e5a54 DAR: 0000000000000000 DSISR: 
00080000 IRQMASK: 0
[    0.006796] GPR00: 0000000000000000 c0000000283d7b50 c000000021abf100 
0000000000000010
[    0.006796] GPR04: 0000000000000010 0000000000000030 0000000000000000 
c000000028365500
[    0.006796] GPR08: 0000000000000000 c000000022213598 000000003b77d000 
0000000000000000
[    0.006796] GPR12: c00000002005d8f0 c000000000008000 c0000000283cb578 
c0000000283cb400
[    0.006796] GPR16: c0000000283c9000 c000000022218b20 c0000000222330e8 
00000000ffffffff
[    0.006796] GPR20: fffffffffffffff6 0000000000000000 c000000022da36e0 
0000000000000000
[    0.006796] GPR24: 0000000000000000 0000000000000000 c0000000283c9178 
c0000000227b5f00
[    0.006796] GPR28: c00000002831c1e8 c000000022db5980 0000000000000000 
0000000000000000
[    0.006835] NIP [c000000020adbb6c] _find_first_bit+0xc/0xc0
[    0.006842] LR [c0000000202e5a58] build_sched_domains+0x7d8/0xb40
[    0.006847] Call Trace:
[    0.006849] [c0000000283d7b50] [c0000000202e5408] 
build_sched_domains+0x188/0xb40 (unreliable)
[    0.006854] [c0000000283d7c90] [c000000022034380] 
sched_init_domains+0x118/0x168
[    0.006860] [c0000000283d7ce0] [c000000022032b14] 
sched_init_smp+0xa8/0x158
[    0.006865] [c0000000283d7d30] [c000000022005674] 
kernel_init_freeable+0x1ac/0x294
[    0.006870] [c0000000283d7dd0] [c000000020011718] kernel_init+0x2c/0x1c4
[    0.006874] [c0000000283d7e30] [c00000002000debc] 
ret_from_kernel_user_thread+0x14/0x1c
[    0.006878] ---- interrupt: 0 at 0x0
[    0.006881] Code: eb610038 7fc3f378 eb810040 eba10048 38210060 
ebc1fff0 ebe1fff8 7c0803a6 4e800020 7c681b78 7c832379 4d820020 
<e9280000> 38e3ffff 39400000 78e7d7e2
[    0.006895] ---[ end trace 0000000000000000 ]---
[    0.006898]


Regards,

Venkat.

>>>
>>> It seems that cpumask_first(llc_mask(i)) is accessing
>>> NULL cpu_coregroup_mask():
>>
>>> has_coregroup_support() is false, thus cpu_coregroup_map
>>> is never allocated in smp_prepare_cpus().
>>> This machine is a "shared system" VM. We should probably
>>> let the LLC id generation fall back to using L2 id if
>>> cpu_coregroup_mask is unavailable (which restores the
>>> behavior before this patch). I'm wondering if the following
>>> change would help(need IBM friends' help on this):
>>
>> Power9 and below systems, dont have coregroup.
>> Its not because of shared LPAR. But its true for dedicated LPARs too.
>> Only Power10 and above systems have hemisphere where we add MC/coregroup
>> support.
>>
>
> OK, thanks for the correction. Are you saying coregroup_enabled is false
> on Power9 and older hardware, and set to true on Power10? Power10 has a
> corresponding device-tree property, which is parsed to enable hemisphere
> support in find_possible_nodes(). This is why has_coregroup_support()
> returns true for Power10.
>
>>>
>>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>>> index 3467f86fd78f..cf6c2e4190ab 100644
>>> --- a/arch/powerpc/kernel/smp.c
>>> +++ b/arch/powerpc/kernel/smp.c
>>> @@ -1042,11 +1042,6 @@ static const struct cpumask
>>> *tl_smallcore_smt_mask(struct sched_domain_topology_
>>>   }
>>>   #endif
>>>
>>> -struct cpumask *cpu_coregroup_mask(int cpu)
>>> -{
>>> -       return per_cpu(cpu_coregroup_map, cpu);
>>> -}
>>> -
>>>   static bool has_coregroup_support(void)
>>>   {
>>>          /* Coregroup identification not available on shared systems */
>>> @@ -1056,6 +1051,14 @@ static bool has_coregroup_support(void)
>>>          return coregroup_enabled;
>>>   }
>>>
>>> +struct cpumask *cpu_coregroup_mask(int cpu)
>>> +{
>>> +       if (!has_coregroup_support())
>>> +               return cpu_l2_cache_mask(cpu);
>>> +
>>> +       return per_cpu(cpu_coregroup_map, cpu);
>>> +}
>>> +
>>
>> While this is a work-around for the problem in Power9
>> It will hurt Power10 and Power11 systems.
>> As has been alluded by Prateek, MC is not LLC on Power.
>
> Could you please elaborate on the cache topology?
> Specifically, could you clarify what the LLC is for Power9
> and Power10 respectively? Is it always the L2 cache?
>
> I have checked the IBM documentation available at:
> https://hc32.hotchips.org/assets/program/conference/day1/HotChips2020_Server_Processors_IBM_Starke_POWER10_v33.pdf 
>
> According to the document, a hemisphere corresponds to a 64MB
> L3 cache shared by 8 cores. Since the MC domain spans a single
> hemisphere, I wonder why the SD_SHARE_LLC flag is not enabled
> for the MC domain?
>
>> So by using llc_mask as cpu_coregroup_mask() we run the trouble of 
>> assuming
>> MC to be similar to LLC. So it will impact Power 10/11 Systems.
>>
>> In commit b5ea300a17e3 sched/cache: Make LLC id continuous, we define
>> #define llc_mask(cpu) cpu_coregroup_mask(cpu)
>>
>> defining it llc_mask to cpu_coregroup_mask means MC should be LLC.
>> This is not true for some architectures atleast on Power.
>>
>
> OK.
>
>> So shouldn't it be using
>> #define llc_mask(cpu) per_cpu(sd_llc, cpu)
>>
>> This should work for systems where LLC is sub-coregroup, coregroup 
>> (or super
>> coregroup: Lets say some archs want LLC at PKG and cluster at 
>> coregroup).
>>
>> if we do that, I dont think we even need the else case where we say
>> #define llc_mask(cpu) cpumask_of(cpu)
>>
>
> I suppose you are referring to
> sched_domain_span(per_cpu(sd_llc, cpu)).
>
> Indeed, deriving the LLC from the SD_SHARE_LLC level offers
> better scalability. However, this approach would involve scheduler
> domains, which can be truncated by cpuset partitions - a scenario we
> prefer to avoid.
>
> thanks,
> Chenyu
>


^ permalink raw reply

* Re: [BUG] sched/cache: "Make LLC id continuous" causes NULL cpumask dereference in build_sched_domains on POWER9
From: Srikar Dronamraju @ 2026-05-26  4:58 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Venkat Rao Bagalkote, Madhavan Srinivasan, Shrikanth Hegde,
	Ritesh Harjani, Christophe Leroy (CS GROUP), LKML, linuxppc-dev,
	linux-sched, tim.c.chen, K Prateek Nayak, Peter Zijlstra
In-Reply-To: <a0e3f4f4-03b6-4bb1-881f-06bf5abca1a3@intel.com>

Hi Chen,

> > > It seems that cpumask_first(llc_mask(i)) is accessing
> > > NULL cpu_coregroup_mask():
> > 
> > > has_coregroup_support() is false, thus cpu_coregroup_map
> > > is never allocated in smp_prepare_cpus().
> > > This machine is a "shared system" VM. We should probably
> > > let the LLC id generation fall back to using L2 id if
> > > cpu_coregroup_mask is unavailable (which restores the
> > > behavior before this patch). I'm wondering if the following
> > > change would help(need IBM friends' help on this):
> > 
> > Power9 and below systems, dont have coregroup.
> > Its not because of shared LPAR. But its true for dedicated LPARs too.
> > Only Power10 and above systems have hemisphere where we add MC/coregroup
> > support.
> > 
> 
> OK, thanks for the correction. Are you saying coregroup_enabled is false
> on Power9 and older hardware, and set to true on Power10? Power10 has a
> corresponding device-tree property, which is parsed to enable hemisphere
> support in find_possible_nodes(). This is why has_coregroup_support()
> returns true for Power10.
> 

Yes, Chen,
coregroup_enabled is true only on Power 10 +
Yes we decipher coregroup from the device-tree properties.

> > > +struct cpumask *cpu_coregroup_mask(int cpu)
> > > +{
> > > +       if (!has_coregroup_support())
> > > +               return cpu_l2_cache_mask(cpu);
> > > +
> > > +       return per_cpu(cpu_coregroup_map, cpu);
> > > +}
> > > +
> > 
> > While this is a work-around for the problem in Power9
> > It will hurt Power10 and Power11 systems.
> > As has been alluded by Prateek, MC is not LLC on Power.
> 
> Could you please elaborate on the cache topology?
> Specifically, could you clarify what the LLC is for Power9
> and Power10 respectively? Is it always the L2 cache?
> 
> I have checked the IBM documentation available at:
> https://hc32.hotchips.org/assets/program/conference/day1/HotChips2020_Server_Processors_IBM_Starke_POWER10_v33.pdf
> According to the document, a hemisphere corresponds to a 64MB
> L3 cache shared by 8 cores. Since the MC domain spans a single
> hemisphere, I wonder why the SD_SHARE_LLC flag is not enabled
> for the MC domain?

If we look at the presentation you pointed above, L2 is 2Mb per SMT8 Core.
L3 is local 8MB per SMT8 core which together form a 64MB l3-buffer per
hemisphere. L3 is a Victim cache and All L3 together form a L3.1 buffer.

In practice, we split the cache per small core aka SMT4 core.  So we have
1Mb L2 per SMT4 core, 4Mb L3 per SMT4 Core.  L3 is a Victim cache and All L3
combine to form L3.1 buffer. Hence for now we still consider L2 to be LLC.

On Power9, L2 is at CACHE domain
On all other Power Systems (P7,P8, P10, P11), L2 is at SMT domain.
On Power, We haven taken L2 as LLC.

lscpu (on Power 10)
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              480
On-line CPU(s) list: 0-479
Thread(s) per core:  8
Core(s) per socket:  15
Socket(s):           4
NUMA node(s):        4
Model:               2.0 (pvr 0080 0200)
Model name:          POWER10, altivec supported
CPU max MHz:         3249.0000
CPU min MHz:         3249.0000
L1d cache:           32K
L1i cache:           48K
L2 cache:            1024K
L3 cache:            4096K
NUMA node0 CPU(s):   0-119
NUMA node1 CPU(s):   120-239
NUMA node2 CPU(s):   240-359
NUMA node3 CPU(s):   360-479

L2 Cache reported here is for SMT4 Core.


lscpu (on Power 9)
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
Physical sockets:    2
Physical chips:      1
Physical cores/chip: 8

L2 Cache reported here is for SMT8 Core aka CACHE domain.


> 
> > So by using llc_mask as cpu_coregroup_mask() we run the trouble of assuming
> > MC to be similar to LLC. So it will impact Power 10/11 Systems.
> > 
> > In commit b5ea300a17e3 sched/cache: Make LLC id continuous, we define
> > #define llc_mask(cpu) cpu_coregroup_mask(cpu)
> > 
> > defining it llc_mask to cpu_coregroup_mask means MC should be LLC.
> > This is not true for some architectures atleast on Power.
> > 
> 
> OK.
> 
> > So shouldn't it be using
> > #define llc_mask(cpu) per_cpu(sd_llc, cpu)
> > 
> > This should work for systems where LLC is sub-coregroup, coregroup (or super
> > coregroup: Lets say some archs want LLC at PKG and cluster at coregroup).
> > 
> > if we do that, I dont think we even need the else case where we say
> > #define llc_mask(cpu) cpumask_of(cpu)
> > 
> 
> I suppose you are referring to
> sched_domain_span(per_cpu(sd_llc, cpu)).
> 
> Indeed, deriving the LLC from the SD_SHARE_LLC level offers
> better scalability. However, this approach would involve scheduler
> domains, which can be truncated by cpuset partitions - a scenario we
> prefer to avoid.
> 

Shouldnt cache-aware scheduling be worried about cpuset partitions too.
If a cpuset has subset of LLC cores, then we should Scheduler assume it can
control complete LLC?

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply

* RE: [PATCH v5 00/20] dma-mapping: Use DMA_ATTR_CC_SHARED through direct, pool and swiotlb paths
From: Michael Kelley @ 2026-05-26  4:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm), iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
  Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, x86@kernel.org
In-Reply-To: <20260522042815.370873-1-aneesh.kumar@kernel.org>

From: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Sent: Thursday, May 21, 2026 9:28 PM
> 
> This series propagates DMA_ATTR_CC_SHARED through the dma-direct,
> dma-pool, and swiotlb paths so that encrypted and decrypted DMA buffers
> are handled consistently.
> 
> Today, the direct DMA path mostly relies on force_dma_unencrypted() for
> shared/decrypted buffer handling. This series consolidates the
> force_dma_unencrypted() checks in the top-level functions and ensures
> that the remaining DMA interfaces use DMA attributes to make the correct
> decisions.
> 
> The series:
> - moves swiotlb-backed allocations out of __dma_direct_alloc_pages(),
> - propagates DMA_ATTR_CC_SHARED through the dma-direct alloc/free
>   paths
> - teaches the atomic DMA pools to track encrypted versus decrypted
>   state
> - tracks swiotlb pool encryption state and enforces strict pool
>   selection
> - centralizes encrypted/decrypted pgprot handling in dma_pgprot() using
>   DMA attributes
> - passes DMA attributes down to dma_capable() so capability checks can
>   validate whether the selected DMA address encoding matches
>   DMA_ATTR_CC_SHARED
> - makes dma_direct_map_phys() choose the DMA address encoding from
>   DMA_ATTR_CC_SHARED and fall back to swiotlb when a shared DMA request
>   cannot use the direct mapping, which lets arm64 and x86 CCA guests stop
>   relying on SWIOTLB_FORCE for DMA mappings
> - use the selected swiotlb pool state to derive the returned DMA
>   address.

[snip]

> 
> 
> Aneesh Kumar K.V (Arm) (20):
>   [DO NOT MERGE] arm64/coco: Add pKVM as a CC platform
>   [DO NOT MERGE] s390: Expose protected virtualization through
>     cc_platform_has()
>   dma-direct: swiotlb: handle swiotlb alloc/free outside
>     __dma_direct_alloc_pages
>   dma-direct: use DMA_ATTR_CC_SHARED in alloc/free paths
>   dma-pool: track decrypted atomic pools and select them via attrs
>   dma: swiotlb: pass mapping attributes by reference
>   dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
>   dma-mapping: make dma_pgprot() honor DMA_ATTR_CC_SHARED
>   dma-direct: pass attrs to dma_capable() for DMA_ATTR_CC_SHARED checks
>   dma-direct: make dma_direct_map_phys() honor DMA_ATTR_CC_SHARED
>   dma-direct: set decrypted flag for remapped DMA allocations
>   dma-direct: select DMA address encoding from DMA_ATTR_CC_SHARED
>   dma-pool: fix page leak in atomic_pool_expand() cleanup
>   dma-direct: rename ret to cpu_addr in alloc helpers
>   dma-direct: return struct page from dma_direct_alloc_from_pool()
>   iommu/dma: Check atomic pool allocation result directly
>   dma: swiotlb: free dynamic pools from process context
>   dma: swiotlb: handle set_memory_decrypted() failures
>   dma: free atomic pool pages by physical address
>   swiotlb: Preserve allocation virtual address for dynamic pools
> 
>  arch/arm64/include/asm/hypervisor.h           |   6 +
>  arch/arm64/include/asm/mem_encrypt.h          |   3 +-
>  arch/arm64/kernel/rsi.c                       |  12 -
>  arch/arm64/mm/init.c                          |  17 +-
>  arch/powerpc/platforms/pseries/svm.c          |   2 +-
>  arch/s390/Kconfig                             |   1 +
>  arch/s390/mm/init.c                           |  16 +-
>  arch/x86/kernel/amd_gart_64.c                 |  30 +-
>  arch/x86/kernel/pci-dma.c                     |   4 +-
>  drivers/iommu/dma-iommu.c                     |  15 +-
>  drivers/virt/coco/pkvm-guest/arm-pkvm-guest.c |   5 +
>  drivers/xen/swiotlb-xen.c                     |   8 +-
>  include/linux/dma-direct.h                    |  20 +-
>  include/linux/dma-map-ops.h                   |   3 +-
>  include/linux/swiotlb.h                       |  20 +-
>  kernel/dma/direct.c                           | 275 +++++++++++++-----
>  kernel/dma/direct.h                           |  47 +--
>  kernel/dma/mapping.c                          |  16 +-
>  kernel/dma/pool.c                             | 221 ++++++++++----
>  kernel/dma/swiotlb.c                          | 270 +++++++++++++----
>  20 files changed, 717 insertions(+), 274 deletions(-)
> 

I tested the series in a linux-next20260518 kernel, running in an
Azure VM on the Hyper-V hypervisor. The physical processor is Intel
XEON(R) PLATINUM 8573C with TDX memory encryption in use, so
this is a Linux CoCo VM. The VM has the usual VMBus synthetic disk
and network devices provided by Hyper-V, plus two PCI NVMe devices
that are directly assigned to the VM. I did basic smoke tests in the
VM, including reading and writing the NVMe devices. The swiotlb is
used as expected for DMA transfers to/from the synthetic and NVMe
devices. The NVMe driver does dma_alloc_coherent() to allocate
memory for control structures that must be decrypted. I did "unbind"
on the NVMe devices, and then rebound them so the dma allocations
would be freed and then reallocated. All looks good.

I'd like to try the same tests in a CoCo VM based on AMD SEV-SNP,
but I need to get quota for that VM size in Azure, and I don't know
how soon that can happen.

So as described above,

Tested-by: Michael Kelley <mhklinux@outlook.com>


^ permalink raw reply

* Re: [BUG] sched/cache: "Make LLC id continuous" causes NULL cpumask dereference in build_sched_domains on POWER9
From: Chen, Yu C @ 2026-05-26  4:08 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Venkat Rao Bagalkote, Madhavan Srinivasan, Shrikanth Hegde,
	Ritesh Harjani, Christophe Leroy (CS GROUP), LKML, linuxppc-dev,
	linux-sched, tim.c.chen, K Prateek Nayak, Peter Zijlstra
In-Reply-To: <ahUQE-AgvOAQV_mI@linux.ibm.com>

Hi Venkat,

On 5/26/2026 11:14 AM, Srikar Dronamraju wrote:
> * Chen, Yu C <yu.c.chen@intel.com> [2026-05-25 23:35:45]:
> 
>> Hi Venkat,
>>
>> On 5/25/2026 10:07 PM, Venkat Rao Bagalkote wrote:
>>> Greetings!!!
>>>
>>> I am seeing an early boot kernel panic due to NULL pointer dereference
>>> on a POWER9 (pSeries) system when testing linux-next (next-20260522).
>>
>> It seems that cpumask_first(llc_mask(i)) is accessing
>> NULL cpu_coregroup_mask():
> 
>> has_coregroup_support() is false, thus cpu_coregroup_map
>> is never allocated in smp_prepare_cpus().
>> This machine is a "shared system" VM. We should probably
>> let the LLC id generation fall back to using L2 id if
>> cpu_coregroup_mask is unavailable (which restores the
>> behavior before this patch). I'm wondering if the following
>> change would help(need IBM friends' help on this):
> 
> Power9 and below systems, dont have coregroup.
> Its not because of shared LPAR. But its true for dedicated LPARs too.
> Only Power10 and above systems have hemisphere where we add MC/coregroup
> support.
>

OK, thanks for the correction. Are you saying coregroup_enabled is false
on Power9 and older hardware, and set to true on Power10? Power10 has a
corresponding device-tree property, which is parsed to enable hemisphere
support in find_possible_nodes(). This is why has_coregroup_support()
returns true for Power10.

>>
>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>> index 3467f86fd78f..cf6c2e4190ab 100644
>> --- a/arch/powerpc/kernel/smp.c
>> +++ b/arch/powerpc/kernel/smp.c
>> @@ -1042,11 +1042,6 @@ static const struct cpumask
>> *tl_smallcore_smt_mask(struct sched_domain_topology_
>>   }
>>   #endif
>>
>> -struct cpumask *cpu_coregroup_mask(int cpu)
>> -{
>> -       return per_cpu(cpu_coregroup_map, cpu);
>> -}
>> -
>>   static bool has_coregroup_support(void)
>>   {
>>          /* Coregroup identification not available on shared systems */
>> @@ -1056,6 +1051,14 @@ static bool has_coregroup_support(void)
>>          return coregroup_enabled;
>>   }
>>
>> +struct cpumask *cpu_coregroup_mask(int cpu)
>> +{
>> +       if (!has_coregroup_support())
>> +               return cpu_l2_cache_mask(cpu);
>> +
>> +       return per_cpu(cpu_coregroup_map, cpu);
>> +}
>> +
> 
> While this is a work-around for the problem in Power9
> It will hurt Power10 and Power11 systems.
> As has been alluded by Prateek, MC is not LLC on Power.

Could you please elaborate on the cache topology?
Specifically, could you clarify what the LLC is for Power9
and Power10 respectively? Is it always the L2 cache?

I have checked the IBM documentation available at:
https://hc32.hotchips.org/assets/program/conference/day1/HotChips2020_Server_Processors_IBM_Starke_POWER10_v33.pdf
According to the document, a hemisphere corresponds to a 64MB
L3 cache shared by 8 cores. Since the MC domain spans a single
hemisphere, I wonder why the SD_SHARE_LLC flag is not enabled
for the MC domain?

> So by using llc_mask as cpu_coregroup_mask() we run the trouble of assuming
> MC to be similar to LLC. So it will impact Power 10/11 Systems.
> 
> In commit b5ea300a17e3 sched/cache: Make LLC id continuous, we define
> #define llc_mask(cpu) cpu_coregroup_mask(cpu)
> 
> defining it llc_mask to cpu_coregroup_mask means MC should be LLC.
> This is not true for some architectures atleast on Power.
> 

OK.

> So shouldn't it be using
> #define llc_mask(cpu) per_cpu(sd_llc, cpu)
> 
> This should work for systems where LLC is sub-coregroup, coregroup (or super
> coregroup: Lets say some archs want LLC at PKG and cluster at coregroup).
> 
> if we do that, I dont think we even need the else case where we say
> #define llc_mask(cpu) cpumask_of(cpu)
> 

I suppose you are referring to
sched_domain_span(per_cpu(sd_llc, cpu)).

Indeed, deriving the LLC from the SD_SHARE_LLC level offers
better scalability. However, this approach would involve scheduler
domains, which can be truncated by cpuset partitions - a scenario we
prefer to avoid.

thanks,
Chenyu


^ permalink raw reply

* Re: [BUG] sched/cache: "Make LLC id continuous" causes NULL cpumask dereference in build_sched_domains on POWER9
From: Chen, Yu C @ 2026-05-26  3:14 UTC (permalink / raw)
  To: K Prateek Nayak, Venkat Rao Bagalkote
  Cc: Madhavan Srinivasan, Shrikanth Hegde, Ritesh Harjani,
	Christophe Leroy (CS GROUP), LKML, linuxppc-dev, linux-sched,
	tim.c.chen, Peter Zijlstra
In-Reply-To: <75179ee2-ec20-4757-8631-79b1f304c366@amd.com>

On 5/26/2026 12:16 AM, K Prateek Nayak wrote:
> Hello Chenyu, Venkat,
> 
> On 5/25/2026 9:05 PM, Chen, Yu C wrote:
>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>> index 3467f86fd78f..cf6c2e4190ab 100644
>> --- a/arch/powerpc/kernel/smp.c
>> +++ b/arch/powerpc/kernel/smp.c
>> @@ -1042,11 +1042,6 @@ static const struct cpumask *tl_smallcore_smt_mask(struct sched_domain_topology_
>>   }
>>   #endif
>>
>> -struct cpumask *cpu_coregroup_mask(int cpu)
>> -{
>> -       return per_cpu(cpu_coregroup_map, cpu);
>> -}
>> -
>>   static bool has_coregroup_support(void)
>>   {
>>          /* Coregroup identification not available on shared systems */
>> @@ -1056,6 +1051,14 @@ static bool has_coregroup_support(void)
>>          return coregroup_enabled;
>>   }
>>
>> +struct cpumask *cpu_coregroup_mask(int cpu)
>> +{
>> +       if (!has_coregroup_support())
>> +               return cpu_l2_cache_mask(cpu);
>> +> +       return per_cpu(cpu_coregroup_map, cpu);
> 
> Interestingly, on powerpc, the MC domain doesn't have SD_SHARE_LLC flag
> set but I believe there is still some benefit of keeping the tasks on
> the same hemisphere?
> 
You are right. I guess power9 reported here does not have hemisphere and
power10 has, according to commit fb2ff9fa72e2:
"From Power10 processors onwards, each chip has 2 hemispheres"
but yes on both power9 and power10, MC domain doesn't have SD_SHARE_LLC
thus aggregating threads to 1 L2 domain might bring benefit. A side note is
that, cache aware scheduling is disabled on power for now because
power does not use the generic cacheinfo framework, thus its cache size
is returned as 0 so get_effective_llc_bytes() returns 0(for now, unless
we get the help from IBM friends to support it)
commit 7030513a0877 ("7030513a0877")

> If we are actually aiming for LLC, I think cpu_l2_cache_mask() is the
> right cpumask for all cases since tl_cache_mask() also returns that
> and the l2_cache_mask is set in all cases covered by update_mask_by_l2()
> in the same file.
> 
> If consolidation on hemisphere is beneficial, then the above diff
> looks correct.
> 
> Note: For has_coregroup_support(), with the above fix, the scheduler
> side llc_id will now be based on MC domain's span which seems wrong
> since it doesn't have SD_SHARE_LLC flag and might lead to other
> behavioral changes now.
> 

You are right, it seems to be a bug when has_coregroup_support() is enabled.
Maybe we can always return l2 id for power?

How about this(revert previous diff):
diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 66ed5fe1b718..3b3b4156f418 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -130,13 +130,15 @@ static inline int cpu_to_coregroup_id(int cpu)

  #ifdef CONFIG_SMP
  #include <asm/cputable.h>
+#include <asm/smp.h>

  struct cpumask *cpu_coregroup_mask(int cpu);
  const struct cpumask *cpu_die_mask(int cpu);
  int cpu_die_id(int cpu);

+#define arch_llc_mask(cpu)     cpu_l2_cache_mask(cpu)
+
  #ifdef CONFIG_PPC64
-#include <asm/smp.h>

  #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))
  #define topology_sibling_cpumask(cpu) 
(per_cpu(cpu_sibling_map, cpu))
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index df2ceb54c970..6772eb0ce493 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2063,12 +2063,18 @@ const struct cpumask *tl_mc_mask(struct 
sched_domain_topology_level *tl, int cpu
         return cpu_coregroup_mask(cpu);
  }

-#define llc_mask(cpu) cpu_coregroup_mask(cpu)
+#ifndef arch_llc_mask
+#define arch_llc_mask(cpu) cpu_coregroup_mask(cpu)
+#endif

  #else
-#define llc_mask(cpu) cpumask_of(cpu)
+#ifndef arch_llc_mask
+#define arch_llc_mask(cpu) cpumask_of(cpu)
+#endif
  #endif

+#define llc_mask(cpu) arch_llc_mask(cpu)
+
  const struct cpumask *tl_pkg_mask(struct sched_domain_topology_level 
*tl, int cpu)
  {
         return cpu_node_mask(cpu);


thanks,
Chenyu

>> +}
>> +
>>   static int __init init_big_cores(void)
>>   {
>>          int cpu;
> 


^ permalink raw reply related

* Re: [BUG] sched/cache: "Make LLC id continuous" causes NULL cpumask dereference in build_sched_domains on POWER9
From: Srikar Dronamraju @ 2026-05-26  3:14 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Venkat Rao Bagalkote, Madhavan Srinivasan, Shrikanth Hegde,
	Ritesh Harjani, Christophe Leroy (CS GROUP), LKML, linuxppc-dev,
	linux-sched, tim.c.chen, K Prateek Nayak, Peter Zijlstra
In-Reply-To: <bca40921-d351-4439-ae6a-9e5294f23e2e@intel.com>

* Chen, Yu C <yu.c.chen@intel.com> [2026-05-25 23:35:45]:

> Hi Venkat,
> 
> On 5/25/2026 10:07 PM, Venkat Rao Bagalkote wrote:
> > Greetings!!!
> > 
> > I am seeing an early boot kernel panic due to NULL pointer dereference
> > on a POWER9 (pSeries) system when testing linux-next (next-20260522).
> 
> It seems that cpumask_first(llc_mask(i)) is accessing
> NULL cpu_coregroup_mask():

> has_coregroup_support() is false, thus cpu_coregroup_map
> is never allocated in smp_prepare_cpus().
> This machine is a "shared system" VM. We should probably
> let the LLC id generation fall back to using L2 id if
> cpu_coregroup_mask is unavailable (which restores the
> behavior before this patch). I'm wondering if the following
> change would help(need IBM friends' help on this):

Power9 and below systems, dont have coregroup.
Its not because of shared LPAR. But its true for dedicated LPARs too.
Only Power10 and above systems have hemisphere where we add MC/coregroup
support.

> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 3467f86fd78f..cf6c2e4190ab 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1042,11 +1042,6 @@ static const struct cpumask
> *tl_smallcore_smt_mask(struct sched_domain_topology_
>  }
>  #endif
> 
> -struct cpumask *cpu_coregroup_mask(int cpu)
> -{
> -       return per_cpu(cpu_coregroup_map, cpu);
> -}
> -
>  static bool has_coregroup_support(void)
>  {
>         /* Coregroup identification not available on shared systems */
> @@ -1056,6 +1051,14 @@ static bool has_coregroup_support(void)
>         return coregroup_enabled;
>  }
> 
> +struct cpumask *cpu_coregroup_mask(int cpu)
> +{
> +       if (!has_coregroup_support())
> +               return cpu_l2_cache_mask(cpu);
> +
> +       return per_cpu(cpu_coregroup_map, cpu);
> +}
> +

While this is a work-around for the problem in Power9
It will hurt Power10 and Power11 systems.
As has been alluded by Prateek, MC is not LLC on Power.
So by using llc_mask as cpu_coregroup_mask() we run the trouble of assuming
MC to be similar to LLC. So it will impact Power 10/11 Systems.

In commit b5ea300a17e3 sched/cache: Make LLC id continuous, we define
#define llc_mask(cpu) cpu_coregroup_mask(cpu)

defining it llc_mask to cpu_coregroup_mask means MC should be LLC.
This is not true for some architectures atleast on Power.

So shouldn't it be using
#define llc_mask(cpu) per_cpu(sd_llc, cpu)

This should work for systems where LLC is sub-coregroup, coregroup (or super
coregroup: Lets say some archs want LLC at PKG and cluster at coregroup).

if we do that, I dont think we even need the else case where we say
#define llc_mask(cpu) cpumask_of(cpu)

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply

* RE: [PATCH v5 10/20] dma-direct: make dma_direct_map_phys() honor DMA_ATTR_CC_SHARED
From: Michael Kelley @ 2026-05-26  2:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm), iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
  Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <20260522042815.370873-11-aneesh.kumar@kernel.org>

From: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Sent: Thursday, May 21, 2026 9:28 PM
> 
> Teach dma_direct_map_phys() to select the DMA address encoding based on
> DMA_ATTR_CC_SHARED.
> 
> Use phys_to_dma_unencrypted() for decrypted mappings and
> phys_to_dma_encrypted() otherwise. If a device requires unencrypted DMA
> but the source physical address is still encrypted, force the mapping
> through swiotlb so the DMA address and backing memory attributes remain
> consistent.
> 
> Update the arm64, x86, s390 and powerpc secure-guest setup to not use
> swiotlb force option
> 
> Tested-by: Jiri Pirko <jiri@nvidia.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> Changes from v3:
> * Handle DMA_ATTR_MMIO
> ---
>  arch/arm64/mm/init.c                 |  4 +--
>  arch/powerpc/platforms/pseries/svm.c |  2 +-
>  arch/s390/mm/init.c                  |  2 +-
>  arch/x86/kernel/pci-dma.c            |  4 +--
>  kernel/dma/direct.c                  |  4 ++-
>  kernel/dma/direct.h                  | 45 +++++++++++++++-------------
>  6 files changed, 31 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index c1b223e7cc8e..a087ac5b15f7 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -340,10 +340,8 @@ void __init arch_mm_preinit(void)
>  	unsigned int flags = SWIOTLB_VERBOSE;
>  	bool swiotlb = max_pfn > PFN_DOWN(arm64_dma_phys_limit);
> 
> -	if (is_realm_world()) {
> +	if (is_realm_world())
>  		swiotlb = true;
> -		flags |= SWIOTLB_FORCE;
> -	}
> 
>  	if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) && !swiotlb)
> {
>  		/*
> diff --git a/arch/powerpc/platforms/pseries/svm.c
> b/arch/powerpc/platforms/pseries/svm.c
> index 384c9dc1899a..7a403dbd35ee 100644
> --- a/arch/powerpc/platforms/pseries/svm.c
> +++ b/arch/powerpc/platforms/pseries/svm.c
> @@ -29,7 +29,7 @@ static int __init init_svm(void)
>  	 * need to use the SWIOTLB buffer for DMA even if dma_capable() says
>  	 * otherwise.
>  	 */
> -	ppc_swiotlb_flags |= SWIOTLB_ANY | SWIOTLB_FORCE;
> +	ppc_swiotlb_flags |= SWIOTLB_ANY;
> 
>  	/* Share the SWIOTLB buffer with the host. */
>  	swiotlb_update_mem_attributes();
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index ad3c6d92b801..581af1483c42 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -163,7 +163,7 @@ static void __init pv_init(void)
>  	virtio_set_mem_acc_cb(virtio_require_restricted_mem_acc);
> 
>  	/* make sure bounce buffers are shared */
> -	swiotlb_init(true, SWIOTLB_FORCE | SWIOTLB_VERBOSE);
> +	swiotlb_init(true, SWIOTLB_VERBOSE);
>  	swiotlb_update_mem_attributes();
>  }
> 
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 6267363e0189..75cf8f6ae8cd 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -59,10 +59,8 @@ static void __init pci_swiotlb_detect(void)
>  	 * bounce buffers as the hypervisor can't access arbitrary VM memory
>  	 * that is not explicitly shared with it.
>  	 */
> -	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>  		x86_swiotlb_enable = true;
> -		x86_swiotlb_flags |= SWIOTLB_FORCE;
> -	}

With this patch removing SWIOTLB_FORCE from four places in
kernel code, there are no remaining places where it is set.
The test of SWIOTLB_FORCE could be removed from
swiotlb_init_remap(), and its definition could be deleted
from include/linux/swiotlb.h.

Michael


^ permalink raw reply

* Re: [PATCH] MAINTAINERS: powerpc: update VMX AES entries
From: Madhavan Srinivasan @ 2026-05-26  2:00 UTC (permalink / raw)
  To: Eric Biggers, Thorsten Blum
  Cc: Herbert Xu, David S. Miller, Breno Leitão, Nayna Jain,
	Paulo Flabiano Smorigo, Ard Biesheuvel, linux-crypto,
	linuxppc-dev, linux-kernel
In-Reply-To: <20260524213525.GA112327@quark>


On 5/25/26 3:05 AM, Eric Biggers wrote:
> On Sun, May 24, 2026 at 11:29:45PM +0200, Thorsten Blum wrote:
>> Commit 7cf2082e74ce ("lib/crypto: powerpc/aes: Migrate POWER8 optimized
>> code into library") removed arch/powerpc/crypto/aes.c and moved
>> arch/powerpc/crypto/aesp8-ppc.pl to lib/crypto/powerpc/.
>>
>> However, the "IBM Power VMX Cryptographic instructions" entry still
>> references the removed file and no longer covers the moved aesp8-ppc.pl.
>>
>> Remove the stale entry, add lib/crypto/powerpc/aesp8-ppc.pl, and tighten
>> the arch/powerpc/crypto/aesp8-ppc.* pattern to match the remaining
>> header only.
>>
>> Fixes: 7cf2082e74ce ("lib/crypto: powerpc/aes: Migrate POWER8 optimized code into library")
>> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
> Acked-by: Eric Biggers <ebiggers@kernel.org>
>
> If this doesn't get picked up through the powerpc tree, I can take this
> through libcrypto-next.
>
> - Eric

I can take this via ppc tree.

Maddy



^ permalink raw reply

* Re: [PATCH V16 4/7] rust/powerpc: Set min rustc version for powerpc
From: Miguel Ojeda @ 2026-05-25 18:16 UTC (permalink / raw)
  To: Mukesh Kumar Chaurasiya (IBM)
  Cc: maddy, mpe, npiggin, chleroy, peterz, jpoimboe, jbaron, aliceryhl,
	rostedt, ardb, ojeda, boqun, gary, bjorn3_gh, lossin, a.hindborg,
	tmgross, dakr, nathan, nick.desaulniers+lkml, morbo, justinstitt,
	daniel.almeida, acourbot, fujita.tomonori, gregkh, prafulrai522,
	tamird, kees, lyude, airlied, linuxppc-dev, linux-kernel,
	rust-for-linux, llvm
In-Reply-To: <20260520064630.1785283-5-mkchauras@gmail.com>

On Wed, May 20, 2026 at 8:48 AM Mukesh Kumar Chaurasiya (IBM)
<mkchauras@gmail.com> wrote:
>
> Minimum `rustc` version required for powerpc is 1.95 as some critical
> features required for compiling rust code for kernel are not there.

Which critical features?

> For example Stable inline asm support which got merged in 1.95.

It is not needed that the support is stable, but rather that
everything you may need works.

From a quick test (with a dummy example that may not be
representative), ppc64 inline assembly seems to work for a long time,
way before Rust 1.95.

So, which is the actual version that it is needed? i.e. 1.95.0 doesn't
seem to be required at least due to that.

That is why I am asking about the critical features above, because it
may be that this works since earlier versions.

(I wonder if this patch and s390's similar one influenced each other?)

Thanks!

Cheers,
Miguel


^ permalink raw reply

* Re: [PATCH v2 13/69] mm/hugetlb: Refactor early boot gigantic hugepage allocation
From: Oscar Salvador (SUSE) @ 2026-05-25 17:21 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, Ackerley Tng, Frank van der Linden,
	aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
	linux-kernel
In-Reply-To: <20260513130542.35604-14-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:41PM +0800, Muchun Song wrote:
> The early boot gigantic hugepage allocation helpers currently mix
> allocation with huge_bootmem_page setup, and leave part of the
> initialization flow in architecture code.
> 
> Refactor the interface to return the allocated huge page pointer and
> move the huge_bootmem_page setup into the generic hugetlb code. This
> makes the architecture-specific paths focus only on finding memory,
> while the common code handles node placement and early page metadata
> setup in one place.
> 
> This also lets powerpc benefit from memblock_reserved_mark_noinit(),
> which it did not enable before.
> 
> In addition, upcoming cross-zone validation for boot-time gigantic
> hugetlb reservation is common logic. With this refactoring, that logic
> can stay in the generic code instead of being duplicated in
> architecture-specific paths.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Same comment as Mike:

Reviewed-by: Oscar Salvador (SUSE) <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 12/69] mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time
From: Oscar Salvador (SUSE) @ 2026-05-25 17:19 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, Ackerley Tng, Frank van der Linden,
	aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
	linux-kernel
In-Reply-To: <20260513130542.35604-13-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:40PM +0800, Muchun Song wrote:
> Hugetlb CMA allocation currently has to cope with CMA areas that span
> multiple zones.
> 
> Validate the reserved CMA range up front in hugetlb_cma_reserve() so
> later hugetlb CMA allocations can assume a zone-consistent area.
> 
> Also drop the pfn_valid() check from cma_validate_zones(). mem_section
> is not fully initialized at this point, so the check can trigger false
> warnings. Keep the sanity check in cma_activate_area() instead.
> 
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Reviewed-by: Oscar Salvador (SUSE) <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 11/69] mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid()
From: Oscar Salvador (SUSE) @ 2026-05-25 17:11 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, Ackerley Tng, Frank van der Linden,
	aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
	linux-kernel
In-Reply-To: <20260513130542.35604-12-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:39PM +0800, Muchun Song wrote:
> sparse_vmemmap_init_nid_late() is still called separately from
> mm_core_init_early(), away from the rest of the sparse initialization
> path.
> 
> Now that sparse_init() runs after zone initialization, call
> sparse_vmemmap_init_nid_late() from sparse_init_nid() instead. This
> keeps both sparse_vmemmap_init_nid_early() and
> sparse_vmemmap_init_nid_late() in the sparse setup path.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply

* Re: [PATCH v2 10/69] mm/mm_init: Remove set_pageblock_order() call from sparse_init()
From: Oscar Salvador (SUSE) @ 2026-05-25 17:10 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
	Michael Ellerman, Madhavan Srinivasan, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
	Christophe Leroy, Ackerley Tng, Frank van der Linden,
	aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
	linux-kernel
In-Reply-To: <20260513130542.35604-11-songmuchun@bytedance.com>

On Wed, May 13, 2026 at 09:04:38PM +0800, Muchun Song wrote:
> free_area_init() already sets pageblock_order before sparse_init() runs
> for CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, so sparse_init() does not need to
> call set_pageblock_order() again.
> 
> With that call removed, set_pageblock_order() is only used in mm/mm_init.c.
> Make it static.
> 
> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Reviewed-by: Oscar Salvador (SUSE) <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox