LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
From: Jordan Niethe @ 2021-09-11  9:14 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, linux-hardening
In-Reply-To: <20210911022904.30962-5-cmr@bluescreens.de>

On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
<cmr@bluescreens.de> wrote:
>
> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> address to be patched is temporarily mapped as writeable. Currently, a
> per-cpu vmalloc patch area is used for this purpose. While the patch
> area is per-cpu, the temporary page mapping is inserted into the kernel
> page tables for the duration of patching. The mapping is exposed to CPUs
> other than the patching CPU - this is undesirable from a hardening
> perspective. Use a temporary mm instead which keeps the mapping local to
> the CPU doing the patching.
>
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> space. The patching address is randomized between PAGE_SIZE and
> DEFAULT_MAP_WINDOW-PAGE_SIZE.
>
> Bits of entropy with 64K page size on BOOK3S_64:
>
>         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
>
>         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>         bits of entropy = log2(128TB / 64K)
>         bits of entropy = 31
>
> The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> operates - by default the space above DEFAULT_MAP_WINDOW is not
> available. Currently the Hash MMU does not use a temporary mm so
> technically this upper limit isn't necessary; however, a larger
> randomization range does not further "harden" this overall approach and
> future work may introduce patching with a temporary mm on Hash as well.
>
> Randomization occurs only once during initialization at boot for each
> possible CPU in the system.
>
> Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> respectively create and remove the temporary mapping with write
> permissions at patching_addr. Map the page with PAGE_KERNEL to set
> EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> KUAP) according to PowerISA v3.0b Figure 35 on Radix.
>
> Based on x86 implementation:
>
> commit 4fc19708b165
> ("x86/alternatives: Initialize temporary mm for patching")
>
> and:
>
> commit b3fd8e83ada0
> ("x86/alternatives: Use temporary mm for text poking")
>
> Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
>
> ---
>
> v6:  * Small clean-ups (naming, formatting, style, etc).
>      * Call stop_using_temporary_mm() before pte_unmap_unlock() after
>        patching.
>      * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
>
> v5:  * Only support Book3s64 Radix MMU for now.
>      * Use a per-cpu datastructure to hold the patching_addr and
>        patching_mm to avoid the need for a synchronization lock/mutex.
>
> v4:  * In the previous series this was two separate patches: one to init
>        the temporary mm in poking_init() (unused in powerpc at the time)
>        and the other to use it for patching (which removed all the
>        per-cpu vmalloc code). Now that we use poking_init() in the
>        existing per-cpu vmalloc approach, that separation doesn't work
>        as nicely anymore so I just merged the two patches into one.
>      * Preload the SLB entry and hash the page for the patching_addr
>        when using Hash on book3s64 to avoid taking an SLB and Hash fault
>        during patching. The previous implementation was a hack which
>        changed current->mm to allow the SLB and Hash fault handlers to
>        work with the temporary mm since both of those code-paths always
>        assume mm == current->mm.
>      * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
>        have to manage the mm->context.active_cpus counter and mm cpumask
>        since they determine (via mm_is_thread_local()) if the TLB flush
>        in pte_clear() is local or not - it should always be local when
>        we're using the temporary mm. On book3s64's Radix MMU we can
>        just call local_flush_tlb_mm().
>      * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
>        KUAP.
> ---
>  arch/powerpc/lib/code-patching.c | 119 +++++++++++++++++++++++++++++--
>  1 file changed, 112 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index e802e42c2789..af8e2a02a9dd 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -11,6 +11,7 @@
>  #include <linux/cpuhotplug.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
> +#include <linux/random.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/page.h>
> @@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
>
>  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
>  static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> +static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
>
>  static int text_area_cpu_up(unsigned int cpu)
>  {
> @@ -126,8 +128,48 @@ static int text_area_cpu_down(unsigned int cpu)
>         return 0;
>  }
>
> +static __always_inline void __poking_init_temp_mm(void)
> +{
> +       int cpu;
> +       spinlock_t *ptl; /* for protecting pte table */

ptl is just used so we don't have to open code allocating a pte in
patching_mm isn't it?

> +       pte_t *ptep;
> +       struct mm_struct *patching_mm;
> +       unsigned long patching_addr;
> +
> +       for_each_possible_cpu(cpu) {
> +               patching_mm = copy_init_mm();
> +               WARN_ON(!patching_mm);

Would it be okay to just let the mmu handle null pointer dereferences?

> +               per_cpu(cpu_patching_mm, cpu) = patching_mm;
> +
> +               /*
> +                * Choose a randomized, page-aligned address from the range:
> +                * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
> +                * address bound is PAGE_SIZE to avoid the zero-page.  The
> +                * upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
> +                * stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> +                */
> +               patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> +                               % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
> +               per_cpu(cpu_patching_addr, cpu) = patching_addr;

On x86 the randomization depends on CONFIG_RANDOMIZE_BASE. Should it
be controllable here too?

> +
> +               /*
> +                * PTE allocation uses GFP_KERNEL which means we need to
> +                * pre-allocate the PTE here because we cannot do the
> +                * allocation during patching when IRQs are disabled.
> +                */
> +               ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> +               WARN_ON(!ptep);
> +               pte_unmap_unlock(ptep, ptl);
> +       }
> +}
> +
>  void __init poking_init(void)
>  {
> +       if (radix_enabled()) {
> +               __poking_init_temp_mm();

Should this also be done with cpuhp_setup_state()?

> +               return;
> +       }
> +
>         WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
>                 "powerpc/text_poke:online", text_area_cpu_up,
>                 text_area_cpu_down) < 0);
> @@ -197,30 +239,93 @@ static inline int unmap_patch_area(void)
>         return 0;
>  }
>
> +struct patch_mapping {
> +       spinlock_t *ptl; /* for protecting pte table */
> +       pte_t *ptep;
> +       struct temp_mm temp_mm;
> +};
> +
> +/*
> + * This can be called for kernel text or a module.
> + */
> +static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
> +{
> +       struct page *page;
> +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> +
> +       if (is_vmalloc_or_module_addr(addr))
> +               page = vmalloc_to_page(addr);
> +       else
> +               page = virt_to_page(addr);
> +
> +       patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> +                                            &patch_mapping->ptl);
> +       if (unlikely(!patch_mapping->ptep)) {
> +               pr_warn("map patch: failed to allocate pte for patching\n");
> +               return -1;
> +       }
> +
> +       set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> +                  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
> +
> +       init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> +       start_using_temporary_mm(&patch_mapping->temp_mm);
> +
> +       return 0;
> +}
> +
> +static int unmap_patch_mm(struct patch_mapping *patch_mapping)
> +{
> +       struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> +       unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> +
> +       pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
> +
> +       local_flush_tlb_mm(patching_mm);
> +       stop_using_temporary_mm(&patch_mapping->temp_mm);
> +
> +       pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
> +
> +       return 0;
> +}
> +
>  static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
>  {
>         int err, rc = 0;
>         u32 *patch_addr = NULL;
>         unsigned long flags;
> +       struct patch_mapping patch_mapping;
>
>         /*
> -        * During early early boot patch_instruction is called
> -        * when text_poke_area is not ready, but we still need
> -        * to allow patching. We just do the plain old patching
> +        * During early early boot patch_instruction is called when the
> +        * patching_mm/text_poke_area is not ready, but we still need to allow
> +        * patching. We just do the plain old patching.
>          */
> -       if (!this_cpu_read(text_poke_area))
> -               return raw_patch_instruction(addr, instr);
> +       if (radix_enabled()) {
> +               if (!this_cpu_read(cpu_patching_mm))
> +                       return raw_patch_instruction(addr, instr);
> +       } else {
> +               if (!this_cpu_read(text_poke_area))
> +                       return raw_patch_instruction(addr, instr);
> +       }

Would testing cpu_patching_addr handler both of these cases?

Then I think it might be clearer to do something like this:
if (radix_enabled()) {
      return patch_instruction_mm(addr, instr);
}

patch_instruction_mm() would combine map_patch_mm(), then patching and
unmap_patch_mm() into one function.

IMO, a bit of code duplication would be cleaner than checking multiple
times for radix_enabled() and having struct patch_mapping especially
for maintaining state.

>
>         local_irq_save(flags);
>
> -       err = map_patch_area(addr);
> +       if (radix_enabled())
> +               err = map_patch_mm(addr, &patch_mapping);
> +       else
> +               err = map_patch_area(addr);
>         if (err)
>                 goto out;
>
>         patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
>         rc = __patch_instruction(addr, instr, patch_addr);
>
> -       err = unmap_patch_area();
> +       if (radix_enabled())
> +               err = unmap_patch_mm(&patch_mapping);
> +       else
> +               err = unmap_patch_area();
>
>  out:
>         local_irq_restore(flags);
> --
> 2.32.0
>
Thanks,
Jordan

^ permalink raw reply

* Re: [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for Radix MMU
From: Jordan Niethe @ 2021-09-11  8:26 UTC (permalink / raw)
  To: Christopher M. Riedl; +Cc: linuxppc-dev, linux-hardening
In-Reply-To: <20210911022904.30962-2-cmr@bluescreens.de>

On Sat, Sep 11, 2021 at 12:35 PM Christopher M. Riedl
<cmr@bluescreens.de> wrote:
>
> x86 supports the notion of a temporary mm which restricts access to
> temporary PTEs to a single CPU. A temporary mm is useful for situations
> where a CPU needs to perform sensitive operations (such as patching a
> STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> said mappings to other CPUs. Another benefit is that other CPU TLBs do
> not need to be flushed when the temporary mm is torn down.
>
> Mappings in the temporary mm can be set in the userspace portion of the
> address-space.
>
> Interrupts must be disabled while the temporary mm is in use. HW
> breakpoints, which may have been set by userspace as watchpoints on
> addresses now within the temporary mm, are saved and disabled when
> loading the temporary mm. The HW breakpoints are restored when unloading
> the temporary mm. All HW breakpoints are indiscriminately disabled while
> the temporary mm is in use - this may include breakpoints set by perf.

I had thought CPUs with a DAWR might not need to do this because the
privilege level that breakpoints trigger on can be configured. But it
turns out in ptrace, etc we use HW_BRK_TYPE_PRIV_ALL.

>
> Based on x86 implementation:
>
> commit cefa929c034e
> ("x86/mm: Introduce temporary mm structs")
>
> Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>
>
> ---
>
> v6:  * Use {start,stop}_using_temporary_mm() instead of
>        {use,unuse}_temporary_mm() as suggested by Christophe.
>
> v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.
>
> v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
>        using/unusing the temp mm as suggested by Jann Horn to keep
>        the context.active counter in-sync on mm/nohash.
>      * Disable SLB preload in the temporary mm when initializing the
>        temp_mm struct.
>      * Include asm/debug.h header to fix build issue with
>        ppc44x_defconfig.
> ---
>  arch/powerpc/include/asm/debug.h |  1 +
>  arch/powerpc/kernel/process.c    |  5 +++
>  arch/powerpc/lib/code-patching.c | 56 ++++++++++++++++++++++++++++++++
>  3 files changed, 62 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
> index 86a14736c76c..dfd82635ea8b 100644
> --- a/arch/powerpc/include/asm/debug.h
> +++ b/arch/powerpc/include/asm/debug.h
> @@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
>  #endif
>
>  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
>  bool ppc_breakpoint_available(void);
>  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
>  extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 50436b52c213..6aa1f5c4d520 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
>         return 0;
>  }
>
> +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> +{
> +       memcpy(brk, this_cpu_ptr(&current_brk[nr]), sizeof(*brk));
> +}

The breakpoint code is already a little hard to follow. I'm worried
doing this might spread breakpoint handling into more places in the
future.
What about something like having a breakpoint_pause() function which
clears the hardware registers only and then a breakpoint_resume()
function that copies from current_brk[] back to the hardware
registers?
Then we don't have to make another copy of the breakpoint state.

> +
>  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
>  {
>         memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index f9a3019e37b4..8d61a7d35b89 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c

Sorry I might have missed it, but what was the reason for not putting
this stuff in mmu_context.h?

> @@ -17,6 +17,9 @@
>  #include <asm/code-patching.h>
>  #include <asm/setup.h>
>  #include <asm/inst.h>
> +#include <asm/mmu_context.h>
> +#include <asm/debug.h>
> +#include <asm/tlb.h>
>
>  static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 *patch_addr)
>  {
> @@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
>  }
>
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +struct temp_mm {
> +       struct mm_struct *temp;
> +       struct mm_struct *prev;
> +       struct arch_hw_breakpoint brk[HBP_NUM_MAX];
^ Then we wouldn't need this.
> +};
> +
> +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
> +{
> +       /* We currently only support temporary mm on the Book3s64 Radix MMU */
> +       WARN_ON(!radix_enabled());
> +
> +       temp_mm->temp = mm;
> +       temp_mm->prev = NULL;
> +       memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> +}
> +
> +static inline void start_using_temporary_mm(struct temp_mm *temp_mm)
> +{
> +       lockdep_assert_irqs_disabled();
> +
> +       temp_mm->prev = current->active_mm;
> +       switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);

Now that we only support radix it should be fine again to have it like this:
switch_mm_irqs_off(NULL, temp_mm->temp, current);?
It was changed from that because it would cause issues on nohash I thought.

> +
> +       WARN_ON(!mm_is_thread_local(temp_mm->temp));
> +
> +       if (ppc_breakpoint_available()) {
> +               struct arch_hw_breakpoint null_brk = {0};
> +               int i = 0;
> +
> +               for (; i < nr_wp_slots(); ++i) {
> +                       __get_breakpoint(i, &temp_mm->brk[i]);
> +                       if (temp_mm->brk[i].type != 0)
> +                               __set_breakpoint(i, &null_brk);
> +               }
> +       }
> +}
> +
> +static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
> +{
> +       lockdep_assert_irqs_disabled();
> +
> +       switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
> +
> +       if (ppc_breakpoint_available()) {
> +               int i = 0;
> +
> +               for (; i < nr_wp_slots(); ++i)
> +                       if (temp_mm->brk[i].type != 0)
> +                               __set_breakpoint(i, &temp_mm->brk[i]);
> +       }
> +}
> +
>  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
>
>  static int text_area_cpu_up(unsigned int cpu)
> --
> 2.32.0
>
Thanks,
Jordan

^ permalink raw reply

* [PATCH v6 2/4] powerpc: Rework and improve STRICT_KERNEL_RWX patching
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening
In-Reply-To: <20210911022904.30962-1-cmr@bluescreens.de>

Rework code-patching with STRICT_KERNEL_RWX to prepare for a later patch
which uses a temporary mm for patching under the Book3s64 Radix MMU.
Make improvements by adding a WARN_ON when the patchsite doesn't match
after patching and return the error from __patch_instruction() properly.

Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>

---

v6:  * Remove the pr_warn() message from unmap_patch_area().

v5:  * New to series.
---
 arch/powerpc/lib/code-patching.c | 35 ++++++++++++++++----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 8d61a7d35b89..8d0bb86125d5 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -102,6 +102,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
 }
 
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
 
 static int text_area_cpu_up(unsigned int cpu)
 {
@@ -114,6 +115,7 @@ static int text_area_cpu_up(unsigned int cpu)
 		return -1;
 	}
 	this_cpu_write(text_poke_area, area);
+	this_cpu_write(cpu_patching_addr, (unsigned long)area->addr);
 
 	return 0;
 }
@@ -139,7 +141,7 @@ void __init poking_init(void)
 /*
  * This can be called for kernel text or a module.
  */
-static int map_patch_area(void *addr, unsigned long text_poke_addr)
+static int map_patch_area(void *addr)
 {
 	unsigned long pfn;
 	int err;
@@ -149,17 +151,20 @@ static int map_patch_area(void *addr, unsigned long text_poke_addr)
 	else
 		pfn = __pa_symbol(addr) >> PAGE_SHIFT;
 
-	err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
+	err = map_kernel_page(__this_cpu_read(cpu_patching_addr),
+			      (pfn << PAGE_SHIFT), PAGE_KERNEL);
 
-	pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
+	pr_devel("Mapped addr %lx with pfn %lx:%d\n",
+		 __this_cpu_read(cpu_patching_addr), pfn, err);
 	if (err)
 		return -1;
 
 	return 0;
 }
 
-static inline int unmap_patch_area(unsigned long addr)
+static inline int unmap_patch_area(void)
 {
+	unsigned long addr = __this_cpu_read(cpu_patching_addr);
 	pte_t *ptep;
 	pmd_t *pmdp;
 	pud_t *pudp;
@@ -199,11 +204,9 @@ static inline int unmap_patch_area(unsigned long addr)
 
 static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
 {
-	int err;
+	int err, rc = 0;
 	u32 *patch_addr = NULL;
 	unsigned long flags;
-	unsigned long text_poke_addr;
-	unsigned long kaddr = (unsigned long)addr;
 
 	/*
 	 * During early early boot patch_instruction is called
@@ -215,24 +218,20 @@ static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
 
 	local_irq_save(flags);
 
-	text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr;
-	if (map_patch_area(addr, text_poke_addr)) {
-		err = -1;
+	err = map_patch_area(addr);
+	if (err)
 		goto out;
-	}
 
-	patch_addr = (u32 *)(text_poke_addr + (kaddr & ~PAGE_MASK));
+	patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
+	rc = __patch_instruction(addr, instr, patch_addr);
 
-	__patch_instruction(addr, instr, patch_addr);
-
-	err = unmap_patch_area(text_poke_addr);
-	if (err)
-		pr_warn("failed to unmap %lx\n", text_poke_addr);
+	err = unmap_patch_area();
 
 out:
 	local_irq_restore(flags);
+	WARN_ON(!ppc_inst_equal(ppc_inst_read(addr), instr));
 
-	return err;
+	return rc ? rc : err;
 }
 #else /* !CONFIG_STRICT_KERNEL_RWX */
 
-- 
2.32.0


^ permalink raw reply related

* [PATCH v6 3/4] powerpc: Use WARN_ON and fix check in poking_init
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening
In-Reply-To: <20210911022904.30962-1-cmr@bluescreens.de>

The latest kernel docs list BUG_ON() as 'deprecated' and that they
should be replaced with WARN_ON() (or pr_warn()) when possible. The
BUG_ON() in poking_init() warrants a WARN_ON() rather than a pr_warn()
since the error condition is deemed "unreachable".

Also take this opportunity to fix the failure check in the WARN_ON():
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ...) returns a positive integer
on success and a negative integer on failure.

Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>

---

v6:  * New to series - based on Christophe's relentless feedback in the
       crusade against BUG_ON()s :)
---
 arch/powerpc/lib/code-patching.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 8d0bb86125d5..e802e42c2789 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -126,16 +126,11 @@ static int text_area_cpu_down(unsigned int cpu)
 	return 0;
 }
 
-/*
- * Although BUG_ON() is rude, in this case it should only happen if ENOMEM, and
- * we judge it as being preferable to a kernel that will crash later when
- * someone tries to use patch_instruction().
- */
 void __init poking_init(void)
 {
-	BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+	WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
 		"powerpc/text_poke:online", text_area_cpu_up,
-		text_area_cpu_down));
+		text_area_cpu_down) < 0);
 }
 
 /*
-- 
2.32.0


^ permalink raw reply related

* [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening
In-Reply-To: <20210911022904.30962-1-cmr@bluescreens.de>

When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped as writeable. Currently, a
per-cpu vmalloc patch area is used for this purpose. While the patch
area is per-cpu, the temporary page mapping is inserted into the kernel
page tables for the duration of patching. The mapping is exposed to CPUs
other than the patching CPU - this is undesirable from a hardening
perspective. Use a temporary mm instead which keeps the mapping local to
the CPU doing the patching.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE.

Bits of entropy with 64K page size on BOOK3S_64:

        bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

        PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
        bits of entropy = log2(128TB / 64K)
	bits of entropy = 31

The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
operates - by default the space above DEFAULT_MAP_WINDOW is not
available. Currently the Hash MMU does not use a temporary mm so
technically this upper limit isn't necessary; however, a larger
randomization range does not further "harden" this overall approach and
future work may introduce patching with a temporary mm on Hash as well.

Randomization occurs only once during initialization at boot for each
possible CPU in the system.

Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
respectively create and remove the temporary mapping with write
permissions at patching_addr. Map the page with PAGE_KERNEL to set
EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
KUAP) according to PowerISA v3.0b Figure 35 on Radix.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>

---

v6:  * Small clean-ups (naming, formatting, style, etc).
     * Call stop_using_temporary_mm() before pte_unmap_unlock() after
       patching.
     * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.

v5:  * Only support Book3s64 Radix MMU for now.
     * Use a per-cpu datastructure to hold the patching_addr and
       patching_mm to avoid the need for a synchronization lock/mutex.

v4:  * In the previous series this was two separate patches: one to init
       the temporary mm in poking_init() (unused in powerpc at the time)
       and the other to use it for patching (which removed all the
       per-cpu vmalloc code). Now that we use poking_init() in the
       existing per-cpu vmalloc approach, that separation doesn't work
       as nicely anymore so I just merged the two patches into one.
     * Preload the SLB entry and hash the page for the patching_addr
       when using Hash on book3s64 to avoid taking an SLB and Hash fault
       during patching. The previous implementation was a hack which
       changed current->mm to allow the SLB and Hash fault handlers to
       work with the temporary mm since both of those code-paths always
       assume mm == current->mm.
     * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
       have to manage the mm->context.active_cpus counter and mm cpumask
       since they determine (via mm_is_thread_local()) if the TLB flush
       in pte_clear() is local or not - it should always be local when
       we're using the temporary mm. On book3s64's Radix MMU we can
       just call local_flush_tlb_mm().
     * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
       KUAP.
---
 arch/powerpc/lib/code-patching.c | 119 +++++++++++++++++++++++++++++--
 1 file changed, 112 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index e802e42c2789..af8e2a02a9dd 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,7 @@
 #include <linux/cpuhotplug.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/random.h>
 
 #include <asm/tlbflush.h>
 #include <asm/page.h>
@@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
 
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
+static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
 
 static int text_area_cpu_up(unsigned int cpu)
 {
@@ -126,8 +128,48 @@ static int text_area_cpu_down(unsigned int cpu)
 	return 0;
 }
 
+static __always_inline void __poking_init_temp_mm(void)
+{
+	int cpu;
+	spinlock_t *ptl; /* for protecting pte table */
+	pte_t *ptep;
+	struct mm_struct *patching_mm;
+	unsigned long patching_addr;
+
+	for_each_possible_cpu(cpu) {
+		patching_mm = copy_init_mm();
+		WARN_ON(!patching_mm);
+		per_cpu(cpu_patching_mm, cpu) = patching_mm;
+
+		/*
+		 * Choose a randomized, page-aligned address from the range:
+		 * [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
+		 * address bound is PAGE_SIZE to avoid the zero-page.  The
+		 * upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
+		 * stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
+		 */
+		patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
+				% (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
+		per_cpu(cpu_patching_addr, cpu) = patching_addr;
+
+		/*
+		 * PTE allocation uses GFP_KERNEL which means we need to
+		 * pre-allocate the PTE here because we cannot do the
+		 * allocation during patching when IRQs are disabled.
+		 */
+		ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
+		WARN_ON(!ptep);
+		pte_unmap_unlock(ptep, ptl);
+	}
+}
+
 void __init poking_init(void)
 {
+	if (radix_enabled()) {
+		__poking_init_temp_mm();
+		return;
+	}
+
 	WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
 		"powerpc/text_poke:online", text_area_cpu_up,
 		text_area_cpu_down) < 0);
@@ -197,30 +239,93 @@ static inline int unmap_patch_area(void)
 	return 0;
 }
 
+struct patch_mapping {
+	spinlock_t *ptl; /* for protecting pte table */
+	pte_t *ptep;
+	struct temp_mm temp_mm;
+};
+
+/*
+ * This can be called for kernel text or a module.
+ */
+static int map_patch_mm(const void *addr, struct patch_mapping *patch_mapping)
+{
+	struct page *page;
+	struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
+	unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
+
+	if (is_vmalloc_or_module_addr(addr))
+		page = vmalloc_to_page(addr);
+	else
+		page = virt_to_page(addr);
+
+	patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
+					     &patch_mapping->ptl);
+	if (unlikely(!patch_mapping->ptep)) {
+		pr_warn("map patch: failed to allocate pte for patching\n");
+		return -1;
+	}
+
+	set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
+		   pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
+
+	init_temp_mm(&patch_mapping->temp_mm, patching_mm);
+	start_using_temporary_mm(&patch_mapping->temp_mm);
+
+	return 0;
+}
+
+static int unmap_patch_mm(struct patch_mapping *patch_mapping)
+{
+	struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
+	unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
+
+	pte_clear(patching_mm, patching_addr, patch_mapping->ptep);
+
+	local_flush_tlb_mm(patching_mm);
+	stop_using_temporary_mm(&patch_mapping->temp_mm);
+
+	pte_unmap_unlock(patch_mapping->ptep, patch_mapping->ptl);
+
+	return 0;
+}
+
 static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
 {
 	int err, rc = 0;
 	u32 *patch_addr = NULL;
 	unsigned long flags;
+	struct patch_mapping patch_mapping;
 
 	/*
-	 * During early early boot patch_instruction is called
-	 * when text_poke_area is not ready, but we still need
-	 * to allow patching. We just do the plain old patching
+	 * During early early boot patch_instruction is called when the
+	 * patching_mm/text_poke_area is not ready, but we still need to allow
+	 * patching. We just do the plain old patching.
 	 */
-	if (!this_cpu_read(text_poke_area))
-		return raw_patch_instruction(addr, instr);
+	if (radix_enabled()) {
+		if (!this_cpu_read(cpu_patching_mm))
+			return raw_patch_instruction(addr, instr);
+	} else {
+		if (!this_cpu_read(text_poke_area))
+			return raw_patch_instruction(addr, instr);
+	}
 
 	local_irq_save(flags);
 
-	err = map_patch_area(addr);
+	if (radix_enabled())
+		err = map_patch_mm(addr, &patch_mapping);
+	else
+		err = map_patch_area(addr);
 	if (err)
 		goto out;
 
 	patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | offset_in_page(addr));
 	rc = __patch_instruction(addr, instr, patch_addr);
 
-	err = unmap_patch_area();
+	if (radix_enabled())
+		err = unmap_patch_mm(&patch_mapping);
+	else
+		err = unmap_patch_area();
 
 out:
 	local_irq_restore(flags);
-- 
2.32.0


^ permalink raw reply related

* [PATCH v6 0/4] Use per-CPU temporary mappings for patching on Radix MMU
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening

When compiled with CONFIG_STRICT_KERNEL_RWX, the kernel must create
temporary mappings when patching itself. These mappings temporarily
override the strict RWX text protections to permit a write. Currently,
powerpc allocates a per-CPU VM area for patching. Patching occurs as
follows:

	1. Map page in per-CPU VM area w/ PAGE_KERNEL protection
	2. Patch text
	3. Remove the temporary mapping

While the VM area is per-CPU, the mapping is actually inserted into the
kernel page tables. Presumably, this could allow another CPU to access
the normally write-protected text - either malicously or accidentally -
via this same mapping if the address of the VM area is known. Ideally,
the mapping should be kept local to the CPU doing the patching [0].

x86 introduced "temporary mm" structs which allow the creation of mappings
local to a particular CPU [1]. This series intends to bring the notion of a
temporary mm to powerpc's Book3s64 Radix MMU and harden it by using such a
mapping for patching a kernel with strict RWX permissions.

Tested boot and ftrace:
	- QEMU+KVM (host: POWER9 Blackbird): Radix MMU w/ KUAP
	- QEMU+KVM (host: POWER9 Blackbird): Hash MMU

Tested boot:
	- QEMU+TCG: ppc44x (bamboo)
	- QEMU+TCG: g5 (mac99)

I also tested with various extra config options enabled as suggested in
section 12) in Documentation/process/submit-checklist.rst.

v6:	* Split series to separate powerpc percpu temporary mm
	  implementation and LKDTM test (still working on this one) and
	  implement some of Christophe Leroy's feedback.
	* Rebase on linuxppc/next: powerpc-5.15-1

v5:	* Only support Book3s64 Radix MMU for now. There are some issues with
	  the previous implementation on the Hash MMU as pointed out by Nick
	  Piggin. Fixing these is not trivial so we only support the Radix MMU
	  for now.

v4:	* It's time to revisit this series again since @jpn and @mpe fixed
	  our known STRICT_*_RWX bugs on powerpc/64s.
	* Rebase on linuxppc/next:
          commit ee1bc694fbaec ("powerpc/kvm: Fix build error when PPC_MEM_KEYS/PPC_PSERIES=n")
	* Completely rework how map_patch() works on book3s64 Hash MMU
	* Split the LKDTM x86_64 and powerpc bits into separate patches
	* Annotate commit messages with changes from v3 instead of
	  listing them here completely out-of context...

v3:	* Rebase on linuxppc/next: commit 9123e3a74ec7 ("Linux 5.9-rc1")
	* Move temporary mm implementation into code-patching.c where it
	  belongs
	* Implement LKDTM hijacker test on x86_64 (on IBM time oof) Do
	* not use address zero for the patching address in the
	  temporary mm (thanks @dja for pointing this out!)
	* Wrap the LKDTM test w/ CONFIG_SMP as suggested by Christophe
	  Leroy
	* Comments to clarify PTE pre-allocation and patching addr
	  selection

v2:	* Rebase on linuxppc/next:
	  commit 105fb38124a4 ("powerpc/8xx: Modify ptep_get()")
	* Always dirty pte when mapping patch
	* Use `ppc_inst_len` instead of `sizeof` on instructions
	* Declare LKDTM patching addr accessor in header where it belongs	

v1:	* Rebase on linuxppc/next (4336b9337824)
	* Save and restore second hw watchpoint
	* Use new ppc_inst_* functions for patching check and in LKDTM test

rfc-v2:	* Many fixes and improvements mostly based on extensive feedback
          and testing by Christophe Leroy (thanks!).
	* Make patching_mm and patching_addr static and move
	  '__ro_after_init' to after the variable name (more common in
	  other parts of the kernel)
	* Use 'asm/debug.h' header instead of 'asm/hw_breakpoint.h' to
	  fix PPC64e compile
	* Add comment explaining why we use BUG_ON() during the init
	  call to setup for patching later
	* Move ptep into patch_mapping to avoid walking page tables a
	  second time when unmapping the temporary mapping
	* Use KUAP under non-radix, also manually dirty the PTE for patch
	  mapping on non-BOOK3S_64 platforms
	* Properly return any error from __patch_instruction
        * Do not use 'memcmp' where a simple comparison is appropriate
	* Simplify expression for patch address by removing pointer maths
	* Add LKDTM test

[0]: https://github.com/linuxppc/issues/issues/224
[1]: https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.amit@gmail.com/

Christopher M. Riedl (4):
  powerpc/64s: Introduce temporary mm for Radix MMU
  powerpc: Rework and improve STRICT_KERNEL_RWX patching
  powerpc: Use WARN_ON and fix check in poking_init
  powerpc/64s: Initialize and use a temporary mm for patching on Radix

 arch/powerpc/include/asm/debug.h |   1 +
 arch/powerpc/kernel/process.c    |   5 +
 arch/powerpc/lib/code-patching.c | 215 ++++++++++++++++++++++++++-----
 3 files changed, 191 insertions(+), 30 deletions(-)

-- 
2.32.0


^ permalink raw reply

* [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for Radix MMU
From: Christopher M. Riedl @ 2021-09-11  2:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-hardening
In-Reply-To: <20210911022904.30962-1-cmr@bluescreens.de>

x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. Another benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use - this may include breakpoints set by perf.

Based on x86 implementation:

commit cefa929c034e
("x86/mm: Introduce temporary mm structs")

Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de>

---

v6:  * Use {start,stop}_using_temporary_mm() instead of
       {use,unuse}_temporary_mm() as suggested by Christophe.

v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.

v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
       using/unusing the temp mm as suggested by Jann Horn to keep
       the context.active counter in-sync on mm/nohash.
     * Disable SLB preload in the temporary mm when initializing the
       temp_mm struct.
     * Include asm/debug.h header to fix build issue with
       ppc44x_defconfig.
---
 arch/powerpc/include/asm/debug.h |  1 +
 arch/powerpc/kernel/process.c    |  5 +++
 arch/powerpc/lib/code-patching.c | 56 ++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 86a14736c76c..dfd82635ea8b 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 50436b52c213..6aa1f5c4d520 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct arch_hw_breakpoint *brk)
 	return 0;
 }
 
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+{
+	memcpy(brk, this_cpu_ptr(&current_brk[nr]), sizeof(*brk));
+}
+
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 {
 	memcpy(this_cpu_ptr(&current_brk[nr]), brk, sizeof(*brk));
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index f9a3019e37b4..8d61a7d35b89 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -17,6 +17,9 @@
 #include <asm/code-patching.h>
 #include <asm/setup.h>
 #include <asm/inst.h>
+#include <asm/mmu_context.h>
+#include <asm/debug.h>
+#include <asm/tlb.h>
 
 static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 *patch_addr)
 {
@@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+struct temp_mm {
+	struct mm_struct *temp;
+	struct mm_struct *prev;
+	struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+};
+
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+{
+	/* We currently only support temporary mm on the Book3s64 Radix MMU */
+	WARN_ON(!radix_enabled());
+
+	temp_mm->temp = mm;
+	temp_mm->prev = NULL;
+	memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
+}
+
+static inline void start_using_temporary_mm(struct temp_mm *temp_mm)
+{
+	lockdep_assert_irqs_disabled();
+
+	temp_mm->prev = current->active_mm;
+	switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
+
+	WARN_ON(!mm_is_thread_local(temp_mm->temp));
+
+	if (ppc_breakpoint_available()) {
+		struct arch_hw_breakpoint null_brk = {0};
+		int i = 0;
+
+		for (; i < nr_wp_slots(); ++i) {
+			__get_breakpoint(i, &temp_mm->brk[i]);
+			if (temp_mm->brk[i].type != 0)
+				__set_breakpoint(i, &null_brk);
+		}
+	}
+}
+
+static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
+{
+	lockdep_assert_irqs_disabled();
+
+	switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
+
+	if (ppc_breakpoint_available()) {
+		int i = 0;
+
+		for (; i < nr_wp_slots(); ++i)
+			if (temp_mm->brk[i].type != 0)
+				__set_breakpoint(i, &temp_mm->brk[i]);
+	}
+}
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
 static int text_area_cpu_up(unsigned int cpu)
-- 
2.32.0


^ permalink raw reply related

* [PATCH bpf-next v2] bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33
From: Tiezhu Yang @ 2021-09-11  1:56 UTC (permalink / raw)
  To: Shubham Bansal, Russell King, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Zi Shen Lim, Catalin Marinas,
	Will Deacon, Paul Burton, Thomas Bogendoerfer, naveen.n.rao,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Luke Nelson, Xi Wang, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bjorn, davem, Johan Almbladh, Paul Chaignon
  Cc: netdev, linux-kernel, linux-mips, sparclinux, bpf, linuxppc-dev,
	linux-riscv, linux-arm-kernel

In the current code, the actual max tail call count is 33 which is greater
than MAX_TAIL_CALL_CNT (defined as 32), the actual limit is not consistent
with the meaning of MAX_TAIL_CALL_CNT, there is some confusion and need to
spend some time to think about the reason at the first glance.

We can see the historical evolution from commit 04fd61ab36ec ("bpf: allow
bpf programs to tail-call other bpf programs") and commit f9dabe016b63
("bpf: Undo off-by-one in interpreter tail call count limit").

In order to avoid changing existing behavior, the actual limit is 33 now,
this is reasonable.

After commit 874be05f525e ("bpf, tests: Add tail call test suite"), we can
see there exists failed testcase.

On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
 # echo 0 > /proc/sys/net/core/bpf_jit_enable
 # modprobe test_bpf
 # dmesg | grep -w FAIL
 Tail call error path, max count reached jited:0 ret 34 != 33 FAIL

On some archs:
 # echo 1 > /proc/sys/net/core/bpf_jit_enable
 # modprobe test_bpf
 # dmesg | grep -w FAIL
 Tail call error path, max count reached jited:1 ret 34 != 33 FAIL

So it is necessary to change the value of MAX_TAIL_CALL_CNT from 32 to 33,
then do some small changes of the related code.

With this patch, it does not change the current limit 33, MAX_TAIL_CALL_CNT
can reflect the actual max tail call count, the tailcall selftests can work
well, and also the above failed testcase in test_bpf can be fixed for the
interpreter (all archs) and the JIT (all archs except for x86).

 # uname -m
 x86_64
 # echo 1 > /proc/sys/net/core/bpf_jit_enable
 # modprobe test_bpf
 # dmesg | grep -w FAIL
 Tail call error path, max count reached jited:1 ret 33 != 34 FAIL

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
---

v2:
  -- fix the typos in the commit message and update the commit message.
  -- fix the failed tailcall selftests for x86 jit.
     I am not quite sure the change on x86 is proper, with this change,
     tailcall selftests passed, but tailcall limit test in test_bpf.ko
     failed, I do not know the reason now, I think this is another issue,
     maybe someone more versed in x86 jit could take a look.

 arch/arm/net/bpf_jit_32.c         | 11 ++++++-----
 arch/arm64/net/bpf_jit_comp.c     |  7 ++++---
 arch/mips/net/ebpf_jit.c          |  4 ++--
 arch/powerpc/net/bpf_jit_comp32.c |  4 ++--
 arch/powerpc/net/bpf_jit_comp64.c | 12 ++++++------
 arch/riscv/net/bpf_jit_comp32.c   |  4 ++--
 arch/riscv/net/bpf_jit_comp64.c   |  4 ++--
 arch/sparc/net/bpf_jit_comp_64.c  |  8 ++++----
 arch/x86/net/bpf_jit_comp.c       | 10 +++++-----
 include/linux/bpf.h               |  2 +-
 kernel/bpf/core.c                 |  4 ++--
 11 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index a951276..39d9ae9 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1180,18 +1180,19 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 
 	/* tmp2[0] = array, tmp2[1] = index */
 
-	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
-	 *	goto out;
+	/*
 	 * tail_call_cnt++;
+	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *	goto out;
 	 */
+	tc = arm_bpf_get_reg64(tcc, tmp, ctx);
+	emit(ARM_ADDS_I(tc[1], tc[1], 1), ctx);
+	emit(ARM_ADC_I(tc[0], tc[0], 0), ctx);
 	lo = (u32)MAX_TAIL_CALL_CNT;
 	hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
-	tc = arm_bpf_get_reg64(tcc, tmp, ctx);
 	emit(ARM_CMP_I(tc[0], hi), ctx);
 	_emit(ARM_COND_EQ, ARM_CMP_I(tc[1], lo), ctx);
 	_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
-	emit(ARM_ADDS_I(tc[1], tc[1], 1), ctx);
-	emit(ARM_ADC_I(tc[0], tc[0], 0), ctx);
 	arm_bpf_put_reg64(tcc, tmp, ctx);
 
 	/* prog = array->ptrs[index]
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 41c23f4..5d6c843 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -286,14 +286,15 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 	emit(A64_CMP(0, r3, tmp), ctx);
 	emit(A64_B_(A64_COND_CS, jmp_offset), ctx);
 
-	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
-	 *     goto out;
+	/*
 	 * tail_call_cnt++;
+	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *     goto out;
 	 */
+	emit(A64_ADD_I(1, tcc, tcc, 1), ctx);
 	emit_a64_mov_i64(tmp, MAX_TAIL_CALL_CNT, ctx);
 	emit(A64_CMP(1, tcc, tmp), ctx);
 	emit(A64_B_(A64_COND_HI, jmp_offset), ctx);
-	emit(A64_ADD_I(1, tcc, tcc, 1), ctx);
 
 	/* prog = array->ptrs[index];
 	 * if (prog == NULL)
diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c
index 3a73e93..029fc34 100644
--- a/arch/mips/net/ebpf_jit.c
+++ b/arch/mips/net/ebpf_jit.c
@@ -617,14 +617,14 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx, int this_idx)
 	b_off = b_imm(this_idx + 1, ctx);
 	emit_instr(ctx, bne, MIPS_R_AT, MIPS_R_ZERO, b_off);
 	/*
-	 * if (TCC-- < 0)
+	 * if (--TCC < 0)
 	 *     goto out;
 	 */
 	/* Delay slot */
 	tcc_reg = (ctx->flags & EBPF_TCC_IN_V1) ? MIPS_R_V1 : MIPS_R_S4;
 	emit_instr(ctx, daddiu, MIPS_R_T5, tcc_reg, -1);
 	b_off = b_imm(this_idx + 1, ctx);
-	emit_instr(ctx, bltz, tcc_reg, b_off);
+	emit_instr(ctx, bltz, MIPS_R_T5, b_off);
 	/*
 	 * prog = array->ptrs[index];
 	 * if (prog == NULL)
diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
index beb12cb..b5585ad 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -221,12 +221,12 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
 	PPC_BCC(COND_GE, out);
 
 	/*
+	 * tail_call_cnt++;
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *   goto out;
 	 */
-	EMIT(PPC_RAW_CMPLWI(_R0, MAX_TAIL_CALL_CNT));
-	/* tail_call_cnt++; */
 	EMIT(PPC_RAW_ADDIC(_R0, _R0, 1));
+	EMIT(PPC_RAW_CMPLWI(_R0, MAX_TAIL_CALL_CNT));
 	PPC_BCC(COND_GT, out);
 
 	/* prog = array->ptrs[index]; */
diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
index b87a63d..bb15cc4 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -227,6 +227,12 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
 	PPC_BCC(COND_GE, out);
 
 	/*
+	 * tail_call_cnt++;
+	 */
+	EMIT(PPC_RAW_ADDI(b2p[TMP_REG_1], b2p[TMP_REG_1], 1));
+	PPC_BPF_STL(b2p[TMP_REG_1], 1, bpf_jit_stack_tailcallcnt(ctx));
+
+	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *   goto out;
 	 */
@@ -234,12 +240,6 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
 	EMIT(PPC_RAW_CMPLWI(b2p[TMP_REG_1], MAX_TAIL_CALL_CNT));
 	PPC_BCC(COND_GT, out);
 
-	/*
-	 * tail_call_cnt++;
-	 */
-	EMIT(PPC_RAW_ADDI(b2p[TMP_REG_1], b2p[TMP_REG_1], 1));
-	PPC_BPF_STL(b2p[TMP_REG_1], 1, bpf_jit_stack_tailcallcnt(ctx));
-
 	/* prog = array->ptrs[index]; */
 	EMIT(PPC_RAW_MULI(b2p[TMP_REG_1], b2p_index, 8));
 	EMIT(PPC_RAW_ADD(b2p[TMP_REG_1], b2p[TMP_REG_1], b2p_bpf_array));
diff --git a/arch/riscv/net/bpf_jit_comp32.c b/arch/riscv/net/bpf_jit_comp32.c
index e649742..1608d94 100644
--- a/arch/riscv/net/bpf_jit_comp32.c
+++ b/arch/riscv/net/bpf_jit_comp32.c
@@ -800,12 +800,12 @@ static int emit_bpf_tail_call(int insn, struct rv_jit_context *ctx)
 
 	/*
 	 * temp_tcc = tcc - 1;
-	 * if (tcc < 0)
+	 * if (temp_tcc < 0)
 	 *   goto out;
 	 */
 	emit(rv_addi(RV_REG_T1, RV_REG_TCC, -1), ctx);
 	off = ninsns_rvoff(tc_ninsn - (ctx->ninsns - start_insn));
-	emit_bcc(BPF_JSLT, RV_REG_TCC, RV_REG_ZERO, off, ctx);
+	emit_bcc(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
 
 	/*
 	 * prog = array->ptrs[index];
diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index 3af4131..6e9ba83 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c
@@ -311,12 +311,12 @@ static int emit_bpf_tail_call(int insn, struct rv_jit_context *ctx)
 	off = ninsns_rvoff(tc_ninsn - (ctx->ninsns - start_insn));
 	emit_branch(BPF_JGE, RV_REG_A2, RV_REG_T1, off, ctx);
 
-	/* if (TCC-- < 0)
+	/* if (--TCC < 0)
 	 *     goto out;
 	 */
 	emit_addi(RV_REG_T1, tcc, -1, ctx);
 	off = ninsns_rvoff(tc_ninsn - (ctx->ninsns - start_insn));
-	emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);
+	emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
 
 	/* prog = array->ptrs[index];
 	 * if (!prog)
diff --git a/arch/sparc/net/bpf_jit_comp_64.c b/arch/sparc/net/bpf_jit_comp_64.c
index 9a2f20c..50d914c 100644
--- a/arch/sparc/net/bpf_jit_comp_64.c
+++ b/arch/sparc/net/bpf_jit_comp_64.c
@@ -863,6 +863,10 @@ static void emit_tail_call(struct jit_ctx *ctx)
 	emit_branch(BGEU, ctx->idx, ctx->idx + OFFSET1, ctx);
 	emit_nop(ctx);
 
+	emit_alu_K(ADD, tmp, 1, ctx);
+	off = BPF_TAILCALL_CNT_SP_OFF;
+	emit(ST32 | IMMED | RS1(SP) | S13(off) | RD(tmp), ctx);
+
 	off = BPF_TAILCALL_CNT_SP_OFF;
 	emit(LD32 | IMMED | RS1(SP) | S13(off) | RD(tmp), ctx);
 	emit_cmpi(tmp, MAX_TAIL_CALL_CNT, ctx);
@@ -870,10 +874,6 @@ static void emit_tail_call(struct jit_ctx *ctx)
 	emit_branch(BGU, ctx->idx, ctx->idx + OFFSET2, ctx);
 	emit_nop(ctx);
 
-	emit_alu_K(ADD, tmp, 1, ctx);
-	off = BPF_TAILCALL_CNT_SP_OFF;
-	emit(ST32 | IMMED | RS1(SP) | S13(off) | RD(tmp), ctx);
-
 	emit_alu3_K(SLL, bpf_index, 3, tmp, ctx);
 	emit_alu(ADD, bpf_array, tmp, ctx);
 	off = offsetof(struct bpf_array, ptrs);
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 0fe6aac..74a9e61 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -402,7 +402,7 @@ static int get_pop_bytes(bool *callee_regs_used)
  * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
  *   if (index >= array->map.max_entries)
  *     goto out;
- *   if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
+ *   if (tail_call_cnt++ == MAX_TAIL_CALL_CNT)
  *     goto out;
  *   prog = array->ptrs[index];
  *   if (prog == NULL)
@@ -452,13 +452,13 @@ static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
 	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
 
 	/*
-	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 * if (tail_call_cnt++ == MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
 	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
 #define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE)
-	EMIT2(X86_JA, OFFSET2);                   /* ja out */
+	EMIT2(X86_JE, OFFSET2);                   /* je out */
 	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
 	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
 
@@ -530,12 +530,12 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
 	}
 
 	/*
-	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 * if (tail_call_cnt++ == MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
 	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
-	EMIT2(X86_JA, off1);                          /* ja out */
+	EMIT2(X86_JE, off1);                          /* je out */
 	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
 	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f4c16f1..224cc7e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1046,7 +1046,7 @@ struct bpf_array {
 };
 
 #define BPF_COMPLEXITY_LIMIT_INSNS      1000000 /* yes. 1M insns */
-#define MAX_TAIL_CALL_CNT 32
+#define MAX_TAIL_CALL_CNT 33
 
 #define BPF_F_ACCESS_MASK	(BPF_F_RDONLY |		\
 				 BPF_F_RDONLY_PROG |	\
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9f4636d..8edb1c3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1564,10 +1564,10 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn)
 
 		if (unlikely(index >= array->map.max_entries))
 			goto out;
-		if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
-			goto out;
 
 		tail_call_cnt++;
+		if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
+			goto out;
 
 		prog = READ_ONCE(array->ptrs[index]);
 		if (!prog)
-- 
2.1.0


^ permalink raw reply related

* [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance
From: Ricardo Neri @ 2021-09-11  1:18 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Len Brown, Aubrey Li, Rafael J . Wysocki, Srikar Dronamraju,
	Ravi V. Shankar, linuxppc-dev, Aubrey Li, Nicholas Piggin,
	Ricardo Neri, Steven Rostedt, Quentin Perret, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Srinivas Pandruvada,
	Joel Fernandes (Google), Tim Chen, Dietmar Eggemann, linux-kernel,
	Ricardo Neri
In-Reply-To: <20210911011819.12184-1-ricardo.neri-calderon@linux.intel.com>

When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
check for the idle state of the destination CPU, dst_cpu, but also of
its SMT siblings.

If dst_cpu is idle but its SMT siblings are busy, performance suffers
if it pulls tasks from a medium priority CPU that does not have SMT
siblings.

Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
siblings of both dst_cpu and the CPUs in the candidate busiest group.

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
  * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
    (Vincent, Peter)
  * Do not even idle CPUs in asym_smt_can_pull_tasks(). (Vincent)
  * Updated function documentation and corrected a typo.

Changes since v3:
  * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
    powerpc folks showed that this patch should not impact them. Also, more
    recent powerpc processor no longer use asym_packing. (PeterZ)
  * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
  * Removed unnecessary check for local CPUs when the local group has zero
    utilization. (Joel)
  * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
    the fact that it deals with SMT cases.
  * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
    that callers can deal with non-SMT cases.

Changes since v2:
  * Reworded the commit message to reflect updates in code.
  * Corrected misrepresentation of dst_cpu as the CPU doing the load
    balancing. (PeterZ)
  * Removed call to arch_asym_check_smt_siblings() as it is now called in
    sched_asym().

Changes since v1:
  * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
    tasks. Instead, reclassify the candidate busiest group, as it
    may still be selected. (PeterZ)
  * Avoid an expensive and unnecessary call to cpumask_weight() when
    determining if a sched_group is comprised of SMT siblings.
    (PeterZ).
---
 kernel/sched/fair.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26db017c14a3..8d763dd0174b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8597,10 +8597,98 @@ group_type group_classify(unsigned int imbalance_pct,
 	return group_has_spare;
 }
 
+/**
+ * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
+ * @dst_cpu:	Destination CPU of the load balancing
+ * @sds:	Load-balancing data with statistics of the local group
+ * @sgs:	Load-balancing statistics of the candidate busiest group
+ * @sg:		The candidate busiest group
+ *
+ * Check the state of the SMT siblings of both @sds::local and @sg and decide
+ * if @dst_cpu can pull tasks.
+ *
+ * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
+ * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
+ * only if @dst_cpu has higher priority.
+ *
+ * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
+ * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher priority.
+ * Bigger imbalances in the number of busy CPUs will be dealt with in
+ * update_sd_pick_busiest().
+ *
+ * If @sg does not have SMT siblings, only pull tasks if all of the SMT siblings
+ * of @dst_cpu are idle and @sg has lower priority.
+ */
+static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
+				    struct sg_lb_stats *sgs,
+				    struct sched_group *sg)
+{
+#ifdef CONFIG_SCHED_SMT
+	bool local_is_smt, sg_is_smt;
+	int sg_busy_cpus;
+
+	local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
+	sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
+
+	sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+	if (!local_is_smt) {
+		/*
+		 * If we are here, @dst_cpu is idle and does not have SMT
+		 * siblings. Pull tasks if candidate group has two or more
+		 * busy CPUs.
+		 */
+		if (sg_is_smt && sg_busy_cpus >= 2)
+			return true;
+
+		/*
+		 * @dst_cpu does not have SMT siblings. @sg may have SMT
+		 * siblings and only one is busy. In such case, @dst_cpu
+		 * can help if it has higher priority and is idle (i.e.,
+		 * it has no running tasks).
+		 */
+		return !sds->local_stat.sum_nr_running &&
+		       sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+	}
+
+	/* @dst_cpu has SMT siblings. */
+
+	if (sg_is_smt) {
+		int local_busy_cpus = sds->local->group_weight -
+				      sds->local_stat.idle_cpus;
+		int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
+
+		if (busy_cpus_delta == 1)
+			return sched_asym_prefer(dst_cpu,
+						 sg->asym_prefer_cpu);
+
+		return false;
+	}
+
+	/*
+	 * @sg does not have SMT siblings. Ensure that @sds::local does not end
+	 * up with more than one busy SMT sibling and only pull tasks if there
+	 * are not busy CPUs (i.e., no CPU has running tasks).
+	 */
+	if (!sds->local_stat.sum_nr_running)
+		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
+	return false;
+#else
+	/* Always return false so that callers deal with non-SMT cases. */
+	return false;
+#endif
+}
+
 static inline bool
 sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
 	   struct sched_group *group)
 {
+	/* Only do SMT checks if either local or candidate have SMT siblings */
+	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
+	    (group->flags & SD_SHARE_CPUCAPACITY))
+		return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+
 	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
 }
 
@@ -9606,6 +9694,12 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		    nr_running == 1)
 			continue;
 
+		/* Make sure we only pull tasks from a CPU of lower priority */
+		if ((env->sd->flags & SD_ASYM_PACKING) &&
+		    sched_asym_prefer(i, env->dst_cpu) &&
+		    nr_running == 1)
+			continue;
+
 		switch (env->migration_type) {
 		case migrate_load:
 			/*
-- 
2.17.1


^ permalink raw reply related

* [PATCH v5 5/6] sched/fair: Carve out logic to mark a group for asymmetric packing
From: Ricardo Neri @ 2021-09-11  1:18 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Len Brown, Aubrey Li, Rafael J . Wysocki, Srikar Dronamraju,
	Ravi V. Shankar, linuxppc-dev, Aubrey Li, Nicholas Piggin,
	Ricardo Neri, Steven Rostedt, Quentin Perret, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Srinivas Pandruvada,
	Joel Fernandes (Google), Tim Chen, Dietmar Eggemann, linux-kernel,
	Ricardo Neri
In-Reply-To: <20210911011819.12184-1-ricardo.neri-calderon@linux.intel.com>

Create a separate function, sched_asym(). A subsequent changeset will
introduce logic to deal with SMT in conjunction with asmymmetric
packing. Such logic will need the statistics of the scheduling
group provided as argument. Update them before calling sched_asym().

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
  * None

Changes since v3:
  * Remove a redundant check for the local group in sched_asym().
    (Dietmar)
  * Reworded commit message for clarity. (Len)

Changes since v2:
  * Introduced this patch.

Changes since v1:
  * N/A
---
 kernel/sched/fair.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c5851260b4d8..26db017c14a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8597,6 +8597,13 @@ group_type group_classify(unsigned int imbalance_pct,
 	return group_has_spare;
 }
 
+static inline bool
+sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
+	   struct sched_group *group)
+{
+	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -8657,18 +8664,17 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
+	sgs->group_capacity = group->sgc->capacity;
+
+	sgs->group_weight = group->group_weight;
+
 	/* Check if dst CPU is idle and preferred to this group */
 	if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
-	    env->idle != CPU_NOT_IDLE &&
-	    sgs->sum_h_nr_running &&
-	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+	    env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
+	    sched_asym(env, sds, sgs, group)) {
 		sgs->group_asym_packing = 1;
 	}
 
-	sgs->group_capacity = group->sgc->capacity;
-
-	sgs->group_weight = group->group_weight;
-
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	/* Computing avg_load makes sense only when group is overloaded */
-- 
2.17.1


^ permalink raw reply related

* [PATCH v5 4/6] sched/fair: Provide update_sg_lb_stats() with sched domain statistics
From: Ricardo Neri @ 2021-09-11  1:18 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Len Brown, Aubrey Li, Rafael J . Wysocki, Srikar Dronamraju,
	Ravi V. Shankar, linuxppc-dev, Aubrey Li, Nicholas Piggin,
	Ricardo Neri, Steven Rostedt, Quentin Perret, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Srinivas Pandruvada,
	Joel Fernandes (Google), Tim Chen, Dietmar Eggemann, linux-kernel,
	Ricardo Neri
In-Reply-To: <20210911011819.12184-1-ricardo.neri-calderon@linux.intel.com>

Before deciding to pull tasks when using asymmetric packing of tasks,
on some architectures (e.g., x86) it is necessary to know not only the
state of dst_cpu but also of its SMT siblings. The decision to classify
a candidate busiest group as group_asym_packing is done in
update_sg_lb_stats(). Give this function access to the scheduling domain
statistics, which contains the statistics of the local group.

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
  * None

Changes since v3:
  * None

Changes since v2:
  * Introduced this patch.

Changes since v1:
  * N/A
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a054f528bcc..c5851260b4d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8605,6 +8605,7 @@ group_type group_classify(unsigned int imbalance_pct,
  * @sg_status: Holds flag indicating the status of the sched_group
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
+				      struct sd_lb_stats *sds,
 				      struct sched_group *group,
 				      struct sg_lb_stats *sgs,
 				      int *sg_status)
@@ -8613,7 +8614,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	memset(sgs, 0, sizeof(*sgs));
 
-	local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+	local_group = group == sds->local;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
@@ -9176,7 +9177,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 				update_group_capacity(env->sd, env->dst_cpu);
 		}
 
-		update_sg_lb_stats(env, sg, sgs, &sg_status);
+		update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
 
 		if (local_group)
 			goto next_group;
-- 
2.17.1


^ permalink raw reply related

* [PATCH v5 3/6] sched/fair: Optimize checking for group_asym_packing
From: Ricardo Neri @ 2021-09-11  1:18 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Len Brown, Aubrey Li, Rafael J . Wysocki, Srikar Dronamraju,
	Ravi V. Shankar, linuxppc-dev, Aubrey Li, Nicholas Piggin,
	Ricardo Neri, Steven Rostedt, Quentin Perret, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Srinivas Pandruvada,
	Joel Fernandes (Google), Tim Chen, Dietmar Eggemann, linux-kernel,
	Ricardo Neri
In-Reply-To: <20210911011819.12184-1-ricardo.neri-calderon@linux.intel.com>

sched_asmy_prefer() always returns false when called on the local group. By
checking local_group, we can avoid additional checks and invoking
sched_asmy_prefer() when it is not needed. No functional changes are
introduced.

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
  * None

Changes since v3:
  * Further rewording of the commit message. (Len)

Changes since v2:
  * Reworded the commit message for clarity. (Peter Z)

Changes since v1:
  * None
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..7a054f528bcc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8657,7 +8657,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	}
 
 	/* Check if dst CPU is idle and preferred to this group */
-	if (env->sd->flags & SD_ASYM_PACKING &&
+	if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
 	    env->idle != CPU_NOT_IDLE &&
 	    sgs->sum_h_nr_running &&
 	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
-- 
2.17.1


^ permalink raw reply related

* [PATCH v5 2/6] sched/topology: Introduce sched_group::flags
From: Ricardo Neri @ 2021-09-11  1:18 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Len Brown, Aubrey Li, Rafael J . Wysocki, Srikar Dronamraju,
	Ravi V. Shankar, linuxppc-dev, Aubrey Li, Nicholas Piggin,
	Ricardo Neri, Steven Rostedt, Quentin Perret, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Srinivas Pandruvada,
	Joel Fernandes (Google), Tim Chen, Dietmar Eggemann, linux-kernel,
	Ricardo Neri
In-Reply-To: <20210911011819.12184-1-ricardo.neri-calderon@linux.intel.com>

There exist situations in which the load balance needs to know the
properties of the CPUs in a scheduling group. When using asymmetric
packing, for instance, the load balancer needs to know not only the
state of dst_cpu but also of its SMT siblings, if any.

Use the flags of the child scheduling domains to initialize scheduling
group flags. This will reflect the properties of the CPUs in the
group.

A subsequent changeset will make use of these new flags. No functional
changes are introduced.

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
  * None

Changes since v3:
  * Clear the flags of the scheduling groups of a domain if its child is
    destroyed.
  * Minor rewording of the commit message.

Changes since v2:
  * Introduced this patch.

Changes since v1:
  * N/A
---
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 21 ++++++++++++++++++---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d3e5793e117..86ab33ce529d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1809,6 +1809,7 @@ struct sched_group {
 	unsigned int		group_weight;
 	struct sched_group_capacity *sgc;
 	int			asym_prefer_cpu;	/* CPU of highest priority in group */
+	int			flags;
 
 	/*
 	 * The CPUs this group covers.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 4e8698e62f07..c56faae461d9 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -716,8 +716,20 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 		tmp = sd;
 		sd = sd->parent;
 		destroy_sched_domain(tmp);
-		if (sd)
+		if (sd) {
+			struct sched_group *sg = sd->groups;
+
+			/*
+			 * sched groups hold the flags of the child sched
+			 * domain for convenience. Clear such flags since
+			 * the child is being destroyed.
+			 */
+			do {
+				sg->flags = 0;
+			} while (sg != sd->groups);
+
 			sd->child = NULL;
+		}
 	}
 
 	for (tmp = sd; tmp; tmp = tmp->parent)
@@ -916,10 +928,12 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 		return NULL;
 
 	sg_span = sched_group_span(sg);
-	if (sd->child)
+	if (sd->child) {
 		cpumask_copy(sg_span, sched_domain_span(sd->child));
-	else
+		sg->flags = sd->child->flags;
+	} else {
 		cpumask_copy(sg_span, sched_domain_span(sd));
+	}
 
 	atomic_inc(&sg->ref);
 	return sg;
@@ -1169,6 +1183,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 	if (child) {
 		cpumask_copy(sched_group_span(sg), sched_domain_span(child));
 		cpumask_copy(group_balance_mask(sg), sched_group_span(sg));
+		sg->flags = child->flags;
 	} else {
 		cpumask_set_cpu(cpu, sched_group_span(sg));
 		cpumask_set_cpu(cpu, group_balance_mask(sg));
-- 
2.17.1


^ permalink raw reply related

* [PATCH v5 1/6] x86/sched: Decrease further the priorities of SMT siblings
From: Ricardo Neri @ 2021-09-11  1:18 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Len Brown, Aubrey Li, Rafael J . Wysocki, Srikar Dronamraju,
	Ravi V. Shankar, linuxppc-dev, Aubrey Li, Nicholas Piggin,
	Ricardo Neri, Steven Rostedt, Quentin Perret, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Srinivas Pandruvada,
	Joel Fernandes (Google), Tim Chen, Dietmar Eggemann, linux-kernel,
	Ricardo Neri
In-Reply-To: <20210911011819.12184-1-ricardo.neri-calderon@linux.intel.com>

When scheduling, it is better to prefer a separate physical core rather
than the SMT sibling of a high priority core. The existing formula to
compute priorities takes such fact in consideration. There may exist,
however, combinations of priorities (i.e., maximum frequencies) in which
the priority of high-numbered SMT siblings of high-priority cores collides
with the priority of low-numbered SMT siblings of low-priority cores.

Consider for instance an SMT2 system with CPUs [0, 1] with priority 60 and
[2, 3] with priority 30(CPUs in brackets are SMT siblings. In such a case,
the resulting priorities would be [120, 60], [60, 30]. Thus, to ensure
that CPU2 has higher priority than CPU1, divide the raw priority by the
squared SMT iterator. The resulting priorities are [120, 30]. [60, 15].

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Originally-by: Len Brown <len.brown@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
  * None

Changes since v3:
  * Introduced this patch

Changes since v2:
  * N/A

Changes since v1:
  * N/A
---
 arch/x86/kernel/itmt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 1afbdd1dd777..9ff480e94511 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -198,7 +198,7 @@ void sched_set_itmt_core_prio(int prio, int core_cpu)
 		 * of the priority chain and only used when
 		 * all other high priority cpus are out of capacity.
 		 */
-		smt_prio = prio * smp_num_siblings / i;
+		smt_prio = prio * smp_num_siblings / (i * i);
 		per_cpu(sched_core_priority, cpu) = smt_prio;
 		i++;
 	}
-- 
2.17.1


^ permalink raw reply related

* [PATCH v5 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING
From: Ricardo Neri @ 2021-09-11  1:18 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Len Brown, Srikar Dronamraju, Ravi V. Shankar, linuxppc-dev,
	Aubrey Li, Nicholas Piggin, Ricardo Neri, Steven Rostedt,
	Quentin Perret, Ben Segall, Mel Gorman, Srinivas Pandruvada,
	Joel Fernandes (Google), Tim Chen, Dietmar Eggemann, linux-kernel,
	Ricardo Neri

This is v5 the series. v1, v2, v3, and v4 patches and test results can be
found in [1], [2], [3], and [4], respectively.

=== Problem statement ===
++ Load balancing ++

When using asymmetric packing, there exists CPU topologies with three
priority levels in which only a subset of the physical cores support SMT.
An instance of such topology is Intel Alderlake, a hybrid processor with
a mixture of Intel Core (with support for SMT) and Intel Atom CPUs.

On Alderlake, it is almost always beneficial to spread work by picking
first the Core CPUs, then the Atoms and at last the SMT siblings.

The current load balancer, however, does not behave as described when using
ASYM_PACKING. Instead, the load balancer will choose higher-priority CPUs 
(an Intel Core) over medium-priority CPUs (an Intel Atom), and subsequently
overflow the load to a low priority SMT sibling CPU. This leaves medium-
priority CPUs idle while low-priority CPUs are busy.

This patchset fixes this behavior by also checking the idle state of the
SMT siblings of both the CPU doing the load balance and the busiest
candidate group when deciding whether the destination CPUs can pull tasks
from the busiest CPU.

++ Rework ASYM_PACKING priorities with ITMT ++
We also reworked the priority computation of the SMT siblings to ensure
that higher-numbered SMT siblings are always low priority. The current
computation may lead to situations in which in some processors those
higher-numbered SMT siblings have the same priority as the Intel Atom
CPUs.

=== Testing ===
I ran a few benchmarks with and without this version of the patchset on
an Intel Alderlake system with 8 Intel Core (with SMT) and 8 Intel
Atom CPUs.

The baseline for the results is an unmodified v5.14 kernel. Results
show a comparative percentage of improvement (positive) or degradation
(negative). Each test case is repeated eight times, and the standard
deviation among repetitions is also documented.

Table 1 shows the results when using hardware-controlled performance
performance states (HWP), a common use case. The impact of the patches
is overall positive with a few test cases showing slight degradation.
hackbench is especially difficult to assess it shows a high degree of
variability.

Thanks and BR,
Ricardo

ITMT: Intel Turbo Boost Max Technology 3.0

========
Changes since v4:
  * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
    (patch 6, Vincent, Peter)
  * Do not even idle CPUs in asym_smt_can_pull_tasks(). (patch 6, Vincent)
  * Unchanged patches: 1, 2, 3, 4, 5.

Changes since v3:
  * Reworked the ITMT priority computation to further reduce the priority
    of SMT siblings (patch 1).
  * Clear scheduling group flags when a child scheduling level
    degenerates (patch 2).
  * Removed arch-specific hooks (patch 6, PeterZ)
  * Removed redundant checks for the local group. (patch 5, Dietmar)
  * Removed redundant check for local idle CPUs and local group
    utilization. (patch 6, Joel)
  * Reworded commit messages of patches 2, 3, 5, and 6 for clarity.
    (Len, PeterZ)
  * Added Joel's Reviewed-by tag.
  * Unchanged patches: 4

Changes since v2:
  * Removed arch_sched_asym_prefer_early() and re-evaluation of the
    candidate busiest group in update_sd_pick_busiest(). (PeterZ)
  * Introduced sched_group::flags to reflect the properties of CPUs
    in the scheduling group. This helps to identify scheduling groups
    whose CPUs are SMT siblings. (PeterZ)
  * Modified update_sg_lb_stats() to get statistics of the scheduling
    domain as an argument. This provides it with the statistics of the
    local domain. (PeterZ)
  * Introduced sched_asym() to decide if a busiest candidate group can
    be marked for asymmetric packing.
  * Reworded patch 1 for clarity. (PeterZ)

Changes since v1:
  * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
    tasks. Instead, reclassify the candidate busiest group, as it
    may still be selected. (PeterZ)
  * Avoid an expensive and unnecessary call to cpumask_weight() when
    determining if a sched_group is comprised of SMT siblings.
    (PeterZ).
  * Updated test results using the v2 patches.


========      Table 1. Test results of patches with HWP        ========
=======================================================================

hackbench
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            group-1          1.00 ( 18.21)   +4.95 ( 11.79)
process-pipe            group-2          1.00 ( 10.09)   -2.41 ( 14.12)
process-pipe            group-4          1.00 ( 29.09)   -4.04 ( 22.58)
process-pipe            group-8          1.00 (  6.76)   -1.61 (  8.13)
process-pipe            group-12         1.00 ( 12.39)   +3.90 (  7.59)
process-pipe            group-16         1.00 (  5.78)   -3.65 (  7.90)
process-pipe            group-20         1.00 (  4.71)   -2.70 (  5.17)
process-pipe            group-24         1.00 (  9.44)  -11.20 ( 10.22)
process-pipe            group-32         1.00 (  9.29)   -0.84 (  7.04)
process-pipe            group-48         1.00 (  7.47)   +0.66 (  5.90)
process-sockets         group-1          1.00 ( 13.53)   -9.60 (  7.91)
process-sockets         group-2          1.00 ( 21.92)   +7.48 (  9.23)
process-sockets         group-4          1.00 ( 14.59)   +9.43 ( 11.85)
process-sockets         group-8          1.00 (  9.43)   +4.67 (  6.23)
process-sockets         group-12         1.00 ( 12.80)   +7.44 ( 12.62)
process-sockets         group-16         1.00 (  8.47)   +2.12 (  9.45)
process-sockets         group-20         1.00 (  5.86)   +2.41 (  3.20)
process-sockets         group-24         1.00 (  4.47)   +4.56 (  3.14)
process-sockets         group-32         1.00 (  4.40)   +5.41 (  3.11)
process-sockets         group-48         1.00 ( 16.60)  +14.69 (  3.08)
threads-pipe            group-1          1.00 (  3.49)   -0.91 (  3.37)
threads-pipe            group-2          1.00 ( 17.48)   +7.36 (  8.81)
threads-pipe            group-4          1.00 ( 17.58)   -2.36 ( 18.80)
threads-pipe            group-8          1.00 (  9.58)   -1.26 (  6.24)
threads-pipe            group-12         1.00 (  6.49)   +2.34 (  4.37)
threads-pipe            group-16         1.00 ( 15.49)   -6.13 ( 14.84)
threads-pipe            group-20         1.00 (  2.76)   -7.87 (  7.93)
threads-pipe            group-24         1.00 ( 13.80)   +3.46 ( 11.91)
threads-pipe            group-32         1.00 (  8.12)   -2.91 (  8.20)
threads-pipe            group-48         1.00 (  5.79)   -3.95 (  5.44)
threads-sockets         group-1          1.00 ( 11.24)   -4.56 ( 10.01)
threads-sockets         group-2          1.00 (  6.81)   +0.60 (  4.95)
threads-sockets         group-4          1.00 (  8.78)   -0.79 (  7.86)
threads-sockets         group-8          1.00 (  6.51)  -15.64 ( 15.33)
threads-sockets         group-12         1.00 ( 12.30)   -7.09 ( 12.45)
threads-sockets         group-16         1.00 (  8.65)   +3.77 (  8.25)
threads-sockets         group-20         1.00 (  5.52)   +4.40 (  3.48)
threads-sockets         group-24         1.00 (  2.89)   +2.54 (  2.68)
threads-sockets         group-32         1.00 (  3.49)   +1.17 (  3.02)
threads-sockets         group-48         1.00 (  3.15)   -3.95 ( 10.64)

netperf
=======
case                    load            baseline(std%)  compare%( std%)
TCP_RR                  thread-1         1.00 (  0.55)   -0.12 (  0.47)
TCP_RR                  thread-2         1.00 (  0.77)   -0.44 (  0.65)
TCP_RR                  thread-4         1.00 (  0.73)   +0.26 (  0.61)
TCP_RR                  thread-8         1.00 (  1.21)   -0.18 (  1.18)
TCP_RR                  thread-12        1.00 (  1.91)   -0.29 (  2.25)
TCP_RR                  thread-16        1.00 (  4.18)   -0.45 (  3.78)
TCP_RR                  thread-20        1.00 (  2.09)   -0.83 (  1.75)
TCP_RR                  thread-24        1.00 (  1.23)   -0.42 (  1.35)
TCP_RR                  thread-32        1.00 ( 13.72)   +6.22 ( 16.10)
TCP_RR                  thread-48        1.00 ( 12.91)   -0.38 ( 13.37)
UDP_RR                  thread-1         1.00 (  0.85)   +0.04 (  0.75)
UDP_RR                  thread-2         1.00 (  0.57)   -0.56 (  0.62)
UDP_RR                  thread-4         1.00 (  0.65)   -0.04 (  0.78)
UDP_RR                  thread-8         1.00 (  1.24)   -0.46 (  8.31)
UDP_RR                  thread-12        1.00 (  6.87)   +0.01 (  1.27)
UDP_RR                  thread-16        1.00 (  6.07)   -0.30 (  1.51)
UDP_RR                  thread-20        1.00 (  1.00)   -0.97 (  0.87)
UDP_RR                  thread-24        1.00 (  0.67)   +0.65 (  4.39)
UDP_RR                  thread-32        1.00 ( 15.59)   +3.27 ( 17.34)
UDP_RR                  thread-48        1.00 ( 12.56)   -1.28 ( 13.43)

tbench
======
case                    load            baseline(std%)  compare%( std%)
loopback                thread-1         1.00 (  0.59)   +0.06 (  0.53)
loopback                thread-2         1.00 (  0.44)   -0.69 (  0.66)
loopback                thread-4         1.00 (  0.27)   +0.61 (  0.31)
loopback                thread-8         1.00 (  0.25)   -0.18 (  0.20)
loopback                thread-12        1.00 (  1.12)   -0.23 (  0.85)
loopback                thread-16        1.00 (  0.40)   -0.25 (  1.59)
loopback                thread-20        1.00 (  0.20)   +0.58 (  0.34)
loopback                thread-24        1.00 (  6.93)   +0.73 (  8.46)
loopback                thread-32        1.00 (  4.61)   +0.96 (  1.62)
loopback                thread-48        1.00 (  2.33)   +1.45 (  0.97)

schbench
========
case                    load            baseline(std%)  compare%( std%)
normal                  mthread-1        1.00 (  4.39)   -0.37 (  6.53)
normal                  mthread-2        1.00 (  2.44)   +0.96 (  4.31)
normal                  mthread-4        1.00 (  9.47)  +13.26 ( 19.52)
normal                  mthread-8        1.00 (  0.06)   -1.05 (  2.31)
normal                  mthread-12       1.00 (  2.62)   -0.66 (  2.21)
normal                  mthread-16       1.00 (  1.90)   +0.21 (  1.70)
normal                  mthread-20       1.00 (  2.25)   -0.44 (  2.41)
normal                  mthread-24       1.00 (  1.89)   -0.05 (  1.78)
normal                  mthread-32       1.00 (  2.04)   -0.28 (  1.92)
normal                  mthread-48       1.00 (  4.43)   +1.10 (  3.86)

[1]. https://lore.kernel.org/lkml/20210406041108.7416-1-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/lkml/20210414020436.12980-1-ricardo.neri-calderon@linux.intel.com/
[3]. https://lore.kernel.org/lkml/20210513154909.6385-1-ricardo.neri-calderon@linux.intel.com/
[4]. https://lore.kernel.org/lkml/20210810144145.18776-1-ricardo.neri-calderon@linux.intel.com/

Ricardo Neri (6):
  x86/sched: Decrease further the priorities of SMT siblings
  sched/topology: Introduce sched_group::flags
  sched/fair: Optimize checking for group_asym_packing
  sched/fair: Provide update_sg_lb_stats() with sched domain statistics
  sched/fair: Carve out logic to mark a group for asymmetric packing
  sched/fair: Consider SMT in ASYM_PACKING load balance

 arch/x86/kernel/itmt.c  |   2 +-
 kernel/sched/fair.c     | 121 ++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h    |   1 +
 kernel/sched/topology.c |  21 ++++++-
 4 files changed, 131 insertions(+), 14 deletions(-)

-- 
2.17.1


^ permalink raw reply

* Re: [PATCH 06/10] powerpc: remove GCC version check for UPD_CONSTR
From: Nathan Chancellor @ 2021-09-10 23:48 UTC (permalink / raw)
  To: Nick Desaulniers, Andrew Morton
  Cc: Stephen Rothwell, linuxppc-dev, Arnd Bergmann, Masahiro Yamada,
	llvm, Rasmus Villemoes, linux-kernel, Paul Mackerras, Joe Perches,
	Linus Torvalds
In-Reply-To: <20210910234047.1019925-7-ndesaulniers@google.com>

On 9/10/2021 4:40 PM, Nick Desaulniers wrote:
> Now that GCC 5.1 is the minimum supported version, we can drop this
> workaround for older versions of GCC. This adversely affected clang,
> too.
> 
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Segher Boessenkool <segher@kernel.crashing.org>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
> ---
>   arch/powerpc/include/asm/asm-const.h | 10 ----------
>   1 file changed, 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/asm-const.h b/arch/powerpc/include/asm/asm-const.h
> index 0ce2368bd20f..dbfa5e1e3198 100644
> --- a/arch/powerpc/include/asm/asm-const.h
> +++ b/arch/powerpc/include/asm/asm-const.h
> @@ -12,16 +12,6 @@
>   #  define ASM_CONST(x)		__ASM_CONST(x)
>   #endif
>   
> -/*
> - * Inline assembly memory constraint
> - *
> - * GCC 4.9 doesn't properly handle pre update memory constraint "m<>"
> - *
> - */
> -#if defined(GCC_VERSION) && GCC_VERSION < 50000
> -#define UPD_CONSTR ""
> -#else
>   #define UPD_CONSTR "<>"
> -#endif

The only reason this exists is because of commit 592bbe9c505d 
("powerpc/uaccess: Don't use "m<>" constraint with GCC 4.9"). It is 
probably just worth sinking "<>" into all of the callsites and removing
UPD_CONSTR.

>   
>   #endif /* _ASM_POWERPC_ASM_CONST_H */
> 

^ permalink raw reply

* [PATCH 06/10] powerpc: remove GCC version check for UPD_CONSTR
From: Nick Desaulniers @ 2021-09-10 23:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arnd Bergmann, linuxppc-dev, Stephen Rothwell, Nick Desaulniers,
	Masahiro Yamada, llvm, Rasmus Villemoes, linux-kernel,
	Nathan Chancellor, Paul Mackerras, Joe Perches, Linus Torvalds
In-Reply-To: <20210910234047.1019925-1-ndesaulniers@google.com>

Now that GCC 5.1 is the minimum supported version, we can drop this
workaround for older versions of GCC. This adversely affected clang,
too.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
---
 arch/powerpc/include/asm/asm-const.h | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-const.h b/arch/powerpc/include/asm/asm-const.h
index 0ce2368bd20f..dbfa5e1e3198 100644
--- a/arch/powerpc/include/asm/asm-const.h
+++ b/arch/powerpc/include/asm/asm-const.h
@@ -12,16 +12,6 @@
 #  define ASM_CONST(x)		__ASM_CONST(x)
 #endif
 
-/*
- * Inline assembly memory constraint
- *
- * GCC 4.9 doesn't properly handle pre update memory constraint "m<>"
- *
- */
-#if defined(GCC_VERSION) && GCC_VERSION < 50000
-#define UPD_CONSTR ""
-#else
 #define UPD_CONSTR "<>"
-#endif
 
 #endif /* _ASM_POWERPC_ASM_CONST_H */
-- 
2.33.0.309.g3052b89438-goog


^ permalink raw reply related

* Re: [PATCH v1] powerpc: flexible GPR range save/restore macros
From: Segher Boessenkool @ 2021-09-10 17:54 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev
In-Reply-To: <20210910143511.375177-1-npiggin@gmail.com>

Hi!

On Sat, Sep 11, 2021 at 12:35:11AM +1000, Nicholas Piggin wrote:
> +.macro OP_REGS op, width, start, end, base, offset
> +	.Lreg=\start
> +	.rept (\end - \start + 1)
> +	\op	.Lreg,\offset+\width*.Lreg(\base)
> +	.Lreg=.Lreg+1
> +	.endr
> +.endm

"offset" here is the offset of reg "0", not the offset of reg "start".
This isn't new, but documenting it would not hurt :-)

".Lreg" does not really give you much protection, you could use any name
that won't collide, it will be a local symbol anyway.  You could use a
name with a "$" in it, even as first letter, for example.  As written it
still conflicts with any other symbol ".Lreg".  Pretty unlikely of
course :-)

Looks fine in any case.

Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>

Segher

^ permalink raw reply

* Re: [PATCH for-5.15 0/5] ASoC: fsl: register platform component before registering cpu dai
From: Mark Brown @ 2021-09-10 16:08 UTC (permalink / raw)
  To: Xiubo.Lee, perex, nicoleotsuka, Shengjiu Wang, alsa-devel, tiwai,
	timur, festevam
  Cc: Mark Brown, linuxppc-dev, linux-kernel
In-Reply-To: <1630665006-31437-1-git-send-email-shengjiu.wang@nxp.com>

On Fri, 3 Sep 2021 18:30:01 +0800, Shengjiu Wang wrote:
> There is no defer probe when adding platform component to
> snd_soc_pcm_runtime(rtd), the code is in snd_soc_add_pcm_runtime()
> 
> snd_soc_register_card()
>   -> snd_soc_bind_card()
>     -> snd_soc_add_pcm_runtime()
>       -> adding cpu dai
>       -> adding codec dai
>       -> adding platform component.
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/5] ASoC: fsl_sai: register platform component before registering cpu dai
      commit: 9c3ad33b5a412d8bc0a377e7cd9baa53ed52f22d
[2/5] ASoC: fsl_esai: register platform component before registering cpu dai
      commit: f12ce92e98b21c1fc669cd74e12c54a0fe3bc2eb
[3/5] ASoC: fsl_micfil: register platform component before registering cpu dai
      commit: 0adf292069dcca8bab76a603251fcaabf77468ca
[4/5] ASoC: fsl_spdif: register platform component before registering cpu dai
      commit: ee8ccc2eb5840e34fce088bdb174fd5329153ef0
[5/5] ASoC: fsl_xcvr: register platform component before registering cpu dai
      commit: c590fa80b39287a91abeb487829f3190e7ae775f

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

^ permalink raw reply

* Re: [PATCH v3 2/8] mm: Introduce a function to check for confidential computing features
From: Borislav Petkov @ 2021-09-10 15:02 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-s390, Sathyanarayanan Kuppuswamy, linux-efi, Brijesh Singh,
	kvm, Joerg Roedel, x86, kexec, linux-kernel, amd-gfx,
	platform-driver-x86, Christoph Hellwig, iommu, Andi Kleen,
	linux-graphics-maintainer, dri-devel, linux-fsdevel, Tianyu Lan,
	linuxppc-dev
In-Reply-To: <0a7618d54e7e954ee56c22ad1b94af2ffe69543a.1631141919.git.thomas.lendacky@amd.com>

On Wed, Sep 08, 2021 at 05:58:33PM -0500, Tom Lendacky wrote:
> In prep for other confidential computing technologies, introduce a generic

preparation

> helper function, cc_platform_has(), that can be used to check for specific
> active confidential computing attributes, like memory encryption. This is
> intended to eliminate having to add multiple technology-specific checks to
> the code (e.g. if (sev_active() || tdx_active())).

...

> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> new file mode 100644
> index 000000000000..253f3ea66cd8
> --- /dev/null
> +++ b/include/linux/cc_platform.h
> @@ -0,0 +1,88 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Confidential Computing Platform Capability checks
> + *
> + * Copyright (C) 2021 Advanced Micro Devices, Inc.
> + *
> + * Author: Tom Lendacky <thomas.lendacky@amd.com>
> + */
> +
> +#ifndef _CC_PLATFORM_H

	_LINUX_CC_PLATFORM_H

> +#define _CC_PLATFORM_H

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* [PATCH v1] powerpc: flexible GPR range save/restore macros
From: Nicholas Piggin @ 2021-09-10 14:35 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Introduce macros that operate on a (start, end) range of GPRs, which
reduces lines of code and need to do mental arithmetic while reading the
code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
Since RFC:
- Removed FP / VMX regs changes from the patch

 arch/powerpc/boot/crt0.S                      | 31 +++++++------
 arch/powerpc/crypto/md5-asm.S                 | 10 ++---
 arch/powerpc/crypto/sha1-powerpc-asm.S        |  6 +--
 arch/powerpc/include/asm/ppc_asm.h            | 43 ++++++++++++-------
 arch/powerpc/kernel/entry_32.S                | 23 ++++------
 arch/powerpc/kernel/exceptions-64e.S          | 14 ++----
 arch/powerpc/kernel/exceptions-64s.S          |  6 +--
 arch/powerpc/kernel/head_32.h                 |  3 +-
 arch/powerpc/kernel/head_booke.h              |  3 +-
 arch/powerpc/kernel/interrupt_64.S            | 34 ++++++---------
 arch/powerpc/kernel/optprobes_head.S          |  4 +-
 arch/powerpc/kernel/tm.S                      | 15 ++-----
 .../powerpc/kernel/trace/ftrace_64_mprofile.S | 14 +++---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S       |  5 +--
 .../lib/test_emulate_step_exec_instr.S        |  8 ++--
 15 files changed, 93 insertions(+), 126 deletions(-)

diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index 1d83966f5ef6..e8f10a599659 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -226,16 +226,19 @@ p_base:	mflr	r10		/* r10 now points to runtime addr of p_base */
 #ifdef __powerpc64__
 
 #define PROM_FRAME_SIZE 512
-#define SAVE_GPR(n, base)       std     n,8*(n)(base)
-#define REST_GPR(n, base)       ld      n,8*(n)(base)
-#define SAVE_2GPRS(n, base)     SAVE_GPR(n, base); SAVE_GPR(n+1, base)
-#define SAVE_4GPRS(n, base)     SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, base)
-#define SAVE_8GPRS(n, base)     SAVE_4GPRS(n, base); SAVE_4GPRS(n+4, base)
-#define SAVE_10GPRS(n, base)    SAVE_8GPRS(n, base); SAVE_2GPRS(n+8, base)
-#define REST_2GPRS(n, base)     REST_GPR(n, base); REST_GPR(n+1, base)
-#define REST_4GPRS(n, base)     REST_2GPRS(n, base); REST_2GPRS(n+2, base)
-#define REST_8GPRS(n, base)     REST_4GPRS(n, base); REST_4GPRS(n+4, base)
-#define REST_10GPRS(n, base)    REST_8GPRS(n, base); REST_2GPRS(n+8, base)
+
+.macro OP_REGS op, width, start, end, base, offset
+	.Lreg=\start
+	.rept (\end - \start + 1)
+	\op	.Lreg,\offset+\width*.Lreg(\base)
+	.Lreg=.Lreg+1
+	.endr
+.endm
+
+#define SAVE_GPRS(start, end, base)	OP_REGS std, 8, start, end, base, 0
+#define REST_GPRS(start, end, base)	OP_REGS ld, 8, start, end, base, 0
+#define SAVE_GPR(n, base)		SAVE_GPRS(n, n, base)
+#define REST_GPR(n, base)		REST_GPRS(n, n, base)
 
 /* prom handles the jump into and return from firmware.  The prom args pointer
    is loaded in r3. */
@@ -246,9 +249,7 @@ prom:
 	stdu	r1,-PROM_FRAME_SIZE(r1) /* Save SP and create stack space */
 
 	SAVE_GPR(2, r1)
-	SAVE_GPR(13, r1)
-	SAVE_8GPRS(14, r1)
-	SAVE_10GPRS(22, r1)
+	SAVE_GPRS(13, 31, r1)
 	mfcr    r10
 	std     r10,8*32(r1)
 	mfmsr   r10
@@ -283,9 +284,7 @@ prom:
 
 	/* Restore other registers */
 	REST_GPR(2, r1)
-	REST_GPR(13, r1)
-	REST_8GPRS(14, r1)
-	REST_10GPRS(22, r1)
+	REST_GPRS(13, 31, r1)
 	ld      r10,8*32(r1)
 	mtcr	r10
 
diff --git a/arch/powerpc/crypto/md5-asm.S b/arch/powerpc/crypto/md5-asm.S
index 948d100a2934..fa6bc440cf4a 100644
--- a/arch/powerpc/crypto/md5-asm.S
+++ b/arch/powerpc/crypto/md5-asm.S
@@ -38,15 +38,11 @@
 
 #define INITIALIZE \
 	PPC_STLU r1,-INT_FRAME_SIZE(r1); \
-	SAVE_8GPRS(14, r1);		/* push registers onto stack	*/ \
-	SAVE_4GPRS(22, r1);						   \
-	SAVE_GPR(26, r1)
+	SAVE_GPRS(14, 26, r1)		/* push registers onto stack	*/
 
 #define FINALIZE \
-	REST_8GPRS(14, r1);		/* pop registers from stack	*/ \
-	REST_4GPRS(22, r1);						   \
-	REST_GPR(26, r1);						   \
-	addi	r1,r1,INT_FRAME_SIZE;
+	REST_GPRS(14, 26, r1);		/* pop registers from stack	*/ \
+	addi	r1,r1,INT_FRAME_SIZE
 
 #ifdef __BIG_ENDIAN__
 #define LOAD_DATA(reg, off) \
diff --git a/arch/powerpc/crypto/sha1-powerpc-asm.S b/arch/powerpc/crypto/sha1-powerpc-asm.S
index 23e248beff71..f0d5ed557ab1 100644
--- a/arch/powerpc/crypto/sha1-powerpc-asm.S
+++ b/arch/powerpc/crypto/sha1-powerpc-asm.S
@@ -125,8 +125,7 @@
 
 _GLOBAL(powerpc_sha_transform)
 	PPC_STLU r1,-INT_FRAME_SIZE(r1)
-	SAVE_8GPRS(14, r1)
-	SAVE_10GPRS(22, r1)
+	SAVE_GPRS(14, 31, r1)
 
 	/* Load up A - E */
 	lwz	RA(0),0(r3)	/* A */
@@ -184,7 +183,6 @@ _GLOBAL(powerpc_sha_transform)
 	stw	RD(0),12(r3)
 	stw	RE(0),16(r3)
 
-	REST_8GPRS(14, r1)
-	REST_10GPRS(22, r1)
+	REST_GPRS(14, 31, r1)
 	addi	r1,r1,INT_FRAME_SIZE
 	blr
diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
index 1c538a9a11e0..8c84fde90440 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -16,30 +16,41 @@
 
 #define SZL			(BITS_PER_LONG/8)
 
+.macro OP_REGS op, width, start, end, base, offset
+	.Lreg=\start
+	.rept (\end - \start + 1)
+	\op	.Lreg, \offset + \width * .Lreg(\base)
+	.Lreg=.Lreg+1
+	.endr
+.endm
+
+.macro OP_REGS_IDX op, width, start, end, tmpreg, base
+	.Lreg=\start
+	.rept (\end - \start + 1)
+	li	\tmpreg, \width * .Lreg
+	\op	.Lreg, \tmpreg, \base
+	.Lreg=.Lreg+1
+	.endr
+.endm
+
 /*
  * Macros for storing registers into and loading registers from
  * exception frames.
  */
 #ifdef __powerpc64__
-#define SAVE_GPR(n, base)	std	n,GPR0+8*(n)(base)
-#define REST_GPR(n, base)	ld	n,GPR0+8*(n)(base)
-#define SAVE_NVGPRS(base)	SAVE_8GPRS(14, base); SAVE_10GPRS(22, base)
-#define REST_NVGPRS(base)	REST_8GPRS(14, base); REST_10GPRS(22, base)
+#define SAVE_GPRS(start, end, base)	OP_REGS std, 8, start, end, base, GPR0
+#define REST_GPRS(start, end, base)	OP_REGS ld, 8, start, end, base, GPR0
+#define SAVE_NVGPRS(base)		SAVE_GPRS(14, 31, base)
+#define REST_NVGPRS(base)		REST_GPRS(14, 31, base)
 #else
-#define SAVE_GPR(n, base)	stw	n,GPR0+4*(n)(base)
-#define REST_GPR(n, base)	lwz	n,GPR0+4*(n)(base)
-#define SAVE_NVGPRS(base)	stmw	13, GPR0+4*13(base)
-#define REST_NVGPRS(base)	lmw	13, GPR0+4*13(base)
+#define SAVE_GPRS(start, end, base)	OP_REGS stw, 4, start, end, base, GPR0
+#define REST_GPRS(start, end, base)	OP_REGS lwz, 4, start, end, base, GPR0
+#define SAVE_NVGPRS(base)		stmw	13, GPR0+4*13(base)
+#define REST_NVGPRS(base)		lmw	13, GPR0+4*13(base)
 #endif
 
-#define SAVE_2GPRS(n, base)	SAVE_GPR(n, base); SAVE_GPR(n+1, base)
-#define SAVE_4GPRS(n, base)	SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, base)
-#define SAVE_8GPRS(n, base)	SAVE_4GPRS(n, base); SAVE_4GPRS(n+4, base)
-#define SAVE_10GPRS(n, base)	SAVE_8GPRS(n, base); SAVE_2GPRS(n+8, base)
-#define REST_2GPRS(n, base)	REST_GPR(n, base); REST_GPR(n+1, base)
-#define REST_4GPRS(n, base)	REST_2GPRS(n, base); REST_2GPRS(n+2, base)
-#define REST_8GPRS(n, base)	REST_4GPRS(n, base); REST_4GPRS(n+4, base)
-#define REST_10GPRS(n, base)	REST_8GPRS(n, base); REST_2GPRS(n+8, base)
+#define SAVE_GPR(n, base)		SAVE_GPRS(n, n, base)
+#define REST_GPR(n, base)		REST_GPRS(n, n, base)
 
 #define SAVE_FPR(n, base)	stfd	n,8*TS_FPRWIDTH*(n)(base)
 #define SAVE_2FPRS(n, base)	SAVE_FPR(n, base); SAVE_FPR(n+1, base)
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 61fdd53cdd9a..c62dd9815965 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -90,8 +90,7 @@ transfer_to_syscall:
 	stw	r12,8(r1)
 	stw	r2,_TRAP(r1)
 	SAVE_GPR(0, r1)
-	SAVE_4GPRS(3, r1)
-	SAVE_2GPRS(7, r1)
+	SAVE_GPRS(3, 8, r1)
 	addi	r2,r10,-THREAD
 	SAVE_NVGPRS(r1)
 
@@ -139,7 +138,7 @@ syscall_exit_finish:
 	mtxer	r5
 	lwz	r0,GPR0(r1)
 	lwz	r3,GPR3(r1)
-	REST_8GPRS(4,r1)
+	REST_GPRS(4, 11, r1)
 	lwz	r12,GPR12(r1)
 	b	1b
 
@@ -232,9 +231,9 @@ fast_exception_return:
 	beq	3f			/* if not, we've got problems */
 #endif
 
-2:	REST_4GPRS(3, r11)
+2:	REST_GPRS(3, 6, r11)
 	lwz	r10,_CCR(r11)
-	REST_2GPRS(1, r11)
+	REST_GPRS(1, 2, r11)
 	mtcr	r10
 	lwz	r10,_LINK(r11)
 	mtlr	r10
@@ -298,16 +297,14 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
 	 * the reliable stack unwinder later on. Clear it.
 	 */
 	stw	r0,8(r1)
-	REST_4GPRS(7, r1)
-	REST_2GPRS(11, r1)
+	REST_GPRS(7, 12, r1)
 
 	mtcr	r3
 	mtlr	r4
 	mtctr	r5
 	mtspr	SPRN_XER,r6
 
-	REST_4GPRS(2, r1)
-	REST_GPR(6, r1)
+	REST_GPRS(2, 6, r1)
 	REST_GPR(0, r1)
 	REST_GPR(1, r1)
 	rfi
@@ -341,8 +338,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
 	lwz	r6,_CCR(r1)
 	li	r0,0
 
-	REST_4GPRS(7, r1)
-	REST_2GPRS(11, r1)
+	REST_GPRS(7, 12, r1)
 
 	mtlr	r3
 	mtctr	r4
@@ -354,7 +350,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
 	 */
 	stw	r0,8(r1)
 
-	REST_4GPRS(2, r1)
+	REST_GPRS(2, 5, r1)
 
 	bne-	cr1,1f /* emulate stack store */
 	mtcr	r6
@@ -430,8 +426,7 @@ _ASM_NOKPROBE_SYMBOL(interrupt_return)
 	bne	interrupt_return;					\
 	lwz	r0,GPR0(r1);						\
 	lwz	r2,GPR2(r1);						\
-	REST_4GPRS(3, r1);						\
-	REST_2GPRS(7, r1);						\
+	REST_GPRS(3, 8, r1);						\
 	lwz	r10,_XER(r1);						\
 	lwz	r11,_CTR(r1);						\
 	mtspr	SPRN_XER,r10;						\
diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
index 711c66b76df1..67dc4e3179a0 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -198,8 +198,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
 
 	stdcx.	r0,0,r1		/* to clear the reservation */
 
-	REST_4GPRS(2, r1)
-	REST_4GPRS(6, r1)
+	REST_GPRS(2, 9, r1)
 
 	ld	r10,_CTR(r1)
 	ld	r11,_XER(r1)
@@ -375,9 +374,7 @@ ret_from_mc_except:
 exc_##n##_common:							    \
 	std	r0,GPR0(r1);		/* save r0 in stackframe */	    \
 	std	r2,GPR2(r1);		/* save r2 in stackframe */	    \
-	SAVE_4GPRS(3, r1);		/* save r3 - r6 in stackframe */    \
-	SAVE_2GPRS(7, r1);		/* save r7, r8 in stackframe */	    \
-	std	r9,GPR9(r1);		/* save r9 in stackframe */	    \
+	SAVE_GPRS(3, 9, r1);		/* save r3 - r9 in stackframe */    \
 	std	r10,_NIP(r1);		/* save SRR0 to stackframe */	    \
 	std	r11,_MSR(r1);		/* save SRR1 to stackframe */	    \
 	beq	2f;			/* if from kernel mode */	    \
@@ -1061,9 +1058,7 @@ bad_stack_book3e:
 	std	r11,_ESR(r1)
 	std	r0,GPR0(r1);		/* save r0 in stackframe */	    \
 	std	r2,GPR2(r1);		/* save r2 in stackframe */	    \
-	SAVE_4GPRS(3, r1);		/* save r3 - r6 in stackframe */    \
-	SAVE_2GPRS(7, r1);		/* save r7, r8 in stackframe */	    \
-	std	r9,GPR9(r1);		/* save r9 in stackframe */	    \
+	SAVE_GPRS(3, 9, r1);		/* save r3 - r9 in stackframe */    \
 	ld	r3,PACA_EXGEN+EX_R10(r13);/* get back r10 */		    \
 	ld	r4,PACA_EXGEN+EX_R11(r13);/* get back r11 */		    \
 	mfspr	r5,SPRN_SPRG_GEN_SCRATCH;/* get back r13 XXX can be wrong */ \
@@ -1077,8 +1072,7 @@ bad_stack_book3e:
 	std	r10,_LINK(r1)
 	std	r11,_CTR(r1)
 	std	r12,_XER(r1)
-	SAVE_10GPRS(14,r1)
-	SAVE_8GPRS(24,r1)
+	SAVE_GPRS(14, 31, r1)
 	lhz	r12,PACA_TRAP_SAVE(r13)
 	std	r12,_TRAP(r1)
 	addi	r11,r1,INT_FRAME_SIZE
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 37859e62a8dc..28ad5c1d883a 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -574,8 +574,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
 	ld	r10,IAREA+EX_CTR(r13)
 	std	r10,_CTR(r1)
 	std	r2,GPR2(r1)		/* save r2 in stackframe	*/
-	SAVE_4GPRS(3, r1)		/* save r3 - r6 in stackframe   */
-	SAVE_2GPRS(7, r1)		/* save r7, r8 in stackframe	*/
+	SAVE_GPRS(3, 8, r1)		/* save r3 - r8 in stackframe   */
 	mflr	r9			/* Get LR, later save to stack	*/
 	ld	r2,PACATOC(r13)		/* get kernel TOC into r2	*/
 	std	r9,_LINK(r1)
@@ -693,8 +692,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
 	mtlr	r9
 	ld	r9,_CCR(r1)
 	mtcr	r9
-	REST_8GPRS(2, r1)
-	REST_4GPRS(10, r1)
+	REST_GPRS(2, 13, r1)
 	REST_GPR(0, r1)
 	/* restore original r1. */
 	ld	r1,GPR1(r1)
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 6b1ec9e3541b..25887303651a 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -115,8 +115,7 @@ _ASM_NOKPROBE_SYMBOL(\name\()_virt)
 	stw	r10,8(r1)
 	li	r10, \trapno
 	stw	r10,_TRAP(r1)
-	SAVE_4GPRS(3, r1)
-	SAVE_2GPRS(7, r1)
+	SAVE_GPRS(3, 8, r1)
 	SAVE_NVGPRS(r1)
 	stw	r2,GPR2(r1)
 	stw	r12,_NIP(r1)
diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index e5503420b6c6..0ae26396639d 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -87,8 +87,7 @@ END_BTB_FLUSH_SECTION
 	stw	r10, 8(r1)
 	li	r10, \trapno
 	stw	r10,_TRAP(r1)
-	SAVE_4GPRS(3, r1)
-	SAVE_2GPRS(7, r1)
+	SAVE_GPRS(3, 8, r1)
 	SAVE_NVGPRS(r1)
 	stw	r2,GPR2(r1)
 	stw	r12,_NIP(r1)
diff --git a/arch/powerpc/kernel/interrupt_64.S b/arch/powerpc/kernel/interrupt_64.S
index d4212d2ff0b5..f090224c2a6e 100644
--- a/arch/powerpc/kernel/interrupt_64.S
+++ b/arch/powerpc/kernel/interrupt_64.S
@@ -169,10 +169,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 	 * The value of AMR only matters while we're in the kernel.
 	 */
 	mtcr	r2
-	ld	r2,GPR2(r1)
-	ld	r3,GPR3(r1)
-	ld	r13,GPR13(r1)
-	ld	r1,GPR1(r1)
+	REST_GPRS(2, 3, r1)
+	REST_GPR(13, r1)
+	REST_GPR(1, r1)
 	RFSCV_TO_USER
 	b	.	/* prevent speculative execution */
 
@@ -190,9 +189,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 	mtctr	r3
 	mtlr	r4
 	mtspr	SPRN_XER,r5
-	REST_10GPRS(2, r1)
-	REST_2GPRS(12, r1)
-	ld	r1,GPR1(r1)
+	REST_GPRS(2, 13, r1)
+	REST_GPR(1, r1)
 	RFI_TO_USER
 .Lsyscall_vectored_\name\()_rst_end:
 
@@ -387,10 +385,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 	 * The value of AMR only matters while we're in the kernel.
 	 */
 	mtcr	r2
-	ld	r2,GPR2(r1)
-	ld	r3,GPR3(r1)
-	ld	r13,GPR13(r1)
-	ld	r1,GPR1(r1)
+	REST_GPRS(2, 3, r1)
+	REST_GPR(13, r1)
+	REST_GPR(1, r1)
 	RFI_TO_USER
 	b	.	/* prevent speculative execution */
 
@@ -401,8 +398,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 	mtctr	r3
 	mtspr	SPRN_XER,r4
 	ld	r0,GPR0(r1)
-	REST_8GPRS(4, r1)
-	ld	r12,GPR12(r1)
+	REST_GPRS(4, 12, r1)
 	b	.Lsyscall_restore_regs_cont
 .Lsyscall_rst_end:
 
@@ -559,17 +555,14 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
 	ld	r6,_XER(r1)
 	li	r0,0
 
-	REST_4GPRS(7, r1)
-	REST_2GPRS(11, r1)
-	REST_GPR(13, r1)
+	REST_GPRS(7, 13, r1)
 
 	mtcr	r3
 	mtlr	r4
 	mtctr	r5
 	mtspr	SPRN_XER,r6
 
-	REST_4GPRS(2, r1)
-	REST_GPR(6, r1)
+	REST_GPRS(2, 6, r1)
 	REST_GPR(0, r1)
 	REST_GPR(1, r1)
 	.ifc \srr,srr
@@ -666,8 +659,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
 	ld	r6,_CCR(r1)
 	li	r0,0
 
-	REST_4GPRS(7, r1)
-	REST_2GPRS(11, r1)
+	REST_GPRS(7, 12, r1)
 
 	mtlr	r3
 	mtctr	r4
@@ -679,7 +671,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
 	 */
 	std	r0,STACK_FRAME_OVERHEAD-16(r1)
 
-	REST_4GPRS(2, r1)
+	REST_GPRS(2, 5, r1)
 
 	bne-	cr1,1f /* emulate stack store */
 	mtcr	r6
diff --git a/arch/powerpc/kernel/optprobes_head.S b/arch/powerpc/kernel/optprobes_head.S
index 19ea3312403c..5c7f0b4b784b 100644
--- a/arch/powerpc/kernel/optprobes_head.S
+++ b/arch/powerpc/kernel/optprobes_head.S
@@ -10,8 +10,8 @@
 #include <asm/asm-offsets.h>
 
 #ifdef CONFIG_PPC64
-#define SAVE_30GPRS(base) SAVE_10GPRS(2,base); SAVE_10GPRS(12,base); SAVE_10GPRS(22,base)
-#define REST_30GPRS(base) REST_10GPRS(2,base); REST_10GPRS(12,base); REST_10GPRS(22,base)
+#define SAVE_30GPRS(base) SAVE_GPRS(2, 31, base)
+#define REST_30GPRS(base) REST_GPRS(2, 31, base)
 #define TEMPLATE_FOR_IMM_LOAD_INSNS	nop; nop; nop; nop; nop
 #else
 #define SAVE_30GPRS(base) stmw	r2, GPR2(base)
diff --git a/arch/powerpc/kernel/tm.S b/arch/powerpc/kernel/tm.S
index 2b91f233b05d..3beecc32940b 100644
--- a/arch/powerpc/kernel/tm.S
+++ b/arch/powerpc/kernel/tm.S
@@ -226,11 +226,8 @@ _GLOBAL(tm_reclaim)
 
 	/* Sync the userland GPRs 2-12, 14-31 to thread->regs: */
 	SAVE_GPR(0, r7)				/* user r0 */
-	SAVE_GPR(2, r7)				/* user r2 */
-	SAVE_4GPRS(3, r7)			/* user r3-r6 */
-	SAVE_GPR(8, r7)				/* user r8 */
-	SAVE_GPR(9, r7)				/* user r9 */
-	SAVE_GPR(10, r7)			/* user r10 */
+	SAVE_GPRS(2, 6, r7)			/* user r2-r6 */
+	SAVE_GPRS(8, 10, r7)			/* user r8-r10 */
 	ld	r3, GPR1(r1)			/* user r1 */
 	ld	r4, GPR7(r1)			/* user r7 */
 	ld	r5, GPR11(r1)			/* user r11 */
@@ -445,12 +442,8 @@ restore_gprs:
 	ld	r6, THREAD_TM_PPR(r3)
 
 	REST_GPR(0, r7)				/* GPR0 */
-	REST_2GPRS(2, r7)			/* GPR2-3 */
-	REST_GPR(4, r7)				/* GPR4 */
-	REST_4GPRS(8, r7)			/* GPR8-11 */
-	REST_2GPRS(12, r7)			/* GPR12-13 */
-
-	REST_NVGPRS(r7)				/* GPR14-31 */
+	REST_GPRS(2, 4, r7)			/* GPR2-4 */
+	REST_GPRS(8, 31, r7)			/* GPR8-31 */
 
 	/* Load up PPR and DSCR here so we don't run with user values for long */
 	mtspr	SPRN_DSCR, r5
diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index f9fd5f743eba..dd5e720fe44b 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -41,15 +41,14 @@ _GLOBAL(ftrace_regs_caller)
 
 	/* Save all gprs to pt_regs */
 	SAVE_GPR(0, r1)
-	SAVE_10GPRS(2, r1)
+	SAVE_GPRS(2, 11, r1)
 
 	/* Ok to continue? */
 	lbz	r3, PACA_FTRACE_ENABLED(r13)
 	cmpdi	r3, 0
 	beq	ftrace_no_trace
 
-	SAVE_10GPRS(12, r1)
-	SAVE_10GPRS(22, r1)
+	SAVE_GPRS(12, 31, r1)
 
 	/* Save previous stack pointer (r1) */
 	addi	r8, r1, SWITCH_FRAME_SIZE
@@ -108,10 +107,7 @@ ftrace_regs_call:
 #endif
 
 	/* Restore gprs */
-	REST_GPR(0,r1)
-	REST_10GPRS(2,r1)
-	REST_10GPRS(12,r1)
-	REST_10GPRS(22,r1)
+	REST_GPRS(2, 31, r1)
 
 	/* Restore possibly modified LR */
 	ld	r0, _LINK(r1)
@@ -157,7 +153,7 @@ _GLOBAL(ftrace_caller)
 	stdu	r1, -SWITCH_FRAME_SIZE(r1)
 
 	/* Save all gprs to pt_regs */
-	SAVE_8GPRS(3, r1)
+	SAVE_GPRS(3, 10, r1)
 
 	lbz	r3, PACA_FTRACE_ENABLED(r13)
 	cmpdi	r3, 0
@@ -194,7 +190,7 @@ ftrace_call:
 	mtctr	r3
 
 	/* Restore gprs */
-	REST_8GPRS(3,r1)
+	REST_GPRS(3, 10, r1)
 
 	/* Restore callee's TOC */
 	ld	r2, 24(r1)
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 75079397c2a5..62b692159d6f 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2673,8 +2673,7 @@ kvmppc_bad_host_intr:
 	std	r0, GPR0(r1)
 	std	r9, GPR1(r1)
 	std	r2, GPR2(r1)
-	SAVE_4GPRS(3, r1)
-	SAVE_2GPRS(7, r1)
+	SAVE_GPRS(3, 8, r1)
 	srdi	r0, r12, 32
 	clrldi	r12, r12, 32
 	std	r0, _CCR(r1)
@@ -2697,7 +2696,7 @@ kvmppc_bad_host_intr:
 	ld	r9, HSTATE_SCRATCH2(r13)
 	ld	r12, HSTATE_SCRATCH0(r13)
 	GET_SCRATCH0(r0)
-	SAVE_4GPRS(9, r1)
+	SAVE_GPRS(9, 12, r1)
 	std	r0, GPR13(r1)
 	SAVE_NVGPRS(r1)
 	ld	r5, HSTATE_CFAR(r13)
diff --git a/arch/powerpc/lib/test_emulate_step_exec_instr.S b/arch/powerpc/lib/test_emulate_step_exec_instr.S
index 9ef941d958d8..5473f9d03df3 100644
--- a/arch/powerpc/lib/test_emulate_step_exec_instr.S
+++ b/arch/powerpc/lib/test_emulate_step_exec_instr.S
@@ -37,7 +37,7 @@ _GLOBAL(exec_instr)
 	 * The stack pointer (GPR1) and the thread pointer (GPR13) are not
 	 * saved as these should not be modified anyway.
 	 */
-	SAVE_2GPRS(2, r1)
+	SAVE_GPRS(2, 3, r1)
 	SAVE_NVGPRS(r1)
 
 	/*
@@ -75,8 +75,7 @@ _GLOBAL(exec_instr)
 
 	/* Load GPRs from pt_regs */
 	REST_GPR(0, r31)
-	REST_10GPRS(2, r31)
-	REST_GPR(12, r31)
+	REST_GPRS(2, 12, r31)
 	REST_NVGPRS(r31)
 
 	/* Placeholder for the test instruction */
@@ -99,8 +98,7 @@ _GLOBAL(exec_instr)
 	subi	r3, r3, GPR0
 	SAVE_GPR(0, r3)
 	SAVE_GPR(2, r3)
-	SAVE_8GPRS(4, r3)
-	SAVE_GPR(12, r3)
+	SAVE_GPRS(4, 12, r3)
 	SAVE_NVGPRS(r3)
 
 	/* Save resulting LR to pt_regs */
-- 
2.23.0


^ permalink raw reply related

* [PATCH 1/1] powerpc: Drop superfluous pci_dev_is_added() calls
From: Niklas Schnelle @ 2021-09-10 14:19 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-arch, linux-s390, linux-pci, linux-kernel,
	Oliver O'Halloran, linuxppc-dev
In-Reply-To: <20210910141940.2598035-1-schnelle@linux.ibm.com>

On powerpc, pci_dev_is_added() is called as part of SR-IOV fixups
that are done under pcibios_add_device() which in turn is only called in
pci_device_add() whih is called when a PCI device is scanned.

Now pci_dev_assign_added() is called in pci_bus_add_device() which is
only called after scanning the device. Thus pci_dev_is_added() is always
false and can be dropped.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 6 ------
 arch/powerpc/platforms/pseries/setup.c     | 3 +--
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c b/arch/powerpc/platforms/powernv/pci-sriov.c
index 28aac933a439..deddbb233fde 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -9,9 +9,6 @@
 
 #include "pci.h"
 
-/* for pci_dev_is_added() */
-#include "../../../../drivers/pci/pci.h"
-
 /*
  * The majority of the complexity in supporting SR-IOV on PowerNV comes from
  * the need to put the MMIO space for each VF into a separate PE. Internally
@@ -228,9 +225,6 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 
 void pnv_pci_ioda_fixup_iov(struct pci_dev *pdev)
 {
-	if (WARN_ON(pci_dev_is_added(pdev)))
-		return;
-
 	if (pdev->is_virtfn) {
 		struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev);
 
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index f79126f16258..2188054470c1 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -74,7 +74,6 @@
 #include <asm/hvconsole.h>
 
 #include "pseries.h"
-#include "../../../../drivers/pci/pci.h"
 
 DEFINE_STATIC_KEY_FALSE(shared_processor);
 EXPORT_SYMBOL(shared_processor);
@@ -750,7 +749,7 @@ static void pseries_pci_fixup_iov_resources(struct pci_dev *pdev)
 	const int *indexes;
 	struct device_node *dn = pci_device_to_OF_node(pdev);
 
-	if (!pdev->is_physfn || pci_dev_is_added(pdev))
+	if (!pdev->is_physfn)
 		return;
 	/*Firmware must support open sriov otherwise dont configure*/
 	indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
-- 
2.25.1


^ permalink raw reply related

* [PATCH 0/1] Arch use of pci_dev_is_added()
From: Niklas Schnelle @ 2021-09-10 14:19 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-arch, linux-s390, linux-pci, linux-kernel,
	Oliver O'Halloran, linuxppc-dev

Hi Bjorn, Hi Michael,

In my proposal to make pci_dev_is_added() more regularly usable by arch code
you mentioned[0] that you believe the uses in arch/powerpc are not necessary
anymore. From code reading I agree and so does Oliver O'Halloran[1].

So as promised here is a patch removing them. I only compile tested this as
I don't have access to a powerpc system.

I've also looked a bit more into our use in s390 and as dicussed previously
I don't think we can cleanly get rid of the existing one in
arch/s390/pci_sysfs.c:recover_store() because we need to distinguish an already
removed pdev just by looking at the pdev itself.

As for new uses I think in the upcoming automatic recovery code we can rely on
the fact that a removed device has pdev->driver == NULL and don't need
pci_dev_is_added() but it would make things clearer. I also noticed that before
commit 44bda4b7d26e9 ("PCI: Fix is_added/is_busmaster race condition") there
was simply a pdev->is_added flag that was cleanly accessible by arch code. So
I wanted to ask for your advice.

Thanks,
Niklas

[0] https://lore.kernel.org/lkml/20210825190444.GA3593752@bjorn-Precision-5520/
[1] https://lore.kernel.org/lkml/CAOSf1CFyuf9FaeSNparj+7W0mKTPvtcM8vxjHDSFsNDC6k_7xQ@mail.gmail.com/

Niklas Schnelle (1):
  powerpc: Drop superfluous pci_dev_is_added() calls

 arch/powerpc/platforms/powernv/pci-sriov.c | 6 ------
 arch/powerpc/platforms/pseries/setup.c     | 3 +--
 2 files changed, 1 insertion(+), 8 deletions(-)

-- 
2.25.1

^ permalink raw reply

* Re: [PATCH v2 3/5] signal: Add unsafe_copy_siginfo_to_user()
From: Christophe Leroy @ 2021-09-10 10:27 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linuxppc-dev, Paul Mackerras, linux-kernel
In-Reply-To: <877dfrrkxo.fsf@disp2133>



On 9/8/21 6:17 PM, Eric W. Biederman wrote:
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> 
>> Le 02/09/2021 à 20:43, Eric W. Biederman a écrit :
>>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>>>
>>>> In the same spirit as commit fb05121fd6a2 ("signal: Add
>>>> unsafe_get_compat_sigset()"), implement an 'unsafe' version of
>>>> copy_siginfo_to_user() in order to use it within user access blocks.
>>>>
>>>> For that, also add an 'unsafe' version of clear_user().
>>>
>>> Looking at your use cases you need the 32bit compat version of this
>>> as well.
>>>
>>> The 32bit compat version is too complicated to become a macro, so I
>>> don't think you can make this work correctly for the 32bit compat case.
>>
>> When looking into patch 5/5 that you nacked, I think you missed the fact that we
>> keep using copy_siginfo_to_user32() as it for the 32 bit compat case.
> 
> I did.  My mistake.
> 
> However that mistake was so easy I think it mirrors the comments others
> have made that this looks like a maintenance hazard.
> 
> Is improving the performance of 32bit kernels interesting?

Yes it is, and that's what this series do.

> Is improving the performance of 32bit compat support interesting?

For me this is a corner case, so I left it aside for now.

> 
> If performance one or either of those cases is interesting it looks like
> we already have copy_siginfo_to_external32 the factor you would need
> to build unsafe_copy_siginfo_to_user32.

I'm not sure I understand your saying here. What do you expect me to do 
with copy_siginfo_to_external32() ?

copy_siginfo_to_user32() is for compat only.

Native 32 bits powerpc use copy_siginfo_to_user()

> 
> So I am not going to say impossible but please make something
> maintainable.  I unified all of the compat 32bit siginfo logic because
> it simply did not get enough love and attention when it was implemented
> per architecture.

Yes, and ? I didn't do any modification to the compat case, so what you 
did remains.


> 
> In general I think that concern applies to this case as well.  We really
> need an implementation that shares as much burden as possible with other
> architectures.

I think yes, that's the reason why I made a generic 
unsafe_copy_siginfo_to_user() and didn't make a powerpc dedicated change.

Once this is merged any other architecture can use 
unsafe_copy_siginfo_to_user().

Did I miss something ?

Christophe

^ permalink raw reply

* Re: [PATCH for-5.15 0/5] ASoC: fsl: register platform component before registering cpu dai
From: Mark Brown @ 2021-09-10 10:16 UTC (permalink / raw)
  To: Shengjiu Wang
  Cc: alsa-devel, Timur Tabi, Xiubo Li, Fabio Estevam, Shengjiu Wang,
	Takashi Iwai, linux-kernel, Nicolin Chen, linuxppc-dev
In-Reply-To: <CAA+D8APjyq74FECmH6ZzyVKHOz6MEV0bt+D4-Xkfc-6C5n9hZg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 313 bytes --]

On Tue, Sep 07, 2021 at 10:43:26AM +0800, Shengjiu Wang wrote:

> It is hard to add conditions here for defer probe.  And maybe
> some drivers need the same components for cpu and platform.
> Do you have any suggestions?

I would expect that in this situation the same component would be able
to fill both roles.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox