LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v13 01/12] swiotlb: Refactor swiotlb init functions
From: Christoph Hellwig @ 2021-06-21 20:19 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: thomas.hellstrom, heikki.krogerus, linux-devicetree, peterz,
	joonas.lahtinen, dri-devel, chris, grant.likely, paulus,
	Frank Rowand, mingo, Marek Szyprowski, Nicolas Boichat,
	Saravana Kannan, Joerg Roedel, Rafael J . Wysocki,
	Christoph Hellwig, Bartosz Golaszewski, bskeggs, linux-pci,
	xen-devel, Thierry Reding, intel-gfx, matthew.auld, Tom Lendacky,
	Jianxiong Gao, Daniel Vetter, Will Deacon, Konrad Rzeszutek Wilk,
	maarten.lankhorst, airlied, Dan Williams, linuxppc-dev,
	jani.nikula, Rob Herring, rodrigo.vivi, Bjorn Helgaas,
	Claire Chang, boris.ostrovsky, Andy Shevchenko, jgross, Greg KH,
	Randy Dunlap, lkml, Tomasz Figa, list@263.net:IOMMU DRIVERS,
	Jim Quinlan, xypron.glpk, Robin Murphy, bauerman
In-Reply-To: <alpine.DEB.2.21.2106211052270.24906@sstabellini-ThinkPad-T480s>

On Mon, Jun 21, 2021 at 10:59:20AM -0700, Stefano Stabellini wrote:
> Just as a clarification: I was referring to the zeroing of "mem" in
> swiotlb_late_init_with_tbl and swiotlb_init_with_tbl. While it looks
> like Tom and Christoph are talking about the zeroing of "tlb".

Indeed. 

> 
> The zeroing of "mem" is required as some fields of struct io_tlb_mem
> need to be initialized to zero. While the zeroing of "tlb" is not
> required except from the point of view of not leaking sensitive data as
> Christoph mentioned.
> 
> Either way, Claire's v14 patch 01/12 looks fine to me.

Agreed.

^ permalink raw reply

* Re: [PATCH v2] ASoC: fsl_xcvr: disable all interrupts when suspend happens
From: Mark Brown @ 2021-06-21 18:46 UTC (permalink / raw)
  To: Xiubo.Lee, tiwai, nicoleotsuka, perex, timur, alsa-devel,
	festevam, Shengjiu Wang
  Cc: Mark Brown, linuxppc-dev, linux-kernel
In-Reply-To: <1624019913-3380-1-git-send-email-shengjiu.wang@nxp.com>

On Fri, 18 Jun 2021 20:38:33 +0800, Shengjiu Wang wrote:
> There is an unhandled interrupt after suspend, which cause endless
> interrupt when system resume, so system may hang.
> 
> Disable all interrupts in runtime suspend callback to avoid above
> issue.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl_xcvr: disable all interrupts when suspend happens
      commit: ea837090b388245744988083313f6e9c7c9b9699

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

^ permalink raw reply

* Re: [PATCH] ASoC: fsl_xcvr: disable all interrupts when suspend happens
From: Mark Brown @ 2021-06-21 18:46 UTC (permalink / raw)
  To: tiwai, Xiubo.Lee, nicoleotsuka, perex, timur, alsa-devel,
	festevam, Shengjiu Wang
  Cc: Mark Brown, linuxppc-dev, linux-kernel
In-Reply-To: <1624009876-3076-1-git-send-email-shengjiu.wang@nxp.com>

On Fri, 18 Jun 2021 17:51:16 +0800, Shengjiu Wang wrote:
> There is an unhandled interrupt after suspend, which cause endless
> interrupt when system resume, so system may hang.
> 
> Disable all interrupts in runtime suspend callback to avoid above
> issue.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl_xcvr: disable all interrupts when suspend happens
      commit: ea837090b388245744988083313f6e9c7c9b9699

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

^ permalink raw reply

* Re: [PATCH -next v2 0/9] ASoC: fsl: Use devm_platform_get_and_ioremap_resource()
From: Mark Brown @ 2021-06-21 18:45 UTC (permalink / raw)
  To: linuxppc-dev, Yang Yingliang, linux-kernel, alsa-devel; +Cc: Mark Brown, timur
In-Reply-To: <20210615013922.784296-1-yangyingliang@huawei.com>

On Tue, 15 Jun 2021 09:39:13 +0800, Yang Yingliang wrote:
> patch #1 ~ #8:
>   Use devm_platform_get_and_ioremap_resource()
> 
> patch #9
>   check return value of platform_get_resource_byname()
> 
> v2:
>   change error message in patch #9
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/9] ASoC: fsl_asrc: Use devm_platform_get_and_ioremap_resource()
      commit: c66d7621737fb07e660b3d6eef40636ef4e9103a
[2/9] ASoC: fsl_aud2htx: Use devm_platform_get_and_ioremap_resource()
      commit: 41e90cbbc50085487b4633f08c86dd71b0f18d7f
[3/9] ASoC: fsl_easrc: Use devm_platform_get_and_ioremap_resource()
      commit: 2cd16cf0d6bbb47adddc633c60ca405f672e64f4
[4/9] ASoC: fsl_esai: Use devm_platform_get_and_ioremap_resource()
      commit: f25bb69e6f04a3d45effbe1c571f5f3ac10253bb
[5/9] ASoC: fsl_micfil: Use devm_platform_get_and_ioremap_resource()
      commit: d9bf1e791ae61d606b0da0003ad19dbe7f252fe8
[6/9] ASoC: fsl_sai: Use devm_platform_get_and_ioremap_resource()
      commit: 664107f63888bdd8a5e1d38c8246b9508a1dc46a
[7/9] ASoC: fsl_spdif: Use devm_platform_get_and_ioremap_resource()
      commit: cbb7ea0aebf0c07061be615cab97ac9cab8a48a0
[8/9] ASoC: fsl_ssi: Use devm_platform_get_and_ioremap_resource()
      commit: 67798860e6d0114149562e6897cf07ba4bebc1d6
[9/9] ASoC: fsl_xcvr: check return value after calling platform_get_resource_byname()
      commit: a2f6ed4a44721d3a9fdf4da7e0743cb13866bf61

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

^ permalink raw reply

* [PATCH] KVM: PPC: Book3S HV: Workaround high stack usage with clang
From: Nathan Chancellor @ 2021-06-21 18:24 UTC (permalink / raw)
  To: Paul Mackerras, Michael Ellerman
  Cc: kernel test robot, Arnd Bergmann, Nick Desaulniers, linux-kernel,
	kvm-ppc, Nathan Chancellor, clang-built-linux, Nicholas Piggin,
	linuxppc-dev
In-Reply-To: <YNDUEoanTqvayZ5P@archlinux-ax161>

LLVM does not emit optimal byteswap assembly, which results in high
stack usage in kvmhv_enter_nested_guest() due to the inlining of
byteswap_pt_regs(). With LLVM 12.0.0:

arch/powerpc/kvm/book3s_hv_nested.c:289:6: error: stack frame size of
2512 bytes in function 'kvmhv_enter_nested_guest' [-Werror,-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
     ^
1 error generated.

While this gets fixed in LLVM, mark byteswap_pt_regs() as
noinline_for_stack so that it does not get inlined and break the build
due to -Werror by default in arch/powerpc/. Not inlining saves
approximately 800 bytes with LLVM 12.0.0:

arch/powerpc/kvm/book3s_hv_nested.c:290:6: warning: stack frame size of
1728 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
     ^
1 warning generated.

Link: https://github.com/ClangBuiltLinux/linux/issues/1292
Link: https://bugs.llvm.org/show_bug.cgi?id=49610
Link: https://lore.kernel.org/r/202104031853.vDT0Qjqj-lkp@intel.com/
Link: https://gist.github.com/ba710e3703bf45043a31e2806c843ffd
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
---
 arch/powerpc/kvm/book3s_hv_nested.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 60724f674421..1b3ff0af1264 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -53,7 +53,8 @@ void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
 	hr->dawrx1 = vcpu->arch.dawrx1;
 }
 
-static void byteswap_pt_regs(struct pt_regs *regs)
+/* Use noinline_for_stack due to https://bugs.llvm.org/show_bug.cgi?id=49610 */
+static noinline_for_stack void byteswap_pt_regs(struct pt_regs *regs)
 {
 	unsigned long *addr = (unsigned long *) regs;
 

base-commit: 4a21192e2796c3338c4b0083b494a84a61311aaf
-- 
2.32.0.93.g670b81a890


^ permalink raw reply related

* Re: arch/powerpc/kvm/book3s_hv_nested.c:264:6: error: stack frame size of 2304 bytes in function 'kvmhv_enter_nested_guest'
From: Nathan Chancellor @ 2021-06-21 18:01 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: kbuild-all, kernel test robot, Arnd Bergmann,
	Linux Memory Management List, linux-kernel, Nicholas Piggin,
	clang-built-linux, kvm-ppc, Andrew Morton, linuxppc-dev,
	Kees Cook
In-Reply-To: <87im273604.fsf@mpe.ellerman.id.au>

On Mon, Jun 21, 2021 at 07:46:03PM +1000, Michael Ellerman wrote:
> Nathan Chancellor <nathan@kernel.org> writes:
> > On 6/20/2021 4:59 PM, Nicholas Piggin wrote:
> >> Excerpts from kernel test robot's message of April 3, 2021 8:47 pm:
> >>> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> >>> head:   d93a0d43e3d0ba9e19387be4dae4a8d5b175a8d7
> >>> commit: 97e4910232fa1f81e806aa60c25a0450276d99a2 linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*
> >>> date:   3 weeks ago
> >>> config: powerpc64-randconfig-r006-20210403 (attached as .config)
> >>> compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 0fe8af94688aa03c01913c2001d6a1a911f42ce6)
> >>> reproduce (this is a W=1 build):
> >>>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> >>>          chmod +x ~/bin/make.cross
> >>>          # install powerpc64 cross compiling tool for clang build
> >>>          # apt-get install binutils-powerpc64-linux-gnu
> >>>          # https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=97e4910232fa1f81e806aa60c25a0450276d99a2
> >>>          git remote add linus https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> >>>          git fetch --no-tags linus master
> >>>          git checkout 97e4910232fa1f81e806aa60c25a0450276d99a2
> >>>          # save the attached .config to linux build tree
> >>>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc64
> >>>
> >>> If you fix the issue, kindly add following tag as appropriate
> >>> Reported-by: kernel test robot <lkp@intel.com>
> >>>
> >>> All errors (new ones prefixed by >>):
> >>>
> >>>>> arch/powerpc/kvm/book3s_hv_nested.c:264:6: error: stack frame size of 2304 bytes in function 'kvmhv_enter_nested_guest' [-Werror,-Wframe-larger-than=]
> >>>     long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
> >>>          ^
> >>>     1 error generated.
> >>>
> >>>
> >>> vim +/kvmhv_enter_nested_guest +264 arch/powerpc/kvm/book3s_hv_nested.c
> >> 
> >> Not much changed here recently. It's not that big a concern because it's
> >> only called in the KVM ioctl path, not in any deep IO paths or anything,
> >> and doesn't recurse. Might be a bit of inlining or stack spilling put it
> >> over the edge.
> >
> > It appears to be the fact that LLVM's PowerPC backend does not emit 
> > efficient byteswap assembly:
> >
> > https://github.com/ClangBuiltLinux/linux/issues/1292
> >
> > https://bugs.llvm.org/show_bug.cgi?id=49610
> >
> >> powerpc does make it an error though, would be good to avoid that so the
> >> robot doesn't keep tripping over.
> >
> > Marking byteswap_pt_regs as 'noinline_for_stack' drastically reduces the 
> > stack usage. If that is an acceptable solution, I can send it along 
> > tomorrow.
> 
> Yeah that should be OK. Can you post the before/after disassembly when
> you post the patch?
> 
> It should just be two extra function calls, which shouldn't be enough
> overhead to be measurable.

The diff is pretty large so I have attached it here along with the full
disassembly of the files before and after the patch I am about to send.
I will reply to this message so the history is there.

Cheers,
Nathan

^ permalink raw reply

* Re: [PATCH v13 01/12] swiotlb: Refactor swiotlb init functions
From: Stefano Stabellini @ 2021-06-21 17:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: thomas.hellstrom, heikki.krogerus, linux-devicetree, peterz,
	joonas.lahtinen, dri-devel, chris, grant.likely, paulus,
	Frank Rowand, mingo, Marek Szyprowski, Stefano Stabellini,
	Saravana Kannan, Joerg Roedel, Rafael J . Wysocki,
	Bartosz Golaszewski, bskeggs, linux-pci, xen-devel,
	Thierry Reding, intel-gfx, matthew.auld, Tom Lendacky,
	Jianxiong Gao, Daniel Vetter, Will Deacon, Konrad Rzeszutek Wilk,
	maarten.lankhorst, airlied, Dan Williams, linuxppc-dev,
	jani.nikula, Rob Herring, rodrigo.vivi, Bjorn Helgaas,
	Claire Chang, boris.ostrovsky, Andy Shevchenko, jgross,
	Nicolas Boichat, Greg KH, Randy Dunlap, lkml, Tomasz Figa,
	list@263.net:IOMMU DRIVERS, Jim Quinlan, xypron.glpk,
	Robin Murphy, bauerman
In-Reply-To: <20210618143212.GA19284@lst.de>

On Fri, 18 Jun 2021, Christoph Hellwig wrote:
> On Fri, Jun 18, 2021 at 09:09:17AM -0500, Tom Lendacky wrote:
> > > swiotlb_init_with_tbl uses memblock_alloc to allocate the io_tlb_mem
> > > and memblock_alloc[1] will do memset in memblock_alloc_try_nid[2], so
> > > swiotlb_init_with_tbl is also good.
> > > I'm happy to add the memset in swiotlb_init_io_tlb_mem if you think
> > > it's clearer and safer.
> > 
> > On x86, if the memset is done before set_memory_decrypted() and memory
> > encryption is active, then the memory will look like ciphertext afterwards
> > and not be zeroes. If zeroed memory is required, then a memset must be
> > done after the set_memory_decrypted() calls.
> 
> Which should be fine - we don't care that the memory is cleared to 0,
> just that it doesn't leak other data.  Maybe a comment would be useful,
> though,

Just as a clarification: I was referring to the zeroing of "mem" in
swiotlb_late_init_with_tbl and swiotlb_init_with_tbl. While it looks
like Tom and Christoph are talking about the zeroing of "tlb".

The zeroing of "mem" is required as some fields of struct io_tlb_mem
need to be initialized to zero. While the zeroing of "tlb" is not
required except from the point of view of not leaking sensitive data as
Christoph mentioned.

Either way, Claire's v14 patch 01/12 looks fine to me.

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Vincent Guittot @ 2021-06-21 17:44 UTC (permalink / raw)
  To: Sachin Sant; +Cc: Peter Zijlstra, linuxppc-dev, Odin Ugedal, open list
In-Reply-To: <14EEE4A4-B2B1-4D0C-B2F6-BDB7C11B05DE@linux.vnet.ibm.com>

On Mon, 21 Jun 2021 at 19:32, Sachin Sant <sachinp@linux.vnet.ibm.com> wrote:
>
> >>> Any thoughts Vincent?
> >>
> >>
> >> I would prefer that we use the reason of adding the cfs in the list instead.
> >>
> >> Something like the below should also fixed the problem. It is based on a
> >> proposal I made to Rik sometimes ago when he tried to flatten the rq:
> >> https://lore.kernel.org/lkml/20190906191237.27006-6-riel@surriel.com/
> >>
> >> This will ensure that a cfs is added in the list whenever one of its  child
> >> is still in the list.
> >
> > Could you confirm that this patch fixes the problem for you too ?
> >
> Thanks for the fix.
>
> The patch fixes the reported problem. The test ran to completion without
> any failure.
>
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>

Thanks

>
> -Sachin
>

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Sachin Sant @ 2021-06-21 17:31 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: Peter Zijlstra, linuxppc-dev, Odin Ugedal, open list
In-Reply-To: <CAKfTPtACzzoGhDFW0bTGgZRPB=3LR6kSwuUOrcKDFTAJ7BhTFQ@mail.gmail.com>

>>> Any thoughts Vincent?
>> 
>> 
>> I would prefer that we use the reason of adding the cfs in the list instead.
>> 
>> Something like the below should also fixed the problem. It is based on a
>> proposal I made to Rik sometimes ago when he tried to flatten the rq:
>> https://lore.kernel.org/lkml/20190906191237.27006-6-riel@surriel.com/
>> 
>> This will ensure that a cfs is added in the list whenever one of its  child
>> is still in the list.
> 
> Could you confirm that this patch fixes the problem for you too ?
> 
Thanks for the fix.

The patch fixes the reported problem. The test ran to completion without
any failure. 

Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>

-Sachin


^ permalink raw reply

* Re: [PATCH v8 4/6] KVM: PPC: Book3S HV: Nested support in H_RPT_INVALIDATE
From: Nathan Chancellor @ 2021-06-21 17:12 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: farosas, aneesh.kumar, npiggin, kvm-ppc, linuxppc-dev, david
In-Reply-To: <20210621085003.904767-5-bharata@linux.ibm.com>

On Mon, Jun 21, 2021 at 02:20:01PM +0530, Bharata B Rao wrote:
> Enable support for process-scoped invalidations from nested
> guests and partition-scoped invalidations for nested guests.
> 
> Process-scoped invalidations for any level of nested guests
> are handled by implementing H_RPT_INVALIDATE handler in the
> nested guest exit path in L0.
> 
> Partition-scoped invalidation requests are forwarded to the
> right nested guest, handled there and passed down to L0
> for eventual handling.
> 
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> 	[Nested guest partition-scoped invalidation changes]
> ---
>  .../include/asm/book3s/64/tlbflush-radix.h    |   4 +
>  arch/powerpc/include/asm/kvm_book3s.h         |   3 +
>  arch/powerpc/kvm/book3s_hv.c                  |  59 ++++++++-
>  arch/powerpc/kvm/book3s_hv_nested.c           | 117 ++++++++++++++++++
>  arch/powerpc/mm/book3s64/radix_tlb.c          |   4 -
>  5 files changed, 180 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> index 8b33601cdb9d..a46fd37ad552 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> @@ -4,6 +4,10 @@
>  
>  #include <asm/hvcall.h>
>  
> +#define RIC_FLUSH_TLB 0
> +#define RIC_FLUSH_PWC 1
> +#define RIC_FLUSH_ALL 2
> +
>  struct vm_area_struct;
>  struct mm_struct;
>  struct mmu_gather;
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> index e6b53c6e21e3..caaa0f592d8e 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -307,6 +307,9 @@ void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
>  void kvmhv_release_all_nested(struct kvm *kvm);
>  long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
>  long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
> +long do_h_rpt_invalidate_pat(struct kvm_vcpu *vcpu, unsigned long lpid,
> +			     unsigned long type, unsigned long pg_sizes,
> +			     unsigned long start, unsigned long end);
>  int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu,
>  			  u64 time_limit, unsigned long lpcr);
>  void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 7e6da4687d88..3d5b8ba3786d 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -925,6 +925,34 @@ static int kvmppc_get_yield_count(struct kvm_vcpu *vcpu)
>  	return yield_count;
>  }
>  
> +/*
> + * H_RPT_INVALIDATE hcall handler for nested guests.
> + *
> + * Handles only nested process-scoped invalidation requests in L0.
> + */
> +static int kvmppc_nested_h_rpt_invalidate(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long type = kvmppc_get_gpr(vcpu, 6);
> +	unsigned long pid, pg_sizes, start, end;
> +
> +	/*
> +	 * The partition-scoped invalidations aren't handled here in L0.
> +	 */
> +	if (type & H_RPTI_TYPE_NESTED)
> +		return RESUME_HOST;
> +
> +	pid = kvmppc_get_gpr(vcpu, 4);
> +	pg_sizes = kvmppc_get_gpr(vcpu, 7);
> +	start = kvmppc_get_gpr(vcpu, 8);
> +	end = kvmppc_get_gpr(vcpu, 9);
> +
> +	do_h_rpt_invalidate_prt(pid, vcpu->arch.nested->shadow_lpid,
> +				type, pg_sizes, start, end);
> +
> +	kvmppc_set_gpr(vcpu, 3, H_SUCCESS);
> +	return RESUME_GUEST;
> +}
> +
>  static long kvmppc_h_rpt_invalidate(struct kvm_vcpu *vcpu,
>  				    unsigned long id, unsigned long target,
>  				    unsigned long type, unsigned long pg_sizes,
> @@ -938,10 +966,18 @@ static long kvmppc_h_rpt_invalidate(struct kvm_vcpu *vcpu,
>  
>  	/*
>  	 * Partition-scoped invalidation for nested guests.
> -	 * Not yet supported
>  	 */
> -	if (type & H_RPTI_TYPE_NESTED)
> -		return H_P3;
> +	if (type & H_RPTI_TYPE_NESTED) {
> +		if (!nesting_enabled(vcpu->kvm))
> +			return H_FUNCTION;
> +
> +		/* Support only cores as target */
> +		if (target != H_RPTI_TARGET_CMMU)
> +			return H_P2;
> +
> +		return do_h_rpt_invalidate_pat(vcpu, id, type, pg_sizes,
> +					       start, end);
> +	}
>  
>  	/*
>  	 * Process-scoped invalidation for L1 guests.
> @@ -1629,6 +1665,23 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu *vcpu)
>  		if (!xics_on_xive())
>  			kvmppc_xics_rm_complete(vcpu, 0);
>  		break;
> +	case BOOK3S_INTERRUPT_SYSCALL:
> +	{
> +		unsigned long req = kvmppc_get_gpr(vcpu, 3);
> +
> +		/*
> +		 * The H_RPT_INVALIDATE hcalls issued by nested
> +		 * guests for process-scoped invalidations when
> +		 * GTSE=0, are handled here in L0.
> +		 */
> +		if (req == H_RPT_INVALIDATE) {
> +			r = kvmppc_nested_h_rpt_invalidate(vcpu);
> +			break;
> +		}
> +
> +		r = RESUME_HOST;
> +		break;
> +	}
>  	default:
>  		r = RESUME_HOST;
>  		break;
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> index 60724f674421..056d3df68de1 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -1214,6 +1214,123 @@ long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu)
>  	return H_SUCCESS;
>  }
>  
> +static long do_tlb_invalidate_nested_tlb(struct kvm_vcpu *vcpu,
> +					 unsigned long lpid,
> +					 unsigned long page_size,
> +					 unsigned long ap,
> +					 unsigned long start,
> +					 unsigned long end)
> +{
> +	unsigned long addr = start;
> +	int ret;
> +
> +	do {
> +		ret = kvmhv_emulate_tlbie_tlb_addr(vcpu, lpid, ap,
> +						   get_epn(addr));
> +		if (ret)
> +			return ret;
> +		addr += page_size;
> +	} while (addr < end);
> +
> +	return ret;
> +}
> +
> +static long do_tlb_invalidate_nested_all(struct kvm_vcpu *vcpu,
> +					 unsigned long lpid, unsigned long ric)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_nested_guest *gp;
> +
> +	gp = kvmhv_get_nested(kvm, lpid, false);
> +	if (gp) {
> +		kvmhv_emulate_tlbie_lpid(vcpu, gp, ric);
> +		kvmhv_put_nested(gp);
> +	}
> +	return H_SUCCESS;
> +}
> +
> +/*
> + * Number of pages above which we invalidate the entire LPID rather than
> + * flush individual pages.
> + */
> +static unsigned long tlb_range_flush_page_ceiling __read_mostly = 33;
> +
> +/*
> + * Performs partition-scoped invalidations for nested guests
> + * as part of H_RPT_INVALIDATE hcall.
> + */
> +long do_h_rpt_invalidate_pat(struct kvm_vcpu *vcpu, unsigned long lpid,
> +			     unsigned long type, unsigned long pg_sizes,
> +			     unsigned long start, unsigned long end)
> +{
> +	struct kvm_nested_guest *gp;
> +	long ret;
> +	unsigned long psize, ap;
> +
> +	/*
> +	 * If L2 lpid isn't valid, we need to return H_PARAMETER.
> +	 *
> +	 * However, nested KVM issues a L2 lpid flush call when creating
> +	 * partition table entries for L2. This happens even before the
> +	 * corresponding shadow lpid is created in HV which happens in
> +	 * H_ENTER_NESTED call. Since we can't differentiate this case from
> +	 * the invalid case, we ignore such flush requests and return success.
> +	 */
> +	gp = kvmhv_find_nested(vcpu->kvm, lpid);
> +	if (!gp)
> +		return H_SUCCESS;
> +
> +	/*
> +	 * A flush all request can be handled by a full lpid flush only.
> +	 */
> +	if ((type & H_RPTI_TYPE_NESTED_ALL) == H_RPTI_TYPE_NESTED_ALL)
> +		return do_tlb_invalidate_nested_all(vcpu, lpid, RIC_FLUSH_ALL);
> +
> +	/*
> +	 * We don't need to handle a PWC flush like process table here,
> +	 * because intermediate partition scoped table in nested guest doesn't
> +	 * really have PWC. Only level we have PWC is in L0 and for nested
> +	 * invalidate at L0 we always do kvm_flush_lpid() which does
> +	 * radix__flush_all_lpid(). For range invalidate at any level, we
> +	 * are not removing the higher level page tables and hence there is
> +	 * no PWC invalidate needed.
> +	 *
> +	 * if (type & H_RPTI_TYPE_PWC) {
> +	 *	ret = do_tlb_invalidate_nested_all(vcpu, lpid, RIC_FLUSH_PWC);
> +	 *	if (ret)
> +	 *		return H_P4;
> +	 * }
> +	 */
> +
> +	if (start == 0 && end == -1)
> +		return do_tlb_invalidate_nested_all(vcpu, lpid, RIC_FLUSH_TLB);
> +
> +	if (type & H_RPTI_TYPE_TLB) {
> +		struct mmu_psize_def *def;
> +		bool flush_lpid;
> +		unsigned long nr_pages;
> +
> +		for (psize = 0; psize < MMU_PAGE_COUNT; psize++) {
> +			def = &mmu_psize_defs[psize];
> +			if (!(pg_sizes & def->h_rpt_pgsize))
> +				continue;
> +
> +			nr_pages = (end - start) >> def->shift;
> +			flush_lpid = nr_pages > tlb_range_flush_page_ceiling;
> +			if (flush_lpid)
> +				return do_tlb_invalidate_nested_all(vcpu, lpid,
> +								RIC_FLUSH_TLB);
> +
> +			ret = do_tlb_invalidate_nested_tlb(vcpu, lpid,
> +							   (1UL << def->shift),
> +							   ap, start, end);

I have not seen this reported yet so apologies if it has and there is a
fix I am missing:

arch/powerpc/kvm/book3s_hv_nested.c:1334:11: error: variable 'ap' is uninitialized when used here [-Werror,-Wuninitialized]
                                                           ap, start, end);
                                                           ^~
arch/powerpc/kvm/book3s_hv_nested.c:1276:25: note: initialize the variable 'ap' to silence this warning
        unsigned long psize, ap;
                               ^
                                = 0
1 error generated.

Cheers,
Nathan

> +			if (ret)
> +				return H_P4;
> +		}
> +	}
> +	return H_SUCCESS;
> +}
> +
>  /* Used to convert a nested guest real address to a L1 guest real address */
>  static int kvmhv_translate_addr_nested(struct kvm_vcpu *vcpu,
>  				       struct kvm_nested_guest *gp,
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index cdd98b9e7b15..4f38cf34ea40 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -20,10 +20,6 @@
>  
>  #include "internal.h"
>  
> -#define RIC_FLUSH_TLB 0
> -#define RIC_FLUSH_PWC 1
> -#define RIC_FLUSH_ALL 2
> -
>  /*
>   * tlbiel instruction for radix, set invalidation
>   * i.e., r=1 and is=01 or is=10 or is=11
> -- 
> 2.31.1

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Vincent Guittot @ 2021-06-21 17:09 UTC (permalink / raw)
  To: Sachin Sant; +Cc: Peter Zijlstra, Odin Ugedal, linuxppc-dev, open list
In-Reply-To: <20210621162243.GA29874@vingu-book>

Hi Sacha

On Mon, 21 Jun 2021 at 18:22, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Le lundi 21 juin 2021 à 14:42:23 (+0200), Odin Ugedal a écrit :
> > Hi,
> >
> > Did some more research, and it looks like this is what happens:
> >
> > $ tree /sys/fs/cgroup/ltp/ -d --charset=ascii
> > /sys/fs/cgroup/ltp/
> > |-- drain
> > `-- test-6851
> >     `-- level2
> >         |-- level3a
> >         |   |-- worker1
> >         |   `-- worker2
> >         `-- level3b
> >             `-- worker3
> >
> > Timeline (ish):
> > - worker3 gets throttled
> > - level3b is decayed, since it has no more load
> > - level2 get throttled
> > - worker3 get unthrottled
> > - level2 get unthrottled
> >   - worker3 is added to list
> >   - level3b is not added to list, since nr_running==0 and is decayed
> >
> >
> > The attached diff (based on
> > https://lore.kernel.org/lkml/20210518125202.78658-3-odin@uged.al/)
> > fixes the issue for me. Not the most elegant solution, but the
> > simplest one as of now, and to show what is wrong.
> >
> > Any thoughts Vincent?
>
>
> I would prefer that we use the reason of adding the cfs in the list instead.
>
> Something like the below should also fixed the problem. It is based on a
> proposal I made to Rik sometimes ago when he tried to flatten the rq:
> https://lore.kernel.org/lkml/20190906191237.27006-6-riel@surriel.com/
>
> This will ensure that a cfs is added in the list whenever one of its  child
> is still in the list.

Could you confirm that this patch fixes the problem for you too ?

>
> ---
>  kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ea7de54cb022..e751061a9449 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3272,6 +3272,31 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
>
>  #ifdef CONFIG_SMP
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +/*
> + * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
> + * immediately before a parent cfs_rq, and cfs_rqs are removed from the list
> + * bottom-up, we only have to test whether the cfs_rq before us on the list
> + * is our child.
> + * If cfs_rq is not on the list, test wether a child needs its to be added to
> + * connect a branch to the tree  * (see list_add_leaf_cfs_rq() for details).
> + */
> +static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
> +{
> +       struct cfs_rq *prev_cfs_rq;
> +       struct list_head *prev;
> +
> +       if (cfs_rq->on_list) {
> +               prev = cfs_rq->leaf_cfs_rq_list.prev;
> +       } else {
> +               struct rq *rq = rq_of(cfs_rq);
> +
> +               prev = rq->tmp_alone_branch;
> +       }
> +
> +       prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);
> +
> +       return (prev_cfs_rq->tg->parent == cfs_rq->tg);
> +}
>
>  static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  {
> @@ -3287,6 +3312,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>         if (cfs_rq->avg.runnable_sum)
>                 return false;
>
> +       if (child_cfs_rq_on_list(cfs_rq))
> +               return false;
> +
>         return true;
>  }
>
> --
> 2.17.1
>
>
>
> >
> > Thanks
> > Odin
> >
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bfaa6e1f6067..aa32e9c29efd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -376,7 +376,8 @@ static inline bool list_add_leaf_cfs_rq(struct
> > cfs_rq *cfs_rq)
> >         return false;
> >  }
> >
> > -static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > +/* Returns 1 if cfs_rq was present in the list and removed */
> > +static inline bool list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  {
> >         if (cfs_rq->on_list) {
> >                 struct rq *rq = rq_of(cfs_rq);
> > @@ -393,7 +394,9 @@ static inline void list_del_leaf_cfs_rq(struct
> > cfs_rq *cfs_rq)
> >
> >                 list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> >                 cfs_rq->on_list = 0;
> > +               return 1;
> >         }
> > +       return 0;
> >  }
> >
> >  static inline void assert_list_leaf_cfs_rq(struct rq *rq)
> > @@ -3298,24 +3301,6 @@ static inline void cfs_rq_util_change(struct
> > cfs_rq *cfs_rq, int flags)
> >
> >  #ifdef CONFIG_SMP
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > -
> > -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > -{
> > -       if (cfs_rq->load.weight)
> > -               return false;
> > -
> > -       if (cfs_rq->avg.load_sum)
> > -               return false;
> > -
> > -       if (cfs_rq->avg.util_sum)
> > -               return false;
> > -
> > -       if (cfs_rq->avg.runnable_sum)
> > -               return false;
> > -
> > -       return true;
> > -}
> > -
> >  /**
> >   * update_tg_load_avg - update the tg's load avg
> >   * @cfs_rq: the cfs_rq whose avg changed
> > @@ -4109,11 +4094,6 @@ static inline void update_misfit_status(struct
> > task_struct *p, struct rq *rq)
> >
> >  #else /* CONFIG_SMP */
> >
> > -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > -{
> > -       return true;
> > -}
> > -
> >  #define UPDATE_TG      0x0
> >  #define SKIP_AGE_LOAD  0x0
> >  #define DO_ATTACH      0x0
> > @@ -4771,10 +4751,11 @@ static int tg_unthrottle_up(struct task_group
> > *tg, void *data)
> >         if (!cfs_rq->throttle_count) {
> >                 cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> >                                              cfs_rq->throttled_clock_task;
> > -
> > -               /* Add cfs_rq with load or one or more already running
> > entities to the list */
> > -               if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> > +               if (cfs_rq->insert_on_unthrottle) {
> >                         list_add_leaf_cfs_rq(cfs_rq);
> > +                       if (tg->parent)
> > +
> > tg->parent->cfs_rq[cpu_of(rq)]->insert_on_unthrottle = true;
> > +                       }
> >         }
> >
> >         return 0;
> > @@ -4788,7 +4769,7 @@ static int tg_throttle_down(struct task_group
> > *tg, void *data)
> >         /* group is entering throttled state, stop time */
> >         if (!cfs_rq->throttle_count) {
> >                 cfs_rq->throttled_clock_task = rq_clock_task(rq);
> > -               list_del_leaf_cfs_rq(cfs_rq);
> > +               cfs_rq->insert_on_unthrottle = list_del_leaf_cfs_rq(cfs_rq);
> >         }
> >         cfs_rq->throttle_count++;
> >
> > @@ -8019,6 +8000,23 @@ static bool __update_blocked_others(struct rq
> > *rq, bool *done)
> >
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >
> > +static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > +{
> > +       if (cfs_rq->load.weight)
> > +               return false;
> > +
> > +       if (cfs_rq->avg.load_sum)
> > +               return false;
> > +
> > +       if (cfs_rq->avg.util_sum)
> > +               return false;
> > +
> > +       if (cfs_rq->avg.runnable_sum)
> > +               return false;
> > +
> > +       return true;
> > +}
> > +
> >  static bool __update_blocked_fair(struct rq *rq, bool *done)
> >  {
> >         struct cfs_rq *cfs_rq, *pos;
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index a189bec13729..12a707d99ee6 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -602,6 +602,7 @@ struct cfs_rq {
> >         u64                     throttled_clock_task_time;
> >         int                     throttled;
> >         int                     throttle_count;
> > +       int                     insert_on_unthrottle;
> >         struct list_head        throttled_list;
> >  #endif /* CONFIG_CFS_BANDWIDTH */
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Vincent Guittot @ 2021-06-21 17:07 UTC (permalink / raw)
  To: Odin Ugedal; +Cc: Sachin Sant, Peter Zijlstra, linuxppc-dev, open list
In-Reply-To: <CAFpoUr1hyAeFSCvyJU4SKjtEKn6Hq1RLuu1eZPhRt1SaAq7=TQ@mail.gmail.com>

On Mon, 21 Jun 2021 at 18:45, Odin Ugedal <odin@uged.al> wrote:
>
> man. 21. jun. 2021 kl. 18:22 skrev Vincent Guittot <vincent.guittot@linaro.org>:
> > I would prefer that we use the reason of adding the cfs in the list instead.
> >
> > Something like the below should also fixed the problem. It is based on a
> > proposal I made to Rik sometimes ago when he tried to flatten the rq:
> > https://lore.kernel.org/lkml/20190906191237.27006-6-riel@surriel.com/
> >
> > This will ensure that a cfs is added in the list whenever one of its  child
> > is still in the list.
>
> Oh, yeah, that is a much more elegant solution! It fixes the issue as well!
>
> Feel free to add this when/if you submit it as a patch:
> Acked-by: Odin Ugedal <odin@uged.al>

Thanks

>
> Odin

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Odin Ugedal @ 2021-06-21 16:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Sachin Sant, linuxppc-dev, Peter Zijlstra, Odin Ugedal, open list
In-Reply-To: <20210621162243.GA29874@vingu-book>

man. 21. jun. 2021 kl. 18:22 skrev Vincent Guittot <vincent.guittot@linaro.org>:
> I would prefer that we use the reason of adding the cfs in the list instead.
>
> Something like the below should also fixed the problem. It is based on a
> proposal I made to Rik sometimes ago when he tried to flatten the rq:
> https://lore.kernel.org/lkml/20190906191237.27006-6-riel@surriel.com/
>
> This will ensure that a cfs is added in the list whenever one of its  child
> is still in the list.

Oh, yeah, that is a much more elegant solution! It fixes the issue as well!

Feel free to add this when/if you submit it as a patch:
Acked-by: Odin Ugedal <odin@uged.al>

Odin

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Vincent Guittot @ 2021-06-21 16:22 UTC (permalink / raw)
  To: Odin Ugedal; +Cc: Sachin Sant, Peter Zijlstra, linuxppc-dev, open list
In-Reply-To: <CAFpoUr3Wy9raHx+Dc0S8TB_Xi=E+Epsh_pA3DEFZP4eKf7s07A@mail.gmail.com>

Le lundi 21 juin 2021 à 14:42:23 (+0200), Odin Ugedal a écrit :
> Hi,
> 
> Did some more research, and it looks like this is what happens:
> 
> $ tree /sys/fs/cgroup/ltp/ -d --charset=ascii
> /sys/fs/cgroup/ltp/
> |-- drain
> `-- test-6851
>     `-- level2
>         |-- level3a
>         |   |-- worker1
>         |   `-- worker2
>         `-- level3b
>             `-- worker3
> 
> Timeline (ish):
> - worker3 gets throttled
> - level3b is decayed, since it has no more load
> - level2 get throttled
> - worker3 get unthrottled
> - level2 get unthrottled
>   - worker3 is added to list
>   - level3b is not added to list, since nr_running==0 and is decayed
> 
> 
> The attached diff (based on
> https://lore.kernel.org/lkml/20210518125202.78658-3-odin@uged.al/)
> fixes the issue for me. Not the most elegant solution, but the
> simplest one as of now, and to show what is wrong.
> 
> Any thoughts Vincent?


I would prefer that we use the reason of adding the cfs in the list instead.

Something like the below should also fixed the problem. It is based on a
proposal I made to Rik sometimes ago when he tried to flatten the rq:
https://lore.kernel.org/lkml/20190906191237.27006-6-riel@surriel.com/

This will ensure that a cfs is added in the list whenever one of its  child
is still in the list. 

---
 kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea7de54cb022..e751061a9449 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3272,6 +3272,31 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)

 #ifdef CONFIG_SMP
 #ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
+ * immediately before a parent cfs_rq, and cfs_rqs are removed from the list
+ * bottom-up, we only have to test whether the cfs_rq before us on the list
+ * is our child.
+ * If cfs_rq is not on the list, test wether a child needs its to be added to
+ * connect a branch to the tree  * (see list_add_leaf_cfs_rq() for details).
+ */
+static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
+{
+	struct cfs_rq *prev_cfs_rq;
+	struct list_head *prev;
+
+	if (cfs_rq->on_list) {
+		prev = cfs_rq->leaf_cfs_rq_list.prev;
+	} else {
+		struct rq *rq = rq_of(cfs_rq);
+
+		prev = rq->tmp_alone_branch;
+	}
+
+	prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);
+
+	return (prev_cfs_rq->tg->parent == cfs_rq->tg);
+}

 static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 {
@@ -3287,6 +3312,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.runnable_sum)
 		return false;

+	if (child_cfs_rq_on_list(cfs_rq))
+		return false;
+
 	return true;
 }

--
2.17.1



> 
> Thanks
> Odin
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfaa6e1f6067..aa32e9c29efd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -376,7 +376,8 @@ static inline bool list_add_leaf_cfs_rq(struct
> cfs_rq *cfs_rq)
>         return false;
>  }
> 
> -static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +/* Returns 1 if cfs_rq was present in the list and removed */
> +static inline bool list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>         if (cfs_rq->on_list) {
>                 struct rq *rq = rq_of(cfs_rq);
> @@ -393,7 +394,9 @@ static inline void list_del_leaf_cfs_rq(struct
> cfs_rq *cfs_rq)
> 
>                 list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
>                 cfs_rq->on_list = 0;
> +               return 1;
>         }
> +       return 0;
>  }
> 
>  static inline void assert_list_leaf_cfs_rq(struct rq *rq)
> @@ -3298,24 +3301,6 @@ static inline void cfs_rq_util_change(struct
> cfs_rq *cfs_rq, int flags)
> 
>  #ifdef CONFIG_SMP
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -
> -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> -{
> -       if (cfs_rq->load.weight)
> -               return false;
> -
> -       if (cfs_rq->avg.load_sum)
> -               return false;
> -
> -       if (cfs_rq->avg.util_sum)
> -               return false;
> -
> -       if (cfs_rq->avg.runnable_sum)
> -               return false;
> -
> -       return true;
> -}
> -
>  /**
>   * update_tg_load_avg - update the tg's load avg
>   * @cfs_rq: the cfs_rq whose avg changed
> @@ -4109,11 +4094,6 @@ static inline void update_misfit_status(struct
> task_struct *p, struct rq *rq)
> 
>  #else /* CONFIG_SMP */
> 
> -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> -{
> -       return true;
> -}
> -
>  #define UPDATE_TG      0x0
>  #define SKIP_AGE_LOAD  0x0
>  #define DO_ATTACH      0x0
> @@ -4771,10 +4751,11 @@ static int tg_unthrottle_up(struct task_group
> *tg, void *data)
>         if (!cfs_rq->throttle_count) {
>                 cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
>                                              cfs_rq->throttled_clock_task;
> -
> -               /* Add cfs_rq with load or one or more already running
> entities to the list */
> -               if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
> +               if (cfs_rq->insert_on_unthrottle) {
>                         list_add_leaf_cfs_rq(cfs_rq);
> +                       if (tg->parent)
> +
> tg->parent->cfs_rq[cpu_of(rq)]->insert_on_unthrottle = true;
> +                       }
>         }
> 
>         return 0;
> @@ -4788,7 +4769,7 @@ static int tg_throttle_down(struct task_group
> *tg, void *data)
>         /* group is entering throttled state, stop time */
>         if (!cfs_rq->throttle_count) {
>                 cfs_rq->throttled_clock_task = rq_clock_task(rq);
> -               list_del_leaf_cfs_rq(cfs_rq);
> +               cfs_rq->insert_on_unthrottle = list_del_leaf_cfs_rq(cfs_rq);
>         }
>         cfs_rq->throttle_count++;
> 
> @@ -8019,6 +8000,23 @@ static bool __update_blocked_others(struct rq
> *rq, bool *done)
> 
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> 
> +static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> +{
> +       if (cfs_rq->load.weight)
> +               return false;
> +
> +       if (cfs_rq->avg.load_sum)
> +               return false;
> +
> +       if (cfs_rq->avg.util_sum)
> +               return false;
> +
> +       if (cfs_rq->avg.runnable_sum)
> +               return false;
> +
> +       return true;
> +}
> +
>  static bool __update_blocked_fair(struct rq *rq, bool *done)
>  {
>         struct cfs_rq *cfs_rq, *pos;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a189bec13729..12a707d99ee6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -602,6 +602,7 @@ struct cfs_rq {
>         u64                     throttled_clock_task_time;
>         int                     throttled;
>         int                     throttle_count;
> +       int                     insert_on_unthrottle;
>         struct list_head        throttled_list;
>  #endif /* CONFIG_CFS_BANDWIDTH */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply related

* Re: [PATCH 1/1] powerpc/pseries: Interface to represent PAPR firmware attributes
From: Pratik Sampat @ 2021-06-21 15:35 UTC (permalink / raw)
  To: Fabiano Rosas, mpe, benh, paulus, linuxppc-dev, kvm-ppc,
	linux-kernel, pratik.r.sampat
In-Reply-To: <87wnqqopp9.fsf@linux.ibm.com>

Hello Fabiano,

Thank you for your review.

On 19/06/21 2:28 am, Fabiano Rosas wrote:
> "Pratik R. Sampat" <psampat@linux.ibm.com> writes:
>
>> Adds a generic interface to represent the energy and frequency related
>> PAPR attributes on the system using the new H_CALL
>> "H_GET_ENERGY_SCALE_INFO".
>>
>> H_GET_EM_PARMS H_CALL was previously responsible for exporting this
>> information in the lparcfg, however the H_GET_EM_PARMS H_CALL
>> will be deprecated P10 onwards.
>>
>> The H_GET_ENERGY_SCALE_INFO H_CALL is of the following call format:
>> hcall(
>>    uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
>>    uint64 flags,           // Per the flag request
>>    uint64 firstAttributeId,// The attribute id
>>    uint64 bufferAddress,   // Guest physical address of the output buffer
>>    uint64 bufferSize       // The size in bytes of the output buffer
>> );
>>
>> This H_CALL can query either all the attributes at once with
>> firstAttributeId = 0, flags = 0 as well as query only one attribute
>> at a time with firstAttributeId = id
>>
>> The output buffer consists of the following
>> 1. number of attributes              - 8 bytes
>> 2. array offset to the data location - 8 bytes
>> 3. version info                      - 1 byte
>> 4. A data array of size num attributes, which contains the following:
>>    a. attribute ID              - 8 bytes
>>    b. attribute value in number - 8 bytes
>>    c. attribute name in string  - 64 bytes
>>    d. attribute value in string - 64 bytes
>>
>> The new H_CALL exports information in direct string value format, hence
>> a new interface has been introduced in
>> /sys/firmware/papr/energy_scale_info to export this information to
>> userspace in an extensible pass-through format.
>>
>> The H_CALL returns the name, numeric value and string value (if exists)
>>
>> The format of exposing the sysfs information is as follows:
>> /sys/firmware/papr/energy_scale_info/
>>     |-- <id>/
>>       |-- desc
>>       |-- value
>>       |-- value_desc (if exists)
>>     |-- <id>/
>>       |-- desc
>>       |-- value
>>       |-- value_desc (if exists)
>> ...
>>
>> The energy information that is exported is useful for userspace tools
>> such as powerpc-utils. Currently these tools infer the
>> "power_mode_data" value in the lparcfg, which in turn is obtained from
>> the to be deprecated H_GET_EM_PARMS H_CALL.
>> On future platforms, such userspace utilities will have to look at the
>> data returned from the new H_CALL being populated in this new sysfs
>> interface and report this information directly without the need of
>> interpretation.
>>
>> Signed-off-by: Pratik R. Sampat <psampat@linux.ibm.com>
>> ---
>>   .../sysfs-firmware-papr-energy-scale-info     |  26 ++
>>   arch/powerpc/include/asm/hvcall.h             |  21 +-
>>   arch/powerpc/kvm/trace_hv.h                   |   1 +
>>   arch/powerpc/platforms/pseries/Makefile       |   3 +-
>>   .../pseries/papr_platform_attributes.c        | 292 ++++++++++++++++++
>>   5 files changed, 341 insertions(+), 2 deletions(-)
>>   create mode 100644 Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
>>   create mode 100644 arch/powerpc/platforms/pseries/papr_platform_attributes.c
>>
>> diff --git a/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
>> new file mode 100644
>> index 000000000000..499bc1ae173a
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
>> @@ -0,0 +1,26 @@
>> +What:		/sys/firmware/papr/energy_scale_info
>> +Date:		June 2021
>> +Contact:	Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>
>> +Description:	Director hosting a set of platform attributes on Linux
>> +		running as a PAPR guest.
> This still refers to papr attributes. We want energy/frequency, etc. instead.

Sure, I can mention about energy, frequency info attributes here.

>> +
>> +		Each file in a directory contains a platform
>> +		attribute hierarchy pertaining to performance/
>> +		energy-savings mode and processor frequency.
>> +
>> +What:		/sys/firmware/papr/energy_scale_info/<id>
>> +		/sys/firmware/papr/energy_scale_info/<id>/desc
>> +		/sys/firmware/papr/energy_scale_info/<id>/value
>> +		/sys/firmware/papr/energy_scale_info/<id>/value_desc
>> +Date:		June 2021
>> +Contact:	Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>
>> +Description:	PAPR attributes directory for POWERVM servers
> Same here.

ack.

>> +
>> +		This directory provides PAPR information. It
> And here.

ack.

>
>> +		contains below sysfs attributes:
>> +
>> +		- desc: File contains the name of attribute <id>
> desc: String description of the attribute <id>

Sure, this description is more complete

>
>> +
>> +		- value: Numeric value of attribute <id>
>> +
>> +		- value_desc: String value of attribute <id>
>> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
>> index e3b29eda8074..19a2a8c77a49 100644
>> --- a/arch/powerpc/include/asm/hvcall.h
>> +++ b/arch/powerpc/include/asm/hvcall.h
>> @@ -316,7 +316,8 @@
>>   #define H_SCM_PERFORMANCE_STATS 0x418
>>   #define H_RPT_INVALIDATE	0x448
>>   #define H_SCM_FLUSH		0x44C
>> -#define MAX_HCALL_OPCODE	H_SCM_FLUSH
>> +#define H_GET_ENERGY_SCALE_INFO	0x450
>> +#define MAX_HCALL_OPCODE	H_GET_ENERGY_SCALE_INFO
>>
>>   /* Scope args for H_SCM_UNBIND_ALL */
>>   #define H_UNBIND_SCOPE_ALL (0x1)
>> @@ -631,6 +632,24 @@ struct hv_gpci_request_buffer {
>>   	uint8_t bytes[HGPCI_MAX_DATA_BYTES];
>>   } __packed;
>>
>> +#define MAX_EM_ATTRS	10
>> +#define MAX_EM_DATA_BYTES \
>> +	(sizeof(struct energy_scale_attributes) * MAX_EM_ATTRS)
> EM doesn't really mean anything in the context of this code. Maybe we
> could standardize the names with 'energy_scale_info' and 'esi' for
> short.

Right, energy_scale_info (esi) is better naming convention and I can change
others to maintain consistency too.

>
>> +struct energy_scale_attributes {
> s/attributes/attribute/

ack, thanks for pointing out incorrect grammar.

>
>> +	__be64 attr_id;
>> +	__be64 attr_value;
>> +	unsigned char attr_desc[64];
>> +	unsigned char attr_value_desc[64];
> These attr_ prefixes could be dropped. I will be clear from the context
> where they are used.
>
>> +} __packed;
>> +
>> +struct hv_energy_scale_buffer {
>> +	__be64 num_attr;
> s/num_attr/num_attrs/

ack

>
>> +	__be64 array_offset;
>> +	__u8 data_header_version;
>> +	unsigned char data[MAX_EM_DATA_BYTES];
>> +} __packed;
> Your code is correct with the current assumptions, but I think we got
> the assumptions wrong, see if you agree:
>
> My understanding of the hypercall is that it is designed around a header
> + data concept. If the header is versioned and backwards compatible,
> then it means that it could increase in size. The offset to the data is
> what ensures backward compatibility so that we can always access the
> data, no matter what the header contains. So this leads me to conclude:
>
> 1- We should stop aborting in case the version changes. Nothing should
> really break if that happens. Both the kernel and userspace would read
> less data than the hypercall is returning, but that is it.
>
> With the current design, if the hypervisor changes the header version,
> the attributes stop being exposed in sysfs (because this code aborts),
> which will break the userspace setup.
>
> 2- We cannot have 'data' in the struct. There is no array there. The
> array is wherever the offset indicates. If the header increases in size,
> then the 'data' would move forward in the buffer.
>
> 3- The concept of MAX_EM_ATTRS is misleading. It is the maximum number
> of attributes that fit in the buffer *for this version of the hcall*. A
> subsequent version could fit fewer attributes and our MAX_EM_ATTRS would
> land us out of bounds. So the number of attributes must be defined
> exclusively by num_attrs.
>
> If we change the above, we only need to touch this code again if the
> header version changes *and* we want to expose the extra information
> brought by the change. An unexpected change in version should not cause
> this code to fail.

I think I have understand the approach you are describing here.
Basically you mean to say that the header can change in size in the future and
that is why header and data should be their own separate entities.

Although I'm not a 100% sure that the header isn't a fixed entity, I agree with
your proposed approach. It only makes the design more flexible and I don't see
any downsides to it.

Please correct me if I'm wrong, but this is what I'm thinking this means in
rough psuedo-code:

char *esi_buf;
struct h_energy_scale_info_hdr *esi_hdr;
struct energy_scale_attribute *esi_attrs;

plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_ALL, 0,
                    virt_to_phys(esi_buf), MAX_BUF_SZ);
esi_hdr = (struct h_energy_scale_info_hdr *) &esi_buf[0];
esi_attrs = (struct energy_scale_attribute *) &esi_buf[be64_to_cpu(esi_hdr->array_offset)];

where "h_energy_scale_info_hdr" is the header that you have defined below, minus the "data".
and "energy_scale_attribute" is pretty much the same minus the "attr" prefix on all the variables

> So what I suggest is we keep a 'header' struct:
>
> struct h_energy_scale_info_hdr {
>      __be64 num_attrs;
>      __be64 data_offset;
>      __u8 version;
> };
>
> and we define an arbitrary size for the attributes array and allocate
> that much more memory:
>
> /*
>   * Note: we allocate space for 10 attributes, but the HV can move the data
>   * array further in the buffer, so it could return fewer attributes.
>   */

Yes, that really depends where the attribute is, Currently there are only 7
attributes that this HCALL returns that's why I have made the max 10 so that
the offset and even then a few more attributes could be accommodated for.

> #define MAX_BUF_SZ (sizeof(struct h_energy_scale_info_hdr) + \
>                      sizeof(struct energy_scale_attributes) * 10)
> ...
> em_buf = kmalloc(MAX_BUF_SZ, GFP_KERNEL);
>
>
>
> The suggestions from here on apply even if my analysis above is wrong:
>
>> +
>> +
>>   #endif /* __ASSEMBLY__ */
>>   #endif /* __KERNEL__ */
>>   #endif /* _ASM_POWERPC_HVCALL_H */
>> diff --git a/arch/powerpc/kvm/trace_hv.h b/arch/powerpc/kvm/trace_hv.h
>> index 830a126e095d..38cd0ed0a617 100644
>> --- a/arch/powerpc/kvm/trace_hv.h
>> +++ b/arch/powerpc/kvm/trace_hv.h
>> @@ -115,6 +115,7 @@
>>   	{H_VASI_STATE,			"H_VASI_STATE"}, \
>>   	{H_ENABLE_CRQ,			"H_ENABLE_CRQ"}, \
>>   	{H_GET_EM_PARMS,		"H_GET_EM_PARMS"}, \
>> +	{H_GET_ENERGY_SCALE_INFO,	"H_GET_ENERGY_SCALE_INFO"}, \
>>   	{H_SET_MPP,			"H_SET_MPP"}, \
>>   	{H_GET_MPP,			"H_GET_MPP"}, \
>>   	{H_HOME_NODE_ASSOCIATIVITY,	"H_HOME_NODE_ASSOCIATIVITY"}, \
>> diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile
>> index c8a2b0b05ac0..d14fca89ac25 100644
>> --- a/arch/powerpc/platforms/pseries/Makefile
>> +++ b/arch/powerpc/platforms/pseries/Makefile
>> @@ -6,7 +6,8 @@ obj-y			:= lpar.o hvCall.o nvram.o reconfig.o \
>>   			   of_helpers.o \
>>   			   setup.o iommu.o event_sources.o ras.o \
>>   			   firmware.o power.o dlpar.o mobility.o rng.o \
>> -			   pci.o pci_dlpar.o eeh_pseries.o msi.o
>> +			   pci.o pci_dlpar.o eeh_pseries.o msi.o \
>> +			   papr_platform_attributes.o
>>   obj-$(CONFIG_SMP)	+= smp.o
>>   obj-$(CONFIG_SCANLOG)	+= scanlog.o
>>   obj-$(CONFIG_KEXEC_CORE)	+= kexec.o
>> diff --git a/arch/powerpc/platforms/pseries/papr_platform_attributes.c b/arch/powerpc/platforms/pseries/papr_platform_attributes.c
>> new file mode 100644
>> index 000000000000..498c74a5e9ab
>> --- /dev/null
>> +++ b/arch/powerpc/platforms/pseries/papr_platform_attributes.c
>> @@ -0,0 +1,292 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * PAPR platform energy attributes driver
>> + *
>> + * This driver creates a sys file at /sys/firmware/papr/ which contains
>> + * files keyword - value pairs that specify energy configuration of the system.
> This description needs updating.

ack, I'll update this

>> + *
>> + * Copyright 2021 IBM Corp.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/types.h>
>> +#include <linux/errno.h>
>> +#include <linux/init.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/hugetlb.h>
>> +#include <asm/lppaca.h>
>> +#include <asm/hvcall.h>
>> +#include <asm/firmware.h>
>> +#include <asm/time.h>
>> +#include <asm/prom.h>
>> +#include <asm/vdso_datapage.h>
>> +#include <asm/vio.h>
>> +#include <asm/mmu.h>
>> +#include <asm/machdep.h>
>> +#include <asm/drmem.h>
>> +
>> +#include "pseries.h"
>> +
>> +#define MAX_ATTRS	3
> It might be good to link this with ops_info size somehow to make sure we
> don't update one without the other.

Sure, I could define it something as follows to determine this relationship
visually.
ops_info[MAX_ATTRS] = { ... }

>
>> +#define MAX_NAME_LEN	16
>> +
>> +struct papr_attr {
>> +	u64 id;
>> +	struct kobj_attribute attr;
> We have attr everywhere. I would use kobj_attr here to improve
> readability.
>
>> +};
>> +struct papr_group {
>> +	char name[MAX_NAME_LEN];
>> +	struct attribute_group pg;
>> +	struct papr_attr *pgattrs;
>> +} *pgs;
>> +
>> +struct kobject *papr_kobj;
>> +struct kobject *escale_kobj;
> Nitpick: esi_kobj

ack, I'll make the "energy_scale_info (esi)" naming convention coherent everywhere.

>
>> +struct hv_energy_scale_buffer *em_buf;
> Could replace the "em" here.
>
>> +struct energy_scale_attributes *ea;
> Nitpick: esi_attrs.

ack.

>
>> +
>> +static ssize_t papr_show_desc(struct kobject *kobj,
>> +			       struct kobj_attribute *attr,
>> +			       char *buf)
>> +{
>> +	struct papr_attr *pattr = container_of(attr, struct papr_attr, attr);
>> +	int idx, ret = 0;
>> +
>> +	/*
>> +	 * We do not expect the name to change, hence use the old value
>> +	 * and save a HCALL
>> +	 */
>> +	for (idx = 0; idx < be64_to_cpu(em_buf->num_attr); idx++) {
>> +		if (pattr->id == be64_to_cpu(ea[idx].attr_id)) {
>> +			ret = sprintf(buf, "%s\n", ea[idx].attr_desc);
>> +			if (ret < 0)
>> +				ret = -EIO;
>> +			break;
>> +		}
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static ssize_t papr_show_value(struct kobject *kobj,
>> +				struct kobj_attribute *attr,
>> +				char *buf)
>> +{
>> +	struct papr_attr *pattr = container_of(attr, struct papr_attr, attr);
>> +	struct hv_energy_scale_buffer *t_buf;
>> +	struct energy_scale_attributes *t_ea;
>> +	int data_offset, ret = 0;
>> +
>> +	t_buf = kmalloc(sizeof(*t_buf), GFP_KERNEL);
>> +	if (t_buf == NULL)
>> +		return -ENOMEM;
>> +
>> +	ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, 0,
>> +				 pattr->id, virt_to_phys(t_buf),
>> +				 sizeof(*t_buf));
>> +
>> +	if (ret != H_SUCCESS) {
>> +		pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
>> +		goto out;
>> +	}
>> +
>> +	data_offset = be64_to_cpu(t_buf->array_offset) -
>> +			(sizeof(t_buf->num_attr) +
>> +			sizeof(t_buf->array_offset) +
>> +			sizeof(t_buf->data_header_version));
>> +
>> +	t_ea = (struct energy_scale_attributes *) &t_buf->data[data_offset];
> t_ea = (struct energy_scale_attributes *)(t_buf + be64_to_cpu(t_buf->array_offset));
>
> Right? If array_offset "points" to data, then data_offset would always
> be 0. So there is no point doing this arithmetic.

Yes, the offset now can directly used as it is a different structure altogether
we don't need to account for the header as that is now in the original parsed
buffer too.

>
>> +
>> +	ret = sprintf(buf, "%llu\n", be64_to_cpu(t_ea->attr_value));
>> +	if (ret < 0)
>> +		ret = -EIO;
>> +out:
>> +	kfree(t_buf);
>> +
>> +	return ret;
>> +}
>> +
>> +static ssize_t papr_show_value_desc(struct kobject *kobj,
>> +				     struct kobj_attribute *attr,
>> +				     char *buf)
>> +{
>> +	struct papr_attr *pattr = container_of(attr, struct papr_attr, attr);
>> +	struct hv_energy_scale_buffer *t_buf;
>> +	struct energy_scale_attributes *t_ea;
>> +	int data_offset, ret = 0;
>> +
>> +	t_buf = kmalloc(sizeof(*t_buf), GFP_KERNEL);
>> +	if (t_buf == NULL)
>> +		return -ENOMEM;
>> +
>> +	ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, 0,
>> +				 pattr->id, virt_to_phys(t_buf),
>> +				 sizeof(*t_buf));
>> +
>> +	if (ret != H_SUCCESS) {
>> +		pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
>> +		goto out;
>> +	}
>> +
>> +	data_offset = be64_to_cpu(t_buf->array_offset) -
>> +			(sizeof(t_buf->num_attr) +
>> +			sizeof(t_buf->array_offset) +
>> +			sizeof(t_buf->data_header_version));
>> +
>> +	t_ea = (struct energy_scale_attributes *) &t_buf->data[data_offset];
>> +
>> +	ret = sprintf(buf, "%s\n", t_ea->attr_value_desc);
>> +	if (ret < 0)
>> +		ret = -EIO;
>> +out:
>> +	kfree(t_buf);
>> +
>> +	return ret;
>> +}
>> +
>> +static struct papr_ops_info {
>> +	const char *attr_name;
>> +	ssize_t (*show)(struct kobject *kobj, struct kobj_attribute *attr,
>> +			char *buf);
>> +} ops_info[] = {
>> +	{ "desc", papr_show_desc },
>> +	{ "value", papr_show_value },
>> +	{ "value_desc", papr_show_value_desc },
>> +};
>> +
>> +static void add_attr(u64 id, int index, struct papr_attr *attr)
>> +{
>> +	attr->id = id;
>> +	sysfs_attr_init(&attr->attr.attr);
>> +	attr->attr.attr.name = ops_info[index].attr_name;
>> +	attr->attr.attr.mode = 0444;
>> +	attr->attr.show = ops_info[index].show;
>> +}
>> +
>> +static int add_attr_group(u64 id, int len, struct papr_group *pg,
>> +			  bool show_val_desc)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < len; i++) {
>> +		if (!strcmp(ops_info[i].attr_name, "value_desc") &&
>> +		    !show_val_desc) {
>> +			continue;
>> +		}
>> +		add_attr(id, i, &pg->pgattrs[i]);
>> +		pg->pg.attrs[i] = &pg->pgattrs[i].attr.attr;
>> +	}
>> +
>> +	return sysfs_create_group(escale_kobj, &pg->pg);
>> +}
>> +
>> +
>> +static int __init papr_init(void)
>> +{
>> +	uint64_t num_attr;
>> +	int ret, idx, i, data_offset;
>> +
>> +	em_buf = kmalloc(sizeof(*em_buf), GFP_KERNEL);
>> +	if (em_buf == NULL)
>> +		return -ENOMEM;
>> +	/*
>> +	 * hcall(
>> +	 * uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
>> +	 * uint64 flags,            // Per the flag request
>> +	 * uint64 firstAttributeId, // The attribute id
>> +	 * uint64 bufferAddress,    // Guest physical address of the output buffer
>> +	 * uint64 bufferSize);      // The size in bytes of the output buffer
>> +	 */
> Since the flags are well defined, we could have:
> #define ESI_FLAGS_ALL PPC_BIT(0)
> #define ESI_FLAGS_SINGLE PPC_BIT(1)

Sure, I can add this macro for abstraction.
I understand this will make things clear looking at the call itself rather then
deciphering the document

> I assume the bits are IBM-order. But you get my point...
>
>> +	ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, 0, 0,
>> +				 virt_to_phys(em_buf), sizeof(*em_buf));
>> +
>> +	if (!firmware_has_feature(FW_FEATURE_LPAR) || ret != H_SUCCESS ||
>> +	    em_buf->data_header_version != 0x1) {
>> +		pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
>> +		goto out;
>> +	}
> I'd make the FW_FEATURE_LPAR check an early return at the start of the
> function.

Right, makes sense. No point doing the HCALL if we don't have the right
firmware feature. Thanks

>
>> +
>> +	num_attr = be64_to_cpu(em_buf->num_attr);
>> +
>> +	/*
>> +	 * Typecast the energy buffer to the attribute structure at the offset
>> +	 * specified in the buffer
>> +	 */
>> +	data_offset = be64_to_cpu(em_buf->array_offset) -
>> +			(sizeof(em_buf->num_attr) +
>> +			sizeof(em_buf->array_offset) +
>> +			sizeof(em_buf->data_header_version));
>> +
>> +	ea = (struct energy_scale_attributes *) &em_buf->data[data_offset];
>> +
>> +	pgs = kcalloc(num_attr, sizeof(*pgs), GFP_KERNEL);
>> +	if (!pgs)
>> +		goto out_pgs;
>> +
>> +	papr_kobj = kobject_create_and_add("papr", firmware_kobj);
>> +	if (!papr_kobj) {
>> +		pr_warn("kobject_create_and_add papr failed\n");
>> +		goto out_kobj;
>> +	}
>> +
>> +	escale_kobj = kobject_create_and_add("energy_scale_info", papr_kobj);
>> +	if (!escale_kobj) {
>> +		pr_warn("kobject_create_and_add energy_scale_info failed\n");
>> +		goto out_ekobj;
>> +	}
>> +
>> +	for (idx = 0; idx < num_attr; idx++) {
>> +		char buf[4];
>> +		bool show_val_desc = true;
>> +
>> +		pgs[idx].pgattrs = kcalloc(MAX_ATTRS,
>> +					   sizeof(*pgs[idx].pgattrs),
>> +					   GFP_KERNEL);
>> +		if (!pgs[idx].pgattrs)
>> +			goto out_kobj;
>> +
>> +		pgs[idx].pg.attrs = kcalloc(MAX_ATTRS + 1,
>> +					    sizeof(*pgs[idx].pg.attrs),
>> +					    GFP_KERNEL);
>> +		if (!pgs[idx].pg.attrs) {
>> +			kfree(pgs[idx].pgattrs);
>> +			goto out_kobj;
>> +		}
>> +
>> +		sprintf(buf, "%lld", be64_to_cpu(ea[idx].attr_id));
>> +		pgs[idx].pg.name = buf;
>> +
>> +		/* Do not add the value description if it does not exist */
>> +		if (strlen(ea[idx].attr_value_desc) == 0)
>> +			show_val_desc = false;
>> +
>> +		if (add_attr_group(be64_to_cpu(ea[idx].attr_id),
>> +				   MAX_ATTRS, &pgs[idx], show_val_desc)) {
>> +			pr_warn("Failed to create papr attribute group %s\n",
>> +				pgs[idx].pg.name);
>> +			goto out_pgattrs;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +
>> +out_pgattrs:
>> +	for (i = 0; i < MAX_ATTRS; i++) {
>> +		kfree(pgs[i].pgattrs);
>> +		kfree(pgs[i].pg.attrs);
>> +	}
>> +out_ekobj:
>> +	kobject_put(escale_kobj);
>> +out_kobj:
>> +	kobject_put(papr_kobj);
>> +out_pgs:
>> +	kfree(pgs);
>> +out:
>> +	kfree(em_buf);
>> +
>> +	return -ENOMEM;
>> +}
>> +
>> +machine_device_initcall(pseries, papr_init);

Regards,
Pratik


^ permalink raw reply

* Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
From: Christophe Leroy @ 2021-06-21 14:11 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: maged michael, Peter Zijlstra, Dave Watson, Will Deacon,
	Russell King, ARM Linux, David Sehr, Paul Mackerras,
	H. Peter Anvin, linux-arch, x86, Andrew Hunter, Greg Hackmann,
	Ingo Molnar, Alan Stern, Paul, Andrea Parri, Avi Kivity,
	Boqun Feng, linuxppc-dev, Nicholas Piggin, Mathieu Desnoyers,
	Alexander Viro, Andy Lutomirski, Thomas Gleixner, linux-api,
	linux-kernel, Linus Torvalds
In-Reply-To: <20210619150202.GZ5077@gate.crashing.org>



Le 19/06/2021 à 17:02, Segher Boessenkool a écrit :
> On Sat, Jun 19, 2021 at 11:35:34AM +0200, Christophe Leroy wrote:
>>
>>
>> Le 18/06/2021 à 19:26, Mathieu Desnoyers a écrit :
>>> ----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy
>>> christophe.leroy@csgroup.eu wrote:
>>> [...]
>>>>
>>>> I don't understand all that complexity to just replace a simple
>>>> 'smp_mb__after_unlock_lock()'.
>>>>
>>>> #define smp_mb__after_unlock_lock()	smp_mb()
>>>> #define smp_mb()	barrier()
>>>> # define barrier() __asm__ __volatile__("": : :"memory")
>>>>
>>>>
>>>> Am I missing some subtility ?
>>>
>>> On powerpc CONFIG_SMP, smp_mb() is actually defined as:
>>>
>>> #define smp_mb()        __smp_mb()
>>> #define __smp_mb()      mb()
>>> #define mb()   __asm__ __volatile__ ("sync" : : : "memory")
>>>
>>> So the original motivation here was to skip a "sync" instruction whenever
>>> switching between threads which are part of the same process. But based on
>>> recent discussions, I suspect my implementation may be inaccurately doing
>>> so though.
>>>
>>
>> I see.
>>
>> Then, if you think a 'sync' is a concern, shouldn't we try and remove the
>> forest of 'sync' in the I/O accessors ?
>>
>> I can't really understand why we need all those 'sync' and 'isync' and
>> 'twi' around the accesses whereas I/O memory is usually mapped as 'Guarded'
>> so memory access ordering is already garantied.
>>
>> I'm sure we'll save a lot with that.
> 
> The point of the twi in the I/O accessors was to make things easier to
> debug if the accesses fail: for the twi insn to complete the load will
> have to have completed as well.  On a correctly working system you never
> should need this (until something fails ;-) )
> 
> Without the twi you might need to enforce ordering in some cases still.
> The twi is a very heavy hammer, but some of that that gives us is no
> doubt actually needed.
> 

Well, I've always been quite perplex about that. According to the documentation of the 8xx, if a bus 
error or something happens on an I/O access, the exception will be accounted on the instruction 
which does the access. But based on the following function, I understand that some version of 
powerpc do generate the trap on the instruction which was being executed at the time the I/O access 
failed, not the instruction that does the access itself ?

/*
  * I/O accesses can cause machine checks on powermacs.
  * Check if the NIP corresponds to the address of a sync
  * instruction for which there is an entry in the exception
  * table.
  *  -- paulus.
  */
static inline int check_io_access(struct pt_regs *regs)
{
#ifdef CONFIG_PPC32
	unsigned long msr = regs->msr;
	const struct exception_table_entry *entry;
	unsigned int *nip = (unsigned int *)regs->nip;

	if (((msr & 0xffff0000) == 0 || (msr & (0x80000 | 0x40000)))
	    && (entry = search_exception_tables(regs->nip)) != NULL) {
		/*
		 * Check that it's a sync instruction, or somewhere
		 * in the twi; isync; nop sequence that inb/inw/inl uses.
		 * As the address is in the exception table
		 * we should be able to read the instr there.
		 * For the debug message, we look at the preceding
		 * load or store.
		 */
		if (*nip == PPC_INST_NOP)
			nip -= 2;
		else if (*nip == PPC_INST_ISYNC)
			--nip;
		if (*nip == PPC_INST_SYNC || (*nip >> 26) == OP_TRAP) {
			unsigned int rb;

			--nip;
			rb = (*nip >> 11) & 0x1f;
			printk(KERN_DEBUG "%s bad port %lx at %p\n",
			       (*nip & 0x100)? "OUT to": "IN from",
			       regs->gpr[rb] - _IO_BASE, nip);
			regs->msr |= MSR_RI;
			regs->nip = extable_fixup(entry);
			return 1;
		}
	}
#endif /* CONFIG_PPC32 */
	return 0;
}

Am I right ?

It is not only the twi which bother's me in the I/O accessors but also the sync/isync and stuff.

A write typically is

	sync
	stw

A read is

	sync
	lwz
	twi
	isync

Taking into account that HW ordering is garanteed by the fact that __iomem is guarded, isn't the 
'memory' clobber enough as a barrier ?

Thanks
Christophe

^ permalink raw reply

* Re: [PATCH v2 2/9] powerpc: Add Microwatt device tree
From: Segher Boessenkool @ 2021-06-21 13:54 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
In-Reply-To: <YM8v2ricaCzGi2vv@thinks.paulus.ozlabs.org>

On Sun, Jun 20, 2021 at 10:08:58PM +1000, Paul Mackerras wrote:
> On Sat, Jun 19, 2021 at 09:26:16AM -0500, Segher Boessenkool wrote:
> > On Fri, Jun 18, 2021 at 01:44:16PM +1000, Paul Mackerras wrote:
> > > Microwatt currently runs with MSR[HV] = 0,
> > 
> > That isn't compliant though?  If your implementation does not have LPAR
> > it must set MSR[HV]=1 always.
> 
> True - but if I actually do that, Linux starts trying to use hrfid
> (for example in masked_Hinterrupt), which Microwatt doesn't have.
> Something for Nick to fix. :)

That looks like it needs fixing, yes (it is hard to actually read).  But
one thing you can do to make this Just Work is to make hrfid do exactly
the same as rfid, i.e. decode hrfid (01000 10010) as rfid (00000 10010).
That probably makes things run already, you don't even need to alias
to SPRs HSRRn (01001 1101n) to SRRn (00000 1101n) :-)

Segher

^ permalink raw reply

* Re: [PATCH 1/2] powerpc/prom_init: Convert prom_strcpy() into prom_strscpy_pad()
From: Daniel Axtens @ 2021-06-21 12:57 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev
In-Reply-To: <20210621064938.2021419-1-mpe@ellerman.id.au>

Hi

> -static char __init *prom_strcpy(char *dest, const char *src)
> +static ssize_t __init prom_strscpy_pad(char *dest, const char *src, size_t n)
>  {
> -	char *tmp = dest;
> +	ssize_t rc;
> +	size_t i;
>  
> -	while ((*dest++ = *src++) != '\0')
> -		/* nothing */;
> -	return tmp;
> +	if (n == 0 || n > INT_MAX)
> +		return -E2BIG;
> +
> +	// Copy up to n bytes
> +	for (i = 0; i < n && src[i] != '\0'; i++)
> +		dest[i] = src[i];
> +
> +	rc = i;
> +
> +	// If we copied all n then we have run out of space for the nul
> +	if (rc == n) {
> +		// Rewind by one character to ensure nul termination
> +		i--;
> +		rc = -E2BIG;
> +	}
> +
> +	for (; i < n; i++)
> +		dest[i] = '\0';
> +
> +	return rc;
>  }
>  

This implementation seems good to me.

I copied it into a new C file and added the following:

int main() {
	char longstr[255]="abcdefghijklmnopqrstuvwxyz";
	char shortstr[5];
	assert(prom_strscpy_pad(longstr, "", 0) == -E2BIG);
	assert(prom_strscpy_pad(longstr, "hello", 255) == 5);
	assert(prom_strscpy_pad(shortstr, "hello", 5) == -E2BIG);
	assert(memcmp(shortstr, "hell", 5) == 0);
	assert(memcmp(longstr, "hello\0\0\0\0\0\0\0\0\0", 6) == 0);
	return 0;
}

All the assertions pass. I believe this covers all the conditions from
the strscpy_pad docstring.

Reviewed-by: Daniel Axtens <dja@axtens.net>

Kind regards,
Daniel

>  static int __init prom_strncmp(const char *cs, const char *ct, size_t count)
> @@ -2701,7 +2719,7 @@ static void __init flatten_device_tree(void)
>  
>  	/* Add "phandle" in there, we'll need it */
>  	namep = make_room(&mem_start, &mem_end, 16, 1);
> -	prom_strcpy(namep, "phandle");
> +	prom_strscpy_pad(namep, "phandle", sizeof("phandle"));
>  	mem_start = (unsigned long)namep + prom_strlen(namep) + 1;
>  
>  	/* Build string array */
> -- 
> 2.25.1

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Odin Ugedal @ 2021-06-21 12:42 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Sachin Sant, linuxppc-dev, Peter Zijlstra, Vincent Guittot,
	open list
In-Reply-To: <CAFpoUr0iWFTq2grtnX_EH6KnZLZQCg1o6_yv1gfDK8WdbHmUCA@mail.gmail.com>

Hi,

Did some more research, and it looks like this is what happens:

$ tree /sys/fs/cgroup/ltp/ -d --charset=ascii
/sys/fs/cgroup/ltp/
|-- drain
`-- test-6851
    `-- level2
        |-- level3a
        |   |-- worker1
        |   `-- worker2
        `-- level3b
            `-- worker3

Timeline (ish):
- worker3 gets throttled
- level3b is decayed, since it has no more load
- level2 get throttled
- worker3 get unthrottled
- level2 get unthrottled
  - worker3 is added to list
  - level3b is not added to list, since nr_running==0 and is decayed


The attached diff (based on
https://lore.kernel.org/lkml/20210518125202.78658-3-odin@uged.al/)
fixes the issue for me. Not the most elegant solution, but the
simplest one as of now, and to show what is wrong.

Any thoughts Vincent?

Thanks
Odin


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfaa6e1f6067..aa32e9c29efd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -376,7 +376,8 @@ static inline bool list_add_leaf_cfs_rq(struct
cfs_rq *cfs_rq)
        return false;
 }

-static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+/* Returns 1 if cfs_rq was present in the list and removed */
+static inline bool list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
        if (cfs_rq->on_list) {
                struct rq *rq = rq_of(cfs_rq);
@@ -393,7 +394,9 @@ static inline void list_del_leaf_cfs_rq(struct
cfs_rq *cfs_rq)

                list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
                cfs_rq->on_list = 0;
+               return 1;
        }
+       return 0;
 }

 static inline void assert_list_leaf_cfs_rq(struct rq *rq)
@@ -3298,24 +3301,6 @@ static inline void cfs_rq_util_change(struct
cfs_rq *cfs_rq, int flags)

 #ifdef CONFIG_SMP
 #ifdef CONFIG_FAIR_GROUP_SCHED
-
-static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
-{
-       if (cfs_rq->load.weight)
-               return false;
-
-       if (cfs_rq->avg.load_sum)
-               return false;
-
-       if (cfs_rq->avg.util_sum)
-               return false;
-
-       if (cfs_rq->avg.runnable_sum)
-               return false;
-
-       return true;
-}
-
 /**
  * update_tg_load_avg - update the tg's load avg
  * @cfs_rq: the cfs_rq whose avg changed
@@ -4109,11 +4094,6 @@ static inline void update_misfit_status(struct
task_struct *p, struct rq *rq)

 #else /* CONFIG_SMP */

-static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
-{
-       return true;
-}
-
 #define UPDATE_TG      0x0
 #define SKIP_AGE_LOAD  0x0
 #define DO_ATTACH      0x0
@@ -4771,10 +4751,11 @@ static int tg_unthrottle_up(struct task_group
*tg, void *data)
        if (!cfs_rq->throttle_count) {
                cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
                                             cfs_rq->throttled_clock_task;
-
-               /* Add cfs_rq with load or one or more already running
entities to the list */
-               if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
+               if (cfs_rq->insert_on_unthrottle) {
                        list_add_leaf_cfs_rq(cfs_rq);
+                       if (tg->parent)
+
tg->parent->cfs_rq[cpu_of(rq)]->insert_on_unthrottle = true;
+                       }
        }

        return 0;
@@ -4788,7 +4769,7 @@ static int tg_throttle_down(struct task_group
*tg, void *data)
        /* group is entering throttled state, stop time */
        if (!cfs_rq->throttle_count) {
                cfs_rq->throttled_clock_task = rq_clock_task(rq);
-               list_del_leaf_cfs_rq(cfs_rq);
+               cfs_rq->insert_on_unthrottle = list_del_leaf_cfs_rq(cfs_rq);
        }
        cfs_rq->throttle_count++;

@@ -8019,6 +8000,23 @@ static bool __update_blocked_others(struct rq
*rq, bool *done)

 #ifdef CONFIG_FAIR_GROUP_SCHED

+static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
+{
+       if (cfs_rq->load.weight)
+               return false;
+
+       if (cfs_rq->avg.load_sum)
+               return false;
+
+       if (cfs_rq->avg.util_sum)
+               return false;
+
+       if (cfs_rq->avg.runnable_sum)
+               return false;
+
+       return true;
+}
+
 static bool __update_blocked_fair(struct rq *rq, bool *done)
 {
        struct cfs_rq *cfs_rq, *pos;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..12a707d99ee6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -602,6 +602,7 @@ struct cfs_rq {
        u64                     throttled_clock_task_time;
        int                     throttled;
        int                     throttle_count;
+       int                     insert_on_unthrottle;
        struct list_head        throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply related

* Re: arch/powerpc/kvm/book3s_hv_nested.c:264:6: error: stack frame size of 2304 bytes in function 'kvmhv_enter_nested_guest'
From: Arnd Bergmann @ 2021-06-21 11:53 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: kbuild-all, kernel test robot, Linux Memory Management List,
	Linux Kernel Mailing List, Nicholas Piggin, Nathan Chancellor,
	clang-built-linux, kvm-ppc, Andrew Morton, linuxppc-dev,
	Kees Cook
In-Reply-To: <87im273604.fsf@mpe.ellerman.id.au>

On Mon, Jun 21, 2021 at 11:46 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
> Nathan Chancellor <nathan@kernel.org> writes:
> > On 6/20/2021 4:59 PM, Nicholas Piggin wrote:
> >> Excerpts from kernel test robot's message of April 3, 2021 8:47 pm:
> >>>
> >>> vim +/kvmhv_enter_nested_guest +264 arch/powerpc/kvm/book3s_hv_nested.c
> >>
> >> Not much changed here recently. It's not that big a concern because it's
> >> only called in the KVM ioctl path, not in any deep IO paths or anything,
> >> and doesn't recurse. Might be a bit of inlining or stack spilling put it
> >> over the edge.
> >
> > It appears to be the fact that LLVM's PowerPC backend does not emit
> > efficient byteswap assembly:
> >
> > https://github.com/ClangBuiltLinux/linux/issues/1292
> >
> > https://bugs.llvm.org/show_bug.cgi?id=49610
> >
> >> powerpc does make it an error though, would be good to avoid that so the
> >> robot doesn't keep tripping over.
> >
> > Marking byteswap_pt_regs as 'noinline_for_stack' drastically reduces the
> > stack usage. If that is an acceptable solution, I can send it along
> > tomorrow.
>
> Yeah that should be OK.

That's fine with me as well.

> Can you post the before/after disassembly when
> you post the patch?
>
> It should just be two extra function calls, which shouldn't be enough
> overhead to be measurable.

The thing I remember is that the 'before' code here is some seriously bad
output from llvm, and it would be helpful to have someone get the compiler
to emit the correct powerpc byteswap instructions and avoid the excessive
stack spilling.

The warning here is just a symptom of a missed optimization and the
same thing probably happens elsewhere on powerpc, even if it doesn't
exceed the stack warning limit.

        Arnd

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Odin Ugedal @ 2021-06-21 11:04 UTC (permalink / raw)
  To: Sachin Sant
  Cc: Peter Zijlstra, Vincent Guittot, linuxppc-dev, Odin Ugedal,
	open list
In-Reply-To: <6D1F875D-58E9-4A55-B0C3-21D5F31EDB76@linux.vnet.ibm.com>

man. 21. jun. 2021 kl. 12:57 skrev Sachin Sant <sachinp@linux.vnet.ibm.com>:
>
> Unfortunately this does not help. I can still recreate the failure.
>
> Have attached the o/p from test run.
>
> Thanks
> -Sachin

Yes, thanks!

I am able to reproduce it locally now, so will keep looking to see if
I find the cause. Thanks!

Odin

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Sachin Sant @ 2021-06-21 10:57 UTC (permalink / raw)
  To: Odin Ugedal; +Cc: Peter Zijlstra, linuxppc-dev, Vincent Guittot, open list
In-Reply-To: <CAFpoUr2o2PVPOx+AvatjjUvqPTyNKE3C6oXejyU3HVMmtCnzvQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]



> On 21-Jun-2021, at 3:24 PM, Odin Ugedal <odin@uged.al> wrote:
> 
> man. 21. jun. 2021 kl. 11:50 skrev Vincent Guittot <vincent.guittot@linaro.org>:
>> This means that a child's load was not null and it was inserted
>> whereas parent's load was null. This should not happen unless the
>> propagation failed somewhere
> 
> My initial thought is that the patch below will fix it, if that is the
> issue (that a leaf is inserted, but the propagation is not "completed"
> in unthrottle). Might that be the case? Still working on reproducing
> the issue tho.
> 

Unfortunately this does not help. I can still recreate the failure.

Have attached the o/p from test run.

Thanks
-Sachin
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfaa6e1f6067..015c5a5c1a4d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4930,12 +4930,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>                if (cfs_rq_throttled(cfs_rq))
>                        goto unthrottle_throttle;
> 
> -               /*
> -                * One parent has been throttled and cfs_rq removed from the
> -                * list. Add it back to not break the leaf list.
> -                */
> -               if (throttled_hierarchy(cfs_rq))
> -                       list_add_leaf_cfs_rq(cfs_rq);
> +               list_add_leaf_cfs_rq(cfs_rq);
>        }
> 
>        /* At this point se is NULL and we are at root level*/

[-- Attachment #2: run.txt --]
[-- Type: text/plain, Size: 5966 bytes --]

# cd /opt/ltp/testcases/bin
# ./cfs_bandwidth01 -i 5
tst_test.c:1313: TINFO: Timeout per run is 0h 05m 00s
tst_buffers.c:55: TINFO: Test is using guarded buffers
cfs_bandwidth01.c:49: TINFO: Set 'worker1/cpu.max' = '3000 10000'
cfs_bandwidth01.c:49: TINFO: Set 'worker2/cpu.max' = '2000 10000'
cfs_bandwidth01.c:49: TINFO: Set 'worker3/cpu.max' = '3000 10000'
cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
cfs_bandwidth01.c:125: TPASS: Workers exited
cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
[   48.343143] ------------[ cut here ]------------
[   48.343164] rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
[   48.343172] WARNING: CPU: 24 PID: 4405 at kernel/sched/fair.c:401 unthrottle_cfs_rq+0x49c/0x560
[   48.343196] Modules linked in: nf_tables nfnetlink tun bridge stp llc rfkill sunrpc pseries_rng xts vmx_crypto uio_pdrv_genirq uio sch_fq_codel ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod fuse
[   48.343251] CPU: 24 PID: 4405 Comm: cfs_bandwidth01 Not tainted 5.13.0-rc7-dirty #4
[   48.343261] NIP:  c0000000001b88fc LR: c0000000001b88f8 CTR: c000000000723d10
[   48.343269] REGS: c00000000fb13780 TRAP: 0700   Not tainted  (5.13.0-rc7-dirty)
[   48.343278] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48044224  XER: 00000005
[   48.343295] CFAR: c00000000014d8a0 IRQMASK: 1 
[   48.343295] GPR00: c0000000001b88f8 c00000000fb13a20 c0000000029ab400 000000000000002d 
[   48.343295] GPR04: 00000000fffeffff c00000000fb136e0 0000000000000027 c00000154f817e08 
[   48.343295] GPR08: 0000000000000023 0000000000000001 0000000000000027 c00000167f1d7fe8 
[   48.343295] GPR12: 0000000000004000 c00000154ffdc680 0000000000000000 0000000000000000 
[   48.343295] GPR16: c000000000fa6660 0000000000000001 0000000000000000 c0000000024e1cd8 
[   48.343295] GPR20: 0000000000000000 c00000000290a69a 0000000000000000 c0000000024e1cc0 
[   48.343295] GPR24: 0000000000000000 c0000000029f2140 c00000154f762380 0000000000000001 
[   48.343295] GPR28: 0000000000000001 0000000000000000 c00000154f762400 0000000000000000 
[   48.343388] NIP [c0000000001b88fc] unthrottle_cfs_rq+0x49c/0x560
[   48.343397] LR [c0000000001b88f8] unthrottle_cfs_rq+0x498/0x560
[   48.343406] Call Trace:
[   48.343410] [c00000000fb13a20] [c0000000001b88f8] unthrottle_cfs_rq+0x498/0x560 (unreliable)
[   48.343422] [c00000000fb13ac0] [c00000000019edb8] tg_set_cfs_bandwidth+0x2c8/0x470
[   48.343433] [c00000000fb13bc0] [c000000000263874] cgroup_file_write+0x164/0x210
[   48.343444] [c00000000fb13c20] [c00000000058cfac] kernfs_fop_write_iter+0x1cc/0x280
[   48.343455] [c00000000fb13c70] [c00000000047024c] new_sync_write+0x14c/0x1d0
[   48.343467] [c00000000fb13d10] [c000000000473844] vfs_write+0x224/0x330
[   48.343476] [c00000000fb13d60] [c000000000473b2c] ksys_write+0x7c/0x140
[   48.343485] [c00000000fb13db0] [c000000000030fb0] system_call_exception+0x150/0x2d0
[   48.343495] [c00000000fb13e10] [c00000000000d45c] system_call_common+0xec/0x278
[   48.343504] --- interrupt: c00 at 0x7fffaa67bd74
[   48.343511] NIP:  00007fffaa67bd74 LR: 00007fffaa5f34c4 CTR: 0000000000000000
[   48.343519] REGS: c00000000fb13e80 TRAP: 0c00   Not tainted  (5.13.0-rc7-dirty)
[   48.343527] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28002282  XER: 00000000
[   48.343548] IRQMASK: 0 
[   48.343548] GPR00: 0000000000000004 00007fffcb534d60 00007fffaa777100 0000000000000010 
[   48.343548] GPR04: 00000000415623d0 0000000000000005 0000000000000010 00007fffcb534df8 
[   48.343548] GPR08: 0000000010028618 0000000000000000 0000000000000000 0000000000000000 
[   48.343548] GPR12: 0000000000000000 00007fffaa81a310 0000000000000000 0000000000000000 
[   48.343548] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[   48.343548] GPR20: 0000000000000000 0000000000000000 0000000000000002 0000000000000000 
[   48.343548] GPR24: 0000000000000000 000000000000002b 0000000000000005 00000000415623d0 
[   48.343548] GPR28: 0000000000000005 00007fffcb534eb0 00000000415623d0 0000000000000005 
[   48.343634] NIP [00007fffaa67bd74] 0x7fffaa67bd74
[   48.343640] LR [00007fffaa5f34c4] 0x7fffaa5f34c4
[   48.343646] --- interrupt: c00
[   48.343651] Instruction dump:
[   48.343656] 4bfffc74 3d22fff6 8929f2a9 2f890000 409efed4 39200001 3d42fff6 3c62fe60 
[   48.343672] 3863be08 992af2a9 4bf94f45 60000000 <0fe00000> 4bfffeb0 7f6407b4 7f43d378 
[   48.343687] ---[ end trace 61db91af8340603f ]---
cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
cfs_bandwidth01.c:125: TPASS: Workers exited
cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
cfs_bandwidth01.c:125: TPASS: Workers exited
cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
cfs_bandwidth01.c:125: TPASS: Workers exited
cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
cfs_bandwidth01.c:125: TPASS: Workers exited
tst_test.c:1349: TFAIL: Kernel is now tainted.

HINT: You _MAY_ be missing kernel fixes, see:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39f23ce07b93
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b34cb07dde7c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fe61468b2cbc
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ab297bab984
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4d22468dae

Summary:
passed   10
failed   1
broken   0
skipped  0
warnings 0

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Odin Ugedal @ 2021-06-21  9:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Sachin Sant, linuxppc-dev, Peter Zijlstra, Odin Ugedal, open list
In-Reply-To: <CAKfTPtC=aXasuSNvn+A3152-4xoOTWROhJpZAVq6RLh1Hacpng@mail.gmail.com>

man. 21. jun. 2021 kl. 11:50 skrev Vincent Guittot <vincent.guittot@linaro.org>:
> This means that a child's load was not null and it was inserted
> whereas parent's load was null. This should not happen unless the
> propagation failed somewhere

My initial thought is that the patch below will fix it, if that is the
issue (that a leaf is inserted, but the propagation is not "completed"
in unthrottle). Might that be the case? Still working on reproducing
the issue tho.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfaa6e1f6067..015c5a5c1a4d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4930,12 +4930,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
                if (cfs_rq_throttled(cfs_rq))
                        goto unthrottle_throttle;

-               /*
-                * One parent has been throttled and cfs_rq removed from the
-                * list. Add it back to not break the leaf list.
-                */
-               if (throttled_hierarchy(cfs_rq))
-                       list_add_leaf_cfs_rq(cfs_rq);
+               list_add_leaf_cfs_rq(cfs_rq);
        }

        /* At this point se is NULL and we are at root level*/

^ permalink raw reply related

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Odin Ugedal @ 2021-06-21  9:39 UTC (permalink / raw)
  To: Sachin Sant
  Cc: Peter Zijlstra, Odin Ugedal, linuxppc-dev, open list,
	Vincent Guittot
In-Reply-To: <9D4A658A-5F77-4C33-904A-126E6052B205@linux.vnet.ibm.com>

man. 21. jun. 2021 kl. 08:33 skrev Sachin Sant <sachinp@linux.vnet.ibm.com>:
>
> While running LTP tests (cfs_bandwidth01) against 5.13.0-rc7 kernel on a powerpc box
> following warning is seen
>
> [ 6611.331827] ------------[ cut here ]------------
> [ 6611.331855] rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
> [ 6611.331862] WARNING: CPU: 8 PID: 0 at kernel/sched/fair.c:401 unthrottle_cfs_rq+0x4cc/0x590
> [ 6611.331883] Modules linked in: nfsv3 nfs_acl nfs lockd grace fscache netfs tun brd overlay vfat fat btrfs blake2b_generic xor zstd_compress raid6_pq xfs loop sctp ip6_udp_tunnel udp_tunnel libcrc32c dm_mod bonding rfkill sunrpc pseries_rng xts vmx_crypto sch_fq_codel ip_tables ext4 mbcache jbd2 sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp fuse [last unloaded: init_module]
> [ 6611.331957] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G           OE     5.13.0-rc6-gcba5e97280f5 #1
> [ 6611.331968] NIP:  c0000000001b7aac LR: c0000000001b7aa8 CTR: c000000000722d30
> [ 6611.331976] REGS: c00000000274f3a0 TRAP: 0700   Tainted: G           OE      (5.13.0-rc6-gcba5e97280f5)
> [ 6611.331985] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48000224  XER: 00000005
> [ 6611.332002] CFAR: c00000000014ca20 IRQMASK: 1
> [ 6611.332002] GPR00: c0000000001b7aa8 c00000000274f640 c000000001abaf00 000000000000002d
> [ 6611.332002] GPR04: 00000000ffff7fff c00000000274f300 0000000000000027 c000000efdb07e08
> [ 6611.332002] GPR08: 0000000000000023 0000000000000001 0000000000000027 c000000001976680
> [ 6611.332002] GPR12: 0000000000000000 c000000effc0be80 c000000ef07b3f90 000000001eefe200
> [ 6611.332002] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 6611.332002] GPR20: 0000000000000001 c000000000fa6c08 c000000000fa6030 0000000000000001
> [ 6611.332002] GPR24: 0000000000000000 0000000000000000 c000000efde12380 0000000000000001
> [ 6611.332002] GPR28: 0000000000000001 0000000000000000 c000000efde12400 0000000000000000
> [ 6611.332094] NIP [c0000000001b7aac] unthrottle_cfs_rq+0x4cc/0x590
> [ 6611.332104] LR [c0000000001b7aa8] unthrottle_cfs_rq+0x4c8/0x590
> [ 6611.332113] Call Trace:
> [ 6611.332116] [c00000000274f640] [c0000000001b7aa8] unthrottle_cfs_rq+0x4c8/0x590 (unreliable)
> [ 6611.332128] [c00000000274f6e0] [c0000000001b7e38] distribute_cfs_runtime+0x1d8/0x280
> [ 6611.332139] [c00000000274f7b0] [c0000000001b81d0] sched_cfs_period_timer+0x140/0x330
> [ 6611.332149] [c00000000274f870] [c00000000022a03c] __hrtimer_run_queues+0x17c/0x380
> [ 6611.332158] [c00000000274f8f0] [c00000000022ac68] hrtimer_interrupt+0x128/0x2f0
> [ 6611.332168] [c00000000274f9a0] [c00000000002940c] timer_interrupt+0x13c/0x370
> [ 6611.332179] [c00000000274fa00] [c000000000009c04] decrementer_common_virt+0x1a4/0x1b0
> [ 6611.332189] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x24
> [ 6611.332199] NIP:  c0000000000f6af8 LR: c000000000a05f68 CTR: 0000000000000000
> [ 6611.332206] REGS: c00000000274fa70 TRAP: 0900   Tainted: G           OE      (5.13.0-rc6-gcba5e97280f5)
> [ 6611.332214] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000224  XER: 00000000
> [ 6611.332234] CFAR: 0000000000000c00 IRQMASK: 0
> [ 6611.332234] GPR00: 0000000000000000 c00000000274fd10 c000000001abaf00 0000000000000000
> [ 6611.332234] GPR04: 00000000000000c0 0000000000000080 0001a91c68b80fa1 00000000000003dc
> [ 6611.332234] GPR08: 000000000001f400 0000000000000001 0000000000000000 0000000000000000
> [ 6611.332234] GPR12: 0000000000000000 c000000effc0be80 c000000ef07b3f90 000000001eefe200
> [ 6611.332234] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 6611.332234] GPR20: 0000000000000001 0000000000000002 0000000000000010 c0000000019fe2f8
> [ 6611.332234] GPR24: 0000000000000001 00000603517d757e 0000000000000000 0000000000000000
> [ 6611.332234] GPR28: 0000000000000001 0000000000000000 c000000001231f90 c000000001231f98
> [ 6611.332323] NIP [c0000000000f6af8] plpar_hcall_norets_notrace+0x18/0x24
> [ 6611.332332] LR [c000000000a05f68] check_and_cede_processor+0x48/0x60
> [ 6611.332340] --- interrupt: 900
> [ 6611.332345] [c00000000274fd10] [c000000efdb92380] 0xc000000efdb92380 (unreliable)
> [ 6611.332355] [c00000000274fd70] [c000000000a063bc] dedicated_cede_loop+0x9c/0x1b0
> [ 6611.332364] [c00000000274fdc0] [c000000000a02b04] cpuidle_enter_state+0x2e4/0x4e0
> [ 6611.332375] [c00000000274fe20] [c000000000a02da0] cpuidle_enter+0x50/0x70
> [ 6611.332385] [c00000000274fe60] [c0000000001a883c] call_cpuidle+0x4c/0x80
> [ 6611.332393] [c00000000274fe80] [c0000000001a8ee0] do_idle+0x380/0x3e0
> [ 6611.332402] [c00000000274ff00] [c0000000001a91bc] cpu_startup_entry+0x3c/0x40
> [ 6611.332411] [c00000000274ff30] [c000000000063ff8] start_secondary+0x298/0x2b0
> [ 6611.332421] [c00000000274ff90] [c00000000000c754] start_secondary_prolog+0x10/0x14
> [ 6611.332430] Instruction dump:
> [ 6611.332435] 4bfffc44 3d22fff6 8929f328 2f890000 409efea4 39200001 3d42fff6 3c62ff4f
> [ 6611.332451] 3863bcd8 992af328 4bf94f15 60000000 <0fe00000> 4bfffe80 7f6407b4 7f43d378
> [ 6611.332466] ---[ end trace 1346f865cd1cae91 ]—
>
> 5.13.0-rc6 was good. Bisect points to following patch
>
> commit a7b359fc6a37
> sched/fair: Correctly insert cfs_rq's to list on unthrottle
>
> The test runs to completion(without this warning) if the patch is reverted.
>
> Thanks
> -Sachin
>

Hi,

Thanks for the report! I have a theory about what is possibly causing
this, so I will try to reproduce it and see if my assumptions are
correct.


Odin

^ permalink raw reply

* Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests
From: Vincent Guittot @ 2021-06-21  9:50 UTC (permalink / raw)
  To: Odin Ugedal; +Cc: Sachin Sant, Peter Zijlstra, linuxppc-dev, open list
In-Reply-To: <CAFpoUr3g5t3Z0BtW4-jnYomc3cdY=V5=Zt94-C+fHOjGWa107w@mail.gmail.com>

On Mon, 21 Jun 2021 at 11:39, Odin Ugedal <odin@uged.al> wrote:
>
> man. 21. jun. 2021 kl. 08:33 skrev Sachin Sant <sachinp@linux.vnet.ibm.com>:
> >
> > While running LTP tests (cfs_bandwidth01) against 5.13.0-rc7 kernel on a powerpc box
> > following warning is seen
> >
> > [ 6611.331827] ------------[ cut here ]------------
> > [ 6611.331855] rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
> > [ 6611.331862] WARNING: CPU: 8 PID: 0 at kernel/sched/fair.c:401 unthrottle_cfs_rq+0x4cc/0x590
> > [ 6611.331883] Modules linked in: nfsv3 nfs_acl nfs lockd grace fscache netfs tun brd overlay vfat fat btrfs blake2b_generic xor zstd_compress raid6_pq xfs loop sctp ip6_udp_tunnel udp_tunnel libcrc32c dm_mod bonding rfkill sunrpc pseries_rng xts vmx_crypto sch_fq_codel ip_tables ext4 mbcache jbd2 sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp fuse [last unloaded: init_module]
> > [ 6611.331957] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G           OE     5.13.0-rc6-gcba5e97280f5 #1
> > [ 6611.331968] NIP:  c0000000001b7aac LR: c0000000001b7aa8 CTR: c000000000722d30
> > [ 6611.331976] REGS: c00000000274f3a0 TRAP: 0700   Tainted: G           OE      (5.13.0-rc6-gcba5e97280f5)
> > [ 6611.331985] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48000224  XER: 00000005
> > [ 6611.332002] CFAR: c00000000014ca20 IRQMASK: 1
> > [ 6611.332002] GPR00: c0000000001b7aa8 c00000000274f640 c000000001abaf00 000000000000002d
> > [ 6611.332002] GPR04: 00000000ffff7fff c00000000274f300 0000000000000027 c000000efdb07e08
> > [ 6611.332002] GPR08: 0000000000000023 0000000000000001 0000000000000027 c000000001976680
> > [ 6611.332002] GPR12: 0000000000000000 c000000effc0be80 c000000ef07b3f90 000000001eefe200
> > [ 6611.332002] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > [ 6611.332002] GPR20: 0000000000000001 c000000000fa6c08 c000000000fa6030 0000000000000001
> > [ 6611.332002] GPR24: 0000000000000000 0000000000000000 c000000efde12380 0000000000000001
> > [ 6611.332002] GPR28: 0000000000000001 0000000000000000 c000000efde12400 0000000000000000
> > [ 6611.332094] NIP [c0000000001b7aac] unthrottle_cfs_rq+0x4cc/0x590
> > [ 6611.332104] LR [c0000000001b7aa8] unthrottle_cfs_rq+0x4c8/0x590
> > [ 6611.332113] Call Trace:
> > [ 6611.332116] [c00000000274f640] [c0000000001b7aa8] unthrottle_cfs_rq+0x4c8/0x590 (unreliable)
> > [ 6611.332128] [c00000000274f6e0] [c0000000001b7e38] distribute_cfs_runtime+0x1d8/0x280
> > [ 6611.332139] [c00000000274f7b0] [c0000000001b81d0] sched_cfs_period_timer+0x140/0x330
> > [ 6611.332149] [c00000000274f870] [c00000000022a03c] __hrtimer_run_queues+0x17c/0x380
> > [ 6611.332158] [c00000000274f8f0] [c00000000022ac68] hrtimer_interrupt+0x128/0x2f0
> > [ 6611.332168] [c00000000274f9a0] [c00000000002940c] timer_interrupt+0x13c/0x370
> > [ 6611.332179] [c00000000274fa00] [c000000000009c04] decrementer_common_virt+0x1a4/0x1b0
> > [ 6611.332189] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x24
> > [ 6611.332199] NIP:  c0000000000f6af8 LR: c000000000a05f68 CTR: 0000000000000000
> > [ 6611.332206] REGS: c00000000274fa70 TRAP: 0900   Tainted: G           OE      (5.13.0-rc6-gcba5e97280f5)
> > [ 6611.332214] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000224  XER: 00000000
> > [ 6611.332234] CFAR: 0000000000000c00 IRQMASK: 0
> > [ 6611.332234] GPR00: 0000000000000000 c00000000274fd10 c000000001abaf00 0000000000000000
> > [ 6611.332234] GPR04: 00000000000000c0 0000000000000080 0001a91c68b80fa1 00000000000003dc
> > [ 6611.332234] GPR08: 000000000001f400 0000000000000001 0000000000000000 0000000000000000
> > [ 6611.332234] GPR12: 0000000000000000 c000000effc0be80 c000000ef07b3f90 000000001eefe200
> > [ 6611.332234] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > [ 6611.332234] GPR20: 0000000000000001 0000000000000002 0000000000000010 c0000000019fe2f8
> > [ 6611.332234] GPR24: 0000000000000001 00000603517d757e 0000000000000000 0000000000000000
> > [ 6611.332234] GPR28: 0000000000000001 0000000000000000 c000000001231f90 c000000001231f98
> > [ 6611.332323] NIP [c0000000000f6af8] plpar_hcall_norets_notrace+0x18/0x24
> > [ 6611.332332] LR [c000000000a05f68] check_and_cede_processor+0x48/0x60
> > [ 6611.332340] --- interrupt: 900
> > [ 6611.332345] [c00000000274fd10] [c000000efdb92380] 0xc000000efdb92380 (unreliable)
> > [ 6611.332355] [c00000000274fd70] [c000000000a063bc] dedicated_cede_loop+0x9c/0x1b0
> > [ 6611.332364] [c00000000274fdc0] [c000000000a02b04] cpuidle_enter_state+0x2e4/0x4e0
> > [ 6611.332375] [c00000000274fe20] [c000000000a02da0] cpuidle_enter+0x50/0x70
> > [ 6611.332385] [c00000000274fe60] [c0000000001a883c] call_cpuidle+0x4c/0x80
> > [ 6611.332393] [c00000000274fe80] [c0000000001a8ee0] do_idle+0x380/0x3e0
> > [ 6611.332402] [c00000000274ff00] [c0000000001a91bc] cpu_startup_entry+0x3c/0x40
> > [ 6611.332411] [c00000000274ff30] [c000000000063ff8] start_secondary+0x298/0x2b0
> > [ 6611.332421] [c00000000274ff90] [c00000000000c754] start_secondary_prolog+0x10/0x14
> > [ 6611.332430] Instruction dump:
> > [ 6611.332435] 4bfffc44 3d22fff6 8929f328 2f890000 409efea4 39200001 3d42fff6 3c62ff4f
> > [ 6611.332451] 3863bcd8 992af328 4bf94f15 60000000 <0fe00000> 4bfffe80 7f6407b4 7f43d378
> > [ 6611.332466] ---[ end trace 1346f865cd1cae91 ]—
> >
> > 5.13.0-rc6 was good. Bisect points to following patch
> >
> > commit a7b359fc6a37
> > sched/fair: Correctly insert cfs_rq's to list on unthrottle
> >
> > The test runs to completion(without this warning) if the patch is reverted.
> >
> > Thanks
> > -Sachin
> >
>
> Hi,
>
> Thanks for the report! I have a theory about what is possibly causing
> this, so I will try to reproduce it and see if my assumptions are
> correct.

This means that a child's load was not null and it was inserted
whereas parent's load was null. This should not happen unless the
propagation failed somewhere

>
>
> Odin

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox