LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Srikar Dronamraju @ 2020-09-13 16:53 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Ingo Molnar,
	Oliver O'Halloran, linuxppc-dev, Valentin Schneider
In-Reply-To: <87imciqwhq.fsf@mpe.ellerman.id.au>

* Michael Ellerman <mpe@ellerman.id.au> [2020-09-13 11:46:41]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > * Michael Ellerman <mpe@ellerman.id.au> [2020-09-11 21:55:23]:
> >
> >> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> >> > Current code assumes that cpumask of cpus sharing a l2-cache mask will
> >> > always be a superset of cpu_sibling_mask.
> >> >
> >> > Lets stop that assumption. cpu_l2_cache_mask is a superset of
> >> > cpu_sibling_mask if and only if shared_caches is set.
> >> 
> >> I'm seeing oopses with this:
> >> 
> 
> The patch fixes qemu, and I don't see the crash on mambo, but I still
> see:
>   [    0.010536] smp: Bringing up secondary CPUs ...
>   [    0.019189] smp: Brought up 2 nodes, 8 CPUs
>   [    0.019210] numa: Node 0 CPUs: 0-3
>   [    0.019235] numa: Node 1 CPUs: 4-7
>   [    0.024444]      the CACHE domain not a subset of the MC domain
>   [    0.024505] BUG: arch topology borken
>   [    0.024527]      the SMT domain not a subset of the CACHE domain
>   [    0.024563] BUG: arch topology borken
>   [    0.024584]      the CACHE domain not a subset of the MC domain
>   [    0.024645] BUG: arch topology borken
>   [    0.024666]      the SMT domain not a subset of the CACHE domain
>   [    0.024702] BUG: arch topology borken
>   [    0.024723]      the CACHE domain not a subset of the MC domain
> 
> That's the p9 mambo model, using skiboot.tcl from skiboot, with CPUS=2,
> THREADS=4 and MAMBO_NUMA=1.
> 

I was able to reproduce with
 qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \
   -kernel build~/vmlinux \
   -m 2G,slots=2,maxmem=4G \
   -object memory-backend-ram,size=1G,id=m0 \
   -object memory-backend-ram,size=1G,id=m1 \
   -numa node,nodeid=0,memdev=m0 \
   -numa node,nodeid=1,memdev=m1 \
   -smp 8,threads=4,sockets=2,maxcpus=8  \


If the CPU doesn't have a l2-cache element, then CPU not only has to set
itself in the cpu_l2_cache but also the siblings. Otherwise it will so
happen that the Siblings will have 4 Cpus set, and the Cache domain will
have just one cpu set, leading to this BUG message.

Patch follows this mail.

> Node layout is:
> 
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000000000000-0x00000000ffffffff]
> [    0.000000]   node   1: [mem 0x0000200000000000-0x00002000ffffffff]
> [    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff]
> [    0.000000] On node 0 totalpages: 65536
> [    0.000000] Initmem setup node 1 [mem 0x0000200000000000-0x00002000ffffffff]
> [    0.000000] On node 1 totalpages: 65536
> 
> 
> There aren't any l2-cache properties in the device-tree under cpus.
> 
> I'll try and have a closer look tonight.
> 
> cheers

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Srikar Dronamraju @ 2020-09-13 17:10 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <87imciqwhq.fsf@mpe.ellerman.id.au>

Fix to make it work where CPUs dont have a l2-cache element.

------------>8-----------------------------------------8<---------------------

From b25d47b01b7195b1df19083a4043fa6a87a901a3 Mon Sep 17 00:00:00 2001
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Date: Thu, 9 Jul 2020 13:33:38 +0530
Subject: [PATCH v5.2 05/10] powerpc/smp: Dont assume l2-cache to be superset of
 sibling

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
	Set cpumask after verifying l2-cache. (Gautham)

Changelog v5 -> v5.2:
	If cpu has no l2-cache set cpumask as per its
                 sibling mask. (Michael Ellerman)

 arch/powerpc/kernel/smp.c | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 9f4333d..168532e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1186,9 +1186,23 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
 	int i;
 
 	l2_cache = cpu_to_l2cache(cpu);
-	if (!l2_cache)
+	if (!l2_cache) {
+		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+
+		/*
+		 * If no l2cache for this CPU, assume all siblings to share
+		 * cache with this CPU.
+		 */
+		if (has_big_cores)
+			sibling_mask = cpu_smallcore_mask;
+
+		for_each_cpu(i, sibling_mask(cpu))
+			set_cpus_related(cpu, i, cpu_l2_cache_mask);
+
 		return false;
+	}
 
+	cpumask_set_cpu(cpu, mask_fn(cpu));
 	for_each_cpu(i, cpu_online_mask) {
 		/*
 		 * when updating the marks the current CPU has not been marked
@@ -1271,29 +1285,30 @@ static void add_cpu_to_masks(int cpu)
 	 * add it to it's own thread sibling mask.
 	 */
 	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
 	for (i = first_thread; i < first_thread + threads_per_core; i++)
 		if (cpu_online(i))
 			set_cpus_related(i, cpu, cpu_sibling_mask);
 
 	add_cpu_to_smallcore_masks(cpu);
-	/*
-	 * Copy the thread sibling mask into the cache sibling mask
-	 * and mark any CPUs that share an L2 with this CPU.
-	 */
-	for_each_cpu(i, cpu_sibling_mask(cpu))
-		set_cpus_related(cpu, i, cpu_l2_cache_mask);
 	update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
-	/*
-	 * Copy the cache sibling mask into core sibling mask and mark
-	 * any CPUs on the same chip as this CPU.
-	 */
-	for_each_cpu(i, cpu_l2_cache_mask(cpu))
-		set_cpus_related(cpu, i, cpu_core_mask);
+	if (pkg_id == -1) {
+		struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+		/*
+		 * Copy the sibling mask into core sibling mask and
+		 * mark any CPUs on the same chip as this CPU.
+		 */
+		if (shared_caches)
+			mask = cpu_l2_cache_mask;
+
+		for_each_cpu(i, mask(cpu))
+			set_cpus_related(cpu, i, cpu_core_mask);
 
-	if (pkg_id == -1)
 		return;
+	}
 
 	for_each_cpu(i, cpu_online_mask)
 		if (get_physical_package_id(i) == pkg_id)
-- 
2.17.1


^ permalink raw reply related

* RE: [PATCHv7 04/12] PCI: designware-ep: Modify MSI and MSIX CAP way of finding
From: Z.q. Hou @ 2020-09-13 17:24 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree@vger.kernel.org, lorenzo.pieralisi@arm.com,
	Xiaowei Bao, Roy Zang, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, Leo Li, M.h. Lian,
	jingoohan1@gmail.com, andrew.murray@arm.com, Mingkai Hu,
	gustavo.pimentel@synopsys.com, bhelgaas@google.com,
	shawnguo@kernel.org, kishon@ti.com, linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <20200910181009.GB592152@bogus>

Hi Rob,

Thanks a lot for your comments!

> -----Original Message-----
> From: Rob Herring <robh@kernel.org>
> Sent: 2020年9月11日 2:10
> To: Z.q. Hou <zhiqiang.hou@nxp.com>
> Cc: linux-pci@vger.kernel.org; devicetree@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> linuxppc-dev@lists.ozlabs.org; bhelgaas@google.com;
> lorenzo.pieralisi@arm.com; shawnguo@kernel.org; Leo Li
> <leoyang.li@nxp.com>; kishon@ti.com; gustavo.pimentel@synopsys.com;
> Roy Zang <roy.zang@nxp.com>; jingoohan1@gmail.com;
> andrew.murray@arm.com; Mingkai Hu <mingkai.hu@nxp.com>; M.h. Lian
> <minghuan.lian@nxp.com>; Xiaowei Bao <xiaowei.bao@nxp.com>
> Subject: Re: [PATCHv7 04/12] PCI: designware-ep: Modify MSI and MSIX CAP
> way of finding
> 
> On Tue, Aug 11, 2020 at 05:54:33PM +0800, Zhiqiang Hou wrote:
> > From: Xiaowei Bao <xiaowei.bao@nxp.com>
> >
> > Each PF of EP device should have its own MSI or MSIX capabitily
> > struct, so create a dw_pcie_ep_func struct and move the msi_cap and
> > msix_cap to this struct from dw_pcie_ep, and manage the PFs via a
> > list.
> >
> > Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> > Signed-off-by: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> > ---
> > V7:
> >  - Rebase the patch without functionality change.
> >
> >  .../pci/controller/dwc/pcie-designware-ep.c   | 139
> +++++++++++++++---
> >  drivers/pci/controller/dwc/pcie-designware.h  |  18 ++-
> >  2 files changed, 136 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index 56bd1cd71f16..4680a51c49c0 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -28,6 +28,19 @@ void dw_pcie_ep_init_notify(struct dw_pcie_ep
> *ep)
> > }  EXPORT_SYMBOL_GPL(dw_pcie_ep_init_notify);
> >
> > +struct dw_pcie_ep_func *
> > +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no) {
> > +	struct dw_pcie_ep_func *ep_func;
> > +
> > +	list_for_each_entry(ep_func, &ep->func_list, list) {
> > +		if (ep_func->func_no == func_no)
> > +			return ep_func;
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> >  static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8
> > func_no)  {
> >  	unsigned int func_offset = 0;
> > @@ -68,6 +81,47 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci,
> enum pci_barno bar)
> >  		__dw_pcie_ep_reset_bar(pci, func_no, bar, 0);  }
> >
> > +static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie_ep *ep, u8
> func_no,
> > +		u8 cap_ptr, u8 cap)
> > +{
> > +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	unsigned int func_offset = 0;
> > +	u8 cap_id, next_cap_ptr;
> > +	u16 reg;
> > +
> > +	if (!cap_ptr)
> > +		return 0;
> > +
> > +	func_offset = dw_pcie_ep_func_select(ep, func_no);
> > +
> > +	reg = dw_pcie_readw_dbi(pci, func_offset + cap_ptr);
> > +	cap_id = (reg & 0x00ff);
> > +
> > +	if (cap_id > PCI_CAP_ID_MAX)
> > +		return 0;
> > +
> > +	if (cap_id == cap)
> > +		return cap_ptr;
> > +
> > +	next_cap_ptr = (reg & 0xff00) >> 8;
> > +	return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap); }
> > +
> > +static u8 dw_pcie_ep_find_capability(struct dw_pcie_ep *ep, u8
> > +func_no, u8 cap) {
> > +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	unsigned int func_offset = 0;
> > +	u8 next_cap_ptr;
> > +	u16 reg;
> > +
> > +	func_offset = dw_pcie_ep_func_select(ep, func_no);
> > +
> > +	reg = dw_pcie_readw_dbi(pci, func_offset + PCI_CAPABILITY_LIST);
> > +	next_cap_ptr = (reg & 0x00ff);
> > +
> > +	return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap); }
> 
> These are almost the same as __dw_pcie_find_next_cap and
> dw_pcie_find_capability. Please modify them to take a function offset and
> work for both host and endpoints.

Yes, will rework in next version.

> 
> > +
> >  static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
> >  				   struct pci_epf_header *hdr)
> >  {
> > @@ -257,13 +311,18 @@ static int dw_pcie_ep_get_msi(struct pci_epc
> *epc, u8 func_no)
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	u32 val, reg;
> >  	unsigned int func_offset = 0;
> > +	struct dw_pcie_ep_func *ep_func;
> >
> > -	if (!ep->msi_cap)
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> > +
> > +	if (!ep_func->msi_cap)
> >  		return -EINVAL;
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> > +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
> >  	val = dw_pcie_readw_dbi(pci, reg);
> >  	if (!(val & PCI_MSI_FLAGS_ENABLE))
> >  		return -EINVAL;
> > @@ -279,13 +338,18 @@ static int dw_pcie_ep_set_msi(struct pci_epc
> *epc, u8 func_no, u8 interrupts)
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	u32 val, reg;
> >  	unsigned int func_offset = 0;
> > +	struct dw_pcie_ep_func *ep_func;
> > +
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> >
> > -	if (!ep->msi_cap)
> > +	if (!ep_func->msi_cap)
> >  		return -EINVAL;
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> > +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
> 
> If msi_cap is now per function, then shouldn't it already include
> 'func_offset'?

Good suggestion, will rework in next version.

Regards,
Zhiqiang

> 
> >  	val = dw_pcie_readw_dbi(pci, reg);
> >  	val &= ~PCI_MSI_FLAGS_QMASK;
> >  	val |= (interrupts << 1) & PCI_MSI_FLAGS_QMASK; @@ -302,13 +366,18
> > @@ static int dw_pcie_ep_get_msix(struct pci_epc *epc, u8 func_no)
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	u32 val, reg;
> >  	unsigned int func_offset = 0;
> > +	struct dw_pcie_ep_func *ep_func;
> > +
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> >
> > -	if (!ep->msix_cap)
> > +	if (!ep_func->msix_cap)
> >  		return -EINVAL;
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msix_cap + func_offset + PCI_MSIX_FLAGS;
> > +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_FLAGS;
> >  	val = dw_pcie_readw_dbi(pci, reg);
> >  	if (!(val & PCI_MSIX_FLAGS_ENABLE))
> >  		return -EINVAL;
> > @@ -325,25 +394,30 @@ static int dw_pcie_ep_set_msix(struct pci_epc
> *epc, u8 func_no, u16 interrupts,
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	u32 val, reg;
> >  	unsigned int func_offset = 0;
> > +	struct dw_pcie_ep_func *ep_func;
> >
> > -	if (!ep->msix_cap)
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> > +
> > +	if (!ep_func->msix_cap)
> >  		return -EINVAL;
> >
> >  	dw_pcie_dbi_ro_wr_en(pci);
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msix_cap + func_offset + PCI_MSIX_FLAGS;
> > +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_FLAGS;
> >  	val = dw_pcie_readw_dbi(pci, reg);
> >  	val &= ~PCI_MSIX_FLAGS_QSIZE;
> >  	val |= interrupts;
> >  	dw_pcie_writew_dbi(pci, reg, val);
> >
> > -	reg = ep->msix_cap + func_offset + PCI_MSIX_TABLE;
> > +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_TABLE;
> >  	val = offset | bir;
> >  	dw_pcie_writel_dbi(pci, reg, val);
> >
> > -	reg = ep->msix_cap + func_offset + PCI_MSIX_PBA;
> > +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_PBA;
> >  	val = (offset + (interrupts * PCI_MSIX_ENTRY_SIZE)) | bir;
> >  	dw_pcie_writel_dbi(pci, reg, val);
> >
> > @@ -426,6 +500,7 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep
> *ep, u8 func_no,
> >  			     u8 interrupt_num)
> >  {
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	struct dw_pcie_ep_func *ep_func;
> >  	struct pci_epc *epc = ep->epc;
> >  	unsigned int aligned_offset;
> >  	unsigned int func_offset = 0;
> > @@ -435,25 +510,29 @@ int dw_pcie_ep_raise_msi_irq(struct
> dw_pcie_ep *ep, u8 func_no,
> >  	bool has_upper;
> >  	int ret;
> >
> > -	if (!ep->msi_cap)
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> > +
> > +	if (!ep_func->msi_cap)
> >  		return -EINVAL;
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> >  	/* Raise MSI per the PCI Local Bus Specification Revision 3.0, 6.8.1. */
> > -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> > +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
> >  	msg_ctrl = dw_pcie_readw_dbi(pci, reg);
> >  	has_upper = !!(msg_ctrl & PCI_MSI_FLAGS_64BIT);
> > -	reg = ep->msi_cap + func_offset + PCI_MSI_ADDRESS_LO;
> > +	reg = ep_func->msi_cap + func_offset + PCI_MSI_ADDRESS_LO;
> >  	msg_addr_lower = dw_pcie_readl_dbi(pci, reg);
> >  	if (has_upper) {
> > -		reg = ep->msi_cap + func_offset + PCI_MSI_ADDRESS_HI;
> > +		reg = ep_func->msi_cap + func_offset + PCI_MSI_ADDRESS_HI;
> >  		msg_addr_upper = dw_pcie_readl_dbi(pci, reg);
> > -		reg = ep->msi_cap + func_offset + PCI_MSI_DATA_64;
> > +		reg = ep_func->msi_cap + func_offset + PCI_MSI_DATA_64;
> >  		msg_data = dw_pcie_readw_dbi(pci, reg);
> >  	} else {
> >  		msg_addr_upper = 0;
> > -		reg = ep->msi_cap + func_offset + PCI_MSI_DATA_32;
> > +		reg = ep_func->msi_cap + func_offset + PCI_MSI_DATA_32;
> >  		msg_data = dw_pcie_readw_dbi(pci, reg);
> >  	}
> >  	aligned_offset = msg_addr_lower & (epc->mem->window.page_size -
> 1);
> > @@ -489,6 +568,7 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep
> *ep, u8 func_no,
> >  			      u16 interrupt_num)
> >  {
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	struct dw_pcie_ep_func *ep_func;
> >  	struct pci_epf_msix_tbl *msix_tbl;
> >  	struct pci_epc *epc = ep->epc;
> >  	unsigned int func_offset = 0;
> > @@ -499,9 +579,16 @@ int dw_pcie_ep_raise_msix_irq(struct
> dw_pcie_ep *ep, u8 func_no,
> >  	int ret;
> >  	u8 bir;
> >
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> > +
> > +	if (!ep_func->msix_cap)
> > +		return -EINVAL;
> > +
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msix_cap + func_offset + PCI_MSIX_TABLE;
> > +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_TABLE;
> >  	tbl_offset = dw_pcie_readl_dbi(pci, reg);
> >  	bir = (tbl_offset & PCI_MSIX_TABLE_BIR);
> >  	tbl_offset &= PCI_MSIX_TABLE_OFFSET; @@ -596,11 +683,15 @@ int
> > dw_pcie_ep_init(struct dw_pcie_ep *ep)  {
> >  	int ret;
> >  	void *addr;
> > +	u8 func_no;
> >  	struct pci_epc *epc;
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	struct device *dev = pci->dev;
> >  	struct device_node *np = dev->of_node;
> >  	const struct pci_epc_features *epc_features;
> > +	struct dw_pcie_ep_func *ep_func;
> > +
> > +	INIT_LIST_HEAD(&ep->func_list);
> >
> >  	if (!pci->dbi_base || !pci->dbi_base2) {
> >  		dev_err(dev, "dbi_base/dbi_base2 is not populated\n"); @@ -660,9
> > +751,19 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
> >  	if (ret < 0)
> >  		epc->max_functions = 1;
> >
> > -	ep->msi_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSI);
> > +	for (func_no = 0; func_no < epc->max_functions; func_no++) {
> > +		ep_func = devm_kzalloc(dev, sizeof(*ep_func), GFP_KERNEL);
> > +		if (!ep_func)
> > +			return -ENOMEM;
> >
> > -	ep->msix_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSIX);
> > +		ep_func->func_no = func_no;
> > +		ep_func->msi_cap = dw_pcie_ep_find_capability(ep, func_no,
> > +							      PCI_CAP_ID_MSI);
> > +		ep_func->msix_cap = dw_pcie_ep_find_capability(ep, func_no,
> > +							       PCI_CAP_ID_MSIX);
> > +
> > +		list_add_tail(&ep_func->list, &ep->func_list);
> > +	}
> >
> >  	if (ep->ops->ep_init)
> >  		ep->ops->ep_init(ep);
> > diff --git a/drivers/pci/controller/dwc/pcie-designware.h
> > b/drivers/pci/controller/dwc/pcie-designware.h
> > index 745b4938225a..19c4ba486239 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware.h
> > +++ b/drivers/pci/controller/dwc/pcie-designware.h
> > @@ -230,8 +230,16 @@ struct dw_pcie_ep_ops {
> >  	unsigned int (*func_conf_select)(struct dw_pcie_ep *ep, u8 func_no);
> > };
> >
> > +struct dw_pcie_ep_func {
> > +	struct list_head	list;
> > +	u8			func_no;
> > +	u8			msi_cap;	/* MSI capability offset */
> > +	u8			msix_cap;	/* MSI-X capability offset */
> > +};
> > +
> >  struct dw_pcie_ep {
> >  	struct pci_epc		*epc;
> > +	struct list_head	func_list;
> >  	const struct dw_pcie_ep_ops *ops;
> >  	phys_addr_t		phys_base;
> >  	size_t			addr_size;
> > @@ -244,8 +252,6 @@ struct dw_pcie_ep {
> >  	u32			num_ob_windows;
> >  	void __iomem		*msi_mem;
> >  	phys_addr_t		msi_mem_phys;
> > -	u8			msi_cap;	/* MSI capability offset */
> > -	u8			msix_cap;	/* MSI-X capability offset */
> >  	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
> >  };
> >
> > @@ -440,6 +446,8 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep
> > *ep, u8 func_no,  int dw_pcie_ep_raise_msix_irq_doorbell(struct
> dw_pcie_ep *ep, u8 func_no,
> >  				       u16 interrupt_num);
> >  void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar);
> > +struct dw_pcie_ep_func *
> > +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no);
> >  #else
> >  static inline void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)  { @@
> > -490,5 +498,11 @@ static inline int
> > dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep,  static
> > inline void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno
> > bar)  {  }
> > +
> > +static inline struct dw_pcie_ep_func *
> > +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no) {
> > +	return NULL;
> > +}
> >  #endif
> >  #endif /* _PCIE_DESIGNWARE_H */
> > --
> > 2.17.1
> >

^ permalink raw reply

* RE: [PATCHv7 12/12] misc: pci_endpoint_test: Add driver data for Layerscape PCIe controllers
From: Z.q. Hou @ 2020-09-13 17:24 UTC (permalink / raw)
  To: Rob Herring
  Cc: Roy Zang, lorenzo.pieralisi@arm.com, jingoohan1@gmail.com,
	linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	Leo Li, kishon@ti.com, M.h. Lian, devicetree@vger.kernel.org,
	robh+dt@kernel.org, Mingkai Hu, linux-pci@vger.kernel.org,
	bhelgaas@google.com, andrew.murray@arm.com,
	gustavo.pimentel@synopsys.com, shawnguo@kernel.org,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <20200910181756.GA622331@bogus>

Hi Rob,

Thanks a lot for your review and ack!

Regards,
Zhiqiang

> -----Original Message-----
> From: Rob Herring <robh@kernel.org>
> Sent: 2020年9月11日 2:18
> To: Z.q. Hou <zhiqiang.hou@nxp.com>
> Cc: bhelgaas@google.com; shawnguo@kernel.org; M.h. Lian
> <minghuan.lian@nxp.com>; Leo Li <leoyang.li@nxp.com>;
> linuxppc-dev@lists.ozlabs.org; robh+dt@kernel.org;
> linux-arm-kernel@lists.infradead.org; Roy Zang <roy.zang@nxp.com>;
> andrew.murray@arm.com; linux-pci@vger.kernel.org;
> lorenzo.pieralisi@arm.com; gustavo.pimentel@synopsys.com; Mingkai Hu
> <mingkai.hu@nxp.com>; linux-kernel@vger.kernel.org;
> jingoohan1@gmail.com; kishon@ti.com; devicetree@vger.kernel.org
> Subject: Re: [PATCHv7 12/12] misc: pci_endpoint_test: Add driver data for
> Layerscape PCIe controllers
> 
> On Tue, 11 Aug 2020 17:54:41 +0800, Zhiqiang Hou wrote:
> > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> >
> > The commit 0a121f9bc3f5 ("misc: pci_endpoint_test: Use streaming DMA
> > APIs for buffer allocation") changed to use streaming DMA APIs,
> > however,
> > dma_map_single() might not return a 4KB aligned address, so add the
> > default_data as driver data for Layerscape PCIe controllers to make it
> > 4KB aligned.
> >
> > Signed-off-by: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> > ---
> > V7:
> >  - New patch.
> >
> >  drivers/misc/pci_endpoint_test.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> 
> Acked-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Satheesh Rajendran @ 2020-09-13 17:42 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Ingo Molnar, Oliver O'Halloran, linuxppc-dev,
	Valentin Schneider
In-Reply-To: <87y2lgr0ic.fsf@mpe.ellerman.id.au>

On Fri, Sep 11, 2020 at 09:55:23PM +1000, Michael Ellerman wrote:
> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > Current code assumes that cpumask of cpus sharing a l2-cache mask will
> > always be a superset of cpu_sibling_mask.
> >
> > Lets stop that assumption. cpu_l2_cache_mask is a superset of
> > cpu_sibling_mask if and only if shared_caches is set.
> 
> I'm seeing oopses with this:
> 
> [    0.117392][    T1] smp: Bringing up secondary CPUs ...
> [    0.156515][    T1] smp: Brought up 2 nodes, 2 CPUs
> [    0.158265][    T1] numa: Node 0 CPUs: 0
> [    0.158520][    T1] numa: Node 1 CPUs: 1
> [    0.167453][    T1] BUG: Unable to handle kernel data access on read at 0x8000000041228298
> [    0.167992][    T1] Faulting instruction address: 0xc00000000018c128
> [    0.168817][    T1] Oops: Kernel access of bad area, sig: 11 [#1]
> [    0.168964][    T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [    0.169417][    T1] Modules linked in:
> [    0.170047][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc2-00095-g7430ad5aa700 #209
> [    0.170305][    T1] NIP:  c00000000018c128 LR: c00000000018c0cc CTR: c00000000004dce0
> [    0.170498][    T1] REGS: c00000007e343880 TRAP: 0380   Not tainted  (5.9.0-rc2-00095-g7430ad5aa700)
> [    0.170602][    T1] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44002222  XER: 00000000
> [    0.170985][    T1] CFAR: c00000000018c288 IRQMASK: 0
> [    0.170985][    T1] GPR00: 0000000000000000 c00000007e343b10 c00000000173e400 0000000000004000
> [    0.170985][    T1] GPR04: 0000000000000000 0000000000000800 0000000000000800 0000000000000000
> [    0.170985][    T1] GPR08: 0000000000000000 c00000000122c298 c00000003fffc000 c00000007fd05ce8
> [    0.170985][    T1] GPR12: c00000007e0119f8 c000000001930000 00000000ffff8ade 0000000000000000
> [    0.170985][    T1] GPR16: c00000007e3c0640 0000000000000917 c00000007e3c0658 0000000000000008
> [    0.170985][    T1] GPR20: c0000000015d0bb8 00000000ffff8ade c000000000f57400 c000000001817c28
> [    0.170985][    T1] GPR24: c00000000176dc80 c00000007e3c0890 c00000007e3cfe00 0000000000000000
> [    0.170985][    T1] GPR28: c000000001772310 c00000007e011900 c00000007e3c0800 0000000000000001
> [    0.172750][    T1] NIP [c00000000018c128] build_sched_domains+0x808/0x14b0
> [    0.172900][    T1] LR [c00000000018c0cc] build_sched_domains+0x7ac/0x14b0
> [    0.173186][    T1] Call Trace:
> [    0.173484][    T1] [c00000007e343b10] [c00000000018bfe8] build_sched_domains+0x6c8/0x14b0 (unreliable)
> [    0.173821][    T1] [c00000007e343c50] [c00000000018dcdc] sched_init_domains+0xec/0x130
> [    0.174037][    T1] [c00000007e343ca0] [c0000000010d59d8] sched_init_smp+0x50/0xc4
> [    0.174207][    T1] [c00000007e343cd0] [c0000000010b45c4] kernel_init_freeable+0x1b4/0x378
> [    0.174378][    T1] [c00000007e343db0] [c0000000000129fc] kernel_init+0x24/0x158
> [    0.174740][    T1] [c00000007e343e20] [c00000000000d9d0] ret_from_kernel_thread+0x5c/0x6c
> [    0.175050][    T1] Instruction dump:
> [    0.175626][    T1] 554905ee 71480040 7d2907b4 4182016c 2c290000 3920006e 913e002c 41820034
> [    0.175841][    T1] 7c6307b4 e9300020 78631f24 7d58182a <7d2a482a> f93e0080 7d404828 314a0001
> [    0.178340][    T1] ---[ end trace 6876b88dd1d4b3bb ]---
> [    0.178512][    T1]
> [    1.180458][    T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> 
> That's qemu:
> 
> qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \
>   -kernel build~/vmlinux \
>   -m 2G,slots=2,maxmem=4G \
>   -object memory-backend-ram,size=1G,id=m0 \
>   -object memory-backend-ram,size=1G,id=m1 \
>   -numa node,nodeid=0,memdev=m0 \
>   -numa node,nodeid=1,memdev=m1 \
>   -smp 2,sockets=2,maxcpus=2  \

PowerKVM guest vCPUs does not yet have L2 and L3 cache elements
I had this bug raised some time ago, probably related?
https://bugs.launchpad.net/qemu/+bug/1774605


Regards,
-Satheesh.
> 
> 
> On mambo I get:
> 
> [    0.005069][    T1] smp: Bringing up secondary CPUs ...
> [    0.011656][    T1] smp: Brought up 2 nodes, 8 CPUs
> [    0.011682][    T1] numa: Node 0 CPUs: 0-3
> [    0.011709][    T1] numa: Node 1 CPUs: 4-7
> [    0.012015][    T1] BUG: arch topology borken
> [    0.012040][    T1]      the SMT domain not a subset of the CACHE domain
> [    0.012107][    T1] BUG: Unable to handle kernel data access on read at 0x80000001012e7398
> [    0.012142][    T1] Faulting instruction address: 0xc0000000001aa4f0
> [    0.012174][    T1] Oops: Kernel access of bad area, sig: 11 [#1]
> [    0.012206][    T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> [    0.012236][    T1] Modules linked in:
> [    0.012264][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc2-00095-g7430ad5aa700 #1
> [    0.012304][    T1] NIP:  c0000000001aa4f0 LR: c0000000001aa498 CTR: 0000000000000000
> [    0.012341][    T1] REGS: c0000000ef583880 TRAP: 0380   Not tainted  (5.9.0-rc2-00095-g7430ad5aa700)
> [    0.012379][    T1] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 44002222  XER: 00040000
> [    0.012439][    T1] CFAR: c0000000000101b0 IRQMASK: 0
> [    0.012439][    T1] GPR00: 0000000000000000 c0000000ef583b10 c0000000017fd000 0000000000004000
> [    0.012439][    T1] GPR04: 0000000000000000 0000000000000800 0000000000000000 0000000000000000
> [    0.012439][    T1] GPR08: 0000000000000000 c0000000012eb398 c0000000ffffc000 0000000000000000
> [    0.012439][    T1] GPR12: 0000000000000020 c0000000019f0000 00000000ffff8ad1 0000000000000000
> [    0.012439][    T1] GPR16: c0000000ef068658 c0000000018d7ba8 0000000000000008 c000000001690bb8
> [    0.012439][    T1] GPR20: c00000000182dc80 c0000000ef06be90 00000000ffff8ad1 c000000001014aa8
> [    0.012439][    T1] GPR24: 0000000000000917 c0000000ef068e00 0000000000000000 c0000000ef06be00
> [    0.012439][    T1] GPR28: 0000000000000001 c0000000ef068640 c0000000ef4a1800 c000000001832310
> [    0.012774][    T1] NIP [c0000000001aa4f0] build_sched_domains+0x5c0/0x14f0
> [    0.012812][    T1] LR [c0000000001aa498] build_sched_domains+0x568/0x14f0
> [    0.012842][    T1] Call Trace:
> [    0.012872][    T1] [c0000000ef583b10] [c0000000001aa3b4] build_sched_domains+0x484/0x14f0 (unreliable)
> [    0.012922][    T1] [c0000000ef583c50] [c0000000001ac3d8] sched_init_domains+0xd8/0x120
> [    0.012966][    T1] [c0000000ef583ca0] [c0000000011962d0] sched_init_smp+0x50/0xc4
> [    0.013008][    T1] [c0000000ef583cd0] [c00000000117451c] kernel_init_freeable+0x1b4/0x378
> [    0.013051][    T1] [c0000000ef583db0] [c000000000012994] kernel_init+0x2c/0x158
> [    0.013092][    T1] [c0000000ef583e20] [c00000000000d9d0] ret_from_kernel_thread+0x5c/0x6c
> [    0.013128][    T1] Instruction dump:
> [    0.013151][    T1] e93b003a 712a0040 552a05ee 418203c4 2c2a0000 3920006e 913b002c 41820034
> [    0.013206][    T1] 7c6307b4 e93d0020 78631f24 7d54182a <7d2a482a> f93b0080 7d404828 314a0001
> [    0.013267][    T1] ---[ end trace 1bf5f6f38a9fd096 ]---
> 
> 
> Did I miss a lead-up patch?
> 
> See here for what I have applied:
>   https://github.com/linuxppc/linux/commits/next-test
> 
> cheers

^ permalink raw reply

* [PATCH] powerpc/papr_scm: Add PAPR command family to pass-through command-set
From: Vaibhav Jain @ 2020-09-13 21:19 UTC (permalink / raw)
  To: linuxppc-dev, linux-nvdimm
  Cc: Santosh Sivaraj, Oliver O'Halloran, Aneesh Kumar K . V,
	Vaibhav Jain, Dan Williams, Ira Weiny

Add NVDIMM_FAMILY_PAPR to the list of valid 'dimm_family_mask'
acceptable by papr_scm. This is needed as since commit
92fe2aa859f5 ("libnvdimm: Validate command family indices") libnvdimm
performs a validation of 'nd_cmd_pkg.nd_family' received as part of
ND_CMD_CALL processing to ensure only known command families can use
the general ND_CMD_CALL pass-through functionality.

Without this change the ND_CMD_CALL pass-through targeting
NVDIMM_FAMILY_PAPR error out with -EINVAL.

Fixes: 92fe2aa859f5 ("libnvdimm: Validate command family indices")
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 5493bc847bd08..27268370dee00 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -898,6 +898,9 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	p->bus_desc.of_node = p->pdev->dev.of_node;
 	p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
 
+	/* Set the dimm command family mask to accept PDSMs */
+	set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
+
 	if (!p->bus_desc.provider_name)
 		return -ENOMEM;
 
-- 
2.26.2


^ permalink raw reply related

* [PATCH] powerpc/papr_scm: Support dynamic enable/disable of performance statistics
From: Vaibhav Jain @ 2020-09-13 21:21 UTC (permalink / raw)
  To: linuxppc-dev, linux-nvdimm
  Cc: Santosh Sivaraj, Oliver O'Halloran, Aneesh Kumar K . V,
	Vaibhav Jain, Dan Williams, Ira Weiny

Collection of performance statistics of an NVDIMM can be dynamically
enabled/disabled from the Hypervisor Management Console even when the
guest lpar is running. The current implementation however will check if
the performance statistics collection is supported during NVDIMM probe
and if yes will assume that to be the case afterwards.

Hence we update papr_scm to remove this assumption from the code by
eliminating the 'stat_buffer_len' member from 'struct papr_scm_priv'
that was used to cache the max buffer size needed to fetch NVDIMM
performance stats from PHYP. With that struct member gone, various
functions that depended on it are updated. Specifically
perf_stats_show() is updated to query the PHYP first for the size of
buffer needed to hold all performance statistics instead of relying on
'stat_buffer_len'

Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/papr_scm.c | 53 +++++++++++------------
 1 file changed, 25 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 27268370dee00..6697e1c3b9ebe 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -112,9 +112,6 @@ struct papr_scm_priv {
 
 	/* Health information for the dimm */
 	u64 health_bitmap;
-
-	/* length of the stat buffer as expected by phyp */
-	size_t stat_buffer_len;
 };
 
 static LIST_HEAD(papr_nd_regions);
@@ -230,14 +227,15 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
  * - If buff_stats == NULL the return value is the size in byes of the buffer
  * needed to hold all supported performance-statistics.
  * - If buff_stats != NULL and num_stats == 0 then we copy all known
- * performance-statistics to 'buff_stat' and expect to be large enough to
- * hold them.
+ * performance-statistics to 'buff_stat' and expect it to be large enough to
+ * hold them. The 'buff_size' args contains the size of the 'buff_stats'
  * - if buff_stats != NULL and num_stats > 0 then copy the requested
  * performance-statistics to buff_stats.
  */
 static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
 				    struct papr_scm_perf_stats *buff_stats,
-				    unsigned int num_stats)
+				    unsigned int num_stats,
+				    size_t buff_size)
 {
 	unsigned long ret[PLPAR_HCALL_BUFSIZE];
 	size_t size;
@@ -261,12 +259,18 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
 			size = sizeof(struct papr_scm_perf_stats) +
 				num_stats * sizeof(struct papr_scm_perf_stat);
 		else
-			size = p->stat_buffer_len;
+			size = buff_size;
 	} else {
 		/* In case of no out buffer ignore the size */
 		size = 0;
 	}
 
+	/* verify that the buffer size needed is sufficient */
+	if (size > buff_size) {
+		__WARN();
+		return -EINVAL;
+	}
+
 	/* Do the HCALL asking PHYP for info */
 	rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index,
 			 buff_stats ? virt_to_phys(buff_stats) : 0,
@@ -277,6 +281,10 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
 		dev_err(&p->pdev->dev,
 			"Unknown performance stats, Err:0x%016lX\n", ret[0]);
 		return -ENOENT;
+	} else if (rc == H_AUTHORITY) {
+		dev_dbg(&p->pdev->dev,
+			"Performance stats in-accessible\n");
+		return -EPERM;
 	} else if (rc != H_SUCCESS) {
 		dev_err(&p->pdev->dev,
 			"Failed to query performance stats, Err:%lld\n", rc);
@@ -526,10 +534,6 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
 	struct papr_scm_perf_stat *stat;
 	struct papr_scm_perf_stats *stats;
 
-	/* Silently fail if fetching performance metrics isn't  supported */
-	if (!p->stat_buffer_len)
-		return 0;
-
 	/* Allocate request buffer enough to hold single performance stat */
 	size = sizeof(struct papr_scm_perf_stats) +
 		sizeof(struct papr_scm_perf_stat);
@@ -543,9 +547,11 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
 	stat->stat_val = 0;
 
 	/* Fetch the fuel gauge and populate it in payload */
-	rc = drc_pmem_query_stats(p, stats, 1);
+	rc = drc_pmem_query_stats(p, stats, 1, size);
 	if (rc < 0) {
 		dev_dbg(&p->pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+		/* Silently fail if unable to fetch performance metric */
+		rc = 0;
 		goto free_stats;
 	}
 
@@ -786,23 +792,25 @@ static ssize_t perf_stats_show(struct device *dev,
 			       struct device_attribute *attr, char *buf)
 {
 	int index;
-	ssize_t rc;
+	ssize_t rc, buff_len;
 	struct seq_buf s;
 	struct papr_scm_perf_stat *stat;
 	struct papr_scm_perf_stats *stats;
 	struct nvdimm *dimm = to_nvdimm(dev);
 	struct papr_scm_priv *p = nvdimm_provider_data(dimm);
 
-	if (!p->stat_buffer_len)
-		return -ENOENT;
+	/* fetch the length of buffer needed to get all stats */
+	buff_len = drc_pmem_query_stats(p, NULL, 0, 0);
+	if (buff_len <= 0)
+		return buff_len;
 
 	/* Allocate the buffer for phyp where stats are written */
-	stats = kzalloc(p->stat_buffer_len, GFP_KERNEL);
+	stats = kzalloc(buff_len, GFP_KERNEL);
 	if (!stats)
 		return -ENOMEM;
 
 	/* Ask phyp to return all dimm perf stats */
-	rc = drc_pmem_query_stats(p, stats, 0);
+	rc = drc_pmem_query_stats(p, stats, 0, buff_len);
 	if (rc)
 		goto free_stats;
 	/*
@@ -891,7 +899,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	struct nd_region_desc ndr_desc;
 	unsigned long dimm_flags;
 	int target_nid, online_nid;
-	ssize_t stat_size;
 
 	p->bus_desc.ndctl = papr_scm_ndctl;
 	p->bus_desc.module = THIS_MODULE;
@@ -962,16 +969,6 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	list_add_tail(&p->region_list, &papr_nd_regions);
 	mutex_unlock(&papr_ndr_lock);
 
-	/* Try retriving the stat buffer and see if its supported */
-	stat_size = drc_pmem_query_stats(p, NULL, 0);
-	if (stat_size > 0) {
-		p->stat_buffer_len = stat_size;
-		dev_dbg(&p->pdev->dev, "Max perf-stat size %lu-bytes\n",
-			p->stat_buffer_len);
-	} else {
-		dev_info(&p->pdev->dev, "Dimm performance stats unavailable\n");
-	}
-
 	return 0;
 
 err:	nvdimm_bus_unregister(p->bus);
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH] spi: fsl-espi: Only process interrupts for expected events
From: Chris Packham @ 2020-09-13 22:03 UTC (permalink / raw)
  To: broonie@kernel.org, npiggin@gmail.com, hkallweit1@gmail.com,
	Joakim Tjernlund
  Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, linux-spi@vger.kernel.org
In-Reply-To: <20200904002812.7300-1-chris.packham@alliedtelesis.co.nz>

Hi All,

On 4/09/20 12:28 pm, Chris Packham wrote:
> The SPIE register contains counts for the TX FIFO so any time the irq
> handler was invoked we would attempt to process the RX/TX fifos. Use the
> SPIM value to mask the events so that we only process interrupts that
> were expected.
>
> This was a latent issue exposed by commit 3282a3da25bd ("powerpc/64:
> Implement soft interrupt replay in C").
>
> Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
> Cc: stable@vger.kernel.org
> ---
ping?
>
> Notes:
>      I've tested this on a T2080RDB and a custom board using the T2081 SoC. With
>      this change I don't see any spurious instances of the "Transfer done but
>      SPIE_DON isn't set!" or "Transfer done but rx/tx fifo's aren't empty!" messages
>      and the updates to spi flash are successful.
>      
>      I think this should go into the stable trees that contain 3282a3da25bd but I
>      haven't added a Fixes: tag because I think 3282a3da25bd exposed the issue as
>      opposed to causing it.
>
>   drivers/spi/spi-fsl-espi.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
> index 7e7c92cafdbb..cb120b68c0e2 100644
> --- a/drivers/spi/spi-fsl-espi.c
> +++ b/drivers/spi/spi-fsl-espi.c
> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi *espi, u32 events)
>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>   {
>   	struct fsl_espi *espi = context_data;
> -	u32 events;
> +	u32 events, mask;
>   
>   	spin_lock(&espi->lock);
>   
>   	/* Get interrupt events(tx/rx) */
>   	events = fsl_espi_read_reg(espi, ESPI_SPIE);
> -	if (!events) {
> +	mask = fsl_espi_read_reg(espi, ESPI_SPIM);
> +	if (!(events & mask)) {
>   		spin_unlock(&espi->lock);
>   		return IRQ_NONE;
>   	}

^ permalink raw reply

* Re: [PATCH v1 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification
From: Nicholas Piggin @ 2020-09-14  2:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy,
	Michael Ellerman, Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1ded5e11-a9e0-a98f-295c-c623e0a5ed36@csgroup.eu>

Excerpts from Christophe Leroy's message of September 9, 2020 4:20 pm:
> 
> 
> Le 09/09/2020 à 08:04, Aneesh Kumar K.V a écrit :
>> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> 
>>> search_exception_tables() is an heavy operation, we have to avoid it.
>>> When KUAP is selected, we'll know the fault has been blocked by KUAP.
>>> Otherwise, it behaves just as if the address was already in the TLBs
>>> and no fault was generated.
>>>
>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>>> ---
>>>   arch/powerpc/mm/fault.c | 20 +++++---------------
>>>   1 file changed, 5 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>>> index 525e0c2b5406..edde169ba3a6 100644
>>> --- a/arch/powerpc/mm/fault.c
>>> +++ b/arch/powerpc/mm/fault.c
>>> @@ -214,24 +214,14 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
>>>   	if (address >= TASK_SIZE)
>>>   		return true;
>>>   
>>> -	if (!is_exec && (error_code & DSISR_PROTFAULT) &&
>>> -	    !search_exception_tables(regs->nip)) {
>>> +	// Read/write fault blocked by KUAP is bad, it can never succeed.
>>> +	if (bad_kuap_fault(regs, address, is_write)) {
>>>   		pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: %d)\n",
>>> -				    address,
>>> -				    from_kuid(&init_user_ns, current_uid()));
>>> -	}
>>> -
>>> -	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
>>> -	if (!search_exception_tables(regs->nip))
>>> -		return true;
>> 
>> We still need to keep this ? Without that we detect the lack of
>> exception tables pretty late.
> 
> Is that a problem at all to detect the lack of exception tables late ?
> That case is very unlikely and will lead to failure anyway. So, is it 
> worth impacting performance of the likely case which will always have an 
> exception table and where we expect the exception to run as fast as 
> possible ?
> 
> The other architectures I have looked at (arm64 and x86) only have the 
> exception table search together with the down_read_trylock(&mm->mmap_sem).

Yeah I don't see how it'd be a problem. User could arrange for page 
table to already be at this address and avoid the fault so it's not the 
right way to stop an attacker, KUAP is.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] spi: fsl-espi: Only process interrupts for expected events
From: Nicholas Piggin @ 2020-09-14  2:28 UTC (permalink / raw)
  To: broonie@kernel.org
  Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, linux-spi@vger.kernel.org
In-Reply-To: <ecedc71d-100a-7d7a-ff7f-ef1a3086dd74@alliedtelesis.co.nz>

Excerpts from Chris Packham's message of September 14, 2020 8:03 am:
> Hi All,
> 
> On 4/09/20 12:28 pm, Chris Packham wrote:
>> The SPIE register contains counts for the TX FIFO so any time the irq
>> handler was invoked we would attempt to process the RX/TX fifos. Use the
>> SPIM value to mask the events so that we only process interrupts that
>> were expected.
>>
>> This was a latent issue exposed by commit 3282a3da25bd ("powerpc/64:
>> Implement soft interrupt replay in C").
>>
>> Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
>> Cc: stable@vger.kernel.org
>> ---
> ping?

I don't know the code/hardware but thanks for tracking this down.

Was there anything more to be done with Jocke's observations, or would 
that be a follow-up patch if anything?

If this patch fixes your problem it should probably go in, unless there 
are any objections.

Thanks,
Nick

>>
>> Notes:
>>      I've tested this on a T2080RDB and a custom board using the T2081 SoC. With
>>      this change I don't see any spurious instances of the "Transfer done but
>>      SPIE_DON isn't set!" or "Transfer done but rx/tx fifo's aren't empty!" messages
>>      and the updates to spi flash are successful.
>>      
>>      I think this should go into the stable trees that contain 3282a3da25bd but I
>>      haven't added a Fixes: tag because I think 3282a3da25bd exposed the issue as
>>      opposed to causing it.
>>
>>   drivers/spi/spi-fsl-espi.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
>> index 7e7c92cafdbb..cb120b68c0e2 100644
>> --- a/drivers/spi/spi-fsl-espi.c
>> +++ b/drivers/spi/spi-fsl-espi.c
>> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi *espi, u32 events)
>>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>>   {
>>   	struct fsl_espi *espi = context_data;
>> -	u32 events;
>> +	u32 events, mask;
>>   
>>   	spin_lock(&espi->lock);
>>   
>>   	/* Get interrupt events(tx/rx) */
>>   	events = fsl_espi_read_reg(espi, ESPI_SPIE);
>> -	if (!events) {
>> +	mask = fsl_espi_read_reg(espi, ESPI_SPIM);
>> +	if (!(events & mask)) {
>>   		spin_unlock(&espi->lock);
>>   		return IRQ_NONE;
>>   	}

^ permalink raw reply

* Re: [PATCH 12/15] selftests/seccomp: powerpc: Fix seccomp return value testing
From: Michael Ellerman @ 2020-09-14  3:38 UTC (permalink / raw)
  To: Kees Cook, linux-kernel
  Cc: Thadeu Lima de Souza Cascardo, Will Drewry, Kees Cook,
	linux-xtensa, linux-mips, Andy Lutomirski, Max Filippov,
	linux-arm-kernel, linux-kselftest, linuxppc-dev,
	Christian Brauner
In-Reply-To: <20200912110820.597135-13-keescook@chromium.org>

Kees Cook <keescook@chromium.org> writes:
> On powerpc, the errno is not inverted, and depends on ccr.so being
> set. Add this to a powerpc definition of SYSCALL_RET_SET().
>
> Co-developed-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> Link: https://lore.kernel.org/linux-kselftest/20200911181012.171027-1-cascardo@canonical.com/
> Fixes: 5d83c2b37d43 ("selftests/seccomp: Add powerpc support")
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)

This looks right to me, and matches what strace does AFAICS.

Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>

cheers

> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 623953a53032..bbab2420d708 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1750,6 +1750,21 @@ TEST_F(TRACE_poke, getpid_runs_normally)
>  # define ARCH_REGS		struct pt_regs
>  # define SYSCALL_NUM(_regs)	(_regs).gpr[0]
>  # define SYSCALL_RET(_regs)	(_regs).gpr[3]
> +# define SYSCALL_RET_SET(_regs, _val)				\
> +	do {							\
> +		typeof(_val) _result = (_val);			\
> +		/*						\
> +		 * A syscall error is signaled by CR0 SO bit	\
> +		 * and the code is stored as a positive value.	\
> +		 */						\
> +		if (_result < 0) {				\
> +			SYSCALL_RET(_regs) = -result;		\
> +			(_regs).ccr |= 0x10000000;		\
> +		} else {					\
> +			SYSCALL_RET(_regs) = result;		\
> +			(_regs).ccr &= ~0x10000000;		\
> +		}						\
> +	} while (0)
>  #elif defined(__s390__)
>  # define ARCH_REGS		s390_regs
>  # define SYSCALL_NUM(_regs)	(_regs).gprs[2]
> -- 
> 2.25.1

^ permalink raw reply

* [PATCH -next] soc/qman: convert to use be32_add_cpu()
From: Liu Shixin @ 2020-09-14  4:17 UTC (permalink / raw)
  To: Li Yang; +Cc: Liu Shixin, linuxppc-dev, linux-kernel, linux-arm-kernel

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
drivers/soc/fsl/qbman/qman_test_api.c---
 drivers/soc/fsl/qbman/qman_test_api.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/soc/fsl/qbman/qman_test_api.c b/drivers/soc/fsl/qbman/qman_test_api.c
index 2895d062cf51..7066b2f1467c 100644
--- a/drivers/soc/fsl/qbman/qman_test_api.c
+++ b/drivers/soc/fsl/qbman/qman_test_api.c
@@ -86,7 +86,7 @@ static void fd_inc(struct qm_fd *fd)
 	len--;
 	qm_fd_set_param(fd, fmt, off, len);
 
-	fd->cmd = cpu_to_be32(be32_to_cpu(fd->cmd) + 1);
+	be32_add_cpu(&fd->cmd, 1);
 }
 
 /* The only part of the 'fd' we can't memcmp() is the ppid */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: linux-arch, Jens Axboe, Peter Zijlstra, Aneesh Kumar K . V,
	linux-kernel, Nicholas Piggin, sparclinux, Andrew Morton,
	linuxppc-dev, David S . Miller

This is an attempt to fix a few different related issues around
switching mm, TLB flushing, and lazy tlb mm handling.

This will require all architectures to eventually move to disabling
irqs over activate_mm, but it's possible we could add another arch
call after irqs are re-enabled for those few which can't do their
entire activation with irqs disabled.

Testing so far indicates this has fixed a mm refcounting bug that
powerpc was running into (via distro report and backport). I haven't
had any real feedback on this series outside powerpc (and it doesn't
really affect other archs), so I propose patches 1,2,4 go via the
powerpc tree.

There is no dependency between them and patch 3, I put it there only
because it follows the history of the code (powerpc code was written
using the sparc64 logic), but I guess they have to go via different arch
trees. Dave, I'll leave patch 3 with you.

Thanks,
Nick

Since v1:
- Updates from Michael Ellerman's review comments.

Nicholas Piggin (4):
  mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

 arch/Kconfig                           |  7 +++
 arch/powerpc/Kconfig                   |  1 +
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/include/asm/tlb.h         | 13 ------
 arch/powerpc/mm/book3s64/radix_tlb.c   | 23 ++++++---
 arch/sparc/kernel/smp_64.c             | 65 ++++++--------------------
 fs/exec.c                              | 17 ++++++-
 7 files changed, 54 insertions(+), 74 deletions(-)

-- 
2.23.0


^ permalink raw reply

* [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: linux-arch, Jens Axboe, Peter Zijlstra, Aneesh Kumar K . V,
	linux-kernel, Nicholas Piggin, sparclinux, Andrew Morton,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200914045219.3736466-1-npiggin@gmail.com>

Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.

This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).

Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:

  call_usermodehelper()
    kernel_execve()
      old_mm = current->mm;
      active_mm = current->active_mm;
      *** preempt *** -------------------->  schedule()
                                               prev->active_mm = NULL;
                                               mmdrop(prev active_mm);
                                             ...
                      <--------------------  schedule()
      current->mm = mm;
      current->active_mm = mm;
      if (!old_mm)
          mmdrop(active_mm);

If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.

Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().

So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.

This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/Kconfig |  7 +++++++
 fs/exec.c    | 17 +++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..94821e3f94d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
 	bool
 	depends on MMU_GATHER_TABLE_FREE
 
+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+	bool
+	help
+	  Temporary select until all architectures can be converted to have
+	  irqs disabled over activate_mm. Architectures that do IPI based TLB
+	  shootdowns should enable this.
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..d4fb18baf1fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
 	}
 
 	task_lock(tsk);
-	active_mm = tsk->active_mm;
 	membarrier_exec_mmap(mm);
-	tsk->mm = mm;
+
+	local_irq_disable();
+	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
+	tsk->mm = mm;
+	/*
+	 * This prevents preemption while active_mm is being loaded and
+	 * it and mm are being updated, which could cause problems for
+	 * lazy tlb mm refcounting when these are updated by context
+	 * switches. Not all architectures can handle irqs off over
+	 * activate_mm yet.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+		local_irq_enable();
 	activate_mm(active_mm, mm);
+	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+		local_irq_enable();
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: linux-arch, Jens Axboe, Peter Zijlstra, Aneesh Kumar K . V,
	linux-kernel, Nicholas Piggin, sparclinux, Andrew Morton,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200914045219.3736466-1-npiggin@gmail.com>

powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig                   | 1 +
 arch/powerpc/include/asm/mmu_context.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65bed1fdeaad..587ba8352d01 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -149,6 +149,7 @@ config PPC
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
 	select ARCH_WANT_IPC_PARSE_VERSION
+	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select ARCH_WEAK_RELEASE_ACQUIRE
 	select BINFMT_ELF
 	select BUILDTIME_TABLE_SORT
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index a3a12a8341b2..b42813359f49 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -244,7 +244,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 #define activate_mm activate_mm
 static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 {
-	switch_mm(prev, next, current);
+	switch_mm_irqs_off(prev, next, current);
 }
 
 /* We don't currently use enter_lazy_tlb() for anything */
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: linux-arch, Jens Axboe, Peter Zijlstra, Aneesh Kumar K . V,
	linux-kernel, Nicholas Piggin, sparclinux, Andrew Morton,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200914045219.3736466-1-npiggin@gmail.com>

The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.

The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).

io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.

The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.

arch/sparc/kernel/smp_64.c:

    if (atomic_read(&mm->mm_users) == 1) {
        cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
        goto local_flush_and_out;
    }

vs fs/io_uring.c

    if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
                 !mmget_not_zero(ctx->sqo_mm)))
        return -EFAULT;
    kthread_use_mm(ctx->sqo_mm);

mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.

I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.

The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/sparc/kernel/smp_64.c | 65 ++++++++------------------------------
 1 file changed, 14 insertions(+), 51 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e286e2badc8a..e38d8bf454e8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
  * are flush_tlb_*() routines, and these run after flush_cache_*()
  * which performs the flushw.
  *
- * The SMP TLB coherency scheme we use works as follows:
- *
- * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
- *    space has (potentially) executed on, this is the heuristic
- *    we use to avoid doing cross calls.
- *
- *    Also, for flushing from kswapd and also for clones, we
- *    use cpu_vm_mask as the list of cpus to make run the TLB.
- *
- * 2) TLB context numbers are shared globally across all processors
- *    in the system, this allows us to play several games to avoid
- *    cross calls.
- *
- *    One invariant is that when a cpu switches to a process, and
- *    that processes tsk->active_mm->cpu_vm_mask does not have the
- *    current cpu's bit set, that tlb context is flushed locally.
- *
- *    If the address space is non-shared (ie. mm->count == 1) we avoid
- *    cross calls when we want to flush the currently running process's
- *    tlb state.  This is done by clearing all cpu bits except the current
- *    processor's in current->mm->cpu_vm_mask and performing the
- *    flush locally only.  This will force any subsequent cpus which run
- *    this task to flush the context from the local tlb if the process
- *    migrates to another cpu (again).
- *
- * 3) For shared address spaces (threads) and swapping we bite the
- *    bullet for most cases and perform the cross call (but only to
- *    the cpus listed in cpu_vm_mask).
- *
- *    The performance gain from "optimizing" away the cross call for threads is
- *    questionable (in theory the big win for threads is the massive sharing of
- *    address space state across processors).
+ * mm->cpu_vm_mask is a bit mask of which cpus an address
+ * space has (potentially) executed on, this is the heuristic
+ * we use to limit cross calls.
  */
 
 /* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
 void smp_flush_tlb_mm(struct mm_struct *mm)
 {
 	u32 ctx = CTX_HWBITS(mm->context);
-	int cpu = get_cpu();
 
-	if (atomic_read(&mm->mm_users) == 1) {
-		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-		goto local_flush_and_out;
-	}
+	get_cpu();
 
 	smp_cross_call_masked(&xcall_flush_tlb_mm,
 			      ctx, 0, 0,
 			      mm_cpumask(mm));
 
-local_flush_and_out:
 	__flush_tlb_mm(ctx, SECONDARY_CONTEXT);
 
 	put_cpu();
@@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
 {
 	u32 ctx = CTX_HWBITS(mm->context);
 	struct tlb_pending_info info;
-	int cpu = get_cpu();
+
+	get_cpu();
 
 	info.ctx = ctx;
 	info.nr = nr;
 	info.vaddrs = vaddrs;
 
-	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
-		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-	else
-		smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
-				       &info, 1);
+	smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
+			       &info, 1);
 
 	__flush_tlb_pending(ctx, nr, vaddrs);
 
@@ -1134,14 +1098,13 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
 void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
 {
 	unsigned long context = CTX_HWBITS(mm->context);
-	int cpu = get_cpu();
 
-	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
-		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-	else
-		smp_cross_call_masked(&xcall_flush_tlb_page,
-				      context, vaddr, 0,
-				      mm_cpumask(mm));
+	get_cpu();
+
+	smp_cross_call_masked(&xcall_flush_tlb_page,
+			      context, vaddr, 0,
+			      mm_cpumask(mm));
+
 	__flush_tlb_page(context, vaddr);
 
 	put_cpu();
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
From: Nicholas Piggin @ 2020-09-14  4:52 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: linux-arch, Jens Axboe, Peter Zijlstra, Aneesh Kumar K . V,
	linux-kernel, Nicholas Piggin, sparclinux, Andrew Morton,
	linuxppc-dev, David S . Miller
In-Reply-To: <20200914045219.3736466-1-npiggin@gmail.com>

Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.

This relies on commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.

Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")
Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/tlb.h       | 13 -------------
 arch/powerpc/mm/book3s64/radix_tlb.c | 23 ++++++++++++++++-------
 2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index fbc6f3002f23..d97f061fecac 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
 		return false;
 	return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
-static inline void mm_reset_thread_local(struct mm_struct *mm)
-{
-	WARN_ON(atomic_read(&mm->context.copros) > 0);
-	/*
-	 * It's possible for mm_access to take a reference on mm_users to
-	 * access the remote mm from another thread, but it's not allowed
-	 * to set mm_cpumask, so mm_users may be > 1 here.
-	 */
-	WARN_ON(current->mm != mm);
-	atomic_set(&mm->context.active_cpus, 1);
-	cpumask_clear(mm_cpumask(mm));
-	cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..143b4fd396f0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
 	struct mm_struct *mm = arg;
 	unsigned long pid = mm->context.id;
 
+	/*
+	 * A kthread could have done a mmget_not_zero() after the flushing CPU
+	 * checked mm_is_singlethreaded, and be in the process of
+	 * kthread_use_mm when interrupted here. In that case, current->mm will
+	 * be set to mm, because kthread_use_mm() setting ->mm and switching to
+	 * the mm is done with interrupts off.
+	 */
 	if (current->mm == mm)
-		return; /* Local CPU */
+		goto out_flush;
 
 	if (current->active_mm == mm) {
-		/*
-		 * Must be a kernel thread because sender is single-threaded.
-		 */
-		BUG_ON(current->mm);
+		WARN_ON_ONCE(current->mm != NULL);
+		/* Is a kernel thread and is using mm as the lazy tlb */
 		mmgrab(&init_mm);
-		switch_mm(mm, &init_mm, current);
 		current->active_mm = &init_mm;
+		switch_mm_irqs_off(mm, &init_mm, current);
 		mmdrop(mm);
 	}
+
+	atomic_dec(&mm->context.active_cpus);
+	cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
+
+out_flush:
 	_tlbiel_pid(pid, RIC_FLUSH_ALL);
 }
 
@@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
 	 */
 	smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
 				(void *)mm, 1);
-	mm_reset_thread_local(mm);
 }
 
 void radix__flush_tlb_mm(struct mm_struct *mm)
-- 
2.23.0


^ permalink raw reply related

* Re: [PATCH 13/15] selftests/seccomp: powerpc: Set syscall return during ptrace syscall exit
From: Michael Ellerman @ 2020-09-14  5:47 UTC (permalink / raw)
  To: Kees Cook, linux-kernel
  Cc: Thadeu Lima de Souza Cascardo, Will Drewry, Kees Cook,
	linux-xtensa, linux-mips, Andy Lutomirski, Max Filippov,
	linux-arm-kernel, linux-kselftest, linuxppc-dev,
	Christian Brauner
In-Reply-To: <20200912110820.597135-14-keescook@chromium.org>

Kees Cook <keescook@chromium.org> writes:
> Some archs (like ppc) only support changing the return code during
> syscall exit when ptrace is used. As the syscall number might not
> be available anymore during syscall exit, it needs to be saved
> during syscall enter. Adjust the ptrace tests to do this.

I'm not that across all the fixture stuff, but if I'm reading it right
you're now calling change_syscall() on both entry and exit for all
arches.

That should work, but it no longer tests changing the return code on
entry on the arches that support it, which seems like a backward step?

cheers


> Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
> Link: https://lore.kernel.org/linux-kselftest/20200911181012.171027-1-cascardo@canonical.com/
> Fixes: 58d0a862f573 ("seccomp: add tests for ptrace hole")
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 34 +++++++++++--------
>  1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index bbab2420d708..26c712c6a575 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1949,12 +1949,19 @@ void tracer_seccomp(struct __test_metadata *_metadata, pid_t tracee,
>  
>  }
>  
> +FIXTURE(TRACE_syscall) {
> +	struct sock_fprog prog;
> +	pid_t tracer, mytid, mypid, parent;
> +	long syscall_nr;
> +};
> +
>  void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee,
>  		   int status, void *args)
>  {
> -	int ret, nr;
> +	int ret;
>  	unsigned long msg;
>  	static bool entry;
> +	FIXTURE_DATA(TRACE_syscall) *self = args;
>  
>  	/*
>  	 * The traditional way to tell PTRACE_SYSCALL entry/exit
> @@ -1968,24 +1975,23 @@ void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee,
>  	EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
>  			: PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
>  
> -	if (!entry)
> -		return;
> -
> -	nr = get_syscall(_metadata, tracee);
> +	/*
> +	 * Some architectures only support setting return values during
> +	 * syscall exit under ptrace, and on exit the syscall number may
> +	 * no longer be available. Therefore, save it here, and call
> +	 * "change syscall and set return values" on both entry and exit.
> +	 */
> +	if (entry)
> +		self->syscall_nr = get_syscall(_metadata, tracee);
>  
> -	if (nr == __NR_getpid)
> +	if (self->syscall_nr == __NR_getpid)
>  		change_syscall(_metadata, tracee, __NR_getppid, 0);
> -	if (nr == __NR_gettid)
> +	if (self->syscall_nr == __NR_gettid)
>  		change_syscall(_metadata, tracee, -1, 45000);
> -	if (nr == __NR_openat)
> +	if (self->syscall_nr == __NR_openat)
>  		change_syscall(_metadata, tracee, -1, -ESRCH);
>  }
>  
> -FIXTURE(TRACE_syscall) {
> -	struct sock_fprog prog;
> -	pid_t tracer, mytid, mypid, parent;
> -};
> -
>  FIXTURE_VARIANT(TRACE_syscall) {
>  	/*
>  	 * All of the SECCOMP_RET_TRACE behaviors can be tested with either
> @@ -2044,7 +2050,7 @@ FIXTURE_SETUP(TRACE_syscall)
>  	self->tracer = setup_trace_fixture(_metadata,
>  					   variant->use_ptrace ? tracer_ptrace
>  							       : tracer_seccomp,
> -					   NULL, variant->use_ptrace);
> +					   self, variant->use_ptrace);
>  
>  	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>  	ASSERT_EQ(0, ret);
> -- 
> 2.25.1

^ permalink raw reply

* [PATCH -next] macintosh: windfarm: use for_each_child_of_node() macro
From: Qinglang Miao @ 2020-09-14  6:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, linux-kernel, Qinglang Miao

Use for_each_child_of_node() macro instead of open coding it.

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
---
 drivers/macintosh/windfarm_smu_sat.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/macintosh/windfarm_smu_sat.c b/drivers/macintosh/windfarm_smu_sat.c
index cb75dc035..e46e1153a 100644
--- a/drivers/macintosh/windfarm_smu_sat.c
+++ b/drivers/macintosh/windfarm_smu_sat.c
@@ -216,8 +216,7 @@ static int wf_sat_probe(struct i2c_client *client,
 
 	vsens[0] = vsens[1] = -1;
 	isens[0] = isens[1] = -1;
-	child = NULL;
-	while ((child = of_get_next_child(dev, child)) != NULL) {
+	for_each_child_of_node(dev, child) {
 		reg = of_get_property(child, "reg", NULL);
 		loc = of_get_property(child, "location", NULL);
 		if (reg == NULL || loc == NULL)
-- 
2.23.0


^ permalink raw reply related

* Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest
From: David Gibson @ 2020-09-14  6:12 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: linuxppc-dev, kvm, kvm-ppc
In-Reply-To: <20200911041607.198092-1-farosas@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 4819 bytes --]

On Fri, Sep 11, 2020 at 01:16:07AM -0300, Fabiano Rosas wrote:
> The current nested KVM code does not support HPT guests. This is
> informed/enforced in some ways:
> 
> - Hosts < P9 will not be able to enable the nested HV feature;
> 
> - The nested hypervisor MMU capabilities will not contain
>   KVM_CAP_PPC_MMU_HASH_V3;
> 
> - QEMU reflects the MMU capabilities in the
>   'ibm,arch-vec-5-platform-support' device-tree property;
> 
> - The nested guest, at 'prom_parse_mmu_model' ignores the
>   'disable_radix' kernel command line option if HPT is not supported;
> 
> - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> 
> There is, however, still a way to start a HPT guest by using
> max-compat-cpu=power8 at the QEMU machine options. This leads to the
> guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> ioctl.
> 
> With the guest set to hash, the nested hypervisor goes through the
> entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> crashes when it tries to execute an hypervisor-privileged (mtspr
> HDEC) instruction at __kvmppc_vcore_entry:
> 
> root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> 
> <snip>
> [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
> [  538.543355] NIP:  c00800000753f388 LR: c00800000753f368 CTR: c0000000001e5ec0
> [  538.543417] REGS: c0000013e91e33b0 TRAP: 0700   Not tainted  (5.9.0-rc4)
> [  538.543470] MSR:  8000000002843033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 22422882  XER: 20040000
> [  538.543546] CFAR: c00800000753f4b0 IRQMASK: 3
>                GPR00: c0080000075397a0 c0000013e91e3640 c00800000755e600 0000000080000000
>                GPR04: 0000000000000000 c0000013eab19800 c000001394de0000 00000043a054db72
>                GPR08: 00000000003b1652 0000000000000000 0000000000000000 c0080000075502e0
>                GPR12: c0000000001e5ec0 c0000007ffa74200 c0000013eab19800 0000000000000008
>                GPR16: 0000000000000000 c00000139676c6c0 c000000001d23948 c0000013e91e38b8
>                GPR20: 0000000000000053 0000000000000000 0000000000000001 0000000000000000
>                GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000001
>                GPR28: 0000000000000001 0000000000000053 c0000013eab19800 0000000000000001
> [  538.544067] NIP [c00800000753f388] __kvmppc_vcore_entry+0x90/0x104 [kvm_hv]
> [  538.544121] LR [c00800000753f368] __kvmppc_vcore_entry+0x70/0x104 [kvm_hv]
> [  538.544173] Call Trace:
> [  538.544196] [c0000013e91e3640] [c0000013e91e3680] 0xc0000013e91e3680 (unreliable)
> [  538.544260] [c0000013e91e3820] [c0080000075397a0] kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> [  538.544325] [c0000013e91e39e0] [c00800000753d99c] kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> [  538.544394] [c0000013e91e3ad0] [c0080000072da4fc] kvmppc_vcpu_run+0x34/0x48 [kvm]
> [  538.544472] [c0000013e91e3af0] [c0080000072d61b8] kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> [  538.544539] [c0000013e91e3b80] [c0080000072c7450] kvm_vcpu_ioctl+0x298/0x778 [kvm]
> [  538.544605] [c0000013e91e3ce0] [c0000000004b8c2c] sys_ioctl+0x1dc/0xc90
> [  538.544662] [c0000013e91e3dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0
> [  538.544726] [c0000013e91e3e20] [c00000000000d140] system_call_common+0xf0/0x27c
> [  538.544787] Instruction dump:
> [  538.544821] f86d1098 60000000 60000000 48000099 e8ad0fe8 e8c500a0 e9264140 75290002
> [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 f90d10a0 480104fd
> [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> 
> This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> the nested hypervisor, causing QEMU to abort.
> 
> Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_hv.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 4ba06a2a306c..764b6239ef72 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5250,6 +5250,12 @@ static long kvm_arch_vm_ioctl_hv(struct file *filp,
>  	case KVM_PPC_ALLOCATE_HTAB: {
>  		u32 htab_order;
>  
> +		/* If we're a nested hypervisor, we currently only support radix */
> +		if (kvmhv_on_pseries()) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +
>  		r = -EFAULT;
>  		if (get_user(htab_order, (u32 __user *)argp))
>  			break;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
From: Nicholas Piggin @ 2020-09-14  7:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Jens Axboe, linux-arch, Peter Zijlstra, Aneesh Kumar K . V,
	linux-kernel, sparclinux, Andrew Morton, linuxppc-dev,
	David S . Miller
In-Reply-To: <20200914045219.3736466-4-npiggin@gmail.com>

Excerpts from Nicholas Piggin's message of September 14, 2020 2:52 pm:

[...]

> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
> optimisation could be effectively restored by sending IPIs to mm_cpumask
> members and having them remove themselves from mm_cpumask. This is more
> tricky so I leave it as an exercise for someone with a sparc64 SMP.
> powerpc has a (currently similarly broken) example.

So this compiles and boots on qemu, but qemu does not support any
sparc64 machines with SMP. Attempting some simple hacks doesn't get
me far because openbios isn't populating an SMP device tree, which
blows up everywhere.

The patch is _relatively_ simple, hopefully it shouldn't explode, so
it's probably ready for testing on real SMP hardware, if someone has
a few cycles.

Thanks,
Nick

> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/sparc/kernel/smp_64.c | 65 ++++++++------------------------------
>  1 file changed, 14 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
> index e286e2badc8a..e38d8bf454e8 100644
> --- a/arch/sparc/kernel/smp_64.c
> +++ b/arch/sparc/kernel/smp_64.c
> @@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
>   * are flush_tlb_*() routines, and these run after flush_cache_*()
>   * which performs the flushw.
>   *
> - * The SMP TLB coherency scheme we use works as follows:
> - *
> - * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
> - *    space has (potentially) executed on, this is the heuristic
> - *    we use to avoid doing cross calls.
> - *
> - *    Also, for flushing from kswapd and also for clones, we
> - *    use cpu_vm_mask as the list of cpus to make run the TLB.
> - *
> - * 2) TLB context numbers are shared globally across all processors
> - *    in the system, this allows us to play several games to avoid
> - *    cross calls.
> - *
> - *    One invariant is that when a cpu switches to a process, and
> - *    that processes tsk->active_mm->cpu_vm_mask does not have the
> - *    current cpu's bit set, that tlb context is flushed locally.
> - *
> - *    If the address space is non-shared (ie. mm->count == 1) we avoid
> - *    cross calls when we want to flush the currently running process's
> - *    tlb state.  This is done by clearing all cpu bits except the current
> - *    processor's in current->mm->cpu_vm_mask and performing the
> - *    flush locally only.  This will force any subsequent cpus which run
> - *    this task to flush the context from the local tlb if the process
> - *    migrates to another cpu (again).
> - *
> - * 3) For shared address spaces (threads) and swapping we bite the
> - *    bullet for most cases and perform the cross call (but only to
> - *    the cpus listed in cpu_vm_mask).
> - *
> - *    The performance gain from "optimizing" away the cross call for threads is
> - *    questionable (in theory the big win for threads is the massive sharing of
> - *    address space state across processors).
> + * mm->cpu_vm_mask is a bit mask of which cpus an address
> + * space has (potentially) executed on, this is the heuristic
> + * we use to limit cross calls.
>   */
>  
>  /* This currently is only used by the hugetlb arch pre-fault
> @@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
>  void smp_flush_tlb_mm(struct mm_struct *mm)
>  {
>  	u32 ctx = CTX_HWBITS(mm->context);
> -	int cpu = get_cpu();
>  
> -	if (atomic_read(&mm->mm_users) == 1) {
> -		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> -		goto local_flush_and_out;
> -	}
> +	get_cpu();
>  
>  	smp_cross_call_masked(&xcall_flush_tlb_mm,
>  			      ctx, 0, 0,
>  			      mm_cpumask(mm));
>  
> -local_flush_and_out:
>  	__flush_tlb_mm(ctx, SECONDARY_CONTEXT);
>  
>  	put_cpu();
> @@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
>  {
>  	u32 ctx = CTX_HWBITS(mm->context);
>  	struct tlb_pending_info info;
> -	int cpu = get_cpu();
> +
> +	get_cpu();
>  
>  	info.ctx = ctx;
>  	info.nr = nr;
>  	info.vaddrs = vaddrs;
>  
> -	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
> -		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> -	else
> -		smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
> -				       &info, 1);
> +	smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
> +			       &info, 1);
>  
>  	__flush_tlb_pending(ctx, nr, vaddrs);
>  
> @@ -1134,14 +1098,13 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
>  void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
>  {
>  	unsigned long context = CTX_HWBITS(mm->context);
> -	int cpu = get_cpu();
>  
> -	if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
> -		cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> -	else
> -		smp_cross_call_masked(&xcall_flush_tlb_page,
> -				      context, vaddr, 0,
> -				      mm_cpumask(mm));
> +	get_cpu();
> +
> +	smp_cross_call_masked(&xcall_flush_tlb_page,
> +			      context, vaddr, 0,
> +			      mm_cpumask(mm));
> +
>  	__flush_tlb_page(context, vaddr);
>  
>  	put_cpu();
> -- 
> 2.23.0
> 
> 

^ permalink raw reply

* Re: [PATCH] serial: ucc_uart: make qe_uart_set_mctrl() static
From: Jiri Slaby @ 2020-09-14  5:29 UTC (permalink / raw)
  To: Jason Yan, timur, gregkh, linux, leoyang.li, linuxppc-dev,
	linux-serial
  Cc: Hulk Robot
In-Reply-To: <20200912033834.143166-1-yanaijie@huawei.com>

On 12. 09. 20, 5:38, Jason Yan wrote:
> This eliminates the following sparse warning:
> 
> drivers/tty/serial/ucc_uart.c:286:6: warning: symbol 'qe_uart_set_mctrl'
> was not declared. Should it be static?
> 
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: Jason Yan <yanaijie@huawei.com>

Sure:
Acked-by: Jiri Slaby <jirislaby@kernel.org>

> ---
>  drivers/tty/serial/ucc_uart.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/tty/serial/ucc_uart.c b/drivers/tty/serial/ucc_uart.c
> index 3c8c662c69e2..d6a8604157ab 100644
> --- a/drivers/tty/serial/ucc_uart.c
> +++ b/drivers/tty/serial/ucc_uart.c
> @@ -283,7 +283,7 @@ static unsigned int qe_uart_tx_empty(struct uart_port *port)
>   * don't need that support. This function must exist, however, otherwise
>   * the kernel will panic.
>   */
> -void qe_uart_set_mctrl(struct uart_port *port, unsigned int mctrl)
> +static void qe_uart_set_mctrl(struct uart_port *port, unsigned int mctrl)
>  {
>  }
>  
> 


-- 
js

^ permalink raw reply

* Re: [PATCH] tty: hvcs: Don't NULL tty->driver_data until hvcs_cleanup()
From: Jiri Slaby @ 2020-09-14  5:35 UTC (permalink / raw)
  To: Tyrel Datwyler, gregkh
  Cc: Joe Perches, open list:HYPERVISOR VIRTUAL CONSOLE DRIVER,
	open list, Jason Yan
In-Reply-To: <20200820234643.70412-1-tyreld@linux.ibm.com>

On 21. 08. 20, 1:46, Tyrel Datwyler wrote:
> The code currently NULLs tty->driver_data in hvcs_close() with the
> intent of informing the next call to hvcs_open() that device needs to be
> reconfigured. However, when hvcs_cleanup() is called we copy hvcsd from
> tty->driver_data which was previoulsy NULLed by hvcs_close() and our
> call to tty_port_put(&hvcsd->port) doesn't actually do anything since
> &hvcsd->port ends up translating to NULL by chance. This has the side
> effect that when hvcs_remove() is called we have one too many port
> references preventing hvcs_destuct_port() from ever being called. This
> also prevents us from reusing the /dev/hvcsX node in a future
> hvcs_probe() and we can eventually run out of /dev/hvcsX devices.
> 
> Fix this by waiting to NULL tty->driver_data in hvcs_cleanup().

Without actually looking into the code, it looks like we need a fix
similar to:
commit 24eb2377f977fe06d84fca558f891f95bc28a449
Author: Jiri Slaby <jirislaby@kernel.org>
Date:   Tue May 26 16:56:32 2020 +0200

    tty: hvc_console, fix crashes on parallel open/close

here too?

> Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
> ---
>  drivers/tty/hvc/hvcs.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/tty/hvc/hvcs.c b/drivers/tty/hvc/hvcs.c
> index 55105ac38f89..509d1042825a 100644
> --- a/drivers/tty/hvc/hvcs.c
> +++ b/drivers/tty/hvc/hvcs.c
> @@ -1216,13 +1216,6 @@ static void hvcs_close(struct tty_struct *tty, struct file *filp)
>  
>  		tty_wait_until_sent(tty, HVCS_CLOSE_WAIT);
>  
> -		/*
> -		 * This line is important because it tells hvcs_open that this
> -		 * device needs to be re-configured the next time hvcs_open is
> -		 * called.
> -		 */
> -		tty->driver_data = NULL;
> -
>  		free_irq(irq, hvcsd);
>  		return;
>  	} else if (hvcsd->port.count < 0) {
> @@ -1237,6 +1230,13 @@ static void hvcs_cleanup(struct tty_struct * tty)
>  {
>  	struct hvcs_struct *hvcsd = tty->driver_data;
>  
> +	/*
> +	 * This line is important because it tells hvcs_open that this
> +	 * device needs to be re-configured the next time hvcs_open is
> +	 * called.
> +	 */
> +	tty->driver_data = NULL;
> +
>  	tty_port_put(&hvcsd->port);
>  }
>  
> 

thanks,
-- 
js

^ permalink raw reply

* Re: [PATCH] spi: fsl-espi: Only process interrupts for expected events
From: Joakim Tjernlund @ 2020-09-14  9:06 UTC (permalink / raw)
  To: broonie@kernel.org, npiggin@gmail.com
  Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, linux-spi@vger.kernel.org
In-Reply-To: <1600050281.5iiy8pkb7z.astroid@bobo.none>

On Mon, 2020-09-14 at 12:28 +1000, Nicholas Piggin wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> Excerpts from Chris Packham's message of September 14, 2020 8:03 am:
> > Hi All,
> > 
> > On 4/09/20 12:28 pm, Chris Packham wrote:
> > > The SPIE register contains counts for the TX FIFO so any time the irq
> > > handler was invoked we would attempt to process the RX/TX fifos. Use the
> > > SPIM value to mask the events so that we only process interrupts that
> > > were expected.
> > > 
> > > This was a latent issue exposed by commit 3282a3da25bd ("powerpc/64:
> > > Implement soft interrupt replay in C").
> > > 
> > > Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
> > > Cc: stable@vger.kernel.org
> > > ---
> > ping?
> 
> I don't know the code/hardware but thanks for tracking this down.
> 
> Was there anything more to be done with Jocke's observations, or would
> that be a follow-up patch if anything?

Patch is good IMHO, there may be more to fix w.r.t clearing the IRQs

> 
> If this patch fixes your problem it should probably go in, unless there
> are any objections.

It should go in I think.

 Jocke

> 
> Thanks,
> Nick
> 
> > > 
> > > Notes:
> > >      I've tested this on a T2080RDB and a custom board using the T2081 SoC. With
> > >      this change I don't see any spurious instances of the "Transfer done but
> > >      SPIE_DON isn't set!" or "Transfer done but rx/tx fifo's aren't empty!" messages
> > >      and the updates to spi flash are successful.
> > > 
> > >      I think this should go into the stable trees that contain 3282a3da25bd but I
> > >      haven't added a Fixes: tag because I think 3282a3da25bd exposed the issue as
> > >      opposed to causing it.
> > > 
> > >   drivers/spi/spi-fsl-espi.c | 5 +++--
> > >   1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
> > > index 7e7c92cafdbb..cb120b68c0e2 100644
> > > --- a/drivers/spi/spi-fsl-espi.c
> > > +++ b/drivers/spi/spi-fsl-espi.c
> > > @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi *espi, u32 events)
> > >   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
> > >   {
> > >      struct fsl_espi *espi = context_data;
> > > -    u32 events;
> > > +    u32 events, mask;
> > > 
> > >      spin_lock(&espi->lock);
> > > 
> > >      /* Get interrupt events(tx/rx) */
> > >      events = fsl_espi_read_reg(espi, ESPI_SPIE);
> > > -    if (!events) {
> > > +    mask = fsl_espi_read_reg(espi, ESPI_SPIM);
> > > +    if (!(events & mask)) {
> > >              spin_unlock(&espi->lock);
> > >              return IRQ_NONE;
> > >      }


^ permalink raw reply

* Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
From: peterz @ 2020-09-14 10:56 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Jens Axboe, Dave Hansen, Aneesh Kumar K . V,
	linux-kernel, Andy Lutomirski, linux-mm @ kvack . org, sparclinux,
	Andrew Morton, linuxppc-dev, David S . Miller
In-Reply-To: <20200914045219.3736466-2-npiggin@gmail.com>

On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
> Reading and modifying current->mm and current->active_mm and switching
> mm should be done with irqs off, to prevent races seeing an intermediate
> state.
> 
> This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
> invalidate"). At exec-time when the new mm is activated, the old one
> should usually be single-threaded and no longer used, unless something
> else is holding an mm_users reference (which may be possible).
> 
> Absent other mm_users, there is also a race with preemption and lazy tlb
> switching. Consider the kernel_execve case where the current thread is
> using a lazy tlb active mm:
> 
>   call_usermodehelper()
>     kernel_execve()
>       old_mm = current->mm;
>       active_mm = current->active_mm;
>       *** preempt *** -------------------->  schedule()
>                                                prev->active_mm = NULL;
>                                                mmdrop(prev active_mm);
>                                              ...
>                       <--------------------  schedule()
>       current->mm = mm;
>       current->active_mm = mm;
>       if (!old_mm)
>           mmdrop(active_mm);
> 
> If we switch back to the kernel thread from a different mm, there is a
> double free of the old active_mm, and a missing free of the new one.
> 
> Closing this race only requires interrupts to be disabled while ->mm
> and ->active_mm are being switched, but the TLB problem requires also
> holding interrupts off over activate_mm. Unfortunately not all archs
> can do that yet, e.g., arm defers the switch if irqs are disabled and
> expects finish_arch_post_lock_switch() to be called to complete the
> flush; um takes a blocking lock in activate_mm().
> 
> So as a first step, disable interrupts across the mm/active_mm updates
> to close the lazy tlb preempt race, and provide an arch option to
> extend that to activate_mm which allows architectures doing IPI based
> TLB shootdowns to close the second race.
> 
> This is a bit ugly, but in the interest of fixing the bug and backporting
> before all architectures are converted this is a compromise.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I'm thinking we want this selected on x86 as well. Andy?

> ---
>  arch/Kconfig |  7 +++++++
>  fs/exec.c    | 17 +++++++++++++++--
>  2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..94821e3f94d1 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
>  	bool
>  	depends on MMU_GATHER_TABLE_FREE
>  
> +config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
> +	bool
> +	help
> +	  Temporary select until all architectures can be converted to have
> +	  irqs disabled over activate_mm. Architectures that do IPI based TLB
> +	  shootdowns should enable this.
> +
>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	bool
>  
> diff --git a/fs/exec.c b/fs/exec.c
> index a91003e28eaa..d4fb18baf1fb 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
>  	}
>  
>  	task_lock(tsk);
> -	active_mm = tsk->active_mm;
>  	membarrier_exec_mmap(mm);
> -	tsk->mm = mm;
> +
> +	local_irq_disable();
> +	active_mm = tsk->active_mm;
>  	tsk->active_mm = mm;
> +	tsk->mm = mm;
> +	/*
> +	 * This prevents preemption while active_mm is being loaded and
> +	 * it and mm are being updated, which could cause problems for
> +	 * lazy tlb mm refcounting when these are updated by context
> +	 * switches. Not all architectures can handle irqs off over
> +	 * activate_mm yet.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> +		local_irq_enable();
>  	activate_mm(active_mm, mm);
> +	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> +		local_irq_enable();
>  	tsk->mm->vmacache_seqnum = 0;
>  	vmacache_flush(tsk);
>  	task_unlock(tsk);
> -- 
> 2.23.0
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox