LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* RE: [PATCH v6 04/11] PCI: designware-ep: Modify MSI and MSIX CAP way of finding
From: Z.q. Hou @ 2020-09-24 14:53 UTC (permalink / raw)
  To: Bjorn Helgaas, Xiaowei Bao
  Cc: Roy Zang, lorenzo.pieralisi@arm.com, devicetree@vger.kernel.org,
	jingoohan1@gmail.com, linuxppc-dev@lists.ozlabs.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Leo Li,
	M.h. Lian, robh+dt@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	gustavo.pimentel@synopsys.com, bhelgaas@google.com,
	andrew.murray@arm.com, kishon@ti.com, shawnguo@kernel.org,
	Mingkai Hu, amurray@thegoodpenguin.co.uk
In-Reply-To: <20200923201613.GA2291357@bjorn-Precision-5520>

Hi Bjorn,

Thanks a lot for your comments!

> -----Original Message-----
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: 2020年9月24日 4:16
> To: Xiaowei Bao <xiaowei.bao@nxp.com>
> Cc: Z.q. Hou <zhiqiang.hou@nxp.com>; M.h. Lian
> <minghuan.lian@nxp.com>; Mingkai Hu <mingkai.hu@nxp.com>;
> bhelgaas@google.com; robh+dt@kernel.org; shawnguo@kernel.org; Leo Li
> <leoyang.li@nxp.com>; kishon@ti.com; lorenzo.pieralisi@arm.com; Roy
> Zang <roy.zang@nxp.com>; amurray@thegoodpenguin.co.uk;
> jingoohan1@gmail.com; gustavo.pimentel@synopsys.com;
> andrew.murray@arm.com; linux-pci@vger.kernel.org;
> devicetree@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; linuxppc-dev@lists.ozlabs.org
> Subject: Re: [PATCH v6 04/11] PCI: designware-ep: Modify MSI and MSIX
> CAP way of finding
> 
> s/MSIX/MSI-X/ (subject and below)
> 
> On Sat, Mar 14, 2020 at 11:30:31AM +0800, Xiaowei Bao wrote:
> > Each PF of EP device should have it's own MSI or MSIX capabitily
> > struct, so create a dw_pcie_ep_func struct and remove the msi_cap and
> > msix_cap to this struct from dw_pcie_ep, and manage the PFs with a
> > list.
> 
> s/capabitily/capability/
> 
> I know Lorenzo has already applied this, but for the future, or in case there
> are other reasons to update this patch.
> 
> There are a bunch of unnecessary initializations below for future cleanup.

Yes, and there are many calling of dw_pcie_ep_func_select() to get func_offset, I plan to submit a separate patch to clean up.

Thanks,
Zhiqiang

> 
> > Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> > ---
> > v3:
> >  - This is a new patch, to fix the issue of MSI and MSIX CAP way of
> >    finding.
> > v4:
> >  - Correct some word of commit message.
> > v5:
> >  - No change.
> > v6:
> >  - Fix up the compile error.
> >
> >  drivers/pci/controller/dwc/pcie-designware-ep.c | 135
> +++++++++++++++++++++---
> >  drivers/pci/controller/dwc/pcie-designware.h    |  18 +++-
> >  2 files changed, 134 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index 933bb89..fb915f2 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -19,6 +19,19 @@ void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
> >  	pci_epc_linkup(epc);
> >  }
> >
> > +struct dw_pcie_ep_func *
> > +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no) {
> > +	struct dw_pcie_ep_func *ep_func;
> > +
> > +	list_for_each_entry(ep_func, &ep->func_list, list) {
> > +		if (ep_func->func_no == func_no)
> > +			return ep_func;
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> >  static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8
> > func_no)  {
> >  	unsigned int func_offset = 0;
> > @@ -59,6 +72,47 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci,
> enum pci_barno bar)
> >  		__dw_pcie_ep_reset_bar(pci, func_no, bar, 0);  }
> >
> > +static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie_ep *ep, u8
> func_no,
> > +		u8 cap_ptr, u8 cap)
> > +{
> > +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	unsigned int func_offset = 0;
> 
> Unnecessary initialization.
> 
> > +	u8 cap_id, next_cap_ptr;
> > +	u16 reg;
> > +
> > +	if (!cap_ptr)
> > +		return 0;
> > +
> > +	func_offset = dw_pcie_ep_func_select(ep, func_no);
> > +
> > +	reg = dw_pcie_readw_dbi(pci, func_offset + cap_ptr);
> > +	cap_id = (reg & 0x00ff);
> > +
> > +	if (cap_id > PCI_CAP_ID_MAX)
> > +		return 0;
> > +
> > +	if (cap_id == cap)
> > +		return cap_ptr;
> > +
> > +	next_cap_ptr = (reg & 0xff00) >> 8;
> > +	return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap); }
> > +
> > +static u8 dw_pcie_ep_find_capability(struct dw_pcie_ep *ep, u8
> > +func_no, u8 cap) {
> > +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	unsigned int func_offset = 0;
> 
> Unnecessary initialization.
> 
> > +	u8 next_cap_ptr;
> > +	u16 reg;
> > +
> > +	func_offset = dw_pcie_ep_func_select(ep, func_no);
> > +
> > +	reg = dw_pcie_readw_dbi(pci, func_offset + PCI_CAPABILITY_LIST);
> > +	next_cap_ptr = (reg & 0x00ff);
> > +
> > +	return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap); }
> > +
> >  static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
> >  				   struct pci_epf_header *hdr)
> >  {
> > @@ -246,13 +300,18 @@ static int dw_pcie_ep_get_msi(struct pci_epc
> *epc, u8 func_no)
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	u32 val, reg;
> >  	unsigned int func_offset = 0;
> 
> Unnecessary initialization (not from your patch).
> 
> > +	struct dw_pcie_ep_func *ep_func;
> >
> > -	if (!ep->msi_cap)
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> > +
> > +	if (!ep_func->msi_cap)
> >  		return -EINVAL;
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> > +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
> >  	val = dw_pcie_readw_dbi(pci, reg);
> >  	if (!(val & PCI_MSI_FLAGS_ENABLE))
> >  		return -EINVAL;
> > @@ -268,13 +327,18 @@ static int dw_pcie_ep_set_msi(struct pci_epc
> *epc, u8 func_no, u8 interrupts)
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	u32 val, reg;
> >  	unsigned int func_offset = 0;
> 
> Unnecessary initialization (not from your patch).
> 
> > +	struct dw_pcie_ep_func *ep_func;
> > +
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> >
> > -	if (!ep->msi_cap)
> > +	if (!ep_func->msi_cap)
> >  		return -EINVAL;
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> > +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
> >  	val = dw_pcie_readw_dbi(pci, reg);
> >  	val &= ~PCI_MSI_FLAGS_QMASK;
> >  	val |= (interrupts << 1) & PCI_MSI_FLAGS_QMASK; @@ -291,13 +355,18
> > @@ static int dw_pcie_ep_get_msix(struct pci_epc *epc, u8 func_no)
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	u32 val, reg;
> >  	unsigned int func_offset = 0;
> 
> Unnecessary initialization (not from your patch).
> 
> > +	struct dw_pcie_ep_func *ep_func;
> > +
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> >
> > -	if (!ep->msix_cap)
> > +	if (!ep_func->msix_cap)
> >  		return -EINVAL;
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msix_cap + func_offset + PCI_MSIX_FLAGS;
> > +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_FLAGS;
> >  	val = dw_pcie_readw_dbi(pci, reg);
> >  	if (!(val & PCI_MSIX_FLAGS_ENABLE))
> >  		return -EINVAL;
> > @@ -313,13 +382,18 @@ static int dw_pcie_ep_set_msix(struct pci_epc
> *epc, u8 func_no, u16 interrupts)
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	u32 val, reg;
> >  	unsigned int func_offset = 0;
> 
> Unnecessary initialization (not from your patch).
> 
> > +	struct dw_pcie_ep_func *ep_func;
> >
> > -	if (!ep->msix_cap)
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> > +
> > +	if (!ep_func->msix_cap)
> >  		return -EINVAL;
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msix_cap + func_offset + PCI_MSIX_FLAGS;
> > +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_FLAGS;
> >  	val = dw_pcie_readw_dbi(pci, reg);
> >  	val &= ~PCI_MSIX_FLAGS_QSIZE;
> >  	val |= interrupts;
> > @@ -404,6 +478,7 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep
> *ep, u8 func_no,
> >  			     u8 interrupt_num)
> >  {
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	struct dw_pcie_ep_func *ep_func;
> >  	struct pci_epc *epc = ep->epc;
> >  	unsigned int aligned_offset;
> >  	unsigned int func_offset = 0;
> 
> Unnecessary initialization (not from your patch).
> 
> > @@ -413,25 +488,29 @@ int dw_pcie_ep_raise_msi_irq(struct
> dw_pcie_ep *ep, u8 func_no,
> >  	bool has_upper;
> >  	int ret;
> >
> > -	if (!ep->msi_cap)
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> > +
> > +	if (!ep_func->msi_cap)
> >  		return -EINVAL;
> >
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> >  	/* Raise MSI per the PCI Local Bus Specification Revision 3.0, 6.8.1. */
> > -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> > +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
> >  	msg_ctrl = dw_pcie_readw_dbi(pci, reg);
> >  	has_upper = !!(msg_ctrl & PCI_MSI_FLAGS_64BIT);
> > -	reg = ep->msi_cap + func_offset + PCI_MSI_ADDRESS_LO;
> > +	reg = ep_func->msi_cap + func_offset + PCI_MSI_ADDRESS_LO;
> >  	msg_addr_lower = dw_pcie_readl_dbi(pci, reg);
> >  	if (has_upper) {
> > -		reg = ep->msi_cap + func_offset + PCI_MSI_ADDRESS_HI;
> > +		reg = ep_func->msi_cap + func_offset + PCI_MSI_ADDRESS_HI;
> >  		msg_addr_upper = dw_pcie_readl_dbi(pci, reg);
> > -		reg = ep->msi_cap + func_offset + PCI_MSI_DATA_64;
> > +		reg = ep_func->msi_cap + func_offset + PCI_MSI_DATA_64;
> >  		msg_data = dw_pcie_readw_dbi(pci, reg);
> >  	} else {
> >  		msg_addr_upper = 0;
> > -		reg = ep->msi_cap + func_offset + PCI_MSI_DATA_32;
> > +		reg = ep_func->msi_cap + func_offset + PCI_MSI_DATA_32;
> >  		msg_data = dw_pcie_readw_dbi(pci, reg);
> >  	}
> >  	aligned_offset = msg_addr_lower & (epc->mem->page_size - 1); @@
> > -467,6 +546,7 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep,
> u8 func_no,
> >  			      u16 interrupt_num)
> >  {
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	struct dw_pcie_ep_func *ep_func;
> >  	struct pci_epc *epc = ep->epc;
> >  	u16 tbl_offset, bir;
> >  	unsigned int func_offset = 0;
> 
> Unnecessary initialization (not from your patch).
> 
> > @@ -477,9 +557,16 @@ int dw_pcie_ep_raise_msix_irq(struct
> dw_pcie_ep *ep, u8 func_no,
> >  	void __iomem *msix_tbl;
> >  	int ret;
> >
> > +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +	if (!ep_func)
> > +		return -EINVAL;
> > +
> > +	if (!ep_func->msix_cap)
> > +		return -EINVAL;
> > +
> >  	func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -	reg = ep->msix_cap + func_offset + PCI_MSIX_TABLE;
> > +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_TABLE;
> >  	tbl_offset = dw_pcie_readl_dbi(pci, reg);
> >  	bir = (tbl_offset & PCI_MSIX_TABLE_BIR);
> >  	tbl_offset &= PCI_MSIX_TABLE_OFFSET; @@ -558,6 +645,7 @@ int
> > dw_pcie_ep_init(struct dw_pcie_ep *ep)
> >  	int i;
> >  	int ret;
> >  	u32 reg;
> > +	u8 func_no;
> >  	void *addr;
> >  	u8 hdr_type;
> >  	unsigned int nbars;
> > @@ -566,6 +654,9 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	struct device *dev = pci->dev;
> >  	struct device_node *np = dev->of_node;
> > +	struct dw_pcie_ep_func *ep_func;
> > +
> > +	INIT_LIST_HEAD(&ep->func_list);
> >
> >  	if (!pci->dbi_base || !pci->dbi_base2) {
> >  		dev_err(dev, "dbi_base/dbi_base2 is not populated\n"); @@ -632,9
> > +723,19 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
> >  	if (ret < 0)
> >  		epc->max_functions = 1;
> >
> > -	ep->msi_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSI);
> > +	for (func_no = 0; func_no < epc->max_functions; func_no++) {
> > +		ep_func = devm_kzalloc(dev, sizeof(*ep_func), GFP_KERNEL);
> > +		if (!ep_func)
> > +			return -ENOMEM;
> >
> > -	ep->msix_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSIX);
> > +		ep_func->func_no = func_no;
> > +		ep_func->msi_cap = dw_pcie_ep_find_capability(ep, func_no,
> > +							      PCI_CAP_ID_MSI);
> > +		ep_func->msix_cap = dw_pcie_ep_find_capability(ep, func_no,
> > +							       PCI_CAP_ID_MSIX);
> > +
> > +		list_add_tail(&ep_func->list, &ep->func_list);
> > +	}
> >
> >  	if (ep->ops->ep_init)
> >  		ep->ops->ep_init(ep);
> > diff --git a/drivers/pci/controller/dwc/pcie-designware.h
> > b/drivers/pci/controller/dwc/pcie-designware.h
> > index cb32afa..dd9b7b4 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware.h
> > +++ b/drivers/pci/controller/dwc/pcie-designware.h
> > @@ -230,8 +230,16 @@ struct dw_pcie_ep_ops {
> >  	unsigned int (*func_conf_select)(struct dw_pcie_ep *ep, u8 func_no);
> > };
> >
> > +struct dw_pcie_ep_func {
> > +	struct list_head	list;
> > +	u8			func_no;
> > +	u8			msi_cap;	/* MSI capability offset */
> > +	u8			msix_cap;	/* MSI-X capability offset */
> > +};
> > +
> >  struct dw_pcie_ep {
> >  	struct pci_epc		*epc;
> > +	struct list_head	func_list;
> >  	const struct dw_pcie_ep_ops *ops;
> >  	phys_addr_t		phys_base;
> >  	size_t			addr_size;
> > @@ -244,8 +252,6 @@ struct dw_pcie_ep {
> >  	u32			num_ob_windows;
> >  	void __iomem		*msi_mem;
> >  	phys_addr_t		msi_mem_phys;
> > -	u8			msi_cap;	/* MSI capability offset */
> > -	u8			msix_cap;	/* MSI-X capability offset */
> >  };
> >
> >  struct dw_pcie_ops {
> > @@ -437,6 +443,8 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep
> > *ep, u8 func_no,  int dw_pcie_ep_raise_msix_irq_doorbell(struct
> dw_pcie_ep *ep, u8 func_no,
> >  				       u16 interrupt_num);
> >  void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar);
> > +struct dw_pcie_ep_func *
> > +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no);
> >  #else
> >  static inline void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)  { @@
> > -478,5 +486,11 @@ static inline int
> > dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep,  static
> > inline void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno
> > bar)  {  }
> > +
> > +static inline struct dw_pcie_ep_func *
> > +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no) {
> > +	return NULL;
> > +}
> >  #endif
> >  #endif /* _PCIE_DESIGNWARE_H */
> > --
> > 2.9.5
> >

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Peter Zijlstra @ 2020-09-24 13:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, David Airlie, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Jani Nikula, Greentime Hu, Rodrigo Vivi,
	Thomas Gleixner, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Linus Torvalds, LKML, Arnd Bergmann, Daniel Vetter, Vineet Gupta,
	Paul Mackerras, Andrew Morton, Daniel Bristot de Oliveira,
	David S. Miller
In-Reply-To: <20200924095138.5318d242@oasis.local.home>

On Thu, Sep 24, 2020 at 09:51:38AM -0400, Steven Rostedt wrote:

> > It turns out, that getting selected for pull-balance is exactly that
> > condition, and clearly a migrate_disable() task cannot be pulled, but we
> > can use that signal to try and pull away the running task that's in the
> > way.
> 
> Unless of course that running task is in a migrate disable section
> itself ;-)

See my ramblings here:

  https://lkml.kernel.org/r/20200924082717.GA1362448@hirez.programming.kicks-ass.net

My plan was to have the migrate_enable() of the running task trigger the
migration in that case.


^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Steven Rostedt @ 2020-09-24 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, David Airlie, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Jani Nikula, Greentime Hu, Rodrigo Vivi,
	Thomas Gleixner, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Linus Torvalds, LKML, Arnd Bergmann, Daniel Vetter, Vineet Gupta,
	Paul Mackerras, Andrew Morton, Daniel Bristot de Oliveira,
	David S. Miller
In-Reply-To: <20200924124241.GK2628@hirez.programming.kicks-ass.net>

On Thu, 24 Sep 2020 14:42:41 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote:
> > Anyway, instead of blocking. What about having a counter of number of
> > migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's
> > already another task with migrate_disabled() set, and the current task has
> > an affinity greater than 1, it tries to migrate to another CPU?  
> 
> That doesn't solve the problem. On wakeup we should already prefer an
> idle CPU over one running a (RT) task, but you can always wake more
> tasks than there's CPUs around and you'll _have_ to stack at some point.

Yes, understood.

> 
> The trick is how to unstack them correctly. We need to detect when a
> migrate_disable() task _should_ start running again, and migrate away
> whoever is in the way at that point.
> 
> It turns out, that getting selected for pull-balance is exactly that
> condition, and clearly a migrate_disable() task cannot be pulled, but we
> can use that signal to try and pull away the running task that's in the
> way.

Unless of course that running task is in a migrate disable section
itself ;-)

But I guess we will always have that SHC, and there will always be a
scenario that you can't balance properly. But hopefully in practice we
wont see that.

How to handle kmap_local(), will migrate_disable() be used only for
32bit or, for consistency, will it also apply to 64bit?

-- Steve

^ permalink raw reply

* Re: C vdso
From: Christophe Leroy @ 2020-09-24 13:17 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev@ozlabs.org
In-Reply-To: <87r1r0oa4o.fsf@mpe.ellerman.id.au>

Hi Michael

Le 17/09/2020 à 14:33, Michael Ellerman a écrit :
> Hi Christophe,
> 
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> Hi Michael,
>>
>> What is the status with the generic C vdso merge ?
>> In some mail, you mentionned having difficulties getting it working on
>> ppc64, any progress ? What's the problem ? Can I help ?
> 
> Yeah sorry I was hoping to get time to work on it but haven't been able
> to.
> 
> It's causing crashes on ppc64 ie. big endian.
> 

> 
> As you can see from the instruction dump we have jumped into the weeds somewhere.
> 
> We also had the report from the kbuild robot about rela.opd being
> discarded, which I think is indicative of a bigger problem. ie. we don't
> process relocations for the VDSO, but opds require relocations (they
> contain an absolute pointer).
> 
> I thought we could get away with that, because the VDSO entry points
> aren't proper functions (so they don't have opds), and I didn't think
> we'd be calling via function pointers in the VDSO code (which would
> require opds). But seems something is not working right.
> 
> Sorry I haven't got back to you with those details. Things are a bit of
> a mess inside IBM at the moment (always?), and I've been trying to get
> everything done before I take a holiday next week.
> 


Can you tell what defconfig you are using ? I have been able to setup a 
full glibc PPC64 cross compilation chain and been able to test it under 
QEMU with success, using Nathan's vdsotest tool.

I tested with both ppc64_defconfig and pseries_defconfig.

The only problem I got is with getcpu, which segfaults but both before 
and after applying my series, so I guess this is unrelated.

Not sure we can pay too much attention to the exact measurement as it is 
a ppc64 QEMU running on a x86 Linux which is running in a Virtual Box on 
a x86 windows Laptop, but at least it works:

Without the series:

clock-getres-monotonic:    vdso: 389 nsec/call
clock-gettime-monotonic:    vdso: 781 nsec/call
clock-getres-monotonic-coarse:    vdso: 13715 nsec/call
clock-gettime-monotonic-coarse:    vdso: 312 nsec/call
clock-getres-monotonic-raw:    vdso: 13589 nsec/call
clock-getres-tai:    vdso: 13827 nsec/call
clock-gettime-tai:    vdso: 14846 nsec/call
clock-getres-boottime:    vdso: 13596 nsec/call
clock-gettime-boottime:    vdso: 14758 nsec/call
clock-getres-realtime:    vdso: 327 nsec/call
clock-gettime-realtime:    vdso: 717 nsec/call
clock-getres-realtime-coarse:    vdso: 14102 nsec/call
clock-gettime-realtime-coarse:    vdso: 299 nsec/call
gettimeofday:    vdso: 771 nsec/call

With the series:

clock-getres-monotonic:    vdso: 350 nsec/call
clock-gettime-monotonic:    vdso: 726 nsec/call
clock-getres-monotonic-coarse:    vdso: 356 nsec/call
clock-gettime-monotonic-coarse:    vdso: 423 nsec/call
clock-getres-monotonic-raw:    vdso: 349 nsec/call
clock-getres-tai:    vdso: 419 nsec/call
clock-gettime-tai:    vdso: 724 nsec/call
clock-getres-boottime:    vdso: 352 nsec/call
clock-gettime-boottime:    vdso: 752 nsec/call
clock-getres-realtime:    vdso: 351 nsec/call
clock-gettime-realtime:    vdso: 733 nsec/call
clock-getres-realtime-coarse:    vdso: 356 nsec/call
clock-gettime-realtime-coarse:    vdso: 367 nsec/call
gettimeofday:    vdso: 796 nsec/call


Thanks
Christophe

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Peter Zijlstra @ 2020-09-24 12:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, David Airlie, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Jani Nikula, Greentime Hu, Rodrigo Vivi,
	Thomas Gleixner, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Linus Torvalds, LKML, Arnd Bergmann, Daniel Vetter, Vineet Gupta,
	Paul Mackerras, Andrew Morton, Daniel Bristot de Oliveira,
	David S. Miller
In-Reply-To: <20200924083241.314f2102@gandalf.local.home>

On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote:
> Anyway, instead of blocking. What about having a counter of number of
> migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's
> already another task with migrate_disabled() set, and the current task has
> an affinity greater than 1, it tries to migrate to another CPU?

That doesn't solve the problem. On wakeup we should already prefer an
idle CPU over one running a (RT) task, but you can always wake more
tasks than there's CPUs around and you'll _have_ to stack at some point.

The trick is how to unstack them correctly. We need to detect when a
migrate_disable() task _should_ start running again, and migrate away
whoever is in the way at that point.

It turns out, that getting selected for pull-balance is exactly that
condition, and clearly a migrate_disable() task cannot be pulled, but we
can use that signal to try and pull away the running task that's in the
way.

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Steven Rostedt @ 2020-09-24 12:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Juri Lelli, peterz, Sebastian Andrzej Siewior, Joonas Lahtinen,
	dri-devel, linux-mips, Ben Segall, Max Filippov, Guo Ren,
	linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, David Airlie, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Jani Nikula, Greentime Hu, Rodrigo Vivi,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Linux-MM, Linus Torvalds, LKML,
	Arnd Bergmann, Daniel Vetter, Vineet Gupta, Paul Mackerras,
	Andrew Morton, Daniel Bristot de Oliveira, David S. Miller
In-Reply-To: <871riracgf.fsf@nanos.tec.linutronix.de>

On Thu, 24 Sep 2020 08:57:52 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> > Now as for migration disabled nesting, at least now we would have
> > groupings of this, and perhaps the theorists can handle that. I mean,
> > how is this much different that having a bunch of tasks blocked on a
> > mutex with the owner is pinned on a CPU?
> >
> > migrate_disable() is a BKL of pinning affinity.  
> 
> No. That's just wrong. preempt disable is a concurrency control,

I think you totally misunderstood what I was saying. The above wasn't about
comparing preempt_disable to migrate_disable. It was comparing
migrate_disable to a chain of tasks blocked on mutexes where the top owner
has preempt_disable set. You still have a bunch of tasks that can't move to
other CPUs.

> > If we only have local_lock() available (even on !RT), then it makes
> > the blocking in groups. At least this way you could grep for all the
> > different local_locks in the system and plug that into the algorithm
> > for WCS, just like one would with a bunch of mutexes.  
> 
> You cannot do that on RT at all where migrate disable is substituting
> preempt disable in spin and rw locks. The result would be the same as
> with a !RT kernel just with horribly bad performance.

Note, the spin and rwlocks already have a lock associated with them. Why
would it be any different on RT? I wasn't suggesting adding another lock
inside a spinlock. Why would I recommend THAT? I wasn't recommending
blindly replacing migrate_disable() with local_lock(). I just meant expose
local_lock() but not migrate_disable().

> 
> That means the stacking problem has to be solved anyway.
> 
> So why on earth do you want to create yet another special duct tape case
> for kamp_local() which proliferates inconsistency instead of aiming for
> consistency accross all preemption models?

The idea was to help with the scheduling issue.

Anyway, instead of blocking. What about having a counter of number of
migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's
already another task with migrate_disabled() set, and the current task has
an affinity greater than 1, it tries to migrate to another CPU?

This way migrate_disable() is less likely to have a bunch of tasks blocked
on one CPU serialized by each task exiting the migrate_disable() section.

Yes, there's more overhead, but it only happens if multiple tasks are in a
migrate disable section on the same CPU.

-- Steve

^ permalink raw reply

* Re: [PATCH] powerpc/process: Fix uninitialised variable error
From: Michael Ellerman @ 2020-09-24 12:29 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev
In-Reply-To: <20200917024509.3253837-1-mpe@ellerman.id.au>

On Thu, 17 Sep 2020 12:45:09 +1000, Michael Ellerman wrote:
> Clang, and GCC with -Wmaybe-uninitialized, can't see that val is
> unused in get_fpexec_mode():
> 
>   arch/powerpc/kernel/process.c:1940:7: error: variable 'val' is used
>   uninitialized whenever 'if' condition is true
> 		  if (cpu_has_feature(CPU_FTR_SPE)) {
> 		      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/process: Fix uninitialised variable error
      https://git.kernel.org/powerpc/c/d208e13c6a2277d9fb71fad6a1394c70bdd7b634

cheers

^ permalink raw reply

* Re: [PATCH -next v2] powerpc/book3s64: fix link error with CONFIG_PPC_RADIX_MMU=n
From: Michael Ellerman @ 2020-09-24 12:28 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, Yang Yingliang
In-Reply-To: <20200917020643.90375-1-yangyingliang@huawei.com>

On Thu, 17 Sep 2020 10:06:43 +0800, Yang Yingliang wrote:
> Fix link error when CONFIG_PPC_RADIX_MMU is disabled:
> powerpc64-linux-gnu-ld: arch/powerpc/platforms/pseries/lpar.o:(.toc+0x0): undefined reference to `mmu_pid_bits'

Applied to powerpc/next.

[1/1] powerpc/book3s64: fix link error with CONFIG_PPC_RADIX_MMU=n
      https://git.kernel.org/powerpc/c/bda7673d64b6c2e92423363a756caa657464e096

cheers

^ permalink raw reply

* Re: [PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes
From: Michael Ellerman @ 2020-09-24 12:28 UTC (permalink / raw)
  To: Nicholas Piggin, linux-mm @ kvack . org
  Cc: Jens Axboe, linux-arch, Peter Zijlstra, Aneesh Kumar K . V,
	linux-kernel, sparclinux, Andrew Morton, linuxppc-dev,
	David S . Miller
In-Reply-To: <20200914045219.3736466-1-npiggin@gmail.com>

On Mon, 14 Sep 2020 14:52:15 +1000, Nicholas Piggin wrote:
> This is an attempt to fix a few different related issues around
> switching mm, TLB flushing, and lazy tlb mm handling.
> 
> This will require all architectures to eventually move to disabling
> irqs over activate_mm, but it's possible we could add another arch
> call after irqs are re-enabled for those few which can't do their
> entire activation with irqs disabled.
> 
> [...]

Applied to powerpc/next.

[1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
      https://git.kernel.org/powerpc/c/d53c3dfb23c45f7d4f910c3a3ca84bf0a99c6143
[2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
      https://git.kernel.org/powerpc/c/66acd46080bd9e5ad2be4b0eb1d498d5145d058e
[3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
      https://git.kernel.org/powerpc/c/bafb056ce27940c9994ea905336aa8f27b4f7275
[4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
      https://git.kernel.org/powerpc/c/a665eec0a22e11cdde708c1c256a465ebe768047

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/prom_init: Check display props exist before enabling btext
From: Michael Ellerman @ 2020-09-24 12:28 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman
In-Reply-To: <20200821103407.3362149-1-mpe@ellerman.id.au>

On Fri, 21 Aug 2020 20:34:07 +1000, Michael Ellerman wrote:
> It's possible to enable CONFIG_PPC_EARLY_DEBUG_BOOTX for a pseries
> kernel (maybe it shouldn't be), which is then booted with qemu/slof.
> 
> But if you do that the kernel crashes in draw_byte(), with a DAR
> pointing somewhere near INT_MAX.
> 
> Adding some debug to prom_init we see that we're not able to read the
> "address" property from OF, so we're just using whatever junk value
> was on the stack.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/prom_init: Check display props exist before enabling btext
      https://git.kernel.org/powerpc/c/6c71cfcc01685ef495ca7886471a76e73446424e

cheers

^ permalink raw reply

* Re: [PATCH 1/3] powerpc: Move arch_cpu_idle_dead() into smp.c
From: Michael Ellerman @ 2020-09-24 12:27 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman
In-Reply-To: <20200819015634.1974478-1-mpe@ellerman.id.au>

On Wed, 19 Aug 2020 11:56:32 +1000, Michael Ellerman wrote:
> arch_cpu_idle_dead() is in idle.c, which makes sense, but it's inside
> a CONFIG_HOTPLUG_CPU block.
> 
> It would be more at home in smp.c, inside the existing
> CONFIG_HOTPLUG_CPU block. Note that CONFIG_HOTPLUG_CPU depends on
> CONFIG_SMP so even though smp.c is not built for SMP=n builds, that's
> fine.

Applied to powerpc/next.

[1/3] powerpc: Move arch_cpu_idle_dead() into smp.c
      https://git.kernel.org/powerpc/c/1ea21ba231f248034e8c794aa675869ca2b97d42
[2/3] powerpc/smp: Fold cpu_die() into its only caller
      https://git.kernel.org/powerpc/c/bf3c1464db883a953ad7bbed64924480b8b2b244
[3/3] powerpc/smp: Move ppc_md.cpu_die() to smp_ops.cpu_offline_self()
      https://git.kernel.org/powerpc/c/39f87561454dc33efb2a3d8354d066207acac8a6

cheers

^ permalink raw reply

* Re: [PATCH 1/2] powerpc/mm/64s: Fix slb_setup_new_exec() sparse warning
From: Michael Ellerman @ 2020-09-24 12:27 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman
In-Reply-To: <20200916115637.3100484-1-mpe@ellerman.id.au>

On Wed, 16 Sep 2020 21:56:36 +1000, Michael Ellerman wrote:
> Sparse says:
>   symbol slb_setup_new_exec was not declared. Should it be static?
> 
> No, it should have a declaration in a header, add one.

Applied to powerpc/next.

[1/2] powerpc/mm/64s: Fix slb_setup_new_exec() sparse warning
      https://git.kernel.org/powerpc/c/ef1edbba52883907caf02ab85e0d00a2e4648f05
[2/2] powerpc/perf: Add declarations to fix sparse warnings
      https://git.kernel.org/powerpc/c/d10ebe79dfae7dc59b6cf77ffa615f0b8dae21bf

cheers

^ permalink raw reply

* Re: [PATCH -next] serial: pmac_zilog: use for_each_child_of_node() macro
From: Michael Ellerman @ 2020-09-24 12:27 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Greg Kroah-Hartman, Jiri Slaby, Qinglang Miao
  Cc: linuxppc-dev, linux-kernel, linux-serial
In-Reply-To: <20200916062138.191188-1-miaoqinglang@huawei.com>

On Wed, 16 Sep 2020 14:21:38 +0800, Qinglang Miao wrote:
> Use for_each_child_of_node() macro instead of open coding it.

Applied to powerpc/next.

[1/1] serial: pmac_zilog: use for_each_child_of_node() macro
      https://git.kernel.org/powerpc/c/1d42e07e9c249b7a032fba82b673ee8a8d6bd7b7

cheers

^ permalink raw reply

* Re: [PATCH -next] powerpc/powernv: fix wrong warning message in opalcore_config_init()
From: Michael Ellerman @ 2020-09-24 12:27 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Qinglang Miao
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200916062129.190864-1-miaoqinglang@huawei.com>

On Wed, 16 Sep 2020 14:21:29 +0800, Qinglang Miao wrote:
> The logic of the warn output is incorrect. The two args should be
> exchanged.

Applied to powerpc/next.

[1/1] powerpc/powernv: fix wrong warning message in opalcore_config_init()
      https://git.kernel.org/powerpc/c/8ec5cb12cd957e59b0470b75d26c901aaf645bc3

cheers

^ permalink raw reply

* Re: [PATCH -next] macintosh: smu_sensors: use for_each_child_of_node() macro
From: Michael Ellerman @ 2020-09-24 12:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Qinglang Miao; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200916062125.190729-1-miaoqinglang@huawei.com>

On Wed, 16 Sep 2020 14:21:25 +0800, Qinglang Miao wrote:
> Use for_each_child_of_node() macro instead of open coding it.

Applied to powerpc/next.

[1/1] macintosh: smu_sensors: use for_each_child_of_node() macro
      https://git.kernel.org/powerpc/c/acff5e6c37fa4bf8d002c917a762c4f7615f6eaf

cheers

^ permalink raw reply

* Re: [PATCH -next] drivers/macintosh/smu.c: use for_each_child_of_node() macro
From: Michael Ellerman @ 2020-09-24 12:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Qinglang Miao; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200916062122.190586-1-miaoqinglang@huawei.com>

On Wed, 16 Sep 2020 14:21:22 +0800, Qinglang Miao wrote:
> Use for_each_child_of_node() macro instead of open coding it.

Applied to powerpc/next.

[1/1] drivers/macintosh/smu.c: use for_each_child_of_node() macro
      https://git.kernel.org/powerpc/c/9c826d31a73815464bd3df81e56d39b3d908ac73

cheers

^ permalink raw reply

* Re: [PATCH -next] powerpc/pseries: convert to use DEFINE_SEQ_ATTRIBUTE macro
From: Michael Ellerman @ 2020-09-24 12:27 UTC (permalink / raw)
  To: Michael Ellerman, Liu Shixin, Paul Mackerras,
	Benjamin Herrenschmidt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200916025026.3992835-1-liushixin2@huawei.com>

On Wed, 16 Sep 2020 10:50:26 +0800, Liu Shixin wrote:
> Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.

Applied to powerpc/next.

[1/1] powerpc/pseries: convert to use DEFINE_SEQ_ATTRIBUTE macro
      https://git.kernel.org/powerpc/c/96543e7352bded5d6d1a0e0022376ebdd6c1b8ab

cheers

^ permalink raw reply

* Re: [PATCH v2 0/7] powerpc: Fix a few W=1 compile warnings
From: Michael Ellerman @ 2020-09-24 12:27 UTC (permalink / raw)
  To: Michael Ellerman, Cédric Le Goater; +Cc: Christophe Leroy, linuxppc-dev
In-Reply-To: <20200914211007.2285999-1-clg@kaod.org>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 813 bytes --]

On Mon, 14 Sep 2020 23:10:00 +0200, Cédric Le Goater wrote:
> Here is a small contribution improving compile with W=1.
> 
> Thanks,
> 
> C.
> 
> Changes in v2:
> 
> [...]

Patches 1, 3, 4 and 7 applied to powerpc/next.

[1/7] powerpc/sysfs: Remove unused 'err' variable in sysfs_create_dscr_default()
      https://git.kernel.org/powerpc/c/7b2aab5f22f0f7cc9e2f8672c9e65e2e88d30102
[3/7] powerpc/sstep: Remove empty if statement checking for invalid form
      https://git.kernel.org/powerpc/c/5ab187e01a5310e1f9cd2f6b192b2343b8bd14cb
[4/7] powerpc/xive: Make debug routines static
      https://git.kernel.org/powerpc/c/2228f19cf90ef796c8d84f54f3a5db2dcc85c83f
[7/7] powerpc/32: Declare stack_overflow_exception() prototype
      https://git.kernel.org/powerpc/c/ebbfeef0d8093a06ff39c60105b6650be3344cbe

cheers

^ permalink raw reply

* Re: [PATCH -next] powerpc/papr_scm: Fix warnings about undeclared variable
From: Vaibhav Jain @ 2020-09-24 11:46 UTC (permalink / raw)
  To: Wang Wensheng, linuxppc-dev
  Cc: santosh, linux-kernel, paulus, aneesh.kumar, dan.j.williams,
	ira.weiny
In-Reply-To: <20200918085951.44983-1-wangwensheng4@huawei.com>


Thanks for the patch. This looks good to me.

Wang Wensheng <wangwensheng4@huawei.com> writes:

> Build the kernel with 'make C=2':
> arch/powerpc/platforms/pseries/papr_scm.c:825:1: warning: symbol
> 'dev_attr_perf_stats' was not declared. Should it be static?

> Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>

-- 
Cheers
~ Vaibhav

^ permalink raw reply

* Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed
From: Cédric Le Goater @ 2020-09-24 10:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Qian Cai, Michael Ellerman
  Cc: linuxppc-dev, linux-next, Oliver O'Halloran, linux-kernel,
	Stephen Rothwell
In-Reply-To: <6716add0-9244-4da1-a578-f7faeb529e77@ozlabs.ru>

On 9/24/20 7:11 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 23/09/2020 17:06, Cédric Le Goater wrote:
>> On 9/23/20 2:33 AM, Qian Cai wrote:
>>> On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote:
>>>> When a passthrough IO adapter is removed from a pseries machine using
>>>> hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
>>>> guest OS to clear all page table entries related to the adapter. If
>>>> some are still present, the RTAS call which isolates the PCI slot
>>>> returns error 9001 "valid outstanding translations" and the removal of
>>>> the IO adapter fails. This is because when the PHBs are scanned, Linux
>>>> maps automatically the INTx interrupts in the Linux interrupt number
>>>> space but these are never removed.
>>>>
>>>> To solve this problem, we introduce a PPC platform specific
>>>> pcibios_remove_bus() routine which clears all interrupt mappings when
>>>> the bus is removed. This also clears the associated page table entries
>>>> of the ESB pages when using XIVE.
>>>>
>>>> For this purpose, we record the logical interrupt numbers of the
>>>> mapped interrupt under the PHB structure and let pcibios_remove_bus()
>>>> do the clean up.
>>>>
>>>> Since some PCI adapters, like GPUs, use the "interrupt-map" property
>>>> to describe interrupt mappings other than the legacy INTx interrupts,
>>>> we can not restrict the size of the mapping array to PCI_NUM_INTX. The
>>>> number of interrupt mappings is computed from the "interrupt-map"
>>>> property and the mapping array is allocated accordingly.
>>>>
>>>> Cc: "Oliver O'Halloran" <oohall@gmail.com>
>>>> Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>
>>> Some syscall fuzzing will trigger this on POWER9 NV where the traces pointed to
>>> this patch.
>>>
>>> .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
>>
>> OK. The patch is missing a NULL assignement after kfree() and that
>> might be the issue. 
>>
>> I did try PHB removal under PowerNV, so I would like to understand 
>> how we managed to remove twice the PCI bus and possibly reproduce. 
>> Any chance we could grab what the syscall fuzzer (syzkaller) did ? 
> 
> 
> 
> My guess would be it is doing this in parallel to provoke races.

Concurrency removal and rescan should be controlled by : 

   pci_stop_and_remove_bus_device_locked()
       pci_lock_rescan_remove() 

And, in the report, the stack traces are on the same CPU and PID. 



What I think is happening is that we did a couple of remove/rescan
of the same PHB. The problem is that ->irq_map is not reallocated
the second time because the PHB is re-scanned differently on the 
PowerNV platform. At the second remove, the ->irq_map being not NULL, 
we try to kfree it again and the allocator warns of a double free :/

This works fine on pseries/KVM because the PHB is never removed and 
on pseries/pHyp, pcibios_scan_phb() is called at PHB hotplug. But on 
PowerNV, pcibios_scan_phb() is only called at probe/boot time and 
not at hotplug time :/

I was a good thing to spot that before merge. But I need to revise 
that patch again.

Thanks,

C.

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: peterz @ 2020-09-24  8:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, David Airlie, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Jani Nikula, Greentime Hu, Rodrigo Vivi,
	Thomas Gleixner, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Linus Torvalds, LKML, Arnd Bergmann, Daniel Vetter, Vineet Gupta,
	Paul Mackerras, Andrew Morton, Daniel Bristot de Oliveira,
	David S. Miller
In-Reply-To: <20200923115251.7cc63a7e@oasis.local.home>

On Wed, Sep 23, 2020 at 11:52:51AM -0400, Steven Rostedt wrote:
> On Wed, 23 Sep 2020 10:40:32 +0200
> peterz@infradead.org wrote:
> 
> > However, with migrate_disable() we can have each task preempted in a
> > migrate_disable() region, worse we can stack them all on the _same_ CPU
> > (super ridiculous odds, sure). And then we end up only able to run one
> > task, with the rest of the CPUs picking their nose.
> 
> What if we just made migrate_disable() a local_lock() available for !RT?

Can't, neiter migrate_disable() nor migrate_enable() are allowed to
block -- which is what makes their implementation so 'interesting'.

> This should lower the SHC in theory, if you can't have stacked migrate
> disables on the same CPU.

See this email in that other thread (you're on Cc too IIRC):

  https://lkml.kernel.org/r/20200923170809.GY1362448@hirez.programming.kicks-ass.net

I think that is we 'frob' the balance PULL, we'll end up with something
similar.

Whichever way around we turn this thing, the migrate_disable() runtime
(we'll have to add a tracer for that), will be an interference term on
the lower priority task, exactly like preempt_disable() would be. We'll
just not exclude a higher priority task from running.

AFAICT; the best case is a single migrate_disable() nesting, where a
higher priority task preempts in a migrate_disable() section -- this is
per design.

When this preempted task becomes elegible to run under the ideal model
(IOW it becomes one of the M highest priority tasks), it might still
have to wait for the preemptee's migrate_disable() section to complete.
Thereby suffering interference in the exact duration of
migrate_disable() section.

Per this argument, the change from preempt_disable() to
migrate_disable() gets us:

 - higher priority tasks gain reduced wake-up latency
 - lower priority tasks are unchanged and are subject to the exact same
   interference term as if the higher priority task were using
   preempt_disable().

Since we've already established this term is unbounded, any task but the
highest priority task is basically buggered.

TL;DR, if we get balancing fixed and achieve (near) the optimal case
above, migrate_disable() is an over-all win. But it's provably
non-deterministic as long as the migrate_disable() sections are
non-deterministic.

The reason this all mostly works in practise is (I think) because:

 - People care most about the higher prio RT tasks and craft them to
   mostly avoid the migrate_disable() infected code.

 - The preemption scenario is most pronounced at higher utilization
   scenarios, and I suspect this is fairly rare to begin with.

 - And while many of these migrate_disable() regions are unbound in
   theory, in practise they're often fairly reasonable.

So my current todo list is:

 - Change RT PULL
 - Change DL PULL
 - Add migrate_disable() tracer; exactly like preempt/irqoff, except
   measuring task-runtime instead of cpu-time.
 - Add a mode that measures actual interference.
 - Add a traceevent to detect preemption in migrate_disable().

And then I suppose I should twist Daniel's arm to update his model to
include these scenarios and numbers.

^ permalink raw reply

* [Bug 195755] rcu_sched detected stalls on CPUs/tasks: (detected by 0, t=6302 jiffies, g=11405, c=11404, q=1880), ppc64, G5
From: bugzilla-daemon @ 2020-09-24  7:41 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-195755-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=195755

Nigel Reed (nigel@nigelreed.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nigel@nigelreed.net

--- Comment #29 from Nigel Reed (nigel@nigelreed.net) ---
I know this is old but I have been having some issues for a while, I was
finally able to get something useful:

[165716.089703] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[165716.095949] rcu:    1-...!: (0 ticks this GP) idle=354/0/0x0
softirq=2154363/2154363 fqs=0
[165716.104512] rcu:    3-...!: (0 ticks this GP) idle=29c/0/0x0
softirq=883832/883832 fqs=0
[165716.112873] rcu:    4-...!: (8 GPs behind) idle=ad8/0/0x0
softirq=2165586/2165586 fqs=0
[165716.121179] rcu:    9-...!: (9 GPs behind) idle=acc/0/0x0
softirq=1340600/1340600 fqs=0
[165716.129467] rcu:    11-...!: (2 GPs behind) idle=a18/0/0x0
softirq=4538536/4538537 fqs=0
[165716.137828] rcu:    12-...!: (0 ticks this GP) idle=870/0/0x0
softirq=2158040/2158040 fqs=0
[165775.697763] rcu: rcu_sched kthread starved for 29898 jiffies! g36134941
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3
[165775.709013] rcu: RCU grace-period kthread stack dump:
[165837.494623] watchdog: BUG: soft lockup - CPU#6 stuck for 23s!
[(resolved):52315]
[165865.494840] watchdog: BUG: soft lockup - CPU#6 stuck for 23s!
[(resolved):52315]

This happened just after freshclam ran but I don't know if it's related.

This is with a Ryzen 7 1800X CPU.
5.4.0-48-generic #52-Ubuntu

I thought I had sysrq configured but it seems not so I can't really provide any
more information, other than this is driving me crazy.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* Re: [PATCH kernel] powerpc/dma: Fix dma_map_ops::get_required_mask
From: Alexey Kardashevskiy @ 2020-09-24  7:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Oliver OHalloran, linuxppc-dev@lists.ozlabs.org,
	Cédric Le Goater
In-Reply-To: <20200923141020.GA12374@lst.de>



On 24/09/2020 00:10, Christoph Hellwig wrote:
> On Tue, Sep 22, 2020 at 12:26:18PM +1000, Alexey Kardashevskiy wrote:
>>> Well, the original intent of dma_get_required_mask is to return the
>>> mask that the driver then uses to figure out what to set, so what aacraid
>>> does fits that use case. 
>>
>> What was the original intent exactly? The driver asks for the minimum or
>> maximum DMA mask the platform supports?
>>
>> As for now, we (ppc64/powernv) can do:
>> 1. bypass (==64bit)
>> 2. a DMA window which used to be limited by 2GB but not anymore.
>>
>> I can understand if the driver asked for required mask in expectation to
>> receive "less or equal than 32bit" and "more than 32 bit" and choose.
>> And this probably was the intent as at the time when the bug was
>> introduced, the window was always smaller than 4GB.
>>
>> But today the window is bigger than than (44 bits now, or a similar
>> value, depends on max page order) so the returned mask is >32. Which
>> still enables that DAC in aacraid but I suspect this is accidental.
> 
> I think for powernv returning 64-bit always would make a lot of sense.
> AFAIK all of powernv is PCIe and not legacy PCI, so returning anything
> less isn't going to help to optimize anything.

May be... The current behavior is not wrong (after the fix) but not
optimal either. Even with legacy PCI it should just result in failing
attempt to set 64bit mask which drivers should still handle, i.e. choose
a shorter mask.

Why not ditch the whole dma_get_required_mask() and just fail on setting
a bigger mask? Are these failures not handled in some drivers? Or there
are cases when a shorter mask is better? Thanks,


>>> Of course that idea is pretty bogus for
>>> PCIe devices.
>>
>> Why? From the PHB side, there are windows. From the device side, there
>> are many crippled devices, like, no GPU I saw in last years supported
>> more than 48bit.
> 
> Yes, but dma_get_required_mask is misnamed - the mask is not required,
> it is the optimal mask.  Even if the window is smaller we handle it
> some way, usually by using swiotlb, or by iommu tricks in your case.
>
>>> I suspect the right fix is to just not query dma_get_required_mask for
>>> PCIe devices in aacraid (and other drivers that do something similar).
>>
>> May be, if you write nice and big comment next to
>> dma_get_required_mask() explaining exactly what it does, then I will
>> realize I am getting this all wrong and we will move to fixing the
>> drivers :)
> 
> Yes, it needs a comment or two, and probaby be renamed to
> dma_get_optimal_dma_mask, and a cleanup of most users.  I've added it
> to my ever growing TODO list, but I would not be unhappy if someone
> else gives it a spin.
> 

-- 
Alexey

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Thomas Gleixner @ 2020-09-24  6:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, Sebastian Andrzej Siewior, Joonas Lahtinen,
	dri-devel, linux-mips, Ben Segall, Max Filippov, Guo Ren,
	linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, David Airlie, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Jani Nikula, Greentime Hu, Rodrigo Vivi,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Linux-MM, Linus Torvalds, LKML,
	Arnd Bergmann, Daniel Vetter, Vineet Gupta, Paul Mackerras,
	Andrew Morton, Daniel Bristot de Oliveira, David S. Miller
In-Reply-To: <20200923171234.0001402d@oasis.local.home>

On Wed, Sep 23 2020 at 17:12, Steven Rostedt wrote:
> On Wed, 23 Sep 2020 22:55:54 +0200
> Then scratch the idea of having anonymous local_lock() and just bring
> local_lock in directly? Then have a kmap local lock, which would only
> block those that need to do a kmap.

That's still going to end up in lock ordering nightmares and you lose
the ability to use kmap_local from arbitrary contexts which was again
one of the goals of this exercise.

Aside of that you're imposing reentrancy protections on something which
does not need it in the first place.

> Now as for migration disabled nesting, at least now we would have
> groupings of this, and perhaps the theorists can handle that. I mean,
> how is this much different that having a bunch of tasks blocked on a
> mutex with the owner is pinned on a CPU?
>
> migrate_disable() is a BKL of pinning affinity.

No. That's just wrong. preempt disable is a concurrency control,
i.e. protecting against reentrancy on a given CPU. But it's a cpu global
protection which means that it's not protecting a specific code path.

Contrary to preempt disable, migrate disable is not protecting against
reentrancy on a given CPU. It's a temporary restriction to the scheduler
on placement.

The fact that disabling preemption implicitely disables migration does
not make them semantically equivalent.

> If we only have local_lock() available (even on !RT), then it makes
> the blocking in groups. At least this way you could grep for all the
> different local_locks in the system and plug that into the algorithm
> for WCS, just like one would with a bunch of mutexes.

You cannot do that on RT at all where migrate disable is substituting
preempt disable in spin and rw locks. The result would be the same as
with a !RT kernel just with horribly bad performance.

That means the stacking problem has to be solved anyway.

So why on earth do you want to create yet another special duct tape case
for kamp_local() which proliferates inconsistency instead of aiming for
consistency accross all preemption models?

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH] rpadlpar_io:Add MODULE_DESCRIPTION entries to kernel modules
From: Oliver O'Halloran @ 2020-09-24  6:41 UTC (permalink / raw)
  To: Mamatha Inamdar
  Cc: Tyrel Datwyler, linux-pci, Linux Kernel Mailing List,
	Bjorn Helgaas, linuxppc-dev
In-Reply-To: <20200924051343.16052.9571.stgit@localhost.localdomain>

On Thu, Sep 24, 2020 at 3:15 PM Mamatha Inamdar
<mamatha4@linux.vnet.ibm.com> wrote:
>
> This patch adds a brief MODULE_DESCRIPTION to rpadlpar_io kernel modules
> (descriptions taken from Kconfig file)
>
> Signed-off-by: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
> ---
>  drivers/pci/hotplug/rpadlpar_core.c |    1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c
> index f979b70..bac65ed 100644
> --- a/drivers/pci/hotplug/rpadlpar_core.c
> +++ b/drivers/pci/hotplug/rpadlpar_core.c
> @@ -478,3 +478,4 @@ static void __exit rpadlpar_io_exit(void)
>  module_init(rpadlpar_io_init);
>  module_exit(rpadlpar_io_exit);
>  MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("RPA Dynamic Logical Partitioning driver for I/O slots");

RPA as a spec was superseded by PAPR in the early 2000s. Can we rename
this already?

The only potential problem I can see is scripts doing: modprobe
rpadlpar_io or similar

However, we should be able to fix that with a module alias.

Oliver

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox