LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH kernel v2 1/2] dma: Allow mixing bypass and normal IOMMU operation
From: Alexey Kardashevskiy @ 2020-10-28 23:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: iommu, linuxppc-dev, linux-kernel
In-Reply-To: <20201028172106.GA10015@lst.de>



On 29/10/2020 04:21, Christoph Hellwig wrote:
> On Wed, Oct 28, 2020 at 05:55:23PM +1100, Alexey Kardashevskiy wrote:
>>
>> It is passing an address of the end of the mapped area so passing a page
>> struct means passing page and offset which is an extra parameter and we do
>> not want to do anything with the page in those hooks anyway so I'd keep it
>> as is.
>>
>>
>>> and
>>>      maybe even hide the dma_map_direct inside it.
>>
>> Call dma_map_direct() from arch_dma_map_page_direct() if
>> arch_dma_map_page_direct() is defined? Seems suboptimal as it is going to
>> be bypass=true in most cases and we save one call by avoiding calling
>> arch_dma_map_page_direct(). Unless I missed something?
> 
> C does not even evaluate the right hand side of a || expression if the
> left hand evaluates to true.

Right, this is what I meant. dma_map_direct() is inline and fast so I 
did not want it inside the arch hook which is not inline.


-- 
Alexey

^ permalink raw reply

* Re: [PATCH kernel v3 1/2] dma: Allow mixing bypass and mapped DMA operation
From: Alexey Kardashevskiy @ 2020-10-28 23:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: iommu, linuxppc-dev, linux-kernel
In-Reply-To: <20201028172201.GB10015@lst.de>



On 29/10/2020 04:22, Christoph Hellwig wrote:
> On Wed, Oct 28, 2020 at 06:00:29PM +1100, Alexey Kardashevskiy wrote:
>> At the moment we allow bypassing DMA ops only when we can do this for
>> the entire RAM. However there are configs with mixed type memory
>> where we could still allow bypassing IOMMU in most cases;
>> POWERPC with persistent memory is one example.
>>
>> This adds an arch hook to determine where bypass can still work and
>> we invoke direct DMA API. The following patch checks the bus limit
>> on POWERPC to allow or disallow direct mapping.
>>
>> This adds a CONFIG_ARCH_HAS_DMA_SET_MASK config option to make arch_xxxx
>> hooks no-op by default.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   kernel/dma/mapping.c | 24 ++++++++++++++++++++----
>>   kernel/dma/Kconfig   |  4 ++++
>>   2 files changed, 24 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
>> index 51bb8fa8eb89..a0bc9eb876ed 100644
>> --- a/kernel/dma/mapping.c
>> +++ b/kernel/dma/mapping.c
>> @@ -137,6 +137,18 @@ static inline bool dma_map_direct(struct device *dev,
>>   	return dma_go_direct(dev, *dev->dma_mask, ops);
>>   }
>>   
>> +#ifdef CONFIG_ARCH_HAS_DMA_MAP_DIRECT
>> +bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr);
>> +bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle);
>> +bool arch_dma_map_sg_direct(struct device *dev, struct scatterlist *sg, int nents);
>> +bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg, int nents);
>> +#else
>> +#define arch_dma_map_page_direct(d, a) (0)
>> +#define arch_dma_unmap_page_direct(d, a) (0)
>> +#define arch_dma_map_sg_direct(d, s, n) (0)
>> +#define arch_dma_unmap_sg_direct(d, s, n) (0)
>> +#endif
> 
> A bunch of overly long lines here.  Except for that this looks ok to me.
> If you want me to queue up the series I can just fix it up.

I thought 100 is the new limit since 
https://lkml.org/lkml/2020/5/29/1038 (yeah that mentioned some Christoph 
:) ) and having these multiline does not make a huge difference but feel 
free fixing them up.

Are you going to take both patches? Do you need mpe's ack? Thanks,


-- 
Alexey

^ permalink raw reply

* Re: [PATCH 0/4] Powerpc: Better preemption for shared processor
From: Waiman Long @ 2020-10-29  0:01 UTC (permalink / raw)
  To: Srikar Dronamraju, Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Phil Auld, Juri Lelli,
	Peter Zijlstra, LKML, Nicholas Piggin, linuxppc-dev,
	Valentin Schneider
In-Reply-To: <20201028123512.871051-1-srikar@linux.vnet.ibm.com>

On 10/28/20 8:35 AM, Srikar Dronamraju wrote:
> Currently, vcpu_is_preempted will return the yield_count for
> shared_processor. On a PowerVM LPAR, Phyp schedules at SMT8 core boundary
> i.e all CPUs belonging to a core are either group scheduled in or group
> scheduled out. This can be used to better predict non-preempted CPUs on
> PowerVM shared LPARs.
>
> perf stat -r 5 -a perf bench sched pipe -l 10000000 (lesser time is better)
>
> powerpc/next
>       35,107,951.20 msec cpu-clock                 #  255.898 CPUs utilized            ( +-  0.31% )
>          23,655,348      context-switches          #    0.674 K/sec                    ( +-  3.72% )
>              14,465      cpu-migrations            #    0.000 K/sec                    ( +-  5.37% )
>              82,463      page-faults               #    0.002 K/sec                    ( +-  8.40% )
>   1,127,182,328,206      cycles                    #    0.032 GHz                      ( +-  1.60% )  (66.67%)
>      78,587,300,622      stalled-cycles-frontend   #    6.97% frontend cycles idle     ( +-  0.08% )  (50.01%)
>     654,124,218,432      stalled-cycles-backend    #   58.03% backend cycles idle      ( +-  1.74% )  (50.01%)
>     834,013,059,242      instructions              #    0.74  insn per cycle
>                                                    #    0.78  stalled cycles per insn  ( +-  0.73% )  (66.67%)
>     132,911,454,387      branches                  #    3.786 M/sec                    ( +-  0.59% )  (50.00%)
>       2,890,882,143      branch-misses             #    2.18% of all branches          ( +-  0.46% )  (50.00%)
>
>             137.195 +- 0.419 seconds time elapsed  ( +-  0.31% )
>
> powerpc/next + patchset
>       29,981,702.64 msec cpu-clock                 #  255.881 CPUs utilized            ( +-  1.30% )
>          40,162,456      context-switches          #    0.001 M/sec                    ( +-  0.01% )
>               1,110      cpu-migrations            #    0.000 K/sec                    ( +-  5.20% )
>              62,616      page-faults               #    0.002 K/sec                    ( +-  3.93% )
>   1,430,030,626,037      cycles                    #    0.048 GHz                      ( +-  1.41% )  (66.67%)
>      83,202,707,288      stalled-cycles-frontend   #    5.82% frontend cycles idle     ( +-  0.75% )  (50.01%)
>     744,556,088,520      stalled-cycles-backend    #   52.07% backend cycles idle      ( +-  1.39% )  (50.01%)
>     940,138,418,674      instructions              #    0.66  insn per cycle
>                                                    #    0.79  stalled cycles per insn  ( +-  0.51% )  (66.67%)
>     146,452,852,283      branches                  #    4.885 M/sec                    ( +-  0.80% )  (50.00%)
>       3,237,743,996      branch-misses             #    2.21% of all branches          ( +-  1.18% )  (50.01%)
>
>              117.17 +- 1.52 seconds time elapsed  ( +-  1.30% )
>
> This is around 14.6% improvement in performance.
>
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Phil Auld <pauld@redhat.com>
>
> Srikar Dronamraju (4):
>    powerpc: Refactor is_kvm_guest declaration to new header
>    powerpc: Rename is_kvm_guest to check_kvm_guest
>    powerpc: Reintroduce is_kvm_guest
>    powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted
>
>   arch/powerpc/include/asm/firmware.h  |  6 ------
>   arch/powerpc/include/asm/kvm_guest.h | 25 +++++++++++++++++++++++++
>   arch/powerpc/include/asm/kvm_para.h  |  2 +-
>   arch/powerpc/include/asm/paravirt.h  | 18 ++++++++++++++++++
>   arch/powerpc/kernel/firmware.c       |  5 ++++-
>   arch/powerpc/platforms/pseries/smp.c |  3 ++-
>   6 files changed, 50 insertions(+), 9 deletions(-)
>   create mode 100644 arch/powerpc/include/asm/kvm_guest.h
>
This patch series looks good to me and the performance is nice too.

Acked-by: Waiman Long <longman@redhat.com>

Just curious, is the performance mainly from the use of static_branch 
(patches 1 - 3) or from reducing call to yield_count_of().

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier
From: Michael Ellerman @ 2020-10-29  0:09 UTC (permalink / raw)
  To: Qian Cai, Paul E . McKenney
  Cc: Peter Zijlstra, linux-kernel, Qian Cai, Paul Mackerras,
	linuxppc-dev
In-Reply-To: <20201028182334.13466-1-cai@redhat.com>

Qian Cai <cai@redhat.com> writes:
> The call to rcu_cpu_starting() in start_secondary() is not early enough
> in the CPU-hotplug onlining process, which results in lockdep splats as
> follows:

Since when?
What kernel version?

I haven't seen this running CPU hotplug tests with PROVE_LOCKING=y on
v5.10-rc1. Am I missing a CONFIG?

cheers


>  WARNING: suspicious RCU usage
>  -----------------------------
>  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
>
>  other info that might help us debug this:
>
>  RCU used illegally from offline CPU!
>  rcu_scheduler_active = 1, debug_locks = 1
>  no locks held by swapper/1/0.
>
>  Call Trace:
>  dump_stack+0xec/0x144 (unreliable)
>  lockdep_rcu_suspicious+0x128/0x14c
>  __lock_acquire+0x1060/0x1c60
>  lock_acquire+0x140/0x5f0
>  _raw_spin_lock_irqsave+0x64/0xb0
>  clockevents_register_device+0x74/0x270
>  register_decrementer_clockevent+0x94/0x110
>  start_secondary+0x134/0x800
>  start_secondary_prolog+0x10/0x14
>
> This is avoided by moving the call to rcu_cpu_starting up near the
> beginning of the start_secondary() function. Note that the
> raw_smp_processor_id() is required in order to avoid calling into
> lockdep before RCU has declared the CPU to be watched for readers.
>
> Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
> Signed-off-by: Qian Cai <cai@redhat.com>
> ---
>  arch/powerpc/kernel/smp.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 3c6b9822f978..8c2857cbd960 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1393,13 +1393,14 @@ static void add_cpu_to_masks(int cpu)
>  /* Activate a secondary processor. */
>  void start_secondary(void *unused)
>  {
> -	unsigned int cpu = smp_processor_id();
> +	unsigned int cpu = raw_smp_processor_id();
>  
>  	mmgrab(&init_mm);
>  	current->active_mm = &init_mm;
>  
>  	smp_store_cpu_info(cpu);
>  	set_dec(tb_ticks_per_jiffy);
> +	rcu_cpu_starting(cpu);
>  	preempt_disable();
>  	cpu_callin_map[cpu] = 1;
>  
> -- 
> 2.28.0

^ permalink raw reply

* Re: [PATCH 01/13] PCI: dwc/imx6: Drop setting PCI_MSI_FLAGS_ENABLE
From: Michael Ellerman @ 2020-10-29  0:21 UTC (permalink / raw)
  To: Rob Herring, Lorenzo Pieralisi
  Cc: Kunihiko Hayashi, Neil Armstrong, linux-pci, Binghui Wang,
	Bjorn Andersson, Minghuan Lian, Thierry Reding,
	Krzysztof Kozlowski, Thomas Petazzoni, Jonathan Chocron,
	Jonathan Hunter, Fabio Estevam, Jerome Brunet, Jesper Nilsson,
	linux-samsung-soc, Kevin Hilman, Pratyush Anand, linux-arm-kernel,
	Kishon Vijay Abraham I, Kukjin Kim, NXP Linux Team, Xiaowei Song,
	Richard Zhu, Martin Blumenstingl, linux-arm-msm, Sascha Hauer,
	linuxppc-dev, Yue Wang, Murali Karicheri, linux-tegra,
	linux-amlogic, linux-omap, Mingkai Hu, Roy Zang, Bjorn Helgaas,
	Masahiro Yamada, Jingoo Han, Andy Gross, Stanimir Varbanov,
	Pengutronix Kernel Team, Gustavo Pimentel, Shawn Guo, Lucas Stach
In-Reply-To: <20201028204646.356535-2-robh@kernel.org>

Rob Herring <robh@kernel.org> writes:
> No other host driver sets the PCI_MSI_FLAGS_ENABLE bit, so it must not
> be necessary. If it is, a comment is needed.

Yeah, but git blame directly points to:

  75cb8d20c112 ("PCI: imx: Enable MSI from downstream components")

Which has a pretty long explanation. The relevant bit probably being:

  ... on i.MX6, the MSI Enable bit controls delivery of MSI interrupts
  from components below the Root Port.


So it seems a little rash to just remove the code.

cheers


> Cc: Richard Zhu <hongxing.zhu@nxp.com>
> Cc: Lucas Stach <l.stach@pengutronix.de>
> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Shawn Guo <shawnguo@kernel.org>
> Cc: Sascha Hauer <s.hauer@pengutronix.de>
> Cc: Pengutronix Kernel Team <kernel@pengutronix.de>
> Cc: Fabio Estevam <festevam@gmail.com>
> Cc: NXP Linux Team <linux-imx@nxp.com>
> Signed-off-by: Rob Herring <robh@kernel.org>
> ---
>  drivers/pci/controller/dwc/pci-imx6.c | 8 --------
>  1 file changed, 8 deletions(-)
>
> diff --git a/drivers/pci/controller/dwc/pci-imx6.c b/drivers/pci/controller/dwc/pci-imx6.c
> index 5cf1ef12fb9b..7dd137d62dca 100644
> --- a/drivers/pci/controller/dwc/pci-imx6.c
> +++ b/drivers/pci/controller/dwc/pci-imx6.c
> @@ -1002,7 +1002,6 @@ static int imx6_pcie_probe(struct platform_device *pdev)
>  	struct resource *dbi_base;
>  	struct device_node *node = dev->of_node;
>  	int ret;
> -	u16 val;
>  
>  	imx6_pcie = devm_kzalloc(dev, sizeof(*imx6_pcie), GFP_KERNEL);
>  	if (!imx6_pcie)
> @@ -1167,13 +1166,6 @@ static int imx6_pcie_probe(struct platform_device *pdev)
>  	if (ret < 0)
>  		return ret;
>  
> -	if (pci_msi_enabled()) {
> -		u8 offset = dw_pcie_find_capability(pci, PCI_CAP_ID_MSI);
> -		val = dw_pcie_readw_dbi(pci, offset + PCI_MSI_FLAGS);
> -		val |= PCI_MSI_FLAGS_ENABLE;
> -		dw_pcie_writew_dbi(pci, offset + PCI_MSI_FLAGS, val);
> -	}
> -
>  	return 0;
>  }
>  
> -- 
> 2.25.1

^ permalink raw reply

* Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier
From: Paul E. McKenney @ 2020-10-29  0:31 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Qian Cai, linux-kernel, Peter Zijlstra, Paul Mackerras,
	linuxppc-dev
In-Reply-To: <87lffpx598.fsf@mpe.ellerman.id.au>

On Thu, Oct 29, 2020 at 11:09:07AM +1100, Michael Ellerman wrote:
> Qian Cai <cai@redhat.com> writes:
> > The call to rcu_cpu_starting() in start_secondary() is not early enough
> > in the CPU-hotplug onlining process, which results in lockdep splats as
> > follows:
> 
> Since when?
> What kernel version?
> 
> I haven't seen this running CPU hotplug tests with PROVE_LOCKING=y on
> v5.10-rc1. Am I missing a CONFIG?

My guess would be that adding CONFIG_PROVE_RAW_LOCK_NESTING=y will
get you some splats.

							Thanx, Paul

> cheers
> 
> 
> >  WARNING: suspicious RCU usage
> >  -----------------------------
> >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
> >
> >  other info that might help us debug this:
> >
> >  RCU used illegally from offline CPU!
> >  rcu_scheduler_active = 1, debug_locks = 1
> >  no locks held by swapper/1/0.
> >
> >  Call Trace:
> >  dump_stack+0xec/0x144 (unreliable)
> >  lockdep_rcu_suspicious+0x128/0x14c
> >  __lock_acquire+0x1060/0x1c60
> >  lock_acquire+0x140/0x5f0
> >  _raw_spin_lock_irqsave+0x64/0xb0
> >  clockevents_register_device+0x74/0x270
> >  register_decrementer_clockevent+0x94/0x110
> >  start_secondary+0x134/0x800
> >  start_secondary_prolog+0x10/0x14
> >
> > This is avoided by moving the call to rcu_cpu_starting up near the
> > beginning of the start_secondary() function. Note that the
> > raw_smp_processor_id() is required in order to avoid calling into
> > lockdep before RCU has declared the CPU to be watched for readers.
> >
> > Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
> > Signed-off-by: Qian Cai <cai@redhat.com>
> > ---
> >  arch/powerpc/kernel/smp.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 3c6b9822f978..8c2857cbd960 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1393,13 +1393,14 @@ static void add_cpu_to_masks(int cpu)
> >  /* Activate a secondary processor. */
> >  void start_secondary(void *unused)
> >  {
> > -	unsigned int cpu = smp_processor_id();
> > +	unsigned int cpu = raw_smp_processor_id();
> >  
> >  	mmgrab(&init_mm);
> >  	current->active_mm = &init_mm;
> >  
> >  	smp_store_cpu_info(cpu);
> >  	set_dec(tb_ticks_per_jiffy);
> > +	rcu_cpu_starting(cpu);
> >  	preempt_disable();
> >  	cpu_callin_map[cpu] = 1;
> >  
> > -- 
> > 2.28.0

^ permalink raw reply

* Re: [PATCH kernel v3 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present
From: Michael Ellerman @ 2020-10-29  0:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, iommu, Christoph Hellwig, linux-kernel
In-Reply-To: <20201028070030.60643-3-aik@ozlabs.ru>

Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index e4198700ed1a..91112e748491 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1111,11 +1112,13 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
>   */
>  static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  {
> -	int len, ret;
> +	int len = 0, ret;
> +	bool pmem_present = of_find_node_by_type(NULL, "ibm,pmemory") != NULL;

That leaks a reference on the returned node.

	dn = of_find_node_by_type(NULL, "ibm,pmemory");
	pmem_present = dn != NULL;
	of_node_put(dn);


> @@ -1126,7 +1129,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  
>  	mutex_lock(&direct_window_init_mutex);
>  
> -	dma_addr = find_existing_ddw(pdn);
> +	dma_addr = find_existing_ddw(pdn, &len);

I don't see len used anywhere?

>  	if (dma_addr != 0)
>  		goto out_unlock;
>  
> @@ -1212,14 +1215,26 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  	}
>  	/* verify the window * number of ptes will map the partition */
>  	/* check largest block * page size > max memory hotplug addr */
> -	max_addr = ddw_memory_hotplug_max();
> -	if (query.largest_available_block < (max_addr >> page_shift)) {
> -		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
> -			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
> -			  1ULL << page_shift);
> +	/*
> +	 * The "ibm,pmemory" can appear anywhere in the address space.
> +	 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
> +	 * for the upper limit and fallback to max RAM otherwise but this
> +	 * disables device::dma_ops_bypass.
> +	 */
> +	len = max_ram_len;

Here you override whatever find_existing_ddw() wrote to len?

> +	if (pmem_present) {
> +		if (query.largest_available_block >=
> +		    (1ULL << (MAX_PHYSMEM_BITS - page_shift)))
> +			len = MAX_PHYSMEM_BITS - page_shift;
> +		else
> +			dev_info(&dev->dev, "Skipping ibm,pmemory");
> +	}
> +
> +	if (query.largest_available_block < (1ULL << (len - page_shift))) {
> +		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
> +			1ULL << len, query.largest_available_block, 1ULL << page_shift);
>  		goto out_failed;
>  	}
> -	len = order_base_2(max_addr);
>  	win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
>  	if (!win64) {
>  		dev_info(&dev->dev,


cheers

^ permalink raw reply

* Re: [PATCH kernel v3 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present
From: Alexey Kardashevskiy @ 2020-10-29  0:46 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: iommu, Christoph Hellwig, linux-kernel
In-Reply-To: <87eelhx3t6.fsf@mpe.ellerman.id.au>



On 29/10/2020 11:40, Michael Ellerman wrote:
> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index e4198700ed1a..91112e748491 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -1111,11 +1112,13 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
>>    */
>>   static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>   {
>> -	int len, ret;
>> +	int len = 0, ret;
>> +	bool pmem_present = of_find_node_by_type(NULL, "ibm,pmemory") != NULL;
> 
> That leaks a reference on the returned node.
> 
> 	dn = of_find_node_by_type(NULL, "ibm,pmemory");
> 	pmem_present = dn != NULL;
> 	of_node_put(dn);


ah, true. v4 then.


> 
> 
>> @@ -1126,7 +1129,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>   
>>   	mutex_lock(&direct_window_init_mutex);
>>   
>> -	dma_addr = find_existing_ddw(pdn);
>> +	dma_addr = find_existing_ddw(pdn, &len);
> 
> I don't see len used anywhere?
> 
>>   	if (dma_addr != 0)
>>   		goto out_unlock;
>>   
>> @@ -1212,14 +1215,26 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>   	}
>>   	/* verify the window * number of ptes will map the partition */
>>   	/* check largest block * page size > max memory hotplug addr */
>> -	max_addr = ddw_memory_hotplug_max();
>> -	if (query.largest_available_block < (max_addr >> page_shift)) {
>> -		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
>> -			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
>> -			  1ULL << page_shift);
>> +	/*
>> +	 * The "ibm,pmemory" can appear anywhere in the address space.
>> +	 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
>> +	 * for the upper limit and fallback to max RAM otherwise but this
>> +	 * disables device::dma_ops_bypass.
>> +	 */
>> +	len = max_ram_len;
> 
> Here you override whatever find_existing_ddw() wrote to len?


Not always, there is a bunch of gotos before this line to the end of the 
function and one (which returns the existing window) is legit. Thanks,



> 
>> +	if (pmem_present) {
>> +		if (query.largest_available_block >=
>> +		    (1ULL << (MAX_PHYSMEM_BITS - page_shift)))
>> +			len = MAX_PHYSMEM_BITS - page_shift;
>> +		else
>> +			dev_info(&dev->dev, "Skipping ibm,pmemory");
>> +	}
>> +
>> +	if (query.largest_available_block < (1ULL << (len - page_shift))) {
>> +		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
>> +			1ULL << len, query.largest_available_block, 1ULL << page_shift);
>>   		goto out_failed;
>>   	}
>> -	len = order_base_2(max_addr);
>>   	win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
>>   	if (!win64) {
>>   		dev_info(&dev->dev,
> 
> 
> cheers
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH] ibmvscsi: fix race potential race after loss of transport
From: Tyrel Datwyler @ 2020-10-29  1:28 UTC (permalink / raw)
  To: Michael Ellerman, james.bottomley
  Cc: brking, linuxppc-dev, linux-scsi, martin.petersen, linux-kernel
In-Reply-To: <87o8knvsb1.fsf@mpe.ellerman.id.au>

On 10/27/20 10:21 PM, Michael Ellerman wrote:
> Tyrel Datwyler <tyreld@linux.ibm.com> writes:
>> After a loss of tranport due to an adatper migration or crash/disconnect from
>> the host partner there is a tiny window where we can race adjusting the
>> request_limit of the adapter. The request limit is atomically inc/dec to track
>> the number of inflight requests against the allowed limit of our VIOS partner.
>> After a transport loss we set the request_limit to zero to reflect this state.
>> However, there is a window where the adapter may attempt to queue a command
>> because the transport loss event hasn't been fully processed yet and
>> request_limit is still greater than zero. The hypercall to send the event will
>> fail and the error path will increment the request_limit as a result. If the
>> adapter processes the transport event prior to this increment the request_limit
>> becomes out of sync with the adapter state and can result in scsi commands being
>> submitted on the now reset connection prior to an SRP Login resulting in a
>> protocol violation.
>>
>> Fix this race by protecting request_limit with the host lock when changing the
>> value via atomic_set() to indicate no transport.
>>
>> Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
>> ---
>>  drivers/scsi/ibmvscsi/ibmvscsi.c | 36 +++++++++++++++++++++++---------
>>  1 file changed, 26 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
>> index b1f3017b6547..188ed75417a5 100644
>> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
>> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
>> @@ -806,6 +806,22 @@ static void purge_requests(struct ibmvscsi_host_data *hostdata, int error_code)
>>  	spin_unlock_irqrestore(hostdata->host->host_lock, flags);
>>  }
>>  
>> +/**
>> + * ibmvscsi_set_request_limit - Set the adapter request_limit in response to
>> + * an adapter failure, reset, or SRP Login. Done under host lock to prevent
>> + * race with scsi command submission.
>> + * @hostdata:	adapter to adjust
>> + * @limit:	new request limit
>> + */
>> +static void ibmvscsi_set_request_limit(struct ibmvscsi_host_data *hostdata, int limit)
>> +{
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(hostdata->host->host_lock, flags);
>> +	atomic_set(&hostdata->request_limit, limit);
>> +	spin_unlock_irqrestore(hostdata->host->host_lock, flags);
>> +}
>> +
>>  /**
>>   * ibmvscsi_reset_host - Reset the connection to the server
>>   * @hostdata:	struct ibmvscsi_host_data to reset
> ...
>> @@ -2137,12 +2153,12 @@ static void ibmvscsi_do_work(struct ibmvscsi_host_data *hostdata)
>>  	}
>>  
>>  	hostdata->action = IBMVSCSI_HOST_ACTION_NONE;
>> +	spin_unlock_irqrestore(hostdata->host->host_lock, flags);
> 
> You drop the lock ...
> 
>>  	if (rc) {
>> -		atomic_set(&hostdata->request_limit, -1);
>> +		ibmvscsi_set_request_limit(hostdata, -1);
> 
> .. then retake it, then drop it again in ibmvscsi_set_request_limit().
> 
> Which introduces the possibility that something else gets the lock
> before you can set the limit to -1.
> 
> I'm not sure that's a bug, but it's not obviously correct either?

Yeah, I'd already moved the request_limit update into its own function before I
got to this case which made me a bit uneasy when I realized I had to drop the
lock because my new function takes the lock. However, we only need to protect
ourselves from from racing with queuecommand() which is locked for its entire
call. Further, if we've gotten here it means we were either resetting or
re-enabling the adapter which would have already set request_limit to zero. At
this point the transport was already gone and we've further failed to reset it.
Also, we've blocked any new scsi requests at this point.

-Tyrel

> 
> cheers
> 
>>  		dev_err(hostdata->dev, "error after %s\n", action);
>>  	}
>> -	spin_unlock_irqrestore(hostdata->host->host_lock, flags);
>>  
>>  	scsi_unblock_requests(hostdata->host);
>>  }


^ permalink raw reply

* [PATCH kernel v4 0/2] DMA, powerpc/dma: Fallback to dma_ops when persistent memory present
From: Alexey Kardashevskiy @ 2020-10-29  1:52 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, iommu, Christoph Hellwig, linux-kernel


This allows mixing direct DMA (to/from RAM) and
IOMMU (to/from apersistent memory) on the PPC64/pseries
platform.

This replaces https://lkml.org/lkml/2020/10/28/929
which replaces https://lkml.org/lkml/2020/10/27/418
which replaces https://lkml.org/lkml/2020/10/20/1085


This is based on sha1
4525c8781ec0 Linus Torvalds "scsi: qla2xxx: remove incorrect sparse #ifdef".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  dma: Allow mixing bypass and mapped DMA operation
  powerpc/dma: Fallback to dma_ops when persistent memory present

 arch/powerpc/kernel/dma-iommu.c        | 73 +++++++++++++++++++++++++-
 arch/powerpc/platforms/pseries/iommu.c | 51 ++++++++++++++----
 kernel/dma/mapping.c                   | 26 +++++++--
 arch/powerpc/Kconfig                   |  1 +
 kernel/dma/Kconfig                     |  4 ++
 5 files changed, 139 insertions(+), 16 deletions(-)

-- 
2.17.1


^ permalink raw reply

* [PATCH kernel v4 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present
From: Alexey Kardashevskiy @ 2020-10-29  1:52 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, iommu, Christoph Hellwig, linux-kernel
In-Reply-To: <20201029015241.73920-1-aik@ozlabs.ru>

So far we have been using huge DMA windows to map all the RAM available.
The RAM is normally mapped to the VM address space contiguously, and
there is always a reasonable upper limit for possible future hot plugged
RAM which makes it easy to map all RAM via IOMMU.

Now there is persistent memory ("ibm,pmemory" in the FDT) which (unlike
normal RAM) can map anywhere in the VM space beyond the maximum RAM size
and since it can be used for DMA, it requires extending the huge window
up to MAX_PHYSMEM_BITS which requires hypervisor support for:
1. huge TCE tables;
2. multilevel TCE tables;
3. huge IOMMU pages.

Certain hypervisors cannot do either so the only option left is
restricting the huge DMA window to include only RAM and fallback to
the default DMA window for persistent memory.

This defines arch_dma_map_direct/etc to allow generic DMA code perform
additional checks on whether direct DMA is still possible.

This checks if the system has persistent memory. If it does not,
the DMA bypass mode is selected, i.e.
* dev->bus_dma_limit = 0
* dev->dma_ops_bypass = true <- this avoid calling dma_ops for mapping.

If there is such memory, this creates identity mapping only for RAM and
sets the dev->bus_dma_limit to let the generic code decide whether to
call into the direct DMA or the indirect DMA ops.

This should not change the existing behaviour when no persistent memory
as dev->dma_ops_bypass is expected to be set.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* fixed leaked device_node
* wrapped long lines
---
 arch/powerpc/kernel/dma-iommu.c        | 73 +++++++++++++++++++++++++-
 arch/powerpc/platforms/pseries/iommu.c | 51 ++++++++++++++----
 arch/powerpc/Kconfig                   |  1 +
 3 files changed, 113 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index a1c744194018..e5e9e5e3e3ca 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -10,6 +10,65 @@
 #include <linux/pci.h>
 #include <asm/iommu.h>
 
+#define can_map_direct(dev, addr) \
+	((dev)->bus_dma_limit >= phys_to_dma((dev), (addr)))
+
+bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr)
+{
+	if (likely(!dev->bus_dma_limit))
+		return false;
+
+	return can_map_direct(dev, addr);
+}
+EXPORT_SYMBOL_GPL(arch_dma_map_page_direct);
+
+#define is_direct_handle(dev, h) ((h) >= (dev)->archdata.dma_offset)
+
+bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle)
+{
+	if (likely(!dev->bus_dma_limit))
+		return false;
+
+	return is_direct_handle(dev, dma_handle);
+}
+EXPORT_SYMBOL_GPL(arch_dma_unmap_page_direct);
+
+bool arch_dma_map_sg_direct(struct device *dev, struct scatterlist *sg,
+			    int nents)
+{
+	struct scatterlist *s;
+	int i;
+
+	if (likely(!dev->bus_dma_limit))
+		return false;
+
+	for_each_sg(sg, s, nents, i) {
+		if (!can_map_direct(dev, sg_phys(s) + s->offset + s->length))
+			return false;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL(arch_dma_map_sg_direct);
+
+bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg,
+			      int nents)
+{
+	struct scatterlist *s;
+	int i;
+
+	if (likely(!dev->bus_dma_limit))
+		return false;
+
+	for_each_sg(sg, s, nents, i) {
+		if (!is_direct_handle(dev, s->dma_address + s->length))
+			return false;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL(arch_dma_unmap_sg_direct);
+
 /*
  * Generic iommu implementation
  */
@@ -90,8 +149,18 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
-		dev->dma_ops_bypass = true;
-		dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
+		/*
+		 * dma_iommu_bypass_supported() sets dma_max when there is
+		 * 1:1 mapping but it is somehow limited.
+		 * ibm,pmemory is one example.
+		 */
+		dev->dma_ops_bypass = dev->bus_dma_limit == 0;
+		if (!dev->dma_ops_bypass)
+			dev_warn(dev,
+				 "iommu: 64-bit OK but direct DMA is limited by %llx\n",
+				 dev->bus_dma_limit);
+		else
+			dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
 		return 1;
 	}
 
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index e4198700ed1a..9fc5217f0c8e 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -839,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
 			np, ret);
 }
 
-static u64 find_existing_ddw(struct device_node *pdn)
+static u64 find_existing_ddw(struct device_node *pdn, int *window_shift)
 {
 	struct direct_window *window;
 	const struct dynamic_dma_window_prop *direct64;
@@ -851,6 +851,7 @@ static u64 find_existing_ddw(struct device_node *pdn)
 		if (window->device == pdn) {
 			direct64 = window->prop;
 			dma_addr = be64_to_cpu(direct64->dma_base);
+			*window_shift = be32_to_cpu(direct64->window_shift);
 			break;
 		}
 	}
@@ -1111,11 +1112,12 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
  */
 static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 {
-	int len, ret;
+	int len = 0, ret;
+	int max_ram_len = order_base_2(ddw_memory_hotplug_max());
 	struct ddw_query_response query;
 	struct ddw_create_response create;
 	int page_shift;
-	u64 dma_addr, max_addr;
+	u64 dma_addr;
 	struct device_node *dn;
 	u32 ddw_avail[DDW_APPLICABLE_SIZE];
 	struct direct_window *window;
@@ -1123,10 +1125,15 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	struct dynamic_dma_window_prop *ddwprop;
 	struct failed_ddw_pdn *fpdn;
 	bool default_win_removed = false;
+	bool pmem_present;
+
+	dn = of_find_node_by_type(NULL, "ibm,pmemory");
+	pmem_present = dn != NULL;
+	of_node_put(dn);
 
 	mutex_lock(&direct_window_init_mutex);
 
-	dma_addr = find_existing_ddw(pdn);
+	dma_addr = find_existing_ddw(pdn, &len);
 	if (dma_addr != 0)
 		goto out_unlock;
 
@@ -1212,14 +1219,29 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	}
 	/* verify the window * number of ptes will map the partition */
 	/* check largest block * page size > max memory hotplug addr */
-	max_addr = ddw_memory_hotplug_max();
-	if (query.largest_available_block < (max_addr >> page_shift)) {
-		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
-			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
-			  1ULL << page_shift);
+	/*
+	 * The "ibm,pmemory" can appear anywhere in the address space.
+	 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
+	 * for the upper limit and fallback to max RAM otherwise but this
+	 * disables device::dma_ops_bypass.
+	 */
+	len = max_ram_len;
+	if (pmem_present) {
+		if (query.largest_available_block >=
+		    (1ULL << (MAX_PHYSMEM_BITS - page_shift)))
+			len = MAX_PHYSMEM_BITS - page_shift;
+		else
+			dev_info(&dev->dev, "Skipping ibm,pmemory");
+	}
+
+	if (query.largest_available_block < (1ULL << (len - page_shift))) {
+		dev_dbg(&dev->dev,
+			"can't map partition max 0x%llx with %llu %llu-sized pages\n",
+			1ULL << len,
+			query.largest_available_block,
+			1ULL << page_shift);
 		goto out_failed;
 	}
-	len = order_base_2(max_addr);
 	win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
 	if (!win64) {
 		dev_info(&dev->dev,
@@ -1299,6 +1321,15 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 
 out_unlock:
 	mutex_unlock(&direct_window_init_mutex);
+
+	/*
+	 * If we have persistent memory and the window size is only as big
+	 * as RAM, then we failed to create a window to cover persistent
+	 * memory and need to set the DMA limit.
+	 */
+	if (pmem_present && dma_addr && (len == max_ram_len))
+		dev->dev.bus_dma_limit = dma_addr + (1ULL << len);
+
 	return dma_addr;
 }
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e9f13fe08492..b2d4580acf79 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -159,6 +159,7 @@ config PPC
 	select DCACHE_WORD_ACCESS		if PPC64 && CPU_LITTLE_ENDIAN
 	select DMA_OPS				if PPC64
 	select DMA_OPS_BYPASS			if PPC64
+	select ARCH_HAS_DMA_MAP_DIRECT 		if PPC64 && PPC_PSERIES
 	select DYNAMIC_FTRACE			if FUNCTION_TRACER
 	select EDAC_ATOMIC_SCRUB
 	select EDAC_SUPPORT
-- 
2.17.1


^ permalink raw reply related

* [PATCH kernel v4 1/2] dma: Allow mixing bypass and mapped DMA operation
From: Alexey Kardashevskiy @ 2020-10-29  1:52 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, iommu, Christoph Hellwig, linux-kernel
In-Reply-To: <20201029015241.73920-1-aik@ozlabs.ru>

At the moment we allow bypassing DMA ops only when we can do this for
the entire RAM. However there are configs with mixed type memory
where we could still allow bypassing IOMMU in most cases;
POWERPC with persistent memory is one example.

This adds an arch hook to determine where bypass can still work and
we invoke direct DMA API. The following patch checks the bus limit
on POWERPC to allow or disallow direct mapping.

This adds a CONFIG_ARCH_HAS_DMA_SET_MASK config option to make arch_xxxx
hooks no-op by default.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* wrapped long lines
---
 kernel/dma/mapping.c | 26 ++++++++++++++++++++++----
 kernel/dma/Kconfig   |  4 ++++
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 51bb8fa8eb89..ad1f052e046d 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -137,6 +137,20 @@ static inline bool dma_map_direct(struct device *dev,
 	return dma_go_direct(dev, *dev->dma_mask, ops);
 }
 
+#ifdef CONFIG_ARCH_HAS_DMA_MAP_DIRECT
+bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr);
+bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle);
+bool arch_dma_map_sg_direct(struct device *dev, struct scatterlist *sg,
+			    int nents);
+bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg,
+			      int nents);
+#else
+#define arch_dma_map_page_direct(d, a) (0)
+#define arch_dma_unmap_page_direct(d, a) (0)
+#define arch_dma_map_sg_direct(d, s, n) (0)
+#define arch_dma_unmap_sg_direct(d, s, n) (0)
+#endif
+
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
 		size_t offset, size_t size, enum dma_data_direction dir,
 		unsigned long attrs)
@@ -149,7 +163,8 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
 	if (WARN_ON_ONCE(!dev->dma_mask))
 		return DMA_MAPPING_ERROR;
 
-	if (dma_map_direct(dev, ops))
+	if (dma_map_direct(dev, ops) ||
+	    arch_dma_map_page_direct(dev, page_to_phys(page) + offset + size))
 		addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
 	else
 		addr = ops->map_page(dev, page, offset, size, dir, attrs);
@@ -165,7 +180,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
 	const struct dma_map_ops *ops = get_dma_ops(dev);
 
 	BUG_ON(!valid_dma_direction(dir));
-	if (dma_map_direct(dev, ops))
+	if (dma_map_direct(dev, ops) ||
+	    arch_dma_unmap_page_direct(dev, addr + size))
 		dma_direct_unmap_page(dev, addr, size, dir, attrs);
 	else if (ops->unmap_page)
 		ops->unmap_page(dev, addr, size, dir, attrs);
@@ -188,7 +204,8 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
 	if (WARN_ON_ONCE(!dev->dma_mask))
 		return 0;
 
-	if (dma_map_direct(dev, ops))
+	if (dma_map_direct(dev, ops) ||
+	    arch_dma_map_sg_direct(dev, sg, nents))
 		ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
 	else
 		ents = ops->map_sg(dev, sg, nents, dir, attrs);
@@ -207,7 +224,8 @@ void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
 
 	BUG_ON(!valid_dma_direction(dir));
 	debug_dma_unmap_sg(dev, sg, nents, dir);
-	if (dma_map_direct(dev, ops))
+	if (dma_map_direct(dev, ops) ||
+	    arch_dma_unmap_sg_direct(dev, sg, nents))
 		dma_direct_unmap_sg(dev, sg, nents, dir, attrs);
 	else if (ops->unmap_sg)
 		ops->unmap_sg(dev, sg, nents, dir, attrs);
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..43d106598e82 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -20,6 +20,10 @@ config DMA_OPS
 config DMA_OPS_BYPASS
 	bool
 
+# Lets platform IOMMU driver choose between bypass and IOMMU
+config ARCH_HAS_DMA_MAP_DIRECT
+	bool
+
 config NEED_SG_DMA_LENGTH
 	bool
 
-- 
2.17.1


^ permalink raw reply related

* Re: [RFC PATCH kernel 1/2] irq: Add reference counting to IRQ mappings
From: Alexey Kardashevskiy @ 2020-10-29  5:04 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Rob Herring, linux-kernel, Qian Cai, Cédric Le Goater,
	Frederic Barrat, Oliver O'Halloran, Thomas Gleixner,
	Michal Suchánek, linuxppc-dev
In-Reply-To: <415025f93a2b93e8ae62cba57ca1a8a7@kernel.org>



On 28/10/2020 03:09, Marc Zyngier wrote:
> Hi Alexey,
> 
> On 2020-10-27 09:06, Alexey Kardashevskiy wrote:
>> PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
>> Device drivers map/unmap hardware interrupts via irq_create_mapping()/
>> irq_dispose_mapping(). The problem with that these interrupts are
>> shared and when performing hot unplug, we need to unmap the interrupt
>> only when the last device is released.
>>
>> This reuses already existing irq_desc::kobj for this purpose.
>> The refcounter is naturally 1 when the descriptor is allocated already;
>> this adds kobject_get() in places where already existing mapped virq
>> is returned.
> 
> That's quite interesting, as I was about to revive a patch series that
> rework the irqdomain subsystem to directly cache irq_desc instead of
> raw interrupt numbers. And for that, I needed some form of refcounting...
> 
>>
>> This reorganizes irq_dispose_mapping() to release the kobj and let
>> the release callback do the cleanup.
>>
>> If some driver or platform does its own reference counting, this expects
>> those parties to call irq_find_mapping() and call irq_dispose_mapping()
>> for every irq_create_fwspec_mapping()/irq_create_mapping().
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  kernel/irq/irqdesc.c   | 35 +++++++++++++++++++++++------------
>>  kernel/irq/irqdomain.c | 27 +++++++++++++--------------
>>  2 files changed, 36 insertions(+), 26 deletions(-)
>>
>> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
>> index 1a7723604399..dae096238500 100644
>> --- a/kernel/irq/irqdesc.c
>> +++ b/kernel/irq/irqdesc.c
>> @@ -419,20 +419,39 @@ static struct irq_desc *alloc_desc(int irq, int
>> node, unsigned int flags,
>>      return NULL;
>>  }
>>
>> +static void delayed_free_desc(struct rcu_head *rhp);
>>  static void irq_kobj_release(struct kobject *kobj)
>>  {
>>      struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
>> +    struct irq_domain *domain;
>> +    unsigned int virq = desc->irq_data.irq;
>>
>> -    free_masks(desc);
>> -    free_percpu(desc->kstat_irqs);
>> -    kfree(desc);
>> +    domain = desc->irq_data.domain;
>> +    if (domain) {
>> +        if (irq_domain_is_hierarchy(domain)) {
>> +            irq_domain_free_irqs(virq, 1);
> 
> How does this work with hierarchical domains? Each domain should
> contribute as a reference on the irq_desc. But if you got here,
> it means the refcount has already dropped to 0.
> 
> So either there is nothing to free here, or you don't track the
> references implied by the hierarchy. I suspect the latter.

This is correct, I did not look at hierarchy yet, looking now...



>> +        } else {
>> +            irq_domain_disassociate(domain, virq);
>> +            irq_free_desc(virq);
>> +        }
>> +    }
>> +
>> +    /*
>> +     * We free the descriptor, masks and stat fields via RCU. That
>> +     * allows demultiplex interrupts to do rcu based management of
>> +     * the child interrupts.
>> +     * This also allows us to use rcu in kstat_irqs_usr().
>> +     */
>> +    call_rcu(&desc->rcu, delayed_free_desc);
>>  }
>>
>>  static void delayed_free_desc(struct rcu_head *rhp)
>>  {
>>      struct irq_desc *desc = container_of(rhp, struct irq_desc, rcu);
>>
>> -    kobject_put(&desc->kobj);
>> +    free_masks(desc);
>> +    free_percpu(desc->kstat_irqs);
>> +    kfree(desc);
>>  }
>>
>>  static void free_desc(unsigned int irq)
>> @@ -453,14 +472,6 @@ static void free_desc(unsigned int irq)
>>       */
>>      irq_sysfs_del(desc);
>>      delete_irq_desc(irq);
>> -
>> -    /*
>> -     * We free the descriptor, masks and stat fields via RCU. That
>> -     * allows demultiplex interrupts to do rcu based management of
>> -     * the child interrupts.
>> -     * This also allows us to use rcu in kstat_irqs_usr().
>> -     */
>> -    call_rcu(&desc->rcu, delayed_free_desc);
>>  }
>>
>>  static int alloc_descs(unsigned int start, unsigned int cnt, int node,
>> diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
>> index cf8b374b892d..02733ddc321f 100644
>> --- a/kernel/irq/irqdomain.c
>> +++ b/kernel/irq/irqdomain.c
>> @@ -638,6 +638,7 @@ unsigned int irq_create_mapping(struct irq_domain 
>> *domain,
>>  {
>>      struct device_node *of_node;
>>      int virq;
>> +    struct irq_desc *desc;
>>
>>      pr_debug("irq_create_mapping(0x%p, 0x%lx)\n", domain, hwirq);
>>
>> @@ -655,7 +656,9 @@ unsigned int irq_create_mapping(struct irq_domain 
>> *domain,
>>      /* Check if mapping already exists */
>>      virq = irq_find_mapping(domain, hwirq);
>>      if (virq) {
>> +        desc = irq_to_desc(virq);
>>          pr_debug("-> existing mapping on virq %d\n", virq);
>> +        kobject_get(&desc->kobj);
> 
> My worry with this is that there is probably a significant amount of
> code out there that relies on multiple calls to irq_create_mapping()
> with the same parameters not to have any side effects. They would
> expect a subsequent irq_dispose_mapping() to drop the translation
> altogether, and that's obviously not the case here.
> 
> Have you audited the various call sites to see what could break?


The vast majority calls one of irq..create_mapping in init/probe and 
then calls irq_dispose_mapping() right there if probing failed or when 
the driver is unloaded. I could not spot any reference counting 
anywhere, everyone seems to call irq_dispose_mapping() per every 
irq_of_parse_and_map() (or friends).

Then there is a minority (such as the code I am fixing in 
powerpc/pseries) but it is either broken (such as pseries/pci which does 
not call irq_dispose_mapping at all)  or  it is 1 disposal per 1 mapping 
(PPC KVM).

I did not spend awful amount of time though, git grep 
irq_dispose_mapping gives 200 lines...


> 
>>          return virq;
>>      }
>>
>> @@ -751,6 +754,7 @@ unsigned int irq_create_fwspec_mapping(struct
>> irq_fwspec *fwspec)
>>      irq_hw_number_t hwirq;
>>      unsigned int type = IRQ_TYPE_NONE;
>>      int virq;
>> +    struct irq_desc *desc;
>>
>>      if (fwspec->fwnode) {
>>          domain = irq_find_matching_fwspec(fwspec, DOMAIN_BUS_WIRED);
>> @@ -787,8 +791,11 @@ unsigned int irq_create_fwspec_mapping(struct
>> irq_fwspec *fwspec)
>>           * current trigger type then we are done so return the
>>           * interrupt number.
>>           */
>> -        if (type == IRQ_TYPE_NONE || type == irq_get_trigger_type(virq))
>> +        if (type == IRQ_TYPE_NONE || type == 
>> irq_get_trigger_type(virq)) {
>> +            desc = irq_to_desc(virq);
>> +            kobject_get(&desc->kobj);
>>              return virq;
>> +        }
>>
>>          /*
>>           * If the trigger type has not been set yet, then set
>> @@ -800,6 +807,8 @@ unsigned int irq_create_fwspec_mapping(struct
>> irq_fwspec *fwspec)
>>                  return 0;
>>
>>              irqd_set_trigger_type(irq_data, type);
>> +            desc = irq_to_desc(virq);
>> +            kobject_get(&desc->kobj);
>>              return virq;
>>          }
>>
>> @@ -852,22 +861,12 @@ EXPORT_SYMBOL_GPL(irq_create_of_mapping);
>>   */
>>  void irq_dispose_mapping(unsigned int virq)
>>  {
>> -    struct irq_data *irq_data = irq_get_irq_data(virq);
>> -    struct irq_domain *domain;
>> +    struct irq_desc *desc = irq_to_desc(virq);
>>
>> -    if (!virq || !irq_data)
>> +    if (!virq || !desc)
>>          return;
>>
>> -    domain = irq_data->domain;
>> -    if (WARN_ON(domain == NULL))
>> -        return;
>> -
>> -    if (irq_domain_is_hierarchy(domain)) {
>> -        irq_domain_free_irqs(virq, 1);
>> -    } else {
>> -        irq_domain_disassociate(domain, virq);
>> -        irq_free_desc(virq);
>> -    }
>> +    kobject_put(&desc->kobj);
>>  }
>>  EXPORT_SYMBOL_GPL(irq_dispose_mapping);
> 
> Thanks,
> 
>          M.

-- 
Alexey

^ permalink raw reply

* Re: [PATCH] powerpc/eeh_cache: Fix a possible debugfs deadlock
From: Oliver O'Halloran @ 2020-10-29  5:57 UTC (permalink / raw)
  To: Qian Cai; +Cc: Linux Kernel Mailing List, Paul Mackerras, linuxppc-dev
In-Reply-To: <20201028152717.8967-1-cai@redhat.com>

On Thu, Oct 29, 2020 at 2:27 AM Qian Cai <cai@redhat.com> wrote:
>
> Lockdep complains that a possible deadlock below in
> eeh_addr_cache_show() because it is acquiring a lock with IRQ enabled,
> but eeh_addr_cache_insert_dev() needs to acquire the same lock with IRQ
> disabled. Let's just make eeh_addr_cache_show() acquire the lock with
> IRQ disabled as well.
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(&pci_io_addr_cache_root.piar_lock);
>                                 local_irq_disable();
>                                 lock(&tp->lock);
>                                 lock(&pci_io_addr_cache_root.piar_lock);
>    <Interrupt>
>      lock(&tp->lock);
>
>   *** DEADLOCK ***
>
>   lock_acquire+0x140/0x5f0
>   _raw_spin_lock_irqsave+0x64/0xb0
>   eeh_addr_cache_insert_dev+0x48/0x390
>   eeh_probe_device+0xb8/0x1a0
>   pnv_pcibios_bus_add_device+0x3c/0x80
>   pcibios_bus_add_device+0x118/0x290
>   pci_bus_add_device+0x28/0xe0
>   pci_bus_add_devices+0x54/0xb0
>   pcibios_init+0xc4/0x124
>   do_one_initcall+0xac/0x528
>   kernel_init_freeable+0x35c/0x3fc
>   kernel_init+0x24/0x148
>   ret_from_kernel_thread+0x5c/0x80
>
>   lock_acquire+0x140/0x5f0
>   _raw_spin_lock+0x4c/0x70
>   eeh_addr_cache_show+0x38/0x110
>   seq_read+0x1a0/0x660
>   vfs_read+0xc8/0x1f0
>   ksys_read+0x74/0x130
>   system_call_exception+0xf8/0x1d0
>   system_call_common+0xe8/0x218
>
> Fixes: 5ca85ae6318d ("powerpc/eeh_cache: Add a way to dump the EEH address cache")
> Signed-off-by: Qian Cai <cai@redhat.com>

Good catch,

Reviewed-by: Oliver O'Halloran <oohall@gmail.com>

^ permalink raw reply

* Re: [PATCH 0/4] Powerpc: Better preemption for shared processor
From: Srikar Dronamraju @ 2020-10-29  7:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Nathan Lynch, Gautham R Shenoy, Phil Auld, Juri Lelli,
	Peter Zijlstra, LKML, Nicholas Piggin, linuxppc-dev,
	Valentin Schneider
In-Reply-To: <da67d6ce-f120-f61a-19ff-0ae4f1f5dac0@redhat.com>

* Waiman Long <longman@redhat.com> [2020-10-28 20:01:30]:

> > Srikar Dronamraju (4):
> >    powerpc: Refactor is_kvm_guest declaration to new header
> >    powerpc: Rename is_kvm_guest to check_kvm_guest
> >    powerpc: Reintroduce is_kvm_guest
> >    powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted
> > 
> >   arch/powerpc/include/asm/firmware.h  |  6 ------
> >   arch/powerpc/include/asm/kvm_guest.h | 25 +++++++++++++++++++++++++
> >   arch/powerpc/include/asm/kvm_para.h  |  2 +-
> >   arch/powerpc/include/asm/paravirt.h  | 18 ++++++++++++++++++
> >   arch/powerpc/kernel/firmware.c       |  5 ++++-
> >   arch/powerpc/platforms/pseries/smp.c |  3 ++-
> >   6 files changed, 50 insertions(+), 9 deletions(-)
> >   create mode 100644 arch/powerpc/include/asm/kvm_guest.h
> > 
> This patch series looks good to me and the performance is nice too.
> 
> Acked-by: Waiman Long <longman@redhat.com>

Thank you.

> 
> Just curious, is the performance mainly from the use of static_branch
> (patches 1 - 3) or from reducing call to yield_count_of().

Because of the reduced call to yield_count

> 
> Cheers,
> Longman
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH 2/4] PM: hibernate: improve robustness of mapping pages in the direct map
From: Mike Rapoport @ 2020-10-29  7:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david@redhat.com, peterz@infradead.org, catalin.marinas@arm.com,
	dave.hansen@linux.intel.com, linux-mm@kvack.org, paulus@samba.org,
	pavel@ucw.cz, hpa@zytor.com, sparclinux@vger.kernel.org,
	cl@linux.com, will@kernel.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, x86@kernel.org, rppt@linux.ibm.com,
	borntraeger@de.ibm.com, mingo@redhat.com, rientjes@google.com,
	Brown, Len, aou@eecs.berkeley.edu, gor@linux.ibm.com,
	linux-pm@vger.kernel.org, hca@linux.ibm.com, bp@alien8.de,
	luto@kernel.org, paul.walmsley@sifive.com, kirill@shutemov.name,
	tglx@linutronix.de, akpm@linux-foundation.org,
	linux-arm-kernel@lists.infradead.org, rjw@rjwysocki.net,
	linux-kernel@vger.kernel.org, penberg@kernel.org,
	palmer@dabbelt.com, iamjoonsoo.kim@lge.com,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <3b4b2b3559bd3dc68adcddf99415bae57152cb6b.camel@intel.com>

On Wed, Oct 28, 2020 at 09:15:38PM +0000, Edgecombe, Rick P wrote:
> On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> > +       if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
> > +               unsigned long addr = (unsigned
> > long)page_address(page);
> > +               int ret;
> > +
> > +               if (enable)
> > +                       ret = set_direct_map_default_noflush(page);
> > +               else
> > +                       ret = set_direct_map_invalid_noflush(page);
> > +
> > +               if (WARN_ON(ret))
> > +                       return;
> > +
> > +               flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > +       } else {
> > +               debug_pagealloc_map_pages(page, 1, enable);
> > +       }
> 
> Looking at the arm side again, I think this might actually introduce a
> regression for the arm/hibernate/DEBUG_PAGEALLOC combo.
> 
> Unlike __kernel_map_pages(), it looks like arm's cpa will always bail
> in the set_direct_map_() functions if rodata_full is false.
>
> So if rodata_full was disabled but debug page alloc is on, then this
> would now skip remapping the pages. I guess the significance depends
> on whether hibernate could actually try to save any DEBUG_PAGEALLOC
> unmapped pages. Looks like it to me though.
 
__kernel_map_pages() on arm64 will also bail out if rodata_full is
false:

void __kernel_map_pages(struct page *page, int numpages, int enable)
{
	if (!debug_pagealloc_enabled() && !rodata_full)
		return;

	set_memory_valid((unsigned long)page_address(page), numpages, enable);
}

So using set_direct_map() to map back pages removed from the direct map
with __kernel_map_pages() seems safe to me.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH 5/9] kprobes/ftrace: Add recursion protection to the ftrace callback
From: Masami Hiramatsu @ 2020-10-29  7:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: James E.J. Bottomley, Guo Ren, linux-csky, H. Peter Anvin,
	linux-s390, Helge Deller, x86, Anil S Keshavamurthy,
	Christian Borntraeger, Naveen N. Rao, Vasily Gorbik,
	Heiko Carstens, Borislav Petkov, Thomas Gleixner, linux-parisc,
	linux-kernel, Masami Hiramatsu, Paul Mackerras, Andrew Morton,
	linuxppc-dev, David S. Miller
In-Reply-To: <20201028115613.140212174@goodmis.org>

Hi Steve,

On Wed, 28 Oct 2020 07:52:49 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> 
> If a ftrace callback does not supply its own recursion protection and
> does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
> make a helper trampoline to do so before calling the callback instead of
> just calling the callback directly.

So in that case the handlers will be called without preempt disabled?


> The default for ftrace_ops is going to assume recursion protection unless
> otherwise specified.

This seems to skip entier handler if ftrace finds recursion.
I would like to increment the missed counter even in that case.

[...]
e.g.

> diff --git a/arch/csky/kernel/probes/ftrace.c b/arch/csky/kernel/probes/ftrace.c
> index 5264763d05be..5eb2604fdf71 100644
> --- a/arch/csky/kernel/probes/ftrace.c
> +++ b/arch/csky/kernel/probes/ftrace.c
> @@ -13,16 +13,21 @@ int arch_check_ftrace_location(struct kprobe *p)
>  void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
>  			   struct ftrace_ops *ops, struct pt_regs *regs)
>  {
> +	int bit;
>  	bool lr_saver = false;
>  	struct kprobe *p;
>  	struct kprobe_ctlblk *kcb;
>  
> -	/* Preempt is disabled by ftrace */
> +	bit = ftrace_test_recursion_trylock();

> +
> +	preempt_disable_notrace();
>  	p = get_kprobe((kprobe_opcode_t *)ip);
>  	if (!p) {
>  		p = get_kprobe((kprobe_opcode_t *)(ip - MCOUNT_INSN_SIZE));
>  		if (unlikely(!p) || kprobe_disabled(p))
> -			return;
> +			goto out;
>  		lr_saver = true;
>  	}

	if (bit < 0) {
		kprobes_inc_nmissed_count(p);
		goto out;
	}

>  
> @@ -56,6 +61,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
>  		 */
>  		__this_cpu_write(current_kprobe, NULL);
>  	}
> +out:
> +	preempt_enable_notrace();

	if (bit >= 0)
		ftrace_test_recursion_unlock(bit);

>  }
>  NOKPROBE_SYMBOL(kprobe_ftrace_handler);
>  

Or, we can also introduce a support function,

static inline void kprobes_inc_nmissed_ip(unsigned long ip)
{
	struct kprobe *p;

	preempt_disable_notrace();
	p = get_kprobe((kprobe_opcode_t *)ip);
	if (p)
		kprobes_inc_nmissed_count(p);
	preempt_enable_notrace();
}

> diff --git a/arch/parisc/kernel/ftrace.c b/arch/parisc/kernel/ftrace.c
> index 4bab21c71055..5f7742b225a5 100644
> --- a/arch/parisc/kernel/ftrace.c
> +++ b/arch/parisc/kernel/ftrace.c
> @@ -208,13 +208,19 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
>  {
>  	struct kprobe_ctlblk *kcb;
>  	struct kprobe *p = get_kprobe((kprobe_opcode_t *)ip);

(BTW, here is a bug... get_kprobe() must be called with preempt disabled.)

> +	int bit;
>  
> -	if (unlikely(!p) || kprobe_disabled(p))
> +	bit = ftrace_test_recursion_trylock();

	if (bit < 0) {
		kprobes_inc_nmissed_ip(ip);
>  		return;
	}

This may easier for you ?

Thank you,

>  
> +	preempt_disable_notrace();
> +	if (unlikely(!p) || kprobe_disabled(p))
> +		goto out;
> +
>  	if (kprobe_running()) {
>  		kprobes_inc_nmissed_count(p);
> -		return;
> +		goto out;
>  	}
>  
>  	__this_cpu_write(current_kprobe, p);
> @@ -235,6 +241,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
>  		}
>  	}
>  	__this_cpu_write(current_kprobe, NULL);
> +out:
> +	preempt_enable_notrace();
> +	ftrace_test_recursion_unlock(bit);
>  }
>  NOKPROBE_SYMBOL(kprobe_ftrace_handler);
>  
> diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c
> index 972cb28174b2..5df8d50c65ae 100644
> --- a/arch/powerpc/kernel/kprobes-ftrace.c
> +++ b/arch/powerpc/kernel/kprobes-ftrace.c
> @@ -18,10 +18,16 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
>  {
>  	struct kprobe *p;
>  	struct kprobe_ctlblk *kcb;
> +	int bit;
>  
> +	bit = ftrace_test_recursion_trylock();
> +	if (bit < 0)
> +		return;
> +
> +	preempt_disable_notrace();
>  	p = get_kprobe((kprobe_opcode_t *)nip);
>  	if (unlikely(!p) || kprobe_disabled(p))
> -		return;
> +		goto out;
>  
>  	kcb = get_kprobe_ctlblk();
>  	if (kprobe_running()) {
> @@ -52,6 +58,9 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
>  		 */
>  		__this_cpu_write(current_kprobe, NULL);
>  	}
> +out:
> +	preempt_enable_notrace();
> +	ftrace_test_recursion_unlock(bit);
>  }
>  NOKPROBE_SYMBOL(kprobe_ftrace_handler);
>  
> diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
> index b388e87a08bf..88466d7fb6b2 100644
> --- a/arch/s390/kernel/ftrace.c
> +++ b/arch/s390/kernel/ftrace.c
> @@ -202,13 +202,19 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
>  {
>  	struct kprobe_ctlblk *kcb;
>  	struct kprobe *p = get_kprobe((kprobe_opcode_t *)ip);
> +	int bit;
>  
> -	if (unlikely(!p) || kprobe_disabled(p))
> +	bit = ftrace_test_recursion_trylock();
> +	if (bit < 0)
>  		return;
>  
> +	preempt_disable_notrace();
> +	if (unlikely(!p) || kprobe_disabled(p))
> +		goto out;
> +
>  	if (kprobe_running()) {
>  		kprobes_inc_nmissed_count(p);
> -		return;
> +		goto out;
>  	}
>  
>  	__this_cpu_write(current_kprobe, p);
> @@ -228,6 +234,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
>  		}
>  	}
>  	__this_cpu_write(current_kprobe, NULL);
> +out:
> +	preempt_enable_notrace();
> +	ftrace_test_recursion_unlock(bit);
>  }
>  NOKPROBE_SYMBOL(kprobe_ftrace_handler);
>  
> diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
> index 681a4b36e9bb..a40a6cdfcca3 100644
> --- a/arch/x86/kernel/kprobes/ftrace.c
> +++ b/arch/x86/kernel/kprobes/ftrace.c
> @@ -18,11 +18,16 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
>  {
>  	struct kprobe *p;
>  	struct kprobe_ctlblk *kcb;
> +	int bit;
>  
> -	/* Preempt is disabled by ftrace */
> +	bit = ftrace_test_recursion_trylock();
> +	if (bit < 0)
> +		return;
> +
> +	preempt_disable_notrace();
>  	p = get_kprobe((kprobe_opcode_t *)ip);
>  	if (unlikely(!p) || kprobe_disabled(p))
> -		return;
> +		goto out;
>  
>  	kcb = get_kprobe_ctlblk();
>  	if (kprobe_running()) {
> @@ -52,6 +57,9 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
>  		 */
>  		__this_cpu_write(current_kprobe, NULL);
>  	}
> +out:
> +	preempt_enable_notrace();
> +	ftrace_test_recursion_unlock(bit);
>  }
>  NOKPROBE_SYMBOL(kprobe_ftrace_handler);
>  
> -- 
> 2.28.0
> 
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH 20/33] docs: ABI: testing: make the files compatible with ReST output
From: Mauro Carvalho Chehab @ 2020-10-29  7:21 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Gautham R. Shenoy, Jason A. Donenfeld, Heikki Krogerus,
	Peter Meerwald-Stadler, Petr Mladek, Linux Doc Mailing List,
	Alexander Shishkin, Nayna Jain, Alexandre Belloni, Mimi Zohar,
	Sebastian Reichel, Guenter Roeck, Bruno Meneguele, Vishal Verma,
	Pavel Machek, Hanjun Guo, Mauro Carvalho Chehab, netdev,
	Oleh Kravchenko, Dan Williams, Andrew Donnellan,
	Javier González, Fabrice Gasnier, Stefano Stabellini,
	linux-acpi, Jonathan Corbet, Chunyan Zhang, Mario Limonciello,
	linux-stm32, Lakshmi Ramasubramanian, Ludovic Desroches,
	Pawan Gupta, linux-arm-kernel, Frederic Barrat, Niklas Cassel,
	Len Brown, Juergen Gross, Mika Westerberg, Alexandre Torgue,
	linux-pm, linux-kernel, linuxppc-dev, Baolin Wang,
	Lars-Peter Clausen, Dan Murphy, Orson Zhai, Philippe Bergheaud,
	xen-devel, Boris Ostrovsky, Andy Shevchenko, Benson Leung,
	Konstantin Khlebnikov, Jens Axboe, Felipe Balbi, Kranthi Kuntala,
	Martin K. Petersen, linux-mm, Greg Kroah-Hartman, linux-usb,
	Rafael J. Wysocki, Nicolas Ferre, linux-iio, Thinh Nguyen,
	Sergey Senozhatsky, Thomas Gleixner, Leonid Maksymchuk,
	Maxime Coquelin, Johannes Thumshirn, Enric Balletbo i Serra,
	Vineela Tummalapalli, Peter Rosin, Jonathan Cameron, Mike Kravetz
In-Reply-To: <20201028174427.GE9364@hoboy.vegasvil.org>

Hi Richard,

Em Wed, 28 Oct 2020 10:44:27 -0700
Richard Cochran <richardcochran@gmail.com> escreveu:

> On Wed, Oct 28, 2020 at 03:23:18PM +0100, Mauro Carvalho Chehab wrote:
> 
> > diff --git a/Documentation/ABI/testing/sysfs-uevent b/Documentation/ABI/testing/sysfs-uevent
> > index aa39f8d7bcdf..d0893dad3f38 100644
> > --- a/Documentation/ABI/testing/sysfs-uevent
> > +++ b/Documentation/ABI/testing/sysfs-uevent
> > @@ -19,7 +19,8 @@ Description:
> >                  a transaction identifier so it's possible to use the same UUID
> >                  value for one or more synthetic uevents in which case we
> >                  logically group these uevents together for any userspace
> > -                listeners. The UUID value appears in uevent as
> > +                listeners. The UUID value appears in uevent as:  
> 
> I know almost nothing about Sphinx, but why have one colon here ^^^ and ...

Good point. After re-reading the text, this ":" doesn't belong here.

> 
> > +
> >                  "SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment
> >                  variable.
> >  
> > @@ -30,18 +31,19 @@ Description:
> >                  It's possible to define zero or more pairs - each pair is then
> >                  delimited by a space character ' '. Each pair appears in
> >                  synthetic uevent as "SYNTH_ARG_KEY=VALUE". That means the KEY
> > -                name gains "SYNTH_ARG_" prefix to avoid possible collisions
> > +                name gains `SYNTH_ARG_` prefix to avoid possible collisions
> >                  with existing variables.
> >  
> > -                Example of valid sequence written to the uevent file:
> > +                Example of valid sequence written to the uevent file::  
> 
> ... two here?

The main issue that this patch wants to solve is here:

                This generates synthetic uevent including these variables::

                    ACTION=add
                    SYNTH_ARG_A=1
                    SYNTH_ARG_B=abc
                    SYNTH_UUID=fe4d7c9d-b8c6-4a70-9ef1-3d8a58d18eed

On Sphinx, consecutive lines with the same indent belongs to the same
paragraph. So, without "::", the above will be displayed on a single line,
which is undesired.

using "::" tells Sphinx to display as-is. It will also place it into a a 
box (colored for html output) and using a monospaced font.

The change at the "uevent file:" line was done just for coherency
purposes.

Yet, after re-reading the text, there are other things that are not
coherent. So, I guess the enclosed patch will work better for sys-uevent.

Thanks,
Mauro

docs: ABI: sysfs-uevent: make it compatible with ReST output

- Replace " by ``, in order to use monospaced fonts;
- mark literal blocks as such.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

diff --git a/Documentation/ABI/testing/sysfs-uevent b/Documentation/ABI/testing/sysfs-uevent
index aa39f8d7bcdf..0b6227706b35 100644
--- a/Documentation/ABI/testing/sysfs-uevent
+++ b/Documentation/ABI/testing/sysfs-uevent
@@ -6,42 +6,46 @@ Description:
                 Enable passing additional variables for synthetic uevents that
                 are generated by writing /sys/.../uevent file.
 
-                Recognized extended format is ACTION [UUID [KEY=VALUE ...].
+                Recognized extended format is::
 
-                The ACTION is compulsory - it is the name of the uevent action
-                ("add", "change", "remove"). There is no change compared to
-                previous functionality here. The rest of the extended format
-                is optional.
+			ACTION [UUID [KEY=VALUE ...]
+
+                The ACTION is compulsory - it is the name of the uevent
+                action (``add``, ``change``, ``remove``). There is no change
+                compared to previous functionality here. The rest of the
+                extended format is optional.
 
                 You need to pass UUID first before any KEY=VALUE pairs.
-                The UUID must be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
+                The UUID must be in ``xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx``
                 format where 'x' is a hex digit. The UUID is considered to be
                 a transaction identifier so it's possible to use the same UUID
                 value for one or more synthetic uevents in which case we
                 logically group these uevents together for any userspace
                 listeners. The UUID value appears in uevent as
-                "SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment
+                ``SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`` environment
                 variable.
 
                 If UUID is not passed in, the generated synthetic uevent gains
-                "SYNTH_UUID=0" environment variable automatically.
+                ``SYNTH_UUID=0`` environment variable automatically.
 
                 The KEY=VALUE pairs can contain alphanumeric characters only.
+
                 It's possible to define zero or more pairs - each pair is then
                 delimited by a space character ' '. Each pair appears in
-                synthetic uevent as "SYNTH_ARG_KEY=VALUE". That means the KEY
-                name gains "SYNTH_ARG_" prefix to avoid possible collisions
+                synthetic uevent as ``SYNTH_ARG_KEY=VALUE``. That means the KEY
+                name gains ``SYNTH_ARG_`` prefix to avoid possible collisions
                 with existing variables.
 
-                Example of valid sequence written to the uevent file:
+                Example of valid sequence written to the uevent file::
 
                     add fe4d7c9d-b8c6-4a70-9ef1-3d8a58d18eed A=1 B=abc
 
-                This generates synthetic uevent including these variables:
+                This generates synthetic uevent including these variables::
 
                     ACTION=add
                     SYNTH_ARG_A=1
                     SYNTH_ARG_B=abc
                     SYNTH_UUID=fe4d7c9d-b8c6-4a70-9ef1-3d8a58d18eed
+
 Users:
                 udev, userspace tools generating synthetic uevents

^ permalink raw reply related

* [PATCH 02/25] ASoC: fsl: fsl_ssi: remove unnecessary CONFIG_PM_SLEEP
From: Coiby Xu @ 2020-10-29  7:42 UTC (permalink / raw)
  To: Jaroslav Kysela, Takashi Iwai
  Cc: moderated list:FREESCALE SOC SOUND DRIVERS, Timur Tabi, Xiubo Li,
	Fabio Estevam, open list:FREESCALE SOC SOUND DRIVERS,
	Liam Girdwood, open list, Nicolin Chen, Mark Brown, Shengjiu Wang
In-Reply-To: <20201029074301.226644-1-coiby.xu@gmail.com>

SET_SYSTEM_SLEEP_PM_OPS has already took good care of CONFIG_PM_CONFIG.

Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
---
 sound/soc/fsl/fsl_ssi.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c
index 404be27c15fe..065500a4cbc1 100644
--- a/sound/soc/fsl/fsl_ssi.c
+++ b/sound/soc/fsl/fsl_ssi.c
@@ -1669,7 +1669,6 @@ static int fsl_ssi_remove(struct platform_device *pdev)
 	return 0;
 }
 
-#ifdef CONFIG_PM_SLEEP
 static int fsl_ssi_suspend(struct device *dev)
 {
 	struct fsl_ssi *ssi = dev_get_drvdata(dev);
@@ -1699,7 +1698,6 @@ static int fsl_ssi_resume(struct device *dev)
 
 	return regcache_sync(regs);
 }
-#endif /* CONFIG_PM_SLEEP */
 
 static const struct dev_pm_ops fsl_ssi_pm = {
 	SET_SYSTEM_SLEEP_PM_OPS(fsl_ssi_suspend, fsl_ssi_resume)
-- 
2.28.0


^ permalink raw reply related

* [PATCH 03/25] ASoC: fsl: remove unnecessary CONFIG_PM_SLEEP
From: Coiby Xu @ 2020-10-29  7:42 UTC (permalink / raw)
  To: Jaroslav Kysela, Takashi Iwai
  Cc: moderated list:FREESCALE SOC SOUND DRIVERS, Timur Tabi, Xiubo Li,
	Fabio Estevam, Sascha Hauer,
	open list:FREESCALE SOC SOUND DRIVERS, Liam Girdwood, open list,
	Nicolin Chen, Mark Brown, NXP Linux Team, Pengutronix Kernel Team,
	Shawn Guo, Shengjiu Wang,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
In-Reply-To: <20201029074301.226644-1-coiby.xu@gmail.com>

SET_SYSTEM_SLEEP_PM_OPS has already took good care of CONFIG_PM_CONFIG.

Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
---
 sound/soc/fsl/imx-audmux.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/sound/soc/fsl/imx-audmux.c b/sound/soc/fsl/imx-audmux.c
index 25c18b9e348f..6d77188a4eab 100644
--- a/sound/soc/fsl/imx-audmux.c
+++ b/sound/soc/fsl/imx-audmux.c
@@ -349,7 +349,6 @@ static int imx_audmux_remove(struct platform_device *pdev)
 	return 0;
 }
 
-#ifdef CONFIG_PM_SLEEP
 static int imx_audmux_suspend(struct device *dev)
 {
 	int i;
@@ -377,7 +376,6 @@ static int imx_audmux_resume(struct device *dev)
 
 	return 0;
 }
-#endif /* CONFIG_PM_SLEEP */
 
 static const struct dev_pm_ops imx_audmux_pm = {
 	SET_SYSTEM_SLEEP_PM_OPS(imx_audmux_suspend, imx_audmux_resume)
-- 
2.28.0


^ permalink raw reply related

* [PATCH 25/25] ALSA: aoa: remove unnecessary CONFIG_PM_SLEEP
From: Coiby Xu @ 2020-10-29  7:43 UTC (permalink / raw)
  To: Jaroslav Kysela, Takashi Iwai
  Cc: Johannes Berg, open list:AOA Apple Onboard Audio ALSA DRIVER,
	moderated list:AOA Apple Onboard Audio ALSA DRIVER, open list
In-Reply-To: <20201029074301.226644-1-coiby.xu@gmail.com>

SIMPLE_DEV_PM_OPS has already took good care of CONFIG_PM_CONFIG.

Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
---
 sound/aoa/fabrics/layout.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/sound/aoa/fabrics/layout.c b/sound/aoa/fabrics/layout.c
index d2e85b83f7ed..197d13f23141 100644
--- a/sound/aoa/fabrics/layout.c
+++ b/sound/aoa/fabrics/layout.c
@@ -1126,7 +1126,6 @@ static int aoa_fabric_layout_remove(struct soundbus_dev *sdev)
 	return 0;
 }
 
-#ifdef CONFIG_PM_SLEEP
 static int aoa_fabric_layout_suspend(struct device *dev)
 {
 	struct layout_dev *ldev = dev_get_drvdata(dev);
@@ -1150,7 +1149,6 @@ static int aoa_fabric_layout_resume(struct device *dev)
 static SIMPLE_DEV_PM_OPS(aoa_fabric_layout_pm_ops,
 	aoa_fabric_layout_suspend, aoa_fabric_layout_resume);
 
-#endif
 
 static struct soundbus_driver aoa_soundbus_driver = {
 	.name = "snd_aoa_soundbus_drv",
@@ -1159,9 +1157,7 @@ static struct soundbus_driver aoa_soundbus_driver = {
 	.remove = aoa_fabric_layout_remove,
 	.driver = {
 		.owner = THIS_MODULE,
-#ifdef CONFIG_PM_SLEEP
 		.pm = &aoa_fabric_layout_pm_ops,
-#endif
 	}
 };
 
-- 
2.28.0


^ permalink raw reply related

* Re: [PATCH 0/4] arch, mm: improve robustness of direct map manipulation
From: Mike Rapoport @ 2020-10-29  8:12 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: david@redhat.com, peterz@infradead.org, catalin.marinas@arm.com,
	dave.hansen@linux.intel.com, linux-mm@kvack.org, paulus@samba.org,
	pavel@ucw.cz, hpa@zytor.com, sparclinux@vger.kernel.org,
	cl@linux.com, will@kernel.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, x86@kernel.org, rppt@linux.ibm.com,
	borntraeger@de.ibm.com, mingo@redhat.com, rientjes@google.com,
	Brown, Len, aou@eecs.berkeley.edu, gor@linux.ibm.com,
	linux-pm@vger.kernel.org, hca@linux.ibm.com, bp@alien8.de,
	luto@kernel.org, paul.walmsley@sifive.com, kirill@shutemov.name,
	tglx@linutronix.de, iamjoonsoo.kim@lge.com,
	linux-arm-kernel@lists.infradead.org, rjw@rjwysocki.net,
	linux-kernel@vger.kernel.org, penberg@kernel.org,
	palmer@dabbelt.com, akpm@linux-foundation.org,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <9e77d0a939eda3029d6ae89bd14d7f1465b0559d.camel@intel.com>

On Wed, Oct 28, 2020 at 09:03:31PM +0000, Edgecombe, Rick P wrote:

> > On Wed, Oct 28, 2020 at 11:20:12AM +0000, Will Deacon wrote:
> > > On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote:
> > > > 					   	
> > > > This is a theoretical bug, but it is still not nice :) 		
> > > > 					
> > > 
> > > Just to clarify: this patch series fixes this problem, right?
> > 
> > Yes.
> > 
> 
> Well, now I'm confused again.
> 
> As David pointed, __vunmap() should not be executing simultaneously
> with the hibernate operation because hibernate can't snapshot while
> data it needs to save is still updating. If a thread was paused when a
> page was in an "invalid" state, it should be remapped by hibernate
> before the copy.
> 
> To level set, before reading this mail, my takeaways from the
> discussions on potential hibernate/debug page alloc problems were:
> 
> Potential RISC-V issue:
> Doesn't have hibernate support
> 
> Potential ARM issue:
> The logic around when it's cpa determines pages might be unmapped looks
> correct for current callers.
> 
> Potential x86 page break issue:
> Seems to be ok for now, but a new set_memory_np() caller could violate
> assumptions in hibernate.
> 
> Non-obvious thorny logic: 
> General agreement it would be good to separate dependencies.
> 
> Behavior of V1 of this patchset:
> No functional change other than addition of a warn in hibernate.

There is a change that adds explicit use of set_direct_map() to
hibernate. Currently, in case of arm64 with DEBUG_PAGEALLOC=n if a
thread was paused when a page was in an "invalid" state hibernate will
access an unmapped data because __kernel_map_pages() will bail out.
After the change set_direct_map_default_noflush() would be used and the
page will get mapped before copy.

> So "does this fix the problem", "yes" leaves me a bit confused... Not
> saying there couldn't be any problems, especially due to the thorniness
> and cross arch stride, but what is it exactly and how does this series
> fix it?

This series goal was primarily to separate dependincies and make it
clearer what DEBUG_PAGEALLOC and what SET_DIRECT_MAP are. As it turned
out, there is also some lack of consistency between architectures that
implement either of this so I tried to improve this as well.

Honestly, I don't know if a thread can be paused at the time __vunmap()
left invalid pages, but it could, there is an issue on arm64 with
DEBUG_PAGEALLOC=n and this set fixes it.

__vunmap()
    vm_remove_mappings()
        set_direct_map_invalid()
	/* thread is frozen */
 					safe_copy_page()	
 					    __kernel_map_pages()
						if (!debug_pagealloc())
 					    	    return
 					    do_copy_page() -> fault

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH 0/4] arch, mm: improve robustness of direct map manipulation
From: David Hildenbrand @ 2020-10-29  8:15 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	Pavel Machek, H. Peter Anvin, sparclinux, Christoph Lameter,
	Will Deacon, linux-riscv, linux-s390, x86, Mike Rapoport,
	Christian Borntraeger, Ingo Molnar, Catalin Marinas, Len Brown,
	Albert Ou, Vasily Gorbik, linux-pm, Heiko Carstens,
	David Rientjes, Borislav Petkov, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Thomas Gleixner, linux-arm-kernel,
	Rafael J. Wysocki, linux-kernel, Pekka Enberg, Palmer Dabbelt,
	Joonsoo Kim, Edgecombe, Rick P, linuxppc-dev, David S. Miller
In-Reply-To: <20201025101555.3057-1-rppt@kernel.org>

On 25.10.20 11:15, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> During recent discussion about KVM protected memory, David raised a concern
> about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1].
> 
> Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
> possible that __kernel_map_pages() would fail, but since this function is
> void, the failure will go unnoticed.
> 
> Moreover, there's lack of consistency of __kernel_map_pages() semantics
> across architectures as some guard this function with
> #ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page
> allocation debugging is disabled at run time and some allow modifying the
> direct map regardless of DEBUG_PAGEALLOC settings.
> 
> This set straightens this out by restoring dependency of
> __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
> accordingly.
> 

So, I was primarily wondering if we really have to touch direct mappings 
in hibernation code, or if we can avoid doing that. I was wondering if 
we cannot simply do something like kmap() when trying to access a 
!mapped page. Similar to reading old-os memory after kexec when in 
kdump. Just a thought.

-- 
Thanks,

David / dhildenb


^ permalink raw reply

* Re: [PATCH kernel v3 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present
From: Michael Ellerman @ 2020-10-29  9:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev; +Cc: iommu, Christoph Hellwig, linux-kernel
In-Reply-To: <2f285412-9e19-7888-1102-f50658c43b9d@ozlabs.ru>

Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> On 29/10/2020 11:40, Michael Ellerman wrote:
>> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>>> @@ -1126,7 +1129,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>   
>>>   	mutex_lock(&direct_window_init_mutex);
>>>   
>>> -	dma_addr = find_existing_ddw(pdn);
>>> +	dma_addr = find_existing_ddw(pdn, &len);
>> 
>> I don't see len used anywhere?
>> 
>>>   	if (dma_addr != 0)
>>>   		goto out_unlock;
>>>   
>>> @@ -1212,14 +1215,26 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>   	}
>>>   	/* verify the window * number of ptes will map the partition */
>>>   	/* check largest block * page size > max memory hotplug addr */
>>> -	max_addr = ddw_memory_hotplug_max();
>>> -	if (query.largest_available_block < (max_addr >> page_shift)) {
>>> -		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
>>> -			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
>>> -			  1ULL << page_shift);
>>> +	/*
>>> +	 * The "ibm,pmemory" can appear anywhere in the address space.
>>> +	 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
>>> +	 * for the upper limit and fallback to max RAM otherwise but this
>>> +	 * disables device::dma_ops_bypass.
>>> +	 */
>>> +	len = max_ram_len;
>> 
>> Here you override whatever find_existing_ddw() wrote to len?
>
> Not always, there is a bunch of gotos before this line to the end of the 
> function and one (which returns the existing window) is legit. Thanks,

Ah yep I see it.

Gotos considered confusing ;)

cheers

^ permalink raw reply

* Re: [PATCH kernel v4 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present
From: Michael Ellerman @ 2020-10-29  9:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, iommu, Christoph Hellwig, linux-kernel
In-Reply-To: <20201029015241.73920-3-aik@ozlabs.ru>

Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> So far we have been using huge DMA windows to map all the RAM available.
> The RAM is normally mapped to the VM address space contiguously, and
> there is always a reasonable upper limit for possible future hot plugged
> RAM which makes it easy to map all RAM via IOMMU.
>
> Now there is persistent memory ("ibm,pmemory" in the FDT) which (unlike
> normal RAM) can map anywhere in the VM space beyond the maximum RAM size
> and since it can be used for DMA, it requires extending the huge window
> up to MAX_PHYSMEM_BITS which requires hypervisor support for:
> 1. huge TCE tables;
> 2. multilevel TCE tables;
> 3. huge IOMMU pages.
>
> Certain hypervisors cannot do either so the only option left is
> restricting the huge DMA window to include only RAM and fallback to
> the default DMA window for persistent memory.
>
> This defines arch_dma_map_direct/etc to allow generic DMA code perform
> additional checks on whether direct DMA is still possible.
>
> This checks if the system has persistent memory. If it does not,
> the DMA bypass mode is selected, i.e.
> * dev->bus_dma_limit = 0
> * dev->dma_ops_bypass = true <- this avoid calling dma_ops for mapping.
>
> If there is such memory, this creates identity mapping only for RAM and
> sets the dev->bus_dma_limit to let the generic code decide whether to
> call into the direct DMA or the indirect DMA ops.
>
> This should not change the existing behaviour when no persistent memory
> as dev->dma_ops_bypass is expected to be set.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Acked-by: Michael Ellerman <mpe@ellerman.id.au>

cheers

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox