LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [V4 0/2] tools/perf: Add instruction and data address registers to extended regs in powerpc
From: Arnaldo Carvalho de Melo @ 2021-10-19 16:30 UTC (permalink / raw)
  To: Athira Rajeev
  Cc: maddy, rnsastry, linux-perf-users, jolsa, kjain, linuxppc-dev
In-Reply-To: <20211018114948.16830-1-atrajeev@linux.vnet.ibm.com>

Em Mon, Oct 18, 2021 at 05:19:46PM +0530, Athira Rajeev escreveu:
> Patch set adds PMU registers namely Sampled Instruction Address Register
> (SIAR) and Sampled Data Address Register (SDAR) as part of extended regs
> in PowerPC. These registers provides the instruction/data address and
> adding these to extended regs helps in debug purposes.
> 
> Patch 1/2 refactors the existing macro definition of
> PERF_REG_PMU_MASK_300 and PERF_REG_PMU_MASK_31 to make it more
> readable.
> Patch 2/2 includes perf tools side changes to add the SPRs to
> sample_reg_mask to use with -I? option.

Thanks, applied.

- Arnaldo

 
> Changelog:
> Change from v3 -> v4:
> - Spilt tools side patches separately since kernel side
>   changes are in powerpc/next. There is no code wise changes
>   from v3.
>   Link to previous version:
>   https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=265811&state=*
> 
>   Kernel patches are taken to powerpc/next:
>   [1/4] powerpc/perf: Refactor the code definition of perf reg extended mask
>   https://git.kernel.org/powerpc/c/02b182e67482d9167a13a0ff19b55037b70b21ad
>   [3/4] powerpc/perf: Expose instruction and data address registers as part of extended regs
>   https://git.kernel.org/powerpc/c/29908bbf7b8960d261dfdd428bbaa656275e80f3
> 
> Change from v2 -> v3:
> Addressed review comments from Michael Ellerman
> - Fixed the macro definition to use "unsigned long long"
>   which otherwise will cause build error with perf on
>   32-bit.
> - Added Reviewed-by from Daniel Axtens for patch3.
> 
> Change from v1 -> v2:
> Addressed review comments from Michael Ellerman
> - Refactored the perf reg extended mask value macros for
>   PERF_REG_PMU_MASK_300 and PERF_REG_PMU_MASK_31 to
>   make it more readable. Also moved PERF_REG_EXTENDED_MAX
>   along with enum definition similar to PERF_REG_POWERPC_MAX.
> 
> Athira Rajeev (2):
>   tools/perf: Refactor the code definition of perf reg extended mask in
>     tools side header file
>   tools/perf: Add perf tools support to expose instruction and data
>     address registers as part of extended regs
> 
>  .../arch/powerpc/include/uapi/asm/perf_regs.h | 28 ++++++++++++-------
>  tools/perf/arch/powerpc/include/perf_regs.h   |  2 ++
>  tools/perf/arch/powerpc/util/perf_regs.c      |  2 ++
>  3 files changed, 22 insertions(+), 10 deletions(-)
> 
> -- 
> 2.33.0

-- 

- Arnaldo

^ permalink raw reply

* Re: [PATCH v2][net-next] soc: fsl: dpio: Unsigned compared against 0 in qbman_swp_set_irq_coalescing()
From: Ioana Ciornei @ 2021-10-19 18:39 UTC (permalink / raw)
  To: Tim Gardner
  Cc: netdev@vger.kernel.org, Roy Pledge, linux-kernel@vger.kernel.org,
	Leo Li, linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <20211019121925.8910-1-tim.gardner@canonical.com>

On Tue, Oct 19, 2021 at 06:19:25AM -0600, Tim Gardner wrote:
> Coverity complains of unsigned compare against 0. There are 2 cases in
> this function:
> 
> 1821        itp = (irq_holdoff * 1000) / p->desc->qman_256_cycles_per_ns;
> 
> CID 121131 (#1 of 1): Unsigned compared against 0 (NO_EFFECT)
> unsigned_compare: This less-than-zero comparison of an unsigned value is never true. itp < 0U.
> 1822        if (itp < 0 || itp > 4096) {
> 1823                max_holdoff = (p->desc->qman_256_cycles_per_ns * 4096) / 1000;
> 1824                pr_err("irq_holdoff must be between 0..%dus\n", max_holdoff);
> 1825                return -EINVAL;
> 1826        }
> 1827
>     	unsigned_compare: This less-than-zero comparison of an unsigned value is never true. irq_threshold < 0U.
> 1828        if (irq_threshold >= p->dqrr.dqrr_size || irq_threshold < 0) {
> 1829                pr_err("irq_threshold must be between 0..%d\n",
> 1830                       p->dqrr.dqrr_size - 1);
> 1831                return -EINVAL;
> 1832        }
> 
> Fix this by removing the comparisons altogether as they are incorrect. Zero is
> a possible value in either case. Also fix a minor comment typo and update the
> 2 pr_err() calls to use %u formatting as well as be more precise regarding
> the exact error.
> 
> Fixes: ed1d2143fee5 ("soc: fsl: dpio: add support for irq coalescing per software portal")
> Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
> Cc: Roy Pledge <Roy.Pledge@nxp.com>
> Cc: Li Yang <leoyang.li@nxp.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: netdev@vger.kernel.org
> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>

Thanks,

-Ioana

^ permalink raw reply

* Re: [PATCH] powerpc/pseries/mobility: ignore ibm,platform-facilities updates
From: Nathan Lynch @ 2021-10-19 21:36 UTC (permalink / raw)
  To: Tyrel Datwyler; +Cc: cheloha, ldufour, linuxppc-dev
In-Reply-To: <6de8b295-112f-651e-a18e-3ab3e499ad69@linux.ibm.com>

Hi Tyrel, thanks for the detailed review.

Tyrel Datwyler <tyreld@linux.ibm.com> writes:
> On 10/18/21 9:34 AM, Nathan Lynch wrote:
>> On VMs with NX encryption, compression, and/or RNG offload, these
>> capabilities are described by nodes in the ibm,platform-facilities device
>> tree hierarchy:
>> 
>>   $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
>>   /sys/firmware/devicetree/base/ibm,platform-facilities/
>>   ├── ibm,compression-v1
>>   ├── ibm,random-v1
>>   └── ibm,sym-encryption-v1
>> 
>>   3 directories
>> 
>> The acceleration functions that these nodes describe are not disrupted by
>> live migration, not even temporarily.
>> 
>> But the post-migration ibm,update-nodes sequence firmware always sends
>> "delete" messages for this hierarchy, followed by an "add" directive to
>> reconstruct it via ibm,configure-connector (log with debugging statements
>> enabled in mobility.c):
>> 
>>   mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
>>   mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
>>   mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
>>   mobility: removing node /ibm,platform-facilities:4294967286
>>   ...
>>   mobility: added node /ibm,platform-facilities:4294967286
>
> It always bothered me that the update from firmware here was an delete/add at
> the entire '/ibm,platform-facilities' node level instead of update properties
> calls. When I asked about it years ago the claim was it was just easier to pass
> an entire sub-tree as a single node add.

Yes, I believe this is for ease of implementation on their part.

>> Note we receive a single "add" message for the entire hierarchy, and what
>> we receive from the ibm,configure-connector sequence is the top-level
>> platform-facilities node along with its three children. The debug message
>> simply reports the parent node and not the whole subtree.
>> 
>> Also, significantly, the nodes added are almost completely equivalent to
>> the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
>> the leaf nodes is the only property I've observed to differ, and Linux does
>> not use that. So in practice, the sum of update messages Linux receives for
>> this hierarchy is equivalent to minor property updates.
>
> The two props I would think maybe we would need to be most be concerned about
> ensuring don't change are "ibm,resource-id" which gets used in the vio bus code
> when configuring platform facilities nodes, and 'ibm,max-sync-cop' used by the
> pseries-nx driver.

The proposed change does not introduce any regressions with respect to
those properties. At least, that's the intention.

ibm,resource-id: the vio code lacks any way of reacting to this changing
at runtime. (I could be wrong, but I suspect it does not change in
practice even though it is technically allowed by the spec.)

ibm,max-sync-cop: the nx code has a notifier callback to handle changes
to this and a couple other properties, but only if they are propagated
through an "update property" event. Node remove+add doesn't trigger the
callback. I think this one is more likely than resource-id to change
between different models or configurations, but I haven't observed this
to happen in my test environment.

Without this change, the child nodes of the ibm,platform-facilities
container in Linux's device tree are completely discarded (see more
below) after a partition migration. This makes driver re-initialization
impossible without a reboot. With the change, the nodes are retained,
albeit with properties that may not correctly reflect the destination
system. To be clear, I think this is bad, and I am working on a better
solution. But I consider the change here a net improvement for the
common case (no meaningful property changes), and it's a wash for
migrations where the properties could change (we weren't handling that
anyway).

>> We succeed in removing the original hierarchy from the device tree. But the
>> drivers bound to the leaf nodes are ignorant of this, and do not relinquish
>> their references. The leaf nodes, still reachable through sysfs, of course
>> still refer to the now-freed ibm,platform-facilities parent node, which
>> makes use-after-free possible:
>
> It is actually more subtle then the drivers themselves being ignorant. They
> could register node update notifiers, but the real problem here is that the vio
> bus device itself takes a reference to each child node registered to the bus.
> I'm not sure we really want to unbind/rebind drivers as a result of LPM, but it
> would be generic to the vio bus instead of updating each driver to ensure its
> handling it device node references properly.

You're right. I will modify the commit message.

>>   refcount_t: addition on 0; use-after-free.
>>   WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
>>   refcount_warn_saturate+0x160/0x1f0 (unreliable)
>>   kobject_get+0xf0/0x100
>>   of_node_get+0x30/0x50
>>   of_get_parent+0x50/0xb0
>>   of_fwnode_get_parent+0x54/0x90
>>   fwnode_count_parents+0x50/0x150
>>   fwnode_full_name_string+0x30/0x110
>>   device_node_string+0x49c/0x790
>>   vsnprintf+0x1c0/0x4c0
>>   sprintf+0x44/0x60
>>   devspec_show+0x34/0x50
>>   dev_attr_show+0x40/0xa0
>>   sysfs_kf_seq_show+0xbc/0x200
>>   kernfs_seq_show+0x44/0x60
>>   seq_read_iter+0x2a4/0x740
>>   kernfs_fop_read_iter+0x254/0x2e0
>>   new_sync_read+0x120/0x190
>>   vfs_read+0x1d0/0x240
>> 
>> Moreover, the "new" replacement subtree is not correctly added to the
>> device tree, resulting in ibm,platform-facilities parent node without the
>> appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
>> 
>>   $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
>>   /sys/firmware/devicetree/base/ibm,platform-facilities/
>> 
>>   0 directories
>> 
>>   $ cd /sys/devices/vio ; find . -xtype l -exec file {} +
>>   ./ibm,sym-encryption-v1/of_node: broken symbolic link to
>>     ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
>>   ./ibm,random-v1/of_node:         broken symbolic link to
>>     ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
>>   ./ibm,compression-v1/of_node:    broken symbolic link to
>>     ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
>> 
>> This is because add_dt_node() -> dlpar_attach_node() attaches only the
>> parent node returned from configure-connector, ignoring any children. This
>> should be corrected for the general case, but fixing that won't help with
>> the stale OF node references, which is the more urgent problem.
>
> I don't follow. If the code path you mention is truly broken in the way you say
> then DLPAR operations involving nodes with child nodes should also be
> broken.

Hmm I don't know of any kernel-based DLPAR workflow that attaches
sub-trees i.e. parent + children. Processor addition attaches CPU nodes
and possibly cache nodes, but they have sibling relationships. If you're
thinking of I/O adapters, the DT updates are (still!) driven from user
space. As undesirable as that is, that use case actually works correctly
AFAIK.

What happens for the platfac LPM scenario is that
dlpar_configure_connector() returns the node pointer for the root
ibm,platform-facilities node, with children attached. That node pointer
is passed from add_dt_node() -> dlpar_attach_node() -> of_attach_node()
-> __of_attach_node(), where its child and sibling fields are
overwritten in the process of attaching it to the tree.

> Last I had seen was that sysfs adds of the new nodes got renamed because the old
> nodes still existed as a result of there reference count not going to zero. Has
> this behavior changed, or am I misremembering the observed behavior?

Maybe you're thinking of the CPU cache node issue recently fixed by
7edd5c9a8820 ("powerpc/pseries/cpuhp: cache node corrections")?

>> One way to address that would be to make the drivers respond to node
>> removal notifications, so that node references can be dropped
>> appropriately. But this would likely force the drivers to disrupt active
>> clients for no useful purpose: equivalent nodes are immediately re-added.
>> And recall that the acceleration capabilities described by the nodes remain
>> available throughout the whole process.
>
> See my comments above about its the vio bus more at fault here then the drivers
> themselves. I'm inclined to agree though that disrupting active operations with
> a driver unbind/rebind is a little extreme.
>
> This also brings me back to firmware removing and re-adding the whole
> '/ibm,platform-facilities' node instead of simply updating changed properties
> could avoid this whole fiasco.

It's a little unfortunate for us, but it's longstanding behavior that is
presumably tolerated fine by other OSes, and it's allowed by the spec.
So we just need to deal with it.

>> The solution I believe to be robust for this situation is to convert
>> remove+add of a node with an unchanged phandle to an update of the node's
>> properties in the Linux device tree structure. That would involve changing
>> and adding a fair amount of code, and may take several iterations to land.
>> 
>> Until that can be realized we have a confirmed use-after-free and the
>> possibility of memory corruption. So add a limited workaround that
>> discriminates on the node type, ignoring adds and removes. This should be
>> amenable to backporting in the meantime.
>
> The reality is that '/ibm,platform-facilities' and 'cache' nodes are the only
> LPM scoped device tree nodes that allow node delete/add. So, as a one-off
> workaround to deal with what I consider a bad firmware approach I think this is
> probably the best approach barring getting firmware to move to an update
> properties approach.

Thanks. I think there's no reason to expect firmware's behavior to
change, and there wouldn't be much payoff for us at this point -- we
would still have to support the current behavior.

^ permalink raw reply

* Re: [PATCH] powerpc/pseries/mobility: ignore ibm,platform-facilities updates
From: Nathan Lynch @ 2021-10-19 21:43 UTC (permalink / raw)
  To: Laurent Dufour; +Cc: cheloha, linuxppc-dev, Tyrel Datwyler
In-Reply-To: <6f72adc9-d28c-beeb-e21f-4a468d2bf0e3@linux.ibm.com>

Laurent Dufour <ldufour@linux.ibm.com> writes:
> Le 19/10/2021 à 00:37, Tyrel Datwyler a écrit :
>> On 10/18/21 9:34 AM, Nathan Lynch wrote:
>> The reality is that '/ibm,platform-facilities' and 'cache' nodes are the only
>> LPM scoped device tree nodes that allow node delete/add. So, as a one-off
>> workaround to deal with what I consider a bad firmware approach I think this is
>> probably the best approach barring getting firmware to move to an update
>> properties approach.
>
> I do agree, this is probably the best option until the firmware is moving to an 
> update notification.

Just to be clear, my proposal is to carry this hack until *Linux* can
be changed to better support the current firmware behavior. There's
little point in trying to get firmware to change now.

^ permalink raw reply

* Re: [PATCH 1/6] PCI/AER: Enable COR/UNCOR error reporting in set_device_error_reporting()
From: Bjorn Helgaas @ 2021-10-19 22:44 UTC (permalink / raw)
  To: Naveen Naidu
  Cc: tsbogend, linux-pci, linux-mips, linux-kernel, oohall, bhelgaas,
	linuxppc-dev, linux-kernel-mentees
In-Reply-To: <b583172ece1fb1dab3d75c6007ec8c443323158d.1633369560.git.naveennaidu479@gmail.com>

On Mon, Oct 04, 2021 at 11:29:27PM +0530, Naveen Naidu wrote:
> The (PCIe r5.0, sec 7.6.4.3, Table 7-101) and  (PCIe r5.0, sec 7.8.4.6,
> Table 7-104) 

s/7.6.4.3/7.8.4.3/

Cite it like this:

  Per PCIe r5.0, sec 7.8.4.3 and sec 7.8.4.6, the default values ...

> states that the default values for the Uncorrectable Error
> Mask and Correctable Error Mask should be 0b. But the current code does
> not set the default value of these registers when the PCIe bus loads the
> AER service driver.

The defaults specified here are for hardware designers -- this is what
the registers must contain after power-up or reset.  This section of
the spec isn't telling us what the OS is required to write.

If we want to clear these masks, I think we have to:

  1) Analyze every other place that writes the masks to make sure we
  don't break any of them.  There aren't very many, and most of them
  are in drivers, which would be after the aer_probe() path.  There
  might be a conflict with program_hpx_type2(), though.

  2) Make it dependent on pcie_aer_is_native().  Ownership of the AER
  capability can be retained by the platform, in which case the OS
  shouldn't touch it.

> Enable reporting of all correctable and uncorrectable errors during
> aer_probe()
> 
> Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
> ---
>  drivers/pci/pcie/aer.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9784fdcf3006..88c4ca6098fb 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1212,6 +1212,7 @@ static int set_device_error_reporting(struct pci_dev *dev, void *data)
>  {
>  	bool enable = *((bool *)data);
>  	int type = pci_pcie_type(dev);
> +	int aer = dev->aer_cap;
>  
>  	if ((type == PCI_EXP_TYPE_ROOT_PORT) ||
>  	    (type == PCI_EXP_TYPE_RC_EC) ||
> @@ -1223,8 +1224,18 @@ static int set_device_error_reporting(struct pci_dev *dev, void *data)
>  			pci_disable_pcie_error_reporting(dev);
>  	}
>  
> -	if (enable)
> +	if (enable) {
> +
> +		/* Enable reporting of all uncorrectable errors */
> +		/* Uncorrectable Error Mask - turned on bits disable errors */
> +		pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, 0);
> +
> +		/* Enable reporting of all correctable errors */
> +		/* Correctable Error Mask - turned on bits disable errors */
> +		pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, 0);
> +
>  		pcie_set_ecrc_checking(dev);
> +	}
>  
>  	return 0;
>  }
> -- 
> 2.25.1
> 
> _______________________________________________
> Linux-kernel-mentees mailing list
> Linux-kernel-mentees@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

^ permalink raw reply

* Re: [PATCH] powerpc/pseries/mobility: ignore ibm, platform-facilities updates
From: Tyrel Datwyler @ 2021-10-19 23:02 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: cheloha, ldufour, linuxppc-dev
In-Reply-To: <87zgr4y9x4.fsf@linux.ibm.com>

On 10/19/21 2:36 PM, Nathan Lynch wrote:
> Hi Tyrel, thanks for the detailed review.
> 
> Tyrel Datwyler <tyreld@linux.ibm.com> writes:
>> On 10/18/21 9:34 AM, Nathan Lynch wrote:
>>> On VMs with NX encryption, compression, and/or RNG offload, these
>>> capabilities are described by nodes in the ibm,platform-facilities device
>>> tree hierarchy:
>>>
>>>   $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
>>>   /sys/firmware/devicetree/base/ibm,platform-facilities/
>>>   ├── ibm,compression-v1
>>>   ├── ibm,random-v1
>>>   └── ibm,sym-encryption-v1
>>>
>>>   3 directories
>>>
>>> The acceleration functions that these nodes describe are not disrupted by
>>> live migration, not even temporarily.
>>>
>>> But the post-migration ibm,update-nodes sequence firmware always sends
>>> "delete" messages for this hierarchy, followed by an "add" directive to
>>> reconstruct it via ibm,configure-connector (log with debugging statements
>>> enabled in mobility.c):
>>>
>>>   mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
>>>   mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
>>>   mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
>>>   mobility: removing node /ibm,platform-facilities:4294967286
>>>   ...
>>>   mobility: added node /ibm,platform-facilities:4294967286
>>
>> It always bothered me that the update from firmware here was an delete/add at
>> the entire '/ibm,platform-facilities' node level instead of update properties
>> calls. When I asked about it years ago the claim was it was just easier to pass
>> an entire sub-tree as a single node add.
> 
> Yes, I believe this is for ease of implementation on their part.

I'd still argue that a full node removal or addition is a platform
reconfiguration on par with a hotplug operation.

> 
>>> Note we receive a single "add" message for the entire hierarchy, and what
>>> we receive from the ibm,configure-connector sequence is the top-level
>>> platform-facilities node along with its three children. The debug message
>>> simply reports the parent node and not the whole subtree.
>>>
>>> Also, significantly, the nodes added are almost completely equivalent to
>>> the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
>>> the leaf nodes is the only property I've observed to differ, and Linux does
>>> not use that. So in practice, the sum of update messages Linux receives for
>>> this hierarchy is equivalent to minor property updates.
>>
>> The two props I would think maybe we would need to be most be concerned about
>> ensuring don't change are "ibm,resource-id" which gets used in the vio bus code
>> when configuring platform facilities nodes, and 'ibm,max-sync-cop' used by the
>> pseries-nx driver.
> 
> The proposed change does not introduce any regressions with respect to
> those properties. At least, that's the intention.
> 
> ibm,resource-id: the vio code lacks any way of reacting to this changing
> at runtime. (I could be wrong, but I suspect it does not change in
> practice even though it is technically allowed by the spec.)
> 
> ibm,max-sync-cop: the nx code has a notifier callback to handle changes
> to this and a couple other properties, but only if they are propagated
> through an "update property" event. Node remove+add doesn't trigger the
> callback. I think this one is more likely than resource-id to change
> between different models or configurations, but I haven't observed this
> to happen in my test environment.
> 
> Without this change, the child nodes of the ibm,platform-facilities
> container in Linux's device tree are completely discarded (see more
> below) after a partition migration. This makes driver re-initialization
> impossible without a reboot. With the change, the nodes are retained,
> albeit with properties that may not correctly reflect the destination
> system. To be clear, I think this is bad, and I am working on a better
> solution. But I consider the change here a net improvement for the
> common case (no meaningful property changes), and it's a wash for
> migrations where the properties could change (we weren't handling that
> anyway).

This didn't use to be the case. After some digging I see the problem, and it is
clearly a long standing regression (see comment further down). Part of that is
my fault for putting the platform-facilities errors on the back burner when I
was working this area.

> 
>>> We succeed in removing the original hierarchy from the device tree. But the
>>> drivers bound to the leaf nodes are ignorant of this, and do not relinquish
>>> their references. The leaf nodes, still reachable through sysfs, of course
>>> still refer to the now-freed ibm,platform-facilities parent node, which
>>> makes use-after-free possible:
>>
>> It is actually more subtle then the drivers themselves being ignorant. They
>> could register node update notifiers, but the real problem here is that the vio
>> bus device itself takes a reference to each child node registered to the bus.
>> I'm not sure we really want to unbind/rebind drivers as a result of LPM, but it
>> would be generic to the vio bus instead of updating each driver to ensure its
>> handling it device node references properly.
> 
> You're right. I will modify the commit message.
> 
>>>   refcount_t: addition on 0; use-after-free.
>>>   WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
>>>   refcount_warn_saturate+0x160/0x1f0 (unreliable)
>>>   kobject_get+0xf0/0x100
>>>   of_node_get+0x30/0x50
>>>   of_get_parent+0x50/0xb0
>>>   of_fwnode_get_parent+0x54/0x90
>>>   fwnode_count_parents+0x50/0x150
>>>   fwnode_full_name_string+0x30/0x110
>>>   device_node_string+0x49c/0x790
>>>   vsnprintf+0x1c0/0x4c0
>>>   sprintf+0x44/0x60
>>>   devspec_show+0x34/0x50
>>>   dev_attr_show+0x40/0xa0
>>>   sysfs_kf_seq_show+0xbc/0x200
>>>   kernfs_seq_show+0x44/0x60
>>>   seq_read_iter+0x2a4/0x740
>>>   kernfs_fop_read_iter+0x254/0x2e0
>>>   new_sync_read+0x120/0x190
>>>   vfs_read+0x1d0/0x240
>>>
>>> Moreover, the "new" replacement subtree is not correctly added to the
>>> device tree, resulting in ibm,platform-facilities parent node without the
>>> appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
>>>
>>>   $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
>>>   /sys/firmware/devicetree/base/ibm,platform-facilities/
>>>
>>>   0 directories
>>>
>>>   $ cd /sys/devices/vio ; find . -xtype l -exec file {} +
>>>   ./ibm,sym-encryption-v1/of_node: broken symbolic link to
>>>     ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
>>>   ./ibm,random-v1/of_node:         broken symbolic link to
>>>     ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
>>>   ./ibm,compression-v1/of_node:    broken symbolic link to
>>>     ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
>>>
>>> This is because add_dt_node() -> dlpar_attach_node() attaches only the
>>> parent node returned from configure-connector, ignoring any children. This
>>> should be corrected for the general case, but fixing that won't help with
>>> the stale OF node references, which is the more urgent problem.
>>
>> I don't follow. If the code path you mention is truly broken in the way you say
>> then DLPAR operations involving nodes with child nodes should also be
>> broken.
> 
> Hmm I don't know of any kernel-based DLPAR workflow that attaches
> sub-trees i.e. parent + children. Processor addition attaches CPU nodes
> and possibly cache nodes, but they have sibling relationships. If you're
> thinking of I/O adapters, the DT updates are (still!) driven from user
> space. As undesirable as that is, that use case actually works correctly
> AFAIK.
> 
> What happens for the platfac LPM scenario is that
> dlpar_configure_connector() returns the node pointer for the root
> ibm,platform-facilities node, with children attached. That node pointer
> is passed from add_dt_node() -> dlpar_attach_node() -> of_attach_node()
> -> __of_attach_node(), where its child and sibling fields are
> overwritten in the process of attaching it to the tree.

Well it worked back in 2013 when I got all the in kernel device tree update
stuff working. Looks like I missed this one which now explains one area where
platform-facilities update originally went to shit.

commit 6162dbe49a451f96431a23b4821f05e3bd925bc1
Author: Grant Likely <grant.likely@linaro.org>
Date:   Wed Jul 16 08:48:46 2014 -0600

    of: Make sure attached nodes don't carry along extra children

    The child pointer does not get cleared when attaching new nodes which
    could cause the tree to be inconsistent. Clear the child pointer in
    __of_attach_node() to be absolutely sure that the structure remains in a
    consistent layout.

    Signed-off-by: Grant Likely <grant.likely@linaro.org>

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index c875787fa394..b96d83100987 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -98,6 +98,7 @@ int of_property_notify(int action, struct device_node *np,

 void __of_attach_node(struct device_node *np)
 {
+       np->child = NULL;
        np->sibling = np->parent->child;
        np->allnext = np->parent->allnext;
        np->parent->allnext = np;

Not sure what the reasoning was here since this prevents attaching a node that
is a sub tree. Either need to revert that, rewrite dlpar_attach_node to iterate
over the sub-tree, or probably best rewrite dlpar_configure_connector to use a
of_changeset instead.

> 
>> Last I had seen was that sysfs adds of the new nodes got renamed because the old
>> nodes still existed as a result of there reference count not going to zero. Has
>> this behavior changed, or am I misremembering the observed behavior?
> 
> Maybe you're thinking of the CPU cache node issue recently fixed by
> 7edd5c9a8820 ("powerpc/pseries/cpuhp: cache node corrections")?
> 
>>> One way to address that would be to make the drivers respond to node
>>> removal notifications, so that node references can be dropped
>>> appropriately. But this would likely force the drivers to disrupt active
>>> clients for no useful purpose: equivalent nodes are immediately re-added.
>>> And recall that the acceleration capabilities described by the nodes remain
>>> available throughout the whole process.
>>
>> See my comments above about its the vio bus more at fault here then the drivers
>> themselves. I'm inclined to agree though that disrupting active operations with
>> a driver unbind/rebind is a little extreme.
>>
>> This also brings me back to firmware removing and re-adding the whole
>> '/ibm,platform-facilities' node instead of simply updating changed properties
>> could avoid this whole fiasco.
> 
> It's a little unfortunate for us, but it's longstanding behavior that is
> presumably tolerated fine by other OSes, and it's allowed by the spec.
> So we just need to deal with it.
> 
>>> The solution I believe to be robust for this situation is to convert
>>> remove+add of a node with an unchanged phandle to an update of the node's
>>> properties in the Linux device tree structure. That would involve changing
>>> and adding a fair amount of code, and may take several iterations to land.
>>>
>>> Until that can be realized we have a confirmed use-after-free and the
>>> possibility of memory corruption. So add a limited workaround that
>>> discriminates on the node type, ignoring adds and removes. This should be
>>> amenable to backporting in the meantime.
>>
>> The reality is that '/ibm,platform-facilities' and 'cache' nodes are the only
>> LPM scoped device tree nodes that allow node delete/add. So, as a one-off
>> workaround to deal with what I consider a bad firmware approach I think this is
>> probably the best approach barring getting firmware to move to an update
>> properties approach.
> 
> Thanks. I think there's no reason to expect firmware's behavior to
> change, and there wouldn't be much payoff for us at this point -- we
> would still have to support the current behavior.
> 


^ permalink raw reply related

* Re: [PATCH v2] soc: fsl: dpio: instead smp_processor_id with raw_smp_processor_id
From: Li Yang @ 2021-10-20  1:35 UTC (permalink / raw)
  To: Meng.Li
  Cc: Ioana Ciocoi Radulescu, Horia Geanta, Roy Pledge, lkml,
	linuxppc-dev,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE
In-Reply-To: <20211019023241.17466-1-Meng.Li@windriver.com>

On Mon, Oct 18, 2021 at 9:46 PM <Meng.Li@windriver.com> wrote:
>
> From: Meng Li <Meng.Li@windriver.com>
>
> When enable debug kernel configs,there will be calltrace as below:
>
> BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
> caller is debug_smp_processor_id+0x20/0x30
> CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.10.63-yocto-standard #1
> Hardware name: NXP Layerscape LX2160ARDB (DT)
> Call trace:
>  dump_backtrace+0x0/0x1a0
>  show_stack+0x24/0x30
>  dump_stack+0xf0/0x13c
>  check_preemption_disabled+0x100/0x110
>  debug_smp_processor_id+0x20/0x30
>  dpaa2_io_query_fq_count+0xdc/0x154
>  dpaa2_eth_stop+0x144/0x314
>  __dev_close_many+0xdc/0x160
>  __dev_change_flags+0xe8/0x220
>  dev_change_flags+0x30/0x70
>  ic_close_devs+0x50/0x78
>  ip_auto_config+0xed0/0xf10
>  do_one_initcall+0xac/0x460
>  kernel_init_freeable+0x30c/0x378
>  kernel_init+0x20/0x128
>  ret_from_fork+0x10/0x38
>
> Based on comment in the context, it doesn't matter whether
> preemption is disable or not. So, instead smp_processor_id()

s/instead/replace/g

> with raw_smp_processor_id() to avoid above call trace.
>
> v2:
> Remove the preempt_disable/enable() protection, instead smp_processor_id
> with raw_smp_processor_id.

The revision history information should go after the "---" below.

>
> Fixes: c89105c9b390 ("staging: fsl-mc: Move DPIO from staging to drivers/soc/fsl")
> Cc: stable@vger.kernel.org
> Signed-off-by: Meng Li <Meng.Li@windriver.com>

I helped to fix the issues I mentioned.  Applied for fix.  Thanks.

> ---
>  drivers/soc/fsl/dpio/dpio-service.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/soc/fsl/dpio/dpio-service.c b/drivers/soc/fsl/dpio/dpio-service.c
> index 19f47ea9dab0..3050a534d42c 100644
> --- a/drivers/soc/fsl/dpio/dpio-service.c
> +++ b/drivers/soc/fsl/dpio/dpio-service.c
> @@ -59,7 +59,7 @@ static inline struct dpaa2_io *service_select_by_cpu(struct dpaa2_io *d,
>          * potentially being migrated away.
>          */
>         if (cpu < 0)
> -               cpu = smp_processor_id();
> +               cpu = raw_smp_processor_id();
>
>         /* If a specific cpu was requested, pick it up immediately */
>         return dpio_by_cpu[cpu];
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH] driver: soc: dpio: use the whole functions to protect critical zone
From: Li Yang @ 2021-10-20  1:42 UTC (permalink / raw)
  To: Meng.Li
  Cc: linuxppc-dev, Youri Querry,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE, lkml
In-Reply-To: <20211019030555.29461-1-Meng.Li@windriver.com>

On Mon, Oct 18, 2021 at 10:07 PM <Meng.Li@windriver.com> wrote:
>
> From: Meng Li <Meng.Li@windriver.com>
>
> In orininal code, use 2 function spin_lock() and local_irq_save() to
> protect the critical zone. But when enable the kernel debug config,
> there are below inconsistent lock state detected.
> ================================
> WARNING: inconsistent lock state
> 5.10.63-yocto-standard #1 Not tainted
> --------------------------------
> inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> lock_torture_wr/226 [HC0[0]:SC1[5]:HE1:SE0] takes:
> ffff002005b2dd80 (&p->access_spinlock){+.?.}-{3:3}, at: qbman_swp_enqueue_multiple_mem_back+0x44/0x270
> {SOFTIRQ-ON-W} state was registered at:
>   lock_acquire.part.0+0xf8/0x250
>   lock_acquire+0x68/0x84
>   _raw_spin_lock+0x68/0x90
>   qbman_swp_enqueue_multiple_mem_back+0x44/0x270
>   ......
>   cryptomgr_test+0x38/0x60
>   kthread+0x158/0x164
>   ret_from_fork+0x10/0x38
> irq event stamp: 4498
> hardirqs last  enabled at (4498): [<ffff800010fcf980>] _raw_spin_unlock_irqrestore+0x90/0xb0
> hardirqs last disabled at (4497): [<ffff800010fcffc4>] _raw_spin_lock_irqsave+0xd4/0xe0
> softirqs last  enabled at (4458): [<ffff8000100108c4>] __do_softirq+0x674/0x724
> softirqs last disabled at (4465): [<ffff80001005b2a4>] __irq_exit_rcu+0x190/0x19c
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>        CPU0
>        ----
>   lock(&p->access_spinlock);
>   <Interrupt>
>     lock(&p->access_spinlock);
>  *** DEADLOCK ***
>
> So, in order to avoid deadlock, use the whole functinos

s/functinos/functions/

> spin_lock_irqsave/spin_unlock_irqrestore() to protect critical zone.
>
> Fixes: 3b2abda7d28c ("soc: fsl: dpio: Replace QMAN array mode with ring mode enqueue")
> Cc: stable@vger.kernel.org
> Signed-off-by: Meng Li <Meng.Li@windriver.com>

Applied for fix.  Thanks.

> ---
>  drivers/soc/fsl/dpio/qbman-portal.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/soc/fsl/dpio/qbman-portal.c b/drivers/soc/fsl/dpio/qbman-portal.c
> index 845e91416b58..a56dbe4de34f 100644
> --- a/drivers/soc/fsl/dpio/qbman-portal.c
> +++ b/drivers/soc/fsl/dpio/qbman-portal.c
> @@ -785,8 +785,7 @@ int qbman_swp_enqueue_multiple_mem_back(struct qbman_swp *s,
>         int i, num_enqueued = 0;
>         unsigned long irq_flags;
>
> -       spin_lock(&s->access_spinlock);
> -       local_irq_save(irq_flags);
> +       spin_lock_irqsave(&s->access_spinlock, irq_flags);
>
>         half_mask = (s->eqcr.pi_ci_mask>>1);
>         full_mask = s->eqcr.pi_ci_mask;
> @@ -797,8 +796,7 @@ int qbman_swp_enqueue_multiple_mem_back(struct qbman_swp *s,
>                 s->eqcr.available = qm_cyc_diff(s->eqcr.pi_ring_size,
>                                         eqcr_ci, s->eqcr.ci);
>                 if (!s->eqcr.available) {
> -                       local_irq_restore(irq_flags);
> -                       spin_unlock(&s->access_spinlock);
> +                       spin_unlock_irqrestore(&s->access_spinlock, irq_flags);
>                         return 0;
>                 }
>         }
> @@ -837,8 +835,7 @@ int qbman_swp_enqueue_multiple_mem_back(struct qbman_swp *s,
>         dma_wmb();
>         qbman_write_register(s, QBMAN_CINH_SWP_EQCR_PI,
>                                 (QB_RT_BIT)|(s->eqcr.pi)|s->eqcr.pi_vb);
> -       local_irq_restore(irq_flags);
> -       spin_unlock(&s->access_spinlock);
> +       spin_unlock_irqrestore(&s->access_spinlock, irq_flags);
>
>         return num_enqueued;
>  }
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH v1 04/11] powerpc/64s: Move and rename do_bad_slb_fault as it is not hash specific
From: Nicholas Piggin @ 2021-10-20  5:07 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev
In-Reply-To: <80ea8862-0843-02ae-791c-5c921c19803f@csgroup.eu>

Excerpts from Christophe Leroy's message of October 19, 2021 3:09 am:
> 
> 
> Le 15/10/2021 à 17:46, Nicholas Piggin a écrit :
>> slb.c is hash-specific SLB management, but do_bad_slb_fault deals with
>> segment interrupts that occur with radix MMU as well.
>> ---
>>   arch/powerpc/include/asm/interrupt.h |  2 +-
>>   arch/powerpc/kernel/exceptions-64s.S |  4 ++--
>>   arch/powerpc/mm/book3s64/slb.c       | 16 ----------------
>>   arch/powerpc/mm/fault.c              | 17 +++++++++++++++++
>>   4 files changed, 20 insertions(+), 19 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h
>> index a1d238255f07..3487aab12229 100644
>> --- a/arch/powerpc/include/asm/interrupt.h
>> +++ b/arch/powerpc/include/asm/interrupt.h
>> @@ -564,7 +564,7 @@ DECLARE_INTERRUPT_HANDLER(kernel_bad_stack);
>>   
>>   /* slb.c */
>>   DECLARE_INTERRUPT_HANDLER_RAW(do_slb_fault);
>> -DECLARE_INTERRUPT_HANDLER(do_bad_slb_fault);
>> +DECLARE_INTERRUPT_HANDLER(do_bad_segment_interrupt);
>>   
>>   /* hash_utils.c */
>>   DECLARE_INTERRUPT_HANDLER_RAW(do_hash_fault);
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
>> index eaf1f72131a1..046c99e31d01 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -1430,7 +1430,7 @@ MMU_FTR_SECTION_ELSE
>>   ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
>>   	std	r3,RESULT(r1)
>>   	addi	r3,r1,STACK_FRAME_OVERHEAD
>> -	bl	do_bad_slb_fault
>> +	bl	do_bad_segment_interrupt
>>   	b	interrupt_return_srr
>>   
>>   
>> @@ -1510,7 +1510,7 @@ MMU_FTR_SECTION_ELSE
>>   ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
>>   	std	r3,RESULT(r1)
>>   	addi	r3,r1,STACK_FRAME_OVERHEAD
>> -	bl	do_bad_slb_fault
>> +	bl	do_bad_segment_interrupt
>>   	b	interrupt_return_srr
>>   
>>   
>> diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>> index f0037bcc47a0..31f4cef3adac 100644
>> --- a/arch/powerpc/mm/book3s64/slb.c
>> +++ b/arch/powerpc/mm/book3s64/slb.c
>> @@ -868,19 +868,3 @@ DEFINE_INTERRUPT_HANDLER_RAW(do_slb_fault)
>>   		return err;
>>   	}
>>   }
>> -
>> -DEFINE_INTERRUPT_HANDLER(do_bad_slb_fault)
>> -{
>> -	int err = regs->result;
>> -
>> -	if (err == -EFAULT) {
>> -		if (user_mode(regs))
>> -			_exception(SIGSEGV, regs, SEGV_BNDERR, regs->dar);
>> -		else
>> -			bad_page_fault(regs, SIGSEGV);
>> -	} else if (err == -EINVAL) {
>> -		unrecoverable_exception(regs);
>> -	} else {
>> -		BUG();
>> -	}
>> -}
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index a8d0ce85d39a..53ddcae0ac9e 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -35,6 +35,7 @@
>>   #include <linux/kfence.h>
>>   #include <linux/pkeys.h>
>>   
>> +#include <asm/asm-prototypes.h>
>>   #include <asm/firmware.h>
>>   #include <asm/interrupt.h>
>>   #include <asm/page.h>
>> @@ -620,4 +621,20 @@ DEFINE_INTERRUPT_HANDLER(do_bad_page_fault_segv)
>>   {
>>   	bad_page_fault(regs, SIGSEGV);
>>   }
>> +
>> +DEFINE_INTERRUPT_HANDLER(do_bad_segment_interrupt)
>> +{
>> +	int err = regs->result;
>> +
>> +	if (err == -EFAULT) {
>> +		if (user_mode(regs))
>> +			_exception(SIGSEGV, regs, SEGV_BNDERR, regs->dar);
>> +		else
>> +			bad_page_fault(regs, SIGSEGV);
>> +	} else if (err == -EINVAL) {
>> +		unrecoverable_exception(regs);
>> +	} else {
>> +		BUG();
>> +	}
>> +}
>>   #endif
>> 
> 
> You could do something more flat:
> 
> 	if (err == -EINVAL)
> 		unrecoverable_exception(regs);
> 
> 	BUG_ON(err != -EFAULT);
> 
> 	if (user_mode(regs))
> 		_exception(SIGSEGV, regs, SEGV_BNDERR, regs->dar);
> 	else
> 		bad_page_fault(regs, SIGSEGV);
> 
> I know you are just moving existing code but moving code is always an 
> opportunity to clean it without additional churn.
> 

Hmm, moving code I prefer not to make any changes. I don't know if it's 
that big an improvement to make the change here.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v1 09/11] powerpc/64s: Make hash MMU code build configurable
From: Nicholas Piggin @ 2021-10-20  5:20 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev
In-Reply-To: <eb342e34-8a97-a7f1-ee56-c8874b1bcd85@csgroup.eu>

Excerpts from Christophe Leroy's message of October 19, 2021 6:05 pm:
> 
> 
> Le 15/10/2021 à 17:46, Nicholas Piggin a écrit :
>> Introduce a new option CONFIG_PPC_64S_HASH_MMU which allows the 64s hash
>> MMU code to be compiled out if radix is selected and the minimum
>> supported CPU type is POWER9 or higher, and KVM is not selected.
>> 
>> This saves 128kB kernel image size (90kB text) on powernv_defconfig
>> minus KVM, 350kB on pseries_defconfig minus KVM, 40kB on a tiny config.
> 
> This patch is huge, it could be split in several smaller patches ?
> 
> I'm sure at least the Kconfig stuff can be do as a second step. In first 
> step just make CONFIG_PPC_64S_HASH_MMU always y.

I can do that.

> 
> I'm wondering if we could also reduce the amount of #ifdefs in C files, 
> by using IS_ENABLED() and/or stubs defined in H files.

I didn't see a lot of low hanging things there. A lot of it is struct
members and globals and things. E.g., something like this -

@@ -175,7 +181,9 @@ static int radix__init_new_context(struct mm_struct *mm)
         */
        asm volatile("ptesync;isync" : : : "memory");
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
        mm->context.hash_context = NULL;
+#endif
 
        return index;
 }

In theory we could add a mm_set_hash_context(mm, NULL) function to do 
this for us, but if it is only required in this one bit of radix init
code then I would say the new function actually adds a reading burden
on all the rest of the code that uses it (or if we only use it in
this one place then it's pretty pointless).

> 
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>   arch/powerpc/Kconfig                          |  1 +
>>   arch/powerpc/include/asm/book3s/64/mmu.h      | 22 ++++++++++++++++++-
>>   .../include/asm/book3s/64/tlbflush-hash.h     |  7 ++++++
>>   arch/powerpc/include/asm/book3s/pgtable.h     |  4 ++++
>>   arch/powerpc/include/asm/mmu.h                | 14 +++++++++---
>>   arch/powerpc/include/asm/mmu_context.h        |  2 ++
>>   arch/powerpc/include/asm/paca.h               |  8 +++++++
>>   arch/powerpc/kernel/asm-offsets.c             |  2 ++
>>   arch/powerpc/kernel/dt_cpu_ftrs.c             |  8 ++++++-
>>   arch/powerpc/kernel/entry_64.S                |  4 ++--
>>   arch/powerpc/kernel/exceptions-64s.S          | 16 ++++++++++++++
>>   arch/powerpc/kernel/mce.c                     |  2 +-
>>   arch/powerpc/kernel/mce_power.c               | 10 ++++++---
>>   arch/powerpc/kernel/paca.c                    | 18 ++++++---------
>>   arch/powerpc/kernel/process.c                 | 13 ++++++-----
>>   arch/powerpc/kernel/prom.c                    |  2 ++
>>   arch/powerpc/kernel/setup_64.c                |  4 ++++
>>   arch/powerpc/kexec/core_64.c                  |  4 ++--
>>   arch/powerpc/kexec/ranges.c                   |  4 ++++
>>   arch/powerpc/kvm/Kconfig                      |  1 +
>>   arch/powerpc/mm/book3s64/Makefile             | 17 ++++++++------
>>   arch/powerpc/mm/book3s64/hash_utils.c         | 10 ---------
>>   .../{hash_hugetlbpage.c => hugetlbpage.c}     |  6 +++++
>>   arch/powerpc/mm/book3s64/mmu_context.c        | 16 ++++++++++++++
>>   arch/powerpc/mm/book3s64/pgtable.c            | 12 ++++++++++
>>   arch/powerpc/mm/book3s64/radix_pgtable.c      |  4 ++++
>>   arch/powerpc/mm/copro_fault.c                 |  2 ++
>>   arch/powerpc/mm/pgtable.c                     | 10 ++++++---
>>   arch/powerpc/platforms/Kconfig.cputype        | 21 +++++++++++++++++-
>>   arch/powerpc/platforms/cell/Kconfig           |  1 +
>>   arch/powerpc/platforms/maple/Kconfig          |  1 +
>>   arch/powerpc/platforms/microwatt/Kconfig      |  2 +-
>>   arch/powerpc/platforms/pasemi/Kconfig         |  1 +
>>   arch/powerpc/platforms/powermac/Kconfig       |  1 +
>>   arch/powerpc/platforms/powernv/Kconfig        |  2 +-
>>   arch/powerpc/platforms/powernv/idle.c         |  2 ++
>>   arch/powerpc/platforms/powernv/setup.c        |  2 ++
>>   arch/powerpc/platforms/pseries/lpar.c         | 11 ++++++++--
>>   arch/powerpc/platforms/pseries/lparcfg.c      |  2 +-
>>   arch/powerpc/platforms/pseries/mobility.c     |  6 +++++
>>   arch/powerpc/platforms/pseries/ras.c          |  2 ++
>>   arch/powerpc/platforms/pseries/reconfig.c     |  2 ++
>>   arch/powerpc/platforms/pseries/setup.c        |  6 +++--
>>   arch/powerpc/xmon/xmon.c                      |  8 +++++--
>>   44 files changed, 233 insertions(+), 60 deletions(-)
>>   rename arch/powerpc/mm/book3s64/{hash_hugetlbpage.c => hugetlbpage.c} (95%)
>> 
> 
>> diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
>> index 8abe8e42e045..0f89fcab834d 100644
>> --- a/arch/powerpc/include/asm/mmu.h
>> +++ b/arch/powerpc/include/asm/mmu.h
>> @@ -157,7 +157,7 @@ DECLARE_PER_CPU(int, next_tlbcam_idx);
>>   
>>   enum {
>>   	MMU_FTRS_POSSIBLE =
>> -#if defined(CONFIG_PPC_BOOK3S_64) || defined(CONFIG_PPC_BOOK3S_604)
>> +#if defined(CONFIG_PPC_BOOK3S_604)
>>   		MMU_FTR_HPTE_TABLE |
>>   #endif
>>   #ifdef CONFIG_PPC_8xx
>> @@ -184,15 +184,18 @@ enum {
>>   		MMU_FTR_USE_TLBRSRV | MMU_FTR_USE_PAIRED_MAS |
>>   #endif
>>   #ifdef CONFIG_PPC_BOOK3S_64
>> +		MMU_FTR_KERNEL_RO |
>> +#ifdef CONFIG_PPC_64S_HASH_MMU
>>   		MMU_FTR_NO_SLBIE_B | MMU_FTR_16M_PAGE | MMU_FTR_TLBIEL |
>>   		MMU_FTR_LOCKLESS_TLBIE | MMU_FTR_CI_LARGE_PAGE |
>>   		MMU_FTR_1T_SEGMENT | MMU_FTR_TLBIE_CROP_VA |
>> -		MMU_FTR_KERNEL_RO | MMU_FTR_68_BIT_VA |
>> +		MMU_FTR_68_BIT_VA | MMU_FTR_HPTE_TABLE |
>>   #endif
>>   #ifdef CONFIG_PPC_RADIX_MMU
>>   		MMU_FTR_TYPE_RADIX |
>>   		MMU_FTR_GTSE |
>>   #endif /* CONFIG_PPC_RADIX_MMU */
>> +#endif
>>   #ifdef CONFIG_PPC_KUAP
>>   	MMU_FTR_BOOK3S_KUAP |
>>   #endif /* CONFIG_PPC_KUAP */
>> @@ -223,6 +226,11 @@ enum {
>>   #ifdef CONFIG_E500
>>   #define MMU_FTRS_ALWAYS		MMU_FTR_TYPE_FSL_E
>>   #endif
>> +#ifdef CONFIG_PPC_BOOK3S_64
>> +#if defined(CONFIG_PPC_RADIX_MMU) && !defined(CONFIG_PPC_64S_HASH_MMU)
>> +#define MMU_FTRS_ALWAYS		MMU_FTR_TYPE_RADIX
>> +#endif
>> +#endif
> 
> Should you also set MMU_FTR_HPTE_TABLE in MMU_FTRS_ALWAYS when HAS_MMU 
> && !RADIX ?

Yeah, good point.

>>   
>>   #ifndef MMU_FTRS_ALWAYS
>>   #define MMU_FTRS_ALWAYS		0
>> @@ -329,7 +337,7 @@ static __always_inline bool radix_enabled(void)
>>   	return mmu_has_feature(MMU_FTR_TYPE_RADIX);
>>   }
>>   
>> -static inline bool early_radix_enabled(void)
>> +static __always_inline bool early_radix_enabled(void)
>>   {
>>   	return early_mmu_has_feature(MMU_FTR_TYPE_RADIX);
>>   }
> 
>> diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
>> index c10fc8a72fb3..642cabc25e99 100644
>> --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> @@ -31,6 +31,7 @@ static int alloc_context_id(int min_id, int max_id)
>>   	return ida_alloc_range(&mmu_context_ida, min_id, max_id, GFP_KERNEL);
>>   }
>>   
>> +#ifdef CONFIG_PPC_64S_HASH_MMU
>>   void hash__reserve_context_id(int id)
>>   {
>>   	int result = ida_alloc_range(&mmu_context_ida, id, id, GFP_KERNEL);
>> @@ -50,7 +51,9 @@ int hash__alloc_context_id(void)
>>   	return alloc_context_id(MIN_USER_CONTEXT, max);
>>   }
>>   EXPORT_SYMBOL_GPL(hash__alloc_context_id);
>> +#endif
>>   
>> +#ifdef CONFIG_PPC_64S_HASH_MMU
>>   static int realloc_context_ids(mm_context_t *ctx)
>>   {
>>   	int i, id;
>> @@ -144,12 +147,15 @@ static int hash__init_new_context(struct mm_struct *mm)
>>   	return index;
>>   }
>>   
>> +void slb_setup_new_exec(void);
>> +
>>   void hash__setup_new_exec(void)
>>   {
>>   	slice_setup_new_exec();
>>   
>>   	slb_setup_new_exec();
>>   }
>> +#endif
>>   
>>   static int radix__init_new_context(struct mm_struct *mm)
>>   {
>> @@ -175,7 +181,9 @@ static int radix__init_new_context(struct mm_struct *mm)
>>   	 */
>>   	asm volatile("ptesync;isync" : : : "memory");
>>   
>> +#ifdef CONFIG_PPC_64S_HASH_MMU
>>   	mm->context.hash_context = NULL;
>> +#endif
>>   
>>   	return index;
>>   }
>> @@ -186,8 +194,10 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>>   
>>   	if (radix_enabled())
>>   		index = radix__init_new_context(mm);
>> +#ifdef CONFIG_PPC_64S_HASH_MMU
>>   	else
>>   		index = hash__init_new_context(mm);
>> +#endif
> 
> I really dislike #ifdef nested in if/else.
> 
> Can you do something like
> 
> 	if (radix_enabled()
> 		index = radix__init_new_context(mm);
> 	else if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
> 		index = hash__init_new_context(mm);

If radix_enabled() constant folds properly then even the 2nd if AFAIKS 
should not be required. Maybe it does now after some of your patches.

I'll check.

> 
> 
>>   
>>   	if (index < 0)
>>   		return index;
>> @@ -211,6 +221,7 @@ void __destroy_context(int context_id)
>>   }
>>   EXPORT_SYMBOL_GPL(__destroy_context);
>>   
>> +#ifdef CONFIG_PPC_64S_HASH_MMU
>>   static void destroy_contexts(mm_context_t *ctx)
>>   {
>>   	int index, context_id;
>> @@ -222,6 +233,7 @@ static void destroy_contexts(mm_context_t *ctx)
>>   	}
>>   	kfree(ctx->hash_context);
>>   }
>> +#endif
>>   
>>   static void pmd_frag_destroy(void *pmd_frag)
>>   {
>> @@ -274,7 +286,11 @@ void destroy_context(struct mm_struct *mm)
>>   		process_tb[mm->context.id].prtb0 = 0;
>>   	else
>>   		subpage_prot_free(mm);
>> +#ifdef CONFIG_PPC_64S_HASH_MMU
>>   	destroy_contexts(&mm->context);
>> +#else
>> +	ida_free(&mmu_context_ida, mm->context.id);
> 
> Is that correct ? Was it done somewhere else before ?

Yeah in destroy_contexts. hash has a extended_id union member that 
covers id.

I could just move this into destroy_contexts though, at least remove the 
ifdef here.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 10/52] KVM: PPC: Book3S HV: Don't always save PMU for guest capable of nesting
From: Nicholas Piggin @ 2021-10-20  5:26 UTC (permalink / raw)
  To: Fabiano Rosas, kvm-ppc, linuxppc-dev
In-Reply-To: <87zgr9w3dp.fsf@linux.ibm.com>

Excerpts from Fabiano Rosas's message of October 16, 2021 10:38 pm:
> Nicholas Piggin <npiggin@gmail.com> writes:
> 
>> Provide a config option that controls the workaround added by commit
>> 63279eeb7f93 ("KVM: PPC: Book3S HV: Always save guest pmu for guest
>> capable of nesting"). The option defaults to y for now, but is expected
>> to go away within a few releases.
>>
>> Nested capable guests running with the earlier commit ("KVM: PPC: Book3S
>> HV Nested: Indicate guest PMU in-use in VPA") will now indicate the PMU
> 
> I think the commit reference is now: 178266389794 (KVM: PPC: Book3S HV
> Nested: Reflect guest PMU in-use to L0 when guest SPRs are live)

Ah yes good point, would be good to that in the changelog. I guess we
can add a References: tag as well, just in case anybody mines this 
stuff.

> 
>> in-use status of their guests, which means the parent does not need to
>> unconditionally save the PMU for nested capable guests.
>>
>> After this latest round of performance optimisations, this option costs
>> about 540 cycles or 10% entry/exit performance on a POWER9 nested-capable
>> guest.
>>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>  arch/powerpc/kvm/Kconfig     | 15 +++++++++++++++
>>  arch/powerpc/kvm/book3s_hv.c | 10 ++++++++--
>>  2 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> index ff581d70f20c..1e7aae522be8 100644
>> --- a/arch/powerpc/kvm/Kconfig
>> +++ b/arch/powerpc/kvm/Kconfig
>> @@ -130,6 +130,21 @@ config KVM_BOOK3S_HV_EXIT_TIMING
>>
>>  	  If unsure, say N.
>>
>> +config KVM_BOOK3S_HV_NESTED_PMU_WORKAROUND
>> +	bool "Nested L0 host workaround for L1 KVM host PMU handling bug" if EXPERT
>> +	depends on KVM_BOOK3S_HV_POSSIBLE
>> +	default !EXPERT
>> +	help
>> +	  Old nested HV capable Linux guests have a bug where the don't
> 
> s/the/they/
> 
> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
> 

Good catch.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 19/52] KVM: PPC: Book3S HV P9: Reduce mtmsrd instructions required to save host SPRs
From: Nicholas Piggin @ 2021-10-20  5:35 UTC (permalink / raw)
  To: Fabiano Rosas, kvm-ppc, linuxppc-dev
In-Reply-To: <87r1clw0a7.fsf@linux.ibm.com>

Excerpts from Fabiano Rosas's message of October 16, 2021 11:45 pm:
> Nicholas Piggin <npiggin@gmail.com> writes:
> 
>> This reduces the number of mtmsrd required to enable facility bits when
>> saving/restoring registers, by having the KVM code set all bits up front
>> rather than using individual facility functions that set their particular
>> MSR bits.
>>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> 
> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
> 
> Aside: at msr_check_and_set what's with MSR_VSX always being implicitly
> set whenever MSR_FP is set? I get that it depends on MSR_FP, but if FP
> always implies VSX, then you could stop setting MSR_VSX in this patch.

Good question, this seems to come from quite old code and is carried
forward. I did not immediately see why, might have been to avoid
another mtmsrd operation if we later want to set VSX.

But the rule seems to be to set MSR_VSX if both FP and VEC are set, so
this seems a bit odd. __msr_check_and_clear similarly clears VSX if we 
clear FP, but not if we clear VEC.

I might be good to remove that logic or turn it into warnings and make 
sure the callers do the right thing. Not sure.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2] KVM: PPC: Defer vtime accounting 'til after IRQ handling
From: Laurent Vivier @ 2021-10-20  6:29 UTC (permalink / raw)
  To: Nicholas Piggin, kvm-ppc, Michael Ellerman
  Cc: Greg Kurz, stable, linux-kernel, linuxppc-dev
In-Reply-To: <1634263564.zfj0ajf8eh.astroid@bobo.none>

On 15/10/2021 04:23, Nicholas Piggin wrote:
> Excerpts from Laurent Vivier's message of October 13, 2021 7:30 pm:
>> On 13/10/2021 01:18, Michael Ellerman wrote:
>>> Laurent Vivier <lvivier@redhat.com> writes:
>>>> Commit 112665286d08 moved guest_exit() in the interrupt protected
>>>> area to avoid wrong context warning (or worse), but the tick counter
>>>> cannot be updated and the guest time is accounted to the system time.
>>>>
>>>> To fix the problem port to POWER the x86 fix
>>>> 160457140187 ("Defer vtime accounting 'til after IRQ handling"):
>>>>
>>>> "Defer the call to account guest time until after servicing any IRQ(s)
>>>>    that happened in the guest or immediately after VM-Exit.  Tick-based
>>>>    accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
>>>>    handler runs, and IRQs are blocked throughout the main sequence of
>>>>    vcpu_enter_guest(), including the call into vendor code to actually
>>>>    enter and exit the guest."
>>>>
>>>> Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context before enabling irqs")
>>>> Cc: npiggin@gmail.com
>>>> Cc: <stable@vger.kernel.org> # 5.12
>>>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>>>> ---
>>>>
>>>> Notes:
>>>>       v2: remove reference to commit 61bd0f66ff92
>>>>           cc stable 5.12
>>>>           add the same comment in the code as for x86
>>>>
>>>>    arch/powerpc/kvm/book3s_hv.c | 24 ++++++++++++++++++++----
>>>>    1 file changed, 20 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>>> index 2acb1c96cfaf..a694d1a8f6ce 100644
>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>> ...
>>>> @@ -4506,13 +4514,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
>>>>    
>>>>    	srcu_read_unlock(&kvm->srcu, srcu_idx);
>>>>    
>>>> +	context_tracking_guest_exit();
>>>> +
>>>>    	set_irq_happened(trap);
>>>>    
>>>>    	kvmppc_set_host_core(pcpu);
>>>>    
>>>> -	guest_exit_irqoff();
>>>> -
>>>>    	local_irq_enable();
>>>> +	/*
>>>> +	 * Wait until after servicing IRQs to account guest time so that any
>>>> +	 * ticks that occurred while running the guest are properly accounted
>>>> +	 * to the guest.  Waiting until IRQs are enabled degrades the accuracy
>>>> +	 * of accounting via context tracking, but the loss of accuracy is
>>>> +	 * acceptable for all known use cases.
>>>> +	 */
>>>> +	vtime_account_guest_exit();
>>>
>>> This pops a warning for me, running guest(s) on Power8:
>>>    
>>>     [  270.745303][T16661] ------------[ cut here ]------------
>>>     [  270.745374][T16661] WARNING: CPU: 72 PID: 16661 at arch/powerpc/kernel/time.c:311 vtime_account_kernel+0xe0/0xf0
>>
>> Thank you, I missed that...
>>
>> My patch is wrong, I have to add vtime_account_guest_exit() before the local_irq_enable().
> 
> I thought so because if we take an interrupt after exiting the guest that
> should be accounted to kernel not guest.
> 
>>
>> arch/powerpc/kernel/time.c
>>
>>    305 static unsigned long vtime_delta(struct cpu_accounting_data *acct,
>>    306                                  unsigned long *stime_scaled,
>>    307                                  unsigned long *steal_time)
>>    308 {
>>    309         unsigned long now, stime;
>>    310
>>    311         WARN_ON_ONCE(!irqs_disabled());
>> ...
>>
>> But I don't understand how ticks can be accounted now if irqs are still disabled.
>>
>> Not sure it is as simple as expected...
> 
> I don't know all the timer stuff too well. The
> !CONFIG_VIRT_CPU_ACCOUNTING case is relying on PF_VCPU to be set when
> the host timer interrupt runs irqtime_account_process_tick runs so it
> can accumulate that tick to the guest?
> 
> That probably makes sense then, but it seems like we need that in a
> different place. Timer interrupts are not guaranteed to be the first one
> to occur when interrupts are enabled.
> 
> Maybe a new tick_account_guest_exit() and move PF_VCPU clearing to that
> for tick based accounting. Call it after local_irq_enable and call the
> vtime accounting before it. Would that work?

Hi Nick,

I think I will not have the time to have a look to fix that?

Could you try?

Thanks,
Laurent


^ permalink raw reply

* Re: [PATCH v11 2/3] tty: hvc: pass DMA capable memory to put_chars()
From: Xianting Tian @ 2021-10-20  8:47 UTC (permalink / raw)
  To: gregkh, jirislaby, amit, arnd, osandov
  Cc: shile.zhang, linuxppc-dev, linux-kernel, virtualization
In-Reply-To: <20211015024658.1353987-3-xianting.tian@linux.alibaba.com>

hi Greg，

Could I get  your comments of this new version patches？ thanks

在 2021/10/15 上午10:46, Xianting Tian 写道:
> As well known, hvc backend can register its opertions to hvc backend.
> the operations contain put_chars(), get_chars() and so on.
>
> Some hvc backend may do dma in its operations. eg, put_chars() of
> virtio-console. But in the code of hvc framework, it may pass DMA
> incapable memory to put_chars() under a specific configuration, which
> is explained in commit c4baad5029(virtio-console: avoid DMA from stack):
> 1, c[] is on stack,
>     hvc_console_print():
>          char c[N_OUTBUF] __ALIGNED__;
>          cons_ops[index]->put_chars(vtermnos[index], c, i);
> 2, ch is on stack,
>     static void hvc_poll_put_char(,,char ch)
>     {
>          struct tty_struct *tty = driver->ttys[0];
>          struct hvc_struct *hp = tty->driver_data;
>          int n;
>
>          do {
>                  n = hp->ops->put_chars(hp->vtermno, &ch, 1);
>          } while (n <= 0);
>     }
>
> Commit c4baad5029 is just the fix to avoid DMA from stack memory, which
> is passed to virtio-console by hvc framework in above code. But I think
> the fix is aggressive, it directly uses kmemdup() to alloc new buffer
> from kmalloc area and do memcpy no matter the memory is in kmalloc area
> or not. But most importantly, it should better be fixed in the hvc
> framework, by changing it to never pass stack memory to the put_chars()
> function in the first place. Otherwise, we still face the same issue if
> a new hvc backend using dma added in the furture.
>
> In this patch, add 'char cons_outbuf[]' as part of 'struct hvc_struct',
> so hp->cons_outbuf is no longer the stack memory, we can use it in above
> cases safely. We also add lock to protect cons_outbuf instead of using
> the global lock of hvc.
>
> Introduce another array(cons_hvcs[]) for hvc pointers next to the
> cons_ops[] and vtermnos[] arrays. With the array, we can easily find
> hvc's cons_outbuf and its lock.
>
> With the patch, we can revert the fix c4baad5029.
>
> Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
> ---
>   drivers/tty/hvc/hvc_console.c | 36 ++++++++++++++++++++---------------
>   drivers/tty/hvc/hvc_console.h | 21 +++++++++++++++++++-
>   2 files changed, 41 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
> index 5957ab728..11f2463a1 100644
> --- a/drivers/tty/hvc/hvc_console.c
> +++ b/drivers/tty/hvc/hvc_console.c
> @@ -41,16 +41,6 @@
>    */
>   #define HVC_CLOSE_WAIT (HZ/100) /* 1/10 of a second */
>   
> -/*
> - * These sizes are most efficient for vio, because they are the
> - * native transfer size. We could make them selectable in the
> - * future to better deal with backends that want other buffer sizes.
> - */
> -#define N_OUTBUF	16
> -#define N_INBUF		16
> -
> -#define __ALIGNED__ __attribute__((__aligned__(L1_CACHE_BYTES)))
> -
>   static struct tty_driver *hvc_driver;
>   static struct task_struct *hvc_task;
>   
> @@ -142,6 +132,7 @@ static int hvc_flush(struct hvc_struct *hp)
>   static const struct hv_ops *cons_ops[MAX_NR_HVC_CONSOLES];
>   static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
>   	{[0 ... MAX_NR_HVC_CONSOLES - 1] = -1};
> +static struct hvc_struct *cons_hvcs[MAX_NR_HVC_CONSOLES];
>   
>   /*
>    * Console APIs, NOT TTY.  These APIs are available immediately when
> @@ -151,9 +142,11 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
>   static void hvc_console_print(struct console *co, const char *b,
>   			      unsigned count)
>   {
> -	char c[N_OUTBUF] __ALIGNED__;
> +	char *c;
>   	unsigned i = 0, n = 0;
>   	int r, donecr = 0, index = co->index;
> +	unsigned long flags;
> +	struct hvc_struct *hp;
>   
>   	/* Console access attempt outside of acceptable console range. */
>   	if (index >= MAX_NR_HVC_CONSOLES)
> @@ -163,6 +156,13 @@ static void hvc_console_print(struct console *co, const char *b,
>   	if (vtermnos[index] == -1)
>   		return;
>   
> +	hp = cons_hvcs[index];
> +	if (!hp)
> +		return;
> +
> +	c = hp->cons_outbuf;
> +
> +	spin_lock_irqsave(&hp->cons_outbuf_lock, flags);
>   	while (count > 0 || i > 0) {
>   		if (count > 0 && i < sizeof(c)) {
>   			if (b[n] == '\n' && !donecr) {
> @@ -191,6 +191,7 @@ static void hvc_console_print(struct console *co, const char *b,
>   			}
>   		}
>   	}
> +	spin_unlock_irqrestore(&hp->cons_outbuf_lock, flags);
>   	hvc_console_flush(cons_ops[index], vtermnos[index]);
>   }
>   
> @@ -878,9 +879,13 @@ static void hvc_poll_put_char(struct tty_driver *driver, int line, char ch)
>   	struct tty_struct *tty = driver->ttys[0];
>   	struct hvc_struct *hp = tty->driver_data;
>   	int n;
> +	unsigned long flags;
>   
>   	do {
> -		n = hp->ops->put_chars(hp->vtermno, &ch, 1);
> +		spin_lock_irqsave(&hp->cons_outbuf_lock, flags);
> +		hp->cons_outbuf[0] = ch;
> +		n = hp->ops->put_chars(hp->vtermno, &hp->cons_outbuf[0], 1);
> +		spin_unlock_irqrestore(&hp->cons_outbuf_lock, flags);
>   	} while (n <= 0);
>   }
>   #endif
> @@ -922,8 +927,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
>   			return ERR_PTR(err);
>   	}
>   
> -	hp = kzalloc(ALIGN(sizeof(*hp), sizeof(long)) + outbuf_size,
> -			GFP_KERNEL);
> +	hp = kzalloc(struct_size(hp, outbuf, outbuf_size), GFP_KERNEL);
>   	if (!hp)
>   		return ERR_PTR(-ENOMEM);
>   
> @@ -931,13 +935,13 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
>   	hp->data = data;
>   	hp->ops = ops;
>   	hp->outbuf_size = outbuf_size;
> -	hp->outbuf = &((char *)hp)[ALIGN(sizeof(*hp), sizeof(long))];
>   
>   	tty_port_init(&hp->port);
>   	hp->port.ops = &hvc_port_ops;
>   
>   	INIT_WORK(&hp->tty_resize, hvc_set_winsz);
>   	spin_lock_init(&hp->lock);
> +	spin_lock_init(&hp->cons_outbuf_lock);
>   	mutex_lock(&hvc_structs_mutex);
>   
>   	/*
> @@ -964,6 +968,7 @@ struct hvc_struct *hvc_alloc(uint32_t vtermno, int data,
>   	if (i < MAX_NR_HVC_CONSOLES) {
>   		cons_ops[i] = ops;
>   		vtermnos[i] = vtermno;
> +		cons_hvcs[i] = hp;
>   	}
>   
>   	list_add_tail(&(hp->next), &hvc_structs);
> @@ -988,6 +993,7 @@ int hvc_remove(struct hvc_struct *hp)
>   	if (hp->index < MAX_NR_HVC_CONSOLES) {
>   		vtermnos[hp->index] = -1;
>   		cons_ops[hp->index] = NULL;
> +		cons_hvcs[hp->index] = NULL;
>   	}
>   
>   	/* Don't whack hp->irq because tty_hangup() will need to free the irq. */
> diff --git a/drivers/tty/hvc/hvc_console.h b/drivers/tty/hvc/hvc_console.h
> index 18d005814..2c32ab67b 100644
> --- a/drivers/tty/hvc/hvc_console.h
> +++ b/drivers/tty/hvc/hvc_console.h
> @@ -32,12 +32,21 @@
>    */
>   #define HVC_ALLOC_TTY_ADAPTERS	8
>   
> +/*
> + * These sizes are most efficient for vio, because they are the
> + * native transfer size. We could make them selectable in the
> + * future to better deal with backends that want other buffer sizes.
> + */
> +#define N_OUTBUF	16
> +#define N_INBUF		16
> +
> +#define __ALIGNED__ __attribute__((__aligned__(L1_CACHE_BYTES)))
> +
>   struct hvc_struct {
>   	struct tty_port port;
>   	spinlock_t lock;
>   	int index;
>   	int do_wakeup;
> -	char *outbuf;
>   	int outbuf_size;
>   	int n_outbuf;
>   	uint32_t vtermno;
> @@ -48,6 +57,16 @@ struct hvc_struct {
>   	struct work_struct tty_resize;
>   	struct list_head next;
>   	unsigned long flags;
> +
> +	/*
> +	 * the buf and its lock are used in hvc console api for putting chars,
> +	 * and also used in hvc_poll_put_char() for putting single char.
> +	 */
> +	spinlock_t cons_outbuf_lock;
> +	char cons_outbuf[N_OUTBUF] __ALIGNED__;
> +
> +	/* the buf is used for putting chars to tty */
> +	char outbuf[] __ALIGNED__;
>   };
>   
>   /* implemented by a low level driver */

^ permalink raw reply

* Re: [PATCH v11 2/3] tty: hvc: pass DMA capable memory to put_chars()
From: Greg KH @ 2021-10-20  8:56 UTC (permalink / raw)
  To: Xianting Tian
  Cc: arnd, amit, jirislaby, shile.zhang, linux-kernel, virtualization,
	linuxppc-dev, osandov
In-Reply-To: <d56c2c23-3e99-417b-8144-cf1bb31b5f6d@linux.alibaba.com>

A: http://en.wikipedia.org/wiki/Top_post
Q: Were do I find info about this thing called top-posting?
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Wed, Oct 20, 2021 at 04:47:23PM +0800, Xianting Tian wrote:
> hi Greg，
> 
> Could I get  your comments of this new version patches？ thanks

It has been less than 5 days.  Please relax, and only worry after 2
weeks have gone by.  We have lots of patches to review.  To help
maintainers out, why don't you review other patches on the mailing lists
as well while you wait?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v2] KVM: PPC: Defer vtime accounting 'til after IRQ handling
From: Nicholas Piggin @ 2021-10-20  9:35 UTC (permalink / raw)
  To: kvm-ppc, Laurent Vivier, Michael Ellerman
  Cc: Greg Kurz, stable, linux-kernel, linuxppc-dev
In-Reply-To: <2a13119c-ccec-1dd5-8cf6-da07a9d8fe6f@redhat.com>

Excerpts from Laurent Vivier's message of October 20, 2021 4:29 pm:
> On 15/10/2021 04:23, Nicholas Piggin wrote:
>> Excerpts from Laurent Vivier's message of October 13, 2021 7:30 pm:
>>> On 13/10/2021 01:18, Michael Ellerman wrote:
>>>> Laurent Vivier <lvivier@redhat.com> writes:
>>>>> Commit 112665286d08 moved guest_exit() in the interrupt protected
>>>>> area to avoid wrong context warning (or worse), but the tick counter
>>>>> cannot be updated and the guest time is accounted to the system time.
>>>>>
>>>>> To fix the problem port to POWER the x86 fix
>>>>> 160457140187 ("Defer vtime accounting 'til after IRQ handling"):
>>>>>
>>>>> "Defer the call to account guest time until after servicing any IRQ(s)
>>>>>    that happened in the guest or immediately after VM-Exit.  Tick-based
>>>>>    accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
>>>>>    handler runs, and IRQs are blocked throughout the main sequence of
>>>>>    vcpu_enter_guest(), including the call into vendor code to actually
>>>>>    enter and exit the guest."
>>>>>
>>>>> Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context before enabling irqs")
>>>>> Cc: npiggin@gmail.com
>>>>> Cc: <stable@vger.kernel.org> # 5.12
>>>>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>>>>> ---
>>>>>
>>>>> Notes:
>>>>>       v2: remove reference to commit 61bd0f66ff92
>>>>>           cc stable 5.12
>>>>>           add the same comment in the code as for x86
>>>>>
>>>>>    arch/powerpc/kvm/book3s_hv.c | 24 ++++++++++++++++++++----
>>>>>    1 file changed, 20 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>>>> index 2acb1c96cfaf..a694d1a8f6ce 100644
>>>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>>> ...
>>>>> @@ -4506,13 +4514,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
>>>>>    
>>>>>    	srcu_read_unlock(&kvm->srcu, srcu_idx);
>>>>>    
>>>>> +	context_tracking_guest_exit();
>>>>> +
>>>>>    	set_irq_happened(trap);
>>>>>    
>>>>>    	kvmppc_set_host_core(pcpu);
>>>>>    
>>>>> -	guest_exit_irqoff();
>>>>> -
>>>>>    	local_irq_enable();
>>>>> +	/*
>>>>> +	 * Wait until after servicing IRQs to account guest time so that any
>>>>> +	 * ticks that occurred while running the guest are properly accounted
>>>>> +	 * to the guest.  Waiting until IRQs are enabled degrades the accuracy
>>>>> +	 * of accounting via context tracking, but the loss of accuracy is
>>>>> +	 * acceptable for all known use cases.
>>>>> +	 */
>>>>> +	vtime_account_guest_exit();
>>>>
>>>> This pops a warning for me, running guest(s) on Power8:
>>>>    
>>>>     [  270.745303][T16661] ------------[ cut here ]------------
>>>>     [  270.745374][T16661] WARNING: CPU: 72 PID: 16661 at arch/powerpc/kernel/time.c:311 vtime_account_kernel+0xe0/0xf0
>>>
>>> Thank you, I missed that...
>>>
>>> My patch is wrong, I have to add vtime_account_guest_exit() before the local_irq_enable().
>> 
>> I thought so because if we take an interrupt after exiting the guest that
>> should be accounted to kernel not guest.
>> 
>>>
>>> arch/powerpc/kernel/time.c
>>>
>>>    305 static unsigned long vtime_delta(struct cpu_accounting_data *acct,
>>>    306                                  unsigned long *stime_scaled,
>>>    307                                  unsigned long *steal_time)
>>>    308 {
>>>    309         unsigned long now, stime;
>>>    310
>>>    311         WARN_ON_ONCE(!irqs_disabled());
>>> ...
>>>
>>> But I don't understand how ticks can be accounted now if irqs are still disabled.
>>>
>>> Not sure it is as simple as expected...
>> 
>> I don't know all the timer stuff too well. The
>> !CONFIG_VIRT_CPU_ACCOUNTING case is relying on PF_VCPU to be set when
>> the host timer interrupt runs irqtime_account_process_tick runs so it
>> can accumulate that tick to the guest?
>> 
>> That probably makes sense then, but it seems like we need that in a
>> different place. Timer interrupts are not guaranteed to be the first one
>> to occur when interrupts are enabled.
>> 
>> Maybe a new tick_account_guest_exit() and move PF_VCPU clearing to that
>> for tick based accounting. Call it after local_irq_enable and call the
>> vtime accounting before it. Would that work?
> 
> Hi Nick,
> 
> I think I will not have the time to have a look to fix that?
> 
> Could you try?

Hey Laurent, sure I can take a look at it.

Thanks,
Nick

^ permalink raw reply

* [PATCH] PCI/hotplug: Remove unneeded of_node_put() in pnv_php
From: Wan Jiabing @ 2021-10-20  9:46 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Bjorn Helgaas, linuxppc-dev, linux-pci, linux-kernel
  Cc: kael_w, Wan Jiabing

Fix following coccicheck warning:
./drivers/pci/hotplug/pnv_php.c:161:2-13: ERROR: probable double put.

Device node iterators put the previous value of the index variable, so
an explicit put causes a double put.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
---
 drivers/pci/hotplug/pnv_php.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index f4c2e6e01be0..f3da4f95d73f 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -158,7 +158,6 @@ static void pnv_php_detach_device_nodes(struct device_node *parent)
 	for_each_child_of_node(parent, dn) {
 		pnv_php_detach_device_nodes(dn);
 
-		of_node_put(dn);
 		of_detach_node(dn);
 	}
 }
-- 
2.20.1


^ permalink raw reply related

* [PATCH] powerpc/idle: Don't corrupt back chain when going idle
From: Michael Ellerman @ 2021-10-20  9:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: kvm-ppc, npiggin

In isa206_idle_insn_mayloss() we store various registers into the stack
red zone, which is allowed.

However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again,
to 0(r1), which corrupts the stack back chain.

We used to do the same in isa206_idle_insn_mayloss() itself, but we
fixed that in 73287caa9210 ("powerpc64/idle: Fix SP offsets when saving
GPRs"), however we missed that the macro also corrupts the back chain.

Corrupting the back chain is bad for debuggability but doesn't
necessarily cause a bug.

However we recently changed the stack handling in some KVM code, and it
now relies on the stack back chain being valid when it returns. The
corruption causes that code to return with r1 pointing somewhere in
kernel data, at some point LR is restored from the stack and we branch
to NULL or somewhere else invalid.

Only affects Power8 hosts running KVM guests, with dynamic_mt_modes
enabled (which it is by default).

The fixes tag below points to the commit that changed the KVM stack
handling, exposing this bug. The actual corruption of the back chain has
always existed since 948cf67c4726 ("powerpc: Add NAP mode support on
Power7 in HV mode").

Fixes: 9b4416c5095c ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/kernel/idle_book3s.S | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S
index abb719b21cae..3d97fb833834 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -126,14 +126,16 @@ _GLOBAL(idle_return_gpr_loss)
 /*
  * This is the sequence required to execute idle instructions, as
  * specified in ISA v2.07 (and earlier). MSR[IR] and MSR[DR] must be 0.
- *
- * The 0(r1) slot is used to save r2 in isa206, so use that here.
+ * We have to store a GPR somewhere, ptesync, then reload it, and create
+ * a false dependency on the result of the load. It doesn't matter which
+ * GPR we store, or where we store it. We have already stored r2 to the
+ * stack at -8(r1) in isa206_idle_insn_mayloss, so use that.
  */
 #define IDLE_STATE_ENTER_SEQ_NORET(IDLE_INST)			\
 	/* Magic NAP/SLEEP/WINKLE mode enter sequence */	\
-	std	r2,0(r1);					\
+	std	r2,-8(r1);					\
 	ptesync;						\
-	ld	r2,0(r1);					\
+	ld	r2,-8(r1);					\
 236:	cmpd	cr0,r2,r2;					\
 	bne	236b;						\
 	IDLE_INST;						\
-- 
2.31.1

^ permalink raw reply related

* Re: [PATCH] PCI/hotplug: Remove unneeded of_node_put() in pnv_php
From: Nathan Lynch @ 2021-10-20 11:39 UTC (permalink / raw)
  To: Wan Jiabing
  Cc: Tyrel Datwyler, Wan Jiabing, ajd, linux-pci, linux-kernel,
	Paul Mackerras, Bjorn Helgaas, kael_w, linuxppc-dev
In-Reply-To: <20211020094604.2106-1-wanjiabing@vivo.com>

Wan Jiabing <wanjiabing@vivo.com> writes:
> Fix following coccicheck warning:
> ./drivers/pci/hotplug/pnv_php.c:161:2-13: ERROR: probable double put.
>
> Device node iterators put the previous value of the index variable, so
> an explicit put causes a double put.

I suppose Coccinelle doesn't take into account that this code is
detaching and freeing the nodes.


> diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
> index f4c2e6e01be0..f3da4f95d73f 100644
> --- a/drivers/pci/hotplug/pnv_php.c
> +++ b/drivers/pci/hotplug/pnv_php.c
> @@ -158,7 +158,6 @@ static void pnv_php_detach_device_nodes(struct device_node *parent)
>  	for_each_child_of_node(parent, dn) {
>  		pnv_php_detach_device_nodes(dn);
>  
> -		of_node_put(dn);
>  		of_detach_node(dn);
>  	}

The code might be improved by comments explaining how the bare
of_node_put() corresponds to a "get" somewhere else in the driver, and
how it doesn't render the ongoing traversal unsafe. It looks suspicious
on first review, but I believe it's intentional and probably correct as
written.

^ permalink raw reply

* [PATCH kernel 1/4] powerpc/pseries/iommu: Fix indentations
From: Alexey Kardashevskiy @ 2021-10-20 13:23 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Frederic Barrat, Brian King, Alexey Kardashevskiy, Leonardo Bras
In-Reply-To: <20211020132315.2287178-1-aik@ozlabs.ru>

This fixes broken indentations. The first hunk might suggest that
the introducing patch was applied incorrectly but it is correct.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 269f61d519c2..09d59088864a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1404,8 +1404,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 			dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
 				 dn, ret);
 
-		/* Make sure to clean DDW if any TCE was set*/
-		clean_dma_window(pdn, win64->value);
+			/* Make sure to clean DDW if any TCE was set*/
+			clean_dma_window(pdn, win64->value);
 			goto out_del_list;
 		}
 	} else {
@@ -1490,7 +1490,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	if (pmem_present && ddw_enabled && direct_mapping && len == max_ram_len)
 		dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL << len);
 
-    return ddw_enabled && direct_mapping;
+	return ddw_enabled && direct_mapping;
 }
 
 static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
-- 
2.30.2


^ permalink raw reply related

* [PATCH kernel 0/4] Fixes for powerpc/pseries/iommu: Make use of DDW for indirect mapping
From: Alexey Kardashevskiy @ 2021-10-20 13:23 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Frederic Barrat, Brian King, Alexey Kardashevskiy, Leonardo Bras

Found some issues on SRIOV enabled PHYP.
It probably should be one patch, or not?

Please comment. Thanks.



Alexey Kardashevskiy (4):
  powerpc/pseries/iommu: Fix indentations
  powerpc/pseries/iommu: Use correct vfree for it_map
  powerpc/pseries/iommu: Check if the default window in use before
    removing it
  powerpc/pseries/iommu: Create huge DMA window if no MMIO32 is present

 arch/powerpc/platforms/pseries/iommu.c | 33 +++++++++++++-------------
 1 file changed, 17 insertions(+), 16 deletions(-)

-- 
2.30.2


^ permalink raw reply

* [PATCH kernel 2/4] powerpc/pseries/iommu: Use correct vfree for it_map
From: Alexey Kardashevskiy @ 2021-10-20 13:23 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Frederic Barrat, Brian King, Alexey Kardashevskiy, Leonardo Bras
In-Reply-To: <20211020132315.2287178-1-aik@ozlabs.ru>

The it_map array is vzalloc'ed so use vfree() for it when creating
a huge DMA window failed for whatever reason.

While at this, write zero to it_map.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 09d59088864a..45564547cd80 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1440,7 +1440,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 		/* Keep default DMA window stuct if removed */
 		if (default_win_removed) {
 			tbl->it_size = 0;
-			kfree(tbl->it_map);
+			vfree(tbl->it_map);
+			tbl->it_map = NULL;
 		}
 
 		set_iommu_table_base(&dev->dev, newtbl);
-- 
2.30.2


^ permalink raw reply related

* [PATCH kernel 3/4] powerpc/pseries/iommu: Check if the default window in use before removing it
From: Alexey Kardashevskiy @ 2021-10-20 13:23 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Frederic Barrat, Brian King, Alexey Kardashevskiy, Leonardo Bras
In-Reply-To: <20211020132315.2287178-1-aik@ozlabs.ru>

At the moment this check is performed after we remove the default window
which is late and disallows to revert whatever changes enable_ddw()
has made to DMA windows.

This moves the check and error exit before removing the window.

This raised the message severity from "debug" to "warning" as this
should not happen in practice and cannot be triggered by the userspace.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 45564547cd80..9bbcfdece9e3 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1302,6 +1302,12 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 		struct property *default_win;
 		int reset_win_ext;
 
+		/* DDW + IOMMU on single window may fail if there is any allocation */
+		if (iommu_table_in_use(tbl)) {
+			dev_warn(&dev->dev, "current IOMMU table in use, can't be replaced.\n");
+			goto out_failed;
+		}
+
 		default_win = of_find_property(pdn, "ibm,dma-window", NULL);
 		if (!default_win)
 			goto out_failed;
@@ -1356,12 +1362,6 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 			query.largest_available_block,
 			1ULL << page_shift);
 
-		/* DDW + IOMMU on single window may fail if there is any allocation */
-		if (default_win_removed && iommu_table_in_use(tbl)) {
-			dev_dbg(&dev->dev, "current IOMMU table in use, can't be replaced.\n");
-			goto out_failed;
-		}
-
 		len = order_base_2(query.largest_available_block << page_shift);
 		win_name = DMA64_PROPNAME;
 	} else {
-- 
2.30.2


^ permalink raw reply related

* [PATCH kernel 4/4] powerpc/pseries/iommu: Create huge DMA window if no MMIO32 is present
From: Alexey Kardashevskiy @ 2021-10-20 13:23 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Frederic Barrat, Brian King, Alexey Kardashevskiy, Leonardo Bras
In-Reply-To: <20211020132315.2287178-1-aik@ozlabs.ru>

The iommu_init_table() helper takes an address range to reserve in
the IOMMU table being initialized to exclude MMIO addresses, this is
useful if the window stretches far beyond 4GB (although wastes some TCEs).
At the moment the code searches for such MMIO32 range and fails if none
found which is considered a problem while it really is not: it is actually
better as this says there is no MMIO32 to reserve and we can use
usually wasted TCEs. Furthermore PHYP never actually allows creating
windows starting at busaddress=0 so this MMIO32 range is never useful.

This removes error exit and initializes the table with zero range if
no MMIO32 is detected.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 9bbcfdece9e3..258e98c10095 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1411,18 +1411,19 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	} else {
 		struct iommu_table *newtbl;
 		int i;
+		unsigned long start = 0, end = 0;
 
 		for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
 			const unsigned long mask = IORESOURCE_MEM_64 | IORESOURCE_MEM;
 
 			/* Look for MMIO32 */
-			if ((pci->phb->mem_resources[i].flags & mask) == IORESOURCE_MEM)
+			if ((pci->phb->mem_resources[i].flags & mask) == IORESOURCE_MEM) {
+				start = pci->phb->mem_resources[i].start;
+				end = pci->phb->mem_resources[i].end;
 				break;
+			}
 		}
 
-		if (i == ARRAY_SIZE(pci->phb->mem_resources))
-			goto out_del_list;
-
 		/* New table for using DDW instead of the default DMA window */
 		newtbl = iommu_pseries_alloc_table(pci->phb->node);
 		if (!newtbl) {
@@ -1432,8 +1433,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 
 		iommu_table_setparms_common(newtbl, pci->phb->bus->number, create.liobn, win_addr,
 					    1UL << len, page_shift, NULL, &iommu_table_lpar_multi_ops);
-		iommu_init_table(newtbl, pci->phb->node, pci->phb->mem_resources[i].start,
-				 pci->phb->mem_resources[i].end);
+		iommu_init_table(newtbl, pci->phb->node, start, end);
 
 		pci->table_group->tables[1] = newtbl;
 
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH v2][net-next] soc: fsl: dpio: Unsigned compared against 0 in qbman_swp_set_irq_coalescing()
From: patchwork-bot+netdevbpf @ 2021-10-20 13:30 UTC (permalink / raw)
  To: Tim Gardner
  Cc: netdev, Roy.Pledge, linux-kernel, leoyang.li, ioana.ciornei,
	linuxppc-dev, linux-arm-kernel
In-Reply-To: <20211019121925.8910-1-tim.gardner@canonical.com>

Hello:

This patch was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Tue, 19 Oct 2021 06:19:25 -0600 you wrote:
> Coverity complains of unsigned compare against 0. There are 2 cases in
> this function:
> 
> 1821        itp = (irq_holdoff * 1000) / p->desc->qman_256_cycles_per_ns;
> 
> CID 121131 (#1 of 1): Unsigned compared against 0 (NO_EFFECT)
> unsigned_compare: This less-than-zero comparison of an unsigned value is never true. itp < 0U.
> 1822        if (itp < 0 || itp > 4096) {
> 1823                max_holdoff = (p->desc->qman_256_cycles_per_ns * 4096) / 1000;
> 1824                pr_err("irq_holdoff must be between 0..%dus\n", max_holdoff);
> 1825                return -EINVAL;
> 1826        }
> 1827
>     	unsigned_compare: This less-than-zero comparison of an unsigned value is never true. irq_threshold < 0U.
> 1828        if (irq_threshold >= p->dqrr.dqrr_size || irq_threshold < 0) {
> 1829                pr_err("irq_threshold must be between 0..%d\n",
> 1830                       p->dqrr.dqrr_size - 1);
> 1831                return -EINVAL;
> 1832        }
> 
> [...]

Here is the summary with links:
  - [v2,net-next] soc: fsl: dpio: Unsigned compared against 0 in qbman_swp_set_irq_coalescing()
    https://git.kernel.org/netdev/net-next/c/818a76a55d6e

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox