LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 09/11] pci/hotplug/pnv-php: Relax check when disabling slot
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

The driver only allows to disable a slot in the POPULATED
state. However, if an error occurs while enabling the slot, say
because the link couldn't be trained, then the POPULATED state may not
be reached, yet the power state of the slot is on. So allow to disable
a slot in the REGISTERED state. Removing the devices will do nothing
since it's not populated, and we'll set the power state of the slot
back to off.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 drivers/pci/hotplug/pnv_php.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index f9c624334ef7..74b62a8e11e7 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -523,7 +523,13 @@ static int pnv_php_disable_slot(struct hotplug_slot *slot)
 	struct pnv_php_slot *php_slot = to_pnv_php_slot(slot);
 	int ret;
 
-	if (php_slot->state != PNV_PHP_STATE_POPULATED)
+	/*
+	 * Allow to disable a slot already in the registered state to
+	 * cover cases where the slot couldn't be enabled and never
+	 * reached the populated state
+	 */
+	if (php_slot->state != PNV_PHP_STATE_POPULATED &&
+		php_slot->state != PNV_PHP_STATE_REGISTERED)
 		return 0;
 
 	/* Remove all devices behind the slot */
-- 
2.21.0


^ permalink raw reply related

* [RFC 05/11] powerpc/powernv/ioda: Find opencapi slot for a device node
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

Unlike real PCI slots, opencapi slots are directly associated to
the (virtual) opencapi PHB, there's no intermediate bridge. So when
looking for a slot ID, we must start the search from the device node
itself and not its parent.

Also, the slot ID is not attached to a specific bdfn, so let's build
it from the PHB ID, like skiboot.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 arch/powerpc/include/asm/pnv-pci.h   |  1 +
 arch/powerpc/platforms/powernv/pci.c | 10 +++++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-pci.h b/arch/powerpc/include/asm/pnv-pci.h
index b5a85f1bb305..4b4dfa6bfdd3 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -15,6 +15,7 @@
 #define PCI_SLOT_ID_PREFIX	(1UL << 63)
 #define PCI_SLOT_ID(phb_id, bdfn)	\
 	(PCI_SLOT_ID_PREFIX | ((uint64_t)(bdfn) << 16) | (phb_id))
+#define PCI_PHB_SLOT_ID(phb_id)		(phb_id)
 
 extern int pnv_pci_get_slot_id(struct device_node *np, uint64_t *id);
 extern int pnv_pci_get_device_tree(uint32_t phandle, void *buf, uint64_t len);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index ff1a33fee8e6..3e4e75a883e1 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -49,13 +49,14 @@ int pnv_pci_get_slot_id(struct device_node *np, uint64_t *id)
 		return -ENXIO;
 
 	bdfn = ((bdfn & 0x00ffff00) >> 8);
-	while ((parent = of_get_parent(parent))) {
+	for (parent = np; parent; parent = of_get_parent(parent)) {
 		if (!PCI_DN(parent)) {
 			of_node_put(parent);
 			break;
 		}
 
-		if (!of_device_is_compatible(parent, "ibm,ioda2-phb")) {
+		if (!of_device_is_compatible(parent, "ibm,ioda2-phb") &&
+		    !of_device_is_compatible(parent, "ibm,ioda2-npu2-opencapi-phb")) {
 			of_node_put(parent);
 			continue;
 		}
@@ -66,7 +67,10 @@ int pnv_pci_get_slot_id(struct device_node *np, uint64_t *id)
 			return -ENXIO;
 		}
 
-		*id = PCI_SLOT_ID(phbid, bdfn);
+		if (of_device_is_compatible(parent, "ibm,ioda2-npu2-opencapi-phb"))
+			*id = PCI_PHB_SLOT_ID(phbid);
+		else
+			*id = PCI_SLOT_ID(phbid, bdfn);
 		return 0;
 	}
 
-- 
2.21.0


^ permalink raw reply related

* [RFC 02/11] powerpc/powernv/ioda: Protect PE list
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

Protect the PHB's list of PE. Probably not needed as long as it was
populated during PHB creation, but it feels right and will become
required once we can add/remove opencapi devices on hotplug.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 3082912e2600..2c063b05bb64 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1078,8 +1078,9 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 	}
 
 	/* Put PE to the list */
+	mutex_lock(&phb->ioda.pe_list_mutex);
 	list_add_tail(&pe->list, &phb->ioda.pe_list);
-
+	mutex_unlock(&phb->ioda.pe_list_mutex);
 	return pe;
 }
 
@@ -3501,7 +3502,10 @@ static void pnv_ioda_release_pe(struct pnv_ioda_pe *pe)
 	struct pnv_phb *phb = pe->phb;
 	struct pnv_ioda_pe *slave, *tmp;
 
+	mutex_lock(&phb->ioda.pe_list_mutex);
 	list_del(&pe->list);
+	mutex_unlock(&phb->ioda.pe_list_mutex);
+
 	switch (phb->type) {
 	case PNV_PHB_IODA1:
 		pnv_pci_ioda1_release_pe_dma(pe);
-- 
2.21.0


^ permalink raw reply related

* [RFC 03/11] powerpc/powernv/ioda: set up PE on opencapi device when enabling
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

The PE for an opencapi device was set as part of a late PHB fixup
operation, when creating the PHB. To use the PCI hotplug framework,
this is not going to work, as the PHB stays the same, it's only the
devices underneath which are updated. For regular PCI devices, it is
done as part of the reconfiguration of the bridge, but for opencapi
PHBs, we don't have an intermediate bridge. So let's define the PE
when the device is enabled. PEs are meaningless for opencapi, the NPU
doesn't define them and opal is not doing anything with them.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 31 +++++++++++++++++------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2c063b05bb64..2cf06fb98978 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1258,8 +1258,6 @@ static void pnv_pci_ioda_setup_PEs(void)
 {
 	struct pci_controller *hose;
 	struct pnv_phb *phb;
-	struct pci_bus *bus;
-	struct pci_dev *pdev;
 	struct pnv_ioda_pe *pe;
 
 	list_for_each_entry(hose, &hose_list, list_node) {
@@ -1271,11 +1269,6 @@ static void pnv_pci_ioda_setup_PEs(void)
 			if (phb->model == PNV_PHB_MODEL_NPU2)
 				WARN_ON_ONCE(pnv_npu2_init(hose));
 		}
-		if (phb->type == PNV_PHB_NPU_OCAPI) {
-			bus = hose->bus;
-			list_for_each_entry(pdev, &bus->devices, bus_list)
-				pnv_ioda_setup_dev_PE(pdev);
-		}
 	}
 	list_for_each_entry(hose, &hose_list, list_node) {
 		phb = hose->private_data;
@@ -3373,6 +3366,28 @@ static bool pnv_pci_enable_device_hook(struct pci_dev *dev)
 	return true;
 }
 
+static bool pnv_ocapi_enable_device_hook(struct pci_dev *dev)
+{
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dn *pdn;
+	struct pnv_ioda_pe *pe;
+
+	if (!phb->initialized)
+		return true;
+
+	pdn = pci_get_pdn(dev);
+	if (!pdn)
+		return false;
+
+	if (pdn->pe_number == IODA_INVALID_PE) {
+		pe = pnv_ioda_setup_dev_PE(dev);
+		if (!pe)
+			return false;
+	}
+	return true;
+}
+
 static long pnv_pci_ioda1_unset_window(struct iommu_table_group *table_group,
 				       int num)
 {
@@ -3613,7 +3628,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = {
 };
 
 static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
-	.enable_device_hook	= pnv_pci_enable_device_hook,
+	.enable_device_hook	= pnv_ocapi_enable_device_hook,
 	.window_alignment	= pnv_pci_window_alignment,
 	.reset_secondary_bus	= pnv_pci_reset_secondary_bus,
 	.shutdown		= pnv_pci_ioda_shutdown,
-- 
2.21.0


^ permalink raw reply related

* [RFC 01/11] powerpc/powernv/ioda: Fix ref count for devices with their own PE
From: Frederic Barrat @ 2019-06-19 13:28 UTC (permalink / raw)
  To: linuxppc-dev, andrew.donnellan, clombard
  Cc: aik, arbab, oohall, groug, alastair
In-Reply-To: <20190619132840.27634-1-fbarrat@linux.ibm.com>

Taking a reference on the pci_dev structure was required with initial
commit 184cd4a3b962 ("powerpc/powernv: PCI support for p7IOC under
OPAL v2"), where we we storing the pci dev in the pci_dn structure.

However, the pci_dev was later removed from the pci_dn structure, but
the reference was kept. See 902bdc57451c ("powerpc/powernv/idoa:
Remove unnecessary pcidev from pci_dn").

The pnv_ioda_pe structure life cycle is the same as the pci_dev
structure, the PE is freed when the device is released. So we don't
need a reference for the pci_dev stored in the PE, otherwise the
pci_dev will never be released. Which is not really a surprise as the
comment (removed here as no longer needed) was stating as much.

Fixes: 902bdc57451c ("powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn")
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 10cc42b9e541..3082912e2600 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1060,14 +1060,6 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 		return NULL;
 	}
 
-	/* NOTE: We get only one ref to the pci_dev for the pdn, not for the
-	 * pointer in the PE data structure, both should be destroyed at the
-	 * same time. However, this needs to be looked at more closely again
-	 * once we actually start removing things (Hotplug, SR-IOV, ...)
-	 *
-	 * At some point we want to remove the PDN completely anyways
-	 */
-	pci_dev_get(dev);
 	pdn->pe_number = pe->pe_number;
 	pe->flags = PNV_IODA_PE_DEV;
 	pe->pdev = dev;
@@ -1082,7 +1074,6 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 		pnv_ioda_free_pe(pe);
 		pdn->pe_number = IODA_INVALID_PE;
 		pe->pdev = NULL;
-		pci_dev_put(dev);
 		return NULL;
 	}
 
@@ -1226,7 +1217,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_npu_PE(struct pci_dev *npu_pdev)
 			 */
 			dev_info(&npu_pdev->dev,
 				"Associating to existing PE %x\n", pe_num);
-			pci_dev_get(npu_pdev);
+			pci_dev_get(npu_pdev); // still needed after 902bdc57451c2c64aa139bbe24067f70a186db0a ?
 			npu_pdn = pci_get_pdn(npu_pdev);
 			rid = npu_pdev->bus->number << 8 | npu_pdn->devfn;
 			npu_pdn->pe_number = pe_num;
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH 1/3] powerpc/64: __ioremap_at clean up in the error case
From: Christophe Leroy @ 2019-06-19 13:04 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <1560916886.zyg87enrjs.astroid@bobo.none>



Le 19/06/2019 à 06:04, Nicholas Piggin a écrit :
> Christophe Leroy's on June 11, 2019 4:28 pm:
>>
>>
>> Le 10/06/2019 à 05:08, Nicholas Piggin a écrit :
>>> __ioremap_at error handling is wonky, it requires caller to clean up
>>> after it. Implement a helper that does the map and error cleanup and
>>> remove the requirement from the caller.
>>>
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>>
>>> This series is a different approach to the problem, using the generic
>>> ioremap_page_range directly which reduces added code, and moves
>>> the radix specific code into radix files. Thanks to Christophe for
>>> pointing out various problems with the previous patch.
>>>
>>>    arch/powerpc/mm/pgtable_64.c | 27 ++++++++++++++++++++-------
>>>    1 file changed, 20 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>>> index d2d976ff8a0e..6bd3660388aa 100644
>>> --- a/arch/powerpc/mm/pgtable_64.c
>>> +++ b/arch/powerpc/mm/pgtable_64.c
>>> @@ -108,14 +108,30 @@ unsigned long ioremap_bot;
>>>    unsigned long ioremap_bot = IOREMAP_BASE;
>>>    #endif
>>>    
>>> +static int ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, pgprot_t prot, int nid)
>>> +{
>>> +	unsigned long i;
>>> +
>>> +	for (i = 0; i < size; i += PAGE_SIZE) {
>>> +		int err = map_kernel_page(ea + i, pa + i, prot);
>>
>> Missing a blank line
>>
>>> +		if (err) {
>>
>> I'd have done the following to reduce indentation depth
>>
>> 		if (!err)
>> 			continue
> 
> I'll consider it, line lengths were not too bad.
> 
>>> +			if (slab_is_available())
>>> +				unmap_kernel_range(ea, size);
>>
>> Shouldn't it be unmap_kernel_range(ea, i) ?
> 
> I guess (i - PAGE_SIZE really), although the old code effectively did
> the full range. As a "clean up" it may be better to avoid subtle
> change in behaviour and do that in another patch?

Not sure we have to do it in another patch.
Previous code was doing full range because it was done at upper level so 
it didn't know the boundaries. You are creating a nice brand new 
function that have all necessary information, so why not make it right 
from the start ?

Christophe

> 
> Thanks,
> Nick
> 

^ permalink raw reply

* Re: [PATCH 2/3] powerpc/64s/radix: ioremap use ioremap_page_range
From: Christophe Leroy @ 2019-06-19 12:49 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <1560915874.eudrz3r20a.astroid@bobo.none>



Le 19/06/2019 à 05:59, Nicholas Piggin a écrit :
> Christophe Leroy's on June 11, 2019 4:46 pm:
>>
>>
>> Le 10/06/2019 à 05:08, Nicholas Piggin a écrit :

[snip]

>>> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
>>> index c9bcf428dd2b..db993bc1aef3 100644
>>> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
>>> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
>>> @@ -11,6 +11,7 @@
>>>    
>>>    #define pr_fmt(fmt) "radix-mmu: " fmt
>>>    
>>> +#include <linux/io.h>
>>>    #include <linux/kernel.h>
>>>    #include <linux/sched/mm.h>
>>>    #include <linux/memblock.h>
>>> @@ -1122,3 +1123,23 @@ void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
>>>    
>>>    	set_pte_at(mm, addr, ptep, pte);
>>>    }
>>> +
>>> +int radix__ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size,
>>> +			pgprot_t prot, int nid)
>>> +{
>>> +	if (likely(slab_is_available())) {
>>> +		int err = ioremap_page_range(ea, ea + size, pa, prot);
>>> +		if (err)
>>> +			unmap_kernel_range(ea, size);
>>> +		return err;
>>> +	} else {
>>> +		unsigned long i;
>>> +
>>> +		for (i = 0; i < size; i += PAGE_SIZE) {
>>> +			int err = map_kernel_page(ea + i, pa + i, prot);
>>> +			if (WARN_ON_ONCE(err)) /* Should clean up */
>>> +				return err;
>>> +		}
>>
>> Same loop again.
>>
>> What about not doing a radix specific function and just putting
>> something like below in the core ioremap_range() function ?
>>
>> 	if (likely(slab_is_available()) && radix_enabled()) {
>> 		int err = ioremap_page_range(ea, ea + size, pa, prot);
>>
>> 		if (err)
>> 			unmap_kernel_range(ea, size);
>> 		return err;
>> 	}
>>
>> Because I'm pretty sure will more and more use ioremap_page_range().
> 
> Well I agree the duplication is not so nice, but it's convenient
> to see what is going on for each MMU type.
> 
> There is a significant amount of churn that needs to be done in
> this layer so I prefer to make it a bit simpler despite duplication.
> 
> I would like to remove the early ioremap or make it into its own
> function. Re-implement map_kernel_page with ioremap_page_range,
> allow page tables that don't use slab to avoid the early check,
> unbolt the hptes mapped in early boot, etc.
> 
> I just wanted to escape out the 64s and hash/radix implementations
> completely until that settles.

I can understand the benefit in some situations but here I just can't. 
And code duplication should be avoided as much as possible as it makes 
code maintenance more difficult.

Here you have:

+static int ioremap_range(unsigned long ea, phys_addr_t pa, unsigned 
long size, pgprot_t prot, int nid)
+{
+	unsigned long i;
+
+	for (i = 0; i < size; i += PAGE_SIZE) {
+		int err = map_kernel_page(ea + i, pa + i, prot);
+		if (err) {
+			if (slab_is_available())
+				unmap_kernel_range(ea, size);
+			else
+				WARN_ON_ONCE(1); /* Should clean up */
+			return err;
+		}
+	}
+
+	return 0;
+}

You now create a new one in another file, that is almost identical:

+int ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, 
pgprot_t prot, int nid)
+{
+	unsigned long i;
+
+	if (radix_enabled())
+		return radix__ioremap_range(ea, pa, size, prot, nid);
+
+	for (i = 0; i < size; i += PAGE_SIZE) {
+		int err = map_kernel_page(ea + i, pa + i, prot);
+		if (err) {
+			if (slab_is_available())
+				unmap_kernel_range(ea, size);
+			else
+				WARN_ON_ONCE(1); /* Should clean up */
+			return err;
+		}
+	}
+
+	return 0;
+}

Then you have to make the original one __weak.

Sorry I'm still having difficulties understanding what the benefit is.

radix_enabled() is defined for every platforms so could just add the 
following on top of the existing ioremap_range() and voila.

+	if (radix_enabled())
+		return radix__ioremap_range(ea, pa, size, prot, nid);


And with that you wouldn't have the __weak stuff to handle.

> 
>>> -static int ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, pgprot_t prot, int nid)
>>> +int __weak ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, pgprot_t prot, int nid)
>>
>> Hum. Weak functions remain in unused in vmlinux unless
>> CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is selected.
>>
>> Also, they are some how dangerous because people might change them
>> without seeing that it is overridden for some particular configuration.
> 
> Well you shouldn't assume that when you see a weak function, but
> what's the preferred alternative? A config option?

Yes you are right, nobody should assume that, but ...

But I think if the fonctions were really different, the preferred 
alternative would be to move the original function into a file dedicated 
to nohash64.

Christophe

^ permalink raw reply

* Re: [PATCH 1/4] mm: Move ioremap page table mapping function to mm/
From: Christophe Leroy @ 2019-06-19 13:18 UTC (permalink / raw)
  To: Nicholas Piggin, linux-mm; +Cc: linuxppc-dev, linux-arm-kernel
In-Reply-To: <1560915576.aqf69c3nf8.astroid@bobo.none>



Le 19/06/2019 à 05:43, Nicholas Piggin a écrit :
> Christophe Leroy's on June 11, 2019 3:24 pm:
>>
>>
>> Le 10/06/2019 à 06:38, Nicholas Piggin a écrit :

[snip]

>>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>>> index 51e131245379..812bea5866d6 100644
>>> --- a/include/linux/vmalloc.h
>>> +++ b/include/linux/vmalloc.h
>>> @@ -147,6 +147,9 @@ extern struct vm_struct *find_vm_area(const void *addr);
>>>    extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
>>>    			struct page **pages);
>>>    #ifdef CONFIG_MMU
>>> +extern int vmap_range(unsigned long addr,
>>> +		       unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
>>> +		       unsigned int max_page_shift);
>>
>> Drop extern keyword here.
> 
> I don't know if I was going crazy but at one point I was getting
> duplicate symbol errors that were fixed by adding extern somewhere.

probably not on a function name ...

> Maybe sleep depravation. However...
> 
>> As checkpatch tells you, 'CHECK:AVOID_EXTERNS: extern prototypes should
>> be avoided in .h files'
> 
> I prefer to follow existing style in surrounding code at the expense
> of some checkpatch warnings. If somebody later wants to "fix" it
> that's fine.

I don't think that's fine to 'fix' later things that could be done right 
from the begining. 'Cosmetic only' fixes never happen because they are a 
nightmare for backports, and a shame for 'git blame'.

In some patches, you add cleanups to make the code look nicer, and here 
you have the opportunity to make the code nice from the begining and you 
prefer repeating the errors done in the past ? You're surprising me.

Christophe

> 
> Thanks,
> Nick
> 

^ permalink raw reply

* Re: [PATCH 0/1] PPC32: fix ptrace() access to FPU registers
From: Radu Rendec @ 2019-06-19 12:57 UTC (permalink / raw)
  To: Daniel Axtens, Andreas Schwab; +Cc: Paul Mackerras, linuxppc-dev, Oleg Nesterov
In-Reply-To: <87k1di2yxg.fsf@dja-thinkpad.axtens.net>

On Wed, 2019-06-19 at 10:36 +1000, Daniel Axtens wrote:
> Andreas Schwab <
> schwab@linux-m68k.org
> > writes:
> 
> > On Jun 18 2019, Radu Rendec <
> > radu.rendec@gmail.com
> > > wrote:
> > 
> > > Since you already have a working setup, it would be nice if you could
> > > add a printk to arch_ptrace() to print the address and confirm what I
> > > believe happens (by reading the gdb source code).
> > 
> > A ppc32 ptrace syscall goes through compat_arch_ptrace.

Right. I completely overlooked that part.

> Ah right, and that (in ptrace32.c) contains code that will work:
> 
> 
> 			/*
> 			 * the user space code considers the floating point
> 			 * to be an array of unsigned int (32 bits) - the
> 			 * index passed in is based on this assumption.
> 			 */
> 			tmp = ((unsigned int *)child->thread.fp_state.fpr)
> 				[FPRINDEX(index)];
> 
> FPRINDEX is defined above to deal with the various manipulations you
> need to do.

Correct. Basically it does the same that I did in my patch: it divides
the index again by 2 (it's already divided by 4 in compat_arch_ptrace()
so it ends up divided by 8), then takes the least significant bit and
adds it to the index. I take bit 2 of the original address, which is the
same thing (because in FPRHALF() the address is already divided by 4).

So we have this in ptrace32.c:

#define FPRNUMBER(i) (((i) - PT_FPR0) >> 1)
#define FPRHALF(i) (((i) - PT_FPR0) & 1)
#define FPRINDEX(i) TS_FPRWIDTH * FPRNUMBER(i) * 2 + FPRHALF(i)

index = (unsigned long) addr >> 2;
(unsigned int *)child->thread.fp_state.fpr)[FPRINDEX(index)]


And we have this in my patch:

fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
(void *)&child->thread.TS_FPR(fpidx) + (addr & 4)

> Radu: I think we want to copy that working code back into ptrace.c. 

I'm not sure that would work. There's a subtle difference: the code in
ptrace32.c is always compiled on a 64-bit kernel and the user space
calling it is always 32-bit; on the other hand, the code in ptrace.c can
be compiled on either a 64-bit kernel or a 32-bit kernel and the user
space calling it always has the same "bitness" as the kernel.

One difference is the size of the CPU registers. On 64-bit they are 8
byte long and user space knows that and generates 8-byte aligned
addresses. So you have to divide the address by 8 to calculate the CPU
register index correctly, which compat_arch_ptrace() currently doesn't.

Another difference is that on 64-bit `long` is 8 bytes, so user space
can read a whole FPU register in a single ptrace call. 

Now that we are all aware of compat_arch_ptrace() (which handles the
special case of a 32-bit process running on a 64-bit kernel) I would say
the patch is correct and does the right thing for both 32-bit and 64-bit 
kernels and processes.

> The challenge will be unpicking the awful mess of ifdefs in ptrace.c
> and making it somewhat more comprehensible.

I'm not sure what ifdefs you're thinking about. The only that are used
inside arch_ptrace() are PT_FPR0, PT_FPSCR and TS_FPR, which seem to be
correct.

But perhaps it would be useful to change my patch and add a comment just
before arch_ptrace() that explains how the math is done and that the
code must work on both 32-bit and 64-bit, the user space address
assumptions, etc.

By the way, I'm not sure the code in compat_arch_ptrace() handles
PT_FPSCR correctly. It might (just because fpscr is right next to fpr[]
in memory - and that's a hack), but I can't figure out if it accesses
the right half.

Radu



^ permalink raw reply

* Re: [PATCH 1/2] KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
From: Michael Ellerman @ 2019-06-19 12:36 UTC (permalink / raw)
  To: Suraj Jitindar Singh, linuxppc-dev; +Cc: mikey, clg, kvm-ppc
In-Reply-To: <20190617071619.19360-2-sjitindarsingh@gmail.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3799 bytes --]

On Mon, 2019-06-17 at 07:16:18 UTC, Suraj Jitindar Singh wrote:
> From: Michael Neuling <mikey@neuling.org>
> 
> Commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9
> option") screwed up some assembler and corrupted a pointer in
> r3. This resulted in crashes like the below:
> 
>   [   44.374746] BUG: Kernel NULL pointer dereference at 0x000013bf
>   [   44.374848] Faulting instruction address: 0xc00000000010b044
>   [   44.374906] Oops: Kernel access of bad area, sig: 11 [#1]
>   [   44.374951] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>   [   44.375018] Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter vmx_crypto crct10dif_vpmsum crc32c_vpmsum kvm_hv kvm sch_fq_codel ip_tables x_tables autofs4 virtio_net net_failover virtio_scsi failover
>   [   44.375401] CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc4+ #3
>   [   44.375500] NIP:  c00000000010b044 LR: c0080000089dacf4 CTR: c00000000010aff4
>   [   44.375604] REGS: c00000179b397710 TRAP: 0300   Not tainted  (5.2.0-rc4+)
>   [   44.375691] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 42244842  XER: 00000000
>   [   44.375815] CFAR: c00000000010aff8 DAR: 00000000000013bf DSISR: 42000000 IRQMASK: 0
>   [   44.375815] GPR00: c0080000089dd6bc c00000179b3979a0 c008000008a04300 ffffffffffffffff
>   [   44.375815] GPR04: 0000000000000000 0000000000000003 000000002444b05d c0000017f11c45d0
>   [   44.375815] GPR08: 078000003e018dfe 0000000000000028 0000000000000001 0000000000000075
>   [   44.375815] GPR12: c00000000010aff4 c000000007ff6300 0000000000000000 0000000000000000
>   [   44.375815] GPR16: 0000000000000000 c0000017f11d0000 00000000ffffffff c0000017f11ca7a8
>   [   44.375815] GPR20: c0000017f11c42ec ffffffffffffffff 0000000000000000 000000000000000a
>   [   44.375815] GPR24: fffffffffffffffc 0000000000000000 c0000017f11c0000 c000000001a77ed8
>   [   44.375815] GPR28: c00000179af70000 fffffffffffffffc c0080000089ff170 c00000179ae88540
>   [   44.376673] NIP [c00000000010b044] kvmppc_h_set_dabr+0x50/0x68
>   [   44.376754] LR [c0080000089dacf4] kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv]
>   [   44.376849] Call Trace:
>   [   44.376886] [c00000179b3979a0] [c0000017f11c0000] 0xc0000017f11c0000 (unreliable)
>   [   44.376982] [c00000179b397a10] [c0080000089dd6bc] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
>   [   44.377084] [c00000179b397ae0] [c0080000093f8bcc] kvmppc_vcpu_run+0x34/0x48 [kvm]
>   [   44.377185] [c00000179b397b00] [c0080000093f522c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
>   [   44.377286] [c00000179b397b90] [c0080000093e3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
>   [   44.377384] [c00000179b397d00] [c0000000004ba6c4] do_vfs_ioctl+0xe4/0xb40
>   [   44.377464] [c00000179b397db0] [c0000000004bb1e4] ksys_ioctl+0xc4/0x110
>   [   44.377547] [c00000179b397e00] [c0000000004bb258] sys_ioctl+0x28/0x80
>   [   44.377628] [c00000179b397e20] [c00000000000b888] system_call+0x5c/0x70
>   [   44.377712] Instruction dump:
>   [   44.377765] 4082fff4 4c00012c 38600000 4e800020 e96280c0 896b0000 2c2b0000 3860ffff
>   [   44.377862] 4d820020 50852e74 508516f6 78840724 <f88313c0> f8a313c8 7c942ba6 7cbc2ba6
> 
> Fix the bug by only changing r3 when we are returning immediately.
> 
> Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option")
> Signed-off-by: Michael Neuling <mikey@neuling.org>
> Reported-by: Cédric Le Goater <clg@kaod.org>

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/fabb2efcf0846e28b4910fc20bdc203d3d0170af

cheers

^ permalink raw reply

* Re: [PATCH] powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac
From: Michael Ellerman @ 2019-06-19 12:32 UTC (permalink / raw)
  To: Christoph Hellwig, benh, paulus
  Cc: aaro.koskinen, linuxppc-dev, linux-kernel, Larry.Finger
In-Reply-To: <20190619105039.GA10118@lst.de>

Christoph Hellwig <hch@lst.de> writes:
> Any chance this could get picked up to fix the regression?

Was hoping Ben would Ack it. He's still powermac maintainer :)

I guess he OK'ed it in the other thread, will add it to my queue.

cheers

> On Thu, Jun 13, 2019 at 10:24:46AM +0200, Christoph Hellwig wrote:
>> With the strict dma mask checking introduced with the switch to
>> the generic DMA direct code common wifi chips on 32-bit powerbooks
>> stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
>> to allow them to reliably allocate dma coherent memory.
>> 
>> Fixes: 65a21b71f948 ("powerpc/dma: remove dma_nommu_dma_supported")
>> Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>  arch/powerpc/include/asm/page.h         | 7 +++++++
>>  arch/powerpc/mm/mem.c                   | 3 ++-
>>  arch/powerpc/platforms/powermac/Kconfig | 1 +
>>  3 files changed, 10 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
>> index b8286a2013b4..0d52f57fca04 100644
>> --- a/arch/powerpc/include/asm/page.h
>> +++ b/arch/powerpc/include/asm/page.h
>> @@ -319,6 +319,13 @@ struct vm_area_struct;
>>  #endif /* __ASSEMBLY__ */
>>  #include <asm/slice.h>
>>  
>> +/*
>> + * Allow 30-bit DMA for very limited Broadcom wifi chips on many powerbooks.
>> + */
>> +#ifdef CONFIG_PPC32
>> +#define ARCH_ZONE_DMA_BITS 30
>> +#else
>>  #define ARCH_ZONE_DMA_BITS 31
>> +#endif
>>  
>>  #endif /* _ASM_POWERPC_PAGE_H */
>> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
>> index cba29131bccc..2540d3b2588c 100644
>> --- a/arch/powerpc/mm/mem.c
>> +++ b/arch/powerpc/mm/mem.c
>> @@ -248,7 +248,8 @@ void __init paging_init(void)
>>  	       (long int)((top_of_ram - total_ram) >> 20));
>>  
>>  #ifdef CONFIG_ZONE_DMA
>> -	max_zone_pfns[ZONE_DMA]	= min(max_low_pfn, 0x7fffffffUL >> PAGE_SHIFT);
>> +	max_zone_pfns[ZONE_DMA]	= min(max_low_pfn,
>> +			((1UL << ARCH_ZONE_DMA_BITS) - 1) >> PAGE_SHIFT);
>>  #endif
>>  	max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
>>  #ifdef CONFIG_HIGHMEM
>> diff --git a/arch/powerpc/platforms/powermac/Kconfig b/arch/powerpc/platforms/powermac/Kconfig
>> index f834a19ed772..c02d8c503b29 100644
>> --- a/arch/powerpc/platforms/powermac/Kconfig
>> +++ b/arch/powerpc/platforms/powermac/Kconfig
>> @@ -7,6 +7,7 @@ config PPC_PMAC
>>  	select PPC_INDIRECT_PCI if PPC32
>>  	select PPC_MPC106 if PPC32
>>  	select PPC_NATIVE
>> +	select ZONE_DMA if PPC32
>>  	default y
>>  
>>  config PPC_PMAC64
>> -- 
>> 2.20.1
> ---end quoted text---

^ permalink raw reply

* Re: [PATCH] powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac
From: Christoph Hellwig @ 2019-06-19 10:50 UTC (permalink / raw)
  To: benh, paulus, mpe; +Cc: aaro.koskinen, linuxppc-dev, linux-kernel, Larry.Finger
In-Reply-To: <20190613082446.18685-1-hch@lst.de>

Any chance this could get picked up to fix the regression?

On Thu, Jun 13, 2019 at 10:24:46AM +0200, Christoph Hellwig wrote:
> With the strict dma mask checking introduced with the switch to
> the generic DMA direct code common wifi chips on 32-bit powerbooks
> stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
> to allow them to reliably allocate dma coherent memory.
> 
> Fixes: 65a21b71f948 ("powerpc/dma: remove dma_nommu_dma_supported")
> Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  arch/powerpc/include/asm/page.h         | 7 +++++++
>  arch/powerpc/mm/mem.c                   | 3 ++-
>  arch/powerpc/platforms/powermac/Kconfig | 1 +
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
> index b8286a2013b4..0d52f57fca04 100644
> --- a/arch/powerpc/include/asm/page.h
> +++ b/arch/powerpc/include/asm/page.h
> @@ -319,6 +319,13 @@ struct vm_area_struct;
>  #endif /* __ASSEMBLY__ */
>  #include <asm/slice.h>
>  
> +/*
> + * Allow 30-bit DMA for very limited Broadcom wifi chips on many powerbooks.
> + */
> +#ifdef CONFIG_PPC32
> +#define ARCH_ZONE_DMA_BITS 30
> +#else
>  #define ARCH_ZONE_DMA_BITS 31
> +#endif
>  
>  #endif /* _ASM_POWERPC_PAGE_H */
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index cba29131bccc..2540d3b2588c 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -248,7 +248,8 @@ void __init paging_init(void)
>  	       (long int)((top_of_ram - total_ram) >> 20));
>  
>  #ifdef CONFIG_ZONE_DMA
> -	max_zone_pfns[ZONE_DMA]	= min(max_low_pfn, 0x7fffffffUL >> PAGE_SHIFT);
> +	max_zone_pfns[ZONE_DMA]	= min(max_low_pfn,
> +			((1UL << ARCH_ZONE_DMA_BITS) - 1) >> PAGE_SHIFT);
>  #endif
>  	max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
>  #ifdef CONFIG_HIGHMEM
> diff --git a/arch/powerpc/platforms/powermac/Kconfig b/arch/powerpc/platforms/powermac/Kconfig
> index f834a19ed772..c02d8c503b29 100644
> --- a/arch/powerpc/platforms/powermac/Kconfig
> +++ b/arch/powerpc/platforms/powermac/Kconfig
> @@ -7,6 +7,7 @@ config PPC_PMAC
>  	select PPC_INDIRECT_PCI if PPC32
>  	select PPC_MPC106 if PPC32
>  	select PPC_NATIVE
> +	select ZONE_DMA if PPC32
>  	default y
>  
>  config PPC_PMAC64
> -- 
> 2.20.1
---end quoted text---

^ permalink raw reply

* Re: [PATCH 4/7] powerpc/ftrace: Additionally nop out the preceding mflr with -mprofile-kernel
From: Nicholas Piggin @ 2019-06-19 10:41 UTC (permalink / raw)
  To: Masami Hiramatsu, Ingo Molnar, Michael Ellerman, Naveen N. Rao,
	Steven Rostedt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1560935530.70niyxru6o.naveen@linux.ibm.com>

Naveen N. Rao's on June 19, 2019 7:53 pm:
> Nicholas Piggin wrote:
>> Michael Ellerman's on June 19, 2019 3:14 pm:
>>> Hi Naveen,
>>> 
>>> Sorry I meant to reply to this earlier .. :/
> 
> No problem. Thanks for the questions.
> 
>>> 
>>> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
>>>> With -mprofile-kernel, gcc emits 'mflr r0', followed by 'bl _mcount' to
>>>> enable function tracing and profiling. So far, with dynamic ftrace, we
>>>> used to only patch out the branch to _mcount(). However, mflr is
>>>> executed by the branch unit that can only execute one per cycle on
>>>> POWER9 and shared with branches, so it would be nice to avoid it where
>>>> possible.
>>>>
>>>> We cannot simply nop out the mflr either. When enabling function
>>>> tracing, there can be a race if tracing is enabled when some thread was
>>>> interrupted after executing a nop'ed out mflr. In this case, the thread
>>>> would execute the now-patched-in branch to _mcount() without having
>>>> executed the preceding mflr.
>>>>
>>>> To solve this, we now enable function tracing in 2 steps: patch in the
>>>> mflr instruction, use synchronize_rcu_tasks() to ensure all existing
>>>> threads make progress, and then patch in the branch to _mcount(). We
>>>> override ftrace_replace_code() with a powerpc64 variant for this
>>>> purpose.
>>> 
>>> According to the ISA we're not allowed to patch mflr at runtime. See the
>>> section on "CMODX".
>> 
>> According to "quasi patch class" engineering note, we can patch
>> anything with a preferred nop. But that's written as an optional
>> facility, which we don't have a feature to test for.
>> 
> 
> Hmm... I wonder what the implications are. We've been patching in a 
> 'trap' for kprobes for a long time now, along with having to patch back 
> the original instruction (which can be anything), when the probe is 
> removed.

Will have to check what implementations support "quasi patch class"
instructions. IIRC recent POWER processors are okay. May have to add
a feature test though.

>>> 
>>> I'm also not convinced the ordering between the two patches is
>>> guaranteed by the ISA, given that there's possibly no isync on the other
>>> CPU.
>> 
>> Will they go through a context synchronizing event?
>> 
>> synchronize_rcu_tasks() should ensure a thread is scheduled away, but
>> I'm not actually sure it guarantees CSI if it's kernel->kernel. Could
>> do a smp_call_function to do the isync on each CPU to be sure.
> 
> Good point. Per 
> Documentation/RCU/Design/Requirements/Requirements.html#Tasks RCU:
> "The solution, in the form of Tasks RCU, is to have implicit read-side 
> critical sections that are delimited by voluntary context switches, that 
> is, calls to schedule(), cond_resched(), and synchronize_rcu_tasks(). In 
> addition, transitions to and from userspace execution also delimit 
> tasks-RCU read-side critical sections."
> 
> I suppose transitions to/from userspace, as well as calls to schedule() 
> result in context synchronizing instruction being executed. But, if some 
> tasks call cond_resched() and synchronize_rcu_tasks(), we probably won't 
> have a CSI executed.
> 
> Also:
> "In CONFIG_PREEMPT=n kernels, trampolines cannot be preempted, so these 
> APIs map to call_rcu(), synchronize_rcu(), and rcu_barrier(), 
> respectively."
> 
> In this scenario as well, I think we won't have a CSI executed in case 
> of cond_resched().
> 
> Should we enhance patch_instruction() to handle that?

Well, not sure. Do we have many post-boot callers of it? Should
they take care of their own synchronization requirements?

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] ocxl: Allow contexts to be attached with a NULL mm
From: Frederic Barrat @ 2019-06-19 10:36 UTC (permalink / raw)
  To: Andrew Donnellan, Alastair D'Silva, alastair
  Cc: Greg Kroah-Hartman, linuxppc-dev, linux-kernel, Arnd Bergmann,
	Nicholas Piggin
In-Reply-To: <81f8951e-a095-3e13-4229-6475f6a8d4a5@linux.ibm.com>



Le 18/06/2019 à 03:50, Andrew Donnellan a écrit :
> On 17/6/19 2:41 pm, Alastair D'Silva wrote:
>> From: Alastair D'Silva <alastair@d-silva.org>
>>
>> If an OpenCAPI context is to be used directly by a kernel driver, there
>> may not be a suitable mm to use.
>>
>> The patch makes the mm parameter to ocxl_context_attach optional.
>>
>> Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
> 
> The one issue I can see here is that using mm == NULL bypasses our 
> method of enabling/disabling global TLBIs in mm_context_add_copro().
> 
> Discussing this privately with Alastair and Fred - this should be fine, 
> but perhaps we should document that.


So indeed we should be fine. I confirmed with Nick that kernel space 
invalidations are already global today.
Nick mentioned that we should still be fine tomorrow, but in the distant 
future, we could imagine local usage of some part of the kernel space. 
It will require some work, but it would be best to add a comment in one 
of the kernel invalidation function (for example 
radix__flush_tlb_kernel_range()) that if a kernel invalidation ever 
becomes local, then clients of the nest MMU may need some work.

A few more comments below.


>> ---
>>   drivers/misc/ocxl/context.c |  9 ++++++---
>>   drivers/misc/ocxl/link.c    | 12 ++++++++----
>>   2 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
>> index bab9c9364184..994563a078eb 100644
>> --- a/drivers/misc/ocxl/context.c
>> +++ b/drivers/misc/ocxl/context.c
>> @@ -69,6 +69,7 @@ static void xsl_fault_error(void *data, u64 addr, 
>> u64 dsisr)
>>   int ocxl_context_attach(struct ocxl_context *ctx, u64 amr, struct 
>> mm_struct *mm)
>>   {
>>       int rc;
>> +    unsigned long pidr = 0;
>>       // Locks both status & tidr
>>       mutex_lock(&ctx->status_mutex);
>> @@ -77,9 +78,11 @@ int ocxl_context_attach(struct ocxl_context *ctx, 
>> u64 amr, struct mm_struct *mm)
>>           goto out;
>>       }
>> -    rc = ocxl_link_add_pe(ctx->afu->fn->link, ctx->pasid,
>> -            mm->context.id, ctx->tidr, amr, mm,
>> -            xsl_fault_error, ctx);
>> +    if (mm)
>> +        pidr = mm->context.id;
>> +
>> +    rc = ocxl_link_add_pe(ctx->afu->fn->link, ctx->pasid, pidr, 
>> ctx->tidr,
>> +                  amr, mm, xsl_fault_error, ctx);
>>       if (rc)
>>           goto out;
>> diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
>> index cce5b0d64505..43542f124807 100644
>> --- a/drivers/misc/ocxl/link.c
>> +++ b/drivers/misc/ocxl/link.c
>> @@ -523,7 +523,8 @@ int ocxl_link_add_pe(void *link_handle, int pasid, 
>> u32 pidr, u32 tidr,
>>       pe->amr = cpu_to_be64(amr);
>>       pe->software_state = cpu_to_be32(SPA_PE_VALID);
>> -    mm_context_add_copro(mm);
>> +    if (mm)
>> +        mm_context_add_copro(mm);


Same as above, we should add a comment here in the driver code that a 
kernel context is ok because invalidations are global.


We also need a new check in xsl_fault_handler(). A valid kernel address 
shouldn't fault, but it's still possible for the FPGA to try accessing a 
bogus kernel address. In which case, xsl_fault_handler() would be 
entered, with a valid fault context. We'll find pe_data in the tree 
based on the valid pe_handle, but pe_data->mm will be NULL. In that, we 
can return early, acknowledging the interrupt with ADDRESS_ERROR value 
(like we do if pe_data is not found in the tree).

   Fred


>>       /*
>>        * Barrier is to make sure PE is visible in the SPA before it
>>        * is used by the device. It also helps with the global TLBI
>> @@ -546,7 +547,8 @@ int ocxl_link_add_pe(void *link_handle, int pasid, 
>> u32 pidr, u32 tidr,
>>        * have a reference on mm_users. Incrementing mm_count solves
>>        * the problem.
>>        */
>> -    mmgrab(mm);
>> +    if (mm)
>> +        mmgrab(mm);
>>       trace_ocxl_context_add(current->pid, spa->spa_mem, pasid, pidr, 
>> tidr);
>>   unlock:
>>       mutex_unlock(&spa->spa_lock);
>> @@ -652,8 +654,10 @@ int ocxl_link_remove_pe(void *link_handle, int 
>> pasid)
>>       if (!pe_data) {
>>           WARN(1, "Couldn't find pe data when removing PE\n");
>>       } else {
>> -        mm_context_remove_copro(pe_data->mm);
>> -        mmdrop(pe_data->mm);
>> +        if (pe_data->mm) {
>> +            mm_context_remove_copro(pe_data->mm);
>> +            mmdrop(pe_data->mm);
>> +        }
>>           kfree_rcu(pe_data, rcu);
>>       }
>>   unlock:
>>
> 


^ permalink raw reply

* Re: [RFC PATCH v0] powerpc: Fix BUG_ON during memory unplug on radix
From: Nicholas Piggin @ 2019-06-19 10:17 UTC (permalink / raw)
  To: Bharata B Rao, linuxppc-dev; +Cc: aneesh.kumar
In-Reply-To: <20190619074545.11761-1-bharata@linux.ibm.com>

Bharata B Rao's on June 19, 2019 5:45 pm:
> We hit the following BUG_ON when memory hotplugged before reboot
> is unplugged after reboot:
> 
> kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!
> 
>  remove_pagetable+0x594/0x6a0
>  (unreliable)
>  remove_pagetable+0x94/0x6a0
>  vmemmap_free+0x394/0x410
>  sparse_remove_one_section+0x26c/0x2e8
>  __remove_pages+0x428/0x540
>  arch_remove_memory+0xd0/0x170
>  __remove_memory+0xd4/0x1a0
>  dlpar_remove_lmb+0xbc/0x110
>  dlpar_memory+0xa80/0xd20
>  handle_dlpar_errorlog+0xa8/0x160
>  pseries_hp_work_fn+0x2c/0x60
>  process_one_work+0x46c/0x860
>  worker_thread+0x364/0x5e0
>  kthread+0x1b0/0x1c0
>  ret_from_kernel_thread+0x5c/0x68
> 
> This occurs because, during reboot-after-hotplug, the hotplugged
> memory range gets initialized as regular memory and page
> tables are setup using memblock allocator. This means that we
> wouldn't have initialized the PMD or PTE fragment count for
> those PMD or PTE pages.
> 
> Fixing this includes 3 aspects:
> 
> - Walk the init_mm page tables from mem_init() and initialize
>   the PMD and PTE fragment counts appropriately.
> - When we do early allocation of PMD (and PGD as well) pages,
>   allocate in page size PAGE_SIZE granularity so that we are
>   sure that the complete page is available for us to set the
>   fragment count which is part of struct page.
> - When PMD or PTE page is freed, check if it comes from memblock
>   allocator and free it appropriately.
> 
> Reported-by: Srikanth Aithal <sraithal@linux.vnet.ibm.com>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/book3s/64/radix.h |  1 +
>  arch/powerpc/include/asm/sparsemem.h       |  1 +
>  arch/powerpc/mm/book3s64/pgtable.c         | 12 +++-
>  arch/powerpc/mm/book3s64/radix_pgtable.c   | 67 +++++++++++++++++++++-
>  arch/powerpc/mm/mem.c                      |  5 ++
>  arch/powerpc/mm/pgtable-frag.c             |  5 +-
>  6 files changed, 87 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
> index 574eca33f893..4320f2790e8d 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -285,6 +285,7 @@ static inline unsigned long radix__get_tree_size(void)
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  int radix__create_section_mapping(unsigned long start, unsigned long end, int nid);
>  int radix__remove_section_mapping(unsigned long start, unsigned long end);
> +void radix__fixup_pgtable_fragments(void);
>  #endif /* CONFIG_MEMORY_HOTPLUG */
>  #endif /* __ASSEMBLY__ */
>  #endif
> diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h
> index 3192d454a733..e662f9232d35 100644
> --- a/arch/powerpc/include/asm/sparsemem.h
> +++ b/arch/powerpc/include/asm/sparsemem.h
> @@ -15,6 +15,7 @@
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  extern int create_section_mapping(unsigned long start, unsigned long end, int nid);
>  extern int remove_section_mapping(unsigned long start, unsigned long end);
> +void fixup_pgtable_fragments(void);
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
>  extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 01bc9663360d..7efe9cc16b39 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -186,6 +186,13 @@ int __meminit remove_section_mapping(unsigned long start, unsigned long end)
>  
>  	return hash__remove_section_mapping(start, end);
>  }
> +
> +void fixup_pgtable_fragments(void)
> +{
> +	if (radix_enabled())
> +		radix__fixup_pgtable_fragments();
> +}
> +
>  #endif /* CONFIG_MEMORY_HOTPLUG */
>  
>  void __init mmu_partition_table_init(void)
> @@ -320,7 +327,10 @@ void pmd_fragment_free(unsigned long *pmd)
>  	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
>  	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
>  		pgtable_pmd_page_dtor(page);
> -		__free_page(page);
> +		if (PageReserved(page))
> +			free_reserved_page(page);

Hmm. Rather than adding this special case here, I wonder if you can
just go along in your fixup walk and convert all these pages to
non-reserved pages?

ClearPageReserved ; init_page_count ; adjust_managed_page_count ; 
should do the trick, right?


> +		else
> +			__free_page(page);

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2 1/1] cpuidle-powernv : forced wakeup for stop states
From: Nicholas Piggin @ 2019-06-19 10:09 UTC (permalink / raw)
  To: Abhishek, linux-kernel, linux-pm, linuxppc-dev
  Cc: ego, daniel.lezcano, rjw, dja
In-Reply-To: <689a52a7-7bfc-7225-e563-ac07f7357e75@linux.vnet.ibm.com>

Abhishek's on June 19, 2019 7:08 pm:
> Hi Nick,
> 
> Thanks for the review. Some replies below.
> 
> On 06/19/2019 09:53 AM, Nicholas Piggin wrote:
>> Abhishek Goel's on June 17, 2019 7:56 pm:
>>> Currently, the cpuidle governors determine what idle state a idling CPU
>>> should enter into based on heuristics that depend on the idle history on
>>> that CPU. Given that no predictive heuristic is perfect, there are cases
>>> where the governor predicts a shallow idle state, hoping that the CPU will
>>> be busy soon. However, if no new workload is scheduled on that CPU in the
>>> near future, the CPU may end up in the shallow state.
>>>
>>> This is problematic, when the predicted state in the aforementioned
>>> scenario is a shallow stop state on a tickless system. As we might get
>>> stuck into shallow states for hours, in absence of ticks or interrupts.
>>>
>>> To address this, We forcefully wakeup the cpu by setting the
>>> decrementer. The decrementer is set to a value that corresponds with the
>>> residency of the next available state. Thus firing up a timer that will
>>> forcefully wakeup the cpu. Few such iterations will essentially train the
>>> governor to select a deeper state for that cpu, as the timer here
>>> corresponds to the next available cpuidle state residency. Thus, cpu will
>>> eventually end up in the deepest possible state.
>>>
>>> Signed-off-by: Abhishek Goel <huntbag@linux.vnet.ibm.com>
>>> ---
>>>
>>> Auto-promotion
>>>   v1 : started as auto promotion logic for cpuidle states in generic
>>> driver
>>>   v2 : Removed timeout_needed and rebased the code to upstream kernel
>>> Forced-wakeup
>>>   v1 : New patch with name of forced wakeup started
>>>   v2 : Extending the forced wakeup logic for all states. Setting the
>>> decrementer instead of queuing up a hrtimer to implement the logic.
>>>
>>>   drivers/cpuidle/cpuidle-powernv.c | 38 +++++++++++++++++++++++++++++++
>>>   1 file changed, 38 insertions(+)
>>>
>>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>>> index 84b1ebe212b3..bc9ca18ae7e3 100644
>>> --- a/drivers/cpuidle/cpuidle-powernv.c
>>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>>> @@ -46,6 +46,26 @@ static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly
>>>   static u64 default_snooze_timeout __read_mostly;
>>>   static bool snooze_timeout_en __read_mostly;
>>>   
>>> +static u64 forced_wakeup_timeout(struct cpuidle_device *dev,
>>> +				 struct cpuidle_driver *drv,
>>> +				 int index)
>>> +{
>>> +	int i;
>>> +
>>> +	for (i = index + 1; i < drv->state_count; i++) {
>>> +		struct cpuidle_state *s = &drv->states[i];
>>> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
>>> +
>>> +		if (s->disabled || su->disable)
>>> +			continue;
>>> +
>>> +		return (s->target_residency + 2 * s->exit_latency) *
>>> +			tb_ticks_per_usec;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>> It would be nice to not have this kind of loop iteration in the
>> idle fast path. Can we add a flag or something to the idle state?
> Currently, we do not have any callback notification or some feedback that
> notifies the driver everytime some state is enabled/disabled. So we have
> to parse everytime to get the next enabled state.

Ahh, that's why you're doing that.

> Are you suggesting to
> add something like next_enabled_state in cpuidle state structure itself
> which will be updated when a state is enabled or disabled?

Hmm, I guess it normally should not iterate over more than one state
unless some idle states are disabled.

What would have been nice is each state just have its own timeout
field with ticks already calculated, if that could be updated when
a state is enabled or disabled. How hard is that to add to the
cpuidle core?

>>> +
>>>   static u64 get_snooze_timeout(struct cpuidle_device *dev,
>>>   			      struct cpuidle_driver *drv,
>>>   			      int index)
>>> @@ -144,8 +164,26 @@ static int stop_loop(struct cpuidle_device *dev,
>>>   		     struct cpuidle_driver *drv,
>>>   		     int index)
>>>   {
>>> +	u64 dec_expiry_tb, dec, timeout_tb, forced_wakeup;
>>> +
>>> +	dec = mfspr(SPRN_DEC);
>>> +	timeout_tb = forced_wakeup_timeout(dev, drv, index);
>>> +	forced_wakeup = 0;
>>> +
>>> +	if (timeout_tb && timeout_tb < dec) {
>>> +		forced_wakeup = 1;
>>> +		dec_expiry_tb = mftb() + dec;
>>> +	}
>> The compiler probably can't optimise away the SPR manipulations so try
>> to avoid them if possible.
> Are you suggesting something like set_dec_before_idle?(in line with
> what you have suggested to do after idle, reset_dec_after_idle)

I should have been clear, I meant don't mfspr(SPRN_DEC) until you
have tested timeout_tb.

>>> +
>>> +	if (forced_wakeup)
>>> +		mtspr(SPRN_DEC, timeout_tb);
>> This should just be put in the above 'if'.
> Fair point.
>>
>>> +
>>>   	power9_idle_type(stop_psscr_table[index].val,
>>>   			 stop_psscr_table[index].mask);
>>> +
>>> +	if (forced_wakeup)
>>> +		mtspr(SPRN_DEC, dec_expiry_tb - mftb());
>> This will sometimes go negative and result in another timer interrupt.
>>
>> It also breaks irq work (which can be set here by machine check I
>> believe.
>>
>> May need to implement some timer code to do this for you.
>>
>> static void reset_dec_after_idle(void)
>> {
>> 	u64 now;
>>          u64 *next_tb;
>>
>> 	if (test_irq_work_pending())
>> 		return;
>> 	now = mftb;
>> 	next_tb = this_cpu_ptr(&decrementers_next_tb);
>>
>> 	if (now >= *next_tb)
>> 		return;
>> 	set_dec(*next_tb - now);
>> 	if (test_irq_work_pending())
>> 		set_dec(1);
>> }
>>
>> Something vaguely like that. See timer_interrupt().
> Ah, Okay. Will go through timer_interrupt().

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 06/28] powerpc/64s/exception: remove the "extra" macro parameter
From: Michael Ellerman @ 2019-06-19 10:02 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <1560933396.zhudswbcjr.astroid@bobo.none>

Nicholas Piggin <npiggin@gmail.com> writes:
> Nicholas Piggin's on June 12, 2019 12:30 am:
>> @@ -265,7 +275,7 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
>>  EXC_REAL_END(machine_check, 0x200, 0x100)
>>  EXC_VIRT_NONE(0x4200, 0x100)
>>  TRAMP_REAL_BEGIN(machine_check_common_early)
>> -	EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
>> +	EXCEPTION_PROLOG_1 EXC_HV, PACA_EXMC, 0, 0x200
>>  	/*
>>  	 * Register contents:
>>  	 * R13		= PACA
>
> This is a little bug here, machine check is an EXC_STD exception. It
> does not show up as generated code problem because EXCEPTION_PROLOG_1
> does not actually do anything with this parameter if KVM is false,
> which it is here.
>
> Still, it's wrong. I may just resend the series, because it caused a
> few conflicts in subsequent patches, and I have a few more to add to
> the end.

OK. I'll pull the series out for now.

cheers

^ permalink raw reply

* Re: [PATCH 4/7] powerpc/ftrace: Additionally nop out the preceding mflr with -mprofile-kernel
From: Naveen N. Rao @ 2019-06-19  9:53 UTC (permalink / raw)
  To: Masami Hiramatsu, Ingo Molnar, Michael Ellerman, Nicholas Piggin,
	Steven Rostedt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1560927184.kqsg9x9bd1.astroid@bobo.none>

Nicholas Piggin wrote:
> Michael Ellerman's on June 19, 2019 3:14 pm:
>> Hi Naveen,
>> 
>> Sorry I meant to reply to this earlier .. :/

No problem. Thanks for the questions.

>> 
>> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
>>> With -mprofile-kernel, gcc emits 'mflr r0', followed by 'bl _mcount' to
>>> enable function tracing and profiling. So far, with dynamic ftrace, we
>>> used to only patch out the branch to _mcount(). However, mflr is
>>> executed by the branch unit that can only execute one per cycle on
>>> POWER9 and shared with branches, so it would be nice to avoid it where
>>> possible.
>>>
>>> We cannot simply nop out the mflr either. When enabling function
>>> tracing, there can be a race if tracing is enabled when some thread was
>>> interrupted after executing a nop'ed out mflr. In this case, the thread
>>> would execute the now-patched-in branch to _mcount() without having
>>> executed the preceding mflr.
>>>
>>> To solve this, we now enable function tracing in 2 steps: patch in the
>>> mflr instruction, use synchronize_rcu_tasks() to ensure all existing
>>> threads make progress, and then patch in the branch to _mcount(). We
>>> override ftrace_replace_code() with a powerpc64 variant for this
>>> purpose.
>> 
>> According to the ISA we're not allowed to patch mflr at runtime. See the
>> section on "CMODX".
> 
> According to "quasi patch class" engineering note, we can patch
> anything with a preferred nop. But that's written as an optional
> facility, which we don't have a feature to test for.
> 

Hmm... I wonder what the implications are. We've been patching in a 
'trap' for kprobes for a long time now, along with having to patch back 
the original instruction (which can be anything), when the probe is 
removed.

>> 
>> I'm also not convinced the ordering between the two patches is
>> guaranteed by the ISA, given that there's possibly no isync on the other
>> CPU.
> 
> Will they go through a context synchronizing event?
> 
> synchronize_rcu_tasks() should ensure a thread is scheduled away, but
> I'm not actually sure it guarantees CSI if it's kernel->kernel. Could
> do a smp_call_function to do the isync on each CPU to be sure.

Good point. Per 
Documentation/RCU/Design/Requirements/Requirements.html#Tasks RCU:
"The solution, in the form of Tasks RCU, is to have implicit read-side 
critical sections that are delimited by voluntary context switches, that 
is, calls to schedule(), cond_resched(), and synchronize_rcu_tasks(). In 
addition, transitions to and from userspace execution also delimit 
tasks-RCU read-side critical sections."

I suppose transitions to/from userspace, as well as calls to schedule() 
result in context synchronizing instruction being executed. But, if some 
tasks call cond_resched() and synchronize_rcu_tasks(), we probably won't 
have a CSI executed.

Also:
"In CONFIG_PREEMPT=n kernels, trampolines cannot be preempted, so these 
APIs map to call_rcu(), synchronize_rcu(), and rcu_barrier(), 
respectively."

In this scenario as well, I think we won't have a CSI executed in case 
of cond_resched().

Should we enhance patch_instruction() to handle that?


- Naveen


^ permalink raw reply

* RE: [PATCH v4 1/3] PM: wakeup: Add routine to help fetch wakeup source object.
From: Ran Wang @ 2019-06-19  9:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mark Rutland, Len Brown, devicetree@vger.kernel.org,
	Greg Kroah-Hartman, linux-pm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Leo Li, Rob Herring, Pavel Machek,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <3448272.3g8bHhgBA9@kreacher>

Hi Rafael,

On Wednesday, June 19, 2019 06:45, Rafael J. Wysocki wrote:
> 
> On Monday, May 20, 2019 11:52:36 AM CEST Ran Wang wrote:
> > Some user might want to go through all registered wakeup sources and
> > doing things accordingly. For example, SoC PM driver might need to do
> > HW programming to prevent powering down specific IP which wakeup
> > source depending on. And is user's responsibility to identify if this
> > wakeup source he is interested in.
> 
> I guess the idea here is that you need to walk wakeup devices and you noticed
> that there was a wakeup source object for each of them and those wakeup
> source objects were on a list, so you could walk wakeup devices by walking the
> list of wakeup source objects.
> 
> That is fair enough, but the changelog above doesn't even talk about that.

How about this:
"Providing a API for helping walk through all registered wakeup devices on the list.
It will be useful for SoC PMU driver to know which device will work as a wakeup
source then do specific HW programming for them."

> > Signed-off-by: Ran Wang <ran.wang_1@nxp.com>
> > ---
> > Change in v4:
> > 	- None.
> >
> > Change in v3:
> > 	- Adjust indentation of *attached_dev;.
> >
> > Change in v2:
> > 	- None.
> >
> >  drivers/base/power/wakeup.c |   18 ++++++++++++++++++
> >  include/linux/pm_wakeup.h   |    3 +++
> >  2 files changed, 21 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c
> > index 5b2b6a0..6904485 100644
> > --- a/drivers/base/power/wakeup.c
> > +++ b/drivers/base/power/wakeup.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/suspend.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/debugfs.h>
> > +#include <linux/of_device.h>
> >  #include <linux/pm_wakeirq.h>
> >  #include <trace/events/power.h>
> >
> > @@ -226,6 +227,22 @@ void wakeup_source_unregister(struct
> wakeup_source *ws)
> >  	}
> >  }
> >  EXPORT_SYMBOL_GPL(wakeup_source_unregister);
> > +/**
> > + * wakeup_source_get_next - Get next wakeup source from the list
> > + * @ws: Previous wakeup source object, null means caller want first one.
> > + */
> > +struct wakeup_source *wakeup_source_get_next(struct wakeup_source
> > +*ws) {
> > +	struct list_head *ws_head = &wakeup_sources;
> > +
> > +	if (ws)
> > +		return list_next_or_null_rcu(ws_head, &ws->entry,
> > +				struct wakeup_source, entry);
> > +	else
> > +		return list_entry_rcu(ws_head->next,
> > +				struct wakeup_source, entry);
> > +}
> > +EXPORT_SYMBOL_GPL(wakeup_source_get_next);
> 
> This needs to be arranged along the lines of
> wakeup_sources_stats_seq_start/next/stop()
> because of the SRCU protection of the list.

Got it, how about this:
 230 /**                                                                              
 231  * wakeup_source_get_next - Get next wakeup source from the list                 
 232  * @ws: Previous wakeup source object, null means caller want first one.         
 233  */                                                                              
 234 struct wakeup_source *wakeup_source_get_next(struct wakeup_source *ws)           
 235 {                                                                                
 236         struct list_head *ws_head = &wakeup_sources;                             
 237         struct wakeup_source *next_ws = NULL;                                    
 238         int idx;                                                                 
 239                                                                                  
 240         idx = srcu_read_lock(&wakeup_srcu);                                      
 241         if (ws)                                                                                                                
 242                 next_ws = list_next_or_null_rcu(ws_head, &ws->entry,             
 243                                 struct wakeup_source, entry);                    
 244         else                                                                     
 245                 next_ws = list_entry_rcu(ws_head->next,                          
 246                                 struct wakeup_source, entry);                    
 247         srcu_read_unlock(&wakeup_srcu, idx);                                     
 248                                                                                  
 249         return next_ws;                                                          
 250 }                                                                                
 251 EXPORT_SYMBOL_GPL(wakeup_source_get_next);   

> >
> >  /**
> >   * device_wakeup_attach - Attach a wakeup source object to a device object.
> > @@ -242,6 +259,7 @@ static int device_wakeup_attach(struct device *dev,
> struct wakeup_source *ws)
> >  		return -EEXIST;
> >  	}
> >  	dev->power.wakeup = ws;
> > +	ws->attached_dev = dev;
> >  	if (dev->power.wakeirq)
> >  		device_wakeup_attach_irq(dev, dev->power.wakeirq);
> >  	spin_unlock_irq(&dev->power.lock);
> > diff --git a/include/linux/pm_wakeup.h b/include/linux/pm_wakeup.h
> > index 0ff134d..913b2fb 100644
> > --- a/include/linux/pm_wakeup.h
> > +++ b/include/linux/pm_wakeup.h
> > @@ -50,6 +50,7 @@
> >   * @wakeup_count: Number of times the wakeup source might abort suspend.
> >   * @active: Status of the wakeup source.
> >   * @has_timeout: The wakeup source has been activated with a timeout.
> > + * @attached_dev: The device it attached to
> >   */
> >  struct wakeup_source {
> >  	const char 		*name;
> > @@ -70,6 +71,7 @@ struct wakeup_source {
> >  	unsigned long		wakeup_count;
> >  	bool			active:1;
> >  	bool			autosleep_enabled:1;
> > +	struct device		*attached_dev;
> 
> Please (a) call it just dev and (b) move it up (before wakeirq, say)

Got it, will update in next version.

Thanks & Regards,
Ran
> 
> >  };
> >
> >  #ifdef CONFIG_PM_SLEEP
> > @@ -101,6 +103,7 @@ static inline void device_set_wakeup_path(struct
> > device *dev)  extern void wakeup_source_remove(struct wakeup_source
> > *ws);  extern struct wakeup_source *wakeup_source_register(const char
> > *name);  extern void wakeup_source_unregister(struct wakeup_source
> > *ws);
> > +extern struct wakeup_source *wakeup_source_get_next(struct
> > +wakeup_source *ws);
> >  extern int device_wakeup_enable(struct device *dev);  extern int
> > device_wakeup_disable(struct device *dev);  extern void
> > device_set_wakeup_capable(struct device *dev, bool capable);
> >
> 
> 
> 


^ permalink raw reply

* Re: [PATCH 5/7] powerpc/ftrace: Update ftrace_location() for powerpc -mprofile-kernel
From: Steven Rostedt @ 2019-06-19  9:28 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: linux-kernel, Nicholas Piggin, Masami Hiramatsu, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <1560930937.j2vguryjp3.naveen@linux.ibm.com>

On Wed, 19 Jun 2019 13:26:37 +0530
"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> wrote:

> > In include/ftrace.h:
> > 
> > #ifndef FTRACE_IP_EXTENSION
> > # define FTRACE_IP_EXTENSION	0
> > #endif
> > 
> > 
> > In arch/powerpc/include/asm/ftrace.h
> > 
> > #define FTRACE_IP_EXTENSION	MCOUNT_INSN_SIZE
> > 
> > 
> > Then we can just have:
> > 
> > unsigned long ftrace_location(unsigned long ip)
> > {
> > 	return ftrace_location_range(ip, ip + FTRACE_IP_EXTENSION);
> > }  
> 
> Thanks, that's indeed nice. I hope you don't mind me adding your SOB for 
> that.

Actually, it's best not to put a SOB by anyone other than yourself. It
actually has legal meaning.

In this case, please add:

Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Thanks!

-- Steve

^ permalink raw reply

* Re: [PATCH v2 1/1] cpuidle-powernv : forced wakeup for stop states
From: Abhishek @ 2019-06-19  9:08 UTC (permalink / raw)
  To: Nicholas Piggin, linux-kernel, linux-pm, linuxppc-dev
  Cc: ego, daniel.lezcano, rjw, dja
In-Reply-To: <1560917320.mk5nn6r8jw.astroid@bobo.none>

Hi Nick,

Thanks for the review. Some replies below.

On 06/19/2019 09:53 AM, Nicholas Piggin wrote:
> Abhishek Goel's on June 17, 2019 7:56 pm:
>> Currently, the cpuidle governors determine what idle state a idling CPU
>> should enter into based on heuristics that depend on the idle history on
>> that CPU. Given that no predictive heuristic is perfect, there are cases
>> where the governor predicts a shallow idle state, hoping that the CPU will
>> be busy soon. However, if no new workload is scheduled on that CPU in the
>> near future, the CPU may end up in the shallow state.
>>
>> This is problematic, when the predicted state in the aforementioned
>> scenario is a shallow stop state on a tickless system. As we might get
>> stuck into shallow states for hours, in absence of ticks or interrupts.
>>
>> To address this, We forcefully wakeup the cpu by setting the
>> decrementer. The decrementer is set to a value that corresponds with the
>> residency of the next available state. Thus firing up a timer that will
>> forcefully wakeup the cpu. Few such iterations will essentially train the
>> governor to select a deeper state for that cpu, as the timer here
>> corresponds to the next available cpuidle state residency. Thus, cpu will
>> eventually end up in the deepest possible state.
>>
>> Signed-off-by: Abhishek Goel <huntbag@linux.vnet.ibm.com>
>> ---
>>
>> Auto-promotion
>>   v1 : started as auto promotion logic for cpuidle states in generic
>> driver
>>   v2 : Removed timeout_needed and rebased the code to upstream kernel
>> Forced-wakeup
>>   v1 : New patch with name of forced wakeup started
>>   v2 : Extending the forced wakeup logic for all states. Setting the
>> decrementer instead of queuing up a hrtimer to implement the logic.
>>
>>   drivers/cpuidle/cpuidle-powernv.c | 38 +++++++++++++++++++++++++++++++
>>   1 file changed, 38 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>> index 84b1ebe212b3..bc9ca18ae7e3 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -46,6 +46,26 @@ static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly
>>   static u64 default_snooze_timeout __read_mostly;
>>   static bool snooze_timeout_en __read_mostly;
>>   
>> +static u64 forced_wakeup_timeout(struct cpuidle_device *dev,
>> +				 struct cpuidle_driver *drv,
>> +				 int index)
>> +{
>> +	int i;
>> +
>> +	for (i = index + 1; i < drv->state_count; i++) {
>> +		struct cpuidle_state *s = &drv->states[i];
>> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
>> +
>> +		if (s->disabled || su->disable)
>> +			continue;
>> +
>> +		return (s->target_residency + 2 * s->exit_latency) *
>> +			tb_ticks_per_usec;
>> +	}
>> +
>> +	return 0;
>> +}
> It would be nice to not have this kind of loop iteration in the
> idle fast path. Can we add a flag or something to the idle state?
Currently, we do not have any callback notification or some feedback that
notifies the driver everytime some state is enabled/disabled. So we have
to parse everytime to get the next enabled state. Are you suggesting to
add something like next_enabled_state in cpuidle state structure itself
which will be updated when a state is enabled or disabled?
>> +
>>   static u64 get_snooze_timeout(struct cpuidle_device *dev,
>>   			      struct cpuidle_driver *drv,
>>   			      int index)
>> @@ -144,8 +164,26 @@ static int stop_loop(struct cpuidle_device *dev,
>>   		     struct cpuidle_driver *drv,
>>   		     int index)
>>   {
>> +	u64 dec_expiry_tb, dec, timeout_tb, forced_wakeup;
>> +
>> +	dec = mfspr(SPRN_DEC);
>> +	timeout_tb = forced_wakeup_timeout(dev, drv, index);
>> +	forced_wakeup = 0;
>> +
>> +	if (timeout_tb && timeout_tb < dec) {
>> +		forced_wakeup = 1;
>> +		dec_expiry_tb = mftb() + dec;
>> +	}
> The compiler probably can't optimise away the SPR manipulations so try
> to avoid them if possible.
Are you suggesting something like set_dec_before_idle?(in line with
what you have suggested to do after idle, reset_dec_after_idle)
>
>> +
>> +	if (forced_wakeup)
>> +		mtspr(SPRN_DEC, timeout_tb);
> This should just be put in the above 'if'.
Fair point.
>
>> +
>>   	power9_idle_type(stop_psscr_table[index].val,
>>   			 stop_psscr_table[index].mask);
>> +
>> +	if (forced_wakeup)
>> +		mtspr(SPRN_DEC, dec_expiry_tb - mftb());
> This will sometimes go negative and result in another timer interrupt.
>
> It also breaks irq work (which can be set here by machine check I
> believe.
>
> May need to implement some timer code to do this for you.
>
> static void reset_dec_after_idle(void)
> {
> 	u64 now;
>          u64 *next_tb;
>
> 	if (test_irq_work_pending())
> 		return;
> 	now = mftb;
> 	next_tb = this_cpu_ptr(&decrementers_next_tb);
>
> 	if (now >= *next_tb)
> 		return;
> 	set_dec(*next_tb - now);
> 	if (test_irq_work_pending())
> 		set_dec(1);
> }
>
> Something vaguely like that. See timer_interrupt().
Ah, Okay. Will go through timer_interrupt().
> Thanks,
> Nick
Thanks,
Abhishek


^ permalink raw reply

* Re: [RFC PATCH v0] powerpc: Fix BUG_ON during memory unplug on radix
From: Aneesh Kumar K.V @ 2019-06-19  9:06 UTC (permalink / raw)
  To: Bharata B Rao, linuxppc-dev; +Cc: aneesh.kumar, npiggin, Bharata B Rao
In-Reply-To: <20190619074545.11761-1-bharata@linux.ibm.com>

Bharata B Rao <bharata@linux.ibm.com> writes:

> We hit the following BUG_ON when memory hotplugged before reboot
> is unplugged after reboot:
>
> kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!
>
>  remove_pagetable+0x594/0x6a0
>  (unreliable)
>  remove_pagetable+0x94/0x6a0
>  vmemmap_free+0x394/0x410
>  sparse_remove_one_section+0x26c/0x2e8
>  __remove_pages+0x428/0x540
>  arch_remove_memory+0xd0/0x170
>  __remove_memory+0xd4/0x1a0
>  dlpar_remove_lmb+0xbc/0x110
>  dlpar_memory+0xa80/0xd20
>  handle_dlpar_errorlog+0xa8/0x160
>  pseries_hp_work_fn+0x2c/0x60
>  process_one_work+0x46c/0x860
>  worker_thread+0x364/0x5e0
>  kthread+0x1b0/0x1c0
>  ret_from_kernel_thread+0x5c/0x68
>
> This occurs because, during reboot-after-hotplug, the hotplugged
> memory range gets initialized as regular memory and page
> tables are setup using memblock allocator. This means that we
> wouldn't have initialized the PMD or PTE fragment count for
> those PMD or PTE pages.
>
> Fixing this includes 3 aspects:
>
> - Walk the init_mm page tables from mem_init() and initialize
>   the PMD and PTE fragment counts appropriately.
> - When we do early allocation of PMD (and PGD as well) pages,
>   allocate in page size PAGE_SIZE granularity so that we are
>   sure that the complete page is available for us to set the
>   fragment count which is part of struct page.


That is an important change now. For early page table we now allocate
PAGE_SIZE tables and hencec we consider then as pages with fragment
count 1. You also may want to explain here why. I guess the challenge is
due to the fact that we can't clearly control how the rest of the page
will get used and we are not sure they all will be allocated for backing
page table pages.

> - When PMD or PTE page is freed, check if it comes from memblock
>   allocator and free it appropriately.
>
> Reported-by: Srikanth Aithal <sraithal@linux.vnet.ibm.com>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/book3s/64/radix.h |  1 +
>  arch/powerpc/include/asm/sparsemem.h       |  1 +
>  arch/powerpc/mm/book3s64/pgtable.c         | 12 +++-
>  arch/powerpc/mm/book3s64/radix_pgtable.c   | 67 +++++++++++++++++++++-
>  arch/powerpc/mm/mem.c                      |  5 ++
>  arch/powerpc/mm/pgtable-frag.c             |  5 +-
>  6 files changed, 87 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
> index 574eca33f893..4320f2790e8d 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -285,6 +285,7 @@ static inline unsigned long radix__get_tree_size(void)
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  int radix__create_section_mapping(unsigned long start, unsigned long end, int nid);
>  int radix__remove_section_mapping(unsigned long start, unsigned long end);
> +void radix__fixup_pgtable_fragments(void);
>  #endif /* CONFIG_MEMORY_HOTPLUG */
>  #endif /* __ASSEMBLY__ */
>  #endif
> diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h
> index 3192d454a733..e662f9232d35 100644
> --- a/arch/powerpc/include/asm/sparsemem.h
> +++ b/arch/powerpc/include/asm/sparsemem.h
> @@ -15,6 +15,7 @@
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  extern int create_section_mapping(unsigned long start, unsigned long end, int nid);
>  extern int remove_section_mapping(unsigned long start, unsigned long end);
> +void fixup_pgtable_fragments(void);
>
>  #ifdef CONFIG_PPC_BOOK3S_64
>  extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 01bc9663360d..7efe9cc16b39 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -186,6 +186,13 @@ int __meminit remove_section_mapping(unsigned long start, unsigned long end)
>
>  	return hash__remove_section_mapping(start, end);
>  }
> +
> +void fixup_pgtable_fragments(void)
> +{
> +	if (radix_enabled())
> +		radix__fixup_pgtable_fragments();
> +}
> +
>  #endif /* CONFIG_MEMORY_HOTPLUG */
>
>  void __init mmu_partition_table_init(void)
> @@ -320,7 +327,10 @@ void pmd_fragment_free(unsigned long *pmd)
>  	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
>  	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
>  		pgtable_pmd_page_dtor(page);
> -		__free_page(page);
> +		if (PageReserved(page))
> +			free_reserved_page(page);
> +		else
> +			__free_page(page);
>  	}
>  }
>
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 273ae66a9a45..402e8da28cab 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -32,6 +32,69 @@
>  unsigned int mmu_pid_bits;
>  unsigned int mmu_base_pid;
>
> +static void fixup_pmd_fragments(pmd_t *pmd)
> +{
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
> +		pte_t *pte;
> +		struct page *page;
> +
> +		if (pmd_none(*pmd))
> +			continue;
> +		if (pmd_huge(*pmd))
> +			continue;
> +
> +		pte = pte_offset_kernel(pmd, 0);
> +		if (!pte_none(*pte)) {
> +			page = virt_to_page(pte);
> +			atomic_inc(&page->pt_frag_refcount);
> +		}
> +	}
> +}


That naming is confusing. You are fixing up fragemts used for level 4
table here. I would call them pte fragments? 


> +
> +static void fixup_pud_fragments(pud_t *pud)
> +{
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
> +		pmd_t *pmd;
> +		struct page *page;
> +
> +		if (pud_none(*pud))
> +			continue;
> +		if (pud_huge(*pud))
> +			continue;
> +
> +		pmd = pmd_offset(pud, 0);
> +		if (!pmd_none(*pmd)) {
> +			page = virt_to_page(pmd);
> +			atomic_inc(&page->pt_frag_refcount);
> +		}
> +		fixup_pmd_fragments(pmd);
> +	}
> +}
> +


How do we handle pud table free. That is going to be tricky for general
allocation they are allocated out of slab and we free them back to slab.
With pages allocated out of memblock, we need to special case them? 


> +void radix__fixup_pgtable_fragments(void)
> +{
> +	int i;
> +	pgd_t *pgd = pgd_offset_k(0UL);
> +
> +	spin_lock(&init_mm.page_table_lock);
> +	for (i = 0; i < PTRS_PER_PGD; i++, pgd++) {
> +		pud_t *pud;
> +
> +		if (pgd_none(*pgd))
> +			continue;
> +		if (pgd_huge(*pgd))
> +			continue;
> +
> +		pud = pud_offset(pgd, 0);
> +		fixup_pud_fragments(pud);
> +	}
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
>  static int native_register_process_table(unsigned long base, unsigned long pg_sz,
>  					 unsigned long table_size)
>  {
> @@ -80,7 +143,7 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
>
>  	pgdp = pgd_offset_k(ea);
>  	if (pgd_none(*pgdp)) {
> -		pudp = early_alloc_pgtable(PUD_TABLE_SIZE, nid,
> +		pudp = early_alloc_pgtable(PAGE_SIZE, nid,
>  						region_start, region_end);
>  		pgd_populate(&init_mm, pgdp, pudp);
>  	}
> @@ -90,7 +153,7 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
>  		goto set_the_pte;
>  	}
>  	if (pud_none(*pudp)) {
> -		pmdp = early_alloc_pgtable(PMD_TABLE_SIZE, nid,
> +		pmdp = early_alloc_pgtable(PAGE_SIZE, nid,
>  						region_start, region_end);
>  		pud_populate(&init_mm, pudp, pmdp);
>  	}
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index cba29131bccc..a8788b404266 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -51,6 +51,10 @@
>
>  #include <mm/mmu_decl.h>
>
> +void __weak fixup_pgtable_fragments(void)
> +{
> +}
> +
>  #ifndef CPU_FTR_COHERENT_ICACHE
>  #define CPU_FTR_COHERENT_ICACHE	0	/* XXX for now */
>  #define CPU_FTR_NOEXECUTE	0
> @@ -276,6 +280,7 @@ void __init mem_init(void)
>  	set_max_mapnr(max_pfn);
>  	memblock_free_all();
>
> +	fixup_pgtable_fragments();
>  #ifdef CONFIG_HIGHMEM
>  	{
>  		unsigned long pfn, highmem_mapnr;
> diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> index a7b05214760c..694de7c731aa 100644
> --- a/arch/powerpc/mm/pgtable-frag.c
> +++ b/arch/powerpc/mm/pgtable-frag.c
> @@ -114,6 +114,9 @@ void pte_fragment_free(unsigned long *table, int kernel)
>  	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
>  		if (!kernel)
>  			pgtable_page_dtor(page);
> -		__free_page(page);
> +		if (PageReserved(page))
> +			free_reserved_page(page);
> +		else
> +			__free_page(page);
>  	}
>  }
> -- 
> 2.17.1


^ permalink raw reply

* Re: [PATCH 06/28] powerpc/64s/exception: remove the "extra" macro parameter
From: Nicholas Piggin @ 2019-06-19  8:40 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20190611143040.7834-7-npiggin@gmail.com>

Nicholas Piggin's on June 12, 2019 12:30 am:
> @@ -265,7 +275,7 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
>  EXC_REAL_END(machine_check, 0x200, 0x100)
>  EXC_VIRT_NONE(0x4200, 0x100)
>  TRAMP_REAL_BEGIN(machine_check_common_early)
> -	EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
> +	EXCEPTION_PROLOG_1 EXC_HV, PACA_EXMC, 0, 0x200
>  	/*
>  	 * Register contents:
>  	 * R13		= PACA

This is a little bug here, machine check is an EXC_STD exception. It
does not show up as generated code problem because EXCEPTION_PROLOG_1
does not actually do anything with this parameter if KVM is false,
which it is here.

Still, it's wrong. I may just resend the series, because it caused a
few conflicts in subsequent patches, and I have a few more to add to
the end.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 5/7] powerpc/ftrace: Update ftrace_location() for powerpc -mprofile-kernel
From: Naveen N. Rao @ 2019-06-19  7:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Nicholas Piggin, Masami Hiramatsu, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20190618143234.78539805@gandalf.local.home>

Steven Rostedt wrote:
> On Tue, 18 Jun 2019 23:53:11 +0530
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> wrote:
> 
>> Naveen N. Rao wrote:
>> > Steven Rostedt wrote:  
>> >> On Tue, 18 Jun 2019 20:17:04 +0530
>> >> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> wrote:
>> >>   
>> >>> @@ -1551,7 +1551,7 @@ unsigned long ftrace_location_range(unsigned long start, unsigned long end)
>> >>>  	key.flags = end;	/* overload flags, as it is unsigned long */
>> >>>  
>> >>>  	for (pg = ftrace_pages_start; pg; pg = pg->next) {
>> >>> -		if (end < pg->records[0].ip ||
>> >>> +		if (end <= pg->records[0].ip ||  
>> >> 
>> >> This breaks the algorithm. "end" is inclusive. That is, if you look for
>> >> a single byte, where "start" and "end" are the same, and it happens to
>> >> be the first ip on the pg page, it will be skipped, and not found.  
>> > 
>> > Thanks. It looks like I should be over-riding ftrace_location() instead.  
>> > I will update this patch.  
>> 
>> I think I will have ftrace own the two instruction range, regardless of 
>> whether the preceding instruction is a 'mflr r0' or not. This simplifies 
>> things and I don't see an issue with it as of now. I will do more 
>> testing to confirm.
>> 
>> - Naveen
>> 
>> 
>> --- a/arch/powerpc/kernel/trace/ftrace.c
>> +++ b/arch/powerpc/kernel/trace/ftrace.c
>> @@ -951,6 +951,16 @@ void arch_ftrace_update_code(int command)
>>  }
>>  
>>  #ifdef CONFIG_MPROFILE_KERNEL
>> +/*
>> + * We consider two instructions -- 'mflr r0', 'bl _mcount' -- to be part
>> + * of ftrace. When checking for the first instruction, we want to include
>> + * the next instruction in the range check.
>> + */
>> +unsigned long ftrace_location(unsigned long ip)
>> +{
>> +	return ftrace_location_range(ip, ip + MCOUNT_INSN_SIZE);
>> +}
>> +
>>  /* Returns 1 if we patched in the mflr */
>>  static int __ftrace_make_call_prep(struct dyn_ftrace *rec)
>>  {
>> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
>> index 21d8e201ee80..122e2bb4a739 100644
>> --- a/kernel/trace/ftrace.c
>> +++ b/kernel/trace/ftrace.c
>> @@ -1573,7 +1573,7 @@ unsigned long ftrace_location_range(unsigned long start, unsigned long end)
>>   * the function tracer. It checks the ftrace internal tables to
>>   * determine if the address belongs or not.
>>   */
>> -unsigned long ftrace_location(unsigned long ip)
>> +unsigned long __weak ftrace_location(unsigned long ip)
>>  {
>>  	return ftrace_location_range(ip, ip);
>>  }
> 
> Actually, instead of making this a weak function, let's do this:
> 
> 
> In include/ftrace.h:
> 
> #ifndef FTRACE_IP_EXTENSION
> # define FTRACE_IP_EXTENSION	0
> #endif
> 
> 
> In arch/powerpc/include/asm/ftrace.h
> 
> #define FTRACE_IP_EXTENSION	MCOUNT_INSN_SIZE
> 
> 
> Then we can just have:
> 
> unsigned long ftrace_location(unsigned long ip)
> {
> 	return ftrace_location_range(ip, ip + FTRACE_IP_EXTENSION);
> }

Thanks, that's indeed nice. I hope you don't mind me adding your SOB for 
that.

- Naveen



^ permalink raw reply

* Re: [PATCH 0/5] Powerpc/hw-breakpoint: Fixes plus Code refactor
From: Ravi Bangoria @ 2019-06-19  7:47 UTC (permalink / raw)
  To: Michael Neuling, Christophe Leroy
  Cc: Ravi Bangoria, linux-kernel, npiggin, paulus, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <85d5494d2a5d4a887e739c379105436e498217a8.camel@neuling.org>



On 6/18/19 11:47 AM, Michael Neuling wrote:
> On Tue, 2019-06-18 at 08:01 +0200, Christophe Leroy wrote:
>>
>> Le 18/06/2019 à 06:27, Ravi Bangoria a écrit :
>>> patch 1-3: Code refactor
>>> patch 4: Speedup disabling breakpoint
>>> patch 5: Fix length calculation for unaligned targets
>>
>> While you are playing with hw breakpoints, did you have a look at 
>> https://github.com/linuxppc/issues/issues/38 ?
> 
> Agreed and also: 
> 
> https://github.com/linuxppc/issues/issues/170
> 
> https://github.com/linuxppc/issues/issues/128 
> 

Yes, I'm aware of those. Will have a look at them.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox