LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 05/13] powerpc/xive: Fix allocation of pages donated to the XIVE controller
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

The XIVE interrupt controller uses a set of internal tables to handle
interrupt routing. The small tables (SBE, EAT) are direct tables and
allocated by OPAL. The bigger ones (NVT, ENDT) are indirect, i.e., the
first table entries point to a page which contains the XIVE structures
used by HW. For these, OPAL only allocates the first page level and
requests pages to be allocated by Linux when a new entry is inserted.

Make sure that these pages can not be reclaimed. The problem can be
seen while stressing a system with 4K KVM guests, 16 vCPUs each.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/xive-internal.h | 2 ++
 arch/powerpc/sysdev/xive/common.c        | 2 +-
 arch/powerpc/sysdev/xive/native.c        | 2 +-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h
index d701af7fb48c..1eacc90f4dcf 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -73,6 +73,8 @@ static inline u32 xive_alloc_order(u32 queue_shift)
 	return (queue_shift > PAGE_SHIFT) ? (queue_shift - PAGE_SHIFT) : 0;
 }
 
+#define XIVE_GFP (__GFP_NORETRY | __GFP_NOWARN | __GFP_NOMEMALLOC)
+
 extern bool xive_cmdline_disabled;
 
 #endif /*  __XIVE_INTERNAL_H */
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 605238ca65e4..80fd97d764ab 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1544,7 +1544,7 @@ __be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift)
 	__be32 *qpage;
 
 	alloc_order = xive_alloc_order(queue_shift);
-	pages = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, alloc_order);
+	pages = alloc_pages_node(cpu_to_node(cpu), XIVE_GFP, alloc_order);
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	qpage = (__be32 *)page_address(pages);
diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c
index cb58ec7ce77a..6afb44d0d816 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -643,7 +643,7 @@ static bool xive_native_provision_pages(void)
 		 * XXX TODO: Try to make the allocation local to the node where
 		 * the chip resides.
 		 */
-		p = kmem_cache_alloc(xive_provision_cache, GFP_KERNEL);
+		p = kmem_cache_alloc(xive_provision_cache, XIVE_GFP);
 		if (!p) {
 			pr_err("Failed to allocate provisioning page\n");
 			return false;
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH 14/20] ethernet: ucc_geth: don't statically allocate eight ucc_geth_info
From: Christophe Leroy @ 2020-12-08 15:13 UTC (permalink / raw)
  To: Rasmus Villemoes, Li Yang, David S. Miller, Jakub Kicinski
  Cc: Vladimir Oltean, Zhao Qiang, linuxppc-dev, linux-kernel, netdev
In-Reply-To: <20201205191744.7847-15-rasmus.villemoes@prevas.dk>



Le 05/12/2020 à 20:17, Rasmus Villemoes a écrit :
> struct ucc_geth_info is somewhat large, and on systems with only one
> or two UCC instances, that just wastes a few KB of memory. So
> allocate and populate a chunk of memory at probe time instead of
> initializing them all during driver init.
> 
> Note that the existing "ug_info == NULL" check was dead code, as the
> address of some static array element can obviously never be NULL.
> 
> Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
> ---
>   drivers/net/ethernet/freescale/ucc_geth.c | 32 +++++++++--------------
>   1 file changed, 12 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
> index a06744d8b4af..273342233bba 100644
> --- a/drivers/net/ethernet/freescale/ucc_geth.c
> +++ b/drivers/net/ethernet/freescale/ucc_geth.c
> @@ -157,8 +157,6 @@ static const struct ucc_geth_info ugeth_primary_info = {
>   	.riscRx = QE_RISC_ALLOCATION_RISC1_AND_RISC2,
>   };
>   
> -static struct ucc_geth_info ugeth_info[8];
> -
>   #ifdef DEBUG
>   static void mem_disp(u8 *addr, int size)
>   {
> @@ -3714,25 +3712,23 @@ static int ucc_geth_probe(struct platform_device* ofdev)
>   	if ((ucc_num < 0) || (ucc_num > 7))
>   		return -ENODEV;
>   
> -	ug_info = &ugeth_info[ucc_num];
> -	if (ug_info == NULL) {
> -		if (netif_msg_probe(&debug))
> -			pr_err("[%d] Missing additional data!\n", ucc_num);
> -		return -ENODEV;
> -	}
> +	ug_info = kmalloc(sizeof(*ug_info), GFP_KERNEL);

Could we use dev_kmalloc() instead, to avoid the freeing on the wait out and the err_free_info: path ?

> +	if (ug_info == NULL)
> +		return -ENOMEM;
> +	memcpy(ug_info, &ugeth_primary_info, sizeof(*ug_info));
>   
>   	ug_info->uf_info.ucc_num = ucc_num;
>   
>   	err = ucc_geth_parse_clock(np, "rx", &ug_info->uf_info.rx_clock);
>   	if (err)
> -		return err;
> +		goto err_free_info;
>   	err = ucc_geth_parse_clock(np, "tx", &ug_info->uf_info.tx_clock);
>   	if (err)
> -		return err;
> +		goto err_free_info;
>   
>   	err = of_address_to_resource(np, 0, &res);
>   	if (err)
> -		return -EINVAL;
> +		goto err_free_info;
>   
>   	ug_info->uf_info.regs = res.start;
>   	ug_info->uf_info.irq = irq_of_parse_and_map(np, 0);
> @@ -3745,7 +3741,7 @@ static int ucc_geth_probe(struct platform_device* ofdev)
>   		 */
>   		err = of_phy_register_fixed_link(np);
>   		if (err)
> -			return err;
> +			goto err_free_info;
>   		ug_info->phy_node = of_node_get(np);
>   	}
>   
> @@ -3876,6 +3872,8 @@ static int ucc_geth_probe(struct platform_device* ofdev)
>   		of_phy_deregister_fixed_link(np);
>   	of_node_put(ug_info->tbi_node);
>   	of_node_put(ug_info->phy_node);
> +err_free_info:
> +	kfree(ug_info);
>   
>   	return err;
>   }
> @@ -3886,6 +3884,7 @@ static int ucc_geth_remove(struct platform_device* ofdev)
>   	struct ucc_geth_private *ugeth = netdev_priv(dev);
>   	struct device_node *np = ofdev->dev.of_node;
>   
> +	kfree(ugeth->ug_info);
>   	ucc_geth_memclean(ugeth);
>   	if (of_phy_is_fixed_link(np))
>   		of_phy_deregister_fixed_link(np);
> @@ -3920,17 +3919,10 @@ static struct platform_driver ucc_geth_driver = {
>   
>   static int __init ucc_geth_init(void)
>   {
> -	int i, ret;
> -
>   	if (netif_msg_drv(&debug))
>   		pr_info(DRV_DESC "\n");
> -	for (i = 0; i < 8; i++)
> -		memcpy(&(ugeth_info[i]), &ugeth_primary_info,
> -		       sizeof(ugeth_primary_info));
> -
> -	ret = platform_driver_register(&ucc_geth_driver);
>   
> -	return ret;
> +	return platform_driver_register(&ucc_geth_driver);
>   }
>   
>   static void __exit ucc_geth_exit(void)
> 

^ permalink raw reply

* [PATCH 06/13] powerpc/xive: Add a name to the IRQ domain
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

We hope one day to handle multiple irq_domain in the XIVE driver.
Start simple by setting the name using the DT node.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/xive-internal.h |  4 ++--
 arch/powerpc/sysdev/xive/common.c        | 10 +++++-----
 arch/powerpc/sysdev/xive/native.c        |  2 +-
 arch/powerpc/sysdev/xive/spapr.c         |  2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h
index 1eacc90f4dcf..066d6fe3dc1d 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -63,8 +63,8 @@ struct xive_ops {
 	const char *name;
 };
 
-bool xive_core_init(const struct xive_ops *ops, void __iomem *area, u32 offset,
-		    u8 max_prio);
+bool xive_core_init(struct device_node *np, const struct xive_ops *ops,
+		    void __iomem *area, u32 offset, u8 max_prio);
 __be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift);
 int xive_core_debug_init(void);
 
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 80fd97d764ab..721617f0f854 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1310,9 +1310,9 @@ static const struct irq_domain_ops xive_irq_domain_ops = {
 	.xlate = xive_irq_domain_xlate,
 };
 
-static void __init xive_init_host(void)
+static void __init xive_init_host(struct device_node *np)
 {
-	xive_irq_domain = irq_domain_add_nomap(NULL, XIVE_MAX_IRQ,
+	xive_irq_domain = irq_domain_add_nomap(np, XIVE_MAX_IRQ,
 					       &xive_irq_domain_ops, NULL);
 	if (WARN_ON(xive_irq_domain == NULL))
 		return;
@@ -1508,8 +1508,8 @@ void xive_shutdown(void)
 	xive_ops->shutdown();
 }
 
-bool __init xive_core_init(const struct xive_ops *ops, void __iomem *area, u32 offset,
-			   u8 max_prio)
+bool __init xive_core_init(struct device_node *np, const struct xive_ops *ops,
+			   void __iomem *area, u32 offset, u8 max_prio)
 {
 	xive_tima = area;
 	xive_tima_offset = offset;
@@ -1520,7 +1520,7 @@ bool __init xive_core_init(const struct xive_ops *ops, void __iomem *area, u32 o
 	__xive_enabled = true;
 
 	pr_devel("Initializing host..\n");
-	xive_init_host();
+	xive_init_host(np);
 
 	pr_devel("Initializing boot CPU..\n");
 
diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c
index 6afb44d0d816..5f1e5aed8ab4 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -622,7 +622,7 @@ bool __init xive_native_init(void)
 	xive_native_setup_pools();
 
 	/* Initialize XIVE core with our backend */
-	if (!xive_core_init(&xive_native_ops, tima, TM_QW3_HV_PHYS,
+	if (!xive_core_init(np, &xive_native_ops, tima, TM_QW3_HV_PHYS,
 			    max_prio)) {
 		opal_xive_reset(OPAL_XIVE_MODE_EMU);
 		return false;
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 1e3674d7ea7b..6610e5149d5a 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -857,7 +857,7 @@ bool __init xive_spapr_init(void)
 	}
 
 	/* Initialize XIVE core with our backend */
-	if (!xive_core_init(&xive_spapr_ops, tima, TM_QW1_OS, max_prio))
+	if (!xive_core_init(np, &xive_spapr_ops, tima, TM_QW1_OS, max_prio))
 		return false;
 
 	pr_info("Using %dkB queues\n", 1 << (xive_queue_shift - 10));
-- 
2.26.2


^ permalink raw reply related

* [PATCH 13/13] powerpc/xive: Improve error reporting of OPAL calls
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

Introduce a vp_err() macro to standardize error reporting.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/native.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c
index 4902d05ebbd1..42297a131a6e 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -122,6 +122,8 @@ static int xive_native_get_irq_config(u32 hw_irq, u32 *target, u8 *prio,
 	return rc == 0 ? 0 : -ENXIO;
 }
 
+#define vp_err(vp, fmt, ...) pr_err("VP[0x%x]: " fmt, vp, ##__VA_ARGS__)
+
 /* This can be called multiple time to change a queue configuration */
 int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio,
 				__be32 *qpage, u32 order, bool can_escalate)
@@ -149,7 +151,7 @@ int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio,
 				      &esc_irq_be,
 				      NULL);
 	if (rc) {
-		pr_err("Error %lld getting queue info prio %d\n", rc, prio);
+		vp_err(vp_id, "Failed to get queue %d info : %lld\n", prio, rc);
 		rc = -EIO;
 		goto fail;
 	}
@@ -172,7 +174,7 @@ int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio,
 		msleep(OPAL_BUSY_DELAY_MS);
 	}
 	if (rc) {
-		pr_err("Error %lld setting queue for prio %d\n", rc, prio);
+		vp_err(vp_id, "Failed to set queue %d info: %lld\n", prio, rc);
 		rc = -EIO;
 	} else {
 		/*
@@ -199,7 +201,7 @@ static void __xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio)
 		msleep(OPAL_BUSY_DELAY_MS);
 	}
 	if (rc)
-		pr_err("Error %lld disabling queue for prio %d\n", rc, prio);
+		vp_err(vp_id, "Failed to disable queue %d : %lld\n", prio, rc);
 }
 
 void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio)
@@ -698,6 +700,8 @@ int xive_native_enable_vp(u32 vp_id, bool single_escalation)
 			break;
 		msleep(OPAL_BUSY_DELAY_MS);
 	}
+	if (rc)
+		vp_err(vp_id, "Failed to enable VP : %lld\n", rc);
 	return rc ? -EIO : 0;
 }
 EXPORT_SYMBOL_GPL(xive_native_enable_vp);
@@ -712,6 +716,8 @@ int xive_native_disable_vp(u32 vp_id)
 			break;
 		msleep(OPAL_BUSY_DELAY_MS);
 	}
+	if (rc)
+		vp_err(vp_id, "Failed to disable VP : %lld\n", rc);
 	return rc ? -EIO : 0;
 }
 EXPORT_SYMBOL_GPL(xive_native_disable_vp);
@@ -723,8 +729,10 @@ int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 *out_chip_id)
 	s64 rc;
 
 	rc = opal_xive_get_vp_info(vp_id, NULL, &vp_cam_be, NULL, &vp_chip_id_be);
-	if (rc)
+	if (rc) {
+		vp_err(vp_id, "Failed to get VP info : %lld\n", rc);
 		return -EIO;
+	}
 	*out_cam_id = be64_to_cpu(vp_cam_be) & 0xffffffffu;
 	*out_chip_id = be32_to_cpu(vp_chip_id_be);
 
@@ -755,8 +763,7 @@ int xive_native_get_queue_info(u32 vp_id, u32 prio,
 	rc = opal_xive_get_queue_info(vp_id, prio, &qpage, &qsize,
 				      &qeoi_page, &escalate_irq, &qflags);
 	if (rc) {
-		pr_err("OPAL failed to get queue info for VCPU %d/%d : %lld\n",
-		       vp_id, prio, rc);
+		vp_err(vp_id, "failed to get queue %d info : %lld\n", prio, rc);
 		return -EIO;
 	}
 
@@ -784,8 +791,7 @@ int xive_native_get_queue_state(u32 vp_id, u32 prio, u32 *qtoggle, u32 *qindex)
 	rc = opal_xive_get_queue_state(vp_id, prio, &opal_qtoggle,
 				       &opal_qindex);
 	if (rc) {
-		pr_err("OPAL failed to get queue state for VCPU %d/%d : %lld\n",
-		       vp_id, prio, rc);
+		vp_err(vp_id, "failed to get queue %d state : %lld\n", prio, rc);
 		return -EIO;
 	}
 
@@ -804,8 +810,7 @@ int xive_native_set_queue_state(u32 vp_id, u32 prio, u32 qtoggle, u32 qindex)
 
 	rc = opal_xive_set_queue_state(vp_id, prio, qtoggle, qindex);
 	if (rc) {
-		pr_err("OPAL failed to set queue state for VCPU %d/%d : %lld\n",
-		       vp_id, prio, rc);
+		vp_err(vp_id, "failed to set queue %d state : %lld\n", prio, rc);
 		return -EIO;
 	}
 
@@ -827,8 +832,7 @@ int xive_native_get_vp_state(u32 vp_id, u64 *out_state)
 
 	rc = opal_xive_get_vp_state(vp_id, &state);
 	if (rc) {
-		pr_err("OPAL failed to get vp state for VCPU %d : %lld\n",
-		       vp_id, rc);
+		vp_err(vp_id, "failed to get vp state : %lld\n", rc);
 		return -EIO;
 	}
 
-- 
2.26.2


^ permalink raw reply related

* [PATCH 10/13] powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_MASK_FW
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

This flag was used to support the PHB4 LSIs on P9 DD1 and we have
stopped supporting this CPU when DD2 came out. See skiboot commit:

  https://github.com/open-power/skiboot/commit/0b0d15e3c170

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/opal-api.h |  2 +-
 arch/powerpc/include/asm/xive.h     |  2 +-
 arch/powerpc/kvm/book3s_xive.c      | 54 +++++------------------------
 arch/powerpc/sysdev/xive/common.c   | 39 +--------------------
 arch/powerpc/sysdev/xive/native.c   |  2 --
 5 files changed, 11 insertions(+), 88 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 48ee604ca39a..0455b679c050 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -1092,7 +1092,7 @@ enum {
 	OPAL_XIVE_IRQ_STORE_EOI		= 0x00000002,
 	OPAL_XIVE_IRQ_LSI		= 0x00000004,
 	OPAL_XIVE_IRQ_SHIFT_BUG		= 0x00000008, /* P9 DD1.0 workaround */
-	OPAL_XIVE_IRQ_MASK_VIA_FW	= 0x00000010,
+	OPAL_XIVE_IRQ_MASK_VIA_FW	= 0x00000010, /* P9 DD1.0 workaround */
 	OPAL_XIVE_IRQ_EOI_VIA_FW	= 0x00000020,
 };
 
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index ff805885a028..d62368d0ba91 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -61,7 +61,7 @@ struct xive_irq_data {
 #define XIVE_IRQ_FLAG_STORE_EOI	0x01
 #define XIVE_IRQ_FLAG_LSI	0x02
 #define XIVE_IRQ_FLAG_SHIFT_BUG	0x04 /* P9 DD1.0 workaround */
-#define XIVE_IRQ_FLAG_MASK_FW	0x08
+#define XIVE_IRQ_FLAG_MASK_FW	0x08 /* P9 DD1.0 workaround */
 #define XIVE_IRQ_FLAG_EOI_FW	0x10
 #define XIVE_IRQ_FLAG_H_INT_ESB	0x20
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index fae1c2e8da29..59a986ae640b 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -419,37 +419,16 @@ static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
 	/* Get the right irq */
 	kvmppc_xive_select_irq(state, &hw_num, &xd);
 
-	/*
-	 * If the interrupt is marked as needing masking via
-	 * firmware, we do it here. Firmware masking however
-	 * is "lossy", it won't return the old p and q bits
-	 * and won't set the interrupt to a state where it will
-	 * record queued ones. If this is an issue we should do
-	 * lazy masking instead.
-	 *
-	 * For now, we work around this in unmask by forcing
-	 * an interrupt whenever we unmask a non-LSI via FW
-	 * (if ever).
-	 */
-	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
-		xive_native_configure_irq(hw_num,
-				kvmppc_xive_vp(xive, state->act_server),
-				MASKED, state->number);
-		/* set old_p so we can track if an H_EOI was done */
-		state->old_p = true;
-		state->old_q = false;
-	} else {
-		/* Set PQ to 10, return old P and old Q and remember them */
-		val = xive_vm_esb_load(xd, XIVE_ESB_SET_PQ_10);
-		state->old_p = !!(val & 2);
-		state->old_q = !!(val & 1);
+	/* Set PQ to 10, return old P and old Q and remember them */
+	val = xive_vm_esb_load(xd, XIVE_ESB_SET_PQ_10);
+	state->old_p = !!(val & 2);
+	state->old_q = !!(val & 1);
 
-		/*
-		 * Synchronize hardware to sensure the queues are updated
-		 * when masking
+	/*
+	 * Synchronize hardware to sensure the queues are updated
+	 * when masking
 		 */
-		xive_native_sync_source(hw_num);
-	}
+	xive_native_sync_source(hw_num);
 
 	return old_prio;
 }
@@ -483,23 +462,6 @@ static void xive_finish_unmask(struct kvmppc_xive *xive,
 	/* Get the right irq */
 	kvmppc_xive_select_irq(state, &hw_num, &xd);
 
-	/*
-	 * See comment in xive_lock_and_mask() concerning masking
-	 * via firmware.
-	 */
-	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
-		xive_native_configure_irq(hw_num,
-				kvmppc_xive_vp(xive, state->act_server),
-				state->act_priority, state->number);
-		/* If an EOI is needed, do it here */
-		if (!state->old_p)
-			xive_vm_source_eoi(hw_num, xd);
-		/* If this is not an LSI, force a trigger */
-		if (!(xd->flags & OPAL_XIVE_IRQ_LSI))
-			xive_irq_trigger(xd);
-		goto bail;
-	}
-
 	/* Old Q set, set PQ to 11 */
 	if (state->old_q)
 		xive_vm_esb_load(xd, XIVE_ESB_SET_PQ_11);
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index a9259470bf9f..a71412fefb65 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -424,9 +424,7 @@ static void xive_irq_eoi(struct irq_data *d)
 }
 
 /*
- * Helper used to mask and unmask an interrupt source. This
- * is only called for normal interrupts that do not require
- * masking/unmasking via firmware.
+ * Helper used to mask and unmask an interrupt source.
  */
 static void xive_do_source_set_mask(struct xive_irq_data *xd,
 				    bool mask)
@@ -673,20 +671,6 @@ static void xive_irq_unmask(struct irq_data *d)
 
 	pr_devel("xive_irq_unmask: irq %d data @%p\n", d->irq, xd);
 
-	/*
-	 * This is a workaround for PCI LSI problems on P9, for
-	 * these, we call FW to set the mask. The problems might
-	 * be fixed by P9 DD2.0, if that is the case, firmware
-	 * will no longer set that flag.
-	 */
-	if (xd->flags & XIVE_IRQ_FLAG_MASK_FW) {
-		unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
-		xive_ops->configure_irq(hw_irq,
-					get_hard_smp_processor_id(xd->target),
-					xive_irq_priority, d->irq);
-		return;
-	}
-
 	xive_do_source_set_mask(xd, false);
 }
 
@@ -696,20 +680,6 @@ static void xive_irq_mask(struct irq_data *d)
 
 	pr_devel("xive_irq_mask: irq %d data @%p\n", d->irq, xd);
 
-	/*
-	 * This is a workaround for PCI LSI problems on P9, for
-	 * these, we call OPAL to set the mask. The problems might
-	 * be fixed by P9 DD2.0, if that is the case, firmware
-	 * will no longer set that flag.
-	 */
-	if (xd->flags & XIVE_IRQ_FLAG_MASK_FW) {
-		unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
-		xive_ops->configure_irq(hw_irq,
-					get_hard_smp_processor_id(xd->target),
-					0xff, d->irq);
-		return;
-	}
-
 	xive_do_source_set_mask(xd, true);
 }
 
@@ -852,13 +822,6 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, void *state)
 	int rc;
 	u8 pq;
 
-	/*
-	 * We only support this on interrupts that do not require
-	 * firmware calls for masking and unmasking
-	 */
-	if (xd->flags & XIVE_IRQ_FLAG_MASK_FW)
-		return -EIO;
-
 	/*
 	 * This is called by KVM with state non-NULL for enabling
 	 * pass-through or NULL for disabling it
diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c
index 0310783241b5..deb97ad25d62 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -64,8 +64,6 @@ int xive_native_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
 		data->flags |= XIVE_IRQ_FLAG_STORE_EOI;
 	if (opal_flags & OPAL_XIVE_IRQ_LSI)
 		data->flags |= XIVE_IRQ_FLAG_LSI;
-	if (opal_flags & OPAL_XIVE_IRQ_MASK_VIA_FW)
-		data->flags |= XIVE_IRQ_FLAG_MASK_FW;
 	if (opal_flags & OPAL_XIVE_IRQ_EOI_VIA_FW)
 		data->flags |= XIVE_IRQ_FLAG_EOI_FW;
 	data->eoi_page = be64_to_cpu(eoi_page);
-- 
2.26.2


^ permalink raw reply related

* [PATCH 07/13] powerpc/xive: Add a debug_show handler to the XIVE irq_domain
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

Full state of the Linux interrupt descriptors can be dumped under
debugfs when compiled with CONFIG_GENERIC_IRQ_DEBUGFS. Add support for
the XIVE interrupt controller.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 58 +++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 721617f0f854..411cba12d73b 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1303,11 +1303,69 @@ static int xive_irq_domain_match(struct irq_domain *h, struct device_node *node,
 	return xive_ops->match(node);
 }
 
+#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
+static const char * const esb_names[] = { "RESET", "OFF", "PENDING", "QUEUED" };
+
+static const struct {
+	u64  mask;
+	char *name;
+} xive_irq_flags[] = {
+	{ XIVE_IRQ_FLAG_STORE_EOI, "STORE_EOI" },
+	{ XIVE_IRQ_FLAG_LSI,       "LSI"       },
+	{ XIVE_IRQ_FLAG_SHIFT_BUG, "SHIFT_BUG" },
+	{ XIVE_IRQ_FLAG_MASK_FW,   "MASK_FW"   },
+	{ XIVE_IRQ_FLAG_EOI_FW,    "EOI_FW"    },
+	{ XIVE_IRQ_FLAG_H_INT_ESB, "H_INT_ESB" },
+	{ XIVE_IRQ_FLAG_NO_EOI,    "NO_EOI"    },
+};
+
+static void xive_irq_domain_debug_show(struct seq_file *m, struct irq_domain *d,
+				       struct irq_data *irqd, int ind)
+{
+	struct xive_irq_data *xd;
+	u64 val;
+	int i;
+
+	/* No IRQ domain level information. To be done */
+	if (!irqd)
+		return;
+
+	if (!is_xive_irq(irq_data_get_irq_chip(irqd)))
+		return;
+
+	seq_printf(m, "%*sXIVE:\n", ind, "");
+	ind++;
+
+	xd = irq_data_get_irq_handler_data(irqd);
+	if (!xd) {
+		seq_printf(m, "%*snot assigned\n", ind, "");
+		return;
+	}
+
+	val = xive_esb_read(xd, XIVE_ESB_GET);
+	seq_printf(m, "%*sESB:      %s\n", ind, "", esb_names[val & 0x3]);
+	seq_printf(m, "%*sPstate:   %s %s\n", ind, "", xd->stale_p ? "stale" : "",
+		   xd->saved_p ? "saved" : "");
+	seq_printf(m, "%*sTarget:   %d\n", ind, "", xd->target);
+	seq_printf(m, "%*sChip:     %d\n", ind, "", xd->src_chip);
+	seq_printf(m, "%*sTrigger:  0x%016llx\n", ind, "", xd->trig_page);
+	seq_printf(m, "%*sEOI:      0x%016llx\n", ind, "", xd->eoi_page);
+	seq_printf(m, "%*sFlags:    0x%llx\n", ind, "", xd->flags);
+	for (i = 0; i < ARRAY_SIZE(xive_irq_flags); i++) {
+		if (xd->flags & xive_irq_flags[i].mask)
+			seq_printf(m, "%*s%s\n", ind + 12, "", xive_irq_flags[i].name);
+	}
+}
+#endif
+
 static const struct irq_domain_ops xive_irq_domain_ops = {
 	.match = xive_irq_domain_match,
 	.map = xive_irq_domain_map,
 	.unmap = xive_irq_domain_unmap,
 	.xlate = xive_irq_domain_xlate,
+#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
+	.debug_show = xive_irq_domain_debug_show,
+#endif
 };
 
 static void __init xive_init_host(struct device_node *np)
-- 
2.26.2


^ permalink raw reply related

* [PATCH 03/13] powerpc/xive: Introduce XIVE_IPI_HW_IRQ
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

The XIVE driver deals with CPU IPIs in a peculiar way. Each CPU has
its own XIVE IPI interrupt allocated at the HW level, for PowerNV, or
at the hypervisor level for pSeries. In practice, these interrupts are
not always used. pSeries/PowerVM prefers local doorbells for local
threads since they are faster. On PowerNV, global doorbells are also
preferred for the same reason.

The mapping in the Linux is reduced to a single interrupt using HW
interrupt number 0 and a custom irq_chip to handle EOI. This can cause
performance issues in some benchmark (ipistorm) on multichip systems.

Clarify the use of the 0 value, it will help in improving multichip
support.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/xive-internal.h |  2 ++
 arch/powerpc/sysdev/xive/common.c        | 10 +++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h
index b7b901da2168..d701af7fb48c 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -5,6 +5,8 @@
 #ifndef __XIVE_INTERNAL_H
 #define __XIVE_INTERNAL_H
 
+#define XIVE_IPI_HW_IRQ		0 /* interrupt source # for IPIs */
+
 /*
  * A "disabled" interrupt should never fire, to catch problems
  * we set its logical number to this
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 65af34ac1fa2..ee375daf8114 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1142,7 +1142,7 @@ static void __init xive_request_ipi(void)
 		return;
 
 	/* Initialize it */
-	virq = irq_create_mapping(xive_irq_domain, 0);
+	virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ);
 	xive_ipi_irq = virq;
 
 	WARN_ON(request_irq(virq, xive_muxed_ipi_action,
@@ -1242,7 +1242,7 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq,
 
 #ifdef CONFIG_SMP
 	/* IPIs are special and come up with HW number 0 */
-	if (hw == 0) {
+	if (hw == XIVE_IPI_HW_IRQ) {
 		/*
 		 * IPIs are marked per-cpu. We use separate HW interrupts under
 		 * the hood but associated with the same "linux" interrupt
@@ -1271,7 +1271,7 @@ static void xive_irq_domain_unmap(struct irq_domain *d, unsigned int virq)
 	if (!data)
 		return;
 	hw_irq = (unsigned int)irqd_to_hwirq(data);
-	if (hw_irq)
+	if (hw_irq != XIVE_IPI_HW_IRQ)
 		xive_irq_free_data(virq);
 }
 
@@ -1421,7 +1421,7 @@ static void xive_flush_cpu_queue(unsigned int cpu, struct xive_cpu *xc)
 		 * Ignore anything that isn't a XIVE irq and ignore
 		 * IPIs, so can just be dropped.
 		 */
-		if (d->domain != xive_irq_domain || hw_irq == 0)
+		if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ)
 			continue;
 
 		/*
@@ -1655,7 +1655,7 @@ static int xive_core_debug_show(struct seq_file *m, void *private)
 		hw_irq = (unsigned int)irqd_to_hwirq(d);
 
 		/* IPIs are special (HW number 0) */
-		if (hw_irq)
+		if (hw_irq != XIVE_IPI_HW_IRQ)
 			xive_debug_show_irq(m, hw_irq, d);
 	}
 	return 0;
-- 
2.26.2


^ permalink raw reply related

* [PATCH 01/13] KVM: PPC: Book3S HV: XIVE: Show detailed configuration in debug output
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater, Greg Kurz
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

This is useful to track allocation of the HW resources on per guest
basis. Making sure IPIs are local to the chip of the vCPUs reduces
rerouting between interrupt controllers and gives better performance
in case of pinning. Checking the distribution of VP structures on the
chips also helps in reducing PowerBUS traffic.

Signed-off-by: Greg Kurz <groug@kaod.org>
[ clg: resurrected show_sources and reworked ouput ]
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.h        |  2 +
 arch/powerpc/kvm/book3s_xive.c        | 76 ++++++++++++++++++++++-----
 arch/powerpc/kvm/book3s_xive_native.c | 21 ++++++--
 3 files changed, 82 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 382e3a56e789..d5d4fee7ac94 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -290,6 +290,8 @@ extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr);
  */
 void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu);
 int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu);
+void kvmppc_xive_debug_show_sources(struct seq_file *m,
+				    struct kvmppc_xive_src_block *sb);
 struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
 	struct kvmppc_xive *xive, int irq);
 void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index a0ebc29f30b2..18a6b75a3bfd 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -2125,9 +2125,8 @@ int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu)
 		if (!q->qpage && !xc->esc_virq[i])
 			continue;
 
-		seq_printf(m, " [q%d]: ", i);
-
 		if (q->qpage) {
+			seq_printf(m, "    q[%d]: ", i);
 			idx = q->idx;
 			i0 = be32_to_cpup(q->qpage + idx);
 			idx = (idx + 1) & q->msk;
@@ -2141,16 +2140,54 @@ int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu)
 				irq_data_get_irq_handler_data(d);
 			u64 pq = xive_vm_esb_load(xd, XIVE_ESB_GET);
 
-			seq_printf(m, "E:%c%c I(%d:%llx:%llx)",
-				   (pq & XIVE_ESB_VAL_P) ? 'P' : 'p',
-				   (pq & XIVE_ESB_VAL_Q) ? 'Q' : 'q',
-				   xc->esc_virq[i], pq, xd->eoi_page);
+			seq_printf(m, "    ESC %d %c%c EOI @%llx",
+				   xc->esc_virq[i],
+				   (pq & XIVE_ESB_VAL_P) ? 'P' : '-',
+				   (pq & XIVE_ESB_VAL_Q) ? 'Q' : '-',
+				   xd->eoi_page);
 			seq_puts(m, "\n");
 		}
 	}
 	return 0;
 }
 
+void kvmppc_xive_debug_show_sources(struct seq_file *m,
+				    struct kvmppc_xive_src_block *sb)
+{
+	int i;
+
+	seq_puts(m, "    LISN      HW/CHIP   TYPE    PQ      EISN    CPU/PRIO\n");
+	for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
+		struct kvmppc_xive_irq_state *state = &sb->irq_state[i];
+		struct xive_irq_data *xd;
+		u64 pq;
+		u32 hw_num;
+
+		if (!state->valid)
+			continue;
+
+		kvmppc_xive_select_irq(state, &hw_num, &xd);
+
+		pq = xive_vm_esb_load(xd, XIVE_ESB_GET);
+
+		seq_printf(m, "%08x  %08x/%02x", state->number, hw_num,
+			   xd->src_chip);
+		if (state->lsi)
+			seq_printf(m, " %cLSI", state->asserted ? '^' : ' ');
+		else
+			seq_puts(m, "  MSI");
+
+		seq_printf(m, " %s  %c%c  %08x   % 4d/%d",
+			   state->ipi_number == hw_num ? "IPI" : " PT",
+			   pq & XIVE_ESB_VAL_P ? 'P' : '-',
+			   pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
+			   state->eisn, state->act_server,
+			   state->act_priority);
+
+		seq_puts(m, "\n");
+	}
+}
+
 static int xive_debug_show(struct seq_file *m, void *private)
 {
 	struct kvmppc_xive *xive = m->private;
@@ -2171,7 +2208,7 @@ static int xive_debug_show(struct seq_file *m, void *private)
 	if (!kvm)
 		return 0;
 
-	seq_printf(m, "=========\nVCPU state\n=========\n");
+	seq_puts(m, "=========\nVCPU state\n=========\n");
 
 	kvm_for_each_vcpu(i, vcpu, kvm) {
 		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
@@ -2179,11 +2216,12 @@ static int xive_debug_show(struct seq_file *m, void *private)
 		if (!xc)
 			continue;
 
-		seq_printf(m, "cpu server %#x VP:%#x CPPR:%#x HWCPPR:%#x"
-			   " MFRR:%#x PEND:%#x h_xirr: R=%lld V=%lld\n",
-			   xc->server_num, xc->vp_id, xc->cppr, xc->hw_cppr,
-			   xc->mfrr, xc->pending,
-			   xc->stat_rm_h_xirr, xc->stat_vm_h_xirr);
+		seq_printf(m, "VCPU %d: VP:%#x/%02x\n"
+			 "    CPPR:%#x HWCPPR:%#x MFRR:%#x PEND:%#x h_xirr: R=%lld V=%lld\n",
+			 xc->server_num, xc->vp_id, xc->vp_chip_id,
+			 xc->cppr, xc->hw_cppr,
+			 xc->mfrr, xc->pending,
+			 xc->stat_rm_h_xirr, xc->stat_vm_h_xirr);
 
 		kvmppc_xive_debug_show_queues(m, vcpu);
 
@@ -2199,13 +2237,25 @@ static int xive_debug_show(struct seq_file *m, void *private)
 		t_vm_h_ipi += xc->stat_vm_h_ipi;
 	}
 
-	seq_printf(m, "Hcalls totals\n");
+	seq_puts(m, "Hcalls totals\n");
 	seq_printf(m, " H_XIRR  R=%10lld V=%10lld\n", t_rm_h_xirr, t_vm_h_xirr);
 	seq_printf(m, " H_IPOLL R=%10lld V=%10lld\n", t_rm_h_ipoll, t_vm_h_ipoll);
 	seq_printf(m, " H_CPPR  R=%10lld V=%10lld\n", t_rm_h_cppr, t_vm_h_cppr);
 	seq_printf(m, " H_EOI   R=%10lld V=%10lld\n", t_rm_h_eoi, t_vm_h_eoi);
 	seq_printf(m, " H_IPI   R=%10lld V=%10lld\n", t_rm_h_ipi, t_vm_h_ipi);
 
+	seq_puts(m, "=========\nSources\n=========\n");
+
+	for (i = 0; i <= xive->max_sbid; i++) {
+		struct kvmppc_xive_src_block *sb = xive->src_blocks[i];
+
+		if (sb) {
+			arch_spin_lock(&sb->lock);
+			kvmppc_xive_debug_show_sources(m, sb);
+			arch_spin_unlock(&sb->lock);
+		}
+	}
+
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 6aaaa4bedaaf..9b395381179d 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1257,18 +1257,31 @@ static int xive_native_debug_show(struct seq_file *m, void *private)
 		if (!xc)
 			continue;
 
-		seq_printf(m, "cpu server %#x VP=%#x NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
-			   xc->server_num, xc->vp_id,
+		seq_printf(m, "VCPU %d: VP=%#x/%02x\n"
+			   "    NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
+			   xc->server_num, xc->vp_id, xc->vp_chip_id,
 			   vcpu->arch.xive_saved_state.nsr,
 			   vcpu->arch.xive_saved_state.cppr,
 			   vcpu->arch.xive_saved_state.ipb,
 			   vcpu->arch.xive_saved_state.pipr,
-			   vcpu->arch.xive_saved_state.w01,
-			   (u32) vcpu->arch.xive_cam_word);
+			   be64_to_cpu(vcpu->arch.xive_saved_state.w01),
+			   be32_to_cpu(vcpu->arch.xive_cam_word));
 
 		kvmppc_xive_debug_show_queues(m, vcpu);
 	}
 
+	seq_puts(m, "=========\nSources\n=========\n");
+
+	for (i = 0; i <= xive->max_sbid; i++) {
+		struct kvmppc_xive_src_block *sb = xive->src_blocks[i];
+
+		if (sb) {
+			arch_spin_lock(&sb->lock);
+			kvmppc_xive_debug_show_sources(m, sb);
+			arch_spin_unlock(&sb->lock);
+		}
+	}
+
 	return 0;
 }
 
-- 
2.26.2


^ permalink raw reply related

* [PATCH 11/13] powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_EOI_FW
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

This flag was used to support the P9 DD1 and we have stopped
supporting this CPU when DD2 came out. See skiboot commit:

  https://github.com/open-power/skiboot/commit/0b0d15e3c170

Also, remove eoi handler which is now unused.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/opal-api.h      |  2 +-
 arch/powerpc/include/asm/xive.h          |  2 +-
 arch/powerpc/sysdev/xive/xive-internal.h |  1 -
 arch/powerpc/kvm/book3s_xive_template.c  |  2 --
 arch/powerpc/sysdev/xive/common.c        | 13 +------------
 arch/powerpc/sysdev/xive/native.c        | 12 ------------
 arch/powerpc/sysdev/xive/spapr.c         |  6 ------
 7 files changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 0455b679c050..0b63ba7d5917 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -1093,7 +1093,7 @@ enum {
 	OPAL_XIVE_IRQ_LSI		= 0x00000004,
 	OPAL_XIVE_IRQ_SHIFT_BUG		= 0x00000008, /* P9 DD1.0 workaround */
 	OPAL_XIVE_IRQ_MASK_VIA_FW	= 0x00000010, /* P9 DD1.0 workaround */
-	OPAL_XIVE_IRQ_EOI_VIA_FW	= 0x00000020,
+	OPAL_XIVE_IRQ_EOI_VIA_FW	= 0x00000020, /* P9 DD1.0 workaround */
 };
 
 /* Flags for OPAL_XIVE_GET/SET_QUEUE_INFO */
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index d62368d0ba91..f6150d7a757a 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -62,7 +62,7 @@ struct xive_irq_data {
 #define XIVE_IRQ_FLAG_LSI	0x02
 #define XIVE_IRQ_FLAG_SHIFT_BUG	0x04 /* P9 DD1.0 workaround */
 #define XIVE_IRQ_FLAG_MASK_FW	0x08 /* P9 DD1.0 workaround */
-#define XIVE_IRQ_FLAG_EOI_FW	0x10
+#define XIVE_IRQ_FLAG_EOI_FW	0x10 /* P9 DD1.0 workaround */
 #define XIVE_IRQ_FLAG_H_INT_ESB	0x20
 
 /* Special flag set by KVM for excalation interrupts */
diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h
index 066d6fe3dc1d..3b7dd2cba9db 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -52,7 +52,6 @@ struct xive_ops {
 	void	(*shutdown)(void);
 
 	void	(*update_pending)(struct xive_cpu *xc);
-	void	(*eoi)(u32 hw_irq);
 	void	(*sync_source)(u32 hw_irq);
 	u64	(*esb_rw)(u32 hw_irq, u32 offset, u64 data, bool write);
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/kvm/book3s_xive_template.c b/arch/powerpc/kvm/book3s_xive_template.c
index ece36e024a8f..b0015e05d99a 100644
--- a/arch/powerpc/kvm/book3s_xive_template.c
+++ b/arch/powerpc/kvm/book3s_xive_template.c
@@ -74,8 +74,6 @@ static void GLUE(X_PFX,source_eoi)(u32 hw_irq, struct xive_irq_data *xd)
 	/* If the XIVE supports the new "store EOI facility, use it */
 	if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
 		__x_writeq(0, __x_eoi_page(xd) + XIVE_ESB_STORE_EOI);
-	else if (hw_irq && xd->flags & XIVE_IRQ_FLAG_EOI_FW)
-		opal_int_eoi(hw_irq);
 	else if (xd->flags & XIVE_IRQ_FLAG_LSI) {
 		/*
 		 * For LSIs the HW EOI cycle is used rather than PQ bits,
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index a71412fefb65..fe6229dd3241 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -354,18 +354,7 @@ static void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data *xd)
 	/* If the XIVE supports the new "store EOI facility, use it */
 	if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
 		xive_esb_write(xd, XIVE_ESB_STORE_EOI, 0);
-	else if (hw_irq && xd->flags & XIVE_IRQ_FLAG_EOI_FW) {
-		/*
-		 * The FW told us to call it. This happens for some
-		 * interrupt sources that need additional HW whacking
-		 * beyond the ESB manipulation. For example LPC interrupts
-		 * on P9 DD1.0 needed a latch to be clared in the LPC bridge
-		 * itself. The Firmware will take care of it.
-		 */
-		if (WARN_ON_ONCE(!xive_ops->eoi))
-			return;
-		xive_ops->eoi(hw_irq);
-	} else {
+	else {
 		u8 eoi_val;
 
 		/*
diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c
index deb97ad25d62..4902d05ebbd1 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -64,8 +64,6 @@ int xive_native_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
 		data->flags |= XIVE_IRQ_FLAG_STORE_EOI;
 	if (opal_flags & OPAL_XIVE_IRQ_LSI)
 		data->flags |= XIVE_IRQ_FLAG_LSI;
-	if (opal_flags & OPAL_XIVE_IRQ_EOI_VIA_FW)
-		data->flags |= XIVE_IRQ_FLAG_EOI_FW;
 	data->eoi_page = be64_to_cpu(eoi_page);
 	data->trig_page = be64_to_cpu(trig_page);
 	data->esb_shift = be32_to_cpu(esb_shift);
@@ -380,15 +378,6 @@ static void xive_native_update_pending(struct xive_cpu *xc)
 	}
 }
 
-static void xive_native_eoi(u32 hw_irq)
-{
-	/*
-	 * Not normally used except if specific interrupts need
-	 * a workaround on EOI.
-	 */
-	opal_int_eoi(hw_irq);
-}
-
 static void xive_native_setup_cpu(unsigned int cpu, struct xive_cpu *xc)
 {
 	s64 rc;
@@ -471,7 +460,6 @@ static const struct xive_ops xive_native_ops = {
 	.match			= xive_native_match,
 	.shutdown		= xive_native_shutdown,
 	.update_pending		= xive_native_update_pending,
-	.eoi			= xive_native_eoi,
 	.setup_cpu		= xive_native_setup_cpu,
 	.teardown_cpu		= xive_native_teardown_cpu,
 	.sync_source		= xive_native_sync_source,
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 6610e5149d5a..01ccc0786ada 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -628,11 +628,6 @@ static void xive_spapr_update_pending(struct xive_cpu *xc)
 	}
 }
 
-static void xive_spapr_eoi(u32 hw_irq)
-{
-	/* Not used */;
-}
-
 static void xive_spapr_setup_cpu(unsigned int cpu, struct xive_cpu *xc)
 {
 	/* Only some debug on the TIMA settings */
@@ -677,7 +672,6 @@ static const struct xive_ops xive_spapr_ops = {
 	.match			= xive_spapr_match,
 	.shutdown		= xive_spapr_shutdown,
 	.update_pending		= xive_spapr_update_pending,
-	.eoi			= xive_spapr_eoi,
 	.setup_cpu		= xive_spapr_setup_cpu,
 	.teardown_cpu		= xive_spapr_teardown_cpu,
 	.sync_source		= xive_spapr_sync_source,
-- 
2.26.2


^ permalink raw reply related

* [PATCH 12/13] powerpc/xive: Simplify xive_do_source_eoi()
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

Previous patches removed the need of the first argument which was a
hack for Firwmware EOI. Remove it and flatten the routine which has
became simpler.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 72 ++++++++++++++-----------------
 1 file changed, 33 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index fe6229dd3241..fb438203d5ee 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -348,39 +348,40 @@ static void xive_do_queue_eoi(struct xive_cpu *xc)
  * EOI an interrupt at the source. There are several methods
  * to do this depending on the HW version and source type
  */
-static void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data *xd)
+static void xive_do_source_eoi(struct xive_irq_data *xd)
 {
+	u8 eoi_val;
+
 	xd->stale_p = false;
+
 	/* If the XIVE supports the new "store EOI facility, use it */
-	if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
+	if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI) {
 		xive_esb_write(xd, XIVE_ESB_STORE_EOI, 0);
-	else {
-		u8 eoi_val;
+		return;
+	}
 
-		/*
-		 * Otherwise for EOI, we use the special MMIO that does
-		 * a clear of both P and Q and returns the old Q,
-		 * except for LSIs where we use the "EOI cycle" special
-		 * load.
-		 *
-		 * This allows us to then do a re-trigger if Q was set
-		 * rather than synthesizing an interrupt in software
-		 *
-		 * For LSIs the HW EOI cycle is used rather than PQ bits,
-		 * as they are automatically re-triggred in HW when still
-		 * pending.
-		 */
-		if (xd->flags & XIVE_IRQ_FLAG_LSI)
-			xive_esb_read(xd, XIVE_ESB_LOAD_EOI);
-		else {
-			eoi_val = xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
-			DBG_VERBOSE("eoi_val=%x\n", eoi_val);
-
-			/* Re-trigger if needed */
-			if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio)
-				out_be64(xd->trig_mmio, 0);
-		}
+	/*
+	 * For LSIs, we use the "EOI cycle" special load rather than
+	 * PQ bits, as they are automatically re-triggered in HW when
+	 * still pending.
+	 */
+	if (xd->flags & XIVE_IRQ_FLAG_LSI) {
+		xive_esb_read(xd, XIVE_ESB_LOAD_EOI);
+		return;
 	}
+
+	/*
+	 * Otherwise, we use the special MMIO that does a clear of
+	 * both P and Q and returns the old Q. This allows us to then
+	 * do a re-trigger if Q was set rather than synthesizing an
+	 * interrupt in software
+	 */
+	eoi_val = xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
+	DBG_VERBOSE("eoi_val=%x\n", eoi_val);
+
+	/* Re-trigger if needed */
+	if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio)
+		out_be64(xd->trig_mmio, 0);
 }
 
 /* irq_chip eoi callback, called with irq descriptor lock held */
@@ -398,7 +399,7 @@ static void xive_irq_eoi(struct irq_data *d)
 	 */
 	if (!irqd_irq_disabled(d) && !irqd_is_forwarded_to_vcpu(d) &&
 	    !(xd->flags & XIVE_IRQ_FLAG_NO_EOI))
-		xive_do_source_eoi(irqd_to_hwirq(d), xd);
+		xive_do_source_eoi(xd);
 	else
 		xd->stale_p = true;
 
@@ -788,14 +789,7 @@ static int xive_irq_retrigger(struct irq_data *d)
 	 * 11, then perform an EOI.
 	 */
 	xive_esb_read(xd, XIVE_ESB_SET_PQ_11);
-
-	/*
-	 * Note: We pass "0" to the hw_irq argument in order to
-	 * avoid calling into the backend EOI code which we don't
-	 * want to do in the case of a re-trigger. Backends typically
-	 * only do EOI for LSIs anyway.
-	 */
-	xive_do_source_eoi(0, xd);
+	xive_do_source_eoi(xd);
 
 	return 1;
 }
@@ -910,7 +904,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, void *state)
 		 * while masked, the generic code will re-mask it anyway.
 		 */
 		if (!xd->saved_p)
-			xive_do_source_eoi(hw_irq, xd);
+			xive_do_source_eoi(xd);
 
 	}
 	return 0;
@@ -1054,7 +1048,7 @@ static void xive_ipi_eoi(struct irq_data *d)
 	DBG_VERBOSE("IPI eoi: irq=%d [0x%lx] (HW IRQ 0x%x) pending=%02x\n",
 		    d->irq, irqd_to_hwirq(d), xc->hw_ipi, xc->pending_prio);
 
-	xive_do_source_eoi(xc->hw_ipi, &xc->ipi_data);
+	xive_do_source_eoi(&xc->ipi_data);
 	xive_do_queue_eoi(xc);
 }
 
@@ -1443,7 +1437,7 @@ static void xive_flush_cpu_queue(unsigned int cpu, struct xive_cpu *xc)
 		 * still asserted. Otherwise do an MSI retrigger.
 		 */
 		if (xd->flags & XIVE_IRQ_FLAG_LSI)
-			xive_do_source_eoi(irqd_to_hwirq(d), xd);
+			xive_do_source_eoi(xd);
 		else
 			xive_irq_retrigger(d);
 
-- 
2.26.2


^ permalink raw reply related

* [PATCH 09/13] powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_SHIFT_BUG
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

This flag was used to support the PHB4 LSIs on P9 DD1 and we have
stopped supporting this CPU when DD2 came out. See skiboot commit:

  https://github.com/open-power/skiboot/commit/0b0d15e3c170

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/opal-api.h     | 2 +-
 arch/powerpc/include/asm/xive.h         | 2 +-
 arch/powerpc/kvm/book3s_xive_native.c   | 3 ---
 arch/powerpc/kvm/book3s_xive_template.c | 3 ---
 arch/powerpc/sysdev/xive/common.c       | 8 --------
 arch/powerpc/sysdev/xive/native.c       | 2 --
 6 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 1dffa3cb16ba..48ee604ca39a 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -1091,7 +1091,7 @@ enum {
 	OPAL_XIVE_IRQ_TRIGGER_PAGE	= 0x00000001,
 	OPAL_XIVE_IRQ_STORE_EOI		= 0x00000002,
 	OPAL_XIVE_IRQ_LSI		= 0x00000004,
-	OPAL_XIVE_IRQ_SHIFT_BUG		= 0x00000008,
+	OPAL_XIVE_IRQ_SHIFT_BUG		= 0x00000008, /* P9 DD1.0 workaround */
 	OPAL_XIVE_IRQ_MASK_VIA_FW	= 0x00000010,
 	OPAL_XIVE_IRQ_EOI_VIA_FW	= 0x00000020,
 };
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index d332dd9a18de..ff805885a028 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -60,7 +60,7 @@ struct xive_irq_data {
 };
 #define XIVE_IRQ_FLAG_STORE_EOI	0x01
 #define XIVE_IRQ_FLAG_LSI	0x02
-#define XIVE_IRQ_FLAG_SHIFT_BUG	0x04
+#define XIVE_IRQ_FLAG_SHIFT_BUG	0x04 /* P9 DD1.0 workaround */
 #define XIVE_IRQ_FLAG_MASK_FW	0x08
 #define XIVE_IRQ_FLAG_EOI_FW	0x10
 #define XIVE_IRQ_FLAG_H_INT_ESB	0x20
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 9b395381179d..170d1d04e1d1 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -37,9 +37,6 @@ static u8 xive_vm_esb_load(struct xive_irq_data *xd, u32 offset)
 	 * ordering.
 	 */
 
-	if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
-		offset |= offset << 4;
-
 	val = in_be64(xd->eoi_mmio + offset);
 	return (u8)val;
 }
diff --git a/arch/powerpc/kvm/book3s_xive_template.c b/arch/powerpc/kvm/book3s_xive_template.c
index 4ad3c0279458..ece36e024a8f 100644
--- a/arch/powerpc/kvm/book3s_xive_template.c
+++ b/arch/powerpc/kvm/book3s_xive_template.c
@@ -61,9 +61,6 @@ static u8 GLUE(X_PFX,esb_load)(struct xive_irq_data *xd, u32 offset)
 	if (offset == XIVE_ESB_SET_PQ_10 && xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
 		offset |= XIVE_ESB_LD_ST_MO;
 
-	if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
-		offset |= offset << 4;
-
 	val =__x_readq(__x_eoi_page(xd) + offset);
 #ifdef __LITTLE_ENDIAN__
 	val >>= 64-8;
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 411cba12d73b..a9259470bf9f 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -200,10 +200,6 @@ static notrace u8 xive_esb_read(struct xive_irq_data *xd, u32 offset)
 	if (offset == XIVE_ESB_SET_PQ_10 && xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
 		offset |= XIVE_ESB_LD_ST_MO;
 
-	/* Handle HW errata */
-	if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
-		offset |= offset << 4;
-
 	if ((xd->flags & XIVE_IRQ_FLAG_H_INT_ESB) && xive_ops->esb_rw)
 		val = xive_ops->esb_rw(xd->hw_irq, offset, 0, 0);
 	else
@@ -214,10 +210,6 @@ static notrace u8 xive_esb_read(struct xive_irq_data *xd, u32 offset)
 
 static void xive_esb_write(struct xive_irq_data *xd, u32 offset, u64 data)
 {
-	/* Handle HW errata */
-	if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
-		offset |= offset << 4;
-
 	if ((xd->flags & XIVE_IRQ_FLAG_H_INT_ESB) && xive_ops->esb_rw)
 		xive_ops->esb_rw(xd->hw_irq, offset, data, 1);
 	else
diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c
index 5f1e5aed8ab4..0310783241b5 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -64,8 +64,6 @@ int xive_native_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
 		data->flags |= XIVE_IRQ_FLAG_STORE_EOI;
 	if (opal_flags & OPAL_XIVE_IRQ_LSI)
 		data->flags |= XIVE_IRQ_FLAG_LSI;
-	if (opal_flags & OPAL_XIVE_IRQ_SHIFT_BUG)
-		data->flags |= XIVE_IRQ_FLAG_SHIFT_BUG;
 	if (opal_flags & OPAL_XIVE_IRQ_MASK_VIA_FW)
 		data->flags |= XIVE_IRQ_FLAG_MASK_FW;
 	if (opal_flags & OPAL_XIVE_IRQ_EOI_VIA_FW)
-- 
2.26.2


^ permalink raw reply related

* [PATCH 00/13] powerpc/xive: misc cleanups
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater

Hello,

The most important change is the removal of support of OPAL flags
required for P9 DD1. It provides a good cleanup of some complex
routines.

The series also includes a change on how the pages donated to the XIVE
IC are allocated in Linux. The flags are changed to make sure that
these pages can not be reclaimed.

Thanks,

C.

Cédric Le Goater (13):
  KVM: PPC: Book3S HV: XIVE: Show detailed configuration in debug output
  powerpc/xive: Rename XIVE_IRQ_NO_EOI to show its a flag
  powerpc/xive: Introduce XIVE_IPI_HW_IRQ
  powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  powerpc/xive: Fix allocation of pages donated to the XIVE controller
  powerpc/xive: Add a name to the IRQ domain
  powerpc/xive: Add a debug_show handler to the XIVE irq_domain
  powerpc: Increase NR_IRQS range to support more KVM guests
  powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_SHIFT_BUG
  powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_MASK_FW
  powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_EOI_FW
  powerpc/xive: Simplify xive_do_source_eoi()
  powerpc/xive: Improve error reporting of OPAL calls

 arch/powerpc/include/asm/opal-api.h      |   6 +-
 arch/powerpc/include/asm/xive.h          |   8 +-
 arch/powerpc/kvm/book3s_xive.h           |   2 +
 arch/powerpc/sysdev/xive/xive-internal.h |   9 +-
 arch/powerpc/kvm/book3s_xive.c           | 134 +++++++-------
 arch/powerpc/kvm/book3s_xive_native.c    |  24 ++-
 arch/powerpc/kvm/book3s_xive_template.c  |   5 -
 arch/powerpc/sysdev/xive/common.c        | 219 +++++++++++------------
 arch/powerpc/sysdev/xive/native.c        |  48 ++---
 arch/powerpc/sysdev/xive/spapr.c         |   8 +-
 arch/powerpc/Kconfig                     |   2 +-
 11 files changed, 230 insertions(+), 235 deletions(-)

-- 
2.26.2


^ permalink raw reply

* [PATCH 08/13] powerpc: Increase NR_IRQS range to support more KVM guests
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

PowerNV systems can handle up to 4K guests and 1M interrupt numbers
per chip. Increase the range of allowed interrupts to support a larger
number of guests.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 5181872f9452..c250fbd430d1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -66,7 +66,7 @@ config NEED_PER_CPU_PAGE_FIRST_CHUNK
 
 config NR_IRQS
 	int "Number of virtual interrupt numbers"
-	range 32 32768
+	range 32 1048576
 	default "512"
 	help
 	  This defines the number of virtual interrupt numbers the kernel
-- 
2.26.2


^ permalink raw reply related

* [PATCH 04/13] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

The 'chip_id' field of the XIVE CPU structure is used to choose a
target for a source located on the same chip when possible. This field
is assigned on the PowerNV platform using the "ibm,chip-id" property
on pSeries under KVM when NUMA nodes are defined but it is undefined
under PowerVM. The XIVE source structure has a similar field
'src_chip' which is only assigned on the PowerNV platform.

cpu_to_node() returns a compatible value on all platforms, 0 being the
default node. It will also give us the opportunity to set the affinity
of a source on pSeries when we can localize them.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index ee375daf8114..605238ca65e4 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1342,16 +1342,11 @@ static int xive_prepare_cpu(unsigned int cpu)

 	xc = per_cpu(xive_cpu, cpu);
 	if (!xc) {
-		struct device_node *np;
-
 		xc = kzalloc_node(sizeof(struct xive_cpu),
 				  GFP_KERNEL, cpu_to_node(cpu));
 		if (!xc)
 			return -ENOMEM;
-		np = of_get_cpu_node(cpu, NULL);
-		if (np)
-			xc->chip_id = of_get_ibm_chip_id(np);
-		of_node_put(np);
+		xc->chip_id = cpu_to_node(cpu);
 		xc->hw_ipi = XIVE_BAD_IRQ;

 		per_cpu(xive_cpu, cpu) = xc;
-- 
2.26.2

^ permalink raw reply related

* [PATCH 02/13] powerpc/xive: Rename XIVE_IRQ_NO_EOI to show its a flag
From: Cédric Le Goater @ 2020-12-08 15:11 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Cédric Le Goater
In-Reply-To: <20201208151124.1329942-1-clg@kaod.org>

This is a simple cleanup to identify easily all flags of the XIVE
interrupt structure. The interrupts flagged with XIVE_IRQ_FLAG_NO_EOI
are the escalations used to wake up vCPUs in KVM. They are handled
very differently from the rest.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/xive.h   | 2 +-
 arch/powerpc/kvm/book3s_xive.c    | 4 ++--
 arch/powerpc/sysdev/xive/common.c | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 309b4d65b74f..d332dd9a18de 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -66,7 +66,7 @@ struct xive_irq_data {
 #define XIVE_IRQ_FLAG_H_INT_ESB	0x20
 
 /* Special flag set by KVM for excalation interrupts */
-#define XIVE_IRQ_NO_EOI		0x80
+#define XIVE_IRQ_FLAG_NO_EOI	0x80
 
 #define XIVE_INVALID_CHIP_ID	-1
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 18a6b75a3bfd..fae1c2e8da29 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -219,7 +219,7 @@ int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
 	/* In single escalation mode, we grab the ESB MMIO of the
 	 * interrupt and mask it. Also populate the VCPU v/raddr
 	 * of the ESB page for use by asm entry/exit code. Finally
-	 * set the XIVE_IRQ_NO_EOI flag which will prevent the
+	 * set the XIVE_IRQ_FLAG_NO_EOI flag which will prevent the
 	 * core code from performing an EOI on the escalation
 	 * interrupt, thus leaving it effectively masked after
 	 * it fires once.
@@ -231,7 +231,7 @@ int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
 		xive_vm_esb_load(xd, XIVE_ESB_SET_PQ_01);
 		vcpu->arch.xive_esc_raddr = xd->eoi_page;
 		vcpu->arch.xive_esc_vaddr = (__force u64)xd->eoi_mmio;
-		xd->flags |= XIVE_IRQ_NO_EOI;
+		xd->flags |= XIVE_IRQ_FLAG_NO_EOI;
 	}
 
 	return 0;
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index a80440af491a..65af34ac1fa2 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -416,7 +416,7 @@ static void xive_irq_eoi(struct irq_data *d)
 	 * been passed-through to a KVM guest
 	 */
 	if (!irqd_irq_disabled(d) && !irqd_is_forwarded_to_vcpu(d) &&
-	    !(xd->flags & XIVE_IRQ_NO_EOI))
+	    !(xd->flags & XIVE_IRQ_FLAG_NO_EOI))
 		xive_do_source_eoi(irqd_to_hwirq(d), xd);
 	else
 		xd->stale_p = true;
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH v3 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification
From: Christophe Leroy @ 2020-12-08 15:07 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <87lfe8qrik.fsf@linux.ibm.com>



Le 08/12/2020 à 15:52, Aneesh Kumar K.V a écrit :
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> 
>> search_exception_tables() is an heavy operation, we have to avoid it.
>> When KUAP is selected, we'll know the fault has been blocked by KUAP.
>> Otherwise, it behaves just as if the address was already in the TLBs
>> and no fault was generated.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> v3: rebased
>> v2: Squashed with the preceeding patch which was re-ordering tests that get removed in this patch.
>> ---
>>   arch/powerpc/mm/fault.c | 23 +++++++----------------
>>   1 file changed, 7 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index 3fcd34c28e10..1770b41e4730 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -210,28 +210,19 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
>>   		return true;
>>   	}
>>   
>> -	if (!is_exec && address < TASK_SIZE && (error_code & (DSISR_PROTFAULT | DSISR_KEYFAULT)) &&
>> -	    !search_exception_tables(regs->nip)) {
>> -		pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: %d)\n",
>> -				    address,
>> -				    from_kuid(&init_user_ns, current_uid()));
>> -	}
>> -
>>   	// Kernel fault on kernel address is bad
>>   	if (address >= TASK_SIZE)
>>   		return true;
>>   
>> -	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
>> -	if (!search_exception_tables(regs->nip))
>> -		return true;
>> -
>> -	// Read/write fault in a valid region (the exception table search passed
>> -	// above), but blocked by KUAP is bad, it can never succeed.
>> -	if (bad_kuap_fault(regs, address, is_write))
>> +	// Read/write fault blocked by KUAP is bad, it can never succeed.
>> +	if (bad_kuap_fault(regs, address, is_write)) {
>> +		pr_crit_ratelimited("Kernel attempted to %s user page (%lx) - exploit attempt? (uid: %d)\n",
>> +				    is_write ? "write" : "read", address,
>> +				    from_kuid(&init_user_ns, current_uid()));
>>   		return true;
>> +	}
> 
> 
> With this I am wondering whether the WARN() in bad_kuap_fault() is
> needed. A direct access of userspace address will trigger this, whereas
> previously we used bad_kuap_fault() only to identify incorrect restore
> of AMR register (ie, to identify kernel bugs). Hence a WARN() there was
> useful. We loose that differentiation now?

Yes, I wanted to remove the WARN(), see 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/cc9129bdda1dbc2f0a09cf45fece7d0b0e690784.1605541983.git.christophe.leroy@csgroup.eu/
but I understood from Michael that maybe it was not a good idea, so I left it aside for now when 
rebasing to v3.

Yes previously we were able to differentiate between a direct access of userspace and a valid access 
triggering a KUAP fault, but at the cost of the heavy search_exception_tables().
The issue was reported by Nick through https://github.com/linuxppc/issues/issues/317

Should be perform the search_exception_tables() once we have hit the KUAP fault and WARN() only in 
that case ?

I was wondering also if we should keep the WARN() only when CONFIG_PPC_KUAP_DEBUG is set ?

> 
> 
>>   
>> -	// What's left? Kernel fault on user in well defined regions (extable
>> -	// matched), and allowed by KUAP in the faulting context.
>> +	// What's left? Kernel fault on user and allowed by KUAP in the faulting context.
>>   	return false;
>>   }
>>   
>> -- 
>> 2.25.0

^ permalink raw reply

* [PATCH] drivers: usb: gadget: prefer pr_*() functions over raw printk()
From: Enrico Weigelt, metux IT consult @ 2020-12-08 14:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: balbi, linux-usb, linuxppc-dev, laurent.pinchart, leoyang.li

Reduce a bit logging boilerplate by using the preferred pr_*()
macros instead of raw printk().

Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
---
 drivers/usb/gadget/function/uvc.h       |  2 +-
 drivers/usb/gadget/udc/atmel_usba_udc.c |  2 +-
 drivers/usb/gadget/udc/fsl_udc_core.c   |  4 +--
 drivers/usb/gadget/udc/fsl_usb2_udc.h   |  4 +--
 drivers/usb/gadget/udc/fusb300_udc.c    | 64 ++++++++++++++++-----------------
 drivers/usb/gadget/udc/goku_udc.c       |  2 +-
 drivers/usb/gadget/udc/r8a66597-udc.h   |  2 +-
 7 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/drivers/usb/gadget/function/uvc.h b/drivers/usb/gadget/function/uvc.h
index 23ee25383c1f..d546eb7c348c 100644
--- a/drivers/usb/gadget/function/uvc.h
+++ b/drivers/usb/gadget/function/uvc.h
@@ -49,7 +49,7 @@ extern unsigned int uvc_gadget_trace_param;
 #define uvc_trace(flag, msg...) \
 	do { \
 		if (uvc_gadget_trace_param & flag) \
-			printk(KERN_DEBUG "uvcvideo: " msg); \
+			pr_debug("uvcvideo: " msg); \
 	} while (0)
 
 #define uvcg_dbg(f, fmt, args...) \
diff --git a/drivers/usb/gadget/udc/atmel_usba_udc.c b/drivers/usb/gadget/udc/atmel_usba_udc.c
index 2b893bceea45..4834fafb3f70 100644
--- a/drivers/usb/gadget/udc/atmel_usba_udc.c
+++ b/drivers/usb/gadget/udc/atmel_usba_udc.c
@@ -1573,7 +1573,7 @@ static void usba_control_irq(struct usba_udc *udc, struct usba_ep *ep)
 		 * generate or receive a reply right away. */
 		usba_ep_writel(ep, CLR_STA, USBA_RX_SETUP);
 
-		/* printk(KERN_DEBUG "setup: %d: %02x.%02x\n",
+		/* pr_debug("setup: %d: %02x.%02x\n",
 			ep->state, crq.crq.bRequestType,
 			crq.crq.bRequest); */
 
diff --git a/drivers/usb/gadget/udc/fsl_udc_core.c b/drivers/usb/gadget/udc/fsl_udc_core.c
index ad6ff9c4188e..cab4def04f9f 100644
--- a/drivers/usb/gadget/udc/fsl_udc_core.c
+++ b/drivers/usb/gadget/udc/fsl_udc_core.c
@@ -1474,7 +1474,7 @@ __acquires(udc->lock)
 			mdelay(10);
 			tmp = fsl_readl(&dr_regs->portsc1) | (ptc << 16);
 			fsl_writel(tmp, &dr_regs->portsc1);
-			printk(KERN_INFO "udc: switch to test mode %d.\n", ptc);
+			pr_info("udc: switch to test mode %d.\n", ptc);
 		}
 
 		return;
@@ -1952,7 +1952,7 @@ static int fsl_udc_start(struct usb_gadget *g,
 	if (!IS_ERR_OR_NULL(udc_controller->transceiver)) {
 		/* Suspend the controller until OTG enable it */
 		udc_controller->stopped = 1;
-		printk(KERN_INFO "Suspend udc for OTG auto detect\n");
+		pr_info("Suspend udc for OTG auto detect\n");
 
 		/* connect to bus through transceiver */
 		if (!IS_ERR_OR_NULL(udc_controller->transceiver)) {
diff --git a/drivers/usb/gadget/udc/fsl_usb2_udc.h b/drivers/usb/gadget/udc/fsl_usb2_udc.h
index 4ba651ae9048..b180bf14dd0c 100644
--- a/drivers/usb/gadget/udc/fsl_usb2_udc.h
+++ b/drivers/usb/gadget/udc/fsl_usb2_udc.h
@@ -509,7 +509,7 @@ struct fsl_udc {
 /*-------------------------------------------------------------------------*/
 
 #ifdef DEBUG
-#define DBG(fmt, args...) 	printk(KERN_DEBUG "[%s]  " fmt "\n", \
+#define DBG(fmt, args...) 	pr_debug("[%s]  " fmt "\n", \
 				__func__, ## args)
 #else
 #define DBG(fmt, args...)	do{}while(0)
@@ -535,7 +535,7 @@ static void dump_msg(const char *label, const u8 * buf, unsigned int length)
 			p += 3;
 		}
 		*p = 0;
-		printk(KERN_DEBUG "%6x: %s\n", start, line);
+		pr_debug("%6x: %s\n", start, line);
 		buf += num;
 		start += num;
 		length -= num;
diff --git a/drivers/usb/gadget/udc/fusb300_udc.c b/drivers/usb/gadget/udc/fusb300_udc.c
index 9af8b415f303..c4e7e4b8e46f 100644
--- a/drivers/usb/gadget/udc/fusb300_udc.c
+++ b/drivers/usb/gadget/udc/fusb300_udc.c
@@ -352,24 +352,24 @@ static void fusb300_wrcxf(struct fusb300_ep *ep,
 		for (i = length >> 2; i > 0; i--) {
 			data = *tmp | *(tmp + 1) << 8 | *(tmp + 2) << 16 |
 				*(tmp + 3) << 24;
-			printk(KERN_DEBUG "    0x%x\n", data);
+			pr_debug("    0x%x\n", data);
 			iowrite32(data, fusb300->reg + FUSB300_OFFSET_CXPORT);
 			tmp = tmp + 4;
 		}
 		switch (length % 4) {
 		case 1:
 			data = *tmp;
-			printk(KERN_DEBUG "    0x%x\n", data);
+			pr_debug("    0x%x\n", data);
 			iowrite32(data, fusb300->reg + FUSB300_OFFSET_CXPORT);
 			break;
 		case 2:
 			data = *tmp | *(tmp + 1) << 8;
-			printk(KERN_DEBUG "    0x%x\n", data);
+			pr_debug("    0x%x\n", data);
 			iowrite32(data, fusb300->reg + FUSB300_OFFSET_CXPORT);
 			break;
 		case 3:
 			data = *tmp | *(tmp + 1) << 8 | *(tmp + 2) << 16;
-			printk(KERN_DEBUG "    0x%x\n", data);
+			pr_debug("    0x%x\n", data);
 			iowrite32(data, fusb300->reg + FUSB300_OFFSET_CXPORT);
 			break;
 		default:
@@ -390,7 +390,7 @@ static void fusb300_clear_epnstall(struct fusb300 *fusb300, u8 ep)
 	u32 reg = ioread32(fusb300->reg + FUSB300_OFFSET_EPSET0(ep));
 
 	if (reg & FUSB300_EPSET0_STL) {
-		printk(KERN_DEBUG "EP%d stall... Clear!!\n", ep);
+		pr_debug("EP%d stall... Clear!!\n", ep);
 		reg |= FUSB300_EPSET0_STL_CLR;
 		iowrite32(reg, fusb300->reg + FUSB300_OFFSET_EPSET0(ep));
 	}
@@ -402,7 +402,7 @@ static void ep0_queue(struct fusb300_ep *ep, struct fusb300_request *req)
 		if (req->req.length) {
 			fusb300_wrcxf(ep, req);
 		} else
-			printk(KERN_DEBUG "%s : req->req.length = 0x%x\n",
+			pr_debug("%s : req->req.length = 0x%x\n",
 				__func__, req->req.length);
 		if ((req->req.length == req->req.actual) ||
 		    (req->req.actual < ep->ep.maxpacket))
@@ -565,7 +565,7 @@ static void fusb300_rdcxf(struct fusb300 *fusb300,
 
 	for (i = (length >> 2); i > 0; i--) {
 		data = ioread32(fusb300->reg + FUSB300_OFFSET_CXPORT);
-		printk(KERN_DEBUG "    0x%x\n", data);
+		pr_debug("    0x%x\n", data);
 		*tmp = data & 0xFF;
 		*(tmp + 1) = (data >> 8) & 0xFF;
 		*(tmp + 2) = (data >> 16) & 0xFF;
@@ -576,18 +576,18 @@ static void fusb300_rdcxf(struct fusb300 *fusb300,
 	switch (length % 4) {
 	case 1:
 		data = ioread32(fusb300->reg + FUSB300_OFFSET_CXPORT);
-		printk(KERN_DEBUG "    0x%x\n", data);
+		pr_debug("    0x%x\n", data);
 		*tmp = data & 0xFF;
 		break;
 	case 2:
 		data = ioread32(fusb300->reg + FUSB300_OFFSET_CXPORT);
-		printk(KERN_DEBUG "    0x%x\n", data);
+		pr_debug("    0x%x\n", data);
 		*tmp = data & 0xFF;
 		*(tmp + 1) = (data >> 8) & 0xFF;
 		break;
 	case 3:
 		data = ioread32(fusb300->reg + FUSB300_OFFSET_CXPORT);
-		printk(KERN_DEBUG "    0x%x\n", data);
+		pr_debug("    0x%x\n", data);
 		*tmp = data & 0xFF;
 		*(tmp + 1) = (data >> 8) & 0xFF;
 		*(tmp + 2) = (data >> 16) & 0xFF;
@@ -610,7 +610,7 @@ static void fusb300_rdfifo(struct fusb300_ep *ep,
 	req->req.actual += length;
 
 	if (req->req.actual > req->req.length)
-		printk(KERN_DEBUG "req->req.actual > req->req.length\n");
+		pr_debug("req->req.actual > req->req.length\n");
 
 	for (i = (length >> 2); i > 0; i--) {
 		data = ioread32(fusb300->reg +
@@ -649,7 +649,7 @@ static void fusb300_rdfifo(struct fusb300_ep *ep,
 		reg = ioread32(fusb300->reg + FUSB300_OFFSET_IGR1);
 		reg &= FUSB300_IGR1_SYNF0_EMPTY_INT;
 		if (i)
-			printk(KERN_INFO "sync fifo is not empty!\n");
+			pr_info("sync fifo is not empty!\n");
 		i++;
 	} while (!reg);
 }
@@ -677,7 +677,7 @@ static u8 fusb300_get_cxstall(struct fusb300 *fusb300)
 static void request_error(struct fusb300 *fusb300)
 {
 	fusb300_set_cxstall(fusb300);
-	printk(KERN_DEBUG "request error!!\n");
+	pr_debug("request error!!\n");
 }
 
 static void get_status(struct fusb300 *fusb300, struct usb_ctrlrequest *ctrl)
@@ -999,7 +999,7 @@ static void check_device_mode(struct fusb300 *fusb300)
 		fusb300->gadget.speed = USB_SPEED_UNKNOWN;
 		break;
 	}
-	printk(KERN_INFO "dev_mode = %d\n", (reg & FUSB300_GCR_DEVEN_MSK));
+	pr_info("dev_mode = %d\n", (reg & FUSB300_GCR_DEVEN_MSK));
 }
 
 
@@ -1076,14 +1076,14 @@ static irqreturn_t fusb300_irq(int irq, void *_fusb300)
 	if (int_grp1 & FUSB300_IGR1_WARM_RST_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_WARM_RST_INT);
-		printk(KERN_INFO"fusb300_warmreset\n");
+		pr_info("fusb300_warmreset\n");
 		fusb300_reset();
 	}
 
 	if (int_grp1 & FUSB300_IGR1_HOT_RST_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_HOT_RST_INT);
-		printk(KERN_INFO"fusb300_hotreset\n");
+		pr_info("fusb300_hotreset\n");
 		fusb300_reset();
 	}
 
@@ -1097,13 +1097,13 @@ static irqreturn_t fusb300_irq(int irq, void *_fusb300)
 	if (int_grp1 & FUSB300_IGR1_CX_COMABT_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_CX_COMABT_INT);
-		printk(KERN_INFO"fusb300_ep0abt\n");
+		pr_info("fusb300_ep0abt\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_VBUS_CHG_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_VBUS_CHG_INT);
-		printk(KERN_INFO"fusb300_vbus_change\n");
+		pr_info("fusb300_vbus_change\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_U3_EXIT_FAIL_INT) {
@@ -1134,25 +1134,25 @@ static irqreturn_t fusb300_irq(int irq, void *_fusb300)
 	if (int_grp1 & FUSB300_IGR1_U3_EXIT_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_U3_EXIT_INT);
-		printk(KERN_INFO "FUSB300_IGR1_U3_EXIT_INT\n");
+		pr_info("FUSB300_IGR1_U3_EXIT_INT\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_U2_EXIT_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_U2_EXIT_INT);
-		printk(KERN_INFO "FUSB300_IGR1_U2_EXIT_INT\n");
+		pr_info("FUSB300_IGR1_U2_EXIT_INT\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_U1_EXIT_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_U1_EXIT_INT);
-		printk(KERN_INFO "FUSB300_IGR1_U1_EXIT_INT\n");
+		pr_info("FUSB300_IGR1_U1_EXIT_INT\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_U3_ENTRY_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_U3_ENTRY_INT);
-		printk(KERN_INFO "FUSB300_IGR1_U3_ENTRY_INT\n");
+		pr_info("FUSB300_IGR1_U3_ENTRY_INT\n");
 		fusb300_enable_bit(fusb300, FUSB300_OFFSET_SSCR1,
 				   FUSB300_SSCR1_GO_U3_DONE);
 	}
@@ -1160,31 +1160,31 @@ static irqreturn_t fusb300_irq(int irq, void *_fusb300)
 	if (int_grp1 & FUSB300_IGR1_U2_ENTRY_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_U2_ENTRY_INT);
-		printk(KERN_INFO "FUSB300_IGR1_U2_ENTRY_INT\n");
+		pr_info("FUSB300_IGR1_U2_ENTRY_INT\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_U1_ENTRY_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_U1_ENTRY_INT);
-		printk(KERN_INFO "FUSB300_IGR1_U1_ENTRY_INT\n");
+		pr_info("FUSB300_IGR1_U1_ENTRY_INT\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_RESM_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_RESM_INT);
-		printk(KERN_INFO "fusb300_resume\n");
+		pr_info("fusb300_resume\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_SUSP_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_SUSP_INT);
-		printk(KERN_INFO "fusb300_suspend\n");
+		pr_info("fusb300_suspend\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_HS_LPM_INT) {
 		fusb300_clear_int(fusb300, FUSB300_OFFSET_IGR1,
 				  FUSB300_IGR1_HS_LPM_INT);
-		printk(KERN_INFO "fusb300_HS_LPM_INT\n");
+		pr_info("fusb300_HS_LPM_INT\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_DEV_MODE_CHG_INT) {
@@ -1195,11 +1195,11 @@ static irqreturn_t fusb300_irq(int irq, void *_fusb300)
 
 	if (int_grp1 & FUSB300_IGR1_CX_COMFAIL_INT) {
 		fusb300_set_cxstall(fusb300);
-		printk(KERN_INFO "fusb300_ep0fail\n");
+		pr_info("fusb300_ep0fail\n");
 	}
 
 	if (int_grp1 & FUSB300_IGR1_CX_SETUP_INT) {
-		printk(KERN_INFO "fusb300_ep0setup\n");
+		pr_info("fusb300_ep0setup\n");
 		if (setup_packet(fusb300, &ctrl)) {
 			spin_unlock(&fusb300->lock);
 			if (fusb300->driver->setup(&fusb300->gadget, &ctrl) < 0)
@@ -1209,16 +1209,16 @@ static irqreturn_t fusb300_irq(int irq, void *_fusb300)
 	}
 
 	if (int_grp1 & FUSB300_IGR1_CX_CMDEND_INT)
-		printk(KERN_INFO "fusb300_cmdend\n");
+		pr_info("fusb300_cmdend\n");
 
 
 	if (int_grp1 & FUSB300_IGR1_CX_OUT_INT) {
-		printk(KERN_INFO "fusb300_cxout\n");
+		pr_info("fusb300_cxout\n");
 		fusb300_ep0out(fusb300);
 	}
 
 	if (int_grp1 & FUSB300_IGR1_CX_IN_INT) {
-		printk(KERN_INFO "fusb300_cxin\n");
+		pr_info("fusb300_cxin\n");
 		fusb300_ep0in(fusb300);
 	}
 
diff --git a/drivers/usb/gadget/udc/goku_udc.c b/drivers/usb/gadget/udc/goku_udc.c
index 3e1267d38774..4f225552861a 100644
--- a/drivers/usb/gadget/udc/goku_udc.c
+++ b/drivers/usb/gadget/udc/goku_udc.c
@@ -1748,7 +1748,7 @@ static int goku_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	int			retval;
 
 	if (!pdev->irq) {
-		printk(KERN_ERR "Check PCI %s IRQ setup!\n", pci_name(pdev));
+		pr_err("Check PCI %s IRQ setup!\n", pci_name(pdev));
 		retval = -ENODEV;
 		goto err;
 	}
diff --git a/drivers/usb/gadget/udc/r8a66597-udc.h b/drivers/usb/gadget/udc/r8a66597-udc.h
index 9a115caba661..fa4d62c32ea1 100644
--- a/drivers/usb/gadget/udc/r8a66597-udc.h
+++ b/drivers/usb/gadget/udc/r8a66597-udc.h
@@ -247,7 +247,7 @@ static inline u16 get_xtal_from_pdata(struct r8a66597_platdata *pdata)
 		clock = XTAL48;
 		break;
 	default:
-		printk(KERN_ERR "r8a66597: platdata clock is wrong.\n");
+		pr_err("r8a66597: platdata clock is wrong.\n");
 		break;
 	}
 
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH v3 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification
From: Aneesh Kumar K.V @ 2020-12-08 14:52 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <731bdee26a5a5c81cd815ed624a6fb3bdef8b4db.1607416578.git.christophe.leroy@csgroup.eu>

Christophe Leroy <christophe.leroy@csgroup.eu> writes:

> search_exception_tables() is an heavy operation, we have to avoid it.
> When KUAP is selected, we'll know the fault has been blocked by KUAP.
> Otherwise, it behaves just as if the address was already in the TLBs
> and no fault was generated.
>
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> v3: rebased
> v2: Squashed with the preceeding patch which was re-ordering tests that get removed in this patch.
> ---
>  arch/powerpc/mm/fault.c | 23 +++++++----------------
>  1 file changed, 7 insertions(+), 16 deletions(-)
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 3fcd34c28e10..1770b41e4730 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -210,28 +210,19 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
>  		return true;
>  	}
>  
> -	if (!is_exec && address < TASK_SIZE && (error_code & (DSISR_PROTFAULT | DSISR_KEYFAULT)) &&
> -	    !search_exception_tables(regs->nip)) {
> -		pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: %d)\n",
> -				    address,
> -				    from_kuid(&init_user_ns, current_uid()));
> -	}
> -
>  	// Kernel fault on kernel address is bad
>  	if (address >= TASK_SIZE)
>  		return true;
>  
> -	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
> -	if (!search_exception_tables(regs->nip))
> -		return true;
> -
> -	// Read/write fault in a valid region (the exception table search passed
> -	// above), but blocked by KUAP is bad, it can never succeed.
> -	if (bad_kuap_fault(regs, address, is_write))
> +	// Read/write fault blocked by KUAP is bad, it can never succeed.
> +	if (bad_kuap_fault(regs, address, is_write)) {
> +		pr_crit_ratelimited("Kernel attempted to %s user page (%lx) - exploit attempt? (uid: %d)\n",
> +				    is_write ? "write" : "read", address,
> +				    from_kuid(&init_user_ns, current_uid()));
>  		return true;
> +	}


With this I am wondering whether the WARN() in bad_kuap_fault() is
needed. A direct access of userspace address will trigger this, whereas
previously we used bad_kuap_fault() only to identify incorrect restore
of AMR register (ie, to identify kernel bugs). Hence a WARN() there was
useful. We loose that differentiation now?


>  
> -	// What's left? Kernel fault on user in well defined regions (extable
> -	// matched), and allowed by KUAP in the faulting context.
> +	// What's left? Kernel fault on user and allowed by KUAP in the faulting context.
>  	return false;
>  }
>  
> -- 
> 2.25.0

^ permalink raw reply

* Re: [PATCH] arch: fix 'unexpected IRQ trap at vector' warnings
From: Helge Deller @ 2020-12-08 14:42 UTC (permalink / raw)
  To: Michael Ellerman, Enrico Weigelt, metux IT consult, linux-kernel
  Cc: linux-s390, hpa, linux-parisc, jdike, x86, linux-um,
	James.Bottomley, mingo, paulus, richard, bp, tglx, linuxppc-dev,
	anton.ivanov
In-Reply-To: <877dptt5av.fsf@mpe.ellerman.id.au>

On 12/8/20 3:11 AM, Michael Ellerman wrote:
> "Enrico Weigelt, metux IT consult" <info@metux.net> writes:
>> All archs, except Alpha, print out the irq number in hex, but the message
>> looks like it was a decimal number, which is quite confusing. Fixing this
>> by adding "0x" prefix.
>
> Arguably decimal would be better, /proc/interrupts and /proc/irq/ both
> use decimal.

I agree.

> The whole message is very dated IMO, these days the number it prints is
> (possibly) virtualised via IRQ domains, ie. it's not necessarily a
> "vector" if that even makes sense on all arches). Arguably "trap" is the
> wrong term on some arches too.
>
> So it would be better reworded entirely IMO, and also switched to
> decimal to match other sources of information on interrupts.
>
> Perhaps:
> 	"Unexpected Linux IRQ %d."

Yes.

and while cleaning it up, introducing a default weak implementation of ack_bad_irq()
which adds and increases irq_err_count for all platforms would be a nice cleanup.

Helge

> If anyone else is having deja vu like me, yes this has come up before:
>   https://lore.kernel.org/lkml/20150712220211.7166.42035.stgit@bhelgaas-glaptop2.roam.corp.google.com/
>
> cheers
>
>
>
>> diff --git a/arch/arm/include/asm/hw_irq.h b/arch/arm/include/asm/hw_irq.h
>> index cecc13214ef1..2749f19271d9 100644
>> --- a/arch/arm/include/asm/hw_irq.h
>> +++ b/arch/arm/include/asm/hw_irq.h
>> @@ -9,7 +9,7 @@ static inline void ack_bad_irq(int irq)
>>  {
>>  	extern unsigned long irq_err_count;
>>  	irq_err_count++;
>> -	pr_crit("unexpected IRQ trap at vector %02x\n", irq);
>> +	pr_crit("unexpected IRQ trap at vector 0x%02x\n", irq);
>>  }
>>
>>  #define ARCH_IRQ_INIT_FLAGS	(IRQ_NOREQUEST | IRQ_NOPROBE)
>> diff --git a/arch/parisc/include/asm/hardirq.h b/arch/parisc/include/asm/hardirq.h
>> index 7f7039516e53..c3348af88d3f 100644
>> --- a/arch/parisc/include/asm/hardirq.h
>> +++ b/arch/parisc/include/asm/hardirq.h
>> @@ -35,6 +35,6 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
>>  #define __IRQ_STAT(cpu, member) (irq_stat[cpu].member)
>>  #define inc_irq_stat(member)	this_cpu_inc(irq_stat.member)
>>  #define __inc_irq_stat(member)	__this_cpu_inc(irq_stat.member)
>> -#define ack_bad_irq(irq) WARN(1, "unexpected IRQ trap at vector %02x\n", irq)
>> +#define ack_bad_irq(irq) WARN(1, "unexpected IRQ trap at vector 0x%02x\n", irq)
>>
>>  #endif /* _PARISC_HARDIRQ_H */
>> diff --git a/arch/powerpc/include/asm/hardirq.h b/arch/powerpc/include/asm/hardirq.h
>> index f133b5930ae1..ec8cf3cf6e49 100644
>> --- a/arch/powerpc/include/asm/hardirq.h
>> +++ b/arch/powerpc/include/asm/hardirq.h
>> @@ -29,7 +29,7 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
>>
>>  static inline void ack_bad_irq(unsigned int irq)
>>  {
>> -	printk(KERN_CRIT "unexpected IRQ trap at vector %02x\n", irq);
>> +	printk(KERN_CRIT "unexpected IRQ trap at vector 0x%02x\n", irq);
>>  }
>>
>>  extern u64 arch_irq_stat_cpu(unsigned int cpu);
>> diff --git a/arch/s390/include/asm/hardirq.h b/arch/s390/include/asm/hardirq.h
>> index dfbc3c6c0674..aaaec5cdd4fe 100644
>> --- a/arch/s390/include/asm/hardirq.h
>> +++ b/arch/s390/include/asm/hardirq.h
>> @@ -23,7 +23,7 @@
>>
>>  static inline void ack_bad_irq(unsigned int irq)
>>  {
>> -	printk(KERN_CRIT "unexpected IRQ trap at vector %02x\n", irq);
>> +	printk(KERN_CRIT "unexpected IRQ trap at vector 0x%02x\n", irq);
>>  }
>>
>>  #endif /* __ASM_HARDIRQ_H */
>> diff --git a/arch/um/include/asm/hardirq.h b/arch/um/include/asm/hardirq.h
>> index b426796d26fd..2a2e6eae034b 100644
>> --- a/arch/um/include/asm/hardirq.h
>> +++ b/arch/um/include/asm/hardirq.h
>> @@ -15,7 +15,7 @@ typedef struct {
>>  #ifndef ack_bad_irq
>>  static inline void ack_bad_irq(unsigned int irq)
>>  {
>> -	printk(KERN_CRIT "unexpected IRQ trap at vector %02x\n", irq);
>> +	printk(KERN_CRIT "unexpected IRQ trap at vector 0x%02x\n", irq);
>>  }
>>  #endif
>>
>> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
>> index c5dd50369e2f..957c716f2df7 100644
>> --- a/arch/x86/kernel/irq.c
>> +++ b/arch/x86/kernel/irq.c
>> @@ -37,7 +37,7 @@ atomic_t irq_err_count;
>>  void ack_bad_irq(unsigned int irq)
>>  {
>>  	if (printk_ratelimit())
>> -		pr_err("unexpected IRQ trap at vector %02x\n", irq);
>> +		pr_err("unexpected IRQ trap at vector 0x%02x\n", irq);
>>
>>  	/*
>>  	 * Currently unexpected vectors happen only on SMP and APIC.
>> --
>> 2.11.0


^ permalink raw reply

* Re: [PATCH v3 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification
From: Aneesh Kumar K.V @ 2020-12-08 14:31 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <d66f9706-9e36-5b92-5a87-90ebd05587e9@csgroup.eu>

Christophe Leroy <christophe.leroy@csgroup.eu> writes:

> Le 08/12/2020 à 14:00, Aneesh Kumar K.V a écrit :
>> On 12/8/20 2:07 PM, Christophe Leroy wrote:
>>> search_exception_tables() is an heavy operation, we have to avoid it.
>>> When KUAP is selected, we'll know the fault has been blocked by KUAP.
>>> Otherwise, it behaves just as if the address was already in the TLBs
>>> and no fault was generated.
>>>
>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>> v3: rebased
>>> v2: Squashed with the preceeding patch which was re-ordering tests that get removed in this patch.
>>> ---
>>>   arch/powerpc/mm/fault.c | 23 +++++++----------------
>>>   1 file changed, 7 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>>> index 3fcd34c28e10..1770b41e4730 100644
>>> --- a/arch/powerpc/mm/fault.c
>>> +++ b/arch/powerpc/mm/fault.c
>>> @@ -210,28 +210,19 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
>>>           return true;
>>>       }
>>> -    if (!is_exec && address < TASK_SIZE && (error_code & (DSISR_PROTFAULT | DSISR_KEYFAULT)) &&
>>> -        !search_exception_tables(regs->nip)) {
>>> -        pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: 
>>> %d)\n",
>>> -                    address,
>>> -                    from_kuid(&init_user_ns, current_uid()));
>>> -    }
>>> -
>>>       // Kernel fault on kernel address is bad
>>>       if (address >= TASK_SIZE)
>>>           return true;
>>> -    // Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
>>> -    if (!search_exception_tables(regs->nip))
>>> -        return true;
>>> -
>>> -    // Read/write fault in a valid region (the exception table search passed
>>> -    // above), but blocked by KUAP is bad, it can never succeed.
>>> -    if (bad_kuap_fault(regs, address, is_write))
>>> +    // Read/write fault blocked by KUAP is bad, it can never succeed.
>>> +    if (bad_kuap_fault(regs, address, is_write)) {
>>> +        pr_crit_ratelimited("Kernel attempted to %s user page (%lx) - exploit attempt? (uid: %d)\n",
>>> +                    is_write ? "write" : "read", address,
>>> +                    from_kuid(&init_user_ns, current_uid()));
>>>           return true;
>>> +    }
>> 
>> 
>> Should we update bad_kuap_fault to check for !is_kernel_addr() and error_code & (DSISIR_PROT_FAULT | 
>> DSISIR_KEYFAULT). I am wondering whether we can take another fault w.r.t kernel address/user address 
>> and end up reporting that as KUAP fault?
>
> Just before this we do:
>
> 	if (address >= TASK_SIZE)
> 		return true;
>
> About the error code, I don't know. Can we take a fault that is not a DSISR_PROT_FAULT |
>   DSISR_KEYFAULT and that is not a KUAP fault ?
>
> Previously (before this patch), the error code was taken into account for the call to 
> search_exception_tables(), but has never been for the bad_kuap_fault().
>

a KUAP fault on radix will result in PROTFAULT and on hash it will
generate KEYFAULT. ie, something like below

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h
index 32fd4452e960..b18cd931e325 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -177,8 +177,8 @@ static inline void restore_user_access(unsigned long flags)
 		allow_user_access(to, to, end - addr, KUAP_READ_WRITE);
 }
 
-static inline bool
-bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
+static inline bool bad_kuap_fault(struct pt_regs *regs, unsigned long address,
+				  bool is_write, unsigned long error_code)
 {
 	unsigned long begin = regs->kuap & 0xf0000000;
 	unsigned long end = regs->kuap << 28;
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h
index 91af50e9b09a..16bb6aee9fcd 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -392,14 +392,24 @@ static inline void set_kuap(unsigned long value)
 }
 
 static inline bool bad_kuap_fault(struct pt_regs *regs, unsigned long address,
-				  bool is_write)
+				  bool is_write, unsigned long error_code)
 {
+	unsigned long fault;
+
 	if (!mmu_has_feature(MMU_FTR_BOOK3S_KUAP))
 		return false;
+
+	if (radix_enabled())
+		fault = DSISR_PROTFAULT;
+	else
+		fault = DSISR_KEYFAULT;
+
 	/*
 	 * For radix this will be a storage protection fault (DSISR_PROTFAULT).
 	 * For hash this will be a key fault (DSISR_KEYFAULT)
 	 */
+	if (!(error_code & fault))
+		return false;
 	/*
 	 * We do have exception table entry, but accessing the
 	 * userspace results in fault.  This could be because we
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 49fe4b4a9434..bdffa2664bf0 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -62,8 +62,8 @@ void setup_kuap(bool disabled);
 #else
 static inline void setup_kuap(bool disabled) { }
 
-static inline bool
-bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
+static inline bool bad_kuap_fault(struct pt_regs *regs, unsigned long address,
+				  bool is_write, unsigned long error_code)
 {
 	return false;
 }
diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index 567cdc557402..7bdd9e5b63ed 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -60,8 +60,8 @@ static inline void restore_user_access(unsigned long flags)
 	mtspr(SPRN_MD_AP, flags);
 }
 
-static inline bool
-bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
+static inline bool bad_kuap_fault(struct pt_regs *regs, unsigned long address,
+				  bool is_write, unsigned long error_code)
 {
 	return WARN(!((regs->kuap ^ MD_APG_KUAP) & 0xff000000),
 		    "Bug: fault blocked by AP register !");
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index b12595102525..03c3414bdc79 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -218,7 +218,7 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
 	}
 
 	// Kernel fault on kernel address is bad
-	if (address >= TASK_SIZE)
+	if (is_kernel_addr(address))
 		return true;
 
 	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
@@ -227,7 +227,7 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
 
 	// Read/write fault in a valid region (the exception table search passed
 	// above), but blocked by KUAP is bad, it can never succeed.
-	if (bad_kuap_fault(regs, address, is_write))
+	if (bad_kuap_fault(regs, address, is_write, error_code))
 		return true;
 
 	// What's left? Kernel fault on user in well defined regions (extable


^ permalink raw reply related

* Re: [PATCH v3 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification
From: Christophe Leroy @ 2020-12-08 14:26 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <0e25f03d-9f59-b963-312c-c3ae1d7953a2@linux.ibm.com>



Le 08/12/2020 à 14:00, Aneesh Kumar K.V a écrit :
> On 12/8/20 2:07 PM, Christophe Leroy wrote:
>> search_exception_tables() is an heavy operation, we have to avoid it.
>> When KUAP is selected, we'll know the fault has been blocked by KUAP.
>> Otherwise, it behaves just as if the address was already in the TLBs
>> and no fault was generated.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> v3: rebased
>> v2: Squashed with the preceeding patch which was re-ordering tests that get removed in this patch.
>> ---
>>   arch/powerpc/mm/fault.c | 23 +++++++----------------
>>   1 file changed, 7 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index 3fcd34c28e10..1770b41e4730 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -210,28 +210,19 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
>>           return true;
>>       }
>> -    if (!is_exec && address < TASK_SIZE && (error_code & (DSISR_PROTFAULT | DSISR_KEYFAULT)) &&
>> -        !search_exception_tables(regs->nip)) {
>> -        pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: 
>> %d)\n",
>> -                    address,
>> -                    from_kuid(&init_user_ns, current_uid()));
>> -    }
>> -
>>       // Kernel fault on kernel address is bad
>>       if (address >= TASK_SIZE)
>>           return true;
>> -    // Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
>> -    if (!search_exception_tables(regs->nip))
>> -        return true;
>> -
>> -    // Read/write fault in a valid region (the exception table search passed
>> -    // above), but blocked by KUAP is bad, it can never succeed.
>> -    if (bad_kuap_fault(regs, address, is_write))
>> +    // Read/write fault blocked by KUAP is bad, it can never succeed.
>> +    if (bad_kuap_fault(regs, address, is_write)) {
>> +        pr_crit_ratelimited("Kernel attempted to %s user page (%lx) - exploit attempt? (uid: %d)\n",
>> +                    is_write ? "write" : "read", address,
>> +                    from_kuid(&init_user_ns, current_uid()));
>>           return true;
>> +    }
> 
> 
> Should we update bad_kuap_fault to check for !is_kernel_addr() and error_code & (DSISIR_PROT_FAULT | 
> DSISIR_KEYFAULT). I am wondering whether we can take another fault w.r.t kernel address/user address 
> and end up reporting that as KUAP fault?

Just before this we do:

	if (address >= TASK_SIZE)
		return true;

About the error code, I don't know. Can we take a fault that is not a DSISR_PROT_FAULT |
  DSISR_KEYFAULT and that is not a KUAP fault ?

Previously (before this patch), the error code was taken into account for the call to 
search_exception_tables(), but has never been for the bad_kuap_fault().

> 
>> -    // What's left? Kernel fault on user in well defined regions (extable
>> -    // matched), and allowed by KUAP in the faulting context.
>> +    // What's left? Kernel fault on user and allowed by KUAP in the faulting context.
>>       return false;
>>   }
>>
> 
> 
> -aneesh

Christophe

^ permalink raw reply

* Re: [PATCH] powerpc/book3s_hv_uvmem: Check for failed page migration
From: Ram Pai @ 2020-12-08 14:20 UTC (permalink / raw)
  To: Alistair Popple; +Cc: linuxppc-dev, bharata
In-Reply-To: <20201203050812.5234-1-alistair@popple.id.au>

On Thu, Dec 03, 2020 at 04:08:12PM +1100, Alistair Popple wrote:
> migrate_vma_pages() may still clear MIGRATE_PFN_MIGRATE on pages which
> are not able to be migrated. Drivers may safely copy data prior to
> calling migrate_vma_pages() however a remote mapping must not be
> established until after migrate_vma_pages() has returned as the
> migration could still fail.
> 
> UV_PAGE_IN_in both copies and maps the data page, therefore it should
> only be called after checking the results of migrate_vma_pages().
> 
> Signed-off-by: Alistair Popple <alistair@popple.id.au>
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 84e5a2dc8be5..08aa6a90c525 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -762,7 +762,10 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
>  		goto out_finalize;
>  	}
> 
> -	if (pagein) {
> +	*mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> +	migrate_vma_pages(&mig);
> +
> +	if ((*mig.src & MIGRATE_PFN_MIGRATE) && pagein) {
>  		pfn = *mig.src >> MIGRATE_PFN_SHIFT;
>  		spage = migrate_pfn_to_page(*mig.src);
>  		if (spage) {
> @@ -773,8 +776,6 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
>  		}
>  	}
> 
> -	*mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> -	migrate_vma_pages(&mig);
>  out_finalize:
>  	migrate_vma_finalize(&mig);
>  	return ret;

Though this patch did not solve the specific problem, I am running into,
my tests did not expose any regression.

Tested-by: Ram Pai <linuxram@us.ibm.com>


Reviewed-by: Ram Pai <linuxram@us.ibm.com>

^ permalink raw reply

* Re: [PATCH v3 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification
From: Aneesh Kumar K.V @ 2020-12-08 13:00 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, npiggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <731bdee26a5a5c81cd815ed624a6fb3bdef8b4db.1607416578.git.christophe.leroy@csgroup.eu>

On 12/8/20 2:07 PM, Christophe Leroy wrote:
> search_exception_tables() is an heavy operation, we have to avoid it.
> When KUAP is selected, we'll know the fault has been blocked by KUAP.
> Otherwise, it behaves just as if the address was already in the TLBs
> and no fault was generated.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> v3: rebased
> v2: Squashed with the preceeding patch which was re-ordering tests that get removed in this patch.
> ---
>   arch/powerpc/mm/fault.c | 23 +++++++----------------
>   1 file changed, 7 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 3fcd34c28e10..1770b41e4730 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -210,28 +210,19 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
>   		return true;
>   	}
>   
> -	if (!is_exec && address < TASK_SIZE && (error_code & (DSISR_PROTFAULT | DSISR_KEYFAULT)) &&
> -	    !search_exception_tables(regs->nip)) {
> -		pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: %d)\n",
> -				    address,
> -				    from_kuid(&init_user_ns, current_uid()));
> -	}
> -
>   	// Kernel fault on kernel address is bad
>   	if (address >= TASK_SIZE)
>   		return true;
>   
> -	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
> -	if (!search_exception_tables(regs->nip))
> -		return true;
> -
> -	// Read/write fault in a valid region (the exception table search passed
> -	// above), but blocked by KUAP is bad, it can never succeed.
> -	if (bad_kuap_fault(regs, address, is_write))
> +	// Read/write fault blocked by KUAP is bad, it can never succeed.
> +	if (bad_kuap_fault(regs, address, is_write)) {
> +		pr_crit_ratelimited("Kernel attempted to %s user page (%lx) - exploit attempt? (uid: %d)\n",
> +				    is_write ? "write" : "read", address,
> +				    from_kuid(&init_user_ns, current_uid()));
>   		return true;
> +	}


Should we update bad_kuap_fault to check for !is_kernel_addr() and 
error_code & (DSISIR_PROT_FAULT | DSISIR_KEYFAULT). I am wondering 
whether we can take another fault w.r.t kernel address/user address and 
end up reporting that as KUAP fault?

>   
> -	// What's left? Kernel fault on user in well defined regions (extable
> -	// matched), and allowed by KUAP in the faulting context.
> +	// What's left? Kernel fault on user and allowed by KUAP in the faulting context.
>   	return false;
>   }
>   
> 


-aneesh

^ permalink raw reply

* Re: [RFC][PATCH 1/2] libnvdimm: Introduce ND_CMD_GET_STAT to retrieve nvdimm statistics
From: Peter Zijlstra @ 2020-12-08 13:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Santosh Sivaraj, Ira Weiny, linux-nvdimm,
	Arnaldo Carvalho de Melo, Ingo Molnar, Aneesh Kumar K . V,
	Vaibhav Jain, linuxppc-dev
In-Reply-To: <CAPcyv4h0PAPyYoea2oxqw_mOZ-Ec-o1MwcdSN0gf5UXqZqjafQ@mail.gmail.com>

On Mon, Dec 07, 2020 at 04:54:21PM -0800, Dan Williams wrote:
> [ add perf maintainers ]
> 
> On Sun, Nov 8, 2020 at 1:16 PM Vaibhav Jain <vaibhav@linux.ibm.com> wrote:
> >
> > Implement support for exposing generic nvdimm statistics via newly
> > introduced dimm-command ND_CMD_GET_STAT that can be handled by nvdimm
> > command handler function and provide values for these statistics back
> > to libnvdimm. Following generic nvdimm statistics are defined as an
> > enumeration in 'uapi/ndctl.h':
> >
> > * "media_reads" : Number of media reads that have occurred since reboot.
> > * "media_writes" : Number of media writes that have occurred since reboot.
> > * "read_requests" : Number of read requests that have occurred since reboot.
> > * "write_requests" : Number of write requests that have occurred since reboot.
> 
> Perhaps document these as "since device reset"? As I can imagine some
> devices might have a mechanism to reset the count outside of "reboot"
> which is a bit ambiguous.
> 
> > * "total_media_reads" : Total number of media reads that have occurred.
> > * "total_media_writes" : Total number of media writes that have occurred.
> > * "total_read_requests" : Total number of read requests that have occurred.
> > * "total_write_requests" : Total number of write requests that have occurred.
> >
> > Apart from ND_CMD_GET_STAT ioctl these nvdimm statistics are also
> > exposed via sysfs '<nvdimm-device>/stats' directory for easy user-space
> > access like below:
> >
> > /sys/class/nd/ndctl0/device/nmem0/stats # tail -n +1 *
> > ==> media_reads <==
> > 252197707602
> > ==> media_writes <==
> > 20684685172
> > ==> read_requests <==
> > 658810924962
> > ==> write_requests <==
> > 404464081574
> 
> Hmm, I haven't looked but how hard would it be to plumb these to be
> perf counter-events. So someone could combine these with other perf
> counters?
> 
> > In case a specific nvdimm-statistic is not supported than nvdimm
> > command handler function can simply return an error (e.g -ENOENT) for
> > request to read that nvdimm-statistic.
> 
> Makes sense, but I expect the perf route also has a way to enumerate
> which statistics / counters are supported. I'm not opposed to also
> having them in sysfs, but I think perf support should be a first class
> citizen.

arch/x86/events/msr.c might be a good starting point for a software pmu
delivering pure counters.

^ permalink raw reply

* Re: [PATCH] powerpc/mce: Remove per cpu variables from MCE handlers
From: Mahesh Jagannath Salgaonkar @ 2020-12-08 12:04 UTC (permalink / raw)
  To: Ganesh, Michael Ellerman, linuxppc-dev; +Cc: npiggin
In-Reply-To: <a514db98-6090-467a-74ae-9c7b4337d0c1@linux.ibm.com>

On 12/8/20 4:16 PM, Ganesh wrote:
> 
> On 12/8/20 4:01 PM, Michael Ellerman wrote:
>> Ganesh Goudar <ganeshgr@linux.ibm.com> writes:
>>> diff --git a/arch/powerpc/include/asm/paca.h
>>> b/arch/powerpc/include/asm/paca.h
>>> index 9454d29ff4b4..4769954efa7d 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -273,6 +274,17 @@ struct paca_struct {
>>>   #ifdef CONFIG_MMIOWB
>>>       struct mmiowb_state mmiowb_state;
>>>   #endif
>>> +#ifdef CONFIG_PPC_BOOK3S_64
>>> +    int mce_nest_count;
>>> +    struct machine_check_event mce_event[MAX_MC_EVT];
>>> +    /* Queue for delayed MCE events. */
>>> +    int mce_queue_count;
>>> +    struct machine_check_event mce_event_queue[MAX_MC_EVT];
>>> +
>>> +    /* Queue for delayed MCE UE events. */
>>> +    int mce_ue_count;
>>> +    struct machine_check_event  mce_ue_event_queue[MAX_MC_EVT];
>>> +#endif /* CONFIG_PPC_BOOK3S_64 */
>>>   } ____cacheline_aligned;
>> How much does this expand the paca by?
> 
> Size of paca is 4480 bytes, these add up another 2160 bytes, so expands
> it by 48%.
> 

Should we dynamically allocate the array sizes early as similar to that
of paca->mce_faulty_slbs so that we don't bump up paca size ?

Thanks,
-Mahesh.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox