* [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-12-03 13:53 ` Thomas Gleixner
2024-11-14 16:18 ` [RFC PATCH 02/15] genirq/msi: Provide DOMAIN_BUS_MSI_REMAP Andrew Jones
` (13 subsequent siblings)
14 siblings, 1 reply; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
In order to support IRQ domains which reside between the leaf domains
and IMSIC, put the IMSIC implementation of irq_set_affinity into its
chip.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
drivers/irqchip/irq-riscv-imsic-platform.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/irqchip/irq-riscv-imsic-platform.c b/drivers/irqchip/irq-riscv-imsic-platform.c
index c708780e8760..5d7c30ad8855 100644
--- a/drivers/irqchip/irq-riscv-imsic-platform.c
+++ b/drivers/irqchip/irq-riscv-imsic-platform.c
@@ -96,9 +96,8 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
bool force)
{
struct imsic_vector *old_vec, *new_vec;
- struct irq_data *pd = d->parent_data;
- old_vec = irq_data_get_irq_chip_data(pd);
+ old_vec = irq_data_get_irq_chip_data(d);
if (WARN_ON(!old_vec))
return -ENOENT;
@@ -116,13 +115,13 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
return -ENOSPC;
/* Point device to the new vector */
- imsic_msi_update_msg(d, new_vec);
+ imsic_msi_update_msg(irq_get_irq_data(d->irq), new_vec);
/* Update irq descriptors with the new vector */
- pd->chip_data = new_vec;
+ d->chip_data = new_vec;
- /* Update effective affinity of parent irq data */
- irq_data_update_effective_affinity(pd, cpumask_of(new_vec->cpu));
+ /* Update effective affinity */
+ irq_data_update_effective_affinity(d, cpumask_of(new_vec->cpu));
/* Move state of the old vector to the new vector */
imsic_vector_move(old_vec, new_vec);
@@ -135,6 +134,9 @@ static struct irq_chip imsic_irq_base_chip = {
.name = "IMSIC",
.irq_mask = imsic_irq_mask,
.irq_unmask = imsic_irq_unmask,
+#ifdef CONFIG_SMP
+ .irq_set_affinity = imsic_irq_set_affinity,
+#endif
.irq_retrigger = imsic_irq_retrigger,
.irq_compose_msi_msg = imsic_irq_compose_msg,
.flags = IRQCHIP_SKIP_SET_WAKE |
@@ -245,7 +247,7 @@ static bool imsic_init_dev_msi_info(struct device *dev,
if (WARN_ON_ONCE(domain != real_parent))
return false;
#ifdef CONFIG_SMP
- info->chip->irq_set_affinity = imsic_irq_set_affinity;
+ info->chip->irq_set_affinity = irq_chip_set_affinity_parent;
#endif
break;
default:
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-11-14 16:18 ` [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity Andrew Jones
@ 2024-12-03 13:53 ` Thomas Gleixner
2024-12-03 16:27 ` Andrew Jones
2024-12-03 16:37 ` Anup Patel
0 siblings, 2 replies; 36+ messages in thread
From: Thomas Gleixner @ 2024-12-03 13:53 UTC (permalink / raw)
To: Andrew Jones, iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp,
alex.williamson, paul.walmsley, palmer, aou
On Thu, Nov 14 2024 at 17:18, Andrew Jones wrote:
> @@ -96,9 +96,8 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
> bool force)
> {
> struct imsic_vector *old_vec, *new_vec;
> - struct irq_data *pd = d->parent_data;
>
> - old_vec = irq_data_get_irq_chip_data(pd);
> + old_vec = irq_data_get_irq_chip_data(d);
> if (WARN_ON(!old_vec))
> return -ENOENT;
>
> @@ -116,13 +115,13 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
> return -ENOSPC;
>
> /* Point device to the new vector */
> - imsic_msi_update_msg(d, new_vec);
> + imsic_msi_update_msg(irq_get_irq_data(d->irq), new_vec);
This looks more than fishy. See below.
> @@ -245,7 +247,7 @@ static bool imsic_init_dev_msi_info(struct device *dev,
> if (WARN_ON_ONCE(domain != real_parent))
> return false;
> #ifdef CONFIG_SMP
> - info->chip->irq_set_affinity = imsic_irq_set_affinity;
> + info->chip->irq_set_affinity = irq_chip_set_affinity_parent;
This should use msi_domain_set_affinity(), which does the right thing:
1) It invokes the irq_set_affinity() callback of the parent domain
2) It composes the message via the hierarchy
3) It writes the message with the msi_write_msg() callback of the top
level domain
Sorry, I missed that when reviewing the original IMSIC MSI support.
The whole IMSIC MSI support can be moved over to MSI LIB which makes all
of this indirection go away and your intermediate domain will just fit
in.
Uncompiled patch below. If that works, it needs to be split up properly.
Note, this removes the setup of the irq_retrigger callback, but that's
fine because on hierarchical domains irq_chip_retrigger_hierarchy() is
invoked anyway. See try_retrigger().
Thanks,
tglx
---
drivers/irqchip/Kconfig | 1
drivers/irqchip/irq-gic-v2m.c | 1
drivers/irqchip/irq-imx-mu-msi.c | 1
drivers/irqchip/irq-msi-lib.c | 11 +-
drivers/irqchip/irq-mvebu-gicp.c | 1
drivers/irqchip/irq-mvebu-odmi.c | 1
drivers/irqchip/irq-mvebu-sei.c | 1
drivers/irqchip/irq-riscv-imsic-platform.c | 131 +----------------------------
include/linux/msi.h | 11 ++
9 files changed, 32 insertions(+), 127 deletions(-)
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -587,6 +587,7 @@ config RISCV_IMSIC
select IRQ_DOMAIN_HIERARCHY
select GENERIC_IRQ_MATRIX_ALLOCATOR
select GENERIC_MSI_IRQ
+ select IRQ_MSI_LIB
config RISCV_IMSIC_PCI
bool
--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -255,6 +255,7 @@ static void __init gicv2m_teardown(void)
static struct msi_parent_ops gicv2m_msi_parent_ops = {
.supported_flags = GICV2M_MSI_FLAGS_SUPPORTED,
.required_flags = GICV2M_MSI_FLAGS_REQUIRED,
+ .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
.bus_select_token = DOMAIN_BUS_NEXUS,
.bus_select_mask = MATCH_PCI_MSI | MATCH_PLATFORM_MSI,
.prefix = "GICv2m-",
--- a/drivers/irqchip/irq-imx-mu-msi.c
+++ b/drivers/irqchip/irq-imx-mu-msi.c
@@ -214,6 +214,7 @@ static void imx_mu_msi_irq_handler(struc
static const struct msi_parent_ops imx_mu_msi_parent_ops = {
.supported_flags = IMX_MU_MSI_FLAGS_SUPPORTED,
.required_flags = IMX_MU_MSI_FLAGS_REQUIRED,
+ .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
.bus_select_token = DOMAIN_BUS_NEXUS,
.bus_select_mask = MATCH_PLATFORM_MSI,
.prefix = "MU-MSI-",
--- a/drivers/irqchip/irq-msi-lib.c
+++ b/drivers/irqchip/irq-msi-lib.c
@@ -28,6 +28,7 @@ bool msi_lib_init_dev_msi_info(struct de
struct msi_domain_info *info)
{
const struct msi_parent_ops *pops = real_parent->msi_parent_ops;
+ struct irq_chip *chip = info->chip;
u32 required_flags;
/* Parent ops available? */
@@ -92,10 +93,10 @@ bool msi_lib_init_dev_msi_info(struct de
info->flags |= required_flags;
/* Chip updates for all child bus types */
- if (!info->chip->irq_eoi)
- info->chip->irq_eoi = irq_chip_eoi_parent;
- if (!info->chip->irq_ack)
- info->chip->irq_ack = irq_chip_ack_parent;
+ if (!chip->irq_eoi && (pops->chip_flags & MSI_CHIP_FLAG_SET_EOI))
+ chip->irq_eoi = irq_chip_eoi_parent;
+ if (!chip->irq_ack && (pops->chip_flags & MSI_CHIP_FLAG_SET_ACK))
+ chip->irq_ack = irq_chip_ack_parent;
/*
* The device MSI domain can never have a set affinity callback. It
@@ -105,7 +106,7 @@ bool msi_lib_init_dev_msi_info(struct de
* device MSI domain aside of mask/unmask which is provided e.g. by
* PCI/MSI device domains.
*/
- info->chip->irq_set_affinity = msi_domain_set_affinity;
+ chip->irq_set_affinity = msi_domain_set_affinity;
return true;
}
EXPORT_SYMBOL_GPL(msi_lib_init_dev_msi_info);
--- a/drivers/irqchip/irq-mvebu-gicp.c
+++ b/drivers/irqchip/irq-mvebu-gicp.c
@@ -161,6 +161,7 @@ static const struct irq_domain_ops gicp_
static const struct msi_parent_ops gicp_msi_parent_ops = {
.supported_flags = GICP_MSI_FLAGS_SUPPORTED,
.required_flags = GICP_MSI_FLAGS_REQUIRED,
+ .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
.bus_select_token = DOMAIN_BUS_GENERIC_MSI,
.bus_select_mask = MATCH_PLATFORM_MSI,
.prefix = "GICP-",
--- a/drivers/irqchip/irq-mvebu-odmi.c
+++ b/drivers/irqchip/irq-mvebu-odmi.c
@@ -157,6 +157,7 @@ static const struct irq_domain_ops odmi_
static const struct msi_parent_ops odmi_msi_parent_ops = {
.supported_flags = ODMI_MSI_FLAGS_SUPPORTED,
.required_flags = ODMI_MSI_FLAGS_REQUIRED,
+ .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
.bus_select_token = DOMAIN_BUS_GENERIC_MSI,
.bus_select_mask = MATCH_PLATFORM_MSI,
.prefix = "ODMI-",
--- a/drivers/irqchip/irq-mvebu-sei.c
+++ b/drivers/irqchip/irq-mvebu-sei.c
@@ -356,6 +356,7 @@ static void mvebu_sei_reset(struct mvebu
static const struct msi_parent_ops sei_msi_parent_ops = {
.supported_flags = SEI_MSI_FLAGS_SUPPORTED,
.required_flags = SEI_MSI_FLAGS_REQUIRED,
+ .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
.bus_select_mask = MATCH_PLATFORM_MSI,
.bus_select_token = DOMAIN_BUS_GENERIC_MSI,
.prefix = "SEI-",
--- a/drivers/irqchip/irq-riscv-imsic-platform.c
+++ b/drivers/irqchip/irq-riscv-imsic-platform.c
@@ -21,6 +21,7 @@
#include <linux/smp.h>
#include "irq-riscv-imsic-state.h"
+#include "irq-msi-lib.h"
static bool imsic_cpu_page_phys(unsigned int cpu, unsigned int guest_index,
phys_addr_t *out_msi_pa)
@@ -84,19 +85,10 @@ static void imsic_irq_compose_msg(struct
}
#ifdef CONFIG_SMP
-static void imsic_msi_update_msg(struct irq_data *d, struct imsic_vector *vec)
-{
- struct msi_msg msg = { };
-
- imsic_irq_compose_vector_msg(vec, &msg);
- irq_data_get_irq_chip(d)->irq_write_msi_msg(d, &msg);
-}
-
static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
bool force)
{
struct imsic_vector *old_vec, *new_vec;
- struct irq_data *pd = d->parent_data;
old_vec = irq_data_get_irq_chip_data(pd);
if (WARN_ON(!old_vec))
@@ -115,14 +107,11 @@ static int imsic_irq_set_affinity(struct
if (!new_vec)
return -ENOSPC;
- /* Point device to the new vector */
- imsic_msi_update_msg(d, new_vec);
-
/* Update irq descriptors with the new vector */
- pd->chip_data = new_vec;
+ d->chip_data = new_vec;
/* Update effective affinity of parent irq data */
- irq_data_update_effective_affinity(pd, cpumask_of(new_vec->cpu));
+ irq_data_update_effective_affinity(d, cpumask_of(new_vec->cpu));
/* Move state of the old vector to the new vector */
imsic_vector_move(old_vec, new_vec);
@@ -137,6 +126,9 @@ static struct irq_chip imsic_irq_base_ch
.irq_unmask = imsic_irq_unmask,
.irq_retrigger = imsic_irq_retrigger,
.irq_compose_msi_msg = imsic_irq_compose_msg,
+#ifdef CONFIG_SMP
+ .irq_set_affinity = imsic_irq_set_affinity,
+#endif
.flags = IRQCHIP_SKIP_SET_WAKE |
IRQCHIP_MASK_ON_SUSPEND,
};
@@ -172,22 +164,6 @@ static void imsic_irq_domain_free(struct
irq_domain_free_irqs_parent(domain, virq, nr_irqs);
}
-static int imsic_irq_domain_select(struct irq_domain *domain, struct irq_fwspec *fwspec,
- enum irq_domain_bus_token bus_token)
-{
- const struct msi_parent_ops *ops = domain->msi_parent_ops;
- u32 busmask = BIT(bus_token);
-
- if (fwspec->fwnode != domain->fwnode || fwspec->param_count != 0)
- return 0;
-
- /* Handle pure domain searches */
- if (bus_token == ops->bus_select_token)
- return 1;
-
- return !!(ops->bus_select_mask & busmask);
-}
-
#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
static void imsic_irq_debug_show(struct seq_file *m, struct irq_domain *d,
struct irq_data *irqd, int ind)
@@ -210,104 +186,15 @@ static const struct irq_domain_ops imsic
#endif
};
-#ifdef CONFIG_RISCV_IMSIC_PCI
-
-static void imsic_pci_mask_irq(struct irq_data *d)
-{
- pci_msi_mask_irq(d);
- irq_chip_mask_parent(d);
-}
-
-static void imsic_pci_unmask_irq(struct irq_data *d)
-{
- irq_chip_unmask_parent(d);
- pci_msi_unmask_irq(d);
-}
-
-#define MATCH_PCI_MSI BIT(DOMAIN_BUS_PCI_MSI)
-
-#else
-
-#define MATCH_PCI_MSI 0
-
-#endif
-
-static bool imsic_init_dev_msi_info(struct device *dev,
- struct irq_domain *domain,
- struct irq_domain *real_parent,
- struct msi_domain_info *info)
-{
- const struct msi_parent_ops *pops = real_parent->msi_parent_ops;
-
- /* MSI parent domain specific settings */
- switch (real_parent->bus_token) {
- case DOMAIN_BUS_NEXUS:
- if (WARN_ON_ONCE(domain != real_parent))
- return false;
-#ifdef CONFIG_SMP
- info->chip->irq_set_affinity = imsic_irq_set_affinity;
-#endif
- break;
- default:
- WARN_ON_ONCE(1);
- return false;
- }
-
- /* Is the target supported? */
- switch (info->bus_token) {
-#ifdef CONFIG_RISCV_IMSIC_PCI
- case DOMAIN_BUS_PCI_DEVICE_MSI:
- case DOMAIN_BUS_PCI_DEVICE_MSIX:
- info->chip->irq_mask = imsic_pci_mask_irq;
- info->chip->irq_unmask = imsic_pci_unmask_irq;
- break;
-#endif
- case DOMAIN_BUS_DEVICE_MSI:
- /*
- * Per-device MSI should never have any MSI feature bits
- * set. It's sole purpose is to create a dumb interrupt
- * chip which has a device specific irq_write_msi_msg()
- * callback.
- */
- if (WARN_ON_ONCE(info->flags))
- return false;
-
- /* Core managed MSI descriptors */
- info->flags |= MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS |
- MSI_FLAG_FREE_MSI_DESCS;
- break;
- case DOMAIN_BUS_WIRED_TO_MSI:
- break;
- default:
- WARN_ON_ONCE(1);
- return false;
- }
-
- /* Use hierarchial chip operations re-trigger */
- info->chip->irq_retrigger = irq_chip_retrigger_hierarchy;
-
- /*
- * Mask out the domain specific MSI feature flags which are not
- * supported by the real parent.
- */
- info->flags &= pops->supported_flags;
-
- /* Enforce the required flags */
- info->flags |= pops->required_flags;
-
- return true;
-}
-
-#define MATCH_PLATFORM_MSI BIT(DOMAIN_BUS_PLATFORM_MSI)
-
static const struct msi_parent_ops imsic_msi_parent_ops = {
.supported_flags = MSI_GENERIC_FLAGS_MASK |
MSI_FLAG_PCI_MSIX,
.required_flags = MSI_FLAG_USE_DEF_DOM_OPS |
- MSI_FLAG_USE_DEF_CHIP_OPS,
+ MSI_FLAG_USE_DEF_CHIP_OPS |
+ MSI_FLAG_PCI_MSI_MASK_PARENT,
.bus_select_token = DOMAIN_BUS_NEXUS,
.bus_select_mask = MATCH_PCI_MSI | MATCH_PLATFORM_MSI,
- .init_dev_msi_info = imsic_init_dev_msi_info,
+ .init_dev_msi_info = msi_lib_init_dev_msi_info,
};
int imsic_irqdomain_init(void)
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -558,11 +558,21 @@ enum {
MSI_FLAG_NO_AFFINITY = (1 << 21),
};
+/*
+ * Flags for msi_parent_ops::chip_flags
+ */
+enum {
+ MSI_CHIP_FLAG_SET_EOI = (1 << 0),
+ MSI_CHIP_FLAG_SET_ACK = (1 << 1),
+};
+
/**
* struct msi_parent_ops - MSI parent domain callbacks and configuration info
*
* @supported_flags: Required: The supported MSI flags of the parent domain
* @required_flags: Optional: The required MSI flags of the parent MSI domain
+ * @chip_flags: Optional: Select MSI chip callbacks to update with defaults
+ * in msi_lib_init_dev_msi_info().
* @bus_select_token: Optional: The bus token of the real parent domain for
* irq_domain::select()
* @bus_select_mask: Optional: A mask of supported BUS_DOMAINs for
@@ -575,6 +585,7 @@ enum {
struct msi_parent_ops {
u32 supported_flags;
u32 required_flags;
+ u32 chip_flags;
u32 bus_select_token;
u32 bus_select_mask;
const char *prefix;
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-12-03 13:53 ` Thomas Gleixner
@ 2024-12-03 16:27 ` Andrew Jones
2024-12-03 16:50 ` Thomas Gleixner
2024-12-03 16:37 ` Anup Patel
1 sibling, 1 reply; 36+ messages in thread
From: Andrew Jones @ 2024-12-03 16:27 UTC (permalink / raw)
To: Thomas Gleixner
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, alex.williamson,
paul.walmsley, palmer, aou
On Tue, Dec 03, 2024 at 02:53:45PM +0100, Thomas Gleixner wrote:
> On Thu, Nov 14 2024 at 17:18, Andrew Jones wrote:
> > @@ -96,9 +96,8 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
> > bool force)
> > {
> > struct imsic_vector *old_vec, *new_vec;
> > - struct irq_data *pd = d->parent_data;
> >
> > - old_vec = irq_data_get_irq_chip_data(pd);
> > + old_vec = irq_data_get_irq_chip_data(d);
> > if (WARN_ON(!old_vec))
> > return -ENOENT;
> >
> > @@ -116,13 +115,13 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
> > return -ENOSPC;
> >
> > /* Point device to the new vector */
> > - imsic_msi_update_msg(d, new_vec);
> > + imsic_msi_update_msg(irq_get_irq_data(d->irq), new_vec);
>
> This looks more than fishy. See below.
>
> > @@ -245,7 +247,7 @@ static bool imsic_init_dev_msi_info(struct device *dev,
> > if (WARN_ON_ONCE(domain != real_parent))
> > return false;
> > #ifdef CONFIG_SMP
> > - info->chip->irq_set_affinity = imsic_irq_set_affinity;
> > + info->chip->irq_set_affinity = irq_chip_set_affinity_parent;
>
> This should use msi_domain_set_affinity(), which does the right thing:
>
> 1) It invokes the irq_set_affinity() callback of the parent domain
>
> 2) It composes the message via the hierarchy
>
> 3) It writes the message with the msi_write_msg() callback of the top
> level domain
>
> Sorry, I missed that when reviewing the original IMSIC MSI support.
>
> The whole IMSIC MSI support can be moved over to MSI LIB which makes all
> of this indirection go away and your intermediate domain will just fit
> in.
>
> Uncompiled patch below. If that works, it needs to be split up properly.
Thanks Thomas. I gave your patch below a go, but we now fail to have an
msi domain set up when probing devices which go through aplic_msi_setup(),
resulting in an immediate NULL deference in
msi_create_device_irq_domain(). I'll look closer tomorrow.
Thanks,
drew
>
> Note, this removes the setup of the irq_retrigger callback, but that's
> fine because on hierarchical domains irq_chip_retrigger_hierarchy() is
> invoked anyway. See try_retrigger().
>
> Thanks,
>
> tglx
> ---
> drivers/irqchip/Kconfig | 1
> drivers/irqchip/irq-gic-v2m.c | 1
> drivers/irqchip/irq-imx-mu-msi.c | 1
> drivers/irqchip/irq-msi-lib.c | 11 +-
> drivers/irqchip/irq-mvebu-gicp.c | 1
> drivers/irqchip/irq-mvebu-odmi.c | 1
> drivers/irqchip/irq-mvebu-sei.c | 1
> drivers/irqchip/irq-riscv-imsic-platform.c | 131 +----------------------------
> include/linux/msi.h | 11 ++
> 9 files changed, 32 insertions(+), 127 deletions(-)
>
> --- a/drivers/irqchip/Kconfig
> +++ b/drivers/irqchip/Kconfig
> @@ -587,6 +587,7 @@ config RISCV_IMSIC
> select IRQ_DOMAIN_HIERARCHY
> select GENERIC_IRQ_MATRIX_ALLOCATOR
> select GENERIC_MSI_IRQ
> + select IRQ_MSI_LIB
>
> config RISCV_IMSIC_PCI
> bool
> --- a/drivers/irqchip/irq-gic-v2m.c
> +++ b/drivers/irqchip/irq-gic-v2m.c
> @@ -255,6 +255,7 @@ static void __init gicv2m_teardown(void)
> static struct msi_parent_ops gicv2m_msi_parent_ops = {
> .supported_flags = GICV2M_MSI_FLAGS_SUPPORTED,
> .required_flags = GICV2M_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_token = DOMAIN_BUS_NEXUS,
> .bus_select_mask = MATCH_PCI_MSI | MATCH_PLATFORM_MSI,
> .prefix = "GICv2m-",
> --- a/drivers/irqchip/irq-imx-mu-msi.c
> +++ b/drivers/irqchip/irq-imx-mu-msi.c
> @@ -214,6 +214,7 @@ static void imx_mu_msi_irq_handler(struc
> static const struct msi_parent_ops imx_mu_msi_parent_ops = {
> .supported_flags = IMX_MU_MSI_FLAGS_SUPPORTED,
> .required_flags = IMX_MU_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_token = DOMAIN_BUS_NEXUS,
> .bus_select_mask = MATCH_PLATFORM_MSI,
> .prefix = "MU-MSI-",
> --- a/drivers/irqchip/irq-msi-lib.c
> +++ b/drivers/irqchip/irq-msi-lib.c
> @@ -28,6 +28,7 @@ bool msi_lib_init_dev_msi_info(struct de
> struct msi_domain_info *info)
> {
> const struct msi_parent_ops *pops = real_parent->msi_parent_ops;
> + struct irq_chip *chip = info->chip;
> u32 required_flags;
>
> /* Parent ops available? */
> @@ -92,10 +93,10 @@ bool msi_lib_init_dev_msi_info(struct de
> info->flags |= required_flags;
>
> /* Chip updates for all child bus types */
> - if (!info->chip->irq_eoi)
> - info->chip->irq_eoi = irq_chip_eoi_parent;
> - if (!info->chip->irq_ack)
> - info->chip->irq_ack = irq_chip_ack_parent;
> + if (!chip->irq_eoi && (pops->chip_flags & MSI_CHIP_FLAG_SET_EOI))
> + chip->irq_eoi = irq_chip_eoi_parent;
> + if (!chip->irq_ack && (pops->chip_flags & MSI_CHIP_FLAG_SET_ACK))
> + chip->irq_ack = irq_chip_ack_parent;
>
> /*
> * The device MSI domain can never have a set affinity callback. It
> @@ -105,7 +106,7 @@ bool msi_lib_init_dev_msi_info(struct de
> * device MSI domain aside of mask/unmask which is provided e.g. by
> * PCI/MSI device domains.
> */
> - info->chip->irq_set_affinity = msi_domain_set_affinity;
> + chip->irq_set_affinity = msi_domain_set_affinity;
> return true;
> }
> EXPORT_SYMBOL_GPL(msi_lib_init_dev_msi_info);
> --- a/drivers/irqchip/irq-mvebu-gicp.c
> +++ b/drivers/irqchip/irq-mvebu-gicp.c
> @@ -161,6 +161,7 @@ static const struct irq_domain_ops gicp_
> static const struct msi_parent_ops gicp_msi_parent_ops = {
> .supported_flags = GICP_MSI_FLAGS_SUPPORTED,
> .required_flags = GICP_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_token = DOMAIN_BUS_GENERIC_MSI,
> .bus_select_mask = MATCH_PLATFORM_MSI,
> .prefix = "GICP-",
> --- a/drivers/irqchip/irq-mvebu-odmi.c
> +++ b/drivers/irqchip/irq-mvebu-odmi.c
> @@ -157,6 +157,7 @@ static const struct irq_domain_ops odmi_
> static const struct msi_parent_ops odmi_msi_parent_ops = {
> .supported_flags = ODMI_MSI_FLAGS_SUPPORTED,
> .required_flags = ODMI_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_token = DOMAIN_BUS_GENERIC_MSI,
> .bus_select_mask = MATCH_PLATFORM_MSI,
> .prefix = "ODMI-",
> --- a/drivers/irqchip/irq-mvebu-sei.c
> +++ b/drivers/irqchip/irq-mvebu-sei.c
> @@ -356,6 +356,7 @@ static void mvebu_sei_reset(struct mvebu
> static const struct msi_parent_ops sei_msi_parent_ops = {
> .supported_flags = SEI_MSI_FLAGS_SUPPORTED,
> .required_flags = SEI_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_mask = MATCH_PLATFORM_MSI,
> .bus_select_token = DOMAIN_BUS_GENERIC_MSI,
> .prefix = "SEI-",
> --- a/drivers/irqchip/irq-riscv-imsic-platform.c
> +++ b/drivers/irqchip/irq-riscv-imsic-platform.c
> @@ -21,6 +21,7 @@
> #include <linux/smp.h>
>
> #include "irq-riscv-imsic-state.h"
> +#include "irq-msi-lib.h"
>
> static bool imsic_cpu_page_phys(unsigned int cpu, unsigned int guest_index,
> phys_addr_t *out_msi_pa)
> @@ -84,19 +85,10 @@ static void imsic_irq_compose_msg(struct
> }
>
> #ifdef CONFIG_SMP
> -static void imsic_msi_update_msg(struct irq_data *d, struct imsic_vector *vec)
> -{
> - struct msi_msg msg = { };
> -
> - imsic_irq_compose_vector_msg(vec, &msg);
> - irq_data_get_irq_chip(d)->irq_write_msi_msg(d, &msg);
> -}
> -
> static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
> bool force)
> {
> struct imsic_vector *old_vec, *new_vec;
> - struct irq_data *pd = d->parent_data;
>
> old_vec = irq_data_get_irq_chip_data(pd);
> if (WARN_ON(!old_vec))
> @@ -115,14 +107,11 @@ static int imsic_irq_set_affinity(struct
> if (!new_vec)
> return -ENOSPC;
>
> - /* Point device to the new vector */
> - imsic_msi_update_msg(d, new_vec);
> -
> /* Update irq descriptors with the new vector */
> - pd->chip_data = new_vec;
> + d->chip_data = new_vec;
>
> /* Update effective affinity of parent irq data */
> - irq_data_update_effective_affinity(pd, cpumask_of(new_vec->cpu));
> + irq_data_update_effective_affinity(d, cpumask_of(new_vec->cpu));
>
> /* Move state of the old vector to the new vector */
> imsic_vector_move(old_vec, new_vec);
> @@ -137,6 +126,9 @@ static struct irq_chip imsic_irq_base_ch
> .irq_unmask = imsic_irq_unmask,
> .irq_retrigger = imsic_irq_retrigger,
> .irq_compose_msi_msg = imsic_irq_compose_msg,
> +#ifdef CONFIG_SMP
> + .irq_set_affinity = imsic_irq_set_affinity,
> +#endif
> .flags = IRQCHIP_SKIP_SET_WAKE |
> IRQCHIP_MASK_ON_SUSPEND,
> };
> @@ -172,22 +164,6 @@ static void imsic_irq_domain_free(struct
> irq_domain_free_irqs_parent(domain, virq, nr_irqs);
> }
>
> -static int imsic_irq_domain_select(struct irq_domain *domain, struct irq_fwspec *fwspec,
> - enum irq_domain_bus_token bus_token)
> -{
> - const struct msi_parent_ops *ops = domain->msi_parent_ops;
> - u32 busmask = BIT(bus_token);
> -
> - if (fwspec->fwnode != domain->fwnode || fwspec->param_count != 0)
> - return 0;
> -
> - /* Handle pure domain searches */
> - if (bus_token == ops->bus_select_token)
> - return 1;
> -
> - return !!(ops->bus_select_mask & busmask);
> -}
> -
> #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> static void imsic_irq_debug_show(struct seq_file *m, struct irq_domain *d,
> struct irq_data *irqd, int ind)
> @@ -210,104 +186,15 @@ static const struct irq_domain_ops imsic
> #endif
> };
>
> -#ifdef CONFIG_RISCV_IMSIC_PCI
> -
> -static void imsic_pci_mask_irq(struct irq_data *d)
> -{
> - pci_msi_mask_irq(d);
> - irq_chip_mask_parent(d);
> -}
> -
> -static void imsic_pci_unmask_irq(struct irq_data *d)
> -{
> - irq_chip_unmask_parent(d);
> - pci_msi_unmask_irq(d);
> -}
> -
> -#define MATCH_PCI_MSI BIT(DOMAIN_BUS_PCI_MSI)
> -
> -#else
> -
> -#define MATCH_PCI_MSI 0
> -
> -#endif
> -
> -static bool imsic_init_dev_msi_info(struct device *dev,
> - struct irq_domain *domain,
> - struct irq_domain *real_parent,
> - struct msi_domain_info *info)
> -{
> - const struct msi_parent_ops *pops = real_parent->msi_parent_ops;
> -
> - /* MSI parent domain specific settings */
> - switch (real_parent->bus_token) {
> - case DOMAIN_BUS_NEXUS:
> - if (WARN_ON_ONCE(domain != real_parent))
> - return false;
> -#ifdef CONFIG_SMP
> - info->chip->irq_set_affinity = imsic_irq_set_affinity;
> -#endif
> - break;
> - default:
> - WARN_ON_ONCE(1);
> - return false;
> - }
> -
> - /* Is the target supported? */
> - switch (info->bus_token) {
> -#ifdef CONFIG_RISCV_IMSIC_PCI
> - case DOMAIN_BUS_PCI_DEVICE_MSI:
> - case DOMAIN_BUS_PCI_DEVICE_MSIX:
> - info->chip->irq_mask = imsic_pci_mask_irq;
> - info->chip->irq_unmask = imsic_pci_unmask_irq;
> - break;
> -#endif
> - case DOMAIN_BUS_DEVICE_MSI:
> - /*
> - * Per-device MSI should never have any MSI feature bits
> - * set. It's sole purpose is to create a dumb interrupt
> - * chip which has a device specific irq_write_msi_msg()
> - * callback.
> - */
> - if (WARN_ON_ONCE(info->flags))
> - return false;
> -
> - /* Core managed MSI descriptors */
> - info->flags |= MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS |
> - MSI_FLAG_FREE_MSI_DESCS;
> - break;
> - case DOMAIN_BUS_WIRED_TO_MSI:
> - break;
> - default:
> - WARN_ON_ONCE(1);
> - return false;
> - }
> -
> - /* Use hierarchial chip operations re-trigger */
> - info->chip->irq_retrigger = irq_chip_retrigger_hierarchy;
> -
> - /*
> - * Mask out the domain specific MSI feature flags which are not
> - * supported by the real parent.
> - */
> - info->flags &= pops->supported_flags;
> -
> - /* Enforce the required flags */
> - info->flags |= pops->required_flags;
> -
> - return true;
> -}
> -
> -#define MATCH_PLATFORM_MSI BIT(DOMAIN_BUS_PLATFORM_MSI)
> -
> static const struct msi_parent_ops imsic_msi_parent_ops = {
> .supported_flags = MSI_GENERIC_FLAGS_MASK |
> MSI_FLAG_PCI_MSIX,
> .required_flags = MSI_FLAG_USE_DEF_DOM_OPS |
> - MSI_FLAG_USE_DEF_CHIP_OPS,
> + MSI_FLAG_USE_DEF_CHIP_OPS |
> + MSI_FLAG_PCI_MSI_MASK_PARENT,
> .bus_select_token = DOMAIN_BUS_NEXUS,
> .bus_select_mask = MATCH_PCI_MSI | MATCH_PLATFORM_MSI,
> - .init_dev_msi_info = imsic_init_dev_msi_info,
> + .init_dev_msi_info = msi_lib_init_dev_msi_info,
> };
>
> int imsic_irqdomain_init(void)
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -558,11 +558,21 @@ enum {
> MSI_FLAG_NO_AFFINITY = (1 << 21),
> };
>
> +/*
> + * Flags for msi_parent_ops::chip_flags
> + */
> +enum {
> + MSI_CHIP_FLAG_SET_EOI = (1 << 0),
> + MSI_CHIP_FLAG_SET_ACK = (1 << 1),
> +};
> +
> /**
> * struct msi_parent_ops - MSI parent domain callbacks and configuration info
> *
> * @supported_flags: Required: The supported MSI flags of the parent domain
> * @required_flags: Optional: The required MSI flags of the parent MSI domain
> + * @chip_flags: Optional: Select MSI chip callbacks to update with defaults
> + * in msi_lib_init_dev_msi_info().
> * @bus_select_token: Optional: The bus token of the real parent domain for
> * irq_domain::select()
> * @bus_select_mask: Optional: A mask of supported BUS_DOMAINs for
> @@ -575,6 +585,7 @@ enum {
> struct msi_parent_ops {
> u32 supported_flags;
> u32 required_flags;
> + u32 chip_flags;
> u32 bus_select_token;
> u32 bus_select_mask;
> const char *prefix;
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-12-03 16:27 ` Andrew Jones
@ 2024-12-03 16:50 ` Thomas Gleixner
2024-12-05 16:12 ` Andrew Jones
0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2024-12-03 16:50 UTC (permalink / raw)
To: Andrew Jones
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, alex.williamson,
paul.walmsley, palmer, aou
On Tue, Dec 03 2024 at 17:27, Andrew Jones wrote:
> On Tue, Dec 03, 2024 at 02:53:45PM +0100, Thomas Gleixner wrote:
>> On Thu, Nov 14 2024 at 17:18, Andrew Jones wrote:
>> The whole IMSIC MSI support can be moved over to MSI LIB which makes all
>> of this indirection go away and your intermediate domain will just fit
>> in.
>>
>> Uncompiled patch below. If that works, it needs to be split up properly.
>
> Thanks Thomas. I gave your patch below a go, but we now fail to have an
> msi domain set up when probing devices which go through aplic_msi_setup(),
> resulting in an immediate NULL deference in
> msi_create_device_irq_domain(). I'll look closer tomorrow.
Duh! I forgot to update the .select callback. I don't know how you fixed that
compile fail up. Delta patch below.
Thanks,
tglx
---
--- a/drivers/irqchip/irq-riscv-imsic-platform.c
+++ b/drivers/irqchip/irq-riscv-imsic-platform.c
@@ -180,7 +180,7 @@ static void imsic_irq_debug_show(struct
static const struct irq_domain_ops imsic_base_domain_ops = {
.alloc = imsic_irq_domain_alloc,
.free = imsic_irq_domain_free,
- .select = imsic_irq_domain_select,
+ .select = msi_lib_irq_domain_select,
#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
.debug_show = imsic_irq_debug_show,
#endif
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-12-03 16:50 ` Thomas Gleixner
@ 2024-12-05 16:12 ` Andrew Jones
0 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-12-05 16:12 UTC (permalink / raw)
To: Thomas Gleixner
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, alex.williamson,
paul.walmsley, palmer, aou
On Tue, Dec 03, 2024 at 05:50:13PM +0100, Thomas Gleixner wrote:
> On Tue, Dec 03 2024 at 17:27, Andrew Jones wrote:
> > On Tue, Dec 03, 2024 at 02:53:45PM +0100, Thomas Gleixner wrote:
> >> On Thu, Nov 14 2024 at 17:18, Andrew Jones wrote:
> >> The whole IMSIC MSI support can be moved over to MSI LIB which makes all
> >> of this indirection go away and your intermediate domain will just fit
> >> in.
> >>
> >> Uncompiled patch below. If that works, it needs to be split up properly.
> >
> > Thanks Thomas. I gave your patch below a go, but we now fail to have an
> > msi domain set up when probing devices which go through aplic_msi_setup(),
> > resulting in an immediate NULL deference in
> > msi_create_device_irq_domain(). I'll look closer tomorrow.
>
> Duh! I forgot to update the .select callback. I don't know how you fixed that
> compile fail up. Delta patch below.
>
> Thanks,
>
> tglx
> ---
> --- a/drivers/irqchip/irq-riscv-imsic-platform.c
> +++ b/drivers/irqchip/irq-riscv-imsic-platform.c
> @@ -180,7 +180,7 @@ static void imsic_irq_debug_show(struct
> static const struct irq_domain_ops imsic_base_domain_ops = {
> .alloc = imsic_irq_domain_alloc,
> .free = imsic_irq_domain_free,
> - .select = imsic_irq_domain_select,
> + .select = msi_lib_irq_domain_select,
> #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> .debug_show = imsic_irq_debug_show,
> #endif
Hi Thomas,
With this select fix and replacing patch 03/15 of this series with the
following diff, this irqbypass PoC still works.
Based on what Anup said, I kept imsic_msi_update_msg(), which means I
kept this entire patch (01/15) as is. Anup is working on a series to fix
the non-atomic MSI message writes to the device and will likely pick this
patch up along with your changes to convert IMSIC to msi-lib.
I'd like to know your opinion on patch 02/15 of this series and the diff
below. afaict, x86 does something similar with the DOMAIN_BUS_DMAR and
DOMAIN_BUS_AMDVI tokens in x86_init_dev_msi_info().
Thanks,
drew
diff --git a/drivers/irqchip/irq-msi-lib.c b/drivers/irqchip/irq-msi-lib.c
index 51464c6257f3..cc18516a4e82 100644
--- a/drivers/irqchip/irq-msi-lib.c
+++ b/drivers/irqchip/irq-msi-lib.c
@@ -36,14 +36,14 @@ bool msi_lib_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
return false;
/*
- * MSI parent domain specific settings. For now there is only the
- * root parent domain, e.g. NEXUS, acting as a MSI parent, but it is
- * possible to stack MSI parents. See x86 vector -> irq remapping
+ * MSI parent domain specific settings. There may be only the root
+ * parent domain, e.g. NEXUS, acting as a MSI parent, or there may
+ * be stacked MSI parents, typically used for remapping.
*/
if (domain->bus_token == pops->bus_select_token) {
if (WARN_ON_ONCE(domain != real_parent))
return false;
- } else {
+ } else if (real_parent->bus_token != DOMAIN_BUS_MSI_REMAP) {
WARN_ON_ONCE(1);
return false;
}
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-12-03 13:53 ` Thomas Gleixner
2024-12-03 16:27 ` Andrew Jones
@ 2024-12-03 16:37 ` Anup Patel
2024-12-03 20:55 ` Thomas Gleixner
1 sibling, 1 reply; 36+ messages in thread
From: Anup Patel @ 2024-12-03 16:37 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Jones, iommu, kvm-riscv, kvm, linux-riscv, linux-kernel,
tjeznach, zong.li, joro, will, robin.murphy, atishp,
alex.williamson, paul.walmsley, palmer, aou
On Tue, Dec 3, 2024 at 7:23 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Nov 14 2024 at 17:18, Andrew Jones wrote:
> > @@ -96,9 +96,8 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
> > bool force)
> > {
> > struct imsic_vector *old_vec, *new_vec;
> > - struct irq_data *pd = d->parent_data;
> >
> > - old_vec = irq_data_get_irq_chip_data(pd);
> > + old_vec = irq_data_get_irq_chip_data(d);
> > if (WARN_ON(!old_vec))
> > return -ENOENT;
> >
> > @@ -116,13 +115,13 @@ static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask
> > return -ENOSPC;
> >
> > /* Point device to the new vector */
> > - imsic_msi_update_msg(d, new_vec);
> > + imsic_msi_update_msg(irq_get_irq_data(d->irq), new_vec);
>
> This looks more than fishy. See below.
>
> > @@ -245,7 +247,7 @@ static bool imsic_init_dev_msi_info(struct device *dev,
> > if (WARN_ON_ONCE(domain != real_parent))
> > return false;
> > #ifdef CONFIG_SMP
> > - info->chip->irq_set_affinity = imsic_irq_set_affinity;
> > + info->chip->irq_set_affinity = irq_chip_set_affinity_parent;
>
> This should use msi_domain_set_affinity(), which does the right thing:
>
> 1) It invokes the irq_set_affinity() callback of the parent domain
>
> 2) It composes the message via the hierarchy
>
> 3) It writes the message with the msi_write_msg() callback of the top
> level domain
>
> Sorry, I missed that when reviewing the original IMSIC MSI support.
>
> The whole IMSIC MSI support can be moved over to MSI LIB which makes all
> of this indirection go away and your intermediate domain will just fit
> in.
>
> Uncompiled patch below. If that works, it needs to be split up properly.
>
> Note, this removes the setup of the irq_retrigger callback, but that's
> fine because on hierarchical domains irq_chip_retrigger_hierarchy() is
> invoked anyway. See try_retrigger().
The IMSIC driver was merged one kernel release before common
MSI LIB was merged.
We should definitely update the IMSIC driver to use MSI LIB, I will
try your suggested changes (below) and post a separate series.
Thanks,
Anup
>
> Thanks,
>
> tglx
> ---
> drivers/irqchip/Kconfig | 1
> drivers/irqchip/irq-gic-v2m.c | 1
> drivers/irqchip/irq-imx-mu-msi.c | 1
> drivers/irqchip/irq-msi-lib.c | 11 +-
> drivers/irqchip/irq-mvebu-gicp.c | 1
> drivers/irqchip/irq-mvebu-odmi.c | 1
> drivers/irqchip/irq-mvebu-sei.c | 1
> drivers/irqchip/irq-riscv-imsic-platform.c | 131 +----------------------------
> include/linux/msi.h | 11 ++
> 9 files changed, 32 insertions(+), 127 deletions(-)
>
> --- a/drivers/irqchip/Kconfig
> +++ b/drivers/irqchip/Kconfig
> @@ -587,6 +587,7 @@ config RISCV_IMSIC
> select IRQ_DOMAIN_HIERARCHY
> select GENERIC_IRQ_MATRIX_ALLOCATOR
> select GENERIC_MSI_IRQ
> + select IRQ_MSI_LIB
>
> config RISCV_IMSIC_PCI
> bool
> --- a/drivers/irqchip/irq-gic-v2m.c
> +++ b/drivers/irqchip/irq-gic-v2m.c
> @@ -255,6 +255,7 @@ static void __init gicv2m_teardown(void)
> static struct msi_parent_ops gicv2m_msi_parent_ops = {
> .supported_flags = GICV2M_MSI_FLAGS_SUPPORTED,
> .required_flags = GICV2M_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_token = DOMAIN_BUS_NEXUS,
> .bus_select_mask = MATCH_PCI_MSI | MATCH_PLATFORM_MSI,
> .prefix = "GICv2m-",
> --- a/drivers/irqchip/irq-imx-mu-msi.c
> +++ b/drivers/irqchip/irq-imx-mu-msi.c
> @@ -214,6 +214,7 @@ static void imx_mu_msi_irq_handler(struc
> static const struct msi_parent_ops imx_mu_msi_parent_ops = {
> .supported_flags = IMX_MU_MSI_FLAGS_SUPPORTED,
> .required_flags = IMX_MU_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_token = DOMAIN_BUS_NEXUS,
> .bus_select_mask = MATCH_PLATFORM_MSI,
> .prefix = "MU-MSI-",
> --- a/drivers/irqchip/irq-msi-lib.c
> +++ b/drivers/irqchip/irq-msi-lib.c
> @@ -28,6 +28,7 @@ bool msi_lib_init_dev_msi_info(struct de
> struct msi_domain_info *info)
> {
> const struct msi_parent_ops *pops = real_parent->msi_parent_ops;
> + struct irq_chip *chip = info->chip;
> u32 required_flags;
>
> /* Parent ops available? */
> @@ -92,10 +93,10 @@ bool msi_lib_init_dev_msi_info(struct de
> info->flags |= required_flags;
>
> /* Chip updates for all child bus types */
> - if (!info->chip->irq_eoi)
> - info->chip->irq_eoi = irq_chip_eoi_parent;
> - if (!info->chip->irq_ack)
> - info->chip->irq_ack = irq_chip_ack_parent;
> + if (!chip->irq_eoi && (pops->chip_flags & MSI_CHIP_FLAG_SET_EOI))
> + chip->irq_eoi = irq_chip_eoi_parent;
> + if (!chip->irq_ack && (pops->chip_flags & MSI_CHIP_FLAG_SET_ACK))
> + chip->irq_ack = irq_chip_ack_parent;
>
> /*
> * The device MSI domain can never have a set affinity callback. It
> @@ -105,7 +106,7 @@ bool msi_lib_init_dev_msi_info(struct de
> * device MSI domain aside of mask/unmask which is provided e.g. by
> * PCI/MSI device domains.
> */
> - info->chip->irq_set_affinity = msi_domain_set_affinity;
> + chip->irq_set_affinity = msi_domain_set_affinity;
> return true;
> }
> EXPORT_SYMBOL_GPL(msi_lib_init_dev_msi_info);
> --- a/drivers/irqchip/irq-mvebu-gicp.c
> +++ b/drivers/irqchip/irq-mvebu-gicp.c
> @@ -161,6 +161,7 @@ static const struct irq_domain_ops gicp_
> static const struct msi_parent_ops gicp_msi_parent_ops = {
> .supported_flags = GICP_MSI_FLAGS_SUPPORTED,
> .required_flags = GICP_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_token = DOMAIN_BUS_GENERIC_MSI,
> .bus_select_mask = MATCH_PLATFORM_MSI,
> .prefix = "GICP-",
> --- a/drivers/irqchip/irq-mvebu-odmi.c
> +++ b/drivers/irqchip/irq-mvebu-odmi.c
> @@ -157,6 +157,7 @@ static const struct irq_domain_ops odmi_
> static const struct msi_parent_ops odmi_msi_parent_ops = {
> .supported_flags = ODMI_MSI_FLAGS_SUPPORTED,
> .required_flags = ODMI_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_token = DOMAIN_BUS_GENERIC_MSI,
> .bus_select_mask = MATCH_PLATFORM_MSI,
> .prefix = "ODMI-",
> --- a/drivers/irqchip/irq-mvebu-sei.c
> +++ b/drivers/irqchip/irq-mvebu-sei.c
> @@ -356,6 +356,7 @@ static void mvebu_sei_reset(struct mvebu
> static const struct msi_parent_ops sei_msi_parent_ops = {
> .supported_flags = SEI_MSI_FLAGS_SUPPORTED,
> .required_flags = SEI_MSI_FLAGS_REQUIRED,
> + .chip_flags = MSI_CHIP_FLAG_SET_EOI | MSI_CHIP_FLAG_SET_ACK,
> .bus_select_mask = MATCH_PLATFORM_MSI,
> .bus_select_token = DOMAIN_BUS_GENERIC_MSI,
> .prefix = "SEI-",
> --- a/drivers/irqchip/irq-riscv-imsic-platform.c
> +++ b/drivers/irqchip/irq-riscv-imsic-platform.c
> @@ -21,6 +21,7 @@
> #include <linux/smp.h>
>
> #include "irq-riscv-imsic-state.h"
> +#include "irq-msi-lib.h"
>
> static bool imsic_cpu_page_phys(unsigned int cpu, unsigned int guest_index,
> phys_addr_t *out_msi_pa)
> @@ -84,19 +85,10 @@ static void imsic_irq_compose_msg(struct
> }
>
> #ifdef CONFIG_SMP
> -static void imsic_msi_update_msg(struct irq_data *d, struct imsic_vector *vec)
> -{
> - struct msi_msg msg = { };
> -
> - imsic_irq_compose_vector_msg(vec, &msg);
> - irq_data_get_irq_chip(d)->irq_write_msi_msg(d, &msg);
> -}
> -
> static int imsic_irq_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
> bool force)
> {
> struct imsic_vector *old_vec, *new_vec;
> - struct irq_data *pd = d->parent_data;
>
> old_vec = irq_data_get_irq_chip_data(pd);
> if (WARN_ON(!old_vec))
> @@ -115,14 +107,11 @@ static int imsic_irq_set_affinity(struct
> if (!new_vec)
> return -ENOSPC;
>
> - /* Point device to the new vector */
> - imsic_msi_update_msg(d, new_vec);
> -
> /* Update irq descriptors with the new vector */
> - pd->chip_data = new_vec;
> + d->chip_data = new_vec;
>
> /* Update effective affinity of parent irq data */
> - irq_data_update_effective_affinity(pd, cpumask_of(new_vec->cpu));
> + irq_data_update_effective_affinity(d, cpumask_of(new_vec->cpu));
>
> /* Move state of the old vector to the new vector */
> imsic_vector_move(old_vec, new_vec);
> @@ -137,6 +126,9 @@ static struct irq_chip imsic_irq_base_ch
> .irq_unmask = imsic_irq_unmask,
> .irq_retrigger = imsic_irq_retrigger,
> .irq_compose_msi_msg = imsic_irq_compose_msg,
> +#ifdef CONFIG_SMP
> + .irq_set_affinity = imsic_irq_set_affinity,
> +#endif
> .flags = IRQCHIP_SKIP_SET_WAKE |
> IRQCHIP_MASK_ON_SUSPEND,
> };
> @@ -172,22 +164,6 @@ static void imsic_irq_domain_free(struct
> irq_domain_free_irqs_parent(domain, virq, nr_irqs);
> }
>
> -static int imsic_irq_domain_select(struct irq_domain *domain, struct irq_fwspec *fwspec,
> - enum irq_domain_bus_token bus_token)
> -{
> - const struct msi_parent_ops *ops = domain->msi_parent_ops;
> - u32 busmask = BIT(bus_token);
> -
> - if (fwspec->fwnode != domain->fwnode || fwspec->param_count != 0)
> - return 0;
> -
> - /* Handle pure domain searches */
> - if (bus_token == ops->bus_select_token)
> - return 1;
> -
> - return !!(ops->bus_select_mask & busmask);
> -}
> -
> #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> static void imsic_irq_debug_show(struct seq_file *m, struct irq_domain *d,
> struct irq_data *irqd, int ind)
> @@ -210,104 +186,15 @@ static const struct irq_domain_ops imsic
> #endif
> };
>
> -#ifdef CONFIG_RISCV_IMSIC_PCI
> -
> -static void imsic_pci_mask_irq(struct irq_data *d)
> -{
> - pci_msi_mask_irq(d);
> - irq_chip_mask_parent(d);
> -}
> -
> -static void imsic_pci_unmask_irq(struct irq_data *d)
> -{
> - irq_chip_unmask_parent(d);
> - pci_msi_unmask_irq(d);
> -}
> -
> -#define MATCH_PCI_MSI BIT(DOMAIN_BUS_PCI_MSI)
> -
> -#else
> -
> -#define MATCH_PCI_MSI 0
> -
> -#endif
> -
> -static bool imsic_init_dev_msi_info(struct device *dev,
> - struct irq_domain *domain,
> - struct irq_domain *real_parent,
> - struct msi_domain_info *info)
> -{
> - const struct msi_parent_ops *pops = real_parent->msi_parent_ops;
> -
> - /* MSI parent domain specific settings */
> - switch (real_parent->bus_token) {
> - case DOMAIN_BUS_NEXUS:
> - if (WARN_ON_ONCE(domain != real_parent))
> - return false;
> -#ifdef CONFIG_SMP
> - info->chip->irq_set_affinity = imsic_irq_set_affinity;
> -#endif
> - break;
> - default:
> - WARN_ON_ONCE(1);
> - return false;
> - }
> -
> - /* Is the target supported? */
> - switch (info->bus_token) {
> -#ifdef CONFIG_RISCV_IMSIC_PCI
> - case DOMAIN_BUS_PCI_DEVICE_MSI:
> - case DOMAIN_BUS_PCI_DEVICE_MSIX:
> - info->chip->irq_mask = imsic_pci_mask_irq;
> - info->chip->irq_unmask = imsic_pci_unmask_irq;
> - break;
> -#endif
> - case DOMAIN_BUS_DEVICE_MSI:
> - /*
> - * Per-device MSI should never have any MSI feature bits
> - * set. It's sole purpose is to create a dumb interrupt
> - * chip which has a device specific irq_write_msi_msg()
> - * callback.
> - */
> - if (WARN_ON_ONCE(info->flags))
> - return false;
> -
> - /* Core managed MSI descriptors */
> - info->flags |= MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS |
> - MSI_FLAG_FREE_MSI_DESCS;
> - break;
> - case DOMAIN_BUS_WIRED_TO_MSI:
> - break;
> - default:
> - WARN_ON_ONCE(1);
> - return false;
> - }
> -
> - /* Use hierarchial chip operations re-trigger */
> - info->chip->irq_retrigger = irq_chip_retrigger_hierarchy;
> -
> - /*
> - * Mask out the domain specific MSI feature flags which are not
> - * supported by the real parent.
> - */
> - info->flags &= pops->supported_flags;
> -
> - /* Enforce the required flags */
> - info->flags |= pops->required_flags;
> -
> - return true;
> -}
> -
> -#define MATCH_PLATFORM_MSI BIT(DOMAIN_BUS_PLATFORM_MSI)
> -
> static const struct msi_parent_ops imsic_msi_parent_ops = {
> .supported_flags = MSI_GENERIC_FLAGS_MASK |
> MSI_FLAG_PCI_MSIX,
> .required_flags = MSI_FLAG_USE_DEF_DOM_OPS |
> - MSI_FLAG_USE_DEF_CHIP_OPS,
> + MSI_FLAG_USE_DEF_CHIP_OPS |
> + MSI_FLAG_PCI_MSI_MASK_PARENT,
> .bus_select_token = DOMAIN_BUS_NEXUS,
> .bus_select_mask = MATCH_PCI_MSI | MATCH_PLATFORM_MSI,
> - .init_dev_msi_info = imsic_init_dev_msi_info,
> + .init_dev_msi_info = msi_lib_init_dev_msi_info,
> };
>
> int imsic_irqdomain_init(void)
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -558,11 +558,21 @@ enum {
> MSI_FLAG_NO_AFFINITY = (1 << 21),
> };
>
> +/*
> + * Flags for msi_parent_ops::chip_flags
> + */
> +enum {
> + MSI_CHIP_FLAG_SET_EOI = (1 << 0),
> + MSI_CHIP_FLAG_SET_ACK = (1 << 1),
> +};
> +
> /**
> * struct msi_parent_ops - MSI parent domain callbacks and configuration info
> *
> * @supported_flags: Required: The supported MSI flags of the parent domain
> * @required_flags: Optional: The required MSI flags of the parent MSI domain
> + * @chip_flags: Optional: Select MSI chip callbacks to update with defaults
> + * in msi_lib_init_dev_msi_info().
> * @bus_select_token: Optional: The bus token of the real parent domain for
> * irq_domain::select()
> * @bus_select_mask: Optional: A mask of supported BUS_DOMAINs for
> @@ -575,6 +585,7 @@ enum {
> struct msi_parent_ops {
> u32 supported_flags;
> u32 required_flags;
> + u32 chip_flags;
> u32 bus_select_token;
> u32 bus_select_mask;
> const char *prefix;
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-12-03 16:37 ` Anup Patel
@ 2024-12-03 20:55 ` Thomas Gleixner
2024-12-03 22:59 ` Thomas Gleixner
0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2024-12-03 20:55 UTC (permalink / raw)
To: Anup Patel
Cc: Andrew Jones, iommu, kvm-riscv, kvm, linux-riscv, linux-kernel,
tjeznach, zong.li, joro, will, robin.murphy, atishp,
alex.williamson, paul.walmsley, palmer, aou
On Tue, Dec 03 2024 at 22:07, Anup Patel wrote:
> On Tue, Dec 3, 2024 at 7:23 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Sorry, I missed that when reviewing the original IMSIC MSI support.
>>
>> The whole IMSIC MSI support can be moved over to MSI LIB which makes all
>> of this indirection go away and your intermediate domain will just fit
>> in.
>>
>> Uncompiled patch below. If that works, it needs to be split up properly.
>>
>> Note, this removes the setup of the irq_retrigger callback, but that's
>> fine because on hierarchical domains irq_chip_retrigger_hierarchy() is
>> invoked anyway. See try_retrigger().
>
> The IMSIC driver was merged one kernel release before common
> MSI LIB was merged.
Ah indeed.
> We should definitely update the IMSIC driver to use MSI LIB, I will
> try your suggested changes (below) and post a separate series.
Pick up the delta patch I gave Andrew...
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-12-03 20:55 ` Thomas Gleixner
@ 2024-12-03 22:59 ` Thomas Gleixner
2024-12-04 3:43 ` Anup Patel
0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2024-12-03 22:59 UTC (permalink / raw)
To: Anup Patel
Cc: Andrew Jones, iommu, kvm-riscv, kvm, linux-riscv, linux-kernel,
tjeznach, zong.li, joro, will, robin.murphy, atishp,
alex.williamson, paul.walmsley, palmer, aou
On Tue, Dec 03 2024 at 21:55, Thomas Gleixner wrote:
> On Tue, Dec 03 2024 at 22:07, Anup Patel wrote:
>> On Tue, Dec 3, 2024 at 7:23 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Sorry, I missed that when reviewing the original IMSIC MSI support.
>>>
>>> The whole IMSIC MSI support can be moved over to MSI LIB which makes all
>>> of this indirection go away and your intermediate domain will just fit
>>> in.
>>>
>>> Uncompiled patch below. If that works, it needs to be split up properly.
>>>
>>> Note, this removes the setup of the irq_retrigger callback, but that's
>>> fine because on hierarchical domains irq_chip_retrigger_hierarchy() is
>>> invoked anyway. See try_retrigger().
>>
>> The IMSIC driver was merged one kernel release before common
>> MSI LIB was merged.
>
> Ah indeed.
>
>> We should definitely update the IMSIC driver to use MSI LIB, I will
>> try your suggested changes (below) and post a separate series.
>
> Pick up the delta patch I gave Andrew...
As I was looking at something else MSI related I had a look at
imsic_irq_set_affinity() again.
It's actually required to have the message write in that function and
not afterwards as you invoke imsic_vector_move() from that function.
That's obviously not true for the remap case as that will not change the
message address/data pair because the remap table entry is immutable -
at least I assume so for my mental sanity sake :)
But that brings me to a related question. How is this supposed to work
with non-atomic message updates? PCI/MSI does not necessarily provide
masking, and the write of the address/data pair is done in bits and
pieces. So you can end up with an intermediate state seen by the device
which ends up somewhere in interrupt nirvana space.
See the dance in msi_set_affinity() and commit 6f1a4891a592
("x86/apic/msi: Plug non-maskable MSI affinity race") for further
explanation.
The way how the IMSIC driver works seems to be pretty much the same as
the x86 APIC mess:
@address is the physical address of the per CPU MSI target
address and @data is the vector ID on that CPU.
So the non-atomic update in case of non-maskable MSI suffers from the
same problem. It works most of the time, but if it doesn't you might
stare at the occasionally lost interrupt and the stale device in
disbelief for quite a while :)
I might be missing something which magically prevent that though :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-12-03 22:59 ` Thomas Gleixner
@ 2024-12-04 3:43 ` Anup Patel
2024-12-04 13:05 ` Thomas Gleixner
0 siblings, 1 reply; 36+ messages in thread
From: Anup Patel @ 2024-12-04 3:43 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Jones, iommu, kvm-riscv, kvm, linux-riscv, linux-kernel,
tjeznach, zong.li, joro, will, robin.murphy, atishp,
alex.williamson, paul.walmsley, palmer, aou
On Wed, Dec 4, 2024 at 4:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Dec 03 2024 at 21:55, Thomas Gleixner wrote:
> > On Tue, Dec 03 2024 at 22:07, Anup Patel wrote:
> >> On Tue, Dec 3, 2024 at 7:23 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>> Sorry, I missed that when reviewing the original IMSIC MSI support.
> >>>
> >>> The whole IMSIC MSI support can be moved over to MSI LIB which makes all
> >>> of this indirection go away and your intermediate domain will just fit
> >>> in.
> >>>
> >>> Uncompiled patch below. If that works, it needs to be split up properly.
> >>>
> >>> Note, this removes the setup of the irq_retrigger callback, but that's
> >>> fine because on hierarchical domains irq_chip_retrigger_hierarchy() is
> >>> invoked anyway. See try_retrigger().
> >>
> >> The IMSIC driver was merged one kernel release before common
> >> MSI LIB was merged.
> >
> > Ah indeed.
> >
> >> We should definitely update the IMSIC driver to use MSI LIB, I will
> >> try your suggested changes (below) and post a separate series.
> >
> > Pick up the delta patch I gave Andrew...
>
> As I was looking at something else MSI related I had a look at
> imsic_irq_set_affinity() again.
>
> It's actually required to have the message write in that function and
> not afterwards as you invoke imsic_vector_move() from that function.
>
> That's obviously not true for the remap case as that will not change the
> message address/data pair because the remap table entry is immutable -
> at least I assume so for my mental sanity sake :)
>
> But that brings me to a related question. How is this supposed to work
> with non-atomic message updates? PCI/MSI does not necessarily provide
> masking, and the write of the address/data pair is done in bits and
> pieces. So you can end up with an intermediate state seen by the device
> which ends up somewhere in interrupt nirvana space.
>
> See the dance in msi_set_affinity() and commit 6f1a4891a592
> ("x86/apic/msi: Plug non-maskable MSI affinity race") for further
> explanation.
>
> The way how the IMSIC driver works seems to be pretty much the same as
> the x86 APIC mess:
>
> @address is the physical address of the per CPU MSI target
> address and @data is the vector ID on that CPU.
>
> So the non-atomic update in case of non-maskable MSI suffers from the
> same problem. It works most of the time, but if it doesn't you might
> stare at the occasionally lost interrupt and the stale device in
> disbelief for quite a while :)
Yes, we have the same challenges as x86 APIC when changing
MSI affinity.
>
> I might be missing something which magically prevent that though :)
>
Your understanding is correct. In fact, the IMSIC msi_set_affinity()
handling is inspired from x86 APIC approach due to similarity in
the overall MSI controller.
The high-level idea of imsic_irq_set_affinity() is as follows:
1) Allocate new_vector (new CPU IMSIC address + new ID on that CPU)
2) Update the MSI address and data programmed in the device
based on new_vector (see imsic_msi_update_msg())
3) At this point the device points to the new_vector but old_vector
(old CPU IMSIC address + old ID on that CPU) is still enabled and
we might have received MSI on old_vector while we were busy
setting up a new_vector for the device. To address this, we call
imsic_vector_move().
4) The imsic_vector_move() marks the old_vector as being
moved and schedules a lazy timer on the old CPU.
5) The lazy timer expires on the old CPU and results in
__imsic_local_sync() being called on the old CPU.
6) If there was a pending MSI on the old vector then the
__imsic_local_sync() function injects an MSI to the
new_vector using an MMIO write.
It is very unlikely that an MSI from device will be dropped
(unless I am missing something) but the unsolved issue
is that handling of in-flight MSI received on the old_vector
during the MSI re-programming is delayed which may have
side effects on the device driver side.
I believe in the future RISC-V AIA v2.0 (whenever that
happens) will address the gaps in AIA v1.0 (like this one).
Regards,
Anup
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity
2024-12-04 3:43 ` Anup Patel
@ 2024-12-04 13:05 ` Thomas Gleixner
0 siblings, 0 replies; 36+ messages in thread
From: Thomas Gleixner @ 2024-12-04 13:05 UTC (permalink / raw)
To: Anup Patel
Cc: Andrew Jones, iommu, kvm-riscv, kvm, linux-riscv, linux-kernel,
tjeznach, zong.li, joro, will, robin.murphy, atishp,
alex.williamson, paul.walmsley, palmer, aou
On Wed, Dec 04 2024 at 09:13, Anup Patel wrote:
> On Wed, Dec 4, 2024 at 4:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> As I was looking at something else MSI related I had a look at
>> imsic_irq_set_affinity() again.
>>
>> It's actually required to have the message write in that function and
>> not afterwards as you invoke imsic_vector_move() from that function.
>>
>> That's obviously not true for the remap case as that will not change the
>> message address/data pair because the remap table entry is immutable -
>> at least I assume so for my mental sanity sake :)
>>
>> But that brings me to a related question. How is this supposed to work
>> with non-atomic message updates? PCI/MSI does not necessarily provide
>> masking, and the write of the address/data pair is done in bits and
>> pieces. So you can end up with an intermediate state seen by the device
>> which ends up somewhere in interrupt nirvana space.
>>
>> See the dance in msi_set_affinity() and commit 6f1a4891a592
>> ("x86/apic/msi: Plug non-maskable MSI affinity race") for further
>> explanation.
>>
>> The way how the IMSIC driver works seems to be pretty much the same as
>> the x86 APIC mess:
>>
>> @address is the physical address of the per CPU MSI target
>> address and @data is the vector ID on that CPU.
>>
>> So the non-atomic update in case of non-maskable MSI suffers from the
>> same problem. It works most of the time, but if it doesn't you might
>> stare at the occasionally lost interrupt and the stale device in
>> disbelief for quite a while :)
>
> Yes, we have the same challenges as x86 APIC when changing
> MSI affinity.
>>
>> I might be missing something which magically prevent that though :)
>>
> Your understanding is correct. In fact, the IMSIC msi_set_affinity()
> handling is inspired from x86 APIC approach due to similarity in
> the overall MSI controller.
>
> The high-level idea of imsic_irq_set_affinity() is as follows:
>
> 1) Allocate new_vector (new CPU IMSIC address + new ID on that CPU)
>
> 2) Update the MSI address and data programmed in the device
> based on new_vector (see imsic_msi_update_msg())
>
> 3) At this point the device points to the new_vector but old_vector
> (old CPU IMSIC address + old ID on that CPU) is still enabled and
> we might have received MSI on old_vector while we were busy
> setting up a new_vector for the device. To address this, we call
> imsic_vector_move().
>
> 4) The imsic_vector_move() marks the old_vector as being
> moved and schedules a lazy timer on the old CPU.
>
> 5) The lazy timer expires on the old CPU and results in
> __imsic_local_sync() being called on the old CPU.
>
> 6) If there was a pending MSI on the old vector then the
> __imsic_local_sync() function injects an MSI to the
> new_vector using an MMIO write.
>
> It is very unlikely that an MSI from device will be dropped
> (unless I am missing something) but the unsolved issue
> is that handling of in-flight MSI received on the old_vector
> during the MSI re-programming is delayed which may have
> side effects on the device driver side.
Interrupt delivery can be delayed for tons of other reasons.
But yes, you are missing something which is worse than a jiffie delay:
CPU Device
msi_update_msg()
compose()
write()
write(msg->address_lo); [1]
write(msg->address_hi); [2]
Raises interrupt [3]
write(msg->data); [4]
[2] can be ignored as it should not change (otherwise 32bit only devices
would not work).
Lets assume that the original message was:
addr_lo = 0x1000, data = 0x10 (CPU1, vector 0x10)
The new message is
addr_lo = 0x2000, data = 0x20 (CPU2, vector 0x20)
After [2] the device sees:
addr_lo = 0x2000, data = 0x10 (CPU2, vector 0x10)
The interrupt raised in [3] will end up on CPU2 at vector 0x10 which
might be not in use or used by some other device. In any case the
interrupt is not reaching the real device handler and you can't see that
interrupt as pending in CPU1.
That's why x86 in case of non-remapped interrupts has this for the two
situations:
1) old->data == new->data
write_msg(new);
The next interrupt either arrives on the old CPU or on the new CPU
depending on timing. There is no intermediate state because the
vector (data) is the same on both CPUs.
2) old->data != new->data
tmp_msg.addr = old_msg.addr
tmp_msg.data = new_msg.data
write_msg(tmp_msg);
So after that write the device might raise the interrupt on CPU1
and vector 0x20.
The next step is
write_msg(new);
which changes the destination CPU to CPU2.
So depending what the device has observed the interrupt might end
up on
CPU1 vector 0x10 (old)
CPU1 vector 0x20 (tmp)
CPU2 vector 0x20 (new)
CPU1 vector 0x20 (tmp) is then checked in the pending register and
the interrupt is retriggered if pending.
That requires to move the interrupt from actual interrupt context on the
old target CPU. It allows to evaluate the old target CPUs pending
register with local interrupts disabled, which obviously does not work
remote.
I don't see a way how that can work remote with the IMSIC either even if
you can easily access the pending state of the remote CPU:
CPU0 CPU1 Device
set_affinity()
write_msg(tmp)
write(addr); // CPU1
write(data); // vector 0x20
raise IRQ
handle vector 0x20
(spurious or other device)
check_pending(CPU1, 0x20) == false -> Interrupt is lost
Remapped interrupts do not have that issue because the table update is
atomic vs. a concurrently raised interrupt, so that it either ends up on
the old or on the new destination. The device message does not change as
it always targets the table entry, which is immutable after setup.
That's why x86 has this special msi affinity setter for the non remapped
case. For the remapped case it's just using the hierarchy default which
ends up at the remap domain.
So the hierarchies look like this:
1) non-remap
PCI/MSI device domain
irq_chip.irq_set_affinity = msi_set_affinity;
VECTOR domain
2) remap
PCI/MSI device domain
irq_chip.irq_set_affinity = msi_domain_set_affinity;
REMAP domain
irq_chip.irq_set_affinity = remap_set_affinity;
VECTOR domain
The special case of msi_set_affinity() is not involved in the remap case
and solely utilized for the non-remap horrors. The remap case does not
require the interrupt to be moved in interrupt context either because
there is no intermediate step and pending register check on the old CPU
involved.
The vector cleanup of the old CPU always happens when the interrupt
arrives on the new target for the first time independent of remapping
mode. That's required for both cases to take care of the case where the
interrupt is raised on the old vector (cpu1, 0x10) before the write
happens and is pending in CPU1. So after desc::lock is released and
interrupts are reenabled the interrupt is handled on CPU1. The next one
is guaranteed to arrive on CPU2 and that triggers the clean up of the
CPU1 vector, which releases it for reuse in the matrix allocator.
I think you should model it in a similar way instead of trying to
artificially reuse imsic_irq_set_affinity() for the remap case,
especially when you decide to close the non-maskable MSI hole described
above, which I recommend to do :)
You still can utilize msi-lib and just have your private implementation
of init_dev_msi_info() as a wrapper around the library similar to
mbi_init_dev_msi_info() and its_init_dev_msi_info(). That wrapper would
just handle the non-remap case to set info::chip:irq_set_affinity to the
magic non-remap function.
Hope that helps.
> I believe in the future RISC-V AIA v2.0 (whenever that
> happens) will address the gaps in AIA v1.0 (like this one).
If that is strictly translation table based, yes.
Thanks,
tglx
^ permalink raw reply [flat|nested] 36+ messages in thread
* [RFC PATCH 02/15] genirq/msi: Provide DOMAIN_BUS_MSI_REMAP
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 03/15] irqchip/riscv-imsic: Add support for DOMAIN_BUS_MSI_REMAP Andrew Jones
` (12 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
Provide a domain bus token for the upcoming support for the RISC-V
IOMMU interrupt remapping domain, which needs to be distinguished
from NEXUS domains. The new token name is generic, as the only
information that needs to be conveyed is that the IRQ domain will
remap MSIs, i.e. there's nothing RISC-V specific to convey.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
include/linux/irqdomain_defs.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/irqdomain_defs.h b/include/linux/irqdomain_defs.h
index 36653e2ee1c9..676eca8147ae 100644
--- a/include/linux/irqdomain_defs.h
+++ b/include/linux/irqdomain_defs.h
@@ -27,6 +27,7 @@ enum irq_domain_bus_token {
DOMAIN_BUS_AMDVI,
DOMAIN_BUS_DEVICE_MSI,
DOMAIN_BUS_WIRED_TO_MSI,
+ DOMAIN_BUS_MSI_REMAP,
};
#endif /* _LINUX_IRQDOMAIN_DEFS_H */
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 03/15] irqchip/riscv-imsic: Add support for DOMAIN_BUS_MSI_REMAP
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 02/15] genirq/msi: Provide DOMAIN_BUS_MSI_REMAP Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 04/15] iommu/riscv: report iommu capabilities Andrew Jones
` (11 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
Unlike a child of an MSI NEXUS domain, a child of an MSI_REMAP domain
will not invoke init_dev_msi_info() with 'domain' equal to
'msi_parent_domain'. This is because the MSI_REMAP domain implements
init_dev_msi_info() with msi_parent_init_dev_msi_info(), which makes
'domain' point to the NEXUS (IMSIC) domain, while keeping
'msi_parent_domain' pointing to itself. The rest of the IMSIC
init_dev_msi_info() implementation works for MSI_REMAP domains,
though, so there's nothing to do to add support except accept the
token.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
drivers/irqchip/irq-riscv-imsic-platform.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/irqchip/irq-riscv-imsic-platform.c b/drivers/irqchip/irq-riscv-imsic-platform.c
index 5d7c30ad8855..6a7d7fefda6a 100644
--- a/drivers/irqchip/irq-riscv-imsic-platform.c
+++ b/drivers/irqchip/irq-riscv-imsic-platform.c
@@ -246,6 +246,8 @@ static bool imsic_init_dev_msi_info(struct device *dev,
case DOMAIN_BUS_NEXUS:
if (WARN_ON_ONCE(domain != real_parent))
return false;
+ fallthrough;
+ case DOMAIN_BUS_MSI_REMAP:
#ifdef CONFIG_SMP
info->chip->irq_set_affinity = irq_chip_set_affinity_parent;
#endif
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 04/15] iommu/riscv: report iommu capabilities
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (2 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 03/15] irqchip/riscv-imsic: Add support for DOMAIN_BUS_MSI_REMAP Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-15 15:20 ` Robin Murphy
2024-11-14 16:18 ` [RFC PATCH 05/15] iommu/riscv: use data structure instead of individual values Andrew Jones
` (10 subsequent siblings)
14 siblings, 1 reply; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
From: Tomasz Jeznach <tjeznach@rivosinc.com>
Report RISC-V IOMMU capabilities required by VFIO subsystem
to enable PCIe device assignment.
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
drivers/iommu/riscv/iommu.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 8a05def774bd..3fe4ceba8dd3 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1462,6 +1462,17 @@ static struct iommu_group *riscv_iommu_device_group(struct device *dev)
return generic_device_group(dev);
}
+static bool riscv_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+ switch (cap) {
+ case IOMMU_CAP_CACHE_COHERENCY:
+ case IOMMU_CAP_DEFERRED_FLUSH:
+ return true;
+ default:
+ return false;
+ }
+}
+
static int riscv_iommu_of_xlate(struct device *dev, const struct of_phandle_args *args)
{
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -1526,6 +1537,7 @@ static void riscv_iommu_release_device(struct device *dev)
static const struct iommu_ops riscv_iommu_ops = {
.pgsize_bitmap = SZ_4K,
.of_xlate = riscv_iommu_of_xlate,
+ .capable = riscv_iommu_capable,
.identity_domain = &riscv_iommu_identity_domain,
.blocked_domain = &riscv_iommu_blocking_domain,
.release_domain = &riscv_iommu_blocking_domain,
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [RFC PATCH 04/15] iommu/riscv: report iommu capabilities
2024-11-14 16:18 ` [RFC PATCH 04/15] iommu/riscv: report iommu capabilities Andrew Jones
@ 2024-11-15 15:20 ` Robin Murphy
2024-11-19 8:28 ` Andrew Jones
0 siblings, 1 reply; 36+ messages in thread
From: Robin Murphy @ 2024-11-15 15:20 UTC (permalink / raw)
To: Andrew Jones, iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On 14/11/2024 4:18 pm, Andrew Jones wrote:
> From: Tomasz Jeznach <tjeznach@rivosinc.com>
>
> Report RISC-V IOMMU capabilities required by VFIO subsystem
> to enable PCIe device assignment.
IOMMU_CAP_DEFERRED_FLUSH has nothing at all to do with VFIO. As far as I
can tell from what's queued, riscv_iommu_unmap_pages() isn't really
implementing the full optimisation to get the most out of it either.
I guess IOMMU_CAP_CACHE_COHERENCY falls out of the assumption of a
coherent IOMMU and lack of PBMT support making everything implicitly
IOMMU_CACHE all the time whether you want it or not, but clarifying that
might be nice (especially since there's some chance that something will
eventually come along to break it...)
Thanks,
Robin.
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
> ---
> drivers/iommu/riscv/iommu.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index 8a05def774bd..3fe4ceba8dd3 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -1462,6 +1462,17 @@ static struct iommu_group *riscv_iommu_device_group(struct device *dev)
> return generic_device_group(dev);
> }
>
> +static bool riscv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> + switch (cap) {
> + case IOMMU_CAP_CACHE_COHERENCY:
> + case IOMMU_CAP_DEFERRED_FLUSH:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
> static int riscv_iommu_of_xlate(struct device *dev, const struct of_phandle_args *args)
> {
> return iommu_fwspec_add_ids(dev, args->args, 1);
> @@ -1526,6 +1537,7 @@ static void riscv_iommu_release_device(struct device *dev)
> static const struct iommu_ops riscv_iommu_ops = {
> .pgsize_bitmap = SZ_4K,
> .of_xlate = riscv_iommu_of_xlate,
> + .capable = riscv_iommu_capable,
> .identity_domain = &riscv_iommu_identity_domain,
> .blocked_domain = &riscv_iommu_blocking_domain,
> .release_domain = &riscv_iommu_blocking_domain,
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 04/15] iommu/riscv: report iommu capabilities
2024-11-15 15:20 ` Robin Murphy
@ 2024-11-19 8:28 ` Andrew Jones
0 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-19 8:28 UTC (permalink / raw)
To: Robin Murphy
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, anup, atishp, tglx, alex.williamson,
paul.walmsley, palmer, aou
On Fri, Nov 15, 2024 at 03:20:36PM +0000, Robin Murphy wrote:
> On 14/11/2024 4:18 pm, Andrew Jones wrote:
> > From: Tomasz Jeznach <tjeznach@rivosinc.com>
> >
> > Report RISC-V IOMMU capabilities required by VFIO subsystem
> > to enable PCIe device assignment.
>
> IOMMU_CAP_DEFERRED_FLUSH has nothing at all to do with VFIO. As far as I can
> tell from what's queued, riscv_iommu_unmap_pages() isn't really implementing
> the full optimisation to get the most out of it either.
Thanks, Robin. I'll drop this cap for the next version.
>
> I guess IOMMU_CAP_CACHE_COHERENCY falls out of the assumption of a coherent
> IOMMU and lack of PBMT support making everything implicitly IOMMU_CACHE all
> the time whether you want it or not, but clarifying that might be nice
> (especially since there's some chance that something will eventually come
> along to break it...)
Yes, riscv selects ARCH_DMA_DEFAULT_COHERENT and the riscv IOMMU hardware
descriptions don't provide any way to say otherwise. I can put a comment
above the IOMMU_CAP_CACHE_COHERENCY case which states "The RISC-V IOMMU is
always DMA cache coherent", or did you have something else in mind?
Thanks,
drew
>
> Thanks,
> Robin.
>
> > Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> > Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
> > ---
> > drivers/iommu/riscv/iommu.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > index 8a05def774bd..3fe4ceba8dd3 100644
> > --- a/drivers/iommu/riscv/iommu.c
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -1462,6 +1462,17 @@ static struct iommu_group *riscv_iommu_device_group(struct device *dev)
> > return generic_device_group(dev);
> > }
> > +static bool riscv_iommu_capable(struct device *dev, enum iommu_cap cap)
> > +{
> > + switch (cap) {
> > + case IOMMU_CAP_CACHE_COHERENCY:
> > + case IOMMU_CAP_DEFERRED_FLUSH:
> > + return true;
> > + default:
> > + return false;
> > + }
> > +}
> > +
> > static int riscv_iommu_of_xlate(struct device *dev, const struct of_phandle_args *args)
> > {
> > return iommu_fwspec_add_ids(dev, args->args, 1);
> > @@ -1526,6 +1537,7 @@ static void riscv_iommu_release_device(struct device *dev)
> > static const struct iommu_ops riscv_iommu_ops = {
> > .pgsize_bitmap = SZ_4K,
> > .of_xlate = riscv_iommu_of_xlate,
> > + .capable = riscv_iommu_capable,
> > .identity_domain = &riscv_iommu_identity_domain,
> > .blocked_domain = &riscv_iommu_blocking_domain,
> > .release_domain = &riscv_iommu_blocking_domain,
^ permalink raw reply [flat|nested] 36+ messages in thread
* [RFC PATCH 05/15] iommu/riscv: use data structure instead of individual values
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (3 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 04/15] iommu/riscv: report iommu capabilities Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 06/15] iommu/riscv: support GSCID and GVMA invalidation command Andrew Jones
` (9 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
From: Zong Li <zong.li@sifive.com>
The parameter will be increased when we need to set up more
bit fields in the device context. Use a data structure to
wrap them up.
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
drivers/iommu/riscv/iommu.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 3fe4ceba8dd3..9d7945dc3c24 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1001,7 +1001,7 @@ static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
* interim translation faults.
*/
static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
- struct device *dev, u64 fsc, u64 ta)
+ struct device *dev, struct riscv_iommu_dc *new_dc)
{
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct riscv_iommu_dc *dc;
@@ -1035,10 +1035,10 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
for (i = 0; i < fwspec->num_ids; i++) {
dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
tc = READ_ONCE(dc->tc);
- tc |= ta & RISCV_IOMMU_DC_TC_V;
+ tc |= new_dc->ta & RISCV_IOMMU_DC_TC_V;
- WRITE_ONCE(dc->fsc, fsc);
- WRITE_ONCE(dc->ta, ta & RISCV_IOMMU_PC_TA_PSCID);
+ WRITE_ONCE(dc->fsc, new_dc->fsc);
+ WRITE_ONCE(dc->ta, new_dc->ta & RISCV_IOMMU_PC_TA_PSCID);
/* Update device context, write TC.V as the last step. */
dma_wmb();
WRITE_ONCE(dc->tc, tc);
@@ -1315,20 +1315,20 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
struct riscv_iommu_device *iommu = dev_to_iommu(dev);
struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
- u64 fsc, ta;
+ struct riscv_iommu_dc dc = {0};
if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
return -ENODEV;
- fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
- FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
- ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
- RISCV_IOMMU_PC_TA_V;
+ dc.fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
+ FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
+ dc.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
+ RISCV_IOMMU_PC_TA_V;
if (riscv_iommu_bond_link(domain, dev))
return -ENOMEM;
- riscv_iommu_iodir_update(iommu, dev, fsc, ta);
+ riscv_iommu_iodir_update(iommu, dev, &dc);
riscv_iommu_bond_unlink(info->domain, dev);
info->domain = domain;
@@ -1419,9 +1419,12 @@ static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
{
struct riscv_iommu_device *iommu = dev_to_iommu(dev);
struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+ struct riscv_iommu_dc dc = {0};
+
+ dc.fsc = RISCV_IOMMU_FSC_BARE;
/* Make device context invalid, translation requests will fault w/ #258 */
- riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, 0);
+ riscv_iommu_iodir_update(iommu, dev, &dc);
riscv_iommu_bond_unlink(info->domain, dev);
info->domain = NULL;
@@ -1440,8 +1443,12 @@ static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
{
struct riscv_iommu_device *iommu = dev_to_iommu(dev);
struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+ struct riscv_iommu_dc dc = {0};
+
+ dc.fsc = RISCV_IOMMU_FSC_BARE;
+ dc.ta = RISCV_IOMMU_PC_TA_V;
- riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, RISCV_IOMMU_PC_TA_V);
+ riscv_iommu_iodir_update(iommu, dev, &dc);
riscv_iommu_bond_unlink(info->domain, dev);
info->domain = NULL;
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 06/15] iommu/riscv: support GSCID and GVMA invalidation command
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (4 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 05/15] iommu/riscv: use data structure instead of individual values Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 07/15] iommu/riscv: Move definitions to iommu.h Andrew Jones
` (8 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
From: Zong Li <zong.li@sifive.com>
This patch adds a ID Allocator for GSCID and a wrap for setting up
GSCID in IOTLB invalidation command.
Set up iohgatp to enable second stage table and flush stage-2 table if
the GSCID is set.
The GSCID of domain should be freed when release domain. GSCID will be
allocated for parent domain in nested IOMMU process.
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
drivers/iommu/riscv/iommu-bits.h | 7 +++++++
drivers/iommu/riscv/iommu.c | 32 ++++++++++++++++++++++++++------
2 files changed, 33 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
index 98daf0e1a306..d72b982cf9bf 100644
--- a/drivers/iommu/riscv/iommu-bits.h
+++ b/drivers/iommu/riscv/iommu-bits.h
@@ -715,6 +715,13 @@ static inline void riscv_iommu_cmd_inval_vma(struct riscv_iommu_command *cmd)
cmd->dword1 = 0;
}
+static inline void riscv_iommu_cmd_inval_gvma(struct riscv_iommu_command *cmd)
+{
+ cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOTINVAL_OPCODE) |
+ FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA);
+ cmd->dword1 = 0;
+}
+
static inline void riscv_iommu_cmd_inval_set_addr(struct riscv_iommu_command *cmd,
u64 addr)
{
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 9d7945dc3c24..ef38a1bb3eca 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -45,6 +45,10 @@
static DEFINE_IDA(riscv_iommu_pscids);
#define RISCV_IOMMU_MAX_PSCID (BIT(20) - 1)
+/* IOMMU GSCID allocation namespace. */
+static DEFINE_IDA(riscv_iommu_gscids);
+#define RISCV_IOMMU_MAX_GSCID (BIT(16) - 1)
+
/* Device resource-managed allocations */
struct riscv_iommu_devres {
void *addr;
@@ -801,6 +805,7 @@ struct riscv_iommu_domain {
struct list_head bonds;
spinlock_t lock; /* protect bonds list updates. */
int pscid;
+ int gscid;
bool amo_enabled;
int numa_node;
unsigned int pgd_mode;
@@ -954,15 +959,20 @@ static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
/*
* IOTLB invalidation request can be safely omitted if already sent
- * to the IOMMU for the same PSCID, and with domain->bonds list
+ * to the IOMMU for the same PSCID/GSCID, and with domain->bonds list
* arranged based on the device's IOMMU, it's sufficient to check
* last device the invalidation was sent to.
*/
if (iommu == prev)
continue;
- riscv_iommu_cmd_inval_vma(&cmd);
- riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
+ if (domain->gscid) {
+ riscv_iommu_cmd_inval_gvma(&cmd);
+ riscv_iommu_cmd_inval_set_gscid(&cmd, domain->gscid);
+ } else {
+ riscv_iommu_cmd_inval_vma(&cmd);
+ riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
+ }
if (len && len < RISCV_IOMMU_IOTLB_INVAL_LIMIT) {
for (iova = start; iova < end; iova += PAGE_SIZE) {
riscv_iommu_cmd_inval_set_addr(&cmd, iova);
@@ -1039,6 +1049,7 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
WRITE_ONCE(dc->fsc, new_dc->fsc);
WRITE_ONCE(dc->ta, new_dc->ta & RISCV_IOMMU_PC_TA_PSCID);
+ WRITE_ONCE(dc->iohgatp, new_dc->iohgatp);
/* Update device context, write TC.V as the last step. */
dma_wmb();
WRITE_ONCE(dc->tc, tc);
@@ -1287,8 +1298,10 @@ static void riscv_iommu_free_paging_domain(struct iommu_domain *iommu_domain)
WARN_ON(!list_empty(&domain->bonds));
- if ((int)domain->pscid > 0)
+ if (domain->pscid > 0)
ida_free(&riscv_iommu_pscids, domain->pscid);
+ if (domain->gscid > 0)
+ ida_free(&riscv_iommu_gscids, domain->gscid);
riscv_iommu_pte_free(domain, _io_pte_entry(pfn, _PAGE_TABLE), NULL);
kfree(domain);
@@ -1320,8 +1333,15 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
return -ENODEV;
- dc.fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
- FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
+ if (domain->gscid) {
+ dc.iohgatp = FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_MODE, domain->pgd_mode) |
+ FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_GSCID, domain->gscid) |
+ FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_PPN, virt_to_pfn(domain->pgd_root));
+ } else {
+ dc.fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
+ FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
+ }
+
dc.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
RISCV_IOMMU_PC_TA_V;
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 07/15] iommu/riscv: Move definitions to iommu.h
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (5 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 06/15] iommu/riscv: support GSCID and GVMA invalidation command Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping Andrew Jones
` (7 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
In order to add the interrupt remapping support in a separate file,
share definitions through the header, as well as making some
functions public.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
drivers/iommu/riscv/iommu-bits.h | 4 ++
drivers/iommu/riscv/iommu.c | 71 ++++----------------------------
drivers/iommu/riscv/iommu.h | 54 ++++++++++++++++++++++++
3 files changed, 67 insertions(+), 62 deletions(-)
diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
index d72b982cf9bf..d3d98dbed709 100644
--- a/drivers/iommu/riscv/iommu-bits.h
+++ b/drivers/iommu/riscv/iommu-bits.h
@@ -36,6 +36,10 @@
#define RISCV_IOMMU_ATP_PPN_FIELD GENMASK_ULL(43, 0)
#define RISCV_IOMMU_ATP_MODE_FIELD GENMASK_ULL(63, 60)
+/* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
+#define riscv_iommu_phys_to_ppn(pa) (((pa) >> 2) & (((1ULL << 44) - 1) << 10))
+#define riscv_iommu_ppn_to_phys(pn) (((pn) << 2) & (((1ULL << 44) - 1) << 12))
+
/* 5.3 IOMMU Capabilities (64bits) */
#define RISCV_IOMMU_REG_CAPABILITIES 0x0000
#define RISCV_IOMMU_CAPABILITIES_VERSION GENMASK_ULL(7, 0)
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index ef38a1bb3eca..6e8ea3d22ff5 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -24,23 +24,10 @@
#include "iommu-bits.h"
#include "iommu.h"
-/* Timeouts in [us] */
-#define RISCV_IOMMU_QCSR_TIMEOUT 150000
-#define RISCV_IOMMU_QUEUE_TIMEOUT 150000
-#define RISCV_IOMMU_DDTP_TIMEOUT 10000000
-#define RISCV_IOMMU_IOTINVAL_TIMEOUT 90000000
-
/* Number of entries per CMD/FLT queue, should be <= INT_MAX */
#define RISCV_IOMMU_DEF_CQ_COUNT 8192
#define RISCV_IOMMU_DEF_FQ_COUNT 4096
-/* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
-#define phys_to_ppn(pa) (((pa) >> 2) & (((1ULL << 44) - 1) << 10))
-#define ppn_to_phys(pn) (((pn) << 2) & (((1ULL << 44) - 1) << 12))
-
-#define dev_to_iommu(dev) \
- iommu_get_iommu_dev(dev, struct riscv_iommu_device, iommu)
-
/* IOMMU PSCID allocation namespace. */
static DEFINE_IDA(riscv_iommu_pscids);
#define RISCV_IOMMU_MAX_PSCID (BIT(20) - 1)
@@ -177,7 +164,7 @@ static int riscv_iommu_queue_alloc(struct riscv_iommu_device *iommu,
if (!queue->base)
return -ENOMEM;
- qb = phys_to_ppn(queue->phys) |
+ qb = riscv_iommu_phys_to_ppn(queue->phys) |
FIELD_PREP(RISCV_IOMMU_QUEUE_LOG2SZ_FIELD, logsz);
/* Update base register and read back to verify hw accepted our write */
@@ -480,15 +467,15 @@ static irqreturn_t riscv_iommu_cmdq_process(int irq, void *data)
}
/* Send command to the IOMMU command queue */
-static void riscv_iommu_cmd_send(struct riscv_iommu_device *iommu,
- struct riscv_iommu_command *cmd)
+void riscv_iommu_cmd_send(struct riscv_iommu_device *iommu,
+ struct riscv_iommu_command *cmd)
{
riscv_iommu_queue_send(&iommu->cmdq, cmd, sizeof(*cmd));
}
/* Send IOFENCE.C command and wait for all scheduled commands to complete. */
-static void riscv_iommu_cmd_sync(struct riscv_iommu_device *iommu,
- unsigned int timeout_us)
+void riscv_iommu_cmd_sync(struct riscv_iommu_device *iommu,
+ unsigned int timeout_us)
{
struct riscv_iommu_command cmd;
unsigned int prod;
@@ -614,7 +601,7 @@ static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iomm
do {
ddt = READ_ONCE(*(unsigned long *)ddtp);
if (ddt & RISCV_IOMMU_DDTE_V) {
- ddtp = __va(ppn_to_phys(ddt));
+ ddtp = __va(riscv_iommu_ppn_to_phys(ddt));
break;
}
@@ -622,7 +609,7 @@ static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iomm
if (!ptr)
return NULL;
- new = phys_to_ppn(__pa(ptr)) | RISCV_IOMMU_DDTE_V;
+ new = riscv_iommu_phys_to_ppn(__pa(ptr)) | RISCV_IOMMU_DDTE_V;
old = cmpxchg_relaxed((unsigned long *)ddtp, ddt, new);
if (old == ddt) {
@@ -687,7 +674,7 @@ static int riscv_iommu_iodir_alloc(struct riscv_iommu_device *iommu)
if (ddtp & RISCV_IOMMU_DDTP_BUSY)
return -EBUSY;
- iommu->ddt_phys = ppn_to_phys(ddtp);
+ iommu->ddt_phys = riscv_iommu_ppn_to_phys(ddtp);
if (iommu->ddt_phys)
iommu->ddt_root = devm_ioremap(iommu->dev,
iommu->ddt_phys, PAGE_SIZE);
@@ -734,7 +721,7 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
do {
rq_ddtp = FIELD_PREP(RISCV_IOMMU_DDTP_IOMMU_MODE, rq_mode);
if (rq_mode > RISCV_IOMMU_DDTP_IOMMU_MODE_BARE)
- rq_ddtp |= phys_to_ppn(iommu->ddt_phys);
+ rq_ddtp |= riscv_iommu_phys_to_ppn(iommu->ddt_phys);
riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, rq_ddtp);
ddtp = riscv_iommu_read_ddtp(iommu);
@@ -799,49 +786,9 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
return 0;
}
-/* This struct contains protection domain specific IOMMU driver data. */
-struct riscv_iommu_domain {
- struct iommu_domain domain;
- struct list_head bonds;
- spinlock_t lock; /* protect bonds list updates. */
- int pscid;
- int gscid;
- bool amo_enabled;
- int numa_node;
- unsigned int pgd_mode;
- unsigned long *pgd_root;
-};
-
#define iommu_domain_to_riscv(iommu_domain) \
container_of(iommu_domain, struct riscv_iommu_domain, domain)
-/* Private IOMMU data for managed devices, dev_iommu_priv_* */
-struct riscv_iommu_info {
- struct riscv_iommu_domain *domain;
-};
-
-/*
- * Linkage between an iommu_domain and attached devices.
- *
- * Protection domain requiring IOATC and DevATC translation cache invalidations,
- * should be linked to attached devices using a riscv_iommu_bond structure.
- * Devices should be linked to the domain before first use and unlinked after
- * the translations from the referenced protection domain can no longer be used.
- * Blocking and identity domains are not tracked here, as the IOMMU hardware
- * does not cache negative and/or identity (BARE mode) translations, and DevATC
- * is disabled for those protection domains.
- *
- * The device pointer and IOMMU data remain stable in the bond struct after
- * _probe_device() where it's attached to the managed IOMMU, up to the
- * completion of the _release_device() call. The release of the bond structure
- * is synchronized with the device release.
- */
-struct riscv_iommu_bond {
- struct list_head list;
- struct rcu_head rcu;
- struct device *dev;
-};
-
static int riscv_iommu_bond_link(struct riscv_iommu_domain *domain,
struct device *dev)
{
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
index b1c4664542b4..dd538b19fbb7 100644
--- a/drivers/iommu/riscv/iommu.h
+++ b/drivers/iommu/riscv/iommu.h
@@ -17,8 +17,35 @@
#include "iommu-bits.h"
+/* Timeouts in [us] */
+#define RISCV_IOMMU_QCSR_TIMEOUT 150000
+#define RISCV_IOMMU_QUEUE_TIMEOUT 150000
+#define RISCV_IOMMU_DDTP_TIMEOUT 10000000
+#define RISCV_IOMMU_IOTINVAL_TIMEOUT 90000000
+
+/* This struct contains protection domain specific IOMMU driver data. */
+struct riscv_iommu_domain {
+ struct iommu_domain domain;
+ struct list_head bonds;
+ spinlock_t lock; /* protect bonds list updates. */
+ int pscid;
+ int gscid;
+ int amo_enabled;
+ int numa_node;
+ unsigned int pgd_mode;
+ unsigned long *pgd_root;
+};
+
+/* Private IOMMU data for managed devices, dev_iommu_priv_* */
+struct riscv_iommu_info {
+ struct riscv_iommu_domain *domain;
+};
+
struct riscv_iommu_device;
+#define dev_to_iommu(dev) \
+ iommu_get_iommu_dev(dev, struct riscv_iommu_device, iommu)
+
struct riscv_iommu_queue {
atomic_t prod; /* unbounded producer allocation index */
atomic_t head; /* unbounded shadow ring buffer consumer index */
@@ -62,9 +89,36 @@ struct riscv_iommu_device {
u64 *ddt_root;
};
+/*
+ * Linkage between an iommu_domain and attached devices.
+ *
+ * Protection domain requiring IOATC and DevATC translation cache invalidations,
+ * should be linked to attached devices using a riscv_iommu_bond structure.
+ * Devices should be linked to the domain before first use and unlinked after
+ * the translations from the referenced protection domain can no longer be used.
+ * Blocking and identity domains are not tracked here, as the IOMMU hardware
+ * does not cache negative and/or identity (BARE mode) translations, and DevATC
+ * is disabled for those protection domains.
+ *
+ * The device pointer and IOMMU data remain stable in the bond struct after
+ * _probe_device() where it's attached to the managed IOMMU, up to the
+ * completion of the _release_device() call. The release of the bond structure
+ * is synchronized with the device release.
+ */
+struct riscv_iommu_bond {
+ struct list_head list;
+ struct rcu_head rcu;
+ struct device *dev;
+};
+
int riscv_iommu_init(struct riscv_iommu_device *iommu);
void riscv_iommu_remove(struct riscv_iommu_device *iommu);
+void riscv_iommu_cmd_send(struct riscv_iommu_device *iommu,
+ struct riscv_iommu_command *cmd);
+void riscv_iommu_cmd_sync(struct riscv_iommu_device *iommu,
+ unsigned int timeout_us);
+
#define riscv_iommu_readl(iommu, addr) \
readl_relaxed((iommu)->reg + (addr))
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (6 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 07/15] iommu/riscv: Move definitions to iommu.h Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-18 18:43 ` Jason Gunthorpe
2024-11-14 16:18 ` [RFC PATCH 09/15] RISC-V: KVM: Enable KVM_VFIO interfaces on RISC-V arch Andrew Jones
` (6 subsequent siblings)
14 siblings, 1 reply; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
This is just a skeleton. Until irq_set_vcpu_affinity() is
implemented the IRQ domain doesn't serve any purpose.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
drivers/iommu/riscv/Makefile | 2 +-
drivers/iommu/riscv/iommu-ir.c | 209 +++++++++++++++++++++++++++++++++
drivers/iommu/riscv/iommu.c | 43 ++++++-
drivers/iommu/riscv/iommu.h | 21 ++++
4 files changed, 270 insertions(+), 5 deletions(-)
create mode 100644 drivers/iommu/riscv/iommu-ir.c
diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
index f54c9ed17d41..8420dd1776cb 100644
--- a/drivers/iommu/riscv/Makefile
+++ b/drivers/iommu/riscv/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
+obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-ir.o iommu-platform.o
obj-$(CONFIG_RISCV_IOMMU_PCI) += iommu-pci.o
diff --git a/drivers/iommu/riscv/iommu-ir.c b/drivers/iommu/riscv/iommu-ir.c
new file mode 100644
index 000000000000..c177e064b205
--- /dev/null
+++ b/drivers/iommu/riscv/iommu-ir.c
@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * IOMMU Interrupt Remapping
+ *
+ * Copyright © 2024 Ventana Micro Systems Inc.
+ */
+#include <linux/irqdomain.h>
+#include <linux/msi.h>
+
+#include "../iommu-pages.h"
+#include "iommu.h"
+
+static size_t riscv_iommu_ir_get_msipte_idx(struct riscv_iommu_domain *domain,
+ phys_addr_t msi_pa)
+{
+ phys_addr_t addr = msi_pa >> 12;
+ size_t idx;
+
+ if (domain->group_index_bits) {
+ phys_addr_t group_mask = BIT(domain->group_index_bits) - 1;
+ phys_addr_t group_shift = domain->group_index_shift - 12;
+ phys_addr_t group = (addr >> group_shift) & group_mask;
+ phys_addr_t mask = domain->msiptp.msi_addr_mask & ~(group_mask << group_shift);
+
+ idx = addr & mask;
+ idx |= group << fls64(mask);
+ } else {
+ idx = addr & domain->msiptp.msi_addr_mask;
+ }
+
+ return idx;
+}
+
+static struct riscv_iommu_msipte *riscv_iommu_ir_get_msipte(struct riscv_iommu_domain *domain,
+ phys_addr_t msi_pa)
+{
+ size_t idx;
+
+ if (((msi_pa >> 12) & ~domain->msiptp.msi_addr_mask) != domain->msiptp.msi_addr_pattern)
+ return NULL;
+
+ idx = riscv_iommu_ir_get_msipte_idx(domain, msi_pa);
+ return &domain->msi_root[idx];
+}
+
+static size_t riscv_iommu_ir_nr_msiptes(struct riscv_iommu_domain *domain)
+{
+ phys_addr_t base = domain->msiptp.msi_addr_pattern << 12;
+ phys_addr_t max_addr = base | (domain->msiptp.msi_addr_mask << 12);
+ size_t max_idx = riscv_iommu_ir_get_msipte_idx(domain, max_addr);
+
+ return max_idx + 1;
+}
+
+static struct irq_chip riscv_iommu_irq_chip = {
+ .name = "IOMMU-IR",
+ .irq_mask = irq_chip_mask_parent,
+ .irq_unmask = irq_chip_unmask_parent,
+};
+
+static int riscv_iommu_irq_domain_alloc_irqs(struct irq_domain *irqdomain,
+ unsigned int irq_base, unsigned int nr_irqs,
+ void *arg)
+{
+ struct irq_data *data;
+ int i, ret;
+
+ ret = irq_domain_alloc_irqs_parent(irqdomain, irq_base, nr_irqs, arg);
+ if (ret)
+ return ret;
+
+ for (i = 0; i < nr_irqs; i++) {
+ data = irq_domain_get_irq_data(irqdomain, irq_base + i);
+ data->chip = &riscv_iommu_irq_chip;
+ }
+
+ return 0;
+}
+
+static const struct irq_domain_ops riscv_iommu_irq_domain_ops = {
+ .alloc = riscv_iommu_irq_domain_alloc_irqs,
+ .free = irq_domain_free_irqs_parent,
+};
+
+static const struct msi_parent_ops riscv_iommu_msi_parent_ops = {
+ .prefix = "IR-",
+ .supported_flags = MSI_GENERIC_FLAGS_MASK |
+ MSI_FLAG_PCI_MSIX,
+ .required_flags = MSI_FLAG_USE_DEF_DOM_OPS |
+ MSI_FLAG_USE_DEF_CHIP_OPS,
+ .init_dev_msi_info = msi_parent_init_dev_msi_info,
+};
+
+int riscv_iommu_irq_domain_create(struct riscv_iommu_domain *domain,
+ struct device *dev)
+{
+ struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+ struct fwnode_handle *fn;
+ char *fwname;
+
+ if (domain->irqdomain) {
+ dev_set_msi_domain(dev, domain->irqdomain);
+ return 0;
+ }
+
+ if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_MSI_FLAT)) {
+ dev_warn(iommu->dev, "Cannot enable interrupt remapping\n");
+ return 0;
+ }
+
+ spin_lock_init(&domain->msi_lock);
+ /*
+ * TODO: The hypervisor should be in control of this size. For now
+ * we just allocate enough space for 512 VCPUs.
+ */
+ domain->msi_order = 1;
+ domain->msi_root = iommu_alloc_pages_node(domain->numa_node,
+ GFP_KERNEL_ACCOUNT, domain->msi_order);
+ if (!domain->msi_root)
+ return -ENOMEM;
+
+ fwname = kasprintf(GFP_KERNEL, "IOMMU-IR-%s", dev_name(dev));
+ if (!fwname) {
+ iommu_free_pages(domain->msi_root, domain->msi_order);
+ return -ENOMEM;
+ }
+
+ fn = irq_domain_alloc_named_fwnode(fwname);
+ kfree(fwname);
+ if (!fn) {
+ dev_err(iommu->dev, "Couldn't allocate fwnode\n");
+ iommu_free_pages(domain->msi_root, domain->msi_order);
+ return -ENOMEM;
+ }
+
+ domain->irqdomain = irq_domain_create_hierarchy(dev_get_msi_domain(dev),
+ 0, 0, fn,
+ &riscv_iommu_irq_domain_ops,
+ domain);
+ if (!domain->irqdomain) {
+ dev_err(iommu->dev, "Failed to create IOMMU irq domain\n");
+ iommu_free_pages(domain->msi_root, domain->msi_order);
+ irq_domain_free_fwnode(fn);
+ return -ENOMEM;
+ }
+
+ domain->irqdomain->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT |
+ IRQ_DOMAIN_FLAG_ISOLATED_MSI;
+ domain->irqdomain->msi_parent_ops = &riscv_iommu_msi_parent_ops;
+ irq_domain_update_bus_token(domain->irqdomain, DOMAIN_BUS_MSI_REMAP);
+ dev_set_msi_domain(dev, domain->irqdomain);
+
+ return 0;
+}
+
+void riscv_iommu_ir_get_resv_regions(struct device *dev, struct list_head *head)
+{
+ struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+ struct riscv_iommu_domain *domain = info->domain;
+ struct iommu_resv_region *reg;
+ phys_addr_t base, addr;
+ size_t nr_pages, i;
+
+ if (!domain || !domain->msiptp.msiptp)
+ return;
+
+ base = domain->msiptp.msi_addr_pattern << 12;
+
+ if (domain->group_index_bits) {
+ phys_addr_t group_mask = BIT(domain->group_index_bits) - 1;
+ phys_addr_t group_shift = domain->group_index_shift - 12;
+ phys_addr_t mask = domain->msiptp.msi_addr_mask & ~(group_mask << group_shift);
+
+ nr_pages = mask + 1;
+ } else {
+ nr_pages = domain->msiptp.msi_addr_mask + 1;
+ }
+
+ for (i = 0; i < BIT(domain->group_index_bits); i++) {
+ addr = base | (i << domain->group_index_shift);
+ reg = iommu_alloc_resv_region(addr, nr_pages * 4096,
+ 0, IOMMU_RESV_MSI, GFP_KERNEL);
+ if (reg)
+ list_add_tail(®->list, head);
+ }
+}
+
+void riscv_iommu_irq_domain_remove(struct riscv_iommu_domain *domain)
+{
+ struct fwnode_handle *fn;
+
+ if (!domain->irqdomain)
+ return;
+
+ iommu_free_pages(domain->msi_root, domain->msi_order);
+
+ fn = domain->irqdomain->fwnode;
+ irq_domain_remove(domain->irqdomain);
+ irq_domain_free_fwnode(fn);
+}
+
+void riscv_iommu_irq_domain_unlink(struct riscv_iommu_domain *domain,
+ struct device *dev)
+{
+ if (!domain || !domain->irqdomain)
+ return;
+
+ dev_set_msi_domain(dev, domain->irqdomain->parent);
+}
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 6e8ea3d22ff5..c4a47b21c58f 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -943,7 +943,8 @@ static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
rcu_read_unlock();
}
-#define RISCV_IOMMU_FSC_BARE 0
+#define RISCV_IOMMU_FSC_BARE 0
+#define RISCV_IOMMU_IOHGATP_BARE 0
/*
* Update IODIR for the device.
@@ -1245,6 +1246,8 @@ static void riscv_iommu_free_paging_domain(struct iommu_domain *iommu_domain)
WARN_ON(!list_empty(&domain->bonds));
+ riscv_iommu_irq_domain_remove(domain);
+
if (domain->pscid > 0)
ida_free(&riscv_iommu_pscids, domain->pscid);
if (domain->gscid > 0)
@@ -1276,10 +1279,30 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
struct riscv_iommu_device *iommu = dev_to_iommu(dev);
struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
struct riscv_iommu_dc dc = {0};
+ int ret;
if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
return -ENODEV;
+ if (riscv_iommu_bond_link(domain, dev))
+ return -ENOMEM;
+
+ if (iommu_domain->type == IOMMU_DOMAIN_UNMANAGED) {
+ domain->gscid = ida_alloc_range(&riscv_iommu_gscids, 1,
+ RISCV_IOMMU_MAX_GSCID, GFP_KERNEL);
+ if (domain->gscid < 0) {
+ riscv_iommu_bond_unlink(domain, dev);
+ return -ENOMEM;
+ }
+
+ ret = riscv_iommu_irq_domain_create(domain, dev);
+ if (ret) {
+ riscv_iommu_bond_unlink(domain, dev);
+ ida_free(&riscv_iommu_gscids, domain->gscid);
+ return ret;
+ }
+ }
+
if (domain->gscid) {
dc.iohgatp = FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_MODE, domain->pgd_mode) |
FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_GSCID, domain->gscid) |
@@ -1292,10 +1315,9 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
dc.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
RISCV_IOMMU_PC_TA_V;
- if (riscv_iommu_bond_link(domain, dev))
- return -ENOMEM;
-
riscv_iommu_iodir_update(iommu, dev, &dc);
+
+ riscv_iommu_irq_domain_unlink(info->domain, dev);
riscv_iommu_bond_unlink(info->domain, dev);
info->domain = domain;
@@ -1389,9 +1411,12 @@ static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
struct riscv_iommu_dc dc = {0};
dc.fsc = RISCV_IOMMU_FSC_BARE;
+ dc.iohgatp = RISCV_IOMMU_IOHGATP_BARE;
/* Make device context invalid, translation requests will fault w/ #258 */
riscv_iommu_iodir_update(iommu, dev, &dc);
+
+ riscv_iommu_irq_domain_unlink(info->domain, dev);
riscv_iommu_bond_unlink(info->domain, dev);
info->domain = NULL;
@@ -1413,15 +1438,24 @@ static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
struct riscv_iommu_dc dc = {0};
dc.fsc = RISCV_IOMMU_FSC_BARE;
+ dc.iohgatp = RISCV_IOMMU_IOHGATP_BARE;
dc.ta = RISCV_IOMMU_PC_TA_V;
riscv_iommu_iodir_update(iommu, dev, &dc);
+
+ riscv_iommu_irq_domain_unlink(info->domain, dev);
riscv_iommu_bond_unlink(info->domain, dev);
info->domain = NULL;
return 0;
}
+static void riscv_iommu_get_resv_regions(struct device *dev,
+ struct list_head *head)
+{
+ riscv_iommu_ir_get_resv_regions(dev, head);
+}
+
static struct iommu_domain riscv_iommu_identity_domain = {
.type = IOMMU_DOMAIN_IDENTITY,
.ops = &(const struct iommu_domain_ops) {
@@ -1516,6 +1550,7 @@ static const struct iommu_ops riscv_iommu_ops = {
.blocked_domain = &riscv_iommu_blocking_domain,
.release_domain = &riscv_iommu_blocking_domain,
.domain_alloc_paging = riscv_iommu_alloc_paging_domain,
+ .get_resv_regions = riscv_iommu_get_resv_regions,
.device_group = riscv_iommu_device_group,
.probe_device = riscv_iommu_probe_device,
.release_device = riscv_iommu_release_device,
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
index dd538b19fbb7..6ce71095781c 100644
--- a/drivers/iommu/riscv/iommu.h
+++ b/drivers/iommu/riscv/iommu.h
@@ -23,6 +23,12 @@
#define RISCV_IOMMU_DDTP_TIMEOUT 10000000
#define RISCV_IOMMU_IOTINVAL_TIMEOUT 90000000
+struct riscv_iommu_msiptp_state {
+ u64 msiptp;
+ u64 msi_addr_mask;
+ u64 msi_addr_pattern;
+};
+
/* This struct contains protection domain specific IOMMU driver data. */
struct riscv_iommu_domain {
struct iommu_domain domain;
@@ -34,6 +40,13 @@ struct riscv_iommu_domain {
int numa_node;
unsigned int pgd_mode;
unsigned long *pgd_root;
+ u32 group_index_bits;
+ u32 group_index_shift;
+ int msi_order;
+ struct riscv_iommu_msipte *msi_root;
+ spinlock_t msi_lock;
+ struct riscv_iommu_msiptp_state msiptp;
+ struct irq_domain *irqdomain;
};
/* Private IOMMU data for managed devices, dev_iommu_priv_* */
@@ -119,6 +132,14 @@ void riscv_iommu_cmd_send(struct riscv_iommu_device *iommu,
void riscv_iommu_cmd_sync(struct riscv_iommu_device *iommu,
unsigned int timeout_us);
+int riscv_iommu_irq_domain_create(struct riscv_iommu_domain *domain,
+ struct device *dev);
+void riscv_iommu_irq_domain_remove(struct riscv_iommu_domain *domain);
+void riscv_iommu_irq_domain_unlink(struct riscv_iommu_domain *domain,
+ struct device *dev);
+void riscv_iommu_ir_get_resv_regions(struct device *dev,
+ struct list_head *head);
+
#define riscv_iommu_readl(iommu, addr) \
readl_relaxed((iommu)->reg + (addr))
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-14 16:18 ` [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping Andrew Jones
@ 2024-11-18 18:43 ` Jason Gunthorpe
2024-11-19 7:49 ` Andrew Jones
0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2024-11-18 18:43 UTC (permalink / raw)
To: Andrew Jones
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On Thu, Nov 14, 2024 at 05:18:53PM +0100, Andrew Jones wrote:
> @@ -1276,10 +1279,30 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
> struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> struct riscv_iommu_dc dc = {0};
> + int ret;
>
> if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
> return -ENODEV;
>
> + if (riscv_iommu_bond_link(domain, dev))
> + return -ENOMEM;
> +
> + if (iommu_domain->type == IOMMU_DOMAIN_UNMANAGED) {
Drivers should not be making tests like this.
> + domain->gscid = ida_alloc_range(&riscv_iommu_gscids, 1,
> + RISCV_IOMMU_MAX_GSCID, GFP_KERNEL);
> + if (domain->gscid < 0) {
> + riscv_iommu_bond_unlink(domain, dev);
> + return -ENOMEM;
> + }
> +
> + ret = riscv_iommu_irq_domain_create(domain, dev);
> + if (ret) {
> + riscv_iommu_bond_unlink(domain, dev);
> + ida_free(&riscv_iommu_gscids, domain->gscid);
> + return ret;
> + }
> + }
What are you trying to do? Make something behave different for VFIO?
That isn't OK, we are trying to remove all the hacky VFIO special
cases in drivers.
What is the HW issue here? It is very very strange (and probably not
going to work right) that the irq domains change when domain
attachment changes.
The IRQ setup should really be fixed before any device drivers probe
onto the device.
Jason
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-18 18:43 ` Jason Gunthorpe
@ 2024-11-19 7:49 ` Andrew Jones
2024-11-19 14:00 ` Jason Gunthorpe
0 siblings, 1 reply; 36+ messages in thread
From: Andrew Jones @ 2024-11-19 7:49 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On Mon, Nov 18, 2024 at 02:43:36PM -0400, Jason Gunthorpe wrote:
> On Thu, Nov 14, 2024 at 05:18:53PM +0100, Andrew Jones wrote:
> > @@ -1276,10 +1279,30 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
> > struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> > struct riscv_iommu_dc dc = {0};
> > + int ret;
> >
> > if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
> > return -ENODEV;
> >
> > + if (riscv_iommu_bond_link(domain, dev))
> > + return -ENOMEM;
> > +
> > + if (iommu_domain->type == IOMMU_DOMAIN_UNMANAGED) {
>
> Drivers should not be making tests like this.
>
> > + domain->gscid = ida_alloc_range(&riscv_iommu_gscids, 1,
> > + RISCV_IOMMU_MAX_GSCID, GFP_KERNEL);
> > + if (domain->gscid < 0) {
> > + riscv_iommu_bond_unlink(domain, dev);
> > + return -ENOMEM;
> > + }
> > +
> > + ret = riscv_iommu_irq_domain_create(domain, dev);
> > + if (ret) {
> > + riscv_iommu_bond_unlink(domain, dev);
> > + ida_free(&riscv_iommu_gscids, domain->gscid);
> > + return ret;
> > + }
> > + }
>
> What are you trying to do? Make something behave different for VFIO?
> That isn't OK, we are trying to remove all the hacky VFIO special
> cases in drivers.
>
> What is the HW issue here? It is very very strange (and probably not
> going to work right) that the irq domains change when domain
> attachment changes.
>
> The IRQ setup should really be fixed before any device drivers probe
> onto the device.
I can't disagree with the statement that this looks hacky, but considering
a VFIO domain needs to use the g-stage for its single-stage translation
and a paging domain for the host would use s-stage, then it seems we need
to identify the VFIO domains for their special treatment. Is there an
example of converting VFIO special casing in other drivers to something
cleaner that you can point me at?
The IRQ domain will only be useful for device assignment, as that's when
an MSI translation will be needed. I can't think of any problems that
could arise from only creating the IRQ domain when probing assigned
devices, but I could certainly be missing something. Do you have some
potential problems in mind?
Thanks,
drew
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-19 7:49 ` Andrew Jones
@ 2024-11-19 14:00 ` Jason Gunthorpe
2024-11-19 15:03 ` Andrew Jones
0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2024-11-19 14:00 UTC (permalink / raw)
To: Andrew Jones
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On Tue, Nov 19, 2024 at 08:49:37AM +0100, Andrew Jones wrote:
> On Mon, Nov 18, 2024 at 02:43:36PM -0400, Jason Gunthorpe wrote:
> > On Thu, Nov 14, 2024 at 05:18:53PM +0100, Andrew Jones wrote:
> > > @@ -1276,10 +1279,30 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
> > > struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > > struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> > > struct riscv_iommu_dc dc = {0};
> > > + int ret;
> > >
> > > if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
> > > return -ENODEV;
> > >
> > > + if (riscv_iommu_bond_link(domain, dev))
> > > + return -ENOMEM;
> > > +
> > > + if (iommu_domain->type == IOMMU_DOMAIN_UNMANAGED) {
> >
> > Drivers should not be making tests like this.
> >
> > > + domain->gscid = ida_alloc_range(&riscv_iommu_gscids, 1,
> > > + RISCV_IOMMU_MAX_GSCID, GFP_KERNEL);
> > > + if (domain->gscid < 0) {
> > > + riscv_iommu_bond_unlink(domain, dev);
> > > + return -ENOMEM;
> > > + }
> > > +
> > > + ret = riscv_iommu_irq_domain_create(domain, dev);
> > > + if (ret) {
> > > + riscv_iommu_bond_unlink(domain, dev);
> > > + ida_free(&riscv_iommu_gscids, domain->gscid);
> > > + return ret;
> > > + }
> > > + }
> >
> > What are you trying to do? Make something behave different for VFIO?
> > That isn't OK, we are trying to remove all the hacky VFIO special
> > cases in drivers.
> >
> > What is the HW issue here? It is very very strange (and probably not
> > going to work right) that the irq domains change when domain
> > attachment changes.
> >
> > The IRQ setup should really be fixed before any device drivers probe
> > onto the device.
>
> I can't disagree with the statement that this looks hacky, but considering
> a VFIO domain needs to use the g-stage for its single-stage translation
> and a paging domain for the host would use s-stage, then it seems we need
> to identify the VFIO domains for their special treatment.
This is the wrong thinking entirely. There is no such thing as a "VFIO
domain".
Default VFIO created domains should act excatly the same as a DMA API
domain.
If you want your system to have irq remapping, then it should be on by
default and DMA API gets remapping too. There would need to be a very
strong reason not to do that in order to make something special for
riscv. If so you'd need to add some kind of flag to select it.
Until you reach nested translation there is no "need" for VFIO to use
any particular stage. The design is that default VFIO uses the same
stage as the DMA API because it is doing the same basic default
translation function.
Nested translation has a control to select the stage, and you can
then force the g-stage for VFIO users at that point.
Regardless, you must not use UNMANAGED as some indication of VFIO,
that is not what it means, that is not what it is for.
> Is there an example of converting VFIO special casing in other
> drivers to something cleaner that you can point me at?
Nobody has had an issue where they want interrupt remapping on/off
depending on VFIO. I think that is inherently wrong.
> The IRQ domain will only be useful for device assignment, as that's when
> an MSI translation will be needed. I can't think of any problems that
> could arise from only creating the IRQ domain when probing assigned
> devices, but I could certainly be missing something. Do you have some
> potential problems in mind?
I'm not an expert in the interrupt subsystem, but my understanding was
we expect the interrupt domains/etc to be static once a device driver
is probed. Changing things during iommu domain attach is after drivers
are probed. I don't really expect it to work correctly in all corner
cases.
VFIO is allowed to change the translation as it operates and we expect
that interrupts are not disturbed.
Jason
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-19 14:00 ` Jason Gunthorpe
@ 2024-11-19 15:03 ` Andrew Jones
2024-11-19 15:36 ` Jason Gunthorpe
0 siblings, 1 reply; 36+ messages in thread
From: Andrew Jones @ 2024-11-19 15:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On November 19, 2024 3:00:47 PM GMT+01:00, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>On Tue, Nov 19, 2024 at 08:49:37AM +0100, Andrew Jones wrote:
>> On Mon, Nov 18, 2024 at 02:43:36PM -0400, Jason Gunthorpe wrote:
>> > On Thu, Nov 14, 2024 at 05:18:53PM +0100, Andrew Jones wrote:
>> > > @@ -1276,10 +1279,30 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
>> > > struct riscv_iommu_device *iommu = dev_to_iommu(dev);
>> > > struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
>> > > struct riscv_iommu_dc dc = {0};
>> > > + int ret;
>> > >
>> > > if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
>> > > return -ENODEV;
>> > >
>> > > + if (riscv_iommu_bond_link(domain, dev))
>> > > + return -ENOMEM;
>> > > +
>> > > + if (iommu_domain->type == IOMMU_DOMAIN_UNMANAGED) {
>> >
>> > Drivers should not be making tests like this.
>> >
>> > > + domain->gscid = ida_alloc_range(&riscv_iommu_gscids, 1,
>> > > + RISCV_IOMMU_MAX_GSCID, GFP_KERNEL);
>> > > + if (domain->gscid < 0) {
>> > > + riscv_iommu_bond_unlink(domain, dev);
>> > > + return -ENOMEM;
>> > > + }
>> > > +
>> > > + ret = riscv_iommu_irq_domain_create(domain, dev);
>> > > + if (ret) {
>> > > + riscv_iommu_bond_unlink(domain, dev);
>> > > + ida_free(&riscv_iommu_gscids, domain->gscid);
>> > > + return ret;
>> > > + }
>> > > + }
>> >
>> > What are you trying to do? Make something behave different for VFIO?
>> > That isn't OK, we are trying to remove all the hacky VFIO special
>> > cases in drivers.
>> >
>> > What is the HW issue here? It is very very strange (and probably not
>> > going to work right) that the irq domains change when domain
>> > attachment changes.
>> >
>> > The IRQ setup should really be fixed before any device drivers probe
>> > onto the device.
>>
>> I can't disagree with the statement that this looks hacky, but considering
>> a VFIO domain needs to use the g-stage for its single-stage translation
>> and a paging domain for the host would use s-stage, then it seems we need
>> to identify the VFIO domains for their special treatment.
>
>This is the wrong thinking entirely. There is no such thing as a "VFIO
>domain".
>
>Default VFIO created domains should act excatly the same as a DMA API
>domain.
>
>If you want your system to have irq remapping, then it should be on by
>default and DMA API gets remapping too. There would need to be a very
>strong reason not to do that in order to make something special for
>riscv. If so you'd need to add some kind of flag to select it.
>
>Until you reach nested translation there is no "need" for VFIO to use
>any particular stage. The design is that default VFIO uses the same
>stage as the DMA API because it is doing the same basic default
>translation function.
The RISC-V IOMMU needs to use g-stage for device assignment, if we also want to enable irqbypass, because the IOMMU is specified to only look at the MSI table when g-stage is in use. This is actually another reason the irq domain only makes sense for device assignment.
>
>Nested translation has a control to select the stage, and you can
>then force the g-stage for VFIO users at that point.
We could force riscv device assignment to always be nested, and when not providing an iommu to the guest, it will still be single-stage, but g-stage, but I don't think that's currently possible with VFIO, is it?
>
>Regardless, you must not use UNMANAGED as some indication of VFIO,
>that is not what it means, that is not what it is for.
>
>> Is there an example of converting VFIO special casing in other
>> drivers to something cleaner that you can point me at?
>
>Nobody has had an issue where they want interrupt remapping on/off
>depending on VFIO. I think that is inherently wrong.
>
>> The IRQ domain will only be useful for device assignment, as that's when
>> an MSI translation will be needed. I can't think of any problems that
>> could arise from only creating the IRQ domain when probing assigned
>> devices, but I could certainly be missing something. Do you have some
>> potential problems in mind?
>
>I'm not an expert in the interrupt subsystem, but my understanding was
>we expect the interrupt domains/etc to be static once a device driver
>is probed. Changing things during iommu domain attach is after drivers
>are probed. I don't really expect it to work correctly in all corner
>cases.
With VFIO the iommu domain attach comes after an unbind/bind, so the new driver is probed. I think that's a safe time. However, if there could be cases where the attach does not follow an unbind/bind, then I agree that wouldn't be safe. I'll consider always creating an IRQ domain, even if it won't provide any additional functionality unless the device is assigned.
>
>VFIO is allowed to change the translation as it operates and we expect
>that interrupts are not disturbed.
>
The IRQ domain stays the same during operation, the only changes are the mappings from what the guest believes are its s-mode interrupt files to the hypervisor selected guest interrupt files, and these changes are made possible by the IRQ domain's vcpu-affinity support.
Thanks,
drew
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-19 15:03 ` Andrew Jones
@ 2024-11-19 15:36 ` Jason Gunthorpe
2024-11-22 15:11 ` Andrew Jones
0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2024-11-19 15:36 UTC (permalink / raw)
To: Andrew Jones
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On Tue, Nov 19, 2024 at 04:03:05PM +0100, Andrew Jones wrote:
> >This is the wrong thinking entirely. There is no such thing as a "VFIO
> >domain".
> >
> >Default VFIO created domains should act excatly the same as a DMA API
> >domain.
> >
> >If you want your system to have irq remapping, then it should be on by
> >default and DMA API gets remapping too. There would need to be a very
> >strong reason not to do that in order to make something special for
> >riscv. If so you'd need to add some kind of flag to select it.
> >
> >Until you reach nested translation there is no "need" for VFIO to use
> >any particular stage. The design is that default VFIO uses the same
> >stage as the DMA API because it is doing the same basic default
> >translation function.
>
> The RISC-V IOMMU needs to use g-stage for device assignment, if we
> also want to enable irqbypass, because the IOMMU is specified to
> only look at the MSI table when g-stage is in use. This is actually
> another reason the irq domain only makes sense for device
> assignment.
Most HW has enablable interrupt remapping and typically Linux just
turns it always on.
Is there a reason the DMA API shouldn't use this translation mode too?
That seems to be the main issue here, you are trying to avoid
interrupt remapping for DMA API and use it only for VFIO, and haven't
explained why we need such complexity. Just use it always?
> >Nested translation has a control to select the stage, and you can
> >then force the g-stage for VFIO users at that point.
>
> We could force riscv device assignment to always be nested, and when
> not providing an iommu to the guest, it will still be single-stage,
> but g-stage, but I don't think that's currently possible with VFIO,
> is it?
That isn't what I mean, I mean you should not be forcing the kind of
domain being created until you get to special cases like nested.
Default VFIO should work the same as the DMA API.
> >> The IRQ domain will only be useful for device assignment, as that's when
> >> an MSI translation will be needed. I can't think of any problems that
> >> could arise from only creating the IRQ domain when probing assigned
> >> devices, but I could certainly be missing something. Do you have some
> >> potential problems in mind?
> >
> >I'm not an expert in the interrupt subsystem, but my understanding was
> >we expect the interrupt domains/etc to be static once a device driver
> >is probed. Changing things during iommu domain attach is after drivers
> >are probed. I don't really expect it to work correctly in all corner
> >cases.
>
> With VFIO the iommu domain attach comes after an unbind/bind, so the
> new driver is probed.
That's the opposite of what I mean. The irq domain should be setup
*before* VFIO binds to the driver.
Changing the active irq_domain while VFIO is already probed to the
device is highly unlikely to work right in all cases.
> I think that's a safe time. However, if there
> could be cases where the attach does not follow an unbind/bind, then
> I agree that wouldn't be safe.
These cases exist.
> I'll consider always creating an IRQ
> domain, even if it won't provide any additional functionality unless
> the device is assigned.
That isn't ideal, the translation under the IRQs shouldn't really be
changing as the translation under the IOMMU changes.
Further, VFIO assumes iommu_group_has_isolated_msi(), ie
IRQ_DOMAIN_FLAG_ISOLATED_MSI, is fixed while it is is bound. Will that
be true if the iommu is flapping all about? What will you do when VFIO
has it attached to a blocked domain?
It just doesn't make sense to change something so fundamental as the
interrupt path on an iommu domain attachement. :\
> >VFIO is allowed to change the translation as it operates and we expect
> >that interrupts are not disturbed.
>
> The IRQ domain stays the same during operation, the only changes are
> the mappings from what the guest believes are its s-mode interrupt
> files to the hypervisor selected guest interrupt files, and these
> changes are made possible by the IRQ domain's vcpu-affinity support.
That is only the case when talking about kvm, this all still has to
work fully for non-kvm VFIO uses cases too.
Jason
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-19 15:36 ` Jason Gunthorpe
@ 2024-11-22 15:11 ` Andrew Jones
2024-11-22 15:33 ` Jason Gunthorpe
0 siblings, 1 reply; 36+ messages in thread
From: Andrew Jones @ 2024-11-22 15:11 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On Tue, Nov 19, 2024 at 11:36:22AM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 19, 2024 at 04:03:05PM +0100, Andrew Jones wrote:
>
> > >This is the wrong thinking entirely. There is no such thing as a "VFIO
> > >domain".
> > >
> > >Default VFIO created domains should act excatly the same as a DMA API
> > >domain.
> > >
> > >If you want your system to have irq remapping, then it should be on by
> > >default and DMA API gets remapping too. There would need to be a very
> > >strong reason not to do that in order to make something special for
> > >riscv. If so you'd need to add some kind of flag to select it.
> > >
> > >Until you reach nested translation there is no "need" for VFIO to use
> > >any particular stage. The design is that default VFIO uses the same
> > >stage as the DMA API because it is doing the same basic default
> > >translation function.
> >
> > The RISC-V IOMMU needs to use g-stage for device assignment, if we
> > also want to enable irqbypass, because the IOMMU is specified to
> > only look at the MSI table when g-stage is in use. This is actually
> > another reason the irq domain only makes sense for device
> > assignment.
>
> Most HW has enablable interrupt remapping and typically Linux just
> turns it always on.
>
> Is there a reason the DMA API shouldn't use this translation mode too?
> That seems to be the main issue here, you are trying to avoid
> interrupt remapping for DMA API and use it only for VFIO, and haven't
> explained why we need such complexity. Just use it always?
The reason is that the RISC-V IOMMU only checks the MSI table, i.e.
enables its support for MSI remapping, when the g-stage (second-stage)
page table is in use. However, the expected virtual memory scheme for an
OS to use for DMA would be to have s-stage (first-stage) in use and the
g-stage set to 'Bare' (not in use). OIOW, it doesn't appear the spec
authors expected MSI remapping to be enabled for the host DMA use case.
That does make some sense, since it's actually not necessary. For the
host DMA use case, providing mappings for each s-mode interrupt file
which the device is allowed to write to in the s-stage page table
sufficiently enables MSIs to be delivered.
>
> > >Nested translation has a control to select the stage, and you can
> > >then force the g-stage for VFIO users at that point.
> >
> > We could force riscv device assignment to always be nested, and when
> > not providing an iommu to the guest, it will still be single-stage,
> > but g-stage, but I don't think that's currently possible with VFIO,
> > is it?
>
> That isn't what I mean, I mean you should not be forcing the kind of
> domain being created until you get to special cases like nested.
>
> Default VFIO should work the same as the DMA API.
If "default VFIO" means VFIO without irqbypass, then it would work the
same as the DMA API, assuming all mappings for all necessary s-mode
interrupt files are created (something the DMA API needs as well).
However, VFIO would also need 'vfio_iommu_type1.allow_unsafe_interrupts=1'
to be set for this no-irqbypass configuration.
>
> > >> The IRQ domain will only be useful for device assignment, as that's when
> > >> an MSI translation will be needed. I can't think of any problems that
> > >> could arise from only creating the IRQ domain when probing assigned
> > >> devices, but I could certainly be missing something. Do you have some
> > >> potential problems in mind?
> > >
> > >I'm not an expert in the interrupt subsystem, but my understanding was
> > >we expect the interrupt domains/etc to be static once a device driver
> > >is probed. Changing things during iommu domain attach is after drivers
> > >are probed. I don't really expect it to work correctly in all corner
> > >cases.
> >
> > With VFIO the iommu domain attach comes after an unbind/bind, so the
> > new driver is probed.
>
> That's the opposite of what I mean. The irq domain should be setup
> *before* VFIO binds to the driver.
>
> Changing the active irq_domain while VFIO is already probed to the
> device is highly unlikely to work right in all cases.
>
> > I think that's a safe time. However, if there
> > could be cases where the attach does not follow an unbind/bind, then
> > I agree that wouldn't be safe.
>
> These cases exist.
>
> > I'll consider always creating an IRQ
> > domain, even if it won't provide any additional functionality unless
> > the device is assigned.
>
> That isn't ideal, the translation under the IRQs shouldn't really be
> changing as the translation under the IOMMU changes.
Unless the device is assigned to a guest, then the IRQ domain wouldn't
do anything at all (it'd just sit between the device and the device's
old MSI parent domain), but it also wouldn't come and go, risking issues
with anything sensitive to changes in the IRQ domain hierarchy.
>
> Further, VFIO assumes iommu_group_has_isolated_msi(), ie
> IRQ_DOMAIN_FLAG_ISOLATED_MSI, is fixed while it is is bound. Will that
> be true if the iommu is flapping all about? What will you do when VFIO
> has it attached to a blocked domain?
>
> It just doesn't make sense to change something so fundamental as the
> interrupt path on an iommu domain attachement. :\
Yes, it does appear I should be doing this at iommu device probe time
instead. It won't provide any additional functionality to use cases which
aren't assigning devices to guests, but it also won't hurt, and it should
avoid the risks you point out.
>
> > >VFIO is allowed to change the translation as it operates and we expect
> > >that interrupts are not disturbed.
> >
> > The IRQ domain stays the same during operation, the only changes are
> > the mappings from what the guest believes are its s-mode interrupt
> > files to the hypervisor selected guest interrupt files, and these
> > changes are made possible by the IRQ domain's vcpu-affinity support.
>
> That is only the case when talking about kvm, this all still has to
> work fully for non-kvm VFIO uses cases too.
>
I'll rework the irq domain creation to be at iommu device probe time for
the next version.
Thanks,
drew
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-22 15:11 ` Andrew Jones
@ 2024-11-22 15:33 ` Jason Gunthorpe
2024-11-22 17:07 ` Andrew Jones
0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2024-11-22 15:33 UTC (permalink / raw)
To: Andrew Jones
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On Fri, Nov 22, 2024 at 04:11:36PM +0100, Andrew Jones wrote:
> The reason is that the RISC-V IOMMU only checks the MSI table, i.e.
> enables its support for MSI remapping, when the g-stage (second-stage)
> page table is in use. However, the expected virtual memory scheme for an
> OS to use for DMA would be to have s-stage (first-stage) in use and the
> g-stage set to 'Bare' (not in use).
That isn't really a technical reason.
> OIOW, it doesn't appear the spec authors expected MSI remapping to
> be enabled for the host DMA use case. That does make some sense,
> since it's actually not necessary. For the host DMA use case,
> providing mappings for each s-mode interrupt file which the device
> is allowed to write to in the s-stage page table sufficiently
> enables MSIs to be delivered.
Well, that seems to be the main problem here. You are grappling with a
spec design that doesn't match the SW expecations. Since it has
deviated from what everyone else has done you now have extra
challenges to resolve in some way.
Just always using interrupt remapping if the HW is capable of
interrupt remapping and ignoring the spec "expectation" is a nice a
simple way to make things work with existing Linux.
> If "default VFIO" means VFIO without irqbypass, then it would work the
> same as the DMA API, assuming all mappings for all necessary s-mode
> interrupt files are created (something the DMA API needs as well).
> However, VFIO would also need 'vfio_iommu_type1.allow_unsafe_interrupts=1'
> to be set for this no-irqbypass configuration.
Which isn't what anyone wants, you need to make the DMA API domain be
fully functional so that VFIO works.
> > That isn't ideal, the translation under the IRQs shouldn't really be
> > changing as the translation under the IOMMU changes.
>
> Unless the device is assigned to a guest, then the IRQ domain wouldn't
> do anything at all (it'd just sit between the device and the device's
> old MSI parent domain), but it also wouldn't come and go, risking issues
> with anything sensitive to changes in the IRQ domain hierarchy.
VFIO isn't restricted to such a simple use model. You have to support
all the generality, which includes fully supporting changing the iommu
translation on the fly.
> > Further, VFIO assumes iommu_group_has_isolated_msi(), ie
> > IRQ_DOMAIN_FLAG_ISOLATED_MSI, is fixed while it is is bound. Will that
> > be true if the iommu is flapping all about? What will you do when VFIO
> > has it attached to a blocked domain?
> >
> > It just doesn't make sense to change something so fundamental as the
> > interrupt path on an iommu domain attachement. :\
>
> Yes, it does appear I should be doing this at iommu device probe time
> instead. It won't provide any additional functionality to use cases which
> aren't assigning devices to guests, but it also won't hurt, and it should
> avoid the risks you point out.
Even if you statically create the domain you can't change the value of
IRQ_DOMAIN_FLAG_ISOLATED_MSI depending on what is currently attached
to the IOMMU.
What you are trying to do is not supported by the software stack right
now. You need to make much bigger, more intrusive changes, if you
really want to make interrupt remapping dynamic.
Jason
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-22 15:33 ` Jason Gunthorpe
@ 2024-11-22 17:07 ` Andrew Jones
2024-11-25 15:07 ` Jason Gunthorpe
0 siblings, 1 reply; 36+ messages in thread
From: Andrew Jones @ 2024-11-22 17:07 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On Fri, Nov 22, 2024 at 11:33:40AM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 22, 2024 at 04:11:36PM +0100, Andrew Jones wrote:
>
> > The reason is that the RISC-V IOMMU only checks the MSI table, i.e.
> > enables its support for MSI remapping, when the g-stage (second-stage)
> > page table is in use. However, the expected virtual memory scheme for an
> > OS to use for DMA would be to have s-stage (first-stage) in use and the
> > g-stage set to 'Bare' (not in use).
>
> That isn't really a technical reason.
>
> > OIOW, it doesn't appear the spec authors expected MSI remapping to
> > be enabled for the host DMA use case. That does make some sense,
> > since it's actually not necessary. For the host DMA use case,
> > providing mappings for each s-mode interrupt file which the device
> > is allowed to write to in the s-stage page table sufficiently
> > enables MSIs to be delivered.
>
> Well, that seems to be the main problem here. You are grappling with a
> spec design that doesn't match the SW expecations. Since it has
> deviated from what everyone else has done you now have extra
> challenges to resolve in some way.
>
> Just always using interrupt remapping if the HW is capable of
> interrupt remapping and ignoring the spec "expectation" is a nice a
> simple way to make things work with existing Linux.
>
> > If "default VFIO" means VFIO without irqbypass, then it would work the
> > same as the DMA API, assuming all mappings for all necessary s-mode
> > interrupt files are created (something the DMA API needs as well).
> > However, VFIO would also need 'vfio_iommu_type1.allow_unsafe_interrupts=1'
> > to be set for this no-irqbypass configuration.
>
> Which isn't what anyone wants, you need to make the DMA API domain be
> fully functional so that VFIO works.
>
> > > That isn't ideal, the translation under the IRQs shouldn't really be
> > > changing as the translation under the IOMMU changes.
> >
> > Unless the device is assigned to a guest, then the IRQ domain wouldn't
> > do anything at all (it'd just sit between the device and the device's
> > old MSI parent domain), but it also wouldn't come and go, risking issues
> > with anything sensitive to changes in the IRQ domain hierarchy.
>
> VFIO isn't restricted to such a simple use model. You have to support
> all the generality, which includes fully supporting changing the iommu
> translation on the fly.
>
> > > Further, VFIO assumes iommu_group_has_isolated_msi(), ie
> > > IRQ_DOMAIN_FLAG_ISOLATED_MSI, is fixed while it is is bound. Will that
> > > be true if the iommu is flapping all about? What will you do when VFIO
> > > has it attached to a blocked domain?
> > >
> > > It just doesn't make sense to change something so fundamental as the
> > > interrupt path on an iommu domain attachement. :\
> >
> > Yes, it does appear I should be doing this at iommu device probe time
> > instead. It won't provide any additional functionality to use cases which
> > aren't assigning devices to guests, but it also won't hurt, and it should
> > avoid the risks you point out.
>
> Even if you statically create the domain you can't change the value of
> IRQ_DOMAIN_FLAG_ISOLATED_MSI depending on what is currently attached
> to the IOMMU.
>
> What you are trying to do is not supported by the software stack right
> now. You need to make much bigger, more intrusive changes, if you
> really want to make interrupt remapping dynamic.
>
Let the fun begin. I'll look into this more. It also looks like I need to
collect some test cases to ensure I can support all use cases with
whatever I propose next. Pointers for those would be welcome.
Thanks,
drew
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping
2024-11-22 17:07 ` Andrew Jones
@ 2024-11-25 15:07 ` Jason Gunthorpe
0 siblings, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2024-11-25 15:07 UTC (permalink / raw)
To: Andrew Jones
Cc: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel, tjeznach,
zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
On Fri, Nov 22, 2024 at 06:07:59PM +0100, Andrew Jones wrote:
> > What you are trying to do is not supported by the software stack right
> > now. You need to make much bigger, more intrusive changes, if you
> > really want to make interrupt remapping dynamic.
> >
>
> Let the fun begin. I'll look into this more. It also looks like I need to
> collect some test cases to ensure I can support all use cases with
> whatever I propose next. Pointers for those would be welcome.
Sorry, I don't really have anything.. But iommufd allows changing the
translation at will and we expect this to happen in normal VMM
scenarios. So blocking, paging, nesting are all expected to be
dynamically selectable and non-disruptive to interrupts.
So, you can't decide if remapping is enabled or not for a device based
only on the domain attachment.
I think you'd need to create a way for VFIO to request dynamic
interrupt remapping be enabled for the device very, very early in it's
process and that would remain fixed while VFIO is using the device.
The dynamic state of interrupt remapping would constrain what iommu
attachment configurations are permitted.
Jason
^ permalink raw reply [flat|nested] 36+ messages in thread
* [RFC PATCH 09/15] RISC-V: KVM: Enable KVM_VFIO interfaces on RISC-V arch
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (7 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 08/15] iommu/riscv: Add IRQ domain for interrupt remapping Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 10/15] RISC-V: KVM: Add irqbypass skeleton Andrew Jones
` (5 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
From: Tomasz Jeznach <tjeznach@rivosinc.com>
Enable KVM/VFIO support on RISC-V architecture.
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
arch/riscv/kvm/Kconfig | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index 0c3cbb0915ff..333d95da8ebe 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -29,10 +29,12 @@ config KVM
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select KVM_GENERIC_HARDWARE_ENABLING
select KVM_MMIO
+ select KVM_VFIO
select KVM_XFER_TO_GUEST_WORK
select KVM_GENERIC_MMU_NOTIFIER
select SCHED_INFO
select GUEST_PERF_EVENTS if PERF_EVENTS
+ select SRCU
help
Support hosting virtualized guest machines.
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 10/15] RISC-V: KVM: Add irqbypass skeleton
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (8 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 09/15] RISC-V: KVM: Enable KVM_VFIO interfaces on RISC-V arch Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 11/15] RISC-V: Define irqbypass vcpu_info Andrew Jones
` (4 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
Add all the functions needed to wire up irqbypass support.
kvm_arch_update_irqfd_routing() is left unimplemented, so return
false from kvm_arch_has_irq_bypass().
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
arch/riscv/include/asm/kvm_host.h | 3 ++
arch/riscv/kvm/Kconfig | 1 +
arch/riscv/kvm/aia_imsic.c | 6 ++++
arch/riscv/kvm/vm.c | 60 +++++++++++++++++++++++++++++++
4 files changed, 70 insertions(+)
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 35eab6e0f4ae..fef7422939f6 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -114,6 +114,9 @@ struct kvm_arch {
/* AIA Guest/VM context */
struct kvm_aia aia;
+
+#define __KVM_HAVE_ARCH_ASSIGNED_DEVICE
+ atomic_t assigned_device_count;
};
struct kvm_cpu_trap {
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index 333d95da8ebe..9a4feb1e671d 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
tristate "Kernel-based Virtual Machine (KVM) support (EXPERIMENTAL)"
depends on RISCV_SBI && MMU
select HAVE_KVM_IRQCHIP
+ select HAVE_KVM_IRQ_BYPASS
select HAVE_KVM_IRQ_ROUTING
select HAVE_KVM_MSI
select HAVE_KVM_VCPU_ASYNC_IOCTL
diff --git a/arch/riscv/kvm/aia_imsic.c b/arch/riscv/kvm/aia_imsic.c
index a8085cd8215e..64b1f3713dd5 100644
--- a/arch/riscv/kvm/aia_imsic.c
+++ b/arch/riscv/kvm/aia_imsic.c
@@ -727,6 +727,12 @@ void kvm_riscv_vcpu_aia_imsic_release(struct kvm_vcpu *vcpu)
kvm_riscv_aia_free_hgei(old_vsfile_cpu, old_vsfile_hgei);
}
+int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
+ uint32_t guest_irq, bool set)
+{
+ return -ENXIO;
+}
+
int kvm_riscv_vcpu_aia_imsic_update(struct kvm_vcpu *vcpu)
{
unsigned long flags;
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 7396b8654f45..9c5837518c1a 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -11,6 +11,9 @@
#include <linux/module.h>
#include <linux/uaccess.h>
#include <linux/kvm_host.h>
+#include <linux/kvm_irqfd.h>
+
+#include <asm/kvm_aia.h>
const struct _kvm_stats_desc kvm_vm_stats_desc[] = {
KVM_GENERIC_VM_STATS()
@@ -55,6 +58,63 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_riscv_aia_destroy_vm(kvm);
}
+void kvm_arch_start_assignment(struct kvm *kvm)
+{
+ atomic_inc(&kvm->arch.assigned_device_count);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_start_assignment);
+
+void kvm_arch_end_assignment(struct kvm *kvm)
+{
+ atomic_dec(&kvm->arch.assigned_device_count);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_end_assignment);
+
+bool noinstr kvm_arch_has_assigned_device(struct kvm *kvm)
+{
+ return arch_atomic_read(&kvm->arch.assigned_device_count);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
+
+bool kvm_arch_has_irq_bypass(void)
+{
+ return false;
+}
+
+int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
+ struct irq_bypass_producer *prod)
+{
+ struct kvm_kernel_irqfd *irqfd =
+ container_of(cons, struct kvm_kernel_irqfd, consumer);
+ int ret;
+
+ irqfd->producer = prod;
+ kvm_arch_start_assignment(irqfd->kvm);
+ ret = kvm_arch_update_irqfd_routing(irqfd->kvm, prod->irq, irqfd->gsi, true);
+ if (ret)
+ kvm_arch_end_assignment(irqfd->kvm);
+
+ return ret;
+}
+
+void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
+ struct irq_bypass_producer *prod)
+{
+ struct kvm_kernel_irqfd *irqfd =
+ container_of(cons, struct kvm_kernel_irqfd, consumer);
+ int ret;
+
+ WARN_ON(irqfd->producer != prod);
+ irqfd->producer = NULL;
+
+ ret = kvm_arch_update_irqfd_routing(irqfd->kvm, prod->irq, irqfd->gsi, false);
+ if (ret)
+ pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
+ irqfd->consumer.token, ret);
+
+ kvm_arch_end_assignment(irqfd->kvm);
+}
+
int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irql,
bool line_status)
{
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 11/15] RISC-V: Define irqbypass vcpu_info
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (9 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 10/15] RISC-V: KVM: Add irqbypass skeleton Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 12/15] iommu/riscv: Add guest file irqbypass support Andrew Jones
` (3 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
The vcpu_info parameter to irq_set_vcpu_affinity() effectively
defines an arch specific IOMMU <=> hypervisor protocol. Provide
a definition for the RISCV IOMMU.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
arch/riscv/include/asm/irq.h | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/riscv/include/asm/irq.h b/arch/riscv/include/asm/irq.h
index 7b038f3b7cb0..8588667cbb5f 100644
--- a/arch/riscv/include/asm/irq.h
+++ b/arch/riscv/include/asm/irq.h
@@ -23,6 +23,15 @@ void riscv_set_intc_hwnode_fn(struct fwnode_handle *(*fn)(void));
struct fwnode_handle *riscv_get_intc_hwnode(void);
+struct riscv_iommu_vcpu_info {
+ u64 msi_addr_pattern;
+ u64 msi_addr_mask;
+ u32 group_index_bits;
+ u32 group_index_shift;
+ u64 gpa;
+ u64 hpa;
+};
+
#ifdef CONFIG_ACPI
enum riscv_irqchip_type {
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 12/15] iommu/riscv: Add guest file irqbypass support
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (10 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 11/15] RISC-V: Define irqbypass vcpu_info Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 13/15] RISC-V: KVM: " Andrew Jones
` (2 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
Implement irq_set_vcpu_affinity() in the RISCV IOMMU driver.
irq_set_vcpu_affinity() is the channel from a hypervisor to the
IOMMU needed to ensure that assigned devices which direct MSIs to
guest IMSIC addresses will have those MSI writes redirected to
their corresponding guest interrupt files.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
drivers/iommu/riscv/iommu-ir.c | 151 +++++++++++++++++++++++++++++++++
drivers/iommu/riscv/iommu.c | 4 +-
drivers/iommu/riscv/iommu.h | 3 +
3 files changed, 156 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/riscv/iommu-ir.c b/drivers/iommu/riscv/iommu-ir.c
index c177e064b205..958270450ec1 100644
--- a/drivers/iommu/riscv/iommu-ir.c
+++ b/drivers/iommu/riscv/iommu-ir.c
@@ -7,6 +7,8 @@
#include <linux/irqdomain.h>
#include <linux/msi.h>
+#include <asm/irq.h>
+
#include "../iommu-pages.h"
#include "iommu.h"
@@ -52,10 +54,159 @@ static size_t riscv_iommu_ir_nr_msiptes(struct riscv_iommu_domain *domain)
return max_idx + 1;
}
+static void riscv_iommu_ir_msitbl_inval(struct riscv_iommu_domain *domain,
+ struct riscv_iommu_msipte *pte)
+{
+ struct riscv_iommu_bond *bond;
+ struct riscv_iommu_device *iommu, *prev;
+ struct riscv_iommu_command cmd;
+ u64 addr;
+
+ addr = pfn_to_phys(FIELD_GET(RISCV_IOMMU_MSIPTE_PPN, pte->pte));
+ riscv_iommu_cmd_inval_gvma(&cmd);
+ riscv_iommu_cmd_inval_set_addr(&cmd, addr);
+
+ /* Like riscv_iommu_iotlb_inval(), synchronize with riscv_iommu_bond_link() */
+ smp_mb();
+
+ rcu_read_lock();
+
+ prev = NULL;
+ list_for_each_entry_rcu(bond, &domain->bonds, list) {
+ iommu = dev_to_iommu(bond->dev);
+
+ if (iommu == prev)
+ continue;
+
+ riscv_iommu_cmd_send(iommu, &cmd);
+ riscv_iommu_cmd_sync(iommu, RISCV_IOMMU_IOTINVAL_TIMEOUT);
+ prev = iommu;
+ }
+
+ rcu_read_unlock();
+}
+
+static void riscv_iommu_ir_msitbl_update(struct riscv_iommu_domain *domain,
+ struct riscv_iommu_msiptp_state *msiptp)
+{
+ struct riscv_iommu_bond *bond;
+ struct riscv_iommu_device *iommu, *prev;
+ struct riscv_iommu_command cmd;
+ struct iommu_fwspec *fwspec;
+ struct riscv_iommu_dc *dc;
+ int i;
+
+ /* Like riscv_iommu_ir_msitbl_inval(), synchronize with riscv_iommu_bond_link() */
+ smp_mb();
+ rcu_read_lock();
+
+ prev = NULL;
+ list_for_each_entry_rcu(bond, &domain->bonds, list) {
+ iommu = dev_to_iommu(bond->dev);
+ fwspec = dev_iommu_fwspec_get(bond->dev);
+
+ for (i = 0; i < fwspec->num_ids; i++) {
+ dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
+ WRITE_ONCE(dc->msiptp, msiptp->msiptp);
+ WRITE_ONCE(dc->msi_addr_mask, msiptp->msi_addr_mask);
+ WRITE_ONCE(dc->msi_addr_pattern, msiptp->msi_addr_pattern);
+ }
+
+ dma_wmb();
+
+ if (iommu == prev)
+ continue;
+
+ riscv_iommu_cmd_inval_gvma(&cmd);
+ riscv_iommu_cmd_send(iommu, &cmd);
+ riscv_iommu_cmd_sync(iommu, RISCV_IOMMU_IOTINVAL_TIMEOUT);
+ prev = iommu;
+ }
+
+ rcu_read_unlock();
+}
+
+static int riscv_iommu_ir_msitbl_init(struct riscv_iommu_domain *domain,
+ struct riscv_iommu_vcpu_info *vcpu_info)
+{
+ domain->msiptp.msi_addr_pattern = vcpu_info->msi_addr_pattern;
+ domain->msiptp.msi_addr_mask = vcpu_info->msi_addr_mask;
+ domain->group_index_bits = vcpu_info->group_index_bits;
+ domain->group_index_shift = vcpu_info->group_index_shift;
+
+ if (riscv_iommu_ir_nr_msiptes(domain) * sizeof(*domain->msi_root) > PAGE_SIZE * 2)
+ return -ENOMEM;
+
+ domain->msiptp.msiptp = virt_to_pfn(domain->msi_root) |
+ FIELD_PREP(RISCV_IOMMU_DC_MSIPTP_MODE,
+ RISCV_IOMMU_DC_MSIPTP_MODE_FLAT);
+
+ riscv_iommu_ir_msitbl_update(domain, &domain->msiptp);
+
+ return 0;
+}
+
+static int riscv_iommu_irq_set_vcpu_affinity(struct irq_data *data, void *info)
+{
+ struct riscv_iommu_vcpu_info *vcpu_info = info;
+ struct riscv_iommu_domain *domain = data->domain->host_data;
+ struct riscv_iommu_msipte *pte;
+ int ret = -EINVAL;
+ u64 pteval;
+
+ if (WARN_ON(domain->domain.type != IOMMU_DOMAIN_UNMANAGED))
+ return ret;
+
+ spin_lock(&domain->msi_lock);
+
+ if (!domain->msiptp.msiptp) {
+ if (WARN_ON(!vcpu_info))
+ goto out_unlock;
+
+ ret = riscv_iommu_ir_msitbl_init(domain, vcpu_info);
+ if (ret)
+ goto out_unlock;
+ } else if (!vcpu_info) {
+ /*
+ * Nothing to do here since we don't track host_irq <=> msipte mappings
+ * nor reference count the ptes. If we did do that tracking then we would
+ * decrement the reference count of the pte for the host_irq and possibly
+ * clear its valid bit if it was the last one mapped.
+ */
+ ret = 0;
+ goto out_unlock;
+ } else if (WARN_ON(vcpu_info->msi_addr_pattern != domain->msiptp.msi_addr_pattern ||
+ vcpu_info->msi_addr_mask != domain->msiptp.msi_addr_mask ||
+ vcpu_info->group_index_bits != domain->group_index_bits ||
+ vcpu_info->group_index_shift != domain->group_index_shift)) {
+ goto out_unlock;
+ }
+
+ pte = riscv_iommu_ir_get_msipte(domain, vcpu_info->gpa);
+ if (!pte)
+ goto out_unlock;
+
+ pteval = FIELD_PREP(RISCV_IOMMU_MSIPTE_M, 3) |
+ riscv_iommu_phys_to_ppn(vcpu_info->hpa) |
+ FIELD_PREP(RISCV_IOMMU_MSIPTE_V, 1);
+
+ if (pte->pte != pteval) {
+ pte->pte = pteval;
+ riscv_iommu_ir_msitbl_inval(domain, pte);
+ }
+
+ ret = 0;
+
+out_unlock:
+ spin_unlock(&domain->msi_lock);
+ return ret;
+}
+
static struct irq_chip riscv_iommu_irq_chip = {
.name = "IOMMU-IR",
.irq_mask = irq_chip_mask_parent,
.irq_unmask = irq_chip_unmask_parent,
+ .irq_set_vcpu_affinity = riscv_iommu_irq_set_vcpu_affinity,
};
static int riscv_iommu_irq_domain_alloc_irqs(struct irq_domain *irqdomain,
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index c4a47b21c58f..46ac228ba7ac 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -544,8 +544,8 @@ static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
}
/* Lookup and initialize device context info structure. */
-static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
- unsigned int devid)
+struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
+ unsigned int devid)
{
const bool base_format = !(iommu->caps & RISCV_IOMMU_CAPABILITIES_MSI_FLAT);
unsigned int depth;
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
index 6ce71095781c..2ca76edf48f5 100644
--- a/drivers/iommu/riscv/iommu.h
+++ b/drivers/iommu/riscv/iommu.h
@@ -127,6 +127,9 @@ struct riscv_iommu_bond {
int riscv_iommu_init(struct riscv_iommu_device *iommu);
void riscv_iommu_remove(struct riscv_iommu_device *iommu);
+struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
+ unsigned int devid);
+
void riscv_iommu_cmd_send(struct riscv_iommu_device *iommu,
struct riscv_iommu_command *cmd);
void riscv_iommu_cmd_sync(struct riscv_iommu_device *iommu,
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 13/15] RISC-V: KVM: Add guest file irqbypass support
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (11 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 12/15] iommu/riscv: Add guest file irqbypass support Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:18 ` [RFC PATCH 14/15] vfio: enable IOMMU_TYPE1 for RISC-V Andrew Jones
2024-11-14 16:19 ` [RFC PATCH 15/15] RISC-V: defconfig: Add VFIO modules Andrew Jones
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
Implement kvm_arch_update_irqfd_routing() which makes
irq_set_vcpu_affinity() calls whenever the assigned device updates
its target addresses and whenever the hypervisor has migrated a
VCPU to another CPU (which requires changing the guest interrupt
file).
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
arch/riscv/kvm/aia_imsic.c | 132 ++++++++++++++++++++++++++++++++++++-
arch/riscv/kvm/vm.c | 2 +-
2 files changed, 130 insertions(+), 4 deletions(-)
diff --git a/arch/riscv/kvm/aia_imsic.c b/arch/riscv/kvm/aia_imsic.c
index 64b1f3713dd5..6a7c23e25f79 100644
--- a/arch/riscv/kvm/aia_imsic.c
+++ b/arch/riscv/kvm/aia_imsic.c
@@ -11,11 +11,13 @@
#include <linux/bitmap.h>
#include <linux/irqchip/riscv-imsic.h>
#include <linux/kvm_host.h>
+#include <linux/kvm_irqfd.h>
#include <linux/math.h>
#include <linux/spinlock.h>
#include <linux/swab.h>
#include <kvm/iodev.h>
#include <asm/csr.h>
+#include <asm/irq.h>
#define IMSIC_MAX_EIX (IMSIC_MAX_ID / BITS_PER_TYPE(u64))
@@ -676,6 +678,14 @@ static void imsic_swfile_update(struct kvm_vcpu *vcpu,
imsic_swfile_extirq_update(vcpu);
}
+static u64 kvm_riscv_aia_msi_addr_mask(struct kvm_aia *aia)
+{
+ u64 group_mask = BIT(aia->nr_group_bits) - 1;
+
+ return (group_mask << (aia->nr_group_shift - IMSIC_MMIO_PAGE_SHIFT)) |
+ (BIT(aia->nr_hart_bits + aia->nr_guest_bits) - 1);
+}
+
void kvm_riscv_vcpu_aia_imsic_release(struct kvm_vcpu *vcpu)
{
unsigned long flags;
@@ -730,7 +740,120 @@ void kvm_riscv_vcpu_aia_imsic_release(struct kvm_vcpu *vcpu)
int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
uint32_t guest_irq, bool set)
{
- return -ENXIO;
+ struct irq_data *irqdata = irq_get_irq_data(host_irq);
+ struct kvm_irq_routing_table *irq_rt;
+ struct kvm_vcpu *vcpu;
+ unsigned long tmp, flags;
+ int idx, ret = -ENXIO;
+
+ if (!set)
+ return irq_set_vcpu_affinity(host_irq, NULL);
+
+ idx = srcu_read_lock(&kvm->irq_srcu);
+ irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
+ if (guest_irq >= irq_rt->nr_rt_entries ||
+ hlist_empty(&irq_rt->map[guest_irq])) {
+ pr_warn_once("no route for guest_irq %u/%u (broken user space?)\n",
+ guest_irq, irq_rt->nr_rt_entries);
+ goto out;
+ }
+
+ kvm_for_each_vcpu(tmp, vcpu, kvm) {
+ struct imsic *imsic = vcpu->arch.aia_context.imsic_state;
+ gpa_t ippn = vcpu->arch.aia_context.imsic_addr >> IMSIC_MMIO_PAGE_SHIFT;
+ struct kvm_aia *aia = &kvm->arch.aia;
+ struct kvm_kernel_irq_routing_entry *e;
+
+ hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
+ struct msi_msg msg[2] = {
+ {
+ .address_hi = e->msi.address_hi,
+ .address_lo = e->msi.address_lo,
+ .data = e->msi.data,
+ },
+ };
+ struct riscv_iommu_vcpu_info vcpu_info = {
+ .msi_addr_mask = kvm_riscv_aia_msi_addr_mask(aia),
+ .group_index_bits = aia->nr_group_bits,
+ .group_index_shift = aia->nr_group_shift,
+ };
+ gpa_t target, tppn;
+
+ if (e->type != KVM_IRQ_ROUTING_MSI)
+ continue;
+
+ target = ((gpa_t)e->msi.address_hi << 32) | e->msi.address_lo;
+ tppn = target >> IMSIC_MMIO_PAGE_SHIFT;
+
+ WARN_ON(target & (IMSIC_MMIO_PAGE_SZ - 1));
+
+ if (ippn != tppn)
+ continue;
+
+ vcpu_info.msi_addr_pattern = tppn & ~vcpu_info.msi_addr_mask;
+ vcpu_info.gpa = target;
+
+ read_lock_irqsave(&imsic->vsfile_lock, flags);
+
+ if (WARN_ON_ONCE(imsic->vsfile_cpu < 0)) {
+ read_unlock_irqrestore(&imsic->vsfile_lock, flags);
+ goto out;
+ }
+
+ vcpu_info.hpa = imsic->vsfile_pa;
+
+ ret = irq_set_vcpu_affinity(host_irq, &vcpu_info);
+ if (ret) {
+ read_unlock_irqrestore(&imsic->vsfile_lock, flags);
+ goto out;
+ }
+
+ irq_data_get_irq_chip(irqdata)->irq_write_msi_msg(irqdata, msg);
+
+ read_unlock_irqrestore(&imsic->vsfile_lock, flags);
+ }
+ }
+
+ ret = 0;
+out:
+ srcu_read_unlock(&kvm->irq_srcu, idx);
+ return ret;
+}
+
+static int kvm_riscv_vcpu_irq_update(struct kvm_vcpu *vcpu)
+{
+ struct kvm *kvm = vcpu->kvm;
+ struct imsic *imsic = vcpu->arch.aia_context.imsic_state;
+ struct kvm_aia *aia = &kvm->arch.aia;
+ u64 mask = kvm_riscv_aia_msi_addr_mask(aia);
+ u64 target = vcpu->arch.aia_context.imsic_addr;
+ struct riscv_iommu_vcpu_info vcpu_info = {
+ .msi_addr_pattern = (target >> IMSIC_MMIO_PAGE_SHIFT) & ~mask,
+ .msi_addr_mask = mask,
+ .group_index_bits = aia->nr_group_bits,
+ .group_index_shift = aia->nr_group_shift,
+ .gpa = target,
+ .hpa = imsic->vsfile_pa,
+ };
+ struct kvm_kernel_irqfd *irqfd;
+ int host_irq, ret;
+
+ spin_lock_irq(&kvm->irqfds.lock);
+
+ list_for_each_entry(irqfd, &kvm->irqfds.items, list) {
+ if (!irqfd->producer)
+ continue;
+ host_irq = irqfd->producer->irq;
+ ret = irq_set_vcpu_affinity(host_irq, &vcpu_info);
+ if (ret) {
+ spin_unlock_irq(&kvm->irqfds.lock);
+ return ret;
+ }
+ }
+
+ spin_unlock_irq(&kvm->irqfds.lock);
+
+ return 0;
}
int kvm_riscv_vcpu_aia_imsic_update(struct kvm_vcpu *vcpu)
@@ -797,14 +920,17 @@ int kvm_riscv_vcpu_aia_imsic_update(struct kvm_vcpu *vcpu)
if (ret)
goto fail_free_vsfile_hgei;
- /* TODO: Update the IOMMU mapping ??? */
-
/* Update new IMSIC VS-file details in IMSIC context */
write_lock_irqsave(&imsic->vsfile_lock, flags);
+
imsic->vsfile_hgei = new_vsfile_hgei;
imsic->vsfile_cpu = vcpu->cpu;
imsic->vsfile_va = new_vsfile_va;
imsic->vsfile_pa = new_vsfile_pa;
+
+ /* Update the IOMMU mapping */
+ kvm_riscv_vcpu_irq_update(vcpu);
+
write_unlock_irqrestore(&imsic->vsfile_lock, flags);
/*
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 9c5837518c1a..5f697d9a37da 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -78,7 +78,7 @@ EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
bool kvm_arch_has_irq_bypass(void)
{
- return false;
+ return true;
}
int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 14/15] vfio: enable IOMMU_TYPE1 for RISC-V
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (12 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 13/15] RISC-V: KVM: " Andrew Jones
@ 2024-11-14 16:18 ` Andrew Jones
2024-11-14 16:19 ` [RFC PATCH 15/15] RISC-V: defconfig: Add VFIO modules Andrew Jones
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:18 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
From: Tomasz Jeznach <tjeznach@rivosinc.com>
Enable VFIO support on RISC-V architecture.
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
drivers/vfio/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index ceae52fd7586..ad62205b4e45 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -39,7 +39,7 @@ config VFIO_GROUP
config VFIO_CONTAINER
bool "Support for the VFIO container /dev/vfio/vfio"
- select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
+ select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64 || RISCV)
depends on VFIO_GROUP
default y
help
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread* [RFC PATCH 15/15] RISC-V: defconfig: Add VFIO modules
2024-11-14 16:18 [RFC PATCH 00/15] iommu/riscv: Add irqbypass support Andrew Jones
` (13 preceding siblings ...)
2024-11-14 16:18 ` [RFC PATCH 14/15] vfio: enable IOMMU_TYPE1 for RISC-V Andrew Jones
@ 2024-11-14 16:19 ` Andrew Jones
14 siblings, 0 replies; 36+ messages in thread
From: Andrew Jones @ 2024-11-14 16:19 UTC (permalink / raw)
To: iommu, kvm-riscv, kvm, linux-riscv, linux-kernel
Cc: tjeznach, zong.li, joro, will, robin.murphy, anup, atishp, tglx,
alex.williamson, paul.walmsley, palmer, aou
Add the VFIO modules to the defconfig to complement KVM now
that there is IOMMU support.
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
arch/riscv/configs/defconfig | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/riscv/configs/defconfig b/arch/riscv/configs/defconfig
index b4a37345703e..10fc9d84a28c 100644
--- a/arch/riscv/configs/defconfig
+++ b/arch/riscv/configs/defconfig
@@ -242,6 +242,8 @@ CONFIG_RTC_DRV_SUN6I=y
CONFIG_DMADEVICES=y
CONFIG_DMA_SUN6I=m
CONFIG_DW_AXI_DMAC=y
+CONFIG_VFIO=m
+CONFIG_VFIO_PCI=m
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_BALLOON=y
CONFIG_VIRTIO_INPUT=y
--
2.47.0
^ permalink raw reply related [flat|nested] 36+ messages in thread