* [PATCH 0/2] s390x/pci: relax I/O address translation requirement
@ 2024-12-09 19:29 Matthew Rosato
2024-12-09 19:29 ` [PATCH 1/2] s390x/pci: add support for guests that request direct mapping Matthew Rosato
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Matthew Rosato @ 2024-12-09 19:29 UTC (permalink / raw)
To: qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
david, iii, clegoate, qemu-devel
This series introduces the concept of the relaxed translation requirement
for s390x guests in order to allow bypass of the guest IOMMU for more
efficient PCI passthrough.
With this series, QEMU can indicate to the guest that an IOMMU is not
strictly required for a zPCI device. This would subsequently allow a
guest linux to use iommu.passthrough=1 and bypass their guest IOMMU for
PCI devices.
When this occurs, QEMU will note the behavior via an intercepted MPCIFC
instruction and will fill the host iommu with mappings of the entire
guest address space in response.
There is a kernel series [1] that adds the relevant behavior needed to
exploit this new feature from within a s390x linux guest.
[1]: https://lore.kernel.org/linux-s390/20241209192403.107090-1-mjrosato@linux.ibm.com/
Matthew Rosato (2):
s390x/pci: add support for guests that request direct mapping
s390x/pci: indicate QEMU supports relaxed translation for passthrough
hw/s390x/s390-pci-bus.c | 23 ++++++++++++++++++
hw/s390x/s390-pci-inst.c | 42 +++++++++++++++++++++++++++++++--
hw/s390x/s390-pci-vfio.c | 4 +++-
include/hw/s390x/s390-pci-bus.h | 2 ++
include/hw/s390x/s390-pci-clp.h | 1 +
5 files changed, 69 insertions(+), 3 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-09 19:29 [PATCH 0/2] s390x/pci: relax I/O address translation requirement Matthew Rosato
@ 2024-12-09 19:29 ` Matthew Rosato
2024-12-09 21:01 ` David Hildenbrand
` (2 more replies)
2024-12-09 19:29 ` [PATCH 2/2] s390x/pci: indicate QEMU supports relaxed translation for passthrough Matthew Rosato
2024-12-12 9:10 ` [PATCH 0/2] s390x/pci: relax I/O address translation requirement Thomas Huth
2 siblings, 3 replies; 17+ messages in thread
From: Matthew Rosato @ 2024-12-09 19:29 UTC (permalink / raw)
To: qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
david, iii, clegoate, qemu-devel
When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T
bit set, treat this as a request to perform direct mapping instead of
address translation. In order to facilitiate this, pin the entirety of
guest memory into the host iommu.
Subsequent guest DMA operations are all expected to be of the format
guest_phys+sdma, allowing them to be used as lookup into the host
iommu table.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
hw/s390x/s390-pci-bus.c | 23 ++++++++++++++++++
hw/s390x/s390-pci-inst.c | 42 +++++++++++++++++++++++++++++++--
include/hw/s390x/s390-pci-bus.h | 2 ++
3 files changed, 65 insertions(+), 2 deletions(-)
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 40b2567aa7..8d4224e032 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -18,6 +18,7 @@
#include "hw/s390x/s390-pci-inst.h"
#include "hw/s390x/s390-pci-kvm.h"
#include "hw/s390x/s390-pci-vfio.h"
+#include "hw/boards.h"
#include "hw/pci/pci_bus.h"
#include "hw/qdev-properties.h"
#include "hw/pci/pci_bridge.h"
@@ -720,6 +721,27 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu)
TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr),
name, iommu->pal + 1);
iommu->enabled = true;
+ iommu->direct_map = false;
+ memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr));
+ g_free(name);
+}
+
+void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu)
+{
+ MachineState *ms = MACHINE(qdev_get_machine());
+
+ /*
+ * For direct-mapping we must map the entire guest address space. Because
+ * the mappings are contiguous we are not restricted to individual 4K
+ * mappings via vfio, so let's not worry about the DMA limit when
+ * calculating the range.
+ */
+ char *name = g_strdup_printf("iommu-s390-%04x", iommu->pbdev->uid);
+ memory_region_init_iommu(&iommu->iommu_mr, sizeof(iommu->iommu_mr),
+ TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr),
+ name, iommu->pba + ms->ram_size);
+ iommu->enabled = true;
+ iommu->direct_map = true;
memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr));
g_free(name);
}
@@ -727,6 +749,7 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu)
void s390_pci_iommu_disable(S390PCIIOMMU *iommu)
{
iommu->enabled = false;
+ iommu->direct_map = false;
g_hash_table_remove_all(iommu->iotlb);
memory_region_del_subregion(&iommu->mr, MEMORY_REGION(&iommu->iommu_mr));
object_unparent(OBJECT(&iommu->iommu_mr));
diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 41655082da..f4d8fe8fe8 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -16,6 +16,7 @@
#include "exec/memory.h"
#include "qemu/error-report.h"
#include "sysemu/hw_accel.h"
+#include "hw/boards.h"
#include "hw/pci/pci_device.h"
#include "hw/s390x/s390-pci-inst.h"
#include "hw/s390x/s390-pci-bus.h"
@@ -990,6 +991,33 @@ int pci_dereg_irqs(S390PCIBusDevice *pbdev)
return 0;
}
+static void s390_pci_setup_stage2_map(S390PCIIOMMU *iommu)
+{
+ MachineState *ms = MACHINE(qdev_get_machine());
+ uint64_t remain = ms->ram_size, start = iommu->pba, mask, size, curr = 0;
+ uint64_t end = start + remain - 1;
+ IOMMUTLBEvent event = {
+ .type = IOMMU_NOTIFIER_MAP,
+ .entry = {
+ .target_as = &address_space_memory,
+ .translated_addr = 0,
+ .perm = IOMMU_RW,
+ },
+ };
+
+ while (remain >= TARGET_PAGE_SIZE) {
+ mask = dma_aligned_pow2_mask(start, end, 64);
+ size = mask + 1;
+ event.entry.iova = start;
+ event.entry.addr_mask = mask;
+ event.entry.translated_addr = curr;
+ memory_region_notify_iommu(&iommu->iommu_mr, 0, event);
+ start += size;
+ curr += size;
+ remain -= size;
+ }
+}
+
static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
uintptr_t ra)
{
@@ -1008,7 +1036,7 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
}
/* currently we only support designation type 1 with translation */
- if (!(dt == ZPCI_IOTA_RTTO && t)) {
+ if (t && !(dt == ZPCI_IOTA_RTTO)) {
error_report("unsupported ioat dt %d t %d", dt, t);
s390_program_interrupt(env, PGM_OPERAND, ra);
return -EINVAL;
@@ -1018,13 +1046,23 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
iommu->pal = pal;
iommu->g_iota = g_iota;
- s390_pci_iommu_enable(iommu);
+ if (t) {
+ s390_pci_iommu_enable(iommu);
+ } else {
+ s390_pci_iommu_dm_enable(iommu);
+ /* If not translating, map everything now */
+ s390_pci_setup_stage2_map(iommu);
+ }
return 0;
}
void pci_dereg_ioat(S390PCIIOMMU *iommu)
{
+ MachineState *ms = MACHINE(qdev_get_machine());
+ if (iommu->direct_map) {
+ s390_pci_batch_unmap(iommu, iommu->pba, ms->ram_size);
+ }
s390_pci_iommu_disable(iommu);
iommu->pba = 0;
iommu->pal = 0;
diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h
index 2c43ea123f..2aa55e3fd0 100644
--- a/include/hw/s390x/s390-pci-bus.h
+++ b/include/hw/s390x/s390-pci-bus.h
@@ -278,6 +278,7 @@ struct S390PCIIOMMU {
MemoryRegion mr;
IOMMUMemoryRegion iommu_mr;
bool enabled;
+ bool direct_map;
uint64_t g_iota;
uint64_t pba;
uint64_t pal;
@@ -389,6 +390,7 @@ int pci_chsc_sei_nt2_have_event(void);
void s390_pci_sclp_configure(SCCB *sccb);
void s390_pci_sclp_deconfigure(SCCB *sccb);
void s390_pci_iommu_enable(S390PCIIOMMU *iommu);
+void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu);
void s390_pci_iommu_disable(S390PCIIOMMU *iommu);
void s390_pci_generate_error_event(uint16_t pec, uint32_t fh, uint32_t fid,
uint64_t faddr, uint32_t e);
--
2.47.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/2] s390x/pci: indicate QEMU supports relaxed translation for passthrough
2024-12-09 19:29 [PATCH 0/2] s390x/pci: relax I/O address translation requirement Matthew Rosato
2024-12-09 19:29 ` [PATCH 1/2] s390x/pci: add support for guests that request direct mapping Matthew Rosato
@ 2024-12-09 19:29 ` Matthew Rosato
2024-12-11 11:40 ` Thomas Huth
2024-12-12 9:10 ` [PATCH 0/2] s390x/pci: relax I/O address translation requirement Thomas Huth
2 siblings, 1 reply; 17+ messages in thread
From: Matthew Rosato @ 2024-12-09 19:29 UTC (permalink / raw)
To: qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
david, iii, clegoate, qemu-devel
Specifying this bit in the guest CLP response indicates that the guest
can optionally choose to skip translation and instead use
identity-mapped operations.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
hw/s390x/s390-pci-vfio.c | 4 +++-
include/hw/s390x/s390-pci-clp.h | 1 +
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index 7dbbc76823..51ac5ff3eb 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c
@@ -224,7 +224,9 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
resgrp = &pbdev->pci_group->zpci_group;
if (cap->flags & VFIO_DEVICE_INFO_ZPCI_FLAG_REFRESH) {
- resgrp->fr = 1;
+ resgrp->fr = (CLP_RSP_QPCIG_MASK_RTR | CLP_RSP_QPCIG_MASK_REFRESH);
+ } else {
+ resgrp->fr = CLP_RSP_QPCIG_MASK_RTR;
}
resgrp->dasm = cap->dasm;
resgrp->msia = cap->msi_addr;
diff --git a/include/hw/s390x/s390-pci-clp.h b/include/hw/s390x/s390-pci-clp.h
index 03b7f9ba5f..6a635d693b 100644
--- a/include/hw/s390x/s390-pci-clp.h
+++ b/include/hw/s390x/s390-pci-clp.h
@@ -158,6 +158,7 @@ typedef struct ClpRspQueryPciGrp {
#define CLP_RSP_QPCIG_MASK_NOI 0xfff
uint16_t i;
uint8_t version;
+#define CLP_RSP_QPCIG_MASK_RTR 0x20
#define CLP_RSP_QPCIG_MASK_FRAME 0x2
#define CLP_RSP_QPCIG_MASK_REFRESH 0x1
uint8_t fr;
--
2.47.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-09 19:29 ` [PATCH 1/2] s390x/pci: add support for guests that request direct mapping Matthew Rosato
@ 2024-12-09 21:01 ` David Hildenbrand
2024-12-09 21:45 ` Matthew Rosato
2024-12-11 11:34 ` Thomas Huth
2024-12-11 14:40 ` Cédric Le Goater
2 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2024-12-09 21:01 UTC (permalink / raw)
To: Matthew Rosato, qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
iii, clegoate, qemu-devel
On 09.12.24 20:29, Matthew Rosato wrote:
> When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T
> bit set, treat this as a request to perform direct mapping instead of
> address translation. In order to facilitiate this, pin the entirety of
> guest memory into the host iommu.
>
> Subsequent guest DMA operations are all expected to be of the format
> guest_phys+sdma, allowing them to be used as lookup into the host
> iommu table.
>
> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
> ---
> hw/s390x/s390-pci-bus.c | 23 ++++++++++++++++++
> hw/s390x/s390-pci-inst.c | 42 +++++++++++++++++++++++++++++++--
> include/hw/s390x/s390-pci-bus.h | 2 ++
> 3 files changed, 65 insertions(+), 2 deletions(-)
>
> diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
> index 40b2567aa7..8d4224e032 100644
> --- a/hw/s390x/s390-pci-bus.c
> +++ b/hw/s390x/s390-pci-bus.c
> @@ -18,6 +18,7 @@
> #include "hw/s390x/s390-pci-inst.h"
> #include "hw/s390x/s390-pci-kvm.h"
> #include "hw/s390x/s390-pci-vfio.h"
> +#include "hw/boards.h"
> #include "hw/pci/pci_bus.h"
> #include "hw/qdev-properties.h"
> #include "hw/pci/pci_bridge.h"
> @@ -720,6 +721,27 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu)
> TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr),
> name, iommu->pal + 1);
> iommu->enabled = true;
> + iommu->direct_map = false;
> + memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr));
> + g_free(name);
> +}
> +
> +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu)
> +{
> + MachineState *ms = MACHINE(qdev_get_machine());
> +
> + /*
> + * For direct-mapping we must map the entire guest address space. Because
> + * the mappings are contiguous we are not restricted to individual 4K
> + * mappings via vfio, so let's not worry about the DMA limit when
> + * calculating the range.
> + */
> + char *name = g_strdup_printf("iommu-s390-%04x", iommu->pbdev->uid);
> + memory_region_init_iommu(&iommu->iommu_mr, sizeof(iommu->iommu_mr),
> + TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr),
> + name, iommu->pba + ms->ram_size);
> + iommu->enabled = true;
> + iommu->direct_map = true;
> memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr));
> g_free(name);
> }
> @@ -727,6 +749,7 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu)
> void s390_pci_iommu_disable(S390PCIIOMMU *iommu)
> {
> iommu->enabled = false;
> + iommu->direct_map = false;
> g_hash_table_remove_all(iommu->iotlb);
> memory_region_del_subregion(&iommu->mr, MEMORY_REGION(&iommu->iommu_mr));
> object_unparent(OBJECT(&iommu->iommu_mr));
> diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
> index 41655082da..f4d8fe8fe8 100644
> --- a/hw/s390x/s390-pci-inst.c
> +++ b/hw/s390x/s390-pci-inst.c
> @@ -16,6 +16,7 @@
> #include "exec/memory.h"
> #include "qemu/error-report.h"
> #include "sysemu/hw_accel.h"
> +#include "hw/boards.h"
> #include "hw/pci/pci_device.h"
> #include "hw/s390x/s390-pci-inst.h"
> #include "hw/s390x/s390-pci-bus.h"
> @@ -990,6 +991,33 @@ int pci_dereg_irqs(S390PCIBusDevice *pbdev)
> return 0;
> }
>
> +static void s390_pci_setup_stage2_map(S390PCIIOMMU *iommu)
> +{
> + MachineState *ms = MACHINE(qdev_get_machine());
> + uint64_t remain = ms->ram_size, start = iommu->pba, mask, size, curr = 0;
> + uint64_t end = start + remain - 1;
> + IOMMUTLBEvent event = {
> + .type = IOMMU_NOTIFIER_MAP,
> + .entry = {
> + .target_as = &address_space_memory,
> + .translated_addr = 0,
> + .perm = IOMMU_RW,
> + },
> + };
> +
> + while (remain >= TARGET_PAGE_SIZE) {
> + mask = dma_aligned_pow2_mask(start, end, 64);
> + size = mask + 1;
> + event.entry.iova = start;
> + event.entry.addr_mask = mask;
> + event.entry.translated_addr = curr;
> + memory_region_notify_iommu(&iommu->iommu_mr, 0, event);
> + start += size;
> + curr += size;
> + remain -= size;
> + }
> +}
Hi,
Trying to wrap my head around that ... you mention that "pin the
entirety of guest memory".
Do you mean that we will actually end up longterm pinning all guest RAM
in the kernel, similar to what vfio ends up doing?
In that case, it would be incompatible with virtio-balloon (and without
modifications with upcoming virtio-mem). Is there already a mechanism in
place to handle that -- a call to ram_block_discard_disable() -- or
even a way to support coordinated discarding of RAM (e.g., virtio-mem +
vfio)?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-09 21:01 ` David Hildenbrand
@ 2024-12-09 21:45 ` Matthew Rosato
2024-12-09 22:09 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: Matthew Rosato @ 2024-12-09 21:45 UTC (permalink / raw)
To: David Hildenbrand, qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
iii, clegoate, qemu-devel
On 12/9/24 4:01 PM, David Hildenbrand wrote:
> On 09.12.24 20:29, Matthew Rosato wrote:
>
> Hi,
>
> Trying to wrap my head around that ... you mention that "pin the entirety of guest memory".
>
> Do you mean that we will actually end up longterm pinning all guest RAM in the kernel, similar to what vfio ends up doing?
Yes. Actually, the usecase here is specifically PCI passthrough via vfio-pci on s390. Unlike other platforms, the default s390 approach only pins on-demand and doesn't longterm pin all guest RAM, which is nice from a memory footprint perspective but pays a price via all those guest-2 RPCIT instructions. The goal here is now provide the optional alternative to longterm pin like other platforms.
>
> In that case, it would be incompatible with virtio-balloon (and without modifications with upcoming virtio-mem). Is there already a mechanism in place to handle that -- a call to ram_block_discard_disable() -- or even a way to support coordinated discarding of RAM (e.g., virtio-mem + vfio)?
Good point, should be calling add ram_block_discard_disable(true) when set register + a corresponding (false) during deregister... Will add for v2.
As for supporting coordinated discard, I was hoping to subsequently look at virtio-mem for this.
Thanks,
Matt
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-09 21:45 ` Matthew Rosato
@ 2024-12-09 22:09 ` David Hildenbrand
2024-12-09 23:22 ` Matthew Rosato
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2024-12-09 22:09 UTC (permalink / raw)
To: Matthew Rosato, qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
iii, clegoate, qemu-devel
On 09.12.24 22:45, Matthew Rosato wrote:
> On 12/9/24 4:01 PM, David Hildenbrand wrote:
>> On 09.12.24 20:29, Matthew Rosato wrote:
>>
>> Hi,
>>
>> Trying to wrap my head around that ... you mention that "pin the entirety of guest memory".
>>
>> Do you mean that we will actually end up longterm pinning all guest RAM in the kernel, similar to what vfio ends up doing?
>
> Yes. Actually, the usecase here is specifically PCI passthrough via vfio-pci on s390. Unlike other platforms, the default s390 approach only pins on-demand and doesn't longterm pin all guest RAM, which is nice from a memory footprint perspective but pays a price via all those guest-2 RPCIT instructions. The goal here is now provide the optional alternative to longterm pin like other platforms.
Okay, thanks for confirming. One more question: who will trigger this
longterm-pinning? Is it vfio?
(the code flow from your code to the pinning code would be nice)
>
>>
>> In that case, it would be incompatible with virtio-balloon (and without modifications with upcoming virtio-mem). Is there already a mechanism in place to handle that -- a call to ram_block_discard_disable() -- or even a way to support coordinated discarding of RAM (e.g., virtio-mem + vfio)?
>
> Good point, should be calling add ram_block_discard_disable(true) when set register + a corresponding (false) during deregister... Will add for v2.
>
> As for supporting coordinated discard, I was hoping to subsequently look at virtio-mem for this.
As long as discarding is blocked for now, we're good. To support it, the
RAMDiscardManager would have to be wired up, similar to vfio.
I think the current way of handling it via
+ IOMMUTLBEvent event = {
+ .type = IOMMU_NOTIFIER_MAP,
+ .entry = {
+ .target_as = &address_space_memory,
+ .translated_addr = 0,
+ .perm = IOMMU_RW,
+ },
+ };
Is probably not ideal: it cannot cope with memory holes (which
virtio-mem would create).
Likely, you'd instead want an address space notifier, and really only
map the memory region sections you get notified about.
There, you can test for RAMDiscardManager and handle it like vfio does.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-09 22:09 ` David Hildenbrand
@ 2024-12-09 23:22 ` Matthew Rosato
2024-12-10 8:58 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: Matthew Rosato @ 2024-12-09 23:22 UTC (permalink / raw)
To: David Hildenbrand, qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
iii, clegoate, qemu-devel
On 12/9/24 5:09 PM, David Hildenbrand wrote:
> On 09.12.24 22:45, Matthew Rosato wrote:
>> On 12/9/24 4:01 PM, David Hildenbrand wrote:
>>> On 09.12.24 20:29, Matthew Rosato wrote:
>>>
>>> Hi,
>>>
>>> Trying to wrap my head around that ... you mention that "pin the entirety of guest memory".
>>>
>>> Do you mean that we will actually end up longterm pinning all guest RAM in the kernel, similar to what vfio ends up doing?
>>
>> Yes. Actually, the usecase here is specifically PCI passthrough via vfio-pci on s390. Unlike other platforms, the default s390 approach only pins on-demand and doesn't longterm pin all guest RAM, which is nice from a memory footprint perspective but pays a price via all those guest-2 RPCIT instructions. The goal here is now provide the optional alternative to longterm pin like other platforms.
>
> Okay, thanks for confirming. One more question: who will trigger this longterm-pinning? Is it vfio?
>
> (the code flow from your code to the pinning code would be nice)
>
Yes, the vfio IOMMU code triggers it. My s390_pci_setup_stage2_map added by this patch calls memory_region_notify_iommu in a loop such that we trigger iommu notifications to map iova X+pba -> GPA X for all GPA, where pba = a "base address" offset that has to be applied when mapping on s390. The notifications are sent in the largest batches possible to minimize vfio ioctls / use the least number of vfio dma mappings.
The iommu notifications get picked up in vfio_iommu_map_notify where we will follow the container DMA path down to vfio_legacy_dma_map; then ultimately vfio is handling the pinning via the VFIO_IOMMU_MAP_DMA ioctl.
>>
>>>
>>> In that case, it would be incompatible with virtio-balloon (and without modifications with upcoming virtio-mem). Is there already a mechanism in place to handle that -- a call to ram_block_discard_disable() -- or even a way to support coordinated discarding of RAM (e.g., virtio-mem + vfio)?
>>
>> Good point, should be calling add ram_block_discard_disable(true) when set register + a corresponding (false) during deregister... Will add for v2.
>>
>> As for supporting coordinated discard, I was hoping to subsequently look at virtio-mem for this.
>
> As long as discarding is blocked for now, we're good. To support it, the RAMDiscardManager would have to be wired up, similar to vfio.
>
> I think the current way of handling it via
> vf
> + IOMMUTLBEvent event = {
> + .type = IOMMU_NOTIFIER_MAP,
> + .entry = {
> + .target_as = &address_space_memory,
> + .translated_addr = 0,
> + .perm = IOMMU_RW,
> + },
> + };
>
>
> Is probably not ideal: it cannot cope with memory holes (which virtio-mem would create).
>
> Likely, you'd instead want an address space notifier, and really only map the memory region sections you get notified about.
>
> There, you can test for RAMDiscardManager and handle it like vfio does.
>
I'll start looking into this; for the moment I'll plan on blocking discarding in this series with a follow-on to then enable virtio-mem, but if I get something working sooner I'll add it to this series. Either way I'll put you on CC.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-09 23:22 ` Matthew Rosato
@ 2024-12-10 8:58 ` David Hildenbrand
2024-12-13 22:46 ` Matthew Rosato
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2024-12-10 8:58 UTC (permalink / raw)
To: Matthew Rosato, qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
iii, clegoate, qemu-devel, Alex Williamson
On 10.12.24 00:22, Matthew Rosato wrote:
> On 12/9/24 5:09 PM, David Hildenbrand wrote:
>> On 09.12.24 22:45, Matthew Rosato wrote:
>>> On 12/9/24 4:01 PM, David Hildenbrand wrote:
>>>> On 09.12.24 20:29, Matthew Rosato wrote:
>>>>
>>>> Hi,
>>>>
>>>> Trying to wrap my head around that ... you mention that "pin the entirety of guest memory".
>>>>
>>>> Do you mean that we will actually end up longterm pinning all guest RAM in the kernel, similar to what vfio ends up doing?
>>>
>>> Yes. Actually, the usecase here is specifically PCI passthrough via vfio-pci on s390. Unlike other platforms, the default s390 approach only pins on-demand and doesn't longterm pin all guest RAM, which is nice from a memory footprint perspective but pays a price via all those guest-2 RPCIT instructions. The goal here is now provide the optional alternative to longterm pin like other platforms.
>>
>> Okay, thanks for confirming. One more question: who will trigger this longterm-pinning? Is it vfio?
>>
>> (the code flow from your code to the pinning code would be nice)
>>
Thanks for the details.
>
> Yes, the vfio IOMMU code triggers it. My s390_pci_setup_stage2_map added by this patch calls memory_region_notify_iommu in a loop such that we trigger iommu notifications to map iova X+pba -> GPA X for all GPA, where pba = a "base address" offset that has to be applied when mapping on s390.
Ah, now I see that you use "ms->ram_size", so indeed, this will map
initial RAM only.
The more-future-proof approach would indeed be to register a memory
listener on &address_space_memory, to then map/unmap whenever notified
about map/unmap.
See "struct vfio_memory_listener" with its region_add/region_del
callbacks that do that, and also implement the RAMDiscardManager plumbing.
And now I wonder if there would be a way to just reuse "struct
vfio_memory_listener" here some way? Or at least reuse the map/unmap
functions, because ...
The notifications are sent in the largest batches possible to minimize
vfio ioctls / use the least number of vfio dma mappings.
>
> The iommu notifications get picked up in vfio_iommu_map_notify where we will follow the container DMA path down to vfio_legacy_dma_map; then ultimately vfio is handling the pinning via the VFIO_IOMMU_MAP_DMA ioctl.
... vfio_listener_region_add() will also just call vfio_container_dma_map().
Maybe there is a reason s390x needs to handle this using
memory_region_notify_iommu() callbacks instead of doing it similar to
"struct vfio_memory_listener" when registered on &address_space_memory
without a viommu.
>
>>>
>>>>
>>>> In that case, it would be incompatible with virtio-balloon (and without modifications with upcoming virtio-mem). Is there already a mechanism in place to handle that -- a call to ram_block_discard_disable() -- or even a way to support coordinated discarding of RAM (e.g., virtio-mem + vfio)?
>>>
>>> Good point, should be calling add ram_block_discard_disable(true) when set register + a corresponding (false) during deregister... Will add for v2.
>>>
>>> As for supporting coordinated discard, I was hoping to subsequently look at virtio-mem for this.
>>
>> As long as discarding is blocked for now, we're good. To support it, the RAMDiscardManager would have to be wired up, similar to vfio.
>>
>> I think the current way of handling it via
>> vf
>> + IOMMUTLBEvent event = {
>> + .type = IOMMU_NOTIFIER_MAP,
>> + .entry = {
>> + .target_as = &address_space_memory,
>> + .translated_addr = 0,
>> + .perm = IOMMU_RW,
>> + },
>> + };
>>
>>
>> Is probably not ideal: it cannot cope with memory holes (which virtio-mem would create).
>>
>> Likely, you'd instead want an address space notifier, and really only map the memory region sections you get notified about.
>>
>> There, you can test for RAMDiscardManager and handle it like vfio does.
>>
>
> I'll start looking into this; for the moment I'll plan on blocking discarding in this series with a follow-on to then enable virtio-mem, but if I get something working sooner I'll add it to this series. Either way I'll put you on CC.
Thanks, exploring whether we can reuse the vfio bits to map/unmap might
be very valuable and simplify things.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-09 19:29 ` [PATCH 1/2] s390x/pci: add support for guests that request direct mapping Matthew Rosato
2024-12-09 21:01 ` David Hildenbrand
@ 2024-12-11 11:34 ` Thomas Huth
2024-12-11 14:40 ` Cédric Le Goater
2 siblings, 0 replies; 17+ messages in thread
From: Thomas Huth @ 2024-12-11 11:34 UTC (permalink / raw)
To: Matthew Rosato, qemu-s390x
Cc: farman, schnelle, pasic, borntraeger, richard.henderson, david,
iii, clegoate, qemu-devel
On 09/12/2024 20.29, Matthew Rosato wrote:
> When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T
> bit set, treat this as a request to perform direct mapping instead of
> address translation. In order to facilitiate this, pin the entirety of
s/facilitiate/facilitate/
> guest memory into the host iommu.
...
> +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu)
> +{
> + MachineState *ms = MACHINE(qdev_get_machine());
> +
> + /*
> + * For direct-mapping we must map the entire guest address space. Because
> + * the mappings are contiguous we are not restricted to individual 4K
> + * mappings via vfio, so let's not worry about the DMA limit when
> + * calculating the range.
> + */
> + char *name = g_strdup_printf("iommu-s390-%04x", iommu->pbdev->uid);
FWIW, you could use g_autofree to get rid of the g_free() at the end of the
function.
> + memory_region_init_iommu(&iommu->iommu_mr, sizeof(iommu->iommu_mr),
> + TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr),
> + name, iommu->pba + ms->ram_size);
> + iommu->enabled = true;
> + iommu->direct_map = true;
> memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr));
> g_free(name);
> }
Thomas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] s390x/pci: indicate QEMU supports relaxed translation for passthrough
2024-12-09 19:29 ` [PATCH 2/2] s390x/pci: indicate QEMU supports relaxed translation for passthrough Matthew Rosato
@ 2024-12-11 11:40 ` Thomas Huth
0 siblings, 0 replies; 17+ messages in thread
From: Thomas Huth @ 2024-12-11 11:40 UTC (permalink / raw)
To: Matthew Rosato, qemu-s390x
Cc: farman, schnelle, pasic, borntraeger, richard.henderson, david,
iii, clegoate, qemu-devel
On 09/12/2024 20.29, Matthew Rosato wrote:
> Specifying this bit in the guest CLP response indicates that the guest
> can optionally choose to skip translation and instead use
> identity-mapped operations.
>
> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
> ---
> hw/s390x/s390-pci-vfio.c | 4 +++-
> include/hw/s390x/s390-pci-clp.h | 1 +
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
> index 7dbbc76823..51ac5ff3eb 100644
> --- a/hw/s390x/s390-pci-vfio.c
> +++ b/hw/s390x/s390-pci-vfio.c
> @@ -224,7 +224,9 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
>
> resgrp = &pbdev->pci_group->zpci_group;
> if (cap->flags & VFIO_DEVICE_INFO_ZPCI_FLAG_REFRESH) {
> - resgrp->fr = 1;
> + resgrp->fr = (CLP_RSP_QPCIG_MASK_RTR | CLP_RSP_QPCIG_MASK_REFRESH);
> + } else {
> + resgrp->fr = CLP_RSP_QPCIG_MASK_RTR;
> }
Just a matter of taste, but maybe easier to write it like this:
resgrp->fr = CLP_RSP_QPCIG_MASK_RTR;
if (cap->flags & VFIO_DEVICE_INFO_ZPCI_FLAG_REFRESH) {
resgrp->fr |= CLP_RSP_QPCIG_MASK_REFRESH;
}
?
Thomas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-09 19:29 ` [PATCH 1/2] s390x/pci: add support for guests that request direct mapping Matthew Rosato
2024-12-09 21:01 ` David Hildenbrand
2024-12-11 11:34 ` Thomas Huth
@ 2024-12-11 14:40 ` Cédric Le Goater
2024-12-11 15:17 ` Matthew Rosato
2 siblings, 1 reply; 17+ messages in thread
From: Cédric Le Goater @ 2024-12-11 14:40 UTC (permalink / raw)
To: Matthew Rosato, qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
david, iii, qemu-devel
On 12/9/24 20:29, Matthew Rosato wrote:
> When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T
> bit set, treat this as a request to perform direct mapping instead of
> address translation. In order to facilitiate this, pin the entirety of
> guest memory into the host iommu.
>
> Subsequent guest DMA operations are all expected to be of the format
> guest_phys+sdma, allowing them to be used as lookup into the host
> iommu table.
>
> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
> ---
> hw/s390x/s390-pci-bus.c | 23 ++++++++++++++++++
> hw/s390x/s390-pci-inst.c | 42 +++++++++++++++++++++++++++++++--
> include/hw/s390x/s390-pci-bus.h | 2 ++
> 3 files changed, 65 insertions(+), 2 deletions(-)
>
> diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
> index 40b2567aa7..8d4224e032 100644
> --- a/hw/s390x/s390-pci-bus.c
> +++ b/hw/s390x/s390-pci-bus.c
> @@ -18,6 +18,7 @@
> #include "hw/s390x/s390-pci-inst.h"
> #include "hw/s390x/s390-pci-kvm.h"
> #include "hw/s390x/s390-pci-vfio.h"
> +#include "hw/boards.h"
> #include "hw/pci/pci_bus.h"
> #include "hw/qdev-properties.h"
> #include "hw/pci/pci_bridge.h"
> @@ -720,6 +721,27 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu)
> TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr),
> name, iommu->pal + 1);
> iommu->enabled = true;
> + iommu->direct_map = false;
> + memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr));
> + g_free(name);
> +}
> +
> +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu)
This is duplicating s390_pci_iommu_enable(). May be we could add a new
argument to s390_pci_iommu_enable() instead ?
> +{
> + MachineState *ms = MACHINE(qdev_get_machine());
> +
> + /*
> + * For direct-mapping we must map the entire guest address space. Because
> + * the mappings are contiguous we are not restricted to individual 4K
> + * mappings via vfio, so let's not worry about the DMA limit when> + * calculating the range.
> + */
> + char *name = g_strdup_printf("iommu-s390-%04x", iommu->pbdev->uid);
> + memory_region_init_iommu(&iommu->iommu_mr, sizeof(iommu->iommu_mr),
> + TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr),
> + name, iommu->pba + ms->ram_size);
> + iommu->enabled = true;
> + iommu->direct_map = true;
> memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr));
> g_free(name);
> }
> @@ -727,6 +749,7 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu)
> void s390_pci_iommu_disable(S390PCIIOMMU *iommu)
> {
> iommu->enabled = false;
> + iommu->direct_map = false;
> g_hash_table_remove_all(iommu->iotlb);
> memory_region_del_subregion(&iommu->mr, MEMORY_REGION(&iommu->iommu_mr));
> object_unparent(OBJECT(&iommu->iommu_mr));
> diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
> index 41655082da..f4d8fe8fe8 100644
> --- a/hw/s390x/s390-pci-inst.c
> +++ b/hw/s390x/s390-pci-inst.c
> @@ -16,6 +16,7 @@
> #include "exec/memory.h"
> #include "qemu/error-report.h"
> #include "sysemu/hw_accel.h"
> +#include "hw/boards.h"
> #include "hw/pci/pci_device.h"
> #include "hw/s390x/s390-pci-inst.h"
> #include "hw/s390x/s390-pci-bus.h"
> @@ -990,6 +991,33 @@ int pci_dereg_irqs(S390PCIBusDevice *pbdev)
> return 0;
> }
>
> +static void s390_pci_setup_stage2_map(S390PCIIOMMU *iommu)
This is very much like s390_pci_batch_unmap(). Could we introduce a
common helper ?
> +{
> + MachineState *ms = MACHINE(qdev_get_machine());
> + uint64_t remain = ms->ram_size, start = iommu->pba, mask, size, curr = 0;
> + uint64_t end = start + remain - 1;
> + IOMMUTLBEvent event = {
> + .type = IOMMU_NOTIFIER_MAP,
> + .entry = {
> + .target_as = &address_space_memory,
> + .translated_addr = 0,
> + .perm = IOMMU_RW,
> + },
> + };
> +
> + while (remain >= TARGET_PAGE_SIZE) {
> + mask = dma_aligned_pow2_mask(start, end, 64);
> + size = mask + 1;
> + event.entry.iova = start;
> + event.entry.addr_mask = mask;
> + event.entry.translated_addr = curr;
> + memory_region_notify_iommu(&iommu->iommu_mr, 0, event);
> + start += size;
> + curr += size;
> + remain -= size;
> + }
> +}
> +
> static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
> uintptr_t ra)
> {
> @@ -1008,7 +1036,7 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
> }
>
> /* currently we only support designation type 1 with translation */
> - if (!(dt == ZPCI_IOTA_RTTO && t)) {
> + if (t && !(dt == ZPCI_IOTA_RTTO)) {
Is this change part of the patchset ? It looks like some other issue.
I might be wrong.
> error_report("unsupported ioat dt %d t %d", dt, t);
> s390_program_interrupt(env, PGM_OPERAND, ra);
> return -EINVAL;
> @@ -1018,13 +1046,23 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
> iommu->pal = pal;
> iommu->g_iota = g_iota;
>
> - s390_pci_iommu_enable(iommu);
> + if (t) {
> + s390_pci_iommu_enable(iommu);
> + } else {
> + s390_pci_iommu_dm_enable(iommu);
> + /* If not translating, map everything now */
> + s390_pci_setup_stage2_map(iommu);
> + }
I don't understand how we can enter "map everything" case.
Could you explain a bit more the scenario ?
Thanks,
C.
> return 0;
> }
>
> void pci_dereg_ioat(S390PCIIOMMU *iommu)
> {
> + MachineState *ms = MACHINE(qdev_get_machine());
> + if (iommu->direct_map) {
> + s390_pci_batch_unmap(iommu, iommu->pba, ms->ram_size);
> + }
> s390_pci_iommu_disable(iommu);
> iommu->pba = 0;
> iommu->pal = 0;
> diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h
> index 2c43ea123f..2aa55e3fd0 100644
> --- a/include/hw/s390x/s390-pci-bus.h
> +++ b/include/hw/s390x/s390-pci-bus.h
> @@ -278,6 +278,7 @@ struct S390PCIIOMMU {
> MemoryRegion mr;
> IOMMUMemoryRegion iommu_mr;
> bool enabled;
> + bool direct_map;
> uint64_t g_iota;
> uint64_t pba;
> uint64_t pal;
> @@ -389,6 +390,7 @@ int pci_chsc_sei_nt2_have_event(void);
> void s390_pci_sclp_configure(SCCB *sccb);
> void s390_pci_sclp_deconfigure(SCCB *sccb);
> void s390_pci_iommu_enable(S390PCIIOMMU *iommu);
> +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu);
> void s390_pci_iommu_disable(S390PCIIOMMU *iommu);
> void s390_pci_generate_error_event(uint16_t pec, uint32_t fh, uint32_t fid,
> uint64_t faddr, uint32_t e);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-11 14:40 ` Cédric Le Goater
@ 2024-12-11 15:17 ` Matthew Rosato
0 siblings, 0 replies; 17+ messages in thread
From: Matthew Rosato @ 2024-12-11 15:17 UTC (permalink / raw)
To: Cédric Le Goater, qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
david, iii, qemu-devel
On 12/11/24 9:40 AM, Cédric Le Goater wrote:
> On 12/9/24 20:29, Matthew Rosato wrote:
>> static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
>> uintptr_t ra)
>> {
>> @@ -1008,7 +1036,7 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
>> }
>> /* currently we only support designation type 1 with translation */
>> - if (!(dt == ZPCI_IOTA_RTTO && t)) {
>> + if (t && !(dt == ZPCI_IOTA_RTTO)) {
>
> Is this change part of the patchset ? It looks like some other issue.
> I might be wrong.
>
Definitely part of this patch.
Before we only allow a single type of translation request from the guest. So, guest needed to be requesting translation (t) and it had to be of a certain type (dt==ZPCI_IOTA_RTTO) else we fail.
Now we are allowing either that same single translation type (so t && dt==ZPCI_IOTA_RTTO) OR no translation (!t). So it becomes valid to have (!t) but still invalid to have (t) with any other dt value besides (dt==ZPCI_IOTA_RTTO), hence the new check.
>> error_report("unsupported ioat dt %d t %d", dt, t);
>> s390_program_interrupt(env, PGM_OPERAND, ra);
>> return -EINVAL;
>> @@ -1018,13 +1046,23 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
>> iommu->pal = pal;
>> iommu->g_iota = g_iota;
>> - s390_pci_iommu_enable(iommu);
>> + if (t) {
>> + s390_pci_iommu_enable(iommu);
>> + } else {
>> + s390_pci_iommu_dm_enable(iommu);
>> + /* If not translating, map everything now */
>> + s390_pci_setup_stage2_map(iommu);
>> + }
>
>
> I don't understand how we can enter "map everything" case.
> Could you explain a bit more the scenario ?
>
Sure, this actually relates directly to the check I discussed up above...
Before, we would only ever accept a guest MPCIFC instruction of type ZPCI_MOD_FC_REG_IOAT that specified "designation type 1 with translation". All else would be rejected.
Now, we would accept either the above OR a guest MPCIFC instruction that specifies "no translation" - this is the case that gets us into the "map everything" scenario.
Patch 2 adds a CLP indication that we advertise to the guest that tells them whether or not the device group in question supports the "map everything case", so it's safe/allowable to issue a MPCIFC instruction that specifies "no translation" to a device in that group.
The referenced kernel series in the cover letter implements the linux guest behavior necessary to exploit this "no translation" case via the optional kernel parameter iommu.passthrough=1.
Hope that helps,
Matt
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] s390x/pci: relax I/O address translation requirement
2024-12-09 19:29 [PATCH 0/2] s390x/pci: relax I/O address translation requirement Matthew Rosato
2024-12-09 19:29 ` [PATCH 1/2] s390x/pci: add support for guests that request direct mapping Matthew Rosato
2024-12-09 19:29 ` [PATCH 2/2] s390x/pci: indicate QEMU supports relaxed translation for passthrough Matthew Rosato
@ 2024-12-12 9:10 ` Thomas Huth
2024-12-12 14:42 ` Matthew Rosato
2 siblings, 1 reply; 17+ messages in thread
From: Thomas Huth @ 2024-12-12 9:10 UTC (permalink / raw)
To: Matthew Rosato, qemu-s390x
Cc: farman, schnelle, pasic, borntraeger, richard.henderson, david,
iii, clegoate, qemu-devel
On 09/12/2024 20.29, Matthew Rosato wrote:
> This series introduces the concept of the relaxed translation requirement
> for s390x guests in order to allow bypass of the guest IOMMU for more
> efficient PCI passthrough.
>
> With this series, QEMU can indicate to the guest that an IOMMU is not
> strictly required for a zPCI device. This would subsequently allow a
> guest linux to use iommu.passthrough=1 and bypass their guest IOMMU for
> PCI devices.
>
> When this occurs, QEMU will note the behavior via an intercepted MPCIFC
> instruction and will fill the host iommu with mappings of the entire
> guest address space in response.
>
> There is a kernel series [1] that adds the relevant behavior needed to
> exploit this new feature from within a s390x linux guest.
>
> [1]: https://lore.kernel.org/linux-s390/20241209192403.107090-1-mjrosato@linux.ibm.com/
>
> Matthew Rosato (2):
> s390x/pci: add support for guests that request direct mapping
> s390x/pci: indicate QEMU supports relaxed translation for passthrough
Hi again!
One more thought: This is a guest-visible feature, isn't it? So do we also
need some migration handling for this? For example, what happens if you
start a guest that is aware of this feature on a host that has a QEMU with
this feature, and then try to live-migrate the guest to a QEMU that does not
have this feature? I guess the guest will crash? It would be better to fail
the migration instead. At least we should disable the feature in older
machine types and only allow it for the latest one.
Thomas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] s390x/pci: relax I/O address translation requirement
2024-12-12 9:10 ` [PATCH 0/2] s390x/pci: relax I/O address translation requirement Thomas Huth
@ 2024-12-12 14:42 ` Matthew Rosato
2024-12-13 9:07 ` Cédric Le Goater
2024-12-13 9:24 ` Thomas Huth
0 siblings, 2 replies; 17+ messages in thread
From: Matthew Rosato @ 2024-12-12 14:42 UTC (permalink / raw)
To: Thomas Huth, qemu-s390x
Cc: farman, schnelle, pasic, borntraeger, richard.henderson, david,
iii, clegoate, qemu-devel
On 12/12/24 4:10 AM, Thomas Huth wrote:
> On 09/12/2024 20.29, Matthew Rosato wrote:
>> This series introduces the concept of the relaxed translation requirement
>> for s390x guests in order to allow bypass of the guest IOMMU for more
>> efficient PCI passthrough.
>>
>> With this series, QEMU can indicate to the guest that an IOMMU is not
>> strictly required for a zPCI device. This would subsequently allow a
>> guest linux to use iommu.passthrough=1 and bypass their guest IOMMU for
>> PCI devices.
>>
>> When this occurs, QEMU will note the behavior via an intercepted MPCIFC
>> instruction and will fill the host iommu with mappings of the entire
>> guest address space in response.
>>
>> There is a kernel series [1] that adds the relevant behavior needed to
>> exploit this new feature from within a s390x linux guest.
>>
>> [1]: https://lore.kernel.org/linux-s390/20241209192403.107090-1-mjrosato@linux.ibm.com/
>>
>> Matthew Rosato (2):
>> s390x/pci: add support for guests that request direct mapping
>> s390x/pci: indicate QEMU supports relaxed translation for passthrough
>
> Hi again!
>
> One more thought: This is a guest-visible feature, isn't it? So do we also need some migration handling for this? For example, what happens if you start a guest that is aware of this feature on a host that has a QEMU with this feature, and then try to live-migrate the guest to a QEMU that does not have this feature? I guess the guest will crash? It would be better to fail the migration instead. At least we should disable the feature in older machine types and only allow it for the latest one.
zPCI devices are currently marked as unmigratable in s390_pci_device_vmstate so it's not a reproducible issue yet.
Re: disabling the feature for older machines, OK -- Shall I fence similar to what we did for interpret/forwarding-assist with a new device property that is default to off on older machines ("relax-translation"? alternative suggestions welcome)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] s390x/pci: relax I/O address translation requirement
2024-12-12 14:42 ` Matthew Rosato
@ 2024-12-13 9:07 ` Cédric Le Goater
2024-12-13 9:24 ` Thomas Huth
1 sibling, 0 replies; 17+ messages in thread
From: Cédric Le Goater @ 2024-12-13 9:07 UTC (permalink / raw)
To: Matthew Rosato, Thomas Huth, qemu-s390x
Cc: farman, schnelle, pasic, borntraeger, richard.henderson, david,
iii, qemu-devel
On 12/12/24 15:42, Matthew Rosato wrote:
> On 12/12/24 4:10 AM, Thomas Huth wrote:
>> On 09/12/2024 20.29, Matthew Rosato wrote:
>>> This series introduces the concept of the relaxed translation requirement
>>> for s390x guests in order to allow bypass of the guest IOMMU for more
>>> efficient PCI passthrough.
>>>
>>> With this series, QEMU can indicate to the guest that an IOMMU is not
>>> strictly required for a zPCI device. This would subsequently allow a
>>> guest linux to use iommu.passthrough=1 and bypass their guest IOMMU for
>>> PCI devices.
>>>
>>> When this occurs, QEMU will note the behavior via an intercepted MPCIFC
>>> instruction and will fill the host iommu with mappings of the entire
>>> guest address space in response.
>>>
>>> There is a kernel series [1] that adds the relevant behavior needed to
>>> exploit this new feature from within a s390x linux guest.
>>>
>>> [1]: https://lore.kernel.org/linux-s390/20241209192403.107090-1-mjrosato@linux.ibm.com/
>>>
>>> Matthew Rosato (2):
>>> s390x/pci: add support for guests that request direct mapping
>>> s390x/pci: indicate QEMU supports relaxed translation for passthrough
>>
>> Hi again!
>>
>> One more thought: This is a guest-visible feature, isn't it? So do we also need some migration handling for this? For example, what happens if you start a guest that is aware of this feature on a host that has a QEMU with this feature, and then try to live-migrate the guest to a QEMU that does not have this feature? I guess the guest will crash? It would be better to fail the migration instead. At least we should disable the feature in older machine types and only allow it for the latest one.
>
> zPCI devices are currently marked as unmigratable in s390_pci_device_vmstate so it's not a reproducible issue yet.
>
> Re: disabling the feature for older machines, OK -- Shall I fence similar to what we did for interpret/forwarding-assist with a new device property that is default to off on older machines ("relax-translation"? alternative suggestions welcome)
Looks good to me.
Thanks,
C.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] s390x/pci: relax I/O address translation requirement
2024-12-12 14:42 ` Matthew Rosato
2024-12-13 9:07 ` Cédric Le Goater
@ 2024-12-13 9:24 ` Thomas Huth
1 sibling, 0 replies; 17+ messages in thread
From: Thomas Huth @ 2024-12-13 9:24 UTC (permalink / raw)
To: Matthew Rosato, qemu-s390x
Cc: farman, schnelle, pasic, borntraeger, richard.henderson, david,
iii, clegoate, qemu-devel
On 12/12/2024 15.42, Matthew Rosato wrote:
> On 12/12/24 4:10 AM, Thomas Huth wrote:
>> On 09/12/2024 20.29, Matthew Rosato wrote:
>>> This series introduces the concept of the relaxed translation requirement
>>> for s390x guests in order to allow bypass of the guest IOMMU for more
>>> efficient PCI passthrough.
>>>
>>> With this series, QEMU can indicate to the guest that an IOMMU is not
>>> strictly required for a zPCI device. This would subsequently allow a
>>> guest linux to use iommu.passthrough=1 and bypass their guest IOMMU for
>>> PCI devices.
>>>
>>> When this occurs, QEMU will note the behavior via an intercepted MPCIFC
>>> instruction and will fill the host iommu with mappings of the entire
>>> guest address space in response.
>>>
>>> There is a kernel series [1] that adds the relevant behavior needed to
>>> exploit this new feature from within a s390x linux guest.
>>>
>>> [1]: https://lore.kernel.org/linux-s390/20241209192403.107090-1-mjrosato@linux.ibm.com/
>>>
>>> Matthew Rosato (2):
>>> s390x/pci: add support for guests that request direct mapping
>>> s390x/pci: indicate QEMU supports relaxed translation for passthrough
>>
>> Hi again!
>>
>> One more thought: This is a guest-visible feature, isn't it? So do we also need some migration handling for this? For example, what happens if you start a guest that is aware of this feature on a host that has a QEMU with this feature, and then try to live-migrate the guest to a QEMU that does not have this feature? I guess the guest will crash? It would be better to fail the migration instead. At least we should disable the feature in older machine types and only allow it for the latest one.
>
> zPCI devices are currently marked as unmigratable in s390_pci_device_vmstate so it's not a reproducible issue yet.
Ah, right, I forgot about that migration blocker, so we should be fine, indeed!
> Re: disabling the feature for older machines, OK -- Shall I fence similar to what we did for interpret/forwarding-assist with a new device property that is default to off on older machines ("relax-translation"? alternative suggestions welcome)
Sounds reasonable!
Thomas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] s390x/pci: add support for guests that request direct mapping
2024-12-10 8:58 ` David Hildenbrand
@ 2024-12-13 22:46 ` Matthew Rosato
0 siblings, 0 replies; 17+ messages in thread
From: Matthew Rosato @ 2024-12-13 22:46 UTC (permalink / raw)
To: David Hildenbrand, qemu-s390x
Cc: farman, schnelle, thuth, pasic, borntraeger, richard.henderson,
iii, clegoate, qemu-devel, Alex Williamson
On 12/10/24 3:58 AM, David Hildenbrand wrote:
> Maybe there is a reason s390x needs to handle this using memory_region_notify_iommu() callbacks instead of doing it similar to "struct vfio_memory_listener" when registered on &address_space_memory without a viommu.
>
Hi David,
I think I sorted a way to handle this such that, when direct mapping, we use a memory region alias instead so that vfio can ultimately handle all of the pinning/unpinning in the non-iommu path of vfio_listener_region_add/del, just like it does for other platforms. But for s390 the alias is needed to provide the SDMA offset so as to ensure that e.g. GPA X maps to iova SDMA+X. Looks to be working nicely so far with my rework of the associated kernel series -- Going to send as part of v2, would appreciate it if you'd give that a look.
Thanks,
Matt
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-12-13 22:47 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-09 19:29 [PATCH 0/2] s390x/pci: relax I/O address translation requirement Matthew Rosato
2024-12-09 19:29 ` [PATCH 1/2] s390x/pci: add support for guests that request direct mapping Matthew Rosato
2024-12-09 21:01 ` David Hildenbrand
2024-12-09 21:45 ` Matthew Rosato
2024-12-09 22:09 ` David Hildenbrand
2024-12-09 23:22 ` Matthew Rosato
2024-12-10 8:58 ` David Hildenbrand
2024-12-13 22:46 ` Matthew Rosato
2024-12-11 11:34 ` Thomas Huth
2024-12-11 14:40 ` Cédric Le Goater
2024-12-11 15:17 ` Matthew Rosato
2024-12-09 19:29 ` [PATCH 2/2] s390x/pci: indicate QEMU supports relaxed translation for passthrough Matthew Rosato
2024-12-11 11:40 ` Thomas Huth
2024-12-12 9:10 ` [PATCH 0/2] s390x/pci: relax I/O address translation requirement Thomas Huth
2024-12-12 14:42 ` Matthew Rosato
2024-12-13 9:07 ` Cédric Le Goater
2024-12-13 9:24 ` Thomas Huth
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).