MMIO through IOMMU from a TCG processor

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* MMIO through IOMMU from a TCG processor
@ 2025-10-08 12:38 Mark Burton
  2025-10-09  8:43 ` Mark Cave-Ayland
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Burton @ 2025-10-08 12:38 UTC (permalink / raw)
  To: QEMU Developers, peterx, eric.auger, zhenzhong.duan,
	alejandro.j.jimenez, peter.maydell, jasowang, pbonzini, tjeznach,
	steven.sistare, clement.mathieu--drif, joao.m.martins,
	ean-philippe, jean-philippe, sarunkod, Phil Mathieu-Daudé

All, sorry for the wide CC, I’m trying to find somebody who understands this corder of the code…. This is perhaps a obscure, but I think it should work.
I am trying to access an MMIO region through an IOMMU, from TCG.
The IOMMU translation has provided an address space that is different from the CPU’s own address space.

In address_space_translate_for_iotlb the section is calculated using the address space provide by the IOMMU translation.
> d = flatview_to_dispatch(address_space_to_flatview(iotlb.target_as));
> 
Later, we come to do the actual access (via e.g. do_st_mmio_leN), and at this point we pick up the cpu’s address spaces in iotlb_to_section, which is different, and the recorded section therefore seems to be incorrect.

> CPUAddressSpace *cpuas = &cpu->cpu_ases[asidx];
> AddressSpaceDispatch *d = cpuas->memory_dispatch;
> int section_index = index & ~TARGET_PAGE_MASK;
> MemoryRegionSection *ret;
> 
> assert(section_index < d->map.sections_nb);
> ret = d->map.sections + section_index;

What I don’t fully understand is how this is supposed to work….?

Have I missed something obvious?

Cheers
Mark.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MMIO through IOMMU from a TCG processor
  2025-10-08 12:38 MMIO through IOMMU from a TCG processor Mark Burton
@ 2025-10-09  8:43 ` Mark Cave-Ayland
  2025-10-09  9:10   ` Mark Burton
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Cave-Ayland @ 2025-10-09  8:43 UTC (permalink / raw)
  To: Mark Burton, QEMU Developers, peterx, eric.auger, zhenzhong.duan,
	alejandro.j.jimenez, peter.maydell, jasowang, pbonzini, tjeznach,
	steven.sistare, clement.mathieu--drif, joao.m.martins,
	ean-philippe, jean-philippe, sarunkod, Phil Mathieu-Daudé

On 08/10/2025 13:38, Mark Burton wrote:

> All, sorry for the wide CC, I’m trying to find somebody who understands this corder of the code…. This is perhaps a obscure, but I think it should work.
> I am trying to access an MMIO region through an IOMMU, from TCG.
> The IOMMU translation has provided an address space that is different from the CPU’s own address space.
> 
> In address_space_translate_for_iotlb the section is calculated using the address space provide by the IOMMU translation.
>> d = flatview_to_dispatch(address_space_to_flatview(iotlb.target_as));
>>
> Later, we come to do the actual access (via e.g. do_st_mmio_leN), and at this point we pick up the cpu’s address spaces in iotlb_to_section, which is different, and the recorded section therefore seems to be incorrect.
> 
>> CPUAddressSpace *cpuas = &cpu->cpu_ases[asidx];
>> AddressSpaceDispatch *d = cpuas->memory_dispatch;
>> int section_index = index & ~TARGET_PAGE_MASK;
>> MemoryRegionSection *ret;
>>
>> assert(section_index < d->map.sections_nb);
>> ret = d->map.sections + section_index;
> 
> What I don’t fully understand is how this is supposed to work….?
> 
> Have I missed something obvious?
> 
> Cheers
> Mark.

What version of QEMU are you using? I'm wondering if you're getting 
caught out by a variant of this: 
https://gitlab.com/qemu-project/qemu/-/issues/3040.


ATB,

Mark.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MMIO through IOMMU from a TCG processor
  2025-10-09  8:43 ` Mark Cave-Ayland
@ 2025-10-09  9:10   ` Mark Burton
  2025-10-09 18:07     ` Pierrick Bouvier
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Burton @ 2025-10-09  9:10 UTC (permalink / raw)
  To: Mark Cave-Ayland
  Cc: QEMU Developers, peterx, eric.auger, zhenzhong.duan,
	alejandro.j.jimenez, peter.maydell, jasowang, pbonzini, tjeznach,
	steven.sistare, clement.mathieu--drif, joao.m.martins,
	ean-philippe, jean-philippe, sarunkod, Phil Mathieu-Daudé,
	Pierrick Bouvier

(Adding Pierrick)
Thanks for getting back to me Mark.

I initially thought the same, and I think I have seen that issue, I have also taken that patch, However …..

For MMIO access, as best I can tell, the initial calculation of the despatch is based on the iotlb reported by the translate function (correct), while the subsequent use of the section number uses the dispatch table from the CPU’s address space….. which gives you the wrong section.

I would very happily do a live debug with you (or anybody) if it would help… I’m more than willing to believe I’ve made a mistake, but I just don’t see how it’s supposed to work.

I have been looking at solutions, and right now, I don’t see anything obvious. As best I can tell, we “loose” the address space that has been returned by the translate function - so, either we would need a way to hold onto that, or, we would have to re-call the function, or….
All of those options look really really nasty to me.

The issue is going to be systems where SMMU’s are used all over the place, specifically, in front of MMIO. (Memory works OK because we get the memory pointer itself, all is fine, the issue seems only be with MMIO accesses through IOMMU regions).

Cheers
Mark.

> On 9 Oct 2025, at 10:43, Mark Cave-Ayland <mark.caveayland@nutanix.com> wrote:
> 
> On 08/10/2025 13:38, Mark Burton wrote:
> 
>> All, sorry for the wide CC, I’m trying to find somebody who understands this corder of the code…. This is perhaps a obscure, but I think it should work.
>> I am trying to access an MMIO region through an IOMMU, from TCG.
>> The IOMMU translation has provided an address space that is different from the CPU’s own address space.
>> In address_space_translate_for_iotlb the section is calculated using the address space provide by the IOMMU translation.
>>> d = flatview_to_dispatch(address_space_to_flatview(iotlb.target_as));
>>> 
>> Later, we come to do the actual access (via e.g. do_st_mmio_leN), and at this point we pick up the cpu’s address spaces in iotlb_to_section, which is different, and the recorded section therefore seems to be incorrect.
>>> CPUAddressSpace *cpuas = &cpu->cpu_ases[asidx];
>>> AddressSpaceDispatch *d = cpuas->memory_dispatch;
>>> int section_index = index & ~TARGET_PAGE_MASK;
>>> MemoryRegionSection *ret;
>>> 
>>> assert(section_index < d->map.sections_nb);
>>> ret = d->map.sections + section_index;
>> What I don’t fully understand is how this is supposed to work….?
>> Have I missed something obvious?
>> Cheers
>> Mark.
> 
> What version of QEMU are you using? I'm wondering if you're getting caught out by a variant of this: https://gitlab.com/qemu-project/qemu/-/issues/3040.
> 
> 
> ATB,
> 
> Mark.
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MMIO through IOMMU from a TCG processor
  2025-10-09  9:10   ` Mark Burton
@ 2025-10-09 18:07     ` Pierrick Bouvier
  2025-10-10  8:53       ` Mark Burton
  0 siblings, 1 reply; 5+ messages in thread
From: Pierrick Bouvier @ 2025-10-09 18:07 UTC (permalink / raw)
  To: Mark Burton, Mark Cave-Ayland
  Cc: QEMU Developers, peterx, eric.auger, zhenzhong.duan,
	alejandro.j.jimenez, peter.maydell, jasowang, pbonzini, tjeznach,
	steven.sistare, clement.mathieu--drif, joao.m.martins,
	ean-philippe, jean-philippe, sarunkod, Phil Mathieu-Daudé

Hi Mark,

"recorded section therefore seems to be incorrect".
do you observe a crash, or on assert failing at execution?

I don't know in details the code you mention, but after investigating 
and fixing https://gitlab.com/qemu-project/qemu/-/issues/3040, I can 
share a few things.

Overall, what you describe looks like a race condition exposing a 
lifetime issue, especially when saying "we 'loose' the address space 
that has been returned by the translate function".
A value was not updated as expected and is out of sync, or was freed too 
early. Memory regions lifetime is something definitely tricky in QEMU, 
and when you mix that with RCU, things can become very obscure in 
multithreaded scenarios.

In the bug above, the solution was to stop duplicating this information, 
and get it from the same source. The overhead to read such atomic data 
is quite small, thanks to use of RCU.
At KVM Forum, Paolo told me he introduced this copy precisely to avoid 
issues, but the opposite happened in reality, which we both found was 
quite funny.

Additional questions:
- At which time of execution does it happen? Is it during pci devices 
initialization, or when remapping specific memory sections?
- Is the bug deterministic or random? If random, does increasing the 
number of pci devices attached increase the probably to meet it?

Additional tools:
- If you observe a crash, build with asan. If you get a use-after-free 
error, it's probably an issue with RCU cleaning up things before you 
expect. This is what I had in the bug mentioned above.
- If your assert fail, I can recommend you capture execution through rr 
(https://github.com/rr-debugger/rr), using chaos mode rr record --chaos, 
which will randomize scheduling of threads. I don't know if you're 
familiar with it, but it allows you to debug your execution "backward".
Once you captured the faulty execution, you can reach the crash or 
faulty assert, then execute backward (reverse-continue) with a 
watchpoint set on the (correct) value that was updated meanwhile. This 
way, you'll find which sequence led to desynchronization, and then 
you'll have a good start to deduce what the root cause is.
- Spend some time making the crash/assert almost deterministic, it will 
save you time later, especially when implementing a possible fix and 
prove it works.

I hope it helps.

Regards,
Pierrick

On 10/9/25 2:10 AM, Mark Burton wrote:
> 
> (Adding Pierrick)
> Thanks for getting back to me Mark.
> 
> I initially thought the same, and I think I have seen that issue, I have also taken that patch, However …..
> 
> For MMIO access, as best I can tell, the initial calculation of the despatch is based on the iotlb reported by the translate function (correct), while the subsequent use of the section number uses the dispatch table from the CPU’s address space….. which gives you the wrong section.
> 
> I would very happily do a live debug with you (or anybody) if it would help… I’m more than willing to believe I’ve made a mistake, but I just don’t see how it’s supposed to work.
> 
> I have been looking at solutions, and right now, I don’t see anything obvious. As best I can tell, we “loose” the address space that has been returned by the translate function - so, either we would need a way to hold onto that, or, we would have to re-call the function, or….
> All of those options look really really nasty to me.
> 
> The issue is going to be systems where SMMU’s are used all over the place, specifically, in front of MMIO. (Memory works OK because we get the memory pointer itself, all is fine, the issue seems only be with MMIO accesses through IOMMU regions).
> 
> Cheers
> Mark.
> 
> 
>> On 9 Oct 2025, at 10:43, Mark Cave-Ayland <mark.caveayland@nutanix.com> wrote:
>>
>> On 08/10/2025 13:38, Mark Burton wrote:
>>
>>> All, sorry for the wide CC, I’m trying to find somebody who understands this corder of the code…. This is perhaps a obscure, but I think it should work.
>>> I am trying to access an MMIO region through an IOMMU, from TCG.
>>> The IOMMU translation has provided an address space that is different from the CPU’s own address space.
>>> In address_space_translate_for_iotlb the section is calculated using the address space provide by the IOMMU translation.
>>>> d = flatview_to_dispatch(address_space_to_flatview(iotlb.target_as));
>>>>
>>> Later, we come to do the actual access (via e.g. do_st_mmio_leN), and at this point we pick up the cpu’s address spaces in iotlb_to_section, which is different, and the recorded section therefore seems to be incorrect.
>>>> CPUAddressSpace *cpuas = &cpu->cpu_ases[asidx];
>>>> AddressSpaceDispatch *d = cpuas->memory_dispatch;
>>>> int section_index = index & ~TARGET_PAGE_MASK;
>>>> MemoryRegionSection *ret;
>>>>
>>>> assert(section_index < d->map.sections_nb);
>>>> ret = d->map.sections + section_index;
>>> What I don’t fully understand is how this is supposed to work….?
>>> Have I missed something obvious?
>>> Cheers
>>> Mark.
>>
>> What version of QEMU are you using? I'm wondering if you're getting caught out by a variant of this: https://gitlab.com/qemu-project/qemu/-/issues/3040.
>>
>>
>> ATB,
>>
>> Mark.
>>
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MMIO through IOMMU from a TCG processor
  2025-10-09 18:07     ` Pierrick Bouvier
@ 2025-10-10  8:53       ` Mark Burton
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Burton @ 2025-10-10  8:53 UTC (permalink / raw)
  To: Pierrick Bouvier
  Cc: Mark Cave-Ayland, QEMU Developers, peterx, eric.auger,
	zhenzhong.duan, alejandro.j.jimenez, peter.maydell, jasowang,
	pbonzini, tjeznach, steven.sistare, clement.mathieu--drif,
	joao.m.martins, ean-philippe, jean-philippe, sarunkod,
	Phil Mathieu-Daudé



> On 9 Oct 2025, at 20:07, Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> 
> Hi Mark,
> 
> "recorded section therefore seems to be incorrect".
> do you observe a crash, or on assert failing at execution?

I guess it would depend on the exact simulation, in my case I get a random memory region, instead of the device I’m looking for.


I no longer believe this is a race condition - to be clear on my setup (which I think is about to become more common, hence it’s worth fixing)

I have

CPU —>”SMMU”—> MMIO

This is modelled as

CPU-> cpuas   -> MR Memory (1)
-> IOMMU MR (2)  —> IO Address space —> MMIO Device MR (1)

What happens is that we “prepare the IO”, and the translation in the IOMMU returns the IO address space, and we then (correctly IMHO) calculate the correct MR from that address space (I’ve labeled that (1))
However, when the CPU finally does the actual access, using the fully populated TLB entry, we now use the CPU’s address space. MR (1) now relates to some random memory.

(BTW, not that if the Device was actually memory, we would then pick up the memory ptr, and all (I think) would work correctly).

I believe the solution is to carry the address space (pointer) in the Full TLB entry. I have a patch for this which seems to fix the issue, and allows MMIO to be accessed through IOMMU, but it’s a bit messy and currently based on an old 9. I’ll try and modernise it, but it might take a little while. Meanwhile, the general jist of it is below, but this won’t apply to upstream cleanly (and I’ve taken the liberty to add some of your patch here too, but - as I say - I am working on a stale branch)

Cheers
Mark.

Signed-off-by: Mark Burton <mburton@quicinc.com>
---
accel/tcg/cputlb.c | 21 +++++++++++++--------
include/exec/exec-all.h | 7 ++++---
system/physmem.c | 27 +++++++++++----------------
3 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 5138d6c1e0..7018665465 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1087,7 +1087,8 @@ void tlb_set_page_full(CPUState *cpu, int mmu_idx,
prot = full->prot;
asidx = cpu_asidx_from_attrs(cpu, full->attrs);
section = address_space_translate_for_iotlb(cpu, asidx, paddr_page,
- &xlat, &sz, full->attrs, &prot);
+ &xlat, &sz, &full->attrs,
+ &full->as, &prot);
assert(sz >= TARGET_PAGE_SIZE);
tlb_debug("vaddr=%016" VADDR_PRIx " paddr=0x" HWADDR_FMT_plx
@@ -1272,13 +1273,13 @@ static inline void cpu_unaligned_access(CPUState *cpu, vaddr addr,
}
static MemoryRegionSection *
-io_prepare(hwaddr *out_offset, CPUState *cpu, hwaddr xlat,
+io_prepare(hwaddr *out_offset, CPUState *cpu, AddressSpace *as, hwaddr xlat,
MemTxAttrs attrs, vaddr addr, uintptr_t retaddr)
{
MemoryRegionSection *section;
hwaddr mr_offset;
- section = iotlb_to_section(cpu, xlat, attrs);
+ section = iotlb_to_section(as, xlat, attrs);
mr_offset = (xlat & TARGET_PAGE_MASK) + addr;
cpu->mem_io_pc = retaddr;
if (!cpu->neg.can_do_io) {
@@ -1596,7 +1597,7 @@ bool tlb_plugin_lookup(CPUState *cpu, vaddr addr, int mmu_idx,
/* We must have an iotlb entry for MMIO */
if (tlb_addr & TLB_MMIO) {
MemoryRegionSection *section =
- iotlb_to_section(cpu, full->xlat_section & ~TARGET_PAGE_MASK,
+ iotlb_to_section(full->as, full->xlat_section & ~TARGET_PAGE_MASK,
full->attrs);
data->is_io = true;
data->mr = section->mr;
@@ -2008,7 +2009,8 @@ static uint64_t do_ld_mmio_beN(CPUState *cpu, CPUTLBEntryFull *full,
tcg_debug_assert(size > 0 && size <= 8);
attrs = full->attrs;
- section = io_prepare(&mr_offset, cpu, full->xlat_section, attrs, addr, ra);
+ section = io_prepare(&mr_offset, cpu, full->as,
+ full->xlat_section, attrs, addr, ra);
mr = section->mr;
BQL_LOCK_GUARD();
@@ -2029,7 +2031,8 @@ static Int128 do_ld16_mmio_beN(CPUState *cpu, CPUTLBEntryFull *full,
tcg_debug_assert(size > 8 && size <= 16);
attrs = full->attrs;
- section = io_prepare(&mr_offset, cpu, full->xlat_section, attrs, addr, ra);
+ section = io_prepare(&mr_offset, cpu, full->as,
+ full->xlat_section, attrs, addr, ra);
mr = section->mr;
BQL_LOCK_GUARD();
@@ -2573,7 +2576,8 @@ static uint64_t do_st_mmio_leN(CPUState *cpu, CPUTLBEntryFull *full,
tcg_debug_assert(size > 0 && size <= 8);
attrs = full->attrs;
- section = io_prepare(&mr_offset, cpu, full->xlat_section, attrs, addr, ra);
+ section = io_prepare(&mr_offset, cpu, full->as,
+ full->xlat_section, attrs, addr, ra);
mr = section->mr;
BQL_LOCK_GUARD();
@@ -2593,7 +2597,8 @@ static uint64_t do_st16_mmio_leN(CPUState *cpu, CPUTLBEntryFull *full,
tcg_debug_assert(size > 8 && size <= 16);
attrs = full->attrs;
- section = io_prepare(&mr_offset, cpu, full->xlat_section, attrs, addr, ra);
+ section = io_prepare(&mr_offset, cpu, full->as,
+ full->xlat_section, attrs, addr, ra);
mr = section->mr;
BQL_LOCK_GUARD();
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 72240ef426..305323caff 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -475,14 +475,14 @@ extern __thread uintptr_t tci_tb_ptr;
/**
* iotlb_to_section:
- * @cpu: CPU performing the access
+ * @as: Address space to access
* @index: TCG CPU IOTLB entry
*
* Given a TCG CPU IOTLB entry, return the MemoryRegionSection that
* it refers to. @index will have been initially created and returned
* by memory_region_section_get_iotlb().
*/
-struct MemoryRegionSection *iotlb_to_section(CPUState *cpu,
+struct MemoryRegionSection *iotlb_to_section(AddressSpace *as,
hwaddr index, MemTxAttrs attrs);
#endif
@@ -596,7 +596,8 @@ void tlb_reset_dirty_range_all(ram_addr_t start, ram_addr_t length);
MemoryRegionSection *
address_space_translate_for_iotlb(CPUState *cpu, int asidx, hwaddr addr,
hwaddr *xlat, hwaddr *plen,
- MemTxAttrs attrs, int *prot);
+ MemTxAttrs *attrs, AddressSpace **,
+ int *prot);
hwaddr memory_region_section_get_iotlb(CPUState *cpu,
MemoryRegionSection *section);
#endif
diff --git a/system/physmem.c b/system/physmem.c
index 8fbb8407de..8aec2f2131 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -164,7 +164,6 @@ static void tcg_commit(MemoryListener *listener);
typedef struct CPUAddressSpace {
CPUState *cpu;
AddressSpace *as;
- struct AddressSpaceDispatch *memory_dispatch;
MemoryListener tcg_as_listener;
} CPUAddressSpace;
@@ -679,7 +678,8 @@ void tcg_iommu_init_notifier_list(CPUState *cpu)
MemoryRegionSection *
address_space_translate_for_iotlb(CPUState *cpu, int asidx, hwaddr orig_addr,
hwaddr *xlat, hwaddr *plen,
- MemTxAttrs attrs, int *prot)
+ MemTxAttrs *attrs, AddressSpace **as,
+ int *prot)
{
MemoryRegionSection *section;
IOMMUMemoryRegion *iommu_mr;
@@ -687,7 +687,8 @@ address_space_translate_for_iotlb(CPUState *cpu, int asidx, hwaddr orig_addr,
IOMMUTLBEntry iotlb;
int iommu_idx;
hwaddr addr = orig_addr;
- AddressSpaceDispatch *d = cpu->cpu_ases[asidx].memory_dispatch;
+ *as = cpu->cpu_ases[asidx].as;
+ AddressSpaceDispatch *d = address_space_to_dispatch(*as);
for (;;) {
section = address_space_translate_internal(d, addr, &addr, plen, false);
@@ -699,13 +700,13 @@ address_space_translate_for_iotlb(CPUState *cpu, int asidx, hwaddr orig_addr,
imrc = memory_region_get_iommu_class_nocheck(iommu_mr);
- iommu_idx = imrc->attrs_to_index(iommu_mr, attrs);
+ iommu_idx = imrc->attrs_to_index(iommu_mr, *attrs);
tcg_register_iommu_notifier(cpu, iommu_mr, iommu_idx);
/* We need all the permissions, so pass IOMMU_NONE so the IOMMU
* doesn't short-cut its translation table walk.
*/
if (imrc->translate_attr) {
- iotlb = imrc->translate_attr(iommu_mr, addr, IOMMU_NONE, &attrs);
+ iotlb = imrc->translate_attr(iommu_mr, addr, IOMMU_NONE, attrs);
} else {
iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, iommu_idx);
}
@@ -726,7 +727,8 @@ address_space_translate_for_iotlb(CPUState *cpu, int asidx, hwaddr orig_addr,
goto translate_fail;
}
- d = flatview_to_dispatch(address_space_to_flatview(iotlb.target_as));
+ *as = iotlb.target_as;
+ d = flatview_to_dispatch(address_space_to_flatview(*as));
}
assert(!memory_region_is_iommu(section->mr));
@@ -2529,12 +2531,11 @@ static uint16_t dummy_section(PhysPageMap *map, FlatView *fv, MemoryRegion *mr)
return phys_section_add(map, &section);
}
-MemoryRegionSection *iotlb_to_section(CPUState *cpu,
+MemoryRegionSection *iotlb_to_section(AddressSpace *as,
hwaddr index, MemTxAttrs attrs)
{
- int asidx = cpu_asidx_from_attrs(cpu, attrs);
- CPUAddressSpace *cpuas = &cpu->cpu_ases[asidx];
- AddressSpaceDispatch *d = cpuas->memory_dispatch;
+ assert(as);
+ AddressSpaceDispatch *d = address_space_to_dispatch(as);
int section_index = index & ~TARGET_PAGE_MASK;
MemoryRegionSection *ret;
@@ -2611,9 +2612,6 @@ static void tcg_log_global_after_sync(MemoryListener *listener)
static void tcg_commit_cpu(CPUState *cpu, run_on_cpu_data data)
{
- CPUAddressSpace *cpuas = data.host_ptr;
-
- cpuas->memory_dispatch = address_space_to_dispatch(cpuas->as);
tlb_flush(cpu);
}
@@ -2669,10 +2667,7 @@ static void tcg_commit(MemoryListener *listener)
* That said, the listener is also called during realize, before
* all of the tcg machinery for run-on is initialized: thus halt_cond.
*/
- AddressSpaceDispatch *d;
cpu_reloading_memory_map();
- d = address_space_to_dispatch(cpuas->as);
- qatomic_rcu_set(&cpuas->memory_dispatch, d);
tlb_flush(cpuas->cpu);
if (cpu->halt_cond) {
async_run_on_cpu(cpu, tcg_commit_cpu, RUN_ON_CPU_HOST_PTR(cpuas));
-- 
2.51.0





> 
> I don't know in details the code you mention, but after investigating and fixing https://gitlab.com/qemu-project/qemu/-/issues/3040, I can share a few things.
> 
> Overall, what you describe looks like a race condition exposing a lifetime issue, especially when saying "we 'loose' the address space that has been returned by the translate function".
> A value was not updated as expected and is out of sync, or was freed too early. Memory regions lifetime is something definitely tricky in QEMU, and when you mix that with RCU, things can become very obscure in multithreaded scenarios.
> 
> In the bug above, the solution was to stop duplicating this information, and get it from the same source. The overhead to read such atomic data is quite small, thanks to use of RCU.
> At KVM Forum, Paolo told me he introduced this copy precisely to avoid issues, but the opposite happened in reality, which we both found was quite funny.
> 
> Additional questions:
> - At which time of execution does it happen? Is it during pci devices initialization, or when remapping specific memory sections?
> - Is the bug deterministic or random? If random, does increasing the number of pci devices attached increase the probably to meet it?
> 
> Additional tools:
> - If you observe a crash, build with asan. If you get a use-after-free error, it's probably an issue with RCU cleaning up things before you expect. This is what I had in the bug mentioned above.
> - If your assert fail, I can recommend you capture execution through rr (https://github.com/rr-debugger/rr), using chaos mode rr record --chaos, which will randomize scheduling of threads. I don't know if you're familiar with it, but it allows you to debug your execution "backward".
> Once you captured the faulty execution, you can reach the crash or faulty assert, then execute backward (reverse-continue) with a watchpoint set on the (correct) value that was updated meanwhile. This way, you'll find which sequence led to desynchronization, and then you'll have a good start to deduce what the root cause is.
> - Spend some time making the crash/assert almost deterministic, it will save you time later, especially when implementing a possible fix and prove it works.
> 
> I hope it helps.
> 
> Regards,
> Pierrick
> 
> On 10/9/25 2:10 AM, Mark Burton wrote:
>> (Adding Pierrick)
>> Thanks for getting back to me Mark.
>> I initially thought the same, and I think I have seen that issue, I have also taken that patch, However …..
>> For MMIO access, as best I can tell, the initial calculation of the despatch is based on the iotlb reported by the translate function (correct), while the subsequent use of the section number uses the dispatch table from the CPU’s address space….. which gives you the wrong section.
>> I would very happily do a live debug with you (or anybody) if it would help… I’m more than willing to believe I’ve made a mistake, but I just don’t see how it’s supposed to work.
>> I have been looking at solutions, and right now, I don’t see anything obvious. As best I can tell, we “loose” the address space that has been returned by the translate function - so, either we would need a way to hold onto that, or, we would have to re-call the function, or….
>> All of those options look really really nasty to me.
>> The issue is going to be systems where SMMU’s are used all over the place, specifically, in front of MMIO. (Memory works OK because we get the memory pointer itself, all is fine, the issue seems only be with MMIO accesses through IOMMU regions).
>> Cheers
>> Mark.
>>> On 9 Oct 2025, at 10:43, Mark Cave-Ayland <mark.caveayland@nutanix.com> wrote:
>>> 
>>> On 08/10/2025 13:38, Mark Burton wrote:
>>> 
>>>> All, sorry for the wide CC, I’m trying to find somebody who understands this corder of the code…. This is perhaps a obscure, but I think it should work.
>>>> I am trying to access an MMIO region through an IOMMU, from TCG.
>>>> The IOMMU translation has provided an address space that is different from the CPU’s own address space.
>>>> In address_space_translate_for_iotlb the section is calculated using the address space provide by the IOMMU translation.
>>>>> d = flatview_to_dispatch(address_space_to_flatview(iotlb.target_as));
>>>>> 
>>>> Later, we come to do the actual access (via e.g. do_st_mmio_leN), and at this point we pick up the cpu’s address spaces in iotlb_to_section, which is different, and the recorded section therefore seems to be incorrect.
>>>>> CPUAddressSpace *cpuas = &cpu->cpu_ases[asidx];
>>>>> AddressSpaceDispatch *d = cpuas->memory_dispatch;
>>>>> int section_index = index & ~TARGET_PAGE_MASK;
>>>>> MemoryRegionSection *ret;
>>>>> 
>>>>> assert(section_index < d->map.sections_nb);
>>>>> ret = d->map.sections + section_index;
>>>> What I don’t fully understand is how this is supposed to work….?
>>>> Have I missed something obvious?
>>>> Cheers
>>>> Mark.
>>> 
>>> What version of QEMU are you using? I'm wondering if you're getting caught out by a variant of this: https://gitlab.com/qemu-project/qemu/-/issues/3040.
>>> 
>>> 
>>> ATB,
>>> 
>>> Mark.
>>> 
> 



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-10  8:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-08 12:38 MMIO through IOMMU from a TCG processor Mark Burton
2025-10-09  8:43 ` Mark Cave-Ayland
2025-10-09  9:10   ` Mark Burton
2025-10-09 18:07     ` Pierrick Bouvier
2025-10-10  8:53       ` Mark Burton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).