Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card

All of lore.kernel.org
 help / color / mirror / Atom feed

* Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
@ 2025-02-17 20:19 Paweł Srokosz
  2025-02-18  9:44 ` Roger Pau Monné
  0 siblings, 1 reply; 9+ messages in thread
From: Paweł Srokosz @ 2025-02-17 20:19 UTC (permalink / raw)
  To: xen-devel

Hello everyone,

for few months I'm struggling with a very weird memory corruption issue in 
Xen PV Dom0 and storage backed by BOSS-S1 RAID-1 card. I noticed it when I tried 
to copy huge ISO file on Dom0 file system and use it for DomU installation.
Everything was fine while its contents were cached in the memory, but when I
rebooted the system and file was read, some parts of the file changed their
contents. In the same time fsck doesn't report any problems with the
filesystem.

In the same time I'm able to reproduce it only when I'm reading and writing
files onto storage backed by two SSDs in hardware RAID-1 (BOSS-S1 SATA AHCI
RAID-1 fw ver. 2.5.13.3024) and only under Xen PV. Without Xen or with PVH Dom0 
everything works correctly. I have reproduced the bug on three servers
with the same hardware/software specification:

- Platform: Dell PowerEdge R640
- CPU: 1x Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
- RAM: 4x Multi-bit ECC DDR-4 32 GB
- Storage:
 - 2xSSD 240GiB with BOSS-S1 SATA AHCI RAID-1 fw ver. 2.5.13.3024 (where
 files corrupt)
 - 1xSSD 4TiB SAS PERC H330 Mini JBOD fw ver. 25.5.9.0001 (where files
 doesn't corrupt)

I reproduced the same situation by writing a file, flushing dirty pages to
the storage (`sync`) and dropping cached pages (`echo 3 > /proc/sys/vm/drop_caches`).

```
# sha256sum Win10_22H2_Polish_x64v1.iso
96aad9e4b20b6e3f5fea40b981263e854f6c879472369d5ce8324aae1f6b7556 Win10_22H2_Polish_x64v1.iso

# echo 3 > /proc/sys/vm/drop_caches

# sha256sum Win10_22H2_Polish_x64v1.iso
0ba05ee38c0f2755bce4ccdf6b389963d9177b261505cbc2b41f8198e9f3bc60 Win10_22H2_Polish_x64v1.iso

# echo 3 > /proc/sys/vm/drop_caches

# sha256sum Win10_22H2_Polish_x64v1.iso
972a7f363e48b72a612efe85cc6a2c8ce7314858ec0e7ef08d9d7578c9a10ddc Win10_22H2_Polish_x64v1.iso
```

Same effect occurred on two other machines with the same hardware. Only files
written under Xen PV Dom0 were affected. When these files (written under Xen Dom0)
were read without Xen, they were consistently corrupted in affected
parts. When these files are read with Xen, the corruption changed every time
we dropped the page cache.

I found that file is corrupted within 4kB page boundaries, so it looked like
memory issue. So I wrote a script that writes a huge file with specific
pattern on each 4kB block (matching the page size) and after
flush/drop_caches, it mmap's the file and checks the integrity of each block.
When block mismatch occurs, it prints the VA and GFN from
`/proc/<pid>/pagemap` (using https://github.com/dwks/pagemap). Each page is
filled with numbers depending on the file offset, so I'm able to correlate
the contents with the specific offset in case they're shifted or out of
order.

In terms of file offset, corruption is usually aligned up to 0xffff boundary
e.g. mismatched blocks can be found within these file offset ranges:
- 0x248f000-0x248ffff
- 0xd4944000-0xd494ffff
- 0xc1fb000-0xc1fffff

My wild guess is that 0xffff is a Linux boundary for readahead operation.
When I try to load two or more files into page cache, I start to see some
patterns on Dom0 Linux PFN (GFN?):

```
Block mismatch 0x4f577000 read -0x1
7f664ec00000-7f6742e40000 r--s 00000000 fe:00 397029 /home/pawelsr/testfile1
 00007f669e177000 pm a18000000030e50c pc 0000000000000001 pf 000000040000082c cg 0000000000000be5

<... redacted series of similar entries for ...8000, ...9000, ...a000>

Block mismatch 0x4f57f000 read -0x1
7f664ec00000-7f6742e40000 r--s 00000000 fe:00 397029 /home/pawelsr/testfile1
 00007f669e17f000 pm a18000000030e514 pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
```

```
Block mismatch 0xc1fb000 read 0x4f577000
7f0552600000-7f0646840000 r--s 00000000 fe:00 399642 /home/pawelsr/testfile2
 00007f055e7fb000 pm a18000000020e50c pc 0000000000000001 pf 000000040000082c cg 0000000000000be5

<... redacted series of similar entries for ...c000, ...d000, ...e000>

Block mismatch 0xc203000 read 0x4f57f000
7f0552600000-7f0646840000 r--s 00000000 fe:00 399642 /home/pawelsr/testfile2
 00007f055e803000 pm a18000000020e514 pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
```

which means that when I try to read from `20e50c-20e514` GFN, I'm getting
contents that should land in `30e50c-30e514` GFN. On the other hand
`30e50c-30e514` contain only zeroes, but sometimes I see something that looks
like a random portion of some memory. When I'm able to correlate the
contents, very often it comes from GFN offseted by multiply of 0x100000.

Corruption isn't limited to page cache but makes whole system unstable and
from time to time results in kernel panic or random segmentation faults. 
It's also not easy to reproduce, I need to read/write a lot of blocks to trigger 
it and bug looks to be time-sensitive.

All three servers behave the same and it doesn't look like problem is caused
by simple hardware issue. All healthchecks and tests on RAM/storage/other
components pass.

Our BOSS-S1 PCI card uses the following SATA controller: Marvell Technology
Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11).
There are well-known problems with this family of controllers and Linux
contains a fixup for DMA function 1
(https://github.com/torvalds/linux/blob/2408a807bfc3f738850ef5ad5e3fd59d66168996/drivers/pci/quirks.c#L4316).
This bug is known to cause some issues on Xen with IOMMU
(https://github.com/QubesOS/qubes-issues/issues/5968). I'm not sure if it's
somehow correlated and causes problems with PV as well.

By testing the bug in different conditions I also spotted few more
correlations:

- bug occurs on Xen PV Dom0 and was reproduced on Xen versions from 4.16.0 to
 4.19.2-pre (up to git:4803a3c5b5 from stable-4.19) and Debian 10 to 12
 (both stable and backports kernel). Somehow that specific commit
 git:4803a3c5b5 makes the bug harder to trigger but it may be just a
 coincidence.
- I was unable to reproduce it when Xen was compiled from master branch, but
 I'm not sure if once again - it wasn't just a bad timing to trigger the
 bug.
- bug occurs only on ext4 file system with hardware RAID backed by BOSS-S1
- bug DOESN'T occur without Xen
- bug DOESN'T occur on Xen PVH Dom0
- bug DOESN'T occur on Xen PV Dom0 when Xen was compiled with excluded
 `NDEBUG` in file `xen/arch/x86/pv/dom0_build.c`. When I played with it, I
 found that I'm unable to reproduce the issue when the code that reverses
 MFN<->PFN mapping for Dom0 is active.
- bug DOESN'T occur when using different storage than one backed by BOSS-S1.
- bug was tested in few additional conditions and reproduction is not
 dependent on these:
 - -O1/-O2/no optimization behaves the same
 - PAT patch to use Linux PAT layout instead of Xen's choice doesn't
 change anything
 (https://github.com/QubesOS/qubes-vmm-xen/blob/main/1018-x86-Use-Linux-s-PAT.patch)
 - different Linux kernel version doesn't change anything
 - vCPU pinning (e.g. single vCPU pinned to Dom0) doesn't change anything
- bug was tested only with smt=1 because Xen doesn't boot properly on our
 machines with smt=0 (hangs with "(XEN) CPU X still not dead", similar to
 https://lists.xen.org/archives/html/xen-devel/2019-08/msg00138.html)

`xl info` for my testbed:

```
# xl info
host : <redacted>
release : 6.12.9+bpo-amd64
version : #1 SMP PREEMPT_DYNAMIC Debian 6.12.9-1~bpo12+1 (2025-01-19)
machine : x86_64
nr_cpus : 28
max_cpu_id : 27
nr_nodes : 1
cores_per_socket : 14
threads_per_core : 2
cpu_mhz : 2593.905
hw_caps : bfebfbff:77fef3ff:2c100800:00000121:0000000f:d29ffffb:00000008:00000100
virt_caps : pv hvm hvm_directio pv_directio hap shadow iommu_hap_pt_share vmtrace gnttab-v1 gnttab-v2
total_memory : 130562
free_memory : 96501
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 19
xen_extra : .2-pre
xen_version : 4.19.2-pre
xen_caps : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit2
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset : Tue Jan 21 09:21:01 2025 +0100 git:4803a3c5b5
xen_commandline : placeholder dom0_mem=32G,max:32G dom0_max_vcpus=16 dom0_vcpus_pin=1 no-real-mode edd=off
cc_compiler : gcc (Debian 12.2.0-14) 12.2.0
cc_compile_by : root
cc_compile_domain : <redacted>
cc_compile_date : Mon Feb 17 17:31:08 UTC 2025
build_id : 410ba653f1f1fc13770b5d2a8cdf5e4d285b6783
xend_config_format : 4
```

After collecting all of these information I'm on a roadblock. Effects of this
bug on memory constistency are pretty serious, but on the other hand, they occur 
in very specific conditions, which makes them difficult to track. I would appreciate 
any help in finding the root cause of this issue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
  2025-02-17 20:19 Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card Paweł Srokosz
@ 2025-02-18  9:44 ` Roger Pau Monné
  2025-02-19 18:37   ` Paweł Srokosz
  0 siblings, 1 reply; 9+ messages in thread
From: Roger Pau Monné @ 2025-02-18  9:44 UTC (permalink / raw)
  To: Paweł Srokosz; +Cc: xen-devel, jgross, andrew.cooper3, JBeulich

On Mon, Feb 17, 2025 at 09:19:41PM +0100, Paweł Srokosz wrote:
> Hello everyone,

Adding the x86 maintainers plus the Linux Xen maintainer to the email.

> for few months I'm struggling with a very weird memory corruption issue in 
> Xen PV Dom0 and storage backed by BOSS-S1 RAID-1 card. I noticed it when I tried 
> to copy huge ISO file on Dom0 file system and use it for DomU installation.
> Everything was fine while its contents were cached in the memory, but when I
> rebooted the system and file was read, some parts of the file changed their
> contents. In the same time fsck doesn't report any problems with the
> filesystem.
> 
> In the same time I'm able to reproduce it only when I'm reading and writing
> files onto storage backed by two SSDs in hardware RAID-1 (BOSS-S1 SATA AHCI
> RAID-1 fw ver. 2.5.13.3024) and only under Xen PV. Without Xen or with PVH Dom0 
> everything works correctly. I have reproduced the bug on three servers
> with the same hardware/software specification:

We had similar reports, and IIRC also with software RAID.

> - Platform: Dell PowerEdge R640
> - CPU: 1x Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> - RAM: 4x Multi-bit ECC DDR-4 32 GB
> - Storage:
>  - 2xSSD 240GiB with BOSS-S1 SATA AHCI RAID-1 fw ver. 2.5.13.3024 (where
>  files corrupt)
>  - 1xSSD 4TiB SAS PERC H330 Mini JBOD fw ver. 25.5.9.0001 (where files
>  doesn't corrupt)
> 
> I reproduced the same situation by writing a file, flushing dirty pages to
> the storage (`sync`) and dropping cached pages (`echo 3 > /proc/sys/vm/drop_caches`).
> 
> ```
> # sha256sum Win10_22H2_Polish_x64v1.iso
> 96aad9e4b20b6e3f5fea40b981263e854f6c879472369d5ce8324aae1f6b7556 Win10_22H2_Polish_x64v1.iso
> 
> # echo 3 > /proc/sys/vm/drop_caches
> 
> # sha256sum Win10_22H2_Polish_x64v1.iso
> 0ba05ee38c0f2755bce4ccdf6b389963d9177b261505cbc2b41f8198e9f3bc60 Win10_22H2_Polish_x64v1.iso
> 
> # echo 3 > /proc/sys/vm/drop_caches
> 
> # sha256sum Win10_22H2_Polish_x64v1.iso
> 972a7f363e48b72a612efe85cc6a2c8ce7314858ec0e7ef08d9d7578c9a10ddc Win10_22H2_Polish_x64v1.iso
> ```
> 
> Same effect occurred on two other machines with the same hardware. Only files
> written under Xen PV Dom0 were affected. When these files (written under Xen Dom0)
> were read without Xen, they were consistently corrupted in affected
> parts. When these files are read with Xen, the corruption changed every time
> we dropped the page cache.
> 
> I found that file is corrupted within 4kB page boundaries, so it looked like
> memory issue. So I wrote a script that writes a huge file with specific
> pattern on each 4kB block (matching the page size) and after
> flush/drop_caches, it mmap's the file and checks the integrity of each block.
> When block mismatch occurs, it prints the VA and GFN from
> `/proc/<pid>/pagemap` (using https://github.com/dwks/pagemap). Each page is
> filled with numbers depending on the file offset, so I'm able to correlate
> the contents with the specific offset in case they're shifted or out of
> order.
> 
> In terms of file offset, corruption is usually aligned up to 0xffff boundary
> e.g. mismatched blocks can be found within these file offset ranges:
> - 0x248f000-0x248ffff
> - 0xd4944000-0xd494ffff
> - 0xc1fb000-0xc1fffff
> 
> My wild guess is that 0xffff is a Linux boundary for readahead operation.
> When I try to load two or more files into page cache, I start to see some
> patterns on Dom0 Linux PFN (GFN?):
> 
> ```
> Block mismatch 0x4f577000 read -0x1
> 7f664ec00000-7f6742e40000 r--s 00000000 fe:00 397029 /home/pawelsr/testfile1
>  00007f669e177000 pm a18000000030e50c pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
> 
> <... redacted series of similar entries for ...8000, ...9000, ...a000>
> 
> Block mismatch 0x4f57f000 read -0x1
> 7f664ec00000-7f6742e40000 r--s 00000000 fe:00 397029 /home/pawelsr/testfile1
>  00007f669e17f000 pm a18000000030e514 pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
> ```
> 
> ```
> Block mismatch 0xc1fb000 read 0x4f577000
> 7f0552600000-7f0646840000 r--s 00000000 fe:00 399642 /home/pawelsr/testfile2
>  00007f055e7fb000 pm a18000000020e50c pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
> 
> <... redacted series of similar entries for ...c000, ...d000, ...e000>
> 
> Block mismatch 0xc203000 read 0x4f57f000
> 7f0552600000-7f0646840000 r--s 00000000 fe:00 399642 /home/pawelsr/testfile2
>  00007f055e803000 pm a18000000020e514 pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
> ```
> 
> which means that when I try to read from `20e50c-20e514` GFN, I'm getting
> contents that should land in `30e50c-30e514` GFN. On the other hand
> `30e50c-30e514` contain only zeroes, but sometimes I see something that looks
> like a random portion of some memory. When I'm able to correlate the
> contents, very often it comes from GFN offseted by multiply of 0x100000.
> 
> Corruption isn't limited to page cache but makes whole system unstable and
> from time to time results in kernel panic or random segmentation faults. 
> It's also not easy to reproduce, I need to read/write a lot of blocks to trigger 
> it and bug looks to be time-sensitive.
> 
> All three servers behave the same and it doesn't look like problem is caused
> by simple hardware issue. All healthchecks and tests on RAM/storage/other
> components pass.
> 
> Our BOSS-S1 PCI card uses the following SATA controller: Marvell Technology
> Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11).
> There are well-known problems with this family of controllers and Linux
> contains a fixup for DMA function 1
> (https://github.com/torvalds/linux/blob/2408a807bfc3f738850ef5ad5e3fd59d66168996/drivers/pci/quirks.c#L4316).
> This bug is known to cause some issues on Xen with IOMMU
> (https://github.com/QubesOS/qubes-issues/issues/5968). I'm not sure if it's
> somehow correlated and causes problems with PV as well.
> 
> By testing the bug in different conditions I also spotted few more
> correlations:
> 
> - bug occurs on Xen PV Dom0 and was reproduced on Xen versions from 4.16.0 to
>  4.19.2-pre (up to git:4803a3c5b5 from stable-4.19) and Debian 10 to 12
>  (both stable and backports kernel). Somehow that specific commit
>  git:4803a3c5b5 makes the bug harder to trigger but it may be just a
>  coincidence.

I think it's more likely to be a Linux bug than a Xen (hypervisor)
bug.

> - I was unable to reproduce it when Xen was compiled from master branch, but
>  I'm not sure if once again - it wasn't just a bad timing to trigger the
>  bug.
> - bug occurs only on ext4 file system with hardware RAID backed by BOSS-S1
> - bug DOESN'T occur without Xen
> - bug DOESN'T occur on Xen PVH Dom0
> - bug DOESN'T occur on Xen PV Dom0 when Xen was compiled with excluded
>  `NDEBUG` in file `xen/arch/x86/pv/dom0_build.c`. When I played with it, I
>  found that I'm unable to reproduce the issue when the code that reverses
>  MFN<->PFN mapping for Dom0 is active.

So the issue doesn't happen on debug=y builds?

That's unexpected.  I would expect the opposite, that some code in
Linux assumes that pfn + 1 == mfn + 1, and hence breaks when the
relation is reversed.

> - bug DOESN'T occur when using different storage than one backed by BOSS-S1.
> - bug was tested in few additional conditions and reproduction is not
>  dependent on these:
>  - -O1/-O2/no optimization behaves the same
>  - PAT patch to use Linux PAT layout instead of Xen's choice doesn't
>  change anything
>  (https://github.com/QubesOS/qubes-vmm-xen/blob/main/1018-x86-Use-Linux-s-PAT.patch)
>  - different Linux kernel version doesn't change anything
>  - vCPU pinning (e.g. single vCPU pinned to Dom0) doesn't change anything
> - bug was tested only with smt=1 because Xen doesn't boot properly on our
>  machines with smt=0 (hangs with "(XEN) CPU X still not dead", similar to
>  https://lists.xen.org/archives/html/xen-devel/2019-08/msg00138.html)

Hm, from that thread it seems like the original bug should already be
fixed.

> `xl info` for my testbed:
> 
> ```
> # xl info
> host : <redacted>
> release : 6.12.9+bpo-amd64
> version : #1 SMP PREEMPT_DYNAMIC Debian 6.12.9-1~bpo12+1 (2025-01-19)
> machine : x86_64
> nr_cpus : 28
> max_cpu_id : 27
> nr_nodes : 1
> cores_per_socket : 14
> threads_per_core : 2
> cpu_mhz : 2593.905
> hw_caps : bfebfbff:77fef3ff:2c100800:00000121:0000000f:d29ffffb:00000008:00000100
> virt_caps : pv hvm hvm_directio pv_directio hap shadow iommu_hap_pt_share vmtrace gnttab-v1 gnttab-v2
> total_memory : 130562
> free_memory : 96501
> sharing_freed_memory : 0
> sharing_used_memory : 0
> outstanding_claims : 0
> free_cpus : 0
> xen_major : 4
> xen_minor : 19
> xen_extra : .2-pre
> xen_version : 4.19.2-pre
> xen_caps : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler : credit2
> xen_pagesize : 4096
> platform_params : virt_start=0xffff800000000000
> xen_changeset : Tue Jan 21 09:21:01 2025 +0100 git:4803a3c5b5
> xen_commandline : placeholder dom0_mem=32G,max:32G dom0_max_vcpus=16 dom0_vcpus_pin=1 no-real-mode edd=off
> cc_compiler : gcc (Debian 12.2.0-14) 12.2.0
> cc_compile_by : root
> cc_compile_domain : <redacted>
> cc_compile_date : Mon Feb 17 17:31:08 UTC 2025
> build_id : 410ba653f1f1fc13770b5d2a8cdf5e4d285b6783
> xend_config_format : 4
> ```
> 
> After collecting all of these information I'm on a roadblock. Effects of this
> bug on memory constistency are pretty serious, but on the other hand, they occur 
> in very specific conditions, which makes them difficult to track. I would appreciate 
> any help in finding the root cause of this issue.

Can you see if you can reproduce with dom0-iommu=strict in the Xen
command line?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
  2025-02-18  9:44 ` Roger Pau Monné
@ 2025-02-19 18:37   ` Paweł Srokosz
  2025-02-20  9:16     ` Roger Pau Monné
  0 siblings, 1 reply; 9+ messages in thread
From: Paweł Srokosz @ 2025-02-19 18:37 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, jgross, andrew cooper3, JBeulich

Hello,

> So the issue doesn't happen on debug=y builds? That's unexpected.  I would
> expect the opposite, that some code in Linux assumes that pfn + 1 == mfn +
> 1, and hence breaks when the relation is reversed.

It was also surprising for me but I think the key thing is that debug=y
causes whole mapping to be reversed so each PFN lands on completely different
MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug
it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the
problem.

> Can you see if you can reproduce with dom0-iommu=strict in the Xen command
> line?

Unfortunately, it doesn't help. But I have few more observations.

Firstly, I checked the "xen-mfndump dump-m2p" output and found that misread
blocks are mapped to suspiciously round MFNs. I have different versions of
Xen and Linux kernel on each machine and I see some coincidence.

I'm writing few huge files without Xen to ensure that they have been written
correctly (because under Xen both read and writeback is affected). Then I'm
booting to Xen, memory-mapping the files and reading each page. I see that when 
block is corrupted, it is mapped on round MFN e.g. pfn=0x5095d9/mfn=0x1600000, 
another on pfn=0x4095d9/mfn=0x1500000 etc.

On another machine with different Linux/Xen version these faults appear on
pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc.

I also noticed that during read of page that is mapped to
pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR:

```
(XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200000000
(XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
(XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200001000
(XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
(XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200006000
(XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
(XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200008000
(XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
(XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200009000
(XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
(XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000a000
(XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
(XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000c000
(XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
```

and every time I'm dropping the cache and reading this region, I'm getting
DMAR faults on few random addresses from 1200000000-120000f000 range (I guess 
MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any PFN in
Dom0 (based on xen-mfndump output.). 

On the other hand, I'm not getting these DMAR faults while reading other regions.
Also I can't trigger the bug with reversed Dom0 mapping, even if I fill the page
cache with reads.

Thank you,
Paweł

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
  2025-02-19 18:37   ` Paweł Srokosz
@ 2025-02-20  9:16     ` Roger Pau Monné
  2025-02-20  9:31       ` Jürgen Groß
  0 siblings, 1 reply; 9+ messages in thread
From: Roger Pau Monné @ 2025-02-20  9:16 UTC (permalink / raw)
  To: Paweł Srokosz; +Cc: xen-devel, jgross, andrew cooper3, JBeulich

On Wed, Feb 19, 2025 at 07:37:47PM +0100, Paweł Srokosz wrote:
> Hello,
> 
> > So the issue doesn't happen on debug=y builds? That's unexpected.  I would
> > expect the opposite, that some code in Linux assumes that pfn + 1 == mfn +
> > 1, and hence breaks when the relation is reversed.
> 
> It was also surprising for me but I think the key thing is that debug=y
> causes whole mapping to be reversed so each PFN lands on completely different
> MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug
> it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the
> problem.
> 
> > Can you see if you can reproduce with dom0-iommu=strict in the Xen command
> > line?
> 
> Unfortunately, it doesn't help. But I have few more observations.
> 
> Firstly, I checked the "xen-mfndump dump-m2p" output and found that misread
> blocks are mapped to suspiciously round MFNs. I have different versions of
> Xen and Linux kernel on each machine and I see some coincidence.
> 
> I'm writing few huge files without Xen to ensure that they have been written
> correctly (because under Xen both read and writeback is affected). Then I'm
> booting to Xen, memory-mapping the files and reading each page. I see that when 
> block is corrupted, it is mapped on round MFN e.g. pfn=0x5095d9/mfn=0x1600000, 
> another on pfn=0x4095d9/mfn=0x1500000 etc.
> 
> On another machine with different Linux/Xen version these faults appear on
> pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc.
> 
> I also noticed that during read of page that is mapped to
> pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR:
> 
> ```
> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200000000
> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200001000
> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200006000
> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200008000
> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200009000
> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000a000
> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000c000
> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> ```

That's interesting, it seems to me that Linux is assuming that pages
at certain boundaries are superpages, and thus it can just increase
the mfn to get the next physical page.

> and every time I'm dropping the cache and reading this region, I'm getting
> DMAR faults on few random addresses from 1200000000-120000f000 range (I guess 
> MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any PFN in
> Dom0 (based on xen-mfndump output.). 

It would be very interesting to figure out where those requests
originate, iow: which entity in Linux creates the bios with the
faulting address(es).

It's a wild guess, but could you try to boot Linux with swiotlb=force
on the command line and attempt to trigger the issue?  I wonder
whether imposing the usage of the swiotlb will surface the issues as
CPU accesses, rather then IOMMU faults, and that could get us a trace
inside Linux of how those requests are generated.

> On the other hand, I'm not getting these DMAR faults while reading other regions.
> Also I can't trigger the bug with reversed Dom0 mapping, even if I fill the page
> cache with reads.

There's possibly some condition we are missing that causes a component
in Linux to assume the next address is mfn + 1, instead of doing the
full address translation from the linear or pfn space.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
  2025-02-20  9:16     ` Roger Pau Monné
@ 2025-02-20  9:31       ` Jürgen Groß
  2025-02-20 12:37         ` Roger Pau Monné
  0 siblings, 1 reply; 9+ messages in thread
From: Jürgen Groß @ 2025-02-20  9:31 UTC (permalink / raw)
  To: Roger Pau Monné, Paweł Srokosz
  Cc: xen-devel, andrew cooper3, JBeulich


[-- Attachment #1.1.1: Type: text/plain, Size: 5220 bytes --]

On 20.02.25 10:16, Roger Pau Monné wrote:
> On Wed, Feb 19, 2025 at 07:37:47PM +0100, Paweł Srokosz wrote:
>> Hello,
>>
>>> So the issue doesn't happen on debug=y builds? That's unexpected.  I would
>>> expect the opposite, that some code in Linux assumes that pfn + 1 == mfn +
>>> 1, and hence breaks when the relation is reversed.
>>
>> It was also surprising for me but I think the key thing is that debug=y
>> causes whole mapping to be reversed so each PFN lands on completely different
>> MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug
>> it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the
>> problem.
>>
>>> Can you see if you can reproduce with dom0-iommu=strict in the Xen command
>>> line?
>>
>> Unfortunately, it doesn't help. But I have few more observations.
>>
>> Firstly, I checked the "xen-mfndump dump-m2p" output and found that misread
>> blocks are mapped to suspiciously round MFNs. I have different versions of
>> Xen and Linux kernel on each machine and I see some coincidence.
>>
>> I'm writing few huge files without Xen to ensure that they have been written
>> correctly (because under Xen both read and writeback is affected). Then I'm
>> booting to Xen, memory-mapping the files and reading each page. I see that when
>> block is corrupted, it is mapped on round MFN e.g. pfn=0x5095d9/mfn=0x1600000,
>> another on pfn=0x4095d9/mfn=0x1500000 etc.
>>
>> On another machine with different Linux/Xen version these faults appear on
>> pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc.
>>
>> I also noticed that during read of page that is mapped to
>> pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR:
>>
>> ```
>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200000000
>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200001000
>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200006000
>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200008000
>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200009000
>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000a000
>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000c000
>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>> ```
> 
> That's interesting, it seems to me that Linux is assuming that pages
> at certain boundaries are superpages, and thus it can just increase
> the mfn to get the next physical page.

I'm not sure this is true. See below.

>> and every time I'm dropping the cache and reading this region, I'm getting
>> DMAR faults on few random addresses from 1200000000-120000f000 range (I guess
>> MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any PFN in
>> Dom0 (based on xen-mfndump output.).
> 
> It would be very interesting to figure out where those requests
> originate, iow: which entity in Linux creates the bios with the
> faulting address(es).

I _think_ this is related to the kernel trying to get some contiguous areas
for the buffers used by the I/Os. As those areas are being given back after
the I/O, they don't appear in the mfndump.

> It's a wild guess, but could you try to boot Linux with swiotlb=force
> on the command line and attempt to trigger the issue?  I wonder
> whether imposing the usage of the swiotlb will surface the issues as
> CPU accesses, rather then IOMMU faults, and that could get us a trace
> inside Linux of how those requests are generated.
> 
>> On the other hand, I'm not getting these DMAR faults while reading other regions.
>> Also I can't trigger the bug with reversed Dom0 mapping, even if I fill the page
>> cache with reads.
> 
> There's possibly some condition we are missing that causes a component
> in Linux to assume the next address is mfn + 1, instead of doing the
> full address translation from the linear or pfn space.

My theory is:

The kernel is seeing the used buffer to be a physically contiguous area,
so it is _not_ using a scatter-gather list (it does in the debug Xen case,
resulting in it not to show any errors). Unfortunately the buffer is not
aligned to its size, so swiotlb-xen will remap the buffer to a suitably
aligned one. The driver will then use the returned machine address for
I/Os to both the devices of the RAID configuration. When the first I/O is
done, the driver probably is calling the DMA unmap or device sync function
already, causing the intermediate contiguous region to be destroyed again
(this is the time when the DMAR errors should show up for the 2nd I/O still
running).

So the main issue IMHO is, that a DMA buffer mapped for one device is used
for 2 devices instead.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
  2025-02-20  9:31       ` Jürgen Groß
@ 2025-02-20 12:37         ` Roger Pau Monné
  2025-02-20 12:43           ` Jürgen Groß
  0 siblings, 1 reply; 9+ messages in thread
From: Roger Pau Monné @ 2025-02-20 12:37 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Paweł Srokosz, xen-devel, andrew cooper3, JBeulich

On Thu, Feb 20, 2025 at 10:31:02AM +0100, Jürgen Groß wrote:
> On 20.02.25 10:16, Roger Pau Monné wrote:
> > On Wed, Feb 19, 2025 at 07:37:47PM +0100, Paweł Srokosz wrote:
> > > Hello,
> > > 
> > > > So the issue doesn't happen on debug=y builds? That's unexpected.  I would
> > > > expect the opposite, that some code in Linux assumes that pfn + 1 == mfn +
> > > > 1, and hence breaks when the relation is reversed.
> > > 
> > > It was also surprising for me but I think the key thing is that debug=y
> > > causes whole mapping to be reversed so each PFN lands on completely different
> > > MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug
> > > it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the
> > > problem.
> > > 
> > > > Can you see if you can reproduce with dom0-iommu=strict in the Xen command
> > > > line?
> > > 
> > > Unfortunately, it doesn't help. But I have few more observations.
> > > 
> > > Firstly, I checked the "xen-mfndump dump-m2p" output and found that misread
> > > blocks are mapped to suspiciously round MFNs. I have different versions of
> > > Xen and Linux kernel on each machine and I see some coincidence.
> > > 
> > > I'm writing few huge files without Xen to ensure that they have been written
> > > correctly (because under Xen both read and writeback is affected). Then I'm
> > > booting to Xen, memory-mapping the files and reading each page. I see that when
> > > block is corrupted, it is mapped on round MFN e.g. pfn=0x5095d9/mfn=0x1600000,
> > > another on pfn=0x4095d9/mfn=0x1500000 etc.
> > > 
> > > On another machine with different Linux/Xen version these faults appear on
> > > pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc.
> > > 
> > > I also noticed that during read of page that is mapped to
> > > pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR:
> > > 
> > > ```
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200000000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200001000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200006000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200008000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200009000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000a000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000c000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > ```
> > 
> > That's interesting, it seems to me that Linux is assuming that pages
> > at certain boundaries are superpages, and thus it can just increase
> > the mfn to get the next physical page.
> 
> I'm not sure this is true. See below.
> 
> > > and every time I'm dropping the cache and reading this region, I'm getting
> > > DMAR faults on few random addresses from 1200000000-120000f000 range (I guess
> > > MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any PFN in
> > > Dom0 (based on xen-mfndump output.).
> > 
> > It would be very interesting to figure out where those requests
> > originate, iow: which entity in Linux creates the bios with the
> > faulting address(es).
> 
> I _think_ this is related to the kernel trying to get some contiguous areas
> for the buffers used by the I/Os. As those areas are being given back after
> the I/O, they don't appear in the mfndump.
> 
> > It's a wild guess, but could you try to boot Linux with swiotlb=force
> > on the command line and attempt to trigger the issue?  I wonder
> > whether imposing the usage of the swiotlb will surface the issues as
> > CPU accesses, rather then IOMMU faults, and that could get us a trace
> > inside Linux of how those requests are generated.
> > 
> > > On the other hand, I'm not getting these DMAR faults while reading other regions.
> > > Also I can't trigger the bug with reversed Dom0 mapping, even if I fill the page
> > > cache with reads.
> > 
> > There's possibly some condition we are missing that causes a component
> > in Linux to assume the next address is mfn + 1, instead of doing the
> > full address translation from the linear or pfn space.
> 
> My theory is:
> 
> The kernel is seeing the used buffer to be a physically contiguous area,
> so it is _not_ using a scatter-gather list (it does in the debug Xen case,
> resulting in it not to show any errors). Unfortunately the buffer is not
> aligned to its size, so swiotlb-xen will remap the buffer to a suitably
> aligned one. The driver will then use the returned machine address for
> I/Os to both the devices of the RAID configuration. When the first I/O is
> done, the driver probably is calling the DMA unmap or device sync function
> already, causing the intermediate contiguous region to be destroyed again
> (this is the time when the DMAR errors should show up for the 2nd I/O still
> running).
> 
> So the main issue IMHO is, that a DMA buffer mapped for one device is used
> for 2 devices instead.

But that won't cause IOMMU faults?  Because the memory used by the
bounce buffer would still be owned by dom0 (and thus part of it's IOMMU
page-tables), just probably re-written to contain different data.

Or is the swiotlb contiguous region torn down after every operation?
That would seem extremely wasteful to me, I assume the buffer is
allocated during device init, and stays the same until the device is
detached.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
  2025-02-20 12:37         ` Roger Pau Monné
@ 2025-02-20 12:43           ` Jürgen Groß
  2025-02-20 13:29             ` Roger Pau Monné
  0 siblings, 1 reply; 9+ messages in thread
From: Jürgen Groß @ 2025-02-20 12:43 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Paweł Srokosz, xen-devel, andrew cooper3, JBeulich


[-- Attachment #1.1.1: Type: text/plain, Size: 6692 bytes --]

On 20.02.25 13:37, Roger Pau Monné wrote:
> On Thu, Feb 20, 2025 at 10:31:02AM +0100, Jürgen Groß wrote:
>> On 20.02.25 10:16, Roger Pau Monné wrote:
>>> On Wed, Feb 19, 2025 at 07:37:47PM +0100, Paweł Srokosz wrote:
>>>> Hello,
>>>>
>>>>> So the issue doesn't happen on debug=y builds? That's unexpected.  I would
>>>>> expect the opposite, that some code in Linux assumes that pfn + 1 == mfn +
>>>>> 1, and hence breaks when the relation is reversed.
>>>>
>>>> It was also surprising for me but I think the key thing is that debug=y
>>>> causes whole mapping to be reversed so each PFN lands on completely different
>>>> MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug
>>>> it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the
>>>> problem.
>>>>
>>>>> Can you see if you can reproduce with dom0-iommu=strict in the Xen command
>>>>> line?
>>>>
>>>> Unfortunately, it doesn't help. But I have few more observations.
>>>>
>>>> Firstly, I checked the "xen-mfndump dump-m2p" output and found that misread
>>>> blocks are mapped to suspiciously round MFNs. I have different versions of
>>>> Xen and Linux kernel on each machine and I see some coincidence.
>>>>
>>>> I'm writing few huge files without Xen to ensure that they have been written
>>>> correctly (because under Xen both read and writeback is affected). Then I'm
>>>> booting to Xen, memory-mapping the files and reading each page. I see that when
>>>> block is corrupted, it is mapped on round MFN e.g. pfn=0x5095d9/mfn=0x1600000,
>>>> another on pfn=0x4095d9/mfn=0x1500000 etc.
>>>>
>>>> On another machine with different Linux/Xen version these faults appear on
>>>> pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc.
>>>>
>>>> I also noticed that during read of page that is mapped to
>>>> pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR:
>>>>
>>>> ```
>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200000000
>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200001000
>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200006000
>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200008000
>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200009000
>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000a000
>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000c000
>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>> ```
>>>
>>> That's interesting, it seems to me that Linux is assuming that pages
>>> at certain boundaries are superpages, and thus it can just increase
>>> the mfn to get the next physical page.
>>
>> I'm not sure this is true. See below.
>>
>>>> and every time I'm dropping the cache and reading this region, I'm getting
>>>> DMAR faults on few random addresses from 1200000000-120000f000 range (I guess
>>>> MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any PFN in
>>>> Dom0 (based on xen-mfndump output.).
>>>
>>> It would be very interesting to figure out where those requests
>>> originate, iow: which entity in Linux creates the bios with the
>>> faulting address(es).
>>
>> I _think_ this is related to the kernel trying to get some contiguous areas
>> for the buffers used by the I/Os. As those areas are being given back after
>> the I/O, they don't appear in the mfndump.
>>
>>> It's a wild guess, but could you try to boot Linux with swiotlb=force
>>> on the command line and attempt to trigger the issue?  I wonder
>>> whether imposing the usage of the swiotlb will surface the issues as
>>> CPU accesses, rather then IOMMU faults, and that could get us a trace
>>> inside Linux of how those requests are generated.
>>>
>>>> On the other hand, I'm not getting these DMAR faults while reading other regions.
>>>> Also I can't trigger the bug with reversed Dom0 mapping, even if I fill the page
>>>> cache with reads.
>>>
>>> There's possibly some condition we are missing that causes a component
>>> in Linux to assume the next address is mfn + 1, instead of doing the
>>> full address translation from the linear or pfn space.
>>
>> My theory is:
>>
>> The kernel is seeing the used buffer to be a physically contiguous area,
>> so it is _not_ using a scatter-gather list (it does in the debug Xen case,
>> resulting in it not to show any errors). Unfortunately the buffer is not
>> aligned to its size, so swiotlb-xen will remap the buffer to a suitably
>> aligned one. The driver will then use the returned machine address for
>> I/Os to both the devices of the RAID configuration. When the first I/O is
>> done, the driver probably is calling the DMA unmap or device sync function
>> already, causing the intermediate contiguous region to be destroyed again
>> (this is the time when the DMAR errors should show up for the 2nd I/O still
>> running).
>>
>> So the main issue IMHO is, that a DMA buffer mapped for one device is used
>> for 2 devices instead.
> 
> But that won't cause IOMMU faults?  Because the memory used by the
> bounce buffer would still be owned by dom0 (and thus part of it's IOMMU
> page-tables), just probably re-written to contain different data.
> 
> Or is the swiotlb contiguous region torn down after every operation?

See the kernel function xen_swiotlb_alloc_coherent(): it will try to
allocate a continuous region from the hypervisor on demand and give it
back via xen_swiotlb_free_coherent() after the I/O.

> That would seem extremely wasteful to me, I assume the buffer is
> allocated during device init, and stays the same until the device is
> detached.

Yes, that is the normal use case for xen_swiotlb_alloc_coherent(). Whether
all users are doing it that way is another question.

For normal I/O the standard case is to use either SG-list, a pre-allocated
contiguous region, or the swiotlb (implicitly done via xen_swiotlb_map_page()).

As the observation was that there are DMAR messages NOT related to dom0 MFNs,
I ruled out normal swiotlb buffers, which are indeed pre-allocated and as such
known to belong to dom0 when taking the mfndump.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
  2025-02-20 12:43           ` Jürgen Groß
@ 2025-02-20 13:29             ` Roger Pau Monné
  2025-02-20 13:41               ` Jürgen Groß
  0 siblings, 1 reply; 9+ messages in thread
From: Roger Pau Monné @ 2025-02-20 13:29 UTC (permalink / raw)
  To: Jürgen Groß
  Cc: Paweł Srokosz, xen-devel, andrew cooper3, JBeulich

On Thu, Feb 20, 2025 at 01:43:39PM +0100, Jürgen Groß wrote:
> On 20.02.25 13:37, Roger Pau Monné wrote:
> > On Thu, Feb 20, 2025 at 10:31:02AM +0100, Jürgen Groß wrote:
> > > On 20.02.25 10:16, Roger Pau Monné wrote:
> > > > On Wed, Feb 19, 2025 at 07:37:47PM +0100, Paweł Srokosz wrote:
> > > > > Hello,
> > > > > 
> > > > > > So the issue doesn't happen on debug=y builds? That's unexpected.  I would
> > > > > > expect the opposite, that some code in Linux assumes that pfn + 1 == mfn +
> > > > > > 1, and hence breaks when the relation is reversed.
> > > > > 
> > > > > It was also surprising for me but I think the key thing is that debug=y
> > > > > causes whole mapping to be reversed so each PFN lands on completely different
> > > > > MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug
> > > > > it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the
> > > > > problem.
> > > > > 
> > > > > > Can you see if you can reproduce with dom0-iommu=strict in the Xen command
> > > > > > line?
> > > > > 
> > > > > Unfortunately, it doesn't help. But I have few more observations.
> > > > > 
> > > > > Firstly, I checked the "xen-mfndump dump-m2p" output and found that misread
> > > > > blocks are mapped to suspiciously round MFNs. I have different versions of
> > > > > Xen and Linux kernel on each machine and I see some coincidence.
> > > > > 
> > > > > I'm writing few huge files without Xen to ensure that they have been written
> > > > > correctly (because under Xen both read and writeback is affected). Then I'm
> > > > > booting to Xen, memory-mapping the files and reading each page. I see that when
> > > > > block is corrupted, it is mapped on round MFN e.g. pfn=0x5095d9/mfn=0x1600000,
> > > > > another on pfn=0x4095d9/mfn=0x1500000 etc.
> > > > > 
> > > > > On another machine with different Linux/Xen version these faults appear on
> > > > > pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc.
> > > > > 
> > > > > I also noticed that during read of page that is mapped to
> > > > > pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR:
> > > > > 
> > > > > ```
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200000000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200001000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200006000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200008000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200009000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000a000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000c000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > ```
> > > > 
> > > > That's interesting, it seems to me that Linux is assuming that pages
> > > > at certain boundaries are superpages, and thus it can just increase
> > > > the mfn to get the next physical page.
> > > 
> > > I'm not sure this is true. See below.
> > > 
> > > > > and every time I'm dropping the cache and reading this region, I'm getting
> > > > > DMAR faults on few random addresses from 1200000000-120000f000 range (I guess
> > > > > MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any PFN in
> > > > > Dom0 (based on xen-mfndump output.).
> > > > 
> > > > It would be very interesting to figure out where those requests
> > > > originate, iow: which entity in Linux creates the bios with the
> > > > faulting address(es).
> > > 
> > > I _think_ this is related to the kernel trying to get some contiguous areas
> > > for the buffers used by the I/Os. As those areas are being given back after
> > > the I/O, they don't appear in the mfndump.
> > > 
> > > > It's a wild guess, but could you try to boot Linux with swiotlb=force
> > > > on the command line and attempt to trigger the issue?  I wonder
> > > > whether imposing the usage of the swiotlb will surface the issues as
> > > > CPU accesses, rather then IOMMU faults, and that could get us a trace
> > > > inside Linux of how those requests are generated.
> > > > 
> > > > > On the other hand, I'm not getting these DMAR faults while reading other regions.
> > > > > Also I can't trigger the bug with reversed Dom0 mapping, even if I fill the page
> > > > > cache with reads.
> > > > 
> > > > There's possibly some condition we are missing that causes a component
> > > > in Linux to assume the next address is mfn + 1, instead of doing the
> > > > full address translation from the linear or pfn space.
> > > 
> > > My theory is:
> > > 
> > > The kernel is seeing the used buffer to be a physically contiguous area,
> > > so it is _not_ using a scatter-gather list (it does in the debug Xen case,
> > > resulting in it not to show any errors). Unfortunately the buffer is not
> > > aligned to its size, so swiotlb-xen will remap the buffer to a suitably
> > > aligned one. The driver will then use the returned machine address for
> > > I/Os to both the devices of the RAID configuration. When the first I/O is
> > > done, the driver probably is calling the DMA unmap or device sync function
> > > already, causing the intermediate contiguous region to be destroyed again
> > > (this is the time when the DMAR errors should show up for the 2nd I/O still
> > > running).
> > > 
> > > So the main issue IMHO is, that a DMA buffer mapped for one device is used
> > > for 2 devices instead.
> > 
> > But that won't cause IOMMU faults?  Because the memory used by the
> > bounce buffer would still be owned by dom0 (and thus part of it's IOMMU
> > page-tables), just probably re-written to contain different data.
> > 
> > Or is the swiotlb contiguous region torn down after every operation?
> 
> See the kernel function xen_swiotlb_alloc_coherent(): it will try to
> allocate a continuous region from the hypervisor on demand and give it
> back via xen_swiotlb_free_coherent() after the I/O.
> 
> > That would seem extremely wasteful to me, I assume the buffer is
> > allocated during device init, and stays the same until the device is
> > detached.
> 
> Yes, that is the normal use case for xen_swiotlb_alloc_coherent(). Whether
> all users are doing it that way is another question.
> 
> For normal I/O the standard case is to use either SG-list, a pre-allocated
> contiguous region, or the swiotlb (implicitly done via xen_swiotlb_map_page()).
> 
> As the observation was that there are DMAR messages NOT related to dom0 MFNs,
> I ruled out normal swiotlb buffers, which are indeed pre-allocated and as such
> known to belong to dom0 when taking the mfndump.

Do you have any suggestion about how to debug this further, is there
some way to trace swiotlb operation to detect this case?

I wonder whether the above scenario won't trigger on native, as it's
also possible to have non-aligned buffers in that case, and hence the
premature relinquish of the bounced memory should also cause issues
there?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
  2025-02-20 13:29             ` Roger Pau Monné
@ 2025-02-20 13:41               ` Jürgen Groß
  0 siblings, 0 replies; 9+ messages in thread
From: Jürgen Groß @ 2025-02-20 13:41 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Paweł Srokosz, xen-devel, andrew cooper3, JBeulich


[-- Attachment #1.1.1: Type: text/plain, Size: 7878 bytes --]

On 20.02.25 14:29, Roger Pau Monné wrote:
> On Thu, Feb 20, 2025 at 01:43:39PM +0100, Jürgen Groß wrote:
>> On 20.02.25 13:37, Roger Pau Monné wrote:
>>> On Thu, Feb 20, 2025 at 10:31:02AM +0100, Jürgen Groß wrote:
>>>> On 20.02.25 10:16, Roger Pau Monné wrote:
>>>>> On Wed, Feb 19, 2025 at 07:37:47PM +0100, Paweł Srokosz wrote:
>>>>>> Hello,
>>>>>>
>>>>>>> So the issue doesn't happen on debug=y builds? That's unexpected.  I would
>>>>>>> expect the opposite, that some code in Linux assumes that pfn + 1 == mfn +
>>>>>>> 1, and hence breaks when the relation is reversed.
>>>>>>
>>>>>> It was also surprising for me but I think the key thing is that debug=y
>>>>>> causes whole mapping to be reversed so each PFN lands on completely different
>>>>>> MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug
>>>>>> it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the
>>>>>> problem.
>>>>>>
>>>>>>> Can you see if you can reproduce with dom0-iommu=strict in the Xen command
>>>>>>> line?
>>>>>>
>>>>>> Unfortunately, it doesn't help. But I have few more observations.
>>>>>>
>>>>>> Firstly, I checked the "xen-mfndump dump-m2p" output and found that misread
>>>>>> blocks are mapped to suspiciously round MFNs. I have different versions of
>>>>>> Xen and Linux kernel on each machine and I see some coincidence.
>>>>>>
>>>>>> I'm writing few huge files without Xen to ensure that they have been written
>>>>>> correctly (because under Xen both read and writeback is affected). Then I'm
>>>>>> booting to Xen, memory-mapping the files and reading each page. I see that when
>>>>>> block is corrupted, it is mapped on round MFN e.g. pfn=0x5095d9/mfn=0x1600000,
>>>>>> another on pfn=0x4095d9/mfn=0x1500000 etc.
>>>>>>
>>>>>> On another machine with different Linux/Xen version these faults appear on
>>>>>> pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc.
>>>>>>
>>>>>> I also noticed that during read of page that is mapped to
>>>>>> pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR:
>>>>>>
>>>>>> ```
>>>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200000000
>>>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200001000
>>>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200006000
>>>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200008000
>>>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200009000
>>>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000a000
>>>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>>>> (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000c000
>>>>>> (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
>>>>>> ```
>>>>>
>>>>> That's interesting, it seems to me that Linux is assuming that pages
>>>>> at certain boundaries are superpages, and thus it can just increase
>>>>> the mfn to get the next physical page.
>>>>
>>>> I'm not sure this is true. See below.
>>>>
>>>>>> and every time I'm dropping the cache and reading this region, I'm getting
>>>>>> DMAR faults on few random addresses from 1200000000-120000f000 range (I guess
>>>>>> MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any PFN in
>>>>>> Dom0 (based on xen-mfndump output.).
>>>>>
>>>>> It would be very interesting to figure out where those requests
>>>>> originate, iow: which entity in Linux creates the bios with the
>>>>> faulting address(es).
>>>>
>>>> I _think_ this is related to the kernel trying to get some contiguous areas
>>>> for the buffers used by the I/Os. As those areas are being given back after
>>>> the I/O, they don't appear in the mfndump.
>>>>
>>>>> It's a wild guess, but could you try to boot Linux with swiotlb=force
>>>>> on the command line and attempt to trigger the issue?  I wonder
>>>>> whether imposing the usage of the swiotlb will surface the issues as
>>>>> CPU accesses, rather then IOMMU faults, and that could get us a trace
>>>>> inside Linux of how those requests are generated.
>>>>>
>>>>>> On the other hand, I'm not getting these DMAR faults while reading other regions.
>>>>>> Also I can't trigger the bug with reversed Dom0 mapping, even if I fill the page
>>>>>> cache with reads.
>>>>>
>>>>> There's possibly some condition we are missing that causes a component
>>>>> in Linux to assume the next address is mfn + 1, instead of doing the
>>>>> full address translation from the linear or pfn space.
>>>>
>>>> My theory is:
>>>>
>>>> The kernel is seeing the used buffer to be a physically contiguous area,
>>>> so it is _not_ using a scatter-gather list (it does in the debug Xen case,
>>>> resulting in it not to show any errors). Unfortunately the buffer is not
>>>> aligned to its size, so swiotlb-xen will remap the buffer to a suitably
>>>> aligned one. The driver will then use the returned machine address for
>>>> I/Os to both the devices of the RAID configuration. When the first I/O is
>>>> done, the driver probably is calling the DMA unmap or device sync function
>>>> already, causing the intermediate contiguous region to be destroyed again
>>>> (this is the time when the DMAR errors should show up for the 2nd I/O still
>>>> running).
>>>>
>>>> So the main issue IMHO is, that a DMA buffer mapped for one device is used
>>>> for 2 devices instead.
>>>
>>> But that won't cause IOMMU faults?  Because the memory used by the
>>> bounce buffer would still be owned by dom0 (and thus part of it's IOMMU
>>> page-tables), just probably re-written to contain different data.
>>>
>>> Or is the swiotlb contiguous region torn down after every operation?
>>
>> See the kernel function xen_swiotlb_alloc_coherent(): it will try to
>> allocate a continuous region from the hypervisor on demand and give it
>> back via xen_swiotlb_free_coherent() after the I/O.
>>
>>> That would seem extremely wasteful to me, I assume the buffer is
>>> allocated during device init, and stays the same until the device is
>>> detached.
>>
>> Yes, that is the normal use case for xen_swiotlb_alloc_coherent(). Whether
>> all users are doing it that way is another question.
>>
>> For normal I/O the standard case is to use either SG-list, a pre-allocated
>> contiguous region, or the swiotlb (implicitly done via xen_swiotlb_map_page()).
>>
>> As the observation was that there are DMAR messages NOT related to dom0 MFNs,
>> I ruled out normal swiotlb buffers, which are indeed pre-allocated and as such
>> known to belong to dom0 when taking the mfndump.
> 
> Do you have any suggestion about how to debug this further, is there
> some way to trace swiotlb operation to detect this case?

I guess looking into the driver source code would be the best option
we have. And this includes the SW-Raid code.

> I wonder whether the above scenario won't trigger on native, as it's
> also possible to have non-aligned buffers in that case, and hence the
> premature relinquish of the bounced memory should also cause issues
> there?

On bare metal a PFN contiguous buffer will automatically be requested via
the dma_ops->alloc() function. It will be aligned according to its size.
The problem can only occur in Xen PV guests, as even a buffer alloced to
be contiguous and of the desired alignment in PFN space, can still have the
wrong alignment in MFN space.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-02-20 13:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-17 20:19 Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card Paweł Srokosz
2025-02-18  9:44 ` Roger Pau Monné
2025-02-19 18:37   ` Paweł Srokosz
2025-02-20  9:16     ` Roger Pau Monné
2025-02-20  9:31       ` Jürgen Groß
2025-02-20 12:37         ` Roger Pau Monné
2025-02-20 12:43           ` Jürgen Groß
2025-02-20 13:29             ` Roger Pau Monné
2025-02-20 13:41               ` Jürgen Groß

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.