Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Roger Pau Monné" <roger.pau@citrix.com>
To: "Paweł Srokosz" <pawel.srokosz@cert.pl>
Cc: xen-devel@lists.xenproject.org, jgross@suse.com,
	andrew.cooper3@citrix.com, JBeulich@suse.com
Subject: Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
Date: Tue, 18 Feb 2025 10:44:20 +0100	[thread overview]
Message-ID: <Z7RWdPpUde9ZoaZu@macbook.local> (raw)
In-Reply-To: <1050214476.1105853.1739823581696.JavaMail.zimbra@cert.pl>

On Mon, Feb 17, 2025 at 09:19:41PM +0100, Paweł Srokosz wrote:
> Hello everyone,

Adding the x86 maintainers plus the Linux Xen maintainer to the email.

> for few months I'm struggling with a very weird memory corruption issue in 
> Xen PV Dom0 and storage backed by BOSS-S1 RAID-1 card. I noticed it when I tried 
> to copy huge ISO file on Dom0 file system and use it for DomU installation.
> Everything was fine while its contents were cached in the memory, but when I
> rebooted the system and file was read, some parts of the file changed their
> contents. In the same time fsck doesn't report any problems with the
> filesystem.
> 
> In the same time I'm able to reproduce it only when I'm reading and writing
> files onto storage backed by two SSDs in hardware RAID-1 (BOSS-S1 SATA AHCI
> RAID-1 fw ver. 2.5.13.3024) and only under Xen PV. Without Xen or with PVH Dom0 
> everything works correctly. I have reproduced the bug on three servers
> with the same hardware/software specification:

We had similar reports, and IIRC also with software RAID.

> - Platform: Dell PowerEdge R640
> - CPU: 1x Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> - RAM: 4x Multi-bit ECC DDR-4 32 GB
> - Storage:
>  - 2xSSD 240GiB with BOSS-S1 SATA AHCI RAID-1 fw ver. 2.5.13.3024 (where
>  files corrupt)
>  - 1xSSD 4TiB SAS PERC H330 Mini JBOD fw ver. 25.5.9.0001 (where files
>  doesn't corrupt)
> 
> I reproduced the same situation by writing a file, flushing dirty pages to
> the storage (`sync`) and dropping cached pages (`echo 3 > /proc/sys/vm/drop_caches`).
> 
> ```
> # sha256sum Win10_22H2_Polish_x64v1.iso
> 96aad9e4b20b6e3f5fea40b981263e854f6c879472369d5ce8324aae1f6b7556 Win10_22H2_Polish_x64v1.iso
> 
> # echo 3 > /proc/sys/vm/drop_caches
> 
> # sha256sum Win10_22H2_Polish_x64v1.iso
> 0ba05ee38c0f2755bce4ccdf6b389963d9177b261505cbc2b41f8198e9f3bc60 Win10_22H2_Polish_x64v1.iso
> 
> # echo 3 > /proc/sys/vm/drop_caches
> 
> # sha256sum Win10_22H2_Polish_x64v1.iso
> 972a7f363e48b72a612efe85cc6a2c8ce7314858ec0e7ef08d9d7578c9a10ddc Win10_22H2_Polish_x64v1.iso
> ```
> 
> Same effect occurred on two other machines with the same hardware. Only files
> written under Xen PV Dom0 were affected. When these files (written under Xen Dom0)
> were read without Xen, they were consistently corrupted in affected
> parts. When these files are read with Xen, the corruption changed every time
> we dropped the page cache.
> 
> I found that file is corrupted within 4kB page boundaries, so it looked like
> memory issue. So I wrote a script that writes a huge file with specific
> pattern on each 4kB block (matching the page size) and after
> flush/drop_caches, it mmap's the file and checks the integrity of each block.
> When block mismatch occurs, it prints the VA and GFN from
> `/proc/<pid>/pagemap` (using https://github.com/dwks/pagemap). Each page is
> filled with numbers depending on the file offset, so I'm able to correlate
> the contents with the specific offset in case they're shifted or out of
> order.
> 
> In terms of file offset, corruption is usually aligned up to 0xffff boundary
> e.g. mismatched blocks can be found within these file offset ranges:
> - 0x248f000-0x248ffff
> - 0xd4944000-0xd494ffff
> - 0xc1fb000-0xc1fffff
> 
> My wild guess is that 0xffff is a Linux boundary for readahead operation.
> When I try to load two or more files into page cache, I start to see some
> patterns on Dom0 Linux PFN (GFN?):
> 
> ```
> Block mismatch 0x4f577000 read -0x1
> 7f664ec00000-7f6742e40000 r--s 00000000 fe:00 397029 /home/pawelsr/testfile1
>  00007f669e177000 pm a18000000030e50c pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
> 
> <... redacted series of similar entries for ...8000, ...9000, ...a000>
> 
> Block mismatch 0x4f57f000 read -0x1
> 7f664ec00000-7f6742e40000 r--s 00000000 fe:00 397029 /home/pawelsr/testfile1
>  00007f669e17f000 pm a18000000030e514 pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
> ```
> 
> ```
> Block mismatch 0xc1fb000 read 0x4f577000
> 7f0552600000-7f0646840000 r--s 00000000 fe:00 399642 /home/pawelsr/testfile2
>  00007f055e7fb000 pm a18000000020e50c pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
> 
> <... redacted series of similar entries for ...c000, ...d000, ...e000>
> 
> Block mismatch 0xc203000 read 0x4f57f000
> 7f0552600000-7f0646840000 r--s 00000000 fe:00 399642 /home/pawelsr/testfile2
>  00007f055e803000 pm a18000000020e514 pc 0000000000000001 pf 000000040000082c cg 0000000000000be5
> ```
> 
> which means that when I try to read from `20e50c-20e514` GFN, I'm getting
> contents that should land in `30e50c-30e514` GFN. On the other hand
> `30e50c-30e514` contain only zeroes, but sometimes I see something that looks
> like a random portion of some memory. When I'm able to correlate the
> contents, very often it comes from GFN offseted by multiply of 0x100000.
> 
> Corruption isn't limited to page cache but makes whole system unstable and
> from time to time results in kernel panic or random segmentation faults. 
> It's also not easy to reproduce, I need to read/write a lot of blocks to trigger 
> it and bug looks to be time-sensitive.
> 
> All three servers behave the same and it doesn't look like problem is caused
> by simple hardware issue. All healthchecks and tests on RAM/storage/other
> components pass.
> 
> Our BOSS-S1 PCI card uses the following SATA controller: Marvell Technology
> Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11).
> There are well-known problems with this family of controllers and Linux
> contains a fixup for DMA function 1
> (https://github.com/torvalds/linux/blob/2408a807bfc3f738850ef5ad5e3fd59d66168996/drivers/pci/quirks.c#L4316).
> This bug is known to cause some issues on Xen with IOMMU
> (https://github.com/QubesOS/qubes-issues/issues/5968). I'm not sure if it's
> somehow correlated and causes problems with PV as well.
> 
> By testing the bug in different conditions I also spotted few more
> correlations:
> 
> - bug occurs on Xen PV Dom0 and was reproduced on Xen versions from 4.16.0 to
>  4.19.2-pre (up to git:4803a3c5b5 from stable-4.19) and Debian 10 to 12
>  (both stable and backports kernel). Somehow that specific commit
>  git:4803a3c5b5 makes the bug harder to trigger but it may be just a
>  coincidence.

I think it's more likely to be a Linux bug than a Xen (hypervisor)
bug.

> - I was unable to reproduce it when Xen was compiled from master branch, but
>  I'm not sure if once again - it wasn't just a bad timing to trigger the
>  bug.
> - bug occurs only on ext4 file system with hardware RAID backed by BOSS-S1
> - bug DOESN'T occur without Xen
> - bug DOESN'T occur on Xen PVH Dom0
> - bug DOESN'T occur on Xen PV Dom0 when Xen was compiled with excluded
>  `NDEBUG` in file `xen/arch/x86/pv/dom0_build.c`. When I played with it, I
>  found that I'm unable to reproduce the issue when the code that reverses
>  MFN<->PFN mapping for Dom0 is active.

So the issue doesn't happen on debug=y builds?

That's unexpected.  I would expect the opposite, that some code in
Linux assumes that pfn + 1 == mfn + 1, and hence breaks when the
relation is reversed.

> - bug DOESN'T occur when using different storage than one backed by BOSS-S1.
> - bug was tested in few additional conditions and reproduction is not
>  dependent on these:
>  - -O1/-O2/no optimization behaves the same
>  - PAT patch to use Linux PAT layout instead of Xen's choice doesn't
>  change anything
>  (https://github.com/QubesOS/qubes-vmm-xen/blob/main/1018-x86-Use-Linux-s-PAT.patch)
>  - different Linux kernel version doesn't change anything
>  - vCPU pinning (e.g. single vCPU pinned to Dom0) doesn't change anything
> - bug was tested only with smt=1 because Xen doesn't boot properly on our
>  machines with smt=0 (hangs with "(XEN) CPU X still not dead", similar to
>  https://lists.xen.org/archives/html/xen-devel/2019-08/msg00138.html)

Hm, from that thread it seems like the original bug should already be
fixed.

> `xl info` for my testbed:
> 
> ```
> # xl info
> host : <redacted>
> release : 6.12.9+bpo-amd64
> version : #1 SMP PREEMPT_DYNAMIC Debian 6.12.9-1~bpo12+1 (2025-01-19)
> machine : x86_64
> nr_cpus : 28
> max_cpu_id : 27
> nr_nodes : 1
> cores_per_socket : 14
> threads_per_core : 2
> cpu_mhz : 2593.905
> hw_caps : bfebfbff:77fef3ff:2c100800:00000121:0000000f:d29ffffb:00000008:00000100
> virt_caps : pv hvm hvm_directio pv_directio hap shadow iommu_hap_pt_share vmtrace gnttab-v1 gnttab-v2
> total_memory : 130562
> free_memory : 96501
> sharing_freed_memory : 0
> sharing_used_memory : 0
> outstanding_claims : 0
> free_cpus : 0
> xen_major : 4
> xen_minor : 19
> xen_extra : .2-pre
> xen_version : 4.19.2-pre
> xen_caps : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler : credit2
> xen_pagesize : 4096
> platform_params : virt_start=0xffff800000000000
> xen_changeset : Tue Jan 21 09:21:01 2025 +0100 git:4803a3c5b5
> xen_commandline : placeholder dom0_mem=32G,max:32G dom0_max_vcpus=16 dom0_vcpus_pin=1 no-real-mode edd=off
> cc_compiler : gcc (Debian 12.2.0-14) 12.2.0
> cc_compile_by : root
> cc_compile_domain : <redacted>
> cc_compile_date : Mon Feb 17 17:31:08 UTC 2025
> build_id : 410ba653f1f1fc13770b5d2a8cdf5e4d285b6783
> xend_config_format : 4
> ```
> 
> After collecting all of these information I'm on a roadblock. Effects of this
> bug on memory constistency are pretty serious, but on the other hand, they occur 
> in very specific conditions, which makes them difficult to track. I would appreciate 
> any help in finding the root cause of this issue.

Can you see if you can reproduce with dom0-iommu=strict in the Xen
command line?

Thanks, Roger.

next prev parent reply	other threads:[~2025-02-18  9:44 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-17 20:19 Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card Paweł Srokosz
2025-02-18  9:44 ` Roger Pau Monné [this message]
2025-02-19 18:37   ` Paweł Srokosz
2025-02-20  9:16     ` Roger Pau Monné
2025-02-20  9:31       ` Jürgen Groß
2025-02-20 12:37         ` Roger Pau Monné
2025-02-20 12:43           ` Jürgen Groß
2025-02-20 13:29             ` Roger Pau Monné
2025-02-20 13:41               ` Jürgen Groß

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z7RWdPpUde9ZoaZu@macbook.local \
    --to=roger.pau@citrix.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=jgross@suse.com \
    --cc=pawel.srokosz@cert.pl \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.