Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Roger Pau Monné" <roger.pau@citrix.com>
To: "Jürgen Groß" <jgross@suse.com>
Cc: "Paweł Srokosz" <pawel.srokosz@cert.pl>,
	xen-devel <xen-devel@lists.xenproject.org>,
	"andrew cooper3" <andrew.cooper3@citrix.com>,
	JBeulich@suse.com
Subject: Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card
Date: Thu, 20 Feb 2025 14:29:30 +0100	[thread overview]
Message-ID: <Z7cuOl0um1XG0t74@macbook.local> (raw)
In-Reply-To: <c6e37d70-6d27-4857-8541-f522a48915a3@suse.com>

On Thu, Feb 20, 2025 at 01:43:39PM +0100, Jürgen Groß wrote:
> On 20.02.25 13:37, Roger Pau Monné wrote:
> > On Thu, Feb 20, 2025 at 10:31:02AM +0100, Jürgen Groß wrote:
> > > On 20.02.25 10:16, Roger Pau Monné wrote:
> > > > On Wed, Feb 19, 2025 at 07:37:47PM +0100, Paweł Srokosz wrote:
> > > > > Hello,
> > > > > 
> > > > > > So the issue doesn't happen on debug=y builds? That's unexpected.  I would
> > > > > > expect the opposite, that some code in Linux assumes that pfn + 1 == mfn +
> > > > > > 1, and hence breaks when the relation is reversed.
> > > > > 
> > > > > It was also surprising for me but I think the key thing is that debug=y
> > > > > causes whole mapping to be reversed so each PFN lands on completely different
> > > > > MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug
> > > > > it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the
> > > > > problem.
> > > > > 
> > > > > > Can you see if you can reproduce with dom0-iommu=strict in the Xen command
> > > > > > line?
> > > > > 
> > > > > Unfortunately, it doesn't help. But I have few more observations.
> > > > > 
> > > > > Firstly, I checked the "xen-mfndump dump-m2p" output and found that misread
> > > > > blocks are mapped to suspiciously round MFNs. I have different versions of
> > > > > Xen and Linux kernel on each machine and I see some coincidence.
> > > > > 
> > > > > I'm writing few huge files without Xen to ensure that they have been written
> > > > > correctly (because under Xen both read and writeback is affected). Then I'm
> > > > > booting to Xen, memory-mapping the files and reading each page. I see that when
> > > > > block is corrupted, it is mapped on round MFN e.g. pfn=0x5095d9/mfn=0x1600000,
> > > > > another on pfn=0x4095d9/mfn=0x1500000 etc.
> > > > > 
> > > > > On another machine with different Linux/Xen version these faults appear on
> > > > > pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc.
> > > > > 
> > > > > I also noticed that during read of page that is mapped to
> > > > > pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR:
> > > > > 
> > > > > ```
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200000000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200001000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200006000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200008000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 1200009000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000a000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 120000c000
> > > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > > > ```
> > > > 
> > > > That's interesting, it seems to me that Linux is assuming that pages
> > > > at certain boundaries are superpages, and thus it can just increase
> > > > the mfn to get the next physical page.
> > > 
> > > I'm not sure this is true. See below.
> > > 
> > > > > and every time I'm dropping the cache and reading this region, I'm getting
> > > > > DMAR faults on few random addresses from 1200000000-120000f000 range (I guess
> > > > > MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any PFN in
> > > > > Dom0 (based on xen-mfndump output.).
> > > > 
> > > > It would be very interesting to figure out where those requests
> > > > originate, iow: which entity in Linux creates the bios with the
> > > > faulting address(es).
> > > 
> > > I _think_ this is related to the kernel trying to get some contiguous areas
> > > for the buffers used by the I/Os. As those areas are being given back after
> > > the I/O, they don't appear in the mfndump.
> > > 
> > > > It's a wild guess, but could you try to boot Linux with swiotlb=force
> > > > on the command line and attempt to trigger the issue?  I wonder
> > > > whether imposing the usage of the swiotlb will surface the issues as
> > > > CPU accesses, rather then IOMMU faults, and that could get us a trace
> > > > inside Linux of how those requests are generated.
> > > > 
> > > > > On the other hand, I'm not getting these DMAR faults while reading other regions.
> > > > > Also I can't trigger the bug with reversed Dom0 mapping, even if I fill the page
> > > > > cache with reads.
> > > > 
> > > > There's possibly some condition we are missing that causes a component
> > > > in Linux to assume the next address is mfn + 1, instead of doing the
> > > > full address translation from the linear or pfn space.
> > > 
> > > My theory is:
> > > 
> > > The kernel is seeing the used buffer to be a physically contiguous area,
> > > so it is _not_ using a scatter-gather list (it does in the debug Xen case,
> > > resulting in it not to show any errors). Unfortunately the buffer is not
> > > aligned to its size, so swiotlb-xen will remap the buffer to a suitably
> > > aligned one. The driver will then use the returned machine address for
> > > I/Os to both the devices of the RAID configuration. When the first I/O is
> > > done, the driver probably is calling the DMA unmap or device sync function
> > > already, causing the intermediate contiguous region to be destroyed again
> > > (this is the time when the DMAR errors should show up for the 2nd I/O still
> > > running).
> > > 
> > > So the main issue IMHO is, that a DMA buffer mapped for one device is used
> > > for 2 devices instead.
> > 
> > But that won't cause IOMMU faults?  Because the memory used by the
> > bounce buffer would still be owned by dom0 (and thus part of it's IOMMU
> > page-tables), just probably re-written to contain different data.
> > 
> > Or is the swiotlb contiguous region torn down after every operation?
> 
> See the kernel function xen_swiotlb_alloc_coherent(): it will try to
> allocate a continuous region from the hypervisor on demand and give it
> back via xen_swiotlb_free_coherent() after the I/O.
> 
> > That would seem extremely wasteful to me, I assume the buffer is
> > allocated during device init, and stays the same until the device is
> > detached.
> 
> Yes, that is the normal use case for xen_swiotlb_alloc_coherent(). Whether
> all users are doing it that way is another question.
> 
> For normal I/O the standard case is to use either SG-list, a pre-allocated
> contiguous region, or the swiotlb (implicitly done via xen_swiotlb_map_page()).
> 
> As the observation was that there are DMAR messages NOT related to dom0 MFNs,
> I ruled out normal swiotlb buffers, which are indeed pre-allocated and as such
> known to belong to dom0 when taking the mfndump.

Do you have any suggestion about how to debug this further, is there
some way to trace swiotlb operation to detect this case?

I wonder whether the above scenario won't trigger on native, as it's
also possible to have non-aligned buffers in that case, and hence the
premature relinquish of the bounced memory should also cause issues
there?

Thanks, Roger.

next prev parent reply	other threads:[~2025-02-20 13:29 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-17 20:19 Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card Paweł Srokosz
2025-02-18  9:44 ` Roger Pau Monné
2025-02-19 18:37   ` Paweł Srokosz
2025-02-20  9:16     ` Roger Pau Monné
2025-02-20  9:31       ` Jürgen Groß
2025-02-20 12:37         ` Roger Pau Monné
2025-02-20 12:43           ` Jürgen Groß
2025-02-20 13:29             ` Roger Pau Monné [this message]
2025-02-20 13:41               ` Jürgen Groß

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z7cuOl0um1XG0t74@macbook.local \
    --to=roger.pau@citrix.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=jgross@suse.com \
    --cc=pawel.srokosz@cert.pl \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.