Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Keir Fraser <keir@xen.org>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	"bruce.edge@gmail.com" <bruce.edge@gmail.com>,
	Gianni@acsinet11.oracle.com, Tedesco <gianni.tedesco@citrix.com>
Subject: Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
Date: Tue, 16 Nov 2010 10:50:16 -0500	[thread overview]
Message-ID: <20101116155016.GA6535@dumpdata.com> (raw)
In-Reply-To: <1289899586.31507.717.camel@zakaz.uk.xensource.com>

disclaimer:
This email got a bit lengthy - so make sure you got a cup of coffee when you read this.

> On an unrelated note I think if we do go down the route of having the
> guest kernel punch the holes itself and such we should do so iff
> XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open

When would that actually happen? Is that return value returned when the
hypervisor is not implementing it (what version was that implemented this)?

> the possibility of cunning tricks on the tools side in the future.

<shuders>

I think we have three options in regards to this RFC patch I posted:
 1). Continue with this and have the toolstack punch the PCI hole. It would
     fill the PCI hole area with INVALID_MFN. The toolstack determines where
     the PCI hole starts.
 2). Do this in the guest where the guest calls both XENMEM_machine_memory_map and
     XENMEM_memory_map to get an idea of the host side PCI hole and set it up.
     Requires changes in hypervisor to allow non-privileged PV guest to make
     XENMEM_machine_memory_map call. Linux kernel decides where PCI hole starts and
     the PCI hole is filled with INVALID_MFN.
 3). Make unconditionally a PCI hole, starting at 3GB. PCI hole filled with INVALID_MFN.
 4). Another one I didn't think of?

For all of those cases when devices show up we populate on demand the P2M array
with the MFNs). For the first two proposals the BARs we read of
the PCI devices are going to be written to the P2M array as identity (so
mfn_list[0xc0000] == 0xc0000). Code has not been written.

For the third proposal, we would have non-identity mappings in the P2M array, as
during the migration we could move from a device with BARs of 0xc0000 to 0x20000.
So mfn_list[0xc0000] = 0x20000.

But for the third case I am unsure how we would get the "real" MFNs. We initially get
the BARs via 0xcf8 calls and if we don't filter them, it gets to ioremap function.
Say the host side BAR is at 0x20000, and our PCI hole starts at 0xc0000. The ioremap
gets called with 0x20000, and in its E820 that region is 'System RAM'.

        last_pfn = last_addr >> PAGE_SHIFT;
        for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) {
                int is_ram = page_is_ram(pfn);

                if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)))
                        return NULL;
                WARN_ON_ONCE(is_ram);
        }   

Ugh, and it will think (correctly) that it falls within RAM.

If we filter the 0xcf8 calls, which we can do the Xen PCI backend case, we can then
provide BARs that always start at 0xC0000. But that does not help the PV guest to
know the "real" MFNs which it needs so it can program the P2M array. So the Xen
PCI front would have to do this - which it could, thought it adds a complexity to it.

We also need to make all of this works with Domain zero, and here 1) or 2) can
easily be used as the Xen hypervisor has given us the E820 nicely peppered with holes.
(I wonder, what happens if dom0 makes a XENMEM_memory_map call - does it get anything?)

If we then go with 3), we would need to instrument the code that reads the BARs so that
it can filter it properly. That would be low-level Linux pci_conf_read and that is not
going happen - so we would have to make the Xen hypervisor be aware of this and when
it traps the in/out provide new BAR values starting at 0xC0000.

I am not comfortable maintaining this filter/keep state code in both the Xen hypervisor
and the Xen PCI front module so I think 3) would not work that well, unless there are
better ways that I have missed?

Back to 1) and 2). Migration would work if we unplug the PCI devices before suspend and
on resume plug them back in - otherwise the PCI BARs might have changed between
migrations. When the guest gets recreated - how does it iterate over the E820 to create
the P2M list? Or is that something that is not done and we just save the P2M list and
restore as-is on the other side? Naturally, since we would unplug the PCI device the
entries in the E820 gaps would be INVALID_MFN...

If we consult the E820 during resume I think doing the PCI hole in the toolstack has
merits - simply b/c the user can set the PCI hole to an arbitrary address that is low
enough (0x2000 say) to cover all of the machines that he/she would migrate too. While
if we do it in the Linux kernel we do not have that information. Even if we don't
consult the E820, the toolstack still has merits - as the PCI hole start address
might be different between the migration machines and we might have started on
a box with the PCI hole being way up (3.9GB) while the other machines might have
at 1.2GB.

The other thing I don't know is how all of this works with 32-bit kernels?

P.S.
I've done the testing of 1) with 64-bit w/ and w/o ballooning and it worked fine.

next prev parent reply	other threads:[~2010-11-16 15:50 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-12 23:08 [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm) Konrad Rzeszutek Wilk
2010-11-12 23:16 ` Konrad Rzeszutek Wilk
2010-11-13  7:40 ` Keir Fraser
2010-11-15 17:03   ` Konrad Rzeszutek Wilk
2010-11-15 17:20     ` Ian Campbell
2010-11-15 17:28       ` Konrad Rzeszutek Wilk
2010-11-15 17:48     ` Keir Fraser
2010-11-15 18:15       ` Konrad Rzeszutek Wilk
2010-11-15 18:41         ` Keir Fraser
2010-11-15 19:32           ` Jeremy Fitzhardinge
2010-11-15 19:57             ` Keir Fraser
2010-11-15 23:11               ` Konrad Rzeszutek Wilk
2010-11-16  1:06                 ` Jeremy Fitzhardinge
2010-11-16  9:26                   ` Ian Campbell
2010-11-16  9:52                     ` Keir Fraser
2010-11-16 10:02                       ` Ian Campbell
2010-11-16 10:11                         ` Keir Fraser
2010-11-16 18:01                           ` Jeremy Fitzhardinge
2010-11-16 15:50                     ` Konrad Rzeszutek Wilk [this message]
2010-11-17 14:23                       ` Ian Campbell
2010-11-16  7:40                 ` Keir Fraser
2010-11-15 19:30         ` Jeremy Fitzhardinge
2010-11-17 11:14 ` Gianni Tedesco
2010-11-17 11:43   ` Ian Campbell
2010-11-17 13:37     ` Gianni Tedesco

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101116155016.GA6535@dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=Gianni@acsinet11.oracle.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=bruce.edge@gmail.com \
    --cc=gianni.tedesco@citrix.com \
    --cc=jeremy@goop.org \
    --cc=keir@xen.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.