maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
@ 2017-03-22 21:16 Dan Streetman
  2017-03-23  2:13 ` Boris Ostrovsky
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Streetman @ 2017-03-22 21:16 UTC (permalink / raw)
  To: Boris Ostrovsky, Konrad Rzeszutek Wilk
  Cc: David Vrabel, Juergen Gross, xen-devel, linux-kernel

I have a question about a problem introduced by this commit:
c275a57f5ec3056f732843b11659d892235faff7
"xen/balloon: Set balloon's initial state to number of existing RAM pages"

It changed the xen balloon current_pages calculation to start with the
number of physical pages in the system, instead of max_pfn.  Since
get_num_physpages() does not include holes, it's always less than the
e820 map's max_pfn.

However, the problem that commit introduced is, if the hypervisor sets
the balloon target to equal to the e820 map's max_pfn, then the
balloon target will *always* be higher than the initial current pages.
Even if the hypervisor sets the target to (e820 max_pfn - holes), if
the OS adds any holes, the balloon target will be higher than the
current pages.  This is the situation, for example, for Amazon AWS
instances.  The result is, the xen balloon will always immediately
hotplug some memory at boot, but then make only (max_pfn -
get_num_physpages()) available to the system.

This balloon-hotplugged memory can cause problems, if the hypervisor
wasn't expecting it; specifically, the system's physical page
addresses now will exceed the e820 map's max_pfn, due to the
balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
DMA to/from those physical pages above the e820 max_pfn, it causes
problems.  For example:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129

The additional small amount of balloon memory can cause other problems
as well, for example:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457

Anyway, I'd like to ask, was the original commit added because
hypervisors are supposed to set their balloon target to the guest
system's number of phys pages (max_pfn - holes)?  The mailing list
discussion and commit description seem to indicate that.  However I'm
not sure how that is possible, because the kernel reserves its own
holes, regardless of any predefined holes in the e820 map; for
example, the kernel reserves 64k (by default) at phys addr 0 (the
amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
the hypervisor really has no way to know what the "right" target to
specify is; unless it knows the exact guest OS and kernel version, and
kernel config values, it will never be able to correctly specify its
target to be exactly (e820 max_pfn - all holes).

Should this commit be reverted?  Should the xen balloon target be
adjusted based on kernel-added e820 holes?  Should something else be
done?

For context, Amazon Linux has simply disabled Xen ballooning
completely.  Likewise, we're planning to disable Xen ballooning in the
Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
Ubuntu kernels).  However, if reverting this patch makes sense in a
bigger context (i.e. Xen users besides AWS), that would allow more
Ubuntu kernels to work correctly in AWS instances.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-22 21:16 maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages" Dan Streetman
@ 2017-03-23  2:13 ` Boris Ostrovsky
  2017-03-23  7:56   ` Juergen Gross
  2017-03-24 20:34   ` Dan Streetman
  0 siblings, 2 replies; 18+ messages in thread
From: Boris Ostrovsky @ 2017-03-23  2:13 UTC (permalink / raw)
  To: Dan Streetman, Konrad Rzeszutek Wilk
  Cc: David Vrabel, Juergen Gross, xen-devel, linux-kernel



On 03/22/2017 05:16 PM, Dan Streetman wrote:
> I have a question about a problem introduced by this commit:
> c275a57f5ec3056f732843b11659d892235faff7
> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
>
> It changed the xen balloon current_pages calculation to start with the
> number of physical pages in the system, instead of max_pfn.  Since
> get_num_physpages() does not include holes, it's always less than the
> e820 map's max_pfn.
>
> However, the problem that commit introduced is, if the hypervisor sets
> the balloon target to equal to the e820 map's max_pfn, then the
> balloon target will *always* be higher than the initial current pages.
> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
> the OS adds any holes, the balloon target will be higher than the
> current pages.  This is the situation, for example, for Amazon AWS
> instances.  The result is, the xen balloon will always immediately
> hotplug some memory at boot, but then make only (max_pfn -
> get_num_physpages()) available to the system.
>
> This balloon-hotplugged memory can cause problems, if the hypervisor
> wasn't expecting it; specifically, the system's physical page
> addresses now will exceed the e820 map's max_pfn, due to the
> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
> DMA to/from those physical pages above the e820 max_pfn, it causes
> problems.  For example:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>
> The additional small amount of balloon memory can cause other problems
> as well, for example:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>
> Anyway, I'd like to ask, was the original commit added because
> hypervisors are supposed to set their balloon target to the guest
> system's number of phys pages (max_pfn - holes)?  The mailing list
> discussion and commit description seem to indicate that.


IIRC the problem that this was trying to fix was that since max_pfn 
includes holes, upon booting we'd immediately balloon down by the 
(typically, MMIO) hole size.

If you boot a guest with ~4+GB memory you should see this.


> However I'm
> not sure how that is possible, because the kernel reserves its own
> holes, regardless of any predefined holes in the e820 map; for
> example, the kernel reserves 64k (by default) at phys addr 0 (the
> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
> the hypervisor really has no way to know what the "right" target to
> specify is; unless it knows the exact guest OS and kernel version, and
> kernel config values, it will never be able to correctly specify its
> target to be exactly (e820 max_pfn - all holes).
>
> Should this commit be reverted?  Should the xen balloon target be
> adjusted based on kernel-added e820 holes?

I think the second one but shouldn't current_pages be updated, and not 
the target? The latter is set by Xen (toolstack, via xenstore usually).

Also, the bugs above (at least one of them) talk about NVMe and I wonder 
whether the memory that they add is of RAM type --- I believe it has its 
own type and so perhaps that introduces additional inconsistencies. AWS 
may have added their own support for that, which we don't have upstream yet.

-boris


> Should something else be
> done?
>
> For context, Amazon Linux has simply disabled Xen ballooning
> completely.  Likewise, we're planning to disable Xen ballooning in the
> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
> Ubuntu kernels).  However, if reverting this patch makes sense in a
> bigger context (i.e. Xen users besides AWS), that would allow more
> Ubuntu kernels to work correctly in AWS instances.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-23  2:13 ` Boris Ostrovsky
@ 2017-03-23  7:56   ` Juergen Gross
  2017-03-24 20:30     ` Dan Streetman
  2017-03-24 20:34   ` Dan Streetman
  1 sibling, 1 reply; 18+ messages in thread
From: Juergen Gross @ 2017-03-23  7:56 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Boris Ostrovsky, Konrad Rzeszutek Wilk, xen-devel, linux-kernel

On 23/03/17 03:13, Boris Ostrovsky wrote:
> 
> 
> On 03/22/2017 05:16 PM, Dan Streetman wrote:
>> I have a question about a problem introduced by this commit:
>> c275a57f5ec3056f732843b11659d892235faff7
>> "xen/balloon: Set balloon's initial state to number of existing RAM
>> pages"
>>
>> It changed the xen balloon current_pages calculation to start with the
>> number of physical pages in the system, instead of max_pfn.  Since
>> get_num_physpages() does not include holes, it's always less than the
>> e820 map's max_pfn.
>>
>> However, the problem that commit introduced is, if the hypervisor sets
>> the balloon target to equal to the e820 map's max_pfn, then the
>> balloon target will *always* be higher than the initial current pages.
>> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>> the OS adds any holes, the balloon target will be higher than the
>> current pages.  This is the situation, for example, for Amazon AWS
>> instances.  The result is, the xen balloon will always immediately
>> hotplug some memory at boot, but then make only (max_pfn -
>> get_num_physpages()) available to the system.
>>
>> This balloon-hotplugged memory can cause problems, if the hypervisor
>> wasn't expecting it; specifically, the system's physical page
>> addresses now will exceed the e820 map's max_pfn, due to the
>> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>> DMA to/from those physical pages above the e820 max_pfn, it causes
>> problems.  For example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>>
>> The additional small amount of balloon memory can cause other problems
>> as well, for example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>>
>> Anyway, I'd like to ask, was the original commit added because
>> hypervisors are supposed to set their balloon target to the guest
>> system's number of phys pages (max_pfn - holes)?  The mailing list
>> discussion and commit description seem to indicate that.
> 
> 
> IIRC the problem that this was trying to fix was that since max_pfn
> includes holes, upon booting we'd immediately balloon down by the
> (typically, MMIO) hole size.
> 
> If you boot a guest with ~4+GB memory you should see this.
> 
> 
>> However I'm
>> not sure how that is possible, because the kernel reserves its own
>> holes, regardless of any predefined holes in the e820 map; for
>> example, the kernel reserves 64k (by default) at phys addr 0 (the
>> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
>> the hypervisor really has no way to know what the "right" target to
>> specify is; unless it knows the exact guest OS and kernel version, and
>> kernel config values, it will never be able to correctly specify its
>> target to be exactly (e820 max_pfn - all holes).
>>
>> Should this commit be reverted?  Should the xen balloon target be
>> adjusted based on kernel-added e820 holes?
> 
> I think the second one but shouldn't current_pages be updated, and not
> the target? The latter is set by Xen (toolstack, via xenstore usually).

Right.

Looking into a HVM domU I can't see any problem related to
CONFIG_X86_RESERVE_LOW: it is set to 64 on my system. The domU is
configured with 2048 MB of RAM, 8MB being video RAM. Looking into
/sys/devices/system/xen_memory/xen_memory0 I can see the current
size and target size do match: both are 2088960 kB (2 GB - 8 MB).

Ballooning down and up to 2048 MB again doesn't change the picture.

So which additional holes are added by the kernel on AWS via which
functions?


Juergen

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-23  7:56   ` Juergen Gross
@ 2017-03-24 20:30     ` Dan Streetman
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Streetman @ 2017-03-24 20:30 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Boris Ostrovsky, Konrad Rzeszutek Wilk, xen-devel, linux-kernel

On Thu, Mar 23, 2017 at 3:56 AM, Juergen Gross <jgross@suse.com> wrote:
> On 23/03/17 03:13, Boris Ostrovsky wrote:
>>
>>
>> On 03/22/2017 05:16 PM, Dan Streetman wrote:
>>> I have a question about a problem introduced by this commit:
>>> c275a57f5ec3056f732843b11659d892235faff7
>>> "xen/balloon: Set balloon's initial state to number of existing RAM
>>> pages"
>>>
>>> It changed the xen balloon current_pages calculation to start with the
>>> number of physical pages in the system, instead of max_pfn.  Since
>>> get_num_physpages() does not include holes, it's always less than the
>>> e820 map's max_pfn.
>>>
>>> However, the problem that commit introduced is, if the hypervisor sets
>>> the balloon target to equal to the e820 map's max_pfn, then the
>>> balloon target will *always* be higher than the initial current pages.
>>> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>>> the OS adds any holes, the balloon target will be higher than the
>>> current pages.  This is the situation, for example, for Amazon AWS
>>> instances.  The result is, the xen balloon will always immediately
>>> hotplug some memory at boot, but then make only (max_pfn -
>>> get_num_physpages()) available to the system.
>>>
>>> This balloon-hotplugged memory can cause problems, if the hypervisor
>>> wasn't expecting it; specifically, the system's physical page
>>> addresses now will exceed the e820 map's max_pfn, due to the
>>> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>>> DMA to/from those physical pages above the e820 max_pfn, it causes
>>> problems.  For example:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>>>
>>> The additional small amount of balloon memory can cause other problems
>>> as well, for example:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>>>
>>> Anyway, I'd like to ask, was the original commit added because
>>> hypervisors are supposed to set their balloon target to the guest
>>> system's number of phys pages (max_pfn - holes)?  The mailing list
>>> discussion and commit description seem to indicate that.
>>
>>
>> IIRC the problem that this was trying to fix was that since max_pfn
>> includes holes, upon booting we'd immediately balloon down by the
>> (typically, MMIO) hole size.
>>
>> If you boot a guest with ~4+GB memory you should see this.
>>
>>
>>> However I'm
>>> not sure how that is possible, because the kernel reserves its own
>>> holes, regardless of any predefined holes in the e820 map; for
>>> example, the kernel reserves 64k (by default) at phys addr 0 (the
>>> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
>>> the hypervisor really has no way to know what the "right" target to
>>> specify is; unless it knows the exact guest OS and kernel version, and
>>> kernel config values, it will never be able to correctly specify its
>>> target to be exactly (e820 max_pfn - all holes).
>>>
>>> Should this commit be reverted?  Should the xen balloon target be
>>> adjusted based on kernel-added e820 holes?
>>
>> I think the second one but shouldn't current_pages be updated, and not
>> the target? The latter is set by Xen (toolstack, via xenstore usually).
>
> Right.
>
> Looking into a HVM domU I can't see any problem related to
> CONFIG_X86_RESERVE_LOW: it is set to 64 on my system. The domU is

sorry I brought that up; I was only giving an example.  It's not
directly relevant to this and may have distracted from the actual
problem; in fact on closer inspection, the X86_RESERVE_LOW is using
memblock_reserve(), which removes it from managed memory but not the
e820 map (and thus doesn't remove it from get_num_physpages()).  Only
phys page 0 is actually reserved in the e820 map.

> configured with 2048 MB of RAM, 8MB being video RAM. Looking into
> /sys/devices/system/xen_memory/xen_memory0 I can see the current
> size and target size do match: both are 2088960 kB (2 GB - 8 MB).
>
> Ballooning down and up to 2048 MB again doesn't change the picture.
>
> So which additional holes are added by the kernel on AWS via which
> functions?

I'll use two AWS types as examples, t2.micro (1G mem) and t2.large (8G mem).

In the micro, the results of ballooning are obvious, because the
hotplugged memory always goes into the Normal zone; but since the base
memory is only 1g, it's contained entirely in the DMA32/DMA zones.  So
we get:

$ grep -E '(start_pfn|present|spanned|managed)' /proc/zoneinfo
        spanned  4095
        present  3997
        managed  3976
  start_pfn:         1
        spanned  258048
        present  258048
        managed  249606
  start_pfn:         4096
        spanned  32768
        present  32768
        managed  11
  start_pfn:         262144

As you can see, none of the e820 memory went into the Normal zone; the
balloon driver hotpluged 128m (32k pages), but only made 11 pages
available.  Having a memory zone with only 11 pages really screwed
with kswapd, since the zone's memory watermarks were all 0.  That was
the second bug I referenced in my initial email.


Anyway, if we look at the large instance, you don't really notice the
additional balloon memory:

$ grep -E '(start_pfn|present|spanned|managed)' /proc/zoneinfo
        spanned  4095
        present  3997
        managed  3976
  start_pfn:         1
        spanned  1044480
        present  978944
        managed  958778
  start_pfn:         4096
        spanned  1146880
        present  1146880
        managed  1080666
  start_pfn:         1048576

but, doing the actual math shows the problem:

$ printf "%x\n" $[ 1048576 + 1146880 ]
218000
$ printf "%x\n" $[ 1048576 + 1080666 ]
207d5a

$ dmesg|grep e820
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] e820: last_pfn = 0x210000 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0xf0000 max_arch_pfn = 0x400000000
[    0.000000] e820: [mem 0xf0000000-0xfbffffff] available for PCI devices
[    0.595083] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]

so, we can see the balloon driver hotplugged those extra 0x8000 pages,
and made some of them available.

The target has been set to:
$ printf "%x\n" $( cat /sys/devices/system/xen_memory/xen_memory0/target )
200000000

while the e820 map provides:
$ printf "%x\n" $[ 0x210000000 - 0x100000000 + 0xf0000000 - 0x100000 +
0x9e000 - 0x1000 ]
1fff9d000

and current memory is:
/sys/devices/system/xen_memory/xen_memory0$ printf "%x\n" $[ $( cat
info/current_kb ) * 1024 ]
1fffa8000

so the balloon driver has added...
$ echo $[ ( 0x1fffa8000 - 0x1fff9d000 ) / 4096 ]
11

exactly 11 pages, just like the micro instance type.  I'm not sure
where the balloon driver gets that 11 page calculation, nor am I sure
why the current_kb is actually less than the balloon target.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-23  2:13 ` Boris Ostrovsky
  2017-03-23  7:56   ` Juergen Gross
@ 2017-03-24 20:34   ` Dan Streetman
  2017-03-24 21:10     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 18+ messages in thread
From: Dan Streetman @ 2017-03-24 20:34 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Konrad Rzeszutek Wilk, David Vrabel, Juergen Gross, xen-devel,
	linux-kernel

On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
>
>
> On 03/22/2017 05:16 PM, Dan Streetman wrote:
>>
>> I have a question about a problem introduced by this commit:
>> c275a57f5ec3056f732843b11659d892235faff7
>> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
>>
>> It changed the xen balloon current_pages calculation to start with the
>> number of physical pages in the system, instead of max_pfn.  Since
>> get_num_physpages() does not include holes, it's always less than the
>> e820 map's max_pfn.
>>
>> However, the problem that commit introduced is, if the hypervisor sets
>> the balloon target to equal to the e820 map's max_pfn, then the
>> balloon target will *always* be higher than the initial current pages.
>> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>> the OS adds any holes, the balloon target will be higher than the
>> current pages.  This is the situation, for example, for Amazon AWS
>> instances.  The result is, the xen balloon will always immediately
>> hotplug some memory at boot, but then make only (max_pfn -
>> get_num_physpages()) available to the system.
>>
>> This balloon-hotplugged memory can cause problems, if the hypervisor
>> wasn't expecting it; specifically, the system's physical page
>> addresses now will exceed the e820 map's max_pfn, due to the
>> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>> DMA to/from those physical pages above the e820 max_pfn, it causes
>> problems.  For example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>>
>> The additional small amount of balloon memory can cause other problems
>> as well, for example:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>>
>> Anyway, I'd like to ask, was the original commit added because
>> hypervisors are supposed to set their balloon target to the guest
>> system's number of phys pages (max_pfn - holes)?  The mailing list
>> discussion and commit description seem to indicate that.
>
>
>
> IIRC the problem that this was trying to fix was that since max_pfn includes
> holes, upon booting we'd immediately balloon down by the (typically, MMIO)
> hole size.
>
> If you boot a guest with ~4+GB memory you should see this.
>
>
>> However I'm
>> not sure how that is possible, because the kernel reserves its own
>> holes, regardless of any predefined holes in the e820 map; for
>> example, the kernel reserves 64k (by default) at phys addr 0 (the
>> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
>> the hypervisor really has no way to know what the "right" target to
>> specify is; unless it knows the exact guest OS and kernel version, and
>> kernel config values, it will never be able to correctly specify its
>> target to be exactly (e820 max_pfn - all holes).
>>
>> Should this commit be reverted?  Should the xen balloon target be
>> adjusted based on kernel-added e820 holes?
>
>
> I think the second one but shouldn't current_pages be updated, and not the
> target? The latter is set by Xen (toolstack, via xenstore usually).
>
> Also, the bugs above (at least one of them) talk about NVMe and I wonder
> whether the memory that they add is of RAM type --- I believe it has its own
> type and so perhaps that introduces additional inconsistencies. AWS may have
> added their own support for that, which we don't have upstream yet.

The type of memory doesn't have anything to do with it.

The problem with NVMe is it's a passthrough device, so the guest talks
directly to the NVMe controller and does DMA with it.  But the
hypervisor does swiotlb translation between the guest physical memory,
and the host physical memory, so that the NVMe device can correctly
DMA to the right memory in the host.

However, the hypervisor only has the guest's physical memory up to the
max e820 pfn mapped; it didn't expect the balloon driver to hotplug
any additional memory above the e820 max pfn, so when the NVMe driver
in the guest tries to tell the NVMe controller to DMA to that
balloon-hotplugged memory, the hypervisor fails the NVMe request,
because it can't do the guest-to-host phys mem mapping, since the
guest phys address is outside the expected max range.



>
> -boris
>
>
>
>> Should something else be
>> done?
>>
>> For context, Amazon Linux has simply disabled Xen ballooning
>> completely.  Likewise, we're planning to disable Xen ballooning in the
>> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
>> Ubuntu kernels).  However, if reverting this patch makes sense in a
>> bigger context (i.e. Xen users besides AWS), that would allow more
>> Ubuntu kernels to work correctly in AWS instances.
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-24 20:34   ` Dan Streetman
@ 2017-03-24 21:10     ` Konrad Rzeszutek Wilk
  2017-03-24 21:26       ` Dan Streetman
  0 siblings, 1 reply; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-03-24 21:10 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Boris Ostrovsky, David Vrabel, Juergen Gross, xen-devel,
	linux-kernel

On Fri, Mar 24, 2017 at 04:34:23PM -0400, Dan Streetman wrote:
> On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
> >
> >
> > On 03/22/2017 05:16 PM, Dan Streetman wrote:
> >>
> >> I have a question about a problem introduced by this commit:
> >> c275a57f5ec3056f732843b11659d892235faff7
> >> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
> >>
> >> It changed the xen balloon current_pages calculation to start with the
> >> number of physical pages in the system, instead of max_pfn.  Since
> >> get_num_physpages() does not include holes, it's always less than the
> >> e820 map's max_pfn.
> >>
> >> However, the problem that commit introduced is, if the hypervisor sets
> >> the balloon target to equal to the e820 map's max_pfn, then the
> >> balloon target will *always* be higher than the initial current pages.
> >> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
> >> the OS adds any holes, the balloon target will be higher than the
> >> current pages.  This is the situation, for example, for Amazon AWS
> >> instances.  The result is, the xen balloon will always immediately
> >> hotplug some memory at boot, but then make only (max_pfn -
> >> get_num_physpages()) available to the system.
> >>
> >> This balloon-hotplugged memory can cause problems, if the hypervisor
> >> wasn't expecting it; specifically, the system's physical page
> >> addresses now will exceed the e820 map's max_pfn, due to the
> >> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
> >> DMA to/from those physical pages above the e820 max_pfn, it causes
> >> problems.  For example:
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
> >>
> >> The additional small amount of balloon memory can cause other problems
> >> as well, for example:
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
> >>
> >> Anyway, I'd like to ask, was the original commit added because
> >> hypervisors are supposed to set their balloon target to the guest
> >> system's number of phys pages (max_pfn - holes)?  The mailing list
> >> discussion and commit description seem to indicate that.
> >
> >
> >
> > IIRC the problem that this was trying to fix was that since max_pfn includes
> > holes, upon booting we'd immediately balloon down by the (typically, MMIO)
> > hole size.
> >
> > If you boot a guest with ~4+GB memory you should see this.
> >
> >
> >> However I'm
> >> not sure how that is possible, because the kernel reserves its own
> >> holes, regardless of any predefined holes in the e820 map; for
> >> example, the kernel reserves 64k (by default) at phys addr 0 (the
> >> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
> >> the hypervisor really has no way to know what the "right" target to
> >> specify is; unless it knows the exact guest OS and kernel version, and
> >> kernel config values, it will never be able to correctly specify its
> >> target to be exactly (e820 max_pfn - all holes).
> >>
> >> Should this commit be reverted?  Should the xen balloon target be
> >> adjusted based on kernel-added e820 holes?
> >
> >
> > I think the second one but shouldn't current_pages be updated, and not the
> > target? The latter is set by Xen (toolstack, via xenstore usually).
> >
> > Also, the bugs above (at least one of them) talk about NVMe and I wonder
> > whether the memory that they add is of RAM type --- I believe it has its own
> > type and so perhaps that introduces additional inconsistencies. AWS may have
> > added their own support for that, which we don't have upstream yet.
> 
> The type of memory doesn't have anything to do with it.
> 
> The problem with NVMe is it's a passthrough device, so the guest talks
> directly to the NVMe controller and does DMA with it.  But the
> hypervisor does swiotlb translation between the guest physical memory,

Um, the hypervisor does not have SWIOTLB support, only IOMMU support.

> and the host physical memory, so that the NVMe device can correctly
> DMA to the right memory in the host.
> 
> However, the hypervisor only has the guest's physical memory up to the
> max e820 pfn mapped; it didn't expect the balloon driver to hotplug
> any additional memory above the e820 max pfn, so when the NVMe driver
> in the guest tries to tell the NVMe controller to DMA to that
> balloon-hotplugged memory, the hypervisor fails the NVMe request,

But when the memory hotplug happens the hypercalls are done to
raise the max pfn.

> because it can't do the guest-to-host phys mem mapping, since the
> guest phys address is outside the expected max range.
> 
> 
> 
> >
> > -boris
> >
> >
> >
> >> Should something else be
> >> done?
> >>
> >> For context, Amazon Linux has simply disabled Xen ballooning
> >> completely.  Likewise, we're planning to disable Xen ballooning in the
> >> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
> >> Ubuntu kernels).  However, if reverting this patch makes sense in a
> >> bigger context (i.e. Xen users besides AWS), that would allow more
> >> Ubuntu kernels to work correctly in AWS instances.
> >>
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-24 21:10     ` Konrad Rzeszutek Wilk
@ 2017-03-24 21:26       ` Dan Streetman
  2017-03-25  1:33         ` Boris Ostrovsky
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Streetman @ 2017-03-24 21:26 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Boris Ostrovsky, David Vrabel, Juergen Gross, xen-devel,
	linux-kernel

On Fri, Mar 24, 2017 at 5:10 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Fri, Mar 24, 2017 at 04:34:23PM -0400, Dan Streetman wrote:
>> On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>> >
>> >
>> > On 03/22/2017 05:16 PM, Dan Streetman wrote:
>> >>
>> >> I have a question about a problem introduced by this commit:
>> >> c275a57f5ec3056f732843b11659d892235faff7
>> >> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
>> >>
>> >> It changed the xen balloon current_pages calculation to start with the
>> >> number of physical pages in the system, instead of max_pfn.  Since
>> >> get_num_physpages() does not include holes, it's always less than the
>> >> e820 map's max_pfn.
>> >>
>> >> However, the problem that commit introduced is, if the hypervisor sets
>> >> the balloon target to equal to the e820 map's max_pfn, then the
>> >> balloon target will *always* be higher than the initial current pages.
>> >> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>> >> the OS adds any holes, the balloon target will be higher than the
>> >> current pages.  This is the situation, for example, for Amazon AWS
>> >> instances.  The result is, the xen balloon will always immediately
>> >> hotplug some memory at boot, but then make only (max_pfn -
>> >> get_num_physpages()) available to the system.
>> >>
>> >> This balloon-hotplugged memory can cause problems, if the hypervisor
>> >> wasn't expecting it; specifically, the system's physical page
>> >> addresses now will exceed the e820 map's max_pfn, due to the
>> >> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>> >> DMA to/from those physical pages above the e820 max_pfn, it causes
>> >> problems.  For example:
>> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>> >>
>> >> The additional small amount of balloon memory can cause other problems
>> >> as well, for example:
>> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>> >>
>> >> Anyway, I'd like to ask, was the original commit added because
>> >> hypervisors are supposed to set their balloon target to the guest
>> >> system's number of phys pages (max_pfn - holes)?  The mailing list
>> >> discussion and commit description seem to indicate that.
>> >
>> >
>> >
>> > IIRC the problem that this was trying to fix was that since max_pfn includes
>> > holes, upon booting we'd immediately balloon down by the (typically, MMIO)
>> > hole size.
>> >
>> > If you boot a guest with ~4+GB memory you should see this.
>> >
>> >
>> >> However I'm
>> >> not sure how that is possible, because the kernel reserves its own
>> >> holes, regardless of any predefined holes in the e820 map; for
>> >> example, the kernel reserves 64k (by default) at phys addr 0 (the
>> >> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
>> >> the hypervisor really has no way to know what the "right" target to
>> >> specify is; unless it knows the exact guest OS and kernel version, and
>> >> kernel config values, it will never be able to correctly specify its
>> >> target to be exactly (e820 max_pfn - all holes).
>> >>
>> >> Should this commit be reverted?  Should the xen balloon target be
>> >> adjusted based on kernel-added e820 holes?
>> >
>> >
>> > I think the second one but shouldn't current_pages be updated, and not the
>> > target? The latter is set by Xen (toolstack, via xenstore usually).
>> >
>> > Also, the bugs above (at least one of them) talk about NVMe and I wonder
>> > whether the memory that they add is of RAM type --- I believe it has its own
>> > type and so perhaps that introduces additional inconsistencies. AWS may have
>> > added their own support for that, which we don't have upstream yet.
>>
>> The type of memory doesn't have anything to do with it.
>>
>> The problem with NVMe is it's a passthrough device, so the guest talks
>> directly to the NVMe controller and does DMA with it.  But the
>> hypervisor does swiotlb translation between the guest physical memory,
>
> Um, the hypervisor does not have SWIOTLB support, only IOMMU support.

heh, well I have no special insight into Amazon's hypervisor, so I
have no idea what underlying memory remapping mechanism it uses :)

>
>> and the host physical memory, so that the NVMe device can correctly
>> DMA to the right memory in the host.
>>
>> However, the hypervisor only has the guest's physical memory up to the
>> max e820 pfn mapped; it didn't expect the balloon driver to hotplug
>> any additional memory above the e820 max pfn, so when the NVMe driver
>> in the guest tries to tell the NVMe controller to DMA to that
>> balloon-hotplugged memory, the hypervisor fails the NVMe request,
>
> But when the memory hotplug happens the hypercalls are done to
> raise the max pfn.

well...all I can say is it rejects DMA above the e820 range.  so this
very well may be a hypervisor bug, where it should add the balloon
memory region to whatever does the NVMe passthrough device iommu
mapping.

I think we can all agree that the *ideal* situation would be, for the
balloon driver to not immediately hotplug memory so it can add 11 more
pages, so maybe I just need to figure out why the balloon driver
thinks it needs 11 more pages, and fix that.

>
>> because it can't do the guest-to-host phys mem mapping, since the
>> guest phys address is outside the expected max range.
>>
>>
>>
>> >
>> > -boris
>> >
>> >
>> >
>> >> Should something else be
>> >> done?
>> >>
>> >> For context, Amazon Linux has simply disabled Xen ballooning
>> >> completely.  Likewise, we're planning to disable Xen ballooning in the
>> >> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
>> >> Ubuntu kernels).  However, if reverting this patch makes sense in a
>> >> bigger context (i.e. Xen users besides AWS), that would allow more
>> >> Ubuntu kernels to work correctly in AWS instances.
>> >>
>> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-24 21:26       ` Dan Streetman
@ 2017-03-25  1:33         ` Boris Ostrovsky
  2017-03-27 19:57           ` Dan Streetman
  0 siblings, 1 reply; 18+ messages in thread
From: Boris Ostrovsky @ 2017-03-25  1:33 UTC (permalink / raw)
  To: Dan Streetman, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, xen-devel, linux-kernel


>
> I think we can all agree that the *ideal* situation would be, for the
> balloon driver to not immediately hotplug memory so it can add 11 more
> pages, so maybe I just need to figure out why the balloon driver
> thinks it needs 11 more pages, and fix that.


How does the new memory appear in the guest? Via online_pages()?

Or is ballooning triggered from watch_target()?

-boris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-25  1:33         ` Boris Ostrovsky
@ 2017-03-27 19:57           ` Dan Streetman
  2017-03-28  1:57             ` Boris Ostrovsky
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Streetman @ 2017-03-27 19:57 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Konrad Rzeszutek Wilk, Juergen Gross, xen-devel, linux-kernel

On Fri, Mar 24, 2017 at 9:33 PM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
>
>>
>> I think we can all agree that the *ideal* situation would be, for the
>> balloon driver to not immediately hotplug memory so it can add 11 more
>> pages, so maybe I just need to figure out why the balloon driver
>> thinks it needs 11 more pages, and fix that.
>
>
>
> How does the new memory appear in the guest? Via online_pages()?
>
> Or is ballooning triggered from watch_target()?

yes, it's triggered from watch_target() which then calls
online_pages() with the new memory.  I added some debug (all numbers
are in hex):

[    0.500080] xen:balloon: Initialising balloon driver
[    0.503027] xen:balloon: balloon_init: current/target pages 1fff9d
[    0.504044] xen_balloon: Initialising balloon driver
[    0.508046] xen_balloon: watch_target: new target 800000 kb
[    0.508046] xen:balloon: balloon_set_new_target: target 200000
[    0.524024] xen:balloon: current_credit: target pages 200000
current pages 1fff9d credit 63
[    0.567055] xen:balloon: balloon_process: current_credit 63
[    0.568005] xen:balloon: reserve_additional_memory: adding memory
resource for 8000 pages
[    3.694443] online_pages: pfn 210000 nr_pages 8000 type 0
[    3.701072] xen:balloon: current_credit: target pages 200000
current pages 1fff9d credit 63
[    3.701074] xen:balloon: balloon_process: current_credit 63
[    3.701075] xen:balloon: increase_reservation: nr_pages 63
[    3.701170] xen:balloon: increase_reservation: done, current_pages 1fffa8
[    3.701172] xen:balloon: current_credit: target pages 200000
current pages 1fffa8 credit 58
[    3.701173] xen:balloon: balloon_process: current_credit 58
[    3.701173] xen:balloon: increase_reservation: nr_pages 58
[    3.701180] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
[    5.708085] xen:balloon: current_credit: target pages 200000
current pages 1fffa8 credit 58
[    5.708088] xen:balloon: balloon_process: current_credit 58
[    5.708089] xen:balloon: increase_reservation: nr_pages 58
[    5.708106] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
[    9.716065] xen:balloon: current_credit: target pages 200000
current pages 1fffa8 credit 58
[    9.716068] xen:balloon: balloon_process: current_credit 58
[    9.716069] xen:balloon: increase_reservation: nr_pages 58
[    9.716087] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0


and that continues forever at the max interval (32), since
max_retry_count is unlimited.  So I think I understand things now;
first, the current_pages is set properly based on the e820 map:

$ dmesg|grep -i e820
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] e820: last_pfn = 0x210000 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0xf0000 max_arch_pfn = 0x400000000
[    0.000000] e820: [mem 0xf0000000-0xfbffffff] available for PCI devices
[    0.528007] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
ubuntu@ip-172-31-60-112:~$ printf "%x\n" $[ 0x210000 - 0x100000 +
0xf0000 - 0x100 + 0x9e - 1 ]
1fff9d


then, the xen balloon notices its target has been set to 200000 by the
hypervisor.  That target does account for the hole at 0xf0000 to
0x100000, but it doesn't account for the hole at 0xe0 to 0x100 ( 0x20
pages), nor the hole at 0x9e to 0xa0 ( 2 pages ), nor the unlisted
hole (that the kernel removes) at 0xa0 to 0xe0 ( 0x40 pages).  That's
0x62 pages, plus the 1-page hole at addr 0 that the kernel always
reserves, is 0x63 pages of holes, which aren't accounted for in the
hypervisor's target.

so the balloon driver hotplugs the memory, and tries to increase its
reservation to provide the needed pages to get the current_pages up to
the target.  However, when it calls the hypervisor to populate the
physmap, the hypervisor only allows 11 (0xb) pages to be populated;
all calls after that get back 0 from the hypervisor.

Do you think the hypervisor's balloon target should account for the
e820 holes (and for the kernel's added hole at addr 0)?
Alternately/additionally, if the hypervisor doesn't want to support
ballooning, should it just return error from the call to populate the
physmap, and not allow those 11 pages?

At this point, it doesn't seem to me like the kernel is doing anything
wrong, correct?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-27 19:57           ` Dan Streetman
@ 2017-03-28  1:57             ` Boris Ostrovsky
  2017-03-28  8:08               ` [Xen-devel] " Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Boris Ostrovsky @ 2017-03-28  1:57 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Konrad Rzeszutek Wilk, Juergen Gross, xen-devel, linux-kernel



On 03/27/2017 03:57 PM, Dan Streetman wrote:
> On Fri, Mar 24, 2017 at 9:33 PM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>>
>>>
>>> I think we can all agree that the *ideal* situation would be, for the
>>> balloon driver to not immediately hotplug memory so it can add 11 more
>>> pages, so maybe I just need to figure out why the balloon driver
>>> thinks it needs 11 more pages, and fix that.
>>
>>
>>
>> How does the new memory appear in the guest? Via online_pages()?
>>
>> Or is ballooning triggered from watch_target()?
>
> yes, it's triggered from watch_target() which then calls
> online_pages() with the new memory.  I added some debug (all numbers
> are in hex):
>
> [    0.500080] xen:balloon: Initialising balloon driver
> [    0.503027] xen:balloon: balloon_init: current/target pages 1fff9d
> [    0.504044] xen_balloon: Initialising balloon driver
> [    0.508046] xen_balloon: watch_target: new target 800000 kb
> [    0.508046] xen:balloon: balloon_set_new_target: target 200000
> [    0.524024] xen:balloon: current_credit: target pages 200000
> current pages 1fff9d credit 63
> [    0.567055] xen:balloon: balloon_process: current_credit 63
> [    0.568005] xen:balloon: reserve_additional_memory: adding memory
> resource for 8000 pages
> [    3.694443] online_pages: pfn 210000 nr_pages 8000 type 0
> [    3.701072] xen:balloon: current_credit: target pages 200000
> current pages 1fff9d credit 63
> [    3.701074] xen:balloon: balloon_process: current_credit 63
> [    3.701075] xen:balloon: increase_reservation: nr_pages 63
> [    3.701170] xen:balloon: increase_reservation: done, current_pages 1fffa8
> [    3.701172] xen:balloon: current_credit: target pages 200000
> current pages 1fffa8 credit 58
> [    3.701173] xen:balloon: balloon_process: current_credit 58
> [    3.701173] xen:balloon: increase_reservation: nr_pages 58
> [    3.701180] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
> [    5.708085] xen:balloon: current_credit: target pages 200000
> current pages 1fffa8 credit 58
> [    5.708088] xen:balloon: balloon_process: current_credit 58
> [    5.708089] xen:balloon: increase_reservation: nr_pages 58
> [    5.708106] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
> [    9.716065] xen:balloon: current_credit: target pages 200000
> current pages 1fffa8 credit 58
> [    9.716068] xen:balloon: balloon_process: current_credit 58
> [    9.716069] xen:balloon: increase_reservation: nr_pages 58
> [    9.716087] xen:balloon: increase_reservation: XENMEM_populate_physmap err 0
>
>
> and that continues forever at the max interval (32), since
> max_retry_count is unlimited.  So I think I understand things now;
> first, the current_pages is set properly based on the e820 map:
>
> $ dmesg|grep -i e820
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
> [    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
> [    0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000020fffffff] usable
> [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [    0.000000] e820: last_pfn = 0x210000 max_arch_pfn = 0x400000000
> [    0.000000] e820: last_pfn = 0xf0000 max_arch_pfn = 0x400000000
> [    0.000000] e820: [mem 0xf0000000-0xfbffffff] available for PCI devices
> [    0.528007] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
> ubuntu@ip-172-31-60-112:~$ printf "%x\n" $[ 0x210000 - 0x100000 +
> 0xf0000 - 0x100 + 0x9e - 1 ]
> 1fff9d
>
>
> then, the xen balloon notices its target has been set to 200000 by the
> hypervisor.  That target does account for the hole at 0xf0000 to
> 0x100000, but it doesn't account for the hole at 0xe0 to 0x100 ( 0x20
> pages), nor the hole at 0x9e to 0xa0 ( 2 pages ), nor the unlisted
> hole (that the kernel removes) at 0xa0 to 0xe0 ( 0x40 pages).  That's
> 0x62 pages, plus the 1-page hole at addr 0 that the kernel always
> reserves, is 0x63 pages of holes, which aren't accounted for in the
> hypervisor's target.
>
> so the balloon driver hotplugs the memory, and tries to increase its
> reservation to provide the needed pages to get the current_pages up to
> the target.  However, when it calls the hypervisor to populate the
> physmap, the hypervisor only allows 11 (0xb) pages to be populated;
> all calls after that get back 0 from the hypervisor.
>
> Do you think the hypervisor's balloon target should account for the
> e820 holes (and for the kernel's added hole at addr 0)?
> Alternately/additionally, if the hypervisor doesn't want to support
> ballooning, should it just return error from the call to populate the
> physmap, and not allow those 11 pages?
>
> At this point, it doesn't seem to me like the kernel is doing anything
> wrong, correct?
>


I think there is indeed a disconnect between target memory (provided by 
the toolstack) and current memory (i.e actual pages available to the guest).

For example

[    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] 
reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] 
reserved

are missed in target calculation. The hvmloader marks them as RESERVED 
(in build_e820_table()) but target value is not aware of this action.

And then the same problem repeats when kernel removes 
0x000a0000-0x000fffff chunk.

(BTW, this is all happening before the new 0x8000 pages are onlined, 
which takes places much later and is a separate and what looks to me an 
unrelated event).

-boris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-28  1:57             ` Boris Ostrovsky
@ 2017-03-28  8:08               ` Jan Beulich
  2017-03-28 14:27                 ` Boris Ostrovsky
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2017-03-28  8:08 UTC (permalink / raw)
  To: Dan Streetman, Boris Ostrovsky; +Cc: xen-devel, Juergen Gross, linux-kernel

>>> On 28.03.17 at 03:57, <boris.ostrovsky@oracle.com> wrote:
> I think there is indeed a disconnect between target memory (provided by 
> the toolstack) and current memory (i.e actual pages available to the guest).
> 
> For example
> 
> [    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] 
> reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] 
> reserved
> 
> are missed in target calculation. The hvmloader marks them as RESERVED 
> (in build_e820_table()) but target value is not aware of this action.
> 
> And then the same problem repeats when kernel removes 
> 0x000a0000-0x000fffff chunk.

But this is all in-guest behavior, i.e. nothing an entity outside the
guest (tool stack or hypervisor) should need to be aware of. That
said, there is still room for improvement in the tools I think:
Regions which architecturally aren't RAM (namely the
0xa0000-0xfffff range) would probably better not be accounted
for as RAM as far as ballooning is concerned. In the hypervisor,
otoh, all memory assigned to the guest (i.e. including such backing
ROMs) needs to be accounted.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-28  8:08               ` [Xen-devel] " Jan Beulich
@ 2017-03-28 14:27                 ` Boris Ostrovsky
  2017-03-28 15:04                   ` Jan Beulich
  2017-03-28 15:30                   ` Juergen Gross
  0 siblings, 2 replies; 18+ messages in thread
From: Boris Ostrovsky @ 2017-03-28 14:27 UTC (permalink / raw)
  To: Jan Beulich, Dan Streetman; +Cc: xen-devel, Juergen Gross, linux-kernel

On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>> On 28.03.17 at 03:57, <boris.ostrovsky@oracle.com> wrote:
>> I think there is indeed a disconnect between target memory (provided by 
>> the toolstack) and current memory (i.e actual pages available to the guest).
>>
>> For example
>>
>> [    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] 
>> reserved
>> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] 
>> reserved
>>
>> are missed in target calculation. The hvmloader marks them as RESERVED 
>> (in build_e820_table()) but target value is not aware of this action.
>>
>> And then the same problem repeats when kernel removes 
>> 0x000a0000-0x000fffff chunk.
> But this is all in-guest behavior, i.e. nothing an entity outside the
> guest (tool stack or hypervisor) should need to be aware of. That
> said, there is still room for improvement in the tools I think:
> Regions which architecturally aren't RAM (namely the
> 0xa0000-0xfffff range) would probably better not be accounted
> for as RAM as far as ballooning is concerned. In the hypervisor,
> otoh, all memory assigned to the guest (i.e. including such backing
> ROMs) needs to be accounted.

On the Linux side we should not include in balloon calculations pages
reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.

Which leaves hvmloader's special pages (and possibly memory under
0xA0000 which may get reserved). Can we pass this info to guests via
xenstore?

-boris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-28 14:27                 ` Boris Ostrovsky
@ 2017-03-28 15:04                   ` Jan Beulich
  2017-03-28 15:30                   ` Juergen Gross
  1 sibling, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2017-03-28 15:04 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Dan Streetman, xen-devel, Juergen Gross, linux-kernel

>>> On 28.03.17 at 16:27, <boris.ostrovsky@oracle.com> wrote:
> Which leaves hvmloader's special pages (and possibly memory under
> 0xA0000 which may get reserved). Can we pass this info to guests via
> xenstore?

I'm perhaps the wrong one to ask regarding xenstore, but for
in-guest communication this seems an at least strange approach
to me.

Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-28 14:27                 ` Boris Ostrovsky
  2017-03-28 15:04                   ` Jan Beulich
@ 2017-03-28 15:30                   ` Juergen Gross
  2017-03-28 16:32                     ` Boris Ostrovsky
  2017-07-08  0:59                     ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 18+ messages in thread
From: Juergen Gross @ 2017-03-28 15:30 UTC (permalink / raw)
  To: Boris Ostrovsky, Jan Beulich, Dan Streetman; +Cc: xen-devel, linux-kernel

On 28/03/17 16:27, Boris Ostrovsky wrote:
> On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>>> On 28.03.17 at 03:57, <boris.ostrovsky@oracle.com> wrote:
>>> I think there is indeed a disconnect between target memory (provided by 
>>> the toolstack) and current memory (i.e actual pages available to the guest).
>>>
>>> For example
>>>
>>> [    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] 
>>> reserved
>>> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] 
>>> reserved
>>>
>>> are missed in target calculation. The hvmloader marks them as RESERVED 
>>> (in build_e820_table()) but target value is not aware of this action.
>>>
>>> And then the same problem repeats when kernel removes 
>>> 0x000a0000-0x000fffff chunk.
>> But this is all in-guest behavior, i.e. nothing an entity outside the
>> guest (tool stack or hypervisor) should need to be aware of. That
>> said, there is still room for improvement in the tools I think:
>> Regions which architecturally aren't RAM (namely the
>> 0xa0000-0xfffff range) would probably better not be accounted
>> for as RAM as far as ballooning is concerned. In the hypervisor,
>> otoh, all memory assigned to the guest (i.e. including such backing
>> ROMs) needs to be accounted.
> 
> On the Linux side we should not include in balloon calculations pages
> reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
> 
> Which leaves hvmloader's special pages (and possibly memory under
> 0xA0000 which may get reserved). Can we pass this info to guests via
> xenstore?

I'd rather keep an internal difference between online pages and E820-map
count value in the balloon driver. This should work always.


Juergen

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-28 15:30                   ` Juergen Gross
@ 2017-03-28 16:32                     ` Boris Ostrovsky
  2017-03-29  4:36                       ` Juergen Gross
  2017-07-08  0:59                     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 18+ messages in thread
From: Boris Ostrovsky @ 2017-03-28 16:32 UTC (permalink / raw)
  To: Juergen Gross, Jan Beulich, Dan Streetman; +Cc: xen-devel, linux-kernel

On 03/28/2017 11:30 AM, Juergen Gross wrote:
> On 28/03/17 16:27, Boris Ostrovsky wrote:
>> On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>>>> On 28.03.17 at 03:57, <boris.ostrovsky@oracle.com> wrote:
>>>> I think there is indeed a disconnect between target memory (provided by 
>>>> the toolstack) and current memory (i.e actual pages available to the guest).
>>>>
>>>> For example
>>>>
>>>> [    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] 
>>>> reserved
>>>> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] 
>>>> reserved
>>>>
>>>> are missed in target calculation. The hvmloader marks them as RESERVED 
>>>> (in build_e820_table()) but target value is not aware of this action.
>>>>
>>>> And then the same problem repeats when kernel removes 
>>>> 0x000a0000-0x000fffff chunk.
>>> But this is all in-guest behavior, i.e. nothing an entity outside the
>>> guest (tool stack or hypervisor) should need to be aware of. That
>>> said, there is still room for improvement in the tools I think:
>>> Regions which architecturally aren't RAM (namely the
>>> 0xa0000-0xfffff range) would probably better not be accounted
>>> for as RAM as far as ballooning is concerned. In the hypervisor,
>>> otoh, all memory assigned to the guest (i.e. including such backing
>>> ROMs) needs to be accounted.
>> On the Linux side we should not include in balloon calculations pages
>> reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
>>
>> Which leaves hvmloader's special pages (and possibly memory under
>> 0xA0000 which may get reserved). Can we pass this info to guests via
>> xenstore?
> I'd rather keep an internal difference between online pages and E820-map
> count value in the balloon driver. This should work always.

We could indeed base calculation on initial state of e820 and not count
the holes toward ballooning needs. I am not sure this will work for
memory unplug though, where a hole can be created in the map and we will
be supposed to handle disappearing memory via ballooning.

Or am I creating a problem where none exists?

-boris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-28 16:32                     ` Boris Ostrovsky
@ 2017-03-29  4:36                       ` Juergen Gross
  0 siblings, 0 replies; 18+ messages in thread
From: Juergen Gross @ 2017-03-29  4:36 UTC (permalink / raw)
  To: Boris Ostrovsky, Jan Beulich, Dan Streetman; +Cc: xen-devel, linux-kernel

On 28/03/17 18:32, Boris Ostrovsky wrote:
> On 03/28/2017 11:30 AM, Juergen Gross wrote:
>> On 28/03/17 16:27, Boris Ostrovsky wrote:
>>> On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>>>>> On 28.03.17 at 03:57, <boris.ostrovsky@oracle.com> wrote:
>>>>> I think there is indeed a disconnect between target memory (provided by 
>>>>> the toolstack) and current memory (i.e actual pages available to the guest).
>>>>>
>>>>> For example
>>>>>
>>>>> [    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] 
>>>>> reserved
>>>>> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] 
>>>>> reserved
>>>>>
>>>>> are missed in target calculation. The hvmloader marks them as RESERVED 
>>>>> (in build_e820_table()) but target value is not aware of this action.
>>>>>
>>>>> And then the same problem repeats when kernel removes 
>>>>> 0x000a0000-0x000fffff chunk.
>>>> But this is all in-guest behavior, i.e. nothing an entity outside the
>>>> guest (tool stack or hypervisor) should need to be aware of. That
>>>> said, there is still room for improvement in the tools I think:
>>>> Regions which architecturally aren't RAM (namely the
>>>> 0xa0000-0xfffff range) would probably better not be accounted
>>>> for as RAM as far as ballooning is concerned. In the hypervisor,
>>>> otoh, all memory assigned to the guest (i.e. including such backing
>>>> ROMs) needs to be accounted.
>>> On the Linux side we should not include in balloon calculations pages
>>> reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
>>>
>>> Which leaves hvmloader's special pages (and possibly memory under
>>> 0xA0000 which may get reserved). Can we pass this info to guests via
>>> xenstore?
>> I'd rather keep an internal difference between online pages and E820-map
>> count value in the balloon driver. This should work always.
> 
> We could indeed base calculation on initial state of e820 and not count
> the holes toward ballooning needs. I am not sure this will work for
> memory unplug though, where a hole can be created in the map and we will
> be supposed to handle disappearing memory via ballooning.
> 
> Or am I creating a problem where none exists?

I'm rather sure memory has to be offlined before being deleted from
the E820 map.


Juergen

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-03-28 15:30                   ` Juergen Gross
  2017-03-28 16:32                     ` Boris Ostrovsky
@ 2017-07-08  0:59                     ` Konrad Rzeszutek Wilk
  2017-07-09 13:16                       ` Juergen Gross
  1 sibling, 1 reply; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-07-08  0:59 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Boris Ostrovsky, Jan Beulich, Dan Streetman, xen-devel,
	linux-kernel

On Tue, Mar 28, 2017 at 05:30:24PM +0200, Juergen Gross wrote:
> On 28/03/17 16:27, Boris Ostrovsky wrote:
> > On 03/28/2017 04:08 AM, Jan Beulich wrote:
> >>>>> On 28.03.17 at 03:57, <boris.ostrovsky@oracle.com> wrote:
> >>> I think there is indeed a disconnect between target memory (provided by 
> >>> the toolstack) and current memory (i.e actual pages available to the guest).
> >>>
> >>> For example
> >>>
> >>> [    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] 
> >>> reserved
> >>> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] 
> >>> reserved
> >>>
> >>> are missed in target calculation. The hvmloader marks them as RESERVED 
> >>> (in build_e820_table()) but target value is not aware of this action.
> >>>
> >>> And then the same problem repeats when kernel removes 
> >>> 0x000a0000-0x000fffff chunk.
> >> But this is all in-guest behavior, i.e. nothing an entity outside the
> >> guest (tool stack or hypervisor) should need to be aware of. That
> >> said, there is still room for improvement in the tools I think:
> >> Regions which architecturally aren't RAM (namely the
> >> 0xa0000-0xfffff range) would probably better not be accounted
> >> for as RAM as far as ballooning is concerned. In the hypervisor,
> >> otoh, all memory assigned to the guest (i.e. including such backing
> >> ROMs) needs to be accounted.
> > 
> > On the Linux side we should not include in balloon calculations pages
> > reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
> > 
> > Which leaves hvmloader's special pages (and possibly memory under
> > 0xA0000 which may get reserved). Can we pass this info to guests via
> > xenstore?
> 
> I'd rather keep an internal difference between online pages and E820-map
> count value in the balloon driver. This should work always.

Did we ever come with a patch for this?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xen-devel] maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
  2017-07-08  0:59                     ` Konrad Rzeszutek Wilk
@ 2017-07-09 13:16                       ` Juergen Gross
  0 siblings, 0 replies; 18+ messages in thread
From: Juergen Gross @ 2017-07-09 13:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Boris Ostrovsky, Jan Beulich, Dan Streetman, xen-devel,
	linux-kernel

On 08/07/17 02:59, Konrad Rzeszutek Wilk wrote:
> On Tue, Mar 28, 2017 at 05:30:24PM +0200, Juergen Gross wrote:
>> On 28/03/17 16:27, Boris Ostrovsky wrote:
>>> On 03/28/2017 04:08 AM, Jan Beulich wrote:
>>>>>>> On 28.03.17 at 03:57, <boris.ostrovsky@oracle.com> wrote:
>>>>> I think there is indeed a disconnect between target memory (provided by 
>>>>> the toolstack) and current memory (i.e actual pages available to the guest).
>>>>>
>>>>> For example
>>>>>
>>>>> [    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] 
>>>>> reserved
>>>>> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] 
>>>>> reserved
>>>>>
>>>>> are missed in target calculation. The hvmloader marks them as RESERVED 
>>>>> (in build_e820_table()) but target value is not aware of this action.
>>>>>
>>>>> And then the same problem repeats when kernel removes 
>>>>> 0x000a0000-0x000fffff chunk.
>>>> But this is all in-guest behavior, i.e. nothing an entity outside the
>>>> guest (tool stack or hypervisor) should need to be aware of. That
>>>> said, there is still room for improvement in the tools I think:
>>>> Regions which architecturally aren't RAM (namely the
>>>> 0xa0000-0xfffff range) would probably better not be accounted
>>>> for as RAM as far as ballooning is concerned. In the hypervisor,
>>>> otoh, all memory assigned to the guest (i.e. including such backing
>>>> ROMs) needs to be accounted.
>>>
>>> On the Linux side we should not include in balloon calculations pages
>>> reserved by trim_bios_range(), i.e. (BIOS_END-BIOS_BEGIN) + 1.
>>>
>>> Which leaves hvmloader's special pages (and possibly memory under
>>> 0xA0000 which may get reserved). Can we pass this info to guests via
>>> xenstore?
>>
>> I'd rather keep an internal difference between online pages and E820-map
>> count value in the balloon driver. This should work always.
> 
> Did we ever come with a patch for this?

Yes, I've sent V2 recently:

https://lists.xen.org/archives/html/xen-devel/2017-07/msg00530.html


Juergen

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-07-09 13:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-22 21:16 maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages" Dan Streetman
2017-03-23  2:13 ` Boris Ostrovsky
2017-03-23  7:56   ` Juergen Gross
2017-03-24 20:30     ` Dan Streetman
2017-03-24 20:34   ` Dan Streetman
2017-03-24 21:10     ` Konrad Rzeszutek Wilk
2017-03-24 21:26       ` Dan Streetman
2017-03-25  1:33         ` Boris Ostrovsky
2017-03-27 19:57           ` Dan Streetman
2017-03-28  1:57             ` Boris Ostrovsky
2017-03-28  8:08               ` [Xen-devel] " Jan Beulich
2017-03-28 14:27                 ` Boris Ostrovsky
2017-03-28 15:04                   ` Jan Beulich
2017-03-28 15:30                   ` Juergen Gross
2017-03-28 16:32                     ` Boris Ostrovsky
2017-03-29  4:36                       ` Juergen Gross
2017-07-08  0:59                     ` Konrad Rzeszutek Wilk
2017-07-09 13:16                       ` Juergen Gross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).