[PATCH 0/3] x86/pvh: Support relocating dom0 kernel

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] x86/pvh: Support relocating dom0 kernel
@ 2024-03-06 18:50 Jason Andryuk
  2024-03-06 18:50 ` [PATCH 1/3] features.h: Replace hard tabs Jason Andryuk
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Jason Andryuk @ 2024-03-06 18:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné

Xen tries to load a PVH dom0 kernel at the fixed guest physical address
from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
it can be configured.

Unfortunately there exist firmwares that have reserved regions at this
address, so Xen fails to load the dom0 kernel since it's not RAM.

The other issue is that the Linux PVH entry point is not
position-independent.  It expects to run at the compiled
CONFIG_PHYSICAL_ADDRESS.

This patch set expands the PVH dom0 builder to try to relocate the
kernel if needed and possible.  XENFEAT_pvh_relocatable is added for
kernels to indicate they are relocatable.  However, we may want to
switch to an additional ELF note with the kernel alignment.  Linux
specifies a kernel alignment in the bzImage boot_params.setup_header,
but that is not present the ELF vmlinux file.

The first patch fixes some whitespace in features.h

The second patch enhances bzimage_parse to pull the kernel_alignment into
an optional align pointer.

The third patch expands the pvh dom0 kernel placement code.

I'll post an additional patch showing the Linux changes to make PVH
relocatable.  

Jason Andryuk (3):
  features.h: Replace hard tabs
  xen/x86: bzImage parse kernel_alignment
  x86/PVH: Support relocatable dom0 kernels

 xen/arch/x86/bzimage.c             |   4 +-
 xen/arch/x86/hvm/dom0_build.c      | 113 ++++++++++++++++++++++++++++-
 xen/arch/x86/include/asm/bzimage.h |   3 +-
 xen/arch/x86/pv/dom0_build.c       |   2 +-
 xen/include/public/features.h      |   9 ++-
 5 files changed, 124 insertions(+), 7 deletions(-)

-- 
2.44.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] features.h: Replace hard tabs
  2024-03-06 18:50 [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Jason Andryuk
@ 2024-03-06 18:50 ` Jason Andryuk
  2024-03-06 20:41   ` Stefano Stabellini
  2024-03-06 18:50 ` [PATCH 2/3] xen/x86: bzImage parse kernel_alignment Jason Andryuk
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Jason Andryuk @ 2024-03-06 18:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu

Replace hard tabs to match the rest of the file.

Fixes: 48a3fd1432 ("domain: expose newly introduced hypercalls as XENFEAT")
Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
---
 xen/include/public/features.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 36936f6a4e..4437f25d25 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -117,8 +117,8 @@
  * VCPUOP_register_runstate_phys_area
  * VCPUOP_register_vcpu_time_phys_area
  */
-#define XENFEAT_runstate_phys_area	  18
-#define XENFEAT_vcpu_time_phys_area	  19
+#define XENFEAT_runstate_phys_area        18
+#define XENFEAT_vcpu_time_phys_area       19
 
 #define XENFEAT_NR_SUBMAPS 1
 
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] features.h: Replace hard tabs
  2024-03-06 18:50 ` [PATCH 1/3] features.h: Replace hard tabs Jason Andryuk
@ 2024-03-06 20:41   ` Stefano Stabellini
  0 siblings, 0 replies; 21+ messages in thread
From: Stefano Stabellini @ 2024-03-06 20:41 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: xen-devel, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu

On Wed, 6 Mar 2024, Jason Andryuk wrote:
> Replace hard tabs to match the rest of the file.
> 
> Fixes: 48a3fd1432 ("domain: expose newly introduced hypercalls as XENFEAT")
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>

Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/3] xen/x86: bzImage parse kernel_alignment
  2024-03-06 18:50 [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Jason Andryuk
  2024-03-06 18:50 ` [PATCH 1/3] features.h: Replace hard tabs Jason Andryuk
@ 2024-03-06 18:50 ` Jason Andryuk
  2024-03-07  2:09   ` Stefano Stabellini
  2024-03-06 18:50 ` [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels Jason Andryuk
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Jason Andryuk @ 2024-03-06 18:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

Expand bzimage_parse() to return kernel_alignment from the setup_header.
This will be needed if loading a PVH kernel at a physical offset to
compensate for a reserved E820 region.

Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
---
 xen/arch/x86/bzimage.c             | 4 +++-
 xen/arch/x86/hvm/dom0_build.c      | 4 +++-
 xen/arch/x86/include/asm/bzimage.h | 3 +--
 xen/arch/x86/pv/dom0_build.c       | 2 +-
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/bzimage.c b/xen/arch/x86/bzimage.c
index ac4fd428be..0f4cfc49f7 100644
--- a/xen/arch/x86/bzimage.c
+++ b/xen/arch/x86/bzimage.c
@@ -105,7 +105,7 @@ unsigned long __init bzimage_headroom(void *image_start,
 }
 
 int __init bzimage_parse(void *image_base, void **image_start,
-                         unsigned long *image_len)
+                         unsigned long *image_len, unsigned int *align)
 {
     struct setup_header *hdr = (struct setup_header *)(*image_start);
     int err = bzimage_check(hdr, *image_len);
@@ -118,6 +118,8 @@ int __init bzimage_parse(void *image_base, void **image_start,
     {
         *image_start += (hdr->setup_sects + 1) * 512 + hdr->payload_offset;
         *image_len = hdr->payload_length;
+        if ( align )
+            *align = hdr->kernel_alignment;
     }
 
     if ( elf_is_elfbinary(*image_start, *image_len) )
diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 0ceda4140b..bbae8a5645 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -548,12 +548,14 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
     struct elf_binary elf;
     struct elf_dom_parms parms;
     paddr_t last_addr;
+    unsigned int align = 0;
     struct hvm_start_info start_info = { 0 };
     struct hvm_modlist_entry mod = { 0 };
     struct vcpu *v = d->vcpu[0];
     int rc;
 
-    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
+    rc = bzimage_parse(image_base, &image_start, &image_len, &align);
+    if ( rc != 0 )
     {
         printk("Error trying to detect bz compressed kernel\n");
         return rc;
diff --git a/xen/arch/x86/include/asm/bzimage.h b/xen/arch/x86/include/asm/bzimage.h
index 7ed69d3910..de4e9a446f 100644
--- a/xen/arch/x86/include/asm/bzimage.h
+++ b/xen/arch/x86/include/asm/bzimage.h
@@ -4,8 +4,7 @@
 #include <xen/init.h>
 
 unsigned long bzimage_headroom(void *image_start, unsigned long image_length);
-
 int bzimage_parse(void *image_base, void **image_start,
-                  unsigned long *image_len);
+                  unsigned long *image_len, unsigned int *align);
 
 #endif /* __X86_BZIMAGE_H__ */
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index d8043fa58a..e9fa8a9a82 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -416,7 +416,7 @@ int __init dom0_construct_pv(struct domain *d,
 
     d->max_pages = ~0U;
 
-    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
+    if ( (rc = bzimage_parse(image_base, &image_start, &image_len, NULL)) != 0 )
         return rc;
 
     if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] xen/x86: bzImage parse kernel_alignment
  2024-03-06 18:50 ` [PATCH 2/3] xen/x86: bzImage parse kernel_alignment Jason Andryuk
@ 2024-03-07  2:09   ` Stefano Stabellini
  2024-03-07  8:26     ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Stefano Stabellini @ 2024-03-07  2:09 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

On Wed, 6 Mar 2024, Jason Andryuk wrote:
> Expand bzimage_parse() to return kernel_alignment from the setup_header.
> This will be needed if loading a PVH kernel at a physical offset to
> compensate for a reserved E820 region.
> 
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>

Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>


> ---
>  xen/arch/x86/bzimage.c             | 4 +++-
>  xen/arch/x86/hvm/dom0_build.c      | 4 +++-
>  xen/arch/x86/include/asm/bzimage.h | 3 +--
>  xen/arch/x86/pv/dom0_build.c       | 2 +-
>  4 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/arch/x86/bzimage.c b/xen/arch/x86/bzimage.c
> index ac4fd428be..0f4cfc49f7 100644
> --- a/xen/arch/x86/bzimage.c
> +++ b/xen/arch/x86/bzimage.c
> @@ -105,7 +105,7 @@ unsigned long __init bzimage_headroom(void *image_start,
>  }
>  
>  int __init bzimage_parse(void *image_base, void **image_start,
> -                         unsigned long *image_len)
> +                         unsigned long *image_len, unsigned int *align)
>  {
>      struct setup_header *hdr = (struct setup_header *)(*image_start);
>      int err = bzimage_check(hdr, *image_len);
> @@ -118,6 +118,8 @@ int __init bzimage_parse(void *image_base, void **image_start,
>      {
>          *image_start += (hdr->setup_sects + 1) * 512 + hdr->payload_offset;
>          *image_len = hdr->payload_length;
> +        if ( align )
> +            *align = hdr->kernel_alignment;
>      }
>  
>      if ( elf_is_elfbinary(*image_start, *image_len) )
> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> index 0ceda4140b..bbae8a5645 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -548,12 +548,14 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>      struct elf_binary elf;
>      struct elf_dom_parms parms;
>      paddr_t last_addr;
> +    unsigned int align = 0;
>      struct hvm_start_info start_info = { 0 };
>      struct hvm_modlist_entry mod = { 0 };
>      struct vcpu *v = d->vcpu[0];
>      int rc;
>  
> -    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
> +    rc = bzimage_parse(image_base, &image_start, &image_len, &align);
> +    if ( rc != 0 )
>      {
>          printk("Error trying to detect bz compressed kernel\n");
>          return rc;
> diff --git a/xen/arch/x86/include/asm/bzimage.h b/xen/arch/x86/include/asm/bzimage.h
> index 7ed69d3910..de4e9a446f 100644
> --- a/xen/arch/x86/include/asm/bzimage.h
> +++ b/xen/arch/x86/include/asm/bzimage.h
> @@ -4,8 +4,7 @@
>  #include <xen/init.h>
>  
>  unsigned long bzimage_headroom(void *image_start, unsigned long image_length);
> -
>  int bzimage_parse(void *image_base, void **image_start,
> -                  unsigned long *image_len);
> +                  unsigned long *image_len, unsigned int *align);
>  
>  #endif /* __X86_BZIMAGE_H__ */
> diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
> index d8043fa58a..e9fa8a9a82 100644
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -416,7 +416,7 @@ int __init dom0_construct_pv(struct domain *d,
>  
>      d->max_pages = ~0U;
>  
> -    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
> +    if ( (rc = bzimage_parse(image_base, &image_start, &image_len, NULL)) != 0 )
>          return rc;
>  
>      if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
> -- 
> 2.44.0
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] xen/x86: bzImage parse kernel_alignment
  2024-03-07  2:09   ` Stefano Stabellini
@ 2024-03-07  8:26     ` Jan Beulich
  2024-03-07 15:06       ` Jason Andryuk
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2024-03-07  8:26 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Andrew Cooper, Roger Pau Monné, Wei Liu,
	Jason Andryuk

On 07.03.2024 03:09, Stefano Stabellini wrote:
> On Wed, 6 Mar 2024, Jason Andryuk wrote:
>> Expand bzimage_parse() to return kernel_alignment from the setup_header.
>> This will be needed if loading a PVH kernel at a physical offset to
>> compensate for a reserved E820 region.
>>
>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> 
> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

Acked-by: Jan Beulich <jbeulich@suse.com>
with two remarks:

>> --- a/xen/arch/x86/hvm/dom0_build.c
>> +++ b/xen/arch/x86/hvm/dom0_build.c
>> @@ -548,12 +548,14 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>>      struct elf_binary elf;
>>      struct elf_dom_parms parms;
>>      paddr_t last_addr;
>> +    unsigned int align = 0;

Strictly speaking this isn't needed here, yet, and would suffice when added
in the next patch. But I'm okay with keeping it.

>> --- a/xen/arch/x86/include/asm/bzimage.h
>> +++ b/xen/arch/x86/include/asm/bzimage.h
>> @@ -4,8 +4,7 @@
>>  #include <xen/init.h>
>>  
>>  unsigned long bzimage_headroom(void *image_start, unsigned long image_length);
>> -
>>  int bzimage_parse(void *image_base, void **image_start,
>> -                  unsigned long *image_len);
>> +                  unsigned long *image_len, unsigned int *align);

Any particular reason for dropping the blank line? I'd prefer if it was kept,
and I may take the liberty to respectively adjust the patch while committing.

Jan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] xen/x86: bzImage parse kernel_alignment
  2024-03-07  8:26     ` Jan Beulich
@ 2024-03-07 15:06       ` Jason Andryuk
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Andryuk @ 2024-03-07 15:06 UTC (permalink / raw)
  To: Jan Beulich, Stefano Stabellini
  Cc: xen-devel, Andrew Cooper, Roger Pau Monné, Wei Liu

On 2024-03-07 03:26, Jan Beulich wrote:
> On 07.03.2024 03:09, Stefano Stabellini wrote:
>> On Wed, 6 Mar 2024, Jason Andryuk wrote:
>>> Expand bzimage_parse() to return kernel_alignment from the setup_header.
>>> This will be needed if loading a PVH kernel at a physical offset to
>>> compensate for a reserved E820 region.
>>>
>>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
>>
>> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
> 
> Acked-by: Jan Beulich <jbeulich@suse.com>
> with two remarks:
> 
>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>> @@ -548,12 +548,14 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>>>       struct elf_binary elf;
>>>       struct elf_dom_parms parms;
>>>       paddr_t last_addr;
>>> +    unsigned int align = 0;
> 
> Strictly speaking this isn't needed here, yet, and would suffice when added
> in the next patch. But I'm okay with keeping it.
> 
>>> --- a/xen/arch/x86/include/asm/bzimage.h
>>> +++ b/xen/arch/x86/include/asm/bzimage.h
>>> @@ -4,8 +4,7 @@
>>>   #include <xen/init.h>
>>>   
>>>   unsigned long bzimage_headroom(void *image_start, unsigned long image_length);
>>> -
>>>   int bzimage_parse(void *image_base, void **image_start,
>>> -                  unsigned long *image_len);
>>> +                  unsigned long *image_len, unsigned int *align);
> 
> Any particular reason for dropping the blank line? I'd prefer if it was kept,
> and I may take the liberty to respectively adjust the patch while committing.

No, no particular reason.  The blank line can be retained.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels
  2024-03-06 18:50 [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Jason Andryuk
  2024-03-06 18:50 ` [PATCH 1/3] features.h: Replace hard tabs Jason Andryuk
  2024-03-06 18:50 ` [PATCH 2/3] xen/x86: bzImage parse kernel_alignment Jason Andryuk
@ 2024-03-06 18:50 ` Jason Andryuk
  2024-03-07  2:09   ` Stefano Stabellini
                     ` (2 more replies)
  2024-03-06 19:31 ` [LINUX PATCH] RFC: x86/pvh: Make Xen PVH entrypoint PIC Jason Andryuk
  2024-03-07 10:00 ` [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Roger Pau Monné
  4 siblings, 3 replies; 21+ messages in thread
From: Jason Andryuk @ 2024-03-06 18:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Jason Andryuk, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

Xen tries to load a PVH dom0 kernel at the fixed guest physical address
from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
it can be configured.

Unfortunately there exist firmwares that have reserved regions at this
address, so Xen fails to load the dom0 kernel since it's not RAM.

The PVH entry code is not relocatable - it loads from absolute
addresses, which fail when the kernel is loaded at a different address.
With a suitably modified kernel, a reloctable entry point is possible.

Add the XENFEAT_pvh_relocatable flag to let a kernel indicate that it
supports a relocatable entry path.

Change the loading to check for an acceptable load address.  If the
kernel is relocatable, support finding an alternate load address.

Linux cares about its physical alignment.  This can be pulled out of the
bzImage header, but not from the vmlinux ELF file.  If an alignment
can't be found, use 2MB.

Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
---
Put alignment as a new ELF note?  Presence of that note would indicate
relocation support without a new XENFEAT flag.

Default alignment to a multiple of 2MB to make more cases work?  It has
to be a power of two, and a multiple might allow loading a customized
kernel.  A larger alignment would limit the number of possible load
locations.
---
 xen/arch/x86/hvm/dom0_build.c | 109 ++++++++++++++++++++++++++++++++++
 xen/include/public/features.h |   5 ++
 2 files changed, 114 insertions(+)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index bbae8a5645..34d68ee8fb 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -537,6 +537,109 @@ static paddr_t __init find_memory(
     return INVALID_PADDR;
 }
 
+static bool __init check_load_address(
+    const struct domain *d, const struct elf_binary *elf)
+{
+    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
+    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
+                                 PAGE_SIZE);
+    unsigned int i;
+
+    /*
+     * The memory map is sorted and all RAM regions starts and sizes are
+     * aligned to page boundaries.
+     */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        paddr_t start = d->arch.e820[i].addr;
+        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
+
+        if ( start <= kernel_start &&
+             end >= kernel_end &&
+             d->arch.e820[i].type == E820_RAM )
+            return true;
+    }
+
+    return false;
+}
+
+/*
+ * Find an e820 RAM region that fits the kernel at a suitable alignment.
+ */
+static paddr_t find_kernel_memory(
+    const struct domain *d, struct elf_binary *elf, paddr_t align)
+{
+    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
+    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
+                                 PAGE_SIZE);
+    unsigned int i;
+
+    /*
+     * The memory map is sorted and all RAM regions starts and sizes are
+     * aligned to page boundaries.
+     */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        paddr_t start = d->arch.e820[i].addr;
+        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
+        paddr_t kstart, kend, offset;
+
+        if ( d->arch.e820[i].type != E820_RAM )
+            continue;
+
+        if ( d->arch.e820[i].size < elf->dest_size )
+            continue;
+
+        if ( end < kernel_end )
+            continue;
+
+        kstart = ROUNDUP(start, align);
+        offset = kstart - kernel_start;
+        kend = kernel_end + offset;
+
+        if ( kend <= end )
+            return offset;
+    }
+
+    return 0;
+}
+
+/*
+ * Check the kernel load address, and adjust if necessary and possible.
+ */
+static bool __init adjust_load_address(
+    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms,
+    paddr_t align)
+{
+    paddr_t offset;
+
+    /* Check load address */
+    if ( check_load_address(d, elf) )
+        return true;
+
+    if ( !test_bit(XENFEAT_pvh_relocatable, parms->f_supported) )
+    {
+        printk("Address conflict and %pd kernel is not relocatable\n", d);
+        return false;
+    }
+
+    if ( align == 0 )
+        align = MB(2);
+
+    offset = find_kernel_memory(d, elf, align);
+    if ( offset == 0 )
+    {
+        printk("Failed find a load offset for the kernel\n");
+        return false;
+    }
+
+    printk("Adjusting load address by %#lx\n", offset);
+    elf->dest_base += offset;
+    parms->phys_entry += offset;
+
+    return true;
+}
+
 static int __init pvh_load_kernel(struct domain *d, const module_t *image,
                                   unsigned long image_headroom,
                                   module_t *initrd, void *image_base,
@@ -587,6 +690,12 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
     elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
     elf.dest_size = parms.virt_kend - parms.virt_kstart;
 
+    if ( !adjust_load_address(d, &elf, &parms, align) )
+    {
+        printk("Unable to load kernel\n");
+        return -ENOMEM;
+    }
+
     elf_set_vcpu(&elf, v);
     rc = elf_load_binary(&elf);
     if ( rc < 0 )
diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 4437f25d25..300480cb22 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -120,6 +120,11 @@
 #define XENFEAT_runstate_phys_area        18
 #define XENFEAT_vcpu_time_phys_area       19
 
+/*
+ * PVH: If set, the guest supports relocation in load address.
+ */
+#define XENFEAT_pvh_relocatable           20
+
 #define XENFEAT_NR_SUBMAPS 1
 
 #endif /* __XEN_PUBLIC_FEATURES_H__ */
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels
  2024-03-06 18:50 ` [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels Jason Andryuk
@ 2024-03-07  2:09   ` Stefano Stabellini
  2024-03-07 16:07     ` Jason Andryuk
  2024-03-07  9:30   ` Roger Pau Monné
  2024-03-11 16:53   ` Jan Beulich
  2 siblings, 1 reply; 21+ messages in thread
From: Stefano Stabellini @ 2024-03-07  2:09 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

On Wed, 6 Mar 2024, Jason Andryuk wrote:
> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
> it can be configured.
> 
> Unfortunately there exist firmwares that have reserved regions at this
> address, so Xen fails to load the dom0 kernel since it's not RAM.
> 
> The PVH entry code is not relocatable - it loads from absolute
> addresses, which fail when the kernel is loaded at a different address.
> With a suitably modified kernel, a reloctable entry point is possible.
> 
> Add the XENFEAT_pvh_relocatable flag to let a kernel indicate that it
> supports a relocatable entry path.
> 
> Change the loading to check for an acceptable load address.  If the
> kernel is relocatable, support finding an alternate load address.
> 
> Linux cares about its physical alignment.  This can be pulled out of the
> bzImage header, but not from the vmlinux ELF file.  If an alignment
> can't be found, use 2MB.
> 
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> ---
> Put alignment as a new ELF note?  Presence of that note would indicate
> relocation support without a new XENFEAT flag.
> 
> Default alignment to a multiple of 2MB to make more cases work?  It has
> to be a power of two, and a multiple might allow loading a customized
> kernel.  A larger alignment would limit the number of possible load
> locations.
> ---
>  xen/arch/x86/hvm/dom0_build.c | 109 ++++++++++++++++++++++++++++++++++
>  xen/include/public/features.h |   5 ++
>  2 files changed, 114 insertions(+)
> 
> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> index bbae8a5645..34d68ee8fb 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -537,6 +537,109 @@ static paddr_t __init find_memory(
>      return INVALID_PADDR;
>  }
>  
> +static bool __init check_load_address(
> +    const struct domain *d, const struct elf_binary *elf)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
> +                                 PAGE_SIZE);
> +    unsigned int i;
> +
> +    /*
> +     * The memory map is sorted and all RAM regions starts and sizes are
> +     * aligned to page boundaries.
> +     */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start = d->arch.e820[i].addr;
> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +
> +        if ( start <= kernel_start &&
> +             end >= kernel_end &&
> +             d->arch.e820[i].type == E820_RAM )
> +            return true;
> +    }
> +
> +    return false;
> +}
> +
> +/*
> + * Find an e820 RAM region that fits the kernel at a suitable alignment.
> + */
> +static paddr_t find_kernel_memory(
> +    const struct domain *d, struct elf_binary *elf, paddr_t align)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
> +                                 PAGE_SIZE);
> +    unsigned int i;
> +
> +    /*
> +     * The memory map is sorted and all RAM regions starts and sizes are
> +     * aligned to page boundaries.
> +     */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start = d->arch.e820[i].addr;
> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +        paddr_t kstart, kend, offset;
> +
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        if ( d->arch.e820[i].size < elf->dest_size )
> +            continue;
> +
> +        if ( end < kernel_end )
> +            continue;

Why this check? Is it to make sure we look for e820 regions that are
higher in terms of addresses? If so, couldn't we start from
d->arch.nr_e820 and go down instead of starting from 0 and going up?

The PVH entry point is a 32-bit entry point if I remember right? Do we
need a 32-bit check? If so then it might not be a good idea to start
from arch.nr_e820 and go down.



> +        kstart = ROUNDUP(start, align);
> +        offset = kstart - kernel_start;
> +        kend = kernel_end + offset;
> +
> +        if ( kend <= end )
> +            return offset;
> +    }
> +
> +    return 0;
> +}
> +
> +/*
> + * Check the kernel load address, and adjust if necessary and possible.
> + */
> +static bool __init adjust_load_address(
> +    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms,
> +    paddr_t align)
> +{
> +    paddr_t offset;
> +
> +    /* Check load address */
> +    if ( check_load_address(d, elf) )
> +        return true;
> +
> +    if ( !test_bit(XENFEAT_pvh_relocatable, parms->f_supported) )
> +    {
> +        printk("Address conflict and %pd kernel is not relocatable\n", d);
> +        return false;
> +    }
> +
> +    if ( align == 0 )
> +        align = MB(2);
> +
> +    offset = find_kernel_memory(d, elf, align);
> +    if ( offset == 0 )
> +    {
> +        printk("Failed find a load offset for the kernel\n");
> +        return false;
> +    }
> +
> +    printk("Adjusting load address by %#lx\n", offset);
> +    elf->dest_base += offset;
> +    parms->phys_entry += offset;
> +
> +    return true;
> +}
> +
>  static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>                                    unsigned long image_headroom,
>                                    module_t *initrd, void *image_base,
> @@ -587,6 +690,12 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>      elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
>      elf.dest_size = parms.virt_kend - parms.virt_kstart;
>  
> +    if ( !adjust_load_address(d, &elf, &parms, align) )
> +    {
> +        printk("Unable to load kernel\n");
> +        return -ENOMEM;
> +    }
> +
>      elf_set_vcpu(&elf, v);
>      rc = elf_load_binary(&elf);
>      if ( rc < 0 )
> diff --git a/xen/include/public/features.h b/xen/include/public/features.h
> index 4437f25d25..300480cb22 100644
> --- a/xen/include/public/features.h
> +++ b/xen/include/public/features.h
> @@ -120,6 +120,11 @@
>  #define XENFEAT_runstate_phys_area        18
>  #define XENFEAT_vcpu_time_phys_area       19
>  
> +/*
> + * PVH: If set, the guest supports relocation in load address.
> + */
> +#define XENFEAT_pvh_relocatable           20
> +
>  #define XENFEAT_NR_SUBMAPS 1
>  
>  #endif /* __XEN_PUBLIC_FEATURES_H__ */
> -- 
> 2.44.0
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels
  2024-03-07  2:09   ` Stefano Stabellini
@ 2024-03-07 16:07     ` Jason Andryuk
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Andryuk @ 2024-03-07 16:07 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall

On 2024-03-06 21:09, Stefano Stabellini wrote:
> On Wed, 6 Mar 2024, Jason Andryuk wrote:
>> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
>> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
>> it can be configured.
>>
>> Unfortunately there exist firmwares that have reserved regions at this
>> address, so Xen fails to load the dom0 kernel since it's not RAM.
>>
>> The PVH entry code is not relocatable - it loads from absolute
>> addresses, which fail when the kernel is loaded at a different address.
>> With a suitably modified kernel, a reloctable entry point is possible.
>>
>> Add the XENFEAT_pvh_relocatable flag to let a kernel indicate that it
>> supports a relocatable entry path.
>>
>> Change the loading to check for an acceptable load address.  If the
>> kernel is relocatable, support finding an alternate load address.
>>
>> Linux cares about its physical alignment.  This can be pulled out of the
>> bzImage header, but not from the vmlinux ELF file.  If an alignment
>> can't be found, use 2MB.
>>
>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
>> ---
>> Put alignment as a new ELF note?  Presence of that note would indicate
>> relocation support without a new XENFEAT flag.
>>
>> Default alignment to a multiple of 2MB to make more cases work?  It has
>> to be a power of two, and a multiple might allow loading a customized
>> kernel.  A larger alignment would limit the number of possible load
>> locations.
>> ---
>>   xen/arch/x86/hvm/dom0_build.c | 109 ++++++++++++++++++++++++++++++++++
>>   xen/include/public/features.h |   5 ++
>>   2 files changed, 114 insertions(+)
>>
>> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
>> index bbae8a5645..34d68ee8fb 100644
>> --- a/xen/arch/x86/hvm/dom0_build.c
>> +++ b/xen/arch/x86/hvm/dom0_build.c
>> @@ -537,6 +537,109 @@ static paddr_t __init find_memory(
>>       return INVALID_PADDR;
>>   }
>>   
>> +static bool __init check_load_address(
>> +    const struct domain *d, const struct elf_binary *elf)
>> +{
>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
>> +                                 PAGE_SIZE);
>> +    unsigned int i;
>> +
>> +    /*
>> +     * The memory map is sorted and all RAM regions starts and sizes are
>> +     * aligned to page boundaries.
>> +     */
>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>> +    {
>> +        paddr_t start = d->arch.e820[i].addr;
>> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
>> +
>> +        if ( start <= kernel_start &&
>> +             end >= kernel_end &&
>> +             d->arch.e820[i].type == E820_RAM )
>> +            return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +/*
>> + * Find an e820 RAM region that fits the kernel at a suitable alignment.
>> + */
>> +static paddr_t find_kernel_memory(
>> +    const struct domain *d, struct elf_binary *elf, paddr_t align)
>> +{
>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
>> +                                 PAGE_SIZE);
>> +    unsigned int i;
>> +
>> +    /*
>> +     * The memory map is sorted and all RAM regions starts and sizes are
>> +     * aligned to page boundaries.
>> +     */
>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>> +    {
>> +        paddr_t start = d->arch.e820[i].addr;
>> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
>> +        paddr_t kstart, kend, offset;
>> +
>> +        if ( d->arch.e820[i].type != E820_RAM )
>> +            continue;
>> +
>> +        if ( d->arch.e820[i].size < elf->dest_size )
>> +            continue;
>> +
>> +        if ( end < kernel_end )
>> +            continue;
> 
> Why this check? Is it to make sure we look for e820 regions that are
> higher in terms of addresses? If so, couldn't we start from
> d->arch.nr_e820 and go down instead of starting from 0 and going up?

Yes, I thought we only wanted a higher address.  The Linux bzImage entry 
code uses the LOAD_PHYSICAL_ADDR (CONFIG_PHYSICAL_START/elf->dest_base) 
as a minimum when extracting vmlinux for a relocatable kernel.  I'm not 
sure if that is strictly required though.

I also thought a smaller adjustment would be better, so starting from 
the lower e820 entries would find the first acceptable one.  But that 
may not matter.

> The PVH entry point is a 32-bit entry point if I remember right? Do we
> need a 32-bit check? If so then it might not be a good idea to start
> from arch.nr_e820 and go down.

Yes, the entry point is 32-bit, so you are correct that that the range 
should to be limited at 4GB.

Regards,
Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels
  2024-03-06 18:50 ` [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels Jason Andryuk
  2024-03-07  2:09   ` Stefano Stabellini
@ 2024-03-07  9:30   ` Roger Pau Monné
  2024-03-07 17:01     ` Jason Andryuk
  2024-03-11 16:53   ` Jan Beulich
  2 siblings, 1 reply; 21+ messages in thread
From: Roger Pau Monné @ 2024-03-07  9:30 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini

On Wed, Mar 06, 2024 at 01:50:32PM -0500, Jason Andryuk wrote:
> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
> it can be configured.
> 
> Unfortunately there exist firmwares that have reserved regions at this
> address, so Xen fails to load the dom0 kernel since it's not RAM.
> 
> The PVH entry code is not relocatable - it loads from absolute
> addresses, which fail when the kernel is loaded at a different address.
> With a suitably modified kernel, a reloctable entry point is possible.
> 
> Add the XENFEAT_pvh_relocatable flag to let a kernel indicate that it
> supports a relocatable entry path.
> 
> Change the loading to check for an acceptable load address.  If the
> kernel is relocatable, support finding an alternate load address.
> 
> Linux cares about its physical alignment.  This can be pulled out of the
> bzImage header, but not from the vmlinux ELF file.  If an alignment
> can't be found, use 2MB.

While I'm fine with having a Linux specific way, there needs to be a
generic way of passing the alignment for non-bzImage kernels.

ELF program headers have an align field, would that be suitable to
use?

In elf_parse_binary() where the p{start,end} is calculated, you could
also fetch the p_offset from the lower found program header and use it
as the required alignment.  Would that be OK for Linux and maybe avoid
having to fiddle with the bzImage header?  FWIW it is likely fine for
FreeBSD.

> 
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>

I created a gitlab ticket for this:

https://gitlab.com/xen-project/xen/-/issues/180

If you want to reference it.

> ---
> Put alignment as a new ELF note?  Presence of that note would indicate
> relocation support without a new XENFEAT flag.
> 
> Default alignment to a multiple of 2MB to make more cases work?  It has
> to be a power of two, and a multiple might allow loading a customized
> kernel.  A larger alignment would limit the number of possible load
> locations.
> ---
>  xen/arch/x86/hvm/dom0_build.c | 109 ++++++++++++++++++++++++++++++++++
>  xen/include/public/features.h |   5 ++
>  2 files changed, 114 insertions(+)
> 
> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> index bbae8a5645..34d68ee8fb 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -537,6 +537,109 @@ static paddr_t __init find_memory(
>      return INVALID_PADDR;
>  }
>  
> +static bool __init check_load_address(
> +    const struct domain *d, const struct elf_binary *elf)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
> +                                 PAGE_SIZE);

You can use PAGE_ALIGN() here (and below) for simplicity.

> +    unsigned int i;
> +
> +    /*
> +     * The memory map is sorted and all RAM regions starts and sizes are
> +     * aligned to page boundaries.
> +     */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start = d->arch.e820[i].addr;
> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;

Since the memory map is sorted you can end the loop once end start >=
kernel_end?  As further regions are past the kernel destination.

> +
> +        if ( start <= kernel_start &&
> +             end >= kernel_end &&
> +             d->arch.e820[i].type == E820_RAM )
> +            return true;
> +    }
> +
> +    return false;
> +}
> +
> +/*
> + * Find an e820 RAM region that fits the kernel at a suitable alignment.
> + */

This (and other) comment seems to fit in a single line: /* ... */.

> +static paddr_t find_kernel_memory(
> +    const struct domain *d, struct elf_binary *elf, paddr_t align)

elf can be const AFAICT.

> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
> +                                 PAGE_SIZE);
> +    unsigned int i;
> +
> +    /*
> +     * The memory map is sorted and all RAM regions starts and sizes are
> +     * aligned to page boundaries.
> +     */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start = d->arch.e820[i].addr;
> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +        paddr_t kstart, kend, offset;
> +
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        if ( d->arch.e820[i].size < elf->dest_size )
> +            continue;
> +
> +        if ( end < kernel_end )
> +            continue;

I'm not sure I understand this check, why would we refuse regions
below the fixed kernel end?  Those should be equally suitable if big
enough and meeting the alignment requirements.

> +
> +        kstart = ROUNDUP(start, align);
> +        offset = kstart - kernel_start;
> +        kend = kernel_end + offset;
> +
> +        if ( kend <= end )
> +            return offset;

Why not return this as an address to use to load the kernel instead of
an offset from dest_base?  That would also make the calculations
easier IMO.

> +    }

This should be limited to a range below 4GB.

> +    return 0;
> +}
> +
> +/*
> + * Check the kernel load address, and adjust if necessary and possible.
> + */
> +static bool __init adjust_load_address(
> +    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms,
> +    paddr_t align)
> +{
> +    paddr_t offset;
> +
> +    /* Check load address */
> +    if ( check_load_address(d, elf) )
> +        return true;
> +
> +    if ( !test_bit(XENFEAT_pvh_relocatable, parms->f_supported) )
> +    {
> +        printk("Address conflict and %pd kernel is not relocatable\n", d);
> +        return false;
> +    }
> +
> +    if ( align == 0 )
> +        align = MB(2);
> +
> +    offset = find_kernel_memory(d, elf, align);
> +    if ( offset == 0 )
> +    {
> +        printk("Failed find a load offset for the kernel\n");
> +        return false;
> +    }
> +
> +    printk("Adjusting load address by %#lx\n", offset);

I think this would be more helpful if the previous and the new ranges
are printed, as I'm not sure the previous dest_base is printed, in
which case the offset doesn't help much.  I would do:

if ( opt_dom0_verbose )
    printk("relocating kernel from [%lx, %lx] -> [%lx, %lx]\n", ...);

> +    elf->dest_base += offset;
> +    parms->phys_entry += offset;

As noted above, I think it would be better if find_kernel_memory()
find an absolute address which is then adjusted here.

> +
> +    return true;
> +}
> +
>  static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>                                    unsigned long image_headroom,
>                                    module_t *initrd, void *image_base,
> @@ -587,6 +690,12 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>      elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
>      elf.dest_size = parms.virt_kend - parms.virt_kstart;
>  
> +    if ( !adjust_load_address(d, &elf, &parms, align) )

check_and_adjust_?  As the address is not unconditionally adjusted.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels
  2024-03-07  9:30   ` Roger Pau Monné
@ 2024-03-07 17:01     ` Jason Andryuk
  2024-03-08  6:34       ` Juergen Gross
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Andryuk @ 2024-03-07 17:01 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini

On 2024-03-07 04:30, Roger Pau Monné wrote:
> On Wed, Mar 06, 2024 at 01:50:32PM -0500, Jason Andryuk wrote:
>> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
>> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
>> it can be configured.
>>
>> Unfortunately there exist firmwares that have reserved regions at this
>> address, so Xen fails to load the dom0 kernel since it's not RAM.
>>
>> The PVH entry code is not relocatable - it loads from absolute
>> addresses, which fail when the kernel is loaded at a different address.
>> With a suitably modified kernel, a reloctable entry point is possible.
>>
>> Add the XENFEAT_pvh_relocatable flag to let a kernel indicate that it
>> supports a relocatable entry path.
>>
>> Change the loading to check for an acceptable load address.  If the
>> kernel is relocatable, support finding an alternate load address.
>>
>> Linux cares about its physical alignment.  This can be pulled out of the
>> bzImage header, but not from the vmlinux ELF file.  If an alignment
>> can't be found, use 2MB.
> 
> While I'm fine with having a Linux specific way, there needs to be a
> generic way of passing the alignment for non-bzImage kernels.
> 
> ELF program headers have an align field, would that be suitable to
> use?

Unfortunately, it doesn't seem correct.  Linux has 
CONFIG_PHYSICAL_ALIGN, and it doesn't seem to be used in the elf 
headers.  As a quick test, I set CONFIG_PHYSICAL_ALIGN=0x800000, but the 
elf align values are still 0x200000.

> In elf_parse_binary() where the p{start,end} is calculated, you could
> also fetch the p_offset from the lower found program header and use it
> as the required alignment.  Would that be OK for Linux and maybe avoid
> having to fiddle with the bzImage header?  FWIW it is likely fine for
> FreeBSD.

Adding an explicit value in an elf note would avoid any ambiguity.

>>
>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> 
> I created a gitlab ticket for this:
> 
> https://gitlab.com/xen-project/xen/-/issues/180
> 
> If you want to reference it.

Will do

>> ---
>> Put alignment as a new ELF note?  Presence of that note would indicate
>> relocation support without a new XENFEAT flag.
>>
>> Default alignment to a multiple of 2MB to make more cases work?  It has
>> to be a power of two, and a multiple might allow loading a customized
>> kernel.  A larger alignment would limit the number of possible load
>> locations.
>> ---
>>   xen/arch/x86/hvm/dom0_build.c | 109 ++++++++++++++++++++++++++++++++++
>>   xen/include/public/features.h |   5 ++
>>   2 files changed, 114 insertions(+)
>>
>> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
>> index bbae8a5645..34d68ee8fb 100644
>> --- a/xen/arch/x86/hvm/dom0_build.c
>> +++ b/xen/arch/x86/hvm/dom0_build.c
>> @@ -537,6 +537,109 @@ static paddr_t __init find_memory(
>>       return INVALID_PADDR;
>>   }
>>   
>> +static bool __init check_load_address(
>> +    const struct domain *d, const struct elf_binary *elf)
>> +{
>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
>> +                                 PAGE_SIZE);
> 
> You can use PAGE_ALIGN() here (and below) for simplicity.

Ok

>> +    unsigned int i;
>> +
>> +    /*
>> +     * The memory map is sorted and all RAM regions starts and sizes are
>> +     * aligned to page boundaries.
>> +     */
>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>> +    {
>> +        paddr_t start = d->arch.e820[i].addr;
>> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
> 
> Since the memory map is sorted you can end the loop once end start >=
> kernel_end?  As further regions are past the kernel destination.

Yes, thanks.

>> +
>> +        if ( start <= kernel_start &&
>> +             end >= kernel_end &&
>> +             d->arch.e820[i].type == E820_RAM )
>> +            return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +/*
>> + * Find an e820 RAM region that fits the kernel at a suitable alignment.
>> + */
> 
> This (and other) comment seems to fit in a single line: /* ... */.

Ok

>> +static paddr_t find_kernel_memory(
>> +    const struct domain *d, struct elf_binary *elf, paddr_t align)
> 
> elf can be const AFAICT.

Ok

>> +{
>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
>> +                                 PAGE_SIZE);
>> +    unsigned int i;
>> +
>> +    /*
>> +     * The memory map is sorted and all RAM regions starts and sizes are
>> +     * aligned to page boundaries.
>> +     */
>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>> +    {
>> +        paddr_t start = d->arch.e820[i].addr;
>> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
>> +        paddr_t kstart, kend, offset;
>> +
>> +        if ( d->arch.e820[i].type != E820_RAM )
>> +            continue;
>> +
>> +        if ( d->arch.e820[i].size < elf->dest_size )
>> +            continue;
>> +
>> +        if ( end < kernel_end )
>> +            continue;
> 
> I'm not sure I understand this check, why would we refuse regions
> below the fixed kernel end?  Those should be equally suitable if big
> enough and meeting the alignment requirements.

I thought (Linux at least) wouldn't want to be moved lower - only higher 
in ram.  There was some Linux boot code that gave me that impression, 
but it may not be strictly true.

>> +
>> +        kstart = ROUNDUP(start, align);
>> +        offset = kstart - kernel_start;
>> +        kend = kernel_end + offset;
>> +
>> +        if ( kend <= end )
>> +            return offset;
> 
> Why not return this as an address to use to load the kernel instead of
> an offset from dest_base?  That would also make the calculations
> easier IMO.

phys_entry needs to be updated as well as dest_base, so returning the 
offset seemed more useful.

>> +    }
> 
> This should be limited to a range below 4GB.

Yes.

>> +    return 0;
>> +}
>> +
>> +/*
>> + * Check the kernel load address, and adjust if necessary and possible.
>> + */
>> +static bool __init adjust_load_address(
>> +    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms,
>> +    paddr_t align)
>> +{
>> +    paddr_t offset;
>> +
>> +    /* Check load address */
>> +    if ( check_load_address(d, elf) )
>> +        return true;
>> +
>> +    if ( !test_bit(XENFEAT_pvh_relocatable, parms->f_supported) )
>> +    {
>> +        printk("Address conflict and %pd kernel is not relocatable\n", d);
>> +        return false;
>> +    }
>> +
>> +    if ( align == 0 )
>> +        align = MB(2);
>> +
>> +    offset = find_kernel_memory(d, elf, align);
>> +    if ( offset == 0 )
>> +    {
>> +        printk("Failed find a load offset for the kernel\n");
>> +        return false;
>> +    }
>> +
>> +    printk("Adjusting load address by %#lx\n", offset);
> 
> I think this would be more helpful if the previous and the new ranges
> are printed, as I'm not sure the previous dest_base is printed, in
> which case the offset doesn't help much.  I would do:
> 
> if ( opt_dom0_verbose )
>      printk("relocating kernel from [%lx, %lx] -> [%lx, %lx]\n", ...);

dest_base was printed (maybe with extra verbosity), but this message is 
clearer since it shows the end result.

>> +    elf->dest_base += offset;
>> +    parms->phys_entry += offset;
> 
> As noted above, I think it would be better if find_kernel_memory()
> find an absolute address which is then adjusted here.

I thought the modification with just offset was clearer compared to:

params->phys_entry += (reloc_base - elf->dest_base);
elf->dest_base = reloc_base;

But I'm fine making the change.

>> +
>> +    return true;
>> +}
>> +
>>   static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>>                                     unsigned long image_headroom,
>>                                     module_t *initrd, void *image_base,
>> @@ -587,6 +690,12 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>>       elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
>>       elf.dest_size = parms.virt_kend - parms.virt_kstart;
>>   
>> +    if ( !adjust_load_address(d, &elf, &parms, align) )
> 
> check_and_adjust_?  As the address is not unconditionally adjusted.

I thought it was a little too wordy, but it is more accurate.  I'll 
rename it.

Thanks for taking a look.

-Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels
  2024-03-07 17:01     ` Jason Andryuk
@ 2024-03-08  6:34       ` Juergen Gross
  0 siblings, 0 replies; 21+ messages in thread
From: Juergen Gross @ 2024-03-08  6:34 UTC (permalink / raw)
  To: Jason Andryuk, Roger Pau Monné
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini


[-- Attachment #1.1.1: Type: text/plain, Size: 1967 bytes --]

On 07.03.24 18:01, Jason Andryuk wrote:
> On 2024-03-07 04:30, Roger Pau Monné wrote:
>> On Wed, Mar 06, 2024 at 01:50:32PM -0500, Jason Andryuk wrote:
>>> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
>>> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
>>> it can be configured.
>>>
>>> Unfortunately there exist firmwares that have reserved regions at this
>>> address, so Xen fails to load the dom0 kernel since it's not RAM.
>>>
>>> The PVH entry code is not relocatable - it loads from absolute
>>> addresses, which fail when the kernel is loaded at a different address.
>>> With a suitably modified kernel, a reloctable entry point is possible.
>>>
>>> Add the XENFEAT_pvh_relocatable flag to let a kernel indicate that it
>>> supports a relocatable entry path.
>>>
>>> Change the loading to check for an acceptable load address.  If the
>>> kernel is relocatable, support finding an alternate load address.
>>>
>>> Linux cares about its physical alignment.  This can be pulled out of the
>>> bzImage header, but not from the vmlinux ELF file.  If an alignment
>>> can't be found, use 2MB.
>>
>> While I'm fine with having a Linux specific way, there needs to be a
>> generic way of passing the alignment for non-bzImage kernels.
>>
>> ELF program headers have an align field, would that be suitable to
>> use?
> 
> Unfortunately, it doesn't seem correct.  Linux has CONFIG_PHYSICAL_ALIGN, and it 
> doesn't seem to be used in the elf headers.  As a quick test, I set 
> CONFIG_PHYSICAL_ALIGN=0x800000, but the elf align values are still 0x200000.

An excerpt from the kernel's arch/x86/Makefile:

#
# The 64-bit kernel must be aligned to 2MB.  Pass -z max-page-size=0x200000 to
# the linker to force 2MB page size regardless of the default page size used
# by the linker.
#
ifdef CONFIG_X86_64
LDFLAGS_vmlinux += -z max-page-size=0x200000
endif


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels
  2024-03-06 18:50 ` [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels Jason Andryuk
  2024-03-07  2:09   ` Stefano Stabellini
  2024-03-07  9:30   ` Roger Pau Monné
@ 2024-03-11 16:53   ` Jan Beulich
  2024-03-11 19:53     ` Jason Andryuk
  2 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2024-03-11 16:53 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini, xen-devel

On 06.03.2024 19:50, Jason Andryuk wrote:
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -537,6 +537,109 @@ static paddr_t __init find_memory(
>      return INVALID_PADDR;
>  }
>  
> +static bool __init check_load_address(
> +    const struct domain *d, const struct elf_binary *elf)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
> +                                 PAGE_SIZE);
> +    unsigned int i;
> +
> +    /*
> +     * The memory map is sorted and all RAM regions starts and sizes are
> +     * aligned to page boundaries.
> +     */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start = d->arch.e820[i].addr;
> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +
> +        if ( start <= kernel_start &&
> +             end >= kernel_end &&
> +             d->arch.e820[i].type == E820_RAM )
> +            return true;
> +    }
> +
> +    return false;
> +}
> +
> +/*
> + * Find an e820 RAM region that fits the kernel at a suitable alignment.
> + */
> +static paddr_t find_kernel_memory(

__init ?

Jan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels
  2024-03-11 16:53   ` Jan Beulich
@ 2024-03-11 19:53     ` Jason Andryuk
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Andryuk @ 2024-03-11 19:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini, xen-devel

On 2024-03-11 12:53, Jan Beulich wrote:
> On 06.03.2024 19:50, Jason Andryuk wrote:
>> --- a/xen/arch/x86/hvm/dom0_build.c
>> +++ b/xen/arch/x86/hvm/dom0_build.c
>> @@ -537,6 +537,109 @@ static paddr_t __init find_memory(
>>       return INVALID_PADDR;
>>   }
>>   
>> +static bool __init check_load_address(
>> +    const struct domain *d, const struct elf_binary *elf)
>> +{
>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>> +    paddr_t kernel_end = ROUNDUP((paddr_t)elf->dest_base + elf->dest_size,
>> +                                 PAGE_SIZE);
>> +    unsigned int i;
>> +
>> +    /*
>> +     * The memory map is sorted and all RAM regions starts and sizes are
>> +     * aligned to page boundaries.
>> +     */
>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>> +    {
>> +        paddr_t start = d->arch.e820[i].addr;
>> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
>> +
>> +        if ( start <= kernel_start &&
>> +             end >= kernel_end &&
>> +             d->arch.e820[i].type == E820_RAM )
>> +            return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +/*
>> + * Find an e820 RAM region that fits the kernel at a suitable alignment.
>> + */
>> +static paddr_t find_kernel_memory(
> 
> __init ?

Yes, thanks for catching that.

Regards,
Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [LINUX PATCH] RFC: x86/pvh: Make Xen PVH entrypoint PIC
  2024-03-06 18:50 [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Jason Andryuk
                   ` (2 preceding siblings ...)
  2024-03-06 18:50 ` [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels Jason Andryuk
@ 2024-03-06 19:31 ` Jason Andryuk
  2024-03-07 10:00 ` [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Roger Pau Monné
  4 siblings, 0 replies; 21+ messages in thread
From: Jason Andryuk @ 2024-03-06 19:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Jan Beulich, Roger Pau Monné, Andrew Cooper, Juergen Gross,
	Boris Ostrovsky, Jason Andryuk

The Xen PVH entrypoint is 32bit non-PIC code running at a default load
address of 0x1000000 (16MB) (CONFIG_PHYSICAL_START).  Xen loads the
kernel at that physical address inside the PVH container.

When running a PVH Dom0, the system reserved addresses are mapped 1-1
into the PVH container.  There exist system firmwares (Coreboot/EDK2)
with reserved memory at 16MB.  This creates a conflict where the PVH
kernel cannot be loaded at that address.

Modify the PVH entrypoint to be position-indepedent to allow flexibility
in load address.

Initial PVH entry runs at the physical addresses and then transitions to
the identity mapped address.  While executing xen_prepare_pvh() calls
through pv_ops function pointers transition to the high mapped
addresses.  Additionally, __va() is called on some hvm_start_info
physical addresses, we need the directmap address range is used.  So we
need to run page tables with all of those ranges mapped.

Modifying init_top_pgt tables ran into issue since
startup_64/__startup_64() will modify those page tables again.  Use a
dedicated set of page tables - pvh_init_top_pgt  - for the PVH entry to
avoid unwanted interactions.

In xen_pvh_init(), __pa() is called to find the physical address of the
hypercall page.  Set phys_base temporarily before calling into
xen_prepare_pvh(), which calls xen_pvh_init(), and clear it afterwards.
__startup_64() assumes phys_base is zero and adds load_delta to it.  If
phys_base is already set, the calculation results in an incorrect cr3.

TODO: The 32bit entry path needs additional work to have relocatable
pagetables.

TODO: Sync features.h from xen.git commit xxxxxxxxxx when it is
commited.

Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
---
Put this out as an example for the Xen modifications.  Will split for
actual submission.

Instead of setting and clearing phys_base, add a dedicated variable?

Make __startup_64() exit if phys_base is already set to allow calling
multiple times, and use that and init_top_pgt instead of adding
additional page tables?
---
 arch/x86/platform/pvh/head.S     | 195 ++++++++++++++++++++++++++++---
 arch/x86/xen/xen-head.S          |   7 +-
 include/xen/interface/features.h |   5 +
 3 files changed, 190 insertions(+), 17 deletions(-)

diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index f7235ef87bc3..bab857db55ef 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -50,11 +50,32 @@
 #define PVH_CS_SEL		(PVH_GDT_ENTRY_CS * 8)
 #define PVH_DS_SEL		(PVH_GDT_ENTRY_DS * 8)
 
+#define rva(x) ((x) - pvh_start_xen)
+
 SYM_CODE_START_LOCAL(pvh_start_xen)
 	UNWIND_HINT_END_OF_STACK
 	cld
 
-	lgdt (_pa(gdt))
+	/*
+	 * See the comment for startup_32 for more details.  We need to
+	 * execute a call to get the execution address to be position
+	 * independent, but we don't have a stack.  Save and restore the
+	 * magic field of start_info in ebx, and use that as the stack.
+	 */
+	mov	(%ebx), %eax
+	leal	4(%ebx), %esp
+	ANNOTATE_INTRA_FUNCTION_CALL
+	call	1f
+1:	popl	%ebp
+	mov	%eax, (%ebx)
+	subl	$ rva(1b), %ebp
+	movl	$0, %esp
+
+	leal	rva(gdt)(%ebp), %eax
+	movl	%eax, %ecx
+	leal	rva(gdt_start)(%ebp), %ecx
+	movl	%ecx, 2(%eax)
+	lgdt	(%eax)
 
 	mov $PVH_DS_SEL,%eax
 	mov %eax,%ds
@@ -62,14 +83,14 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
 	mov %eax,%ss
 
 	/* Stash hvm_start_info. */
-	mov $_pa(pvh_start_info), %edi
+	leal rva(pvh_start_info)(%ebp), %edi
 	mov %ebx, %esi
-	mov _pa(pvh_start_info_sz), %ecx
+	movl rva(pvh_start_info_sz)(%ebp), %ecx
 	shr $2,%ecx
 	rep
 	movsl
 
-	mov $_pa(early_stack_end), %esp
+	leal rva(early_stack_end)(%ebp), %esp
 
 	/* Enable PAE mode. */
 	mov %cr4, %eax
@@ -83,29 +104,83 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
 	btsl $_EFER_LME, %eax
 	wrmsr
 
+	mov %ebp, %ebx
+	subl $LOAD_PHYSICAL_ADDR, %ebx /* offset */
+	jz .Lpagetable_done
+
+	/* Fixup page-tables for relocation. */
+	leal rva(pvh_init_top_pgt)(%ebp), %edi
+	movl $512, %ecx
+2:
+	movl 0x00(%edi), %eax
+	addl 0x04(%edi), %eax
+	jz 1f
+	addl %ebx, 0x00(%edi)
+1:
+	addl $8, %edi
+	decl %ecx
+	jnz 2b
+
+	/* L3 ident has a single entry. */
+	leal rva(pvh_level3_ident_pgt)(%ebp), %edi
+	addl %ebx, 0x00(%edi)
+
+	leal rva(pvh_level3_kernel_pgt)(%ebp), %edi
+	addl %ebx, (4096 - 16)(%edi)
+	addl %ebx, (4096 - 8)(%edi)
+
+	/* pvh_level2_ident_pgt is fine - large pages */
+
+	/* pvh_level2_kernel_pgt needs adjustment - large pages */
+	leal rva(pvh_level2_kernel_pgt)(%ebp), %edi
+	movl $512, %ecx
+2:
+	movl 0x00(%edi), %eax
+	addl 0x04(%edi), %eax
+	jz 1f
+	addl %ebx, 0x00(%edi)
+1:
+	addl $8, %edi
+	decl %ecx
+	jnz 2b
+
+.Lpagetable_done:
 	/* Enable pre-constructed page tables. */
-	mov $_pa(init_top_pgt), %eax
+	leal rva(pvh_init_top_pgt)(%ebp), %eax
 	mov %eax, %cr3
 	mov $(X86_CR0_PG | X86_CR0_PE), %eax
 	mov %eax, %cr0
 
 	/* Jump to 64-bit mode. */
-	ljmp $PVH_CS_SEL, $_pa(1f)
+	pushl $PVH_CS_SEL
+	leal  rva(1f)(%ebp), %eax
+	pushl %eax
+	lretl
 
 	/* 64-bit entry point. */
 	.code64
 1:
 	/* Set base address in stack canary descriptor. */
 	mov $MSR_GS_BASE,%ecx
-	mov $_pa(canary), %eax
+	leal rva(canary)(%ebp), %eax
 	xor %edx, %edx
 	wrmsr
 
+	/* Calculate load offset from LOAD_PHYSICAL_ADDR and store in
+	 * phys_base.  __pa() needs phys_base set to calculate the the
+	 * hypercall page in xen_pvh_init(). */
+	movq %rbp, %rbx
+	subq $LOAD_PHYSICAL_ADDR, %rbx
+	movq %rbx, phys_base(%rip)
 	call xen_prepare_pvh
+	/* Clear phys_base.  startup_64/__startup_64 will *add* to its value,
+	   so start from 0. */
+	xor  %rbx, %rbx
+	movq %rbx, phys_base(%rip)
 
 	/* startup_64 expects boot_params in %rsi. */
-	mov $_pa(pvh_bootparams), %rsi
-	mov $_pa(startup_64), %rax
+	lea rva(pvh_bootparams)(%ebp), %rsi
+	lea rva(startup_64)(%ebp), %rax
 	ANNOTATE_RETPOLINE_SAFE
 	jmp *%rax
 
@@ -113,20 +188,27 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
 
 	call mk_early_pgtbl_32
 
-	mov $_pa(initial_page_table), %eax
+	leal rva(initial_page_table)(%ebp), %eax
 	mov %eax, %cr3
 
 	mov %cr0, %eax
 	or $(X86_CR0_PG | X86_CR0_PE), %eax
 	mov %eax, %cr0
 
-	ljmp $PVH_CS_SEL, $1f
+	pushl $PVH_CS_SEL
+	leal  rva(1f)(%ebp), %eax
+	pushl %eax
+	lretl
+
 1:
 	call xen_prepare_pvh
-	mov $_pa(pvh_bootparams), %esi
+	leal rva(pvh_bootparams)(%ebp), %esi
 
 	/* startup_32 doesn't expect paging and PAE to be on. */
-	ljmp $PVH_CS_SEL, $_pa(2f)
+	pushl $PVH_CS_SEL
+	leal  rva(2f)(%ebp), %eax
+	pushl %eax
+	lretl
 2:
 	mov %cr0, %eax
 	and $~X86_CR0_PG, %eax
@@ -135,15 +217,19 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
 	and $~X86_CR4_PAE, %eax
 	mov %eax, %cr4
 
-	ljmp $PVH_CS_SEL, $_pa(startup_32)
+	pushl $PVH_CS_SEL
+	leal  rva(startup_32)(%ebp), %eax
+	pushl %eax
+	lretl
 #endif
+
 SYM_CODE_END(pvh_start_xen)
 
 	.section ".init.data","aw"
 	.balign 8
 SYM_DATA_START_LOCAL(gdt)
-	.word gdt_end - gdt_start
-	.long _pa(gdt_start)
+	.word gdt_end - gdt_start - 1
+	.long 0
 	.word 0
 SYM_DATA_END(gdt)
 SYM_DATA_START_LOCAL(gdt_start)
@@ -163,5 +249,82 @@ SYM_DATA_START_LOCAL(early_stack)
 	.fill BOOT_STACK_SIZE, 1, 0
 SYM_DATA_END_LABEL(early_stack, SYM_L_LOCAL, early_stack_end)
 
+/*
+ * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
+ * because we need identity-mapped pages.
+ */
+#define l4_index(x)     (((x) >> 39) & 511)
+#define pud_index(x)    (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
+
+L4_PAGE_OFFSET  = l4_index(__PAGE_OFFSET_BASE_L4)
+L4_START_KERNEL = l4_index(__START_KERNEL_map)
+L3_START_KERNEL = pud_index(__START_KERNEL_map)
+
+#define SYM_DATA_START_PAGE_ALIGNED(name)			\
+	SYM_START(name, SYM_L_GLOBAL, .balign PAGE_SIZE)
+
+/* Automate the creation of 1 to 1 mapping pmd entries */
+#define PMDS(START, PERM, COUNT)			\
+	i = 0 ;						\
+	.rept (COUNT) ;					\
+	.quad	(START) + (i << PMD_SHIFT) + (PERM) ;	\
+	i = i + 1 ;					\
+	.endr
+
+/*
+ * Xen PVH needs a set of identity mapped and kernel high mapping
+ * page tables.  pvh_start_xen starts running on the identity mapped
+ * page tables, but xen_prepare_pvh calls into the high mapping.
+ * These page tables are need to be relocatable and are only used until
+ * startup_64 transitions to init_top_pgt.
+ */
+SYM_DATA_START_PAGE_ALIGNED(pvh_init_top_pgt)
+	.quad   pvh_level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
+	.org    pvh_init_top_pgt + L4_PAGE_OFFSET*8, 0
+	.quad   pvh_level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
+	.org    pvh_init_top_pgt + L4_START_KERNEL*8, 0
+	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
+	.quad   pvh_level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+SYM_DATA_END(pvh_init_top_pgt)
+
+SYM_DATA_START_PAGE_ALIGNED(pvh_level3_ident_pgt)
+	.quad	pvh_level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
+	.fill	511, 8, 0
+SYM_DATA_END(pvh_level3_ident_pgt)
+SYM_DATA_START_PAGE_ALIGNED(pvh_level2_ident_pgt)
+	/*
+	 * Since I easily can, map the first 1G.
+	 * Don't set NX because code runs from these pages.
+	 *
+	 * Note: This sets _PAGE_GLOBAL despite whether
+	 * the CPU supports it or it is enabled.  But,
+	 * the CPU should ignore the bit.
+	 */
+	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
+SYM_DATA_END(pvh_level2_ident_pgt)
+SYM_DATA_START_PAGE_ALIGNED(pvh_level3_kernel_pgt)
+	.fill	L3_START_KERNEL,8,0
+	/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
+	.quad	pvh_level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
+	.quad	0 /* no fixmap */
+SYM_DATA_END(pvh_level3_kernel_pgt)
+
+SYM_DATA_START_PAGE_ALIGNED(pvh_level2_kernel_pgt)
+	/*
+	 * Kernel high mapping.
+	 *
+	 * The kernel code+data+bss must be located below KERNEL_IMAGE_SIZE in
+	 * virtual address space, which is 1 GiB if RANDOMIZE_BASE is enabled,
+	 * 512 MiB otherwise.
+	 *
+	 * (NOTE: after that starts the module area, see MODULES_VADDR.)
+	 *
+	 * This table is eventually used by the kernel during normal runtime.
+	 * Care must be taken to clear out undesired bits later, like _PAGE_RW
+	 * or _PAGE_GLOBAL in some cases.
+	 */
+	PMDS(0, __PAGE_KERNEL_LARGE_EXEC, KERNEL_IMAGE_SIZE/PMD_SIZE)
+SYM_DATA_END(pvh_level2_kernel_pgt)
+
 	ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,
 	             _ASM_PTR (pvh_start_xen - __START_KERNEL_map))
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index a0ea285878db..e6994aaa8cc3 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -106,7 +106,12 @@ SYM_CODE_END(xen_cpu_bringup_again)
 # define FEATURES_PV 0
 #endif
 #ifdef CONFIG_XEN_PVH
-# define FEATURES_PVH (1 << XENFEAT_linux_rsdp_unrestricted)
+# ifdef CONFIG_RELOCATABLE
+#   define FEATURES_PVH (1 << XENFEAT_linux_rsdp_unrestricted) | \
+                        (1 << XENFEAT_pvh_relocatable)
+# else
+#   define FEATURES_PVH (1 << XENFEAT_linux_rsdp_unrestricted)
+# endif
 #else
 # define FEATURES_PVH 0
 #endif
diff --git a/include/xen/interface/features.h b/include/xen/interface/features.h
index 53f760378e39..92e7da9194c8 100644
--- a/include/xen/interface/features.h
+++ b/include/xen/interface/features.h
@@ -97,6 +97,11 @@
 #define XENFEAT_not_direct_mapped         16
 #define XENFEAT_direct_mapped             17
 
+/*
+ * PVH: If set, the guest supports being relocated in physical memory on entry.
+ */
+#define XENFEAT_pvh_relocatable           20
+
 #define XENFEAT_NR_SUBMAPS 1
 
 #endif /* __XEN_PUBLIC_FEATURES_H__ */
-- 
2.44.0
g


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] x86/pvh: Support relocating dom0 kernel
  2024-03-06 18:50 [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Jason Andryuk
                   ` (3 preceding siblings ...)
  2024-03-06 19:31 ` [LINUX PATCH] RFC: x86/pvh: Make Xen PVH entrypoint PIC Jason Andryuk
@ 2024-03-07 10:00 ` Roger Pau Monné
  2024-03-07 10:08   ` Jan Beulich
  4 siblings, 1 reply; 21+ messages in thread
From: Roger Pau Monné @ 2024-03-07 10:00 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: xen-devel, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu

On Wed, Mar 06, 2024 at 01:50:29PM -0500, Jason Andryuk wrote:
> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
> it can be configured.
> 
> Unfortunately there exist firmwares that have reserved regions at this
> address, so Xen fails to load the dom0 kernel since it's not RAM.
> 
> The other issue is that the Linux PVH entry point is not
> position-independent.  It expects to run at the compiled
> CONFIG_PHYSICAL_ADDRESS.
> 
> This patch set expands the PVH dom0 builder to try to relocate the
> kernel if needed and possible.  XENFEAT_pvh_relocatable is added for
> kernels to indicate they are relocatable.  However, we may want to
> switch to an additional ELF note with the kernel alignment.  Linux
> specifies a kernel alignment in the bzImage boot_params.setup_header,
> but that is not present the ELF vmlinux file.

I wonder whether we need a pair of notes, to signal the min/max
addresses the kernel supports being relocated to.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] x86/pvh: Support relocating dom0 kernel
  2024-03-07 10:00 ` [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Roger Pau Monné
@ 2024-03-07 10:08   ` Jan Beulich
  2024-03-07 10:20     ` Roger Pau Monné
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2024-03-07 10:08 UTC (permalink / raw)
  To: Roger Pau Monné, Jason Andryuk
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 07.03.2024 11:00, Roger Pau Monné wrote:
> On Wed, Mar 06, 2024 at 01:50:29PM -0500, Jason Andryuk wrote:
>> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
>> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
>> it can be configured.
>>
>> Unfortunately there exist firmwares that have reserved regions at this
>> address, so Xen fails to load the dom0 kernel since it's not RAM.
>>
>> The other issue is that the Linux PVH entry point is not
>> position-independent.  It expects to run at the compiled
>> CONFIG_PHYSICAL_ADDRESS.
>>
>> This patch set expands the PVH dom0 builder to try to relocate the
>> kernel if needed and possible.  XENFEAT_pvh_relocatable is added for
>> kernels to indicate they are relocatable.  However, we may want to
>> switch to an additional ELF note with the kernel alignment.  Linux
>> specifies a kernel alignment in the bzImage boot_params.setup_header,
>> but that is not present the ELF vmlinux file.
> 
> I wonder whether we need a pair of notes, to signal the min/max
> addresses the kernel supports being relocated to.

Plus, as per your other reply, a 3rd one to specify alignment needs.

Then again - a single note can have multiple values. So perhaps it
doesn't need to be more than one new notes (except if dealing with
multi-value ones is deemed too complicated).

Jan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] x86/pvh: Support relocating dom0 kernel
  2024-03-07 10:08   ` Jan Beulich
@ 2024-03-07 10:20     ` Roger Pau Monné
  2024-03-07 17:33       ` Jason Andryuk
  0 siblings, 1 reply; 21+ messages in thread
From: Roger Pau Monné @ 2024-03-07 10:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Jason Andryuk, xen-devel, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu

On Thu, Mar 07, 2024 at 11:08:37AM +0100, Jan Beulich wrote:
> On 07.03.2024 11:00, Roger Pau Monné wrote:
> > On Wed, Mar 06, 2024 at 01:50:29PM -0500, Jason Andryuk wrote:
> >> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
> >> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
> >> it can be configured.
> >>
> >> Unfortunately there exist firmwares that have reserved regions at this
> >> address, so Xen fails to load the dom0 kernel since it's not RAM.
> >>
> >> The other issue is that the Linux PVH entry point is not
> >> position-independent.  It expects to run at the compiled
> >> CONFIG_PHYSICAL_ADDRESS.
> >>
> >> This patch set expands the PVH dom0 builder to try to relocate the
> >> kernel if needed and possible.  XENFEAT_pvh_relocatable is added for
> >> kernels to indicate they are relocatable.  However, we may want to
> >> switch to an additional ELF note with the kernel alignment.  Linux
> >> specifies a kernel alignment in the bzImage boot_params.setup_header,
> >> but that is not present the ELF vmlinux file.
> > 
> > I wonder whether we need a pair of notes, to signal the min/max
> > addresses the kernel supports being relocated to.
> 
> Plus, as per your other reply, a 3rd one to specify alignment needs.

Alignment we could in theory get from the ELF program header, if OSes
fill those reliably.  FreeBSD seems to do so, haven't checked Linux.

> Then again - a single note can have multiple values. So perhaps it
> doesn't need to be more than one new notes (except if dealing with
> multi-value ones is deemed too complicated).

I've never dealt with a multi-value note, if that's not overly
complicated I would be fine with it, otherwise multiple notes are OK
IMO.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] x86/pvh: Support relocating dom0 kernel
  2024-03-07 10:20     ` Roger Pau Monné
@ 2024-03-07 17:33       ` Jason Andryuk
  2024-03-08  7:03         ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Andryuk @ 2024-03-07 17:33 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu

On 2024-03-07 05:20, Roger Pau Monné wrote:
> On Thu, Mar 07, 2024 at 11:08:37AM +0100, Jan Beulich wrote:
>> On 07.03.2024 11:00, Roger Pau Monné wrote:
>>> On Wed, Mar 06, 2024 at 01:50:29PM -0500, Jason Andryuk wrote:
>>>> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
>>>> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
>>>> it can be configured.
>>>>
>>>> Unfortunately there exist firmwares that have reserved regions at this
>>>> address, so Xen fails to load the dom0 kernel since it's not RAM.
>>>>
>>>> The other issue is that the Linux PVH entry point is not
>>>> position-independent.  It expects to run at the compiled
>>>> CONFIG_PHYSICAL_ADDRESS.
>>>>
>>>> This patch set expands the PVH dom0 builder to try to relocate the
>>>> kernel if needed and possible.  XENFEAT_pvh_relocatable is added for
>>>> kernels to indicate they are relocatable.  However, we may want to
>>>> switch to an additional ELF note with the kernel alignment.  Linux
>>>> specifies a kernel alignment in the bzImage boot_params.setup_header,
>>>> but that is not present the ELF vmlinux file.
>>>
>>> I wonder whether we need a pair of notes, to signal the min/max
>>> addresses the kernel supports being relocated to.
>>
>> Plus, as per your other reply, a 3rd one to specify alignment needs.
> 
> Alignment we could in theory get from the ELF program header, if OSes
> fill those reliably.  FreeBSD seems to do so, haven't checked Linux.

I will look into this more, but at first glance, I don't see a 
connection between Linux's CONFIG_PHYSICAL_ALIGN and the values in the 
elf headers.  As a quick test, I set CONFIG_PHYSICAL_ALIGN=0x800000, but 
the elf align values are still 0x200000.

The elf header values may be a suitable fallback though?

>> Then again - a single note can have multiple values. So perhaps it
>> doesn't need to be more than one new notes (except if dealing with
>> multi-value ones is deemed too complicated).
> 
> I've never dealt with a multi-value note, if that's not overly
> complicated I would be fine with it, otherwise multiple notes are OK
> IMO.

It looks like a single note can be followed by arbitrary data, so 
putting three values in should be fine.  PVH is defined with a 32bit 
entry point, so are 32bit min, max & align values acceptable, or should 
64bit be specified?  Linux uses _ASM_PTR for PHYS32_ENTRY so its size 
matches the bitness of the target (32 on 32/ 64 on 64).  I guess I'll 
follow that unless I hear a preference otherwise.

Regards,
Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] x86/pvh: Support relocating dom0 kernel
  2024-03-07 17:33       ` Jason Andryuk
@ 2024-03-08  7:03         ` Jan Beulich
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2024-03-08  7:03 UTC (permalink / raw)
  To: Jason Andryuk
  Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

On 07.03.2024 18:33, Jason Andryuk wrote:
> On 2024-03-07 05:20, Roger Pau Monné wrote:
>> On Thu, Mar 07, 2024 at 11:08:37AM +0100, Jan Beulich wrote:
>>> On 07.03.2024 11:00, Roger Pau Monné wrote:
>>>> On Wed, Mar 06, 2024 at 01:50:29PM -0500, Jason Andryuk wrote:
>>>>> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
>>>>> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
>>>>> it can be configured.
>>>>>
>>>>> Unfortunately there exist firmwares that have reserved regions at this
>>>>> address, so Xen fails to load the dom0 kernel since it's not RAM.
>>>>>
>>>>> The other issue is that the Linux PVH entry point is not
>>>>> position-independent.  It expects to run at the compiled
>>>>> CONFIG_PHYSICAL_ADDRESS.
>>>>>
>>>>> This patch set expands the PVH dom0 builder to try to relocate the
>>>>> kernel if needed and possible.  XENFEAT_pvh_relocatable is added for
>>>>> kernels to indicate they are relocatable.  However, we may want to
>>>>> switch to an additional ELF note with the kernel alignment.  Linux
>>>>> specifies a kernel alignment in the bzImage boot_params.setup_header,
>>>>> but that is not present the ELF vmlinux file.
>>>>
>>>> I wonder whether we need a pair of notes, to signal the min/max
>>>> addresses the kernel supports being relocated to.
>>>
>>> Plus, as per your other reply, a 3rd one to specify alignment needs.
>>
>> Alignment we could in theory get from the ELF program header, if OSes
>> fill those reliably.  FreeBSD seems to do so, haven't checked Linux.
> 
> I will look into this more, but at first glance, I don't see a 
> connection between Linux's CONFIG_PHYSICAL_ALIGN and the values in the 
> elf headers.  As a quick test, I set CONFIG_PHYSICAL_ALIGN=0x800000, but 
> the elf align values are still 0x200000.
> 
> The elf header values may be a suitable fallback though?

Imo, given the above, explicit values should be required. Better not
load a kernel than doing so and then getting hard to debug crashes.

Jan


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-03-11 19:57 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-06 18:50 [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Jason Andryuk
2024-03-06 18:50 ` [PATCH 1/3] features.h: Replace hard tabs Jason Andryuk
2024-03-06 20:41   ` Stefano Stabellini
2024-03-06 18:50 ` [PATCH 2/3] xen/x86: bzImage parse kernel_alignment Jason Andryuk
2024-03-07  2:09   ` Stefano Stabellini
2024-03-07  8:26     ` Jan Beulich
2024-03-07 15:06       ` Jason Andryuk
2024-03-06 18:50 ` [PATCH 3/3] x86/PVH: Support relocatable dom0 kernels Jason Andryuk
2024-03-07  2:09   ` Stefano Stabellini
2024-03-07 16:07     ` Jason Andryuk
2024-03-07  9:30   ` Roger Pau Monné
2024-03-07 17:01     ` Jason Andryuk
2024-03-08  6:34       ` Juergen Gross
2024-03-11 16:53   ` Jan Beulich
2024-03-11 19:53     ` Jason Andryuk
2024-03-06 19:31 ` [LINUX PATCH] RFC: x86/pvh: Make Xen PVH entrypoint PIC Jason Andryuk
2024-03-07 10:00 ` [PATCH 0/3] x86/pvh: Support relocating dom0 kernel Roger Pau Monné
2024-03-07 10:08   ` Jan Beulich
2024-03-07 10:20     ` Roger Pau Monné
2024-03-07 17:33       ` Jason Andryuk
2024-03-08  7:03         ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.