Issue with shared information page on Xen/ARM 4.17

All of lore.kernel.org
 help / color / mirror / Atom feed

* Issue with shared information page on Xen/ARM 4.17
@ 2023-09-29  2:49 Elliott Mitchell
  2023-10-03  8:26 ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-09-29  2:49 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Volodymyr Babchuk

I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
the changes with the handling of the shared information page appear to
have broken things for me.

With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
of the 4.17 release, mapping the shared information page doesn't work.

I'm using Tianocore as the first stage loader.  This continues to work
fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
While Tianocore does map the shared information page, my reading of their
source is that it properly unmaps the page and therefore shouldn't cause
trouble.

Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
the recommended address range, but my understanding is this is supposed
to be okay.

The return code is -16, which is EBUSY.

Ideas?

The obvious spot for EBUSY is from already having the shared information
page mapped.  Issue is I'm pretty sure it isn't mapped.

Did the policy change and now the shared information page *must* be
mapped within the recommended address ranges?

Is it a problem to map the shared information page at one address.  Unmap
it, then map it at another address?

-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-09-29  2:49 Issue with shared information page on Xen/ARM 4.17 Elliott Mitchell
@ 2023-10-03  8:26 ` Roger Pau Monné
  2023-10-03 19:18   ` Elliott Mitchell
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-10-03  8:26 UTC (permalink / raw)
  To: Elliott Mitchell
  Cc: xen-devel, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
> I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
> the changes with the handling of the shared information page appear to
> have broken things for me.
> 
> With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
> of the 4.17 release, mapping the shared information page doesn't work.

This is due to 71320946d5edf AFAICT.

> I'm using Tianocore as the first stage loader.  This continues to work
> fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
> While Tianocore does map the shared information page, my reading of their
> source is that it properly unmaps the page and therefore shouldn't cause
> trouble.
> 
> Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
> the recommended address range, but my understanding is this is supposed
> to be okay.
> 
> The return code is -16, which is EBUSY.
> 
> Ideas?

I think the issue is that you are mapping the shared info page over a
guest RAM page, and in order to do that you would fist need to create
a hole and then map the shared info page.  IOW: the issue is not with
edk2 not having unmapped the page, but with FreeBSD trying to map the
shared_info over a RAM page instead of a hole in the p2m.  x86
behavior is different here, and does allow mapping the shared_info
page over a RAM gfn (by first removing the backing RAM page on the
gfn).

Ideally we would like to use an unpopulated gfn, but doing so is not
trivial right now, as the point where the shared_info page is mapped
we don't yet have an easy way to account for unpopulated regions.

My suggestion would be to do something like:

diff --git a/sys/x86/xen/hvm.c b/sys/x86/xen/hvm.c
index 4122daeaf600..7251bc69ae15 100644
--- a/sys/x86/xen/hvm.c
+++ b/sys/x86/xen/hvm.c
@@ -194,18 +194,20 @@ xen_hvm_init_shared_info_page(void)
 {
 	struct xen_add_to_physmap xatp;
 
-	if (xen_pv_domain()) {
-		/*
-		 * Already setup in the PV case, shared_info is passed inside
-		 * of the start_info struct at start of day.
-		 */
-		return;
-	}
-
 	if (HYPERVISOR_shared_info == NULL) {
+		struct xen_remove_from_physmap rm = {
+			.domid = DOMID_SELF,
+		};
+		int rc;
+
 		HYPERVISOR_shared_info = malloc(PAGE_SIZE, M_XENHVM, M_NOWAIT);
 		if (HYPERVISOR_shared_info == NULL)
 			panic("Unable to allocate Xen shared info page");
+
+		rm.gpfn = vtophys(HYPERVISOR_shared_info) >> PAGE_SHIFT;
+		rc = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &rm);
+		if (rc != 0)
+			printf("Failed to remove shared_info GFN: %d\n", rc);
 	}
 
 	xatp.domid = DOMID_SELF;

But in the long term we should see about initializing the shared_info
page as part of xenpv_attach().

Regards, Roger.


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-03  8:26 ` Roger Pau Monné
@ 2023-10-03 19:18   ` Elliott Mitchell
  2023-10-04  8:13     ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-10-03 19:18 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
> On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
> > I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
> > the changes with the handling of the shared information page appear to
> > have broken things for me.
> > 
> > With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
> > of the 4.17 release, mapping the shared information page doesn't work.
> 
> This is due to 71320946d5edf AFAICT.

Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
This seems a fairly reasonable change, so I had no intention of asking
for a revert (which likely would have been rejected).  There is also a
real possibility the -EBUSY comes from elsewhere.  Could also be
71320946d5edf caused a bug elsewhere to be exposed.

> > I'm using Tianocore as the first stage loader.  This continues to work
> > fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
> > While Tianocore does map the shared information page, my reading of their
> > source is that it properly unmaps the page and therefore shouldn't cause
> > trouble.
> > 
> > Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
> > the recommended address range, but my understanding is this is supposed
> > to be okay.
> > 
> > The return code is -16, which is EBUSY.
> > 
> > Ideas?
> 
> I think the issue is that you are mapping the shared info page over a
> guest RAM page, and in order to do that you would fist need to create
> a hole and then map the shared info page.  IOW: the issue is not with
> edk2 not having unmapped the page, but with FreeBSD trying to map the
> shared_info over a RAM page instead of a hole in the p2m.  x86
> behavior is different here, and does allow mapping the shared_info
> page over a RAM gfn (by first removing the backing RAM page on the
> gfn).

An interesting thought.  I thought I'd tried this, but since I didn't see
such in my experiments list.  What I had tried was removing all the pages
in the suggested mapping range.  Yet this failed.

Since this seemed reasonable, I've now tried and found it fails.  The
XENMEM_remove_from_physmap call returns 0.

> Ideally we would like to use an unpopulated gfn, but doing so is not
> trivial right now, as the point where the shared_info page is mapped
> we don't yet have an easy way to account for unpopulated regions.
> 
> My suggestion would be to do something like:
> 
> diff --git a/sys/x86/xen/hvm.c b/sys/x86/xen/hvm.c
> index 4122daeaf600..7251bc69ae15 100644
> --- a/sys/x86/xen/hvm.c
> +++ b/sys/x86/xen/hvm.c
> @@ -194,18 +194,20 @@ xen_hvm_init_shared_info_page(void)
>  {
>  	struct xen_add_to_physmap xatp;
>  
> -	if (xen_pv_domain()) {
> -		/*
> -		 * Already setup in the PV case, shared_info is passed inside
> -		 * of the start_info struct at start of day.
> -		 */
> -		return;
> -	}
> -
>  	if (HYPERVISOR_shared_info == NULL) {
> +		struct xen_remove_from_physmap rm = {
> +			.domid = DOMID_SELF,
> +		};
> +		int rc;
> +
>  		HYPERVISOR_shared_info = malloc(PAGE_SIZE, M_XENHVM, M_NOWAIT);
>  		if (HYPERVISOR_shared_info == NULL)
>  			panic("Unable to allocate Xen shared info page");
> +
> +		rm.gpfn = vtophys(HYPERVISOR_shared_info) >> PAGE_SHIFT;
> +		rc = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &rm);
> +		if (rc != 0)
> +			printf("Failed to remove shared_info GFN: %d\n", rc);
>  	}
>  
>  	xatp.domid = DOMID_SELF;
> 
> But in the long term we should see about initializing the shared_info
> page as part of xenpv_attach().

Didn't even look at the example since I already had ones handy.
Unfortunately this has also failed.  The XENMEM_remove_from_physmap call
returns 0.

If there is a need I can hand off a build of Xen and Tianocore, then let
someone else with better Xen/ARM diagnostics play with it.  Runs about
75MB, someone could place it in a better sharing place if desired.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-03 19:18   ` Elliott Mitchell
@ 2023-10-04  8:13     ` Roger Pau Monné
  2023-10-04 10:55       ` Julien Grall
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-10-04  8:13 UTC (permalink / raw)
  To: Elliott Mitchell
  Cc: xen-devel, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
> On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
> > On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
> > > I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
> > > the changes with the handling of the shared information page appear to
> > > have broken things for me.
> > > 
> > > With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
> > > of the 4.17 release, mapping the shared information page doesn't work.
> > 
> > This is due to 71320946d5edf AFAICT.
> 
> Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
> This seems a fairly reasonable change, so I had no intention of asking
> for a revert (which likely would have been rejected).  There is also a
> real possibility the -EBUSY comes from elsewhere.  Could also be
> 71320946d5edf caused a bug elsewhere to be exposed.

A good way to know would be to attempt to revert 71320946d5edf and see
if that fixes your issue.

Alternatively you can try (or similar):

diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index 6ccffeaea57d..105ef3faecfd 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
                 page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
         }
         else
+        {
+            printk("%u already mapped\n", space);
             /*
              * Mandate the caller to first unmap the page before mapping it
              * again. This is to prevent Xen creating an unwanted hole in
@@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
              * to unmap it afterwards.
              */
             rc = -EBUSY;
+        }
         p2m_write_unlock(p2m);
     }
 

> > > I'm using Tianocore as the first stage loader.  This continues to work
> > > fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
> > > While Tianocore does map the shared information page, my reading of their
> > > source is that it properly unmaps the page and therefore shouldn't cause
> > > trouble.
> > > 
> > > Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
> > > the recommended address range, but my understanding is this is supposed
> > > to be okay.
> > > 
> > > The return code is -16, which is EBUSY.
> > > 
> > > Ideas?
> > 
> > I think the issue is that you are mapping the shared info page over a
> > guest RAM page, and in order to do that you would fist need to create
> > a hole and then map the shared info page.  IOW: the issue is not with
> > edk2 not having unmapped the page, but with FreeBSD trying to map the
> > shared_info over a RAM page instead of a hole in the p2m.  x86
> > behavior is different here, and does allow mapping the shared_info
> > page over a RAM gfn (by first removing the backing RAM page on the
> > gfn).
> 
> An interesting thought.  I thought I'd tried this, but since I didn't see
> such in my experiments list.  What I had tried was removing all the pages
> in the suggested mapping range.  Yet this failed.

Yeah, I went too fast and didn't read the code correctly, it is not
checking that the provided gfn is already populated, but whether the
mfn intended to be mapped is already mapped at a different location.

> Since this seemed reasonable, I've now tried and found it fails.  The
> XENMEM_remove_from_physmap call returns 0.

XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
like edk2 hasn't unmapped the shared_info page.  The OS has no idea
at which position the shared_info page is currently mapped, and hence
can't do anything to attempt to unmap it in order to cover up for
buggy firmware.

edk2 should be the entity to issue the XENMEM_remove_from_physmap
against the gfn where it has the shared_info page mapped.  Likely
needs to be done as part of ExitBootServices() method.

FWIW, 71320946d5edf is an ABI change, and as desirable as such
behavior might be, a new hypercall should have introduced that had the
behavior that the change intended to retrofit into
XENMEM_add_to_physmap.

Thanks, Roger.


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04  8:13     ` Roger Pau Monné
@ 2023-10-04 10:55       ` Julien Grall
  2023-10-04 11:42         ` Oleksandr Tyshchenko
  2023-10-04 12:53         ` Roger Pau Monné
  0 siblings, 2 replies; 24+ messages in thread
From: Julien Grall @ 2023-10-04 10:55 UTC (permalink / raw)
  To: Roger Pau Monné, Elliott Mitchell
  Cc: xen-devel, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

Hi Roger,

On 04/10/2023 09:13, Roger Pau Monné wrote:
> On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
>> On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
>>> On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
>>>> I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
>>>> the changes with the handling of the shared information page appear to
>>>> have broken things for me.
>>>>
>>>> With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
>>>> of the 4.17 release, mapping the shared information page doesn't work.
>>>
>>> This is due to 71320946d5edf AFAICT.
>>
>> Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
>> This seems a fairly reasonable change, so I had no intention of asking
>> for a revert (which likely would have been rejected).  There is also a
>> real possibility the -EBUSY comes from elsewhere.  Could also be
>> 71320946d5edf caused a bug elsewhere to be exposed.
> 
> A good way to know would be to attempt to revert 71320946d5edf and see
> if that fixes your issue.
> 
> Alternatively you can try (or similar):
> 
> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
> index 6ccffeaea57d..105ef3faecfd 100644
> --- a/xen/arch/arm/mm.c
> +++ b/xen/arch/arm/mm.c
> @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
>                   page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
>           }
>           else
> +        {
> +            printk("%u already mapped\n", space);
>               /*
>                * Mandate the caller to first unmap the page before mapping it
>                * again. This is to prevent Xen creating an unwanted hole in
> @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
>                * to unmap it afterwards.
>                */
>               rc = -EBUSY;
> +        }
>           p2m_write_unlock(p2m);
>       }
>   
> 
>>>> I'm using Tianocore as the first stage loader.  This continues to work
>>>> fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
>>>> While Tianocore does map the shared information page, my reading of their
>>>> source is that it properly unmaps the page and therefore shouldn't cause
>>>> trouble.
>>>>
>>>> Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
>>>> the recommended address range, but my understanding is this is supposed
>>>> to be okay.
>>>>
>>>> The return code is -16, which is EBUSY.
>>>>
>>>> Ideas?
>>>
>>> I think the issue is that you are mapping the shared info page over a
>>> guest RAM page, and in order to do that you would fist need to create
>>> a hole and then map the shared info page.  IOW: the issue is not with
>>> edk2 not having unmapped the page, but with FreeBSD trying to map the
>>> shared_info over a RAM page instead of a hole in the p2m.  x86
>>> behavior is different here, and does allow mapping the shared_info
>>> page over a RAM gfn (by first removing the backing RAM page on the
>>> gfn).
>>
>> An interesting thought.  I thought I'd tried this, but since I didn't see
>> such in my experiments list.  What I had tried was removing all the pages
>> in the suggested mapping range.  Yet this failed.
> 
> Yeah, I went too fast and didn't read the code correctly, it is not
> checking that the provided gfn is already populated, but whether the
> mfn intended to be mapped is already mapped at a different location.
> 
>> Since this seemed reasonable, I've now tried and found it fails.  The
>> XENMEM_remove_from_physmap call returns 0.
> 
> XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
> like edk2 hasn't unmapped the shared_info page.  The OS has no idea
> at which position the shared_info page is currently mapped, and hence
> can't do anything to attempt to unmap it in order to cover up for
> buggy firmware.
> 
> edk2 should be the entity to issue the XENMEM_remove_from_physmap
> against the gfn where it has the shared_info page mapped.  Likely
> needs to be done as part of ExitBootServices() method.
> 
> FWIW, 71320946d5edf is an ABI change, and as desirable as such
> behavior might be, a new hypercall should have introduced that had the
> behavior that the change intended to retrofit into
> XENMEM_add_to_physmap.
I can see how you think this is an ABI change but the previous behavior 
was incorrect. Before this patch, on Arm, we would allow the shared page 
to be mapped twice. As we don't know where the firmware had mapped it 
this could result to random corruption.

Now, we could surely decide to remove the page as x86 did. But this 
could leave a hole in the RAM. As the OS would not know where the hole 
is, this could lead to page fault randomly during runtime.

Neither of the two behaviors help the users. In fact, I think they only 
make the experience worse because you don't know when the issue will happen.

AFAICT, there is no way for an HVM guestto know which GFN was inuse. So 
in all the cases, I can't think of a way for the OS to workaround 
properly buggy firmware. Therefore, returning -EBUSY is the safest we 
can do for our users and I don't view it as a ABI change (someone rely 
on the previous behavior is bound to failure).

For more details, you can look at the original thread ([1], [2]).

Cheers,

[1] ef5bf7df-ad8a-e420-0fc4-d8f0a0e0f2fc@xen.org
[2] dfd7657e-bbb9-3d96-2650-063561a00b9b@xen.org


-- 
Julien Grall


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 10:55       ` Julien Grall
@ 2023-10-04 11:42         ` Oleksandr Tyshchenko
  2023-10-04 12:59           ` Roger Pau Monné
  2023-10-04 12:53         ` Roger Pau Monné
  1 sibling, 1 reply; 24+ messages in thread
From: Oleksandr Tyshchenko @ 2023-10-04 11:42 UTC (permalink / raw)
  To: Julien Grall, Roger Pau Monné, Elliott Mitchell
  Cc: xen-devel@lists.xenproject.org, Stefano Stabellini,
	Bertrand Marquis, Volodymyr Babchuk



On 04.10.23 13:55, Julien Grall wrote:

Hello all.

> Hi Roger,
> 
> On 04/10/2023 09:13, Roger Pau Monné wrote:
>> On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
>>> On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
>>>> On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
>>>>> I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current 
>>>>> issue is
>>>>> the changes with the handling of the shared information page appear to
>>>>> have broken things for me.
>>>>>
>>>>> With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
>>>>> of the 4.17 release, mapping the shared information page doesn't work.
>>>>
>>>> This is due to 71320946d5edf AFAICT.
>>>
>>> Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
>>> This seems a fairly reasonable change, so I had no intention of asking
>>> for a revert (which likely would have been rejected).  There is also a
>>> real possibility the -EBUSY comes from elsewhere.  Could also be
>>> 71320946d5edf caused a bug elsewhere to be exposed.
>>
>> A good way to know would be to attempt to revert 71320946d5edf and see
>> if that fixes your issue.
>>
>> Alternatively you can try (or similar):
>>
>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
>> index 6ccffeaea57d..105ef3faecfd 100644
>> --- a/xen/arch/arm/mm.c
>> +++ b/xen/arch/arm/mm.c
>> @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
>>                   page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
>>           }
>>           else
>> +        {
>> +            printk("%u already mapped\n", space);
>>               /*
>>                * Mandate the caller to first unmap the page before 
>> mapping it
>>                * again. This is to prevent Xen creating an unwanted 
>> hole in
>> @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
>>                * to unmap it afterwards.
>>                */
>>               rc = -EBUSY;
>> +        }
>>           p2m_write_unlock(p2m);
>>       }
>>
>>>>> I'm using Tianocore as the first stage loader.  This continues to work
>>>>> fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
>>>>> While Tianocore does map the shared information page, my reading of 
>>>>> their
>>>>> source is that it properly unmaps the page and therefore shouldn't 
>>>>> cause
>>>>> trouble.
>>>>>
>>>>> Notes on the actual call is gpfn was 0x0000000000040072.  This is 
>>>>> outside
>>>>> the recommended address range, but my understanding is this is 
>>>>> supposed
>>>>> to be okay.
>>>>>
>>>>> The return code is -16, which is EBUSY.
>>>>>
>>>>> Ideas?
>>>>
>>>> I think the issue is that you are mapping the shared info page over a
>>>> guest RAM page, and in order to do that you would fist need to create
>>>> a hole and then map the shared info page.  IOW: the issue is not with
>>>> edk2 not having unmapped the page, but with FreeBSD trying to map the
>>>> shared_info over a RAM page instead of a hole in the p2m.  x86
>>>> behavior is different here, and does allow mapping the shared_info
>>>> page over a RAM gfn (by first removing the backing RAM page on the
>>>> gfn).
>>>
>>> An interesting thought.  I thought I'd tried this, but since I didn't 
>>> see
>>> such in my experiments list.  What I had tried was removing all the 
>>> pages
>>> in the suggested mapping range.  Yet this failed.
>>
>> Yeah, I went too fast and didn't read the code correctly, it is not
>> checking that the provided gfn is already populated, but whether the
>> mfn intended to be mapped is already mapped at a different location.
>>
>>> Since this seemed reasonable, I've now tried and found it fails.  The
>>> XENMEM_remove_from_physmap call returns 0.
>>
>> XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
>> like edk2 hasn't unmapped the shared_info page.  The OS has no idea
>> at which position the shared_info page is currently mapped, and hence
>> can't do anything to attempt to unmap it in order to cover up for
>> buggy firmware.
>>
>> edk2 should be the entity to issue the XENMEM_remove_from_physmap
>> against the gfn where it has the shared_info page mapped.  Likely
>> needs to be done as part of ExitBootServices() method.
>>
>> FWIW, 71320946d5edf is an ABI change, and as desirable as such
>> behavior might be, a new hypercall should have introduced that had the
>> behavior that the change intended to retrofit into
>> XENMEM_add_to_physmap.
> I can see how you think this is an ABI change but the previous behavior 
> was incorrect. Before this patch, on Arm, we would allow the shared page 
> to be mapped twice. As we don't know where the firmware had mapped it 
> this could result to random corruption.
> 
> Now, we could surely decide to remove the page as x86 did. But this 
> could leave a hole in the RAM. As the OS would not know where the hole 
> is, this could lead to page fault randomly during runtime.


+1.

In addition to what Julien has already said, I would like to say the 
same issue was faced due to U-Boot (running as a part of Xen guest 
before OS) didn't perform a cleanup before jumping to OS. This is 
already fixed to follow the current behavior. I didn't find 
corresponding U-Boot mail thread, but can point to already upstreamed 
commit in the main repo.

https://github.com/u-boot/u-boot/commit/0001a964b840a62c66da42a89a10a2656831aa4b

[snip]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 10:55       ` Julien Grall
  2023-10-04 11:42         ` Oleksandr Tyshchenko
@ 2023-10-04 12:53         ` Roger Pau Monné
  2023-10-04 13:03           ` Julien Grall
  1 sibling, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-10-04 12:53 UTC (permalink / raw)
  To: Julien Grall
  Cc: Elliott Mitchell, xen-devel, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

On Wed, Oct 04, 2023 at 11:55:05AM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 04/10/2023 09:13, Roger Pau Monné wrote:
> > On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
> > > On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
> > > > On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
> > > > > I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
> > > > > the changes with the handling of the shared information page appear to
> > > > > have broken things for me.
> > > > > 
> > > > > With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
> > > > > of the 4.17 release, mapping the shared information page doesn't work.
> > > > 
> > > > This is due to 71320946d5edf AFAICT.
> > > 
> > > Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
> > > This seems a fairly reasonable change, so I had no intention of asking
> > > for a revert (which likely would have been rejected).  There is also a
> > > real possibility the -EBUSY comes from elsewhere.  Could also be
> > > 71320946d5edf caused a bug elsewhere to be exposed.
> > 
> > A good way to know would be to attempt to revert 71320946d5edf and see
> > if that fixes your issue.
> > 
> > Alternatively you can try (or similar):
> > 
> > diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
> > index 6ccffeaea57d..105ef3faecfd 100644
> > --- a/xen/arch/arm/mm.c
> > +++ b/xen/arch/arm/mm.c
> > @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
> >                   page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
> >           }
> >           else
> > +        {
> > +            printk("%u already mapped\n", space);
> >               /*
> >                * Mandate the caller to first unmap the page before mapping it
> >                * again. This is to prevent Xen creating an unwanted hole in
> > @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
> >                * to unmap it afterwards.
> >                */
> >               rc = -EBUSY;
> > +        }
> >           p2m_write_unlock(p2m);
> >       }
> > 
> > > > > I'm using Tianocore as the first stage loader.  This continues to work
> > > > > fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
> > > > > While Tianocore does map the shared information page, my reading of their
> > > > > source is that it properly unmaps the page and therefore shouldn't cause
> > > > > trouble.
> > > > > 
> > > > > Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
> > > > > the recommended address range, but my understanding is this is supposed
> > > > > to be okay.
> > > > > 
> > > > > The return code is -16, which is EBUSY.
> > > > > 
> > > > > Ideas?
> > > > 
> > > > I think the issue is that you are mapping the shared info page over a
> > > > guest RAM page, and in order to do that you would fist need to create
> > > > a hole and then map the shared info page.  IOW: the issue is not with
> > > > edk2 not having unmapped the page, but with FreeBSD trying to map the
> > > > shared_info over a RAM page instead of a hole in the p2m.  x86
> > > > behavior is different here, and does allow mapping the shared_info
> > > > page over a RAM gfn (by first removing the backing RAM page on the
> > > > gfn).
> > > 
> > > An interesting thought.  I thought I'd tried this, but since I didn't see
> > > such in my experiments list.  What I had tried was removing all the pages
> > > in the suggested mapping range.  Yet this failed.
> > 
> > Yeah, I went too fast and didn't read the code correctly, it is not
> > checking that the provided gfn is already populated, but whether the
> > mfn intended to be mapped is already mapped at a different location.
> > 
> > > Since this seemed reasonable, I've now tried and found it fails.  The
> > > XENMEM_remove_from_physmap call returns 0.
> > 
> > XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
> > like edk2 hasn't unmapped the shared_info page.  The OS has no idea
> > at which position the shared_info page is currently mapped, and hence
> > can't do anything to attempt to unmap it in order to cover up for
> > buggy firmware.
> > 
> > edk2 should be the entity to issue the XENMEM_remove_from_physmap
> > against the gfn where it has the shared_info page mapped.  Likely
> > needs to be done as part of ExitBootServices() method.
> > 
> > FWIW, 71320946d5edf is an ABI change, and as desirable as such
> > behavior might be, a new hypercall should have introduced that had the
> > behavior that the change intended to retrofit into
> > XENMEM_add_to_physmap.
> I can see how you think this is an ABI change but the previous behavior was
> incorrect. Before this patch, on Arm, we would allow the shared page to be
> mapped twice. As we don't know where the firmware had mapped it this could
> result to random corruption.
> 
> Now, we could surely decide to remove the page as x86 did. But this could
> leave a hole in the RAM. As the OS would not know where the hole is, this
> could lead to page fault randomly during runtime.

I would say it's the job of the firmware to notify the OS where the
hole is, by modifying the memory map handled to the OS.  Or else
mapping the shared_info page in an unpopulated p2m hole.

When using UEFI there's RAM that will always be in-use by the
firmware, as runtime services cannot be shut down, and hence the
firmware must already have a way to remove/reserve such region(s) on
the memory map.

> Neither of the two behaviors help the users. In fact, I think they only make
> the experience worse because you don't know when the issue will happen.
> 
> AFAICT, there is no way for an HVM guestto know which GFN was inuse. So in
> all the cases, I can't think of a way for the OS to workaround properly
> buggy firmware. Therefore, returning -EBUSY is the safest we can do for our
> users and I don't view it as a ABI change (someone rely on the previous
> behavior is bound to failure).

I fully agree the current behavior might not be the best one, but I do
consider this part of the ABI, specially as booting guests using edk2
has now stopped working after this change.

Introducing a different hypercall, or even using
XENMAPSPACE_shared_info with idx = 1 to signal the usage of the new
behavior should be used instead.

This would also allow unifying the behavior between x86 and Arm.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 11:42         ` Oleksandr Tyshchenko
@ 2023-10-04 12:59           ` Roger Pau Monné
  2023-10-04 17:18             ` Oleksandr Tyshchenko
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-10-04 12:59 UTC (permalink / raw)
  To: Oleksandr Tyshchenko
  Cc: Julien Grall, Elliott Mitchell, xen-devel@lists.xenproject.org,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk

On Wed, Oct 04, 2023 at 11:42:32AM +0000, Oleksandr Tyshchenko wrote:
> 
> 
> On 04.10.23 13:55, Julien Grall wrote:
> 
> Hello all.
> 
> > Hi Roger,
> > 
> > On 04/10/2023 09:13, Roger Pau Monné wrote:
> >> On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
> >>> On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
> >>>> On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
> >>>>> I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current 
> >>>>> issue is
> >>>>> the changes with the handling of the shared information page appear to
> >>>>> have broken things for me.
> >>>>>
> >>>>> With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
> >>>>> of the 4.17 release, mapping the shared information page doesn't work.
> >>>>
> >>>> This is due to 71320946d5edf AFAICT.
> >>>
> >>> Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
> >>> This seems a fairly reasonable change, so I had no intention of asking
> >>> for a revert (which likely would have been rejected).  There is also a
> >>> real possibility the -EBUSY comes from elsewhere.  Could also be
> >>> 71320946d5edf caused a bug elsewhere to be exposed.
> >>
> >> A good way to know would be to attempt to revert 71320946d5edf and see
> >> if that fixes your issue.
> >>
> >> Alternatively you can try (or similar):
> >>
> >> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
> >> index 6ccffeaea57d..105ef3faecfd 100644
> >> --- a/xen/arch/arm/mm.c
> >> +++ b/xen/arch/arm/mm.c
> >> @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
> >>                   page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
> >>           }
> >>           else
> >> +        {
> >> +            printk("%u already mapped\n", space);
> >>               /*
> >>                * Mandate the caller to first unmap the page before 
> >> mapping it
> >>                * again. This is to prevent Xen creating an unwanted 
> >> hole in
> >> @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
> >>                * to unmap it afterwards.
> >>                */
> >>               rc = -EBUSY;
> >> +        }
> >>           p2m_write_unlock(p2m);
> >>       }
> >>
> >>>>> I'm using Tianocore as the first stage loader.  This continues to work
> >>>>> fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
> >>>>> While Tianocore does map the shared information page, my reading of 
> >>>>> their
> >>>>> source is that it properly unmaps the page and therefore shouldn't 
> >>>>> cause
> >>>>> trouble.
> >>>>>
> >>>>> Notes on the actual call is gpfn was 0x0000000000040072.  This is 
> >>>>> outside
> >>>>> the recommended address range, but my understanding is this is 
> >>>>> supposed
> >>>>> to be okay.
> >>>>>
> >>>>> The return code is -16, which is EBUSY.
> >>>>>
> >>>>> Ideas?
> >>>>
> >>>> I think the issue is that you are mapping the shared info page over a
> >>>> guest RAM page, and in order to do that you would fist need to create
> >>>> a hole and then map the shared info page.  IOW: the issue is not with
> >>>> edk2 not having unmapped the page, but with FreeBSD trying to map the
> >>>> shared_info over a RAM page instead of a hole in the p2m.  x86
> >>>> behavior is different here, and does allow mapping the shared_info
> >>>> page over a RAM gfn (by first removing the backing RAM page on the
> >>>> gfn).
> >>>
> >>> An interesting thought.  I thought I'd tried this, but since I didn't 
> >>> see
> >>> such in my experiments list.  What I had tried was removing all the 
> >>> pages
> >>> in the suggested mapping range.  Yet this failed.
> >>
> >> Yeah, I went too fast and didn't read the code correctly, it is not
> >> checking that the provided gfn is already populated, but whether the
> >> mfn intended to be mapped is already mapped at a different location.
> >>
> >>> Since this seemed reasonable, I've now tried and found it fails.  The
> >>> XENMEM_remove_from_physmap call returns 0.
> >>
> >> XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
> >> like edk2 hasn't unmapped the shared_info page.  The OS has no idea
> >> at which position the shared_info page is currently mapped, and hence
> >> can't do anything to attempt to unmap it in order to cover up for
> >> buggy firmware.
> >>
> >> edk2 should be the entity to issue the XENMEM_remove_from_physmap
> >> against the gfn where it has the shared_info page mapped.  Likely
> >> needs to be done as part of ExitBootServices() method.
> >>
> >> FWIW, 71320946d5edf is an ABI change, and as desirable as such
> >> behavior might be, a new hypercall should have introduced that had the
> >> behavior that the change intended to retrofit into
> >> XENMEM_add_to_physmap.
> > I can see how you think this is an ABI change but the previous behavior 
> > was incorrect. Before this patch, on Arm, we would allow the shared page 
> > to be mapped twice. As we don't know where the firmware had mapped it 
> > this could result to random corruption.
> > 
> > Now, we could surely decide to remove the page as x86 did. But this 
> > could leave a hole in the RAM. As the OS would not know where the hole 
> > is, this could lead to page fault randomly during runtime.
> 
> 
> +1.
> 
> In addition to what Julien has already said, I would like to say the 
> same issue was faced due to U-Boot (running as a part of Xen guest 
> before OS) didn't perform a cleanup before jumping to OS. This is 
> already fixed to follow the current behavior. I didn't find 
> corresponding U-Boot mail thread, but can point to already upstreamed 
> commit in the main repo.

What about other bootloaders?

There might be other loaders that didn't unmap shared_info either, but
that removed the page where the shared_info is mapped from the
reported RAM ranges to the guest.  In such case not doing the
unmapping was benign, as the guest would never use that gfn as RAM
anyway.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 12:53         ` Roger Pau Monné
@ 2023-10-04 13:03           ` Julien Grall
  2023-10-04 13:39             ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Julien Grall @ 2023-10-04 13:03 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Elliott Mitchell, xen-devel, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

Hi Roger,

On 04/10/2023 13:53, Roger Pau Monné wrote:
> On Wed, Oct 04, 2023 at 11:55:05AM +0100, Julien Grall wrote:
>> Hi Roger,
>>
>> On 04/10/2023 09:13, Roger Pau Monné wrote:
>>> On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
>>>> On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
>>>>> On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
>>>>>> I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
>>>>>> the changes with the handling of the shared information page appear to
>>>>>> have broken things for me.
>>>>>>
>>>>>> With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
>>>>>> of the 4.17 release, mapping the shared information page doesn't work.
>>>>>
>>>>> This is due to 71320946d5edf AFAICT.
>>>>
>>>> Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
>>>> This seems a fairly reasonable change, so I had no intention of asking
>>>> for a revert (which likely would have been rejected).  There is also a
>>>> real possibility the -EBUSY comes from elsewhere.  Could also be
>>>> 71320946d5edf caused a bug elsewhere to be exposed.
>>>
>>> A good way to know would be to attempt to revert 71320946d5edf and see
>>> if that fixes your issue.
>>>
>>> Alternatively you can try (or similar):
>>>
>>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
>>> index 6ccffeaea57d..105ef3faecfd 100644
>>> --- a/xen/arch/arm/mm.c
>>> +++ b/xen/arch/arm/mm.c
>>> @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
>>>                    page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
>>>            }
>>>            else
>>> +        {
>>> +            printk("%u already mapped\n", space);
>>>                /*
>>>                 * Mandate the caller to first unmap the page before mapping it
>>>                 * again. This is to prevent Xen creating an unwanted hole in
>>> @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
>>>                 * to unmap it afterwards.
>>>                 */
>>>                rc = -EBUSY;
>>> +        }
>>>            p2m_write_unlock(p2m);
>>>        }
>>>
>>>>>> I'm using Tianocore as the first stage loader.  This continues to work
>>>>>> fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
>>>>>> While Tianocore does map the shared information page, my reading of their
>>>>>> source is that it properly unmaps the page and therefore shouldn't cause
>>>>>> trouble.
>>>>>>
>>>>>> Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
>>>>>> the recommended address range, but my understanding is this is supposed
>>>>>> to be okay.
>>>>>>
>>>>>> The return code is -16, which is EBUSY.
>>>>>>
>>>>>> Ideas?
>>>>>
>>>>> I think the issue is that you are mapping the shared info page over a
>>>>> guest RAM page, and in order to do that you would fist need to create
>>>>> a hole and then map the shared info page.  IOW: the issue is not with
>>>>> edk2 not having unmapped the page, but with FreeBSD trying to map the
>>>>> shared_info over a RAM page instead of a hole in the p2m.  x86
>>>>> behavior is different here, and does allow mapping the shared_info
>>>>> page over a RAM gfn (by first removing the backing RAM page on the
>>>>> gfn).
>>>>
>>>> An interesting thought.  I thought I'd tried this, but since I didn't see
>>>> such in my experiments list.  What I had tried was removing all the pages
>>>> in the suggested mapping range.  Yet this failed.
>>>
>>> Yeah, I went too fast and didn't read the code correctly, it is not
>>> checking that the provided gfn is already populated, but whether the
>>> mfn intended to be mapped is already mapped at a different location.
>>>
>>>> Since this seemed reasonable, I've now tried and found it fails.  The
>>>> XENMEM_remove_from_physmap call returns 0.
>>>
>>> XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
>>> like edk2 hasn't unmapped the shared_info page.  The OS has no idea
>>> at which position the shared_info page is currently mapped, and hence
>>> can't do anything to attempt to unmap it in order to cover up for
>>> buggy firmware.
>>>
>>> edk2 should be the entity to issue the XENMEM_remove_from_physmap
>>> against the gfn where it has the shared_info page mapped.  Likely
>>> needs to be done as part of ExitBootServices() method.
>>>
>>> FWIW, 71320946d5edf is an ABI change, and as desirable as such
>>> behavior might be, a new hypercall should have introduced that had the
>>> behavior that the change intended to retrofit into
>>> XENMEM_add_to_physmap.
>> I can see how you think this is an ABI change but the previous behavior was
>> incorrect. Before this patch, on Arm, we would allow the shared page to be
>> mapped twice. As we don't know where the firmware had mapped it this could
>> result to random corruption.
>>
>> Now, we could surely decide to remove the page as x86 did. But this could
>> leave a hole in the RAM. As the OS would not know where the hole is, this
>> could lead to page fault randomly during runtime.
> 
> I would say it's the job of the firmware to notify the OS where the
> hole is, by modifying the memory map handled to the OS.  Or else
> mapping the shared_info page in an unpopulated p2m hole.

I agree but I am not convinced that they are all doing it. At least 
U-boot didn't do it before we fixed it.

> 
> When using UEFI there's RAM that will always be in-use by the
> firmware, as runtime services cannot be shut down, and hence the
> firmware must already have a way to remove/reserve such region(s) on
> the memory map.

Can either you or Elliott confirm if EDK2 reserve the region?

> 
>> Neither of the two behaviors help the users. In fact, I think they only make
>> the experience worse because you don't know when the issue will happen.
>>
>> AFAICT, there is no way for an HVM guestto know which GFN was inuse. So in
>> all the cases, I can't think of a way for the OS to workaround properly
>> buggy firmware. Therefore, returning -EBUSY is the safest we can do for our
>> users and I don't view it as a ABI change (someone rely on the previous
>> behavior is bound to failure).
> 
> I fully agree the current behavior might not be the best one, but I do
> consider this part of the ABI, specially as booting guests using edk2
> has now stopped working after this change.

Right. If we remove the check, you may be able to boot a guest. But are 
we sure that such guest would run safely?

Also, it is not really argument, but this is not the only broken part in 
EDK2 for Xen Arm guests. The other one I know is EDS makes assumption 
how some Device-Tree nodes and this will break on newer Xen.

So overall, it feels to me that EDK2 is not entirely ready to be used in 
production for Xen on Arm guests.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 13:03           ` Julien Grall
@ 2023-10-04 13:39             ` Roger Pau Monné
  2023-10-04 14:06               ` Julien Grall
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-10-04 13:39 UTC (permalink / raw)
  To: Julien Grall
  Cc: Elliott Mitchell, xen-devel, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

On Wed, Oct 04, 2023 at 02:03:43PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 04/10/2023 13:53, Roger Pau Monné wrote:
> > On Wed, Oct 04, 2023 at 11:55:05AM +0100, Julien Grall wrote:
> > > Hi Roger,
> > > 
> > > On 04/10/2023 09:13, Roger Pau Monné wrote:
> > > > On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
> > > > > On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
> > > > > > On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
> > > > > > > I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
> > > > > > > the changes with the handling of the shared information page appear to
> > > > > > > have broken things for me.
> > > > > > > 
> > > > > > > With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
> > > > > > > of the 4.17 release, mapping the shared information page doesn't work.
> > > > > > 
> > > > > > This is due to 71320946d5edf AFAICT.
> > > > > 
> > > > > Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
> > > > > This seems a fairly reasonable change, so I had no intention of asking
> > > > > for a revert (which likely would have been rejected).  There is also a
> > > > > real possibility the -EBUSY comes from elsewhere.  Could also be
> > > > > 71320946d5edf caused a bug elsewhere to be exposed.
> > > > 
> > > > A good way to know would be to attempt to revert 71320946d5edf and see
> > > > if that fixes your issue.
> > > > 
> > > > Alternatively you can try (or similar):
> > > > 
> > > > diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
> > > > index 6ccffeaea57d..105ef3faecfd 100644
> > > > --- a/xen/arch/arm/mm.c
> > > > +++ b/xen/arch/arm/mm.c
> > > > @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
> > > >                    page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
> > > >            }
> > > >            else
> > > > +        {
> > > > +            printk("%u already mapped\n", space);
> > > >                /*
> > > >                 * Mandate the caller to first unmap the page before mapping it
> > > >                 * again. This is to prevent Xen creating an unwanted hole in
> > > > @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
> > > >                 * to unmap it afterwards.
> > > >                 */
> > > >                rc = -EBUSY;
> > > > +        }
> > > >            p2m_write_unlock(p2m);
> > > >        }
> > > > 
> > > > > > > I'm using Tianocore as the first stage loader.  This continues to work
> > > > > > > fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
> > > > > > > While Tianocore does map the shared information page, my reading of their
> > > > > > > source is that it properly unmaps the page and therefore shouldn't cause
> > > > > > > trouble.
> > > > > > > 
> > > > > > > Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
> > > > > > > the recommended address range, but my understanding is this is supposed
> > > > > > > to be okay.
> > > > > > > 
> > > > > > > The return code is -16, which is EBUSY.
> > > > > > > 
> > > > > > > Ideas?
> > > > > > 
> > > > > > I think the issue is that you are mapping the shared info page over a
> > > > > > guest RAM page, and in order to do that you would fist need to create
> > > > > > a hole and then map the shared info page.  IOW: the issue is not with
> > > > > > edk2 not having unmapped the page, but with FreeBSD trying to map the
> > > > > > shared_info over a RAM page instead of a hole in the p2m.  x86
> > > > > > behavior is different here, and does allow mapping the shared_info
> > > > > > page over a RAM gfn (by first removing the backing RAM page on the
> > > > > > gfn).
> > > > > 
> > > > > An interesting thought.  I thought I'd tried this, but since I didn't see
> > > > > such in my experiments list.  What I had tried was removing all the pages
> > > > > in the suggested mapping range.  Yet this failed.
> > > > 
> > > > Yeah, I went too fast and didn't read the code correctly, it is not
> > > > checking that the provided gfn is already populated, but whether the
> > > > mfn intended to be mapped is already mapped at a different location.
> > > > 
> > > > > Since this seemed reasonable, I've now tried and found it fails.  The
> > > > > XENMEM_remove_from_physmap call returns 0.
> > > > 
> > > > XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
> > > > like edk2 hasn't unmapped the shared_info page.  The OS has no idea
> > > > at which position the shared_info page is currently mapped, and hence
> > > > can't do anything to attempt to unmap it in order to cover up for
> > > > buggy firmware.
> > > > 
> > > > edk2 should be the entity to issue the XENMEM_remove_from_physmap
> > > > against the gfn where it has the shared_info page mapped.  Likely
> > > > needs to be done as part of ExitBootServices() method.
> > > > 
> > > > FWIW, 71320946d5edf is an ABI change, and as desirable as such
> > > > behavior might be, a new hypercall should have introduced that had the
> > > > behavior that the change intended to retrofit into
> > > > XENMEM_add_to_physmap.
> > > I can see how you think this is an ABI change but the previous behavior was
> > > incorrect. Before this patch, on Arm, we would allow the shared page to be
> > > mapped twice. As we don't know where the firmware had mapped it this could
> > > result to random corruption.
> > > 
> > > Now, we could surely decide to remove the page as x86 did. But this could
> > > leave a hole in the RAM. As the OS would not know where the hole is, this
> > > could lead to page fault randomly during runtime.
> > 
> > I would say it's the job of the firmware to notify the OS where the
> > hole is, by modifying the memory map handled to the OS.  Or else
> > mapping the shared_info page in an unpopulated p2m hole.
> 
> I agree but I am not convinced that they are all doing it. At least U-boot
> didn't do it before we fixed it.
> 
> > 
> > When using UEFI there's RAM that will always be in-use by the
> > firmware, as runtime services cannot be shut down, and hence the
> > firmware must already have a way to remove/reserve such region(s) on
> > the memory map.
> 
> Can either you or Elliott confirm if EDK2 reserve the region?

I will defer to Elliott to check for arm.  I would be quite surprised
if it doesn't on x86, or else we would get a myriad of bug reports
about guests randomly crashing when using edk2.

> > 
> > > Neither of the two behaviors help the users. In fact, I think they only make
> > > the experience worse because you don't know when the issue will happen.
> > > 
> > > AFAICT, there is no way for an HVM guestto know which GFN was inuse. So in
> > > all the cases, I can't think of a way for the OS to workaround properly
> > > buggy firmware. Therefore, returning -EBUSY is the safest we can do for our
> > > users and I don't view it as a ABI change (someone rely on the previous
> > > behavior is bound to failure).
> > 
> > I fully agree the current behavior might not be the best one, but I do
> > consider this part of the ABI, specially as booting guests using edk2
> > has now stopped working after this change.
> 
> Right. If we remove the check, you may be able to boot a guest. But are we
> sure that such guest would run safely?

If the guest wants the hypervisor to enforce such behavior, let it
use the new hypercall to explicitly request the shared_info page to
not be mapped anywhere else.

But if you don't trust the bootloader, how do you know it hasn't poked
holes elsewhere in the RAM regions?

Even if the shared_info page has been unmapped, can you be sure the
bootloader has put a RAM page back in that gfn?

I understand this ABI change is done to avoid bugs, but at the cost of
diverging from x86, and breaking existing firmwares which might not be
buggy.

> Also, it is not really argument, but this is not the only broken part in
> EDK2 for Xen Arm guests. The other one I know is EDS makes assumption how
> some Device-Tree nodes and this will break on newer Xen.
> 
> So overall, it feels to me that EDK2 is not entirely ready to be used in
> production for Xen on Arm guests.

I really have no insight on this.  What are the supported way of booting
guests on Arm?  (SUPPORT.md doesn't seem to list any firmware for Arm
guests AFAICT).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 13:39             ` Roger Pau Monné
@ 2023-10-04 14:06               ` Julien Grall
  2023-10-04 14:53                 ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Julien Grall @ 2023-10-04 14:06 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Elliott Mitchell, xen-devel, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

Hi Roger,

On 04/10/2023 14:39, Roger Pau Monné wrote:
> On Wed, Oct 04, 2023 at 02:03:43PM +0100, Julien Grall wrote:
>> Hi Roger,
>>
>> On 04/10/2023 13:53, Roger Pau Monné wrote:
>>> On Wed, Oct 04, 2023 at 11:55:05AM +0100, Julien Grall wrote:
>>>> Hi Roger,
>>>>
>>>> On 04/10/2023 09:13, Roger Pau Monné wrote:
>>>>> On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
>>>>>> On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
>>>>>>> On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
>>>>>>>> I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
>>>>>>>> the changes with the handling of the shared information page appear to
>>>>>>>> have broken things for me.
>>>>>>>>
>>>>>>>> With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
>>>>>>>> of the 4.17 release, mapping the shared information page doesn't work.
>>>>>>>
>>>>>>> This is due to 71320946d5edf AFAICT.
>>>>>>
>>>>>> Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
>>>>>> This seems a fairly reasonable change, so I had no intention of asking
>>>>>> for a revert (which likely would have been rejected).  There is also a
>>>>>> real possibility the -EBUSY comes from elsewhere.  Could also be
>>>>>> 71320946d5edf caused a bug elsewhere to be exposed.
>>>>>
>>>>> A good way to know would be to attempt to revert 71320946d5edf and see
>>>>> if that fixes your issue.
>>>>>
>>>>> Alternatively you can try (or similar):
>>>>>
>>>>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
>>>>> index 6ccffeaea57d..105ef3faecfd 100644
>>>>> --- a/xen/arch/arm/mm.c
>>>>> +++ b/xen/arch/arm/mm.c
>>>>> @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
>>>>>                     page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
>>>>>             }
>>>>>             else
>>>>> +        {
>>>>> +            printk("%u already mapped\n", space);
>>>>>                 /*
>>>>>                  * Mandate the caller to first unmap the page before mapping it
>>>>>                  * again. This is to prevent Xen creating an unwanted hole in
>>>>> @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
>>>>>                  * to unmap it afterwards.
>>>>>                  */
>>>>>                 rc = -EBUSY;
>>>>> +        }
>>>>>             p2m_write_unlock(p2m);
>>>>>         }
>>>>>
>>>>>>>> I'm using Tianocore as the first stage loader.  This continues to work
>>>>>>>> fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
>>>>>>>> While Tianocore does map the shared information page, my reading of their
>>>>>>>> source is that it properly unmaps the page and therefore shouldn't cause
>>>>>>>> trouble.
>>>>>>>>
>>>>>>>> Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
>>>>>>>> the recommended address range, but my understanding is this is supposed
>>>>>>>> to be okay.
>>>>>>>>
>>>>>>>> The return code is -16, which is EBUSY.
>>>>>>>>
>>>>>>>> Ideas?
>>>>>>>
>>>>>>> I think the issue is that you are mapping the shared info page over a
>>>>>>> guest RAM page, and in order to do that you would fist need to create
>>>>>>> a hole and then map the shared info page.  IOW: the issue is not with
>>>>>>> edk2 not having unmapped the page, but with FreeBSD trying to map the
>>>>>>> shared_info over a RAM page instead of a hole in the p2m.  x86
>>>>>>> behavior is different here, and does allow mapping the shared_info
>>>>>>> page over a RAM gfn (by first removing the backing RAM page on the
>>>>>>> gfn).
>>>>>>
>>>>>> An interesting thought.  I thought I'd tried this, but since I didn't see
>>>>>> such in my experiments list.  What I had tried was removing all the pages
>>>>>> in the suggested mapping range.  Yet this failed.
>>>>>
>>>>> Yeah, I went too fast and didn't read the code correctly, it is not
>>>>> checking that the provided gfn is already populated, but whether the
>>>>> mfn intended to be mapped is already mapped at a different location.
>>>>>
>>>>>> Since this seemed reasonable, I've now tried and found it fails.  The
>>>>>> XENMEM_remove_from_physmap call returns 0.
>>>>>
>>>>> XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
>>>>> like edk2 hasn't unmapped the shared_info page.  The OS has no idea
>>>>> at which position the shared_info page is currently mapped, and hence
>>>>> can't do anything to attempt to unmap it in order to cover up for
>>>>> buggy firmware.
>>>>>
>>>>> edk2 should be the entity to issue the XENMEM_remove_from_physmap
>>>>> against the gfn where it has the shared_info page mapped.  Likely
>>>>> needs to be done as part of ExitBootServices() method.
>>>>>
>>>>> FWIW, 71320946d5edf is an ABI change, and as desirable as such
>>>>> behavior might be, a new hypercall should have introduced that had the
>>>>> behavior that the change intended to retrofit into
>>>>> XENMEM_add_to_physmap.
>>>> I can see how you think this is an ABI change but the previous behavior was
>>>> incorrect. Before this patch, on Arm, we would allow the shared page to be
>>>> mapped twice. As we don't know where the firmware had mapped it this could
>>>> result to random corruption.
>>>>
>>>> Now, we could surely decide to remove the page as x86 did. But this could
>>>> leave a hole in the RAM. As the OS would not know where the hole is, this
>>>> could lead to page fault randomly during runtime.
>>>
>>> I would say it's the job of the firmware to notify the OS where the
>>> hole is, by modifying the memory map handled to the OS.  Or else
>>> mapping the shared_info page in an unpopulated p2m hole.
>>
>> I agree but I am not convinced that they are all doing it. At least U-boot
>> didn't do it before we fixed it.
>>
>>>
>>> When using UEFI there's RAM that will always be in-use by the
>>> firmware, as runtime services cannot be shut down, and hence the
>>> firmware must already have a way to remove/reserve such region(s) on
>>> the memory map.
>>
>> Can either you or Elliott confirm if EDK2 reserve the region?
> 
> I will defer to Elliott to check for arm.  I would be quite surprised
> if it doesn't on x86, or else we would get a myriad of bug reports
> about guests randomly crashing when using edk2.
> 
>>>
>>>> Neither of the two behaviors help the users. In fact, I think they only make
>>>> the experience worse because you don't know when the issue will happen.
>>>>
>>>> AFAICT, there is no way for an HVM guestto know which GFN was inuse. So in
>>>> all the cases, I can't think of a way for the OS to workaround properly
>>>> buggy firmware. Therefore, returning -EBUSY is the safest we can do for our
>>>> users and I don't view it as a ABI change (someone rely on the previous
>>>> behavior is bound to failure).
>>>
>>> I fully agree the current behavior might not be the best one, but I do
>>> consider this part of the ABI, specially as booting guests using edk2
>>> has now stopped working after this change.
>>
>> Right. If we remove the check, you may be able to boot a guest. But are we
>> sure that such guest would run safely?
> 
> If the guest wants the hypervisor to enforce such behavior, let it
> use the new hypercall to explicitly request the shared_info page to
> not be mapped anywhere else.

TBH, I am not convinced the new hypercall is going to help. Let say we 
decide to modify FreeBSD/Linux to use it, The old EDK2 firmware would 
still be buggy and this would prevent boot. So we are back the same 
problem...

We could also say we don't support older firmware. But that's not very 
different from leaving the hypercall as-is and fix EDK2

> 
> But if you don't trust the bootloader, how do you know it hasn't poked
> holes elsewhere in the RAM regions?

We don't know. But how do you know the bootloader will not want to 
continue using the vCPU shared page?

For instance, the public headers doesn't seem to mention that the page 
can only mapped once and it would unmap the previous area. In fact, for 
Arm, until that commit shared page could be mapped N times... So 
technically even if we remove the page, the commit already made an ABI 
change. Yes it is now more inline with x86 but this doesn't this is 
still an ABI change. I would be surprised if you say we should not have 
done that because (in particular if you have XSA-379 in mind).

> 
> Even if the shared_info page has been unmapped, can you be sure the
> bootloader has put a RAM page back in that gfn?

We can't. But the same goes with the bootloader reserving the region...

> 
> I understand this ABI change is done to avoid bugs, but at the cost of
> diverging from x86, and breaking existing firmwares which might not be
> buggy.
As I pointed out above, the exact behavior of the hypercall is not fully 
documented and the behavior has changed with some XSAs. So this is no 
surprise if Arm and x86 behaved differently (even before that commit).

There are plenty of behavior I considered wrong in the way x86 update 
the P2M and I would be concerned if we don't give any leeway for the 
architectures to tighten the update. BTW some checks have evolved over 
the time during security event (XSA-378 for example).

This is not very different here. For Arm we decided to not follow a 
behavior that I consider incorrect and potentially more harmful than 
trying to support bootloader not removing the shared page.

If we want to handle such firmware, I think it would be better if we 
provide an hypercall that would return the GFN where it is currently mapped.

> 
>> Also, it is not really argument, but this is not the only broken part in
>> EDK2 for Xen Arm guests. The other one I know is EDS makes assumption how
>> some Device-Tree nodes and this will break on newer Xen.
>>
>> So overall, it feels to me that EDK2 is not entirely ready to be used in
>> production for Xen on Arm guests.
> 
> I really have no insight on this.  What are the supported way of booting
> guests on Arm?  (SUPPORT.md doesn't seem to list any firmware for Arm
> guests AFAICT).

Some bootloaders (e.g. U-boot/EDK2) have support to be used as a fimware 
for Xen on Arm guests. But they are not supported officially.

Most of the setup seems to specify the kernel directly in the XL 
configuration. We probably ought to add support for EDK2/U-boot.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 14:06               ` Julien Grall
@ 2023-10-04 14:53                 ` Roger Pau Monné
  2023-10-04 17:47                   ` Julien Grall
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-10-04 14:53 UTC (permalink / raw)
  To: Julien Grall
  Cc: Elliott Mitchell, xen-devel, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

On Wed, Oct 04, 2023 at 03:06:14PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 04/10/2023 14:39, Roger Pau Monné wrote:
> > On Wed, Oct 04, 2023 at 02:03:43PM +0100, Julien Grall wrote:
> > > Hi Roger,
> > > 
> > > On 04/10/2023 13:53, Roger Pau Monné wrote:
> > > > On Wed, Oct 04, 2023 at 11:55:05AM +0100, Julien Grall wrote:
> > > > > Hi Roger,
> > > > > 
> > > > > On 04/10/2023 09:13, Roger Pau Monné wrote:
> > > > > > On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
> > > > > > > On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
> > > > > > > > On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
> > > > > > > > > I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current issue is
> > > > > > > > > the changes with the handling of the shared information page appear to
> > > > > > > > > have broken things for me.
> > > > > > > > > 
> > > > > > > > > With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
> > > > > > > > > of the 4.17 release, mapping the shared information page doesn't work.
> > > > > > > > 
> > > > > > > > This is due to 71320946d5edf AFAICT.
> > > > > > > 
> > > > > > > Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
> > > > > > > This seems a fairly reasonable change, so I had no intention of asking
> > > > > > > for a revert (which likely would have been rejected).  There is also a
> > > > > > > real possibility the -EBUSY comes from elsewhere.  Could also be
> > > > > > > 71320946d5edf caused a bug elsewhere to be exposed.
> > > > > > 
> > > > > > A good way to know would be to attempt to revert 71320946d5edf and see
> > > > > > if that fixes your issue.
> > > > > > 
> > > > > > Alternatively you can try (or similar):
> > > > > > 
> > > > > > diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
> > > > > > index 6ccffeaea57d..105ef3faecfd 100644
> > > > > > --- a/xen/arch/arm/mm.c
> > > > > > +++ b/xen/arch/arm/mm.c
> > > > > > @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
> > > > > >                     page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
> > > > > >             }
> > > > > >             else
> > > > > > +        {
> > > > > > +            printk("%u already mapped\n", space);
> > > > > >                 /*
> > > > > >                  * Mandate the caller to first unmap the page before mapping it
> > > > > >                  * again. This is to prevent Xen creating an unwanted hole in
> > > > > > @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
> > > > > >                  * to unmap it afterwards.
> > > > > >                  */
> > > > > >                 rc = -EBUSY;
> > > > > > +        }
> > > > > >             p2m_write_unlock(p2m);
> > > > > >         }
> > > > > > 
> > > > > > > > > I'm using Tianocore as the first stage loader.  This continues to work
> > > > > > > > > fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
> > > > > > > > > While Tianocore does map the shared information page, my reading of their
> > > > > > > > > source is that it properly unmaps the page and therefore shouldn't cause
> > > > > > > > > trouble.
> > > > > > > > > 
> > > > > > > > > Notes on the actual call is gpfn was 0x0000000000040072.  This is outside
> > > > > > > > > the recommended address range, but my understanding is this is supposed
> > > > > > > > > to be okay.
> > > > > > > > > 
> > > > > > > > > The return code is -16, which is EBUSY.
> > > > > > > > > 
> > > > > > > > > Ideas?
> > > > > > > > 
> > > > > > > > I think the issue is that you are mapping the shared info page over a
> > > > > > > > guest RAM page, and in order to do that you would fist need to create
> > > > > > > > a hole and then map the shared info page.  IOW: the issue is not with
> > > > > > > > edk2 not having unmapped the page, but with FreeBSD trying to map the
> > > > > > > > shared_info over a RAM page instead of a hole in the p2m.  x86
> > > > > > > > behavior is different here, and does allow mapping the shared_info
> > > > > > > > page over a RAM gfn (by first removing the backing RAM page on the
> > > > > > > > gfn).
> > > > > > > 
> > > > > > > An interesting thought.  I thought I'd tried this, but since I didn't see
> > > > > > > such in my experiments list.  What I had tried was removing all the pages
> > > > > > > in the suggested mapping range.  Yet this failed.
> > > > > > 
> > > > > > Yeah, I went too fast and didn't read the code correctly, it is not
> > > > > > checking that the provided gfn is already populated, but whether the
> > > > > > mfn intended to be mapped is already mapped at a different location.
> > > > > > 
> > > > > > > Since this seemed reasonable, I've now tried and found it fails.  The
> > > > > > > XENMEM_remove_from_physmap call returns 0.
> > > > > > 
> > > > > > XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
> > > > > > like edk2 hasn't unmapped the shared_info page.  The OS has no idea
> > > > > > at which position the shared_info page is currently mapped, and hence
> > > > > > can't do anything to attempt to unmap it in order to cover up for
> > > > > > buggy firmware.
> > > > > > 
> > > > > > edk2 should be the entity to issue the XENMEM_remove_from_physmap
> > > > > > against the gfn where it has the shared_info page mapped.  Likely
> > > > > > needs to be done as part of ExitBootServices() method.
> > > > > > 
> > > > > > FWIW, 71320946d5edf is an ABI change, and as desirable as such
> > > > > > behavior might be, a new hypercall should have introduced that had the
> > > > > > behavior that the change intended to retrofit into
> > > > > > XENMEM_add_to_physmap.
> > > > > I can see how you think this is an ABI change but the previous behavior was
> > > > > incorrect. Before this patch, on Arm, we would allow the shared page to be
> > > > > mapped twice. As we don't know where the firmware had mapped it this could
> > > > > result to random corruption.
> > > > > 
> > > > > Now, we could surely decide to remove the page as x86 did. But this could
> > > > > leave a hole in the RAM. As the OS would not know where the hole is, this
> > > > > could lead to page fault randomly during runtime.
> > > > 
> > > > I would say it's the job of the firmware to notify the OS where the
> > > > hole is, by modifying the memory map handled to the OS.  Or else
> > > > mapping the shared_info page in an unpopulated p2m hole.
> > > 
> > > I agree but I am not convinced that they are all doing it. At least U-boot
> > > didn't do it before we fixed it.
> > > 
> > > > 
> > > > When using UEFI there's RAM that will always be in-use by the
> > > > firmware, as runtime services cannot be shut down, and hence the
> > > > firmware must already have a way to remove/reserve such region(s) on
> > > > the memory map.
> > > 
> > > Can either you or Elliott confirm if EDK2 reserve the region?
> > 
> > I will defer to Elliott to check for arm.  I would be quite surprised
> > if it doesn't on x86, or else we would get a myriad of bug reports
> > about guests randomly crashing when using edk2.
> > 
> > > > 
> > > > > Neither of the two behaviors help the users. In fact, I think they only make
> > > > > the experience worse because you don't know when the issue will happen.
> > > > > 
> > > > > AFAICT, there is no way for an HVM guestto know which GFN was inuse. So in
> > > > > all the cases, I can't think of a way for the OS to workaround properly
> > > > > buggy firmware. Therefore, returning -EBUSY is the safest we can do for our
> > > > > users and I don't view it as a ABI change (someone rely on the previous
> > > > > behavior is bound to failure).
> > > > 
> > > > I fully agree the current behavior might not be the best one, but I do
> > > > consider this part of the ABI, specially as booting guests using edk2
> > > > has now stopped working after this change.
> > > 
> > > Right. If we remove the check, you may be able to boot a guest. But are we
> > > sure that such guest would run safely?
> > 
> > If the guest wants the hypervisor to enforce such behavior, let it
> > use the new hypercall to explicitly request the shared_info page to
> > not be mapped anywhere else.
> 
> TBH, I am not convinced the new hypercall is going to help. Let say we
> decide to modify FreeBSD/Linux to use it, The old EDK2 firmware would still
> be buggy and this would prevent boot. So we are back the same problem...
> 
> We could also say we don't support older firmware. But that's not very
> different from leaving the hypercall as-is and fix EDK2

We could at least print a warning message that the firmware still had
the shared_info page mapped, and that the system might not work as
expected.

> > 
> > But if you don't trust the bootloader, how do you know it hasn't poked
> > holes elsewhere in the RAM regions?
> 
> We don't know. But how do you know the bootloader will not want to continue
> using the vCPU shared page?

I don't think it's feasible for two entities to concurrently use the
shared_info page.

> For instance, the public headers doesn't seem to mention that the page can
> only mapped once and it would unmap the previous area. In fact, for Arm,
> until that commit shared page could be mapped N times... So technically even
> if we remove the page, the commit already made an ABI change. Yes it is now
> more inline with x86 but this doesn't this is still an ABI change. I would
> be surprised if you say we should not have done that because (in particular
> if you have XSA-379 in mind).

So we agree at least that there's an ABI change :).

It's different from 379 because the shared_info page is never freed
for the lifetime of the domain, hence there's no risk of leak in this
specific case.  I can see how preventing multiple mappings can be a
safeguard for possible issues similar to 379.

Isn't it possible to map a grant at multiple gfns however?

> > 
> > Even if the shared_info page has been unmapped, can you be sure the
> > bootloader has put a RAM page back in that gfn?
> 
> We can't. But the same goes with the bootloader reserving the region...
> 
> > 
> > I understand this ABI change is done to avoid bugs, but at the cost of
> > diverging from x86, and breaking existing firmwares which might not be
> > buggy.
> As I pointed out above, the exact behavior of the hypercall is not fully
> documented and the behavior has changed with some XSAs. So this is no
> surprise if Arm and x86 behaved differently (even before that commit).
> 
> There are plenty of behavior I considered wrong in the way x86 update the
> P2M and I would be concerned if we don't give any leeway for the
> architectures to tighten the update. BTW some checks have evolved over the
> time during security event (XSA-378 for example).

ABI changes due to security issues are unavoidable.

> This is not very different here. For Arm we decided to not follow a behavior
> that I consider incorrect and potentially more harmful than trying to
> support bootloader not removing the shared page.

I think this is not very friendly to users, specially if edk2 wasn't
checked.  I understand the situation is different on Arm vs x86, so if
edk2 is not supported on arm I guess it doesn't matter much whether
it's broken.  It would be a much worse issue on x86 where edk2 is
supported.

> If we want to handle such firmware, I think it would be better if we provide
> an hypercall that would return the GFN where it is currently mapped.

Sure, but such hypercall would be racy, as by the time the gfn is
returned the value could be stale.  Adding a replacing and non
replacing XENMEM_add_to_physmap variations would IMO be better.

Anyway, I don't maintain this, so it's up to you.

> > 
> > > Also, it is not really argument, but this is not the only broken part in
> > > EDK2 for Xen Arm guests. The other one I know is EDS makes assumption how
> > > some Device-Tree nodes and this will break on newer Xen.
> > > 
> > > So overall, it feels to me that EDK2 is not entirely ready to be used in
> > > production for Xen on Arm guests.
> > 
> > I really have no insight on this.  What are the supported way of booting
> > guests on Arm?  (SUPPORT.md doesn't seem to list any firmware for Arm
> > guests AFAICT).
> 
> Some bootloaders (e.g. U-boot/EDK2) have support to be used as a fimware for
> Xen on Arm guests. But they are not supported officially.
> 
> Most of the setup seems to specify the kernel directly in the XL
> configuration. We probably ought to add support for EDK2/U-boot.

I had no idea about that, I do think some kind of firmware is required
or else OSes different than Linux can't be supported unless they
implement the Linux entry point.

Is the entry point / CPU state for arm guests documented somewhere?

I wonder whether Elliot could use that for FreeBSD until the situation
with edk2 is stable.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 12:59           ` Roger Pau Monné
@ 2023-10-04 17:18             ` Oleksandr Tyshchenko
  0 siblings, 0 replies; 24+ messages in thread
From: Oleksandr Tyshchenko @ 2023-10-04 17:18 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Julien Grall, Elliott Mitchell, xen-devel@lists.xenproject.org,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk



On 04.10.23 15:59, Roger Pau Monné wrote:

Hello Roger



> On Wed, Oct 04, 2023 at 11:42:32AM +0000, Oleksandr Tyshchenko wrote:
>>
>>
>> On 04.10.23 13:55, Julien Grall wrote:
>>
>> Hello all.
>>
>>> Hi Roger,
>>>
>>> On 04/10/2023 09:13, Roger Pau Monné wrote:
>>>> On Tue, Oct 03, 2023 at 12:18:35PM -0700, Elliott Mitchell wrote:
>>>>> On Tue, Oct 03, 2023 at 10:26:28AM +0200, Roger Pau Monné wrote:
>>>>>> On Thu, Sep 28, 2023 at 07:49:18PM -0700, Elliott Mitchell wrote:
>>>>>>> I'm trying to get FreeBSD/ARM operational on Xen/ARM.  Current
>>>>>>> issue is
>>>>>>> the changes with the handling of the shared information page appear to
>>>>>>> have broken things for me.
>>>>>>>
>>>>>>> With a pre-4.17 build of Xen/ARM things worked fine.  Yet with a build
>>>>>>> of the 4.17 release, mapping the shared information page doesn't work.
>>>>>>
>>>>>> This is due to 71320946d5edf AFAICT.
>>>>>
>>>>> Yes.  While the -EBUSY line may be the one triggering, I'm unsure why.
>>>>> This seems a fairly reasonable change, so I had no intention of asking
>>>>> for a revert (which likely would have been rejected).  There is also a
>>>>> real possibility the -EBUSY comes from elsewhere.  Could also be
>>>>> 71320946d5edf caused a bug elsewhere to be exposed.
>>>>
>>>> A good way to know would be to attempt to revert 71320946d5edf and see
>>>> if that fixes your issue.
>>>>
>>>> Alternatively you can try (or similar):
>>>>
>>>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
>>>> index 6ccffeaea57d..105ef3faecfd 100644
>>>> --- a/xen/arch/arm/mm.c
>>>> +++ b/xen/arch/arm/mm.c
>>>> @@ -1424,6 +1424,8 @@ int xenmem_add_to_physmap_one(
>>>>                    page_set_xenheap_gfn(mfn_to_page(mfn), gfn);
>>>>            }
>>>>            else
>>>> +        {
>>>> +            printk("%u already mapped\n", space);
>>>>                /*
>>>>                 * Mandate the caller to first unmap the page before
>>>> mapping it
>>>>                 * again. This is to prevent Xen creating an unwanted
>>>> hole in
>>>> @@ -1432,6 +1434,7 @@ int xenmem_add_to_physmap_one(
>>>>                 * to unmap it afterwards.
>>>>                 */
>>>>                rc = -EBUSY;
>>>> +        }
>>>>            p2m_write_unlock(p2m);
>>>>        }
>>>>
>>>>>>> I'm using Tianocore as the first stage loader.  This continues to work
>>>>>>> fine.  The build is using tag "edk2-stable202211", commit fff6d81270.
>>>>>>> While Tianocore does map the shared information page, my reading of
>>>>>>> their
>>>>>>> source is that it properly unmaps the page and therefore shouldn't
>>>>>>> cause
>>>>>>> trouble.
>>>>>>>
>>>>>>> Notes on the actual call is gpfn was 0x0000000000040072.  This is
>>>>>>> outside
>>>>>>> the recommended address range, but my understanding is this is
>>>>>>> supposed
>>>>>>> to be okay.
>>>>>>>
>>>>>>> The return code is -16, which is EBUSY.
>>>>>>>
>>>>>>> Ideas?
>>>>>>
>>>>>> I think the issue is that you are mapping the shared info page over a
>>>>>> guest RAM page, and in order to do that you would fist need to create
>>>>>> a hole and then map the shared info page.  IOW: the issue is not with
>>>>>> edk2 not having unmapped the page, but with FreeBSD trying to map the
>>>>>> shared_info over a RAM page instead of a hole in the p2m.  x86
>>>>>> behavior is different here, and does allow mapping the shared_info
>>>>>> page over a RAM gfn (by first removing the backing RAM page on the
>>>>>> gfn).
>>>>>
>>>>> An interesting thought.  I thought I'd tried this, but since I didn't
>>>>> see
>>>>> such in my experiments list.  What I had tried was removing all the
>>>>> pages
>>>>> in the suggested mapping range.  Yet this failed.
>>>>
>>>> Yeah, I went too fast and didn't read the code correctly, it is not
>>>> checking that the provided gfn is already populated, but whether the
>>>> mfn intended to be mapped is already mapped at a different location.
>>>>
>>>>> Since this seemed reasonable, I've now tried and found it fails.  The
>>>>> XENMEM_remove_from_physmap call returns 0.
>>>>
>>>> XENMEM_remove_from_physmap returning 0 is fine, but it seems to me
>>>> like edk2 hasn't unmapped the shared_info page.  The OS has no idea
>>>> at which position the shared_info page is currently mapped, and hence
>>>> can't do anything to attempt to unmap it in order to cover up for
>>>> buggy firmware.
>>>>
>>>> edk2 should be the entity to issue the XENMEM_remove_from_physmap
>>>> against the gfn where it has the shared_info page mapped.  Likely
>>>> needs to be done as part of ExitBootServices() method.
>>>>
>>>> FWIW, 71320946d5edf is an ABI change, and as desirable as such
>>>> behavior might be, a new hypercall should have introduced that had the
>>>> behavior that the change intended to retrofit into
>>>> XENMEM_add_to_physmap.
>>> I can see how you think this is an ABI change but the previous behavior
>>> was incorrect. Before this patch, on Arm, we would allow the shared page
>>> to be mapped twice. As we don't know where the firmware had mapped it
>>> this could result to random corruption.
>>>
>>> Now, we could surely decide to remove the page as x86 did. But this
>>> could leave a hole in the RAM. As the OS would not know where the hole
>>> is, this could lead to page fault randomly during runtime.
>>
>>
>> +1.
>>
>> In addition to what Julien has already said, I would like to say the
>> same issue was faced due to U-Boot (running as a part of Xen guest
>> before OS) didn't perform a cleanup before jumping to OS. This is
>> already fixed to follow the current behavior. I didn't find
>> corresponding U-Boot mail thread, but can point to already upstreamed
>> commit in the main repo.
> 
> What about other bootloaders?
> 
> There might be other loaders that didn't unmap shared_info either, but
> that removed the page where the shared_info is mapped from the
> reported RAM ranges to the guest.  In such case not doing the
> unmapping was benign, as the guest would never use that gfn as RAM
> anyway.

I got your point, this way, I agree, not doing the
unmapping wouldn't be a big problem. I am wondering whether such 
bootloaders that instead of unmapping indeed hiding/reserving the gfn 
(where shared_info is mapped to) are really present in the world (Arm)? 
Patch was merged more than 1 year ago, and the issue hasn't been noticed 
until now with EDK2. Let's wait for the addition input on how the EDK2 
is really dealing with shared_info.


> 
> Thanks, Roger.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 14:53                 ` Roger Pau Monné
@ 2023-10-04 17:47                   ` Julien Grall
  2023-10-04 21:13                     ` Elliott Mitchell
  2023-10-04 22:52                     ` Stefano Stabellini
  0 siblings, 2 replies; 24+ messages in thread
From: Julien Grall @ 2023-10-04 17:47 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Elliott Mitchell, xen-devel, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

Hi,

On 04/10/2023 15:53, Roger Pau Monné wrote:
> On Wed, Oct 04, 2023 at 03:06:14PM +0100, Julien Grall wrote:
>>>>>> Neither of the two behaviors help the users. In fact, I think they only make
>>>>>> the experience worse because you don't know when the issue will happen.
>>>>>>
>>>>>> AFAICT, there is no way for an HVM guestto know which GFN was inuse. So in
>>>>>> all the cases, I can't think of a way for the OS to workaround properly
>>>>>> buggy firmware. Therefore, returning -EBUSY is the safest we can do for our
>>>>>> users and I don't view it as a ABI change (someone rely on the previous
>>>>>> behavior is bound to failure).
>>>>>
>>>>> I fully agree the current behavior might not be the best one, but I do
>>>>> consider this part of the ABI, specially as booting guests using edk2
>>>>> has now stopped working after this change.
>>>>
>>>> Right. If we remove the check, you may be able to boot a guest. But are we
>>>> sure that such guest would run safely?
>>>
>>> If the guest wants the hypervisor to enforce such behavior, let it
>>> use the new hypercall to explicitly request the shared_info page to
>>> not be mapped anywhere else.
>>
>> TBH, I am not convinced the new hypercall is going to help. Let say we
>> decide to modify FreeBSD/Linux to use it, The old EDK2 firmware would still
>> be buggy and this would prevent boot. So we are back the same problem...
>>
>> We could also say we don't support older firmware. But that's not very
>> different from leaving the hypercall as-is and fix EDK2
> 
> We could at least print a warning message that the firmware still had
> the shared_info page mapped, and that the system might not work as
> expected.

Are you suggesting to also fallback on the older hypercall?

> 
>>>
>>> But if you don't trust the bootloader, how do you know it hasn't poked
>>> holes elsewhere in the RAM regions?
>>
>> We don't know. But how do you know the bootloader will not want to continue
>> using the vCPU shared page?
> 
> I don't think it's feasible for two entities to concurrently use the
> shared_info page.
> 
>> For instance, the public headers doesn't seem to mention that the page can
>> only mapped once and it would unmap the previous area. In fact, for Arm,
>> until that commit shared page could be mapped N times... So technically even
>> if we remove the page, the commit already made an ABI change. Yes it is now
>> more inline with x86 but this doesn't this is still an ABI change. I would
>> be surprised if you say we should not have done that because (in particular
>> if you have XSA-379 in mind).
> 
> So we agree at least that there's an ABI change :).
> 
> It's different from 379 because the shared_info page is never freed
> for the lifetime of the domain, hence there's no risk of leak in this
> specific case.  I can see how preventing multiple mappings can be a
> safeguard for possible issues similar to 379.
> 
> Isn't it possible to map a grant at multiple gfns however?

There can be multiple mapping for the grant frame. But there can be 
*only* one mapping for the grant *table* frame.

> 
>>>
>>> Even if the shared_info page has been unmapped, can you be sure the
>>> bootloader has put a RAM page back in that gfn?
>>
>> We can't. But the same goes with the bootloader reserving the region...
>>
>>>
>>> I understand this ABI change is done to avoid bugs, but at the cost of
>>> diverging from x86, and breaking existing firmwares which might not be
>>> buggy.
>> As I pointed out above, the exact behavior of the hypercall is not fully
>> documented and the behavior has changed with some XSAs. So this is no
>> surprise if Arm and x86 behaved differently (even before that commit).
>>
>> There are plenty of behavior I considered wrong in the way x86 update the
>> P2M and I would be concerned if we don't give any leeway for the
>> architectures to tighten the update. BTW some checks have evolved over the
>> time during security event (XSA-378 for example).
> 
> ABI changes due to security issues are unavoidable.
> 
>> This is not very different here. For Arm we decided to not follow a behavior
>> that I consider incorrect and potentially more harmful than trying to
>> support bootloader not removing the shared page.
> 
> I think this is not very friendly to users, specially if edk2 wasn't
> checked.

This was forgotten because it is not yet common to use EDK2 on Xen on 
Arm (the proof is it took one year to find the obvious bug). I agree 
this is not user friendly, but it is impossible to check all the single 
projects. I will usually only look at the one that I know are used on 
Arm and/or someone remind me on the ML.

> I understand the situation is different on Arm vs x86, so if
> edk2 is not supported on arm I guess it doesn't matter much whether
> it's broken.  It would be a much worse issue on x86 where edk2 is
> supported.

AFAIK, we have CI for x86 on EDK2 but we don't on Arm.

> 
>> If we want to handle such firmware, I think it would be better if we provide
>> an hypercall that would return the GFN where it is currently mapped.
> 
> Sure, but such hypercall would be racy, as by the time the gfn is
> returned the value could be stale.  Adding a replacing and non
> replacing XENMEM_add_to_physmap variations would IMO be better.
> 
> Anyway, I don't maintain this, so it's up to you.

Bertrand/Stefano, any opinions?

> 
>>>
>>>> Also, it is not really argument, but this is not the only broken part in
>>>> EDK2 for Xen Arm guests. The other one I know is EDS makes assumption how
>>>> some Device-Tree nodes and this will break on newer Xen.
>>>>
>>>> So overall, it feels to me that EDK2 is not entirely ready to be used in
>>>> production for Xen on Arm guests.
>>>
>>> I really have no insight on this.  What are the supported way of booting
>>> guests on Arm?  (SUPPORT.md doesn't seem to list any firmware for Arm
>>> guests AFAICT).
>>
>> Some bootloaders (e.g. U-boot/EDK2) have support to be used as a fimware for
>> Xen on Arm guests. But they are not supported officially.
>>
>> Most of the setup seems to specify the kernel directly in the XL
>> configuration. We probably ought to add support for EDK2/U-boot.
> 
> I had no idea about that, I do think some kind of firmware is required
> or else OSes different than Linux can't be supported unless they
> implement the Linux entry point.

Indeed. But -ENOTIME and so far no-one else shown any interest :).

> 
> Is the entry point / CPU state for arm guests documented somewhere?

Yes. The protocol is documented in 
https://docs.kernel.org/arch/arm64/booting.html.

> 
> I wonder whether Elliot could use that for FreeBSD until the situation
> with edk2 is stable.

Unfortunately, the situation is unlikely to change if no one puts any 
effort to fix it and add testing.

While I am on the reviewer list for EDK2, I don't have the bandwith to 
rewiew, let alone working on it. In fact, I have been looking for 
someone to replace me as the reviewer for EDK2 in the past few months. 
The role is still open if someone has any interest and more time to 
allocate.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 17:47                   ` Julien Grall
@ 2023-10-04 21:13                     ` Elliott Mitchell
  2023-10-04 22:21                       ` Stefano Stabellini
  2023-10-05  8:54                       ` Julien Grall
  2023-10-04 22:52                     ` Stefano Stabellini
  1 sibling, 2 replies; 24+ messages in thread
From: Elliott Mitchell @ 2023-10-04 21:13 UTC (permalink / raw)
  To: Julien Grall
  Cc: Roger Pau Monné, xen-devel, Stefano Stabellini,
	Bertrand Marquis, Volodymyr Babchuk, Oleksandr Tyshchenko

Heavily trimming earlier messages.  Also doing one response to cover
several items.  Hopefully I'm not missing something which needs a
response.


On Wed, Oct 04, 2023 at 03:39:16PM +0200, Roger Pau Monné wrote:
> On Wed, Oct 04, 2023 at 02:03:43PM +0100, Julien Grall wrote:
> > 
> > On 04/10/2023 13:53, Roger Pau Monné wrote:
> > > 
> > > When using UEFI there's RAM that will always be in-use by the
> > > firmware, as runtime services cannot be shut down, and hence the
> > > firmware must already have a way to remove/reserve such region(s) on
> > > the memory map.
> > 
> > Can either you or Elliott confirm if EDK2 reserve the region?
> 
> I will defer to Elliott to check for arm.  I would be quite surprised
> if it doesn't on x86, or else we would get a myriad of bug reports
> about guests randomly crashing when using edk2.

When I had originally looked I thought there was no problem as
`OvmfPkg/XenPlatformPei/Xen.c`:
CalibrateLapicTimer()
	MapSharedInfoPage(SharedInfo)
	...
	UnmapXenPage(SharedInfo)

Later using `find * -type f -print0 | xargs -0 grep -eXENMAPSPACE_shared_info`
`OvmfPkg/XenBusDxe/XenBusDxe.c`:
	XenGetSharedInfoPage()
  // using reserved page because the page is not released when Linux is
  // starting because of the add_to_physmap. QEMU might try to access the
  // page, and fail because it have no right to do so (segv).

Looks like this second case leaks the shared information page.
Originally I thought there was no problem as I'd only found the first
instance.  Appears this second instance is the problem.


> > Also, it is not really argument, but this is not the only broken part in
> > EDK2 for Xen Arm guests. The other one I know is EDS makes assumption how
> > some Device-Tree nodes and this will break on newer Xen.
> > 
> > So overall, it feels to me that EDK2 is not entirely ready to be used in
> > production for Xen on Arm guests.
> 
> I really have no insight on this.  What are the supported way of booting
> guests on Arm?  (SUPPORT.md doesn't seem to list any firmware for Arm
> guests AFAICT).

I don't know about whether their support status, but I'm aware of two
viable ways to boot domains on ARM.  First, PyGRUB does work.  This is a
rather poor way to boot, but I do admit it is functional for Linux.
Second, Tianocore/EDK2.  This is very functional, but seems the above
broke recently.

I hope PvGRUB for ARM becomes available soon, but right now it isn't
available.


On Wed, Oct 04, 2023 at 06:47:41PM +0100, Julien Grall wrote:
> 
> On 04/10/2023 15:53, Roger Pau Monné wrote:
> > On Wed, Oct 04, 2023 at 03:06:14PM +0100, Julien Grall wrote:
> >>
> >> This is not very different here. For Arm we decided to not follow a behavior
> >> that I consider incorrect and potentially more harmful than trying to
> >> support bootloader not removing the shared page.
> > 
> > I think this is not very friendly to users, specially if edk2 wasn't
> > checked.
> 
> This was forgotten because it is not yet common to use EDK2 on Xen on 
> Arm (the proof is it took one year to find the obvious bug). I agree 
> this is not user friendly, but it is impossible to check all the single 
> projects. I will usually only look at the one that I know are used on 
> Arm and/or someone remind me on the ML.

By traditional standards, 1 year is quite fast.  Figure 3-6 months for
distributions/vendors to pick up the latest, then another 3-6 months
while people are experimenting.

This may point to Xen/ARM not being heavily deployed.  Could also be most
uses are doing direct kernel boot, or using PyGRUB.

Since it is the only bootloader for non-Linux and only bootloader which
doesn't requite execution in domain 0, Tianocore/EDK2 seems worth
monitoring.


> > I understand the situation is different on Arm vs x86, so if
> > edk2 is not supported on arm I guess it doesn't matter much whether
> > it's broken.  It would be a much worse issue on x86 where edk2 is
> > supported.
> 
> AFAIK, we have CI for x86 on EDK2 but we don't on Arm.

What is the current status of this?  I'm unsure whether it was an extra
patch done by Debian, but "edk2-stable202211"/fff6d81270 doesn't work
with Xen/Qemu.

I've also never observed 'kernel = "OVMF.fd"' generating any activity
with a PVH domain on x86.  Good news is Xen/x86 now accepts "OVMF.fd" as
a valid kernel, so progress has been made.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 21:13                     ` Elliott Mitchell
@ 2023-10-04 22:21                       ` Stefano Stabellini
  2023-10-04 22:37                         ` Elliott Mitchell
  2023-10-05  8:54                       ` Julien Grall
  1 sibling, 1 reply; 24+ messages in thread
From: Stefano Stabellini @ 2023-10-04 22:21 UTC (permalink / raw)
  To: Elliott Mitchell
  Cc: Julien Grall, Roger Pau Monné, xen-devel, Stefano Stabellini,
	Bertrand Marquis, Volodymyr Babchuk, Oleksandr Tyshchenko

[-- Attachment #1: Type: text/plain, Size: 1678 bytes --]

On Wed, 4 Oct 2023, Elliott Mitchell wrote:
> On Wed, Oct 04, 2023 at 03:39:16PM +0200, Roger Pau Monné wrote:
> > On Wed, Oct 04, 2023 at 02:03:43PM +0100, Julien Grall wrote:
> > > 
> > > On 04/10/2023 13:53, Roger Pau Monné wrote:
> > > > 
> > > > When using UEFI there's RAM that will always be in-use by the
> > > > firmware, as runtime services cannot be shut down, and hence the
> > > > firmware must already have a way to remove/reserve such region(s) on
> > > > the memory map.
> > > 
> > > Can either you or Elliott confirm if EDK2 reserve the region?
> > 
> > I will defer to Elliott to check for arm.  I would be quite surprised
> > if it doesn't on x86, or else we would get a myriad of bug reports
> > about guests randomly crashing when using edk2.
> 
> When I had originally looked I thought there was no problem as
> `OvmfPkg/XenPlatformPei/Xen.c`:
> CalibrateLapicTimer()
> 	MapSharedInfoPage(SharedInfo)
> 	...
> 	UnmapXenPage(SharedInfo)
> 
> Later using `find * -type f -print0 | xargs -0 grep -eXENMAPSPACE_shared_info`
> `OvmfPkg/XenBusDxe/XenBusDxe.c`:
> 	XenGetSharedInfoPage()
>   // using reserved page because the page is not released when Linux is
>   // starting because of the add_to_physmap. QEMU might try to access the
>   // page, and fail because it have no right to do so (segv).
> 
> Looks like this second case leaks the shared information page.
> Originally I thought there was no problem as I'd only found the first
> instance.  Appears this second instance is the problem.

I understand this second case is *not* unmapping the SharedInfo page,
but is it reserving it somehow? For instance marking it as reserved in
the EFI memory map?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 22:21                       ` Stefano Stabellini
@ 2023-10-04 22:37                         ` Elliott Mitchell
  2023-10-04 22:42                           ` Stefano Stabellini
  0 siblings, 1 reply; 24+ messages in thread
From: Elliott Mitchell @ 2023-10-04 22:37 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, Roger Pau Monné, xen-devel, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

On Wed, Oct 04, 2023 at 03:21:04PM -0700, Stefano Stabellini wrote:
> On Wed, 4 Oct 2023, Elliott Mitchell wrote:
> > On Wed, Oct 04, 2023 at 03:39:16PM +0200, Roger Pau Monné wrote:
> > > On Wed, Oct 04, 2023 at 02:03:43PM +0100, Julien Grall wrote:
> > > > 
> > > > On 04/10/2023 13:53, Roger Pau Monné wrote:
> > > > > 
> > > > > When using UEFI there's RAM that will always be in-use by the
> > > > > firmware, as runtime services cannot be shut down, and hence the
> > > > > firmware must already have a way to remove/reserve such region(s) on
> > > > > the memory map.
> > > > 
> > > > Can either you or Elliott confirm if EDK2 reserve the region?
> > > 
> > > I will defer to Elliott to check for arm.  I would be quite surprised
> > > if it doesn't on x86, or else we would get a myriad of bug reports
> > > about guests randomly crashing when using edk2.
> > 
> > When I had originally looked I thought there was no problem as
> > `OvmfPkg/XenPlatformPei/Xen.c`:
> > CalibrateLapicTimer()
> > 	MapSharedInfoPage(SharedInfo)
> > 	...
> > 	UnmapXenPage(SharedInfo)
> > 
> > Later using `find * -type f -print0 | xargs -0 grep -eXENMAPSPACE_shared_info`
> > `OvmfPkg/XenBusDxe/XenBusDxe.c`:
> > 	XenGetSharedInfoPage()
> >   // using reserved page because the page is not released when Linux is
> >   // starting because of the add_to_physmap. QEMU might try to access the
> >   // page, and fail because it have no right to do so (segv).
> > 
> > Looks like this second case leaks the shared information page.
> > Originally I thought there was no problem as I'd only found the first
> > instance.  Appears this second instance is the problem.
> 
> I understand this second case is *not* unmapping the SharedInfo page,
> but is it reserving it somehow? For instance marking it as reserved in
> the EFI memory map?

Notice the "//" comment which I carefully grabbed?

  // using reserved page because the page is not released when Linux is
  // starting because of the add_to_physmap. QEMU might try to access the
  // page, and fail because it have no right to do so (segv).

So the page shouldn't be touched by anyone, but it does end up wasted.
Likely ExitBootServices() should clear the mapping.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 22:37                         ` Elliott Mitchell
@ 2023-10-04 22:42                           ` Stefano Stabellini
  0 siblings, 0 replies; 24+ messages in thread
From: Stefano Stabellini @ 2023-10-04 22:42 UTC (permalink / raw)
  To: Elliott Mitchell
  Cc: Stefano Stabellini, Julien Grall, Roger Pau Monné, xen-devel,
	Bertrand Marquis, Volodymyr Babchuk, Oleksandr Tyshchenko

[-- Attachment #1: Type: text/plain, Size: 2530 bytes --]

On Wed, 4 Oct 2023, Elliott Mitchell wrote:
> On Wed, Oct 04, 2023 at 03:21:04PM -0700, Stefano Stabellini wrote:
> > On Wed, 4 Oct 2023, Elliott Mitchell wrote:
> > > On Wed, Oct 04, 2023 at 03:39:16PM +0200, Roger Pau Monné wrote:
> > > > On Wed, Oct 04, 2023 at 02:03:43PM +0100, Julien Grall wrote:
> > > > > 
> > > > > On 04/10/2023 13:53, Roger Pau Monné wrote:
> > > > > > 
> > > > > > When using UEFI there's RAM that will always be in-use by the
> > > > > > firmware, as runtime services cannot be shut down, and hence the
> > > > > > firmware must already have a way to remove/reserve such region(s) on
> > > > > > the memory map.
> > > > > 
> > > > > Can either you or Elliott confirm if EDK2 reserve the region?
> > > > 
> > > > I will defer to Elliott to check for arm.  I would be quite surprised
> > > > if it doesn't on x86, or else we would get a myriad of bug reports
> > > > about guests randomly crashing when using edk2.
> > > 
> > > When I had originally looked I thought there was no problem as
> > > `OvmfPkg/XenPlatformPei/Xen.c`:
> > > CalibrateLapicTimer()
> > > 	MapSharedInfoPage(SharedInfo)
> > > 	...
> > > 	UnmapXenPage(SharedInfo)
> > > 
> > > Later using `find * -type f -print0 | xargs -0 grep -eXENMAPSPACE_shared_info`
> > > `OvmfPkg/XenBusDxe/XenBusDxe.c`:
> > > 	XenGetSharedInfoPage()
> > >   // using reserved page because the page is not released when Linux is
> > >   // starting because of the add_to_physmap. QEMU might try to access the
> > >   // page, and fail because it have no right to do so (segv).
> > > 
> > > Looks like this second case leaks the shared information page.
> > > Originally I thought there was no problem as I'd only found the first
> > > instance.  Appears this second instance is the problem.
> > 
> > I understand this second case is *not* unmapping the SharedInfo page,
> > but is it reserving it somehow? For instance marking it as reserved in
> > the EFI memory map?
> 
> Notice the "//" comment which I carefully grabbed?
> 
>   // using reserved page because the page is not released when Linux is
>   // starting because of the add_to_physmap. QEMU might try to access the
>   // page, and fail because it have no right to do so (segv).
> 
> So the page shouldn't be touched by anyone, but it does end up wasted.
> Likely ExitBootServices() should clear the mapping.

Sorry to be pedantic but I am really not familiar with EDK2. Does
"reserved page" in this context mean a memory page from a reserved
region marked as reserved in the EFI memory map?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 17:47                   ` Julien Grall
  2023-10-04 21:13                     ` Elliott Mitchell
@ 2023-10-04 22:52                     ` Stefano Stabellini
  2023-10-05  0:20                       ` Elliott Mitchell
  2023-10-05  9:24                       ` Roger Pau Monné
  1 sibling, 2 replies; 24+ messages in thread
From: Stefano Stabellini @ 2023-10-04 22:52 UTC (permalink / raw)
  To: Julien Grall
  Cc: Roger Pau Monné, Elliott Mitchell, xen-devel,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk,
	Oleksandr Tyshchenko

On Wed, 4 Oct 2023, Julien Grall wrote:
> > > This is not very different here. For Arm we decided to not follow a
> > > behavior
> > > that I consider incorrect and potentially more harmful than trying to
> > > support bootloader not removing the shared page.
> > 
> > I think this is not very friendly to users, specially if edk2 wasn't
> > checked.
> 
> This was forgotten because it is not yet common to use EDK2 on Xen on Arm (the
> proof is it took one year to find the obvious bug). I agree this is not user
> friendly, but it is impossible to check all the single projects. I will
> usually only look at the one that I know are used on Arm and/or someone remind
> me on the ML.

At AMD/Xilinx we don't have EDK2 on any ARM board. Unless I go out of my
way to test it on purpose I wouldn't see it.

> > I understand the situation is different on Arm vs x86, so if
> > edk2 is not supported on arm I guess it doesn't matter much whether
> > it's broken.  It would be a much worse issue on x86 where edk2 is
> > supported.
> 
> AFAIK, we have CI for x86 on EDK2 but we don't on Arm.

I think we should have a gitlab-ci job testing EDK2 on ARM using QEMU
and I would certainly welcome it if someone contributed it.

> > > If we want to handle such firmware, I think it would be better if we
> > > provide
> > > an hypercall that would return the GFN where it is currently mapped.
> > 
> > Sure, but such hypercall would be racy, as by the time the gfn is
> > returned the value could be stale.  Adding a replacing and non
> > replacing XENMEM_add_to_physmap variations would IMO be better.
> > 
> > Anyway, I don't maintain this, so it's up to you.
> 
> Bertrand/Stefano, any opinions?

I think we should fix EDK2 to unmap the shared info in all cases as
that's simpler and the best implementation. What's the value of keeping
the mapping around when the OS can't find it? Unless you have an idea on
how the OS could find the location of the existing EDK2 shared info
mapping.

It is important not to have 2 different behaviors for the same hypercall
on ARM and x86, especially when the hypercall looks arch-neutral and an
operating system would reasonably expect to use it in common code.
Having different behaviors on ARM/x86 is more error prone than having a
less-than-ideal hypercall implementation.

I agree with Julien that the ARM behavior is the right behavior. Can we
change the x86 implementation to match ARM somehow?

If we do, I guess we would end up breaking legacy EDK2?

Is really the only choice to change the ARM implementation to match the
x86 implementation?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 22:52                     ` Stefano Stabellini
@ 2023-10-05  0:20                       ` Elliott Mitchell
  2023-10-05  9:24                       ` Roger Pau Monné
  1 sibling, 0 replies; 24+ messages in thread
From: Elliott Mitchell @ 2023-10-05  0:20 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, Roger Pau Monné, xen-devel, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

On Wed, Oct 04, 2023 at 03:52:54PM -0700, Stefano Stabellini wrote:
> On Wed, 4 Oct 2023, Julien Grall wrote:
> > > > If we want to handle such firmware, I think it would be better if we
> > > > provide
> > > > an hypercall that would return the GFN where it is currently mapped.
> > > 
> > > Sure, but such hypercall would be racy, as by the time the gfn is
> > > returned the value could be stale.  Adding a replacing and non
> > > replacing XENMEM_add_to_physmap variations would IMO be better.
> > > 
> > > Anyway, I don't maintain this, so it's up to you.
> > 
> > Bertrand/Stefano, any opinions?
> 
> I think we should fix EDK2 to unmap the shared info in all cases as
> that's simpler and the best implementation. What's the value of keeping
> the mapping around when the OS can't find it? Unless you have an idea on
> how the OS could find the location of the existing EDK2 shared info
> mapping.

I tend to agree.

> It is important not to have 2 different behaviors for the same hypercall
> on ARM and x86, especially when the hypercall looks arch-neutral and an
> operating system would reasonably expect to use it in common code.
> Having different behaviors on ARM/x86 is more error prone than having a
> less-than-ideal hypercall implementation.

I attempted to head this direction with FreeBSD, but there were enough
surrounding differences to make this troublesome to implement.  May be
easier on other OSes with less history though.

I do agree on general principle fewer/smaller differences are better.
Yet I note my earlier patch to have `typedef struct trap_info trap_info_t`
consistently visible didn't go through...

> I agree with Julien that the ARM behavior is the right behavior. Can we
> change the x86 implementation to match ARM somehow?
> 
> If we do, I guess we would end up breaking legacy EDK2?

I agree this ARM behavior does seem appropriate.  Due to longer history
though it will need far more transition time.  If this 4.18 tarballs
aren't out yet, I would try to see whether a warning message could be
implemented *now*.  I estimate the old behavior will need support for
5-10 years.


On Wed, Oct 04, 2023 at 03:42:00PM -0700, Stefano Stabellini wrote:
> Sorry to be pedantic but I am really not familiar with EDK2. Does
> "reserved page" in this context mean a memory page from a reserved
> region marked as reserved in the EFI memory map?

I'm not too familiar either, so a fair amount of what I write is
speculation or guesses.  I'm assuming this refers to "ACPI reserved
page", meaning marked as not for use by the OS.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 21:13                     ` Elliott Mitchell
  2023-10-04 22:21                       ` Stefano Stabellini
@ 2023-10-05  8:54                       ` Julien Grall
  2023-10-05 15:56                         ` Elliott Mitchell
  1 sibling, 1 reply; 24+ messages in thread
From: Julien Grall @ 2023-10-05  8:54 UTC (permalink / raw)
  To: Elliott Mitchell
  Cc: Roger Pau Monné, xen-devel, Stefano Stabellini,
	Bertrand Marquis, Volodymyr Babchuk, Oleksandr Tyshchenko

On 04/10/2023 22:13, Elliott Mitchell wrote:
>>> I understand the situation is different on Arm vs x86, so if
>>> edk2 is not supported on arm I guess it doesn't matter much whether
>>> it's broken.  It would be a much worse issue on x86 where edk2 is
>>> supported.
>>
>> AFAIK, we have CI for x86 on EDK2 but we don't on Arm.
> 
> What is the current status of this?  I'm unsure whether it was an extra
> patch done by Debian, but "edk2-stable202211"/fff6d81270 doesn't work
> with Xen/Qemu.

I don't know what's the status for x86. But for Arm, there are nothing. 
And as I pointed out in my previous answer this is unlikely to change 
until someone invest time in EDK2 on Xen on Arm.

If there are patches sent on the ML, then I am happy to attempt to 
review them. But I am afraid, I am not going to have time to try to find 
and fix all the issues when using EDK2 in Arm guests.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-04 22:52                     ` Stefano Stabellini
  2023-10-05  0:20                       ` Elliott Mitchell
@ 2023-10-05  9:24                       ` Roger Pau Monné
  2023-10-05 20:52                         ` Stefano Stabellini
  1 sibling, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-10-05  9:24 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, Elliott Mitchell, xen-devel, Bertrand Marquis,
	Volodymyr Babchuk, Oleksandr Tyshchenko

On Wed, Oct 04, 2023 at 03:52:54PM -0700, Stefano Stabellini wrote:
> On Wed, 4 Oct 2023, Julien Grall wrote:
> > > > If we want to handle such firmware, I think it would be better if we
> > > > provide
> > > > an hypercall that would return the GFN where it is currently mapped.
> > > 
> > > Sure, but such hypercall would be racy, as by the time the gfn is
> > > returned the value could be stale.  Adding a replacing and non
> > > replacing XENMEM_add_to_physmap variations would IMO be better.
> > > 
> > > Anyway, I don't maintain this, so it's up to you.
> > 
> > Bertrand/Stefano, any opinions?
> 
> I think we should fix EDK2 to unmap the shared info in all cases as
> that's simpler and the best implementation. What's the value of keeping
> the mapping around when the OS can't find it? Unless you have an idea on
> how the OS could find the location of the existing EDK2 shared info
> mapping.

Indeed, edk2 should unmap the page, and we should fix that.

> 
> It is important not to have 2 different behaviors for the same hypercall
> on ARM and x86, especially when the hypercall looks arch-neutral and an
> operating system would reasonably expect to use it in common code.
> Having different behaviors on ARM/x86 is more error prone than having a
> less-than-ideal hypercall implementation.
> 
> I agree with Julien that the ARM behavior is the right behavior. Can we
> change the x86 implementation to match ARM somehow?

I'm afraid I don't see how.  edk2 is supported on x86, and hence we
cannot simply make a change to the hypervisor that would render all
current versions of edk2 unusable.

> If we do, I guess we would end up breaking legacy EDK2?

Breaking plain edk2, as there's no version of edk2 that currently does
the unmapping?

> Is really the only choice to change the ARM implementation to match the
> x86 implementation?

Unless we want x86 and Arm to diverge in behavior.

I do think the arm behavior is more sane, but I don't think we can
make that change on x86 and simply render all existing versions of
edk2 unusable.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-05  8:54                       ` Julien Grall
@ 2023-10-05 15:56                         ` Elliott Mitchell
  0 siblings, 0 replies; 24+ messages in thread
From: Elliott Mitchell @ 2023-10-05 15:56 UTC (permalink / raw)
  To: Julien Grall
  Cc: Roger Pau Monné, xen-devel, Stefano Stabellini,
	Bertrand Marquis, Volodymyr Babchuk, Oleksandr Tyshchenko

[-- Attachment #1: Type: text/plain, Size: 2318 bytes --]

On Thu, Oct 05, 2023 at 09:54:26AM +0100, Julien Grall wrote:
> On 04/10/2023 22:13, Elliott Mitchell wrote:
> >>> I understand the situation is different on Arm vs x86, so if
> >>> edk2 is not supported on arm I guess it doesn't matter much whether
> >>> it's broken.  It would be a much worse issue on x86 where edk2 is
> >>> supported.
> >>
> >> AFAIK, we have CI for x86 on EDK2 but we don't on Arm.
> > 
> > What is the current status of this?  I'm unsure whether it was an extra
> > patch done by Debian, but "edk2-stable202211"/fff6d81270 doesn't work
> > with Xen/Qemu.
> 
> I don't know what's the status for x86. But for Arm, there are nothing. 
> And as I pointed out in my previous answer this is unlikely to change 
> until someone invest time in EDK2 on Xen on Arm.

Indications are the measures for x86 aren't very good either.  Presently
the Debian distribution is using builds based on tag edk2-stable202211,
commit fff6d81270.  Could be the Debian package process got broken.
Could also be that tag, builds are completing yet the output fails to
function.

I've also been trying to get Tianocore/EDK2 to function as a bootloader
for PVH.  Current OVMF builds are accepted for the 'kernel = "OVMF.fd"'
line, but simply hang.  I'm wondering whether it is assuming the presence
of a framebuffer and doesn't use the x86 Xen console.

> If there are patches sent on the ML, then I am happy to attempt to 
> review them. But I am afraid, I am not going to have time to try to find 
> and fix all the issues when using EDK2 in Arm guests.

The attached patch got booting via Tianocore/EDK2 working.  This likely
needs adjustment to match their style; problem is their style is so
awful I was looking for an airsickness bag.  I really don't want to do
further polishing.

I'm unsure whether XENMEM_remove_from_physmap uncovers a memory page
which had been previously mapped, versus turning it into a hole.  In
the former case, the allocation should be moved to a normal heap page
since the OS can reuse it.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



[-- Attachment #2: 0001-OvmfPkg-XenBusDxe-Unmap-shared-information-page-on-e.patch --]
[-- Type: text/x-diff, Size: 1746 bytes --]

From 19b0e88b5e0d6954a924f444b02bcb576a115952 Mon Sep 17 00:00:00 2001
Message-Id: <19b0e88b5e0d6954a924f444b02bcb576a115952.1696517822.git.ehem+xen@m5p.com>
From: Elliott Mitchell <ehem+xen@m5p.com>
Date: Wed, 4 Oct 2023 21:47:17 -0700
Subject: [PATCH] OvmfPkg/XenBusDxe: Unmap shared information page on exit

Xen/ARM now requires the shared information page to only be mapped once.
This behavior is expected to be copied to Xen/x86 at some future point.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
---
 OvmfPkg/XenBusDxe/XenBusDxe.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/OvmfPkg/XenBusDxe/XenBusDxe.c b/OvmfPkg/XenBusDxe/XenBusDxe.c
index 132f43a72b..117a299d10 100644
--- a/OvmfPkg/XenBusDxe/XenBusDxe.c
+++ b/OvmfPkg/XenBusDxe/XenBusDxe.c
@@ -84,6 +84,33 @@ XenGetSharedInfoPage (
   return EFI_SUCCESS;
 }
 
+/**
+  Unmap the shared_info_t page from memory.
+
+  @param Dev    A XENBUS_DEVICE instance.
+**/
+VOID
+XenClearSharedInfoPage (
+  IN OUT XENBUS_DEVICE  *Dev
+  )
+{
+  xen_remove_from_physmap_t  Parameter;
+
+  //
+  // Either never mapped, or else already cleared.  No further cleanup
+  // action required.
+  //
+  if (!Dev->SharedInfo) return;
+
+  Parameter.domid = DOMID_SELF;
+  Parameter.gpfn  = (UINTN)Dev->SharedInfo >> EFI_PAGE_SHIFT;
+  if (XenHypercallMemoryOp (XENMEM_remove_from_physmap, &Parameter) != 0)
+    return;
+
+  FreePages (Dev->SharedInfo, 1);
+  Dev->SharedInfo = NULL;
+}
+
 /**
   Unloads an image.
 
@@ -501,6 +528,8 @@ XenBusDxeDriverBindingStop (
   XenStoreDeinit (Dev);
   XenGrantTableDeinit (Dev);
 
+  XenClearSharedInfoPage(Dev);
+
   gBS->CloseProtocol (
          ControllerHandle,
          &gEfiDevicePathProtocolGuid,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Issue with shared information page on Xen/ARM 4.17
  2023-10-05  9:24                       ` Roger Pau Monné
@ 2023-10-05 20:52                         ` Stefano Stabellini
  0 siblings, 0 replies; 24+ messages in thread
From: Stefano Stabellini @ 2023-10-05 20:52 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Julien Grall, Elliott Mitchell, xen-devel,
	Bertrand Marquis, Volodymyr Babchuk, Oleksandr Tyshchenko

[-- Attachment #1: Type: text/plain, Size: 2721 bytes --]

On Thu, 5 Oct 2023, Roger Pau Monné wrote:
> On Wed, Oct 04, 2023 at 03:52:54PM -0700, Stefano Stabellini wrote:
> > On Wed, 4 Oct 2023, Julien Grall wrote:
> > > > > If we want to handle such firmware, I think it would be better if we
> > > > > provide
> > > > > an hypercall that would return the GFN where it is currently mapped.
> > > > 
> > > > Sure, but such hypercall would be racy, as by the time the gfn is
> > > > returned the value could be stale.  Adding a replacing and non
> > > > replacing XENMEM_add_to_physmap variations would IMO be better.
> > > > 
> > > > Anyway, I don't maintain this, so it's up to you.
> > > 
> > > Bertrand/Stefano, any opinions?
> > 
> > I think we should fix EDK2 to unmap the shared info in all cases as
> > that's simpler and the best implementation. What's the value of keeping
> > the mapping around when the OS can't find it? Unless you have an idea on
> > how the OS could find the location of the existing EDK2 shared info
> > mapping.
> 
> Indeed, edk2 should unmap the page, and we should fix that.
> 
> > 
> > It is important not to have 2 different behaviors for the same hypercall
> > on ARM and x86, especially when the hypercall looks arch-neutral and an
> > operating system would reasonably expect to use it in common code.
> > Having different behaviors on ARM/x86 is more error prone than having a
> > less-than-ideal hypercall implementation.
> > 
> > I agree with Julien that the ARM behavior is the right behavior. Can we
> > change the x86 implementation to match ARM somehow?
> 
> I'm afraid I don't see how.  edk2 is supported on x86, and hence we
> cannot simply make a change to the hypervisor that would render all
> current versions of edk2 unusable.
> 
> > If we do, I guess we would end up breaking legacy EDK2?
> 
> Breaking plain edk2, as there's no version of edk2 that currently does
> the unmapping?
> 
> > Is really the only choice to change the ARM implementation to match the
> > x86 implementation?
> 
> Unless we want x86 and Arm to diverge in behavior.
> 
> I do think the arm behavior is more sane, but I don't think we can
> make that change on x86 and simply render all existing versions of
> edk2 unusable.

Right, but maybe we can come up with a deprecation period?

Especially considering that this is for domU firmware (not host
firmware), and domU firmware is often (not always, but often) provided
by Xen (Xen in the sense of xen.git and Xen packages). First we fix EDK2
upstream and we update the EDK2 build in xen.git. Then we give it a
couple of releases, then we change the x86 hypercall behavior?

In the meantime we could print a warning in Xen DEBUG builds when the
hypercall is called and there is one existing mapping?

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-10-05 20:52 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-29  2:49 Issue with shared information page on Xen/ARM 4.17 Elliott Mitchell
2023-10-03  8:26 ` Roger Pau Monné
2023-10-03 19:18   ` Elliott Mitchell
2023-10-04  8:13     ` Roger Pau Monné
2023-10-04 10:55       ` Julien Grall
2023-10-04 11:42         ` Oleksandr Tyshchenko
2023-10-04 12:59           ` Roger Pau Monné
2023-10-04 17:18             ` Oleksandr Tyshchenko
2023-10-04 12:53         ` Roger Pau Monné
2023-10-04 13:03           ` Julien Grall
2023-10-04 13:39             ` Roger Pau Monné
2023-10-04 14:06               ` Julien Grall
2023-10-04 14:53                 ` Roger Pau Monné
2023-10-04 17:47                   ` Julien Grall
2023-10-04 21:13                     ` Elliott Mitchell
2023-10-04 22:21                       ` Stefano Stabellini
2023-10-04 22:37                         ` Elliott Mitchell
2023-10-04 22:42                           ` Stefano Stabellini
2023-10-05  8:54                       ` Julien Grall
2023-10-05 15:56                         ` Elliott Mitchell
2023-10-04 22:52                     ` Stefano Stabellini
2023-10-05  0:20                       ` Elliott Mitchell
2023-10-05  9:24                       ` Roger Pau Monné
2023-10-05 20:52                         ` Stefano Stabellini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.