[PATCH 0/8] xen: cache control improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] xen: cache control improvements
@ 2025-05-06  8:31 Roger Pau Monne
  2025-05-06  8:31 ` [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches Roger Pau Monne
                   ` (9 more replies)
  0 siblings, 10 replies; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, Juergen Gross,
	Christian Lindig, David Scott, Bertrand Marquis,
	Volodymyr Babchuk

Hello,

Following series contain some fixes for cache control operations, the
main focus is to reduce the load on big systems when cache control
operations are executed.

Patches 1-4 are bugfixes, while patches starting from 5 are improvements
to the current code.  Patch 9 is an optimization to avoid having to
broadcast cache flushes on all pCPUs on x86.

Thanks, Roger.

Roger Pau Monne (9):
  x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches
  x86/pv: fix emulation of wb{,no}invd to flush all pCPU caches
  xen/gnttab: limit cache flush operation to guests allowed cache
    control
  x86/gnttab: do not implement GNTTABOP_cache_flush
  x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  x86/p2m: limit cache flush in memory_type_changed()
  xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  xen: introduce flag when a domain requires cache control
  xen/x86: track dirty pCPU caches for a given vCPU

 docs/man/xl.cfg.5.pod.in            | 10 +++++++
 tools/include/libxl.h               |  7 +++++
 tools/libs/light/libxl_create.c     |  6 ++++
 tools/libs/light/libxl_types.idl    |  3 ++
 tools/ocaml/libs/xc/xenctrl.ml      |  1 +
 tools/ocaml/libs/xc/xenctrl.mli     |  1 +
 tools/xl/xl_parse.c                 |  2 ++
 xen/arch/arm/dom0less-build.c       | 12 ++++++--
 xen/arch/arm/domain.c               |  3 +-
 xen/arch/arm/domain_build.c         |  6 ++++
 xen/arch/x86/domain.c               | 43 +++++++++++++++++++++++++++++
 xen/arch/x86/hvm/hvm.c              |  2 +-
 xen/arch/x86/hvm/mtrr.c             | 29 ++++---------------
 xen/arch/x86/hvm/svm/svm.c          |  6 ++--
 xen/arch/x86/hvm/vmx/vmcs.c         |  3 +-
 xen/arch/x86/hvm/vmx/vmx.c          |  6 ++--
 xen/arch/x86/include/asm/domain.h   |  9 ++++++
 xen/arch/x86/include/asm/flushtlb.h | 15 ----------
 xen/arch/x86/include/asm/iocap.h    | 19 ++-----------
 xen/arch/x86/include/asm/p2m.h      |  2 +-
 xen/arch/x86/mm.c                   | 21 ++++----------
 xen/arch/x86/mm/p2m-ept.c           |  7 +----
 xen/arch/x86/mm/p2m-pod.c           |  4 +--
 xen/arch/x86/mm/p2m.c               | 13 ++++++++-
 xen/arch/x86/pv/emul-priv-op.c      | 19 ++++++-------
 xen/arch/x86/setup.c                |  7 +++++
 xen/common/domain.c                 |  3 +-
 xen/common/grant_table.c            | 11 ++++++++
 xen/common/memory.c                 |  2 +-
 xen/include/asm-generic/iocap.h     |  2 +-
 xen/include/public/domctl.h         |  4 ++-
 xen/include/xen/iocap.h             | 25 ++---------------
 xen/include/xen/sched.h             |  6 ++++
 33 files changed, 181 insertions(+), 128 deletions(-)

-- 
2.48.1



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
@ 2025-05-06  8:31 ` Roger Pau Monne
  2025-05-12 14:09   ` Jan Beulich
  2025-05-12 14:11   ` Jan Beulich
  2025-05-06  8:31 ` [PATCH 2/9] x86/pv: fix emulation of wb{,no}invd " Roger Pau Monne
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The implementation of MMUEXT_FLUSH_CACHE is bogus, as it doesn't account to
flush the cache of any previous pCPU where the current vCPU might have run,
and hence is likely to not work as expected.

Fix this by resorting to use the same logic as MMUEXT_FLUSH_CACHE_GLOBAL,
which will be correct in all cases.

Fixes: 534ffcb416bb ("Fix WBINVD by adding a new hypercall.")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Alternatively, the hypercall could be made correct by keeping track of the
pCPUs the vCPU has run on since the last cache flush.  That's however more
work.  See later in the series.
---
 xen/arch/x86/mm.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 66c15a3c864f..38e214352201 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3805,14 +3805,11 @@ long do_mmuext_op(
             break;
 
         case MMUEXT_FLUSH_CACHE:
-            if ( unlikely(currd != pg_owner) )
-                rc = -EPERM;
-            else if ( unlikely(!cache_flush_permitted(currd)) )
-                rc = -EACCES;
-            else
-                wbinvd();
-            break;
-
+            /*
+             * Dirty pCPU caches where the current vCPU has been scheduled are
+             * not tracked, and hence we need to resort to a global cache
+             * flush for correctness.
+             */
         case MMUEXT_FLUSH_CACHE_GLOBAL:
             if ( unlikely(currd != pg_owner) )
                 rc = -EPERM;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches
  2025-05-06  8:31 ` [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches Roger Pau Monne
@ 2025-05-12 14:09   ` Jan Beulich
  2025-05-12 14:27     ` Roger Pau Monné
  2025-05-12 14:11   ` Jan Beulich
  1 sibling, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 14:09 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> The implementation of MMUEXT_FLUSH_CACHE is bogus, as it doesn't account to
> flush the cache of any previous pCPU where the current vCPU might have run,
> and hence is likely to not work as expected.
> 
> Fix this by resorting to use the same logic as MMUEXT_FLUSH_CACHE_GLOBAL,
> which will be correct in all cases.
> 
> Fixes: 534ffcb416bb ("Fix WBINVD by adding a new hypercall.")
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Alternatively, the hypercall could be made correct by keeping track of the
> pCPUs the vCPU has run on since the last cache flush.  That's however more
> work.  See later in the series.

Since, as iirc you indicated elsewhere, there's no actual user of this sub-op,
doing as you do here is likely good enough. Just one concern:

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -3805,14 +3805,11 @@ long do_mmuext_op(
>              break;
>  
>          case MMUEXT_FLUSH_CACHE:
> -            if ( unlikely(currd != pg_owner) )
> -                rc = -EPERM;
> -            else if ( unlikely(!cache_flush_permitted(currd)) )
> -                rc = -EACCES;

This error code will change to ...

> -            else
> -                wbinvd();
> -            break;
> -
> +            /*
> +             * Dirty pCPU caches where the current vCPU has been scheduled are
> +             * not tracked, and hence we need to resort to a global cache
> +             * flush for correctness.
> +             */
>          case MMUEXT_FLUSH_CACHE_GLOBAL:
>              if ( unlikely(currd != pg_owner) )
>                  rc = -EPERM;

... -EINVAL (sitting out of context). If we accept any error code change here,
I think it wants to be the other way around, as EINVAL isn't quite appropriate
to signal !cache_flush_permitted() to the caller. With that extra adjustment:
Acked-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches
  2025-05-12 14:09   ` Jan Beulich
@ 2025-05-12 14:27     ` Roger Pau Monné
  0 siblings, 0 replies; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-12 14:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Mon, May 12, 2025 at 04:09:28PM +0200, Jan Beulich wrote:
> On 06.05.2025 10:31, Roger Pau Monne wrote:
> > The implementation of MMUEXT_FLUSH_CACHE is bogus, as it doesn't account to
> > flush the cache of any previous pCPU where the current vCPU might have run,
> > and hence is likely to not work as expected.
> > 
> > Fix this by resorting to use the same logic as MMUEXT_FLUSH_CACHE_GLOBAL,
> > which will be correct in all cases.
> > 
> > Fixes: 534ffcb416bb ("Fix WBINVD by adding a new hypercall.")
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Alternatively, the hypercall could be made correct by keeping track of the
> > pCPUs the vCPU has run on since the last cache flush.  That's however more
> > work.  See later in the series.
> 
> Since, as iirc you indicated elsewhere, there's no actual user of this sub-op,
> doing as you do here is likely good enough. Just one concern:
> 
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -3805,14 +3805,11 @@ long do_mmuext_op(
> >              break;
> >  
> >          case MMUEXT_FLUSH_CACHE:
> > -            if ( unlikely(currd != pg_owner) )
> > -                rc = -EPERM;
> > -            else if ( unlikely(!cache_flush_permitted(currd)) )
> > -                rc = -EACCES;
> 
> This error code will change to ...
> 
> > -            else
> > -                wbinvd();
> > -            break;
> > -
> > +            /*
> > +             * Dirty pCPU caches where the current vCPU has been scheduled are
> > +             * not tracked, and hence we need to resort to a global cache
> > +             * flush for correctness.
> > +             */
> >          case MMUEXT_FLUSH_CACHE_GLOBAL:
> >              if ( unlikely(currd != pg_owner) )
> >                  rc = -EPERM;
> 
> ... -EINVAL (sitting out of context). If we accept any error code change here,
> I think it wants to be the other way around, as EINVAL isn't quite appropriate
> to signal !cache_flush_permitted() to the caller. With that extra adjustment:
> Acked-by: Jan Beulich <jbeulich@suse.com>

Oh, good catch.  I didn't realize the return code change.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches
  2025-05-06  8:31 ` [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches Roger Pau Monne
  2025-05-12 14:09   ` Jan Beulich
@ 2025-05-12 14:11   ` Jan Beulich
  2025-05-12 14:24     ` Roger Pau Monné
  1 sibling, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 14:11 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> The implementation of MMUEXT_FLUSH_CACHE is bogus, as it doesn't account to
> flush the cache of any previous pCPU where the current vCPU might have run,
> and hence is likely to not work as expected.
> 
> Fix this by resorting to use the same logic as MMUEXT_FLUSH_CACHE_GLOBAL,
> which will be correct in all cases.
> 
> Fixes: 534ffcb416bb ("Fix WBINVD by adding a new hypercall.")

Oh, and: I've looked up this hash, and found a "trivial merge". Are you sure
here?

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches
  2025-05-12 14:11   ` Jan Beulich
@ 2025-05-12 14:24     ` Roger Pau Monné
  0 siblings, 0 replies; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-12 14:24 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Mon, May 12, 2025 at 04:11:51PM +0200, Jan Beulich wrote:
> On 06.05.2025 10:31, Roger Pau Monne wrote:
> > The implementation of MMUEXT_FLUSH_CACHE is bogus, as it doesn't account to
> > flush the cache of any previous pCPU where the current vCPU might have run,
> > and hence is likely to not work as expected.
> > 
> > Fix this by resorting to use the same logic as MMUEXT_FLUSH_CACHE_GLOBAL,
> > which will be correct in all cases.
> > 
> > Fixes: 534ffcb416bb ("Fix WBINVD by adding a new hypercall.")
> 
> Oh, and: I've looked up this hash, and found a "trivial merge". Are you sure
> here?

Indeed, sorry.  I've made a mistake and copied the wrong hash, it
should be 8e90e37e6db8.  The change subject is correct however.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 2/9] x86/pv: fix emulation of wb{,no}invd to flush all pCPU caches
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
  2025-05-06  8:31 ` [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches Roger Pau Monne
@ 2025-05-06  8:31 ` Roger Pau Monne
  2025-05-12 14:20   ` Jan Beulich
  2025-05-06  8:31 ` [PATCH 3/9] xen/gnttab: limit cache flush operation to guests allowed cache control Roger Pau Monne
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The current emulation of wb{,no}invd is bogus for PV guests: it will only
flush the current pCPU cache, without taking into account pCPUs where the
vCPU had run previously.  Since there's no tracking of dirty cache pCPUs
currently, resort to flushing the cache on all host pCPUs.  Also as a
result of having no mechanism to broadcast a wbnoinvd instruction, resort
to emulating it using wbinvd behavior, which is more expensive performance
wise, but correct.

Fixes: 2f6070f0b988 ("Priv-op emulation in Xen, for RDMSR/WRMSR/WBINVD")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Note further patches will limit the broadcast cache flush to only pCPU
where the vCPU has ran.
---
 xen/arch/x86/pv/emul-priv-op.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 70150c272276..089d4cb4d905 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -1193,17 +1193,18 @@ static int cf_check cache_op(
 {
     ASSERT(op == x86emul_wbinvd || op == x86emul_wbnoinvd);
 
-    /* Ignore the instruction if unprivileged. */
-    if ( !cache_flush_permitted(current->domain) )
+    /*
+     * Ignore the instruction if domain doesn't have cache control.
+     * Non-physdev domain attempted WBINVD; ignore for now since
+     * newer linux uses this in some start-of-day timing loops.
+     */
+    if ( cache_flush_permitted(current->domain) )
         /*
-         * Non-physdev domain attempted WBINVD; ignore for now since
-         * newer linux uses this in some start-of-day timing loops.
+         * Handle wbnoinvd as wbinvd, at the expense of higher cost.  Broadcast
+         * the flush to all pCPUs, Xen doesn't track where the vCPU has ran
+         * previously.
          */
-        ;
-    else if ( op == x86emul_wbnoinvd /* && cpu_has_wbnoinvd */ )
-        wbnoinvd();
-    else
-        wbinvd();
+        flush_all(FLUSH_CACHE);
 
     return X86EMUL_OKAY;
 }
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] x86/pv: fix emulation of wb{,no}invd to flush all pCPU caches
  2025-05-06  8:31 ` [PATCH 2/9] x86/pv: fix emulation of wb{,no}invd " Roger Pau Monne
@ 2025-05-12 14:20   ` Jan Beulich
  2025-05-12 14:41     ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 14:20 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> --- a/xen/arch/x86/pv/emul-priv-op.c
> +++ b/xen/arch/x86/pv/emul-priv-op.c
> @@ -1193,17 +1193,18 @@ static int cf_check cache_op(
>  {
>      ASSERT(op == x86emul_wbinvd || op == x86emul_wbnoinvd);
>  
> -    /* Ignore the instruction if unprivileged. */
> -    if ( !cache_flush_permitted(current->domain) )
> +    /*
> +     * Ignore the instruction if domain doesn't have cache control.
> +     * Non-physdev domain attempted WBINVD; ignore for now since
> +     * newer linux uses this in some start-of-day timing loops.

That's very old comment, and hence I think "newer" isn't quite applicable
anymore. Either omit the word (if Linux still does so), or say "older"
instead? Also since you touch the comment, upper-casing the L in Linux
might be nice.

> +     */
> +    if ( cache_flush_permitted(current->domain) )
>          /*
> -         * Non-physdev domain attempted WBINVD; ignore for now since
> -         * newer linux uses this in some start-of-day timing loops.
> +         * Handle wbnoinvd as wbinvd, at the expense of higher cost.  Broadcast
> +         * the flush to all pCPUs, Xen doesn't track where the vCPU has ran
> +         * previously.
>           */
> -        ;
> -    else if ( op == x86emul_wbnoinvd /* && cpu_has_wbnoinvd */ )
> -        wbnoinvd();

So this goes away altogether, which isn't nice. It was "only" 2 years ago that
I posted a series where an additional ...

> -    else
> -        wbinvd();
> +        flush_all(FLUSH_CACHE);

... FLUSH_CACHE_WRITEBACK is introduced [1]. I really, really think that should
go in first, for it to then be used here. The more that it's the 1st patch in
that series.

Jan

[1] https://lists.xen.org/archives/html/xen-devel/2023-05/msg00242.html


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] x86/pv: fix emulation of wb{,no}invd to flush all pCPU caches
  2025-05-12 14:20   ` Jan Beulich
@ 2025-05-12 14:41     ` Roger Pau Monné
  0 siblings, 0 replies; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-12 14:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Mon, May 12, 2025 at 04:20:30PM +0200, Jan Beulich wrote:
> On 06.05.2025 10:31, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/pv/emul-priv-op.c
> > +++ b/xen/arch/x86/pv/emul-priv-op.c
> > @@ -1193,17 +1193,18 @@ static int cf_check cache_op(
> >  {
> >      ASSERT(op == x86emul_wbinvd || op == x86emul_wbnoinvd);
> >  
> > -    /* Ignore the instruction if unprivileged. */
> > -    if ( !cache_flush_permitted(current->domain) )
> > +    /*
> > +     * Ignore the instruction if domain doesn't have cache control.
> > +     * Non-physdev domain attempted WBINVD; ignore for now since
> > +     * newer linux uses this in some start-of-day timing loops.
> 
> That's very old comment, and hence I think "newer" isn't quite applicable
> anymore. Either omit the word (if Linux still does so), or say "older"
> instead? Also since you touch the comment, upper-casing the L in Linux
> might be nice.

There's a wbinvd at the beginning of trampoline_start, which is
possibly to what this comment is referring to?

I will just drop the mention of "new" or "old" and capitalize the L in
Linux.

> > +     */
> > +    if ( cache_flush_permitted(current->domain) )
> >          /*
> > -         * Non-physdev domain attempted WBINVD; ignore for now since
> > -         * newer linux uses this in some start-of-day timing loops.
> > +         * Handle wbnoinvd as wbinvd, at the expense of higher cost.  Broadcast
> > +         * the flush to all pCPUs, Xen doesn't track where the vCPU has ran
> > +         * previously.
> >           */
> > -        ;
> > -    else if ( op == x86emul_wbnoinvd /* && cpu_has_wbnoinvd */ )
> > -        wbnoinvd();
> 
> So this goes away altogether, which isn't nice. It was "only" 2 years ago that
> I posted a series where an additional ...
> 
> > -    else
> > -        wbinvd();
> > +        flush_all(FLUSH_CACHE);
> 
> ... FLUSH_CACHE_WRITEBACK is introduced [1]. I really, really think that should
> go in first, for it to then be used here. The more that it's the 1st patch in
> that series.

Saw that series when doing this, I was going to ask about them but you
where away and then I forgot about it.

Let me take a look now.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 3/9] xen/gnttab: limit cache flush operation to guests allowed cache control
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
  2025-05-06  8:31 ` [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches Roger Pau Monne
  2025-05-06  8:31 ` [PATCH 2/9] x86/pv: fix emulation of wb{,no}invd " Roger Pau Monne
@ 2025-05-06  8:31 ` Roger Pau Monne
  2025-05-06 10:15   ` Julien Grall
  2025-05-06  8:31 ` [PATCH 4/9] x86/gnttab: do not implement GNTTABOP_cache_flush Roger Pau Monne
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

Whether a domain is allowed to issue cache-control operations is reported
by the cache_flush_permitted() check.  Introduce such check to limit the
availability of GNTTABOP_cache_flush to only guests that are granted cache
control.

Fixes: 18e8d22fe750 ("introduce GNTTABOP_cache_flush")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/common/grant_table.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index e75ff98aff1c..d874ac5f1241 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -3780,6 +3780,11 @@ long do_grant_table_op(
 
         if ( unlikely(!guest_handle_okay(cflush, count)) )
             goto out;
+
+        rc = -EPERM;
+        if ( !cache_flush_permitted(current->domain) )
+            goto out;
+
         rc = gnttab_cache_flush(cflush, &opaque_in, count);
         if ( rc >= 0 )
         {
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/9] xen/gnttab: limit cache flush operation to guests allowed cache control
  2025-05-06  8:31 ` [PATCH 3/9] xen/gnttab: limit cache flush operation to guests allowed cache control Roger Pau Monne
@ 2025-05-06 10:15   ` Julien Grall
  2025-05-06 10:40     ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Julien Grall @ 2025-05-06 10:15 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Stefano Stabellini

Hi Roger,

On 06/05/2025 09:31, Roger Pau Monne wrote:
> Whether a domain is allowed to issue cache-control operations is reported
> by the cache_flush_permitted() check.  Introduce such check to limit the
> availability of GNTTABOP_cache_flush to only guests that are granted cache
> control.

Can you outline what's the problem you are trying to solve? Asking, 
because I don't see the problem of allowing any guest calling 
GNTTABOP_cache_flush on Arm from any domains.

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/9] xen/gnttab: limit cache flush operation to guests allowed cache control
  2025-05-06 10:15   ` Julien Grall
@ 2025-05-06 10:40     ` Roger Pau Monné
  2025-05-12 14:23       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-06 10:40 UTC (permalink / raw)
  To: Julien Grall
  Cc: xen-devel, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Stefano Stabellini

On Tue, May 06, 2025 at 11:15:09AM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 06/05/2025 09:31, Roger Pau Monne wrote:
> > Whether a domain is allowed to issue cache-control operations is reported
> > by the cache_flush_permitted() check.  Introduce such check to limit the
> > availability of GNTTABOP_cache_flush to only guests that are granted cache
> > control.
> 
> Can you outline what's the problem you are trying to solve? Asking, because
> I don't see the problem of allowing any guest calling GNTTABOP_cache_flush
> on Arm from any domains.

At least on x86 cache flush operations are restricted to guests for
which cache_flush_permitted() returns true.  I've assumed the same
would apply to Arm, since cache_flush_permitted() is also defined
there.  If it's fine to issue cache flush operations from any guests
on ARM, I suggest cache_flush_permitted() should unconditionally
return true then.

The problem on x86 is that it's an expensive operation when done
correctly, as it involves flushing the caches of all pCPUs where the
vCPU has been scheduled.  Note however the implementation of
GNTTABOP_cache_flush is incorrect on x86, and won't work as
expected.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/9] xen/gnttab: limit cache flush operation to guests allowed cache control
  2025-05-06 10:40     ` Roger Pau Monné
@ 2025-05-12 14:23       ` Jan Beulich
  0 siblings, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 14:23 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Stefano Stabellini, Julien Grall

On 06.05.2025 12:40, Roger Pau Monné wrote:
> On Tue, May 06, 2025 at 11:15:09AM +0100, Julien Grall wrote:
>> On 06/05/2025 09:31, Roger Pau Monne wrote:
>>> Whether a domain is allowed to issue cache-control operations is reported
>>> by the cache_flush_permitted() check.  Introduce such check to limit the
>>> availability of GNTTABOP_cache_flush to only guests that are granted cache
>>> control.
>>
>> Can you outline what's the problem you are trying to solve? Asking, because
>> I don't see the problem of allowing any guest calling GNTTABOP_cache_flush
>> on Arm from any domains.
> 
> At least on x86 cache flush operations are restricted to guests for
> which cache_flush_permitted() returns true.  I've assumed the same
> would apply to Arm, since cache_flush_permitted() is also defined
> there.  If it's fine to issue cache flush operations from any guests
> on ARM, I suggest cache_flush_permitted() should unconditionally
> return true then.
> 
> The problem on x86 is that it's an expensive operation when done
> correctly, as it involves flushing the caches of all pCPUs where the
> vCPU has been scheduled.  Note however the implementation of
> GNTTABOP_cache_flush is incorrect on x86, and won't work as
> expected.

So instead of altering Arm behavior, how about rejecting GNTTABOP_cache_flush
on x86 then? It was introduced specifically for Arm, and it shouldn't have
gained any users (albeit of course we can't be sure of that).

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 4/9] x86/gnttab: do not implement GNTTABOP_cache_flush
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
                   ` (2 preceding siblings ...)
  2025-05-06  8:31 ` [PATCH 3/9] xen/gnttab: limit cache flush operation to guests allowed cache control Roger Pau Monne
@ 2025-05-06  8:31 ` Roger Pau Monne
  2025-05-12 14:27   ` Jan Beulich
  2025-05-06  8:31 ` [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr() Roger Pau Monne
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini

The current underlying implementation of GNTTABOP_cache_flush on x86 won't
work as expected.  The provided {clean,invalidate}_dcache_va_range()
helpers only do a local pCPU cache flush, so the cache of previous pCPUs
where the vCPU might have run are not flushed.

However instead of attempting to fix this, make the GNTTABOP_cache_flush
operation only available to ARM.  There are no known users on x86, the
implementation is broken, and other architectures don't have grant-table
support yet.

With that operation not implemented on x86, the related
{clean,invalidate}_dcache_va_range() helpers can also be removed.

Fixes: f62dc81c2df7 ("x86: introduce more cache maintenance operations")
Fixes: 18e8d22fe750 ("introduce GNTTABOP_cache_flush")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I've attempted to introduce a new arch_do_grant_table_op() in a separate
arch-specific file, but it required exposing too much functionality from
the common grant_table.c, ifdefying is probably better for the time being.
---
 xen/arch/x86/include/asm/flushtlb.h | 15 ---------------
 xen/common/grant_table.c            |  6 ++++++
 2 files changed, 6 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h
index bb0ad58db49b..d0c9120b5faf 100644
--- a/xen/arch/x86/include/asm/flushtlb.h
+++ b/xen/arch/x86/include/asm/flushtlb.h
@@ -182,21 +182,6 @@ void flush_area_mask(const cpumask_t *mask, const void *va,
 }
 
 static inline void flush_page_to_ram(unsigned long mfn, bool sync_icache) {}
-static inline int invalidate_dcache_va_range(const void *p,
-                                             unsigned long size)
-{ return -EOPNOTSUPP; }
-static inline int clean_and_invalidate_dcache_va_range(const void *p,
-                                                       unsigned long size)
-{
-    unsigned int order = get_order_from_bytes(size);
-    /* sub-page granularity support needs to be added if necessary */
-    flush_area_local(p, FLUSH_CACHE|FLUSH_ORDER(order));
-    return 0;
-}
-static inline int clean_dcache_va_range(const void *p, unsigned long size)
-{
-    return clean_and_invalidate_dcache_va_range(p, size);
-}
 
 unsigned int guest_flush_tlb_flags(const struct domain *d);
 void guest_flush_tlb_mask(const struct domain *d, const cpumask_t *mask);
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index d874ac5f1241..cc7c7d004065 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -940,6 +940,7 @@ static void reduce_status_for_pin(struct domain *rd,
         gnttab_clear_flags(rd, clear_flags, status);
 }
 
+#ifdef CONFIG_ARM
 static struct active_grant_entry *grant_map_exists(const struct domain *ld,
                                                    struct grant_table *rgt,
                                                    mfn_t mfn,
@@ -975,6 +976,7 @@ static struct active_grant_entry *grant_map_exists(const struct domain *ld,
 
     return ERR_PTR(-EINVAL);
 }
+#endif /* CONFIG_ARM */
 
 union maptrack_node {
     struct {
@@ -3520,6 +3522,7 @@ gnttab_swap_grant_ref(XEN_GUEST_HANDLE_PARAM(gnttab_swap_grant_ref_t) uop,
     return 0;
 }
 
+#ifdef CONFIG_ARM
 static int _cache_flush(const gnttab_cache_flush_t *cflush, grant_ref_t *cur_ref)
 {
     struct domain *d, *owner;
@@ -3631,6 +3634,7 @@ gnttab_cache_flush(XEN_GUEST_HANDLE_PARAM(gnttab_cache_flush_t) uop,
 
     return 0;
 }
+#endif /* CONFIG_ARM */
 
 long do_grant_table_op(
     unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) uop, unsigned int count)
@@ -3773,6 +3777,7 @@ long do_grant_table_op(
         break;
     }
 
+#ifdef CONFIG_ARM
     case GNTTABOP_cache_flush:
     {
         XEN_GUEST_HANDLE_PARAM(gnttab_cache_flush_t) cflush =
@@ -3794,6 +3799,7 @@ long do_grant_table_op(
         }
         break;
     }
+#endif
 
     default:
         rc = -ENOSYS;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 4/9] x86/gnttab: do not implement GNTTABOP_cache_flush
  2025-05-06  8:31 ` [PATCH 4/9] x86/gnttab: do not implement GNTTABOP_cache_flush Roger Pau Monne
@ 2025-05-12 14:27   ` Jan Beulich
  0 siblings, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 14:27 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> The current underlying implementation of GNTTABOP_cache_flush on x86 won't
> work as expected.  The provided {clean,invalidate}_dcache_va_range()
> helpers only do a local pCPU cache flush, so the cache of previous pCPUs
> where the vCPU might have run are not flushed.
> 
> However instead of attempting to fix this, make the GNTTABOP_cache_flush
> operation only available to ARM.  There are no known users on x86, the
> implementation is broken, and other architectures don't have grant-table
> support yet.
> 
> With that operation not implemented on x86, the related
> {clean,invalidate}_dcache_va_range() helpers can also be removed.
> 
> Fixes: f62dc81c2df7 ("x86: introduce more cache maintenance operations")
> Fixes: 18e8d22fe750 ("introduce GNTTABOP_cache_flush")
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Ah, here we go. I think this is what we want, without patch 3. It will want
to come with a CHANGELOG entry, though.

> --- a/xen/common/grant_table.c
> +++ b/xen/common/grant_table.c
> @@ -940,6 +940,7 @@ static void reduce_status_for_pin(struct domain *rd,
>          gnttab_clear_flags(rd, clear_flags, status);
>  }
>  
> +#ifdef CONFIG_ARM

Better introduce a new Kconfig setting (e.g. HAS_GRANT_CACHE_FLUSH) right
away, in case RISC-V and/or PPC would also want such behavior? 

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
                   ` (3 preceding siblings ...)
  2025-05-06  8:31 ` [PATCH 4/9] x86/gnttab: do not implement GNTTABOP_cache_flush Roger Pau Monne
@ 2025-05-06  8:31 ` Roger Pau Monne
  2025-05-12 15:04   ` Jan Beulich
  2025-05-16  6:58   ` Jan Beulich
  2025-05-06  8:31 ` [PATCH 6/9] x86/p2m: limit cache flush in memory_type_changed() Roger Pau Monne
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The current logic partially open-codes memory_type_changed(), but doesn't
check whether the type change or the cache flush is actually needed.
Instead switch to using memory_type_changed(), at possibly a higher expense
cost of not exclusively issuing cache flushes when limiting cacheability.

However using memory_type_changed() has the benefit of limiting
recalculations and cache flushes to strictly only when it's meaningful due
to the guest configuration, like having devices or IO regions assigned.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/hvm/mtrr.c | 22 +++-------------------
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c
index 797f2ae7fd3a..b88e81eb44b1 100644
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -605,22 +605,8 @@ int hvm_set_mem_pinned_cacheattr(struct domain *d, uint64_t gfn_start,
 
                 type = range->type;
                 call_rcu(&range->rcu, free_pinned_cacheattr_entry);
-                p2m_memory_type_changed(d);
-                switch ( type )
-                {
-                case X86_MT_UCM:
-                    /*
-                     * For EPT we can also avoid the flush in this case;
-                     * see epte_get_entry_emt().
-                     */
-                    if ( hap_enabled(d) && cpu_has_vmx )
-                case X86_MT_UC:
-                        break;
-                    /* fall through */
-                default:
-                    flush_all(FLUSH_CACHE);
-                    break;
-                }
+                memory_type_changed(d);
+
                 return 0;
             }
         domain_unlock(d);
@@ -682,9 +668,7 @@ int hvm_set_mem_pinned_cacheattr(struct domain *d, uint64_t gfn_start,
 
     xfree(newr);
 
-    p2m_memory_type_changed(d);
-    if ( type != X86_MT_WB )
-        flush_all(FLUSH_CACHE);
+    memory_type_changed(d);
 
     return rc;
 }
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  2025-05-06  8:31 ` [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr() Roger Pau Monne
@ 2025-05-12 15:04   ` Jan Beulich
  2025-05-15 10:22     ` Roger Pau Monné
  2025-05-16  6:58   ` Jan Beulich
  1 sibling, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 15:04 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> The current logic partially open-codes memory_type_changed(), but doesn't
> check whether the type change or the cache flush is actually needed.
> Instead switch to using memory_type_changed(), at possibly a higher expense
> cost of not exclusively issuing cache flushes when limiting cacheability.
> 
> However using memory_type_changed() has the benefit of limiting
> recalculations and cache flushes to strictly only when it's meaningful due
> to the guest configuration, like having devices or IO regions assigned.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Hmm, I'm not convinced this is a win. This operation isn't normally used on
a running guest, aiui.

What's more, this heavily conflicts with a patch posted (again) over 2 years
ago [1]. Unless there's something severely wrong with that (and unless your
patch would make that old one unnecessary), I'm again of the opinion that
that one should go in first. It is becoming increasingly noticeable that it's
unhelpful if posted patches aren't being looked at.

Jan

[1] https://lists.xen.org/archives/html/xen-devel/2023-03/msg01551.html



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  2025-05-12 15:04   ` Jan Beulich
@ 2025-05-15 10:22     ` Roger Pau Monné
  2025-05-16  7:00       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-15 10:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Mon, May 12, 2025 at 05:04:56PM +0200, Jan Beulich wrote:
> On 06.05.2025 10:31, Roger Pau Monne wrote:
> > The current logic partially open-codes memory_type_changed(), but doesn't
> > check whether the type change or the cache flush is actually needed.
> > Instead switch to using memory_type_changed(), at possibly a higher expense
> > cost of not exclusively issuing cache flushes when limiting cacheability.
> > 
> > However using memory_type_changed() has the benefit of limiting
> > recalculations and cache flushes to strictly only when it's meaningful due
> > to the guest configuration, like having devices or IO regions assigned.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Hmm, I'm not convinced this is a win. This operation isn't normally used on
> a running guest, aiui.
> 
> What's more, this heavily conflicts with a patch posted (again) over 2 years
> ago [1]. Unless there's something severely wrong with that (and unless your
> patch would make that old one unnecessary), I'm again of the opinion that
> that one should go in first. It is becoming increasingly noticeable that it's
> unhelpful if posted patches aren't being looked at.

I'm happy for your change to go in first, but I think that
memory_type_changed() should be adjusted to contain the extra checks
that you add in your patch, and then hvm_set_mem_pinned_cacheattr()
should be switched to use memory_type_changed() rather than
open-coding it.  For once it would miss the adjustment done to
memory_type_changed() to only flush the cache when there's a
meaningful change to the p2m (iow: p2m_memory_type_changed() is not a
no-op).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  2025-05-15 10:22     ` Roger Pau Monné
@ 2025-05-16  7:00       ` Jan Beulich
  2025-05-16  7:57         ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-16  7:00 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 15.05.2025 12:22, Roger Pau Monné wrote:
> On Mon, May 12, 2025 at 05:04:56PM +0200, Jan Beulich wrote:
>> On 06.05.2025 10:31, Roger Pau Monne wrote:
>>> The current logic partially open-codes memory_type_changed(), but doesn't
>>> check whether the type change or the cache flush is actually needed.
>>> Instead switch to using memory_type_changed(), at possibly a higher expense
>>> cost of not exclusively issuing cache flushes when limiting cacheability.
>>>
>>> However using memory_type_changed() has the benefit of limiting
>>> recalculations and cache flushes to strictly only when it's meaningful due
>>> to the guest configuration, like having devices or IO regions assigned.
>>>
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>
>> Hmm, I'm not convinced this is a win. This operation isn't normally used on
>> a running guest, aiui.
>>
>> What's more, this heavily conflicts with a patch posted (again) over 2 years
>> ago [1]. Unless there's something severely wrong with that (and unless your
>> patch would make that old one unnecessary), I'm again of the opinion that
>> that one should go in first. It is becoming increasingly noticeable that it's
>> unhelpful if posted patches aren't being looked at.
> 
> I'm happy for your change to go in first, but I think that
> memory_type_changed() should be adjusted to contain the extra checks
> that you add in your patch, and then hvm_set_mem_pinned_cacheattr()
> should be switched to use memory_type_changed() rather than
> open-coding it.

Maybe, but see my other reply to your patch.

>  For once it would miss the adjustment done to
> memory_type_changed() to only flush the cache when there's a
> meaningful change to the p2m (iow: p2m_memory_type_changed() is not a
> no-op).

That could also be mirrored here, if there were (remaining) reasons to avoid
use of memory_type_changed() for this function.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  2025-05-16  7:00       ` Jan Beulich
@ 2025-05-16  7:57         ` Roger Pau Monné
  0 siblings, 0 replies; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-16  7:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Fri, May 16, 2025 at 09:00:42AM +0200, Jan Beulich wrote:
> On 15.05.2025 12:22, Roger Pau Monné wrote:
> > On Mon, May 12, 2025 at 05:04:56PM +0200, Jan Beulich wrote:
> >> On 06.05.2025 10:31, Roger Pau Monne wrote:
> >>> The current logic partially open-codes memory_type_changed(), but doesn't
> >>> check whether the type change or the cache flush is actually needed.
> >>> Instead switch to using memory_type_changed(), at possibly a higher expense
> >>> cost of not exclusively issuing cache flushes when limiting cacheability.
> >>>
> >>> However using memory_type_changed() has the benefit of limiting
> >>> recalculations and cache flushes to strictly only when it's meaningful due
> >>> to the guest configuration, like having devices or IO regions assigned.
> >>>
> >>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >>
> >> Hmm, I'm not convinced this is a win. This operation isn't normally used on
> >> a running guest, aiui.
> >>
> >> What's more, this heavily conflicts with a patch posted (again) over 2 years
> >> ago [1]. Unless there's something severely wrong with that (and unless your
> >> patch would make that old one unnecessary), I'm again of the opinion that
> >> that one should go in first. It is becoming increasingly noticeable that it's
> >> unhelpful if posted patches aren't being looked at.
> > 
> > I'm happy for your change to go in first, but I think that
> > memory_type_changed() should be adjusted to contain the extra checks
> > that you add in your patch, and then hvm_set_mem_pinned_cacheattr()
> > should be switched to use memory_type_changed() rather than
> > open-coding it.
> 
> Maybe, but see my other reply to your patch.
> 
> >  For once it would miss the adjustment done to
> > memory_type_changed() to only flush the cache when there's a
> > meaningful change to the p2m (iow: p2m_memory_type_changed() is not a
> > no-op).
> 
> That could also be mirrored here, if there were (remaining) reasons to avoid
> use of memory_type_changed() for this function.

Indeed, but it's just more open-coding of memory_type_changed() in the
context of hvm_set_mem_pinned_cacheattr().  IMO I don't know exactly
the usage of this function, it seems like it should be mostly used at
domain build time, or maybe when doing hotplug of a device, so not
very often.

Given how complicated cache management is, I would prefer if the logic
is limited to few places (like memory_type_changed()), instead of
having open-coded implementations of the same logic in multiple
places.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  2025-05-06  8:31 ` [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr() Roger Pau Monne
  2025-05-12 15:04   ` Jan Beulich
@ 2025-05-16  6:58   ` Jan Beulich
  2025-05-16  7:48     ` Roger Pau Monné
  1 sibling, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-16  6:58 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> --- a/xen/arch/x86/hvm/mtrr.c
> +++ b/xen/arch/x86/hvm/mtrr.c
> @@ -605,22 +605,8 @@ int hvm_set_mem_pinned_cacheattr(struct domain *d, uint64_t gfn_start,
>  
>                  type = range->type;
>                  call_rcu(&range->rcu, free_pinned_cacheattr_entry);
> -                p2m_memory_type_changed(d);
> -                switch ( type )
> -                {
> -                case X86_MT_UCM:
> -                    /*
> -                     * For EPT we can also avoid the flush in this case;
> -                     * see epte_get_entry_emt().
> -                     */
> -                    if ( hap_enabled(d) && cpu_has_vmx )
> -                case X86_MT_UC:
> -                        break;
> -                    /* fall through */
> -                default:
> -                    flush_all(FLUSH_CACHE);
> -                    break;
> -                }
> +                memory_type_changed(d);
> +
>                  return 0;
>              }

Hmm, so your consideration to avoid the "goto" in my patch was perhaps this
part of your change, where you expect the "return 0" could then stay here.
Maybe. However, you removing the switch() means we'd then also flush in
cases where currently (or with my change) we avoid doing so.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  2025-05-16  6:58   ` Jan Beulich
@ 2025-05-16  7:48     ` Roger Pau Monné
  2025-05-16  8:02       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-16  7:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Fri, May 16, 2025 at 08:58:39AM +0200, Jan Beulich wrote:
> On 06.05.2025 10:31, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/hvm/mtrr.c
> > +++ b/xen/arch/x86/hvm/mtrr.c
> > @@ -605,22 +605,8 @@ int hvm_set_mem_pinned_cacheattr(struct domain *d, uint64_t gfn_start,
> >  
> >                  type = range->type;
> >                  call_rcu(&range->rcu, free_pinned_cacheattr_entry);
> > -                p2m_memory_type_changed(d);
> > -                switch ( type )
> > -                {
> > -                case X86_MT_UCM:
> > -                    /*
> > -                     * For EPT we can also avoid the flush in this case;
> > -                     * see epte_get_entry_emt().
> > -                     */
> > -                    if ( hap_enabled(d) && cpu_has_vmx )
> > -                case X86_MT_UC:
> > -                        break;
> > -                    /* fall through */
> > -                default:
> > -                    flush_all(FLUSH_CACHE);
> > -                    break;
> > -                }
> > +                memory_type_changed(d);
> > +
> >                  return 0;
> >              }
> 
> Hmm, so your consideration to avoid the "goto" in my patch was perhaps this
> part of your change, where you expect the "return 0" could then stay here.
> Maybe. However, you removing the switch() means we'd then also flush in
> cases where currently (or with my change) we avoid doing so.

Oh, yes, just replied to your previous email with this suggestion.

I don't have a strong opinion, but I don't think the fine grained
flush avoidance is really required.  What's more, if we want to call
memory_type_changed() we must do so for any changes, as the call to
p2m_memory_type_changed() must be done unconditionally regardless of
whether the specific MTRR type change could allow us to avoid the
flush.

Overall, I have the impression hvm_set_mem_pinned_cacheattr() should
only be used while building a domain, and hence the flush can likely
be skipped unconditionally, regardless of the type changes.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  2025-05-16  7:48     ` Roger Pau Monné
@ 2025-05-16  8:02       ` Jan Beulich
  2025-05-16  8:45         ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-16  8:02 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 16.05.2025 09:48, Roger Pau Monné wrote:
> Overall, I have the impression hvm_set_mem_pinned_cacheattr() should
> only be used while building a domain, and hence the flush can likely
> be skipped unconditionally, regardless of the type changes.

See my patch submission, which had this remark:

"Is it really sensible to add/remove ranges once the guest is already
 running? (If it is, limiting the scope of the flush would be nice, but
 would require knowing dirtyness for the domain wrt the caches, which
 currently we don't track.)"

As apparently we both agree, why don't we reject requests post-creation
then, and drop the flush? The one thing I'm uncertain about is whether
the DM would actually have carried out the operation strictly ahead of
the domain being first un-paused.

Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
  2025-05-16  8:02       ` Jan Beulich
@ 2025-05-16  8:45         ` Roger Pau Monné
  0 siblings, 0 replies; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-16  8:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Fri, May 16, 2025 at 10:02:22AM +0200, Jan Beulich wrote:
> On 16.05.2025 09:48, Roger Pau Monné wrote:
> > Overall, I have the impression hvm_set_mem_pinned_cacheattr() should
> > only be used while building a domain, and hence the flush can likely
> > be skipped unconditionally, regardless of the type changes.
> 
> See my patch submission, which had this remark:
> 
> "Is it really sensible to add/remove ranges once the guest is already
>  running? (If it is, limiting the scope of the flush would be nice, but
>  would require knowing dirtyness for the domain wrt the caches, which
>  currently we don't track.)"

Well, I had an extra patch to attempt to do track vCPU dirtyness wrt
to the caches.

> 
> As apparently we both agree, why don't we reject requests post-creation
> then, and drop the flush? The one thing I'm uncertain about is whether
> the DM would actually have carried out the operation strictly ahead of
> the domain being first un-paused.

I've looked at QEMU, and I cannot convince myself the operation
couldn't be used against live domains, it's part of a memory listener
hook.

What is also concerning is that this seems to be completely ignored
when using AMD NPT, I can only find references to
hvm_get_mem_pinned_cacheattr() in EPT and shadow code.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 6/9] x86/p2m: limit cache flush in memory_type_changed()
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
                   ` (4 preceding siblings ...)
  2025-05-06  8:31 ` [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr() Roger Pau Monne
@ 2025-05-06  8:31 ` Roger Pau Monne
  2025-05-12 15:10   ` Jan Beulich
  2025-05-06  8:31 ` [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources() Roger Pau Monne
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

Only do the cache flush when there's a p2m type change to propagate,
otherwise there's no change in the p2m effective caching attributes.

If the p2m memory_type_changed hook is not set p2m_memory_type_changed() is
a no-op, no recalculation of caching attributes is needed, nor flushing of
the previous cache contents.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/hvm/mtrr.c        |  3 +--
 xen/arch/x86/include/asm/p2m.h |  2 +-
 xen/arch/x86/mm/p2m.c          | 13 ++++++++++++-
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c
index b88e81eb44b1..e7acfb6e8dc4 100644
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -767,9 +767,8 @@ HVM_REGISTER_SAVE_RESTORE(MTRR, hvm_save_mtrr_msr, NULL, hvm_load_mtrr_msr, 1,
 void memory_type_changed(struct domain *d)
 {
     if ( (is_iommu_enabled(d) || cache_flush_permitted(d)) &&
-         d->vcpu && d->vcpu[0] )
+         d->vcpu && d->vcpu[0] && p2m_memory_type_changed(d) )
     {
-        p2m_memory_type_changed(d);
         flush_all(FLUSH_CACHE);
     }
 }
diff --git a/xen/arch/x86/include/asm/p2m.h b/xen/arch/x86/include/asm/p2m.h
index b9ce7d8705ba..4358cc15a2a1 100644
--- a/xen/arch/x86/include/asm/p2m.h
+++ b/xen/arch/x86/include/asm/p2m.h
@@ -700,7 +700,7 @@ void p2m_pod_dump_data(struct domain *d);
 #ifdef CONFIG_HVM
 
 /* Report a change affecting memory types. */
-void p2m_memory_type_changed(struct domain *d);
+bool p2m_memory_type_changed(struct domain *d);
 
 /* Called by p2m code when demand-populating a PoD page */
 bool
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 3a39b5d1246b..b9a7c2dc5302 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -126,12 +126,21 @@ static void _memory_type_changed(struct p2m_domain *p2m)
 {
     if ( p2m->memory_type_changed )
         p2m->memory_type_changed(p2m);
+    else
+        ASSERT_UNREACHABLE();
 }
 
-void p2m_memory_type_changed(struct domain *d)
+bool p2m_memory_type_changed(struct domain *d)
 {
     struct p2m_domain *hostp2m = p2m_get_hostp2m(d);
 
+    /*
+     * The p2m memory_type_changed hook will be the same for the host p2m or
+     * the altp2ms, do the check early and return if not set.
+     */
+    if ( !hostp2m->memory_type_changed )
+        return false;
+
     p2m_lock(hostp2m);
 
     _memory_type_changed(hostp2m);
@@ -154,6 +163,8 @@ void p2m_memory_type_changed(struct domain *d)
     }
 
     p2m_unlock(hostp2m);
+
+    return true;
 }
 
 int p2m_set_ioreq_server(struct domain *d,
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 6/9] x86/p2m: limit cache flush in memory_type_changed()
  2025-05-06  8:31 ` [PATCH 6/9] x86/p2m: limit cache flush in memory_type_changed() Roger Pau Monne
@ 2025-05-12 15:10   ` Jan Beulich
  0 siblings, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 15:10 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> Only do the cache flush when there's a p2m type change to propagate,
> otherwise there's no change in the p2m effective caching attributes.
> 
> If the p2m memory_type_changed hook is not set p2m_memory_type_changed() is
> a no-op, no recalculation of caching attributes is needed, nor flushing of
> the previous cache contents.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
                   ` (5 preceding siblings ...)
  2025-05-06  8:31 ` [PATCH 6/9] x86/p2m: limit cache flush in memory_type_changed() Roger Pau Monne
@ 2025-05-06  8:31 ` Roger Pau Monne
  2025-05-12 15:16   ` Jan Beulich
  2025-05-06  8:31 ` [PATCH 8/9] xen: introduce flag when a domain requires cache control Roger Pau Monne
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini

To better describe the underlying implementation.  Define
cache_flush_permitted() as an alias of has_arch_io_resources(), so that
current users of cache_flush_permitted() are not effectively modified.

With the introduction of the new handler, change some of the call sites of
cache_flush_permitted() to instead use has_arch_io_resources() as such
callers are not after whether cache flush is enabled, but rather whether
the domain has any IO resources assigned.

Take the opportunity to adjust l1_disallow_mask() to use the newly
introduced has_arch_io_resources() macro.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/iocap.h | 4 +++-
 xen/arch/x86/mm.c                | 3 +--
 xen/arch/x86/mm/p2m-pod.c        | 4 ++--
 xen/common/memory.c              | 2 +-
 xen/include/asm-generic/iocap.h  | 4 +++-
 5 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/include/asm/iocap.h b/xen/arch/x86/include/asm/iocap.h
index 53d87ae8a334..61d026dbf5f6 100644
--- a/xen/arch/x86/include/asm/iocap.h
+++ b/xen/arch/x86/include/asm/iocap.h
@@ -15,10 +15,12 @@
 #define ioports_access_permitted(d, s, e)               \
     rangeset_contains_range((d)->arch.ioport_caps, s, e)
 
-#define cache_flush_permitted(d)                        \
+#define has_arch_io_resources(d)                        \
     (!rangeset_is_empty((d)->iomem_caps) ||             \
      !rangeset_is_empty((d)->arch.ioport_caps))
 
+#define cache_flush_permitted has_arch_io_resources
+
 static inline int ioports_permit_access(struct domain *d, unsigned long s,
                                         unsigned long e)
 {
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 38e214352201..59b60b1e62a7 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -172,8 +172,7 @@ static DEFINE_SPINLOCK(subpage_ro_lock);
 
 #define l1_disallow_mask(d)                                     \
     (((d) != dom_io) &&                                         \
-     (rangeset_is_empty((d)->iomem_caps) &&                     \
-      rangeset_is_empty((d)->arch.ioport_caps) &&               \
+     (!has_arch_io_resources(d) &&                              \
       !has_arch_pdevs(d) &&                                     \
       is_pv_domain(d)) ?                                        \
      L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))
diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c
index df2a1cc0749b..05633fe2ac88 100644
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -338,7 +338,7 @@ p2m_pod_set_mem_target(struct domain *d, unsigned long target)
 
     ASSERT( pod_target >= p2m->pod.count );
 
-    if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
+    if ( has_arch_pdevs(d) || has_arch_io_resources(d) )
         ret = -ENOTEMPTY;
     else
         ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/);
@@ -1395,7 +1395,7 @@ guest_physmap_mark_populate_on_demand(struct domain *d, unsigned long gfn,
     if ( !paging_mode_translate(d) )
         return -EINVAL;
 
-    if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
+    if ( has_arch_pdevs(d) || has_arch_io_resources(d) )
         return -ENOTEMPTY;
 
     do {
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 8ca4e1a8425b..46620ed8253d 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -86,7 +86,7 @@ static unsigned int max_order(const struct domain *d)
     unsigned int order = domu_max_order;
 
 #ifdef CONFIG_HAS_PASSTHROUGH
-    if ( cache_flush_permitted(d) && order < ptdom_max_order )
+    if ( has_arch_io_resources(d) && order < ptdom_max_order )
         order = ptdom_max_order;
 #endif
 
diff --git a/xen/include/asm-generic/iocap.h b/xen/include/asm-generic/iocap.h
index dd7cb45488f7..664bbc8971fe 100644
--- a/xen/include/asm-generic/iocap.h
+++ b/xen/include/asm-generic/iocap.h
@@ -2,9 +2,11 @@
 #ifndef __ASM_GENERIC_IOCAP_H__
 #define __ASM_GENERIC_IOCAP_H__
 
-#define cache_flush_permitted(d)                        \
+#define has_arch_io_resources(d)                        \
     (!rangeset_is_empty((d)->iomem_caps))
 
+#define cache_flush_permitted has_arch_io_resources
+
 #endif /* __ASM_GENERIC_IOCAP_H__ */
 
 /*
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  2025-05-06  8:31 ` [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources() Roger Pau Monne
@ 2025-05-12 15:16   ` Jan Beulich
  2025-05-15 10:28     ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 15:16 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> To better describe the underlying implementation.  Define
> cache_flush_permitted() as an alias of has_arch_io_resources(), so that
> current users of cache_flush_permitted() are not effectively modified.
> 
> With the introduction of the new handler, change some of the call sites of
> cache_flush_permitted() to instead use has_arch_io_resources() as such
> callers are not after whether cache flush is enabled, but rather whether
> the domain has any IO resources assigned.
> 
> Take the opportunity to adjust l1_disallow_mask() to use the newly
> introduced has_arch_io_resources() macro.

While I'm happy with everything else here, to me it's at least on the
edge whether cache_flush_permitted() wouldn't be the better predicate
to use there, for this being about ...

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -172,8 +172,7 @@ static DEFINE_SPINLOCK(subpage_ro_lock);
>  
>  #define l1_disallow_mask(d)                                     \
>      (((d) != dom_io) &&                                         \
> -     (rangeset_is_empty((d)->iomem_caps) &&                     \
> -      rangeset_is_empty((d)->arch.ioport_caps) &&               \
> +     (!has_arch_io_resources(d) &&                              \
>        !has_arch_pdevs(d) &&                                     \
>        is_pv_domain(d)) ?                                        \
>       L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))

... cachability, which goes hand in hand with the ability to also
flush cache contents.

Tangentially - is it plausible for has_arch_io_resources() to return
false when has_arch_pdevs() returns true? Perhaps there are exotic
PCI devices (but non-bridges) which work with no BARs at all ...

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  2025-05-12 15:16   ` Jan Beulich
@ 2025-05-15 10:28     ` Roger Pau Monné
  2025-05-16  7:07       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-15 10:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Mon, May 12, 2025 at 05:16:02PM +0200, Jan Beulich wrote:
> On 06.05.2025 10:31, Roger Pau Monne wrote:
> > To better describe the underlying implementation.  Define
> > cache_flush_permitted() as an alias of has_arch_io_resources(), so that
> > current users of cache_flush_permitted() are not effectively modified.
> > 
> > With the introduction of the new handler, change some of the call sites of
> > cache_flush_permitted() to instead use has_arch_io_resources() as such
> > callers are not after whether cache flush is enabled, but rather whether
> > the domain has any IO resources assigned.
> > 
> > Take the opportunity to adjust l1_disallow_mask() to use the newly
> > introduced has_arch_io_resources() macro.
> 
> While I'm happy with everything else here, to me it's at least on the
> edge whether cache_flush_permitted() wouldn't be the better predicate
> to use there, for this being about ...
> 
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -172,8 +172,7 @@ static DEFINE_SPINLOCK(subpage_ro_lock);
> >  
> >  #define l1_disallow_mask(d)                                     \
> >      (((d) != dom_io) &&                                         \
> > -     (rangeset_is_empty((d)->iomem_caps) &&                     \
> > -      rangeset_is_empty((d)->arch.ioport_caps) &&               \
> > +     (!has_arch_io_resources(d) &&                              \
> >        !has_arch_pdevs(d) &&                                     \
> >        is_pv_domain(d)) ?                                        \
> >       L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))
> 
> ... cachability, which goes hand in hand with the ability to also
> flush cache contents.

Hm, I was on the edge here, in fact I've previously coded this using
cache_flush_permitted(), just to the change back to
has_arch_io_resources().  If you think cache_flush_permitted() is
better I'm fine with that.

> Tangentially - is it plausible for has_arch_io_resources() to return
> false when has_arch_pdevs() returns true? Perhaps there are exotic
> PCI devices (but non-bridges) which work with no BARs at all ...

I guess it's technically possible, albeit very unlikely?  How would
the OS interact with such device then, exclusively with PCI config
space accesses?

I'm happy to just use cache_flush_permitted() which is likely more
correct given the context here.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  2025-05-15 10:28     ` Roger Pau Monné
@ 2025-05-16  7:07       ` Jan Beulich
  2025-05-16  8:02         ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-16  7:07 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 15.05.2025 12:28, Roger Pau Monné wrote:
> On Mon, May 12, 2025 at 05:16:02PM +0200, Jan Beulich wrote:
>> On 06.05.2025 10:31, Roger Pau Monne wrote:
>>> To better describe the underlying implementation.  Define
>>> cache_flush_permitted() as an alias of has_arch_io_resources(), so that
>>> current users of cache_flush_permitted() are not effectively modified.
>>>
>>> With the introduction of the new handler, change some of the call sites of
>>> cache_flush_permitted() to instead use has_arch_io_resources() as such
>>> callers are not after whether cache flush is enabled, but rather whether
>>> the domain has any IO resources assigned.
>>>
>>> Take the opportunity to adjust l1_disallow_mask() to use the newly
>>> introduced has_arch_io_resources() macro.
>>
>> While I'm happy with everything else here, to me it's at least on the
>> edge whether cache_flush_permitted() wouldn't be the better predicate
>> to use there, for this being about ...
>>
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -172,8 +172,7 @@ static DEFINE_SPINLOCK(subpage_ro_lock);
>>>  
>>>  #define l1_disallow_mask(d)                                     \
>>>      (((d) != dom_io) &&                                         \
>>> -     (rangeset_is_empty((d)->iomem_caps) &&                     \
>>> -      rangeset_is_empty((d)->arch.ioport_caps) &&               \
>>> +     (!has_arch_io_resources(d) &&                              \
>>>        !has_arch_pdevs(d) &&                                     \
>>>        is_pv_domain(d)) ?                                        \
>>>       L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))
>>
>> ... cachability, which goes hand in hand with the ability to also
>> flush cache contents.
> 
> Hm, I was on the edge here, in fact I've previously coded this using
> cache_flush_permitted(), just to the change back to
> has_arch_io_resources().  If you think cache_flush_permitted() is
> better I'm fine with that.

I think that would be better here, yet as you say - it's not entirely
clear cut either way.

>> Tangentially - is it plausible for has_arch_io_resources() to return
>> false when has_arch_pdevs() returns true? Perhaps there are exotic
>> PCI devices (but non-bridges) which work with no BARs at all ...
> 
> I guess it's technically possible, albeit very unlikely?  How would
> the OS interact with such device then, exclusively with PCI config
> space accesses?

Yes, that's what I'd expect for such devices. Looking around, there
are numerous such devices (leaving aside bridges). Just that it looks
implausible to me that one would want to pass those through to a guest.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  2025-05-16  7:07       ` Jan Beulich
@ 2025-05-16  8:02         ` Roger Pau Monné
  2025-05-16  8:08           ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-16  8:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Fri, May 16, 2025 at 09:07:43AM +0200, Jan Beulich wrote:
> On 15.05.2025 12:28, Roger Pau Monné wrote:
> > On Mon, May 12, 2025 at 05:16:02PM +0200, Jan Beulich wrote:
> >> On 06.05.2025 10:31, Roger Pau Monne wrote:
> >>> To better describe the underlying implementation.  Define
> >>> cache_flush_permitted() as an alias of has_arch_io_resources(), so that
> >>> current users of cache_flush_permitted() are not effectively modified.
> >>>
> >>> With the introduction of the new handler, change some of the call sites of
> >>> cache_flush_permitted() to instead use has_arch_io_resources() as such
> >>> callers are not after whether cache flush is enabled, but rather whether
> >>> the domain has any IO resources assigned.
> >>>
> >>> Take the opportunity to adjust l1_disallow_mask() to use the newly
> >>> introduced has_arch_io_resources() macro.
> >>
> >> While I'm happy with everything else here, to me it's at least on the
> >> edge whether cache_flush_permitted() wouldn't be the better predicate
> >> to use there, for this being about ...
> >>
> >>> --- a/xen/arch/x86/mm.c
> >>> +++ b/xen/arch/x86/mm.c
> >>> @@ -172,8 +172,7 @@ static DEFINE_SPINLOCK(subpage_ro_lock);
> >>>  
> >>>  #define l1_disallow_mask(d)                                     \
> >>>      (((d) != dom_io) &&                                         \
> >>> -     (rangeset_is_empty((d)->iomem_caps) &&                     \
> >>> -      rangeset_is_empty((d)->arch.ioport_caps) &&               \
> >>> +     (!has_arch_io_resources(d) &&                              \
> >>>        !has_arch_pdevs(d) &&                                     \
> >>>        is_pv_domain(d)) ?                                        \
> >>>       L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))
> >>
> >> ... cachability, which goes hand in hand with the ability to also
> >> flush cache contents.
> > 
> > Hm, I was on the edge here, in fact I've previously coded this using
> > cache_flush_permitted(), just to the change back to
> > has_arch_io_resources().  If you think cache_flush_permitted() is
> > better I'm fine with that.
> 
> I think that would be better here, yet as you say - it's not entirely
> clear cut either way.

I've reverted this chunk of the change and left the code as-is for the
time being.

> >> Tangentially - is it plausible for has_arch_io_resources() to return
> >> false when has_arch_pdevs() returns true? Perhaps there are exotic
> >> PCI devices (but non-bridges) which work with no BARs at all ...
> > 
> > I guess it's technically possible, albeit very unlikely?  How would
> > the OS interact with such device then, exclusively with PCI config
> > space accesses?
> 
> Yes, that's what I'd expect for such devices. Looking around, there
> are numerous such devices (leaving aside bridges). Just that it looks
> implausible to me that one would want to pass those through to a guest.

Well, we also need to consider dom0 here (either PV or PVH), which
will get those devices passed through.  I assume those are mostly
system devices, and hence there's usually no interaction of the OS
with them.

I'm thinking that our definition of cache_flush_permitted() is not
fully accurate then, we would need to also account for any PCI devices
being assigned to the guest, even if those have no IO resources?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  2025-05-16  8:02         ` Roger Pau Monné
@ 2025-05-16  8:08           ` Jan Beulich
  2025-05-16  8:27             ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-16  8:08 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 16.05.2025 10:02, Roger Pau Monné wrote:
> On Fri, May 16, 2025 at 09:07:43AM +0200, Jan Beulich wrote:
>> On 15.05.2025 12:28, Roger Pau Monné wrote:
>>> On Mon, May 12, 2025 at 05:16:02PM +0200, Jan Beulich wrote:
>>>> On 06.05.2025 10:31, Roger Pau Monne wrote:
>>>>> To better describe the underlying implementation.  Define
>>>>> cache_flush_permitted() as an alias of has_arch_io_resources(), so that
>>>>> current users of cache_flush_permitted() are not effectively modified.
>>>>>
>>>>> With the introduction of the new handler, change some of the call sites of
>>>>> cache_flush_permitted() to instead use has_arch_io_resources() as such
>>>>> callers are not after whether cache flush is enabled, but rather whether
>>>>> the domain has any IO resources assigned.
>>>>>
>>>>> Take the opportunity to adjust l1_disallow_mask() to use the newly
>>>>> introduced has_arch_io_resources() macro.
>>>>
>>>> While I'm happy with everything else here, to me it's at least on the
>>>> edge whether cache_flush_permitted() wouldn't be the better predicate
>>>> to use there, for this being about ...
>>>>
>>>>> --- a/xen/arch/x86/mm.c
>>>>> +++ b/xen/arch/x86/mm.c
>>>>> @@ -172,8 +172,7 @@ static DEFINE_SPINLOCK(subpage_ro_lock);
>>>>>  
>>>>>  #define l1_disallow_mask(d)                                     \
>>>>>      (((d) != dom_io) &&                                         \
>>>>> -     (rangeset_is_empty((d)->iomem_caps) &&                     \
>>>>> -      rangeset_is_empty((d)->arch.ioport_caps) &&               \
>>>>> +     (!has_arch_io_resources(d) &&                              \
>>>>>        !has_arch_pdevs(d) &&                                     \
>>>>>        is_pv_domain(d)) ?                                        \
>>>>>       L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))
>>>>
>>>> ... cachability, which goes hand in hand with the ability to also
>>>> flush cache contents.
>>>
>>> Hm, I was on the edge here, in fact I've previously coded this using
>>> cache_flush_permitted(), just to the change back to
>>> has_arch_io_resources().  If you think cache_flush_permitted() is
>>> better I'm fine with that.
>>
>> I think that would be better here, yet as you say - it's not entirely
>> clear cut either way.
> 
> I've reverted this chunk of the change and left the code as-is for the
> time being.

Didn't we agree to use cache_flush_permitted() here instead?

>>>> Tangentially - is it plausible for has_arch_io_resources() to return
>>>> false when has_arch_pdevs() returns true? Perhaps there are exotic
>>>> PCI devices (but non-bridges) which work with no BARs at all ...
>>>
>>> I guess it's technically possible, albeit very unlikely?  How would
>>> the OS interact with such device then, exclusively with PCI config
>>> space accesses?
>>
>> Yes, that's what I'd expect for such devices. Looking around, there
>> are numerous such devices (leaving aside bridges). Just that it looks
>> implausible to me that one would want to pass those through to a guest.
> 
> Well, we also need to consider dom0 here (either PV or PVH), which
> will get those devices passed through.  I assume those are mostly
> system devices, and hence there's usually no interaction of the OS
> with them.
> 
> I'm thinking that our definition of cache_flush_permitted() is not
> fully accurate then, we would need to also account for any PCI devices
> being assigned to the guest, even if those have no IO resources?

I think so, yes.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  2025-05-16  8:08           ` Jan Beulich
@ 2025-05-16  8:27             ` Roger Pau Monné
  2025-05-16  8:36               ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-16  8:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Fri, May 16, 2025 at 10:08:35AM +0200, Jan Beulich wrote:
> On 16.05.2025 10:02, Roger Pau Monné wrote:
> > On Fri, May 16, 2025 at 09:07:43AM +0200, Jan Beulich wrote:
> >> On 15.05.2025 12:28, Roger Pau Monné wrote:
> >>> On Mon, May 12, 2025 at 05:16:02PM +0200, Jan Beulich wrote:
> >>>> On 06.05.2025 10:31, Roger Pau Monne wrote:
> >>>>> To better describe the underlying implementation.  Define
> >>>>> cache_flush_permitted() as an alias of has_arch_io_resources(), so that
> >>>>> current users of cache_flush_permitted() are not effectively modified.
> >>>>>
> >>>>> With the introduction of the new handler, change some of the call sites of
> >>>>> cache_flush_permitted() to instead use has_arch_io_resources() as such
> >>>>> callers are not after whether cache flush is enabled, but rather whether
> >>>>> the domain has any IO resources assigned.
> >>>>>
> >>>>> Take the opportunity to adjust l1_disallow_mask() to use the newly
> >>>>> introduced has_arch_io_resources() macro.
> >>>>
> >>>> While I'm happy with everything else here, to me it's at least on the
> >>>> edge whether cache_flush_permitted() wouldn't be the better predicate
> >>>> to use there, for this being about ...
> >>>>
> >>>>> --- a/xen/arch/x86/mm.c
> >>>>> +++ b/xen/arch/x86/mm.c
> >>>>> @@ -172,8 +172,7 @@ static DEFINE_SPINLOCK(subpage_ro_lock);
> >>>>>  
> >>>>>  #define l1_disallow_mask(d)                                     \
> >>>>>      (((d) != dom_io) &&                                         \
> >>>>> -     (rangeset_is_empty((d)->iomem_caps) &&                     \
> >>>>> -      rangeset_is_empty((d)->arch.ioport_caps) &&               \
> >>>>> +     (!has_arch_io_resources(d) &&                              \
> >>>>>        !has_arch_pdevs(d) &&                                     \
> >>>>>        is_pv_domain(d)) ?                                        \
> >>>>>       L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))
> >>>>
> >>>> ... cachability, which goes hand in hand with the ability to also
> >>>> flush cache contents.
> >>>
> >>> Hm, I was on the edge here, in fact I've previously coded this using
> >>> cache_flush_permitted(), just to the change back to
> >>> has_arch_io_resources().  If you think cache_flush_permitted() is
> >>> better I'm fine with that.
> >>
> >> I think that would be better here, yet as you say - it's not entirely
> >> clear cut either way.
> > 
> > I've reverted this chunk of the change and left the code as-is for the
> > time being.
> 
> Didn't we agree to use cache_flush_permitted() here instead?

I think it would be a bit weird, if we want this to be a
non-functional change we would need to keep the has_arch_pdevs()
condition because cache_flush_permitted() doesn't take that into
account.  Or we need to adjust cache_flush_permitted() to also take
has_arch_pdevs() into consideration.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  2025-05-16  8:27             ` Roger Pau Monné
@ 2025-05-16  8:36               ` Jan Beulich
  2025-05-16  8:55                 ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-16  8:36 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 16.05.2025 10:27, Roger Pau Monné wrote:
> On Fri, May 16, 2025 at 10:08:35AM +0200, Jan Beulich wrote:
>> On 16.05.2025 10:02, Roger Pau Monné wrote:
>>> On Fri, May 16, 2025 at 09:07:43AM +0200, Jan Beulich wrote:
>>>> On 15.05.2025 12:28, Roger Pau Monné wrote:
>>>>> On Mon, May 12, 2025 at 05:16:02PM +0200, Jan Beulich wrote:
>>>>>> On 06.05.2025 10:31, Roger Pau Monne wrote:
>>>>>>> To better describe the underlying implementation.  Define
>>>>>>> cache_flush_permitted() as an alias of has_arch_io_resources(), so that
>>>>>>> current users of cache_flush_permitted() are not effectively modified.
>>>>>>>
>>>>>>> With the introduction of the new handler, change some of the call sites of
>>>>>>> cache_flush_permitted() to instead use has_arch_io_resources() as such
>>>>>>> callers are not after whether cache flush is enabled, but rather whether
>>>>>>> the domain has any IO resources assigned.
>>>>>>>
>>>>>>> Take the opportunity to adjust l1_disallow_mask() to use the newly
>>>>>>> introduced has_arch_io_resources() macro.
>>>>>>
>>>>>> While I'm happy with everything else here, to me it's at least on the
>>>>>> edge whether cache_flush_permitted() wouldn't be the better predicate
>>>>>> to use there, for this being about ...
>>>>>>
>>>>>>> --- a/xen/arch/x86/mm.c
>>>>>>> +++ b/xen/arch/x86/mm.c
>>>>>>> @@ -172,8 +172,7 @@ static DEFINE_SPINLOCK(subpage_ro_lock);
>>>>>>>  
>>>>>>>  #define l1_disallow_mask(d)                                     \
>>>>>>>      (((d) != dom_io) &&                                         \
>>>>>>> -     (rangeset_is_empty((d)->iomem_caps) &&                     \
>>>>>>> -      rangeset_is_empty((d)->arch.ioport_caps) &&               \
>>>>>>> +     (!has_arch_io_resources(d) &&                              \
>>>>>>>        !has_arch_pdevs(d) &&                                     \
>>>>>>>        is_pv_domain(d)) ?                                        \
>>>>>>>       L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))
>>>>>>
>>>>>> ... cachability, which goes hand in hand with the ability to also
>>>>>> flush cache contents.
>>>>>
>>>>> Hm, I was on the edge here, in fact I've previously coded this using
>>>>> cache_flush_permitted(), just to the change back to
>>>>> has_arch_io_resources().  If you think cache_flush_permitted() is
>>>>> better I'm fine with that.
>>>>
>>>> I think that would be better here, yet as you say - it's not entirely
>>>> clear cut either way.
>>>
>>> I've reverted this chunk of the change and left the code as-is for the
>>> time being.
>>
>> Didn't we agree to use cache_flush_permitted() here instead?
> 
> I think it would be a bit weird, if we want this to be a
> non-functional change we would need to keep the has_arch_pdevs()
> condition because cache_flush_permitted() doesn't take that into
> account.  Or we need to adjust cache_flush_permitted() to also take
> has_arch_pdevs() into consideration.

Which is what you suggested elsewhere, or did I misunderstand that?

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
  2025-05-16  8:36               ` Jan Beulich
@ 2025-05-16  8:55                 ` Roger Pau Monné
  0 siblings, 0 replies; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-16  8:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Fri, May 16, 2025 at 10:36:19AM +0200, Jan Beulich wrote:
> On 16.05.2025 10:27, Roger Pau Monné wrote:
> > On Fri, May 16, 2025 at 10:08:35AM +0200, Jan Beulich wrote:
> >> On 16.05.2025 10:02, Roger Pau Monné wrote:
> >>> On Fri, May 16, 2025 at 09:07:43AM +0200, Jan Beulich wrote:
> >>>> On 15.05.2025 12:28, Roger Pau Monné wrote:
> >>>>> On Mon, May 12, 2025 at 05:16:02PM +0200, Jan Beulich wrote:
> >>>>>> On 06.05.2025 10:31, Roger Pau Monne wrote:
> >>>>>>> To better describe the underlying implementation.  Define
> >>>>>>> cache_flush_permitted() as an alias of has_arch_io_resources(), so that
> >>>>>>> current users of cache_flush_permitted() are not effectively modified.
> >>>>>>>
> >>>>>>> With the introduction of the new handler, change some of the call sites of
> >>>>>>> cache_flush_permitted() to instead use has_arch_io_resources() as such
> >>>>>>> callers are not after whether cache flush is enabled, but rather whether
> >>>>>>> the domain has any IO resources assigned.
> >>>>>>>
> >>>>>>> Take the opportunity to adjust l1_disallow_mask() to use the newly
> >>>>>>> introduced has_arch_io_resources() macro.
> >>>>>>
> >>>>>> While I'm happy with everything else here, to me it's at least on the
> >>>>>> edge whether cache_flush_permitted() wouldn't be the better predicate
> >>>>>> to use there, for this being about ...
> >>>>>>
> >>>>>>> --- a/xen/arch/x86/mm.c
> >>>>>>> +++ b/xen/arch/x86/mm.c
> >>>>>>> @@ -172,8 +172,7 @@ static DEFINE_SPINLOCK(subpage_ro_lock);
> >>>>>>>  
> >>>>>>>  #define l1_disallow_mask(d)                                     \
> >>>>>>>      (((d) != dom_io) &&                                         \
> >>>>>>> -     (rangeset_is_empty((d)->iomem_caps) &&                     \
> >>>>>>> -      rangeset_is_empty((d)->arch.ioport_caps) &&               \
> >>>>>>> +     (!has_arch_io_resources(d) &&                              \
> >>>>>>>        !has_arch_pdevs(d) &&                                     \
> >>>>>>>        is_pv_domain(d)) ?                                        \
> >>>>>>>       L1_DISALLOW_MASK : (L1_DISALLOW_MASK & ~PAGE_CACHE_ATTRS))
> >>>>>>
> >>>>>> ... cachability, which goes hand in hand with the ability to also
> >>>>>> flush cache contents.
> >>>>>
> >>>>> Hm, I was on the edge here, in fact I've previously coded this using
> >>>>> cache_flush_permitted(), just to the change back to
> >>>>> has_arch_io_resources().  If you think cache_flush_permitted() is
> >>>>> better I'm fine with that.
> >>>>
> >>>> I think that would be better here, yet as you say - it's not entirely
> >>>> clear cut either way.
> >>>
> >>> I've reverted this chunk of the change and left the code as-is for the
> >>> time being.
> >>
> >> Didn't we agree to use cache_flush_permitted() here instead?
> > 
> > I think it would be a bit weird, if we want this to be a
> > non-functional change we would need to keep the has_arch_pdevs()
> > condition because cache_flush_permitted() doesn't take that into
> > account.  Or we need to adjust cache_flush_permitted() to also take
> > has_arch_pdevs() into consideration.
> 
> Which is what you suggested elsewhere, or did I misunderstand that?

Yes, I missed that you agreed to that then, sorry.  To many messages
on the thread I'm afraid.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 8/9] xen: introduce flag when a domain requires cache control
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
                   ` (6 preceding siblings ...)
  2025-05-06  8:31 ` [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources() Roger Pau Monne
@ 2025-05-06  8:31 ` Roger Pau Monne
  2025-05-06 10:20   ` Julien Grall
  2025-05-12 15:24   ` Jan Beulich
  2025-05-06  8:31 ` [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU Roger Pau Monne
  2025-05-06  9:32 ` [PATCH 0/8] xen: cache control improvements Christian Lindig
  9 siblings, 2 replies; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Anthony PERARD, Andrew Cooper, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini, Juergen Gross,
	Christian Lindig, David Scott, Bertrand Marquis,
	Volodymyr Babchuk

Such flag is added to the domain create hypercall, and a matching option is
added to xl and libxl to set the flag: `cache_control`.  When the flag is
set, the domain is allowed the usage of cache control operations.  If the
flag is not explicitly set, libxl will set it if the domain has any `iomem`
or `ioports` ranges assigned.

Modify cache_flush_permitted() helper so that it's return value is no
longer based on the IO resources assigned to the domain, but rather whether
the domain has been explicitly allowed control over the cache, or if it has
IOMMU support and there's a single IOMMU in the system that doesn't allow
forcing snooping behind the guest back.  As a result of this, the return of
cache_flush_permitted() is constant for the lifetime of a domain, based on
the domain creation flags and the capabilities of the IOMMU if enabled
for the domain.

Since cache control is now known at domain creation, and doesn't change for
the lifetime of a domain, the cache related checks can be simplified,
specially in iomem_{permit,deny}_access() and
ioports_{permit,deny}_access(), as the EPT EMT values won't depend on
whether the domain has IO resources assigned, but rather chosen at creation
time.

As cache_flush_permitted() now also takes into account if a domain has an
IOMMU assigned, and whether the IOMMU supports forcing snooping, some of
the checks can be simplified to drop the iommu_snoop or is_iommu_enabled()
checks previously used in conjunction with cache_flush_permitted().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 docs/man/xl.cfg.5.pod.in         | 10 ++++++++++
 tools/include/libxl.h            |  7 +++++++
 tools/libs/light/libxl_create.c  |  6 ++++++
 tools/libs/light/libxl_types.idl |  3 +++
 tools/ocaml/libs/xc/xenctrl.ml   |  1 +
 tools/ocaml/libs/xc/xenctrl.mli  |  1 +
 tools/xl/xl_parse.c              |  2 ++
 xen/arch/arm/dom0less-build.c    | 12 ++++++++++--
 xen/arch/arm/domain.c            |  3 ++-
 xen/arch/arm/domain_build.c      |  6 ++++++
 xen/arch/x86/hvm/mtrr.c          |  2 +-
 xen/arch/x86/hvm/vmx/vmcs.c      |  3 +--
 xen/arch/x86/hvm/vmx/vmx.c       |  2 +-
 xen/arch/x86/include/asm/iocap.h | 19 ++-----------------
 xen/arch/x86/mm/p2m-ept.c        |  7 +------
 xen/arch/x86/setup.c             |  7 +++++++
 xen/common/domain.c              |  3 ++-
 xen/include/asm-generic/iocap.h  |  2 --
 xen/include/public/domctl.h      |  4 +++-
 xen/include/xen/iocap.h          | 25 ++-----------------------
 xen/include/xen/sched.h          |  6 ++++++
 21 files changed, 74 insertions(+), 57 deletions(-)

diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
index 7339c44efd54..d61685de4db3 100644
--- a/docs/man/xl.cfg.5.pod.in
+++ b/docs/man/xl.cfg.5.pod.in
@@ -1705,6 +1705,16 @@ This feature is a B<technology preview>.
 
 =back
 
+=item B<cache_control=BOOLEAN>
+
+If this option is B<true>, Xen allows the guest to perform cache control
+operations.
+
+If this option is missing the default will be set based on whether there are
+any <B>iomem or <B>ioports (if applicable) assigned to the guest.  Note that
+enabling <B>passthrough support can also allow guest cache control operations
+depending on the <B>IOMMU features.
+
 =back
 
 =head2 Paravirtualised (PV) Guest Specific Options
diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index b7ad7735ca4c..9a55e98f28bb 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -647,6 +647,13 @@
  */
 #define LIBXL_HAVE_DT_OVERLAY_DOMAIN 1
 
+/*
+ * LIBXL_HAVE_CACHE_CONTROL indicates the presence of cache_control boolean
+ * field in libxl_domain_build_info.  The field signals whether a domain is
+ * allowed access to cache control operations.
+ */
+#define LIBXL_HAVE_CACHE_CONTROL 1
+
 /*
  * libxl memory management
  *
diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c
index e03599ea99d1..469fa306e2f0 100644
--- a/tools/libs/light/libxl_create.c
+++ b/tools/libs/light/libxl_create.c
@@ -493,6 +493,9 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
         b_info->bootloader_user =
             libxl__strdup(gc, getenv("LIBXL_BOOTLOADER_USER"));
 
+    libxl_defbool_setdefault(&b_info->cache_control,
+                             b_info->num_iomem || b_info->num_ioports);
+
     return 0;
 }
 
@@ -667,6 +670,9 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
         if (libxl_defbool_val(b_info->vpmu))
             create.flags |= XEN_DOMCTL_CDF_vpmu;
 
+        if (libxl_defbool_val(b_info->cache_control))
+            create.flags |= XEN_DOMCTL_CDF_cache_control;
+
         assert(info->passthrough != LIBXL_PASSTHROUGH_DEFAULT);
         LOG(DETAIL, "passthrough: %s",
             libxl_passthrough_to_string(info->passthrough));
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index 9bb296993199..59303e8df74a 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -737,6 +737,9 @@ libxl_domain_build_info = Struct("domain_build_info",[
 
     ("vpmu", libxl_defbool),
 
+    # Allow guest access to cache control instructions.
+    ("cache_control", libxl_defbool),
+
     ], dir=DIR_IN,
        copy_deprecated_fn="libxl__domain_build_info_copy_deprecated",
 )
diff --git a/tools/ocaml/libs/xc/xenctrl.ml b/tools/ocaml/libs/xc/xenctrl.ml
index 2690f9a92316..15ea2cbb0490 100644
--- a/tools/ocaml/libs/xc/xenctrl.ml
+++ b/tools/ocaml/libs/xc/xenctrl.ml
@@ -70,6 +70,7 @@ type domain_create_flag =
   | CDF_IOMMU
   | CDF_NESTED_VIRT
   | CDF_VPMU
+  | CDF_cache_control
 
 type domain_create_iommu_opts =
   | IOMMU_NO_SHAREPT
diff --git a/tools/ocaml/libs/xc/xenctrl.mli b/tools/ocaml/libs/xc/xenctrl.mli
index febbe1f6ae3f..c9ce99a62c16 100644
--- a/tools/ocaml/libs/xc/xenctrl.mli
+++ b/tools/ocaml/libs/xc/xenctrl.mli
@@ -63,6 +63,7 @@ type domain_create_flag =
   | CDF_IOMMU
   | CDF_NESTED_VIRT
   | CDF_VPMU
+  | CDF_cache_control
 
 type domain_create_iommu_opts =
   | IOMMU_NO_SHAREPT
diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
index 089a88935aff..b5f3133581cb 100644
--- a/tools/xl/xl_parse.c
+++ b/tools/xl/xl_parse.c
@@ -2993,6 +2993,8 @@ skip_usbdev:
 
     xlu_cfg_get_defbool(config, "vpmu", &b_info->vpmu, 0);
 
+    xlu_cfg_get_defbool(config, "cache_control", &b_info->cache_control, 0);
+
     xlu_cfg_destroy(config);
 }
 
diff --git a/xen/arch/arm/dom0less-build.c b/xen/arch/arm/dom0less-build.c
index a356fc94fc4d..e0f0b828d0b6 100644
--- a/xen/arch/arm/dom0less-build.c
+++ b/xen/arch/arm/dom0less-build.c
@@ -1106,6 +1106,8 @@ void __init create_domUs(void)
 
             if ( !strcmp(dom0less_iommu, "enabled") )
                 iommu = true;
+            else
+                d_cfg.flags |= XEN_DOMCTL_CDF_cache_control;
         }
 
         if ( (flags & CDF_hardware) && !(flags & CDF_directmap) &&
@@ -1120,8 +1122,14 @@ void __init create_domUs(void)
             has_dtb = true;
         }
 
-        if ( iommu_enabled && (iommu || has_dtb) )
-            d_cfg.flags |= XEN_DOMCTL_CDF_iommu;
+        if ( iommu || has_dtb )
+            /*
+             * Domain has devices assigned, either enable IOMMU support (if
+             * present), or explicitly allow cache control operations for DMA
+             * coherency.
+             */
+            d_cfg.flags |= iommu_enabled ? XEN_DOMCTL_CDF_iommu
+                                         : XEN_DOMCTL_CDF_cache_control;
 
         if ( !dt_property_read_u32(node, "nr_spis", &d_cfg.arch.nr_spis) )
         {
diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 45aeb8bddcb0..1a8503f8d471 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -612,7 +612,8 @@ int arch_sanitise_domain_config(struct xen_domctl_createdomain *config)
     unsigned int max_vcpus;
     unsigned int flags_required = (XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap);
     unsigned int flags_optional = (XEN_DOMCTL_CDF_iommu | XEN_DOMCTL_CDF_vpmu |
-                                   XEN_DOMCTL_CDF_xs_domain );
+                                   XEN_DOMCTL_CDF_xs_domain |
+                                   XEN_DOMCTL_CDF_cache_control);
     unsigned int sve_vl_bits = sve_decode_vl(config->arch.sve_vl);
 
     if ( (config->flags & ~flags_optional) != flags_required )
diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 270a6b97e42c..0e7e6a6ac044 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -2381,6 +2381,12 @@ void __init create_dom0(void)
 
     if ( iommu_enabled )
         dom0_cfg.flags |= XEN_DOMCTL_CDF_iommu;
+    else
+        /*
+         * A hardware domain without IOMMU will need cache control to
+         * ensure DMA coherency.
+         */
+        dom0_cfg.flags |= XEN_DOMCTL_CDF_cache_control;
 
     if ( opt_dom0_sve )
     {
diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c
index e7acfb6e8dc4..887994d2b984 100644
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -766,7 +766,7 @@ HVM_REGISTER_SAVE_RESTORE(MTRR, hvm_save_mtrr_msr, NULL, hvm_load_mtrr_msr, 1,
 
 void memory_type_changed(struct domain *d)
 {
-    if ( (is_iommu_enabled(d) || cache_flush_permitted(d)) &&
+    if ( cache_flush_permitted(d) &&
          d->vcpu && d->vcpu[0] && p2m_memory_type_changed(d) )
     {
         flush_all(FLUSH_CACHE);
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index a44475ae15bd..fba62b5d47d8 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -1921,8 +1921,7 @@ void cf_check vmx_do_resume(void)
          *  2: execute wbinvd on all dirty pCPUs when guest wbinvd exits.
          * If VT-d engine can force snooping, we don't need to do these.
          */
-        if ( has_arch_pdevs(v->domain) && !iommu_snoop
-                && !cpu_has_wbinvd_exiting )
+        if ( cache_flush_permitted(v->domain) && !cpu_has_wbinvd_exiting )
         {
             int cpu = v->arch.hvm.vmx.active_cpu;
             if ( cpu != -1 )
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 827db6bdd807..639882ceb216 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3840,7 +3840,7 @@ static void vmx_do_extint(struct cpu_user_regs *regs)
 
 static void cf_check vmx_wbinvd_intercept(void)
 {
-    if ( !cache_flush_permitted(current->domain) || iommu_snoop )
+    if ( !cache_flush_permitted(current->domain) )
         return;
 
     if ( cpu_has_wbinvd_exiting )
diff --git a/xen/arch/x86/include/asm/iocap.h b/xen/arch/x86/include/asm/iocap.h
index 61d026dbf5f6..d3bb66f0e97a 100644
--- a/xen/arch/x86/include/asm/iocap.h
+++ b/xen/arch/x86/include/asm/iocap.h
@@ -19,31 +19,16 @@
     (!rangeset_is_empty((d)->iomem_caps) ||             \
      !rangeset_is_empty((d)->arch.ioport_caps))
 
-#define cache_flush_permitted has_arch_io_resources
-
 static inline int ioports_permit_access(struct domain *d, unsigned long s,
                                         unsigned long e)
 {
-    bool flush = cache_flush_permitted(d);
-    int ret = rangeset_add_range(d->arch.ioport_caps, s, e);
-
-    if ( !ret && !is_iommu_enabled(d) && !flush )
-        /* See comment in iomem_permit_access(). */
-        memory_type_changed(d);
-
-    return ret;
+    return rangeset_add_range(d->arch.ioport_caps, s, e);
 }
 
 static inline int ioports_deny_access(struct domain *d, unsigned long s,
                                       unsigned long e)
 {
-    int ret = rangeset_remove_range(d->arch.ioport_caps, s, e);
-
-    if ( !ret && !is_iommu_enabled(d) && !cache_flush_permitted(d) )
-        /* See comment in iomem_deny_access(). */
-        memory_type_changed(d);
-
-    return ret;
+    return rangeset_remove_range(d->arch.ioport_caps, s, e);
 }
 
 #endif /* __X86_IOCAP_H__ */
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 0cf6818c13f0..d3ce422d76d7 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -505,12 +505,7 @@ int epte_get_entry_emt(struct domain *d, gfn_t gfn, mfn_t mfn,
         return -1;
     }
 
-    /*
-     * Conditional must be kept in sync with the code in
-     * {iomem,ioports}_{permit,deny}_access().
-     */
-    if ( type != p2m_mmio_direct && !is_iommu_enabled(d) &&
-         !cache_flush_permitted(d) )
+    if ( type != p2m_mmio_direct && !cache_flush_permitted(d) )
     {
         *ipat = true;
         return X86_MT_WB;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 25189541244d..ec1b9772bec0 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1029,6 +1029,13 @@ static struct domain *__init create_dom0(struct boot_info *bi)
 
     if ( iommu_enabled )
         dom0_cfg.flags |= XEN_DOMCTL_CDF_iommu;
+    else if ( !pv_shim )
+        /*
+         * pvshim doesn't support passthrough, so never allow it cache control.
+         * Otherwise a hardware domain without IOMMU will need cache control to
+         * ensure DMA coherency.
+         */
+        dom0_cfg.flags |= XEN_DOMCTL_CDF_cache_control;
 
     /* Create initial domain.  Not d0 for pvshim. */
     bd->domid = get_initial_domain_id();
diff --git a/xen/common/domain.c b/xen/common/domain.c
index abf1969e60e3..07b81a70c7a6 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -721,7 +721,8 @@ static int sanitise_domain_config(struct xen_domctl_createdomain *config)
          ~(XEN_DOMCTL_CDF_hvm | XEN_DOMCTL_CDF_hap |
            XEN_DOMCTL_CDF_s3_integrity | XEN_DOMCTL_CDF_oos_off |
            XEN_DOMCTL_CDF_xs_domain | XEN_DOMCTL_CDF_iommu |
-           XEN_DOMCTL_CDF_nested_virt | XEN_DOMCTL_CDF_vpmu) )
+           XEN_DOMCTL_CDF_nested_virt | XEN_DOMCTL_CDF_vpmu |
+           XEN_DOMCTL_CDF_cache_control) )
     {
         dprintk(XENLOG_INFO, "Unknown CDF flags %#x\n", config->flags);
         return -EINVAL;
diff --git a/xen/include/asm-generic/iocap.h b/xen/include/asm-generic/iocap.h
index 664bbc8971fe..011c1a475072 100644
--- a/xen/include/asm-generic/iocap.h
+++ b/xen/include/asm-generic/iocap.h
@@ -5,8 +5,6 @@
 #define has_arch_io_resources(d)                        \
     (!rangeset_is_empty((d)->iomem_caps))
 
-#define cache_flush_permitted has_arch_io_resources
-
 #endif /* __ASM_GENERIC_IOCAP_H__ */
 
 /*
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 5b2063eed984..9134dde7e105 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -66,9 +66,11 @@ struct xen_domctl_createdomain {
 #define XEN_DOMCTL_CDF_nested_virt    (1U << _XEN_DOMCTL_CDF_nested_virt)
 /* Should we expose the vPMU to the guest? */
 #define XEN_DOMCTL_CDF_vpmu           (1U << 7)
+/* Guest has control over the cache? */
+#define XEN_DOMCTL_CDF_cache_control  (1U << 8)
 
 /* Max XEN_DOMCTL_CDF_* constant.  Used for ABI checking. */
-#define XEN_DOMCTL_CDF_MAX XEN_DOMCTL_CDF_vpmu
+#define XEN_DOMCTL_CDF_MAX XEN_DOMCTL_CDF_cache_control
 
     uint32_t flags;
 
diff --git a/xen/include/xen/iocap.h b/xen/include/xen/iocap.h
index ffbc48b60fd5..a753fea2ec50 100644
--- a/xen/include/xen/iocap.h
+++ b/xen/include/xen/iocap.h
@@ -15,34 +15,13 @@
 static inline int iomem_permit_access(struct domain *d, unsigned long s,
                                       unsigned long e)
 {
-    bool flush = cache_flush_permitted(d);
-    int ret = rangeset_add_range(d->iomem_caps, s, e);
-
-    if ( !ret && !is_iommu_enabled(d) && !flush )
-        /*
-         * Only flush if the range(s) are empty before this addition and
-         * IOMMU is not enabled for the domain, otherwise it makes no
-         * difference for effective cache attribute calculation purposes.
-         */
-        memory_type_changed(d);
-
-    return ret;
+    return rangeset_add_range(d->iomem_caps, s, e);
 }
 
 static inline int iomem_deny_access(struct domain *d, unsigned long s,
                                     unsigned long e)
 {
-    int ret = rangeset_remove_range(d->iomem_caps, s, e);
-
-    if ( !ret && !is_iommu_enabled(d) && !cache_flush_permitted(d) )
-        /*
-         * Only flush if the range(s) are empty after this removal and
-         * IOMMU is not enabled for the domain, otherwise it makes no
-         * difference for effective cache attribute calculation purposes.
-         */
-        memory_type_changed(d);
-
-    return ret;
+    return rangeset_remove_range(d->iomem_caps, s, e);
 }
 
 #define iomem_access_permitted(d, s, e)                 \
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 559d201e0c7e..908a3b7292ff 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -1222,6 +1222,12 @@ static always_inline bool is_iommu_enabled(const struct domain *d)
     return evaluate_nospec(d->options & XEN_DOMCTL_CDF_iommu);
 }
 
+static inline bool cache_flush_permitted(const struct domain *d)
+{
+    return (d->options & XEN_DOMCTL_CDF_cache_control) ||
+           (is_iommu_enabled(d) && !iommu_snoop);
+}
+
 #ifdef CONFIG_MEM_PAGING
 # define mem_paging_enabled(d) vm_event_check_ring((d)->vm_event_paging)
 #else
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] xen: introduce flag when a domain requires cache control
  2025-05-06  8:31 ` [PATCH 8/9] xen: introduce flag when a domain requires cache control Roger Pau Monne
@ 2025-05-06 10:20   ` Julien Grall
  2025-05-06 10:43     ` Roger Pau Monné
  2025-05-12 15:24   ` Jan Beulich
  1 sibling, 1 reply; 50+ messages in thread
From: Julien Grall @ 2025-05-06 10:20 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Jan Beulich,
	Stefano Stabellini, Juergen Gross, Christian Lindig, David Scott,
	Bertrand Marquis, Volodymyr Babchuk

Hi Roger,

On 06/05/2025 09:31, Roger Pau Monne wrote:
> Such flag is added to the domain create hypercall, and a matching option is
> added to xl and libxl to set the flag: `cache_control`.  When the flag is
> set, the domain is allowed the usage of cache control operations.  

Can you clarify whether you are talking about using the hypercall to 
flush the cache or the instructions?

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] xen: introduce flag when a domain requires cache control
  2025-05-06 10:20   ` Julien Grall
@ 2025-05-06 10:43     ` Roger Pau Monné
  0 siblings, 0 replies; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-06 10:43 UTC (permalink / raw)
  To: Julien Grall
  Cc: xen-devel, Anthony PERARD, Andrew Cooper, Michal Orzel,
	Jan Beulich, Stefano Stabellini, Juergen Gross, Christian Lindig,
	David Scott, Bertrand Marquis, Volodymyr Babchuk

On Tue, May 06, 2025 at 11:20:41AM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 06/05/2025 09:31, Roger Pau Monne wrote:
> > Such flag is added to the domain create hypercall, and a matching option is
> > added to xl and libxl to set the flag: `cache_control`.  When the flag is
> > set, the domain is allowed the usage of cache control operations.
> 
> Can you clarify whether you are talking about using the hypercall to flush
> the cache or the instructions?

On x86 at least: both hypercall and instructions if possible, since
Xen traps cache flush instructions.  Maybe that's different on ARM.
As said on a previous reply I've assumed that cache_flush_permitted()
was also relevant on ARM, but maybe it's not and domains are
unconditionally allowed execution of cache control instructions and
hypercalls.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] xen: introduce flag when a domain requires cache control
  2025-05-06  8:31 ` [PATCH 8/9] xen: introduce flag when a domain requires cache control Roger Pau Monne
  2025-05-06 10:20   ` Julien Grall
@ 2025-05-12 15:24   ` Jan Beulich
  2025-05-15 10:44     ` Roger Pau Monné
  1 sibling, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 15:24 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Julien Grall,
	Stefano Stabellini, Juergen Gross, Christian Lindig, David Scott,
	Bertrand Marquis, Volodymyr Babchuk, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> Such flag is added to the domain create hypercall, and a matching option is
> added to xl and libxl to set the flag: `cache_control`.  When the flag is
> set, the domain is allowed the usage of cache control operations.  If the
> flag is not explicitly set, libxl will set it if the domain has any `iomem`
> or `ioports` ranges assigned.
> 
> Modify cache_flush_permitted() helper so that it's return value is no
> longer based on the IO resources assigned to the domain, but rather whether
> the domain has been explicitly allowed control over the cache, or if it has
> IOMMU support and there's a single IOMMU in the system that doesn't allow
> forcing snooping behind the guest back.  As a result of this, the return of
> cache_flush_permitted() is constant for the lifetime of a domain, based on
> the domain creation flags and the capabilities of the IOMMU if enabled
> for the domain.

This then further emphasizes the remark made for patch 7.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] xen: introduce flag when a domain requires cache control
  2025-05-12 15:24   ` Jan Beulich
@ 2025-05-15 10:44     ` Roger Pau Monné
  2025-05-16  7:16       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-15 10:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Julien Grall,
	Stefano Stabellini, Juergen Gross, Christian Lindig, David Scott,
	Bertrand Marquis, Volodymyr Babchuk, xen-devel

On Mon, May 12, 2025 at 05:24:01PM +0200, Jan Beulich wrote:
> On 06.05.2025 10:31, Roger Pau Monne wrote:
> > Such flag is added to the domain create hypercall, and a matching option is
> > added to xl and libxl to set the flag: `cache_control`.  When the flag is
> > set, the domain is allowed the usage of cache control operations.  If the
> > flag is not explicitly set, libxl will set it if the domain has any `iomem`
> > or `ioports` ranges assigned.
> > 
> > Modify cache_flush_permitted() helper so that it's return value is no
> > longer based on the IO resources assigned to the domain, but rather whether
> > the domain has been explicitly allowed control over the cache, or if it has
> > IOMMU support and there's a single IOMMU in the system that doesn't allow
> > forcing snooping behind the guest back.  As a result of this, the return of
> > cache_flush_permitted() is constant for the lifetime of a domain, based on
> > the domain creation flags and the capabilities of the IOMMU if enabled
> > for the domain.
> 
> This then further emphasizes the remark made for patch 7.

Hm, I think you are referring to the remark about how PCI device
without IO resources would be handled then, and what would
cache_flush_permitted() return then?

IMO the benefit of the approach presented here is that it aims to know
whether a domain needs cache control to be set at creation time, and
not altered during it's runtime.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] xen: introduce flag when a domain requires cache control
  2025-05-15 10:44     ` Roger Pau Monné
@ 2025-05-16  7:16       ` Jan Beulich
  2025-05-16  8:05         ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-16  7:16 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Julien Grall,
	Stefano Stabellini, Juergen Gross, Christian Lindig, David Scott,
	Bertrand Marquis, Volodymyr Babchuk, xen-devel

On 15.05.2025 12:44, Roger Pau Monné wrote:
> On Mon, May 12, 2025 at 05:24:01PM +0200, Jan Beulich wrote:
>> On 06.05.2025 10:31, Roger Pau Monne wrote:
>>> Such flag is added to the domain create hypercall, and a matching option is
>>> added to xl and libxl to set the flag: `cache_control`.  When the flag is
>>> set, the domain is allowed the usage of cache control operations.  If the
>>> flag is not explicitly set, libxl will set it if the domain has any `iomem`
>>> or `ioports` ranges assigned.
>>>
>>> Modify cache_flush_permitted() helper so that it's return value is no
>>> longer based on the IO resources assigned to the domain, but rather whether
>>> the domain has been explicitly allowed control over the cache, or if it has
>>> IOMMU support and there's a single IOMMU in the system that doesn't allow
>>> forcing snooping behind the guest back.  As a result of this, the return of
>>> cache_flush_permitted() is constant for the lifetime of a domain, based on
>>> the domain creation flags and the capabilities of the IOMMU if enabled
>>> for the domain.
>>
>> This then further emphasizes the remark made for patch 7.
> 
> Hm, I think you are referring to the remark about how PCI device
> without IO resources would be handled then, and what would
> cache_flush_permitted() return then?

Or more generally the relationship between that and has_arch_io_resources().

> IMO the benefit of the approach presented here is that it aims to know
> whether a domain needs cache control to be set at creation time, and
> not altered during it's runtime.

I agree that having this not vary over a domain's lifetime makes things easier
for us. At the same time it may negatively affect performance of domains where
hardware devices are added / removed on the fly.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] xen: introduce flag when a domain requires cache control
  2025-05-16  7:16       ` Jan Beulich
@ 2025-05-16  8:05         ` Roger Pau Monné
  0 siblings, 0 replies; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-16  8:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Julien Grall,
	Stefano Stabellini, Juergen Gross, Christian Lindig, David Scott,
	Bertrand Marquis, Volodymyr Babchuk, xen-devel

On Fri, May 16, 2025 at 09:16:10AM +0200, Jan Beulich wrote:
> On 15.05.2025 12:44, Roger Pau Monné wrote:
> > On Mon, May 12, 2025 at 05:24:01PM +0200, Jan Beulich wrote:
> >> On 06.05.2025 10:31, Roger Pau Monne wrote:
> >>> Such flag is added to the domain create hypercall, and a matching option is
> >>> added to xl and libxl to set the flag: `cache_control`.  When the flag is
> >>> set, the domain is allowed the usage of cache control operations.  If the
> >>> flag is not explicitly set, libxl will set it if the domain has any `iomem`
> >>> or `ioports` ranges assigned.
> >>>
> >>> Modify cache_flush_permitted() helper so that it's return value is no
> >>> longer based on the IO resources assigned to the domain, but rather whether
> >>> the domain has been explicitly allowed control over the cache, or if it has
> >>> IOMMU support and there's a single IOMMU in the system that doesn't allow
> >>> forcing snooping behind the guest back.  As a result of this, the return of
> >>> cache_flush_permitted() is constant for the lifetime of a domain, based on
> >>> the domain creation flags and the capabilities of the IOMMU if enabled
> >>> for the domain.
> >>
> >> This then further emphasizes the remark made for patch 7.
> > 
> > Hm, I think you are referring to the remark about how PCI device
> > without IO resources would be handled then, and what would
> > cache_flush_permitted() return then?
> 
> Or more generally the relationship between that and has_arch_io_resources().
> 
> > IMO the benefit of the approach presented here is that it aims to know
> > whether a domain needs cache control to be set at creation time, and
> > not altered during it's runtime.
> 
> I agree that having this not vary over a domain's lifetime makes things easier
> for us. At the same time it may negatively affect performance of domains where
> hardware devices are added / removed on the fly.

I'm planing to leave this change for a further iteration of the
series.  Initially I just want to attempt to limit flushing when the
domain has IOMMU support enabled, and the host globally supports
snooping on all IOMMUs, as that's likely less controversial and would
unblock a customer reported issue.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
                   ` (7 preceding siblings ...)
  2025-05-06  8:31 ` [PATCH 8/9] xen: introduce flag when a domain requires cache control Roger Pau Monne
@ 2025-05-06  8:31 ` Roger Pau Monne
  2025-05-06 11:16   ` Andrew Cooper
  2025-05-12 15:33   ` Jan Beulich
  2025-05-06  9:32 ` [PATCH 0/8] xen: cache control improvements Christian Lindig
  9 siblings, 2 replies; 50+ messages in thread
From: Roger Pau Monne @ 2025-05-06  8:31 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

When a guest is allowed access to cache control operations such tracking
prevents having to issue a system-wide cache flush, and rather just flush
the pCPUs where the vCPU has been scheduled since the last flush.

Note that domain-wide flushes accumulate the dirty caches from all the
vCPUs, but clearing the vCPU masks will require pausing all vCPUs, which
seems overkill.  Instead leave the vCPU dirty masks as-is, worse case it
will result in redundant flushes in further calls.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c             | 43 +++++++++++++++++++++++++++++++
 xen/arch/x86/hvm/hvm.c            |  2 +-
 xen/arch/x86/hvm/mtrr.c           |  2 +-
 xen/arch/x86/hvm/svm/svm.c        |  6 +++--
 xen/arch/x86/hvm/vmx/vmx.c        |  6 +++--
 xen/arch/x86/include/asm/domain.h |  9 +++++++
 xen/arch/x86/mm.c                 | 25 +++++++-----------
 xen/arch/x86/pv/emul-priv-op.c    |  8 ++----
 8 files changed, 73 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index f197dad4c0cd..3d08b829d2db 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -579,6 +579,13 @@ int arch_vcpu_create(struct vcpu *v)
 
         if ( (rc = init_vcpu_msr_policy(v)) )
             goto fail;
+
+        if ( cache_flush_permitted(d) &&
+             !cond_zalloc_cpumask_var(&v->arch.dirty_cache) )
+        {
+            rc = -ENOMEM;
+            goto fail;
+        }
     }
     else if ( (rc = xstate_alloc_save_area(v)) != 0 )
         return rc;
@@ -614,6 +621,7 @@ int arch_vcpu_create(struct vcpu *v)
     vcpu_destroy_fpu(v);
     xfree(v->arch.msrs);
     v->arch.msrs = NULL;
+    FREE_CPUMASK_VAR(v->arch.dirty_cache);
 
     return rc;
 }
@@ -628,6 +636,8 @@ void arch_vcpu_destroy(struct vcpu *v)
     xfree(v->arch.msrs);
     v->arch.msrs = NULL;
 
+    FREE_CPUMASK_VAR(v->arch.dirty_cache);
+
     if ( is_hvm_vcpu(v) )
         hvm_vcpu_destroy(v);
     else
@@ -2018,6 +2028,9 @@ static void __context_switch(void)
         cpumask_set_cpu(cpu, nd->dirty_cpumask);
     write_atomic(&n->dirty_cpu, cpu);
 
+    if ( cache_flush_permitted(nd) )
+        __cpumask_set_cpu(cpu, n->arch.dirty_cache);
+
     if ( !is_idle_domain(nd) )
     {
         memcpy(stack_regs, &n->arch.user_regs, CTXT_SWITCH_STACK_BYTES);
@@ -2606,6 +2619,36 @@ unsigned int domain_max_paddr_bits(const struct domain *d)
     return bits;
 }
 
+void vcpu_flush_cache(struct vcpu *curr)
+{
+    ASSERT(curr == current);
+    ASSERT(cache_flush_permitted(curr->domain));
+
+    flush_mask(curr->arch.dirty_cache, FLUSH_CACHE);
+    cpumask_clear(curr->arch.dirty_cache);
+    __cpumask_set_cpu(smp_processor_id(), curr->arch.dirty_cache);
+}
+
+void domain_flush_cache(const struct domain *d)
+{
+    const struct vcpu *v;
+    cpumask_t *mask = this_cpu(scratch_cpumask);
+
+    ASSERT(cache_flush_permitted(d));
+
+    cpumask_clear(mask);
+    for_each_vcpu( d, v )
+        cpumask_or(mask, mask, v->arch.dirty_cache);
+
+    flush_mask(mask, FLUSH_CACHE);
+    /*
+     * Clearing the mask of vCPUs in the domain would be racy unless all vCPUs
+     * are paused, so just leave them as-is, at the cost of possibly doing
+     * redundant flushes in later calls.  It's still better than doing a
+     * host-wide cache flush.
+     */
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 4cb2e13046d1..aed582a215a0 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2277,7 +2277,7 @@ void hvm_shadow_handle_cd(struct vcpu *v, unsigned long value)
             domain_pause_nosync(v->domain);
 
             /* Flush physical caches. */
-            flush_all(FLUSH_CACHE);
+            domain_flush_cache(v->domain);
             hvm_set_uc_mode(v, 1);
 
             domain_unpause(v->domain);
diff --git a/xen/arch/x86/hvm/mtrr.c b/xen/arch/x86/hvm/mtrr.c
index 887994d2b984..cfe0d44459c2 100644
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -769,7 +769,7 @@ void memory_type_changed(struct domain *d)
     if ( cache_flush_permitted(d) &&
          d->vcpu && d->vcpu[0] && p2m_memory_type_changed(d) )
     {
-        flush_all(FLUSH_CACHE);
+        domain_flush_cache(d);
     }
 }
 
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index e33a38c1e446..5d1777ace335 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2315,8 +2315,10 @@ static void svm_vmexit_mce_intercept(
 
 static void cf_check svm_wbinvd_intercept(void)
 {
-    if ( cache_flush_permitted(current->domain) )
-        flush_all(FLUSH_CACHE);
+    struct vcpu *curr = current;
+
+    if ( cache_flush_permitted(curr->domain) )
+        vcpu_flush_cache(curr);
 }
 
 static void svm_vmexit_do_invalidate_cache(struct cpu_user_regs *regs,
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 639882ceb216..9273607d576c 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3840,11 +3840,13 @@ static void vmx_do_extint(struct cpu_user_regs *regs)
 
 static void cf_check vmx_wbinvd_intercept(void)
 {
-    if ( !cache_flush_permitted(current->domain) )
+    struct vcpu *curr = current;
+
+    if ( !cache_flush_permitted(curr->domain) )
         return;
 
     if ( cpu_has_wbinvd_exiting )
-        flush_all(FLUSH_CACHE);
+        vcpu_flush_cache(curr);
     else
         wbinvd();
 }
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 8c0dea12a526..064b51889dc2 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -668,6 +668,12 @@ struct arch_vcpu
 
     struct vcpu_msrs *msrs;
 
+    /*
+     * When vCPU is allowed cache control track the pCPUs the vCPU has run on
+     * since the last flush.
+     */
+    cpumask_var_t dirty_cache;
+
     struct {
         bool next_interrupt_enabled;
     } monitor;
@@ -790,6 +796,9 @@ unsigned int domain_max_paddr_bits(const struct domain *d);
 #define arch_init_idle_domain arch_init_idle_domain
 void arch_init_idle_domain(struct domain *d);
 
+void vcpu_flush_cache(struct vcpu *curr);
+void domain_flush_cache(const struct domain *d);
+
 #endif /* __ASM_DOMAIN_H__ */
 
 /*
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 59b60b1e62a7..11b59398a2c4 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3804,26 +3804,19 @@ long do_mmuext_op(
             break;
 
         case MMUEXT_FLUSH_CACHE:
-            /*
-             * Dirty pCPU caches where the current vCPU has been scheduled are
-             * not tracked, and hence we need to resort to a global cache
-             * flush for correctness.
-             */
+            if ( unlikely(currd != pg_owner) )
+                rc = -EPERM;
+            else if ( likely(cache_flush_permitted(currd)) )
+                vcpu_flush_cache(curr);
+            else
+                rc = -EINVAL;
+            break;
+
         case MMUEXT_FLUSH_CACHE_GLOBAL:
             if ( unlikely(currd != pg_owner) )
                 rc = -EPERM;
             else if ( likely(cache_flush_permitted(currd)) )
-            {
-                unsigned int cpu;
-                cpumask_t *mask = this_cpu(scratch_cpumask);
-
-                cpumask_clear(mask);
-                for_each_online_cpu(cpu)
-                    if ( !cpumask_intersects(mask,
-                                             per_cpu(cpu_sibling_mask, cpu)) )
-                        __cpumask_set_cpu(cpu, mask);
-                flush_mask(mask, FLUSH_CACHE);
-            }
+                domain_flush_cache(currd);
             else
                 rc = -EINVAL;
             break;
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 089d4cb4d905..076ce8f00457 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -1199,12 +1199,8 @@ static int cf_check cache_op(
      * newer linux uses this in some start-of-day timing loops.
      */
     if ( cache_flush_permitted(current->domain) )
-        /*
-         * Handle wbnoinvd as wbinvd, at the expense of higher cost.  Broadcast
-         * the flush to all pCPUs, Xen doesn't track where the vCPU has ran
-         * previously.
-         */
-        flush_all(FLUSH_CACHE);
+        /* Handle wbnoinvd as wbinvd, at the expense of higher cost. */
+        vcpu_flush_cache(current);
 
     return X86EMUL_OKAY;
 }
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU
  2025-05-06  8:31 ` [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU Roger Pau Monne
@ 2025-05-06 11:16   ` Andrew Cooper
  2025-05-06 12:55     ` Roger Pau Monné
  2025-05-12 15:33   ` Jan Beulich
  1 sibling, 1 reply; 50+ messages in thread
From: Andrew Cooper @ 2025-05-06 11:16 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich

On 06/05/2025 9:31 am, Roger Pau Monne wrote:
> When a guest is allowed access to cache control operations such tracking
> prevents having to issue a system-wide cache flush, and rather just flush
> the pCPUs where the vCPU has been scheduled since the last flush.
>
> Note that domain-wide flushes accumulate the dirty caches from all the
> vCPUs, but clearing the vCPU masks will require pausing all vCPUs, which
> seems overkill.  Instead leave the vCPU dirty masks as-is, worse case it
> will result in redundant flushes in further calls.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

I'm afraid this doesn't work.

Unlike TLBs, dirty cacheline can move sideways, e.g. by foreign or grant
mapping, but also naturally because of how cache coherency works.

We need to use the guarantees given to us by the architecture to simply
nop out cache flushes when safe to do so.

Everything else is either a shootdown (clflush/opt/clwb, and doesn't
even trap to Xen), or needs to be a global WB{NO,}INVD.  Partial WBINVDs
are of no value.

~Andrew


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU
  2025-05-06 11:16   ` Andrew Cooper
@ 2025-05-06 12:55     ` Roger Pau Monné
  2025-05-12 15:38       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-06 12:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Jan Beulich

On Tue, May 06, 2025 at 12:16:00PM +0100, Andrew Cooper wrote:
> On 06/05/2025 9:31 am, Roger Pau Monne wrote:
> > When a guest is allowed access to cache control operations such tracking
> > prevents having to issue a system-wide cache flush, and rather just flush
> > the pCPUs where the vCPU has been scheduled since the last flush.
> >
> > Note that domain-wide flushes accumulate the dirty caches from all the
> > vCPUs, but clearing the vCPU masks will require pausing all vCPUs, which
> > seems overkill.  Instead leave the vCPU dirty masks as-is, worse case it
> > will result in redundant flushes in further calls.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> I'm afraid this doesn't work.
> 
> Unlike TLBs, dirty cacheline can move sideways, e.g. by foreign or grant
> mapping, but also naturally because of how cache coherency works.

Does such sideway moving also imply that local WB{NO,}INVD on native
could be equally bogus?

According to the SDM, cache lines can indeed move between processor
caches, but the memory controller must always snoop such moves and
flush the data to memory:

"Here, the processor with the valid data may pass the data to the
other processors without actually writing it to system memory;
however, it is the responsibility of the memory controller to snoop
this operation and update memory."

So a cache line moving sideways will always be propagated to memory as
part of the move, and hence the data in the previous pCPU cache will
always hit memory.

grants/foreign maps are indeed complex, but sharing non-coherent
across domains seems like a recipe for disaster.  It's maybe mitigated
by doing host-wide cache flushes, but how does the mapping domain know
whether  source domain has possibly dirty caches that need flushing?
IMO it's the source that would need to first flush any cache contents,
and then share the memory.

FWIW, (and not saying this is correct), but KVM uses the same model of
tracking dirty caches, see wbinvd_dirty_mask field in struct
kvm_vcpu_arch.

> We need to use the guarantees given to us by the architecture to simply
> nop out cache flushes when safe to do so.

We already do this when possible AFAICT.

> Everything else is either a shootdown (clflush/opt/clwb, and doesn't
> even trap to Xen), or needs to be a global WB{NO,}INVD.  Partial WBINVDs
> are of no value.

What about on Intel when there's no capability to trap WBINVD?  Xen is
currently flushing the cache of the previous pCPU in case the vCPU has
moved around, see vmx_do_resume().

Thanks, Roger.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU
  2025-05-06 12:55     ` Roger Pau Monné
@ 2025-05-12 15:38       ` Jan Beulich
  2025-05-15 10:52         ` Roger Pau Monné
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 15:38 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper

On 06.05.2025 14:55, Roger Pau Monné wrote:
> On Tue, May 06, 2025 at 12:16:00PM +0100, Andrew Cooper wrote:
>> On 06/05/2025 9:31 am, Roger Pau Monne wrote:
>>> When a guest is allowed access to cache control operations such tracking
>>> prevents having to issue a system-wide cache flush, and rather just flush
>>> the pCPUs where the vCPU has been scheduled since the last flush.
>>>
>>> Note that domain-wide flushes accumulate the dirty caches from all the
>>> vCPUs, but clearing the vCPU masks will require pausing all vCPUs, which
>>> seems overkill.  Instead leave the vCPU dirty masks as-is, worse case it
>>> will result in redundant flushes in further calls.
>>>
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>
>> I'm afraid this doesn't work.
>>
>> Unlike TLBs, dirty cacheline can move sideways, e.g. by foreign or grant
>> mapping, but also naturally because of how cache coherency works.
> 
> Does such sideway moving also imply that local WB{NO,}INVD on native
> could be equally bogus?
> 
> According to the SDM, cache lines can indeed move between processor
> caches, but the memory controller must always snoop such moves and
> flush the data to memory:
> 
> "Here, the processor with the valid data may pass the data to the
> other processors without actually writing it to system memory;
> however, it is the responsibility of the memory controller to snoop
> this operation and update memory."
> 
> So a cache line moving sideways will always be propagated to memory as
> part of the move, and hence the data in the previous pCPU cache will
> always hit memory.

But that's only one of the two aspects of a flush. The other is to ensure
respective data isn't in any (covered) cache anymore. IOW dirty-ness (as
the title has it) isn't a criteria, unless of course you mean "dirty" in
a sense different from what it means in the cache coherency model.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU
  2025-05-12 15:38       ` Jan Beulich
@ 2025-05-15 10:52         ` Roger Pau Monné
  2025-05-16  7:25           ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Roger Pau Monné @ 2025-05-15 10:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper

On Mon, May 12, 2025 at 05:38:07PM +0200, Jan Beulich wrote:
> On 06.05.2025 14:55, Roger Pau Monné wrote:
> > On Tue, May 06, 2025 at 12:16:00PM +0100, Andrew Cooper wrote:
> >> On 06/05/2025 9:31 am, Roger Pau Monne wrote:
> >>> When a guest is allowed access to cache control operations such tracking
> >>> prevents having to issue a system-wide cache flush, and rather just flush
> >>> the pCPUs where the vCPU has been scheduled since the last flush.
> >>>
> >>> Note that domain-wide flushes accumulate the dirty caches from all the
> >>> vCPUs, but clearing the vCPU masks will require pausing all vCPUs, which
> >>> seems overkill.  Instead leave the vCPU dirty masks as-is, worse case it
> >>> will result in redundant flushes in further calls.
> >>>
> >>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >>
> >> I'm afraid this doesn't work.
> >>
> >> Unlike TLBs, dirty cacheline can move sideways, e.g. by foreign or grant
> >> mapping, but also naturally because of how cache coherency works.
> > 
> > Does such sideway moving also imply that local WB{NO,}INVD on native
> > could be equally bogus?
> > 
> > According to the SDM, cache lines can indeed move between processor
> > caches, but the memory controller must always snoop such moves and
> > flush the data to memory:
> > 
> > "Here, the processor with the valid data may pass the data to the
> > other processors without actually writing it to system memory;
> > however, it is the responsibility of the memory controller to snoop
> > this operation and update memory."
> > 
> > So a cache line moving sideways will always be propagated to memory as
> > part of the move, and hence the data in the previous pCPU cache will
> > always hit memory.
> 
> But that's only one of the two aspects of a flush. The other is to ensure
> respective data isn't in any (covered) cache anymore. IOW dirty-ness (as
> the title has it) isn't a criteria, unless of course you mean "dirty" in
> a sense different from what it means in the cache coherency model.

Given the direct map, and the fact that the CPU can speculatively load
entries in the cache, I'm not sure there's much Xen can effectively do
ATM to ensure a certain cache line it's not in any cache anymore.

It would possibly get better if/when the direct map is removed, but
even then Xen (or dom0) might map guest memory for emulation or IO
purposes.  Then Xen/dom0 would need to issue a wbinvd after unmapping
the memory, to ensure the cache is clean in case a vCPU from a domain
is scheduled there.

IMO being realistic it is very unlikely for Xen to be able to ensure
that after a guest wbinvd there are no guest owned cache lines in any
CPU cache, even if such wbinvd is propagated to all host CPUs.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU
  2025-05-15 10:52         ` Roger Pau Monné
@ 2025-05-16  7:25           ` Jan Beulich
  0 siblings, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2025-05-16  7:25 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper

On 15.05.2025 12:52, Roger Pau Monné wrote:
> On Mon, May 12, 2025 at 05:38:07PM +0200, Jan Beulich wrote:
>> On 06.05.2025 14:55, Roger Pau Monné wrote:
>>> On Tue, May 06, 2025 at 12:16:00PM +0100, Andrew Cooper wrote:
>>>> On 06/05/2025 9:31 am, Roger Pau Monne wrote:
>>>>> When a guest is allowed access to cache control operations such tracking
>>>>> prevents having to issue a system-wide cache flush, and rather just flush
>>>>> the pCPUs where the vCPU has been scheduled since the last flush.
>>>>>
>>>>> Note that domain-wide flushes accumulate the dirty caches from all the
>>>>> vCPUs, but clearing the vCPU masks will require pausing all vCPUs, which
>>>>> seems overkill.  Instead leave the vCPU dirty masks as-is, worse case it
>>>>> will result in redundant flushes in further calls.
>>>>>
>>>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>>>
>>>> I'm afraid this doesn't work.
>>>>
>>>> Unlike TLBs, dirty cacheline can move sideways, e.g. by foreign or grant
>>>> mapping, but also naturally because of how cache coherency works.
>>>
>>> Does such sideway moving also imply that local WB{NO,}INVD on native
>>> could be equally bogus?
>>>
>>> According to the SDM, cache lines can indeed move between processor
>>> caches, but the memory controller must always snoop such moves and
>>> flush the data to memory:
>>>
>>> "Here, the processor with the valid data may pass the data to the
>>> other processors without actually writing it to system memory;
>>> however, it is the responsibility of the memory controller to snoop
>>> this operation and update memory."
>>>
>>> So a cache line moving sideways will always be propagated to memory as
>>> part of the move, and hence the data in the previous pCPU cache will
>>> always hit memory.
>>
>> But that's only one of the two aspects of a flush. The other is to ensure
>> respective data isn't in any (covered) cache anymore. IOW dirty-ness (as
>> the title has it) isn't a criteria, unless of course you mean "dirty" in
>> a sense different from what it means in the cache coherency model.
> 
> Given the direct map, and the fact that the CPU can speculatively load
> entries in the cache, I'm not sure there's much Xen can effectively do
> ATM to ensure a certain cache line it's not in any cache anymore.
> 
> It would possibly get better if/when the direct map is removed, but
> even then Xen (or dom0) might map guest memory for emulation or IO
> purposes.  Then Xen/dom0 would need to issue a wbinvd after unmapping
> the memory, to ensure the cache is clean in case a vCPU from a domain
> is scheduled there.
> 
> IMO being realistic it is very unlikely for Xen to be able to ensure
> that after a guest wbinvd there are no guest owned cache lines in any
> CPU cache, even if such wbinvd is propagated to all host CPUs.

Well, see Andrew's reply on one of the two "restrict guest-induced WBINVD"
of mine. What you say is effectively supporting the statement I make in
the descriptions there ("We allow its use for writeback purposes only
anyway, ..."). Which Andrew says is wrong for at least future use cases,
possibly even today's. IOW I think we first need to find some common
grounds on the underlying goals / principles.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU
  2025-05-06  8:31 ` [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU Roger Pau Monne
  2025-05-06 11:16   ` Andrew Cooper
@ 2025-05-12 15:33   ` Jan Beulich
  1 sibling, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2025-05-12 15:33 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 06.05.2025 10:31, Roger Pau Monne wrote:
> @@ -2606,6 +2619,36 @@ unsigned int domain_max_paddr_bits(const struct domain *d)
>      return bits;
>  }
>  
> +void vcpu_flush_cache(struct vcpu *curr)
> +{
> +    ASSERT(curr == current);
> +    ASSERT(cache_flush_permitted(curr->domain));
> +
> +    flush_mask(curr->arch.dirty_cache, FLUSH_CACHE);
> +    cpumask_clear(curr->arch.dirty_cache);

While here re-ordering of the two operations would be merely for (kind of)
doc purposes, ...

> +    __cpumask_set_cpu(smp_processor_id(), curr->arch.dirty_cache);
> +}
> +
> +void domain_flush_cache(const struct domain *d)
> +{
> +    const struct vcpu *v;
> +    cpumask_t *mask = this_cpu(scratch_cpumask);
> +
> +    ASSERT(cache_flush_permitted(d));
> +
> +    cpumask_clear(mask);
> +    for_each_vcpu( d, v )
> +        cpumask_or(mask, mask, v->arch.dirty_cache);
> +
> +    flush_mask(mask, FLUSH_CACHE);
> +    /*
> +     * Clearing the mask of vCPUs in the domain would be racy unless all vCPUs
> +     * are paused, so just leave them as-is, at the cost of possibly doing
> +     * redundant flushes in later calls.  It's still better than doing a
> +     * host-wide cache flush.
> +     */

... wouldn't clearing before flushing avoid the raciness here?

> +}

Also imo both functions are a better fit in mm.c.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/8] xen: cache control improvements
  2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
                   ` (8 preceding siblings ...)
  2025-05-06  8:31 ` [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU Roger Pau Monne
@ 2025-05-06  9:32 ` Christian Lindig
  9 siblings, 0 replies; 50+ messages in thread
From: Christian Lindig @ 2025-05-06  9:32 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, Juergen Gross,
	David Scott, Bertrand Marquis, Volodymyr Babchuk, Edwin Torok

This looks fine from an OCaml perspective, which is my focus. I didn't review the logical aspect of this.

— Christian

Acked-by: Christian Lindig <christian.lindig@cloud.com>


> On 6 May 2025, at 09:31, Roger Pau Monne <roger.pau@citrix.com> wrote:
> 
> Hello,
> 
> Following series contain some fixes for cache control operations, the
> main focus is to reduce the load on big systems when cache control
> operations are executed.
> 
> Patches 1-4 are bugfixes, while patches starting from 5 are improvements
> to the current code.  Patch 9 is an optimization to avoid having to
> broadcast cache flushes on all pCPUs on x86.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (9):
>  x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches
>  x86/pv: fix emulation of wb{,no}invd to flush all pCPU caches
>  xen/gnttab: limit cache flush operation to guests allowed cache
>    control
>  x86/gnttab: do not implement GNTTABOP_cache_flush
>  x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr()
>  x86/p2m: limit cache flush in memory_type_changed()
>  xen/x86: rename cache_flush_permitted() to has_arch_io_resources()
>  xen: introduce flag when a domain requires cache control
>  xen/x86: track dirty pCPU caches for a given vCPU
> 
> docs/man/xl.cfg.5.pod.in            | 10 +++++++
> tools/include/libxl.h               |  7 +++++
> tools/libs/light/libxl_create.c     |  6 ++++
> tools/libs/light/libxl_types.idl    |  3 ++
> tools/ocaml/libs/xc/xenctrl.ml      |  1 +
> tools/ocaml/libs/xc/xenctrl.mli     |  1 +
> tools/xl/xl_parse.c                 |  2 ++
> xen/arch/arm/dom0less-build.c       | 12 ++++++--
> xen/arch/arm/domain.c               |  3 +-
> xen/arch/arm/domain_build.c         |  6 ++++
> xen/arch/x86/domain.c               | 43 +++++++++++++++++++++++++++++
> xen/arch/x86/hvm/hvm.c              |  2 +-
> xen/arch/x86/hvm/mtrr.c             | 29 ++++---------------
> xen/arch/x86/hvm/svm/svm.c          |  6 ++--
> xen/arch/x86/hvm/vmx/vmcs.c         |  3 +-
> xen/arch/x86/hvm/vmx/vmx.c          |  6 ++--
> xen/arch/x86/include/asm/domain.h   |  9 ++++++
> xen/arch/x86/include/asm/flushtlb.h | 15 ----------
> xen/arch/x86/include/asm/iocap.h    | 19 ++-----------
> xen/arch/x86/include/asm/p2m.h      |  2 +-
> xen/arch/x86/mm.c                   | 21 ++++----------
> xen/arch/x86/mm/p2m-ept.c           |  7 +----
> xen/arch/x86/mm/p2m-pod.c           |  4 +--
> xen/arch/x86/mm/p2m.c               | 13 ++++++++-
> xen/arch/x86/pv/emul-priv-op.c      | 19 ++++++-------
> xen/arch/x86/setup.c                |  7 +++++
> xen/common/domain.c                 |  3 +-
> xen/common/grant_table.c            | 11 ++++++++
> xen/common/memory.c                 |  2 +-
> xen/include/asm-generic/iocap.h     |  2 +-
> xen/include/public/domctl.h         |  4 ++-
> xen/include/xen/iocap.h             | 25 ++---------------
> xen/include/xen/sched.h             |  6 ++++
> 33 files changed, 181 insertions(+), 128 deletions(-)
> 
> -- 
> 2.48.1
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2025-05-16  8:56 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-06  8:31 [PATCH 0/8] xen: cache control improvements Roger Pau Monne
2025-05-06  8:31 ` [PATCH 1/9] x86/pv: fix MMUEXT_FLUSH_CACHE to flush all pCPU caches Roger Pau Monne
2025-05-12 14:09   ` Jan Beulich
2025-05-12 14:27     ` Roger Pau Monné
2025-05-12 14:11   ` Jan Beulich
2025-05-12 14:24     ` Roger Pau Monné
2025-05-06  8:31 ` [PATCH 2/9] x86/pv: fix emulation of wb{,no}invd " Roger Pau Monne
2025-05-12 14:20   ` Jan Beulich
2025-05-12 14:41     ` Roger Pau Monné
2025-05-06  8:31 ` [PATCH 3/9] xen/gnttab: limit cache flush operation to guests allowed cache control Roger Pau Monne
2025-05-06 10:15   ` Julien Grall
2025-05-06 10:40     ` Roger Pau Monné
2025-05-12 14:23       ` Jan Beulich
2025-05-06  8:31 ` [PATCH 4/9] x86/gnttab: do not implement GNTTABOP_cache_flush Roger Pau Monne
2025-05-12 14:27   ` Jan Beulich
2025-05-06  8:31 ` [PATCH 5/9] x86/mtrr: use memory_type_changed() in hvm_set_mem_pinned_cacheattr() Roger Pau Monne
2025-05-12 15:04   ` Jan Beulich
2025-05-15 10:22     ` Roger Pau Monné
2025-05-16  7:00       ` Jan Beulich
2025-05-16  7:57         ` Roger Pau Monné
2025-05-16  6:58   ` Jan Beulich
2025-05-16  7:48     ` Roger Pau Monné
2025-05-16  8:02       ` Jan Beulich
2025-05-16  8:45         ` Roger Pau Monné
2025-05-06  8:31 ` [PATCH 6/9] x86/p2m: limit cache flush in memory_type_changed() Roger Pau Monne
2025-05-12 15:10   ` Jan Beulich
2025-05-06  8:31 ` [PATCH 7/9] xen/x86: rename cache_flush_permitted() to has_arch_io_resources() Roger Pau Monne
2025-05-12 15:16   ` Jan Beulich
2025-05-15 10:28     ` Roger Pau Monné
2025-05-16  7:07       ` Jan Beulich
2025-05-16  8:02         ` Roger Pau Monné
2025-05-16  8:08           ` Jan Beulich
2025-05-16  8:27             ` Roger Pau Monné
2025-05-16  8:36               ` Jan Beulich
2025-05-16  8:55                 ` Roger Pau Monné
2025-05-06  8:31 ` [PATCH 8/9] xen: introduce flag when a domain requires cache control Roger Pau Monne
2025-05-06 10:20   ` Julien Grall
2025-05-06 10:43     ` Roger Pau Monné
2025-05-12 15:24   ` Jan Beulich
2025-05-15 10:44     ` Roger Pau Monné
2025-05-16  7:16       ` Jan Beulich
2025-05-16  8:05         ` Roger Pau Monné
2025-05-06  8:31 ` [PATCH 9/9] xen/x86: track dirty pCPU caches for a given vCPU Roger Pau Monne
2025-05-06 11:16   ` Andrew Cooper
2025-05-06 12:55     ` Roger Pau Monné
2025-05-12 15:38       ` Jan Beulich
2025-05-15 10:52         ` Roger Pau Monné
2025-05-16  7:25           ` Jan Beulich
2025-05-12 15:33   ` Jan Beulich
2025-05-06  9:32 ` [PATCH 0/8] xen: cache control improvements Christian Lindig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.