Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Dave Hansen @ 2024-06-04 15:47 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov
  Cc: adrian.hunter, ardb, ashish.kalra, bhe, dave.hansen,
	elena.reshetova, haiyangz, hpa, jun.nakajima, kai.huang, kexec,
	kys, linux-acpi, linux-coco, linux-hyperv, linux-kernel, ltao,
	mingo, peterz, rafael, rick.p.edgecombe,
	sathyanarayanan.kuppuswamy, seanjc, tglx, thomas.lendacky, x86
In-Reply-To: <noym2bqgxqcyhhdzoax7gvdfzhh7rtw7cv236fhzpqh3wqf76e@2jj733skv7y4>

On 6/4/24 08:32, Kirill A. Shutemov wrote:
> What about the comment below?
> 
> 			/*
> 			 * One possible reason for the failure is if kexec raced
> 			 * with memory conversion. In this case shared bit in
> 			 * page table got set (or not cleared) during
> 			 * shared<->private conversion, but the page is actually
> 			 * private. So this failure is not going to affect the
> 			 * kexec'ed kernel.
> 			 *
> 			 * The only thing one can do at this point on failure
> 			 * at this point is panic. In absence of better options,
> 			 * it is reasonable to proceed, hoping the failure is a
> 			 * benign shared bit mismatch due to the race.
> 			 *
> 			 * Also, even if the failure is real and the page cannot
> 			 * be touched as private, the kdump kernel will boot
> 			 * fine as it uses pre-reserved memory. What happens
> 			 * next depends on what the dumping process does and
> 			 * there's a reasonable chance to produce useful dump
> 			 * on crash.
> 			 *
> 			 * Regardless, the print leaves a trace in the log to
> 			 * give a clue for debug.
> 			 */

It's rambling too much for my taste.

Let's boil this down to what matters:

 1. Failures to change encryption status here can lead a future kernel
    to touch shared memory with a private mapping
 2. That causes an immediate unrecoverable guest shutdown (right?)
 3. kdump kernels should not be affected since they have their own
    memory ranges and its encryption status is not being tweawked here
 4. The pr_err() may help make some sense out of #2 when it happens

I'm not sure the reason behind the failed conversion is important here.

I wouldn't mention panic().

We don't need to opine about what the next kernel might or might not do.

^ permalink raw reply

* Re: [PATCHv11 09/19] x86/tdx: Account shared memory
From: Dave Hansen @ 2024-06-04 16:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Ard Biesheuvel, Baoquan He, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
	linux-coco, linux-kernel, Tao Liu
In-Reply-To: <20240528095522.509667-10-kirill.shutemov@linux.intel.com>

On 5/28/24 02:55, Kirill A. Shutemov wrote:
> Keep track of the number of shared pages. This will allow for 
> cross-checking against the shared information in the direct mapping
> and reporting if the shared bit is lost.

It's probably also worth mentioning that conversions are slow and
relatively rare and even though a global atomic isn't really scalable,
it also isn't worth doing anything fancier.
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 26fa47db5782..979891e97d83 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -38,6 +38,8 @@
>  
>  #define TDREPORT_SUBTYPE_0	0
>  
> +static atomic_long_t nr_shared;

Doesn't this technically need to be:

	static atomic_long_t nr_shared = ATOMIC_LONG_INIT(0);

?  I thought we had some architectures where the 0 logical value wasn't
actually all 0's.

^ permalink raw reply

* Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Kirill A. Shutemov @ 2024-06-04 16:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, adrian.hunter, ardb, ashish.kalra, bhe,
	dave.hansen, elena.reshetova, haiyangz, hpa, jun.nakajima,
	kai.huang, kexec, kys, linux-acpi, linux-coco, linux-hyperv,
	linux-kernel, ltao, mingo, peterz, rafael, rick.p.edgecombe,
	sathyanarayanan.kuppuswamy, seanjc, tglx, thomas.lendacky, x86
In-Reply-To: <78d33a31-0ef2-417b-a240-b2880b64518e@intel.com>

On Tue, Jun 04, 2024 at 08:47:22AM -0700, Dave Hansen wrote:
> On 6/4/24 08:32, Kirill A. Shutemov wrote:
> > What about the comment below?
> > 
> > 			/*
> > 			 * One possible reason for the failure is if kexec raced
> > 			 * with memory conversion. In this case shared bit in
> > 			 * page table got set (or not cleared) during
> > 			 * shared<->private conversion, but the page is actually
> > 			 * private. So this failure is not going to affect the
> > 			 * kexec'ed kernel.
> > 			 *
> > 			 * The only thing one can do at this point on failure
> > 			 * at this point is panic. In absence of better options,
> > 			 * it is reasonable to proceed, hoping the failure is a
> > 			 * benign shared bit mismatch due to the race.
> > 			 *
> > 			 * Also, even if the failure is real and the page cannot
> > 			 * be touched as private, the kdump kernel will boot
> > 			 * fine as it uses pre-reserved memory. What happens
> > 			 * next depends on what the dumping process does and
> > 			 * there's a reasonable chance to produce useful dump
> > 			 * on crash.
> > 			 *
> > 			 * Regardless, the print leaves a trace in the log to
> > 			 * give a clue for debug.
> > 			 */
> 
> It's rambling too much for my taste.
> 
> Let's boil this down to what matters:
> 
>  1. Failures to change encryption status here can lead a future kernel
>     to touch shared memory with a private mapping
>  2. That causes an immediate unrecoverable guest shutdown (right?)

Right.

>  3. kdump kernels should not be affected since they have their own
>     memory ranges and its encryption status is not being tweawked here
>  4. The pr_err() may help make some sense out of #2 when it happens
> 
> I'm not sure the reason behind the failed conversion is important here.

The important part is that failure can be benign. It explains "can" in #1.
But okay.

> I wouldn't mention panic().
> 
> We don't need to opine about what the next kernel might or might not do.

Is this any better?

			/*
			 * If tdx_enc_status_changed() fails, it leaves memory
			 * in an unknown state. If the memory remains shared,
			 * it can result in an unrecoverable guest shutdown on
			 * the first accessed through a private mapping.
			 *
			 * The kdump kernel boot is not impacted as it uses
			 * a pre-reserved memory range that is always private.
			 * However, gathering crash information could lead to
			 * a crash if it accesses unconverted memory through
			 * a private mapping.
			 *
			 * pr_err() may assist in understanding such crashes.
			 */
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCHv11 10/19] x86/mm: Add callbacks to prepare encrypted memory for kexec
From: Dave Hansen @ 2024-06-04 16:16 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Ard Biesheuvel, Baoquan He, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
	linux-coco, linux-kernel, Nikolay Borisov, Tao Liu
In-Reply-To: <20240528095522.509667-11-kirill.shutemov@linux.intel.com>

On 5/28/24 02:55, Kirill A. Shutemov wrote:
> +	x86_platform.guest.enc_kexec_begin(true);
> +	x86_platform.guest.enc_kexec_finish();

I really despise the random, unlabeled true/false/0/1 arguments to
functions like this.

I'll bring it up in the non-noop patch though.

^ permalink raw reply

* Re: [PATCHv11 09/19] x86/tdx: Account shared memory
From: Kirill A. Shutemov @ 2024-06-04 16:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Ard Biesheuvel, Baoquan He, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
	linux-coco, linux-kernel, Tao Liu
In-Reply-To: <6fd2c15b-5bd9-4e7e-8da0-4ca2c89b38f9@intel.com>

On Tue, Jun 04, 2024 at 09:08:25AM -0700, Dave Hansen wrote:
> On 5/28/24 02:55, Kirill A. Shutemov wrote:
> > Keep track of the number of shared pages. This will allow for 
> > cross-checking against the shared information in the direct mapping
> > and reporting if the shared bit is lost.
> 
> It's probably also worth mentioning that conversions are slow and
> relatively rare and even though a global atomic isn't really scalable,
> it also isn't worth doing anything fancier.

Okay, will do.

> > diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> > index 26fa47db5782..979891e97d83 100644
> > --- a/arch/x86/coco/tdx/tdx.c
> > +++ b/arch/x86/coco/tdx/tdx.c
> > @@ -38,6 +38,8 @@
> >  
> >  #define TDREPORT_SUBTYPE_0	0
> >  
> > +static atomic_long_t nr_shared;
> 
> Doesn't this technically need to be:
> 
> 	static atomic_long_t nr_shared = ATOMIC_LONG_INIT(0);
> 
> ?  I thought we had some architectures where the 0 logical value wasn't
> actually all 0's.

Hm. I am not aware of such requirement. I see plenty uninitilized
atomic_long_t in generic code. For instance, invalid_kread_bytes.

And I doubt TDX will ever be built for non-x86 :P

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Dave Hansen @ 2024-06-04 16:27 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Ard Biesheuvel, Baoquan He, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
	linux-coco, linux-kernel, Tao Liu
In-Reply-To: <20240528095522.509667-12-kirill.shutemov@linux.intel.com>

On 5/28/24 02:55, Kirill A. Shutemov wrote:
> +/* Stop new private<->shared conversions */
> +static void tdx_kexec_begin(bool crash)
> +{
> +	/*
> +	 * Crash kernel reaches here with interrupts disabled: can't wait for
> +	 * conversions to finish.
> +	 *
> +	 * If race happened, just report and proceed.
> +	 */
> +	if (!set_memory_enc_stop_conversion(!crash))
> +		pr_warn("Failed to stop shared<->private conversions\n");
> +}

I don't like having to pass 'crash' in here.

If interrupts are the problem we have ways of testing for those directly.

If it's being in an oops that's a problem, we have 'oops_in_progress'
for that.

In other words, I'd much rather this function (or better yet
set_memory_enc_stop_conversion() itself) use some existing API to change
its behavior in a crash rather than have the context be passed down and
twiddled through several levels of function calls.

There are a ton of these in the console code:

	if (oops_in_progress)
		foo_trylock();
	else
		foo_lock();

To me, that's a billion times more clear than a 'wait' argument that
gets derives from who-knows-what that I have to trace through ten levels
of function calls.

	

^ permalink raw reply

* Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion
From: Borislav Petkov @ 2024-06-04 17:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: H. Peter Anvin, Nikolay Borisov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, Rafael J. Wysocki, Peter Zijlstra,
	Adrian Hunter, Kuppuswamy Sathyanarayanan, Elena Reshetova,
	Jun Nakajima, Rick Edgecombe, Tom Lendacky, Kalra, Ashish,
	Sean Christopherson, Huang, Kai, Ard Biesheuvel, Baoquan He,
	K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
	linux-coco, linux-kernel
In-Reply-To: <ehttxqgg7zhbgty5m5uxkduj3xf7soonrzfu4rfw7hccqgdydl@afki66pnree5>

On Tue, Jun 04, 2024 at 06:21:27PM +0300, Kirill A. Shutemov wrote:
> What about this?

Yeah, LGTM.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Borislav Petkov @ 2024-06-04 18:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, adrian.hunter, ardb, ashish.kalra, bhe, dave.hansen,
	elena.reshetova, haiyangz, hpa, jun.nakajima, kai.huang, kexec,
	kys, linux-acpi, linux-coco, linux-hyperv, linux-kernel, ltao,
	mingo, peterz, rafael, rick.p.edgecombe,
	sathyanarayanan.kuppuswamy, seanjc, tglx, thomas.lendacky, x86
In-Reply-To: <u3hg3fqc2nxsjtfugjmmzlahwriyqlebnkxrbzgrxlkj6l3k36@yd3yudglgevi>

On Tue, Jun 04, 2024 at 07:14:00PM +0300, Kirill A. Shutemov wrote:
> 			/*
> 			 * If tdx_enc_status_changed() fails, it leaves memory
> 			 * in an unknown state. If the memory remains shared,
> 			 * it can result in an unrecoverable guest shutdown on
> 			 * the first accessed through a private mapping.

"access"

So this sentence above can go too, right?

Because that comment is in tdx_kexec_finish() and we're basically going
off to kexec. So can a guest even access it through a private mapping?
We're shutting down so nothing is running anymore...

> 			 * The kdump kernel boot is not impacted as it uses
> 			 * a pre-reserved memory range that is always private.
> 			 * However, gathering crash information could lead to
> 			 * a crash if it accesses unconverted memory through
> 			 * a private mapping.

When does the kexec kernel even get such a private mapping? It is not
even up yet...

> 			 * pr_err() may assist in understanding such crashes.

"Print error info in order to leave bread crumbs for debugging." is what
I'd say.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: Thomas Gleixner @ 2024-06-04 18:13 UTC (permalink / raw)
  To: mhklinux, kys, haiyangz, wei.liu, decui, mingo, bp, dave.hansen,
	x86, hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
	martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
	linux-scsi, linux-arch
  Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-7-mhklinux@outlook.com>

Michael!

On Mon, Jun 03 2024 at 22:09, mhkelley58@gmail.com wrote:
> Hyper-V VMBus devices generate interrupts that are multiplexed
> onto a single per-CPU architectural interrupt. The top-level VMBus
> driver ISR demultiplexes these interrupts and invokes per-device
> handlers. Currently, these per-device handlers are not modeled as
> Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted
> to the top level architectural interrupt. Visibility into per-device
> interrupt stats requires accessing VMBus-specific entries in sysfs.
> The top-level VMBus driver ISR also handles management-related
> interrupts that are not attributable to a particular VMBus device.
>
> As part of changing VMBus to model VMBus per-device handlers as
> normal Linux IRQs, the top-level VMBus driver needs to conditionally
> account for interrupts. If it passes the interrupt off to a
> device-specific IRQ, the interrupt stats are done by that IRQ
> handler, and accounting for the interrupt at the top level
> is duplicative. But if it handles a management-related interrupt
> itself, then it should account for the interrupt itself.
>
> Introduce a new flow handler that provides this functionality.
> The new handler parallels handle_percpu_irq(), but does stats
> only if the ISR returns other than IRQ_NONE. The existing
> handle_untracked_irq() can't be used because it doesn't work for
> per-cpu IRQs, and it doesn't provide conditional stats.

There is a two other options to solve this:

   1) Move the inner workings of handle_percpu_irq() out into
      a static function which returns the 'handled' value and
      share it between the two handler functions.

   2) Allocate a proper interrupt for the management mode and invoke it
      via generic_handle_irq() just as any other demultiplex interrupt.
      That spares all the special casing in the core code and just
      works.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH] x86/tdx: Enhance code generation for TDCALL and SEAMCALL wrappers
From: Sean Christopherson @ 2024-06-04 19:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Paolo Bonzini, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Josh Poimboeuf, Peter Zijlstra, linux-coco, linux-kernel,
	linux-hyperv
In-Reply-To: <crv2ak76xudopm7xnbg6zp2ntw4t2gni3vlowm7otl7xyu72oz@rt2ncbmodsth>

On Tue, Jun 04, 2024, Kirill A. Shutemov wrote:
> On Mon, Jun 03, 2024 at 06:37:45AM -0700, Dave Hansen wrote:
> > On 6/2/24 04:54, Kirill A. Shutemov wrote:
> > > Sean observed that the compiler is generating inefficient code to clear
> > > the tdx_module_args struct for TDCALL and SEAMCALL wrappers. The
> > > compiler is generating numerous instructions at each call site to clear
> > > the unused fields of the structure.
> > > 
> > > To address this issue, avoid using C99-initializer and instead
> > > explicitly use string instructions to clear the struct.
> > > 
> > > With Clang, this change results in a savings of approximately 3K with my
> > > configuration:
> > > 
> > > add/remove: 0/0 grow/shrink: 0/21 up/down: 0/-3187 (-3187)
> > > 
> > > With GCC, the savings are less significant at around 300 bytes:
> > > 
> > > add/remove: 0/0 grow/shrink: 3/22 up/down: 17/-313 (-296)
> > > 
> > > GCC tends to generate string instructions more frequently to clear the
> > > struct.
> > 
> > <shrug>
> > 
> > I don't think moving away from perfectly normal C struct initialization
> > is worth it for 300 bytes of text in couple of slow paths.
> > 
> > If someone out there is using clang, is confident that it is doing the
> > right thing and not just being silly, _and_ is horribly bothered by its
> > code generation, then please speak up.
> 
> Conceptually, I like my previous attempt more. But it is much more
> intrusive and I am not sure it is worth the risk.
> 
> This patch feels like hack around compiler.
> 
> Sean, do you have any comments?

Yes :-)

1. Y'all *really* need to actually look at the generated code, because this is
   amusingly broken.

diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 0519dd7cbb92..575cc54670ef 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -91,7 +91,7 @@ struct tdx_module_args {
 
 static __always_inline void tdx_arg_init(struct tdx_module_args *args)
 {
-       asm ("rep stosb"
+       asm volatile ("rep stosb"
             : "+D" (args)
             : "c" (sizeof(*args)), "a" (0)
             : "memory");

2. Look at *all* the code generation, because the code generation for TDX-as-a-guest
   as a whole is bad/confusing.  E.g. switching on ve->exit_reason in ve_instr_len()
   after doing the same in virt_exception_kernel() _could_ be optimized away, but
   in part because tdx_module_args generates so much code, gcc-13 at least tends
   to not inline the individual handlers, which in turn means ve_instr_len() doesn't
   get inlined because the compiler doesn't know that e.g. handle_cpuid() will
   only ever call ve_instr_len() with ve->exit_reason == EXIT_REASON_CPUID.

3. As Paolo pointed out[*], passing arguments vs. explicitly filling operands
   have different tradeoffs.  Explicitly filling operands is *visually* pleasing,
   but it's also prone to typos, and as evidenced by a data leak in mmio_read(),
   being able to easily audit the code doesn't mean squat unless someone actually
   does the audit (usually in response to a bug).

   static bool mmio_read(int size, unsigned long addr, unsigned long *val)
   {
	struct tdx_module_args args = {
		.r10 = TDX_HYPERCALL_STANDARD,
		.r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
		.r12 = size,
		.r13 = EPT_READ,
		.r14 = addr,
		.r15 = *val,    <==== data leak
	};

	if (__tdx_hypercall(&args))
		return false;

	*val = args.r11;
	return true;
   }

   [*] https://lore.kernel.org/all/611a387b-ba7e-46d7-b6bf-84dc6c037d33@redhat.com

3. Pick _one_ approach for the majority of TDVMCALLs.  The existing code is a mix
   of passing arguments (e.g. mmio_write()) and explicit operands (e.g. mmio_read()).
   There will inevitably be special snowflakes, e.g. for some asinine reason, CPUID
   skips r11 as an output.  But AFAICT, most TDVMCALLs conform to a standard
   pattern.

4. Using a trampoline probably isn't worth the marginal reduction in *written*
   code.  The generated code is almost as weird as the tdx_module_args code.
   E.g. each callsite generates a pile of MOV instructions to registers that
   *don't* match the GHCI, and so I doubt the end result would be any easier to
   debug for unsuspecting users.

If we're willing to suffer a few gnarly macros, I think we get a satisfactory mix
of standardized arguments and explicit operands, and generate vastly better code.

The macros are beyond ugly and are also error prone to some extent, but having to
add new macros should be quite rare, and much of the boilerplate could be stripped
away with even more macros.

And while the macros are ugly, the advantage of having to specify the number of
input and output operands reduces the probability of a data leak, e.g. the mmio_read()
bug wouldn't escape compilation because TDVMCALL_4_1() would be unhappy, and
TDVMCALL_5_1() doesn't need to exist.

Dump of assembler code for function tdx_handle_virt_exception:
   0xffffffff81003220 <+0>:	call   0xffffffff810577d0 <__fentry__>
   0xffffffff81003225 <+5>:	push   %r13
   0xffffffff81003227 <+7>:	push   %r12
   0xffffffff81003229 <+9>:	push   %rbp
   0xffffffff8100322a <+10>:	push   %rbx
   0xffffffff8100322b <+11>:	mov    %rdi,%rbx
   0xffffffff8100322e <+14>:	sub    $0x8,%rsp
   0xffffffff81003232 <+18>:	testb  $0x3,0x88(%rdi)
   0xffffffff81003239 <+25>:	mov    (%rsi),%rax
   0xffffffff8100323c <+28>:	je     0xffffffff81003269 <tdx_handle_virt_exception+73>
   0xffffffff8100323e <+30>:	cmp    $0xa,%rax
   0xffffffff81003242 <+34>:	jne    0xffffffff8100327a <tdx_handle_virt_exception+90>
   0xffffffff81003244 <+36>:	mov    %rbx,%rdi
   0xffffffff81003247 <+39>:	call   0xffffffff81002820 <handle_cpuid>
   0xffffffff8100324c <+44>:	test   %eax,%eax
   0xffffffff8100324e <+46>:	js     0xffffffff81003289 <tdx_handle_virt_exception+105>
   0xffffffff81003250 <+48>:	cltq
   0xffffffff81003252 <+50>:	add    %rax,0x80(%rbx)
   0xffffffff81003259 <+57>:	mov    $0x1,%eax
   0xffffffff8100325e <+62>:	add    $0x8,%rsp
   0xffffffff81003262 <+66>:	pop    %rbx
   0xffffffff81003263 <+67>:	pop    %rbp
   0xffffffff81003264 <+68>:	pop    %r12
   0xffffffff81003266 <+70>:	pop    %r13
   0xffffffff81003268 <+72>:	ret
   0xffffffff81003269 <+73>:	lea    -0xa(%rax),%rdx
   0xffffffff8100326d <+77>:	cmp    $0x26,%rdx
   0xffffffff81003271 <+81>:	ja     0xffffffff8100327a <tdx_handle_virt_exception+90>
   0xffffffff81003273 <+83>:	jmp    *-0x7e3ffd88(,%rdx,8)
   0xffffffff8100327a <+90>:	mov    %rax,%rsi
   0xffffffff8100327d <+93>:	mov    $0xffffffff81e6b0b6,%rdi
   0xffffffff81003284 <+100>:	call   0xffffffff810eb170 <_printk>
   0xffffffff81003289 <+105>:	xor    %eax,%eax
   0xffffffff8100328b <+107>:	jmp    0xffffffff8100325e <tdx_handle_virt_exception+62>
   0xffffffff8100328d <+109>:	mov    0x18(%rsi),%rbp
   0xffffffff81003291 <+113>:	mov    %rsi,(%rsp)
   0xffffffff81003295 <+117>:	mov    %rbp,%rdi
   0xffffffff81003298 <+120>:	call   0xffffffff81002700 <cc_mkenc>
   0xffffffff8100329d <+125>:	mov    (%rsp),%rsi
   0xffffffff810032a1 <+129>:	cmp    %rax,%rbp
   0xffffffff810032a4 <+132>:	je     0xffffffff81003376 <tdx_handle_virt_exception+342>
   0xffffffff810032aa <+138>:	mov    %rbx,%rdi
   0xffffffff810032ad <+141>:	call   0xffffffff81002e20 <handle_mmio>
   0xffffffff810032b2 <+146>:	jmp    0xffffffff8100324c <tdx_handle_virt_exception+44>
   0xffffffff810032b4 <+148>:	mov    0x60(%rdi),%rdx
   0xffffffff810032b8 <+152>:	xor    %eax,%eax
   0xffffffff810032ba <+154>:	mov    $0x3c00,%ecx
   0xffffffff810032bf <+159>:	shl    $0x20,%rdx
   0xffffffff810032c3 <+163>:	or     0x50(%rdi),%rdx
   0xffffffff810032c7 <+167>:	mov    $0x20,%r11
   0xffffffff810032ce <+174>:	mov    0x58(%rdi),%r12
   0xffffffff810032d2 <+178>:	mov    %rdx,%r13
   0xffffffff810032d5 <+181>:	xor    %r10d,%r10d
   0xffffffff810032d8 <+184>:	tdcall
   0xffffffff810032dc <+188>:	mov    %r10,%rcx
   0xffffffff810032df <+191>:	test   %rax,%rax
   0xffffffff810032e2 <+194>:	jne    0xffffffff81003371 <tdx_handle_virt_exception+337>
   0xffffffff810032e8 <+200>:	test   %rcx,%rcx
   0xffffffff810032eb <+203>:	jne    0xffffffff81003289 <tdx_handle_virt_exception+105>
   0xffffffff810032ed <+205>:	mov    0x20(%rsi),%eax
   0xffffffff810032f0 <+208>:	jmp    0xffffffff8100324c <tdx_handle_virt_exception+44>
   0xffffffff810032f5 <+213>:	xor    %eax,%eax
   0xffffffff810032f7 <+215>:	mov    $0x1c00,%ecx
   0xffffffff810032fc <+220>:	mov    $0x1f,%r11
   0xffffffff81003303 <+227>:	mov    0x58(%rdi),%r12
   0xffffffff81003307 <+231>:	xor    %r10d,%r10d
   0xffffffff8100330a <+234>:	tdcall
   0xffffffff8100330e <+238>:	mov    %r10,%rcx
   0xffffffff81003311 <+241>:	mov    %r11,%rdx
   0xffffffff81003314 <+244>:	test   %rax,%rax
   0xffffffff81003317 <+247>:	jne    0xffffffff81003371 <tdx_handle_virt_exception+337>
   0xffffffff81003319 <+249>:	test   %rcx,%rcx
   0xffffffff8100331c <+252>:	jne    0xffffffff81003289 <tdx_handle_virt_exception+105>
   0xffffffff81003322 <+258>:	movq   $0x0,0x50(%rdi)
   0xffffffff8100332a <+266>:	movq   $0x0,0x60(%rdi)
   0xffffffff81003332 <+274>:	cmpq   $0x1f,(%rsi)
   0xffffffff81003336 <+278>:	je     0xffffffff810032ed <tdx_handle_virt_exception+205>
   0xffffffff81003338 <+280>:	ud2
   0xffffffff8100333a <+282>:	jmp    0xffffffff810032ed <tdx_handle_virt_exception+205>
   0xffffffff8100333c <+284>:	mov    %rsi,(%rsp)
   0xffffffff81003340 <+288>:	pushf
   0xffffffff81003341 <+289>:	pop    %rdi
   0xffffffff81003342 <+290>:	shr    $0x9,%rdi
   0xffffffff81003346 <+294>:	xor    $0x1,%rdi
   0xffffffff8100334a <+298>:	and    $0x1,%edi
   0xffffffff8100334d <+301>:	call   0xffffffff81949c00 <__halt>
   0xffffffff81003352 <+306>:	test   %rax,%rax
   0xffffffff81003355 <+309>:	jne    0xffffffff81003289 <tdx_handle_virt_exception+105>
   0xffffffff8100335b <+315>:	mov    (%rsp),%rsi
   0xffffffff8100335f <+319>:	cmpq   $0xc,(%rsi)
   0xffffffff81003363 <+323>:	je     0xffffffff810032ed <tdx_handle_virt_exception+205>
   0xffffffff81003365 <+325>:	jmp    0xffffffff81003338 <tdx_handle_virt_exception+280>
   0xffffffff81003367 <+327>:	call   0xffffffff81002d20 <handle_io>
   0xffffffff8100336c <+332>:	jmp    0xffffffff8100324c <tdx_handle_virt_exception+44>
   0xffffffff81003371 <+337>:	call   0xffffffff81944440 <__tdx_hypercall_failed>
   0xffffffff81003376 <+342>:	mov    $0xffffffff81eab098,%rdi
   0xffffffff8100337d <+349>:	call   0xffffffff81080cf0 <panic>
End of assembler dump.

---
 arch/x86/boot/compressed/tdx.c    |  31 ++---
 arch/x86/coco/tdx/tdx.c           | 136 +++++++++-----------
 arch/x86/hyperv/ivm.c             |  26 +---
 arch/x86/include/asm/shared/tdx.h | 204 +++++++++++++++++++++++++++---
 4 files changed, 262 insertions(+), 135 deletions(-)

diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 8451d6a1030c..5a94cab412ed 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -18,32 +18,25 @@ void __tdx_hypercall_failed(void)
 
 static inline unsigned int tdx_io_in(int size, u16 port)
 {
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
-		.r12 = size,
-		.r13 = 0,
-		.r14 = port,
-	};
+	unsigned int val;
 
-	if (__tdx_hypercall(&args))
+	if (TDVMCALL_4_1(r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
+			     r12 = size,
+			     r13 = 0,
+			     r14 = port,
+			     TDX_ON_SUCCESS(val = out_r11)))
 		return UINT_MAX;
 
-	return args.r11;
+	return val;
 }
 
 static inline void tdx_io_out(int size, u16 port, u32 value)
 {
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
-		.r12 = size,
-		.r13 = 1,
-		.r14 = port,
-		.r15 = value,
-	};
-
-	__tdx_hypercall(&args);
+	TDVMCALL_5_0(r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
+		     r12 = size,
+		     r13 = 1,
+		     r14 = port,
+		     r15 = value);
 }
 
 static inline u8 tdx_inb(u16 port)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index c1cb90369915..b9f76445419c 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -124,7 +124,10 @@ EXPORT_SYMBOL_GPL(tdx_mcall_get_report0);
 u64 tdx_hcall_get_quote(u8 *buf, size_t size)
 {
 	/* Since buf is a shared memory, set the shared (decrypted) bits */
-	return _tdx_hypercall(TDVMCALL_GET_QUOTE, cc_mkdec(virt_to_phys(buf)), size, 0, 0);
+	return TDVMCALL_3_0(r11 = TDVMCALL_GET_QUOTE,
+			    r12 = cc_mkdec(virt_to_phys(buf)),
+			    r13 = size);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(tdx_hcall_get_quote);
 
@@ -226,9 +229,11 @@ static void tdx_parse_tdinfo(u64 *cc_mask)
  * information if #VE occurred due to instruction execution, but not for EPT
  * violations.
  */
-static int ve_instr_len(struct ve_info *ve)
+static int ve_instr_len(u32 exit_reason, struct ve_info *ve)
 {
-	switch (ve->exit_reason) {
+	WARN_ON_ONCE(ve->exit_reason != exit_reason);
+
+	switch (exit_reason) {
 	case EXIT_REASON_HLT:
 	case EXIT_REASON_MSR_READ:
 	case EXIT_REASON_MSR_WRITE:
@@ -252,12 +257,6 @@ static int ve_instr_len(struct ve_info *ve)
 
 static u64 __cpuidle __halt(const bool irq_disabled)
 {
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = hcall_func(EXIT_REASON_HLT),
-		.r12 = irq_disabled,
-	};
-
 	/*
 	 * Emulate HLT operation via hypercall. More info about ABI
 	 * can be found in TDX Guest-Host-Communication Interface
@@ -270,7 +269,8 @@ static u64 __cpuidle __halt(const bool irq_disabled)
 	 * can keep the vCPU in virtual HLT, even if an IRQ is
 	 * pending, without hanging/breaking the guest.
 	 */
-	return __tdx_hypercall(&args);
+	return TDVMCALL_2_0(r11 = hcall_func(EXIT_REASON_HLT),
+			    r12 = irq_disabled);
 }
 
 static int handle_halt(struct ve_info *ve)
@@ -280,7 +280,7 @@ static int handle_halt(struct ve_info *ve)
 	if (__halt(irq_disabled))
 		return -EIO;
 
-	return ve_instr_len(ve);
+	return ve_instr_len(EXIT_REASON_HLT, ve);
 }
 
 void __cpuidle tdx_safe_halt(void)
@@ -296,43 +296,37 @@ void __cpuidle tdx_safe_halt(void)
 
 static int read_msr(struct pt_regs *regs, struct ve_info *ve)
 {
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = hcall_func(EXIT_REASON_MSR_READ),
-		.r12 = regs->cx,
-	};
+	u64 val = 0;
 
 	/*
 	 * Emulate the MSR read via hypercall. More info about ABI
 	 * can be found in TDX Guest-Host-Communication Interface
 	 * (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
 	 */
-	if (__tdx_hypercall(&args))
+	if (TDVMCALL_2_1(r11 = hcall_func(EXIT_REASON_MSR_READ),
+			 r12 = regs->cx,
+			 TDX_ON_SUCCESS(val = out_r11)))
 		return -EIO;
 
-	regs->ax = lower_32_bits(args.r11);
-	regs->dx = upper_32_bits(args.r11);
-	return ve_instr_len(ve);
+	regs->ax = lower_32_bits(val);
+	regs->dx = upper_32_bits(val);
+
+	return ve_instr_len(EXIT_REASON_MSR_READ, ve);
 }
 
 static int write_msr(struct pt_regs *regs, struct ve_info *ve)
 {
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = hcall_func(EXIT_REASON_MSR_WRITE),
-		.r12 = regs->cx,
-		.r13 = (u64)regs->dx << 32 | regs->ax,
-	};
-
 	/*
 	 * Emulate the MSR write via hypercall. More info about ABI
 	 * can be found in TDX Guest-Host-Communication Interface
 	 * (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
 	 */
-	if (__tdx_hypercall(&args))
+	if (TDVMCALL_3_0(r11 = hcall_func(EXIT_REASON_MSR_WRITE),
+			 r12 = regs->cx,
+			 r13 = (u64)regs->dx << 32 | regs->ax))
 		return -EIO;
 
-	return ve_instr_len(ve);
+	return ve_instr_len(EXIT_REASON_MSR_WRITE, ve);
 }
 
 static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
@@ -353,7 +347,7 @@ static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
 	 */
 	if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
 		regs->ax = regs->bx = regs->cx = regs->dx = 0;
-		return ve_instr_len(ve);
+		return ve_instr_len(EXIT_REASON_CPUID, ve);
 	}
 
 	/*
@@ -374,31 +368,26 @@ static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
 	regs->cx = args.r14;
 	regs->dx = args.r15;
 
-	return ve_instr_len(ve);
+	return ve_instr_len(EXIT_REASON_CPUID, ve);
 }
 
 static bool mmio_read(int size, unsigned long addr, unsigned long *val)
 {
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
-		.r12 = size,
-		.r13 = EPT_READ,
-		.r14 = addr,
-		.r15 = *val,
-	};
-
-	if (__tdx_hypercall(&args))
-		return false;
-
-	*val = args.r11;
-	return true;
+	return !TDVMCALL_4_1(r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
+			     r12 = size,
+			     r13 = EPT_READ,
+			     r14 = addr,
+			     TDX_ON_SUCCESS(*val = out_r11));
+	return false;
 }
 
 static bool mmio_write(int size, unsigned long addr, unsigned long val)
 {
-	return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size,
-			       EPT_WRITE, addr, val);
+	return !TDVMCALL_5_0(r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
+			     r12 = size,
+			     r13 = EPT_WRITE,
+			     r14 = addr,
+			     r15 = val);
 }
 
 static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
@@ -508,42 +497,37 @@ static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 
 static bool handle_in(struct pt_regs *regs, int size, int port)
 {
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
-		.r12 = size,
-		.r13 = PORT_READ,
-		.r14 = port,
-	};
-	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
-	bool success;
-
-	/*
-	 * Emulate the I/O read via hypercall. More info about ABI can be found
-	 * in TDX Guest-Host-Communication Interface (GHCI) section titled
-	 * "TDG.VP.VMCALL<Instruction.IO>".
-	 */
-	success = !__tdx_hypercall(&args);
+	const u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
 
 	/* Update part of the register affected by the emulated instruction */
 	regs->ax &= ~mask;
-	if (success)
-		regs->ax |= args.r11 & mask;
 
-	return success;
+	/*
+	 * Emulate the I/O read via hypercall. More info about ABI can be found
+	 * in TDX Guest-Host-Communication Interface (GHCI) section titled
+	 * "TDG.VP.VMCALL<Instruction.IO>".
+	 */
+	return !TDVMCALL_4_1(r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
+			     r12 = size,
+			     r13 = PORT_READ,
+			     r14 = port,
+			     TDX_ON_SUCCESS(regs->ax |= out_r11 & mask));
 }
 
 static bool handle_out(struct pt_regs *regs, int size, int port)
 {
-	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
+	const u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
 
 	/*
 	 * Emulate the I/O write via hypercall. More info about ABI can be found
 	 * in TDX Guest-Host-Communication Interface (GHCI) section titled
 	 * "TDG.VP.VMCALL<Instruction.IO>".
 	 */
-	return !_tdx_hypercall(hcall_func(EXIT_REASON_IO_INSTRUCTION), size,
-			       PORT_WRITE, port, regs->ax & mask);
+	return !TDVMCALL_5_0(r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
+			     r12 = size,
+			     r13 = PORT_WRITE,
+			     r14 = port,
+			     r15 = regs->ax & mask);
 }
 
 /*
@@ -575,7 +559,7 @@ static int handle_io(struct pt_regs *regs, struct ve_info *ve)
 	if (!ret)
 		return -EIO;
 
-	return ve_instr_len(ve);
+	return ve_instr_len(EXIT_REASON_IO_INSTRUCTION, ve);
 }
 
 /*
@@ -745,14 +729,11 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
 	}
 
 	while (retry_count < max_retries_per_page) {
-		struct tdx_module_args args = {
-			.r10 = TDX_HYPERCALL_STANDARD,
-			.r11 = TDVMCALL_MAP_GPA,
-			.r12 = start,
-			.r13 = end - start };
-
 		u64 map_fail_paddr;
-		u64 ret = __tdx_hypercall(&args);
+		u64 ret = TDVMCALL_3_1(r11 = TDVMCALL_MAP_GPA,
+				       r12 = start,
+				       r13 = end - start,
+				       map_fail_paddr = out_r11);
 
 		if (ret != TDVMCALL_STATUS_RETRY)
 			return !ret;
@@ -761,7 +742,6 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
 		 * region starting at the GPA specified in R11. R11 comes
 		 * from the untrusted VMM. Sanity check it.
 		 */
-		map_fail_paddr = args.r11;
 		if (map_fail_paddr < start || map_fail_paddr >= end)
 			return false;
 
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 768d73de0d09..4d51b8fde6b1 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -385,32 +385,20 @@ static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {}
 #ifdef CONFIG_INTEL_TDX_GUEST
 static void hv_tdx_msr_write(u64 msr, u64 val)
 {
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = EXIT_REASON_MSR_WRITE,
-		.r12 = msr,
-		.r13 = val,
-	};
-
-	u64 ret = __tdx_hypercall(&args);
+	u64 ret = TDVMCALL_3_0(r11 = EXIT_REASON_MSR_WRITE,
+			       r12 = msr,
+			       r13 = val);
 
 	WARN_ONCE(ret, "Failed to emulate MSR write: %lld\n", ret);
 }
 
 static void hv_tdx_msr_read(u64 msr, u64 *val)
 {
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = EXIT_REASON_MSR_READ,
-		.r12 = msr,
-	};
+	u64 ret = TDVMCALL_2_1(r11 = hcall_func(EXIT_REASON_MSR_READ),
+			       r12 = msr,
+			       *val = out_r11);
 
-	u64 ret = __tdx_hypercall(&args);
-
-	if (WARN_ONCE(ret, "Failed to emulate MSR read: %lld\n", ret))
-		*val = 0;
-	else
-		*val = args.r11;
+	WARN_ONCE(ret, "Failed to emulate MSR read: %lld\n", ret);
 }
 
 u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2)
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index fdfd41511b02..c1354054f144 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -65,6 +65,191 @@
 
 #include <linux/compiler_attributes.h>
 
+#define TDVMCALL_BUG_ON(ret)							\
+do {										\
+	if (unlikely(ret))							\
+		__tdx_hypercall_failed();					\
+} while (0)
+
+#define TDX_ON_SUCCESS(x) if (__ret) (x)
+
+#define TDVMCALL_2_0(__in_r11, __in_r12)					\
+({										\
+	u64 vmcall_ret = TDG_VP_VMCALL;						\
+	u64 __in_r11, __in_r12;							\
+	u64 __ret;								\
+										\
+	asm volatile (								\
+		"movq	%[r11_in], %%r11\n\t"					\
+		"movq	%[r12_in], %%r12\n\t"					\
+		"xor    %%r10d, %%r10d\n\t"					\
+		".byte 0x66,0x0f,0x01,0xcc\n\t"					\
+		"movq	%%r10, %[r10_out]\n\t"					\
+		: "+a"(vmcall_ret),						\
+		  [r10_out] "=rm" (__ret)					\
+		: "c" (TDX_R10 | TDX_R11 | TDX_R12),				\
+		  [r11_in] "irm" (r11),						\
+		  [r12_in] "irm" (r12)						\
+		: "r10", "r11", "r12"						\
+	);									\
+										\
+	TDVMCALL_BUG_ON(vmcall_ret);						\
+										\
+	__ret;									\
+})
+
+#define TDVMCALL_2_1(__in_r11, __in_r12, __out_r11)				\
+({										\
+	u64 vmcall_ret = TDG_VP_VMCALL;						\
+	u64 __in_r11, __in_r12;							\
+	u64 out_r11;								\
+	u64 __ret;								\
+										\
+	asm volatile (								\
+		"movq	%[r11_in], %%r11\n\t"					\
+		"movq	%[r12_in], %%r12\n\t"					\
+		"xor    %%r10d, %%r10d\n\t"					\
+		".byte 0x66,0x0f,0x01,0xcc\n\t"					\
+		"movq	%%r10, %[r10_out]\n\t"					\
+		"movq	%%r11, %[r11_out]\n\t"					\
+		: "+a"(vmcall_ret),						\
+		  [r10_out] "=rm" (__ret),					\
+		  [r11_out] "=rm" (out_r11)					\
+		: "c" (TDX_R10 | TDX_R11 | TDX_R12),				\
+		  [r11_in] "irm" (r11),						\
+		  [r12_in] "irm" (r12)						\
+		: "r10", "r11", "r12"						\
+	);									\
+										\
+	TDVMCALL_BUG_ON(vmcall_ret);						\
+										\
+	__out_r11;								\
+	__ret;									\
+})
+
+#define TDVMCALL_3_0(__in_r11, __in_r12, __in_r13)				\
+({										\
+	u64 vmcall_ret = TDG_VP_VMCALL;						\
+	u64 __in_r11, __in_r12, __in_r13;					\
+	u64 __ret;								\
+										\
+	asm volatile (								\
+		"movq	%[r11_in], %%r11\n\t"					\
+		"movq	%[r12_in], %%r12\n\t"					\
+		"movq	%[r13_in], %%r13\n\t"					\
+		"xor    %%r10d, %%r10d\n\t"					\
+		".byte 0x66,0x0f,0x01,0xcc\n\t"					\
+		"movq	%%r10, %[r10_out]\n\t"					\
+		: "+a"(vmcall_ret),						\
+		  [r10_out] "=rm" (__ret)					\
+		: "c" (TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13),			\
+		  [r11_in] "irm" (r11),						\
+		  [r12_in] "irm" (r12),						\
+		  [r13_in] "irm" (r13)						\
+		: "r10", "r11", "r12", "r13"					\
+	);									\
+										\
+	TDVMCALL_BUG_ON(vmcall_ret);						\
+										\
+	__ret;									\
+})
+
+#define TDVMCALL_3_1(__in_r11, __in_r12, __in_r13, __out_r11)			\
+({										\
+	u64 vmcall_ret = TDG_VP_VMCALL;						\
+	u64 __in_r11, __in_r12, __in_r13;					\
+	u64 out_r11;								\
+	u64 __ret;								\
+										\
+	asm volatile (								\
+		"movq	%[r11_in], %%r11\n\t"					\
+		"movq	%[r12_in], %%r12\n\t"					\
+		"movq	%[r13_in], %%r13\n\t"					\
+		"xor    %%r10d, %%r10d\n\t"					\
+		".byte 0x66,0x0f,0x01,0xcc\n\t"					\
+		"movq	%%r10, %[r10_out]\n\t"					\
+		"movq	%%r11, %[r11_out]\n\t"					\
+		: "+a"(vmcall_ret),						\
+		  [r10_out] "=rm" (__ret),					\
+		  [r11_out] "=rm" (out_r11)					\
+		: "c" (TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13),			\
+		  [r11_in] "irm" (r11),						\
+		  [r12_in] "irm" (r12),						\
+		  [r13_in] "irm" (r13)						\
+		: "r10", "r11", "r12", "r13"					\
+	);									\
+										\
+	TDVMCALL_BUG_ON(vmcall_ret);						\
+										\
+	__out_r11;								\
+	__ret;									\
+})
+
+#define TDVMCALL_4_1(__in_r11, __in_r12, __in_r13, __in_r14, __out_r11)		\
+({										\
+	u64 vmcall_ret = TDG_VP_VMCALL;						\
+	u64 __in_r11, __in_r12, __in_r13, __in_r14;				\
+	u64 out_r11;								\
+	u64 __ret;								\
+										\
+	asm volatile (								\
+		"movq	%[r11_in], %%r11\n\t"					\
+		"movq	%[r12_in], %%r12\n\t"					\
+		"movq	%[r13_in], %%r13\n\t"					\
+		"movq	%[r14_in], %%r14\n\t"					\
+		"xor    %%r10d, %%r10d\n\t"					\
+		".byte 0x66,0x0f,0x01,0xcc\n\t"					\
+		"movq	%%r10, %[r10_out]\n\t"					\
+		"movq	%%r11, %[r11_out]\n\t"					\
+		: "+a"(vmcall_ret),						\
+		  [r10_out] "=rm" (__ret),					\
+		  [r11_out] "=rm" (out_r11)					\
+		: "c" (TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14),	\
+		  [r11_in] "irm" (r11),						\
+		  [r12_in] "irm" (r12),						\
+		  [r13_in] "irm" (r13),						\
+		  [r14_in] "irm" (r14)						\
+		: "r10", "r11", "r12", "r13", "r14"				\
+	);									\
+										\
+	TDVMCALL_BUG_ON(vmcall_ret);						\
+										\
+	__out_r11;								\
+	__ret;									\
+})
+
+
+#define TDVMCALL_5_0(__in_r11, __in_r12, __in_r13, __in_r14, __in_r15)		\
+({										\
+	u64 vmcall_ret = TDG_VP_VMCALL;						\
+	u64 __in_r11, __in_r12, __in_r13, __in_r14, __in_r15;			\
+	u64 __ret;								\
+										\
+	asm volatile (								\
+		"movq	%[r11_in], %%r11\n\t"					\
+		"movq	%[r12_in], %%r12\n\t"					\
+		"movq	%[r13_in], %%r13\n\t"					\
+		"movq	%[r14_in], %%r14\n\t"					\
+		"movq	%[r15_in], %%r15\n\t"					\
+		"xor    %%r10d, %%r10d\n\t"					\
+		".byte 0x66,0x0f,0x01,0xcc\n\t"					\
+		"movq	%%r10, %[r10_out]\n\t"					\
+		: "+a"(vmcall_ret),						\
+		  [r10_out] "=rm" (__ret)					\
+		: "c" (TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15),\
+		  [r11_in] "irm" (r11),						\
+		  [r12_in] "irm" (r12),						\
+		  [r13_in] "irm" (r13),						\
+		  [r14_in] "irm" (r14),						\
+		  [r15_in] "irm" (r15)						\
+		: "r10", "r11", "r12", "r13", "r14", "r15"			\
+	);									\
+										\
+	TDVMCALL_BUG_ON(vmcall_ret);						\
+										\
+	__ret;									\
+})
+
 /*
  * Used in __tdcall*() to gather the input/output registers' values of the
  * TDCALL instruction when requesting services from the TDX module. This is a
@@ -97,25 +282,6 @@ u64 __tdcall_saved_ret(u64 fn, struct tdx_module_args *args);
 /* Used to request services from the VMM */
 u64 __tdx_hypercall(struct tdx_module_args *args);
 
-/*
- * Wrapper for standard use of __tdx_hypercall with no output aside from
- * return code.
- */
-static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
-{
-	struct tdx_module_args args = {
-		.r10 = TDX_HYPERCALL_STANDARD,
-		.r11 = fn,
-		.r12 = r12,
-		.r13 = r13,
-		.r14 = r14,
-		.r15 = r15,
-	};
-
-	return __tdx_hypercall(&args);
-}
-
-
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __noreturn __tdx_hypercall_failed(void);
 

base-commit: 2ab79514109578fc4b6df90633d500cf281eb689
-- 


^ permalink raw reply related

* RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: Michael Kelley @ 2024-06-04 23:03 UTC (permalink / raw)
  To: Thomas Gleixner, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com,
	arnd@arndb.de, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: maz@kernel.org, den@valinux.co.jp, jgowans@amazon.com,
	dawei.li@shingroup.cn
In-Reply-To: <87h6e860f8.ffs@tglx>

From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, June 4, 2024 11:14 AM
> 
> Michael!
> 
> On Mon, Jun 03 2024 at 22:09, mhkelley58@gmail.com wrote:
> > Hyper-V VMBus devices generate interrupts that are multiplexed
> > onto a single per-CPU architectural interrupt. The top-level VMBus
> > driver ISR demultiplexes these interrupts and invokes per-device
> > handlers. Currently, these per-device handlers are not modeled as
> > Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted
> > to the top level architectural interrupt. Visibility into per-device
> > interrupt stats requires accessing VMBus-specific entries in sysfs.
> > The top-level VMBus driver ISR also handles management-related
> > interrupts that are not attributable to a particular VMBus device.
> >
> > As part of changing VMBus to model VMBus per-device handlers as
> > normal Linux IRQs, the top-level VMBus driver needs to conditionally
> > account for interrupts. If it passes the interrupt off to a
> > device-specific IRQ, the interrupt stats are done by that IRQ
> > handler, and accounting for the interrupt at the top level
> > is duplicative. But if it handles a management-related interrupt
> > itself, then it should account for the interrupt itself.
> >
> > Introduce a new flow handler that provides this functionality.
> > The new handler parallels handle_percpu_irq(), but does stats
> > only if the ISR returns other than IRQ_NONE. The existing
> > handle_untracked_irq() can't be used because it doesn't work for
> > per-cpu IRQs, and it doesn't provide conditional stats.
> 
> There is a two other options to solve this:
> 

Thanks for taking a look.  Unfortunately, unless I'm missing something,
both options you suggest have downsides.

>    1) Move the inner workings of handle_percpu_irq() out into
>       a static function which returns the 'handled' value and
>       share it between the two handler functions.

The "inner workings" aren't quite the same in the two cases.
handle_percpu_irq() uses handle_irq_event_percpu() while
handle_percpu_demux_irq() uses __handle_irq_event_percpu().
The latter doesn't do add_interrupt_randomness() because the
demultiplexed IRQ handler will do it.  Doing add_interrupt_randomness()
twice doesn't break anything, but it's more overhead in the hard irq
path, which I'm trying to avoid.  The extra functionality in the
non-double-underscore version could be hoisted up to
handle_percpu_irq(), but that offsets gains from sharing the
inner workings.

> 
>    2) Allocate a proper interrupt for the management mode and invoke it
>       via generic_handle_irq() just as any other demultiplex interrupt.
>       That spares all the special casing in the core code and just
>       works.

Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ,
and the interrupt counting is done in Hyper-V specific code that could be
removed.  The demux'ed interrupt does the counting.

But on arm64 the top-level interrupt *is* a Linux IRQ, so each
interrupt will get double-counted, which is a problem.  Having to add
handle_percpu_demux_irq() to handle arm64 correctly isn't as clean
as I wish it could be.  But I couldn't find a better approach.

Michael

^ permalink raw reply

* Re: [PATCH net-next v3] net: mana: Allow variable size indirection table
From: Shradha Gupta @ 2024-06-05  8:39 UTC (permalink / raw)
  To: Simon Horman
  Cc: linux-hardening, netdev, linux-hyperv, linux-kernel, linux-rdma,
	Colin Ian King, Ahmed Zaki, Pavan Chebbi, Souradeep Chakrabarti,
	Konstantin Taranov, Kees Cook, Paolo Abeni, Jakub Kicinski,
	Eric Dumazet, David S. Miller, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Leon Romanovsky, Jason Gunthorpe, Long Li,
	Shradha Gupta
In-Reply-To: <20240604093349.GP491852@kernel.org>

On Tue, Jun 04, 2024 at 10:33:49AM +0100, Simon Horman wrote:
> On Fri, May 31, 2024 at 08:37:41AM -0700, Shradha Gupta wrote:
> > Allow variable size indirection table allocation in MANA instead
> > of using a constant value MANA_INDIRECT_TABLE_SIZE.
> > The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
> > indirection table is allocated dynamically.
> > 
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Dexuan Cui <decui@microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> 
> ...
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> 
> ...
> 
> > @@ -2344,11 +2352,33 @@ static int mana_create_vport(struct mana_port_context *apc,
> >  	return mana_create_txq(apc, net);
> >  }
> >  
> > +static int mana_rss_table_alloc(struct mana_port_context *apc)
> > +{
> > +	if (!apc->indir_table_sz) {
> > +		netdev_err(apc->ndev,
> > +			   "Indirection table size not set for vPort %d\n",
> > +			   apc->port_idx);
> > +		return -EINVAL;
> > +	}
> > +
> > +	apc->indir_table = kcalloc(apc->indir_table_sz, sizeof(u32), GFP_KERNEL);
> > +	if (!apc->indir_table)
> > +		return -ENOMEM;
> > +
> > +	apc->rxobj_table = kcalloc(apc->indir_table_sz, sizeof(mana_handle_t), GFP_KERNEL);
> > +	if (!apc->rxobj_table) {
> > +		kfree(apc->indir_table);
> 
> Hi, Shradha
> 
> Perhaps I am on the wrong track here, but I have some concerns
> about clean-up paths.
> 
> Firstly.  I think that apc->indir_table should be to NULL here for
> consistency with other clean-up paths. Or alternatively, fields of apc
> should not set to NULL elsewhere after being freed.

Hi Simon,

Thanks for the comments. This makes sense, I am planning of consistently removing
the NULLify from other places too as per Leon's comments.
> 
> In looking into this I noticed that mana_probe() does not call
> mana_remove() or return an error in the cases where mana_probe_port() or
> mana_attach() fail unless add_adev also fails. If so, is that intentional?

Right, so most calls like mana_probe_port(), mana_attach() cleanup after themselves
in the code if there is any error. So, not having to call mana_remove() in these
cases in mana_probe() is intentional. But I do agree that an error is returned in
mana_probe() only if add_adev also fails. I'll fix that too in the next version
> 
> In any case, I would suggest as a follow-up, arranging things so that when
> an error occurs in a function, anything that was allocated is unwound
> before returning an error.
> 
> I think this would make allocation/deallocation easier to reason with.
> And I suspect it would avoid both the need for fields of structures to be
> zeroed after being freed, and the need to call mana_remove() from
> mana_probe().

Agreed
> 
> > +		return -ENOMEM;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static void mana_rss_table_init(struct mana_port_context *apc)
> >  {
> >  	int i;
> >  
> > -	for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++)
> > +	for (i = 0; i < apc->indir_table_sz; i++)
> >  		apc->indir_table[i] =
> >  			ethtool_rxfh_indir_default(i, apc->num_queues);
> >  }
> 
> ...
> 
> > @@ -2739,11 +2772,17 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
> >  	err = register_netdev(ndev);
> >  	if (err) {
> >  		netdev_err(ndev, "Unable to register netdev.\n");
> > -		goto reset_apc;
> > +		goto free_indir;
> >  	}
> >  
> >  	return 0;
> >  
> > +free_indir:
> > +	apc->indir_table_sz = 0;
> > +	kfree(apc->indir_table);
> > +	apc->indir_table = NULL;
> > +	kfree(apc->rxobj_table);
> > +	apc->rxobj_table = NULL;
> >  reset_apc:
> >  	kfree(apc->rxqs);
> >  	apc->rxqs = NULL;
> 
> nit: Not strictly related to this patch, but the reset_apc code should
>      probably be a call to mana_cleanup_port_context() as it is the dual of
>      mana_init_port_context() which is called earlier in mana_probe_port()

Sure, let me do that too.
> 
> ...
> 
> > @@ -2931,6 +2972,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> >  		}
> >  
> >  		unregister_netdevice(ndev);
> > +		apc->indir_table_sz = 0;
> > +		kfree(apc->indir_table);
> > +		apc->indir_table = NULL;
> > +		kfree(apc->rxobj_table);
> > +		apc->rxobj_table = NULL;
> 
> The code to free and zero indir_table_sz and indir_table appears twice
> in this patch. Perhaps a helper to do this, which would be the dual
> of mana_rss_table_alloc is in order.
Makes sense, will change this too.

Thanks,
Shradha.
> 
> >  
> >  		rtnl_unlock();
> >  
> 
> ...

^ permalink raw reply

* Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Kirill A. Shutemov @ 2024-06-05 12:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, adrian.hunter, ardb, ashish.kalra, bhe, dave.hansen,
	elena.reshetova, haiyangz, hpa, jun.nakajima, kai.huang, kexec,
	kys, linux-acpi, linux-coco, linux-hyperv, linux-kernel, ltao,
	mingo, peterz, rafael, rick.p.edgecombe,
	sathyanarayanan.kuppuswamy, seanjc, tglx, thomas.lendacky, x86
In-Reply-To: <20240604180554.GIZl9XgscEI3PUvR-W@fat_crate.local>

On Tue, Jun 04, 2024 at 08:05:54PM +0200, Borislav Petkov wrote:
> On Tue, Jun 04, 2024 at 07:14:00PM +0300, Kirill A. Shutemov wrote:
> > 			/*
> > 			 * If tdx_enc_status_changed() fails, it leaves memory
> > 			 * in an unknown state. If the memory remains shared,
> > 			 * it can result in an unrecoverable guest shutdown on
> > 			 * the first accessed through a private mapping.
> 
> "access"

Okay.

> So this sentence above can go too, right?

I don't think so.

> Because that comment is in tdx_kexec_finish() and we're basically going
> off to kexec. So can a guest even access it through a private mapping?
> We're shutting down so nothing is running anymore...

This kernel can't. But the next kernel can.

If a page can be accessed via private mapping is determined by the
presence in Secure EPT. This state persist across kexec.

> > 			 * The kdump kernel boot is not impacted as it uses
> > 			 * a pre-reserved memory range that is always private.
> > 			 * However, gathering crash information could lead to
> > 			 * a crash if it accesses unconverted memory through
> > 			 * a private mapping.
> 
> When does the kexec kernel even get such a private mapping? It is not
> even up yet...

Crash kernel provides access to this memory via /proc/vmcore. Crash kernel
will assume all memory there is private.

> > 			 * pr_err() may assist in understanding such crashes.
> 
> "Print error info in order to leave bread crumbs for debugging." is what
> I'd say.

Okay.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Kirill A. Shutemov @ 2024-06-05 12:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Ard Biesheuvel, Baoquan He, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
	linux-coco, linux-kernel, Tao Liu
In-Reply-To: <68943e0b-8d82-42de-8f09-058e97d9a392@intel.com>

On Tue, Jun 04, 2024 at 09:27:59AM -0700, Dave Hansen wrote:
> On 5/28/24 02:55, Kirill A. Shutemov wrote:
> > +/* Stop new private<->shared conversions */
> > +static void tdx_kexec_begin(bool crash)
> > +{
> > +	/*
> > +	 * Crash kernel reaches here with interrupts disabled: can't wait for
> > +	 * conversions to finish.
> > +	 *
> > +	 * If race happened, just report and proceed.
> > +	 */
> > +	if (!set_memory_enc_stop_conversion(!crash))
> > +		pr_warn("Failed to stop shared<->private conversions\n");
> > +}
> 
> I don't like having to pass 'crash' in here.
> 
> If interrupts are the problem we have ways of testing for those directly.
> 
> If it's being in an oops that's a problem, we have 'oops_in_progress'
> for that.
> 
> In other words, I'd much rather this function (or better yet
> set_memory_enc_stop_conversion() itself) use some existing API to change
> its behavior in a crash rather than have the context be passed down and
> twiddled through several levels of function calls.
> 
> There are a ton of these in the console code:
> 
> 	if (oops_in_progress)
> 		foo_trylock();
> 	else
> 		foo_lock();
> 
> To me, that's a billion times more clear than a 'wait' argument that
> gets derives from who-knows-what that I have to trace through ten levels
> of function calls.

Okay fair enough. Check out the fixup below. Is it what you mean?

One other thing I realized is that these callback are dead code if kernel
compiled without kexec support. Do we want them to be wrapped with
#ifdef COFNIG_KEXEC_CORE everywhere? It is going to be ugly.

Any better ideas?

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 3d23ea0f5d45..1c5aa036b76b 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -834,7 +834,7 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 }
 
 /* Stop new private<->shared conversions */
-static void tdx_kexec_begin(bool crash)
+static void tdx_kexec_begin(void)
 {
 	/*
 	 * Crash kernel reaches here with interrupts disabled: can't wait for
@@ -842,7 +842,7 @@ static void tdx_kexec_begin(bool crash)
 	 *
 	 * If race happened, just report and proceed.
 	 */
-	if (!set_memory_enc_stop_conversion(!crash))
+	if (!set_memory_enc_stop_conversion())
 		pr_warn("Failed to stop shared<->private conversions\n");
 }
 
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index d490db38db9e..4b2abce2e3e7 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -50,7 +50,7 @@ int set_memory_np(unsigned long addr, int numpages);
 int set_memory_p(unsigned long addr, int numpages);
 int set_memory_4k(unsigned long addr, int numpages);
 
-bool set_memory_enc_stop_conversion(bool wait);
+bool set_memory_enc_stop_conversion(void);
 int set_memory_encrypted(unsigned long addr, int numpages);
 int set_memory_decrypted(unsigned long addr, int numpages);
 
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index b0f313278967..213cf5379a5a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -152,8 +152,6 @@ struct x86_init_acpi {
  * @enc_kexec_begin		Begin the two-step process of converting shared memory back
  *				to private. It stops the new conversions from being started
  *				and waits in-flight conversions to finish, if possible.
- *				The @crash parameter denotes whether the function is being
- *				called in the crash shutdown path.
  * @enc_kexec_finish		Finish the two-step process of converting shared memory to
  *				private. All memory is private after the call when
  *				the function returns.
@@ -165,7 +163,7 @@ struct x86_guest {
 	int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
 	bool (*enc_tlb_flush_required)(bool enc);
 	bool (*enc_cache_flush_required)(void);
-	void (*enc_kexec_begin)(bool crash);
+	void (*enc_kexec_begin)(void);
 	void (*enc_kexec_finish)(void);
 };
 
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index fc52ea80cdc8..340af8155658 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -137,7 +137,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	 * down and interrupts have been disabled. This allows the callback to
 	 * detect a race with the conversion and report it.
 	 */
-	x86_platform.guest.enc_kexec_begin(true);
+	x86_platform.guest.enc_kexec_begin();
 	x86_platform.guest.enc_kexec_finish();
 
 	crash_save_cpu(regs, safe_smp_processor_id());
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 513809b5b27c..0e0a4cf6b5eb 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -723,7 +723,7 @@ void native_machine_shutdown(void)
 	 * conversions to finish cleanly.
 	 */
 	if (kexec_in_progress)
-		x86_platform.guest.enc_kexec_begin(false);
+		x86_platform.guest.enc_kexec_begin();
 
 	/* Stop the cpus and apics */
 #ifdef CONFIG_X86_IO_APIC
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 8a79fb505303..82b128d3f309 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -138,7 +138,7 @@ static int enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool
 static int enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return 0; }
 static bool enc_tlb_flush_required_noop(bool enc) { return false; }
 static bool enc_cache_flush_required_noop(void) { return false; }
-static void enc_kexec_begin_noop(bool crash) {}
+static void enc_kexec_begin_noop(void) {}
 static void enc_kexec_finish_noop(void) {}
 static bool is_private_mmio_noop(u64 addr) {return false; }
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 2a548b65ef5f..443a97e515c0 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2240,13 +2240,14 @@ static DECLARE_RWSEM(mem_enc_lock);
  *
  * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete.
  * The lock is not released to prevent new conversions from being started.
- *
- * If sleep is not allowed, as in a crash scenario, try to take the lock.
- * Failure indicates that there is a race with the conversion.
  */
-bool set_memory_enc_stop_conversion(bool wait)
+bool set_memory_enc_stop_conversion(void)
 {
-	if (!wait)
+	/*
+	 * In a crash scenario, sleep is not allowed. Try to take the lock.
+	 * Failure indicates that there is a race with the conversion.
+	 */
+	if (oops_in_progress)
 		return down_write_trylock(&mem_enc_lock);
 
 	down_write(&mem_enc_lock);
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related

* RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: Thomas Gleixner @ 2024-06-05 13:20 UTC (permalink / raw)
  To: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com,
	arnd@arndb.de, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: maz@kernel.org, den@valinux.co.jp, jgowans@amazon.com,
	dawei.li@shingroup.cn
In-Reply-To: <SN6PR02MB415737FF6F7B40A1CD20C4A9D4F82@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, Jun 04 2024 at 23:03, Michael Kelley wrote:
> From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, June 4, 2024 11:14 AM
>>    1) Move the inner workings of handle_percpu_irq() out into
>>       a static function which returns the 'handled' value and
>>       share it between the two handler functions.
>
> The "inner workings" aren't quite the same in the two cases.
> handle_percpu_irq() uses handle_irq_event_percpu() while
> handle_percpu_demux_irq() uses __handle_irq_event_percpu().
> The latter doesn't do add_interrupt_randomness() because the
> demultiplexed IRQ handler will do it.  Doing add_interrupt_randomness()
> twice doesn't break anything, but it's more overhead in the hard irq
> path, which I'm trying to avoid.  The extra functionality in the
> non-double-underscore version could be hoisted up to
> handle_percpu_irq(), but that offsets gains from sharing the
> inner workings.

That's not rocket science to solve:

static irqreturn_t helper(desc, func)
{
	boiler_plate..
        ret = func(desc)
	boiler_plate..
        return ret;
}

No?

TBH, I still hate that conditional accounting :)

>>    2) Allocate a proper interrupt for the management mode and invoke it
>>       via generic_handle_irq() just as any other demultiplex interrupt.
>>       That spares all the special casing in the core code and just
>>       works.
>
> Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ,
> and the interrupt counting is done in Hyper-V specific code that could be
> removed.  The demux'ed interrupt does the counting.
>
> But on arm64 the top-level interrupt *is* a Linux IRQ, so each
> interrupt will get double-counted, which is a problem.

What is the problem?

You have: toplevel, mgmt, device[], right?

They are all accounted for seperately and each toplevel interrupt might
result in demultiplexing one or more interrupts (mgmt, device[]), no?

IMO accounting the toplevel interrupt seperately is informative because
it allows you to figure out whether demultiplexing is clustered or not,
but I lost that argument long ago. That's why most demultiplex muck
installs a chained handler, which is a design fail on it's own.

Thanks,

        tglx


^ permalink raw reply

* RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: Michael Kelley @ 2024-06-05 13:45 UTC (permalink / raw)
  To: Thomas Gleixner, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com,
	arnd@arndb.de, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: maz@kernel.org, den@valinux.co.jp, jgowans@amazon.com,
	dawei.li@shingroup.cn
In-Reply-To: <87zfrz4jce.ffs@tglx>

From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
> 
> On Tue, Jun 04 2024 at 23:03, Michael Kelley wrote:
> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Tuesday, June 4, 2024 11:14 AM
> >>    1) Move the inner workings of handle_percpu_irq() out into
> >>       a static function which returns the 'handled' value and
> >>       share it between the two handler functions.
> >
> > The "inner workings" aren't quite the same in the two cases.
> > handle_percpu_irq() uses handle_irq_event_percpu() while
> > handle_percpu_demux_irq() uses __handle_irq_event_percpu().
> > The latter doesn't do add_interrupt_randomness() because the
> > demultiplexed IRQ handler will do it.  Doing add_interrupt_randomness()
> > twice doesn't break anything, but it's more overhead in the hard irq
> > path, which I'm trying to avoid.  The extra functionality in the
> > non-double-underscore version could be hoisted up to
> > handle_percpu_irq(), but that offsets gains from sharing the
> > inner workings.
> 
> That's not rocket science to solve:
> 
> static irqreturn_t helper(desc, func)
> {
> 	boiler_plate..
>         ret = func(desc)
> 	boiler_plate..
>         return ret;
> }
> 
> No?
> 
> TBH, I still hate that conditional accounting :)
> 
> >>    2) Allocate a proper interrupt for the management mode and invoke it
> >>       via generic_handle_irq() just as any other demultiplex interrupt.
> >>       That spares all the special casing in the core code and just
> >>       works.
> >
> > Yes, this would work on x86, as the top-level interrupt isn't a Linux IRQ,
> > and the interrupt counting is done in Hyper-V specific code that could be
> > removed.  The demux'ed interrupt does the counting.
> >
> > But on arm64 the top-level interrupt *is* a Linux IRQ, so each
> > interrupt will get double-counted, which is a problem.
> 
> What is the problem?
> 
> You have: toplevel, mgmt, device[], right?
> 
> They are all accounted for seperately and each toplevel interrupt might
> result in demultiplexing one or more interrupts (mgmt, device[]), no?
> 
> IMO accounting the toplevel interrupt seperately is informative because
> it allows you to figure out whether demultiplexing is clustered or not,
> but I lost that argument long ago. That's why most demultiplex muck
> installs a chained handler, which is a design fail on it's own.

In /proc/interrupts, the double-counting isn't a problem, and is
potentially helpful as you say. But /proc/stat, for example, shows a total
interrupt count, which will be roughly double what it was before. That
/proc/stat value then shows up in user space in vmstat, for example.
That's what I was concerned about, though it's not a huge problem in
the grand scheme of things.

Michael

^ permalink raw reply

* RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: Thomas Gleixner @ 2024-06-05 14:19 UTC (permalink / raw)
  To: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com,
	arnd@arndb.de, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: maz@kernel.org, den@valinux.co.jp, jgowans@amazon.com,
	dawei.li@shingroup.cn
In-Reply-To: <SN6PR02MB415706390CB0E8FD599B6494D4F92@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
>
> In /proc/interrupts, the double-counting isn't a problem, and is
> potentially helpful as you say. But /proc/stat, for example, shows a total
> interrupt count, which will be roughly double what it was before. That
> /proc/stat value then shows up in user space in vmstat, for example.
> That's what I was concerned about, though it's not a huge problem in
> the grand scheme of things.

That's trivial to solve. We can mark interrupts to be excluded from
/proc/stat accounting.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Dave Hansen @ 2024-06-05 16:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Ard Biesheuvel, Baoquan He, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
	linux-coco, linux-kernel, Tao Liu
In-Reply-To: <ass6d7jqmjmttnyfzpdfid6dymqriors2awdtyedwvc7ft4zhg@qzj637lzmjt6>

On 6/5/24 05:43, Kirill A. Shutemov wrote:
> Okay fair enough. Check out the fixup below. Is it what you mean?

Yes.  Much better.

> One other thing I realized is that these callback are dead code if kernel
> compiled without kexec support. Do we want them to be wrapped with
> #ifdef COFNIG_KEXEC_CORE everywhere? It is going to be ugly.
> 
> Any better ideas?

The other callbacks don't have #ifdefs either and they're dependent on
memory encryption as far as I can tell.

I think a simple:

	if (IS_ENABLED(COFNIG_KEXEC_CORE))
		return;

in the top of the callbacks will result in a tiny little stub function
when kexec is disabled.  So the bloat will be limited to kernels that
have TDX compiled in but kexec compiled out (probably never).  The bloat
will be two callback pointer, one tiny stub function, and a quick
call/return in a slow path.

I think that probably ends up being a few dozen bytes of bloat in kernel
text for a "probably never" config.

^ permalink raw reply

* Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Borislav Petkov @ 2024-06-05 16:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, adrian.hunter, ardb, ashish.kalra, bhe, dave.hansen,
	elena.reshetova, haiyangz, hpa, jun.nakajima, kai.huang, kexec,
	kys, linux-acpi, linux-coco, linux-hyperv, linux-kernel, ltao,
	mingo, peterz, rafael, rick.p.edgecombe,
	sathyanarayanan.kuppuswamy, seanjc, tglx, thomas.lendacky, x86
In-Reply-To: <alkew673cceojzmhsp3wj43yv76cek5ydh2iosfcphuv6ro26q@pj6whxcoetht>

On Wed, Jun 05, 2024 at 03:21:42PM +0300, Kirill A. Shutemov wrote:
> If a page can be accessed via private mapping is determined by the
> presence in Secure EPT. This state persist across kexec.

I just love it how I tickle out details each time I touch this comment
because we three can't write a single concise and self-contained
explanation. :-(

Ok, next version:

"Private mappings persist across kexec. If tdx_enc_status_changed() fails
in the first kernel, it leaves memory in an unknown state.

If that memory remains shared, accessing it in the *next* kernel through
a private mapping will result in an unrecoverable guest shutdown.

The kdump kernel boot is not impacted as it uses a pre-reserved memory
range that is always private.  However, gathering crash information
could lead to a crash if it accesses unconverted memory through
a private mapping which is possible when accessing that memory through
/proc/vmcore, for example.

In all cases, print error info in order to leave enough bread crumbs for
debugging."

I think this is getting in the right direction as it actually makes
sense now.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* [PATCH 1/1] x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
From: mhkelley58 @ 2024-06-06  2:55 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, linux-hyperv, linux-kernel

From: Michael Kelley <mhklinux@outlook.com>

A Linux guest on Hyper-V gets the TSC frequency from a synthetic MSR, if
available. In this case, set X86_FEATURE_TSC_KNOWN_FREQ so that Linux
doesn't unnecessarily do refined TSC calibration when setting up the TSC
clocksource.

With this change, a message such as this is no longer output during boot
when the TSC is used as the clocksource:

[    1.115141] tsc: Refined TSC clocksource calibration: 2918.408 MHz

Furthermore, the guest and host will have exactly the same view of the
TSC frequency, which is important for features such as the TSC deadline
timer that are emulated by the Hyper-V host.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/kernel/cpu/mshyperv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e0fd57a8ba84..c3e38eaf6d2f 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -424,6 +424,7 @@ static void __init ms_hyperv_init_platform(void)
 	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
 		x86_platform.calibrate_tsc = hv_get_tsc_khz;
 		x86_platform.calibrate_cpu = hv_get_tsc_khz;
+		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	}
 
 	if (ms_hyperv.priv_high & HV_ISOLATION) {
-- 
2.25.1


^ permalink raw reply related

* RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: Michael Kelley @ 2024-06-06  3:14 UTC (permalink / raw)
  To: Thomas Gleixner, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com,
	arnd@arndb.de, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: maz@kernel.org, den@valinux.co.jp, jgowans@amazon.com,
	dawei.li@shingroup.cn
In-Reply-To: <87cyov4glm.ffs@tglx>

From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM
> 
> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
> >
> > In /proc/interrupts, the double-counting isn't a problem, and is
> > potentially helpful as you say. But /proc/stat, for example, shows a total
> > interrupt count, which will be roughly double what it was before. That
> > /proc/stat value then shows up in user space in vmstat, for example.
> > That's what I was concerned about, though it's not a huge problem in
> > the grand scheme of things.
> 
> That's trivial to solve. We can mark interrupts to be excluded from
> /proc/stat accounting.
> 

OK.  On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
out the HYP interrupts. But what do you envision on arm64, where
there is no arch_irq_stat_cpu()?  On arm64, the top-level interrupt is a
normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
with no breakout by IRQ. Identifying the right IRQ and subtracting it
out later looks a lot uglier than the conditional stats accounting.

Or if there's some other approach I'm missing, please enlighten me!

Michael

^ permalink raw reply

* Re: [PATCH v6] Drivers: hv: Cosmetic changes for hv.c and balloon.c
From: Wei Liu @ 2024-06-06  6:03 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Aditya Nagesh, adityanagesh@microsoft.com, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D719EE0D5BAC19FFA449D4FD2@SN6PR02MB4157.namprd02.prod.outlook.com>

On Sat, Jun 01, 2024 at 04:11:32AM +0000, Michael Kelley wrote:
> From: Aditya Nagesh <adityanagesh@linux.microsoft.com> Sent: Friday, May 31, 2024 3:49 AM
> > 
> > Fix issues reported by checkpatch.pl script in hv.c and
> > balloon.c
> >  - Remove unnecessary parentheses
> >  - Remove extra newlines
> >  - Remove extra spaces
> >  - Add spaces between comparison operators
> >  - Remove comparison with NULL in if statements
> > 
> > No functional changes intended
> > 
> > Signed-off-by: Aditya Nagesh <adityanagesh@linux.microsoft.com>
> > Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> > ---
> > [V6]
> > Fix build failure and unintended change after rebase
> > 
> > [V5]
> > Rebase to hyperv-fixes
> > 
> > [V4]
> > Fix Alignment issue and revert a line since 100 characters are allowed in a line
> > 
> > [V3]
> > Fix alignment issues in multiline function parameters.
> > 
> > [V2]
> > Change Subject from "Drivers: hv: Fix Issues reported by checkpatch.pl script"
> >  drivers/hv/hv.c         | 37 ++++++++--------
> >  drivers/hv/hv_balloon.c | 98 +++++++++++++++--------------------------
> >  2 files changed, 53 insertions(+), 82 deletions(-)
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> 

Applied to hyperv-fixes. Thanks.

^ permalink raw reply

* RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: Thomas Gleixner @ 2024-06-06  9:34 UTC (permalink / raw)
  To: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com,
	arnd@arndb.de, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: maz@kernel.org, den@valinux.co.jp, jgowans@amazon.com,
	dawei.li@shingroup.cn
In-Reply-To: <SN6PR02MB4157AD9DE6D3F45EC5F5595DD4FA2@SN6PR02MB4157.namprd02.prod.outlook.com>

On Thu, Jun 06 2024 at 03:14, Michael Kelley wrote:
> From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM
>> 
>> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
>> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
>> >
>> > In /proc/interrupts, the double-counting isn't a problem, and is
>> > potentially helpful as you say. But /proc/stat, for example, shows a total
>> > interrupt count, which will be roughly double what it was before. That
>> > /proc/stat value then shows up in user space in vmstat, for example.
>> > That's what I was concerned about, though it's not a huge problem in
>> > the grand scheme of things.
>> 
>> That's trivial to solve. We can mark interrupts to be excluded from
>> /proc/stat accounting.
>> 
>
> OK.  On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
> out the HYP interrupts. But what do you envision on arm64, where
> there is no arch_irq_stat_cpu()?  On arm64, the top-level interrupt is a
> normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
> with no breakout by IRQ. Identifying the right IRQ and subtracting it
> out later looks a lot uglier than the conditional stats accounting.

Sure. There are two ways to solve that:

1) Introduce a IRQ_NO_PER_CPU_STATS flag, mark the interrupt
   accordingly and make the stats increment conditional on it.
   The downside is that the conditional affects every interrupt.

2) Do something like this:

static inline
void __handle_percpu_irq(struct irq_desc *desc, irqreturn_t (*handle)(struct irq_desc *))
{
	struct irq_chip *chip = irq_desc_get_chip(desc);

	if (chip->irq_ack)
		chip->irq_ack(&desc->irq_data);

	handle(desc);

	if (chip->irq_eoi)
		chip->irq_eoi(&desc->irq_data);
}

void handle_percpu_irq(struct irq_desc *desc)
{
	/*
	 * PER CPU interrupts are not serialized. Do not touch
	 * desc->tot_count.
	 */
	__kstat_incr_irqs_this_cpu(desc);
	__handle_percpu_irq(desc, handle_irq_event_percpu);
}

void handle_percpu_irq_nostat(struct irq_desc *desc)
{
	__this_cpu_inc(desc->kstat_irqs->cnt);
	__handle_percpu_irq(desc, __handle_irq_event_percpu);
}
 
So that keeps the interrupt accounted for in /proc/interrupts. If you
don't want that remove the __this_cpu_inc() and mark the interrupt with
irq_set_status_flags(irq, IRQ_HIDDEN). That will exclude it from
/proc/interrupts too.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Kirill A. Shutemov @ 2024-06-06 12:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, adrian.hunter, ardb, ashish.kalra, bhe, dave.hansen,
	elena.reshetova, haiyangz, hpa, jun.nakajima, kai.huang, kexec,
	kys, linux-acpi, linux-coco, linux-hyperv, linux-kernel, ltao,
	mingo, peterz, rafael, rick.p.edgecombe,
	sathyanarayanan.kuppuswamy, seanjc, tglx, thomas.lendacky, x86
In-Reply-To: <20240605162419.GJZmCRM8V6xooyvm9H@fat_crate.local>

On Wed, Jun 05, 2024 at 06:24:19PM +0200, Borislav Petkov wrote:
> On Wed, Jun 05, 2024 at 03:21:42PM +0300, Kirill A. Shutemov wrote:
> > If a page can be accessed via private mapping is determined by the
> > presence in Secure EPT. This state persist across kexec.
> 
> I just love it how I tickle out details each time I touch this comment
> because we three can't write a single concise and self-contained
> explanation. :-(
> 
> Ok, next version:
> 
> "Private mappings persist across kexec. If tdx_enc_status_changed() fails

s/Private mappings persist /Memory encryption state persists /

> in the first kernel, it leaves memory in an unknown state.
> 
> If that memory remains shared, accessing it in the *next* kernel through
> a private mapping will result in an unrecoverable guest shutdown.
> 
> The kdump kernel boot is not impacted as it uses a pre-reserved memory
> range that is always private.  However, gathering crash information
> could lead to a crash if it accesses unconverted memory through
> a private mapping which is possible when accessing that memory through
> /proc/vmcore, for example.
> 
> In all cases, print error info in order to leave enough bread crumbs for
> debugging."
> 
> I think this is getting in the right direction as it actually makes
> sense now.

Otherwise looks good to me.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* RE: [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: Michael Kelley @ 2024-06-06 14:34 UTC (permalink / raw)
  To: Thomas Gleixner, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
	robh@kernel.org, bhelgaas@google.com,
	James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com,
	arnd@arndb.de, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: maz@kernel.org, den@valinux.co.jp, jgowans@amazon.com,
	dawei.li@shingroup.cn
In-Reply-To: <87le3i2z5g.ffs@tglx>

From: Thomas Gleixner <tglx@linutronix.de> Sent: Thursday, June 6, 2024 2:34 AM
> 
> On Thu, Jun 06 2024 at 03:14, Michael Kelley wrote:
> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 7:20 AM
> >>
> >> On Wed, Jun 05 2024 at 13:45, Michael Kelley wrote:
> >> > From: Thomas Gleixner <tglx@linutronix.de> Sent: Wednesday, June 5, 2024 6:20 AM
> >> >
> >> > In /proc/interrupts, the double-counting isn't a problem, and is
> >> > potentially helpful as you say. But /proc/stat, for example, shows a total
> >> > interrupt count, which will be roughly double what it was before. That
> >> > /proc/stat value then shows up in user space in vmstat, for example.
> >> > That's what I was concerned about, though it's not a huge problem in
> >> > the grand scheme of things.
> >>
> >> That's trivial to solve. We can mark interrupts to be excluded from
> >> /proc/stat accounting.
> >>
> >
> > OK.  On x86, some simple #ifdef'ery in arch_irq_stat_cpu() can filter
> > out the HYP interrupts. But what do you envision on arm64, where
> > there is no arch_irq_stat_cpu()?  On arm64, the top-level interrupt is a
> > normal Linux IRQ, and its count is included in the "kstat.irqs_sum" field
> > with no breakout by IRQ. Identifying the right IRQ and subtracting it
> > out later looks a lot uglier than the conditional stats accounting.
> 
> Sure. There are two ways to solve that:
> 
> 1) Introduce a IRQ_NO_PER_CPU_STATS flag, mark the interrupt
>    accordingly and make the stats increment conditional on it.
>    The downside is that the conditional affects every interrupt.
> 
> 2) Do something like this:
> 
> static inline
> void __handle_percpu_irq(struct irq_desc *desc, irqreturn_t (*handle)(struct irq_desc
> *))
> {
> 	struct irq_chip *chip = irq_desc_get_chip(desc);
> 
> 	if (chip->irq_ack)
> 		chip->irq_ack(&desc->irq_data);
> 
> 	handle(desc);
> 
> 	if (chip->irq_eoi)
> 		chip->irq_eoi(&desc->irq_data);
> }
> 
> void handle_percpu_irq(struct irq_desc *desc)
> {
> 	/*
> 	 * PER CPU interrupts are not serialized. Do not touch
> 	 * desc->tot_count.
> 	 */
> 	__kstat_incr_irqs_this_cpu(desc);
> 	__handle_percpu_irq(desc, handle_irq_event_percpu);
> }
> 
> void handle_percpu_irq_nostat(struct irq_desc *desc)
> {
> 	__this_cpu_inc(desc->kstat_irqs->cnt);
> 	__handle_percpu_irq(desc, __handle_irq_event_percpu);
> }
> 
> So that keeps the interrupt accounted for in /proc/interrupts. If you
> don't want that remove the __this_cpu_inc() and mark the interrupt with
> irq_set_status_flags(irq, IRQ_HIDDEN). That will exclude it from
> /proc/interrupts too.
> 

Yes, this works for not double-counting in the first place. Account for the
control message interrupts in their own Linux IRQ. Then for the top-level
interrupt, instead of adding a new handler with conditional accounting,
add a new per-CPU handler that does no accounting. I had not noticed
the IRQ_HIDDEN flag, and that solves my concern about having an
entry in /proc/interrupts that always shows zero interrupts.  And with
no double-counting, the interrupt counts in /proc/stat won't be bloated.

On x86, I'll have to separately make the "HYP" line in /proc/interrupts
go away, but that's easy.

Thanks,

Michael

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox