* Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Dave Hansen @ 2024-06-04 15:47 UTC (permalink / raw)
To: Kirill A. Shutemov, Borislav Petkov
Cc: adrian.hunter, ardb, ashish.kalra, bhe, dave.hansen,
elena.reshetova, haiyangz, hpa, jun.nakajima, kai.huang, kexec,
kys, linux-acpi, linux-coco, linux-hyperv, linux-kernel, ltao,
mingo, peterz, rafael, rick.p.edgecombe,
sathyanarayanan.kuppuswamy, seanjc, tglx, thomas.lendacky, x86
In-Reply-To: <noym2bqgxqcyhhdzoax7gvdfzhh7rtw7cv236fhzpqh3wqf76e@2jj733skv7y4>
On 6/4/24 08:32, Kirill A. Shutemov wrote:
> What about the comment below?
>
> /*
> * One possible reason for the failure is if kexec raced
> * with memory conversion. In this case shared bit in
> * page table got set (or not cleared) during
> * shared<->private conversion, but the page is actually
> * private. So this failure is not going to affect the
> * kexec'ed kernel.
> *
> * The only thing one can do at this point on failure
> * at this point is panic. In absence of better options,
> * it is reasonable to proceed, hoping the failure is a
> * benign shared bit mismatch due to the race.
> *
> * Also, even if the failure is real and the page cannot
> * be touched as private, the kdump kernel will boot
> * fine as it uses pre-reserved memory. What happens
> * next depends on what the dumping process does and
> * there's a reasonable chance to produce useful dump
> * on crash.
> *
> * Regardless, the print leaves a trace in the log to
> * give a clue for debug.
> */
It's rambling too much for my taste.
Let's boil this down to what matters:
1. Failures to change encryption status here can lead a future kernel
to touch shared memory with a private mapping
2. That causes an immediate unrecoverable guest shutdown (right?)
3. kdump kernels should not be affected since they have their own
memory ranges and its encryption status is not being tweawked here
4. The pr_err() may help make some sense out of #2 when it happens
I'm not sure the reason behind the failed conversion is important here.
I wouldn't mention panic().
We don't need to opine about what the next kernel might or might not do.
^ permalink raw reply
* Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec
From: Kirill A. Shutemov @ 2024-06-04 15:32 UTC (permalink / raw)
To: Borislav Petkov
Cc: adrian.hunter, ardb, ashish.kalra, bhe, dave.hansen,
elena.reshetova, haiyangz, hpa, jun.nakajima, kai.huang, kexec,
kys, linux-acpi, linux-coco, linux-hyperv, linux-kernel, ltao,
mingo, peterz, rafael, rick.p.edgecombe,
sathyanarayanan.kuppuswamy, seanjc, tglx, thomas.lendacky, x86
In-Reply-To: <20240603083754.GAZl2A4uXvVB5w4l9u@fat_crate.local>
On Mon, Jun 03, 2024 at 10:37:54AM +0200, Borislav Petkov wrote:
> On Sun, Jun 02, 2024 at 05:23:03PM +0300, Kirill A. Shutemov wrote:
> > + /*
> > + * The only thing one can do at this point on failure
> > + * is panic. It is reasonable to proceed.
>
> It makes even less sense now: panic() means "all stops and we die" and
> you say it is reasonable to proceed.
>
> I'm confused.
Right.
What about the comment below?
/*
* One possible reason for the failure is if kexec raced
* with memory conversion. In this case shared bit in
* page table got set (or not cleared) during
* shared<->private conversion, but the page is actually
* private. So this failure is not going to affect the
* kexec'ed kernel.
*
* The only thing one can do at this point on failure
* at this point is panic. In absence of better options,
* it is reasonable to proceed, hoping the failure is a
* benign shared bit mismatch due to the race.
*
* Also, even if the failure is real and the page cannot
* be touched as private, the kdump kernel will boot
* fine as it uses pre-reserved memory. What happens
* next depends on what the dumping process does and
* there's a reasonable chance to produce useful dump
* on crash.
*
* Regardless, the print leaves a trace in the log to
* give a clue for debug.
*/
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion
From: Kirill A. Shutemov @ 2024-06-04 15:21 UTC (permalink / raw)
To: Borislav Petkov
Cc: H. Peter Anvin, Nikolay Borisov, Thomas Gleixner, Ingo Molnar,
Dave Hansen, x86, Rafael J. Wysocki, Peter Zijlstra,
Adrian Hunter, Kuppuswamy Sathyanarayanan, Elena Reshetova,
Jun Nakajima, Rick Edgecombe, Tom Lendacky, Kalra, Ashish,
Sean Christopherson, Huang, Kai, Ard Biesheuvel, Baoquan He,
K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
linux-coco, linux-kernel
In-Reply-To: <20240604091503.GQZl7bF14qTSAjqUhN@fat_crate.local>
On Tue, Jun 04, 2024 at 11:15:03AM +0200, Borislav Petkov wrote:
> On Mon, Jun 03, 2024 at 05:24:00PM -0700, H. Peter Anvin wrote:
> > Trying one more time; sorry (again) if someone receives this in duplicate.
> >
> > > > >
> > > > > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > > > > index 56cab1bb25f5..085eef5c3904 100644
> > > > > --- a/arch/x86/kernel/relocate_kernel_64.S
> > > > > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > > > > @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > > > > */
> > > > > movl $X86_CR4_PAE, %eax
> > > > > testq $X86_CR4_LA57, %r13
> > > > > - jz 1f
> > > > > + jz .Lno_la57
> > > > > orl $X86_CR4_LA57, %eax
> > > > > -1:
> > > > > +.Lno_la57:
> > > > > +
> > > > > movq %rax, %cr4
> >
> > If we are cleaning up this code... the above can simply be:
> >
> > andl $(X86_CR4_PAE | X86_CR4_LA54), %r13
> > movq %r13, %cr4
> >
> > %r13 is dead afterwards, and the PAE bit *will* be set in %r13 anyway.
>
> Yeah, with a proper comment. The testing of bits is not really needed.
I think it is better fit the next patch.
What about this?
From b45fe48092abad2612c2bafbb199e4de80c99545 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 10 Feb 2023 12:53:11 +0300
Subject: [PATCHv11.1 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
that bit is cleared during CR4 register reprogramming during boot or
kexec flows, a #VE exception will be raised which the guest kernel
cannot handle it.
Therefore, make sure the CR4.MCE setting is preserved over kexec too and
avoid raising any #VEs.
The change doesn't affect non-TDX-guest environments.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
arch/x86/kernel/relocate_kernel_64.S | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 085eef5c3904..9c2cf70c5f54 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -5,6 +5,8 @@
*/
#include <linux/linkage.h>
+#include <linux/stringify.h>
+#include <asm/alternative.h>
#include <asm/page_types.h>
#include <asm/kexec.h>
#include <asm/processor-flags.h>
@@ -145,14 +147,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
* Set cr4 to a known state:
* - physical address extension enabled
* - 5-level paging, if it was enabled before
+ * - Machine check exception on TDX guest, if it was enabled before.
+ * Clearing MCE might not be allowed in TDX guests, depending on setup.
+ *
+ * Use R13 that contains the original CR4 value, read in relocate_kernel().
+ * PAE is always set in the original CR4.
*/
- movl $X86_CR4_PAE, %eax
- testq $X86_CR4_LA57, %r13
- jz .Lno_la57
- orl $X86_CR4_LA57, %eax
-.Lno_la57:
-
- movq %rax, %cr4
+ andl $(X86_CR4_PAE | X86_CR4_LA57), %r13d
+ ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
+ movq %r13, %cr4
jmp 1f
1:
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply related
* Re: [PATCH] x86/tdx: Enhance code generation for TDCALL and SEAMCALL wrappers
From: Kirill A. Shutemov @ 2024-06-04 11:37 UTC (permalink / raw)
To: Dave Hansen, Sean Christopherson
Cc: Paolo Bonzini, Dave Hansen, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Josh Poimboeuf,
Peter Zijlstra, linux-coco, linux-kernel, linux-hyperv
In-Reply-To: <8992921e-7343-4279-9953-0c042d8baf90@intel.com>
On Mon, Jun 03, 2024 at 06:37:45AM -0700, Dave Hansen wrote:
> On 6/2/24 04:54, Kirill A. Shutemov wrote:
> > Sean observed that the compiler is generating inefficient code to clear
> > the tdx_module_args struct for TDCALL and SEAMCALL wrappers. The
> > compiler is generating numerous instructions at each call site to clear
> > the unused fields of the structure.
> >
> > To address this issue, avoid using C99-initializer and instead
> > explicitly use string instructions to clear the struct.
> >
> > With Clang, this change results in a savings of approximately 3K with my
> > configuration:
> >
> > add/remove: 0/0 grow/shrink: 0/21 up/down: 0/-3187 (-3187)
> >
> > With GCC, the savings are less significant at around 300 bytes:
> >
> > add/remove: 0/0 grow/shrink: 3/22 up/down: 17/-313 (-296)
> >
> > GCC tends to generate string instructions more frequently to clear the
> > struct.
>
> <shrug>
>
> I don't think moving away from perfectly normal C struct initialization
> is worth it for 300 bytes of text in couple of slow paths.
>
> If someone out there is using clang, is confident that it is doing the
> right thing and not just being silly, _and_ is horribly bothered by its
> code generation, then please speak up.
Conceptually, I like my previous attempt more. But it is much more
intrusive and I am not sure it is worth the risk.
This patch feels like hack around compiler.
Sean, do you have any comments?
> > +static __always_inline void tdx_arg_init(struct tdx_module_args *args)
> > +{
> > + asm ("rep stosb"
> > + : "+D" (args)
> > + : "c" (sizeof(*args)), "a" (0)
> > + : "memory");
> > +}
>
> The inline assembly also has the side-effect of tripping up the
> compiler. The compiler can't optimize across these at all and it
> probably has the effect of bloating the code.
It can, but it is limited. Compiler has to flush registers content back to
memory before asm() and cannot assume anything that read from memory
before the asm() is still valid after.
> Oh, and if we're going to leave this weirdo initialization idiom for
> TDX, it needs to be well commented:
>
> /*
> * Using normal " = {};" to initialize tdx_module_args results in
> * bloated hard-to-read assembly. Zero it using the most compact way
> * available.
> */
>
> Eh?
Okay.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH net-next v3] net: mana: Allow variable size indirection table
From: Simon Horman @ 2024-06-04 9:33 UTC (permalink / raw)
To: Shradha Gupta
Cc: linux-hardening, netdev, linux-hyperv, linux-kernel, linux-rdma,
Colin Ian King, Ahmed Zaki, Pavan Chebbi, Souradeep Chakrabarti,
Konstantin Taranov, Kees Cook, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, David S. Miller, Dexuan Cui, Wei Liu, Haiyang Zhang,
K. Y. Srinivasan, Leon Romanovsky, Jason Gunthorpe, Long Li,
Shradha Gupta
In-Reply-To: <1717169861-15825-1-git-send-email-shradhagupta@linux.microsoft.com>
On Fri, May 31, 2024 at 08:37:41AM -0700, Shradha Gupta wrote:
> Allow variable size indirection table allocation in MANA instead
> of using a constant value MANA_INDIRECT_TABLE_SIZE.
> The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
> indirection table is allocated dynamically.
>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Dexuan Cui <decui@microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
...
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
...
> @@ -2344,11 +2352,33 @@ static int mana_create_vport(struct mana_port_context *apc,
> return mana_create_txq(apc, net);
> }
>
> +static int mana_rss_table_alloc(struct mana_port_context *apc)
> +{
> + if (!apc->indir_table_sz) {
> + netdev_err(apc->ndev,
> + "Indirection table size not set for vPort %d\n",
> + apc->port_idx);
> + return -EINVAL;
> + }
> +
> + apc->indir_table = kcalloc(apc->indir_table_sz, sizeof(u32), GFP_KERNEL);
> + if (!apc->indir_table)
> + return -ENOMEM;
> +
> + apc->rxobj_table = kcalloc(apc->indir_table_sz, sizeof(mana_handle_t), GFP_KERNEL);
> + if (!apc->rxobj_table) {
> + kfree(apc->indir_table);
Hi, Shradha
Perhaps I am on the wrong track here, but I have some concerns
about clean-up paths.
Firstly. I think that apc->indir_table should be to NULL here for
consistency with other clean-up paths. Or alternatively, fields of apc
should not set to NULL elsewhere after being freed.
In looking into this I noticed that mana_probe() does not call
mana_remove() or return an error in the cases where mana_probe_port() or
mana_attach() fail unless add_adev also fails. If so, is that intentional?
In any case, I would suggest as a follow-up, arranging things so that when
an error occurs in a function, anything that was allocated is unwound
before returning an error.
I think this would make allocation/deallocation easier to reason with.
And I suspect it would avoid both the need for fields of structures to be
zeroed after being freed, and the need to call mana_remove() from
mana_probe().
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> static void mana_rss_table_init(struct mana_port_context *apc)
> {
> int i;
>
> - for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++)
> + for (i = 0; i < apc->indir_table_sz; i++)
> apc->indir_table[i] =
> ethtool_rxfh_indir_default(i, apc->num_queues);
> }
...
> @@ -2739,11 +2772,17 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
> err = register_netdev(ndev);
> if (err) {
> netdev_err(ndev, "Unable to register netdev.\n");
> - goto reset_apc;
> + goto free_indir;
> }
>
> return 0;
>
> +free_indir:
> + apc->indir_table_sz = 0;
> + kfree(apc->indir_table);
> + apc->indir_table = NULL;
> + kfree(apc->rxobj_table);
> + apc->rxobj_table = NULL;
> reset_apc:
> kfree(apc->rxqs);
> apc->rxqs = NULL;
nit: Not strictly related to this patch, but the reset_apc code should
probably be a call to mana_cleanup_port_context() as it is the dual of
mana_init_port_context() which is called earlier in mana_probe_port()
...
> @@ -2931,6 +2972,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> }
>
> unregister_netdevice(ndev);
> + apc->indir_table_sz = 0;
> + kfree(apc->indir_table);
> + apc->indir_table = NULL;
> + kfree(apc->rxobj_table);
> + apc->rxobj_table = NULL;
The code to free and zero indir_table_sz and indir_table appears twice
in this patch. Perhaps a helper to do this, which would be the dual
of mana_rss_table_alloc is in order.
>
> rtnl_unlock();
>
...
^ permalink raw reply
* Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion
From: Borislav Petkov @ 2024-06-04 9:15 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Kirill A. Shutemov, Nikolay Borisov, Thomas Gleixner, Ingo Molnar,
Dave Hansen, x86, Rafael J. Wysocki, Peter Zijlstra,
Adrian Hunter, Kuppuswamy Sathyanarayanan, Elena Reshetova,
Jun Nakajima, Rick Edgecombe, Tom Lendacky, Kalra, Ashish,
Sean Christopherson, Huang, Kai, Ard Biesheuvel, Baoquan He,
K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
linux-coco, linux-kernel
In-Reply-To: <748d3b70-60b4-44e0-bd81-9117f1ab699d@zytor.com>
On Mon, Jun 03, 2024 at 05:24:00PM -0700, H. Peter Anvin wrote:
> Trying one more time; sorry (again) if someone receives this in duplicate.
>
> > > >
> > > > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > > > index 56cab1bb25f5..085eef5c3904 100644
> > > > --- a/arch/x86/kernel/relocate_kernel_64.S
> > > > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > > > @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > > > */
> > > > movl $X86_CR4_PAE, %eax
> > > > testq $X86_CR4_LA57, %r13
> > > > - jz 1f
> > > > + jz .Lno_la57
> > > > orl $X86_CR4_LA57, %eax
> > > > -1:
> > > > +.Lno_la57:
> > > > +
> > > > movq %rax, %cr4
>
> If we are cleaning up this code... the above can simply be:
>
> andl $(X86_CR4_PAE | X86_CR4_LA54), %r13
> movq %r13, %cr4
>
> %r13 is dead afterwards, and the PAE bit *will* be set in %r13 anyway.
Yeah, with a proper comment. The testing of bits is not really needed.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH net-next v3] net: mana: Allow variable size indirection table
From: Shradha Gupta @ 2024-06-04 9:01 UTC (permalink / raw)
To: Leon Romanovsky
Cc: linux-hardening, netdev, linux-hyperv, linux-kernel, linux-rdma,
Colin Ian King, Ahmed Zaki, Pavan Chebbi, Souradeep Chakrabarti,
Konstantin Taranov, Kees Cook, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, David S. Miller, Dexuan Cui, Wei Liu, Haiyang Zhang,
K. Y. Srinivasan, Jason Gunthorpe, Long Li, Shradha Gupta
In-Reply-To: <20240604083205.GM3884@unreal>
On Tue, Jun 04, 2024 at 11:32:05AM +0300, Leon Romanovsky wrote:
> On Mon, Jun 03, 2024 at 10:36:48PM -0700, Shradha Gupta wrote:
> > On Mon, Jun 03, 2024 at 11:41:22AM +0300, Leon Romanovsky wrote:
> > > On Fri, May 31, 2024 at 08:37:41AM -0700, Shradha Gupta wrote:
> > > > Allow variable size indirection table allocation in MANA instead
> > > > of using a constant value MANA_INDIRECT_TABLE_SIZE.
> > > > The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
> > > > indirection table is allocated dynamically.
> > > >
> > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > Reviewed-by: Dexuan Cui <decui@microsoft.com>
> > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > ---
> > > > Changes in v3:
> > > > * Fixed the memory leak(save_table) in mana_set_rxfh()
> > > >
> > > > Changes in v2:
> > > > * Rebased to latest net-next tree
> > > > * Rearranged cleanup code in mana_probe_port to avoid extra operations
> > > > ---
> > > > drivers/infiniband/hw/mana/qp.c | 10 +--
> > > > drivers/net/ethernet/microsoft/mana/mana_en.c | 68 ++++++++++++++++---
> > > > .../ethernet/microsoft/mana/mana_ethtool.c | 27 +++++---
> > > > include/net/mana/gdma.h | 4 +-
> > > > include/net/mana/mana.h | 9 +--
> > > > 5 files changed, 89 insertions(+), 29 deletions(-)
> > >
> > > <...>
> > >
> > > > +free_indir:
> > > > + apc->indir_table_sz = 0;
> > > > + kfree(apc->indir_table);
> > > > + apc->indir_table = NULL;
> > > > + kfree(apc->rxobj_table);
> > > > + apc->rxobj_table = NULL;
> > > > reset_apc:
> > > > kfree(apc->rxqs);
> > > > apc->rxqs = NULL;
> > > > @@ -2897,6 +2936,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> > > > {
> > >
> > > <...>
> > >
> > > > @@ -2931,6 +2972,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> > > > }
> > > >
> > > > unregister_netdevice(ndev);
> > > > + apc->indir_table_sz = 0;
> > > > + kfree(apc->indir_table);
> > > > + apc->indir_table = NULL;
> > > > + kfree(apc->rxobj_table);
> > > > + apc->rxobj_table = NULL;
> > >
> > > Why do you need to NULLify here? Will apc is going to be accessible
> > > after call to mana_remove? or port probe failure?
> > Right, they won't be accessed. This is just for the sake of completeness
> > and to prevent double free in case there are code bug in other place.
>
> This coding patter is called defensive programming, which is discouraged
> in the kernel. You are not preventing double free, but hiding bugs which
> were possible to be found by various static analysis tools.
>
> Please don't do it.
>
> Thanks
Understood, it makes sense. Let me fix this in the next version.
Regards,
Shradha
>
> >
> > Regards,
> > Shradha.
> > >
> > > Thanks
^ permalink raw reply
* Re: [PATCH net-next v3] net: mana: Allow variable size indirection table
From: Leon Romanovsky @ 2024-06-04 8:32 UTC (permalink / raw)
To: Shradha Gupta
Cc: linux-hardening, netdev, linux-hyperv, linux-kernel, linux-rdma,
Colin Ian King, Ahmed Zaki, Pavan Chebbi, Souradeep Chakrabarti,
Konstantin Taranov, Kees Cook, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, David S. Miller, Dexuan Cui, Wei Liu, Haiyang Zhang,
K. Y. Srinivasan, Jason Gunthorpe, Long Li, Shradha Gupta
In-Reply-To: <20240604053648.GA14220@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Mon, Jun 03, 2024 at 10:36:48PM -0700, Shradha Gupta wrote:
> On Mon, Jun 03, 2024 at 11:41:22AM +0300, Leon Romanovsky wrote:
> > On Fri, May 31, 2024 at 08:37:41AM -0700, Shradha Gupta wrote:
> > > Allow variable size indirection table allocation in MANA instead
> > > of using a constant value MANA_INDIRECT_TABLE_SIZE.
> > > The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
> > > indirection table is allocated dynamically.
> > >
> > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > Reviewed-by: Dexuan Cui <decui@microsoft.com>
> > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > ---
> > > Changes in v3:
> > > * Fixed the memory leak(save_table) in mana_set_rxfh()
> > >
> > > Changes in v2:
> > > * Rebased to latest net-next tree
> > > * Rearranged cleanup code in mana_probe_port to avoid extra operations
> > > ---
> > > drivers/infiniband/hw/mana/qp.c | 10 +--
> > > drivers/net/ethernet/microsoft/mana/mana_en.c | 68 ++++++++++++++++---
> > > .../ethernet/microsoft/mana/mana_ethtool.c | 27 +++++---
> > > include/net/mana/gdma.h | 4 +-
> > > include/net/mana/mana.h | 9 +--
> > > 5 files changed, 89 insertions(+), 29 deletions(-)
> >
> > <...>
> >
> > > +free_indir:
> > > + apc->indir_table_sz = 0;
> > > + kfree(apc->indir_table);
> > > + apc->indir_table = NULL;
> > > + kfree(apc->rxobj_table);
> > > + apc->rxobj_table = NULL;
> > > reset_apc:
> > > kfree(apc->rxqs);
> > > apc->rxqs = NULL;
> > > @@ -2897,6 +2936,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> > > {
> >
> > <...>
> >
> > > @@ -2931,6 +2972,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> > > }
> > >
> > > unregister_netdevice(ndev);
> > > + apc->indir_table_sz = 0;
> > > + kfree(apc->indir_table);
> > > + apc->indir_table = NULL;
> > > + kfree(apc->rxobj_table);
> > > + apc->rxobj_table = NULL;
> >
> > Why do you need to NULLify here? Will apc is going to be accessible
> > after call to mana_remove? or port probe failure?
> Right, they won't be accessed. This is just for the sake of completeness
> and to prevent double free in case there are code bug in other place.
This coding patter is called defensive programming, which is discouraged
in the kernel. You are not preventing double free, but hiding bugs which
were possible to be found by various static analysis tools.
Please don't do it.
Thanks
>
> Regards,
> Shradha.
> >
> > Thanks
^ permalink raw reply
* Re: [PATCH net-next v3] net: mana: Allow variable size indirection table
From: Shradha Gupta @ 2024-06-04 5:36 UTC (permalink / raw)
To: Leon Romanovsky
Cc: linux-hardening, netdev, linux-hyperv, linux-kernel, linux-rdma,
Colin Ian King, Ahmed Zaki, Pavan Chebbi, Souradeep Chakrabarti,
Konstantin Taranov, Kees Cook, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, David S. Miller, Dexuan Cui, Wei Liu, Haiyang Zhang,
K. Y. Srinivasan, Jason Gunthorpe, Long Li, Shradha Gupta
In-Reply-To: <20240603084122.GK3884@unreal>
On Mon, Jun 03, 2024 at 11:41:22AM +0300, Leon Romanovsky wrote:
> On Fri, May 31, 2024 at 08:37:41AM -0700, Shradha Gupta wrote:
> > Allow variable size indirection table allocation in MANA instead
> > of using a constant value MANA_INDIRECT_TABLE_SIZE.
> > The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
> > indirection table is allocated dynamically.
> >
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Dexuan Cui <decui@microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> > Changes in v3:
> > * Fixed the memory leak(save_table) in mana_set_rxfh()
> >
> > Changes in v2:
> > * Rebased to latest net-next tree
> > * Rearranged cleanup code in mana_probe_port to avoid extra operations
> > ---
> > drivers/infiniband/hw/mana/qp.c | 10 +--
> > drivers/net/ethernet/microsoft/mana/mana_en.c | 68 ++++++++++++++++---
> > .../ethernet/microsoft/mana/mana_ethtool.c | 27 +++++---
> > include/net/mana/gdma.h | 4 +-
> > include/net/mana/mana.h | 9 +--
> > 5 files changed, 89 insertions(+), 29 deletions(-)
>
> <...>
>
> > +free_indir:
> > + apc->indir_table_sz = 0;
> > + kfree(apc->indir_table);
> > + apc->indir_table = NULL;
> > + kfree(apc->rxobj_table);
> > + apc->rxobj_table = NULL;
> > reset_apc:
> > kfree(apc->rxqs);
> > apc->rxqs = NULL;
> > @@ -2897,6 +2936,7 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> > {
>
> <...>
>
> > @@ -2931,6 +2972,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
> > }
> >
> > unregister_netdevice(ndev);
> > + apc->indir_table_sz = 0;
> > + kfree(apc->indir_table);
> > + apc->indir_table = NULL;
> > + kfree(apc->rxobj_table);
> > + apc->rxobj_table = NULL;
>
> Why do you need to NULLify here? Will apc is going to be accessible
> after call to mana_remove? or port probe failure?
Right, they won't be accessed. This is just for the sake of completeness
and to prevent double free in case there are code bug in other place.
Regards,
Shradha.
>
> Thanks
^ permalink raw reply
* [RFC 12/12] Drivers: hv: vmbus: Ensure IRQ affinity isn't set to a CPU going offline
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
hv_synic_cleanup() currently prevents a CPU from going offline if any
VMBus channel IRQs are targeted at that CPU. However, current code has a
race in that an IRQ could be affinitized to the CPU after the check in
hv_synic_cleanup() and before the CPU is removed from cpu_online_mask.
Any channel interrupts could be lost and the channel would hang.
Fix this by adding a flag for each CPU indicating if the synic is online.
Filter the new affinity with these flags so that vmbus_irq_set_affinity()
doesn't pick a CPU where the synic is already offline.
Also add a spin lock so that vmbus_irq_set_affinity() changing the
channel target_cpu and sending the MODIFYCHANNEL message to Hyper-V
are atomic with respect to the checks made in hv_synic_cleanup().
hv_synic_cleanup() needs these operations to be atomic so that it
can correctly count the MODIFYCHANNEL messages that need to be
ack'ed by Hyper-V.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/connection.c | 1 +
drivers/hv/hv.c | 22 ++++++++++++++++++++--
drivers/hv/hyperv_vmbus.h | 2 ++
drivers/hv/vmbus_drv.c | 34 ++++++++++++++++++++++++++++------
4 files changed, 51 insertions(+), 8 deletions(-)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index a105eecdeec2..b44ce3d39135 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -213,6 +213,7 @@ int vmbus_connect(void)
INIT_LIST_HEAD(&vmbus_connection.chn_list);
mutex_init(&vmbus_connection.channel_mutex);
+ spin_lock_init(&vmbus_connection.set_affinity_lock);
/*
* Setup the vmbus event connection for channel interrupt
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 76658dfc5008..89e491219129 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -338,6 +338,8 @@ int hv_synic_init(unsigned int cpu)
{
hv_synic_enable_regs(cpu);
+ cpumask_set_cpu(cpu, &vmbus_connection.synic_online);
+
hv_stimer_legacy_init(cpu, VMBUS_MESSAGE_SINT);
return 0;
@@ -513,6 +515,17 @@ int hv_synic_cleanup(unsigned int cpu)
* TODO: Re-bind the channels to different CPUs.
*/
mutex_lock(&vmbus_connection.channel_mutex);
+ spin_lock(&vmbus_connection.set_affinity_lock);
+
+ /*
+ * Once the check for channels assigned to this CPU is complete, we
+ * must not allow a channel to be assigned to this CPU. So mark
+ * the synic as no longer online. This cpumask is checked in
+ * vmbus_irq_set_affinity() to prevent setting the affinity of
+ * an IRQ to such a CPU.
+ */
+ cpumask_clear_cpu(cpu, &vmbus_connection.synic_online);
+
list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
if (channel->target_cpu == cpu) {
channel_found = true;
@@ -527,10 +540,11 @@ int hv_synic_cleanup(unsigned int cpu)
if (channel_found)
break;
}
+ spin_unlock(&vmbus_connection.set_affinity_lock);
mutex_unlock(&vmbus_connection.channel_mutex);
if (channel_found)
- return -EBUSY;
+ goto set_online;
/*
* channel_found == false means that any channels that were previously
@@ -547,7 +561,7 @@ int hv_synic_cleanup(unsigned int cpu)
if (hv_synic_event_pending()) {
pr_err("Events pending when trying to offline CPU %d\n",
cpu);
- return -EBUSY;
+ goto set_online;
}
}
@@ -557,4 +571,8 @@ int hv_synic_cleanup(unsigned int cpu)
hv_synic_disable_regs(cpu);
return 0;
+
+set_online:
+ cpumask_set_cpu(cpu, &vmbus_connection.synic_online);
+ return -EBUSY;
}
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 571b2955b38e..92ae5af10778 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -263,6 +263,8 @@ struct vmbus_connection {
struct fwnode_handle *vmbus_fwnode;
struct irq_domain *vmbus_irq_domain;
struct irq_chip vmbus_irq_chip;
+ cpumask_t synic_online;
+ spinlock_t set_affinity_lock;
/*
* VM-wide counts of MODIFYCHANNEL messages sent and completed.
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 87f2f3436136..3430ad42d7ba 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1351,6 +1351,14 @@ int vmbus_irq_set_affinity(struct irq_data *data,
return -EINVAL;
}
+ /*
+ * The spin lock must be held so that checking synic_online, sending
+ * the MODIFYCHANNEL message, and setting channel->target_cpu are
+ * atomic with respect to hv_synic_cleanup() clearing the CPU in
+ * synic_online and doing the search.
+ */
+ spin_lock(&vmbus_connection.set_affinity_lock);
+
/* Don't consider CPUs that are isolated */
if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
cpumask_and(&tempmask, dest,
@@ -1367,30 +1375,39 @@ int vmbus_irq_set_affinity(struct irq_data *data,
origin_cpu = channel->target_cpu;
if (cpumask_test_cpu(origin_cpu, &tempmask)) {
target_cpu = origin_cpu;
+ spin_unlock(&vmbus_connection.set_affinity_lock);
goto update_effective;
}
/*
* Pick a CPU from the new affinity mask. As a simple heuristic to
* spread out the selection when the mask contains multiple CPUs,
- * start with whatever CPU was last selected.
+ * start with whatever CPU was last selected. Also filter out any
+ * CPUs where synic_online isn't set -- these CPUs are in the process
+ * of going offline and must not have channel interrupts assigned
+ * to them.
*/
+ cpumask_and(&tempmask, &tempmask, &vmbus_connection.synic_online);
target_cpu = cpumask_next_wrap(next_cpu, &tempmask, nr_cpu_ids, false);
- if (target_cpu >= nr_cpu_ids)
- return -EINVAL;
+ if (target_cpu >= nr_cpu_ids) {
+ ret = -EINVAL;
+ goto unlock;
+ }
next_cpu = target_cpu;
/*
* Hyper-V will ignore MODIFYCHANNEL messages for "non-open" channels;
* avoid sending the message and fail here for such channels.
*/
- if (channel->state != CHANNEL_OPENED_STATE)
- return -EIO;
+ if (channel->state != CHANNEL_OPENED_STATE) {
+ ret = -EIO;
+ goto unlock;
+ }
ret = vmbus_send_modifychannel(channel,
hv_cpu_number_to_vp_number(target_cpu));
if (ret)
- return ret;
+ goto unlock;
/*
* Warning. At this point, there is *no* guarantee that the host will
@@ -1408,6 +1425,7 @@ int vmbus_irq_set_affinity(struct irq_data *data,
*/
channel->target_cpu = target_cpu;
+ spin_unlock(&vmbus_connection.set_affinity_lock);
/* See init_vp_index(). */
if (hv_is_perf_channel(channel))
@@ -1422,6 +1440,10 @@ int vmbus_irq_set_affinity(struct irq_data *data,
update_effective:
irq_data_update_effective_affinity(data, cpumask_of(target_cpu));
return IRQ_SET_MASK_OK;
+
+unlock:
+ spin_unlock(&vmbus_connection.set_affinity_lock);
+ return ret;
}
/*
--
2.25.1
^ permalink raw reply related
* [RFC 11/12] Drivers: hv: vmbus: Wait for MODIFYCHANNEL to finish when offlining CPUs
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
vmbus_irq_set_affinity() may issue a MODIFYCHANNEL request to Hyper-V to
target a VMBus channel interrupt to a different CPU. While newer versions
of Hyper-V send a response to the guest when the change is complete,
vmbus_irq_set_affinity() does not wait for the response because it is
running with interrupts disabled. So Hyper-V may continue to direct
interrupts to the old CPU for a short window after vmbus_irq_set_affinity()
completes. This lag is not a problem during normal operation. But if
the old CPU is taken offline during that window, Hyper-V may drop
the interrupt because the synic in the target CPU is disabled. Dropping
the interrupt may cause the VMBus channel to hang.
To prevent this, wait for in-process MODIFYCHANNEL requests when taking
a CPU offline. On newer versions of Hyper-V, completion can be confirmed
by waiting for the response sent by Hyper-V. But on older versions of
Hyper-V that don't send a response, wait a fixed interval of time that
empirically should be "long enough", as that's the best that can be done.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/channel.c | 3 ++
drivers/hv/channel_mgmt.c | 32 ++++--------------
drivers/hv/hv.c | 70 +++++++++++++++++++++++++++++++++++----
drivers/hv/hyperv_vmbus.h | 8 +++++
4 files changed, 81 insertions(+), 32 deletions(-)
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index b7920072e243..b7ee95373049 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -246,6 +246,9 @@ int vmbus_send_modifychannel(struct vmbus_channel *channel, u32 target_vp)
ret = vmbus_post_msg(&msg, sizeof(msg), false);
trace_vmbus_send_modifychannel(&msg, ret);
+ if (!ret)
+ vmbus_connection.modchan_sent++;
+
return ret;
}
EXPORT_SYMBOL_GPL(vmbus_send_modifychannel);
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index da42aaae994e..960a2f0367d8 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -1465,40 +1465,20 @@ static void vmbus_ongpadl_created(struct vmbus_channel_message_header *hdr)
* vmbus_onmodifychannel_response - Modify Channel response handler.
*
* This is invoked when we received a response to our channel modify request.
- * Find the matching request, copy the response and signal the requesting thread.
+ * Increment the count of responses received. No locking is needed because
+ * responses are always received on the VMBUS_CONNECT_CPU.
*/
static void vmbus_onmodifychannel_response(struct vmbus_channel_message_header *hdr)
{
struct vmbus_channel_modifychannel_response *response;
- struct vmbus_channel_msginfo *msginfo;
- unsigned long flags;
response = (struct vmbus_channel_modifychannel_response *)hdr;
+ if (response->status)
+ pr_err("Error status %x in MODIFYCHANNEL response for relid %d\n",
+ response->status, response->child_relid);
+ vmbus_connection.modchan_completed++;
trace_vmbus_onmodifychannel_response(response);
-
- /*
- * Find the modify msg, copy the response and signal/unblock the wait event.
- */
- spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
-
- list_for_each_entry(msginfo, &vmbus_connection.chn_msg_list, msglistentry) {
- struct vmbus_channel_message_header *responseheader =
- (struct vmbus_channel_message_header *)msginfo->msg;
-
- if (responseheader->msgtype == CHANNELMSG_MODIFYCHANNEL) {
- struct vmbus_channel_modifychannel *modifymsg;
-
- modifymsg = (struct vmbus_channel_modifychannel *)msginfo->msg;
- if (modifymsg->child_relid == response->child_relid) {
- memcpy(&msginfo->response.modify_response, response,
- sizeof(*response));
- complete(&msginfo->waitevent);
- break;
- }
- }
- }
- spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
}
/*
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index a8ad728354cb..76658dfc5008 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -401,6 +401,56 @@ void hv_synic_disable_regs(unsigned int cpu)
disable_percpu_irq(vmbus_irq);
}
+static void hv_synic_wait_for_modifychannel(int cpu)
+{
+ int i = 5;
+ u64 base;
+
+ /*
+ * If we're on a VMBus version where MODIFYCHANNEL doesn't send acks,
+ * just sleep for 20 milliseconds and hope that gives Hyper-V enough
+ * time to process them. Empirical data on recent server-class CPUs
+ * (both x86 and arm64) shows that the Hyper-V response is typically
+ * received and processed in the guest within a few hundred
+ * microseconds. The 20 millisecond wait is somewhat arbitrary and
+ * intended to give plenty to time in case there are multiple
+ * MODIFYCHANNEL requests in progress and the host is busy. It's
+ * the best we can do.
+ */
+ if (vmbus_proto_version < VERSION_WIN10_V5_3) {
+ usleep_range(20000, 25000);
+ return;
+ }
+
+ /*
+ * Otherwise compare the current value of modchan_completed against
+ * modchan_sent. If some MODIFYCHANNEL requests have been sent that
+ * haven't completed, sleep 5 milliseconds and check again. If the
+ * requests still haven't completed after 5 attempts, output a
+ * message and proceed anyway.
+ *
+ * Hyper-V guarantees to process MODIFYCHANNEL requests in the order
+ * they are received from the guest, so simply comparing the counts
+ * is sufficient.
+ *
+ * Note that this check may encompass MODIFYCHANNEL requests that are
+ * unrelated to the CPU that is going offline. But the only effect is
+ * to potentially wait a little bit longer than necessary. CPUs going
+ * offline and affinity changes that result in MODIFYCHANNEL are
+ * relatively rare and it's not worth the complexity to track them more
+ * precisely.
+ */
+ base = READ_ONCE(vmbus_connection.modchan_sent);
+ while (READ_ONCE(vmbus_connection.modchan_completed) < base && i) {
+ usleep_range(5000, 10000);
+ i--;
+ }
+
+ if (i == 0)
+ pr_err("Timed out waiting for MODIFYCHANNEL. CPU %d sent %lld completed %lld\n",
+ cpu, base, vmbus_connection.modchan_completed);
+}
+
#define HV_MAX_TRIES 3
/*
* Scan the event flags page of 'this' CPU looking for any bit that is set. If we find one
@@ -485,13 +535,21 @@ int hv_synic_cleanup(unsigned int cpu)
/*
* channel_found == false means that any channels that were previously
* assigned to the CPU have been reassigned elsewhere with a call of
- * vmbus_send_modifychannel(). Scan the event flags page looking for
- * bits that are set and waiting with a timeout for vmbus_chan_sched()
- * to process such bits. If bits are still set after this operation
- * and VMBus is connected, fail the CPU offlining operation.
+ * vmbus_send_modifychannel(). First wait until any MODIFYCHANNEL
+ * requests have been completed by Hyper-V, after which we know that
+ * no new bits in the event flags will be set. Then scan the event flags
+ * page looking for bits that are set and waiting with a timeout for
+ * vmbus_chan_sched() to process such bits. If bits are still set
+ * after this operation, fail the CPU offlining operation.
*/
- if (vmbus_proto_version >= VERSION_WIN10_V4_1 && hv_synic_event_pending())
- return -EBUSY;
+ if (vmbus_proto_version >= VERSION_WIN10_V4_1) {
+ hv_synic_wait_for_modifychannel(cpu);
+ if (hv_synic_event_pending()) {
+ pr_err("Events pending when trying to offline CPU %d\n",
+ cpu);
+ return -EBUSY;
+ }
+ }
always_cleanup:
hv_stimer_legacy_cleanup(cpu);
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index bf35bb40c55e..571b2955b38e 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -264,6 +264,14 @@ struct vmbus_connection {
struct irq_domain *vmbus_irq_domain;
struct irq_chip vmbus_irq_chip;
+ /*
+ * VM-wide counts of MODIFYCHANNEL messages sent and completed.
+ * Used when taking a CPU offline to make sure the relevant
+ * MODIFYCHANNEL messages have been completed.
+ */
+ u64 modchan_sent;
+ u64 modchan_completed;
+
/*
* An offer message is handled first on the work_queue, and then
* is further handled on handle_primary_chan_wq or
--
2.25.1
^ permalink raw reply related
* [RFC 10/12] Drivers: hv: vmbus: Implement vmbus_irq_set_affinity
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
Pull out core code from target_cpu_store() to implement
vmbus_irq_set_affinity() so the affinity of VMBus channel interrupts
can be updated from user space via /proc/irq.
Since vmbus_irq_set_affinity() runs with interrupts disabled,
vmbus_send_modifychannel() can't wait for an ACK from Hyper-V. As
such, remove the "wait for ack" version of vmbus_send_modifychannel().
Not waiting isn't a problem unless the old CPU is quickly taken offline
before Hyper-V makes the change, which is dealt with in a subsequent
patch.
Also change target_cpu_store() to call irq_set_affinity() so that
changes made via /sys/bus/vmbus/devices/<guid>/channels/<nn>/cpu
are in sync with the /proc/irq interface. The cpus_read_lock() is
no longer needed in target_cpu_store() because irq_set_affinity()
ensures that the interrupt affinity is not set to an offline
CPU.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/channel.c | 97 ++++++-------------------
drivers/hv/vmbus_drv.c | 161 +++++++++++++++++++++++++----------------
2 files changed, 121 insertions(+), 137 deletions(-)
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 1aa020b538f1..b7920072e243 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -212,79 +212,6 @@ int vmbus_send_tl_connect_request(const guid_t *shv_guest_servie_id,
}
EXPORT_SYMBOL_GPL(vmbus_send_tl_connect_request);
-static int send_modifychannel_without_ack(struct vmbus_channel *channel, u32 target_vp)
-{
- struct vmbus_channel_modifychannel msg;
- int ret;
-
- memset(&msg, 0, sizeof(msg));
- msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
- msg.child_relid = channel->offermsg.child_relid;
- msg.target_vp = target_vp;
-
- ret = vmbus_post_msg(&msg, sizeof(msg), true);
- trace_vmbus_send_modifychannel(&msg, ret);
-
- return ret;
-}
-
-static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 target_vp)
-{
- struct vmbus_channel_modifychannel *msg;
- struct vmbus_channel_msginfo *info;
- unsigned long flags;
- int ret;
-
- info = kzalloc(sizeof(struct vmbus_channel_msginfo) +
- sizeof(struct vmbus_channel_modifychannel),
- GFP_KERNEL);
- if (!info)
- return -ENOMEM;
-
- init_completion(&info->waitevent);
- info->waiting_channel = channel;
-
- msg = (struct vmbus_channel_modifychannel *)info->msg;
- msg->header.msgtype = CHANNELMSG_MODIFYCHANNEL;
- msg->child_relid = channel->offermsg.child_relid;
- msg->target_vp = target_vp;
-
- spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
- list_add_tail(&info->msglistentry, &vmbus_connection.chn_msg_list);
- spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
-
- ret = vmbus_post_msg(msg, sizeof(*msg), true);
- trace_vmbus_send_modifychannel(msg, ret);
- if (ret != 0) {
- spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
- list_del(&info->msglistentry);
- spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
- goto free_info;
- }
-
- /*
- * Release channel_mutex; otherwise, vmbus_onoffer_rescind() could block on
- * the mutex and be unable to signal the completion.
- *
- * See the caller target_cpu_store() for information about the usage of the
- * mutex.
- */
- mutex_unlock(&vmbus_connection.channel_mutex);
- wait_for_completion(&info->waitevent);
- mutex_lock(&vmbus_connection.channel_mutex);
-
- spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
- list_del(&info->msglistentry);
- spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
-
- if (info->response.modify_response.status)
- ret = -EAGAIN;
-
-free_info:
- kfree(info);
- return ret;
-}
-
/*
* Set/change the vCPU (@target_vp) the channel (@child_relid) will interrupt.
*
@@ -294,14 +221,32 @@ static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 target
* out an ACK, we can not know when the host will stop interrupting the "old"
* vCPU and start interrupting the "new" vCPU for the given channel.
*
+ * But even if Hyper-V provides the ACK, we don't wait for it because the
+ * caller, vmbus_irq_set_affinity(), is running with a spin lock held. The
+ * unknown delay in when the host will start interrupting the new vCPU is not
+ * a problem unless the old vCPU is taken offline, and that situation is dealt
+ * with separately in the CPU offlining path.
+ *
* The CHANNELMSG_MODIFYCHANNEL message type is supported since VMBus version
* VERSION_WIN10_V4_1.
*/
int vmbus_send_modifychannel(struct vmbus_channel *channel, u32 target_vp)
{
- if (vmbus_proto_version >= VERSION_WIN10_V5_3)
- return send_modifychannel_with_ack(channel, target_vp);
- return send_modifychannel_without_ack(channel, target_vp);
+ struct vmbus_channel_modifychannel msg;
+ int ret;
+
+ if (vmbus_proto_version < VERSION_WIN10_V4_1)
+ return -EINVAL;
+
+ memset(&msg, 0, sizeof(msg));
+ msg.header.msgtype = CHANNELMSG_MODIFYCHANNEL;
+ msg.child_relid = channel->offermsg.child_relid;
+ msg.target_vp = target_vp;
+
+ ret = vmbus_post_msg(&msg, sizeof(msg), false);
+ trace_vmbus_send_modifychannel(&msg, ret);
+
+ return ret;
}
EXPORT_SYMBOL_GPL(vmbus_send_modifychannel);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index b73be7c02d37..87f2f3436136 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -22,7 +22,6 @@
#include <linux/kernel_stat.h>
#include <linux/of_address.h>
#include <linux/clockchips.h>
-#include <linux/cpu.h>
#include <linux/sched/isolation.h>
#include <linux/sched/task_stack.h>
@@ -1322,10 +1321,107 @@ static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
return IRQ_NONE;
}
+/*
+ * This function is invoked by user space affinity changes initiated
+ * from /proc/irq/<nn> or from the legacy VMBus-specific interface at
+ * /sys/bus/vmbus/devices/<guid>/channels/<nn>/cpu.
+ *
+ * In the former case, the /proc implementation ensures that unmapping
+ * (i.e., deleting) the IRQ will pend while this function is in progress.
+ * Since deleting the channel unmaps the IRQ first, the channel can't go
+ * away either.
+ *
+ * In the latter case, the VMBus connection channel_mutex is held, which
+ * prevents channel deltion, and therefore IRQ unampping as well.
+ *
+ * So in both cases, accessing the channel and IRQ data structures is safe.
+ */
int vmbus_irq_set_affinity(struct irq_data *data,
const struct cpumask *dest, bool force)
{
- return 0;
+ static int next_cpu;
+ static cpumask_t tempmask;
+ int origin_cpu, target_cpu;
+ struct vmbus_channel *channel = irq_data_get_irq_handler_data(data);
+ int ret;
+
+ if (!channel) {
+ pr_err("Bad channel in vmbus_irq_set_affinity for relid %ld\n",
+ data->hwirq);
+ return -EINVAL;
+ }
+
+ /* Don't consider CPUs that are isolated */
+ if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
+ cpumask_and(&tempmask, dest,
+ housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+ else
+ cpumask_copy(&tempmask, dest);
+
+ /*
+ * If Hyper-V is already targeting a CPU in the new affinity mask,
+ * keep that targeting and Hyper-V doesn't need to be updated. But
+ * still set effective affinity as it may be unset when the IRQ is
+ * first created.
+ */
+ origin_cpu = channel->target_cpu;
+ if (cpumask_test_cpu(origin_cpu, &tempmask)) {
+ target_cpu = origin_cpu;
+ goto update_effective;
+ }
+
+ /*
+ * Pick a CPU from the new affinity mask. As a simple heuristic to
+ * spread out the selection when the mask contains multiple CPUs,
+ * start with whatever CPU was last selected.
+ */
+ target_cpu = cpumask_next_wrap(next_cpu, &tempmask, nr_cpu_ids, false);
+ if (target_cpu >= nr_cpu_ids)
+ return -EINVAL;
+ next_cpu = target_cpu;
+
+ /*
+ * Hyper-V will ignore MODIFYCHANNEL messages for "non-open" channels;
+ * avoid sending the message and fail here for such channels.
+ */
+ if (channel->state != CHANNEL_OPENED_STATE)
+ return -EIO;
+
+ ret = vmbus_send_modifychannel(channel,
+ hv_cpu_number_to_vp_number(target_cpu));
+ if (ret)
+ return ret;
+
+ /*
+ * Warning. At this point, there is *no* guarantee that the host will
+ * have successfully processed the vmbus_send_modifychannel() request.
+ * See the header comment of vmbus_send_modifychannel() for more info.
+ *
+ * Lags in the processing of the above vmbus_send_modifychannel() can
+ * result in missed interrupts if the "old" target CPU is taken offline
+ * before Hyper-V starts sending interrupts to the "new" target CPU.
+ * hv_synic_cleanup() will ensure no interrupts are missed.
+ *
+ * But apart from this offlining scenario, the code tolerates such
+ * lags. It will function correctly even if a channel interrupt comes
+ * in on a CPU that is different from the channel target_cpu value.
+ */
+
+ channel->target_cpu = target_cpu;
+
+ /* See init_vp_index(). */
+ if (hv_is_perf_channel(channel))
+ hv_update_allocated_cpus(origin_cpu, target_cpu);
+
+ /* Currently set only for storvsc channels. */
+ if (channel->change_target_cpu_callback) {
+ (*channel->change_target_cpu_callback)(channel,
+ origin_cpu, target_cpu);
+ }
+
+update_effective:
+ irq_data_update_effective_affinity(data, cpumask_of(target_cpu));
+ return IRQ_SET_MASK_OK;
}
/*
@@ -1655,7 +1751,7 @@ static ssize_t target_cpu_show(struct vmbus_channel *channel, char *buf)
static ssize_t target_cpu_store(struct vmbus_channel *channel,
const char *buf, size_t count)
{
- u32 target_cpu, origin_cpu;
+ u32 target_cpu;
ssize_t ret = count;
if (vmbus_proto_version < VERSION_WIN10_V4_1)
@@ -1668,17 +1764,6 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel,
if (target_cpu >= nr_cpumask_bits)
return -EINVAL;
- if (!cpumask_test_cpu(target_cpu, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)))
- return -EINVAL;
-
- /* No CPUs should come up or down during this. */
- cpus_read_lock();
-
- if (!cpu_online(target_cpu)) {
- cpus_read_unlock();
- return -EINVAL;
- }
-
/*
* Synchronizes target_cpu_store() and channel closure:
*
@@ -1703,55 +1788,9 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel,
*/
mutex_lock(&vmbus_connection.channel_mutex);
- /*
- * Hyper-V will ignore MODIFYCHANNEL messages for "non-open" channels;
- * avoid sending the message and fail here for such channels.
- */
- if (channel->state != CHANNEL_OPENED_STATE) {
- ret = -EIO;
- goto cpu_store_unlock;
- }
-
- origin_cpu = channel->target_cpu;
- if (target_cpu == origin_cpu)
- goto cpu_store_unlock;
-
- if (vmbus_send_modifychannel(channel,
- hv_cpu_number_to_vp_number(target_cpu))) {
- ret = -EIO;
- goto cpu_store_unlock;
- }
-
- /*
- * For version before VERSION_WIN10_V5_3, the following warning holds:
- *
- * Warning. At this point, there is *no* guarantee that the host will
- * have successfully processed the vmbus_send_modifychannel() request.
- * See the header comment of vmbus_send_modifychannel() for more info.
- *
- * Lags in the processing of the above vmbus_send_modifychannel() can
- * result in missed interrupts if the "old" target CPU is taken offline
- * before Hyper-V starts sending interrupts to the "new" target CPU.
- * But apart from this offlining scenario, the code tolerates such
- * lags. It will function correctly even if a channel interrupt comes
- * in on a CPU that is different from the channel target_cpu value.
- */
-
- channel->target_cpu = target_cpu;
-
- /* See init_vp_index(). */
- if (hv_is_perf_channel(channel))
- hv_update_allocated_cpus(origin_cpu, target_cpu);
-
- /* Currently set only for storvsc channels. */
- if (channel->change_target_cpu_callback) {
- (*channel->change_target_cpu_callback)(channel,
- origin_cpu, target_cpu);
- }
+ ret = irq_set_affinity(channel->irq, cpumask_of(target_cpu));
-cpu_store_unlock:
mutex_unlock(&vmbus_connection.channel_mutex);
- cpus_read_unlock();
return ret;
}
static VMBUS_CHAN_ATTR(cpu, 0644, target_cpu_show, target_cpu_store);
--
2.25.1
^ permalink raw reply related
* [RFC 09/12] Drivers: hv: vmbus: Use Linux IRQs to handle VMBus channel interrupts
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
Do the following:
1) Create an interrupt handler for VMBus channel interrupts by pulling
out portions of vmbus_chan_sched() into vmbus_chan_handler(). The
outer part of vmbus_chan_sched() that loops through the synic event
page bitmap remains unchanged. But when a pending VMBus channel
interrupt is found, call generic_handle_irq_desc() to invoke
handle_simple_irq() and then vmbus_chan_handler() for the channel's
IRQ. handle_simple_irq() does the IRQ stats for that channel's IRQ,
so that per-channel interrupt counts appear in /proc/interrupts. The
overall processing of VMBus channel interrupts is unchanged except
for the intervening handle_simple_irq() that does the stats. No acks
or EOIs are required for VMBus channel IRQs.
2) Update __vmbus_open() to call request_irq(), specifying the previously
setup channel IRQ name and vmbus_chan_handler() as the interrupt
handler. Set the IRQ affinity to the target_cpu assigned when the
channel was created.
3) Update vmbus_isr() to return "false" if it only handles VMBus
interrupts, which were passed to the channel IRQ handler. If
vmbus_isr() handles one or more control message interrupts, then
return "true". Update the related definitions to specify a boolean
return value.
4) The callers of vmbus_isr() increment IRQ stats for the top-level
IRQ only if "true" is returned. On x86, the caller is
sysvec_hyperv_callback(), which manages the stats directly. On
arm64, the caller is vmbus_percpu_isr(), which maps the boolean
return value to IRQ_NONE ("false") or IRQ_HANDLED ("true").
Then handle_percpu_demux_irq() conditionally updates the
stats based on the return value from vmbus_percpu_isr().
With these changes, interrupts from VMBus channels are now
processed as Linux IRQs that are demultiplexed from the main
VMBus interrupt.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
arch/x86/kernel/cpu/mshyperv.c | 9 ++--
drivers/hv/channel.c | 25 +++++++++-
drivers/hv/hv_common.c | 2 +-
drivers/hv/vmbus_drv.c | 84 +++++++++++++++++++---------------
include/asm-generic/mshyperv.h | 3 +-
5 files changed, 79 insertions(+), 44 deletions(-)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e0fd57a8ba84..18bc282a99db 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -110,7 +110,7 @@ void hv_set_msr(unsigned int reg, u64 value)
}
EXPORT_SYMBOL_GPL(hv_set_msr);
-static void (*vmbus_handler)(void);
+static bool (*vmbus_handler)(void);
static void (*hv_stimer0_handler)(void);
static void (*hv_kexec_handler)(void);
static void (*hv_crash_handler)(struct pt_regs *regs);
@@ -119,9 +119,8 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
{
struct pt_regs *old_regs = set_irq_regs(regs);
- inc_irq_stat(irq_hv_callback_count);
- if (vmbus_handler)
- vmbus_handler();
+ if (vmbus_handler && vmbus_handler())
+ inc_irq_stat(irq_hv_callback_count);
if (ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED)
apic_eoi();
@@ -129,7 +128,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
set_irq_regs(old_regs);
}
-void hv_setup_vmbus_handler(void (*handler)(void))
+void hv_setup_vmbus_handler(bool (*handler)(void))
{
vmbus_handler = handler;
}
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index fb8cd8469328..1aa020b538f1 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -638,6 +638,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
struct vmbus_channel_open_channel *open_msg;
struct vmbus_channel_msginfo *open_info = NULL;
struct page *page = newchannel->ringbuffer_page;
+ u32 relid = newchannel->offermsg.child_relid;
u32 send_pages, recv_pages;
unsigned long flags;
int err;
@@ -685,13 +686,31 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
if (err)
goto error_free_gpadl;
+ /* Request the IRQ and assign to target_cpu */
+ err = request_irq(newchannel->irq, vmbus_chan_handler, 0,
+ newchannel->irq_name, newchannel);
+ if (err) {
+ pr_err("request_irq failed with %d for relid %d irq %d\n",
+ err, relid, newchannel->irq);
+ goto error_free_gpadl;
+ }
+ err = irq_set_affinity_and_hint(newchannel->irq,
+ cpumask_of(newchannel->target_cpu));
+ if (err) {
+ pr_err("irq_set_affinity_and_hint failed with %d for relid %d irq %d\n",
+ err, relid, newchannel->irq);
+ free_irq(newchannel->irq, newchannel);
+ goto error_free_gpadl;
+ }
+ newchannel->irq_requested = true;
+
/* Create and init the channel open message */
open_info = kzalloc(sizeof(*open_info) +
sizeof(struct vmbus_channel_open_channel),
GFP_KERNEL);
if (!open_info) {
err = -ENOMEM;
- goto error_free_gpadl;
+ goto error_free_irq;
}
init_completion(&open_info->waitevent);
@@ -759,6 +778,10 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
error_free_info:
kfree(open_info);
+error_free_irq:
+ irq_update_affinity_hint(newchannel->irq, NULL);
+ free_irq(newchannel->irq, newchannel);
+ newchannel->irq_requested = false;
error_free_gpadl:
vmbus_teardown_gpadl(newchannel, &newchannel->ringbuffer_gpadlhandle);
error_clean_ring:
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 9c452bfbd571..38a23add721c 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -610,7 +610,7 @@ bool __weak hv_isolation_type_tdx(void)
}
EXPORT_SYMBOL_GPL(hv_isolation_type_tdx);
-void __weak hv_setup_vmbus_handler(void (*handler)(void))
+void __weak hv_setup_vmbus_handler(bool (*handler)(void))
{
}
EXPORT_SYMBOL_GPL(hv_setup_vmbus_handler);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 8fd03d41e71a..b73be7c02d37 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1193,6 +1193,45 @@ static void vmbus_force_channel_rescinded(struct vmbus_channel *channel)
}
#endif /* CONFIG_PM_SLEEP */
+irqreturn_t vmbus_chan_handler(int irq, void *dev_id)
+{
+ void (*callback_fn)(void *context);
+ struct vmbus_channel *channel = dev_id;
+
+ /*
+ * Make sure that the ring buffer data structure doesn't get
+ * freed while we dereference the ring buffer pointer. Test
+ * for the channel's onchannel_callback being NULL within a
+ * sched_lock critical section. See also the inline comments
+ * in vmbus_reset_channel_cb().
+ */
+ spin_lock(&channel->sched_lock);
+
+ callback_fn = channel->onchannel_callback;
+ if (unlikely(callback_fn == NULL))
+ goto spin_unlock;
+
+ trace_vmbus_chan_sched(channel);
+
+ ++channel->interrupts;
+
+ switch (channel->callback_mode) {
+ case HV_CALL_ISR:
+ (*callback_fn)(channel->channel_callback_context);
+ break;
+
+ case HV_CALL_BATCHED:
+ hv_begin_read(&channel->inbound);
+ fallthrough;
+ case HV_CALL_DIRECT:
+ tasklet_schedule(&channel->callback_event);
+ }
+
+spin_unlock:
+ spin_unlock(&channel->sched_lock);
+ return IRQ_HANDLED;
+}
+
/*
* Schedule all channels with events pending
*/
@@ -1217,7 +1256,6 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
return;
for_each_set_bit(relid, recv_int_page, maxbits) {
- void (*callback_fn)(void *context);
struct vmbus_channel *channel;
struct irq_desc *desc;
@@ -1244,43 +1282,14 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
if (channel->rescind)
goto sched_unlock_rcu;
- /*
- * Make sure that the ring buffer data structure doesn't get
- * freed while we dereference the ring buffer pointer. Test
- * for the channel's onchannel_callback being NULL within a
- * sched_lock critical section. See also the inline comments
- * in vmbus_reset_channel_cb().
- */
- spin_lock(&channel->sched_lock);
-
- callback_fn = channel->onchannel_callback;
- if (unlikely(callback_fn == NULL))
- goto sched_unlock;
-
- trace_vmbus_chan_sched(channel);
-
- ++channel->interrupts;
-
- switch (channel->callback_mode) {
- case HV_CALL_ISR:
- (*callback_fn)(channel->channel_callback_context);
- break;
-
- case HV_CALL_BATCHED:
- hv_begin_read(&channel->inbound);
- fallthrough;
- case HV_CALL_DIRECT:
- tasklet_schedule(&channel->callback_event);
- }
+ generic_handle_irq_desc(desc);
-sched_unlock:
- spin_unlock(&channel->sched_lock);
sched_unlock_rcu:
rcu_read_unlock();
}
}
-static void vmbus_isr(void)
+static bool vmbus_isr(void)
{
struct hv_per_cpu_context *hv_cpu
= this_cpu_ptr(hv_context.cpu_context);
@@ -1299,15 +1308,18 @@ static void vmbus_isr(void)
vmbus_signal_eom(msg, HVMSG_TIMER_EXPIRED);
} else
tasklet_schedule(&hv_cpu->msg_dpc);
- }
- add_interrupt_randomness(vmbus_interrupt);
+ add_interrupt_randomness(vmbus_interrupt);
+ return true;
+ }
+ return false;
}
static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
{
- vmbus_isr();
- return IRQ_HANDLED;
+ if (vmbus_isr())
+ return IRQ_HANDLED;
+ return IRQ_NONE;
}
int vmbus_irq_set_affinity(struct irq_data *data,
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 0488ff8b511f..0a5559b9d5f7 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -178,7 +178,7 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)
int hv_get_hypervisor_version(union hv_hypervisor_version_info *info);
-void hv_setup_vmbus_handler(void (*handler)(void));
+void hv_setup_vmbus_handler(bool (*handler)(void));
void hv_remove_vmbus_handler(void);
void hv_setup_stimer0_handler(void (*handler)(void));
void hv_remove_stimer0_handler(void);
@@ -188,6 +188,7 @@ void hv_remove_kexec_handler(void);
void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
void hv_remove_crash_handler(void);
+extern irqreturn_t vmbus_chan_handler(int irq, void *dev_id);
extern void vmbus_irq_mask(struct irq_data *data);
extern void vmbus_irq_unmask(struct irq_data *data);
extern int vmbus_irq_set_affinity(struct irq_data *data,
--
2.25.1
^ permalink raw reply related
* [RFC 08/12] Drivers: hv: vmbus: Allocate an IRQ per channel and use for relid mapping
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
Current code uses the "channels" array in the VMBus connection to map from
relid to VMBus channel. Functions exist to establish and remove mappings,
and to map from a relid to its channel.
To implement assigning Linux IRQs to VMBus channels, repurpose the existing
relid functions. Change the "map" function to create an IRQ mapping in the
VMBus IRQ domain, and set the handler data for the IRQ to be the VMBus
channel. Change the "unmap" function to remove the IRQ mapping after
freeing the IRQ if it has been requested. Change the relid2channel()
function to look up the mapping and return both the channel and the
corresponding irqdesc.
While Linux IRQs are allocated for each channel, they are used only for
relid mapping. A subsequent patch changes the interrupt handling path
to use the Linux IRQ.
With these changes, the "channels" array in the VMBus connection is no
longer used. Remove it. Also fixup comments that refer to the channels
array and make them refer to the IRQ mapping instead.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/channel_mgmt.c | 52 ++++++++++++++++++++++++++++-----------
drivers/hv/connection.c | 23 ++++++-----------
drivers/hv/hyperv_vmbus.h | 5 +---
drivers/hv/vmbus_drv.c | 8 +++---
include/linux/hyperv.h | 3 +++
5 files changed, 54 insertions(+), 37 deletions(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index adbe184e5197..da42aaae994e 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -10,6 +10,9 @@
#include <linux/kernel.h>
#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/irqdomain.h>
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/mm.h>
@@ -410,7 +413,10 @@ static void free_channel(struct vmbus_channel *channel)
void vmbus_channel_map_relid(struct vmbus_channel *channel)
{
- if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
+ int res;
+ u32 relid = channel->offermsg.child_relid;
+
+ if (WARN_ON(relid >= MAX_CHANNEL_RELIDS))
return;
/*
* The mapping of the channel's relid is visible from the CPUs that
@@ -437,18 +443,33 @@ void vmbus_channel_map_relid(struct vmbus_channel *channel)
* of the VMBus driver and vmbus_chan_sched() can not run before
* vmbus_bus_resume() has completed execution (cf. resume_noirq).
*/
- virt_store_mb(
- vmbus_connection.channels[channel->offermsg.child_relid],
- channel);
+
+ channel->irq = irq_create_mapping(vmbus_connection.vmbus_irq_domain, relid);
+ if (!channel->irq) {
+ pr_err("irq_create_mapping failed for relid %d\n", relid);
+ return;
+ }
+
+ res = irq_set_handler_data(channel->irq, channel);
+ if (res) {
+ irq_dispose_mapping(channel->irq);
+ channel->irq = 0;
+ pr_err("irq_set_handler_data failed with %d for relid %d\n",
+ res, relid);
+ return;
+ }
+
+ irq_set_status_flags(channel->irq, IRQ_MOVE_PCNTXT);
}
void vmbus_channel_unmap_relid(struct vmbus_channel *channel)
{
- if (WARN_ON(channel->offermsg.child_relid >= MAX_CHANNEL_RELIDS))
- return;
- WRITE_ONCE(
- vmbus_connection.channels[channel->offermsg.child_relid],
- NULL);
+ if (channel->irq_requested) {
+ irq_update_affinity_hint(channel->irq, NULL);
+ free_irq(channel->irq, channel);
+ }
+ channel->irq_requested = false;
+ irq_dispose_mapping(channel->irq);
}
static void vmbus_release_relid(u32 relid)
@@ -478,10 +499,10 @@ void hv_process_channel_removal(struct vmbus_channel *channel)
!is_hvsock_channel(channel));
/*
- * Upon suspend, an in-use hv_sock channel is removed from the array of
- * channels and the relid is invalidated. After hibernation, when the
+ * Upon suspend, an in-use hv_sock channel is removed from the IRQ
+ * map and the relid is invalidated. After hibernation, when the
* user-space application destroys the channel, it's unnecessary and
- * unsafe to remove the channel from the array of channels. See also
+ * unsafe to remove the channel from the IRQ map. See also
* the inline comments before the call of vmbus_release_relid() below.
*/
if (channel->offermsg.child_relid != INVALID_RELID)
@@ -533,6 +554,9 @@ static void vmbus_add_channel_work(struct work_struct *work)
struct vmbus_channel *primary_channel = newchannel->primary_channel;
int ret;
+ if (!newchannel->irq)
+ goto err_deq_chan;
+
/*
* This state is used to indicate a successful open
* so that when we do close the channel normally, we
@@ -1144,7 +1168,7 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
vmbus_setup_channel_state(oldchannel, offer);
}
- /* Add the channel back to the array of channels. */
+ /* Re-establish the channel's IRQ mapping using the new relid */
vmbus_channel_map_relid(oldchannel);
check_ready_for_resume_event();
@@ -1225,7 +1249,7 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr)
}
mutex_lock(&vmbus_connection.channel_mutex);
- channel = relid2channel(rescind->child_relid);
+ channel = relid2channel(rescind->child_relid, NULL);
if (channel != NULL) {
/*
* Guarantee that no other instance of vmbus_onoffer_rescind()
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index cb01784e5c3b..a105eecdeec2 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -308,14 +308,6 @@ int vmbus_connect(void)
pr_info("Vmbus version:%d.%d\n",
version >> 16, version & 0xFFFF);
- vmbus_connection.channels = kcalloc(MAX_CHANNEL_RELIDS,
- sizeof(struct vmbus_channel *),
- GFP_KERNEL);
- if (vmbus_connection.channels == NULL) {
- ret = -ENOMEM;
- goto cleanup;
- }
-
kfree(msginfo);
return 0;
@@ -373,15 +365,16 @@ void vmbus_disconnect(void)
* relid2channel - Get the channel object given its
* child relative id (ie channel id)
*/
-struct vmbus_channel *relid2channel(u32 relid)
+struct vmbus_channel *relid2channel(u32 relid, struct irq_desc **desc_ptr)
{
- if (vmbus_connection.channels == NULL) {
- pr_warn_once("relid2channel: relid=%d: No channels mapped!\n", relid);
- return NULL;
- }
- if (WARN_ON(relid >= MAX_CHANNEL_RELIDS))
+ struct irq_desc *desc;
+
+ desc = irq_resolve_mapping(vmbus_connection.vmbus_irq_domain, relid);
+ if (!desc)
return NULL;
- return READ_ONCE(vmbus_connection.channels[relid]);
+ if (desc_ptr)
+ *desc_ptr = desc;
+ return irq_desc_get_handler_data(desc);
}
/*
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 95d4d47d34f7..bf35bb40c55e 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -259,9 +259,6 @@ struct vmbus_connection {
struct list_head chn_list;
struct mutex channel_mutex;
- /* Array of channels */
- struct vmbus_channel **channels;
-
/* IRQ domain data */
struct fwnode_handle *vmbus_fwnode;
struct irq_domain *vmbus_irq_domain;
@@ -364,7 +361,7 @@ void vmbus_remove_channel_attr_group(struct vmbus_channel *channel);
void vmbus_channel_map_relid(struct vmbus_channel *channel);
void vmbus_channel_unmap_relid(struct vmbus_channel *channel);
-struct vmbus_channel *relid2channel(u32 relid);
+struct vmbus_channel *relid2channel(u32 relid, struct irq_desc **desc);
void vmbus_free_channels(void);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index cbccdfad49a2..8fd03d41e71a 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1219,6 +1219,7 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
for_each_set_bit(relid, recv_int_page, maxbits) {
void (*callback_fn)(void *context);
struct vmbus_channel *channel;
+ struct irq_desc *desc;
if (!sync_test_and_clear_bit(relid, recv_int_page))
continue;
@@ -1236,7 +1237,7 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
rcu_read_lock();
/* Find channel based on relid */
- channel = relid2channel(relid);
+ channel = relid2channel(relid, &desc);
if (channel == NULL)
goto sched_unlock_rcu;
@@ -2466,10 +2467,10 @@ static int vmbus_bus_suspend(struct device *dev)
list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
/*
- * Remove the channel from the array of channels and invalidate
+ * Remove the channel from the IRQ map and invalidate
* the channel's relid. Upon resume, vmbus_onoffer() will fix
* up the relid (and other fields, if necessary) and add the
- * channel back to the array.
+ * channel back to the IRQ map.
*/
vmbus_channel_unmap_relid(channel);
channel->offermsg.child_relid = INVALID_RELID;
@@ -2748,7 +2749,6 @@ static void __exit vmbus_exit(void)
vmbus_free_channels();
irq_domain_remove(vmbus_connection.vmbus_irq_domain);
irq_domain_free_fwnode(vmbus_connection.vmbus_fwnode);
- kfree(vmbus_connection.channels);
/*
* The vmbus panic notifier is always registered, hence we should
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 897d96fa19d4..d545b1b635e5 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1072,6 +1072,9 @@ struct vmbus_channel {
/* The max size of a packet on this channel */
u32 max_pkt_size;
+ /* The Linux IRQ for the channel in the "hv-vmbus" IRQ domain */
+ u32 irq;
+ bool irq_requested;
char irq_name[VMBUS_CHAN_IRQ_NAME_MAX];
};
--
2.25.1
^ permalink raw reply related
* [RFC 07/12] Drivers: hv: vmbus: Set up irqdomain and irqchip for the VMBus connection
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
In preparation for assigning Linux IRQs to VMBus channels, set up an
irqdomain and irqchip for the VMBus connection. The irqdomain is linear,
with the VMBus relid used as the "hwirq" value. A relid is a unique
index assigned by Hyper-V to each VMBus channel, with values ranging
from 1 to 2047. Because these hwirqs don't map to anything in the
architectural hardware, the domain is not part of the domain hierarchy.
VMBus channel interrupts provide minimal management functionality, so
provide only a minimal set of irqchip functions. The set_affinity function
is a place-holder that is populated in a subsequent patch.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/connection.c | 24 +++++++++-----
drivers/hv/hyperv_vmbus.h | 9 +++++
drivers/hv/vmbus_drv.c | 60 +++++++++++++++++++++++++++++++++-
include/asm-generic/mshyperv.h | 6 ++++
4 files changed, 90 insertions(+), 9 deletions(-)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index f001ae880e1d..cb01784e5c3b 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -21,21 +21,29 @@
#include <linux/export.h>
#include <linux/io.h>
#include <linux/set_memory.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/irqdomain.h>
#include <asm/mshyperv.h>
#include "hyperv_vmbus.h"
struct vmbus_connection vmbus_connection = {
- .conn_state = DISCONNECTED,
- .unload_event = COMPLETION_INITIALIZER(
- vmbus_connection.unload_event),
- .next_gpadl_handle = ATOMIC_INIT(0xE1E10),
-
- .ready_for_suspend_event = COMPLETION_INITIALIZER(
- vmbus_connection.ready_for_suspend_event),
+ .conn_state = DISCONNECTED,
+ .unload_event = COMPLETION_INITIALIZER(
+ vmbus_connection.unload_event),
+ .next_gpadl_handle = ATOMIC_INIT(0xE1E10),
+
+ .vmbus_irq_chip.name = "VMBus",
+ .vmbus_irq_chip.irq_set_affinity = vmbus_irq_set_affinity,
+ .vmbus_irq_chip.irq_mask = vmbus_irq_mask,
+ .vmbus_irq_chip.irq_unmask = vmbus_irq_unmask,
+
+ .ready_for_suspend_event = COMPLETION_INITIALIZER(
+ vmbus_connection.ready_for_suspend_event),
.ready_for_resume_event = COMPLETION_INITIALIZER(
- vmbus_connection.ready_for_resume_event),
+ vmbus_connection.ready_for_resume_event),
};
EXPORT_SYMBOL_GPL(vmbus_connection);
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 76ac5185a01a..95d4d47d34f7 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -18,7 +18,11 @@
#include <asm/hyperv-tlfs.h>
#include <linux/atomic.h>
#include <linux/hyperv.h>
+#include <linux/fwnode.h>
#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/irqdomain.h>
#include "hv_trace.h"
@@ -258,6 +262,11 @@ struct vmbus_connection {
/* Array of channels */
struct vmbus_channel **channels;
+ /* IRQ domain data */
+ struct fwnode_handle *vmbus_fwnode;
+ struct irq_domain *vmbus_irq_domain;
+ struct irq_chip vmbus_irq_chip;
+
/*
* An offer message is handled first on the work_queue, and then
* is further handled on handle_primary_chan_wq or
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 291a8358370b..cbccdfad49a2 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -36,6 +36,9 @@
#include <linux/syscore_ops.h>
#include <linux/dma-map-ops.h>
#include <linux/pci.h>
+#include <linux/irq.h>
+#include <linux/irqdomain.h>
+#include <linux/hardirq.h>
#include <clocksource/hyperv_timer.h>
#include <asm/mshyperv.h>
#include "hyperv_vmbus.h"
@@ -1306,6 +1309,40 @@ static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
return IRQ_HANDLED;
}
+int vmbus_irq_set_affinity(struct irq_data *data,
+ const struct cpumask *dest, bool force)
+{
+ return 0;
+}
+
+/*
+ * VMBus channel interrupts do not need to be masked or unmasked, and the
+ * Hyper-V synic doesn't provide any masking functionality anyway. But in the
+ * absence of these irqchip functions, the IRQ subsystem keeps the IRQ marked
+ * as "masked". To prevent any problems associated with staying the "masked"
+ * state, and so that IRQ status shown in debugfs doesn't indicate "masked",
+ * provide null implementations.
+ */
+void vmbus_irq_unmask(struct irq_data *data)
+{
+}
+
+void vmbus_irq_mask(struct irq_data *data)
+{
+}
+
+static int vmbus_irq_map(struct irq_domain *d, unsigned int irq,
+ irq_hw_number_t hw)
+{
+ irq_set_chip_and_handler(irq,
+ &vmbus_connection.vmbus_irq_chip, handle_simple_irq);
+ return 0;
+}
+
+static const struct irq_domain_ops vmbus_domain_ops = {
+ .map = vmbus_irq_map,
+};
+
/*
* vmbus_bus_init -Main vmbus driver initialization routine.
*
@@ -1340,6 +1377,7 @@ static int vmbus_bus_init(void)
if (vmbus_irq == -1) {
hv_setup_vmbus_handler(vmbus_isr);
} else {
+ irq_set_handler(vmbus_irq, handle_percpu_demux_irq);
vmbus_evt = alloc_percpu(long);
ret = request_percpu_irq(vmbus_irq, vmbus_percpu_isr,
"Hyper-V VMbus", vmbus_evt);
@@ -1355,6 +1393,20 @@ static int vmbus_bus_init(void)
if (ret)
goto err_alloc;
+ /* Create IRQ domain for VMBus devices */
+ vmbus_connection.vmbus_fwnode = irq_domain_alloc_named_fwnode("hv-vmbus");
+ if (!vmbus_connection.vmbus_fwnode) {
+ ret = -ENOMEM;
+ goto err_alloc;
+ }
+ vmbus_connection.vmbus_irq_domain = irq_domain_create_linear(
+ vmbus_connection.vmbus_fwnode, MAX_CHANNEL_RELIDS,
+ &vmbus_domain_ops, NULL);
+ if (!vmbus_connection.vmbus_irq_domain) {
+ ret = -ENOMEM;
+ goto err_fwnode;
+ }
+
/*
* Initialize the per-cpu interrupt state and stimer state.
* Then connect to the host.
@@ -1362,7 +1414,7 @@ static int vmbus_bus_init(void)
ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/vmbus:online",
hv_synic_init, hv_synic_cleanup);
if (ret < 0)
- goto err_alloc;
+ goto err_domain;
hyperv_cpuhp_online = ret;
ret = vmbus_connect();
@@ -1382,6 +1434,10 @@ static int vmbus_bus_init(void)
err_connect:
cpuhp_remove_state(hyperv_cpuhp_online);
+err_domain:
+ irq_domain_remove(vmbus_connection.vmbus_irq_domain);
+err_fwnode:
+ irq_domain_free_fwnode(vmbus_connection.vmbus_fwnode);
err_alloc:
hv_synic_free();
if (vmbus_irq == -1) {
@@ -2690,6 +2746,8 @@ static void __exit vmbus_exit(void)
hv_debug_rm_all_dir();
vmbus_free_channels();
+ irq_domain_remove(vmbus_connection.vmbus_irq_domain);
+ irq_domain_free_fwnode(vmbus_connection.vmbus_fwnode);
kfree(vmbus_connection.channels);
/*
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 8fe7aaab2599..0488ff8b511f 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -24,6 +24,7 @@
#include <acpi/acpi_numa.h>
#include <linux/cpumask.h>
#include <linux/nmi.h>
+#include <linux/irq.h>
#include <asm/ptrace.h>
#include <asm/hyperv-tlfs.h>
@@ -187,6 +188,11 @@ void hv_remove_kexec_handler(void);
void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
void hv_remove_crash_handler(void);
+extern void vmbus_irq_mask(struct irq_data *data);
+extern void vmbus_irq_unmask(struct irq_data *data);
+extern int vmbus_irq_set_affinity(struct irq_data *data,
+ const struct cpumask *dest, bool force);
+
extern int vmbus_interrupt;
extern int vmbus_irq;
--
2.25.1
^ permalink raw reply related
* [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
Hyper-V VMBus devices generate interrupts that are multiplexed
onto a single per-CPU architectural interrupt. The top-level VMBus
driver ISR demultiplexes these interrupts and invokes per-device
handlers. Currently, these per-device handlers are not modeled as
Linux IRQs, so /proc/interrupts shows all VMBus interrupts as accounted
to the top level architectural interrupt. Visibility into per-device
interrupt stats requires accessing VMBus-specific entries in sysfs.
The top-level VMBus driver ISR also handles management-related
interrupts that are not attributable to a particular VMBus device.
As part of changing VMBus to model VMBus per-device handlers as
normal Linux IRQs, the top-level VMBus driver needs to conditionally
account for interrupts. If it passes the interrupt off to a
device-specific IRQ, the interrupt stats are done by that IRQ
handler, and accounting for the interrupt at the top level
is duplicative. But if it handles a management-related interrupt
itself, then it should account for the interrupt itself.
Introduce a new flow handler that provides this functionality.
The new handler parallels handle_percpu_irq(), but does stats
only if the ISR returns other than IRQ_NONE. The existing
handle_untracked_irq() can't be used because it doesn't work for
per-cpu IRQs, and it doesn't provide conditional stats.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
include/linux/irq.h | 1 +
kernel/irq/chip.c | 29 +++++++++++++++++++++++++++++
2 files changed, 30 insertions(+)
diff --git a/include/linux/irq.h b/include/linux/irq.h
index a217e1029c1d..8825b95cefe5 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -656,6 +656,7 @@ extern void handle_edge_eoi_irq(struct irq_desc *desc);
extern void handle_simple_irq(struct irq_desc *desc);
extern void handle_untracked_irq(struct irq_desc *desc);
extern void handle_percpu_irq(struct irq_desc *desc);
+extern void handle_percpu_demux_irq(struct irq_desc *desc);
extern void handle_percpu_devid_irq(struct irq_desc *desc);
extern void handle_bad_irq(struct irq_desc *desc);
extern void handle_nested_irq(unsigned int irq);
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index dc94e0bf2c94..1f37a9db4a4d 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -910,6 +910,35 @@ void handle_percpu_irq(struct irq_desc *desc)
chip->irq_eoi(&desc->irq_data);
}
+/**
+ * handle_percpu_demux_irq - Per CPU local irq handler for IRQs
+ * that may demultiplex to other IRQs
+ * @desc: the interrupt description structure for this irq
+ *
+ * For per CPU interrupts on SMP machines without locking requirements.
+ * Used for IRQs that may demultiplex to other IRQs that will do the
+ * stats tracking and randomness. If the handler result indicates no
+ * demultiplexing is done, account for the interrupt on this IRQ.
+ */
+void handle_percpu_demux_irq(struct irq_desc *desc)
+{
+ struct irq_chip *chip = irq_desc_get_chip(desc);
+
+ if (chip->irq_ack)
+ chip->irq_ack(&desc->irq_data);
+
+ if (__handle_irq_event_percpu(desc))
+ /*
+ * PER CPU interrupts are not serialized. Do not touch
+ * desc->tot_count.
+ */
+ __kstat_incr_irqs_this_cpu(desc);
+
+ if (chip->irq_eoi)
+ chip->irq_eoi(&desc->irq_data);
+}
+EXPORT_SYMBOL_GPL(handle_percpu_demux_irq);
+
/**
* handle_percpu_devid_irq - Per CPU local irq handler with per cpu dev ids
* @desc: the interrupt description structure for this irq
--
2.25.1
^ permalink raw reply related
* [RFC 05/12] scsi: storvsc: Annotate the VMBus channel IRQ name
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
In preparation for assigning Linux IRQs to VMBus channels, annotate the
IRQ name in the VMBus channels used by the virtual IDE, SCSI, and Fibre
Channel controllers in a Hyper-V guest. Distinguish the primary channel
and sub-channels with "pri" and "sub" prefixes. Identify controllers
with a numeric suffix that matches the numeric part of the "host<n>"
name displayed by "lsscsi" and SCSI-related entries in sysfs.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/scsi/storvsc_drv.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 7ceb982040a5..11b3fc3b24c9 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -686,6 +686,8 @@ static void handle_sc_creation(struct vmbus_channel *new_sc)
new_sc->max_pkt_size = STORVSC_MAX_PKT_SIZE;
new_sc->next_request_id_callback = storvsc_next_request_id;
+ snprintf(new_sc->irq_name, VMBUS_CHAN_IRQ_NAME_MAX, "sub@storvsc%d",
+ stor_device->host->host_no);
ret = vmbus_open(new_sc,
aligned_ringbuffer_size,
@@ -1322,6 +1324,7 @@ static int storvsc_connect_to_vsp(struct hv_device *device, u32 ring_size,
bool is_fc)
{
struct vmstorage_channel_properties props;
+ struct storvsc_device *stor_device;
int ret;
memset(&props, 0, sizeof(struct vmstorage_channel_properties));
@@ -1329,6 +1332,12 @@ static int storvsc_connect_to_vsp(struct hv_device *device, u32 ring_size,
device->channel->max_pkt_size = STORVSC_MAX_PKT_SIZE;
device->channel->next_request_id_callback = storvsc_next_request_id;
+ stor_device = get_out_stor_device(device);
+ if (!stor_device)
+ return -EINVAL;
+ snprintf(device->channel->irq_name, VMBUS_CHAN_IRQ_NAME_MAX, "pri@storvsc%d",
+ stor_device->host->host_no);
+
ret = vmbus_open(device->channel,
ring_size,
ring_size,
--
2.25.1
^ permalink raw reply related
* [RFC 04/12] PCI: hv: Annotate the VMBus channel IRQ name
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
In preparation for assigning Linux IRQs to VMBus channels, annotate
the IRQ name in the single VMBus channel used for setup and teardown
of a virtual PCI device in a Hyper-V guest. The annotation adds the
16-bit PCI domain ID that the Hyper-V vPCI driver assigns to the
virtual PCI bus for the device.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/pci/controller/pci-hyperv.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 5992280e8110..4f70cddb61dc 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3705,6 +3705,9 @@ static int hv_pci_probe(struct hv_device *hdev,
hdev->channel->request_addr_callback = vmbus_request_addr;
hdev->channel->rqstor_size = HV_PCI_RQSTOR_SIZE;
+ snprintf(hdev->channel->irq_name, VMBUS_CHAN_IRQ_NAME_MAX,
+ "vpci:%04x", dom);
+
ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
hv_pci_onchannelcallback, hbus);
if (ret)
@@ -4018,6 +4021,8 @@ static int hv_pci_resume(struct hv_device *hdev)
hdev->channel->next_request_id_callback = vmbus_next_request_id;
hdev->channel->request_addr_callback = vmbus_request_addr;
hdev->channel->rqstor_size = HV_PCI_RQSTOR_SIZE;
+ snprintf(hdev->channel->irq_name, VMBUS_CHAN_IRQ_NAME_MAX,
+ "vpci:%04x", hbus->bridge->domain_nr);
ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
hv_pci_onchannelcallback, hbus);
--
2.25.1
^ permalink raw reply related
* [RFC 03/12] Drivers: hv: vmbus: Add an IRQ name to VMBus channels
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
In preparation for assigning Linux IRQs to VMBus channels, assign a
string name to each channel. The name is used when the IRQ is
requested, and it will appear in the output of /proc/interrupts to
make it easier to identify the device the IRQ is associated with.
Many VMBus devices are single-instance, with a single channel. For
such devices, a default string name can be determined based on the
GUID identifying the device. So add default names to the vmbus_devs
table that lists recognized devices. When a channel is created, set
the channel name to that default so a reasonable name is displayed
without requiring VMBus driver modifications.
However, individual VMBus device drivers may be optionally modified
to override the default name to provide additional information. In
cases where a driver supports multiple instances of a device, or a
device has multiple channels, the additional information may be
helpful to display in /proc/interrupts.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/channel_mgmt.c | 45 +++++++++++++++++++++++++++++++--------
include/linux/hyperv.h | 5 +++++
2 files changed, 41 insertions(+), 9 deletions(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index a216a0aede73..adbe184e5197 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -20,6 +20,7 @@
#include <linux/delay.h>
#include <linux/cpu.h>
#include <linux/hyperv.h>
+#include <linux/string.h>
#include <asm/mshyperv.h>
#include <linux/sched/isolation.h>
@@ -33,6 +34,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_IDE_GUID,
.perf_device = true,
.allowed_in_isolated = false,
+ .irq_name = "storvsc",
},
/* SCSI */
@@ -40,6 +42,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_SCSI_GUID,
.perf_device = true,
.allowed_in_isolated = true,
+ .irq_name = "storvsc",
},
/* Fibre Channel */
@@ -47,6 +50,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_SYNTHFC_GUID,
.perf_device = true,
.allowed_in_isolated = false,
+ .irq_name = "storvsc",
},
/* Synthetic NIC */
@@ -54,6 +58,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_NIC_GUID,
.perf_device = true,
.allowed_in_isolated = true,
+ .irq_name = "netvsc",
},
/* Network Direct */
@@ -61,6 +66,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_ND_GUID,
.perf_device = true,
.allowed_in_isolated = false,
+ .irq_name = "netdirect",
},
/* PCIE */
@@ -68,6 +74,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_PCIE_GUID,
.perf_device = false,
.allowed_in_isolated = true,
+ .irq_name = "vpci",
},
/* Synthetic Frame Buffer */
@@ -75,6 +82,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_SYNTHVID_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "framebuffer",
},
/* Synthetic Keyboard */
@@ -82,6 +90,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_KBD_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "keyboard",
},
/* Synthetic MOUSE */
@@ -89,6 +98,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_MOUSE_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "mouse",
},
/* KVP */
@@ -96,6 +106,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_KVP_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "kvp",
},
/* Time Synch */
@@ -103,6 +114,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_TS_GUID,
.perf_device = false,
.allowed_in_isolated = true,
+ .irq_name = "timesync",
},
/* Heartbeat */
@@ -110,6 +122,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_HEART_BEAT_GUID,
.perf_device = false,
.allowed_in_isolated = true,
+ .irq_name = "heartbeat",
},
/* Shutdown */
@@ -117,6 +130,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_SHUTDOWN_GUID,
.perf_device = false,
.allowed_in_isolated = true,
+ .irq_name = "shutdown",
},
/* File copy */
@@ -126,6 +140,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_FCOPY_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "fcopy",
},
/* Backup */
@@ -133,6 +148,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_VSS_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "vss",
},
/* Dynamic Memory */
@@ -140,6 +156,7 @@ const struct vmbus_device vmbus_devs[] = {
HV_DM_GUID,
.perf_device = false,
.allowed_in_isolated = false,
+ .irq_name = "balloon",
},
/*
@@ -198,20 +215,30 @@ static bool is_unsupported_vmbus_devs(const guid_t *guid)
return false;
}
-static u16 hv_get_dev_type(const struct vmbus_channel *channel)
+static void hv_set_dev_type_and_irq_name(struct vmbus_channel *channel)
{
const guid_t *guid = &channel->offermsg.offer.if_type;
+ char *name = NULL;
u16 i;
- if (is_hvsock_channel(channel))
- return HV_UNKNOWN;
-
- for (i = HV_IDE; i < HV_UNKNOWN; i++) {
- if (guid_equal(guid, &vmbus_devs[i].guid))
- return i;
+ if (is_hvsock_channel(channel)) {
+ i = HV_UNKNOWN;
+ name = "hv_sock";
+ goto found;
}
+
+ for (i = HV_IDE; i < HV_UNKNOWN; i++)
+ if (guid_equal(guid, &vmbus_devs[i].guid)) {
+ name = vmbus_devs[i].irq_name;
+ goto found;
+ }
+
pr_info("Unknown GUID: %pUl\n", guid);
- return i;
+
+found:
+ channel->device_id = i;
+ if (name)
+ strscpy(channel->irq_name, name, VMBUS_CHAN_IRQ_NAME_MAX);
}
/**
@@ -970,7 +997,7 @@ static void vmbus_setup_channel_state(struct vmbus_channel *channel,
sizeof(struct vmbus_channel_offer_channel));
channel->monitor_grp = (u8)offer->monitorid / 32;
channel->monitor_bit = (u8)offer->monitorid % 32;
- channel->device_id = hv_get_dev_type(channel);
+ hv_set_dev_type_and_irq_name(channel);
}
/*
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index d52c916cc492..897d96fa19d4 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -826,6 +826,7 @@ struct vmbus_device {
guid_t guid;
bool perf_device;
bool allowed_in_isolated;
+ char *irq_name;
};
#define VMBUS_DEFAULT_MAX_PKT_SIZE 4096
@@ -837,6 +838,8 @@ struct vmbus_gpadl {
bool decrypted;
};
+#define VMBUS_CHAN_IRQ_NAME_MAX 32
+
struct vmbus_channel {
struct list_head listentry;
@@ -1068,6 +1071,8 @@ struct vmbus_channel {
/* The max size of a packet on this channel */
u32 max_pkt_size;
+
+ char irq_name[VMBUS_CHAN_IRQ_NAME_MAX];
};
#define lock_requestor(channel, flags) \
--
2.25.1
^ permalink raw reply related
* [RFC 02/12] Drivers: hv: vmbus: Fix error path that deletes non-existent sysfs group
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
If vmbus_device_create() returns an error to vmbus_add_channel_work(),
the cleanup path calls free_channel(), which in turn calls
vmbus_remove_channel_attr_group(). But the channel attr group hasn't
been created yet, causing sysfs_remove_group() to generate multiple
WARNs about non-existent entries.
Fix the WARNs by adding a flag to struct vmbus_channel to indicate
whether the sysfs group for the channel has been created. Use the
flag to determine if the sysfs group should be removed.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/vmbus_drv.c | 5 ++++-
include/linux/hyperv.h | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 12a707ab73f8..291a8358370b 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1842,6 +1842,7 @@ int vmbus_add_channel_kobj(struct hv_device *dev, struct vmbus_channel *channel)
dev_err(device, "Unable to set up channel sysfs files\n");
return ret;
}
+ channel->channel_attr_set = true;
kobject_uevent(kobj, KOBJ_ADD);
@@ -1853,7 +1854,9 @@ int vmbus_add_channel_kobj(struct hv_device *dev, struct vmbus_channel *channel)
*/
void vmbus_remove_channel_attr_group(struct vmbus_channel *channel)
{
- sysfs_remove_group(&channel->kobj, &vmbus_chan_group);
+ if (channel->channel_attr_set)
+ sysfs_remove_group(&channel->kobj, &vmbus_chan_group);
+ channel->channel_attr_set = false;
}
/*
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 5e39baa7f6cb..d52c916cc492 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -980,6 +980,7 @@ struct vmbus_channel {
* For sysfs per-channel properties.
*/
struct kobject kobj;
+ bool channel_attr_set;
/*
* For performance critical channels (storage, networking
--
2.25.1
^ permalink raw reply related
* [RFC 01/12] Drivers: hv: vmbus: Drop unsupported VMBus devices earlier
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
Because Hyper-V doesn't know ahead-of-time what operating system will be
running in a guest VM, it may offer to Linux guests VMBus devices that are
intended for use only in Windows guests. Currently in Linux, VMBus
channels are created for these devices, and they are processed by
vmbus_device_register(). While nothing further happens because no matching
driver is found, the channel continues to exist.
To avoid having the spurious channel, drop such devices immediately in
vmbus_onoffer(), based on identifying the device in the
vmbus_unsupported_devs table. If Hyper-V should issue a rescind request
for the device, no matching channel will be found and the rescind will
also be ignored.
Since unsupported devices are dropped early, the check for unsupported
devices in hv_get_dev_type() is redundant. Remove it.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/channel_mgmt.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 3c6011a48dab..a216a0aede73 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -203,7 +203,7 @@ static u16 hv_get_dev_type(const struct vmbus_channel *channel)
const guid_t *guid = &channel->offermsg.offer.if_type;
u16 i;
- if (is_hvsock_channel(channel) || is_unsupported_vmbus_devs(guid))
+ if (is_hvsock_channel(channel))
return HV_UNKNOWN;
for (i = HV_IDE; i < HV_UNKNOWN; i++) {
@@ -1036,6 +1036,16 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
trace_vmbus_onoffer(offer);
+ /*
+ * Hyper-V may offer pseudo-devices with functionality intended for
+ * Windows guests. Silently ignore them. There's no value in
+ * cluttering dmesg with error messages.
+ */
+ if (is_unsupported_vmbus_devs(&offer->offer.if_type)) {
+ atomic_dec(&vmbus_connection.offer_in_progress);
+ return;
+ }
+
if (!vmbus_is_valid_offer(offer)) {
pr_err_ratelimited("Invalid offer %d from the host supporting isolation\n",
offer->child_relid);
--
2.25.1
^ permalink raw reply related
* [RFC 00/12] Hyper-V guests use Linux IRQs for channel interrupts
From: mhkelley58 @ 2024-06-04 5:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, lpieralisi, kw, robh, bhelgaas, James.Bottomley,
martin.petersen, arnd, linux-hyperv, linux-kernel, linux-pci,
linux-scsi, linux-arch
Cc: maz, den, jgowans, dawei.li
From: Michael Kelley <mhklinux@outlook.com>
This patch set makes significant changes in how Linux guests running on Hyper-V
handle interrupts from VMBus devices. It's "RFC" because of the level of change,
and because I'm interested in macro-level feedback on the whole idea. The usual
detailed-level code review feedback is also welcome.
Background
----------
Linux guests running on Hyper-V are presented with a range of synthetic
devices, including storage, networking, mouse, keyboard, etc. Guests are also
presented with management-related pseudo-devices, including time
synchronization, heartbeat, and host-initiated shutdown. These devices use the
VMBus connection between the host and the guest, and each device has one or
more VMBus channels for passing messages between the host and guest.
Separately, a control path exists for creating and tearing down VMBus channels
when needed. Hyper-V VMBus and its synthetic devices play a role similar to
virtio and virtio devices.
The host interrupts the guest when it sends a new message to the guest in an
otherwise empty channel, and when it sends a new control message to the guest.
All interrupts are multiplexed onto a single per-CPU architectural interrupt as
defined by the processor architecture. On arm64, this architectural interrupt
is a PPI, and is represented in Linux as a Linux per-CPU IRQ. The x86/x64
architecture does not provide per-CPU interrupts, so Linux reserves a fixed x86
interrupt vector (HYPERVISOR_CALLBACK_VECTOR) on all CPUs, with a hard-coded
interrupt handler. No Linux IRQ is assigned on x86/x64.
The interrupt handler for the architectural interrupt is the VMBus interrupt
handler, and it runs whenever the host generates an interrupt for a channel or
a control message. The VMBus interrupt handler consults data structures
provided by the per-CPU Hyper-V synthetic interrupt controller (synic) to
determine which channel or channels have a pending interrupt, or if there is a
pending control message. When doing this demultiplexing, the VMBus interrupt
handler directly invokes the per-device handlers, and processes any control
messages.
In this scheme, the individual VMBus devices and their channels are not
modelled as Linux IRQs. /proc/interrupts shows counts for the top-level
architectural interrupt, but not for the individual VMBus devices. The VMBus
driver has implemented separate reporting of per-device data under /sys.
Similarly, /proc/irq/<nn> does not show affinity information for individual
VMBus devices, and the affinity cannot be changed via /proc/irq. Again, a
separate interface under /sys is provided for changing the vCPU that a VMBus
channel should interrupt.
What's New
----------
Linux kernel community members have commented that it would be better for the
Hyper-V synic and the handling of VMBus channel interrupts to be modelled as
Linux IRQs. Then they would participate in existing IRQ handling mechanisms,
and not need something separate under /sys. This patch set provides such an
implementation.
With this patch set, the output of /proc/interrupts looks like this in a Linux
guest on a Hyper-V x86/x64 Generation 2 VM with 8 vCPUs (the columns for
CPUs 2 thru 5 are omitted due to text width considerations):
CPU0 CPU1 CPU6 CPU7
8: 0 0 0 0 IO-APIC 8-edge rtc0
9: 2 0 0 0 IO-APIC 9-fasteoi acpi
24: 16841 0 0 0 VMBus 7 heartbeat
25: 1 0 0 0 VMBus 9 shutdown
26: 6752 0 0 0 VMBus 10 timesync
27: 1 0 0 0 VMBus 11 vss
28: 22095 0 0 0 VMBus 14 pri@storvsc1
29: 0 85 0 0 VMBus 15 pri@storvsc0
30: 3 0 0 0 VMBus 1 balloon
31: 106 0 0 0 VMBus 3 mouse
32: 73 0 0 0 VMBus 4 keyboard
33: 6 0 0 0 VMBus 5 framebuffer
34: 0 0 0 0 VMBus 13 netvsc
35: 1 0 0 0 VMBus 8 kvp
36: 0 0 0 0 VMBus 16 netvsc
37: 0 0 0 0 VMBus 17 netvsc
38: 0 0 0 0 VMBus 18 netvsc
39: 0 0 2375 0 VMBus 19 netvsc
40: 0 0 0 1178 VMBus 20 netvsc
41: 1003 0 0 0 VMBus 21 netvsc
42: 0 4337 0 0 VMBus 22 netvsc
43: 0 0 0 0 VMBus 23 sub@storvsc1
44: 0 0 0 0 VMBus 24 sub@storvsc0
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 0 0 0 0 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 Performance monitoring interrupts
IWI: 1 0 0 2 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 411 233 349 318 Rescheduling interrupts
CAL: 135236 29211 99526 36424 Function call interrupts
TLB: 0 0 0 0 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 109 109 109 109 Machine check polls
HYP: 71 0 0 0 Hypervisor callback interrupts
HRE: 0 0 0 0 Hyper-V reenlightenment interrupts
HVS: 183391 109695 181539 339239 Hyper-V stimer0 interrupts
ERR: 0
MIS: 0
PIN: 0 0 0 0 Posted-interrupt notification event
NPI: 0 0 0 0 Nested posted-interrupt event
PIW: 0 0 0 0 Posted-interrupt wakeup event
IRQs 24 thru 44 are the VMBus channel IRQs that didn't previously exist. Some
devices, such as storvsc and netvsc, have multiple channels, each of which has
a separate IRQ. The HYP line now shows counts only for control messages instead
of for all VMBus interrupts. The HVS line continues to show Hyper-V synthetic
timer (stimer0) interrupts. (In older versions of Hyper-V where stimer0
interrupts are delivered as control messages, those interrupts are accounted
for on the HYP line. Since this is legacy behavior, it seemed unnecessary to
create a separate Linux IRQ to model these messages.)
The basic approach is to update the VMBus channel create/open/close/delete
infrastructure, and the VMBus interrupt handler, to create and process the
additional IRQs. The goal is that individual VMBus drivers require no code
changes to work with the new IRQs. However, a driver may optionally annotate
the IRQ using knowledge that only the driver has. For example, in the above
/proc/interrupts output, the storvsc driver for synthetic disks has been
updated to distinguish the primary channel from the subchannels, and to
distinguish controller 0 from controller 1. Since storvsc models a SCSI
controller, the numeric designations are set to match the "host0" and "host1"
SCSI controllers that could be seen with "lsscsi -H -v". Similarly annotating
the "netvsc" entries is a "to do" item.
In furtherance of the "no changes to existing VMBus drivers" goal, all VMBus
IRQs use the same interrupt handler. Each channel has a callback function
associated with it, and the interrupt handler invokes this callback in the same
way as before VMBus IRQs were introduced.
VMBus code creates a standalone linear IRQ domain for the VMBus IRQs. The hwirq
#'s are the VMBus channel relid's, which are integers in the range 1 to 2047.
Each relid uniquely identifies a VMBus channel, and a channel interrupt from
the host provides the relid. A mapping from IRQ to hwirq (relid) is created
when a new channel is created, and the mapping is deleted when the channel is
deleted, so adding and removing VMBus devices in a running VM is fully
functional.
Getting the IRQ counting correct requires a new IRQ flow handler in
/kernel/irq/chip.c. See Patch 6. The existing flow handlers don't handle this
case, and creating a custom flow handler outside of chip.c isn't feasible
because the needed components aren't available to code outside of /kernel/irq.
When a guest CPU is taken offline, Linux automatically changes the affinity of
IRQs assigned to that CPU so that the IRQ is handled on another CPU that is
still online. Unfortunately, VMBus channel IRQs are not able to take advantage
of that mechanism. At the point the mechanism runs, it isn't possible to know
when Hyper-V started interrupting the new CPU; hence pending interrupts may be
stranded on the old offline CPU. Existing VMBus code in Linux prevents a CPU
from going offline when a VMBus channel interrupt is affined to it, and that
code must be retained. Of course, VMBus channel IRQs can manually be affined to
a different CPU, which could then allow a CPU to be taken offline. I have an
idea for a somewhat convoluted way to allow the automatic Linux mechanism to
work, but that will be another patch set.
Having VMBus channel interrupts appear in /proc/irq means that user-space
irqbalance can change the affinity of the interrupts, where previously it could
not. For example, this gives irqbalance the ability to change (and perhaps
degrade) the default affinity configuration that was designed to maximize
network throughput. This is yet another reason to disable irqbalance in VM
images.
The custom VMBus entries under /sys are retained for backwards compatibility,
but they are integrated. For example, changing the affinity via /proc/irq is
reflected in /sys/bus/vmbus/devices/<guid>/channels/<nn>/cpu, and vice versa.
Patches
-------
The patches have been tested on x86/x64 Generation 1 and Generation 2 VMs, and
on arm64 VMs.
* Patches 1 and 2 fixes some error handling that is needed by subsequent
patches. They could be applied independent of this patch set.
* Patches 3,4, and 5 add support for IRQ names, and add annotations in the
storvsc and the Hyper-V vPCI driver so that descriptions in /proc/interrupts
convey useful information.
* Patch 6 adds a new IRQ flow handler needed to properly account for IRQs as
VMBus IRQs or as the top-level architectural interrupt.
* Patch 7 creates the new IRQ domain for VMBus channel IRQs.
* Patch 8 allocates the VMBus channel IRQs, and creates the mapping between
VMBus relid (used as the hwirq) and the Linux IRQ number. That mapping is then
used for lookups.
* Patch 9 does request_irq() for the new VMBus channel IRQs, and converts the
top-level VMBus architectural interrupt handler to use the new VMBus channel
IRQs for handling channel interrupts. With this patch, /proc/interrupts shows
counts for the VMBus channel IRQs.
* Patch 10 implements the irq_set_affinity() function for VMBus channel IRQs,
and integrates it with the existing Hyper-V-specific sysfs mechanism for
changing the CPU that a VMBus channel will interrupt.
* Patch 11 updates hv_synic_cleanup() during CPU offlining to handle the lack
of waiting for the MODIFYCHANNEL message in vmbus_irq_set_affinity() in Patch
10.
* Patch 12 fixes a race in VMBus channel interrupt assignments that existed
prior to this patch set when taking a CPU offline.
Open Topics
-----------
* The performance impact of the additional IRQ processing in the interrupt path
appears to be minimal, based on looking at the code. Measurements are still to
be taken to confirm this.
* Does the netvsc driver have sufficient information to annotate what appears
in /proc/interrupts, particularly when multiple synthetic NICs are present? At
first glance, it appears that the identifying info isn't generated until after
vmbus_open() is called, and by then it's too late to do the IRQ annotation.
Need some input from a netvsc expert.
* The VMBus irq domain and irq chip data structures are placed in the
vmbus_connection structure. Starting with VMBus protocol version 5.0, the VMBus
host side supports multiple connections from a single guest instance (see
commit ae20b254306a6). While there's no upstream kernel code that creates
multiple VMBus connections, it's unclear whether the IRQ domain should be
global across all VMBus connection, or per connection. Are relid's global (to
the VM) or per connection?
* I have not tried or tested any hv_sock channels. Is there a handy test
program that would be convenient for opening multiple hv_sock connections to
the guest, have them persist long enough to see what /proc/interrupts shows,
and try changing the IRQ affinity?
* I have not tried or tested Linux guest hibernation paths.
The patches build and run against 6.10-rc2 and next-20240603.
Michael Kelley (12):
Drivers: hv: vmbus: Drop unsupported VMBus devices earlier
Drivers: hv: vmbus: Fix error path that deletes non-existent sysfs
group
Drivers: hv: vmbus: Add an IRQ name to VMBus channels
PCI: hv: Annotate the VMBus channel IRQ name
scsi: storvsc: Annotate the VMBus channel IRQ name
genirq: Add per-cpu flow handler with conditional IRQ stats
Drivers: hv: vmbus: Set up irqdomain and irqchip for the VMBus
connection
Drivers: hv: vmbus: Allocate an IRQ per channel and use for relid
mapping
Drivers: hv: vmbus: Use Linux IRQs to handle VMBus channel interrupts
Drivers: hv: vmbus: Implement vmbus_irq_set_affinity
Drivers: hv: vmbus: Wait for MODIFYCHANNEL to finish when offlining
CPUs
Drivers: hv: vmbus: Ensure IRQ affinity isn't set to a CPU going
offline
arch/x86/kernel/cpu/mshyperv.c | 9 +-
drivers/hv/channel.c | 125 ++++------
drivers/hv/channel_mgmt.c | 139 ++++++++----
drivers/hv/connection.c | 48 ++--
drivers/hv/hv.c | 90 +++++++-
drivers/hv/hv_common.c | 2 +-
drivers/hv/hyperv_vmbus.h | 22 +-
drivers/hv/vmbus_drv.c | 338 +++++++++++++++++++---------
drivers/pci/controller/pci-hyperv.c | 5 +
drivers/scsi/storvsc_drv.c | 9 +
include/asm-generic/mshyperv.h | 9 +-
include/linux/hyperv.h | 9 +
include/linux/irq.h | 1 +
kernel/irq/chip.c | 29 +++
14 files changed, 567 insertions(+), 268 deletions(-)
--
2.25.1
^ permalink raw reply
* Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion
From: H. Peter Anvin @ 2024-06-04 0:24 UTC (permalink / raw)
To: Kirill A. Shutemov, Nikolay Borisov
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
Huang, Kai, Ard Biesheuvel, Baoquan He, K. Y. Srinivasan,
Haiyang Zhang, kexec, linux-hyperv, linux-acpi, linux-coco,
linux-kernel
In-Reply-To: <t3zx4f6ynru7qp4oel4syza2alcuxz7q7hxqgf2lxusgobnsnh@vtnecqrsxci5>
Trying one more time; sorry (again) if someone receives this in duplicate.
>>>
>>> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
>>> index 56cab1bb25f5..085eef5c3904 100644
>>> --- a/arch/x86/kernel/relocate_kernel_64.S
>>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>>> @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>> */
>>> movl $X86_CR4_PAE, %eax
>>> testq $X86_CR4_LA57, %r13
>>> - jz 1f
>>> + jz .Lno_la57
>>> orl $X86_CR4_LA57, %eax
>>> -1:
>>> +.Lno_la57:
>>> +
>>> movq %rax, %cr4
If we are cleaning up this code... the above can simply be:
andl $(X86_CR4_PAE | X86_CR4_LA54), %r13
movq %r13, %cr4
%r13 is dead afterwards, and the PAE bit *will* be set in %r13 anyway.
-hpa
^ permalink raw reply
* Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation
From: Sean Christopherson @ 2024-06-04 0:30 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Nicolas Saenz Julienne, Borislav Petkov, Dave Hansen,
H . Peter Anvin, Ingo Molnar, Kees Cook, Paolo Bonzini,
Thomas Gleixner, Vitaly Kuznetsov, Wanpeng Li, Rick P Edgecombe,
Alexander Graf, Angelina Vu, Anna Trikalinou, Chao Peng,
Forrest Yuan Yu, James Gowans, James Morris, John Andersen,
Madhavan T . Venkataraman, Marian Rotariu, Mihai Donțu,
Nicușor Cîțu, Thara Gopinath, Trilok Soni, Wei Liu,
Will Deacon, Yu Zhang, Ștefan Șicleru, dev, kvm,
linux-hardening, linux-hyperv, linux-kernel,
linux-security-module, qemu-devel, virtualization, x86, xen-devel
In-Reply-To: <20240603.ahNg8waif6Fu@digikod.net>
On Mon, Jun 03, 2024, Mickaël Salaün wrote:
> On Wed, May 15, 2024 at 01:32:24PM -0700, Sean Christopherson wrote:
> > On Tue, May 14, 2024, Mickaël Salaün wrote:
> > > On Fri, May 10, 2024 at 10:07:00AM +0000, Nicolas Saenz Julienne wrote:
> > > > Development happens
> > > > https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next
> > > > branch, but I'd advice against looking into it until we add some order
> > > > to the rework. Regardless, feel free to get in touch.
> > >
> > > Thanks for the update.
> > >
> > > Could we schedule a PUCK meeting to synchronize and help each other?
> > > What about June 12?
> >
> > June 12th works on my end.
>
> Can you please send an invite?
I think this is all the info?
Time: 6am PDT
Video: https://meet.google.com/vdb-aeqo-knk
Phone: https://tel.meet/vdb-aeqo-knk?pin=3003112178656
Calendar: https://calendar.google.com/calendar/u/0?cid=Y182MWE1YjFmNjQ0NzM5YmY1YmVkN2U1ZWE1ZmMzNjY5Y2UzMmEyNTQ0YzVkYjFjN2M4OTE3MDJjYTUwOTBjN2Q1QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20
Drive: https://drive.google.com/drive/folders/1aTqCrvTsQI9T4qLhhLs_l986SngGlhPH?resourcekey=0-FDy0ykM3RerZedI8R-zj4A&usp=drive_link
^ permalink raw reply
* Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation
From: Sean Christopherson @ 2024-06-04 0:29 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Nicolas Saenz Julienne, Borislav Petkov, Dave Hansen,
H . Peter Anvin, Ingo Molnar, Kees Cook, Paolo Bonzini,
Thomas Gleixner, Vitaly Kuznetsov, Wanpeng Li, Rick P Edgecombe,
Alexander Graf, Angelina Vu, Anna Trikalinou, Chao Peng,
Forrest Yuan Yu, James Gowans, James Morris, John Andersen,
Madhavan T . Venkataraman, Marian Rotariu, Mihai Donțu,
Nicușor Cîțu, Thara Gopinath, Trilok Soni, Wei Liu,
Will Deacon, Yu Zhang, Ștefan Șicleru, dev, kvm,
linux-hardening, linux-hyperv, linux-kernel,
linux-security-module, qemu-devel, virtualization, x86, xen-devel
In-Reply-To: <20240514.OoPohLaejai6@digikod.net>
On Tue, May 14, 2024, Mickaël Salaün wrote:
> On Tue, May 07, 2024 at 09:16:06AM -0700, Sean Christopherson wrote:
> > On Tue, May 07, 2024, Mickaël Salaün wrote:
> > > If yes, that would indeed require a *lot* of work for something we're not
> > > sure will be accepted later on.
> >
> > Yes and no. The AWS folks are pursuing VSM support in KVM+QEMU, and SVSM support
> > is trending toward the paired VM+vCPU model. IMO, it's entirely feasible to
> > design KVM support such that much of the development load can be shared between
> > the projects. And having 2+ use cases for a feature (set) makes it _much_ more
> > likely that the feature(s) will be accepted.
> >
> > And similar to what Paolo said regarding HEKI not having a complete story, I
> > don't see a clear line of sight for landing host-defined policy enforcement, as
> > there are many open, non-trivial questions that need answers. I.e. upstreaming
> > HEKI in its current form is also far from a done deal, and isn't guaranteed to
> > be substantially less work when all is said and done.
>
> I'm not sure to understand why "Heki not having a complete story". The
> goal is the same as the current kernel self-protection mechanisms.
HEKI doesn't have a complete story for how it's going to play nice with kexec(),
emulated RESET, etc. The kernel's existing self-protection mechanisms Just Work
because the protections are automatically disabled/lost on such transitions.
They are obviously significant drawbacks to that behavior, but they are accepted
drawbacks, i.e. solving those problems isn't in scope (yet) for the kernel. And
the "failure" mode is also loss of hardening, not an unusable guest.
In other words, the kernel's hardening is firmly best effort at this time,
whereas HEKI likely needs to be much more than "best effort" in order to justify
the extra complexity. And that means having answers to the various interoperability
questions.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox