* Re: [EXTERNAL] Re: [PATCH 5/9] x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
From: Tianyu Lan @ 2023-06-27 10:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley, Tianyu Lan,
linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <8b93aa93-903f-3a69-77f9-0c6b694d826b@gmail.com>
On 6/8/2023 11:15 PM, Tianyu Lan wrote:
> On 6/8/2023 9:21 PM, Peter Zijlstra wrote:
>> On Thu, Jun 01, 2023 at 11:16:18AM -0400, Tianyu Lan wrote:
>>> From: Tianyu Lan <tiala@microsoft.com>
>>>
>>> In sev-snp enlightened guest, Hyper-V hypercall needs
>>> to use vmmcall to trigger vmexit and notify hypervisor
>>> to handle hypercall request.
>>>
>>> There is no x86 SEV SNP feature flag support so far and
>>> hardware provides MSR_AMD64_SEV register to check SEV-SNP
>>> capability with MSR_AMD64_SEV_ENABLED bit. ALTERNATIVE can't
>>> work without SEV-SNP x86 feature flag. May add later when
>>> the associated flag is introduced.
>>>
>>> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
>>> ---
>>> arch/x86/include/asm/mshyperv.h | 44 ++++++++++++++++++++++++---------
>>> 1 file changed, 33 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/mshyperv.h
>>> b/arch/x86/include/asm/mshyperv.h
>>> index 31c476f4e656..d859d7c5f5e8 100644
>>> --- a/arch/x86/include/asm/mshyperv.h
>>> +++ b/arch/x86/include/asm/mshyperv.h
>>> @@ -61,16 +61,25 @@ static inline u64 hv_do_hypercall(u64 control,
>>> void *input, void *output)
>>> u64 hv_status;
>>> #ifdef CONFIG_X86_64
>>> - if (!hv_hypercall_pg)
>>> - return U64_MAX;
>>> + if (hv_isolation_type_en_snp()) {
>>> + __asm__ __volatile__("mov %4, %%r8\n"
>>> + "vmmcall"
>>> + : "=a" (hv_status), ASM_CALL_CONSTRAINT,
>>> + "+c" (control), "+d" (input_address)
>>> + : "r" (output_address)
>>> + : "cc", "memory", "r8", "r9", "r10", "r11");
>>> + } else {
>>> + if (!hv_hypercall_pg)
>>> + return U64_MAX;
>>> - __asm__ __volatile__("mov %4, %%r8\n"
>>> - CALL_NOSPEC
>>> - : "=a" (hv_status), ASM_CALL_CONSTRAINT,
>>> - "+c" (control), "+d" (input_address)
>>> - : "r" (output_address),
>>> - THUNK_TARGET(hv_hypercall_pg)
>>> - : "cc", "memory", "r8", "r9", "r10", "r11");
>>> + __asm__ __volatile__("mov %4, %%r8\n"
>>> + CALL_NOSPEC
>>> + : "=a" (hv_status), ASM_CALL_CONSTRAINT,
>>> + "+c" (control), "+d" (input_address)
>>> + : "r" (output_address),
>>> + THUNK_TARGET(hv_hypercall_pg)
>>> + : "cc", "memory", "r8", "r9", "r10", "r11");
>>> + }
>>> #else
>>
>> Remains unanswered:
>>
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20230516102912.GG2587705%2540hirez.programming.kicks-ass.net&data=05%7C01%7CTianyu.Lan%40microsoft.com%7C60a576eb67634ffa27b108db68234d5a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638218273105649705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=MFj67DON0K%2BUoUJbeaIA5oVTxyrzO3fb5DbxYgDWwX0%3D&reserved=0
>>
>> Would this not generate better code with an alternative?
>
>
> Hi Peter:
> Thanks to review. I put the explaination in the change log.
>
> "There is no x86 SEV SNP feature(X86_FEATURE_SEV_SNP) flag
> support so far and hardware provides MSR_AMD64_SEV register
> to check SEV-SNP capability with MSR_AMD64_SEV_ENABLED bit
> ALTERNATIVE can't work without SEV-SNP x86 feature flag."
> There is no cpuid leaf bit to check AMD SEV-SNP feature.
>
> After some Hyper-V doesn't provides SEV and SEV-ES guest before and so
> may reuse X86_FEATURE_SEV and X86_FEATURE_SEV_ES flag as alternative
> feature check for Hyper-V SEV-SNP guest. Will refresh patch.
>
Hi Peter:
I tried using alternative for "vmmcall" and CALL_NOSPEC in a single
Inline assembly. The output is different in the SEV-SNP mode. When SEV-
SNP is enabled, thunk_target is not required. While it's necessary in
the non SEV-SNP mode. Do you have any idea how to differentiate outputs
in the single Inline assembly which just like alternative works for
assembler template.
^ permalink raw reply
* Re: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Souradeep Chakrabarti @ 2023-06-27 8:45 UTC (permalink / raw)
To: Haiyang Zhang
Cc: KY Srinivasan, wei.liu@kernel.org, Dexuan Cui,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, Long Li, Ajay Sharma, leon@kernel.org,
cai.huoqing@linux.dev, ssengar@linux.microsoft.com,
vkuznets@redhat.com, tglx@linutronix.de,
linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
stable@vger.kernel.org, Souradeep Chakrabarti
In-Reply-To: <PH7PR21MB3116B7C4826A19F3103D6304CA26A@PH7PR21MB3116.namprd21.prod.outlook.com>
On Mon, Jun 26, 2023 at 03:53:50PM +0000, Haiyang Zhang wrote:
>
>
> > -----Original Message-----
> > From: souradeep chakrabarti <schakrabarti@linux.microsoft.com>
> > Sent: Monday, June 26, 2023 5:20 AM
> > To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> > <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com;
> > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay
> > Sharma <sharmaajay@microsoft.com>; leon@kernel.org;
> > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com;
> > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org;
> > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org
> > Cc: stable@vger.kernel.org; Souradeep Chakrabarti
> > <schakrabarti@microsoft.com>; Souradeep Chakrabarti
> > <schakrabarti@linux.microsoft.com>
> > Subject: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is
> > unresponsive
>
> In general, two patches shouldn't have the same Subject.
>
> For this patch set, the two patches are two steps to fix the unloading issue, and
> they are not very long. IMHO, they should be in one patch.
>
> Thanks,
> - Haiyang
I will create a single patch in next version. As this fixes the unloading issue.
^ permalink raw reply
* Re: [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Souradeep Chakrabarti @ 2023-06-27 8:44 UTC (permalink / raw)
To: Praveen Kumar
Cc: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
linux-hyperv, netdev, linux-kernel, linux-rdma, stable,
schakrabarti
In-Reply-To: <69098dcb-c184-7d93-4045-7ac1bc0ac6d0@linux.microsoft.com>
On Mon, Jun 26, 2023 at 07:50:44PM +0530, Praveen Kumar wrote:
> On 6/26/2023 2:48 PM, souradeep chakrabarti wrote:
> > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> >
> > This patch addresses the VF unload issue, where mana_dealloc_queues()
> > gets stuck in infinite while loop, because of host unresponsiveness.
> > It adds a timeout in the while loop, to fix it.
> >
> > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for
> > Microsoft Azure Network Adapter)
> > Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> > ---
> > V2 -> V3:
> > * Splitted the patch in two parts.
> > * Removed the unnecessary braces from mana_dealloc_queues().
> > ---
> > drivers/net/ethernet/microsoft/mana/mana_en.c | 19 +++++++++++++++++--
> > 1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index d907727c7b7a..cb5c43c3c47e 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -2329,7 +2329,10 @@ static int mana_dealloc_queues(struct net_device *ndev)
> > {
> > struct mana_port_context *apc = netdev_priv(ndev);
> > struct gdma_dev *gd = apc->ac->gdma_dev;
> > + unsigned long timeout;
> > struct mana_txq *txq;
> > + struct sk_buff *skb;
> > + struct mana_cq *cq;
> > int i, err;
> >
> > if (apc->port_is_up)
> > @@ -2348,13 +2351,25 @@ static int mana_dealloc_queues(struct net_device *ndev)
> > *
> > * Drain all the in-flight TX packets
> > */
> > +
> > + timeout = jiffies + 120 * HZ;
> > for (i = 0; i < apc->num_queues; i++) {
> > txq = &apc->tx_qp[i].txq;
> > -
> > - while (atomic_read(&txq->pending_sends) > 0)
> > + while (atomic_read(&txq->pending_sends) > 0 &&
> > + time_before(jiffies, timeout))
> > usleep_range(1000, 2000);
> > }
> >
> > + for (i = 0; i < apc->num_queues; i++) {
> > + txq = &apc->tx_qp[i].txq;
> > + cq = &apc->tx_qp[i].tx_cq;
> > + while (atomic_read(&txq->pending_sends)) {
> > + skb = skb_dequeue(&txq->pending_skbs);
> > + mana_unmap_skb(skb, apc);
> > + napi_consume_skb(skb, cq->budget);
> > + atomic_sub(1, &txq->pending_sends);
> > + }
> > + }
>
> Can we combine these 2 loops into 1 something like this ?
>
> for (i = 0; i < apc->num_queues; i++) {
> txq = &apc->tx_qp[i].txq;
> cq = &apc->tx_qp[i].tx_cq;
> while (atomic_read(&txq->pending_sends)) {
> if (time_before(jiffies, timeout)) {
> usleep_range(1000, 2000);
> } else {
> skb = skb_dequeue(&txq->pending_skbs);
> mana_unmap_skb(skb, apc);
> napi_consume_skb(skb, cq->budget);
> atomic_sub(1, &txq->pending_sends);
> }
> }
> }
We should free up the skbs only after timeout has happened or after all the
queues are looped.
> > /* We're 100% sure the queues can no longer be woken up, because
> > * we're sure now mana_poll_tx_cq() can't be running.
> > */
^ permalink raw reply
* Re: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Souradeep Chakrabarti @ 2023-06-27 8:42 UTC (permalink / raw)
To: Praveen Kumar
Cc: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
linux-hyperv, netdev, linux-kernel, linux-rdma, stable,
schakrabarti
In-Reply-To: <578faf91-35e6-d946-d9ec-c949e57eadef@linux.microsoft.com>
On Mon, Jun 26, 2023 at 07:43:07PM +0530, Praveen Kumar wrote:
> On 6/26/2023 2:50 PM, souradeep chakrabarti wrote:
> > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> >
> > This is the second part of the fix.
> >
> > Also this patch adds a new attribute in mana_context, which gets set when
> > mana_hwc_send_request() hits a timeout because of host unresponsiveness.
> > This flag then helps to avoid the timeouts in successive calls.
> >
> > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for
> > Microsoft Azure Network Adapter)
> > Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> > ---
> > V2 -> V3:
> > * Removed the initialization of vf_unload_timeout
> > * Splitted the patch in two.
> > * Fixed extra space from the commit message.
> > ---
> > drivers/net/ethernet/microsoft/mana/gdma_main.c | 4 +++-
> > drivers/net/ethernet/microsoft/mana/hw_channel.c | 12 +++++++++++-
> > include/net/mana/mana.h | 2 ++
> > 3 files changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 8f3f78b68592..6411f01be0d9 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
> > struct gdma_context *gc = gd->gdma_context;
> > struct gdma_general_resp resp = {};
> > struct gdma_general_req req = {};
> > + struct mana_context *ac;
> > int err;
> >
> > if (gd->pdid == INVALID_PDID)
> > return -EINVAL;
> > + ac = gd->driver_data;
> >
> > mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE, sizeof(req),
> > sizeof(resp));
> > @@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
> > req.hdr.dev_id = gd->dev_id;
> >
> > err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
> > - if (err || resp.hdr.status) {
> > + if ((err || resp.hdr.status) && !ac->vf_unload_timeout) {
> > dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n",
> > err, resp.hdr.status);
>
> With !ac->vf_unload_timeout option, this message may not be correctly showing err, status. Probably you want to add explicit information during timeouts so that it give right information ? Or have the err, status field properly updated.
This check !ac->vf_unload_timeout here means if ac->vf_unload_timeout is not yet set,
then only consider the err path, else continue the remaining operation.
>
> > if (!err)
> > diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > index 9d1507eba5b9..492cb2c6e2cb 100644
> > --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > @@ -1,8 +1,10 @@
> > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> > /* Copyright (c) 2021, Microsoft Corporation. */
> >
> > +#include "asm-generic/errno.h"
> > #include <net/mana/gdma.h>
> > #include <net/mana/hw_channel.h>
> > +#include <net/mana/mana.h>
> >
> > static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id)
> > {
> > @@ -786,12 +788,19 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
> > struct hwc_wq *txq = hwc->txq;
> > struct gdma_req_hdr *req_msg;
> > struct hwc_caller_ctx *ctx;
> > + struct mana_context *ac;
> > u32 dest_vrcq = 0;
> > u32 dest_vrq = 0;
> > u16 msg_id;
> > int err;
> >
> > mana_hwc_get_msg_index(hwc, &msg_id);
> > + ac = hwc->gdma_dev->driver_data;
>
> Is there a case where gdma_dev be invalid here ? If so, lets check the state and then proceed further ?
I can see Haiyang has already in his comment, responded on the same.
hwc->gdma_dev will be valid here, but as Haiyang pointed we need to use
hwc->gdma_dev->gdma_context->mana.driver_data, or better to relocate the
attribute in gdma_context.
>
> > + if (ac->vf_unload_timeout) {
> > + dev_err(hwc->dev, "HWC: vport is already unloaded.\n");
> > + err = -ETIMEDOUT;
> > + goto out;
> > + }
> >
> > tx_wr = &txq->msg_buf->reqs[msg_id];
> >
> > @@ -825,9 +834,10 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
> > goto out;
> > }
> >
> > - if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) {
> > + if (!wait_for_completion_timeout(&ctx->comp_event, 5 * HZ)) {
>
> IMHO we should have macros instead of magic numbers (5 , 30 or so). But would like others to comment here.
>
> > dev_err(hwc->dev, "HWC: Request timed out!\n");
> > err = -ETIMEDOUT;
> > + ac->vf_unload_timeout = true;
> > goto out;
> > }
> >
> > diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
> > index 9eef19972845..5f5affdca1eb 100644
> > --- a/include/net/mana/mana.h
> > +++ b/include/net/mana/mana.h
> > @@ -358,6 +358,8 @@ struct mana_context {
> >
> > u16 num_ports;
> >
> > + bool vf_unload_timeout;
> > +
> > struct mana_eq *eqs;
> >
> > struct net_device *ports[MAX_PORTS_IN_MANA_DEV];
^ permalink raw reply
* Re: [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Souradeep Chakrabarti @ 2023-06-27 8:35 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Dexuan Cui, Simon Horman, KY Srinivasan, Haiyang Zhang,
wei.liu@kernel.org, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, Long Li, Ajay Sharma,
leon@kernel.org, cai.huoqing@linux.dev,
ssengar@linux.microsoft.com, vkuznets@redhat.com,
tglx@linutronix.de, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, stable@vger.kernel.org,
Souradeep Chakrabarti
In-Reply-To: <20230626134721.256c6c16@hermes.local>
On Mon, Jun 26, 2023 at 01:47:21PM -0700, Stephen Hemminger wrote:
> On Mon, 26 Jun 2023 20:06:48 +0000
> Dexuan Cui <decui@microsoft.com> wrote:
>
> > > From: Simon Horman
> > > Sent: Monday, June 26, 2023 6:05 AM
> > > > ...
> > > > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a
> > > > driver for
> > > > Microsoft Azure Network Adapter)
> > >
> > > nit: A correct format of this fixes tag is:
> > >
> > > In particular:
> > > * All on lone line
> > > * Description in double quotes.
> > >
> > > Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network
> > > Adapter (MANA)")
> >
> > Hi Souradeep, FYI I often refer to:
> > https://marc.info/?l=linux-pci&m=150905742808166&w=2
> >
> > The link mentions:
> > alias gsr='git --no-pager show -s --abbrev-commit --abbrev=12 --pretty=format:"%h (\"%s\")%n"'
> >
Thank you for the advice. Will use it from now onwards.
> > "gsr ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f" produces:
> > ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
>
> You can do same thing without shell alias by using git-config
>
> [alias]
> fixes = log -1 --format=fixes
> gsr = log -1 --format=gsr
>
> [pretty]
> fixes = Fixes: %h (\"%s\")
> gsr = %h (\"%s\")
>
> Then:
> $ git gsr 1919b39fc6eabb9a6f9a51706ff6d03865f5df29
> 1919b39fc6ea ("net: mana: Fix perf regression: remove rx_cqes, tx_cqes counters")
Thank you for the suggestion.
^ permalink raw reply
* [PATCH V2 9/9] x86/hyperv: Add hyperv-specific handling for VMMCALL under SEV-ES
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230627032248.2170007-1-ltykernel@gmail.com>
From: Tianyu Lan <tiala@microsoft.com>
Add Hyperv-specific handling for faults caused by VMMCALL
instructions.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
arch/x86/kernel/cpu/mshyperv.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 8e1d9ed6a1e0..ba9a3a65f664 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -32,6 +32,7 @@
#include <asm/nmi.h>
#include <clocksource/hyperv_timer.h>
#include <asm/numa.h>
+#include <asm/svm.h>
/* Is Linux running as the root partition? */
bool hv_root_partition;
@@ -577,6 +578,20 @@ static bool __init ms_hyperv_msi_ext_dest_id(void)
return eax & HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE;
}
+static void hv_sev_es_hcall_prepare(struct ghcb *ghcb, struct pt_regs *regs)
+{
+ /* RAX and CPL are already in the GHCB */
+ ghcb_set_rcx(ghcb, regs->cx);
+ ghcb_set_rdx(ghcb, regs->dx);
+ ghcb_set_r8(ghcb, regs->r8);
+}
+
+static bool hv_sev_es_hcall_finish(struct ghcb *ghcb, struct pt_regs *regs)
+{
+ /* No checking of the return state needed */
+ return true;
+}
+
const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = {
.name = "Microsoft Hyper-V",
.detect = ms_hyperv_platform,
@@ -584,4 +599,6 @@ const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = {
.init.x2apic_available = ms_hyperv_x2apic_available,
.init.msi_ext_dest_id = ms_hyperv_msi_ext_dest_id,
.init.init_platform = ms_hyperv_init_platform,
+ .runtime.sev_es_hcall_prepare = hv_sev_es_hcall_prepare,
+ .runtime.sev_es_hcall_finish = hv_sev_es_hcall_finish,
};
--
2.25.1
^ permalink raw reply related
* [PATCH V2 8/9] x86/hyperv: Add smp support for SEV-SNP guest
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230627032248.2170007-1-ltykernel@gmail.com>
From: Tianyu Lan <tiala@microsoft.com>
In the AMD SEV-SNP guest, AP needs to be started up via sev es
save area and Hyper-V requires to call HVCALL_START_VP hypercall
to pass the gpa of sev es save area with AP's vp index and VTL(Virtual
trust level) parameters. Override wakeup_secondary_cpu_64 callback
with hv_snp_boot_ap.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
arch/x86/hyperv/ivm.c | 95 +++++++++++++++++++++++++++++++
arch/x86/include/asm/mshyperv.h | 9 +++
arch/x86/kernel/cpu/mshyperv.c | 13 ++++-
include/asm-generic/hyperv-tlfs.h | 1 +
4 files changed, 116 insertions(+), 2 deletions(-)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index b1639ec07155..9b307f99b540 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -22,11 +22,15 @@
#include <asm/sev.h>
#include <asm/realmode.h>
#include <asm/e820/api.h>
+#include <asm/desc.h>
#ifdef CONFIG_AMD_MEM_ENCRYPT
#define GHCB_USAGE_HYPERV_CALL 1
+static u8 ap_start_input_arg[PAGE_SIZE] __bss_decrypted __aligned(PAGE_SIZE);
+static u8 ap_start_stack[PAGE_SIZE] __aligned(PAGE_SIZE);
+
union hv_ghcb {
struct ghcb ghcb;
struct {
@@ -449,6 +453,97 @@ __init void hv_sev_init_mem_and_cpu(void)
}
}
+#define hv_populate_vmcb_seg(seg, gdtr_base) \
+do { \
+ if (seg.selector) { \
+ seg.base = 0; \
+ seg.limit = HV_AP_SEGMENT_LIMIT; \
+ seg.attrib = *(u16 *)(gdtr_base + seg.selector + 5); \
+ seg.attrib = (seg.attrib & 0xFF) | ((seg.attrib >> 4) & 0xF00); \
+ } \
+} while (0) \
+
+int hv_snp_boot_ap(int cpu, unsigned long start_ip)
+{
+ struct sev_es_save_area *vmsa = (struct sev_es_save_area *)
+ __get_free_page(GFP_KERNEL | __GFP_ZERO);
+ struct desc_ptr gdtr;
+ u64 ret, rmp_adjust, retry = 5;
+ struct hv_enable_vp_vtl *start_vp_input;
+ unsigned long flags;
+
+ native_store_gdt(&gdtr);
+
+ vmsa->gdtr.base = gdtr.address;
+ vmsa->gdtr.limit = gdtr.size;
+
+ asm volatile("movl %%es, %%eax;" : "=a" (vmsa->es.selector));
+ hv_populate_vmcb_seg(vmsa->es, vmsa->gdtr.base);
+
+ asm volatile("movl %%cs, %%eax;" : "=a" (vmsa->cs.selector));
+ hv_populate_vmcb_seg(vmsa->cs, vmsa->gdtr.base);
+
+ asm volatile("movl %%ss, %%eax;" : "=a" (vmsa->ss.selector));
+ hv_populate_vmcb_seg(vmsa->ss, vmsa->gdtr.base);
+
+ asm volatile("movl %%ds, %%eax;" : "=a" (vmsa->ds.selector));
+ hv_populate_vmcb_seg(vmsa->ds, vmsa->gdtr.base);
+
+ vmsa->efer = native_read_msr(MSR_EFER);
+
+ asm volatile("movq %%cr4, %%rax;" : "=a" (vmsa->cr4));
+ asm volatile("movq %%cr3, %%rax;" : "=a" (vmsa->cr3));
+ asm volatile("movq %%cr0, %%rax;" : "=a" (vmsa->cr0));
+
+ vmsa->xcr0 = 1;
+ vmsa->g_pat = HV_AP_INIT_GPAT_DEFAULT;
+ vmsa->rip = (u64)secondary_startup_64_no_verify;
+ vmsa->rsp = (u64)&ap_start_stack[PAGE_SIZE];
+
+ /*
+ * Set the SNP-specific fields for this VMSA:
+ * VMPL level
+ * SEV_FEATURES (matches the SEV STATUS MSR right shifted 2 bits)
+ */
+ vmsa->vmpl = 0;
+ vmsa->sev_features = sev_status >> 2;
+
+ /*
+ * Running at VMPL0 allows the kernel to change the VMSA bit for a page
+ * using the RMPADJUST instruction. However, for the instruction to
+ * succeed it must target the permissions of a lesser privileged
+ * (higher numbered) VMPL level, so use VMPL1 (refer to the RMPADJUST
+ * instruction in the AMD64 APM Volume 3).
+ */
+ rmp_adjust = RMPADJUST_VMSA_PAGE_BIT | 1;
+ ret = rmpadjust((unsigned long)vmsa, RMP_PG_SIZE_4K,
+ rmp_adjust);
+ if (ret != 0) {
+ pr_err("RMPADJUST(%llx) failed: %llx\n", (u64)vmsa, ret);
+ return ret;
+ }
+
+ local_irq_save(flags);
+ start_vp_input =
+ (struct hv_enable_vp_vtl *)ap_start_input_arg;
+ memset(start_vp_input, 0, sizeof(*start_vp_input));
+ start_vp_input->partition_id = -1;
+ start_vp_input->vp_index = cpu;
+ start_vp_input->target_vtl.target_vtl = ms_hyperv.vtl;
+ *(u64 *)&start_vp_input->vp_context = __pa(vmsa) | 1;
+
+ do {
+ ret = hv_do_hypercall(HVCALL_START_VP,
+ start_vp_input, NULL);
+ } while (hv_result(ret) == HV_STATUS_TIME_OUT && retry--);
+
+ local_irq_restore(flags);
+
+ if (!hv_result_success(ret))
+ pr_err("HvCallStartVirtualProcessor failed: %llx\n", ret);
+ return ret;
+}
+
void __init hv_vtom_init(void)
{
/*
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 7a9a6cdc2ae9..804c67475054 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -65,6 +65,13 @@ struct memory_map_entry {
u32 reserved;
};
+/*
+ * DEFAULT INIT GPAT and SEGMENT LIMIT value in struct VMSA
+ * to start AP in enlightened SEV guest.
+ */
+#define HV_AP_INIT_GPAT_DEFAULT 0x0007040600070406ULL
+#define HV_AP_SEGMENT_LIMIT 0xffffffff
+
int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
@@ -271,6 +278,7 @@ bool hv_ghcb_negotiate_protocol(void);
void __noreturn hv_ghcb_terminate(unsigned int set, unsigned int reason);
void hv_vtom_init(void);
void hv_sev_init_mem_and_cpu(void);
+int hv_snp_boot_ap(int cpu, unsigned long start_ip);
#else
static inline void hv_ghcb_msr_write(u64 msr, u64 value) {}
static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {}
@@ -278,6 +286,7 @@ static inline bool hv_ghcb_negotiate_protocol(void) { return false; }
static inline void hv_ghcb_terminate(unsigned int set, unsigned int reason) {}
static inline void hv_vtom_init(void) {}
static inline void hv_sev_init_mem_and_cpu(void) {}
+static int hv_snp_boot_ap(int cpu, unsigned long start_ip) {}
#endif
extern bool hv_isolation_type_snp(void);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index d3bb921ee7fe..8e1d9ed6a1e0 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -295,6 +295,16 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
native_smp_prepare_cpus(max_cpus);
+ /*
+ * Override wakeup_secondary_cpu_64 callback for SEV-SNP
+ * enlightened guest.
+ */
+ if (hv_isolation_type_en_snp())
+ apic->wakeup_secondary_cpu_64 = hv_snp_boot_ap;
+
+ if (!hv_root_partition)
+ return;
+
#ifdef CONFIG_X86_64
for_each_present_cpu(i) {
if (i == 0)
@@ -502,8 +512,7 @@ static void __init ms_hyperv_init_platform(void)
# ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = hv_smp_prepare_boot_cpu;
- if (hv_root_partition)
- smp_ops.smp_prepare_cpus = hv_smp_prepare_cpus;
+ smp_ops.smp_prepare_cpus = hv_smp_prepare_cpus;
# endif
/*
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index f4e4cc4f965f..fdac4a1714ec 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -223,6 +223,7 @@ enum HV_GENERIC_SET_FORMAT {
#define HV_STATUS_INVALID_PORT_ID 17
#define HV_STATUS_INVALID_CONNECTION_ID 18
#define HV_STATUS_INSUFFICIENT_BUFFERS 19
+#define HV_STATUS_TIME_OUT 120
#define HV_STATUS_VTL_ALREADY_ENABLED 134
/*
--
2.25.1
^ permalink raw reply related
* [PATCH V2 7/9] x86/hyperv: Initialize cpu and memory for SEV-SNP enlightened guest
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230627032248.2170007-1-ltykernel@gmail.com>
From: Tianyu Lan <tiala@microsoft.com>
Hyper-V enlightened guest doesn't have boot loader support.
Boot Linux kernel directly from hypervisor with data (kernel
image, initrd and parameter page) and memory for boot up that
is initialized via AMD SEV PSP protocol (Please reference
Section 4.5 Launching a Guest of [1]).
Kernel needs to read processor and memory info from EN_SEV_
SNP_PROCESSOR/MEM_INFO_ADDR address which are populated by
Hyper-V. The these data is prepared by hypervisor via SNP_
LAUNCH_UPDATE with page type SNP_PAGE_TYPE_UNMEASURED and
Initialize smp cpu related ops, validate system memory and
add them into e820 table.
[1]: https://www.amd.com/system/files/TechDocs/56860.pdf
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
arch/x86/hyperv/ivm.c | 93 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/mshyperv.h | 17 ++++++
arch/x86/kernel/cpu/mshyperv.c | 3 ++
3 files changed, 113 insertions(+)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 5d3ee3124e00..b1639ec07155 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -17,6 +17,11 @@
#include <asm/mem_encrypt.h>
#include <asm/mshyperv.h>
#include <asm/hypervisor.h>
+#include <asm/coco.h>
+#include <asm/io_apic.h>
+#include <asm/sev.h>
+#include <asm/realmode.h>
+#include <asm/e820/api.h>
#ifdef CONFIG_AMD_MEM_ENCRYPT
@@ -57,6 +62,8 @@ union hv_ghcb {
static u16 hv_ghcb_version __ro_after_init;
+static u32 processor_count;
+
u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size)
{
union hv_ghcb *hv_ghcb;
@@ -356,6 +363,92 @@ static bool hv_is_private_mmio(u64 addr)
return false;
}
+static __init void hv_snp_get_smp_config(unsigned int early)
+{
+ /*
+ * The "early" parameter can be true only if old-style AMD
+ * Opteron NUMA detection is enabled, which should never be
+ * the case for an SEV-SNP guest. See CONFIG_AMD_NUMA.
+ * For safety, just do nothing if "early" is true.
+ */
+ if (early)
+ return;
+
+ /*
+ * There is no firmware and ACPI MADT table support in
+ * in the Hyper-V SEV-SNP enlightened guest. Set smp
+ * related config variable here.
+ */
+ while (num_processors < processor_count) {
+ early_per_cpu(x86_cpu_to_apicid, num_processors) = num_processors;
+ early_per_cpu(x86_bios_cpu_apicid, num_processors) = num_processors;
+ physid_set(num_processors, phys_cpu_present_map);
+ set_cpu_possible(num_processors, true);
+ set_cpu_present(num_processors, true);
+ num_processors++;
+ }
+}
+
+__init void hv_sev_init_mem_and_cpu(void)
+{
+ struct memory_map_entry *entry;
+ struct e820_entry *e820_entry;
+ u64 e820_end;
+ u64 ram_end;
+ u64 page;
+
+ /*
+ * Hyper-V enlightened snp guest boots kernel
+ * directly without bootloader. So roms, bios
+ * regions and reserve resources are not available.
+ * Set these callback to NULL.
+ */
+ x86_platform.legacy.rtc = 0;
+ x86_platform.legacy.reserve_bios_regions = 0;
+ x86_platform.set_wallclock = set_rtc_noop;
+ x86_platform.get_wallclock = get_rtc_noop;
+ x86_init.resources.probe_roms = x86_init_noop;
+ x86_init.resources.reserve_resources = x86_init_noop;
+ x86_init.mpparse.find_smp_config = x86_init_noop;
+ x86_init.mpparse.get_smp_config = hv_snp_get_smp_config;
+
+ /*
+ * Hyper-V SEV-SNP enlightened guest doesn't support ioapic
+ * and legacy APIC page read/write. Switch to hv apic here.
+ */
+ disable_ioapic_support();
+
+ /* Get processor and mem info. */
+ processor_count = *(u32 *)__va(EN_SEV_SNP_PROCESSOR_INFO_ADDR);
+ entry = (struct memory_map_entry *)__va(EN_SEV_SNP_MEM_INFO_ADDR);
+
+ /*
+ * There is no bootloader/EFI firmware in the SEV SNP guest.
+ * E820 table in the memory just describes memory for kernel,
+ * ACPI table, cmdline, boot params and ramdisk. The dynamic
+ * data(e.g, vcpu number and the rest memory layout) needs to
+ * be read from EN_SEV_SNP_PROCESSOR_INFO_ADDR.
+ */
+ for (; entry->numpages != 0; entry++) {
+ e820_entry = &e820_table->entries[
+ e820_table->nr_entries - 1];
+ e820_end = e820_entry->addr + e820_entry->size;
+ ram_end = (entry->starting_gpn +
+ entry->numpages) * PAGE_SIZE;
+
+ if (e820_end < entry->starting_gpn * PAGE_SIZE)
+ e820_end = entry->starting_gpn * PAGE_SIZE;
+
+ if (e820_end < ram_end) {
+ pr_info("Hyper-V: add e820 entry [mem %#018Lx-%#018Lx]\n", e820_end, ram_end - 1);
+ e820__range_add(e820_end, ram_end - e820_end,
+ E820_TYPE_RAM);
+ for (page = e820_end; page < ram_end; page += PAGE_SIZE)
+ pvalidate((unsigned long)__va(page), RMP_PG_SIZE_4K, true);
+ }
+ }
+}
+
void __init hv_vtom_init(void)
{
/*
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index d859d7c5f5e8..7a9a6cdc2ae9 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -50,6 +50,21 @@ extern bool hv_isolation_type_en_snp(void);
extern union hv_ghcb * __percpu *hv_ghcb_pg;
+/*
+ * Hyper-V puts processor and memory layout info
+ * to this address in SEV-SNP enlightened guest.
+ */
+#define EN_SEV_SNP_PROCESSOR_INFO_ADDR 0x802000
+#define EN_SEV_SNP_MEM_INFO_ADDR 0x802018
+
+struct memory_map_entry {
+ u64 starting_gpn;
+ u64 numpages;
+ u16 type;
+ u16 flags;
+ u32 reserved;
+};
+
int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
@@ -255,12 +270,14 @@ void hv_ghcb_msr_read(u64 msr, u64 *value);
bool hv_ghcb_negotiate_protocol(void);
void __noreturn hv_ghcb_terminate(unsigned int set, unsigned int reason);
void hv_vtom_init(void);
+void hv_sev_init_mem_and_cpu(void);
#else
static inline void hv_ghcb_msr_write(u64 msr, u64 value) {}
static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {}
static inline bool hv_ghcb_negotiate_protocol(void) { return false; }
static inline void hv_ghcb_terminate(unsigned int set, unsigned int reason) {}
static inline void hv_vtom_init(void) {}
+static inline void hv_sev_init_mem_and_cpu(void) {}
#endif
extern bool hv_isolation_type_snp(void);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 5398fb2f4d39..d3bb921ee7fe 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -529,6 +529,9 @@ static void __init ms_hyperv_init_platform(void)
if (!(ms_hyperv.features & HV_ACCESS_TSC_INVARIANT))
mark_tsc_unstable("running on Hyper-V");
+ if (hv_isolation_type_en_snp())
+ hv_sev_init_mem_and_cpu();
+
hardlockup_detector_disable();
}
--
2.25.1
^ permalink raw reply related
* [PATCH V2 6/9] clocksource: hyper-v: Mark hyperv tsc page unencrypted in sev-snp enlightened guest
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230627032248.2170007-1-ltykernel@gmail.com>
From: Tianyu Lan <tiala@microsoft.com>
Hyper-V tsc page is shared with hypervisor and mark the page
unencrypted in sev-snp enlightened guest when it's used.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
drivers/clocksource/hyperv_timer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index bcd9042a0c9f..66e29a19770b 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -376,7 +376,7 @@ EXPORT_SYMBOL_GPL(hv_stimer_global_cleanup);
static union {
struct ms_hyperv_tsc_page page;
u8 reserved[PAGE_SIZE];
-} tsc_pg __aligned(PAGE_SIZE);
+} tsc_pg __bss_decrypted __aligned(PAGE_SIZE);
static struct ms_hyperv_tsc_page *tsc_page = &tsc_pg.page;
static unsigned long tsc_pfn;
--
2.25.1
^ permalink raw reply related
* [PATCH V2 5/9] x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230627032248.2170007-1-ltykernel@gmail.com>
From: Tianyu Lan <tiala@microsoft.com>
In sev-snp enlightened guest, Hyper-V hypercall needs
to use vmmcall to trigger vmexit and notify hypervisor
to handle hypercall request.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
arch/x86/include/asm/mshyperv.h | 44 ++++++++++++++++++++++++---------
1 file changed, 33 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 31c476f4e656..d859d7c5f5e8 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -61,16 +61,25 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
u64 hv_status;
#ifdef CONFIG_X86_64
- if (!hv_hypercall_pg)
- return U64_MAX;
+ if (hv_isolation_type_en_snp()) {
+ __asm__ __volatile__("mov %4, %%r8\n"
+ "vmmcall"
+ : "=a" (hv_status), ASM_CALL_CONSTRAINT,
+ "+c" (control), "+d" (input_address)
+ : "r" (output_address)
+ : "cc", "memory", "r8", "r9", "r10", "r11");
+ } else {
+ if (!hv_hypercall_pg)
+ return U64_MAX;
- __asm__ __volatile__("mov %4, %%r8\n"
- CALL_NOSPEC
- : "=a" (hv_status), ASM_CALL_CONSTRAINT,
- "+c" (control), "+d" (input_address)
- : "r" (output_address),
- THUNK_TARGET(hv_hypercall_pg)
- : "cc", "memory", "r8", "r9", "r10", "r11");
+ __asm__ __volatile__("mov %4, %%r8\n"
+ CALL_NOSPEC
+ : "=a" (hv_status), ASM_CALL_CONSTRAINT,
+ "+c" (control), "+d" (input_address)
+ : "r" (output_address),
+ THUNK_TARGET(hv_hypercall_pg)
+ : "cc", "memory", "r8", "r9", "r10", "r11");
+ }
#else
u32 input_address_hi = upper_32_bits(input_address);
u32 input_address_lo = lower_32_bits(input_address);
@@ -104,7 +113,13 @@ static inline u64 _hv_do_fast_hypercall8(u64 control, u64 input1)
u64 hv_status;
#ifdef CONFIG_X86_64
- {
+ if (hv_isolation_type_en_snp()) {
+ __asm__ __volatile__(
+ "vmmcall"
+ : "=a" (hv_status), ASM_CALL_CONSTRAINT,
+ "+c" (control), "+d" (input1)
+ :: "cc", "r8", "r9", "r10", "r11");
+ } else {
__asm__ __volatile__(CALL_NOSPEC
: "=a" (hv_status), ASM_CALL_CONSTRAINT,
"+c" (control), "+d" (input1)
@@ -149,7 +164,14 @@ static inline u64 _hv_do_fast_hypercall16(u64 control, u64 input1, u64 input2)
u64 hv_status;
#ifdef CONFIG_X86_64
- {
+ if (hv_isolation_type_en_snp()) {
+ __asm__ __volatile__("mov %4, %%r8\n"
+ "vmmcall"
+ : "=a" (hv_status), ASM_CALL_CONSTRAINT,
+ "+c" (control), "+d" (input1)
+ : "r" (input2)
+ : "cc", "r8", "r9", "r10", "r11");
+ } else {
__asm__ __volatile__("mov %4, %%r8\n"
CALL_NOSPEC
: "=a" (hv_status), ASM_CALL_CONSTRAINT,
--
2.25.1
^ permalink raw reply related
* [PATCH V2 3/9] x86/hyperv: Mark Hyper-V vp assist page unencrypted in SEV-SNP enlightened guest
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230627032248.2170007-1-ltykernel@gmail.com>
From: Tianyu Lan <tiala@microsoft.com>
hv vp assist page needs to be shared between SEV-SNP guest and Hyper-V.
So mark the page unencrypted in the SEV-SNP guest.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
arch/x86/hyperv/hv_init.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 1ba367a9686e..b004370d3b01 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -18,6 +18,7 @@
#include <asm/hyperv-tlfs.h>
#include <asm/mshyperv.h>
#include <asm/idtentry.h>
+#include <asm/set_memory.h>
#include <linux/kexec.h>
#include <linux/version.h>
#include <linux/vmalloc.h>
@@ -106,8 +107,21 @@ static int hv_cpu_init(unsigned int cpu)
* in hv_cpu_die(), otherwise a CPU may not be stopped in the
* case of CPU offlining and the VM will hang.
*/
- if (!*hvp)
+ if (!*hvp) {
*hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
+
+ /*
+ * Hyper-V should never specify a VM that is a Confidential
+ * VM and also running in the root partition. Root partition
+ * is blocked to run in Confidential VM. So only decrypt assist
+ * page in non-root partition here.
+ */
+ if (*hvp && hv_isolation_type_en_snp()) {
+ WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
+ memset(*hvp, 0, PAGE_SIZE);
+ }
+ }
+
if (*hvp)
msr.pfn = vmalloc_to_pfn(*hvp);
--
2.25.1
^ permalink raw reply related
* [PATCH V2 4/9] drivers: hv: Mark percpu hvcall input arg page unencrypted in SEV-SNP enlightened guest
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230627032248.2170007-1-ltykernel@gmail.com>
From: Tianyu Lan <tiala@microsoft.com>
Hypervisor needs to access input arg, VMBus synic event and
message pages. Mark these pages unencrypted in the SEV-SNP
guest and free them only if they have been marked encrypted
successfully.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
drivers/hv/hv.c | 57 +++++++++++++++++++++++++++++++++++++++---
drivers/hv/hv_common.c | 13 ++++++++++
2 files changed, 67 insertions(+), 3 deletions(-)
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index de6708dbe0df..ec6e35a0d9bf 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -20,6 +20,7 @@
#include <linux/interrupt.h>
#include <clocksource/hyperv_timer.h>
#include <asm/mshyperv.h>
+#include <linux/set_memory.h>
#include "hyperv_vmbus.h"
/* The one and only */
@@ -78,7 +79,7 @@ int hv_post_message(union hv_connection_id connection_id,
int hv_synic_alloc(void)
{
- int cpu;
+ int cpu, ret = -ENOMEM;
struct hv_per_cpu_context *hv_cpu;
/*
@@ -123,26 +124,76 @@ int hv_synic_alloc(void)
goto err;
}
}
+
+ if (hv_isolation_type_en_snp()) {
+ ret = set_memory_decrypted((unsigned long)
+ hv_cpu->synic_message_page, 1);
+ if (ret) {
+ pr_err("Failed to decrypt SYNIC msg page: %d\n", ret);
+ hv_cpu->synic_message_page = NULL;
+
+ /*
+ * Free the event page here so that hv_synic_free()
+ * won't later try to re-encrypt it.
+ */
+ free_page((unsigned long)hv_cpu->synic_event_page);
+ hv_cpu->synic_event_page = NULL;
+ goto err;
+ }
+
+ ret = set_memory_decrypted((unsigned long)
+ hv_cpu->synic_event_page, 1);
+ if (ret) {
+ pr_err("Failed to decrypt SYNIC event page: %d\n", ret);
+ hv_cpu->synic_event_page = NULL;
+ goto err;
+ }
+
+ memset(hv_cpu->synic_message_page, 0, PAGE_SIZE);
+ memset(hv_cpu->synic_event_page, 0, PAGE_SIZE);
+ }
}
return 0;
+
err:
/*
* Any memory allocations that succeeded will be freed when
* the caller cleans up by calling hv_synic_free()
*/
- return -ENOMEM;
+ return ret;
}
void hv_synic_free(void)
{
- int cpu;
+ int cpu, ret;
for_each_present_cpu(cpu) {
struct hv_per_cpu_context *hv_cpu
= per_cpu_ptr(hv_context.cpu_context, cpu);
+ /* It's better to leak the page if the encryption fails. */
+ if (hv_isolation_type_en_snp()) {
+ if (hv_cpu->synic_message_page) {
+ ret = set_memory_encrypted((unsigned long)
+ hv_cpu->synic_message_page, 1);
+ if (ret) {
+ pr_err("Failed to encrypt SYNIC msg page: %d\n", ret);
+ hv_cpu->synic_message_page = NULL;
+ }
+ }
+
+ if (hv_cpu->synic_event_page) {
+ ret = set_memory_encrypted((unsigned long)
+ hv_cpu->synic_event_page, 1);
+ if (ret) {
+ pr_err("Failed to encrypt SYNIC event page: %d\n", ret);
+ hv_cpu->synic_event_page = NULL;
+ }
+ }
+ }
+
free_page((unsigned long)hv_cpu->synic_event_page);
free_page((unsigned long)hv_cpu->synic_message_page);
}
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 4b4aa53c34c2..2d43ba2bc925 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -24,6 +24,7 @@
#include <linux/kmsg_dump.h>
#include <linux/slab.h>
#include <linux/dma-map-ops.h>
+#include <linux/set_memory.h>
#include <asm/hyperv-tlfs.h>
#include <asm/mshyperv.h>
@@ -359,6 +360,7 @@ int hv_common_cpu_init(unsigned int cpu)
u64 msr_vp_index;
gfp_t flags;
int pgcount = hv_root_partition ? 2 : 1;
+ int ret;
/* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
flags = irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL;
@@ -378,6 +380,17 @@ int hv_common_cpu_init(unsigned int cpu)
outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg);
*outputarg = (char *)(*inputarg) + HV_HYP_PAGE_SIZE;
}
+
+ if (hv_isolation_type_en_snp()) {
+ ret = set_memory_decrypted((unsigned long)*inputarg, pgcount);
+ if (ret) {
+ kfree(*inputarg);
+ *inputarg = NULL;
+ return ret;
+ }
+
+ memset(*inputarg, 0x00, pgcount * PAGE_SIZE);
+ }
}
msr_vp_index = hv_get_register(HV_REGISTER_VP_INDEX);
--
2.25.1
^ permalink raw reply related
* [PATCH V2 2/9] x86/hyperv: Set Virtual Trust Level in VMBus init message
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230627032248.2170007-1-ltykernel@gmail.com>
From: Tianyu Lan <tiala@microsoft.com>
SEV-SNP guest provides vtl(Virtual Trust Level) and
get it from Hyper-V hvcall via register hvcall HVCALL_
GET_VP_REGISTERS.
During initialization of VMBus, vtl needs to be set in the
VMBus init message.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
arch/x86/hyperv/hv_init.c | 36 ++++++++++++++++++++++++++++++
arch/x86/include/asm/hyperv-tlfs.h | 7 ++++++
drivers/hv/connection.c | 1 +
include/asm-generic/mshyperv.h | 1 +
include/linux/hyperv.h | 4 ++--
5 files changed, 47 insertions(+), 2 deletions(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 6c04b52f139b..1ba367a9686e 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -378,6 +378,40 @@ static void __init hv_get_partition_id(void)
local_irq_restore(flags);
}
+static u8 __init get_vtl(void)
+{
+ u64 control = HV_HYPERCALL_REP_COMP_1 | HVCALL_GET_VP_REGISTERS;
+ struct hv_get_vp_registers_input *input;
+ struct hv_get_vp_registers_output *output;
+ u64 vtl = 0;
+ u64 ret;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ output = (struct hv_get_vp_registers_output *)input;
+ if (!input) {
+ local_irq_restore(flags);
+ goto done;
+ }
+
+ memset(input, 0, struct_size(input, element, 1));
+ input->header.partitionid = HV_PARTITION_ID_SELF;
+ input->header.vpindex = HV_VP_INDEX_SELF;
+ input->header.inputvtl = 0;
+ input->element[0].name0 = HV_X64_REGISTER_VSM_VP_STATUS;
+
+ ret = hv_do_hypercall(control, input, output);
+ if (hv_result_success(ret))
+ vtl = output->as64.low & HV_X64_VTL_MASK;
+ else
+ pr_err("Hyper-V: failed to get VTL! %lld", ret);
+ local_irq_restore(flags);
+
+done:
+ return vtl;
+}
+
/*
* This function is to be invoked early in the boot sequence after the
* hypervisor has been detected.
@@ -506,6 +540,8 @@ void __init hyperv_init(void)
/* Query the VMs extended capability once, so that it can be cached. */
hv_query_ext_cap(0);
+ /* Find the VTL */
+ ms_hyperv.vtl = get_vtl();
return;
clean_guest_os_id:
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index cea95dcd27c2..4bf0b315b0ce 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -301,6 +301,13 @@ enum hv_isolation_type {
#define HV_X64_MSR_TIME_REF_COUNT HV_REGISTER_TIME_REF_COUNT
#define HV_X64_MSR_REFERENCE_TSC HV_REGISTER_REFERENCE_TSC
+/*
+ * Registers are only accessible via HVCALL_GET_VP_REGISTERS hvcall and
+ * there is not associated MSR address.
+ */
+#define HV_X64_REGISTER_VSM_VP_STATUS 0x000D0003
+#define HV_X64_VTL_MASK GENMASK(3, 0)
+
/* Hyper-V memory host visibility */
enum hv_mem_host_visibility {
VMBUS_PAGE_NOT_VISIBLE = 0,
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 5978e9dbc286..02b54f85dc60 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -98,6 +98,7 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
*/
if (version >= VERSION_WIN10_V5) {
msg->msg_sint = VMBUS_MESSAGE_SINT;
+ msg->msg_vtl = ms_hyperv.vtl;
vmbus_connection.msg_conn_id = VMBUS_MESSAGE_CONNECTION_ID_4;
} else {
msg->interrupt_page = virt_to_phys(vmbus_connection.int_page);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 6b5c41f90398..f73a044ecaa7 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -54,6 +54,7 @@ struct ms_hyperv_info {
};
};
u64 shared_gpa_boundary;
+ u8 vtl;
};
extern struct ms_hyperv_info ms_hyperv;
extern bool hv_nested;
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index bfbc37ce223b..1f2bfec4abde 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -665,8 +665,8 @@ struct vmbus_channel_initiate_contact {
u64 interrupt_page;
struct {
u8 msg_sint;
- u8 padding1[3];
- u32 padding2;
+ u8 msg_vtl;
+ u8 reserved[6];
};
};
u64 monitor_page1;
--
2.25.1
^ permalink raw reply related
* [PATCH V2 1/9] x86/hyperv: Add sev-snp enlightened guest static key
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230627032248.2170007-1-ltykernel@gmail.com>
From: Tianyu Lan <tiala@microsoft.com>
Introduce static key isolation_type_en_snp for enlightened
sev-snp guest check.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
arch/x86/hyperv/ivm.c | 11 +++++++++++
arch/x86/include/asm/mshyperv.h | 3 +++
arch/x86/kernel/cpu/mshyperv.c | 9 +++++++--
drivers/hv/hv_common.c | 6 ++++++
include/asm-generic/mshyperv.h | 12 +++++++++---
5 files changed, 36 insertions(+), 5 deletions(-)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index cc92388b7a99..5d3ee3124e00 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -409,3 +409,14 @@ bool hv_isolation_type_snp(void)
{
return static_branch_unlikely(&isolation_type_snp);
}
+
+DEFINE_STATIC_KEY_FALSE(isolation_type_en_snp);
+/*
+ * hv_isolation_type_en_snp - Check system runs in the AMD SEV-SNP based
+ * isolation enlightened VM.
+ */
+bool hv_isolation_type_en_snp(void)
+{
+ return static_branch_unlikely(&isolation_type_en_snp);
+}
+
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 49bb4f2bd300..31c476f4e656 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -26,6 +26,7 @@
union hv_ghcb;
DECLARE_STATIC_KEY_FALSE(isolation_type_snp);
+DECLARE_STATIC_KEY_FALSE(isolation_type_en_snp);
typedef int (*hyperv_fill_flush_list_func)(
struct hv_guest_mapping_flush_list *flush,
@@ -45,6 +46,8 @@ extern void *hv_hypercall_pg;
extern u64 hv_current_partition_id;
+extern bool hv_isolation_type_en_snp(void);
+
extern union hv_ghcb * __percpu *hv_ghcb_pg;
int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index c7969e806c64..5398fb2f4d39 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -402,8 +402,12 @@ static void __init ms_hyperv_init_platform(void)
pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",
ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);
- if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
+
+ if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+ static_branch_enable(&isolation_type_en_snp);
+ } else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) {
static_branch_enable(&isolation_type_snp);
+ }
}
if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
@@ -473,7 +477,8 @@ static void __init ms_hyperv_init_platform(void)
#if IS_ENABLED(CONFIG_HYPERV)
if ((hv_get_isolation_type() == HV_ISOLATION_TYPE_VBS) ||
- (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP))
+ ((hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) &&
+ ms_hyperv.paravisor_present))
hv_vtom_init();
/*
* Setup the hook to get control post apic initialization.
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 542a1d53b303..4b4aa53c34c2 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -502,6 +502,12 @@ bool __weak hv_isolation_type_snp(void)
}
EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
+bool __weak hv_isolation_type_en_snp(void)
+{
+ return false;
+}
+EXPORT_SYMBOL_GPL(hv_isolation_type_en_snp);
+
void __weak hv_setup_vmbus_handler(void (*handler)(void))
{
}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 402a8c1c202d..6b5c41f90398 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -36,15 +36,21 @@ struct ms_hyperv_info {
u32 nested_features;
u32 max_vp_index;
u32 max_lp_index;
- u32 isolation_config_a;
+ union {
+ u32 isolation_config_a;
+ struct {
+ u32 paravisor_present : 1;
+ u32 reserved_a1 : 31;
+ };
+ };
union {
u32 isolation_config_b;
struct {
u32 cvm_type : 4;
- u32 reserved1 : 1;
+ u32 reserved_b1 : 1;
u32 shared_gpa_boundary_active : 1;
u32 shared_gpa_boundary_bits : 6;
- u32 reserved2 : 20;
+ u32 reserved_b2 : 20;
};
};
u64 shared_gpa_boundary;
--
2.25.1
^ permalink raw reply related
* [PATCH V2 0/9] x86/hyperv: Add AMD sev-snp enlightened guest support on hyperv
From: Tianyu Lan @ 2023-06-27 3:22 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, daniel.lezcano, arnd, michael.h.kelley
Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, vkuznets
From: Tianyu Lan <tiala@microsoft.com>
Hyper-V provides two modes for running SEV-SNP VMs:
1) In vTOM mode with a paravisor (see Section 15.36.8 of [1])
2) In "fully enlightened" mode with normal "C" bit control
over page encryption, and no paravisor
For #1, the paravisor runs in VMPL 0, while Linux runs in VMPL 2
(see Section 15.36.7 of [1]). The paravisor is typically provided
by Hyper-V and handles most of the SNP-related functionality. As
such, most of the SNP functionality in the Linux guest is bypassed.
The guest operates in vTOM mode, where encryption is enabled by default.
The guest must still request page transitions between private and shared,
but there is relatively less SNP machinery required in the guest. Support
for this mode of operation first went upstream in the 5.15 kernel.
For #2, this patch set provides the initial support. The existing
SEV-SNP machinery in the kernel is fully used, but Hyper-V specific
updates are required to properly share Hyper-V communication pages
between the guest and host and to start APs at boot time.
In either mode, Hyper-V requires that the guest implement the SEV-SNP
Restricted Interrupt Injection feature (see Section 15.36.16 of [1],
and Section 5 of [2]). Without this feature, the guest is subject to
attack by a compromised hypervisor that can inject any exception at
any time, such as injecting an interrupt while the guest has interrupts
disabled. In vTOM mode, Restricted Interrupt Injection is implemented
by the paravisor, so no Linux guest changes are required. But in fully
enlightened mode, the Linux guest must provide the implementation.
This patch set is derived from an earlier patch set that includes both
the Hyper-V specific changes and Restricted Interrupt Injection support.[3]
But it is now limited to only the Hyper-V specific changes. The Restricted
Interrupt Injection support will come later in a separate patch set.
[1] https://www.amd.com/system/files/TechDocs/24593.pdf
[2] https://www.amd.com/system/files/TechDocs/56421-guest-hypervisor-communication-block-standardization.pdf
[3] https://lore.kernel.org/lkml/20230515165917.1306922-1-ltykernel@gmail.com/
Change since v1:
* vTOM case uses paravisor_present flag and
HV_ISOLATION_TYPE_SNP type.
* Rework some patches' change log
* Fix some comments in the patches
Tianyu Lan (9):
x86/hyperv: Add sev-snp enlightened guest static key
x86/hyperv: Set Virtual Trust Level in VMBus init message
x86/hyperv: Mark Hyper-V vp assist page unencrypted in SEV-SNP
enlightened guest
drivers: hv: Mark percpu hvcall input arg page unencrypted in SEV-SNP
enlightened guest
x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp
enlightened guest
clocksource: hyper-v: Mark hyperv tsc page unencrypted in sev-snp
enlightened guest
x86/hyperv: Initialize cpu and memory for SEV-SNP enlightened guest
x86/hyperv: Add smp support for SEV-SNP guest
x86/hyperv: Add hyperv-specific handling for VMMCALL under SEV-ES
arch/x86/hyperv/hv_init.c | 52 +++++++-
arch/x86/hyperv/ivm.c | 199 +++++++++++++++++++++++++++++
arch/x86/include/asm/hyperv-tlfs.h | 7 +
arch/x86/include/asm/mshyperv.h | 73 +++++++++--
arch/x86/kernel/cpu/mshyperv.c | 42 +++++-
drivers/clocksource/hyperv_timer.c | 2 +-
drivers/hv/connection.c | 1 +
drivers/hv/hv.c | 57 ++++++++-
drivers/hv/hv_common.c | 19 +++
include/asm-generic/hyperv-tlfs.h | 1 +
include/asm-generic/mshyperv.h | 13 +-
include/linux/hyperv.h | 4 +-
12 files changed, 445 insertions(+), 25 deletions(-)
--
2.25.1
^ permalink raw reply
* [Patch v3] net: mana: Batch ringing RX queue doorbell on receiving packets
From: longli @ 2023-06-26 23:57 UTC (permalink / raw)
To: Jason Gunthorpe, Leon Romanovsky, Ajay Sharma, Dexuan Cui,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: linux-rdma, linux-hyperv, netdev, linux-kernel, Long Li, stable
From: Long Li <longli@microsoft.com>
It's inefficient to ring the doorbell page every time a WQE is posted to
the received queue. Excessive MMIO writes result in CPU spending more
time waiting on LOCK instructions (atomic operations), resulting in
poor scaling performance.
Move the code for ringing doorbell page to where after we have posted all
WQEs to the receive queue during a callback from napi_poll().
With this change, tests showed an improvement from 120G/s to 160G/s on a
200G physical link, with 16 or 32 hardware queues.
Tests showed no regression in network latency benchmarks on single
connection.
While we are making changes in this code path, change the code for
ringing doorbell to set the WQE_COUNT to 0 for Receive Queue. The
hardware specification specifies that it should set to 0. Although
currently the hardware doesn't enforce the check, in the future releases
it may do.
Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
---
Change log:
v2:
Check for comp_read > 0 as it might be negative on completion error.
Set rq.wqe_cnt to 0 according to BNIC spec.
v3:
Add details in the commit on the reason of performance increase and test numbers.
Add details in the commit on why rq.wqe_cnt should be set to 0 according to hardware spec.
Add "Reviewed-by" from Haiyang and Dexuan.
drivers/net/ethernet/microsoft/mana/gdma_main.c | 5 ++++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..3765d3389a9a 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -300,8 +300,11 @@ static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
void mana_gd_wq_ring_doorbell(struct gdma_context *gc, struct gdma_queue *queue)
{
+ /* Hardware Spec specifies that software client should set 0 for
+ * wqe_cnt for Receive Queues. This value is not used in Send Queues.
+ */
mana_gd_ring_doorbell(gc, queue->gdma_dev->doorbell, queue->type,
- queue->id, queue->head * GDMA_WQE_BU_SIZE, 1);
+ queue->id, queue->head * GDMA_WQE_BU_SIZE, 0);
}
void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index cd4d5ceb9f2d..1d8abe63fcb8 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1383,8 +1383,8 @@ static void mana_post_pkt_rxq(struct mana_rxq *rxq)
recv_buf_oob = &rxq->rx_oobs[curr_index];
- err = mana_gd_post_and_ring(rxq->gdma_rq, &recv_buf_oob->wqe_req,
- &recv_buf_oob->wqe_inf);
+ err = mana_gd_post_work_request(rxq->gdma_rq, &recv_buf_oob->wqe_req,
+ &recv_buf_oob->wqe_inf);
if (WARN_ON_ONCE(err))
return;
@@ -1654,6 +1654,12 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
mana_process_rx_cqe(rxq, cq, &comp[i]);
}
+ if (comp_read > 0) {
+ struct gdma_context *gc = rxq->gdma_rq->gdma_dev->gdma_context;
+
+ mana_gd_wq_ring_doorbell(gc, rxq->gdma_rq);
+ }
+
if (rxq->xdp_flush)
xdp_do_flush();
}
--
2.34.1
^ permalink raw reply related
* RE: [PATCH v2] net: mana: Batch ringing RX queue doorbell on receiving packets
From: Long Li @ 2023-06-26 22:55 UTC (permalink / raw)
To: Jakub Kicinski, longli@linuxonhyperv.com
Cc: Jason Gunthorpe, Leon Romanovsky, Ajay Sharma, Dexuan Cui,
KY Srinivasan, Haiyang Zhang, Wei Liu, David S. Miller,
Eric Dumazet, Paolo Abeni, linux-rdma@vger.kernel.org,
linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20230624151037.699c50c6@kernel.org>
> Subject: Re: [PATCH v2] net: mana: Batch ringing RX queue doorbell on
> receiving packets
>
> On Thu, 22 Jun 2023 09:22:36 -0700 longli@linuxonhyperv.com wrote:
> > It's inefficient to ring the doorbell page every time a WQE is posted
> > to the received queue.
> >
> > Move the code for ringing doorbell page to where after we have posted
> > all WQEs to the receive queue during a callback from napi_poll().
> >
> > Tests showed no regression in network latency benchmarks.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure
> > Network Adapter (MANA)")
>
> If this is supposed to be a fix, you need to clearly explain what the
> performance loss was, so that backporters can make an informed decision.
I'm sending v3 to include the details.
>
> > drivers/net/ethernet/microsoft/mana/gdma_main.c | 5 ++++-
> > drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
> > 2 files changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 8f3f78b68592..ef11d09a3655 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -300,8 +300,11 @@ static void mana_gd_ring_doorbell(struct
> > gdma_context *gc, u32 db_index,
> >
> > void mana_gd_wq_ring_doorbell(struct gdma_context *gc, struct
> > gdma_queue *queue) {
> > + /* BNIC Spec specifies that client should set 0 for rq.wqe_cnt
> > + * This value is not used in sq
> > + */
> > mana_gd_ring_doorbell(gc, queue->gdma_dev->doorbell, queue-
> >type,
> > - queue->id, queue->head * GDMA_WQE_BU_SIZE,
> 1);
> > + queue->id, queue->head * GDMA_WQE_BU_SIZE,
> 0);
> > }
>
> This change needs to be explained in the commit message, or should be a
> separate patch.
Will explain the change in the commit message.
Thanks,
Long
> --
> pw-bot: cr
^ permalink raw reply
* [PATCH AUTOSEL 5.4 5/6] scsi: storvsc: Always set no_report_opcodes
From: Sasha Levin @ 2023-06-26 21:51 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Michael Kelley, Martin K . Petersen, Sasha Levin, kys, haiyangz,
wei.liu, decui, jejb, linux-hyperv, linux-scsi
In-Reply-To: <20230626215116.179581-1-sashal@kernel.org>
From: Michael Kelley <mikelley@microsoft.com>
[ Upstream commit 31d16e712bdcaee769de4780f72ff8d6cd3f0589 ]
Hyper-V synthetic SCSI devices do not support the MAINTENANCE_IN SCSI
command, so scsi_report_opcode() always fails, resulting in messages like
this:
hv_storvsc <guid>: tag#205 cmd 0xa3 status: scsi 0x2 srb 0x86 hv 0xc0000001
The recently added support for command duration limits calls
scsi_report_opcode() four times as each device comes online, which
significantly increases the number of messages logged in a system with many
disks.
Fix the problem by always marking Hyper-V synthetic SCSI devices as not
supporting scsi_report_opcode(). With this setting, the MAINTENANCE_IN SCSI
command is not issued and no messages are logged.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1686343101-18930-1-git-send-email-mikelley@microsoft.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/scsi/storvsc_drv.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 8d1b19b2322f5..a91ee2b03c382 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1423,6 +1423,8 @@ static int storvsc_device_configure(struct scsi_device *sdevice)
{
blk_queue_rq_timeout(sdevice->request_queue, (storvsc_timeout * HZ));
+ /* storvsc devices don't support MAINTENANCE_IN SCSI cmd */
+ sdevice->no_report_opcodes = 1;
sdevice->no_write_same = 1;
/*
--
2.39.2
^ permalink raw reply related
* [PATCH AUTOSEL 5.10 6/7] scsi: storvsc: Always set no_report_opcodes
From: Sasha Levin @ 2023-06-26 21:51 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Michael Kelley, Martin K . Petersen, Sasha Levin, kys, haiyangz,
wei.liu, decui, jejb, linux-hyperv, linux-scsi
In-Reply-To: <20230626215108.179483-1-sashal@kernel.org>
From: Michael Kelley <mikelley@microsoft.com>
[ Upstream commit 31d16e712bdcaee769de4780f72ff8d6cd3f0589 ]
Hyper-V synthetic SCSI devices do not support the MAINTENANCE_IN SCSI
command, so scsi_report_opcode() always fails, resulting in messages like
this:
hv_storvsc <guid>: tag#205 cmd 0xa3 status: scsi 0x2 srb 0x86 hv 0xc0000001
The recently added support for command duration limits calls
scsi_report_opcode() four times as each device comes online, which
significantly increases the number of messages logged in a system with many
disks.
Fix the problem by always marking Hyper-V synthetic SCSI devices as not
supporting scsi_report_opcode(). With this setting, the MAINTENANCE_IN SCSI
command is not issued and no messages are logged.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1686343101-18930-1-git-send-email-mikelley@microsoft.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/scsi/storvsc_drv.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 70b4868fe2f7d..17efb6c9e21f4 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1537,6 +1537,8 @@ static int storvsc_device_configure(struct scsi_device *sdevice)
{
blk_queue_rq_timeout(sdevice->request_queue, (storvsc_timeout * HZ));
+ /* storvsc devices don't support MAINTENANCE_IN SCSI cmd */
+ sdevice->no_report_opcodes = 1;
sdevice->no_write_same = 1;
/*
--
2.39.2
^ permalink raw reply related
* [PATCH AUTOSEL 5.15 7/9] scsi: storvsc: Always set no_report_opcodes
From: Sasha Levin @ 2023-06-26 21:50 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Michael Kelley, Martin K . Petersen, Sasha Levin, kys, haiyangz,
wei.liu, decui, jejb, linux-hyperv, linux-scsi
In-Reply-To: <20230626215057.179363-1-sashal@kernel.org>
From: Michael Kelley <mikelley@microsoft.com>
[ Upstream commit 31d16e712bdcaee769de4780f72ff8d6cd3f0589 ]
Hyper-V synthetic SCSI devices do not support the MAINTENANCE_IN SCSI
command, so scsi_report_opcode() always fails, resulting in messages like
this:
hv_storvsc <guid>: tag#205 cmd 0xa3 status: scsi 0x2 srb 0x86 hv 0xc0000001
The recently added support for command duration limits calls
scsi_report_opcode() four times as each device comes online, which
significantly increases the number of messages logged in a system with many
disks.
Fix the problem by always marking Hyper-V synthetic SCSI devices as not
supporting scsi_report_opcode(). With this setting, the MAINTENANCE_IN SCSI
command is not issued and no messages are logged.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1686343101-18930-1-git-send-email-mikelley@microsoft.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/scsi/storvsc_drv.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index c9b1500c2ab87..98db7bcbe2af1 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1626,6 +1626,8 @@ static int storvsc_device_configure(struct scsi_device *sdevice)
{
blk_queue_rq_timeout(sdevice->request_queue, (storvsc_timeout * HZ));
+ /* storvsc devices don't support MAINTENANCE_IN SCSI cmd */
+ sdevice->no_report_opcodes = 1;
sdevice->no_write_same = 1;
/*
--
2.39.2
^ permalink raw reply related
* [PATCH AUTOSEL 6.1 11/15] scsi: storvsc: Always set no_report_opcodes
From: Sasha Levin @ 2023-06-26 21:50 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Michael Kelley, Martin K . Petersen, Sasha Levin, kys, haiyangz,
wei.liu, decui, jejb, linux-hyperv, linux-scsi
In-Reply-To: <20230626215031.179159-1-sashal@kernel.org>
From: Michael Kelley <mikelley@microsoft.com>
[ Upstream commit 31d16e712bdcaee769de4780f72ff8d6cd3f0589 ]
Hyper-V synthetic SCSI devices do not support the MAINTENANCE_IN SCSI
command, so scsi_report_opcode() always fails, resulting in messages like
this:
hv_storvsc <guid>: tag#205 cmd 0xa3 status: scsi 0x2 srb 0x86 hv 0xc0000001
The recently added support for command duration limits calls
scsi_report_opcode() four times as each device comes online, which
significantly increases the number of messages logged in a system with many
disks.
Fix the problem by always marking Hyper-V synthetic SCSI devices as not
supporting scsi_report_opcode(). With this setting, the MAINTENANCE_IN SCSI
command is not issued and no messages are logged.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1686343101-18930-1-git-send-email-mikelley@microsoft.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/scsi/storvsc_drv.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 5284f9a0b826e..167f4d112a5f9 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1567,6 +1567,8 @@ static int storvsc_device_configure(struct scsi_device *sdevice)
{
blk_queue_rq_timeout(sdevice->request_queue, (storvsc_timeout * HZ));
+ /* storvsc devices don't support MAINTENANCE_IN SCSI cmd */
+ sdevice->no_report_opcodes = 1;
sdevice->no_write_same = 1;
/*
--
2.39.2
^ permalink raw reply related
* [PATCH AUTOSEL 6.3 11/16] scsi: storvsc: Always set no_report_opcodes
From: Sasha Levin @ 2023-06-26 21:49 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Michael Kelley, Martin K . Petersen, Sasha Levin, kys, haiyangz,
wei.liu, decui, jejb, linux-hyperv, linux-scsi
In-Reply-To: <20230626214956.178942-1-sashal@kernel.org>
From: Michael Kelley <mikelley@microsoft.com>
[ Upstream commit 31d16e712bdcaee769de4780f72ff8d6cd3f0589 ]
Hyper-V synthetic SCSI devices do not support the MAINTENANCE_IN SCSI
command, so scsi_report_opcode() always fails, resulting in messages like
this:
hv_storvsc <guid>: tag#205 cmd 0xa3 status: scsi 0x2 srb 0x86 hv 0xc0000001
The recently added support for command duration limits calls
scsi_report_opcode() four times as each device comes online, which
significantly increases the number of messages logged in a system with many
disks.
Fix the problem by always marking Hyper-V synthetic SCSI devices as not
supporting scsi_report_opcode(). With this setting, the MAINTENANCE_IN SCSI
command is not issued and no messages are logged.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1686343101-18930-1-git-send-email-mikelley@microsoft.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/scsi/storvsc_drv.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index e6bc622954cfa..659196a2f63ad 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1567,6 +1567,8 @@ static int storvsc_device_configure(struct scsi_device *sdevice)
{
blk_queue_rq_timeout(sdevice->request_queue, (storvsc_timeout * HZ));
+ /* storvsc devices don't support MAINTENANCE_IN SCSI cmd */
+ sdevice->no_report_opcodes = 1;
sdevice->no_write_same = 1;
/*
--
2.39.2
^ permalink raw reply related
* Re: [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Stephen Hemminger @ 2023-06-26 20:47 UTC (permalink / raw)
To: Dexuan Cui
Cc: Simon Horman, souradeep chakrabarti, KY Srinivasan, Haiyang Zhang,
wei.liu@kernel.org, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, Long Li, Ajay Sharma,
leon@kernel.org, cai.huoqing@linux.dev,
ssengar@linux.microsoft.com, vkuznets@redhat.com,
tglx@linutronix.de, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, stable@vger.kernel.org,
Souradeep Chakrabarti
In-Reply-To: <SA1PR21MB1335166153541BFDC2B7036BBF26A@SA1PR21MB1335.namprd21.prod.outlook.com>
On Mon, 26 Jun 2023 20:06:48 +0000
Dexuan Cui <decui@microsoft.com> wrote:
> > From: Simon Horman
> > Sent: Monday, June 26, 2023 6:05 AM
> > > ...
> > > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a
> > > driver for
> > > Microsoft Azure Network Adapter)
> >
> > nit: A correct format of this fixes tag is:
> >
> > In particular:
> > * All on lone line
> > * Description in double quotes.
> >
> > Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network
> > Adapter (MANA)")
>
> Hi Souradeep, FYI I often refer to:
> https://marc.info/?l=linux-pci&m=150905742808166&w=2
>
> The link mentions:
> alias gsr='git --no-pager show -s --abbrev-commit --abbrev=12 --pretty=format:"%h (\"%s\")%n"'
>
> "gsr ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f" produces:
> ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
You can do same thing without shell alias by using git-config
[alias]
fixes = log -1 --format=fixes
gsr = log -1 --format=gsr
[pretty]
fixes = Fixes: %h (\"%s\")
gsr = %h (\"%s\")
Then:
$ git gsr 1919b39fc6eabb9a6f9a51706ff6d03865f5df29
1919b39fc6ea ("net: mana: Fix perf regression: remove rx_cqes, tx_cqes counters")
^ permalink raw reply
* RE: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Haiyang Zhang @ 2023-06-26 20:32 UTC (permalink / raw)
To: Praveen Kumar, souradeep chakrabarti, KY Srinivasan,
wei.liu@kernel.org, Dexuan Cui, davem@davemloft.net,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, Long Li,
Ajay Sharma, leon@kernel.org, cai.huoqing@linux.dev,
ssengar@linux.microsoft.com, vkuznets@redhat.com,
tglx@linutronix.de, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org
Cc: stable@vger.kernel.org, Souradeep Chakrabarti
In-Reply-To: <578faf91-35e6-d946-d9ec-c949e57eadef@linux.microsoft.com>
> -----Original Message-----
> From: Praveen Kumar <kumarpraveen@linux.microsoft.com>
> Sent: Monday, June 26, 2023 10:13 AM
> To: souradeep chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>;
> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay Sharma
> <sharmaajay@microsoft.com>; leon@kernel.org; cai.huoqing@linux.dev;
> ssengar@linux.microsoft.com; vkuznets@redhat.com; tglx@linutronix.de; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-rdma@vger.kernel.org
> Cc: stable@vger.kernel.org; Souradeep Chakrabarti
> <schakrabarti@microsoft.com>
> Subject: Re: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is
> unresponsive
>
> On 6/26/2023 2:50 PM, souradeep chakrabarti wrote:
> > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> >
> > This is the second part of the fix.
> >
> > Also this patch adds a new attribute in mana_context, which gets set when
> > mana_hwc_send_request() hits a timeout because of host unresponsiveness.
> > This flag then helps to avoid the timeouts in successive calls.
> >
> > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a
> driver for
> > Microsoft Azure Network Adapter)
> > Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> > ---
> > V2 -> V3:
> > * Removed the initialization of vf_unload_timeout
> > * Splitted the patch in two.
> > * Fixed extra space from the commit message.
> > ---
> > drivers/net/ethernet/microsoft/mana/gdma_main.c | 4 +++-
> > drivers/net/ethernet/microsoft/mana/hw_channel.c | 12 +++++++++++-
> > include/net/mana/mana.h | 2 ++
> > 3 files changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 8f3f78b68592..6411f01be0d9 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev
> *gd)
> > struct gdma_context *gc = gd->gdma_context;
> > struct gdma_general_resp resp = {};
> > struct gdma_general_req req = {};
> > + struct mana_context *ac;
> > int err;
> >
> > if (gd->pdid == INVALID_PDID)
> > return -EINVAL;
> > + ac = gd->driver_data;
> >
> > mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE,
> sizeof(req),
> > sizeof(resp));
> > @@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd)
> > req.hdr.dev_id = gd->dev_id;
> >
> > err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
> > - if (err || resp.hdr.status) {
> > + if ((err || resp.hdr.status) && !ac->vf_unload_timeout) {
> > dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n",
> > err, resp.hdr.status);
>
> With !ac->vf_unload_timeout option, this message may not be correctly
> showing err, status. Probably you want to add explicit information during
> timeouts so that it give right information ? Or have the err, status field properly
> updated.
>
> > if (!err)
> > diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > index 9d1507eba5b9..492cb2c6e2cb 100644
> > --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > @@ -1,8 +1,10 @@
> > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> > /* Copyright (c) 2021, Microsoft Corporation. */
> >
> > +#include "asm-generic/errno.h"
> > #include <net/mana/gdma.h>
> > #include <net/mana/hw_channel.h>
> > +#include <net/mana/mana.h>
> >
> > static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16
> *msg_id)
> > {
> > @@ -786,12 +788,19 @@ int mana_hwc_send_request(struct
> hw_channel_context *hwc, u32 req_len,
> > struct hwc_wq *txq = hwc->txq;
> > struct gdma_req_hdr *req_msg;
> > struct hwc_caller_ctx *ctx;
> > + struct mana_context *ac;
> > u32 dest_vrcq = 0;
> > u32 dest_vrq = 0;
> > u16 msg_id;
> > int err;
> >
> > mana_hwc_get_msg_index(hwc, &msg_id);
> > + ac = hwc->gdma_dev->driver_data;
>
> Is there a case where gdma_dev be invalid here ? If so, lets check the state and
> then proceed further ?
Yes, hwc->gdma_dev is assigned shortly after it's allocated - see the code below. So
it's valid.
But hwc->gdma_dev->driver_data is hwc, not "mana_context *ac". There are two
gdma_dev in gdma_context: hwc & mana.
You can get ac from: hwc->gdma_dev->gdma_context->mana.driver_data
Or, to avoid too many pointer deference, I suggest to put the vf_unload_timeout
into gdma_context.
int mana_hwc_create_channel(struct gdma_context *gc)
{
hwc = kzalloc(sizeof(*hwc), GFP_KERNEL);
...
gd->gdma_context = gc;
gd->driver_data = hwc;
hwc->gdma_dev = gd;
hwc->dev = gc->dev;
Also, mana_gd_send_request/mana_hwc_send_request() is used in many places,
not just unloading.
Should you use timeout value 5 sec, and the vf_unload_timeout flag in unloading
path only, and avoid touching other code paths? Please check with hostnet team for
suggestions.
If we decide to let the vf_unload_timeout flag affect all code paths, not just
unloading, then it should be renamed to hwc_timeout, and submit the second patch
separately.
If just use it for unloading, since mana_gd_deregister_device() is used by PF too,
name it like: unload_hwc_timeout.
Thanks,
-Haiyang
^ permalink raw reply
* RE: [PATCH 1/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive
From: Dexuan Cui @ 2023-06-26 20:06 UTC (permalink / raw)
To: Simon Horman, souradeep chakrabarti
Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, Long Li, Ajay Sharma, leon@kernel.org,
cai.huoqing@linux.dev, ssengar@linux.microsoft.com,
vkuznets@redhat.com, tglx@linutronix.de,
linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
stable@vger.kernel.org, Souradeep Chakrabarti
In-Reply-To: <ZJmNBKA3ygDryP7i@corigine.com>
> From: Simon Horman
> Sent: Monday, June 26, 2023 6:05 AM
> > ...
> > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a
> > driver for
> > Microsoft Azure Network Adapter)
>
> nit: A correct format of this fixes tag is:
>
> In particular:
> * All on lone line
> * Description in double quotes.
>
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network
> Adapter (MANA)")
Hi Souradeep, FYI I often refer to:
https://marc.info/?l=linux-pci&m=150905742808166&w=2
The link mentions:
alias gsr='git --no-pager show -s --abbrev-commit --abbrev=12 --pretty=format:"%h (\"%s\")%n"'
"gsr ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f" produces:
ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox