Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] tcp: Make pingpong threshold tunable
From: Neal Cardwell @ 2023-06-09 19:16 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, kys, olaf, vkuznets, davem, weiwan,
	tim.gardner, corbet, edumazet, kuba, pabeni, dsahern, atenart,
	bagasdotme, ykaliuta, kuniyu, stephen, simon.horman, maheshb,
	liushixin2, linux-doc, linux-kernel, Yuchung Cheng,
	Soheil Hassas Yeganeh
In-Reply-To: <1686327959-13478-1-git-send-email-haiyangz@microsoft.com>

On Fri, Jun 9, 2023 at 12:26 PM Haiyang Zhang <haiyangz@microsoft.com> wrote:

Regarding the patch title:
> [PATCH net-next] tcp: Make pingpong threshold tunable

There are many ways to make something tunable these days, including
BPF, setsockopt(), and sysctl. :-) This patch only uses sysctl. Please
consider a more clear/specific title, like:

   [PATCH net-next] tcp: set pingpong threshold via sysctl

> TCP pingpong threshold is 1 by default. But some applications, like SQL DB
> may prefer a higher pingpong threshold to activate delayed acks in quick
> ack mode for better performance.
>
> The pingpong threshold and related code were changed to 3 in the year
> 2019, and reverted to 1 in the year 2022.

Please include the specific commit, like:

The pingpong threshold and related code were changed to 3 in the year
 2019 in:
   commit 4a41f453bedf ("tcp: change pingpong threshold to 3")
and reverted to 1 in the year 2022 in:
  commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")

Then please make sure to use scripts/checkpatch.pl on your resulting
patch to check the formatting of the commit references, among other
things.

> There is no single value that
> fits all applications.
>
> Add net.core.tcp_pingpong_thresh sysctl tunable,

For consistency, TCP sysctls should be in net.ipv4 rather than
net.core. Yes, that is awkward, given IPv6 support. But consistency is
very important here. :-)

> so it can be tuned for
> optimal performance based on the application needs.
>
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  Documentation/admin-guide/sysctl/net.rst |  8 ++++++++
>  include/net/inet_connection_sock.h       | 14 +++++++++++---
>  net/core/sysctl_net_core.c               |  9 +++++++++
>  net/ipv4/tcp.c                           |  2 ++
>  net/ipv4/tcp_output.c                    | 17 +++++++++++++++--
>  5 files changed, 45 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> index 4877563241f3..16f54be9461f 100644
> --- a/Documentation/admin-guide/sysctl/net.rst
> +++ b/Documentation/admin-guide/sysctl/net.rst
> @@ -413,6 +413,14 @@ historical importance.
>
>  Default: 0
>
> +tcp_pingpong_thresh
> +-------------------
> +
> +TCP pingpong threshold is 1 by default, but some application may need a higher
> +threshold for optimal performance.
> +
> +Default: 1, min: 1, max: 3

If we want to make this tunable, it seems sad to make the max 3. I'd
suggest making the max 255, since we have 8 bits of space anyway in
the inet_csk(sk)->icsk_ack.pingpong field.

> +
>  2. /proc/sys/net/unix - Parameters for Unix domain sockets
>  ----------------------------------------------------------
>
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index c2b15f7e5516..e84e33ddae49 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -324,11 +324,11 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
>
>  struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
>
> -#define TCP_PINGPONG_THRESH    1
> +extern int tcp_pingpong_thresh;

To match most TCP sysctls, this should be per-namespace, rather than global.

Please follow a recent example by Eric, perhaps:
 65466904b015f6eeb9225b51aeb29b01a1d4b59c
  tcp: adjust TSO packet sizes based on min_rtt


>
>  static inline void inet_csk_enter_pingpong_mode(struct sock *sk)
>  {
> -       inet_csk(sk)->icsk_ack.pingpong = TCP_PINGPONG_THRESH;
> +       inet_csk(sk)->icsk_ack.pingpong = tcp_pingpong_thresh;
>  }

  inet_csk(sk)->icsk_ack.pingpong =  sock_net(sk)->sysctl_tcp_pingpong_thresh;

>  static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
> @@ -338,7 +338,15 @@ static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
>
>  static inline bool inet_csk_in_pingpong_mode(struct sock *sk)
>  {
> -       return inet_csk(sk)->icsk_ack.pingpong >= TCP_PINGPONG_THRESH;
> +       return inet_csk(sk)->icsk_ack.pingpong >= tcp_pingpong_thresh;
> +}

Again, sock_net(sk)->sysctl_tcp_pingpong_thresh rather than tcp_pingpong_thresh.

> +static inline void inet_csk_inc_pingpong_cnt(struct sock *sk)
> +{
> +       struct inet_connection_sock *icsk = inet_csk(sk);
> +
> +       if (icsk->icsk_ack.pingpong < U8_MAX)
> +               icsk->icsk_ack.pingpong++;
>  }
>
>  static inline bool inet_csk_has_ulp(struct sock *sk)
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 782273bb93c2..b5253567f2bd 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -653,6 +653,15 @@ static struct ctl_table net_core_table[] = {

Again, in net.ipv4, not net.core.

>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = SYSCTL_ZERO,
>         },
> +       {
> +               .procname       = "tcp_pingpong_thresh",
> +               .data           = &tcp_pingpong_thresh,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = SYSCTL_ONE,
> +               .extra2         = SYSCTL_THREE,

Please make the max U8_MAX to allow more flexibility (since we have 8
bits of space anyway in the inet_csk(sk)->icsk_ack.pingpong field).

> +       },
>         { }
>  };
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 53b7751b68e1..dcd143193d41 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -308,6 +308,8 @@ EXPORT_SYMBOL(tcp_have_smc);
>  struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp;
>  EXPORT_SYMBOL(tcp_sockets_allocated);
>
> +int tcp_pingpong_thresh __read_mostly = 1;
> +

Again, per-network-namespace. You will need to initialize the
per-netns value in tcp_sk_init(). Again, see Eric's
65466904b015f6eeb9225b51aeb29b01a1d4b59c commit for an example.

>   * TCP splice context
>   */
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index cfe128b81a01..576d21621778 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -167,12 +167,25 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
>         if (tcp_packets_in_flight(tp) == 0)
>                 tcp_ca_event(sk, CA_EVENT_TX_START);
>
> +       /* If tcp_pingpong_thresh > 1, and
> +        * this is the first data packet sent in response to the
> +        * previous received data,
> +        * and it is a reply for ato after last received packet,
> +        * increase pingpong count.
> +        */
> +       if (tcp_pingpong_thresh > 1 &&
> +           before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> +               inet_csk_inc_pingpong_cnt(sk);
> +

Introducing this new code re-introduces a bug fixed in 4d8f24eeedc5.
As that commit description noted:

    This to-be-reverted commit was meant to apply a stricter rule for the
    stack to enter pingpong mode. However, the condition used to check for
    interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
    jiffy based and might be too coarse, which delays the stack entering
    pingpong mode.
    We revert this patch so that we no longer use the above condition to
    determine interactive session,

>         tp->lsndtime = now;
>
> -       /* If it is a reply for ato after last received
> +       /* If tcp_pingpong_thresh == 1, and

Please remove the "If tcp_pingpong_thresh == 1, and" part, since this
is the correct code path no matter the value of the threshold.

> +        * it is a reply for ato after last received
>          * packet, enter pingpong mode.
>          */
> -       if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> +       if (tcp_pingpong_thresh == 1 &&

Please remove the "if (tcp_pingpong_thresh == 1 &&" part, since this
is the correct code path no matter the value of the threshold.

> +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
>                 inet_csk_enter_pingpong_mode(sk);

Please make this call inet_csk_inc_pingpong_cnt(), since this is the
correct code path no matter the value of the threshold.

thanks,
neal

^ permalink raw reply

* Re: [PATCH net-next 3/3] tools: ynl: Remove duplicated include in devlink-user.c
From: Jakub Kicinski @ 2023-06-09 18:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Yang Li, simon.horman, davem, edumazet, pabeni, netdev,
	linux-hyperv, linux-kernel, Abaci Robot
In-Reply-To: <ZIMPLYi/xRih+DlC@nanopsycho>

On Fri, 9 Jun 2023 13:38:21 +0200 Jiri Pirko wrote:
> You are patching generated file, as the path suggests.
> See what the file header says:
> /* Do not edit directly, auto-generated from: */
> /*      Documentation/netlink/specs/devlink.yaml */

And the full fix is already on the list :(
https://lore.kernel.org/all/20230608211200.1247213-2-kuba@kernel.org/
Reverted...

^ permalink raw reply

* [PATCH net-next] tcp: Make pingpong threshold tunable
From: Haiyang Zhang @ 2023-06-09 16:25 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, kys, olaf, vkuznets, davem, weiwan, tim.gardner, corbet,
	edumazet, kuba, pabeni, dsahern, atenart, bagasdotme, ykaliuta,
	kuniyu, stephen, simon.horman, maheshb, liushixin2, linux-doc,
	linux-kernel

TCP pingpong threshold is 1 by default. But some applications, like SQL DB
may prefer a higher pingpong threshold to activate delayed acks in quick
ack mode for better performance.

The pingpong threshold and related code were changed to 3 in the year
2019, and reverted to 1 in the year 2022. There is no single value that
fits all applications.

Add net.core.tcp_pingpong_thresh sysctl tunable, so it can be tuned for
optimal performance based on the application needs.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 Documentation/admin-guide/sysctl/net.rst |  8 ++++++++
 include/net/inet_connection_sock.h       | 14 +++++++++++---
 net/core/sysctl_net_core.c               |  9 +++++++++
 net/ipv4/tcp.c                           |  2 ++
 net/ipv4/tcp_output.c                    | 17 +++++++++++++++--
 5 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 4877563241f3..16f54be9461f 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -413,6 +413,14 @@ historical importance.
 
 Default: 0
 
+tcp_pingpong_thresh
+-------------------
+
+TCP pingpong threshold is 1 by default, but some application may need a higher
+threshold for optimal performance.
+
+Default: 1, min: 1, max: 3
+
 2. /proc/sys/net/unix - Parameters for Unix domain sockets
 ----------------------------------------------------------
 
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index c2b15f7e5516..e84e33ddae49 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -324,11 +324,11 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
 
 struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
 
-#define TCP_PINGPONG_THRESH	1
+extern int tcp_pingpong_thresh;
 
 static inline void inet_csk_enter_pingpong_mode(struct sock *sk)
 {
-	inet_csk(sk)->icsk_ack.pingpong = TCP_PINGPONG_THRESH;
+	inet_csk(sk)->icsk_ack.pingpong = tcp_pingpong_thresh;
 }
 
 static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
@@ -338,7 +338,15 @@ static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
 
 static inline bool inet_csk_in_pingpong_mode(struct sock *sk)
 {
-	return inet_csk(sk)->icsk_ack.pingpong >= TCP_PINGPONG_THRESH;
+	return inet_csk(sk)->icsk_ack.pingpong >= tcp_pingpong_thresh;
+}
+
+static inline void inet_csk_inc_pingpong_cnt(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if (icsk->icsk_ack.pingpong < U8_MAX)
+		icsk->icsk_ack.pingpong++;
 }
 
 static inline bool inet_csk_has_ulp(struct sock *sk)
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 782273bb93c2..b5253567f2bd 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -653,6 +653,15 @@ static struct ctl_table net_core_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "tcp_pingpong_thresh",
+		.data		= &tcp_pingpong_thresh,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+		.extra2		= SYSCTL_THREE,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53b7751b68e1..dcd143193d41 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -308,6 +308,8 @@ EXPORT_SYMBOL(tcp_have_smc);
 struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp;
 EXPORT_SYMBOL(tcp_sockets_allocated);
 
+int tcp_pingpong_thresh __read_mostly = 1;
+
 /*
  * TCP splice context
  */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cfe128b81a01..576d21621778 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -167,12 +167,25 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
 	if (tcp_packets_in_flight(tp) == 0)
 		tcp_ca_event(sk, CA_EVENT_TX_START);
 
+	/* If tcp_pingpong_thresh > 1, and
+	 * this is the first data packet sent in response to the
+	 * previous received data,
+	 * and it is a reply for ato after last received packet,
+	 * increase pingpong count.
+	 */
+	if (tcp_pingpong_thresh > 1 &&
+	    before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
+	    (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
+		inet_csk_inc_pingpong_cnt(sk);
+
 	tp->lsndtime = now;
 
-	/* If it is a reply for ato after last received
+	/* If tcp_pingpong_thresh == 1, and
+	 * it is a reply for ato after last received
 	 * packet, enter pingpong mode.
 	 */
-	if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
+	if (tcp_pingpong_thresh == 1 &&
+	    (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
 		inet_csk_enter_pingpong_mode(sk);
 }
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH 1/1] clocksource: hyper-v: Rework clocksource and sched clock setup
From: Michael Kelley @ 2023-06-09 15:47 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, daniel.lezcano, tglx, linux-kernel,
	linux-hyperv
  Cc: mikelley

Current code assigns either the Hyper-V TSC page or MSR-based ref counter
as the sched clock. This may be sub-optimal in two cases. First, if there
is hardware support to ensure consistent TSC frequency across live
migrations and Hyper-V is using that support, the raw TSC is a faster
source of time than the Hyper-V TSC page.  Second, the MSR-based ref
counter is relatively slow because reads require a trap to the hypervisor.
As such, it should never be used as the sched clock. The native sched
clock based on the raw TSC or jiffies is much better.

Rework the sched clock setup so it is set to the TSC page only if
Hyper-V indicates that the TSC may have inconsistent frequency across
live migrations. Also, remove the code that sets the sched clock to
the MSR-based ref counter. In the cases where it is not set, the sched
clock will then be the native sched clock.

As part of the rework, always enable both the TSC page clocksource and
the MSR-based ref counter clocksource. Set the ratings so the TSC page
clocksource is preferred. While the MSR-based ref counter clocksource
is unlikely to ever be the default, having it available for manual
selection is convenient for development purposes.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
---
 drivers/clocksource/hyperv_timer.c | 54 ++++++++++++++++----------------------
 1 file changed, 23 insertions(+), 31 deletions(-)

diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index d851970..e56307a 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -485,15 +485,9 @@ static u64 notrace read_hv_clock_msr_cs(struct clocksource *arg)
 	return read_hv_clock_msr();
 }
 
-static u64 noinstr read_hv_sched_clock_msr(void)
-{
-	return (read_hv_clock_msr() - hv_sched_clock_offset) *
-		(NSEC_PER_SEC / HV_CLOCK_HZ);
-}
-
 static struct clocksource hyperv_cs_msr = {
 	.name	= "hyperv_clocksource_msr",
-	.rating	= 500,
+	.rating	= 495,
 	.read	= read_hv_clock_msr_cs,
 	.mask	= CLOCKSOURCE_MASK(64),
 	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
@@ -523,7 +517,7 @@ static __always_inline void hv_setup_sched_clock(void *sched_clock)
 static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
 #endif /* CONFIG_GENERIC_SCHED_CLOCK */
 
-static bool __init hv_init_tsc_clocksource(void)
+static void __init hv_init_tsc_clocksource(void)
 {
 	union hv_reference_tsc_msr tsc_msr;
 
@@ -534,17 +528,14 @@ static bool __init hv_init_tsc_clocksource(void)
 	 * Hyper-V Reference TSC rating, causing the generic TSC to be used.
 	 * TSC_INVARIANT is not offered on ARM64, so the Hyper-V Reference
 	 * TSC will be preferred over the virtualized ARM64 arch counter.
-	 * While the Hyper-V MSR clocksource won't be used since the
-	 * Reference TSC clocksource is present, change its rating as
-	 * well for consistency.
 	 */
 	if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT) {
 		hyperv_cs_tsc.rating = 250;
-		hyperv_cs_msr.rating = 250;
+		hyperv_cs_msr.rating = 245;
 	}
 
 	if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
-		return false;
+		return;
 
 	hv_read_reference_counter = read_hv_clock_tsc;
 
@@ -575,33 +566,34 @@ static bool __init hv_init_tsc_clocksource(void)
 
 	clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
 
-	hv_sched_clock_offset = hv_read_reference_counter();
-	hv_setup_sched_clock(read_hv_sched_clock_tsc);
-
-	return true;
+	/*
+	 * If TSC is invariant, then let it stay as the sched clock since it
+	 * will be faster than reading the TSC page. But if not invariant, use
+	 * the TSC page so that live migrations across hosts with different
+	 * frequencies is handled correctly.
+	 */
+	if (!(ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)) {
+		hv_sched_clock_offset = hv_read_reference_counter();
+		hv_setup_sched_clock(read_hv_sched_clock_tsc);
+	}
 }
 
 void __init hv_init_clocksource(void)
 {
 	/*
-	 * Try to set up the TSC page clocksource. If it succeeds, we're
-	 * done. Otherwise, set up the MSR clocksource.  At least one of
-	 * these will always be available except on very old versions of
-	 * Hyper-V on x86.  In that case we won't have a Hyper-V
+	 * Try to set up the TSC page clocksource, then the MSR clocksource.
+	 * At least one of these will always be available except on very old
+	 * versions of Hyper-V on x86.  In that case we won't have a Hyper-V
 	 * clocksource, but Linux will still run with a clocksource based
 	 * on the emulated PIT or LAPIC timer.
+	 *
+	 * Never use the MSR clocksource as sched clock.  It's too slow.
+	 * Better to use the native sched clock as the fallback.
 	 */
-	if (hv_init_tsc_clocksource())
-		return;
-
-	if (!(ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE))
-		return;
-
-	hv_read_reference_counter = read_hv_clock_msr;
-	clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
+	hv_init_tsc_clocksource();
 
-	hv_sched_clock_offset = hv_read_reference_counter();
-	hv_setup_sched_clock(read_hv_sched_clock_msr);
+	if (ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE)
+		clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
 }
 
 void __init hv_remap_tsc_clocksource(void)
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next,V2] net: mana: Add support for vlan tagging
From: Haiyang Zhang @ 2023-06-09 12:47 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, decui, kys, paulros, olaf, vkuznets, davem, wei.liu,
	edumazet, kuba, pabeni, leon, longli, ssengar, linux-rdma, daniel,
	john.fastabend, bpf, ast, sharmaajay, hawk, tglx, shradhagupta,
	linux-kernel

To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX
skb. Then extract the vlan tag from the skb struct, and save it to
tx_oob for the NIC to transmit. For vlan tags on the payload, they
are accepted by the NIC too.

For RX, extract the vlan tag from CQE and put it into skb.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
V2:
Removed the code that extracts inband tag, because our NIC accepts
inband tags too.

---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d907727c7b7a..cd4d5ceb9f2d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -179,6 +179,14 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 		pkg.tx_oob.s_oob.short_vp_offset = txq->vp_offset;
 	}
 
+	if (skb_vlan_tag_present(skb)) {
+		pkt_fmt = MANA_LONG_PKT_FMT;
+		pkg.tx_oob.l_oob.inject_vlan_pri_tag = 1;
+		pkg.tx_oob.l_oob.pcp = skb_vlan_tag_get_prio(skb);
+		pkg.tx_oob.l_oob.dei = skb_vlan_tag_get_cfi(skb);
+		pkg.tx_oob.l_oob.vlan_id = skb_vlan_tag_get_id(skb);
+	}
+
 	pkg.tx_oob.s_oob.pkt_fmt = pkt_fmt;
 
 	if (pkt_fmt == MANA_SHORT_PKT_FMT) {
@@ -1457,6 +1465,12 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 			skb_set_hash(skb, hash_value, PKT_HASH_TYPE_L3);
 	}
 
+	if (cqe->rx_vlantag_present) {
+		u16 vlan_tci = cqe->rx_vlan_id;
+
+		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tci);
+	}
+
 	u64_stats_update_begin(&rx_stats->syncp);
 	rx_stats->packets++;
 	rx_stats->bytes += pkt_len;
@@ -2451,8 +2465,9 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	ndev->hw_features |= NETIF_F_RXCSUM;
 	ndev->hw_features |= NETIF_F_TSO | NETIF_F_TSO6;
 	ndev->hw_features |= NETIF_F_RXHASH;
-	ndev->features = ndev->hw_features;
-	ndev->vlan_features = 0;
+	ndev->features = ndev->hw_features | NETIF_F_HW_VLAN_CTAG_TX |
+			 NETIF_F_HW_VLAN_CTAG_RX;
+	ndev->vlan_features = ndev->features;
 	ndev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
 			     NETDEV_XDP_ACT_NDO_XMIT;
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH net-next 3/3] tools: ynl: Remove duplicated include in devlink-user.c
From: Jiri Pirko @ 2023-06-09 11:38 UTC (permalink / raw)
  To: Yang Li
  Cc: simon.horman, davem, edumazet, kuba, pabeni, netdev, linux-hyperv,
	linux-kernel, Abaci Robot
In-Reply-To: <20230609085249.131071-1-yang.lee@linux.alibaba.com>

Fri, Jun 09, 2023 at 10:52:47AM CEST, yang.lee@linux.alibaba.com wrote:
>./tools/net/ynl/generated/devlink-user.c: stdlib.h is included more than once.
>
>Reported-by: Abaci Robot <abaci@linux.alibaba.com>
>Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5464
>Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
>---
> tools/net/ynl/generated/devlink-user.c | 1 -
> 1 file changed, 1 deletion(-)
>
>diff --git a/tools/net/ynl/generated/devlink-user.c b/tools/net/ynl/generated/devlink-user.c
>index c3204e20b971..18157afd7c73 100644
>--- a/tools/net/ynl/generated/devlink-user.c
>+++ b/tools/net/ynl/generated/devlink-user.c

You are patching generated file, as the path suggests.
See what the file header says:
/* Do not edit directly, auto-generated from: */
/*      Documentation/netlink/specs/devlink.yaml */


>@@ -8,7 +8,6 @@
> #include "ynl.h"
> #include <linux/devlink.h>
> 
>-#include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <libmnl/libmnl.h>
>-- 
>2.20.1.7.g153144c
>
>

^ permalink raw reply

* Re: [PATCH net-next 3/3] tools: ynl: Remove duplicated include in devlink-user.c
From: patchwork-bot+netdevbpf @ 2023-06-09 10:40 UTC (permalink / raw)
  To: Yang Li
  Cc: simon.horman, davem, edumazet, kuba, pabeni, netdev, linux-hyperv,
	linux-kernel, abaci
In-Reply-To: <20230609085249.131071-1-yang.lee@linux.alibaba.com>

Hello:

This series was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:

On Fri,  9 Jun 2023 16:52:47 +0800 you wrote:
> ./tools/net/ynl/generated/devlink-user.c: stdlib.h is included more than once.
> 
> Reported-by: Abaci Robot <abaci@linux.alibaba.com>
> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5464
> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
> ---
>  tools/net/ynl/generated/devlink-user.c | 1 -
>  1 file changed, 1 deletion(-)

Here is the summary with links:
  - [net-next,3/3] tools: ynl: Remove duplicated include in devlink-user.c
    (no matching commit)
  - [net-next,2/3] tools: ynl: Remove duplicated include in handshake-user.c
    https://git.kernel.org/netdev/net-next/c/e7c5433c5aaa
  - [net-next,1/3] net: hv_netvsc: Remove duplicated include in rndis_filter.c
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH 7/9] x86/hyperv: Initialize cpu and memory for SEV-SNP enlightened guest
From: Jeremi Piotrowski @ 2023-06-09  9:56 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: Tianyu Lan, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	daniel.lezcano@linaro.org, arnd@arndb.de, Tianyu Lan,
	linux-arch@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, vkuznets@redhat.com
In-Reply-To: <BYAPR21MB1688B4F74FF7B670267D32FAD750A@BYAPR21MB1688.namprd21.prod.outlook.com>

On Thu, Jun 08, 2023 at 01:51:35PM +0000, Michael Kelley (LINUX) wrote:
> From: Tianyu Lan <ltykernel@gmail.com> Sent: Thursday, June 1, 2023 8:16 AM
> > 
> > Hyper-V enlightened guest doesn't have boot loader support.
> > Boot Linux kernel directly from hypervisor with data(kernel
> 
> Add a space between "data" and "(kernel"
> 
> > image, initrd and parameter page) and memory for boot up that
> > is initialized via AMD SEV PSP proctol LAUNCH_UPDATE_DATA
> 
> s/proctol/protocol/
> 
> > (Please refernce https://www.amd.com/system/files/TechDocs/55766_SEV-KM_API_Specification.pdf 1.3.1 Launch).
> 
> s/refernce/reference/
> 
> And the link above didn't work for me -- the "55766_SEV-KM_API_Specification.pdf"
> part was separated from the rest of the URL, though it's possible the mangling
> was done by Microsoft's email system.  Please double check that the URL is
> correctly formatted with no spurious spaces.
> 
> Even better, maybe write this as:
> 
> Please reference Section 1.3.1 "Launch" of [1].
> 
> Then put the full link as [1] at the bottom of the commit message.
> 

Tianyu: that document is SEV specific, and does not have the parts that SEV-SNP
uses. For SNP this is the firmware ABI:

https://www.amd.com/system/files/TechDocs/56860.pdf

and the API I think you mean is SNP_LAUNCH_UPDATE.

It would also help to mention that the data at EN_SEV_SNP_PROCESSOR_INFO_ADDR
is loaded as PAGE_TYPE_UNMEASURED.

Thanks,
Jeremi

^ permalink raw reply

* [PATCH net-next 1/3] net: hv_netvsc: Remove duplicated include in rndis_filter.c
From: Yang Li @ 2023-06-09  8:52 UTC (permalink / raw)
  To: simon.horman
  Cc: davem, edumazet, kuba, pabeni, netdev, linux-hyperv, linux-kernel,
	Yang Li, Abaci Robot
In-Reply-To: <20230609085249.131071-1-yang.lee@linux.alibaba.com>

./drivers/net/hyperv/rndis_filter.c: linux/slab.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5462
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
---
 drivers/net/hyperv/rndis_filter.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index af95947a87c5..ecc2128ca9b7 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -21,7 +21,6 @@
 #include <linux/rtnetlink.h>
 #include <linux/ucs2_string.h>
 #include <linux/string.h>
-#include <linux/slab.h>
 
 #include "hyperv_net.h"
 #include "netvsc_trace.h"
-- 
2.20.1.7.g153144c


^ permalink raw reply related

* [PATCH net-next 2/3] tools: ynl: Remove duplicated include in handshake-user.c
From: Yang Li @ 2023-06-09  8:52 UTC (permalink / raw)
  To: simon.horman
  Cc: davem, edumazet, kuba, pabeni, netdev, linux-hyperv, linux-kernel,
	Yang Li, Abaci Robot
In-Reply-To: <20230609085249.131071-1-yang.lee@linux.alibaba.com>

./tools/net/ynl/generated/handshake-user.c: stdlib.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5464
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
---
 tools/net/ynl/generated/handshake-user.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/net/ynl/generated/handshake-user.c b/tools/net/ynl/generated/handshake-user.c
index fe99c4ef7373..7a1f0364b88f 100644
--- a/tools/net/ynl/generated/handshake-user.c
+++ b/tools/net/ynl/generated/handshake-user.c
@@ -8,7 +8,6 @@
 #include "ynl.h"
 #include <linux/handshake.h>
 
-#include <stdlib.h>
 #include <stdio.h>
 #include <string.h>
 #include <libmnl/libmnl.h>
-- 
2.20.1.7.g153144c


^ permalink raw reply related

* [PATCH net-next 3/3] tools: ynl: Remove duplicated include in devlink-user.c
From: Yang Li @ 2023-06-09  8:52 UTC (permalink / raw)
  To: simon.horman
  Cc: davem, edumazet, kuba, pabeni, netdev, linux-hyperv, linux-kernel,
	Yang Li, Abaci Robot

./tools/net/ynl/generated/devlink-user.c: stdlib.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5464
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
---
 tools/net/ynl/generated/devlink-user.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/net/ynl/generated/devlink-user.c b/tools/net/ynl/generated/devlink-user.c
index c3204e20b971..18157afd7c73 100644
--- a/tools/net/ynl/generated/devlink-user.c
+++ b/tools/net/ynl/generated/devlink-user.c
@@ -8,7 +8,6 @@
 #include "ynl.h"
 #include <linux/devlink.h>
 
-#include <stdlib.h>
 #include <stdio.h>
 #include <string.h>
 #include <libmnl/libmnl.h>
-- 
2.20.1.7.g153144c


^ permalink raw reply related

* RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Wei Hu @ 2023-06-09  1:15 UTC (permalink / raw)
  To: Long Li, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, Ajay Sharma, jgg@ziepe.ca,
	leon@kernel.org, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <PH7PR21MB3263782C842638253C1FDB0CCE50A@PH7PR21MB3263.namprd21.prod.outlook.com>



> -----Original Message-----
> From: Long Li <longli@microsoft.com>
> Sent: Friday, June 9, 2023 1:48 AM
> To: Wei Hu <weh@microsoft.com>; netdev@vger.kernel.org; linux-
> hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; Ajay Sharma
> <sharmaajay@microsoft.com>; jgg@ziepe.ca; leon@kernel.org; KY
> Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>;
> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; vkuznets@redhat.com; ssengar@linux.microsoft.com;
> shradhagupta@linux.microsoft.com
> Subject: RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to
> mana ib driver.
> 
> > Subject: RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to
> > mana ib driver.
> >
> >
> >
> > > -----Original Message-----
> > > Subject: RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support
> > > to mana ib driver.
> > >
> > > > Subject: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to
> > > > mana ib driver.
> > > >
> > > > Add EQ interrupt support for mana ib driver. Allocate EQs per
> > > > ucontext to receive interrupt. Attach EQ when CQ is created. Call
> > > > CQ interrupt handler when completion interrupt happens. EQs are
> > > > destroyed when
> > > ucontext is deallocated.
> > > >
> > > > The change calls some public APIs in mana ethernet driver to
> > > > allocate EQs and other resources. Ehe EQ process routine is also
> > > > shared by mana ethernet and mana ib drivers.
> > > >
> > > > Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
> > > > Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
> > > > Signed-off-by: Wei Hu <weh@microsoft.com>
> > > > ---
> > > >
> > > > v2: Use ibdev_dbg to print error messages and return -ENOMEN
> > > >     when kzalloc fails.
> > > >
> > > >  drivers/infiniband/hw/mana/cq.c               |  32 ++++-
> > > >  drivers/infiniband/hw/mana/main.c             |  87 ++++++++++++
> > > >  drivers/infiniband/hw/mana/mana_ib.h          |   4 +
> > > >  drivers/infiniband/hw/mana/qp.c               |  90 +++++++++++-
> > > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++---
> ----
> > -
> > > >  drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
> > > >  include/net/mana/gdma.h                       |   9 +-
> > > >  7 files changed, 290 insertions(+), 64 deletions(-)
> > > >
> > > > diff --git a/drivers/infiniband/hw/mana/cq.c
> > > > b/drivers/infiniband/hw/mana/cq.c index
> d141cab8a1e6..3cd680e0e753
> > > > 100644
> > > > --- a/drivers/infiniband/hw/mana/cq.c
> > > > +++ b/drivers/infiniband/hw/mana/cq.c
> > > > @@ -12,13 +12,20 @@ int mana_ib_create_cq(struct ib_cq *ibcq,
> > > > const struct ib_cq_init_attr *attr,
> > > >  	struct ib_device *ibdev = ibcq->device;
> > > >  	struct mana_ib_create_cq ucmd = {};
> > > >  	struct mana_ib_dev *mdev;
> > > > +	struct gdma_context *gc;
> > > > +	struct gdma_dev *gd;
> > > >  	int err;
> > > >
> > > >  	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> > > > +	gd = mdev->gdma_dev;
> > > > +	gc = gd->gdma_context;
> > > >
> > > >  	if (udata->inlen < sizeof(ucmd))
> > > >  		return -EINVAL;
> > > >
> > > > +	cq->comp_vector = attr->comp_vector > gc->max_num_queues ?
> > > > +				0 : attr->comp_vector;
> > > > +
> > > >  	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata-
> > > > >inlen));
> > > >  	if (err) {
> > > >  		ibdev_dbg(ibdev,
> > > > @@ -69,11 +76,32 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq,
> > > > struct ib_udata *udata)
> > > >  	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
> > > >  	struct ib_device *ibdev = ibcq->device;
> > > >  	struct mana_ib_dev *mdev;
> > > > +	struct gdma_context *gc;
> > > > +	struct gdma_dev *gd;
> > > > +
> > > >
> > > >  	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> > > > +	gd = mdev->gdma_dev;
> > > > +	gc = gd->gdma_context;
> > > >
> > > > -	mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> > > > -	ib_umem_release(cq->umem);
> > > > +
> > > > +
> > > > +	if (atomic_read(&ibcq->usecnt) == 0) {
> > > > +		mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> > >
> > > Need to check if this function fails. The following code will call
> > > kfree(gc-
> > > >cq_table[cq->id]), it's possible that IRQ is happening at the same
> > > >time if CQ
> > > is not destroyed.
> > >
> >
> > Sure. Will update.
> >
> > > > +		ibdev_dbg(ibdev, "freeing gdma cq %p\n", gc->cq_table[cq-
> > > >id]);
> > > > +		kfree(gc->cq_table[cq->id]);
> > > > +		gc->cq_table[cq->id] = NULL;
> > > > +		ib_umem_release(cq->umem);
> > > > +	}
> > > >
> > > >  	return 0;
> > > >  }
> > > > +
> > > > +void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq) {
> > > > +	struct mana_ib_cq *cq = ctx;
> > > > +	struct ib_device *ibdev = cq->ibcq.device;
> > > > +
> > > > +	ibdev_dbg(ibdev, "Enter %s %d\n", __func__, __LINE__);
> > >
> > > This debug message seems overkill?
> > >
> > > > +	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); }
> > > > diff --git a/drivers/infiniband/hw/mana/main.c
> > > > b/drivers/infiniband/hw/mana/main.c
> > > > index 7be4c3adb4e2..e4efbcaed10e 100644
> > > > --- a/drivers/infiniband/hw/mana/main.c
> > > > +++ b/drivers/infiniband/hw/mana/main.c
> > > > @@ -143,6 +143,81 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd,
> > > > struct ib_udata *udata)
> > > >  	return err;
> > > >  }
> > > >
> > > > +static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
> > > > +			       struct mana_ib_dev *mdev) {
> > > > +	struct gdma_context *gc = mdev->gdma_dev->gdma_context;
> > > > +	struct ib_device *ibdev = ucontext->ibucontext.device;
> > > > +	struct gdma_queue *eq;
> > > > +	int i;
> > > > +
> > > > +	if (!ucontext->eqs)
> > > > +		return;
> > > > +
> > > > +	for (i = 0; i < gc->max_num_queues; i++) {
> > > > +		eq = ucontext->eqs[i].eq;
> > > > +		if (!eq)
> > > > +			continue;
> > > > +
> > > > +		mana_gd_destroy_queue(gc, eq);
> > > > +	}
> > > > +
> > > > +	kfree(ucontext->eqs);
> > > > +	ucontext->eqs = NULL;
> > > > +
> > > > +	ibdev_dbg(ibdev, "destroyed eq's count %d\n", gc-
> > > >max_num_queues); }
> > >
> > > Will gc->max_num_queues change after destroying a EQ?
> > >
> >
> > I think it will not change. Also the compiler might optimize the code
> > to just read the value once and store it in a register.
> >
> > Thanks,
> > Wei
> 
> This message is confusing. How about changing it to " destroyed eq.
> Maximum count %d", or just remove the count as it's not informational.
> 
> Long

Sure. I will remove the count from the message.

Thanks,
Wei

^ permalink raw reply

* RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Long Li @ 2023-06-08 17:47 UTC (permalink / raw)
  To: Wei Hu, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, Ajay Sharma, jgg@ziepe.ca,
	leon@kernel.org, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <SI2P153MB0441EC655394CEA3E8E727E7BB50A@SI2P153MB0441.APCP153.PROD.OUTLOOK.COM>

> Subject: RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to
> mana ib driver.
> 
> 
> 
> > -----Original Message-----
> > Subject: RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to
> > mana ib driver.
> >
> > > Subject: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to
> > > mana ib driver.
> > >
> > > Add EQ interrupt support for mana ib driver. Allocate EQs per
> > > ucontext to receive interrupt. Attach EQ when CQ is created. Call CQ
> > > interrupt handler when completion interrupt happens. EQs are
> > > destroyed when
> > ucontext is deallocated.
> > >
> > > The change calls some public APIs in mana ethernet driver to
> > > allocate EQs and other resources. Ehe EQ process routine is also
> > > shared by mana ethernet and mana ib drivers.
> > >
> > > Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
> > > Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
> > > Signed-off-by: Wei Hu <weh@microsoft.com>
> > > ---
> > >
> > > v2: Use ibdev_dbg to print error messages and return -ENOMEN
> > >     when kzalloc fails.
> > >
> > >  drivers/infiniband/hw/mana/cq.c               |  32 ++++-
> > >  drivers/infiniband/hw/mana/main.c             |  87 ++++++++++++
> > >  drivers/infiniband/hw/mana/mana_ib.h          |   4 +
> > >  drivers/infiniband/hw/mana/qp.c               |  90 +++++++++++-
> > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++-------
> -
> > >  drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
> > >  include/net/mana/gdma.h                       |   9 +-
> > >  7 files changed, 290 insertions(+), 64 deletions(-)
> > >
> > > diff --git a/drivers/infiniband/hw/mana/cq.c
> > > b/drivers/infiniband/hw/mana/cq.c index d141cab8a1e6..3cd680e0e753
> > > 100644
> > > --- a/drivers/infiniband/hw/mana/cq.c
> > > +++ b/drivers/infiniband/hw/mana/cq.c
> > > @@ -12,13 +12,20 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const
> > > struct ib_cq_init_attr *attr,
> > >  	struct ib_device *ibdev = ibcq->device;
> > >  	struct mana_ib_create_cq ucmd = {};
> > >  	struct mana_ib_dev *mdev;
> > > +	struct gdma_context *gc;
> > > +	struct gdma_dev *gd;
> > >  	int err;
> > >
> > >  	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> > > +	gd = mdev->gdma_dev;
> > > +	gc = gd->gdma_context;
> > >
> > >  	if (udata->inlen < sizeof(ucmd))
> > >  		return -EINVAL;
> > >
> > > +	cq->comp_vector = attr->comp_vector > gc->max_num_queues ?
> > > +				0 : attr->comp_vector;
> > > +
> > >  	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata-
> > > >inlen));
> > >  	if (err) {
> > >  		ibdev_dbg(ibdev,
> > > @@ -69,11 +76,32 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq,
> > > struct ib_udata *udata)
> > >  	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
> > >  	struct ib_device *ibdev = ibcq->device;
> > >  	struct mana_ib_dev *mdev;
> > > +	struct gdma_context *gc;
> > > +	struct gdma_dev *gd;
> > > +
> > >
> > >  	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> > > +	gd = mdev->gdma_dev;
> > > +	gc = gd->gdma_context;
> > >
> > > -	mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> > > -	ib_umem_release(cq->umem);
> > > +
> > > +
> > > +	if (atomic_read(&ibcq->usecnt) == 0) {
> > > +		mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> >
> > Need to check if this function fails. The following code will call
> > kfree(gc-
> > >cq_table[cq->id]), it's possible that IRQ is happening at the same
> > >time if CQ
> > is not destroyed.
> >
> 
> Sure. Will update.
> 
> > > +		ibdev_dbg(ibdev, "freeing gdma cq %p\n", gc->cq_table[cq-
> > >id]);
> > > +		kfree(gc->cq_table[cq->id]);
> > > +		gc->cq_table[cq->id] = NULL;
> > > +		ib_umem_release(cq->umem);
> > > +	}
> > >
> > >  	return 0;
> > >  }
> > > +
> > > +void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq) {
> > > +	struct mana_ib_cq *cq = ctx;
> > > +	struct ib_device *ibdev = cq->ibcq.device;
> > > +
> > > +	ibdev_dbg(ibdev, "Enter %s %d\n", __func__, __LINE__);
> >
> > This debug message seems overkill?
> >
> > > +	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); }
> > > diff --git a/drivers/infiniband/hw/mana/main.c
> > > b/drivers/infiniband/hw/mana/main.c
> > > index 7be4c3adb4e2..e4efbcaed10e 100644
> > > --- a/drivers/infiniband/hw/mana/main.c
> > > +++ b/drivers/infiniband/hw/mana/main.c
> > > @@ -143,6 +143,81 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd,
> > > struct ib_udata *udata)
> > >  	return err;
> > >  }
> > >
> > > +static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
> > > +			       struct mana_ib_dev *mdev) {
> > > +	struct gdma_context *gc = mdev->gdma_dev->gdma_context;
> > > +	struct ib_device *ibdev = ucontext->ibucontext.device;
> > > +	struct gdma_queue *eq;
> > > +	int i;
> > > +
> > > +	if (!ucontext->eqs)
> > > +		return;
> > > +
> > > +	for (i = 0; i < gc->max_num_queues; i++) {
> > > +		eq = ucontext->eqs[i].eq;
> > > +		if (!eq)
> > > +			continue;
> > > +
> > > +		mana_gd_destroy_queue(gc, eq);
> > > +	}
> > > +
> > > +	kfree(ucontext->eqs);
> > > +	ucontext->eqs = NULL;
> > > +
> > > +	ibdev_dbg(ibdev, "destroyed eq's count %d\n", gc-
> > >max_num_queues); }
> >
> > Will gc->max_num_queues change after destroying a EQ?
> >
> 
> I think it will not change. Also the compiler might optimize the code to just
> read the value once and store it in a register.
> 
> Thanks,
> Wei

This message is confusing. How about changing it to " destroyed eq. Maximum count %d", or just remove the count as it's not informational.

Long

^ permalink raw reply

* RE: [PATCH 1/1] x86/irq: Add hardcoded hypervisor interrupts to /proc/stat
From: Michael Kelley (LINUX) @ 2023-06-08 15:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, x86@kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <dce372bd-e63c-f24c-5b79-1ef65fd1e59a@intel.com>

From: Dave Hansen <dave.hansen@intel.com> Sent: Wednesday, March 22, 2023 11:07 AM
> 
> On 2/27/23 10:46, Michael Kelley wrote:
> > diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> > index 766ffe3..9f668d2 100644
> > --- a/arch/x86/kernel/irq.c
> > +++ b/arch/x86/kernel/irq.c
> > @@ -211,6 +211,13 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
> >  #ifdef CONFIG_X86_MCE_THRESHOLD
> >  	sum += irq_stats(cpu)->irq_threshold_count;
> >  #endif
> > +#ifdef CONFIG_X86_HV_CALLBACK_VECTOR
> > +	sum += irq_stats(cpu)->irq_hv_callback_count;
> > +#endif
> > +#if IS_ENABLED(CONFIG_HYPERV)
> > +	sum += irq_stats(cpu)->irq_hv_reenlightenment_count;
> > +	sum += irq_stats(cpu)->hyperv_stimer0_count;
> > +#endif
> >  #ifdef CONFIG_X86_MCE
> >  	sum += per_cpu(mce_exception_count, cpu);
> >  	sum += per_cpu(mce_poll_count, cpu);
> 
> This seems fine, especially since arch_show_interrupts() has them.  But,
> what's with the "#if IS_ENABLED" versus the plain #ifdef?  Is there some
> difference I'm missing?  Why not just be consistent with the other code
> and use a plain #ifdef for both?

Dave --

With Sean's explanation for #if IS_ENABLED, are you OK with giving this
an ACK as an x86 maintainer?   This patch has been hanging around for a
while now ...

Michael

^ permalink raw reply

* Re: [PATCH 7/9] x86/hyperv: Initialize cpu and memory for SEV-SNP enlightened guest
From: Tianyu Lan @ 2023-06-08 15:18 UTC (permalink / raw)
  To: Michael Kelley (LINUX), KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, daniel.lezcano@linaro.org,
	arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	vkuznets@redhat.com
In-Reply-To: <BYAPR21MB16882DC67EAF87B518D5D72ED750A@BYAPR21MB1688.namprd21.prod.outlook.com>



On 6/8/2023 10:09 PM, Michael Kelley (LINUX) wrote:
>> +static __init void hv_snp_get_smp_config(unsigned int early)
>> +{
>> +	/*
>> +	 * The "early" is only to be true when there is AMD
>> +	 * numa support. Hyper-V AMD SEV-SNP guest may not
>> +	 * have numa support. To make sure smp config is
>> +	 * always initialized, do that when early is false.
>> +	 */
> I didn't really understand this comment.  After doing a little research, let
> me suggest this wording:
> 
> 	/*
> 	 * The "early" parameter can be true only if old-style AMD
> 	 * Opteron NUMA detection is enabled, which should never be
> 	 * the case for an SEV-SNP guest.  See CONFIG_AMD_NUMA.
> 	 * For safety, just do nothing if "early" is true.
> 	 */
> 
> Let me know if this new wording makes sense based on your understanding.
> 

Yes, it makes sense. Will update. Thanks.

^ permalink raw reply

* RE: [PATCH] Drivers: hv: vmbus: Add missing check for dma_set_mask
From: Michael Kelley (LINUX) @ 2023-06-08 15:15 UTC (permalink / raw)
  To: Jiasheng Jiang, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, nathan@kernel.org
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20230607014310.19850-1-jiasheng@iscas.ac.cn>

From: Jiasheng Jiang <jiasheng@iscas.ac.cn> Sent: Tuesday, June 6, 2023 6:43 PM
> 
> Add check for dma_set_mask() and return the error if it fails.
> 
> Fixes: 6bf625a4140f ("Drivers: hv: vmbus: Rework use of DMA_BIT_MASK(64)")
> Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
> ---
>  drivers/hv/vmbus_drv.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 1c65a6dfb9fa..68b7be2762ea 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1908,7 +1908,11 @@ int vmbus_device_register(struct hv_device
> *child_device_obj)
> 
>  	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
>  	child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
> -	dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
> +	ret = dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
> +	if (ret) {
> +		put_device(&child_device_obj->device);

I don't think the put_device() call is correct here.  The underlying kobj in
child_device_obj->device is not initialized until device_register() calls
device_initialize().  It's after device_initialize() is called that put_device()
must be used.

That said, I don't see that the memory for the child_device_obj gets
freed if we just do "return ret", though maybe I'm missing it.  And there's
the matter of the memory allocated by dev_set_name().   Getting this
error path fully correct may be rather awkward. :-(

Michael

> +		return ret;
> +	}
> 
>  	/*
>  	 * Register with the LDM. This will kick off the driver/device
> --
> 2.25.1


^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH 5/9] x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
From: Tianyu Lan @ 2023-06-08 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, daniel.lezcano, arnd, michael.h.kelley, Tianyu Lan,
	linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230608132127.GK998233@hirez.programming.kicks-ass.net>

On 6/8/2023 9:21 PM, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 11:16:18AM -0400, Tianyu Lan wrote:
>> From: Tianyu Lan <tiala@microsoft.com>
>>
>> In sev-snp enlightened guest, Hyper-V hypercall needs
>> to use vmmcall to trigger vmexit and notify hypervisor
>> to handle hypercall request.
>>
>> There is no x86 SEV SNP feature flag support so far and
>> hardware provides MSR_AMD64_SEV register to check SEV-SNP
>> capability with MSR_AMD64_SEV_ENABLED bit. ALTERNATIVE can't
>> work without SEV-SNP x86 feature flag. May add later when
>> the associated flag is introduced.
>>
>> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
>> ---
>>   arch/x86/include/asm/mshyperv.h | 44 ++++++++++++++++++++++++---------
>>   1 file changed, 33 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index 31c476f4e656..d859d7c5f5e8 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -61,16 +61,25 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
>>   	u64 hv_status;
>>   
>>   #ifdef CONFIG_X86_64
>> -	if (!hv_hypercall_pg)
>> -		return U64_MAX;
>> +	if (hv_isolation_type_en_snp()) {
>> +		__asm__ __volatile__("mov %4, %%r8\n"
>> +				     "vmmcall"
>> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
>> +				       "+c" (control), "+d" (input_address)
>> +				     :  "r" (output_address)
>> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
>> +	} else {
>> +		if (!hv_hypercall_pg)
>> +			return U64_MAX;
>>   
>> -	__asm__ __volatile__("mov %4, %%r8\n"
>> -			     CALL_NOSPEC
>> -			     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
>> -			       "+c" (control), "+d" (input_address)
>> -			     :  "r" (output_address),
>> -				THUNK_TARGET(hv_hypercall_pg)
>> -			     : "cc", "memory", "r8", "r9", "r10", "r11");
>> +		__asm__ __volatile__("mov %4, %%r8\n"
>> +				     CALL_NOSPEC
>> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
>> +				       "+c" (control), "+d" (input_address)
>> +				     :  "r" (output_address),
>> +					THUNK_TARGET(hv_hypercall_pg)
>> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
>> +	}
>>   #else
> 
> Remains unanswered:
> 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20230516102912.GG2587705%2540hirez.programming.kicks-ass.net&data=05%7C01%7CTianyu.Lan%40microsoft.com%7C60a576eb67634ffa27b108db68234d5a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638218273105649705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=MFj67DON0K%2BUoUJbeaIA5oVTxyrzO3fb5DbxYgDWwX0%3D&reserved=0
> 
> Would this not generate better code with an alternative?


Hi Peter:
	Thanks to review. I put the explaination in the change log.

"There is no x86 SEV SNP feature(X86_FEATURE_SEV_SNP) flag
support so far and hardware provides MSR_AMD64_SEV register
to check SEV-SNP capability with MSR_AMD64_SEV_ENABLED bit
ALTERNATIVE can't work without SEV-SNP x86 feature flag."
There is no cpuid leaf bit to check AMD SEV-SNP feature.

After some Hyper-V doesn't provides SEV and SEV-ES guest before and so
may reuse X86_FEATURE_SEV and X86_FEATURE_SEV_ES flag as alternative
feature check for Hyper-V SEV-SNP guest. Will refresh patch.




^ permalink raw reply

* RE: [PATCH v2 1/2] x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline
From: Michael Kelley (LINUX) @ 2023-06-08 14:39 UTC (permalink / raw)
  To: bp@alien8.de
  Cc: KY Srinivasan, Haiyang Zhang, Dexuan Cui, catalin.marinas@arm.com,
	will@kernel.org, tglx@linutronix.de, mingo@redhat.com,
	dave.hansen@linux.intel.com, hpa@zytor.com,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	Wei Liu, linux-arm-kernel@lists.infradead.org, x86@kernel.org
In-Reply-To: <ZG0LTAeV+KMAGXIq@liuwe-devbox-debian-v2>

From: Wei Liu <wei.liu@kernel.org>
> 
> On Tue, May 23, 2023 at 10:14:21AM -0700, Michael Kelley wrote:
> [...]
> > diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> > index 0f1001d..3ceb9df 100644
> > --- a/include/linux/cpuhotplug.h
> > +++ b/include/linux/cpuhotplug.h
> > @@ -200,6 +200,7 @@ enum cpuhp_state {
> >
> >  	/* Online section invoked on the hotplugged CPU from the hotplug thread */
> >  	CPUHP_AP_ONLINE_IDLE,
> > +	CPUHP_AP_HYPERV_ONLINE,
> 
> x86 maintainers, are you okay with this?
> 

Boris -- Are you OK with this, and could give an ACK?  This small patch
set fixes a problem introduced into 6.4-rc1 by other Confidential VM
changes, so this fix needs to be incorporated before 6.4 is released.

Michael

> >  	CPUHP_AP_KVM_ONLINE,
> >  	CPUHP_AP_SCHED_WAIT_EMPTY,
> >  	CPUHP_AP_SMPBOOT_THREADS,
> > --
> > 1.8.3.1
> >

^ permalink raw reply

* RE: [PATCH 4/9] drivers: hv: Mark shared pages unencrypted in SEV-SNP enlightened guest
From: Michael Kelley (LINUX) @ 2023-06-08 14:21 UTC (permalink / raw)
  To: Tianyu Lan, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	daniel.lezcano@linaro.org, arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	vkuznets@redhat.com
In-Reply-To: <20230601151624.1757616-5-ltykernel@gmail.com>

From: Tianyu Lan <ltykernel@gmail.com> Sent: Thursday, June 1, 2023 8:16 AM
> 
> Hypervisor needs to access iput arg, VMBus synic event and

s/iput/input/

> message pages. Mask these pages unencrypted in the sev-snp

s/Mask/Mark/

> guest and free them only if they have been marked encrypted
> successfully.
> 
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  drivers/hv/hv.c        | 57 +++++++++++++++++++++++++++++++++++++++---
>  drivers/hv/hv_common.c | 24 +++++++++++++++++-
>  2 files changed, 77 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index de6708dbe0df..94406dbe0df0 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -20,6 +20,7 @@
>  #include <linux/interrupt.h>
>  #include <clocksource/hyperv_timer.h>
>  #include <asm/mshyperv.h>
> +#include <linux/set_memory.h>
>  #include "hyperv_vmbus.h"
> 
>  /* The one and only */
> @@ -78,7 +79,7 @@ int hv_post_message(union hv_connection_id connection_id,
> 
>  int hv_synic_alloc(void)
>  {
> -	int cpu;
> +	int cpu, ret = -ENOMEM;
>  	struct hv_per_cpu_context *hv_cpu;
> 
>  	/*
> @@ -123,26 +124,76 @@ int hv_synic_alloc(void)
>  				goto err;
>  			}
>  		}
> +
> +		if (hv_isolation_type_en_snp()) {
> +			ret = set_memory_decrypted((unsigned long)
> +				hv_cpu->synic_message_page, 1);
> +			if (ret) {
> +				pr_err("Failed to decrypt SYNIC msg page: %d\n", ret);
> +				hv_cpu->synic_message_page = NULL;
> +
> +				/*
> +				 * Free the event page here and not encrypt
> +				 * the page in hv_synic_free().
> +				 */

Let's tweak the wording of the comment:

				/*
				 * Free the event page here so that hv_synic_free()
				 * won't later try to re-encrypt it.
				 */

> +				free_page((unsigned long)hv_cpu->synic_event_page);
> +				hv_cpu->synic_event_page = NULL;
> +				goto err;
> +			}
> +
> +			ret = set_memory_decrypted((unsigned long)
> +				hv_cpu->synic_event_page, 1);
> +			if (ret) {
> +				pr_err("Failed to decrypt SYNIC event page: %d\n", ret);
> +				hv_cpu->synic_event_page = NULL;
> +				goto err;
> +			}
> +
> +			memset(hv_cpu->synic_message_page, 0, PAGE_SIZE);
> +			memset(hv_cpu->synic_event_page, 0, PAGE_SIZE);
> +		}
>  	}
> 
>  	return 0;
> +
>  err:
>  	/*
>  	 * Any memory allocations that succeeded will be freed when
>  	 * the caller cleans up by calling hv_synic_free()
>  	 */
> -	return -ENOMEM;
> +	return ret;
>  }
> 
> 
>  void hv_synic_free(void)
>  {
> -	int cpu;
> +	int cpu, ret;
> 
>  	for_each_present_cpu(cpu) {
>  		struct hv_per_cpu_context *hv_cpu
>  			= per_cpu_ptr(hv_context.cpu_context, cpu);
> 
> +		/* It's better to leak the page if the encryption fails. */
> +		if (hv_isolation_type_en_snp()) {
> +			if (hv_cpu->synic_message_page) {
> +				ret = set_memory_encrypted((unsigned long)
> +					hv_cpu->synic_message_page, 1);
> +				if (ret) {
> +					pr_err("Failed to encrypt SYNIC msg page: %d\n", ret);
> +					hv_cpu->synic_message_page = NULL;
> +				}
> +			}
> +
> +			if (hv_cpu->synic_event_page) {
> +				ret = set_memory_encrypted((unsigned long)
> +					hv_cpu->synic_event_page, 1);
> +				if (ret) {
> +					pr_err("Failed to encrypt SYNIC event page: %d\n", ret);
> +					hv_cpu->synic_event_page = NULL;
> +				}
> +			}
> +		}
> +
>  		free_page((unsigned long)hv_cpu->synic_event_page);
>  		free_page((unsigned long)hv_cpu->synic_message_page);
>  	}
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 179bc5f5bf52..bed9aa6ac19a 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -24,6 +24,7 @@
>  #include <linux/kmsg_dump.h>
>  #include <linux/slab.h>
>  #include <linux/dma-map-ops.h>
> +#include <linux/set_memory.h>
>  #include <asm/hyperv-tlfs.h>
>  #include <asm/mshyperv.h>
> 
> @@ -359,6 +360,7 @@ int hv_common_cpu_init(unsigned int cpu)
>  	u64 msr_vp_index;
>  	gfp_t flags;
>  	int pgcount = hv_root_partition ? 2 : 1;
> +	int ret;
> 
>  	/* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
>  	flags = irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL;
> @@ -368,6 +370,17 @@ int hv_common_cpu_init(unsigned int cpu)
>  	if (!(*inputarg))
>  		return -ENOMEM;
> 
> +	if (hv_isolation_type_en_snp()) {
> +		ret = set_memory_decrypted((unsigned long)*inputarg, pgcount);
> +		if (ret) {
> +			kfree(*inputarg);
> +			*inputarg = NULL;
> +			return ret;
> +		}
> +
> +		memset(*inputarg, 0x00, pgcount * PAGE_SIZE);
> +	}
> +
>  	if (hv_root_partition) {
>  		outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg);
>  		*outputarg = (char *)(*inputarg) + HV_HYP_PAGE_SIZE;
> @@ -387,7 +400,9 @@ int hv_common_cpu_die(unsigned int cpu)
>  {
>  	unsigned long flags;
>  	void **inputarg, **outputarg;
> +	int pgcount = hv_root_partition ? 2 : 1;
>  	void *mem;
> +	int ret;
> 
>  	local_irq_save(flags);
> 
> @@ -402,7 +417,14 @@ int hv_common_cpu_die(unsigned int cpu)
> 
>  	local_irq_restore(flags);
> 
> -	kfree(mem);
> +	if (hv_isolation_type_en_snp()) {
> +		ret = set_memory_encrypted((unsigned long)mem, pgcount);
> +		if (ret)
> +			pr_warn("Hyper-V: Failed to encrypt input arg on cpu%d: %d\n",
> +				cpu, ret);
> +		/* It's unsafe to free 'mem'. */
> +		return 0;
> +	}
> 
>  	return 0;
>  }
> --
> 2.25.1


^ permalink raw reply

* RE: [PATCH 7/9] x86/hyperv: Initialize cpu and memory for SEV-SNP enlightened guest
From: Michael Kelley (LINUX) @ 2023-06-08 14:09 UTC (permalink / raw)
  To: Tianyu Lan, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	daniel.lezcano@linaro.org, arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	vkuznets@redhat.com
In-Reply-To: <20230601151624.1757616-8-ltykernel@gmail.com>

From: Tianyu Lan <ltykernel@gmail.com> Sent: Thursday, June 1, 2023 8:16 AM
> 

[snip]

> 
> +static __init void hv_snp_get_smp_config(unsigned int early)
> +{
> +	/*
> +	 * The "early" is only to be true when there is AMD
> +	 * numa support. Hyper-V AMD SEV-SNP guest may not
> +	 * have numa support. To make sure smp config is
> +	 * always initialized, do that when early is false.
> +	 */

I didn't really understand this comment.  After doing a little research, let
me suggest this wording:

	/*
	 * The "early" parameter can be true only if old-style AMD
	 * Opteron NUMA detection is enabled, which should never be
	 * the case for an SEV-SNP guest.  See CONFIG_AMD_NUMA.
	 * For safety, just do nothing if "early" is true.
	 */

Let me know if this new wording makes sense based on your understanding.

Michael

> +	if (early)
> +		return;
> +

^ permalink raw reply

* RE: [PATCH 7/9] x86/hyperv: Initialize cpu and memory for SEV-SNP enlightened guest
From: Michael Kelley (LINUX) @ 2023-06-08 13:51 UTC (permalink / raw)
  To: Tianyu Lan, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	daniel.lezcano@linaro.org, arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	vkuznets@redhat.com
In-Reply-To: <20230601151624.1757616-8-ltykernel@gmail.com>

From: Tianyu Lan <ltykernel@gmail.com> Sent: Thursday, June 1, 2023 8:16 AM
> 
> Hyper-V enlightened guest doesn't have boot loader support.
> Boot Linux kernel directly from hypervisor with data(kernel

Add a space between "data" and "(kernel"

> image, initrd and parameter page) and memory for boot up that
> is initialized via AMD SEV PSP proctol LAUNCH_UPDATE_DATA

s/proctol/protocol/

> (Please refernce https://www.amd.com/system/files/TechDocs/55766_SEV-KM_API_Specification.pdf 1.3.1 Launch).

s/refernce/reference/

And the link above didn't work for me -- the "55766_SEV-KM_API_Specification.pdf"
part was separated from the rest of the URL, though it's possible the mangling
was done by Microsoft's email system.  Please double check that the URL is
correctly formatted with no spurious spaces.

Even better, maybe write this as:

Please reference Section 1.3.1 "Launch" of [1].

Then put the full link as [1] at the bottom of the commit message.

> 
> Kernel needs to read processor and memory info from EN_SEV_
> SNP_PROCESSOR/MEM_INFO_ADDR address which are populated by Hyper-V.
> Initialize smp cpu related ops, validate system memory and add
> it into e820 table.
> 
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/hyperv/ivm.c           | 93 +++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/mshyperv.h | 17 ++++++
>  arch/x86/kernel/cpu/mshyperv.c  |  3 ++
>  3 files changed, 113 insertions(+)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 5d3ee3124e00..e735507d0f54 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -17,6 +17,11 @@
>  #include <asm/mem_encrypt.h>
>  #include <asm/mshyperv.h>
>  #include <asm/hypervisor.h>
> +#include <asm/coco.h>
> +#include <asm/io_apic.h>
> +#include <asm/sev.h>
> +#include <asm/realmode.h>
> +#include <asm/e820/api.h>
> 
>  #ifdef CONFIG_AMD_MEM_ENCRYPT
> 
> @@ -57,6 +62,8 @@ union hv_ghcb {
> 
>  static u16 hv_ghcb_version __ro_after_init;
> 
> +static u32 processor_count;
> +
>  u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size)
>  {
>  	union hv_ghcb *hv_ghcb;
> @@ -356,6 +363,92 @@ static bool hv_is_private_mmio(u64 addr)
>  	return false;
>  }
> 
> +static __init void hv_snp_get_smp_config(unsigned int early)
> +{
> +	/*
> +	 * The "early" is only to be true when there is AMD
> +	 * numa support. Hyper-V AMD SEV-SNP guest may not
> +	 * have numa support. To make sure smp config is
> +	 * always initialized, do that when early is false.
> +	 */
> +	if (early)
> +		return;
> +
> +	/*
> +	 * There is no firmware and ACPI MADT table support in
> +	 * in the Hyper-V SEV-SNP enlightened guest. Set smp
> +	 * related config variable here.
> +	 */
> +	while (num_processors < processor_count) {
> +		early_per_cpu(x86_cpu_to_apicid, num_processors) = num_processors;
> +		early_per_cpu(x86_bios_cpu_apicid, num_processors) = num_processors;
> +		physid_set(num_processors, phys_cpu_present_map);
> +		set_cpu_possible(num_processors, true);
> +		set_cpu_present(num_processors, true);
> +		num_processors++;
> +	}
> +}
> +
> +__init void hv_sev_init_mem_and_cpu(void)
> +{
> +	struct memory_map_entry *entry;
> +	struct e820_entry *e820_entry;
> +	u64 e820_end;
> +	u64 ram_end;
> +	u64 page;
> +
> +	/*
> +	 * Hyper-V enlightened snp guest boots kernel
> +	 * directly without bootloader. So roms, bios
> +	 * regions and reserve resources are not available.
> +	 * Set these callback to NULL.
> +	 */
> +	x86_platform.legacy.rtc			= 0;
> +	x86_platform.legacy.reserve_bios_regions = 0;
> +	x86_platform.set_wallclock		= set_rtc_noop;
> +	x86_platform.get_wallclock		= get_rtc_noop;
> +	x86_init.resources.probe_roms		= x86_init_noop;
> +	x86_init.resources.reserve_resources	= x86_init_noop;
> +	x86_init.mpparse.find_smp_config	= x86_init_noop;
> +	x86_init.mpparse.get_smp_config		= hv_snp_get_smp_config;
> +
> +	/*
> +	 * Hyper-V SEV-SNP enlightened guest doesn't support ioapic
> +	 * and legacy APIC page read/write. Switch to hv apic here.
> +	 */
> +	disable_ioapic_support();
> +
> +	/* Get processor and mem info. */
> +	processor_count = *(u32 *)__va(EN_SEV_SNP_PROCESSOR_INFO_ADDR);
> +	entry = (struct memory_map_entry *)__va(EN_SEV_SNP_MEM_INFO_ADDR);
> +
> +	/*
> +	 * There is no bootloader/EFI firmware in the SEV SNP guest.
> +	 * E820 table in the memory just describes memory for kernel,
> +	 * ACPI table, cmdline, boot params and ramdisk. The dynamic
> +	 * data(e.g, vcpu number and the rest memory layout) needs to
> +	 * be read from EN_SEV_SNP_PROCESSOR_INFO_ADDR.
> +	 */
> +	for (; entry->numpages != 0; entry++) {
> +		e820_entry = &e820_table->entries[
> +				e820_table->nr_entries - 1];
> +		e820_end = e820_entry->addr + e820_entry->size;
> +		ram_end = (entry->starting_gpn +
> +			   entry->numpages) * PAGE_SIZE;
> +
> +		if (e820_end < entry->starting_gpn * PAGE_SIZE)
> +			e820_end = entry->starting_gpn * PAGE_SIZE;
> +
> +		if (e820_end < ram_end) {
> +			pr_info("Hyper-V: add e820 entry [mem %#018Lx-%#018Lx]\n", e820_end, ram_end - 1);
> +			e820__range_add(e820_end, ram_end - e820_end,
> +					E820_TYPE_RAM);
> +			for (page = e820_end; page < ram_end; page += PAGE_SIZE)
> +				pvalidate((unsigned long)__va(page), RMP_PG_SIZE_4K, true);
> +		}
> +	}
> +}
> +
>  void __init hv_vtom_init(void)
>  {
>  	/*
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index d859d7c5f5e8..7a9a6cdc2ae9 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -50,6 +50,21 @@ extern bool hv_isolation_type_en_snp(void);
> 
>  extern union hv_ghcb * __percpu *hv_ghcb_pg;
> 
> +/*
> + * Hyper-V puts processor and memory layout info
> + * to this address in SEV-SNP enlightened guest.
> + */
> +#define EN_SEV_SNP_PROCESSOR_INFO_ADDR  0x802000
> +#define EN_SEV_SNP_MEM_INFO_ADDR	0x802018
> +
> +struct memory_map_entry {
> +	u64 starting_gpn;
> +	u64 numpages;
> +	u16 type;
> +	u16 flags;
> +	u32 reserved;
> +};
> +
>  int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
>  int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
>  int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
> @@ -255,12 +270,14 @@ void hv_ghcb_msr_read(u64 msr, u64 *value);
>  bool hv_ghcb_negotiate_protocol(void);
>  void __noreturn hv_ghcb_terminate(unsigned int set, unsigned int reason);
>  void hv_vtom_init(void);
> +void hv_sev_init_mem_and_cpu(void);
>  #else
>  static inline void hv_ghcb_msr_write(u64 msr, u64 value) {}
>  static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {}
>  static inline bool hv_ghcb_negotiate_protocol(void) { return false; }
>  static inline void hv_ghcb_terminate(unsigned int set, unsigned int reason) {}
>  static inline void hv_vtom_init(void) {}
> +static inline void hv_sev_init_mem_and_cpu(void) {}
>  #endif
> 
>  extern bool hv_isolation_type_snp(void);
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 9186453251f7..48b9eab3daf6 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -528,6 +528,9 @@ static void __init ms_hyperv_init_platform(void)
>  	if (!(ms_hyperv.features & HV_ACCESS_TSC_INVARIANT))
>  		mark_tsc_unstable("running on Hyper-V");
> 
> +	if (hv_isolation_type_en_snp())
> +		hv_sev_init_mem_and_cpu();
> +
>  	hardlockup_detector_disable();
>  }
> 
> --
> 2.25.1


^ permalink raw reply

* RE: [PATCH 3/9] x86/hyperv: Mark Hyper-V vp assist page unencrypted in SEV-SNP enlightened guest
From: Vitaly Kuznetsov @ 2023-06-08 13:44 UTC (permalink / raw)
  To: Michael Kelley (LINUX), Tianyu Lan, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, daniel.lezcano@linaro.org,
	arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <BYAPR21MB16883BF49ED337A6EF063461D750A@BYAPR21MB1688.namprd21.prod.outlook.com>

"Michael Kelley (LINUX)" <mikelley@microsoft.com> writes:

> From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Tuesday, June 6, 2023 8:49 AM
>> 
>> Tianyu Lan <ltykernel@gmail.com> writes:
>> 
>> > On 6/5/2023 8:13 PM, Vitaly Kuznetsov wrote:
>> >>> @@ -113,6 +114,11 @@ static int hv_cpu_init(unsigned int cpu)
>> >>>
>> >>>   	}
>> >>>   	if (!WARN_ON(!(*hvp))) {
>> >>> +		if (hv_isolation_type_en_snp()) {
>> >>> +			WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
>> >>> +			memset(*hvp, 0, PAGE_SIZE);
>> >>> +		}
>> >> Why do we need to set the page as decrypted here and not when we
>> >> allocate the page (a few lines above)?
>> >
>> > If Linux root partition boots in the SEV-SNP guest, the page still needs
>> > to be decrypted.
>
> We have code in place that prevents this scenario.  We don't allow Linux
> in the root partition to run in SEV-SNP mode.  See commit f8acb24aaf89.
>
>> >
>> 
>> I'd suggest we add a flag to indicate that VP assist page was actually
>> set (on the first invocation of hv_cpu_init() for guest partitions and
>> all invocations for root partition) and only call
>> set_memory_decrypted()/memset() then: that would both help with the
>> potential issue with KVM using enlightened vmcs and avoid the unneeded
>> hypercall.
>> 
>
> I think there's actually a more immediate problem with the code as
> written.  The VP assist page for a CPU is not re-encrypted or freed when
> a CPU goes offline (for reasons that have been discussed elsewhere).  So
> if a CPU in an SEV-SNP VM goes offline and then comes back online, the
> originally allocated and already decrypted VP assist page will be reused.
> But bad things will happen if we try to decrypt the page again.
>
> Given that we disallow the root partition running in SEV-SNP mode,
> can we avoid the complexity of a flag, and just do the decryption and
> zero'ing when the page is allocated?

Sure, makes perfect sense but let's leave a [one line] comment why we
don't do any decryption for root partition then.

-- 
Vitaly


^ permalink raw reply

* RE: [PATCH 3/9] x86/hyperv: Mark Hyper-V vp assist page unencrypted in SEV-SNP enlightened guest
From: Michael Kelley (LINUX) @ 2023-06-08 13:25 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Tianyu Lan, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, daniel.lezcano@linaro.org,
	arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <87o7lsk2v8.fsf@redhat.com>

From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Tuesday, June 6, 2023 8:49 AM
> 
> Tianyu Lan <ltykernel@gmail.com> writes:
> 
> > On 6/5/2023 8:13 PM, Vitaly Kuznetsov wrote:
> >>> @@ -113,6 +114,11 @@ static int hv_cpu_init(unsigned int cpu)
> >>>
> >>>   	}
> >>>   	if (!WARN_ON(!(*hvp))) {
> >>> +		if (hv_isolation_type_en_snp()) {
> >>> +			WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
> >>> +			memset(*hvp, 0, PAGE_SIZE);
> >>> +		}
> >> Why do we need to set the page as decrypted here and not when we
> >> allocate the page (a few lines above)?
> >
> > If Linux root partition boots in the SEV-SNP guest, the page still needs
> > to be decrypted.

We have code in place that prevents this scenario.  We don't allow Linux
in the root partition to run in SEV-SNP mode.  See commit f8acb24aaf89.

> >
> 
> I'd suggest we add a flag to indicate that VP assist page was actually
> set (on the first invocation of hv_cpu_init() for guest partitions and
> all invocations for root partition) and only call
> set_memory_decrypted()/memset() then: that would both help with the
> potential issue with KVM using enlightened vmcs and avoid the unneeded
> hypercall.
> 

I think there's actually a more immediate problem with the code as
written.  The VP assist page for a CPU is not re-encrypted or freed when
a CPU goes offline (for reasons that have been discussed elsewhere).  So
if a CPU in an SEV-SNP VM goes offline and then comes back online, the
originally allocated and already decrypted VP assist page will be reused.
But bad things will happen if we try to decrypt the page again.

Given that we disallow the root partition running in SEV-SNP mode,
can we avoid the complexity of a flag, and just do the decryption and
zero'ing when the page is allocated?

Michael

^ permalink raw reply

* Re: [PATCH 2/9] x86/hyperv: Set Virtual Trust Level in VMBus init message
From: Tianyu Lan @ 2023-06-08 13:21 UTC (permalink / raw)
  To: Michael Kelley (LINUX), KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, daniel.lezcano@linaro.org,
	arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	vkuznets@redhat.com
In-Reply-To: <BYAPR21MB1688903C92A708680FFE7B91D750A@BYAPR21MB1688.namprd21.prod.outlook.com>



On 6/8/2023 9:06 PM, Michael Kelley (LINUX) wrote:
> From: Tianyu Lan<ltykernel@gmail.com>  Sent: Thursday, June 1, 2023 8:16 AM
>> SEV-SNP guest provides vtl(Virtual Trust Level) and
>> get it from Hyper-V hvcall via register hvcall HVCALL_
>> GET_VP_REGISTERS.
>>
>> During initialization of VMBus, vtl needs to be set in the
>> VMBus init message.
> Let's clean up this commit message a bit.  I would suggest:
> 
> SEV-SNP guests on Hyper-V can run at multiple Virtual Trust
> Levels (VTL).  During boot, get the VTL at which we're running
> using the GET_VP_REGISTERs hypercall, and save the value
> for future use.  Then during VMBus initialization, set the VTL
> with the saved value as required in the VMBus init message.
> 

Will update. Thanks to rework change log.

^ permalink raw reply

* Re: [PATCH 5/9] x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
From: Peter Zijlstra @ 2023-06-08 13:21 UTC (permalink / raw)
  To: Tianyu Lan
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, daniel.lezcano, arnd, michael.h.kelley, Tianyu Lan,
	linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230601151624.1757616-6-ltykernel@gmail.com>

On Thu, Jun 01, 2023 at 11:16:18AM -0400, Tianyu Lan wrote:
> From: Tianyu Lan <tiala@microsoft.com>
> 
> In sev-snp enlightened guest, Hyper-V hypercall needs
> to use vmmcall to trigger vmexit and notify hypervisor
> to handle hypercall request.
> 
> There is no x86 SEV SNP feature flag support so far and
> hardware provides MSR_AMD64_SEV register to check SEV-SNP
> capability with MSR_AMD64_SEV_ENABLED bit. ALTERNATIVE can't
> work without SEV-SNP x86 feature flag. May add later when
> the associated flag is introduced. 
> 
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/include/asm/mshyperv.h | 44 ++++++++++++++++++++++++---------
>  1 file changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 31c476f4e656..d859d7c5f5e8 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -61,16 +61,25 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
>  	u64 hv_status;
>  
>  #ifdef CONFIG_X86_64
> -	if (!hv_hypercall_pg)
> -		return U64_MAX;
> +	if (hv_isolation_type_en_snp()) {
> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     "vmmcall"
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input_address)
> +				     :  "r" (output_address)
> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
> +	} else {
> +		if (!hv_hypercall_pg)
> +			return U64_MAX;
>  
> -	__asm__ __volatile__("mov %4, %%r8\n"
> -			     CALL_NOSPEC
> -			     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -			       "+c" (control), "+d" (input_address)
> -			     :  "r" (output_address),
> -				THUNK_TARGET(hv_hypercall_pg)
> -			     : "cc", "memory", "r8", "r9", "r10", "r11");
> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     CALL_NOSPEC
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input_address)
> +				     :  "r" (output_address),
> +					THUNK_TARGET(hv_hypercall_pg)
> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
> +	}
>  #else

Remains unanswered:

https://lkml.kernel.org/r/20230516102912.GG2587705%40hirez.programming.kicks-ass.net

Would this not generate better code with an alternative?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox