linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Simon Horman <horms@kernel.org>
To: mhklinux@outlook.com
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, andrew+netdev@lunn.ch, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	James.Bottomley@hansenpartnership.com,
	martin.petersen@oracle.com, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	linux-scsi@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH net 3/5] hv_netvsc: Preserve contiguous PFN grouping in the page buffer array
Date: Wed, 14 May 2025 10:34:35 +0100	[thread overview]
Message-ID: <20250514093435.GE3339421@horms.kernel.org> (raw)
In-Reply-To: <20250513000604.1396-4-mhklinux@outlook.com>

On Mon, May 12, 2025 at 05:06:02PM -0700, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> Starting with commit dca5161f9bd0 ("hv_netvsc: Check status in
> SEND_RNDIS_PKT completion message") in the 6.3 kernel, the Linux
> driver for Hyper-V synthetic networking (netvsc) occasionally reports
> "nvsp_rndis_pkt_complete error status: 2".[1] This error indicates
> that Hyper-V has rejected a network packet transmit request from the
> guest, and the outgoing network packet is dropped. Higher level
> network protocols presumably recover and resend the packet so there is
> no functional error, but performance is slightly impacted. Commit
> dca5161f9bd0 is not the cause of the error -- it only added reporting
> of an error that was already happening without any notice. The error
> has presumably been present since the netvsc driver was originally
> introduced into Linux.
> 
> The root cause of the problem is that the netvsc driver in Linux may
> send an incorrectly formatted VMBus message to Hyper-V when
> transmitting the network packet. The incorrect formatting occurs when
> the rndis header of the VMBus message crosses a page boundary due to
> how the Linux skb head memory is aligned. In such a case, two PFNs are
> required to describe the location of the rndis header, even though
> they are contiguous in guest physical address (GPA) space. Hyper-V
> requires that two rndis header PFNs be in a single "GPA range" data
> struture, but current netvsc code puts each PFN in its own GPA range,
> which Hyper-V rejects as an error.
> 
> The incorrect formatting occurs only for larger packets that netvsc
> must transmit via a VMBus "GPA Direct" message. There's no problem
> when netvsc transmits a smaller packet by copying it into a pre-
> allocated send buffer slot because the pre-allocated slots don't have
> page crossing issues.
> 
> After commit 14ad6ed30a10 ("net: allow small head cache usage with
> large MAX_SKB_FRAGS values") in the 6.14-rc4 kernel, the error occurs
> much more frequently in VMs with 16 or more vCPUs. It may occur every
> few seconds, or even more frequently, in an ssh session that outputs a
> lot of text. Commit 14ad6ed30a10 subtly changes how skb head memory is
> allocated, making it much more likely that the rndis header will cross
> a page boundary when the vCPU count is 16 or more. The changes in
> commit 14ad6ed30a10 are perfectly valid -- they just had the side
> effect of making the netvsc bug more prominent.
> 
> Current code in init_page_array() creates a separate page buffer array
> entry for each PFN required to identify the data to be transmitted.
> Contiguous PFNs get separate entries in the page buffer array, and any
> information about contiguity is lost.
> 
> Fix the core issue by having init_page_array() construct the page
> buffer array to represent contiguous ranges rather than individual
> pages. When these ranges are subsequently passed to
> netvsc_build_mpb_array(), it can build GPA ranges that contain
> multiple PFNs, as required to avoid the error "nvsp_rndis_pkt_complete
> error status: 2". If instead the network packet is sent by copying
> into a pre-allocated send buffer slot, the copy proceeds using the
> contiguous ranges rather than individual pages, but the result of the
> copying is the same. Also fix rndis_filter_send_request() to construct
> a contiguous range, since it has its own page buffer array.
> 
> This change has a side benefit in CoCo VMs in that netvsc_dma_map()
> calls dma_map_single() on each contiguous range instead of on each
> page. This results in fewer calls to dma_map_single() but on larger
> chunks of memory, which should reduce contention on the swiotlb.
> 
> Since the page buffer array now contains one entry for each contiguous
> range instead of for each individual page, the number of entries in
> the array can be reduced, saving 208 bytes of stack space in
> netvsc_xmit() when MAX_SKG_FRAGS has the default value of 17.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217503
> 
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217503
> Cc: <stable@vger.kernel.org> # 6.1.x
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  drivers/net/hyperv/hyperv_net.h   | 12 ++++++
>  drivers/net/hyperv/netvsc_drv.c   | 63 ++++++++-----------------------
>  drivers/net/hyperv/rndis_filter.c | 24 +++---------
>  3 files changed, 32 insertions(+), 67 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 70f7cb383228..76725f25abd5 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -893,6 +893,18 @@ struct nvsp_message {
>  				 sizeof(struct nvsp_message))
>  #define NETVSC_MIN_IN_MSG_SIZE sizeof(struct vmpacket_descriptor)
>  
> +/* Maximum # of contiguous data ranges that can make up a trasmitted packet.
> + * Typically it's the max SKB fragments plus 2 for the rndis packet and the
> + * linear portion of the SKB. But if MAX_SKB_FRAGS is large, the value may
> + * need to be limited to MAX_PAGE_BUFFER_COUNT, which is the max # of entries
> + * in a GPA direct packet sent to netvsp over VMBus.
> + */
> +#if MAX_SKB_FRAGS + 2 < MAX_PAGE_BUFFER_COUNT
> +#define MAX_DATA_RANGES (MAX_SKB_FRAGS + 2)
> +#else
> +#define MAX_DATA_RANGES MAX_PAGE_BUFFER_COUNT
> +#endif
> +
>  /* Estimated requestor size:
>   * out_ring_size/min_out_msg_size + in_ring_size/min_in_msg_size
>   */

...

> @@ -371,28 +338,28 @@ static u32 init_page_array(void *hdr, u32 len, struct sk_buff *skb,
>  	 * 2. skb linear data
>  	 * 3. skb fragment data
>  	 */
> -	slots_used += fill_pg_buf(virt_to_hvpfn(hdr),
> -				  offset_in_hvpage(hdr),
> -				  len,
> -				  &pb[slots_used]);
>  
> +	pb[0].offset = offset_in_hvpage(hdr);
> +	pb[0].len = len;
> +	pb[0].pfn = virt_to_hvpfn(hdr);
>  	packet->rmsg_size = len;
> -	packet->rmsg_pgcnt = slots_used;
> +	packet->rmsg_pgcnt = 1;
>  
> -	slots_used += fill_pg_buf(virt_to_hvpfn(data),
> -				  offset_in_hvpage(data),
> -				  skb_headlen(skb),
> -				  &pb[slots_used]);
> +	pb[1].offset = offset_in_hvpage(skb->data);
> +	pb[1].len = skb_headlen(skb);
> +	pb[1].pfn = virt_to_hvpfn(skb->data);
>  
>  	for (i = 0; i < frags; i++) {
>  		skb_frag_t *frag = skb_shinfo(skb)->frags + i;
> +		struct hv_page_buffer *cur_pb = &pb[i + 2];

Hi Michael,

If I got things right then then pb is allocated on the stack
in netvsc_xmit and has MAX_DATA_RANGES elements.

If MAX_SKB_FRAGS is largs and MAX_DATA_RANGES has been limited to
MAX_DATA_RANGES. And frags is large. Is is possible to overrun pb here?

> +		u64 pfn = page_to_hvpfn(skb_frag_page(frag));
> +		u32 offset = skb_frag_off(frag);
>  
> -		slots_used += fill_pg_buf(page_to_hvpfn(skb_frag_page(frag)),
> -					  skb_frag_off(frag),
> -					  skb_frag_size(frag),
> -					  &pb[slots_used]);
> +		cur_pb->offset = offset_in_hvpage(offset);
> +		cur_pb->len = skb_frag_size(frag);
> +		cur_pb->pfn = pfn + (offset >> HV_HYP_PAGE_SHIFT);
>  	}
> -	return slots_used;
> +	return frags + 2;
>  }
>  
>  static int count_skb_frag_slots(struct sk_buff *skb)
> @@ -483,7 +450,7 @@ static int netvsc_xmit(struct sk_buff *skb, struct net_device *net, bool xdp_tx)
>  	struct net_device *vf_netdev;
>  	u32 rndis_msg_size;
>  	u32 hash;
> -	struct hv_page_buffer pb[MAX_PAGE_BUFFER_COUNT];
> +	struct hv_page_buffer pb[MAX_DATA_RANGES];
>  
>  	/* If VF is present and up then redirect packets to it.
>  	 * Skip the VF if it is marked down or has no carrier.

...

  reply	other threads:[~2025-05-14  9:34 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-13  0:05 [PATCH net 0/5] hv_netvsc: Fix error "nvsp_rndis_pkt_complete error status: 2" mhkelley58
2025-05-13  0:06 ` [PATCH net 1/5] Drivers: hv: Allow vmbus_sendpacket_mpb_desc() to create multiple ranges mhkelley58
2025-05-15 10:51   ` Simon Horman
2025-05-13  0:06 ` [PATCH net 2/5] hv_netvsc: Use vmbus_sendpacket_mpb_desc() to send VMBus messages mhkelley58
2025-05-14  9:37   ` Simon Horman
2025-05-14 15:44     ` Michael Kelley
2025-05-15 10:50       ` Simon Horman
2025-05-13  0:06 ` [PATCH net 3/5] hv_netvsc: Preserve contiguous PFN grouping in the page buffer array mhkelley58
2025-05-14  9:34   ` Simon Horman [this message]
2025-05-14 15:42     ` Michael Kelley
2025-05-15 10:40       ` Simon Horman
2025-05-13  0:06 ` [PATCH net 4/5] hv_netvsc: Remove rmsg_pgcnt mhkelley58
2025-05-15 10:55   ` Simon Horman
2025-05-13  0:06 ` [PATCH net 5/5] Drivers: hv: vmbus: Remove vmbus_sendpacket_pagebuffer() mhkelley58
2025-05-15  3:00 ` [PATCH net 0/5] hv_netvsc: Fix error "nvsp_rndis_pkt_complete error status: 2" patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250514093435.GE3339421@horms.kernel.org \
    --to=horms@kernel.org \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=decui@microsoft.com \
    --cc=edumazet@google.com \
    --cc=haiyangz@microsoft.com \
    --cc=kuba@kernel.org \
    --cc=kys@microsoft.com \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=mhklinux@outlook.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=wei.liu@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).