All of lore.kernel.org
 help / color / mirror / Atom feed
From: Simon Horman <horms@kernel.org>
To: mhklinux@outlook.com
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, andrew+netdev@lunn.ch, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	James.Bottomley@hansenpartnership.com,
	martin.petersen@oracle.com, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	linux-scsi@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH net 3/5] hv_netvsc: Preserve contiguous PFN grouping in the page buffer array
Date: Wed, 14 May 2025 10:34:35 +0100	[thread overview]
Message-ID: <20250514093435.GE3339421@horms.kernel.org> (raw)
In-Reply-To: <20250513000604.1396-4-mhklinux@outlook.com>

On Mon, May 12, 2025 at 05:06:02PM -0700, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> Starting with commit dca5161f9bd0 ("hv_netvsc: Check status in
> SEND_RNDIS_PKT completion message") in the 6.3 kernel, the Linux
> driver for Hyper-V synthetic networking (netvsc) occasionally reports
> "nvsp_rndis_pkt_complete error status: 2".[1] This error indicates
> that Hyper-V has rejected a network packet transmit request from the
> guest, and the outgoing network packet is dropped. Higher level
> network protocols presumably recover and resend the packet so there is
> no functional error, but performance is slightly impacted. Commit
> dca5161f9bd0 is not the cause of the error -- it only added reporting
> of an error that was already happening without any notice. The error
> has presumably been present since the netvsc driver was originally
> introduced into Linux.
> 
> The root cause of the problem is that the netvsc driver in Linux may
> send an incorrectly formatted VMBus message to Hyper-V when
> transmitting the network packet. The incorrect formatting occurs when
> the rndis header of the VMBus message crosses a page boundary due to
> how the Linux skb head memory is aligned. In such a case, two PFNs are
> required to describe the location of the rndis header, even though
> they are contiguous in guest physical address (GPA) space. Hyper-V
> requires that two rndis header PFNs be in a single "GPA range" data
> struture, but current netvsc code puts each PFN in its own GPA range,
> which Hyper-V rejects as an error.
> 
> The incorrect formatting occurs only for larger packets that netvsc
> must transmit via a VMBus "GPA Direct" message. There's no problem
> when netvsc transmits a smaller packet by copying it into a pre-
> allocated send buffer slot because the pre-allocated slots don't have
> page crossing issues.
> 
> After commit 14ad6ed30a10 ("net: allow small head cache usage with
> large MAX_SKB_FRAGS values") in the 6.14-rc4 kernel, the error occurs
> much more frequently in VMs with 16 or more vCPUs. It may occur every
> few seconds, or even more frequently, in an ssh session that outputs a
> lot of text. Commit 14ad6ed30a10 subtly changes how skb head memory is
> allocated, making it much more likely that the rndis header will cross
> a page boundary when the vCPU count is 16 or more. The changes in
> commit 14ad6ed30a10 are perfectly valid -- they just had the side
> effect of making the netvsc bug more prominent.
> 
> Current code in init_page_array() creates a separate page buffer array
> entry for each PFN required to identify the data to be transmitted.
> Contiguous PFNs get separate entries in the page buffer array, and any
> information about contiguity is lost.
> 
> Fix the core issue by having init_page_array() construct the page
> buffer array to represent contiguous ranges rather than individual
> pages. When these ranges are subsequently passed to
> netvsc_build_mpb_array(), it can build GPA ranges that contain
> multiple PFNs, as required to avoid the error "nvsp_rndis_pkt_complete
> error status: 2". If instead the network packet is sent by copying
> into a pre-allocated send buffer slot, the copy proceeds using the
> contiguous ranges rather than individual pages, but the result of the
> copying is the same. Also fix rndis_filter_send_request() to construct
> a contiguous range, since it has its own page buffer array.
> 
> This change has a side benefit in CoCo VMs in that netvsc_dma_map()
> calls dma_map_single() on each contiguous range instead of on each
> page. This results in fewer calls to dma_map_single() but on larger
> chunks of memory, which should reduce contention on the swiotlb.
> 
> Since the page buffer array now contains one entry for each contiguous
> range instead of for each individual page, the number of entries in
> the array can be reduced, saving 208 bytes of stack space in
> netvsc_xmit() when MAX_SKG_FRAGS has the default value of 17.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217503
> 
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217503
> Cc: <stable@vger.kernel.org> # 6.1.x
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  drivers/net/hyperv/hyperv_net.h   | 12 ++++++
>  drivers/net/hyperv/netvsc_drv.c   | 63 ++++++++-----------------------
>  drivers/net/hyperv/rndis_filter.c | 24 +++---------
>  3 files changed, 32 insertions(+), 67 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 70f7cb383228..76725f25abd5 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -893,6 +893,18 @@ struct nvsp_message {
>  				 sizeof(struct nvsp_message))
>  #define NETVSC_MIN_IN_MSG_SIZE sizeof(struct vmpacket_descriptor)
>  
> +/* Maximum # of contiguous data ranges that can make up a trasmitted packet.
> + * Typically it's the max SKB fragments plus 2 for the rndis packet and the
> + * linear portion of the SKB. But if MAX_SKB_FRAGS is large, the value may
> + * need to be limited to MAX_PAGE_BUFFER_COUNT, which is the max # of entries
> + * in a GPA direct packet sent to netvsp over VMBus.
> + */
> +#if MAX_SKB_FRAGS + 2 < MAX_PAGE_BUFFER_COUNT
> +#define MAX_DATA_RANGES (MAX_SKB_FRAGS + 2)
> +#else
> +#define MAX_DATA_RANGES MAX_PAGE_BUFFER_COUNT
> +#endif
> +
>  /* Estimated requestor size:
>   * out_ring_size/min_out_msg_size + in_ring_size/min_in_msg_size
>   */

...

> @@ -371,28 +338,28 @@ static u32 init_page_array(void *hdr, u32 len, struct sk_buff *skb,
>  	 * 2. skb linear data
>  	 * 3. skb fragment data
>  	 */
> -	slots_used += fill_pg_buf(virt_to_hvpfn(hdr),
> -				  offset_in_hvpage(hdr),
> -				  len,
> -				  &pb[slots_used]);
>  
> +	pb[0].offset = offset_in_hvpage(hdr);
> +	pb[0].len = len;
> +	pb[0].pfn = virt_to_hvpfn(hdr);
>  	packet->rmsg_size = len;
> -	packet->rmsg_pgcnt = slots_used;
> +	packet->rmsg_pgcnt = 1;
>  
> -	slots_used += fill_pg_buf(virt_to_hvpfn(data),
> -				  offset_in_hvpage(data),
> -				  skb_headlen(skb),
> -				  &pb[slots_used]);
> +	pb[1].offset = offset_in_hvpage(skb->data);
> +	pb[1].len = skb_headlen(skb);
> +	pb[1].pfn = virt_to_hvpfn(skb->data);
>  
>  	for (i = 0; i < frags; i++) {
>  		skb_frag_t *frag = skb_shinfo(skb)->frags + i;
> +		struct hv_page_buffer *cur_pb = &pb[i + 2];

Hi Michael,

If I got things right then then pb is allocated on the stack
in netvsc_xmit and has MAX_DATA_RANGES elements.

If MAX_SKB_FRAGS is largs and MAX_DATA_RANGES has been limited to
MAX_DATA_RANGES. And frags is large. Is is possible to overrun pb here?

> +		u64 pfn = page_to_hvpfn(skb_frag_page(frag));
> +		u32 offset = skb_frag_off(frag);
>  
> -		slots_used += fill_pg_buf(page_to_hvpfn(skb_frag_page(frag)),
> -					  skb_frag_off(frag),
> -					  skb_frag_size(frag),
> -					  &pb[slots_used]);
> +		cur_pb->offset = offset_in_hvpage(offset);
> +		cur_pb->len = skb_frag_size(frag);
> +		cur_pb->pfn = pfn + (offset >> HV_HYP_PAGE_SHIFT);
>  	}
> -	return slots_used;
> +	return frags + 2;
>  }
>  
>  static int count_skb_frag_slots(struct sk_buff *skb)
> @@ -483,7 +450,7 @@ static int netvsc_xmit(struct sk_buff *skb, struct net_device *net, bool xdp_tx)
>  	struct net_device *vf_netdev;
>  	u32 rndis_msg_size;
>  	u32 hash;
> -	struct hv_page_buffer pb[MAX_PAGE_BUFFER_COUNT];
> +	struct hv_page_buffer pb[MAX_DATA_RANGES];
>  
>  	/* If VF is present and up then redirect packets to it.
>  	 * Skip the VF if it is marked down or has no carrier.

...

  reply	other threads:[~2025-05-14  9:34 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-13  0:05 [PATCH net 0/5] hv_netvsc: Fix error "nvsp_rndis_pkt_complete error status: 2" mhkelley58
2025-05-13  0:06 ` [PATCH net 1/5] Drivers: hv: Allow vmbus_sendpacket_mpb_desc() to create multiple ranges mhkelley58
2025-05-15 10:51   ` Simon Horman
2025-05-13  0:06 ` [PATCH net 2/5] hv_netvsc: Use vmbus_sendpacket_mpb_desc() to send VMBus messages mhkelley58
2025-05-14  9:37   ` Simon Horman
2025-05-14 15:44     ` Michael Kelley
2025-05-15 10:50       ` Simon Horman
2025-05-13  0:06 ` [PATCH net 3/5] hv_netvsc: Preserve contiguous PFN grouping in the page buffer array mhkelley58
2025-05-14  9:34   ` Simon Horman [this message]
2025-05-14 15:42     ` Michael Kelley
2025-05-15 10:40       ` Simon Horman
2025-05-13  0:06 ` [PATCH net 4/5] hv_netvsc: Remove rmsg_pgcnt mhkelley58
2025-05-15 10:55   ` Simon Horman
2025-05-13  0:06 ` [PATCH net 5/5] Drivers: hv: vmbus: Remove vmbus_sendpacket_pagebuffer() mhkelley58
2025-05-15  3:00 ` [PATCH net 0/5] hv_netvsc: Fix error "nvsp_rndis_pkt_complete error status: 2" patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250514093435.GE3339421@horms.kernel.org \
    --to=horms@kernel.org \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=decui@microsoft.com \
    --cc=edumazet@google.com \
    --cc=haiyangz@microsoft.com \
    --cc=kuba@kernel.org \
    --cc=kys@microsoft.com \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=mhklinux@outlook.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=wei.liu@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.