[PATCH net-next v2] gve: make IRQ handlers and page allocation NUMA aware

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v2] gve: make IRQ handlers and page allocation NUMA aware
@ 2025-07-07 21:01 Jeroen de Borst
  2025-07-08 13:31 ` Simon Horman
  2025-07-10  2:40 ` patchwork-bot+netdevbpf
  0 siblings, 2 replies; 4+ messages in thread
From: Jeroen de Borst @ 2025-07-07 21:01 UTC (permalink / raw)
  To: netdev
  Cc: hramamurthy, davem, edumazet, kuba, willemb, pabeni,
	Bailey Forrest, Joshua Washington, Jeroen de Borst

From: Bailey Forrest <bcf@google.com>

All memory in GVE is currently allocated without regard for the NUMA
node of the device. Because access to NUMA-local memory access is
significantly cheaper than access to a remote node, this change attempts
to ensure that page frags used in the RX path, including page pool
frags, are allocated on the NUMA node local to the gVNIC device. Note
that this attempt is best-effort. If necessary, the driver will still
allocate non-local memory, as __GFP_THISNODE is not passed. Descriptor
ring allocations are not updated, as dma_alloc_coherent handles that.

This change also modifies the IRQ affinity setting to only select CPUs
from the node local to the device, preserving the behavior that TX and
RX queues of the same index share CPU affinity.

Signed-off-by: Bailey Forrest <bcf@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
---
v1: https://lore.kernel.org/netdev/20250627183141.3781516-1-hramamurthy@google.com/
v2:
- Utilize kvcalloc_node instead of kvzalloc_node for array-type
  allocations.
---
 drivers/net/ethernet/google/gve/gve.h         |  1 +
 .../ethernet/google/gve/gve_buffer_mgmt_dqo.c |  1 +
 drivers/net/ethernet/google/gve/gve_main.c    | 30 +++++++++++++++----
 drivers/net/ethernet/google/gve/gve_rx.c      | 14 ++++-----
 drivers/net/ethernet/google/gve/gve_rx_dqo.c  |  8 ++---
 5 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index cf91195d5f39..53899096e89e 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -804,6 +804,7 @@ struct gve_priv {
 	struct gve_tx_queue_config tx_cfg;
 	struct gve_rx_queue_config rx_cfg;
 	u32 num_ntfy_blks; /* split between TX and RX so must be even */
+	int numa_node;
 
 	struct gve_registers __iomem *reg_bar0; /* see gve_register.h */
 	__be32 __iomem *db_bar2; /* "array" of doorbells */
diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
index a71883e1d920..6c3c459a1b5e 100644
--- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
@@ -246,6 +246,7 @@ struct page_pool *gve_rx_create_page_pool(struct gve_priv *priv,
 		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
 		.order = 0,
 		.pool_size = GVE_PAGE_POOL_SIZE_MULTIPLIER * priv->rx_desc_cnt,
+		.nid = priv->numa_node,
 		.dev = &priv->pdev->dev,
 		.netdev = priv->dev,
 		.napi = &priv->ntfy_blocks[ntfy_id].napi,
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 27f97a1d2957..be461751ff31 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -461,10 +461,19 @@ int gve_napi_poll_dqo(struct napi_struct *napi, int budget)
 	return work_done;
 }
 
+static const struct cpumask *gve_get_node_mask(struct gve_priv *priv)
+{
+	if (priv->numa_node == NUMA_NO_NODE)
+		return cpu_all_mask;
+	else
+		return cpumask_of_node(priv->numa_node);
+}
+
 static int gve_alloc_notify_blocks(struct gve_priv *priv)
 {
 	int num_vecs_requested = priv->num_ntfy_blks + 1;
-	unsigned int active_cpus;
+	const struct cpumask *node_mask;
+	unsigned int cur_cpu;
 	int vecs_enabled;
 	int i, j;
 	int err;
@@ -503,8 +512,6 @@ static int gve_alloc_notify_blocks(struct gve_priv *priv)
 		if (priv->rx_cfg.num_queues > priv->rx_cfg.max_queues)
 			priv->rx_cfg.num_queues = priv->rx_cfg.max_queues;
 	}
-	/* Half the notification blocks go to TX and half to RX */
-	active_cpus = min_t(int, priv->num_ntfy_blks / 2, num_online_cpus());
 
 	/* Setup Management Vector  - the last vector */
 	snprintf(priv->mgmt_msix_name, sizeof(priv->mgmt_msix_name), "gve-mgmnt@pci:%s",
@@ -533,6 +540,8 @@ static int gve_alloc_notify_blocks(struct gve_priv *priv)
 	}
 
 	/* Setup the other blocks - the first n-1 vectors */
+	node_mask = gve_get_node_mask(priv);
+	cur_cpu = cpumask_first(node_mask);
 	for (i = 0; i < priv->num_ntfy_blks; i++) {
 		struct gve_notify_block *block = &priv->ntfy_blocks[i];
 		int msix_idx = i;
@@ -549,9 +558,17 @@ static int gve_alloc_notify_blocks(struct gve_priv *priv)
 			goto abort_with_some_ntfy_blocks;
 		}
 		block->irq = priv->msix_vectors[msix_idx].vector;
-		irq_set_affinity_hint(priv->msix_vectors[msix_idx].vector,
-				      get_cpu_mask(i % active_cpus));
+		irq_set_affinity_and_hint(block->irq,
+					  cpumask_of(cur_cpu));
 		block->irq_db_index = &priv->irq_db_indices[i].index;
+
+		cur_cpu = cpumask_next(cur_cpu, node_mask);
+		/* Wrap once CPUs in the node have been exhausted, or when
+		 * starting RX queue affinities. TX and RX queues of the same
+		 * index share affinity.
+		 */
+		if (cur_cpu >= nr_cpu_ids || (i + 1) == priv->tx_cfg.max_queues)
+			cur_cpu = cpumask_first(node_mask);
 	}
 	return 0;
 abort_with_some_ntfy_blocks:
@@ -1040,7 +1057,7 @@ int gve_alloc_page(struct gve_priv *priv, struct device *dev,
 		   struct page **page, dma_addr_t *dma,
 		   enum dma_data_direction dir, gfp_t gfp_flags)
 {
-	*page = alloc_page(gfp_flags);
+	*page = alloc_pages_node(priv->numa_node, gfp_flags, 0);
 	if (!*page) {
 		priv->page_alloc_fail++;
 		return -ENOMEM;
@@ -2322,6 +2339,7 @@ static int gve_init_priv(struct gve_priv *priv, bool skip_describe_device)
 	 */
 	priv->num_ntfy_blks = (num_ntfy - 1) & ~0x1;
 	priv->mgmt_msix_idx = priv->num_ntfy_blks;
+	priv->numa_node = dev_to_node(&priv->pdev->dev);
 
 	priv->tx_cfg.max_queues =
 		min_t(int, priv->tx_cfg.max_queues, priv->num_ntfy_blks / 2);
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index 90e875c1832f..ec424d2f4f57 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -192,8 +192,8 @@ static int gve_rx_prefill_pages(struct gve_rx_ring *rx,
 	 */
 	slots = rx->mask + 1;
 
-	rx->data.page_info = kvzalloc(slots *
-				      sizeof(*rx->data.page_info), GFP_KERNEL);
+	rx->data.page_info = kvcalloc_node(slots, sizeof(*rx->data.page_info),
+					   GFP_KERNEL, priv->numa_node);
 	if (!rx->data.page_info)
 		return -ENOMEM;
 
@@ -216,7 +216,8 @@ static int gve_rx_prefill_pages(struct gve_rx_ring *rx,
 
 	if (!rx->data.raw_addressing) {
 		for (j = 0; j < rx->qpl_copy_pool_mask + 1; j++) {
-			struct page *page = alloc_page(GFP_KERNEL);
+			struct page *page = alloc_pages_node(priv->numa_node,
+							     GFP_KERNEL, 0);
 
 			if (!page) {
 				err = -ENOMEM;
@@ -303,10 +304,9 @@ int gve_rx_alloc_ring_gqi(struct gve_priv *priv,
 
 	rx->qpl_copy_pool_mask = min_t(u32, U32_MAX, slots * 2) - 1;
 	rx->qpl_copy_pool_head = 0;
-	rx->qpl_copy_pool = kvcalloc(rx->qpl_copy_pool_mask + 1,
-				     sizeof(rx->qpl_copy_pool[0]),
-				     GFP_KERNEL);
-
+	rx->qpl_copy_pool = kvcalloc_node(rx->qpl_copy_pool_mask + 1,
+					  sizeof(rx->qpl_copy_pool[0]),
+					  GFP_KERNEL, priv->numa_node);
 	if (!rx->qpl_copy_pool) {
 		err = -ENOMEM;
 		goto abort_with_slots;
diff --git a/drivers/net/ethernet/google/gve/gve_rx_dqo.c b/drivers/net/ethernet/google/gve/gve_rx_dqo.c
index 96743e1d80f3..2bc4ff9da981 100644
--- a/drivers/net/ethernet/google/gve/gve_rx_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_rx_dqo.c
@@ -237,9 +237,9 @@ int gve_rx_alloc_ring_dqo(struct gve_priv *priv,
 
 	rx->dqo.num_buf_states = cfg->raw_addressing ? buffer_queue_slots :
 		gve_get_rx_pages_per_qpl_dqo(cfg->ring_size);
-	rx->dqo.buf_states = kvcalloc(rx->dqo.num_buf_states,
-				      sizeof(rx->dqo.buf_states[0]),
-				      GFP_KERNEL);
+	rx->dqo.buf_states = kvcalloc_node(rx->dqo.num_buf_states,
+					   sizeof(rx->dqo.buf_states[0]),
+					   GFP_KERNEL, priv->numa_node);
 	if (!rx->dqo.buf_states)
 		return -ENOMEM;
 
@@ -488,7 +488,7 @@ static int gve_rx_copy_ondemand(struct gve_rx_ring *rx,
 				struct gve_rx_buf_state_dqo *buf_state,
 				u16 buf_len)
 {
-	struct page *page = alloc_page(GFP_ATOMIC);
+	struct page *page = alloc_pages_node(rx->gve->numa_node, GFP_ATOMIC, 0);
 	int num_frags;
 
 	if (!page)
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next v2] gve: make IRQ handlers and page allocation NUMA aware
  2025-07-07 21:01 [PATCH net-next v2] gve: make IRQ handlers and page allocation NUMA aware Jeroen de Borst
@ 2025-07-08 13:31 ` Simon Horman
  2025-07-09 17:55   ` Joshua Washington
  2025-07-10  2:40 ` patchwork-bot+netdevbpf
  1 sibling, 1 reply; 4+ messages in thread
From: Simon Horman @ 2025-07-08 13:31 UTC (permalink / raw)
  To: Jeroen de Borst
  Cc: netdev, hramamurthy, davem, edumazet, kuba, willemb, pabeni,
	Bailey Forrest, Joshua Washington

On Mon, Jul 07, 2025 at 02:01:07PM -0700, Jeroen de Borst wrote:
> From: Bailey Forrest <bcf@google.com>
> 
> All memory in GVE is currently allocated without regard for the NUMA
> node of the device. Because access to NUMA-local memory access is
> significantly cheaper than access to a remote node, this change attempts
> to ensure that page frags used in the RX path, including page pool
> frags, are allocated on the NUMA node local to the gVNIC device. Note
> that this attempt is best-effort. If necessary, the driver will still
> allocate non-local memory, as __GFP_THISNODE is not passed. Descriptor
> ring allocations are not updated, as dma_alloc_coherent handles that.
> 
> This change also modifies the IRQ affinity setting to only select CPUs
> from the node local to the device, preserving the behavior that TX and
> RX queues of the same index share CPU affinity.
> 
> Signed-off-by: Bailey Forrest <bcf@google.com>
> Signed-off-by: Joshua Washington <joshwash@google.com>
> Reviewed-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
> Signed-off-by: Jeroen de Borst <jeroendb@google.com>
> ---
> v1: https://lore.kernel.org/netdev/20250627183141.3781516-1-hramamurthy@google.com/
> v2:
> - Utilize kvcalloc_node instead of kvzalloc_node for array-type
>   allocations.

Thanks for the update.
I note that this addresses Jakub's review of v1.

I have a minor suggestion below, but I don't think it warrants
blocking progress of this patch.

Reviewed-by: Simon Horman <horms@kernel.org>

...

> diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c

...

> @@ -533,6 +540,8 @@ static int gve_alloc_notify_blocks(struct gve_priv *priv)
>  	}
>  
>  	/* Setup the other blocks - the first n-1 vectors */
> +	node_mask = gve_get_node_mask(priv);
> +	cur_cpu = cpumask_first(node_mask);
>  	for (i = 0; i < priv->num_ntfy_blks; i++) {
>  		struct gve_notify_block *block = &priv->ntfy_blocks[i];
>  		int msix_idx = i;
> @@ -549,9 +558,17 @@ static int gve_alloc_notify_blocks(struct gve_priv *priv)
>  			goto abort_with_some_ntfy_blocks;
>  		}
>  		block->irq = priv->msix_vectors[msix_idx].vector;
> -		irq_set_affinity_hint(priv->msix_vectors[msix_idx].vector,
> -				      get_cpu_mask(i % active_cpus));
> +		irq_set_affinity_and_hint(block->irq,
> +					  cpumask_of(cur_cpu));
>  		block->irq_db_index = &priv->irq_db_indices[i].index;
> +
> +		cur_cpu = cpumask_next(cur_cpu, node_mask);
> +		/* Wrap once CPUs in the node have been exhausted, or when
> +		 * starting RX queue affinities. TX and RX queues of the same
> +		 * index share affinity.
> +		 */
> +		if (cur_cpu >= nr_cpu_ids || (i + 1) == priv->tx_cfg.max_queues)
> +			cur_cpu = cpumask_first(node_mask);

FWIIW, maybe this can be written more succinctly as follows.
(Completely untested!)

		/* TX and RX queues of the same index share affinity. */
		if (i + 1 == priv->tx_cfg.max_queues)
			cur_cpu = cpumask_first(node_mask);
		else
			cur_cpu = cpumask_next_wrap(cur_cpu, node_mask);

...

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next v2] gve: make IRQ handlers and page allocation NUMA aware
  2025-07-08 13:31 ` Simon Horman
@ 2025-07-09 17:55   ` Joshua Washington
  0 siblings, 0 replies; 4+ messages in thread
From: Joshua Washington @ 2025-07-09 17:55 UTC (permalink / raw)
  To: Simon Horman
  Cc: Jeroen de Borst, netdev, Harshitha Ramamurthy, davem,
	Eric Dumazet, Jakub Kicinski, Willem de Bruijn, Paolo Abeni,
	Bailey Forrest

Thanks for the feedback.

> > +             cur_cpu = cpumask_next(cur_cpu, node_mask);
> > +             /* Wrap once CPUs in the node have been exhausted, or when
> > +              * starting RX queue affinities. TX and RX queues of the same
> > +              * index share affinity.
> > +              */
> > +             if (cur_cpu >= nr_cpu_ids || (i + 1) == priv->tx_cfg.max_queues)
> > +                     cur_cpu = cpumask_first(node_mask);
>
> FWIIW, maybe this can be written more succinctly as follows.
> (Completely untested!)
>
>                 /* TX and RX queues of the same index share affinity. */
>                 if (i + 1 == priv->tx_cfg.max_queues)
>                         cur_cpu = cpumask_first(node_mask);
>                 else
>                         cur_cpu = cpumask_next_wrap(cur_cpu, node_mask);

I personally do not have a very strong opinion on this, so I'll update
it if more feedback comes which requires another patch revision.
Otherwise, I will leave it as-is, as the feedback does not seem to be
blocking.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next v2] gve: make IRQ handlers and page allocation NUMA aware
  2025-07-07 21:01 [PATCH net-next v2] gve: make IRQ handlers and page allocation NUMA aware Jeroen de Borst
  2025-07-08 13:31 ` Simon Horman
@ 2025-07-10  2:40 ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 4+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-07-10  2:40 UTC (permalink / raw)
  To: Jeroen de Borst
  Cc: netdev, hramamurthy, davem, edumazet, kuba, willemb, pabeni, bcf,
	joshwash

Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon,  7 Jul 2025 14:01:07 -0700 you wrote:
> From: Bailey Forrest <bcf@google.com>
> 
> All memory in GVE is currently allocated without regard for the NUMA
> node of the device. Because access to NUMA-local memory access is
> significantly cheaper than access to a remote node, this change attempts
> to ensure that page frags used in the RX path, including page pool
> frags, are allocated on the NUMA node local to the gVNIC device. Note
> that this attempt is best-effort. If necessary, the driver will still
> allocate non-local memory, as __GFP_THISNODE is not passed. Descriptor
> ring allocations are not updated, as dma_alloc_coherent handles that.
> 
> [...]

Here is the summary with links:
  - [net-next,v2] gve: make IRQ handlers and page allocation NUMA aware
    https://git.kernel.org/netdev/net-next/c/d991666b7b69

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-07-10  2:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07 21:01 [PATCH net-next v2] gve: make IRQ handlers and page allocation NUMA aware Jeroen de Borst
2025-07-08 13:31 ` Simon Horman
2025-07-09 17:55   ` Joshua Washington
2025-07-10  2:40 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).