* [PATCH net 1/8] ice: respect netif readiness in AF_XDP ZC related ndo's
2024-07-08 22:14 [PATCH net 0/8][pull request] ice: fix AF_XDP ZC timeout and concurrency issues Tony Nguyen
@ 2024-07-08 22:14 ` Tony Nguyen
2024-07-08 22:14 ` [PATCH net 2/8] ice: don't busy wait for Rx queue disable in ice_qp_dis() Tony Nguyen
` (6 subsequent siblings)
7 siblings, 0 replies; 18+ messages in thread
From: Tony Nguyen @ 2024-07-08 22:14 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Michal Kubiak, anthony.l.nguyen, maciej.fijalkowski,
magnus.karlsson, aleksander.lobakin, ast, daniel, hawk,
john.fastabend, bpf, Shannon Nelson, Chandan Kumar Rout
From: Michal Kubiak <michal.kubiak@intel.com>
Address a scenario in which XSK ZC Tx produces descriptors to XDP Tx
ring when link is either not yet fully initialized or process of
stopping the netdev has already started. To avoid this, add checks
against carrier readiness in ice_xsk_wakeup() and in ice_xmit_zc().
One could argue that bailing out early in ice_xsk_wakeup() would be
sufficient but given the fact that we produce Tx descriptors on behalf
of NAPI that is triggered for Rx traffic, the latter is also needed.
Bringing link up is an asynchronous event executed within
ice_service_task so even though interface has been brought up there is
still a time frame where link is not yet ok.
Without this patch, when AF_XDP ZC Tx is used simultaneously with stack
Tx, Tx timeouts occur after going through link flap (admin brings
interface down then up again). HW seem to be unable to transmit
descriptor to the wire after HW tail register bump which in turn causes
bit __QUEUE_STATE_STACK_XOFF to be set forever as
netdev_tx_completed_queue() sees no cleaned bytes on the input.
Fixes: 126cdfe1007a ("ice: xsk: Improve AF_XDP ZC Tx and use batching API")
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice_xsk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index a65955eb23c0..72738b8b8a68 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -1048,6 +1048,10 @@ bool ice_xmit_zc(struct ice_tx_ring *xdp_ring)
ice_clean_xdp_irq_zc(xdp_ring);
+ if (!netif_carrier_ok(xdp_ring->vsi->netdev) ||
+ !netif_running(xdp_ring->vsi->netdev))
+ return true;
+
budget = ICE_DESC_UNUSED(xdp_ring);
budget = min_t(u16, budget, ICE_RING_QUARTER(xdp_ring));
@@ -1091,7 +1095,7 @@ ice_xsk_wakeup(struct net_device *netdev, u32 queue_id,
struct ice_vsi *vsi = np->vsi;
struct ice_tx_ring *ring;
- if (test_bit(ICE_VSI_DOWN, vsi->state))
+ if (test_bit(ICE_VSI_DOWN, vsi->state) || !netif_carrier_ok(netdev))
return -ENETDOWN;
if (!ice_is_xdp_ena_vsi(vsi))
--
2.41.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH net 2/8] ice: don't busy wait for Rx queue disable in ice_qp_dis()
2024-07-08 22:14 [PATCH net 0/8][pull request] ice: fix AF_XDP ZC timeout and concurrency issues Tony Nguyen
2024-07-08 22:14 ` [PATCH net 1/8] ice: respect netif readiness in AF_XDP ZC related ndo's Tony Nguyen
@ 2024-07-08 22:14 ` Tony Nguyen
2024-07-08 22:14 ` [PATCH net 3/8] ice: replace synchronize_rcu with synchronize_net Tony Nguyen
` (5 subsequent siblings)
7 siblings, 0 replies; 18+ messages in thread
From: Tony Nguyen @ 2024-07-08 22:14 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Maciej Fijalkowski, anthony.l.nguyen, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
When ice driver is spammed with multiple xdpsock instances and flow
control is enabled, there are cases when Rx queue gets stuck and unable
to reflect the disable state in QRX_CTRL register. Similar issue has
previously been addressed in commit 13a6233b033f ("ice: Add support to
enable/disable all Rx queues before waiting").
To workaround this, let us simply not wait for a disabled state as later
patch will make sure that regardless of the encountered error in the
process of disabling a queue pair, the Rx queue will be enabled.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice_xsk.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 72738b8b8a68..3104a5657b83 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -199,10 +199,8 @@ static int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
if (err)
return err;
}
- err = ice_vsi_ctrl_one_rx_ring(vsi, false, q_idx, true);
- if (err)
- return err;
+ ice_vsi_ctrl_one_rx_ring(vsi, false, q_idx, false);
ice_qp_clean_rings(vsi, q_idx);
ice_qp_reset_stats(vsi, q_idx);
--
2.41.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH net 3/8] ice: replace synchronize_rcu with synchronize_net
2024-07-08 22:14 [PATCH net 0/8][pull request] ice: fix AF_XDP ZC timeout and concurrency issues Tony Nguyen
2024-07-08 22:14 ` [PATCH net 1/8] ice: respect netif readiness in AF_XDP ZC related ndo's Tony Nguyen
2024-07-08 22:14 ` [PATCH net 2/8] ice: don't busy wait for Rx queue disable in ice_qp_dis() Tony Nguyen
@ 2024-07-08 22:14 ` Tony Nguyen
2024-07-08 22:14 ` [PATCH net 4/8] ice: modify error handling when setting XSK pool in ndo_bpf Tony Nguyen
` (4 subsequent siblings)
7 siblings, 0 replies; 18+ messages in thread
From: Tony Nguyen @ 2024-07-08 22:14 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Maciej Fijalkowski, anthony.l.nguyen, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Given that ice_qp_dis() is called under rtnl_lock, synchronize_net() can
be called instead of synchronize_rcu() so that XDP rings can finish its
job in a faster way. Also let us do this as earlier in XSK queue disable
flow.
Additionally, turn off regular Tx queue before disabling irqs and NAPI.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice_xsk.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 3104a5657b83..ba50af9a5929 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -52,10 +52,8 @@ static void ice_qp_reset_stats(struct ice_vsi *vsi, u16 q_idx)
static void ice_qp_clean_rings(struct ice_vsi *vsi, u16 q_idx)
{
ice_clean_tx_ring(vsi->tx_rings[q_idx]);
- if (ice_is_xdp_ena_vsi(vsi)) {
- synchronize_rcu();
+ if (ice_is_xdp_ena_vsi(vsi))
ice_clean_tx_ring(vsi->xdp_rings[q_idx]);
- }
ice_clean_rx_ring(vsi->rx_rings[q_idx]);
}
@@ -180,11 +178,12 @@ static int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
usleep_range(1000, 2000);
}
+ synchronize_net();
+ netif_tx_stop_queue(netdev_get_tx_queue(vsi->netdev, q_idx));
+
ice_qvec_dis_irq(vsi, rx_ring, q_vector);
ice_qvec_toggle_napi(vsi, q_vector, false);
- netif_tx_stop_queue(netdev_get_tx_queue(vsi->netdev, q_idx));
-
ice_fill_txq_meta(vsi, tx_ring, &txq_meta);
err = ice_vsi_stop_tx_ring(vsi, ICE_NO_RESET, 0, tx_ring, &txq_meta);
if (err)
--
2.41.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH net 4/8] ice: modify error handling when setting XSK pool in ndo_bpf
2024-07-08 22:14 [PATCH net 0/8][pull request] ice: fix AF_XDP ZC timeout and concurrency issues Tony Nguyen
` (2 preceding siblings ...)
2024-07-08 22:14 ` [PATCH net 3/8] ice: replace synchronize_rcu with synchronize_net Tony Nguyen
@ 2024-07-08 22:14 ` Tony Nguyen
2024-07-08 22:14 ` [PATCH net 5/8] ice: toggle netif_carrier when setting up XSK pool Tony Nguyen
` (3 subsequent siblings)
7 siblings, 0 replies; 18+ messages in thread
From: Tony Nguyen @ 2024-07-08 22:14 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Maciej Fijalkowski, anthony.l.nguyen, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Don't bail out right when spotting an error within ice_qp_{dis,ena}()
but rather track error and go through whole flow of disabling and
enabling queue pair.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice_xsk.c | 30 +++++++++++++-----------
1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index ba50af9a5929..902096b000f5 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -162,6 +162,7 @@ static int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
struct ice_tx_ring *tx_ring;
struct ice_rx_ring *rx_ring;
int timeout = 50;
+ int fail = 0;
int err;
if (q_idx >= vsi->num_rxq || q_idx >= vsi->num_txq)
@@ -186,8 +187,8 @@ static int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
ice_fill_txq_meta(vsi, tx_ring, &txq_meta);
err = ice_vsi_stop_tx_ring(vsi, ICE_NO_RESET, 0, tx_ring, &txq_meta);
- if (err)
- return err;
+ if (!fail)
+ fail = err;
if (ice_is_xdp_ena_vsi(vsi)) {
struct ice_tx_ring *xdp_ring = vsi->xdp_rings[q_idx];
@@ -195,15 +196,15 @@ static int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
ice_fill_txq_meta(vsi, xdp_ring, &txq_meta);
err = ice_vsi_stop_tx_ring(vsi, ICE_NO_RESET, 0, xdp_ring,
&txq_meta);
- if (err)
- return err;
+ if (!fail)
+ fail = err;
}
ice_vsi_ctrl_one_rx_ring(vsi, false, q_idx, false);
ice_qp_clean_rings(vsi, q_idx);
ice_qp_reset_stats(vsi, q_idx);
- return 0;
+ return fail;
}
/**
@@ -216,32 +217,33 @@ static int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
static int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
{
struct ice_q_vector *q_vector;
+ int fail = 0;
int err;
err = ice_vsi_cfg_single_txq(vsi, vsi->tx_rings, q_idx);
- if (err)
- return err;
+ if (!fail)
+ fail = err;
if (ice_is_xdp_ena_vsi(vsi)) {
struct ice_tx_ring *xdp_ring = vsi->xdp_rings[q_idx];
err = ice_vsi_cfg_single_txq(vsi, vsi->xdp_rings, q_idx);
- if (err)
- return err;
+ if (!fail)
+ fail = err;
ice_set_ring_xdp(xdp_ring);
ice_tx_xsk_pool(vsi, q_idx);
}
err = ice_vsi_cfg_single_rxq(vsi, q_idx);
- if (err)
- return err;
+ if (!fail)
+ fail = err;
q_vector = vsi->rx_rings[q_idx]->q_vector;
ice_qvec_cfg_msix(vsi, q_vector);
err = ice_vsi_ctrl_one_rx_ring(vsi, true, q_idx, true);
- if (err)
- return err;
+ if (!fail)
+ fail = err;
ice_qvec_toggle_napi(vsi, q_vector, true);
ice_qvec_ena_irq(vsi, q_vector);
@@ -249,7 +251,7 @@ static int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
netif_tx_start_queue(netdev_get_tx_queue(vsi->netdev, q_idx));
clear_bit(ICE_CFG_BUSY, vsi->state);
- return 0;
+ return fail;
}
/**
--
2.41.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH net 5/8] ice: toggle netif_carrier when setting up XSK pool
2024-07-08 22:14 [PATCH net 0/8][pull request] ice: fix AF_XDP ZC timeout and concurrency issues Tony Nguyen
` (3 preceding siblings ...)
2024-07-08 22:14 ` [PATCH net 4/8] ice: modify error handling when setting XSK pool in ndo_bpf Tony Nguyen
@ 2024-07-08 22:14 ` Tony Nguyen
2024-07-08 22:14 ` [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool Tony Nguyen
` (2 subsequent siblings)
7 siblings, 0 replies; 18+ messages in thread
From: Tony Nguyen @ 2024-07-08 22:14 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Maciej Fijalkowski, anthony.l.nguyen, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
This so we prevent Tx timeout issues. One of conditions checked on
running in the background dev_watchdog() is netif_carrier_ok(), so let
us turn it off when we disable the queues that belong to a q_vector
where XSK pool is being configured. Turn carrier on in ice_qp_ena()
only when ice_get_link_status() tells us that physical link is up.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice_xsk.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 902096b000f5..3fbe4cfadfbf 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -180,6 +180,7 @@ static int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
}
synchronize_net();
+ netif_carrier_off(vsi->netdev);
netif_tx_stop_queue(netdev_get_tx_queue(vsi->netdev, q_idx));
ice_qvec_dis_irq(vsi, rx_ring, q_vector);
@@ -218,6 +219,7 @@ static int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
{
struct ice_q_vector *q_vector;
int fail = 0;
+ bool link_up;
int err;
err = ice_vsi_cfg_single_txq(vsi, vsi->tx_rings, q_idx);
@@ -248,7 +250,11 @@ static int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
ice_qvec_toggle_napi(vsi, q_vector, true);
ice_qvec_ena_irq(vsi, q_vector);
- netif_tx_start_queue(netdev_get_tx_queue(vsi->netdev, q_idx));
+ ice_get_link_status(vsi->port_info, &link_up);
+ if (link_up) {
+ netif_tx_start_queue(netdev_get_tx_queue(vsi->netdev, q_idx));
+ netif_carrier_on(vsi->netdev);
+ }
clear_bit(ICE_CFG_BUSY, vsi->state);
return fail;
--
2.41.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-08 22:14 [PATCH net 0/8][pull request] ice: fix AF_XDP ZC timeout and concurrency issues Tony Nguyen
` (4 preceding siblings ...)
2024-07-08 22:14 ` [PATCH net 5/8] ice: toggle netif_carrier when setting up XSK pool Tony Nguyen
@ 2024-07-08 22:14 ` Tony Nguyen
2024-07-10 1:45 ` Jakub Kicinski
2024-07-08 22:14 ` [PATCH net 7/8] ice: add missing WRITE_ONCE when clearing ice_rx_ring::xdp_prog Tony Nguyen
2024-07-08 22:14 ` [PATCH net 8/8] ice: xsk: fix txq interrupt mapping Tony Nguyen
7 siblings, 1 reply; 18+ messages in thread
From: Tony Nguyen @ 2024-07-08 22:14 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Maciej Fijalkowski, anthony.l.nguyen, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
xsk_buff_pool pointers that ice ring structs hold are updated via
ndo_bpf that is executed in process context while it can be read by
remote CPU at the same time within NAPI poll. Use synchronize_net()
after pointer update and {READ,WRITE}_ONCE() when working with mentioned
pointer.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice.h | 11 ++--
drivers/net/ethernet/intel/ice/ice_base.c | 4 +-
drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
drivers/net/ethernet/intel/ice/ice_txrx.c | 4 +-
drivers/net/ethernet/intel/ice/ice_xsk.c | 78 ++++++++++++++---------
drivers/net/ethernet/intel/ice/ice_xsk.h | 4 +-
6 files changed, 61 insertions(+), 42 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 99a75a59078e..caaa10157909 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -765,18 +765,17 @@ static inline struct xsk_buff_pool *ice_get_xp_from_qid(struct ice_vsi *vsi,
}
/**
- * ice_xsk_pool - get XSK buffer pool bound to a ring
+ * ice_rx_xsk_pool - assign XSK buff pool to Rx ring
* @ring: Rx ring to use
*
- * Returns a pointer to xsk_buff_pool structure if there is a buffer pool
- * present, NULL otherwise.
+ * Sets XSK buff pool pointer on Rx ring.
*/
-static inline struct xsk_buff_pool *ice_xsk_pool(struct ice_rx_ring *ring)
+static inline void ice_rx_xsk_pool(struct ice_rx_ring *ring)
{
struct ice_vsi *vsi = ring->vsi;
u16 qid = ring->q_index;
- return ice_get_xp_from_qid(vsi, qid);
+ WRITE_ONCE(ring->xsk_pool, ice_get_xp_from_qid(vsi, qid));
}
/**
@@ -801,7 +800,7 @@ static inline void ice_tx_xsk_pool(struct ice_vsi *vsi, u16 qid)
if (!ring)
return;
- ring->xsk_pool = ice_get_xp_from_qid(vsi, qid);
+ WRITE_ONCE(ring->xsk_pool, ice_get_xp_from_qid(vsi, qid));
}
/**
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 5d396c1a7731..1facf179a96f 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -536,7 +536,7 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
return err;
}
- ring->xsk_pool = ice_xsk_pool(ring);
+ ice_rx_xsk_pool(ring);
if (ring->xsk_pool) {
xdp_rxq_info_unreg(&ring->xdp_rxq);
@@ -597,7 +597,7 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
return 0;
}
- ok = ice_alloc_rx_bufs_zc(ring, num_bufs);
+ ok = ice_alloc_rx_bufs_zc(ring, ring->xsk_pool, num_bufs);
if (!ok) {
u16 pf_q = ring->vsi->rxq_map[ring->q_index];
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 55a42aad92a5..9b075dd48889 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2949,7 +2949,7 @@ static void ice_vsi_rx_napi_schedule(struct ice_vsi *vsi)
ice_for_each_rxq(vsi, i) {
struct ice_rx_ring *rx_ring = vsi->rx_rings[i];
- if (rx_ring->xsk_pool)
+ if (READ_ONCE(rx_ring->xsk_pool))
napi_schedule(&rx_ring->q_vector->napi);
}
}
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 8bb743f78fcb..f4b2b1bca234 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1523,7 +1523,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
ice_for_each_tx_ring(tx_ring, q_vector->tx) {
bool wd;
- if (tx_ring->xsk_pool)
+ if (READ_ONCE(tx_ring->xsk_pool))
wd = ice_xmit_zc(tx_ring);
else if (ice_ring_is_xdp(tx_ring))
wd = true;
@@ -1556,7 +1556,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
* comparison in the irq context instead of many inside the
* ice_clean_rx_irq function and makes the codebase cleaner.
*/
- cleaned = rx_ring->xsk_pool ?
+ cleaned = READ_ONCE(rx_ring->xsk_pool) ?
ice_clean_rx_irq_zc(rx_ring, budget_per_ring) :
ice_clean_rx_irq(rx_ring, budget_per_ring);
work_done += cleaned;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 3fbe4cfadfbf..b4058c4937bc 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -250,6 +250,8 @@ static int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
ice_qvec_toggle_napi(vsi, q_vector, true);
ice_qvec_ena_irq(vsi, q_vector);
+ /* make sure NAPI sees updated ice_{t,x}_ring::xsk_pool */
+ synchronize_net();
ice_get_link_status(vsi->port_info, &link_up);
if (link_up) {
netif_tx_start_queue(netdev_get_tx_queue(vsi->netdev, q_idx));
@@ -464,6 +466,7 @@ static u16 ice_fill_rx_descs(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
/**
* __ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
* @rx_ring: Rx ring
+ * @xsk_pool: XSK buffer pool to pick buffers to be filled by HW
* @count: The number of buffers to allocate
*
* Place the @count of descriptors onto Rx ring. Handle the ring wrap
@@ -472,7 +475,8 @@ static u16 ice_fill_rx_descs(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
*
* Returns true if all allocations were successful, false if any fail.
*/
-static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
+static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring,
+ struct xsk_buff_pool *xsk_pool, u16 count)
{
u32 nb_buffs_extra = 0, nb_buffs = 0;
union ice_32b_rx_flex_desc *rx_desc;
@@ -484,8 +488,7 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
xdp = ice_xdp_buf(rx_ring, ntu);
if (ntu + count >= rx_ring->count) {
- nb_buffs_extra = ice_fill_rx_descs(rx_ring->xsk_pool, xdp,
- rx_desc,
+ nb_buffs_extra = ice_fill_rx_descs(xsk_pool, xdp, rx_desc,
rx_ring->count - ntu);
if (nb_buffs_extra != rx_ring->count - ntu) {
ntu += nb_buffs_extra;
@@ -498,7 +501,7 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
ice_release_rx_desc(rx_ring, 0);
}
- nb_buffs = ice_fill_rx_descs(rx_ring->xsk_pool, xdp, rx_desc, count);
+ nb_buffs = ice_fill_rx_descs(xsk_pool, xdp, rx_desc, count);
ntu += nb_buffs;
if (ntu == rx_ring->count)
@@ -514,6 +517,7 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
/**
* ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
* @rx_ring: Rx ring
+ * @xsk_pool: XSK buffer pool to pick buffers to be filled by HW
* @count: The number of buffers to allocate
*
* Wrapper for internal allocation routine; figure out how many tail
@@ -521,7 +525,8 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
*
* Returns true if all calls to internal alloc routine succeeded
*/
-bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
+bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring,
+ struct xsk_buff_pool *xsk_pool, u16 count)
{
u16 rx_thresh = ICE_RING_QUARTER(rx_ring);
u16 leftover, i, tail_bumps;
@@ -530,9 +535,9 @@ bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
leftover = count - (tail_bumps * rx_thresh);
for (i = 0; i < tail_bumps; i++)
- if (!__ice_alloc_rx_bufs_zc(rx_ring, rx_thresh))
+ if (!__ice_alloc_rx_bufs_zc(rx_ring, xsk_pool, rx_thresh))
return false;
- return __ice_alloc_rx_bufs_zc(rx_ring, leftover);
+ return __ice_alloc_rx_bufs_zc(rx_ring, xsk_pool, leftover);
}
/**
@@ -653,7 +658,7 @@ static u32 ice_clean_xdp_irq_zc(struct ice_tx_ring *xdp_ring)
if (xdp_ring->next_to_clean >= cnt)
xdp_ring->next_to_clean -= cnt;
if (xsk_frames)
- xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames);
+ xsk_tx_completed(READ_ONCE(xdp_ring->xsk_pool), xsk_frames);
return completed_frames;
}
@@ -705,7 +710,8 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
dma_addr_t dma;
dma = xsk_buff_xdp_get_dma(xdp);
- xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, size);
+ xsk_buff_raw_dma_sync_for_device(READ_ONCE(xdp_ring->xsk_pool),
+ dma, size);
tx_buf->xdp = xdp;
tx_buf->type = ICE_TX_BUF_XSK_TX;
@@ -763,7 +769,8 @@ ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
if (!err)
return ICE_XDP_REDIR;
- if (xsk_uses_need_wakeup(rx_ring->xsk_pool) && err == -ENOBUFS)
+ if (xsk_uses_need_wakeup(READ_ONCE(rx_ring->xsk_pool)) &&
+ err == -ENOBUFS)
result = ICE_XDP_EXIT;
else
result = ICE_XDP_CONSUMED;
@@ -832,8 +839,8 @@ ice_add_xsk_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *first,
*/
int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
{
+ struct xsk_buff_pool *xsk_pool = READ_ONCE(rx_ring->xsk_pool);
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
- struct xsk_buff_pool *xsk_pool = rx_ring->xsk_pool;
u32 ntc = rx_ring->next_to_clean;
u32 ntu = rx_ring->next_to_use;
struct xdp_buff *first = NULL;
@@ -945,7 +952,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
rx_ring->next_to_clean = ntc;
entries_to_alloc = ICE_RX_DESC_UNUSED(rx_ring);
if (entries_to_alloc > ICE_RING_QUARTER(rx_ring))
- failure |= !ice_alloc_rx_bufs_zc(rx_ring, entries_to_alloc);
+ failure |= !ice_alloc_rx_bufs_zc(rx_ring, xsk_pool,
+ entries_to_alloc);
ice_finalize_xdp_rx(xdp_ring, xdp_xmit, 0);
ice_update_rx_ring_stats(rx_ring, total_rx_packets, total_rx_bytes);
@@ -968,17 +976,19 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
/**
* ice_xmit_pkt - produce a single HW Tx descriptor out of AF_XDP descriptor
* @xdp_ring: XDP ring to produce the HW Tx descriptor on
+ * @xsk_pool: XSK buffer pool to pick buffers to be consumed by HW
* @desc: AF_XDP descriptor to pull the DMA address and length from
* @total_bytes: bytes accumulator that will be used for stats update
*/
-static void ice_xmit_pkt(struct ice_tx_ring *xdp_ring, struct xdp_desc *desc,
+static void ice_xmit_pkt(struct ice_tx_ring *xdp_ring,
+ struct xsk_buff_pool *xsk_pool, struct xdp_desc *desc,
unsigned int *total_bytes)
{
struct ice_tx_desc *tx_desc;
dma_addr_t dma;
- dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc->addr);
- xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc->len);
+ dma = xsk_buff_raw_get_dma(xsk_pool, desc->addr);
+ xsk_buff_raw_dma_sync_for_device(xsk_pool, dma, desc->len);
tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_to_use++);
tx_desc->buf_addr = cpu_to_le64(dma);
@@ -991,10 +1001,13 @@ static void ice_xmit_pkt(struct ice_tx_ring *xdp_ring, struct xdp_desc *desc,
/**
* ice_xmit_pkt_batch - produce a batch of HW Tx descriptors out of AF_XDP descriptors
* @xdp_ring: XDP ring to produce the HW Tx descriptors on
+ * @xsk_pool: XSK buffer pool to pick buffers to be consumed by HW
* @descs: AF_XDP descriptors to pull the DMA addresses and lengths from
* @total_bytes: bytes accumulator that will be used for stats update
*/
-static void ice_xmit_pkt_batch(struct ice_tx_ring *xdp_ring, struct xdp_desc *descs,
+static void ice_xmit_pkt_batch(struct ice_tx_ring *xdp_ring,
+ struct xsk_buff_pool *xsk_pool,
+ struct xdp_desc *descs,
unsigned int *total_bytes)
{
u16 ntu = xdp_ring->next_to_use;
@@ -1004,8 +1017,8 @@ static void ice_xmit_pkt_batch(struct ice_tx_ring *xdp_ring, struct xdp_desc *de
loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
dma_addr_t dma;
- dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, descs[i].addr);
- xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, descs[i].len);
+ dma = xsk_buff_raw_get_dma(xsk_pool, descs[i].addr);
+ xsk_buff_raw_dma_sync_for_device(xsk_pool, dma, descs[i].len);
tx_desc = ICE_TX_DESC(xdp_ring, ntu++);
tx_desc->buf_addr = cpu_to_le64(dma);
@@ -1021,21 +1034,24 @@ static void ice_xmit_pkt_batch(struct ice_tx_ring *xdp_ring, struct xdp_desc *de
/**
* ice_fill_tx_hw_ring - produce the number of Tx descriptors onto ring
* @xdp_ring: XDP ring to produce the HW Tx descriptors on
+ * @xsk_pool: XSK buffer pool to pick buffers to be consumed by HW
* @descs: AF_XDP descriptors to pull the DMA addresses and lengths from
* @nb_pkts: count of packets to be send
* @total_bytes: bytes accumulator that will be used for stats update
*/
-static void ice_fill_tx_hw_ring(struct ice_tx_ring *xdp_ring, struct xdp_desc *descs,
- u32 nb_pkts, unsigned int *total_bytes)
+static void ice_fill_tx_hw_ring(struct ice_tx_ring *xdp_ring,
+ struct xsk_buff_pool *xsk_pool,
+ struct xdp_desc *descs, u32 nb_pkts,
+ unsigned int *total_bytes)
{
u32 batched, leftover, i;
batched = ALIGN_DOWN(nb_pkts, PKTS_PER_BATCH);
leftover = nb_pkts & (PKTS_PER_BATCH - 1);
for (i = 0; i < batched; i += PKTS_PER_BATCH)
- ice_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
+ ice_xmit_pkt_batch(xdp_ring, xsk_pool, &descs[i], total_bytes);
for (; i < batched + leftover; i++)
- ice_xmit_pkt(xdp_ring, &descs[i], total_bytes);
+ ice_xmit_pkt(xdp_ring, xsk_pool, &descs[i], total_bytes);
}
/**
@@ -1046,7 +1062,8 @@ static void ice_fill_tx_hw_ring(struct ice_tx_ring *xdp_ring, struct xdp_desc *d
*/
bool ice_xmit_zc(struct ice_tx_ring *xdp_ring)
{
- struct xdp_desc *descs = xdp_ring->xsk_pool->tx_descs;
+ struct xsk_buff_pool *xsk_pool = READ_ONCE(xdp_ring->xsk_pool);
+ struct xdp_desc *descs = xsk_pool->tx_descs;
u32 nb_pkts, nb_processed = 0;
unsigned int total_bytes = 0;
int budget;
@@ -1060,25 +1077,26 @@ bool ice_xmit_zc(struct ice_tx_ring *xdp_ring)
budget = ICE_DESC_UNUSED(xdp_ring);
budget = min_t(u16, budget, ICE_RING_QUARTER(xdp_ring));
- nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, budget);
+ nb_pkts = xsk_tx_peek_release_desc_batch(xsk_pool, budget);
if (!nb_pkts)
return true;
if (xdp_ring->next_to_use + nb_pkts >= xdp_ring->count) {
nb_processed = xdp_ring->count - xdp_ring->next_to_use;
- ice_fill_tx_hw_ring(xdp_ring, descs, nb_processed, &total_bytes);
+ ice_fill_tx_hw_ring(xdp_ring, xsk_pool, descs, nb_processed,
+ &total_bytes);
xdp_ring->next_to_use = 0;
}
- ice_fill_tx_hw_ring(xdp_ring, &descs[nb_processed], nb_pkts - nb_processed,
- &total_bytes);
+ ice_fill_tx_hw_ring(xdp_ring, xsk_pool, &descs[nb_processed],
+ nb_pkts - nb_processed, &total_bytes);
ice_set_rs_bit(xdp_ring);
ice_xdp_ring_update_tail(xdp_ring);
ice_update_tx_ring_stats(xdp_ring, nb_pkts, total_bytes);
- if (xsk_uses_need_wakeup(xdp_ring->xsk_pool))
- xsk_set_tx_need_wakeup(xdp_ring->xsk_pool);
+ if (xsk_uses_need_wakeup(xsk_pool))
+ xsk_set_tx_need_wakeup(xsk_pool);
return nb_pkts < budget;
}
@@ -1111,7 +1129,7 @@ ice_xsk_wakeup(struct net_device *netdev, u32 queue_id,
ring = vsi->rx_rings[queue_id]->xdp_ring;
- if (!ring->xsk_pool)
+ if (!READ_ONCE(ring->xsk_pool))
return -EINVAL;
/* The idea here is that if NAPI is running, mark a miss, so
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h
index 6fa181f080ef..4cd2d62a0836 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.h
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.h
@@ -22,7 +22,8 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool,
u16 qid);
int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget);
int ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, u32 flags);
-bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count);
+bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring,
+ struct xsk_buff_pool *xsk_pool, u16 count);
bool ice_xsk_any_rx_ring_ena(struct ice_vsi *vsi);
void ice_xsk_clean_rx_ring(struct ice_rx_ring *rx_ring);
void ice_xsk_clean_xdp_ring(struct ice_tx_ring *xdp_ring);
@@ -51,6 +52,7 @@ ice_clean_rx_irq_zc(struct ice_rx_ring __always_unused *rx_ring,
static inline bool
ice_alloc_rx_bufs_zc(struct ice_rx_ring __always_unused *rx_ring,
+ struct xsk_buff_pool __always_unused *xsk_pool,
u16 __always_unused count)
{
return false;
--
2.41.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-08 22:14 ` [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool Tony Nguyen
@ 2024-07-10 1:45 ` Jakub Kicinski
2024-07-23 23:46 ` Maciej Fijalkowski
0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2024-07-10 1:45 UTC (permalink / raw)
To: Tony Nguyen
Cc: davem, pabeni, edumazet, netdev, Maciej Fijalkowski,
magnus.karlsson, aleksander.lobakin, ast, daniel, hawk,
john.fastabend, bpf, Shannon Nelson, Chandan Kumar Rout
On Mon, 8 Jul 2024 15:14:12 -0700 Tony Nguyen wrote:
> @@ -1556,7 +1556,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
> * comparison in the irq context instead of many inside the
> * ice_clean_rx_irq function and makes the codebase cleaner.
> */
> - cleaned = rx_ring->xsk_pool ?
> + cleaned = READ_ONCE(rx_ring->xsk_pool) ?
> ice_clean_rx_irq_zc(rx_ring, budget_per_ring) :
> ice_clean_rx_irq(rx_ring, budget_per_ring);
> work_done += cleaned;
> @@ -832,8 +839,8 @@ ice_add_xsk_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *first,
> */
> int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> {
> + struct xsk_buff_pool *xsk_pool = READ_ONCE(rx_ring->xsk_pool);
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> - struct xsk_buff_pool *xsk_pool = rx_ring->xsk_pool;
> u32 ntc = rx_ring->next_to_clean;
> u32 ntu = rx_ring->next_to_use;
> struct xdp_buff *first = NULL;
This looks suspicious, you need to at least explain why it's correct.
READ_ONCE() means one access per critical section, usually.
You access it at least twice in a single NAPI pool.
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-10 1:45 ` Jakub Kicinski
@ 2024-07-23 23:46 ` Maciej Fijalkowski
2024-07-24 14:57 ` Jakub Kicinski
0 siblings, 1 reply; 18+ messages in thread
From: Maciej Fijalkowski @ 2024-07-23 23:46 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tony Nguyen, davem, pabeni, edumazet, netdev, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
On Tue, Jul 09, 2024 at 06:45:24PM -0700, Jakub Kicinski wrote:
> On Mon, 8 Jul 2024 15:14:12 -0700 Tony Nguyen wrote:
> > @@ -1556,7 +1556,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
> > * comparison in the irq context instead of many inside the
> > * ice_clean_rx_irq function and makes the codebase cleaner.
> > */
> > - cleaned = rx_ring->xsk_pool ?
> > + cleaned = READ_ONCE(rx_ring->xsk_pool) ?
> > ice_clean_rx_irq_zc(rx_ring, budget_per_ring) :
> > ice_clean_rx_irq(rx_ring, budget_per_ring);
> > work_done += cleaned;
>
>
> > @@ -832,8 +839,8 @@ ice_add_xsk_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *first,
> > */
> > int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> > {
> > + struct xsk_buff_pool *xsk_pool = READ_ONCE(rx_ring->xsk_pool);
> > unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> > - struct xsk_buff_pool *xsk_pool = rx_ring->xsk_pool;
> > u32 ntc = rx_ring->next_to_clean;
> > u32 ntu = rx_ring->next_to_use;
> > struct xdp_buff *first = NULL;
>
> This looks suspicious, you need to at least explain why it's correct.
> READ_ONCE() means one access per critical section, usually.
> You access it at least twice in a single NAPI pool.
Hey after break! Comebacks are tough, vacation was followed by flu so bear
with me please...
Actually xsk_pool *can* be accessed multiple times during the refill of HW
Rx ring (at the end of napi poll Rx side). I thought it would be safe to
follow the scheme of xdp prog pointer handling, where we read it from ring
once per napi loop then work on local pointer.
Goal of this commit was to prevent compiler from code reoder such as NAPI
is launched before update of xsk_buff_pool pointer which is achieved with
WRITE_ONCE()/synchronize_net() pair. Then per my understanding single
READ_ONCE() within NAPI was sufficient, the one that makes the decision
which Rx routine should be called (zc or standard one). Given that bh are
disabled and updater respects RCU grace period IMHO pointer is valid for
current NAPI cycle.
If you're saying it's not correct and each and every xsk_pool reference
within NAPI has to be decorated with READ_ONCE() then so is the xdp_prog
pointer, but I'd like to hear more about this.
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-23 23:46 ` Maciej Fijalkowski
@ 2024-07-24 14:57 ` Jakub Kicinski
2024-07-24 15:49 ` Maciej Fijalkowski
0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2024-07-24 14:57 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Tony Nguyen, davem, pabeni, edumazet, netdev, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
On Wed, 24 Jul 2024 01:46:11 +0200 Maciej Fijalkowski wrote:
> Goal of this commit was to prevent compiler from code reoder such as NAPI
> is launched before update of xsk_buff_pool pointer which is achieved with
> WRITE_ONCE()/synchronize_net() pair. Then per my understanding single
> READ_ONCE() within NAPI was sufficient, the one that makes the decision
> which Rx routine should be called (zc or standard one). Given that bh are
> disabled and updater respects RCU grace period IMHO pointer is valid for
> current NAPI cycle.
So if we are already in the af_xdp handler, and update patch sets pool
to NULL - the af_xdp handler will be fine with the pool becoming NULL?
I guess it may be fine, it's just quite odd to call the function called
_ONCE() multiple times..
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-24 14:57 ` Jakub Kicinski
@ 2024-07-24 15:49 ` Maciej Fijalkowski
2024-07-25 13:38 ` Jakub Kicinski
0 siblings, 1 reply; 18+ messages in thread
From: Maciej Fijalkowski @ 2024-07-24 15:49 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tony Nguyen, davem, pabeni, edumazet, netdev, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
On Wed, Jul 24, 2024 at 07:57:42AM -0700, Jakub Kicinski wrote:
> On Wed, 24 Jul 2024 01:46:11 +0200 Maciej Fijalkowski wrote:
> > Goal of this commit was to prevent compiler from code reoder such as NAPI
> > is launched before update of xsk_buff_pool pointer which is achieved with
> > WRITE_ONCE()/synchronize_net() pair. Then per my understanding single
> > READ_ONCE() within NAPI was sufficient, the one that makes the decision
> > which Rx routine should be called (zc or standard one). Given that bh are
> > disabled and updater respects RCU grace period IMHO pointer is valid for
> > current NAPI cycle.
>
> So if we are already in the af_xdp handler, and update patch sets pool
> to NULL - the af_xdp handler will be fine with the pool becoming NULL?
> I guess it may be fine, it's just quite odd to call the function called
> _ONCE() multiple times..
Update path before NULLing pool will go through rcu grace period, stop
napis, disable irqs, etc. Running napi won't be exposed to nulled pool in
such case.
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-24 15:49 ` Maciej Fijalkowski
@ 2024-07-25 13:38 ` Jakub Kicinski
2024-07-25 18:31 ` Maciej Fijalkowski
0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2024-07-25 13:38 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Tony Nguyen, davem, pabeni, edumazet, netdev, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
On Wed, 24 Jul 2024 17:49:12 +0200 Maciej Fijalkowski wrote:
> > So if we are already in the af_xdp handler, and update patch sets pool
> > to NULL - the af_xdp handler will be fine with the pool becoming NULL?
> > I guess it may be fine, it's just quite odd to call the function called
> > _ONCE() multiple times..
>
> Update path before NULLing pool will go through rcu grace period, stop
> napis, disable irqs, etc. Running napi won't be exposed to nulled pool in
> such case.
Could you make it clearer what condition the patch is fixing, then?
What can go wrong without this patch?
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-25 13:38 ` Jakub Kicinski
@ 2024-07-25 18:31 ` Maciej Fijalkowski
2024-07-25 23:07 ` Jakub Kicinski
0 siblings, 1 reply; 18+ messages in thread
From: Maciej Fijalkowski @ 2024-07-25 18:31 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tony Nguyen, davem, pabeni, edumazet, netdev, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
On Thu, Jul 25, 2024 at 06:38:58AM -0700, Jakub Kicinski wrote:
> On Wed, 24 Jul 2024 17:49:12 +0200 Maciej Fijalkowski wrote:
> > > So if we are already in the af_xdp handler, and update patch sets pool
> > > to NULL - the af_xdp handler will be fine with the pool becoming NULL?
> > > I guess it may be fine, it's just quite odd to call the function called
> > > _ONCE() multiple times..
> >
> > Update path before NULLing pool will go through rcu grace period, stop
> > napis, disable irqs, etc. Running napi won't be exposed to nulled pool in
> > such case.
>
> Could you make it clearer what condition the patch is fixing, then?
> What can go wrong without this patch?
Sorry for confusion, but without this patch scenario you brought up
initially *could* happen, under some wild circumstances. When I was
responding yesterday my head was around the code with this particular
patch in place, that's why I said such pool state transistion was not
possible.
Updater does this (prior to this patch):
(...)
ring->xsk_pool = ice_get_xp_from_qid(vsi, qid); // set to NULL
(...)
ice_qvec_toggle_napi(vsi, q_vector, true);
ice_qvec_ena_irq(vsi, q_vector);
In theory compiler is allowed to transform the code in a way that xsk_pool
assignment will happen *after* triggering napi. So in ice_napi_poll():
if (tx_ring->xsk_pool)
wd = ice_xmit_zc(tx_ring); // call ZC routine
else if (ice_ring_is_xdp(tx_ring))
wd = true;
else
wd = ice_clean_tx_irq(tx_ring, budget);
You will initiate ZC Tx processing because xsk_pool ptr was still valid
and crash in the middle of its job once it's finally NULLed. To prevent
that:
updater:
(...)
WRITE_ONCE(ring->xsk_pool, ice_get_xp_from_qid(vsi, qid));
(...)
ice_qvec_toggle_napi(vsi, q_vector, true);
ice_qvec_ena_irq(vsi, q_vector);
/* make sure NAPI sees updated ice_{t,x}_ring::xsk_pool */
synchronize_net();
reader:
if (READ_ONCE(tx_ring->xsk_pool))
wd = ice_xmit_zc(tx_ring);
else if (ice_ring_is_xdp(tx_ring))
wd = true;
else
wd = ice_clean_tx_irq(tx_ring, budget);
Does that make any sense now?
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-25 18:31 ` Maciej Fijalkowski
@ 2024-07-25 23:07 ` Jakub Kicinski
2024-07-26 13:43 ` Maciej Fijalkowski
0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2024-07-25 23:07 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Tony Nguyen, davem, pabeni, edumazet, netdev, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
On Thu, 25 Jul 2024 20:31:31 +0200 Maciej Fijalkowski wrote:
> Does that make any sense now?
Could be brain fog due to post-netdev.conf covid but no, not really.
The _ONCE() helpers basically give you the ability to store the pointer
to a variable on the stack, and that variable won't change behind your
back. But the only reason to READ_ONCE(ptr->thing) something multiple
times is to tell KCSAN that "I know what I'm doing", it just silences
potential warnings :S
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-25 23:07 ` Jakub Kicinski
@ 2024-07-26 13:43 ` Maciej Fijalkowski
2024-07-26 14:37 ` Jakub Kicinski
0 siblings, 1 reply; 18+ messages in thread
From: Maciej Fijalkowski @ 2024-07-26 13:43 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Tony Nguyen, davem, pabeni, edumazet, netdev, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
On Thu, Jul 25, 2024 at 04:07:00PM -0700, Jakub Kicinski wrote:
> On Thu, 25 Jul 2024 20:31:31 +0200 Maciej Fijalkowski wrote:
> > Does that make any sense now?
>
> Could be brain fog due to post-netdev.conf covid but no, not really.
Huh, that makes two of us.
>
> The _ONCE() helpers basically give you the ability to store the pointer
> to a variable on the stack, and that variable won't change behind your
> back. But the only reason to READ_ONCE(ptr->thing) something multiple
> times is to tell KCSAN that "I know what I'm doing", it just silences
> potential warnings :S
I feel like you keep on referring to _ONCE (*) being used multiple times
which might be counter-intuitive whereas I was trying from the beginning
to explain my point that xsk pool from driver POV should get the very same
treatment as xdp prog has currently. So, either mark it as __rcu variable
and use rcu helpers or use _ONCE variants plus some sync.
(*) Ok, if you meant from the very beginning that two READ_ONCE against
pool per single critical section is suspicious then I didn't get that,
sorry. With diff below I would have single READ_ONCE and work on that
variable for rest of the napi. Patch was actually trying to limit xsk_pool
accesses from ring struct by working on stack variable.
Would you be okay with that?
-----8<-----
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 4c115531beba..5b27aaaa94ee 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1550,14 +1550,15 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
budget_per_ring = budget;
ice_for_each_rx_ring(rx_ring, q_vector->rx) {
+ struct xsk_buff_pool *xsk_pool = READ_ONCE(rx_ring->xsk_pool);
int cleaned;
/* A dedicated path for zero-copy allows making a single
* comparison in the irq context instead of many inside the
* ice_clean_rx_irq function and makes the codebase cleaner.
*/
- cleaned = READ_ONCE(rx_ring->xsk_pool) ?
- ice_clean_rx_irq_zc(rx_ring, budget_per_ring) :
+ cleaned = rx_ring->xsk_pool ?
+ ice_clean_rx_irq_zc(rx_ring, xsk_pool, budget_per_ring) :
ice_clean_rx_irq(rx_ring, budget_per_ring);
work_done += cleaned;
/* if we clean as many as budgeted, we must not be done */
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 492a9e54d58b..dceab7619a64 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -837,13 +837,15 @@ ice_add_xsk_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *first,
/**
* ice_clean_rx_irq_zc - consumes packets from the hardware ring
* @rx_ring: AF_XDP Rx ring
+ * @xsk_pool: AF_XDP pool ptr
* @budget: NAPI budget
*
* Returns number of processed packets on success, remaining budget on failure.
*/
-int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
+int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring,
+ struct xsk_buff_pool *xsk_pool,
+ int budget)
{
- struct xsk_buff_pool *xsk_pool = READ_ONCE(rx_ring->xsk_pool);
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
u32 ntc = rx_ring->next_to_clean;
u32 ntu = rx_ring->next_to_use;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h
index 4cd2d62a0836..8c3675185699 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.h
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.h
@@ -20,7 +20,9 @@ struct ice_vsi;
#ifdef CONFIG_XDP_SOCKETS
int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool,
u16 qid);
-int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget);
+int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring,
+ struct xsk_buff_pool *xsk_pool,
+ int budget);
int ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, u32 flags);
bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring,
struct xsk_buff_pool *xsk_pool, u16 count);
@@ -45,6 +47,7 @@ ice_xsk_pool_setup(struct ice_vsi __always_unused *vsi,
static inline int
ice_clean_rx_irq_zc(struct ice_rx_ring __always_unused *rx_ring,
+ struct xsk_buff_pool __always_unused *xsk_pool,
int __always_unused budget)
{
return 0;
----->8-----
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool
2024-07-26 13:43 ` Maciej Fijalkowski
@ 2024-07-26 14:37 ` Jakub Kicinski
0 siblings, 0 replies; 18+ messages in thread
From: Jakub Kicinski @ 2024-07-26 14:37 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Tony Nguyen, davem, pabeni, edumazet, netdev, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
On Fri, 26 Jul 2024 15:43:20 +0200 Maciej Fijalkowski wrote:
> > The _ONCE() helpers basically give you the ability to store the pointer
> > to a variable on the stack, and that variable won't change behind your
> > back. But the only reason to READ_ONCE(ptr->thing) something multiple
> > times is to tell KCSAN that "I know what I'm doing", it just silences
> > potential warnings :S
>
> I feel like you keep on referring to _ONCE (*) being used multiple times
> which might be counter-intuitive whereas I was trying from the beginning
> to explain my point that xsk pool from driver POV should get the very same
> treatment as xdp prog has currently. So, either mark it as __rcu variable
> and use rcu helpers or use _ONCE variants plus some sync.
>
> (*) Ok, if you meant from the very beginning that two READ_ONCE against
> pool per single critical section is suspicious then I didn't get that,
> sorry. With diff below I would have single READ_ONCE and work on that
> variable for rest of the napi. Patch was actually trying to limit xsk_pool
> accesses from ring struct by working on stack variable.
>
> Would you be okay with that?
Yup! That diff makes sense, thanks!
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH net 7/8] ice: add missing WRITE_ONCE when clearing ice_rx_ring::xdp_prog
2024-07-08 22:14 [PATCH net 0/8][pull request] ice: fix AF_XDP ZC timeout and concurrency issues Tony Nguyen
` (5 preceding siblings ...)
2024-07-08 22:14 ` [PATCH net 6/8] ice: improve updating ice_{t, r}x_ring::xsk_pool Tony Nguyen
@ 2024-07-08 22:14 ` Tony Nguyen
2024-07-08 22:14 ` [PATCH net 8/8] ice: xsk: fix txq interrupt mapping Tony Nguyen
7 siblings, 0 replies; 18+ messages in thread
From: Tony Nguyen @ 2024-07-08 22:14 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Maciej Fijalkowski, anthony.l.nguyen, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
It is read by data path and modified from process context on remote cpu
so it is needed to use WRITE_ONCE to clear the pointer.
Fixes: efc2214b6047 ("ice: Add support for XDP")
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice_txrx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index f4b2b1bca234..4c115531beba 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -456,7 +456,7 @@ void ice_free_rx_ring(struct ice_rx_ring *rx_ring)
if (rx_ring->vsi->type == ICE_VSI_PF)
if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
- rx_ring->xdp_prog = NULL;
+ WRITE_ONCE(rx_ring->xdp_prog, NULL);
if (rx_ring->xsk_pool) {
kfree(rx_ring->xdp_buf);
rx_ring->xdp_buf = NULL;
--
2.41.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH net 8/8] ice: xsk: fix txq interrupt mapping
2024-07-08 22:14 [PATCH net 0/8][pull request] ice: fix AF_XDP ZC timeout and concurrency issues Tony Nguyen
` (6 preceding siblings ...)
2024-07-08 22:14 ` [PATCH net 7/8] ice: add missing WRITE_ONCE when clearing ice_rx_ring::xdp_prog Tony Nguyen
@ 2024-07-08 22:14 ` Tony Nguyen
7 siblings, 0 replies; 18+ messages in thread
From: Tony Nguyen @ 2024-07-08 22:14 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, netdev
Cc: Maciej Fijalkowski, anthony.l.nguyen, magnus.karlsson,
aleksander.lobakin, ast, daniel, hawk, john.fastabend, bpf,
Shannon Nelson, Chandan Kumar Rout
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
ice_cfg_txq_interrupt() internally handles XDP Tx ring. Do not use
ice_for_each_tx_ring() in ice_qvec_cfg_msix() as this causing us to
treat XDP ring that belongs to queue vector as Tx ring and therefore
misconfiguring the interrupts.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP")
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice_xsk.c | 24 ++++++++++++++----------
1 file changed, 14 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index b4058c4937bc..492a9e54d58b 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -110,25 +110,29 @@ ice_qvec_dis_irq(struct ice_vsi *vsi, struct ice_rx_ring *rx_ring,
* ice_qvec_cfg_msix - Enable IRQ for given queue vector
* @vsi: the VSI that contains queue vector
* @q_vector: queue vector
+ * @qid: queue index
*/
static void
-ice_qvec_cfg_msix(struct ice_vsi *vsi, struct ice_q_vector *q_vector)
+ice_qvec_cfg_msix(struct ice_vsi *vsi, struct ice_q_vector *q_vector, u16 qid)
{
u16 reg_idx = q_vector->reg_idx;
struct ice_pf *pf = vsi->back;
struct ice_hw *hw = &pf->hw;
- struct ice_tx_ring *tx_ring;
- struct ice_rx_ring *rx_ring;
+ int q, _qid = qid;
ice_cfg_itr(hw, q_vector);
- ice_for_each_tx_ring(tx_ring, q_vector->tx)
- ice_cfg_txq_interrupt(vsi, tx_ring->reg_idx, reg_idx,
- q_vector->tx.itr_idx);
+ for (q = 0; q < q_vector->num_ring_tx; q++) {
+ ice_cfg_txq_interrupt(vsi, _qid, reg_idx, q_vector->tx.itr_idx);
+ _qid++;
+ }
- ice_for_each_rx_ring(rx_ring, q_vector->rx)
- ice_cfg_rxq_interrupt(vsi, rx_ring->reg_idx, reg_idx,
- q_vector->rx.itr_idx);
+ _qid = qid;
+
+ for (q = 0; q < q_vector->num_ring_rx; q++) {
+ ice_cfg_rxq_interrupt(vsi, _qid, reg_idx, q_vector->rx.itr_idx);
+ _qid++;
+ }
ice_flush(hw);
}
@@ -241,7 +245,7 @@ static int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
fail = err;
q_vector = vsi->rx_rings[q_idx]->q_vector;
- ice_qvec_cfg_msix(vsi, q_vector);
+ ice_qvec_cfg_msix(vsi, q_vector, q_idx);
err = ice_vsi_ctrl_one_rx_ring(vsi, true, q_idx, true);
if (!fail)
--
2.41.0
^ permalink raw reply related [flat|nested] 18+ messages in thread