* [PATCH net-next 0/5] mlxsw: Improve events processing performance
@ 2024-04-26 12:42 Petr Machata
2024-04-26 12:42 ` [PATCH net-next 1/5] mlxsw: pci: Handle up to 64 Rx completions in tasklet Petr Machata
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: Petr Machata @ 2024-04-26 12:42 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev
Cc: Ido Schimmel, Petr Machata, Amit Cohen, mlxsw
Amit Cohen writes:
Spectrum ASICs only support a single interrupt, it means that all the
events are handled by one IRQ (interrupt request) handler.
Currently, we schedule a tasklet to handle events in EQ, then we also use
tasklet for CQ, SDQ and RDQ. Tasklet runs in softIRQ (software IRQ)
context, and will be run on the same CPU which scheduled it. It means that
today we have one CPU which handles all the packets (both network packets
and EMADs) from hardware.
The existing implementation is not efficient and can be improved.
Measuring latency of EMADs in the driver (without the time in FW) shows
that latency is increased by factor of 28 (x28) when network traffic is
handled by the driver.
Measuring throughput in CPU shows that CPU can handle ~35% less packets
of specific flow when corrupted packets are also handled by the driver.
There are cases that these values even worse, we measure decrease of ~44%
packet rate.
This can be improved if network packet and EMADs will be handled in
parallel by several CPUs, and more than that, if different types of traffic
will be handled in parallel. We can achieve this using NAPI.
This set converts the driver to process completions from hardware via NAPI.
The idea is to add NAPI instance per CQ (which is mapped 1:1 to SDQ/RDQ),
which means that each DQ can be handled separately. we have DQ for EMADs
and DQs for each trap group (like LLDP, BGP, L3 drops, etc..). See more
details in commit messages.
An additional improvement which is done as part of this set is related to
doorbells' ring. The idea is to handle small chunks of Rx packets (which
is also recommended using NAPI) and ring doorbells once per chunk. This
reduces the access to hardware which is expensive (time wise) and might
take time because of memory barriers.
With this set we can see better performance.
To summerize:
EMADs latency:
+------------------------------------------------------------------------+
| | Before this set | Now |
|------------------|---------------------------|-------------------------|
| Increased factor | x28 | x1.5 |
+------------------------------------------------------------------------+
Note that we can see even measurements that show better latency when
traffic is handled by the driver.
Throughput:
+------------------------------------------------------------------------+
| | Before this set | Now |
|-------------|----------------------------|-----------------------------|
| Reduced | 35% | 6% |
| packet rate | | |
+------------------------------------------------------------------------+
Additional improvements are planned - use page pool for buffer allocations
and avoid cache miss of each SKB using napi_build_skb().
Patch set overview:
Patches #1-#2 improve access to hardware by reducing dorbells' rings
Patch #3-#4 are preaparations for NAPI usage
Patch #5 converts the driver to use NAPI
Amit Cohen (5):
mlxsw: pci: Handle up to 64 Rx completions in tasklet
mlxsw: pci: Ring RDQ and CQ doorbells once per several completions
mlxsw: pci: Initialize dummy net devices for NAPI
mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure
mlxsw: pci: Use NAPI for event processing
drivers/net/ethernet/mellanox/mlxsw/pci.c | 204 ++++++++++++++++------
1 file changed, 150 insertions(+), 54 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH net-next 1/5] mlxsw: pci: Handle up to 64 Rx completions in tasklet
2024-04-26 12:42 [PATCH net-next 0/5] mlxsw: Improve events processing performance Petr Machata
@ 2024-04-26 12:42 ` Petr Machata
2024-04-26 12:42 ` [PATCH net-next 2/5] mlxsw: pci: Ring RDQ and CQ doorbells once per several completions Petr Machata
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Petr Machata @ 2024-04-26 12:42 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev
Cc: Ido Schimmel, Petr Machata, Amit Cohen, mlxsw
From: Amit Cohen <amcohen@nvidia.com>
We can get many completions in one interrupt. Currently, the CQ tasklet
handles up to half queue size completions, and then arms the hardware to
generate additional events, which means that in case that there were
additional completions that we did not handle, we will get immediately an
additional interrupt to handle the rest.
The decision to handle up to half of the queue size is arbitrary and was
determined in 2015, when mlxsw driver was added to the kernel. One
additional fact that should be taken into account is that while WQEs
from RDQ are handled, the CPU that handles the tasklet is dedicated for
this task, which means that we might hold the CPU for a long time.
Handle WQEs in smaller chucks, then arm CQ doorbell to notify the hardware
to send additional notifications. Set the chunk size to 64 as this number
is recommended using NAPI and the driver will use NAPI in a next patch.
Note that for now we use ARM doorbell to retrigger CQ tasklet, but with
NAPI it will be more efficient as software will reschedule the poll
method and we will not involve hardware for that.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
---
drivers/net/ethernet/mellanox/mlxsw/pci.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 13fd067c39ed..8668947400ab 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -652,12 +652,13 @@ static char *mlxsw_pci_cq_sw_cqe_get(struct mlxsw_pci_queue *q)
return elem;
}
+#define MLXSW_PCI_CQ_MAX_HANDLE 64
+
static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
{
struct mlxsw_pci_queue *q = from_tasklet(q, t, tasklet);
struct mlxsw_pci_queue *rdq = q->cq.dq;
struct mlxsw_pci *mlxsw_pci = q->pci;
- int credits = q->count >> 1;
int items = 0;
char *cqe;
@@ -683,7 +684,7 @@ static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
mlxsw_pci_cqe_rdq_handle(mlxsw_pci, rdq,
wqe_counter, q->cq.v, ncqe);
- if (++items == credits)
+ if (++items == MLXSW_PCI_CQ_MAX_HANDLE)
break;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH net-next 2/5] mlxsw: pci: Ring RDQ and CQ doorbells once per several completions
2024-04-26 12:42 [PATCH net-next 0/5] mlxsw: Improve events processing performance Petr Machata
2024-04-26 12:42 ` [PATCH net-next 1/5] mlxsw: pci: Handle up to 64 Rx completions in tasklet Petr Machata
@ 2024-04-26 12:42 ` Petr Machata
2024-04-26 12:42 ` [PATCH net-next 3/5] mlxsw: pci: Initialize dummy net devices for NAPI Petr Machata
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Petr Machata @ 2024-04-26 12:42 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev
Cc: Ido Schimmel, Petr Machata, Amit Cohen, mlxsw
From: Amit Cohen <amcohen@nvidia.com>
Currently, for each CQE in CQ, we ring CQ doorbell, then handle RDQ and
ring RDQ doorbell. Finally we ring CQ arm doorbell - once per CQ tasklet.
The idea of ringing CQ doorbell before RDQ doorbell, is to be sure that
when we post new WQE (after RDQ is handled), there is an available CQE.
This was done because of a hardware bug as part of
commit c9ebea04cb1b ("mlxsw: pci: Ring CQ's doorbell before RDQ's").
There is no real reason to ring RDQ and CQ doorbells for each completion,
it is better to handle several completions and reduce number of ringings,
as access to hardware is expensive (time wise) and might take time because
of memory barriers.
A previous patch changed CQ tasklet to handle up to 64 Rx packets. With
this limitation, we can ring CQ and RDQ doorbells once per CQ tasklet.
The counters of the doorbells are increased by the amount of packets
that we handled, then the device will know for which completion to send
an additional event.
To avoid reordering CQ and RDQ doorbells' ring, let the tasklet to ring
also RDQ doorbell, mlxsw_pci_cqe_rdq_handle() handles the counter but
does not ring the doorbell.
Note that with this change there is no need to copy the CQE, as we ring CQ
doorbell only after Rx packet processing (which uses the CQE) is done.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
---
drivers/net/ethernet/mellanox/mlxsw/pci.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 8668947400ab..2094b802d8d5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -630,9 +630,7 @@ static void mlxsw_pci_cqe_rdq_handle(struct mlxsw_pci *mlxsw_pci,
mlxsw_core_skb_receive(mlxsw_pci->core, skb, &rx_info);
out:
- /* Everything is set up, ring doorbell to pass elem to HW */
q->producer_counter++;
- mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, q);
return;
}
@@ -666,7 +664,6 @@ static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
u16 wqe_counter = mlxsw_pci_cqe_wqe_counter_get(cqe);
u8 sendq = mlxsw_pci_cqe_sr_get(q->cq.v, cqe);
u8 dqn = mlxsw_pci_cqe_dqn_get(q->cq.v, cqe);
- char ncqe[MLXSW_PCI_CQE_SIZE_MAX];
if (unlikely(sendq)) {
WARN_ON_ONCE(1);
@@ -678,16 +675,15 @@ static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
continue;
}
- memcpy(ncqe, cqe, q->elem_size);
- mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
-
mlxsw_pci_cqe_rdq_handle(mlxsw_pci, rdq,
- wqe_counter, q->cq.v, ncqe);
+ wqe_counter, q->cq.v, cqe);
if (++items == MLXSW_PCI_CQ_MAX_HANDLE)
break;
}
+ mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
+ mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, rdq);
mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH net-next 3/5] mlxsw: pci: Initialize dummy net devices for NAPI
2024-04-26 12:42 [PATCH net-next 0/5] mlxsw: Improve events processing performance Petr Machata
2024-04-26 12:42 ` [PATCH net-next 1/5] mlxsw: pci: Handle up to 64 Rx completions in tasklet Petr Machata
2024-04-26 12:42 ` [PATCH net-next 2/5] mlxsw: pci: Ring RDQ and CQ doorbells once per several completions Petr Machata
@ 2024-04-26 12:42 ` Petr Machata
2024-04-26 12:42 ` [PATCH net-next 4/5] mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure Petr Machata
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Petr Machata @ 2024-04-26 12:42 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev
Cc: Ido Schimmel, Petr Machata, Amit Cohen, mlxsw
From: Amit Cohen <amcohen@nvidia.com>
mlxsw will use NAPI for event processing in a next patch. As preparation,
add two dummy net devices and initialize them.
NAPI instance should be attached to net device. Usually each queue is used
by a single net device in network drivers, so the mapping between net
device to NAPI instance is intuitive. In our case, Rx queues are not per
port, they are per trap-group. Tx queues are mapped to net devices, but we
do not have a separate queue for each local port, several ports share the
same queue.
Use init_dummy_netdev() to initialize dummy net devices for NAPI.
To run NAPI poll method in a kernel thread, the net device which NAPI
instance is attached to should be marked as 'threaded'. It is
recommended to handle Tx packets in softIRQ context, as usually this is
a short task - just free the Tx packet which has been transmitted.
Rx packets handling is more complicated task, so drivers can use a
dedicated kernel thread to process them. It allows processing packets from
different Rx queues in parallel. We would like to handle only Rx packets in
kernel threads, which means that we will use two dummy net devices
(one for Rx and one for Tx). Set only one of them with 'threaded' as it
will be used for Rx processing. Do not fail in case that setting 'threaded'
fails, as it is better to use regular softIRQ NAPI rather than preventing
the driver from loading.
Note that the net devices are initialized with init_dummy_netdev(), so
they are not registered, which means that they will not be visible to user.
It will not be possible to change 'threaded' configuration from user
space, but it is reasonable in our case, as there is no another
configuration which makes sense, considering that user has no influence
on the usage of each queue.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
---
drivers/net/ethernet/mellanox/mlxsw/pci.c | 41 +++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 2094b802d8d5..ec54b876dfd9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -127,8 +127,42 @@ struct mlxsw_pci {
u8 num_cqs; /* Number of CQs */
u8 num_sdqs; /* Number of SDQs */
bool skip_reset;
+ struct net_device *napi_dev_tx;
+ struct net_device *napi_dev_rx;
};
+static int mlxsw_pci_napi_devs_init(struct mlxsw_pci *mlxsw_pci)
+{
+ int err;
+
+ mlxsw_pci->napi_dev_tx = alloc_netdev_dummy(0);
+ if (!mlxsw_pci->napi_dev_tx)
+ return -ENOMEM;
+ strscpy(mlxsw_pci->napi_dev_tx->name, "mlxsw_tx",
+ sizeof(mlxsw_pci->napi_dev_tx->name));
+
+ mlxsw_pci->napi_dev_rx = alloc_netdev_dummy(0);
+ if (!mlxsw_pci->napi_dev_rx) {
+ err = -ENOMEM;
+ goto err_alloc_rx;
+ }
+ strscpy(mlxsw_pci->napi_dev_rx->name, "mlxsw_rx",
+ sizeof(mlxsw_pci->napi_dev_rx->name));
+ dev_set_threaded(mlxsw_pci->napi_dev_rx, true);
+
+ return 0;
+
+err_alloc_rx:
+ free_netdev(mlxsw_pci->napi_dev_tx);
+ return err;
+}
+
+static void mlxsw_pci_napi_devs_fini(struct mlxsw_pci *mlxsw_pci)
+{
+ free_netdev(mlxsw_pci->napi_dev_rx);
+ free_netdev(mlxsw_pci->napi_dev_tx);
+}
+
static void mlxsw_pci_queue_tasklet_schedule(struct mlxsw_pci_queue *q)
{
tasklet_schedule(&q->tasklet);
@@ -1721,6 +1755,10 @@ static int mlxsw_pci_init(void *bus_priv, struct mlxsw_core *mlxsw_core,
if (err)
goto err_requery_resources;
+ err = mlxsw_pci_napi_devs_init(mlxsw_pci);
+ if (err)
+ goto err_napi_devs_init;
+
err = mlxsw_pci_aqs_init(mlxsw_pci, mbox);
if (err)
goto err_aqs_init;
@@ -1738,6 +1776,8 @@ static int mlxsw_pci_init(void *bus_priv, struct mlxsw_core *mlxsw_core,
err_request_eq_irq:
mlxsw_pci_aqs_fini(mlxsw_pci);
err_aqs_init:
+ mlxsw_pci_napi_devs_fini(mlxsw_pci);
+err_napi_devs_init:
err_requery_resources:
err_config_profile:
err_cqe_v_check:
@@ -1765,6 +1805,7 @@ static void mlxsw_pci_fini(void *bus_priv)
free_irq(pci_irq_vector(mlxsw_pci->pdev, 0), mlxsw_pci);
mlxsw_pci_aqs_fini(mlxsw_pci);
+ mlxsw_pci_napi_devs_fini(mlxsw_pci);
mlxsw_pci_fw_area_fini(mlxsw_pci);
mlxsw_pci_free_irq_vectors(mlxsw_pci);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH net-next 4/5] mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure
2024-04-26 12:42 [PATCH net-next 0/5] mlxsw: Improve events processing performance Petr Machata
` (2 preceding siblings ...)
2024-04-26 12:42 ` [PATCH net-next 3/5] mlxsw: pci: Initialize dummy net devices for NAPI Petr Machata
@ 2024-04-26 12:42 ` Petr Machata
2024-04-26 12:42 ` [PATCH net-next 5/5] mlxsw: pci: Use NAPI for event processing Petr Machata
2024-04-29 9:50 ` [PATCH net-next 0/5] mlxsw: Improve events processing performance patchwork-bot+netdevbpf
5 siblings, 0 replies; 7+ messages in thread
From: Petr Machata @ 2024-04-26 12:42 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev
Cc: Ido Schimmel, Petr Machata, Amit Cohen, mlxsw
From: Amit Cohen <amcohen@nvidia.com>
The next patch will set the driver to use NAPI for event processing. Then
tasklet mechanism will be used only for EQ. Reorganize 'mlxsw_pci_queue'
to hold EQ and CQ attributes in a union. For now, add tasklet for both EQ
and CQ. This will be changed in the next patch, as 'tasklet_struct' will be
replaced with NAPI instance.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
---
drivers/net/ethernet/mellanox/mlxsw/pci.c | 76 +++++++++++------------
1 file changed, 38 insertions(+), 38 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index ec54b876dfd9..7724f9a61479 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -82,12 +82,17 @@ struct mlxsw_pci_queue {
u8 num; /* queue number */
u8 elem_size; /* size of one element */
enum mlxsw_pci_queue_type type;
- struct tasklet_struct tasklet; /* queue processing tasklet */
struct mlxsw_pci *pci;
- struct {
- enum mlxsw_pci_cqe_v v;
- struct mlxsw_pci_queue *dq;
- } cq;
+ union {
+ struct {
+ enum mlxsw_pci_cqe_v v;
+ struct mlxsw_pci_queue *dq;
+ struct tasklet_struct tasklet;
+ } cq;
+ struct {
+ struct tasklet_struct tasklet;
+ } eq;
+ } u;
};
struct mlxsw_pci_queue_type_group {
@@ -163,11 +168,6 @@ static void mlxsw_pci_napi_devs_fini(struct mlxsw_pci *mlxsw_pci)
free_netdev(mlxsw_pci->napi_dev_tx);
}
-static void mlxsw_pci_queue_tasklet_schedule(struct mlxsw_pci_queue *q)
-{
- tasklet_schedule(&q->tasklet);
-}
-
static char *__mlxsw_pci_queue_elem_get(struct mlxsw_pci_queue *q,
size_t elem_size, int elem_index)
{
@@ -324,7 +324,7 @@ static int mlxsw_pci_sdq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
return err;
cq = mlxsw_pci_cq_get(mlxsw_pci, cq_num);
- cq->cq.dq = q;
+ cq->u.cq.dq = q;
mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, q);
return 0;
}
@@ -433,7 +433,7 @@ static int mlxsw_pci_rdq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
return err;
cq = mlxsw_pci_cq_get(mlxsw_pci, cq_num);
- cq->cq.dq = q;
+ cq->u.cq.dq = q;
mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, q);
@@ -455,7 +455,7 @@ static int mlxsw_pci_rdq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
elem_info = mlxsw_pci_queue_elem_info_get(q, i);
mlxsw_pci_rdq_skb_free(mlxsw_pci, elem_info);
}
- cq->cq.dq = NULL;
+ cq->u.cq.dq = NULL;
mlxsw_cmd_hw2sw_rdq(mlxsw_pci->core, q->num);
return err;
@@ -477,12 +477,12 @@ static void mlxsw_pci_rdq_fini(struct mlxsw_pci *mlxsw_pci,
static void mlxsw_pci_cq_pre_init(struct mlxsw_pci *mlxsw_pci,
struct mlxsw_pci_queue *q)
{
- q->cq.v = mlxsw_pci->max_cqe_ver;
+ q->u.cq.v = mlxsw_pci->max_cqe_ver;
- if (q->cq.v == MLXSW_PCI_CQE_V2 &&
+ if (q->u.cq.v == MLXSW_PCI_CQE_V2 &&
q->num < mlxsw_pci->num_sdqs &&
!mlxsw_core_sdq_supports_cqe_v2(mlxsw_pci->core))
- q->cq.v = MLXSW_PCI_CQE_V1;
+ q->u.cq.v = MLXSW_PCI_CQE_V1;
}
static unsigned int mlxsw_pci_read32_off(struct mlxsw_pci *mlxsw_pci,
@@ -676,7 +676,7 @@ static char *mlxsw_pci_cq_sw_cqe_get(struct mlxsw_pci_queue *q)
elem_info = mlxsw_pci_queue_elem_info_consumer_get(q);
elem = elem_info->elem;
- owner_bit = mlxsw_pci_cqe_owner_get(q->cq.v, elem);
+ owner_bit = mlxsw_pci_cqe_owner_get(q->u.cq.v, elem);
if (mlxsw_pci_elem_hw_owned(q, owner_bit))
return NULL;
q->consumer_counter++;
@@ -688,16 +688,16 @@ static char *mlxsw_pci_cq_sw_cqe_get(struct mlxsw_pci_queue *q)
static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
{
- struct mlxsw_pci_queue *q = from_tasklet(q, t, tasklet);
- struct mlxsw_pci_queue *rdq = q->cq.dq;
+ struct mlxsw_pci_queue *q = from_tasklet(q, t, u.cq.tasklet);
+ struct mlxsw_pci_queue *rdq = q->u.cq.dq;
struct mlxsw_pci *mlxsw_pci = q->pci;
int items = 0;
char *cqe;
while ((cqe = mlxsw_pci_cq_sw_cqe_get(q))) {
u16 wqe_counter = mlxsw_pci_cqe_wqe_counter_get(cqe);
- u8 sendq = mlxsw_pci_cqe_sr_get(q->cq.v, cqe);
- u8 dqn = mlxsw_pci_cqe_dqn_get(q->cq.v, cqe);
+ u8 sendq = mlxsw_pci_cqe_sr_get(q->u.cq.v, cqe);
+ u8 dqn = mlxsw_pci_cqe_dqn_get(q->u.cq.v, cqe);
if (unlikely(sendq)) {
WARN_ON_ONCE(1);
@@ -710,7 +710,7 @@ static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
}
mlxsw_pci_cqe_rdq_handle(mlxsw_pci, rdq,
- wqe_counter, q->cq.v, cqe);
+ wqe_counter, q->u.cq.v, cqe);
if (++items == MLXSW_PCI_CQ_MAX_HANDLE)
break;
@@ -723,8 +723,8 @@ static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
static void mlxsw_pci_cq_tx_tasklet(struct tasklet_struct *t)
{
- struct mlxsw_pci_queue *q = from_tasklet(q, t, tasklet);
- struct mlxsw_pci_queue *sdq = q->cq.dq;
+ struct mlxsw_pci_queue *q = from_tasklet(q, t, u.cq.tasklet);
+ struct mlxsw_pci_queue *sdq = q->u.cq.dq;
struct mlxsw_pci *mlxsw_pci = q->pci;
int credits = q->count >> 1;
int items = 0;
@@ -732,8 +732,8 @@ static void mlxsw_pci_cq_tx_tasklet(struct tasklet_struct *t)
while ((cqe = mlxsw_pci_cq_sw_cqe_get(q))) {
u16 wqe_counter = mlxsw_pci_cqe_wqe_counter_get(cqe);
- u8 sendq = mlxsw_pci_cqe_sr_get(q->cq.v, cqe);
- u8 dqn = mlxsw_pci_cqe_dqn_get(q->cq.v, cqe);
+ u8 sendq = mlxsw_pci_cqe_sr_get(q->u.cq.v, cqe);
+ u8 dqn = mlxsw_pci_cqe_dqn_get(q->u.cq.v, cqe);
char ncqe[MLXSW_PCI_CQE_SIZE_MAX];
if (unlikely(!sendq)) {
@@ -750,7 +750,7 @@ static void mlxsw_pci_cq_tx_tasklet(struct tasklet_struct *t)
mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
mlxsw_pci_cqe_sdq_handle(mlxsw_pci, sdq,
- wqe_counter, q->cq.v, ncqe);
+ wqe_counter, q->u.cq.v, ncqe);
if (++items == credits)
break;
@@ -777,10 +777,10 @@ static void mlxsw_pci_cq_tasklet_setup(struct mlxsw_pci_queue *q,
{
switch (cq_type) {
case MLXSW_PCI_CQ_SDQ:
- tasklet_setup(&q->tasklet, mlxsw_pci_cq_tx_tasklet);
+ tasklet_setup(&q->u.cq.tasklet, mlxsw_pci_cq_tx_tasklet);
break;
case MLXSW_PCI_CQ_RDQ:
- tasklet_setup(&q->tasklet, mlxsw_pci_cq_rx_tasklet);
+ tasklet_setup(&q->u.cq.tasklet, mlxsw_pci_cq_rx_tasklet);
break;
}
}
@@ -796,13 +796,13 @@ static int mlxsw_pci_cq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
for (i = 0; i < q->count; i++) {
char *elem = mlxsw_pci_queue_elem_get(q, i);
- mlxsw_pci_cqe_owner_set(q->cq.v, elem, 1);
+ mlxsw_pci_cqe_owner_set(q->u.cq.v, elem, 1);
}
- if (q->cq.v == MLXSW_PCI_CQE_V1)
+ if (q->u.cq.v == MLXSW_PCI_CQE_V1)
mlxsw_cmd_mbox_sw2hw_cq_cqe_ver_set(mbox,
MLXSW_CMD_MBOX_SW2HW_CQ_CQE_VER_1);
- else if (q->cq.v == MLXSW_PCI_CQE_V2)
+ else if (q->u.cq.v == MLXSW_PCI_CQE_V2)
mlxsw_cmd_mbox_sw2hw_cq_cqe_ver_set(mbox,
MLXSW_CMD_MBOX_SW2HW_CQ_CQE_VER_2);
@@ -831,13 +831,13 @@ static void mlxsw_pci_cq_fini(struct mlxsw_pci *mlxsw_pci,
static u16 mlxsw_pci_cq_elem_count(const struct mlxsw_pci_queue *q)
{
- return q->cq.v == MLXSW_PCI_CQE_V2 ? MLXSW_PCI_CQE2_COUNT :
+ return q->u.cq.v == MLXSW_PCI_CQE_V2 ? MLXSW_PCI_CQE2_COUNT :
MLXSW_PCI_CQE01_COUNT;
}
static u8 mlxsw_pci_cq_elem_size(const struct mlxsw_pci_queue *q)
{
- return q->cq.v == MLXSW_PCI_CQE_V2 ? MLXSW_PCI_CQE2_SIZE :
+ return q->u.cq.v == MLXSW_PCI_CQE_V2 ? MLXSW_PCI_CQE2_SIZE :
MLXSW_PCI_CQE01_SIZE;
}
@@ -860,7 +860,7 @@ static char *mlxsw_pci_eq_sw_eqe_get(struct mlxsw_pci_queue *q)
static void mlxsw_pci_eq_tasklet(struct tasklet_struct *t)
{
unsigned long active_cqns[BITS_TO_LONGS(MLXSW_PCI_CQS_MAX)];
- struct mlxsw_pci_queue *q = from_tasklet(q, t, tasklet);
+ struct mlxsw_pci_queue *q = from_tasklet(q, t, u.eq.tasklet);
struct mlxsw_pci *mlxsw_pci = q->pci;
int credits = q->count >> 1;
u8 cqn, cq_count;
@@ -886,7 +886,7 @@ static void mlxsw_pci_eq_tasklet(struct tasklet_struct *t)
cq_count = mlxsw_pci->num_cqs;
for_each_set_bit(cqn, active_cqns, cq_count) {
q = mlxsw_pci_cq_get(mlxsw_pci, cqn);
- mlxsw_pci_queue_tasklet_schedule(q);
+ tasklet_schedule(&q->u.cq.tasklet);
}
}
@@ -922,7 +922,7 @@ static int mlxsw_pci_eq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
err = mlxsw_cmd_sw2hw_eq(mlxsw_pci->core, mbox, q->num);
if (err)
return err;
- tasklet_setup(&q->tasklet, mlxsw_pci_eq_tasklet);
+ tasklet_setup(&q->u.eq.tasklet, mlxsw_pci_eq_tasklet);
mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
return 0;
@@ -1483,7 +1483,7 @@ static irqreturn_t mlxsw_pci_eq_irq_handler(int irq, void *dev_id)
struct mlxsw_pci_queue *q;
q = mlxsw_pci_eq_get(mlxsw_pci);
- mlxsw_pci_queue_tasklet_schedule(q);
+ tasklet_schedule(&q->u.eq.tasklet);
return IRQ_HANDLED;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH net-next 5/5] mlxsw: pci: Use NAPI for event processing
2024-04-26 12:42 [PATCH net-next 0/5] mlxsw: Improve events processing performance Petr Machata
` (3 preceding siblings ...)
2024-04-26 12:42 ` [PATCH net-next 4/5] mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure Petr Machata
@ 2024-04-26 12:42 ` Petr Machata
2024-04-29 9:50 ` [PATCH net-next 0/5] mlxsw: Improve events processing performance patchwork-bot+netdevbpf
5 siblings, 0 replies; 7+ messages in thread
From: Petr Machata @ 2024-04-26 12:42 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev
Cc: Ido Schimmel, Petr Machata, Amit Cohen, mlxsw
From: Amit Cohen <amcohen@nvidia.com>
Spectrum ASICs only support a single interrupt, that means that all the
events are handled by one IRQ (interrupt request) handler. Once an
interrupt is received, we schedule tasklet to handle events from EQ and
then schedule tasklets to handle completions from CQs. Tasklet runs in
softIRQ (software IRQ) context, and will be run on the same CPU which
scheduled it. That means that today we use only one CPU to handle all the
packets (both network packets and EMADs) from hardware.
This can be improved using NAPI. The idea is to use NAPI instance per
CQ, which is mapped 1:1 to DQ (RDQ or SDQ). NAPI poll method can be run
in kernel thread, so then the driver will be able to handle WQEs in several
CPUs. Convert the existing code to use NAPI APIs.
Add NAPI instance as part of 'struct mlxsw_pci_queue' and initialize it
as part of CQs initialization. Set the appropriate poll method and dummy
net device, according to queue number, similar to tasklet setup. For CQs
which are used for completions of RDQ, use Rx poll method and
'napi_dev_rx', which is set as 'threaded'. It means that Rx poll method
will run in kernel context, so several RDQs will be handled in parallel.
For CQs which are used for completions of SDQ, use Tx poll method and
'napi_dev_tx', this method will run in softIRQ context, as it is
recommended in NAPI documentation, as Tx packets' processing is short task.
Convert mlxsw_pci_cq_{rx,tx}_tasklet() to poll methods. Handle 'budget'
argument - ignore it in Tx poll method, as it is recommended to not limit
Tx processing. For Rx processing, handle up to 'budget' completions.
Return 'work_done' which is the amount of completions that were handled.
Handle the following cases:
1. After processing 'budget' completions, the driver still has work to do:
Return work-done = budget. In that case, the NAPI instance will be
polled again (without the need to be rescheduled). Do not re-arm the
queue, as NAPI will handle the reschedule, so we do not have to involve
hardware to send an additional interrupt for the completions that should
be processed.
2. Event processing has been completed:
Call napi_complete_done() to mark NAPI processing as completed, which
means that the poll method will not be rescheduled. Re-arm the queue,
as all completions were handled.
In case that poll method handled exactly 'budget' completions, return
work-done = budget -1, to distinguish from the case that driver still
has completions to handle. Otherwise, return the amount of completions
that were handled.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
---
drivers/net/ethernet/mellanox/mlxsw/pci.c | 96 ++++++++++++++++++-----
1 file changed, 77 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 7724f9a61479..bf66d996e32e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -87,7 +87,7 @@ struct mlxsw_pci_queue {
struct {
enum mlxsw_pci_cqe_v v;
struct mlxsw_pci_queue *dq;
- struct tasklet_struct tasklet;
+ struct napi_struct napi;
} cq;
struct {
struct tasklet_struct tasklet;
@@ -684,16 +684,29 @@ static char *mlxsw_pci_cq_sw_cqe_get(struct mlxsw_pci_queue *q)
return elem;
}
-#define MLXSW_PCI_CQ_MAX_HANDLE 64
+static bool mlxsw_pci_cq_cqe_to_handle(struct mlxsw_pci_queue *q)
+{
+ struct mlxsw_pci_queue_elem_info *elem_info;
+ bool owner_bit;
+
+ elem_info = mlxsw_pci_queue_elem_info_consumer_get(q);
+ owner_bit = mlxsw_pci_cqe_owner_get(q->u.cq.v, elem_info->elem);
+ return !mlxsw_pci_elem_hw_owned(q, owner_bit);
+}
-static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
+static int mlxsw_pci_napi_poll_cq_rx(struct napi_struct *napi, int budget)
{
- struct mlxsw_pci_queue *q = from_tasklet(q, t, u.cq.tasklet);
+ struct mlxsw_pci_queue *q = container_of(napi, struct mlxsw_pci_queue,
+ u.cq.napi);
struct mlxsw_pci_queue *rdq = q->u.cq.dq;
struct mlxsw_pci *mlxsw_pci = q->pci;
- int items = 0;
+ int work_done = 0;
char *cqe;
+ /* If the budget is 0, Rx processing should be skipped. */
+ if (unlikely(!budget))
+ return 0;
+
while ((cqe = mlxsw_pci_cq_sw_cqe_get(q))) {
u16 wqe_counter = mlxsw_pci_cqe_wqe_counter_get(cqe);
u8 sendq = mlxsw_pci_cqe_sr_get(q->u.cq.v, cqe);
@@ -712,22 +725,44 @@ static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
mlxsw_pci_cqe_rdq_handle(mlxsw_pci, rdq,
wqe_counter, q->u.cq.v, cqe);
- if (++items == MLXSW_PCI_CQ_MAX_HANDLE)
+ if (++work_done == budget)
break;
}
mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, rdq);
- mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
+
+ if (work_done < budget)
+ goto processing_completed;
+
+ /* The driver still has outstanding work to do, budget was exhausted.
+ * Return exactly budget. In that case, the NAPI instance will be polled
+ * again.
+ */
+ if (mlxsw_pci_cq_cqe_to_handle(q))
+ goto out;
+
+ /* The driver processed all the completions and handled exactly
+ * 'budget'. Return 'budget - 1' to distinguish from the case that
+ * driver still has completions to handle.
+ */
+ if (work_done == budget)
+ work_done--;
+
+processing_completed:
+ if (napi_complete_done(napi, work_done))
+ mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
+out:
+ return work_done;
}
-static void mlxsw_pci_cq_tx_tasklet(struct tasklet_struct *t)
+static int mlxsw_pci_napi_poll_cq_tx(struct napi_struct *napi, int budget)
{
- struct mlxsw_pci_queue *q = from_tasklet(q, t, u.cq.tasklet);
+ struct mlxsw_pci_queue *q = container_of(napi, struct mlxsw_pci_queue,
+ u.cq.napi);
struct mlxsw_pci_queue *sdq = q->u.cq.dq;
struct mlxsw_pci *mlxsw_pci = q->pci;
- int credits = q->count >> 1;
- int items = 0;
+ int work_done = 0;
char *cqe;
while ((cqe = mlxsw_pci_cq_sw_cqe_get(q))) {
@@ -752,11 +787,21 @@ static void mlxsw_pci_cq_tx_tasklet(struct tasklet_struct *t)
mlxsw_pci_cqe_sdq_handle(mlxsw_pci, sdq,
wqe_counter, q->u.cq.v, ncqe);
- if (++items == credits)
- break;
+ work_done++;
}
+ /* If the budget is 0 napi_complete_done() should never be called. */
+ if (unlikely(!budget))
+ goto processing_completed;
+
+ work_done = min(work_done, budget - 1);
+ if (unlikely(!napi_complete_done(napi, work_done)))
+ goto out;
+
+processing_completed:
mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
+out:
+ return work_done;
}
static enum mlxsw_pci_cq_type
@@ -772,17 +817,29 @@ mlxsw_pci_cq_type(const struct mlxsw_pci *mlxsw_pci,
return MLXSW_PCI_CQ_RDQ;
}
-static void mlxsw_pci_cq_tasklet_setup(struct mlxsw_pci_queue *q,
- enum mlxsw_pci_cq_type cq_type)
+static void mlxsw_pci_cq_napi_setup(struct mlxsw_pci_queue *q,
+ enum mlxsw_pci_cq_type cq_type)
{
+ struct mlxsw_pci *mlxsw_pci = q->pci;
+
switch (cq_type) {
case MLXSW_PCI_CQ_SDQ:
- tasklet_setup(&q->u.cq.tasklet, mlxsw_pci_cq_tx_tasklet);
+ netif_napi_add(mlxsw_pci->napi_dev_tx, &q->u.cq.napi,
+ mlxsw_pci_napi_poll_cq_tx);
break;
case MLXSW_PCI_CQ_RDQ:
- tasklet_setup(&q->u.cq.tasklet, mlxsw_pci_cq_rx_tasklet);
+ netif_napi_add(mlxsw_pci->napi_dev_rx, &q->u.cq.napi,
+ mlxsw_pci_napi_poll_cq_rx);
break;
}
+
+ napi_enable(&q->u.cq.napi);
+}
+
+static void mlxsw_pci_cq_napi_teardown(struct mlxsw_pci_queue *q)
+{
+ napi_disable(&q->u.cq.napi);
+ netif_napi_del(&q->u.cq.napi);
}
static int mlxsw_pci_cq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
@@ -817,7 +874,7 @@ static int mlxsw_pci_cq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
err = mlxsw_cmd_sw2hw_cq(mlxsw_pci->core, mbox, q->num);
if (err)
return err;
- mlxsw_pci_cq_tasklet_setup(q, mlxsw_pci_cq_type(mlxsw_pci, q));
+ mlxsw_pci_cq_napi_setup(q, mlxsw_pci_cq_type(mlxsw_pci, q));
mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
return 0;
@@ -826,6 +883,7 @@ static int mlxsw_pci_cq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
static void mlxsw_pci_cq_fini(struct mlxsw_pci *mlxsw_pci,
struct mlxsw_pci_queue *q)
{
+ mlxsw_pci_cq_napi_teardown(q);
mlxsw_cmd_hw2sw_cq(mlxsw_pci->core, q->num);
}
@@ -886,7 +944,7 @@ static void mlxsw_pci_eq_tasklet(struct tasklet_struct *t)
cq_count = mlxsw_pci->num_cqs;
for_each_set_bit(cqn, active_cqns, cq_count) {
q = mlxsw_pci_cq_get(mlxsw_pci, cqn);
- tasklet_schedule(&q->u.cq.tasklet);
+ napi_schedule(&q->u.cq.napi);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH net-next 0/5] mlxsw: Improve events processing performance
2024-04-26 12:42 [PATCH net-next 0/5] mlxsw: Improve events processing performance Petr Machata
` (4 preceding siblings ...)
2024-04-26 12:42 ` [PATCH net-next 5/5] mlxsw: pci: Use NAPI for event processing Petr Machata
@ 2024-04-29 9:50 ` patchwork-bot+netdevbpf
5 siblings, 0 replies; 7+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-04-29 9:50 UTC (permalink / raw)
To: Petr Machata
Cc: davem, edumazet, kuba, pabeni, netdev, idosch, amcohen, mlxsw
Hello:
This series was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:
On Fri, 26 Apr 2024 14:42:21 +0200 you wrote:
> Amit Cohen writes:
>
> Spectrum ASICs only support a single interrupt, it means that all the
> events are handled by one IRQ (interrupt request) handler.
>
> Currently, we schedule a tasklet to handle events in EQ, then we also use
> tasklet for CQ, SDQ and RDQ. Tasklet runs in softIRQ (software IRQ)
> context, and will be run on the same CPU which scheduled it. It means that
> today we have one CPU which handles all the packets (both network packets
> and EMADs) from hardware.
>
> [...]
Here is the summary with links:
- [net-next,1/5] mlxsw: pci: Handle up to 64 Rx completions in tasklet
https://git.kernel.org/netdev/net-next/c/e28d8aba4381
- [net-next,2/5] mlxsw: pci: Ring RDQ and CQ doorbells once per several completions
https://git.kernel.org/netdev/net-next/c/6b3d015cdb2a
- [net-next,3/5] mlxsw: pci: Initialize dummy net devices for NAPI
https://git.kernel.org/netdev/net-next/c/5d01ed2e9708
- [net-next,4/5] mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure
https://git.kernel.org/netdev/net-next/c/c0d9267873bc
- [net-next,5/5] mlxsw: pci: Use NAPI for event processing
(no matching commit)
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-04-29 9:50 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-26 12:42 [PATCH net-next 0/5] mlxsw: Improve events processing performance Petr Machata
2024-04-26 12:42 ` [PATCH net-next 1/5] mlxsw: pci: Handle up to 64 Rx completions in tasklet Petr Machata
2024-04-26 12:42 ` [PATCH net-next 2/5] mlxsw: pci: Ring RDQ and CQ doorbells once per several completions Petr Machata
2024-04-26 12:42 ` [PATCH net-next 3/5] mlxsw: pci: Initialize dummy net devices for NAPI Petr Machata
2024-04-26 12:42 ` [PATCH net-next 4/5] mlxsw: pci: Reorganize 'mlxsw_pci_queue' structure Petr Machata
2024-04-26 12:42 ` [PATCH net-next 5/5] mlxsw: pci: Use NAPI for event processing Petr Machata
2024-04-29 9:50 ` [PATCH net-next 0/5] mlxsw: Improve events processing performance patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).