* [PATCH] net: increase the maximum of RX/TX descriptors
@ 2024-10-29 12:46 Lukas Sismis
0 siblings, 0 replies; 12+ messages in thread
From: Lukas Sismis @ 2024-10-29 12:46 UTC (permalink / raw)
To: anatoly.burakov, ian.stokes; +Cc: dev, Lukas Sismis
Intel PMDs are capped by default to only 4096 RX/TX descriptors.
This can be limiting for applications requiring a bigger buffer
capabilities. The cap prevented the applications to configure
more descriptors. By bufferring more packets with RX/TX
descriptors, the applications can better handle the processing
peaks.
Signed-off-by: Lukas Sismis <sismis@cesnet.cz>
---
doc/guides/nics/ixgbe.rst | 2 +-
drivers/net/cpfl/cpfl_rxtx.h | 2 +-
drivers/net/e1000/e1000_ethdev.h | 2 +-
drivers/net/iavf/iavf_rxtx.h | 2 +-
drivers/net/ice/ice_rxtx.h | 2 +-
drivers/net/idpf/idpf_rxtx.h | 2 +-
drivers/net/ixgbe/ixgbe_ethdev.c | 2 +-
drivers/net/ixgbe/ixgbe_rxtx.h | 2 +-
8 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/doc/guides/nics/ixgbe.rst b/doc/guides/nics/ixgbe.rst
index 14573b542e..291b33d699 100644
--- a/doc/guides/nics/ixgbe.rst
+++ b/doc/guides/nics/ixgbe.rst
@@ -76,7 +76,7 @@ Scattered packets are not supported in this mode.
If an incoming packet is greater than the maximum acceptable length of one "mbuf" data size (by default, the size is 2 KB),
vPMD for RX would be disabled.
-By default, IXGBE_MAX_RING_DESC is set to 4096 and RTE_PMD_IXGBE_RX_MAX_BURST is set to 32.
+By default, IXGBE_MAX_RING_DESC is set to 32768 and RTE_PMD_IXGBE_RX_MAX_BURST is set to 32.
Windows Prerequisites and Pre-conditions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/drivers/net/cpfl/cpfl_rxtx.h b/drivers/net/cpfl/cpfl_rxtx.h
index aacd087b56..4db4025771 100644
--- a/drivers/net/cpfl/cpfl_rxtx.h
+++ b/drivers/net/cpfl/cpfl_rxtx.h
@@ -11,7 +11,7 @@
/* In QLEN must be whole number of 32 descriptors. */
#define CPFL_ALIGN_RING_DESC 32
#define CPFL_MIN_RING_DESC 32
-#define CPFL_MAX_RING_DESC 4096
+#define CPFL_MAX_RING_DESC 32768
#define CPFL_DMA_MEM_ALIGN 4096
#define CPFL_MAX_HAIRPINQ_RX_2_TX 1
diff --git a/drivers/net/e1000/e1000_ethdev.h b/drivers/net/e1000/e1000_ethdev.h
index 339ae1f4b6..e9046047f6 100644
--- a/drivers/net/e1000/e1000_ethdev.h
+++ b/drivers/net/e1000/e1000_ethdev.h
@@ -107,7 +107,7 @@
* (num_ring_desc * sizeof(struct e1000_rx/tx_desc)) % 128 == 0
*/
#define E1000_MIN_RING_DESC 32
-#define E1000_MAX_RING_DESC 4096
+#define E1000_MAX_RING_DESC 32768
/*
* TDBA/RDBA should be aligned on 16 byte boundary. But TDLEN/RDLEN should be
diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h
index 7b56076d32..f9c129f0ef 100644
--- a/drivers/net/iavf/iavf_rxtx.h
+++ b/drivers/net/iavf/iavf_rxtx.h
@@ -8,7 +8,7 @@
/* In QLEN must be whole number of 32 descriptors. */
#define IAVF_ALIGN_RING_DESC 32
#define IAVF_MIN_RING_DESC 64
-#define IAVF_MAX_RING_DESC 4096
+#define IAVF_MAX_RING_DESC 32768
#define IAVF_DMA_MEM_ALIGN 4096
/* Base address of the HW descriptor ring should be 128B aligned. */
#define IAVF_RING_BASE_ALIGN 128
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index f7276cfc9f..6d18fe908d 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -9,7 +9,7 @@
#define ICE_ALIGN_RING_DESC 32
#define ICE_MIN_RING_DESC 64
-#define ICE_MAX_RING_DESC 4096
+#define ICE_MAX_RING_DESC 32768
#define ICE_DMA_MEM_ALIGN 4096
#define ICE_RING_BASE_ALIGN 128
diff --git a/drivers/net/idpf/idpf_rxtx.h b/drivers/net/idpf/idpf_rxtx.h
index 41a7495083..0f78f7cba5 100644
--- a/drivers/net/idpf/idpf_rxtx.h
+++ b/drivers/net/idpf/idpf_rxtx.h
@@ -11,7 +11,7 @@
/* In QLEN must be whole number of 32 descriptors. */
#define IDPF_ALIGN_RING_DESC 32
#define IDPF_MIN_RING_DESC 32
-#define IDPF_MAX_RING_DESC 4096
+#define IDPF_MAX_RING_DESC 32768
#define IDPF_DMA_MEM_ALIGN 4096
/* Base address of the HW descriptor ring should be 128B aligned. */
#define IDPF_RING_BASE_ALIGN 128
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 7da2ccf6a8..a2637f0a91 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -73,7 +73,7 @@
#define IXGBE_MMW_SIZE_DEFAULT 0x4
#define IXGBE_MMW_SIZE_JUMBO_FRAME 0x14
-#define IXGBE_MAX_RING_DESC 4096 /* replicate define from rxtx */
+#define IXGBE_MAX_RING_DESC 32768 /* replicate define from rxtx */
/*
* Default values for RX/TX configuration
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index ee89c89929..a28037b08a 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -25,7 +25,7 @@
* (num_ring_desc * sizeof(rx/tx descriptor)) % 128 == 0
*/
#define IXGBE_MIN_RING_DESC 32
-#define IXGBE_MAX_RING_DESC 4096
+#define IXGBE_MAX_RING_DESC 32768
#define RTE_PMD_IXGBE_TX_MAX_BURST 32
#define RTE_PMD_IXGBE_RX_MAX_BURST 32
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH] net: increase the maximum of RX/TX descriptors
@ 2024-10-29 12:48 Lukas Sismis
2024-10-29 14:37 ` Morten Brørup
0 siblings, 1 reply; 12+ messages in thread
From: Lukas Sismis @ 2024-10-29 12:48 UTC (permalink / raw)
To: anatoly.burakov, ian.stokes; +Cc: dev, Lukas Sismis
Intel PMDs are capped by default to only 4096 RX/TX descriptors.
This can be limiting for applications requiring a bigger buffer
capabilities. The cap prevented the applications to configure
more descriptors. By bufferring more packets with RX/TX
descriptors, the applications can better handle the processing
peaks.
Signed-off-by: Lukas Sismis <sismis@cesnet.cz>
---
doc/guides/nics/ixgbe.rst | 2 +-
drivers/net/cpfl/cpfl_rxtx.h | 2 +-
drivers/net/e1000/e1000_ethdev.h | 2 +-
drivers/net/iavf/iavf_rxtx.h | 2 +-
drivers/net/ice/ice_rxtx.h | 2 +-
drivers/net/idpf/idpf_rxtx.h | 2 +-
drivers/net/ixgbe/ixgbe_ethdev.c | 2 +-
drivers/net/ixgbe/ixgbe_rxtx.h | 2 +-
8 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/doc/guides/nics/ixgbe.rst b/doc/guides/nics/ixgbe.rst
index 14573b542e..291b33d699 100644
--- a/doc/guides/nics/ixgbe.rst
+++ b/doc/guides/nics/ixgbe.rst
@@ -76,7 +76,7 @@ Scattered packets are not supported in this mode.
If an incoming packet is greater than the maximum acceptable length of one "mbuf" data size (by default, the size is 2 KB),
vPMD for RX would be disabled.
-By default, IXGBE_MAX_RING_DESC is set to 4096 and RTE_PMD_IXGBE_RX_MAX_BURST is set to 32.
+By default, IXGBE_MAX_RING_DESC is set to 32768 and RTE_PMD_IXGBE_RX_MAX_BURST is set to 32.
Windows Prerequisites and Pre-conditions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/drivers/net/cpfl/cpfl_rxtx.h b/drivers/net/cpfl/cpfl_rxtx.h
index aacd087b56..4db4025771 100644
--- a/drivers/net/cpfl/cpfl_rxtx.h
+++ b/drivers/net/cpfl/cpfl_rxtx.h
@@ -11,7 +11,7 @@
/* In QLEN must be whole number of 32 descriptors. */
#define CPFL_ALIGN_RING_DESC 32
#define CPFL_MIN_RING_DESC 32
-#define CPFL_MAX_RING_DESC 4096
+#define CPFL_MAX_RING_DESC 32768
#define CPFL_DMA_MEM_ALIGN 4096
#define CPFL_MAX_HAIRPINQ_RX_2_TX 1
diff --git a/drivers/net/e1000/e1000_ethdev.h b/drivers/net/e1000/e1000_ethdev.h
index 339ae1f4b6..e9046047f6 100644
--- a/drivers/net/e1000/e1000_ethdev.h
+++ b/drivers/net/e1000/e1000_ethdev.h
@@ -107,7 +107,7 @@
* (num_ring_desc * sizeof(struct e1000_rx/tx_desc)) % 128 == 0
*/
#define E1000_MIN_RING_DESC 32
-#define E1000_MAX_RING_DESC 4096
+#define E1000_MAX_RING_DESC 32768
/*
* TDBA/RDBA should be aligned on 16 byte boundary. But TDLEN/RDLEN should be
diff --git a/drivers/net/iavf/iavf_rxtx.h b/drivers/net/iavf/iavf_rxtx.h
index 7b56076d32..f9c129f0ef 100644
--- a/drivers/net/iavf/iavf_rxtx.h
+++ b/drivers/net/iavf/iavf_rxtx.h
@@ -8,7 +8,7 @@
/* In QLEN must be whole number of 32 descriptors. */
#define IAVF_ALIGN_RING_DESC 32
#define IAVF_MIN_RING_DESC 64
-#define IAVF_MAX_RING_DESC 4096
+#define IAVF_MAX_RING_DESC 32768
#define IAVF_DMA_MEM_ALIGN 4096
/* Base address of the HW descriptor ring should be 128B aligned. */
#define IAVF_RING_BASE_ALIGN 128
diff --git a/drivers/net/ice/ice_rxtx.h b/drivers/net/ice/ice_rxtx.h
index f7276cfc9f..6d18fe908d 100644
--- a/drivers/net/ice/ice_rxtx.h
+++ b/drivers/net/ice/ice_rxtx.h
@@ -9,7 +9,7 @@
#define ICE_ALIGN_RING_DESC 32
#define ICE_MIN_RING_DESC 64
-#define ICE_MAX_RING_DESC 4096
+#define ICE_MAX_RING_DESC 32768
#define ICE_DMA_MEM_ALIGN 4096
#define ICE_RING_BASE_ALIGN 128
diff --git a/drivers/net/idpf/idpf_rxtx.h b/drivers/net/idpf/idpf_rxtx.h
index 41a7495083..0f78f7cba5 100644
--- a/drivers/net/idpf/idpf_rxtx.h
+++ b/drivers/net/idpf/idpf_rxtx.h
@@ -11,7 +11,7 @@
/* In QLEN must be whole number of 32 descriptors. */
#define IDPF_ALIGN_RING_DESC 32
#define IDPF_MIN_RING_DESC 32
-#define IDPF_MAX_RING_DESC 4096
+#define IDPF_MAX_RING_DESC 32768
#define IDPF_DMA_MEM_ALIGN 4096
/* Base address of the HW descriptor ring should be 128B aligned. */
#define IDPF_RING_BASE_ALIGN 128
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 7da2ccf6a8..a2637f0a91 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -73,7 +73,7 @@
#define IXGBE_MMW_SIZE_DEFAULT 0x4
#define IXGBE_MMW_SIZE_JUMBO_FRAME 0x14
-#define IXGBE_MAX_RING_DESC 4096 /* replicate define from rxtx */
+#define IXGBE_MAX_RING_DESC 32768 /* replicate define from rxtx */
/*
* Default values for RX/TX configuration
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index ee89c89929..a28037b08a 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -25,7 +25,7 @@
* (num_ring_desc * sizeof(rx/tx descriptor)) % 128 == 0
*/
#define IXGBE_MIN_RING_DESC 32
-#define IXGBE_MAX_RING_DESC 4096
+#define IXGBE_MAX_RING_DESC 32768
#define RTE_PMD_IXGBE_TX_MAX_BURST 32
#define RTE_PMD_IXGBE_RX_MAX_BURST 32
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* RE: [PATCH] net: increase the maximum of RX/TX descriptors
2024-10-29 12:48 Lukas Sismis
@ 2024-10-29 14:37 ` Morten Brørup
2024-10-30 13:58 ` Lukáš Šišmiš
0 siblings, 1 reply; 12+ messages in thread
From: Morten Brørup @ 2024-10-29 14:37 UTC (permalink / raw)
To: Lukas Sismis, anatoly.burakov, ian.stokes; +Cc: dev
> From: Lukas Sismis [mailto:sismis@cesnet.cz]
> Sent: Tuesday, 29 October 2024 13.49
>
> Intel PMDs are capped by default to only 4096 RX/TX descriptors.
> This can be limiting for applications requiring a bigger buffer
> capabilities. The cap prevented the applications to configure
> more descriptors. By bufferring more packets with RX/TX
> descriptors, the applications can better handle the processing
> peaks.
>
> Signed-off-by: Lukas Sismis <sismis@cesnet.cz>
> ---
Seems like a good idea.
Have the max number of descriptors been checked with the datasheets for all the affected NIC chips?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] net: increase the maximum of RX/TX descriptors
2024-10-29 14:37 ` Morten Brørup
@ 2024-10-30 13:58 ` Lukáš Šišmiš
2024-10-30 15:20 ` Stephen Hemminger
0 siblings, 1 reply; 12+ messages in thread
From: Lukáš Šišmiš @ 2024-10-30 13:58 UTC (permalink / raw)
To: Morten Brørup, anatoly.burakov, ian.stokes; +Cc: dev
On 29. 10. 24 15:37, Morten Brørup wrote:
>> From: Lukas Sismis [mailto:sismis@cesnet.cz]
>> Sent: Tuesday, 29 October 2024 13.49
>>
>> Intel PMDs are capped by default to only 4096 RX/TX descriptors.
>> This can be limiting for applications requiring a bigger buffer
>> capabilities. The cap prevented the applications to configure
>> more descriptors. By bufferring more packets with RX/TX
>> descriptors, the applications can better handle the processing
>> peaks.
>>
>> Signed-off-by: Lukas Sismis <sismis@cesnet.cz>
>> ---
> Seems like a good idea.
>
> Have the max number of descriptors been checked with the datasheets for all the affected NIC chips?
>
I was hoping to get some feedback on this from the Intel folks.
But it seems like I can change it only for ixgbe (82599) to 32k
(possibly to 64k - 8), others - ice (E810) and i40e (X710) are capped at
8k - 32.
I neither have any experience with other drivers nor I have them
available to test so I will let it be in the follow-up version of this
patch.
Lukas
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] net: increase the maximum of RX/TX descriptors
2024-10-30 13:58 ` Lukáš Šišmiš
@ 2024-10-30 15:20 ` Stephen Hemminger
2024-10-30 15:40 ` Lukáš Šišmiš
0 siblings, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2024-10-30 15:20 UTC (permalink / raw)
To: Lukáš Šišmiš
Cc: Morten Brørup, anatoly.burakov, ian.stokes, dev
On Wed, 30 Oct 2024 14:58:40 +0100
Lukáš Šišmiš <sismis@cesnet.cz> wrote:
> On 29. 10. 24 15:37, Morten Brørup wrote:
> >> From: Lukas Sismis [mailto:sismis@cesnet.cz]
> >> Sent: Tuesday, 29 October 2024 13.49
> >>
> >> Intel PMDs are capped by default to only 4096 RX/TX descriptors.
> >> This can be limiting for applications requiring a bigger buffer
> >> capabilities. The cap prevented the applications to configure
> >> more descriptors. By bufferring more packets with RX/TX
> >> descriptors, the applications can better handle the processing
> >> peaks.
> >>
> >> Signed-off-by: Lukas Sismis <sismis@cesnet.cz>
> >> ---
> > Seems like a good idea.
> >
> > Have the max number of descriptors been checked with the datasheets for all the affected NIC chips?
> >
> I was hoping to get some feedback on this from the Intel folks.
>
> But it seems like I can change it only for ixgbe (82599) to 32k
> (possibly to 64k - 8), others - ice (E810) and i40e (X710) are capped at
> 8k - 32.
>
> I neither have any experience with other drivers nor I have them
> available to test so I will let it be in the follow-up version of this
> patch.
>
> Lukas
>
Having large number of descriptors especially at lower speeds will
increase buffer bloat. For real life applications, do not want increase
latency more than 1ms.
10 Gbps has 7.62Gbps of effective bandwidth due to overhead.
Rate for 1500 MTU is 7.62Gbs / (1500 * 8) = 635 K pps (i.e 1.5 us per packet)
A ring of 4096 descriptors can take 6 ms for full size packets.
Be careful, optimizing for 64 byte benchmarks can be disaster in real world.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] net: increase the maximum of RX/TX descriptors
2024-10-30 15:20 ` Stephen Hemminger
@ 2024-10-30 15:40 ` Lukáš Šišmiš
2024-10-30 15:58 ` Bruce Richardson
2024-10-30 16:06 ` Stephen Hemminger
0 siblings, 2 replies; 12+ messages in thread
From: Lukáš Šišmiš @ 2024-10-30 15:40 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Morten Brørup, anatoly.burakov, ian.stokes, dev
On 30. 10. 24 16:20, Stephen Hemminger wrote:
> On Wed, 30 Oct 2024 14:58:40 +0100
> Lukáš Šišmiš <sismis@cesnet.cz> wrote:
>
>> On 29. 10. 24 15:37, Morten Brørup wrote:
>>>> From: Lukas Sismis [mailto:sismis@cesnet.cz]
>>>> Sent: Tuesday, 29 October 2024 13.49
>>>>
>>>> Intel PMDs are capped by default to only 4096 RX/TX descriptors.
>>>> This can be limiting for applications requiring a bigger buffer
>>>> capabilities. The cap prevented the applications to configure
>>>> more descriptors. By bufferring more packets with RX/TX
>>>> descriptors, the applications can better handle the processing
>>>> peaks.
>>>>
>>>> Signed-off-by: Lukas Sismis <sismis@cesnet.cz>
>>>> ---
>>> Seems like a good idea.
>>>
>>> Have the max number of descriptors been checked with the datasheets for all the affected NIC chips?
>>>
>> I was hoping to get some feedback on this from the Intel folks.
>>
>> But it seems like I can change it only for ixgbe (82599) to 32k
>> (possibly to 64k - 8), others - ice (E810) and i40e (X710) are capped at
>> 8k - 32.
>>
>> I neither have any experience with other drivers nor I have them
>> available to test so I will let it be in the follow-up version of this
>> patch.
>>
>> Lukas
>>
> Having large number of descriptors especially at lower speeds will
> increase buffer bloat. For real life applications, do not want increase
> latency more than 1ms.
>
> 10 Gbps has 7.62Gbps of effective bandwidth due to overhead.
> Rate for 1500 MTU is 7.62Gbs / (1500 * 8) = 635 K pps (i.e 1.5 us per packet)
> A ring of 4096 descriptors can take 6 ms for full size packets.
>
> Be careful, optimizing for 64 byte benchmarks can be disaster in real world.
>
Thanks for the info Stephen, however I am not trying to optimize for 64
byte benchmarks. The work has been initiated by an IO problem and Intel
NICs. Suricata IDS worker (1 core per queue) received a burst of packets
and then sequentially processes them one by one. Well it seems like
having a 4k buffers it seems to not be enough. NVIDIA NICs allow e.g.
32k descriptors and it works fine. In the end it worked fine when ixgbe
descriptors were increased as well. I am not sure why AF-Packet can
handle this much better than DPDK, AFP doesn't have crazy high number of
descriptors configured <= 4096, yet it works better. At the moment I
assume there is an internal buffering in the kernel which allows to
handle processing spikes.
To give more context here is the forum discussion -
https://forum.suricata.io/t/high-packet-drop-rate-with-dpdk-compared-to-af-packet-in-suricata-7-0-7/4896
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] net: increase the maximum of RX/TX descriptors
2024-10-30 15:40 ` Lukáš Šišmiš
@ 2024-10-30 15:58 ` Bruce Richardson
2024-10-30 16:06 ` Stephen Hemminger
1 sibling, 0 replies; 12+ messages in thread
From: Bruce Richardson @ 2024-10-30 15:58 UTC (permalink / raw)
To: Lukáš Šišmiš
Cc: Stephen Hemminger, Morten Brørup, anatoly.burakov,
ian.stokes, dev
On Wed, Oct 30, 2024 at 04:40:10PM +0100, Lukáš Šišmiš wrote:
>
> On 30. 10. 24 16:20, Stephen Hemminger wrote:
> > On Wed, 30 Oct 2024 14:58:40 +0100
> > Lukáš Šišmiš <sismis@cesnet.cz> wrote:
> >
> > > On 29. 10. 24 15:37, Morten Brørup wrote:
> > > > > From: Lukas Sismis [mailto:sismis@cesnet.cz]
> > > > > Sent: Tuesday, 29 October 2024 13.49
> > > > >
> > > > > Intel PMDs are capped by default to only 4096 RX/TX descriptors.
> > > > > This can be limiting for applications requiring a bigger buffer
> > > > > capabilities. The cap prevented the applications to configure
> > > > > more descriptors. By bufferring more packets with RX/TX
> > > > > descriptors, the applications can better handle the processing
> > > > > peaks.
> > > > >
> > > > > Signed-off-by: Lukas Sismis <sismis@cesnet.cz>
> > > > > ---
> > > > Seems like a good idea.
> > > >
> > > > Have the max number of descriptors been checked with the datasheets for all the affected NIC chips?
> > > I was hoping to get some feedback on this from the Intel folks.
> > >
> > > But it seems like I can change it only for ixgbe (82599) to 32k
> > > (possibly to 64k - 8), others - ice (E810) and i40e (X710) are capped at
> > > 8k - 32.
> > >
> > > I neither have any experience with other drivers nor I have them
> > > available to test so I will let it be in the follow-up version of this
> > > patch.
> > >
> > > Lukas
> > >
> > Having large number of descriptors especially at lower speeds will
> > increase buffer bloat. For real life applications, do not want increase
> > latency more than 1ms.
> >
> > 10 Gbps has 7.62Gbps of effective bandwidth due to overhead.
> > Rate for 1500 MTU is 7.62Gbs / (1500 * 8) = 635 K pps (i.e 1.5 us per packet)
> > A ring of 4096 descriptors can take 6 ms for full size packets.
> >
> > Be careful, optimizing for 64 byte benchmarks can be disaster in real world.
> >
> Thanks for the info Stephen, however I am not trying to optimize for 64 byte
> benchmarks. The work has been initiated by an IO problem and Intel NICs.
> Suricata IDS worker (1 core per queue) received a burst of packets and then
> sequentially processes them one by one. Well it seems like having a 4k
> buffers it seems to not be enough. NVIDIA NICs allow e.g. 32k descriptors
> and it works fine. In the end it worked fine when ixgbe descriptors were
> increased as well. I am not sure why AF-Packet can handle this much better
> than DPDK, AFP doesn't have crazy high number of descriptors configured <=
> 4096, yet it works better. At the moment I assume there is an internal
> buffering in the kernel which allows to handle processing spikes.
>
> To give more context here is the forum discussion - https://forum.suricata.io/t/high-packet-drop-rate-with-dpdk-compared-to-af-packet-in-suricata-7-0-7/4896
>
Thanks for the context, and it is an interesting discussion.
One small suggestion, which I sadly don't think it will help with your
problem specifically, but I suspect that you don't need both Rx and Tx
queues to be that big. Given that the traffic going out is not going to be
greater than the traffic rate coming in, you shouldn't need much buffering,
on the Tx side. Therefore, even if you increase the Rx buffers to 32k, I'd
suggest using only 1k or 512 Tx ring slots and see how it goes. That will
give you better performance due to a reduced memory buffer footprint. Any
packets buffers transmitted will remain in the NIC ring until SW wraps all
the way around the ring, meaning a 4k Tx ring will likely always hold 4k-64
buffers in it, and similarly a 32k Tx ring will increase your active buffer
count (and hence app cache footprint) by 32k-64.
/Bruce
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] net: increase the maximum of RX/TX descriptors
2024-10-30 15:40 ` Lukáš Šišmiš
2024-10-30 15:58 ` Bruce Richardson
@ 2024-10-30 16:06 ` Stephen Hemminger
2024-11-05 8:49 ` Morten Brørup
1 sibling, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2024-10-30 16:06 UTC (permalink / raw)
To: Lukáš Šišmiš
Cc: Morten Brørup, anatoly.burakov, ian.stokes, dev
On Wed, 30 Oct 2024 16:40:10 +0100
Lukáš Šišmiš <sismis@cesnet.cz> wrote:
> On 30. 10. 24 16:20, Stephen Hemminger wrote:
> > On Wed, 30 Oct 2024 14:58:40 +0100
> > Lukáš Šišmiš <sismis@cesnet.cz> wrote:
> >
> >> On 29. 10. 24 15:37, Morten Brørup wrote:
> >>>> From: Lukas Sismis [mailto:sismis@cesnet.cz]
> >>>> Sent: Tuesday, 29 October 2024 13.49
> >>>>
> >>>> Intel PMDs are capped by default to only 4096 RX/TX descriptors.
> >>>> This can be limiting for applications requiring a bigger buffer
> >>>> capabilities. The cap prevented the applications to configure
> >>>> more descriptors. By bufferring more packets with RX/TX
> >>>> descriptors, the applications can better handle the processing
> >>>> peaks.
> >>>>
> >>>> Signed-off-by: Lukas Sismis <sismis@cesnet.cz>
> >>>> ---
> >>> Seems like a good idea.
> >>>
> >>> Have the max number of descriptors been checked with the datasheets for all the affected NIC chips?
> >>>
> >> I was hoping to get some feedback on this from the Intel folks.
> >>
> >> But it seems like I can change it only for ixgbe (82599) to 32k
> >> (possibly to 64k - 8), others - ice (E810) and i40e (X710) are capped at
> >> 8k - 32.
> >>
> >> I neither have any experience with other drivers nor I have them
> >> available to test so I will let it be in the follow-up version of this
> >> patch.
> >>
> >> Lukas
> >>
> > Having large number of descriptors especially at lower speeds will
> > increase buffer bloat. For real life applications, do not want increase
> > latency more than 1ms.
> >
> > 10 Gbps has 7.62Gbps of effective bandwidth due to overhead.
> > Rate for 1500 MTU is 7.62Gbs / (1500 * 8) = 635 K pps (i.e 1.5 us per packet)
> > A ring of 4096 descriptors can take 6 ms for full size packets.
> >
> > Be careful, optimizing for 64 byte benchmarks can be disaster in real world.
> >
> Thanks for the info Stephen, however I am not trying to optimize for 64
> byte benchmarks. The work has been initiated by an IO problem and Intel
> NICs. Suricata IDS worker (1 core per queue) received a burst of packets
> and then sequentially processes them one by one. Well it seems like
> having a 4k buffers it seems to not be enough. NVIDIA NICs allow e.g.
> 32k descriptors and it works fine. In the end it worked fine when ixgbe
> descriptors were increased as well. I am not sure why AF-Packet can
> handle this much better than DPDK, AFP doesn't have crazy high number of
> descriptors configured <= 4096, yet it works better. At the moment I
> assume there is an internal buffering in the kernel which allows to
> handle processing spikes.
>
> To give more context here is the forum discussion -
> https://forum.suricata.io/t/high-packet-drop-rate-with-dpdk-compared-to-af-packet-in-suricata-7-0-7/4896
>
>
>
I suspect AF_PACKET provides an intermediate step which can buffer more
or spread out the work.
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [PATCH] net: increase the maximum of RX/TX descriptors
2024-10-30 16:06 ` Stephen Hemminger
@ 2024-11-05 8:49 ` Morten Brørup
2024-11-05 15:55 ` Stephen Hemminger
0 siblings, 1 reply; 12+ messages in thread
From: Morten Brørup @ 2024-11-05 8:49 UTC (permalink / raw)
To: Stephen Hemminger, Lukáš Šišmiš
Cc: anatoly.burakov, ian.stokes, dev, bruce.richardson
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 30 October 2024 17.07
>
> On Wed, 30 Oct 2024 16:40:10 +0100
> Lukáš Šišmiš <sismis@cesnet.cz> wrote:
>
> > On 30. 10. 24 16:20, Stephen Hemminger wrote:
> > > On Wed, 30 Oct 2024 14:58:40 +0100
> > > Lukáš Šišmiš <sismis@cesnet.cz> wrote:
> > >
> > >> On 29. 10. 24 15:37, Morten Brørup wrote:
> > >>>> From: Lukas Sismis [mailto:sismis@cesnet.cz]
> > >>>> Sent: Tuesday, 29 October 2024 13.49
> > >>>>
> > >>>> Intel PMDs are capped by default to only 4096 RX/TX descriptors.
> > >>>> This can be limiting for applications requiring a bigger buffer
> > >>>> capabilities. The cap prevented the applications to configure
> > >>>> more descriptors. By bufferring more packets with RX/TX
> > >>>> descriptors, the applications can better handle the processing
> > >>>> peaks.
> > >>>>
> > >>>> Signed-off-by: Lukas Sismis <sismis@cesnet.cz>
> > >>>> ---
> > >>> Seems like a good idea.
> > >>>
> > >>> Have the max number of descriptors been checked with the
> datasheets for all the affected NIC chips?
> > >>>
> > >> I was hoping to get some feedback on this from the Intel folks.
> > >>
> > >> But it seems like I can change it only for ixgbe (82599) to 32k
> > >> (possibly to 64k - 8), others - ice (E810) and i40e (X710) are
> capped at
> > >> 8k - 32.
> > >>
> > >> I neither have any experience with other drivers nor I have them
> > >> available to test so I will let it be in the follow-up version of
> this
> > >> patch.
> > >>
> > >> Lukas
> > >>
> > > Having large number of descriptors especially at lower speeds will
> > > increase buffer bloat. For real life applications, do not want
> increase
> > > latency more than 1ms.
> > >
> > > 10 Gbps has 7.62Gbps of effective bandwidth due to overhead.
> > > Rate for 1500 MTU is 7.62Gbs / (1500 * 8) = 635 K pps (i.e 1.5 us
> per packet)
> > > A ring of 4096 descriptors can take 6 ms for full size packets.
> > >
> > > Be careful, optimizing for 64 byte benchmarks can be disaster in
> real world.
> > >
> > Thanks for the info Stephen, however I am not trying to optimize for
> 64
> > byte benchmarks. The work has been initiated by an IO problem and
> Intel
> > NICs. Suricata IDS worker (1 core per queue) received a burst of
> packets
> > and then sequentially processes them one by one. Well it seems like
> > having a 4k buffers it seems to not be enough. NVIDIA NICs allow e.g.
> > 32k descriptors and it works fine. In the end it worked fine when
> ixgbe
> > descriptors were increased as well. I am not sure why AF-Packet can
> > handle this much better than DPDK, AFP doesn't have crazy high number
> of
> > descriptors configured <= 4096, yet it works better. At the moment I
> > assume there is an internal buffering in the kernel which allows to
> > handle processing spikes.
> >
> > To give more context here is the forum discussion -
> > https://forum.suricata.io/t/high-packet-drop-rate-with-dpdk-compared-
> to-af-packet-in-suricata-7-0-7/4896
> >
> >
> >
>
> I suspect AF_PACKET provides an intermediate step which can buffer more
> or spread out the work.
Agree. It's a Linux scheduling issue.
With DPDK polling, there is no interrupt in the kernel scheduler.
If the CPU core running the DPDK polling thread is running some other thread when the packets arrive on the hardware, the DPDK polling thread is NOT scheduled immediately, but has to wait for the kernel scheduler to switch to this thread instead of the other thread.
Quite a lot of time can pass before this happens - the kernel scheduler does not know that the DPDK polling thread has urgent work pending.
And the number of RX descriptors needs to be big enough to absorb all packets arriving during the scheduling delay.
It is not well described how to *guarantee* that nothing but the DPDK polling thread runs on a dedicated CPU core.
With AF_PACKET, the hardware generates an interrupt, and the kernel immediately calls the driver's interrupt handler - regardless what the CPU core is currently doing.
The driver's interrupt handler acknowledges the interrupt to the hardware and informs the kernel that the softirq handler is pending.
AFAIU, the kernel executes pending softirq handlers immediately after returning from an interrupt handler - regardless what the CPU core was doing the interrupt occurred.
The softirq handler then dequeues the packets from the hardware RX descriptors into SKBs, and when all of them have been dequeued from the hardware, enables interrupts. Then the CPU core resumes the work it was doing when interrupted.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] net: increase the maximum of RX/TX descriptors
2024-11-05 8:49 ` Morten Brørup
@ 2024-11-05 15:55 ` Stephen Hemminger
2024-11-05 16:50 ` Morten Brørup
0 siblings, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2024-11-05 15:55 UTC (permalink / raw)
To: Morten Brørup
Cc: Lukáš Šišmiš, anatoly.burakov,
ian.stokes, dev, bruce.richardson
On Tue, 5 Nov 2024 09:49:39 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> > I suspect AF_PACKET provides an intermediate step which can buffer more
> > or spread out the work.
>
> Agree. It's a Linux scheduling issue.
>
> With DPDK polling, there is no interrupt in the kernel scheduler.
> If the CPU core running the DPDK polling thread is running some other thread when the packets arrive on the hardware, the DPDK polling thread is NOT scheduled immediately, but has to wait for the kernel scheduler to switch to this thread instead of the other thread.
> Quite a lot of time can pass before this happens - the kernel scheduler does not know that the DPDK polling thread has urgent work pending.
> And the number of RX descriptors needs to be big enough to absorb all packets arriving during the scheduling delay.
> It is not well described how to *guarantee* that nothing but the DPDK polling thread runs on a dedicated CPU core.
That why any non-trivial DPDK application needs to run on isolated cpu's.
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [PATCH] net: increase the maximum of RX/TX descriptors
2024-11-05 15:55 ` Stephen Hemminger
@ 2024-11-05 16:50 ` Morten Brørup
2024-11-05 21:20 ` Lukáš Šišmiš
0 siblings, 1 reply; 12+ messages in thread
From: Morten Brørup @ 2024-11-05 16:50 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Lukáš Šišmiš, anatoly.burakov,
ian.stokes, dev, bruce.richardson
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, 5 November 2024 16.55
>
> On Tue, 5 Nov 2024 09:49:39 +0100
> Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > >
> > > I suspect AF_PACKET provides an intermediate step which can buffer
> more
> > > or spread out the work.
> >
> > Agree. It's a Linux scheduling issue.
> >
> > With DPDK polling, there is no interrupt in the kernel scheduler.
> > If the CPU core running the DPDK polling thread is running some other
> thread when the packets arrive on the hardware, the DPDK polling thread
> is NOT scheduled immediately, but has to wait for the kernel scheduler
> to switch to this thread instead of the other thread.
> > Quite a lot of time can pass before this happens - the kernel
> scheduler does not know that the DPDK polling thread has urgent work
> pending.
> > And the number of RX descriptors needs to be big enough to absorb all
> packets arriving during the scheduling delay.
> > It is not well described how to *guarantee* that nothing but the DPDK
> polling thread runs on a dedicated CPU core.
>
> That why any non-trivial DPDK application needs to run on isolated
> cpu's.
Exactly.
And it is non-trivial and not well described how to do this.
Especially in virtual environments.
E.g. I ran some scheduling latency tests earlier today, and frequently observed 500-1000 us scheduling latency under vmware vSphere ESXi. This requires a large number of RX descriptors to absorb without packet loss. (Disclaimer: The virtual machine configuration had not been optimized. Tweaking the knobs offered by the hypervisor might improve this.)
The exact same firmware (same kernel, rootfs, libraries, applications etc.) running directly on our purpose-built hardware has scheduling latency very close to the kernel's default "timerslack" (50 us).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] net: increase the maximum of RX/TX descriptors
2024-11-05 16:50 ` Morten Brørup
@ 2024-11-05 21:20 ` Lukáš Šišmiš
0 siblings, 0 replies; 12+ messages in thread
From: Lukáš Šišmiš @ 2024-11-05 21:20 UTC (permalink / raw)
To: Morten Brørup, Stephen Hemminger
Cc: anatoly.burakov, ian.stokes, dev, bruce.richardson
On 05. 11. 24 17:50, Morten Brørup wrote:
>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>> Sent: Tuesday, 5 November 2024 16.55
>>
>> On Tue, 5 Nov 2024 09:49:39 +0100
>> Morten Brørup <mb@smartsharesystems.com> wrote:
>>
>>>> I suspect AF_PACKET provides an intermediate step which can buffer
>> more
>>>> or spread out the work.
>>> Agree. It's a Linux scheduling issue.
>>>
>>> With DPDK polling, there is no interrupt in the kernel scheduler.
>>> If the CPU core running the DPDK polling thread is running some other
>> thread when the packets arrive on the hardware, the DPDK polling thread
>> is NOT scheduled immediately, but has to wait for the kernel scheduler
>> to switch to this thread instead of the other thread.
>>> Quite a lot of time can pass before this happens - the kernel
>> scheduler does not know that the DPDK polling thread has urgent work
>> pending.
>>> And the number of RX descriptors needs to be big enough to absorb all
>> packets arriving during the scheduling delay.
>>> It is not well described how to *guarantee* that nothing but the DPDK
>> polling thread runs on a dedicated CPU core.
>>
>> That why any non-trivial DPDK application needs to run on isolated
>> cpu's.
> Exactly.
> And it is non-trivial and not well described how to do this.
>
> Especially in virtual environments.
> E.g. I ran some scheduling latency tests earlier today, and frequently observed 500-1000 us scheduling latency under vmware vSphere ESXi. This requires a large number of RX descriptors to absorb without packet loss. (Disclaimer: The virtual machine configuration had not been optimized. Tweaking the knobs offered by the hypervisor might improve this.)
>
> The exact same firmware (same kernel, rootfs, libraries, applications etc.) running directly on our purpose-built hardware has scheduling latency very close to the kernel's default "timerslack" (50 us).
>
Thanks for the feedback, I am currently not 100% I ran my earlier
experiments on isolcpus and whether it had a massive impact or not.
But here is a decent guide on latency tuning I found the other day
though virtual environments are not exactly described.
https://rigtorp.se/low-latency-guide/
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-11-05 21:20 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29 12:46 [PATCH] net: increase the maximum of RX/TX descriptors Lukas Sismis
-- strict thread matches above, loose matches on Subject: below --
2024-10-29 12:48 Lukas Sismis
2024-10-29 14:37 ` Morten Brørup
2024-10-30 13:58 ` Lukáš Šišmiš
2024-10-30 15:20 ` Stephen Hemminger
2024-10-30 15:40 ` Lukáš Šišmiš
2024-10-30 15:58 ` Bruce Richardson
2024-10-30 16:06 ` Stephen Hemminger
2024-11-05 8:49 ` Morten Brørup
2024-11-05 15:55 ` Stephen Hemminger
2024-11-05 16:50 ` Morten Brørup
2024-11-05 21:20 ` Lukáš Šišmiš
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).