* [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance @ 2014-05-14 14:17 Jesper Dangaard Brouer 2014-05-14 14:17 ` [net-next PATCH 1/5] ixgbe: trivial fixes while reading code Jesper Dangaard Brouer ` (5 more replies) 0 siblings, 6 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi I'm on a quest to push the packet per sec (pps) limits of our network stack, with a special focus on single CPU performance. My first action is to measure and identify bottlenecks in the transmit path. For achieving this goal, I need a fast in-kernel packet generator, like "pktgen". It turned out that "pktgen" were too slow. Thus, this series focus on optimizing "pktgen" for single CPU performance. Overview 1xCPU performance Packet Per Sec (pps) stats: * baseline: 3,930,068 pps * patch2: 5,362,722 pps -- TXSZ=1024 * patch3: 5,608,781 pps --> 178.29ns per pkt * patch4: 5,857,065 pps --> 170.73ns ( -7.56ns) * patch5: 6,346,500 pps --> 157.56ns (-13.17ns) * No-lock: 6,642,948 pps --> 150.53ns ( -7.03ns) The last result "No-lock" removes the HARD_TX_{UN}LOCK, and is not applicable to upstream. It removes two "LOCK" instructions (cost 8ns each), thus I were expecting to see an improvement of 16ns, but we only see 7ns. This leads me to believe, that we have reached the ixgbe driver limit, single queue. Setup according to blogpost: http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html Hardware: System: CPU E5-2630 NIC: Intel ixgbe/82599 chip Testing done with net-next git tree on top of commit 79e0f1c9f (ipv6: Need to sock_put on csum error). Pktgen script exercising race condition: https://github.com/netoptimizer/network-testing/blob/master/pktgen/unit_test01_race_add_rem_device_loop.sh --- Jesper Dangaard Brouer (5): pktgen: RCU'ify "if_list" to remove lock in next_to_run() pktgen: avoid expensive set_current_state() call in loop pktgen: avoid atomic_inc per packet in xmit loop ixgbe: increase default TX ring buffer to 1024 ixgbe: trivial fixes while reading code drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 net/core/pktgen.c | 115 +++++++++++++------------ 3 files changed, 61 insertions(+), 58 deletions(-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 22+ messages in thread
* [net-next PATCH 1/5] ixgbe: trivial fixes while reading code 2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer @ 2014-05-14 14:17 ` Jesper Dangaard Brouer 2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer ` (4 subsequent siblings) 5 siblings, 0 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 8089ea9..5e5df9f 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -2097,7 +2097,7 @@ dma_sync: * This function provides a "bounce buffer" approach to Rx interrupt * processing. The advantage to this is that on systems that have * expensive overhead for IOMMU access this provides a means of avoiding - * it by maintaining the mapping of the page to the syste. + * it by maintaining the mapping of the page to the system. * * Returns amount of work completed **/ ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer 2014-05-14 14:17 ` [net-next PATCH 1/5] ixgbe: trivial fixes while reading code Jesper Dangaard Brouer @ 2014-05-14 14:17 ` Jesper Dangaard Brouer 2014-05-14 14:28 ` David Laight 2014-05-14 16:28 ` Alexander Duyck 2014-05-14 14:17 ` [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop Jesper Dangaard Brouer ` (3 subsequent siblings) 5 siblings, 2 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring running full. Thus, the TX ring is artificially limiting pktgen. Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy" counters. Increasing the TX ring buffer should be done carefully, as it comes at a higher memory cost, which can also negatively influence performance. E.g. ring buffer array of struct ixgbe_tx_buffer (current size 48bytes) increase from 512*48=24576bytes to 1024*48=49152bytes which is larger than the L1 data cache (32KB on my E5-2630), thus increasing the L1->L2 cache-references. Adjusting the TX ring buffer (TXSZ) measured over 10 sec with ifpps (single CPU performance, ixgbe 10Gbit/s, E5-2630) * cmd: ethtool -G eth8 tx $TXSZ * 3,930,065 pps -- TXSZ= 512 * 5,312,249 pps -- TXSZ= 768 * 5,362,722 pps -- TXSZ=1024 * 5,361,390 pps -- TXSZ=1536 * 5,362,439 pps -- TXSZ=2048 * 5,359,744 pps -- TXSZ=4096 Choosing size 1024 because for the next optimizations 768 is not enough. Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag. This allow us to put more pressure on the TX ring buffers. It is the ixgbe_maybe_stop_tx() call that stops the transmits, and pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index c688c8a..bf078fe 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -63,7 +63,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt /* TX/RX descriptor defines */ -#define IXGBE_DEFAULT_TXD 512 +#define IXGBE_DEFAULT_TXD 1024 #define IXGBE_DEFAULT_TX_WORK 256 #define IXGBE_MAX_TXD 4096 #define IXGBE_MIN_TXD 64 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* RE: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer @ 2014-05-14 14:28 ` David Laight 2014-05-14 19:25 ` Jesper Dangaard Brouer 2014-05-14 16:28 ` Alexander Duyck 1 sibling, 1 reply; 22+ messages in thread From: David Laight @ 2014-05-14 14:28 UTC (permalink / raw) To: 'Jesper Dangaard Brouer', netdev@vger.kernel.org Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt@kth.se, zhouzhouyi@gmail.com From: Jesper Dangaard Brouer > Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring > running full. Thus, the TX ring is artificially limiting pktgen. > > Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy" > counters. Have you tried reducing the tx interrupt mitigation delay. It might just be that the end of tx interrupt is delayed too long. David ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 2014-05-14 14:28 ` David Laight @ 2014-05-14 19:25 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-14 19:25 UTC (permalink / raw) To: David Laight Cc: brouer, netdev@vger.kernel.org, Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt@kth.se, zhouzhouyi@gmail.com On Wed, 14 May 2014 14:28:24 +0000 David Laight <David.Laight@ACULAB.COM> wrote: > From: Jesper Dangaard Brouer > > Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring > > running full. Thus, the TX ring is artificially limiting pktgen. > > > > Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy" > > counters. > > Have you tried reducing the tx interrupt mitigation delay. > It might just be that the end of tx interrupt is delayed too long. Does the ixgbe have TX interrupt mitigation delays? I don't "see" them: $ sudo ethtool -c eth8 Coalesce parameters for eth8: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 1 rx-frames: 0 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 0 tx-frames: 0 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer 2014-05-14 14:28 ` David Laight @ 2014-05-14 16:28 ` Alexander Duyck 2014-05-14 17:49 ` David Miller 1 sibling, 1 reply; 22+ messages in thread From: Alexander Duyck @ 2014-05-14 16:28 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev Cc: Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi On 05/14/2014 07:17 AM, Jesper Dangaard Brouer wrote: > Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring > running full. Thus, the TX ring is artificially limiting pktgen. > > Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy" > counters. > > Increasing the TX ring buffer should be done carefully, as it comes at > a higher memory cost, which can also negatively influence performance. > E.g. ring buffer array of struct ixgbe_tx_buffer (current size 48bytes) > increase from 512*48=24576bytes to 1024*48=49152bytes which is larger > than the L1 data cache (32KB on my E5-2630), thus increasing the L1->L2 > cache-references. > > Adjusting the TX ring buffer (TXSZ) measured over 10 sec with ifpps > (single CPU performance, ixgbe 10Gbit/s, E5-2630) > * cmd: ethtool -G eth8 tx $TXSZ > * 3,930,065 pps -- TXSZ= 512 > * 5,312,249 pps -- TXSZ= 768 > * 5,362,722 pps -- TXSZ=1024 > * 5,361,390 pps -- TXSZ=1536 > * 5,362,439 pps -- TXSZ=2048 > * 5,359,744 pps -- TXSZ=4096 > > Choosing size 1024 because for the next optimizations 768 is not > enough. > > Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled > devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores > the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag. This allow us to put > more pressure on the TX ring buffers. > > It is the ixgbe_maybe_stop_tx() call that stops the transmits, and > pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq). > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > --- > > drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h > index c688c8a..bf078fe 100644 > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h > @@ -63,7 +63,7 @@ > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > /* TX/RX descriptor defines */ > -#define IXGBE_DEFAULT_TXD 512 > +#define IXGBE_DEFAULT_TXD 1024 > #define IXGBE_DEFAULT_TX_WORK 256 > #define IXGBE_MAX_TXD 4096 > #define IXGBE_MIN_TXD 64 > What is the point of optimizing ixgbe for a synthetic benchmark? In my experience the full stack can only handle about 2Mpps at 60B packets with a single queue. Updating the defaults for a pktgen test seems unrealistic as that isn't really a standard use case for the driver. I'd say that it might be better to just add a note to the documentation folder indicating what configuration is optimal for pktgen rather then changing everyone's defaults to support one specific test. Thanks, Alex ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 2014-05-14 16:28 ` Alexander Duyck @ 2014-05-14 17:49 ` David Miller 2014-05-14 19:09 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 22+ messages in thread From: David Miller @ 2014-05-14 17:49 UTC (permalink / raw) To: alexander.h.duyck Cc: brouer, netdev, jeffrey.t.kirsher, dborkman, fw, shemminger, paulmck, robert, greearb, john.r.fastabend, danieltt, zhouzhouyi From: Alexander Duyck <alexander.h.duyck@intel.com> Date: Wed, 14 May 2014 09:28:50 -0700 > I'd say that it might be better to just add a note to the documentation > folder indicating what configuration is optimal for pktgen rather then > changing everyone's defaults to support one specific test. We could have drivers provide a pktgen config adjustment mechanism, so if someone starts pktgen then the device auto-adjusts to a pktgen optimal configuration (whatever that may entail). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 2014-05-14 17:49 ` David Miller @ 2014-05-14 19:09 ` Jesper Dangaard Brouer 2014-05-14 19:54 ` David Miller ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-14 19:09 UTC (permalink / raw) To: David Miller Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, dborkman, fw, shemminger, paulmck, robert, greearb, john.r.fastabend, danieltt, zhouzhouyi, brouer On Wed, 14 May 2014 13:49:50 -0400 (EDT) David Miller <davem@davemloft.net> wrote: > From: Alexander Duyck <alexander.h.duyck@intel.com> > Date: Wed, 14 May 2014 09:28:50 -0700 > > > I'd say that it might be better to just add a note to the documentation > > folder indicating what configuration is optimal for pktgen rather then > > changing everyone's defaults to support one specific test. > > We could have drivers provide a pktgen config adjustment mechanism, > so if someone starts pktgen then the device auto-adjusts to a pktgen > optimal configuration (whatever that may entail). That might be problematic because changing the TX queue size cause the ixgbe driver to reset the link. Notice that pktgen is ignoring BQL. I'm sort of hoping that BQL will push back for real use-cases, to avoid the bad effects of increasing the TX size. One of the bad effects, I'm hoping BQL will mitigate, is the case of filling the TX queue with large frames. Consider 9K jumbo frames, how long time will it take to empty 1024 jumbo frames on a 10G link: (9000*8)/(10000*10^6)*1000*1024 = 7.37ms But with 9K MTU and 512, we already have: (9000*8)/(10000*10^6)*1000*512 = 3.69ms Guess the more normal use-case would be 1500+38 (Ethernet overhead) (1538*8)/(10000*10^6)*1000*1024 = 1.25ms And then again, these calculations should then in theory be multiplied with the number of TX queues. I know, increasing these limits should not be taken lightly, but we just have to be crystal clear that the current 512 limit, is artificially limiting the capabilities of your hardware. We can postpone this increase, because I also observe a 2Mpps limit when actually using (alloc/free) real SKBs. The alloc/free cost is currently just hiding this limitation. -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 2014-05-14 19:09 ` Jesper Dangaard Brouer @ 2014-05-14 19:54 ` David Miller 2014-05-15 9:16 ` David Laight 2014-05-29 15:29 ` Jesper Dangaard Brouer 2 siblings, 0 replies; 22+ messages in thread From: David Miller @ 2014-05-14 19:54 UTC (permalink / raw) To: brouer Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, dborkman, fw, shemminger, paulmck, robert, greearb, john.r.fastabend, danieltt, zhouzhouyi From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Wed, 14 May 2014 21:09:35 +0200 > On Wed, 14 May 2014 13:49:50 -0400 (EDT) > David Miller <davem@davemloft.net> wrote: > >> From: Alexander Duyck <alexander.h.duyck@intel.com> >> Date: Wed, 14 May 2014 09:28:50 -0700 >> >> > I'd say that it might be better to just add a note to the documentation >> > folder indicating what configuration is optimal for pktgen rather then >> > changing everyone's defaults to support one specific test. >> >> We could have drivers provide a pktgen config adjustment mechanism, >> so if someone starts pktgen then the device auto-adjusts to a pktgen >> optimal configuration (whatever that may entail). > > That might be problematic because changing the TX queue size cause the > ixgbe driver to reset the link. A minor issue for someone firing up a specialized network test tool like pktgen. > Notice that pktgen is ignoring BQL. I'm sort of hoping that BQL will > push back for real use-cases, to avoid the bad effects of increasing > the TX size. I think we need to think carefully about drivers configuring such huge per-queue TX queue sizes. If anything, we should be decreasing their size in the default configuration. ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 2014-05-14 19:09 ` Jesper Dangaard Brouer 2014-05-14 19:54 ` David Miller @ 2014-05-15 9:16 ` David Laight 2014-05-29 15:29 ` Jesper Dangaard Brouer 2 siblings, 0 replies; 22+ messages in thread From: David Laight @ 2014-05-15 9:16 UTC (permalink / raw) To: 'Jesper Dangaard Brouer', David Miller Cc: alexander.h.duyck@intel.com, netdev@vger.kernel.org, jeffrey.t.kirsher@intel.com, dborkman@redhat.com, fw@strlen.de, shemminger@vyatta.com, paulmck@linux.vnet.ibm.com, robert@herjulf.se, greearb@candelatech.com, john.r.fastabend@intel.com, danieltt@kth.se, zhouzhouyi@gmail.com From: Jesper Dangaard Brouer > On Wed, 14 May 2014 13:49:50 -0400 (EDT) > David Miller <davem@davemloft.net> wrote: > > > From: Alexander Duyck <alexander.h.duyck@intel.com> > > Date: Wed, 14 May 2014 09:28:50 -0700 > > > > > I'd say that it might be better to just add a note to the documentation > > > folder indicating what configuration is optimal for pktgen rather then > > > changing everyone's defaults to support one specific test. > > > > We could have drivers provide a pktgen config adjustment mechanism, > > so if someone starts pktgen then the device auto-adjusts to a pktgen > > optimal configuration (whatever that may entail). > > That might be problematic because changing the TX queue size cause the > ixgbe driver to reset the link. > > Notice that pktgen is ignoring BQL. I'm sort of hoping that BQL will > push back for real use-cases, to avoid the bad effects of increasing > the TX size. > > One of the bad effects, I'm hoping BQL will mitigate, is the case of > filling the TX queue with large frames. Consider 9K jumbo frames, how > long time will it take to empty 1024 jumbo frames on a 10G link: > > (9000*8)/(10000*10^6)*1000*1024 = 7.37ms Never mind 9k 'jumbo' frames, I'm pretty sure ixgbe supports TCP segmentation offload - so you can have 64k frames in the tx ring. Since each takes (about) 44 ethernet frames that is about 55ms to clear the queue. David ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 2014-05-14 19:09 ` Jesper Dangaard Brouer 2014-05-14 19:54 ` David Miller 2014-05-15 9:16 ` David Laight @ 2014-05-29 15:29 ` Jesper Dangaard Brouer 2 siblings, 0 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-29 15:29 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: David Miller, alexander.h.duyck, netdev, jeffrey.t.kirsher, dborkman, fw, shemminger, paulmck, robert, greearb, john.r.fastabend, danieltt, zhouzhouyi, Thomas Graf On Wed, 14 May 2014 21:09:35 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > From: Alexander Duyck <alexander.h.duyck@intel.com> > > Date: Wed, 14 May 2014 09:28:50 -0700 > > > > > I'd say that it might be better to just add a note to the documentation > > > folder indicating what configuration is optimal for pktgen rather then > > > changing everyone's defaults to support one specific test. > > I know, increasing these limits should not be taken lightly, but we > just have to be crystal clear that the current 512 limit, is > artificially limiting the capabilities of your hardware. The above statement is mine and it is wrong ;-) I'm dropping this patch because of the following understanding: Alexander Duyck pointed out to me, that interrupt throttling might be the reason behind the need to increase the TX ring size. I tested this and Alex is right. The needed TX ring size ("ethtool -g") for max performance, is directly corrolated with how fast/often the TX cleanup is running. Adjusting the "ethtool -C <ethx> rx-usecs" value affect how often we cleanup the ring(s). The default value "1" is some auto interrupt throttling. Notice with these coalesce tuning, the performance even increase from 6.7Mpps to 7.1Mpps on top of patchset V1. On top of V1 patchset: - 6,747,016 pps - rx-usecs: 1 tx-ring: 1024 (irqs: 9492) - 6,684,612 pps - rx-usecs: 10 tx-ring: 1024 (irqs:99322) - 7,005,226 pps - rx-usecs: 20 tx-ring: 1024 (irqs:50444) - 7,113,048 pps - rx-usecs: 30 tx-ring: 1024 (irqs:34004) - 7,133,019 pps - rx-usecs: 40 tx-ring: 1024 (irqs:25845) - 7,168,399 pps - rx-usecs: 50 tx-ring: 1024 (irqs:20896) Look same performance with 512 TX ring. Lowering TX ring size to (default) 512: (On top of V1 patchset) - 3,934,674 pps - rx-usecs: 1 tx-ring: 512 (irqs: 9602) - 6,684,066 pps - rx-usecs: 10 tx-ring: 512 (irqs:99370) - 7,001,235 pps - rx-usecs: 20 tx-ring: 512 (irqs:50567) - 7,115,047 pps - rx-usecs: 30 tx-ring: 512 (irqs:34105) - 7,130,250 pps - rx-usecs: 40 tx-ring: 512 (irqs:25741) - 7,165,296 pps - rx-usecs: 50 tx-ring: 512 (irqs:20898) Look how even a 256 TX ring is enough, if we cleanup the TX ring fast enough, and how performance decrease if we cleanup to slowly. Lowering TX ring size to (default) 256: (On top of V1 patchset) - 1,883,360 pps - rx-usecs: 1 tx-ring: 256 (irqs: 9800) - 6,683,552 pps - rx-usecs: 10 tx-ring: 256 (irqs:99786) - 7,005,004 pps - rx-usecs: 20 tx-ring: 256 (irqs:50749) - 7,108,776 pps - rx-usecs: 30 tx-ring: 256 (irqs:34536) - 5,734,301 pps - rx-usecs: 40 tx-ring: 256 (irqs:25909) - 4,590,093 pps - rx-usecs: 50 tx-ring: 256 (irqs:21183) -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 22+ messages in thread
* [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop 2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer 2014-05-14 14:17 ` [net-next PATCH 1/5] ixgbe: trivial fixes while reading code Jesper Dangaard Brouer 2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer @ 2014-05-14 14:17 ` Jesper Dangaard Brouer 2014-05-14 14:35 ` Eric Dumazet 2014-05-14 14:17 ` [net-next PATCH 4/5] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer ` (2 subsequent siblings) 5 siblings, 1 reply; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi Avoid the expensive atomic refcnt increase in the pktgen xmit loop, by simply setting the refcnt only when a new SKB gets allocated. Setting it according to how many times we are spinning the same SKB (and handling the case of skb_clone=0). Performance data with CLONE_SKB==100000 and TX ring buffer size=1024: (single CPU performance, ixgbe 10Gbit/s, E5-2630) * Before: 5,362,722 pps --> 186.47ns per pkt (1/5362722*10^9) * Now: 5,608,781 pps --> 178.29ns per pkt (1/5608781*10^9) * Diff: +246,059 pps --> -8.18ns The performance increase converted to nanoseconds (8.18ns), correspond well to the measured overhead of LOCK prefixed assembler instructions on my E5-2630 CPU which is measured to be 8.23ns. Note, with TX ring size 768 I see some "tx_restart_queue" events. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- net/core/pktgen.c | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 0304f98..7752806 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -3327,6 +3327,9 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev) pkt_dev->clone_count--; /* back out increment, OOM */ return; } + /* Avoid atomic inc for every packet before xmit call */ + atomic_set(&(pkt_dev->skb->users), + max(2,(pkt_dev->clone_skb+1))); pkt_dev->last_pkt_size = pkt_dev->skb->len; pkt_dev->allocated_skbs++; pkt_dev->clone_count = 0; /* reset counter */ @@ -3347,7 +3350,6 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev) pkt_dev->last_ok = 0; goto unlock; } - atomic_inc(&(pkt_dev->skb->users)); ret = (*xmit)(pkt_dev->skb, odev); switch (ret) { ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop 2014-05-14 14:17 ` [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop Jesper Dangaard Brouer @ 2014-05-14 14:35 ` Eric Dumazet 2014-05-14 15:13 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 22+ messages in thread From: Eric Dumazet @ 2014-05-14 14:35 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: netdev, Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi On Wed, 2014-05-14 at 16:17 +0200, Jesper Dangaard Brouer wrote: > Avoid the expensive atomic refcnt increase in the pktgen xmit loop, by > simply setting the refcnt only when a new SKB gets allocated. Setting > it according to how many times we are spinning the same SKB (and > handling the case of skb_clone=0). > > Performance data with CLONE_SKB==100000 and TX ring buffer size=1024: > (single CPU performance, ixgbe 10Gbit/s, E5-2630) > * Before: 5,362,722 pps --> 186.47ns per pkt (1/5362722*10^9) > * Now: 5,608,781 pps --> 178.29ns per pkt (1/5608781*10^9) > * Diff: +246,059 pps --> -8.18ns > > The performance increase converted to nanoseconds (8.18ns), correspond > well to the measured overhead of LOCK prefixed assembler instructions > on my E5-2630 CPU which is measured to be 8.23ns. > > Note, with TX ring size 768 I see some "tx_restart_queue" events. > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > --- OK, but then you need to properly undo the refcnt at the end of pktgen loop, otherwise you leak one skb ? Say you run pktgen for 995 sends, and clone_count is 10. If done properly, you probably can avoid the kfree_skb(pkt_dev->skb) in pktgen_xmit(), and set initial skb->users to max(1, pkt_dev->clone_skb); ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop 2014-05-14 14:35 ` Eric Dumazet @ 2014-05-14 15:13 ` Jesper Dangaard Brouer 2014-05-14 15:35 ` Eric Dumazet 0 siblings, 1 reply; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-14 15:13 UTC (permalink / raw) To: Eric Dumazet Cc: brouer, netdev, Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi On Wed, 14 May 2014 07:35:31 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2014-05-14 at 16:17 +0200, Jesper Dangaard Brouer wrote: > > Avoid the expensive atomic refcnt increase in the pktgen xmit loop, by > > simply setting the refcnt only when a new SKB gets allocated. Setting > > it according to how many times we are spinning the same SKB (and > > handling the case of skb_clone=0). > > > > Performance data with CLONE_SKB==100000 and TX ring buffer size=1024: > > (single CPU performance, ixgbe 10Gbit/s, E5-2630) > > * Before: 5,362,722 pps --> 186.47ns per pkt (1/5362722*10^9) > > * Now: 5,608,781 pps --> 178.29ns per pkt (1/5608781*10^9) > > * Diff: +246,059 pps --> -8.18ns > > > > The performance increase converted to nanoseconds (8.18ns), correspond > > well to the measured overhead of LOCK prefixed assembler instructions > > on my E5-2630 CPU which is measured to be 8.23ns. > > > > Note, with TX ring size 768 I see some "tx_restart_queue" events. > > > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > > --- > > OK, but then you need to properly undo the refcnt at the end of pktgen > loop, otherwise you leak one skb ? > > Say you run pktgen for 995 sends, and clone_count is 10. Ah I see, then pktgen gets stopped, I might leak a SKB. > If done properly, you probably can avoid the kfree_skb(pkt_dev->skb) in > pktgen_xmit(), and set initial skb->users to max(1, pkt_dev->clone_skb); Just setting max(1, pkt_dev->clone_skb) killed my machine (and my netconsole didn't send the entire output before dying). kernel:[91899.988338] skbuff: skb_under_panic: text:ffffffffa03a0b5f len:56 put:14 head:ffff88081bdeb400 data:ffff88081bdeb3f4 tail:0x2c end:0xc0 dev:eth8 kernel:[91899.988505] ------------[ cut here ]------------ kernel:[91899.988643] invalid opcode: 0000 [#1] SMP kernel:[91899.988711] Modules linked in: pktgen netconsole ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 kernel:[91899.988887] general protection fault: 0000 [#2] SMP kernel:[91899.988888] Modules linked in: pktgen netconsole ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc vhost_net macvtap macvlan vhost tun kvm_intel kvm microcode pcspkr ipmi_si ipmi_msghandler hid_generic i2c_i801 sg ixgbe mlx4_en mdio vxlan mlx4_core igb i2c_algo_bit i2c_core ptp pps_core sd_mod crc_t10dif crct10dif_common ahci libahci libata dm_mirror dm_region_hash dm_log dm_mod [last unloaded: pktgen] kernel:[91899.988912] CPU: 0 PID: 21807 Comm: kpktgend_0 Tainted: G W 3.15.0-rc1-dpaccel01-pktgen+ #469 kernel:[91899.988913] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013 kernel:[91899.988914] task: ffff88081e1b6480 ti: ffff880807374000 task.ti: ffff880807374000 kernel:[91899.988915] RIP: 0010:[<ffffffff8119ecdc>] [<ffffffff8119ecdc>] __kmalloc_node_track_caller+0xdc/0x240 kernel:[91899.988920] RSP: 0018:ffff880807375798 EFLAGS: 00010046 kernel:[91899.988921] RAX: 0000000000000000 RBX: 0000000000000020 RCX: dc0a9af1def08800 kernel:[91899.988922] RDX: 0000000000046786 RSI: 0000000000000000 RDI: 0000000000016340 kernel:[91899.988922] RBP: ffff8808073757f8 R08: ffff88085fc16340 R09: ffffffff815a8367 kernel:[91899.988923] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000010220 kernel:[91899.988924] R13: ffff88085f803600 R14: 0000000000000180 R15: 00000000ffffffff kernel:[91899.988925] FS: 0000000000000000(0000) GS:ffff88085fc00000(0000) knlGS:0000000000000000 kernel:[91899.988925] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel:[91899.988926] CR2: 00007f311af84000 CR3: 00000000019f0000 CR4: 00000000000407f0 kernel:[91899.988926] Stack: kernel:[91899.988927] 00000020ffffffff ffff88081a78e800 ffff8808594c6800 ffff8808594c6000 kernel:[91899.988928] ffffffff815a839b dc0a9af1def08800 ffff8808073757e8 0000000000000020 kernel:[91899.988930] 00000000ffffffff 0000000000000180 ffff880807375867 0000000000000180 kernel:[91899.988931] Call Trace: kernel:[91899.988932] [<ffffffff815a839b>] ? __alloc_skb+0x8b/0x1e0 kernel:[91899.988936] [<ffffffff815a805b>] __kmalloc_reserve+0x3b/0xa0 kernel:[91899.988938] [<ffffffff815a839b>] __alloc_skb+0x8b/0x1e0 kernel:[91899.988940] [<ffffffff815d508c>] netpoll_send_udp+0x8c/0x400 kernel:[91899.988943] [<ffffffffa03962c3>] write_msg+0xd3/0x130 [netconsole] kernel:[91899.988946] [<ffffffff810b5e75>] call_console_drivers.clone.1+0xa5/0x110 kernel:[91899.988950] [<ffffffff810b5fc7>] console_cont_flush.clone.0+0xe7/0x1a0 kernel:[91899.988952] [<ffffffff810b60b2>] console_unlock+0x32/0x2a0 kernel:[91899.988953] [<ffffffff810b6805>] vprintk_emit+0x325/0x510 kernel:[91899.988954] [<ffffffff810b577c>] ? wake_up_klogd+0x3c/0x50 kernel:[91899.988956] [<ffffffff816be2a1>] printk+0x77/0x79 kernel:[91899.988959] [<ffffffff816c6e95>] ? notifier_call_chain+0x55/0x80 kernel:[91899.988962] [<ffffffff810d4547>] print_modules+0xb7/0xd0 kernel:[91899.988965] [<ffffffff816c6f3e>] ? notify_die+0x2e/0x30 kernel:[91899.988966] [<ffffffff816c3f7b>] __die+0xab/0x100 kernel:[91899.988967] [<ffffffff81005e78>] die+0x48/0x90 kernel:[91899.988970] [<ffffffff816c3b9b>] do_trap+0xcb/0x170 kernel:[91899.988971] [<ffffffff816c6eda>] ? __atomic_notifier_call_chain+0x1a/0x30 kernel:[91899.988972] [<ffffffff81003615>] do_invalid_op+0x95/0xb0 kernel:[91899.988974] [<ffffffff815a7f45>] ? skb_push+0x85/0x90 kernel:[91899.988976] [<ffffffff810b66c8>] ? vprintk_emit+0x1e8/0x510 kernel:[91899.988977] [<ffffffff816cd378>] invalid_op+0x18/0x20 kernel:[91899.988979] [<ffffffff815a7f45>] ? skb_push+0x85/0x90 kernel:[91899.988981] [<ffffffff815a7f45>] ? skb_push+0x85/0x90 kernel:[91899.988982] [<ffffffffa03a0b5f>] fill_packet_ipv4+0xaf/0x470 [pktgen] kernel:[91899.988985] [<ffffffff815a8eff>] ? __kfree_skb+0x3f/0xc0 kernel:[91899.988987] [<ffffffffa0167ed0>] ? ixgbe_xmit_frame_ring+0x610/0x610 [ixgbe] kernel:[91899.988993] [<ffffffffa03a1438>] pktgen_xmit+0x98/0x370 [pktgen] kernel:[91899.988995] [<ffffffffa03a1540>] ? pktgen_xmit+0x1a0/0x370 [pktgen] kernel:[91899.988997] [<ffffffffa03a1887>] pktgen_thread_worker+0x177/0x500 [pktgen] kernel:[91899.988999] [<ffffffff810af2b0>] ? bit_waitqueue+0xd0/0xd0 kernel:[91899.989001] [<ffffffff810af2b0>] ? bit_waitqueue+0xd0/0xd0 kernel:[91899.989002] [<ffffffffa03a1710>] ? pktgen_xmit+0x370/0x370 [pktgen] kernel:[91899.989004] [<ffffffff8108e7ae>] kthread+0xce/0xf0 kernel:[91899.989006] [<ffffffff8108e6e0>] ? kthread_worker_fn+0x120/0x120 kernel:[91899.989008] [<ffffffff816cbd6c>] ret_from_fork+0x7c/0xb0 kernel:[91899.989011] [<ffffffff8108e6e0>] ? kthread_worker_fn+0x120/0x120 kernel:[91899.989012] Code: 48 8b 4d c0 44 89 fa 44 89 e6 4c 89 ef e8 4d c5 ff ff 48 89 45 c8 eb 37 0f 1f 80 00 00 00 00 49 63 45 20 48 8b 4d c8 49 8b 7d 00 <48> 8b 1c 01 48 8d 4a 01 48 8b 45 c8 65 48 0f c7 0f 0f 94 c0 3c kernel:[91899.989030] RSP <ffff880807375798> -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop 2014-05-14 15:13 ` Jesper Dangaard Brouer @ 2014-05-14 15:35 ` Eric Dumazet 0 siblings, 0 replies; 22+ messages in thread From: Eric Dumazet @ 2014-05-14 15:35 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: netdev, Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi On Wed, 2014-05-14 at 17:13 +0200, Jesper Dangaard Brouer wrote: > Ah I see, then pktgen gets stopped, I might leak a SKB. > > > > If done properly, you probably can avoid the kfree_skb(pkt_dev->skb) in > > pktgen_xmit(), and set initial skb->users to max(1, pkt_dev->clone_skb); > > Just setting max(1, pkt_dev->clone_skb) killed my machine (and my > netconsole didn't send the entire output before dying). Well, of course, I did not provide a patch, only a hint ;) If all skb->users is set to the _exact_ number of kfree_skb()/consume_skb() done by the driver, then no kfree_skb() is needed in pktgen. The thing is : clone_skb == 0 means basically that you need to init skb->users to the number of time you plan to send this packet. This depends on pkt_dev->count - pkt_dev->sofar and other things probably. min(dev->clone_skb, pkt_dev->count - pkt_dev->sofar) But don't 'try' that without making sure its the right thing !!! Make sure to replace atomic_dec(&(pkt_dev->skb->users)) by kfree_skb() in case of NETDEV_TX_LOCKED/NETDEV_TX_BUSY ^ permalink raw reply [flat|nested] 22+ messages in thread
* [net-next PATCH 4/5] pktgen: avoid expensive set_current_state() call in loop 2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer ` (2 preceding siblings ...) 2014-05-14 14:17 ` [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop Jesper Dangaard Brouer @ 2014-05-14 14:17 ` Jesper Dangaard Brouer 2014-05-14 14:18 ` [net-next PATCH 5/5] pktgen: RCU'ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer 5 siblings, 0 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-14 14:17 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi I request review as I'm uncertain of this change as I don't know the API of set_current_state() very well. The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit is LOCK prefixed. Avoid calling set_current_state() inside the busy-loop in pktgen_thread_worker(). In case of pkt_dev->delay, then it is still used in pktgen_xmit() via the spin() call. Performance data with CLONE_SKB==100000 and TX ring buffer size=1024: (single CPU performance, ixgbe 10Gbit/s, E5-2630) * Prev: 5608781 pps --> 178.29ns (1/5608781*10^9) * Now: 5857065 pps --> 170.73ns (1/5857065*10^9) * Diff: +248284 pps --> -7.56ns Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- net/core/pktgen.c | 9 +++------ 1 files changed, 3 insertions(+), 6 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 7752806..cae7e0c 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -3409,10 +3409,10 @@ static int pktgen_thread_worker(void *arg) pr_debug("starting pktgen/%d: pid=%d\n", cpu, task_pid_nr(current)); - set_current_state(TASK_INTERRUPTIBLE); - set_freezable(); + __set_current_state(TASK_RUNNING); + while (!kthread_should_stop()) { pkt_dev = next_to_run(t); @@ -3426,8 +3426,6 @@ static int pktgen_thread_worker(void *arg) continue; } - __set_current_state(TASK_RUNNING); - if (likely(pkt_dev)) { pktgen_xmit(pkt_dev); @@ -3458,9 +3456,8 @@ static int pktgen_thread_worker(void *arg) } try_to_freeze(); - - set_current_state(TASK_INTERRUPTIBLE); } + set_current_state(TASK_INTERRUPTIBLE); pr_debug("%s stopping all device\n", t->tsk->comm); pktgen_stop(t); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [net-next PATCH 5/5] pktgen: RCU'ify "if_list" to remove lock in next_to_run() 2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer ` (3 preceding siblings ...) 2014-05-14 14:17 ` [net-next PATCH 4/5] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer @ 2014-05-14 14:18 ` Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer 5 siblings, 0 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-05-14 14:18 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev Cc: Alexander Duyck, Jeff Kirsher, Daniel Borkmann, Florian Westphal, David S. Miller, Stephen Hemminger, Paul E. McKenney, Robert Olsson, Ben Greear, John Fastabend, danieltt, zhouzhouyi The if_lock()/if_unlock() in next_to_run() adds a significant overhead, because its called for every packet in busy loop of pktgen_thread_worker(). Removing these two "LOCK" operations should in theory save us approx 16ns (8ns x 2), as illustrated below we save 15ns when just removing the locks, and 13ns when introducing RCU protection. Performance data with CLONE_SKB==100000 and TX ring buffer size=1024: (single CPU performance, ixgbe 10Gbit/s, E5-2630) * Prev : 5857065 pps --> 170.73ns (1/5857065*10^9) * No-lock : 6424246 pps --> 155.66ns (1/6424246*10^9) * Diff : +567181 pps --> -15.07ns (1/6424246*10^9)-(1/5857065*10^9) * RCU-fix : 6346500 pps --> 157.56ns * RCU-cost: -77746 pps --> +1.90ns * Diff : +489435 pps --> -13.17ns (1/6346500*10^9)-(1/5857065*10^9) To understand this RCU patch, I describe the pktgen thread model below. In pktgen there is several kernel threads, but there is only one CPU running each kernel thread. Communication with the kernel threads are done through some thread control flags. This allow the thread to change data structures at a know synchronization point, see main thread func pktgen_thread_worker(). Userspace changes are communicated through proc-file writes. There are three types of changes, general control changes "pgctrl" (func:pgctrl_write), thread changes "kpktgend_X" (func:pktgen_thread_write), and interface config changes "etcX@N" (func:pktgen_if_write). Userspace "pgctrl" and "thread" changes are synchronized via the mutex pktgen_thread_lock, thus only a single userspace instance can run. The mutex is taken while the packet generator is running, by pgctrl "start". Thus e.g. "add_device" cannot be invoked when pktgen is running/started. All "pgctrl" and all "thread" changes, except thread "add_device", communicate via the thread control flags. The main problem is the exception "add_device", that modifies threads "if_list" directly. Fortunately "add_device" cannot be invoked while pktgen is running. But there exists a race between "rem_device_all" and "add_device" (which normally don't occur, because "rem_device_all" waits 125ms before returning). Background'ing "rem_device_all" and running "add_device" immediately allow the race to occur. The race affects the threads (list of devices) "if_list". The if_lock is used for protecting this "if_list". Other readers are given lock-free access to the list under RCU read sections. Note, interface config changes (via proc) can occur while pktgen is running, which worries me a bit. I'm assuming proc_remove() takes appropriate locks, to assure no writers exists after proc_remove() finish. I've been running a script exercising the race condition (leading me to fix the proc_remove order), without any issues. The script also exercises concurrent proc writes, while the interface config is getting removed. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- net/core/pktgen.c | 102 ++++++++++++++++++++++++++++------------------------- 1 files changed, 53 insertions(+), 49 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index cae7e0c..915b162 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -69,8 +69,9 @@ * for running devices in the if_list and sends packets until count is 0 it * also the thread checks the thread->control which is used for inter-process * communication. controlling process "posts" operations to the threads this - * way. The if_lock should be possible to remove when add/rem_device is merged - * into this too. + * way. + * The if_list is RCU protected, and the if_lock remains to protect updating + * of if_list, from "add_device" as it invoked from userspace (via proc write). * * By design there should only be *one* "controlling" process. In practice * multiple write accesses gives unpredictable result. Understood by "write" @@ -208,7 +209,7 @@ #define T_REMDEVALL (1<<2) /* Remove all devs */ #define T_REMDEV (1<<3) /* Remove one dev */ -/* If lock -- can be removed after some work */ +/* If lock -- protects updating of if_list */ #define if_lock(t) spin_lock(&(t->if_lock)); #define if_unlock(t) spin_unlock(&(t->if_lock)); @@ -241,6 +242,7 @@ struct pktgen_dev { struct proc_dir_entry *entry; /* proc file */ struct pktgen_thread *pg_thread;/* the owner */ struct list_head list; /* chaining in the thread's run-queue */ + struct rcu_head rcu; /* freed by RCU */ int running; /* if false, the test will stop */ @@ -1737,14 +1739,15 @@ static int pktgen_thread_show(struct seq_file *seq, void *v) seq_printf(seq, "Running: "); - if_lock(t); - list_for_each_entry(pkt_dev, &t->if_list, list) + /* Should we use if_lock here? */ + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) if (pkt_dev->running) seq_printf(seq, "%s ", pkt_dev->odevname); seq_printf(seq, "\nStopped: "); - list_for_each_entry(pkt_dev, &t->if_list, list) + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) if (!pkt_dev->running) seq_printf(seq, "%s ", pkt_dev->odevname); @@ -1753,7 +1756,7 @@ static int pktgen_thread_show(struct seq_file *seq, void *v) else seq_printf(seq, "\nResult: NA\n"); - if_unlock(t); + rcu_read_unlock(); return 0; } @@ -1878,10 +1881,8 @@ static struct pktgen_dev *__pktgen_NN_threads(const struct pktgen_net *pn, pkt_dev = pktgen_find_dev(t, ifname, exact); if (pkt_dev) { if (remove) { - if_lock(t); pkt_dev->removal_mark = 1; t->control |= T_REMDEV; - if_unlock(t); } break; } @@ -1931,7 +1932,8 @@ static void pktgen_change_name(const struct pktgen_net *pn, struct net_device *d list_for_each_entry(t, &pn->pktgen_threads, th_list) { struct pktgen_dev *pkt_dev; - list_for_each_entry(pkt_dev, &t->if_list, list) { + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) { if (pkt_dev->odev != dev) continue; @@ -1946,6 +1948,7 @@ static void pktgen_change_name(const struct pktgen_net *pn, struct net_device *d dev->name); break; } + rcu_read_unlock(); } } @@ -2997,8 +3000,8 @@ static void pktgen_run(struct pktgen_thread *t) func_enter(); - if_lock(t); - list_for_each_entry(pkt_dev, &t->if_list, list) { + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) { /* * setup odev and create initial packet. @@ -3007,18 +3010,18 @@ static void pktgen_run(struct pktgen_thread *t) if (pkt_dev->odev) { pktgen_clear_counters(pkt_dev); - pkt_dev->running = 1; /* Cranke yeself! */ pkt_dev->skb = NULL; pkt_dev->started_at = pkt_dev->next_tx = ktime_get(); set_pkt_overhead(pkt_dev); strcpy(pkt_dev->result, "Starting"); + pkt_dev->running = 1; /* Cranke yeself! */ started++; } else strcpy(pkt_dev->result, "Error starting"); } - if_unlock(t); + rcu_read_unlock(); if (started) t->control &= ~(T_STOP); } @@ -3041,27 +3044,25 @@ static int thread_is_running(const struct pktgen_thread *t) { const struct pktgen_dev *pkt_dev; - list_for_each_entry(pkt_dev, &t->if_list, list) - if (pkt_dev->running) + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) + if (pkt_dev->running) { + rcu_read_unlock(); return 1; + } + rcu_read_unlock(); return 0; } static int pktgen_wait_thread_run(struct pktgen_thread *t) { - if_lock(t); - while (thread_is_running(t)) { - if_unlock(t); - msleep_interruptible(100); if (signal_pending(current)) goto signal; - if_lock(t); } - if_unlock(t); return 1; signal: return 0; @@ -3166,10 +3167,10 @@ static int pktgen_stop_device(struct pktgen_dev *pkt_dev) return -EINVAL; } + pkt_dev->running = 0; kfree_skb(pkt_dev->skb); pkt_dev->skb = NULL; pkt_dev->stopped_at = ktime_get(); - pkt_dev->running = 0; show_results(pkt_dev, nr_frags); @@ -3180,9 +3181,8 @@ static struct pktgen_dev *next_to_run(struct pktgen_thread *t) { struct pktgen_dev *pkt_dev, *best = NULL; - if_lock(t); - - list_for_each_entry(pkt_dev, &t->if_list, list) { + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) { if (!pkt_dev->running) continue; if (best == NULL) @@ -3190,7 +3190,8 @@ static struct pktgen_dev *next_to_run(struct pktgen_thread *t) else if (ktime_compare(pkt_dev->next_tx, best->next_tx) < 0) best = pkt_dev; } - if_unlock(t); + rcu_read_unlock(); + return best; } @@ -3200,13 +3201,13 @@ static void pktgen_stop(struct pktgen_thread *t) func_enter(); - if_lock(t); + rcu_read_lock(); - list_for_each_entry(pkt_dev, &t->if_list, list) { + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) { pktgen_stop_device(pkt_dev); } - if_unlock(t); + rcu_read_unlock(); } /* @@ -3220,8 +3221,6 @@ static void pktgen_rem_one_if(struct pktgen_thread *t) func_enter(); - if_lock(t); - list_for_each_safe(q, n, &t->if_list) { cur = list_entry(q, struct pktgen_dev, list); @@ -3235,8 +3234,6 @@ static void pktgen_rem_one_if(struct pktgen_thread *t) break; } - - if_unlock(t); } static void pktgen_rem_all_ifs(struct pktgen_thread *t) @@ -3248,8 +3245,6 @@ static void pktgen_rem_all_ifs(struct pktgen_thread *t) /* Remove all devices, free mem */ - if_lock(t); - list_for_each_safe(q, n, &t->if_list) { cur = list_entry(q, struct pktgen_dev, list); @@ -3258,8 +3253,6 @@ static void pktgen_rem_all_ifs(struct pktgen_thread *t) pktgen_remove_device(t, cur); } - - if_unlock(t); } static void pktgen_rem_thread(struct pktgen_thread *t) @@ -3484,8 +3477,8 @@ static struct pktgen_dev *pktgen_find_dev(struct pktgen_thread *t, struct pktgen_dev *p, *pkt_dev = NULL; size_t len = strlen(ifname); - if_lock(t); - list_for_each_entry(p, &t->if_list, list) + rcu_read_lock(); + list_for_each_entry_rcu(p, &t->if_list, list) if (strncmp(p->odevname, ifname, len) == 0) { if (p->odevname[len]) { if (exact || p->odevname[len] != '@') @@ -3495,7 +3488,7 @@ static struct pktgen_dev *pktgen_find_dev(struct pktgen_thread *t, break; } - if_unlock(t); + rcu_read_unlock(); pr_debug("find_dev(%s) returning %p\n", ifname, pkt_dev); return pkt_dev; } @@ -3509,6 +3502,12 @@ static int add_dev_to_thread(struct pktgen_thread *t, { int rv = 0; + /* This function cannot be called concurrently, as its called + * under pktgen_thread_lock mutex, but it can run from + * userspace on another CPU than the kthread. The if_lock() + * is used here to sync with concurrent instances of + * _rem_dev_from_if_list() invoked via kthread, which is also + * updating the if_list */ if_lock(t); if (pkt_dev->pg_thread) { @@ -3517,9 +3516,9 @@ static int add_dev_to_thread(struct pktgen_thread *t, goto out; } - list_add(&pkt_dev->list, &t->if_list); - pkt_dev->pg_thread = t; pkt_dev->running = 0; + pkt_dev->pg_thread = t; + list_add_rcu(&pkt_dev->list, &t->if_list); out: if_unlock(t); @@ -3674,11 +3673,13 @@ static void _rem_dev_from_if_list(struct pktgen_thread *t, struct list_head *q, *n; struct pktgen_dev *p; + if_lock(t); list_for_each_safe(q, n, &t->if_list) { p = list_entry(q, struct pktgen_dev, list); if (p == pkt_dev) - list_del(&p->list); + list_del_rcu(&p->list); } + if_unlock(t); } static int pktgen_remove_device(struct pktgen_thread *t, @@ -3698,20 +3699,22 @@ static int pktgen_remove_device(struct pktgen_thread *t, pkt_dev->odev = NULL; } - /* And update the thread if_list */ - - _rem_dev_from_if_list(t, pkt_dev); - + /* Remove proc before if_list entry, because add_device uses + * list to determine if interface already exist, avoid race + * with proc_create_data() */ if (pkt_dev->entry) proc_remove(pkt_dev->entry); + /* And update the thread if_list */ + _rem_dev_from_if_list(t, pkt_dev); + #ifdef CONFIG_XFRM free_SAs(pkt_dev); #endif vfree(pkt_dev->flows); if (pkt_dev->page) put_page(pkt_dev->page); - kfree(pkt_dev); + kfree_rcu(pkt_dev, rcu); return 0; } @@ -3811,6 +3814,7 @@ static void __exit pg_cleanup(void) { unregister_netdevice_notifier(&pktgen_notifier_block); unregister_pernet_subsys(&pg_net_ops); + /* Don't need rcu_barrier() due to use of kfree_rcu() */ } module_init(pg_init); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance 2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer ` (4 preceding siblings ...) 2014-05-14 14:18 ` [net-next PATCH 5/5] pktgen: RCU'ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer @ 2014-06-26 11:16 ` Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance Jesper Dangaard Brouer ` (3 more replies) 5 siblings, 4 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-06-26 11:16 UTC (permalink / raw) To: Jesper Dangaard Brouer, David S. Miller, netdev Cc: Daniel Borkmann, Florian Westphal, Hannes Frederic Sowa, Thomas Graf, Eric Dumazet, zoltan.kiss, Alexander Duyck, Jeff Kirsher This series focus on optimizing "pktgen" for single CPU performance. V2-series: - Removed some patches - Doc real reason for TX ring buffer filling up NIC tuning for pktgen: http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html General overload setup according to: http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html Hardware: System: CPU E5-2630 NIC: Intel ixgbe/82599 chip Testing done with net-next git tree on top of commit 6623b41944 ("Merge branch 'master' of...jkirsher/net-next") Pktgen script exercising race condition: https://github.com/netoptimizer/network-testing/blob/master/pktgen/unit_test01_race_add_rem_device_loop.sh Tool for measuring LOCK overhead: https://github.com/netoptimizer/network-testing/blob/master/src/overhead_cmpxchg.c --- Jesper Dangaard Brouer (3): pktgen: RCU-ify "if_list" to remove lock in next_to_run() pktgen: avoid expensive set_current_state() call in loop pktgen: document tuning for max NIC performance Documentation/networking/pktgen.txt | 28 +++++++++ net/core/pktgen.c | 110 ++++++++++++++++++----------------- 2 files changed, 83 insertions(+), 55 deletions(-) -- ^ permalink raw reply [flat|nested] 22+ messages in thread
* [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance 2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer @ 2014-06-26 11:16 ` Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 2/3] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer ` (2 subsequent siblings) 3 siblings, 0 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-06-26 11:16 UTC (permalink / raw) To: Jesper Dangaard Brouer, David S. Miller, netdev Cc: Daniel Borkmann, Florian Westphal, Hannes Frederic Sowa, Thomas Graf, Eric Dumazet, zoltan.kiss, Alexander Duyck, Jeff Kirsher Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring running full. Thus, the TX ring is artificially limiting pktgen. (Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy" counters.) Using ixgbe, the real reason behind the TX ring running full, is due to TX ring not being cleaned up fast enough. The ixgbe driver combines TX+RX ring cleanups, and the cleanup interval is affected by the ethtool --coalesce setting of parameter "rx-usecs". Do not increase the default NIC TX ring buffer or default cleanup interval. Instead simply document that pktgen needs special NIC tuning for maximum packet per sec performance. Performance results with pktgen with clone_skb=100000. TX ring size 512 (default), adjusting "rx-usecs": (Single CPU performance, E5-2630, ixgbe) - 3935002 pps - rx-usecs: 1 (irqs: 9346) - 5132350 pps - rx-usecs: 10 (irqs: 99157) - 5375111 pps - rx-usecs: 20 (irqs: 50154) - 5454050 pps - rx-usecs: 30 (irqs: 33872) - 5496320 pps - rx-usecs: 40 (irqs: 26197) - 5502510 pps - rx-usecs: 50 (irqs: 21527) TX ring size adjusting (ethtool -G), "rx-usecs==1" (default): - 3935002 pps - tx-size: 512 - 5354401 pps - tx-size: 768 - 5356847 pps - tx-size: 1024 - 5327595 pps - tx-size: 1536 - 5356779 pps - tx-size: 2048 - 5353438 pps - tx-size: 4096 Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag. This allow us to put more pressure on the TX ring buffers. It is the ixgbe_maybe_stop_tx() call that stops the transmits, and pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- Documentation/networking/pktgen.txt | 28 ++++++++++++++++++++++++++++ 1 files changed, 28 insertions(+), 0 deletions(-) diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index 0e30c78..0dffc6e 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -24,6 +24,34 @@ For monitoring and control pktgen creates: /proc/net/pktgen/ethX +Tuning NIC for max performance +============================== + +The default NIC setting are (likely) not tuned for pktgen's artificial +overload type of benchmarking, as this could hurt the normal use-case. + +Specifically increasing the TX ring buffer in the NIC: + # ethtool -G ethX tx 1024 + +A larger TX ring can improve pktgen's performance, while it can hurt +in the general case, 1) because the TX ring buffer might get larger +than the CPUs L1/L2 cache, 2) because it allow more queueing in the +NIC HW layer (which is bad for bufferbloat). + +One should be careful to conclude, that packets/descriptors in the HW +TX ring cause delay. Drivers usually delay cleaning up the +ring-buffers (for various performance reasons), thus packets stalling +the TX ring, might just be waiting for cleanup. + +This cleanup issues is specifically the case, for the driver ixgbe +(Intel 82599 chip). This driver (ixgbe) combine TX+RX ring cleanups, +and the cleanup interval is affected by the ethtool --coalesce setting +of parameter "rx-usecs". + +For ixgbe use e.g "30" resulting in approx 33K interrupts/sec (1/30*10^6): + # ethtool -C ethX rx-usecs 30 + + Viewing threads =============== /proc/net/pktgen/kpktgend_0 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [net-next PATCH V2 2/3] pktgen: avoid expensive set_current_state() call in loop 2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance Jesper Dangaard Brouer @ 2014-06-26 11:16 ` Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 3/3] pktgen: RCU-ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer 2014-07-01 22:51 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance David Miller 3 siblings, 0 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-06-26 11:16 UTC (permalink / raw) To: Jesper Dangaard Brouer, David S. Miller, netdev Cc: Daniel Borkmann, Florian Westphal, Hannes Frederic Sowa, Thomas Graf, Eric Dumazet, zoltan.kiss, Alexander Duyck, Jeff Kirsher Avoid calling set_current_state() inside the busy-loop in pktgen_thread_worker(). In case of pkt_dev->delay, then it is still used/enabled in pktgen_xmit() via the spin() call. The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit is LOCK prefixed. I've measured the asm LOCK operation to take approx 8ns on this E5-2630 CPU. Performance increase corrolate with this measurement. Performance data with CLONE_SKB==100000, rx-usecs=30: (single CPU performance, ixgbe 10Gbit/s, E5-2630) * Prev: 5454050 pps --> 183.35ns (1/5454050*10^9) * Now: 5684009 pps --> 175.93ns (1/5684009*10^9) * Diff: +229959 pps --> -7.42ns Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- net/core/pktgen.c | 9 +++------ 1 files changed, 3 insertions(+), 6 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index fc17a9d..b61f553 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -3407,10 +3407,10 @@ static int pktgen_thread_worker(void *arg) pr_debug("starting pktgen/%d: pid=%d\n", cpu, task_pid_nr(current)); - set_current_state(TASK_INTERRUPTIBLE); - set_freezable(); + __set_current_state(TASK_RUNNING); + while (!kthread_should_stop()) { pkt_dev = next_to_run(t); @@ -3424,8 +3424,6 @@ static int pktgen_thread_worker(void *arg) continue; } - __set_current_state(TASK_RUNNING); - if (likely(pkt_dev)) { pktgen_xmit(pkt_dev); @@ -3456,9 +3454,8 @@ static int pktgen_thread_worker(void *arg) } try_to_freeze(); - - set_current_state(TASK_INTERRUPTIBLE); } + set_current_state(TASK_INTERRUPTIBLE); pr_debug("%s stopping all device\n", t->tsk->comm); pktgen_stop(t); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [net-next PATCH V2 3/3] pktgen: RCU-ify "if_list" to remove lock in next_to_run() 2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 2/3] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer @ 2014-06-26 11:16 ` Jesper Dangaard Brouer 2014-07-01 22:51 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance David Miller 3 siblings, 0 replies; 22+ messages in thread From: Jesper Dangaard Brouer @ 2014-06-26 11:16 UTC (permalink / raw) To: Jesper Dangaard Brouer, David S. Miller, netdev Cc: Daniel Borkmann, Florian Westphal, Hannes Frederic Sowa, Thomas Graf, Eric Dumazet, zoltan.kiss, Alexander Duyck, Jeff Kirsher The if_lock()/if_unlock() in next_to_run() adds a significant overhead, because its called for every packet in busy loop of pktgen_thread_worker(). (Thomas Graf originally pointed me at this lock problem). Removing these two "LOCK" operations should in theory save us approx 16ns (8ns x 2), as illustrated below we do save 16ns when removing the locks and introducing RCU protection. Performance data with CLONE_SKB==100000, TX-size=512, rx-usecs=30: (single CPU performance, ixgbe 10Gbit/s, E5-2630) * Prev : 5684009 pps --> 175.93ns (1/5684009*10^9) * RCU-fix: 6272204 pps --> 159.43ns (1/6272204*10^9) * Diff : +588195 pps --> -16.50ns To understand this RCU patch, I describe the pktgen thread model below. In pktgen there is several kernel threads, but there is only one CPU running each kernel thread. Communication with the kernel threads are done through some thread control flags. This allow the thread to change data structures at a know synchronization point, see main thread func pktgen_thread_worker(). Userspace changes are communicated through proc-file writes. There are three types of changes, general control changes "pgctrl" (func:pgctrl_write), thread changes "kpktgend_X" (func:pktgen_thread_write), and interface config changes "etcX@N" (func:pktgen_if_write). Userspace "pgctrl" and "thread" changes are synchronized via the mutex pktgen_thread_lock, thus only a single userspace instance can run. The mutex is taken while the packet generator is running, by pgctrl "start". Thus e.g. "add_device" cannot be invoked when pktgen is running/started. All "pgctrl" and all "thread" changes, except thread "add_device", communicate via the thread control flags. The main problem is the exception "add_device", that modifies threads "if_list" directly. Fortunately "add_device" cannot be invoked while pktgen is running. But there exists a race between "rem_device_all" and "add_device" (which normally don't occur, because "rem_device_all" waits 125ms before returning). Background'ing "rem_device_all" and running "add_device" immediately allow the race to occur. The race affects the threads (list of devices) "if_list". The if_lock is used for protecting this "if_list". Other readers are given lock-free access to the list under RCU read sections. Note, interface config changes (via proc) can occur while pktgen is running, which worries me a bit. I'm assuming proc_remove() takes appropriate locks, to assure no writers exists after proc_remove() finish. I've been running a script exercising the race condition (leading me to fix the proc_remove order), without any issues. The script also exercises concurrent proc writes, while the interface config is getting removed. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Florian Westphal <fw@strlen.de> --- net/core/pktgen.c | 101 +++++++++++++++++++++++++++-------------------------- 1 files changed, 52 insertions(+), 49 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index b61f553..5b05e36 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -69,8 +69,9 @@ * for running devices in the if_list and sends packets until count is 0 it * also the thread checks the thread->control which is used for inter-process * communication. controlling process "posts" operations to the threads this - * way. The if_lock should be possible to remove when add/rem_device is merged - * into this too. + * way. + * The if_list is RCU protected, and the if_lock remains to protect updating + * of if_list, from "add_device" as it invoked from userspace (via proc write). * * By design there should only be *one* "controlling" process. In practice * multiple write accesses gives unpredictable result. Understood by "write" @@ -208,7 +209,7 @@ #define T_REMDEVALL (1<<2) /* Remove all devs */ #define T_REMDEV (1<<3) /* Remove one dev */ -/* If lock -- can be removed after some work */ +/* If lock -- protects updating of if_list */ #define if_lock(t) spin_lock(&(t->if_lock)); #define if_unlock(t) spin_unlock(&(t->if_lock)); @@ -241,6 +242,7 @@ struct pktgen_dev { struct proc_dir_entry *entry; /* proc file */ struct pktgen_thread *pg_thread;/* the owner */ struct list_head list; /* chaining in the thread's run-queue */ + struct rcu_head rcu; /* freed by RCU */ int running; /* if false, the test will stop */ @@ -1737,14 +1739,14 @@ static int pktgen_thread_show(struct seq_file *seq, void *v) seq_puts(seq, "Running: "); - if_lock(t); - list_for_each_entry(pkt_dev, &t->if_list, list) + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) if (pkt_dev->running) seq_printf(seq, "%s ", pkt_dev->odevname); seq_puts(seq, "\nStopped: "); - list_for_each_entry(pkt_dev, &t->if_list, list) + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) if (!pkt_dev->running) seq_printf(seq, "%s ", pkt_dev->odevname); @@ -1753,7 +1755,7 @@ static int pktgen_thread_show(struct seq_file *seq, void *v) else seq_puts(seq, "\nResult: NA\n"); - if_unlock(t); + rcu_read_unlock(); return 0; } @@ -1878,10 +1880,8 @@ static struct pktgen_dev *__pktgen_NN_threads(const struct pktgen_net *pn, pkt_dev = pktgen_find_dev(t, ifname, exact); if (pkt_dev) { if (remove) { - if_lock(t); pkt_dev->removal_mark = 1; t->control |= T_REMDEV; - if_unlock(t); } break; } @@ -1931,7 +1931,8 @@ static void pktgen_change_name(const struct pktgen_net *pn, struct net_device *d list_for_each_entry(t, &pn->pktgen_threads, th_list) { struct pktgen_dev *pkt_dev; - list_for_each_entry(pkt_dev, &t->if_list, list) { + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) { if (pkt_dev->odev != dev) continue; @@ -1946,6 +1947,7 @@ static void pktgen_change_name(const struct pktgen_net *pn, struct net_device *d dev->name); break; } + rcu_read_unlock(); } } @@ -2997,8 +2999,8 @@ static void pktgen_run(struct pktgen_thread *t) func_enter(); - if_lock(t); - list_for_each_entry(pkt_dev, &t->if_list, list) { + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) { /* * setup odev and create initial packet. @@ -3007,18 +3009,18 @@ static void pktgen_run(struct pktgen_thread *t) if (pkt_dev->odev) { pktgen_clear_counters(pkt_dev); - pkt_dev->running = 1; /* Cranke yeself! */ pkt_dev->skb = NULL; pkt_dev->started_at = pkt_dev->next_tx = ktime_get(); set_pkt_overhead(pkt_dev); strcpy(pkt_dev->result, "Starting"); + pkt_dev->running = 1; /* Cranke yeself! */ started++; } else strcpy(pkt_dev->result, "Error starting"); } - if_unlock(t); + rcu_read_unlock(); if (started) t->control &= ~(T_STOP); } @@ -3041,27 +3043,25 @@ static int thread_is_running(const struct pktgen_thread *t) { const struct pktgen_dev *pkt_dev; - list_for_each_entry(pkt_dev, &t->if_list, list) - if (pkt_dev->running) + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) + if (pkt_dev->running) { + rcu_read_unlock(); return 1; + } + rcu_read_unlock(); return 0; } static int pktgen_wait_thread_run(struct pktgen_thread *t) { - if_lock(t); - while (thread_is_running(t)) { - if_unlock(t); - msleep_interruptible(100); if (signal_pending(current)) goto signal; - if_lock(t); } - if_unlock(t); return 1; signal: return 0; @@ -3166,10 +3166,10 @@ static int pktgen_stop_device(struct pktgen_dev *pkt_dev) return -EINVAL; } + pkt_dev->running = 0; kfree_skb(pkt_dev->skb); pkt_dev->skb = NULL; pkt_dev->stopped_at = ktime_get(); - pkt_dev->running = 0; show_results(pkt_dev, nr_frags); @@ -3180,9 +3180,8 @@ static struct pktgen_dev *next_to_run(struct pktgen_thread *t) { struct pktgen_dev *pkt_dev, *best = NULL; - if_lock(t); - - list_for_each_entry(pkt_dev, &t->if_list, list) { + rcu_read_lock(); + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) { if (!pkt_dev->running) continue; if (best == NULL) @@ -3190,7 +3189,8 @@ static struct pktgen_dev *next_to_run(struct pktgen_thread *t) else if (ktime_compare(pkt_dev->next_tx, best->next_tx) < 0) best = pkt_dev; } - if_unlock(t); + rcu_read_unlock(); + return best; } @@ -3200,13 +3200,13 @@ static void pktgen_stop(struct pktgen_thread *t) func_enter(); - if_lock(t); + rcu_read_lock(); - list_for_each_entry(pkt_dev, &t->if_list, list) { + list_for_each_entry_rcu(pkt_dev, &t->if_list, list) { pktgen_stop_device(pkt_dev); } - if_unlock(t); + rcu_read_unlock(); } /* @@ -3220,8 +3220,6 @@ static void pktgen_rem_one_if(struct pktgen_thread *t) func_enter(); - if_lock(t); - list_for_each_safe(q, n, &t->if_list) { cur = list_entry(q, struct pktgen_dev, list); @@ -3235,8 +3233,6 @@ static void pktgen_rem_one_if(struct pktgen_thread *t) break; } - - if_unlock(t); } static void pktgen_rem_all_ifs(struct pktgen_thread *t) @@ -3248,8 +3244,6 @@ static void pktgen_rem_all_ifs(struct pktgen_thread *t) /* Remove all devices, free mem */ - if_lock(t); - list_for_each_safe(q, n, &t->if_list) { cur = list_entry(q, struct pktgen_dev, list); @@ -3258,8 +3252,6 @@ static void pktgen_rem_all_ifs(struct pktgen_thread *t) pktgen_remove_device(t, cur); } - - if_unlock(t); } static void pktgen_rem_thread(struct pktgen_thread *t) @@ -3482,8 +3474,8 @@ static struct pktgen_dev *pktgen_find_dev(struct pktgen_thread *t, struct pktgen_dev *p, *pkt_dev = NULL; size_t len = strlen(ifname); - if_lock(t); - list_for_each_entry(p, &t->if_list, list) + rcu_read_lock(); + list_for_each_entry_rcu(p, &t->if_list, list) if (strncmp(p->odevname, ifname, len) == 0) { if (p->odevname[len]) { if (exact || p->odevname[len] != '@') @@ -3493,7 +3485,7 @@ static struct pktgen_dev *pktgen_find_dev(struct pktgen_thread *t, break; } - if_unlock(t); + rcu_read_unlock(); pr_debug("find_dev(%s) returning %p\n", ifname, pkt_dev); return pkt_dev; } @@ -3507,6 +3499,12 @@ static int add_dev_to_thread(struct pktgen_thread *t, { int rv = 0; + /* This function cannot be called concurrently, as its called + * under pktgen_thread_lock mutex, but it can run from + * userspace on another CPU than the kthread. The if_lock() + * is used here to sync with concurrent instances of + * _rem_dev_from_if_list() invoked via kthread, which is also + * updating the if_list */ if_lock(t); if (pkt_dev->pg_thread) { @@ -3515,9 +3513,9 @@ static int add_dev_to_thread(struct pktgen_thread *t, goto out; } - list_add(&pkt_dev->list, &t->if_list); - pkt_dev->pg_thread = t; pkt_dev->running = 0; + pkt_dev->pg_thread = t; + list_add_rcu(&pkt_dev->list, &t->if_list); out: if_unlock(t); @@ -3672,11 +3670,13 @@ static void _rem_dev_from_if_list(struct pktgen_thread *t, struct list_head *q, *n; struct pktgen_dev *p; + if_lock(t); list_for_each_safe(q, n, &t->if_list) { p = list_entry(q, struct pktgen_dev, list); if (p == pkt_dev) - list_del(&p->list); + list_del_rcu(&p->list); } + if_unlock(t); } static int pktgen_remove_device(struct pktgen_thread *t, @@ -3696,20 +3696,22 @@ static int pktgen_remove_device(struct pktgen_thread *t, pkt_dev->odev = NULL; } - /* And update the thread if_list */ - - _rem_dev_from_if_list(t, pkt_dev); - + /* Remove proc before if_list entry, because add_device uses + * list to determine if interface already exist, avoid race + * with proc_create_data() */ if (pkt_dev->entry) proc_remove(pkt_dev->entry); + /* And update the thread if_list */ + _rem_dev_from_if_list(t, pkt_dev); + #ifdef CONFIG_XFRM free_SAs(pkt_dev); #endif vfree(pkt_dev->flows); if (pkt_dev->page) put_page(pkt_dev->page); - kfree(pkt_dev); + kfree_rcu(pkt_dev, rcu); return 0; } @@ -3809,6 +3811,7 @@ static void __exit pg_cleanup(void) { unregister_netdevice_notifier(&pktgen_notifier_block); unregister_pernet_subsys(&pg_net_ops); + /* Don't need rcu_barrier() due to use of kfree_rcu() */ } module_init(pg_init); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance 2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer ` (2 preceding siblings ...) 2014-06-26 11:16 ` [net-next PATCH V2 3/3] pktgen: RCU-ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer @ 2014-07-01 22:51 ` David Miller 3 siblings, 0 replies; 22+ messages in thread From: David Miller @ 2014-07-01 22:51 UTC (permalink / raw) To: brouer Cc: netdev, dborkman, fw, hannes, tgraf, eric.dumazet, zoltan.kiss, alexander.h.duyck, jeffrey.t.kirsher From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Thu, 26 Jun 2014 13:16:17 +0200 > This series focus on optimizing "pktgen" for single CPU performance. > > V2-series: > - Removed some patches > - Doc real reason for TX ring buffer filling up Looks great, series applied, thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2014-07-01 22:51 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-05-14 14:17 [net-next PATCH 0/5] Optimizing "pktgen" for single CPU performance Jesper Dangaard Brouer 2014-05-14 14:17 ` [net-next PATCH 1/5] ixgbe: trivial fixes while reading code Jesper Dangaard Brouer 2014-05-14 14:17 ` [net-next PATCH 2/5] ixgbe: increase default TX ring buffer to 1024 Jesper Dangaard Brouer 2014-05-14 14:28 ` David Laight 2014-05-14 19:25 ` Jesper Dangaard Brouer 2014-05-14 16:28 ` Alexander Duyck 2014-05-14 17:49 ` David Miller 2014-05-14 19:09 ` Jesper Dangaard Brouer 2014-05-14 19:54 ` David Miller 2014-05-15 9:16 ` David Laight 2014-05-29 15:29 ` Jesper Dangaard Brouer 2014-05-14 14:17 ` [net-next PATCH 3/5] pktgen: avoid atomic_inc per packet in xmit loop Jesper Dangaard Brouer 2014-05-14 14:35 ` Eric Dumazet 2014-05-14 15:13 ` Jesper Dangaard Brouer 2014-05-14 15:35 ` Eric Dumazet 2014-05-14 14:17 ` [net-next PATCH 4/5] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer 2014-05-14 14:18 ` [net-next PATCH 5/5] pktgen: RCU'ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 1/3] pktgen: document tuning for max NIC performance Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 2/3] pktgen: avoid expensive set_current_state() call in loop Jesper Dangaard Brouer 2014-06-26 11:16 ` [net-next PATCH V2 3/3] pktgen: RCU-ify "if_list" to remove lock in next_to_run() Jesper Dangaard Brouer 2014-07-01 22:51 ` [net-next PATCH V2 0/3] Optimizing pktgen for single CPU performance David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).