qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen
@ 2014-09-19 10:35 Jesper Dangaard Brouer
  2014-09-19 10:44 ` qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue Jesper Dangaard Brouer
  2014-09-19 11:57 ` qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen Jamal Hadi Salim
  0 siblings, 2 replies; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-19 10:35 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend

This experiment were about finding the tipping-point, when bulking
from the qdisc kicks in.  This is an artificial benchmark.

This testing relates to my qdisc bulk dequeue patches:
 http://thread.gmane.org/gmane.linux.network/328829/focus=328951

My point have always been, we should only start bulking packets when
really needed, I dislike attempts to delay TX in antisipation of
packets arriving shortly (due to the added latency).  IMHO the qdisc
layer seems the right place "see" when bulking makes sense.

The reason behind this test is, there is two code paths in the qdisc
layer.  1) when qdisc is empty we allow packet to directly call
sch_direct_xmit(), 2) when qdisc contains packet we go through a more
expensive process of enqueue, dequeue and possibly rescheduling a
softirq.

Thus, the cost when the qdisc kicks-in should be slightly higher.  My
qdisc bulk dequeue patch, should help us actually getting faster in
this case.  Below results (with dequeue bulking max 4 packets) show
that, this was true, as expected the locking cost were reduced, giving
us an actual speedup.

Testing this tipping point is hard, but found an trafgen setup, that
were just balancing on this tipping point, single CPU 1Gbit/s setup
driver igb.

 # trafgen --cpp --dev eth1 --conf udp_example02_const.trafgen -V  --qdisc-path -t0 --cpus 1

With this specific trafgen setup, I could show that: when qdisc queue
was empty I could not hit wirespeed 1G.

* instant rx:0 tx:1423314 pps n:60 average: rx:0 tx:1423454 pps
  (instant variation TX -0.069 ns (min:-0.707 max:0.392) RX 0.000 ns)

Perf showed the top#1 (13.49%) item was _raw_spin_lock called 81.32%
by sch_direct_xmit() and 16.92% by __dev_queue_xmit()

Sometimes by itself trafgen, creates a qdisc backlog, and _then_ the
qdisc bulking kicks-in. Resulting in full 1G wirespeed.

* instant rx:0 tx:1489324 pps n:29 average: rx:0 tx:1489263 pps
 (instant variation TX 0.028 ns (min:-0.040 max:0.028) RX 0.000 ns)
* Diff: (1/1423314*10^9)-(1/1489324*10^9) = 31ns

The process could be triggered by e.g. starting a netperf on another
CPU, cause qdisc to backlog for the trafgen-qdisc, when stopping the
netperf the trafgen-qdisc keep backlogged (at least for a while), and
NOW it hits wirespeed 1G.

Perf record/diff showed exactly what was expect: The _raw_spin_lock
was now top#3 with 6.09% (down with 7.20%).  The distribution of
callers (of _raw_spin_lock) have changed (and sort-of swapped) with
58.64% by __dev_queue_xmit() and only 33.54% by sch_direct_xmit().

# perf diff
# Baseline    Delta  Symbol
# no-bulk    bulk(4)
# ........  .......  .........................
    13.55%   -7.20%  [k] _raw_spin_lock
    11.88%   -0.06%  [k] sock_alloc_send_pskb
     6.24%   +0.12%  [k] packet_snd
     3.66%   -1.34%  [k] igb_tx_map
     3.58%   +0.07%  [k] __alloc_skb
     2.86%   +0.16%  [k] __dev_queue_xmit

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 3+ messages in thread

* qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue
  2014-09-19 10:35 qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen Jesper Dangaard Brouer
@ 2014-09-19 10:44 ` Jesper Dangaard Brouer
  2014-09-19 11:57 ` qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen Jamal Hadi Salim
  1 sibling, 0 replies; 3+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-19 10:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, netdev@vger.kernel.org, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend


On Fri, 19 Sep 2014 12:35:36 +0200
Jesper Dangaard Brouer <jbrouer@redhat.com> wrote:
 
> This testing relates to my qdisc bulk dequeue patches:
>  http://thread.gmane.org/gmane.linux.network/328829/focus=328951


I will quickly followup, with a more real-life use-case for qdisc layer
dequeue bulking (as Eric dislikes my artificial benchmarks ;-)).

Using UDP_STREAM on 1Gbit/s driver igb, I can show that the
_raw_spin_lock calls are reduced with approx 3%, when enabling
bulking of just 2 packets.

This test can only demonstrates a CPU usage reduction, as the
throughput is already at maximum link (bandwidth) capacity.


Notice netperf option "-m 1472" which makes sure we are not sending
UDP IP-fragments::

 netperf -H 192.168.111.2 -t UDP_STREAM -l 120 -- -m 1472


Results from perf diff::

 # Command: perf diff
 # Event 'cycles'
 # Baseline  Delta    Symbol
 # no-bulk   bulk(1)
 # ........  .......  .........................................
 #
     7.05%   -3.03%  [k] _raw_spin_lock
     6.34%   +0.23%  [k] copy_user_enhanced_fast_string
     6.30%   +0.26%  [k] fib_table_lookup
     3.03%   +0.01%  [k] __slab_free
     3.00%   +0.08%  [k] intel_idle
     2.49%   +0.05%  [k] sock_alloc_send_pskb
     2.31%   +0.30%  netperf  [.] send_omni_inner
     2.12%   +0.12%  netperf  [.] send_data
     2.11%   +0.10%  [k] udp_sendmsg
     1.96%   +0.02%  [k] __ip_append_data
     1.48%   -0.01%  [k] __alloc_skb
     1.46%   +0.07%  [k] __mkroute_output
     1.34%   +0.05%  [k] __ip_select_ident
     1.29%   +0.03%  [k] check_leaf
     1.27%   +0.09%  [k] __skb_get_hash

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Settings::

 export N=0; sudo sh -c "echo $N > /proc/sys/net/core/qdisc_bulk_dequeue_limit"; \
 grep -H . /proc/sys/net/core/qdisc_bulk_dequeue_limit
 /proc/sys/net/core/qdisc_bulk_dequeue_limit:0

 export N=1; sudo sh -c "echo $N > /proc/sys/net/core/qdisc_bulk_dequeue_limit"; \
 grep -H . /proc/sys/net/core/qdisc_bulk_dequeue_limit
 /proc/sys/net/core/qdisc_bulk_dequeue_limit:1

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen
  2014-09-19 10:35 qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen Jesper Dangaard Brouer
  2014-09-19 10:44 ` qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue Jesper Dangaard Brouer
@ 2014-09-19 11:57 ` Jamal Hadi Salim
  1 sibling, 0 replies; 3+ messages in thread
From: Jamal Hadi Salim @ 2014-09-19 11:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev@vger.kernel.org
  Cc: David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Alexander Duyck,
	John Fastabend

On 09/19/14 06:35, Jesper Dangaard Brouer wrote:
>
> This experiment were about finding the tipping-point, when bulking
> from the qdisc kicks in.  This is an artificial benchmark.
>
> This testing relates to my qdisc bulk dequeue patches:
>   http://thread.gmane.org/gmane.linux.network/328829/focus=328951
>
> My point have always been, we should only start bulking packets when
> really needed, I dislike attempts to delay TX in antisipation of
> packets arriving shortly (due to the added latency).  IMHO the qdisc
> layer seems the right place "see" when bulking makes sense.
>
> The reason behind this test is, there is two code paths in the qdisc
> layer.  1) when qdisc is empty we allow packet to directly call
> sch_direct_xmit(), 2) when qdisc contains packet we go through a more
> expensive process of enqueue, dequeue and possibly rescheduling a
> softirq.
>
> Thus, the cost when the qdisc kicks-in should be slightly higher.  My
> qdisc bulk dequeue patch, should help us actually getting faster in
> this case.  Below results (with dequeue bulking max 4 packets) show
> that, this was true, as expected the locking cost were reduced, giving
> us an actual speedup.
>
>
> Testing this tipping point is hard, but found an trafgen setup, that
> were just balancing on this tipping point, single CPU 1Gbit/s setup
> driver igb.
>

The feedback system is clearly very well oiled. Or is it now? ;->
Jesper, maybe you need to poke at system level as opposed to
microscopic lock level. The transmit path is essentially kicked by
tx softirq which is driven by rx path etc. And those guys work like
a clock pendulum.
To busy that sucker, You may be able to get more luck with
forwarding kind of traffic.
Funnel traffic from many nic ports tied to different CPUs to one egress
port.
Some coffee helped me remember i actually surrendered that it can be
done at all in netconf 2011[1] but please let me not poison your
thinking - you may find otherwise.

cheers,
jamal

http://vger.kernel.org/netconf2011_slides/jamal_netconf2011.pdf slide 12

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-09-19 11:57 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-19 10:35 qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen Jesper Dangaard Brouer
2014-09-19 10:44 ` qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue Jesper Dangaard Brouer
2014-09-19 11:57 ` qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen Jamal Hadi Salim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).