Re: [PATCH v5 2/2] skb_array: ring test

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: linux-kernel@vger.kernel.org, Jason Wang <jasowang@redhat.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	davem@davemloft.net, netdev@vger.kernel.org,
	Steven Rostedt <rostedt@goodmis.org>,
	brouer@redhat.com
Subject: Re: [PATCH v5 2/2] skb_array: ring test
Date: Fri, 3 Jun 2016 14:15:55 +0200	[thread overview]
Message-ID: <20160603141555.7dad8fda@redhat.com> (raw)
In-Reply-To: <20160602204725.7bcfd927@redhat.com>

On Thu, 2 Jun 2016 20:47:25 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Tue, 24 May 2016 23:34:14 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote:  
> > > 
> > > On Tue, 24 May 2016 12:28:09 +0200
> > > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > >     
> > > > I do like perf, but it does not answer my questions about the
> > > > performance of this queue. I will code something up in my own
> > > > framework[2] to answer my own performance questions.
> > > > 
> > > > Like what is be minimum overhead (in cycles) achievable with this type
> > > > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> > > > for fastpath usage.    
> > > 
> > > Coded it up here:
> > >  https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184
> > >  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
> > > 
> > > This is a really fake benchmark, but it sort of shows the  
> > > overhead achievable with this type of queue, where it is the same
> > > CPU enqueuing and dequeuing, and cache is guaranteed to be hot.
> > > 
> > > Measured on a i7-4790K CPU @ 4.00GHz, the average cost of
> > > enqueue+dequeue of a single object is around 102 cycles(tsc).
> > > 
> > > To compare this with below, where enq and deq is measured separately:
> > >  102 / 2 = 51 cycles  
> 
> The alf_queue[1] baseline is 26 cycles in this minimum overhead
> achievable benchmark with a MPMC (Multi-Producer/Multi-Consumer) queue
> which use a locked cmpxchg.  (SPSC variant is 5 cycles, thus most cost
> comes from locked cmpxchg).
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h
> 
> > > > Then I also want to know how this performs when two CPUs are involved.
> > > > As this is also a primary use-case, for you when sending packets into a
> > > > guest.    
> > > 
> > > Coded it up here:
> > >  https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e
> > >  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
> > >  
> > > This parallel benchmark try to keep two (or more) CPUs busy enqueuing or
> > > dequeuing on the same skb_array queue.  It prefills the queue,
> > > and stops the test as soon as queue is empty or full, or
> > > completes a number of "loops"/cycles.
> > > 
> > > For two CPUs the results are really good:
> > >  enqueue: 54 cycles(tsc)
> > >  dequeue: 53 cycles(tsc)  
> 
> As MST points out, a scheme like the alf_queue[1] have the issue that it
> "reads" the opposite cacheline of the consumer.tail/producer.tail to
> determine if space-is-left/queue-is-empty.  This cause an expensive
> transition for the cache coherency protocol.
> 
> Coded up similar test for alf_queue:
>  https://github.com/netoptimizer/prototype-kernel/commit/b3ff2624f1
>  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c
> 
> For two CPUs MPMC results are, significantly worse, and demonstrate MSTs point:
>  enqueue: 227 cycles(tsc)
>  dequeue: 231 cycles(tsc)
> 
> Alf_queue also have a SPSC (Single-Producer/Single-Consumer) variant:
>  enqueue: 24 cycles(tsc)
>  dequeue: 23 cycles(tsc)
> 
> 
> > > Going to 4 CPUs, things break down (but it was not primary use-case?):
> > >  CPU(0) 927 cycles(tsc) enqueue
> > >  CPU(1) 921 cycles(tsc) dequeue
> > >  CPU(2) 927 cycles(tsc) enqueue
> > >  CPU(3) 898 cycles(tsc) dequeue    
> > 
> > It's mostly the spinlock contention I guess.
> > Maybe we don't need fair spinlocks in this case.
> > Try replacing spinlocks with simple cmpxchg
> > and see what happens?  
> 
> The alf_queue uses a cmpxchg scheme, and it does scale better when the
> number of CPUs increase:
> 
>  CPUs:4 Average: 586 cycles(tsc)
>  CPUs:6 Average: 744 cycles(tsc)
>  CPUs:8 Average: 1578 cycles(tsc)
> 
> Notice the alf_queue was designed with the purpose of bulking, to
> mitigate the effect of this cacheline bouncing, but it was not covered
> in this test.

Added bulking to the alf_queue test:
 https://github.com/netoptimizer/prototype-kernel/commit/e22a0d8745 

This does help significantly, but requires use-cases where there are
packets to be bulk deq/enq.  On the other hand, the skb_array also
requires that objects in the queue/array exceed one cacheline, before
it starts to scale.

For two CPUs we need bulk=4 before beating skb_array.  See benchmark
adjusting bulk size:
 CPUs:2 bulk=step:1 Average: 231 cycles(tsc)
 CPUs:2 bulk=step:2 Average: 118 cycles(tsc)
 CPUs:2 bulk=step:3 Average: 65 cycles(tsc)
 CPUs:2 bulk=step:4 Average: 48 cycles(tsc)
 CPUs:2 bulk=step:5 Average: 40 cycles(tsc)
 CPUs:2 bulk=step:6 Average: 37 cycles(tsc)
 CPUs:2 bulk=step:7 Average: 29 cycles(tsc)
 CPUs:2 bulk=step:8 Average: 24 cycles(tsc)
 CPUs:2 bulk=step:9 Average: 23 cycles(tsc)
 CPUs:2 bulk=step:10 Average: 20 cycles(tsc)

Keeping bulk=8, and increasing the CPUs does show better scalability,
due to bulking.

This system (i7-4790K CPU @ 4.00GHz) only had 8-core CPUs:
 CPUs:2 bulk=step:8 Average: 25 cycles(tsc)
 CPUs:4 bulk=step:8 Average: 71 cycles(tsc)
 CPUs:6 bulk=step:8 Average: 100 cycles(tsc)
 CPUs:8 bulk=step:8 Average: 185 cycles(tsc)

Found a (slower) 24-core CPU system (E5-2695v2-ES @ 2.50GHz):
 CPUs:2 bulk=step:8 Average: 50 cycles(tsc)
 CPUs:4 bulk=step:8 Average: 101 cycles(tsc)
 CPUs:6 bulk=step:8 Average: 214 cycles(tsc)
 CPUs:8 bulk=step:8 Average: 347 cycles(tsc)
 CPUs:10 bulk=step:8 Average: 468 cycles(tsc)
 CPUs:12 bulk=step:8 Average: 670 cycles(tsc)
 CPUs:14 bulk=step:8 Average: 698 cycles(tsc)
 CPUs:16 bulk=step:8 Average: 1149 cycles(tsc)
 CPUs:18 bulk=step:8 Average: 1094 cycles(tsc)
 CPUs:20 bulk=step:8 Average: 1349 cycles(tsc)
 CPUs:22 bulk=step:8 Average: 1406 cycles(tsc)
 CPUs:24 bulk=step:8 Average: 1553 cycles(tsc)

I still think skb_array is the winner, when the normal use-case is two
CPUs, and we cannot guarantee CPU pinning (thus cannot use SPSC).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2016-06-03 12:15 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-23 10:43 [PATCH v5 0/2] skb_array: array based FIFO for skbs Michael S. Tsirkin
2016-05-23 10:43 ` [PATCH v5 1/2] " Michael S. Tsirkin
2016-05-23 10:43 ` [PATCH v5 2/2] skb_array: ring test Michael S. Tsirkin
2016-05-23 13:09   ` Jesper Dangaard Brouer
2016-05-23 20:52     ` Michael S. Tsirkin
2016-05-24 10:28       ` Jesper Dangaard Brouer
2016-05-24 10:33         ` Michael S. Tsirkin
2016-05-24 11:54         ` Michael S. Tsirkin
2016-05-24 12:11         ` Michael S. Tsirkin
2016-05-24 17:03         ` Jesper Dangaard Brouer
2016-05-24 20:34           ` Michael S. Tsirkin
2016-06-02 18:47             ` Jesper Dangaard Brouer
2016-06-03 12:15               ` Jesper Dangaard Brouer [this message]
2016-05-23 13:31 ` [PATCH v5 0/2] skb_array: array based FIFO for skbs Eric Dumazet
2016-05-23 20:35   ` Michael S. Tsirkin
2016-05-30  9:59 ` Jason Wang
2016-05-30 15:37   ` Michael S. Tsirkin
2016-05-31  2:29     ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160603141555.7dad8fda@redhat.com \
    --to=brouer@redhat.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=jasowang@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).