Re: Multi-core scalability problems - Jesper Dangaard Brouer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Federico Parola <fede.parola@hotmail.it>
Cc: xdp-newbies@vger.kernel.org, brouer@redhat.com
Subject: Re: Multi-core scalability problems
Date: Mon, 19 Oct 2020 20:26:21 +0200	[thread overview]
Message-ID: <20201019202621.690aaacd@carbon> (raw)
In-Reply-To: <VI1PR04MB31049251BB588E95D5C5E3009E1E0@VI1PR04MB3104.eurprd04.prod.outlook.com>

On Mon, 19 Oct 2020 17:23:18 +0200
Federico Parola <fede.parola@hotmail.it> wrote:

> On 15/10/20 15:22, Jesper Dangaard Brouer wrote:
> > On Thu, 15 Oct 2020 14:04:51 +0200
> > Federico Parola <fede.parola@hotmail.it> wrote:
> >   
> >> On 14/10/20 16:26, Jesper Dangaard Brouer wrote:  
> >>> On Wed, 14 Oct 2020 14:17:46 +0200
> >>> Federico Parola <fede.parola@hotmail.it> wrote:
> >>>      
> >>>> On 14/10/20 11:15, Jesper Dangaard Brouer wrote:  
> >>>>> On Wed, 14 Oct 2020 08:56:43 +0200
> >>>>> Federico Parola <fede.parola@hotmail.it> wrote:
> >>>>>
> >>>>> [...]  
> >>>>>>> Can you try to use this[2] tool:
> >>>>>>>      ethtool_stats.pl --dev enp101s0f0
> >>>>>>>
> >>>>>>> And notice if there are any strange counters.
> >>>>>>>
> >>>>>>>
> >>>>>>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl  
> > [...]
> >   
> >>>> The only solution I've found so far is to reduce the size of the rx ring
> >>>> as I mentioned in my former post. However I still see a decrease in
> >>>> performance when exceeding 4 cores.  
> >>>
> >>> What is happening when you are reducing the size of the rx ring is two
> >>> things. (1) i40e driver have reuse/recycle-pages trick that get less
> >>> efficient, but because you are dropping packets early you are not
> >>> affected. (2) the total size of L3 memory you need to touch is also
> >>> decreased.
> >>>
> >>> I think you are hitting case (2).  The Intel CPU have a cool feature
> >>> called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can
> >>> deliver packet data into L3 cache memory (if NIC is directly PCIe
> >>> connected to CPU).  The CPU is in charge when this feature is enabled,
> >>> and it will try to avoid L3 trashing and disable it in certain cases.
> >>> When you reduce the size of the rx rings, then you are also needing
> >>> less L3 cache memory, to the CPU will allow this DDIO feature.
> >>>
> >>> You can use the 'perf stat' tool to check if this is happening, by
> >>> monitoring L3 (and L2) cache usage.  
> >>
> >> What events should I monitor? LLC-load-misses/LLC-loads?  
> > 
> > Looking at my own results from xdp-paper[1], it looks like that it
> > results in real 'cache-misses' (perf stat -e cache-misses).
> > 
> > E.g I ran:
> >   sudo ~/perf stat -C3 -e cycles -e  instructions -e cache-references -e cache-misses -r 3 sleep 1
> > 
> > Notice how the 'insn per cycle' gets less efficient when we experience
> > these cache-misses.
> > 
> > Also how RX-size of queues affect XDP-redirect in [2].
> > 
> > 
> > [1] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org
> > [2] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench05_xdp_redirect.org
> >  
> Hi Jesper, sorry for the late reply, this are the cache refs/misses for 
> 4 flows and different rx ring sizes:
> 
> RX 512 (9.4 Mpps dropped):
> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
>    23771011  cache-references                                (+-  0.04% )
>     8865698  cache-misses      # 37.296 % of all cache refs  (+-  0.04% )
> 
> RX 128 (39.4 Mpps dropped):
> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
>    68177470  cache-references                               ( +-  0.01% )
>       23898  cache-misses      # 0.035 % of all cache refs  ( +-  3.23% )
> 
> Reducing the size of the rx ring brings to a huge decrease in cache 
> misses, is this the effect of DDIO turning on?

Yes, exactly.

It is very high that 37.296 % of all cache refs is being cache-misses.
The number of cache-misses 8,865,698 is close to your reported 9.4
Mpps. Thus, seems to correlate with the idea that this is DDIO-missing
as you have a miss per packet.

I can see that you have selected a subset of the CPUs (0,1,2,13), it
important that this is the active CPUs.  I usually only select a
single/individual CPU to make sure I can reason about the numbers.
I've seen before that some CPUs get DDIO effect and others not, so
watch out for this.

If you add HW-counter -e instructions -e cycles to your perf stat
command, you will also see the instructions per cycle calculation.  You
should notice that the cache-miss also cause this number to be reduced,
as the CPUs stalls it cannot keep the CPU pipeline full/busy.

What kind of CPU are you using?
Specifically cache-sizes (use dmidecode look for "Cache Information")

The performance drop is a little too large 39.4 Mpps -> 9.4 Mpps.  

If I were you, I would measure the speed of the memory, via using the
tool lmbench-3.0 command 'lat_mem_rd'.

 /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 2000 128

The output is the nanosec latency of accessing increasing sizes of	
memory.  The jumps/increases in latency should be fairly clear and
shows the latency of the different cache levels.  For my CPU E5-1650 v4
@ 3.60GHz with 15MB L2 cache, I see L1=1.055ns, L2=5.521ns, L3=17.569ns.
(I could not find a tool that tells me the cost of accessing main-memory,
but maybe it is the 17.569ns, as the tool measurement jump from 12MB
(5.933ns) to 16MB (12.334ns) and I know L3 is 15MB, so I don't get an
accurate L3 measurement.)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2020-10-19 18:26 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-13 13:49 Multi-core scalability problems Federico Parola
2020-10-13 16:41 ` Jesper Dangaard Brouer
2020-10-13 16:44 ` Toke Høiland-Jørgensen
2020-10-14  6:56   ` Federico Parola
2020-10-14  9:15     ` Jesper Dangaard Brouer
2020-10-14 12:17       ` Federico Parola
2020-10-14 14:26         ` Jesper Dangaard Brouer
2020-10-15 12:04           ` Federico Parola
2020-10-15 13:22             ` Jesper Dangaard Brouer
2020-10-19 15:23               ` Federico Parola
2020-10-19 18:26                 ` Jesper Dangaard Brouer [this message]
2020-10-24 13:57                   ` Federico Parola
2020-10-26  8:14                     ` Jesper Dangaard Brouer
     [not found] <VI1PR04MB3104C1D86BDC113F4AC0CF4A9E050@VI1PR04MB3104.eurprd04.prod.outlook.com>
2020-10-14  8:35 ` Federico Parola

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201019202621.690aaacd@carbon \
    --to=brouer@redhat.com \
    --cc=fede.parola@hotmail.it \
    --cc=xdp-newbies@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.