All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Federico Parola <fede.parola@hotmail.it>
Cc: xdp-newbies@vger.kernel.org, brouer@redhat.com
Subject: Re: Multi-core scalability problems
Date: Mon, 26 Oct 2020 09:14:48 +0100	[thread overview]
Message-ID: <20201026091448.1a407c86@carbon> (raw)
In-Reply-To: <VI1PR04MB31048EB4062CAE0C9245B87F9E1B0@VI1PR04MB3104.eurprd04.prod.outlook.com>

On Sat, 24 Oct 2020 15:57:50 +0200
Federico Parola <fede.parola@hotmail.it> wrote:

> On 19/10/20 20:26, Jesper Dangaard Brouer wrote:
> > On Mon, 19 Oct 2020 17:23:18 +0200
> > Federico Parola <fede.parola@hotmail.it> wrote:  
>  >>
>  >> [...]
>  >>
> >> Hi Jesper, sorry for the late reply, this are the cache refs/misses for
> >> 4 flows and different rx ring sizes:
> >>
> >> RX 512 (9.4 Mpps dropped):
> >> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
> >>     23771011  cache-references                                (+-  0.04% )
> >>      8865698  cache-misses      # 37.296 % of all cache refs  (+-  0.04% )
> >>
> >> RX 128 (39.4 Mpps dropped):
> >> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
> >>     68177470  cache-references                               ( +-  0.01% )
> >>        23898  cache-misses      # 0.035 % of all cache refs  ( +-  3.23% )
> >>
> >> Reducing the size of the rx ring brings to a huge decrease in cache
> >> misses, is this the effect of DDIO turning on?  
> > 
> > Yes, exactly.
> > 
> > It is very high that 37.296 % of all cache refs is being cache-misses.
> > The number of cache-misses 8,865,698 is close to your reported 9.4
> > Mpps. Thus, seems to correlate with the idea that this is DDIO-missing
> > as you have a miss per packet.
> > 
> > I can see that you have selected a subset of the CPUs (0,1,2,13), it
> > important that this is the active CPUs.  I usually only select a
> > single/individual CPU to make sure I can reason about the numbers.
> > I've seen before that some CPUs get DDIO effect and others not, so
> > watch out for this.
> > 
> > If you add HW-counter -e instructions -e cycles to your perf stat
> > command, you will also see the instructions per cycle calculation.  You
> > should notice that the cache-miss also cause this number to be reduced,
> > as the CPUs stalls it cannot keep the CPU pipeline full/busy.
> > 
> > What kind of CPU are you using?
> > Specifically cache-sizes (use dmidecode look for "Cache Information")
> >   
> I'm using an Intel Xeon Gold 5120, L1: 896 KiB, L2: 14 MiB, L3: 19.25 MiB.

Is this a NUMA system?

The numbers you report is for all cores together.  Looking at [1] and
[2], I can see this is a 14-cores CPU. According to [3] the cache is:

Level 1 cache size:
	14 x 32 KB 8-way set associative instruction caches
	14 x 32 KB 8-way set associative data caches

Level 2 cache size:
 	14 x 1 MB 16-way set associative caches

Level 3 cache size
	19.25 MB 11-way set associative non-inclusive shared cache

One thing that catch my eye is the "non-inclusive" cache.  And that [4]
states "rearchitected cache hierarchy designed for server workloads".



[1] https://en.wikichip.org/wiki/intel/xeon_gold/5120
[2] https://ark.intel.com/content/www/us/en/ark/products/120474/intel-xeon-gold-5120-processor-19-25m-cache-2-20-ghz.html
[3] https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%205120.html
[4] https://en.wikichip.org/wiki/intel/xeon_gold

> > The performance drop is a little too large 39.4 Mpps -> 9.4 Mpps.
> > 
> > If I were you, I would measure the speed of the memory, via using the
> > tool lmbench-3.0 command 'lat_mem_rd'.
> > 
> >   /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 2000 128
> > 
> > The output is the nanosec latency of accessing increasing sizes of	
> > memory.  The jumps/increases in latency should be fairly clear and
> > shows the latency of the different cache levels.  For my CPU E5-1650 v4
> > @ 3.60GHz with 15MB L2 cache, I see L1=1.055ns, L2=5.521ns, L3=17.569ns.
> > (I could not find a tool that tells me the cost of accessing main-memory,
> > but maybe it is the 17.569ns, as the tool measurement jump from 12MB
> > (5.933ns) to 16MB (12.334ns) and I know L3 is 15MB, so I don't get an
> > accurate L3 measurement.)
> >   
> I run the benchmark, I can see to well distinct jumps (L1 and L2 cache I 
> guess) of 1.543ns and 5.400ns, but then the latency grows gradually:

Guess you left out some numbers below for the 1.543ns measurement you
mention in the text.  There is a plateau at 5.508ns, and another at
plateau 8.629ns, which could be L3?

> 0.25000 5.400
> 0.37500 5.508
> 0.50000 5.508
> 0.75000 6.603
> 1.00000 8.247
> 1.50000 8.616
> 2.00000 8.747
> 3.00000 8.629
> 4.00000 8.629
> 6.00000 8.676
> 8.00000 8.800
> 12.00000 9.119
> 16.00000 10.840
> 24.00000 16.650
> 32.00000 19.888
> 48.00000 21.582
> 64.00000 22.519
> 96.00000 23.473
> 128.00000 24.125
> 192.00000 24.777
> 256.00000 25.124
> 384.00000 25.445
> 512.00000 25.642
> 768.00000 25.775
> 1024.00000 25.869
> 1536.00000 25.942
> I can't really tell where L3 cache and main memory start.

I guess the plateau around 25.445ns is the main memory speed. 

The latency different is very large, but the performance drop is still
too large 39.4 Mpps -> 9.4 Mpps.  Back-of-envelope calc, 8.629ns to
25.445ns is approx a factor 3 (25.445/8.629=2.948).  9.4 Mpps x factor
is 27.7Mpps, 39.4 Mpps div factor is 13.36Mpps.  Meaning it doesn't
add-up to explain this difference.


> One thing I forgot to mention is that I experience the same performance 
> drop even without specifying the --readmem flag of the bpf sample 
> (no_touch mode), if I'm not wrong without the flag the ebpf program 
> should not access to the packet buffer and therefore the DDIO should 
> have no effect.

I was going to ask you to swap between --readmem flag and no_touch
mode, and then measure if perf-stat cache-misses stay the same.  It
sounds like you already did this?

The DDIO/DCA is something the CPU chooses to do, based on proprietary
design by Intel.  Thus, it is hard to say why DDIO is acting like this.
E.g. still causing a cache-miss even when using no_touch mode.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


  reply	other threads:[~2020-10-26  8:15 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-13 13:49 Multi-core scalability problems Federico Parola
2020-10-13 16:41 ` Jesper Dangaard Brouer
2020-10-13 16:44 ` Toke Høiland-Jørgensen
2020-10-14  6:56   ` Federico Parola
2020-10-14  9:15     ` Jesper Dangaard Brouer
2020-10-14 12:17       ` Federico Parola
2020-10-14 14:26         ` Jesper Dangaard Brouer
2020-10-15 12:04           ` Federico Parola
2020-10-15 13:22             ` Jesper Dangaard Brouer
2020-10-19 15:23               ` Federico Parola
2020-10-19 18:26                 ` Jesper Dangaard Brouer
2020-10-24 13:57                   ` Federico Parola
2020-10-26  8:14                     ` Jesper Dangaard Brouer [this message]
     [not found] <VI1PR04MB3104C1D86BDC113F4AC0CF4A9E050@VI1PR04MB3104.eurprd04.prod.outlook.com>
2020-10-14  8:35 ` Federico Parola

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201026091448.1a407c86@carbon \
    --to=brouer@redhat.com \
    --cc=fede.parola@hotmail.it \
    --cc=xdp-newbies@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.