From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97543C2D0A3 for ; Mon, 26 Oct 2020 08:15:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 29D5920874 for ; Mon, 26 Oct 2020 08:15:00 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Z9twi6xQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422267AbgJZIPA (ORCPT ); Mon, 26 Oct 2020 04:15:00 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:29446 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1421974AbgJZIPA (ORCPT ); Mon, 26 Oct 2020 04:15:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1603700098; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=daQcal4/uXaUnmXQDlL9/XKxSo7q3aHQh0nHOMjD0hw=; b=Z9twi6xQDUXauOQRpjyXTEyswBNlCQDVzKGmBQRLZvDAfL2Zr/RTaUbC4bGhxOaXA1ojq7 Y7UKhr5rKH8QaIfHz7QLX5I0bQddle368eSyAnMfDAJ9SlVXtaPjg854DgWjRQ9t7bLN9W LVaPEPyvY87Ju4AThjY2xaojNQ/R58E= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-589-r-_M-4MFO4e4tr2SRsa1kA-1; Mon, 26 Oct 2020 04:14:54 -0400 X-MC-Unique: r-_M-4MFO4e4tr2SRsa1kA-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8CF71807321; Mon, 26 Oct 2020 08:14:53 +0000 (UTC) Received: from carbon (unknown [10.36.110.9]) by smtp.corp.redhat.com (Postfix) with ESMTP id 752FF10013C0; Mon, 26 Oct 2020 08:14:49 +0000 (UTC) Date: Mon, 26 Oct 2020 09:14:48 +0100 From: Jesper Dangaard Brouer To: Federico Parola Cc: xdp-newbies@vger.kernel.org, brouer@redhat.com Subject: Re: Multi-core scalability problems Message-ID: <20201026091448.1a407c86@carbon> In-Reply-To: References: <87r1q29ita.fsf@toke.dk> <20201014111529.6e464b4b@carbon> <20201014162636.39c2ba14@carbon> <20201015152252.4360cf9a@carbon> <20201019202621.690aaacd@carbon> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Precedence: bulk List-ID: X-Mailing-List: xdp-newbies@vger.kernel.org On Sat, 24 Oct 2020 15:57:50 +0200 Federico Parola wrote: > On 19/10/20 20:26, Jesper Dangaard Brouer wrote: > > On Mon, 19 Oct 2020 17:23:18 +0200 > > Federico Parola wrote: > >> > >> [...] > >> > >> Hi Jesper, sorry for the late reply, this are the cache refs/misses for > >> 4 flows and different rx ring sizes: > >> > >> RX 512 (9.4 Mpps dropped): > >> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs): > >> 23771011 cache-references (+- 0.04% ) > >> 8865698 cache-misses # 37.296 % of all cache refs (+- 0.04% ) > >> > >> RX 128 (39.4 Mpps dropped): > >> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs): > >> 68177470 cache-references ( +- 0.01% ) > >> 23898 cache-misses # 0.035 % of all cache refs ( +- 3.23% ) > >> > >> Reducing the size of the rx ring brings to a huge decrease in cache > >> misses, is this the effect of DDIO turning on? > > > > Yes, exactly. > > > > It is very high that 37.296 % of all cache refs is being cache-misses. > > The number of cache-misses 8,865,698 is close to your reported 9.4 > > Mpps. Thus, seems to correlate with the idea that this is DDIO-missing > > as you have a miss per packet. > > > > I can see that you have selected a subset of the CPUs (0,1,2,13), it > > important that this is the active CPUs. I usually only select a > > single/individual CPU to make sure I can reason about the numbers. > > I've seen before that some CPUs get DDIO effect and others not, so > > watch out for this. > > > > If you add HW-counter -e instructions -e cycles to your perf stat > > command, you will also see the instructions per cycle calculation. You > > should notice that the cache-miss also cause this number to be reduced, > > as the CPUs stalls it cannot keep the CPU pipeline full/busy. > > > > What kind of CPU are you using? > > Specifically cache-sizes (use dmidecode look for "Cache Information") > > > I'm using an Intel Xeon Gold 5120, L1: 896 KiB, L2: 14 MiB, L3: 19.25 MiB. Is this a NUMA system? The numbers you report is for all cores together. Looking at [1] and [2], I can see this is a 14-cores CPU. According to [3] the cache is: Level 1 cache size: 14 x 32 KB 8-way set associative instruction caches 14 x 32 KB 8-way set associative data caches Level 2 cache size: 14 x 1 MB 16-way set associative caches Level 3 cache size 19.25 MB 11-way set associative non-inclusive shared cache One thing that catch my eye is the "non-inclusive" cache. And that [4] states "rearchitected cache hierarchy designed for server workloads". [1] https://en.wikichip.org/wiki/intel/xeon_gold/5120 [2] https://ark.intel.com/content/www/us/en/ark/products/120474/intel-xeon-gold-5120-processor-19-25m-cache-2-20-ghz.html [3] https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%205120.html [4] https://en.wikichip.org/wiki/intel/xeon_gold > > The performance drop is a little too large 39.4 Mpps -> 9.4 Mpps. > > > > If I were you, I would measure the speed of the memory, via using the > > tool lmbench-3.0 command 'lat_mem_rd'. > > > > /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 2000 128 > > > > The output is the nanosec latency of accessing increasing sizes of > > memory. The jumps/increases in latency should be fairly clear and > > shows the latency of the different cache levels. For my CPU E5-1650 v4 > > @ 3.60GHz with 15MB L2 cache, I see L1=1.055ns, L2=5.521ns, L3=17.569ns. > > (I could not find a tool that tells me the cost of accessing main-memory, > > but maybe it is the 17.569ns, as the tool measurement jump from 12MB > > (5.933ns) to 16MB (12.334ns) and I know L3 is 15MB, so I don't get an > > accurate L3 measurement.) > > > I run the benchmark, I can see to well distinct jumps (L1 and L2 cache I > guess) of 1.543ns and 5.400ns, but then the latency grows gradually: Guess you left out some numbers below for the 1.543ns measurement you mention in the text. There is a plateau at 5.508ns, and another at plateau 8.629ns, which could be L3? > 0.25000 5.400 > 0.37500 5.508 > 0.50000 5.508 > 0.75000 6.603 > 1.00000 8.247 > 1.50000 8.616 > 2.00000 8.747 > 3.00000 8.629 > 4.00000 8.629 > 6.00000 8.676 > 8.00000 8.800 > 12.00000 9.119 > 16.00000 10.840 > 24.00000 16.650 > 32.00000 19.888 > 48.00000 21.582 > 64.00000 22.519 > 96.00000 23.473 > 128.00000 24.125 > 192.00000 24.777 > 256.00000 25.124 > 384.00000 25.445 > 512.00000 25.642 > 768.00000 25.775 > 1024.00000 25.869 > 1536.00000 25.942 > I can't really tell where L3 cache and main memory start. I guess the plateau around 25.445ns is the main memory speed. The latency different is very large, but the performance drop is still too large 39.4 Mpps -> 9.4 Mpps. Back-of-envelope calc, 8.629ns to 25.445ns is approx a factor 3 (25.445/8.629=2.948). 9.4 Mpps x factor is 27.7Mpps, 39.4 Mpps div factor is 13.36Mpps. Meaning it doesn't add-up to explain this difference. > One thing I forgot to mention is that I experience the same performance > drop even without specifying the --readmem flag of the bpf sample > (no_touch mode), if I'm not wrong without the flag the ebpf program > should not access to the packet buffer and therefore the DDIO should > have no effect. I was going to ask you to swap between --readmem flag and no_touch mode, and then measure if perf-stat cache-misses stay the same. It sounds like you already did this? The DDIO/DCA is something the CPU chooses to do, based on proprietary design by Intel. Thus, it is hard to say why DDIO is acting like this. E.g. still causing a cache-miss even when using no_touch mode. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer