From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B67FC433F5 for ; Tue, 10 May 2022 10:29:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238196AbiEJKde (ORCPT ); Tue, 10 May 2022 06:33:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49864 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232799AbiEJKdd (ORCPT ); Tue, 10 May 2022 06:33:33 -0400 Received: from mail-ej1-x632.google.com (mail-ej1-x632.google.com [IPv6:2a00:1450:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF17D201336 for ; Tue, 10 May 2022 03:29:35 -0700 (PDT) Received: by mail-ej1-x632.google.com with SMTP id i19so31980169eja.11 for ; Tue, 10 May 2022 03:29:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=message-id:date:mime-version:user-agent:subject:to:cc:references :from:in-reply-to:content-transfer-encoding; bh=I6uI7kHlCc5QUgYAVZnm+XmGBAePYKXNFME5rh1pCus=; b=StHiDv6/VDl44lCizCGAQFd6pTX+gILmR1k+K2MaA5RK1sKuYNB/ooUGd0v4h8Pv2O 27z79RT9i3YuMNgPPDiOqWN8gF11DgE052rKMXI1GUOSzIJxP5j7m2bThENZxNPJft61 1SMN1RHezDK6L61GfLT6ihm6YWw1w7HrwsenZ7qj51+kqRl8xsAGCO5kzLLwgxKosRQY jqTa8Mvc4dOqu8sjnxPGk2ISkQhRM9u0AvmC71evVtxvJC82ExUzearftsmVOj9UARgb 4MtflPtgHfhFHOiAYt6e50XeKDIC/bcZ9ZhunR7CmOgfzLvZlDCLAZe5FdoK5MISAAx7 3YZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :to:cc:references:from:in-reply-to:content-transfer-encoding; bh=I6uI7kHlCc5QUgYAVZnm+XmGBAePYKXNFME5rh1pCus=; b=HOUNakTdWRvX9JJt4/ETTtRY4FuVTl6ia7raX+IAa+L/E9zo8656mFOSd2muhmVDr+ kfIaTGkSRMKxJ6sSsukcW9oAIyeVB0CHuvFePg3LLww2el6tF3Im9PUxaEW4tAdKYZwA Y5o+/6p9J0mC4yotc6H1qNcfRGb4HUC7BvkOYCtlWWp7tLLhaBw1/NYJqQHDsbL5qMxL ldWpRzKINaV+2Y0boc7/CZhPnn6iMdD25ErEBvtYyi1hXpOoPGDi3DZvOOrKuZmMN5nd FaHS5aR6cOfb0n+tnah7zLSK1HpAFU4jbYdi9PX0lqJIVuQ3WDWbaDmtW+ovfS/Yyz+5 +QwQ== X-Gm-Message-State: AOAM530x/wcYhx/wF0ML59enfzNWp/eGsEeVF1EVWVJRFKr8RYK3GIqk Un2jET0j8hc3x9Tz1VtVJQo= X-Google-Smtp-Source: ABdhPJzItR3VqKrzZ4ni2hxOujWkEVMpYAEZ9jI6Laum4XLNobxvcQx+AYvusIKXN/QXu89dFAR8lw== X-Received: by 2002:a17:906:6d91:b0:6f4:5433:72f5 with SMTP id h17-20020a1709066d9100b006f4543372f5mr13173287ejt.414.1652178574239; Tue, 10 May 2022 03:29:34 -0700 (PDT) Received: from [192.168.26.149] (ip-194-187-74-233.konfederacka.maverick.com.pl. [194.187.74.233]) by smtp.googlemail.com with ESMTPSA id nr1-20020a1709068b8100b006f3ef214e6fsm5905769ejc.213.2022.05.10.03.29.33 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 10 May 2022 03:29:33 -0700 (PDT) Message-ID: <391ca2d1-6977-0c9b-588c-31ad9bb68c82@gmail.com> Date: Tue, 10 May 2022 12:29:32 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:96.0) Gecko/20100101 Thunderbird/96.0 Subject: Re: Optimizing kernel compilation / alignments for network performance To: Andrew Lunn Cc: Felix Fietkau , Arnd Bergmann , Alexander Lobakin , Network Development , linux-arm-kernel , Russell King , "openwrt-devel@lists.openwrt.org" , Florian Fainelli References: <84f25f73-1fab-fe43-70eb-45d25b614b4c@gmail.com> <20220427125658.3127816-1-alexandr.lobakin@intel.com> <066fc320-dc04-11a4-476e-b0d11f3b17e6@gmail.com> <2a338e8e-3288-859c-d2e8-26c5712d3d06@nbd.name> <04fa6560-e6f4-005f-cddb-7bc9b4859ba2@gmail.com> From: =?UTF-8?B?UmFmYcWCIE1pxYJlY2tp?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 6.05.2022 14:42, Andrew Lunn wrote: >>> I just took a quick look at the driver. It allocates and maps rx buffers that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. >>> This seems rather excessive, especially since most people are going to use a MTU of 1500. >>> My proposal would be to add support for making rx buffer size dependent on MTU, reallocating the ring on MTU changes. >>> This should significantly reduce the time spent on flushing caches. >> >> Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac: >> configure MTU and add support for frames beyond 8192 byte size"): >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03 >> >> It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps). >> >> I do all my testing with >> #define BGMAC_RX_MAX_FRAME_SIZE 1536 > > That helps show that cache operations are part of your bottleneck. > > Taking a quick look at the driver. On the receive side: > > /* Unmap buffer to make it accessible to the CPU */ > dma_unmap_single(dma_dev, dma_addr, > BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE); > > Here is data is mapped read for the CPU to use it. > > /* Get info from the header */ > len = le16_to_cpu(rx->len); > flags = le16_to_cpu(rx->flags); > > /* Check for poison and drop or pass the packet */ > if (len == 0xdead && flags == 0xbeef) { > netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n", > ring->start); > put_page(virt_to_head_page(buf)); > bgmac->net_dev->stats.rx_errors++; > break; > } > > if (len > BGMAC_RX_ALLOC_SIZE) { > netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n", > ring->start); > put_page(virt_to_head_page(buf)); > bgmac->net_dev->stats.rx_length_errors++; > bgmac->net_dev->stats.rx_errors++; > break; > } > > /* Omit CRC. */ > len -= ETH_FCS_LEN; > > skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE); > if (unlikely(!skb)) { > netdev_err(bgmac->net_dev, "build_skb failed\n"); > put_page(virt_to_head_page(buf)); > bgmac->net_dev->stats.rx_errors++; > break; > } > skb_put(skb, BGMAC_RX_FRAME_OFFSET + > BGMAC_RX_BUF_OFFSET + len); > skb_pull(skb, BGMAC_RX_FRAME_OFFSET + > BGMAC_RX_BUF_OFFSET); > > skb_checksum_none_assert(skb); > skb->protocol = eth_type_trans(skb, bgmac->net_dev); > > and this is the first access of the actual data. You can make the > cache actually work for you, rather than against you, to adding a call to > > prefetch(buf); > > just after the dma_unmap_single(). That will start getting the frame > header from DRAM into cache, so hopefully it is available by the time > eth_type_trans() is called and you don't have a cache miss. I don't think that analysis is correct. Please take a look at following lines: struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET; void *buf = slot->buf; The first we do after dma_unmap_single() call is rx->len read. That actually points to DMA data. There is nothing we could keep CPU busy with while preteching data. FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by a single 1 Mb/s. Speed was exactly the same as without prefetch() call.