From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01590C433B4 for ; Thu, 1 Apr 2021 18:28:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D124B601FC for ; Thu, 1 Apr 2021 18:28:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238044AbhDAS2D (ORCPT ); Thu, 1 Apr 2021 14:28:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36920 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239494AbhDASQd (ORCPT ); Thu, 1 Apr 2021 14:16:33 -0400 Received: from mail-ej1-x62b.google.com (mail-ej1-x62b.google.com [IPv6:2a00:1450:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 11314C02FEA4; Thu, 1 Apr 2021 09:09:47 -0700 (PDT) Received: by mail-ej1-x62b.google.com with SMTP id jy13so3695952ejc.2; Thu, 01 Apr 2021 09:09:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=us+3Yp591gmxYbF9mhS2gMKfe8LxyfhDKgXk1LVCWx4=; b=iFnOKtd9w2BlsxxD++8UE77w+uiRs8+99uEAykGyQ93TTtc2v1tkfBZhEzQOK/v34H 9MJK/+S7fPFlWD8j3CTm5Jxp/Hp1W4rRnAyj3ifUL1BFItCqHPoC5J7hBgGRh2ixrulA +rgFpSwS3z+/AoZGJH540lxqLURMYpy4xDJQLF5BoMjhV9dQxeB1s7u9WED94LE6c0B/ ybEmgo95jcAo9tRFy8hP4HqdcoKJaYDZYCjykWjikLoz/GZc8ambESkqHZ6qush8TazO 8j2gZ0nf6CDliR2dk5RuKK93Ck3d+G0XiFbw8G8MIGmcQAHcNjxP+kpU4UQl1pB/WG6u 9tFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=us+3Yp591gmxYbF9mhS2gMKfe8LxyfhDKgXk1LVCWx4=; b=R14i4dueZnoubT/HmGZI20kOE5FMAyXCLd/Rc4FRSfR4qZdLhrhLz+plwr1y0HV4xH QuftmJq/4WmqrfDK+o77ZMIPZCJZrCPQUMVYCfijrcsObcp9T+DiFZVHVXsONTtVs9gd K2Y0QYYud9tQnnf9OP2CB0cZsRnDlehvDn27XQMguURqbnH6AESYqRJAD7x0fNV++CWr +ml3fR4M4qcww3GUhPhoI1Qb9wOAy8pPRYmKZKuY39lF+znTzPgMaZFcYYIeN4Vf5djs pJok3pA4noYwU6fbJcZM/ot3s76wQOXxi1Xbm/lQiW9cnQsW9MvrlF2rjFwLlz9WKN4v 6sgg== X-Gm-Message-State: AOAM530j8IqqhDdyRoDk4FJlEYM6awyVB+OP47jbljeK6hhbRtHJPgPE pYA/ZCMPUU5NW0kEwJQbfRA= X-Google-Smtp-Source: ABdhPJyrwyPfRMr4KqfP6OorQAfwCEQQlVGm7Sxql3/HJLyN8t1gJQzvMgwEIaQA6QuO6HlcywSVEQ== X-Received: by 2002:a17:906:fcb2:: with SMTP id qw18mr9578431ejb.434.1617293385778; Thu, 01 Apr 2021 09:09:45 -0700 (PDT) Received: from skbuf (5-12-16-165.residential.rdsnet.ro. [5.12.16.165]) by smtp.gmail.com with ESMTPSA id l12sm3617631edb.39.2021.04.01.09.09.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Apr 2021 09:09:45 -0700 (PDT) Date: Thu, 1 Apr 2021 19:09:43 +0300 From: Vladimir Oltean To: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= Cc: Jakub Kicinski , Alexei Starovoitov , Daniel Borkmann , Jesper Dangaard Brouer , John Fastabend , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , "David S. Miller" , netdev@vger.kernel.org, bpf@vger.kernel.org, Alexander Duyck , Ioana Ciornei , Alex Marginean , Claudiu Manoil , Ilias Apalodimas , Vladimir Oltean Subject: Re: [PATCH net-next 9/9] net: enetc: add support for XDP_REDIRECT Message-ID: <20210401160943.frw7l3rio7spr33n@skbuf> References: <20210331200857.3274425-1-olteanv@gmail.com> <20210331200857.3274425-10-olteanv@gmail.com> <87blaynt4l.fsf@toke.dk> <20210401113133.vzs3uxkp52k2ctla@skbuf> <875z16nsiu.fsf@toke.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <875z16nsiu.fsf@toke.dk> Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Thu, Apr 01, 2021 at 01:39:05PM +0200, Toke Høiland-Jørgensen wrote: > Vladimir Oltean writes: > > > On Thu, Apr 01, 2021 at 01:26:02PM +0200, Toke Høiland-Jørgensen wrote: > >> > +int enetc_xdp_xmit(struct net_device *ndev, int num_frames, > >> > + struct xdp_frame **frames, u32 flags) > >> > +{ > >> > + struct enetc_tx_swbd xdp_redirect_arr[ENETC_MAX_SKB_FRAGS] = {0}; > >> > + struct enetc_ndev_priv *priv = netdev_priv(ndev); > >> > + struct enetc_bdr *tx_ring; > >> > + int xdp_tx_bd_cnt, i, k; > >> > + int xdp_tx_frm_cnt = 0; > >> > + > >> > + tx_ring = priv->tx_ring[smp_processor_id()]; > >> > >> What mechanism guarantees that this won't overflow the array? :) > > > > Which array, the array of TX rings? > > Yes. > The problem isn't even accessing an out-of-bounds element in the TX ring array. As it turns out, I had a relatively superficial understanding of how things are organized, but let me try to explain. The number of TX rings is a configurable resource (between PFs and VFs) and we read the capability at probe time: enetc_get_si_caps: val = enetc_rd(hw, ENETC_SICAPR0); si->num_rx_rings = (val >> 16) & 0xff; si->num_tx_rings = val & 0xff; enetc_init_si_rings_params: priv->num_tx_rings = si->num_tx_rings; In any case, the TX array is declared as: struct enetc_ndev_priv { struct enetc_bdr *tx_ring[16]; struct enetc_bdr *rx_ring[16]; }; because that's the maximum hardware capability. The priv->tx_ring array is populated in: enetc_alloc_msix: /* # of tx rings per int vector */ v_tx_rings = priv->num_tx_rings / priv->bdr_int_num; for (i = 0; i < priv->bdr_int_num; i++) { for (j = 0; j < v_tx_rings; j++) { if (priv->bdr_int_num == ENETC_MAX_BDR_INT) idx = 2 * j + i; /* 2 CPUs */ else idx = j + i * v_tx_rings; /* default */ priv->tx_ring[idx] = bdr; } } priv->bdr_int_num is set to "num_online_cpus()". On LS1028A, it can be either 1 or 2 (and the ENETC_MAX_BDR_INT macro is equal to 2). Otherwise said, the convoluted logic above does the following: - It affines an MSI interrupt vector per CPU - It affines an RX ring per MSI vector, hence per CPU - It balances the fixed number of TX rings (say 8) among the available MSI vectors, hence CPUs (say 2). It does this by iterating with i through the RX MSI interrupt vectors, and with j through the number of TX rings per MSI vector. This logic maps: - the even TX rings to CPU 0 and the odd TX rings to CPU 1, if 2 CPUs are used - all TX rings to CPU 0, if 1 CPU is used This is done because we have this logic in enetc_poll: for (i = 0; i < v->count_tx_rings; i++) if (!enetc_clean_tx_ring(&v->tx_ring[i], budget)) complete = false; for processing the TX completions of a given group of TX rings in the RX MSI interrupt handler of a certain CPU. Otherwise said, priv->tx_ring[i] is always BD ring i, and that mapping never changes. All 8 TX rings are enabled and available for use. What I knew about tc-taprio and tc-mqprio is that they only enqueue to TX queues [0, num_tc-1] because of this, as it turns out: enetc_xmit: tx_ring = priv->tx_ring[skb->queue_mapping]; where skb->queue_mapping is given by: err = netif_set_real_num_tx_queues(ndev, priv->num_tx_rings); and by this, respectively, from the mqprio code path: netif_set_real_num_tx_queues(ndev, num_tc); As for why XDP works, and priv->tx_ring[smp_processor_id()] is: - TX ring 0 for CPU 0 and TX ring 1 for CPU 1, if 2 CPUs are used - TX ring 0, if 1 CPU is used The TX completions in the first case are handled by: - CPU 0 for TX ring 0 (because it is even) and CPU 1 for TX ring 1 (because it is odd), if 2 CPUs are used, due to the mapping I talked about earlier - CPU 0 if only 1 CPU is used > > You mean that it's possible to receive a TC_SETUP_QDISC_MQPRIO or > > TC_SETUP_QDISC_TAPRIO with num_tc == 1, and we have 2 CPUs? > > Not just that, this ndo can be called on arbitrary CPUs after a > redirect. The code just calls through from the XDP receive path so which > CPU it ends up on depends on the RSS+IRQ config of the other device, > which may not even be the same driver; i.e., you have no control over > that... :) > What do you mean by "arbitrary" CPU? You can't plug CPUs in, it's a dual core system... Why does the source ifindex matter at all? I'm using the TX ring affined to the CPU that ndo_xdp_xmit is currently running on. > > Well, yeah, I don't know what's the proper way to deal with that. Ideas? > > Well the obvious one is just: > > tx_ring = priv->tx_ring[smp_processor_id() % num_ring_ids]; > > and then some kind of locking to deal with multiple CPUs accessing the > same TX ring... By multiple CPUs accessing the same TX ring, you mean locking between ndo_xdp_xmit and ndo_start_xmit? Can that even happen if the hardware architecture is to have at least as many TX rings as CPUs? Because otherwise, I see that ndo_xdp_xmit is only called from xdp_do_flush, which is in softirq context, which to my very rudimentary knowledge run with bottom halves, thus preemption, disabled? So I don't think it's possible for ndo_xdp_xmit and ndo_xmit, or even two ndo_xdp_xmit instances, to access the same TX ring? Sorry, I'm sure these are trivial questions, but I would like to really understand what I need to change and why :D