From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp2.osuosl.org (smtp2.osuosl.org [140.211.166.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C425A1CB53F for ; Tue, 1 Oct 2024 16:31:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=140.211.166.133 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727800274; cv=none; b=A0i2RN//8G7xyJk5yGKiGeHVF8TVlwxc5NSso6NcBseocJIcen/ZKf3Td4iDpaUEGOzm12jR/hX+JsnrlAlqcXtpi+vdz4A8FvPc5C1MJJC0Ved1rY4sgZyR9dtLpmcejrEw/1P3XMiJFZeGzxFKcb6CYZ9HDwxV4QapxP+OABI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727800274; c=relaxed/simple; bh=tmCg5BWsCw59JIzvnnyzLDO7fHE1f4U4nQ8gTD+boTU=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=AR9ESwTNPimk6bERjQd3R8vlBIpqosW5Yd3LBb3rqKVrHK6PXabYPTPKImGBI9XwwkA7P3WeoxjeWWtVua31g22ErDIIVUBk6Aned2u9qIda33Mfb9EZpJlrUWik71Ddur59PKac1rNJPH7/L6hqb51lex1Fs9ENMXWu1SAZoAs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=networkplumber-org.20230601.gappssmtp.com header.i=@networkplumber-org.20230601.gappssmtp.com header.b=EejfaDP0; arc=none smtp.client-ip=140.211.166.133 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=networkplumber-org.20230601.gappssmtp.com header.i=@networkplumber-org.20230601.gappssmtp.com header.b="EejfaDP0" Received: from localhost (localhost [127.0.0.1]) by smtp2.osuosl.org (Postfix) with ESMTP id 6088940B00 for ; Tue, 1 Oct 2024 16:31:12 +0000 (UTC) X-Virus-Scanned: amavis at osuosl.org X-Spam-Flag: NO X-Spam-Score: -1.9 X-Spam-Level: Received: from smtp2.osuosl.org ([127.0.0.1]) by localhost (smtp2.osuosl.org [127.0.0.1]) (amavis, port 10024) with ESMTP id ewYk-zy3j5Ku for ; Tue, 1 Oct 2024 16:31:10 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::431; helo=mail-pf1-x431.google.com; envelope-from=stephen@networkplumber.org; receiver= DMARC-Filter: OpenDMARC Filter v1.4.2 smtp2.osuosl.org E867740AE7 Authentication-Results: smtp2.osuosl.org; dmarc=pass (p=quarantine dis=none) header.from=networkplumber.org DKIM-Filter: OpenDKIM Filter v2.11.0 smtp2.osuosl.org E867740AE7 Authentication-Results: smtp2.osuosl.org; dkim=pass (2048-bit key, unprotected) header.d=networkplumber-org.20230601.gappssmtp.com header.i=@networkplumber-org.20230601.gappssmtp.com header.a=rsa-sha256 header.s=20230601 header.b=EejfaDP0 Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by smtp2.osuosl.org (Postfix) with ESMTPS id E867740AE7 for ; Tue, 1 Oct 2024 16:31:09 +0000 (UTC) Received: by mail-pf1-x431.google.com with SMTP id d2e1a72fcca58-718d91eef2eso4042958b3a.1 for ; Tue, 01 Oct 2024 09:31:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20230601.gappssmtp.com; s=20230601; t=1727800269; x=1728405069; darn=lists.linux-foundation.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=NaOCtJBpmUXeW5fDjOcuOw2JadaIwq6wWesigL7oUQQ=; b=EejfaDP0Hw/lr+8k8FEGSVvZIfdQcshcxIfY9+JpQIDHwrQSNKvDiUQpVu4WK5fPhc uTwElMQ0H/zJ36zB+CyPt0VMpvLN9IVPl+lG7uH1efwn5WKSgDY+XJf9XXTNpDnQYzk+ s7BEqChYqo1/6LAirgj7TT2f3u6hUyaGOBHayAlJk7vmt0etobyaSVYmIcnzALpt7/uq vUKWumsIufLvdDHkwJ37fnrWteJC0OO81K3lyx7gNZTH0tFary+cdzuwanuO0kPWe82S lHpNCTMm3Rg3JqtLdbFXyVMmS1YOjFBxz8yB8QB2ZYd7LxtUTC6WgWKb3tXHFmaHMgDw k+tA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727800269; x=1728405069; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NaOCtJBpmUXeW5fDjOcuOw2JadaIwq6wWesigL7oUQQ=; b=sp+shsEZaTfzn+YZnZShJB1sJC5ij6l48S0L8p0ZFKLrmDS/tmfc5s6SgeCVfhJBez SAvFAaaJjF9q/TvCU3hHZdTwlY4JY/l7GVCWzvpaqrBcg94ULIn7S8QMFYMKvnvVJWLg RYUIirjEPpVCu66yoKleJXlIbu3LKmIYCjGEp1YBq3cT4/GBIalzbXwgd5wBoudQ+UIl IqufNYUjPCKS01am3wtcpofvDJAh4iNnqnTepg+Un4Mklk8TZ4ASn32Dyny9Pxg9z/Uv iBUzI4KbpI18sUAzYGi5ee665SQHrmffgwrcYZbxbvTTysnxHyj0mIw+YTItSd/egBkZ H1Kg== X-Forwarded-Encrypted: i=1; AJvYcCUQ5kOwNnYRUXdI1QA7l64JkbisbwhLKvB7MPXxHNYunRFvxTb+BpxoQUJI6hQPBZLs1yxu4mZpDLDXgtJKjw==@lists.linux-foundation.org X-Gm-Message-State: AOJu0YwqeGojeaITWESmUYLiWFNIQDdjpXT/vh++sFQmfVrwm9a1rc+h MtbGHXZGvDQSJvFTje5suwlnlXzHS63DL1+VWz4CFFEHP8yt0ewp+ZNCV/QnSA8= X-Google-Smtp-Source: AGHT+IEJ1YSBFgA6XCPu/UKUAJZ+sUtJEj01EpSGftWy6BCcu9DYD9VOPkZdjd+3NzkAra9PlbsRcg== X-Received: by 2002:a05:6a00:22ca:b0:717:9896:fb03 with SMTP id d2e1a72fcca58-71dc6010da0mr239138b3a.6.1727800268875; Tue, 01 Oct 2024 09:31:08 -0700 (PDT) Received: from hermes.local (204-195-96-226.wavecable.com. [204.195.96.226]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-71b264bb2b8sm8246283b3a.61.2024.10.01.09.31.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Oct 2024 09:31:08 -0700 (PDT) Date: Tue, 1 Oct 2024 09:31:05 -0700 From: Stephen Hemminger To: Akihiko Odaki Cc: Jason Wang , Jonathan Corbet , Willem de Bruijn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , "Michael S. Tsirkin" , Xuan Zhuo , Shuah Khan , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-kselftest@vger.kernel.org, Yuri Benditovich , Andrew Melnychenko , gur.stavi@huawei.com Subject: Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature Message-ID: <20241001093105.126dacd6@hermes.local> In-Reply-To: References: <20240924-rss-v4-0-84e932ec0e6c@daynix.com> <6c101c08-4364-4211-a883-cb206d57303d@daynix.com> <447dca19-58c5-4c01-b60e-cfe5e601961a@daynix.com> <20240929083314.02d47d69@hermes.local> Precedence: bulk X-Mailing-List: virtualization@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, 1 Oct 2024 14:54:29 +0900 Akihiko Odaki wrote: > On 2024/09/30 0:33, Stephen Hemminger wrote: > > On Sun, 29 Sep 2024 16:10:47 +0900 > > Akihiko Odaki wrote: > > =20 > >> On 2024/09/29 11:07, Jason Wang wrote: =20 > >>> On Fri, Sep 27, 2024 at 3:51=E2=80=AFPM Akihiko Odaki wrote: =20 > >>>> > >>>> On 2024/09/27 13:31, Jason Wang wrote: =20 > >>>>> On Fri, Sep 27, 2024 at 10:11=E2=80=AFAM Akihiko Odaki wrote: =20 > >>>>>> > >>>>>> On 2024/09/25 12:30, Jason Wang wrote: =20 > >>>>>>> On Tue, Sep 24, 2024 at 5:01=E2=80=AFPM Akihiko Odaki wrote: =20 > >>>>>>>> > >>>>>>>> virtio-net have two usage of hashes: one is RSS and another is h= ash > >>>>>>>> reporting. Conventionally the hash calculation was done by the V= MM. > >>>>>>>> However, computing the hash after the queue was chosen defeats t= he > >>>>>>>> purpose of RSS. > >>>>>>>> > >>>>>>>> Another approach is to use eBPF steering program. This approach = has > >>>>>>>> another downside: it cannot report the calculated hash due to the > >>>>>>>> restrictive nature of eBPF. > >>>>>>>> > >>>>>>>> Introduce the code to compute hashes to the kernel in order to o= vercome > >>>>>>>> thse challenges. > >>>>>>>> > >>>>>>>> An alternative solution is to extend the eBPF steering program s= o that it > >>>>>>>> will be able to report to the userspace, but it is based on cont= ext > >>>>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but t= hey will > >>>>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs= (KVM > >>>>>>>> and vhost_net). > >>>>>>>> =20 > >>>>>>> > >>>>>>> I wonder if we could clone the skb and reuse some to store the ha= sh, > >>>>>>> then the steering eBPF program can access these fields without > >>>>>>> introducing full RSS in the kernel? =20 > >>>>>> > >>>>>> I don't get how cloning the skb can solve the issue. > >>>>>> > >>>>>> We can certainly implement Toeplitz function in the kernel or even= with > >>>>>> tc-bpf to store a hash value that can be used for eBPF steering pr= ogram > >>>>>> and virtio hash reporting. However we don't have a means of storin= g a > >>>>>> hash type, which is specific to virtio hash reporting and lacks a > >>>>>> corresponding skb field. =20 > >>>>> > >>>>> I may miss something but looking at sk_filter_is_valid_access(). It > >>>>> looks to me we can make use of skb->cb[0..4]? =20 > >>>> > >>>> I didn't opt to using cb. Below is the rationale: > >>>> > >>>> cb is for tail call so it means we reuse the field for a different > >>>> purpose. The context rewrite allows adding a field without increasing > >>>> the size of the underlying storage (the real sk_buff) so we should a= dd a > >>>> new field instead of reusing an existing field to avoid confusion. > >>>> > >>>> We are however no longer allowed to add a new field. In my > >>>> understanding, this is because it is an UAPI, and eBPF maintainers f= ound > >>>> it is difficult to maintain its stability. > >>>> > >>>> Reusing cb for hash reporting is a workaround to avoid having a new > >>>> field, but it does not solve the underlying problem (i.e., keeping e= BPF > >>>> as stable as UAPI is unreasonably hard). In my opinion, adding an io= ctl > >>>> is a reasonable option to keep the API as stable as other virtualiza= tion > >>>> UAPIs while respecting the underlying intention of the context rewri= te > >>>> feature freeze. =20 > >>> > >>> Fair enough. > >>> > >>> Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably > >>> via cls or other). It might worth to see if anything we miss here. =20 > >> > >> Thanks for the information. I wonder why they used cls instead of > >> steering program. Perhaps it may be due to compatibility with macvtap > >> and ipvtap, which don't steering program. > >> > >> Their RSS implementation looks cleaner so I will improve my RSS > >> implementation accordingly. > >> =20 > >=20 > > DPDK needs to support flow rules. The specific case is where packets > > are classified by a flow, then RSS is done across a subset of the queue= s. > > The support for flow in TUN driver is more academic than useful, > > I fixed it for current BPF, but doubt anyone is using it really. > >=20 > > A full steering program would be good, but would require much more > > complexity to take a general set of flow rules then communicate that > > to the steering program. > > =20 >=20 > It reminded me of RSS context and flow filter. Some physical NICs=20 > support to use a dedicated RSS context for packets matched with flow=20 > filter, and virtio is also gaining corresponding features. >=20 > RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178 > Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179 >=20 > I considered about the possibility of supporting these features with tc=20 > instead of adding ioctls to tuntap, but it seems not appropriate for=20 > virtualization use case. >=20 > In a virtualization use case, tuntap is configured according to requests= =20 > of guests, and the code processing these requests need to have minimal=20 > permissions for security. This goal is achieved by passing a file=20 > descriptor that represents a tuntap from a privileged process (e.g.,=20 > libvirt) to the process handling guest requests (e.g., QEMU). >=20 > However, tc is configured with rtnetlink, which does not seem to have an= =20 > interface to delegate a permission for one particular device to another=20 > process. >=20 > For now I'll continue working on the current approach that is based on=20 > ioctl and lacks RSS context and flow filter features. Eventually they=20 > are also likely to require new ioctls if they are to be supported with=20 > vhost_net. The DPDK flow handling (rte_flow) was started by Mellanox and many of the features are to support what that NIC can do. Would be good to have a tc way to configure that (or devlink).