From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C9E561C9EB7 for ; Tue, 1 Oct 2024 16:31:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727800271; cv=none; b=Yir53feEl9/SPZWUUbypPraBUXGpJMyAQWfWALCD3Ioop0Hm91+Jvgnxh2ThLFEqIhqugVVY2dXVK0HI2biKHq211Dm8tHdIunMi3KY0c0KTzOy+AsWMftEb843jkzMaQx9Ae3NdWZj2vfHIO0asYDAkQ66MGmWLDLAMazXoVmw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727800271; c=relaxed/simple; bh=tmCg5BWsCw59JIzvnnyzLDO7fHE1f4U4nQ8gTD+boTU=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=k8aixLo9+4ixDHr05HTDNkxnRjPqo28gJyUEt/bqMVUYEcqngF6KY5l1YizHn3BLws4ZgtGB44ZDC436BzrZBr7gqjT3ejf1dssaMixQ4AecoBoFqu2bGA2vSWqk0npo0NeIFSi8dKwSbsxVnckN92UEF3aYyQkpeclvMTEcDOs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=networkplumber.org; spf=pass smtp.mailfrom=networkplumber.org; dkim=pass (2048-bit key) header.d=networkplumber-org.20230601.gappssmtp.com header.i=@networkplumber-org.20230601.gappssmtp.com header.b=vXQ7cLv9; arc=none smtp.client-ip=209.85.210.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=networkplumber.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=networkplumber.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=networkplumber-org.20230601.gappssmtp.com header.i=@networkplumber-org.20230601.gappssmtp.com header.b="vXQ7cLv9" Received: by mail-pf1-f182.google.com with SMTP id d2e1a72fcca58-71798a15ce5so5165378b3a.0 for ; Tue, 01 Oct 2024 09:31:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20230601.gappssmtp.com; s=20230601; t=1727800269; x=1728405069; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=NaOCtJBpmUXeW5fDjOcuOw2JadaIwq6wWesigL7oUQQ=; b=vXQ7cLv92aK8tiLsyyjo3pFZkW0D6qVdtFa1oIGlsHoHHQnhRYAPgUUBPiOeNcAZBO 3UUMSjcq1nx56InsGdcqVu18nGdiCdFzM4Et39fVXe5g2sKJTFukl+BV9RcwTdzz9L0f AdiL0xDWxmDGaXgOa9GwDqH97tnUegBiUwBj3SEU6eIgbYMo54xxeGAT4WYn9/1m97GJ zeh32S5EMpnSLOCXGt1Z2H7LwIP5s3TESi12QgXoAp5GzVbQDLWtfip/cAPmLK+8HzeC xG+D5GgqNw4kmVnEp+maVqmrVggWCwxsAYQ9I+nSomqpO9hv2xEKrlQPZ2izSS3APNkQ oCRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727800269; x=1728405069; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NaOCtJBpmUXeW5fDjOcuOw2JadaIwq6wWesigL7oUQQ=; b=rgUKqyNtEAMft/EOfs8COU8rYFfcpUeHE/YUcotLw65tVvLpCn9aBBA/kBIU0zwfqV 3WxM6cqUPpSabI78D+OLlv/BgfUe2Qvj9da0v4oYEYjkQ30CJdSB0tLyJxnFi0MA74ks lMhqBLA/lOxR02lKTBkG6hX1UvYjJEShI5l9Zpb0/L0BgIv1QoLYFVu2voZEfPBs+N00 YD18rHJNBLRTewnGn8UuOrSXTBHq/fZM9DFQo5Fym+MGIYDGaMMw4/ZceivhACvPQJ2j EbWkzd/C7F0vVt33QRxD0ut/QvEeOoLKCPgLxyDSLQ0BVnUJokIsKRxVFB24G3PYOB/G npgQ== X-Forwarded-Encrypted: i=1; AJvYcCUvlNXHcXG3wwWEyXN3LUVnBKcNhQEqbK66Nnv6CEqcXhdlOAPgL7Szo8ShKLa2IjTPy1PSE8szOkk=@vger.kernel.org X-Gm-Message-State: AOJu0YxoeyJpT6QN8Kq1VO85E9e/2Q5TLiyiV0bCEUyGjjA/nKs0v53r dufJO7oIGQ7GEFMhtjSL0hnKdveECZPYDwQFAmyZdZxrtjMODBNAjotkefH3HeM= X-Google-Smtp-Source: AGHT+IEJ1YSBFgA6XCPu/UKUAJZ+sUtJEj01EpSGftWy6BCcu9DYD9VOPkZdjd+3NzkAra9PlbsRcg== X-Received: by 2002:a05:6a00:22ca:b0:717:9896:fb03 with SMTP id d2e1a72fcca58-71dc6010da0mr239138b3a.6.1727800268875; Tue, 01 Oct 2024 09:31:08 -0700 (PDT) Received: from hermes.local (204-195-96-226.wavecable.com. [204.195.96.226]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-71b264bb2b8sm8246283b3a.61.2024.10.01.09.31.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Oct 2024 09:31:08 -0700 (PDT) Date: Tue, 1 Oct 2024 09:31:05 -0700 From: Stephen Hemminger To: Akihiko Odaki Cc: Jason Wang , Jonathan Corbet , Willem de Bruijn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , "Michael S. Tsirkin" , Xuan Zhuo , Shuah Khan , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-kselftest@vger.kernel.org, Yuri Benditovich , Andrew Melnychenko , gur.stavi@huawei.com Subject: Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature Message-ID: <20241001093105.126dacd6@hermes.local> In-Reply-To: References: <20240924-rss-v4-0-84e932ec0e6c@daynix.com> <6c101c08-4364-4211-a883-cb206d57303d@daynix.com> <447dca19-58c5-4c01-b60e-cfe5e601961a@daynix.com> <20240929083314.02d47d69@hermes.local> Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, 1 Oct 2024 14:54:29 +0900 Akihiko Odaki wrote: > On 2024/09/30 0:33, Stephen Hemminger wrote: > > On Sun, 29 Sep 2024 16:10:47 +0900 > > Akihiko Odaki wrote: > > =20 > >> On 2024/09/29 11:07, Jason Wang wrote: =20 > >>> On Fri, Sep 27, 2024 at 3:51=E2=80=AFPM Akihiko Odaki wrote: =20 > >>>> > >>>> On 2024/09/27 13:31, Jason Wang wrote: =20 > >>>>> On Fri, Sep 27, 2024 at 10:11=E2=80=AFAM Akihiko Odaki wrote: =20 > >>>>>> > >>>>>> On 2024/09/25 12:30, Jason Wang wrote: =20 > >>>>>>> On Tue, Sep 24, 2024 at 5:01=E2=80=AFPM Akihiko Odaki wrote: =20 > >>>>>>>> > >>>>>>>> virtio-net have two usage of hashes: one is RSS and another is h= ash > >>>>>>>> reporting. Conventionally the hash calculation was done by the V= MM. > >>>>>>>> However, computing the hash after the queue was chosen defeats t= he > >>>>>>>> purpose of RSS. > >>>>>>>> > >>>>>>>> Another approach is to use eBPF steering program. This approach = has > >>>>>>>> another downside: it cannot report the calculated hash due to the > >>>>>>>> restrictive nature of eBPF. > >>>>>>>> > >>>>>>>> Introduce the code to compute hashes to the kernel in order to o= vercome > >>>>>>>> thse challenges. > >>>>>>>> > >>>>>>>> An alternative solution is to extend the eBPF steering program s= o that it > >>>>>>>> will be able to report to the userspace, but it is based on cont= ext > >>>>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but t= hey will > >>>>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs= (KVM > >>>>>>>> and vhost_net). > >>>>>>>> =20 > >>>>>>> > >>>>>>> I wonder if we could clone the skb and reuse some to store the ha= sh, > >>>>>>> then the steering eBPF program can access these fields without > >>>>>>> introducing full RSS in the kernel? =20 > >>>>>> > >>>>>> I don't get how cloning the skb can solve the issue. > >>>>>> > >>>>>> We can certainly implement Toeplitz function in the kernel or even= with > >>>>>> tc-bpf to store a hash value that can be used for eBPF steering pr= ogram > >>>>>> and virtio hash reporting. However we don't have a means of storin= g a > >>>>>> hash type, which is specific to virtio hash reporting and lacks a > >>>>>> corresponding skb field. =20 > >>>>> > >>>>> I may miss something but looking at sk_filter_is_valid_access(). It > >>>>> looks to me we can make use of skb->cb[0..4]? =20 > >>>> > >>>> I didn't opt to using cb. Below is the rationale: > >>>> > >>>> cb is for tail call so it means we reuse the field for a different > >>>> purpose. The context rewrite allows adding a field without increasing > >>>> the size of the underlying storage (the real sk_buff) so we should a= dd a > >>>> new field instead of reusing an existing field to avoid confusion. > >>>> > >>>> We are however no longer allowed to add a new field. In my > >>>> understanding, this is because it is an UAPI, and eBPF maintainers f= ound > >>>> it is difficult to maintain its stability. > >>>> > >>>> Reusing cb for hash reporting is a workaround to avoid having a new > >>>> field, but it does not solve the underlying problem (i.e., keeping e= BPF > >>>> as stable as UAPI is unreasonably hard). In my opinion, adding an io= ctl > >>>> is a reasonable option to keep the API as stable as other virtualiza= tion > >>>> UAPIs while respecting the underlying intention of the context rewri= te > >>>> feature freeze. =20 > >>> > >>> Fair enough. > >>> > >>> Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably > >>> via cls or other). It might worth to see if anything we miss here. =20 > >> > >> Thanks for the information. I wonder why they used cls instead of > >> steering program. Perhaps it may be due to compatibility with macvtap > >> and ipvtap, which don't steering program. > >> > >> Their RSS implementation looks cleaner so I will improve my RSS > >> implementation accordingly. > >> =20 > >=20 > > DPDK needs to support flow rules. The specific case is where packets > > are classified by a flow, then RSS is done across a subset of the queue= s. > > The support for flow in TUN driver is more academic than useful, > > I fixed it for current BPF, but doubt anyone is using it really. > >=20 > > A full steering program would be good, but would require much more > > complexity to take a general set of flow rules then communicate that > > to the steering program. > > =20 >=20 > It reminded me of RSS context and flow filter. Some physical NICs=20 > support to use a dedicated RSS context for packets matched with flow=20 > filter, and virtio is also gaining corresponding features. >=20 > RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178 > Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179 >=20 > I considered about the possibility of supporting these features with tc=20 > instead of adding ioctls to tuntap, but it seems not appropriate for=20 > virtualization use case. >=20 > In a virtualization use case, tuntap is configured according to requests= =20 > of guests, and the code processing these requests need to have minimal=20 > permissions for security. This goal is achieved by passing a file=20 > descriptor that represents a tuntap from a privileged process (e.g.,=20 > libvirt) to the process handling guest requests (e.g., QEMU). >=20 > However, tc is configured with rtnetlink, which does not seem to have an= =20 > interface to delegate a permission for one particular device to another=20 > process. >=20 > For now I'll continue working on the current approach that is based on=20 > ioctl and lacks RSS context and flow filter features. Eventually they=20 > are also likely to require new ioctls if they are to be supported with=20 > vhost_net. The DPDK flow handling (rte_flow) was started by Mellanox and many of the features are to support what that NIC can do. Would be good to have a tc way to configure that (or devlink).