netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: Akihiko Odaki <akihiko.odaki@daynix.com>
Cc: Jason Wang <jasowang@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Xuan Zhuo <xuanzhuo@linux.alibaba.com>,
	Shuah Khan <shuah@kernel.org>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, kvm@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	linux-kselftest@vger.kernel.org,
	Yuri Benditovich <yuri.benditovich@daynix.com>,
	Andrew Melnychenko <andrew@daynix.com>,
	gur.stavi@huawei.com
Subject: Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature
Date: Tue, 1 Oct 2024 09:31:05 -0700	[thread overview]
Message-ID: <20241001093105.126dacd6@hermes.local> (raw)
In-Reply-To: <f437d2d6-e4a2-4539-bd30-f312bbf0eac8@daynix.com>

On Tue, 1 Oct 2024 14:54:29 +0900
Akihiko Odaki <akihiko.odaki@daynix.com> wrote:

> On 2024/09/30 0:33, Stephen Hemminger wrote:
> > On Sun, 29 Sep 2024 16:10:47 +0900
> > Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >   
> >> On 2024/09/29 11:07, Jason Wang wrote:  
> >>> On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:  
> >>>>
> >>>> On 2024/09/27 13:31, Jason Wang wrote:  
> >>>>> On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:  
> >>>>>>
> >>>>>> On 2024/09/25 12:30, Jason Wang wrote:  
> >>>>>>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:  
> >>>>>>>>
> >>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>> purpose of RSS.
> >>>>>>>>
> >>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>> restrictive nature of eBPF.
> >>>>>>>>
> >>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>> thse challenges.
> >>>>>>>>
> >>>>>>>> An alternative solution is to extend the eBPF steering program so that it
> >>>>>>>> will be able to report to the userspace, but it is based on context
> >>>>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> >>>>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> >>>>>>>> and vhost_net).
> >>>>>>>>     
> >>>>>>>
> >>>>>>> I wonder if we could clone the skb and reuse some to store the hash,
> >>>>>>> then the steering eBPF program can access these fields without
> >>>>>>> introducing full RSS in the kernel?  
> >>>>>>
> >>>>>> I don't get how cloning the skb can solve the issue.
> >>>>>>
> >>>>>> We can certainly implement Toeplitz function in the kernel or even with
> >>>>>> tc-bpf to store a hash value that can be used for eBPF steering program
> >>>>>> and virtio hash reporting. However we don't have a means of storing a
> >>>>>> hash type, which is specific to virtio hash reporting and lacks a
> >>>>>> corresponding skb field.  
> >>>>>
> >>>>> I may miss something but looking at sk_filter_is_valid_access(). It
> >>>>> looks to me we can make use of skb->cb[0..4]?  
> >>>>
> >>>> I didn't opt to using cb. Below is the rationale:
> >>>>
> >>>> cb is for tail call so it means we reuse the field for a different
> >>>> purpose. The context rewrite allows adding a field without increasing
> >>>> the size of the underlying storage (the real sk_buff) so we should add a
> >>>> new field instead of reusing an existing field to avoid confusion.
> >>>>
> >>>> We are however no longer allowed to add a new field. In my
> >>>> understanding, this is because it is an UAPI, and eBPF maintainers found
> >>>> it is difficult to maintain its stability.
> >>>>
> >>>> Reusing cb for hash reporting is a workaround to avoid having a new
> >>>> field, but it does not solve the underlying problem (i.e., keeping eBPF
> >>>> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
> >>>> is a reasonable option to keep the API as stable as other virtualization
> >>>> UAPIs while respecting the underlying intention of the context rewrite
> >>>> feature freeze.  
> >>>
> >>> Fair enough.
> >>>
> >>> Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
> >>> via cls or other). It might worth to see if anything we miss here.  
> >>
> >> Thanks for the information. I wonder why they used cls instead of
> >> steering program. Perhaps it may be due to compatibility with macvtap
> >> and ipvtap, which don't steering program.
> >>
> >> Their RSS implementation looks cleaner so I will improve my RSS
> >> implementation accordingly.
> >>  
> > 
> > DPDK needs to support flow rules. The specific case is where packets
> > are classified by a flow, then RSS is done across a subset of the queues.
> > The support for flow in TUN driver is more academic than useful,
> > I fixed it for current BPF, but doubt anyone is using it really.
> > 
> > A full steering program would be good, but would require much more
> > complexity to take a general set of flow rules then communicate that
> > to the steering program.
> >   
> 
> It reminded me of RSS context and flow filter. Some physical NICs 
> support to use a dedicated RSS context for packets matched with flow 
> filter, and virtio is also gaining corresponding features.
> 
> RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178
> Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179
> 
> I considered about the possibility of supporting these features with tc 
> instead of adding ioctls to tuntap, but it seems not appropriate for 
> virtualization use case.
> 
> In a virtualization use case, tuntap is configured according to requests 
> of guests, and the code processing these requests need to have minimal 
> permissions for security. This goal is achieved by passing a file 
> descriptor that represents a tuntap from a privileged process (e.g., 
> libvirt) to the process handling guest requests (e.g., QEMU).
> 
> However, tc is configured with rtnetlink, which does not seem to have an 
> interface to delegate a permission for one particular device to another 
> process.
> 
> For now I'll continue working on the current approach that is based on 
> ioctl and lacks RSS context and flow filter features. Eventually they 
> are also likely to require new ioctls if they are to be supported with 
> vhost_net.

The DPDK flow handling (rte_flow) was started by Mellanox and many of
the features are to support what that NIC can do. Would be good to have
a tc way to configure that (or devlink).

  reply	other threads:[~2024-10-01 16:31 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-24  9:01 [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature Akihiko Odaki
2024-09-24  9:01 ` [PATCH RFC v4 1/9] skbuff: Introduce SKB_EXT_TUN_VNET_HASH Akihiko Odaki
2024-09-24  9:01 ` [PATCH RFC v4 2/9] virtio_net: Add functions for hashing Akihiko Odaki
2024-09-24  9:01 ` [PATCH RFC v4 3/9] net: flow_dissector: Export flow_keys_dissector_symmetric Akihiko Odaki
2024-09-24  9:01 ` [PATCH RFC v4 4/9] tap: Pad virtio header with zero Akihiko Odaki
2024-09-24  9:01 ` [PATCH RFC v4 5/9] tun: " Akihiko Odaki
2024-09-24  9:01 ` [PATCH RFC v4 6/9] tun: Introduce virtio-net hash reporting feature Akihiko Odaki
2024-09-24  9:01 ` [PATCH RFC v4 7/9] tun: Introduce virtio-net RSS Akihiko Odaki
2024-09-24 13:05   ` Simon Horman
2024-09-27  2:22     ` Akihiko Odaki
2024-09-25  3:30   ` Jason Wang
2024-09-24  9:01 ` [PATCH RFC v4 8/9] selftest: tun: Add tests for virtio-net hashing Akihiko Odaki
2024-09-24  9:01 ` [PATCH RFC v4 9/9] vhost/net: Support VIRTIO_NET_F_HASH_REPORT Akihiko Odaki
2024-09-25  3:30 ` [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature Jason Wang
2024-09-27  2:11   ` Akihiko Odaki
2024-09-27  4:31     ` Jason Wang
2024-09-27  7:50       ` Akihiko Odaki
2024-09-29  2:07         ` Jason Wang
2024-09-29  7:10           ` Akihiko Odaki
2024-09-29 15:33             ` Stephen Hemminger
2024-10-01  5:54               ` Akihiko Odaki
2024-10-01 16:31                 ` Stephen Hemminger [this message]
2024-10-02  5:26                   ` Akihiko Odaki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241001093105.126dacd6@hermes.local \
    --to=stephen@networkplumber.org \
    --cc=akihiko.odaki@daynix.com \
    --cc=andrew@daynix.com \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gur.stavi@huawei.com \
    --cc=jasowang@redhat.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=shuah@kernel.org \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=xuanzhuo@linux.alibaba.com \
    --cc=yuri.benditovich@daynix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).