From: wangyunjian <wangyunjian@huawei.com>
To: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
Paolo Abeni <pabeni@redhat.com>,
"willemdebruijn.kernel@gmail.com"
<willemdebruijn.kernel@gmail.com>,
"kuba@kernel.org" <kuba@kernel.org>,
"bjorn@kernel.org" <bjorn@kernel.org>,
"magnus.karlsson@intel.com" <magnus.karlsson@intel.com>,
"maciej.fijalkowski@intel.com" <maciej.fijalkowski@intel.com>,
"jonathan.lemon@gmail.com" <jonathan.lemon@gmail.com>,
"davem@davemloft.net" <davem@davemloft.net>,
"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"virtualization@lists.linux.dev" <virtualization@lists.linux.dev>,
xudingke <xudingke@huawei.com>,
"liwei (DT)" <liwei395@huawei.com>
Subject: RE: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
Date: Mon, 11 Mar 2024 13:27:54 +0000 [thread overview]
Message-ID: <d44e32c757014b38bb64d92af683b128@huawei.com> (raw)
In-Reply-To: <CACGkMEsFtJTMFVHt8pJ39Ge8nTJcsX=R_dYghz_93+_Yn--ZDQ@mail.gmail.com>
> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Monday, March 11, 2024 12:01 PM
> To: wangyunjian <wangyunjian@huawei.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>; Paolo Abeni <pabeni@redhat.com>;
> willemdebruijn.kernel@gmail.com; kuba@kernel.org; bjorn@kernel.org;
> magnus.karlsson@intel.com; maciej.fijalkowski@intel.com;
> jonathan.lemon@gmail.com; davem@davemloft.net; bpf@vger.kernel.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> virtualization@lists.linux.dev; xudingke <xudingke@huawei.com>; liwei (DT)
> <liwei395@huawei.com>
> Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
>
> On Mon, Mar 4, 2024 at 9:45 PM wangyunjian <wangyunjian@huawei.com>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > > Sent: Friday, March 1, 2024 7:53 PM
> > > To: wangyunjian <wangyunjian@huawei.com>
> > > Cc: Paolo Abeni <pabeni@redhat.com>;
> > > willemdebruijn.kernel@gmail.com; jasowang@redhat.com;
> > > kuba@kernel.org; bjorn@kernel.org; magnus.karlsson@intel.com;
> > > maciej.fijalkowski@intel.com; jonathan.lemon@gmail.com;
> > > davem@davemloft.net; bpf@vger.kernel.org; netdev@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> > > virtualization@lists.linux.dev; xudingke <xudingke@huawei.com>;
> > > liwei (DT) <liwei395@huawei.com>
> > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > support
> > >
> > > On Fri, Mar 01, 2024 at 11:45:52AM +0000, wangyunjian wrote:
> > > > > -----Original Message-----
> > > > > From: Paolo Abeni [mailto:pabeni@redhat.com]
> > > > > Sent: Thursday, February 29, 2024 7:13 PM
> > > > > To: wangyunjian <wangyunjian@huawei.com>; mst@redhat.com;
> > > > > willemdebruijn.kernel@gmail.com; jasowang@redhat.com;
> > > > > kuba@kernel.org; bjorn@kernel.org; magnus.karlsson@intel.com;
> > > > > maciej.fijalkowski@intel.com; jonathan.lemon@gmail.com;
> > > > > davem@davemloft.net
> > > > > Cc: bpf@vger.kernel.org; netdev@vger.kernel.org;
> > > > > linux-kernel@vger.kernel.org; kvm@vger.kernel.org;
> > > > > virtualization@lists.linux.dev; xudingke <xudingke@huawei.com>;
> > > > > liwei (DT) <liwei395@huawei.com>
> > > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > > > support
> > > > >
> > > > > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > > > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > > > > + struct xsk_buff_pool *pool;
> > > > > > + u32 i, batch, budget;
> > > > > > + void *frame;
> > > > > > +
> > > > > > + if (!ptr_ring_empty(&tfile->tx_ring))
> > > > > > + return;
> > > > > > +
> > > > > > + spin_lock(&tfile->pool_lock);
> > > > > > + pool = tfile->xsk_pool;
> > > > > > + if (!pool) {
> > > > > > + spin_unlock(&tfile->pool_lock);
> > > > > > + return;
> > > > > > + }
> > > > > > +
> > > > > > + if (tfile->nb_descs) {
> > > > > > + xsk_tx_completed(pool, tfile->nb_descs);
> > > > > > + if (xsk_uses_need_wakeup(pool))
> > > > > > + xsk_set_tx_need_wakeup(pool);
> > > > > > + }
> > > > > > +
> > > > > > + spin_lock(&tfile->tx_ring.producer_lock);
> > > > > > + budget = min_t(u32, tfile->tx_ring.size,
> > > > > > + TUN_XDP_BATCH);
> > > > > > +
> > > > > > + batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > > > > + if (!batch) {
> > > > >
> > > > > This branch looks like an unneeded "optimization". The generic
> > > > > loop below should have the same effect with no measurable perf
> > > > > delta - and
> > > smaller code.
> > > > > Just remove this.
> > > > >
> > > > > > + tfile->nb_descs = 0;
> > > > > > + spin_unlock(&tfile->tx_ring.producer_lock);
> > > > > > + spin_unlock(&tfile->pool_lock);
> > > > > > + return;
> > > > > > + }
> > > > > > +
> > > > > > + tfile->nb_descs = batch;
> > > > > > + for (i = 0; i < batch; i++) {
> > > > > > + /* Encode the XDP DESC flag into lowest bit
> > > > > > + for consumer to
> > > differ
> > > > > > + * XDP desc from XDP buffer and sk_buff.
> > > > > > + */
> > > > > > + frame = tun_xdp_desc_to_ptr(&pool->tx_descs[i]);
> > > > > > + /* The budget must be less than or equal to
> tx_ring.size,
> > > > > > + * so enqueuing will not fail.
> > > > > > + */
> > > > > > + __ptr_ring_produce(&tfile->tx_ring, frame);
> > > > > > + }
> > > > > > + spin_unlock(&tfile->tx_ring.producer_lock);
> > > > > > + spin_unlock(&tfile->pool_lock);
> > > > >
> > > > > More related to the general design: it looks wrong. What if
> > > > > get_rx_bufs() will fail (ENOBUF) after successful peeking? With
> > > > > no more incoming packets, later peek will return 0 and it looks
> > > > > like that the half-processed packets will stay in the ring forever???
> > > > >
> > > > > I think the 'ring produce' part should be moved into tun_do_read().
> > > >
> > > > Currently, the vhost-net obtains a batch descriptors/sk_buffs from
> > > > the ptr_ring and enqueue the batch descriptors/sk_buffs to the
> > > > virtqueue'queue, and then consumes the descriptors/sk_buffs from
> > > > the virtqueue'queue in sequence. As a result, TUN does not know
> > > > whether the batch descriptors have been used up, and thus does not
> > > > know when to
> > > return the batch descriptors.
> > > >
> > > > So, I think it's reasonable that when vhost-net checks ptr_ring is
> > > > empty, it calls peek_len to get new xsk's descs and return the descriptors.
> > > >
> > > > Thanks
> > >
> > > What you need to think about is that if you peek, another call in
> > > parallel can get the same value at the same time.
> >
> > Thank you. I have identified a problem. The tx_descs array was created within
> xsk's pool.
> > When xsk is freed, the pool and tx_descs are also freed. Howerver,
> > some descs may remain in the virtqueue'queue, which could lead to a
> use-after-free scenario.
>
> This can probably solving by when xsk pool is disabled, signal the vhost_net to
> drop those descriptors.
I think TUN can notify vhost_net to drop these descriptors through netdev events.
However, there is a potential concurrency problem. When handling netdev events
and packets, vhost_net preempts the 'vq->mutex_lock', leading to unstable performance.
Thanks
>
> Thanks
>
> > Currently,
> > I do not have an idea to solve this concurrency problem and believe
> > this scenario may not be appropriate for reusing the ptr_ring.
> >
> > Thanks
> >
> > >
> > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Paolo
> > > >
> >
next prev parent reply other threads:[~2024-03-11 13:27 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-28 11:05 [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support Yunjian Wang
2024-02-29 10:24 ` kernel test robot
2024-02-29 11:12 ` Paolo Abeni
2024-02-29 13:15 ` wangyunjian
2024-03-01 11:45 ` wangyunjian
2024-03-01 11:53 ` Michael S. Tsirkin
2024-03-04 13:45 ` wangyunjian
2024-03-11 4:00 ` Jason Wang
2024-03-11 13:27 ` wangyunjian [this message]
2024-03-12 6:07 ` Jason Wang
2024-02-29 15:44 ` kernel test robot
2024-03-01 14:11 ` Maciej Fijalkowski
2024-03-01 18:40 ` Willem de Bruijn
2024-03-06 5:32 ` Jason Wang
2024-03-04 6:55 ` Jason Wang
2024-03-04 11:23 ` wangyunjian
2024-03-06 2:11 ` Jason Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d44e32c757014b38bb64d92af683b128@huawei.com \
--to=wangyunjian@huawei.com \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=jasowang@redhat.com \
--cc=jonathan.lemon@gmail.com \
--cc=kuba@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=liwei395@huawei.com \
--cc=maciej.fijalkowski@intel.com \
--cc=magnus.karlsson@intel.com \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=virtualization@lists.linux.dev \
--cc=willemdebruijn.kernel@gmail.com \
--cc=xudingke@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).