Re: [PATCH v3] vhost: Add indirect descriptors support to the TX path

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: yuanhan.liu@linux.intel.com, huawei.xie@intel.com, dev@dpdk.org,
	vkaplans@redhat.com, stephen@networkplumber.org
Subject: Re: [PATCH v3] vhost: Add indirect descriptors support to the TX path
Date: Fri, 23 Sep 2016 21:22:23 +0300	[thread overview]
Message-ID: <20160923212116-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <425573ad-216f-54e7-f4ee-998a4f87e189@redhat.com>

On Fri, Sep 23, 2016 at 08:16:09PM +0200, Maxime Coquelin wrote:
> 
> 
> On 09/23/2016 08:06 PM, Michael S. Tsirkin wrote:
> > On Fri, Sep 23, 2016 at 08:02:27PM +0200, Maxime Coquelin wrote:
> > > 
> > > 
> > > On 09/23/2016 05:49 PM, Michael S. Tsirkin wrote:
> > > > On Fri, Sep 23, 2016 at 10:28:23AM +0200, Maxime Coquelin wrote:
> > > > > Indirect descriptors are usually supported by virtio-net devices,
> > > > > allowing to dispatch a larger number of requests.
> > > > > 
> > > > > When the virtio device sends a packet using indirect descriptors,
> > > > > only one slot is used in the ring, even for large packets.
> > > > > 
> > > > > The main effect is to improve the 0% packet loss benchmark.
> > > > > A PVP benchmark using Moongen (64 bytes) on the TE, and testpmd
> > > > > (fwd io for host, macswap for VM) on DUT shows a +50% gain for
> > > > > zero loss.
> > > > > 
> > > > > On the downside, micro-benchmark using testpmd txonly in VM and
> > > > > rxonly on host shows a loss between 1 and 4%.i But depending on
> > > > > the needs, feature can be disabled at VM boot time by passing
> > > > > indirect_desc=off argument to vhost-user device in Qemu.
> > > > 
> > > > Even better, change guest pmd to only use indirect
> > > > descriptors when this makes sense (e.g. sufficiently
> > > > large packets).
> > > With the micro-benchmark, the degradation is quite constant whatever
> > > the packet size.
> > > 
> > > For PVP, I could not test with larger packets than 64 bytes, as I don't
> > > have a 40G interface,
> > 
> > Don't 64 byte packets fit in a single slot anyway?
> No, indirect is used. I didn't checked in details, but I think this is
> because there is no headroom reserved in the mbuf.
> 
> This is the condition to meet to fit in a single slot:
> /* optimize ring usage */
> if (vtpci_with_feature(hw, VIRTIO_F_ANY_LAYOUT) &&
> 	rte_mbuf_refcnt_read(txm) == 1 &&
> 	RTE_MBUF_DIRECT(txm) &&
> 	txm->nb_segs == 1 &&
> 	rte_pktmbuf_headroom(txm) >= hdr_size &&
> 	rte_is_aligned(rte_pktmbuf_mtod(txm, char *),
> 		__alignof__(struct virtio_net_hdr_mrg_rxbuf)))
>     can_push = 1;
> else if (vtpci_with_feature(hw, VIRTIO_RING_F_INDIRECT_DESC) &&
> 	txm->nb_segs < VIRTIO_MAX_TX_INDIRECT)
>     use_indirect = 1;
> 
> I will check more in details next week.

Two thoughts then
1. so can some headroom be reserved?
2. how about using indirect with 3 s/g entries,
   but direct with 2 and down?


> > Why would there be an effect with that?
> > 
> > > and line rate with 10G is reached rapidly.
> > 
> > Right but focus on packet loss. you can have that at any rate.
> > 
> > > 
> > > > I would be very interested to know when does it make
> > > > sense.
> > > > 
> > > > The feature is there, it's up to guest whether to
> > > > use it.
> > > Do you mean the PMD should detect dynamically whether using indirect,
> > > or having an option at device init time to enable or not the feature?
> > 
> > guest PMD should not use indirect where it slows things down.
> > 
> > > > 
> > > > 
> > > > > Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > ---
> > > > > Changes since v2:
> > > > > =================
> > > > >  - Revert back to not checking feature flag to be aligned with
> > > > > kernel implementation
> > > > >  - Ensure we don't have nested indirect descriptors
> > > > >  - Ensure the indirect desc address is valid, to protect against
> > > > > malicious guests
> > > > > 
> > > > > Changes since RFC:
> > > > > =================
> > > > >  - Enrich commit message with figures
> > > > >  - Rebased on top of dpdk-next-virtio's master
> > > > >  - Add feature check to ensure we don't receive an indirect desc
> > > > > if not supported by the virtio driver
> > > > > 
> > > > >  lib/librte_vhost/vhost.c      |  3 ++-
> > > > >  lib/librte_vhost/virtio_net.c | 41 +++++++++++++++++++++++++++++++----------
> > > > >  2 files changed, 33 insertions(+), 11 deletions(-)
> > > > > 
> > > > > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> > > > > index 46095c3..30bb0ce 100644
> > > > > --- a/lib/librte_vhost/vhost.c
> > > > > +++ b/lib/librte_vhost/vhost.c
> > > > > @@ -65,7 +65,8 @@
> > > > >  				(1ULL << VIRTIO_NET_F_CSUM)    | \
> > > > >  				(1ULL << VIRTIO_NET_F_GUEST_CSUM) | \
> > > > >  				(1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
> > > > > -				(1ULL << VIRTIO_NET_F_GUEST_TSO6))
> > > > > +				(1ULL << VIRTIO_NET_F_GUEST_TSO6) | \
> > > > > +				(1ULL << VIRTIO_RING_F_INDIRECT_DESC))
> > > > > 
> > > > >  uint64_t VHOST_FEATURES = VHOST_SUPPORTED_FEATURES;
> > > > > 
> > > > > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> > > > > index 8a151af..2e0a587 100644
> > > > > --- a/lib/librte_vhost/virtio_net.c
> > > > > +++ b/lib/librte_vhost/virtio_net.c
> > > > > @@ -679,8 +679,8 @@ make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
> > > > >  }
> > > > > 
> > > > >  static inline int __attribute__((always_inline))
> > > > > -copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > > > -		  struct rte_mbuf *m, uint16_t desc_idx,
> > > > > +copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > > > +		  uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > > >  		  struct rte_mempool *mbuf_pool)
> > > > >  {
> > > > >  	struct vring_desc *desc;
> > > > > @@ -693,8 +693,9 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > > >  	/* A counter to avoid desc dead loop chain */
> > > > >  	uint32_t nr_desc = 1;
> > > > > 
> > > > > -	desc = &vq->desc[desc_idx];
> > > > > -	if (unlikely(desc->len < dev->vhost_hlen))
> > > > > +	desc = &descs[desc_idx];
> > > > > +	if (unlikely((desc->len < dev->vhost_hlen)) ||
> > > > > +			(desc->flags & VRING_DESC_F_INDIRECT))
> > > > >  		return -1;
> > > > > 
> > > > >  	desc_addr = gpa_to_vva(dev, desc->addr);
> > > > > @@ -711,7 +712,9 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > > >  	 */
> > > > >  	if (likely((desc->len == dev->vhost_hlen) &&
> > > > >  		   (desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > > -		desc = &vq->desc[desc->next];
> > > > > +		desc = &descs[desc->next];
> > > > > +		if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > > +			return -1;
> > > > > 
> > > > >  		desc_addr = gpa_to_vva(dev, desc->addr);
> > > > >  		if (unlikely(!desc_addr))
> > > > 
> > > > 
> > > > Just to make sure, does this still allow a chain of
> > > > direct descriptors ending with an indirect one?
> > > > This is legal as per spec.
> > > > 
> > > > > @@ -747,10 +750,12 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > > >  			if ((desc->flags & VRING_DESC_F_NEXT) == 0)
> > > > >  				break;
> > > > > 
> > > > > -			if (unlikely(desc->next >= vq->size ||
> > > > > -				     ++nr_desc > vq->size))
> > > > > +			if (unlikely(desc->next >= max_desc ||
> > > > > +				     ++nr_desc > max_desc))
> > > > > +				return -1;
> > > > > +			desc = &descs[desc->next];
> > > > > +			if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > >  				return -1;
> > > > > -			desc = &vq->desc[desc->next];
> > > > > 
> > > > >  			desc_addr = gpa_to_vva(dev, desc->addr);
> > > > >  			if (unlikely(!desc_addr))
> > > > > @@ -878,19 +883,35 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> > > > >  	/* Prefetch descriptor index. */
> > > > >  	rte_prefetch0(&vq->desc[desc_indexes[0]]);
> > > > >  	for (i = 0; i < count; i++) {
> > > > > +		struct vring_desc *desc;
> > > > > +		uint16_t sz, idx;
> > > > >  		int err;
> > > > > 
> > > > >  		if (likely(i + 1 < count))
> > > > >  			rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);
> > > > > 
> > > > > +		if (vq->desc[desc_indexes[i]].flags & VRING_DESC_F_INDIRECT) {
> > > > > +			desc = (struct vring_desc *)gpa_to_vva(dev,
> > > > > +					vq->desc[desc_indexes[i]].addr);
> > > > > +			if (unlikely(!desc))
> > > > > +				break;
> > > > > +
> > > > > +			rte_prefetch0(desc);
> > > > > +			sz = vq->desc[desc_indexes[i]].len / sizeof(*desc);
> > > > > +			idx = 0;
> > > > > +		} else {
> > > > > +			desc = vq->desc;
> > > > > +			sz = vq->size;
> > > > > +			idx = desc_indexes[i];
> > > > > +		}
> > > > > +
> > > > >  		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
> > > > >  		if (unlikely(pkts[i] == NULL)) {
> > > > >  			RTE_LOG(ERR, VHOST_DATA,
> > > > >  				"Failed to allocate memory for mbuf.\n");
> > > > >  			break;
> > > > >  		}
> > > > > -		err = copy_desc_to_mbuf(dev, vq, pkts[i], desc_indexes[i],
> > > > > -					mbuf_pool);
> > > > > +		err = copy_desc_to_mbuf(dev, desc, sz, pkts[i], idx, mbuf_pool);
> > > > >  		if (unlikely(err)) {
> > > > >  			rte_pktmbuf_free(pkts[i]);
> > > > >  			break;
> > > > > --
> > > > > 2.7.4
> > > > 
> > > > Something that I'm missing here: it's legal for guest
> > > > to add indirect descriptors for RX.
> > > > I don't see the handling of RX here though.
> > > > I think it's required for spec compliance.
> > > >

next prev parent reply	other threads:[~2016-09-23 18:22 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-23  8:28 [PATCH v3] vhost: Add indirect descriptors support to the TX path Maxime Coquelin
2016-09-23 15:49 ` Michael S. Tsirkin
2016-09-23 18:02   ` Maxime Coquelin
2016-09-23 18:06     ` Michael S. Tsirkin
2016-09-23 18:16       ` Maxime Coquelin
2016-09-23 18:22         ` Michael S. Tsirkin [this message]
2016-09-23 20:24           ` Stephen Hemminger
2016-09-26  3:03             ` Yuanhan Liu
2016-09-26 12:25               ` Michael S. Tsirkin
2016-09-26 13:04                 ` Yuanhan Liu
2016-09-27  4:15 ` Yuanhan Liu
2016-09-27  7:25   ` Maxime Coquelin
2016-09-27  8:42 ` [PATCH v4] " Maxime Coquelin
2016-09-27 12:18   ` Yuanhan Liu
2016-10-14  7:24   ` Wang, Zhihong
2016-10-14  7:34     ` Wang, Zhihong
2016-10-14 15:50     ` Maxime Coquelin
2016-10-17 11:23       ` Maxime Coquelin
2016-10-17 13:21         ` Yuanhan Liu
2016-10-17 14:14           ` Maxime Coquelin
2016-10-27  9:00             ` Wang, Zhihong
2016-10-27  9:10               ` Maxime Coquelin
2016-10-27  9:55                 ` Maxime Coquelin
2016-10-27 10:19                   ` Wang, Zhihong
2016-10-28  7:32                     ` Pierre Pfister (ppfister)
2016-10-28  7:58                       ` Maxime Coquelin
2016-11-01  8:15                         ` Yuanhan Liu
2016-11-01  9:39                           ` Thomas Monjalon
2016-11-02  2:44                             ` Yuanhan Liu
2016-10-27 10:33                 ` Yuanhan Liu
2016-10-27 10:35                   ` Maxime Coquelin
2016-10-27 10:46                     ` Yuanhan Liu
2016-10-28  0:49                       ` Wang, Zhihong
2016-10-28  7:42                         ` Maxime Coquelin
2016-10-31 10:01                           ` Wang, Zhihong
2016-11-02 10:51                             ` Maxime Coquelin
2016-11-03  8:11                               ` Maxime Coquelin
2016-11-04  6:18                                 ` Xu, Qian Q
2016-11-04  7:41                                   ` Maxime Coquelin
2016-11-04  7:20                                 ` Wang, Zhihong
2016-11-04  7:57                                   ` Maxime Coquelin
2016-11-04  7:59                                     ` Maxime Coquelin
2016-11-04 10:43                                       ` Wang, Zhihong
2016-11-04 11:22                                         ` Maxime Coquelin
2016-11-04 11:36                                           ` Yuanhan Liu
2016-11-04 11:39                                             ` Maxime Coquelin
2016-11-04 12:30                                           ` Wang, Zhihong
2016-11-04 12:54                                             ` Maxime Coquelin
2016-11-04 13:09                                               ` Wang, Zhihong
2016-11-08 10:51                                                 ` Wang, Zhihong
2016-10-27 10:53                   ` Maxime Coquelin
2016-10-28  6:05                     ` Xu, Qian Q

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160923212116-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=dev@dpdk.org \
    --cc=huawei.xie@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=stephen@networkplumber.org \
    --cc=vkaplans@redhat.com \
    --cc=yuanhan.liu@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.