Netdev List
 help / color / mirror / Atom feed
* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-24  1:49 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180424044250-mutt-send-email-mst@kernel.org>

On Tue, Apr 24, 2018 at 04:43:22AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > > > Hello everyone,
> > > > > > > > > 
> > > > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > > > 
> > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > > > > > 
> > > > > > > > > TODO:
> > > > > > > > > - Refinements and bug fixes;
> > > > > > > > > - Split into small patches;
> > > > > > > > > - Test indirect descriptor support;
> > > > > > > > > - Test/fix event suppression support;
> > > > > > > > > - Test devices other than net;
> > > > > > > > > 
> > > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > > > - Some other refinements and bug fixes;
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > > > ---
> > > > > > > > >    drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
> > > > > > > > >    include/linux/virtio_ring.h        |    8 +-
> > > > > > > > >    include/uapi/linux/virtio_config.h |   12 +-
> > > > > > > > >    include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > > > > >    4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -58,14 +58,15 @@
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > +
> > > > > > > > > +	if (vq->indirect) {
> > > > > > > > > +		u32 len;
> > > > > > > > > +
> > > > > > > > > +		desc = vq->desc_state[head].indir_desc;
> > > > > > > > > +		/* Free the indirect table, if any, now that it's unmapped. */
> > > > > > > > > +		if (!desc)
> > > > > > > > > +			goto out;
> > > > > > > > > +
> > > > > > > > > +		len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > > > +				      vq->vring_packed.desc[head].len);
> > > > > > > > > +
> > > > > > > > > +		BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > > > +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > > > > > can safely remove this BUG_ON() here.
> > > > > > > > 
> > > > > > > > > +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > > > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > > > > > remove this BUG_ON() too.
> > > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > > > And I think something related to this in the spec isn't very
> > > > > > > clear currently.
> > > > > > > 
> > > > > > > In the spec, there are below words:
> > > > > > > 
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > > > """
> > > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > > > is reserved and is ignored by the device.
> > > > > > > """
> > > > > > > 
> > > > > > > So when device writes back an used descriptor in this case,
> > > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > > > is reserved and should be ignored.
> > > > > > > 
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > > > > """
> > > > > > > Element Length is reserved for used descriptors without the
> > > > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > > > > """
> > > > > > > 
> > > > > > > And this is the way how driver ignores the `len` in an used
> > > > > > > descriptor.
> > > > > > > 
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > > > > """
> > > > > > > To increase ring capacity the driver can store a (read-only
> > > > > > > by the device) table of indirect descriptors anywhere in memory,
> > > > > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > > > > containing this indirect descriptor table;
> > > > > > > """
> > > > > > > 
> > > > > > > So the indirect descriptors in the table are read-only by
> > > > > > > the device. And the only descriptor which is writeable by
> > > > > > > the device is the descriptor in the main virtqueue (with
> > > > > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > > > > > `len` in this descriptor, we won't be able to get the
> > > > > > > length of the data written by the device.
> > > > > > > 
> > > > > > > So I think the `len` in this descriptor will carry the
> > > > > > > length of the data written by the device (if the buffers
> > > > > > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > > > > > isn't set by the device. How do you think?
> > > > > > 
> > > > > > Yes I think so. But we'd better need clarification from Michael.
> > > > > 
> > > > > I think if you use a descriptor, and you want to supply len
> > > > > to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> > > > > Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> > > > > If that's a problem we could look at relaxing that last requirement -
> > > > > does driver want INDIRECT in used descriptor to match
> > > > > the value in the avail descriptor for some reason?
> > > > 
> > > > For indirect, driver needs some way to get the length
> > > > of the data written by the driver. And the descriptors
> > > > in the indirect table is read-only, so the only place
> > > > device could put this value is the descriptor with the
> > > > VIRTQ_DESC_F_INDIRECT flag set.
> > > 
> > > when writing out used descriptor, device should set VIRTQ_DESC_F_WRITE
> > > (and clear VIRTQ_DESC_F_INDIRECT).
> > 
> > So the spec allows device to set VIRTQ_DESC_F_WRITE bit
> > when writing out an used descriptor even if the corresponding
> > descriptors it just parsed don't have the VIRTQ_DESC_F_WRITE
> > bit set?
> > 
> > Best regards,
> > Tiwei Bie
> 
> I think so. In a used descriptor, VIRTQ_DESC_F_WRITE just means length
> is valid.

Got it. It's very neat. Thanks! :)

Best regards,
Tiwei Bie

> 
> -- 
> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-24  1:43 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424013747.hmr4ytqe7gdqfhdt@debian>

On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote:
> On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > > 
> > > > > 
> > > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > > Hello everyone,
> > > > > > > > 
> > > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > > 
> > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > > > > 
> > > > > > > > TODO:
> > > > > > > > - Refinements and bug fixes;
> > > > > > > > - Split into small patches;
> > > > > > > > - Test indirect descriptor support;
> > > > > > > > - Test/fix event suppression support;
> > > > > > > > - Test devices other than net;
> > > > > > > > 
> > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > > - Some other refinements and bug fixes;
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > 
> > > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > > ---
> > > > > > > >    drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
> > > > > > > >    include/linux/virtio_ring.h        |    8 +-
> > > > > > > >    include/uapi/linux/virtio_config.h |   12 +-
> > > > > > > >    include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > > > >    4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > @@ -58,14 +58,15 @@
> > > > > > > [...]
> > > > > > > 
> > > > > > > > +
> > > > > > > > +	if (vq->indirect) {
> > > > > > > > +		u32 len;
> > > > > > > > +
> > > > > > > > +		desc = vq->desc_state[head].indir_desc;
> > > > > > > > +		/* Free the indirect table, if any, now that it's unmapped. */
> > > > > > > > +		if (!desc)
> > > > > > > > +			goto out;
> > > > > > > > +
> > > > > > > > +		len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > > +				      vq->vring_packed.desc[head].len);
> > > > > > > > +
> > > > > > > > +		BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > > +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > > > > can safely remove this BUG_ON() here.
> > > > > > > 
> > > > > > > > +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > > > > remove this BUG_ON() too.
> > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > > And I think something related to this in the spec isn't very
> > > > > > clear currently.
> > > > > > 
> > > > > > In the spec, there are below words:
> > > > > > 
> > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > > """
> > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > > is reserved and is ignored by the device.
> > > > > > """
> > > > > > 
> > > > > > So when device writes back an used descriptor in this case,
> > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > > is reserved and should be ignored.
> > > > > > 
> > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > > > """
> > > > > > Element Length is reserved for used descriptors without the
> > > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > > > """
> > > > > > 
> > > > > > And this is the way how driver ignores the `len` in an used
> > > > > > descriptor.
> > > > > > 
> > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > > > """
> > > > > > To increase ring capacity the driver can store a (read-only
> > > > > > by the device) table of indirect descriptors anywhere in memory,
> > > > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > > > containing this indirect descriptor table;
> > > > > > """
> > > > > > 
> > > > > > So the indirect descriptors in the table are read-only by
> > > > > > the device. And the only descriptor which is writeable by
> > > > > > the device is the descriptor in the main virtqueue (with
> > > > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > > > > `len` in this descriptor, we won't be able to get the
> > > > > > length of the data written by the device.
> > > > > > 
> > > > > > So I think the `len` in this descriptor will carry the
> > > > > > length of the data written by the device (if the buffers
> > > > > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > > > > isn't set by the device. How do you think?
> > > > > 
> > > > > Yes I think so. But we'd better need clarification from Michael.
> > > > 
> > > > I think if you use a descriptor, and you want to supply len
> > > > to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> > > > Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> > > > If that's a problem we could look at relaxing that last requirement -
> > > > does driver want INDIRECT in used descriptor to match
> > > > the value in the avail descriptor for some reason?
> > > 
> > > For indirect, driver needs some way to get the length
> > > of the data written by the driver. And the descriptors
> > > in the indirect table is read-only, so the only place
> > > device could put this value is the descriptor with the
> > > VIRTQ_DESC_F_INDIRECT flag set.
> > 
> > when writing out used descriptor, device should set VIRTQ_DESC_F_WRITE
> > (and clear VIRTQ_DESC_F_INDIRECT).
> 
> So the spec allows device to set VIRTQ_DESC_F_WRITE bit
> when writing out an used descriptor even if the corresponding
> descriptors it just parsed don't have the VIRTQ_DESC_F_WRITE
> bit set?
> 
> Best regards,
> Tiwei Bie

I think so. In a used descriptor, VIRTQ_DESC_F_WRITE just means length
is valid.

-- 
MST

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-24  1:42 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Siwei Liu, Jiri Pirko, Sridhar Samudrala, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski, Jason Wang
In-Reply-To: <20180423182503.353180c9@xeon-e3>

On Mon, Apr 23, 2018 at 06:25:03PM -0700, Stephen Hemminger wrote:
> On Mon, 23 Apr 2018 12:44:39 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
> 
> > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:  
> > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >>  
> > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:  
> > >> > > > >
> > >> > > > >I will NAK patches to change to common code for netvsc especially the
> > >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> > >> > > > >mode, ans we really can't have a new model; or do backport.
> > >> > > > >
> > >> > > > >Plus, DPDK is now dependent on existing model.  
> > >> > > >
> > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.  
> > >> > >
> > >> > > The network device model is a userspace API, and DPDK is a userspace application.  
> > >> >
> > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > >> > AFAIK it's normally banging device registers directly.
> > >> >  
> > >> > > You can't go breaking userspace even if you don't like the application.  
> > >> >
> > >> > Could you please explain how is the proposed patchset breaking
> > >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> > >> > API at all.
> > >> >  
> > >>
> > >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> > >> to look for Linux netvsc device and the paired VF device and setup the
> > >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> > >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> > >> VF device.
> > >>
> > >> So it depends on existing 2 device model. You can't go to a 3 device model
> > >> or start hiding devices from userspace.  
> > >
> > > Okay so how does the existing patch break that? IIUC does not go to
> > > a 3 device model since netvsc calls failover_register directly.
> > >  
> > >> Also, I am working on associating netvsc and VF device based on serial number
> > >> rather than MAC address. The serial number is how Windows works now, and it makes
> > >> sense for Linux and Windows to use the same mechanism if possible.  
> > >
> > > Maybe we should support same for virtio ...
> > > Which serial do you mean? From vpd?
> > >
> > > I guess you will want to keep supporting MAC for old hypervisors?
> 
> The serial number has always been in the hypervisor since original support of SR-IOV
> in WS2008.  So no backward compatibility special cases would be needed.

Is that a serial from real hardware or a hypervisor thing?


-- 
MST

^ permalink raw reply

* [PATCH 1/1] Revert "rds: ib: add error handle"
From: Zhu Yanjun @ 2018-04-24  1:39 UTC (permalink / raw)
  To: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel

This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.

After long time discussion and investigations, it seems that there
is no mem leak. So this patch is reverted.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 net/rds/ib_cm.c | 47 +++++++++++------------------------------------
 1 file changed, 11 insertions(+), 36 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index eea1d86..d64bfaf 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -443,7 +443,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 		ic->i_send_cq = NULL;
 		ibdev_put_vector(rds_ibdev, ic->i_scq_vector);
 		rdsdebug("ib_create_cq send failed: %d\n", ret);
-		goto rds_ibdev_out;
+		goto out;
 	}
 
 	ic->i_rcq_vector = ibdev_get_unused_vector(rds_ibdev);
@@ -457,19 +457,19 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 		ic->i_recv_cq = NULL;
 		ibdev_put_vector(rds_ibdev, ic->i_rcq_vector);
 		rdsdebug("ib_create_cq recv failed: %d\n", ret);
-		goto send_cq_out;
+		goto out;
 	}
 
 	ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
 	if (ret) {
 		rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
-		goto recv_cq_out;
+		goto out;
 	}
 
 	ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
 	if (ret) {
 		rdsdebug("ib_req_notify_cq recv failed: %d\n", ret);
-		goto recv_cq_out;
+		goto out;
 	}
 
 	/* XXX negotiate max send/recv with remote? */
@@ -495,7 +495,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr);
 	if (ret) {
 		rdsdebug("rdma_create_qp failed: %d\n", ret);
-		goto recv_cq_out;
+		goto out;
 	}
 
 	ic->i_send_hdrs = ib_dma_alloc_coherent(dev,
@@ -505,7 +505,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	if (!ic->i_send_hdrs) {
 		ret = -ENOMEM;
 		rdsdebug("ib_dma_alloc_coherent send failed\n");
-		goto qp_out;
+		goto out;
 	}
 
 	ic->i_recv_hdrs = ib_dma_alloc_coherent(dev,
@@ -515,7 +515,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	if (!ic->i_recv_hdrs) {
 		ret = -ENOMEM;
 		rdsdebug("ib_dma_alloc_coherent recv failed\n");
-		goto send_hdrs_dma_out;
+		goto out;
 	}
 
 	ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header),
@@ -523,7 +523,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	if (!ic->i_ack) {
 		ret = -ENOMEM;
 		rdsdebug("ib_dma_alloc_coherent ack failed\n");
-		goto recv_hdrs_dma_out;
+		goto out;
 	}
 
 	ic->i_sends = vzalloc_node(ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work),
@@ -531,7 +531,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	if (!ic->i_sends) {
 		ret = -ENOMEM;
 		rdsdebug("send allocation failed\n");
-		goto ack_dma_out;
+		goto out;
 	}
 
 	ic->i_recvs = vzalloc_node(ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work),
@@ -539,7 +539,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	if (!ic->i_recvs) {
 		ret = -ENOMEM;
 		rdsdebug("recv allocation failed\n");
-		goto sends_out;
+		goto out;
 	}
 
 	rds_ib_recv_init_ack(ic);
@@ -547,33 +547,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	rdsdebug("conn %p pd %p cq %p %p\n", conn, ic->i_pd,
 		 ic->i_send_cq, ic->i_recv_cq);
 
-	return ret;
-
-sends_out:
-	vfree(ic->i_sends);
-ack_dma_out:
-	ib_dma_free_coherent(dev, sizeof(struct rds_header),
-			     ic->i_ack, ic->i_ack_dma);
-recv_hdrs_dma_out:
-	ib_dma_free_coherent(dev, ic->i_recv_ring.w_nr *
-					sizeof(struct rds_header),
-					ic->i_recv_hdrs, ic->i_recv_hdrs_dma);
-send_hdrs_dma_out:
-	ib_dma_free_coherent(dev, ic->i_send_ring.w_nr *
-					sizeof(struct rds_header),
-					ic->i_send_hdrs, ic->i_send_hdrs_dma);
-qp_out:
-	rdma_destroy_qp(ic->i_cm_id);
-recv_cq_out:
-	if (!ib_destroy_cq(ic->i_recv_cq))
-		ic->i_recv_cq = NULL;
-send_cq_out:
-	if (!ib_destroy_cq(ic->i_send_cq))
-		ic->i_send_cq = NULL;
-rds_ibdev_out:
-	rds_ib_remove_conn(rds_ibdev, conn);
+out:
 	rds_ib_dev_put(rds_ibdev);
-
 	return ret;
 }
 
-- 
2.7.4

^ permalink raw reply related

* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-24  1:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424042913-mutt-send-email-mst@kernel.org>

On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > 
> > > > 
> > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > Hello everyone,
> > > > > > > 
> > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > 
> > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > > > 
> > > > > > > TODO:
> > > > > > > - Refinements and bug fixes;
> > > > > > > - Split into small patches;
> > > > > > > - Test indirect descriptor support;
> > > > > > > - Test/fix event suppression support;
> > > > > > > - Test devices other than net;
> > > > > > > 
> > > > > > > RFC v1 -> RFC v2:
> > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > - Some other refinements and bug fixes;
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > ---
> > > > > > >    drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
> > > > > > >    include/linux/virtio_ring.h        |    8 +-
> > > > > > >    include/uapi/linux/virtio_config.h |   12 +-
> > > > > > >    include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > > >    4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -58,14 +58,15 @@
> > > > > > [...]
> > > > > > 
> > > > > > > +
> > > > > > > +	if (vq->indirect) {
> > > > > > > +		u32 len;
> > > > > > > +
> > > > > > > +		desc = vq->desc_state[head].indir_desc;
> > > > > > > +		/* Free the indirect table, if any, now that it's unmapped. */
> > > > > > > +		if (!desc)
> > > > > > > +			goto out;
> > > > > > > +
> > > > > > > +		len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > +				      vq->vring_packed.desc[head].len);
> > > > > > > +
> > > > > > > +		BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > > > can safely remove this BUG_ON() here.
> > > > > > 
> > > > > > > +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > > > remove this BUG_ON() too.
> > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > And I think something related to this in the spec isn't very
> > > > > clear currently.
> > > > > 
> > > > > In the spec, there are below words:
> > > > > 
> > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > """
> > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > is reserved and is ignored by the device.
> > > > > """
> > > > > 
> > > > > So when device writes back an used descriptor in this case,
> > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > is reserved and should be ignored.
> > > > > 
> > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > > """
> > > > > Element Length is reserved for used descriptors without the
> > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > > """
> > > > > 
> > > > > And this is the way how driver ignores the `len` in an used
> > > > > descriptor.
> > > > > 
> > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > > """
> > > > > To increase ring capacity the driver can store a (read-only
> > > > > by the device) table of indirect descriptors anywhere in memory,
> > > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > > containing this indirect descriptor table;
> > > > > """
> > > > > 
> > > > > So the indirect descriptors in the table are read-only by
> > > > > the device. And the only descriptor which is writeable by
> > > > > the device is the descriptor in the main virtqueue (with
> > > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > > > `len` in this descriptor, we won't be able to get the
> > > > > length of the data written by the device.
> > > > > 
> > > > > So I think the `len` in this descriptor will carry the
> > > > > length of the data written by the device (if the buffers
> > > > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > > > isn't set by the device. How do you think?
> > > > 
> > > > Yes I think so. But we'd better need clarification from Michael.
> > > 
> > > I think if you use a descriptor, and you want to supply len
> > > to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> > > Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> > > If that's a problem we could look at relaxing that last requirement -
> > > does driver want INDIRECT in used descriptor to match
> > > the value in the avail descriptor for some reason?
> > 
> > For indirect, driver needs some way to get the length
> > of the data written by the driver. And the descriptors
> > in the indirect table is read-only, so the only place
> > device could put this value is the descriptor with the
> > VIRTQ_DESC_F_INDIRECT flag set.
> 
> when writing out used descriptor, device should set VIRTQ_DESC_F_WRITE
> (and clear VIRTQ_DESC_F_INDIRECT).

So the spec allows device to set VIRTQ_DESC_F_WRITE bit
when writing out an used descriptor even if the corresponding
descriptors it just parsed don't have the VIRTQ_DESC_F_WRITE
bit set?

Best regards,
Tiwei Bie

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-24  1:29 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424011638.vf6f22dnl7ufnnij@debian>

On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > 
> > > 
> > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > Hello everyone,
> > > > > > 
> > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > 
> > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > > 
> > > > > > TODO:
> > > > > > - Refinements and bug fixes;
> > > > > > - Split into small patches;
> > > > > > - Test indirect descriptor support;
> > > > > > - Test/fix event suppression support;
> > > > > > - Test devices other than net;
> > > > > > 
> > > > > > RFC v1 -> RFC v2:
> > > > > > - Add indirect descriptor support - compile test only;
> > > > > > - Add event suppression supprt - compile test only;
> > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > - Avoid using '%' operator (Jason);
> > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > - Some other refinements and bug fixes;
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > ---
> > > > > >    drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
> > > > > >    include/linux/virtio_ring.h        |    8 +-
> > > > > >    include/uapi/linux/virtio_config.h |   12 +-
> > > > > >    include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > >    4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -58,14 +58,15 @@
> > > > > [...]
> > > > > 
> > > > > > +
> > > > > > +	if (vq->indirect) {
> > > > > > +		u32 len;
> > > > > > +
> > > > > > +		desc = vq->desc_state[head].indir_desc;
> > > > > > +		/* Free the indirect table, if any, now that it's unmapped. */
> > > > > > +		if (!desc)
> > > > > > +			goto out;
> > > > > > +
> > > > > > +		len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > +				      vq->vring_packed.desc[head].len);
> > > > > > +
> > > > > > +		BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > > can safely remove this BUG_ON() here.
> > > > > 
> > > > > > +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > > remove this BUG_ON() too.
> > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > And I think something related to this in the spec isn't very
> > > > clear currently.
> > > > 
> > > > In the spec, there are below words:
> > > > 
> > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > """
> > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > is reserved and is ignored by the device.
> > > > """
> > > > 
> > > > So when device writes back an used descriptor in this case,
> > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > is reserved and should be ignored.
> > > > 
> > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > """
> > > > Element Length is reserved for used descriptors without the
> > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > """
> > > > 
> > > > And this is the way how driver ignores the `len` in an used
> > > > descriptor.
> > > > 
> > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > """
> > > > To increase ring capacity the driver can store a (read-only
> > > > by the device) table of indirect descriptors anywhere in memory,
> > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > containing this indirect descriptor table;
> > > > """
> > > > 
> > > > So the indirect descriptors in the table are read-only by
> > > > the device. And the only descriptor which is writeable by
> > > > the device is the descriptor in the main virtqueue (with
> > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > > `len` in this descriptor, we won't be able to get the
> > > > length of the data written by the device.
> > > > 
> > > > So I think the `len` in this descriptor will carry the
> > > > length of the data written by the device (if the buffers
> > > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > > isn't set by the device. How do you think?
> > > 
> > > Yes I think so. But we'd better need clarification from Michael.
> > 
> > I think if you use a descriptor, and you want to supply len
> > to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> > Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> > If that's a problem we could look at relaxing that last requirement -
> > does driver want INDIRECT in used descriptor to match
> > the value in the avail descriptor for some reason?
> 
> For indirect, driver needs some way to get the length
> of the data written by the driver. And the descriptors
> in the indirect table is read-only, so the only place
> device could put this value is the descriptor with the
> VIRTQ_DESC_F_INDIRECT flag set.

when writing out used descriptor, device should set VIRTQ_DESC_F_WRITE
(and clear VIRTQ_DESC_F_INDIRECT).

> > 
> > > > 
> > > > 
> > > > > The reason is we don't touch descriptor ring in the case of split, so
> > > > > BUG_ON()s may help there.
> > > > > 
> > > > > > +
> > > > > > +		for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
> > > > > > +			vring_unmap_one_packed(vq, &desc[j]);
> > > > > > +
> > > > > > +		kfree(desc);
> > > > > > +		vq->desc_state[head].indir_desc = NULL;
> > > > > > +	} else if (ctx) {
> > > > > > +		*ctx = vq->desc_state[head].indir_desc;
> > > > > > +	}
> > > > > > +
> > > > > > +out:
> > > > > > +	return vq->desc_state[head].num;
> > > > > > +}
> > > > > > +
> > > > > > +static inline bool more_used_split(const struct vring_virtqueue *vq)
> > > > > >    {
> > > > > >    	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> > > > > >    }
> > > > > > +static inline bool more_used_packed(const struct vring_virtqueue *vq)
> > > > > > +{
> > > > > > +	u16 last_used, flags;
> > > > > > +	bool avail, used;
> > > > > > +
> > > > > > +	if (vq->vq.num_free == vq->vring_packed.num)
> > > > > > +		return false;
> > > > > > +
> > > > > > +	last_used = vq->last_used_idx;
> > > > > > +	flags = virtio16_to_cpu(vq->vq.vdev,
> > > > > > +				vq->vring_packed.desc[last_used].flags);
> > > > > > +	avail = flags & VRING_DESC_F_AVAIL(1);
> > > > > > +	used = flags & VRING_DESC_F_USED(1);
> > > > > > +
> > > > > > +	return avail == used;
> > > > > > +}
> > > > > This looks interesting, spec said:
> > > > > 
> > > > > "
> > > > > Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
> > > > > available descriptor and
> > > > > equal for a used descriptor.
> > > > > Note that this observation is mostly useful for sanity-checking as these are
> > > > > necessary but not sufficient
> > > > > conditions - for example, all descriptors are zero-initialized. To detect
> > > > > used and available descriptors it is
> > > > > possible for drivers and devices to keep track of the last observed value of
> > > > > VIRTQ_DESC_F_USED/VIRTQ_-
> > > > > DESC_F_AVAIL. Other techniques to detect
> > > > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > > > might also be possible.
> > > > > "
> > > > > 
> > > > > So it looks to me it was not sufficient, looking at the example codes in
> > > > > spec, do we need to track last seen used_wrap_counter here?
> > > > I don't think we have to track used_wrap_counter in
> > > > driver. There was a discussion on this:
> > > > 
> > > > https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html
> > > > 
> > > > And after that, below sentence was added (it's also
> > > > in the above words you quoted):
> > > > 
> > > > """
> > > > Other techniques to detect
> > > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > > might also be possible.
> > > > """
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > 
> > > I see, the extra condition "if (vq->vq.num_free == vq->vring_packed.num)"
> > > help in this case.
> > > 
> > > Thanks
> > 
> > I still think tracking a wrap counter is better.
> 
> >From my understanding, wrap counter is only needed when
> one side just want to update parts of the status bit(s),
> it's something like the "report status" or "write back"
> feature [1] in the hardware NIC. And in the driver, all
> the status must be updated, and that's why I don't want
> to track the usedwrap counter.
> 
> [1] https://github.com/btw616/dpdk-virtio1.1/commit/ca837865bd10
> 
> Best regards,
> Tiwei Bie
> 
> > 
> > > > 
> > > > > Thanks

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-24  1:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, Jiri Pirko, Sridhar Samudrala, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski, Jason Wang
In-Reply-To: <20180423230037-mutt-send-email-mst@kernel.org>

On Mon, 23 Apr 2018 23:06:55 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:  
> > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:  
> > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >>  
> > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:  
> > >> > > > >
> > >> > > > >I will NAK patches to change to common code for netvsc especially the
> > >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> > >> > > > >mode, ans we really can't have a new model; or do backport.
> > >> > > > >
> > >> > > > >Plus, DPDK is now dependent on existing model.  
> > >> > > >
> > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.  
> > >> > >
> > >> > > The network device model is a userspace API, and DPDK is a userspace application.  
> > >> >
> > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > >> > AFAIK it's normally banging device registers directly.
> > >> >  
> > >> > > You can't go breaking userspace even if you don't like the application.  
> > >> >
> > >> > Could you please explain how is the proposed patchset breaking
> > >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> > >> > API at all.
> > >> >  
> > >>
> > >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> > >> to look for Linux netvsc device and the paired VF device and setup the
> > >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> > >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> > >> VF device.
> > >>
> > >> So it depends on existing 2 device model. You can't go to a 3 device model
> > >> or start hiding devices from userspace.  
> > >
> > > Okay so how does the existing patch break that? IIUC does not go to
> > > a 3 device model since netvsc calls failover_register directly.
> > >  
> > >> Also, I am working on associating netvsc and VF device based on serial number
> > >> rather than MAC address. The serial number is how Windows works now, and it makes
> > >> sense for Linux and Windows to use the same mechanism if possible.  
> > >
> > > Maybe we should support same for virtio ...
> > > Which serial do you mean? From vpd?
> > >
> > > I guess you will want to keep supporting MAC for old hypervisors?
> > >
> > > It all seems like a reasonable thing to support in the generic core.  
> > 
> > That's the reason why I chose explicit identifier rather than rely on
> > MAC address to bind/pair a device. MAC address can change. Even if it
> > can't, malicious guest user can fake MAC address to skip binding.
> > 
> > -Siwei  
> 
> Address should be sampled at device creation to prevent this
> kind of hack. Not that it buys the malicious user much:
> if you can poke at MAC addresses you probably already can
> break networking.

On Hyper-V guest can't really change MAC address if SR-IOV is enabled.
Also, Linux has permanent ether address in netdev which is what should
be used to avoid user messing with device.

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-24  1:25 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Michael S. Tsirkin, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
	Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <CADGSJ20ge75T+ddxtUBT4d9m1i3=HLOAHJEoS7Cg0bqnXrutwA@mail.gmail.com>

On Mon, 23 Apr 2018 12:44:39 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:  
> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>  
> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:  
> >> > > > >
> >> > > > >I will NAK patches to change to common code for netvsc especially the
> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> > > > >
> >> > > > >Plus, DPDK is now dependent on existing model.  
> >> > > >
> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.  
> >> > >
> >> > > The network device model is a userspace API, and DPDK is a userspace application.  
> >> >
> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> > AFAIK it's normally banging device registers directly.
> >> >  
> >> > > You can't go breaking userspace even if you don't like the application.  
> >> >
> >> > Could you please explain how is the proposed patchset breaking
> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> > API at all.
> >> >  
> >>
> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> >> to look for Linux netvsc device and the paired VF device and setup the
> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> >> VF device.
> >>
> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> or start hiding devices from userspace.  
> >
> > Okay so how does the existing patch break that? IIUC does not go to
> > a 3 device model since netvsc calls failover_register directly.
> >  
> >> Also, I am working on associating netvsc and VF device based on serial number
> >> rather than MAC address. The serial number is how Windows works now, and it makes
> >> sense for Linux and Windows to use the same mechanism if possible.  
> >
> > Maybe we should support same for virtio ...
> > Which serial do you mean? From vpd?
> >
> > I guess you will want to keep supporting MAC for old hypervisors?

The serial number has always been in the hypervisor since original support of SR-IOV
in WS2008.  So no backward compatibility special cases would be needed.

^ permalink raw reply

* Re: [PATCH net 0/3] amd-xgbe: AMD XGBE driver fixes 2018-04-23
From: David Miller @ 2018-04-24  1:24 UTC (permalink / raw)
  To: thomas.lendacky; +Cc: netdev
In-Reply-To: <20180423164258.18740.98574.stgit@tlendack-t1.amdoffice.net>

From: Tom Lendacky <thomas.lendacky@amd.com>
Date: Mon, 23 Apr 2018 11:42:58 -0500

> This patch series addresses some issues in the AMD XGBE driver.
> 
> The following fixes are included in this driver update series:
> 
> - Improve KR auto-negotiation and training (2 patches)
>   - Add pre and post auto-negotiation hooks
>   - Use the pre and post auto-negotiation hooks to disable CDR tracking
>     during auto-negotiation page exchange in KR mode
> - Check for SFP tranceiver signal support and only use the signal if the
>   SFP indicates that it is supported
> 
> This patch series is based on net.
> 
> ---
> 
> Please queue this patch series up to stable releases 4.14 and above.

Series applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-24  1:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424040258-mutt-send-email-mst@kernel.org>

On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > 
> > 
> > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > Hello everyone,
> > > > > 
> > > > > This RFC implements packed ring support for virtio driver.
> > > > > 
> > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > 
> > > > > TODO:
> > > > > - Refinements and bug fixes;
> > > > > - Split into small patches;
> > > > > - Test indirect descriptor support;
> > > > > - Test/fix event suppression support;
> > > > > - Test devices other than net;
> > > > > 
> > > > > RFC v1 -> RFC v2:
> > > > > - Add indirect descriptor support - compile test only;
> > > > > - Add event suppression supprt - compile test only;
> > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > - Avoid using '%' operator (Jason);
> > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > - Some other refinements and bug fixes;
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > ---
> > > > >    drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
> > > > >    include/linux/virtio_ring.h        |    8 +-
> > > > >    include/uapi/linux/virtio_config.h |   12 +-
> > > > >    include/uapi/linux/virtio_ring.h   |   61 ++
> > > > >    4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -58,14 +58,15 @@
> > > > [...]
> > > > 
> > > > > +
> > > > > +	if (vq->indirect) {
> > > > > +		u32 len;
> > > > > +
> > > > > +		desc = vq->desc_state[head].indir_desc;
> > > > > +		/* Free the indirect table, if any, now that it's unmapped. */
> > > > > +		if (!desc)
> > > > > +			goto out;
> > > > > +
> > > > > +		len = virtio32_to_cpu(vq->vq.vdev,
> > > > > +				      vq->vring_packed.desc[head].len);
> > > > > +
> > > > > +		BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > can safely remove this BUG_ON() here.
> > > > 
> > > > > +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > remove this BUG_ON() too.
> > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > And I think something related to this in the spec isn't very
> > > clear currently.
> > > 
> > > In the spec, there are below words:
> > > 
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > """
> > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > is reserved and is ignored by the device.
> > > """
> > > 
> > > So when device writes back an used descriptor in this case,
> > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > is reserved and should be ignored.
> > > 
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > """
> > > Element Length is reserved for used descriptors without the
> > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > """
> > > 
> > > And this is the way how driver ignores the `len` in an used
> > > descriptor.
> > > 
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > """
> > > To increase ring capacity the driver can store a (read-only
> > > by the device) table of indirect descriptors anywhere in memory,
> > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > containing this indirect descriptor table;
> > > """
> > > 
> > > So the indirect descriptors in the table are read-only by
> > > the device. And the only descriptor which is writeable by
> > > the device is the descriptor in the main virtqueue (with
> > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > `len` in this descriptor, we won't be able to get the
> > > length of the data written by the device.
> > > 
> > > So I think the `len` in this descriptor will carry the
> > > length of the data written by the device (if the buffers
> > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > isn't set by the device. How do you think?
> > 
> > Yes I think so. But we'd better need clarification from Michael.
> 
> I think if you use a descriptor, and you want to supply len
> to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> If that's a problem we could look at relaxing that last requirement -
> does driver want INDIRECT in used descriptor to match
> the value in the avail descriptor for some reason?

For indirect, driver needs some way to get the length
of the data written by the driver. And the descriptors
in the indirect table is read-only, so the only place
device could put this value is the descriptor with the
VIRTQ_DESC_F_INDIRECT flag set.

> 
> > > 
> > > 
> > > > The reason is we don't touch descriptor ring in the case of split, so
> > > > BUG_ON()s may help there.
> > > > 
> > > > > +
> > > > > +		for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
> > > > > +			vring_unmap_one_packed(vq, &desc[j]);
> > > > > +
> > > > > +		kfree(desc);
> > > > > +		vq->desc_state[head].indir_desc = NULL;
> > > > > +	} else if (ctx) {
> > > > > +		*ctx = vq->desc_state[head].indir_desc;
> > > > > +	}
> > > > > +
> > > > > +out:
> > > > > +	return vq->desc_state[head].num;
> > > > > +}
> > > > > +
> > > > > +static inline bool more_used_split(const struct vring_virtqueue *vq)
> > > > >    {
> > > > >    	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> > > > >    }
> > > > > +static inline bool more_used_packed(const struct vring_virtqueue *vq)
> > > > > +{
> > > > > +	u16 last_used, flags;
> > > > > +	bool avail, used;
> > > > > +
> > > > > +	if (vq->vq.num_free == vq->vring_packed.num)
> > > > > +		return false;
> > > > > +
> > > > > +	last_used = vq->last_used_idx;
> > > > > +	flags = virtio16_to_cpu(vq->vq.vdev,
> > > > > +				vq->vring_packed.desc[last_used].flags);
> > > > > +	avail = flags & VRING_DESC_F_AVAIL(1);
> > > > > +	used = flags & VRING_DESC_F_USED(1);
> > > > > +
> > > > > +	return avail == used;
> > > > > +}
> > > > This looks interesting, spec said:
> > > > 
> > > > "
> > > > Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
> > > > available descriptor and
> > > > equal for a used descriptor.
> > > > Note that this observation is mostly useful for sanity-checking as these are
> > > > necessary but not sufficient
> > > > conditions - for example, all descriptors are zero-initialized. To detect
> > > > used and available descriptors it is
> > > > possible for drivers and devices to keep track of the last observed value of
> > > > VIRTQ_DESC_F_USED/VIRTQ_-
> > > > DESC_F_AVAIL. Other techniques to detect
> > > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > > might also be possible.
> > > > "
> > > > 
> > > > So it looks to me it was not sufficient, looking at the example codes in
> > > > spec, do we need to track last seen used_wrap_counter here?
> > > I don't think we have to track used_wrap_counter in
> > > driver. There was a discussion on this:
> > > 
> > > https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html
> > > 
> > > And after that, below sentence was added (it's also
> > > in the above words you quoted):
> > > 
> > > """
> > > Other techniques to detect
> > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > might also be possible.
> > > """
> > > 
> > > Best regards,
> > > Tiwei Bie
> > 
> > I see, the extra condition "if (vq->vq.num_free == vq->vring_packed.num)"
> > help in this case.
> > 
> > Thanks
> 
> I still think tracking a wrap counter is better.

>From my understanding, wrap counter is only needed when
one side just want to update parts of the status bit(s),
it's something like the "report status" or "write back"
feature [1] in the hardware NIC. And in the driver, all
the status must be updated, and that's why I don't want
to track the usedwrap counter.

[1] https://github.com/btw616/dpdk-virtio1.1/commit/ca837865bd10

Best regards,
Tiwei Bie

> 
> > > 
> > > > Thanks

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Jason Wang @ 2018-04-24  1:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tiwei Bie, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424040258-mutt-send-email-mst@kernel.org>



On 2018年04月24日 09:05, Michael S. Tsirkin wrote:
>>>>> +	if (vq->indirect) {
>>>>> +		u32 len;
>>>>> +
>>>>> +		desc = vq->desc_state[head].indir_desc;
>>>>> +		/* Free the indirect table, if any, now that it's unmapped. */
>>>>> +		if (!desc)
>>>>> +			goto out;
>>>>> +
>>>>> +		len = virtio32_to_cpu(vq->vq.vdev,
>>>>> +				      vq->vring_packed.desc[head].len);
>>>>> +
>>>>> +		BUG_ON(!(vq->vring_packed.desc[head].flags &
>>>>> +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
>>>> It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
>>>> can safely remove this BUG_ON() here.
>>>>
>>>>> +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
>>>> Len could be ignored for used descriptor according to the spec, so we need
>>>> remove this BUG_ON() too.
>>> Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
>>> And I think something related to this in the spec isn't very
>>> clear currently.
>>>
>>> In the spec, there are below words:
>>>
>>> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
>>> """
>>> In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
>>> is reserved and is ignored by the device.
>>> """
>>>
>>> So when device writes back an used descriptor in this case,
>>> device may not set the VIRTQ_DESC_F_WRITE flag as the flag
>>> is reserved and should be ignored.
>>>
>>> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
>>> """
>>> Element Length is reserved for used descriptors without the
>>> VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
>>> """
>>>
>>> And this is the way how driver ignores the `len` in an used
>>> descriptor.
>>>
>>> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
>>> """
>>> To increase ring capacity the driver can store a (read-only
>>> by the device) table of indirect descriptors anywhere in memory,
>>> and insert a descriptor in the main virtqueue (with \field{Flags}
>>> bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
>>> containing this indirect descriptor table;
>>> """
>>>
>>> So the indirect descriptors in the table are read-only by
>>> the device. And the only descriptor which is writeable by
>>> the device is the descriptor in the main virtqueue (with
>>> Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
>>> `len` in this descriptor, we won't be able to get the
>>> length of the data written by the device.
>>>
>>> So I think the `len` in this descriptor will carry the
>>> length of the data written by the device (if the buffers
>>> are writable to the device) even if the VIRTQ_DESC_F_WRITE
>>> isn't set by the device. How do you think?
>> Yes I think so. But we'd better need clarification from Michael.
> I think if you use a descriptor, and you want to supply len
> to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> If that's a problem we could look at relaxing that last requirement -
> does driver want INDIRECT in used descriptor to match
> the value in the avail descriptor for some reason?
>

Looks not, so what I get it:

- device and set VIRTQ_DESC_F_WRITE flag for used descriptor when needed
- there no need to keep INDIRECT flag in used descriptor

So for the above case, we can just have a used descriptor with _F_WRITE 
but without INDIRECT flag.

Thanks

^ permalink raw reply

* Re: [PATCH net] pppoe: check sockaddr length in pppoe_connect()
From: David Miller @ 2018-04-24  1:12 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, mostrows
In-Reply-To: <387ca48810af36f2626049008a795d1adc375cb8.1524494257.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Mon, 23 Apr 2018 16:38:27 +0200

> We must validate sockaddr_len, otherwise userspace can pass fewer data
> than we expect and we end up accessing invalid data.
> 
> Fixes: 224cf5ad14c0 ("ppp: Move the PPP drivers")
> Reported-by: syzbot+4f03bdf92fdf9ef5ddab@syzkaller.appspotmail.com
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>

Applied and queued up for -stable, thank you.

^ permalink raw reply

* Re: [PATCH net] l2tp: check sockaddr length in pppol2tp_connect()
From: David Miller @ 2018-04-24  1:11 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <e76f65e9826fe7591dc8bba561461c993e5643e0.1524492877.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Mon, 23 Apr 2018 16:15:14 +0200

> Check sockaddr_len before dereferencing sp->sa_protocol, to ensure that
> it actually points to valid data.
> 
> Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
> Reported-by: syzbot+a70ac890b23b1bf29f5c@syzkaller.appspotmail.com
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>

Applied and queued up for -stable.

I guess you can completely remove the "bad socket address" -EINVAL else
clause later in the function as a cleanup in net-next.

^ permalink raw reply

* Re: [PATCH] selftests: net: update .gitignore with missing test
From: David Miller @ 2018-04-24  1:07 UTC (permalink / raw)
  To: anders.roxell; +Cc: shuah, netdev, linux-kselftest, linux-kernel
In-Reply-To: <20180423140050.4508-1-anders.roxell@linaro.org>

From: Anders Roxell <anders.roxell@linaro.org>
Date: Mon, 23 Apr 2018 16:00:50 +0200

> Fixes: 192dc405f308 ("selftests: net: add tcp_mmap program")
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>

Applied, thanks.

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-24  1:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tiwei Bie, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <e08c55df-6ea2-cf7f-d7c9-91e55913ab16@redhat.com>

On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月23日 17:29, Tiwei Bie wrote:
> > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > Hello everyone,
> > > > 
> > > > This RFC implements packed ring support for virtio driver.
> > > > 
> > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > 
> > > > TODO:
> > > > - Refinements and bug fixes;
> > > > - Split into small patches;
> > > > - Test indirect descriptor support;
> > > > - Test/fix event suppression support;
> > > > - Test devices other than net;
> > > > 
> > > > RFC v1 -> RFC v2:
> > > > - Add indirect descriptor support - compile test only;
> > > > - Add event suppression supprt - compile test only;
> > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > - Avoid using '%' operator (Jason);
> > > > - Rename free_head -> next_avail_idx (Jason);
> > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > - Some other refinements and bug fixes;
> > > > 
> > > > Thanks!
> > > > 
> > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > ---
> > > >    drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
> > > >    include/linux/virtio_ring.h        |    8 +-
> > > >    include/uapi/linux/virtio_config.h |   12 +-
> > > >    include/uapi/linux/virtio_ring.h   |   61 ++
> > > >    4 files changed, 980 insertions(+), 195 deletions(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 71458f493cf8..0515dca34d77 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -58,14 +58,15 @@
> > > [...]
> > > 
> > > > +
> > > > +	if (vq->indirect) {
> > > > +		u32 len;
> > > > +
> > > > +		desc = vq->desc_state[head].indir_desc;
> > > > +		/* Free the indirect table, if any, now that it's unmapped. */
> > > > +		if (!desc)
> > > > +			goto out;
> > > > +
> > > > +		len = virtio32_to_cpu(vq->vq.vdev,
> > > > +				      vq->vring_packed.desc[head].len);
> > > > +
> > > > +		BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > can safely remove this BUG_ON() here.
> > > 
> > > > +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > Len could be ignored for used descriptor according to the spec, so we need
> > > remove this BUG_ON() too.
> > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > And I think something related to this in the spec isn't very
> > clear currently.
> > 
> > In the spec, there are below words:
> > 
> > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > """
> > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > is reserved and is ignored by the device.
> > """
> > 
> > So when device writes back an used descriptor in this case,
> > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > is reserved and should be ignored.
> > 
> > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > """
> > Element Length is reserved for used descriptors without the
> > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > """
> > 
> > And this is the way how driver ignores the `len` in an used
> > descriptor.
> > 
> > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > """
> > To increase ring capacity the driver can store a (read-only
> > by the device) table of indirect descriptors anywhere in memory,
> > and insert a descriptor in the main virtqueue (with \field{Flags}
> > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > containing this indirect descriptor table;
> > """
> > 
> > So the indirect descriptors in the table are read-only by
> > the device. And the only descriptor which is writeable by
> > the device is the descriptor in the main virtqueue (with
> > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > `len` in this descriptor, we won't be able to get the
> > length of the data written by the device.
> > 
> > So I think the `len` in this descriptor will carry the
> > length of the data written by the device (if the buffers
> > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > isn't set by the device. How do you think?
> 
> Yes I think so. But we'd better need clarification from Michael.

I think if you use a descriptor, and you want to supply len
to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
If that's a problem we could look at relaxing that last requirement -
does driver want INDIRECT in used descriptor to match
the value in the avail descriptor for some reason?

> > 
> > 
> > > The reason is we don't touch descriptor ring in the case of split, so
> > > BUG_ON()s may help there.
> > > 
> > > > +
> > > > +		for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
> > > > +			vring_unmap_one_packed(vq, &desc[j]);
> > > > +
> > > > +		kfree(desc);
> > > > +		vq->desc_state[head].indir_desc = NULL;
> > > > +	} else if (ctx) {
> > > > +		*ctx = vq->desc_state[head].indir_desc;
> > > > +	}
> > > > +
> > > > +out:
> > > > +	return vq->desc_state[head].num;
> > > > +}
> > > > +
> > > > +static inline bool more_used_split(const struct vring_virtqueue *vq)
> > > >    {
> > > >    	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> > > >    }
> > > > +static inline bool more_used_packed(const struct vring_virtqueue *vq)
> > > > +{
> > > > +	u16 last_used, flags;
> > > > +	bool avail, used;
> > > > +
> > > > +	if (vq->vq.num_free == vq->vring_packed.num)
> > > > +		return false;
> > > > +
> > > > +	last_used = vq->last_used_idx;
> > > > +	flags = virtio16_to_cpu(vq->vq.vdev,
> > > > +				vq->vring_packed.desc[last_used].flags);
> > > > +	avail = flags & VRING_DESC_F_AVAIL(1);
> > > > +	used = flags & VRING_DESC_F_USED(1);
> > > > +
> > > > +	return avail == used;
> > > > +}
> > > This looks interesting, spec said:
> > > 
> > > "
> > > Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
> > > available descriptor and
> > > equal for a used descriptor.
> > > Note that this observation is mostly useful for sanity-checking as these are
> > > necessary but not sufficient
> > > conditions - for example, all descriptors are zero-initialized. To detect
> > > used and available descriptors it is
> > > possible for drivers and devices to keep track of the last observed value of
> > > VIRTQ_DESC_F_USED/VIRTQ_-
> > > DESC_F_AVAIL. Other techniques to detect
> > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > might also be possible.
> > > "
> > > 
> > > So it looks to me it was not sufficient, looking at the example codes in
> > > spec, do we need to track last seen used_wrap_counter here?
> > I don't think we have to track used_wrap_counter in
> > driver. There was a discussion on this:
> > 
> > https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html
> > 
> > And after that, below sentence was added (it's also
> > in the above words you quoted):
> > 
> > """
> > Other techniques to detect
> > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > might also be possible.
> > """
> > 
> > Best regards,
> > Tiwei Bie
> 
> I see, the extra condition "if (vq->vq.num_free == vq->vring_packed.num)"
> help in this case.
> 
> Thanks

I still think tracking a wrap counter is better.

> > 
> > > Thanks

^ permalink raw reply

* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Jason Wang @ 2018-04-24  1:04 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Michael S. Tsirkin
  Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180423201151.GD5215@char.us.oracle.com>



On 2018年04月24日 04:11, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 23, 2018 at 10:59:43PM +0300, Michael S. Tsirkin wrote:
>> On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
>>>> Hi all:
>>>>
>>>> This RFC implement packed ring layout. The code were tested with
>>>> Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
>>>> tweaks were needed on top of Tiwei's code to make it run. TCP stream
>>>> and pktgen does not show obvious difference compared with split ring.
>>> I have to ask then - what is the benefit of this?
>> You can use this with dpdk within guest.
>> The DPDK version did see a performance improvement so hopefully with
> Is there a link to this performance improvement document?
>

You probably want to this:

https://www.mail-archive.com/dev@dpdk.org/msg97735.html

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH] dca: make function dca_common_get_tag static
From: David Miller @ 2018-04-24  1:03 UTC (permalink / raw)
  To: colin.king; +Cc: netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180423124938.26070-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Mon, 23 Apr 2018 13:49:38 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Function dca_common_get_tag is local to the source and does not need to be
> in global scope, so make it static.
> 
> Cleans up sparse warning:
> drivers/dca/dca-core.c:273:4: warning: symbol 'dca_common_get_tag' was
> not declared. Should it be static?
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH V6 net-next 08/15] net/tls: Support TLS device offload with IPv6
From: David Miller @ 2018-04-24  0:56 UTC (permalink / raw)
  To: borisp; +Cc: netdev, saeedm, davejwatson, ktkhai, ilyal
In-Reply-To: <1524410397-108425-9-git-send-email-borisp@mellanox.com>

From: Boris Pismenny <borisp@mellanox.com>
Date: Sun, 22 Apr 2018 18:19:50 +0300

> @@ -97,13 +102,57 @@ static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
>  	spin_unlock_irqrestore(&tls_device_lock, flags);
>  }
>  
> +#if IS_ENABLED(CONFIG_IPV6)
> +static struct net_device *ipv6_get_netdev(struct sock *sk)
> +{
> +	struct net_device *dev = NULL;
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct ipv6_pinfo *np = inet6_sk(sk);
> +	struct flowi6 _fl6, *fl6 = &_fl6;
> +	struct dst_entry *dst;

Ugh, please use sk->sk_dst_cache->dev and avoid all of the unnecessary
work.

Thank you.

^ permalink raw reply

* Re: [PATCH V6 net-next 07/15] net/tls: Add generic NIC offload infrastructure
From: David Miller @ 2018-04-24  0:55 UTC (permalink / raw)
  To: borisp; +Cc: netdev, saeedm, davejwatson, ktkhai, ilyal, aviadye
In-Reply-To: <1524410397-108425-8-git-send-email-borisp@mellanox.com>

From: Boris Pismenny <borisp@mellanox.com>
Date: Sun, 22 Apr 2018 18:19:49 +0300

> +/* We assume that the socket is already connected */
> +static struct net_device *get_netdev_for_sock(struct sock *sk)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct net_device *netdev = NULL;
> +
> +	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
> +
> +	return netdev;
> +}

Please do this more directly by looking at sk->sk_dst_cache and taking
the device from dst->dev if non-NULL.

That avoids the dev_get_by_index() demux work, and also if
sk->sk_dst_cache is non-NULL then the socket is indeed connected.

Thanks.

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Jason Wang @ 2018-04-24  0:54 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180423092908.77rii3gi7dcaf7o6@debian>



On 2018年04月23日 17:29, Tiwei Bie wrote:
> On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
>> On 2018年04月01日 22:12, Tiwei Bie wrote:
>>> Hello everyone,
>>>
>>> This RFC implements packed ring support for virtio driver.
>>>
>>> The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
>>> by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
>>> Minor changes are needed for the vhost code, e.g. to kick the guest.
>>>
>>> TODO:
>>> - Refinements and bug fixes;
>>> - Split into small patches;
>>> - Test indirect descriptor support;
>>> - Test/fix event suppression support;
>>> - Test devices other than net;
>>>
>>> RFC v1 -> RFC v2:
>>> - Add indirect descriptor support - compile test only;
>>> - Add event suppression supprt - compile test only;
>>> - Move vring_packed_init() out of uapi (Jason, MST);
>>> - Merge two loops into one in virtqueue_add_packed() (Jason);
>>> - Split vring_unmap_one() for packed ring and split ring (Jason);
>>> - Avoid using '%' operator (Jason);
>>> - Rename free_head -> next_avail_idx (Jason);
>>> - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
>>> - Some other refinements and bug fixes;
>>>
>>> Thanks!
>>>
>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>> ---
>>>    drivers/virtio/virtio_ring.c       | 1094 +++++++++++++++++++++++++++++-------
>>>    include/linux/virtio_ring.h        |    8 +-
>>>    include/uapi/linux/virtio_config.h |   12 +-
>>>    include/uapi/linux/virtio_ring.h   |   61 ++
>>>    4 files changed, 980 insertions(+), 195 deletions(-)
>>>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 71458f493cf8..0515dca34d77 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -58,14 +58,15 @@
>> [...]
>>
>>> +
>>> +	if (vq->indirect) {
>>> +		u32 len;
>>> +
>>> +		desc = vq->desc_state[head].indir_desc;
>>> +		/* Free the indirect table, if any, now that it's unmapped. */
>>> +		if (!desc)
>>> +			goto out;
>>> +
>>> +		len = virtio32_to_cpu(vq->vq.vdev,
>>> +				      vq->vring_packed.desc[head].len);
>>> +
>>> +		BUG_ON(!(vq->vring_packed.desc[head].flags &
>>> +			 cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
>> It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
>> can safely remove this BUG_ON() here.
>>
>>> +		BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
>> Len could be ignored for used descriptor according to the spec, so we need
>> remove this BUG_ON() too.
> Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> And I think something related to this in the spec isn't very
> clear currently.
>
> In the spec, there are below words:
>
> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> """
> In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> is reserved and is ignored by the device.
> """
>
> So when device writes back an used descriptor in this case,
> device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> is reserved and should be ignored.
>
> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> """
> Element Length is reserved for used descriptors without the
> VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> """
>
> And this is the way how driver ignores the `len` in an used
> descriptor.
>
> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> """
> To increase ring capacity the driver can store a (read-only
> by the device) table of indirect descriptors anywhere in memory,
> and insert a descriptor in the main virtqueue (with \field{Flags}
> bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> containing this indirect descriptor table;
> """
>
> So the indirect descriptors in the table are read-only by
> the device. And the only descriptor which is writeable by
> the device is the descriptor in the main virtqueue (with
> Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> `len` in this descriptor, we won't be able to get the
> length of the data written by the device.
>
> So I think the `len` in this descriptor will carry the
> length of the data written by the device (if the buffers
> are writable to the device) even if the VIRTQ_DESC_F_WRITE
> isn't set by the device. How do you think?

Yes I think so. But we'd better need clarification from Michael.

>
>
>> The reason is we don't touch descriptor ring in the case of split, so
>> BUG_ON()s may help there.
>>
>>> +
>>> +		for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
>>> +			vring_unmap_one_packed(vq, &desc[j]);
>>> +
>>> +		kfree(desc);
>>> +		vq->desc_state[head].indir_desc = NULL;
>>> +	} else if (ctx) {
>>> +		*ctx = vq->desc_state[head].indir_desc;
>>> +	}
>>> +
>>> +out:
>>> +	return vq->desc_state[head].num;
>>> +}
>>> +
>>> +static inline bool more_used_split(const struct vring_virtqueue *vq)
>>>    {
>>>    	return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
>>>    }
>>> +static inline bool more_used_packed(const struct vring_virtqueue *vq)
>>> +{
>>> +	u16 last_used, flags;
>>> +	bool avail, used;
>>> +
>>> +	if (vq->vq.num_free == vq->vring_packed.num)
>>> +		return false;
>>> +
>>> +	last_used = vq->last_used_idx;
>>> +	flags = virtio16_to_cpu(vq->vq.vdev,
>>> +				vq->vring_packed.desc[last_used].flags);
>>> +	avail = flags & VRING_DESC_F_AVAIL(1);
>>> +	used = flags & VRING_DESC_F_USED(1);
>>> +
>>> +	return avail == used;
>>> +}
>> This looks interesting, spec said:
>>
>> "
>> Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
>> available descriptor and
>> equal for a used descriptor.
>> Note that this observation is mostly useful for sanity-checking as these are
>> necessary but not sufficient
>> conditions - for example, all descriptors are zero-initialized. To detect
>> used and available descriptors it is
>> possible for drivers and devices to keep track of the last observed value of
>> VIRTQ_DESC_F_USED/VIRTQ_-
>> DESC_F_AVAIL. Other techniques to detect
>> VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
>> might also be possible.
>> "
>>
>> So it looks to me it was not sufficient, looking at the example codes in
>> spec, do we need to track last seen used_wrap_counter here?
> I don't think we have to track used_wrap_counter in
> driver. There was a discussion on this:
>
> https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html
>
> And after that, below sentence was added (it's also
> in the above words you quoted):
>
> """
> Other techniques to detect
> VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> might also be possible.
> """
>
> Best regards,
> Tiwei Bie

I see, the extra condition "if (vq->vq.num_free == 
vq->vring_packed.num)" help in this case.

Thanks

>
>> Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-24  0:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, netdev, linux-kernel, mst, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180423151545.GU17484@dhcp22.suse.cz>



On Mon, 23 Apr 2018, Michal Hocko wrote:

> On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
> 
> > > > He didn't want to fix vmalloc(GFP_NOIO)
> > > 
> > > I don't remember that conversation, so I don't know whether I agree with
> > > his reasoning or not.  But we are supposed to be moving away from GFP_NOIO
> > > towards marking regions with memalloc_noio_save() / restore.  If you do
> > > that, you won't need vmalloc(GFP_NOIO).
> > 
> > He said the same thing a year ago. And there was small progress. 6 out of 
> > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in 
> > infiniband and 1 in btrfs. (the whole discussion is here 
> > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
> 
> Well this is not that easy. It requires a cooperation from maintainers.
> I can only do as much. I've posted patches in the past and actively
> bringing up this topic at LSFMM last two years...

You're right - but you have chosen the uneasy path. Fixing __vmalloc code 
is easy and it doesn't require cooperation with maintainers.

> > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4 
> > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why 
> > does he have veto over this part of the code? I'd much rather argue with 
> > people who have constructive comments about fixing bugs than with him.
> 
> I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> I would be much more willing to change my mind if you would back your
> patch by a real bug report. Hacks are acceptable when we have a real
> issue in hands. But if we want to fix potential issue then better make
> it properly.

Developers should fix bugs in advance, not to wait until a crash hapens, 
is analyzed and reported.

What's the problem with 15-line hack? Is the problem that kernel 
developers would feel depressed when looking the source code? Other than 
harming developers' feelings, I don't see what kind of damange could that 
piece of code do.

Mikulas

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Yonghong Song @ 2018-04-24  0:06 UTC (permalink / raw)
  To: Peter Zijlstra, Song Liu
  Cc: netdev, ast, daniel, kernel-team, hannes, qinteng
In-Reply-To: <20180312220132.GH4043@hirez.programming.kicks-ass.net>


Hi, Peter,

I have a question regarding to one of your comments below.

On 3/12/18 3:01 PM, Peter Zijlstra wrote:
> On Mon, Mar 12, 2018 at 01:39:56PM -0700, Song Liu wrote:
>> +static void stack_map_get_build_id_offset(struct bpf_map *map,
>> +					  struct stack_map_bucket *bucket,
>> +					  u64 *ips, u32 trace_nr)
>> +{
>> +	int i;
>> +	struct vm_area_struct *vma;
>> +	struct bpf_stack_build_id *id_offs;
>> +
>> +	bucket->nr = trace_nr;
>> +	id_offs = (struct bpf_stack_build_id *)bucket->data;
>> +
>> +	if (!current || !current->mm ||
>> +	    down_read_trylock(&current->mm->mmap_sem) == 0) {
> 
> You probably want an in_nmi() before the down_read_trylock(). Doing
> up_read() is an absolute no-no from NMI context.

The below is the final code from Song:

         /*
          * We cannot do up_read() in nmi context, so build_id lookup is
          * only supported for non-nmi events. If at some point, it is
          * possible to run find_vma() without taking the semaphore, we
          * would like to allow build_id lookup in nmi context.
          *
          * Same fallback is used for kernel stack (!user) on a stackmap
          * with build_id.
          */
         if (!user || !current || !current->mm || in_nmi() ||
             down_read_trylock(&current->mm->mmap_sem) == 0) {
                 /* cannot access current->mm, fall back to ips */
                 for (i = 0; i < trace_nr; i++) {
                         id_offs[i].status = BPF_STACK_BUILD_ID_IP;
                         id_offs[i].ip = ips[i];
                 }
                 return;
         }

         ....

> 
> And IIUC its 'trivial' to use this stuff with hardware counters.

Here, you mentioned that it was 'trivial' to use buildid thing with 
hardware counters, if I interpreted correctly. However, the hardware 
counter overflow will trigger NMI. Based on the above logic,
it will default to old IP only behavior.

Could you explain a little more how to get buildid with hardware
counter overflow events?

Thanks!

> 
>> +		/* cannot access current->mm, fall back to ips */
>> +		for (i = 0; i < trace_nr; i++) {
>> +			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
>> +			id_offs[i].ip = ips[i];
>> +		}
>> +		return;
>> +	}
>> +
>> +	for (i = 0; i < trace_nr; i++) {
>> +		vma = find_vma(current->mm, ips[i]);
>> +		if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
>> +			/* per entry fall back to ips */
>> +			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
>> +			id_offs[i].ip = ips[i];
>> +			continue;
>> +		}
>> +		id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
>> +			- vma->vm_start;
>> +		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
>> +	}
>> +	up_read(&current->mm->mmap_sem);
>> +}
> 

^ permalink raw reply

* [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24  0:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, netdev, linux-kernel, mst, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180423151545.GU17484@dhcp22.suse.cz>

The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.

Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.

These bugs are hard to reproduce because kvmalloc falls back to vmalloc
only if memory is fragmented.

In order to detect these bugs reliably I submit this patch that changes
kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
verify the addresses passed to it, and so the user will get a reliable
stacktrace.

Some bugs (such as buffer overflows) are better detected
with kmalloc code, so we must test the kmalloc path too.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/util.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c	2018-04-23 00:12:05.000000000 +0200
+++ linux-2.6/mm/util.c	2018-04-23 17:57:02.000000000 +0200
@@ -14,6 +14,7 @@
 #include <linux/hugetlb.h>
 #include <linux/vmalloc.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/random.h>
 
 #include <asm/sections.h>
 #include <linux/uaccess.h>
@@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
 	 */
 	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
+#ifdef CONFIG_DEBUG_SG
+	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
+	if (!(prandom_u32_max(2) & 1))
+		goto do_vmalloc;
+#endif
+
 	/*
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
+#ifdef CONFIG_DEBUG_SG
+do_vmalloc:
+#endif
 	return __vmalloc_node_flags_caller(size, node, flags,
 			__builtin_return_address(0));
 }

^ permalink raw reply

* Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
From: Willem de Bruijn @ 2018-04-23 23:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Daniel Borkmann, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180424021712-mutt-send-email-mst@kernel.org>

On Mon, Apr 23, 2018 at 7:21 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, we add another setsockopt for registered user memory (umem)
>> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
>> ask the kernel to allocate a queue (ring buffer) and also mmap it
>> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
>>
>> The queue is used to explicitly pass ownership of umem frames from the
>> user process to the kernel. These frames will in a later patch be
>> filled in with Rx packet data by the kernel.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> ---
>>  include/uapi/linux/if_xdp.h | 15 +++++++++++
>>  net/xdp/Makefile            |  2 +-
>>  net/xdp/xdp_umem.c          |  5 ++++
>>  net/xdp/xdp_umem.h          |  2 ++
>>  net/xdp/xsk.c               | 62 ++++++++++++++++++++++++++++++++++++++++++++-
>>  net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++++
>>  net/xdp/xsk_queue.h         | 38 +++++++++++++++++++++++++++
>>  7 files changed, 180 insertions(+), 2 deletions(-)
>>  create mode 100644 net/xdp/xsk_queue.c
>>  create mode 100644 net/xdp/xsk_queue.h
>>
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index 41252135a0fe..975661e1baca 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -23,6 +23,7 @@
>>
>>  /* XDP socket options */
>>  #define XDP_UMEM_REG                 3
>> +#define XDP_UMEM_FILL_RING           4
>>
>>  struct xdp_umem_reg {
>>       __u64 addr; /* Start of packet data area */
>> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>>       __u32 frame_headroom; /* Frame head room */
>>  };
>>
>> +/* Pgoff for mmaping the rings */
>> +#define XDP_UMEM_PGOFF_FILL_RING     0x100000000
>> +
>> +struct xdp_ring {
>> +     __u32 producer __attribute__((aligned(64)));
>> +     __u32 consumer __attribute__((aligned(64)));
>> +};
>
> Why 64? And do you still need these guys in uapi?

I was just about to ask the same. You mean cacheline_aligned?

>> +static int xsk_mmap(struct file *file, struct socket *sock,
>> +                 struct vm_area_struct *vma)
>> +{
>> +     unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>> +     unsigned long size = vma->vm_end - vma->vm_start;
>> +     struct xdp_sock *xs = xdp_sk(sock->sk);
>> +     struct xsk_queue *q;
>> +     unsigned long pfn;
>> +     struct page *qpg;
>> +
>> +     if (!xs->umem)
>> +             return -EINVAL;
>> +
>> +     if (offset == XDP_UMEM_PGOFF_FILL_RING)
>> +             q = xs->umem->fq;
>> +     else
>> +             return -EINVAL;
>> +
>> +     qpg = virt_to_head_page(q->ring);

Is it assured that q is initialized with a call to setsockopt
XDP_UMEM_FILL_RING before the call the mmap?

In general, with such an extensive new API, it might be worthwhile to
run syzkaller locally on a kernel with these patches. It is pretty
easy to set up (https://github.com/google/syzkaller/blob/master/docs/linux/setup.md),
though it also needs to be taught about any new APIs.

^ permalink raw reply

* Re: [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets
From: Michael S. Tsirkin @ 2018-04-23 23:31 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, michael.lundkvist, jesse.brandeburg, anjali.singhai,
	qi.z.zhang, Björn Töpel
In-Reply-To: <20180423135619.7179-16-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 03:56:19PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> This is a sample application for AF_XDP sockets. The application
> supports three different modes of operation: rxdrop, txonly and l2fwd.
> 
> To show-case a simple round-robin load-balancing between a set of
> sockets in an xskmap, set the RR_LB compile time define option to 1 in
> "xdpsock.h".
> 
> Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  samples/bpf/Makefile       |   4 +
>  samples/bpf/xdpsock.h      |  11 +
>  samples/bpf/xdpsock_kern.c |  56 +++
>  samples/bpf/xdpsock_user.c | 947 +++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 1018 insertions(+)
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index aa8c392e2e52..d0ddc1abf20d 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
>  hostprogs-y += syscall_tp
>  hostprogs-y += cpustat
>  hostprogs-y += xdp_adjust_tail
> +hostprogs-y += xdpsock
>  
>  # Libbpf dependencies
>  LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
> @@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
>  syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
>  cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
>  xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
> +xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
>  
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
>  always += syscall_tp_kern.o
>  always += cpustat_kern.o
>  always += xdp_adjust_tail_kern.o
> +always += xdpsock_kern.o
>  
>  HOSTCFLAGS += -I$(objtree)/usr/include
>  HOSTCFLAGS += -I$(srctree)/tools/lib/
> @@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
>  HOSTLOADLIBES_syscall_tp += -lelf
>  HOSTLOADLIBES_cpustat += -lelf
>  HOSTLOADLIBES_xdp_adjust_tail += -lelf
> +HOSTLOADLIBES_xdpsock += -lelf -pthread
>  
>  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
> diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
> new file mode 100644
> index 000000000000..533ab81adfa1
> --- /dev/null
> +++ b/samples/bpf/xdpsock.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef XDPSOCK_H_
> +#define XDPSOCK_H_
> +
> +/* Power-of-2 number of sockets */
> +#define MAX_SOCKS 4
> +
> +/* Round-robin receive */
> +#define RR_LB 0
> +
> +#endif /* XDPSOCK_H_ */
> diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
> new file mode 100644
> index 000000000000..d8806c41362e
> --- /dev/null
> +++ b/samples/bpf/xdpsock_kern.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define KBUILD_MODNAME "foo"
> +#include <uapi/linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +#include "xdpsock.h"
> +
> +struct bpf_map_def SEC("maps") qidconf_map = {
> +	.type		= BPF_MAP_TYPE_ARRAY,
> +	.key_size	= sizeof(int),
> +	.value_size	= sizeof(int),
> +	.max_entries	= 1,
> +};
> +
> +struct bpf_map_def SEC("maps") xsks_map = {
> +	.type = BPF_MAP_TYPE_XSKMAP,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(int),
> +	.max_entries = 4,
> +};
> +
> +struct bpf_map_def SEC("maps") rr_map = {
> +	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(unsigned int),
> +	.max_entries = 1,
> +};
> +
> +SEC("xdp_sock")
> +int xdp_sock_prog(struct xdp_md *ctx)
> +{
> +	int *qidconf, key = 0, idx;
> +	unsigned int *rr;
> +
> +	qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
> +	if (!qidconf)
> +		return XDP_ABORTED;
> +
> +	if (*qidconf != ctx->rx_queue_index)
> +		return XDP_PASS;
> +
> +#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
> +	rr = bpf_map_lookup_elem(&rr_map, &key);
> +	if (!rr)
> +		return XDP_ABORTED;
> +
> +	*rr = (*rr + 1) & (MAX_SOCKS - 1);
> +	idx = *rr;
> +#else
> +	idx = 0;
> +#endif
> +
> +	return bpf_redirect_map(&xsks_map, idx, 0);
> +}
> +
> +char _license[] SEC("license") = "GPL";
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> new file mode 100644
> index 000000000000..690bac1a0ab7
> --- /dev/null
> +++ b/samples/bpf/xdpsock_user.c
> @@ -0,0 +1,947 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2017 - 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <assert.h>
> +#include <errno.h>
> +#include <getopt.h>
> +#include <libgen.h>
> +#include <linux/bpf.h>
> +#include <linux/if_link.h>
> +#include <linux/if_xdp.h>
> +#include <linux/if_ether.h>
> +#include <net/if.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <net/ethernet.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/mman.h>
> +#include <time.h>
> +#include <unistd.h>
> +#include <pthread.h>
> +#include <locale.h>
> +#include <sys/types.h>
> +#include <poll.h>
> +
> +#include "bpf_load.h"
> +#include "bpf_util.h"
> +#include "libbpf.h"
> +
> +#include "xdpsock.h"
> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +
> +#ifndef AF_XDP
> +#define AF_XDP 44
> +#endif
> +
> +#ifndef PF_XDP
> +#define PF_XDP AF_XDP
> +#endif
> +
> +#define NUM_FRAMES 131072
> +#define FRAME_HEADROOM 0
> +#define FRAME_SIZE 2048
> +#define NUM_DESCS 1024
> +#define BATCH_SIZE 16
> +
> +#define FQ_NUM_DESCS 1024
> +#define CQ_NUM_DESCS 1024
> +
> +#define DEBUG_HEXDUMP 0
> +
> +typedef __u32 u32;
> +
> +static unsigned long prev_time;
> +
> +enum benchmark_type {
> +	BENCH_RXDROP = 0,
> +	BENCH_TXONLY = 1,
> +	BENCH_L2FWD = 2,
> +};
> +
> +static enum benchmark_type opt_bench = BENCH_RXDROP;
> +static u32 opt_xdp_flags;
> +static const char *opt_if = "";
> +static int opt_ifindex;
> +static int opt_queue;
> +static int opt_poll;
> +static int opt_shared_packet_buffer;
> +static int opt_interval = 1;
> +
> +struct xdp_umem_uqueue {
> +	u32 cached_prod;
> +	u32 cached_cons;
> +	u32 mask;
> +	u32 size;
> +	struct xdp_umem_ring *ring;
> +};
> +
> +struct xdp_umem {
> +	char (*frames)[FRAME_SIZE];
> +	struct xdp_umem_uqueue fq;
> +	struct xdp_umem_uqueue cq;
> +	int fd;
> +};
> +
> +struct xdp_uqueue {
> +	u32 cached_prod;
> +	u32 cached_cons;
> +	u32 mask;
> +	u32 size;
> +	struct xdp_rxtx_ring *ring;
> +};
> +
> +struct xdpsock {
> +	struct xdp_uqueue rx;
> +	struct xdp_uqueue tx;
> +	int sfd;
> +	struct xdp_umem *umem;
> +	u32 outstanding_tx;
> +	unsigned long rx_npkts;
> +	unsigned long tx_npkts;
> +	unsigned long prev_rx_npkts;
> +	unsigned long prev_tx_npkts;
> +};
> +
> +#define MAX_SOCKS 4
> +static int num_socks;
> +struct xdpsock *xsks[MAX_SOCKS];
> +
> +static unsigned long get_nsecs(void)
> +{
> +	struct timespec ts;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &ts);
> +	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
> +}
> +
> +static void dump_stats(void);
> +
> +#define lassert(expr)							\
> +	do {								\
> +		if (!(expr)) {						\
> +			fprintf(stderr, "%s:%s:%i: Assertion failed: "	\
> +				#expr ": errno: %d/\"%s\"\n",		\
> +				__FILE__, __func__, __LINE__,		\
> +				errno, strerror(errno));		\
> +			dump_stats();					\
> +			exit(EXIT_FAILURE);				\
> +		}							\
> +	} while (0)
> +
> +#define barrier() __asm__ __volatile__("": : :"memory")
> +#define u_smp_rmb() barrier()
> +#define u_smp_wmb() barrier()
> +#define likely(x) __builtin_expect(!!(x), 1)
> +#define unlikely(x) __builtin_expect(!!(x), 0)
> +
> +static const char pkt_data[] =
> +	"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
> +	"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
> +	"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
> +	"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
> +
> +static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
> +{
> +	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
> +
> +	if (free_entries >= nb)
> +		return free_entries;
> +
> +	/* Refresh the local tail pointer */
> +	q->cached_cons = q->ring->ptrs.consumer;
> +
> +	return q->size - (q->cached_prod - q->cached_cons);
> +}
> +
> +static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
> +{
> +	u32 free_entries = q->cached_cons - q->cached_prod;
> +
> +	if (free_entries >= ndescs)
> +		return free_entries;
> +
> +	/* Refresh the local tail pointer */
> +	q->cached_cons = q->ring->ptrs.consumer + q->size;
> +	return q->cached_cons - q->cached_prod;
> +}
> +
> +static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
> +{
> +	u32 entries = q->cached_prod - q->cached_cons;
> +
> +	if (entries == 0)
> +		q->cached_prod = q->ring->ptrs.producer;
> +
> +	entries = q->cached_prod - q->cached_cons;
> +
> +	return (entries > nb) ? nb : entries;
> +}
> +
> +static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
> +{
> +	u32 entries = q->cached_prod - q->cached_cons;
> +
> +	if (entries == 0)
> +		q->cached_prod = q->ring->ptrs.producer;
> +
> +	entries = q->cached_prod - q->cached_cons;
> +	return (entries > ndescs) ? ndescs : entries;
> +}
> +
> +static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
> +					 struct xdp_desc *d,
> +					 size_t nb)
> +{
> +	u32 i;
> +
> +	if (umem_nb_free(fq, nb) < nb)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < nb; i++) {
> +		u32 idx = fq->cached_prod++ & fq->mask;
> +
> +		fq->ring->desc[idx] = d[i].idx;
> +	}
> +
> +	u_smp_wmb();
> +
> +	fq->ring->ptrs.producer = fq->cached_prod;
> +
> +	return 0;
> +}
> +
> +static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
> +				      size_t nb)
> +{
> +	u32 i;
> +
> +	if (umem_nb_free(fq, nb) < nb)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < nb; i++) {
> +		u32 idx = fq->cached_prod++ & fq->mask;
> +
> +		fq->ring->desc[idx] = d[i];
> +	}
> +
> +	u_smp_wmb();
> +
> +	fq->ring->ptrs.producer = fq->cached_prod;
> +
> +	return 0;
> +}
> +
> +static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
> +					       u32 *d, size_t nb)
> +{
> +	u32 idx, i, entries = umem_nb_avail(cq, nb);
> +
> +	u_smp_rmb();
> +
> +	for (i = 0; i < entries; i++) {
> +		idx = cq->cached_cons++ & cq->mask;
> +		d[i] = cq->ring->desc[idx];
> +	}
> +
> +	if (entries > 0) {
> +		u_smp_wmb();
> +
> +		cq->ring->ptrs.consumer = cq->cached_cons;
> +	}
> +
> +	return entries;
> +}
> +
> +static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
> +{
> +	lassert(idx < NUM_FRAMES);
> +	return &xsk->umem->frames[idx][off];
> +}
> +
> +static inline int xq_enq(struct xdp_uqueue *uq,
> +			 const struct xdp_desc *descs,
> +			 unsigned int ndescs)
> +{
> +	struct xdp_rxtx_ring *r = uq->ring;
> +	unsigned int i;
> +
> +	if (xq_nb_free(uq, ndescs) < ndescs)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < ndescs; i++) {
> +		u32 idx = uq->cached_prod++ & uq->mask;
> +
> +		r->desc[idx].idx = descs[i].idx;
> +		r->desc[idx].len = descs[i].len;
> +		r->desc[idx].offset = descs[i].offset;
> +	}
> +
> +	u_smp_wmb();
> +
> +	r->ptrs.producer = uq->cached_prod;
> +	return 0;
> +}
> +
> +static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
> +				 __u32 idx, unsigned int ndescs)
> +{
> +	struct xdp_rxtx_ring *q = uq->ring;
> +	unsigned int i;
> +
> +	if (xq_nb_free(uq, ndescs) < ndescs)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < ndescs; i++) {
> +		u32 idx = uq->cached_prod++ & uq->mask;
> +
> +		q->desc[idx].idx	= idx + i;
> +		q->desc[idx].len	= sizeof(pkt_data) - 1;
> +		q->desc[idx].offset	= 0;
> +	}
> +
> +	u_smp_wmb();
> +
> +	q->ptrs.producer = uq->cached_prod;
> +	return 0;
> +}
> +
> +static inline int xq_deq(struct xdp_uqueue *uq,
> +			 struct xdp_desc *descs,
> +			 int ndescs)
> +{
> +	struct xdp_rxtx_ring *r = uq->ring;
> +	unsigned int idx;
> +	int i, entries;
> +
> +	entries = xq_nb_avail(uq, ndescs);
> +
> +	u_smp_rmb();
> +
> +	for (i = 0; i < entries; i++) {
> +		idx = uq->cached_cons++ & uq->mask;
> +		descs[i] = r->desc[idx];
> +	}
> +
> +	if (entries > 0) {
> +		u_smp_wmb();
> +
> +		r->ptrs.consumer = uq->cached_cons;
> +	}
> +
> +	return entries;
> +}

Interesting, I was under the impression that you were
planning to get rid of consumer/producer counters
and validate the descriptors instead.

That's the ptr_ring design.

You can then drop all the code around synchronising
counter caches, as well as smp_rmb barriers.


> +
> +static void swap_mac_addresses(void *data)
> +{
> +	struct ether_header *eth = (struct ether_header *)data;
> +	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
> +	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
> +	struct ether_addr tmp;
> +
> +	tmp = *src_addr;
> +	*src_addr = *dst_addr;
> +	*dst_addr = tmp;
> +}
> +
> +#if DEBUG_HEXDUMP
> +static void hex_dump(void *pkt, size_t length, const char *prefix)
> +{
> +	int i = 0;
> +	const unsigned char *address = (unsigned char *)pkt;
> +	const unsigned char *line = address;
> +	size_t line_size = 32;
> +	unsigned char c;
> +
> +	printf("length = %zu\n", length);
> +	printf("%s | ", prefix);
> +	while (length-- > 0) {
> +		printf("%02X ", *address++);
> +		if (!(++i % line_size) || (length == 0 && i % line_size)) {
> +			if (length == 0) {
> +				while (i++ % line_size)
> +					printf("__ ");
> +			}
> +			printf(" | ");	/* right close */
> +			while (line < address) {
> +				c = *line++;
> +				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
> +			}
> +			printf("\n");
> +			if (length > 0)
> +				printf("%s | ", prefix);
> +		}
> +	}
> +	printf("\n");
> +}
> +#endif
> +
> +static size_t gen_eth_frame(char *frame)
> +{
> +	memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
> +	return sizeof(pkt_data) - 1;
> +}
> +
> +static struct xdp_umem *xdp_umem_configure(int sfd)
> +{
> +	int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
> +	struct xdp_umem_reg mr;
> +	struct xdp_umem *umem;
> +	void *bufs;
> +
> +	umem = calloc(1, sizeof(*umem));
> +	lassert(umem);
> +
> +	lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
> +			       NUM_FRAMES * FRAME_SIZE) == 0);
> +
> +	mr.addr = (__u64)bufs;
> +	mr.len = NUM_FRAMES * FRAME_SIZE;
> +	mr.frame_size = FRAME_SIZE;
> +	mr.frame_headroom = FRAME_HEADROOM;
> +
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
> +			   sizeof(int)) == 0);
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size,
> +			   sizeof(int)) == 0);
> +
> +	umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
> +			     FQ_NUM_DESCS * sizeof(u32),
> +			     PROT_READ | PROT_WRITE,
> +			     MAP_SHARED | MAP_POPULATE, sfd,
> +			     XDP_UMEM_PGOFF_FILL_RING);
> +	lassert(umem->fq.ring != MAP_FAILED);
> +
> +	umem->fq.mask = FQ_NUM_DESCS - 1;
> +	umem->fq.size = FQ_NUM_DESCS;
> +
> +	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
> +			     CQ_NUM_DESCS * sizeof(u32),
> +			     PROT_READ | PROT_WRITE,
> +			     MAP_SHARED | MAP_POPULATE, sfd,
> +			     XDP_UMEM_PGOFF_COMPLETION_RING);
> +	lassert(umem->cq.ring != MAP_FAILED);
> +
> +	umem->cq.mask = CQ_NUM_DESCS - 1;
> +	umem->cq.size = CQ_NUM_DESCS;
> +
> +	umem->frames = (char (*)[FRAME_SIZE])bufs;
> +	umem->fd = sfd;
> +
> +	if (opt_bench == BENCH_TXONLY) {
> +		int i;
> +
> +		for (i = 0; i < NUM_FRAMES; i++)
> +			(void)gen_eth_frame(&umem->frames[i][0]);
> +	}
> +
> +	return umem;
> +}
> +
> +static struct xdpsock *xsk_configure(struct xdp_umem *umem)
> +{
> +	struct sockaddr_xdp sxdp = {};
> +	int sfd, ndescs = NUM_DESCS;
> +	struct xdpsock *xsk;
> +	bool shared = true;
> +	u32 i;
> +
> +	sfd = socket(PF_XDP, SOCK_RAW, 0);
> +	lassert(sfd >= 0);
> +
> +	xsk = calloc(1, sizeof(*xsk));
> +	lassert(xsk);
> +
> +	xsk->sfd = sfd;
> +	xsk->outstanding_tx = 0;
> +
> +	if (!umem) {
> +		shared = false;
> +		xsk->umem = xdp_umem_configure(sfd);
> +	} else {
> +		xsk->umem = umem;
> +	}
> +
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING,
> +			   &ndescs, sizeof(int)) == 0);
> +	lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING,
> +			   &ndescs, sizeof(int)) == 0);
> +
> +	/* Rx */
> +	xsk->rx.ring = mmap(NULL,
> +			    sizeof(struct xdp_ring) +
> +			    NUM_DESCS * sizeof(struct xdp_desc),
> +			    PROT_READ | PROT_WRITE,
> +			    MAP_SHARED | MAP_POPULATE, sfd,
> +			    XDP_PGOFF_RX_RING);
> +	lassert(xsk->rx.ring != MAP_FAILED);
> +
> +	if (!shared) {
> +		for (i = 0; i < NUM_DESCS / 2; i++)
> +			lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
> +				== 0);
> +	}
> +
> +	/* Tx */
> +	xsk->tx.ring = mmap(NULL,
> +			 sizeof(struct xdp_ring) +
> +			 NUM_DESCS * sizeof(struct xdp_desc),
> +			 PROT_READ | PROT_WRITE,
> +			 MAP_SHARED | MAP_POPULATE, sfd,
> +			 XDP_PGOFF_TX_RING);
> +	lassert(xsk->tx.ring != MAP_FAILED);
> +
> +	xsk->rx.mask = NUM_DESCS - 1;
> +	xsk->rx.size = NUM_DESCS;
> +
> +	xsk->tx.mask = NUM_DESCS - 1;
> +	xsk->tx.size = NUM_DESCS;
> +
> +	sxdp.sxdp_family = PF_XDP;
> +	sxdp.sxdp_ifindex = opt_ifindex;
> +	sxdp.sxdp_queue_id = opt_queue;
> +	if (shared) {
> +		sxdp.sxdp_flags = XDP_SHARED_UMEM;
> +		sxdp.sxdp_shared_umem_fd = umem->fd;
> +	}
> +
> +	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
> +
> +	return xsk;
> +}
> +
> +static void print_benchmark(bool running)
> +{
> +	const char *bench_str = "INVALID";
> +
> +	if (opt_bench == BENCH_RXDROP)
> +		bench_str = "rxdrop";
> +	else if (opt_bench == BENCH_TXONLY)
> +		bench_str = "txonly";
> +	else if (opt_bench == BENCH_L2FWD)
> +		bench_str = "l2fwd";
> +
> +	printf("%s:%d %s ", opt_if, opt_queue, bench_str);
> +	if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
> +		printf("xdp-skb ");
> +	else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
> +		printf("xdp-drv ");
> +	else
> +		printf("	");
> +
> +	if (opt_poll)
> +		printf("poll() ");
> +
> +	if (running) {
> +		printf("running...");
> +		fflush(stdout);
> +	}
> +}
> +
> +static void dump_stats(void)
> +{
> +	unsigned long now = get_nsecs();
> +	long dt = now - prev_time;
> +	int i;
> +
> +	prev_time = now;
> +
> +	for (i = 0; i < num_socks; i++) {
> +		char *fmt = "%-15s %'-11.0f %'-11lu\n";
> +		double rx_pps, tx_pps;
> +
> +		rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) *
> +			 1000000000. / dt;
> +		tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) *
> +			 1000000000. / dt;
> +
> +		printf("\n sock%d@", i);
> +		print_benchmark(false);
> +		printf("\n");
> +
> +		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
> +		       dt / 1000000000.);
> +		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
> +		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
> +
> +		xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts;
> +		xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts;
> +	}
> +}
> +
> +static void *poller(void *arg)
> +{
> +	(void)arg;
> +	for (;;) {
> +		sleep(opt_interval);
> +		dump_stats();
> +	}
> +
> +	return NULL;
> +}
> +
> +static void int_exit(int sig)
> +{
> +	(void)sig;
> +	dump_stats();
> +	bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
> +	exit(EXIT_SUCCESS);
> +}
> +
> +static struct option long_options[] = {
> +	{"rxdrop", no_argument, 0, 'r'},
> +	{"txonly", no_argument, 0, 't'},
> +	{"l2fwd", no_argument, 0, 'l'},
> +	{"interface", required_argument, 0, 'i'},
> +	{"queue", required_argument, 0, 'q'},
> +	{"poll", no_argument, 0, 'p'},
> +	{"shared-buffer", no_argument, 0, 's'},
> +	{"xdp-skb", no_argument, 0, 'S'},
> +	{"xdp-native", no_argument, 0, 'N'},
> +	{"interval", required_argument, 0, 'n'},
> +	{0, 0, 0, 0}
> +};
> +
> +static void usage(const char *prog)
> +{
> +	const char *str =
> +		"  Usage: %s [OPTIONS]\n"
> +		"  Options:\n"
> +		"  -r, --rxdrop		Discard all incoming packets (default)\n"
> +		"  -t, --txonly		Only send packets\n"
> +		"  -l, --l2fwd		MAC swap L2 forwarding\n"
> +		"  -i, --interface=n	Run on interface n\n"
> +		"  -q, --queue=n	Use queue n (default 0)\n"
> +		"  -p, --poll		Use poll syscall\n"
> +		"  -s, --shared-buffer	Use shared packet buffer\n"
> +		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
> +		"  -N, --xdp-native=n	Enfore XDP native mode\n"
> +		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
> +		"\n";
> +	fprintf(stderr, str, prog);
> +	exit(EXIT_FAILURE);
> +}
> +
> +static void parse_command_line(int argc, char **argv)
> +{
> +	int option_index, c;
> +
> +	opterr = 0;
> +
> +	for (;;) {
> +		c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
> +				&option_index);
> +		if (c == -1)
> +			break;
> +
> +		switch (c) {
> +		case 'r':
> +			opt_bench = BENCH_RXDROP;
> +			break;
> +		case 't':
> +			opt_bench = BENCH_TXONLY;
> +			break;
> +		case 'l':
> +			opt_bench = BENCH_L2FWD;
> +			break;
> +		case 'i':
> +			opt_if = optarg;
> +			break;
> +		case 'q':
> +			opt_queue = atoi(optarg);
> +			break;
> +		case 's':
> +			opt_shared_packet_buffer = 1;
> +			break;
> +		case 'p':
> +			opt_poll = 1;
> +			break;
> +		case 'S':
> +			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
> +			break;
> +		case 'N':
> +			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
> +			break;
> +		case 'n':
> +			opt_interval = atoi(optarg);
> +			break;
> +		default:
> +			usage(basename(argv[0]));
> +		}
> +	}
> +
> +	opt_ifindex = if_nametoindex(opt_if);
> +	if (!opt_ifindex) {
> +		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
> +			opt_if);
> +		usage(basename(argv[0]));
> +	}
> +}
> +
> +static void kick_tx(int fd)
> +{
> +	int ret;
> +
> +	ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
> +	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
> +		return;
> +	lassert(0);
> +}
> +
> +static inline void complete_tx_l2fwd(struct xdpsock *xsk)
> +{
> +	u32 descs[BATCH_SIZE];
> +	unsigned int rcvd;
> +	size_t ndescs;
> +
> +	if (!xsk->outstanding_tx)
> +		return;
> +
> +	kick_tx(xsk->sfd);
> +	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
> +		 xsk->outstanding_tx;
> +
> +	/* re-add completed Tx buffers */
> +	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
> +	if (rcvd > 0) {
> +		umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
> +		xsk->outstanding_tx -= rcvd;
> +		xsk->tx_npkts += rcvd;
> +	}
> +}
> +
> +static inline void complete_tx_only(struct xdpsock *xsk)
> +{
> +	u32 descs[BATCH_SIZE];
> +	unsigned int rcvd;
> +
> +	if (!xsk->outstanding_tx)
> +		return;
> +
> +	kick_tx(xsk->sfd);
> +
> +	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
> +	if (rcvd > 0) {
> +		xsk->outstanding_tx -= rcvd;
> +		xsk->tx_npkts += rcvd;
> +	}
> +}
> +
> +static void rx_drop(struct xdpsock *xsk)
> +{
> +	struct xdp_desc descs[BATCH_SIZE];
> +	unsigned int rcvd, i;
> +
> +	rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
> +	if (!rcvd)
> +		return;
> +
> +	for (i = 0; i < rcvd; i++) {
> +		u32 idx = descs[i].idx;
> +
> +		lassert(idx < NUM_FRAMES);
> +#if DEBUG_HEXDUMP
> +		char *pkt;
> +		char buf[32];
> +
> +		pkt = xq_get_data(xsk, idx, descs[i].offset);
> +		sprintf(buf, "idx=%d", idx);
> +		hex_dump(pkt, descs[i].len, buf);
> +#endif
> +	}
> +
> +	xsk->rx_npkts += rcvd;
> +
> +	umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
> +}
> +
> +static void rx_drop_all(void)
> +{
> +	struct pollfd fds[MAX_SOCKS + 1];
> +	int i, ret, timeout, nfds = 1;
> +
> +	memset(fds, 0, sizeof(fds));
> +
> +	for (i = 0; i < num_socks; i++) {
> +		fds[i].fd = xsks[i]->sfd;
> +		fds[i].events = POLLIN;
> +		timeout = 1000; /* 1sn */
> +	}
> +
> +	for (;;) {
> +		if (opt_poll) {
> +			ret = poll(fds, nfds, timeout);
> +			if (ret <= 0)
> +				continue;
> +		}
> +
> +		for (i = 0; i < num_socks; i++)
> +			rx_drop(xsks[i]);
> +	}
> +}
> +
> +static void tx_only(struct xdpsock *xsk)
> +{
> +	int timeout, ret, nfds = 1;
> +	struct pollfd fds[nfds + 1];
> +	unsigned int idx = 0;
> +
> +	memset(fds, 0, sizeof(fds));
> +	fds[0].fd = xsk->sfd;
> +	fds[0].events = POLLOUT;
> +	timeout = 1000; /* 1sn */
> +
> +	for (;;) {
> +		if (opt_poll) {
> +			ret = poll(fds, nfds, timeout);
> +			if (ret <= 0)
> +				continue;
> +
> +			if (fds[0].fd != xsk->sfd ||
> +			    !(fds[0].revents & POLLOUT))
> +				continue;
> +		}
> +
> +		if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
> +			lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0);
> +
> +			xsk->outstanding_tx += BATCH_SIZE;
> +			idx += BATCH_SIZE;
> +			idx %= NUM_FRAMES;
> +		}
> +
> +		complete_tx_only(xsk);
> +	}
> +}
> +
> +static void l2fwd(struct xdpsock *xsk)
> +{
> +	for (;;) {
> +		struct xdp_desc descs[BATCH_SIZE];
> +		unsigned int rcvd, i;
> +		int ret;
> +
> +		for (;;) {
> +			complete_tx_l2fwd(xsk);
> +
> +			rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
> +			if (rcvd > 0)
> +				break;
> +		}
> +
> +		for (i = 0; i < rcvd; i++) {
> +			char *pkt = xq_get_data(xsk, descs[i].idx,
> +						descs[i].offset);
> +
> +			swap_mac_addresses(pkt);
> +#if DEBUG_HEXDUMP
> +			char buf[32];
> +			u32 idx = descs[i].idx;
> +
> +			sprintf(buf, "idx=%d", idx);
> +			hex_dump(pkt, descs[i].len, buf);
> +#endif
> +		}
> +
> +		xsk->rx_npkts += rcvd;
> +
> +		ret = xq_enq(&xsk->tx, descs, rcvd);
> +		lassert(ret == 0);
> +		xsk->outstanding_tx += rcvd;
> +	}
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +	char xdp_filename[256];
> +	int i, ret, key = 0;
> +	pthread_t pt;
> +
> +	parse_command_line(argc, argv);
> +
> +	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +		fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
> +			strerror(errno));
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
> +
> +	if (load_bpf_file(xdp_filename)) {
> +		fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (!prog_fd[0]) {
> +		fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
> +			strerror(errno));
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
> +		fprintf(stderr, "ERROR: link set xdp fd failed\n");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
> +	if (ret) {
> +		fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Create sockets... */
> +	xsks[num_socks++] = xsk_configure(NULL);
> +
> +#if RR_LB
> +	for (i = 0; i < MAX_SOCKS - 1; i++)
> +		xsks[num_socks++] = xsk_configure(xsks[0]->umem);
> +#endif
> +
> +	/* ...and insert them into the map. */
> +	for (i = 0; i < num_socks; i++) {
> +		key = i;
> +		ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
> +		if (ret) {
> +			fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
> +			exit(EXIT_FAILURE);
> +		}
> +	}
> +
> +	signal(SIGINT, int_exit);
> +	signal(SIGTERM, int_exit);
> +	signal(SIGABRT, int_exit);
> +
> +	setlocale(LC_ALL, "");
> +
> +	ret = pthread_create(&pt, NULL, poller, NULL);
> +	lassert(ret == 0);
> +
> +	prev_time = get_nsecs();
> +
> +	if (opt_bench == BENCH_RXDROP)
> +		rx_drop_all();
> +	else if (opt_bench == BENCH_TXONLY)
> +		tx_only(xsks[0]);
> +	else
> +		l2fwd(xsks[0]);
> +
> +	return 0;
> +}
> -- 
> 2.14.1

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox