* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-17 21:28 UTC (permalink / raw)
To: therbert; +Cc: netdev
In-Reply-To: <BANLkTin3hRwB39LGfOSxet61Cdm3F1MsKg@mail.gmail.com>
From: Tom Herbert <therbert@google.com>
Date: Tue, 17 May 2011 14:27:10 -0700
>> Actually, I think it won't work. Even Linux emits fragments last to
>> first, so we won't see the UDP header until the last packet where it's
>> no longer useful.
>>
> I remember observing this a while back, what's the rationale for it?
That's the cheapest way to build the fragments.
Regardless of the reason we have to handle it forever.
^ permalink raw reply
* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-17 21:28 UTC (permalink / raw)
To: Shirley Ma
Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
kvm, linux-kernel
In-Reply-To: <1305665419.10756.33.camel@localhost.localdomain>
On Tue, May 17, 2011 at 01:50:19PM -0700, Shirley Ma wrote:
> Resubmit the patch with most update. This patch passed some
> live-migration test against RHEL6.2. I will run more stress test w/i
> live migration.
>
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
Cool. cleanup path needs a fix - are you use you can
not use kobj or some other existing refcounting?
Is perf regressiion caused by tx ring overrun gone now?
I added some comments about how we might be aqble
to complete requests out of order but it's not a must.
> ---
>
> drivers/vhost/net.c | 37 +++++++++++++++++++++++++++++++-
> drivers/vhost/vhost.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-
> drivers/vhost/vhost.h | 12 ++++++++++
> 3 files changed, 101 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 2f7c76a..6bd6e28 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,9 @@
> * Using this limit prevents one virtqueue from starving others. */
> #define VHOST_NET_WEIGHT 0x80000
>
> +/* MAX number of TX used buffers for outstanding zerocopy */
> +#define VHOST_MAX_ZEROCOPY_PEND 128
> +
> enum {
> VHOST_NET_VQ_RX = 0,
> VHOST_NET_VQ_TX = 1,
> @@ -129,6 +132,7 @@ static void handle_tx(struct vhost_net *net)
> int err, wmem;
> size_t hdr_size;
> struct socket *sock;
> + struct skb_ubuf_info pend;
>
> /* TODO: check that we are running from vhost_worker? */
> sock = rcu_dereference_check(vq->private_data, 1);
> @@ -151,6 +155,10 @@ static void handle_tx(struct vhost_net *net)
> hdr_size = vq->vhost_hlen;
>
> for (;;) {
> + /* Release DMAs done buffers first */
> + if (atomic_read(&vq->refcnt) > VHOST_MAX_ZEROCOPY_PEND)
> + vhost_zerocopy_signal_used(vq, false);
> +
> head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> ARRAY_SIZE(vq->iov),
> &out, &in,
> @@ -166,6 +174,13 @@ static void handle_tx(struct vhost_net *net)
> set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> break;
> }
> + /* If more outstanding DMAs, queue the work */
> + if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
> + (atomic_read(&vq->refcnt) > VHOST_MAX_ZEROCOPY_PEND)) {
> + tx_poll_start(net, sock);
> + set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> + break;
> + }
> if (unlikely(vhost_enable_notify(vq))) {
> vhost_disable_notify(vq);
> continue;
> @@ -188,17 +203,35 @@ static void handle_tx(struct vhost_net *net)
> iov_length(vq->hdr, s), hdr_size);
> break;
> }
> + /* use msg_control to pass vhost zerocopy ubuf info to skb */
> + if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> + vq->heads[vq->upend_idx].id = head;
> + if (len <= 128)
I thought we have a constant for that?
> + vq->heads[vq->upend_idx].len = VHOST_DMA_DONE_LEN;
> + else {
> + vq->heads[vq->upend_idx].len = len;
> + pend.callback = vhost_zerocopy_callback;
> + pend.arg = vq;
> + pend.desc = vq->upend_idx;
> + msg.msg_control = &pend;
> + msg.msg_controllen = sizeof(pend);
> + }
> + atomic_inc(&vq->refcnt);
> + vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
Ok, so we deal with a cyclic ring apparently? What guarantees we don't
overrun it?
> + }
> /* TODO: Check specific error and bomb out unless ENOBUFS? */
> err = sock->ops->sendmsg(NULL, sock, &msg, len);
> if (unlikely(err < 0)) {
> - vhost_discard_vq_desc(vq, 1);
> + if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> + vhost_discard_vq_desc(vq, 1);
How are errors handled with zerocopy?
> tx_poll_start(net, sock);
> break;
> }
> if (err != len)
> pr_debug("Truncated TX packet: "
> " len %d != %zd\n", err, len);
> - vhost_add_used_and_signal(&net->dev, vq, head, 0);
> + if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> + vhost_add_used_and_signal(&net->dev, vq, head, 0);
> total_len += len;
> if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2ab2912..ce799d6 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> vq->call_ctx = NULL;
> vq->call = NULL;
> vq->log_ctx = NULL;
> + vq->upend_idx = 0;
> + vq->done_idx = 0;
> + atomic_set(&vq->refcnt, 0);
> }
>
> static int vhost_worker(void *data)
> @@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> UIO_MAXIOV, GFP_KERNEL);
> dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
> GFP_KERNEL);
> - dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
> + dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads *
> UIO_MAXIOV, GFP_KERNEL);
Which fields need to be initialized actually?
>
> if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
> @@ -385,6 +388,38 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
> return 0;
> }
>
> +/*
> + comments
> +*/
Hmm.
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
> +{
> + int i, j = 0;
> +
> + i = vq->done_idx;
> + while (i != vq->upend_idx) {
A for loop might be clearer.
> + if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
On shutdown, we signal all buffers used to the guest?
Why?
> + /* reset len = 0 */
comment not very helpful.
Could you explain what this does instead?
Or better use some constant instead of 0 ...
> + vq->heads[i].len = 0;
> + i = (i + 1) % UIO_MAXIOV;
> + ++j;
> + } else
> + break;
Hmm so if the 1st entry does not complete, you do not signal anything?
> + }
Looking at this loop, done idx is the consumer and used idx
is the producer, right?
> + if (j) {
> + /* comments */
Yes?
> + if (i > vq->done_idx)
> + vhost_add_used_n(vq, &vq->heads[vq->done_idx], j);
> + else {
> + vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> + UIO_MAXIOV - vq->done_idx);
> + vhost_add_used_n(vq, vq->heads, i);
> + }
> + vq->done_idx = i;
> + vhost_signal(vq->dev, vq);
> + atomic_sub(j, &vq->refcnt);
Code will likely be simpler if you call vhost_add_used once for
each head in the first loop. Possibly add_used_signal might be
a good idea too.
> + }
> +}
> +
> /* Caller should have device mutex */
> void vhost_dev_cleanup(struct vhost_dev *dev)
> {
> @@ -395,6 +430,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> vhost_poll_stop(&dev->vqs[i].poll);
> vhost_poll_flush(&dev->vqs[i].poll);
> }
> + /* wait for all lower device DMAs done, then notify guest */
> + if (atomic_read(&dev->vqs[i].refcnt)) {
> + msleep(1000);
> + vhost_zerocopy_signal_used(&dev->vqs[i], true);
> + }
This needs to be fixed somehow. Use a completion object and wait
on it?
> if (dev->vqs[i].error_ctx)
> eventfd_ctx_put(dev->vqs[i].error_ctx);
> if (dev->vqs[i].error)
> @@ -603,6 +643,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>
> mutex_lock(&vq->mutex);
>
> + /* force all lower device DMAs done */
> + if (atomic_read(&vq->refcnt))
> + vhost_zerocopy_signal_used(vq, true);
> +
> switch (ioctl) {
> case VHOST_SET_VRING_NUM:
> /* Resizing ring with an active backend?
> @@ -1416,3 +1460,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
> vq_err(vq, "Failed to enable notification at %p: %d\n",
> &vq->used->flags, r);
> }
> +
> +void vhost_zerocopy_callback(struct sk_buff *skb)
> +{
> + int idx = skb_shinfo(skb)->ubuf.desc;
> + struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
> +
> + /* set len = 1 to mark this desc buffers done DMA */
this comment can now go.
> + vq->heads[idx].len = VHOST_DMA_DONE_LEN;
> +}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index b3363ae..8e3ecc7 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -13,6 +13,10 @@
> #include <linux/virtio_ring.h>
> #include <asm/atomic.h>
>
> +/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
> + * done */
> +#define VHOST_DMA_DONE_LEN 1
> +
> struct vhost_device;
>
> struct vhost_work;
> @@ -108,6 +112,12 @@ struct vhost_virtqueue {
> /* Log write descriptors */
> void __user *log_base;
> struct vhost_log *log;
> + /* vhost zerocopy support */
> + atomic_t refcnt; /* num of outstanding zerocopy DMAs */
future enhancement idea: this is used apparently under vq lock
so no need for an atomic?
> + /* copy of avail idx to monitor outstanding DMA zerocopy buffers */
looking at code upend_idx seems to be calculated independently
of guest avail idx - could you clarify pls?
> + int upend_idx;
> + /* copy of used idx to monintor DMA done zerocopy buffers */
monitor
> + int done_idx;
I think in reality these are just producer and consumer
in the head structure which for zero copy is used
> };
>
> struct vhost_dev {
> @@ -154,6 +164,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
>
> int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> unsigned int log_num, u64 len);
> +void vhost_zerocopy_callback(struct sk_buff *skb);
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
>
> #define vq_err(vq, fmt, ...) do { \
> pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \
>
^ permalink raw reply
* Re: [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with a 'bitwise' type
From: David Miller @ 2011-05-17 21:28 UTC (permalink / raw)
To: joe; +Cc: bhutchings, netdev
In-Reply-To: <1305667378.1722.65.camel@Joe-Laptop>
From: Joe Perches <joe@perches.com>
Date: Tue, 17 May 2011 14:22:58 -0700
> On Tue, 2011-05-17 at 17:14 -0400, David Miller wrote:
>> Accept that the compiler currently doesn't want to allow enums to be
>> used as bit-masks, don't paper around it.
>
> A patch applied yesterday using enums as bitmasks:
> http://patchwork.ozlabs.org/patch/95802/
Thanks, fixed:
--------------------
ipv4: Don't use enums as bitmasks in ip_fragment.c
Noticed by Joe Perches.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/ipv4/ip_fragment.c | 10 ++++------
1 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9e1458d..0ad6035 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -81,12 +81,10 @@ struct ipq {
* We want to check ECN values of all fragments, do detect invalid combinations.
* In ipq->ecn, we store the OR value of each ip4_frag_ecn() fragment value.
*/
-enum {
- IPFRAG_ECN_NOT_ECT = 0x01, /* one frag had ECN_NOT_ECT */
- IPFRAG_ECN_ECT_1 = 0x02, /* one frag had ECN_ECT_1 */
- IPFRAG_ECN_ECT_0 = 0x04, /* one frag had ECN_ECT_0 */
- IPFRAG_ECN_CE = 0x08, /* one frag had ECN_CE */
-};
+#define IPFRAG_ECN_NOT_ECT 0x01 /* one frag had ECN_NOT_ECT */
+#define IPFRAG_ECN_ECT_1 0x02 /* one frag had ECN_ECT_1 */
+#define IPFRAG_ECN_ECT_0 0x04 /* one frag had ECN_ECT_0 */
+#define IPFRAG_ECN_CE 0x08 /* one frag had ECN_CE */
static inline u8 ip4_frag_ecn(u8 tos)
{
--
1.7.4.4
^ permalink raw reply related
* Re: small RPS cache for fragments?
From: Eric Dumazet @ 2011-05-17 21:27 UTC (permalink / raw)
To: Ben Hutchings; +Cc: David Miller, therbert, netdev
In-Reply-To: <1305666822.2848.51.camel@bwh-desktop>
Le mardi 17 mai 2011 à 22:13 +0100, Ben Hutchings a écrit :
> On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Tue, 17 May 2011 23:00:50 +0200
> >
> > > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> > >> From: Tom Herbert <therbert@google.com>
> > >> Date: Tue, 17 May 2011 13:02:25 -0700
> > >>
> > >> > I like it! And this sounds like the sort of algorithm that NICs might
> > >> > be able to implement to solve the UDP/RSS unpleasantness, so even
> > >> > better.
> > >>
> > >> Actually, I think it won't work. Even Linux emits fragments last to
> > >> first, so we won't see the UDP header until the last packet where it's
> > >> no longer useful.
> > >>
> > >> Back to the drawing board. :-/
> > >
> > > Well, we could just use the iph->id in the rxhash computation for frags.
> > >
> > > At least all frags of a given datagram should be reassembled on same
> > > cpu, so we get RPS (but not RFS)
> >
> > That's true, but one could also argue that in the existing code at least
> > one of the packets (the one with the UDP header) would make it to the
> > proper flow cpu.
>
> No, we ignore the layer-4 header when either MF or OFFSET is non-zero.
Exactly
As is, RPS (based on our software rxhash computation) should be working
fine with frags, unless we receive different flows with same
(src_addr,dst_addr) pair.
This is why I asked David if real workloads could hit one cpu instead of
many ones.
^ permalink raw reply
* Re: small RPS cache for fragments?
From: Tom Herbert @ 2011-05-17 21:27 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20110517.164929.1737248436066795381.davem@davemloft.net>
> Actually, I think it won't work. Even Linux emits fragments last to
> first, so we won't see the UDP header until the last packet where it's
> no longer useful.
>
I remember observing this a while back, what's the rationale for it?
> Back to the drawing board. :-/
>
^ permalink raw reply
* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-17 21:26 UTC (permalink / raw)
To: bhutchings; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <1305666822.2848.51.camel@bwh-desktop>
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Tue, 17 May 2011 22:13:42 +0100
> On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote:
>> That's true, but one could also argue that in the existing code at least
>> one of the packets (the one with the UDP header) would make it to the
>> proper flow cpu.
>
> No, we ignore the layer-4 header when either MF or OFFSET is non-zero.
That's right and I now remember we had quite a discussion about this
in the past.
So IP/saddr/daddr keying is out of the question due to reordering
concerns.
The idea to do RFS post fragmentation is interesting, it's sort of
another form of GRO. We would need to re-fragment (like GRO does)
in the forwarding case.
But it would be nice since it would reduce the number of calls into
the stack (and thus route lookups, etc.) per fragmented frame.
There is of course the issue of fragmentation queue timeouts, and
what semantics of that means when we are not the final destination
and those fragments would have been forwarded rather than consumed
by us.
^ permalink raw reply
* Re: [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with a 'bitwise' type
From: Joe Perches @ 2011-05-17 21:22 UTC (permalink / raw)
To: David Miller; +Cc: bhutchings, netdev
In-Reply-To: <20110517.171412.1017451005914294196.davem@davemloft.net>
On Tue, 2011-05-17 at 17:14 -0400, David Miller wrote:
> Accept that the compiler currently doesn't want to allow enums to be
> used as bit-masks, don't paper around it.
A patch applied yesterday using enums as bitmasks:
http://patchwork.ozlabs.org/patch/95802/
^ permalink raw reply
* Re: [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with a 'bitwise' type
From: David Miller @ 2011-05-17 21:14 UTC (permalink / raw)
To: bhutchings; +Cc: netdev
In-Reply-To: <1305666379.2848.45.camel@bwh-desktop>
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Tue, 17 May 2011 22:06:19 +0100
> But that seems to result in discarding all type information for the
> flags. I would prefer to improve type checking.
That's why I don't want a solution like your cast patch, as that
takes the typing information away.
Accept that the compiler currently doesn't want to allow enums to be
used as bit-masks, don't paper around it.
I'm not applying this patch either, you think all of this "__force"
casting stuff is better? No way.
^ permalink raw reply
* RFC: Add WARN_RATELIMIT to bug.h (was: Re: [PATCH] sk-filter: Rate-limit WARNing, print dbg info.)
From: Joe Perches @ 2011-05-17 21:13 UTC (permalink / raw)
To: Ben Greear, Arnd Bergmann; +Cc: netdev, LKML, linux-arch
In-Reply-To: <4DD2DB9D.6050306@candelatech.com>
On Tue, 2011-05-17 at 13:33 -0700, Ben Greear wrote:
> On 05/17/2011 12:53 PM, Joe Perches wrote:
> > Another option is to add a WARN_RATELIMIT
> > (there is already a WARN_ON_RATELIMIT) to avoid
> > missing other messages.
> > Something like:
> > include/asm-generic/bug.h | 16 ++++++++++++++++
> > net/core/filter.c | 4 +++-
> > 2 files changed, 19 insertions(+), 1 deletions(-)
> > diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
> > index e5a3f58..12b250c 100644
> > --- a/include/asm-generic/bug.h
> > +++ b/include/asm-generic/bug.h
> > @@ -165,6 +165,22 @@ extern void warn_slowpath_null(const char *file, const int line);
> > #define WARN_ON_RATELIMIT(condition, state) \
> > WARN_ON((condition)&& __ratelimit(state))
> >
> > +#define __WARN_RATELIMIT(condition, state, format...) \
> > +({ \
> > + int rtn = 0; \
> > + if (unlikely(__ratelimit(state))) \
> > + rtn = WARN(condition, format); \
> > + rtn; \
> > +})
> > +
> > +#define WARN_RATELIMIT(condition, format...) \
> > +({ \
> > + static DEFINE_RATELIMIT_STATE(_rs, \
> > + DEFAULT_RATELIMIT_INTERVAL, \
> > + DEFAULT_RATELIMIT_BURST); \
> > + __WARN_RATELIMIT(condition,&_rs, format); \
> > +})
> > +
> > /*
> > * WARN_ON_SMP() is for cases that the warning is either
> > * meaningless for !SMP or may even cause failures.
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 0eb8c44..0e3622f 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -350,7 +350,9 @@ load_b:
> > continue;
> > }
> > default:
> > - WARN_ON(1);
> > + WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
> > + fentry->code, fentry->jt,
> > + fentry->jf, fentry->k);
> > return 0;
> > }
> > }
> >
> Sounds fine to me. You want to just send an official
> patch with all this?
I added a some cc's for wider exposure.
original post: http://www.spinics.net/lists/netdev/msg164521.html
Let's wait to see if David has anything to say.
The biggest negative I see is adding RATELIMIT_STATE
to asm-generic/bug.h, though it already has a use of
__ratelimit in WARN_ON_RATELIMIT there.
WARN_ON_RATELIMIT is unused today.
The only user seems to have been RCU_PREEMPT
which was deleted in:
commit 6b3ef48adf847f7adf11c870e3ffacac150f1564
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Sat Aug 22 13:56:53 2009 -0700
rcu: Remove CONFIG_PREEMPT_RCU
Maybe the old definition should be removed instead.
If there are no comments after a day or two, I'll
sign and send this as 2 patches with the filter
one marked as:
Original-patch-by: Ben Greear <greearb@candelatech.com>
cheers, Joe
^ permalink raw reply
* Re: small RPS cache for fragments?
From: Ben Hutchings @ 2011-05-17 21:13 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <20110517.171000.1166144155994185790.davem@davemloft.net>
On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 17 May 2011 23:00:50 +0200
>
> > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> >> From: Tom Herbert <therbert@google.com>
> >> Date: Tue, 17 May 2011 13:02:25 -0700
> >>
> >> > I like it! And this sounds like the sort of algorithm that NICs might
> >> > be able to implement to solve the UDP/RSS unpleasantness, so even
> >> > better.
> >>
> >> Actually, I think it won't work. Even Linux emits fragments last to
> >> first, so we won't see the UDP header until the last packet where it's
> >> no longer useful.
> >>
> >> Back to the drawing board. :-/
> >
> > Well, we could just use the iph->id in the rxhash computation for frags.
> >
> > At least all frags of a given datagram should be reassembled on same
> > cpu, so we get RPS (but not RFS)
>
> That's true, but one could also argue that in the existing code at least
> one of the packets (the one with the UDP header) would make it to the
> proper flow cpu.
No, we ignore the layer-4 header when either MF or OFFSET is non-zero.
Ben.
> That could be as much as half of the packets.
>
> So I don't yet see it as a clear win.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: small RPS cache for fragments?
From: Rick Jones @ 2011-05-17 21:13 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, therbert, netdev
In-Reply-To: <20110517.171000.1166144155994185790.davem@davemloft.net>
On Tue, 2011-05-17 at 17:10 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 17 May 2011 23:00:50 +0200
>
> > Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> >> From: Tom Herbert <therbert@google.com>
> >> Date: Tue, 17 May 2011 13:02:25 -0700
> >>
> >> > I like it! And this sounds like the sort of algorithm that NICs might
> >> > be able to implement to solve the UDP/RSS unpleasantness, so even
> >> > better.
> >>
> >> Actually, I think it won't work. Even Linux emits fragments last to
> >> first, so we won't see the UDP header until the last packet where it's
> >> no longer useful.
> >>
> >> Back to the drawing board. :-/
> >
> > Well, we could just use the iph->id in the rxhash computation for frags.
> >
> > At least all frags of a given datagram should be reassembled on same
> > cpu, so we get RPS (but not RFS)
>
> That's true, but one could also argue that in the existing code at least
> one of the packets (the one with the UDP header) would make it to the
> proper flow cpu.
>
> That could be as much as half of the packets.
>
> So I don't yet see it as a clear win.
How heinous would it be to do post-reassembly RFS?
rick
^ permalink raw reply
* Re: small RPS cache for fragments?
From: Ben Hutchings @ 2011-05-17 21:11 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, therbert, netdev
In-Reply-To: <1305666050.2691.4.camel@edumazet-laptop>
On Tue, 2011-05-17 at 23:00 +0200, Eric Dumazet wrote:
> Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> > From: Tom Herbert <therbert@google.com>
> > Date: Tue, 17 May 2011 13:02:25 -0700
> >
> > > I like it! And this sounds like the sort of algorithm that NICs might
> > > be able to implement to solve the UDP/RSS unpleasantness, so even
> > > better.
> >
> > Actually, I think it won't work. Even Linux emits fragments last to
> > first, so we won't see the UDP header until the last packet where it's
> > no longer useful.
> >
> > Back to the drawing board. :-/
>
> Well, we could just use the iph->id in the rxhash computation for frags.
But then each datagram lands on a different CPU, and reordering is
liable to happen far more often than it does now.
> At least all frags of a given datagram should be reassembled on same
> cpu, so we get RPS (but not RFS)
You could still do RPS with just IP addresses (same as RSS using
Toeplitz hashes).
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: small RPS cache for fragments?
From: Rick Jones @ 2011-05-17 21:11 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, therbert, netdev
In-Reply-To: <1305666050.2691.4.camel@edumazet-laptop>
On Tue, 2011-05-17 at 23:00 +0200, Eric Dumazet wrote:
> Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> > From: Tom Herbert <therbert@google.com>
> > Date: Tue, 17 May 2011 13:02:25 -0700
> >
> > > I like it! And this sounds like the sort of algorithm that NICs might
> > > be able to implement to solve the UDP/RSS unpleasantness, so even
> > > better.
> >
> > Actually, I think it won't work. Even Linux emits fragments last to
> > first, so we won't see the UDP header until the last packet where it's
> > no longer useful.
> >
> > Back to the drawing board. :-/
>
> Well, we could just use the iph->id in the rxhash computation for frags.
>
> At least all frags of a given datagram should be reassembled on same
> cpu, so we get RPS (but not RFS)
Won't that just scatter the fragments of a given flow across processors?
Instead of then going back and forth between two caches - where
reassembly happens and then where the app is running, it will go back
and forth between the app's cache and pretty much nearly every other
cache in the system (or at least configured to take RPS traffic).
rick
^ permalink raw reply
* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-17 21:10 UTC (permalink / raw)
To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <1305666050.2691.4.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 May 2011 23:00:50 +0200
> Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
>> From: Tom Herbert <therbert@google.com>
>> Date: Tue, 17 May 2011 13:02:25 -0700
>>
>> > I like it! And this sounds like the sort of algorithm that NICs might
>> > be able to implement to solve the UDP/RSS unpleasantness, so even
>> > better.
>>
>> Actually, I think it won't work. Even Linux emits fragments last to
>> first, so we won't see the UDP header until the last packet where it's
>> no longer useful.
>>
>> Back to the drawing board. :-/
>
> Well, we could just use the iph->id in the rxhash computation for frags.
>
> At least all frags of a given datagram should be reassembled on same
> cpu, so we get RPS (but not RFS)
That's true, but one could also argue that in the existing code at least
one of the packets (the one with the UDP header) would make it to the
proper flow cpu.
That could be as much as half of the packets.
So I don't yet see it as a clear win.
^ permalink raw reply
* [PATCH net-next-2.6] sfc: Replace enum efx_fc_type with a 'bitwise' type
From: Ben Hutchings @ 2011-05-17 21:06 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20110517.152651.1785846817320651943.davem@davemloft.net>
gcc 4.5 doesn't like enumerators as flags, and neither does davem.
Replace the enumerated type with a 'bitwise' type alias which makes it
clear where these flags are being used and allows sparse to validate
their use.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
You wrote:
> What part of "get rid of the enum" is so hard to understand?
>
> I'll say it again: Please get rid of the enum efx_fc_type
OK, done.
> You can define bit positions if you like, because those will
> be used as a proper enum, as unique bit positions within
> a set of bits. Then you can define macros as (1 << XXX)
But that seems to result in discarding all type information for the
flags. I would prefer to improve type checking.
Ben.
drivers/net/sfc/efx.c | 2 +-
drivers/net/sfc/efx.h | 2 +-
drivers/net/sfc/ethtool.c | 2 +-
drivers/net/sfc/mdio_10g.c | 13 ++++++++-----
drivers/net/sfc/mdio_10g.h | 2 +-
drivers/net/sfc/net_driver.h | 15 +++++++--------
6 files changed, 19 insertions(+), 17 deletions(-)
diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index 05502b3..f840879 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -833,7 +833,7 @@ void efx_link_set_advertising(struct efx_nic *efx, u32 advertising)
}
}
-void efx_link_set_wanted_fc(struct efx_nic *efx, enum efx_fc_type wanted_fc)
+void efx_link_set_wanted_fc(struct efx_nic *efx, efx_fc_mode wanted_fc)
{
efx->wanted_fc = wanted_fc;
if (efx->link_advertising) {
diff --git a/drivers/net/sfc/efx.h b/drivers/net/sfc/efx.h
index 3d83a1f..242d68b 100644
--- a/drivers/net/sfc/efx.h
+++ b/drivers/net/sfc/efx.h
@@ -142,6 +142,6 @@ static inline void efx_schedule_channel(struct efx_channel *channel)
extern void efx_link_status_changed(struct efx_nic *efx);
extern void efx_link_set_advertising(struct efx_nic *efx, u32);
-extern void efx_link_set_wanted_fc(struct efx_nic *efx, enum efx_fc_type);
+extern void efx_link_set_wanted_fc(struct efx_nic *efx, efx_fc_mode);
#endif /* EFX_EFX_H */
diff --git a/drivers/net/sfc/ethtool.c b/drivers/net/sfc/ethtool.c
index 348437a..114829d 100644
--- a/drivers/net/sfc/ethtool.c
+++ b/drivers/net/sfc/ethtool.c
@@ -698,7 +698,7 @@ static int efx_ethtool_set_pauseparam(struct net_device *net_dev,
struct ethtool_pauseparam *pause)
{
struct efx_nic *efx = netdev_priv(net_dev);
- enum efx_fc_type wanted_fc, old_fc;
+ efx_fc_mode wanted_fc, old_fc;
u32 old_adv;
bool reset;
int rc = 0;
diff --git a/drivers/net/sfc/mdio_10g.c b/drivers/net/sfc/mdio_10g.c
index 7115914..9be5da6 100644
--- a/drivers/net/sfc/mdio_10g.c
+++ b/drivers/net/sfc/mdio_10g.c
@@ -284,18 +284,21 @@ void efx_mdio_an_reconfigure(struct efx_nic *efx)
efx_mdio_write(efx, MDIO_MMD_AN, MDIO_CTRL1, reg);
}
-enum efx_fc_type efx_mdio_get_pause(struct efx_nic *efx)
+efx_fc_mode efx_mdio_get_pause(struct efx_nic *efx)
{
- BUILD_BUG_ON(EFX_FC_AUTO & (EFX_FC_RX | EFX_FC_TX));
+ BUILD_BUG_ON((__force u8)EFX_FC_TX != FLOW_CTRL_TX);
+ BUILD_BUG_ON((__force u8)EFX_FC_RX != FLOW_CTRL_RX);
if (!(efx->wanted_fc & EFX_FC_AUTO))
return efx->wanted_fc;
WARN_ON(!(efx->mdio.mmds & MDIO_DEVS_AN));
- return mii_resolve_flowctrl_fdx(
- mii_advertise_flowctrl(efx->wanted_fc),
- efx_mdio_read(efx, MDIO_MMD_AN, MDIO_AN_LPA));
+ return (__force efx_fc_mode)
+ mii_resolve_flowctrl_fdx(
+ mii_advertise_flowctrl((__force int)
+ (efx->wanted_fc & ~EFX_FC_AUTO)),
+ efx_mdio_read(efx, MDIO_MMD_AN, MDIO_AN_LPA));
}
int efx_mdio_test_alive(struct efx_nic *efx)
diff --git a/drivers/net/sfc/mdio_10g.h b/drivers/net/sfc/mdio_10g.h
index df07039..1616b13 100644
--- a/drivers/net/sfc/mdio_10g.h
+++ b/drivers/net/sfc/mdio_10g.h
@@ -92,7 +92,7 @@ extern void efx_mdio_an_reconfigure(struct efx_nic *efx);
/* Get pause parameters from AN if available (otherwise return
* requested pause parameters)
*/
-enum efx_fc_type efx_mdio_get_pause(struct efx_nic *efx);
+extern efx_fc_mode efx_mdio_get_pause(struct efx_nic *efx);
/* Wait for specified MMDs to exit reset within a timeout */
extern int efx_mdio_wait_reset_mmds(struct efx_nic *efx,
diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index ce9697b..1c29775 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -448,12 +448,11 @@ enum nic_state {
/* Forward declaration */
struct efx_nic;
-/* Pseudo bit-mask flow control field */
-enum efx_fc_type {
- EFX_FC_RX = FLOW_CTRL_RX,
- EFX_FC_TX = FLOW_CTRL_TX,
- EFX_FC_AUTO = 4,
-};
+/* Flow control flags */
+typedef unsigned int __bitwise efx_fc_mode;
+#define EFX_FC_TX ((__force efx_fc_mode) 1)
+#define EFX_FC_RX ((__force efx_fc_mode) 2)
+#define EFX_FC_AUTO ((__force efx_fc_mode) 4)
/**
* struct efx_link_state - Current state of the link
@@ -465,7 +464,7 @@ enum efx_fc_type {
struct efx_link_state {
bool up;
bool fd;
- enum efx_fc_type fc;
+ efx_fc_mode fc;
unsigned int speed;
};
@@ -784,7 +783,7 @@ struct efx_nic {
bool promiscuous;
union efx_multicast_hash multicast_hash;
- enum efx_fc_type wanted_fc;
+ efx_fc_mode wanted_fc;
atomic_t rx_reset;
enum efx_loopback_mode loopback_mode;
--
1.7.4
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Shirley Ma @ 2011-05-17 21:01 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
kvm, linux-kernel
In-Reply-To: <20110517205827.GB7589@redhat.com>
On Tue, 2011-05-17 at 23:58 +0300, Michael S. Tsirkin wrote:
> On Tue, May 17, 2011 at 01:50:19PM -0700, Shirley Ma wrote:
> > Resubmit the patch with most update. This patch passed some
> > live-migration test against RHEL6.2. I will run more stress test w/i
> > live migration.
>
> What changed from the last version?
Sorry, I forgot to mention: add clean up pending before set_vring.
Thanks
Shirley
^ permalink raw reply
* Re: small RPS cache for fragments?
From: Eric Dumazet @ 2011-05-17 21:00 UTC (permalink / raw)
To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20110517.164929.1737248436066795381.davem@davemloft.net>
Le mardi 17 mai 2011 à 16:49 -0400, David Miller a écrit :
> From: Tom Herbert <therbert@google.com>
> Date: Tue, 17 May 2011 13:02:25 -0700
>
> > I like it! And this sounds like the sort of algorithm that NICs might
> > be able to implement to solve the UDP/RSS unpleasantness, so even
> > better.
>
> Actually, I think it won't work. Even Linux emits fragments last to
> first, so we won't see the UDP header until the last packet where it's
> no longer useful.
>
> Back to the drawing board. :-/
Well, we could just use the iph->id in the rxhash computation for frags.
At least all frags of a given datagram should be reassembled on same
cpu, so we get RPS (but not RFS)
^ permalink raw reply
* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-17 20:58 UTC (permalink / raw)
To: Shirley Ma
Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
kvm, linux-kernel
In-Reply-To: <1305665419.10756.33.camel@localhost.localdomain>
On Tue, May 17, 2011 at 01:50:19PM -0700, Shirley Ma wrote:
> Resubmit the patch with most update. This patch passed some
> live-migration test against RHEL6.2. I will run more stress test w/i
> live migration.
What changed from the last version?
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
>
> drivers/vhost/net.c | 37 +++++++++++++++++++++++++++++++-
> drivers/vhost/vhost.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-
> drivers/vhost/vhost.h | 12 ++++++++++
> 3 files changed, 101 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 2f7c76a..6bd6e28 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,9 @@
> * Using this limit prevents one virtqueue from starving others. */
> #define VHOST_NET_WEIGHT 0x80000
>
> +/* MAX number of TX used buffers for outstanding zerocopy */
> +#define VHOST_MAX_ZEROCOPY_PEND 128
> +
> enum {
> VHOST_NET_VQ_RX = 0,
> VHOST_NET_VQ_TX = 1,
> @@ -129,6 +132,7 @@ static void handle_tx(struct vhost_net *net)
> int err, wmem;
> size_t hdr_size;
> struct socket *sock;
> + struct skb_ubuf_info pend;
>
> /* TODO: check that we are running from vhost_worker? */
> sock = rcu_dereference_check(vq->private_data, 1);
> @@ -151,6 +155,10 @@ static void handle_tx(struct vhost_net *net)
> hdr_size = vq->vhost_hlen;
>
> for (;;) {
> + /* Release DMAs done buffers first */
> + if (atomic_read(&vq->refcnt) > VHOST_MAX_ZEROCOPY_PEND)
> + vhost_zerocopy_signal_used(vq, false);
> +
> head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> ARRAY_SIZE(vq->iov),
> &out, &in,
> @@ -166,6 +174,13 @@ static void handle_tx(struct vhost_net *net)
> set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> break;
> }
> + /* If more outstanding DMAs, queue the work */
> + if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
> + (atomic_read(&vq->refcnt) > VHOST_MAX_ZEROCOPY_PEND)) {
> + tx_poll_start(net, sock);
> + set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> + break;
> + }
> if (unlikely(vhost_enable_notify(vq))) {
> vhost_disable_notify(vq);
> continue;
> @@ -188,17 +203,35 @@ static void handle_tx(struct vhost_net *net)
> iov_length(vq->hdr, s), hdr_size);
> break;
> }
> + /* use msg_control to pass vhost zerocopy ubuf info to skb */
> + if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> + vq->heads[vq->upend_idx].id = head;
> + if (len <= 128)
> + vq->heads[vq->upend_idx].len = VHOST_DMA_DONE_LEN;
> + else {
> + vq->heads[vq->upend_idx].len = len;
> + pend.callback = vhost_zerocopy_callback;
> + pend.arg = vq;
> + pend.desc = vq->upend_idx;
> + msg.msg_control = &pend;
> + msg.msg_controllen = sizeof(pend);
> + }
> + atomic_inc(&vq->refcnt);
> + vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
> + }
> /* TODO: Check specific error and bomb out unless ENOBUFS? */
> err = sock->ops->sendmsg(NULL, sock, &msg, len);
> if (unlikely(err < 0)) {
> - vhost_discard_vq_desc(vq, 1);
> + if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> + vhost_discard_vq_desc(vq, 1);
> tx_poll_start(net, sock);
> break;
> }
> if (err != len)
> pr_debug("Truncated TX packet: "
> " len %d != %zd\n", err, len);
> - vhost_add_used_and_signal(&net->dev, vq, head, 0);
> + if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> + vhost_add_used_and_signal(&net->dev, vq, head, 0);
> total_len += len;
> if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2ab2912..ce799d6 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> vq->call_ctx = NULL;
> vq->call = NULL;
> vq->log_ctx = NULL;
> + vq->upend_idx = 0;
> + vq->done_idx = 0;
> + atomic_set(&vq->refcnt, 0);
> }
>
> static int vhost_worker(void *data)
> @@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> UIO_MAXIOV, GFP_KERNEL);
> dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
> GFP_KERNEL);
> - dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
> + dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads *
> UIO_MAXIOV, GFP_KERNEL);
>
> if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
> @@ -385,6 +388,38 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
> return 0;
> }
>
> +/*
> + comments
> +*/
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
> +{
> + int i, j = 0;
> +
> + i = vq->done_idx;
> + while (i != vq->upend_idx) {
> + if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
> + /* reset len = 0 */
> + vq->heads[i].len = 0;
> + i = (i + 1) % UIO_MAXIOV;
> + ++j;
> + } else
> + break;
> + }
> + if (j) {
> + /* comments */
> + if (i > vq->done_idx)
> + vhost_add_used_n(vq, &vq->heads[vq->done_idx], j);
> + else {
> + vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> + UIO_MAXIOV - vq->done_idx);
> + vhost_add_used_n(vq, vq->heads, i);
> + }
> + vq->done_idx = i;
> + vhost_signal(vq->dev, vq);
> + atomic_sub(j, &vq->refcnt);
> + }
> +}
> +
> /* Caller should have device mutex */
> void vhost_dev_cleanup(struct vhost_dev *dev)
> {
> @@ -395,6 +430,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> vhost_poll_stop(&dev->vqs[i].poll);
> vhost_poll_flush(&dev->vqs[i].poll);
> }
> + /* wait for all lower device DMAs done, then notify guest */
> + if (atomic_read(&dev->vqs[i].refcnt)) {
> + msleep(1000);
> + vhost_zerocopy_signal_used(&dev->vqs[i], true);
> + }
> if (dev->vqs[i].error_ctx)
> eventfd_ctx_put(dev->vqs[i].error_ctx);
> if (dev->vqs[i].error)
> @@ -603,6 +643,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>
> mutex_lock(&vq->mutex);
>
> + /* force all lower device DMAs done */
> + if (atomic_read(&vq->refcnt))
> + vhost_zerocopy_signal_used(vq, true);
> +
> switch (ioctl) {
> case VHOST_SET_VRING_NUM:
> /* Resizing ring with an active backend?
> @@ -1416,3 +1460,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
> vq_err(vq, "Failed to enable notification at %p: %d\n",
> &vq->used->flags, r);
> }
> +
> +void vhost_zerocopy_callback(struct sk_buff *skb)
> +{
> + int idx = skb_shinfo(skb)->ubuf.desc;
> + struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
> +
> + /* set len = 1 to mark this desc buffers done DMA */
> + vq->heads[idx].len = VHOST_DMA_DONE_LEN;
> +}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index b3363ae..8e3ecc7 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -13,6 +13,10 @@
> #include <linux/virtio_ring.h>
> #include <asm/atomic.h>
>
> +/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
> + * done */
> +#define VHOST_DMA_DONE_LEN 1
> +
> struct vhost_device;
>
> struct vhost_work;
> @@ -108,6 +112,12 @@ struct vhost_virtqueue {
> /* Log write descriptors */
> void __user *log_base;
> struct vhost_log *log;
> + /* vhost zerocopy support */
> + atomic_t refcnt; /* num of outstanding zerocopy DMAs */
> + /* copy of avail idx to monitor outstanding DMA zerocopy buffers */
> + int upend_idx;
> + /* copy of used idx to monintor DMA done zerocopy buffers */
> + int done_idx;
> };
>
> struct vhost_dev {
> @@ -154,6 +164,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
>
> int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> unsigned int log_num, u64 len);
> +void vhost_zerocopy_callback(struct sk_buff *skb);
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
>
> #define vq_err(vq, fmt, ...) do { \
> pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \
>
^ permalink raw reply
* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-17 20:53 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <20110517062111.GD26989@redhat.com>
On Tue, 2011-05-17 at 09:21 +0300, Michael S. Tsirkin wrote:
> Problem is, in your patch there are a set of restrictions on what the
> device can do with the skb that we need to enforce somehow.
> Also, how do we know it's a 'real NIC' and not a software device?
I checked macvtap newlink, it doesn't seem to block software device.
So we need to work on the wider feature bit first.
Shirley
^ permalink raw reply
* Re: [TEST PATCH net-next] vhost: accumulate multiple used and sigal in vhost TX test
From: Michael S. Tsirkin @ 2011-05-17 20:52 UTC (permalink / raw)
To: Shirley Ma
Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
kvm, linux-kernel
In-Reply-To: <1305665181.10756.29.camel@localhost.localdomain>
On Tue, May 17, 2011 at 01:46:21PM -0700, Shirley Ma wrote:
> Hello Michael,
>
> Here is the patch I used to test out of order before: add used in a pend
> array, and swap the last two ids.
>
> I used to hit an issue, but now it seems working well.
Aha, so if I apply this guest will *not* crash? :)
> This won't impact zero-copy patch since we need to maintain the pend
> used ids anyway.
Yes, but now we can mark them used immediately as they are
completed - and I am guessing this will relieve the
pressure on tx ring that you see?
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
> drivers/vhost/net.c | 24 +++++++++++++++++++++++-
> drivers/vhost/vhost.c | 11 +++++++++++
> drivers/vhost/vhost.h | 1 +
> 3 files changed, 35 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 2f7c76a..19e1baa 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,8 @@
> * Using this limit prevents one virtqueue from starving others. */
> #define VHOST_NET_WEIGHT 0x80000
>
> +#define VHOST_MAX_PEND 128
> +
> enum {
> VHOST_NET_VQ_RX = 0,
> VHOST_NET_VQ_TX = 1,
> @@ -198,13 +200,33 @@ static void handle_tx(struct vhost_net *net)
> if (err != len)
> pr_debug("Truncated TX packet: "
> " len %d != %zd\n", err, len);
> - vhost_add_used_and_signal(&net->dev, vq, head, 0);
> + vq->heads[vq->pend_idx].id = head;
> + vq->heads[vq->pend_idx].len = 0;
> + ++vq->pend_idx;
> + if (vq->pend_idx >= VHOST_MAX_PEND) {
> + int id;
> + id = vq->heads[vq->pend_idx-1].id;
> + vq->heads[vq->pend_idx-1].id = vq->heads[vq->pend_idx-2].id;
> + vq->heads[vq->pend_idx-2].id = id;
> + vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> + vq->pend_idx);
> + vq->pend_idx = 0;
> + }
> total_len += len;
> if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> vhost_poll_queue(&vq->poll);
> break;
> }
> }
> + if (vq->pend_idx >= VHOST_MAX_PEND) {
> + int id;
> + id = vq->heads[vq->pend_idx-1].id;
> + vq->heads[vq->pend_idx-1].id = vq->heads[vq->pend_idx-2].id;
> + vq->heads[vq->pend_idx-2].id = id;
> + vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> + vq->pend_idx);
> + vq->pend_idx = 0;
> + }
>
> mutex_unlock(&vq->mutex);
> }
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2ab2912..7eea6b3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -174,6 +174,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> vq->call_ctx = NULL;
> vq->call = NULL;
> vq->log_ctx = NULL;
> + vq->pend_idx = 0;
> }
>
> static int vhost_worker(void *data)
> @@ -395,6 +396,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> vhost_poll_stop(&dev->vqs[i].poll);
> vhost_poll_flush(&dev->vqs[i].poll);
> }
> + if (dev->vqs[i].pend_idx != 0) {
> + vhost_add_used_and_signal_n(dev, &dev->vqs[i],
> + dev->vqs[i].heads, dev->vqs[i].pend_idx);
> + dev->vqs[i].pend_idx = 0;
> + }
> if (dev->vqs[i].error_ctx)
> eventfd_ctx_put(dev->vqs[i].error_ctx);
> if (dev->vqs[i].error)
> @@ -603,6 +609,11 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>
> mutex_lock(&vq->mutex);
>
> + if (vq->pend_idx != 0) {
> + vhost_add_used_and_signal_n(d, vq, vq->heads, vq->pend_idx);
> + vq->pend_idx = 0;
> + }
> +
> switch (ioctl) {
> case VHOST_SET_VRING_NUM:
> /* Resizing ring with an active backend?
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index b3363ae..44a412d 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -108,6 +108,7 @@ struct vhost_virtqueue {
> /* Log write descriptors */
> void __user *log_base;
> struct vhost_log *log;
> + int pend_idx;
> };
>
> struct vhost_dev {
>
^ permalink raw reply
* Re: [PATCH] net: ethtool: fix IPV6 checksum feature name string
From: David Miller @ 2011-05-17 20:50 UTC (permalink / raw)
To: mirq-linux; +Cc: netdev, bhutchings
In-Reply-To: <20110517204015.2F4C013A6A@rere.qmqm.pl>
From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Tue, 17 May 2011 22:40:15 +0200 (CEST)
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Shirley Ma @ 2011-05-17 20:50 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
kvm, linux-kernel
In-Reply-To: <1305646444.10756.16.camel@localhost.localdomain>
Resubmit the patch with most update. This patch passed some
live-migration test against RHEL6.2. I will run more stress test w/i
live migration.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/vhost/net.c | 37 +++++++++++++++++++++++++++++++-
drivers/vhost/vhost.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-
drivers/vhost/vhost.h | 12 ++++++++++
3 files changed, 101 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..6bd6e28 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,9 @@
* Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x80000
+/* MAX number of TX used buffers for outstanding zerocopy */
+#define VHOST_MAX_ZEROCOPY_PEND 128
+
enum {
VHOST_NET_VQ_RX = 0,
VHOST_NET_VQ_TX = 1,
@@ -129,6 +132,7 @@ static void handle_tx(struct vhost_net *net)
int err, wmem;
size_t hdr_size;
struct socket *sock;
+ struct skb_ubuf_info pend;
/* TODO: check that we are running from vhost_worker? */
sock = rcu_dereference_check(vq->private_data, 1);
@@ -151,6 +155,10 @@ static void handle_tx(struct vhost_net *net)
hdr_size = vq->vhost_hlen;
for (;;) {
+ /* Release DMAs done buffers first */
+ if (atomic_read(&vq->refcnt) > VHOST_MAX_ZEROCOPY_PEND)
+ vhost_zerocopy_signal_used(vq, false);
+
head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
&out, &in,
@@ -166,6 +174,13 @@ static void handle_tx(struct vhost_net *net)
set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
break;
}
+ /* If more outstanding DMAs, queue the work */
+ if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
+ (atomic_read(&vq->refcnt) > VHOST_MAX_ZEROCOPY_PEND)) {
+ tx_poll_start(net, sock);
+ set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ break;
+ }
if (unlikely(vhost_enable_notify(vq))) {
vhost_disable_notify(vq);
continue;
@@ -188,17 +203,35 @@ static void handle_tx(struct vhost_net *net)
iov_length(vq->hdr, s), hdr_size);
break;
}
+ /* use msg_control to pass vhost zerocopy ubuf info to skb */
+ if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
+ vq->heads[vq->upend_idx].id = head;
+ if (len <= 128)
+ vq->heads[vq->upend_idx].len = VHOST_DMA_DONE_LEN;
+ else {
+ vq->heads[vq->upend_idx].len = len;
+ pend.callback = vhost_zerocopy_callback;
+ pend.arg = vq;
+ pend.desc = vq->upend_idx;
+ msg.msg_control = &pend;
+ msg.msg_controllen = sizeof(pend);
+ }
+ atomic_inc(&vq->refcnt);
+ vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
+ }
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(NULL, sock, &msg, len);
if (unlikely(err < 0)) {
- vhost_discard_vq_desc(vq, 1);
+ if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+ vhost_discard_vq_desc(vq, 1);
tx_poll_start(net, sock);
break;
}
if (err != len)
pr_debug("Truncated TX packet: "
" len %d != %zd\n", err, len);
- vhost_add_used_and_signal(&net->dev, vq, head, 0);
+ if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+ vhost_add_used_and_signal(&net->dev, vq, head, 0);
total_len += len;
if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
vhost_poll_queue(&vq->poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..ce799d6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->call_ctx = NULL;
vq->call = NULL;
vq->log_ctx = NULL;
+ vq->upend_idx = 0;
+ vq->done_idx = 0;
+ atomic_set(&vq->refcnt, 0);
}
static int vhost_worker(void *data)
@@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
UIO_MAXIOV, GFP_KERNEL);
dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
GFP_KERNEL);
- dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
+ dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads *
UIO_MAXIOV, GFP_KERNEL);
if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
@@ -385,6 +388,38 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
return 0;
}
+/*
+ comments
+*/
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
+{
+ int i, j = 0;
+
+ i = vq->done_idx;
+ while (i != vq->upend_idx) {
+ if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
+ /* reset len = 0 */
+ vq->heads[i].len = 0;
+ i = (i + 1) % UIO_MAXIOV;
+ ++j;
+ } else
+ break;
+ }
+ if (j) {
+ /* comments */
+ if (i > vq->done_idx)
+ vhost_add_used_n(vq, &vq->heads[vq->done_idx], j);
+ else {
+ vhost_add_used_n(vq, &vq->heads[vq->done_idx],
+ UIO_MAXIOV - vq->done_idx);
+ vhost_add_used_n(vq, vq->heads, i);
+ }
+ vq->done_idx = i;
+ vhost_signal(vq->dev, vq);
+ atomic_sub(j, &vq->refcnt);
+ }
+}
+
/* Caller should have device mutex */
void vhost_dev_cleanup(struct vhost_dev *dev)
{
@@ -395,6 +430,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
vhost_poll_stop(&dev->vqs[i].poll);
vhost_poll_flush(&dev->vqs[i].poll);
}
+ /* wait for all lower device DMAs done, then notify guest */
+ if (atomic_read(&dev->vqs[i].refcnt)) {
+ msleep(1000);
+ vhost_zerocopy_signal_used(&dev->vqs[i], true);
+ }
if (dev->vqs[i].error_ctx)
eventfd_ctx_put(dev->vqs[i].error_ctx);
if (dev->vqs[i].error)
@@ -603,6 +643,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
mutex_lock(&vq->mutex);
+ /* force all lower device DMAs done */
+ if (atomic_read(&vq->refcnt))
+ vhost_zerocopy_signal_used(vq, true);
+
switch (ioctl) {
case VHOST_SET_VRING_NUM:
/* Resizing ring with an active backend?
@@ -1416,3 +1460,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
vq_err(vq, "Failed to enable notification at %p: %d\n",
&vq->used->flags, r);
}
+
+void vhost_zerocopy_callback(struct sk_buff *skb)
+{
+ int idx = skb_shinfo(skb)->ubuf.desc;
+ struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
+
+ /* set len = 1 to mark this desc buffers done DMA */
+ vq->heads[idx].len = VHOST_DMA_DONE_LEN;
+}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..8e3ecc7 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,6 +13,10 @@
#include <linux/virtio_ring.h>
#include <asm/atomic.h>
+/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
+ * done */
+#define VHOST_DMA_DONE_LEN 1
+
struct vhost_device;
struct vhost_work;
@@ -108,6 +112,12 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
+ /* vhost zerocopy support */
+ atomic_t refcnt; /* num of outstanding zerocopy DMAs */
+ /* copy of avail idx to monitor outstanding DMA zerocopy buffers */
+ int upend_idx;
+ /* copy of used idx to monintor DMA done zerocopy buffers */
+ int done_idx;
};
struct vhost_dev {
@@ -154,6 +164,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
unsigned int log_num, u64 len);
+void vhost_zerocopy_callback(struct sk_buff *skb);
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
#define vq_err(vq, fmt, ...) do { \
pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \
^ permalink raw reply related
* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-17 20:49 UTC (permalink / raw)
To: therbert; +Cc: netdev
In-Reply-To: <BANLkTikyH=q_6uOvFh3_Z_xwPST3zVijZw@mail.gmail.com>
From: Tom Herbert <therbert@google.com>
Date: Tue, 17 May 2011 13:02:25 -0700
> I like it! And this sounds like the sort of algorithm that NICs might
> be able to implement to solve the UDP/RSS unpleasantness, so even
> better.
Actually, I think it won't work. Even Linux emits fragments last to
first, so we won't see the UDP header until the last packet where it's
no longer useful.
Back to the drawing board. :-/
^ permalink raw reply
* Re: small RPS cache for fragments?
From: Rick Jones @ 2011-05-17 20:41 UTC (permalink / raw)
To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <1305663434.8149.936.camel@tardy>
On Tue, 2011-05-17 at 13:17 -0700, Rick Jones wrote:
> On Tue, 2011-05-17 at 13:02 -0700, Tom Herbert wrote:
> > I like it! And this sounds like the sort of algorithm that NICs might
> > be able to implement to solve the UDP/RSS unpleasantness, so even
> > better.
>
> Do (m)any devices take "shortcuts" with UDP datagrams these days? By
> that I mean that back in the day, the HP-PB and "Slider" FDDI
> cards/drivers did checksum offload for fragmented UDP datagrams by
> sending the first fragment, the one with the UDP header and thus
> checksum, last. It did that to save space on the card and make use of
> the checksum accumulator.
Even if no devices (mis)behave like that today, ordering of fragments
sent via a mode-rr bond is far from a sure thing.
rick
>
> rick jones
>
> >
> > Tom
> >
> > On Tue, May 17, 2011 at 11:33 AM, David Miller <davem@davemloft.net> wrote:
> > >
> > > It seems to me that we can solve the UDP fragmentation problem for
> > > flow steering very simply by creating a (saddr/daddr/IPID) entry in a
> > > table that maps to the corresponding RPS flow entry.
> > >
> > > When we see the initial frag with the UDP header, we create the
> > > saddr/daddr/IPID mapping, and we tear it down when we hit the
> > > saddr/daddr/IPID mapping and the packet has the IP_MF bit clear.
> > >
> > > We only inspect the saddr/daddr/IPID cache when iph->frag_off is
> > > non-zero.
> > >
> > > It's best effort and should work quite well.
> > >
> > > Even a one-behind cache, per-NAPI instance, would do a lot better than
> > > what happens at the moment. Especially since the IP fragments mostly
> > > arrive as one packet train.
^ permalink raw reply
* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-17 20:47 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1305663288.2691.2.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 May 2011 22:14:48 +0200
> OK but do we have workloads actually needing this optimization at all ?
Yes, I've seen performance graphs where RPS/RFS falls off the cliff
when datagram sizes go from 1024 to 2048 bytes.
Wrt. defrag queue overhead, it still is minor compared to the cost of
processing 1/2 of all packets on one cpu on a 24 core system.
BTW, if we can steer reliably, we could make per-cpu defrag queue if
you worry about it so much :-)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox