Netdev List
 help / color / mirror / Atom feed
* linux-next: manual merge of the net-next tree with the arm-soc tree
From: Stephen Rothwell @ 2016-12-01  1:31 UTC (permalink / raw)
  To: David Miller, Networking, Olof Johansson, Arnd Bergmann, ARM
  Cc: Rob Rice, Florian Fainelli, linux-next, linux-kernel, Jon Mason

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  arch/arm64/boot/dts/broadcom/ns2.dtsi

between commit:

  e79249143f46 ("arm64: dts: Add Broadcom Northstar2 device tree entries for PDC driver.")

from the arm-soc tree and commit:

  dddc3c9d7d02 ("arm64: dts: NS2: add AMAC ethernet support")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/arm64/boot/dts/broadcom/ns2.dtsi
index 863503d78f57,773ed593da4d..000000000000
--- a/arch/arm64/boot/dts/broadcom/ns2.dtsi
+++ b/arch/arm64/boot/dts/broadcom/ns2.dtsi
@@@ -197,42 -191,18 +197,54 @@@
  
  		#include "ns2-clock.dtsi"
  
 +		pdc0: iproc-pdc0@612c0000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x612c0000 0x445>;  /* PDC FS0 regs */
 +			interrupts = <GIC_SPI 187 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
 +		pdc1: iproc-pdc1@612e0000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x612e0000 0x445>;  /* PDC FS1 regs */
 +			interrupts = <GIC_SPI 189 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
 +		pdc2: iproc-pdc2@61300000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x61300000 0x445>;  /* PDC FS2 regs */
 +			interrupts = <GIC_SPI 191 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
 +		pdc3: iproc-pdc3@61320000 {
 +			compatible = "brcm,iproc-pdc-mbox";
 +			reg = <0x61320000 0x445>;  /* PDC FS3 regs */
 +			interrupts = <GIC_SPI 193 IRQ_TYPE_LEVEL_HIGH>;
 +			#mbox-cells = <1>;
 +			brcm,rx-status-len = <32>;
 +			brcm,use-bcm-hdr;
 +		};
 +
+ 		enet: ethernet@61000000 {
+ 			compatible = "brcm,ns2-amac";
+ 			reg = <0x61000000 0x1000>,
+ 			      <0x61090000 0x1000>,
+ 			      <0x61030000 0x100>;
+ 			reg-names = "amac_base", "idm_base", "nicpm_base";
+ 			interrupts = <GIC_SPI 341 IRQ_TYPE_LEVEL_HIGH>;
+ 			phy-handle = <&gphy0>;
+ 			phy-mode = "rgmii";
+ 			status = "disabled";
+ 		};
+ 
  		dma0: dma@61360000 {
  			compatible = "arm,pl330", "arm,primecell";
  			reg = <0x61360000 0x1000>;

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2016-12-01  1:36 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: linux-next, linux-kernel, Cyrille Pitchen, Zach Brown

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/ethernet/cadence/macb.c

between commit:

  a0b44eea372b ("net: macb: fix the RX queue reset in macb_rx()")

from the net tree and commit:

  b410d13e10db ("net: macb: Use variables with defaults for tx/rx ring sizes instead of hardcoded values")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/ethernet/cadence/macb.c
index ec09fcece711,0e489bb82456..000000000000
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@@ -974,8 -990,7 +990,8 @@@ static inline void macb_init_rx_ring(st
  		bp->rx_ring[i].ctrl = 0;
  		addr += bp->rx_buffer_size;
  	}
- 	bp->rx_ring[RX_RING_SIZE - 1].addr |= MACB_BIT(RX_WRAP);
+ 	bp->rx_ring[bp->rx_ring_size - 1].addr |= MACB_BIT(RX_WRAP);
 +	bp->rx_tail = 0;
  }
  
  static int macb_rx(struct macb *bp, int budget)
@@@ -1617,7 -1735,9 +1737,7 @@@ static void macb_init_rings(struct mac
  	}
  	bp->queues[0].tx_head = 0;
  	bp->queues[0].tx_tail = 0;
- 	bp->queues[0].tx_ring[TX_RING_SIZE - 1].ctrl |= MACB_BIT(TX_WRAP);
+ 	bp->queues[0].tx_ring[bp->tx_ring_size - 1].ctrl |= MACB_BIT(TX_WRAP);
 -
 -	bp->rx_tail = 0;
  }
  
  static void macb_reset_hw(struct macb *bp)

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2016-12-01  1:41 UTC (permalink / raw)
  To: David Miller, Networking; +Cc: linux-next, linux-kernel, Jiri Pirko, Roi Dayan

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/sched/cls_flower.c

between commit:

  725cbb62e7ad ("sched: cls_flower: remove from hashtable only in case skip sw flag is not set")

from the net tree and commit:

  13fa876ebd03 ("net/sched: cls_flower: merge filter delete/destroy common code")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/sched/cls_flower.c
index 904442421db3,e8dd09af0d0c..000000000000
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@@ -273,24 -272,14 +276,32 @@@ static void fl_hw_update_stats(struct t
  	dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol, &tc);
  }
  
 +static void fl_destroy_sleepable(struct work_struct *work)
 +{
 +	struct cls_fl_head *head = container_of(work, struct cls_fl_head,
 +						work);
 +	if (head->mask_assigned)
 +		rhashtable_destroy(&head->ht);
 +	kfree(head);
 +	module_put(THIS_MODULE);
 +}
 +
 +static void fl_destroy_rcu(struct rcu_head *rcu)
 +{
 +	struct cls_fl_head *head = container_of(rcu, struct cls_fl_head, rcu);
 +
 +	INIT_WORK(&head->work, fl_destroy_sleepable);
 +	schedule_work(&head->work);
 +}
 +
+ static void __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f)
+ {
+ 	list_del_rcu(&f->list);
+ 	fl_hw_destroy_filter(tp, (unsigned long)f);
+ 	tcf_unbind_filter(tp, &f->res);
+ 	call_rcu(&f->rcu, fl_destroy_filter);
+ }
+ 
  static bool fl_destroy(struct tcf_proto *tp, bool force)
  {
  	struct cls_fl_head *head = rtnl_dereference(tp->root);
@@@ -299,14 -288,12 +310,11 @@@
  	if (!force && !list_empty(&head->filters))
  		return false;
  
- 	list_for_each_entry_safe(f, next, &head->filters, list) {
- 		fl_hw_destroy_filter(tp, (unsigned long)f);
- 		list_del_rcu(&f->list);
- 		call_rcu(&f->rcu, fl_destroy_filter);
- 	}
+ 	list_for_each_entry_safe(f, next, &head->filters, list)
+ 		__fl_delete(tp, f);
 -	RCU_INIT_POINTER(tp->root, NULL);
 -	if (head->mask_assigned)
 -		rhashtable_destroy(&head->ht);
 -	kfree_rcu(head, rcu);
 +
 +	__module_get(THIS_MODULE);
 +	call_rcu(&head->rcu, fl_destroy_rcu);
  	return true;
  }
  
@@@ -761,13 -782,9 +804,10 @@@ static int fl_delete(struct tcf_proto *
  	struct cls_fl_head *head = rtnl_dereference(tp->root);
  	struct cls_fl_filter *f = (struct cls_fl_filter *) arg;
  
 -	rhashtable_remove_fast(&head->ht, &f->ht_node,
 -			       head->ht_params);
 +	if (!tc_skip_sw(f->flags))
 +		rhashtable_remove_fast(&head->ht, &f->ht_node,
 +				       head->ht_params);
- 	list_del_rcu(&f->list);
- 	fl_hw_destroy_filter(tp, (unsigned long)f);
- 	tcf_unbind_filter(tp, &f->res);
- 	call_rcu(&f->rcu, fl_destroy_filter);
+ 	__fl_delete(tp, f);
  	return 0;
  }
  

^ permalink raw reply

* [PATCH net v2 3/3] Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
From: Eli Cooper @ 2016-12-01  2:05 UTC (permalink / raw)
  To: netdev, David S . Miller; +Cc: Eric Dumazet
In-Reply-To: <20161201020512.21661-1-elicooper@gmx.com>

This reverts commit ae148b085876fa771d9ef2c05f85d4b4bf09ce0d
("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").

skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
dst_output() is called. It is no longer necessary to do it for each tunnel.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
---
 net/ipv6/ip6_tunnel.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0a4759b..d76674e 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1181,7 +1181,6 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield,
 	if (err)
 		return err;
 
-	skb->protocol = htons(ETH_P_IPV6);
 	skb_push(skb, sizeof(struct ipv6hdr));
 	skb_reset_network_header(skb);
 	ipv6h = ipv6_hdr(skb);
-- 
2.10.2

^ permalink raw reply related

* [PATCH net v2 1/3] ipv4: Set skb->protocol properly for local output
From: Eli Cooper @ 2016-12-01  2:05 UTC (permalink / raw)
  To: netdev, David S . Miller; +Cc: Eric Dumazet

When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
fixing a bug where GSO packets sent through a sit tunnel are dropped
when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
---
v2: place the assignment before the netfilter hook

 net/ipv4/ip_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 105908d..877bdb0 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -107,6 +107,8 @@ int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 	if (unlikely(!skb))
 		return 0;
 
+	skb->protocol = htons(ETH_P_IP);
+
 	return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
 		       net, sk, skb, NULL, skb_dst(skb)->dev,
 		       dst_output);
-- 
2.10.2

^ permalink raw reply related

* [PATCH net v2 2/3] ipv6: Set skb->protocol properly for local output
From: Eli Cooper @ 2016-12-01  2:05 UTC (permalink / raw)
  To: netdev, David S . Miller; +Cc: Eric Dumazet
In-Reply-To: <20161201020512.21661-1-elicooper@gmx.com>

When xfrm is applied to TSO/GSO packets, it follows this path:

    xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()

where skb_gso_segment() relies on skb->protocol to function properly.

This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
when xfrm is involved.

Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
---
v2: place the assignment before the netfilter hook

 net/ipv6/output_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 7cca8ac..cd42523 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -155,6 +155,8 @@ int __ip6_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 	if (unlikely(!skb))
 		return 0;
 
+	skb->protocol = htons(ETH_P_IPV6);
+
 	return nf_hook(NFPROTO_IPV6, NF_INET_LOCAL_OUT,
 		       net, sk, skb, NULL, skb_dst(skb)->dev,
 		       dst_output);
-- 
2.10.2

^ permalink raw reply related

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01  2:32 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat
In-Reply-To: <CALx6S35UZcD+kXTTGeML4DdDYaUoArSfEPeUoTuFEeRWKsM7sQ@mail.gmail.com>

On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Another issue I found during my tests last days, is a problem with BQL,
> > and more generally when driver stops/starts the queue.
> >
> > When under stress and BQL stops the queue, driver TX completion does a
> > lot of work, and servicing CPU also takes over further qdisc_run().
> >
> Hmm, hard to say if this is problem or a feature I think ;-). One way
> to "solve" this would be to use IRQ round robin, that would spread the
> load of networking across CPUs, but that gives us no additional
> parallelism and reduces locality-- it's generally considered a bad
> idea. The question might be: is it better to continuously ding one CPU
> with all the networking work or try to artificially spread it out?
> Note this is orthogonal to MQ also, so we should already have multiple
> CPUs doing netif_schedule_queue for queues they manage.
> 
> Do you have a test or application that shows this is causing pain?

Yes, just launch enough TCP senders (more than 10,000) to fully utilize
the NIC, with small messages.

super_netperf is not good for that, because you would need 10,000
processes and would spend too much cycles just dealing with an enormous
working set, you would not activate BQL.


> 
> > The work-flow is :
> >
> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
> > unmap things, queue skbs for freeing.
> >
> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
> >
> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
> >      netif_schedule_queue(dev_queue);
> >
> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
> > (They absolutely have no chance to grab it)
> >
> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
> > and RX work.
> >
> > This problem is magnified when XPS is used, if one mono-threaded application deals with
> > thousands of TCP sockets.
> >
> Do you know of an application doing this? The typical way XPS and most
> of the other steering mechanisms are configured assume that workloads
> tend towards a normal distribution. Such mechanisms can become
> problematic under asymmetric loads, but then I would expect these are
> likely dedicated servers so that the mechanisms can be tuned
> accordingly. For instance, XPS can be configured to map more than one
> queue to a CPU. Alternatively, IIRC Windows has some functionality to
> tune networking for the load (spin up queues, reconfigure things
> similar to XPS/RPS, etc.)-- that's promising up the point that we need
> a lot of heuristics and measurement; but then we lose determinism and
> risk edge case where we get completely unsatisfied results (one of my
> concerns with the recent attempt to put configuration in the kernel).

We have thousands of applications, and some of them 'kind of multicast'
events to a broad number of TCP sockets.

Very often, applications writers use a single thread for doing this,
when all they need is to send small packets to 10,000 sockets, and they
do not really care of doing this very fast. They also do not want to
hurt other applications sharing the NIC.

Very often, process scheduler will also run this single thread in a
single cpu, ie avoiding expensive migrations if they are not needed.

Problem is this behavior locks one TX queue for the duration of the
multicast, since XPS will force all the TX packets to go to one TX
queue.

Other flows that would share the locked CPU experience high P99
latencies.


> 
> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
> > to allow another cpu to service the qdisc and spare us.
> >
> Wouldn't this need to be an active operation? That is to queue the
> qdisc on another CPU's output_queue?

I simply suggest we try to queue the qdisc for further servicing as we
do today, from net_tx_action(), but we might use a different bit, so
that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
before we grab it from net_tx_action(), maybe 100 usec later (time to
flush all skbs queued in napi_consume_skb() and maybe RX processing,
since most NIC handle TX completion before doing RX processing from thei
napi poll handler.

Should be doable with few changes in __netif_schedule() and
net_tx_action(), plus some control paths that will need to take care of
the new bit at dismantle time, right ?

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Florian Westphal @ 2016-12-01  2:44 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S34qPqXa7s1eHmk9V-k6xb=36dfiQvx3JruaNnqg4v8r9g@mail.gmail.com>

Tom Herbert <tom@herbertland.com> wrote:
> Posting for discussion....

Warning: You are not going to like this reply...

> Now that XDP seems to be nicely gaining traction

Yes, I regret to see that.  XDP seems useful to create impressive
benchmark numbers (and little else).

I will send a separate email to keep that flamebait part away from
this thread though.

[..]

> addresses the performance gap for stateless packet processing). The
> problem statement is analogous to that which we had for XDP, namely
> can we create a mode in the kernel that offer the same performance
> that is seen with L4 protocols over kernel bypass

Why?  If you want to bypass the kernel, then DO IT.

There is nothing wrong with DPDK.  The ONLY problem is that the kernel
does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap.

But even without that its not difficult to get DPDK running.

(T)XDP seems born from spite, not technical rationale.
IMO everyone would be better off if we'd just have something netmap-esqe
in the network core (also see below).

> I imagine there are a few reasons why userspace TCP stacks can get
> good performance:
> 
> - Spin polling (we already can do this in kernel)
> - Lockless, I would assume that threads typically have exclusive
> access to a queue pair for a connection
> - Minimal TCP/IP stack code
> - Zero copy TX/RX
> - Light weight structures for queuing
> - No context switches
> - Fast data path for in order, uncongested flows
> - Silo'ing between application and device queues

I only see two cases:

1. Many applications running (standard Os model) that need to
send/receive data
-> Linux Network Stack

2. Single dedicated application that does all rx/tx

-> no queueing needed (can block network rx completely if receiver
is slow)
-> no allocations needed at runtime at all
-> no locking needed (single produce, single consumer)

If you have #2 and you need to be fast etc then full userspace
bypass is fine.  We will -- no matter what we do in kernel -- never
be able to keep up with the speed you can get with that
because we have to deal with #1.  (Plus the ease of use/freedom of doing
userspace programming).  And yes, I think that #2 is something we
should address solely by providing netmap or something similar.

But even considering #1 there are ways to speed stack up:

I'd kill RPF/RPS so we don't have IPI anymore and skb stays
on same cpu up to where it gets queued (ofo or rx queue).

Then we could tell driver what happened with the skb it gave us, e.g.
we can tell driver it can do immediate page/dma reuse, for example
in pure ack case as opposed to skb sitting in ofo or receive queue.

(RPS/RFS functionality could still be provided via one of the gazillion
 hooks we now have in the stack for those that need/want it), so I do
not think we would lose functionality.

>   - Call into TCP/IP stack with page data directly from driver-- no
> skbuff allocation or interface. This is essentially provided by the
> XDP API although we would need to generalize the interface to call
> stack functions (I previously posted patches for that). We will also
> need a new action, XDP_HELD?, that indicates the XDP function held the
> packet (put on a socket for instance).

Seems this will not work at all with the planned page pool thing when
pages start to be held indefinitely.

You can also never get even close to userspace offload stacks once you
need/do this; allocations in hotpath are too expensive.

[..]

>   - When we transmit, it would be nice to go straight from TCP
> connection to an XDP device queue and in particular skip the qdisc
> layer. This follows the principle of low latency being first criteria.

It will never be lower than userspace offloads so anyone with serious
"low latency" requirement (trading) will use that instead.

Whats your target audience?

> longer latencies in effect which likely means TXDP isn't appropriate
> in such a cases. BQL is also out, however we would want the TX
> batching of XDP.

Right, congestion control and buffer bloat are totally overrated .. 8-(

So far I haven't seen anything that would need XDP at all.

What makes it technically impossible to apply these miracles to the
stack...?

E.g. "mini-skb": Even if we assume that this provides a speedup
(where does that come from? should make no difference if a 32 or
 320 byte buffer gets allocated).

If we assume that its the zeroing of sk_buff (but iirc it made little
to no difference), could add

unsigned long skb_extensions[1];

to sk_buff, then move everything not needed for tcp fastpath
(e.g. secpath, conntrack, nf_bridge, tunnel encap, tc, ...)
below that

Then convert accesses to accessors and init it on demand.

One could probably also split cb[] into a smaller fastpath area
and another one at the end that won't be touched at allocation time.

> Miscellaneous

> contemplating that connections/sockets can be bound to particularly
> CPUs and that any operations (socket operations, timers, receive
> processing) must occur on that CPU. The CPU would be the one where RX
> happens. Note this implies perfect silo'ing, everything for driver RX
> to application processing happens inline on the CPU. The stack would
> not cross CPUs for a connection while in this mode.

Again don't see how this relates to xdp.  Could also be done with
current stack if we make rps/rfs pluggable since nothing else
currently pushes skb to another cpu (except when scheduler is involved
via tc mirred, netfilter userspace queueing etc) but that is always
explicit (i.e. skb leaves softirq protection).

Can we please fix and improve what we already have rather than creating
yet another NIH thing that will have to be maintained forever?

Thanks.

^ permalink raw reply

* RE: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: wangyunjian @ 2016-12-01  2:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, caihe
In-Reply-To: <20161130152004-mutt-send-email-mst@kernel.org>


>-----Original Message-----
>From: Michael S. Tsirkin [mailto:mst@redhat.com] 
>Sent: Wednesday, November 30, 2016 9:41 PM
>To: wangyunjian
>Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>
>On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>> When we meet an error(err=-EBADFD) recvmsg,
>
>How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>return 0 in this case, breaking the loop?

We started many guest VMs while attaching/detaching some virtio-net nics for loop. 
The soft lockup might happened. The err is -EBADFD.

meesage log:
kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
call trace:
[<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
[<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
[<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
[<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
[<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
[<fffffff810a5c7f>]kthread+0xcf/0xe0
[<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
[<fffffff81648898>]ret_from_fork+0x58/0x90
[<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140

>> the error handling in vhost
>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>> 
>> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>> ---
>>  drivers/vhost/net.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 5dc128a..edc470b 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>  			pr_debug("Discarded rx packet: "
>>  				 " len %d, expected %zd\n", err, sock_len);
>>  			vhost_discard_vq_desc(vq, headcount);
>> +			/* Don't continue to do, when meet errors. */
>> +			if (err < 0)
>> +				goto out;
>
>You might get e.g. EAGAIN and I think you need to retry
>in this case.
>
>>  			continue;
>>  		}
>>  		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>> -- 
>> 1.9.5.msysgit.1
>>

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01  2:50 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat
In-Reply-To: <1480559566.18162.253.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, 2016-11-30 at 18:32 -0800, Eric Dumazet wrote:

> I simply suggest we try to queue the qdisc for further servicing as we
> do today, from net_tx_action(), but we might use a different bit, so
> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> before we grab it from net_tx_action(), maybe 100 usec later (time to
> flush all skbs queued in napi_consume_skb() and maybe RX processing,
> since most NIC handle TX completion before doing RX processing from thei
> napi poll handler.
> 
> Should be doable with few changes in __netif_schedule() and
> net_tx_action(), plus some control paths that will need to take care of
> the new bit at dismantle time, right ?

Hmm... this is silly. Code already implements a different bit.

qdisc_run() seems to run more often from net_tx_action(), I have to find
out why.

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01  3:13 UTC (permalink / raw)
  To: wangyunjian, Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, caihe
In-Reply-To: <34EFBCA9F01B0748BEB6B629CE643AE60B0A7B68@szxeml561-mbx.china.huawei.com>



On 2016年12月01日 10:48, wangyunjian wrote:
>> -----Original Message-----
>> From: Michael S. Tsirkin [mailto:mst@redhat.com]
>> Sent: Wednesday, November 30, 2016 9:41 PM
>> To: wangyunjian
>> Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>> Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>>
>> On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>> When we meet an error(err=-EBADFD) recvmsg,
>> How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>> return 0 in this case, breaking the loop?
> We started many guest VMs while attaching/detaching some virtio-net nics for loop.
> The soft lockup might happened. The err is -EBADFD.

How did you do the attaching/detaching? AFAIK, the -EBADFD can only 
happens when you deleting tun device during vhost_net transmission.

>
> meesage log:
> kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
> call trace:
> [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
> [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
> [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
> [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
> [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
> [<fffffff810a5c7f>]kthread+0xcf/0xe0
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> [<fffffff81648898>]ret_from_fork+0x58/0x90
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
>
>>> the error handling in vhost
>>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>>>
>>> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>>> ---
>>>   drivers/vhost/net.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>> index 5dc128a..edc470b 100644
>>> --- a/drivers/vhost/net.c
>>> +++ b/drivers/vhost/net.c
>>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>>   			pr_debug("Discarded rx packet: "
>>>   				 " len %d, expected %zd\n", err, sock_len);
>>>   			vhost_discard_vq_desc(vq, headcount);
>>> +			/* Don't continue to do, when meet errors. */
>>> +			if (err < 0)
>>> +				goto out;
>> You might get e.g. EAGAIN and I think you need to retry
>> in this case.
>>
>>>   			continue;
>>>   		}
>>>   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>>> -- 
>>> 1.9.5.msysgit.1
>>>

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01  3:15 UTC (permalink / raw)
  To: Michael S. Tsirkin, Yunjian Wang; +Cc: netdev, linux-kernel, caihe
In-Reply-To: <20161130152004-mutt-send-email-mst@kernel.org>



On 2016年11月30日 21:40, Michael S. Tsirkin wrote:
> On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>> When we meet an error(err=-EBADFD) recvmsg,
> How do you get EBADFD? Won't vhost_net_rx_peek_head_len
> return 0 in this case, breaking the loop?
>
>> the error handling in vhost
>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>>
>> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>> ---
>>   drivers/vhost/net.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 5dc128a..edc470b 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>   			pr_debug("Discarded rx packet: "
>>   				 " len %d, expected %zd\n", err, sock_len);
>>   			vhost_discard_vq_desc(vq, headcount);
>> +			/* Don't continue to do, when meet errors. */
>> +			if (err < 0)
>> +				goto out;
> You might get e.g. EAGAIN and I think you need to retry
> in this case.

Yes.

>>   			continue;
>>   		}
>>   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>> -- 
>> 1.9.5.msysgit.1
>>

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-12-01  3:21 UTC (permalink / raw)
  To: wangyunjian
  Cc: jasowang@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, caihe
In-Reply-To: <34EFBCA9F01B0748BEB6B629CE643AE60B0A7B68@szxeml561-mbx.china.huawei.com>

On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
> 
> >-----Original Message-----
> >From: Michael S. Tsirkin [mailto:mst@redhat.com] 
> >Sent: Wednesday, November 30, 2016 9:41 PM
> >To: wangyunjian
> >Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
> >Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
> >
> >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> >> When we meet an error(err=-EBADFD) recvmsg,
> >
> >How do you get EBADFD? Won't vhost_net_rx_peek_head_len
> >return 0 in this case, breaking the loop?
> 
> We started many guest VMs while attaching/detaching some virtio-net nics for loop. 
> The soft lockup might happened. The err is -EBADFD.
> 

OK, I'd like to figure out what happened here. why don't
we get 0 when we peek at the head?

EBADFD is from here:
        struct tun_struct *tun = __tun_get(tfile);
...
        if (!tun)
                return -EBADFD;

but then:
static int tun_peek_len(struct socket *sock)
{       

...

        struct tun_struct *tun;
...
        
        tun = __tun_get(tfile);
        if (!tun) 
                return 0;


so peek len should return 0.

then while will exit:
        while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
...


> meesage log:
> kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
> call trace:
> [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
> [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
> [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
> [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
> [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
> [<fffffff810a5c7f>]kthread+0xcf/0xe0
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> [<fffffff81648898>]ret_from_fork+0x58/0x90
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140

So somehow you keep seeing something in tun when we peek.
IMO this is the problem we should try to fix.
Could you try debugging the root cause here?

> >> the error handling in vhost
> >> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
> >> 
> >> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> >> ---
> >>  drivers/vhost/net.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >> 
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index 5dc128a..edc470b 100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
> >>  			pr_debug("Discarded rx packet: "
> >>  				 " len %d, expected %zd\n", err, sock_len);
> >>  			vhost_discard_vq_desc(vq, headcount);
> >> +			/* Don't continue to do, when meet errors. */
> >> +			if (err < 0)
> >> +				goto out;
> >
> >You might get e.g. EAGAIN and I think you need to retry
> >in this case.
> >
> >>  			continue;
> >>  		}
> >>  		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> >> -- 
> >> 1.9.5.msysgit.1
> >>

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01  3:26 UTC (permalink / raw)
  To: Michael S. Tsirkin, wangyunjian
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, caihe
In-Reply-To: <20161201051207-mutt-send-email-mst@kernel.org>



On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
> On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
>>> -----Original Message-----
>>> From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>> Sent: Wednesday, November 30, 2016 9:41 PM
>>> To: wangyunjian
>>> Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>>> Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>>>
>>> On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>> When we meet an error(err=-EBADFD) recvmsg,
>>> How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>>> return 0 in this case, breaking the loop?
>> We started many guest VMs while attaching/detaching some virtio-net nics for loop.
>> The soft lockup might happened. The err is -EBADFD.
>>
> OK, I'd like to figure out what happened here. why don't
> we get 0 when we peek at the head?
>
> EBADFD is from here:
>          struct tun_struct *tun = __tun_get(tfile);
> ...
>          if (!tun)
>                  return -EBADFD;
>
> but then:
> static int tun_peek_len(struct socket *sock)
> {
>
> ...
>
>          struct tun_struct *tun;
> ...
>          
>          tun = __tun_get(tfile);
>          if (!tun)
>                  return 0;
>
>
> so peek len should return 0.
>
> then while will exit:
>          while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
> ...
>

Consider this case: user do ip link del link tap0 before recvmsg() but 
after tun_peek_len() ?

>> meesage log:
>> kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
>> call trace:
>> [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
>> [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
>> [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
>> [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
>> [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
>> [<fffffff810a5c7f>]kthread+0xcf/0xe0
>> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
>> [<fffffff81648898>]ret_from_fork+0x58/0x90
>> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> So somehow you keep seeing something in tun when we peek.
> IMO this is the problem we should try to fix.
> Could you try debugging the root cause here?
>
>>>> the error handling in vhost
>>>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>>>>
>>>> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>>>> ---
>>>>   drivers/vhost/net.c | 3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>>> index 5dc128a..edc470b 100644
>>>> --- a/drivers/vhost/net.c
>>>> +++ b/drivers/vhost/net.c
>>>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>>>   			pr_debug("Discarded rx packet: "
>>>>   				 " len %d, expected %zd\n", err, sock_len);
>>>>   			vhost_discard_vq_desc(vq, headcount);
>>>> +			/* Don't continue to do, when meet errors. */
>>>> +			if (err < 0)
>>>> +				goto out;
>>> You might get e.g. EAGAIN and I think you need to retry
>>> in this case.
>>>
>>>>   			continue;
>>>>   		}
>>>>   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>>>> -- 
>>>> 1.9.5.msysgit.1
>>>>

^ permalink raw reply

* Re: DSA vs envelope frames
From: Florian Fainelli @ 2016-12-01  3:26 UTC (permalink / raw)
  To: Andrew Lunn, Nikita Yushchenko
  Cc: Toshiaki Makita, Andy Duan, David S. Miller, Troy Kisky,
	Eric Nelson, Philippe Reynes, Johannes Berg,
	netdev@vger.kernel.org, Chris Healy, Fabio Estevam,
	linux-kernel@vger.kernel.org, Vivien Didelot
In-Reply-To: <20161130151028.GD21645@lunn.ch>



On 11/30/2016 07:10 AM, Andrew Lunn wrote:
>> What is not really clear - what if several tagging protocols are used
>> together. AFAIU, things may be more complex that simple appending of
>> tags, e.g. EDSA tag can carry VLAN id inside.
> 
> Hi Nikita
> 
> At least for all current tagging protocols, the size of the tag is
> constant. And you cannot run different tagging protocols on the same
> master interface at the same time.

I am not sure if using envelope frames is entirely appropriate here,
because there are existing switch tagging protocols that:

- don't have a specific Ethernet type allocated (Broadcom tags, DSA)
- could be appended at the end of the frame instead of pre-pended

Alexander Duyck suggested a while ago that we may be able to use the
headers_ops to implement the DSA tag pop/push, as well as get an
appropriate MTU adjustment, can you see if that would work?

> 
> However, i think Florian tried something funky with the SF2 and B53
> driver. He has a b53 hanging off a sf2. So i think he used nested
> tagging protocols!

Well, this actually did not work, because the SF2 and B53 switches
essentially terminate switch tags when they ingress their switch ports,
so the tag inserted by B53 ingressing into the SF2 port does not get
sent all the way to the CPU hanging off the SF2... With the B53 hanging
off the SF2, we essentially have to disable Broadcom tags in the B53
device, because the tag and switches were never designed to be cascaded
in the first place (at least not these specific cores).
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-12-01  3:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: wangyunjian, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	caihe
In-Reply-To: <ee70cb92-3fd9-4148-77ba-739af30b04b6@redhat.com>

On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
> 
> 
> On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
> > On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > > > Sent: Wednesday, November 30, 2016 9:41 PM
> > > > To: wangyunjian
> > > > Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
> > > > Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
> > > > 
> > > > On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> > > > > When we meet an error(err=-EBADFD) recvmsg,
> > > > How do you get EBADFD? Won't vhost_net_rx_peek_head_len
> > > > return 0 in this case, breaking the loop?
> > > We started many guest VMs while attaching/detaching some virtio-net nics for loop.
> > > The soft lockup might happened. The err is -EBADFD.
> > > 
> > OK, I'd like to figure out what happened here. why don't
> > we get 0 when we peek at the head?
> > 
> > EBADFD is from here:
> >          struct tun_struct *tun = __tun_get(tfile);
> > ...
> >          if (!tun)
> >                  return -EBADFD;
> > 
> > but then:
> > static int tun_peek_len(struct socket *sock)
> > {
> > 
> > ...
> > 
> >          struct tun_struct *tun;
> > ...
> >          tun = __tun_get(tfile);
> >          if (!tun)
> >                  return 0;
> > 
> > 
> > so peek len should return 0.
> > 
> > then while will exit:
> >          while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
> > ...
> > 
> 
> Consider this case: user do ip link del link tap0 before recvmsg() but after
> tun_peek_len() ?

Sure, this can happen, but I think we'll just exit on the next loop,
won't we?

> > > meesage log:
> > > kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
> > > call trace:
> > > [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
> > > [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
> > > [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
> > > [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
> > > [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
> > > [<fffffff810a5c7f>]kthread+0xcf/0xe0
> > > [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> > > [<fffffff81648898>]ret_from_fork+0x58/0x90
> > > [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> > So somehow you keep seeing something in tun when we peek.
> > IMO this is the problem we should try to fix.
> > Could you try debugging the root cause here?
> > 
> > > > > the error handling in vhost
> > > > > handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
> > > > > 
> > > > > Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> > > > > ---
> > > > >   drivers/vhost/net.c | 3 +++
> > > > >   1 file changed, 3 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > > > index 5dc128a..edc470b 100644
> > > > > --- a/drivers/vhost/net.c
> > > > > +++ b/drivers/vhost/net.c
> > > > > @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
> > > > >   			pr_debug("Discarded rx packet: "
> > > > >   				 " len %d, expected %zd\n", err, sock_len);
> > > > >   			vhost_discard_vq_desc(vq, headcount);
> > > > > +			/* Don't continue to do, when meet errors. */
> > > > > +			if (err < 0)
> > > > > +				goto out;
> > > > You might get e.g. EAGAIN and I think you need to retry
> > > > in this case.
> > > > 
> > > > >   			continue;
> > > > >   		}
> > > > >   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> > > > > -- 
> > > > > 1.9.5.msysgit.1
> > > > > 

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01  3:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: wangyunjian, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	caihe
In-Reply-To: <20161201052657-mutt-send-email-mst@kernel.org>



On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
>> >
>> >
>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
>>>>> > > > >-----Original Message-----
>>>>> > > > >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
>>>>> > > > >To: wangyunjian
>>>>> > > > >Cc:jasowang@redhat.com;netdev@vger.kernel.org;linux-kernel@vger.kernel.org; caihe
>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>>>>> > > > >
>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>>>>> > > > >return 0 in this case, breaking the loop?
>>>> > > >We started many guest VMs while attaching/detaching some virtio-net nics for loop.
>>>> > > >The soft lockup might happened. The err is -EBADFD.
>>>> > > >
>>> > >OK, I'd like to figure out what happened here. why don't
>>> > >we get 0 when we peek at the head?
>>> > >
>>> > >EBADFD is from here:
>>> > >          struct tun_struct *tun = __tun_get(tfile);
>>> > >...
>>> > >          if (!tun)
>>> > >                  return -EBADFD;
>>> > >
>>> > >but then:
>>> > >static int tun_peek_len(struct socket *sock)
>>> > >{
>>> > >
>>> > >...
>>> > >
>>> > >          struct tun_struct *tun;
>>> > >...
>>> > >          tun = __tun_get(tfile);
>>> > >          if (!tun)
>>> > >                  return 0;
>>> > >
>>> > >
>>> > >so peek len should return 0.
>>> > >
>>> > >then while will exit:
>>> > >          while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
>>> > >...
>>> > >
>> >
>> >Consider this case: user do ip link del link tap0 before recvmsg() but after
>> >tun_peek_len() ?
> Sure, this can happen, but I think we'll just exit on the next loop,
> won't we?
>

Right, this is the only case I can image for -EBADFD, let's wait for the 
author to the steps.

^ permalink raw reply

* Re: [PATCH] tcp_bbr: fix Kconfig to be able to make BBR the default congestion algorithm
From: David Miller @ 2016-12-01  3:43 UTC (permalink / raw)
  To: berny156; +Cc: linux-kernel, netdev, ncardwell
In-Reply-To: <b37e4220-b857-4853-f2e2-b77bb08115e5@gmx.de>

From: Bernhard Held <berny156@gmx.de>
Date: Wed, 30 Nov 2016 23:31:29 +0100

> Add missing line to be able to make BBR the default
> congestion algorithm.
> 
> Signed-off-by: Bernhard Held <berny156@gmx.de>

The current tree already has this fix.

^ permalink raw reply

* linux-next: manual merge of the tip tree with the net-next tree
From: Stephen Rothwell @ 2016-12-01  3:46 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	David Miller, Networking
  Cc: linux-next, linux-kernel, Edward Cree, Nicolas Pitre

Hi all,

Today's linux-next merge of the tip tree got a conflict in:

  drivers/net/ethernet/sfc/Kconfig

between commit:

  5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")

from the net-next tree and commit:

  d1cbfd771ce8 ("ptp_clock: Allow for it to be optional")

from the tip tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/ethernet/sfc/Kconfig
index 0e16197e4f18,83f4766a1da0..000000000000
--- a/drivers/net/ethernet/sfc/Kconfig
+++ b/drivers/net/ethernet/sfc/Kconfig
@@@ -5,11 -5,11 +5,11 @@@ config SF
  	select CRC32
  	select I2C
  	select I2C_ALGOBIT
- 	select PTP_1588_CLOCK
 +	select SFC_FALCON
+ 	imply PTP_1588_CLOCK
  	---help---
  	  This driver supports 10/40-gigabit Ethernet cards based on
 -	  the Solarflare SFC4000, SFC9000-family and SFC9100-family
 -	  controllers.
 +	  the Solarflare SFC9000-family and SFC9100-family controllers.
  
  	  To compile this driver as a module, choose M here.  The module
  	  will be called sfc.

^ permalink raw reply

* Re: [net-next PATCH v3 3/6] virtio_net: Add XDP support
From: John Fastabend @ 2016-12-01  4:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eric.dumazet, daniel, shm, davem, tgraf, alexei.starovoitov,
	john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <20161130203851-mutt-send-email-mst@kernel.org>

On 16-11-30 10:54 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 29, 2016 at 12:10:21PM -0800, John Fastabend wrote:
>> From: John Fastabend <john.fastabend@gmail.com>
>>
>> This adds XDP support to virtio_net. Some requirements must be
>> met for XDP to be enabled depending on the mode. First it will
>> only be supported with LRO disabled so that data is not pushed
>> across multiple buffers. Second the MTU must be less than a page
>> size to avoid having to handle XDP across multiple pages.
>>
>> If mergeable receive is enabled this patch only supports the case
>> where header and data are in the same buf which we can check when
>> a packet is received by looking at num_buf. If the num_buf is
>> greater than 1 and a XDP program is loaded the packet is dropped
>> and a warning is thrown. When any_header_sg is set this does not
>> happen and both header and data is put in a single buffer as expected
>> so we check this when XDP programs are loaded.  Subsequent patches
>> will process the packet in a degraded mode to ensure connectivity
>> and correctness is not lost even if backend pushes packets into
>> multiple buffers.
>>
>> If big packets mode is enabled and MTU/LRO conditions above are
>> met then XDP is allowed.
>>
>> This patch was tested with qemu with vhost=on and vhost=off where
>> mergable and big_packet modes were forced via hard coding feature
>> negotiation. Multiple buffers per packet was forced via a small
>> test patch to vhost.c in the vhost=on qemu mode.
>>
>> Suggested-by: Shrijeet Mukherjee <shrijeet@gmail.com>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---

[...]

>>  
>> +static u32 do_xdp_prog(struct virtnet_info *vi,
>> +		       struct bpf_prog *xdp_prog,
>> +		       struct page *page, int offset, int len)
>> +{
>> +	int hdr_padded_len;
>> +	struct xdp_buff xdp;
>> +	u32 act;
>> +	u8 *buf;
>> +
>> +	buf = page_address(page) + offset;
>> +
>> +	if (vi->mergeable_rx_bufs)
>> +		hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>> +	else
>> +		hdr_padded_len = sizeof(struct padded_vnet_hdr);
>> +
>> +	xdp.data = buf + hdr_padded_len;
>> +	xdp.data_end = xdp.data + (len - vi->hdr_len);
> 
> so header seems to be ignored completely.
> but the packet could be from the time when
> e.g. checksum offloading was on, and
> so it might gave DATA_VALID (from CHECKSUM_UNNECESSARY
> in host).
> 
> I think you want to verify that flags and gso type
> are 0.

Ah yes agree. I think this might be possible in all the driver
implementations as well I wonder if there are bugs there as well.
The race between setting MTU, LRO, etc. and pushing the XDP prog
into the datapath seems unlikely but possible.

> 
> 
>> +
>> +	act = bpf_prog_run_xdp(xdp_prog, &xdp);
>> +	switch (act) {
>> +	case XDP_PASS:
>> +		return XDP_PASS;
>> +	default:
>> +		bpf_warn_invalid_xdp_action(act);
>> +	case XDP_TX:
>> +	case XDP_ABORTED:
>> +	case XDP_DROP:
>> +		return XDP_DROP;
>> +	}
>> +}
> 
> do we really want this switch just to warn?
> How about doing != XDP_PASS in the caller?

Well in later patches XDP_TX case gets handled. So its really
three cases, PASS, XDP_TX, {default|ABORT|DROP}. I don't have
a strong preference if its case statement or if/else it should
not matter except for style.

> 
>> +
>>  static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, unsigned int len)
>>  {
>>  	struct sk_buff * skb = buf;
>> @@ -340,14 +375,28 @@ static struct sk_buff *receive_big(struct net_device *dev,
>>  				   void *buf,
>>  				   unsigned int len)
>>  {
>> +	struct bpf_prog *xdp_prog;
>>  	struct page *page = buf;
>> -	struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
>> +	struct sk_buff *skb;
>>  
>> +	rcu_read_lock();
>> +	xdp_prog = rcu_dereference(rq->xdp_prog);
>> +	if (xdp_prog) {
>> +		u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
>> +
>> +		if (act == XDP_DROP)
>> +			goto err_xdp;
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
>>  	if (unlikely(!skb))
>>  		goto err;
>>  
>>  	return skb;
>>  
>> +err_xdp:
>> +	rcu_read_unlock();
>>  err:
>>  	dev->stats.rx_dropped++;
>>  	give_pages(rq, page);
>> @@ -366,10 +415,27 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  	struct page *page = virt_to_head_page(buf);
>>  	int offset = buf - page_address(page);
>>  	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
> 
> This is some useless computation when XDP is used, isn't it?
> 

Seems to be nice catch.

>> +	struct sk_buff *head_skb, *curr_skb;
>> +	struct bpf_prog *xdp_prog;
>>  
>> -	struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
>> -					       truesize);
>> -	struct sk_buff *curr_skb = head_skb;
>> +	rcu_read_lock();
>> +	xdp_prog = rcu_dereference(rq->xdp_prog);
>> +	if (xdp_prog) {
>> +		u32 act;
>> +
>> +		if (num_buf > 1) {
>> +			bpf_warn_invalid_xdp_buffer();
>> +			goto err_xdp;
>> +		}
>> +
>> +		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
>> +		if (act == XDP_DROP)
>> +			goto err_xdp;
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
>> +	curr_skb = head_skb;
>>  
>>  	if (unlikely(!curr_skb))
>>  		goto err_skb;
> 
> I'm confused. Did the requirement to have a page per packet go away?
> I don't think this mode is doing it here.
> 

Correct its not a page per packet but there seems no reason to enforce
that here. XDP works just fine without contiguous buffers. I believe
the page per packet argument came out of the mlx driver implementation.

iirc DaveM had the most push back on the multiple packets per page
thing. Maybe he can comment again. (sorry Dave I'm dragging this up again).

> 
>> @@ -423,6 +489,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  	ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
>>  	return head_skb;
>>  
>> +err_xdp:
>> +	rcu_read_unlock();
>>  err_skb:
>>  	put_page(page);
>>  	while (--num_buf) {
>> @@ -1328,6 +1396,13 @@ static int virtnet_set_channels(struct net_device *dev,
>>  	if (queue_pairs > vi->max_queue_pairs || queue_pairs == 0)
>>  		return -EINVAL;
>>  
>> +	/* For now we don't support modifying channels while XDP is loaded
>> +	 * also when XDP is loaded all RX queues have XDP programs so we only
>> +	 * need to check a single RX queue.
>> +	 */
>> +	if (vi->rq[0].xdp_prog)
>> +		return -EINVAL;
>> +
>>  	get_online_cpus();
>>  	err = virtnet_set_queues(vi, queue_pairs);
>>  	if (!err) {
>> @@ -1454,6 +1529,68 @@ static int virtnet_set_features(struct net_device *netdev,
>>  	return 0;
>>  }
>>  
>> +static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(dev);
>> +	struct bpf_prog *old_prog;
>> +	int i;
>> +
>> +	if ((dev->features & NETIF_F_LRO) && prog) {
>> +		netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (vi->mergeable_rx_bufs && !vi->any_header_sg) {
>> +		netdev_warn(dev, "XDP expects header/data in single page\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (dev->mtu > PAGE_SIZE) {
>> +		netdev_warn(dev, "XDP requires MTU less than %lu\n", PAGE_SIZE);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (prog) {
>> +		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
>> +		if (IS_ERR(prog))
>> +			return PTR_ERR(prog);
>> +	}
>> +
>> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>> +		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
>> +		rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
>> +		if (old_prog)
>> +			bpf_prog_put(old_prog);
> 
> don't we need to sync before put?
> 

No should be handled by core infrastructure in bpf_prog_put(). Also this
aligns with other driver implementations.

> 
>> +	}
>> +
>> +	return 0;
>> +}
>> +

Thanks,
John

^ permalink raw reply

* Re: [net-next PATCH v3 0/6] XDP for virtio_net
From: John Fastabend @ 2016-12-01  4:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: tgraf, shm, alexei.starovoitov, daniel, davem, john.r.fastabend,
	netdev, bblanco, brouer
In-Reply-To: <20161130205527-mutt-send-email-mst@kernel.org>

On 16-11-30 10:56 AM, Michael S. Tsirkin wrote:
> On Wed, Nov 30, 2016 at 10:41:04AM -0800, John Fastabend wrote:
>> On 16-11-30 10:35 AM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 29, 2016 at 12:05:20PM -0800, John Fastabend wrote:
>>>> This implements virtio_net for the mergeable buffers and big_packet
>>>> modes. I tested this with vhost_net running on qemu and did not see
>>>> any issues. For testing num_buf > 1 I added a hack to vhost driver
>>>> to only but 100 bytes per buffer.
>>>>
>>>> There are some restrictions for XDP to be enabled and work well
>>>> (see patch 3) for more details.
>>>>
>>>>   1. LRO must be off
>>>>   2. MTU must be less than PAGE_SIZE
>>>>   3. queues must be available to dedicate to XDP
>>>>   4. num_bufs received in mergeable buffers must be 1
>>>>   5. big_packet mode must have all data on single page
>>>>
>>>> Please review any comments/feedback welcome as always.
>>>>
>>>> v2, fixes rcu usage throughout thanks to Eric and the use of
>>>> num_online_cpus() usage thanks to Jakub.
>>>>
>>>> v3, add slowpath patch to handle num_bufs > 1
>>>>
>>>> Thanks,
>>>> John
>>>
>>> BTW this is threaded incorrectly: patch 1/6 isn't a reply to 0/6,
>>> patches 2 and on are replies to patch 1.
>>>
>>
>> Ah yep, if you mangle the command line git will send the
>> cover letter even if you have mangled 'to' email addresses but when
>> it hits a real patch it aborts. At least on my version of git.
>>
>>> I'm busy until end of week, I'll review Monday. Sorry about the delay.
>>
>> In the meantime I'll post a v4 with better commit message (Alexei) and
>> address a corner cases Jakub pointed out.
> 
> I did a quick look and found some too, but a detailed review will
> have to wait till next week.

Thanks! I'll address all the comments and have a v4 by then.

^ permalink raw reply

* RE: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: wangyunjian @ 2016-12-01  4:41 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, caihe
In-Reply-To: <936954dd-c8a5-c0f6-c3b0-84a9d67329f5@redhat.com>

>-----Original Message-----
>From: Jason Wang [mailto:jasowang@redhat.com] 
>Sent: Thursday, December 01, 2016 11:37 AM
>To: Michael S. Tsirkin
>Cc: wangyunjian; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>
>
>
>On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
>> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
>>> >
>>> >
>>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
>>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
>>>>>> > > > >-----Original Message-----
>>>>>> > > > >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
>>>>>> > > > >To: wangyunjian
>>>>>> > > > >Cc:jasowang@redhat.com;netdev@vger.kernel.org;linux-kernel@
>>>>>> > > > >vger.kernel.org; caihe
>>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call 
>>>>>> > > > >the recvmsg when meet errors
>>>>>> > > > >
>>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
>>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len 
>>>>>> > > > >return 0 in this case, breaking the loop?
>>>>> > > >We started many guest VMs while attaching/detaching some virtio-net nics for loop.
>>>>> > > >The soft lockup might happened. The err is -EBADFD.
>>>>> > > >
>>>> > >OK, I'd like to figure out what happened here. why don't we get 0 
>>>> > >when we peek at the head?
>>>> > >
>>>> > >EBADFD is from here:
>>>> > >          struct tun_struct *tun = __tun_get(tfile); ...
>>>> > >          if (!tun)
>>>> > >                  return -EBADFD;
>>>> > >
>>>> > >but then:
>>>> > >static int tun_peek_len(struct socket *sock) {
>>>> > >
>>>> > >...
>>>> > >
>>>> > >          struct tun_struct *tun; ...
>>>> > >          tun = __tun_get(tfile);
>>>> > >          if (!tun)
>>>> > >                  return 0;
>>>> > >
>>>> > >
>>>> > >so peek len should return 0.
>>>> > >
>>>> > >then while will exit:
>>>> > >          while ((sock_len = vhost_net_rx_peek_head_len(net, 
>>>> > >sock->sk))) ...
>>>> > >
>>> >
>>> >Consider this case: user do ip link del link tap0 before recvmsg() 
>>> >but after
>>> >tun_peek_len() ?
>> Sure, this can happen, but I think we'll just exit on the next loop, 
>> won't we?
>>
>
>Right, this is the only case I can image for -EBADFD, let's wait for the author to the steps.
>

Thanks, I understand it don't happen in the latest kernel version.
My problem happened using kernel version 3.10.0-xx

The peek len willn't return 0.

static int peek_head_len(struct sock *sk)
{
	struct sk_buff *head;
	int len = 0;
	unsigned long flags;

	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
	head = skb_peek(&sk->sk_receive_queue);
	if (likely(head)) {
		len = head->len;
		if (skb_vlan_tag_present(head))
			len += VLAN_HLEN;
	}

	spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
	return len;
}

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-12-01  4:56 UTC (permalink / raw)
  To: wangyunjian
  Cc: Jason Wang, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	caihe
In-Reply-To: <34EFBCA9F01B0748BEB6B629CE643AE60B0A7C38@szxeml561-mbx.china.huawei.com>

On Thu, Dec 01, 2016 at 04:41:40AM +0000, wangyunjian wrote:
> >-----Original Message-----
> >From: Jason Wang [mailto:jasowang@redhat.com] 
> >Sent: Thursday, December 01, 2016 11:37 AM
> >To: Michael S. Tsirkin
> >Cc: wangyunjian; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
> >Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
> >
> >
> >
> >On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
> >> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
> >>> >
> >>> >
> >>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
> >>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
> >>>>>> > > > >-----Original Message-----
> >>>>>> > > > >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
> >>>>>> > > > >To: wangyunjian
> >>>>>> > > > >Cc:jasowang@redhat.com;netdev@vger.kernel.org;linux-kernel@
> >>>>>> > > > >vger.kernel.org; caihe
> >>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call 
> >>>>>> > > > >the recvmsg when meet errors
> >>>>>> > > > >
> >>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> >>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
> >>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len 
> >>>>>> > > > >return 0 in this case, breaking the loop?
> >>>>> > > >We started many guest VMs while attaching/detaching some virtio-net nics for loop.
> >>>>> > > >The soft lockup might happened. The err is -EBADFD.
> >>>>> > > >
> >>>> > >OK, I'd like to figure out what happened here. why don't we get 0 
> >>>> > >when we peek at the head?
> >>>> > >
> >>>> > >EBADFD is from here:
> >>>> > >          struct tun_struct *tun = __tun_get(tfile); ...
> >>>> > >          if (!tun)
> >>>> > >                  return -EBADFD;
> >>>> > >
> >>>> > >but then:
> >>>> > >static int tun_peek_len(struct socket *sock) {
> >>>> > >
> >>>> > >...
> >>>> > >
> >>>> > >          struct tun_struct *tun; ...
> >>>> > >          tun = __tun_get(tfile);
> >>>> > >          if (!tun)
> >>>> > >                  return 0;
> >>>> > >
> >>>> > >
> >>>> > >so peek len should return 0.
> >>>> > >
> >>>> > >then while will exit:
> >>>> > >          while ((sock_len = vhost_net_rx_peek_head_len(net, 
> >>>> > >sock->sk))) ...
> >>>> > >
> >>> >
> >>> >Consider this case: user do ip link del link tap0 before recvmsg() 
> >>> >but after
> >>> >tun_peek_len() ?
> >> Sure, this can happen, but I think we'll just exit on the next loop, 
> >> won't we?
> >>
> >
> >Right, this is the only case I can image for -EBADFD, let's wait for the author to the steps.
> >
> 
> Thanks, I understand it don't happen in the latest kernel version.
> My problem happened using kernel version 3.10.0-xx
> The peek len willn't return 0.
> 
> static int peek_head_len(struct sock *sk)
> {
> 	struct sk_buff *head;
> 	int len = 0;
> 	unsigned long flags;
> 
> 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
> 	head = skb_peek(&sk->sk_receive_queue);

Should return NULL, should it not?
Maybe sk_receive_queue was not purged on detach back then.


> 	if (likely(head)) {
> 		len = head->len;
> 		if (skb_vlan_tag_present(head))
> 			len += VLAN_HLEN;
> 	}
> 
> 	spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
> 	return len;
> }

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Tom Herbert @ 2016-12-01  5:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat
In-Reply-To: <1480559566.18162.253.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
>> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > Another issue I found during my tests last days, is a problem with BQL,
>> > and more generally when driver stops/starts the queue.
>> >
>> > When under stress and BQL stops the queue, driver TX completion does a
>> > lot of work, and servicing CPU also takes over further qdisc_run().
>> >
>> Hmm, hard to say if this is problem or a feature I think ;-). One way
>> to "solve" this would be to use IRQ round robin, that would spread the
>> load of networking across CPUs, but that gives us no additional
>> parallelism and reduces locality-- it's generally considered a bad
>> idea. The question might be: is it better to continuously ding one CPU
>> with all the networking work or try to artificially spread it out?
>> Note this is orthogonal to MQ also, so we should already have multiple
>> CPUs doing netif_schedule_queue for queues they manage.
>>
>> Do you have a test or application that shows this is causing pain?
>
> Yes, just launch enough TCP senders (more than 10,000) to fully utilize
> the NIC, with small messages.
>
> super_netperf is not good for that, because you would need 10,000
> processes and would spend too much cycles just dealing with an enormous
> working set, you would not activate BQL.
>
>
>>
>> > The work-flow is :
>> >
>> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
>> > unmap things, queue skbs for freeing.
>> >
>> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
>> >
>> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>> >      netif_schedule_queue(dev_queue);
>> >
>> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
>> > (They absolutely have no chance to grab it)
>> >
>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>> > and RX work.
>> >
>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>> > thousands of TCP sockets.
>> >
>> Do you know of an application doing this? The typical way XPS and most
>> of the other steering mechanisms are configured assume that workloads
>> tend towards a normal distribution. Such mechanisms can become
>> problematic under asymmetric loads, but then I would expect these are
>> likely dedicated servers so that the mechanisms can be tuned
>> accordingly. For instance, XPS can be configured to map more than one
>> queue to a CPU. Alternatively, IIRC Windows has some functionality to
>> tune networking for the load (spin up queues, reconfigure things
>> similar to XPS/RPS, etc.)-- that's promising up the point that we need
>> a lot of heuristics and measurement; but then we lose determinism and
>> risk edge case where we get completely unsatisfied results (one of my
>> concerns with the recent attempt to put configuration in the kernel).
>
> We have thousands of applications, and some of them 'kind of multicast'
> events to a broad number of TCP sockets.
>
> Very often, applications writers use a single thread for doing this,
> when all they need is to send small packets to 10,000 sockets, and they
> do not really care of doing this very fast. They also do not want to
> hurt other applications sharing the NIC.
>
> Very often, process scheduler will also run this single thread in a
> single cpu, ie avoiding expensive migrations if they are not needed.
>
> Problem is this behavior locks one TX queue for the duration of the
> multicast, since XPS will force all the TX packets to go to one TX
> queue.
>
The fact that XPS is forcing TX packets to go over one CPU is a result
of how you've configured XPS. There are other potentially
configurations that present different tradeoffs. For instance, TX
queues are plentiful now days to the point that we could map a number
of queues to each CPU while respecting NUMA locality between the
sending CPU and where TX completions occur. If the set is sufficiently
large this would also helps to address device lock contention. The
other thing I'm wondering is if Willem's concepts distributing DOS
attacks in RPS might be applicable in XPS. If we could detect that a
TX queue is "under attack" maybe we can automatically backoff to
distributing the load to more queues in XPS somehow.

Tom

> Other flows that would share the locked CPU experience high P99
> latencies.
>
>
>>
>> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
>> > to allow another cpu to service the qdisc and spare us.
>> >
>> Wouldn't this need to be an active operation? That is to queue the
>> qdisc on another CPU's output_queue?
>
> I simply suggest we try to queue the qdisc for further servicing as we
> do today, from net_tx_action(), but we might use a different bit, so
> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> before we grab it from net_tx_action(), maybe 100 usec later (time to
> flush all skbs queued in napi_consume_skb() and maybe RX processing,
> since most NIC handle TX completion before doing RX processing from thei
> napi poll handler.
>
> Should be doable with few changes in __netif_schedule() and
> net_tx_action(), plus some control paths that will need to take care of
> the new bit at dismantle time, right ?
>
>
>

^ permalink raw reply

* Re: [v5,1/5] soc: qcom: smem_state: Fix include for ERR_PTR()
From: Valo, Kalle @ 2016-12-01  5:13 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: k.eugene.e@gmail.com, Andy Gross, wcn36xx@lists.infradead.org,
	linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org
In-Reply-To: <878ts0u7z3.fsf@kamboji.qca.qualcomm.com>

Kalle Valo <kvalo@qca.qualcomm.com> writes:

> "Valo, Kalle" <kvalo@qca.qualcomm.com> writes:
>
>> Bjorn Andersson <bjorn.andersson@linaro.org> writes:
>>
>>> On Wed 16 Nov 10:49 PST 2016, Kalle Valo wrote:
>>>
>>>> Bjorn Andersson <bjorn.andersson@linaro.org> wrote:
>>>> > The correct include file for getting errno constants and ERR_PTR() is
>>>> > linux/err.h, rather than linux/errno.h, so fix the include.
>>>> > 
>>>> > Fixes: e8b123e60084 ("soc: qcom: smem_state: Add stubs for disabled smem_state")
>>>> > Acked-by: Andy Gross <andy.gross@linaro.org>
>>>> > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
>>>> 
>>>> For some reason this fails to compile now. Can you take a look, please?
>>>> 
>>>> ERROR: "qcom_wcnss_open_channel" [drivers/net/wireless/ath/wcn36xx/wcn36xx.ko] undefined!
>>>> make[1]: *** [__modpost] Error 1
>>>> make: *** [modules] Error 2
>>>> 
>>>> 5 patches set to Changes Requested.
>>>> 
>>>> 9429045 [v5,1/5] soc: qcom: smem_state: Fix include for ERR_PTR()
>>>> 9429047 [v5,2/5] wcn36xx: Transition driver to SMD client
>>>
>>> This patch was updated with the necessary depends in Kconfig to catch
>>> this exact issue and when I pull in your .config (which has QCOM_SMD=n,
>>> QCOM_WCNSS_CTRL=n and WCN36XX=y) I can build this just fine.
>>>
>>> I've tested the various combinations and it seems to work fine. Do you
>>> have any other patches in your tree?
>>
>> This was with the pending branch of my ath.git tree. There are other
>> wireless patches (ath10k etc) but I would guess they don't affect here.
>>
>>> Any stale objects?
>>
>> Not sure what you mean with this question, but I didn't run 'make clean'
>> if that's what you are asking.
>>
>>> Would you mind retesting this, before I invest more time in trying to
>>> reproduce the issue you're seeing?
>>
>> Sure, I'll take a look but that might take few days.
>
> I didn't find enough time to look at this in detail. I applied this to
> my ath.git pending branch, let's see what the kbuild bot finds.

It found the same problem. Interestingly I'm also building x86 with 32
bit, maybe it's related?

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git pending
head:   1ea16a1c457939b4564643f7637d5cc639a8d3b7
commit: 5eb09c672b01460804fd49b1c9cc7d1072a102f0 [96/99] wcn36xx: Transition driver to SMD client
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        git checkout 5eb09c672b01460804fd49b1c9cc7d1072a102f0
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> ERROR: "qcom_wcnss_open_channel" [drivers/net/wireless/ath/wcn36xx/wcn36xx.ko] undefined!

-- 
Kalle Valo

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox