Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net 0/2] Alow tg3/bnx recevie full sized 802.1ad frames
From: David Miller @ 2014-10-01 20:45 UTC (permalink / raw)
  To: vyasevich
  Cc: netdev, prashant, mchan, sony.chacko, Dept-HSGLinuxNICDev,
	vyasevic
In-Reply-To: <1412120377-11125-1-git-send-email-vyasevic@redhat.com>

From: Vladislav Yasevich <vyasevich@gmail.com>
Date: Tue, 30 Sep 2014 19:39:35 -0400

> bnx2 and tg3 drivers drop packets that exceed device mtu and that
> do not have a vlan tag.  The idea is to catch ethernet frames
> that are too long.  This also ends up dropping 802.1ad tagged
> frames since the encapsulation protocol is different.
> We should not be dropping this packets.
> 
> v2:  rebased and consolidated tg3 and bnx patches.

Applied, thanks Vlad.

^ permalink raw reply

* [net-next PATCH V6 2/2] qdisc: dequeue bulking also pickup GSO/TSO packets
From: Jesper Dangaard Brouer @ 2014-10-01 20:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke
In-Reply-To: <20141001203345.3321.99675.stgit@dragon>

The TSO and GSO segmented packets already benefit from bulking
on their own.

The TSO packets have always taken advantage of the only updating
the tailptr once for a large packet.

The GSO segmented packets have recently taken advantage of
bulking xmit_more API, via merge commit 53fda7f7f9e8 ("Merge
branch 'xmit_list'"), specifically via commit 7f2e870f2a4 ("net:
Move main gso loop out of dev_hard_start_xmit() into helper.")
allowing qdisc requeue of remaining list.  And via commit
ce93718fb7cd ("net: Don't keep around original SKB when we
software segment GSO frames.").

This patch allow further bulking of TSO/GSO packets together,
when dequeueing from the qdisc.

Testing:
 Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec).

Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec).

 Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.

 Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
diff of 0.12ms corrosponding to 1500 bytes at 100Mbit/s. Bulking
several GSOs together, result in a stable high-prio queue delay of
2.23ms.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---

 net/sched/sch_generic.c |   12 +++---------
 1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index c2e87e6..797ebef 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -63,10 +63,6 @@ static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
 	struct sk_buff *skb, *tail_skb = head_skb;

 	while (bytelimit > 0) {
-		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
-		if (tail_skb->next || skb_is_gso(tail_skb))
-			break;
-
 		skb = q->dequeue(q);
 		if (!skb)
 			break;
@@ -76,11 +72,9 @@ static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
 		if (!skb)
 			break;

-		/* "skb" can be a skb list after validate call above
-		 * (GSO segmented), but it is okay to append it to
-		 * current tail_skb->next, because next round will exit
-		 * in-case "tail_skb->next" is a skb list.
-		 */
+		while (tail_skb->next) /* GSO list goto tail */
+			tail_skb = tail_skb->next;
+
 		tail_skb->next = skb;
 		tail_skb = skb;
 	}

^ permalink raw reply related

* [net-next PATCH V6 1/2] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
From: Jesper Dangaard Brouer @ 2014-10-01 20:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke
In-Reply-To: <20141001203345.3321.99675.stgit@dragon>

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits.  In-effect bulking will only happen for BQL enabled drivers.

Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.

For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets.  They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.

Jointed work with Hannes, Daniel and Florian.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>

---
V6:
 - Remove packet "budget" limit, rely 100% on BQL

V5:
 - Changed "budget" limit to 7 packets (from 8)
 - Choosing to use skb->len, because BQL uses that
 - Added testing results to patch desc

V4:
- Patch rewritten in the Red Hat Neuchatel office jointed work with
  Hannes, Daniel and Florian.
- Conservative approach of only using on BQL enabled drivers
- No user tunable parameter, but limit bulking to 8 packets.
- For now, avoid bulking GSO packets packets.

V3:
 - Correct use of BQL
 - Some minor adjustments based on feedback.
 - Default setting only bulk dequeue 1 extra packet (2 packets).

V2:
 - Restruct functions, split out functionality
 - Use BQL bytelimit to avoid overshooting driver limits
---

 include/net/sch_generic.h |   16 ++++++++++++++++
 net/sched/sch_generic.c   |   46 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f126698..d17ed6f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -7,6 +7,7 @@
 #include <linux/pkt_sched.h>
 #include <linux/pkt_cls.h>
 #include <linux/percpu.h>
+#include <linux/dynamic_queue_limits.h>
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>

@@ -119,6 +120,21 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
 	qdisc->__state &= ~__QDISC___STATE_RUNNING;
 }

+static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
+{
+	return qdisc->flags & TCQ_F_ONETXQUEUE;
+}
+
+static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq)
+{
+#ifdef CONFIG_BQL
+	/* Non-BQL migrated drivers will return 0, too. */
+	return dql_avail(&txq->dql);
+#else
+	return 0;
+#endif
+}
+
 static inline bool qdisc_is_throttled(const struct Qdisc *qdisc)
 {
 	return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7c8e5d7..c2e87e6 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,41 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	return 0;
 }

+static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
+					    struct sk_buff *head_skb,
+					    int bytelimit)
+{
+	struct sk_buff *skb, *tail_skb = head_skb;
+
+	while (bytelimit > 0) {
+		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
+		if (tail_skb->next || skb_is_gso(tail_skb))
+			break;
+
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+
+		bytelimit -= skb->len; /* covers GSO len */
+		skb = validate_xmit_skb(skb, qdisc_dev(q));
+		if (!skb)
+			break;
+
+		/* "skb" can be a skb list after validate call above
+		 * (GSO segmented), but it is okay to append it to
+		 * current tail_skb->next, because next round will exit
+		 * in-case "tail_skb->next" is a skb list.
+		 */
+		tail_skb->next = skb;
+		tail_skb = skb;
+	}
+
+	return head_skb;
+}
+
+/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
+ * A requeued skb (via q->gso_skb) can also be a SKB list.
+ */
 static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 {
 	struct sk_buff *skb = q->gso_skb;
@@ -70,10 +105,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 		} else
 			skb = NULL;
 	} else {
-		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
+		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
+		    !netif_xmit_frozen_or_stopped(txq)) {
+			int bytelimit = qdisc_avail_bulklimit(txq);
+
 			skb = q->dequeue(q);
-			if (skb)
+			if (skb) {
+				bytelimit -= skb->len;
 				skb = validate_xmit_skb(skb, qdisc_dev(q));
+			}
+			if (skb && qdisc_may_bulk(q))
+				skb = try_bulk_dequeue_skb(q, skb, bytelimit);
 		}
 	}

^ permalink raw reply related

* Re: [PATCH V3 next] r8169: add support for Byte Queue Limits
From: David Miller @ 2014-10-01 20:36 UTC (permalink / raw)
  To: fw; +Cc: netdev, romieu, hayeswang
In-Reply-To: <1412163483-24560-1-git-send-email-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Wed,  1 Oct 2014 13:38:03 +0200

> tested on RTL8168d/8111d model using 'super_netperf 40' with TCP/UDP_STREAM.
> 
> Output of
> while true; do
>     for n in inflight limit; do
>           echo -n $n\ ; cat $n;
>     done;
>     sleep 1;
> done
 ...
> [end of test]
> 
> Cc: Francois Romieu <romieu@fr.zoreil.com>
> Cc: Hayes Wang <hayeswang@realtek.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Tom Herbert <therbert@google.com>
> ---
>  Change since v2: updated commit message.

Applied, thank you.

^ permalink raw reply

* [net-next PATCH V6 0/2] qdisc: bulk dequeue support
From: Jesper Dangaard Brouer @ 2014-10-01 20:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

This patchset uses DaveM's recent API changes to dev_hard_start_xmit(),
from the qdisc layer, to implement dequeue bulking.

Patch01: "qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE"
 - Implement basic qdisc dequeue bulking
 - This time, 100% relying on BQL limits, no magic safe-guard constants

Patch02: "qdisc: dequeue bulking also pickup GSO/TSO packets"
 - Extend bulking to bulk several GSO/TSO packets
 - Seperate patch, as it introduce a small regression, see test section.

We do have a patch03, which exports a userspace tunable as a BQL
tunable, that can byte-cap or disable the bulking/bursting.  But we
could not agree on it internally, thus not sending it now.  We
basically strive to avoid adding any new userspace tunable.

Testing patch01:
================
 Demonstrating the performance improvement of qdisc dequeue bulking, is
tricky because the effect only "kicks-in" once the qdisc system have a
backlog. Thus, for a backlog to form, we need either 1) to exceed wirespeed
of the link or 2) exceed the capability of the device driver.

For practical use-cases, the measureable effect of this will be a
reduction in CPU usage

01-TCP_STREAM:
--------------
Testing effect for TCP involves disabling TSO and GSO, because TCP
already benefit from bulking, via TSO and especially for GSO segmented
packets.  This patch view TSO/GSO as a seperate kind of bulking, and
avoid further bulking of these packet types.

The measured perf diff benefit (at 10Gbit/s) for a single netperf
TCP_STREAM were 9.24% less CPU used on calls to _raw_spin_lock()
(mostly from sch_direct_xmit).

If my E5-2695v2(ES) CPU is tuned according to:
 http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html
Then it is possible that a single netperf TCP_STREAM, with GSO and TSO
disabled, can utilize all bandwidth on a 10Gbit/s link.  This will
then cause a standing backlog queue at the qdisc layer.

Trying to pressure the system some more CPU util wise, I'm starting
24x TCP_STREAMs and monitoring the overall CPU utilization.  This
confirms bulking saves CPU cycles when it "kicks-in".

Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows:
 * Disabled bulking: sys:2.58%  soft:8.50%  idle:88.78%
 * Enabled  bulking: sys:2.43%  soft:7.66%  idle:89.79%

02-UDP_STREAM
-------------
The measured perf diff benefit for UDP_STREAM were 6.41% less CPU used
on calls to _raw_spin_lock().  24x UDP_STREAM with packet size -m 1472 (to
avoid sending UDP/IP fragments).

03-trafgen driver test
----------------------
The performance of the 10Gbit/s ixgbe driver is limited due to
updating the HW ring-queue tail-pointer on every packet.  As
previously demonstrated with pktgen.

Using trafgen to send RAW frames from userspace (via AF_PACKET), and
forcing it through qdisc path (with option --qdisc-path and -t0),
sending with 12 CPUs.

I can demonstrate this driver layer limitation:
 * 12.8 Mpps with no qdisc bulking
 * 14.8 Mpps with qdisc bulking (full 10G-wirespeed)

Testing patch02:
================
Testing Bulking several GSO/TSO packets:

Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec for 10G while transmitting
approx 813Kpps).

Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec, for 10G
while transmitting approx 813Kpps).

 Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.

Corrosponding to:
 (10000*10^6)*((0.50-0.47)/10^3)/8 = 37500 bytes
 (10000*10^6)*((0.50-0.38)/10^3)/8 = 150000 bytes
 37500/1500  = 25 pkts
 150000/1500 = 100 pkts

 Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
diff of 0.12ms corrosponding to 1500 bytes at 100Mbit/s. Bulking
several GSOs together, result in a stable high-prio queue delay of
2.23ms.

---

Jesper Dangaard Brouer (2):
      qdisc: dequeue bulking also pickup GSO/TSO packets
      qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE

 include/net/sch_generic.h |   16 ++++++++++++++++
 net/sched/sch_generic.c   |   40 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 54 insertions(+), 2 deletions(-)

-- 

^ permalink raw reply

* Re: [PATCH v2 net-next] net: cleanup and document skb fclone layout
From: David Miller @ 2014-10-01 20:34 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1412022555.30721.40.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 29 Sep 2014 13:29:15 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Lets use a proper structure to clearly document and implement
> skb fast clones.
> 
> Then, we might experiment more easily alternative layouts.
> 
> This patch adds a new skb_fclone_busy() helper, used by tcp and xfrm,
> to stop leaking of implementation details.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> v2: rebased on latest net-next

Applied, thank you.

^ permalink raw reply

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
From: Jesper Dangaard Brouer @ 2014-10-01 20:32 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen, brouer
In-Reply-To: <542C5E8B.7070204@mojatatu.com>

On Wed, 01 Oct 2014 16:05:31 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On 10/01/14 15:47, Jesper Dangaard Brouer wrote:
> 
> >
> > Answer is yes.  It is very easy with simple netperf TCP_STREAM to cause
> > queueing >1 packet in the qdisc layer.
> 
> If that is the case, I withdraw any doubts i had.
> Can you please specify this in your commit logs for patch 0?

I'll try to make it more explicit.
Will resubmit patchset shortly...

Notice it is not difficult cause a queue to form, but it is tricky (not
difficult) to correctly test this patchset.  Perhaps you misread my
statement earlier as "it was difficult to test and cause a queue to form"?

> > If tuned (according to my blog,
> > unloading netfilter etc.) then a single netperf TCP_STREAM will max out
> > 10Gbit/s and cause a standing queue.
> >
> 
> You should describe such tuning in the patch log (hard to read
> blogs for more than 30 seconds; write a paper if you want to provide
> more details).

I think you could read this blog in 30 sec:
 http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html

My cover letter and testing section... will take you longer that 30
sec, it have grown quite large (and Eric will not even read it :-P ;-))

Believe or not, I've actually restricted and reduced the testing
section.  If you want the hole verbose version of my testing for the
upcoming V6 patch, look at this:

 http://people.netfilter.org/hawk/qdisc/measure12_internal_V6_patch/
 http://people.netfilter.org/hawk/qdisc/measure13_V6_patch_NObulk/

And use netperf-wrapper to dive into the data.
A quick setup guide:
 http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html

> > I'm monitoring backlog of qdiscs, and I always see >1 backlog, I never
> > saw a standing queue of 1 packet in my testing.  Either the backlog
> > area is high 100-200 packets, or 0 backlog.  (With fake pktgen/trafgen
> > style tests, it's possible to cause 1000 backlog).
> 
> It would be nice to actually collect such stats. Monitoring the backlog
> via dumping qdisc stats is a good start - but actually keeping traces
> of average bulk size is more useful.

I usually also monitors the BQL limits during these tests.

 grep -H . /sys/class/net/eth4/queues/tx-*/byte_queue_limits/{inflight,limit}

To Toke:
 Perhaps we could convince Toke, to add a netperf-wrapper recorder for
the BQL inflight and limit?  (It would be really cool to plot together)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: abort orphan sockets stalling on zero window probes
From: David Miller @ 2014-10-01 20:28 UTC (permalink / raw)
  To: ycheng; +Cc: edumazet, andrey.dmitrov, ncardwell, netdev
In-Reply-To: <1412022038-14408-1-git-send-email-ycheng@google.com>

From: Yuchung Cheng <ycheng@google.com>
Date: Mon, 29 Sep 2014 13:20:38 -0700

> Currently we have two different policies for orphan sockets
> that repeatedly stall on zero window ACKs. If a socket gets
> a zero window ACK when it is transmitting data, the RTO is
> used to probe the window. The socket is aborted after roughly
> tcp_orphan_retries() retries (as in tcp_write_timeout()).
> 
> But if the socket was idle when it received the zero window ACK,
> and later wants to send more data, we use the probe timer to
> probe the window. If the receiver always returns zero window ACKs,
> icsk_probes keeps getting reset in tcp_ack() and the orphan socket
> can stall forever until the system reaches the orphan limit (as
> commented in tcp_probe_timer()). This opens up a simple attack
> to create lots of hanging orphan sockets to burn the memory
> and the CPU, as demonstrated in the recent netdev post "TCP
> connection will hang in FIN_WAIT1 after closing if zero window is
> advertised." http://www.spinics.net/lists/netdev/msg296539.html
> 
> This patch follows the design in RTO-based probe: we abort an orphan
> socket stalling on zero window when the probe timer reaches both
> the maximum backoff and the maximum RTO. For example, an 100ms RTT
> connection will timeout after roughly 153 seconds (0.3 + 0.6 +
> .... + 76.8) if the receiver keeps the window shut. If the orphan
> socket passes this check, but the system already has too many orphans
> (as in tcp_out_of_resources()), we still abort it but we'll also
> send an RST packet as the connection may still be active.
> 
> In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
> sockets stalled on zero-window probes. This changes the semantics
> of TCP_USER_TIMEOUT slightly because it previously only applies
> when the socket has pending transmission.
> 
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru>

Applied, thanks a lot.

^ permalink raw reply

* Re: [PATCH net-next 0/2] sunvnet: Packet processing in non-interrupt context.
From: David Miller @ 2014-10-01 20:25 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: raghuram.kothakota, netdev
In-Reply-To: <20141001202315.GN17706@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 1 Oct 2014 16:23:15 -0400

> On (10/01/14 16:15), David Miller wrote:
>> > 
>> > If I make this a NAPI driver that uses napi_gro_receive, I would
>> > still have to deal with a budget, right?
>> 
>> Absolutely, and YOU MUST, because the budget keeps one device from
>> hogging the RX packet input path from other devices on a given cpu.
> 
> yes, but limiting the budget of sk_buffs read mid-way during descriptor
> read is deadly to perf because of the ensuing LDC stop/start exchange -
> ends up being even worse than the baseline.
> 
> Doesnt the netif_rx/process_backlog infra already do a napi_schedule,
> thus avoiding the above concern? 

The limit is by default 64 packets, it won't matter.

I think you're overplaying the things that block use of NAPI, how
about implementing it properly, using netif_gso_receive() and proper
RCU accesses, and coming back with some real performance measurements?

^ permalink raw reply

* [PATCH net-next 2/2] vnet_start_xmit() must hold a refcnt on port.
From: Sowmini Varadhan @ 2014-10-01 20:25 UTC (permalink / raw)
  To: davem, raghuram.kothakota, sowmini.varadhan, david.stevens; +Cc: netdev


A vnet_port_remove could be triggered as a result of an ldm-unbind
operation
by the peer, module unload, or other changes to the inter-vnet-link
configuration.  When this is concurrent with vnet_start_xmit(),
the following sequence could occur

  thread 1                                    thread 2
vnet_start_xmit
 -> tx_port_find
    spin_lock_irqsave(&vp->lock..)
    ret = __tx_port_find(..)
    spin_lock_irqrestore(&vp->lock..)
                                           vio_remove -> ..
                                               ->vnet_port_remove
                                           spin_lock_irqsave(&vp->lock..)
                                           cleanup
                                           spin_lock_irqrestore(&vp->lock..)
                                           kfree(port)
 /* attempt to use ret will bomb */

This patch avoids the problem by holding/releasing a refcnt on
the vnet_port.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Raghuram Kothakota <raghuram.kothakota@oracle.com>
---
changes since v1: reverted switch_port caching in `struct vnet'.

 drivers/net/ethernet/sun/sunvnet.c | 46 +++++++++++++++++++++++++++++++++++++-
 drivers/net/ethernet/sun/sunvnet.h |  1 +
 2 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index e2aacf5..62264cb 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -47,6 +47,42 @@ MODULE_VERSION(DRV_MODULE_VERSION);
 
 static int __vnet_tx_trigger(struct vnet_port *port, u32 start);
 
+static inline int vnet_refcnt_read(const struct vnet_port *port)
+{
+	return atomic_read(&port->refcnt);
+}
+
+static inline void vnet_hold(struct vnet_port *port)
+{
+	atomic_inc(&port->refcnt);
+	BUG_ON(vnet_refcnt_read(port) == 0);
+}
+
+static void vnet_put(struct vnet_port *port)
+{
+	atomic_dec(&port->refcnt);
+}
+
+static void vnet_wait_allrefs(struct vnet_port *port)
+{
+	unsigned long warning_time;
+	int refcnt = vnet_refcnt_read(port);
+
+	warning_time = jiffies;
+	while (refcnt != 0) {
+		msleep(250);
+		refcnt = vnet_refcnt_read(port);
+		if (time_after(jiffies, warning_time + 10 * HZ)) {
+			pr_emerg("vnet_wait_allrefs: waiting for port to "
+				 "%x:%x:%x:%x:%x:%x. Usage count = %d\n",
+				port->raddr[0], port->raddr[1],
+				port->raddr[2], port->raddr[3],
+				port->raddr[4], port->raddr[5], refcnt);
+			warning_time = jiffies;
+		}
+	}
+}
+
 /* Ordered from largest major to lowest */
 static struct vio_version vnet_versions[] = {
 	{ .major = 1, .minor = 6 },
@@ -773,14 +809,17 @@ struct vnet_port *__tx_port_find(struct vnet *vp, struct sk_buff *skb)
 	hlist_for_each_entry(port, hp, hash) {
 		if (!port_is_up(port))
 			continue;
-		if (ether_addr_equal(port->raddr, skb->data))
+		if (ether_addr_equal(port->raddr, skb->data)) {
+			vnet_hold(port);
 			return port;
+		}
 	}
 	list_for_each_entry(port, &vp->port_list, list) {
 		if (!port->switch_port)
 			continue;
 		if (!port_is_up(port))
 			continue;
+		vnet_hold(port);
 		return port;
 	}
 	return NULL;
@@ -974,6 +1013,7 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			dev->stats.tx_errors++;
 		}
 		spin_unlock_irqrestore(&port->vio.lock, flags);
+		vnet_put(port);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -1071,6 +1111,7 @@ ldc_start_done:
 	vnet_free_skbs(freeskbs);
 
 	(void)mod_timer(&port->clean_timer, jiffies + VNET_CLEAN_TIMEOUT);
+	vnet_put(port);
 
 	return NETDEV_TX_OK;
 
@@ -1087,6 +1128,8 @@ out_dropped:
 	else
 		del_timer(&port->clean_timer);
 	dev->stats.tx_dropped++;
+	if (port)
+		vnet_put(port);
 	return NETDEV_TX_OK;
 }
 
@@ -1636,6 +1679,7 @@ static int vnet_port_remove(struct vio_dev *vdev)
 		spin_unlock_irqrestore(&vp->lock, flags);
 
 		vnet_workq_disable(port);
+		vnet_wait_allrefs(port);
 
 		vnet_port_free_tx_bufs(port);
 		vio_ldc_free(&port->vio);
diff --git a/drivers/net/ethernet/sun/sunvnet.h b/drivers/net/ethernet/sun/sunvnet.h
index 1182ec6..5c35474 100644
--- a/drivers/net/ethernet/sun/sunvnet.h
+++ b/drivers/net/ethernet/sun/sunvnet.h
@@ -53,6 +53,7 @@ struct vnet_port {
 	u32			stop_rx_idx;
 	bool			stop_rx;
 	bool			start_cons;
+	atomic_t		refcnt;
 
 	struct timer_list	clean_timer;
 
-- 
1.8.4.2

^ permalink raw reply related

* [PATCHv2 net-next 1/2] sunvnet: Process Rx data packets in a BH handler
From: Sowmini Varadhan @ 2014-10-01 20:25 UTC (permalink / raw)
  To: davem, raghuram.kothakota, sowmini.varadhan, david.stevens; +Cc: netdev


Move VIO DATA processing out of interrupt context,
and into a bottom-half handler (vnet_event_bh())

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Raghuram Kothakota <raghuram.kothakota@oracle.com>
---
 drivers/net/ethernet/sun/sunvnet.c | 126 ++++++++++++++++++++++++-------------
 drivers/net/ethernet/sun/sunvnet.h |  10 ++-
 2 files changed, 91 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 1262697..e2aacf5 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -274,6 +274,7 @@ static struct sk_buff *alloc_and_align_skb(struct net_device *dev,
 	return skb;
 }
 
+/* reads in exactly one sk_buff */
 static int vnet_rx_one(struct vnet_port *port, unsigned int len,
 		       struct ldc_trans_cookie *cookies, int ncookies)
 {
@@ -311,9 +312,8 @@ static int vnet_rx_one(struct vnet_port *port, unsigned int len,
 
 	dev->stats.rx_packets++;
 	dev->stats.rx_bytes += len;
-
-	netif_rx(skb);
-
+	/* BH context cannot call netif_receive_skb */
+	netif_rx_ni(skb);
 	return 0;
 
 out_free_skb:
@@ -534,7 +534,10 @@ static int vnet_ack(struct vnet_port *port, void *msgbuf)
 	struct net_device *dev;
 	struct vnet *vp;
 	u32 end;
+	unsigned long flags;
 	struct vio_net_desc *desc;
+	bool need_trigger = false;
+
 	if (unlikely(pkt->tag.stype_env != VIO_DRING_DATA))
 		return 0;
 
@@ -545,21 +548,17 @@ static int vnet_ack(struct vnet_port *port, void *msgbuf)
 	/* sync for race conditions with vnet_start_xmit() and tell xmit it
 	 * is time to send a trigger.
 	 */
+	spin_lock_irqsave(&port->vio.lock, flags);
 	dr->cons = next_idx(end, dr);
 	desc = vio_dring_entry(dr, dr->cons);
-	if (desc->hdr.state == VIO_DESC_READY && port->start_cons) {
-		/* vnet_start_xmit() just populated this dring but missed
-		 * sending the "start" LDC message to the consumer.
-		 * Send a "start" trigger on its behalf.
-		 */
-		if (__vnet_tx_trigger(port, dr->cons) > 0)
-			port->start_cons = false;
-		else
-			port->start_cons = true;
-	} else {
-		port->start_cons = true;
-	}
+	if (desc->hdr.state == VIO_DESC_READY && !port->start_cons)
+		need_trigger = true;
+	else
+		port->start_cons = true; /* vnet_start_xmit will send trigger */
+	spin_unlock_irqrestore(&port->vio.lock, flags);
 
+	if (need_trigger && __vnet_tx_trigger(port, dr->cons) <= 0)
+		port->start_cons = true;
 
 	vp = port->vp;
 	dev = vp->dev;
@@ -617,33 +616,13 @@ static void maybe_tx_wakeup(unsigned long param)
 	netif_tx_unlock(dev);
 }
 
-static void vnet_event(void *arg, int event)
+static void vnet_event_bh(struct work_struct *work)
 {
-	struct vnet_port *port = arg;
+	struct vnet_port *port = container_of(work, struct vnet_port, rx_work);
 	struct vio_driver_state *vio = &port->vio;
-	unsigned long flags;
 	int tx_wakeup, err;
 
-	spin_lock_irqsave(&vio->lock, flags);
-
-	if (unlikely(event == LDC_EVENT_RESET ||
-		     event == LDC_EVENT_UP)) {
-		vio_link_state_change(vio, event);
-		spin_unlock_irqrestore(&vio->lock, flags);
-
-		if (event == LDC_EVENT_RESET) {
-			port->rmtu = 0;
-			vio_port_up(vio);
-		}
-		return;
-	}
-
-	if (unlikely(event != LDC_EVENT_DATA_READY)) {
-		pr_warn("Unexpected LDC event %d\n", event);
-		spin_unlock_irqrestore(&vio->lock, flags);
-		return;
-	}
-
+	mutex_lock(&port->vnet_rx_mutex);
 	tx_wakeup = err = 0;
 	while (1) {
 		union {
@@ -691,14 +670,41 @@ static void vnet_event(void *arg, int event)
 		if (err == -ECONNRESET)
 			break;
 	}
-	spin_unlock(&vio->lock);
-	/* Kick off a tasklet to wake the queue.  We cannot call
-	 * maybe_tx_wakeup directly here because we could deadlock on
-	 * netif_tx_lock() with dev_watchdog()
-	 */
 	if (unlikely(tx_wakeup && err != -ECONNRESET))
 		tasklet_schedule(&port->vp->vnet_tx_wakeup);
+	mutex_unlock(&port->vnet_rx_mutex);
+	vio_set_intr(vio->vdev->rx_ino, HV_INTR_ENABLED);
+}
+
+static void vnet_event(void *arg, int event)
+{
+	struct vnet_port *port = arg;
+	struct vio_driver_state *vio = &port->vio;
+	unsigned long flags;
 
+	spin_lock_irqsave(&vio->lock, flags);
+
+	if (unlikely(event == LDC_EVENT_RESET ||
+		     event == LDC_EVENT_UP)) {
+		vio_link_state_change(vio, event);
+		spin_unlock_irqrestore(&vio->lock, flags);
+
+		if (event == LDC_EVENT_RESET)
+			vio_port_up(vio);
+		return;
+	}
+
+	if (unlikely(event != LDC_EVENT_DATA_READY)) {
+		pr_warn("Unexpected LDC event %d\n", event);
+		spin_unlock_irqrestore(&vio->lock, flags);
+		return;
+	}
+
+	if ((port->flags & VNET_PORT_DEAD) == 0) {
+		vio_set_intr(vio->vdev->rx_ino, HV_INTR_DISABLED);
+		queue_work(port->rx_workq, &port->rx_work);
+	}
+	spin_unlock(&vio->lock);
 	local_irq_restore(flags);
 }
 
@@ -750,6 +756,11 @@ static inline bool port_is_up(struct vnet_port *vnet)
 {
 	struct vio_driver_state *vio = &vnet->vio;
 
+	/* Should never hit a DEAD port here: we are holding the vnet lock,
+	 * and the list cleanup and VNET_PORT_DEAD marking gets done
+	 * under the vnet lock as well.
+	 */
+	BUG_ON(vnet->flags & VNET_PORT_DEAD);
 	return !!(vio->hs_state & VIO_HS_COMPLETE);
 }
 
@@ -1487,6 +1498,23 @@ static void print_version(void)
 	printk_once(KERN_INFO "%s", version);
 }
 
+static int vnet_workq_enable(struct vnet_port *port)
+{
+	port->rx_workq = alloc_workqueue(dev_name(&port->vio.vdev->dev),
+					 WQ_HIGHPRI|WQ_UNBOUND, 1);
+	if (!port->rx_workq)
+		return -ENOMEM;
+	mutex_init(&port->vnet_rx_mutex);
+	INIT_WORK(&port->rx_work, vnet_event_bh);
+	return 0;
+}
+
+static void vnet_workq_disable(struct vnet_port *port)
+{
+	flush_workqueue(port->rx_workq);
+	destroy_workqueue(port->rx_workq);
+}
+
 const char *remote_macaddr_prop = "remote-mac-address";
 
 static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
@@ -1536,6 +1564,10 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 	if (err)
 		goto err_out_free_port;
 
+	err = vnet_workq_enable(port);
+	if (err)
+		goto err_out_free_port;
+
 	err = vnet_port_alloc_tx_bufs(port);
 	if (err)
 		goto err_out_free_ldc;
@@ -1572,6 +1604,7 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 
 err_out_free_ldc:
 	vio_ldc_free(&port->vio);
+	destroy_workqueue(port->rx_workq);
 
 err_out_free_port:
 	kfree(port);
@@ -1589,14 +1622,21 @@ static int vnet_port_remove(struct vio_dev *vdev)
 		struct vnet *vp = port->vp;
 		unsigned long flags;
 
+		vio_set_intr(port->vio.vdev->rx_ino, HV_INTR_DISABLED);
 		del_timer_sync(&port->vio.timer);
 		del_timer_sync(&port->clean_timer);
 
+		/* VNET_PORT_DEAD disallows any more vnet_event_bh
+		 * scheduling and prevents new refs to the port
+		 */
 		spin_lock_irqsave(&vp->lock, flags);
+		port->flags |= VNET_PORT_DEAD;
 		list_del(&port->list);
 		hlist_del(&port->hash);
 		spin_unlock_irqrestore(&vp->lock, flags);
 
+		vnet_workq_disable(port);
+
 		vnet_port_free_tx_bufs(port);
 		vio_ldc_free(&port->vio);
 
diff --git a/drivers/net/ethernet/sun/sunvnet.h b/drivers/net/ethernet/sun/sunvnet.h
index c911045..1182ec6 100644
--- a/drivers/net/ethernet/sun/sunvnet.h
+++ b/drivers/net/ethernet/sun/sunvnet.h
@@ -2,7 +2,7 @@
 #define _SUNVNET_H
 
 #include <linux/interrupt.h>
-
+#include <linux/workqueue.h>
 #define DESC_NCOOKIES(entry_size)	\
 	((entry_size) - sizeof(struct vio_net_desc))
 
@@ -41,7 +41,8 @@ struct vnet_port {
 	struct hlist_node	hash;
 	u8			raddr[ETH_ALEN];
 	u8			switch_port;
-	u8			__pad;
+	u8			flags;
+#define	VNET_PORT_DEAD	0x01
 
 	struct vnet		*vp;
 
@@ -56,6 +57,11 @@ struct vnet_port {
 	struct timer_list	clean_timer;
 
 	u64			rmtu;
+
+	struct mutex            vnet_rx_mutex; /* serializes rx_workq */
+	struct work_struct      rx_work;
+	struct workqueue_struct *rx_workq;
+
 };
 
 static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio)
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH net-next 0/2] sunvnet: Packet processing in non-interrupt context
From: Sowmini Varadhan @ 2014-10-01 20:24 UTC (permalink / raw)
  To: davem, raghuram.kothakota, sowmini.varadhan, david.stevens; +Cc: netdev

The existing sunvnet implementation does all the packet interception
in LDC interrupt context. Patch 1 of this series moves the data
processing to a bottom-half handler.

Some context for the reasons for choosing a BH handler over NAPI:
   A NAPI (or simple tasklet) based implementation provides softirq
   context, which allows the driver to safely invoke netif_receive_skb()
   to deliver the packet to the IP stack. But in the case of sunvnet,
   we are already receiving multiple packets for a single ldc_rx
   interrupt, so the budget-based softirq-vs-polling infra does not
   provide a significant optimization. Rather, it can get in the way,
   if we constrain the vnet-rx path to a poorly chosen budget, and force
   ourselves to send a STOPPED/START ldc exchange needlessly.

   A BH Rx handler is a simpler way to  avoid the weaknesses of processing
   packets in LDC interrupt context, and also provides Rx load-spreading
   across multiple CPUs.

Note that PATCH 1 is dependant on the functions added as part of the
sparc-next commit "sparc64: Add vio_set_intr() to enable/disable Rx interrupts"
(Cf: http://www.spinics.net/lists/sparclinux/msg12647.html)

PATCH 2 of this series fixes a race-condition between vnet_port_remove()
and vnet_start_xmit().

changes since v1 to Patch 2: revert switch_port caching in struct vnet

Sowmini Varadhan (2):
  Process Rx data packets in a BH handler
  vnet_start_xmit() must hold a refcnt on port.

 drivers/net/ethernet/sun/sunvnet.c | 187 +++++++++++++++++++++++++++----------
 drivers/net/ethernet/sun/sunvnet.h |  12 ++-
 2 files changed, 146 insertions(+), 53 deletions(-)

-- 
1.8.4.2

^ permalink raw reply

* Re: [PATCH net-next 2/2] sunvnet: vnet_start_xmit() must hold a refcnt on port.
From: David Miller @ 2014-10-01 20:24 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: eric.dumazet, raghuram.kothakota, netdev
In-Reply-To: <542C61DD.8010504@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 01 Oct 2014 16:19:41 -0400

> On 10/01/2014 04:15 PM, Eric Dumazet wrote:
>> On Wed, 2014-10-01 at 15:44 -0400, Sowmini Varadhan wrote:
>>> On (10/01/14 12:05), Eric Dumazet wrote:
>>>>
>>>> Hmpff... This calls for rcu protection here !
>>>>
>>>
>>> I did consider that, but given that the lists containing the ports are
>>> accessed in multiple contexts, some of which can sleep, and given that
>>> the vnet port is similar in spirit to the net_device, I followed the
>>> net_device model of dev_put etc.
>>
>> dev_put() uses per cpu variables, an order of magnitude faster than a
>> atomic put/get :(
> 
> I could make this a per-cpu variable, it would not be too hard to
> change
> vnet_put/hold etc. That's not a problem.

If you went with NAPI you could use RCU.

All of the negative and complicated aspects of your changes strictly
have to do with not going with that implementation route.

^ permalink raw reply

* Re: [PATCH net-next 0/2] sunvnet: Packet processing in non-interrupt context.
From: Sowmini Varadhan @ 2014-10-01 20:23 UTC (permalink / raw)
  To: David Miller; +Cc: raghuram.kothakota, netdev
In-Reply-To: <20141001.161508.1823792090990427608.davem@davemloft.net>

On (10/01/14 16:15), David Miller wrote:
> > 
> > If I make this a NAPI driver that uses napi_gro_receive, I would
> > still have to deal with a budget, right?
> 
> Absolutely, and YOU MUST, because the budget keeps one device from
> hogging the RX packet input path from other devices on a given cpu.

yes, but limiting the budget of sk_buffs read mid-way during descriptor
read is deadly to perf because of the ensuing LDC stop/start exchange -
ends up being even worse than the baseline.

Doesnt the netif_rx/process_backlog infra already do a napi_schedule,
thus avoiding the above concern? 

--Sowmini

^ permalink raw reply

* Re: [PATCH net-next 2/2] sunvnet: vnet_start_xmit() must hold a refcnt on port.
From: Sowmini Varadhan @ 2014-10-01 20:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, raghuram.kothakota, netdev
In-Reply-To: <1412194526.16704.61.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/01/2014 04:15 PM, Eric Dumazet wrote:
> On Wed, 2014-10-01 at 15:44 -0400, Sowmini Varadhan wrote:
>> On (10/01/14 12:05), Eric Dumazet wrote:
>>>
>>> Hmpff... This calls for rcu protection here !
>>>
>>
>> I did consider that, but given that the lists containing the ports are
>> accessed in multiple contexts, some of which can sleep, and given that
>> the vnet port is similar in spirit to the net_device, I followed the
>> net_device model of dev_put etc.
>
> dev_put() uses per cpu variables, an order of magnitude faster than a
> atomic put/get :(

I could make this a per-cpu variable, it would not be too hard to change
vnet_put/hold etc. That's not a problem.

--Sowmini

^ permalink raw reply

* Re: [PATCH net-next 2/2] sunvnet: vnet_start_xmit() must hold a refcnt on port.
From: Eric Dumazet @ 2014-10-01 20:15 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: davem, raghuram.kothakota, netdev
In-Reply-To: <20141001194403.GM17706@oracle.com>

On Wed, 2014-10-01 at 15:44 -0400, Sowmini Varadhan wrote:
> On (10/01/14 12:05), Eric Dumazet wrote:
> > 
> > Hmpff... This calls for rcu protection here !
> > 
> 
> I did consider that, but given that the lists containing the ports are
> accessed in multiple contexts, some of which can sleep, and given that
> the vnet port is similar in spirit to the net_device, I followed the
> net_device model of dev_put etc.

dev_put() uses per cpu variables, an order of magnitude faster than a
atomic put/get :(

^ permalink raw reply

* Re: [PATCH net-next 0/2] sunvnet: Packet processing in non-interrupt context.
From: David Miller @ 2014-10-01 20:15 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: raghuram.kothakota, netdev
In-Reply-To: <542C5C37.1040409@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 01 Oct 2014 15:55:35 -0400

> On 10/01/2014 03:50 PM, David Miller wrote:
> 
>>>
>>>     A BH Rx handler is a simpler way to avoid the weaknesses of processing
>>>     packets in LDC interrupt context, and also provides Rx load-spreading
>>>     across multiple CPUs.
>>
>> I think you want to re-evaluate this considering napi_gro_receive()
>> which
>> is what you should really be calling in a NAPI driver.
> 
> 
> Sorry I dont follow the suggestion.
> 
> If I make this a NAPI driver that uses napi_gro_receive, I would
> still have to deal with a budget, right?

Absolutely, and YOU MUST, because the budget keeps one device from
hogging the RX packet input path from other devices on a given cpu.

^ permalink raw reply

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
From: Jamal Hadi Salim @ 2014-10-01 20:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen
In-Reply-To: <20141001214700.18b16387@redhat.com>

On 10/01/14 15:47, Jesper Dangaard Brouer wrote:

>
> Answer is yes.  It is very easy with simple netperf TCP_STREAM to cause
> queueing >1 packet in the qdisc layer.

If that is the case, I withdraw any doubts i had.
Can you please specify this in your commit logs for patch 0?

> If tuned (according to my blog,
> unloading netfilter etc.) then a single netperf TCP_STREAM will max out
> 10Gbit/s and cause a standing queue.
>

You should describe such tuning in the patch log (hard to read
blogs for more than 30 seconds; write a paper if you want to provide
more details).

> I'm monitoring backlog of qdiscs, and I always see >1 backlog, I never
> saw a standing queue of 1 packet in my testing.  Either the backlog
> area is high 100-200 packets, or 0 backlog.  (With fake pktgen/trafgen
> style tests, it's possible to cause 1000 backlog).

It would be nice to actually collect such stats. Monitoring the backlog
via dumping qdisc stats is a good start - but actually keeping traces
of average bulk size is more useful.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next 0/2] sunvnet: Packet processing in non-interrupt context.
From: Sowmini Varadhan @ 2014-10-01 19:55 UTC (permalink / raw)
  To: David Miller; +Cc: raghuram.kothakota, netdev
In-Reply-To: <20141001.155009.2277009117294992988.davem@davemloft.net>

On 10/01/2014 03:50 PM, David Miller wrote:

>>
>>     A BH Rx handler is a simpler way to  avoid the weaknesses of processing
>>     packets in LDC interrupt context, and also provides Rx load-spreading
>>     across multiple CPUs.
>
> I think you want to re-evaluate this considering napi_gro_receive() which
> is what you should really be calling in a NAPI driver.

Sorry I dont follow the suggestion.

If I make this a NAPI driver that uses napi_gro_receive, I would
still have to deal with a budget, right?

--Sowmini

^ permalink raw reply

* Re: [PATCH net-next 2/2] sunvnet: vnet_start_xmit() must hold a refcnt on port.
From: David Miller @ 2014-10-01 19:52 UTC (permalink / raw)
  To: david.stevens; +Cc: sowmini.varadhan, raghuram.kothakota, netdev
In-Reply-To: <542C56A5.4070805@oracle.com>

From: David L Stevens <david.stevens@oracle.com>
Date: Wed, 01 Oct 2014 15:31:49 -0400

> 
> 
> On 10/01/2014 03:23 PM, Sowmini Varadhan wrote:
>> On (10/01/14 15:06), David L Stevens wrote:
>>>
>>> This "vp->switch_port" addition doesn't appear to be related to the port refcnt
>>> change, and doesn't allow for multiple switch ports.
>> 
>> The switch_port is the connection to Dom0. Do you envision us having more than
>> one switch_port? How?
> 
> While Dom0 might only create one port with the "switch" flag, the flag just means
> "I can reach anybody" and is not inherently unique. I don't think an attached
> VM should assume there is always only one; it prevents multipath load balancing
> kinds of things in the future.
> 
> Also, there is the broader point that this sort of change should be a separate patch.
> It isn't required for fixing the dangling reference -- it is an independent change.

Multiple switch ports are absolutely allowed by the protocol spec and can
provide the suggested facilities David mentioned, don't prevent them from
being used.

^ permalink raw reply

* Re: [PATCH net-next 0/2] sunvnet: Packet processing in non-interrupt context.
From: David Miller @ 2014-10-01 19:50 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: raghuram.kothakota, netdev
In-Reply-To: <20141001185604.GG17706@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 1 Oct 2014 14:56:04 -0400

> 
> The existing sunvnet implementation does all the packet interception
> in LDC interrupt context. Patch 1 of this series moves the data
> processing to a bottom-half handler.
> 
> Some context for the reasons for choosing a BH handler over NAPI:
>    A NAPI (or simple tasklet) based implementation provides softirq
>    context, which allows the driver to safely invoke netif_receive_skb()
>    to deliver the packet to the IP stack. But in the case of sunvnet,
>    we are already receiving multiple packets for a single ldc_rx
>    interrupt, so the budget-based softirq-vs-polling infra does not
>    provide a significant optimization. Rather, it can get in the way,
>    if we constrain the vnet-rx path to a poorly chosen budget, and force
>    ourselves to send a STOPPED/START ldc exchange needlessly.
> 
>    A BH Rx handler is a simpler way to  avoid the weaknesses of processing
>    packets in LDC interrupt context, and also provides Rx load-spreading
>    across multiple CPUs.

I think you want to re-evaluate this considering napi_gro_receive() which
is what you should really be calling in a NAPI driver.

^ permalink raw reply

* Re: [PATCH 1/1 net-next] cipso: add __init to cipso_v4_cache_init
From: David Miller @ 2014-10-01 19:47 UTC (permalink / raw)
  To: fabf; +Cc: linux-kernel, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1412184603-17294-1-git-send-email-fabf@skynet.be>

From: Fabian Frederick <fabf@skynet.be>
Date: Wed,  1 Oct 2014 19:30:03 +0200

> cipso_v4_cache_init is only called by __init cipso_v4_init
> 
> Signed-off-by: Fabian Frederick <fabf@skynet.be>

Applied.

^ permalink raw reply

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
From: David Miller @ 2014-10-01 19:47 UTC (permalink / raw)
  To: brouer
  Cc: jhs, therbert, netdev, eric.dumazet, hannes, fw, dborkman,
	alexander.duyck, john.r.fastabend, dave.taht, toke
In-Reply-To: <20141001192840.5679a671@redhat.com>

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 1 Oct 2014 19:28:40 +0200

> Thus, code is activated only when q->qlen is >= 1. And I have already
> shown that we see a win with just bulking 2 packets:
>  "Subject: qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue"
>  http://thread.gmane.org/gmane.linux.network/331152/focus=331154
> 
> I actually believe, that I have already measured and shown that going
> though the bulk dequeue is a win, this should show direct_xmit vs.
> bulking:
>  "Subject: qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen"
>  http://thread.gmane.org/gmane.linux.network/331152

+1 Amen...

^ permalink raw reply

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
From: Jesper Dangaard Brouer @ 2014-10-01 19:47 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen, brouer
In-Reply-To: <542C4E0D.4050404@mojatatu.com>

On Wed, 01 Oct 2014 14:55:09 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On 10/01/14 13:28, Jesper Dangaard Brouer wrote:
> 
> > Thus, code is activated only when q->qlen is >= 1. And I have already
> > shown that we see a win with just bulking 2 packets:
> 
> If you can get 2 packets, indeed you win. If you can on average get >1
> over a long period, you still win.
> You have clearly demonstrated you can do that with traffic
> generators (udp or in kernel pktgen). I was more worried about the
> common use case scenario (handwaved as 1-24 TCP streams).
> 
> The key here is: *if you never hit bulking* then the cost is
> _per packet_ for sch_direct_xmit bypass.
> Question is what is that cost for the common case as defined above?
> Can you hit a bulk level >1 on 1-24 TCP streams?
> I would be happy if your answer is *yes*. 

Answer is yes.  It is very easy with simple netperf TCP_STREAM to cause
queueing >1 packet in the qdisc layer.  If tuned (according to my blog,
unloading netfilter etc.) then a single netperf TCP_STREAM will max out
10Gbit/s and cause a standing queue.

I'm monitoring backlog of qdiscs, and I always see >1 backlog, I never
saw a standing queue of 1 packet in my testing.  Either the backlog
area is high 100-200 packets, or 0 backlog.  (With fake pktgen/trafgen
style tests, it's possible to cause 1000 backlog).


> If your answer is no (since
> it is hard to achieve) - then how far off is it from before your
> patches (since now you have added at minimal a branch check).
> I think it is fair for you to quantify that, no?
> Feature is still useful for the other cases.
> 
> Note:
> This is what i referred to as the "no animals were hurt during the
> making of these patches" statement. I am sorry again for raining on
> the parade.

I'm starting to think this patch is the most carefully tested patch on
netdev for a very very long time... I'm getting tired of this testing
going on for so long.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH 1/1 net-next] inet: frags: add __init to ip4_frags_ctl_register
From: David Miller @ 2014-10-01 19:46 UTC (permalink / raw)
  To: fabf; +Cc: linux-kernel, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1412183937-16055-1-git-send-email-fabf@skynet.be>

From: Fabian Frederick <fabf@skynet.be>
Date: Wed,  1 Oct 2014 19:18:57 +0200

> ip4_frags_ctl_register is only called by __init ipfrag_init
> 
> Signed-off-by: Fabian Frederick <fabf@skynet.be>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox