Netdev List
 help / color / mirror / Atom feed
* [PATCH] pkt_sched: fq: rate limiting improvements
From: Eric Dumazet @ 2013-10-01 16:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Steinar H. Gunderson

From: Eric Dumazet <edumazet@google.com>

FQ rate limiting suffers from two problems, reported
by Steinar :

1) FQ enforces a delay when flow quantum is exhausted in order
to reduce cpu overhead. But if packets are small, current
delay computation is slightly wrong, and observed rates can
be too high.

Steinar had this problem because he disabled TSO and GSO,
and default FQ quantum is 2*1514.

(Of course, I wish recent TSO auto sizing changes will help
to not having to disable TSO in the first place)

2) maxrate was not used for forwarded flows (skbs not attached
to a socket)

Tested:

tc qdisc add dev eth0 root est 1sec 4sec fq maxrate 8Mbit
netperf -H lpq84 -l 1000 &
sleep 10 ; tc -s qdisc show dev eth0
qdisc fq 8003: root refcnt 32 limit 10000p flow_limit 100p buckets 1024
 quantum 3028 initial_quantum 15140 maxrate 8000Kbit 
 Sent 16819357 bytes 11258 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 7831Kbit 653pps backlog 7570b 5p requeues 0 
  44 flows (43 inactive, 1 throttled), next packet delay 2977352 ns
  0 gc, 0 highprio, 5545 throttled

lpq83:~# tcpdump -p -i eth0 host lpq84 -c 12
09:02:52.079484 IP lpq83 > lpq84: . 1389536928:1389538376(1448) ack 3808678021 win 457 <nop,nop,timestamp 961812 572609068>
09:02:52.079499 IP lpq83 > lpq84: . 1448:2896(1448) ack 1 win 457 <nop,nop,timestamp 961812 572609068>
09:02:52.079906 IP lpq84 > lpq83: . ack 2896 win 16384 <nop,nop,timestamp 572609080 961812>
09:02:52.082568 IP lpq83 > lpq84: . 2896:4344(1448) ack 1 win 457 <nop,nop,timestamp 961815 572609071>
09:02:52.082581 IP lpq83 > lpq84: . 4344:5792(1448) ack 1 win 457 <nop,nop,timestamp 961815 572609071>
09:02:52.083017 IP lpq84 > lpq83: . ack 5792 win 16384 <nop,nop,timestamp 572609083 961815>
09:02:52.085678 IP lpq83 > lpq84: . 5792:7240(1448) ack 1 win 457 <nop,nop,timestamp 961818 572609074>
09:02:52.085693 IP lpq83 > lpq84: . 7240:8688(1448) ack 1 win 457 <nop,nop,timestamp 961818 572609074>
09:02:52.086117 IP lpq84 > lpq83: . ack 8688 win 16384 <nop,nop,timestamp 572609086 961818>
09:02:52.088792 IP lpq83 > lpq84: . 8688:10136(1448) ack 1 win 457 <nop,nop,timestamp 961821 572609077>
09:02:52.088806 IP lpq83 > lpq84: . 10136:11584(1448) ack 1 win 457 <nop,nop,timestamp 961821 572609077>
09:02:52.089217 IP lpq84 > lpq83: . ack 11584 win 16384 <nop,nop,timestamp 572609090 961821>

Reported-by: Steinar H. Gunderson <sesse@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/sched/sch_fq.c |   45 ++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index fc6de56..a2fef8b 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -420,6 +420,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 	struct fq_flow_head *head;
 	struct sk_buff *skb;
 	struct fq_flow *f;
+	u32 rate;
 
 	skb = fq_dequeue_head(sch, &q->internal);
 	if (skb)
@@ -468,28 +469,34 @@ begin:
 	f->time_next_packet = now;
 	f->credit -= qdisc_pkt_len(skb);
 
-	if (f->credit <= 0 &&
-	    q->rate_enable &&
-	    skb->sk && skb->sk->sk_state != TCP_TIME_WAIT) {
-		u32 rate = skb->sk->sk_pacing_rate ?: q->flow_default_rate;
+	if (f->credit > 0 || !q->rate_enable)
+		goto out;
 
-		rate = min(rate, q->flow_max_rate);
-		if (rate) {
-			u64 len = (u64)qdisc_pkt_len(skb) * NSEC_PER_SEC;
-
-			do_div(len, rate);
-			/* Since socket rate can change later,
-			 * clamp the delay to 125 ms.
-			 * TODO: maybe segment the too big skb, as in commit
-			 * e43ac79a4bc ("sch_tbf: segment too big GSO packets")
-			 */
-			if (unlikely(len > 125 * NSEC_PER_MSEC)) {
-				len = 125 * NSEC_PER_MSEC;
-				q->stat_pkts_too_long++;
-			}
+	if (skb->sk && skb->sk->sk_state != TCP_TIME_WAIT) {
+		rate = skb->sk->sk_pacing_rate ?: q->flow_default_rate;
 
-			f->time_next_packet = now + len;
+		rate = min(rate, q->flow_max_rate);
+	} else {
+		rate = q->flow_max_rate;
+		if (rate == ~0U)
+			goto out;
+	}
+	if (rate) {
+		u32 plen = max(qdisc_pkt_len(skb), q->quantum);
+		u64 len = (u64)plen * NSEC_PER_SEC;
+
+		do_div(len, rate);
+		/* Since socket rate can change later,
+		 * clamp the delay to 125 ms.
+		 * TODO: maybe segment the too big skb, as in commit
+		 * e43ac79a4bc ("sch_tbf: segment too big GSO packets")
+		 */
+		if (unlikely(len > 125 * NSEC_PER_MSEC)) {
+			len = 125 * NSEC_PER_MSEC;
+			q->stat_pkts_too_long++;
 		}
+
+		f->time_next_packet = now + len;
 	}
 out:
 	qdisc_bstats_update(sch, skb);

^ permalink raw reply related

* Re: [PATCH v2] ll_temac: Reset dma descriptors indexes on ndo_open
From: David Miller @ 2013-10-01 16:32 UTC (permalink / raw)
  To: ricardo.ribalda; +Cc: joe, jg1.han, gregkh, wfp5p, netdev, linux-kernel
In-Reply-To: <1380608230-29183-1-git-send-email-ricardo.ribalda@gmail.com>

From: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Date: Tue,  1 Oct 2013 08:17:10 +0200

> The dma descriptors indexes are only initialized on the probe function.
> 
> If a packet is on the buffer when temac_stop is called, the dma
> descriptors indexes can be left on a incorrect state where no other
> package can be sent.
> 
> So an interface could be left in an usable state after ifdow/ifup.
> 
> This patch makes sure that the descriptors indexes are in a proper
> status when the device is open.
> 
> Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: Established sockets remain open after iface down or address lost
From: Alexey Kuznetsov @ 2013-10-01 16:33 UTC (permalink / raw)
  To: Chris Verges; +Cc: David Miller, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <20130926060433.GA9170@cverges-dev-lnx.sentient-energy.com>

Hello!

> P.S.  I apologize in advance if I missed this answer in the netdev archives.

FYI googling f.e. "netdev tcp remove local address" instantly  finds
all that you want to know. Namrly, subj: Re: "TCP shutdown behaviour
when deleting local IP addresses"

^ permalink raw reply

* Re: [PATCH 1/3] net: ethernet: cpsw: Remove redundant of_match_ptr
From: David Miller @ 2013-10-01 16:34 UTC (permalink / raw)
  To: sachin.kamat; +Cc: netdev
In-Reply-To: <1380515114-2823-1-git-send-email-sachin.kamat@linaro.org>

From: Sachin Kamat <sachin.kamat@linaro.org>
Date: Mon, 30 Sep 2013 09:55:12 +0530

> The data structure of_match_ptr() protects is always compiled in.
> Hence of_match_ptr() is not needed.
> 
> Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>

Applied.

^ permalink raw reply

* Re: [PATCH 2/3] net: ethernet: cpsw-phy-sel: Remove redundant of_match_ptr
From: David Miller @ 2013-10-01 16:34 UTC (permalink / raw)
  To: sachin.kamat; +Cc: netdev
In-Reply-To: <1380515114-2823-2-git-send-email-sachin.kamat@linaro.org>

From: Sachin Kamat <sachin.kamat@linaro.org>
Date: Mon, 30 Sep 2013 09:55:13 +0530

> The data structure of_match_ptr() protects is always compiled in.
> Hence of_match_ptr() is not needed.
> 
> Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>

Applied.

^ permalink raw reply

* Re: [PATCH 3/3] net: can: c_can_platform: Remove redundant of_match_ptr
From: David Miller @ 2013-10-01 16:34 UTC (permalink / raw)
  To: sachin.kamat; +Cc: netdev, mkl, linux-can
In-Reply-To: <1380515114-2823-3-git-send-email-sachin.kamat@linaro.org>

From: Sachin Kamat <sachin.kamat@linaro.org>
Date: Mon, 30 Sep 2013 09:55:14 +0530

> The data structure of_match_ptr() protects is always compiled in.
> Hence of_match_ptr() is not needed.
> 
> Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>

Applied.

^ permalink raw reply

* Re: [PATCH net 1/1] qlcnic: Fix SR-IOV configuration
From: David Miller @ 2013-10-01 16:35 UTC (permalink / raw)
  To: manish.chopra; +Cc: netdev, rajesh.borundia, Dept-HSGLinuxNICDev
In-Reply-To: <1380608628-8391-1-git-send-email-manish.chopra@qlogic.com>

From: Manish Chopra <manish.chopra@qlogic.com>
Date: Tue, 1 Oct 2013 02:23:48 -0400

> o Interface needs to be brought down and up while configuring SR-IOV.
>   Protect interface up/down using rtnl_lock()/rtnl_unlock()
> 
> Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v2] l2tp: add support for IPv4-mapped IPv6 addresses
From: David Miller @ 2013-10-01 16:38 UTC (permalink / raw)
  To: f.cachereul; +Cc: jchapman, netdev
In-Reply-To: <524A7FB7.40304@init-sys.com>

From: François Cachereul <f.cachereul@init-sys.com>
Date: Tue, 01 Oct 2013 09:54:31 +0200

> @@ -1620,6 +1621,8 @@ int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32
>  	int err;
>  	struct socket *sock = NULL;
>  	struct sock *sk = NULL;
> +#if IS_ENABLED(CONFIG_IPV6)
> +#endif
>  	struct l2tp_net *pn;
>  	enum l2tp_encap_type encap = L2TP_ENCAPTYPE_UDP;
>  

Please get rid of this.

^ permalink raw reply

* Re: [PATCH 0/6] Netfilter/IPVS fixes for net
From: David Miller @ 2013-10-01 16:39 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1380618511-6109-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Tue,  1 Oct 2013 11:08:25 +0200

> The following patchset contains Netfilter/IPVS fixes for your net
> tree, they are:
> 
> * Fix BUG_ON splat due to malformed TCP packets seen by synproxy, from
>   Patrick McHardy.
> 
> * Fix possible weight overflow in lblc and lblcr schedulers due to
>   32-bits arithmetics, from Simon Kirby.
> 
> * Fix possible memory access race in the lblc and lblcr schedulers,
>   introduced when it was converted to use RCU, two patches from
>   Julian Anastasov.
> 
> * Fix hard dependency on CPU 0 when reading per-cpu stats in the
>   rate estimator, from Julian Anastasov.
> 
> * Fix race that may lead to object use after release, when invoking
>   ipvsadm -C && ipvsadm -R, introduced when adding RCU, from Julian
>   Anastasov.
> 
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git master

Pulled, thanks Pablo.

^ permalink raw reply

* Re: [PATCH net v2 1/4] ip_tunnel: Fix a memory corruption in ip_tunnel_xmit
From: David Miller @ 2013-10-01 16:43 UTC (permalink / raw)
  To: steffen.klassert; +Cc: pshelar, netdev
In-Reply-To: <20131001093359.GG7660@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Tue, 1 Oct 2013 11:33:59 +0200

> We might extend the used aera of a skb beyond the total
> headroom when we install the ipip header. Fix this by
> calling skb_cow_head() unconditionally.
> 
> Bug was introduced with commit c544193214
> ("GRE: Refactor GRE tunneling code.")
> 
> Cc: Pravin Shelar <pshelar@nicira.com>
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net v2 2/4] ip_tunnel: Add fallback tunnels to the hash lists
From: David Miller @ 2013-10-01 16:43 UTC (permalink / raw)
  To: steffen.klassert; +Cc: pshelar, netdev
In-Reply-To: <20131001093448.GH7660@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Tue, 1 Oct 2013 11:34:48 +0200

> Currently we can not update the tunnel parameters of
> the fallback tunnels because we don't find them in the
> hash lists. Fix this by adding them on initialization.
> 
> Bug was introduced with commit c544193214
> ("GRE: Refactor GRE tunneling code.")
> 
> Cc: Pravin Shelar <pshelar@nicira.com>
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net 3/4] ip_tunnel_core: Change __skb_push back to skb_push
From: David Miller @ 2013-10-01 16:43 UTC (permalink / raw)
  To: steffen.klassert; +Cc: pshelar, netdev
In-Reply-To: <20131001093551.GI7660@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Tue, 1 Oct 2013 11:35:51 +0200

> Git commit 0e6fbc5b ("ip_tunnels: extend iptunnel_xmit()")
> moved the IP header installation to iptunnel_xmit() and
> changed skb_push() to __skb_push(). This makes possible
> bugs hard to track down, so change it back to skb_push().
> 
> Cc: Pravin Shelar <pshelar@nicira.com>
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net 4/4] ip_tunnel: Remove double unregister of the fallback device
From: David Miller @ 2013-10-01 16:43 UTC (permalink / raw)
  To: steffen.klassert; +Cc: pshelar, nicolas.dichtel, netdev
In-Reply-To: <20131001093737.GJ7660@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Tue, 1 Oct 2013 11:37:37 +0200

> When queueing the netdevices for removal, we queue the
> fallback device twice in ip_tunnel_destroy(). The first
> time when we queue all netdevices in the namespace and
> then again explicitly. Fix this by removing the explicit
> queueing of the fallback device.
> 
> Bug was introduced when network namespace support was added
> with commit 6c742e714d8 ("ipip: add x-netns support").
> 
> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied and queued up for -stable.

Thanks!

^ permalink raw reply

* Re: [PATCH next 0/6] be2net: patch set
From: David Miller @ 2013-10-01 16:46 UTC (permalink / raw)
  To: sathya.perla; +Cc: netdev
In-Reply-To: <1380623401-15630-1-git-send-email-sathya.perla@emulex.com>

From: Sathya Perla <sathya.perla@emulex.com>
Date: Tue, 1 Oct 2013 15:59:55 +0530

> Pls apply the following patches to the net-next tree. Thanks.

All applied, thanks.

^ permalink raw reply

* Re: [PATCH] ath6kl: fix compilation warning in ath6kl_htc_pipe_conn_service
From: Kalle Valo @ 2013-10-01 16:46 UTC (permalink / raw)
  To: Vladimir Murzin; +Cc: netdev, linville, linux-wireless, ath6kl-devel
In-Reply-To: <1380462871-2649-1-git-send-email-murzin.v@gmail.com>

Vladimir Murzin <murzin.v@gmail.com> writes:

> Fix the warning
>
> drivers/net/wireless/ath/ath6kl/htc_pipe.c: In function
> 'ath6kl_htc_pipe_conn_service':
> drivers/net/wireless/ath/ath6kl/htc_pipe.c:1293:26: warning: integer overflow
> in expression [-Woverflow]
>
> by giving a hint to compiler about unsigned nature of
> HTC_CONN_FLGS_SET_RECV_ALLOC_MASK
>
> Signed-off-by: Vladimir Murzin <murzin.v@gmail.com>

Thanks, applied to ath.git.

-- 
Kalle Valo

^ permalink raw reply

* Big performance loss from 3.4.63 to 3.10.13 when routing ipv4
From: Wolfgang Walter @ 2013-10-01 16:39 UTC (permalink / raw)
  To: netdev

Hello,

I tried to upgrade one of our routers to 3.10.13 from 3.4.63 and I see a 
dramatic performance loss. I tried 3.11.2 and it is still there.

*** Symptoms:

All network traffic over the router become slow and sluggish. If one pings the 
router there is a packet loss. After about 2 minutes the traffic completely 
stalls for about 1 minute. Then it works again as in the beginning to then 
stall again. And so on.

This happens even with rather moderate traffic. While still routing the CPU 
utilization is higher than it is with 3.4.63 but only moderately.

When it stalls no network traffic seems possible (but to loopback). If one 
tries to ping from the router any target (even if it is on a interface with no 
traffic at all) one gets:

	ping: sendmsg: No buffer space available

As the router has about 15G free memory this probably means that an internal 
table is full.

The CPU-utilization is low within that period.


I can trigger it easily when I copy about 50 big files per scp over 50 
different ipsec-tunnels:

* boot router

* wait until all ipsec tunnels are established

* start copying:

H <--1G--> Router <---1G--->.......<-- >=100MBit --> Xn <---100Mbit----> Rn

So there is a ipsec tunnel between Router and Xn for all n=1 to 50. I copy 
files from Rn to H. I start the copy from H, so the tcp-connections get 
established from H to Rn.

The same test works just fine with 3.4.63. All cores are used but no one 
reaches its limit. The router does neither drop pings nor does it have 
problems pinging other targets.

I tested 3.8.13 It seems not to have this issue if I increase

	net.ipv4.inet_peer_threshold

(I tried 6566400, didn't try smaller values beside the default one).

If I use the default one 3.8.13 behaves badly.

But 3.8.13 seems to have other issues. Basically: routing stalls later much 
longer (up to 6 minutes or so).



*** Environment:

It's a 8 core machine (with AES-NI). It establishes a lot of ipsec-tunnels. It 
uses statefull packet filtering (but no NAT). The network-cards are intel 
cards (driver: igb and ixgbe). No IPv6. No ethernet flow control enabled (but 
doesn't matter). No traffic shaping (that is tc). igb/ixgbe interfaces: 
nothing modified with ethtool but flow control (autoneg off tx off rx off).


Any idea?


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* Re: [net-next 0/9][pull request] Intel Wired LAN Driver Updates
From: David Miller @ 2013-10-01 16:51 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, sassmann
In-Reply-To: <1380627236-3190-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue,  1 Oct 2013 04:33:47 -0700

> This series contains updates to ixgbevf, ixgbe and igb.
> 
> Don provides 3 patches for ixgbevf where he cleans up a redundant
> read mailbox failure check, adds a new function to wait for receive
> queues to be disabled before disabling NAPI, and move the API
> negotiation so that it occurs in the reset path.  This will allow
> the PF to be informed of the API version earlier.
> 
> Jacob provides a ixgbevf and ixgbe patch.  His ixgbevf patch removes
> the use of hw_dbg when the ixgbe_get_regs function is called in ethtool.
> The ixgbe patch renames the LL_EXTENDED_STATS and some of the functions
> required to implement busy polling in order to remove the marketing
> "low latency" blurb which hides what the code actually does.
> 
> Leonardo provides a ixgbe patch to add support for DCB registers dump
> using ethtool for 82599 and x540 ethernet controllers.
> 
> I (Jeff) provide a ixgbe patch to cleanup whitespace issues seen in a
> code review.
> 
> Todd provides for igb to add support for i354 in the ethtool offline
> tests.
> 
> Laura provides an igb patch to add the ethtool callbacks necessary to
> configure the number of RSS queues.

Pulled, thanks Jeff.

Please address Ben Hutching's concerns about the state of the device
after a number of channels configuration failure with follow-on
changes, if necessary.

Thanks.

^ permalink raw reply

* Re: [PATCH 2/2] net, mellanox mlx4 Fix compile warnings [v4]
From: David Miller @ 2013-10-01 16:57 UTC (permalink / raw)
  To: prarit; +Cc: jackm, netdev, dledford, amirv, ogerlitz
In-Reply-To: <524AE968.608@redhat.com>

From: Prarit Bhargava <prarit@redhat.com>
Date: Tue, 01 Oct 2013 11:25:28 -0400

> 
> 
> On 10/01/2013 11:17 AM, Jack Morgenstein wrote:
>> Since we are in the middle of submitting patches which touch the file
>> "resource_tracker.c", I would really like to hold off on these warning
>> fixes for a bit, and I'll handle the changes for all the functions at
>> once to conform (correctly!) to the format suggested by Dave Miller.
>> 
>>  -Jack
> 
> Hey Jack, np.  Thanks for looking.

Please repost both patches when you've sorted this out, thanks!

^ permalink raw reply

* Re: [PATCH net 1/2] sit: allow to use rtnl ops on fb tunnel
From: David Miller @ 2013-10-01 16:59 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev, steffen.klassert, pshelar
In-Reply-To: <1380643500-5018-1-git-send-email-nicolas.dichtel@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue,  1 Oct 2013 18:04:59 +0200

> rtnl ops where introduced by ba3e3f50a0e5 ("sit: advertise tunnel param via
> rtnl"), but I forget to assign rtnl ops to fb tunnels.
> 
> Now that it is done, we must remove the explicit call to
> unregister_netdevice_queue(), because  the fallback tunnel is added to the queue
> in sit_destroy_tunnels() when checking rtnl_link_ops of all netdevices (this
> is valid since commit 5e6700b3bf98 ("sit: add support of x-netns")).
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

Applied and queued up for -stable.

But I imagine since the x-netns changes aren't in various -stable
branches this will need to be adjusted a bit?

^ permalink raw reply

* Re: [PATCH net 2/2] ip6tnl: allow to use rtnl ops on fb tunnel
From: David Miller @ 2013-10-01 16:59 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev, steffen.klassert, pshelar
In-Reply-To: <1380643500-5018-2-git-send-email-nicolas.dichtel@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue,  1 Oct 2013 18:05:00 +0200

> rtnl ops where introduced by c075b13098b3 ("ip6tnl: advertise tunnel param via
> rtnl"), but I forget to assign rtnl ops to fb tunnels.
> 
> Now that it is done, we must remove the explicit call to
> unregister_netdevice_queue(), because  the fallback tunnel is added to the queue
> in ip6_tnl_destroy_tunnels() when checking rtnl_link_ops of all netdevices (this
> is valid since commit 0bd8762824e7 ("ip6tnl: add x-netns support")).
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

Applied and queued up for -stable, and I have similar concerns about
the backport issues.

Thanks.

^ permalink raw reply

* Re: [PATCH] pkt_sched: fq: rate limiting improvements
From: David Miller @ 2013-10-01 17:01 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, sesse
In-Reply-To: <1380643816.19002.29.camel@edumazet-glaptop.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 01 Oct 2013 09:10:16 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> FQ rate limiting suffers from two problems, reported
> by Steinar :
> 
> 1) FQ enforces a delay when flow quantum is exhausted in order
> to reduce cpu overhead. But if packets are small, current
> delay computation is slightly wrong, and observed rates can
> be too high.
> 
> Steinar had this problem because he disabled TSO and GSO,
> and default FQ quantum is 2*1514.
> 
> (Of course, I wish recent TSO auto sizing changes will help
> to not having to disable TSO in the first place)
> 
> 2) maxrate was not used for forwarded flows (skbs not attached
> to a socket)
> 
> Tested:
> 
> tc qdisc add dev eth0 root est 1sec 4sec fq maxrate 8Mbit
> netperf -H lpq84 -l 1000 &
> sleep 10 ; tc -s qdisc show dev eth0
> qdisc fq 8003: root refcnt 32 limit 10000p flow_limit 100p buckets 1024
>  quantum 3028 initial_quantum 15140 maxrate 8000Kbit 
>  Sent 16819357 bytes 11258 pkt (dropped 0, overlimits 0 requeues 0) 
>  rate 7831Kbit 653pps backlog 7570b 5p requeues 0 
>   44 flows (43 inactive, 1 throttled), next packet delay 2977352 ns
>   0 gc, 0 highprio, 5545 throttled
> 
> lpq83:~# tcpdump -p -i eth0 host lpq84 -c 12
> 09:02:52.079484 IP lpq83 > lpq84: . 1389536928:1389538376(1448) ack 3808678021 win 457 <nop,nop,timestamp 961812 572609068>
> 09:02:52.079499 IP lpq83 > lpq84: . 1448:2896(1448) ack 1 win 457 <nop,nop,timestamp 961812 572609068>
> 09:02:52.079906 IP lpq84 > lpq83: . ack 2896 win 16384 <nop,nop,timestamp 572609080 961812>
> 09:02:52.082568 IP lpq83 > lpq84: . 2896:4344(1448) ack 1 win 457 <nop,nop,timestamp 961815 572609071>
> 09:02:52.082581 IP lpq83 > lpq84: . 4344:5792(1448) ack 1 win 457 <nop,nop,timestamp 961815 572609071>
> 09:02:52.083017 IP lpq84 > lpq83: . ack 5792 win 16384 <nop,nop,timestamp 572609083 961815>
> 09:02:52.085678 IP lpq83 > lpq84: . 5792:7240(1448) ack 1 win 457 <nop,nop,timestamp 961818 572609074>
> 09:02:52.085693 IP lpq83 > lpq84: . 7240:8688(1448) ack 1 win 457 <nop,nop,timestamp 961818 572609074>
> 09:02:52.086117 IP lpq84 > lpq83: . ack 8688 win 16384 <nop,nop,timestamp 572609086 961818>
> 09:02:52.088792 IP lpq83 > lpq84: . 8688:10136(1448) ack 1 win 457 <nop,nop,timestamp 961821 572609077>
> 09:02:52.088806 IP lpq83 > lpq84: . 10136:11584(1448) ack 1 win 457 <nop,nop,timestamp 961821 572609077>
> 09:02:52.089217 IP lpq84 > lpq83: . ack 11584 win 16384 <nop,nop,timestamp 572609090 961821>
> 
> Reported-by: Steinar H. Gunderson <sesse@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks a lot Eric.

^ permalink raw reply

* Re: Established sockets remain open after iface down or address lost
From: Rick Jones @ 2013-10-01 17:06 UTC (permalink / raw)
  To: Chris Verges
  Cc: Eric Dumazet, davem, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <20131001160825.GA8784@cverges-dev-lnx.sentient-energy.com>

On 10/01/2013 09:08 AM, Chris Verges wrote:
> On Tue, Oct 01, 2013 at 08:44:17AM -0700, Rick Jones wrote:
>> The protocol between client and server needs to have an
>> application-layer "keepalive" mechanism added, and then the server
>> will be able to detect a dangling connection without need of any
>> further kernel modifications.
>>
>> If that is not possible, the server can/should set SO_KEEPALIVE and
>> perhaps tweak the TCP keepalive settings.  Not as good (IMO) as an
>> application-layer keepalive because it only shows that the connection
>> is good as far as TCP, but I suppose it could do in a pinch.
>
> I agree that some form of keepalives would solve the problem where
> blocking reads need to be interrupted.  However, this creates traffic
> across the link -- directly proportional to the keepalive interval.
>
> The underlying physical layer is such that we pay for all traffic going
> across it -- including any keepalives at either the application or TCP
> layers.  Paying for this keepalive traffic when the link is operational
> is not desired.

Pick your poison :)

If the server application is in a "I know there should be (more) data 
arriving on this connection" mode, then you can simply have an 
application-layer timeout in the server code that does not rely on 
active probing the connection.

Otherwise, even if you do get some sort of "nuke connections using a 
source IP matching an interface we just brought down" option into the 
kernel, you will still have the small matter of something else between 
the client and server going down that neither can see directly.

rick jones

^ permalink raw reply

* Re: Established sockets remain open after iface down or address lost
From: Chris Verges @ 2013-10-01 17:07 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: David Miller, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <CAMrnHh6Lr9q27yiSmJ-EbJMTeUVyf9650vK+6QmbfTVfz_omxg@mail.gmail.com>

On Tue, Oct 01, 2013 at 08:33:04PM +0400, Alexey Kuznetsov wrote:
> > P.S.  I apologize in advance if I missed this answer in the netdev
> > archives.
> 
> FYI googling f.e. "netdev tcp remove local address" instantly  finds
> all that you want to know. Namrly, subj: Re: "TCP shutdown behaviour
> when deleting local IP addresses"

Oustanding.  Thank you!  I really appreciate the pointer to bring me up
to speed on the history.  It sounds like Mikael and I share a similar
desire and rationale for why such a thing would be useful.

Your comment regarding the tcp hash table scanning is what I was
planning to code up, though it sounds like some amount of effort was
made in this attempt and abandoned.  Would you happen to know where any
out-of-kernel code related to this might be parked if one wanted to
continue this effort?

I do agree with Mikael that offloading this from the kernel to some kind
of connection manager would be ideal.  I'll ponder this some more....

Again, many thanks for the help.

Chris

^ permalink raw reply

* [PATCH] tcp: sndbuf autotuning improvements
From: Eric Dumazet @ 2013-10-01 17:23 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Maciej Żenczykowski

From: Eric Dumazet <edumazet@google.com>

tcp_fixup_sndbuf() is underestimating initial send buffer requirements.

It was not noticed because big GSO packets were escaping the limitation,
but with smaller TSO packets (or TSO/GSO/SG off), application hits
sk_sndbuf before having a chance to fill enough packets in socket write
queue.

- initial cwnd can be bigger than 10 for specific routes

- SKB_TRUESIZE() is a bit under real needs in some cases,
  because of power-of-two rounding in kmalloc()

- Fast Recovery (RFC 5681 3.2) : Cubic needs 70% factor

- Extra cushion (application might react slowly to POLLOUT)

tcp_v4_conn_req_fastopen() needs to call tcp_init_metrics() before
calling tcp_init_buffer_space()

Then we realize tcp_new_space() should call tcp_fixup_sndbuf()
instead of duplicating this stuff.

Rename tcp_fixup_sndbuf() to tcp_sndbuf_expand() to be more
descriptive.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
---
 net/ipv4/tcp_input.c |   38 +++++++++++++++++++++++++-------------
 net/ipv4/tcp_ipv4.c  |    2 +-
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 66aa816..cd65674 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -267,11 +267,31 @@ static bool TCP_ECN_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
  * 1. Tuning sk->sk_sndbuf, when connection enters established state.
  */
 
-static void tcp_fixup_sndbuf(struct sock *sk)
+static void tcp_sndbuf_expand(struct sock *sk)
 {
-	int sndmem = SKB_TRUESIZE(tcp_sk(sk)->rx_opt.mss_clamp + MAX_TCP_HEADER);
+	const struct tcp_sock *tp = tcp_sk(sk);
+	int sndmem, per_mss;
+	u32 nr_segs;
+
+	/* Worst case is non GSO/TSO : each frame consumes one skb
+	 * and skb->head is kmalloced using power of two area of memory
+	 */
+	per_mss = max_t(u32, tp->rx_opt.mss_clamp, tp->mss_cache) +
+		  MAX_TCP_HEADER +
+		  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	per_mss = roundup_pow_of_two(per_mss) +
+		  SKB_DATA_ALIGN(sizeof(struct sk_buff));
+
+	nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
+	nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
+
+	/* Fast Recovery (RFC 5681 3.2) :
+	 * Cubic needs 1.7 factor, rounded to 2 to include
+	 * extra cushion (application might react slowly to POLLOUT)
+	 */
+	sndmem = 2 * nr_segs * per_mss;
 
-	sndmem *= TCP_INIT_CWND;
 	if (sk->sk_sndbuf < sndmem)
 		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
 }
@@ -376,7 +396,7 @@ void tcp_init_buffer_space(struct sock *sk)
 	if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
 		tcp_fixup_rcvbuf(sk);
 	if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
-		tcp_fixup_sndbuf(sk);
+		tcp_sndbuf_expand(sk);
 
 	tp->rcvq_space.space = tp->rcv_wnd;
 	tp->rcvq_space.time = tcp_time_stamp;
@@ -4723,15 +4743,7 @@ static void tcp_new_space(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	if (tcp_should_expand_sndbuf(sk)) {
-		int sndmem = SKB_TRUESIZE(max_t(u32,
-						tp->rx_opt.mss_clamp,
-						tp->mss_cache) +
-					  MAX_TCP_HEADER);
-		int demanded = max_t(unsigned int, tp->snd_cwnd,
-				     tp->reordering + 1);
-		sndmem *= 2 * demanded;
-		if (sndmem > sk->sk_sndbuf)
-			sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
+		tcp_sndbuf_expand(sk);
 		tp->snd_cwnd_stamp = tcp_time_stamp;
 	}
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index b14266b..5d6b1a6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1410,8 +1410,8 @@ static int tcp_v4_conn_req_fastopen(struct sock *sk,
 	inet_csk(child)->icsk_af_ops->rebuild_header(child);
 	tcp_init_congestion_control(child);
 	tcp_mtup_init(child);
-	tcp_init_buffer_space(child);
 	tcp_init_metrics(child);
+	tcp_init_buffer_space(child);
 
 	/* Queue the data carried in the SYN packet. We need to first
 	 * bump skb's refcnt because the caller will attempt to free it.

^ permalink raw reply related

* Hello,
From: Mrs Chantal Diarrah @ 2013-10-01 17:31 UTC (permalink / raw)




Hello,


 Compliment of the day to you.
I am Mrs Chantal Diarrah, I am sending this brief letter to solicit your partnership to transfer
 
$19.5 million US Dollars. I shall send you more information
 
and procedures when I receive positive response from you.
Best Regards,
 Thanks
Mrs. Chantal

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox