Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] ucc_geth: Add 16 bytes to max TX frame for VLANs
From: David Miller @ 2012-05-03  0:12 UTC (permalink / raw)
  To: Joakim.Tjernlund; +Cc: netdev
In-Reply-To: <1335775015-14718-2-git-send-email-Joakim.Tjernlund@transmode.se>

From: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Date: Mon, 30 Apr 2012 10:36:55 +0200

> Creating a VLAN interface on top of ucc_geth adds 4 bytes
> to the frame and the HW controller is not prepared to
> TX a frame bigger than 1518 bytes which is 4 bytes too
> small for a full VLAN frame. Add 16 bytes which will handle
> the a simple VLAN and leaves 12 bytes for future expansion.
> 
> Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>

Applied.

^ permalink raw reply

* Re: [PATCH 3/3] usbnet: fix skb traversing races during unlink(v1)
From: David Miller @ 2012-05-03  0:12 UTC (permalink / raw)
  To: tom.leiming-Re5JQEeQqe8AvxtiuMwx3w
  Cc: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
	huajun.li.lee-Re5JQEeQqe8AvxtiuMwx3w, oneukum-l3A5Bk7waGM,
	stable-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <1335775864-4873-4-git-send-email-tom.leiming-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

From: Ming Lei <tom.leiming-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Mon, 30 Apr 2012 16:51:04 +0800

> This patch depends on the usb_block_urb/usb_unblock_urb patch from
> Oliver.

Therefore I cannot apply it to any of my trees.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] net: ucc_geth, increase no. of HW RX descriptors
From: David Miller @ 2012-05-03  0:10 UTC (permalink / raw)
  To: Joakim.Tjernlund; +Cc: netdev
In-Reply-To: <1335775015-14718-1-git-send-email-Joakim.Tjernlund@transmode.se>

From: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Date: Mon, 30 Apr 2012 10:36:54 +0200

> In a busy network we see ucc_geth is dropping RX pkgs every now
> and then. Increase the RX queues HW descriptors from
> 16 to 32 to deal with this.
> 
> Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: fix handling single MSIX mode for 57710/57711
From: David Miller @ 2012-05-03  0:04 UTC (permalink / raw)
  To: dmitry; +Cc: netdev, eilong
In-Reply-To: <1335957393-21804-1-git-send-email-dmitry@broadcom.com>

From: "Dmitry Kravkov" <dmitry@broadcom.com>
Date: Wed, 2 May 2012 14:16:33 +0300

> commit 30a5de7723a8a4211be02e94236e9167a424fd07 added
> ability to use single MSI-X vector, but lack proper
> handling for 57710/57711 HW
> 
> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Applied, thanks.

^ permalink raw reply

* [PATCH] net: davinci_emac: Add pre_open, post_stop platform callbacks
From: Mark A. Greer @ 2012-05-02 23:47 UTC (permalink / raw)
  To: netdev; +Cc: linux-omap, linux-arm-kernel

From: "Mark A. Greer" <mgreer@animalcreek.com>

The davinci EMAC driver has been incorporated into the am35x
family of SoC's which is OMAP-based.  The incorporation is
incomplete in that the EMAC cannot unblock the [ARM] core if
its blocked on a 'wfi' instruction.  This is an issue with
the cpu_idle code because it has the core execute a 'wfi'
instruction.

To work around this issue, add platform data callbacks which
are called at the beginning of the open routine and at the
end of the stop routine of the davinci_emac driver.  The
callbacks allow the platform code to issue disable_hlt() and
enable_hlt() calls appropriately.  Calling disable_hlt()
prevents cpu_idle from issuing the 'wfi' instruction.

It is not sufficient to simply call disable_hlt() when
there is an EMAC present because it could be present but
not actually used in which case, we do want the 'wfi' to
be executed.

Signed-off-by: Mark A. Greer <mgreer@animalcreek.com>
---

I know adding platform_data callbacks are frowned upon
and I really don't want to add them but I don't see
any other way to accomplish what needs to be accomplished.

Any suggestions?

Thanks, Mark.

 drivers/net/ethernet/ti/davinci_emac.c |   14 ++++++++++++++
 include/linux/davinci_emac.h           |    2 ++
 2 files changed, 16 insertions(+)

diff --git a/drivers/net/ethernet/ti/davinci_emac.c b/drivers/net/ethernet/ti/davinci_emac.c
index 174a334..141a888 100644
--- a/drivers/net/ethernet/ti/davinci_emac.c
+++ b/drivers/net/ethernet/ti/davinci_emac.c
@@ -344,6 +344,8 @@ struct emac_priv {
 	/*platform specific members*/
 	void (*int_enable) (void);
 	void (*int_disable) (void);
+	void (*pre_open) (struct net_device *ndev);
+	void (*post_stop) (struct net_device *ndev);
 };

 /* clock frequency for EMAC */
@@ -1534,6 +1536,9 @@ static int emac_dev_open(struct net_device *ndev)
 	int k = 0;
 	struct emac_priv *priv = netdev_priv(ndev);

+	if (priv->pre_open)
+		priv->pre_open(ndev);
+
 	netif_carrier_off(ndev);
 	for (cnt = 0; cnt < ETH_ALEN; cnt++)
 		ndev->dev_addr[cnt] = priv->mac_addr[cnt];
@@ -1644,6 +1649,10 @@ rollback:
 		res = platform_get_resource(priv->pdev, IORESOURCE_IRQ, k-1);
 		m = res->end;
 	}
+
+	if (priv->post_stop)
+		priv->post_stop(ndev);
+
 	return -EBUSY;
 }

@@ -1686,6 +1695,9 @@ static int emac_dev_stop(struct net_device *ndev)
 	if (netif_msg_drv(priv))
 		dev_notice(emac_dev, "DaVinci EMAC: %s stopped\n", ndev->name);

+	if (priv->post_stop)
+		priv->post_stop(ndev);
+
 	return 0;
 }

@@ -1817,6 +1829,8 @@ static int __devinit davinci_emac_probe(struct platform_device *pdev)
 	priv->version = pdata->version;
 	priv->int_enable = pdata->interrupt_enable;
 	priv->int_disable = pdata->interrupt_disable;
+	priv->pre_open = pdata->pre_open;
+	priv->post_stop = pdata->post_stop;

 	priv->coal_intvl = 0;
 	priv->bus_freq_mhz = (u32)(emac_bus_frequency / 1000000);
diff --git a/include/linux/davinci_emac.h b/include/linux/davinci_emac.h
index 5428885..b61e6de 100644
--- a/include/linux/davinci_emac.h
+++ b/include/linux/davinci_emac.h
@@ -39,6 +39,8 @@ struct emac_platform_data {
 	bool no_bd_ram;
 	void (*interrupt_enable) (void);
 	void (*interrupt_disable) (void);
+	void (*pre_open) (struct net_device *ndev);
+	void (*post_stop) (struct net_device *ndev);
 };

 enum {
-- 
1.7.9.4

^ permalink raw reply related

* Re: [RFC][PATCH] net: ipv4: ipconfig: decrease CONF_CARRIER_TIMEOUT
From: David Miller @ 2012-05-02 23:43 UTC (permalink / raw)
  To: c.hemp; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1335972259-20975-1-git-send-email-c.hemp@phytec.de>

From: Christian Hemp <c.hemp@phytec.de>
Date: Wed, 2 May 2012 17:24:19 +0200

> A timeout of two minutes is pretty anoying if _no_ ethernet cable
> is attached by purpose.  This patch decreases the timeout of
> CONF_CARRIER_TIMEOUT to an accaptable value of 10 secounds.
> 
> Signed-off-by: Christian Hemp <c.hemp@phytec.de>

It was increased to 2 minutes intentionally, therefore you better go
look into the history of why this was done and you better explain in
your commit message why those issues don't matter.

Otherwise your patch will be completely ignored.

^ permalink raw reply

* Re: [net-next v2 0/7][pull request] Intel Wired LAN Driver Updates
From: David Miller @ 2012-05-02 23:45 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, sassmann, bhutchings
In-Reply-To: <1335950025-13294-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed,  2 May 2012 02:13:38 -0700

> This series of patches contains updates for e1000e and ixgbe.
> Patch 4 & 5 integrate the hwmon interface for ixgbe as Ben Hutchings
> has done for other drivers.
> 
> v2: dropped the ixgbe patch which added the sysfs interface
> 
> The following are changes since commit e4ae004b84b315dd4b762e474f97403eac70f76a:
>   netem: add ECN capability
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Pulled, thanks Jeff.

^ permalink raw reply

* Re: bonding: don't increase rx_dropped after processing LACPDUs
From: David Miller @ 2012-05-02 23:41 UTC (permalink / raw)
  To: jbohac; +Cc: eric.dumazet, fubar, andy, netdev
In-Reply-To: <20120502205118.GB25355@midget.suse.cz>

From: Jiri Bohac <jbohac@suse.cz>
Date: Wed, 2 May 2012 22:51:18 +0200

> On Wed, May 02, 2012 at 10:36:49PM +0200, Eric Dumazet wrote:
>> > +			if (ret == RX_HANDLER_CONSUMED)
>> > +				kfree_skb(skb);
>> 
>> After this point, you have use after free :
>> 
>> if (bond_should_deliver_exact_match(skb, slave, bond)) { 
>> 	...
>> }
>> skb->dev = bond->dev;
> 
> Thanks for spotting this! Let's just return immediately at that
> point. Fixed version below:
> 
> Signed-off-by: Jiri Bohac <jbohac@suse.cz>

Please don't do this.

When you post a fixed version of a patch, post it with the full
proper commit message and signoff.

I'm not going to go back to your original posting and put that
commit message from there into the fixed patch.  That's your
job as a patch submitter, not mine.

^ permalink raw reply

* Re: [net-next PATCH v4 0/8] Managing the forwarding database(FDB)
From: Sridhar Samudrala @ 2012-05-02 23:36 UTC (permalink / raw)
  To: John Fastabend
  Cc: Michael S. Tsirkin, shemminger, bhutchings, hadi,
	jeffrey.t.kirsher, netdev, gregory.v.rose, krkumar2, roprabhu
In-Reply-To: <4FA1ACA1.2080808@intel.com>

On 5/2/2012 2:52 PM, John Fastabend wrote:
> On 5/2/2012 8:08 AM, Michael S. Tsirkin wrote:
>> On Sun, Apr 15, 2012 at 01:06:37PM -0400, David Miller wrote:
>>> From: John Fastabend<john.r.fastabend@intel.com>
>>> Date: Sun, 15 Apr 2012 09:43:51 -0700
>>>
>>>> The following series is a submission for net-next to allow
>>>> embedded switches and other stacked devices other then the
>>>> Linux bridge to manage a forwarding database.
>>>>
>>>> Previously discussed here,
>>>>
>>>> http://lists.openwall.net/netdev/2012/03/19/26
>>>>
>>>> v4: propagate return codes correctly for ndo_dflt_Fdb_dump()
>>>>
>>>> v3: resolve the macvlan patch 8/8 to fix a dev_set_promiscuity()
>>>>      error and add the flags field to change and get link routines.
>>>>
>>>> v2: addressed feedback from Ben Hutchings resolving a typo in the
>>>>      multicast add/del routines and improving the error handling
>>>>      when both NTF_SELF and NTF_MASTER are set.
>>>>
>>>> I've tested this with 'br' tool published by Stephen Hemminger
>>>> soon to be renamed 'bridge' I believe and various traffic
>>>> generators mostly pktgen, ping, and netperf.
>>> All applied, if we need any more tweaks we can just add them
>>> on top of this work.
>>>
>>> Thanks John.
>> John, do you plan to update kvm userspace to use this interface?
>>
> No immediate plans. I would really appreciate it if you or one
> of the IBM developers working in this space took it on. Of course
> if no one steps up I guess I can eventually get at it but it will
> be sometime. For now I've been doing this manually with the bridge
> tool yet to be published.
>
>
Does this mean that when we add an interface to a bridge, it need not be 
put in promiscuous mode and
add/delete fdb entries dynamically?
Or are we talking only about VMs attached to macvtap?

Thanks
Sridhar

^ permalink raw reply

* Question: Section mismatch warnings in drivers/net
From: H Hartley Sweeten @ 2012-05-02 23:30 UTC (permalink / raw)
  To: Linux Kernel; +Cc: netdev

Hello all,

I noticed a number of section mismatch warnings in various
drivers/net/ drivers. I was going to submit some patches to
clean them up but then saw these commits:

commit 948252cb9e01d65a89ecadf67be5018351eee15e
Author: David S. Miller <davem@davemloft.net>
Date:   Tue May 31 19:27:48 2011 -0700

    Revert "net: fix section mismatches"
    
    This reverts commit e5cb966c0838e4da43a3b0751bdcac7fe719f7b4.
    
    It causes new build regressions with gcc-4.2 which is
    pretty common on non-x86 platforms.
    
    Reported-by: James Bottomley <James.Bottomley@HansenPartnership.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit e5cb966c0838e4da43a3b0751bdcac7fe719f7b4
Author: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date:   Mon Apr 18 13:31:20 2011 +0000

    net: fix section mismatches
    
    Fix build warnings like the following:
    
    WARNING: drivers/net/built-in.o(.data+0x12434): Section mismatch in reference from the variable madge
    
    And add some consts to EISA device ID tables along the way.
    
    Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Was the issue with the build regressions ever worked out? Would
patches to fix the section mismatches be appropriate now?

Regards,
Hartley

^ permalink raw reply

* [PATCH v4 3/3] tcp: early retransmit: delayed fast retransmit
From: Yuchung Cheng @ 2012-05-02 23:30 UTC (permalink / raw)
  To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng
In-Reply-To: <1336001404-21879-1-git-send-email-ycheng@google.com>

Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed.  When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog in v2:
 - Set sysctl_tcp_early_retrans default to 2
ChangeLog in v3:
  - use separate u8 for early retrans stats in tcp_sock
  - disable ER if detects any reordering
ChangeLog in v4:
  - move disabling ER code on reordering into part2

 include/linux/tcp.h   |    3 +-
 include/net/tcp.h     |    3 ++
 net/ipv4/tcp_input.c  |   69 ++++++++++++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_output.c |    5 +--
 net/ipv4/tcp_timer.c  |    5 +++
 5 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7859b41..d9b42c5be 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -372,7 +372,8 @@ struct tcp_sock {
 		repair      : 1,
 		unused      : 1;
 	u8	repair_queue;
-	u8	do_early_retrans:1;/* Enable RFC5827 early-retransmit  */
+	u8	do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
+		early_retrans_delayed:1; /* Delayed ER timer installed */
 
 /* RTT measurement */
 	u32	srtt;		/* smoothed round trip time << 3	*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 685437a..5283aa4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -500,6 +500,8 @@ extern void tcp_send_delayed_ack(struct sock *sk);
 
 /* tcp_input.c */
 extern void tcp_cwnd_application_limited(struct sock *sk);
+extern void tcp_resume_early_retransmit(struct sock *sk);
+extern void tcp_rearm_rto(struct sock *sk);
 
 /* tcp_timer.c */
 extern void tcp_init_xmit_timers(struct sock *);
@@ -805,6 +807,7 @@ static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
 {
 	tp->do_early_retrans = sysctl_tcp_early_retrans &&
 		!sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
+	tp->early_retrans_delayed = 0;
 }
 
 static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 82194c7..6c0bc6f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2344,6 +2344,27 @@ static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)
 	return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;
 }
 
+static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	unsigned long delay;
+
+	/* Delay early retransmit and entering fast recovery for
+	 * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
+	 * available, or RTO is scheduled to fire first.
+	 */
+	if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt)
+		return false;
+
+	delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2));
+	if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
+		return false;
+
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX);
+	tp->early_retrans_delayed = 1;
+	return true;
+}
+
 static inline int tcp_skb_timedout(const struct sock *sk,
 				   const struct sk_buff *skb)
 {
@@ -2451,7 +2472,7 @@ static inline int tcp_head_timedout(const struct sock *sk)
  * Main question: may we further continue forward transmission
  * with the same cwnd?
  */
-static int tcp_time_to_recover(struct sock *sk)
+static int tcp_time_to_recover(struct sock *sk, int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 packets_out;
@@ -2505,7 +2526,7 @@ static int tcp_time_to_recover(struct sock *sk)
 	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
 	    (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
 	    !tcp_may_send_now(sk))
-		return 1;
+		return !tcp_pause_early_retransmit(sk, flag);
 
 	return 0;
 }
@@ -3172,7 +3193,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
 			tcp_try_undo_dsack(sk);
 
-		if (!tcp_time_to_recover(sk)) {
+		if (!tcp_time_to_recover(sk, flag)) {
 			tcp_try_to_open(sk, flag);
 			return;
 		}
@@ -3271,16 +3292,47 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
 /* Restart timer after forward progress on connection.
  * RFC2988 recommends to restart timer to now+rto.
  */
-static void tcp_rearm_rto(struct sock *sk)
+void tcp_rearm_rto(struct sock *sk)
 {
-	const struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
 
 	if (!tp->packets_out) {
 		inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
 	} else {
-		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-					  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+		u32 rto = inet_csk(sk)->icsk_rto;
+		/* Offset the time elapsed after installing regular RTO */
+		if (tp->early_retrans_delayed) {
+			struct sk_buff *skb = tcp_write_queue_head(sk);
+			const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
+			s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
+			/* delta may not be positive if the socket is locked
+			 * when the delayed ER timer fires and is rescheduled.
+			 */
+			if (delta > 0)
+				rto = delta;
+		}
+		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
+					  TCP_RTO_MAX);
 	}
+	tp->early_retrans_delayed = 0;
+}
+
+/* This function is called when the delayed ER timer fires. TCP enters
+ * fast recovery and performs fast-retransmit.
+ */
+void tcp_resume_early_retransmit(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	tcp_rearm_rto(sk);
+
+	/* Stop if ER is disabled after the delayed ER timer is scheduled */
+	if (!tp->do_early_retrans)
+		return;
+
+	tcp_enter_recovery(sk, false);
+	tcp_update_scoreboard(sk, 1);
+	tcp_xmit_retransmit_queue(sk);
 }
 
 /* If we get here, the whole TSO packet has not been acked. */
@@ -3729,6 +3781,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, tp->snd_nxt))
 		goto invalid_ack;
 
+	if (tp->early_retrans_delayed)
+		tcp_rearm_rto(sk);
+
 	if (after(ack, prior_snd_una))
 		flag |= FLAG_SND_UNA_ADVANCED;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 834e89f..d947330 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -78,9 +78,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
 		tp->frto_counter = 3;
 
 	tp->packets_out += tcp_skb_pcount(skb);
-	if (!prior_packets)
-		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-					  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+	if (!prior_packets || tp->early_retrans_delayed)
+		tcp_rearm_rto(sk);
 }
 
 /* SND.NXT, if window was not shrunk.
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 34d4a02..e911e6c 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -319,6 +319,11 @@ void tcp_retransmit_timer(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
+	if (tp->early_retrans_delayed) {
+		tcp_resume_early_retransmit(sk);
+		return;
+	}
+
 	if (!tp->packets_out)
 		goto out;
 
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery()
From: Yuchung Cheng @ 2012-05-02 23:30 UTC (permalink / raw)
  To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng

This a prepartion patch that refactors the code to enter recovery
into a new function tcp_enter_recovery(). It's needed to implement
the delayed fast retransmit in ER.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog since v1:
 - swaped with part 1 and part2
ChangeLog since v2:
  - removed RFC in commit message
ChangeLog since v3:
  - nothing

 net/ipv4/tcp_input.c |   61 +++++++++++++++++++++++++++----------------------
 1 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c93b0cb..22df826 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3022,6 +3022,38 @@ static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
 	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
 }
 
+static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int mib_idx;
+
+	if (tcp_is_reno(tp))
+		mib_idx = LINUX_MIB_TCPRENORECOVERY;
+	else
+		mib_idx = LINUX_MIB_TCPSACKRECOVERY;
+
+	NET_INC_STATS_BH(sock_net(sk), mib_idx);
+
+	tp->high_seq = tp->snd_nxt;
+	tp->prior_ssthresh = 0;
+	tp->undo_marker = tp->snd_una;
+	tp->undo_retrans = tp->retrans_out;
+
+	if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
+		if (!ece_ack)
+			tp->prior_ssthresh = tcp_current_ssthresh(sk);
+		tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk);
+		TCP_ECN_queue_cwr(tp);
+	}
+
+	tp->bytes_acked = 0;
+	tp->snd_cwnd_cnt = 0;
+	tp->prior_cwnd = tp->snd_cwnd;
+	tp->prr_delivered = 0;
+	tp->prr_out = 0;
+	tcp_set_ca_state(sk, TCP_CA_Recovery);
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
@@ -3041,7 +3073,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 	struct tcp_sock *tp = tcp_sk(sk);
 	int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
 				    (tcp_fackets_out(tp) > tp->reordering));
-	int fast_rexmit = 0, mib_idx;
+	int fast_rexmit = 0;
 
 	if (WARN_ON(!tp->packets_out && tp->sacked_out))
 		tp->sacked_out = 0;
@@ -3142,32 +3174,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		}
 
 		/* Otherwise enter Recovery state */
-
-		if (tcp_is_reno(tp))
-			mib_idx = LINUX_MIB_TCPRENORECOVERY;
-		else
-			mib_idx = LINUX_MIB_TCPSACKRECOVERY;
-
-		NET_INC_STATS_BH(sock_net(sk), mib_idx);
-
-		tp->high_seq = tp->snd_nxt;
-		tp->prior_ssthresh = 0;
-		tp->undo_marker = tp->snd_una;
-		tp->undo_retrans = tp->retrans_out;
-
-		if (icsk->icsk_ca_state < TCP_CA_CWR) {
-			if (!(flag & FLAG_ECE))
-				tp->prior_ssthresh = tcp_current_ssthresh(sk);
-			tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
-			TCP_ECN_queue_cwr(tp);
-		}
-
-		tp->bytes_acked = 0;
-		tp->snd_cwnd_cnt = 0;
-		tp->prior_cwnd = tp->snd_cwnd;
-		tp->prr_delivered = 0;
-		tp->prr_out = 0;
-		tcp_set_ca_state(sk, TCP_CA_Recovery);
+		tcp_enter_recovery(sk, (flag & FLAG_ECE));
 		fast_rexmit = 1;
 	}
 
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH v4 2/3] tcp: early retransmit
From: Yuchung Cheng @ 2012-05-02 23:30 UTC (permalink / raw)
  To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng
In-Reply-To: <1336001404-21879-1-git-send-email-ycheng@google.com>

This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are
less than 4 to recover losses by fast recovery instead of timeout.

While the algorithm is simple, small but frequent network reordering
makes this feature dangerous: the connection repeatedly enter
false recovery and degrade performance. Therefore we implement
a mitigation suggested in the appendix of the RFC that delays
entering fast recovery by a small interval, i.e., RTT/4. Currently
ER is conservative and is disabled for the rest of the connection
after the first reordering event. A large scale web server
experiment on the performance impact of ER is summarized in
section 6 of the paper "Proportional Rate Reduction for TCP”,
IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

Note that Linux has a similar feature called THIN_DUPACK. The
differences are THIN_DUPACK do not mitigate reorderings and is only
used after slow start. Currently ER is disabled if THIN_DUPACK is
enabled. I would be happy to merge THIN_DUPACK feature with ER if
people think it's a good idea.

ER is enabled by sysctl_tcp_early_retrans:
  0: Disables ER

  1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

  2: (Default) reduce dupthresh like mode 1. In addition, delay
     entering fast recovery by RTT/4.

Note: mode 2 is implemented in the third part of this patch series.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog in v2:
 - swapped part1 and part2
ChangeLog in v3:
 - trivial text fixes in documentation.
ChangeLog in v4:
 - disables ER if any reordering. shuffle early retransmit vars in tcp_sock.

 Documentation/networking/ip-sysctl.txt |   14 ++++++++++++++
 include/linux/tcp.h                    |    1 +
 include/net/tcp.h                      |   15 +++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |   10 ++++++++++
 net/ipv4/tcp.c                         |    3 +++
 net/ipv4/tcp_input.c                   |   15 +++++++++++++++
 net/ipv4/tcp_minisocks.c               |    1 +
 7 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 9b569a2..34916e7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -190,6 +190,20 @@ tcp_cookie_size - INTEGER
 tcp_dsack - BOOLEAN
 	Allows TCP to send "duplicate" SACKs.
 
+tcp_early_retrans - INTEGER
+	Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
+	for triggering fast retransmit when the amount of outstanding data is
+	small and when no previously unsent data can be transmitted (such
+	that limited transmit could be used).
+	Possible values:
+		0 disables ER
+		1 enables ER
+		2 enables ER but delays fast recovery and fast retransmit
+		  by a fourth of RTT. This mitigates connection falsely
+		  recovers when network has a small degree of reordering
+		  (less than 3 packets).
+	Default: 2
+
 tcp_ecn - INTEGER
 	Enable Explicit Congestion Notification (ECN) in TCP. ECN is only
 	used when both ends of the TCP flow support it. It is useful to
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 278af9e..7859b41 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -372,6 +372,7 @@ struct tcp_sock {
 		repair      : 1,
 		unused      : 1;
 	u8	repair_queue;
+	u8	do_early_retrans:1;/* Enable RFC5827 early-retransmit  */
 
 /* RTT measurement */
 	u32	srtt;		/* smoothed round trip time << 3	*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0fb84de..685437a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -252,6 +252,7 @@ extern int sysctl_tcp_max_ssthresh;
 extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_early_retrans;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -797,6 +798,20 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
 	tp->rx_opt.sack_ok |= TCP_FACK_ENABLED;
 }
 
+/* TCP early-retransmit (ER) is similar to but more conservative than
+ * the thin-dupack feature.  Enable ER only if thin-dupack is disabled.
+ */
+static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
+{
+	tp->do_early_retrans = sysctl_tcp_early_retrans &&
+		!sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
+}
+
+static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
+{
+	tp->do_early_retrans = 0;
+}
+
 static inline unsigned int tcp_left_out(const struct tcp_sock *tp)
 {
 	return tp->sacked_out + tp->lost_out;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 33417f8..ef32956 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -27,6 +27,7 @@
 #include <net/tcp_memcontrol.h>
 
 static int zero;
+static int two = 2;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -677,6 +678,15 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_early_retrans",
+		.data		= &sysctl_tcp_early_retrans,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &two,
+	},
+	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9670af3..6802c89 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -395,6 +395,7 @@ void tcp_init_sock(struct sock *sk)
 	tp->mss_cache = TCP_MSS_DEFAULT;
 
 	tp->reordering = sysctl_tcp_reordering;
+	tcp_enable_early_retrans(tp);
 	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
 
 	sk->sk_state = TCP_CLOSE;
@@ -2495,6 +2496,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			err = -EINVAL;
 		else
 			tp->thin_dupack = val;
+			if (tp->thin_dupack)
+				tcp_disable_early_retrans(tp);
 		break;
 
 	case TCP_REPAIR:
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 22df826..82194c7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -99,6 +99,7 @@ int sysctl_tcp_thin_dupack __read_mostly;
 
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
 int sysctl_tcp_abc __read_mostly;
+int sysctl_tcp_early_retrans __read_mostly = 2;
 
 #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
 #define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
@@ -906,6 +907,7 @@ static void tcp_init_metrics(struct sock *sk)
 	if (dst_metric(dst, RTAX_REORDERING) &&
 	    tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
 		tcp_disable_fack(tp);
+		tcp_disable_early_retrans(tp);
 		tp->reordering = dst_metric(dst, RTAX_REORDERING);
 	}
 
@@ -988,6 +990,9 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
 #endif
 		tcp_disable_fack(tp);
 	}
+
+	if (metric > 0)
+		tcp_disable_early_retrans(tp);
 }
 
 /* This must be called before lost_out is incremented */
@@ -2492,6 +2497,16 @@ static int tcp_time_to_recover(struct sock *sk)
 	    tcp_is_sack(tp) && !tcp_send_head(sk))
 		return 1;
 
+	/* Trick#6: TCP early retransmit, per RFC5827.  To avoid spurious
+	 * retransmissions due to small network reorderings, we implement
+	 * Mitigation A.3 in the RFC and delay the retransmission for a short
+	 * interval if appropriate.
+	 */
+	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
+	    (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
+	    !tcp_may_send_now(sk))
+		return 1;
+
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 3cabafb..6f6a918 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->sacked_out = 0;
 		newtp->fackets_out = 0;
 		newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
+		tcp_enable_early_retrans(newtp);
 
 		/* So many TCP implementations out there (incorrectly) count the
 		 * initial SYN frame in their delayed-ACK and congestion control
-- 
1.7.7.3

^ permalink raw reply related

* Re: WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x277/0x280()
From: Alex Villacís Lasso @ 2012-05-02 23:28 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev
In-Reply-To: <20120502225227.GA12309@electric-eye.fr.zoreil.com>

El 02/05/12 17:52, Francois Romieu escribió:
> Alex Villacís Lasso<avillaci@fiec.espol.edu.ec>  :
> [...]
>> Still present in 3.4-rc5.
> Is it a 8168evl ?
>
I am currently away from the target computer. How should I check for this? lspci?

^ permalink raw reply

* Re: bonding: don't increase rx_dropped after processing LACPDUs
From: Jay Vosburgh @ 2012-05-02 23:10 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Eric Dumazet, Andy Gospodarek, netdev
In-Reply-To: <20120502205118.GB25355@midget.suse.cz>

Jiri Bohac <jbohac@suse.cz> wrote:

>On Wed, May 02, 2012 at 10:36:49PM +0200, Eric Dumazet wrote:
>> > +			if (ret == RX_HANDLER_CONSUMED)
>> > +				kfree_skb(skb);
>> 
>> After this point, you have use after free :
>> 
>> if (bond_should_deliver_exact_match(skb, slave, bond)) { 
>> 	...
>> }
>> skb->dev = bond->dev;
>
>Thanks for spotting this! Let's just return immediately at that
>point. Fixed version below:

	Won't this make it impossible to bind a PF_PACKET socket to
sll_protocol == ETH_P_SLOW and see the LACPDUs, but only when bonding is
running 802.3ad?  This because the ptype_all check in
__netif_receive_skb happens before the rx_handler, but the ptype_base
check (bound packet socket, for example) happens after.  Currently,
libpcap looks to bind to ETH_P_ALL, so it won't be affected.

	If so, is that something we care about?

	-J


>Signed-off-by: Jiri Bohac <jbohac@suse.cz>
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -2173,9 +2173,10 @@ re_arm:
>  * received frames (loopback). Since only the payload is given to this
>  * function, it check for loopback.
>  */
>-static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u16 length)
>+static int bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u16 length)
> {
> 	struct port *port;
>+	int ret = RX_HANDLER_ANOTHER;
>
> 	if (length >= sizeof(struct lacpdu)) {
>
>@@ -2184,11 +2185,12 @@ static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u
> 		if (!port->slave) {
> 			pr_warning("%s: Warning: port of slave %s is uninitialized\n",
> 				   slave->dev->name, slave->dev->master->name);
>-			return;
>+			return ret;
> 		}
>
> 		switch (lacpdu->subtype) {
> 		case AD_TYPE_LACPDU:
>+			ret = RX_HANDLER_CONSUMED;
> 			pr_debug("Received LACPDU on port %d\n",
> 				 port->actor_port_number);
> 			/* Protect against concurrent state machines */
>@@ -2198,6 +2200,7 @@ static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u
> 			break;
>
> 		case AD_TYPE_MARKER:
>+			ret = RX_HANDLER_CONSUMED;
> 			// No need to convert fields to Little Endian since we don't use the marker's fields.
>
> 			switch (((struct bond_marker *)lacpdu)->tlv_type) {
>@@ -2219,6 +2222,7 @@ static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u
> 			}
> 		}
> 	}
>+	return ret;
> }
>
> /**
>@@ -2456,18 +2460,20 @@ out:
> 	return NETDEV_TX_OK;
> }
>
>-void bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
>+int bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
> 			  struct slave *slave)
> {
>+	int ret = RX_HANDLER_ANOTHER;
> 	if (skb->protocol != PKT_TYPE_LACPDU)
>-		return;
>+		return ret;
>
> 	if (!pskb_may_pull(skb, sizeof(struct lacpdu)))
>-		return;
>+		return ret;
>
> 	read_lock(&bond->lock);
>-	bond_3ad_rx_indication((struct lacpdu *) skb->data, slave, skb->len);
>+	ret = bond_3ad_rx_indication((struct lacpdu *) skb->data, slave, skb->len);
> 	read_unlock(&bond->lock);
>+	return ret;
> }
>
> /*
>diff --git a/drivers/net/bonding/bond_3ad.h b/drivers/net/bonding/bond_3ad.h
>index 235b2cc..5ee7e3c 100644
>--- a/drivers/net/bonding/bond_3ad.h
>+++ b/drivers/net/bonding/bond_3ad.h
>@@ -274,7 +274,7 @@ void bond_3ad_adapter_duplex_changed(struct slave *slave);
> void bond_3ad_handle_link_change(struct slave *slave, char link);
> int  bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info);
> int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev);
>-void bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
>+int bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
> 			  struct slave *slave);
> int bond_3ad_set_carrier(struct bonding *bond);
> void bond_3ad_update_lacp_rate(struct bonding *bond);
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 62d2409..0a0f4a6 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1444,8 +1444,9 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
> 	struct sk_buff *skb = *pskb;
> 	struct slave *slave;
> 	struct bonding *bond;
>-	void (*recv_probe)(struct sk_buff *, struct bonding *,
>+	int (*recv_probe)(struct sk_buff *, struct bonding *,
> 				struct slave *);
>+	int ret = RX_HANDLER_ANOTHER;
>
> 	skb = skb_share_check(skb, GFP_ATOMIC);
> 	if (unlikely(!skb))
>@@ -1464,8 +1465,12 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
> 		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>
> 		if (likely(nskb)) {
>-			recv_probe(nskb, bond, slave);
>+			ret = recv_probe(nskb, bond, slave);
> 			dev_kfree_skb(nskb);
>+			if (ret == RX_HANDLER_CONSUMED) {
>+				kfree_skb(skb);
>+				return ret;
>+			}
> 		}
> 	}
>
>@@ -1487,7 +1492,7 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
> 		memcpy(eth_hdr(skb)->h_dest, bond->dev->dev_addr, ETH_ALEN);
> 	}
>
>-	return RX_HANDLER_ANOTHER;
>+	return ret;
> }
>
> /* enslave device <slave> to bond device <master> */
>@@ -2723,7 +2728,7 @@ static void bond_validate_arp(struct bonding *bond, struct slave *slave, __be32
> 	}
> }
>
>-static void bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
>+static int bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
> 			 struct slave *slave)
> {
> 	struct arphdr *arp;
>@@ -2731,7 +2736,7 @@ static void bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
> 	__be32 sip, tip;
>
> 	if (skb->protocol != __cpu_to_be16(ETH_P_ARP))
>-		return;
>+		return RX_HANDLER_ANOTHER;
>
> 	read_lock(&bond->lock);
>
>@@ -2776,6 +2781,7 @@ static void bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
>
> out_unlock:
> 	read_unlock(&bond->lock);
>+	return RX_HANDLER_ANOTHER;
> }
>
> /*
>
>
>-- 
>Jiri Bohac <jbohac@suse.cz>
>SUSE Labs, SUSE CZ

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x277/0x280()
From: Francois Romieu @ 2012-05-02 22:52 UTC (permalink / raw)
  To: Alex Villacís Lasso; +Cc: netdev
In-Reply-To: <4FA18919.8010503@fiec.espol.edu.ec>

Alex Villacís Lasso <avillaci@fiec.espol.edu.ec> :
[...]
> Still present in 3.4-rc5.

Is it a 8168evl ?

-- 
Ueimor

^ permalink raw reply

* Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
From: Eric W. Biederman @ 2012-05-02 21:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, jasowang, eric.dumazet, netdev, linux-kernel,
	Basil Gor
In-Reply-To: <20120502213128.GB8266@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> writes:

> On Wed, May 02, 2012 at 12:40:55PM -0700, Eric W. Biederman wrote:
>> 
>> Side question.  Are you aware that macvtap/vhost net is broken
>> in the presence of  vlan accelleration and that vlan accelleration
>> is not optional if you are using vlan headers?
>> 
>> Eric
>
> I didn't know. Could you explain please?

It is worth looking at the netdev history for some recent patches by
Basil Gor, as he has been hit by this issue and has been trying to come
up with a clean fix.  I was burned by this issue in other parts of the
networking stack and so have been doing some basic review.

The short version is that on any normal path through the networking
stack we implement vlan header accelleration in hardware or emulation
in software.  The result is that the ethernet vlan header is not
on the packet and is instead in the skb->tci field.

When coming on out the pf_packet sockets we don't include the vlan
header in the packet but instead the vlan header is put in aux data.

The result of all of this is that vhost/net.c looses the vlan
header when coming from pf_packet socket.

Furthermore macvtap_recvmsg does not act like pf_packet sockets
when there is a vlan header persent so weirdness ensues.  Especially
when the packet is coming from a path where the vlan header has
been stripped and placed in skb->tci.

Eric

^ permalink raw reply

* Re: [net-next PATCH v4 0/8] Managing the forwarding database(FDB)
From: John Fastabend @ 2012-05-02 21:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: shemminger, bhutchings, sri, hadi, jeffrey.t.kirsher, netdev,
	gregory.v.rose, krkumar2, roprabhu
In-Reply-To: <20120502150830.GA2976@redhat.com>

On 5/2/2012 8:08 AM, Michael S. Tsirkin wrote:
> On Sun, Apr 15, 2012 at 01:06:37PM -0400, David Miller wrote:
>> From: John Fastabend <john.r.fastabend@intel.com>
>> Date: Sun, 15 Apr 2012 09:43:51 -0700
>>
>>> The following series is a submission for net-next to allow
>>> embedded switches and other stacked devices other then the
>>> Linux bridge to manage a forwarding database.
>>>
>>> Previously discussed here,
>>>
>>> http://lists.openwall.net/netdev/2012/03/19/26
>>>
>>> v4: propagate return codes correctly for ndo_dflt_Fdb_dump()
>>>
>>> v3: resolve the macvlan patch 8/8 to fix a dev_set_promiscuity()
>>>     error and add the flags field to change and get link routines.
>>>
>>> v2: addressed feedback from Ben Hutchings resolving a typo in the
>>>     multicast add/del routines and improving the error handling
>>>     when both NTF_SELF and NTF_MASTER are set.
>>>
>>> I've tested this with 'br' tool published by Stephen Hemminger
>>> soon to be renamed 'bridge' I believe and various traffic
>>> generators mostly pktgen, ping, and netperf.
>>
>> All applied, if we need any more tweaks we can just add them
>> on top of this work.
>>
>> Thanks John.
> 
> John, do you plan to update kvm userspace to use this interface?
> 

No immediate plans. I would really appreciate it if you or one
of the IBM developers working in this space took it on. Of course
if no one steps up I guess I can eventually get at it but it will
be sometime. For now I've been doing this manually with the bridge
tool yet to be published.

.John

^ permalink raw reply

* Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
From: Michael S. Tsirkin @ 2012-05-02 21:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, jasowang, eric.dumazet, netdev, linux-kernel
In-Reply-To: <87pqammkvs.fsf@xmission.com>

On Wed, May 02, 2012 at 12:40:55PM -0700, Eric W. Biederman wrote:
> 
> Side question.  Are you aware that macvtap/vhost net is broken
> in the presence of  vlan accelleration and that vlan accelleration
> is not optional if you are using vlan headers?
> 
> Eric

I didn't know. Could you explain please?

^ permalink raw reply

* Re: [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit
From: Yuchung Cheng @ 2012-05-02 21:17 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: davem, ilpo.jarvinen, nanditad, netdev
In-Reply-To: <CADVnQyn6K8xH=wwi7pN0wsKS4wEhksi2pB0KbTaSDqLGWmnBaw@mail.gmail.com>

On Wed, May 2, 2012 at 12:34 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Wed, May 2, 2012 at 2:46 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
>> Delays the fast retransmit by an interval of RTT/4. We borrow the
>> RTO timer to implement the delay. If we receive another ACK or send
>> a new packet, the timer is cancelled and restored to original RTO
>> value offset by time elapsed.  When the delayed-ER timer fires,
>> we enter fast recovery and perform fast retransmit.
>>
>> Signed-off-by: Yuchung Cheng <ycheng@google.com>
>> ---
>> ChangeLog in v2:
>>  - Set sysctl_tcp_early_retrans default to 2
>> ChangeLog in v3:
>>  - use separate u8 for early retrans stats in tcp_sock
>>  - disable ER if detects any reordering
>
> After reading patch 3 of the series, I see that patch 3 incorporates
> most of those suggestions from Monday. I think it would be quite a bit
> cleaner to just have patch 2 of the series put the
> tcp_disable_early_retrans() call and tcp_sock fields in the ultimately
> desired place, rather than having patch 2 put them somewhere and patch
> 3 move them, but maybe that's just me.
>
> When all the patches in the series are applied, the one issue I still
> see is that frto_counter is in a new place, a bit further away from
> frto_highmark than it used to be before the patch series. I think it
> would be good  to keep frto_counter in its original location, back up
> nearer to frto_highmark.
Sorry for keep mis-placing the changes in different patch parts. Will fix asap.

>
> neal

^ permalink raw reply

* Re: [PATCH] tcp: change tcp_adv_win_scale and tcp_rmem[2]
From: Rick Jones @ 2012-05-02 21:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Neal Cardwell, netdev, Tom Herbert, Yuchung Cheng
In-Reply-To: <1335961721.22133.562.camel@edumazet-glaptop>

On 05/02/2012 05:28 AM, Eric Dumazet wrote:
> We should adjust the len/truesize ratio to 50% instead of 75%

As an added bonus, it would be consistent with the code behind 
setsockopt(SO_[SND|RCV]BUF) doubling what the user passes-in. (modulo 
the net.core.[rw]mem_max setting)

rick jones

^ permalink raw reply

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
From: Alexander Duyck @ 2012-05-02 20:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell, Tom Herbert,
	Jeff Kirsher, Michael Chan, Matt Carlson, Herbert Xu,
	Ben Hutchings, Ilpo Järvinen, Maciej Żenczykowski
In-Reply-To: <1335982515.22133.610.camel@edumazet-glaptop>

On 05/02/2012 11:15 AM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 11:05 -0700, Alexander Duyck wrote:
>
>> You're correct about the fragstolen case, I actually was thinking of the
>> first patch you sent, not this second one.
>>
>> However we still have a problem.  What we end up with now is a case of
>> sharing in which the clone skb no longer knows that it is sharing the
>> head with another skb.  The dataref will drop to 1 when we call
>> __kfree_skb.  This means that any other function out there that tries to
>> see if the skb is shared would return false.  This could lead to issues
>> if there is anything out there that manipulates the data in head based
>> on the false assumption that it is not cloned.  What we would probably
>> need to do in this case is tweak the logic for skb_cloned.  If you are
>> using a head_frag you should probably add a check that returns true if
>> cloned is true and page_count is greater than 1.  We should be safe in
>> the case of skb_header_cloned since we already dropped are dataref when
>> we stole the page and freed the skb.
> I really dont understand this concern.
>
> When skb is cloned, we copy in head_frag __skb_clone()
>
> So both skbs have the bit set, and dataref = 2.
>
> first skb is freed, dataref becomes 1 and nothing special  happen
>
> >From this point, skb->head is not 'shared' anymore (taken your own
> words). And we are free to do whatever we want.
>
> second skb is freed, dataref becomes 0 and we call the right destructor.
The problem is that the stack will not be able to detect sharing.  As
long as page_count is greater than 2 and skb->cloned is set we should be
telling any callers to skb_cloned that the head is cloned.  Otherwise we
can run into issues elsewhere with well meaning code checking and not
detecting sharing, and then mangling the header.

Also I am not sure if the big monolithic changes are really the best way
to approach this.  It would be nice if we could fix this incrementally
instead of trying to do it all at once since there are multiple issues
that need to be addressed.

I will try to submit a few patches from my end later today.  I still
need to look over all of the changes from the past couple of weeks that
were based on the assumption that the IP stack completely owned the skb.

Thanks,

Alex

^ permalink raw reply

* Re: bonding: don't increase rx_dropped after processing LACPDUs
From: Jiri Bohac @ 2012-05-02 20:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiri Bohac, Jay Vosburgh, Andy Gospodarek, netdev
In-Reply-To: <1335991009.22133.639.camel@edumazet-glaptop>

On Wed, May 02, 2012 at 10:36:49PM +0200, Eric Dumazet wrote:
> > +			if (ret == RX_HANDLER_CONSUMED)
> > +				kfree_skb(skb);
> 
> After this point, you have use after free :
> 
> if (bond_should_deliver_exact_match(skb, slave, bond)) { 
> 	...
> }
> skb->dev = bond->dev;

Thanks for spotting this! Let's just return immediately at that
point. Fixed version below:

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2173,9 +2173,10 @@ re_arm:
  * received frames (loopback). Since only the payload is given to this
  * function, it check for loopback.
  */
-static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u16 length)
+static int bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u16 length)
 {
 	struct port *port;
+	int ret = RX_HANDLER_ANOTHER;
 
 	if (length >= sizeof(struct lacpdu)) {
 
@@ -2184,11 +2185,12 @@ static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u
 		if (!port->slave) {
 			pr_warning("%s: Warning: port of slave %s is uninitialized\n",
 				   slave->dev->name, slave->dev->master->name);
-			return;
+			return ret;
 		}
 
 		switch (lacpdu->subtype) {
 		case AD_TYPE_LACPDU:
+			ret = RX_HANDLER_CONSUMED;
 			pr_debug("Received LACPDU on port %d\n",
 				 port->actor_port_number);
 			/* Protect against concurrent state machines */
@@ -2198,6 +2200,7 @@ static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u
 			break;
 
 		case AD_TYPE_MARKER:
+			ret = RX_HANDLER_CONSUMED;
 			// No need to convert fields to Little Endian since we don't use the marker's fields.
 
 			switch (((struct bond_marker *)lacpdu)->tlv_type) {
@@ -2219,6 +2222,7 @@ static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u
 			}
 		}
 	}
+	return ret;
 }
 
 /**
@@ -2456,18 +2460,20 @@ out:
 	return NETDEV_TX_OK;
 }
 
-void bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
+int bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
 			  struct slave *slave)
 {
+	int ret = RX_HANDLER_ANOTHER;
 	if (skb->protocol != PKT_TYPE_LACPDU)
-		return;
+		return ret;
 
 	if (!pskb_may_pull(skb, sizeof(struct lacpdu)))
-		return;
+		return ret;
 
 	read_lock(&bond->lock);
-	bond_3ad_rx_indication((struct lacpdu *) skb->data, slave, skb->len);
+	ret = bond_3ad_rx_indication((struct lacpdu *) skb->data, slave, skb->len);
 	read_unlock(&bond->lock);
+	return ret;
 }
 
 /*
diff --git a/drivers/net/bonding/bond_3ad.h b/drivers/net/bonding/bond_3ad.h
index 235b2cc..5ee7e3c 100644
--- a/drivers/net/bonding/bond_3ad.h
+++ b/drivers/net/bonding/bond_3ad.h
@@ -274,7 +274,7 @@ void bond_3ad_adapter_duplex_changed(struct slave *slave);
 void bond_3ad_handle_link_change(struct slave *slave, char link);
 int  bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info);
 int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev);
-void bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
+int bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
 			  struct slave *slave);
 int bond_3ad_set_carrier(struct bonding *bond);
 void bond_3ad_update_lacp_rate(struct bonding *bond);
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 62d2409..0a0f4a6 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1444,8 +1444,9 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 	struct sk_buff *skb = *pskb;
 	struct slave *slave;
 	struct bonding *bond;
-	void (*recv_probe)(struct sk_buff *, struct bonding *,
+	int (*recv_probe)(struct sk_buff *, struct bonding *,
 				struct slave *);
+	int ret = RX_HANDLER_ANOTHER;
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
 	if (unlikely(!skb))
@@ -1464,8 +1465,12 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 
 		if (likely(nskb)) {
-			recv_probe(nskb, bond, slave);
+			ret = recv_probe(nskb, bond, slave);
 			dev_kfree_skb(nskb);
+			if (ret == RX_HANDLER_CONSUMED) {
+				kfree_skb(skb);
+				return ret;
+			}
 		}
 	}
 
@@ -1487,7 +1492,7 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 		memcpy(eth_hdr(skb)->h_dest, bond->dev->dev_addr, ETH_ALEN);
 	}
 
-	return RX_HANDLER_ANOTHER;
+	return ret;
 }
 
 /* enslave device <slave> to bond device <master> */
@@ -2723,7 +2728,7 @@ static void bond_validate_arp(struct bonding *bond, struct slave *slave, __be32
 	}
 }
 
-static void bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
+static int bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
 			 struct slave *slave)
 {
 	struct arphdr *arp;
@@ -2731,7 +2736,7 @@ static void bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
 	__be32 sip, tip;
 
 	if (skb->protocol != __cpu_to_be16(ETH_P_ARP))
-		return;
+		return RX_HANDLER_ANOTHER;
 
 	read_lock(&bond->lock);
 
@@ -2776,6 +2781,7 @@ static void bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
 
 out_unlock:
 	read_unlock(&bond->lock);
+	return RX_HANDLER_ANOTHER;
 }
 
 /*


-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply related

* Re: bonding: don't increase rx_dropped after processing LACPDUs
From: Eric Dumazet @ 2012-05-02 20:36 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Jay Vosburgh, Andy Gospodarek, netdev
In-Reply-To: <20120502202309.GA25355@midget.suse.cz>

On Wed, 2012-05-02 at 22:23 +0200, Jiri Bohac wrote:
> Since commit 3aba891d, bonding processes LACP frames (802.3ad
> mode) with bond_handle_frame(). Currently a copy of the skb is
> made and the original is left to be processed by other
> rx_handlers and the rest of the network stack by returning
> RX_HANDLER_ANOTHER.  As there is no protocol handler for
> PKT_TYPE_LACPDU, the frame is dropped and dev->rx_dropped
> increased.
> 
> Fix this by making bond_handle_frame() return RX_HANDLER_CONSUMED
> if bonding has processed the LACP frame.  
> 
> Signed-off-by: Jiri Bohac <jbohac@suse.cz>
> 

Nice idea but...

> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1444,8 +1444,9 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
>  	struct sk_buff *skb = *pskb;
>  	struct slave *slave;
>  	struct bonding *bond;
> -	void (*recv_probe)(struct sk_buff *, struct bonding *,
> +	int (*recv_probe)(struct sk_buff *, struct bonding *,
>  				struct slave *);
> +	int ret = RX_HANDLER_ANOTHER;
>  
>  	skb = skb_share_check(skb, GFP_ATOMIC);
>  	if (unlikely(!skb))
> @@ -1464,8 +1465,10 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
>  		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>  
>  		if (likely(nskb)) {
> -			recv_probe(nskb, bond, slave);
> +			ret = recv_probe(nskb, bond, slave);
>  			dev_kfree_skb(nskb);
> +			if (ret == RX_HANDLER_CONSUMED)
> +				kfree_skb(skb);

After this point, you have use after free :

if (bond_should_deliver_exact_match(skb, slave, bond)) { 
	...
}
skb->dev = bond->dev;
...


>  		}
>  	}
>  
> @@ -1487,7 +1490,7 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
>  		memcpy(eth_hdr(skb)->h_dest, bond->dev->dev_addr, ETH_ALEN);
>  	}
>  
> -	return RX_HANDLER_ANOTHER;
> +	return ret;
>  }
>  

^ permalink raw reply

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
From: Joe Perches @ 2012-05-02 20:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Alexander Duyck, Alexander Duyck, netdev,
	Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski
In-Reply-To: <1335990187.22133.636.camel@edumazet-glaptop>

On Wed, 2012-05-02 at 22:23 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 13:11 -0700, Joe Perches wrote:
> > It might be useful to comment that this is/can be initialized to
> > false in tcp_try_coalesce via tcp_recv_queue.
> > Otherwise this looks like a possibly uninitialized test
> > of fragstolen.  Maybe there's a path where tail is null
> > in tcp_recv_queue and it's an uninitialized test anyway.
> 
> If tcp_queue_rcv() returns 1, fragstolen is initialized in
> tcp_try_coalesce().
> 
> If tcp_queue_rcv() returns 0, fragstolen content is undefined and we
> dont care.
> 
> If a compiler or static checker complains, its only their problem.

True, but it's code that's a bit fragile and
I think as such it could be improved for any
human reader by a descriptive comment.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox