Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-2.6 PATCH] e1000e: swap max hw supported frame size between 82574 and 82583
From: Jeff Kirsher @ 2009-10-02 22:30 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, rjw, Alexander Duyck, Jeff Kirsher

From: Alexander Duyck <alexander.h.duyck@intel.com>

There appears to have been a mixup in the max supported jumbo frame size
between 82574 and 82583 which ended up disabling jumbo frames on the 82574
as a result.  This patch swaps the two so that this issue is resolved.

This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=14261

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/e1000e/82571.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/e1000e/82571.c b/drivers/net/e1000e/82571.c
index b53b40b..d1e0563 100644
--- a/drivers/net/e1000e/82571.c
+++ b/drivers/net/e1000e/82571.c
@@ -1803,7 +1803,7 @@ struct e1000_info e1000_82574_info = {
 				  | FLAG_HAS_AMT
 				  | FLAG_HAS_CTRLEXT_ON_LOAD,
 	.pba			= 20,
-	.max_hw_frame_size	= ETH_FRAME_LEN + ETH_FCS_LEN,
+	.max_hw_frame_size	= DEFAULT_JUMBO,
 	.get_variants		= e1000_get_variants_82571,
 	.mac_ops		= &e82571_mac_ops,
 	.phy_ops		= &e82_phy_ops_bm,
@@ -1820,7 +1820,7 @@ struct e1000_info e1000_82583_info = {
 				  | FLAG_HAS_AMT
 				  | FLAG_HAS_CTRLEXT_ON_LOAD,
 	.pba			= 20,
-	.max_hw_frame_size	= DEFAULT_JUMBO,
+	.max_hw_frame_size	= ETH_FRAME_LEN + ETH_FCS_LEN,
 	.get_variants		= e1000_get_variants_82571,
 	.mac_ops		= &e82571_mac_ops,
 	.phy_ops		= &e82_phy_ops_bm,


^ permalink raw reply related

* Re: [PATCH] TCPCT-1: adding a sysctl
From: David Miller @ 2009-10-02 22:48 UTC (permalink / raw)
  To: william.allen.simpson; +Cc: netdev
In-Reply-To: <4AC674A4.2040900@gmail.com>

From: William Allen Simpson <william.allen.simpson@gmail.com>
Date: Fri, 02 Oct 2009 17:46:12 -0400

> Andi Kleen wrote:
>> William Allen Simpson <william.allen.simpson@gmail.com> writes:
>>> Any suggestions for improvement?  Or general approval?
>> The patch seems incomplete, can't find callers for most of the new
>> functions.
>> 
> Ummm, I was following the suggested practice of breaking it into
> smaller
> pieces for review.  This is just the control functions and headers.
> I've
> actually completed most of the port, and am champing at the bit.

We can't review the helper functions and infrastructure properly until
we can see how they are actually used.

Seeing how they are used shows us how well they are designed.

Otherwise asking for a is absolutely pointless as we have no context
in which to judge the code you're showing us.

^ permalink raw reply

* tc: indirect shaping of incoming traffic
From: Hauke Laging @ 2009-10-02 22:46 UTC (permalink / raw)
  To: netdev

Hello,

I am not a programmer thus I cannot try to solve this problem myself but I 
have thought about using traffic shaping for incoming traffic:

http://www.hauke-laging.de/ideen/shape-incoming-ip-traffic/index.en.html

Maybe somebody with the necessary capabilities finds this interesting 
enough to give it a try...

CU

Hauke

^ permalink raw reply

* [PATCH 1/1] net: mark net_proto_ops as const
From: Stephen Hemminger @ 2009-10-02 23:25 UTC (permalink / raw)
  To: David S. Miller
  Cc: Bernard Pidoux F6BVP, Dan Carpenter, Andy Grover, Qinghuang Feng,
	Christine Caulfield, Ursula Braun, James Morris, David Howells,
	Hendrik Brueckner, Matthias Urlichs, Denis V. Lunev,
	Stephen Hemminger, linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Alan Cox, Harvey Harrison, Chas Williams,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Mark Smith,
	Pekka Savola (ipv6), Eric Dumazet, Huang Weiyi, James Chapman,
	Henner 
In-Reply-To: <20091002232520.925496630@vyatta.com>

[-- Attachment #1: cons-proto-family.patch --]
[-- Type: text/plain, Size: 16409 bytes --]

All users of structure net_proto_ops should be declared const.

---
 drivers/isdn/mISDN/socket.c       |    3 +--
 drivers/net/pppox.c               |    2 +-
 include/net/bluetooth/bluetooth.h |    2 +-
 net/appletalk/ddp.c               |    2 +-
 net/atm/pvc.c                     |    2 +-
 net/atm/svc.c                     |    2 +-
 net/ax25/af_ax25.c                |    2 +-
 net/bluetooth/af_bluetooth.c      |    4 ++--
 net/bluetooth/bnep/sock.c         |    2 +-
 net/bluetooth/cmtp/sock.c         |    2 +-
 net/bluetooth/hci_sock.c          |    2 +-
 net/bluetooth/hidp/sock.c         |    2 +-
 net/bluetooth/l2cap.c             |    2 +-
 net/bluetooth/rfcomm/sock.c       |    2 +-
 net/bluetooth/sco.c               |    2 +-
 net/can/af_can.c                  |    2 +-
 net/decnet/af_decnet.c            |    2 +-
 net/econet/af_econet.c            |    2 +-
 net/ieee802154/af_ieee802154.c    |    2 +-
 net/ipv4/af_inet.c                |    2 +-
 net/ipv6/af_inet6.c               |    2 +-
 net/ipx/af_ipx.c                  |    2 +-
 net/irda/af_irda.c                |    2 +-
 net/iucv/af_iucv.c                |    2 +-
 net/key/af_key.c                  |    2 +-
 net/llc/af_llc.c                  |    2 +-
 net/netlink/af_netlink.c          |    2 +-
 net/netrom/af_netrom.c            |    2 +-
 net/packet/af_packet.c            |    2 +-
 net/phonet/af_phonet.c            |    2 +-
 net/rds/af_rds.c                  |    2 +-
 net/rose/af_rose.c                |    2 +-
 net/rxrpc/af_rxrpc.c              |    2 +-
 net/unix/af_unix.c                |    2 +-
 net/x25/af_x25.c                  |    2 +-
 35 files changed, 36 insertions(+), 37 deletions(-)

--- a/drivers/isdn/mISDN/socket.c	2009-10-02 16:20:02.320291655 -0700
+++ b/drivers/isdn/mISDN/socket.c	2009-10-02 16:20:43.210280312 -0700
@@ -808,8 +808,7 @@ mISDN_sock_create(struct net *net, struc
 	return err;
 }
 
-static struct
-net_proto_family mISDN_sock_family_ops = {
+static const struct net_proto_family mISDN_sock_family_ops = {
 	.owner  = THIS_MODULE,
 	.family = PF_ISDN,
 	.create = mISDN_sock_create,
--- a/drivers/net/pppox.c	2009-10-01 19:03:16.918349768 -0700
+++ b/drivers/net/pppox.c	2009-10-02 16:20:43.210280312 -0700
@@ -125,7 +125,7 @@ out:
 	return rc;
 }
 
-static struct net_proto_family pppox_proto_family = {
+static const struct net_proto_family pppox_proto_family = {
 	.family	= PF_PPPOX,
 	.create	= pppox_create,
 	.owner	= THIS_MODULE,
--- a/include/net/bluetooth/bluetooth.h	2009-10-01 19:03:17.038350923 -0700
+++ b/include/net/bluetooth/bluetooth.h	2009-10-02 16:20:43.210280312 -0700
@@ -121,7 +121,7 @@ struct bt_sock_list {
 	rwlock_t          lock;
 };
 
-int  bt_sock_register(int proto, struct net_proto_family *ops);
+int  bt_sock_register(int proto, const struct net_proto_family *ops);
 int  bt_sock_unregister(int proto);
 void bt_sock_link(struct bt_sock_list *l, struct sock *s);
 void bt_sock_unlink(struct bt_sock_list *l, struct sock *s);
--- a/net/appletalk/ddp.c	2009-10-01 19:03:16.634348458 -0700
+++ b/net/appletalk/ddp.c	2009-10-02 16:20:43.210280312 -0700
@@ -1821,7 +1821,7 @@ static int atalk_compat_ioctl(struct soc
 #endif
 
 
-static struct net_proto_family atalk_family_ops = {
+static const struct net_proto_family atalk_family_ops = {
 	.family		= PF_APPLETALK,
 	.create		= atalk_create,
 	.owner		= THIS_MODULE,
--- a/net/atm/pvc.c	2009-10-02 16:20:02.440301882 -0700
+++ b/net/atm/pvc.c	2009-10-02 16:20:43.210280312 -0700
@@ -137,7 +137,7 @@ static int pvc_create(struct net *net, s
 }
 
 
-static struct net_proto_family pvc_family_ops = {
+static const struct net_proto_family pvc_family_ops = {
 	.family = PF_ATMPVC,
 	.create = pvc_create,
 	.owner = THIS_MODULE,
--- a/net/atm/svc.c	2009-10-02 16:20:02.440301882 -0700
+++ b/net/atm/svc.c	2009-10-02 16:20:43.210280312 -0700
@@ -666,7 +666,7 @@ static int svc_create(struct net *net, s
 }
 
 
-static struct net_proto_family svc_family_ops = {
+static const struct net_proto_family svc_family_ops = {
 	.family = PF_ATMSVC,
 	.create = svc_create,
 	.owner = THIS_MODULE,
--- a/net/ax25/af_ax25.c	2009-10-02 16:20:02.440301882 -0700
+++ b/net/ax25/af_ax25.c	2009-10-02 16:20:43.210280312 -0700
@@ -1961,7 +1961,7 @@ static const struct file_operations ax25
 
 #endif
 
-static struct net_proto_family ax25_family_ops = {
+static const struct net_proto_family ax25_family_ops = {
 	.family =	PF_AX25,
 	.create =	ax25_create,
 	.owner	=	THIS_MODULE,
--- a/net/bluetooth/af_bluetooth.c	2009-10-01 19:03:16.854348407 -0700
+++ b/net/bluetooth/af_bluetooth.c	2009-10-02 16:20:43.210280312 -0700
@@ -45,7 +45,7 @@
 
 /* Bluetooth sockets */
 #define BT_MAX_PROTO	8
-static struct net_proto_family *bt_proto[BT_MAX_PROTO];
+static const struct net_proto_family *bt_proto[BT_MAX_PROTO];
 static DEFINE_RWLOCK(bt_proto_lock);
 
 static struct lock_class_key bt_lock_key[BT_MAX_PROTO];
@@ -86,7 +86,7 @@ static inline void bt_sock_reclassify_lo
 				bt_key_strings[proto], &bt_lock_key[proto]);
 }
 
-int bt_sock_register(int proto, struct net_proto_family *ops)
+int bt_sock_register(int proto, const struct net_proto_family *ops)
 {
 	int err = 0;
 
--- a/net/bluetooth/bnep/sock.c	2009-10-01 19:03:16.898350369 -0700
+++ b/net/bluetooth/bnep/sock.c	2009-10-02 16:20:43.210280312 -0700
@@ -222,7 +222,7 @@ static int bnep_sock_create(struct net *
 	return 0;
 }
 
-static struct net_proto_family bnep_sock_family_ops = {
+static const struct net_proto_family bnep_sock_family_ops = {
 	.family = PF_BLUETOOTH,
 	.owner	= THIS_MODULE,
 	.create = bnep_sock_create
--- a/net/bluetooth/cmtp/sock.c	2009-10-01 19:03:16.874349134 -0700
+++ b/net/bluetooth/cmtp/sock.c	2009-10-02 16:20:43.220342414 -0700
@@ -217,7 +217,7 @@ static int cmtp_sock_create(struct net *
 	return 0;
 }
 
-static struct net_proto_family cmtp_sock_family_ops = {
+static const struct net_proto_family cmtp_sock_family_ops = {
 	.family	= PF_BLUETOOTH,
 	.owner	= THIS_MODULE,
 	.create	= cmtp_sock_create
--- a/net/bluetooth/hci_sock.c	2009-10-02 16:20:02.440301882 -0700
+++ b/net/bluetooth/hci_sock.c	2009-10-02 16:20:43.220342414 -0700
@@ -687,7 +687,7 @@ static int hci_sock_dev_event(struct not
 	return NOTIFY_DONE;
 }
 
-static struct net_proto_family hci_sock_family_ops = {
+static const struct net_proto_family hci_sock_family_ops = {
 	.family	= PF_BLUETOOTH,
 	.owner	= THIS_MODULE,
 	.create	= hci_sock_create,
--- a/net/bluetooth/hidp/sock.c	2009-10-01 19:03:16.794397698 -0700
+++ b/net/bluetooth/hidp/sock.c	2009-10-02 16:20:43.220342414 -0700
@@ -268,7 +268,7 @@ static int hidp_sock_create(struct net *
 	return 0;
 }
 
-static struct net_proto_family hidp_sock_family_ops = {
+static const struct net_proto_family hidp_sock_family_ops = {
 	.family	= PF_BLUETOOTH,
 	.owner	= THIS_MODULE,
 	.create	= hidp_sock_create
--- a/net/bluetooth/l2cap.c	2009-10-02 16:20:02.440301882 -0700
+++ b/net/bluetooth/l2cap.c	2009-10-02 16:20:43.220342414 -0700
@@ -3916,7 +3916,7 @@ static const struct proto_ops l2cap_sock
 	.getsockopt	= l2cap_sock_getsockopt
 };
 
-static struct net_proto_family l2cap_sock_family_ops = {
+static const struct net_proto_family l2cap_sock_family_ops = {
 	.family	= PF_BLUETOOTH,
 	.owner	= THIS_MODULE,
 	.create	= l2cap_sock_create,
--- a/net/bluetooth/rfcomm/sock.c	2009-10-02 16:20:02.440301882 -0700
+++ b/net/bluetooth/rfcomm/sock.c	2009-10-02 16:20:43.220342414 -0700
@@ -1101,7 +1101,7 @@ static const struct proto_ops rfcomm_soc
 	.mmap		= sock_no_mmap
 };
 
-static struct net_proto_family rfcomm_sock_family_ops = {
+static const struct net_proto_family rfcomm_sock_family_ops = {
 	.family		= PF_BLUETOOTH,
 	.owner		= THIS_MODULE,
 	.create		= rfcomm_sock_create
--- a/net/bluetooth/sco.c	2009-10-02 16:20:02.440301882 -0700
+++ b/net/bluetooth/sco.c	2009-10-02 16:20:43.220342414 -0700
@@ -993,7 +993,7 @@ static const struct proto_ops sco_sock_o
 	.getsockopt	= sco_sock_getsockopt
 };
 
-static struct net_proto_family sco_sock_family_ops = {
+static const struct net_proto_family sco_sock_family_ops = {
 	.family	= PF_BLUETOOTH,
 	.owner	= THIS_MODULE,
 	.create	= sco_sock_create,
--- a/net/can/af_can.c	2009-10-01 19:03:16.437350580 -0700
+++ b/net/can/af_can.c	2009-10-02 16:20:43.220342414 -0700
@@ -842,7 +842,7 @@ static struct packet_type can_packet __r
 	.func = can_rcv,
 };
 
-static struct net_proto_family can_family_ops __read_mostly = {
+static const struct net_proto_family can_family_ops __read_mostly = {
 	.family = PF_CAN,
 	.create = can_create,
 	.owner  = THIS_MODULE,
--- a/net/decnet/af_decnet.c	2009-10-02 16:20:02.450290721 -0700
+++ b/net/decnet/af_decnet.c	2009-10-02 16:20:43.220342414 -0700
@@ -2325,7 +2325,7 @@ static const struct file_operations dn_s
 };
 #endif
 
-static struct net_proto_family	dn_family_ops = {
+static const struct net_proto_family	dn_family_ops = {
 	.family =	AF_DECnet,
 	.create =	dn_create,
 	.owner	=	THIS_MODULE,
--- a/net/econet/af_econet.c	2009-10-01 19:03:16.381348558 -0700
+++ b/net/econet/af_econet.c	2009-10-02 16:20:43.220342414 -0700
@@ -742,7 +742,7 @@ static int econet_ioctl(struct socket *s
 	return 0;
 }
 
-static struct net_proto_family econet_family_ops = {
+static const struct net_proto_family econet_family_ops = {
 	.family =	PF_ECONET,
 	.create =	econet_create,
 	.owner	=	THIS_MODULE,
--- a/net/ieee802154/af_ieee802154.c	2009-10-01 19:03:16.534350765 -0700
+++ b/net/ieee802154/af_ieee802154.c	2009-10-02 16:20:43.220342414 -0700
@@ -285,7 +285,7 @@ out:
 	return rc;
 }
 
-static struct net_proto_family ieee802154_family_ops = {
+static const struct net_proto_family ieee802154_family_ops = {
 	.family		= PF_IEEE802154,
 	.create		= ieee802154_create,
 	.owner		= THIS_MODULE,
--- a/net/ipv4/af_inet.c	2009-10-02 16:20:02.450290721 -0700
+++ b/net/ipv4/af_inet.c	2009-10-02 16:20:43.220342414 -0700
@@ -931,7 +931,7 @@ static const struct proto_ops inet_sockr
 #endif
 };
 
-static struct net_proto_family inet_family_ops = {
+static const struct net_proto_family inet_family_ops = {
 	.family = PF_INET,
 	.create = inet_create,
 	.owner	= THIS_MODULE,
--- a/net/ipv6/af_inet6.c	2009-10-01 19:03:16.558349136 -0700
+++ b/net/ipv6/af_inet6.c	2009-10-02 16:20:43.220342414 -0700
@@ -552,7 +552,7 @@ const struct proto_ops inet6_dgram_ops =
 #endif
 };
 
-static struct net_proto_family inet6_family_ops = {
+static const struct net_proto_family inet6_family_ops = {
 	.family = PF_INET6,
 	.create = inet6_create,
 	.owner	= THIS_MODULE,
--- a/net/ipx/af_ipx.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/ipx/af_ipx.c	2009-10-02 16:20:43.220342414 -0700
@@ -1927,7 +1927,7 @@ static int ipx_compat_ioctl(struct socke
  * Socket family declarations
  */
 
-static struct net_proto_family ipx_family_ops = {
+static const struct net_proto_family ipx_family_ops = {
 	.family		= PF_IPX,
 	.create		= ipx_create,
 	.owner		= THIS_MODULE,
--- a/net/irda/af_irda.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/irda/af_irda.c	2009-10-02 16:20:43.220342414 -0700
@@ -2463,7 +2463,7 @@ bed:
 	return 0;
 }
 
-static struct net_proto_family irda_family_ops = {
+static const struct net_proto_family irda_family_ops = {
 	.family = PF_IRDA,
 	.create = irda_create,
 	.owner	= THIS_MODULE,
--- a/net/iucv/af_iucv.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/iucv/af_iucv.c	2009-10-02 16:20:43.220342414 -0700
@@ -1715,7 +1715,7 @@ static const struct proto_ops iucv_sock_
 	.getsockopt	= iucv_sock_getsockopt,
 };
 
-static struct net_proto_family iucv_sock_family_ops = {
+static const struct net_proto_family iucv_sock_family_ops = {
 	.family	= AF_IUCV,
 	.owner	= THIS_MODULE,
 	.create	= iucv_sock_create,
--- a/net/key/af_key.c	2009-10-01 19:03:16.477349034 -0700
+++ b/net/key/af_key.c	2009-10-02 16:20:43.230306460 -0700
@@ -3644,7 +3644,7 @@ static const struct proto_ops pfkey_ops 
 	.recvmsg	=	pfkey_recvmsg,
 };
 
-static struct net_proto_family pfkey_family_ops = {
+static const struct net_proto_family pfkey_family_ops = {
 	.family	=	PF_KEY,
 	.create	=	pfkey_create,
 	.owner	=	THIS_MODULE,
--- a/net/llc/af_llc.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/llc/af_llc.c	2009-10-02 16:20:43.230306460 -0700
@@ -1092,7 +1092,7 @@ out:
 	return rc;
 }
 
-static struct net_proto_family llc_ui_family_ops = {
+static const struct net_proto_family llc_ui_family_ops = {
 	.family = PF_LLC,
 	.create = llc_ui_create,
 	.owner	= THIS_MODULE,
--- a/net/netlink/af_netlink.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/netlink/af_netlink.c	2009-10-02 16:20:43.230306460 -0700
@@ -2050,7 +2050,7 @@ static const struct proto_ops netlink_op
 	.sendpage =	sock_no_sendpage,
 };
 
-static struct net_proto_family netlink_family_ops = {
+static const struct net_proto_family netlink_family_ops = {
 	.family = PF_NETLINK,
 	.create = netlink_create,
 	.owner	= THIS_MODULE,	/* for consistency 8) */
--- a/net/netrom/af_netrom.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/netrom/af_netrom.c	2009-10-02 16:20:43.230306460 -0700
@@ -1372,7 +1372,7 @@ static const struct file_operations nr_i
 };
 #endif	/* CONFIG_PROC_FS */
 
-static struct net_proto_family nr_family_ops = {
+static const struct net_proto_family nr_family_ops = {
 	.family		=	PF_NETROM,
 	.create		=	nr_create,
 	.owner		=	THIS_MODULE,
--- a/net/packet/af_packet.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/packet/af_packet.c	2009-10-02 16:20:43.230306460 -0700
@@ -2363,7 +2363,7 @@ static const struct proto_ops packet_ops
 	.sendpage =	sock_no_sendpage,
 };
 
-static struct net_proto_family packet_family_ops = {
+static const struct net_proto_family packet_family_ops = {
 	.family =	PF_PACKET,
 	.create =	packet_create,
 	.owner	=	THIS_MODULE,
--- a/net/phonet/af_phonet.c	2009-10-01 19:03:16.674395303 -0700
+++ b/net/phonet/af_phonet.c	2009-10-02 16:20:43.230306460 -0700
@@ -118,7 +118,7 @@ out:
 	return err;
 }
 
-static struct net_proto_family phonet_proto_family = {
+static const struct net_proto_family phonet_proto_family = {
 	.family = PF_PHONET,
 	.create = pn_socket_create,
 	.owner = THIS_MODULE,
--- a/net/rds/af_rds.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/rds/af_rds.c	2009-10-02 16:20:43.230306460 -0700
@@ -431,7 +431,7 @@ void rds_sock_put(struct rds_sock *rs)
 	sock_put(rds_rs_to_sk(rs));
 }
 
-static struct net_proto_family rds_family_ops = {
+static const struct net_proto_family rds_family_ops = {
 	.family =	AF_RDS,
 	.create =	rds_create,
 	.owner	=	THIS_MODULE,
--- a/net/rose/af_rose.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/rose/af_rose.c	2009-10-02 16:20:43.230306460 -0700
@@ -1509,7 +1509,7 @@ static const struct file_operations rose
 };
 #endif	/* CONFIG_PROC_FS */
 
-static struct net_proto_family rose_family_ops = {
+static const struct net_proto_family rose_family_ops = {
 	.family		=	PF_ROSE,
 	.create		=	rose_create,
 	.owner		=	THIS_MODULE,
--- a/net/rxrpc/af_rxrpc.c	2009-10-02 16:20:02.460301560 -0700
+++ b/net/rxrpc/af_rxrpc.c	2009-10-02 16:20:43.230306460 -0700
@@ -777,7 +777,7 @@ static struct proto rxrpc_proto = {
 	.max_header	= sizeof(struct rxrpc_header),
 };
 
-static struct net_proto_family rxrpc_family_ops = {
+static const struct net_proto_family rxrpc_family_ops = {
 	.family	= PF_RXRPC,
 	.create = rxrpc_create,
 	.owner	= THIS_MODULE,
--- a/net/unix/af_unix.c	2009-10-01 19:03:16.518348870 -0700
+++ b/net/unix/af_unix.c	2009-10-02 16:20:43.230306460 -0700
@@ -2214,7 +2214,7 @@ static const struct file_operations unix
 
 #endif
 
-static struct net_proto_family unix_family_ops = {
+static const struct net_proto_family unix_family_ops = {
 	.family = PF_UNIX,
 	.create = unix_create,
 	.owner	= THIS_MODULE,
--- a/net/x25/af_x25.c	2009-10-02 16:20:02.470281669 -0700
+++ b/net/x25/af_x25.c	2009-10-02 16:20:43.230306460 -0700
@@ -1476,7 +1476,7 @@ static int x25_ioctl(struct socket *sock
 	return rc;
 }
 
-static struct net_proto_family x25_family_ops = {
+static const struct net_proto_family x25_family_ops = {
 	.family =	AF_X25,
 	.create =	x25_create,
 	.owner	=	THIS_MODULE,

-- 


------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf

^ permalink raw reply

* Re: [PATCH 1/1] net: mark net_proto_ops as const
From: Alexey Dobriyan @ 2009-10-03  0:00 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Bernard Pidoux F6BVP, Dan Carpenter, Andy Grover, Henner Eisen,
	Christine Caulfield, Ursula Braun, James Morris, David Howells,
	Hendrik Brueckner, Matthias Urlichs, Denis V. Lunev,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Alan Cox,
	Harvey Harrison, Chas Williams, linux-s390-u79uwXL29TY76Z2rM5mHXA,
	Mark Smith, Pekka Savola (ipv6), Eric Dumazet, Huang Weiyi,
	James Chapman, Johan Hedberg, Arnaldo 
In-Reply-To: <20091002232721.385245919-ZtmgI6mnKB3QT0dZR+AlfA@public.gmane.org>

On Fri, Oct 02, 2009 at 04:25:21PM -0700, Stephen Hemminger wrote:
> --- a/net/can/af_can.c
> +++ b/net/can/af_can.c
> @@ -842,7 +842,7 @@ static struct packet_type can_packet __r
>  	.func = can_rcv,
>  };
>  
> -static struct net_proto_family can_family_ops __read_mostly = {
> +static const struct net_proto_family can_family_ops __read_mostly = {
						       ^^^^^^^^^^^^^
>  	.family = PF_CAN,
>  	.create = can_create,
>  	.owner  = THIS_MODULE,

ACK, except this chunk: const already means read-only.

------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf

^ permalink raw reply

* Re: [PATCH] TCPCT-1: adding a sysctl
From: William Allen Simpson @ 2009-10-03  0:32 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20091002.154808.137771153.davem@davemloft.net>

David Miller wrote:
> From: William Allen Simpson <william.allen.simpson@gmail.com>
>> Ummm, I was following the suggested practice of breaking it into
>> smaller
>> pieces for review.  This is just the control functions and headers.
>> I've
>> actually completed most of the port, and am champing at the bit.
> 
> We can't review the helper functions and infrastructure properly until
> we can see how they are actually used.
> 
> Seeing how they are used shows us how well they are designed.
> 
> Otherwise asking for a is absolutely pointless as we have no context
> in which to judge the code you're showing us.
> 
Thanks.  I'd hand-split my code into much smaller patches for review.
Now, I know there are patches that are *too* small....

I've merged the several things you've mentioned, and will post it soon
(after making sure it compiles and runs separately).

^ permalink raw reply

* Re: [PATCH 1/3] bonding: allow previous slave to be used when re-balancing traffic on tlb/alb interfaces
From: Jay Vosburgh @ 2009-10-03  1:13 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: netdev, David Miller
In-Reply-To: <1254269731-7341-2-git-send-email-fubar@us.ibm.com>

Jay Vosburgh <fubar@us.ibm.com> wrote:

>From: Andy Gospodarek <andy@greyhouse.net>
>
>When using tlb (mode 5) or alb (mode 6) bonding, a task runs every 10s
>and re-balances the output devices based on load.  I was trying to
>diagnose some connectivity issues and realized that a high-traffic host
>would often switch output interfaces every 10s.  I discovered this
>happened because the 'least loaded interface' was chosen as the next
>output interface for any given stream and quite often some lower load
>traffic would slip in an take the interface previously used by our
>stream.  This meant the 'least loaded interface' was no longer the one
>we used during the last interval.
>
>The switching of streams to another interface was not extremely helpful
>as it would force the destination host or router to update its ARP
>tables and produce some additional ARP traffic as the destination host
>verified that is was using the MAC address it expected.  Having the
>destination MAC for a given IP change every 10s seems undesirable.
>
>The decision was made to use the same slave during this interval if the
>current load on that interface was < 10.  A load of < 10 indicates that
>during the last 10s sample, roughly 100bytes were sent by all streams
>currently assigned to that interface.  This essentially means the
>interface is unloaded, but allows for a few frames that will probably
>have minimal impact to slip into the same interface we were using in the
>past.

	Andy, I've been doing some further testing with this patch, and
I'm seeing some panics that I believe are related to this patch.  It
appears that the last_slave isn't cleared (or isn't cleared soon enough)
when a slave is released, and concurrent transmit activity is getting
into alb_get_best_slave() and finding a last_slave pointer that is stale
(points to no slave currently on the slave list).

	This seems to reproduce fairly consistently when I set up alb
mode with two slaves, change the active slave so that alb mode moves the
MACs around, then release the inactive slave.  I run a concurrent "ping
-f" of some remote host.

	I added some code to tlb_clear_slave to set last_last to NULL if
save_load is 0, but the problem still happened.  I think the race is
that bond_alb_deinit_slave is called with the bond->lock released, but
the slave has already been detached in bond_release, and concurrent
transmit activity gets in and looks up last_slave.

	I'm out of time for today, so I'll look at this more on Monday
if I haven't heard back from you.

	-J

>Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
>Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
>
>---
> drivers/net/bonding/bond_alb.c |   21 ++++++++++++++++++++-
> drivers/net/bonding/bond_alb.h |    4 ++++
> 2 files changed, 24 insertions(+), 1 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index 9b5936f..cf2842e 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -150,6 +150,7 @@ static inline void tlb_init_table_entry(struct tlb_client_info *entry, int save_
> 		entry->load_history = 1 + entry->tx_bytes /
> 				      BOND_TLB_REBALANCE_INTERVAL;
> 		entry->tx_bytes = 0;
>+		entry->last_slave = entry->tx_slave;
> 	}
>
> 	entry->tx_slave = NULL;
>@@ -270,6 +271,24 @@ static struct slave *tlb_get_least_loaded_slave(struct bonding *bond)
> 	return least_loaded;
> }
>
>+/* Caller must hold bond lock for read and hashtbl lock */
>+static struct slave *tlb_get_best_slave(struct bonding *bond, u32 hash_index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	struct tlb_client_info *tx_hash_table = bond_info->tx_hashtbl;
>+	struct slave *last_slave = tx_hash_table[hash_index].last_slave;
>+	struct slave *next_slave = NULL;
>+
>+	if (last_slave && SLAVE_IS_OK(last_slave)) {
>+		/* Use the last slave listed in the tx hashtbl if:
>+		   the last slave currently is essentially unloaded. */
>+		if (SLAVE_TLB_INFO(last_slave).load < 10)
>+			next_slave = last_slave;
>+	}
>+
>+	return next_slave ? next_slave : tlb_get_least_loaded_slave(bond);
>+}
>+
> /* Caller must hold bond lock for read */
> static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u32 skb_len)
> {
>@@ -282,7 +301,7 @@ static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u3
> 	hash_table = bond_info->tx_hashtbl;
> 	assigned_slave = hash_table[hash_index].tx_slave;
> 	if (!assigned_slave) {
>-		assigned_slave = tlb_get_least_loaded_slave(bond);
>+		assigned_slave = tlb_get_best_slave(bond, hash_index);
>
> 		if (assigned_slave) {
> 			struct tlb_slave_info *slave_info =
>diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
>index 50968f8..b65fd29 100644
>--- a/drivers/net/bonding/bond_alb.h
>+++ b/drivers/net/bonding/bond_alb.h
>@@ -36,6 +36,10 @@ struct tlb_client_info {
> 				 * packets to a Client that the Hash function
> 				 * gave this entry index.
> 				 */
>+	struct slave *last_slave; /* Pointer to last slave used for transmiting
>+				 * packets to a Client that the Hash function
>+				 * gave this entry index.
>+				 */
> 	u32 tx_bytes;		/* Each Client acumulates the BytesTx that
> 				 * were tranmitted to it, and after each
> 				 * CallBack the LoadHistory is devided
>-- 
>1.6.0.2
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* clownix_spy: qdisc monitor + generic kernel variable plotter
From: clownix @ 2009-10-02 18:15 UTC (permalink / raw)
  To: netdev

At http://clownix.net, there is a qdisc monitor based on a sched
qdisc named "spy" and a module that periodicaly sends the
enqueues/dequeues/drops/queue-size/delays... to be gtk-plotted to the
user world (through a netlink socket).

With a few lines written in a module, any kernel variable can be
plotted.

Note that the name of this software package is "clownix_spy", and not
cloonix_net which is another project on the same site.
Regards 
Vincent Perrier

^ permalink raw reply

* [PATCH] pktgen: Fix multiqueue handling
From: Eric Dumazet @ 2009-10-03  6:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: Robert Olsson, Linux Netdev List, Stephen Hemminger

Note : I could not really test this patch, I dont have multi queue hardware yet.

I found this by code inspection, please double check, thanks

[PATCH] pktgen: Fix multiqueue handling

It is not currently possible to instruct pktgen to use one selected tx queue.

When Robert added multiqueue support in commit 45b270f8, he added
an interval (queue_map_min, queue_map_max), and his code doesnt take
into account the case of min = max, to select one tx queue exactly.

I suspect a high performance setup on a eight txqueue device wants 
to use exactly eight cpus, and assign one tx queue to each sender.

This patchs makes pktgen select the right tx queue, not the first one.

Also updates Documentation to reflect Robert changes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 Documentation/networking/pktgen.txt |    8 ++++++++
 net/core/pktgen.c                   |    2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
index c6cf4a3..61bb645 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -90,6 +90,11 @@ Examples:
  pgset "dstmac 00:00:00:00:00:00"    sets MAC destination address
  pgset "srcmac 00:00:00:00:00:00"    sets MAC source address
 
+ pgset "queue_map_min 0" Sets the min value of tx queue interval
+ pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices
+                         To select queue 1 of a given device,
+                         use queue_map_min=1 and queue_map_max=1
+
  pgset "src_mac_count 1" Sets the number of MACs we'll range through.  
                          The 'minimum' MAC is what you set with srcmac.
 
@@ -101,6 +106,9 @@ Examples:
                               IPDST_RND, UDPSRC_RND,
                               UDPDST_RND, MACSRC_RND, MACDST_RND 
                               MPLS_RND, VID_RND, SVID_RND
+                              QUEUE_MAP_RND # queue map random
+                              QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
+
 
  pgset "udp_src_min 9"   set UDP source port min, If < udp_src_max, then
                          cycle through the port range.
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index b694552..421857c 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2212,7 +2212,7 @@ static void set_cur_queue_map(struct pktgen_dev *pkt_dev)
 	if (pkt_dev->flags & F_QUEUE_MAP_CPU)
 		pkt_dev->cur_queue_map = smp_processor_id();
 
-	else if (pkt_dev->queue_map_min < pkt_dev->queue_map_max) {
+	else if (pkt_dev->queue_map_min <= pkt_dev->queue_map_max) {
 		__u16 t;
 		if (pkt_dev->flags & F_QUEUE_MAP_RND) {
 			t = random32() %

^ permalink raw reply related

* Re: [PATCH] TCPCT-1: adding a sysctl
From: David Miller @ 2009-10-03  6:26 UTC (permalink / raw)
  To: william.allen.simpson; +Cc: netdev
In-Reply-To: <4AC69B89.5010604@gmail.com>

From: William Allen Simpson <william.allen.simpson@gmail.com>
Date: Fri, 02 Oct 2009 20:32:09 -0400

> David Miller wrote:
>> From: William Allen Simpson <william.allen.simpson@gmail.com>
>>> Ummm, I was following the suggested practice of breaking it into
>>> smaller
>>> pieces for review.  This is just the control functions and headers.
>>> I've
>>> actually completed most of the port, and am champing at the bit.
>> We can't review the helper functions and infrastructure properly until
>> we can see how they are actually used.
>> Seeing how they are used shows us how well they are designed.
>> Otherwise asking for a is absolutely pointless as we have no context
>> in which to judge the code you're showing us.
>> 
> Thanks.  I'd hand-split my code into much smaller patches for review.
> Now, I know there are patches that are *too* small....

It's not that the patches are too small, you totally misunderstand me.

The problem is that you have to post all of the patches as a set which
can be reviewed as a unit.  The ones the use the new functions as well
as the ones that add them.

It also helps if you mention in the commit message things like
"These helper functions will be used in a subsequent change which
does ..."

^ permalink raw reply

* Re: clownix_spy: qdisc monitor + generic kernel variable plotter
From: David Miller @ 2009-10-03  6:30 UTC (permalink / raw)
  To: clownix; +Cc: netdev
In-Reply-To: <1254507348.5236.3.camel@localhost>

Please stop posting this message over and over again.  I've seen this
posting 4 or 5 times already, including both your linux-kernel and
netdev posts.

^ permalink raw reply

* Re: [BUG net-2.6] bluetooth/rfcomm : sleeping function called from invalid context at mm/slub.c:1719
From: Dave Young @ 2009-10-03  7:06 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Marcel Holtmann, Linux Netdev List, linux-bluetooth
In-Reply-To: <4AC6247E.7050308@hartkopp.net>

On Fri, Oct 02, 2009 at 06:04:14PM +0200, Oliver Hartkopp wrote:
> Dave Young wrote:
> > On Fri, Oct 2, 2009 at 2:28 PM, Oliver Hartkopp <oliver@hartkopp.net> wrote:
> >> Hello Marcel,
> >>
> >> with current net-2.6 tree ...
> >>
> >> While starting my PPP Bluetooth dialup networking, i got this:
> > 
> > Hi, oliver
> > 
> > please try following patch:
> > http://patchwork.kernel.org/patch/51326/
> 
> Hi Dave,
> 
> that fixed it at ppp startup!
> 
> Tested-by: Oliver Hartkopp <oliver@hartkopp.net>
> 
> Btw. when shutting down the ppp connection i still get this:
> 
> [  361.996887] INFO: trying to register non-static key.
> [  361.996897] the code is fine but needs lockdep annotation.
> [  361.996902] turning off the locking correctness validator.
> [  361.996912] Pid: 0, comm: swapper Not tainted 2.6.31-08939-gdb8abec-dirty #22
> [  361.996919] Call Trace:
> [  361.996933]  [<c12e4fb2>] ? printk+0xf/0x11
> [  361.996947]  [<c1042214>] register_lock_class+0x5a/0x295
> [  361.996957]  [<c1043af2>] __lock_acquire+0x9b/0xc03
> [  361.996967]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
> [  361.996985]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
> [  361.996995]  [<c104491f>] ? lock_release_non_nested+0x17b/0x1db
> [  361.997008]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
> [  361.997018]  [<c10426fd>] ? trace_hardirqs_off+0xb/0xd
> [  361.997028]  [<c10446b6>] lock_acquire+0x5c/0x73
> [  361.997039]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
> [  361.997049]  [<c12e6e23>] _spin_lock_irqsave+0x24/0x34
> [  361.997058]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
> [  361.997066]  [<c124cd14>] skb_dequeue+0x12/0x4c
> [  361.997075]  [<c124d579>] skb_queue_purge+0x14/0x1b
> [  361.997088]  [<fa59ce3f>] l2cap_recv_frame+0xe9e/0x129a [l2cap]
> [  361.997099]  [<c10421d1>] ? register_lock_class+0x17/0x295
> [  361.997110]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
> [  361.997128]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
> [  361.997139]  [<c120de74>] ? uhci_giveback_urb+0xf2/0x162
> [  361.997163]  [<f8bb4c45>] ? hci_rx_task+0xfe/0x1f8 [bluetooth]
> [  361.997177]  [<fa59d2e4>] l2cap_recv_acldata+0xa9/0x1be [l2cap]
> [  361.997190]  [<fa59d23b>] ? l2cap_recv_acldata+0x0/0x1be [l2cap]
> [  361.997208]  [<f8bb4c77>] hci_rx_task+0x130/0x1f8 [bluetooth]
> [  361.997219]  [<c102a098>] tasklet_action+0x6b/0xb2
> [  361.997228]  [<c102a46b>] __do_softirq+0x82/0x101
> [  361.997237]  [<c102a515>] do_softirq+0x2b/0x43
> [  361.997246]  [<c102a619>] irq_exit+0x35/0x68
> [  361.997256]  [<c1004513>] do_IRQ+0x80/0x96
> [  361.997265]  [<c10030ae>] common_interrupt+0x2e/0x34
> [  361.997275]  [<c104007b>] ? tick_device_uses_broadcast+0x71/0x7c
> [  361.997286]  [<c11747a8>] ? acpi_idle_enter_simple+0x103/0x12e
> [  361.997296]  [<c1174515>] acpi_idle_enter_bm+0xc3/0x253
> [  361.997306]  [<c1238b6f>] cpuidle_idle_call+0x60/0x91
> [  361.997315]  [<c1001d44>] cpu_idle+0x49/0x65
> [  361.997324]  [<c12e2f0e>] start_secondary+0x190/0x195
> 
> 
> Thanks,
> Oliver
> 

Oliver, does following patch fix the non-static lock problem?
--

now l2cap conn locks will be initialized after setup l2cap conn timer,
it will introduce following problem:

[  361.996887] INFO: trying to register non-static key.
[  361.996897] the code is fine but needs lockdep annotation.
[  361.996902] turning off the locking correctness validator.
[  361.996912] Pid: 0, comm: swapper Not tainted 2.6.31-08939-gdb8abec-dirty #22
[  361.996919] Call Trace:
[  361.996933]  [<c12e4fb2>] ? printk+0xf/0x11
[  361.996947]  [<c1042214>] register_lock_class+0x5a/0x295
[  361.996957]  [<c1043af2>] __lock_acquire+0x9b/0xc03
[  361.996967]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
[  361.996985]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
[  361.996995]  [<c104491f>] ? lock_release_non_nested+0x17b/0x1db
[  361.997008]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
[  361.997018]  [<c10426fd>] ? trace_hardirqs_off+0xb/0xd
[  361.997028]  [<c10446b6>] lock_acquire+0x5c/0x73
[  361.997039]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
[  361.997049]  [<c12e6e23>] _spin_lock_irqsave+0x24/0x34
[  361.997058]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
[  361.997066]  [<c124cd14>] skb_dequeue+0x12/0x4c
[  361.997075]  [<c124d579>] skb_queue_purge+0x14/0x1b
[  361.997088]  [<fa59ce3f>] l2cap_recv_frame+0xe9e/0x129a [l2cap]
[  361.997099]  [<c10421d1>] ? register_lock_class+0x17/0x295
[  361.997110]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
[  361.997128]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
[  361.997139]  [<c120de74>] ? uhci_giveback_urb+0xf2/0x162
[  361.997163]  [<f8bb4c45>] ? hci_rx_task+0xfe/0x1f8 [bluetooth]
[  361.997177]  [<fa59d2e4>] l2cap_recv_acldata+0xa9/0x1be [l2cap]
[  361.997190]  [<fa59d23b>] ? l2cap_recv_acldata+0x0/0x1be [l2cap]
[  361.997208]  [<f8bb4c77>] hci_rx_task+0x130/0x1f8 [bluetooth]
[  361.997219]  [<c102a098>] tasklet_action+0x6b/0xb2
[  361.997228]  [<c102a46b>] __do_softirq+0x82/0x101
[  361.997237]  [<c102a515>] do_softirq+0x2b/0x43
[  361.997246]  [<c102a619>] irq_exit+0x35/0x68
[  361.997256]  [<c1004513>] do_IRQ+0x80/0x96
[  361.997265]  [<c10030ae>] common_interrupt+0x2e/0x34
[  361.997275]  [<c104007b>] ? tick_device_uses_broadcast+0x71/0x7c
[  361.997286]  [<c11747a8>] ? acpi_idle_enter_simple+0x103/0x12e
[  361.997296]  [<c1174515>] acpi_idle_enter_bm+0xc3/0x253
[  361.997306]  [<c1238b6f>] cpuidle_idle_call+0x60/0x91
[  361.997315]  [<c1001d44>] cpu_idle+0x49/0x65
[  361.997324]  [<c12e2f0e>] start_secondary+0x190/0x195

Here move lock init things before setup_timer to avoid misuse
uninitialized locks.

Reported-by: Oliver Hartkopp <oliver@hartkopp.net>
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
---
net/bluetooth/l2cap.c |    6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

--- linux-2.6.31.orig/net/bluetooth/l2cap.c	2009-09-30 16:36:10.000000000 +0800
+++ linux-2.6.31/net/bluetooth/l2cap.c	2009-10-03 14:44:51.000000000 +0800
@@ -555,12 +555,12 @@ static struct l2cap_conn *l2cap_conn_add
 
 	conn->feat_mask = 0;
 
-	setup_timer(&conn->info_timer, l2cap_info_timeout,
-						(unsigned long) conn);
-
 	spin_lock_init(&conn->lock);
 	rwlock_init(&conn->chan_list.lock);
 
+	setup_timer(&conn->info_timer, l2cap_info_timeout,
+						(unsigned long) conn);
+
 	conn->disc_reason = 0x13;
 
 	return conn;

^ permalink raw reply

* Re: [Bug #14301] WARNING: at net/ipv4/af_inet.c:154
From: Eric Dumazet @ 2009-10-03  8:36 UTC (permalink / raw)
  To: Rafael J. Wysocki, Ralf Hildebrandt
  Cc: Linux Kernel Mailing List, Kernel Testers List, Herbert Xu,
	Linux Netdev List, Wei Yongjun, David S. Miller
In-Reply-To: <COE24pZSBH.A.mdH.sMTxKB@chimera>

Rafael J. Wysocki a écrit :
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
> 
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31.  Please verify if it still should
> be listed and let me know (either way).
> 
> 
> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=14301
> Subject		: WARNING: at net/ipv4/af_inet.c:154
> Submitter	: Ralf Hildebrandt <Ralf.Hildebrandt-jq1tPX9l7E6ELgA04lAiVw@public.gmane.org>
> Date		: 2009-09-30 12:24 (2 days old)
> References	: http://marc.info/?l=linux-kernel&m=125431350218137&w=4
> 
> 

If commit d99927f4d93f36553699573b279e0ff98ad7dea6
(net: Fix sock_wfree() race) doesnt fix this problem, then
maybe we should take a look at an old patch.

< data mining... running... output results to lkml/netdev >

Random guesses

 1) : commit d55d87fdff8252d0e2f7c28c2d443aee17e9d70f
(net: Move rx skb_orphan call to where needed)

A similar problem on SCTP was fixed by commit 
1bc4ee4088c9a502db0e9c87f675e61e57fa1734
(sctp: fix warning at inet_sock_destruct() while release sctp socket)

2) CORK and UDP sockets
  It seems we can leave an UDP socket with a frame in sk_write_queue
  Purge of this queue is done by udp_flush_pending_frames()
   This calls ip_flush_pending_frames()
   But this function only calls kfree_skb(), not sk_wmem_free_skb()...


Could you try following patch ?

Thanks

[PATCH] net: UDP should not use ip_flush_pending_frames()

Now xmit UDP messages are charged, we must take care of calling right
skb freeing function.

In case a close() is performed on a socket where CORKED frame
is still queued in sk_write_queue, calling ip_flush_pending_frames()
leads to sk_forward_alloc leak.

Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt-jq1tPX9l7E6ELgA04lAiVw@public.gmane.org>
Signed-off-by: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/net/sock.h  |   10 ++++++++++
 include/net/tcp.h   |   10 ----------
 net/ipv4/tcp.c      |    2 +-
 net/ipv4/tcp_ipv4.c |    2 +-
 net/ipv4/udp.c      |    2 +-
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1621935..7c80fec 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -882,6 +882,16 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 	__kfree_skb(skb);
 }
 
+/* write queue abstraction */
+static inline void sk_write_queue_purge(struct sock *sk)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
+		sk_wmem_free_skb(sk, skb);
+	sk_mem_reclaim(sk);
+}
+
 /* Used by processes to "lock" a socket state, so that
  * interrupts and bottom half handlers won't change it
  * from under us. It essentially blocks any incoming
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..4c7036a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1220,16 +1220,6 @@ static inline void		tcp_put_md5sig_pool(void)
 	put_cpu();
 }
 
-/* write queue abstraction */
-static inline void tcp_write_queue_purge(struct sock *sk)
-{
-	struct sk_buff *skb;
-
-	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
-		sk_wmem_free_skb(sk, skb);
-	sk_mem_reclaim(sk);
-}
-
 static inline struct sk_buff *tcp_write_queue_head(struct sock *sk)
 {
 	return skb_peek(&sk->sk_write_queue);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 64d0af6..0124f5b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1992,7 +1992,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 
 	tcp_clear_xmit_timers(sk);
 	__skb_queue_purge(&sk->sk_receive_queue);
-	tcp_write_queue_purge(sk);
+	sk_write_queue_purge(sk);
 	__skb_queue_purge(&tp->out_of_order_queue);
 #ifdef CONFIG_NET_DMA
 	__skb_queue_purge(&sk->sk_async_wait_queue);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7cda24b..76e59df 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1845,7 +1845,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 	tcp_cleanup_congestion_control(sk);
 
 	/* Cleanup up the write buffer. */
-	tcp_write_queue_purge(sk);
+	sk_write_queue_purge(sk);
 
 	/* Cleans up our, hopefully empty, out_of_order_queue. */
 	__skb_queue_purge(&tp->out_of_order_queue);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6ec6a8a..58007d1 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -464,7 +464,7 @@ void udp_flush_pending_frames(struct sock *sk)
 	if (up->pending) {
 		up->len = 0;
 		up->pending = 0;
-		ip_flush_pending_frames(sk);
+		sk_write_queue_purge(sk);
 	}
 }
 EXPORT_SYMBOL(udp_flush_pending_frames);

^ permalink raw reply related

* Re: [Bug #14301] WARNING: at net/ipv4/af_inet.c:154
From: Eric Dumazet @ 2009-10-03  8:52 UTC (permalink / raw)
  To: Rafael J. Wysocki, Ralf Hildebrandt
  Cc: Linux Kernel Mailing List, Kernel Testers List, Herbert Xu,
	Linux Netdev List, Wei Yongjun, David S. Miller
In-Reply-To: <4AC70D20.4060009-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Eric Dumazet a écrit :
> Rafael J. Wysocki a écrit :
>> This message has been generated automatically as a part of a report
>> of regressions introduced between 2.6.30 and 2.6.31.
>>
>> The following bug entry is on the current list of known regressions
>> introduced between 2.6.30 and 2.6.31.  Please verify if it still should
>> be listed and let me know (either way).
>>
>>
>> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=14301
>> Subject		: WARNING: at net/ipv4/af_inet.c:154
>> Submitter	: Ralf Hildebrandt <Ralf.Hildebrandt-jq1tPX9l7E6ELgA04lAiVw@public.gmane.org>
>> Date		: 2009-09-30 12:24 (2 days old)
>> References	: http://marc.info/?l=linux-kernel&m=125431350218137&w=4
>>
>>
> 
> If commit d99927f4d93f36553699573b279e0ff98ad7dea6
> (net: Fix sock_wfree() race) doesnt fix this problem, then
> maybe we should take a look at an old patch.
> 
> < data mining... running... output results to lkml/netdev >
> 
> Random guesses
> 
>  1) : commit d55d87fdff8252d0e2f7c28c2d443aee17e9d70f
> (net: Move rx skb_orphan call to where needed)
> 
> A similar problem on SCTP was fixed by commit 
> 1bc4ee4088c9a502db0e9c87f675e61e57fa1734
> (sctp: fix warning at inet_sock_destruct() while release sctp socket)
> 
> 2) CORK and UDP sockets
>   It seems we can leave an UDP socket with a frame in sk_write_queue
>   Purge of this queue is done by udp_flush_pending_frames()
>    This calls ip_flush_pending_frames()
>    But this function only calls kfree_skb(), not sk_wmem_free_skb()...
> 
> 
> Could you try following patch ?
> 

Hmm, I missed the ip_cork_release(), here is an updated version.


[PATCH] net: UDP should not use ip_flush_pending_frames()

Now xmit UDP messages are charged, we must take care of calling right
skb freeing function.

In case a close() is performed on a socket where CORKED frame
is still queued in sk_write_queue, calling ip_flush_pending_frames()
leads to sk_forward_alloc leak.

Fix this by calling sk_write_queue_purge() and ip_cork_release()
instead.

Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt-jq1tPX9l7E6ELgA04lAiVw@public.gmane.org>
Signed-off-by: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/net/ip.h    |    1 +
 include/net/sock.h  |   10 ++++++++++
 include/net/tcp.h   |   10 ----------
 net/ipv4/tcp.c      |    2 +-
 net/ipv4/tcp_ipv4.c |    2 +-
 net/ipv4/udp.c      |    3 ++-
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 2f47e54..c8d8828 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -117,6 +117,7 @@ extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int od
 extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
 				int offset, size_t size, int flags);
 extern int		ip_push_pending_frames(struct sock *sk);
+extern void     ip_cork_release(struct inet_sock *);
 extern void		ip_flush_pending_frames(struct sock *sk);
 
 /* datagram.c */
diff --git a/include/net/sock.h b/include/net/sock.h
index 1621935..7c80fec 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -882,6 +882,16 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 	__kfree_skb(skb);
 }
 
+/* write queue abstraction */
+static inline void sk_write_queue_purge(struct sock *sk)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
+		sk_wmem_free_skb(sk, skb);
+	sk_mem_reclaim(sk);
+}
+
 /* Used by processes to "lock" a socket state, so that
  * interrupts and bottom half handlers won't change it
  * from under us. It essentially blocks any incoming
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..4c7036a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1220,16 +1220,6 @@ static inline void		tcp_put_md5sig_pool(void)
 	put_cpu();
 }
 
-/* write queue abstraction */
-static inline void tcp_write_queue_purge(struct sock *sk)
-{
-	struct sk_buff *skb;
-
-	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
-		sk_wmem_free_skb(sk, skb);
-	sk_mem_reclaim(sk);
-}
-
 static inline struct sk_buff *tcp_write_queue_head(struct sock *sk)
 {
 	return skb_peek(&sk->sk_write_queue);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 64d0af6..0124f5b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1992,7 +1992,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 
 	tcp_clear_xmit_timers(sk);
 	__skb_queue_purge(&sk->sk_receive_queue);
-	tcp_write_queue_purge(sk);
+	sk_write_queue_purge(sk);
 	__skb_queue_purge(&tp->out_of_order_queue);
 #ifdef CONFIG_NET_DMA
 	__skb_queue_purge(&sk->sk_async_wait_queue);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7cda24b..76e59df 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1845,7 +1845,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 	tcp_cleanup_congestion_control(sk);
 
 	/* Cleanup up the write buffer. */
-	tcp_write_queue_purge(sk);
+	sk_write_queue_purge(sk);
 
 	/* Cleans up our, hopefully empty, out_of_order_queue. */
 	__skb_queue_purge(&tp->out_of_order_queue);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6ec6a8a..b6370d0 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -464,7 +464,8 @@ void udp_flush_pending_frames(struct sock *sk)
 	if (up->pending) {
 		up->len = 0;
 		up->pending = 0;
-		ip_flush_pending_frames(sk);
+		sk_write_queue_purge(sk);
+		ip_cork_release(inet_sk(sk));
 	}
 }
 EXPORT_SYMBOL(udp_flush_pending_frames);

^ permalink raw reply related

* Re: [BUG net-2.6] bluetooth/rfcomm : sleeping function called from invalid context at mm/slub.c:1719
From: Oliver Hartkopp @ 2009-10-03  9:43 UTC (permalink / raw)
  To: Dave Young
  Cc: Marcel Holtmann, Linux Netdev List,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20091003070622.GA4110@darkstar>

Dave Young wrote:
> On Fri, Oct 02, 2009 at 06:04:14PM +0200, Oliver Hartkopp wrote:
>> Dave Young wrote:
>>> On Fri, Oct 2, 2009 at 2:28 PM, Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org> wrote:
>>>> Hello Marcel,
>>>>
>>>> with current net-2.6 tree ...
>>>>
>>>> While starting my PPP Bluetooth dialup networking, i got this:
>>> Hi, oliver
>>>
>>> please try following patch:
>>> http://patchwork.kernel.org/patch/51326/
>> Hi Dave,
>>
>> that fixed it at ppp startup!
>>
>> Tested-by: Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>
>>
>> Btw. when shutting down the ppp connection i still get this:
>>
>> [  361.996887] INFO: trying to register non-static key.
>> [  361.996897] the code is fine but needs lockdep annotation.
>> [  361.996902] turning off the locking correctness validator.
>> [  361.996912] Pid: 0, comm: swapper Not tainted 2.6.31-08939-gdb8abec-dirty #22
>> [  361.996919] Call Trace:
>> [  361.996933]  [<c12e4fb2>] ? printk+0xf/0x11
>> [  361.996947]  [<c1042214>] register_lock_class+0x5a/0x295
>> [  361.996957]  [<c1043af2>] __lock_acquire+0x9b/0xc03
>> [  361.996967]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
>> [  361.996985]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
>> [  361.996995]  [<c104491f>] ? lock_release_non_nested+0x17b/0x1db
>> [  361.997008]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
>> [  361.997018]  [<c10426fd>] ? trace_hardirqs_off+0xb/0xd
>> [  361.997028]  [<c10446b6>] lock_acquire+0x5c/0x73
>> [  361.997039]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
>> [  361.997049]  [<c12e6e23>] _spin_lock_irqsave+0x24/0x34
>> [  361.997058]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
>> [  361.997066]  [<c124cd14>] skb_dequeue+0x12/0x4c
>> [  361.997075]  [<c124d579>] skb_queue_purge+0x14/0x1b
>> [  361.997088]  [<fa59ce3f>] l2cap_recv_frame+0xe9e/0x129a [l2cap]
>> [  361.997099]  [<c10421d1>] ? register_lock_class+0x17/0x295
>> [  361.997110]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
>> [  361.997128]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
>> [  361.997139]  [<c120de74>] ? uhci_giveback_urb+0xf2/0x162
>> [  361.997163]  [<f8bb4c45>] ? hci_rx_task+0xfe/0x1f8 [bluetooth]
>> [  361.997177]  [<fa59d2e4>] l2cap_recv_acldata+0xa9/0x1be [l2cap]
>> [  361.997190]  [<fa59d23b>] ? l2cap_recv_acldata+0x0/0x1be [l2cap]
>> [  361.997208]  [<f8bb4c77>] hci_rx_task+0x130/0x1f8 [bluetooth]
>> [  361.997219]  [<c102a098>] tasklet_action+0x6b/0xb2
>> [  361.997228]  [<c102a46b>] __do_softirq+0x82/0x101
>> [  361.997237]  [<c102a515>] do_softirq+0x2b/0x43
>> [  361.997246]  [<c102a619>] irq_exit+0x35/0x68
>> [  361.997256]  [<c1004513>] do_IRQ+0x80/0x96
>> [  361.997265]  [<c10030ae>] common_interrupt+0x2e/0x34
>> [  361.997275]  [<c104007b>] ? tick_device_uses_broadcast+0x71/0x7c
>> [  361.997286]  [<c11747a8>] ? acpi_idle_enter_simple+0x103/0x12e
>> [  361.997296]  [<c1174515>] acpi_idle_enter_bm+0xc3/0x253
>> [  361.997306]  [<c1238b6f>] cpuidle_idle_call+0x60/0x91
>> [  361.997315]  [<c1001d44>] cpu_idle+0x49/0x65
>> [  361.997324]  [<c12e2f0e>] start_secondary+0x190/0x195
>>
>>
>> Thanks,
>> Oliver
>>
> 
> Oliver, does following patch fix the non-static lock problem?
> --
> 
> now l2cap conn locks will be initialized after setup l2cap conn timer,
> it will introduce following problem:
> 
> [  361.996887] INFO: trying to register non-static key.
> [  361.996897] the code is fine but needs lockdep annotation.
> [  361.996902] turning off the locking correctness validator.
> [  361.996912] Pid: 0, comm: swapper Not tainted 2.6.31-08939-gdb8abec-dirty #22
> [  361.996919] Call Trace:
> [  361.996933]  [<c12e4fb2>] ? printk+0xf/0x11
> [  361.996947]  [<c1042214>] register_lock_class+0x5a/0x295
> [  361.996957]  [<c1043af2>] __lock_acquire+0x9b/0xc03
> [  361.996967]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
> [  361.996985]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
> [  361.996995]  [<c104491f>] ? lock_release_non_nested+0x17b/0x1db
> [  361.997008]  [<fa59a168>] ? l2cap_get_chan_by_scid+0x35/0x43 [l2cap]
> [  361.997018]  [<c10426fd>] ? trace_hardirqs_off+0xb/0xd
> [  361.997028]  [<c10446b6>] lock_acquire+0x5c/0x73
> [  361.997039]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
> [  361.997049]  [<c12e6e23>] _spin_lock_irqsave+0x24/0x34
> [  361.997058]  [<c124cd14>] ? skb_dequeue+0x12/0x4c
> [  361.997066]  [<c124cd14>] skb_dequeue+0x12/0x4c
> [  361.997075]  [<c124d579>] skb_queue_purge+0x14/0x1b
> [  361.997088]  [<fa59ce3f>] l2cap_recv_frame+0xe9e/0x129a [l2cap]
> [  361.997099]  [<c10421d1>] ? register_lock_class+0x17/0x295
> [  361.997110]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
> [  361.997128]  [<c104464b>] ? __lock_acquire+0xbf4/0xc03
> [  361.997139]  [<c120de74>] ? uhci_giveback_urb+0xf2/0x162
> [  361.997163]  [<f8bb4c45>] ? hci_rx_task+0xfe/0x1f8 [bluetooth]
> [  361.997177]  [<fa59d2e4>] l2cap_recv_acldata+0xa9/0x1be [l2cap]
> [  361.997190]  [<fa59d23b>] ? l2cap_recv_acldata+0x0/0x1be [l2cap]
> [  361.997208]  [<f8bb4c77>] hci_rx_task+0x130/0x1f8 [bluetooth]
> [  361.997219]  [<c102a098>] tasklet_action+0x6b/0xb2
> [  361.997228]  [<c102a46b>] __do_softirq+0x82/0x101
> [  361.997237]  [<c102a515>] do_softirq+0x2b/0x43
> [  361.997246]  [<c102a619>] irq_exit+0x35/0x68
> [  361.997256]  [<c1004513>] do_IRQ+0x80/0x96
> [  361.997265]  [<c10030ae>] common_interrupt+0x2e/0x34
> [  361.997275]  [<c104007b>] ? tick_device_uses_broadcast+0x71/0x7c
> [  361.997286]  [<c11747a8>] ? acpi_idle_enter_simple+0x103/0x12e
> [  361.997296]  [<c1174515>] acpi_idle_enter_bm+0xc3/0x253
> [  361.997306]  [<c1238b6f>] cpuidle_idle_call+0x60/0x91
> [  361.997315]  [<c1001d44>] cpu_idle+0x49/0x65
> [  361.997324]  [<c12e2f0e>] start_secondary+0x190/0x195
> 
> Here move lock init things before setup_timer to avoid misuse
> uninitialized locks.
> 
> Reported-by: Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>
> Signed-off-by: Dave Young <hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
> net/bluetooth/l2cap.c |    6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
> 
> --- linux-2.6.31.orig/net/bluetooth/l2cap.c	2009-09-30 16:36:10.000000000 +0800
> +++ linux-2.6.31/net/bluetooth/l2cap.c	2009-10-03 14:44:51.000000000 +0800
> @@ -555,12 +555,12 @@ static struct l2cap_conn *l2cap_conn_add
>  
>  	conn->feat_mask = 0;
>  
> -	setup_timer(&conn->info_timer, l2cap_info_timeout,
> -						(unsigned long) conn);
> -
>  	spin_lock_init(&conn->lock);
>  	rwlock_init(&conn->chan_list.lock);
>  
> +	setup_timer(&conn->info_timer, l2cap_info_timeout,
> +						(unsigned long) conn);
> +
>  	conn->disc_reason = 0x13;
>  
>  	return conn;

No, it does not have any effect.

As the lockdep annotation only appears when shutting down the ppp connection,
i wonder whether it should help to change things in a _conn_add() function.
:-)

Or didn't i made it clear before, that this annotation one only happens at ppp
shutdown?

Best regards,
Oliver

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-10-03 10:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AC501EB.8090608@gmail.com>

On 10/01/2009 09:24 PM, Gregory Haskins wrote:
>
>> Virtualization is about not doing that.  Sometimes it's necessary (when
>> you have made unfixable design mistakes), but just to replace a bus,
>> with no advantages to the guest that has to be changed (other
>> hypervisors or hypervisorless deployment scenarios aren't).
>>      
> The problem is that your continued assertion that there is no advantage
> to the guest is a completely unsubstantiated claim.  As it stands right
> now, I have a public git tree that, to my knowledge, is the fastest KVM
> PV networking implementation around.  It also has capabilities that are
> demonstrably not found elsewhere, such as the ability to render generic
> shared-memory interconnects (scheduling, timers), interrupt-priority
> (qos), and interrupt-coalescing (exit-ratio reduction).  I designed each
> of these capabilities after carefully analyzing where KVM was coming up
> short.
>
> Those are facts.
>
> I can't easily prove which of my new features alone are what makes it
> special per se, because I don't have unit tests for each part that
> breaks it down.  What I _can_ state is that its the fastest and most
> feature rich KVM-PV tree that I am aware of, and others may download and
> test it themselves to verify my claims.
>    

If you wish to introduce a feature which has downsides (and to me, vbus 
has downsides) then you must prove it is necessary on its own merits.  
venet is pretty cool but I need proof before I believe its performance 
is due to vbus and not to venet-host.

> The disproof, on the other hand, would be in a counter example that
> still meets all the performance and feature criteria under all the same
> conditions while maintaining the existing ABI.  To my knowledge, this
> doesn't exist.
>    

mst is working on it and we should have it soon.

> Therefore, if you believe my work is irrelevant, show me a git tree that
> accomplishes the same feats in a binary compatible way, and I'll rethink
> my position.  Until then, complaining about lack of binary compatibility
> is pointless since it is not an insurmountable proposition, and the one
> and only available solution declares it a required casualty.
>    

Fine, let's defer it until vhost-net is up and running.

>> Well, Xen requires pre-translation (since the guest has to give the host
>> (which is just another guest) permissions to access the data).
>>      
> Actually I am not sure that it does require pre-translation.  You might
> be able to use the memctx->copy_to/copy_from scheme in post translation
> as well, since those would be able to communicate to something like the
> xen kernel.  But I suppose either method would result in extra exits, so
> there is no distinct benefit using vbus there..as you say below "they're
> just different".
>
> The biggest difference is that my proposed model gets around the notion
> that the entire guest address space can be represented by an arbitrary
> pointer.  For instance, the copy_to/copy_from routines take a GPA, but
> may use something indirect like a DMA controller to access that GPA.  On
> the other hand, virtio fully expects a viable pointer to come out of the
> interface iiuc.  This is in part what makes vbus more adaptable to non-virt.
>    

No, virtio doesn't expect a pointer (this is what makes Xen possible).  
vhost does; but nothing prevents an interested party from adapting it.

>>> An interesting thing here is that you don't even need a fancy
>>> multi-homed setup to see the effects of my exit-ratio reduction work:
>>> even single port configurations suffer from the phenomenon since many
>>> devices have multiple signal-flows (e.g. network adapters tend to have
>>> at least 3 flows: rx-ready, tx-complete, and control-events (link-state,
>>> etc).  Whats worse, is that the flows often are indirectly related (for
>>> instance, many host adapters will free tx skbs during rx operations, so
>>> you tend to get bursts of tx-completes at the same time as rx-ready.  If
>>> the flows map 1:1 with IDT, they will suffer the same problem.
>>>
>>>        
>> You can simply use the same vector for both rx and tx and poll both at
>> every interrupt.
>>      
> Yes, but that has its own problems: e.g. additional exits or at least
> additional overhead figuring out what happens each time.

If you're just coalescing tx and rx, it's an additional memory read 
(which you have anyway in the vbus interrupt queue).

> This is even
> more important as we scale out to MQ which may have dozens of queue
> pairs.  You really want finer grained signal-path decode if you want
> peak performance.
>    

MQ definitely wants per-queue or per-queue-pair vectors, and it 
definitely doesn't want all interrupts to be serviced by a single 
interrupt queue (you could/should make the queue per-vcpu).

>>> Its important to note here that we are actually looking at the interrupt
>>> rate, not the exit rate (which is usually a multiple of the interrupt
>>> rate, since you have to factor in as many as three exits per interrupt
>>> (IPI, window, EOI).  Therefore we saved about 18k interrupts in this 10
>>> second burst, but we may have actually saved up to 54k exits in the
>>> process. This is only over a 10 second window at GigE rates, so YMMV.
>>> These numbers get even more dramatic on higher end hardware, but I
>>> haven't had a chance to generate new numbers yet.
>>>
>>>        
>> (irq window exits should only be required on a small percentage of
>> interrupt injections, since the guest will try to disable interrupts for
>> short periods only)
>>      
> Good point. You are probably right. Certainly the other 2 remain, however.
>
>    

You can easily eliminate most of the EOI exits by patching 
ack_APIC_irq() to do the following:

     if (atomic_inc_return(&vapic->eoi_count) < 0)
         null_hypercall();

Where eoi_count is a per-cpu shared counter that indicates how many EOIs 
were performed by the guest, with the sign bit a signal from the 
hypervisor that an lower-priority interrupt is pending.

We do something similar for the TPR, which is heavily exercised by 
Windows XP.

Note that svm provides a mechanism to queue interrupts without requiring 
the interrupt window; we don't use it in kvm (primarily because only a 
small fraction of injections would benefit).

> Ultimately, the fastest exit is the one you do not take.  That is what I
> am trying to achieve.
>    

The problem is that all those paravirtualizations bring their own 
problems and are quickly obsoleted by hardware advances.  Intel and AMD 
also see what you're seeing.  Sure, it takes hardware a long time to 
propagate to the field, but the same holds for software.

>
>>> The even worse news for 1:1 models is that the ratio of
>>> exits-per-interrupt climbs with load (exactly when it hurts the most)
>>> since that is when the probability that the vcpu will need all three
>>> exits is the highest.
>>>
>>>        
>> Requiring all three exits means the guest is spending most of its time
>> with interrupts disabled; that's unlikely.
>>      
> (see "softirqs" above)
>    

There are no softirqs above, please clarify.

>> Thanks for the numbers.  Are those 11% attributable to rx/tx
>> piggybacking from the same interface?
>>      
> Its hard to tell, since I am not instrumented to discern the difference
> in this run.  I do know from previous traces on the 10GE rig that the
> chelsio T3 that I am running reaps the pending-tx ring at the same time
> as a rx polling, so its very likely that both events are often
> coincident at least there.
>    

I assume you had only two active interrupts?  In that case, tx and rx 
mitigation should have prevented the same interrupt from coalescing with 
itself, so that leaves rx/tx coalescing as the only option?

>> Also, 170K interupts ->  17K interrupts/sec ->  55kbit/interrupt ->
>> 6.8kB/interrupt.  Ignoring interrupt merging and assuming equal rx/tx
>> distribution, that's about 13kB/interrupt.  Seems rather low for a
>> saturated link.
>>      
> I am not following: Do you suspect that I have too few interrupts to
> represent 940Mb/s, or that I have too little data/interrupt and this
> ratio should be improved?
>    

Too few bits/interrupt.  With tso, your "packets" should be 64KB at 
least, and you should expect multiple packets per tx interrupt.  Maybe 
these are all acks?

>>>>
>>>>          
>>> Everyone is of course entitled to an opinion, but the industry as a
>>> whole would disagree with you.  Signal path routing (1:1, aggregated,
>>> etc) is at the discretion of the bus designer.  Most buses actually do
>>> _not_ support 1:1 with IDT (think USB, SCSI, IDE, etc).
>>>
>>>        
>> With standard PCI, they do not.  But all modern host adapters support
>> MSI and they will happily give you one interrupt per queue.
>>      
> While MSI is a good technological advancement for PCI, I was referring
> to signal:IDT ratio.  MSI would still classify as 1:1.
>    

I meant, a multiqueue SCSI or network adapter is not N:1 but N:M since 
each queue would get its own interrupt.  So it looks like modern cards 
try to disaggregate, not aggregate.  Previously, non-MSI PCI forced them 
to aggregate by providing a small amount of irq pins.

>> Let's do that then.  Please reserve the corresponding comparisons from
>> your side as well.
>>      
> That is quite the odd request.  My graphs are all built using readily
> available code and open tools and do not speculate as to what someone
> else may come up with in the future.  They reflect what is available
> today.  Do you honestly think I should wait indefinitely for a competing
> idea to try to catch up before I talk about my results?  That's
> certainly an interesting perspective.
>    

You results are excellent and I'm not asking you hide them.  But you 
can't compare (more) complete code to incomplete code and state that 
this proves you are right, or to use results for an entire stack as 
proof that one component is what made it possible.

> With all due respect, the only red-herring is your unsubstantiated
> claims that my results do not matter.
>    

My claim is that your results are mostly due to venet-host.  I don't 
have a proof but you don't have a counterproof.  That is why I ask you 
to wait for vhost-net, it will give us more data so we can see what's what.

>>> This is not to mention
>>> that vhost-net does nothing to address our other goals, like scheduler
>>> coordination and non-802.x fabrics.
>>>
>>>        
>> What are scheduler coordination and non-802.x fabrics?
>>      
> We are working on real-time, IB and QOS, for examples, in addition to
> the now well known 802.x venet driver.
>    

Won't QoS require a departure from aggregated interrupts?  Suppose an 
low priority interrupt arrives and the guest starts processing, then a 
high priority interrupt.  Don't you need a real (IDT) interrupt to make 
the guest process the high-priority event?

>>>> Right, when you ignore the points where they don't fit, it's a perfect
>>>> mesh.
>>>>
>>>>          
>>> Where doesn't it fit?
>>>
>>>        
>> (avoiding infinite loop)
>>      
> I'm serious.  Where doesn't it fit?  Point me at a URL if its already
> discussed.
>    

Sorry, I lost the context; also my original comment wasn't very 
constructive, consider it retracted.

>>> Citation please.  Afaict, the one use case that we looked at for vhost
>>> outside of KVM failed to adapt properly, so I do not see how this is
>>> true.
>>>
>>>        
>> I think Ira said he can make vhost work?
>>
>>      
> Not exactly.  It kind of works for 802.x only (albeit awkwardly) because
> there is no strong distinction between "resource" and "consumer" with
> ethernet.  So you can run it inverted without any serious consequences
> (at least, not from consequences of the inversion).  Since the x86
> boards are the actual resource providers in his system, other device
> types will fail to map to the vhost model properly, like disk-io or
> consoles for instance.
>    

In that case vhost will have to be adapted or they will have to use 
something else.



>> virtio-net over pci is deployed.  Replacing the backend with vhost-net
>> will require no guest modifications.
>>      
> That _is_ a nice benefit, I agree.  I just do not agree its a hard
> requirement.
>    

Consider a cloud where the hypervisor is updated without the knowledge 
of the guest admins.  Either we break the guests and require the guest 
admins to login (without networking) to upgrade their drivers during 
production and then look for a new cloud, or we maintain both device 
models and ask the guest admins to upgrade their drivers so we can drop 
support for the old device, a request which they will rightly ignore.

>> Obviously virtio-net isn't deployed in non-virt.  But if we adopt vbus,
>> we have to migrate guests.
>>      
> As a first step, lets just shoot for "support" instead of "adopt".
>    

"support" means eventually "adopt", it isn't viable to maintain two 
models in parallel.

> Ill continue to push patches to you that help interfacing with the guest
> in a vbus neutral way (like irqfd/ioeventfd) and we can go from there.
> Are you open to this work assuming it passes normal review cycles, etc?
>   It would presumably be of use to others that want to interface to a
> guest (e.g. vhost) as well.
>    

Neutral interfaces are great, and I've already received feedback from 
third parties that they ought to work well for their uses.  I don't 
really like xinterface since I think it's too intrusive locking wise, 
especially when there's currently churn in kvm memory management.  But 
feel free to post your ideas or patches, maybe we can work something out.

>>> And once those events are fed, you still need a
>>> PV layer to actually handle the bus interface in a high-performance
>>> manner so its not like you really have a "native" stack in either case.
>>>
>>>        
>> virtio-net doesn't use any pv layer.
>>      
> Well, it does when you really look closely at how it works.  For one, it
> has the virtqueues library that would be (or at least _should be_)
> common for all virtio-X adapters, etc etc.  Even if this layer is
> collapsed into each driver on the Windows platform, its still there
> nonetheless.
>    

By "pv layer" I meant something that is visible along the guest/host 
interface.  virtio devices are completely independent from one another 
and (using virtio-pci) only talk through interfaces exposed by the 
relevant card.  If you wanted to, you could implement a virtio-pci card 
in silicon.

Practically the only difference between ordinary NICs and virtio-net is 
that interrupt status and enable/disable are stored in memory instead of 
NIC registers, but a real NIC could have done it the virtio way.

>>>> that doesn't need to be retrofitted.
>>>>
>>>>          
>>> No, that is incorrect.  You have to heavily modify the pci model with
>>> layers on top to get any kind of performance out of it.  Otherwise, we
>>> would just use realtek emulation, which is technically the native PCI
>>> you are apparently so enamored with.
>>>
>>>        
>> virtio-net doesn't modify the PCI model.
>>      
> Sure it does.  It doesn't use MMIO/PIO bars for registers, it uses
> vq->kick().

Which translates to a BAR register.

> It doesn't use pci-config-space, it uses virtio->features.
>    

Which translates to a BAR.

>   It doesn't use PCI interrupts, it uses a callback on the vq etc, etc.
> You would never use raw "registers", as the exit rate would crush you.
> You would never use raw interrupts, as you need a shared-memory based
> mitigation scheme.
>
> IOW: Virtio has a device model layer that tunnels over PCI.  It doesn't
> actually use PCI directly.  This is in fact what allows the linux
> version to work over lguest, s390 and vbus in addition to PCI.
>    

That's just a nice way to reuse the driver across multiple busses.  Kind 
of like isa/pci drivers that might even still exist in the source tree.  
On x86, virtio doesn't bypass PCI, just adds a layer above it.

>> You can have dynamic MSI/queue routing with virtio, and each MSI can be
>> routed to a vcpu at will.
>>      
> Can you arbitrarily create a new MSI/queue on a per-device basis on the
> fly?   We want to do this for some upcoming designs.  Or do you need to
> predeclare the vectors when the device is hot-added?
>    

You need to predeclare the number of vectors, but queue/interrupt 
assignment is runtime.


>>> priority, and coalescing, etc.
>>>
>>>        
>> Do you mean interrupt priority?  Well, apic allows interrupt priorities
>> and Windows uses them; Linux doesn't.  I don't see a reason to provide
>> more than native hardware.
>>      
> The APIC model is not optimal for PV given the exits required for a
> basic operation like an interrupt injection, and has scaling/flexibility
> issues with its 16:16 priority mapping.
>
> OTOH, you don't necessarily want to rip it out because of all the
> additional features it has like the IPI facility and the handling of
> many low-performance data-paths.  Therefore, I am of the opinion that
> the optimal placement for advanced signal handling is directly at the
> bus that provides the high-performance resources.  I could be convinced
> otherwise with a compelling argument, but I think this is the path of
> least resistance.
>    

With EOI PV you can reduce the cost of interrupt injection to slightly 
more than one exit/interrupt.  vbus might reduce it to slightly less 
than one exit/interrupt.

wrt priority, if you have 12 or fewer realtime interrupt sources you can 
map them to available priorities.  If you have more then you take extra 
interrupts, but at a ratio of 12:1 (so 24 realtime interrupts mean you 
may take a single extra exit).  The advantages of this is that all 
interrupts (not just vbus) are prioritized, and bare metal benefits as well.

If 12 is too low for you, pressure Intel to increase the TPR to 8 r/w 
bits, too bad they missed a chance with x2apic (which btw reduces the 
apic exit costs significantly).



>> N:1 breaks down on large guests since one vcpu will have to process all
>> events.
>>      
> Well, first of all that is not necessarily true.  Some high performance
> buses like SCSI and FC work fine with an aggregated model, so its not a
> foregone conclusion that aggregation kills SMP IO performance.  This is
> especially true when you adding coalescing on top, like AlacrityVM does.
>    

Nevertheless, the high performance adaptors provide multiqueue and MSI; 
one of the reasons is to distribute processing.

> I do agree that other subsystems, like networking for instance, may
> sometimes benefit from flexible signal-routing because of multiqueue,
> etc, for particularly large guests.  However, the decision to make the
> current kvm-connector used in AlacrityVM aggregate one priority FIFO per
> IRQ was an intentional design tradeoff.  My experience with my target
> user base is that these data-centers are typically deploying 1-4 vcpu
> guests, so I optimized for that.  YMMV, so we can design a different
> connector, or a different mode of the existing connector, to accommodate
> large guests as well if that was something desirable.
>
>    
>> You could do N:M, with commands to change routings, but where's
>> your userspace interface?
>>      
> Well, we should be able to add that when/if its needed.  I just don't
> think the need is there yet.  KVM tops out at 16 IIUC anyway.
>    

My feeling is that 16 will definitely need multiqueue, and perhaps even 4.

(we can probably up the 16, certainly with Marcelo's srcu work).

>> you can't tell from /proc/interrupts which
>> vbus interupts are active
>>      
> This should be trivial to add some kind of *fs display.  I will fix this
> shortly.
>    

And update irqbalance and other tools?  What about the Windows 
equivalent?  What happens when (say) Linux learns to migrate interrupts 
to where they're actually used?

This should really be done at the irqchip level, but before that, we 
need to be 100% certain it's worthwhile.

>> The larger your installed base, the more difficult it is.  Of course
>> it's doable, but I prefer not doing it and instead improving things in a
>> binary backwards compatible manner.  If there is no choice we will bow
>> to the inevitable and make our users upgrade.  But at this point there
>> is a choice, and I prefer to stick with vhost-net until it is proven
>> that it won't work.
>>      
> Fair enough.  But note you are likely going to need to respin your
> existing drivers anyway to gain peak performance, since there are known
> shortcomings in the virtio-pci ABI today (like queue identification in
> the interrupt hotpath) as it stands.  So that pain is coming one way or
> the other.
>    

We'll update the drivers but we won't require users to update.  The 
majority will not notice an upgrade; those who are interested in getting 
more performance will update their drivers (at their own schedule).

>> One of the benefits of virtualization is that the guest model is
>> stable.  You can live-migrate guests and upgrade the hardware
>> underneath.  You can have a single guest image that you clone to
>> provision new guests.  If you switch to a new model, you give up those
>> benefits, or you support both models indefinitely.
>>      
> I understand what you are saying, but I don't buy it.  If you add a new
> feature to an existing model even without something as drastic as a new
> bus, you still have the same exact dilemma:  The migration target needs
> feature parity with consumed features in the guest.  Its really the same
> no matter what unless you never add guest-visible features.
>    

When you upgrade your data center, you start upgrading your hypervisors 
(one by one, with live migration making it transparent) and certainly 
not exposing new features to running guests.  Once you are done you can 
expose the new features, and guests which can interested in them can 
upgrade their drivers and see them.

>> Note even hardware nowadays is binary compatible.  One e1000 driver
>> supports a ton of different cards, and I think (not sure) newer cards
>> will work with older drivers, just without all their features.
>>      
> Noted, but that is not really the same thing.  Thats more like adding a
> feature bit to virtio, not replacing GigE with 10GE.
>    

Right, and that's what virtio-net changes look like.

>>> If and when that becomes a priority concern, that would be a function
>>> transparently supported in the BIOS shipped with the hypervisor, and
>>> would thus be invisible to the user.
>>>
>>>        
>> No, you have to update the driver in your initrd (for Linux)
>>      
> Thats fine, the distros generally do this automatically when you load
> the updated KMP package.
>    

So it's not invisible to the user.  You update your hypervisor and now 
need to tell your users to add the new driver to their initrd and 
reboot.  They're not going to like you.


>> or properly install the new driver (for Windows).  It's especially
>> difficult for Windows.
>>      
> What is difficult here?  I never seem to have any problems and I have
> all kinds of guests from XP to Win7.
>    

If you accidentally reboot before you install the new driver, you won't 
boot; and there are issues with loading a new driver without the 
hardware present (not sure what exactly).

>> I don't want to support both virtio and vbus in parallel.  There's
>> enough work already.
>>      
> Until I find some compelling reason that indicates I was wrong about all
> of this, I will continue building a community around the vbus code base
> and developing support for its components anyway.  So that effort is
> going to happen in parallel regardless.
>
> This is purely a question about whether you will work with me to make
> vbus an available option in upstream KVM or not.
>    

Without xinterface there's no need for vbus support in kvm, so nothing's 
blocking you there.  I'm open to extending the host-side kvm interfaces 
to improve kernel integration.  However I still think vbus is the wrong 
design and shouldn't be merged.

>>   If we adopt vbus, we'll have to deprecate and eventually kill off virtio.
>>      
> Thats more hyperbole.  virtio is technically fine and complementary as
> it is.  No one says you have to do anything drastic w.r.t. virtio.  If
> you _did_ adopt vbus, perhaps you would want to optionally deprecate
> vhost or possibly the virtio-pci adapter, but that is about it.  The
> rest of the infrastructure should be preserved if it was designed properly.
>    

virtio-pci is what makes existing guests work (and vhost-net will 
certainly need to be killed off).  But I really don't see the point of 
layering virtio on top of vbus.

>> PCI is continuously updated, with MSI, MSI-X, and IOMMU support being
>> some recent updates.  I'd like to ride on top of that instead of having
>> to clone it for every guest I support.
>>      
> While a noble goal, one of the points I keep making though, as someone
> who has built the stack both ways, is almost none of the PCI stack is
> actually needed to get the PV job done.  The part you do need is
> primarily a function of the generic OS stack and trivial to interface
> with anyway.
>    

PCI doesn't stand in the way of pv, and allows us to have a uniform 
interface to purely emulated, pv, and assigned devices, with minimal 
changes to the guest.  To me that's the path of least resistance.

>>
>>>>> As an added bonus, its device-model is modular.  A developer can
>>>>> write a
>>>>> new device model, compile it, insmod it to the host kernel, hotplug it
>>>>> to the running guest with mkdir/ln, and the come back out again
>>>>> (hotunplug with rmdir, rmmod, etc).  They may do this all without
>>>>> taking
>>>>> the guest down, and while eating QEMU based IO solutions for breakfast
>>>>> performance wise.
>>>>>
>>>>> Afaict, qemu can't do either of those things.
>>>>>
>>>>>
>>>>>            
>>>> We've seen that herring before,
>>>>
>>>>          
>>> Citation?
>>>
>>>        
>> It's the compare venet-in-kernel to virtio-in-userspace thing again.
>>      
> No, you said KVM has "userspace hotplug".  I retorted that vbus not only
> has hotplug, it also has a modular architecture.  You then countered
> that this feature is a red-herring.  If this was previously discussed
> and rejected for some reason, I would like to know the history.  Or did
> I misunderstand you?
>    

I was talking about your breakfast (the performance comparison again).

> For one, we have the common layer of shm-signal, and IOQ.  These
> libraries were designed to be reused on both sides of the link.
> Generally shm-signal has no counterpart in the existing model, though
> its functionality is integrated into the virtqueue.

I agree that ioq/shm separation is a nice feature.


>  From there, going down the stack, it looks like
>
>      (guest-side)
> |-------------------------
> | venet (competes with virtio-net)
> |-------------------------
> | vbus-proxy (competes with pci-bus, config+hotplug, sync/async)
> |-------------------------
> | vbus-pcibridge (interrupt coalescing + priority, fastpath)
> |-------------------------
>             |
> |-------------------------
> | vbus-kvmconnector (interrupt coalescing + priority, fast-path)
> |-------------------------
> | vbus-core (hotplug, address decoding, etc)
> |-------------------------
> | venet-device (ioq frame/deframe to tap/macvlan/vmdq, etc)
> |-------------------------
>
> If you want to use virtio, insert a virtio layer between the "driver"
> and "device" components at the outer edges of the stack.
>    

But then it adds no value.  It's just another shim.

>> To me, compatible means I can live migrate an image to a new system
>> without the user knowing about the change.  You'll be able to do that
>> with vhost-net.
>>      
> As soon as you add any new guest-visible feature, you are in the same
> exact boat.
>    

No.  You support two-way migration while hiding new features.  You 
support one-way migration if you expose new features (suitable for data 
center upgrade).  You don't support any migration if you switch models.

>>> No, that is incorrect.  For one, vhost uses them on a per-signal path
>>> basis, whereas vbus only has one channel for the entire guest->host.
>>>
>>>        
>> You'll probably need to change that as you start running smp guests.
>>      
> The hypercall channel is already SMP optimized over a single PIO path,
> so I think we are covered there.  See "fastcall" in my code for details:
>
> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=drivers/vbus/pci-bridge.c;h=81f7cdd2167ae2f53406850ebac448a2183842f2;hb=fd1c156be7735f8b259579f18268a756beccfc96#l102
>
> It just passes the cpuid into the PIO write so we can have parallel,
> lockless "hypercalls".  This forms the basis of our guest scheduler
> support, for instance.
>    

This is... wierd.  Scheduler support should be part of kvm core and done 
using ordinary hypercalls, not as part of a bus model.

>>   You could implement virtio-net hardware if you wanted to.
>>      
> Technically you could build vbus in hardware too, I suppose, since the
> bridge is PCI compliant.  I would never advocate it, however, since many
> of our tricks do not matter if its real hardware (e.g. they are
> optimized for the costs associated with VM).
>    

No, you can't.  You won't get the cpuid in your pio writes, for one.  
And if multiple vbus cards are plugged into different PCI slots, they 
either lost inter-card interrupt coalescing, or you have to connect them 
in a side-channel.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] pktgen: restore nanosec delays
From: Eric Dumazet @ 2009-10-03 11:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: Stephen Hemminger, Linux Netdev List

Commit fd29cf72 (pktgen: convert to use ktime_t)
inadvertantly converted "delay" parameter from nanosec to microsec.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index b694552..227ba31 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -964,7 +964,7 @@ static ssize_t pktgen_if_write(struct file *file,
 		if (value == 0x7FFFFFFF)
 			pkt_dev->delay = ULLONG_MAX;
 		else
-			pkt_dev->delay = (u64)value * NSEC_PER_USEC;
+			pkt_dev->delay = (u64)value;

 		sprintf(pg_result, "OK: delay=%llu",
 			(unsigned long long) pkt_dev->delay);

^ permalink raw reply related

* [PATCH RFC] isdn/capi: fix up CAPI subsystem workaround locking a bit
From: Tilman Schmidt @ 2009-10-03 12:06 UTC (permalink / raw)
  To: i4ldeveloper
  Cc: Michael Buesch, Carsten Paeth, Karsten Keil, Karsten Keil,
	Armin Schindler, isdn4linux, netdev, linux-kernel

Move calls to handle_minor_send() and handle_minor_recv() out of
the sections locked by workaround_lock.
- handle_minor_send() may call another CAPI function via the card
  driver, deadlocking by trying to take workaround_lock again.
- handle_minor_recv() calls the receive_buf method of the active
  line discipline which may sleep.

This fixes Bugzilla entries 11687 and 14305 but may reenlarge the
window of vulnerability for the races that were not-quite-fixed by
commit 053b47ff249b9e0a634dae807f81465205e7c228. To avoid one
specific race, read the mp->tty member of the capiminor structure
only once in handle_recv_skb().

Signed-off-by: Tilman Schmidt <tilman@imap.cc>
---
I wasn't able to get any information on the nature of the problem fixed
by commit 053b47ff249b9e0a634dae807f81465205e7c228 from its author, nor
did my search of the LKML archives yield anything on it, so I went for
a minimally invasive approach. It works on my test machine, but a
complete overhaul of locking in capi.ko would of course be better.

 drivers/isdn/capi/capi.c |   33 +++++++++++++++++++--------------
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/isdn/capi/capi.c b/drivers/isdn/capi/capi.c
index 65bf91e..f348df2 100644
--- a/drivers/isdn/capi/capi.c
+++ b/drivers/isdn/capi/capi.c
@@ -452,18 +452,19 @@ static int handle_recv_skb(struct capiminor *mp, struct sk_buff *skb)
 	struct sk_buff *nskb;
 	int datalen;
 	u16 errcode, datahandle;
+	struct tty_struct *tty;
 	struct tty_ldisc *ld;
 	
 	datalen = skb->len - CAPIMSG_LEN(skb->data);
-	if (mp->tty == NULL)
-	{
+	tty = mp->tty;
+	if (tty == NULL) {
 #ifdef _DEBUG_DATAFLOW
 		printk(KERN_DEBUG "capi: currently no receiver\n");
 #endif
 		return -1;
 	}
 	
-	ld = tty_ldisc_ref(mp->tty);
+	ld = tty_ldisc_ref(tty);
 	if (ld == NULL)
 		return -1;
 	if (ld->ops->receive_buf == NULL) {
@@ -478,7 +479,7 @@ static int handle_recv_skb(struct capiminor *mp, struct sk_buff *skb)
 #endif
 		goto bad;
 	}
-	if (mp->tty->receive_room < datalen) {
+	if (tty->receive_room < datalen) {
 #if defined(_DEBUG_DATAFLOW) || defined(_DEBUG_TTYFUNCS)
 		printk(KERN_DEBUG "capi: no room in tty\n");
 #endif
@@ -501,7 +502,7 @@ static int handle_recv_skb(struct capiminor *mp, struct sk_buff *skb)
 	printk(KERN_DEBUG "capi: DATA_B3_RESP %u len=%d => ldisc\n",
 				datahandle, skb->len);
 #endif
-	ld->ops->receive_buf(mp->tty, skb->data, NULL, skb->len);
+	ld->ops->receive_buf(tty, skb->data, NULL, skb->len);
 	kfree_skb(skb);
 	tty_ldisc_deref(ld);
 	return 0;
@@ -653,7 +654,9 @@ static void capi_recv_message(struct capi20_appl *ap, struct sk_buff *skb)
 #endif
 		skb_queue_tail(&mp->inqueue, skb);
 		mp->inbytes += skb->len;
+		spin_unlock_irqrestore(&workaround_lock, flags);
 		handle_minor_recv(mp);
+		return;
 
 	} else if (CAPIMSG_SUBCOMMAND(skb->data) == CAPI_CONF) {
 
@@ -667,7 +670,9 @@ static void capi_recv_message(struct capi20_appl *ap, struct sk_buff *skb)
 		(void)capiminor_del_ack(mp, datahandle);
 		if (mp->tty)
 			tty_wakeup(mp->tty);
-		(void)handle_minor_send(mp);
+		spin_unlock_irqrestore(&workaround_lock, flags);
+		handle_minor_send(mp);
+		return;
 
 	} else {
 		/* ups, let capi application handle it :-) */
@@ -1042,8 +1047,8 @@ static int capinc_tty_open(struct tty_struct * tty, struct file * file)
 #ifdef _DEBUG_REFCOUNT
 	printk(KERN_DEBUG "capinc_tty_open ocount=%d\n", atomic_read(&mp->ttyopencount));
 #endif
-	handle_minor_recv(mp);
 	spin_unlock_irqrestore(&workaround_lock, flags);
+	handle_minor_recv(mp);
 	return 0;
 }
 
@@ -1110,9 +1115,9 @@ static int capinc_tty_write(struct tty_struct * tty,
 
 	skb_queue_tail(&mp->outqueue, skb);
 	mp->outbytes += skb->len;
-	(void)handle_minor_send(mp);
-	(void)handle_minor_recv(mp);
 	spin_unlock_irqrestore(&workaround_lock, flags);
+	handle_minor_send(mp);
+	handle_minor_recv(mp);
 	return count;
 }
 
@@ -1145,7 +1150,6 @@ static int capinc_tty_put_char(struct tty_struct *tty, unsigned char ch)
 		mp->ttyskb = NULL;
 		skb_queue_tail(&mp->outqueue, skb);
 		mp->outbytes += skb->len;
-		(void)handle_minor_send(mp);
 	}
 	skb = alloc_skb(CAPI_DATA_B3_REQ_LEN+CAPI_MAX_BLKSIZE, GFP_ATOMIC);
 	if (skb) {
@@ -1157,6 +1161,7 @@ static int capinc_tty_put_char(struct tty_struct *tty, unsigned char ch)
 		ret = 0;
 	}
 	spin_unlock_irqrestore(&workaround_lock, flags);
+	handle_minor_send(mp);
 	return ret;
 }
 
@@ -1183,10 +1188,10 @@ static void capinc_tty_flush_chars(struct tty_struct *tty)
 		mp->ttyskb = NULL;
 		skb_queue_tail(&mp->outqueue, skb);
 		mp->outbytes += skb->len;
-		(void)handle_minor_send(mp);
 	}
-	(void)handle_minor_recv(mp);
 	spin_unlock_irqrestore(&workaround_lock, flags);
+	handle_minor_send(mp);
+	handle_minor_recv(mp);
 }
 
 static int capinc_tty_write_room(struct tty_struct *tty)
@@ -1264,8 +1269,8 @@ static void capinc_tty_unthrottle(struct tty_struct * tty)
 	if (mp) {
 		spin_lock_irqsave(&workaround_lock, flags);
 		mp->ttyinstop = 0;
-		handle_minor_recv(mp);
 		spin_unlock_irqrestore(&workaround_lock, flags);
+		handle_minor_recv(mp);
 	}
 }
 
@@ -1290,8 +1295,8 @@ static void capinc_tty_start(struct tty_struct *tty)
 	if (mp) {
 		spin_lock_irqsave(&workaround_lock, flags);
 		mp->ttyoutstop = 0;
-		(void)handle_minor_send(mp);
 		spin_unlock_irqrestore(&workaround_lock, flags);
+		handle_minor_send(mp);
 	}
 }
 
-- 
1.6.2.1.214.ge986c

^ permalink raw reply related

* Re: [PATCH] net: Fix wrong sizeof
From: Jan Ceuleers @ 2009-10-03 15:38 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20091002.095402.42770342.davem@davemloft.net>

David Miller wrote:
> Any time you see "&" in a sizeof() expression, it's almost
> certainly a bug.  Something for the folks with automated
> tools to look for if they haven't already :-)

Your remark prompted me to find four more instances of such bugs (none of which in the networking bits). I have submitted patches.

Thank you.

Jan

^ permalink raw reply

* Re: [Bug #14301] WARNING: at net/ipv4/af_inet.c:154
From: Eric Dumazet @ 2009-10-03 17:53 UTC (permalink / raw)
  Cc: Rafael J. Wysocki, Ralf Hildebrandt, Linux Kernel Mailing List,
	Kernel Testers List, Herbert Xu, Linux Netdev List, Wei Yongjun,
	David S. Miller
In-Reply-To: <4AC710DF.5070705@gmail.com>

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> Rafael J. Wysocki a écrit :
>>> This message has been generated automatically as a part of a report
>>> of regressions introduced between 2.6.30 and 2.6.31.
>>>
>>> The following bug entry is on the current list of known regressions
>>> introduced between 2.6.30 and 2.6.31.  Please verify if it still should
>>> be listed and let me know (either way).
>>>
>>>
>>> Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=14301
>>> Subject		: WARNING: at net/ipv4/af_inet.c:154
>>> Submitter	: Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>
>>> Date		: 2009-09-30 12:24 (2 days old)
>>> References	: http://marc.info/?l=linux-kernel&m=125431350218137&w=4
>>>
>>>
>> If commit d99927f4d93f36553699573b279e0ff98ad7dea6
>> (net: Fix sock_wfree() race) doesnt fix this problem, then
>> maybe we should take a look at an old patch.
>>
>> < data mining... running... output results to lkml/netdev >
>>
>> Random guesses
>>
>>  1) : commit d55d87fdff8252d0e2f7c28c2d443aee17e9d70f
>> (net: Move rx skb_orphan call to where needed)
>>
>> A similar problem on SCTP was fixed by commit 
>> 1bc4ee4088c9a502db0e9c87f675e61e57fa1734
>> (sctp: fix warning at inet_sock_destruct() while release sctp socket)
>>
>> 2) CORK and UDP sockets
>>   It seems we can leave an UDP socket with a frame in sk_write_queue
>>   Purge of this queue is done by udp_flush_pending_frames()
>>    This calls ip_flush_pending_frames()
>>    But this function only calls kfree_skb(), not sk_wmem_free_skb()...
>>
>>
>> Could you try following patch ?
>>
> 
> Hmm, I missed the ip_cork_release(), here is an updated version.
> 

Please ignore this patch, I was wrong, sk_forward_alloc is not used
on xmit side for udp, only receive side. CORK/UDP should be fine

Investigation still needed...



^ permalink raw reply

* Re: [PATCH RFC] isdn/capi: fix up CAPI subsystem workaround locking a bit
From: Michael Buesch @ 2009-10-03 18:26 UTC (permalink / raw)
  To: Tilman Schmidt
  Cc: i4ldeveloper, Carsten Paeth, Karsten Keil, Karsten Keil,
	Armin Schindler, isdn4linux, netdev, linux-kernel
In-Reply-To: <20091003120657.2228911186C@xenon.ts.pxnet.com>

On Saturday 03 October 2009 14:06:57 Tilman Schmidt wrote:
> Move calls to handle_minor_send() and handle_minor_recv() out of
> the sections locked by workaround_lock.
> - handle_minor_send() may call another CAPI function via the card
>   driver, deadlocking by trying to take workaround_lock again.
> - handle_minor_recv() calls the receive_buf method of the active
>   line discipline which may sleep.

I remember that handle_minor_send() and/or handle_minor_recv() showed up
in the crash backtraces. So if you move them out of the critical
section, you can as well remove the lock completely.

-- 
Greetings, Michael.

^ permalink raw reply

* [PATCH] TCPCT+1: initial SYN exchange with SYNACK data
From: William Allen Simpson @ 2009-10-03 18:33 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1608 bytes --]

This is a straightforward re-implementation of an earlier (year-old) patch
that no longer applies cleanly, with permission of the original author
(Adam Langley).  The patch was previously reviewed:

   http://thread.gmane.org/gmane.linux.network/102586

The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data.  This is more flexible and
less subject to user configuration error.  Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.

   "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
   http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

   "Re: what a new TCP header might look like", May 12, 1998.
   ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

As suggested, the CONFIG_ was replaced by a sysctl (tcp_cookie_size) to
turn on and off the cookie option default globally.

These functions will also be used in subsequent patches that implement
additional features.
---
  include/linux/tcp.h        |   92 +++++++++++++++++--
  include/net/tcp.h          |   48 +++++++++-
  net/ipv4/sysctl_net_ipv4.c |    8 ++
  net/ipv4/tcp.c             |   97 +++++++++++++++++++-
  net/ipv4/tcp_input.c       |   38 ++++++++-
  net/ipv4/tcp_ipv4.c        |   54 ++++++++++-
  net/ipv4/tcp_minisocks.c   |   32 +++++-
  net/ipv4/tcp_output.c      |  220 +++++++++++++++++++++++++++++++++++++++++---
  net/ipv6/tcp_ipv6.c        |   35 +++++++-
  9 files changed, 580 insertions(+), 44 deletions(-)


[-- Attachment #2: tcpct+1.patch --]
[-- Type: text/plain, Size: 33275 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 61723a7..bdd1a7f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -96,6 +96,7 @@ enum {
 #define TCP_QUICKACK		12	/* Block/reenable quick acks */
 #define TCP_CONGESTION		13	/* Congestion control algorithm */
 #define TCP_MD5SIG		14	/* TCP MD5 Signature (RFC2385) */
+#define TCP_COOKIE_DATA		15	/* TCP Cookie Transactions extension */
 
 #define TCPI_OPT_TIMESTAMPS	1
 #define TCPI_OPT_SACK		2
@@ -170,6 +171,33 @@ struct tcp_md5sig {
 	__u8	tcpm_key[TCP_MD5SIG_MAXKEYLEN];		/* key (binary) */
 };
 
+/* for TCP_COOKIE_DATA socket option */
+#define TCP_COOKIE_MAX		16		/* 128-bits */
+#define TCP_COOKIE_MIN		 8		/*  64-bits */
+#define TCP_COOKIE_PAIR_SIZE	(2*TCP_COOKIE_MAX)
+
+#define TCP_S_DATA_MAX		64U		/* after TCP+IP options */
+#define TCP_S_DATA_MSS_DEFAULT	536U		/* default MSS (RFC1122) */
+
+/* Flags for both getsockopt and setsockopt */
+#define TCP_COOKIE_IN_ALWAYS	(1 << 0)	/* Discard SYN without cookie */
+#define TCP_COOKIE_OUT_NEVER	(1 << 1)	/* Prohibit outgoing cookies.
+						   Supercedes the others. */
+
+/* Flags for getsockopt */
+#define TCP_S_DATA_IN		(1 << 2)	/* Was data received? */
+#define TCP_S_DATA_OUT		(1 << 3)	/* Was data sent? */
+
+/* TCP Cookie Transactions data */
+struct tcp_cookie_data {
+	__u16	tcpcd_flags;			/* see above */
+	__u8	__tcpcd_pad1;			/* zero */
+	__u8	tcpcd_cookie_desired;		/* bytes */
+	__u16	tcpcd_s_data_desired;		/* bytes of variable data */
+	__u16	tcpcd_used;			/* bytes in value */
+	__u8	tcpcd_value[TCP_S_DATA_MSS_DEFAULT];
+};
+
 #ifdef __KERNEL__
 
 #include <linux/skbuff.h>
@@ -210,33 +238,53 @@ struct tcp_options_received {
 	u32	ts_recent;	/* Time stamp to echo next		*/
 	u32	rcv_tsval;	/* Time stamp value             	*/
 	u32	rcv_tsecr;	/* Time stamp echo reply        	*/
-	u16 	saw_tstamp : 1,	/* Saw TIMESTAMP on last packet		*/
+	u32 	saw_tstamp : 1,	/* Saw TIMESTAMP on last packet		*/
 		tstamp_ok : 1,	/* TIMESTAMP seen on SYN packet		*/
 		dsack : 1,	/* D-SACK is scheduled			*/
 		wscale_ok : 1,	/* Wscale seen on SYN packet		*/
 		sack_ok : 4,	/* SACK seen on SYN packet		*/
 		snd_wscale : 4,	/* Window scaling received from sender	*/
-		rcv_wscale : 4;	/* Window scaling to send to receiver	*/
-/*	SACKs data	*/
+		rcv_wscale : 4,	/* Window scaling to send to receiver	*/
+		extend_ok:1;	/* Cookie{less,pair} option seen	*/
+	u8	*cookie_copy;	/* temporary pointer			*/
+	u8	cookie_size;	/* bytes in copy			*/
 	u8	num_sacks;	/* Number of SACK blocks		*/
-	u16	user_mss;  	/* mss requested by user in ioctl */
+	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
 };
 
+static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+{
+	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+	rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
+	rx_opt->cookie_size = rx_opt->extend_ok = 0;
+}
+
 /* This is the max number of SACKS that we'll generate and process. It's safe
  * to increse this, although since:
  *   size = TCPOLEN_SACK_BASE_ALIGNED (4) + n * TCPOLEN_SACK_PERBLOCK (8)
  * only four options will fit in a standard TCP header */
 #define TCP_NUM_SACKS 4
 
+struct tcp_cookie_pair;
+struct tcp_s_data_payload;
+
 struct tcp_request_sock {
 	struct inet_request_sock 	req;
 #ifdef CONFIG_TCP_MD5SIG
 	/* Only used by TCP MD5 Signature so far. */
 	const struct tcp_request_sock_ops *af_specific;
 #endif
-	u32			 	rcv_isn;
-	u32			 	snt_isn;
+	u32				rcv_isn;
+	u32				snt_isn;
+
+	/* Cookie Transactions */
+	u8				*cookie_copy;	/* temporary pointer */
+	u8				cookie_size;	/* bytes in copy */
+	u8				s_data_in:1,
+					s_data_out:1,
+					cookie_in_always:1,
+					cookie_out_never:1;
 };
 
 static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
@@ -406,6 +454,32 @@ struct tcp_sock {
 /* TCP MD5 Signature Option information */
 	struct tcp_md5sig_info	*md5sig_info;
 #endif
+
+	/* If s_data_desired > 0 and s_data_payload is non-NULL, then this
+	 * object holds a reference to it (s_data_payload->kref)
+	 */
+	struct tcp_s_data_payload	*s_data_payload;
+
+	/* When the cookie options are generated and exchanged, then this
+	 * object holds a reference to them (cookie_pair->kref)
+	 */
+	struct tcp_cookie_pair	  	*cookie_pair;
+
+	/* If s_data_payload is non-NULL, then this holds a copy of
+	 * s_data_payload->tsdpl_size.  Otherwise, this holds the user
+	 * specified tcpcd_s_data_desired (variable data).
+	 */
+	u16				s_data_desired;	/* bytes */
+
+	/* Initially, this holds the user specified tcpcd_cookie_desired.
+	 * Zero indicates default (sysctl_tcp_cookie_size).  After the
+	 * option has been exchanged, this holds the actual size.
+	 */
+	u8				cookie_desired;	/* bytes */
+	u8				s_data_in:1,
+					s_data_out:1,
+					cookie_in_always:1,
+					cookie_out_never:1;
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
@@ -424,6 +498,10 @@ struct tcp_timewait_sock {
 	u16			  tw_md5_keylen;
 	u8			  tw_md5_key[TCP_MD5SIG_MAXKEYLEN];
 #endif
+	/* Few sockets in timewait have cookies; in that case, then this
+	 * object holds a reference to it (tw_cookie_pair->kref)
+	 */
+	struct tcp_cookie_pair	  *tw_cookie_pair;
 };
 
 static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
@@ -431,6 +509,6 @@ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
 	return (struct tcp_timewait_sock *)sk;
 }
 
-#endif
+#endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_TCP_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..7fb2456 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -30,6 +30,7 @@
 #include <linux/dmaengine.h>
 #include <linux/crypto.h>
 #include <linux/cryptohash.h>
+#include <linux/kref.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -167,6 +168,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_SACK             5       /* SACK Block */
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
+#define TCPOPT_COOKIE		253	/* Cookie extension (experimental) */
 
 /*
  *     TCP option lengths
@@ -177,6 +179,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_SACK_PERM      2
 #define TCPOLEN_TIMESTAMP      10
 #define TCPOLEN_MD5SIG         18
+#define TCPOLEN_COOKIE_BASE    2	/* Cookie-less header extension */
+#define TCPOLEN_COOKIE_PAIR    3	/* Cookie pair header extension */
+#define TCPOLEN_COOKIE_MAX     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MAX)
+#define TCPOLEN_COOKIE_MIN     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN)
 
 /* But this is what stacks really send out. */
 #define TCPOLEN_TSTAMP_ALIGNED		12
@@ -237,6 +243,7 @@ extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_max_ssthresh;
+extern int sysctl_tcp_cookie_size;
 
 extern atomic_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -343,11 +350,6 @@ static inline void tcp_dec_quickack_mode(struct sock *sk,
 
 extern void tcp_enter_quickack_mode(struct sock *sk);
 
-static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
-{
- 	rx_opt->tstamp_ok = rx_opt->sack_ok = rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
-}
-
 #define	TCP_ECN_OK		1
 #define	TCP_ECN_QUEUE_CWR	2
 #define	TCP_ECN_DEMAND_CWR	4
@@ -1480,6 +1482,42 @@ struct tcp_request_sock_ops {
 #endif
 };
 
+/**
+ * This structure contains variable data that is to be included in the
+ * cookie option and compared with later incoming segments.
+ *
+ * A tcp_sock contains a pointer to the current value, and this is cloned to
+ * the tcp_timewait_sock.
+ */
+struct tcp_cookie_pair {
+	struct kref	kref;
+	/* 32-bit aligned for faster comparisons? */
+	u8		tcpcp_data[TCP_COOKIE_PAIR_SIZE];
+	u8		tcpcp_size;	/* of the cookie pair */
+};
+
+static inline void tcp_cookie_pair_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct tcp_cookie_pair, kref));
+}
+
+/**
+ * This structure contains constant data that is to be included in the
+ * payload of SYN or SYNACK segments when the cookie option is present.
+ *
+ * This structure is immutable (save for the reference counter) once created.
+ */
+struct tcp_s_data_payload {
+	struct kref	kref;
+	u16		tsdpl_size;	/* of the trailing payload */
+	u8		tsdpl_data[0];	/* trailing payload */
+};
+
+static inline void tcp_s_data_payload_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct tcp_s_data_payload, kref));
+}
+
 extern void tcp_v4_init(void);
 extern void tcp_init(void);
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 2dcf04d..3422c54 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -714,6 +714,14 @@ static struct ctl_table ipv4_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "tcp_cookie_size",
+		.data		= &sysctl_tcp_cookie_size,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5a15e76..87f4939 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2039,8 +2039,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 	int val;
 	int err = 0;
 
-	/* This is a string value all the others are int's */
-	if (optname == TCP_CONGESTION) {
+	/* These are data/string values, all the others are ints */
+	if (TCP_CONGESTION == optname) {
 		char name[TCP_CA_NAME_MAX];
 
 		if (optlen < 1)
@@ -2056,6 +2056,61 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		err = tcp_set_congestion_control(sk, name);
 		release_sock(sk);
 		return err;
+	} else if (TCP_COOKIE_DATA == optname) {
+		struct tcp_cookie_data tcd;
+		struct tcp_s_data_payload *tsdplp;
+
+		if (optlen < sizeof(tcd)) {
+			return -EINVAL;
+		}
+		if (copy_from_user(&tcd, optval, sizeof(tcd))) {
+			return -EFAULT;
+		}
+		if (0 == tcd.tcpcd_cookie_desired) {
+			/* default to global value */
+		} else if ((0x1 & tcd.tcpcd_cookie_desired)
+			|| TCP_COOKIE_MAX < tcd.tcpcd_cookie_desired
+			|| TCP_COOKIE_MIN > tcd.tcpcd_cookie_desired) {
+			return -EINVAL;
+		}
+
+		lock_sock(sk);
+		tp->cookie_in_always = (TCP_COOKIE_IN_ALWAYS & tcd.tcpcd_flags);
+		tp->cookie_out_never = (TCP_COOKIE_OUT_NEVER & tcd.tcpcd_flags);
+		tp->cookie_desired = tcd.tcpcd_cookie_desired;
+
+		/* If there's no constant data, save tcpcd_s_data_desired.
+		 * Otherwise, copy the length of the constant data instead.
+		 */
+		if (0 == tcd.tcpcd_used) {
+			if (NULL != tp->s_data_payload) {
+				kref_put(&tp->s_data_payload->kref,
+					 tcp_s_data_payload_release);
+				tp->s_data_payload = NULL;
+			}
+			tp->s_data_desired = tcd.tcpcd_s_data_desired;
+		} else if (sizeof(tcd.tcpcd_value) < tcd.tcpcd_used) {
+			err = -EINVAL;
+		} else if (NULL != (tsdplp =
+				    kmalloc(sizeof(struct tcp_s_data_payload)
+					    + tcd.tcpcd_used,
+					    GFP_ATOMIC))) {
+			if (unlikely(tp->s_data_payload)) {
+				kref_put(&tp->s_data_payload->kref,
+					 tcp_s_data_payload_release);
+			}
+			kref_init(&tsdplp->kref);
+			memcpy(tsdplp->tsdpl_data, tcd.tcpcd_value,
+			       tcd.tcpcd_used);
+			tsdplp->tsdpl_size =
+			tp->s_data_desired = tcd.tcpcd_used;
+			tp->s_data_payload = tsdplp;
+		} else {
+			err = -ENOMEM;
+		}
+
+		release_sock(sk);
+		return err;
 	}
 
 	if (optlen < sizeof(int))
@@ -2318,6 +2373,44 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	if (get_user(len, optlen))
 		return -EFAULT;
 
+	/* These are data/string values, all the others are ints */
+	if (TCP_COOKIE_DATA == optname) {
+		struct tcp_cookie_data tcd;
+		struct tcp_cookie_pair *tcpcpp = tp->cookie_pair;
+
+		if (len < sizeof(tcd)) {
+			return -EINVAL;
+		}
+
+		memset(&tcd, 0, sizeof(tcd));
+		tcd.tcpcd_flags = (tp->s_data_in ? TCP_S_DATA_IN : 0)
+				| (tp->s_data_out ? TCP_S_DATA_OUT : 0)
+				| (tp->cookie_in_always ? TCP_COOKIE_IN_ALWAYS : 0)
+				| (tp->cookie_out_never ? TCP_COOKIE_OUT_NEVER : 0);
+
+		tcd.tcpcd_cookie_desired = tp->cookie_desired;
+		tcd.tcpcd_s_data_desired = tp->s_data_desired;
+
+		if (NULL != tcpcpp) {
+			/* Cookie(s) saved, return as nonce */
+			if (sizeof(tcd.tcpcd_value) < tcpcpp->tcpcp_size) {
+				/* impossible? */
+				return -EINVAL;
+			}
+			memcpy(&tcd.tcpcd_value[0], &tcpcpp->tcpcp_data[0],
+			       tcpcpp->tcpcp_size);
+			tcd.tcpcd_used = tcpcpp->tcpcp_size;
+		}
+
+		if (copy_to_user(optval, &tcd, sizeof(tcd))) {
+			return -EFAULT;
+		}
+		if (put_user(sizeof(tcd), optlen)) {
+			return -EFAULT;
+		}
+		return 0;
+	}
+
 	len = min_t(unsigned int, len, sizeof(int));
 
 	if (len < 0)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d86784b..88ffca9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3782,6 +3782,21 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				 */
 				break;
 #endif
+			case TCPOPT_COOKIE:
+				/* This option carries 3 different lengths.
+				 */
+				if (TCPOLEN_COOKIE_MAX >= opsize
+				 && TCPOLEN_COOKIE_MIN <= opsize) {
+					opt_rx->cookie_size =
+						opsize - TCPOLEN_COOKIE_BASE;
+					opt_rx->cookie_copy = ptr;
+					opt_rx->extend_ok = 1;
+				} else if (TCPOLEN_COOKIE_PAIR == opsize) {
+					/* not yet implemented */
+				} else if (TCPOLEN_COOKIE_BASE == opsize) {
+					/* not yet implemented */
+				}
+				break;
 			}
 
 			ptr += opsize-2;
@@ -5364,6 +5379,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	int saved_clamp = tp->rx_opt.mss_clamp;
+	bool s_data_queued = false;
 
 	tcp_parse_options(skb, &tp->rx_opt, 0);
 
@@ -5462,6 +5478,23 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 * Change state from SYN-SENT only after copied_seq
 		 * is initialized. */
 		tp->copied_seq = tp->rcv_nxt;
+
+		/* If the cookie extension option is present, and there's
+		 * some incoming transaction data, queue it.
+		 */
+		if (tp->rx_opt.extend_ok
+		 && skb->len > (th->doff << 2)) {
+			__skb_pull(skb, th->doff << 2);
+			__skb_queue_tail(&sk->sk_receive_queue, skb);
+			skb_set_owner_r(skb, sk);
+			sk->sk_data_ready(sk, 0);
+			s_data_queued = true;
+			tp->s_data_in = 1; /* true */
+			tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+			tp->rcv_wup = TCP_SKB_CB(skb)->end_seq;
+			tp->copied_seq = TCP_SKB_CB(skb)->seq + 1;
+		}
+
 		smp_mb();
 		tcp_set_state(sk, TCP_ESTABLISHED);
 
@@ -5513,11 +5546,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 						  TCP_DELACK_MAX, TCP_RTO_MAX);
 
 discard:
-			__kfree_skb(skb);
+			if (!s_data_queued)
+				__kfree_skb(skb);
 			return 0;
 		} else {
 			tcp_send_ack(sk);
 		}
+		if (s_data_queued)
+			return 0;
 		return -1;
 	}
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7cda24b..67eb529 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -217,7 +217,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (inet->opt)
 		inet_csk(sk)->icsk_ext_hdr_len = inet->opt->optlen;
 
-	tp->rx_opt.mss_clamp = 536;
+	tp->rx_opt.mss_clamp = TCP_MIN_RCVMSS;
 
 	/* Socket identity is still unknown (sport may be zero).
 	 * However we set state to SYN-SENT and not releasing socket
@@ -1210,9 +1210,11 @@ static struct timewait_sock_ops tcp_timewait_sock_ops = {
 
 int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 {
-	struct inet_request_sock *ireq;
+	u8 bakery[TCP_COOKIE_MAX];
 	struct tcp_options_received tmp_opt;
+	struct inet_request_sock *ireq;
 	struct request_sock *req;
+	struct tcp_sock *tp = tcp_sk(sk);
 	__be32 saddr = ip_hdr(skb)->saddr;
 	__be32 daddr = ip_hdr(skb)->daddr;
 	__u32 isn = TCP_SKB_CB(skb)->when;
@@ -1257,16 +1259,37 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 #endif
 
 	tcp_clear_options(&tmp_opt);
-	tmp_opt.mss_clamp = 536;
-	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
+	tmp_opt.mss_clamp = TCP_MIN_RCVMSS;
+	tmp_opt.user_mss  = tp->rx_opt.user_mss;
 
 	tcp_parse_options(skb, &tmp_opt, 0);
 
+	if (tmp_opt.extend_ok
+	 && tmp_opt.saw_tstamp
+	 && !tp->cookie_out_never
+	 && (0 < tp->cookie_desired || 0 < sysctl_tcp_cookie_size)) {
+#ifdef CONFIG_SYN_COOKIES
+		want_cookie = 0;	/* not our kind of cookie */
+#endif
+		tcp_rsk(req)->cookie_out_never = 0;
+		tcp_rsk(req)->cookie_copy = bakery;
+		tcp_rsk(req)->cookie_size = tmp_opt.cookie_size;
+
+		/* secret recipe not yet implemented */
+		get_random_bytes(bakery, tmp_opt.cookie_size);
+	} else if (!tp->cookie_in_always) {
+		/* redundant indications, but ensure initialization. */
+		tcp_rsk(req)->cookie_out_never = 1;
+		tcp_rsk(req)->cookie_size = 0;
+	} else {
+		goto drop_and_free;
+	}
+	tcp_rsk(req)->cookie_in_always = tp->cookie_in_always;
+
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
 
 	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-
 	tcp_openreq_init(req, &tmp_opt, skb);
 
 	ireq = inet_rsk(req);
@@ -1810,7 +1833,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 	 */
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd_clamp = ~0;
-	tp->mss_cache = 536;
+	tp->mss_cache = TCP_MIN_RCVMSS;
 
 	tp->reordering = sysctl_tcp_reordering;
 	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
@@ -1826,6 +1849,14 @@ static int tcp_v4_init_sock(struct sock *sk)
 	tp->af_specific = &tcp_sock_ipv4_specific;
 #endif
 
+/* For grep, in order of appearance:
+ *	tp->s_data_payload = NULL;
+ *	tp->cookie_pair = NULL;
+ *	tp->s_data_desired = tp->cookie_desired = 0;
+ *	tp->s_data_in = tp->s_data_out = 0;
+ *	tp->cookie_in_always = tp->cookie_out_never = 0;
+ */
+
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
@@ -1879,6 +1910,17 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		sk->sk_sndmsg_page = NULL;
 	}
 
+	if (NULL != tp->s_data_payload) {
+		kref_put(&tp->s_data_payload->kref,
+			 tcp_s_data_payload_release);
+		tp->s_data_payload = NULL;
+	}
+	if (NULL != tp->cookie_pair) {
+		kref_put(&tp->cookie_pair->kref,
+			 tcp_cookie_pair_release);
+		tp->cookie_pair = NULL;
+	}
+
 	percpu_counter_dec(&tcp_sockets_allocated);
 }
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 624c3c9..c38e901 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -375,6 +375,13 @@ static inline void TCP_ECN_openreq_child(struct tcp_sock *tp,
 	tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
 }
 
+static inline int tcp_s_data_size(const struct tcp_sock *tp)
+{
+	return (0 < tp->s_data_desired && NULL != tp->s_data_payload)
+		? tp->s_data_payload->tsdpl_size
+		: 0;
+}
+
 /* This is not only more efficient than what we used to do, it eliminates
  * a lot of code duplication between IPv4/IPv6 SYN recv processing. -DaveM
  *
@@ -394,9 +401,12 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		/* Now setup tcp_sock */
 		newtp = tcp_sk(newsk);
 		newtp->pred_flags = 0;
-		newtp->rcv_wup = newtp->copied_seq = newtp->rcv_nxt = treq->rcv_isn + 1;
-		newtp->snd_sml = newtp->snd_una = newtp->snd_nxt = treq->snt_isn + 1;
-		newtp->snd_up = treq->snt_isn + 1;
+
+		newtp->rcv_wup = newtp->copied_seq =
+		newtp->rcv_nxt = treq->rcv_isn + 1;
+
+		newtp->snd_sml = newtp->snd_una = newtp->snd_nxt =
+		newtp->snd_up = treq->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk));
 
 		tcp_prequeue_init(newtp);
 
@@ -429,8 +439,17 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
 		skb_queue_head_init(&newtp->out_of_order_queue);
-		newtp->write_seq = treq->snt_isn + 1;
-		newtp->pushed_seq = newtp->write_seq;
+		newtp->write_seq = newtp->pushed_seq =
+			treq->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk));
+
+		newtp->s_data_payload = NULL;
+		newtp->cookie_pair = NULL;
+		newtp->s_data_desired = 0;
+		newtp->cookie_desired = treq->cookie_size;
+		newtp->s_data_in = treq->s_data_in;
+		newtp->s_data_out = treq->s_data_out;
+		newtp->cookie_in_always = treq->cookie_in_always;
+		newtp->cookie_out_never = treq->cookie_out_never;
 
 		newtp->rx_opt.saw_tstamp = 0;
 
@@ -596,7 +615,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	 * Invalid ACK: reset will be sent by listening socket
 	 */
 	if ((flg & TCP_FLAG_ACK) &&
-	    (TCP_SKB_CB(skb)->ack_seq != tcp_rsk(req)->snt_isn + 1))
+	    (TCP_SKB_CB(skb)->ack_seq != tcp_rsk(req)->snt_isn + 1 +
+					 tcp_s_data_size(tcp_sk(sk))))
 		return sk;
 
 	/* Also, it would be not so bad idea to check rcv_tsecr, which
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5200aab..cd6d388 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -59,6 +59,14 @@ int sysctl_tcp_base_mss __read_mostly = 512;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+#ifdef CONFIG_SYSCTL
+/* By default, let the user enable it. */
+int sysctl_tcp_cookie_size __read_mostly = 0;
+#else
+int sysctl_tcp_cookie_size __read_mostly = TCP_COOKIE_MAX;
+#endif
+
+
 /* Account for new data that has been sent to the network. */
 static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
 {
@@ -361,6 +369,8 @@ static inline int tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_SACK_ADVERTISE	(1 << 0)
 #define OPTION_TS		(1 << 1)
 #define OPTION_MD5		(1 << 2)
+#define OPTION_WSCALE		(1 << 3)
+#define OPTION_COOKIE_EXTENSION	(1 << 4)
 
 struct tcp_out_options {
 	u8 options;		/* bit field of OPTION_* */
@@ -368,8 +378,35 @@ struct tcp_out_options {
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u16 mss;		/* 0 to disable */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
+	u8	*cookie_copy;	/* temporary pointer */
+	u8	cookie_size;	/* bytes in copy */
 };
 
+/* The sysctl int routines are generic, so check consistency here.
+ */
+static u8 tcp_cookie_size_check(u8 desired)
+{
+	if (0 < desired) {
+		/* previously specified */
+		return desired;
+	}
+	if (0 == sysctl_tcp_cookie_size) {
+		/* no default specified */
+		return 0;
+	}
+	if (TCP_COOKIE_MIN > sysctl_tcp_cookie_size) {
+		return TCP_COOKIE_MIN;
+	}
+	if (TCP_COOKIE_MAX < sysctl_tcp_cookie_size) {
+		return TCP_COOKIE_MAX;
+	}
+	if (0x1 & sysctl_tcp_cookie_size) {
+		/* 8-bit multiple, illegal, fix it */
+		return (u8)(sysctl_tcp_cookie_size + 0x1);
+	}
+	return (u8)sysctl_tcp_cookie_size;
+}
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -386,11 +423,22 @@ struct tcp_out_options {
 static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 			      const struct tcp_out_options *opts,
 			      __u8 **md5_hash) {
-	if (unlikely(OPTION_MD5 & opts->options)) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_NOP << 16) |
-			       (TCPOPT_MD5SIG << 8) |
-			       TCPOLEN_MD5SIG);
+	u8 options = opts->options;	/* mungable copy */
+
+	if (unlikely(OPTION_MD5 & options)) {
+		if (unlikely(OPTION_COOKIE_EXTENSION & options)) {
+			*ptr++ = htonl((TCPOPT_COOKIE << 24) |
+				       (TCPOLEN_COOKIE_BASE << 16) |
+				       (TCPOPT_MD5SIG << 8) |
+				       TCPOLEN_MD5SIG);
+		} else {
+			*ptr++ = htonl((TCPOPT_NOP << 24) |
+				       (TCPOPT_NOP << 16) |
+				       (TCPOPT_MD5SIG << 8) |
+				       TCPOLEN_MD5SIG);
+		}
+		/* larger cookies are incompatible */
+		options &= ~OPTION_COOKIE_EXTENSION;
 		*md5_hash = (__u8 *)ptr;
 		ptr += 4;
 	} else {
@@ -403,12 +451,13 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 			       opts->mss);
 	}
 
-	if (likely(OPTION_TS & opts->options)) {
-		if (unlikely(OPTION_SACK_ADVERTISE & opts->options)) {
+	if (likely(OPTION_TS & options)) {
+		if (unlikely(OPTION_SACK_ADVERTISE & options)) {
 			*ptr++ = htonl((TCPOPT_SACK_PERM << 24) |
 				       (TCPOLEN_SACK_PERM << 16) |
 				       (TCPOPT_TIMESTAMP << 8) |
 				       TCPOLEN_TIMESTAMP);
+			options &= ~OPTION_SACK_ADVERTISE;
 		} else {
 			*ptr++ = htonl((TCPOPT_NOP << 24) |
 				       (TCPOPT_NOP << 16) |
@@ -419,15 +468,48 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 		*ptr++ = htonl(opts->tsecr);
 	}
 
-	if (unlikely(OPTION_SACK_ADVERTISE & opts->options &&
-		     !(OPTION_TS & opts->options))) {
+	/* specification requires following timestamp, so do it now.
+	 */
+	if (unlikely(OPTION_COOKIE_EXTENSION & options)) {
+		u8 *cookie_copy = opts->cookie_copy;
+		u8 cookie_size = opts->cookie_size;
+
+		if (unlikely(0x1 & cookie_size)) {
+			/* 8-bit multiple, illegal, ignore */
+			cookie_size = 0;
+		} else if (likely(0x2 & cookie_size)) {
+			__u8 *p = (__u8 *)ptr;
+
+			/* 16-bit multiple */
+			*p++ = TCPOPT_COOKIE;
+			*p++ = TCPOLEN_COOKIE_BASE + cookie_size;
+			*p++ = *cookie_copy++;
+			*p++ = *cookie_copy++;
+			ptr++;
+			cookie_size -= 2;
+		} else {
+			/* 32-bit multiple */
+			*ptr++ = htonl(((TCPOPT_NOP << 24) |
+				        (TCPOPT_NOP << 16) |
+				        (TCPOPT_COOKIE << 8) |
+				        TCPOLEN_COOKIE_BASE) +
+				       cookie_size);
+		}
+
+		if (0 < cookie_size) {
+			memcpy(ptr, cookie_copy, cookie_size);
+			ptr += (cookie_size >> 2);
+		}
+	}
+
+	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
 		*ptr++ = htonl((TCPOPT_NOP << 24) |
 			       (TCPOPT_NOP << 16) |
 			       (TCPOPT_SACK_PERM << 8) |
 			       TCPOLEN_SACK_PERM);
 	}
 
-	if (unlikely(opts->ws)) {
+	if (unlikely(OPTION_WSCALE & options)) {
 		*ptr++ = htonl((TCPOPT_NOP << 24) |
 			       (TCPOPT_WINDOW << 16) |
 			       (TCPOLEN_WINDOW << 8) |
@@ -463,10 +545,16 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 				struct tcp_md5sig_key **md5) {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned size = 0;
+	u8 cookie_size = !tp->cookie_out_never
+			 ? tcp_cookie_size_check(tp->cookie_desired)
+			 : 0;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
 	if (*md5) {
+		if (0 != cookie_size) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+		}
 		opts->options |= OPTION_MD5;
 		size += TCPOLEN_MD5SIG_ALIGNED;
 	}
@@ -494,8 +582,8 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	}
 	if (likely(sysctl_tcp_window_scaling)) {
 		opts->ws = tp->rx_opt.rcv_wscale;
-		if (likely(opts->ws))
-			size += TCPOLEN_WSCALE_ALIGNED;
+		opts->options |= OPTION_WSCALE;
+		size += TCPOLEN_WSCALE_ALIGNED;
 	}
 	if (likely(sysctl_tcp_sack)) {
 		opts->options |= OPTION_SACK_ADVERTISE;
@@ -503,6 +591,61 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 			size += TCPOLEN_SACKPERM_ALIGNED;
 	}
 
+	/* Having both authentication and cookies for security is redundant,
+	 * and there's certainly not enough room.  Instead, the cookie-less
+	 * variant is proposed above.
+	 *
+	 * Consider the pessimal case with authentication.  The options
+	 * could look like:
+	 *   COOKIE|MD5(20) + MSS(4) + WSCALE(4) + SACK|TS(12) == 40
+	 *
+	 * (Currently, the timestamps && *MD5 test above prevents this.)
+	 *
+	 * Note that timestamps are required by the specification.
+	 *
+	 * Odd numbers of bytes are prohibited by the specification, ensuring
+	 * that the cookie is 16-bit aligned, and the resulting cookie pair is
+	 * 32-bit aligned.
+	 */
+	if (NULL == *md5
+	 && (OPTION_TS & opts->options)
+	 && 0 != cookie_size) {
+		int need = TCPOLEN_COOKIE_BASE + cookie_size;
+		int remaining = MAX_TCP_OPTION_SPACE - size;
+
+		if (!(0x2 & cookie_size)) {
+			/* 32-bit multiple */
+			need += 2; /* NOPs */
+
+			if (need > remaining) {
+				/* try shrinking cookie to fit */
+				cookie_size -= 2;
+				need -= 4;
+			}
+		}
+		while (need > remaining && TCP_COOKIE_MIN <= cookie_size) {
+			cookie_size -= 4;
+			need -= 4;
+		}
+		if (TCP_COOKIE_MIN <= cookie_size) {
+			if (NULL == tp->cookie_pair
+			 && NULL != (tp->cookie_pair =
+				     kmalloc(sizeof(struct tcp_cookie_pair),
+					     GFP_ATOMIC))) {
+				kref_init(&tp->cookie_pair->kref);
+				tp->cookie_pair->tcpcp_size = cookie_size;
+				get_random_bytes(&tp->cookie_pair->tcpcp_data[0],
+						 cookie_size);
+			}
+			if (NULL != tp->cookie_pair) {
+				opts->options |= OPTION_COOKIE_EXTENSION;
+				opts->cookie_copy = &tp->cookie_pair->tcpcp_data[0];
+				opts->cookie_size = cookie_size;
+				tp->cookie_desired = cookie_size; /* remember */
+				size += need;
+			}
+		}
+	}
 	return size;
 }
 
@@ -512,13 +655,19 @@ static unsigned tcp_synack_options(struct sock *sk,
 				   unsigned mss, struct sk_buff *skb,
 				   struct tcp_out_options *opts,
 				   struct tcp_md5sig_key **md5) {
-	unsigned size = 0;
 	struct inet_request_sock *ireq = inet_rsk(req);
+	unsigned size = 0;
+	u8 cookie_size = !tcp_rsk(req)->cookie_out_never
+			 ? tcp_rsk(req)->cookie_size
+			 : 0;
 	char doing_ts;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
 	if (*md5) {
+		if (0 != cookie_size) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+		}
 		opts->options |= OPTION_MD5;
 		size += TCPOLEN_MD5SIG_ALIGNED;
 	}
@@ -537,8 +686,8 @@ static unsigned tcp_synack_options(struct sock *sk,
 
 	if (likely(ireq->wscale_ok)) {
 		opts->ws = ireq->rcv_wscale;
-		if (likely(opts->ws))
-			size += TCPOLEN_WSCALE_ALIGNED;
+		opts->options |= OPTION_WSCALE;
+		size += TCPOLEN_WSCALE_ALIGNED;
 	}
 	if (likely(doing_ts)) {
 		opts->options |= OPTION_TS;
@@ -552,6 +701,29 @@ static unsigned tcp_synack_options(struct sock *sk,
 			size += TCPOLEN_SACKPERM_ALIGNED;
 	}
 
+	/* Similar rationale to tcp_syn_options() applies here, too.
+	 * If the <SYN> options fit, the same options should fit now!
+	 */
+	if (NULL == *md5
+	 && doing_ts
+	 && 0 != cookie_size) {
+		int need = TCPOLEN_COOKIE_BASE + cookie_size;
+		int remaining = MAX_TCP_OPTION_SPACE - size;
+
+		if (!(0x2 & cookie_size)) {
+			/* 32-bit multiple */
+			need += 2; /* NOPs */
+		}
+		if (need <= remaining) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+			opts->cookie_copy = tcp_rsk(req)->cookie_copy;
+			opts->cookie_size = cookie_size;
+			size += need;
+		} else {
+			/* There's no error return, so flag it. */
+			tcp_rsk(req)->cookie_out_never = 1;
+		}
+	}
 	return size;
 }
 
@@ -2283,6 +2455,24 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	 */
 	tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
 			     TCPCB_FLAG_SYN | TCPCB_FLAG_ACK);
+
+	/* If cookies are active, and constant data is available, copy it
+	 * directly from the listening socket.
+	 */
+	if (!tcp_rsk(req)->cookie_out_never
+	 && 0 < tcp_rsk(req)->cookie_size
+	 && 0 < tp->s_data_desired) {
+		const struct tcp_s_data_payload *tsdplp =
+			tp->s_data_payload;
+
+		if (NULL != tsdplp) {
+			u8 *buf = skb_put(skb, tsdplp->tsdpl_size);
+
+			memcpy(buf, tsdplp->tsdpl_data, tsdplp->tsdpl_size);
+			TCP_SKB_CB(skb)->end_seq += tsdplp->tsdpl_size;
+		}
+	}
+
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
 	th->ack_seq = htonl(tcp_rsk(req)->rcv_isn + 1);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 21d100b..af33758 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1159,11 +1159,12 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb)
  */
 static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 {
+	u8 bakery[TCP_COOKIE_MAX];
+	struct tcp_options_received tmp_opt;
 	struct inet6_request_sock *treq;
 	struct ipv6_pinfo *np = inet6_sk(sk);
-	struct tcp_options_received tmp_opt;
-	struct tcp_sock *tp = tcp_sk(sk);
 	struct request_sock *req = NULL;
+	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 isn = TCP_SKB_CB(skb)->when;
 #ifdef CONFIG_SYN_COOKIES
 	int want_cookie = 0;
@@ -1205,6 +1206,28 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	tcp_parse_options(skb, &tmp_opt, 0);
 
+	if (tmp_opt.extend_ok
+	 && tmp_opt.saw_tstamp
+	 && !tp->cookie_out_never
+	 && (0 < tp->cookie_desired || 0 < sysctl_tcp_cookie_size)) {
+#ifdef CONFIG_SYN_COOKIES
+		want_cookie = 0;	/* not our kind of cookie */
+#endif
+		tcp_rsk(req)->cookie_out_never = 0;
+		tcp_rsk(req)->cookie_copy = bakery;
+		tcp_rsk(req)->cookie_size = tmp_opt.cookie_size;
+
+		/* secret recipe not yet implemented */
+		get_random_bytes(bakery, tmp_opt.cookie_size);
+	} else if (!tp->cookie_in_always) {
+		/* redundant indications, but ensure initialization. */
+		tcp_rsk(req)->cookie_out_never = 1;
+		tcp_rsk(req)->cookie_size = 0;
+	} else {
+		goto drop;
+	}
+	tcp_rsk(req)->cookie_in_always = tp->cookie_in_always;
+
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
 
@@ -1864,6 +1887,14 @@ static int tcp_v6_init_sock(struct sock *sk)
 	tp->af_specific = &tcp_sock_ipv6_specific;
 #endif
 
+/* For grep, in order of appearance:
+ *	tp->s_data_payload = NULL;
+ *	tp->cookie_pair = NULL;
+ *	tp->s_data_desired = tp->cookie_desired = 0;
+ *	tp->s_data_in = tp->s_data_out = 0;
+ *	tp->cookie_in_always = tp->cookie_out_never = 0;
+ */
+
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
-- 
1.6.0.4


^ permalink raw reply related

* Re: [PATCH RFC] isdn/capi: fix up CAPI subsystem workaround locking a bit
From: Michael Buesch @ 2009-10-03 18:35 UTC (permalink / raw)
  To: Tilman Schmidt
  Cc: i4ldeveloper, Carsten Paeth, Karsten Keil, Karsten Keil,
	Armin Schindler, isdn4linux, netdev, linux-kernel
In-Reply-To: <200910032026.24451.mb@bu3sch.de>

On Saturday 03 October 2009 20:26:22 Michael Buesch wrote:
> On Saturday 03 October 2009 14:06:57 Tilman Schmidt wrote:
> > Move calls to handle_minor_send() and handle_minor_recv() out of
> > the sections locked by workaround_lock.
> > - handle_minor_send() may call another CAPI function via the card
> >   driver, deadlocking by trying to take workaround_lock again.
> > - handle_minor_recv() calls the receive_buf method of the active
> >   line discipline which may sleep.
> 
> I remember that handle_minor_send() and/or handle_minor_recv() showed up
> in the crash backtraces. So if you move them out of the critical
> section, you can as well remove the lock completely.
> 

here's my original mail:
http://lkml.indiana.edu/hypermail/linux/kernel/0605.0/0455.html

Note the patch in that mail does _not_ fix the issue, as it turned out later.
Then I did the workaround-lock patch, which _did_ fix it.

But in that mail you can see the original crash backtraces, which might be
useful for finding out what _really_ happened.

-- 
Greetings, Michael.

^ permalink raw reply

* [PATCH] pktgen: Fix multiqueue handling
From: Robert Olsson @ 2009-10-03 19:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Robert Olsson, Linux Netdev List,
	Stephen Hemminger
In-Reply-To: <4AC6EE3B.6050407@gmail.com>



Thanks yes is seems right. We can chose a single arbitrary TX queue with the patch.
BTW Noticed you and Stephen discussed to reduce dirtying skb->users, maybe the idea
to bump up skb->users to clone_skb is not so bad. I'll think the code will be pretty 
straight-forward. 

Cheers
						--ro

 
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>

Eric Dumazet writes:
 > Note : I could not really test this patch, I dont have multi queue hardware yet.
 > I found this by code inspection, please double check, thanks
 > [PATCH] pktgen: Fix multiqueue handling
 > 
 > It is not currently possible to instruct pktgen to use one selected tx queue.
 > 
 > When Robert added multiqueue support in commit 45b270f8, he added
 > an interval (queue_map_min, queue_map_max), and his code doesnt take
 > into account the case of min = max, to select one tx queue exactly.
 > 
 > I suspect a high performance setup on a eight txqueue device wants 
 > to use exactly eight cpus, and assign one tx queue to each sender.
 > 
 > This patchs makes pktgen select the right tx queue, not the first one.
 > 
 > Also updates Documentation to reflect Robert changes.
 > 
 > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
 > ---
 >  Documentation/networking/pktgen.txt |    8 ++++++++
 >  net/core/pktgen.c                   |    2 +-
 >  2 files changed, 9 insertions(+), 1 deletion(-)
 > 
 > diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
 > index c6cf4a3..61bb645 100644
 > --- a/Documentation/networking/pktgen.txt
 > +++ b/Documentation/networking/pktgen.txt
 > @@ -90,6 +90,11 @@ Examples:
 >   pgset "dstmac 00:00:00:00:00:00"    sets MAC destination address
 >   pgset "srcmac 00:00:00:00:00:00"    sets MAC source address
 >  
 > + pgset "queue_map_min 0" Sets the min value of tx queue interval
 > + pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices
 > +                         To select queue 1 of a given device,
 > +                         use queue_map_min=1 and queue_map_max=1
 > +
 >   pgset "src_mac_count 1" Sets the number of MACs we'll range through.  
 >                           The 'minimum' MAC is what you set with srcmac.
 >  
 > @@ -101,6 +106,9 @@ Examples:
 >                                IPDST_RND, UDPSRC_RND,
 >                                UDPDST_RND, MACSRC_RND, MACDST_RND 
 >                                MPLS_RND, VID_RND, SVID_RND
 > +                              QUEUE_MAP_RND # queue map random
 > +                              QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
 > +
 >  
 >   pgset "udp_src_min 9"   set UDP source port min, If < udp_src_max, then
 >                           cycle through the port range.
 > diff --git a/net/core/pktgen.c b/net/core/pktgen.c
 > index b694552..421857c 100644
 > --- a/net/core/pktgen.c
 > +++ b/net/core/pktgen.c
 > @@ -2212,7 +2212,7 @@ static void set_cur_queue_map(struct pktgen_dev *pkt_dev)
 >  	if (pkt_dev->flags & F_QUEUE_MAP_CPU)
 >  		pkt_dev->cur_queue_map = smp_processor_id();
 >  
 > -	else if (pkt_dev->queue_map_min < pkt_dev->queue_map_max) {
 > +	else if (pkt_dev->queue_map_min <= pkt_dev->queue_map_max) {
 >  		__u16 t;
 >  		if (pkt_dev->flags & F_QUEUE_MAP_RND) {
 >  			t = random32() %

^ permalink raw reply

* [PATCH] ibmtr: possible Read buffer overflow?
From: Roel Kluin @ 2009-10-03 21:26 UTC (permalink / raw)
  To: David S. Miller, netdev, Andrew Morton

Prevent read outside array bounds.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
---
Is this maybe required?

build tested

diff --git a/drivers/net/tokenring/ibmtr.c b/drivers/net/tokenring/ibmtr.c
index 525bbc5..6a3c751 100644
--- a/drivers/net/tokenring/ibmtr.c
+++ b/drivers/net/tokenring/ibmtr.c
@@ -1143,9 +1143,16 @@ static void dir_open_adapter (struct net_device *dev)
                 } else {
 			char **prphase = printphase;
 			char **prerror = printerror;
+			int pnr = err / 16 - 1;
+			int enr = err % 16 - 1;
 			DPRINTK("TR Adapter misc open failure, error code = ");
-			printk("0x%x, Phase: %s, Error: %s\n",
-				err, prphase[err/16 -1], prerror[err%16 -1]);
+			if (pnr < 0 || pnr >= ARRAY_SIZE(printphase) ||
+					enr < 0 ||
+					enr >= ARRAY_SIZE(printerror))
+				printk("0x%x, invalid Phase/Error.", err);
+			else
+				printk("0x%x, Phase: %s, Error: %s\n", err,
+						prphase[pnr], prerror[enr]);
 			printk(" retrying after %ds delay...\n",
 					TR_RETRY_INTERVAL/HZ);
                 }

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox