Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next-2.6] bnx2x: Remove two prefetch()
From: Eric Dumazet @ 2010-04-28 12:33 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein
In-Reply-To: <1272454432.14068.4.camel@bigi>

Le mercredi 28 avril 2010 à 07:33 -0400, jamal a écrit :
> On Wed, 2010-04-28 at 00:18 +0200, Eric Dumazet wrote:
> 
> > Thanks David, I was about to resubmit the cumulative patch ;)
> 
> Hrm, i never got the email with your patch on top of Changlis
> (the fscking ISP has creative ways of reordering, delaying and also
> occassionaly loosing my emails). So all my tests from last
> week did not include the extra patch. I will try to make time today
> to test with latest net-next which seems to have some extra goodies.
> If there is any other patch you want me to try let me know...
> 
> cheers,
> jamal

If you wait a bit, I have another patch to speedup udp receive path ;)



^ permalink raw reply

* Re: [PATCH net-next-2.6] bnx2x: Remove two prefetch()
From: jamal @ 2010-04-28 12:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein
In-Reply-To: <1272458001.2267.0.camel@edumazet-laptop>

On Wed, 2010-04-28 at 14:33 +0200, Eric Dumazet wrote:

> If you wait a bit, I have another patch to speedup udp receive path ;)

Shoot whenever you are ready ;-> I will test with and without your
patch..

cheers,
jamal


^ permalink raw reply

* [PATCH] bonding: fix arp_validate on bonds inside a bridge
From: Jiri Bohac @ 2010-04-28 12:59 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: bonding-devel, netdev

Hi,

bonding with arp_validate does not currently work when the
bonding master is part of a bridge. This is because
bond_arp_rcv() is registered as a packet type handler for ARP,
but before netif_receive_skb() processes the ptype_base hash
table, handle_bridge() is called and changes the skb->dev to
point to the bridge device.

This patch makes bonding_should_drop() call the bonding ARP
handler directly if a IFF_MASTER_NEEDARP flag is set on the
bonding master.  bond_register_arp() now only needs to set the
IFF_MASTER_NEEDARP flag.

We ne longer need special ARP handling for inactive slaves, hence
IFF_SLAVE_NEEDARP is not needed.

skb_reset_network_header() and skb_reset_transport_header() need
to be called before the call to bonding_should_drop() because
bond_handle_arp() needs the offsets initialized.

P.S.: bonding_should_drop() should probably be renamed to
handle_bonding() -- we already have handle_bridge() and
handle_macvlan(), and bonding_should_drop() has long been doing
other stuff than deciding which packets to drop...



Signed-off-by: Jiri Bohac <jbohac@suse.cz>

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 0075514..cafd404 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1940,8 +1940,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 	}
 
 	slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
-				   IFF_SLAVE_INACTIVE | IFF_BONDING |
-				   IFF_SLAVE_NEEDARP);
+				   IFF_SLAVE_INACTIVE | IFF_BONDING);
 
 	kfree(slave);
 
@@ -2612,11 +2611,12 @@ static void bond_validate_arp(struct bonding *bond, struct slave *slave, __be32
 	}
 }
 
-static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
+static void bond_handle_arp(struct sk_buff *skb)
 {
 	struct arphdr *arp;
 	struct slave *slave;
 	struct bonding *bond;
+	struct net_device *dev = skb->dev->master, *orig_dev = skb->dev;
 	unsigned char *arp_ptr;
 	__be32 sip, tip;
 
@@ -2637,9 +2637,8 @@ static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct pack
 	bond = netdev_priv(dev);
 	read_lock(&bond->lock);
 
-	pr_debug("bond_arp_rcv: bond %s skb->dev %s orig_dev %s\n",
-		 bond->dev->name, skb->dev ? skb->dev->name : "NULL",
-		 orig_dev ? orig_dev->name : "NULL");
+	pr_debug("bond_handle_arp: bond: %s, master: %s, slave: %s\n",
+		bond->dev->name, dev->name, orig_dev->name);
 
 	slave = bond_get_slave_by_dev(bond, orig_dev);
 	if (!slave || !slave_do_arp_validate(bond, slave))
@@ -2684,8 +2683,7 @@ static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct pack
 out_unlock:
 	read_unlock(&bond->lock);
 out:
-	dev_kfree_skb(skb);
-	return NET_RX_SUCCESS;
+	return;
 }
 
 /*
@@ -3567,23 +3565,12 @@ static void bond_unregister_lacpdu(struct bonding *bond)
 
 void bond_register_arp(struct bonding *bond)
 {
-	struct packet_type *pt = &bond->arp_mon_pt;
-
-	if (pt->type)
-		return;
-
-	pt->type = htons(ETH_P_ARP);
-	pt->dev = bond->dev;
-	pt->func = bond_arp_rcv;
-	dev_add_pack(pt);
+	bond->dev->priv_flags |= IFF_MASTER_NEEDARP;
 }
 
 void bond_unregister_arp(struct bonding *bond)
 {
-	struct packet_type *pt = &bond->arp_mon_pt;
-
-	dev_remove_pack(pt);
-	pt->type = 0;
+	bond->dev->priv_flags &= ~IFF_MASTER_NEEDARP;
 }
 
 /*---------------------------- Hashing Policies -----------------------------*/
@@ -5041,6 +5028,7 @@ static int __init bonding_init(void)
 	register_netdevice_notifier(&bond_netdev_notifier);
 	register_inetaddr_notifier(&bond_inetaddr_notifier);
 	bond_register_ipv6_notifier();
+	bond_handle_arp_hook = bond_handle_arp;
 out:
 	return res;
 err:
@@ -5061,6 +5049,7 @@ static void __exit bonding_exit(void)
 
 	rtnl_link_unregister(&bond_link_ops);
 	unregister_pernet_subsys(&bond_net_ops);
+	bond_handle_arp_hook = NULL;
 }
 
 module_init(bonding_init);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 257a7a4..57adfe5 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -212,7 +212,6 @@ struct bonding {
 	struct   bond_params params;
 	struct   list_head vlan_list;
 	struct   vlan_group *vlgrp;
-	struct   packet_type arp_mon_pt;
 	struct   workqueue_struct *wq;
 	struct   delayed_work mii_work;
 	struct   delayed_work arp_work;
@@ -292,14 +291,12 @@ static inline void bond_set_slave_inactive_flags(struct slave *slave)
 	if (!bond_is_lb(bond))
 		slave->state = BOND_STATE_BACKUP;
 	slave->dev->priv_flags |= IFF_SLAVE_INACTIVE;
-	if (slave_do_arp_validate(bond, slave))
-		slave->dev->priv_flags |= IFF_SLAVE_NEEDARP;
 }
 
 static inline void bond_set_slave_active_flags(struct slave *slave)
 {
 	slave->state = BOND_STATE_ACTIVE;
-	slave->dev->priv_flags &= ~(IFF_SLAVE_INACTIVE | IFF_SLAVE_NEEDARP);
+	slave->dev->priv_flags &= ~IFF_SLAVE_INACTIVE;
 }
 
 static inline void bond_set_master_3ad_flags(struct bonding *bond)
diff --git a/include/linux/if.h b/include/linux/if.h
index 3a9f410..84ab2c8 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -63,7 +63,7 @@
 #define IFF_MASTER_8023AD	0x8	/* bonding master, 802.3ad. 	*/
 #define IFF_MASTER_ALB	0x10		/* bonding master, balance-alb.	*/
 #define IFF_BONDING	0x20		/* bonding master or slave	*/
-#define IFF_SLAVE_NEEDARP 0x40		/* need ARPs for validation	*/
+#define IFF_MASTER_NEEDARP 0x40		/* need ARPs for validation	*/
 #define IFF_ISATAP	0x80		/* ISATAP interface (RFC4214)	*/
 #define IFF_MASTER_ARPMON 0x100		/* bonding master, ARP mon in use */
 #define IFF_WAN_HDLC	0x200		/* WAN HDLC device		*/
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..9f82fc6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2055,6 +2055,8 @@ static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
 	}
 }
 
+extern void (*bond_handle_arp_hook)(struct sk_buff *skb);
+
 /* On bonding slaves other than the currently active slave, suppress
  * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
  * ARP on active-backup slaves with arp_validate enabled.
@@ -2076,11 +2078,13 @@ static inline int skb_bond_should_drop(struct sk_buff *skb,
 			skb_bond_set_mac_by_master(skb, master);
 		}
 
-		if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
-			if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
-			    skb->protocol == __cpu_to_be16(ETH_P_ARP))
-				return 0;
+		/* pass ARP frames directly to bonding
+		   before bridging or other hooks change them */
+		if ((master->priv_flags & IFF_MASTER_NEEDARP) &&
+		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
+			bond_handle_arp_hook(skb);
 
+		if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
 			if (master->priv_flags & IFF_MASTER_ALB) {
 				if (skb->pkt_type != PACKET_BROADCAST &&
 				    skb->pkt_type != PACKET_MULTICAST)
diff --git a/net/core/dev.c b/net/core/dev.c
index f769098..98d85a8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2314,6 +2314,9 @@ static inline int deliver_skb(struct sk_buff *skb,
 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
 
+void (*bond_handle_arp_hook)(struct sk_buff *skb);
+EXPORT_SYMBOL_GPL(bond_handle_arp_hook);
+
 #if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
 
 #if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
@@ -2507,6 +2510,10 @@ int netif_receive_skb(struct sk_buff *skb)
 	if (!skb->skb_iif)
 		skb->skb_iif = skb->dev->ifindex;
 
+	skb_reset_network_header(skb);
+	skb_reset_transport_header(skb);
+	skb->mac_len = skb->network_header - skb->mac_header;
+
 	null_or_orig = NULL;
 	orig_dev = skb->dev;
 	master = ACCESS_ONCE(orig_dev->master);
@@ -2519,10 +2526,6 @@ int netif_receive_skb(struct sk_buff *skb)
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
-	skb_reset_network_header(skb);
-	skb_reset_transport_header(skb);
-	skb->mac_len = skb->network_header - skb->mac_header;
-
 	pt_prev = NULL;
 
 	rcu_read_lock();


-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply related

* Re: [net-next-2.6 PATCH 1/2] Add netdev port-profile support (take III, was iovnl)
From: Arnd Bergmann @ 2010-04-28 13:13 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw
In-Reply-To: <20100428044235.8646.61943.stgit@savbu-pc100.cisco.com>

On Wednesday 28 April 2010, Scott Feldman wrote:
> From: Scott Feldman <scofeldm@cisco.com>
> 
> Add new netdev ops ndo_{set|get}_port_profile to allow setting of port-profile
> on a netdev interface.  Extends RTM_SETLINK/RTM_GETLINK with new sub cmd called
> IFLA_PORT_PROFILE (added to end of IFLA_cmd list).  The port-profile cmd
> arguments are (as seen from modified iproute2 cmdline):
>
>        ip link set DEVICE [ { up | down } ]
>                           [ arp { on | off } ]
>                           [ dynamic { on | off } ]
>                           [ multicast { on | off } ]
>                           ...
>                           [ vf NUM [ mac LLADDR ]
>                                    [ vlan VLANID [ qos VLAN-QOS ] ]
>                                    [ rate TXRATE ] ] 
>                           [ port_profile [ PORT-PROFILE
>                                    [ mac LLADDR ]
>                                    [ host_uuid HOST_UUID ]
>                                    [ client_uuid CLIENT_UUID ]
>                                    [ client_name CLIENT_NAME ] ] ]
>        ip link show [ DEVICE ]

We will need a few more options to cover draft VDP in addition to the protocol
your NIC is using. I still think it's possible to use the same interface
for both, but the differences are obviously showing.

The missing bits that I can see so far are:

- You only have 'get' and 'set'. We will also need a 'unset' or 'delete'
  option in order to get rid of a port profile association.
- VDP has three different ways to 'set' a port profile: 'associate',
  'pre-associate with resource reservation' and 'pre-associate without
  resource reservation'. This could become an extra option flag.
- Instead of a port profile name, VDP specifies a tuple like
   struct vsi_associate {
	unsigned char VSI_Mgr_ID;       /* VSI manager ID */
	unsigned char VSI_Type_ID[3];   /* 24 bit VSI Type ID */
	unsigned char VSI_Type_Version; /* VSI Type version */
   };
   I'm not sure how to deal with that best, but there needs to be
   some parsing of these numbers.
- VDP requires a vlan ID to be part of the association, in addition to
  the MAC address. In theory, we could have multiple tuples of MAC+VLAN
  addresses, but we can probably just associate each tuple separately
  and ignore that part of the standard.
- we have a set of possible error conditions that can be returned by
  the switch (invalid format, insufficient resources, unknown VTID,
  VTID violation, VTID verison violation, out of sync). It should be
  possible to return each of these to the user with 'get'.

> Since we're using netlink sockets, the receiver of the RTM_SETLINK msg can
> be in kernel- or user-space.  For kernel-space recipient, rtnetlink.c, the
> new ndo_set_port_profile netdev op is called to set the port-profile.
> User-space recipients can decide how they propagate the msg to the switch.
> There is also a RTM_GETLINK cmd to to return port-profile setting of an
> interface and to also return the status of the last port-profile.

More on a stylistic note, I'm not convinced that using RTM_SETLINK/GETLINK
is the right interface for this, unlike the 'ip link set DEV vf ...' stuff,
because it seems to suggest that this is an option of the adapter itself.
I actually liked the iovnl family better in this regard, because it kept
the protocols separate.

What I could imagine to unify this is something like

   ip port_profile set DEVICE [ { pre_associate | pre_associate_rr } ]
                              { name PORT-PROFILE | vsi MGR:VTID:VER }
                                    mac LLADDR
				    [ vlan VID ]
                                    [ host_uuid HOST_UUID ]
                                    [ client_uuid CLIENT_UUID ]
                                    [ client_name CLIENT_NAME ]
   ip port_profile del DEVICE [ mac LLADDR [ vlan VID ] ]
   ip port_profile show DEVICE

	Arnd

^ permalink raw reply

* Re: [PATCH net-next-2.6] bnx2x: Remove two prefetch()
From: Eilon Greenstein @ 2010-04-28 13:14 UTC (permalink / raw)
  To: David Miller
  Cc: vladz, eliezert, eric.dumazet@gmail.com, xiaosuo@gmail.com,
	hadi@cyberus.ca, therbert@google.com, shemminger@vyatta.com,
	netdev@vger.kernel.org
In-Reply-To: <20100427.151937.62344362.davem@davemloft.net>

On Tue, 2010-04-27 at 15:19 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 28 Apr 2010 00:18:13 +0200
> 
> > [PATCH net-next-2.6] bnx2x: Remove two prefetch()
> > 
> > 1) Even on 64bit arches, sizeof(struct sk_buff) < 256
> > 2) No need to prefetch same pointer twice.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > CC: Eilon Greenstein <eilong@broadcom.com>
> 
> Eilon please review and ACK/NACK

Vlad ran few benchmarks, and we couldn't find any justification for
those prefetch calls. After consulting with Eliezer Tamir (the original
author) we are glad to Ack this patch.

Thanks Eric!
Acked-by: <eilong@broadcom.com>




^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Arnd Bergmann @ 2010-04-28 13:32 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw
In-Reply-To: <20100428044240.8646.27412.stgit@savbu-pc100.cisco.com>

On Wednesday 28 April 2010, Scott Feldman wrote:
> +static int enic_set_port_profile(struct net_device *netdev,
> +       struct ifla_port_profile *ipp)
> +{
> +       struct enic *enic = netdev_priv(netdev);
> +       struct vic_provinfo *vp;
> +       u8 oui[3] = VIC_PROVINFO_CISCO_OUI;
> +       u8 *mac = ipp->mac;
> +       int err;
> +
> +       memset(&enic->port_profile, 0, sizeof(enic->port_profile));
> +
> +       if (!enic_is_dynamic(enic))
> +               return -EOPNOTSUPP;

Not sure I understand how this fits together. You said in an earlier mail:

> > Anything that ties port profiles to VFs seems fundamentally flawed AFAICT,
> > at least when we want to extend this to adapters that don't do it in firmware.
>
> Ya, I tend I agree.  Let's just make port-profile a setting of any netdev,
> an eth, macvtap, eth.x, bond, etc.  That's probably what I should have done
> in the first place.  Something like:

I thought you had meant that we can do the association of attached interfaces
through any interface, rather than tying it to the slave interface. While I'm
not sure I read your code correctly, it seems like you now only talk to the
slave interface, not to the master at all!

At least the check above should be 'if (enic_is_dynamic(enic)) return -EOPNOTSUPP',
not the other way round.
Moreover, if the netdev is the master here, you only allow a single slave, which
is not enough for larger setups (n > 1), though that could be a limitation of your
first version.

Passing just the slave device however would not work in the general case, as I
tried to point out in the mail you replied to. If the slave interface is owned
by a guest using PCI passthrough, or it sits below a stack of nested interfaces
(vlan, bridge, tap, vhost, ...), it's impossible to know what interface is
responsible for setting up the slave. Note that you cannot perform the association
through the slave interface itself because the remote switch would discard any
traffic originating from an unassociated interface.

> +static int enic_get_port_profile(struct net_device *netdev,
> +       struct ifla_port_profile *ipp)
> +{
> +       struct enic *enic = netdev_priv(netdev);
> +       int done, err, error;
> +
> +       enic->port_profile.status = IFLA_PORT_PROFILE_STATUS_UNKNOWN;
> +
> +       spin_lock(&enic->devcmd_lock);
> +       err = vnic_dev_init_done(enic->vdev, &done, &error);
> +       spin_unlock(&enic->devcmd_lock);
> +
> +       if (err || error)
> +               enic->port_profile.status = IFLA_PORT_PROFILE_STATUS_ERROR;
> +
> +       if (!done)
> +               enic->port_profile.status = IFLA_PORT_PROFILE_STATUS_INPROGRESS;
> +
> +       if (!error)
> +               enic->port_profile.status = IFLA_PORT_PROFILE_STATUS_SUCCESS;
> +
> +       memcpy(ipp, &enic->port_profile, sizeof(enic->port_profile));
> +
> +       return 0;
> +}
> +

Similarly, this interface only passes back a single port profile association,
where it should have a way to return all of the slave ports.

	Arnd

^ permalink raw reply

* Re: [PATCH 2/4] [RFC] Add sock_create_kern_net()
From: Dan Smith @ 2010-04-28 13:38 UTC (permalink / raw)
  To: hadi; +Cc: containers, netdev, Daniel Lezcano, Eric W. Biederman
In-Reply-To: <1272455094.14068.15.camel@bigi>

j> So ... how does user space know what "other_netns" is?

That's the point, userspace doesn't know about and can't use this
interface.  This is a way for the kernel to open a socket in another
netns to talk to that netns' RTNETLINK.  I realize in its current form
it could be used for something more nefarious, but it would be kernel
code doing it.

j> Also note Eric's recent patches introduced another way of opening a
j> socket in a different namespace - are you using those in the
j> abstraction to find what netns is?

No.  The process doing the checkpoint already has pointers to the
tasks and thus their netns pointers.  Eric's interface is an
abstraction to allow userspace to do something similar, I think that
using it from the kernel would be rather messy.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply

* [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173)
From: Neil Horman @ 2010-04-28 13:47 UTC (permalink / raw)
  To: vladislav.yasevich, sri, linux-sctp
  Cc: eteo, netdev, davem, security, nhorman

Hey-
	Recently, it was reported to me that the kernel could oops in the
following way:

<5> kernel BUG at net/core/skbuff.c:91!
<5> invalid operand: 0000 [#1]
<5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
mptbase sd_mod scsi_mod
<5> CPU:    0
<5> EIP:    0060:[<c02bff27>]    Not tainted VLI
<5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
<5> EIP is at skb_over_panic+0x1f/0x2d
<5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
<5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
<5> ds: 007b   es: 007b   ss: 0068
<5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
<5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
e0c2947d 
<5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
df653490 
<5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
00000004 
<5> Call Trace:
<5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
<5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
<5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
<5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
<5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
<5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
<5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
<5>  [<c01555a4>] cache_grow+0x140/0x233
<5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
<5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
<5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
<5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
<5>  [<c02d005e>] nf_iterate+0x40/0x81
<5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
<5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
<5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
<5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
<5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
<5>  [<c02e103e>] ip_rcv+0x334/0x3b4
<5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
<5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
<5>  [<c02c67a4>] process_backlog+0x6c/0xd9
<5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
<5>  [<c012a7b1>] __do_softirq+0x35/0x79
<5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
<5>  [<c01094de>] do_softirq+0x46/0x4d

Its an skb_over_panic BUG halt that results from processing an init chunk in
which too many of its variable length parameters are in some way malformed.

The problem is in sctp_process_unk_param:
if (NULL == *errp)
	*errp = sctp_make_op_error_space(asoc, chunk,
					 ntohs(chunk->chunk_hdr->length));

	if (*errp) {
		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
				 WORD_ROUND(ntohs(param.p->length)));
		sctp_addto_chunk(*errp,
			WORD_ROUND(ntohs(param.p->length)),
				  param.v);

When we allocate an error chunk, we assume that the worst case scenario requires
that we have chunk_hdr->length data allocated, which would be correct nominally,
given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
chunk, so the worst case situation in which all parameters are in violation
requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.

The result of this error is that a deliberately malformed packet sent to a
listening host can cause a remote DOS, described in CVE-2010-1173:
http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173

I've tested the below fix and confirmed that it fixes the issue.  It
pre-allocates the error chunk in sctp_verify_init, where we are able to count
the total number of variable length parameters, so we know how many error
headers we might need.  Then we simply use that chunk, if we find an error, or
discard/free it if all the parameters are valid.  Applies on top of the
lksctp-dev tree

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 sm_make_chunk.c |   24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)


diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index f592163..990457b 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2134,6 +2134,8 @@ int sctp_verify_init(const struct sctp_association *asoc,
 	union sctp_params param;
 	int has_cookie = 0;
 	int result;
+	unsigned int param_cnt;
+	unsigned int len;
 
 	/* Verify stream values are non-zero. */
 	if ((0 == peer_init->init_hdr.num_outbound_streams) ||
@@ -2149,6 +2151,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
 
 		if (SCTP_PARAM_STATE_COOKIE == param.p->type)
 			has_cookie = 1;
+		param_cnt++;
 
 	} /* for (loop through all parameters) */
 
@@ -2169,6 +2172,20 @@ int sctp_verify_init(const struct sctp_association *asoc,
 		return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
 						  chunk, errp);
 
+	if (!*errp) {
+		/*
+		 * Pre-allocate the error packet here
+		 * we do this as we need to reserve space
+		 * for the worst case scenario in which 
+		 * every parameter is in error and needs 
+		 * an errhdr attached to it
+		 */
+		len = ntohs(chunk->chunk_hdr->length);
+		len += sizeof(sctp_errhdr_t)*param_cnt;
+
+		*errp = sctp_make_op_error_space(asoc, chunk, len);
+	}
+
 	/* Verify all the variable length parameters */
 	sctp_walk_params(param, peer_init, init_hdr.params) {
 
@@ -2176,9 +2193,11 @@ int sctp_verify_init(const struct sctp_association *asoc,
 		switch (result) {
 		    case SCTP_IERROR_ABORT:
 		    case SCTP_IERROR_NOMEM:
-				return 0;
 		    case SCTP_IERROR_ERROR:
-				return 1;
+				len = ntohs((*errp)->chunk_hdr->length);
+				if ((*errp) && (len == sizeof(sctp_chunkhdr_t)))
+					sctp_chunk_free(*errp);
+				return (result == SCTP_IERROR_ERROR) ? 1 : 0;
 		    case SCTP_IERROR_NO_ERROR:
 		    default:
 				break;
@@ -2186,6 +2205,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
 
 	} /* for (loop through all parameters) */
 
+	sctp_chunk_free(*errp);
 	return 1;
 }
 

^ permalink raw reply related

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Wolfgang Grandegger @ 2010-04-28 13:50 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100428054706.GA4516@riccoc20.at.omicron.at>

Richard Cochran wrote:
> On Tue, Apr 27, 2010 at 06:20:25PM +0200, Wolfgang Grandegger wrote:
>> Do you have also a patch adding support for hardware timestamping to ptpd?
> 
> Yes, I do:
> 
>    https://sourceforge.net/tracker/index.php?func=detail&aid=2992847&group_id=139814&atid=744634

Thanks.

> I should have mentioned, you also need the gianfar HW time stamping
> patches, recently posted to netdev by Manfred Rudigier.

I'm aware of these patches. I'm actually using the net-next-2.6 git tree.

I got ptpd working but I do not yet see the PPS-Signals on my scope. At
a first glance, the PPS-Signal seems to be configured by the gianfar_ptp
driver (setting the fiper1 and timer1 registers) but I might have missed
something.

Wolfgang.

^ permalink raw reply

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173)
From: Vlad Yasevich @ 2010-04-28 14:00 UTC (permalink / raw)
  To: Neil Horman; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <20100428134748.GA4818@hmsreliant.think-freely.org>

I have this patch and a few others already queued.

I was planning on sending these today for stable.

Here is the full list of stable patches I have:

sctp: Fix oops when sending queued ASCONF chunks
sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
sctp: per_cpu variables should be in bh_disabled section
sctp: fix potential reference of a freed pointer
sctp: avoid irq lock inversion while call sk->sk_data_ready()

-vlad

Neil Horman wrote:
> Hey-
> 	Recently, it was reported to me that the kernel could oops in the
> following way:
> 
> <5> kernel BUG at net/core/skbuff.c:91!
> <5> invalid operand: 0000 [#1]
> <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
> ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
> vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
> ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
> snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
> pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
> mptbase sd_mod scsi_mod
> <5> CPU:    0
> <5> EIP:    0060:[<c02bff27>]    Not tainted VLI
> <5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
> <5> EIP is at skb_over_panic+0x1f/0x2d
> <5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
> <5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
> <5> ds: 007b   es: 007b   ss: 0068
> <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
> <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
> e0c2947d 
> <5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
> df653490 
> <5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
> 00000004 
> <5> Call Trace:
> <5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
> <5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
> <5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
> <5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
> <5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
> <5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
> <5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
> <5>  [<c01555a4>] cache_grow+0x140/0x233
> <5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
> <5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
> <5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
> <5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
> <5>  [<c02d005e>] nf_iterate+0x40/0x81
> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
> <5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
> <5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
> <5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
> <5>  [<c02e103e>] ip_rcv+0x334/0x3b4
> <5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
> <5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
> <5>  [<c02c67a4>] process_backlog+0x6c/0xd9
> <5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
> <5>  [<c012a7b1>] __do_softirq+0x35/0x79
> <5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
> <5>  [<c01094de>] do_softirq+0x46/0x4d
> 
> Its an skb_over_panic BUG halt that results from processing an init chunk in
> which too many of its variable length parameters are in some way malformed.
> 
> The problem is in sctp_process_unk_param:
> if (NULL == *errp)
> 	*errp = sctp_make_op_error_space(asoc, chunk,
> 					 ntohs(chunk->chunk_hdr->length));
> 
> 	if (*errp) {
> 		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
> 				 WORD_ROUND(ntohs(param.p->length)));
> 		sctp_addto_chunk(*errp,
> 			WORD_ROUND(ntohs(param.p->length)),
> 				  param.v);
> 
> When we allocate an error chunk, we assume that the worst case scenario requires
> that we have chunk_hdr->length data allocated, which would be correct nominally,
> given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
> we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
> chunk, so the worst case situation in which all parameters are in violation
> requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
> 
> The result of this error is that a deliberately malformed packet sent to a
> listening host can cause a remote DOS, described in CVE-2010-1173:
> http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
> 
> I've tested the below fix and confirmed that it fixes the issue.  It
> pre-allocates the error chunk in sctp_verify_init, where we are able to count
> the total number of variable length parameters, so we know how many error
> headers we might need.  Then we simply use that chunk, if we find an error, or
> discard/free it if all the parameters are valid.  Applies on top of the
> lksctp-dev tree
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 
>  sm_make_chunk.c |   24 ++++++++++++++++++++++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index f592163..990457b 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -2134,6 +2134,8 @@ int sctp_verify_init(const struct sctp_association *asoc,
>  	union sctp_params param;
>  	int has_cookie = 0;
>  	int result;
> +	unsigned int param_cnt;
> +	unsigned int len;
>  
>  	/* Verify stream values are non-zero. */
>  	if ((0 == peer_init->init_hdr.num_outbound_streams) ||
> @@ -2149,6 +2151,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>  
>  		if (SCTP_PARAM_STATE_COOKIE == param.p->type)
>  			has_cookie = 1;
> +		param_cnt++;
>  
>  	} /* for (loop through all parameters) */
>  
> @@ -2169,6 +2172,20 @@ int sctp_verify_init(const struct sctp_association *asoc,
>  		return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
>  						  chunk, errp);
>  
> +	if (!*errp) {
> +		/*
> +		 * Pre-allocate the error packet here
> +		 * we do this as we need to reserve space
> +		 * for the worst case scenario in which 
> +		 * every parameter is in error and needs 
> +		 * an errhdr attached to it
> +		 */
> +		len = ntohs(chunk->chunk_hdr->length);
> +		len += sizeof(sctp_errhdr_t)*param_cnt;
> +
> +		*errp = sctp_make_op_error_space(asoc, chunk, len);
> +	}
> +
>  	/* Verify all the variable length parameters */
>  	sctp_walk_params(param, peer_init, init_hdr.params) {
>  
> @@ -2176,9 +2193,11 @@ int sctp_verify_init(const struct sctp_association *asoc,
>  		switch (result) {
>  		    case SCTP_IERROR_ABORT:
>  		    case SCTP_IERROR_NOMEM:
> -				return 0;
>  		    case SCTP_IERROR_ERROR:
> -				return 1;
> +				len = ntohs((*errp)->chunk_hdr->length);
> +				if ((*errp) && (len == sizeof(sctp_chunkhdr_t)))
> +					sctp_chunk_free(*errp);
> +				return (result == SCTP_IERROR_ERROR) ? 1 : 0;
>  		    case SCTP_IERROR_NO_ERROR:
>  		    default:
>  				break;
> @@ -2186,6 +2205,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>  
>  	} /* for (loop through all parameters) */
>  
> +	sctp_chunk_free(*errp);
>  	return 1;
>  }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-28 14:06 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272458174.14068.16.camel@bigi>

Le mercredi 28 avril 2010 à 08:36 -0400, jamal a écrit :
> On Wed, 2010-04-28 at 14:33 +0200, Eric Dumazet wrote:
> 
> > If you wait a bit, I have another patch to speedup udp receive path ;)
> 
> Shoot whenever you are ready ;-> I will test with and without your
> patch..
> 

Here it is ;)

Thanks

[PATCH net-next-2.6] net: speedup udp receive path

Since commit 95766fff ([UDP]: Add memory accounting.), 
each received packet needs one extra sock_lock()/sock_release() pair.

This added latency because of possible backlog handling. Then later,
ticket spinlocks added yet another latency source in case of DDOS.

This patch introduces lock_sock_bh() and unlock_sock_bh()
synchronization primitives, avoiding one atomic operation and backlog
processing.

skb_free_datagram_locked() uses them instead of full blown
lock_sock()/release_sock(). skb is orphaned inside locked section for
proper socket memory reclaim, and finally freed outside of it.

UDP receive path now take the socket spinlock only once.

Signed-off-by: Eric DUmazet <eric.dumazet@gmail.com>
---
 include/net/sock.h  |   10 ++++++++++
 net/core/datagram.c |   10 +++++++---
 net/ipv4/udp.c      |   12 ++++++------
 net/ipv6/udp.c      |    4 ++--
 4 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index cf12b1e..d361c77 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1021,6 +1021,16 @@ extern void release_sock(struct sock *sk);
 				SINGLE_DEPTH_NESTING)
 #define bh_unlock_sock(__sk)	spin_unlock(&((__sk)->sk_lock.slock))
 
+static inline void lock_sock_bh(struct sock *sk)
+{
+	spin_lock_bh(&sk->sk_lock.slock);
+}
+
+static inline void unlock_sock_bh(struct sock *sk)
+{
+	spin_unlock_bh(&sk->sk_lock.slock);
+}
+
 extern struct sock		*sk_alloc(struct net *net, int family,
 					  gfp_t priority,
 					  struct proto *prot);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 5574a5d..95b851f 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -229,9 +229,13 @@ EXPORT_SYMBOL(skb_free_datagram);
 
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
-	lock_sock(sk);
-	skb_free_datagram(sk, skb);
-	release_sock(sk);
+	lock_sock_bh(sk);
+	skb_orphan(skb);
+	sk_mem_reclaim_partial(sk);
+	unlock_sock_bh(sk);
+
+	/* skb is now orphaned, might be freed outside of locked section */
+	consume_skb(skb);
 }
 EXPORT_SYMBOL(skb_free_datagram_locked);
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 63eb56b..1f86965 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1062,10 +1062,10 @@ static unsigned int first_packet_length(struct sock *sk)
 	spin_unlock_bh(&rcvq->lock);
 
 	if (!skb_queue_empty(&list_kill)) {
-		lock_sock(sk);
+		lock_sock_bh(sk);
 		__skb_queue_purge(&list_kill);
 		sk_mem_reclaim_partial(sk);
-		release_sock(sk);
+		unlock_sock_bh(sk);
 	}
 	return res;
 }
@@ -1196,10 +1196,10 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock(sk);
+	lock_sock_bh(sk);
 	if (!skb_kill_datagram(sk, skb, flags))
 		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	release_sock(sk);
+	unlock_sock_bh(sk);
 
 	if (noblock)
 		return -EAGAIN;
@@ -1624,9 +1624,9 @@ int udp_rcv(struct sk_buff *skb)
 
 void udp_destroy_sock(struct sock *sk)
 {
-	lock_sock(sk);
+	lock_sock_bh(sk);
 	udp_flush_pending_frames(sk);
-	release_sock(sk);
+	unlock_sock_bh(sk);
 }
 
 /*
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3ead20a..91c60f0 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -424,7 +424,7 @@ out:
 	return err;
 
 csum_copy_err:
-	lock_sock(sk);
+	lock_sock_bh(sk);
 	if (!skb_kill_datagram(sk, skb, flags)) {
 		if (is_udp4)
 			UDP_INC_STATS_USER(sock_net(sk),
@@ -433,7 +433,7 @@ csum_copy_err:
 			UDP6_INC_STATS_USER(sock_net(sk),
 					UDP_MIB_INERRORS, is_udplite);
 	}
-	release_sock(sk);
+	unlock_sock_bh(sk);
 
 	if (flags & MSG_DONTWAIT)
 		return -EAGAIN;



^ permalink raw reply related

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173)
From: Vlad Yasevich @ 2010-04-28 14:17 UTC (permalink / raw)
  To: Neil Horman; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <4BD83F85.8090308@hp.com>



Vlad Yasevich wrote:
> I have this patch and a few others already queued.

Scratch that.  I totally misread the description and the patch.

-vlad
> 
> I was planning on sending these today for stable.
> 
> Here is the full list of stable patches I have:
> 
> sctp: Fix oops when sending queued ASCONF chunks
> sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
> sctp: per_cpu variables should be in bh_disabled section
> sctp: fix potential reference of a freed pointer
> sctp: avoid irq lock inversion while call sk->sk_data_ready()
> 
> -vlad
> 
> Neil Horman wrote:
>> Hey-
>> 	Recently, it was reported to me that the kernel could oops in the
>> following way:
>>
>> <5> kernel BUG at net/core/skbuff.c:91!
>> <5> invalid operand: 0000 [#1]
>> <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
>> ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
>> vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
>> ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
>> snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
>> pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
>> mptbase sd_mod scsi_mod
>> <5> CPU:    0
>> <5> EIP:    0060:[<c02bff27>]    Not tainted VLI
>> <5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
>> <5> EIP is at skb_over_panic+0x1f/0x2d
>> <5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
>> <5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
>> <5> ds: 007b   es: 007b   ss: 0068
>> <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
>> <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
>> e0c2947d 
>> <5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
>> df653490 
>> <5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
>> 00000004 
>> <5> Call Trace:
>> <5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
>> <5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
>> <5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
>> <5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
>> <5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
>> <5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
>> <5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
>> <5>  [<c01555a4>] cache_grow+0x140/0x233
>> <5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
>> <5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
>> <5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
>> <5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
>> <5>  [<c02d005e>] nf_iterate+0x40/0x81
>> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
>> <5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
>> <5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
>> <5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
>> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
>> <5>  [<c02e103e>] ip_rcv+0x334/0x3b4
>> <5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
>> <5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
>> <5>  [<c02c67a4>] process_backlog+0x6c/0xd9
>> <5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
>> <5>  [<c012a7b1>] __do_softirq+0x35/0x79
>> <5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
>> <5>  [<c01094de>] do_softirq+0x46/0x4d
>>
>> Its an skb_over_panic BUG halt that results from processing an init chunk in
>> which too many of its variable length parameters are in some way malformed.
>>
>> The problem is in sctp_process_unk_param:
>> if (NULL == *errp)
>> 	*errp = sctp_make_op_error_space(asoc, chunk,
>> 					 ntohs(chunk->chunk_hdr->length));
>>
>> 	if (*errp) {
>> 		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
>> 				 WORD_ROUND(ntohs(param.p->length)));
>> 		sctp_addto_chunk(*errp,
>> 			WORD_ROUND(ntohs(param.p->length)),
>> 				  param.v);
>>
>> When we allocate an error chunk, we assume that the worst case scenario requires
>> that we have chunk_hdr->length data allocated, which would be correct nominally,
>> given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
>> we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
>> chunk, so the worst case situation in which all parameters are in violation
>> requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
>>
>> The result of this error is that a deliberately malformed packet sent to a
>> listening host can cause a remote DOS, described in CVE-2010-1173:
>> http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
>>
>> I've tested the below fix and confirmed that it fixes the issue.  It
>> pre-allocates the error chunk in sctp_verify_init, where we are able to count
>> the total number of variable length parameters, so we know how many error
>> headers we might need.  Then we simply use that chunk, if we find an error, or
>> discard/free it if all the parameters are valid.  Applies on top of the
>> lksctp-dev tree
>>
>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>>
>>
>>  sm_make_chunk.c |   24 ++++++++++++++++++++++--
>>  1 file changed, 22 insertions(+), 2 deletions(-)
>>
>>
>> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
>> index f592163..990457b 100644
>> --- a/net/sctp/sm_make_chunk.c
>> +++ b/net/sctp/sm_make_chunk.c
>> @@ -2134,6 +2134,8 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  	union sctp_params param;
>>  	int has_cookie = 0;
>>  	int result;
>> +	unsigned int param_cnt;
>> +	unsigned int len;
>>  
>>  	/* Verify stream values are non-zero. */
>>  	if ((0 == peer_init->init_hdr.num_outbound_streams) ||
>> @@ -2149,6 +2151,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  
>>  		if (SCTP_PARAM_STATE_COOKIE == param.p->type)
>>  			has_cookie = 1;
>> +		param_cnt++;
>>  
>>  	} /* for (loop through all parameters) */
>>  
>> @@ -2169,6 +2172,20 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  		return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
>>  						  chunk, errp);
>>  
>> +	if (!*errp) {
>> +		/*
>> +		 * Pre-allocate the error packet here
>> +		 * we do this as we need to reserve space
>> +		 * for the worst case scenario in which 
>> +		 * every parameter is in error and needs 
>> +		 * an errhdr attached to it
>> +		 */
>> +		len = ntohs(chunk->chunk_hdr->length);
>> +		len += sizeof(sctp_errhdr_t)*param_cnt;
>> +
>> +		*errp = sctp_make_op_error_space(asoc, chunk, len);
>> +	}
>> +
>>  	/* Verify all the variable length parameters */
>>  	sctp_walk_params(param, peer_init, init_hdr.params) {
>>  
>> @@ -2176,9 +2193,11 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  		switch (result) {
>>  		    case SCTP_IERROR_ABORT:
>>  		    case SCTP_IERROR_NOMEM:
>> -				return 0;
>>  		    case SCTP_IERROR_ERROR:
>> -				return 1;
>> +				len = ntohs((*errp)->chunk_hdr->length);
>> +				if ((*errp) && (len == sizeof(sctp_chunkhdr_t)))
>> +					sctp_chunk_free(*errp);
>> +				return (result == SCTP_IERROR_ERROR) ? 1 : 0;
>>  		    case SCTP_IERROR_NO_ERROR:
>>  		    default:
>>  				break;
>> @@ -2186,6 +2205,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  
>>  	} /* for (loop through all parameters) */
>>  
>> +	sctp_chunk_free(*errp);
>>  	return 1;
>>  }
>>  
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-28 14:19 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272463605.2267.70.camel@edumazet-laptop>

Le mercredi 28 avril 2010 à 16:06 +0200, Eric Dumazet a écrit :
> Le mercredi 28 avril 2010 à 08:36 -0400, jamal a écrit :
> > On Wed, 2010-04-28 at 14:33 +0200, Eric Dumazet wrote:
> > 
> > > If you wait a bit, I have another patch to speedup udp receive path ;)
> > 
> > Shoot whenever you are ready ;-> I will test with and without your
> > patch..
> > 
> 
> Here it is ;)
> 
> Thanks

I forgot to say that with my previous DDOS test/bench (16 cpus trying to
feed one udp socket), my receiver can now process 420.000 pps instead of
200.000 ;)




^ permalink raw reply

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173)
From: Neil Horman @ 2010-04-28 14:21 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <4BD83F85.8090308@hp.com>

On Wed, Apr 28, 2010 at 10:00:37AM -0400, Vlad Yasevich wrote:
> I have this patch and a few others already queued.
> 
> I was planning on sending these today for stable.
> 
> Here is the full list of stable patches I have:
> 
> sctp: Fix oops when sending queued ASCONF chunks
> sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
> sctp: per_cpu variables should be in bh_disabled section
> sctp: fix potential reference of a freed pointer
> sctp: avoid irq lock inversion while call sk->sk_data_ready()
> 
> -vlad
> 
Are you sure?  this oops looks _very_ simmilar to the INIT/INIT-ACK length
calculation oops described above, but is in fact different, and requires this
patch, from what I can see.  The right fix might be in the ASCONF chunk patch
you list above, but I don't see that in your tree at the moment, so I can't be
sure.

Neil

> Neil Horman wrote:
> > Hey-
> > 	Recently, it was reported to me that the kernel could oops in the
> > following way:
> > 
> > <5> kernel BUG at net/core/skbuff.c:91!
> > <5> invalid operand: 0000 [#1]
> > <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
> > ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
> > vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
> > ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
> > snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
> > pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
> > mptbase sd_mod scsi_mod
> > <5> CPU:    0
> > <5> EIP:    0060:[<c02bff27>]    Not tainted VLI
> > <5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
> > <5> EIP is at skb_over_panic+0x1f/0x2d
> > <5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
> > <5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
> > <5> ds: 007b   es: 007b   ss: 0068
> > <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
> > <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
> > e0c2947d 
> > <5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
> > df653490 
> > <5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
> > 00000004 
> > <5> Call Trace:
> > <5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
> > <5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
> > <5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
> > <5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
> > <5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
> > <5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
> > <5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
> > <5>  [<c01555a4>] cache_grow+0x140/0x233
> > <5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
> > <5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
> > <5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
> > <5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
> > <5>  [<c02d005e>] nf_iterate+0x40/0x81
> > <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
> > <5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
> > <5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
> > <5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
> > <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
> > <5>  [<c02e103e>] ip_rcv+0x334/0x3b4
> > <5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
> > <5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
> > <5>  [<c02c67a4>] process_backlog+0x6c/0xd9
> > <5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
> > <5>  [<c012a7b1>] __do_softirq+0x35/0x79
> > <5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
> > <5>  [<c01094de>] do_softirq+0x46/0x4d
> > 
> > Its an skb_over_panic BUG halt that results from processing an init chunk in
> > which too many of its variable length parameters are in some way malformed.
> > 
> > The problem is in sctp_process_unk_param:
> > if (NULL == *errp)
> > 	*errp = sctp_make_op_error_space(asoc, chunk,
> > 					 ntohs(chunk->chunk_hdr->length));
> > 
> > 	if (*errp) {
> > 		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
> > 				 WORD_ROUND(ntohs(param.p->length)));
> > 		sctp_addto_chunk(*errp,
> > 			WORD_ROUND(ntohs(param.p->length)),
> > 				  param.v);
> > 
> > When we allocate an error chunk, we assume that the worst case scenario requires
> > that we have chunk_hdr->length data allocated, which would be correct nominally,
> > given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
> > we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
> > chunk, so the worst case situation in which all parameters are in violation
> > requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
> > 
> > The result of this error is that a deliberately malformed packet sent to a
> > listening host can cause a remote DOS, described in CVE-2010-1173:
> > http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
> > 
> > I've tested the below fix and confirmed that it fixes the issue.  It
> > pre-allocates the error chunk in sctp_verify_init, where we are able to count
> > the total number of variable length parameters, so we know how many error
> > headers we might need.  Then we simply use that chunk, if we find an error, or
> > discard/free it if all the parameters are valid.  Applies on top of the
> > lksctp-dev tree
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> > 
> >  sm_make_chunk.c |   24 ++++++++++++++++++++++--
> >  1 file changed, 22 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> > index f592163..990457b 100644
> > --- a/net/sctp/sm_make_chunk.c
> > +++ b/net/sctp/sm_make_chunk.c
> > @@ -2134,6 +2134,8 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  	union sctp_params param;
> >  	int has_cookie = 0;
> >  	int result;
> > +	unsigned int param_cnt;
> > +	unsigned int len;
> >  
> >  	/* Verify stream values are non-zero. */
> >  	if ((0 == peer_init->init_hdr.num_outbound_streams) ||
> > @@ -2149,6 +2151,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  
> >  		if (SCTP_PARAM_STATE_COOKIE == param.p->type)
> >  			has_cookie = 1;
> > +		param_cnt++;
> >  
> >  	} /* for (loop through all parameters) */
> >  
> > @@ -2169,6 +2172,20 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  		return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
> >  						  chunk, errp);
> >  
> > +	if (!*errp) {
> > +		/*
> > +		 * Pre-allocate the error packet here
> > +		 * we do this as we need to reserve space
> > +		 * for the worst case scenario in which 
> > +		 * every parameter is in error and needs 
> > +		 * an errhdr attached to it
> > +		 */
> > +		len = ntohs(chunk->chunk_hdr->length);
> > +		len += sizeof(sctp_errhdr_t)*param_cnt;
> > +
> > +		*errp = sctp_make_op_error_space(asoc, chunk, len);
> > +	}
> > +
> >  	/* Verify all the variable length parameters */
> >  	sctp_walk_params(param, peer_init, init_hdr.params) {
> >  
> > @@ -2176,9 +2193,11 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  		switch (result) {
> >  		    case SCTP_IERROR_ABORT:
> >  		    case SCTP_IERROR_NOMEM:
> > -				return 0;
> >  		    case SCTP_IERROR_ERROR:
> > -				return 1;
> > +				len = ntohs((*errp)->chunk_hdr->length);
> > +				if ((*errp) && (len == sizeof(sctp_chunkhdr_t)))
> > +					sctp_chunk_free(*errp);
> > +				return (result == SCTP_IERROR_ERROR) ? 1 : 0;
> >  		    case SCTP_IERROR_NO_ERROR:
> >  		    default:
> >  				break;
> > @@ -2186,6 +2205,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  
> >  	} /* for (loop through all parameters) */
> >  
> > +	sctp_chunk_free(*errp);
> >  	return 1;
> >  }
> >  
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

^ permalink raw reply

* Re: Checkpoint and Restart of INET routing information
From: Daniel Lezcano @ 2010-04-28 14:24 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, netdev
In-Reply-To: <1272034539-19899-1-git-send-email-danms@us.ibm.com>

Dan Smith wrote:
> This set extends the existing network socket, device, and namespace support
> in the checkpoint tree to cover routing information.  It does so by making
> heavy use of RTNETLINK to dump and insert routes much like userspace would.
> Because the task doing the checkpointing or restarting needs to examine
> or setup resources for tasks in network namespaces other than its own, an
> additional kernel socket setup call is added.  It provides us the ability
> to talk to RTNETLINK in a foreign netns.
>
> The support added in this set allows me to configure various inet4 and inet6
> routes in a container and have them saved and restored successfully during
> a checkpoint/restart process.
>   

Why do you need to do that from the kernel ? Same remark for ipv4/6 
addresses.
What prevents you to do 'ip route show'  and use these informations to 
restore the routes later ?
Will we end up by moving all the network userspace tools in the kernel ? :)

If you use the Eric's setns patchset, you will be able to do that easily 
from userspace, no ?



^ permalink raw reply

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Wolfgang Grandegger @ 2010-04-28 14:31 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <4BD83D37.4060301@grandegger.com>

Wolfgang Grandegger wrote:
> Richard Cochran wrote:
>> On Tue, Apr 27, 2010 at 06:20:25PM +0200, Wolfgang Grandegger wrote:
>>> Do you have also a patch adding support for hardware timestamping to ptpd?
>> Yes, I do:
>>
>>    https://sourceforge.net/tracker/index.php?func=detail&aid=2992847&group_id=139814&atid=744634
> 
> Thanks.
> 
>> I should have mentioned, you also need the gianfar HW time stamping
>> patches, recently posted to netdev by Manfred Rudigier.
> 
> I'm aware of these patches. I'm actually using the net-next-2.6 git tree.
> 
> I got ptpd working but I do not yet see the PPS-Signals on my scope. At
> a first glance, the PPS-Signal seems to be configured by the gianfar_ptp
> driver (setting the fiper1 and timer1 registers) but I might have missed
> something.

That's because some 1588_PPS related bits are not yet setup in the
platform code of mainline kernel.

Wolfgang.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-28 14:34 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272464368.2267.72.camel@edumazet-laptop>

Le mercredi 28 avril 2010 à 16:19 +0200, Eric Dumazet a écrit :

> I forgot to say that with my previous DDOS test/bench (16 cpus trying to
> feed one udp socket), my receiver can now process 420.000 pps instead of
> 200.000 ;)

And perf top of the cpu dedicated to the thread doing the recvmsg() is :
(after patch)

----------------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:98.0% [1000Hz cycles],  (all, cpu: 1)
----------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ____________________________

             5463.00 45.5% _raw_spin_lock_bh             vmlinux                     
              761.00  6.3% copy_user_generic_string      vmlinux                     
              662.00  5.5% sock_recv_ts_and_drops        vmlinux                     
              645.00  5.4% kfree                         vmlinux                     
              568.00  4.7% _raw_spin_lock                vmlinux                     
              494.00  4.1% __skb_recv_datagram           vmlinux                     
              488.00  4.1% skb_copy_datagram_iovec       vmlinux                     
              467.00  3.9% __slab_free                   vmlinux                     
              176.00  1.5% udp_recvmsg                   vmlinux                     
              168.00  1.4% ia32_sysenter_target          vmlinux                     
              161.00  1.3% kmem_cache_free               vmlinux                     
              161.00  1.3% _raw_spin_lock_irqsave        vmlinux                     
              151.00  1.3% memcpy_toiovec                vmlinux                     
              131.00  1.1% fget_light                    vmlinux                     
              130.00  1.1% sock_rfree                    vmlinux                     
              104.00  0.9% inet_recvmsg                  vmlinux                     
               99.00  0.8% dst_release                   vmlinux                     
               98.00  0.8% skb_release_head_state        vmlinux                     
               83.00  0.7% __sk_mem_reclaim              vmlinux                     
               75.00  0.6% sys_recvfrom                  vmlinux                     
               61.00  0.5% sysexit_from_sys_call         vmlinux                     
               59.00  0.5% fput                          vmlinux                     
               56.00  0.5% schedule                      vmlinux                     
               56.00  0.5% sock_recvmsg                  vmlinux                     
               54.00  0.4% move_addr_to_user             vmlinux                     
               51.00  0.4% compat_sys_socketcall         vmlinux                     
               48.00  0.4% _raw_spin_unlock_bh           vmlinux                    



^ permalink raw reply

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173)
From: Vlad Yasevich @ 2010-04-28 14:37 UTC (permalink / raw)
  To: Neil Horman; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <20100428142147.GB4818@hmsreliant.think-freely.org>



Neil Horman wrote:
> On Wed, Apr 28, 2010 at 10:00:37AM -0400, Vlad Yasevich wrote:
>> I have this patch and a few others already queued.
>>
>> I was planning on sending these today for stable.
>>
>> Here is the full list of stable patches I have:
>>
>> sctp: Fix oops when sending queued ASCONF chunks
>> sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
>> sctp: per_cpu variables should be in bh_disabled section
>> sctp: fix potential reference of a freed pointer
>> sctp: avoid irq lock inversion while call sk->sk_data_ready()
>>
>> -vlad
>>
> Are you sure?  this oops looks _very_ simmilar to the INIT/INIT-ACK length
> calculation oops described above, but is in fact different, and requires this
> patch, from what I can see.  The right fix might be in the ASCONF chunk patch
> you list above, but I don't see that in your tree at the moment, so I can't be
> sure.

As I said, I totally goofed when reading the description and I apologize.
However, I do one comment regarding the patch.

If the bad packet is REALLY long (I mean close to 65K IP limit), then
we'll end up allocating a supper huge skb in this case and potentially exceed
the IP length limitation.  Section 11.4 of rfc 4960 allows us to omit some
errors and limit the size of the packet.

I would recommend limiting this to MTU worth of potentiall errors.  This is
on top of what the INIT-ACK is going to carry, so at most we'll sent 2 MTUs
worth.  That's still a potential by amplification attack, but it's somewhat
mitigated.

Of course now we have to handle the case of checking for space before adding
an error cause. :)

-vlad

> 
> Neil
> 
>> Neil Horman wrote:
>>> Hey-
>>> 	Recently, it was reported to me that the kernel could oops in the
>>> following way:
>>>
>>> <5> kernel BUG at net/core/skbuff.c:91!
>>> <5> invalid operand: 0000 [#1]
>>> <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
>>> ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
>>> vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
>>> ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
>>> snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
>>> pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
>>> mptbase sd_mod scsi_mod
>>> <5> CPU:    0
>>> <5> EIP:    0060:[<c02bff27>]    Not tainted VLI
>>> <5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
>>> <5> EIP is at skb_over_panic+0x1f/0x2d
>>> <5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
>>> <5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
>>> <5> ds: 007b   es: 007b   ss: 0068
>>> <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
>>> <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
>>> e0c2947d 
>>> <5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
>>> df653490 
>>> <5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
>>> 00000004 
>>> <5> Call Trace:
>>> <5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
>>> <5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
>>> <5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
>>> <5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
>>> <5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
>>> <5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
>>> <5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
>>> <5>  [<c01555a4>] cache_grow+0x140/0x233
>>> <5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
>>> <5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
>>> <5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
>>> <5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
>>> <5>  [<c02d005e>] nf_iterate+0x40/0x81
>>> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
>>> <5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
>>> <5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
>>> <5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
>>> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
>>> <5>  [<c02e103e>] ip_rcv+0x334/0x3b4
>>> <5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
>>> <5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
>>> <5>  [<c02c67a4>] process_backlog+0x6c/0xd9
>>> <5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
>>> <5>  [<c012a7b1>] __do_softirq+0x35/0x79
>>> <5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
>>> <5>  [<c01094de>] do_softirq+0x46/0x4d
>>>
>>> Its an skb_over_panic BUG halt that results from processing an init chunk in
>>> which too many of its variable length parameters are in some way malformed.
>>>
>>> The problem is in sctp_process_unk_param:
>>> if (NULL == *errp)
>>> 	*errp = sctp_make_op_error_space(asoc, chunk,
>>> 					 ntohs(chunk->chunk_hdr->length));
>>>
>>> 	if (*errp) {
>>> 		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
>>> 				 WORD_ROUND(ntohs(param.p->length)));
>>> 		sctp_addto_chunk(*errp,
>>> 			WORD_ROUND(ntohs(param.p->length)),
>>> 				  param.v);
>>>
>>> When we allocate an error chunk, we assume that the worst case scenario requires
>>> that we have chunk_hdr->length data allocated, which would be correct nominally,
>>> given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
>>> we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
>>> chunk, so the worst case situation in which all parameters are in violation
>>> requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
>>>
>>> The result of this error is that a deliberately malformed packet sent to a
>>> listening host can cause a remote DOS, described in CVE-2010-1173:
>>> http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
>>>
>>> I've tested the below fix and confirmed that it fixes the issue.  It
>>> pre-allocates the error chunk in sctp_verify_init, where we are able to count
>>> the total number of variable length parameters, so we know how many error
>>> headers we might need.  Then we simply use that chunk, if we find an error, or
>>> discard/free it if all the parameters are valid.  Applies on top of the
>>> lksctp-dev tree
>>>
>>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>>>
>>>
>>>  sm_make_chunk.c |   24 ++++++++++++++++++++++--
>>>  1 file changed, 22 insertions(+), 2 deletions(-)
>>>
>>>
>>> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
>>> index f592163..990457b 100644
>>> --- a/net/sctp/sm_make_chunk.c
>>> +++ b/net/sctp/sm_make_chunk.c
>>> @@ -2134,6 +2134,8 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  	union sctp_params param;
>>>  	int has_cookie = 0;
>>>  	int result;
>>> +	unsigned int param_cnt;
>>> +	unsigned int len;
>>>  
>>>  	/* Verify stream values are non-zero. */
>>>  	if ((0 == peer_init->init_hdr.num_outbound_streams) ||
>>> @@ -2149,6 +2151,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  
>>>  		if (SCTP_PARAM_STATE_COOKIE == param.p->type)
>>>  			has_cookie = 1;
>>> +		param_cnt++;
>>>  
>>>  	} /* for (loop through all parameters) */
>>>  
>>> @@ -2169,6 +2172,20 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  		return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
>>>  						  chunk, errp);
>>>  
>>> +	if (!*errp) {
>>> +		/*
>>> +		 * Pre-allocate the error packet here
>>> +		 * we do this as we need to reserve space
>>> +		 * for the worst case scenario in which 
>>> +		 * every parameter is in error and needs 
>>> +		 * an errhdr attached to it
>>> +		 */
>>> +		len = ntohs(chunk->chunk_hdr->length);
>>> +		len += sizeof(sctp_errhdr_t)*param_cnt;
>>> +
>>> +		*errp = sctp_make_op_error_space(asoc, chunk, len);
>>> +	}
>>> +
>>>  	/* Verify all the variable length parameters */
>>>  	sctp_walk_params(param, peer_init, init_hdr.params) {
>>>  
>>> @@ -2176,9 +2193,11 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  		switch (result) {
>>>  		    case SCTP_IERROR_ABORT:
>>>  		    case SCTP_IERROR_NOMEM:
>>> -				return 0;
>>>  		    case SCTP_IERROR_ERROR:
>>> -				return 1;
>>> +				len = ntohs((*errp)->chunk_hdr->length);
>>> +				if ((*errp) && (len == sizeof(sctp_chunkhdr_t)))
>>> +					sctp_chunk_free(*errp);
>>> +				return (result == SCTP_IERROR_ERROR) ? 1 : 0;
>>>  		    case SCTP_IERROR_NO_ERROR:
>>>  		    default:
>>>  				break;
>>> @@ -2186,6 +2205,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  
>>>  	} /* for (loop through all parameters) */
>>>  
>>> +	sctp_chunk_free(*errp);
>>>  	return 1;
>>>  }
>>>  
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
> 

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: John Haxby @ 2010-04-28 14:43 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Patrick McHardy, Netfilter Developer Mailing List,
	Linux Netdev List
In-Reply-To: <alpine.LSU.2.01.1004211533410.21020@obet.zrqbmnf.qr>

On 21/04/10 14:35, Jan Engelhardt wrote:
> On Wednesday 2010-04-21 15:17, Patrick McHardy wrote:
>    
>> Jan Engelhardt wrote:
>>      
>>> On Wednesday 2010-04-21 14:59, Patrick McHardy wrote:
>>>
>>>        
>>>> Jan Engelhardt wrote:
>>>>          
>>>>> The SYSRQ target will allow to remotely invoke sysrq on the local
>>>>> machine. Authentication is by means of a pre-shared key that can
>>>>> either be transmitted plaintext or digest-secured.
>>>>>            
>>>> I really think this is pushing what netfilter is meant for a bit
>>>> far. Its basically abusing the firewall ruleset to offer a network
>>>> service.
>>>>
>>>> I can see that its useful to have this in the kernel instead of
>>>> userspace, but why isn't this implemented as a stand-alone module?
>>>> That seems like a better design to me and also makes it more useful
>>>> by not depending on netfilter.
>>>>          
>>> That sort of diverts from the earlier what-seemed-to-be-consensus.
>>>
>>> Oh well, I would not mind holding the single commit up as long as the
>>> rest isn't blocked too :-)
>>>        
>> Then lets skip this one for now.
>>      
> Well you raised the concern before -- namely that kdboe would have
> the very same feature. And yet, kdboe was not part of the kernel.
> Neither is the magical stand-alone module.
> I really prefer to have it in rather than out, because I know
> that's going to mess up maintenance-here-and-there. I'm already
> having a big time with xtables-addons that still carries
> xt_condition and SYSRQ for a while, and it does have some different
> code lines than the kernel copy.
>    

I have to agree with Jan here, but I'd like to raise some additional points.

kdboe (or kgdboe) isn't part of the kernel and I don't think it 
necessarily fits all the use cases for xt_SYSRQ.  The one I have in mind 
is where there is a non-kernel hacker whose machine has got into 
trouble.  The poor harrassed sys admin (in this case) has configured 
netconsole and knows that sysrq-t and sysrq-m are useful as a first 
attempt at passing useful information to someone who knows what might be 
going on and that sysrq-c to get a crash dump will also be useful.   
(This represents quite a few of the better sys admins that I come 
across.)   xt_SYSRQ is likewise easy to set up and easy to use.   It's 
true that k(g)dboe would provide this kind of information provided that 
the debuginfo was present on the target machine and the environment was 
such that any sort of debugging over netconsole was sufficiently secure 
... (is it at least as secure as the xt_SYSRQ controls?)

I was running over the design of a standalone module in my head on the 
way in this morning.   It seems fairly straightforward, but as I started 
adding in necessary requirements like limited IP addresses (which I know 
are not actually secure), limited interfaces (which are more secure in a 
controlled physical environment), user-space control and so on the more 
it was sounding as though it would just be a cut-down iptables.   And 
then, of course, that begs the question "why don't you leave all that 
extra stuff to iptables?"

My own interest in getting xt_SYSRQ into the mainline kernel is that it 
would then be easier to get it accepted in production kernels where it 
would make the poor beleaguered sys admin's life a little easier.  That 
is, _some_ useful information or even a crash dump could be extracted 
from the machine before it's big red button time.

jch

^ permalink raw reply

* Re: [Bugme-new] [Bug 15868] New: Deleting IP address from interface doesn't prevent sending a data.
From: Andrew Morton @ 2010-04-28 11:42 UTC (permalink / raw)
  To: netdev; +Cc: bugzilla-daemon, bugme-daemon, Yurij.Plotnikov
In-Reply-To: <bug-15868-10286@https.bugzilla.kernel.org/>


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 28 Apr 2010 08:11:02 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=15868
> 
>            Summary: Deleting IP address from interface doesn't prevent
>                     sending a data.
>            Product: Networking
>            Version: 2.5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: Yurij.Plotnikov@oktetlabs.ru
>         Regression: No
> 
> 
> Starting from 2.6.26, Linux kernel has strange behavior for the interface with
> two IPv4 addresses.
> 
>  Let A and B are hosts with directly connected interfaces ethA (on host A) and
> ethB (on host B). Let 10.10.0.1/24 and 10.10.0.3/24 addresses are assigned to
> ethA and 10.10.0.2/24 address is assigned to ethB. Let there is established TCP
> connection between host A and host B with sockets sock_A and sock_B that are
> bound to 10.10.0.3 and 10.10.0.2 addresses respectively. Then if someone
> deletes 10.10.0.3 address from ethA interface and after that send some data
> from sock_A socket then the data will be delivered to sock_B socket and someone
> can read it from this socket.
> 
>  There is the same picture for UDP sockets. With previous definitions if there
> are UDP sockets sock_A on host A and sock_B on host B and they are bound to
> 10.10.0.3 and 10.10.0.2 addresses respectively and they are connected to
> 10.10.0.2 and 10.10.0.3 addresses respectively then if someone deletes
> 10.10.0.3 address from ethA interface and after that send some data using
> send() function from sock_A then the data will be delivered to sock_B.
> 
> The data will not be sent in both cases if there are no addresses assigned to
> the interface after address removing.


^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: John Haxby @ 2010-04-28 14:54 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Patrick McHardy, Netfilter Developer Mailing List,
	Linux Netdev List
In-Reply-To: <4BD84992.5030909@oracle.com>

On 28/04/10 15:43, John Haxby wrote:
>
> kdboe (or kgdboe) isn't part of the kernel and I don't think it 
> necessarily fits all the use cases for xt_SYSRQ.  The one I have in 
> mind is where there is a non-kernel hacker whose machine has got into 
> trouble.  The poor harrassed sys admin (in this case) has configured 
> netconsole and knows that sysrq-t and sysrq-m are useful as a first 
> attempt at passing useful information to someone who knows what might 
> be going on and that sysrq-c to get a crash dump will also be 
> useful.   (This represents quite a few of the better sys admins that I 
> come across.)   xt_SYSRQ is likewise easy to set up and easy to use.   
> It's true that k(g)dboe would provide this kind of information 
> provided that the debuginfo was present on the target machine and the 
> environment was such that any sort of debugging over netconsole was 
> sufficiently secure ... (is it at least as secure as the xt_SYSRQ 
> controls?)
>

I really must read what I've written more carefully.   I should have 
gone on to say that I don't see that k(g)dboe will be viable in this use 
case although for someone actually debugging a kernel on a machine that 
they have access to xt_SYSRQ leaves an awful lot to be desired :-)   But 
that isn't the common use-case I see -- the one I see is where the sys 
admins used to have a "crash trolley" which was a console and PS/2 
keyboard which they could plug into a machine to get some information, 
but as many rack machines no longer have anything PS/2 and USB hot plug 
is unlikely to work on a sick machine we need a sufficiently light 
mechanism that it will work in most cases (xt_SYSRQ is careful to 
pre-allocate most of the resources it will need).

And then I should have said that moving on to the possibility of a 
standalone module and that ...
> I was running over the design of a standalone module in my head on the 
> way in this morning.   It seems fairly straightforward, but as I 
> started adding in necessary requirements like limited IP addresses 
> (which I know are not actually secure), limited interfaces (which are 
> more secure in a controlled physical environment), user-space control 
> and so on the more it was sounding as though it would just be a 
> cut-down iptables.   And then, of course, that begs the question "why 
> don't you leave all that extra stuff to iptables?"

So unless I'm missing something obvious and different, I don't see that 
a standalone module is going to be lightweight enough to be acceptable.

Sorry for not making filling this parts in earlier.

jch

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: Jan Engelhardt @ 2010-04-28 15:03 UTC (permalink / raw)
  To: John Haxby
  Cc: Patrick McHardy, Netfilter Developer Mailing List,
	Linux Netdev List
In-Reply-To: <4BD84C23.2000301@oracle.com>


On Wednesday 2010-04-28 16:54, John Haxby wrote:
>
> use-case I see -- the one I see is where the sys admins used to have a "crash
> trolley" which was a console and PS/2 keyboard which they could plug into a
> machine to get some information, but as many rack machines no longer have
> anything PS/2 and USB hot plug is unlikely to work on a sick machine

Oh I can tell you stories... sometimes it's so dead in the water that
the console unblanking would not work any more, rendering even any PS/2
useless. Stupid southbridge chipsets blowing up DMA :-)

^ permalink raw reply

* Re: [PATCH 2/4] [RFC] Add sock_create_kern_net()
From: Dan Smith @ 2010-04-28 15:06 UTC (permalink / raw)
  To: David Miller; +Cc: containers, netdev
In-Reply-To: <20100427.171844.77354120.davem@davemloft.net>

Hi,

DM> If you can create netlink sockets in a remote NS you can also make
DM> changes there, and the whole point is to disallow changes.

DM> So maybe you won't be making changes, but others will think about
DM> using this and doing so.

I would be making changes on restart, because I insert routes.  As has
been pointed out, Eric's setns() patches allow this sort of violation
from userspace even :)

Following that example, I could have the checkpointing task stash the
current nsproxy and temporarily jump to the destination netns to do
the checkpoint.  I'll cook up something to look at...

Thanks Dave!

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sk_add_backlog() take rmem_alloc into account
From: Brian Bloniarz @ 2010-04-28 15:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, therbert, netdev, rick.jones2
In-Reply-To: <1272399662.2343.12.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 27 avril 2010 à 19:37 +0200, Eric Dumazet a écrit :
> 
>> We might use the ticket spinlock paradigm to let writers go in parallel
>> and let the user the socket lock
>>
>> Instead of having the bh_lock_sock() to protect receive_queue *and*
>> backlog, writers get a unique slot in a table, that 'user' can handle
>> later.
>>
>> Or serialize writers (before they try to bh_lock_sock()) with a
>> dedicated lock, so that user has 50% chances to get the sock lock,
>> contending with at most one writer.
> 
> Following patch fixes the issue for me, with little performance hit on
> fast path.
> 
> Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
> receiver can now process ~200.000 pps (instead of ~100 pps before the
> patch) on my dev machine.
> 
> Thanks !
> 
> [PATCH net-next-2.6] net: sk_add_backlog() take rmem_alloc into account
> 
> Current socket backlog limit is not enough to really stop DDOS attacks,
> because user thread spend many time to process a full backlog each
> round, and user might crazy spin on socket lock.
> 
> We should add backlog size and receive_queue size (aka rmem_alloc) to
> pace writers, and let user run without being slow down too much.
> 
> Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
> stress situations.
> 
> Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
> receiver can now process ~200.000 pps (instead of ~100 pps before the
> patch) on a 8 core machine.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Wow that was awesome.

> ---
>  include/net/sock.h |   13 +++++++++++--
>  net/core/sock.c    |    5 ++++-
>  net/ipv4/udp.c     |    4 ++++
>  net/ipv6/udp.c     |    8 ++++++++
>  4 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 86a8ca1..4b0097d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -255,7 +255,6 @@ struct sock {
>  		struct sk_buff *head;
>  		struct sk_buff *tail;
>  		int len;
> -		int limit;
>  	} sk_backlog;
>  	wait_queue_head_t	*sk_sleep;
>  	struct dst_entry	*sk_dst_cache;
> @@ -604,10 +603,20 @@ static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
>  	skb->next = NULL;
>  }
>  
> +/*
> + * Take into account size of receive queue and backlog queue
> + */
> +static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb)
> +{
> +	unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);
> +
> +	return qsize + skb->truesize > sk->sk_rcvbuf;
> +}
> +

Reading sk_backlog.len without the socket lock held seems to
contradict the comment in sock.h:
  *	@sk_backlog: always used with the per-socket spinlock held
  ...
struct sock {

        ...
	/*
	 * The backlog queue is special, it is always used with
	 * the per-socket spinlock held and requires low latency
	 * access. Therefore we special case it's implementation.
	 */
	struct { ... } sk_backlog;

Is this just a doc mismatch or does sk_backlog.len need to use
atomic_read & atomic_set?

>  /* The per-socket spinlock must be held here. */
>  static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb)
>  {
> -	if (sk->sk_backlog.len >= max(sk->sk_backlog.limit, sk->sk_rcvbuf << 1))
> +	if (sk_rcvqueues_full(sk, skb))
>  		return -ENOBUFS;
>  
>  	__sk_add_backlog(sk, skb);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 58ebd14..5104175 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -327,6 +327,10 @@ int sk_receive_skb(struct sock *sk, struct sk_buff *skb, const int nested)
>  
>  	skb->dev = NULL;
>  
> +	if (sk_rcvqueues_full(sk, skb)) {
> +		atomic_inc(&sk->sk_drops);
> +		goto discard_and_relse;
> +	}
>  	if (nested)
>  		bh_lock_sock_nested(sk);
>  	else
> @@ -1885,7 +1889,6 @@ void sock_init_data(struct socket *sock, struct sock *sk)
>  	sk->sk_allocation	=	GFP_KERNEL;
>  	sk->sk_rcvbuf		=	sysctl_rmem_default;
>  	sk->sk_sndbuf		=	sysctl_wmem_default;
> -	sk->sk_backlog.limit	=	sk->sk_rcvbuf << 1;
>  	sk->sk_state		=	TCP_CLOSE;
>  	sk_set_socket(sk, sock);
>  
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 1e18f9c..776c844 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1372,6 +1372,10 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  			goto drop;
>  	}
>  
> +
> +	if (sk_rcvqueues_full(sk, skb))
> +		goto drop;
> +
>  	rc = 0;
>  
>  	bh_lock_sock(sk);
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 2850e35..3ead20a 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -584,6 +584,10 @@ static void flush_stack(struct sock **stack, unsigned int count,
>  
>  		sk = stack[i];
>  		if (skb1) {
> +			if (sk_rcvqueues_full(sk, skb)) {
> +				kfree_skb(skb1);
> +				goto drop;
> +			}
>  			bh_lock_sock(sk);
>  			if (!sock_owned_by_user(sk))
>  				udpv6_queue_rcv_skb(sk, skb1);
> @@ -759,6 +763,10 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  
>  	/* deliver */
>  
> +	if (sk_rcvqueues_full(sk, skb)) {
> +		sock_put(sk);
> +		goto discard;
> +	}
>  	bh_lock_sock(sk);
>  	if (!sock_owned_by_user(sk))
>  		udpv6_queue_rcv_skb(sk, skb);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH net-next-2.6] bnx2x: Remove two prefetch()
From: Eliezer Tamir @ 2010-04-28 15:44 UTC (permalink / raw)
  To: eilong
  Cc: David Miller, vladz, eric.dumazet@gmail.com, xiaosuo@gmail.com,
	hadi@cyberus.ca, therbert@google.com, shemminger@vyatta.com,
	netdev@vger.kernel.org
In-Reply-To: <1272460455.30392.24.camel@lb-tlvb-eilong.il.broadcom.com>

On Wed, Apr 28, 2010 at 4:14 PM, Eilon Greenstein <eilong@broadcom.com> wrote:
>
> On Tue, 2010-04-27 at 15:19 -0700, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 28 Apr 2010 00:18:13 +0200
> >
> > > [PATCH net-next-2.6] bnx2x: Remove two prefetch()
> > >
> > > 1) Even on 64bit arches, sizeof(struct sk_buff) < 256
> > > 2) No need to prefetch same pointer twice.
> > >
> > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > > CC: Eilon Greenstein <eilong@broadcom.com>
> >
> > Eilon please review and ACK/NACK
>
> Vlad ran few benchmarks, and we couldn't find any justification for
> those prefetch calls. After consulting with Eliezer Tamir (the original
> author) we are glad to Ack this patch.
>
> Thanks Eric!
> Acked-by: <eilong@broadcom.com>
>
>
Normally, I would not have said anything but since Eilon asked.
Acked-by: <eliezer@tamir.org.il>
(this time in plain text)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox