Netdev List

Netdev List
 help / color / mirror / Atom feed

* Congratulations!
From: Yahoo Team @ 2010-11-12 14:32 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 24 bytes --]

Please view the attached

[-- Attachment #2: XS.pdf --]
[-- Type: application/pdf, Size: 85075 bytes --]

^ permalink raw reply

* Re: [PATCH 2/10] Fix leaking of kernel heap addresses in net/
From: Ben Hutchings @ 2010-11-12 15:09 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: David S. Miller, Oliver Hartkopp, Alexey Kuznetsov, Urs Thuermann,
	Hideaki YOSHIFUJI, Patrick McHardy, James Morris,
	Remi Denis-Courmont, Pekka Savola (ipv6), Sridhar Samudrala,
	Vlad Yasevich, Tejun Heo, Eric Dumazet, Li Zefan, Joe Perches,
	Stephen Hemminger, Jamal Hadi Salim, Eric W. Biederman,
	Alexey Dobriyan, Jiri Pirko, Johannes Berg, Daniel Lezcano,
	Pavel 
In-Reply-To: <1289524019.5167.66.camel@dan>

On Thu, 2010-11-11 at 20:06 -0500, Dan Rosenberg wrote:
> diff --git a/net/can/bcm.c b/net/can/bcm.c
> index 08ffe9e..5960ad7 100644
> --- a/net/can/bcm.c
> +++ b/net/can/bcm.c
[...]
> +		seq_printf(m, ">>> socket %lu", sock_i_ino(sk));

Why decimal here...

[...]
> +		/* unique socket inode as filename */
> +		sprintf(bo->procname, "%lx", sock_i_ino(sk));

...and hexadecimal here?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH 3/10] Fix leaking of kernel heap addresses in net/
From: Ben Hutchings @ 2010-11-12 15:11 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: David S. Miller, Oliver Hartkopp, Alexey Kuznetsov, Urs Thuermann,
	Hideaki YOSHIFUJI, Patrick McHardy, James Morris,
	Remi Denis-Courmont, Pekka Savola (ipv6), Sridhar Samudrala,
	Vlad Yasevich, Tejun Heo, Eric Dumazet, Li Zefan, Joe Perches,
	Stephen Hemminger, Jamal Hadi Salim, Eric W. Biederman,
	Alexey Dobriyan, Jiri Pirko, Johannes Berg, Daniel Lezcano,
	Pavel 
In-Reply-To: <1289524023.5167.67.camel@dan>

On Thu, 2010-11-11 at 20:07 -0500, Dan Rosenberg wrote:
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 1f85ef2..0ac8ff2 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
[...]
> +	/* Only expose kernel addresses to privileged readers */
> +	if (capable(CAP_NET_ADMIN))
> +		seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
> +			" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d\n",
> +			i, src, srcp, dest, destp, sp->sk_state,
> +			sk_wmem_alloc_get(sp),
> +			sk_rmem_alloc_get(sp),
> +			0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
> +			atomic_read(&sp->sk_refcnt),
> +			sp, atomic_read(&sp->sk_drops));
> +	else
> +		seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
> +			" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %d %d\n",
> +			i, src, srcp, dest, destp, sp->sk_state,
> +			sk_wmem_alloc_get(sp),
> +			sk_rmem_alloc_get(sp),
> +			0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
> +			atomic_read(&sp->sk_refcnt),
> +			0, atomic_read(&sp->sk_drops));
[...]

This could just be written as:

	seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %lx %d\n",
		i, src, srcp, dest, destp, sp->sk_state,
		sk_wmem_alloc_get(sp),
		sk_rmem_alloc_get(sp),
		0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
		atomic_read(&sp->sk_refcnt),
		capable(CAP_NET_ADMIN) ? (unsigned long)sp : 0UL,
		atomic_read(&sp->sk_drops));

Similarly for other formats that you want to make conditional on
CAP_NET_ADMIN.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH 2/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12 15:11 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David S. Miller, Oliver Hartkopp, Alexey Kuznetsov, Urs Thuermann,
	Hideaki YOSHIFUJI, Patrick McHardy, James Morris,
	Remi Denis-Courmont, Pekka Savola (ipv6), Sridhar Samudrala,
	Vlad Yasevich, Tejun Heo, Eric Dumazet, Li Zefan, Joe Perches,
	Stephen Hemminger, Jamal Hadi Salim, Eric W. Biederman,
	Alexey Dobriyan, Jiri Pirko, Johannes Berg, Daniel Lezcano,
	Pavel 
In-Reply-To: <1289574562.2247.1.camel@achroite.uk.solarflarecom.com>


> On Thu, 2010-11-11 at 20:06 -0500, Dan Rosenberg wrote:
> > diff --git a/net/can/bcm.c b/net/can/bcm.c
> > index 08ffe9e..5960ad7 100644
> > --- a/net/can/bcm.c
> > +++ b/net/can/bcm.c
> [...]
> > +		seq_printf(m, ">>> socket %lu", sock_i_ino(sk));
> 
> Why decimal here...
> 
> [...]
> > +		/* unique socket inode as filename */
> > +		sprintf(bo->procname, "%lx", sock_i_ino(sk));
> 
> ...and hexadecimal here?
> 
> Ben.
> 

You're right, it should be consistent.  I avoided decimal in the /proc
filename because it may be too long - the next version will do the same
for the seq_print output.

-Dan


^ permalink raw reply

* Re: [PATCH 3/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12 15:14 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David S. Miller, Oliver Hartkopp, Alexey Kuznetsov, Urs Thuermann,
	Hideaki YOSHIFUJI, Patrick McHardy, James Morris,
	Remi Denis-Courmont, Pekka Savola (ipv6), Sridhar Samudrala,
	Vlad Yasevich, Tejun Heo, Eric Dumazet, Li Zefan, Joe Perches,
	Stephen Hemminger, Jamal Hadi Salim, Eric W. Biederman,
	Alexey Dobriyan, Jiri Pirko, Johannes Berg, Daniel Lezcano,
	Pavel 
In-Reply-To: <1289574706.2247.4.camel@achroite.uk.solarflarecom.com>


> 
> Similarly for other formats that you want to make conditional on
> CAP_NET_ADMIN.

I wanted to avoid a %p output of "(null)" in case userspace tools
couldn't handle that, but I agree this is much nicer looking.  I'm
currently putting together a new format specifier instead, as per the
suggestions of Eric and David.

-Dan


^ permalink raw reply

* [PATCH 2/2] netfilter: ipv6: fix overlap check for fragments
From: kaber @ 2010-11-12 15:19 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev, Shan Wei, Patrick McHardy
In-Reply-To: <1289575172-7272-1-git-send-email-kaber@trash.net>

From: Shan Wei <shanwei@cn.fujitsu.com>

The type of FRAG6_CB(prev)->offset is int, skb->len is *unsigned* int,
and offset is int.

Without this patch, type conversion occurred to this expression, when
(FRAG6_CB(prev)->offset + prev->len) is less than offset.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv6/netfilter/nf_conntrack_reasm.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 3a3f129..79d43aa 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -286,7 +286,7 @@ found:
 
 	/* Check for overlap with preceding fragment. */
 	if (prev &&
-	    (NFCT_FRAG6_CB(prev)->offset + prev->len) - offset > 0)
+	    (NFCT_FRAG6_CB(prev)->offset + prev->len) > offset)
 		goto discard_fq;
 
 	/* Look for overlap with succeeding segment. */
-- 
1.7.3.2


^ permalink raw reply related

* [PATCH 0/2] netfilter: netfilter fixes
From: kaber @ 2010-11-12 15:19 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

Hi Dave,

The following two patches fix some netfilter bugs:

- missing parentheses in NF_HOOK_COND, breaking error propagation for
  dropped packets. From Eric Paris.

- incorrect checking for overlapping fragments in IPv6 conntrack
  reassembly. From Shan Wei

Please apply or pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6.git

Thanks!

^ permalink raw reply

* [PATCH 1/2] netfilter: NF_HOOK_COND has wrong conditional
From: kaber @ 2010-11-12 15:19 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev, Eric Paris, stable, Patrick McHardy
In-Reply-To: <1289575172-7272-1-git-send-email-kaber@trash.net>

From: Eric Paris <eparis@redhat.com>

The NF_HOOK_COND returns 0 when it shouldn't due to what I believe to be an
error in the code as the order of operations is not what was intended.  C will
evalutate == before =.  Which means ret is getting set to the bool result,
rather than the return value of the function call.  The code says

if (ret = function() == 1)
when it meant to say:
if ((ret = function()) == 1)

Normally the compiler would warn, but it doesn't notice it because its
a actually complex conditional and so the wrong code is wrapped in an explict
set of () [exactly what the compiler wants you to do if this was intentional].
Fixing this means that errors when netfilter denies a packet get propagated
back up the stack rather than lost.

Problem introduced by commit 2249065f (netfilter: get rid of the grossness
in netfilter.h).

Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/netfilter.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 89341c3..03317c8 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -215,7 +215,7 @@ NF_HOOK_COND(uint8_t pf, unsigned int hook, struct sk_buff *skb,
 	int ret;

 	if (!cond ||
-	    (ret = nf_hook_thresh(pf, hook, skb, in, out, okfn, INT_MIN) == 1))
+	    ((ret = nf_hook_thresh(pf, hook, skb, in, out, okfn, INT_MIN)) == 1))
 		ret = okfn(skb);
 	return ret;
 }
-- 
1.7.3.2

^ permalink raw reply related

* [PATCH net-next-2.6 V2] igmp: RCU conversion of in_dev->mc_list
From: Eric Dumazet @ 2010-11-12 15:46 UTC (permalink / raw)
  To: Américo Wang; +Cc: Cypher Wu, linux-kernel, netdev, David Miller
In-Reply-To: <1289571974.3185.254.camel@edumazet-laptop>

Le vendredi 12 novembre 2010 à 15:26 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 14:34 +0100, Eric Dumazet a écrit :
> > Le vendredi 12 novembre 2010 à 10:22 +0100, Eric Dumazet a écrit :
> > > Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
> > > > On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
> > > 
> > > > >A RCU conversion is far more complex.
> > > > >
> > > > 
> > > > Yup.
> > > 
> > > 
> > > Well, actually this is easy in this case.
> > > 
> > > I'll post a patch to do this RCU conversion.
> > > 
> > > 
> > 
> > Note : compile tested only, I'll appreciate if someone can test it ;)
> > 
> > Note: one patch from net-2.6 is not yet included in net-next-2.6, so
> > please make sure you have it before testing ;)
> > 
> > ( http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=18943d292facbc70e6a36fc62399ae833f64671b )
> > 
> > 
> > Thanks
> > 
> > [PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list
> > 
> > in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).
> > 
> > This can easily be converted to a RCU protection.
> > 
> > Writers hold RTNL, so mc_list_lock is removed, not replaced by a
> > spinlock.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > Cc: Cypher Wu <cypher.w@gmail.com>
> > Cc: Américo Wang <xiyou.wangcong@gmail.com>
> > ---
> 
> ...
> 
> >  void ip_mc_up(struct in_device *in_dev)
> >  {
> > -	struct ip_mc_list *i;
> > +	struct ip_mc_list *pmc;
> >  
> >  	ASSERT_RTNL();
> >  
> >  	ip_mc_inc_group(in_dev, IGMP_ALL_HOSTS);
> >  
> > -	for (i=in_dev->mc_list; i; i=i->next)
> > -		igmp_group_added(i);
> > +	for_each_pmc_rtnl(in_dev, pmc);
> > +		igmp_group_added(pmc);
> >  }
> 
> 
> Oops there is an extra ; after the for_each_pmc_rtnl(in_dev, pmc)
> 
> should be
> 
> 	for_each_pmc_rtnl(in_dev, pmc)
> 		igmp_group_added(pmc);
> 
> 

I confirm that with above fix, the patch works for me.
(Tested with CONFIG_PROVE_RCU=y )

Here is an updated version.

[PATCH net-next-2.6 V2] igmp: RCU conversion of in_dev->mc_list

in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).

This can easily be converted to a RCU protection.

Writers hold RTNL, so mc_list_lock is removed, not replaced by a
spinlock.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cypher Wu <cypher.w@gmail.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
---
 include/linux/igmp.h       |   12 +
 include/linux/inetdevice.h |    5 
 include/net/inet_sock.h    |    2 
 net/ipv4/igmp.c            |  223 ++++++++++++++++-------------------
 4 files changed, 115 insertions(+), 127 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 93fc244..7d16467 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -167,10 +167,10 @@ struct ip_sf_socklist {
  */
 
 struct ip_mc_socklist {
-	struct ip_mc_socklist	*next;
+	struct ip_mc_socklist __rcu *next_rcu;
 	struct ip_mreqn		multi;
 	unsigned int		sfmode;		/* MCAST_{INCLUDE,EXCLUDE} */
-	struct ip_sf_socklist	*sflist;
+	struct ip_sf_socklist __rcu	*sflist;
 	struct rcu_head		rcu;
 };
 
@@ -186,11 +186,14 @@ struct ip_sf_list {
 struct ip_mc_list {
 	struct in_device	*interface;
 	__be32			multiaddr;
+	unsigned int		sfmode;
 	struct ip_sf_list	*sources;
 	struct ip_sf_list	*tomb;
-	unsigned int		sfmode;
 	unsigned long		sfcount[2];
-	struct ip_mc_list	*next;
+	union {
+		struct ip_mc_list *next;
+		struct ip_mc_list __rcu *next_rcu;
+	};
 	struct timer_list	timer;
 	int			users;
 	atomic_t		refcnt;
@@ -201,6 +204,7 @@ struct ip_mc_list {
 	char			loaded;
 	unsigned char		gsquery;	/* check source marks? */
 	unsigned char		crcount;
+	struct rcu_head		rcu;
 };
 
 /* V3 exponential field decoding */
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index ccd5b07..380ba6b 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -52,9 +52,8 @@ struct in_device {
 	atomic_t		refcnt;
 	int			dead;
 	struct in_ifaddr	*ifa_list;	/* IP ifaddr chain		*/
-	rwlock_t		mc_list_lock;
-	struct ip_mc_list	*mc_list;	/* IP multicast filter chain    */
-	int			mc_count;	          /* Number of installed mcasts	*/
+	struct ip_mc_list __rcu	*mc_list;	/* IP multicast filter chain    */
+	int			mc_count;	/* Number of installed mcasts	*/
 	spinlock_t		mc_tomb_lock;
 	struct ip_mc_list	*mc_tomb;
 	unsigned long		mr_v1_seen;
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 1989cfd..8945f9f 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -141,7 +141,7 @@ struct inet_sock {
 				nodefrag:1;
 	int			mc_index;
 	__be32			mc_addr;
-	struct ip_mc_socklist	*mc_list;
+	struct ip_mc_socklist __rcu	*mc_list;
 	struct {
 		unsigned int		flags;
 		unsigned int		fragsize;
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 08d0d81..6f49d6c 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -149,11 +149,17 @@ static void ip_mc_clear_src(struct ip_mc_list *pmc);
 static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode,
 			 int sfcount, __be32 *psfsrc, int delta);
 
+
+static void ip_mc_list_reclaim(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ip_mc_list, rcu));
+}
+
 static void ip_ma_put(struct ip_mc_list *im)
 {
 	if (atomic_dec_and_test(&im->refcnt)) {
 		in_dev_put(im->interface);
-		kfree(im);
+		call_rcu(&im->rcu, ip_mc_list_reclaim);
 	}
 }
 
@@ -163,7 +169,7 @@ static void ip_ma_put(struct ip_mc_list *im)
  *	Timer management
  */
 
-static __inline__ void igmp_stop_timer(struct ip_mc_list *im)
+static void igmp_stop_timer(struct ip_mc_list *im)
 {
 	spin_lock_bh(&im->lock);
 	if (del_timer(&im->timer))
@@ -496,14 +502,24 @@ empty_source:
 	return skb;
 }
 
+#define for_each_pmc_rcu(in_dev, pmc)				\
+	for (pmc = rcu_dereference(in_dev->mc_list);		\
+	     pmc != NULL;					\
+	     pmc = rcu_dereference(pmc->next_rcu))
+
+#define for_each_pmc_rtnl(in_dev, pmc)				\
+	for (pmc = rtnl_dereference(in_dev->mc_list);		\
+	     pmc != NULL;					\
+	     pmc = rtnl_dereference(pmc->next_rcu))
+
 static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
 {
 	struct sk_buff *skb = NULL;
 	int type;
 
 	if (!pmc) {
-		read_lock(&in_dev->mc_list_lock);
-		for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+		rcu_read_lock();
+		for_each_pmc_rcu(in_dev, pmc) {
 			if (pmc->multiaddr == IGMP_ALL_HOSTS)
 				continue;
 			spin_lock_bh(&pmc->lock);
@@ -514,7 +530,7 @@ static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
 			skb = add_grec(skb, pmc, type, 0, 0);
 			spin_unlock_bh(&pmc->lock);
 		}
-		read_unlock(&in_dev->mc_list_lock);
+		rcu_read_unlock();
 	} else {
 		spin_lock_bh(&pmc->lock);
 		if (pmc->sfcount[MCAST_EXCLUDE])
@@ -556,7 +572,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
 	struct sk_buff *skb = NULL;
 	int type, dtype;
 
-	read_lock(&in_dev->mc_list_lock);
+	rcu_read_lock();
 	spin_lock_bh(&in_dev->mc_tomb_lock);
 
 	/* deleted MCA's */
@@ -593,7 +609,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
 	spin_unlock_bh(&in_dev->mc_tomb_lock);
 
 	/* change recs */
-	for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rcu(in_dev, pmc) {
 		spin_lock_bh(&pmc->lock);
 		if (pmc->sfcount[MCAST_EXCLUDE]) {
 			type = IGMPV3_BLOCK_OLD_SOURCES;
@@ -616,7 +632,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
 		}
 		spin_unlock_bh(&pmc->lock);
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 
 	if (!skb)
 		return;
@@ -813,14 +829,14 @@ static void igmp_heard_report(struct in_device *in_dev, __be32 group)
 	if (group == IGMP_ALL_HOSTS)
 		return;
 
-	read_lock(&in_dev->mc_list_lock);
-	for (im=in_dev->mc_list; im!=NULL; im=im->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, im) {
 		if (im->multiaddr == group) {
 			igmp_stop_timer(im);
 			break;
 		}
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 }
 
 static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
@@ -906,8 +922,8 @@ static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
 	 * - Use the igmp->igmp_code field as the maximum
 	 *   delay possible
 	 */
-	read_lock(&in_dev->mc_list_lock);
-	for (im=in_dev->mc_list; im!=NULL; im=im->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, im) {
 		int changed;
 
 		if (group && group != im->multiaddr)
@@ -925,7 +941,7 @@ static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
 		if (changed)
 			igmp_mod_timer(im, max_delay);
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 }
 
 /* called in rcu_read_lock() section */
@@ -1110,8 +1126,8 @@ static void igmpv3_clear_delrec(struct in_device *in_dev)
 		kfree(pmc);
 	}
 	/* clear dead sources, too */
-	read_lock(&in_dev->mc_list_lock);
-	for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, pmc) {
 		struct ip_sf_list *psf, *psf_next;
 
 		spin_lock_bh(&pmc->lock);
@@ -1123,7 +1139,7 @@ static void igmpv3_clear_delrec(struct in_device *in_dev)
 			kfree(psf);
 		}
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 }
 #endif
 
@@ -1209,7 +1225,7 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
 
 	ASSERT_RTNL();
 
-	for (im=in_dev->mc_list; im; im=im->next) {
+	for_each_pmc_rtnl(in_dev, im) {
 		if (im->multiaddr == addr) {
 			im->users++;
 			ip_mc_add_src(in_dev, &addr, MCAST_EXCLUDE, 0, NULL, 0);
@@ -1217,7 +1233,7 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
 		}
 	}
 
-	im = kmalloc(sizeof(*im), GFP_KERNEL);
+	im = kzalloc(sizeof(*im), GFP_KERNEL);
 	if (!im)
 		goto out;
 
@@ -1227,26 +1243,18 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
 	im->multiaddr = addr;
 	/* initial mode is (EX, empty) */
 	im->sfmode = MCAST_EXCLUDE;
-	im->sfcount[MCAST_INCLUDE] = 0;
 	im->sfcount[MCAST_EXCLUDE] = 1;
-	im->sources = NULL;
-	im->tomb = NULL;
-	im->crcount = 0;
 	atomic_set(&im->refcnt, 1);
 	spin_lock_init(&im->lock);
 #ifdef CONFIG_IP_MULTICAST
-	im->tm_running = 0;
 	setup_timer(&im->timer, &igmp_timer_expire, (unsigned long)im);
 	im->unsolicit_count = IGMP_Unsolicited_Report_Count;
-	im->reporter = 0;
-	im->gsquery = 0;
 #endif
-	im->loaded = 0;
-	write_lock_bh(&in_dev->mc_list_lock);
-	im->next = in_dev->mc_list;
-	in_dev->mc_list = im;
+
+	im->next_rcu = in_dev->mc_list;
 	in_dev->mc_count++;
-	write_unlock_bh(&in_dev->mc_list_lock);
+	rcu_assign_pointer(in_dev->mc_list, im);
+
 #ifdef CONFIG_IP_MULTICAST
 	igmpv3_del_delrec(in_dev, im->multiaddr);
 #endif
@@ -1287,17 +1295,18 @@ EXPORT_SYMBOL(ip_mc_rejoin_group);
 
 void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
 {
-	struct ip_mc_list *i, **ip;
+	struct ip_mc_list *i;
+	struct ip_mc_list __rcu **ip;
 
 	ASSERT_RTNL();
 
-	for (ip=&in_dev->mc_list; (i=*ip)!=NULL; ip=&i->next) {
+	for (ip = &in_dev->mc_list;
+	     (i = rtnl_dereference(*ip)) != NULL;
+	     ip = &i->next_rcu) {
 		if (i->multiaddr == addr) {
 			if (--i->users == 0) {
-				write_lock_bh(&in_dev->mc_list_lock);
-				*ip = i->next;
+				*ip = i->next_rcu;
 				in_dev->mc_count--;
-				write_unlock_bh(&in_dev->mc_list_lock);
 				igmp_group_dropped(i);
 
 				if (!in_dev->dead)
@@ -1316,34 +1325,34 @@ EXPORT_SYMBOL(ip_mc_dec_group);
 
 void ip_mc_unmap(struct in_device *in_dev)
 {
-	struct ip_mc_list *i;
+	struct ip_mc_list *pmc;
 
 	ASSERT_RTNL();
 
-	for (i = in_dev->mc_list; i; i = i->next)
-		igmp_group_dropped(i);
+	for_each_pmc_rtnl(in_dev, pmc)
+		igmp_group_dropped(pmc);
 }
 
 void ip_mc_remap(struct in_device *in_dev)
 {
-	struct ip_mc_list *i;
+	struct ip_mc_list *pmc;
 
 	ASSERT_RTNL();
 
-	for (i = in_dev->mc_list; i; i = i->next)
-		igmp_group_added(i);
+	for_each_pmc_rtnl(in_dev, pmc)
+		igmp_group_added(pmc);
 }
 
 /* Device going down */
 
 void ip_mc_down(struct in_device *in_dev)
 {
-	struct ip_mc_list *i;
+	struct ip_mc_list *pmc;
 
 	ASSERT_RTNL();
 
-	for (i=in_dev->mc_list; i; i=i->next)
-		igmp_group_dropped(i);
+	for_each_pmc_rtnl(in_dev, pmc)
+		igmp_group_dropped(pmc);
 
 #ifdef CONFIG_IP_MULTICAST
 	in_dev->mr_ifc_count = 0;
@@ -1374,7 +1383,6 @@ void ip_mc_init_dev(struct in_device *in_dev)
 	in_dev->mr_qrv = IGMP_Unsolicited_Report_Count;
 #endif
 
-	rwlock_init(&in_dev->mc_list_lock);
 	spin_lock_init(&in_dev->mc_tomb_lock);
 }
 
@@ -1382,14 +1390,14 @@ void ip_mc_init_dev(struct in_device *in_dev)
 
 void ip_mc_up(struct in_device *in_dev)
 {
-	struct ip_mc_list *i;
+	struct ip_mc_list *pmc;
 
 	ASSERT_RTNL();
 
 	ip_mc_inc_group(in_dev, IGMP_ALL_HOSTS);
 
-	for (i=in_dev->mc_list; i; i=i->next)
-		igmp_group_added(i);
+	for_each_pmc_rtnl(in_dev, pmc)
+		igmp_group_added(pmc);
 }
 
 /*
@@ -1405,17 +1413,13 @@ void ip_mc_destroy_dev(struct in_device *in_dev)
 	/* Deactivate timers */
 	ip_mc_down(in_dev);
 
-	write_lock_bh(&in_dev->mc_list_lock);
-	while ((i = in_dev->mc_list) != NULL) {
-		in_dev->mc_list = i->next;
+	while ((i = rtnl_dereference(in_dev->mc_list)) != NULL) {
+		in_dev->mc_list = i->next_rcu;
 		in_dev->mc_count--;
-		write_unlock_bh(&in_dev->mc_list_lock);
+
 		igmp_group_dropped(i);
 		ip_ma_put(i);
-
-		write_lock_bh(&in_dev->mc_list_lock);
 	}
-	write_unlock_bh(&in_dev->mc_list_lock);
 }
 
 /* RTNL is locked */
@@ -1513,18 +1517,18 @@ static int ip_mc_del_src(struct in_device *in_dev, __be32 *pmca, int sfmode,
 
 	if (!in_dev)
 		return -ENODEV;
-	read_lock(&in_dev->mc_list_lock);
-	for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, pmc) {
 		if (*pmca == pmc->multiaddr)
 			break;
 	}
 	if (!pmc) {
 		/* MCA not found?? bug */
-		read_unlock(&in_dev->mc_list_lock);
+		rcu_read_unlock();
 		return -ESRCH;
 	}
 	spin_lock_bh(&pmc->lock);
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 #ifdef CONFIG_IP_MULTICAST
 	sf_markstate(pmc);
 #endif
@@ -1685,18 +1689,18 @@ static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode,
 
 	if (!in_dev)
 		return -ENODEV;
-	read_lock(&in_dev->mc_list_lock);
-	for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, pmc) {
 		if (*pmca == pmc->multiaddr)
 			break;
 	}
 	if (!pmc) {
 		/* MCA not found?? bug */
-		read_unlock(&in_dev->mc_list_lock);
+		rcu_read_unlock();
 		return -ESRCH;
 	}
 	spin_lock_bh(&pmc->lock);
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 
 #ifdef CONFIG_IP_MULTICAST
 	sf_markstate(pmc);
@@ -1793,7 +1797,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)
 
 	err = -EADDRINUSE;
 	ifindex = imr->imr_ifindex;
-	for (i = inet->mc_list; i; i = i->next) {
+	for_each_pmc_rtnl(inet, i) {
 		if (i->multi.imr_multiaddr.s_addr == addr &&
 		    i->multi.imr_ifindex == ifindex)
 			goto done;
@@ -1807,7 +1811,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)
 		goto done;
 
 	memcpy(&iml->multi, imr, sizeof(*imr));
-	iml->next = inet->mc_list;
+	iml->next_rcu = inet->mc_list;
 	iml->sflist = NULL;
 	iml->sfmode = MCAST_EXCLUDE;
 	rcu_assign_pointer(inet->mc_list, iml);
@@ -1821,17 +1825,14 @@ EXPORT_SYMBOL(ip_mc_join_group);
 
 static void ip_sf_socklist_reclaim(struct rcu_head *rp)
 {
-	struct ip_sf_socklist *psf;
-
-	psf = container_of(rp, struct ip_sf_socklist, rcu);
+	kfree(container_of(rp, struct ip_sf_socklist, rcu));
 	/* sk_omem_alloc should have been decreased by the caller*/
-	kfree(psf);
 }
 
 static int ip_mc_leave_src(struct sock *sk, struct ip_mc_socklist *iml,
 			   struct in_device *in_dev)
 {
-	struct ip_sf_socklist *psf = iml->sflist;
+	struct ip_sf_socklist *psf = rtnl_dereference(iml->sflist);
 	int err;
 
 	if (psf == NULL) {
@@ -1851,11 +1852,8 @@ static int ip_mc_leave_src(struct sock *sk, struct ip_mc_socklist *iml,
 
 static void ip_mc_socklist_reclaim(struct rcu_head *rp)
 {
-	struct ip_mc_socklist *iml;
-
-	iml = container_of(rp, struct ip_mc_socklist, rcu);
+	kfree(container_of(rp, struct ip_mc_socklist, rcu));
 	/* sk_omem_alloc should have been decreased by the caller*/
-	kfree(iml);
 }
 
 
@@ -1866,7 +1864,8 @@ static void ip_mc_socklist_reclaim(struct rcu_head *rp)
 int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
 {
 	struct inet_sock *inet = inet_sk(sk);
-	struct ip_mc_socklist *iml, **imlp;
+	struct ip_mc_socklist *iml;
+	struct ip_mc_socklist __rcu **imlp;
 	struct in_device *in_dev;
 	struct net *net = sock_net(sk);
 	__be32 group = imr->imr_multiaddr.s_addr;
@@ -1876,7 +1875,9 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
 	rtnl_lock();
 	in_dev = ip_mc_find_dev(net, imr);
 	ifindex = imr->imr_ifindex;
-	for (imlp = &inet->mc_list; (iml = *imlp) != NULL; imlp = &iml->next) {
+	for (imlp = &inet->mc_list;
+	     (iml = rtnl_dereference(*imlp)) != NULL;
+	     imlp = &iml->next_rcu) {
 		if (iml->multi.imr_multiaddr.s_addr != group)
 			continue;
 		if (ifindex) {
@@ -1888,7 +1889,7 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
 
 		(void) ip_mc_leave_src(sk, iml, in_dev);
 
-		rcu_assign_pointer(*imlp, iml->next);
+		*imlp = iml->next_rcu;
 
 		if (in_dev)
 			ip_mc_dec_group(in_dev, group);
@@ -1934,7 +1935,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
 	}
 	err = -EADDRNOTAVAIL;
 
-	for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rtnl(inet, pmc) {
 		if ((pmc->multi.imr_multiaddr.s_addr ==
 		     imr.imr_multiaddr.s_addr) &&
 		    (pmc->multi.imr_ifindex == imr.imr_ifindex))
@@ -1958,7 +1959,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
 		pmc->sfmode = omode;
 	}
 
-	psl = pmc->sflist;
+	psl = rtnl_dereference(pmc->sflist);
 	if (!add) {
 		if (!psl)
 			goto done;	/* err = -EADDRNOTAVAIL */
@@ -2077,7 +2078,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
 		goto done;
 	}
 
-	for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rtnl(inet, pmc) {
 		if (pmc->multi.imr_multiaddr.s_addr == msf->imsf_multiaddr &&
 		    pmc->multi.imr_ifindex == imr.imr_ifindex)
 			break;
@@ -2107,7 +2108,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
 		(void) ip_mc_add_src(in_dev, &msf->imsf_multiaddr,
 				     msf->imsf_fmode, 0, NULL, 0);
 	}
-	psl = pmc->sflist;
+	psl = rtnl_dereference(pmc->sflist);
 	if (psl) {
 		(void) ip_mc_del_src(in_dev, &msf->imsf_multiaddr, pmc->sfmode,
 			psl->sl_count, psl->sl_addr, 0);
@@ -2155,7 +2156,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
 	}
 	err = -EADDRNOTAVAIL;
 
-	for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rtnl(inet, pmc) {
 		if (pmc->multi.imr_multiaddr.s_addr == msf->imsf_multiaddr &&
 		    pmc->multi.imr_ifindex == imr.imr_ifindex)
 			break;
@@ -2163,7 +2164,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
 	if (!pmc)		/* must have a prior join */
 		goto done;
 	msf->imsf_fmode = pmc->sfmode;
-	psl = pmc->sflist;
+	psl = rtnl_dereference(pmc->sflist);
 	rtnl_unlock();
 	if (!psl) {
 		len = 0;
@@ -2208,7 +2209,7 @@ int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf,
 
 	err = -EADDRNOTAVAIL;
 
-	for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rtnl(inet, pmc) {
 		if (pmc->multi.imr_multiaddr.s_addr == addr &&
 		    pmc->multi.imr_ifindex == gsf->gf_interface)
 			break;
@@ -2216,7 +2217,7 @@ int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf,
 	if (!pmc)		/* must have a prior join */
 		goto done;
 	gsf->gf_fmode = pmc->sfmode;
-	psl = pmc->sflist;
+	psl = rtnl_dereference(pmc->sflist);
 	rtnl_unlock();
 	count = psl ? psl->sl_count : 0;
 	copycount = count < gsf->gf_numsrc ? count : gsf->gf_numsrc;
@@ -2257,7 +2258,7 @@ int ip_mc_sf_allow(struct sock *sk, __be32 loc_addr, __be32 rmt_addr, int dif)
 		goto out;
 
 	rcu_read_lock();
-	for (pmc=rcu_dereference(inet->mc_list); pmc; pmc=rcu_dereference(pmc->next)) {
+	for_each_pmc_rcu(inet, pmc) {
 		if (pmc->multi.imr_multiaddr.s_addr == loc_addr &&
 		    pmc->multi.imr_ifindex == dif)
 			break;
@@ -2265,7 +2266,7 @@ int ip_mc_sf_allow(struct sock *sk, __be32 loc_addr, __be32 rmt_addr, int dif)
 	ret = inet->mc_all;
 	if (!pmc)
 		goto unlock;
-	psl = pmc->sflist;
+	psl = rcu_dereference(pmc->sflist);
 	ret = (pmc->sfmode == MCAST_EXCLUDE);
 	if (!psl)
 		goto unlock;
@@ -2300,10 +2301,10 @@ void ip_mc_drop_socket(struct sock *sk)
 		return;
 
 	rtnl_lock();
-	while ((iml = inet->mc_list) != NULL) {
+	while ((iml = rtnl_dereference(inet->mc_list)) != NULL) {
 		struct in_device *in_dev;
-		rcu_assign_pointer(inet->mc_list, iml->next);
 
+		inet->mc_list = iml->next_rcu;
 		in_dev = inetdev_by_index(net, iml->multi.imr_ifindex);
 		(void) ip_mc_leave_src(sk, iml, in_dev);
 		if (in_dev != NULL) {
@@ -2323,8 +2324,8 @@ int ip_check_mc(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, u16 p
 	struct ip_sf_list *psf;
 	int rv = 0;
 
-	read_lock(&in_dev->mc_list_lock);
-	for (im=in_dev->mc_list; im; im=im->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, im) {
 		if (im->multiaddr == mc_addr)
 			break;
 	}
@@ -2345,7 +2346,7 @@ int ip_check_mc(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, u16 p
 		} else
 			rv = 1; /* unspecified source; tentatively allow */
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 	return rv;
 }
 
@@ -2371,13 +2372,11 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
 		in_dev = __in_dev_get_rcu(state->dev);
 		if (!in_dev)
 			continue;
-		read_lock(&in_dev->mc_list_lock);
-		im = in_dev->mc_list;
+		im = rcu_dereference(in_dev->mc_list);
 		if (im) {
 			state->in_dev = in_dev;
 			break;
 		}
-		read_unlock(&in_dev->mc_list_lock);
 	}
 	return im;
 }
@@ -2385,11 +2384,9 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
 static struct ip_mc_list *igmp_mc_get_next(struct seq_file *seq, struct ip_mc_list *im)
 {
 	struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
-	im = im->next;
-	while (!im) {
-		if (likely(state->in_dev != NULL))
-			read_unlock(&state->in_dev->mc_list_lock);
 
+	im = rcu_dereference(im->next_rcu);
+	while (!im) {
 		state->dev = next_net_device_rcu(state->dev);
 		if (!state->dev) {
 			state->in_dev = NULL;
@@ -2398,8 +2395,7 @@ static struct ip_mc_list *igmp_mc_get_next(struct seq_file *seq, struct ip_mc_li
 		state->in_dev = __in_dev_get_rcu(state->dev);
 		if (!state->in_dev)
 			continue;
-		read_lock(&state->in_dev->mc_list_lock);
-		im = state->in_dev->mc_list;
+		im = rcu_dereference(state->in_dev->mc_list);
 	}
 	return im;
 }
@@ -2435,10 +2431,8 @@ static void igmp_mc_seq_stop(struct seq_file *seq, void *v)
 	__releases(rcu)
 {
 	struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
-	if (likely(state->in_dev != NULL)) {
-		read_unlock(&state->in_dev->mc_list_lock);
-		state->in_dev = NULL;
-	}
+
+	state->in_dev = NULL;
 	state->dev = NULL;
 	rcu_read_unlock();
 }
@@ -2460,7 +2454,7 @@ static int igmp_mc_seq_show(struct seq_file *seq, void *v)
 		querier = "NONE";
 #endif
 
-		if (state->in_dev->mc_list == im) {
+		if (rcu_dereference(state->in_dev->mc_list) == im) {
 			seq_printf(seq, "%d\t%-10s: %5d %7s\n",
 				   state->dev->ifindex, state->dev->name, state->in_dev->mc_count, querier);
 		}
@@ -2519,8 +2513,7 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
 		idev = __in_dev_get_rcu(state->dev);
 		if (unlikely(idev == NULL))
 			continue;
-		read_lock(&idev->mc_list_lock);
-		im = idev->mc_list;
+		im = rcu_dereference(idev->mc_list);
 		if (likely(im != NULL)) {
 			spin_lock_bh(&im->lock);
 			psf = im->sources;
@@ -2531,7 +2524,6 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
 			}
 			spin_unlock_bh(&im->lock);
 		}
-		read_unlock(&idev->mc_list_lock);
 	}
 	return psf;
 }
@@ -2545,9 +2537,6 @@ static struct ip_sf_list *igmp_mcf_get_next(struct seq_file *seq, struct ip_sf_l
 		spin_unlock_bh(&state->im->lock);
 		state->im = state->im->next;
 		while (!state->im) {
-			if (likely(state->idev != NULL))
-				read_unlock(&state->idev->mc_list_lock);
-
 			state->dev = next_net_device_rcu(state->dev);
 			if (!state->dev) {
 				state->idev = NULL;
@@ -2556,8 +2545,7 @@ static struct ip_sf_list *igmp_mcf_get_next(struct seq_file *seq, struct ip_sf_l
 			state->idev = __in_dev_get_rcu(state->dev);
 			if (!state->idev)
 				continue;
-			read_lock(&state->idev->mc_list_lock);
-			state->im = state->idev->mc_list;
+			state->im = rcu_dereference(state->idev->mc_list);
 		}
 		if (!state->im)
 			break;
@@ -2603,10 +2591,7 @@ static void igmp_mcf_seq_stop(struct seq_file *seq, void *v)
 		spin_unlock_bh(&state->im->lock);
 		state->im = NULL;
 	}
-	if (likely(state->idev != NULL)) {
-		read_unlock(&state->idev->mc_list_lock);
-		state->idev = NULL;
-	}
+	state->idev = NULL;
 	state->dev = NULL;
 	rcu_read_unlock();
 }

^ permalink raw reply related

* Re: [PATCH net-next-2.6] vlan: remove ndo_select_queue() logic
From: Patrick McHardy @ 2010-11-12 15:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesse Gross, David Miller, netdev
In-Reply-To: <1289504565.17691.1710.camel@edumazet-laptop>

Am 11.11.2010 20:42, schrieb Eric Dumazet:
> [PATCH] vlan: remove ndo_select_queue() logic
> 
> Now vlan are lockless, we dont need special ndo_select_queue() logic.
> dev_pick_tx() will do the multiqueue stuff on the real device transmit.
> 
> Suggested-by: Jesse Gross <jesse@nicira.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Both patches look good to me.

Acked-by: Patrick McHardy <kaber@trash.net>

^ permalink raw reply

* RE: [RFC PATCH] network: return errors if we know tcp_connect failed
From: Eric Paris @ 2010-11-12 16:08 UTC (permalink / raw)
  To: Hua Zhong
  Cc: netdev, linux-kernel, davem, kuznet, pekkas, jmorris, yoshfuji,
	kaber, eric.dumazet, paul.moore
In-Reply-To: <00c201cb81eb$84e18160$8ea48420$@com>

On Thu, 2010-11-11 at 13:58 -0800, Hua Zhong wrote:
> > Yes, I realize this is little different than if the
> > SYN was dropped in the first network device, but it is different
> > because we know what happened!  We know that connect() call failed
> > and that there isn't anything coming back.
> 
> I would argue that -j DROP should behave exactly as the packet is dropped in the network, while -j REJECT should signal the failure to the application as soon as possible (which it doesn't seem to do).
> 
> It does not only make sense, but also is a highly useful testing technique that we use -j DROP in OUTPUT to emulate network losses and see how the application behaves.

I guess I can be a bit more descriptive of my specific situation,
although I'm not sure it matters.  I don't actually plan to drop packets
with -j REJECT or -j DROP, that's just a simple example everyone can see
on their own machine.  I plan to have the packets drop in the selinux
netfilter hook.  The SELinux hook uses NF_DROP/NF_ACCEPT just like any
other netfilter hook.  Maybe the answer is that I need to duplicate the
-j REJECT type operations in the SELinux hook.  -j REJECT doesn't do
what I want today, but if that's the right way forward tell me and I'll
look down that path.

But the path I first started looking down rules in 2 distinct questions:

1) What should netfilter pass back up the stack.  From my looking at
this I see that nf_hook_slow() will convert NF_DROP into -EPERM and pass
that back up the stack.  Is this wrong?  Should it more intelligently
pass errors back up the stack?  Maybe it needs an NF_REJECT as well as
NF_DROP?  NF_DROP returns 0 maybe and NF_REJECT return EPERM?

2) What should the generic TCP code (tcp_connect()) do if the skb failed
to send.  Should it return error codes back up the stack somehow or
should they continue to be ignored?  Obviously continuing to just ignore
information we have doesn't make me happy (otherwise I wouldn't have
started scratching this itch).  But the point about ENOBUFS is well
taken.  Maybe I should make tcp_connect(), or the caller to
tcp_connect() more intelligent about specific error codes?

I'm looking for a path forward.  If SELinux is rejecting the SYN packets
on connect() I want to pass that info to userspace rather than just
hanging.  What's the best way to accomplish that?

-Eric

^ permalink raw reply

* RE: [RFC PATCH] network: return errors if we know tcp_connect failed
From: Eric Dumazet @ 2010-11-12 16:15 UTC (permalink / raw)
  To: Eric Paris
  Cc: Hua Zhong, netdev, linux-kernel, davem, kuznet, pekkas, jmorris,
	yoshfuji, kaber, paul.moore
In-Reply-To: <1289578108.3083.95.camel@localhost.localdomain>

Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :

> 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> to send.  Should it return error codes back up the stack somehow or
> should they continue to be ignored?  Obviously continuing to just ignore
> information we have doesn't make me happy (otherwise I wouldn't have
> started scratching this itch).  But the point about ENOBUFS is well
> taken.  Maybe I should make tcp_connect(), or the caller to
> tcp_connect() more intelligent about specific error codes?
> 
> I'm looking for a path forward.  If SELinux is rejecting the SYN packets
> on connect() I want to pass that info to userspace rather than just
> hanging.  What's the best way to accomplish that?
> 

Eric, if you can differentiate a permanent reject, instead of a
temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
yes, you could make tcp_connect() report to user the permanent error,
and ignore the temporary one.

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: Stephen Hemminger @ 2010-11-12 16:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dan Rosenberg, David Miller, socketcan, kuznet, urs.thuermann,
	yoshfuji, kaber, jmorris, remi.denis-courmont, pekkas, sri,
	vladislav.yasevich, tj, lizf, joe, hadi, ebiederm, adobriyan,
	jpirko, johannes.berg, daniel.lezcano, xemul, socketcan-core,
	netdev, linux-sctp, torvalds
In-Reply-To: <1289546610.17691.1770.camel@edumazet-laptop>

On Fri, 12 Nov 2010 08:23:30 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 11 novembre 2010 à 21:34 -0500, Dan Rosenberg a écrit :
> > > I want whatever you replace it with to be equivalent for
> > > object tracking purposes.
> > 
> > In nearly all of the cases I fixed, the socket inode is already
> > provided, which serves as a perfectly good unique identifier.  Would you
> > prefer I include that information twice?
> 
> Oh well. Please read this answer carefuly.
> 
> Some facts to feed your next patch. I am very pleased you changed your
> mind and that we keep useful information in kernel log.
> 
> 1) Inode numbers are not guaranteed to be unique. Its a 32bit seq
> number, and we dont check another socket inode use the same inode number
> (after 2^32 allocations it can happens)
> 
> 2) /proc/net/ files can deliver same "line" of information several
> times, because of their implementation.
> 
> 3) Because of SLAB_DESTROY_BY_RCU, same 'kernel socket pointer' can be
> seen several times in /proc/net/tcp & /proc/net/udp, but really on
> different "sockets"
> 
> 4) Some good applications use both the socket pointer and inode number
> (tuple) to filter out the [2] problem. Dont break them, please ?
> Anything that might break an application must be at the very least
> tunable.
> 
> In my opinion, a good thing would be :
> 
> - Use a special printf() format , aka "secure pointer", as Thomas
> suggested.
> 
> - Make sure you print different opaque values for two different kernel
> pointers. This is mandatory.
> 
> - Make sure the NULL pointer stay as a NULL pointer to not let the
> hostile user know your secret, and to ease debugging stuff.
> 
> - Have security experts advice to chose a nice crypto function, maybe
> jenkin hash. Not too slow would be nice.
> 
> 
> static unsigned long securize_kpointers_rnd;
> 
> At boot time, stick a random value in this variable.
> (Maybe make sure the 5 low order bits are 0)
> 
> unsigned long opacify_kptr(unsigned long ptr)
> {
> 	if (ptr == 0)
> 		return ptr;
> 	if (capable(CAP_NET_ADMIN))
> 		return ptr;
> 
> 	return some_crypto_hash(ptr, &securize_kpointers_rnd);
> }
> 
> At least, use a central point, so that we can easily add/change the
> logic if needed.
> 
> Please provide this patch in kernel/printk.c for initial review, then
> once everybody is OK, you can send one patch for net tree.
> 
> No need to send 10 patches if we dont agree on the general principle.

Also, the whole idea needs to be under a config option, so only
the paranoid idiots turn it on.


-- 

^ permalink raw reply

* Re: request_threaded_irq()
From: Stephen Hemminger @ 2010-11-12 16:35 UTC (permalink / raw)
  To: Mark Ryden; +Cc: netdev
In-Reply-To: <AANLkTikCQRiW3oWYSXK7yTxLWd3us+8wdwfuOs1k71VD@mail.gmail.com>

On Fri, 12 Nov 2010 14:27:11 +0200
Mark Ryden <markryde@gmail.com> wrote:

> Hello netdev,
> 
> grepping under net-next-2.6/drivers/net for request_threaded_irq() ,
> shows that it appears only in 3 drivers:
> 
> can/mcp251x.c
> wireless/b43/main.c
> wireless/rt2x00/rt2x00pci.c
> 
> I was wondering: when thinking about performance, is it worthwhile to use this
> API instead of ordinary request_irq() . It seems to me that
> request_threaded_irq()  might
> be better in some cases than NAPI polling forr network drivers (or at
> list it might be so in some systems, maybe multicore ?)

Threaded irq is not needed for properly written NAPI driver.
A NAPI driver should do minimum work in real irq and the whole NAPI processing
will happen in soft-irq.

There has been discussion of moving NAPI softirq into a high
priority thread. But last time I tried it, the performance was noticably
worse on network intensive benchmarks.


^ permalink raw reply

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
From: David Lamparter @ 2010-11-12 16:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Paris, Hua Zhong, netdev, linux-kernel, davem, kuznet,
	pekkas, jmorris, yoshfuji, kaber, paul.moore
In-Reply-To: <1289578532.3185.265.camel@edumazet-laptop>

On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
> Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
> 
> > 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> > to send.  Should it return error codes back up the stack somehow or
> > should they continue to be ignored?  Obviously continuing to just ignore
> > information we have doesn't make me happy (otherwise I wouldn't have
> > started scratching this itch).  But the point about ENOBUFS is well
> > taken.  Maybe I should make tcp_connect(), or the caller to
> > tcp_connect() more intelligent about specific error codes?
> > 
> > I'm looking for a path forward.  If SELinux is rejecting the SYN packets
> > on connect() I want to pass that info to userspace rather than just
> > hanging.  What's the best way to accomplish that?
> > 
> 
> Eric, if you can differentiate a permanent reject, instead of a
> temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
> yes, you could make tcp_connect() report to user the permanent error,
> and ignore the temporary one.

If the netfilter targets DROP/REJECT match the NF_DROP/NF_REJECT
counterparts, which i guess they do but i didn't read the source ;),
then SELinux should use NF_REJECT in my opinion.

NF_DROP does exactly what the name says, it drops the packet aka
basically puts it in /dev/null. As with writing to /dev/null, you don't
get an error for that. Even more, if in the meantime the DROP rule does
not match anymore, the 2nd or 3rd SYN from the connect() can come
through and establish a connection (think of "-m statistic" & co.)

This is very different from REJECT.

If REJECT doesn't immediately get reported to the application, that *is*
a bug, but last time i checked i got EPERM immediately. I would fix
SELinux to use the same mechanism.

-David

^ permalink raw reply

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
From: Eric Paris @ 2010-11-12 16:53 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric Dumazet, Hua Zhong, netdev, linux-kernel, davem, kuznet,
	pekkas, jmorris, yoshfuji, kaber, paul.moore
In-Reply-To: <20101112163543.GB122902@jupiter.n2.diac24.net>

On Fri, 2010-11-12 at 17:35 +0100, David Lamparter wrote:
> On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
> > Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
> > 
> > > 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> > > to send.  Should it return error codes back up the stack somehow or
> > > should they continue to be ignored?  Obviously continuing to just ignore
> > > information we have doesn't make me happy (otherwise I wouldn't have
> > > started scratching this itch).  But the point about ENOBUFS is well
> > > taken.  Maybe I should make tcp_connect(), or the caller to
> > > tcp_connect() more intelligent about specific error codes?
> > > 
> > > I'm looking for a path forward.  If SELinux is rejecting the SYN packets
> > > on connect() I want to pass that info to userspace rather than just
> > > hanging.  What's the best way to accomplish that?
> > > 
> > 
> > Eric, if you can differentiate a permanent reject, instead of a
> > temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
> > yes, you could make tcp_connect() report to user the permanent error,
> > and ignore the temporary one.
> 
> If the netfilter targets DROP/REJECT match the NF_DROP/NF_REJECT
> counterparts, which i guess they do but i didn't read the source ;),
> then SELinux should use NF_REJECT in my opinion.

As it stands today there is no NF_REJECT.  NF_DROP is the only (related)
permitted return value from a netfilter hook.  Maybe I need to change
that fact though.

> NF_DROP does exactly what the name says, it drops the packet aka
> basically puts it in /dev/null. As with writing to /dev/null, you don't
> get an error for that. Even more, if in the meantime the DROP rule does
> not match anymore, the 2nd or 3rd SYN from the connect() can come
> through and establish a connection (think of "-m statistic" & co.)
> 
> This is very different from REJECT.
> 
> If REJECT doesn't immediately get reported to the application, that *is*
> a bug, but last time i checked i got EPERM immediately. I would fix
> SELinux to use the same mechanism.

I haven't looked at what -j REJECT does (or was intended to do) but it
most certainly does not return an error to sys_connect().  Try it out.

iptables -A OUTPUT -p tcp --dport 80 -j REJECT
links www.google.com

it just hangs on 'making connection'  (exact same for -j DROP)

If everyone agrees that's the wrong behavior (for -j REJECT) I'll work
on fixing that (however is appropriate) and will change the SELinux code
if needed after we've fixed the -j REJECT code.  Obviously there's
problems with my original way to fix the lack of error returns (namely
that I would immediately EACCES for DROP as well as REJECT).

I'm glad to hear that others seem to believe the current code is buggy
and I'm not completely off my rocker to think that applications should
be able to learn somehow that things fell down...

-Eric

^ permalink raw reply

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
From: Patrick McHardy @ 2010-11-12 16:54 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric Dumazet, Eric Paris, Hua Zhong, netdev, linux-kernel, davem,
	kuznet, pekkas, jmorris, yoshfuji, paul.moore
In-Reply-To: <20101112163543.GB122902@jupiter.n2.diac24.net>

Am 12.11.2010 17:35, schrieb David Lamparter:
> On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
>> Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
>>
>>> 2) What should the generic TCP code (tcp_connect()) do if the skb failed
>>> to send.  Should it return error codes back up the stack somehow or
>>> should they continue to be ignored?  Obviously continuing to just ignore
>>> information we have doesn't make me happy (otherwise I wouldn't have
>>> started scratching this itch).  But the point about ENOBUFS is well
>>> taken.  Maybe I should make tcp_connect(), or the caller to
>>> tcp_connect() more intelligent about specific error codes?
>>>
>>> I'm looking for a path forward.  If SELinux is rejecting the SYN packets
>>> on connect() I want to pass that info to userspace rather than just
>>> hanging.  What's the best way to accomplish that?
>>>
>>
>> Eric, if you can differentiate a permanent reject, instead of a
>> temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
>> yes, you could make tcp_connect() report to user the permanent error,
>> and ignore the temporary one.

Indeed. We could even make the NF_DROP return value configurable
by encoding it in the verdict.

> If the netfilter targets DROP/REJECT match the NF_DROP/NF_REJECT
> counterparts, which i guess they do but i didn't read the source ;),
> then SELinux should use NF_REJECT in my opinion.

There is no NF_REJECT.

> NF_DROP does exactly what the name says, it drops the packet aka
> basically puts it in /dev/null. As with writing to /dev/null, you don't
> get an error for that. Even more, if in the meantime the DROP rule does
> not match anymore, the 2nd or 3rd SYN from the connect() can come
> through and establish a connection (think of "-m statistic" & co.)
> 
> This is very different from REJECT.

Returning NF_DROP results in -EPERM getting reported back. As Eric
noticed, this is ignored for SYN packets.

> If REJECT doesn't immediately get reported to the application, that *is*
> a bug, but last time i checked i got EPERM immediately. I would fix
> SELinux to use the same mechanism.

NF_DROP returns -EPERM, the REJECT targets send packets to reject
a connection. Whether this is reported immediately depends on the
error and the protocol in question. Using a TCP reset immediately
resets the connection.

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12 17:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, David Miller, socketcan, kuznet, urs.thuermann,
	yoshfuji, kaber, jmorris, remi.denis-courmont, pekkas, sri,
	vladislav.yasevich, tj, lizf, joe, hadi, ebiederm, adobriyan,
	jpirko, johannes.berg, daniel.lezcano, xemul, socketcan-core,
	netdev, linux-sctp, torvalds
In-Reply-To: <20101112083315.096dfaa3@nehalam>

> 
> Also, the whole idea needs to be under a config option, so only
> the paranoid idiots turn it on.

If that's what's necessary to get it accepted, I'm willing to do that.
But when a solution does not negatively impact usability or performance
and improves security, even in a small way, why should it not be enabled
by default?  Of course it's my responsibility to first propose a
solution that is acceptable from a usability/debugging standpoint, but
assuming that can be achieved, I don't really see what the problem is.
There's a difference between being a "paranoid idiot" and wanting to
protect users from unnecessary exposure.

-Dan

^ permalink raw reply

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
From: Alexey Kuznetsov @ 2010-11-12 17:46 UTC (permalink / raw)
  To: Eric Paris; +Cc: netdev, linux-kernel, davem, pekkas, jmorris, yoshfuji, kaber
In-Reply-To: <20101111210341.31350.86916.stgit@paris.rdu.redhat.com>

Hello!

On Thu, Nov 11, 2010 at 04:03:41PM -0500, Eric Paris wrote:
> immediately when it calls connect().  Is this wrong?  Is this bad to tell
> userspace more quickly what happened?  Does passing this error code back up
> the stack here break something else?  Why do some functions seem to pay
> attention to tcp_transmit_skb() return codes and some functions just ignore
> it?

Essentially, return value of tcp_transmit_skb() is always ignored.
It is used only for accounting and for some optimization of retransmission behaviour.
Generally, tcp does not react on errors coming outside of tcp protocol.

The only loophole is ICMP error in the same case as yours. In _violation_ of specs
linux immediately aborts unestablished connect on an icmp error. IMHO that thing
which you suggest is correct (of course, provided you filter out transient errors and react only
to EPERM or something like this). It was not done because it was expected
firewall rule prescribing immediate abort is configured with "--reject-with icmp-port-unreachable",
otherwise the rule orders real blackhole.

Alexey

^ permalink raw reply

* a problem tcp_v4_err()
From: Alexey Kuznetsov @ 2010-11-12 17:57 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David Lamparter, Eric Dumazet, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore
In-Reply-To: <4CDD7145.8070606@trash.net>

Hello!

I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
are performed on unlocked sockets. It looks like at least this BUG_ON():

                skb = tcp_write_queue_head(sk);
                BUG_ON(!skb);

can be easily triggered.

Do I miss something?

Alexey

^ permalink raw reply

* Re: a problem tcp_v4_err()
From: Eric Dumazet @ 2010-11-12 18:12 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore
In-Reply-To: <20101112175715.GB16544@ms2.inr.ac.ru>

Le vendredi 12 novembre 2010 à 20:57 +0300, Alexey Kuznetsov a écrit :
> Hello!
> 
> I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
> are performed on unlocked sockets. It looks like at least this BUG_ON():
> 
>                 skb = tcp_write_queue_head(sk);
>                 BUG_ON(!skb);
> 
> can be easily triggered.
> 
> Do I miss something?
> 

Hi Alexey !

I see socket is locked around line 368,

        bh_lock_sock(sk);
        /* If too many ICMPs get dropped on busy
         * servers this needs to be solved differently.
         */
        if (sock_owned_by_user(sk))
                NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);


Hmm, maybe some goto is missing ;)

^ permalink raw reply

* Re: a problem tcp_v4_err()
From: Eric Dumazet @ 2010-11-12 18:21 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore
In-Reply-To: <1289585578.3185.268.camel@edumazet-laptop>

Le vendredi 12 novembre 2010 à 19:12 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 20:57 +0300, Alexey Kuznetsov a écrit :
> > Hello!
> > 
> > I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
> > are performed on unlocked sockets. It looks like at least this BUG_ON():
> > 
> >                 skb = tcp_write_queue_head(sk);
> >                 BUG_ON(!skb);
> > 
> > can be easily triggered.
> > 
> > Do I miss something?
> > 
> 
> Hi Alexey !
> 
> I see socket is locked around line 368,
> 
>         bh_lock_sock(sk);
>         /* If too many ICMPs get dropped on busy
>          * servers this needs to be solved differently.
>          */
>         if (sock_owned_by_user(sk))
>                 NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
> 
> 
> Hmm, maybe some goto is missing ;)
> 

Well, goto is not missing.

Why do you think BUG_ON(!skb) can be triggered ?

We test before :

	if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
		!icsk->icsk_backoff)
		break;

So a concurrent user only can add new skb(s) in the (non empty) queue ?

^ permalink raw reply

* Re: a problem tcp_v4_err()
From: Eric Dumazet @ 2010-11-12 18:27 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore,
	Damian Lukowski
In-Reply-To: <1289586114.3185.271.camel@edumazet-laptop>

Le vendredi 12 novembre 2010 à 19:21 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 19:12 +0100, Eric Dumazet a écrit :
> > Le vendredi 12 novembre 2010 à 20:57 +0300, Alexey Kuznetsov a écrit :
> > > Hello!
> > > 
> > > I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
> > > are performed on unlocked sockets. It looks like at least this BUG_ON():
> > > 
> > >                 skb = tcp_write_queue_head(sk);
> > >                 BUG_ON(!skb);
> > > 
> > > can be easily triggered.
> > > 
> > > Do I miss something?
> > > 
> > 
> > Hi Alexey !
> > 
> > I see socket is locked around line 368,
> > 
> >         bh_lock_sock(sk);
> >         /* If too many ICMPs get dropped on busy
> >          * servers this needs to be solved differently.
> >          */
> >         if (sock_owned_by_user(sk))
> >                 NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
> > 
> > 
> > Hmm, maybe some goto is missing ;)
> > 
> 
> Well, goto is not missing.
> 
> Why do you think BUG_ON(!skb) can be triggered ?
> 
> We test before :
> 
> 	if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
> 		!icsk->icsk_backoff)
> 		break;
> 
> So a concurrent user only can add new skb(s) in the (non empty) queue ?
> 
> 

Oh well, it seems you are right (backlog processing)

Bug was introduced in commit f1ecd5d9e736660 (Revert Backoff [v3]:
Revert RTO on ICMP destination unreachable) from Damian Lukowski

^ permalink raw reply

* Re: a problem tcp_v4_err()
From: Alexey Kuznetsov @ 2010-11-12 18:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore
In-Reply-To: <1289585578.3185.268.camel@edumazet-laptop>

Hello!

On Fri, Nov 12, 2010 at 07:12:58PM +0100, Eric Dumazet wrote:
> I see socket is locked around line 368,
> 
>         bh_lock_sock(sk);
>         /* If too many ICMPs get dropped on busy
>          * servers this needs to be solved differently.
>          */
>         if (sock_owned_by_user(sk))
>                 NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
> 
> 
> Hmm, maybe some goto is missing ;)

It is not missing, sock_owned_by_user() is checked later when some operation which
cannot be done without lock is required. It was done to save error in sk_err_soft even
when socket is locked.

This code also _understands_ this: look at

                } else if (sock_owned_by_user(sk)) {
                        /* RTO revert clocked out retransmission,
                         * but socket is locked. Will defer. */
                        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
                                                  HZ/20, TCP_RTO_MAX);

but somehow it considers the manipulations with rto/backoff/write_queue as safe.
Seems, they are not.

Alexey

^ permalink raw reply

* Re: a problem tcp_v4_err()
From: Alexey Kuznetsov @ 2010-11-12 18:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore,
	Damian Lukowski
In-Reply-To: <1289586477.3185.273.camel@edumazet-laptop>

Hello!

> Oh well, it seems you are right (backlog processing)

Exactly.

Alexey

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox