Netdev List
 help / color / mirror / Atom feed
* [patch 22/30] netfilter: ebt_ulog: fix checkentry return value
From: Greg KH @ 2009-10-01 23:31 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: stable-review, torvalds, akpm, alan, netdev, netfilter-devel,
	Patrick McHardy, davem
In-Reply-To: <20091001233504.GA17709@kroah.com>

[-- Attachment #1: netfilter-ebt_ulog-fix-checkentry-return-value.patch --]
[-- Type: text/plain, Size: 900 bytes --]


2.6.30-stable review patch.  If anyone has any objections, please let us know.

------------------
From: Patrick McHardy <kaber@trash.net>

netfilter: ebt_ulog: fix checkentry return value

Upstream commit 8a56df0a:

Commit 19eda87 (netfilter: change return types of check functions for
Ebtables extensions) broke the ebtables ulog module by missing a return
value conversion.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 net/bridge/netfilter/ebt_ulog.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/bridge/netfilter/ebt_ulog.c
+++ b/net/bridge/netfilter/ebt_ulog.c
@@ -266,7 +266,7 @@ static bool ebt_ulog_tg_check(const stru
 	if (uloginfo->qthreshold > EBT_ULOG_MAX_QLEN)
 		uloginfo->qthreshold = EBT_ULOG_MAX_QLEN;
 
-	return 0;
+	return true;
 }
 
 static struct xt_target ebt_ulog_tg_reg __read_mostly = {



^ permalink raw reply

* [patch 23/30] netfilter: nf_nat: fix inverted logic for persistent NAT mappings
From: Greg KH @ 2009-10-01 23:31 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: stable-review, torvalds, akpm, alan, netdev, netfilter-devel,
	Patrick McHardy, davem, Maximilian Engelhardt
In-Reply-To: <20091001233504.GA17709@kroah.com>

[-- Attachment #1: netfilter-nf_nat-fix-inverted-logic-for-persistent-nat-mappings.patch --]
[-- Type: text/plain, Size: 1554 bytes --]


2.6.30-stable review patch.  If anyone has any objections, please let us know.

------------------
From: Patrick McHardy <kaber@trash.net>

netfilter: nf_nat: fix inverted logic for persistent NAT mappings

Upstream commit cce5a5c3:

Kernel 2.6.30 introduced a patch [1] for the persistent option in the
netfilter SNAT target. This is exactly what we need here so I had a quick look
at the code and noticed that the patch is wrong. The logic is simply inverted.
The patch below fixes this.

Also note that because of this the default behavior of the SNAT target has
changed since kernel 2.6.30 as it now ignores the destination IP in choosing
the source IP for nating (which should only be the case if the persistent
option is set).

[1] http://git.eu.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=98d500d66cb7940747b424b245fc6a51ecfbf005

Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 net/ipv4/netfilter/nf_nat_core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -212,7 +212,7 @@ find_best_ips_proto(struct nf_conntrack_
 	maxip = ntohl(range->max_ip);
 	j = jhash_2words((__force u32)tuple->src.u3.ip,
 			 range->flags & IP_NAT_RANGE_PERSISTENT ?
-				(__force u32)tuple->dst.u3.ip : 0, 0);
+				0 : (__force u32)tuple->dst.u3.ip, 0);
 	j = ((u64)j * (maxip - minip + 1)) >> 32;
 	*var_ipp = htonl(minip + j);
 }



^ permalink raw reply

* [082/136] netfilter: nf_conntrack: netns fix re reliable conntrack event delivery
From: Greg KH @ 2009-10-02  1:17 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: stable-review, torvalds, akpm, alan, netdev, netfilter-devel,
	Patrick McHardy, davem, Alexey Dobriyan, Pablo Neira Ayuso
In-Reply-To: <20091002012911.GA18542@kroah.com>

[-- Attachment #1: netfilter-nf_conntrack-netns-fix-re-reliable-conntrack-event-delivery.patch --]
[-- Type: text/plain, Size: 1572 bytes --]


2.6.31-stable review patch.  If anyone has any objections, please let us know.

------------------
From: Patrick McHardy <kaber@trash.net>

netfilter: nf_conntrack: netns fix re reliable conntrack event delivery

Upstream commit ee254fa4:

Conntracks in netns other than init_net dying list were never killed.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 net/netfilter/nf_conntrack_core.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1089,14 +1089,14 @@ void nf_conntrack_flush_report(struct ne
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_flush_report);
 
-static void nf_ct_release_dying_list(void)
+static void nf_ct_release_dying_list(struct net *net)
 {
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 	struct hlist_nulls_node *n;
 
 	spin_lock_bh(&nf_conntrack_lock);
-	hlist_nulls_for_each_entry(h, n, &init_net.ct.dying, hnnode) {
+	hlist_nulls_for_each_entry(h, n, &net->ct.dying, hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		/* never fails to remove them, no listeners at this point */
 		nf_ct_kill(ct);
@@ -1115,7 +1115,7 @@ static void nf_conntrack_cleanup_net(str
 {
  i_see_dead_people:
 	nf_ct_iterate_cleanup(net, kill_all, NULL);
-	nf_ct_release_dying_list();
+	nf_ct_release_dying_list(net);
 	if (atomic_read(&net->ct.count) != 0) {
 		schedule();
 		goto i_see_dead_people;

^ permalink raw reply

* [083/136] netfilter: bridge: refcount fix
From: Greg KH @ 2009-10-02  1:17 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: stable-review, torvalds, akpm, alan, netdev, netfilter-devel,
	Patrick McHardy, davem, Eric Dumazet
In-Reply-To: <20091002012911.GA18542@kroah.com>

[-- Attachment #1: netfilter-bridge-refcount-fix.patch --]
[-- Type: text/plain, Size: 1097 bytes --]


2.6.31-stable review patch.  If anyone has any objections, please let us know.

------------------
From: Patrick McHardy <kaber@trash.net>

netfilter: bridge: refcount fix

Upstream commit f3abc9b9:

commit f216f082b2b37c4943f1e7c393e2786648d48f6f
([NETFILTER]: bridge netfilter: deal with martians correctly)
added a refcount leak on in_dev.

Instead of using in_dev_get(), we can use __in_dev_get_rcu(),
as netfilter hooks are running under rcu_read_lock(), as pointed
by Patrick.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 net/bridge/br_netfilter.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -359,7 +359,7 @@ static int br_nf_pre_routing_finish(stru
 				},
 				.proto = 0,
 			};
-			struct in_device *in_dev = in_dev_get(dev);
+			struct in_device *in_dev = __in_dev_get_rcu(dev);
 
 			/* If err equals -EHOSTUNREACH the error is due to a
 			 * martian destination or due to the fact that

^ permalink raw reply

* [084/136] netfilter: ebt_ulog: fix checkentry return value
From: Greg KH @ 2009-10-02  1:17 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: stable-review, torvalds, akpm, alan, netdev, netfilter-devel,
	Patrick McHardy, davem
In-Reply-To: <20091002012911.GA18542@kroah.com>

[-- Attachment #1: netfilter-ebt_ulog-fix-checkentry-return-value.patch --]
[-- Type: text/plain, Size: 900 bytes --]


2.6.31-stable review patch.  If anyone has any objections, please let us know.

------------------
From: Patrick McHardy <kaber@trash.net>

netfilter: ebt_ulog: fix checkentry return value

Upstream commit 8a56df0a:

Commit 19eda87 (netfilter: change return types of check functions for
Ebtables extensions) broke the ebtables ulog module by missing a return
value conversion.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 net/bridge/netfilter/ebt_ulog.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/bridge/netfilter/ebt_ulog.c
+++ b/net/bridge/netfilter/ebt_ulog.c
@@ -266,7 +266,7 @@ static bool ebt_ulog_tg_check(const stru
 	if (uloginfo->qthreshold > EBT_ULOG_MAX_QLEN)
 		uloginfo->qthreshold = EBT_ULOG_MAX_QLEN;
 
-	return 0;
+	return true;
 }
 
 static struct xt_target ebt_ulog_tg_reg __read_mostly = {



^ permalink raw reply

* [081/136] netfilter: nf_nat: fix inverted logic for persistent NAT mappings
From: Greg KH @ 2009-10-02  1:17 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: stable-review, torvalds, akpm, alan, netdev, netfilter-devel,
	Patrick McHardy, davem, Maximilian Engelhardt
In-Reply-To: <20091002012911.GA18542@kroah.com>

[-- Attachment #1: netfilter-nf_nat-fix-inverted-logic-for-persistent-nat-mappings.patch --]
[-- Type: text/plain, Size: 1554 bytes --]


2.6.31-stable review patch.  If anyone has any objections, please let us know.

------------------
From: Patrick McHardy <kaber@trash.net>

netfilter: nf_nat: fix inverted logic for persistent NAT mappings

Upstream commit cce5a5c3:

Kernel 2.6.30 introduced a patch [1] for the persistent option in the
netfilter SNAT target. This is exactly what we need here so I had a quick look
at the code and noticed that the patch is wrong. The logic is simply inverted.
The patch below fixes this.

Also note that because of this the default behavior of the SNAT target has
changed since kernel 2.6.30 as it now ignores the destination IP in choosing
the source IP for nating (which should only be the case if the persistent
option is set).

[1] http://git.eu.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=98d500d66cb7940747b424b245fc6a51ecfbf005

Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 net/ipv4/netfilter/nf_nat_core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -212,7 +212,7 @@ find_best_ips_proto(struct nf_conntrack_
 	maxip = ntohl(range->max_ip);
 	j = jhash_2words((__force u32)tuple->src.u3.ip,
 			 range->flags & IP_NAT_RANGE_PERSISTENT ?
-				(__force u32)tuple->dst.u3.ip : 0, 0);
+				0 : (__force u32)tuple->dst.u3.ip, 0);
 	j = ((u64)j * (maxip - minip + 1)) >> 32;
 	*var_ipp = htonl(minip + j);
 }



^ permalink raw reply

* Re: [PATCH 1/7] mlx4: Added interrupts test support
From: Roland Dreier @ 2009-10-02  3:32 UTC (permalink / raw)
  To: Yevgeny Petrilin; +Cc: davem, netdev
In-Reply-To: <4AC4BD9A.2050101@mellanox.co.il>

This feels like a pretty risky thing to do while the device might be
handling all sorts of other traffic at the same time.  Are you sure
there are no races you expose here?  Have you actually seen cases where
the interrupt test during initialization works but then this test
catches a problem?  (My experience has been that if any MSI-X interrupts
work from a device, then they'll all work)

 > +/* A test that verifies that we can accept interrupts on all
 > + * the irq vectors of the device.
 > + * Interrupts are checked using the NOP command.
 > + */
 > +int mlx4_test_interrupts(struct mlx4_dev *dev)
 > +{
 > +	struct mlx4_priv *priv = mlx4_priv(dev);
 > +	int i;
 > +	int err;
 > +
 > +	err = mlx4_NOP(dev);
 > +	/* When not in MSI_X, there is only one irq to check */
 > +	if (!(dev->flags & MLX4_FLAG_MSI_X))
 > +		return err;
 > +
 > +	/* A loop over all completion vectors, for each vector we will check
 > +	 * whether it works by mapping command completions to that vector
 > +	 * and performing a NOP command
 > +	 */
 > +	for (i = 0; !err && (i < dev->caps.num_comp_vectors); ++i) {
 > +		/* Temporary use polling for command completions */

you want the adverb form here: "Temporarily"

 > +		mlx4_cmd_use_polling(dev);
 > +
 > +		/* Map the new eq to handle all asyncronous events */

"asynchronous"

 > +		err = mlx4_MAP_EQ(dev, MLX4_ASYNC_EVENT_MASK, 0,
 > +				  priv->eq_table.eq[i].eqn);
 > +		if (err) {
 > +			mlx4_warn(dev, "Failed mapping eq for interrupt test\n");
 > +			mlx4_cmd_use_events(dev);
 > +			break;
 > +		}
 > +
 > +		/* Go back to using events */
 > +		mlx4_cmd_use_events(dev);
 > +		err = mlx4_NOP(dev);

You could simplify the code a bit by moving the mlx4_cmd_use_events() to
before where you test err, ie:

		err = mlx4_MAP_EQ(...);
		mlx4_cmd_user_events(dev);
		if (err)
			mlx4_warn(dev, ...)
		else
			err = mlx4_NOP(dev);

 > +	}
 > +
 > +	/* Return to default */
 > +	mlx4_MAP_EQ(dev, MLX4_ASYNC_EVENT_MASK, 0,
 > +		    priv->eq_table.eq[dev->caps.num_comp_vectors].eqn);
 > +	return err;
 > +}

^ permalink raw reply

* query: adding a sysctl
From: William Allen Simpson @ 2009-10-02  4:00 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 403 bytes --]

[My first post here, hopefully not a FAQ, as I've googled it, but cannot find
the definitive answer.]

I've been trying to add a sysctl, and I've noticed this message:

sysctl table check failed: /net/ipv4/tcp_cookie_size .3.5.126 Unknown sysctl binary path

I modeled the code on sysctl_tcp_syncookies, and apparently I'm missing some
additional magic?  Or does something need to be done other than C?

[-- Attachment #2: sysctl_tcp_cookie_size.txt --]
[-- Type: text/plain, Size: 1993 bytes --]

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index e76d3b2..8c74bec 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -435,6 +435,7 @@ enum
 	NET_TCP_ALLOWED_CONG_CONTROL=123,
 	NET_TCP_MAX_SSTHRESH=124,
 	NET_TCP_FRTO_RESPONSE=125,
+	NET_TCP_COOKIE_SIZE=126,
 };
 
 enum {
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 4710d21..e6174c9 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -340,6 +340,16 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec_jiffies,
 		.strategy	= sysctl_jiffies
 	},
+#ifdef CONFIG_TCP_OPT_COOKIE_EXTENSION
+	{
+		.ctl_name	= NET_TCP_COOKIE_SIZE,
+		.procname	= "tcp_cookie_size",
+		.data		= &sysctl_tcp_cookie_size,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#endif
 #ifdef CONFIG_SYN_COOKIES
 	{
 		.ctl_name	= NET_TCP_SYNCOOKIES,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 56b7602..a53b2a8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -208,6 +214,7 @@ extern int sysctl_tcp_synack_retries;
 extern int sysctl_tcp_retries1;
 extern int sysctl_tcp_retries2;
 extern int sysctl_tcp_orphan_retries;
+extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_syncookies;
 extern int sysctl_tcp_retrans_collapse;
 extern int sysctl_tcp_stdurg;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5200aab..afbdc30 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -59,6 +59,14 @@ int sysctl_tcp_base_mss __read_mostly = 512;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+#ifdef CONFIG_SYSCTL
+/* By default, let the user enable it. */
+int sysctl_tcp_cookie_size __read_mostly = 0;
+#else
+int sysctl_tcp_cookie_size __read_mostly = TCP_COOKIE_MAX;
+#endif
+
+
 /* Account for new data that has been sent to the network. */
 static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
 {

^ permalink raw reply related

* Re: [PATCH] tg3: Remove prev_vlan_tag from struct tx_ring_info
From: Eric Dumazet @ 2009-10-02  4:16 UTC (permalink / raw)
  To: David Miller; +Cc: mcarlson, netdev, mchan
In-Reply-To: <20091001.143859.53379358.davem@davemloft.net>

David Miller a écrit :
> 
> Applied, thanks.
> 
> Eric, I had to apply this by hand because:
> 
>>> @@ -2412,7 +2412,6 @@ struct ring_info {
>>>  
>>>  struct tx_ring_info {
>>>  	struct sk_buff                  *skb;
>>> -	u32                             prev_vlan_tag;
>>>  };
> 
> Your email client changed tabs into spaces.

Oops, I'm sorry Dave, I did a copy/paste and forgot about tabs.

Thanks

^ permalink raw reply

* Re: [PATCH 04/31] mm: tag reseve pages
From: Neil Brown @ 2009-10-02  4:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Suresh Jayaraman, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, Miklos Szeredi, Wouter Verhelst, Peter Zijlstra,
	trond.myklebust
In-Reply-To: <alpine.DEB.1.00.0910011407390.32006@chino.kir.corp.google.com>

On Thursday October 1, rientjes@google.com wrote:
> On Thu, 1 Oct 2009, Suresh Jayaraman wrote:
> 
> > Index: mmotm/mm/page_alloc.c
> > ===================================================================
> > --- mmotm.orig/mm/page_alloc.c
> > +++ mmotm/mm/page_alloc.c
> > @@ -1501,8 +1501,10 @@ zonelist_scan:
> >  try_this_zone:
> >  		page = buffered_rmqueue(preferred_zone, zone, order,
> >  						gfp_mask, migratetype);
> > -		if (page)
> > +		if (page) {
> > +			page->reserve = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> >  			break;
> > +		}
> >  this_zone_full:
> >  		if (NUMA_BUILD)
> >  			zlc_mark_zone_full(zonelist, z);
> 
> page->reserve won't necessary indicate that access to reserves was 
> _necessary_ for the allocation to succeed, though.  This will mark any 
> page being allocated under PF_MEMALLOC as reserve when all zones may be 
> well above their min watermarks.

Normally if zones are above their watermarks, page->reserve will not
be set.
This is because __alloc_page_nodemask (which seems to be the main
non-inline entrypoint) first calls get_page_from_freelist with
alloc_flags set to ALLOC_WMARK_LOW|ALLOC_CPUSET.
Only if this fails does __alloc_page_nodemask call
__alloc_pages_slowpath which potentially sets ALLOC_NO_WATERMARKS in
alloc_flags.

So page->reserved being set actually tells us:
  PF_MEMALLOC or GFP_MEMALLOC were used, and
  a WMARK_LOW allocation attempt failed very recently

which is close enough to "the emergency reserves were used" I think.

Thanks,
NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 30/31] Fix use of uninitialized variable in cache_grow()
From: Neil Brown @ 2009-10-02  4:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Suresh Jayaraman, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, Miklos Szeredi, Wouter Verhelst, Peter Zijlstra,
	trond.myklebust
In-Reply-To: <alpine.DEB.1.00.0910011341280.27559@chino.kir.corp.google.com>

On Thursday October 1, rientjes@google.com wrote:
> On Thu, 1 Oct 2009, Suresh Jayaraman wrote:
> 
> > From: Miklos Szeredi <mszeredi@suse.cz>
> > 
> > This fixes a bug in reserve-slub.patch.
> > 
> > If cache_grow() was called with objp != NULL then the 'reserve' local
> > variable wasn't initialized. This resulted in ac->reserve being set to
> > a rubbish value.  Due to this in some circumstances huge amounts of
> > slab pages were allocated (due to slab_force_alloc() returning true),
> > which caused atomic page allocation failures and slowdown of the
> > system.
> > 
> > Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> > Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
> > ---
> >  mm/slab.c |    5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > Index: mmotm/mm/slab.c
> > ===================================================================
> > --- mmotm.orig/mm/slab.c
> > +++ mmotm/mm/slab.c
> > @@ -2760,7 +2760,7 @@ static int cache_grow(struct kmem_cache
> >  	size_t offset;
> >  	gfp_t local_flags;
> >  	struct kmem_list3 *l3;
> > -	int reserve;
> > +	int reserve = -1;
> >  
> >  	/*
> >  	 * Be lazy and only check for valid flags here,  keeping it out of the
> > @@ -2816,7 +2816,8 @@ static int cache_grow(struct kmem_cache
> >  	if (local_flags & __GFP_WAIT)
> >  		local_irq_disable();
> >  	check_irq_off();
> > -	slab_set_reserve(cachep, reserve);
> > +	if (reserve != -1)
> > +		slab_set_reserve(cachep, reserve);
> >  	spin_lock(&l3->list_lock);
> >  
> >  	/* Make slab active. */
> 
> Given the patch description, shouldn't this be a test for objp != NULL 
> instead, then?

In between those to patch hunks, cache_grow contains the code:
	if (!objp)
		objp = kmem_getpages(cachep, local_flags, nodeid, &reserve);
	if (!objp)
		goto failed;

We can no longer test if objp was NULL on entry to the function.
We could take a copy of objp on entry to the function, and test it
here.  But initialising 'reserve' to an invalid value is easier.



> 
> If so, it doesn't make sense because reserve will only be initialized when 
> objp == NULL in the call to kmem_getpages() from cache_grow().
> 
> 
> The title of the patch suggests this is just dealing with an uninitialized 
> auto variable so the anticipated change would be from "int reserve" to 
> "int uninitialized_var(result)".

That change is only appropriate when the compiler is issuing a
warning that the variable is used before it is initialised, but we
know that not to be the case.
In this situation, we know it *is* being used before it is
initialised, and so we need to initialise it to something.

Thanks,
NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 03/31] mm: expose gfp_to_alloc_flags()
From: Neil Brown @ 2009-10-02  5:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Suresh Jayaraman, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, Miklos Szeredi, Wouter Verhelst, Peter Zijlstra,
	trond.myklebust
In-Reply-To: <alpine.DEB.1.00.0910011355230.32006@chino.kir.corp.google.com>

On Thursday October 1, rientjes@google.com wrote:
> On Thu, 1 Oct 2009, Suresh Jayaraman wrote:
> 
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl> 
> > 
> > Expose the gfp to alloc_flags mapping, so we can use it in other parts
> > of the vm.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
> 
> Nack, these flags are internal to the page allocator and exporting them to 
> generic VM code is unnecessary.
> 
> The only bit you actually use in your patchset is ALLOC_NO_WATERMARKS to 
> determine whether a particular allocation can use memory reserves.  I'd 
> suggest adding a bool function that returns whether the current context is 
> given access to reserves including your new __GFP_MEMALLOC flag and 
> exporting that instead.

That sounds like a very appropriate suggestion, thanks.

So something like this?
Then change every occurrence of
+		if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
to
+		if (!(gfp_has_no_watermarks(gfpflags)))

??

Thanks,
NeilBrown



diff --git a/mm/internal.h b/mm/internal.h
index 22ec8d2..7ff78d6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -195,6 +195,8 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
+int gfp_has_no_watermarks(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf72055..4b4292a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1782,6 +1782,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
+int gfp_has_no_watermarks(gfp_t gfp_mask)
+{
+	return (gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH] Use sk_mark for routing lookup in more places
From: Eric Dumazet @ 2009-10-02  5:14 UTC (permalink / raw)
  To: David Miller; +Cc: atis, panther, netdev
In-Reply-To: <20091001.151823.263194343.davem@davemloft.net>

Here is a followup on this area, thanks.

[RFC] af_packet: fill skb->mark at xmit

skb->mark may be used by classifiers, so fill it in case user 
set a SO_MARK option on socket.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/packet/af_packet.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d7ecca0..610f150 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -490,6 +490,7 @@ static int packet_sendmsg_spkt(struct kiocb *iocb, struct socket *sock,
 	skb->protocol = proto;
 	skb->dev = dev;
 	skb->priority = sk->sk_priority;
+	skb->mark = sk->sk_mark;
 	if (err)
 		goto out_free;
 
@@ -856,6 +857,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	skb->protocol = proto;
 	skb->dev = dev;
 	skb->priority = po->sk.sk_priority;
+	skb->mark = po->sk.sk_mark;
 	skb_shinfo(skb)->destructor_arg = ph.raw;
 
 	switch (po->tp_version) {
@@ -1125,6 +1127,7 @@ static int packet_snd(struct socket *sock,
 	skb->protocol = proto;
 	skb->dev = dev;
 	skb->priority = sk->sk_priority;
+	skb->mark = sk->sk_mark;
 
 	/*
 	 *	Now send it


^ permalink raw reply related

* Re: [PATCH 01/31] mm: serialize access to min_free_kbytes
From: Neil Brown @ 2009-10-02  5:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Suresh Jayaraman, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, Miklos Szeredi, Wouter Verhelst, Peter Zijlstra,
	trond.myklebust
In-Reply-To: <alpine.DEB.1.00.0910011330430.27559@chino.kir.corp.google.com>

On Thursday October 1, rientjes@google.com wrote:
> On Thu, 1 Oct 2009, Suresh Jayaraman wrote:
> 
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl> 
> > 
> > There is a small race between the procfs caller and the memory hotplug caller
> > of setup_per_zone_wmarks(). Not a big deal, but the next patch will add yet
> > another caller. Time to close the gap.
> > 
> 
> By "next patch," you mean "mm: emegency pool" (patch 08/31)?

:-)  It is always safer to say "a subsequent patch", isn't it....

> 
> If so, can't you eliminate var_free_mutex entirely from that patch and 
> take min_free_lock in adjust_memalloc_reserve() instead?

adjust_memalloc_reserve does a test alloc/free cycle under a lock.
That cannot be done under a spin-lock, it must be a mutex.
So I don't think you can eliminate var_free_mutex.

Thanks,
NeilBrown

> 
>  [ __adjust_memalloc_reserve() would call __setup_per_zone_wmarks()
>    under lock instead, now. ]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/7] mlx4: Added interrupts test support
From: David Miller @ 2009-10-02  5:27 UTC (permalink / raw)
  To: rdreier; +Cc: yevgenyp, netdev
In-Reply-To: <adaeipmqxmv.fsf@cisco.com>

From: Roland Dreier <rdreier@cisco.com>
Date: Thu, 01 Oct 2009 20:32:08 -0700

> This feels like a pretty risky thing to do while the device might be
> handling all sorts of other traffic at the same time.  Are you sure
> there are no races you expose here?  Have you actually seen cases where
> the interrupt test during initialization works but then this test
> catches a problem?  (My experience has been that if any MSI-X interrupts
> work from a device, then they'll all work)

I would suggest only allowing the test while the interface is down.
That way the test has exclusive control of the IRQ.

^ permalink raw reply

* Re: [PATCH 2/5] Implement loss counting on TFRC-SP receiver
From: Gerrit Renker @ 2009-10-01 20:40 UTC (permalink / raw)
  To: Ivo Calado; +Cc: dccp, netdev
In-Reply-To: <cb00fa210909231843q7f13b2c3i32672e883a017b7b@mail.gmail.com>

| >> The following code would be correct then?
| >>
| >>	 if ((len <= 0) ||
| >>	     (!tfrc_lh_closed_check(cur, cong_evt->tfrchrx_ccval)))
| > {
| >> +		 cur->li_losses += rh->num_losses;
| >> + 		 rh->num_losses  = 0;
| >> 		 return false;
| >> With this change I suppose the could be fixed. With that, the
| >> rh->num_losses couldn't added twice. Am I correct?
| >>
| >>
| > The function tfrc_lh_interval_add() is called when
| >  * __two_after_loss() returns true (a new loss is detected) or
| >  * a data packet is ECN-CE marked.
| >
| > I am still not sure about the 'len <= 0' case; this would be true
| > if an ECN-marked packet arrives whose sequence number is 'before'
| > the start of the current loss interval, or if a loss is detected
| > which is older than the start of the current loss interval.
| >
| > The other case (tfrc_lh_closed_check) returns 1 if the current loss
| > interval is 'closed' according to RFC 4342, 10.2.
| >
| > Intuitively, in the first case it refers to the preceding loss
| > interval (i.e. not cur->...), in the second case it seems correct.
| >
| > Doing the first case is complicated due to going back in history.
| > The simplest solution I can think of at the moment is to ignore
| > the exception-case of reordered packets and do something like
| >
| >  if (len <= 0) {
| >     /* FIXME: this belongs into the previous loss interval */
| >     tfrc_pr_debug("Warning: ignoring loss due to reordering");
| > 	return false;
| > }
| >  if (!tfrc_lh_closed_check(...)) {
| >     // your code from above
| > }
| 
| Okay, i'll add your sugestion. But i don't know how this would be fixed at all.
|
If it doesn't we will just do another iteration and fix it.



| > So it is necessary to decide whether to go the full way, which means
| >  * support Loss Intervals and Dropped Packets alike
| >  * modify TFRC library (it will be a redesign)
| >  * modify receiver code
| >  * modify sender code,
| >    or to use the present approach where
| >  * the receiver computes the Loss Rate and
| >  * a Mandatory Send Loss Event Rate feature is present during feature
| >    negotiation, to avoid problems with incompatible senders
| >   (there is a comment explaining this, in net/dccp/feat.c).
| >
| > Thoughts?
| 
<snip>

| I believe that the first way is better (to "support Loss Intervals and
| Dropped Packets alike..."), because RFC requires loss intervals option
| to be sent. And so, proceed and implement dropped packets option for
| TFRC-SP. You are right, this would need a redesign and rewrite of
| sender and receiver code.
| 
Agree, then let's do that. It requires some coordination on how to arrange
the patches, but we can simplify the process by using the test tree to 
store all intermediate results (i.e. use a separate tree for the rewrite
until it is sufficiently stable/useful).

^ permalink raw reply

* Re: [PATCH 00/31] Swap over NFS -v20
From: Neil Brown @ 2009-10-02  5:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Suresh Jayaraman, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, netdev, Miklos Szeredi, Wouter Verhelst, Peter Zijlstra,
	trond.myklebust
In-Reply-To: <20091001174201.GA30068@infradead.org>

On Thursday October 1, hch@infradead.org wrote:
> 
> The other really big one is adding a proper method for safe, page-backed
> kernelspace I/O on files.  That is not something like the grotty
> swap-tied address_space operations in this patch, but more something in
> the direction of the kernel direct I/O patches from Jenx Axboe he did
> for using in the loop driver.  But even those aren't complete as they
> don't touch the locking issue yet.

Do you have a problem with the proposed address_space operations apart
from their names including the word "swap"?  Would something like:
  direct_on, direct_off, direct_read, direct_write
be better.
Semantics being that the read and write:
  - bypass the page cache (invalidation is up to caller)
  - must not make a blocking non-emergency memory allocation
direct_on does any pre-allocation and pre-reading to ensure those
semantics and be provided.

I have wondered if an extra flag along the lines of "I don't care
about this data after a crash" would be useful.
It would be set for swap, but not set for other users.  Thus
e.g. RAID1 could easily avoid resyncing an area that was used only for
swap.

The only thing of Jens' that I could find used bmap - is there
something more recent I should look for?

> 
> Especially the latter is an absolutely essential step to make any
> progress here, and an excellent patch series of it's own as there are
> multiple users for this, like making swap safe on btrfs files, making
> the MD bitmap code actually safe or improving the loop driver.

100% agree.

Thanks,
NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: query: adding a sysctl
From: Stephen Hemminger @ 2009-10-02  5:57 UTC (permalink / raw)
  To: William Allen Simpson; +Cc: netdev
In-Reply-To: <4AC57AC5.3080703@gmail.com>

On Fri, 02 Oct 2009 00:00:05 -0400
William Allen Simpson <william.allen.simpson@gmail.com> wrote:

> [My first post here, hopefully not a FAQ, as I've googled it, but cannot find
> the definitive answer.]
> 
> I've been trying to add a sysctl, and I've noticed this message:
> 
> sysctl table check failed: /net/ipv4/tcp_cookie_size .3.5.126 Unknown sysctl binary path
> 
> I modeled the code on sysctl_tcp_syncookies, and apparently I'm missing some
> additional magic?  Or does something need to be done other than C?

The sysctl table check code is kernel/sysctl.c, it maps numerical
sysctl values to /proc paths so that the permissions checks on the numeric
sysctl match those of the /proc file involved.

Hint: the easiest way to find things out is to use git grep
to see how any related sysctl is implemented.

BUT numbered sysctl values are deprecated and should no longer be added.
The current way is to use CTL_UNNUMBERED instead, if you use CTL_UNNUMBERED
then the table does not need to be changed.

-- 

^ permalink raw reply

* Re: [PATCH] Use sk_mark for routing lookup in more places
From: Eric Dumazet @ 2009-10-02  6:08 UTC (permalink / raw)
  Cc: David Miller, atis, panther, netdev
In-Reply-To: <4AC58C46.8080408@gmail.com>

Eric Dumazet a écrit :
> Here is a followup on this area, thanks.
> 
> [RFC] af_packet: fill skb->mark at xmit
> 
> skb->mark may be used by classifiers, so fill it in case user 
> set a SO_MARK option on socket.
> 

Maybe a more generic way to handle this for various protocols
would be to fill skb->mark in sock_alloc_send_pskb()



^ permalink raw reply

* [PATCH] cnic: Fix NETDEV_UP event processing.
From: Michael Chan @ 2009-10-02  6:17 UTC (permalink / raw)
  To: davem; +Cc: netdev, michaelc, Michael Chan, Benjamin Li

This fixes the problem of not handling the NETDEV_UP event properly
during hot-plug or modprobe of bnx2 after cnic.  The handling was
skipped by mistakenly using "else if" to check for the event.

Also update version to 2.0.1.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: Benjamin Li <benli@broadcom.com>
---
 drivers/net/cnic.c    |    3 ++-
 drivers/net/cnic_if.h |    4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/cnic.c b/drivers/net/cnic.c
index 211c8e9..46c87ec 100644
--- a/drivers/net/cnic.c
+++ b/drivers/net/cnic.c
@@ -2733,7 +2733,8 @@ static int cnic_netdev_event(struct notifier_block *this, unsigned long event,
 			cnic_ulp_init(dev);
 		else if (event == NETDEV_UNREGISTER)
 			cnic_ulp_exit(dev);
-		else if (event == NETDEV_UP) {
+
+		if (event == NETDEV_UP) {
 			if (cnic_register_netdev(dev) != 0) {
 				cnic_put(dev);
 				goto done;
diff --git a/drivers/net/cnic_if.h b/drivers/net/cnic_if.h
index a492357..d8b09ef 100644
--- a/drivers/net/cnic_if.h
+++ b/drivers/net/cnic_if.h
@@ -12,8 +12,8 @@
 #ifndef CNIC_IF_H
 #define CNIC_IF_H
 
-#define CNIC_MODULE_VERSION	"2.0.0"
-#define CNIC_MODULE_RELDATE	"May 21, 2009"
+#define CNIC_MODULE_VERSION	"2.0.1"
+#define CNIC_MODULE_RELDATE	"Oct 01, 2009"
 
 #define CNIC_ULP_RDMA		0
 #define CNIC_ULP_ISCSI		1
-- 
1.6.4.GIT



^ permalink raw reply related

* [Question]: reqsk table size limited to 16?
From: Gerrit Renker @ 2009-10-02  6:11 UTC (permalink / raw)
  To: netdev

Can someone please have a look, it may be that I am missing something?

It seems that in the following the maximum number of table entries is set
to always 16, despite sysctl_max_syn_backlog (tcp_max_syn_backlog), 
overriding the 'backlog' parameter to listen(2).

net/core/request_sock.c
-----------------------

int reqsk_queue_alloc(struct request_sock_queue *queue,
                      unsigned int nr_table_entries)
{
        size_t lopt_size = sizeof(struct listen_sock);
        struct listen_sock *lopt;

	nr_table_entries = min_t(u32, nr_table_entries, sysctl_max_syn_backlog);
        nr_table_entries = max_t(u32, nr_table_entries, 8);
        nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);

	//...
	for (lopt->max_qlen_log = 3;
             (1 << lopt->max_qlen_log) < nr_table_entries;
             lopt->max_qlen_log++);

 	//...
	lopt->nr_table_entries = nr_table_entries;
	
	//...
	return 0
}

The function is called with an argument 'nr_table_entries', which is then clamped as

   sysctl_max_syn_backlog <= nr_table_entries <= 8

If nr_table_entries = 8, then round_pow_of_two(8 + 1) = 16.

The sysctl value is set to a much higher value (default 128 or 1024, net/ipv4/tcp.c).

The reqsk_queue_alloc() gets 'nr_table_entries' passed directly from inet_csk_listen_start(),
which in turn gets its 'nr_table_entries' as the 'backlog' argument to listen(2) via
 * net/dccp/proto.c   (dccp_listen_start) or
 * net/ipv4/af_inet.c (inet_listen).

^ permalink raw reply

* Re: [PATCH] connector: Fix regression introduced by sid connector
From: Christian Borntraeger @ 2009-10-02  6:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: oleg, scott, zbr, linux-kernel, matthltc, davem, netdev
In-Reply-To: <20091001141426.2c1a0139.akpm@linux-foundation.org>

Sorry about that. Dont know how this escaped.  It was probably hiding between
all the sparse warnings I get in kernel/* and a lack of knowledge in this
area.
Here is a new version:



since commit 02b51df1b07b4e9ca823c89284e704cadb323cd1 (proc connector: add event 
for process becoming session leader) we have the following warning:
Badness at kernel/softirq.c:143
[...]
Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0)
[...]
Call Trace:
([<000000013fe04100>] 0x13fe04100)
 [<000000000048a946>] sk_filter+0x9a/0xd0
 [<000000000049d938>] netlink_broadcast+0x2c0/0x53c
 [<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0
 [<00000000003baef0>] proc_sid_connector+0xc4/0xd4
 [<0000000000142604>] __set_special_pids+0x58/0x90
 [<0000000000159938>] sys_setsid+0xb4/0xd8
 [<00000000001187fe>] sysc_noemu+0x10/0x16
 [<00000041616cb266>] 0x41616cb266

The warning is
--->    WARN_ON_ONCE(in_irq() || irqs_disabled());

The network code must not be called with disabled interrupts but
sys_setsid holds the tasklist_lock with spinlock_irq while calling
the connector. 
After a discussion we agreed that we can move proc_sid_connector
from __set_special_pids to sys_setsid.
We also agreed that it is sufficient to change the check from
task_session(curr) != pid into err > 0, since if we don't change the
session, this means we were already the leader and return -EPERM.

One last thing:
There is also daemonize(), and some people might want to get a
notification in that case. Since daemonize() is only needed if a user
space does kernel_thread this does not look important (and there seems
to be no consensus if this connector should be called in daemonize). If
we really want this, we can add proc_sid_connector to daemonize() in an
additional patch (Scott?)

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CCed: Scott James Remnant <scott@ubuntu.com>
CCed: Matt Helsley <matthltc@us.ibm.com>
CCed: David S. Miller <davem@davemloft.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
---
 kernel/exit.c |    4 +---
 kernel/sys.c  |    2 ++
 2 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -359,10 +359,8 @@ void __set_special_pids(struct pid *pid)
 {
 	struct task_struct *curr = current->group_leader;
 
-	if (task_session(curr) != pid) {
+	if (task_session(curr) != pid)
 		change_pid(curr, PIDTYPE_SID, pid);
-		proc_sid_connector(curr);
-	}
 
 	if (task_pgrp(curr) != pid)
 		change_pid(curr, PIDTYPE_PGID, pid);
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -1110,6 +1110,8 @@ SYSCALL_DEFINE0(setsid)
 	err = session;
 out:
 	write_unlock_irq(&tasklist_lock);
+	if (err > 0)
+		proc_sid_connector(group_leader);
 	return err;
 }
 

^ permalink raw reply

* Re: [Question]: reqsk table size limited to 16?
From: Gerrit Renker @ 2009-10-02  6:25 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20091002061134.GC5646@gerrit.erg.abdn.ac.uk>

Please forget the posting, this is correct; the clamping is

  8 <= nr_table_entries <=  sysctl_max_syn_backlog,

i.e. the minimum table size is 16.

Quoting Gerrit:
| Can someone please have a look, it may be that I am missing something?
| 
| It seems that in the following the maximum number of table entries is set
| to always 16, despite sysctl_max_syn_backlog (tcp_max_syn_backlog), 
| overriding the 'backlog' parameter to listen(2).
| 
| net/core/request_sock.c
| -----------------------
| 
| int reqsk_queue_alloc(struct request_sock_queue *queue,
|                       unsigned int nr_table_entries)
| {
|         size_t lopt_size = sizeof(struct listen_sock);
|         struct listen_sock *lopt;
| 
| 	nr_table_entries = min_t(u32, nr_table_entries, sysctl_max_syn_backlog);
|         nr_table_entries = max_t(u32, nr_table_entries, 8);
|         nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
| 
| 	//...
| 	for (lopt->max_qlen_log = 3;
|              (1 << lopt->max_qlen_log) < nr_table_entries;
|              lopt->max_qlen_log++);
| 
|  	//...
| 	lopt->nr_table_entries = nr_table_entries;
| 	
| 	//...
| 	return 0
| }
| 
| The function is called with an argument 'nr_table_entries', which is then clamped as
| 
|    sysctl_max_syn_backlog <= nr_table_entries <= 8
| 
| If nr_table_entries = 8, then round_pow_of_two(8 + 1) = 16.
| 
| The sysctl value is set to a much higher value (default 128 or 1024, net/ipv4/tcp.c).
| 
| The reqsk_queue_alloc() gets 'nr_table_entries' passed directly from inet_csk_listen_start(),
| which in turn gets its 'nr_table_entries' as the 'backlog' argument to listen(2) via
|  * net/dccp/proto.c   (dccp_listen_start) or
|  * net/ipv4/af_inet.c (inet_listen).

-- 

^ permalink raw reply

* [BUG net-2.6] bluetooth/rfcomm : sleeping function called from invalid context at mm/slub.c:1719
From: Oliver Hartkopp @ 2009-10-02  6:28 UTC (permalink / raw)
  To: Marcel Holtmann; +Cc: Linux Netdev List, linux-bluetooth-u79uwXL29TY76Z2rM5mHXA

Hello Marcel,

with current net-2.6 tree ...

While starting my PPP Bluetooth dialup networking, i got this:

[  722.461549] PPP generic driver version 2.4.2
[  722.477519] BUG: sleeping function called from invalid context at
mm/slub.c:1719
[  722.477530] in_atomic(): 1, irqs_disabled(): 0, pid: 4677, name: pppd
[  722.477537] 3 locks held by pppd/4677:
[  722.477542]  #0:  (rfcomm_mutex){+.+.+.}, at: [<fa5df2a1>]
rfcomm_dlc_open+0x28/0x2d6 [rfcomm]
[  722.477568]  #1:  (sk_lock-AF_BLUETOOTH-BTPROTO_L2CAP){+.+.+.}, at:
[<fa5414f8>] l2cap_sock_connect+0x62/0x2c6 [l2cap]
[  722.477589]  #2:  (&hdev->lock){+...+.}, at: [<fa5415b4>]
l2cap_sock_connect+0x11e/0x2c6 [l2cap]
[  722.477613] Pid: 4677, comm: pppd Not tainted 2.6.31-08939-gdb8abec-dirty #21
[  722.477619] Call Trace:
[  722.477633]  [<c1042a2b>] ? __debug_show_held_locks+0x1e/0x20
[  722.477644]  [<c10212a1>] __might_sleep+0xc9/0xce
[  722.477655]  [<c1078b62>] __kmalloc+0x6d/0xfb
[  722.477666]  [<c119e739>] ? kzalloc+0xb/0xd
[  722.477674]  [<c119e739>] kzalloc+0xb/0xd
[  722.477683]  [<c119ef1a>] device_private_init+0x15/0x3d
[  722.477693]  [<c11a0e1b>] dev_set_drvdata+0x18/0x26
[  722.477718]  [<f8b7ca1b>] hci_conn_init_sysfs+0x3d/0xc7 [bluetooth]
[  722.477737]  [<f8b791b3>] hci_conn_add+0x1c0/0x1d5 [bluetooth]
[  722.477756]  [<f8b79360>] hci_connect+0x71/0x17d [bluetooth]
[  722.477769]  [<fa54162c>] l2cap_sock_connect+0x196/0x2c6 [l2cap]
[  722.477782]  [<c1246e3d>] kernel_connect+0xd/0x12
[  722.477795]  [<fa5df3c3>] rfcomm_dlc_open+0x14a/0x2d6 [rfcomm]
[  722.477810]  [<fa5e10fa>] ? rfcomm_tty_open+0x73/0x227 [rfcomm]
[  722.477825]  [<fa5e1130>] rfcomm_tty_open+0xa9/0x227 [rfcomm]
[  722.477836]  [<c1022e3f>] ? default_wake_function+0x0/0xd
[  722.477847]  [<c1180c79>] tty_open+0x29e/0x399
[  722.477858]  [<c107e9bd>] chrdev_open+0x13f/0x156
[  722.477868]  [<c107b0d3>] __dentry_open+0x11b/0x20f
[  722.477878]  [<c107b261>] nameidata_to_filp+0x2c/0x43
[  722.477888]  [<c107e87e>] ? chrdev_open+0x0/0x156
[  722.477898]  [<c1084e9e>] do_filp_open+0x3c6/0x70a
[  722.477910]  [<c108d3e4>] ? alloc_fd+0xc8/0xd2
[  722.477920]  [<c108d3e4>] ? alloc_fd+0xc8/0xd2
[  722.477930]  [<c107aebc>] do_sys_open+0x4a/0xe7
[  722.477940]  [<c1002acc>] ? restore_all_notrace+0x0/0x18
[  722.477950]  [<c107af9b>] sys_open+0x1e/0x26
[  722.477959]  [<c1002a18>] sysenter_do_call+0x12/0x36
[  729.658613] PPP BSD Compression module registered
[  729.684789] PPP Deflate Compression module registered

Any idea?

Regards,
Oliver

^ permalink raw reply

* Re: [Question]: reqsk table size limited to 16?
From: Eric Dumazet @ 2009-10-02  6:49 UTC (permalink / raw)
  To: Gerrit Renker, netdev
In-Reply-To: <20091002061134.GC5646@gerrit.erg.abdn.ac.uk>

Gerrit Renker a écrit :
> Can someone please have a look, it may be that I am missing something?
> 
> It seems that in the following the maximum number of table entries is set
> to always 16, despite sysctl_max_syn_backlog (tcp_max_syn_backlog), 
> overriding the 'backlog' parameter to listen(2).

False alarm ;)

> 
> net/core/request_sock.c
> -----------------------
> 
> int reqsk_queue_alloc(struct request_sock_queue *queue,
>                       unsigned int nr_table_entries)
> {
>         size_t lopt_size = sizeof(struct listen_sock);
>         struct listen_sock *lopt;
> 
> 	nr_table_entries = min_t(u32, nr_table_entries, sysctl_max_syn_backlog);

Here we take the _minimum_ value.
If you have  nr_table_entries=4096 and sysctl_max_syn_backlog=1024,
result is 1024

>         nr_table_entries = max_t(u32, nr_table_entries, 8);

Here we take the _maximum_ value of nr_table_entries and 8

-> 1024

Deal is : We want at least 8 slots, even if users called listen(fd, 1);

(Later, user can change its mind and call listen(fd, 1024).

We dont resize hashtable yet, so we guarantee at least 8 slots fot pathological cases.

>         nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
> 
> 	//...
> 	for (lopt->max_qlen_log = 3;
>              (1 << lopt->max_qlen_log) < nr_table_entries;
>              lopt->max_qlen_log++);
> 
>  	//...
> 	lopt->nr_table_entries = nr_table_entries;
> 	
> 	//...
> 	return 0
> }
> 
> The function is called with an argument 'nr_table_entries', which is then clamped as
> 
>    sysctl_max_syn_backlog <= nr_table_entries <= 8
> 
> If nr_table_entries = 8, then round_pow_of_two(8 + 1) = 16.
> 
> The sysctl value is set to a much higher value (default 128 or 1024, net/ipv4/tcp.c).
> 
> The reqsk_queue_alloc() gets 'nr_table_entries' passed directly from inet_csk_listen_start(),
> which in turn gets its 'nr_table_entries' as the 'backlog' argument to listen(2) via
>  * net/dccp/proto.c   (dccp_listen_start) or
>  * net/ipv4/af_inet.c (inet_listen).


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox