Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Changli Gao @ 2010-04-20 23:35 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, franco, therbert, netdev
In-Reply-To: <20100420.144106.118596093.davem@davemloft.net>

On Wed, Apr 21, 2010 at 5:41 AM, David Miller <davem@davemloft.net> wrote:
> Eric, do you remember that "TCP friends" rough patch I sent you last
> year that essentailly made TCP sockets over loopback behave like
> AF_UNIX ones and just queue the SKBs directly to the destination
> socket without doing any protocol work?

I think it will break some benchmark tools. The loopback device is for
testing networking protocol stacks, so we shouldn't bypass the
protocol processing. And anyone who has a performance problem of
loopback device should turn to UNIX domain socket.

For routers, how about letting users choose whether RPS mixes layer 4 info in?

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: David Miller @ 2010-04-20 23:38 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, franco, therbert, netdev
In-Reply-To: <v2s412e6f7f1004201635m5258e777oac49fea8f9625bcf@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Wed, 21 Apr 2010 07:35:48 +0800

> On Wed, Apr 21, 2010 at 5:41 AM, David Miller <davem@davemloft.net> wrote:
>> Eric, do you remember that "TCP friends" rough patch I sent you last
>> year that essentailly made TCP sockets over loopback behave like
>> AF_UNIX ones and just queue the SKBs directly to the destination
>> socket without doing any protocol work?
> 
> I think it will break some benchmark tools.

Other systems already do this optimization, so if things break, this
breakage is already pervasive.

We should be able to tell people that they can use TCP solely in their
applications and it will perform optimally regardless of transport.

People already code their applications this way, and ignoring
this issue would just makes us stupid.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sk_sleep() helper
From: David Miller @ 2010-04-20 23:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1271804631.7895.519.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 21 Apr 2010 01:03:51 +0200

> [PATCH net-next-2.6] net: sk_sleep() helper

Looks good, applied, thanks Eric.

^ permalink raw reply

* crash with bridge and inconsistent handling of NETDEV_TX_OK
From: Mikulas Patocka @ 2010-04-20 23:40 UTC (permalink / raw)
  To: lists.linux-foundation.org, netdev; +Cc: kaber, davem

Hi

I got this crash on 2.6.34-rc4 when using it as a bridge. The crash is not 
reproducible. There are two interfaces, eth0 (sun-hme) and eth1 (tg3) and 
there is a bridge between them.

The crash was in inlined function skb_drop_list, inlined from 
skb_drop_fraglist from skb_release_data. The "list" pointer was 0x18.

This is backtrace:
skb_release_data+7c/e0
__kfree_skb+c/c0
dev_hard_start_xmit+288/3c0
sch_direct_xmit+12c/1c0
dev_queue_xmit+3fc/520
br_dev_queue_push_xmit+60/80
br_handle_frame_finish+100/1e00
br_handle_frame+168/240
netif_receive_skb+274/480
process_backlog+64/c0
net_rx_action+cc/180
__do_softirq+94/120

Settings for for eth0 and eth1 are:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

For br0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

(there is no netfilter compiled on that machine)

I'm curious --- this code path in dev_hard_start_xmit is taken only if GSO 
is used. From the trace, it can be seen that a packet was received by one 
nic and forwarded by the bridge to the other nic. How could GSO be used in 
this scenario?

---

Reviewing the code further, I found one very weird commit 
572a9d7b6fc7f20f573664063324c086be310c42 committed to 2.6.33. What it 
does, it changes the semantics of ndo_hard_start_xmit(). Prior to the 
patch, the meaning was --- return zero (NETDEV_TX_OK) --- the skb is 
consumed by the driver. Returns non-zero --- the skb is left owned by the 
caller. The patch changes it to return other flags in bits 4-7 and changes 
the consumed/returned logic.

The problem is that there is still plenty of code that compares it against 
NETDEV_TX_OK to find out if the skb was consumed.

I don't know if this caused my crash (it is not reproducible), but the 
patch is buggy and dangerous.

Handling of the return codes is inconsistent:
* most callers use dev_xmit_complete(int rc) { return rc < 
NET_XMIT_MASK(0x0f); } to test if the skb was consumed. This seems to be 
the method to be used intended by the developers. (why not <= 0x0f ?)

* dev_hard_start_xmit uses rc == NETDEV_TX_OK || rc & ~NETDEV_TX_MASK 
(~0xf0) to test if the skb is consumed. This is differs from 
dev_xmit_complete for some values.

net/core/netpoll.c : at two places, it compares the return value of 
ndo_start_xmit with NETDEV_TX_OK
net/sched/sch_teql.c : compares with NETDEV_TX_OK
drivers/net/wan/dlci.c : ignores the return value of ndo_start_xmit

--- the above inconsistencies lead to situations where both the caller and 
the callee think that they own the skb, and the result is memory 
corruption.


If we grep for places that assume that the packet was consumed if the 
return code is NETDEV_TX_OK, there are even more:
drivers/net/qla3xxx.c
drivers/net/chelsio/sge.c
drivers/net/wireless/hostap/hostap_80211_tx.c
drivers/net/wireless/ipw2x00/libipw_tx.c
drivers/net/sfc/selftest.c
net/mac80211/tx.c
--- Not all these cases are bugs, because sometimes the return value is 
self-generated by the driver. But it is just confusing and it may trigger 
bugs in the future.

---

I'd recommend to revert 572a9d7b6fc7f20f573664063324c086be310c42 because 
it made several bugs (the code that compared the return value against 
NETDEV_TX_OK was correct before and is buggy after the patch). 
Furthermore, there may be hidden bugs, if someone compares the return 
value with 0 and frees skb based on this comparison, it is impossible to 
find it with grep, yet changing the meaning of return values would make a 
bug here.

Mikulas

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-20 23:42 UTC (permalink / raw)
  To: Gaspar Chilingarov; +Cc: netdev
In-Reply-To: <v2n46c8cb3e1004201618s37e9f0c6j7527e59673a34322@mail.gmail.com>

Le mercredi 21 avril 2010 à 04:18 +0500, Gaspar Chilingarov a écrit :
> >
> > Its doable, only if you bind() your sockets before connect()
> >
> > For each socket, you'll need to chose an (local IP, local port) not
> > already in use.
> >
> > kernel wont magically select one source IP from the pool you have.
> >
> 
> I expect that I should select source IP and kernel should find some
> unused local_port for me.
> 
> I'm binding before the connect, of course ;)
> 
> 

Yes, you bind the IP, but let kernel choose the port.

In this case, kernel is a bit dumb (or too smart ?)

If you want to check source, its in file net/ipv4/inet_connection_sock.c

function inet_csk_get_port()

you can remove the 

if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) { 
	spin_unlock(&head->lock); 
	snum = smallest_rover; 
	goto have_snum;
}

It will solve your problem (but bind() will probably be slow)




^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: David Miller @ 2010-04-20 23:45 UTC (permalink / raw)
  To: mpatocka; +Cc: lists.linux-foundation.org, netdev, kaber
In-Reply-To: <Pine.LNX.4.64.1004201834190.29017@hs20-bc2-1.build.redhat.com>

From: Mikulas Patocka <mpatocka@redhat.com>
Date: Tue, 20 Apr 2010 19:40:56 -0400 (EDT)

> Reviewing the code further, I found one very weird commit
> 572a9d7b6fc7f20f573664063324c086be310c42 committed to 2.6.33. What
> it does, it changes the semantics of ndo_hard_start_xmit(). Prior to
> the patch, the meaning was --- return zero (NETDEV_TX_OK) --- the
> skb is consumed by the driver. Returns non-zero --- the skb is left
> owned by the caller.  The patch changes it to return other flags in
> bits 4-7 and changes the consumed/returned logic.
> 
> The problem is that there is still plenty of code that compares it
> against NETDEV_TX_OK to find out if the skb was consumed.

Drivers are not supposed to return those new flag bits, the new flag
bits as return values exist only in the packet scheduler path.

We're not reverting the commit you mention, we're going to fix
whatever bug exists instead.

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: David Miller @ 2010-04-20 23:48 UTC (permalink / raw)
  To: mpatocka; +Cc: lists.linux-foundation.org, netdev, kaber
In-Reply-To: <20100420.164505.34362726.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 20 Apr 2010 16:45:05 -0700 (PDT)

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Tue, 20 Apr 2010 19:40:56 -0400 (EDT)
> 
>> Reviewing the code further, I found one very weird commit
>> 572a9d7b6fc7f20f573664063324c086be310c42 committed to 2.6.33. What
>> it does, it changes the semantics of ndo_hard_start_xmit(). Prior to
>> the patch, the meaning was --- return zero (NETDEV_TX_OK) --- the
>> skb is consumed by the driver. Returns non-zero --- the skb is left
>> owned by the caller.  The patch changes it to return other flags in
>> bits 4-7 and changes the consumed/returned logic.
>> 
>> The problem is that there is still plenty of code that compares it
>> against NETDEV_TX_OK to find out if the skb was consumed.
> 
> Drivers are not supposed to return those new flag bits, the new flag
> bits as return values exist only in the packet scheduler path.

And BTW, NETDEV_TX_OK is only ever returned by itself, the
flag bits only get set when a non-NETDEV_TX_OK value is returned.

So we really haven't changed semantics at all, NETDEV_TX_OK (which is
zero) and non-zero are the two valid return value cases.

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Ben Greear @ 2010-04-20 23:49 UTC (permalink / raw)
  To: Gaspar Chilingarov; +Cc: netdev
In-Reply-To: <y2p46c8cb3e1004201635s8f84672dq90d3115dca04a1ed@mail.gmail.com>

On 04/20/2010 04:35 PM, Gaspar Chilingarov wrote:
> sysctl -a | grep local_port_range

[root@ct503-10G-09 ~]# sysctl -a | grep local_port_range
net.ipv4.ip_local_port_range = 10000	61000

I'm explicitly binding to local ports as well as local IPs, btw.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Gaspar Chilingarov @ 2010-04-20 23:57 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <4BCE3D8D.3030500@candelatech.com>

Ohhh, see my bug report again, please :)

The problem arises when used port count is close to or more than the
number of local ports available on the system (in your case 40000
client connections is smaller than 51000 of available ports).

I think that if you choose to go with default settings and try to bind
half of your ports to one local IP and half of your ports - to another
one -- you will hit this bug as well.

If you explicitly bind to the local ports as well - it _may_ solve a
problem - I will this today (it's just a workaround, but in fact not
the solution :), but if you let kernel automatically find local ports
- you will get the errors.

/Gaspar


2010/4/21 Ben Greear <greearb@candelatech.com>:
> On 04/20/2010 04:35 PM, Gaspar Chilingarov wrote:
>>
>> sysctl -a | grep local_port_range
>
> [root@ct503-10G-09 ~]# sysctl -a | grep local_port_range
> net.ipv4.ip_local_port_range = 10000    61000
>
> I'm explicitly binding to local ports as well as local IPs, btw.
>
> Thanks,
> Ben
>
> --
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com
>



-- 
Gaspar Chilingarov

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: Mikulas Patocka @ 2010-04-20 23:57 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, kaber
In-Reply-To: <20100420.164505.34362726.davem@davemloft.net>

On Tue, 20 Apr 2010, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Tue, 20 Apr 2010 19:40:56 -0400 (EDT)
> 
> > Reviewing the code further, I found one very weird commit
> > 572a9d7b6fc7f20f573664063324c086be310c42 committed to 2.6.33. What
> > it does, it changes the semantics of ndo_hard_start_xmit(). Prior to
> > the patch, the meaning was --- return zero (NETDEV_TX_OK) --- the
> > skb is consumed by the driver. Returns non-zero --- the skb is left
> > owned by the caller.  The patch changes it to return other flags in
> > bits 4-7 and changes the consumed/returned logic.
> > 
> > The problem is that there is still plenty of code that compares it
> > against NETDEV_TX_OK to find out if the skb was consumed.
> 
> Drivers are not supposed to return those new flag bits, the new flag
> bits as return values exist only in the packet scheduler path.

If they are not supposed to return it, why do you test it in 
dev_hard_start_xmit?

Define clearly what the return status is allowed (and maybe enforce it 
with BUG()s) and follow that definition.

Currently, there is no specification and the code is unintelligible crap: 
it first masks the value with ~0xf0 to test if the skb is consumed 
(dev_hard_start_xmit) and then compares it with 0x0f for the same test 
(dev_queue_xmit).

> We're not reverting the commit you mention, we're going to fix
> whatever bug exists instead.

So have a fun searching for them :-/

Mikulas

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: David Miller @ 2010-04-21  0:00 UTC (permalink / raw)
  To: mpatocka; +Cc: netdev, kaber
In-Reply-To: <Pine.LNX.4.64.1004201947470.29017@hs20-bc2-1.build.redhat.com>

From: Mikulas Patocka <mpatocka@redhat.com>
Date: Tue, 20 Apr 2010 19:57:45 -0400 (EDT)

>> We're not reverting the commit you mention, we're going to fix
>> whatever bug exists instead.
> 
> So have a fun searching for them :-/

When you report a problem and your knee-jerk reaction is "revert this
commit", expect some push back.

We'll instead analyze this bug and take the time to figure out the
most appropriate course of action.

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21  0:05 UTC (permalink / raw)
  To: Ben Greear, David Miller; +Cc: Gaspar Chilingarov, netdev
In-Reply-To: <4BCE3D8D.3030500@candelatech.com>

Le mardi 20 avril 2010 à 16:49 -0700, Ben Greear a écrit :
> On 04/20/2010 04:35 PM, Gaspar Chilingarov wrote:
> > sysctl -a | grep local_port_range
> 
> [root@ct503-10G-09 ~]# sysctl -a | grep local_port_range
> net.ipv4.ip_local_port_range = 10000	61000
> 
> I'm explicitly binding to local ports as well as local IPs, btw.
> 

I believe the bsockets 'optimization' is a bug, we should remove it.

This is a stable candidate (2.6.30+)

[PATCH net-next-2.6] tcp: remove bsockets count

Counting number of bound sockets to avoid a loop is buggy, since we cant
know how many IP addresses are in use. When threshold is reached, we try
5 random slots and can fail while there are plenty available ports.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/inet_hashtables.h   |    2 --
 net/ipv4/inet_connection_sock.c |    5 -----
 net/ipv4/inet_hashtables.c      |    5 -----
 3 files changed, 12 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 74358d1..e0f3a05 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -150,8 +150,6 @@ struct inet_hashinfo {
 	 */
 	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
 					____cacheline_aligned_in_smp;
-
-	atomic_t			bsockets;
 };
 
 static inline struct inet_ehash_bucket *inet_ehash_bucket(
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8da6429..0bbfd00 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -119,11 +119,6 @@ again:
 					    (tb->num_owners < smallest_size || smallest_size == -1)) {
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
-						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
-							spin_unlock(&head->lock);
-							snum = smallest_rover;
-							goto have_snum;
-						}
 					}
 					goto next;
 				}
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2b79377..4bc921f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -62,8 +62,6 @@ void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
 
-	atomic_inc(&hashinfo->bsockets);
-
 	inet_sk(sk)->inet_num = snum;
 	sk_add_bind_node(sk, &tb->owners);
 	tb->num_owners++;
@@ -81,8 +79,6 @@ static void __inet_put_port(struct sock *sk)
 	struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash];
 	struct inet_bind_bucket *tb;
 
-	atomic_dec(&hashinfo->bsockets);
-
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
 	__sk_del_bind_node(sk);
@@ -551,7 +547,6 @@ void inet_hashinfo_init(struct inet_hashinfo *h)
 {
 	int i;
 
-	atomic_set(&h->bsockets, 0);
 	for (i = 0; i < INET_LHTABLE_SIZE; i++) {
 		spin_lock_init(&h->listening_hash[i].lock);
 		INIT_HLIST_NULLS_HEAD(&h->listening_hash[i].head,



^ permalink raw reply related

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Gaspar Chilingarov @ 2010-04-21  0:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, netdev
In-Reply-To: <1271808314.7895.614.camel@edumazet-laptop>

2010/4/21 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mardi 20 avril 2010 à 16:49 -0700, Ben Greear a écrit :
>> On 04/20/2010 04:35 PM, Gaspar Chilingarov wrote:
>> > sysctl -a | grep local_port_range
>>
>> [root@ct503-10G-09 ~]# sysctl -a | grep local_port_range
>> net.ipv4.ip_local_port_range = 10000  61000
>>
>> I'm explicitly binding to local ports as well as local IPs, btw.
>>
>
> I believe the bsockets 'optimization' is a bug, we should remove it.
>
> This is a stable candidate (2.6.30+)
>
> [PATCH net-next-2.6] tcp: remove bsockets count
>
> Counting number of bound sockets to avoid a loop is buggy, since we cant
> know how many IP addresses are in use. When threshold is reached, we try
> 5 random slots and can fail while there are plenty available ports.
>

Thank you a lot for the patch - I will try it.

In FreeBSD I was able to add about 32 C classes (8192 ips) on the
single interface (never tried to do that in Linux yet :) - so you
really never know how much IP's are there available. Tens and even up
to hundred IPs on the single machine are not that usual in hosting
environment at all.

/Gaspar

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: Mikulas Patocka @ 2010-04-21  0:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, kaber
In-Reply-To: <20100420.170038.130590455.davem@davemloft.net>



On Tue, 20 Apr 2010, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Tue, 20 Apr 2010 19:57:45 -0400 (EDT)
> 
> >> We're not reverting the commit you mention, we're going to fix
> >> whatever bug exists instead.
> > 
> > So have a fun searching for them :-/
> 
> When you report a problem and your knee-jerk reaction is "revert this
> commit", expect some push back.

I tried to analyze skb handling and couldn't check correctness of this 
commit. I really think it should be reverted and applied later, when the 
author does it correctly (for example, one bit in the return value to 
indicate consumed/returned skb, other unimportant informational bits).

> We'll instead analyze this bug and take the time to figure out the
> most appropriate course of action.

And do you have any idea what caused it? Why is it using GSO on bridging?

Mikulas

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: David Miller @ 2010-04-21  0:14 UTC (permalink / raw)
  To: eric.dumazet; +Cc: greearb, gasparch, netdev
In-Reply-To: <1271808314.7895.614.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 21 Apr 2010 02:05:14 +0200

> [PATCH net-next-2.6] tcp: remove bsockets count
> 
> Counting number of bound sockets to avoid a loop is buggy, since we cant
> know how many IP addresses are in use. When threshold is reached, we try
> 5 random slots and can fail while there are plenty available ports.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Hmmm, yes indeed it seems there are improper assumptions made by
this scheme.

This is a tricky area, so I'll wait for some test results from the
reporter and study the code over a few times myself to make sure
we get this right.

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21  0:14 UTC (permalink / raw)
  To: Gaspar Chilingarov; +Cc: Ben Greear, netdev
In-Reply-To: <n2g46c8cb3e1004201657t50599d7em5535f2f12af96b59@mail.gmail.com>

Le mercredi 21 avril 2010 à 04:57 +0500, Gaspar Chilingarov a écrit :
> Ohhh, see my bug report again, please :)
> 
> The problem arises when used port count is close to or more than the
> number of local ports available on the system (in your case 40000
> client connections is smaller than 51000 of available ports).
> 
> I think that if you choose to go with default settings and try to bind
> half of your ports to one local IP and half of your ports - to another
> one -- you will hit this bug as well.
> 
> If you explicitly bind to the local ports as well - it _may_ solve a
> problem - I will this today (it's just a workaround, but in fact not
> the solution :), but if you let kernel automatically find local ports
> - you will get the errors.

Yes, I posted a preliminar patch to solve this problem, please test
it ;)

But beware bind(port=0) time might be long. This is why high perf
programs prefer to give the exact port at bind(IP, port) time.

Algo will scan the whole range to find the port used by minimum number
of connection. Then it'll check if the selected IP was already in use.

If not -> ok, port found

If yes -> redo five time the same loop (I guess this redo should be
changed too, to use something else.




^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: David Miller @ 2010-04-21  0:15 UTC (permalink / raw)
  To: mpatocka; +Cc: netdev, kaber
In-Reply-To: <Pine.LNX.4.64.1004202003010.13651@hs20-bc2-1.build.redhat.com>

From: Mikulas Patocka <mpatocka@redhat.com>
Date: Tue, 20 Apr 2010 20:12:45 -0400 (EDT)

> Why is it using GSO on bridging?

Unlike LRO, GRO and GSO are completely valid in bridging and
routing situations.

In fact, in virtualization environments it is essential for
good performance.

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21  0:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271808314.7895.614.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 02:05:14AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> I believe the bsockets 'optimization' is a bug, we should remove it.
> 
> This is a stable candidate (2.6.30+)
> 
> [PATCH net-next-2.6] tcp: remove bsockets count
> 
> Counting number of bound sockets to avoid a loop is buggy, since we cant
> know how many IP addresses are in use. When threshold is reached, we try
> 5 random slots and can fail while there are plenty available ports.

To return back to exponential bind() times you need to revert the whole
original patch including magic 5 number, not only bsockets.

But actual problem is not in this digit, but in a deeper logic.
Previously we scanned the whole table, now we have 5 attempts to
find out at least one bucket (without conflict) we will insert
new socket into. Apparently for large number of addresses it is possible
that all 5 times we will randomly select those buckets which conflicts.
As dumb solution we can increase 'attempt' number to infinite one, or
fallback to whole-table-search after several random attempts, which is a
bit more clever I think.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21  0:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Gaspar Chilingarov, netdev
In-Reply-To: <1271806945.7895.581.camel@edumazet-laptop>

Hi.

On Wed, Apr 21, 2010 at 01:42:25AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> If you want to check source, its in file net/ipv4/inet_connection_sock.c
> 
> function inet_csk_get_port()
> 
> you can remove the 
> 
> if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) { 
> 	spin_unlock(&head->lock); 
> 	snum = smallest_rover; 
> 	goto have_snum;
> }
> 
> It will solve your problem (but bind() will probably be slow)

Netcat uses SO_REUSEADDR, so its the code path which generates an error,
but this part actually returns a port number not en error. I suppose
what happend is 'attempts' check faired and thus system failed to find
a small enough bucket which does not conflict. To test this we can
remove 'smallest_size = tb->num_owners;' assignment, but bind() can be
damn slow in this case indeed.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: Mikulas Patocka @ 2010-04-21  0:23 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, kaber
In-Reply-To: <20100420.171525.200010343.davem@davemloft.net>



On Tue, 20 Apr 2010, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Tue, 20 Apr 2010 20:12:45 -0400 (EDT)
> 
> > Why is it using GSO on bridging?
> 
> Unlike LRO, GRO and GSO are completely valid in bridging and
> routing situations.
> 
> In fact, in virtualization environments it is essential for
> good performance.

I know it may be used for bridging. But it doesn't explain how it happened 
in my case.

I have two NICs, each with 1500 MTU. The stack trace indicates that a 
packet was received by one NIC and bridged to the other. The stack trace 
also indicates that it went through GSO code path. The question is why? 
How could a forwarded packet be so large to use GSO?

Mikulas

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: David Miller @ 2010-04-21  1:02 UTC (permalink / raw)
  To: mpatocka; +Cc: netdev, kaber
In-Reply-To: <Pine.LNX.4.64.1004202018100.16302@hs20-bc2-1.build.redhat.com>

From: Mikulas Patocka <mpatocka@redhat.com>
Date: Tue, 20 Apr 2010 20:23:57 -0400 (EDT)

> I have two NICs, each with 1500 MTU. The stack trace indicates that a 
> packet was received by one NIC and bridged to the other. The stack trace 
> also indicates that it went through GSO code path. The question is why? 
> How could a forwarded packet be so large to use GSO?

GRO automatically accumulates packets together, accumulating them into
larger super-packets.  This is done regardless of input device feeding
it.

Can you understand this now?  In software, we accumulate all incoming
packets for a TCP stream into larger super-packets.  Just because it's
a bridging scenerio doesn't mean we disable GRO.

These super-packets are being bridged, then forwarded out your
destination device and GSO has to de-segment these GRO frames.

GRO is done unconditionally, all the time, for all packets.

^ permalink raw reply

* Re: pull request: wireless-2.6 2010-04-20
From: David Miller @ 2010-04-21  1:04 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20100420183013.GC6472@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Tue, 20 Apr 2010 14:30:14 -0400

> Here are a few more intended for 2.6.34, mostly from the iwlwifi team.
> They fix a couple of potential crashes, an incorrect EEPROM offset
> related to regulatory information, and a connectivity failure relating
> to 802.11n and the 4965.  The "iwlwifi: fix scan races" seems a little
> long, but much of it is the fallout from renaming a goto label.

Pulled, thanks John.

^ permalink raw reply

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: David Miller @ 2010-04-21  1:05 UTC (permalink / raw)
  To: wg; +Cc: hjk, netdev, socketcan-core
In-Reply-To: <4BCE04D9.2030601@grandegger.com>

From: Wolfgang Grandegger <wg@grandegger.com>
Date: Tue, 20 Apr 2010 21:47:37 +0200

> Hans J. Koch wrote:
>> In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...
>> 
>> This patch replaces dev_err() with printk() to avoid this.
>> 
>> Signed-off-by: "Hans J. Koch" <hjk@linutronix.de>
>> ---
>>  drivers/net/can/usb/ems_usb.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> Index: linux-2.6.34-rc/drivers/net/can/usb/ems_usb.c
>> ===================================================================
>> --- linux-2.6.34-rc.orig/drivers/net/can/usb/ems_usb.c	2010-04-20 15:32:25.000000000 +0200
>> +++ linux-2.6.34-rc/drivers/net/can/usb/ems_usb.c	2010-04-20 15:33:20.000000000 +0200
>> @@ -1006,7 +1006,7 @@
>>  
>>  	netdev = alloc_candev(sizeof(struct ems_usb), MAX_TX_URBS);
>>  	if (!netdev) {
>> -		dev_err(netdev->dev.parent, "Couldn't alloc candev\n");
>> +		printk(KERN_ERR "ems_usb: Couldn't alloc candev\n");
>>  		return -ENOMEM;
>>  	}
> 
> I think "dev_err(&intf->dev, ...)" should be used before
> SET_NETDEV_DEV(netdev, &intf->dev) is called. I see two "dev_err()"
> calls which need to be fixed.

Agreed.

^ permalink raw reply

* Re: [PATCH] gianfar: Wait for both RX and TX to stop
From: David Miller @ 2010-04-21  1:06 UTC (permalink / raw)
  To: timur.tabi; +Cc: galak, afleming, netdev
In-Reply-To: <o2oed82fe3e1004200801n98683cdalb2898c879b93f8fd@mail.gmail.com>

From: Timur Tabi <timur.tabi@gmail.com>
Date: Tue, 20 Apr 2010 10:01:48 -0500

> On Mon, Apr 19, 2010 at 11:43 PM, Kumar Gala <galak@kernel.crashing.org> wrote:
> 
>> spin_event_timeout doesn't make sense for this.  The patch is fine.
> 
> Can you please elaborate on that?  I don't understand why you think
> that.  spin_event_timeout() takes an expression and a timeout, and
> loops over the expression calling cpu_relax(), just like this loop
> does.

Indeed it does, Kumar this request seems reasonable.

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: Mikulas Patocka @ 2010-04-21  1:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, kaber
In-Reply-To: <20100420.180253.159346294.davem@davemloft.net>

On Tue, 20 Apr 2010, David Miller wrote:

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Tue, 20 Apr 2010 20:23:57 -0400 (EDT)
> 
> > I have two NICs, each with 1500 MTU. The stack trace indicates that a 
> > packet was received by one NIC and bridged to the other. The stack trace 
> > also indicates that it went through GSO code path. The question is why? 
> > How could a forwarded packet be so large to use GSO?
> 
> GRO automatically accumulates packets together, accumulating them into
> larger super-packets.  This is done regardless of input device feeding
> it.
> 
> Can you understand this now?  In software, we accumulate all incoming
> packets for a TCP stream into larger super-packets.  Just because it's
> a bridging scenerio doesn't mean we disable GRO.
> 
> These super-packets are being bridged, then forwarded out your
> destination device and GSO has to de-segment these GRO frames.
> 
> GRO is done unconditionally, all the time, for all packets.

I see, but GRO is turned off on my interfaces, according to ethtool.

Mikulas

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox