Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next-2.6] cleanup: remove two unnecessary exports (skbuff).
From: David Miller @ 2010-04-21  5:40 UTC (permalink / raw)
  To: ramirose; +Cc: netdev
In-Reply-To: <g2teb3ff54b1004202220va58efc7evdb91e2cf38b062d4@mail.gmail.com>

From: Rami Rosen <ramirose@gmail.com>
Date: Wed, 21 Apr 2010 08:20:31 +0300

> There is no need to export skb_under_panic() and skb_over_panic() in
> skbuff.c, since these methods are used only in skbuff.c ; this patch
> removes these
> two exports. It also marks these functions as 'static' and removeS the extern
> declarations of them from include/linux/skbuff.h
> 
> Signed-off-by: Rami Rosen <ramirose@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21  5:46 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421003022.GA3107@ioremap.net>

Le mercredi 21 avril 2010 à 04:30 +0400, Evgeniy Polyakov a écrit :
> On Wed, Apr 21, 2010 at 02:05:14AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > I believe the bsockets 'optimization' is a bug, we should remove it.
> > 
> > This is a stable candidate (2.6.30+)
> > 
> > [PATCH net-next-2.6] tcp: remove bsockets count
> > 
> > Counting number of bound sockets to avoid a loop is buggy, since we cant
> > know how many IP addresses are in use. When threshold is reached, we try
> > 5 random slots and can fail while there are plenty available ports.
> 
> To return back to exponential bind() times you need to revert the whole
> original patch including magic 5 number, not only bsockets.
> 
> But actual problem is not in this digit, but in a deeper logic.
> Previously we scanned the whole table, now we have 5 attempts to
> find out at least one bucket (without conflict) we will insert
> new socket into. Apparently for large number of addresses it is possible
> that all 5 times we will randomly select those buckets which conflicts.
> As dumb solution we can increase 'attempt' number to infinite one, or
> fallback to whole-table-search after several random attempts, which is a
> bit more clever I think.
> 

Hmm, maybe I am blind, but on the case the threshold is reached, we dont
have 5 attempts "to find out at least one bucket (without conflict)"

We just take the first entry from the random starting point, _without_
checking we have a conflict.

if (net_eq(ib_net(tb), net) && tb->port == rover) {
	if (tb->fastreuse > 0 &&
	    sk->sk_reuse &&
	    sk->sk_state != TCP_LISTEN &&
	    (tb->num_owners < smallest_size || smallest_size == -1)) {
		smallest_size = tb->num_owners;
		smallest_rover = rover;
		if (atomic_read(&hashinfo->bsockets) > (high-low)+1) {
			spin_unlock(&head->lock);
			snum = smallest_rover;   // We select this, without checking for
conflicts.
			goto have_snum;
		}
	}


Then we goto to "have_snum" label

Then we realize (selected_IP, randomport) is already in use.
End of first try.

We redo the thing 5 times, so we only look at 5 slots out of
32000-64000.

Maybe the fix would need to check if there is a conflict before doing
the "goto have_snum"

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index e0a3e35..0498daf 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -120,9 +120,11 @@ again:
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
 						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
-							spin_unlock(&head->lock);
-							snum = smallest_rover;
-							goto have_snum;
+							if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb))
+								spin_unlock(&head->lock);
+								snum = smallest_rover;
+								goto have_snum;
+							}
 						}
 					}
 					goto next;



^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: Introduce skb_orphan_try()
From: Eric Dumazet @ 2010-04-21  6:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100418.024601.85853662.davem@davemloft.net>

Le dimanche 18 avril 2010 à 02:46 -0700, David Miller a écrit :

> Looks good, applied, thanks Eric.

Hmm, looking at the GSO stuff, I believe we should not call
skb_orphan_try() on gso skbs ?

Thanks !

[PATCH net-next-2.6] net: Dont call skb_orphan_try() for GSO

At this point, skb->destructor is not the original one (stored in
DEV_GSO_CB(skb)->destructor)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index b31d5d6..200d1e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1937,7 +1937,6 @@ gso:
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(nskb);
 
-		skb_orphan_try(nskb);
 		rc = ops->ndo_start_xmit(nskb, dev);
 		if (unlikely(rc != NETDEV_TX_OK)) {
 			if (rc & ~NETDEV_TX_MASK)



^ permalink raw reply related

* [PATCH] ethernet: print protocol in host byte order
From: Johannes Berg @ 2010-04-21  7:06 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet

Eric's recent patch added __force, but this
place would seem to require actually doing
a byte order conversion so the printk is
consistent across architectures.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
---
 net/ethernet/eth.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 3584696..0c0d272 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -136,7 +136,7 @@ int eth_rebuild_header(struct sk_buff *skb)
 	default:
 		printk(KERN_DEBUG
 		       "%s: unable to resolve type %X addresses.\n",
-		       dev->name, (__force int)eth->h_proto);
+		       dev->name, ntohs(eth->h_proto));
 
 		memcpy(eth->h_source, dev->dev_addr, ETH_ALEN);
 		break;



^ permalink raw reply related

* ipv6: Fix tcp_v6_send_response checksum
From: Herbert Xu @ 2010-04-21  7:07 UTC (permalink / raw)
  To: David S. Miller, netdev

Hi:

ipv6: Fix tcp_v6_send_response checksum

My recent patch to remove the open-coded checksum sequence in
tcp_v6_send_response broke it as we did not set the transport
header pointer on the new packet.

Instead we had set the transport header on the original packet,
which is unnecessary and unexpected.

So this patch removes that and instead sets the transport header
on the new packet.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c92ebe8..075f540 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1015,7 +1015,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 	skb_reserve(buff, MAX_HEADER + sizeof(struct ipv6hdr) + tot_len);

 	t1 = (struct tcphdr *) skb_push(buff, tot_len);
-	skb_reset_transport_header(skb);
+	skb_reset_transport_header(buff);

 	/* Swap the send and the receive. */
 	memset(t1, 0, sizeof(*t1));

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: [PATCH] net: ipv6 bind to device issue
From: Jiri Olsa @ 2010-04-21  7:21 UTC (permalink / raw)
  To: Brian Haley
  Cc: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet,
	netdev
In-Reply-To: <4BCDEED3.7040901@hp.com>

On Tue, Apr 20, 2010 at 02:13:39PM -0400, Brian Haley wrote:
> Jiri Olsa wrote:
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index c2438e8..7bf7717 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
> >  {
> >  	int flags = 0;
> >  
> > -	if (rt6_need_strict(&fl->fl6_dst))
> > +	if (rt6_need_strict(&fl->fl6_dst) || fl->oif)
> >  		flags |= RT6_LOOKUP_F_IFACE;
> >  
> >  	if (!ipv6_addr_any(&fl->fl6_src))
> 
> Actually, looking at this again, we might want to swap the order
> here since fl->oif should be filled-in for most link-local and
> multicast requests calling this:
> 
> 	if (fl->oif || rt6_need_strict(&fl->fl6_dst))
> 
> Just a thought, but it potentially saves a call to determine
> the scope of the address.
> 
> -Brian

I think it's a good idea, attaching the changed patch

thanks,
jirka
---

The issue raises when having 2 NICs both assigned the same
IPv6 global address.

If a sender binds to a particular NIC (SO_BINDTODEVICE),
the outgoing traffic is being sent via the first found.
The bonded device is thus not taken into an account during the
routing.


>From the ip6_route_output function:

If the binding address is multicast, linklocal or loopback,
the RT6_LOOKUP_F_IFACE bit is set, but not for global address.

So binding global address will neglect SO_BINDTODEVICE-binded device,
because the fib6_rule_lookup function path won't check for the
flowi::oif field and take first route that fits.

Following patch should handle the issue.

wbr,
jirka


Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Scott Otto <scott.otto@alcatel-lucent.com>
---
 net/ipv6/route.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c2438e8..05ebd78 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
 {
 	int flags = 0;
 
-	if (rt6_need_strict(&fl->fl6_dst))
+	if (fl->oif || rt6_need_strict(&fl->fl6_dst))
 		flags |= RT6_LOOKUP_F_IFACE;
 
 	if (!ipv6_addr_any(&fl->fl6_src))


^ permalink raw reply related

* Re: ipv6: Fix tcp_v6_send_response checksum
From: David Miller @ 2010-04-21  7:49 UTC (permalink / raw)
  To: herbert; +Cc: netdev, cratiu
In-Reply-To: <20100421070737.GA30517@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 21 Apr 2010 15:07:37 +0800

> ipv6: Fix tcp_v6_send_response checksum

I put this into net-2.6 and modified the commit message since, as we
found, this incorrect transport header reset was added there to fix
IPSEC.

I'm convinced that Cosmin didn't test the patch he actually sent out
:-)

Thanks!

--------------------
ipv6: Fix tcp_v6_send_response transport header setting.

My recent patch to remove the open-coded checksum sequence in
tcp_v6_send_response broke it as we did not set the transport
header pointer on the new packet.

Actually, there is code there trying to set the transport
header properly, but it sets it for the wrong skb ('skb'
instead of 'buff').

This bug was introduced by commit
a8fdf2b331b38d61fb5f11f3aec4a4f9fb2dedcb ("ipv6: Fix
tcp_v6_send_response(): it didn't set skb transport header")

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/tcp_ipv6.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c92ebe8..075f540 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1015,7 +1015,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 	skb_reserve(buff, MAX_HEADER + sizeof(struct ipv6hdr) + tot_len);
 
 	t1 = (struct tcphdr *) skb_push(buff, tot_len);
-	skb_reset_transport_header(skb);
+	skb_reset_transport_header(buff);
 
 	/* Swap the send and the receive. */
 	memset(t1, 0, sizeof(*t1));
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH] ipv4: handle GARPs specially when updating neighbors
From: Sasha Levin @ 2010-04-21  8:02 UTC (permalink / raw)
  To: netdev@vger.kernel.org

From: Sasha Levin <sasha@comsleep.com>

We are currently testing IP fail-over on storage devices, and have observed an issue with the IP transfer from one device to another.

Assuming we have 2 storage devices A and B, and a server C which uses the storage, the scenario is:

1. Device A sends an ARP request which server C sees – server C updates it’s ARP table with the MAC of device A.
2. Device A fails, Device B takes over the IP and sends out a GARP.
3. Even though device C sees the GARP, it ignores it and keeps trying to communicate with device A until the entry is removed from its cache and a new ARP request is generated.

The code which causes this is located in arp_process@/net/ipv4/arp.c:

override = time_after(jiffies, n->updated + n->parms->locktime);

/* Broadcast replies and request packets
   do not assert neighbour reachability.
 */
if (arp->ar_op != htons(ARPOP_REPLY) ||
    skb->pkt_type != PACKET_HOST)
        state = NUD_STALE;
neigh_update(n, sha, state, override ? NEIGH_UPDATE_F_OVERRIDE : 0);
neigh_release(n);

According to the code, this scenario happens because the kernel ignores any ARP updates which happened in a short period after the previous ARP update. The reason which was stated in the comments is  “If several different ARP replies follows back-to-back, use the FIRST one. It is possible, if several proxy agents are active. Taking the first reply prevents arp trashing and chooses the fastest router.”.

This, however, doesn’t take into account GARPs which are not being sent by ARP proxies anyway and just ignores them too – causing a loss of communication for over a minute until the ARP cache refreshes.

Signed-off-by: Sasha Levin <sasha@comsleep.com>
---
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 1a9dd66..caa2093 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -876,8 +876,11 @@ static int arp_process(struct sk_buff *skb)
 		   use the FIRST one. It is possible, if several proxy
 		   agents are active. Taking the first reply prevents
 		   arp trashing and chooses the fastest router.
+
+		   GARPs are always updating the cache since they can
+		   originate from different devices with the same IP.
 		 */
-		override = time_after(jiffies, n->updated + n->parms->locktime);
+		override = (sip == tip) || time_after(jiffies, n->updated + n->parms->locktime);

 		/* Broadcast replies and request packets
 		   do not assert neighbour reachability.

^ permalink raw reply related

* Re: [PATCH] NET: Fix an RCU warning in dev_pick_tx()
From: David Miller @ 2010-04-21  8:10 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dhowells, netdev
In-Reply-To: <1271779414.7895.35.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 18:03:34 +0200

> Le mardi 20 avril 2010 à 11:25 +0100, David Howells a écrit :
>> Fix the following RCU warning in dev_pick_tx():
 ...
> Absolutely right, thanks David
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

I'll apply this, thanks everyone.

> This might conflict with following commit in net-next-2.6, where I chose
> the rcu_dereference_check() alternative

Yead I'll be mindful of this when I do a merge right after
this, thanks for the heads up.

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21  8:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271828799.7895.1287.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 07:46:39AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> 		if (atomic_read(&hashinfo->bsockets) > (high-low)+1) {
> 			spin_unlock(&head->lock);
> 			snum = smallest_rover;   // We select this, without checking for
> conflicts.
> 			goto have_snum;
> 		}
> 	}
> 
> 
> Then we goto to "have_snum" label
> 
> Then we realize (selected_IP, randomport) is already in use.
> End of first try.
> 
> We redo the thing 5 times, so we only look at 5 slots out of
> 32000-64000.

We only break out of the loop in above case when number of sockets is
already more than our range limit. If we would just try 5 random times
out of 1000 in Gaspar's case, we would not be able to select all 1000
sockets.

> Maybe the fix would need to check if there is a conflict before doing
> the "goto have_snum"

I believe this is a useful patch, but it addresses a different issue.
This path should not fire up when we bind to single address.

> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index e0a3e35..0498daf 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -120,9 +120,11 @@ again:
>  						smallest_size = tb->num_owners;
>  						smallest_rover = rover;
>  						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
> -							spin_unlock(&head->lock);
> -							snum = smallest_rover;
> -							goto have_snum;
> +							if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb))
> +								spin_unlock(&head->lock);
> +								snum = smallest_rover;
> +								goto have_snum;
> +							}
>  						}
>  					}
>  					goto next;
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-21  8:35 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@linux.intel.com, davem@davemloft.net
In-Reply-To: <F2E9EB7348B8264F86B6AB8151CE2D79026FAB1E5F@shsmsx502.ccr.corp.intel.com>

On Tue, Apr 20, 2010 at 10:21:55AM +0800, Xin, Xiaohui wrote:
> Michael,
> 
> >>>>>> What we have not done yet:
> >>>>>> 	packet split support
> >>>>>> 
> >>>>>What does this mean, exactly?
> >>>> We can support 1500MTU, but for jumbo frame, since vhost driver before don't 
> >>>>support mergeable buffer, we cannot try it for multiple sg.
> >>>> 
> >>>I do not see why, vhost currently supports 64K buffers with indirect
> >>>descriptors.
> >>> 
> >> The receive_skb() in guest virtio-net driver will merge the multiple sg to skb frags, how >>can indirect descriptors to that?
> 
> >See add_recvbuf_big.
> 
> I don't mean this, it's for buffer submission. I mean when packet is received, in receive_buf(), mergeable buffer knows which pages received can be hooked in skb frags, it's receive_mergeable() which do this.
> 
> When a NIC driver supports packet split mode, then each ring descriptor contains a skb and a page. When packet is received, if the status is not EOP, then hook the page of the next descriptor to the prev skb. We don't how many frags belongs to one skb. So when guest submit buffers, it should submit multiple pages, and when receive, the guest should know which pages are belongs to one skb and hook them together. I think receive_mergeable() can do this, but I don't see how big->packets handle this. May I miss something here?
> 
> Thanks
> Xiaohui 


Yes, I think this packet split mode probably maps well to mergeable buffer
support. Note that
1. Not all devices support large packets in this way, others might map
   to indirect buffers better
   So we have to figure out how migration is going to work
2. It's up to guest driver whether to enable features such as
   mergeable buffers and indirect buffers
   So we have to figure out how to notify guest which mode
   is optimal for a given device
3. We don't want to depend on jumbo frames for decent performance
   So we probably should support GSO/GRO

-- 
MST

^ permalink raw reply

* Re: 2.6.34-rc5: Reported regressions from 2.6.33
From: Jerome Glisse @ 2010-04-21  8:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Bowler, Kernel Testers List, Linux SCSI List,
	Network Development, Linux Wireless List,
	Linux Kernel Mailing List, Linux ACPI, Andrew Morton, DRI,
	Linus Torvalds, Linux PM List, Maciej Rutecki
In-Reply-To: <201004210715.38621.rjw@sisk.pl>

On Wed, Apr 21, 2010 at 07:15:38AM +0200, Rafael J. Wysocki wrote:
> On Tuesday 20 April 2010, Nick Bowler wrote:
> > On 05:15 Tue 20 Apr     , Rafael J. Wysocki wrote:
> > > If you know of any other unresolved regressions from 2.6.33, please let us
> > > know either and we'll add them to the list.  Also, please let us know
> > > if any of the entries below are invalid.
> > 
> > Please list these two similar regressions from 2.6.33 in the r600 DRM:
> > 
> >  * r600 CS checker rejects GL_DEPTH_TEST w/o depth buffer:
> >            https://bugs.freedesktop.org/show_bug.cgi?id=27571
> > 
> >  * r600 CS checker rejects narrow FBO renderbuffers:
> >            https://bugs.freedesktop.org/show_bug.cgi?id=27609
> 
> Do you want to me to add them as one entry or as two separate bugs?
> 
> Rafael
> 

First one is userspace bug, i need to look into the second one.
ie we were lucky the hw didn't lockup without depth buffer and
depth test enabled.

Cheers,
Jerome

^ permalink raw reply

* Re: ipv6: Fix tcp_v6_send_response checksum
From: David Miller @ 2010-04-21  8:58 UTC (permalink / raw)
  To: herbert; +Cc: netdev, cratiu
In-Reply-To: <20100421.004922.193694715.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 21 Apr 2010 00:49:22 -0700 (PDT)

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 21 Apr 2010 15:07:37 +0800
> 
>> ipv6: Fix tcp_v6_send_response checksum
> 
> I put this into net-2.6 and modified the commit message since, as we
> found, this incorrect transport header reset was added there to fix
> IPSEC.

Ok, even with this pulled into net-next-2.6 the ipv6 tcp response
checksums are still bad.  The following fix is necessary as
well:

--------------------
tcp: Fix ipv6 checksumming on response packets for real.

Commit 6651ffc8e8bdd5fb4b7d1867c6cfebb4f309512c
("ipv6: Fix tcp_v6_send_response transport header setting.")
fixed one half of why ipv6 tcp response checksums were
invalid, but it's not the whole story.

If we're going to use CHECKSUM_PARTIAL for these things (which we are
since commit 2e8e18ef52e7dd1af0a3bd1f7d990a1d0b249586 "tcp: Set
CHECKSUM_UNNECESSARY in tcp_init_nondata_skb"), we can't be setting
buff->csum as we always have been here in tcp_v6_send_response.  We
need to leave it at zero.

Kill that line and checksums are good again.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/tcp_ipv6.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 78480f4..5d2e430 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1050,8 +1050,6 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 	}
 #endif

-	buff->csum = csum_partial(t1, tot_len, 0);
-
 	memset(&fl, 0, sizeof(fl));
 	ipv6_addr_copy(&fl.fl6_dst, &ipv6_hdr(skb)->saddr);
 	ipv6_addr_copy(&fl.fl6_src, &ipv6_hdr(skb)->daddr);
-- 
1.7.0.4

^ permalink raw reply related

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21  9:02 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421082559.GA32475@ioremap.net>

Le mercredi 21 avril 2010 à 12:25 +0400, Evgeniy Polyakov a écrit :

> I believe this is a useful patch, but it addresses a different issue.
> This path should not fire up when we bind to single address.

Well, the real problem is that following sequence can happen :

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(6, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(7, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
setsockopt(8, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(8, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(9, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(10, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 11
setsockopt(11, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(11, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(11, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0


Note ports are given several times for different sockets.

So several sockets are 'bound' to same IP:port values

At connect() time, we refuse and say address is not available.



Following program to demonstrate the problem.

First time, launch it with an extra agument to setup ip aliases and ip_local_port_range




#include <stdio.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>
#include <errno.h>

int listenfd;
int on = 1;

void listener()
{
	if (fork())
		return;

	while (1) {
		struct sockaddr_in addr;
		socklen_t len = sizeof(addr);
		int fd = accept(listenfd, (struct sockaddr *)&addr, &len);
	}
}

int main(int argc, char *argv[])
{
	int i, port, total = 0;
	char cmd[128];
	struct sockaddr_in addr;
	socklen_t len;

	if (argc > 1) {
		for (i = 2; i < 8; i++) {
			sprintf(cmd, "ip addr add 127.0.0.%d/8 dev lo 2>/dev/null", i);
			system(cmd);
		}
		system("echo '34000 34002' >/proc/sys/net/ipv4/ip_local_port_range");
	}
	listenfd = socket(AF_INET, SOCK_STREAM, 0);
	setsockopt(listenfd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
	memset(&addr, 0, sizeof(addr));
	addr.sin_family = AF_INET;
	addr.sin_port = htons(4444);
	if (bind(listenfd, (struct sockaddr *)&addr, sizeof(addr)) == -1) {
		perror("bind");
		return 1;
	}
	listen(listenfd, 10);
	listener();

	for (i = 2; i < 8; i++) {
		for (port = 34000; port < 34010; port++) {
			int fd = socket(AF_INET, SOCK_STREAM, 0);
			if (fd == -1) {
				fprintf(stderr, "Could not open socket, errno=%d\n", errno);
				goto end;
			}
			setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
			addr.sin_addr.s_addr = htonl(0x7f000000 + i);
//			addr.sin_port = htons(port);
			addr.sin_port = 0;
			if (bind(fd, (struct sockaddr *)&addr, sizeof(addr)) == -1) {
				fprintf(stderr, "Could not bind()\n");
				goto end;
			}
			len = sizeof(addr);
			getsockname(fd, (struct sockaddr *)&addr, &len);
#if 0
			addr.sin_addr.s_addr = htonl(0x7f000001);
			addr.sin_port = htons(4444);
			if ((total < 10) && (connect(fd, (struct sockaddr *)&addr, sizeof(addr)) == -1)) {
				len = sizeof(addr);
				getsockname(fd, (struct sockaddr *)&addr, &len);
				fprintf(stderr, "Could not connect()\n");
				goto end;
			}
#endif
			total++;
		}
	}
end:
	printf("i=127.0.0.%d port=%d (total=%d)\n", i, port, total);
	pause();
}



^ permalink raw reply

* [PATCH 1/2][RESEND] ehea: error handling improvement
From: Thomas Klein @ 2010-04-21  9:10 UTC (permalink / raw)
  To: davem; +Cc: netdev, linuxppc-dev, linux-kernel, themann

Reset a port's resources only if they're actually in an error state

Signed-off-by: Thomas Klein <tklein@de.ibm.com>
---

Patch created against net-2.6

diff -Nurp net-2.6.orig/drivers/net/ehea/ehea_main.c net-2.6/drivers/net/ehea/ehea_main.c
--- net-2.6.orig/drivers/net/ehea/ehea_main.c	2010-04-21 10:23:21.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea_main.c	2010-04-21 10:41:21.000000000 +0200
@@ -791,11 +791,17 @@ static struct ehea_cqe *ehea_proc_cqes(s
 		cqe_counter++;
 		rmb();
 		if (cqe->status & EHEA_CQE_STAT_ERR_MASK) {
-			ehea_error("Send Completion Error: Resetting port");
+			ehea_error("Bad send completion status=0x%04X",
+				   cqe->status);
+
 			if (netif_msg_tx_err(pr->port))
 				ehea_dump(cqe, sizeof(*cqe), "Send CQE");
-			ehea_schedule_port_reset(pr->port);
-			break;
+
+			if (cqe->status & EHEA_CQE_STAT_RESET_MASK) {
+				ehea_error("Resetting port");
+				ehea_schedule_port_reset(pr->port);
+				break;
+			}
 		}
 
 		if (netif_msg_tx_done(pr->port))
@@ -901,6 +907,8 @@ static irqreturn_t ehea_qp_aff_irq_handl
 	struct ehea_eqe *eqe;
 	struct ehea_qp *qp;
 	u32 qp_token;
+	u64 resource_type, aer, aerr;
+	int reset_port = 0;
 
 	eqe = ehea_poll_eq(port->qp_eq);
 
@@ -910,11 +918,24 @@ static irqreturn_t ehea_qp_aff_irq_handl
 			   eqe->entry, qp_token);
 
 		qp = port->port_res[qp_token].qp;
-		ehea_error_data(port->adapter, qp->fw_handle);
+
+		resource_type = ehea_error_data(port->adapter, qp->fw_handle,
+						&aer, &aerr);
+
+		if (resource_type == EHEA_AER_RESTYPE_QP) {
+			if ((aer & EHEA_AER_RESET_MASK) ||
+			    (aerr & EHEA_AERR_RESET_MASK))
+				 reset_port = 1;
+		} else
+			reset_port = 1;   /* Reset in case of CQ or EQ error */
+
 		eqe = ehea_poll_eq(port->qp_eq);
 	}
 
-	ehea_schedule_port_reset(port);
+	if (reset_port) {
+		ehea_error("Resetting port");
+		ehea_schedule_port_reset(port);
+	}
 
 	return IRQ_HANDLED;
 }
diff -Nurp net-2.6.orig/drivers/net/ehea/ehea_qmr.c net-2.6/drivers/net/ehea/ehea_qmr.c
--- net-2.6.orig/drivers/net/ehea/ehea_qmr.c	2010-04-21 10:23:21.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea_qmr.c	2010-04-21 10:41:21.000000000 +0200
@@ -229,14 +229,14 @@ u64 ehea_destroy_cq_res(struct ehea_cq *
 
 int ehea_destroy_cq(struct ehea_cq *cq)
 {
-	u64 hret;
+	u64 hret, aer, aerr;
 	if (!cq)
 		return 0;
 
 	hcp_epas_dtor(&cq->epas);
 	hret = ehea_destroy_cq_res(cq, NORMAL_FREE);
 	if (hret == H_R_STATE) {
-		ehea_error_data(cq->adapter, cq->fw_handle);
+		ehea_error_data(cq->adapter, cq->fw_handle, &aer, &aerr);
 		hret = ehea_destroy_cq_res(cq, FORCE_FREE);
 	}
 
@@ -357,7 +357,7 @@ u64 ehea_destroy_eq_res(struct ehea_eq *
 
 int ehea_destroy_eq(struct ehea_eq *eq)
 {
-	u64 hret;
+	u64 hret, aer, aerr;
 	if (!eq)
 		return 0;
 
@@ -365,7 +365,7 @@ int ehea_destroy_eq(struct ehea_eq *eq)
 
 	hret = ehea_destroy_eq_res(eq, NORMAL_FREE);
 	if (hret == H_R_STATE) {
-		ehea_error_data(eq->adapter, eq->fw_handle);
+		ehea_error_data(eq->adapter, eq->fw_handle, &aer, &aerr);
 		hret = ehea_destroy_eq_res(eq, FORCE_FREE);
 	}
 
@@ -540,7 +540,7 @@ u64 ehea_destroy_qp_res(struct ehea_qp *
 
 int ehea_destroy_qp(struct ehea_qp *qp)
 {
-	u64 hret;
+	u64 hret, aer, aerr;
 	if (!qp)
 		return 0;
 
@@ -548,7 +548,7 @@ int ehea_destroy_qp(struct ehea_qp *qp)
 
 	hret = ehea_destroy_qp_res(qp, NORMAL_FREE);
 	if (hret == H_R_STATE) {
-		ehea_error_data(qp->adapter, qp->fw_handle);
+		ehea_error_data(qp->adapter, qp->fw_handle, &aer, &aerr);
 		hret = ehea_destroy_qp_res(qp, FORCE_FREE);
 	}
 
@@ -986,42 +986,45 @@ void print_error_data(u64 *data)
 	if (length > EHEA_PAGESIZE)
 		length = EHEA_PAGESIZE;
 
-	if (type == 0x8) /* Queue Pair */
+	if (type == EHEA_AER_RESTYPE_QP)
 		ehea_error("QP (resource=%llX) state: AER=0x%llX, AERR=0x%llX, "
 			   "port=%llX", resource, data[6], data[12], data[22]);
-
-	if (type == 0x4) /* Completion Queue */
+	else if (type == EHEA_AER_RESTYPE_CQ)
 		ehea_error("CQ (resource=%llX) state: AER=0x%llX", resource,
 			   data[6]);
-
-	if (type == 0x3) /* Event Queue */
+	else if (type == EHEA_AER_RESTYPE_EQ)
 		ehea_error("EQ (resource=%llX) state: AER=0x%llX", resource,
 			   data[6]);
 
 	ehea_dump(data, length, "error data");
 }
 
-void ehea_error_data(struct ehea_adapter *adapter, u64 res_handle)
+u64 ehea_error_data(struct ehea_adapter *adapter, u64 res_handle,
+		    u64 *aer, u64 *aerr)
 {
 	unsigned long ret;
 	u64 *rblock;
+	u64 type = 0;
 
 	rblock = (void *)get_zeroed_page(GFP_KERNEL);
 	if (!rblock) {
 		ehea_error("Cannot allocate rblock memory.");
-		return;
+		goto out;
 	}
 
-	ret = ehea_h_error_data(adapter->handle,
-				res_handle,
-				rblock);
+	ret = ehea_h_error_data(adapter->handle, res_handle, rblock);
 
-	if (ret == H_R_STATE)
-		ehea_error("No error data is available: %llX.", res_handle);
-	else if (ret == H_SUCCESS)
+	if (ret == H_SUCCESS) {
+		type = EHEA_BMASK_GET(ERROR_DATA_TYPE, rblock[2]);
+		*aer = rblock[6];
+		*aerr = rblock[12];
 		print_error_data(rblock);
-	else
+	} else if (ret == H_R_STATE) {
+		ehea_error("No error data available: %llX.", res_handle);
+	} else
 		ehea_error("Error data could not be fetched: %llX", res_handle);
 
 	free_page((unsigned long)rblock);
+out:
+	return type;
 }
diff -Nurp net-2.6.orig/drivers/net/ehea/ehea_qmr.h net-2.6/drivers/net/ehea/ehea_qmr.h
--- net-2.6.orig/drivers/net/ehea/ehea_qmr.h	2010-04-21 10:23:21.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea_qmr.h	2010-04-21 10:41:21.000000000 +0200
@@ -154,6 +154,9 @@ struct ehea_rwqe {
 #define EHEA_CQE_STAT_ERR_IP       0x2000
 #define EHEA_CQE_STAT_ERR_CRC      0x1000
 
+/* Defines which bad send cqe stati lead to a port reset */
+#define EHEA_CQE_STAT_RESET_MASK   0x0002
+
 struct ehea_cqe {
 	u64 wr_id;		/* work request ID from WQE */
 	u8 type;
@@ -187,6 +190,14 @@ struct ehea_cqe {
 #define EHEA_EQE_SM_MECH_NUMBER  EHEA_BMASK_IBM(48, 55)
 #define EHEA_EQE_SM_PORT_NUMBER  EHEA_BMASK_IBM(56, 63)
 
+#define EHEA_AER_RESTYPE_QP  0x8
+#define EHEA_AER_RESTYPE_CQ  0x4
+#define EHEA_AER_RESTYPE_EQ  0x3
+
+/* Defines which affiliated errors lead to a port reset */
+#define EHEA_AER_RESET_MASK   0xFFFFFFFFFEFFFFFFULL
+#define EHEA_AERR_RESET_MASK  0xFFFFFFFFFFFFFFFFULL
+
 struct ehea_eqe {
 	u64 entry;
 };
@@ -379,7 +390,8 @@ int ehea_gen_smr(struct ehea_adapter *ad
 
 int ehea_rem_mr(struct ehea_mr *mr);
 
-void ehea_error_data(struct ehea_adapter *adapter, u64 res_handle);
+u64 ehea_error_data(struct ehea_adapter *adapter, u64 res_handle,
+		    u64 *aer, u64 *aerr);
 
 int ehea_add_sect_bmap(unsigned long pfn, unsigned long nr_pages);
 int ehea_rem_sect_bmap(unsigned long pfn, unsigned long nr_pages);

^ permalink raw reply

* [PATCH 2/2][RESEND] ehea: fix possible DLPAR/mem deadlock
From: Thomas Klein @ 2010-04-21  9:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, linuxppc-dev, linux-kernel, themann

Force serialization of userspace-triggered DLPAR/mem operations

Signed-off-by: Thomas Klein <tklein@de.ibm.com>
---

Patch created against net-2.6

diff -Nurp net-2.6.orig/drivers/net/ehea/ehea.h net-2.6/drivers/net/ehea/ehea.h
--- net-2.6.orig/drivers/net/ehea/ehea.h	2010-04-21 10:23:21.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea.h	2010-04-21 10:42:14.000000000 +0200
@@ -40,7 +40,7 @@
 #include <asm/io.h>
 
 #define DRV_NAME	"ehea"
-#define DRV_VERSION	"EHEA_0102"
+#define DRV_VERSION	"EHEA_0103"
 
 /* eHEA capability flags */
 #define DLPAR_PORT_ADD_REM 1
diff -Nurp net-2.6.orig/drivers/net/ehea/ehea_main.c net-2.6/drivers/net/ehea/ehea_main.c
--- net-2.6.orig/drivers/net/ehea/ehea_main.c	2010-04-21 10:43:14.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea_main.c	2010-04-21 10:42:14.000000000 +0200
@@ -2889,7 +2889,6 @@ static void ehea_rereg_mrs(struct work_s
 	int ret, i;
 	struct ehea_adapter *adapter;
 
-	mutex_lock(&dlpar_mem_lock);
 	ehea_info("LPAR memory changed - re-initializing driver");
 
 	list_for_each_entry(adapter, &adapter_list, list)
@@ -2959,7 +2958,6 @@ static void ehea_rereg_mrs(struct work_s
 		}
 	ehea_info("re-initializing driver complete");
 out:
-	mutex_unlock(&dlpar_mem_lock);
 	return;
 }
 
@@ -3542,7 +3540,14 @@ void ehea_crash_handler(void)
 static int ehea_mem_notifier(struct notifier_block *nb,
                              unsigned long action, void *data)
 {
+	int ret = NOTIFY_BAD;
 	struct memory_notify *arg = data;
+
+	if (!mutex_trylock(&dlpar_mem_lock)) {
+		ehea_info("ehea_mem_notifier must not be called parallelized");
+		goto out;
+	}
+
 	switch (action) {
 	case MEM_CANCEL_OFFLINE:
 		ehea_info("memory offlining canceled");
@@ -3551,14 +3556,14 @@ static int ehea_mem_notifier(struct noti
 		ehea_info("memory is going online");
 		set_bit(__EHEA_STOP_XFER, &ehea_driver_flags);
 		if (ehea_add_sect_bmap(arg->start_pfn, arg->nr_pages))
-			return NOTIFY_BAD;
+			goto out_unlock;
 		ehea_rereg_mrs(NULL);
 		break;
 	case MEM_GOING_OFFLINE:
 		ehea_info("memory is going offline");
 		set_bit(__EHEA_STOP_XFER, &ehea_driver_flags);
 		if (ehea_rem_sect_bmap(arg->start_pfn, arg->nr_pages))
-			return NOTIFY_BAD;
+			goto out_unlock;
 		ehea_rereg_mrs(NULL);
 		break;
 	default:
@@ -3566,8 +3571,12 @@ static int ehea_mem_notifier(struct noti
 	}
 
 	ehea_update_firmware_handles();
+	ret = NOTIFY_OK;
 
-	return NOTIFY_OK;
+out_unlock:
+	mutex_unlock(&dlpar_mem_lock);
+out:
+	return ret;
 }
 
 static struct notifier_block ehea_mem_nb = {

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Franco Fichtner @ 2010-04-21  9:29 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, Changli Gao, David Miller, netdev
In-Reply-To: <i2q65634d661004200809ydba56c35g785ba51900c379f7@mail.gmail.com>

Tom Herbert wrote:
>> I thought about this for some time...
>>
>> Do we really need the port numbers here at all? A simple
>> addr1^addr2 can provide a good enough pointer for
>> distribution amongst CPUs.
>>
> 
> What about a server behind a TCP proxy?  Also, need to minimize
> collisions for RPS to be effective

What about routers? What about loopback? This all boils down to
the same issue of obscuring IP data by "magical" means and then
reattaching functionality by reaching for upper layer information.
It is necessary in some cases, but it can cripple performance
for other cases.

The interesting thing is you don't need to deal with collisions
while distributing amonst cpus at all. You just need to make sure
the distribution algorithm keeps every single flow attached to
the correct cpu.

All of the actual flow hashing, tracking and whatever else the
traffic needs to go through can be done locally by cpu x which
helps a lot with load distribution and cache issues in mind. It
also helps locking because there is no global flow lookup table.
Oh, and it also reduces collisions with every cpu you add for
receiving.

I work with a lot of plain office and ISP traffic in mind daily,
so please don't misunderstand my motivation here. I'd hate to
see poor performance in scenarios in which there is a lot of
potential improvement.

Franco

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Eric Dumazet @ 2010-04-21  9:39 UTC (permalink / raw)
  To: Franco Fichtner; +Cc: Tom Herbert, Changli Gao, David Miller, netdev
In-Reply-To: <4BCEC593.4000007@lastsummer.de>

Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit :
> Tom Herbert wrote:
> >> I thought about this for some time...
> >>
> >> Do we really need the port numbers here at all? A simple
> >> addr1^addr2 can provide a good enough pointer for
> >> distribution amongst CPUs.
> >>
> > 
> > What about a server behind a TCP proxy?  Also, need to minimize
> > collisions for RPS to be effective
> 
> What about routers? What about loopback? This all boils down to
> the same issue of obscuring IP data by "magical" means and then
> reattaching functionality by reaching for upper layer information.
> It is necessary in some cases, but it can cripple performance
> for other cases.
> 
> The interesting thing is you don't need to deal with collisions
> while distributing amonst cpus at all. You just need to make sure
> the distribution algorithm keeps every single flow attached to
> the correct cpu.
> 
> All of the actual flow hashing, tracking and whatever else the
> traffic needs to go through can be done locally by cpu x which
> helps a lot with load distribution and cache issues in mind. It
> also helps locking because there is no global flow lookup table.
> Oh, and it also reduces collisions with every cpu you add for
> receiving.
> 
> I work with a lot of plain office and ISP traffic in mind daily,
> so please don't misunderstand my motivation here. I'd hate to
> see poor performance in scenarios in which there is a lot of
> potential improvement.
> 

I am a bit lost by this conversation.

Are you saying something is wrong with current schem ?

What are exactly your suggestions ?

Tom replied to you that a hash derived from (addr1 ^ addr2) would not
work in situations where all flows goes from machine A to machine B
(all hashes would be the same)

Current hash is probably more than enough to cover all situations.






^ permalink raw reply

* Re: ipv6: Fix tcp_v6_send_response checksum
From: Cosmin Ratiu @ 2010-04-21  9:42 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev
In-Reply-To: <20100421.004922.193694715.davem@davemloft.net>

On Wednesday 21 April 2010 10:49:22 David Miller wrote:
> I put this into net-2.6 and modified the commit message since, as we
> found, this incorrect transport header reset was added there to fix
> IPSEC.
> 
> I'm convinced that Cosmin didn't test the patch he actually sent out
> 
> :-)

I apologize. I've tested the patch on 2.6.7 (tcp_v6_send_ack then) and then 
redid the patch half-asleep against net-next.

Cosmin.

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21  9:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271840535.7895.1612.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 11:02:15AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Le mercredi 21 avril 2010 à 12:25 +0400, Evgeniy Polyakov a écrit :
> 
> > I believe this is a useful patch, but it addresses a different issue.
> > This path should not fire up when we bind to single address.
> 
> Well, the real problem is that following sequence can happen :
> 
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
> setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(5, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(6, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
> setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(7, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> setsockopt(8, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(8, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
> setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(9, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
> setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(10, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 11
> setsockopt(11, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(11, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(11, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> 
> 
> Note ports are given several times for different sockets.
> 
> So several sockets are 'bound' to same IP:port values

That's kind of weird, since address is the same. I'm curious why bind
conflict does not resolve that. Likely because socket is not yet
connected.

> At connect() time, we refuse and say address is not available.

Yup. But isn't it a different problem from what Gaspar observed?
Netcat connects after bind so it should not be an issue here.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: Hans J. Koch @ 2010-04-21 10:18 UTC (permalink / raw)
  To: David Miller
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <20100420.180501.08643239.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Tue, Apr 20, 2010 at 06:05:01PM -0700, David Miller wrote:
> From: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
> > I think "dev_err(&intf->dev, ...)" should be used before
> > SET_NETDEV_DEV(netdev, &intf->dev) is called. I see two "dev_err()"
> > calls which need to be fixed.
> 
> Agreed.

Then it should probably look like this:


From: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
To: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org,
	Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
Subject: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c

In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...

This patch replaces netdev->dev.parent with &intf->dev in dev_err() calls to
avoid this.

Signed-off-by: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 drivers/net/can/usb/ems_usb.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: net-next-2.6/drivers/net/can/usb/ems_usb.c
===================================================================
--- net-next-2.6.orig/drivers/net/can/usb/ems_usb.c	2010-04-13 11:27:33.000000000 +0200
+++ net-next-2.6/drivers/net/can/usb/ems_usb.c	2010-04-21 11:59:04.000000000 +0200
@@ -1006,7 +1006,7 @@
 
 	netdev = alloc_candev(sizeof(struct ems_usb), MAX_TX_URBS);
 	if (!netdev) {
-		dev_err(netdev->dev.parent, "Couldn't alloc candev\n");
+		dev_err(&intf->dev, "ems_usb: Couldn't alloc candev\n");
 		return -ENOMEM;
 	}
 
@@ -1036,20 +1036,20 @@
 
 	dev->intr_urb = usb_alloc_urb(0, GFP_KERNEL);
 	if (!dev->intr_urb) {
-		dev_err(netdev->dev.parent, "Couldn't alloc intr URB\n");
+		dev_err(&intf->dev, "Couldn't alloc intr URB\n");
 		goto cleanup_candev;
 	}
 
 	dev->intr_in_buffer = kzalloc(INTR_IN_BUFFER_SIZE, GFP_KERNEL);
 	if (!dev->intr_in_buffer) {
-		dev_err(netdev->dev.parent, "Couldn't alloc Intr buffer\n");
+		dev_err(&intf->dev, "Couldn't alloc Intr buffer\n");
 		goto cleanup_intr_urb;
 	}
 
 	dev->tx_msg_buffer = kzalloc(CPC_HEADER_SIZE +
 				     sizeof(struct ems_cpc_msg), GFP_KERNEL);
 	if (!dev->tx_msg_buffer) {
-		dev_err(netdev->dev.parent, "Couldn't alloc Tx buffer\n");
+		dev_err(&intf->dev, "Couldn't alloc Tx buffer\n");
 		goto cleanup_intr_in_buffer;
 	}

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21 10:21 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421095812.GA14778@ioremap.net>

Le mercredi 21 avril 2010 à 13:58 +0400, Evgeniy Polyakov a écrit :
> On Wed, Apr 21, 2010 at 11:02:15AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > Le mercredi 21 avril 2010 à 12:25 +0400, Evgeniy Polyakov a écrit :
> > 
> > > I believe this is a useful patch, but it addresses a different issue.
> > > This path should not fire up when we bind to single address.
> > 
> > Well, the real problem is that following sequence can happen :
> > 
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
> > setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(5, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> > setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(6, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
> > setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(7, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> > setsockopt(8, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(8, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
> > setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(9, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
> > setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(10, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 11
> > setsockopt(11, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(11, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(11, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > 
> > 
> > Note ports are given several times for different sockets.
> > 
> > So several sockets are 'bound' to same IP:port values
> 
> That's kind of weird, since address is the same. I'm curious why bind
> conflict does not resolve that. Likely because socket is not yet
> connected.
> 
> > At connect() time, we refuse and say address is not available.
> 
> Yup. But isn't it a different problem from what Gaspar observed?
> Netcat connects after bind so it should not be an issue here.
> 

Well, just replace the "#if 0" in program, and i'll connect, and
reproduce same problem

conflict only checks :

                        if (!reuse || !sk2->sk_reuse ||
                            sk2->sk_state == TCP_LISTEN) {
                                const __be32 sk2_rcv_saddr =
inet_rcv_saddr(sk2);
                                if (!sk2_rcv_saddr || !sk_rcv_saddr ||
                                    sk2_rcv_saddr == sk_rcv_saddr)
                                        break;
                        }


ie it wont re-allocate a port only if a listener uses this port.

If sk2 is connected or close, it doesnt matter, and port can be reused.




^ permalink raw reply

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: Wolfgang Grandegger @ 2010-04-21 10:27 UTC (permalink / raw)
  To: Hans J. Koch
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <20100421101805.GA1995-hikPBsva6T+Nj9Bq2fkWzw@public.gmane.org>

Hans J. Koch wrote:
> On Tue, Apr 20, 2010 at 06:05:01PM -0700, David Miller wrote:
>> From: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
>>> I think "dev_err(&intf->dev, ...)" should be used before
>>> SET_NETDEV_DEV(netdev, &intf->dev) is called. I see two "dev_err()"
>>> calls which need to be fixed.
>> Agreed.
> 
> Then it should probably look like this:

There are even more dev_err's to be fixed...

> From: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> To: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org,
> 	Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
> Subject: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
> 
> In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...
> 
> This patch replaces netdev->dev.parent with &intf->dev in dev_err() calls to
> avoid this.
> 
> Signed-off-by: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> ---
>  drivers/net/can/usb/ems_usb.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> Index: net-next-2.6/drivers/net/can/usb/ems_usb.c
> ===================================================================
> --- net-next-2.6.orig/drivers/net/can/usb/ems_usb.c	2010-04-13 11:27:33.000000000 +0200
> +++ net-next-2.6/drivers/net/can/usb/ems_usb.c	2010-04-21 11:59:04.000000000 +0200
> @@ -1006,7 +1006,7 @@
>  
>  	netdev = alloc_candev(sizeof(struct ems_usb), MAX_TX_URBS);
>  	if (!netdev) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc candev\n");
> +		dev_err(&intf->dev, "ems_usb: Couldn't alloc candev\n");
>  		return -ENOMEM;
>  	}
>  
> @@ -1036,20 +1036,20 @@
>  
>  	dev->intr_urb = usb_alloc_urb(0, GFP_KERNEL);
>  	if (!dev->intr_urb) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc intr URB\n");
> +		dev_err(&intf->dev, "Couldn't alloc intr URB\n");
>  		goto cleanup_candev;
>  	}
>  
>  	dev->intr_in_buffer = kzalloc(INTR_IN_BUFFER_SIZE, GFP_KERNEL);
>  	if (!dev->intr_in_buffer) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc Intr buffer\n");
> +		dev_err(&intf->dev, "Couldn't alloc Intr buffer\n");
>  		goto cleanup_intr_urb;
>  	}
>  
>  	dev->tx_msg_buffer = kzalloc(CPC_HEADER_SIZE +
>  				     sizeof(struct ems_cpc_msg), GFP_KERNEL);
>  	if (!dev->tx_msg_buffer) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc Tx buffer\n");
> +		dev_err(&intf->dev, "Couldn't alloc Tx buffer\n");
>  		goto cleanup_intr_in_buffer;
>  	}

Acked-by: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Franco Fichtner @ 2010-04-21 11:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, Changli Gao, David Miller, netdev
In-Reply-To: <1271842773.7895.1692.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit :
>> Tom Herbert wrote:
>>>> I thought about this for some time...
>>>>
>>>> Do we really need the port numbers here at all? A simple
>>>> addr1^addr2 can provide a good enough pointer for
>>>> distribution amongst CPUs.
>>>>
>>> What about a server behind a TCP proxy?  Also, need to minimize
>>> collisions for RPS to be effective
>> What about routers? What about loopback? This all boils down to
>> the same issue of obscuring IP data by "magical" means and then
>> reattaching functionality by reaching for upper layer information.
>> It is necessary in some cases, but it can cripple performance
>> for other cases.
>>
>> The interesting thing is you don't need to deal with collisions
>> while distributing amonst cpus at all. You just need to make sure
>> the distribution algorithm keeps every single flow attached to
>> the correct cpu.
>>
>> All of the actual flow hashing, tracking and whatever else the
>> traffic needs to go through can be done locally by cpu x which
>> helps a lot with load distribution and cache issues in mind. It
>> also helps locking because there is no global flow lookup table.
>> Oh, and it also reduces collisions with every cpu you add for
>> receiving.
>>
>> I work with a lot of plain office and ISP traffic in mind daily,
>> so please don't misunderstand my motivation here. I'd hate to
>> see poor performance in scenarios in which there is a lot of
>> potential improvement.
>>
> 
> I am a bit lost by this conversation.
> 
> Are you saying something is wrong with current schem ?
> 
> What are exactly your suggestions ?
> 

Hashing for cpu distribution should be as minimal as it could possibly
be with the least number operations needed to compute a hash, which 
normally involves touching one cold cache line (ip header). If you add 
the ports to your mix you have the luxury of solving static ip mappings,
but only for protocols that support it. Usage of the destination port
may also prove to be more or less pointless with a lot of http traffic,
because it's most likely static. And you add another potential cold
cache line access. For a lot of traffic scenarios, we'll have a bunch of
internal ips and the internet on the other side, so having a simple hash
based on a flavor if internal/external ip is more than enough to work
with for distribution. If the network card can provide a complete hash
all the better. Then this part of my point is void.

But then, hashing for cpu distribution should have nothing todo with
real flow tracking with lookup tables for let's say a firewall or dpi
application, because that data is only needed by local cpu and can
be gathered after distribution. Simply put, the lookup for the flow, if
it is needed, does not belong to distribution. It can be outsourced to
the destination cpu or just simply be ignored, if the application
doesn't care.

> Tom replied to you that a hash derived from (addr1 ^ addr2) would not
> work in situations where all flows goes from machine A to machine B
> (all hashes would be the same)
> 

Yes, I see the point. And all I'm just asking if it's wise to optimize
for this particular scenario.

If you spin this idea further beyond flow tracking, maybe an application
also needs to do some kind of user tracking by ip. Wouldn't it make
sense to have user based flows on a more local basis, not a global one
because ports will get in the way?

> Current hash is probably more than enough to cover all situations.
> 

I agree with this, but would like to point out the phrasing "probably
more than enough". :)

Franco

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Eric Dumazet @ 2010-04-21 11:16 UTC (permalink / raw)
  To: Franco Fichtner; +Cc: Tom Herbert, Changli Gao, David Miller, netdev
In-Reply-To: <4BCEDC29.6040303@lastsummer.de>

Le mercredi 21 avril 2010 à 13:06 +0200, Franco Fichtner a écrit :
> > 
> 
> Hashing for cpu distribution should be as minimal as it could possibly
> be with the least number operations needed to compute a hash, which 
> normally involves touching one cold cache line (ip header). If you add 
> the ports to your mix you have the luxury of solving static ip mappings,
> but only for protocols that support it. Usage of the destination port
> may also prove to be more or less pointless with a lot of http traffic,
> because it's most likely static. And you add another potential cold
> cache line access. For a lot of traffic scenarios, we'll have a bunch of
> internal ips and the internet on the other side, so having a simple hash
> based on a flavor if internal/external ip is more than enough to work
> with for distribution. If the network card can provide a complete hash
> all the better. Then this part of my point is void.
> 

But we already have to bring into our cpu cache one cache line, needed
in eth_type_trans() : (12+2 bytes of ethernet header)

TCP/UDP tuples are included into this cache line (64 bytes on current
popular arches)

Cost of rxhash is absolute noise into the picture.
A device provided hash, to be effective, would also make
eth_type_trans() call not done.




^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox