Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] tun: orphan an skb on tx
From: Jan Kiszka @ 2010-04-21 11:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: stable@kernel.org, David S. Miller, Herbert Xu, Paul Moore,
	David Woodhouse, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, qemu-devel
In-Reply-To: <20100421113557.GA31606@redhat.com>

Michael S. Tsirkin wrote:
> On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
>> The following situation was observed in the field:
>> tap1 sends packets, tap2 does not consume them, as a result
>> tap1 can not be closed. This happens because
>> tun/tap devices can hang on to skbs undefinitely.
>>
>> As noted by Herbert, possible solutions include a timeout followed by a
>> copy/change of ownership of the skb, or always copying/changing
>> ownership if we're going into a hostile device.
>>
>> This patch implements the second approach.
>>
>> Note: one issue still remaining is that since skbs
>> keep reference to tun socket and tun socket has a
>> reference to tun device, we won't flush backlog,
>> instead simply waiting for all skbs to get transmitted.
>> At least this is not user-triggerable, and
>> this was not reported in practice, my assumption is
>> other devices besides tap complete an skb
>> within finite time after it has been queued.
>>
>> A possible solution for the second issue
>> would not to have socket reference the device,
>> instead, implement dev->destructor for tun, and
>> wait for all skbs to complete there, but this
>> needs some thought, probably too risky for 2.6.34.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
>>
>> ---
>>
>> Please review the below, and consider for 2.6.34,
>> and stable trees.
>>
>>  drivers/net/tun.c |    4 ++++
>>  1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 96c39bd..4326520 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>  		}
>>  	}
>>  
>> +	/* Orphan the skb - required as we might hang on to it
>> +	 * for indefinite time. */
>> +	skb_orphan(skb);
>> +
>>  	/* Enqueue packet */
>>  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
>>  	dev->trans_start = jiffies;
>> -- 
>> 1.7.0.2.280.gc6f05
> 
> This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
> Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).
> 
> Thanks!
> 

Before I forget again:

Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
(namely the field test of our customer)

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply

* Re: [PATCH] tun: orphan an skb on tx
From: Michael S. Tsirkin @ 2010-04-21 11:35 UTC (permalink / raw)
  To: stable
  Cc: David S. Miller, Herbert Xu, Paul Moore, David Woodhouse, netdev,
	linux-kernel, Jan Kiszka, qemu-devel
In-Reply-To: <20100413145944.GA7716@redhat.com>

On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
> The following situation was observed in the field:
> tap1 sends packets, tap2 does not consume them, as a result
> tap1 can not be closed. This happens because
> tun/tap devices can hang on to skbs undefinitely.
> 
> As noted by Herbert, possible solutions include a timeout followed by a
> copy/change of ownership of the skb, or always copying/changing
> ownership if we're going into a hostile device.
> 
> This patch implements the second approach.
> 
> Note: one issue still remaining is that since skbs
> keep reference to tun socket and tun socket has a
> reference to tun device, we won't flush backlog,
> instead simply waiting for all skbs to get transmitted.
> At least this is not user-triggerable, and
> this was not reported in practice, my assumption is
> other devices besides tap complete an skb
> within finite time after it has been queued.
> 
> A possible solution for the second issue
> would not to have socket reference the device,
> instead, implement dev->destructor for tun, and
> wait for all skbs to complete there, but this
> needs some thought, probably too risky for 2.6.34.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
> 
> ---
> 
> Please review the below, and consider for 2.6.34,
> and stable trees.
> 
>  drivers/net/tun.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 96c39bd..4326520 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  		}
>  	}
>  
> +	/* Orphan the skb - required as we might hang on to it
> +	 * for indefinite time. */
> +	skb_orphan(skb);
> +
>  	/* Enqueue packet */
>  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
>  	dev->trans_start = jiffies;
> -- 
> 1.7.0.2.280.gc6f05

This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).

Thanks!

-- 
MST

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21 11:27 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421095812.GA14778@ioremap.net>

Here is the patch I use now and my test application is now able to open
and connect 1000000 sockets (ulimit -n 1000000)

Trick is bind_conflict() must refuse a socket to bind to a port on a non
null IP if another socket already uses same port on same IP.

Plus the previous patch sent (check a conflict before exiting the search
loop)

What do you think ?

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index e0a3e35..78cbc39 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -70,13 +70,17 @@ int inet_csk_bind_conflict(const struct sock *sk,
 		    (!sk->sk_bound_dev_if ||
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
+			const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
+
 			if (!reuse || !sk2->sk_reuse ||
 			    sk2->sk_state == TCP_LISTEN) {
-				const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
 				if (!sk2_rcv_saddr || !sk_rcv_saddr ||
 				    sk2_rcv_saddr == sk_rcv_saddr)
 					break;
-			}
+			} else if (reuse && sk2->sk_reuse &&
+				   sk2_rcv_saddr &&
+				   sk2_rcv_saddr == sk_rcv_saddr)
+				break;
 		}
 	}
 	return node != NULL;
@@ -120,9 +124,11 @@ again:
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
 						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
-							spin_unlock(&head->lock);
-							snum = smallest_rover;
-							goto have_snum;
+							if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
+								spin_unlock(&head->lock);
+								snum = smallest_rover;
+								goto have_snum;
+							}
 						}
 					}
 					goto next;



^ permalink raw reply related

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 11:26 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw
In-Reply-To: <C7F354F4.29CC8%scofeldm@cisco.com>

On Tuesday 20 April 2010, Scott Feldman wrote:
> On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> 
> > On Monday 19 April 2010, Scott Feldman wrote:
> > 
> >> IOV netlink (IOVNL) adds I/O Virtualization control support to a master
> >> device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
> >> control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
> >> design allows for the case where master and slave are the
> >> same netdev interface.
> > 
> > What is the reason for controlling the slave device through the master,
> > rather than talking to the slave directly? The kernel always knows
> > the master for each slave, so it seems to me that this information
> > is redundant.
> 
> The interface would allow talking to the slave directly. In fact, that's the
> example with enic port-profile in patch 2/2.  But, it would be nice not to
> rule out the case where the master proxies slave control and the master is
> under exclusively controlled by hypervisor.

Not sure I understand. Do you mean the case where this code runs in
the hypervisor (e.g. KVM), or a different scerario with the setup being
done in a guest driver?

So far, I have assumed that we would always do the setup on the host
side, which always has access to both the master, and a slave proxy.
In particular, your interface requires access to the slave AFAICT,
because otherwise the VF IFNAME does not have any significance.

Take the case where you use network namespaces and put the VF into
a separate namespace. With your interface, the PF is still in the
root namespace, but passing both interface names in this interface
won't help you because they are never visible in the same namespace
(e.g. both might be named eth0 in their respective containers).

> > Is this new interface only for the case that you have a switch integrated
> > in the NIC, or also for the case where you do an LLDP and EDP exchange
> > with an adjacent bridge and put the device into VEPA mode?
> 
> All of the above.  Basing this on netlink give us flexibility to work with
> user-space mgmt tools or directly with kernel netdev as in the enic case.
> Not trying to make assumptions about where (user-space, kernel) and by which
> entity sources or sinks the netlink msg.

ok.

> > Did you consider making this code an extension to the DCB interface
> > instead of a separate one? What was the reason for your decision
> > to keep it separate?
> 
> Considered it but DCB interface is well defined for DCB and it didn't seem
> right gluing on interfaces not specified within DCB.  I agree that there is
> some overlap in the sense that both interface are used to configure a netdev
> with some properties interesting for the data center, but the DCB interface
> is for local setting of the properties on the host whereas iovnl is about
> pushing the setting of those properties to the network for policy-based
> control.
>
> > Also, do you expect your interface to be supported by dcbd/lldpad,
> > or is there a good reason to create a new tool for iovnl?
> 
> Lldpad supporting this interface would seem right, for those cases where
> lldpad is responsible for configuring the netdev.

I believe we meant different things here, because I misunderstood the
intention of the code. My question was whether lldpad would send the
netlink messages to iovnl, but from what you and Chris write, the
real idea was that both lldpad and kernel/iovnl can receive the
same messages, right?

> >> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
> >> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
> > 
> > Can you elaborate more on what these do? Who is the 'client' and the 'host'
> > in this case, and why do you need to identify them?
> 
> Those are optional and useful, for example, by the network mgmt tool for
> presenting a view such as:
> 
>     - blade 1/2                     // know by host uuid
>         - vm-rhel5-eth0             // client name
>             - port-profile: xyz
> 
> Something like that.

Hmm, but how do they get from the device driver to the the network
management tool then? Also, these are similar to the attributes
that are passed in the 802.1Qbg VDP protocol, but not compatible.
If the idea is use the same netlink protocol for both your internal
representation and for the standard based protocol, I think we should
make them compatible.

Instead of a string identifying the port profile, this needs to pass
a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte).

There is also a UUID in VDP, but it identifies the guest, not the host,
so this is really confusing.

VDP also needs a list of MAC addresses and VLAN IDs (normally only
one of each), but that would be separate from what you tell the adapter,
see below: 

> >> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
> > 
> > Just one mac address? What happens if we want to assign multiple mac
> > addresses to the VF later? Also, how is this defined specifically?
> > Will a SIOCSIFHWADDR with a different MAC address on the VF fail
> > later, or is this just the default value?
> 
> Depends on how the VF wants to handle this.  For our use-case with enic we
> only need the port-profile op so I'm not sure what the best design is for
> mac+vlan on a VF.  Looking for advise from folks like yourself.  If it's not
> needed, let's scratch it.

In order to make VEPA work, it's absolutely required to impose a hard limit
on what MAC+VLAN IDs are visible to the VF, because the switch identifies
the guest by those and forwards any frames to/from that address according
to the VSI type.

However, I feel that we should strictly separate the steps of configuring
the adapter from talking to the switch. When we do the VDP association
in user land, we still need to set up the VLAN and MAC configuration for
the VF through a kernel interface. If we ignore the port profile stuff
for a moment, your netlink interface looks like a good fit for that.

Since it seems what you really want to do is to do the exchange with the
switch from here, maybe the hardware configuration part should be moved
the DCB interface?

	Arnd

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Eric Dumazet @ 2010-04-21 11:16 UTC (permalink / raw)
  To: Franco Fichtner; +Cc: Tom Herbert, Changli Gao, David Miller, netdev
In-Reply-To: <4BCEDC29.6040303@lastsummer.de>

Le mercredi 21 avril 2010 à 13:06 +0200, Franco Fichtner a écrit :
> > 
> 
> Hashing for cpu distribution should be as minimal as it could possibly
> be with the least number operations needed to compute a hash, which 
> normally involves touching one cold cache line (ip header). If you add 
> the ports to your mix you have the luxury of solving static ip mappings,
> but only for protocols that support it. Usage of the destination port
> may also prove to be more or less pointless with a lot of http traffic,
> because it's most likely static. And you add another potential cold
> cache line access. For a lot of traffic scenarios, we'll have a bunch of
> internal ips and the internet on the other side, so having a simple hash
> based on a flavor if internal/external ip is more than enough to work
> with for distribution. If the network card can provide a complete hash
> all the better. Then this part of my point is void.
> 

But we already have to bring into our cpu cache one cache line, needed
in eth_type_trans() : (12+2 bytes of ethernet header)

TCP/UDP tuples are included into this cache line (64 bytes on current
popular arches)

Cost of rxhash is absolute noise into the picture.
A device provided hash, to be effective, would also make
eth_type_trans() call not done.




^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Franco Fichtner @ 2010-04-21 11:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, Changli Gao, David Miller, netdev
In-Reply-To: <1271842773.7895.1692.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit :
>> Tom Herbert wrote:
>>>> I thought about this for some time...
>>>>
>>>> Do we really need the port numbers here at all? A simple
>>>> addr1^addr2 can provide a good enough pointer for
>>>> distribution amongst CPUs.
>>>>
>>> What about a server behind a TCP proxy?  Also, need to minimize
>>> collisions for RPS to be effective
>> What about routers? What about loopback? This all boils down to
>> the same issue of obscuring IP data by "magical" means and then
>> reattaching functionality by reaching for upper layer information.
>> It is necessary in some cases, but it can cripple performance
>> for other cases.
>>
>> The interesting thing is you don't need to deal with collisions
>> while distributing amonst cpus at all. You just need to make sure
>> the distribution algorithm keeps every single flow attached to
>> the correct cpu.
>>
>> All of the actual flow hashing, tracking and whatever else the
>> traffic needs to go through can be done locally by cpu x which
>> helps a lot with load distribution and cache issues in mind. It
>> also helps locking because there is no global flow lookup table.
>> Oh, and it also reduces collisions with every cpu you add for
>> receiving.
>>
>> I work with a lot of plain office and ISP traffic in mind daily,
>> so please don't misunderstand my motivation here. I'd hate to
>> see poor performance in scenarios in which there is a lot of
>> potential improvement.
>>
> 
> I am a bit lost by this conversation.
> 
> Are you saying something is wrong with current schem ?
> 
> What are exactly your suggestions ?
> 

Hashing for cpu distribution should be as minimal as it could possibly
be with the least number operations needed to compute a hash, which 
normally involves touching one cold cache line (ip header). If you add 
the ports to your mix you have the luxury of solving static ip mappings,
but only for protocols that support it. Usage of the destination port
may also prove to be more or less pointless with a lot of http traffic,
because it's most likely static. And you add another potential cold
cache line access. For a lot of traffic scenarios, we'll have a bunch of
internal ips and the internet on the other side, so having a simple hash
based on a flavor if internal/external ip is more than enough to work
with for distribution. If the network card can provide a complete hash
all the better. Then this part of my point is void.

But then, hashing for cpu distribution should have nothing todo with
real flow tracking with lookup tables for let's say a firewall or dpi
application, because that data is only needed by local cpu and can
be gathered after distribution. Simply put, the lookup for the flow, if
it is needed, does not belong to distribution. It can be outsourced to
the destination cpu or just simply be ignored, if the application
doesn't care.

> Tom replied to you that a hash derived from (addr1 ^ addr2) would not
> work in situations where all flows goes from machine A to machine B
> (all hashes would be the same)
> 

Yes, I see the point. And all I'm just asking if it's wise to optimize
for this particular scenario.

If you spin this idea further beyond flow tracking, maybe an application
also needs to do some kind of user tracking by ip. Wouldn't it make
sense to have user based flows on a more local basis, not a global one
because ports will get in the way?

> Current hash is probably more than enough to cover all situations.
> 

I agree with this, but would like to point out the phrasing "probably
more than enough". :)

Franco

^ permalink raw reply

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: Wolfgang Grandegger @ 2010-04-21 10:27 UTC (permalink / raw)
  To: Hans J. Koch
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <20100421101805.GA1995-hikPBsva6T+Nj9Bq2fkWzw@public.gmane.org>

Hans J. Koch wrote:
> On Tue, Apr 20, 2010 at 06:05:01PM -0700, David Miller wrote:
>> From: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
>>> I think "dev_err(&intf->dev, ...)" should be used before
>>> SET_NETDEV_DEV(netdev, &intf->dev) is called. I see two "dev_err()"
>>> calls which need to be fixed.
>> Agreed.
> 
> Then it should probably look like this:

There are even more dev_err's to be fixed...

> From: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> To: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org,
> 	Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
> Subject: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
> 
> In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...
> 
> This patch replaces netdev->dev.parent with &intf->dev in dev_err() calls to
> avoid this.
> 
> Signed-off-by: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> ---
>  drivers/net/can/usb/ems_usb.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> Index: net-next-2.6/drivers/net/can/usb/ems_usb.c
> ===================================================================
> --- net-next-2.6.orig/drivers/net/can/usb/ems_usb.c	2010-04-13 11:27:33.000000000 +0200
> +++ net-next-2.6/drivers/net/can/usb/ems_usb.c	2010-04-21 11:59:04.000000000 +0200
> @@ -1006,7 +1006,7 @@
>  
>  	netdev = alloc_candev(sizeof(struct ems_usb), MAX_TX_URBS);
>  	if (!netdev) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc candev\n");
> +		dev_err(&intf->dev, "ems_usb: Couldn't alloc candev\n");
>  		return -ENOMEM;
>  	}
>  
> @@ -1036,20 +1036,20 @@
>  
>  	dev->intr_urb = usb_alloc_urb(0, GFP_KERNEL);
>  	if (!dev->intr_urb) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc intr URB\n");
> +		dev_err(&intf->dev, "Couldn't alloc intr URB\n");
>  		goto cleanup_candev;
>  	}
>  
>  	dev->intr_in_buffer = kzalloc(INTR_IN_BUFFER_SIZE, GFP_KERNEL);
>  	if (!dev->intr_in_buffer) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc Intr buffer\n");
> +		dev_err(&intf->dev, "Couldn't alloc Intr buffer\n");
>  		goto cleanup_intr_urb;
>  	}
>  
>  	dev->tx_msg_buffer = kzalloc(CPC_HEADER_SIZE +
>  				     sizeof(struct ems_cpc_msg), GFP_KERNEL);
>  	if (!dev->tx_msg_buffer) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc Tx buffer\n");
> +		dev_err(&intf->dev, "Couldn't alloc Tx buffer\n");
>  		goto cleanup_intr_in_buffer;
>  	}

Acked-by: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21 10:21 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421095812.GA14778@ioremap.net>

Le mercredi 21 avril 2010 à 13:58 +0400, Evgeniy Polyakov a écrit :
> On Wed, Apr 21, 2010 at 11:02:15AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > Le mercredi 21 avril 2010 à 12:25 +0400, Evgeniy Polyakov a écrit :
> > 
> > > I believe this is a useful patch, but it addresses a different issue.
> > > This path should not fire up when we bind to single address.
> > 
> > Well, the real problem is that following sequence can happen :
> > 
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
> > setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(5, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> > setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(6, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
> > setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(7, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> > setsockopt(8, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(8, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
> > setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(9, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
> > setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(10, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 11
> > setsockopt(11, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > bind(11, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> > getsockname(11, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> > 
> > 
> > Note ports are given several times for different sockets.
> > 
> > So several sockets are 'bound' to same IP:port values
> 
> That's kind of weird, since address is the same. I'm curious why bind
> conflict does not resolve that. Likely because socket is not yet
> connected.
> 
> > At connect() time, we refuse and say address is not available.
> 
> Yup. But isn't it a different problem from what Gaspar observed?
> Netcat connects after bind so it should not be an issue here.
> 

Well, just replace the "#if 0" in program, and i'll connect, and
reproduce same problem

conflict only checks :

                        if (!reuse || !sk2->sk_reuse ||
                            sk2->sk_state == TCP_LISTEN) {
                                const __be32 sk2_rcv_saddr =
inet_rcv_saddr(sk2);
                                if (!sk2_rcv_saddr || !sk_rcv_saddr ||
                                    sk2_rcv_saddr == sk_rcv_saddr)
                                        break;
                        }


ie it wont re-allocate a port only if a listener uses this port.

If sk2 is connected or close, it doesnt matter, and port can be reused.




^ permalink raw reply

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: Hans J. Koch @ 2010-04-21 10:18 UTC (permalink / raw)
  To: David Miller
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <20100420.180501.08643239.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Tue, Apr 20, 2010 at 06:05:01PM -0700, David Miller wrote:
> From: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
> > I think "dev_err(&intf->dev, ...)" should be used before
> > SET_NETDEV_DEV(netdev, &intf->dev) is called. I see two "dev_err()"
> > calls which need to be fixed.
> 
> Agreed.

Then it should probably look like this:


From: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
To: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org,
	Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
Subject: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c

In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...

This patch replaces netdev->dev.parent with &intf->dev in dev_err() calls to
avoid this.

Signed-off-by: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 drivers/net/can/usb/ems_usb.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: net-next-2.6/drivers/net/can/usb/ems_usb.c
===================================================================
--- net-next-2.6.orig/drivers/net/can/usb/ems_usb.c	2010-04-13 11:27:33.000000000 +0200
+++ net-next-2.6/drivers/net/can/usb/ems_usb.c	2010-04-21 11:59:04.000000000 +0200
@@ -1006,7 +1006,7 @@
 
 	netdev = alloc_candev(sizeof(struct ems_usb), MAX_TX_URBS);
 	if (!netdev) {
-		dev_err(netdev->dev.parent, "Couldn't alloc candev\n");
+		dev_err(&intf->dev, "ems_usb: Couldn't alloc candev\n");
 		return -ENOMEM;
 	}
 
@@ -1036,20 +1036,20 @@
 
 	dev->intr_urb = usb_alloc_urb(0, GFP_KERNEL);
 	if (!dev->intr_urb) {
-		dev_err(netdev->dev.parent, "Couldn't alloc intr URB\n");
+		dev_err(&intf->dev, "Couldn't alloc intr URB\n");
 		goto cleanup_candev;
 	}
 
 	dev->intr_in_buffer = kzalloc(INTR_IN_BUFFER_SIZE, GFP_KERNEL);
 	if (!dev->intr_in_buffer) {
-		dev_err(netdev->dev.parent, "Couldn't alloc Intr buffer\n");
+		dev_err(&intf->dev, "Couldn't alloc Intr buffer\n");
 		goto cleanup_intr_urb;
 	}
 
 	dev->tx_msg_buffer = kzalloc(CPC_HEADER_SIZE +
 				     sizeof(struct ems_cpc_msg), GFP_KERNEL);
 	if (!dev->tx_msg_buffer) {
-		dev_err(netdev->dev.parent, "Couldn't alloc Tx buffer\n");
+		dev_err(&intf->dev, "Couldn't alloc Tx buffer\n");
 		goto cleanup_intr_in_buffer;
 	}

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21  9:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271840535.7895.1612.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 11:02:15AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Le mercredi 21 avril 2010 à 12:25 +0400, Evgeniy Polyakov a écrit :
> 
> > I believe this is a useful patch, but it addresses a different issue.
> > This path should not fire up when we bind to single address.
> 
> Well, the real problem is that following sequence can happen :
> 
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
> setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(5, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(6, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
> setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(7, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> setsockopt(8, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(8, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
> setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(9, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
> setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(10, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 11
> setsockopt(11, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(11, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
> getsockname(11, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
> 
> 
> Note ports are given several times for different sockets.
> 
> So several sockets are 'bound' to same IP:port values

That's kind of weird, since address is the same. I'm curious why bind
conflict does not resolve that. Likely because socket is not yet
connected.

> At connect() time, we refuse and say address is not available.

Yup. But isn't it a different problem from what Gaspar observed?
Netcat connects after bind so it should not be an issue here.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: ipv6: Fix tcp_v6_send_response checksum
From: Cosmin Ratiu @ 2010-04-21  9:42 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev
In-Reply-To: <20100421.004922.193694715.davem@davemloft.net>

On Wednesday 21 April 2010 10:49:22 David Miller wrote:
> I put this into net-2.6 and modified the commit message since, as we
> found, this incorrect transport header reset was added there to fix
> IPSEC.
> 
> I'm convinced that Cosmin didn't test the patch he actually sent out
> 
> :-)

I apologize. I've tested the patch on 2.6.7 (tcp_v6_send_ack then) and then 
redid the patch half-asleep against net-next.

Cosmin.

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Eric Dumazet @ 2010-04-21  9:39 UTC (permalink / raw)
  To: Franco Fichtner; +Cc: Tom Herbert, Changli Gao, David Miller, netdev
In-Reply-To: <4BCEC593.4000007@lastsummer.de>

Le mercredi 21 avril 2010 à 11:29 +0200, Franco Fichtner a écrit :
> Tom Herbert wrote:
> >> I thought about this for some time...
> >>
> >> Do we really need the port numbers here at all? A simple
> >> addr1^addr2 can provide a good enough pointer for
> >> distribution amongst CPUs.
> >>
> > 
> > What about a server behind a TCP proxy?  Also, need to minimize
> > collisions for RPS to be effective
> 
> What about routers? What about loopback? This all boils down to
> the same issue of obscuring IP data by "magical" means and then
> reattaching functionality by reaching for upper layer information.
> It is necessary in some cases, but it can cripple performance
> for other cases.
> 
> The interesting thing is you don't need to deal with collisions
> while distributing amonst cpus at all. You just need to make sure
> the distribution algorithm keeps every single flow attached to
> the correct cpu.
> 
> All of the actual flow hashing, tracking and whatever else the
> traffic needs to go through can be done locally by cpu x which
> helps a lot with load distribution and cache issues in mind. It
> also helps locking because there is no global flow lookup table.
> Oh, and it also reduces collisions with every cpu you add for
> receiving.
> 
> I work with a lot of plain office and ISP traffic in mind daily,
> so please don't misunderstand my motivation here. I'd hate to
> see poor performance in scenarios in which there is a lot of
> potential improvement.
> 

I am a bit lost by this conversation.

Are you saying something is wrong with current schem ?

What are exactly your suggestions ?

Tom replied to you that a hash derived from (addr1 ^ addr2) would not
work in situations where all flows goes from machine A to machine B
(all hashes would be the same)

Current hash is probably more than enough to cover all situations.






^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Franco Fichtner @ 2010-04-21  9:29 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, Changli Gao, David Miller, netdev
In-Reply-To: <i2q65634d661004200809ydba56c35g785ba51900c379f7@mail.gmail.com>

Tom Herbert wrote:
>> I thought about this for some time...
>>
>> Do we really need the port numbers here at all? A simple
>> addr1^addr2 can provide a good enough pointer for
>> distribution amongst CPUs.
>>
> 
> What about a server behind a TCP proxy?  Also, need to minimize
> collisions for RPS to be effective

What about routers? What about loopback? This all boils down to
the same issue of obscuring IP data by "magical" means and then
reattaching functionality by reaching for upper layer information.
It is necessary in some cases, but it can cripple performance
for other cases.

The interesting thing is you don't need to deal with collisions
while distributing amonst cpus at all. You just need to make sure
the distribution algorithm keeps every single flow attached to
the correct cpu.

All of the actual flow hashing, tracking and whatever else the
traffic needs to go through can be done locally by cpu x which
helps a lot with load distribution and cache issues in mind. It
also helps locking because there is no global flow lookup table.
Oh, and it also reduces collisions with every cpu you add for
receiving.

I work with a lot of plain office and ISP traffic in mind daily,
so please don't misunderstand my motivation here. I'd hate to
see poor performance in scenarios in which there is a lot of
potential improvement.

Franco

^ permalink raw reply

* [PATCH 2/2][RESEND] ehea: fix possible DLPAR/mem deadlock
From: Thomas Klein @ 2010-04-21  9:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, linuxppc-dev, linux-kernel, themann

Force serialization of userspace-triggered DLPAR/mem operations

Signed-off-by: Thomas Klein <tklein@de.ibm.com>
---

Patch created against net-2.6

diff -Nurp net-2.6.orig/drivers/net/ehea/ehea.h net-2.6/drivers/net/ehea/ehea.h
--- net-2.6.orig/drivers/net/ehea/ehea.h	2010-04-21 10:23:21.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea.h	2010-04-21 10:42:14.000000000 +0200
@@ -40,7 +40,7 @@
 #include <asm/io.h>
 
 #define DRV_NAME	"ehea"
-#define DRV_VERSION	"EHEA_0102"
+#define DRV_VERSION	"EHEA_0103"
 
 /* eHEA capability flags */
 #define DLPAR_PORT_ADD_REM 1
diff -Nurp net-2.6.orig/drivers/net/ehea/ehea_main.c net-2.6/drivers/net/ehea/ehea_main.c
--- net-2.6.orig/drivers/net/ehea/ehea_main.c	2010-04-21 10:43:14.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea_main.c	2010-04-21 10:42:14.000000000 +0200
@@ -2889,7 +2889,6 @@ static void ehea_rereg_mrs(struct work_s
 	int ret, i;
 	struct ehea_adapter *adapter;
 
-	mutex_lock(&dlpar_mem_lock);
 	ehea_info("LPAR memory changed - re-initializing driver");
 
 	list_for_each_entry(adapter, &adapter_list, list)
@@ -2959,7 +2958,6 @@ static void ehea_rereg_mrs(struct work_s
 		}
 	ehea_info("re-initializing driver complete");
 out:
-	mutex_unlock(&dlpar_mem_lock);
 	return;
 }
 
@@ -3542,7 +3540,14 @@ void ehea_crash_handler(void)
 static int ehea_mem_notifier(struct notifier_block *nb,
                              unsigned long action, void *data)
 {
+	int ret = NOTIFY_BAD;
 	struct memory_notify *arg = data;
+
+	if (!mutex_trylock(&dlpar_mem_lock)) {
+		ehea_info("ehea_mem_notifier must not be called parallelized");
+		goto out;
+	}
+
 	switch (action) {
 	case MEM_CANCEL_OFFLINE:
 		ehea_info("memory offlining canceled");
@@ -3551,14 +3556,14 @@ static int ehea_mem_notifier(struct noti
 		ehea_info("memory is going online");
 		set_bit(__EHEA_STOP_XFER, &ehea_driver_flags);
 		if (ehea_add_sect_bmap(arg->start_pfn, arg->nr_pages))
-			return NOTIFY_BAD;
+			goto out_unlock;
 		ehea_rereg_mrs(NULL);
 		break;
 	case MEM_GOING_OFFLINE:
 		ehea_info("memory is going offline");
 		set_bit(__EHEA_STOP_XFER, &ehea_driver_flags);
 		if (ehea_rem_sect_bmap(arg->start_pfn, arg->nr_pages))
-			return NOTIFY_BAD;
+			goto out_unlock;
 		ehea_rereg_mrs(NULL);
 		break;
 	default:
@@ -3566,8 +3571,12 @@ static int ehea_mem_notifier(struct noti
 	}
 
 	ehea_update_firmware_handles();
+	ret = NOTIFY_OK;
 
-	return NOTIFY_OK;
+out_unlock:
+	mutex_unlock(&dlpar_mem_lock);
+out:
+	return ret;
 }
 
 static struct notifier_block ehea_mem_nb = {

^ permalink raw reply

* [PATCH 1/2][RESEND] ehea: error handling improvement
From: Thomas Klein @ 2010-04-21  9:10 UTC (permalink / raw)
  To: davem; +Cc: netdev, linuxppc-dev, linux-kernel, themann

Reset a port's resources only if they're actually in an error state

Signed-off-by: Thomas Klein <tklein@de.ibm.com>
---

Patch created against net-2.6

diff -Nurp net-2.6.orig/drivers/net/ehea/ehea_main.c net-2.6/drivers/net/ehea/ehea_main.c
--- net-2.6.orig/drivers/net/ehea/ehea_main.c	2010-04-21 10:23:21.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea_main.c	2010-04-21 10:41:21.000000000 +0200
@@ -791,11 +791,17 @@ static struct ehea_cqe *ehea_proc_cqes(s
 		cqe_counter++;
 		rmb();
 		if (cqe->status & EHEA_CQE_STAT_ERR_MASK) {
-			ehea_error("Send Completion Error: Resetting port");
+			ehea_error("Bad send completion status=0x%04X",
+				   cqe->status);
+
 			if (netif_msg_tx_err(pr->port))
 				ehea_dump(cqe, sizeof(*cqe), "Send CQE");
-			ehea_schedule_port_reset(pr->port);
-			break;
+
+			if (cqe->status & EHEA_CQE_STAT_RESET_MASK) {
+				ehea_error("Resetting port");
+				ehea_schedule_port_reset(pr->port);
+				break;
+			}
 		}
 
 		if (netif_msg_tx_done(pr->port))
@@ -901,6 +907,8 @@ static irqreturn_t ehea_qp_aff_irq_handl
 	struct ehea_eqe *eqe;
 	struct ehea_qp *qp;
 	u32 qp_token;
+	u64 resource_type, aer, aerr;
+	int reset_port = 0;
 
 	eqe = ehea_poll_eq(port->qp_eq);
 
@@ -910,11 +918,24 @@ static irqreturn_t ehea_qp_aff_irq_handl
 			   eqe->entry, qp_token);
 
 		qp = port->port_res[qp_token].qp;
-		ehea_error_data(port->adapter, qp->fw_handle);
+
+		resource_type = ehea_error_data(port->adapter, qp->fw_handle,
+						&aer, &aerr);
+
+		if (resource_type == EHEA_AER_RESTYPE_QP) {
+			if ((aer & EHEA_AER_RESET_MASK) ||
+			    (aerr & EHEA_AERR_RESET_MASK))
+				 reset_port = 1;
+		} else
+			reset_port = 1;   /* Reset in case of CQ or EQ error */
+
 		eqe = ehea_poll_eq(port->qp_eq);
 	}
 
-	ehea_schedule_port_reset(port);
+	if (reset_port) {
+		ehea_error("Resetting port");
+		ehea_schedule_port_reset(port);
+	}
 
 	return IRQ_HANDLED;
 }
diff -Nurp net-2.6.orig/drivers/net/ehea/ehea_qmr.c net-2.6/drivers/net/ehea/ehea_qmr.c
--- net-2.6.orig/drivers/net/ehea/ehea_qmr.c	2010-04-21 10:23:21.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea_qmr.c	2010-04-21 10:41:21.000000000 +0200
@@ -229,14 +229,14 @@ u64 ehea_destroy_cq_res(struct ehea_cq *
 
 int ehea_destroy_cq(struct ehea_cq *cq)
 {
-	u64 hret;
+	u64 hret, aer, aerr;
 	if (!cq)
 		return 0;
 
 	hcp_epas_dtor(&cq->epas);
 	hret = ehea_destroy_cq_res(cq, NORMAL_FREE);
 	if (hret == H_R_STATE) {
-		ehea_error_data(cq->adapter, cq->fw_handle);
+		ehea_error_data(cq->adapter, cq->fw_handle, &aer, &aerr);
 		hret = ehea_destroy_cq_res(cq, FORCE_FREE);
 	}
 
@@ -357,7 +357,7 @@ u64 ehea_destroy_eq_res(struct ehea_eq *
 
 int ehea_destroy_eq(struct ehea_eq *eq)
 {
-	u64 hret;
+	u64 hret, aer, aerr;
 	if (!eq)
 		return 0;
 
@@ -365,7 +365,7 @@ int ehea_destroy_eq(struct ehea_eq *eq)
 
 	hret = ehea_destroy_eq_res(eq, NORMAL_FREE);
 	if (hret == H_R_STATE) {
-		ehea_error_data(eq->adapter, eq->fw_handle);
+		ehea_error_data(eq->adapter, eq->fw_handle, &aer, &aerr);
 		hret = ehea_destroy_eq_res(eq, FORCE_FREE);
 	}
 
@@ -540,7 +540,7 @@ u64 ehea_destroy_qp_res(struct ehea_qp *
 
 int ehea_destroy_qp(struct ehea_qp *qp)
 {
-	u64 hret;
+	u64 hret, aer, aerr;
 	if (!qp)
 		return 0;
 
@@ -548,7 +548,7 @@ int ehea_destroy_qp(struct ehea_qp *qp)
 
 	hret = ehea_destroy_qp_res(qp, NORMAL_FREE);
 	if (hret == H_R_STATE) {
-		ehea_error_data(qp->adapter, qp->fw_handle);
+		ehea_error_data(qp->adapter, qp->fw_handle, &aer, &aerr);
 		hret = ehea_destroy_qp_res(qp, FORCE_FREE);
 	}
 
@@ -986,42 +986,45 @@ void print_error_data(u64 *data)
 	if (length > EHEA_PAGESIZE)
 		length = EHEA_PAGESIZE;
 
-	if (type == 0x8) /* Queue Pair */
+	if (type == EHEA_AER_RESTYPE_QP)
 		ehea_error("QP (resource=%llX) state: AER=0x%llX, AERR=0x%llX, "
 			   "port=%llX", resource, data[6], data[12], data[22]);
-
-	if (type == 0x4) /* Completion Queue */
+	else if (type == EHEA_AER_RESTYPE_CQ)
 		ehea_error("CQ (resource=%llX) state: AER=0x%llX", resource,
 			   data[6]);
-
-	if (type == 0x3) /* Event Queue */
+	else if (type == EHEA_AER_RESTYPE_EQ)
 		ehea_error("EQ (resource=%llX) state: AER=0x%llX", resource,
 			   data[6]);
 
 	ehea_dump(data, length, "error data");
 }
 
-void ehea_error_data(struct ehea_adapter *adapter, u64 res_handle)
+u64 ehea_error_data(struct ehea_adapter *adapter, u64 res_handle,
+		    u64 *aer, u64 *aerr)
 {
 	unsigned long ret;
 	u64 *rblock;
+	u64 type = 0;
 
 	rblock = (void *)get_zeroed_page(GFP_KERNEL);
 	if (!rblock) {
 		ehea_error("Cannot allocate rblock memory.");
-		return;
+		goto out;
 	}
 
-	ret = ehea_h_error_data(adapter->handle,
-				res_handle,
-				rblock);
+	ret = ehea_h_error_data(adapter->handle, res_handle, rblock);
 
-	if (ret == H_R_STATE)
-		ehea_error("No error data is available: %llX.", res_handle);
-	else if (ret == H_SUCCESS)
+	if (ret == H_SUCCESS) {
+		type = EHEA_BMASK_GET(ERROR_DATA_TYPE, rblock[2]);
+		*aer = rblock[6];
+		*aerr = rblock[12];
 		print_error_data(rblock);
-	else
+	} else if (ret == H_R_STATE) {
+		ehea_error("No error data available: %llX.", res_handle);
+	} else
 		ehea_error("Error data could not be fetched: %llX", res_handle);
 
 	free_page((unsigned long)rblock);
+out:
+	return type;
 }
diff -Nurp net-2.6.orig/drivers/net/ehea/ehea_qmr.h net-2.6/drivers/net/ehea/ehea_qmr.h
--- net-2.6.orig/drivers/net/ehea/ehea_qmr.h	2010-04-21 10:23:21.000000000 +0200
+++ net-2.6/drivers/net/ehea/ehea_qmr.h	2010-04-21 10:41:21.000000000 +0200
@@ -154,6 +154,9 @@ struct ehea_rwqe {
 #define EHEA_CQE_STAT_ERR_IP       0x2000
 #define EHEA_CQE_STAT_ERR_CRC      0x1000
 
+/* Defines which bad send cqe stati lead to a port reset */
+#define EHEA_CQE_STAT_RESET_MASK   0x0002
+
 struct ehea_cqe {
 	u64 wr_id;		/* work request ID from WQE */
 	u8 type;
@@ -187,6 +190,14 @@ struct ehea_cqe {
 #define EHEA_EQE_SM_MECH_NUMBER  EHEA_BMASK_IBM(48, 55)
 #define EHEA_EQE_SM_PORT_NUMBER  EHEA_BMASK_IBM(56, 63)
 
+#define EHEA_AER_RESTYPE_QP  0x8
+#define EHEA_AER_RESTYPE_CQ  0x4
+#define EHEA_AER_RESTYPE_EQ  0x3
+
+/* Defines which affiliated errors lead to a port reset */
+#define EHEA_AER_RESET_MASK   0xFFFFFFFFFEFFFFFFULL
+#define EHEA_AERR_RESET_MASK  0xFFFFFFFFFFFFFFFFULL
+
 struct ehea_eqe {
 	u64 entry;
 };
@@ -379,7 +390,8 @@ int ehea_gen_smr(struct ehea_adapter *ad
 
 int ehea_rem_mr(struct ehea_mr *mr);
 
-void ehea_error_data(struct ehea_adapter *adapter, u64 res_handle);
+u64 ehea_error_data(struct ehea_adapter *adapter, u64 res_handle,
+		    u64 *aer, u64 *aerr);
 
 int ehea_add_sect_bmap(unsigned long pfn, unsigned long nr_pages);
 int ehea_rem_sect_bmap(unsigned long pfn, unsigned long nr_pages);

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21  9:02 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421082559.GA32475@ioremap.net>

Le mercredi 21 avril 2010 à 12:25 +0400, Evgeniy Polyakov a écrit :

> I believe this is a useful patch, but it addresses a different issue.
> This path should not fire up when we bind to single address.

Well, the real problem is that following sequence can happen :

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(6, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(7, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
setsockopt(8, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(8, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(9, {sa_family=AF_INET, sin_port=htons(34000), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10
setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(10, {sa_family=AF_INET, sin_port=htons(34002), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 11
setsockopt(11, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(11, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(11, {sa_family=AF_INET, sin_port=htons(34001), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0


Note ports are given several times for different sockets.

So several sockets are 'bound' to same IP:port values

At connect() time, we refuse and say address is not available.



Following program to demonstrate the problem.

First time, launch it with an extra agument to setup ip aliases and ip_local_port_range




#include <stdio.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>
#include <errno.h>

int listenfd;
int on = 1;

void listener()
{
	if (fork())
		return;

	while (1) {
		struct sockaddr_in addr;
		socklen_t len = sizeof(addr);
		int fd = accept(listenfd, (struct sockaddr *)&addr, &len);
	}
}

int main(int argc, char *argv[])
{
	int i, port, total = 0;
	char cmd[128];
	struct sockaddr_in addr;
	socklen_t len;

	if (argc > 1) {
		for (i = 2; i < 8; i++) {
			sprintf(cmd, "ip addr add 127.0.0.%d/8 dev lo 2>/dev/null", i);
			system(cmd);
		}
		system("echo '34000 34002' >/proc/sys/net/ipv4/ip_local_port_range");
	}
	listenfd = socket(AF_INET, SOCK_STREAM, 0);
	setsockopt(listenfd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
	memset(&addr, 0, sizeof(addr));
	addr.sin_family = AF_INET;
	addr.sin_port = htons(4444);
	if (bind(listenfd, (struct sockaddr *)&addr, sizeof(addr)) == -1) {
		perror("bind");
		return 1;
	}
	listen(listenfd, 10);
	listener();

	for (i = 2; i < 8; i++) {
		for (port = 34000; port < 34010; port++) {
			int fd = socket(AF_INET, SOCK_STREAM, 0);
			if (fd == -1) {
				fprintf(stderr, "Could not open socket, errno=%d\n", errno);
				goto end;
			}
			setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
			addr.sin_addr.s_addr = htonl(0x7f000000 + i);
//			addr.sin_port = htons(port);
			addr.sin_port = 0;
			if (bind(fd, (struct sockaddr *)&addr, sizeof(addr)) == -1) {
				fprintf(stderr, "Could not bind()\n");
				goto end;
			}
			len = sizeof(addr);
			getsockname(fd, (struct sockaddr *)&addr, &len);
#if 0
			addr.sin_addr.s_addr = htonl(0x7f000001);
			addr.sin_port = htons(4444);
			if ((total < 10) && (connect(fd, (struct sockaddr *)&addr, sizeof(addr)) == -1)) {
				len = sizeof(addr);
				getsockname(fd, (struct sockaddr *)&addr, &len);
				fprintf(stderr, "Could not connect()\n");
				goto end;
			}
#endif
			total++;
		}
	}
end:
	printf("i=127.0.0.%d port=%d (total=%d)\n", i, port, total);
	pause();
}



^ permalink raw reply

* Re: ipv6: Fix tcp_v6_send_response checksum
From: David Miller @ 2010-04-21  8:58 UTC (permalink / raw)
  To: herbert; +Cc: netdev, cratiu
In-Reply-To: <20100421.004922.193694715.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 21 Apr 2010 00:49:22 -0700 (PDT)

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 21 Apr 2010 15:07:37 +0800
> 
>> ipv6: Fix tcp_v6_send_response checksum
> 
> I put this into net-2.6 and modified the commit message since, as we
> found, this incorrect transport header reset was added there to fix
> IPSEC.

Ok, even with this pulled into net-next-2.6 the ipv6 tcp response
checksums are still bad.  The following fix is necessary as
well:

--------------------
tcp: Fix ipv6 checksumming on response packets for real.

Commit 6651ffc8e8bdd5fb4b7d1867c6cfebb4f309512c
("ipv6: Fix tcp_v6_send_response transport header setting.")
fixed one half of why ipv6 tcp response checksums were
invalid, but it's not the whole story.

If we're going to use CHECKSUM_PARTIAL for these things (which we are
since commit 2e8e18ef52e7dd1af0a3bd1f7d990a1d0b249586 "tcp: Set
CHECKSUM_UNNECESSARY in tcp_init_nondata_skb"), we can't be setting
buff->csum as we always have been here in tcp_v6_send_response.  We
need to leave it at zero.

Kill that line and checksums are good again.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/tcp_ipv6.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 78480f4..5d2e430 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1050,8 +1050,6 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 	}
 #endif

-	buff->csum = csum_partial(t1, tot_len, 0);
-
 	memset(&fl, 0, sizeof(fl));
 	ipv6_addr_copy(&fl.fl6_dst, &ipv6_hdr(skb)->saddr);
 	ipv6_addr_copy(&fl.fl6_src, &ipv6_hdr(skb)->daddr);
-- 
1.7.0.4

^ permalink raw reply related

* Re: 2.6.34-rc5: Reported regressions from 2.6.33
From: Jerome Glisse @ 2010-04-21  8:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Bowler, Kernel Testers List, Linux SCSI List,
	Network Development, Linux Wireless List,
	Linux Kernel Mailing List, Linux ACPI, Andrew Morton, DRI,
	Linus Torvalds, Linux PM List, Maciej Rutecki
In-Reply-To: <201004210715.38621.rjw@sisk.pl>

On Wed, Apr 21, 2010 at 07:15:38AM +0200, Rafael J. Wysocki wrote:
> On Tuesday 20 April 2010, Nick Bowler wrote:
> > On 05:15 Tue 20 Apr     , Rafael J. Wysocki wrote:
> > > If you know of any other unresolved regressions from 2.6.33, please let us
> > > know either and we'll add them to the list.  Also, please let us know
> > > if any of the entries below are invalid.
> > 
> > Please list these two similar regressions from 2.6.33 in the r600 DRM:
> > 
> >  * r600 CS checker rejects GL_DEPTH_TEST w/o depth buffer:
> >            https://bugs.freedesktop.org/show_bug.cgi?id=27571
> > 
> >  * r600 CS checker rejects narrow FBO renderbuffers:
> >            https://bugs.freedesktop.org/show_bug.cgi?id=27609
> 
> Do you want to me to add them as one entry or as two separate bugs?
> 
> Rafael
> 

First one is userspace bug, i need to look into the second one.
ie we were lucky the hw didn't lockup without depth buffer and
depth test enabled.

Cheers,
Jerome

^ permalink raw reply

* Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-21  8:35 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@linux.intel.com, davem@davemloft.net
In-Reply-To: <F2E9EB7348B8264F86B6AB8151CE2D79026FAB1E5F@shsmsx502.ccr.corp.intel.com>

On Tue, Apr 20, 2010 at 10:21:55AM +0800, Xin, Xiaohui wrote:
> Michael,
> 
> >>>>>> What we have not done yet:
> >>>>>> 	packet split support
> >>>>>> 
> >>>>>What does this mean, exactly?
> >>>> We can support 1500MTU, but for jumbo frame, since vhost driver before don't 
> >>>>support mergeable buffer, we cannot try it for multiple sg.
> >>>> 
> >>>I do not see why, vhost currently supports 64K buffers with indirect
> >>>descriptors.
> >>> 
> >> The receive_skb() in guest virtio-net driver will merge the multiple sg to skb frags, how >>can indirect descriptors to that?
> 
> >See add_recvbuf_big.
> 
> I don't mean this, it's for buffer submission. I mean when packet is received, in receive_buf(), mergeable buffer knows which pages received can be hooked in skb frags, it's receive_mergeable() which do this.
> 
> When a NIC driver supports packet split mode, then each ring descriptor contains a skb and a page. When packet is received, if the status is not EOP, then hook the page of the next descriptor to the prev skb. We don't how many frags belongs to one skb. So when guest submit buffers, it should submit multiple pages, and when receive, the guest should know which pages are belongs to one skb and hook them together. I think receive_mergeable() can do this, but I don't see how big->packets handle this. May I miss something here?
> 
> Thanks
> Xiaohui 


Yes, I think this packet split mode probably maps well to mergeable buffer
support. Note that
1. Not all devices support large packets in this way, others might map
   to indirect buffers better
   So we have to figure out how migration is going to work
2. It's up to guest driver whether to enable features such as
   mergeable buffers and indirect buffers
   So we have to figure out how to notify guest which mode
   is optimal for a given device
3. We don't want to depend on jumbo frames for decent performance
   So we probably should support GSO/GRO

-- 
MST

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21  8:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271828799.7895.1287.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 07:46:39AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> 		if (atomic_read(&hashinfo->bsockets) > (high-low)+1) {
> 			spin_unlock(&head->lock);
> 			snum = smallest_rover;   // We select this, without checking for
> conflicts.
> 			goto have_snum;
> 		}
> 	}
> 
> 
> Then we goto to "have_snum" label
> 
> Then we realize (selected_IP, randomport) is already in use.
> End of first try.
> 
> We redo the thing 5 times, so we only look at 5 slots out of
> 32000-64000.

We only break out of the loop in above case when number of sockets is
already more than our range limit. If we would just try 5 random times
out of 1000 in Gaspar's case, we would not be able to select all 1000
sockets.

> Maybe the fix would need to check if there is a conflict before doing
> the "goto have_snum"

I believe this is a useful patch, but it addresses a different issue.
This path should not fire up when we bind to single address.

> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index e0a3e35..0498daf 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -120,9 +120,11 @@ again:
>  						smallest_size = tb->num_owners;
>  						smallest_rover = rover;
>  						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
> -							spin_unlock(&head->lock);
> -							snum = smallest_rover;
> -							goto have_snum;
> +							if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb))
> +								spin_unlock(&head->lock);
> +								snum = smallest_rover;
> +								goto have_snum;
> +							}
>  						}
>  					}
>  					goto next;
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [PATCH] NET: Fix an RCU warning in dev_pick_tx()
From: David Miller @ 2010-04-21  8:10 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dhowells, netdev
In-Reply-To: <1271779414.7895.35.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 18:03:34 +0200

> Le mardi 20 avril 2010 à 11:25 +0100, David Howells a écrit :
>> Fix the following RCU warning in dev_pick_tx():
 ...
> Absolutely right, thanks David
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

I'll apply this, thanks everyone.

> This might conflict with following commit in net-next-2.6, where I chose
> the rcu_dereference_check() alternative

Yead I'll be mindful of this when I do a merge right after
this, thanks for the heads up.

^ permalink raw reply

* [PATCH] ipv4: handle GARPs specially when updating neighbors
From: Sasha Levin @ 2010-04-21  8:02 UTC (permalink / raw)
  To: netdev@vger.kernel.org

From: Sasha Levin <sasha@comsleep.com>

We are currently testing IP fail-over on storage devices, and have observed an issue with the IP transfer from one device to another.

Assuming we have 2 storage devices A and B, and a server C which uses the storage, the scenario is:

1. Device A sends an ARP request which server C sees – server C updates it’s ARP table with the MAC of device A.
2. Device A fails, Device B takes over the IP and sends out a GARP.
3. Even though device C sees the GARP, it ignores it and keeps trying to communicate with device A until the entry is removed from its cache and a new ARP request is generated.

The code which causes this is located in arp_process@/net/ipv4/arp.c:

override = time_after(jiffies, n->updated + n->parms->locktime);

/* Broadcast replies and request packets
   do not assert neighbour reachability.
 */
if (arp->ar_op != htons(ARPOP_REPLY) ||
    skb->pkt_type != PACKET_HOST)
        state = NUD_STALE;
neigh_update(n, sha, state, override ? NEIGH_UPDATE_F_OVERRIDE : 0);
neigh_release(n);

According to the code, this scenario happens because the kernel ignores any ARP updates which happened in a short period after the previous ARP update. The reason which was stated in the comments is  “If several different ARP replies follows back-to-back, use the FIRST one. It is possible, if several proxy agents are active. Taking the first reply prevents arp trashing and chooses the fastest router.”.

This, however, doesn’t take into account GARPs which are not being sent by ARP proxies anyway and just ignores them too – causing a loss of communication for over a minute until the ARP cache refreshes.

Signed-off-by: Sasha Levin <sasha@comsleep.com>
---
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 1a9dd66..caa2093 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -876,8 +876,11 @@ static int arp_process(struct sk_buff *skb)
 		   use the FIRST one. It is possible, if several proxy
 		   agents are active. Taking the first reply prevents
 		   arp trashing and chooses the fastest router.
+
+		   GARPs are always updating the cache since they can
+		   originate from different devices with the same IP.
 		 */
-		override = time_after(jiffies, n->updated + n->parms->locktime);
+		override = (sip == tip) || time_after(jiffies, n->updated + n->parms->locktime);

 		/* Broadcast replies and request packets
 		   do not assert neighbour reachability.

^ permalink raw reply related

* Re: ipv6: Fix tcp_v6_send_response checksum
From: David Miller @ 2010-04-21  7:49 UTC (permalink / raw)
  To: herbert; +Cc: netdev, cratiu
In-Reply-To: <20100421070737.GA30517@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 21 Apr 2010 15:07:37 +0800

> ipv6: Fix tcp_v6_send_response checksum

I put this into net-2.6 and modified the commit message since, as we
found, this incorrect transport header reset was added there to fix
IPSEC.

I'm convinced that Cosmin didn't test the patch he actually sent out
:-)

Thanks!

--------------------
ipv6: Fix tcp_v6_send_response transport header setting.

My recent patch to remove the open-coded checksum sequence in
tcp_v6_send_response broke it as we did not set the transport
header pointer on the new packet.

Actually, there is code there trying to set the transport
header properly, but it sets it for the wrong skb ('skb'
instead of 'buff').

This bug was introduced by commit
a8fdf2b331b38d61fb5f11f3aec4a4f9fb2dedcb ("ipv6: Fix
tcp_v6_send_response(): it didn't set skb transport header")

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/tcp_ipv6.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c92ebe8..075f540 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1015,7 +1015,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 	skb_reserve(buff, MAX_HEADER + sizeof(struct ipv6hdr) + tot_len);
 
 	t1 = (struct tcphdr *) skb_push(buff, tot_len);
-	skb_reset_transport_header(skb);
+	skb_reset_transport_header(buff);
 
 	/* Swap the send and the receive. */
 	memset(t1, 0, sizeof(*t1));
-- 
1.7.0.4


^ permalink raw reply related

* Re: [PATCH] net: ipv6 bind to device issue
From: Jiri Olsa @ 2010-04-21  7:21 UTC (permalink / raw)
  To: Brian Haley
  Cc: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet,
	netdev
In-Reply-To: <4BCDEED3.7040901@hp.com>

On Tue, Apr 20, 2010 at 02:13:39PM -0400, Brian Haley wrote:
> Jiri Olsa wrote:
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index c2438e8..7bf7717 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
> >  {
> >  	int flags = 0;
> >  
> > -	if (rt6_need_strict(&fl->fl6_dst))
> > +	if (rt6_need_strict(&fl->fl6_dst) || fl->oif)
> >  		flags |= RT6_LOOKUP_F_IFACE;
> >  
> >  	if (!ipv6_addr_any(&fl->fl6_src))
> 
> Actually, looking at this again, we might want to swap the order
> here since fl->oif should be filled-in for most link-local and
> multicast requests calling this:
> 
> 	if (fl->oif || rt6_need_strict(&fl->fl6_dst))
> 
> Just a thought, but it potentially saves a call to determine
> the scope of the address.
> 
> -Brian

I think it's a good idea, attaching the changed patch

thanks,
jirka
---

The issue raises when having 2 NICs both assigned the same
IPv6 global address.

If a sender binds to a particular NIC (SO_BINDTODEVICE),
the outgoing traffic is being sent via the first found.
The bonded device is thus not taken into an account during the
routing.


>From the ip6_route_output function:

If the binding address is multicast, linklocal or loopback,
the RT6_LOOKUP_F_IFACE bit is set, but not for global address.

So binding global address will neglect SO_BINDTODEVICE-binded device,
because the fib6_rule_lookup function path won't check for the
flowi::oif field and take first route that fits.

Following patch should handle the issue.

wbr,
jirka


Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Scott Otto <scott.otto@alcatel-lucent.com>
---
 net/ipv6/route.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c2438e8..05ebd78 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
 {
 	int flags = 0;
 
-	if (rt6_need_strict(&fl->fl6_dst))
+	if (fl->oif || rt6_need_strict(&fl->fl6_dst))
 		flags |= RT6_LOOKUP_F_IFACE;
 
 	if (!ipv6_addr_any(&fl->fl6_src))


^ permalink raw reply related

* ipv6: Fix tcp_v6_send_response checksum
From: Herbert Xu @ 2010-04-21  7:07 UTC (permalink / raw)
  To: David S. Miller, netdev

Hi:

ipv6: Fix tcp_v6_send_response checksum

My recent patch to remove the open-coded checksum sequence in
tcp_v6_send_response broke it as we did not set the transport
header pointer on the new packet.

Instead we had set the transport header on the original packet,
which is unnecessary and unexpected.

So this patch removes that and instead sets the transport header
on the new packet.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c92ebe8..075f540 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1015,7 +1015,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 	skb_reserve(buff, MAX_HEADER + sizeof(struct ipv6hdr) + tot_len);

 	t1 = (struct tcphdr *) skb_push(buff, tot_len);
-	skb_reset_transport_header(skb);
+	skb_reset_transport_header(buff);

 	/* Swap the send and the receive. */
 	memset(t1, 0, sizeof(*t1));

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox