Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next,1/2] add iovnl netlink support
From: Scott Feldman @ 2010-04-21 20:25 UTC (permalink / raw)
  To: Arnd Bergmann, Chris Wright; +Cc: davem, netdev
In-Reply-To: <201004212139.22421.arnd@arndb.de>

On 4/21/10 12:39 PM, "Arnd Bergmann" <arnd@arndb.de> wrote:

>>> 1. Setting up the slave device
>>>  a) create an SR-IOV VF to assign to a guest
>>>  b) create a macvtap device to pass to qemu or vhost
>>>  c) attach a tap device to a bridge
>>>  d) create a macvlan device and put it into a container
>>>  e) create a virtual interface for a VMDq adapter
>> 
>> OK, but iovnl isn't doing this.
> 
> The set_mac_vlan that Scott's patch adds seems to implement 1a), as far
> as I can tell. Interestingly, this is not actually implemented in
> the enic driver in patch 2/2. So if we all agree that this is out of the
> scope of iovnl, let's just remove it from the interface and find another
> way for it (ethtool, iplink, ..., as listed above).

You're right, not needed for enic since mac addr is included with
port-profile push and vlan membership is implied by port-profile.  So I put
set_mac_vlan in there basically to elicit feedback.

There really wouldn't be much different between iplink and iovnl since
they're both rtnetlink...seems we should keep IOV-related APIs in one place.
Maybe there are other IOV APIs to add to iovnl in the future like:

    vf <- add_vf(pf)
    del_vf(pf, vf)

Ethtool doesn't seem the right place for this.

-scott

^ permalink raw reply

* IPv6 duplicate address detection erroneously marking address as duplicate when a host receives its own multicast packets?
From: Sam Cannell @ 2010-04-21 20:13 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 3576 bytes --]

[c&p from my email to lkml; I was asked to forward it here too.  This
occurs on the stock kernels in Debian Lenny and Ubuntu Karmic, as well
as a 2.6.33 I built myself]

Hi,

I've been having some trouble with ip6 duplicate address detection in a
Linux VM (under XVM on OpenSolaris).  It seems that the ethernet bridge
in XVM sends a host's own multicast packets back to it, which the
duplicate address detection code in linux decide that another host on
the network is using the same address.

For instance, running:

router4:~ # ip addr add fe80::216:36ff:fe4e:461c/64 dev eth0

I get the following output in tcpdump:

router4:~ # tcpdump -nevpi eth0 ip6 
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96
bytes
12:33:03.755897 00:16:36:4e:46:1c > 33:33:00:00:00:16, ethertype IPv6
(0x86dd), length 90: (hlim 1, next-header Options (0) payload length:
36) :: > ff02::16: HBH (rtalert: 0x0000) (padn)[icmp6 sum ok] ICMP6,
multicast listener report v2, length 28, 1 group record(s) [gaddr
ff02::1:ff4e:461c to_ex, 0 source(s)]
12:33:04.551772 00:16:36:4e:46:1c > 33:33:ff:4e:46:1c, ethertype IPv6
(0x86dd), length 78: (hlim 255, next-header ICMPv6 (58) payload length:
24) :: > ff02::1:ff4e:461c: [icmp6 sum ok] ICMP6, neighbor solicitation,
length 24, who has fe80::216:36ff:fe4e:461c
12:33:04.551998 00:16:36:4e:46:1c > 33:33:ff:4e:46:1c, ethertype IPv6
(0x86dd), length 78: (hlim 255, next-header ICMPv6 (58) payload length:
24) :: > ff02::1:ff4e:461c: [icmp6 sum ok] ICMP6, neighbor solicitation,
length 24, who has fe80::216:36ff:fe4e:461c
^C
3 packets captured
3 packets received by filter
0 packets dropped by kernel

And dmesg says:

router4:~ # dmesg
[  371.024287] eth0: IPv6 duplicate address fe80::216:36ff:fe4e:461c
detected!

And the address sits in 'tentative' mode:

router4:~ # ip addr show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP qlen 1000
    link/ether 00:16:36:4e:46:1c brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.128/24 brd 192.168.2.255 scope global eth0
    inet6 fe80::216:36ff:fe4e:461c/64 scope link tentative flags 08 
       valid_lft forever preferred_lft forever

This happens for link-local and global scope address, both when they try
to auto-configure and when set by hand:

[  463.500328] eth0: IPv6 duplicate address
2404:130:0:1000:216:36ff:fe4e:461c detected!
[  732.428312] eth0: IPv6 duplicate address
2404:130:0:1000:216:36ff:fe4e:461c detected!
[  883.812328] eth0: IPv6 duplicate address 2404:130::3:2:1 detected!

I'd happily put this down to a failing in XVM, however the stateless
autoconfiguration RFC (4862) states that the stack shouldn't decide an
address is duplicate based on receipt of a neighbor solicitation message
that it sent itself:

5.4.3.  Receiving Neighbor Solicitation Messages
[...]
If the source address of the Neighbor Solicitation is the unspecified
   address, the solicitation is from a node performing Duplicate Address
   Detection.  If the solicitation is from another node, the tentative
   address is a duplicate and should not be used (by either node).  If
   the solicitation is from the node itself (because the node loops back
   multicast packets), the solicitation does not indicate the presence
   of a duplicate address.

Assuming my understanding of the RFC is correct, this suggests to me
that duplicate address detection in Linux is being a little too hasty to
mark the address as invalid.  Thoughts?

Thanks,

Sam Cannell

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 194 bytes --]

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21 20:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271877975.7895.3171.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 09:26:15PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Le mercredi 21 avril 2010 à 22:58 +0400, Evgeniy Polyakov a écrit :
> 
> > Damn it, I tried multiple times :)
> > You are right of course!
> > 
> 
> Here is a formal patch then :)

Ack. Thank you Eric!

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 19:39 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev
In-Reply-To: <20100421181021.GC25928@x200.localdomain>

On Wednesday 21 April 2010, Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Wednesday 21 April 2010, Chris Wright wrote:
> > > * Arnd Bergmann (arnd@arndb.de) wrote:
> > > > Since it seems what you really want to do is to do the exchange with the
> > > > switch from here, maybe the hardware configuration part should be moved
> > > > the DCB interface?
> > > 
> > > I suppose this would work  (although it's a bit odd being out of scope
> > > of DCB spec).
> > 
> > It could be anywhere, it doesn't have to be the DCB interface, but could
> > be anything ranging from ethtool to iplink I guess. And we should define
> > it in a way that works for any SR-IOV card, whether it's using Cisco's
> > protocol in firmware, 802.1Qbg VDP in firmware, lldpad to do VDP or
> > none of the above and just provides an internal switch like all the
> > existing NICs.
> 
> Heh, that's exactly what iovnl does ;-)

No, according to what you write below, it's exactly what iovnl does *not* do,
i.e. part 1 in my list.

> > 1. Setting up the slave device
> >  a) create an SR-IOV VF to assign to a guest
> >  b) create a macvtap device to pass to qemu or vhost
> >  c) attach a tap device to a bridge
> >  d) create a macvlan device and put it into a container
> >  e) create a virtual interface for a VMDq adapter
> 
> OK, but iovnl isn't doing this.

The set_mac_vlan that Scott's patch adds seems to implement 1a), as far
as I can tell. Interestingly, this is not actually implemented in
the enic driver in patch 2/2. So if we all agree that this is out of the
scope of iovnl, let's just remove it from the interface and find another
way for it (ethtool, iplink, ..., as listed above).

Note that we still need to pass the MAC address and VLAN ID (or a list
of these) to the external switch, my point is just that this should be
separate from enforcing it in the hypervisor.

> > 2) Registering the slave with the switch
> >  a) use Cisco protocol in enic firmware (see patch 2/2)
> >  b) use standard VDP in lldpad
> >  c) use reverse-engineered cisco protocol in some user tool for
> >     non-enic adapters.
> >  d) use standard VDP in firmware (hopefully this never happens)
> >  e) do nothing at all (as we do today)
> 
> And this is the step that is the main purpose of iovnl.
> 
> Here's the simplest snippet of libvirt to show this.  It sends
> set_port_profile netlink messages and then creates macvtap.  As simple
> as it gets...
> 
> --- a/src/qemu/qemu_conf.c
> +++ b/src/qemu/qemu_conf.c
> @@ -1470,6 +1470,11 @@ qemudPhysIfaceConnect(virConnectPtr conn,
>          net->model && STREQ(net->model, "virtio"))
>          vnet_hdr = 1;
>  
> +    setPortProfileId(net->data.direct.linkdev,
> +                      net->data.direct.mode,
> +                      net->data.direct.profileid,
> +                      net->mac);
> +
>      rc = openMacvtapTap(net->ifname, net->mac, linkdev, brmode,
>                          &res_ifname, vnet_hdr);

Ok. In case of VDP, I guess this needs to be extended with the vlan ID
that has been configured, and possibly with a UUID, because we need to
pass the same one on the target machine if we migrate it.

Alternatively, the setPortProfileId could figure out the MAC address and
VLAN ID from the device itself, but then we don't need to pass either of
them.

	Arnd

^ permalink raw reply

* Re: Bug#577640: linux-image-2.6.33-2-amd64: Kernel warnings in netns  thread
From: Eric W. Biederman @ 2010-04-21 19:36 UTC (permalink / raw)
  To: Martín Ferrari
  Cc: Ben Hutchings, 577640, netdev, Eric W. Biederman, Alexey Dobriyan,
	Mathieu Lacage
In-Reply-To: <o2jb9800b71004210819t6ca61099y17be143527ccbc95@mail.gmail.com>

Martín Ferrari <martin.ferrari@gmail.com> writes:

> I'm not starting a new thread/bug, as this is probably related...
>
> I just discovered that in 2.6.33, if I create a veth inside a
> namespace and then move one of the halves into the main namespace,
> when I kill the namespace, I get one of these warnings followed by an
> oops. This does not happen if the veth is created from the main ns and
> then moved, nor in 2.6.32. This happens both in Qemu and on real
> hardware (both amd64)
>
> To reproduce:
>
> $ sudo ./startns bash
> # ip l a type veth
> # ip l s veth0 netns 1
> # exit

Nasty weird. I did a quick test here, and I'm not seeing that.
Does the 2.6.33 experimental kernel have any patches applied?

The sysfs_add_one path looks like someone we hit a duplicate name,
which is a bug of an entirely different kind.  From there it appears
that we later destroy the device not realizing it isn't in sysfs.
Which causes everything to explode.

The sysctl issues appears to be that somewhere we have the ordering of
creates and/or deletes wrong.  It is just possible the changes for
batch deletion might be exposing a bug under load.

The sysctl error appears to be in the class of things that should never
happen but that we should handle correctly anyway.

Eric

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21 19:26 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421185837.GB21249@ioremap.net>

Le mercredi 21 avril 2010 à 22:58 +0400, Evgeniy Polyakov a écrit :

> Damn it, I tried multiple times :)
> You are right of course!
> 

Here is a formal patch then :)

[PATCH] tcp: bind() fix when many ports are bound

Port autoselection done by kernel only works when number of bound
sockets is under a threshold (typically 30000).

When this threshold is over, we must check if there is a conflict before
exiting first loop in inet_csk_get_port()

Change inet_csk_bind_conflict() to forbid two reuse-enabled sockets to
bind on same (address,port) tuple (with a non ANY address)

Same change for inet6_csk_bind_conflict()

Reported-by: Gaspar Chilingarov <gasparch@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/inet_connection_sock.c  |   16 +++++++++++-----
 net/ipv6/inet6_connection_sock.c |   15 ++++++++++-----
 2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index e0a3e35..78cbc39 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -70,13 +70,17 @@ int inet_csk_bind_conflict(const struct sock *sk,
 		    (!sk->sk_bound_dev_if ||
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
+			const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
+
 			if (!reuse || !sk2->sk_reuse ||
 			    sk2->sk_state == TCP_LISTEN) {
-				const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
 				if (!sk2_rcv_saddr || !sk_rcv_saddr ||
 				    sk2_rcv_saddr == sk_rcv_saddr)
 					break;
-			}
+			} else if (reuse && sk2->sk_reuse &&
+				   sk2_rcv_saddr &&
+				   sk2_rcv_saddr == sk_rcv_saddr)
+				break;
 		}
 	}
 	return node != NULL;
@@ -120,9 +124,11 @@ again:
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
 						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
-							spin_unlock(&head->lock);
-							snum = smallest_rover;
-							goto have_snum;
+							if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
+								spin_unlock(&head->lock);
+								snum = smallest_rover;
+								goto have_snum;
+							}
 						}
 					}
 					goto next;
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 0c5e3c3..fb6959c 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -42,11 +42,16 @@ int inet6_csk_bind_conflict(const struct sock *sk,
 		if (sk != sk2 &&
 		    (!sk->sk_bound_dev_if ||
 		     !sk2->sk_bound_dev_if ||
-		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if) &&
-		    (!sk->sk_reuse || !sk2->sk_reuse ||
-		     sk2->sk_state == TCP_LISTEN) &&
-		     ipv6_rcv_saddr_equal(sk, sk2))
-			break;
+		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
+			if ((!sk->sk_reuse || !sk2->sk_reuse ||
+			     sk2->sk_state == TCP_LISTEN) &&
+			     ipv6_rcv_saddr_equal(sk, sk2))
+				break;
+			else if (sk->sk_reuse && sk2->sk_reuse &&
+				!ipv6_addr_any(inet6_rcv_saddr(sk2)) &&
+				ipv6_rcv_saddr_equal(sk, sk2))
+				break;
+		}
 	}
 
 	return node != NULL;



^ permalink raw reply related

* Re: cxgb4: Use ntohs() on __be16 value instead of htons()
From: Dimitris Michailidis @ 2010-04-21 19:17 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David S. Miller, netdev
In-Reply-To: <ada1ve8r832.fsf@roland-alpha.cisco.com>

On 04/21/2010 11:09 AM, Roland Dreier wrote:
> Use the correct direction of byte-swapping function to fix a mistake
> shown by sparse endianness checking -- c.fl0id is __be16.
>
> Signed-off-by: Roland Dreier<rolandd@cisco.com>

Yes, thanks.

Acked-by: Dimitris Michailidis <dm@chelsio.com>

> ---
>   drivers/net/cxgb4/sge.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/cxgb4/sge.c b/drivers/net/cxgb4/sge.c
> index 14adc58..70bf2b2 100644
> --- a/drivers/net/cxgb4/sge.c
> +++ b/drivers/net/cxgb4/sge.c
> @@ -2047,7 +2047,7 @@ int t4_sge_alloc_rxq(struct adapter *adap, struct sge_rspq *iq, bool fwevtq,
>   	adap->sge.ingr_map[iq->cntxt_id] = iq;
>
>   	if (fl) {
> -		fl->cntxt_id = htons(c.fl0id);
> +		fl->cntxt_id = ntohs(c.fl0id);
>   		fl->avail = fl->pend_cred = 0;
>   		fl->pidx = fl->cidx = 0;
>   		fl->alloc_failed = fl->large_alloc_failed = fl->starving = 0;
>


^ permalink raw reply

* Re: [stable] [PATCH] tun: orphan an skb on tx
From: Greg KH @ 2010-04-21 19:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: stable, Paul Moore, David Woodhouse, netdev, linux-kernel,
	qemu-devel, Herbert Xu, Jan Kiszka, David S. Miller
In-Reply-To: <20100421113557.GA31606@redhat.com>

On Wed, Apr 21, 2010 at 02:35:57PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
> > The following situation was observed in the field:
> > tap1 sends packets, tap2 does not consume them, as a result
> > tap1 can not be closed. This happens because
> > tun/tap devices can hang on to skbs undefinitely.
> > 
> > As noted by Herbert, possible solutions include a timeout followed by a
> > copy/change of ownership of the skb, or always copying/changing
> > ownership if we're going into a hostile device.
> > 
> > This patch implements the second approach.
> > 
> > Note: one issue still remaining is that since skbs
> > keep reference to tun socket and tun socket has a
> > reference to tun device, we won't flush backlog,
> > instead simply waiting for all skbs to get transmitted.
> > At least this is not user-triggerable, and
> > this was not reported in practice, my assumption is
> > other devices besides tap complete an skb
> > within finite time after it has been queued.
> > 
> > A possible solution for the second issue
> > would not to have socket reference the device,
> > instead, implement dev->destructor for tun, and
> > wait for all skbs to complete there, but this
> > needs some thought, probably too risky for 2.6.34.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
> > 
> > ---
> > 
> > Please review the below, and consider for 2.6.34,
> > and stable trees.
> > 
> >  drivers/net/tun.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 96c39bd..4326520 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >  		}
> >  	}
> >  
> > +	/* Orphan the skb - required as we might hang on to it
> > +	 * for indefinite time. */
> > +	skb_orphan(skb);
> > +
> >  	/* Enqueue packet */
> >  	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
> >  	dev->trans_start = jiffies;
> > -- 
> > 1.7.0.2.280.gc6f05
> 
> This is commit 0110d6f22f392f976e84ab49da1b42f85b64a3c5 in net-2.6
> Please cherry-pick this fix in stable kernels (2.6.32 and 2.6.33).

David Miller queues up the patches for the network subsystem for the
stable trees, and then forwards them to me when he feels they are ready.

So I'll defer to him on this one and wait for it to come from him.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] gianfar: Wait for both RX and TX to stop
From: Kumar Gala @ 2010-04-21 19:13 UTC (permalink / raw)
  To: Timur Tabi; +Cc: David Miller, afleming, netdev
In-Reply-To: <z2ged82fe3e1004210733v8fa40902k664549aa9620b13@mail.gmail.com>


On Apr 21, 2010, at 9:33 AM, Timur Tabi wrote:

> On Wed, Apr 21, 2010 at 7:17 AM, Kumar Gala <galak@kernel.crashing.org> wrote:
> 
>> I understand, its more a sense that we are saying we want to time out for what I consider a catastrophic HW failure.
> 
> And how else will you detect and recover from such a failure without a
> timeout?  And are you absolutely certain that there will never be a
> programming failure that will cause this loop to spin forever?
> 
> If you're really opposed to a timeout, you can still use
> spin_event_timeout() by just setting the timeout to -1 and adding a
> comment explaining why.

I'm not opposed, I'm just asking if we are saying we shouldn't be using cpu_relax() for spinning on HW status registers ever.

If we are suggesting that cpu_relax() shouldn't be used in these scenarios going forward I'm ok w/the change you suggest and starting to convert other cpu_relax() calls to use spin_event_timeout()

- k

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Tom Herbert @ 2010-04-21 19:12 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, franco, xiaosuo, netdev
In-Reply-To: <20100420.144106.118596093.davem@davemloft.net>

On Tue, Apr 20, 2010 at 2:41 PM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 20 Apr 2010 16:57:01 +0200
>
>> I know many applications using TCP on loopback, they are real :)
>
> This is all true and I support your hashing patch and all of that.
>
> But if we really want TCP over loopback to go fast, there are much
> better ways to do this.
>
> Eric, do you remember that "TCP friends" rough patch I sent you last
> year that essentailly made TCP sockets over loopback behave like
> AF_UNIX ones and just queue the SKBs directly to the destination
> socket without doing any protocol work?
>

This is sounds very interesting!  Could you post a patch? :-)

> If we ever got that working, tbench performance would become
> impressive :)
>

^ permalink raw reply

* Re: [PATCH] cxgb3: fix linkup issue
From: Divy Le Ray @ 2010-04-21 19:12 UTC (permalink / raw)
  To: Hiroshi Shimamoto; +Cc: netdev, linux-kernel@vger.kernel.org
In-Reply-To: <4BCD0424.8030501@ct.jp.nec.com>

Hiroshi Shimamoto wrote:
> From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
>
> I encountered an issue that not to link up on cxgb3 fabric.
> I bisected and found that this regression was introduced by
> 0f07c4ee8c800923ae7918c231532a9256233eed.
>
> Correct to pass phy_addr to cphy_init() at t3_xaui_direct_phy_prep().
>
> Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
>   

Sorry for the review delay, I just came back from some time off.
Acked-by: Divy Le Ray <divy@chelsio.com>

> ---
>  drivers/net/cxgb3/ael1002.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/cxgb3/ael1002.c b/drivers/net/cxgb3/ael1002.c
> index 5248f9e..35cd367 100644
> --- a/drivers/net/cxgb3/ael1002.c
> +++ b/drivers/net/cxgb3/ael1002.c
> @@ -934,7 +934,7 @@ static struct cphy_ops xaui_direct_ops = {
>  int t3_xaui_direct_phy_prep(struct cphy *phy, struct adapter *adapter,
>  			    int phy_addr, const struct mdio_ops *mdio_ops)
>  {
> -	cphy_init(phy, adapter, MDIO_PRTAD_NONE, &xaui_direct_ops, mdio_ops,
> +	cphy_init(phy, adapter, phy_addr, &xaui_direct_ops, mdio_ops,
>  		  SUPPORTED_10000baseT_Full | SUPPORTED_AUI | SUPPORTED_TP,
>  		  "10GBASE-CX4");
>  	return 0;
>   


^ permalink raw reply

* Re: cxgb4: Make unnecessarily global functions static
From: Roland Dreier @ 2010-04-21 19:03 UTC (permalink / raw)
  To: Dimitris Michailidis; +Cc: David S. Miller, netdev
In-Reply-To: <adapr1spr7e.fsf@roland-alpha.cisco.com>

By the way, this results in the following improvement in size in my
x86-64 build, so it's not just source cleanup:

add/remove: 0/4 grow/shrink: 3/6 up/down: 186/-346 (-160)
function                                     old     new   delta
static.T                                   18293   18412    +119
t4_ethrx_handler                             649     698     +49
handle_trace_pkt                             127     145     +18
t4_wr_mbox_meat                              728     722      -6
t4_read_rss                                  147     140      -7
t4_mc_read                                   201     187     -14
sf1_write                                    243     228     -15
sf1_read                                     279     262     -17
t4_edc_read                                  240     222     -18
t4_pktgl_free                                 41       -     -41
t4_read_indirect                              46       -     -46
t4_write_indirect                             52       -     -52
t4_wait_op_done_val                          130       -    -130
-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Narendra Choyal @ 2010-04-21 19:03 UTC (permalink / raw)
  To: Gaspar Chilingarov; +Cc: netdev
In-Reply-To: <g2z46c8cb3e1004201517i5641a75cze2ec5bd33e81fb0f@mail.gmail.com>

use the following command
this will help you out to more connetions

#ip link set eth0 mtu 69

and increase the size of buffer[]

I think this will helps you ( I am student yet now )

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-21 19:01 UTC (permalink / raw)
  To: hadi; +Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271853570.4032.21.camel@bigi>

Le mercredi 21 avril 2010 à 08:39 -0400, jamal a écrit :
> On Tue, 2010-04-20 at 15:13 +0200, Eric Dumazet wrote:
> 
> 
> > I think your tests are very interesting, maybe could you publish them
> > somehow ? (I forgot to thank you about the previous report and nice
> > graph)
> > perf reports would be good too to help to spot hot points.
> 
> Ok ;->
> Let me explain my test setup (which some app types may gasp at;->):
> 
> SUT(system under test) was a nehalem single processor (4 cores, 2 SMT
> threads per core). 
> SUT runs a udp sink server i wrote (with apologies to Rick Jones[1])
> which forks at most a process per detected cpu and binds to a different
> udp port on each processor.
> Traffic generator sent to SUT upto 750Kpps of udp packets round-robbin
> and varied the destination port to select a different flow on each of
> the outgoing packets. I could further increment the number of flows by
> varying the source address and source port number but in the end i 
> settled down to fixed srcip/srcport/destinationip and just varied the
> port number in order to simplify results collection.
> For rps i selected mask "ee" and bound interrupt to cpu0. ee leaves
> out cpu0 and cpu4 from the set of target cpus. Because Nehalem has SMT
> threads, cpu0 and cpu4 are SMT threads that reside on core0 and they
> steal execution cycles from each other - so i didnt want that to happen
> and instead tried to have as many of those cycles as possible for
> demuxing incoming packets.
> 
> Overall, in best case scenario rps had 5-7% better throughput than
> nonrps setup. It had upto 10% more cpu use and about 2-5% more latency.
> I am attaching some visualization of the way 8 flows were distributed
> around the different cpus. The diagrams show some samples - but what you
> see there was a good reflection of what i saw in many runs of the tests.
> Essentially, for localization is better with rps which gets better if
> you can somehow map the target cpus as selected by rps to what the app
> binds to.
> Ive also attached a small annotated perf output - sorry i didnt have
> time to dig deeper into the code; maybe later this week. I think my
> biggest problem in this setup was the sky2 driver or hardware poor
> ability to handle lots of traffic.
> 
> 
> cheers,
> jamal
> 
> [1] I want to hump on the SUT with tons of traffic and count packets;
> too complex to do with netperf

Thanks a lot Jamal, this is really useful

Drawback of using a fixed src ip from your generator is that all flows
share the same struct dst entry on SUT. This might explain some glitches
you noticed (ip_route_input + ip_rcv at high level on slave/application
cpus)
Also note your test is one way. If some data was replied we would see
much use of the 'flows'

I notice epoll_ctl() used a lot, are you re-arming epoll each time you
receive a datagram ?

I see slave/application cpus hit _raw_spin_lock_irqsave() and  
_raw_spin_unlock_irqrestore().

Maybe a ring buffer could help (instead of a double linked queue) for
backlog, or the double queue trick, if Changli wants to respin his
patch.






^ permalink raw reply

* cxgb4: Make unnecessarily global functions static
From: Roland Dreier @ 2010-04-21 18:59 UTC (permalink / raw)
  To: Dimitris Michailidis, David S. Miller; +Cc: netdev

Also put t4_write_indirect() inside "#if 0" to avoid a "defined but not
used" compile warning.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
---
 drivers/net/cxgb4/cxgb4.h |    2 --
 drivers/net/cxgb4/sge.c   |    2 +-
 drivers/net/cxgb4/t4_hw.c |   22 ++++++++++++----------
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/net/cxgb4/cxgb4.h b/drivers/net/cxgb4/cxgb4.h
index 3d8ff48..4b35dc7 100644
--- a/drivers/net/cxgb4/cxgb4.h
+++ b/drivers/net/cxgb4/cxgb4.h
@@ -651,8 +651,6 @@ int t4_link_start(struct adapter *adap, unsigned int mbox, unsigned int port,
 		  struct link_config *lc);
 int t4_restart_aneg(struct adapter *adap, unsigned int mbox, unsigned int port);
 int t4_seeprom_wp(struct adapter *adapter, bool enable);
-int t4_read_flash(struct adapter *adapter, unsigned int addr,
-		  unsigned int nwords, u32 *data, int byte_oriented);
 int t4_load_fw(struct adapter *adapter, const u8 *fw_data, unsigned int size);
 int t4_check_fw_version(struct adapter *adapter);
 int t4_prep_adapter(struct adapter *adapter);
diff --git a/drivers/net/cxgb4/sge.c b/drivers/net/cxgb4/sge.c
index 14adc58..04e5710 100644
--- a/drivers/net/cxgb4/sge.c
+++ b/drivers/net/cxgb4/sge.c
@@ -1471,7 +1471,7 @@ EXPORT_SYMBOL(cxgb4_pktgl_to_skb);
  *	Releases the pages of a packet gather list.  We do not own the last
  *	page on the list and do not free it.
  */
-void t4_pktgl_free(const struct pkt_gl *gl)
+static void t4_pktgl_free(const struct pkt_gl *gl)
 {
 	int n;
 	const skb_frag_t *p;
diff --git a/drivers/net/cxgb4/t4_hw.c b/drivers/net/cxgb4/t4_hw.c
index a814a3a..cadead5 100644
--- a/drivers/net/cxgb4/t4_hw.c
+++ b/drivers/net/cxgb4/t4_hw.c
@@ -53,8 +53,8 @@
  *	at the time it indicated completion is stored there.  Returns 0 if the
  *	operation completes and	-EAGAIN	otherwise.
  */
-int t4_wait_op_done_val(struct adapter *adapter, int reg, u32 mask,
-			int polarity, int attempts, int delay, u32 *valp)
+static int t4_wait_op_done_val(struct adapter *adapter, int reg, u32 mask,
+			       int polarity, int attempts, int delay, u32 *valp)
 {
 	while (1) {
 		u32 val = t4_read_reg(adapter, reg);
@@ -109,9 +109,9 @@ void t4_set_reg_field(struct adapter *adapter, unsigned int addr, u32 mask,
  *	Reads registers that are accessed indirectly through an address/data
  *	register pair.
  */
-void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
-		      unsigned int data_reg, u32 *vals, unsigned int nregs,
-		      unsigned int start_idx)
+static void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
+			     unsigned int data_reg, u32 *vals,
+			     unsigned int nregs, unsigned int start_idx)
 {
 	while (nregs--) {
 		t4_write_reg(adap, addr_reg, start_idx);
@@ -120,6 +120,7 @@ void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
 	}
 }
 
+#if 0
 /**
  *	t4_write_indirect - write indirectly addressed registers
  *	@adap: the adapter
@@ -132,15 +133,16 @@ void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
  *	Writes a sequential block of registers that are accessed indirectly
  *	through an address/data register pair.
  */
-void t4_write_indirect(struct adapter *adap, unsigned int addr_reg,
-		       unsigned int data_reg, const u32 *vals,
-		       unsigned int nregs, unsigned int start_idx)
+static void t4_write_indirect(struct adapter *adap, unsigned int addr_reg,
+			      unsigned int data_reg, const u32 *vals,
+			      unsigned int nregs, unsigned int start_idx)
 {
 	while (nregs--) {
 		t4_write_reg(adap, addr_reg, start_idx++);
 		t4_write_reg(adap, data_reg, *vals++);
 	}
 }
+#endif
 
 /*
  * Get the reply to a mailbox command and store it in @rpl in big-endian order.
@@ -537,8 +539,8 @@ static int flash_wait_op(struct adapter *adapter, int attempts, int delay)
  *	(i.e., big-endian), otherwise as 32-bit words in the platform's
  *	natural endianess.
  */
-int t4_read_flash(struct adapter *adapter, unsigned int addr,
-		  unsigned int nwords, u32 *data, int byte_oriented)
+static int t4_read_flash(struct adapter *adapter, unsigned int addr,
+			 unsigned int nwords, u32 *data, int byte_oriented)
 {
 	int ret;
 

-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply related

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21 18:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271875416.7895.3033.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 08:43:36PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Le mercredi 21 avril 2010 à 22:27 +0400, Evgeniy Polyakov a écrit :
> > On Wed, Apr 21, 2010 at 01:27:33PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > > Here is the patch I use now and my test application is now able to open
> > > and connect 1000000 sockets (ulimit -n 1000000)
> > > 
> > > Trick is bind_conflict() must refuse a socket to bind to a port on a non
> > > null IP if another socket already uses same port on same IP.
> > > 
> > > Plus the previous patch sent (check a conflict before exiting the search
> > > loop)
> > > 
> > > What do you think ?
> > 
> > Looks good, but do we want to check only reused socket's address there?
> > What if one of the sockets does not have reuse option turned on, will it
> > break?
> > 
> 
> Well, if one socket doesnt have reuse option turned on, the previous
> test already works ?
> 
> if (!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) {
> 	if (!sk2_rcv_saddr || !sk_rcv_saddr ||
> 	    sk2_rcv_saddr == sk_rcv_saddr)
> 		break;
> } else if (reuse && sk2->sk_reuse &&
>            sk2_rcv_saddr &&
>            sk2_rcv_saddr == sk_rcv_saddr)
> 	break;
> 
> I failed to factorize this complex test :(

Damn it, I tried multiple times :)
You are right of course!

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21 18:43 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421182723.GA17202@ioremap.net>

Le mercredi 21 avril 2010 à 22:27 +0400, Evgeniy Polyakov a écrit :
> On Wed, Apr 21, 2010 at 01:27:33PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > Here is the patch I use now and my test application is now able to open
> > and connect 1000000 sockets (ulimit -n 1000000)
> > 
> > Trick is bind_conflict() must refuse a socket to bind to a port on a non
> > null IP if another socket already uses same port on same IP.
> > 
> > Plus the previous patch sent (check a conflict before exiting the search
> > loop)
> > 
> > What do you think ?
> 
> Looks good, but do we want to check only reused socket's address there?
> What if one of the sockets does not have reuse option turned on, will it
> break?
> 

Well, if one socket doesnt have reuse option turned on, the previous
test already works ?

if (!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) {
	if (!sk2_rcv_saddr || !sk_rcv_saddr ||
	    sk2_rcv_saddr == sk_rcv_saddr)
		break;
} else if (reuse && sk2->sk_reuse &&
           sk2_rcv_saddr &&
           sk2_rcv_saddr == sk_rcv_saddr)
	break;

I failed to factorize this complex test :(




^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21 18:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271849253.7895.1929.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 01:27:33PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Here is the patch I use now and my test application is now able to open
> and connect 1000000 sockets (ulimit -n 1000000)
> 
> Trick is bind_conflict() must refuse a socket to bind to a port on a non
> null IP if another socket already uses same port on same IP.
> 
> Plus the previous patch sent (check a conflict before exiting the search
> loop)
> 
> What do you think ?

Looks good, but do we want to check only reused socket's address there?
What if one of the sockets does not have reuse option turned on, will it
break?

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-21 18:10 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Chris Wright, Scott Feldman, davem, netdev
In-Reply-To: <201004211952.29145.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Wednesday 21 April 2010, Chris Wright wrote:
> > * Arnd Bergmann (arnd@arndb.de) wrote:
> > > Since it seems what you really want to do is to do the exchange with the
> > > switch from here, maybe the hardware configuration part should be moved
> > > the DCB interface?
> > 
> > I suppose this would work  (although it's a bit odd being out of scope
> > of DCB spec).
> 
> It could be anywhere, it doesn't have to be the DCB interface, but could
> be anything ranging from ethtool to iplink I guess. And we should define
> it in a way that works for any SR-IOV card, whether it's using Cisco's
> protocol in firmware, 802.1Qbg VDP in firmware, lldpad to do VDP or
> none of the above and just provides an internal switch like all the
> existing NICs.

Heh, that's exactly what iovnl does ;-)

> > I don't expect mgmt app to care about the implementation
> > specifics of an adapter, so it will always send this and iovnl message
> > too.  All as part of same setup.
> 
> Why? I really see these things as separate. Obviously a management
> tool like libvirt would need to do both these things eventually, but
> each of them has multiple options that can be combined in various
> ways:
> 
> 1. Setting up the slave device
>  a) create an SR-IOV VF to assign to a guest
>  b) create a macvtap device to pass to qemu or vhost
>  c) attach a tap device to a bridge
>  d) create a macvlan device and put it into a container
>  e) create a virtual interface for a VMDq adapter

OK, but iovnl isn't doing this.

> 2) Registering the slave with the switch
>  a) use Cisco protocol in enic firmware (see patch 2/2)
>  b) use standard VDP in lldpad
>  c) use reverse-engineered cisco protocol in some user tool for
>     non-enic adapters.
>  d) use standard VDP in firmware (hopefully this never happens)
>  e) do nothing at all (as we do today)

And this is the step that is the main purpose of iovnl.

Here's the simplest snippet of libvirt to show this.  It sends
set_port_profile netlink messages and then creates macvtap.  As simple
as it gets...

--- a/src/qemu/qemu_conf.c
+++ b/src/qemu/qemu_conf.c
@@ -1470,6 +1470,11 @@ qemudPhysIfaceConnect(virConnectPtr conn,
         net->model && STREQ(net->model, "virtio"))
         vnet_hdr = 1;
 
+    setPortProfileId(net->data.direct.linkdev,
+                      net->data.direct.mode,
+                      net->data.direct.profileid,
+                      net->mac);
+
     rc = openMacvtapTap(net->ifname, net->mac, linkdev, brmode,
                         &res_ifname, vnet_hdr);

^ permalink raw reply

* cxgb4: Use ntohs() on __be16 value instead of htons()
From: Roland Dreier @ 2010-04-21 18:09 UTC (permalink / raw)
  To: Dimitris Michailidis, David S. Miller; +Cc: netdev

Use the correct direction of byte-swapping function to fix a mistake
shown by sparse endianness checking -- c.fl0id is __be16.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
---
 drivers/net/cxgb4/sge.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/cxgb4/sge.c b/drivers/net/cxgb4/sge.c
index 14adc58..70bf2b2 100644
--- a/drivers/net/cxgb4/sge.c
+++ b/drivers/net/cxgb4/sge.c
@@ -2047,7 +2047,7 @@ int t4_sge_alloc_rxq(struct adapter *adap, struct sge_rspq *iq, bool fwevtq,
 	adap->sge.ingr_map[iq->cntxt_id] = iq;
 
 	if (fl) {
-		fl->cntxt_id = htons(c.fl0id);
+		fl->cntxt_id = ntohs(c.fl0id);
 		fl->avail = fl->pend_cred = 0;
 		fl->pidx = fl->cidx = 0;
 		fl->alloc_failed = fl->large_alloc_failed = fl->starving = 0;

-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply related

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 18:04 UTC (permalink / raw)
  To: Scott Feldman; +Cc: Chris Wright, davem, netdev
In-Reply-To: <C7F475A7.2A1D9%scofeldm@cisco.com>

On Wednesday 21 April 2010, Scott Feldman wrote:
> On 4/21/10 6:17 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> > More importantly, the card cannot possibly do the protocol by itself,
> > because the information that gets exchanged is specific to the hypervisor and
> > the guest, not to the hardware. What you have implemented is another protocol
> > between the hypervisor and the NIC that exchanges the exact same data that
> > then gets sent to the switch. We already need to have an implementation that
> > sends this data to the switch from user space for all cards that don't do
> > it in firmware, so doing an alternative path in the adapter really creates
> > more work for the users, and means that when we fix bugs or add features
> > to the common code, you don't get them ;-).
> 
> But the point of iovnl was to provide a single mechanism for both types of
> adapters (w/ or w/o firmware assist) to exchange this data with the switch,
> therefore making the difference in the adapters transparent to the user.  So
> I'm missing your point about more work for the users.

It creates an extra step: Normally we'd simply implement the network protocol
in user space, e.g. in lldpad and have other code use the lldptool command
line interface to start the negotiation.

Now we have a user protocol based on netlink that is about as complex as the
wire protocol itself, at least if you want to implement both the standard
VDP and the Cisco variant, and do all the interesting parts like guest migration
and synchronously waiting for the negotiation to complete.

	Arnd

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 17:52 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev
In-Reply-To: <20100421161842.GB25928@x200.localdomain>

On Wednesday 21 April 2010, Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Tuesday 20 April 2010, Scott Feldman wrote:
> > I believe we meant different things here, because I misunderstood the
> > intention of the code. My question was whether lldpad would send the
> > netlink messages to iovnl, but from what you and Chris write, the
> > real idea was that both lldpad and kernel/iovnl can receive the
> > same messages, right?
> 
> Correct.  An example set of steps for initiating host to switch
> negotiation and subsequently launching a VM would be (expect user below
> to be a mgmt tool like libvirt):
> 
> 1) user sends netlink message w/ relevant host interface and port profile id
> 2) recipient picks this up (enic, lldpad, whatever)
> 3) recipient does negotiation w/ adjacent switch
> 4) user creates macvtap associated w/ relevant host interface
> 5) user launches guest

I'd move point 4 before 1, but otherwise it makes sense and it would still
work either way.

> > If the idea is use the same netlink protocol for both your internal
> > representation and for the standard based protocol, I think we should
> > make them compatible.
> 
> Indeed, that's my expectation.
>
> [...]
>
> > Instead of a string identifying the port profile, this needs to pass
> > a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte).
> 
> I think we just need a u8 array, 4 bytes for VDP, some maxlen that is
> at least as large as enic expects.
> 
> > There is also a UUID in VDP, but it identifies the guest, not the host,
> > so this is really confusing.
> 
> Yes, I had same confusion.  I expected guest, enic wants to send host as
> well.

So given all these differences, how compatible can we make them?

With the current definition, most of fields are at least slightly
different. The differences seem to stem mostly from the fact that
Cisco switches use a nonstandard protocol, rather than the difference
between the firmware and userland implementations of the protocol,
and of course we shouldn't confuse the two.

> > In order to make VEPA work, it's absolutely required to impose a hard limit
> > on what MAC+VLAN IDs are visible to the VF, because the switch identifies
> > the guest by those and forwards any frames to/from that address according
> > to the VSI type.
> > 
> > However, I feel that we should strictly separate the steps of configuring
> > the adapter from talking to the switch. When we do the VDP association
> > in user land, we still need to set up the VLAN and MAC configuration for
> > the VF through a kernel interface. If we ignore the port profile stuff
> > for a moment, your netlink interface looks like a good fit for that.
> > 
> > Since it seems what you really want to do is to do the exchange with the
> > switch from here, maybe the hardware configuration part should be moved
> > the DCB interface?
> 
> I suppose this would work  (although it's a bit odd being out of scope
> of DCB spec).

It could be anywhere, it doesn't have to be the DCB interface, but could
be anything ranging from ethtool to iplink I guess. And we should define
it in a way that works for any SR-IOV card, whether it's using Cisco's
protocol in firmware, 802.1Qbg VDP in firmware, lldpad to do VDP or
none of the above and just provides an internal switch like all the
existing NICs.

> I don't expect mgmt app to care about the implementation
> specifics of an adapter, so it will always send this and iovnl message
> too.  All as part of same setup.

Why? I really see these things as separate. Obviously a management
tool like libvirt would need to do both these things eventually, but
each of them has multiple options that can be combined in various
ways:

1. Setting up the slave device
 a) create an SR-IOV VF to assign to a guest
 b) create a macvtap device to pass to qemu or vhost
 c) attach a tap device to a bridge
 d) create a macvlan device and put it into a container
 e) create a virtual interface for a VMDq adapter

2) Registering the slave with the switch
 a) use Cisco protocol in enic firmware (see patch 2/2)
 b) use standard VDP in lldpad
 c) use reverse-engineered cisco protocol in some user tool for
    non-enic adapters.
 d) use standard VDP in firmware (hopefully this never happens)
 e) do nothing at all (as we do today)

Some of the cases can be treated identically, e.g. 1d) and 1e), or
2a) and 2c), but in general the management app needs to have some
idea of which combination it's going to set up.

	Arnd

^ permalink raw reply

* [PATCH] Socket filter access to hatype
From: Paul LeoNerd Evans @ 2010-04-21 17:25 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1829 bytes --]

When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on, such as is reported by the sll_hatype field of the struct sockaddr_ll
when the packet is sent up to userland.

Unless I've managed to miss a trick somewhere, this would seem to put a
fairly fundamental blocker on actually being able to filter in such
packets. Granted there's the SKF_OFF_NET area to inspect at the e.g. IPv4
level, but this makes it impossible to do anything on e.g. the Ethernet
level.

See below for a patch to add an SKF_AD_HATYPE field, up among the other
special access fields around SKF_AD_OFF.

diff -ur linux-2.6.33.2.orig/include/linux/filter.h linux-2.6.33.2/include/linux/filter.h
--- linux-2.6.33.2.orig/include/linux/filter.h	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/include/linux/filter.h	2010-04-20 22:40:25.000000000 +0100
@@ -123,7 +123,8 @@
 #define SKF_AD_NLATTR_NEST	16
 #define SKF_AD_MARK 	20
 #define SKF_AD_QUEUE	24
-#define SKF_AD_MAX	28
+#define SKF_AD_HATYPE	28
+#define SKF_AD_MAX	32
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)

diff -ur linux-2.6.33.2.orig/net/core/filter.c linux-2.6.33.2/net/core/filter.c
--- linux-2.6.33.2.orig/net/core/filter.c	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/net/core/filter.c	2010-04-20 22:41:01.000000000 +0100
@@ -309,6 +309,9 @@
 		case SKF_AD_QUEUE:
 			A = skb->queue_mapping;
 			continue;
+		case SKF_AD_HATYPE:
+			A = skb->dev->type;
+			continue;
 		case SKF_AD_NLATTR: {
 			struct nlattr *nla;

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: 2.6.34-rc5: Reported regressions from 2.6.33
From: Nick Bowler @ 2010-04-21 16:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Rafael J. Wysocki, DRI, Linux SCSI List, Network Development,
	Linux Wireless List, Linux Kernel Mailing List, Linux ACPI,
	Andrew Morton, Kernel Testers List, Linus Torvalds, Linux PM List,
	Maciej Rutecki
In-Reply-To: <20100421085739.GA2576@barney.localdomain>

On Wed, Apr 21, 2010 at 07:15:38AM +0200, Rafael J. Wysocki wrote:
> On Tuesday 20 April 2010, Nick Bowler wrote:
> > Please list these two similar regressions from 2.6.33 in the r600 DRM:
> > 
> >  * r600 CS checker rejects GL_DEPTH_TEST w/o depth buffer:
> >            https://bugs.freedesktop.org/show_bug.cgi?id=27571
> > 
> >  * r600 CS checker rejects narrow FBO renderbuffers:
> >            https://bugs.freedesktop.org/show_bug.cgi?id=27609
> 
> Do you want to me to add them as one entry or as two separate bugs?

As upstream doesn't consider the first to be a kernel issue, I guess you
should just list the second.

On 10:57 Wed 21 Apr     , Jerome Glisse wrote:
> First one is userspace bug, i need to look into the second one.
> ie we were lucky the hw didn't lockup without depth buffer and
> depth test enabled.

OK, if the failure is due to userspace is doing Very Bad Things(tm),
catching that seems reasonable.

Nevertheless, even if it happened by luck, the result was (ostensibly)
working programs that suddenly break once one "upgrades" to the latest
kernel.  If userspace can't be fixed before 2.6.34 is released, perhaps
a less cryptic log message would be appropriate?

-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: George B. @ 2010-04-21 16:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Evgeniy Polyakov, Ben Greear, David Miller, Gaspar Chilingarov,
	netdev
In-Reply-To: <1271849253.7895.1929.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 4:27 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Here is the patch I use now and my test application is now able to open
> and connect 1000000 sockets (ulimit -n 1000000)

I believe we hit this very yesterday in our test lab.  We had a stress
test running of one of our applications with about a dozen instances
of it running on the box.  Suddenly dns requests began failing with
the complaint that it couldn't make a request out because there were
no sockets.

root@champagne:/proc/sys/net/ipv4> host gh
host: isc_socket_bind: address in use

Netstat showed 61580 total sockets (UDP and TCP) on the address being
used by the above dns request. (local port range 1025 65535).  That
dns request should not have been failing.

I noticed that the number of UDP sockets was close to the maximum
allowed by the port range, but they were across different IP
addresses, no one IP address had too many and there should have been
available ports on all IP addresses.

Further, the number of udp sockets in use seemed to hit the wall at a
little above 64,000 and I never got above that number.

If that is the normal behavior of the kernel, it could be a big
problem for scaling the application.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox