Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Ben Hutchings @ 2010-04-21 12:59 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr; +Cc: tglx, davem, arjan, netdev, linux-kernel
In-Reply-To: <20100420180112.1276.11906.stgit@ppwaskie-hc2.jf.intel.com>

On Tue, 2010-04-20 at 11:01 -0700, Peter P Waskiewicz Jr wrote:
> This patch adds a callback function pointer to the irq_desc
> structure, along with a registration function and a read-only
> proc entry for each interrupt.
> 
> This affinity_hint handle for each interrupt can be used by
> underlying drivers that need a better mechanism to control
> interrupt affinity.  The underlying driver can register a
> callback for the interrupt, which will allow the driver to
> provide the CPU mask for the interrupt to anything that
> requests it.  The intent is to extend the userspace daemon,
> irqbalance, to help hint to it a preferred CPU mask to balance
> the interrupt into.

Doesn't it make more sense to have the driver follow affinity decisions
made from user-space?  I realise that reallocating queues is disruptive
and we probably don't want irqbalance to trigger that, but there should
be a mechanism for the administrator to trigger it.

Looking at your patch for ixgbe:

[...]
> diff --git a/drivers/net/ixgbe/ixgbe_main.c
> b/drivers/net/ixgbe/ixgbe_main.c
> index 1b1419c..3e00d41 100644
> --- a/drivers/net/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ixgbe/ixgbe_main.c
[...]
> @@ -1083,6 +1113,16 @@ static void ixgbe_configure_msix(struct ixgbe_adapter *adapter)
>                         q_vector->eitr = adapter->rx_eitr_param;
>  
>                 ixgbe_write_eitr(q_vector);
> +
> +               /*
> +                * Allocate the affinity_hint cpumask, assign the mask for
> +                * this vector, and register our affinity_hint callback.
> +                */
> +               alloc_cpumask_var(&q_vector->affinity_mask, GFP_KERNEL);
> +               cpumask_set_cpu(v_idx, q_vector->affinity_mask);
> +               irq_register_affinity_hint(adapter->msix_entries[v_idx].vector,
> +                                          adapter,
> +                                          &ixgbe_irq_affinity_callback);
>         }
>  
>         if (adapter->hw.mac.type == ixgbe_mac_82598EB)
[...]

This just assigns IRQs to the first n CPU threads.  Depending on the
enumeration order, this might result in assigning an IRQ to each of 2
threads on a core while leaving other cores unused!

irqbalance can already find the various IRQs associated with a single
net device by looking at the handler names.  So it can do at least as
well as this without such a hint.  Unless drivers have *useful* hints to
give, I don't see the point in adding this mechanism.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: Jan Engelhardt @ 2010-04-21 13:07 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Netfilter Developer Mailing List, Linux Netdev List, John Haxby
In-Reply-To: <4BCEF6B4.8090105@trash.net>

On Wednesday 2010-04-21 14:59, Patrick McHardy wrote:

>Jan Engelhardt wrote:
>> The SYSRQ target will allow to remotely invoke sysrq on the local
>> machine. Authentication is by means of a pre-shared key that can
>> either be transmitted plaintext or digest-secured.
>
>I really think this is pushing what netfilter is meant for a bit
>far. Its basically abusing the firewall ruleset to offer a network
>service.
>
>I can see that its useful to have this in the kernel instead of
>userspace, but why isn't this implemented as a stand-alone module?
>That seems like a better design to me and also makes it more useful
>by not depending on netfilter.

That sort of diverts from the earlier what-seemed-to-be-consensus.

Oh well, I would not mind holding the single commit up as long as the 
rest isn't blocked too :-)


^ permalink raw reply

* Re: ipv6: Fix tcp_v6_send_response checksum
From: Herbert Xu @ 2010-04-21 13:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, cratiu
In-Reply-To: <20100421.015823.266647388.davem@davemloft.net>

On Wed, Apr 21, 2010 at 01:58:23AM -0700, David Miller wrote:
> 
> If we're going to use CHECKSUM_PARTIAL for these things (which we are
> since commit 2e8e18ef52e7dd1af0a3bd1f7d990a1d0b249586 "tcp: Set
> CHECKSUM_UNNECESSARY in tcp_init_nondata_skb"), we can't be setting
> buff->csum as we always have been here in tcp_v6_send_response.  We
> need to leave it at zero.
> 
> Kill that line and checksums are good again.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

That line is needed for the CHECKSUM_NONE case.  In fact, if we're
in the CHECKSUM_PARTIAL case, then that buff->csum calculation
should be a noop because it immediately gets overwritten when we
subsequently set csum_start/csum_offset.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: Patrick McHardy @ 2010-04-21 13:17 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Netfilter Developer Mailing List, Linux Netdev List, John Haxby
In-Reply-To: <alpine.LSU.2.01.1004211506280.10840@obet.zrqbmnf.qr>

Jan Engelhardt wrote:
> On Wednesday 2010-04-21 14:59, Patrick McHardy wrote:
> 
>> Jan Engelhardt wrote:
>>> The SYSRQ target will allow to remotely invoke sysrq on the local
>>> machine. Authentication is by means of a pre-shared key that can
>>> either be transmitted plaintext or digest-secured.
>> I really think this is pushing what netfilter is meant for a bit
>> far. Its basically abusing the firewall ruleset to offer a network
>> service.
>>
>> I can see that its useful to have this in the kernel instead of
>> userspace, but why isn't this implemented as a stand-alone module?
>> That seems like a better design to me and also makes it more useful
>> by not depending on netfilter.
> 
> That sort of diverts from the earlier what-seemed-to-be-consensus.
> 
> Oh well, I would not mind holding the single commit up as long as the 
> rest isn't blocked too :-)

Then lets skip this one for now.

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: Mikulas Patocka @ 2010-04-21 13:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, kaber
In-Reply-To: <20100420.190127.66303903.davem@davemloft.net>



On Tue, 20 Apr 2010, David Miller wrote:

> 
> I looked more at your crash report.
> 
> You shouldn't even be in this code path for other reasons, namely
> skb->next should be NULL.  But it's not in your case.  skb->next would
> only be non-NULL for GSO frames, which we've established we should not
> be seeing here.
> 
> Given that skb->next is non-NULL and the fraglists of this SKB are
> corrupted (next pointer is 0x18), I think we're getting memory
> corruption from somewhere else.  This also jives with the fact that
> this is not readily reproducable.

The crash happened just a few days after I started to use the machine for 
bridging. There were no unexplained crashes before. So I suspect that the 
cause is bridging or tg3.

> The whole ->ndo_start_xmit() return value stuff is unrelated to this
> issue, we shouldn't even be in this code path.  In fact, if reverting
> that TX flags handling commit makes your crashes go away it would be a
> huge surprise.

I thought that if some weird ->ndo_start_xmit() return values appeared, 
this would lead to misunderstanding who owns the skb, using of already 
deallocated skb and mentioned memory corruption. But I can't prove it.

Mikulas

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: Jan Engelhardt @ 2010-04-21 13:35 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Netfilter Developer Mailing List, Linux Netdev List, John Haxby
In-Reply-To: <4BCEFACC.7020804@trash.net>


On Wednesday 2010-04-21 15:17, Patrick McHardy wrote:
>Jan Engelhardt wrote:
>> On Wednesday 2010-04-21 14:59, Patrick McHardy wrote:
>> 
>>> Jan Engelhardt wrote:
>>>> The SYSRQ target will allow to remotely invoke sysrq on the local
>>>> machine. Authentication is by means of a pre-shared key that can
>>>> either be transmitted plaintext or digest-secured.
>>> I really think this is pushing what netfilter is meant for a bit
>>> far. Its basically abusing the firewall ruleset to offer a network
>>> service.
>>>
>>> I can see that its useful to have this in the kernel instead of
>>> userspace, but why isn't this implemented as a stand-alone module?
>>> That seems like a better design to me and also makes it more useful
>>> by not depending on netfilter.
>> 
>> That sort of diverts from the earlier what-seemed-to-be-consensus.
>> 
>> Oh well, I would not mind holding the single commit up as long as the 
>> rest isn't blocked too :-)
>
>Then lets skip this one for now.

Well you raised the concern before -- namely that kdboe would have
the very same feature. And yet, kdboe was not part of the kernel.
Neither is the magical stand-alone module.
I really prefer to have it in rather than out, because I know
that's going to mess up maintenance-here-and-there. I'm already
having a big time with xtables-addons that still carries
xt_condition and SYSRQ for a while, and it does have some different
code lines than the kernel copy.

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 13:17 UTC (permalink / raw)
  To: Scott Feldman; +Cc: Chris Wright, davem, netdev
In-Reply-To: <C7F35C03.29CF4%scofeldm@cisco.com>

On Tuesday 20 April 2010, Scott Feldman wrote:
> On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
>
> >> In the case of devices that can do adjacent switch negotiations directly.
> > 
> > I thought the idea to deal with those devices was to beat sense into
> > the respective developers until they do the negotiation in software 8-)
> 
> When the device can do the negotiation directly with the switch, why does it
> make sense to bypass that and use software on the host?  I don't think we'd
> want to give up on link speed/duplex auto-negotiation and punt those setting
> back to the user/host like in the old days.

For the link negotiation, the card is the right place because it's necessary
to get the link working before the OS can talk to the switch.
For VDP, that's different because the hypervisor needs to talk to the switch
before the guest can communicate, so there is no interdependency.

More importantly, the card cannot possibly do the protocol by itself,
because the information that gets exchanged is specific to the hypervisor and
the guest, not to the hardware. What you have implemented is another protocol
between the hypervisor and the NIC that exchanges the exact same data that
then gets sent to the switch. We already need to have an implementation that
sends this data to the switch from user space for all cards that don't do
it in firmware, so doing an alternative path in the adapter really creates
more work for the users, and means that when we fix bugs or add features
to the common code, you don't get them ;-).

	Arnd


^ permalink raw reply

* Re: [PATCH net-next-2.6 v4 10/14] l2tp: Convert rwlock to RCU
From: James Chapman @ 2010-04-21 13:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1271782553.7895.62.camel@edumazet-laptop>

Eric Dumazet wrote:
> Hi James
> 
> I started a while ago patching l2tp but I wont be able to finish and
> test the thing...
> 
> There is a fundamental problem with this kind of construct :
> (this was wrong even better your RCU conversion)
> 
> rcu_read_lock_bh()
> hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
> 	if (session->session_id == session_id) {
> 		rcu_read_unlock_bh();
> 		return session;
> 	}
> }
> rcu_read_unlock_bh();
> 
> 
> While the lookup _is_ protected, the result is not.
> 
> As soon as you call rcu_read_unlock_bh(); and before the "return
> session;", current thread could be preempted and an other thread frees
> session under first thread. Unexpected things can then happen.
> 
> Therefore, you need either to :
> 
> 1) Take a refcount on session (or tunnel) before the return
> 2) Or move the rcu_read_lock_bh()/rcu_read_unlock_bh() at callers.
> 3) Or all callers use a stronger lock. But then, why use RCU ;)
> 
> Here is a preliminary patch, obviously not finished, nor compiled, nor
> tested, to give possible ways to handle this problem.
> 
> (I added the ref parameter to make sure to change function signatures,
> maybe its not necessary and we should always take references)

Thanks Eric. I'll take a look at this.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply

* Re: [PATCH] drivers/net/usb: Add new driver ipheth
From: Diego Giagio @ 2010-04-21 14:15 UTC (permalink / raw)
  To: L. Alberto Giménez
  Cc: linux-kernel, dborca, David S. Miller, James Bottomley,
	Ralf Baechle, Greg Kroah-Hartman, Jonas Sjöquist,
	Torgny Johansson, Steve Glendinning, David Brownell,
	Omar Laazimani, Rémi Denis-Courmont, netdev, linux-usb
In-Reply-To: <1271615719-14928-1-git-send-email-agimenez@sysvalve.es>

On Sun, Apr 18, 2010 at 3:35 PM, L. Alberto Giménez
<agimenez@sysvalve.es> wrote:
> From: Diego Giagio <diego@giagio.com>
>
> Add new driver to use tethering with an iPhone device. After initial submission,
> apply fixes to fit the new driver into the kernel standards.
>
> There are still a couple of minor (almost cosmetic-level) issues, but the driver
> is fully functional right now.
>

Signed-off-by: Diego Giagio <diego@giagio.com>
Cc: Daniel Borca <dborca@yahoo.com>

>
>  drivers/net/Makefile     |    1 +
>  drivers/net/usb/Kconfig  |   12 +
>  drivers/net/usb/Makefile |    1 +
>  drivers/net/usb/ipheth.c |  568 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 582 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/usb/ipheth.c
>
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index a583b50..12b280a 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -273,6 +273,7 @@ obj-$(CONFIG_USB_RTL8150)       += usb/
>  obj-$(CONFIG_USB_HSO)          += usb/
>  obj-$(CONFIG_USB_USBNET)        += usb/
>  obj-$(CONFIG_USB_ZD1201)        += usb/
> +obj-$(CONFIG_USB_IPHETH)        += usb/
>
>  obj-y += wireless/
>  obj-$(CONFIG_NET_TULIP) += tulip/
> diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
> index ba56ce4..63be4ca 100644
> --- a/drivers/net/usb/Kconfig
> +++ b/drivers/net/usb/Kconfig
> @@ -385,4 +385,16 @@ config USB_CDC_PHONET
>          cellular modem, as found on most Nokia handsets with the
>          "PC suite" USB profile.
>
> +config USB_IPHETH
> +       tristate "Apple iPhone USB Ethernet driver"
> +       default n
> +       ---help---
> +         Module used to share Internet connection (tethering) from your
> +         iPhone (Original, 3G and 3GS) to your system.
> +         Note that you need userspace libraries and programs that are needed
> +         to pair your device with your system and that understand the iPhone
> +         protocol.
> +
> +         For more information: http://giagio.com/wiki/moin.cgi/iPhoneEthernetDriver
> +
>  endmenu
> diff --git a/drivers/net/usb/Makefile b/drivers/net/usb/Makefile
> index 82ea629..edb09c0 100644
> --- a/drivers/net/usb/Makefile
> +++ b/drivers/net/usb/Makefile
> @@ -23,4 +23,5 @@ obj-$(CONFIG_USB_NET_MCS7830) += mcs7830.o
>  obj-$(CONFIG_USB_USBNET)       += usbnet.o
>  obj-$(CONFIG_USB_NET_INT51X1)  += int51x1.o
>  obj-$(CONFIG_USB_CDC_PHONET)   += cdc-phonet.o
> +obj-$(CONFIG_USB_IPHETH)       += ipheth.o
>
> diff --git a/drivers/net/usb/ipheth.c b/drivers/net/usb/ipheth.c
> new file mode 100644
> index 0000000..fd10331
> --- /dev/null
> +++ b/drivers/net/usb/ipheth.c
> @@ -0,0 +1,568 @@
> +/*
> + * ipheth.c - Apple iPhone USB Ethernet driver
> + *
> + * Copyright (c) 2009 Diego Giagio <diego@giagio.com>
> + * All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of GIAGIO.COM nor the names of its contributors
> + *    may be used to endorse or promote products derived from this software
> + *    without specific prior written permission.
> + *
> + * Alternatively, provided that this notice is retained in full, this
> + * software may be distributed under the terms of the GNU General
> + * Public License ("GPL") version 2, in which case the provisions of the
> + * GPL apply INSTEAD OF those given above.
> + *
> + * The provided data structures and external interfaces from this code
> + * are not restricted to be used by modules with a GPL compatible license.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> + * DAMAGE.
> + *
> + *
> + * Attention: iPhone device must be paired, otherwise it won't respond to our
> + * driver. For more info: http://giagio.com/wiki/moin.cgi/iPhoneEthernetDriver
> + *
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/init.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/usb.h>
> +#include <linux/workqueue.h>
> +
> +#define USB_VENDOR_APPLE        0x05ac
> +#define USB_PRODUCT_IPHONE      0x1290
> +#define USB_PRODUCT_IPHONE_3G   0x1292
> +#define USB_PRODUCT_IPHONE_3GS  0x1294
> +
> +#define IPHETH_USBINTF_CLASS    255
> +#define IPHETH_USBINTF_SUBCLASS 253
> +#define IPHETH_USBINTF_PROTO    1
> +
> +#define IPHETH_BUF_SIZE         1516
> +#define IPHETH_TX_TIMEOUT       (5 * HZ)
> +
> +#define IPHETH_INTFNUM          2
> +#define IPHETH_ALT_INTFNUM      1
> +
> +#define IPHETH_CTRL_ENDP        0x00
> +#define IPHETH_CTRL_BUF_SIZE    0x40
> +#define IPHETH_CTRL_TIMEOUT     (5 * HZ)
> +
> +#define IPHETH_CMD_GET_MACADDR   0x00
> +#define IPHETH_CMD_CARRIER_CHECK 0x45
> +
> +#define IPHETH_CARRIER_CHECK_TIMEOUT round_jiffies_relative(1 * HZ)
> +#define IPHETH_CARRIER_ON       0x04
> +
> +static struct usb_device_id ipheth_table[] = {
> +       { USB_DEVICE_AND_INTERFACE_INFO(
> +               USB_VENDOR_APPLE, USB_PRODUCT_IPHONE,
> +               IPHETH_USBINTF_CLASS, IPHETH_USBINTF_SUBCLASS,
> +               IPHETH_USBINTF_PROTO) },
> +       { USB_DEVICE_AND_INTERFACE_INFO(
> +               USB_VENDOR_APPLE, USB_PRODUCT_IPHONE_3G,
> +               IPHETH_USBINTF_CLASS, IPHETH_USBINTF_SUBCLASS,
> +               IPHETH_USBINTF_PROTO) },
> +       { USB_DEVICE_AND_INTERFACE_INFO(
> +               USB_VENDOR_APPLE, USB_PRODUCT_IPHONE_3GS,
> +               IPHETH_USBINTF_CLASS, IPHETH_USBINTF_SUBCLASS,
> +               IPHETH_USBINTF_PROTO) },
> +       { }
> +};
> +MODULE_DEVICE_TABLE(usb, ipheth_table);
> +
> +struct ipheth_device {
> +       struct usb_device *udev;
> +       struct usb_interface *intf;
> +       struct net_device *net;
> +       struct sk_buff *tx_skb;
> +       struct urb *tx_urb;
> +       struct urb *rx_urb;
> +       unsigned char *tx_buf;
> +       unsigned char *rx_buf;
> +       unsigned char *ctrl_buf;
> +       u8 bulk_in;
> +       u8 bulk_out;
> +       struct delayed_work carrier_work;
> +};
> +
> +static int ipheth_rx_submit(struct ipheth_device *dev, gfp_t mem_flags);
> +
> +static int ipheth_alloc_urbs(struct ipheth_device *iphone)
> +{
> +       struct urb *tx_urb = NULL;
> +       struct urb *rx_urb = NULL;
> +       u8 *tx_buf = NULL;
> +       u8 *rx_buf = NULL;
> +
> +       tx_urb = usb_alloc_urb(0, GFP_KERNEL);
> +       if (tx_urb == NULL)
> +               goto error;
> +
> +       rx_urb = usb_alloc_urb(0, GFP_KERNEL);
> +       if (rx_urb == NULL)
> +               goto error;
> +
> +       tx_buf = usb_buffer_alloc(iphone->udev,
> +                                 IPHETH_BUF_SIZE,
> +                                 GFP_KERNEL,
> +                                 &tx_urb->transfer_dma);
> +       if (tx_buf == NULL)
> +               goto error;
> +
> +       rx_buf = usb_buffer_alloc(iphone->udev,
> +                                 IPHETH_BUF_SIZE,
> +                                 GFP_KERNEL,
> +                                 &rx_urb->transfer_dma);
> +       if (rx_buf == NULL)
> +               goto error;
> +
> +
> +       iphone->tx_urb = tx_urb;
> +       iphone->rx_urb = rx_urb;
> +       iphone->tx_buf = tx_buf;
> +       iphone->rx_buf = rx_buf;
> +       return 0;
> +
> +error:
> +       usb_buffer_free(iphone->udev, IPHETH_BUF_SIZE, rx_buf,
> +                       rx_urb->transfer_dma);
> +       usb_buffer_free(iphone->udev, IPHETH_BUF_SIZE, tx_buf,
> +                       tx_urb->transfer_dma);
> +       usb_free_urb(rx_urb);
> +       usb_free_urb(tx_urb);
> +       return -ENOMEM;
> +}
> +
> +static void ipheth_free_urbs(struct ipheth_device *iphone)
> +{
> +       usb_buffer_free(iphone->udev, IPHETH_BUF_SIZE, iphone->rx_buf,
> +                       iphone->rx_urb->transfer_dma);
> +       usb_buffer_free(iphone->udev, IPHETH_BUF_SIZE, iphone->tx_buf,
> +                       iphone->tx_urb->transfer_dma);
> +       usb_free_urb(iphone->rx_urb);
> +       usb_free_urb(iphone->tx_urb);
> +}
> +
> +static void ipheth_kill_urbs(struct ipheth_device *dev)
> +{
> +       usb_kill_urb(dev->tx_urb);
> +       usb_kill_urb(dev->rx_urb);
> +}
> +
> +static void ipheth_rcvbulk_callback(struct urb *urb)
> +{
> +       struct ipheth_device *dev;
> +       struct sk_buff *skb;
> +       int status;
> +       char *buf;
> +       int len;
> +
> +       dev = urb->context;
> +       if (dev == NULL)
> +               return;
> +
> +       status = urb->status;
> +       switch (status) {
> +       case -ENOENT:
> +       case -ECONNRESET:
> +       case -ESHUTDOWN:
> +               return;
> +       case 0:
> +               break;
> +       default:
> +               err("%s: urb status: %d", __func__, urb->status);
> +               return;
> +       }
> +
> +       len = urb->actual_length;
> +       buf = urb->transfer_buffer;
> +
> +       skb = dev_alloc_skb(NET_IP_ALIGN + len);
> +       if (!skb) {
> +               err("%s: dev_alloc_skb: -ENOMEM", __func__);
> +               dev->net->stats.rx_dropped++;
> +               return;
> +       }
> +
> +       skb_reserve(skb, NET_IP_ALIGN);
> +       memcpy(skb_put(skb, len), buf + NET_IP_ALIGN, len - NET_IP_ALIGN);
> +       skb->dev = dev->net;
> +       skb->protocol = eth_type_trans(skb, dev->net);
> +
> +       dev->net->stats.rx_packets++;
> +       dev->net->stats.rx_bytes += len;
> +
> +       netif_rx(skb);
> +       ipheth_rx_submit(dev, GFP_ATOMIC);
> +}
> +
> +static void ipheth_sndbulk_callback(struct urb *urb)
> +{
> +       struct ipheth_device *dev;
> +
> +       dev = urb->context;
> +       if (dev == NULL)
> +               return;
> +
> +       if (urb->status != 0 &&
> +           urb->status != -ENOENT &&
> +           urb->status != -ECONNRESET &&
> +           urb->status != -ESHUTDOWN)
> +               err("%s: urb status: %d", __func__, urb->status);
> +
> +       dev_kfree_skb_irq(dev->tx_skb);
> +       netif_wake_queue(dev->net);
> +}
> +
> +static int ipheth_carrier_set(struct ipheth_device *dev)
> +{
> +       struct usb_device *udev = dev->udev;
> +       int retval;
> +
> +       retval = usb_control_msg(udev,
> +                       usb_rcvctrlpipe(udev, IPHETH_CTRL_ENDP),
> +                       IPHETH_CMD_CARRIER_CHECK, /* request */
> +                       0xc0, /* request type */
> +                       0x00, /* value */
> +                       0x02, /* index */
> +                       dev->ctrl_buf, IPHETH_CTRL_BUF_SIZE,
> +                       IPHETH_CTRL_TIMEOUT);
> +       if (retval < 0) {
> +               err("%s: usb_control_msg: %d", __func__, retval);
> +               return retval;
> +       }
> +
> +       if (dev->ctrl_buf[0] == IPHETH_CARRIER_ON)
> +               netif_carrier_on(dev->net);
> +       else
> +               netif_carrier_off(dev->net);
> +
> +       return 0;
> +}
> +
> +static void ipheth_carrier_check_work(struct work_struct *work)
> +{
> +       struct ipheth_device *dev = container_of(work, struct ipheth_device,
> +                                                carrier_work.work);
> +
> +       ipheth_carrier_set(dev);
> +       schedule_delayed_work(&dev->carrier_work, IPHETH_CARRIER_CHECK_TIMEOUT);
> +}
> +
> +static int ipheth_get_macaddr(struct ipheth_device *dev)
> +{
> +       struct usb_device *udev = dev->udev;
> +       struct net_device *net = dev->net;
> +       int retval;
> +
> +       retval = usb_control_msg(udev,
> +                                usb_rcvctrlpipe(udev, IPHETH_CTRL_ENDP),
> +                                IPHETH_CMD_GET_MACADDR, /* request */
> +                                0xc0, /* request type */
> +                                0x00, /* value */
> +                                0x02, /* index */
> +                                dev->ctrl_buf,
> +                                IPHETH_CTRL_BUF_SIZE,
> +                                IPHETH_CTRL_TIMEOUT);
> +       if (retval < 0) {
> +               err("%s: usb_control_msg: %d", __func__, retval);
> +       } else if (retval < ETH_ALEN) {
> +               err("%s: usb_control_msg: short packet: %d bytes",
> +                       __func__, retval);
> +               retval = -EINVAL;
> +       } else {
> +               memcpy(net->dev_addr, dev->ctrl_buf, ETH_ALEN);
> +               retval = 0;
> +       }
> +
> +       return retval;
> +}
> +
> +static int ipheth_rx_submit(struct ipheth_device *dev, gfp_t mem_flags)
> +{
> +       struct usb_device *udev = dev->udev;
> +       int retval;
> +
> +       usb_fill_bulk_urb(dev->rx_urb, udev,
> +                         usb_rcvbulkpipe(udev, dev->bulk_in),
> +                         dev->rx_buf, IPHETH_BUF_SIZE,
> +                         ipheth_rcvbulk_callback,
> +                         dev);
> +       dev->rx_urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP;
> +
> +       retval = usb_submit_urb(dev->rx_urb, mem_flags);
> +       if (retval)
> +               err("%s: usb_submit_urb: %d", __func__, retval);
> +       return retval;
> +}
> +
> +static int ipheth_open(struct net_device *net)
> +{
> +       struct ipheth_device *dev = netdev_priv(net);
> +       struct usb_device *udev = dev->udev;
> +       int retval = 0;
> +
> +       usb_set_interface(udev, IPHETH_INTFNUM, IPHETH_ALT_INTFNUM);
> +
> +       retval = ipheth_carrier_set(dev);
> +       if (retval)
> +               return retval;
> +
> +       retval = ipheth_rx_submit(dev, GFP_KERNEL);
> +       if (retval)
> +               return retval;
> +
> +       schedule_delayed_work(&dev->carrier_work, IPHETH_CARRIER_CHECK_TIMEOUT);
> +       netif_start_queue(net);
> +       return retval;
> +}
> +
> +static int ipheth_close(struct net_device *net)
> +{
> +       struct ipheth_device *dev = netdev_priv(net);
> +
> +       cancel_delayed_work_sync(&dev->carrier_work);
> +       netif_stop_queue(net);
> +       return 0;
> +}
> +
> +static int ipheth_tx(struct sk_buff *skb, struct net_device *net)
> +{
> +       struct ipheth_device *dev = netdev_priv(net);
> +       struct usb_device *udev = dev->udev;
> +       int retval;
> +
> +       /* Paranoid */
> +       if (skb->len > IPHETH_BUF_SIZE) {
> +               WARN(1, "%s: skb too large: %d bytes", __func__, skb->len);
> +               dev->net->stats.tx_dropped++;
> +               dev_kfree_skb_irq(skb);
> +               return NETDEV_TX_OK;
> +       }
> +
> +       memcpy(dev->tx_buf, skb->data, skb->len);
> +       if (skb->len < IPHETH_BUF_SIZE)
> +               memset(dev->tx_buf + skb->len, 0, IPHETH_BUF_SIZE - skb->len);
> +
> +       usb_fill_bulk_urb(dev->tx_urb, udev,
> +                         usb_sndbulkpipe(udev, dev->bulk_out),
> +                         dev->tx_buf, IPHETH_BUF_SIZE,
> +                         ipheth_sndbulk_callback,
> +                         dev);
> +       dev->tx_urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP;
> +
> +       retval = usb_submit_urb(dev->tx_urb, GFP_ATOMIC);
> +       if (retval) {
> +               err("%s: usb_submit_urb: %d", __func__, retval);
> +               dev->net->stats.tx_errors++;
> +               dev_kfree_skb_irq(skb);
> +       } else {
> +               dev->tx_skb = skb;
> +
> +               dev->net->stats.tx_packets++;
> +               dev->net->stats.tx_bytes += skb->len;
> +               netif_stop_queue(net);
> +       }
> +
> +       return NETDEV_TX_OK;
> +}
> +
> +static void ipheth_tx_timeout(struct net_device *net)
> +{
> +       struct ipheth_device *dev = netdev_priv(net);
> +
> +       err("%s: TX timeout", __func__);
> +       dev->net->stats.tx_errors++;
> +       usb_unlink_urb(dev->tx_urb);
> +}
> +
> +static struct net_device_stats *ipheth_stats(struct net_device *net)
> +{
> +       struct ipheth_device *dev = netdev_priv(net);
> +       return &dev->net->stats;
> +}
> +
> +static u32 ipheth_ethtool_op_get_link(struct net_device *net)
> +{
> +       struct ipheth_device *dev = netdev_priv(net);
> +       return netif_carrier_ok(dev->net);
> +}
> +
> +static struct ethtool_ops ops = {
> +       .get_link = ipheth_ethtool_op_get_link
> +};
> +
> +static const struct net_device_ops ipheth_netdev_ops = {
> +       .ndo_open = &ipheth_open,
> +       .ndo_stop = &ipheth_close,
> +       .ndo_start_xmit = &ipheth_tx,
> +       .ndo_tx_timeout = &ipheth_tx_timeout,
> +       .ndo_get_stats = &ipheth_stats,
> +};
> +
> +static struct device_type ipheth_type = {
> +       .name   = "wwan",
> +};
> +
> +static int ipheth_probe(struct usb_interface *intf,
> +                       const struct usb_device_id *id)
> +{
> +       struct usb_device *udev = interface_to_usbdev(intf);
> +       struct usb_host_interface *hintf;
> +       struct usb_endpoint_descriptor *endp;
> +       struct ipheth_device *dev;
> +       struct net_device *netdev;
> +       int i;
> +       int retval;
> +
> +       netdev = alloc_etherdev(sizeof(struct ipheth_device));
> +       if (!netdev)
> +               return -ENOMEM;
> +
> +       netdev->netdev_ops = &ipheth_netdev_ops;
> +       netdev->watchdog_timeo = IPHETH_TX_TIMEOUT;
> +       strcpy(netdev->name, "wwan%d");
> +
> +       dev = netdev_priv(netdev);
> +       dev->udev = udev;
> +       dev->net = netdev;
> +       dev->intf = intf;
> +
> +       /* Set up endpoints */
> +       hintf = usb_altnum_to_altsetting(intf, IPHETH_ALT_INTFNUM);
> +       if (hintf == NULL) {
> +               retval = -ENODEV;
> +               err("Unable to find alternate settings interface");
> +               goto err_endpoints;
> +       }
> +
> +       for (i = 0; i < hintf->desc.bNumEndpoints; i++) {
> +               endp = &hintf->endpoint[i].desc;
> +               if (usb_endpoint_is_bulk_in(endp))
> +                       dev->bulk_in = endp->bEndpointAddress;
> +               else if (usb_endpoint_is_bulk_out(endp))
> +                       dev->bulk_out = endp->bEndpointAddress;
> +       }
> +       if (!(dev->bulk_in && dev->bulk_out)) {
> +               retval = -ENODEV;
> +               err("Unable to find endpoints");
> +               goto err_endpoints;
> +       }
> +
> +       dev->ctrl_buf = kmalloc(IPHETH_CTRL_BUF_SIZE, GFP_KERNEL);
> +       if (dev->ctrl_buf == NULL) {
> +               retval = -ENOMEM;
> +               goto err_alloc_ctrl_buf;
> +       }
> +
> +       retval = ipheth_get_macaddr(dev);
> +       if (retval)
> +               goto err_get_macaddr;
> +
> +       INIT_DELAYED_WORK(&dev->carrier_work, ipheth_carrier_check_work);
> +
> +       retval = ipheth_alloc_urbs(dev);
> +       if (retval) {
> +               err("error allocating urbs: %d", retval);
> +               goto err_alloc_urbs;
> +       }
> +
> +       usb_set_intfdata(intf, dev);
> +
> +       SET_NETDEV_DEV(netdev, &intf->dev);
> +       SET_ETHTOOL_OPS(netdev, &ops);
> +       SET_NETDEV_DEVTYPE(netdev, &ipheth_type);
> +
> +       retval = register_netdev(netdev);
> +       if (retval) {
> +               err("error registering netdev: %d", retval);
> +               retval = -EIO;
> +               goto err_register_netdev;
> +       }
> +
> +       dev_info(&intf->dev, "Apple iPhone USB Ethernet device attached\n");
> +       return 0;
> +
> +err_register_netdev:
> +       ipheth_free_urbs(dev);
> +err_alloc_urbs:
> +err_get_macaddr:
> +err_alloc_ctrl_buf:
> +       kfree(dev->ctrl_buf);
> +err_endpoints:
> +       free_netdev(netdev);
> +       return retval;
> +}
> +
> +static void ipheth_disconnect(struct usb_interface *intf)
> +{
> +       struct ipheth_device *dev;
> +
> +       dev = usb_get_intfdata(intf);
> +       if (dev != NULL) {
> +               unregister_netdev(dev->net);
> +               ipheth_kill_urbs(dev);
> +               ipheth_free_urbs(dev);
> +               kfree(dev->ctrl_buf);
> +               free_netdev(dev->net);
> +       }
> +       usb_set_intfdata(intf, NULL);
> +       dev_info(&intf->dev, "Apple iPhone USB Ethernet now disconnected\n");
> +}
> +
> +static struct usb_driver ipheth_driver = {
> +       .name =         "ipheth",
> +       .probe =        ipheth_probe,
> +       .disconnect =   ipheth_disconnect,
> +       .id_table =     ipheth_table,
> +};
> +
> +static int __init ipheth_init(void)
> +{
> +       int retval;
> +
> +       retval = usb_register(&ipheth_driver);
> +       if (retval) {
> +               err("usb_register failed: %d", retval);
> +               return retval;
> +       }
> +       return 0;
> +}
> +
> +static void __exit ipheth_exit(void)
> +{
> +       usb_deregister(&ipheth_driver);
> +}
> +
> +module_init(ipheth_init);
> +module_exit(ipheth_exit);
> +
> +MODULE_AUTHOR("Diego Giagio <diego@giagio.com>");
> +MODULE_DESCRIPTION("Apple iPhone USB Ethernet driver");
> +MODULE_LICENSE("Dual BSD/GPL");
> --
> 1.7.0
>
>



-- 
Diego Giagio
diego@giagio.com / diego@FreeBSD.org

^ permalink raw reply

* Re: [PATCH] gianfar: Wait for both RX and TX to stop
From: Timur Tabi @ 2010-04-21 14:33 UTC (permalink / raw)
  To: Kumar Gala; +Cc: David Miller, afleming, netdev
In-Reply-To: <10AE27BE-1830-4AAD-83E6-20001BC430D8@kernel.crashing.org>

On Wed, Apr 21, 2010 at 7:17 AM, Kumar Gala <galak@kernel.crashing.org> wrote:

> I understand, its more a sense that we are saying we want to time out for what I consider a catastrophic HW failure.

And how else will you detect and recover from such a failure without a
timeout?  And are you absolutely certain that there will never be a
programming failure that will cause this loop to spin forever?

If you're really opposed to a timeout, you can still use
spin_event_timeout() by just setting the timeout to -1 and adding a
comment explaining why.

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: Bug#577640: linux-image-2.6.33-2-amd64: Kernel warnings in netns thread
From: Martín Ferrari @ 2010-04-21 15:19 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: 577640, netdev, Eric W. Biederman, Alexey Dobriyan,
	Mathieu Lacage
In-Reply-To: <1271612005.3679.248.camel@localhost>

I'm not starting a new thread/bug, as this is probably related...

I just discovered that in 2.6.33, if I create a veth inside a
namespace and then move one of the halves into the main namespace,
when I kill the namespace, I get one of these warnings followed by an
oops. This does not happen if the veth is created from the main ns and
then moved, nor in 2.6.32. This happens both in Qemu and on real
hardware (both amd64)

To reproduce:

$ sudo ./startns bash
# ip l a type veth
# ip l s veth0 netns 1
# exit

The full log:


[  280.655876] ------------[ cut here ]------------
[  280.655916] WARNING: at
/build/mattems-linux-2.6_2.6.33-1~experimental.4-amd64-ieqSsa/linux-2.6-2.6.33-1~experimental.4/debian/build/source_amd64_none/fs/sysfs/dir.c:487
sysfs_add_one+0xcc/0xe4()
[  280.655920] Hardware name:
[  280.655922] sysfs: cannot create duplicate filename
'/devices/virtual/net/veth0/statistics'
[  280.655924] Modules linked in: veth loop parport_pc parport snd_pcm
evdev snd_timer tpm_tis button serio_raw snd i2c_piix4 tpm soundcore
tpm_bios processor snd_page_alloc psmouse i2c_core pcspkr ext3 jbd
mbcache sg sr_mod cdrom sd_mod crc_t10dif ata_generic ata_piix libata
thermal floppy thermal_sys 8139cp 8139too mii scsi_mod [last unloaded:
scsi_wait_scan]
[  280.655954] Pid: 1276, comm: ip Not tainted 2.6.33-2-amd64 #1
[  280.655956] Call Trace:
[  280.655961]  [<ffffffff8113a399>] ? sysfs_add_one+0xcc/0xe4
[  280.655964]  [<ffffffff8113a399>] ? sysfs_add_one+0xcc/0xe4
[  280.655987]  [<ffffffff81046b81>] ? warn_slowpath_common+0x77/0xa3
[  280.655991]  [<ffffffff81046c09>] ? warn_slowpath_fmt+0x51/0x59
[  280.655994]  [<ffffffff8113a2c5>] ? sysfs_pathname+0x35/0x3d
[  280.655997]  [<ffffffff8113a2c5>] ? sysfs_pathname+0x35/0x3d
[  280.656000]  [<ffffffff8113a2c5>] ? sysfs_pathname+0x35/0x3d
[  280.656007]  [<ffffffff8113a2c5>] ? sysfs_pathname+0x35/0x3d
[  280.656031]  [<ffffffff8113a399>] ? sysfs_add_one+0xcc/0xe4
[  280.656035]  [<ffffffff8113aac5>] ? create_dir+0x4f/0x7c
[  280.656038]  [<ffffffff8113baac>] ? internal_create_group+0x51/0x15a
[  280.656062]  [<ffffffff8120f3ba>] ? device_add_groups+0x22/0x5d
[  280.656066]  [<ffffffff8120faba>] ? device_add+0x2d1/0x54a
[  280.656085]  [<ffffffff81243d22>] ? dev_change_net_namespace+0x1a1/0x1fc
[  280.656097]  [<ffffffff8124b625>] ? do_setlink+0x67/0x356
[  280.656101]  [<ffffffff8124c452>] ? rtnl_newlink+0x2f4/0x4f3
[  280.656104]  [<ffffffff8124c203>] ? rtnl_newlink+0xa5/0x4f3
[  280.656107]  [<ffffffff8123cdcf>] ? __skb_recv_datagram+0x11a/0x24f
[  280.656116]  [<ffffffff8123c040>] ? memcpy_toiovec+0x3d/0x6d
[  280.656124]  [<ffffffff8125d39d>] ? netlink_sendmsg+0x15f/0x252
[  280.656127]  [<ffffffff8124bf69>] ? rtnetlink_rcv_msg+0x0/0x1f5
[  280.656130]  [<ffffffff8125cf55>] ? netlink_rcv_skb+0x34/0x7c
[  280.656133]  [<ffffffff8124bf63>] ? rtnetlink_rcv+0x1f/0x25
[  280.656136]  [<ffffffff8125cd49>] ? netlink_unicast+0xe2/0x148
[  280.656139]  [<ffffffff8125d47d>] ? netlink_sendmsg+0x23f/0x252
[  280.656142]  [<ffffffff81232ea3>] ? sock_sendmsg+0x83/0x9b
[  280.656162]  [<ffffffff810b5122>] ? __alloc_pages_nodemask+0x10f/0x5e1
[  280.656166]  [<ffffffff8123bd03>] ? copy_from_user+0x13/0x25
[  280.656169]  [<ffffffff8123c0b9>] ? verify_iovec+0x49/0x84
[  280.656172]  [<ffffffff81233178>] ? sys_sendmsg+0x225/0x2af
[  280.656176]  [<ffffffff812343c7>] ? sys_recvmsg+0x48/0x56
[  280.656188]  [<ffffffff81008ac2>] ? system_call_fastpath+0x16/0x1b
[  280.656191] ---[ end trace f31191528d9325da ]---
[  280.658197] ------------[ cut here ]------------
[  280.658202] WARNING: at
/build/mattems-linux-2.6_2.6.33-1~experimental.4-amd64-ieqSsa/linux-2.6-2.6.33-1~experimental.4/debian/build/source_amd64_none/net/core/dev.c:5597
dev_change_net_namespace+0x1b6/0x1fc()
[  280.658205] Hardware name:
[  280.658207] Modules linked in: veth loop parport_pc parport snd_pcm
evdev snd_timer tpm_tis button serio_raw snd i2c_piix4 tpm soundcore
tpm_bios processor snd_page_alloc psmouse i2c_core pcspkr ext3 jbd
mbcache sg sr_mod cdrom sd_mod crc_t10dif ata_generic ata_piix libata
thermal floppy thermal_sys 8139cp 8139too mii scsi_mod [last unloaded:
scsi_wait_scan]
[  280.658229] Pid: 1276, comm: ip Tainted: G        W  2.6.33-2-amd64 #1
[  280.658231] Call Trace:
[  280.658235]  [<ffffffff81243d37>] ? dev_change_net_namespace+0x1b6/0x1fc
[  280.658238]  [<ffffffff81243d37>] ? dev_change_net_namespace+0x1b6/0x1fc
[  280.658241]  [<ffffffff81046b81>] ? warn_slowpath_common+0x77/0xa3
[  280.658245]  [<ffffffff81243d37>] ? dev_change_net_namespace+0x1b6/0x1fc
[  280.658248]  [<ffffffff8124b625>] ? do_setlink+0x67/0x356
[  280.658251]  [<ffffffff8124c452>] ? rtnl_newlink+0x2f4/0x4f3
[  280.658254]  [<ffffffff8124c203>] ? rtnl_newlink+0xa5/0x4f3
[  280.658257]  [<ffffffff8123cdcf>] ? __skb_recv_datagram+0x11a/0x24f
[  280.658261]  [<ffffffff8123c040>] ? memcpy_toiovec+0x3d/0x6d
[  280.658264]  [<ffffffff8125d39d>] ? netlink_sendmsg+0x15f/0x252
[  280.658267]  [<ffffffff8124bf69>] ? rtnetlink_rcv_msg+0x0/0x1f5
[  280.658270]  [<ffffffff8125cf55>] ? netlink_rcv_skb+0x34/0x7c
[  280.658273]  [<ffffffff8124bf63>] ? rtnetlink_rcv+0x1f/0x25
[  280.658275]  [<ffffffff8125cd49>] ? netlink_unicast+0xe2/0x148
[  280.658278]  [<ffffffff8125d47d>] ? netlink_sendmsg+0x23f/0x252
[  280.658281]  [<ffffffff81232ea3>] ? sock_sendmsg+0x83/0x9b
[  280.658285]  [<ffffffff810b5122>] ? __alloc_pages_nodemask+0x10f/0x5e1
[  280.658288]  [<ffffffff8123bd03>] ? copy_from_user+0x13/0x25
[  280.658292]  [<ffffffff8123c0b9>] ? verify_iovec+0x49/0x84
[  280.658294]  [<ffffffff81233178>] ? sys_sendmsg+0x225/0x2af
[  280.658298]  [<ffffffff812343c7>] ? sys_recvmsg+0x48/0x56
[  280.658301]  [<ffffffff81008ac2>] ? system_call_fastpath+0x16/0x1b
[  280.658304] ---[ end trace f31191528d9325db ]---
[  304.280782] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[  304.282171] IP: [<ffffffff81215f17>] device_pm_remove+0x2c/0x4f
[  304.283021] PGD 0
[  304.283665] Oops: 0002 [#1] SMP
[  304.284016] last sysfs file: /sys/devices/virtual/net/lo/operstate
[  304.284016] CPU 0
[  304.284016] Pid: 9, comm: netns Tainted: G        W  2.6.33-2-amd64 #1 /
[  304.284016] RIP: 0010:[<ffffffff81215f17>]  [<ffffffff81215f17>]
device_pm_remove+0x2c/0x4f
[  304.284016] RSP: 0018:ffff88007fb9bd60  EFLAGS: 00010286
[  304.284016] RAX: 0000000000000000 RBX: ffff880037ccdc38 RCX: ffff880037ccdcd8
[  304.284016] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffffffff816762a0
[  304.284016] RBP: ffff880037cc9000 R08: 0000000000000000 R09: ffffffff812ee659
[  304.284016] R10: ffff88007fb56200 R11: 00000000000003c0 R12: 0000000000000000
[  304.284016] R13: ffff88007fb9be00 R14: ffff88007fb9bdc0 R15: ffff88007fb63800
[  304.284016] FS:  0000000000000000(0000) GS:ffff880001a00000(0000)
knlGS:0000000000000000
[  304.284016] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  304.284016] CR2: 0000000000000008 CR3: 0000000001636000 CR4: 00000000000006f0
[  304.284016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  304.284016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  304.284016] Process netns (pid: 9, threadinfo ffff88007fb9a000,
task ffff88007fb63800)
[  304.284016] Stack:
[  304.284016]  ffff880037ccdc38 ffffffff8120f5fa ffff880037ccd800
ffff880037cc9000
[  304.284016] <0> ffff88007fb9bdc0 ffffffff812436aa ffff88007fb9bdc0
ffff88007fb9bdd8
[  304.284016] <0> ffff88007d1c0070 ffffffff8124371f ffff88007d1c0020
ffffffff812438a3
[  304.284016] Call Trace:
[  304.284016]  [<ffffffff8120f5fa>] ? device_del+0x33/0x1a0
[  304.284016]  [<ffffffff812436aa>] ? rollback_registered_many+0x135/0x19c
[  304.284016]  [<ffffffff8124371f>] ? unregister_netdevice_many+0xe/0x57
[  304.284016]  [<ffffffff812438a3>] ? default_device_exit_batch+0x92/0xa3
[  304.284016]  [<ffffffff8123e97a>] ? cleanup_net+0xfd/0x1af
[  304.284016]  [<ffffffff8105bd0d>] ? worker_thread+0x188/0x21d
[  304.284016]  [<ffffffff8123e87d>] ? cleanup_net+0x0/0x1af
[  304.284016]  [<ffffffff8105f2d2>] ? autoremove_wake_function+0x0/0x2e
[  304.284016]  [<ffffffff8105bb85>] ? worker_thread+0x0/0x21d
[  304.284016]  [<ffffffff8105ee99>] ? kthread+0x79/0x81
[  304.284016]  [<ffffffff810098e4>] ? kernel_thread_helper+0x4/0x10
[  304.284016]  [<ffffffff8105ee20>] ? kthread+0x0/0x81
[  304.284016]  [<ffffffff810098e0>] ? kernel_thread_helper+0x0/0x10
[  304.284016] Code: 48 89 fb 48 c7 c7 a0 62 67 81 e8 81 7b 0d 00 48
8b 93 a0 00 00 00 48 8b 83 a8 00 00 00 48 8d 8b a0 00 00 00 48 c7 c7
a0 62 67 81 <48> 89 42 08 48 89 10 48 89 8b a0 00 00 00 48 89 8b a8 00
00 00
[  304.284016] RIP  [<ffffffff81215f17>] device_pm_remove+0x2c/0x4f
[  304.284016]  RSP <ffff88007fb9bd60>
[  304.284016] CR2: 0000000000000008
[  304.330835] ---[ end trace f31191528d9325dc ]---


On Sun, Apr 18, 2010 at 19:33, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Tue, 2010-04-13 at 13:20 +0200, Martin Ferrari wrote:
>> Package: linux-2.6
>> Version: 2.6.33-1~experimental.4
>> Severity: normal
>> Tags: experimental
>>
>> Firstly, please note that I'm running this inside a Qemu, but I
>> imagine that it should not change things much.
>>
>> I installed 2.6.33 to try out the new improvements regarding network
>> namespaces, and while creating and killing hundreds of them, I got
>> many warnings from the kernel that might indicate a bug somewhere.
>> Please see the log already included by reportbug.
>
> I'm forwarding this to the upstream maintainers.
>
> Ben.
>
> [...]
>> [ 6696.035331] ------------[ cut here ]------------
>> [ 6696.035334] WARNING: at /build/mattems-linux-2.6_2.6.33-1~experimental.4-amd64-ieqSsa/linux-2.6-2.6.33-1~experimental.4/debian/build/source_amd64_none/kernel/sysctl.c:1894 unregister_sysctl_table+0xa6/0xd1()
>> [ 6696.035336] Hardware name:
>> [ 6696.035337] Modules linked in: veth loop parport_pc parport snd_pcm tpm_tis evdev snd_timer i2c_piix4 tpm tpm_bios button processor serio_raw i2c_core snd soundcore snd_page_alloc pcspkr psmouse ext3 jbd mbcache sg sr_mod cdrom sd_mod crc_t10dif ata_generic ata_piix libata floppy 8139cp thermal thermal_sys 8139too mii scsi_mod [last unloaded: veth]
>> [ 6696.035354] Pid: 9, comm: netns Tainted: G        W  2.6.33-2-amd64 #1
>> [ 6696.035355] Call Trace:
>> [ 6696.035357]  [<ffffffff8104e303>] ? unregister_sysctl_table+0xa6/0xd1
>> [ 6696.035360]  [<ffffffff8104e303>] ? unregister_sysctl_table+0xa6/0xd1
>> [ 6696.035362]  [<ffffffff81046b81>] ? warn_slowpath_common+0x77/0xa3
>> [ 6696.035364]  [<ffffffff8104e303>] ? unregister_sysctl_table+0xa6/0xd1
>> [ 6696.035367]  [<ffffffff812b1205>] ? addrconf_ifdown+0x26f/0x2cc
>> [ 6696.035369]  [<ffffffff81247edc>] ? neigh_sysctl_unregister+0x1a/0x31
>> [ 6696.035371]  [<ffffffff812b1211>] ? addrconf_ifdown+0x27b/0x2cc
>> [ 6696.035374]  [<ffffffff812b2b0c>] ? addrconf_notify+0x714/0x7ea
>> [ 6696.035376]  [<ffffffff811eb2e7>] ? extract_entropy+0x6a/0x125
>> [ 6696.035379]  [<ffffffff81053aaf>] ? lock_timer_base+0x26/0x4b
>> [ 6696.035382]  [<ffffffff8123951c>] ? skb_dequeue+0x50/0x58
>> [ 6696.035384]  [<ffffffff812482c8>] ? pneigh_queue_purge+0x25/0x2f
>> [ 6696.035386]  [<ffffffff81249a91>] ? neigh_ifdown+0xba/0xc9
>> [ 6696.035389]  [<ffffffff81062f6d>] ? notifier_call_chain+0x29/0x4c
>> [ 6696.035392]  [<ffffffff81243662>] ? rollback_registered_many+0xed/0x19c
>> [ 6696.035394]  [<ffffffff8124371f>] ? unregister_netdevice_many+0xe/0x57
>> [ 6696.035397]  [<ffffffff812438a3>] ? default_device_exit_batch+0x92/0xa3
>> [ 6696.035399]  [<ffffffff8123e97a>] ? cleanup_net+0xfd/0x1af
>> [ 6696.035402]  [<ffffffff8105bd0d>] ? worker_thread+0x188/0x21d
>> [ 6696.035404]  [<ffffffff8123e87d>] ? cleanup_net+0x0/0x1af
>> [ 6696.035406]  [<ffffffff8105f2d2>] ? autoremove_wake_function+0x0/0x2e
>> [ 6696.035409]  [<ffffffff8105bb85>] ? worker_thread+0x0/0x21d
>> [ 6696.035411]  [<ffffffff8105ee99>] ? kthread+0x79/0x81
>> [ 6696.035414]  [<ffffffff810098e4>] ? kernel_thread_helper+0x4/0x10
>> [ 6696.035416]  [<ffffffff8105ee20>] ? kthread+0x0/0x81
>> [ 6696.035418]  [<ffffffff810098e0>] ? kernel_thread_helper+0x0/0x10
>> [ 6696.035419] ---[ end trace ef7b93cb006e989e ]---
> [...]
>
> --
> Ben Hutchings
> Once a job is fouled up, anything done to improve it makes it worse.
>



-- 
Martín Ferrari

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-21 16:18 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Scott Feldman, davem, netdev, chrisw
In-Reply-To: <201004211326.00974.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Tuesday 20 April 2010, Scott Feldman wrote:
> > On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> > > Also, do you expect your interface to be supported by dcbd/lldpad,
> > > or is there a good reason to create a new tool for iovnl?
> > 
> > Lldpad supporting this interface would seem right, for those cases where
> > lldpad is responsible for configuring the netdev.
> 
> I believe we meant different things here, because I misunderstood the
> intention of the code. My question was whether lldpad would send the
> netlink messages to iovnl, but from what you and Chris write, the
> real idea was that both lldpad and kernel/iovnl can receive the
> same messages, right?

Correct.  An example set of steps for initiating host to switch
negotiation and subsequently launching a VM would be (expect user below
to be a mgmt tool like libvirt):

1) user sends netlink message w/ relevant host interface and port profile id
2) recipient picks this up (enic, lldpad, whatever)
3) recipient does negotiation w/ adjacent switch
4) user creates macvtap associated w/ relevant host interface
5) user launches guest

> > >> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
> > >> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
> > > 
> > > Can you elaborate more on what these do? Who is the 'client' and the 'host'
> > > in this case, and why do you need to identify them?
> > 
> > Those are optional and useful, for example, by the network mgmt tool for
> > presenting a view such as:
> > 
> >     - blade 1/2                     // know by host uuid
> >         - vm-rhel5-eth0             // client name
> >             - port-profile: xyz
> > 
> > Something like that.
> 
> Hmm, but how do they get from the device driver to the the network
> management tool then? Also, these are similar to the attributes
> that are passed in the 802.1Qbg VDP protocol, but not compatible.
> If the idea is use the same netlink protocol for both your internal
> representation and for the standard based protocol, I think we should
> make them compatible.

Indeed, that's my expectation.

> Instead of a string identifying the port profile, this needs to pass
> a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte).

I think we just need a u8 array, 4 bytes for VDP, some maxlen that is
at least as large as enic expects.

> There is also a UUID in VDP, but it identifies the guest, not the host,
> so this is really confusing.

Yes, I had same confusion.  I expected guest, enic wants to send host as
well.

> VDP also needs a list of MAC addresses and VLAN IDs (normally only
> one of each), but that would be separate from what you tell the adapter,
> see below: 
> 
> > >> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
> > > 
> > > Just one mac address? What happens if we want to assign multiple mac
> > > addresses to the VF later? Also, how is this defined specifically?
> > > Will a SIOCSIFHWADDR with a different MAC address on the VF fail
> > > later, or is this just the default value?
> > 
> > Depends on how the VF wants to handle this.  For our use-case with enic we
> > only need the port-profile op so I'm not sure what the best design is for
> > mac+vlan on a VF.  Looking for advise from folks like yourself.  If it's not
> > needed, let's scratch it.
> 
> In order to make VEPA work, it's absolutely required to impose a hard limit
> on what MAC+VLAN IDs are visible to the VF, because the switch identifies
> the guest by those and forwards any frames to/from that address according
> to the VSI type.
> 
> However, I feel that we should strictly separate the steps of configuring
> the adapter from talking to the switch. When we do the VDP association
> in user land, we still need to set up the VLAN and MAC configuration for
> the VF through a kernel interface. If we ignore the port profile stuff
> for a moment, your netlink interface looks like a good fit for that.
> 
> Since it seems what you really want to do is to do the exchange with the
> switch from here, maybe the hardware configuration part should be moved
> the DCB interface?

I suppose this would work  (although it's a bit odd being out of scope
of DCB spec).  I don't expect mgmt app to care about the implementation
specifics of an adapter, so it will always send this and iovnl message
too.  All as part of same setup.

thanks,
-chris

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Scott Feldman @ 2010-04-21 16:28 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Chris Wright, davem, netdev
In-Reply-To: <201004211517.58062.arnd@arndb.de>

On 4/21/10 6:17 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> On Tuesday 20 April 2010, Scott Feldman wrote:
>> On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
>> 
>>>> In the case of devices that can do adjacent switch negotiations directly.
>>> 
>>> I thought the idea to deal with those devices was to beat sense into
>>> the respective developers until they do the negotiation in software 8-)
>> 
>> When the device can do the negotiation directly with the switch, why does it
>> make sense to bypass that and use software on the host?  I don't think we'd
>> want to give up on link speed/duplex auto-negotiation and punt those setting
>> back to the user/host like in the old days.
> 
> For the link negotiation, the card is the right place because it's necessary
> to get the link working before the OS can talk to the switch.
> For VDP, that's different because the hypervisor needs to talk to the switch
> before the guest can communicate, so there is no interdependency.
> 
> More importantly, the card cannot possibly do the protocol by itself,
> because the information that gets exchanged is specific to the hypervisor and
> the guest, not to the hardware. What you have implemented is another protocol
> between the hypervisor and the NIC that exchanges the exact same data that
> then gets sent to the switch. We already need to have an implementation that
> sends this data to the switch from user space for all cards that don't do
> it in firmware, so doing an alternative path in the adapter really creates
> more work for the users, and means that when we fix bugs or add features
> to the common code, you don't get them ;-).

But the point of iovnl was to provide a single mechanism for both types of
adapters (w/ or w/o firmware assist) to exchange this data with the switch,
therefore making the difference in the adapters transparent to the user.  So
I'm missing your point about more work for the users.

-scott


^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: George B. @ 2010-04-21 16:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Evgeniy Polyakov, Ben Greear, David Miller, Gaspar Chilingarov,
	netdev
In-Reply-To: <1271849253.7895.1929.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 4:27 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Here is the patch I use now and my test application is now able to open
> and connect 1000000 sockets (ulimit -n 1000000)

I believe we hit this very yesterday in our test lab.  We had a stress
test running of one of our applications with about a dozen instances
of it running on the box.  Suddenly dns requests began failing with
the complaint that it couldn't make a request out because there were
no sockets.

root@champagne:/proc/sys/net/ipv4> host gh
host: isc_socket_bind: address in use

Netstat showed 61580 total sockets (UDP and TCP) on the address being
used by the above dns request. (local port range 1025 65535).  That
dns request should not have been failing.

I noticed that the number of UDP sockets was close to the maximum
allowed by the port range, but they were across different IP
addresses, no one IP address had too many and there should have been
available ports on all IP addresses.

Further, the number of udp sockets in use seemed to hit the wall at a
little above 64,000 and I never got above that number.

If that is the normal behavior of the kernel, it could be a big
problem for scaling the application.

^ permalink raw reply

* Re: 2.6.34-rc5: Reported regressions from 2.6.33
From: Nick Bowler @ 2010-04-21 16:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Rafael J. Wysocki, DRI, Linux SCSI List, Network Development,
	Linux Wireless List, Linux Kernel Mailing List, Linux ACPI,
	Andrew Morton, Kernel Testers List, Linus Torvalds, Linux PM List,
	Maciej Rutecki
In-Reply-To: <20100421085739.GA2576@barney.localdomain>

On Wed, Apr 21, 2010 at 07:15:38AM +0200, Rafael J. Wysocki wrote:
> On Tuesday 20 April 2010, Nick Bowler wrote:
> > Please list these two similar regressions from 2.6.33 in the r600 DRM:
> > 
> >  * r600 CS checker rejects GL_DEPTH_TEST w/o depth buffer:
> >            https://bugs.freedesktop.org/show_bug.cgi?id=27571
> > 
> >  * r600 CS checker rejects narrow FBO renderbuffers:
> >            https://bugs.freedesktop.org/show_bug.cgi?id=27609
> 
> Do you want to me to add them as one entry or as two separate bugs?

As upstream doesn't consider the first to be a kernel issue, I guess you
should just list the second.

On 10:57 Wed 21 Apr     , Jerome Glisse wrote:
> First one is userspace bug, i need to look into the second one.
> ie we were lucky the hw didn't lockup without depth buffer and
> depth test enabled.

OK, if the failure is due to userspace is doing Very Bad Things(tm),
catching that seems reasonable.

Nevertheless, even if it happened by luck, the result was (ostensibly)
working programs that suddenly break once one "upgrades" to the latest
kernel.  If userspace can't be fixed before 2.6.34 is released, perhaps
a less cryptic log message would be appropriate?

-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply

* [PATCH] Socket filter access to hatype
From: Paul LeoNerd Evans @ 2010-04-21 17:25 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1829 bytes --]

When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on, such as is reported by the sll_hatype field of the struct sockaddr_ll
when the packet is sent up to userland.

Unless I've managed to miss a trick somewhere, this would seem to put a
fairly fundamental blocker on actually being able to filter in such
packets. Granted there's the SKF_OFF_NET area to inspect at the e.g. IPv4
level, but this makes it impossible to do anything on e.g. the Ethernet
level.

See below for a patch to add an SKF_AD_HATYPE field, up among the other
special access fields around SKF_AD_OFF.


diff -ur linux-2.6.33.2.orig/include/linux/filter.h linux-2.6.33.2/include/linux/filter.h
--- linux-2.6.33.2.orig/include/linux/filter.h	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/include/linux/filter.h	2010-04-20 22:40:25.000000000 +0100
@@ -123,7 +123,8 @@
 #define SKF_AD_NLATTR_NEST	16
 #define SKF_AD_MARK 	20
 #define SKF_AD_QUEUE	24
-#define SKF_AD_MAX	28
+#define SKF_AD_HATYPE	28
+#define SKF_AD_MAX	32
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
diff -ur linux-2.6.33.2.orig/net/core/filter.c linux-2.6.33.2/net/core/filter.c
--- linux-2.6.33.2.orig/net/core/filter.c	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/net/core/filter.c	2010-04-20 22:41:01.000000000 +0100
@@ -309,6 +309,9 @@
 		case SKF_AD_QUEUE:
 			A = skb->queue_mapping;
 			continue;
+		case SKF_AD_HATYPE:
+			A = skb->dev->type;
+			continue;
 		case SKF_AD_NLATTR: {
 			struct nlattr *nla;
 

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 17:52 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev
In-Reply-To: <20100421161842.GB25928@x200.localdomain>

On Wednesday 21 April 2010, Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Tuesday 20 April 2010, Scott Feldman wrote:
> > I believe we meant different things here, because I misunderstood the
> > intention of the code. My question was whether lldpad would send the
> > netlink messages to iovnl, but from what you and Chris write, the
> > real idea was that both lldpad and kernel/iovnl can receive the
> > same messages, right?
> 
> Correct.  An example set of steps for initiating host to switch
> negotiation and subsequently launching a VM would be (expect user below
> to be a mgmt tool like libvirt):
> 
> 1) user sends netlink message w/ relevant host interface and port profile id
> 2) recipient picks this up (enic, lldpad, whatever)
> 3) recipient does negotiation w/ adjacent switch
> 4) user creates macvtap associated w/ relevant host interface
> 5) user launches guest

I'd move point 4 before 1, but otherwise it makes sense and it would still
work either way.

> > If the idea is use the same netlink protocol for both your internal
> > representation and for the standard based protocol, I think we should
> > make them compatible.
> 
> Indeed, that's my expectation.
>
> [...]
>
> > Instead of a string identifying the port profile, this needs to pass
> > a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte).
> 
> I think we just need a u8 array, 4 bytes for VDP, some maxlen that is
> at least as large as enic expects.
> 
> > There is also a UUID in VDP, but it identifies the guest, not the host,
> > so this is really confusing.
> 
> Yes, I had same confusion.  I expected guest, enic wants to send host as
> well.

So given all these differences, how compatible can we make them?

With the current definition, most of fields are at least slightly
different. The differences seem to stem mostly from the fact that
Cisco switches use a nonstandard protocol, rather than the difference
between the firmware and userland implementations of the protocol,
and of course we shouldn't confuse the two.

> > In order to make VEPA work, it's absolutely required to impose a hard limit
> > on what MAC+VLAN IDs are visible to the VF, because the switch identifies
> > the guest by those and forwards any frames to/from that address according
> > to the VSI type.
> > 
> > However, I feel that we should strictly separate the steps of configuring
> > the adapter from talking to the switch. When we do the VDP association
> > in user land, we still need to set up the VLAN and MAC configuration for
> > the VF through a kernel interface. If we ignore the port profile stuff
> > for a moment, your netlink interface looks like a good fit for that.
> > 
> > Since it seems what you really want to do is to do the exchange with the
> > switch from here, maybe the hardware configuration part should be moved
> > the DCB interface?
> 
> I suppose this would work  (although it's a bit odd being out of scope
> of DCB spec).

It could be anywhere, it doesn't have to be the DCB interface, but could
be anything ranging from ethtool to iplink I guess. And we should define
it in a way that works for any SR-IOV card, whether it's using Cisco's
protocol in firmware, 802.1Qbg VDP in firmware, lldpad to do VDP or
none of the above and just provides an internal switch like all the
existing NICs.

> I don't expect mgmt app to care about the implementation
> specifics of an adapter, so it will always send this and iovnl message
> too.  All as part of same setup.

Why? I really see these things as separate. Obviously a management
tool like libvirt would need to do both these things eventually, but
each of them has multiple options that can be combined in various
ways:

1. Setting up the slave device
 a) create an SR-IOV VF to assign to a guest
 b) create a macvtap device to pass to qemu or vhost
 c) attach a tap device to a bridge
 d) create a macvlan device and put it into a container
 e) create a virtual interface for a VMDq adapter

2) Registering the slave with the switch
 a) use Cisco protocol in enic firmware (see patch 2/2)
 b) use standard VDP in lldpad
 c) use reverse-engineered cisco protocol in some user tool for
    non-enic adapters.
 d) use standard VDP in firmware (hopefully this never happens)
 e) do nothing at all (as we do today)

Some of the cases can be treated identically, e.g. 1d) and 1e), or
2a) and 2c), but in general the management app needs to have some
idea of which combination it's going to set up.

	Arnd

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 18:04 UTC (permalink / raw)
  To: Scott Feldman; +Cc: Chris Wright, davem, netdev
In-Reply-To: <C7F475A7.2A1D9%scofeldm@cisco.com>

On Wednesday 21 April 2010, Scott Feldman wrote:
> On 4/21/10 6:17 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> > More importantly, the card cannot possibly do the protocol by itself,
> > because the information that gets exchanged is specific to the hypervisor and
> > the guest, not to the hardware. What you have implemented is another protocol
> > between the hypervisor and the NIC that exchanges the exact same data that
> > then gets sent to the switch. We already need to have an implementation that
> > sends this data to the switch from user space for all cards that don't do
> > it in firmware, so doing an alternative path in the adapter really creates
> > more work for the users, and means that when we fix bugs or add features
> > to the common code, you don't get them ;-).
> 
> But the point of iovnl was to provide a single mechanism for both types of
> adapters (w/ or w/o firmware assist) to exchange this data with the switch,
> therefore making the difference in the adapters transparent to the user.  So
> I'm missing your point about more work for the users.

It creates an extra step: Normally we'd simply implement the network protocol
in user space, e.g. in lldpad and have other code use the lldptool command
line interface to start the negotiation.

Now we have a user protocol based on netlink that is about as complex as the
wire protocol itself, at least if you want to implement both the standard
VDP and the Cisco variant, and do all the interesting parts like guest migration
and synchronously waiting for the negotiation to complete.

	Arnd

^ permalink raw reply

* cxgb4: Use ntohs() on __be16 value instead of htons()
From: Roland Dreier @ 2010-04-21 18:09 UTC (permalink / raw)
  To: Dimitris Michailidis, David S. Miller; +Cc: netdev

Use the correct direction of byte-swapping function to fix a mistake
shown by sparse endianness checking -- c.fl0id is __be16.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
---
 drivers/net/cxgb4/sge.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/cxgb4/sge.c b/drivers/net/cxgb4/sge.c
index 14adc58..70bf2b2 100644
--- a/drivers/net/cxgb4/sge.c
+++ b/drivers/net/cxgb4/sge.c
@@ -2047,7 +2047,7 @@ int t4_sge_alloc_rxq(struct adapter *adap, struct sge_rspq *iq, bool fwevtq,
 	adap->sge.ingr_map[iq->cntxt_id] = iq;
 
 	if (fl) {
-		fl->cntxt_id = htons(c.fl0id);
+		fl->cntxt_id = ntohs(c.fl0id);
 		fl->avail = fl->pend_cred = 0;
 		fl->pidx = fl->cidx = 0;
 		fl->alloc_failed = fl->large_alloc_failed = fl->starving = 0;

-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply related

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-21 18:10 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Chris Wright, Scott Feldman, davem, netdev
In-Reply-To: <201004211952.29145.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Wednesday 21 April 2010, Chris Wright wrote:
> > * Arnd Bergmann (arnd@arndb.de) wrote:
> > > Since it seems what you really want to do is to do the exchange with the
> > > switch from here, maybe the hardware configuration part should be moved
> > > the DCB interface?
> > 
> > I suppose this would work  (although it's a bit odd being out of scope
> > of DCB spec).
> 
> It could be anywhere, it doesn't have to be the DCB interface, but could
> be anything ranging from ethtool to iplink I guess. And we should define
> it in a way that works for any SR-IOV card, whether it's using Cisco's
> protocol in firmware, 802.1Qbg VDP in firmware, lldpad to do VDP or
> none of the above and just provides an internal switch like all the
> existing NICs.

Heh, that's exactly what iovnl does ;-)

> > I don't expect mgmt app to care about the implementation
> > specifics of an adapter, so it will always send this and iovnl message
> > too.  All as part of same setup.
> 
> Why? I really see these things as separate. Obviously a management
> tool like libvirt would need to do both these things eventually, but
> each of them has multiple options that can be combined in various
> ways:
> 
> 1. Setting up the slave device
>  a) create an SR-IOV VF to assign to a guest
>  b) create a macvtap device to pass to qemu or vhost
>  c) attach a tap device to a bridge
>  d) create a macvlan device and put it into a container
>  e) create a virtual interface for a VMDq adapter

OK, but iovnl isn't doing this.

> 2) Registering the slave with the switch
>  a) use Cisco protocol in enic firmware (see patch 2/2)
>  b) use standard VDP in lldpad
>  c) use reverse-engineered cisco protocol in some user tool for
>     non-enic adapters.
>  d) use standard VDP in firmware (hopefully this never happens)
>  e) do nothing at all (as we do today)

And this is the step that is the main purpose of iovnl.

Here's the simplest snippet of libvirt to show this.  It sends
set_port_profile netlink messages and then creates macvtap.  As simple
as it gets...

--- a/src/qemu/qemu_conf.c
+++ b/src/qemu/qemu_conf.c
@@ -1470,6 +1470,11 @@ qemudPhysIfaceConnect(virConnectPtr conn,
         net->model && STREQ(net->model, "virtio"))
         vnet_hdr = 1;
 
+    setPortProfileId(net->data.direct.linkdev,
+                      net->data.direct.mode,
+                      net->data.direct.profileid,
+                      net->mac);
+
     rc = openMacvtapTap(net->ifname, net->mac, linkdev, brmode,
                         &res_ifname, vnet_hdr);

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21 18:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271849253.7895.1929.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 01:27:33PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Here is the patch I use now and my test application is now able to open
> and connect 1000000 sockets (ulimit -n 1000000)
> 
> Trick is bind_conflict() must refuse a socket to bind to a port on a non
> null IP if another socket already uses same port on same IP.
> 
> Plus the previous patch sent (check a conflict before exiting the search
> loop)
> 
> What do you think ?

Looks good, but do we want to check only reused socket's address there?
What if one of the sockets does not have reuse option turned on, will it
break?

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-21 18:43 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <20100421182723.GA17202@ioremap.net>

Le mercredi 21 avril 2010 à 22:27 +0400, Evgeniy Polyakov a écrit :
> On Wed, Apr 21, 2010 at 01:27:33PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > Here is the patch I use now and my test application is now able to open
> > and connect 1000000 sockets (ulimit -n 1000000)
> > 
> > Trick is bind_conflict() must refuse a socket to bind to a port on a non
> > null IP if another socket already uses same port on same IP.
> > 
> > Plus the previous patch sent (check a conflict before exiting the search
> > loop)
> > 
> > What do you think ?
> 
> Looks good, but do we want to check only reused socket's address there?
> What if one of the sockets does not have reuse option turned on, will it
> break?
> 

Well, if one socket doesnt have reuse option turned on, the previous
test already works ?

if (!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) {
	if (!sk2_rcv_saddr || !sk_rcv_saddr ||
	    sk2_rcv_saddr == sk_rcv_saddr)
		break;
} else if (reuse && sk2->sk_reuse &&
           sk2_rcv_saddr &&
           sk2_rcv_saddr == sk_rcv_saddr)
	break;

I failed to factorize this complex test :(




^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Evgeniy Polyakov @ 2010-04-21 18:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Gaspar Chilingarov, netdev
In-Reply-To: <1271875416.7895.3033.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 08:43:36PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Le mercredi 21 avril 2010 à 22:27 +0400, Evgeniy Polyakov a écrit :
> > On Wed, Apr 21, 2010 at 01:27:33PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > > Here is the patch I use now and my test application is now able to open
> > > and connect 1000000 sockets (ulimit -n 1000000)
> > > 
> > > Trick is bind_conflict() must refuse a socket to bind to a port on a non
> > > null IP if another socket already uses same port on same IP.
> > > 
> > > Plus the previous patch sent (check a conflict before exiting the search
> > > loop)
> > > 
> > > What do you think ?
> > 
> > Looks good, but do we want to check only reused socket's address there?
> > What if one of the sockets does not have reuse option turned on, will it
> > break?
> > 
> 
> Well, if one socket doesnt have reuse option turned on, the previous
> test already works ?
> 
> if (!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) {
> 	if (!sk2_rcv_saddr || !sk_rcv_saddr ||
> 	    sk2_rcv_saddr == sk_rcv_saddr)
> 		break;
> } else if (reuse && sk2->sk_reuse &&
>            sk2_rcv_saddr &&
>            sk2_rcv_saddr == sk_rcv_saddr)
> 	break;
> 
> I failed to factorize this complex test :(

Damn it, I tried multiple times :)
You are right of course!

-- 
	Evgeniy Polyakov

^ permalink raw reply

* cxgb4: Make unnecessarily global functions static
From: Roland Dreier @ 2010-04-21 18:59 UTC (permalink / raw)
  To: Dimitris Michailidis, David S. Miller; +Cc: netdev

Also put t4_write_indirect() inside "#if 0" to avoid a "defined but not
used" compile warning.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
---
 drivers/net/cxgb4/cxgb4.h |    2 --
 drivers/net/cxgb4/sge.c   |    2 +-
 drivers/net/cxgb4/t4_hw.c |   22 ++++++++++++----------
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/net/cxgb4/cxgb4.h b/drivers/net/cxgb4/cxgb4.h
index 3d8ff48..4b35dc7 100644
--- a/drivers/net/cxgb4/cxgb4.h
+++ b/drivers/net/cxgb4/cxgb4.h
@@ -651,8 +651,6 @@ int t4_link_start(struct adapter *adap, unsigned int mbox, unsigned int port,
 		  struct link_config *lc);
 int t4_restart_aneg(struct adapter *adap, unsigned int mbox, unsigned int port);
 int t4_seeprom_wp(struct adapter *adapter, bool enable);
-int t4_read_flash(struct adapter *adapter, unsigned int addr,
-		  unsigned int nwords, u32 *data, int byte_oriented);
 int t4_load_fw(struct adapter *adapter, const u8 *fw_data, unsigned int size);
 int t4_check_fw_version(struct adapter *adapter);
 int t4_prep_adapter(struct adapter *adapter);
diff --git a/drivers/net/cxgb4/sge.c b/drivers/net/cxgb4/sge.c
index 14adc58..04e5710 100644
--- a/drivers/net/cxgb4/sge.c
+++ b/drivers/net/cxgb4/sge.c
@@ -1471,7 +1471,7 @@ EXPORT_SYMBOL(cxgb4_pktgl_to_skb);
  *	Releases the pages of a packet gather list.  We do not own the last
  *	page on the list and do not free it.
  */
-void t4_pktgl_free(const struct pkt_gl *gl)
+static void t4_pktgl_free(const struct pkt_gl *gl)
 {
 	int n;
 	const skb_frag_t *p;
diff --git a/drivers/net/cxgb4/t4_hw.c b/drivers/net/cxgb4/t4_hw.c
index a814a3a..cadead5 100644
--- a/drivers/net/cxgb4/t4_hw.c
+++ b/drivers/net/cxgb4/t4_hw.c
@@ -53,8 +53,8 @@
  *	at the time it indicated completion is stored there.  Returns 0 if the
  *	operation completes and	-EAGAIN	otherwise.
  */
-int t4_wait_op_done_val(struct adapter *adapter, int reg, u32 mask,
-			int polarity, int attempts, int delay, u32 *valp)
+static int t4_wait_op_done_val(struct adapter *adapter, int reg, u32 mask,
+			       int polarity, int attempts, int delay, u32 *valp)
 {
 	while (1) {
 		u32 val = t4_read_reg(adapter, reg);
@@ -109,9 +109,9 @@ void t4_set_reg_field(struct adapter *adapter, unsigned int addr, u32 mask,
  *	Reads registers that are accessed indirectly through an address/data
  *	register pair.
  */
-void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
-		      unsigned int data_reg, u32 *vals, unsigned int nregs,
-		      unsigned int start_idx)
+static void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
+			     unsigned int data_reg, u32 *vals,
+			     unsigned int nregs, unsigned int start_idx)
 {
 	while (nregs--) {
 		t4_write_reg(adap, addr_reg, start_idx);
@@ -120,6 +120,7 @@ void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
 	}
 }
 
+#if 0
 /**
  *	t4_write_indirect - write indirectly addressed registers
  *	@adap: the adapter
@@ -132,15 +133,16 @@ void t4_read_indirect(struct adapter *adap, unsigned int addr_reg,
  *	Writes a sequential block of registers that are accessed indirectly
  *	through an address/data register pair.
  */
-void t4_write_indirect(struct adapter *adap, unsigned int addr_reg,
-		       unsigned int data_reg, const u32 *vals,
-		       unsigned int nregs, unsigned int start_idx)
+static void t4_write_indirect(struct adapter *adap, unsigned int addr_reg,
+			      unsigned int data_reg, const u32 *vals,
+			      unsigned int nregs, unsigned int start_idx)
 {
 	while (nregs--) {
 		t4_write_reg(adap, addr_reg, start_idx++);
 		t4_write_reg(adap, data_reg, *vals++);
 	}
 }
+#endif
 
 /*
  * Get the reply to a mailbox command and store it in @rpl in big-endian order.
@@ -537,8 +539,8 @@ static int flash_wait_op(struct adapter *adapter, int attempts, int delay)
  *	(i.e., big-endian), otherwise as 32-bit words in the platform's
  *	natural endianess.
  */
-int t4_read_flash(struct adapter *adapter, unsigned int addr,
-		  unsigned int nwords, u32 *data, int byte_oriented)
+static int t4_read_flash(struct adapter *adapter, unsigned int addr,
+			 unsigned int nwords, u32 *data, int byte_oriented)
 {
 	int ret;
 

-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-21 19:01 UTC (permalink / raw)
  To: hadi; +Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271853570.4032.21.camel@bigi>

Le mercredi 21 avril 2010 à 08:39 -0400, jamal a écrit :
> On Tue, 2010-04-20 at 15:13 +0200, Eric Dumazet wrote:
> 
> 
> > I think your tests are very interesting, maybe could you publish them
> > somehow ? (I forgot to thank you about the previous report and nice
> > graph)
> > perf reports would be good too to help to spot hot points.
> 
> Ok ;->
> Let me explain my test setup (which some app types may gasp at;->):
> 
> SUT(system under test) was a nehalem single processor (4 cores, 2 SMT
> threads per core). 
> SUT runs a udp sink server i wrote (with apologies to Rick Jones[1])
> which forks at most a process per detected cpu and binds to a different
> udp port on each processor.
> Traffic generator sent to SUT upto 750Kpps of udp packets round-robbin
> and varied the destination port to select a different flow on each of
> the outgoing packets. I could further increment the number of flows by
> varying the source address and source port number but in the end i 
> settled down to fixed srcip/srcport/destinationip and just varied the
> port number in order to simplify results collection.
> For rps i selected mask "ee" and bound interrupt to cpu0. ee leaves
> out cpu0 and cpu4 from the set of target cpus. Because Nehalem has SMT
> threads, cpu0 and cpu4 are SMT threads that reside on core0 and they
> steal execution cycles from each other - so i didnt want that to happen
> and instead tried to have as many of those cycles as possible for
> demuxing incoming packets.
> 
> Overall, in best case scenario rps had 5-7% better throughput than
> nonrps setup. It had upto 10% more cpu use and about 2-5% more latency.
> I am attaching some visualization of the way 8 flows were distributed
> around the different cpus. The diagrams show some samples - but what you
> see there was a good reflection of what i saw in many runs of the tests.
> Essentially, for localization is better with rps which gets better if
> you can somehow map the target cpus as selected by rps to what the app
> binds to.
> Ive also attached a small annotated perf output - sorry i didnt have
> time to dig deeper into the code; maybe later this week. I think my
> biggest problem in this setup was the sky2 driver or hardware poor
> ability to handle lots of traffic.
> 
> 
> cheers,
> jamal
> 
> [1] I want to hump on the SUT with tons of traffic and count packets;
> too complex to do with netperf

Thanks a lot Jamal, this is really useful

Drawback of using a fixed src ip from your generator is that all flows
share the same struct dst entry on SUT. This might explain some glitches
you noticed (ip_route_input + ip_rcv at high level on slave/application
cpus)
Also note your test is one way. If some data was replied we would see
much use of the 'flows'

I notice epoll_ctl() used a lot, are you re-arming epoll each time you
receive a datagram ?

I see slave/application cpus hit _raw_spin_lock_irqsave() and  
_raw_spin_unlock_irqrestore().

Maybe a ring buffer could help (instead of a double linked queue) for
backlog, or the double queue trick, if Changli wants to respin his
patch.






^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox