Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/2] PCI runtime power management for r8169 and e1000e
From: Markus Feldmann @ 2010-04-22 20:19 UTC (permalink / raw)
  To: netdev
In-Reply-To: <201003150131.59619.rjw@sisk.pl>

Rafael J. Wysocki schrieb:
> Hi,
> 
> The following two patches add basic PCI runtime power management to the r8169
> and e1000e drivers.
> 
> It works so that the adapter is put into PCI D3 if there's no network cable
> attached to it and back into PCI D0 once the cable has been detected.
> 
> The feature is disabled by default and it can be enabled by writing "auto" to
> the device's /sys/devices/.../power/control file.  Writing "on" to this file
> disables the feature again.
> 
> The patches have been tested on MSI Wind U100 (r8169) and Toshiba Portege
> R500 (e1000e).
> 
> Thanks,
> Rafael
> 

Hi Rafael,

Which kernel do i need for this feature? I have a mini-itx board 
combined with a daughterboard with additionally 3xRJ45, so that i have 4 
network devices:
...
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)
04:04.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL-8110SC/8169SC Gigabit Ethernet (rev 10)
04:06.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL-8110SC/8169SC Gigabit Ethernet (rev 10)
04:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL-8110SC/8169SC Gigabit Ethernet (rev 10)

Not all of them are used at the same time. So how can i put the unused 
in a low power state?

regards Markus

^ permalink raw reply

* Re: Subject: re-submit4 [ANNOUNCEMENT] NET: usb: sierra_net.c driver
From: Greg KH @ 2010-04-22 19:44 UTC (permalink / raw)
  To: Elina Pasheva; +Cc: dbrownell, davem, rfiler, linux-usb, netdev
In-Reply-To: <1271963973.10035.6.camel@Linuxdev4-laptop>

On Thu, Apr 22, 2010 at 12:19:33PM -0700, Elina Pasheva wrote:
> +static void sierra_net_send_cmd(struct usbnet *dev,
> +		u8 *cmd, int cmdlen, const char * cmd_name)
> +{
> +	struct sierra_net_data *priv = sierra_net_get_private(dev);
> +	int  status;
> +
> +	status = usb_control_msg(dev->udev, usb_sndctrlpipe(dev->udev, 0),
> +			USB_CDC_SEND_ENCAPSULATED_COMMAND,
> +			USB_DIR_OUT|USB_TYPE_CLASS|USB_RECIP_INTERFACE,	0,
> +			priv->ifnum, cmd, cmdlen, 0);

No timeout?

> +		ifnum = priv->ifnum;
> +		len = usb_control_msg(dev->udev, usb_rcvctrlpipe(dev->udev, 0),
> +				USB_CDC_GET_ENCAPSULATED_RESPONSE,
> +				USB_DIR_IN|USB_TYPE_CLASS|USB_RECIP_INTERFACE,
> +				0, ifnum, buf, SIERRA_NET_USBCTL_BUF_LEN, 0);

No timeout?

> +		if (unlikely(len < 0)) {
> +			netdev_err(dev->net,
> +				"usb_control_msg failed, status %d\n", len);

You don't need "unlikely", this is an extreemly slow path here.
Also, what happens for a "short read"?  You don't handle that properly.

thanks,

greg k-h

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-22 19:36 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev, Mitch Williams
In-Reply-To: <20100422190230.GL28829@x200.localdomain>

On Thursday 22 April 2010 21:02:30 Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Thursday 22 April 2010 19:47:29 Chris Wright wrote:
> > > OK, wasn't clear if you meant that or simply 100% dedicating the interface
> > > via something like virtio.  The add_vf() idea, while neat, doesn't really
> > > match how VF's are allocated.
> > 
> > But we still need something like that for allocating queues in VMDq
> > and similar cases where we do not have pass-through, right?
> 
> Iff we care about VMDq w/out SR-IOV (since SR-IOV hardware is VMDq
> capable and already has a queue-pair + interrupt + net_dev), yes.
> 
> And it's not just VMDq, it's any multi-queue card that can do mac/vlan
> filter in hw + header/data split (for direct data DMA to guest buffers).

Right, that's what I meant by VMDq. Do we have a better term to describe
this class of devices, i.e. VMDq and other cards that also have the
features you listed?

	Arnd

^ permalink raw reply

* Subject: re-submit4 [ANNOUNCEMENT] NET: usb: sierra_net.c driver
From: Elina Pasheva @ 2010-04-22 19:19 UTC (permalink / raw)
  To: dbrownell, davem; +Cc: epasheva, rfiler, linux-usb, netdev

Subject: re-submit4 [ANNOUNCEMENT] NET: usb: sierra_net.c driver
From: Elina Pasheva <epasheva@sierrawireless.com>

The following is a new Linux driver which exposes certain models of Sierra
Wireless modems to the operating system as Network Interface Cards (NICs).

This driver requires a version of the sierra.c driver which supports
blacklisting to work properly. The blacklist in sierra.c rejects the interfaces
claimed by sierra_net.c. Likewise, the sierra_net.c driver only accepts
(i.e. whitelists) the interface(s) used for USB-to-WWAN traffic.
The version of sierra.c which supports blacklisting is
available from the sierra wireless knowledge base page for older kernels. It is
also available in Linux kernel starting from version 2.6.31.

This driver works with all Sierra Wireless devices configured with PID=68A3
like USB305, USB306 provided the corresponding firmware version is I2.0
(for USB305) or M3.0 (for USB306) and later.
This driver will not work with earlier firmware versions than the ones shown
above. In this case the driver will issue an error message indicating 
incompatibility and will not serve the device's USB-to-WWAN interface.

Sierra_net.c sits atop a pre-existing Linux driver called usbnet.c.
A series of hook functions are provided in sierra_net.c which are called by
usbnet.c in response to a particular condition such as receipt or transmission
of a data packet. As such, usbnet.c does most of the work of making
a modem appear to the system as a network device and for properly exchanging
traffic between the USB subsystem and the Network card interface.
Sierra_net.c is concerned with managing the data exchanged between the
USB-to-WWAN interface and the upper layers of the operating system.

The version number of sierra_net.c driver is set to 2.0.

Thanks to Dan Williams for his generous help in generating this patch.
This patch has been checked against net-2.6 tree.
Signed-off-by: Elina Pasheva <epasheva@sierrawireless.com>
Signed-off-by: Rory Filer <rfiler@sierrawireless.com>
---

 drivers/net/usb/Kconfig      |   10 +
 drivers/net/usb/Makefile     |    2 +-
 drivers/net/usb/sierra_net.c |  987 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 998 insertions(+), 1 deletions(-)
 create mode 100644 drivers/net/usb/sierra_net.c

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index ba56ce4..dcba201 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -385,4 +385,14 @@ config USB_CDC_PHONET
 	  cellular modem, as found on most Nokia handsets with the
 	  "PC suite" USB profile.
 
+config USB_SIERRA_NET
+	tristate "USB-to-WWAN Driver for Sierra Wireless modems"
+	depends on USB_USBNET
+	default y
+	help
+	  Choose this option if you have a Sierra Wireless USB-to-WWAN device.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called sierra_net.
+
 endmenu
diff --git a/drivers/net/usb/Makefile b/drivers/net/usb/Makefile
index 82ea629..07605be 100644
--- a/drivers/net/usb/Makefile
+++ b/drivers/net/usb/Makefile
@@ -23,4 +23,4 @@ obj-$(CONFIG_USB_NET_MCS7830)	+= mcs7830.o
 obj-$(CONFIG_USB_USBNET)	+= usbnet.o
 obj-$(CONFIG_USB_NET_INT51X1)	+= int51x1.o
 obj-$(CONFIG_USB_CDC_PHONET)	+= cdc-phonet.o
-
+obj-$(CONFIG_USB_SIERRA_NET)	+= sierra_net.o
diff --git a/drivers/net/usb/sierra_net.c b/drivers/net/usb/sierra_net.c
new file mode 100644
index 0000000..eece76e
--- /dev/null
+++ b/drivers/net/usb/sierra_net.c
@@ -0,0 +1,987 @@
+/*
+ * USB-to-WWAN Driver for Sierra Wireless modems
+ *
+ * Copyright (C) 2008, 2009, 2010 Paxton Smith, Matthew Safar, Rory Filer
+ *                          <linux@sierrawireless.com>
+ *
+ * Portions of this based on the cdc_ether driver by David Brownell (2003-2005)
+ * and Ole Andre Vadla Ravnas (ActiveSync) (2006).
+ *
+ * IMPORTANT DISCLAIMER: This driver is not commercially supported by
+ * Sierra Wireless. Use at your own risk.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#define DRIVER_VERSION "v.2.0"
+#define DRIVER_AUTHOR "Paxton Smith, Matthew Safar, Rory Filer"
+#define DRIVER_DESC "USB-to-WWAN Driver for Sierra Wireless modems"
+static const char driver_name[] = "sierra_net";
+
+/* if defined debug messages enabled */
+/*#define	DEBUG*/
+
+#include <linux/module.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/usb.h>
+#include <linux/usb/cdc.h>
+#include <net/ip.h>
+#include <net/udp.h>
+#include <asm/unaligned.h>
+#include <linux/usb/usbnet.h>
+
+#define SWI_USB_REQUEST_GET_FW_ATTR	0x06
+#define SWI_GET_FW_ATTR_MASK		0x08
+
+/* atomic counter partially included in MAC address to make sure 2 devices
+ * do not end up with the same MAC - concept breaks in case of > 255 ifaces
+ */
+static	atomic_t iface_counter = ATOMIC_INIT(0);
+
+/*
+ * SYNC Timer Delay definition used to set the expiry time
+ */
+#define SIERRA_NET_SYNCDELAY (2*HZ)
+
+/* Max. MTU supported. The modem buffers are limited to 1500 */
+#define SIERRA_NET_MAX_SUPPORTED_MTU	1500
+
+/* The SIERRA_NET_USBCTL_BUF_LEN defines a buffer size allocated for control
+ * message reception ... and thus the max. received packet.
+ * (May be the cause for parse_hip returning -3)
+ */
+#define SIERRA_NET_USBCTL_BUF_LEN	1024
+
+/* list of interface numbers - used for constructing interface lists */
+struct sierra_net_iface_info {
+	const u32 infolen;	/* number of interface numbers on list */
+	const u8  *ifaceinfo;	/* pointer to the array holding the numbers */
+};
+
+struct sierra_net_info_data {
+	u16 rx_urb_size;
+	struct sierra_net_iface_info whitelist;
+};
+
+/* Private data structure */
+struct sierra_net_data {
+
+	u8 ethr_hdr_tmpl[ETH_HLEN]; /* ethernet header template for rx'd pkts */
+
+	u16 link_up;		/* air link up or down */
+	u8 tx_hdr_template[4];	/* part of HIP hdr for tx'd packets */
+
+	u8 sync_msg[4];		/* SYNC message */
+	u8 shdwn_msg[4];	/* Shutdown message */
+
+	/* Backpointer to the container */
+	struct usbnet *usbnet;
+
+	u8 ifnum;	/* interface number */
+
+/* Bit masks, must be a power of 2 */
+#define SIERRA_NET_EVENT_RESP_AVAIL    0x01
+#define SIERRA_NET_TIMER_EXPIRY        0x02
+	unsigned long kevent_flags;
+	struct work_struct sierra_net_kevent;
+	struct timer_list sync_timer; /* For retrying SYNC sequence */
+};
+
+struct param {
+	int is_present;
+	union {
+		void  *ptr;
+		u32    dword;
+		u16    word;
+		u8     byte;
+	};
+};
+
+/* HIP message type */
+#define SIERRA_NET_HIP_EXTENDEDID	0x7F
+#define SIERRA_NET_HIP_HSYNC_ID		0x60	/* Modem -> host */
+#define SIERRA_NET_HIP_RESTART_ID	0x62	/* Modem -> host */
+#define SIERRA_NET_HIP_MSYNC_ID		0x20	/* Host -> modem */
+#define SIERRA_NET_HIP_SHUTD_ID		0x26	/* Host -> modem */
+
+#define SIERRA_NET_HIP_EXT_IP_IN_ID   0x0202
+#define SIERRA_NET_HIP_EXT_IP_OUT_ID  0x0002
+
+/* 3G UMTS Link Sense Indication definitions */
+#define SIERRA_NET_HIP_LSI_UMTSID	0x78
+
+/* Reverse Channel Grant Indication HIP message */
+#define SIERRA_NET_HIP_RCGI		0x64
+
+/* LSI Protocol types */
+#define SIERRA_NET_PROTOCOL_UMTS      0x01
+/* LSI Coverage */
+#define SIERRA_NET_COVERAGE_NONE      0x00
+#define SIERRA_NET_COVERAGE_NOPACKET  0x01
+
+/* LSI Session */
+#define SIERRA_NET_SESSION_IDLE       0x00
+/* LSI Link types */
+#define SIERRA_NET_AS_LINK_TYPE_IPv4  0x00
+
+struct lsi_umts {
+	u8 protocol;
+	u8 unused1;
+	__be16 length;
+	/* eventually use a union for the rest - assume umts for now */
+	u8 coverage;
+	u8 unused2[41];
+	u8 session_state;
+	u8 unused3[33];
+	u8 link_type;
+	u8 pdp_addr_len; /* NW-supplied PDP address len */
+	u8 pdp_addr[16]; /* NW-supplied PDP address (bigendian)) */
+	u8 unused4[23];
+	u8 dns1_addr_len; /* NW-supplied 1st DNS address len (bigendian) */
+	u8 dns1_addr[16]; /* NW-supplied 1st DNS address */
+	u8 dns2_addr_len; /* NW-supplied 2nd DNS address len */
+	u8 dns2_addr[16]; /* NW-supplied 2nd DNS address (bigendian)*/
+	u8 wins1_addr_len; /* NW-supplied 1st Wins address len */
+	u8 wins1_addr[16]; /* NW-supplied 1st Wins address (bigendian)*/
+	u8 wins2_addr_len; /* NW-supplied 2nd Wins address len */
+	u8 wins2_addr[16]; /* NW-supplied 2nd Wins address (bigendian) */
+	u8 unused5[4];
+	u8 gw_addr_len; /* NW-supplied GW address len */
+	u8 gw_addr[16]; /* NW-supplied GW address (bigendian) */
+	u8 reserved[8];
+} __attribute__ ((packed));
+
+#define SIERRA_NET_LSI_COMMON_LEN      4
+#define SIERRA_NET_LSI_UMTS_LEN        (sizeof(struct lsi_umts))
+#define SIERRA_NET_LSI_UMTS_STATUS_LEN \
+	(SIERRA_NET_LSI_UMTS_LEN - SIERRA_NET_LSI_COMMON_LEN)
+
+/* Forward definitions */
+static void sierra_sync_timer(unsigned long syncdata);
+static int sierra_net_change_mtu(struct net_device *net, int new_mtu);
+
+/* Our own net device operations structure */
+static const struct net_device_ops sierra_net_device_ops = {
+	.ndo_open               = usbnet_open,
+	.ndo_stop               = usbnet_stop,
+	.ndo_start_xmit         = usbnet_start_xmit,
+	.ndo_tx_timeout         = usbnet_tx_timeout,
+	.ndo_change_mtu         = sierra_net_change_mtu,
+	.ndo_set_mac_address    = eth_mac_addr,
+	.ndo_validate_addr      = eth_validate_addr,
+};
+
+/* get private data associated with passed in usbnet device */
+static inline struct sierra_net_data *sierra_net_get_private(struct usbnet *dev)
+{
+	return (struct sierra_net_data *)dev->data[0];
+}
+
+/* set private data associated with passed in usbnet device */
+static inline void sierra_net_set_private(struct usbnet *dev,
+			struct sierra_net_data *priv)
+{
+	dev->data[0] = (unsigned long)priv;
+}
+
+/* is packet IPv4 */
+static inline int is_ip(struct sk_buff *skb)
+{
+	return (skb->protocol == cpu_to_be16(ETH_P_IP));
+}
+
+/*
+ * check passed in packet and make sure that:
+ *  - it is linear (no scatter/gather)
+ *  - it is ethernet (mac_header properly set)
+ */
+static int check_ethip_packet(struct sk_buff *skb, struct usbnet *dev)
+{
+	skb_reset_mac_header(skb); /* ethernet header */
+
+	if (skb_is_nonlinear(skb)) {
+		netdev_err(dev->net, "Non linear buffer-dropping\n");
+		return 0;
+	}
+
+	if (!pskb_may_pull(skb, ETH_HLEN))
+		return 0;
+	skb->protocol = eth_hdr(skb)->h_proto;
+
+	return 1;
+}
+
+static const u8 *save16bit(struct param *p, const u8 *datap)
+{
+	p->is_present = 1;
+	p->word = get_unaligned_be16(datap);
+	return datap + sizeof(p->word);
+}
+
+static const u8 *save8bit(struct param *p, const u8 *datap)
+{
+	p->is_present = 1;
+	p->byte = *datap;
+	return datap + sizeof(p->byte);
+}
+
+/*----------------------------------------------------------------------------*
+ *                              BEGIN HIP                                     *
+ *----------------------------------------------------------------------------*/
+/* HIP header */
+#define SIERRA_NET_HIP_HDR_LEN 4
+/* Extended HIP header */
+#define SIERRA_NET_HIP_EXT_HDR_LEN 6
+
+struct hip_hdr {
+	int    hdrlen;
+	struct param payload_len;
+	struct param msgid;
+	struct param msgspecific;
+	struct param extmsgid;
+};
+
+static int parse_hip(const u8 *buf, const u32 buflen, struct hip_hdr *hh)
+{
+	const u8 *curp = buf;
+	int    padded;
+
+	if (buflen < SIERRA_NET_HIP_HDR_LEN)
+		return -1;
+
+	curp = save16bit(&hh->payload_len, curp);
+	curp = save8bit(&hh->msgid, curp);
+	curp = save8bit(&hh->msgspecific, curp);
+
+	padded = hh->msgid.byte & 0x80;
+	hh->msgid.byte &= 0x7F;			/* 7 bits */
+
+	hh->extmsgid.is_present = (hh->msgid.byte == SIERRA_NET_HIP_EXTENDEDID);
+	if (hh->extmsgid.is_present) {
+		if (buflen < SIERRA_NET_HIP_EXT_HDR_LEN)
+			return -2;
+
+		hh->payload_len.word &= 0x3FFF; /* 14 bits */
+
+		curp = save16bit(&hh->extmsgid, curp);
+		hh->extmsgid.word &= 0x03FF;	/* 10 bits */
+
+		hh->hdrlen = SIERRA_NET_HIP_EXT_HDR_LEN;
+	} else {
+		hh->payload_len.word &= 0x07FF;	/* 11 bits */
+		hh->hdrlen = SIERRA_NET_HIP_HDR_LEN;
+	}
+
+	if (padded) {
+		hh->hdrlen++;
+		hh->payload_len.word--;
+	}
+
+	/* if real packet shorter than the claimed length */
+	if (buflen < (hh->hdrlen + hh->payload_len.word))
+		return -3;
+
+	return 0;
+}
+
+static void build_hip(u8 *buf, const u16 payloadlen,
+		struct sierra_net_data *priv)
+{
+	/* the following doesn't have the full functionality. We
+	 * currently build only one kind of header, so it is faster this way
+	 */
+	put_unaligned_be16(payloadlen, buf);
+	memcpy(buf+2, priv->tx_hdr_template, sizeof(priv->tx_hdr_template));
+}
+/*----------------------------------------------------------------------------*
+ *                              END HIP                                       *
+ *----------------------------------------------------------------------------*/
+
+static void sierra_net_send_cmd(struct usbnet *dev,
+		u8 *cmd, int cmdlen, const char * cmd_name)
+{
+	struct sierra_net_data *priv = sierra_net_get_private(dev);
+	int  status;
+
+	status = usb_control_msg(dev->udev, usb_sndctrlpipe(dev->udev, 0),
+			USB_CDC_SEND_ENCAPSULATED_COMMAND,
+			USB_DIR_OUT|USB_TYPE_CLASS|USB_RECIP_INTERFACE,	0,
+			priv->ifnum, cmd, cmdlen, 0);
+
+	if (status != cmdlen && status != -ENODEV)
+		netdev_err(dev->net, "Submit %s failed %d\n", cmd_name, status);
+}
+
+static void sierra_net_send_sync(struct usbnet *dev)
+{
+	struct sierra_net_data *priv = sierra_net_get_private(dev);
+
+	dev_dbg(&dev->udev->dev, "%s", __func__);
+
+	sierra_net_send_cmd(dev, priv->sync_msg,
+			sizeof(priv->sync_msg), "SYNC");
+}
+
+static void sierra_net_send_shutdown(struct usbnet *dev)
+{
+	struct sierra_net_data *priv = sierra_net_get_private(dev);
+
+	dev_dbg(&dev->udev->dev, "%s", __func__);
+
+	sierra_net_send_cmd(dev, priv->shdwn_msg,
+			sizeof(priv->shdwn_msg), "Shutdown");
+}
+
+static void sierra_net_set_ctx_index(struct sierra_net_data *priv, u8 ctx_ix)
+{
+	dev_dbg(&(priv->usbnet->udev->dev), "%s %d", __func__, ctx_ix);
+	priv->tx_hdr_template[0] = 0x3F;
+	priv->tx_hdr_template[1] = ctx_ix;
+	*((u16 *)&priv->tx_hdr_template[2]) =
+		cpu_to_be16(SIERRA_NET_HIP_EXT_IP_OUT_ID);
+}
+
+static inline int sierra_net_is_valid_addrlen(u8 len)
+{
+	return (len == sizeof(struct in_addr));
+}
+
+static int sierra_net_parse_lsi(struct usbnet *dev, char *data, int datalen)
+{
+	struct lsi_umts *lsi = (struct lsi_umts *)data;
+
+	if (lsi->length != cpu_to_be16(SIERRA_NET_LSI_UMTS_STATUS_LEN)) {
+		netdev_err(dev->net, "%s: LSI_UMTS_STATUS_LEN %d, exp %u\n",
+				__func__, be16_to_cpu(lsi->length),
+				(u32)SIERRA_NET_LSI_UMTS_STATUS_LEN);
+		return -1;
+	}
+
+	/* Validate the protocol  - only support UMTS for now */
+	if (lsi->protocol != SIERRA_NET_PROTOCOL_UMTS) {
+		netdev_err(dev->net, "Protocol unsupported, 0x%02x\n",
+			lsi->protocol);
+		return -1;
+	}
+
+	/* Validate the link type */
+	if (lsi->link_type != SIERRA_NET_AS_LINK_TYPE_IPv4) {
+		netdev_err(dev->net, "Link type unsupported: 0x%02x\n",
+			lsi->link_type);
+		return -1;
+	}
+
+	/* Validate the coverage */
+	if (lsi->coverage == SIERRA_NET_COVERAGE_NONE
+	   || lsi->coverage == SIERRA_NET_COVERAGE_NOPACKET) {
+		netdev_err(dev->net, "No coverage, 0x%02x\n", lsi->coverage);
+		return 0;
+	}
+
+	/* Validate the session state */
+	if (lsi->session_state == SIERRA_NET_SESSION_IDLE) {
+		netdev_err(dev->net, "Session idle, 0x%02x\n",
+			lsi->session_state);
+		return 0;
+	}
+
+	/* Set link_sense true */
+	return 1;
+}
+
+static void sierra_net_handle_lsi(struct usbnet *dev, char *data,
+		struct hip_hdr	*hh)
+{
+	struct sierra_net_data *priv = sierra_net_get_private(dev);
+	int link_up;
+
+	link_up = sierra_net_parse_lsi(dev, data + hh->hdrlen,
+					hh->payload_len.word);
+	if (link_up < 0) {
+		netdev_err(dev->net, "Invalid LSI\n");
+		return;
+	}
+	if (link_up) {
+		sierra_net_set_ctx_index(priv, hh->msgspecific.byte);
+		priv->link_up = 1;
+		netif_carrier_on(dev->net);
+	} else {
+		priv->link_up = 0;
+		netif_carrier_off(dev->net);
+	}
+}
+
+static void sierra_net_dosync(struct usbnet *dev)
+{
+	struct sierra_net_data *priv = sierra_net_get_private(dev);
+
+	dev_dbg(&dev->udev->dev, "%s", __func__);
+
+	/* tell modem we are ready */
+	sierra_net_send_sync(dev);
+	sierra_net_send_sync(dev);
+
+	/* Now, start a timer and make sure we get the Restart Indication */
+	priv->sync_timer.function = sierra_sync_timer;
+	priv->sync_timer.data = (unsigned long) dev;
+	priv->sync_timer.expires = jiffies + SIERRA_NET_SYNCDELAY;
+	add_timer(&priv->sync_timer);
+}
+
+static void sierra_net_kevent(struct work_struct *work)
+{
+	struct sierra_net_data *priv =
+		container_of(work, struct sierra_net_data, sierra_net_kevent);
+	struct usbnet *dev = priv->usbnet;
+	int  len;
+	int  err;
+	u8  *buf;
+	u8   ifnum;
+
+	if (test_bit(SIERRA_NET_EVENT_RESP_AVAIL, &priv->kevent_flags)) {
+		clear_bit(SIERRA_NET_EVENT_RESP_AVAIL, &priv->kevent_flags);
+
+		/* Query the modem for the LSI message */
+		buf = kzalloc(SIERRA_NET_USBCTL_BUF_LEN, GFP_KERNEL);
+		if (!buf) {
+			netdev_err(dev->net,
+				"failed to allocate buf for LS msg\n");
+			return;
+		}
+		ifnum = priv->ifnum;
+		len = usb_control_msg(dev->udev, usb_rcvctrlpipe(dev->udev, 0),
+				USB_CDC_GET_ENCAPSULATED_RESPONSE,
+				USB_DIR_IN|USB_TYPE_CLASS|USB_RECIP_INTERFACE,
+				0, ifnum, buf, SIERRA_NET_USBCTL_BUF_LEN, 0);
+
+		if (unlikely(len < 0)) {
+			netdev_err(dev->net,
+				"usb_control_msg failed, status %d\n", len);
+		} else {
+			struct hip_hdr	hh;
+
+			dev_dbg(&dev->udev->dev, "%s: Received status message,"
+				" %04x bytes", __func__, len);
+
+			err = parse_hip(buf, len, &hh);
+			if (err) {
+				netdev_err(dev->net, "%s: Bad packet,"
+					" parse result %d\n", __func__, err);
+				kfree(buf);
+				return;
+			}
+
+			/* Validate packet length */
+			if (len != hh.hdrlen + hh.payload_len.word) {
+				netdev_err(dev->net, "%s: Bad packet, received"
+					" %d, expected %d\n",	__func__, len,
+					hh.hdrlen + hh.payload_len.word);
+				kfree(buf);
+				return;
+			}
+
+			/* Switch on received message types */
+			switch (hh.msgid.byte) {
+			case SIERRA_NET_HIP_LSI_UMTSID:
+				dev_dbg(&dev->udev->dev, "LSI for ctx:%d",
+					hh.msgspecific.byte);
+				sierra_net_handle_lsi(dev, buf, &hh);
+				break;
+			case SIERRA_NET_HIP_RESTART_ID:
+				dev_dbg(&dev->udev->dev, "Restart reported: %d,"
+						" stopping sync timer",
+						hh.msgspecific.byte);
+				/* Got sync resp - stop timer & clear mask */
+				del_timer_sync(&priv->sync_timer);
+				clear_bit(SIERRA_NET_TIMER_EXPIRY,
+					  &priv->kevent_flags);
+				break;
+			case SIERRA_NET_HIP_HSYNC_ID:
+				dev_dbg(&dev->udev->dev, "SYNC received");
+				sierra_net_send_sync(dev);
+				break;
+			case SIERRA_NET_HIP_EXTENDEDID:
+				netdev_err(dev->net, "Unrecognized HIP msg, "
+					"extmsgid 0x%04x\n", hh.extmsgid.word);
+				break;
+			case SIERRA_NET_HIP_RCGI:
+				/* Ignored */
+				break;
+			default:
+				netdev_err(dev->net, "Unrecognized HIP msg, "
+					"msgid 0x%02x\n", hh.msgid.byte);
+				break;
+			}
+		}
+		kfree(buf);
+	}
+	/* The sync timer bit might be set */
+	if (test_bit(SIERRA_NET_TIMER_EXPIRY, &priv->kevent_flags)) {
+		clear_bit(SIERRA_NET_TIMER_EXPIRY, &priv->kevent_flags);
+		dev_dbg(&dev->udev->dev, "Deferred sync timer expiry");
+		sierra_net_dosync(priv->usbnet);
+	}
+
+	if (priv->kevent_flags)
+		dev_dbg(&dev->udev->dev, "sierra_net_kevent done, "
+			"kevent_flags = 0x%lx", priv->kevent_flags);
+}
+
+static void sierra_net_defer_kevent(struct usbnet *dev, int work)
+{
+	struct sierra_net_data *priv = sierra_net_get_private(dev);
+
+	set_bit(work, &priv->kevent_flags);
+	schedule_work(&priv->sierra_net_kevent);
+}
+
+/*
+ * Sync Retransmit Timer Handler. On expiry, kick the work queue
+ */
+void sierra_sync_timer(unsigned long syncdata)
+{
+	struct usbnet *dev = (struct usbnet *)syncdata;
+
+	dev_dbg(&dev->udev->dev, "%s", __func__);
+	/* Kick the tasklet */
+	sierra_net_defer_kevent(dev, SIERRA_NET_TIMER_EXPIRY);
+}
+
+static void sierra_net_status(struct usbnet *dev, struct urb *urb)
+{
+	struct usb_cdc_notification *event;
+
+	dev_dbg(&dev->udev->dev, "%s", __func__);
+
+	if (urb->actual_length < sizeof *event)
+		return;
+
+	/* Add cases to handle other standard notifications. */
+	event = urb->transfer_buffer;
+	switch (event->bNotificationType) {
+	case USB_CDC_NOTIFY_NETWORK_CONNECTION:
+	case USB_CDC_NOTIFY_SPEED_CHANGE:
+		/* USB 305 sends those */
+		break;
+	case USB_CDC_NOTIFY_RESPONSE_AVAILABLE:
+		sierra_net_defer_kevent(dev, SIERRA_NET_EVENT_RESP_AVAIL);
+		break;
+	default:
+		netdev_err(dev->net, ": unexpected notification %02x!\n",
+				event->bNotificationType);
+		break;
+	}
+}
+
+static void sierra_net_get_drvinfo(struct net_device *net,
+		struct ethtool_drvinfo *info)
+{
+	/* Inherit standard device info */
+	usbnet_get_drvinfo(net, info);
+	strncpy(info->driver, driver_name, sizeof info->driver);
+	strncpy(info->version, DRIVER_VERSION, sizeof info->version);
+}
+
+static u32 sierra_net_get_link(struct net_device *net)
+{
+	struct usbnet *dev = netdev_priv(net);
+	/* Report link is down whenever the interface is down */
+	return sierra_net_get_private(dev)->link_up && netif_running(net);
+}
+
+static struct ethtool_ops sierra_net_ethtool_ops = {
+	.get_drvinfo = sierra_net_get_drvinfo,
+	.get_link = sierra_net_get_link,
+	.get_msglevel = usbnet_get_msglevel,
+	.set_msglevel = usbnet_set_msglevel,
+	.get_settings = usbnet_get_settings,
+	.set_settings = usbnet_set_settings,
+	.nway_reset = usbnet_nway_reset,
+};
+
+/* MTU can not be more than 1500 bytes, enforce it. */
+static int sierra_net_change_mtu(struct net_device *net, int new_mtu)
+{
+	if (new_mtu > SIERRA_NET_MAX_SUPPORTED_MTU)
+		return -EINVAL;
+
+	return usbnet_change_mtu(net, new_mtu);
+}
+
+static int is_whitelisted(const u8 ifnum,
+			const struct sierra_net_iface_info *whitelist)
+{
+	if (whitelist) {
+		const u8 *list = whitelist->ifaceinfo;
+		int i;
+
+		for (i = 0; i < whitelist->infolen; i++) {
+			if (list[i] == ifnum)
+				return 1;
+		}
+	}
+	return 0;
+}
+
+static int sierra_net_get_fw_attr(struct usbnet *dev, u16 *datap)
+{
+	int result = 0;
+	u16 *attrdata;
+
+	attrdata = kmalloc(sizeof(*attrdata), GFP_KERNEL);
+	if (!attrdata)
+		return -ENOMEM;
+
+	result = usb_control_msg(
+			dev->udev,
+			usb_rcvctrlpipe(dev->udev, 0),
+			/* _u8 vendor specific request */
+			SWI_USB_REQUEST_GET_FW_ATTR,
+			USB_DIR_IN | USB_TYPE_VENDOR,	/* __u8 request type */
+			0x0000,		/* __u16 value not used */
+			0x0000,		/* __u16 index  not used */
+			attrdata,	/* char *data */
+			sizeof(*attrdata),		/* __u16 size */
+			USB_CTRL_SET_TIMEOUT);	/* int timeout */
+
+	if (result < 0) {
+		kfree(attrdata);
+		return -EIO;
+	}
+
+	*datap = *attrdata;
+
+	kfree(attrdata);
+	return result;
+}
+
+/*
+ * collects the bulk endpoints, the status endpoint.
+ */
+static int sierra_net_bind(struct usbnet *dev, struct usb_interface *intf)
+{
+	u8	ifacenum;
+	u8	numendpoints;
+	u16	fwattr = 0;
+	int	status;
+	struct ethhdr *eth;
+	struct sierra_net_data *priv;
+	static const u8 sync_tmplate[sizeof(priv->sync_msg)] = {
+		0x00, 0x00, SIERRA_NET_HIP_MSYNC_ID, 0x00};
+	static const u8 shdwn_tmplate[sizeof(priv->shdwn_msg)] = {
+		0x00, 0x00, SIERRA_NET_HIP_SHUTD_ID, 0x00};
+
+	struct sierra_net_info_data *data =
+			(struct sierra_net_info_data *)dev->driver_info->data;
+
+	dev_dbg(&dev->udev->dev, "%s", __func__);
+
+	ifacenum = intf->cur_altsetting->desc.bInterfaceNumber;
+	/* We only accept certain interfaces */
+	if (!is_whitelisted(ifacenum, &data->whitelist)) {
+		dev_dbg(&dev->udev->dev, "Ignoring interface: %d", ifacenum);
+		return -ENODEV;
+	}
+	numendpoints = intf->cur_altsetting->desc.bNumEndpoints;
+	/* We have three endpoints, bulk in and out, and a status */
+	if (numendpoints != 3) {
+		dev_err(&dev->udev->dev, "Expected 3 endpoints, found: %d",
+			numendpoints);
+		return -ENODEV;
+	}
+	/* Status endpoint set in usbnet_get_endpoints() */
+	dev->status = NULL;
+	status = usbnet_get_endpoints(dev, intf);
+	if (status < 0) {
+		dev_err(&dev->udev->dev, "Error in usbnet_get_endpoints (%d)",
+			status);
+		return -ENODEV;
+	}
+	/* Initialize sierra private data */
+	priv = kzalloc(sizeof *priv, GFP_KERNEL);
+	if (!priv) {
+		dev_err(&dev->udev->dev, "No memory");
+		return -ENOMEM;
+	}
+
+	priv->usbnet = dev;
+	priv->ifnum = ifacenum;
+	dev->net->netdev_ops = &sierra_net_device_ops;
+
+	/* change MAC addr to include, ifacenum, and to be unique */
+	dev->net->dev_addr[ETH_ALEN-2] = atomic_inc_return(&iface_counter);
+	dev->net->dev_addr[ETH_ALEN-1] = ifacenum;
+
+	/* we will have to manufacture ethernet headers, prepare template */
+	eth = (struct ethhdr *)priv->ethr_hdr_tmpl;
+	memcpy(&eth->h_dest, dev->net->dev_addr, ETH_ALEN);
+	eth->h_proto = cpu_to_be16(ETH_P_IP);
+
+	/* prepare shutdown message template */
+	memcpy(priv->shdwn_msg, shdwn_tmplate, sizeof(priv->shdwn_msg));
+	/* set context index initially to 0 - prepares tx hdr template */
+	sierra_net_set_ctx_index(priv, 0);
+
+	/* decrease the rx_urb_size and max_tx_size to 4k on USB 1.1 */
+	dev->rx_urb_size  = data->rx_urb_size;
+	if (dev->udev->speed != USB_SPEED_HIGH)
+		dev->rx_urb_size  = min_t(size_t, 4096, data->rx_urb_size);
+
+	dev->net->hard_header_len += SIERRA_NET_HIP_EXT_HDR_LEN;
+	dev->hard_mtu = dev->net->mtu + dev->net->hard_header_len;
+
+	/* Set up the netdev */
+	dev->net->flags |= IFF_NOARP;
+	dev->net->ethtool_ops = &sierra_net_ethtool_ops;
+	netif_carrier_off(dev->net);
+
+	sierra_net_set_private(dev, priv);
+
+	priv->kevent_flags = 0;
+
+	/* Use the shared workqueue */
+	INIT_WORK(&priv->sierra_net_kevent, sierra_net_kevent);
+
+	/* Only need to do this once */
+	init_timer(&priv->sync_timer);
+
+	/* verify fw attributes */
+	status = sierra_net_get_fw_attr(dev, &fwattr);
+	dev_dbg(&dev->udev->dev, "Fw attr: %x\n", fwattr);
+
+	/* test whether firmware supports DHCP */
+	if (!(status == sizeof(fwattr) && (fwattr & SWI_GET_FW_ATTR_MASK))) {
+		/* found incompatible firmware version */
+		dev_err(&dev->udev->dev, "Incompatible driver and firmware"
+			" versions\n");
+		kfree(priv);
+		return -ENODEV;
+	}
+	/* prepare sync message from template */
+	memcpy(priv->sync_msg, sync_tmplate, sizeof(priv->sync_msg));
+
+	/* initiate the sync sequence */
+	sierra_net_dosync(dev);
+
+	return 0;
+}
+
+static void sierra_net_unbind(struct usbnet *dev, struct usb_interface *intf)
+{
+	struct sierra_net_data *priv = sierra_net_get_private(dev);
+
+	dev_dbg(&dev->udev->dev, "%s", __func__);
+
+	/* Kill the timer then flush the work queue */
+	del_timer_sync(&priv->sync_timer);
+
+	flush_scheduled_work();
+
+	/* tell modem we are going away */
+	sierra_net_send_shutdown(dev);
+
+	sierra_net_set_private(dev, NULL);
+
+	kfree(priv);
+}
+
+static struct sk_buff *sierra_net_skb_clone(struct usbnet *dev,
+		struct sk_buff *skb, int len)
+{
+	struct sk_buff *new_skb;
+
+	/* clone skb */
+	new_skb = skb_clone(skb, GFP_ATOMIC);
+
+	/* remove len bytes from original */
+	skb_pull(skb, len);
+
+	/* trim next packet to it's length */
+	if (new_skb) {
+		skb_trim(new_skb, len);
+	} else {
+		if (netif_msg_rx_err(dev))
+			netdev_err(dev->net, "failed to get skb\n");
+		dev->net->stats.rx_dropped++;
+	}
+
+	return new_skb;
+}
+
+/* ---------------------------- Receive data path ----------------------*/
+static int sierra_net_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
+{
+	int err;
+	struct hip_hdr  hh;
+	struct sk_buff *new_skb;
+
+	dev_dbg(&dev->udev->dev, "%s", __func__);
+
+	/* could contain multiple packets */
+	while (likely(skb->len)) {
+		err = parse_hip(skb->data, skb->len, &hh);
+		if (err) {
+			if (netif_msg_rx_err(dev))
+				netdev_err(dev->net, "Invalid HIP header %d\n",
+					err);
+			/* dev->net->stats.rx_errors incremented by caller */
+			dev->net->stats.rx_length_errors++;
+			return 0;
+		}
+
+		/* Validate Extended HIP header */
+		if (!hh.extmsgid.is_present
+		    || hh.extmsgid.word != SIERRA_NET_HIP_EXT_IP_IN_ID) {
+			if (netif_msg_rx_err(dev))
+				netdev_err(dev->net, "HIP/ETH: Invalid pkt\n");
+
+			dev->net->stats.rx_frame_errors++;
+			/* dev->net->stats.rx_errors incremented by caller */;
+			return 0;
+		}
+
+		skb_pull(skb, hh.hdrlen);
+
+		/* We are going to accept this packet, prepare it */
+		memcpy(skb->data, sierra_net_get_private(dev)->ethr_hdr_tmpl,
+			ETH_HLEN);
+
+		/* Last packet in batch handled by usbnet */
+		if (hh.payload_len.word == skb->len)
+			return 1;
+
+		new_skb = sierra_net_skb_clone(dev, skb, hh.payload_len.word);
+		if (new_skb)
+			usbnet_skb_return(dev, new_skb);
+
+	} /* while */
+
+	return 0;
+}
+
+/* ---------------------------- Transmit data path ----------------------*/
+struct sk_buff *sierra_net_tx_fixup(struct usbnet *dev, struct sk_buff *skb,
+		gfp_t flags)
+{
+	struct sierra_net_data *priv = sierra_net_get_private(dev);
+	u16 len;
+	bool need_tail;
+
+	dev_dbg(&dev->udev->dev, "%s", __func__);
+	if (priv->link_up && check_ethip_packet(skb, dev) && is_ip(skb)) {
+		/* enough head room as is? */
+		if (SIERRA_NET_HIP_EXT_HDR_LEN <= skb_headroom(skb)) {
+			/* Save the Eth/IP length and set up HIP hdr */
+			len = skb->len;
+			skb_push(skb, SIERRA_NET_HIP_EXT_HDR_LEN);
+			/* Handle ZLP issue */
+			need_tail = ((len + SIERRA_NET_HIP_EXT_HDR_LEN)
+				% dev->maxpacket == 0);
+			if (need_tail) {
+				if (unlikely(skb_tailroom(skb) == 0)) {
+					netdev_err(dev->net, "tx_fixup:"
+						"no room for packet\n");
+					dev_kfree_skb_any(skb);
+					return NULL;
+				} else {
+					skb->data[skb->len] = 0;
+					__skb_put(skb, 1);
+					len = len + 1;
+				}
+			}
+			build_hip(skb->data, len, priv);
+			return skb;
+		} else {
+			/*
+			 * compensate in the future if necessary
+			 */
+			netdev_err(dev->net, "tx_fixup: no room for HIP\n");
+		} /* headroom */
+	}
+
+	if (!priv->link_up)
+		dev->net->stats.tx_carrier_errors++;
+
+	/* tx_dropped incremented by usbnet */
+
+	/* filter the packet out, release it  */
+	dev_kfree_skb_any(skb);
+	return NULL;
+}
+
+static const u8 sierra_net_ifnum_list[] = { 7, 10, 11 };
+static const struct sierra_net_info_data sierra_net_info_data_68A3 = {
+	.rx_urb_size = 8 * 1024,
+	.whitelist = {
+		.infolen = ARRAY_SIZE(sierra_net_ifnum_list),
+		.ifaceinfo = sierra_net_ifnum_list
+	}
+};
+
+static const struct driver_info sierra_net_info_68A3 = {
+	.description = "Sierra Wireless USB-to-WWAN Modem",
+	.flags = FLAG_WWAN | FLAG_SEND_ZLP,
+	.bind = sierra_net_bind,
+	.unbind = sierra_net_unbind,
+	.status = sierra_net_status,
+	.rx_fixup = sierra_net_rx_fixup,
+	.tx_fixup = sierra_net_tx_fixup,
+	.data = (unsigned long)&sierra_net_info_data_68A3,
+};
+
+static const struct usb_device_id products[] = {
+	{USB_DEVICE(0x1199, 0x68A3), /* Sierra Wireless USB-to-WWAN modem */
+	.driver_info = (unsigned long) &sierra_net_info_68A3},
+
+	{}, /* last item */
+};
+MODULE_DEVICE_TABLE(usb, products);
+
+/* We are based on usbnet, so let it handle the USB driver specifics */
+static struct usb_driver sierra_net_driver = {
+	.name = "sierra_net",
+	.id_table = products,
+	.probe = usbnet_probe,
+	.disconnect = usbnet_disconnect,
+	.suspend = usbnet_suspend,
+	.resume = usbnet_resume,
+	.no_dynamic_id = 1,
+};
+
+static int __init sierra_net_init(void)
+{
+	BUILD_BUG_ON(FIELD_SIZEOF(struct usbnet, data)
+				< sizeof(struct cdc_state));
+
+	return usb_register(&sierra_net_driver);
+}
+
+static void __exit sierra_net_exit(void)
+{
+	usb_deregister(&sierra_net_driver);
+}
+
+module_exit(sierra_net_exit);
+module_init(sierra_net_init);
+
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+
-- 
1.5.4.3



^ permalink raw reply related

* Re: [PATCH] WAN: flush tx_queue in hdlc_ppp to prevent panic on rmmod hw_driver.
From: Michael Barkowski @ 2010-04-22 19:17 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: David Miller, netdev
In-Reply-To: <m3mxx5mv8v.fsf@intrepid.localdomain>

Krzysztof Halasa wrote:
> tx_queue is used as a temporary queue when not allowed to queue skb
> directly to the hw device driver (which may sleep). Most paths flush
> it before returning, but ppp_start() currently cannot. Make sure we
> don't leave skbs pointing to a non-existent device.
> 
> Thanks to Michael Barkowski for reporting this problem.

Great - thanks.  Will this be going into -stable?

-- 
Michael Barkowski

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-22 19:02 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Chris Wright, Scott Feldman, davem, netdev, Mitch Williams
In-Reply-To: <201004222048.53950.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Thursday 22 April 2010 19:47:29 Chris Wright wrote:
> > OK, wasn't clear if you meant that or simply 100% dedicating the interface
> > via something like virtio.  The add_vf() idea, while neat, doesn't really
> > match how VF's are allocated.
> 
> But we still need something like that for allocating queues in VMDq
> and similar cases where we do not have pass-through, right?

Iff we care about VMDq w/out SR-IOV (since SR-IOV hardware is VMDq
capable and already has a queue-pair + interrupt + net_dev), yes.

And it's not just VMDq, it's any multi-queue card that can do mac/vlan
filter in hw + header/data split (for direct data DMA to guest buffers).

> As far as I can tell we don't have an interface for that yet, but
> we have drivers for a number of cards that could do this.
> 
> > > I don't have an SR-IOV card available for testing yet. How is this
> > > configured now?
> > 
> > The device shows up in the host as a normal network device, so mgmt tools
> > currently treat it as if it's no different from a PF.  So that's just
> > plain old:
> > 
> > SIOCSIFHWADDR or RTM_SETLINK (i.e. normal ->ndo_set_mac_addr)
> 
> Ok, but that only works for a fixed number of VFs and you can only
> configure the VF before it's assigned to the guest, right?

Depends on assign.

Assign meaning it's still visible in host, but only one guest is using
it via virtio (e.g. vhost-net)....then no, can change anytime (although
it's not typically changed during VM lifecycle).

Assign meaning direct PCI device assignment of the VF to the guest,
then yes, only while the device has driver in host.

> Both are not serious limitations, but it would be nice to
> have an easy way around them. In particular, for assigning
> the mac address and vlan id (VF in access mode), there needs
> to be some interface that allows the host but not the guest
> to change the settings after assigning the card to the guest.
> 
> This is a fundamental requirement for VEPA, because the switch
> applied its forwarding rules based on the mac address and trusts
> the hypervisor to make sure it cannot be faked by the guest.

Sure, but the VF (when directly assigned to the guest) is going to (at
least it should, for security reasons) always trap to a privileged code if
the guest tries to do something like set mac or vlan id.  All the SR-IOV
cards I've seen do this.  The "set VF mac addr" is really a message to
the PF.

thanks,
-chris

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-22 18:48 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev, Mitch Williams
In-Reply-To: <20100422174729.GK28829@x200.localdomain>

On Thursday 22 April 2010 19:47:29 Chris Wright wrote:
> OK, wasn't clear if you meant that or simply 100% dedicating the interface
> via something like virtio.  The add_vf() idea, while neat, doesn't really
> match how VF's are allocated.

But we still need something like that for allocating queues in VMDq
and similar cases where we do not have pass-through, right?

As far as I can tell we don't have an interface for that yet, but
we have drivers for a number of cards that could do this.

> > I don't have an SR-IOV card available for testing yet. How is this
> > configured now?
> 
> The device shows up in the host as a normal network device, so mgmt tools
> currently treat it as if it's no different from a PF.  So that's just
> plain old:
> 
> SIOCSIFHWADDR or RTM_SETLINK (i.e. normal ->ndo_set_mac_addr)

Ok, but that only works for a fixed number of VFs and you can only
configure the VF before it's assigned to the guest, right?

Both are not serious limitations, but it would be nice to
have an easy way around them. In particular, for assigning
the mac address and vlan id (VF in access mode), there needs
to be some interface that allows the host but not the guest
to change the settings after assigning the card to the guest.

This is a fundamental requirement for VEPA, because the switch
applied its forwarding rules based on the mac address and trusts
the hypervisor to make sure it cannot be faked by the guest.

	Arnd

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-22 17:47 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Chris Wright, Scott Feldman, davem, netdev, Mitch Williams
In-Reply-To: <201004220851.05756.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Thursday 22 April 2010, Chris Wright wrote:
> > > 
> > > ip link add link eth0 type macvlan    # for a container
> > > ip link add link eth0 type macvtap    # for qemu/vhost
> > > ip link add link eth0 type vf         # for device assignment
> > 
> > BTW, what do you mean by device assignment?
> 
> I mean giving an SR-IOV VF to the guest as a native PCI device
> rather than having qemu or vhost present a virtio-net to the
> guest.

OK, wasn't clear if you meant that or simply 100% dedicating the interface
via something like virtio.  The add_vf() idea, while neat, doesn't really
match how VF's are allocated.

> > > There are obviously significant differences between these three, but
> > > they also share enough of their properties to let us treat them
> > > in similar ways.
> > > 
> > > If we integrate the iovnl client into iproute2, the sequence for setting
> > > up an enic VF and associating it to the port profile could be
> > > 
> > > # create vf0, pass mac and vlan id to HW, no association yet
> > > ip link add link eth0 name vf0 type vf mac fe:dc:ba:12:34:56 vlan 78
> > 
> > Just to clarify...right now, the normal SR-IOV VF is already there.
> > And, or course, can have its mac addr/vlan set already.
> 
> I don't have an SR-IOV card available for testing yet. How is this
> configured now?

The device shows up in the host as a normal network device, so mgmt tools
currently treat it as if it's no different from a PF.  So that's just
plain old:

SIOCSIFHWADDR or RTM_SETLINK (i.e. normal ->ndo_set_mac_addr)

There's also the possiblity of configuring through the PF (although
this isn't really widely used ATM, and has the disadvantage of exposing
the VF number to userspace in a way that's difficult to use).  This is
also done via RTM_SETLINK (on the PF this time), and will result in
->ndo_set_vf_mac().

> > > # associate vf with port profile, mac address must match the one assigned
> > > #  to the interface before.
> > > ip iov assoc eth0 port-profile "general" host-uuid "dcf2a873-f5ee-41dd-a7ad-802a544e48c2" \
> > >        mac fe:dc:ba:12:34:56
> > 
> > At that point you could just do s/mac fe:.*/link vf0/
> 
> My point was that this information should be irrelevant to the code doing the
> association with the switch. It sort of makes sense when the receiver is enic,
> but when we send the same data to lldpad, it doesn't care about the slave device
> name but only about the mac address. Especially since the slave device might not
> be in the root name space any more, meaning we have no way to find it.

Yeah, w/ namespace I think you'd normally do all setup before handing
into a new namespace.

thanks,
-chris

^ permalink raw reply

* Re: [patch] rdma: potential ERR_PTR dereference
From: Andy Grover @ 2010-04-22 17:28 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: David S. Miller, rds-devel, netdev, kernel-janitors
In-Reply-To: <20100422095527.GQ29647@bicker>

Dan Carpenter wrote:
> In the original code, the "goto out" calls "rdma_destroy_id(cm_id);"
> That isn't needed here and would cause problems because "cm_id" is an 
> ERR_PTR.  The new code just returns directly.
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Thanks, Dan.

Acked-by: Andy Grover <andy.grover@oracle.com>

-- Andy

^ permalink raw reply

* [PATCH v2] tcp: fix outsegs stat for TSO segments
From: Tom Herbert @ 2010-04-22 17:00 UTC (permalink / raw)
  To: davem, netdev

Account for TSO segments of an skb in TCP_MIB_OUTSEGS counter.  Without
doing this, the counter can be off by orders of magnitude from the
actual number of segments sent.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/net/snmp.h b/include/net/snmp.h
index 884fdbb..92456f1 100644
--- a/include/net/snmp.h
+++ b/include/net/snmp.h
@@ -133,6 +133,8 @@ struct linux_xfrm_mib {
 			__this_cpu_add(mib[0]->mibs[field], addend)
 #define SNMP_ADD_STATS_USER(mib, field, addend)	\
 			this_cpu_add(mib[1]->mibs[field], addend)
+#define SNMP_ADD_STATS(mib, field, addend)	\
+			this_cpu_add(mib[0]->mibs[field], addend)
 /*
  * Use "__typeof__(*mib[0]) *ptr" instead of "__typeof__(mib[0]) ptr"
  * to make @ptr a non-percpu pointer.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70c5159..91640fe 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -294,6 +294,7 @@ extern struct proto tcp_prot;
 #define TCP_INC_STATS_BH(net, field)	SNMP_INC_STATS_BH((net)->mib.tcp_statistics, field)
 #define TCP_DEC_STATS(net, field)	SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
 #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
+#define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
 
 extern void			tcp_v4_err(struct sk_buff *skb, u32);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2b7d71f..8ce0f99 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -888,7 +888,8 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 		tcp_event_data_sent(tp, skb, sk);
 
 	if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq)
-		TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
+		TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS,
+			      tcp_skb_pcount(skb));
 
 	err = icsk->icsk_af_ops->queue_xmit(skb);
 	if (likely(err <= 0))
@@ -2503,7 +2504,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	th->window = htons(min(req->rcv_wnd, 65535U));
 	tcp_options_write((__be32 *)(th + 1), tp, &opts);
 	th->doff = (tcp_header_size >> 2);
-	TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
+	TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS, tcp_skb_pcount(skb));
 
 #ifdef CONFIG_TCP_MD5SIG
 	/* Okay, we have all we need - do the md5 hash if needed */

^ permalink raw reply related

* Re: DDoS attack causing bad effect on conntrack searches
From: Paul E. McKenney @ 2010-04-22 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel
In-Reply-To: <1271952128.7895.5851.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 06:02:08PM +0200, Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> > On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> > > Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> > > 
> > > > If one hash slot is under attack, then there is a bug somewhere.
> > > > 
> > > > If we cannot avoid this, we can fallback to a secure mode at the second
> > > > retry, and take the spinlock.
> > > > 
> > > > Tis way, most of lookups stay lockless (one pass), and some might take
> > > > the slot lock to avoid the possibility of a loop.
> > > > 
> > > > I suspect a bug elsewhere, quite frankly !
> > > > 
> > > > We have a chain that have an end pointer that doesnt match the expected
> > > > one.
> > > > 
> > > 
> > > On normal situation, we always finish the lookup :
> > > 
> > > 1) If we found the thing we were looking at.
> > > 
> > > 2) We get the list end (item not found), we then check if it is the
> > > expected end.
> > > 
> > > It is _not_ the expected end only if some writer deleted/inserted an
> > > element in _this_ chain during our lookup.
> > 
> > So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
> > elements?  (Not obvious from the code, but my ignorance of the networking
> > code is such that many things in that part of the kernel are not obvious
> > to me, I am afraid.)
> 
> Yes, this uses SLAB_DESTROY_BY_RCU, like tcp/udp lookups.

OK, that will do it!!!  ;-)

One way of throttling the bad effects of updates on readers is to
periodically force updates through a grace period.  But this seems
to be a very big hammer, and likely to have little practical effect.

Another approach would be to have multiple list pointers per element,
so that a given element could be reused a small number of times without
messing up concurrent readers (sort of like Herbert's resizable hash
table).

But as you say, if some other bug is really behind this, better to fix
that bug than to work around it.

> > Otherwise, of course you would simply allow deleted elements to continue
> > pointing where they did previously, so that concurrent readers would not
> > miss anything.
> 
> > Of course, the same potential might arise on insertion, but it is usually
> > OK to miss an element that was inserted after you started searching.
> > 
> > > Because our lookup is lockless, we then have to redo it because we might
> > > miss the object we are looking for.
> > 
> > Ah...  Is there also a resize operation?  Herbert did do a resizable
> > hash table recently, but I was under the impression that (1) it was in
> > some other part of the networking stack and (2) it avoided the need to
> > restart readers.
> > 
> > > If we can do the 'retry' a 10 times, it means the attacker was really
> > > clever enough to inject new packets (new conntracks) at the right
> > > moment, in the right hash chain, and this sounds so higly incredible
> > > that I cannot believe it at all :)
> > 
> > Or maybe the DoS attack is injecting so many new conntracks that a large
> > fraction of the hash chains are being modified at any given time?
> 
> maybe hash table has one slot :)

;-) ;-) ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/3] IPv6: Generic TTL Security Mechanism (original version)
From: Stephen Hemminger @ 2010-04-22 16:23 UTC (permalink / raw)
  To: davem; +Cc: Pekka Savola, YOSHIFUJI Hideaki, Nick Hilliard, netdev
In-Reply-To: <20100403232922.489187907@vyatta.com>

On Sat, 03 Apr 2010 16:21:04 -0700
Stephen Hemminger <shemminger@vyatta.com> wrote:

> This patch adds IPv6 support for RFC5082 Generalized TTL
> Security Mechanism.  
> 
> The original proposed code; the IPV6 and IPV4 socket options are seperate.
> With this method, the server does have to deal with both IPv4 and IPv6
> socket options and the client has to handle the different for each
> family.
> 
> On client:
> 	int ttl = 255;
> 	getaddrinfo(argv[1], argv[2], &hint, &result);
> 
> 	for (rp = result; rp != NULL; rp = rp->ai_next) {
> 		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
> 		if (s < 0) continue;
> 
> 		if (rp->ai_family == AF_INET) {
> 			setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
> 		} else if (rp->ai_family == AF_INET6) {
> 			setsockopt(s, IPPROTO_IPV6,  IPV6_UNICAST_HOPS, 
> 					&ttl, sizeof(ttl)))
> 		}
> 			
> 		if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
> 		   ...
> 
> On server:
> 	int minttl = 255 - maxhops;
>    
> 	getaddrinfo(NULL, port, &hints, &result);
> 	for (rp = result; rp != NULL; rp = rp->ai_next) {
> 		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
> 		if (s < 0) continue;
> 
> 		if (rp->ai_family == AF_INET6)
> 			setsockopt(s, IPPROTO_IPV6,  IPV6_MINHOPCOUNT,
> 					&minttl, sizeof(minttl));
> 		setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
> 			
> 		if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
> 			break
> ..
> 
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Dave: Yoshifuji and I agree this is the best solution, how come the patch
hasn't been applied?

^ permalink raw reply

* Re: [PATCH] NIU support for skb->rxhash
From: Stephen Hemminger @ 2010-04-22 16:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100422.042157.99869295.davem@davemloft.net>

On Thu, 22 Apr 2010 04:21:57 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> 
> But it turns out using it is largely pointless since the only way to
> get the hash value(s) is through a structure which is prepended to the
> packet data (so we take a cache miss on the packet data anyways)
> instead of being able to fetch it out of the RX descriptors :-/
> 
> If anyone out there is trying to design sane hardware, please put the
> following into your RX descriptors:
> 
> 1) ethernet protocol type (u16)
> 2) a flag bit indicating if the packet destination matched one
>    of the programmed unicast MAC addresses
> 3) a flag bit indicating "multicast"
> 4) a flag bit indicating "broadcast"
> 5) at least 32-bits of the computed flow hash (u32)
> 
> kthx, bye!
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Could you make configurable via ethtool like I did for sky2.

P.s: where is that patch seems lost in patchwork

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Stephen Hemminger @ 2010-04-22 16:17 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Herbert Xu, David Miller, yoshfuji, netdev
In-Reply-To: <20100422154908.GA31568@midget.suse.cz>

On Thu, 22 Apr 2010 17:49:08 +0200
Jiri Bohac <jbohac@suse.cz> wrote:

> On Thu, Apr 22, 2010 at 10:25:06PM +0800, Herbert Xu wrote:
> > This patch fixes this by using the DADFAILED bit to synchronise
> > the two paths while holding the ifp lock.  It relies on the fact
> > that the TENTATIVE bit is always set during DAD, and that the
> > DADFAILED bit is only set on failure.
> 
> But the addr_dad_failure()->...->ipv6_del_addr() path will
> still race with any other path calling ipv6_del_addr() (e.g. a
> manual address removal). Won't it?
> 
> I still don't see why __ipv6_ifa_notify() needs to call
> dst_free(). Shouldn't that be dst_release() instead, to drop the
> reference obtained by dst_hold(&ifp->rt->u.dst)?

Yes, some more locking and race condition management is needed.

Something like the following (untested):

--- a/net/ipv6/addrconf.c	2010-04-22 09:11:54.594827858 -0700
+++ b/net/ipv6/addrconf.c	2010-04-22 09:15:59.224631752 -0700
@@ -720,13 +720,18 @@ static void ipv6_del_addr(struct inet6_i
 
 	hash = ipv6_addr_hash(&ifp->addr);
 
+	write_lock_bh(&idev->lock);
+	if (ifp->dead) {
+		write_unlock(&idev->lock); /* lost race with DAD */
+		return;
+	}
+
 	ifp->dead = 1;
 
-	spin_lock_bh(&addrconf_hash_lock);
+	spin_lock(&addrconf_hash_lock);
 	hlist_del_init_rcu(&ifp->addr_lst);
-	spin_unlock_bh(&addrconf_hash_lock);
+	spin_unlock(&addrconf_hash_lock);
 
-	write_lock_bh(&idev->lock);
 #ifdef CONFIG_IPV6_PRIVACY
 	if (ifp->flags&IFA_F_TEMPORARY) {
 		list_del(&ifp->tmp_list);




^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 16:02 UTC (permalink / raw)
  To: paulmck
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel
In-Reply-To: <20100422155123.GA2524@linux.vnet.ibm.com>

Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> > Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> > 
> > > If one hash slot is under attack, then there is a bug somewhere.
> > > 
> > > If we cannot avoid this, we can fallback to a secure mode at the second
> > > retry, and take the spinlock.
> > > 
> > > Tis way, most of lookups stay lockless (one pass), and some might take
> > > the slot lock to avoid the possibility of a loop.
> > > 
> > > I suspect a bug elsewhere, quite frankly !
> > > 
> > > We have a chain that have an end pointer that doesnt match the expected
> > > one.
> > > 
> > 
> > On normal situation, we always finish the lookup :
> > 
> > 1) If we found the thing we were looking at.
> > 
> > 2) We get the list end (item not found), we then check if it is the
> > expected end.
> > 
> > It is _not_ the expected end only if some writer deleted/inserted an
> > element in _this_ chain during our lookup.
> 
> So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
> elements?  (Not obvious from the code, but my ignorance of the networking
> code is such that many things in that part of the kernel are not obvious
> to me, I am afraid.)
> 

Yes, this uses SLAB_DESTROY_BY_RCU, like tcp/udp lookups.

> Otherwise, of course you would simply allow deleted elements to continue
> pointing where they did previously, so that concurrent readers would not
> miss anything.
> 




> Of course, the same potential might arise on insertion, but it is usually
> OK to miss an element that was inserted after you started searching.
> 
> > Because our lookup is lockless, we then have to redo it because we might
> > miss the object we are looking for.
> 
> Ah...  Is there also a resize operation?  Herbert did do a resizable
> hash table recently, but I was under the impression that (1) it was in
> some other part of the networking stack and (2) it avoided the need to
> restart readers.
> 
> > If we can do the 'retry' a 10 times, it means the attacker was really
> > clever enough to inject new packets (new conntracks) at the right
> > moment, in the right hash chain, and this sounds so higly incredible
> > that I cannot believe it at all :)
> 
> Or maybe the DoS attack is injecting so many new conntracks that a large
> fraction of the hash chains are being modified at any given time?
> 
> 							Thanx, Paul

maybe hash table has one slot :)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-22 16:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Miles Lane, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <20100422145640.GB3228@redhat.com>

On Thu, Apr 22, 2010 at 10:56:40AM -0400, Vivek Goyal wrote:
> On Wed, Apr 21, 2010 at 02:35:43PM -0700, Paul E. McKenney wrote:
> 
> [..]
> > > [    3.116754] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > [    3.116754] ---------------------------------------------------
> > > [    3.116754] kernel/cgroup.c:4432 invoked rcu_dereference_check()
> > > without protection!
> > > [    3.116754]
> > > [    3.116754] other info that might help us debug this:
> > > [    3.116754]
> > > [    3.116754]
> > > [    3.116754] rcu_scheduler_active = 1, debug_locks = 1
> > > [    3.116754] 2 locks held by async/1/666:
> > > [    3.116754]  #0:  (&shost->scan_mutex){+.+.+.}, at:
> > > [<ffffffff812df0a0>] __scsi_add_device+0x83/0xe4
> > > [    3.116754]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> > > [<ffffffff811f2e8d>] blkiocg_add_blkio_group+0x29/0x7f
> > > [    3.116754]
> > > [    3.116754] stack backtrace:
> > > [    3.116754] Pid: 666, comm: async/1 Not tainted 2.6.34-rc5 #18
> > > [    3.116754] Call Trace:
> > > [    3.116754]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > > [    3.116754]  [<ffffffff8107f9b1>] css_id+0x3f/0x51
> > > [    3.116754]  [<ffffffff811f2e9c>] blkiocg_add_blkio_group+0x38/0x7f
> > > [    3.116754]  [<ffffffff811f4e64>] cfq_init_queue+0xdf/0x2dc
> > > [    3.116754]  [<ffffffff811e3445>] elevator_init+0xba/0xf5
> > > [    3.116754]  [<ffffffff812dc02a>] ? scsi_request_fn+0x0/0x451
> > > [    3.116754]  [<ffffffff811e696b>] blk_init_queue_node+0x12f/0x135
> > > [    3.116754]  [<ffffffff811e697d>] blk_init_queue+0xc/0xe
> > > [    3.116754]  [<ffffffff812dc49c>] __scsi_alloc_queue+0x21/0x111
> > > [    3.116754]  [<ffffffff812dc5a4>] scsi_alloc_queue+0x18/0x64
> > > [    3.116754]  [<ffffffff812de5a0>] scsi_alloc_sdev+0x19e/0x256
> > > [    3.116754]  [<ffffffff812de73e>] scsi_probe_and_add_lun+0xe6/0x9c5
> > > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > > [    3.116754]  [<ffffffff813ce0d6>] ? __mutex_lock_common+0x3e4/0x43a
> > > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > > [    3.116754]  [<ffffffff812d0a5c>] ? transport_setup_classdev+0x0/0x17
> > > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > > [    3.116754]  [<ffffffff812df0d5>] __scsi_add_device+0xb8/0xe4
> > > [    3.116754]  [<ffffffff812ea9c5>] ata_scsi_scan_host+0x74/0x16e
> > > [    3.116754]  [<ffffffff81057685>] ? autoremove_wake_function+0x0/0x34
> > > [    3.116754]  [<ffffffff812e8e64>] async_port_probe+0xab/0xb7
> > > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > > [    3.116754]  [<ffffffff8105e2ba>] async_thread+0x105/0x1f4
> > > [    3.116754]  [<ffffffff81033d79>] ? default_wake_function+0x0/0xf
> > > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > > [    3.116754]  [<ffffffff8105713e>] kthread+0x89/0x91
> > > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > > [    3.116754]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> > > [    3.116754]  [<ffffffff813cfcc0>] ? restore_args+0x0/0x30
> > > [    3.116754]  [<ffffffff810570b5>] ? kthread+0x0/0x91
> > > [    3.116754]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10
> > 
> > I cannot convince myself that the above access is safe.  Vivek, Nauman,
> > thoughts?
> 
> Hi Paul,
> 
> blkiocg_add_blkio_group() is called from two paths.
> 
> First one is following. This path should be safe as it takes rcu read
> lock.
> 
> cfq_get_cfqg()
> 	rcu_read_lock()
> 	cfq_find_alloc_cfqg()
> 		blkiocg_add_blkio_group()
> 	rcu_read_unlock()
> 
> Second one is as shown in above backtrace.
> 
> cfq_init_queue()
> 	blkiocg_add_blkio_group().
> 
> This path is called at request queue and cfq initialization time and
> we access only root cgroup (root blkio_cgroup). As root cgroup can't
> go away, do we have to protect that call also using rcu_read_lock()?

You are correct, if the root cgroup cannot go away and if we only access
the root cgroup, then rcu_read_lock() is not required.

> So I guess it is not unsafe but propably we need to fix the warning, I
> should wrap second call to blkiocg_add_blkio_group() with
> rcu_read_lock/unlock pair?

That would work very well!

							Thanx, Paul

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Paul E. McKenney @ 2010-04-22 15:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel
In-Reply-To: <1271948029.7895.5707.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> 
> > If one hash slot is under attack, then there is a bug somewhere.
> > 
> > If we cannot avoid this, we can fallback to a secure mode at the second
> > retry, and take the spinlock.
> > 
> > Tis way, most of lookups stay lockless (one pass), and some might take
> > the slot lock to avoid the possibility of a loop.
> > 
> > I suspect a bug elsewhere, quite frankly !
> > 
> > We have a chain that have an end pointer that doesnt match the expected
> > one.
> > 
> 
> On normal situation, we always finish the lookup :
> 
> 1) If we found the thing we were looking at.
> 
> 2) We get the list end (item not found), we then check if it is the
> expected end.
> 
> It is _not_ the expected end only if some writer deleted/inserted an
> element in _this_ chain during our lookup.

So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
elements?  (Not obvious from the code, but my ignorance of the networking
code is such that many things in that part of the kernel are not obvious
to me, I am afraid.)

Otherwise, of course you would simply allow deleted elements to continue
pointing where they did previously, so that concurrent readers would not
miss anything.

Of course, the same potential might arise on insertion, but it is usually
OK to miss an element that was inserted after you started searching.

> Because our lookup is lockless, we then have to redo it because we might
> miss the object we are looking for.

Ah...  Is there also a resize operation?  Herbert did do a resizable
hash table recently, but I was under the impression that (1) it was in
some other part of the networking stack and (2) it avoided the need to
restart readers.

> If we can do the 'retry' a 10 times, it means the attacker was really
> clever enough to inject new packets (new conntracks) at the right
> moment, in the right hash chain, and this sounds so higly incredible
> that I cannot believe it at all :)

Or maybe the DoS attack is injecting so many new conntracks that a large
fraction of the hash chains are being modified at any given time?

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-22 15:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jbohac, yoshfuji, netdev, shemminger
In-Reply-To: <20100422142506.GA15858@gondor.apana.org.au>

On Thu, Apr 22, 2010 at 10:25:06PM +0800, Herbert Xu wrote:
> This patch fixes this by using the DADFAILED bit to synchronise
> the two paths while holding the ifp lock.  It relies on the fact
> that the TENTATIVE bit is always set during DAD, and that the
> DADFAILED bit is only set on failure.

But the addr_dad_failure()->...->ipv6_del_addr() path will
still race with any other path calling ipv6_del_addr() (e.g. a
manual address removal). Won't it?

I still don't see why __ipv6_ifa_notify() needs to call
dst_free(). Shouldn't that be dst_release() instead, to drop the
reference obtained by dst_hold(&ifp->rt->u.dst)?

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply

* Re: [PATCH linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Ben Hutchings @ 2010-04-22 15:41 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: tglx@linutronix.de, davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <Pine.WNT.4.64.1004220459110.3324@PPWASKIE-MOBL2.amr.corp.intel.com>

On Thu, 2010-04-22 at 05:11 -0700, Peter P Waskiewicz Jr wrote:
> On Wed, 21 Apr 2010, Ben Hutchings wrote:
> 
> > On Tue, 2010-04-20 at 11:01 -0700, Peter P Waskiewicz Jr wrote:
> >> This patch adds a callback function pointer to the irq_desc
> >> structure, along with a registration function and a read-only
> >> proc entry for each interrupt.
> >>
> >> This affinity_hint handle for each interrupt can be used by
> >> underlying drivers that need a better mechanism to control
> >> interrupt affinity.  The underlying driver can register a
> >> callback for the interrupt, which will allow the driver to
> >> provide the CPU mask for the interrupt to anything that
> >> requests it.  The intent is to extend the userspace daemon,
> >> irqbalance, to help hint to it a preferred CPU mask to balance
> >> the interrupt into.
> >
> > Doesn't it make more sense to have the driver follow affinity decisions
> > made from user-space?  I realise that reallocating queues is disruptive
> > and we probably don't want irqbalance to trigger that, but there should
> > be a mechanism for the administrator to trigger it.
> 
> The driver here would be assisting userspace (irqbalance) to provide 
> better details how the HW is laid out with respect to flows.  As it stands 
> today, irqbalance is almost guaranteed to move interrups to CPUs that are 
> not aligned with where applications are running for network adapters. 
> This is very apparent when running at speeds in the 10 Gigabit range, or 
> even multiple 1 Gigabit ports running at the same time.

I'm well aware that irqbalance isn't making good decisions at the
moment.  The question is whether this will really help irqbalance to do
better.

[...]
> > This just assigns IRQs to the first n CPU threads.  Depending on the
> > enumeration order, this might result in assigning an IRQ to each of 2
> > threads on a core while leaving other cores unused!
> 
> This ixgbe patch is only meant to be an example of how you could use it. 
> I didn't hammer out all the corner cases of interrupt alignment in it yet. 
> However, ixgbe is already aligning Tx flows onto the CPU/queue pair the Tx 
> occurred (i.e. Tx session from CPU 4 will be queued on Tx queue 4),
[...]

OK, now I remember ixgbe has this odd select_queue() implementation.
But this behaviour can result in reordering whenever a user thread
migrates, and in any case Dave discourages people from setting
select_queue().  So I see that these changes would be useful for ixgbe
(together with an update to irqbalance), but they don't seem to fit the
general direction of multiqueue networking on Linux.

(Actually, the hints seem to be incomplete.  If there are more than 16
CPU threads then multiple CPU threads can map to the same queues, but it
looks like you only include the first in the queue's hint.)

An alternate approach is to use the RX queue index to drive TX queue
selection.  I posted a patch to do that earlier this week.  However I
haven't yet had a chance to try that on a suitably large system.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Vivek Goyal @ 2010-04-22 14:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Miles Lane, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <20100421213543.GO2563@linux.vnet.ibm.com>

On Wed, Apr 21, 2010 at 02:35:43PM -0700, Paul E. McKenney wrote:

[..]
> > [    3.116754] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [    3.116754] ---------------------------------------------------
> > [    3.116754] kernel/cgroup.c:4432 invoked rcu_dereference_check()
> > without protection!
> > [    3.116754]
> > [    3.116754] other info that might help us debug this:
> > [    3.116754]
> > [    3.116754]
> > [    3.116754] rcu_scheduler_active = 1, debug_locks = 1
> > [    3.116754] 2 locks held by async/1/666:
> > [    3.116754]  #0:  (&shost->scan_mutex){+.+.+.}, at:
> > [<ffffffff812df0a0>] __scsi_add_device+0x83/0xe4
> > [    3.116754]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> > [<ffffffff811f2e8d>] blkiocg_add_blkio_group+0x29/0x7f
> > [    3.116754]
> > [    3.116754] stack backtrace:
> > [    3.116754] Pid: 666, comm: async/1 Not tainted 2.6.34-rc5 #18
> > [    3.116754] Call Trace:
> > [    3.116754]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > [    3.116754]  [<ffffffff8107f9b1>] css_id+0x3f/0x51
> > [    3.116754]  [<ffffffff811f2e9c>] blkiocg_add_blkio_group+0x38/0x7f
> > [    3.116754]  [<ffffffff811f4e64>] cfq_init_queue+0xdf/0x2dc
> > [    3.116754]  [<ffffffff811e3445>] elevator_init+0xba/0xf5
> > [    3.116754]  [<ffffffff812dc02a>] ? scsi_request_fn+0x0/0x451
> > [    3.116754]  [<ffffffff811e696b>] blk_init_queue_node+0x12f/0x135
> > [    3.116754]  [<ffffffff811e697d>] blk_init_queue+0xc/0xe
> > [    3.116754]  [<ffffffff812dc49c>] __scsi_alloc_queue+0x21/0x111
> > [    3.116754]  [<ffffffff812dc5a4>] scsi_alloc_queue+0x18/0x64
> > [    3.116754]  [<ffffffff812de5a0>] scsi_alloc_sdev+0x19e/0x256
> > [    3.116754]  [<ffffffff812de73e>] scsi_probe_and_add_lun+0xe6/0x9c5
> > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [    3.116754]  [<ffffffff813ce0d6>] ? __mutex_lock_common+0x3e4/0x43a
> > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > [    3.116754]  [<ffffffff812d0a5c>] ? transport_setup_classdev+0x0/0x17
> > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > [    3.116754]  [<ffffffff812df0d5>] __scsi_add_device+0xb8/0xe4
> > [    3.116754]  [<ffffffff812ea9c5>] ata_scsi_scan_host+0x74/0x16e
> > [    3.116754]  [<ffffffff81057685>] ? autoremove_wake_function+0x0/0x34
> > [    3.116754]  [<ffffffff812e8e64>] async_port_probe+0xab/0xb7
> > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > [    3.116754]  [<ffffffff8105e2ba>] async_thread+0x105/0x1f4
> > [    3.116754]  [<ffffffff81033d79>] ? default_wake_function+0x0/0xf
> > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > [    3.116754]  [<ffffffff8105713e>] kthread+0x89/0x91
> > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [    3.116754]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> > [    3.116754]  [<ffffffff813cfcc0>] ? restore_args+0x0/0x30
> > [    3.116754]  [<ffffffff810570b5>] ? kthread+0x0/0x91
> > [    3.116754]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10
> 
> I cannot convince myself that the above access is safe.  Vivek, Nauman,
> thoughts?

Hi Paul,

blkiocg_add_blkio_group() is called from two paths.

First one is following. This path should be safe as it takes rcu read
lock.

cfq_get_cfqg()
	rcu_read_lock()
	cfq_find_alloc_cfqg()
		blkiocg_add_blkio_group()
	rcu_read_unlock()

Second one is as shown in above backtrace.

cfq_init_queue()
	blkiocg_add_blkio_group().

This path is called at request queue and cfq initialization time and
we access only root cgroup (root blkio_cgroup). As root cgroup can't
go away, do we have to protect that call also using rcu_read_lock()?

So I guess it is not unsafe but propably we need to fix the warning, I
should wrap second call to blkiocg_add_blkio_group() with
rcu_read_lock/unlock pair?

Thanks
Vivek

^ permalink raw reply

* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-22 14:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <1271946805.7895.5658.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 10:33 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> It does make a difference, Damn it.
>
> I really really start to think you dont read what I wrote, or you dont
> care.

I misunderstood it. Sorry.

>
> Damn, cant you update all the things at once, taking this lock only
> once ?
>
> You focus having an ultra precise count of pkt_queue.len, but we dont
> care at all ! We only want a _limit_, or else the box can be killed by
> DOS.
>
> If in practice this limit can be 2*limit, thats OK.
>
> Cant you understand this ?
>
>
> We need one limit. Not two limits.
>
> I already told you how to do it, but you ignored me and started yet
> another convoluted thing.
>
>
> process_backlog() transfert the queue to its own queue and reset pkt_len
> to 0 (Only once)
>
> End of story.
>
> Maximum packet queued to this cpu softnet_data will be 2 * old_limit.
>
> So what ?
>

Now, I think I really understand. We don't need a precious limit. So
only a additional queue is enough.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 14:53 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney
In-Reply-To: <1271946961.7895.5665.camel@edumazet-laptop>

Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :

> If one hash slot is under attack, then there is a bug somewhere.
> 
> If we cannot avoid this, we can fallback to a secure mode at the second
> retry, and take the spinlock.
> 
> Tis way, most of lookups stay lockless (one pass), and some might take
> the slot lock to avoid the possibility of a loop.
> 
> I suspect a bug elsewhere, quite frankly !
> 
> We have a chain that have an end pointer that doesnt match the expected
> one.
> 

On normal situation, we always finish the lookup :

1) If we found the thing we were looking at.

2) We get the list end (item not found), we then check if it is the
expected end.

It is _not_ the expected end only if some writer deleted/inserted an
element in _this_ chain during our lookup.

Because our lookup is lockless, we then have to redo it because we might
miss the object we are looking for.

If we can do the 'retry' a 10 times, it means the attacker was really
clever enough to inject new packets (new conntracks) at the right
moment, in the right hash chain, and this sounds so higly incredible
that I cannot believe it at all :)




^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 14:36 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney
In-Reply-To: <4BD04C74.9020402@trash.net>

Le jeudi 22 avril 2010 à 15:17 +0200, Patrick McHardy a écrit :
> Changli Gao wrote:
> > On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
> >> At an unnamed ISP, we experienced a DDoS attack against one of our
> >> customers.  This attack also caused problems for one of our Linux
> >> based routers.
> >>
> >> The attack was "only" generating 300 kpps (packets per sec), which
> >> usually isn't a problem for this (fairly old) Linux Router.  But the
> >> conntracking system chocked and reduced pps processing power to
> >> 40kpps.
> >>
> >> I do extensive RRD/graph monitoring of the machines.  The IP conntrack
> >> searches in the period exploded, to a stunning 700.000 searches per
> >> sec.
> >>
> >> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
> >>
> >> First I though it might be caused by bad hashing, but after reading
> >> the kernel code (func: __nf_conntrack_find()), I think its caused by
> >> the loop restart (goto begin) of the conntrack search, running under
> >> local_bh_disable().  These RCU changes to conntrack were introduced in
> >> ea781f19 by Eric Dumazet.
> >>
> >> Code: net/netfilter/nf_conntrack_core.c
> >> Func: __nf_conntrack_find()
> >>
> >> struct nf_conntrack_tuple_hash *
> >> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
> >> {
> >>        struct nf_conntrack_tuple_hash *h;
> >>        struct hlist_nulls_node *n;
> >>        unsigned int hash = hash_conntrack(tuple);
> >>
> >>        /* Disable BHs the entire time since we normally need to disable them
> >>         * at least once for the stats anyway.
> >>         */
> >>        local_bh_disable();
> >> begin:
> >>        hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
> >>                if (nf_ct_tuple_equal(tuple, &h->tuple)) {
> >>                        NF_CT_STAT_INC(net, found);
> >>                        local_bh_enable();
> >>                        return h;
> >>                }
> >>                NF_CT_STAT_INC(net, searched);
> >>        }
> >>        /*
> >>         * if the nulls value we got at the end of this lookup is
> >>         * not the expected one, we must restart lookup.
> >>         * We probably met an item that was moved to another chain.
> >>         */
> >>        if (get_nulls_value(n) != hash)
> >>                goto begin;
> >>        local_bh_enable();
> >>
> > 
> > We should add a retry limit there.
> 
> We can't do that since that would allow false negatives.

If one hash slot is under attack, then there is a bug somewhere.

If we cannot avoid this, we can fallback to a secure mode at the second
retry, and take the spinlock.

Tis way, most of lookups stay lockless (one pass), and some might take
the slot lock to avoid the possibility of a loop.

I suspect a bug elsewhere, quite frankly !

We have a chain that have an end pointer that doesnt match the expected
one.




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-22 14:33 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <y2n412e6f7f1004220606id324dc9bj2cc04cfbad50a101@mail.gmail.com>

Le jeudi 22 avril 2010 à 21:06 +0800, Changli Gao a écrit :
> On Thu, Apr 22, 2010 at 7:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Please reorder things better.
> >
> > Most likely this function is called for one packet.
> >
> > In your version you take twice the rps_lock()/rps_unlock() path, so
> > it'll be slower.
> >
> > Once to 'transfert' one list to process list
> >
> > Once to be able to do the 'label out:' post processing.
> >
> 
> It doesn't make any difference. We have to hold rps_lock to update
> input_pkt_queue_len, if we don't use another variable to record the
> length of the process queue, or atomic variable.
> 

It does make a difference, Damn it.

I really really start to think you dont read what I wrote, or you dont
care.

Damn, cant you update all the things at once, taking this lock only
once ?

You focus having an ultra precise count of pkt_queue.len, but we dont
care at all ! We only want a _limit_, or else the box can be killed by
DOS.

If in practice this limit can be 2*limit, thats OK. 

Cant you understand this ?

> I think it is better that using another variable to record the length
> of the process queue, and updating it before process_backlog()
> returns. For one packet, there is only one locking/unlocking. There is
> only one issue you concerned: cache miss due to sum the two queues'
> length. I'll change softnet_data to:
> 
> struct softnet_data {
>         struct Qdisc            *output_queue;
>         struct list_head        poll_list;
>         struct sk_buff          *completion_queue;
>         struct sk_buff_head     process_queue;
> 
> #ifdef CONFIG_RPS
>         struct softnet_data     *rps_ipi_list;
> 
>         /* Elements below can be accessed between CPUs for RPS */
>         struct call_single_data csd ____cacheline_aligned_in_smp;
>         struct softnet_data     *rps_ipi_next;
>         unsigned int            cpu;
>         unsigned int            input_queue_head;
> #endif
>         unsigned int            process_queue_len;
>         struct sk_buff_head     input_pkt_queue;
>         struct napi_struct      backlog;
> };
> 
> For one packets, we have to update process_queue_len in any way. For
> more packets, we only change process_queue_len just before
> process_backlog() returns. It means that process_queue_len change is
> batched.
> 

We need one limit. Not two limits.

I already told you how to do it, but you ignored me and started yet
another convoluted thing.

process_backlog() transfert the queue to its own queue and reset pkt_len
to 0 (Only once)

End of story.

Maximum packet queued to this cpu softnet_data will be 2 * old_limit.

So what ?

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Herbert Xu @ 2010-04-22 14:25 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, yoshfuji, netdev, shemminger
In-Reply-To: <20100422.004324.67422011.davem@davemloft.net>

On Thu, Apr 22, 2010 at 12:43:24AM -0700, David Miller wrote:
>
> Thanks Herbert.

No worries :)

BTW similar races exist in other NDISC receive functions, but
it's too late today so I'll look at this tomorrow unless someone
else wants to have a go at this.

ipv6: Prevent DAD races

The NDISC receive path has not been written in a way that handles
unexpected packets properly.

For example, if we get two identical simultaneous NA/NS packets
that result in a DAD failure, we may try to delete the same address
twice.

A similar problem occurs when we get a DAD failure just as we're
about to mark an address as having passed DAD.

This patch fixes this by using the DADFAILED bit to synchronise
the two paths while holding the ifp lock.  It relies on the fact
that the TENTATIVE bit is always set during DAD, and that the
DADFAILED bit is only set on failure.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index de7a194..1d15d5e 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1401,6 +1401,16 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp, int dad_failed)
 void addrconf_dad_failure(struct inet6_ifaddr *ifp)
 {
 	struct inet6_dev *idev = ifp->idev;
+	int ignore;
+
+	spin_lock(&ifp->lock);
+	ignore = (ifp->flags & (IFA_F_DADFAILED|IFA_F_TENTATIVE)) ^
+		 IFA_F_TENTATIVE;
+	ifp->flags |= IFA_F_DADFAILED;
+	spin_unlock(&ifp->lock);
+
+	if (ignore)
+		return;

 	if (net_ratelimit())
 		printk(KERN_INFO "%s: IPv6 duplicate address %pI6c detected!\n",
@@ -2789,7 +2799,10 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 	read_lock_bh(&idev->lock);
 	if (ifp->dead)
 		goto out;
+
 	spin_lock_bh(&ifp->lock);
+	if (ifp->flags & IFA_F_DADFAILED)
+		goto unlock_ifp;

 	if (dev->flags&(IFF_NOARP|IFF_LOOPBACK) ||
 	    idev->cnf.accept_dad < 1 ||
@@ -2824,6 +2837,7 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 		ip6_ins_rt(ifp->rt);

 	addrconf_dad_kick(ifp);
+unlock_ifp:
 	spin_unlock_bh(&ifp->lock);
 out:
 	read_unlock_bh(&idev->lock);
@@ -2841,6 +2855,11 @@ static void addrconf_dad_timer(unsigned long data)
 		goto out;
 	}
 	spin_lock_bh(&ifp->lock);
+	if (ifp->flags & IFA_F_DADFAILED) {
+		spin_unlock_bh(&ifp->lock);
+		read_unlock_bh(&idev->lock);
+		goto out;
+	}
 	if (ifp->probes == 0) {
 		/*
 		 * DAD was successful

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox