Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] myri10ge: improve port type reporting in ethtool output
From: Ben Hutchings @ 2009-10-19 14:17 UTC (permalink / raw)
  To: Andrew Gallatin
  Cc: Brice Goglin, David S. Miller, Linux Network Development list
In-Reply-To: <4ADC69FF.5000109@myri.com>

On Mon, 2009-10-19 at 09:30 -0400, Andrew Gallatin wrote:
> Ben Hutchings wrote:
> > On Mon, 2009-10-19 at 08:34 -0400, Andrew Gallatin wrote:
> >> Ben Hutchings wrote:
> >>
> >>> Lying about link modes is not an improvement.
> >> OK, so we're probably doing something wrong. I suspect we're not
> >> alone.  At least we don't set SUPPORTED_TP for CX4, like I've
> >> seen some NICs do.
> >>
> >> Can somebody suggest how we can tell ethtool that
> >> the NIC supports 10Gb only (no autoneg down to 1Gb or lower)
> >> for copper (10Gbase-CX4)?   How about for fiber (10Gbase-{S,L})R?
> > 
> > What's wrong with what you already do?  Customers expect to see
> > something on the supported line?
> 
> Exactly.  One has complained because drivers for
> other vendors NICs show this, even if they are fibre NICs
> or CX4 NICs, and don't actually support 10GbaseT.

Let's fix the other drivers then.  Labelling these NICs as supporting
10GBASE-T is liable to confuse more people (and tools) in the long run.

> I'm happy to back this part out, and resubmit the patch without
> it. There is still some fairly valuable stuff in the patch
> -- mainly updating the NIC detection logic for new NICs to
> detect fibre vs copper.

Sure.

You should also set port = PORT_OTHER for CX4 or KX4.  Currently it
looks like you don't set port, so it appears as 0 == PORT_TP.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH 0/2] Reduce number of GFP_ATOMIC allocation failures
From: Mel Gorman @ 2009-10-19 14:13 UTC (permalink / raw)
  To: Karol Lewandowski
  Cc: Andrew Morton, stable, Rafael J. Wysocki, David Miller, Frans Pop,
	reinette chatre, Kalle Valo, John W. Linville, Pekka Enberg,
	Bartlomiej Zolnierkiewicz, netdev, linux-kernel,
	linux-mm@kvack.org
In-Reply-To: <20091017183421.GA3370@bizet.domek.prywatny>

On Sat, Oct 17, 2009 at 08:34:21PM +0200, Karol Lewandowski wrote:
> On Fri, Oct 16, 2009 at 11:37:24AM +0100, Mel Gorman wrote:
> > The following two patches against 2.6.32-rc4 should reduce allocation
> > failure reports for GFP_ATOMIC allocations that have being cropping up
> > since 2.6.31-rc1.
> ...
> > The patches should also help the following bugs as well and testing there
> > would be appreciated.
> > 
> > [Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
> > 
> > It might also have helped the following bug
> 
> These patches actually made situation kind-of "worse" for this
> particular issue.
> 
> I've tried patches with post 2.6.32-rc4 kernel and after second
> suspend-resume cycle I got typical "order:5" failure.  However, this
> time when I manually tried to bring interface up ("ifup eth0") it
> failed for 4 consecutive times with "Can't allocate memory".  Before
> applying these patches this never occured -- kernel sometimes failed
> to allocate memory during resume, but it *never* failed afterwards.
> 

I'm hoping the patch + the revert which I asked for in another mail will
help. It's been clear for a while that more than one thing went wrong
during this cycle.

> I'll go now for another round of bisecting... and hopefully this time
> I'll be able to trigger this problem on different/faster computer with
> e100-based card.
> 
> 
> > although that driver has already been fixed by not making high-order
> > atomic allocations.
> 
> Driver has been fixed?  The one patch that I saw (by davem[1]) didn't
> fix this issue.  As of 2.6.32-rc5 I see no fixes to e100.c in
> mainline, has there been another than this[1] fix posted somewhere?
> 
> [1] http://lkml.org/lkml/2009/10/12/169
> 

The driver that was fixed was for the ipw2200, not the e100.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [rft]aggressive power management for cdc-ether
From: Oliver Neukum @ 2009-10-19 13:51 UTC (permalink / raw)
  To: David Brownell, linux-usb, netdev

Hi,

this implements usb autosuspend for online cdc-ether devices
that support remote wakeup? What do you think?

	Regards
		Oliver

--

commit 956c214d266fc1764ceb931b039c7aadded4eb24
Author: Oliver Neukum <oliver@neukum.org>
Date:   Mon Oct 19 15:07:54 2009 +0200

    usb:usbnet&cdc-ether:full aggressive autosuspend
    
    autosuspend for cdc-ether devices while online if
    the device supports remote wakeup

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 4a6aff5..8ee5bd7 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -411,6 +411,12 @@ static int cdc_bind(struct usbnet *dev, struct usb_interface *intf)
 	return 0;
 }
 
+static int cdc_manage_power(struct usbnet *dev, int on)
+{
+	dev->intf->needs_remote_wakeup = on;
+	return 0;
+}
+
 static const struct driver_info	cdc_info = {
 	.description =	"CDC Ethernet Device",
 	.flags =	FLAG_ETHER,
@@ -418,6 +424,7 @@ static const struct driver_info	cdc_info = {
 	.bind =		cdc_bind,
 	.unbind =	usbnet_cdc_unbind,
 	.status =	cdc_status,
+	.manage_power =	cdc_manage_power,
 };
 
 /*-------------------------------------------------------------------------*/
@@ -570,6 +577,7 @@ static struct usb_driver cdc_driver = {
 	.disconnect =	usbnet_disconnect,
 	.suspend =	usbnet_suspend,
 	.resume =	usbnet_resume,
+	.supports_autosuspend = 1,
 };
 
 
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index ca5ca5a..c9938d5 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -353,7 +353,8 @@ static void rx_submit (struct usbnet *dev, struct urb *urb, gfp_t flags)
 
 	if (netif_running (dev->net)
 			&& netif_device_present (dev->net)
-			&& !test_bit (EVENT_RX_HALT, &dev->flags)) {
+			&& !test_bit (EVENT_RX_HALT, &dev->flags)
+			&& !test_bit (EVENT_DEV_ASLEEP, &dev->flags)) {
 		switch (retval = usb_submit_urb (urb, GFP_ATOMIC)) {
 		case -EPIPE:
 			usbnet_defer_kevent (dev, EVENT_RX_HALT);
@@ -611,15 +612,36 @@ EXPORT_SYMBOL_GPL(usbnet_unlink_rx_urbs);
 /*-------------------------------------------------------------------------*/
 
 // precondition: never called in_interrupt
+static void usbnet_terminate_urbs(struct usbnet *dev)
+{
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK (unlink_wakeup);
+	DECLARE_WAITQUEUE (wait, current);
+	int temp;
+
+	/* ensure there are no more active urbs */
+	add_wait_queue(&unlink_wakeup, &wait);
+	dev->wait = &unlink_wakeup;
+	temp = unlink_urbs(dev, &dev->txq) +
+		unlink_urbs(dev, &dev->rxq);
+
+	/* maybe wait for deletions to finish. */
+	while (!skb_queue_empty(&dev->rxq)
+		&& !skb_queue_empty(&dev->txq)
+		&& !skb_queue_empty(&dev->done)) {
+			schedule_timeout(UNLINK_TIMEOUT_MS);
+			if (netif_msg_ifdown(dev))
+				devdbg(dev, "waited for %d urb completions",
+					temp);
+	}
+	dev->wait = NULL;
+	remove_wait_queue(&unlink_wakeup, &wait);
+}
 
 int usbnet_stop (struct net_device *net)
 {
 	struct usbnet		*dev = netdev_priv(net);
 	struct driver_info	*info = dev->driver_info;
-	int			temp;
 	int			retval;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK (unlink_wakeup);
-	DECLARE_WAITQUEUE (wait, current);
 
 	netif_stop_queue (net);
 
@@ -641,25 +663,8 @@ int usbnet_stop (struct net_device *net)
 				info->description);
 	}
 
-	if (!(info->flags & FLAG_AVOID_UNLINK_URBS)) {
-		/* ensure there are no more active urbs */
-		add_wait_queue(&unlink_wakeup, &wait);
-		dev->wait = &unlink_wakeup;
-		temp = unlink_urbs(dev, &dev->txq) +
-			unlink_urbs(dev, &dev->rxq);
-
-		/* maybe wait for deletions to finish. */
-		while (!skb_queue_empty(&dev->rxq)
-				&& !skb_queue_empty(&dev->txq)
-				&& !skb_queue_empty(&dev->done)) {
-			msleep(UNLINK_TIMEOUT_MS);
-			if (netif_msg_ifdown(dev))
-				devdbg(dev, "waited for %d urb completions",
-					temp);
-		}
-		dev->wait = NULL;
-		remove_wait_queue(&unlink_wakeup, &wait);
-	}
+	if (!(info->flags & FLAG_AVOID_UNLINK_URBS))
+		usbnet_terminate_urbs(dev);
 
 	usb_kill_urb(dev->interrupt);
 
@@ -672,7 +677,10 @@ int usbnet_stop (struct net_device *net)
 	dev->flags = 0;
 	del_timer_sync (&dev->delay);
 	tasklet_kill (&dev->bh);
-	usb_autopm_put_interface(dev->intf);
+	if (info->manage_power)
+		info->manage_power(dev, 0);
+	else
+		usb_autopm_put_interface(dev->intf);
 
 	return 0;
 }
@@ -753,6 +761,12 @@ int usbnet_open (struct net_device *net)
 
 	// delay posting reads until we're fully open
 	tasklet_schedule (&dev->bh);
+	if (info->manage_power) {
+		retval = info->manage_power(dev, 1);
+		if (retval < 0)
+			goto done;
+		usb_autopm_put_interface(dev->intf);
+	}
 	return retval;
 done:
 	usb_autopm_put_interface(dev->intf);
@@ -882,6 +896,7 @@ kevent (struct work_struct *work)
 	if (test_bit (EVENT_TX_HALT, &dev->flags)) {
 		unlink_urbs (dev, &dev->txq);
 		status = usb_clear_halt (dev->udev, dev->out);
+		usb_autopm_put_interface(dev->intf);
 		if (status < 0
 				&& status != -EPIPE
 				&& status != -ESHUTDOWN) {
@@ -953,17 +968,20 @@ static void tx_complete (struct urb *urb)
 	if (urb->status == 0) {
 		dev->net->stats.tx_packets++;
 		dev->net->stats.tx_bytes += entry->length;
+		usb_autopm_put_interface_async(dev->intf);
 	} else {
 		dev->net->stats.tx_errors++;
 
 		switch (urb->status) {
 		case -EPIPE:
+			/* we do not allow autosuspension */
 			usbnet_defer_kevent (dev, EVENT_TX_HALT);
 			break;
 
 		/* software-driven interface shutdown */
 		case -ECONNRESET:		// async unlink
 		case -ESHUTDOWN:		// hardware gone
+			usb_autopm_put_interface_async(dev->intf);
 			break;
 
 		// like rx, tx gets controller i/o faults during khubd delays
@@ -971,6 +989,7 @@ static void tx_complete (struct urb *urb)
 		case -EPROTO:
 		case -ETIME:
 		case -EILSEQ:
+			usb_mark_last_busy(dev->udev);
 			if (!timer_pending (&dev->delay)) {
 				mod_timer (&dev->delay,
 					jiffies + THROTTLE_JIFFIES);
@@ -979,8 +998,10 @@ static void tx_complete (struct urb *urb)
 							urb->status);
 			}
 			netif_stop_queue (dev->net);
+			usb_autopm_put_interface_async(dev->intf);
 			break;
 		default:
+			usb_autopm_put_interface_async(dev->intf);
 			if (netif_msg_tx_err (dev))
 				devdbg (dev, "tx err %d", entry->urb->status);
 			break;
@@ -1058,6 +1079,23 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 	}
 
 	spin_lock_irqsave (&dev->txq.lock, flags);
+	retval = usb_autopm_get_interface_async(dev->intf);
+	if (retval < 0) {
+		spin_unlock_irqrestore (&dev->txq.lock, flags);
+		goto drop;
+	}
+
+#ifdef CONFIG_PM
+	/* if this triggers the device is still a sleep */
+	if (test_bit(EVENT_DEV_ASLEEP, &dev->flags)) {
+		/* transmission will be done in resume */
+		dev->deferred = urb;
+		/* no use to process more packets */
+		netif_stop_queue(net);
+		spin_unlock_irqrestore(&dev->txq.lock, flags);
+		goto deferred;
+	}
+#endif
 
 	switch ((retval = usb_submit_urb (urb, GFP_ATOMIC))) {
 	case -EPIPE:
@@ -1088,6 +1126,7 @@ drop:
 		devdbg (dev, "> tx, len %d, type 0x%x",
 			length, skb->protocol);
 	}
+deferred:
 	return NETDEV_TX_OK;
 }
 EXPORT_SYMBOL_GPL(usbnet_start_xmit);
@@ -1363,13 +1402,23 @@ int usbnet_suspend (struct usb_interface *intf, pm_message_t message)
 	struct usbnet		*dev = usb_get_intfdata(intf);
 
 	if (!dev->suspend_count++) {
+		spin_lock_irq(&dev->txq.lock);
+		/* don't autosuspend while transmitting */
+		if (dev->txq.qlen && (message.event & PM_EVENT_AUTO)) {
+			spin_unlock_irq(&dev->txq.lock);
+			return -EBUSY;
+		} else {
+			set_bit(EVENT_DEV_ASLEEP, &dev->flags);
+			spin_unlock_irq(&dev->txq.lock);
+		}
 		/*
 		 * accelerate emptying of the rx and queues, to avoid
 		 * having everything error out.
 		 */
 		netif_device_detach (dev->net);
-		(void) unlink_urbs (dev, &dev->rxq);
-		(void) unlink_urbs (dev, &dev->txq);
+		usbnet_terminate_urbs(dev);
+		usb_kill_urb(dev->interrupt);
+		
 		/*
 		 * reattach so runtime management can use and
 		 * wake the device
@@ -1383,10 +1432,32 @@ EXPORT_SYMBOL_GPL(usbnet_suspend);
 int usbnet_resume (struct usb_interface *intf)
 {
 	struct usbnet		*dev = usb_get_intfdata(intf);
-
-	if (!--dev->suspend_count)
+	struct sk_buff          *skb;
+	struct urb              *res;
+	int                     retval;
+	
+	if (!--dev->suspend_count) {
+		spin_lock_irq(&dev->txq.lock);
+		res = dev->deferred;
+		dev->deferred = NULL;
+		clear_bit(EVENT_DEV_ASLEEP, &dev->flags);
+		spin_unlock_irq(&dev->txq.lock);
+		if (res) {
+			retval = usb_submit_urb(res, GFP_NOIO);
+			if (retval < 0) {
+				usb_free_urb(res);
+				netif_start_queue(dev->net);
+				usb_autopm_put_interface_async(dev->intf);
+			} else {
+				skb = (struct sk_buff *)res->context;
+				dev->net->trans_start = jiffies;
+				__skb_queue_tail (&dev->txq, skb);
+				if (!(dev->txq.qlen >= TX_QLEN(dev)))
+					netif_start_queue(dev->net);
+			}
+		}
 		tasklet_schedule (&dev->bh);
-
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(usbnet_resume);
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index f814730..e418c0b 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -55,6 +55,7 @@ struct usbnet {
 	struct sk_buff_head	done;
 	struct sk_buff_head	rxq_pause;
 	struct urb		*interrupt;
+	struct urb		*deferred;
 	struct tasklet_struct	bh;
 
 	struct work_struct	kevent;
@@ -65,6 +66,8 @@ struct usbnet {
 #		define EVENT_STS_SPLIT	3
 #		define EVENT_LINK_RESET	4
 #		define EVENT_RX_PAUSED	5
+#             define EVENT_DEV_WAKING 6 
+#             define EVENT_DEV_ASLEEP 7 
 };
 
 static inline struct usb_driver *driver_of(struct usb_interface *intf)
@@ -107,6 +110,9 @@ struct driver_info {
 	/* see if peer is connected ... can sleep */
 	int	(*check_connect)(struct usbnet *);
 
+	/* (dis)activate runtime power management */
+	int	(*manage_power)(struct usbnet *, int);
+
 	/* for status polling */
 	void	(*status)(struct usbnet *, struct urb *);
 


^ permalink raw reply related

* Re: [PATCH 3/9] pcmcia: use pcmcia_loop_config in misc pcmcia drivers
From: Jiri Kosina @ 2009-10-19 13:45 UTC (permalink / raw)
  To: Dominik Brodowski
  Cc: linux-pcmcia, David S. Miller, John W. Linville, David Sterba,
	netdev, linux-wireless
In-Reply-To: <1255907255-28297-3-git-send-email-linux@dominikbrodowski.net>

On Mon, 19 Oct 2009, Dominik Brodowski wrote:

> Use pcmcia_loop_config() in a few drivers missed during the first
> round. On fmvj18x_cs.c it -- strangely -- only requries us to set
> conf.ConfigIndex, which is done by the core, so include an empty
> loop function which returns 0 unconditionally.
> 
> CC: David S. Miller <davem@davemloft.net>
> CC: John W. Linville <linville@tuxdriver.com>
> CC: Jiri Kosina <jkosina@suse.cz>
> CC: David Sterba <dsterba@suse.cz>
> CC: netdev@vger.kernel.org
> CC: linux-wireless@vger.kernel.org
> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
> ---
>  drivers/char/pcmcia/ipwireless/main.c |  103 +++++++--------------------------

For the ipwireless part

	Acked-by: Jiri Kosina <jkosina@suse.cz>

Thanks,

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

^ permalink raw reply

* power management for zaurus
From: Oliver Neukum @ 2009-10-19 13:40 UTC (permalink / raw)
  To: pavel-+ZI9xUNit7I, David Brownell,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3202 bytes --]

Hi,

could somebody with a zaurus test this patch?
It introduces aggressive usb autosuspend for the devices.
It depends on the attached basic support for usbnet.

	Regards
		Oliver

--

commit ce0be29fc149b0e178a47a9a2380ef2be52ea7c6
Author: Oliver Neukum <oliver-GvhC2dPhHPQdnm+yROfE0A@public.gmane.org>
Date:   Mon Oct 19 13:55:56 2009 +0200

    zaurus & rndis_host autosuspend

diff --git a/drivers/net/usb/rndis_host.c b/drivers/net/usb/rndis_host.c
index 0caa800..4630703 100644
--- a/drivers/net/usb/rndis_host.c
+++ b/drivers/net/usb/rndis_host.c
@@ -571,12 +571,19 @@ fill:
 }
 EXPORT_SYMBOL_GPL(rndis_tx_fixup);
 
+static int rndis_manage_power(struct usbnet *dev, int on)
+{
+	dev->intf->needs_remote_wakeup = on;
+	return 0;
+}
+
 
 static const struct driver_info	rndis_info = {
 	.description =	"RNDIS device",
 	.flags =	FLAG_ETHER | FLAG_FRAMING_RN | FLAG_NO_SETINT,
 	.bind =		rndis_bind,
 	.unbind =	rndis_unbind,
+	.manage_power =	rndis_manage_power,
 	.status =	rndis_status,
 	.rx_fixup =	rndis_rx_fixup,
 	.tx_fixup =	rndis_tx_fixup,
@@ -609,6 +616,7 @@ static struct usb_driver rndis_driver = {
 	.disconnect =	usbnet_disconnect,
 	.suspend =	usbnet_suspend,
 	.resume =	usbnet_resume,
+	.supports_autosuspend = 1,
 };
 
 static int __init rndis_init(void)
diff --git a/drivers/net/usb/zaurus.c b/drivers/net/usb/zaurus.c
index 04882c8..015c49d 100644
--- a/drivers/net/usb/zaurus.c
+++ b/drivers/net/usb/zaurus.c
@@ -94,6 +94,12 @@ static int zaurus_bind(struct usbnet *dev, struct usb_interface *intf)
 	return usbnet_generic_cdc_bind(dev, intf);
 }
 
+static int zaurus_manage_power(struct usbnet *dev, int on)
+{
+	dev->intf->needs_remote_wakeup = on;
+	return 0;
+}
+
 /* PDA style devices are always connected if present */
 static int always_connected (struct usbnet *dev)
 {
@@ -106,6 +112,7 @@ static const struct driver_info	zaurus_sl5x00_info = {
 	.check_connect = always_connected,
 	.bind =		zaurus_bind,
 	.unbind =	usbnet_cdc_unbind,
+	.manage_power =	zaurus_manage_power,
 	.tx_fixup =	zaurus_tx_fixup,
 };
 #define	ZAURUS_STRONGARM_INFO	((unsigned long)&zaurus_sl5x00_info)
@@ -116,6 +123,7 @@ static const struct driver_info	zaurus_pxa_info = {
 	.check_connect = always_connected,
 	.bind =		zaurus_bind,
 	.unbind =	usbnet_cdc_unbind,
+	.manage_power =	zaurus_manage_power,
 	.tx_fixup =	zaurus_tx_fixup,
 };
 #define	ZAURUS_PXA_INFO		((unsigned long)&zaurus_pxa_info)
@@ -126,6 +134,7 @@ static const struct driver_info	olympus_mxl_info = {
 	.check_connect = always_connected,
 	.bind =		zaurus_bind,
 	.unbind =	usbnet_cdc_unbind,
+	.manage_power =	zaurus_manage_power,
 	.tx_fixup =	zaurus_tx_fixup,
 };
 #define	OLYMPUS_MXL_INFO	((unsigned long)&olympus_mxl_info)
@@ -262,6 +271,7 @@ static const struct driver_info	bogus_mdlm_info = {
 	.check_connect = always_connected,
 	.tx_fixup =	zaurus_tx_fixup,
 	.bind =		blan_mdlm_bind,
+	.manage_power = zaurus_manage_power
 };
 
 static const struct usb_device_id	products [] = {
@@ -370,6 +380,7 @@ static struct usb_driver zaurus_driver = {
 	.disconnect =	usbnet_disconnect,
 	.suspend =	usbnet_suspend,
 	.resume =	usbnet_resume,
+	.supports_autosuspend = 1,
 };
 
 static int __init zaurus_init(void)


[-- Attachment #2: usbnet_tested_auto.diff --]
[-- Type: text/x-patch, Size: 9652 bytes --]

commit 956c214d266fc1764ceb931b039c7aadded4eb24
Author: Oliver Neukum <oliver-GvhC2dPhHPQdnm+yROfE0A@public.gmane.org>
Date:   Mon Oct 19 15:07:54 2009 +0200

    usb:usbnet&cdc-ether:full aggressive autosuspend
    
    autosuspend for cdc-ether devices while online if
    the device supports remote wakeup

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 4a6aff5..8ee5bd7 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -411,6 +411,12 @@ static int cdc_bind(struct usbnet *dev, struct usb_interface *intf)
 	return 0;
 }
 
+static int cdc_manage_power(struct usbnet *dev, int on)
+{
+	dev->intf->needs_remote_wakeup = on;
+	return 0;
+}
+
 static const struct driver_info	cdc_info = {
 	.description =	"CDC Ethernet Device",
 	.flags =	FLAG_ETHER,
@@ -418,6 +424,7 @@ static const struct driver_info	cdc_info = {
 	.bind =		cdc_bind,
 	.unbind =	usbnet_cdc_unbind,
 	.status =	cdc_status,
+	.manage_power =	cdc_manage_power,
 };
 
 /*-------------------------------------------------------------------------*/
@@ -570,6 +577,7 @@ static struct usb_driver cdc_driver = {
 	.disconnect =	usbnet_disconnect,
 	.suspend =	usbnet_suspend,
 	.resume =	usbnet_resume,
+	.supports_autosuspend = 1,
 };
 
 
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index ca5ca5a..c9938d5 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -353,7 +353,8 @@ static void rx_submit (struct usbnet *dev, struct urb *urb, gfp_t flags)
 
 	if (netif_running (dev->net)
 			&& netif_device_present (dev->net)
-			&& !test_bit (EVENT_RX_HALT, &dev->flags)) {
+			&& !test_bit (EVENT_RX_HALT, &dev->flags)
+			&& !test_bit (EVENT_DEV_ASLEEP, &dev->flags)) {
 		switch (retval = usb_submit_urb (urb, GFP_ATOMIC)) {
 		case -EPIPE:
 			usbnet_defer_kevent (dev, EVENT_RX_HALT);
@@ -611,15 +612,36 @@ EXPORT_SYMBOL_GPL(usbnet_unlink_rx_urbs);
 /*-------------------------------------------------------------------------*/
 
 // precondition: never called in_interrupt
+static void usbnet_terminate_urbs(struct usbnet *dev)
+{
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK (unlink_wakeup);
+	DECLARE_WAITQUEUE (wait, current);
+	int temp;
+
+	/* ensure there are no more active urbs */
+	add_wait_queue(&unlink_wakeup, &wait);
+	dev->wait = &unlink_wakeup;
+	temp = unlink_urbs(dev, &dev->txq) +
+		unlink_urbs(dev, &dev->rxq);
+
+	/* maybe wait for deletions to finish. */
+	while (!skb_queue_empty(&dev->rxq)
+		&& !skb_queue_empty(&dev->txq)
+		&& !skb_queue_empty(&dev->done)) {
+			schedule_timeout(UNLINK_TIMEOUT_MS);
+			if (netif_msg_ifdown(dev))
+				devdbg(dev, "waited for %d urb completions",
+					temp);
+	}
+	dev->wait = NULL;
+	remove_wait_queue(&unlink_wakeup, &wait);
+}
 
 int usbnet_stop (struct net_device *net)
 {
 	struct usbnet		*dev = netdev_priv(net);
 	struct driver_info	*info = dev->driver_info;
-	int			temp;
 	int			retval;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK (unlink_wakeup);
-	DECLARE_WAITQUEUE (wait, current);
 
 	netif_stop_queue (net);
 
@@ -641,25 +663,8 @@ int usbnet_stop (struct net_device *net)
 				info->description);
 	}
 
-	if (!(info->flags & FLAG_AVOID_UNLINK_URBS)) {
-		/* ensure there are no more active urbs */
-		add_wait_queue(&unlink_wakeup, &wait);
-		dev->wait = &unlink_wakeup;
-		temp = unlink_urbs(dev, &dev->txq) +
-			unlink_urbs(dev, &dev->rxq);
-
-		/* maybe wait for deletions to finish. */
-		while (!skb_queue_empty(&dev->rxq)
-				&& !skb_queue_empty(&dev->txq)
-				&& !skb_queue_empty(&dev->done)) {
-			msleep(UNLINK_TIMEOUT_MS);
-			if (netif_msg_ifdown(dev))
-				devdbg(dev, "waited for %d urb completions",
-					temp);
-		}
-		dev->wait = NULL;
-		remove_wait_queue(&unlink_wakeup, &wait);
-	}
+	if (!(info->flags & FLAG_AVOID_UNLINK_URBS))
+		usbnet_terminate_urbs(dev);
 
 	usb_kill_urb(dev->interrupt);
 
@@ -672,7 +677,10 @@ int usbnet_stop (struct net_device *net)
 	dev->flags = 0;
 	del_timer_sync (&dev->delay);
 	tasklet_kill (&dev->bh);
-	usb_autopm_put_interface(dev->intf);
+	if (info->manage_power)
+		info->manage_power(dev, 0);
+	else
+		usb_autopm_put_interface(dev->intf);
 
 	return 0;
 }
@@ -753,6 +761,12 @@ int usbnet_open (struct net_device *net)
 
 	// delay posting reads until we're fully open
 	tasklet_schedule (&dev->bh);
+	if (info->manage_power) {
+		retval = info->manage_power(dev, 1);
+		if (retval < 0)
+			goto done;
+		usb_autopm_put_interface(dev->intf);
+	}
 	return retval;
 done:
 	usb_autopm_put_interface(dev->intf);
@@ -882,6 +896,7 @@ kevent (struct work_struct *work)
 	if (test_bit (EVENT_TX_HALT, &dev->flags)) {
 		unlink_urbs (dev, &dev->txq);
 		status = usb_clear_halt (dev->udev, dev->out);
+		usb_autopm_put_interface(dev->intf);
 		if (status < 0
 				&& status != -EPIPE
 				&& status != -ESHUTDOWN) {
@@ -953,17 +968,20 @@ static void tx_complete (struct urb *urb)
 	if (urb->status == 0) {
 		dev->net->stats.tx_packets++;
 		dev->net->stats.tx_bytes += entry->length;
+		usb_autopm_put_interface_async(dev->intf);
 	} else {
 		dev->net->stats.tx_errors++;
 
 		switch (urb->status) {
 		case -EPIPE:
+			/* we do not allow autosuspension */
 			usbnet_defer_kevent (dev, EVENT_TX_HALT);
 			break;
 
 		/* software-driven interface shutdown */
 		case -ECONNRESET:		// async unlink
 		case -ESHUTDOWN:		// hardware gone
+			usb_autopm_put_interface_async(dev->intf);
 			break;
 
 		// like rx, tx gets controller i/o faults during khubd delays
@@ -971,6 +989,7 @@ static void tx_complete (struct urb *urb)
 		case -EPROTO:
 		case -ETIME:
 		case -EILSEQ:
+			usb_mark_last_busy(dev->udev);
 			if (!timer_pending (&dev->delay)) {
 				mod_timer (&dev->delay,
 					jiffies + THROTTLE_JIFFIES);
@@ -979,8 +998,10 @@ static void tx_complete (struct urb *urb)
 							urb->status);
 			}
 			netif_stop_queue (dev->net);
+			usb_autopm_put_interface_async(dev->intf);
 			break;
 		default:
+			usb_autopm_put_interface_async(dev->intf);
 			if (netif_msg_tx_err (dev))
 				devdbg (dev, "tx err %d", entry->urb->status);
 			break;
@@ -1058,6 +1079,23 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 	}
 
 	spin_lock_irqsave (&dev->txq.lock, flags);
+	retval = usb_autopm_get_interface_async(dev->intf);
+	if (retval < 0) {
+		spin_unlock_irqrestore (&dev->txq.lock, flags);
+		goto drop;
+	}
+
+#ifdef CONFIG_PM
+	/* if this triggers the device is still a sleep */
+	if (test_bit(EVENT_DEV_ASLEEP, &dev->flags)) {
+		/* transmission will be done in resume */
+		dev->deferred = urb;
+		/* no use to process more packets */
+		netif_stop_queue(net);
+		spin_unlock_irqrestore(&dev->txq.lock, flags);
+		goto deferred;
+	}
+#endif
 
 	switch ((retval = usb_submit_urb (urb, GFP_ATOMIC))) {
 	case -EPIPE:
@@ -1088,6 +1126,7 @@ drop:
 		devdbg (dev, "> tx, len %d, type 0x%x",
 			length, skb->protocol);
 	}
+deferred:
 	return NETDEV_TX_OK;
 }
 EXPORT_SYMBOL_GPL(usbnet_start_xmit);
@@ -1363,13 +1402,23 @@ int usbnet_suspend (struct usb_interface *intf, pm_message_t message)
 	struct usbnet		*dev = usb_get_intfdata(intf);
 
 	if (!dev->suspend_count++) {
+		spin_lock_irq(&dev->txq.lock);
+		/* don't autosuspend while transmitting */
+		if (dev->txq.qlen && (message.event & PM_EVENT_AUTO)) {
+			spin_unlock_irq(&dev->txq.lock);
+			return -EBUSY;
+		} else {
+			set_bit(EVENT_DEV_ASLEEP, &dev->flags);
+			spin_unlock_irq(&dev->txq.lock);
+		}
 		/*
 		 * accelerate emptying of the rx and queues, to avoid
 		 * having everything error out.
 		 */
 		netif_device_detach (dev->net);
-		(void) unlink_urbs (dev, &dev->rxq);
-		(void) unlink_urbs (dev, &dev->txq);
+		usbnet_terminate_urbs(dev);
+		usb_kill_urb(dev->interrupt);
+		
 		/*
 		 * reattach so runtime management can use and
 		 * wake the device
@@ -1383,10 +1432,32 @@ EXPORT_SYMBOL_GPL(usbnet_suspend);
 int usbnet_resume (struct usb_interface *intf)
 {
 	struct usbnet		*dev = usb_get_intfdata(intf);
-
-	if (!--dev->suspend_count)
+	struct sk_buff          *skb;
+	struct urb              *res;
+	int                     retval;
+	
+	if (!--dev->suspend_count) {
+		spin_lock_irq(&dev->txq.lock);
+		res = dev->deferred;
+		dev->deferred = NULL;
+		clear_bit(EVENT_DEV_ASLEEP, &dev->flags);
+		spin_unlock_irq(&dev->txq.lock);
+		if (res) {
+			retval = usb_submit_urb(res, GFP_NOIO);
+			if (retval < 0) {
+				usb_free_urb(res);
+				netif_start_queue(dev->net);
+				usb_autopm_put_interface_async(dev->intf);
+			} else {
+				skb = (struct sk_buff *)res->context;
+				dev->net->trans_start = jiffies;
+				__skb_queue_tail (&dev->txq, skb);
+				if (!(dev->txq.qlen >= TX_QLEN(dev)))
+					netif_start_queue(dev->net);
+			}
+		}
 		tasklet_schedule (&dev->bh);
-
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(usbnet_resume);
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index f814730..e418c0b 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -55,6 +55,7 @@ struct usbnet {
 	struct sk_buff_head	done;
 	struct sk_buff_head	rxq_pause;
 	struct urb		*interrupt;
+	struct urb		*deferred;
 	struct tasklet_struct	bh;
 
 	struct work_struct	kevent;
@@ -65,6 +66,8 @@ struct usbnet {
 #		define EVENT_STS_SPLIT	3
 #		define EVENT_LINK_RESET	4
 #		define EVENT_RX_PAUSED	5
+#             define EVENT_DEV_WAKING 6 
+#             define EVENT_DEV_ASLEEP 7 
 };
 
 static inline struct usb_driver *driver_of(struct usb_interface *intf)
@@ -107,6 +110,9 @@ struct driver_info {
 	/* see if peer is connected ... can sleep */
 	int	(*check_connect)(struct usbnet *);
 
+	/* (dis)activate runtime power management */
+	int	(*manage_power)(struct usbnet *, int);
+
 	/* for status polling */
 	void	(*status)(struct usbnet *, struct urb *);
 

^ permalink raw reply related

* [net-next PATCH 1/1] qlge: Size RX buffers based on MTU.
From: Ron Mercer @ 2009-10-19 13:32 UTC (permalink / raw)
  To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <20091019132902.GB14919@linux-ox1b.qlogic.org>

Change RX large buffer size based on MTU. If pages are larger
than the MTU the page is divided up into multiple chunks and passed to
the hardware.  When pages are smaller than MTU each RX buffer can
contain be comprised of up to 2 pages.

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/qlge/qlge.h      |   15 ++-
 drivers/net/qlge/qlge_main.c |  273 +++++++++++++++++++++++++++++++-----------
 2 files changed, 214 insertions(+), 74 deletions(-)

diff --git a/drivers/net/qlge/qlge.h b/drivers/net/qlge/qlge.h
index fd47691..9cdf8ff 100644
--- a/drivers/net/qlge/qlge.h
+++ b/drivers/net/qlge/qlge.h
@@ -56,7 +56,8 @@
 		MAX_DB_PAGES_PER_BQ(NUM_LARGE_BUFFERS) * sizeof(u64))
 #define SMALL_BUFFER_SIZE 512
 #define SMALL_BUF_MAP_SIZE (SMALL_BUFFER_SIZE / 2)
-#define LARGE_BUFFER_SIZE	PAGE_SIZE
+#define LARGE_BUFFER_MAX_SIZE 8192
+#define LARGE_BUFFER_MIN_SIZE 2048
 #define MAX_SPLIT_SIZE 1023
 #define QLGE_SB_PAD 32
 
@@ -1201,9 +1202,17 @@ struct tx_ring_desc {
 	struct tx_ring_desc *next;
 };
 
+struct page_chunk {
+	struct page *page;	/* master page */
+	char *va;		/* virt addr for this chunk */
+	u64 map;		/* mapping for master */
+	unsigned int offset;	/* offset for this chunk */
+	unsigned int last_flag; /* flag set for last chunk in page */
+};
+
 struct bq_desc {
 	union {
-		struct page *lbq_page;
+		struct page_chunk pg_chunk;
 		struct sk_buff *skb;
 	} p;
 	__le64 *addr;
@@ -1272,6 +1281,7 @@ struct rx_ring {
 	dma_addr_t lbq_base_dma;
 	void *lbq_base_indirect;
 	dma_addr_t lbq_base_indirect_dma;
+	struct page_chunk pg_chunk; /* current page for chunks */
 	struct bq_desc *lbq;	/* array of control blocks */
 	void __iomem *lbq_prod_idx_db_reg;	/* PCI doorbell mem area + 0x18 */
 	u32 lbq_prod_idx;	/* current sw prod idx */
@@ -1526,6 +1536,7 @@ struct ql_adapter {
 
 	struct rx_ring rx_ring[MAX_RX_RINGS];
 	struct tx_ring tx_ring[MAX_TX_RINGS];
+	unsigned int lbq_buf_order;
 
 	int rx_csum;
 	u32 default_rx_queue;
diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index 9eefb11..4935710 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -1025,6 +1025,11 @@ end:
 	return status;
 }
 
+static inline unsigned int ql_lbq_block_size(struct ql_adapter *qdev)
+{
+	return PAGE_SIZE << qdev->lbq_buf_order;
+}
+
 /* Get the next large buffer. */
 static struct bq_desc *ql_get_curr_lbuf(struct rx_ring *rx_ring)
 {
@@ -1036,6 +1041,28 @@ static struct bq_desc *ql_get_curr_lbuf(struct rx_ring *rx_ring)
 	return lbq_desc;
 }
 
+static struct bq_desc *ql_get_curr_lchunk(struct ql_adapter *qdev,
+		struct rx_ring *rx_ring)
+{
+	struct bq_desc *lbq_desc = ql_get_curr_lbuf(rx_ring);
+
+	pci_dma_sync_single_for_cpu(qdev->pdev,
+					pci_unmap_addr(lbq_desc, mapaddr),
+				    rx_ring->lbq_buf_size,
+					PCI_DMA_FROMDEVICE);
+
+	/* If it's the last chunk of our master page then
+	 * we unmap it.
+	 */
+	if ((lbq_desc->p.pg_chunk.offset + rx_ring->lbq_buf_size)
+					== ql_lbq_block_size(qdev))
+		pci_unmap_page(qdev->pdev,
+				lbq_desc->p.pg_chunk.map,
+				ql_lbq_block_size(qdev),
+				PCI_DMA_FROMDEVICE);
+	return lbq_desc;
+}
+
 /* Get the next small buffer. */
 static struct bq_desc *ql_get_curr_sbuf(struct rx_ring *rx_ring)
 {
@@ -1063,6 +1090,53 @@ static void ql_write_cq_idx(struct rx_ring *rx_ring)
 	ql_write_db_reg(rx_ring->cnsmr_idx, rx_ring->cnsmr_idx_db_reg);
 }
 
+static int ql_get_next_chunk(struct ql_adapter *qdev, struct rx_ring *rx_ring,
+						struct bq_desc *lbq_desc)
+{
+	if (!rx_ring->pg_chunk.page) {
+		u64 map;
+		rx_ring->pg_chunk.page = alloc_pages(__GFP_COLD | __GFP_COMP |
+						GFP_ATOMIC,
+						qdev->lbq_buf_order);
+		if (unlikely(!rx_ring->pg_chunk.page)) {
+			QPRINTK(qdev, DRV, ERR,
+				"page allocation failed.\n");
+			return -ENOMEM;
+		}
+		rx_ring->pg_chunk.offset = 0;
+		map = pci_map_page(qdev->pdev, rx_ring->pg_chunk.page,
+					0, ql_lbq_block_size(qdev),
+					PCI_DMA_FROMDEVICE);
+		if (pci_dma_mapping_error(qdev->pdev, map)) {
+			__free_pages(rx_ring->pg_chunk.page,
+					qdev->lbq_buf_order);
+			QPRINTK(qdev, DRV, ERR,
+				"PCI mapping failed.\n");
+			return -ENOMEM;
+		}
+		rx_ring->pg_chunk.map = map;
+		rx_ring->pg_chunk.va = page_address(rx_ring->pg_chunk.page);
+	}
+
+	/* Copy the current master pg_chunk info
+	 * to the current descriptor.
+	 */
+	lbq_desc->p.pg_chunk = rx_ring->pg_chunk;
+
+	/* Adjust the master page chunk for next
+	 * buffer get.
+	 */
+	rx_ring->pg_chunk.offset += rx_ring->lbq_buf_size;
+	if (rx_ring->pg_chunk.offset == ql_lbq_block_size(qdev)) {
+		rx_ring->pg_chunk.page = NULL;
+		lbq_desc->p.pg_chunk.last_flag = 1;
+	} else {
+		rx_ring->pg_chunk.va += rx_ring->lbq_buf_size;
+		get_page(rx_ring->pg_chunk.page);
+		lbq_desc->p.pg_chunk.last_flag = 0;
+	}
+	return 0;
+}
 /* Process (refill) a large buffer queue. */
 static void ql_update_lbq(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 {
@@ -1072,39 +1146,28 @@ static void ql_update_lbq(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 	u64 map;
 	int i;
 
-	while (rx_ring->lbq_free_cnt > 16) {
+	while (rx_ring->lbq_free_cnt > 32) {
 		for (i = 0; i < 16; i++) {
 			QPRINTK(qdev, RX_STATUS, DEBUG,
 				"lbq: try cleaning clean_idx = %d.\n",
 				clean_idx);
 			lbq_desc = &rx_ring->lbq[clean_idx];
-			if (lbq_desc->p.lbq_page == NULL) {
-				QPRINTK(qdev, RX_STATUS, DEBUG,
-					"lbq: getting new page for index %d.\n",
-					lbq_desc->index);
-				lbq_desc->p.lbq_page = alloc_page(GFP_ATOMIC);
-				if (lbq_desc->p.lbq_page == NULL) {
-					rx_ring->lbq_clean_idx = clean_idx;
-					QPRINTK(qdev, RX_STATUS, ERR,
-						"Couldn't get a page.\n");
-					return;
-				}
-				map = pci_map_page(qdev->pdev,
-						   lbq_desc->p.lbq_page,
-						   0, PAGE_SIZE,
-						   PCI_DMA_FROMDEVICE);
-				if (pci_dma_mapping_error(qdev->pdev, map)) {
-					rx_ring->lbq_clean_idx = clean_idx;
-					put_page(lbq_desc->p.lbq_page);
-					lbq_desc->p.lbq_page = NULL;
-					QPRINTK(qdev, RX_STATUS, ERR,
-						"PCI mapping failed.\n");
+			if (ql_get_next_chunk(qdev, rx_ring, lbq_desc)) {
+				QPRINTK(qdev, IFUP, ERR,
+					"Could not get a page chunk.\n");
 					return;
 				}
+
+			map = lbq_desc->p.pg_chunk.map +
+				lbq_desc->p.pg_chunk.offset;
 				pci_unmap_addr_set(lbq_desc, mapaddr, map);
-				pci_unmap_len_set(lbq_desc, maplen, PAGE_SIZE);
+			pci_unmap_len_set(lbq_desc, maplen,
+					rx_ring->lbq_buf_size);
 				*lbq_desc->addr = cpu_to_le64(map);
-			}
+
+			pci_dma_sync_single_for_device(qdev->pdev, map,
+						rx_ring->lbq_buf_size,
+						PCI_DMA_FROMDEVICE);
 			clean_idx++;
 			if (clean_idx == rx_ring->lbq_len)
 				clean_idx = 0;
@@ -1480,27 +1543,24 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 			 * chain it to the header buffer's skb and let
 			 * it rip.
 			 */
-			lbq_desc = ql_get_curr_lbuf(rx_ring);
-			pci_unmap_page(qdev->pdev,
-				       pci_unmap_addr(lbq_desc,
-						      mapaddr),
-				       pci_unmap_len(lbq_desc, maplen),
-				       PCI_DMA_FROMDEVICE);
+			lbq_desc = ql_get_curr_lchunk(qdev, rx_ring);
 			QPRINTK(qdev, RX_STATUS, DEBUG,
-				"Chaining page to skb.\n");
-			skb_fill_page_desc(skb, 0, lbq_desc->p.lbq_page,
-					   0, length);
+				"Chaining page at offset = %d,"
+				"for %d bytes  to skb.\n",
+				lbq_desc->p.pg_chunk.offset, length);
+			skb_fill_page_desc(skb, 0, lbq_desc->p.pg_chunk.page,
+						lbq_desc->p.pg_chunk.offset,
+						length);
 			skb->len += length;
 			skb->data_len += length;
 			skb->truesize += length;
-			lbq_desc->p.lbq_page = NULL;
 		} else {
 			/*
 			 * The headers and data are in a single large buffer. We
 			 * copy it to a new skb and let it go. This can happen with
 			 * jumbo mtu on a non-TCP/UDP frame.
 			 */
-			lbq_desc = ql_get_curr_lbuf(rx_ring);
+			lbq_desc = ql_get_curr_lchunk(qdev, rx_ring);
 			skb = netdev_alloc_skb(qdev->ndev, length);
 			if (skb == NULL) {
 				QPRINTK(qdev, PROBE, DEBUG,
@@ -1515,13 +1575,14 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 			skb_reserve(skb, NET_IP_ALIGN);
 			QPRINTK(qdev, RX_STATUS, DEBUG,
 				"%d bytes of headers and data in large. Chain page to new skb and pull tail.\n", length);
-			skb_fill_page_desc(skb, 0, lbq_desc->p.lbq_page,
-					   0, length);
+			skb_fill_page_desc(skb, 0,
+						lbq_desc->p.pg_chunk.page,
+						lbq_desc->p.pg_chunk.offset,
+						length);
 			skb->len += length;
 			skb->data_len += length;
 			skb->truesize += length;
 			length -= length;
-			lbq_desc->p.lbq_page = NULL;
 			__pskb_pull_tail(skb,
 				(ib_mac_rsp->flags2 & IB_MAC_IOCB_RSP_V) ?
 				VLAN_ETH_HLEN : ETH_HLEN);
@@ -1538,8 +1599,7 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 		 *         frames.  If the MTU goes up we could
 		 *          eventually be in trouble.
 		 */
-		int size, offset, i = 0;
-		__le64 *bq, bq_array[8];
+		int size, i = 0;
 		sbq_desc = ql_get_curr_sbuf(rx_ring);
 		pci_unmap_single(qdev->pdev,
 				 pci_unmap_addr(sbq_desc, mapaddr),
@@ -1558,37 +1618,25 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 			QPRINTK(qdev, RX_STATUS, DEBUG,
 				"%d bytes of headers & data in chain of large.\n", length);
 			skb = sbq_desc->p.skb;
-			bq = &bq_array[0];
-			memcpy(bq, skb->data, sizeof(bq_array));
 			sbq_desc->p.skb = NULL;
 			skb_reserve(skb, NET_IP_ALIGN);
-		} else {
-			QPRINTK(qdev, RX_STATUS, DEBUG,
-				"Headers in small, %d bytes of data in chain of large.\n", length);
-			bq = (__le64 *)sbq_desc->p.skb->data;
 		}
 		while (length > 0) {
-			lbq_desc = ql_get_curr_lbuf(rx_ring);
-			pci_unmap_page(qdev->pdev,
-				       pci_unmap_addr(lbq_desc,
-						      mapaddr),
-				       pci_unmap_len(lbq_desc,
-						     maplen),
-				       PCI_DMA_FROMDEVICE);
-			size = (length < PAGE_SIZE) ? length : PAGE_SIZE;
-			offset = 0;
+			lbq_desc = ql_get_curr_lchunk(qdev, rx_ring);
+			size = (length < rx_ring->lbq_buf_size) ? length :
+				rx_ring->lbq_buf_size;
 
 			QPRINTK(qdev, RX_STATUS, DEBUG,
 				"Adding page %d to skb for %d bytes.\n",
 				i, size);
-			skb_fill_page_desc(skb, i, lbq_desc->p.lbq_page,
-					   offset, size);
+			skb_fill_page_desc(skb, i,
+						lbq_desc->p.pg_chunk.page,
+						lbq_desc->p.pg_chunk.offset,
+						size);
 			skb->len += size;
 			skb->data_len += size;
 			skb->truesize += size;
 			length -= size;
-			lbq_desc->p.lbq_page = NULL;
-			bq++;
 			i++;
 		}
 		__pskb_pull_tail(skb, (ib_mac_rsp->flags2 & IB_MAC_IOCB_RSP_V) ?
@@ -2304,20 +2352,29 @@ err:
 
 static void ql_free_lbq_buffers(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 {
-	int i;
 	struct bq_desc *lbq_desc;
 
-	for (i = 0; i < rx_ring->lbq_len; i++) {
-		lbq_desc = &rx_ring->lbq[i];
-		if (lbq_desc->p.lbq_page) {
+	uint32_t  curr_idx, clean_idx;
+
+	curr_idx = rx_ring->lbq_curr_idx;
+	clean_idx = rx_ring->lbq_clean_idx;
+	while (curr_idx != clean_idx) {
+		lbq_desc = &rx_ring->lbq[curr_idx];
+
+		if (lbq_desc->p.pg_chunk.last_flag) {
 			pci_unmap_page(qdev->pdev,
-				       pci_unmap_addr(lbq_desc, mapaddr),
-				       pci_unmap_len(lbq_desc, maplen),
+				lbq_desc->p.pg_chunk.map,
+				ql_lbq_block_size(qdev),
 				       PCI_DMA_FROMDEVICE);
-
-			put_page(lbq_desc->p.lbq_page);
-			lbq_desc->p.lbq_page = NULL;
+			lbq_desc->p.pg_chunk.last_flag = 0;
 		}
+
+		put_page(lbq_desc->p.pg_chunk.page);
+		lbq_desc->p.pg_chunk.page = NULL;
+
+		if (++curr_idx == rx_ring->lbq_len)
+			curr_idx = 0;
+
 	}
 }
 
@@ -2615,6 +2672,7 @@ static int ql_start_rx_ring(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 	/* Set up the shadow registers for this ring. */
 	rx_ring->prod_idx_sh_reg = shadow_reg;
 	rx_ring->prod_idx_sh_reg_dma = shadow_reg_dma;
+	*rx_ring->prod_idx_sh_reg = 0;
 	shadow_reg += sizeof(u64);
 	shadow_reg_dma += sizeof(u64);
 	rx_ring->lbq_base_indirect = shadow_reg;
@@ -3495,6 +3553,10 @@ static int ql_configure_rings(struct ql_adapter *qdev)
 	struct rx_ring *rx_ring;
 	struct tx_ring *tx_ring;
 	int cpu_cnt = min(MAX_CPUS, (int)num_online_cpus());
+	unsigned int lbq_buf_len = (qdev->ndev->mtu > 1500) ?
+		LARGE_BUFFER_MAX_SIZE : LARGE_BUFFER_MIN_SIZE;
+
+	qdev->lbq_buf_order = get_order(lbq_buf_len);
 
 	/* In a perfect world we have one RSS ring for each CPU
 	 * and each has it's own vector.  To do that we ask for
@@ -3542,7 +3604,10 @@ static int ql_configure_rings(struct ql_adapter *qdev)
 			rx_ring->lbq_len = NUM_LARGE_BUFFERS;
 			rx_ring->lbq_size =
 			    rx_ring->lbq_len * sizeof(__le64);
-			rx_ring->lbq_buf_size = LARGE_BUFFER_SIZE;
+			rx_ring->lbq_buf_size = (u16)lbq_buf_len;
+			QPRINTK(qdev, IFUP, DEBUG,
+				"lbq_buf_size %d, order = %d\n",
+				rx_ring->lbq_buf_size, qdev->lbq_buf_order);
 			rx_ring->sbq_len = NUM_SMALL_BUFFERS;
 			rx_ring->sbq_size =
 			    rx_ring->sbq_len * sizeof(__le64);
@@ -3592,14 +3657,63 @@ error_up:
 	return err;
 }
 
+static int ql_change_rx_buffers(struct ql_adapter *qdev)
+{
+	struct rx_ring *rx_ring;
+	int i, status;
+	u32 lbq_buf_len;
+
+	/* Wait for an oustanding reset to complete. */
+	if (!test_bit(QL_ADAPTER_UP, &qdev->flags)) {
+		int i = 3;
+		while (i-- && !test_bit(QL_ADAPTER_UP, &qdev->flags)) {
+			QPRINTK(qdev, IFUP, ERR,
+				 "Waiting for adapter UP...\n");
+			ssleep(1);
+		}
+
+		if (!i) {
+			QPRINTK(qdev, IFUP, ERR,
+			 "Timed out waiting for adapter UP\n");
+			return -ETIMEDOUT;
+		}
+	}
+
+	status = ql_adapter_down(qdev);
+	if (status)
+		goto error;
+
+	/* Get the new rx buffer size. */
+	lbq_buf_len = (qdev->ndev->mtu > 1500) ?
+		LARGE_BUFFER_MAX_SIZE : LARGE_BUFFER_MIN_SIZE;
+	qdev->lbq_buf_order = get_order(lbq_buf_len);
+
+	for (i = 0; i < qdev->rss_ring_count; i++) {
+		rx_ring = &qdev->rx_ring[i];
+		/* Set the new size. */
+		rx_ring->lbq_buf_size = lbq_buf_len;
+	}
+
+	status = ql_adapter_up(qdev);
+	if (status)
+		goto error;
+
+	return status;
+error:
+	QPRINTK(qdev, IFUP, ALERT,
+		"Driver up/down cycle failed, closing device.\n");
+	set_bit(QL_ADAPTER_UP, &qdev->flags);
+	dev_close(qdev->ndev);
+	return status;
+}
+
 static int qlge_change_mtu(struct net_device *ndev, int new_mtu)
 {
 	struct ql_adapter *qdev = netdev_priv(ndev);
+	int status;
 
 	if (ndev->mtu == 1500 && new_mtu == 9000) {
 		QPRINTK(qdev, IFUP, ERR, "Changing to jumbo MTU.\n");
-		queue_delayed_work(qdev->workqueue,
-				&qdev->mpi_port_cfg_work, 0);
 	} else if (ndev->mtu == 9000 && new_mtu == 1500) {
 		QPRINTK(qdev, IFUP, ERR, "Changing to normal MTU.\n");
 	} else if ((ndev->mtu == 1500 && new_mtu == 1500) ||
@@ -3607,8 +3721,23 @@ static int qlge_change_mtu(struct net_device *ndev, int new_mtu)
 		return 0;
 	} else
 		return -EINVAL;
+
+	queue_delayed_work(qdev->workqueue,
+			&qdev->mpi_port_cfg_work, 3*HZ);
+
+	if (!netif_running(qdev->ndev)) {
+		ndev->mtu = new_mtu;
+		return 0;
+	}
+
 	ndev->mtu = new_mtu;
-	return 0;
+	status = ql_change_rx_buffers(qdev);
+	if (status) {
+		QPRINTK(qdev, IFUP, ERR,
+			"Changing MTU failed.\n");
+	}
+
+	return status;
 }
 
 static struct net_device_stats *qlge_get_stats(struct net_device
-- 
1.6.0.2


^ permalink raw reply related

* Re: [net-next PATCH 0/3] qlge: Size RX buffers based on MTU.
From: Ron Mercer @ 2009-10-19 13:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <20091017.153751.193720868.davem@davemloft.net>

> 
> You should send them as patches that actually compile at each
> step of the way and therefore don't break bisection.

The changes are functionally a single patch which I will send shortly.
Thanks.

^ permalink raw reply

* Re: [PATCH] myri10ge: improve port type reporting in ethtool output
From: Andrew Gallatin @ 2009-10-19 13:30 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Brice Goglin, David S. Miller, Linux Network Development list
In-Reply-To: <1255957384.2782.2.camel@achroite>

Ben Hutchings wrote:
> On Mon, 2009-10-19 at 08:34 -0400, Andrew Gallatin wrote:
>> Ben Hutchings wrote:
>>
>>> Lying about link modes is not an improvement.
>> OK, so we're probably doing something wrong. I suspect we're not
>> alone.  At least we don't set SUPPORTED_TP for CX4, like I've
>> seen some NICs do.
>>
>> Can somebody suggest how we can tell ethtool that
>> the NIC supports 10Gb only (no autoneg down to 1Gb or lower)
>> for copper (10Gbase-CX4)?   How about for fiber (10Gbase-{S,L})R?
> 
> What's wrong with what you already do?  Customers expect to see
> something on the supported line?

Exactly.  One has complained because drivers for
other vendors NICs show this, even if they are fibre NICs
or CX4 NICs, and don't actually support 10GbaseT.

I'm happy to back this part out, and resubmit the patch without
it. There is still some fairly valuable stuff in the patch
-- mainly updating the NIC detection logic for new NICs to
detect fibre vs copper.

Drew

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 13:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Denys Fedoryschenko, netdev, linux-ppp, paulus, mostrows,
	Cyrill Gorcunov
In-Reply-To: <4ADC5D3B.8010006@gmail.com>

The entire scheme for managing net namespaces seems unsafe.  We depend
on synchronization via pn->hash_lock, but have no guarantee of the
existence of the "net" object -- hence no way to ensure the existence
of the lock itself.  This should be relatively easy to fix though as
we should be able to get/put the net namespace as we add remove
objects to/from the pppoe hash.

Once you solve this existence issue, the flush_lock can be eliminated
altogether since all of the relevant code paths already depend on a
write_lock_bh(&pn->hash_lock), and that's the lock that should be use
to protect the pppoe_dev field.

Another patch to follow later...

--
Michal Ostrowski
mostrows@gmail.com



On Mon, Oct 19, 2009 at 7:36 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Michal Ostrowski a écrit :
>> Here's my theory on this after an inital look...
>>
>> Looking at the oops report and disassembly of the actual module binary
>> that caused the oops, one can deduce that:
>>
>> Execution was in pppoe_flush_dev().  %ebx contained the pointer "struct
>> pppox_sock *po", which is what we faulted on, excuting "cmp %eax, 0x190(%ebx)".
>> %ebx value was 0xffffffff (hence we got "NULL pointer dereference at 0x18f").
>>
>> At this point "i" (stored in %esi) is 15 (valid), meaning that we got a value
>> of 0xffffffff in pn->hash_table[i].
>>
>>>From this I'd hypothesize that the combination of dev_put() and release_sock()
>> may have allowed us to free "pn".  At the bottom of the loop we alreayd
>> recognize that since locks are dropped we're responsible for handling
>> invalidation of objects, and perhaps that should be extended to "pn" as well.
>> --
>> Michal Ostrowski
>> mostrows@gmail.com
>>
>>
>
> Looking at this stuff, I do believe flush_lock protection is not
> properly done.
>
> At the end of pppoe_connect() for example we can find :
>
> err_put:
>        if (po->pppoe_dev) {
>                dev_put(po->pppoe_dev);
>                po->pppoe_dev = NULL;
>        }
>
> This is done without any protection, and can therefore clash with
> pppoe_flush_dev() :
>
>        spin_lock(&flush_lock);
>        po->pppoe_dev = NULL; /* ppoe_dev can already be NULL before this point */
>        spin_unlock(&flush_lock);
>
>        dev_put(dev);    /* oops */
>

^ permalink raw reply

* Re: [PATCH] AF_UNIX: Fix deadlock on connecting to shutdown socket
From: David Miller @ 2009-10-19 13:14 UTC (permalink / raw)
  To: jarkao2
  Cc: tomoki.sekiyama.qu, linux-kernel, netdev, alan, satoshi.oshima.fk,
	hidehiro.kawai.ez, hideo.aoki.tk
In-Reply-To: <20091019115713.GB6869@ff.dom.local>

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 19 Oct 2009 11:57:13 +0000

> Isn't the shutdown call expected to change sk_state to TCP_CLOSE?

No, because the send side is still up and operational, it's
only a half duplex close.

^ permalink raw reply

* Re: [PATCH] myri10ge: improve port type reporting in ethtool output
From: Andrew Gallatin @ 2009-10-19 12:34 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Brice Goglin, David S. Miller, Linux Network Development list
In-Reply-To: <1255939929.3916.13.camel@localhost>

Ben Hutchings wrote:

> Lying about link modes is not an improvement.

OK, so we're probably doing something wrong. I suspect we're not
alone.  At least we don't set SUPPORTED_TP for CX4, like I've
seen some NICs do.

Can somebody suggest how we can tell ethtool that
the NIC supports 10Gb only (no autoneg down to 1Gb or lower)
for copper (10Gbase-CX4)?   How about for fiber (10Gbase-{S,L})R?

Thanks,

Drew

^ permalink raw reply

* Re: [PATCH] myri10ge: improve port type reporting in ethtool output
From: Ben Hutchings @ 2009-10-19 13:03 UTC (permalink / raw)
  To: Andrew Gallatin
  Cc: Brice Goglin, David S. Miller, Linux Network Development list
In-Reply-To: <4ADC5CB8.4010801@myri.com>

On Mon, 2009-10-19 at 08:34 -0400, Andrew Gallatin wrote:
> Ben Hutchings wrote:
> 
> > Lying about link modes is not an improvement.
> 
> OK, so we're probably doing something wrong. I suspect we're not
> alone.  At least we don't set SUPPORTED_TP for CX4, like I've
> seen some NICs do.
> 
> Can somebody suggest how we can tell ethtool that
> the NIC supports 10Gb only (no autoneg down to 1Gb or lower)
> for copper (10Gbase-CX4)?   How about for fiber (10Gbase-{S,L})R?

What's wrong with what you already do?  Customers expect to see
something on the supported line?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Eric Dumazet @ 2009-10-19 12:36 UTC (permalink / raw)
  To: Michal Ostrowski; +Cc: Denys Fedoryschenko, netdev, linux-ppp, paulus, mostrows
In-Reply-To: <e6d1cecd0910182034t9d24859mc6f392875b36ad17@mail.gmail.com>

Michal Ostrowski a écrit :
> Here's my theory on this after an inital look...
> 
> Looking at the oops report and disassembly of the actual module binary
> that caused the oops, one can deduce that:
> 
> Execution was in pppoe_flush_dev().  %ebx contained the pointer "struct
> pppox_sock *po", which is what we faulted on, excuting "cmp %eax, 0x190(%ebx)".
> %ebx value was 0xffffffff (hence we got "NULL pointer dereference at 0x18f").
> 
> At this point "i" (stored in %esi) is 15 (valid), meaning that we got a value
> of 0xffffffff in pn->hash_table[i].
> 
>>From this I'd hypothesize that the combination of dev_put() and release_sock()
> may have allowed us to free "pn".  At the bottom of the loop we alreayd
> recognize that since locks are dropped we're responsible for handling
> invalidation of objects, and perhaps that should be extended to "pn" as well.
> --
> Michal Ostrowski
> mostrows@gmail.com
> 
> 

Looking at this stuff, I do believe flush_lock protection is not
properly done.

At the end of pppoe_connect() for example we can find :

err_put:
        if (po->pppoe_dev) {
                dev_put(po->pppoe_dev);
                po->pppoe_dev = NULL;
        }

This is done without any protection, and can therefore clash with 
pppoe_flush_dev() :

	spin_lock(&flush_lock);
	po->pppoe_dev = NULL; /* ppoe_dev can already be NULL before this point */
	spin_unlock(&flush_lock);

	dev_put(dev);    /* oops */

^ permalink raw reply

* [PATCH]: ingress socket filter by mark
From: jamal @ 2009-10-19 12:17 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Eric Dumazet, Maciej Żenczykowski

[-- Attachment #1: Type: text/plain, Size: 70 bytes --]

apps can specify mark that they want to accept/reject.

cheers,
jamal

[-- Attachment #2: filt-sock-m-3 --]
[-- Type: text/plain, Size: 1099 bytes --]

commit ec187e3028db866161b881c5ac9eeea4e9bb0f1f
Author: Jamal Hadi Salim <hadi@cyberus.ca>
Date:   Mon Oct 19 08:12:46 2009 -0400

    [PATCH]: ingress socket filter by mark
    
    Allow bpf to set a filter to drop packets that dont
    match a specific mark
    
    Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1354aaf..909193e 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -123,7 +123,8 @@ struct sock_fprog	/* Required for SO_ATTACH_FILTER. */
 #define SKF_AD_IFINDEX 	8
 #define SKF_AD_NLATTR	12
 #define SKF_AD_NLATTR_NEST	16
-#define SKF_AD_MAX	20
+#define SKF_AD_MARK 	20
+#define SKF_AD_MAX	24
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
diff --git a/net/core/filter.c b/net/core/filter.c
index d1d779c..e3987e1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -303,6 +303,9 @@ load_b:
 		case SKF_AD_IFINDEX:
 			A = skb->dev->ifindex;
 			continue;
+		case SKF_AD_MARK:
+			A = skb->mark;
+			continue;
 		case SKF_AD_NLATTR: {
 			struct nlattr *nla;
 

^ permalink raw reply related

* Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Oleg Nesterov @ 2009-10-19 12:13 UTC (permalink / raw)
  To: Anirban Sinha; +Cc: linux-kernel, David Miller, netdev, Anirban Sinha
In-Reply-To: <4ADB7856.7000803@anirban.org>

Hi Anirban,

On 10/18, Anirban Sinha wrote:
>
> I have a question for you. The queue_work() routine which is called from
> schedule_work() does a put_cpu() which in turn does a enable_preempt(). Is
> this an attempt to trigger the scheduler?

No. please note that queue_work() does get_cpu() + put_cpu() to protect
against cpu_down() in between.

This can trigger the scheduler of course, but everything should be OK.

> One of the side affects of
> this enable_preempt() is the crash that we see below. What is happening
> is that  a timer callback routine, in  this case inet_twdr_hangman(),
> tries a bunch of cleanup until a threshold is reached. If further cleanups
> needs to be done beyond the threshold, it queues a work function. Now when
> the timer callback is run in __run_timers(), the routine grabs the value
> of preempt_count before and after the callback function call. If the two
> counts do not match, it calls BUG() (line 1037 in kernel/timer.c).

Yes, but I can't see how queue_work() can be involved, it doesn't change
->preempt_count. Note again it does put after get.

> Is is
> it illegal to schedule a work function from within a timer callback?

Yes sure.

I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.

Could you verify this?

Oleg.


^ permalink raw reply

* Re: [PATCH][RFC]: ingress socket filter by mark
From: jamal @ 2009-10-19 12:12 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: Eric Dumazet, netdev, David Miller, Atis Elsts
In-Reply-To: <55a4f86e0910181609o6b21d667g8e65638667a1d687@mail.gmail.com>

On Sun, 2009-10-18 at 16:09 -0700, Maciej Żenczykowski wrote:
> 
> I agree that being able to filter on mark in bpf makes a lot of sense.

I agree as well - i posted a patch yesterday; i just tested it and it
works so i will formally post it shortly.

> I wonder if we're not hitting the filters potentially before the mark
> is set though (on receive at least)...
> I'm nowhere near sure but I think packets get diverted/cloned to
> tcpdump before they hit the ip stack (and thus potentially get marked
> by ip(6)table mangle rules)

There are many ways to mark the packets before they get to the socket.
tc ingress provides at least two ways (ipt action and recently posted
patch by me on skbedit); iptables as well.

cheers,
jamal


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with  "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-19 12:01 UTC (permalink / raw)
  To: Michal Ostrowski; +Cc: netdev, linux-ppp, paulus, mostrows
In-Reply-To: <e6d1cecd0910182034t9d24859mc6f392875b36ad17@mail.gmail.com>

Applied patch manually, still panic (maybe different now):

[   42.596904]
[   42.596904] Pid: 0, comm: swapper Not tainted (2.6.31.4-build-0047 #12)
[   42.596904] EIP: 0060:[<f886865e>] EFLAGS: 00010286 CPU: 0
[   42.596904] EIP is at pppoe_rcv+0x153/0x1be [pppoe]
[   42.596904] EAX: 00003100 EBX: ffffffff ECX: 00000002 EDX: f74007cb
[   42.596904] ESI: f79b4b00 EDI: f6332f00 EBP: c1806edc ESP: c1806eb8
[   42.596904]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[   42.596904] Process swapper (pid: 0, ti=c1806000 task=c03f6de0 
task.ti=c03f1000)
[   42.596904] Stack:
[   42.596904]  f7445000
00000004
f6fc0028
31000030
f6fc0030
f6332f40
f79b4b00
c0418140
Oct 19 15:00:46 194.146.153.99
[   42.596904] <0>
f886a444
c1806f14
c029e702
f7445000
f79b4bb0
f79b4bb0
c0418160
f7445000
Oct 19 15:00:46 194.146.153.99
[   42.596904] <0>
00000000
00000001
64886f14
c02ac3ce
f79b4b00
00000000
f753105c
c1806f58
Oct 19 15:00:46 194.146.153.99
[   42.596904] Call Trace:
[   42.596904]  [<c029e702>] ? netif_receive_skb+0x43b/0x45a
[   42.596904]  [<c02ac3ce>] ? eth_type_trans+0x25/0xa9
[   42.596904]  [<f840f736>] ? rtl8169_rx_interrupt+0x343/0x3f3 [r8169]
[   42.596904]  [<f8412191>] ? rtl8169_poll+0x2f/0x1b2 [r8169]
[   42.596904]  [<c01413e5>] ? hrtimer_run_pending+0x2d/0xa7
[   42.596904]  [<c029ec95>] ? net_rx_action+0x93/0x177
[   42.596904]  [<c0131bdd>] ? __do_softirq+0xa7/0x144
[   42.596904]  [<c0131b36>] ? __do_softirq+0x0/0x144
[   42.596904]  <IRQ>
Oct 19 15:00:46 194.146.153.99
[   42.596904]  [<c0131957>] ? irq_exit+0x29/0x5c
[   42.596904]  [<c0104393>] ? do_IRQ+0x80/0x96
[   42.596904]  [<c0102f49>] ? common_interrupt+0x29/0x30
[   42.596904]  [<c0108a3e>] ? mwait_idle+0x8a/0xb9
[   42.596904]  [<c0149613>] ? tick_nohz_restart_sched_tick+0x27/0x12f
[   42.596904]  [<c0101bf0>] ? cpu_idle+0x44/0x60
[   42.596904]  [<c02ef8c7>] ? rest_init+0x53/0x55
[   42.596904]  [<c04197df>] ? start_kernel+0x2b9/0x2be
[   42.596904]  [<c041906a>] ? i386_start_kernel+0x6a/0x6f
[   42.596904] Code:
53
0a
32
53
0b
31
c2
c1
e8
08
31
c2
eb
08
0f
b6
c2
c1
f8
04
31
c2
d1
e9
83
f9
04
74
f1
89
d0
83
e0
0f
8b
1c
87
eb
35
66
8b
45
ea
Oct 19 15:00:46 194.146.153.99
39
83
98
01
00
00
75
22
8b
55
e4
8d
83
9a
01
00
00
b9
06
00
Oct 19 15:00:46 194.146.153.99
[   42.596904] EIP: [<f886865e>]
pppoe_rcv+0x153/0x1be [pppoe]
SS:ESP 0068:c1806eb8
[   42.596904] CR2: 0000000000000197
[   42.606148] ---[ end trace b477ff4ee072d9b9 ]---
[   42.606209] Kernel panic - not syncing: Fatal exception in interrupt
[   42.606273] Pid: 0, comm: swapper Tainted: G      D    2.6.31.4-build-0047 
#12
[   42.606351] Call Trace:
[   42.606413]  [<c02fc496>] ? printk+0xf/0x11
[   42.606474]  [<c02fc3e7>] panic+0x39/0xd9
[   42.606535]  [<c01059b7>] oops_end+0x8b/0x9a
[   42.606597]  [<c0118f6d>] no_context+0x13d/0x147
[   42.606658]  [<c011908a>] __bad_area_nosemaphore+0x113/0x11b
[   42.606722]  [<c0121dc1>] ? check_preempt_wakeup+0x34/0x141
[   42.606862]  [<c01294bb>] ? try_to_wake_up+0x1aa/0x1b4
[   42.606930]  [<c0209541>] ? cpumask_next_and+0x26/0x37
[   42.607003]  [<c01256f5>] ? find_busiest_group+0x291/0x885
[   42.607067]  [<c019cf47>] ? pollwake+0x5a/0x63
[   42.607127]  [<c011909f>] bad_area_nosemaphore+0xd/0x10
[   42.607189]  [<c0119301>] do_page_fault+0x114/0x26f
[   42.607251]  [<c01191ed>] ? do_page_fault+0x0/0x26f
[   42.607313]  [<c02fe506>] error_code+0x66/0x6c
[   42.607375]  [<c02900d8>] ? pcibios_set_master+0x89/0x8d
[   42.607436]  [<c01191ed>] ? do_page_fault+0x0/0x26f
[   42.607501]  [<f886865e>] ? pppoe_rcv+0x153/0x1be [pppoe]
[   42.607564]  [<c029e702>] netif_receive_skb+0x43b/0x45a
[   42.607625]  [<c02ac3ce>] ? eth_type_trans+0x25/0xa9
[   42.607691]  [<f840f736>] rtl8169_rx_interrupt+0x343/0x3f3 [r8169]
[   42.607759]  [<f8412191>] rtl8169_poll+0x2f/0x1b2 [r8169]
[   42.607824]  [<c01413e5>] ? hrtimer_run_pending+0x2d/0xa7
[   42.607886]  [<c029ec95>] net_rx_action+0x93/0x177
[   42.607948]  [<c0131bdd>] __do_softirq+0xa7/0x144
[   42.608022]  [<c0131b36>] ? __do_softirq+0x0/0x144
[   42.608082]  <IRQ>
[<c0131957>] ? irq_exit+0x29/0x5c
[   42.608183]  [<c0104393>] ? do_IRQ+0x80/0x96
[   42.608183]  [<c0104393>] ? do_IRQ+0x80/0x96
[   42.608245]  [<c0102f49>] ? common_interrupt+0x29/0x30
[   42.608307]  [<c0108a3e>] ? mwait_idle+0x8a/0xb9
[   42.608369]  [<c0149613>] ? tick_nohz_restart_sched_tick+0x27/0x12f
[   42.608431]  [<c0101bf0>] ? cpu_idle+0x44/0x60
[   42.608493]  [<c02ef8c7>] ? rest_init+0x53/0x55
[   42.608553]  [<c04197df>] ? start_kernel+0x2b9/0x2be
[   42.608616]  [<c041906a>] ? i386_start_kernel+0x6a/0x6f
[   42.608682] Rebooting in 5 seconds..



On Monday 19 October 2009 06:34:06 Michal Ostrowski wrote:
> Here's my theory on this after an inital look...
>
> Looking at the oops report and disassembly of the actual module binary
> that caused the oops, one can deduce that:
>
> Execution was in pppoe_flush_dev().  %ebx contained the pointer "struct
> pppox_sock *po", which is what we faulted on, excuting "cmp %eax,
> 0x190(%ebx)". %ebx value was 0xffffffff (hence we got "NULL pointer
> dereference at 0x18f").
>
> At this point "i" (stored in %esi) is 15 (valid), meaning that we got a
> value of 0xffffffff in pn->hash_table[i].
>
> From this I'd hypothesize that the combination of dev_put() and
> release_sock() may have allowed us to free "pn".  At the bottom of the loop
> we alreayd recognize that since locks are dropped we're responsible for
> handling invalidation of objects, and perhaps that should be extended to
> "pn" as well. --
> Michal Ostrowski
> mostrows@gmail.com
>
>
> ---
>  drivers/net/pppoe.c |   86 ++++++++++++++++++++++++++----
> --------------------
>  1 files changed, 45 insertions(+), 41 deletions(-)
>
> diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
> index 7cbf6f9..720c4ea 100644
> --- a/drivers/net/pppoe.c
> +++ b/drivers/net/pppoe.c
> @@ -296,6 +296,7 @@ static void pppoe_flush_dev(struct net_device *dev)
>
>        BUG_ON(dev == NULL);
>
> +restart:
>        pn = pppoe_pernet(dev_net(dev));
>        if (!pn) /* already freed */
>                return;
> @@ -303,48 +304,51 @@ static void pppoe_flush_dev(struct net_device *dev)
>        write_lock_bh(&pn->hash_lock);
>        for (i = 0; i < PPPOE_HASH_SIZE; i++) {
>                struct pppox_sock *po = pn->hash_table[i];
> +                struct sock *sk;
>
> -               while (po != NULL) {
> -                       struct sock *sk;
> -                       if (po->pppoe_dev != dev) {
> -                               po = po->next;
> -                               continue;
> -                       }
> -                       sk = sk_pppox(po);
> -                       spin_lock(&flush_lock);
> -                       po->pppoe_dev = NULL;
> -                       spin_unlock(&flush_lock);
> -                       dev_put(dev);
> -
> -                       /* We always grab the socket lock, followed by the
> -                        * hash_lock, in that order.  Since we should
> -                        * hold the sock lock while doing any unbinding,
> -                        * we need to release the lock we're holding.
> -                        * Hold a reference to the sock so it doesn't
> disappear -                        * as we're jumping between locks.
> -                        */
> +                while (po && po->pppoe_dev != dev) {
> +                        po = po->next;
> +                }
>
> -                       sock_hold(sk);
> +                if (po == NULL) {
> +                        continue;
> +                }
>
> -                       write_unlock_bh(&pn->hash_lock);
> -                       lock_sock(sk);
> +                sk = sk_pppox(po);
>
> -                       if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND))
> { -                               pppox_unbind_sock(sk);
> -                               sk->sk_state = PPPOX_ZOMBIE;
> -                               sk->sk_state_change(sk);
> -                       }
> +                spin_lock(&flush_lock);
> +                po->pppoe_dev = NULL;
> +                spin_unlock(&flush_lock);
>
> -                       release_sock(sk);
> -                       sock_put(sk);
> +                dev_put(dev);
>
> -                       /* Restart scan at the beginning of this hash
> chain. -                        * While the lock was dropped the chain
> contents may -                        * have changed.
> -                        */
> -                       write_lock_bh(&pn->hash_lock);
> -                       po = pn->hash_table[i];
> -               }
> +                /* We always grab the socket lock, followed by the
> +                 * hash_lock, in that order.  Since we should
> +                 * hold the sock lock while doing any unbinding,
> +                 * we need to release the lock we're holding.
> +                 * Hold a reference to the sock so it doesn't disappear
> +                 * as we're jumping between locks.
> +                 */
> +
> +                sock_hold(sk);
> +
> +                write_unlock_bh(&pn->hash_lock);
> +                lock_sock(sk);
> +
> +                if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
> +                        pppox_unbind_sock(sk);
> +                        sk->sk_state = PPPOX_ZOMBIE;
> +                        sk->sk_state_change(sk);
> +                }
> +
> +                release_sock(sk);
> +                sock_put(sk);
> +
> +                /* Restart the flush process from the beginning.  While
> +                 * the lock was dropped the chain contents may have
> +                 * changed, and sock_put may have made things go away.
> +                 */
> +                goto restart;
>        }
>        write_unlock_bh(&pn->hash_lock);
>  }
> --
> 1.6.3.3
>
> On Sun, Oct 18, 2009 at 4:02 PM, Denys Fedoryschenko <denys@visp.net.lb> 
wrote:
> > I have server running as pppoe NAS.
> > Tried to rename customers without dropping pppd connections first, got
> > panic after few seconds.
> > Panic triggerable at 2.6.30.4 and 2.6.31.4
> > pppoe users running on eth2
> > pppoe flags:
> >  1457 root     /usr/sbin/pppoe-server -I eth2 -k -L 172.16.1.1 -R
> > 172.16.1.2 -N 253 -C gpzone -S gpzone
> >
> >
> > Commands sequence that i think triggered that:
> >
> > ip link set eth0 down
> > ip link set eth1 down
> > ip link set eth2 down
> > nameif etherx 00:16:76:8D:83:BA
> > nameif eth0 00:19:e0:72:4a:37
> > nameif eth1 00:19:e0:72:4a:4b
> >
> > ip addr flush dev eth0
> > ip addr flush dev eth1
> > ip addr add X.X.X.X/29 dev eth0
> > ip addr add 192.168.2.177/24 dev eth0
> > ip addr add 192.168.0.1/32 dev eth1
> > ip addr add 127.0.0.0/8 dev lo
> > #ip link set eth0 up
> > ip link set eth0 up
> > ip link set eth1 up
> > ip link set lo up
> > ip route add 0.0.0.0/0 via X.X.X.X
> >
> >
> > [  103.428591] r8169: eth0: link up
> > [  103.430360] r8169: eth1: link up
> > [  113.361528] BUG: unable to handle kernel
> > NULL pointer dereference
> > at 0000018f
> > [  113.361717] IP:
> > [<f8868269>] pppoe_device_event+0x80/0x12c [pppoe]
> > [  113.361853] *pdpt = 000000003411a001
> > *pde = 0000000000000000
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362012] Oops: 0000 [#1]
> > SMP
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362166] last sysfs file: /sys/devices/virtual/vc/vcs3/dev
> > [  113.362246] Modules linked in:
> > netconsole
> > configfs
> > act_skbedit
> > sch_ingress
> > sch_prio
> > cls_flow
> > cls_u32
> > em_meta
> > cls_basic
> > xt_dscp
> > xt_DSCP
> > ipt_REJECT
> > ts_bm
> > xt_string
> > xt_hl
> > ifb
> > cls_fw
> > sch_tbf
> > sch_htb
> > act_ipt
> > act_mirred
> > xt_MARK
> > pppoe
> > pppox
> > ppp_generic
> > slhc
> > xt_TCPMSS
> > xt_mark
> > xt_tcpudp
> > iptable_mangle
> > iptable_nat
> > nf_nat
> > rtc_cmos
> > nf_conntrack_ipv4
> > rtc_core
> > nf_conntrack
> > rtc_lib
> > nf_defrag_ipv4
> > iptable_filter
> > ip_tables
> > x_tables
> > 8021q
> > garp
> > stp
> > llc
> > loop
> > sata_sil
> > pata_atiixp
> > pata_acpi
> > ata_generic
> > libata
> > 8139cp
> > usb_storage
> > mtdblock
> > mtd_blkdevs
> > mtd
> > sr_mod
> > cdrom
> > tulip
> > r8169
> > sky2
> > via_velocity
> > via_rhine
> > sis900
> > ne2k_pci
> > 8390
> > skge
> > tg3
> > libphy
> > 8139too
> > e1000
> > e100
> > usbhid
> > ohci_hcd
> > uhci_hcd
> > ehci_hcd
> > usbcore
> > nls_base
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344]
> > [  113.362344] Pid: 2858, comm: pppd Not tainted (2.6.31.4-build-0047 #7)
> > [  113.362344] EIP: 0060:[<f8868269>] EFLAGS: 00010286 CPU: 0
> > [  113.362344] EIP is at pppoe_device_event+0x80/0x12c [pppoe]
> > [  113.362344] EAX: f4fbe000 EBX: ffffffff ECX: f6cea5a0 EDX: f7403680
> > [  113.362344] ESI: 0000000f EDI: f6cea5e0 EBP: f4145e34 ESP: f4145e1c
> > [  113.362344]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > [  113.362344] Process pppd (pid: 2858, ti=f4145000 task=f4112ff0
> > task.ti=f4145000)
> > [  113.362344] Stack:
> > [  113.362344]  f4fbe220
> > f4fbe000
> > f6cea5a0
> > f886a430
> > fffffff5
> > 00000000
> > f4145e54
> > c01422b3
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344] <0>
> > f4fbe000
> > 00000009
> > f8a457d8
> > f4fbe000
> > f8850190
> > 00001091
> > f4145e64
> > c0142361
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344] <0>
> > ffffffff
> > 00000000
> > f4145e74
> > c029ffbf
> > f4fbe000
> > 000010d0
> > f4145e90
> > c029fa70
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344] Call Trace:
> > [  113.362344]  [<c01422b3>] ? notifier_call_chain+0x2b/0x4a
> > [  113.362344]  [<c0142361>] ? raw_notifier_call_chain+0xc/0xe
> > [  113.362344]  [<c029ffbf>] ? dev_close+0x4c/0x8c
> > [  113.362344]  [<c029fa70>] ? dev_change_flags+0xa5/0x158
> > [  113.362344]  [<c02da633>] ? devinet_ioctl+0x21a/0x503
> > [  113.362344]  [<c02db693>] ? inet_ioctl+0x8d/0xa6
> > [  113.362344]  [<c0292b21>] ? sock_ioctl+0x1c8/0x1ec
> > [  113.362344]  [<c0292959>] ? sock_ioctl+0x0/0x1ec
> > [  113.362344]  [<c019af2b>] ? vfs_ioctl+0x22/0x69
> > [  113.362344]  [<c019b435>] ? do_vfs_ioctl+0x41f/0x459
> > [  113.362344]  [<c02934eb>] ? sys_send+0x18/0x1a
> > [  113.362344]  [<c011942f>] ? do_page_fault+0x242/0x26f
> > [  113.362344]  [<c019b49b>] ? sys_ioctl+0x2c/0x45
> > [  113.362344]  [<c0102975>] ? syscall_call+0x7/0xb
> > [  113.362344] Code:
> > c9
> > 00
> > 00
> > 00
> > 89
> > c7
> > 31
> > f6
> > 83
> > c7
> > 40
> > 89
> > f8
> > e8
> > cc
> > 60
> > a9
> > c7
> > 8b
> > 45
> > ec
> > 05
> > 20
> > 02
> > 00
> > 00
> > 89
> > 45
> > e8
> > 8b
> > 4d
> > f0
> > 8b
> > 1c
> > b1
> > e9
> > 8c
> > 00
> > 00
> > 00
> > 8b
> > 45
> > ec
> > Oct 18 23:59:40 194.146.153.93
> > 83
> > 90
> > 01
> > 00
> > 00
> > 74
> > 08
> > 8b
> > 9b
> > 8c
> > 01
> > 00
> > 00
> > eb
> > 79
> > b8
> > c0
> > a6
> > 86
> > f8
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344] EIP: [<f8868269>]
> > pppoe_device_event+0x80/0x12c [pppoe]
> > SS:ESP 0068:f4145e1c
> > [  113.362344] CR2: 000000000000018f
> > [  113.373124] ---[ end trace f6fe64a307e97f3b ]---
> > [  113.373203] Kernel panic - not syncing: Fatal exception in interrupt
> > [  113.373286] Pid: 2858, comm: pppd Tainted: G      D  
> >  2.6.31.4-build-0047 #7
> > [  113.373379] Call Trace:
> > [  113.373479]  [<c02fc496>] ? printk+0xf/0x11
> > [  113.373561]  [<c02fc3e7>] panic+0x39/0xd9
> > [  113.373656]  [<c01059b7>] oops_end+0x8b/0x9a
> > [  113.373727]  [<c0118f6d>] no_context+0x13d/0x147
> > [  113.373800]  [<c011908a>] __bad_area_nosemaphore+0x113/0x11b
> > [  113.373881]  [<c02953b3>] ? sock_alloc_send_pskb+0x8b/0x24a
> > [  113.373959]  [<c0121801>] ? __wake_up_sync_key+0x3b/0x45
> > [  113.374030]  [<c0131967>] ? irq_exit+0x39/0x5c
> > [  113.374107]  [<c0104393>] ? do_IRQ+0x80/0x96
> > [  113.374183]  [<c0102f49>] ? common_interrupt+0x29/0x30
> > [  113.374259]  [<c011909f>] bad_area_nosemaphore+0xd/0x10
> > [  113.374348]  [<c0119301>] do_page_fault+0x114/0x26f
> > [  113.374526]  [<c01191ed>] ? do_page_fault+0x0/0x26f
> > [  113.374605]  [<c02fe506>] error_code+0x66/0x6c
> > [  113.374683]  [<c02d007b>] ? tcp_v4_send_ack+0xa3/0x10e
> > [  113.374764]  [<c01191ed>] ? do_page_fault+0x0/0x26f
> > [  113.374850]  [<f8868269>] ? pppoe_device_event+0x80/0x12c [pppoe]
> > [  113.374928]  [<c01422b3>] notifier_call_chain+0x2b/0x4a
> > [  113.375012]  [<c0142361>] raw_notifier_call_chain+0xc/0xe
> > [  113.375097]  [<c029ffbf>] dev_close+0x4c/0x8c
> > [  113.375169]  [<c029fa70>] dev_change_flags+0xa5/0x158
> > [  113.375239]  [<c02da633>] devinet_ioctl+0x21a/0x503
> > [  113.375318]  [<c02db693>] inet_ioctl+0x8d/0xa6
> > [  113.375411]  [<c0292b21>] sock_ioctl+0x1c8/0x1ec
> > [  113.375491]  [<c0292959>] ? sock_ioctl+0x0/0x1ec
> > [  113.375574]  [<c019af2b>] vfs_ioctl+0x22/0x69
> > [  113.375653]  [<c019b435>] do_vfs_ioctl+0x41f/0x459
> > [  113.375734]  [<c02934eb>] ? sys_send+0x18/0x1a
> > [  113.375813]  [<c011942f>] ? do_page_fault+0x242/0x26f
> > [  113.375884]  [<c019b49b>] sys_ioctl+0x2c/0x45
> > [  113.375960]  [<c0102975>] syscall_call+0x7/0xb
> > [  113.376041] Rebooting in 5 seconds..



^ permalink raw reply

* Re: [PATCH] AF_UNIX: Fix deadlock on connecting to shutdown socket
From: Jarek Poplawski @ 2009-10-19 11:57 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: linux-kernel, netdev, alan, davem, satoshi.oshima.fk,
	hidehiro.kawai.ez, hideo.aoki.tk
In-Reply-To: <4ADC010C.5070809@hitachi.com>

On 19-10-2009 08:02, Tomoki Sekiyama wrote:
...
> Why this happens:
>  Error checks between unix_socket_connect() and unix_wait_for_peer() are
>  inconsistent. The former calls the latter to wait until the backlog is
>  processed. Despite the latter returns without doing anything when the
>  socket is shutdown, the former doesn't check the shutdown state and
>  just retries calling the latter forever.
> 
> Patch:
>  The patch below adds shutdown check into unix_socket_connect(), so
>  connect(2) to the shutdown socket will return -ECONREFUSED.
> 
> Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
> Signed-off-by: Masanori Yoshida <masanori.yoshida.tv@hitachi.com>
> ---
>  net/unix/af_unix.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 51ab497..fc820cd 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1074,6 +1074,8 @@ restart:
>  	err = -ECONNREFUSED;
>  	if (other->sk_state != TCP_LISTEN)
>  		goto out_unlock;
> +	if (other->sk_shutdown & RCV_SHUTDOWN)
> +		goto out_unlock;

Isn't the shutdown call expected to change sk_state to TCP_CLOSE?

Jarek P.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with  "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-19 11:36 UTC (permalink / raw)
  To: Michal Ostrowski; +Cc: netdev, linux-ppp, paulus, mostrows
In-Reply-To: <e6d1cecd0910182034t9d24859mc6f392875b36ad17@mail.gmail.com>

Can you send me patch as attachment please?

On Monday 19 October 2009 06:34:06 Michal Ostrowski wrote:
> Here's my theory on this after an inital look...
>
> Looking at the oops report and disassembly of the actual module binary
> that caused the oops, one can deduce that:
>
> Execution was in pppoe_flush_dev().  %ebx contained the pointer "struct
> pppox_sock *po", which is what we faulted on, excuting "cmp %eax,
> 0x190(%ebx)". %ebx value was 0xffffffff (hence we got "NULL pointer
> dereference at 0x18f").
>
> At this point "i" (stored in %esi) is 15 (valid), meaning that we got a
> value of 0xffffffff in pn->hash_table[i].
>
> From this I'd hypothesize that the combination of dev_put() and
> release_sock() may have allowed us to free "pn".  At the bottom of the loop
> we alreayd recognize that since locks are dropped we're responsible for
> handling invalidation of objects, and perhaps that should be extended to
> "pn" as well. --
> Michal Ostrowski
> mostrows@gmail.com
>
>
> ---
>  drivers/net/pppoe.c |   86 ++++++++++++++++++++++++++----
> --------------------
>  1 files changed, 45 insertions(+), 41 deletions(-)
>
> diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
> index 7cbf6f9..720c4ea 100644
> --- a/drivers/net/pppoe.c
> +++ b/drivers/net/pppoe.c
> @@ -296,6 +296,7 @@ static void pppoe_flush_dev(struct net_device *dev)
>
>        BUG_ON(dev == NULL);
>
> +restart:
>        pn = pppoe_pernet(dev_net(dev));
>        if (!pn) /* already freed */
>                return;
> @@ -303,48 +304,51 @@ static void pppoe_flush_dev(struct net_device *dev)
>        write_lock_bh(&pn->hash_lock);
>        for (i = 0; i < PPPOE_HASH_SIZE; i++) {
>                struct pppox_sock *po = pn->hash_table[i];
> +                struct sock *sk;
>
> -               while (po != NULL) {
> -                       struct sock *sk;
> -                       if (po->pppoe_dev != dev) {
> -                               po = po->next;
> -                               continue;
> -                       }
> -                       sk = sk_pppox(po);
> -                       spin_lock(&flush_lock);
> -                       po->pppoe_dev = NULL;
> -                       spin_unlock(&flush_lock);
> -                       dev_put(dev);
> -
> -                       /* We always grab the socket lock, followed by the
> -                        * hash_lock, in that order.  Since we should
> -                        * hold the sock lock while doing any unbinding,
> -                        * we need to release the lock we're holding.
> -                        * Hold a reference to the sock so it doesn't
> disappear -                        * as we're jumping between locks.
> -                        */
> +                while (po && po->pppoe_dev != dev) {
> +                        po = po->next;
> +                }
>
> -                       sock_hold(sk);
> +                if (po == NULL) {
> +                        continue;
> +                }
>
> -                       write_unlock_bh(&pn->hash_lock);
> -                       lock_sock(sk);
> +                sk = sk_pppox(po);
>
> -                       if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND))
> { -                               pppox_unbind_sock(sk);
> -                               sk->sk_state = PPPOX_ZOMBIE;
> -                               sk->sk_state_change(sk);
> -                       }
> +                spin_lock(&flush_lock);
> +                po->pppoe_dev = NULL;
> +                spin_unlock(&flush_lock);
>
> -                       release_sock(sk);
> -                       sock_put(sk);
> +                dev_put(dev);
>
> -                       /* Restart scan at the beginning of this hash
> chain. -                        * While the lock was dropped the chain
> contents may -                        * have changed.
> -                        */
> -                       write_lock_bh(&pn->hash_lock);
> -                       po = pn->hash_table[i];
> -               }
> +                /* We always grab the socket lock, followed by the
> +                 * hash_lock, in that order.  Since we should
> +                 * hold the sock lock while doing any unbinding,
> +                 * we need to release the lock we're holding.
> +                 * Hold a reference to the sock so it doesn't disappear
> +                 * as we're jumping between locks.
> +                 */
> +
> +                sock_hold(sk);
> +
> +                write_unlock_bh(&pn->hash_lock);
> +                lock_sock(sk);
> +
> +                if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
> +                        pppox_unbind_sock(sk);
> +                        sk->sk_state = PPPOX_ZOMBIE;
> +                        sk->sk_state_change(sk);
> +                }
> +
> +                release_sock(sk);
> +                sock_put(sk);
> +
> +                /* Restart the flush process from the beginning.  While
> +                 * the lock was dropped the chain contents may have
> +                 * changed, and sock_put may have made things go away.
> +                 */
> +                goto restart;
>        }
>        write_unlock_bh(&pn->hash_lock);
>  }
> --
> 1.6.3.3
>
> On Sun, Oct 18, 2009 at 4:02 PM, Denys Fedoryschenko <denys@visp.net.lb> 
wrote:
> > I have server running as pppoe NAS.
> > Tried to rename customers without dropping pppd connections first, got
> > panic after few seconds.
> > Panic triggerable at 2.6.30.4 and 2.6.31.4
> > pppoe users running on eth2
> > pppoe flags:
> >  1457 root     /usr/sbin/pppoe-server -I eth2 -k -L 172.16.1.1 -R
> > 172.16.1.2 -N 253 -C gpzone -S gpzone
> >
> >
> > Commands sequence that i think triggered that:
> >
> > ip link set eth0 down
> > ip link set eth1 down
> > ip link set eth2 down
> > nameif etherx 00:16:76:8D:83:BA
> > nameif eth0 00:19:e0:72:4a:37
> > nameif eth1 00:19:e0:72:4a:4b
> >
> > ip addr flush dev eth0
> > ip addr flush dev eth1
> > ip addr add X.X.X.X/29 dev eth0
> > ip addr add 192.168.2.177/24 dev eth0
> > ip addr add 192.168.0.1/32 dev eth1
> > ip addr add 127.0.0.0/8 dev lo
> > #ip link set eth0 up
> > ip link set eth0 up
> > ip link set eth1 up
> > ip link set lo up
> > ip route add 0.0.0.0/0 via X.X.X.X
> >
> >
> > [  103.428591] r8169: eth0: link up
> > [  103.430360] r8169: eth1: link up
> > [  113.361528] BUG: unable to handle kernel
> > NULL pointer dereference
> > at 0000018f
> > [  113.361717] IP:
> > [<f8868269>] pppoe_device_event+0x80/0x12c [pppoe]
> > [  113.361853] *pdpt = 000000003411a001
> > *pde = 0000000000000000
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362012] Oops: 0000 [#1]
> > SMP
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362166] last sysfs file: /sys/devices/virtual/vc/vcs3/dev
> > [  113.362246] Modules linked in:
> > netconsole
> > configfs
> > act_skbedit
> > sch_ingress
> > sch_prio
> > cls_flow
> > cls_u32
> > em_meta
> > cls_basic
> > xt_dscp
> > xt_DSCP
> > ipt_REJECT
> > ts_bm
> > xt_string
> > xt_hl
> > ifb
> > cls_fw
> > sch_tbf
> > sch_htb
> > act_ipt
> > act_mirred
> > xt_MARK
> > pppoe
> > pppox
> > ppp_generic
> > slhc
> > xt_TCPMSS
> > xt_mark
> > xt_tcpudp
> > iptable_mangle
> > iptable_nat
> > nf_nat
> > rtc_cmos
> > nf_conntrack_ipv4
> > rtc_core
> > nf_conntrack
> > rtc_lib
> > nf_defrag_ipv4
> > iptable_filter
> > ip_tables
> > x_tables
> > 8021q
> > garp
> > stp
> > llc
> > loop
> > sata_sil
> > pata_atiixp
> > pata_acpi
> > ata_generic
> > libata
> > 8139cp
> > usb_storage
> > mtdblock
> > mtd_blkdevs
> > mtd
> > sr_mod
> > cdrom
> > tulip
> > r8169
> > sky2
> > via_velocity
> > via_rhine
> > sis900
> > ne2k_pci
> > 8390
> > skge
> > tg3
> > libphy
> > 8139too
> > e1000
> > e100
> > usbhid
> > ohci_hcd
> > uhci_hcd
> > ehci_hcd
> > usbcore
> > nls_base
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344]
> > [  113.362344] Pid: 2858, comm: pppd Not tainted (2.6.31.4-build-0047 #7)
> > [  113.362344] EIP: 0060:[<f8868269>] EFLAGS: 00010286 CPU: 0
> > [  113.362344] EIP is at pppoe_device_event+0x80/0x12c [pppoe]
> > [  113.362344] EAX: f4fbe000 EBX: ffffffff ECX: f6cea5a0 EDX: f7403680
> > [  113.362344] ESI: 0000000f EDI: f6cea5e0 EBP: f4145e34 ESP: f4145e1c
> > [  113.362344]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > [  113.362344] Process pppd (pid: 2858, ti=f4145000 task=f4112ff0
> > task.ti=f4145000)
> > [  113.362344] Stack:
> > [  113.362344]  f4fbe220
> > f4fbe000
> > f6cea5a0
> > f886a430
> > fffffff5
> > 00000000
> > f4145e54
> > c01422b3
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344] <0>
> > f4fbe000
> > 00000009
> > f8a457d8
> > f4fbe000
> > f8850190
> > 00001091
> > f4145e64
> > c0142361
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344] <0>
> > ffffffff
> > 00000000
> > f4145e74
> > c029ffbf
> > f4fbe000
> > 000010d0
> > f4145e90
> > c029fa70
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344] Call Trace:
> > [  113.362344]  [<c01422b3>] ? notifier_call_chain+0x2b/0x4a
> > [  113.362344]  [<c0142361>] ? raw_notifier_call_chain+0xc/0xe
> > [  113.362344]  [<c029ffbf>] ? dev_close+0x4c/0x8c
> > [  113.362344]  [<c029fa70>] ? dev_change_flags+0xa5/0x158
> > [  113.362344]  [<c02da633>] ? devinet_ioctl+0x21a/0x503
> > [  113.362344]  [<c02db693>] ? inet_ioctl+0x8d/0xa6
> > [  113.362344]  [<c0292b21>] ? sock_ioctl+0x1c8/0x1ec
> > [  113.362344]  [<c0292959>] ? sock_ioctl+0x0/0x1ec
> > [  113.362344]  [<c019af2b>] ? vfs_ioctl+0x22/0x69
> > [  113.362344]  [<c019b435>] ? do_vfs_ioctl+0x41f/0x459
> > [  113.362344]  [<c02934eb>] ? sys_send+0x18/0x1a
> > [  113.362344]  [<c011942f>] ? do_page_fault+0x242/0x26f
> > [  113.362344]  [<c019b49b>] ? sys_ioctl+0x2c/0x45
> > [  113.362344]  [<c0102975>] ? syscall_call+0x7/0xb
> > [  113.362344] Code:
> > c9
> > 00
> > 00
> > 00
> > 89
> > c7
> > 31
> > f6
> > 83
> > c7
> > 40
> > 89
> > f8
> > e8
> > cc
> > 60
> > a9
> > c7
> > 8b
> > 45
> > ec
> > 05
> > 20
> > 02
> > 00
> > 00
> > 89
> > 45
> > e8
> > 8b
> > 4d
> > f0
> > 8b
> > 1c
> > b1
> > e9
> > 8c
> > 00
> > 00
> > 00
> > 8b
> > 45
> > ec
> > Oct 18 23:59:40 194.146.153.93
> > 83
> > 90
> > 01
> > 00
> > 00
> > 74
> > 08
> > 8b
> > 9b
> > 8c
> > 01
> > 00
> > 00
> > eb
> > 79
> > b8
> > c0
> > a6
> > 86
> > f8
> > Oct 18 23:59:40 194.146.153.93
> > [  113.362344] EIP: [<f8868269>]
> > pppoe_device_event+0x80/0x12c [pppoe]
> > SS:ESP 0068:f4145e1c
> > [  113.362344] CR2: 000000000000018f
> > [  113.373124] ---[ end trace f6fe64a307e97f3b ]---
> > [  113.373203] Kernel panic - not syncing: Fatal exception in interrupt
> > [  113.373286] Pid: 2858, comm: pppd Tainted: G      D  
> >  2.6.31.4-build-0047 #7
> > [  113.373379] Call Trace:
> > [  113.373479]  [<c02fc496>] ? printk+0xf/0x11
> > [  113.373561]  [<c02fc3e7>] panic+0x39/0xd9
> > [  113.373656]  [<c01059b7>] oops_end+0x8b/0x9a
> > [  113.373727]  [<c0118f6d>] no_context+0x13d/0x147
> > [  113.373800]  [<c011908a>] __bad_area_nosemaphore+0x113/0x11b
> > [  113.373881]  [<c02953b3>] ? sock_alloc_send_pskb+0x8b/0x24a
> > [  113.373959]  [<c0121801>] ? __wake_up_sync_key+0x3b/0x45
> > [  113.374030]  [<c0131967>] ? irq_exit+0x39/0x5c
> > [  113.374107]  [<c0104393>] ? do_IRQ+0x80/0x96
> > [  113.374183]  [<c0102f49>] ? common_interrupt+0x29/0x30
> > [  113.374259]  [<c011909f>] bad_area_nosemaphore+0xd/0x10
> > [  113.374348]  [<c0119301>] do_page_fault+0x114/0x26f
> > [  113.374526]  [<c01191ed>] ? do_page_fault+0x0/0x26f
> > [  113.374605]  [<c02fe506>] error_code+0x66/0x6c
> > [  113.374683]  [<c02d007b>] ? tcp_v4_send_ack+0xa3/0x10e
> > [  113.374764]  [<c01191ed>] ? do_page_fault+0x0/0x26f
> > [  113.374850]  [<f8868269>] ? pppoe_device_event+0x80/0x12c [pppoe]
> > [  113.374928]  [<c01422b3>] notifier_call_chain+0x2b/0x4a
> > [  113.375012]  [<c0142361>] raw_notifier_call_chain+0xc/0xe
> > [  113.375097]  [<c029ffbf>] dev_close+0x4c/0x8c
> > [  113.375169]  [<c029fa70>] dev_change_flags+0xa5/0x158
> > [  113.375239]  [<c02da633>] devinet_ioctl+0x21a/0x503
> > [  113.375318]  [<c02db693>] inet_ioctl+0x8d/0xa6
> > [  113.375411]  [<c0292b21>] sock_ioctl+0x1c8/0x1ec
> > [  113.375491]  [<c0292959>] ? sock_ioctl+0x0/0x1ec
> > [  113.375574]  [<c019af2b>] vfs_ioctl+0x22/0x69
> > [  113.375653]  [<c019b435>] do_vfs_ioctl+0x41f/0x459
> > [  113.375734]  [<c02934eb>] ? sys_send+0x18/0x1a
> > [  113.375813]  [<c011942f>] ? do_page_fault+0x242/0x26f
> > [  113.375884]  [<c019b49b>] sys_ioctl+0x2c/0x45
> > [  113.375960]  [<c0102975>] syscall_call+0x7/0xb
> > [  113.376041] Rebooting in 5 seconds..



^ permalink raw reply

* Re: [PATCH] Add sk_mark route lookup support for IPv4 listening sockets, and for IPv4 multicast forwarding
From: Atis Elsts @ 2009-10-19 11:38 UTC (permalink / raw)
  To: steve
  Cc: Maciej Żenczykowski, David Miller, netdev, panther,
	eric.dumazet, brian.haley
In-Reply-To: <20091019082033.GB27230@fogou.chygwyn.com>

On Monday 19 October 2009 11:20:33 steve@chygwyn.com wrote:
>
> Another potential use case would be to segregate traffic into different
> routing domains (and thus being able to change the mark when moving from
> one routing domain to another might be useful).

I agree. Actually, one of our  users recenlty requested adding matcher in 
firewall that would match outgoing the packets by the routing table that was 
used to route them. (For now we found a workaround using tclassid, but that 
requires manual configuration.) So yes, it's an useful feature even excluding 
the tunnel cases.

I don't like the idea of using skb->mark for storing that information though, 
because I think these multiple uses of the same field would be too confusing 
for users, even if the default behavior remained the same as now. Also, 
consider the case when someone watch to match packets in post routing chain 
*both* by the mark that was set in prerouting chain, and routing table used 
to route the packet?

There already is free space (padding fieds) in struct dst_entry, so why not 
use this space to store the routing table? Speed is also not an issue, 
because the field only needs to be filled in slowpath routing lookup, and 
will be used only
1) if iptables are explicitly configured to match by it;
2) (?) in tunnel routing lookups. (no idea which is the best option here)

I see that struct rt6_info already has field
    struct fib6_table		*rt6i_table
so this matcher already can be made for IPv6 firewall. But IPv4 still is more 
imporant at the moment :)

Atis

^ permalink raw reply

* RE: PATCH: Network Device Naming mechanism and policy
From: Narendra_K @ 2009-10-19 11:30 UTC (permalink / raw)
  To: dannf, bhutchings
  Cc: netdev, linux-hotplug, Matt_Domsch, Jordan_Hargrave, Charles_Rose
In-Reply-To: <20091016214024.GA10091@ldl.fc.hp.com>

>> > > And how would the regular file look like in terms of holding 
>> > > ifindex of the interface, which can be passed to libnetdevname.
>> > 
>> > I can't think of anything we need to store in the regular file. If 
>> > we have the kernel name for the device, we can look up the ifindex 
>> > in /sys. Correct me if I'm wrong, but storing it ourselves seems 
>> > redundant.
>> 
>> But the name of a netdev can change whereas its ifindex never does.
>> Identifying netdevs by name would require additional work to update 
>> the links when a netdev is renamed and would still be prone to race 
>> conditions.  This is why Narendra and Matt were proposing to 
>store the 
>> ifindex in the node all along...
>
>Matt, Ben and I talked about a few other possibilities on IRC.
>The one I like the most at the moment is an idea Ben had to 
>creat dummy files named after the ifindex. Then, use symlinks 
>for the kernel name and the various by-$property 
>subdirectories. This means the KOBJ events will need to expose 
>the ifindex.
>

I suppose the KOBJ events already expose the ifindex of a network
interface. The file "/sys/class/net/ethN/uevent" contains INTERFACE=ethN
and IFINDEX=n already. But it looks like udev doesn't use it in any way.
For example, with the kernel patch the "/sys/class/net/ethN/uevent"
contains in addition to the above details, MAJOR=M and MINOR=m which the
udev knows how to make use of with a rule like 

SUBSYSTEM=="net", KERNEL!="tun", NAME="netdev/%k", MODE="0600".

>I'm a novice at net programming, but I'm told that ifindex is 
>the information apps ultimately require here.

Yes. The minor number of the device node is retreived by libnetdevname
by "stat"ing the pathname which happens to be ifindex of the device and
it is mapped to corresponding kernel name by "if_indextoname"  call.

With regards,
Narendra K  

^ permalink raw reply

* Re: [PATCH] Add sk_mark route lookup support for IPv4 listening sockets, and for IPv4 multicast forwarding
From: steve @ 2009-10-19  8:20 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: David Miller, atis, netdev, panther, eric.dumazet, brian.haley
In-Reply-To: <55a4f86e0910141133y4decdeb4v43d9168687bbb724@mail.gmail.com>

Hi,

On Wed, Oct 14, 2009 at 11:33:39AM -0700, Maciej Żenczykowski wrote:
> > I don't agree. There are two route lookups with a tunnel, the
> > internal one and the tunnel one. Here is an example of what I'm
> > thinking:
> >
> > 1. Look up a route which points at a remote ip addres via a tunnel device.
> >   The "setmark" on this route sets the skb mark
> 
> imho, this is much better done by having the mark setting performed
> explicitly by the tunnel device itself.
> That's also were we set ttl and qos (or inherit) on the outgoing packet).
>
Yes, I think (having looked at the code a bit more in the mean time)
that there is an argument for doing that. Although I still think that
setting the mark via the routing table would be a useful feature too.

 
> > 2. Look up a route on the tunnel itself (i.e. the tunnel endpoint not
> >   the socket endpoint) using the mark from the initial lookup. This
> >   route can depend on the previous lookup (if there are multiple
> >   routes for multiple marks) and also set the mark to use.
> 
> we would get the mark set by the tunneling module here.
> 
> > The default would be to inherit the mark over a route lookup, in
> > case that no "setmark" had been specified for that route. In
> > other words, it would be the same as it is now.
> 
> I'm not saying your solution wouldn't work, but I think it's less
> clean.  I don't think marking should be inherited (in the general
> case) in case of packet wrapping (whether via gre, ipip, sit, or other
> methods).
>
I guess we could say that inheriting the mark would not be the default
if the packet has gone through a device (whether virtual or physical)
then. That still seems ok to me since its basically what happens currently
I think.
 
> > The mark is supposed to be a generic thing, not just for routing
> > lookups, it can be used for classification, etc as well. I would
> > expect to see such a thing used for maybe specifying a VLAN or
> > a reference to an MPLS label stack, or something similar too,
> 
> Right, the mark can currently (as far as I know) be set in one of two
> ways - either from the mangle table (and it can also be matched on in
> netfilter) or by using setsockopt(SO_MARK).
> 
> Imagine a situation where you have a machine with routing already
> configured (pretty complex setup, tunnels, firewalls, etc) and you
> want to run a user space application that verifies (health-checks)
> some remote host (or something).  As part of the health check you want
> to verify a particular route to the destination.  This requires
> per-socket routing, which can (almost) be achieved by having proper
> routing (on fwmark) setup and using setsockopt(SO_MARK) on the health
> check socket in order to force specific routing.  These health checks
> may then of course be feedback into the routing system (ie. if they
> fail the routing rules get modified).  Note, that in particular we may
> want to be healthchecking routes that aren't even available in the
> default routing table (because they've currently been removed from the
> default table, because previous health checks failed).
> 
> Maciej

Yes, thats a good use case. I think there are a lot of other potential
use cases too though. A while back when I was looking into MPLS I wondered
about using the mark to index into a set of outgoing label stacks. That
was the original reason that I thought setting the mark via the routing
table would be useful. I've not really had the time to continue my
MPLS investigations recently though :(

Another potential use case would be to segregate traffic into different
routing domains (and thus being able to change the mark when moving from
one routing domain to another might be useful).

Steve.


^ permalink raw reply

* Re: [PATCH] AF_UNIX: Fix deadlock on connecting to shutdown socket
From: Américo Wang @ 2009-10-19  9:06 UTC (permalink / raw)
  To: Tomoki Sekiyama
  Cc: linux-kernel, netdev, alan, davem, satoshi.oshima.fk,
	hidehiro.kawai.ez, hideo.aoki.tk, masanori.yoshida.tv
In-Reply-To: <4ADC2A1D.2090303@hitachi.com>

On Mon, Oct 19, 2009 at 4:58 PM, Tomoki Sekiyama
<tomoki.sekiyama.qu@hitachi.com> wrote:
> Hi, thanks for testing!
>
> Américo Wang wrote:
>> On Mon, Oct 19, 2009 at 2:02 PM, Tomoki Sekiyama
>> <tomoki.sekiyama.qu@hitachi.com> wrote:
>>> Hi,
>>> I found a deadlock bug in UNIX domain socket, which makes able to DoS
>>> attack against the local machine by non-root users.
>>>
>>> How to reproduce:
>>> 1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
>>>    namespace(*), and shutdown(2) it.
>>>  2. Repeat connect(2)ing to the listening socket from the other sockets
>>>    until the connection backlog is full-filled.
>>>  3. connect(2) takes the CPU forever. If every core is taken, the
>>>    system hangs.
>>>
>>> PoC code: (Run as many times as cores on SMP machines.)
>
> Sorry for my ambiguous explanation ...
>
>> Interesting...
>>
>> I tried this with the following command:
>>
>> % for i in `seq 1 $(grep processor -c /proc/cpuinfo)`;
>> do ./unix-socket-dos-exploit; echo "=====$i====";done
> <snip>
>> My system doesn't hang at all.
>>
>> Am I missing something?
>>
>> Thanks!
>
> You should run the ./unix-socket-dos-exploit concurrently, like below:
>
> for i in {1..4} ; do ./unix-socket-dos-exploit & done
>
> # For safety reason, the PoC code stops in 15 seconds by alarm(15).

Hmm, you are right.

My system hangs for 10 or more seconds after I did what you said.
Confirmed.

Thanks!

^ permalink raw reply

* Re: [PATCH] AF_UNIX: Fix deadlock on connecting to shutdown socket
From: Tomoki Sekiyama @ 2009-10-19  8:58 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: alan, davem, satoshi.oshima.fk, hidehiro.kawai.ez, hideo.aoki.tk,
	masanori.yoshida.tv
In-Reply-To: <2375c9f90910190002m372edafq9a4c95d754640487@mail.gmail.com>

Hi, thanks for testing!

Américo Wang wrote:
> On Mon, Oct 19, 2009 at 2:02 PM, Tomoki Sekiyama
> <tomoki.sekiyama.qu@hitachi.com> wrote:
>> Hi,
>> I found a deadlock bug in UNIX domain socket, which makes able to DoS
>> attack against the local machine by non-root users.
>>
>> How to reproduce:
>> 1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
>>    namespace(*), and shutdown(2) it.
>>  2. Repeat connect(2)ing to the listening socket from the other sockets
>>    until the connection backlog is full-filled.
>>  3. connect(2) takes the CPU forever. If every core is taken, the
>>    system hangs.
>>
>> PoC code: (Run as many times as cores on SMP machines.)

Sorry for my ambiguous explanation ...

> Interesting...
> 
> I tried this with the following command:
> 
> % for i in `seq 1 $(grep processor -c /proc/cpuinfo)`;
> do ./unix-socket-dos-exploit; echo "=====$i====";done
<snip>
> My system doesn't hang at all.
> 
> Am I missing something?
>
> Thanks!

You should run the ./unix-socket-dos-exploit concurrently, like below:

for i in {1..4} ; do ./unix-socket-dos-exploit & done

# For safety reason, the PoC code stops in 15 seconds by alarm(15).

-- 
Tomoki Sekiyama
Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: tomoki.sekiyama.qu@hitachi.com

^ permalink raw reply

* Re: [PATCH] AF_UNIX: Fix deadlock on connecting to shutdown socket
From: Tomoki Sekiyama @ 2009-10-19  8:54 UTC (permalink / raw)
  To: xiyou.wangcong
  Cc: linux-kernel, netdev, alan, davem, satoshi.oshima.fk,
	hidehiro.kawai.ez, hideo.aoki.tk
In-Reply-To: <2375c9f90910190002m372edafq9a4c95d754640487@mail.gmail.com>

Hi, thanks for testing!

Américo Wang wrote:
> On Mon, Oct 19, 2009 at 2:02 PM, Tomoki Sekiyama
> <tomoki.sekiyama.qu@hitachi.com> wrote:
>> Hi,
>> I found a deadlock bug in UNIX domain socket, which makes able to DoS
>> attack against the local machine by non-root users.
>>
>> How to reproduce:
>> 1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
>>    namespace(*), and shutdown(2) it.
>>  2. Repeat connect(2)ing to the listening socket from the other sockets
>>    until the connection backlog is full-filled.
>>  3. connect(2) takes the CPU forever. If every core is taken, the
>>    system hangs.
>>
>> PoC code: (Run as many times as cores on SMP machines.)

Sorry for my ambiguous explanation ...

> Interesting...
> 
> I tried this with the following command:
> 
> % for i in `seq 1 $(grep processor -c /proc/cpuinfo)`;
> do ./unix-socket-dos-exploit; echo "=====$i====";done
<snip>
> My system doesn't hang at all.
> 
> Am I missing something?
>
> Thanks!

You should run the ./unix-socket-dos-exploit concurrently, like below:

for i in {1..4} ; do ./unix-socket-dos-exploit & done

# For safety reason, the PoC code stops in 15 seconds by alarm(15).

-- 
Tomoki Sekiyama
Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: tomoki.sekiyama.qu@hitachi.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox