Netdev List
 help / color / mirror / Atom feed
* Re: Fw: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join on same multicast-group dont return EINVAL
From: David Stevens @ 2010-05-05 17:09 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Herbert Xu, netdev, netdev-owner
In-Reply-To: <20100505082138.6cb93aa1@nehalam>

This particular failure  may be  just a matter of translating the
EADDRINUSE check in IP_ADD_SOURCE_MEMBERSHIP to
return EINVAL rather than ignoring it. The more general change of
segregating SSM and ASM should go further than that (e.g., a boolean
to tell you which way the membership was added and checking in
all the operations).

The code predates this informational RFC and allows you to
change back and forth between ASM and SSM where it makes
sense (a more liberal interpretation).

Of course, if existing applications are mixing them already, they
would break, and I'm not sure I agree it's a good thing to have
to destroy an existing membership and recreate it if you want to
switch from ASM to SSM.

I can look at this, but not for a few days at least; can review if
someone else does before I do.

                                        +-DLS


netdev-owner@vger.kernel.org wrote on 05/05/2010 08:21:38 AM:

> 
> 
> Begin forwarded message:
> 
> Date: Wed, 5 May 2010 09:48:40 GMT
> From: bugzilla-daemon@bugzilla.kernel.org
> To: shemminger@linux-foundation.org
> Subject: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after 
IP_ADD_MEMBERSHIP 
> join on same multicast-group dont return EINVAL
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=15907
> 
>            Summary: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP 
join
>                     on same multicast-group dont return EINVAL
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.34-rc6
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: mail@fholler.de
>         Regression: No
> 
> 
> Created an attachment (id=26225)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=26225)
> asm+ssm join test program
> 
> When an SSM IP_ADD_SOURCE_MEMBERSHIP is done after an ASM 
IP_ADD_MEMBERSHIP
> join on the same group(& same interface) the setsockopt operation should 
return
> EINVAL.
> 
> The linux implementation returns successfull
> 
> 
> 
> https://www3.tools.ietf.org/html/rfc3678#section-4.1.3
> 
> I attached an simple C test program.
> 
> -- 
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are the assignee for the bug.
> 
> 
> -- 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* RE: [RFC PATCH net-next] drivers/net/ks*: Use netdev_<level>, netif_<level> and pr_<level>
From: Ha, Tristram @ 2010-05-05 17:06 UTC (permalink / raw)
  To: David Miller; +Cc: ben, Choi, David, richard.rojfors.ext, netdev, joe
In-Reply-To: <20100316.213133.102673436.davem@davemloft.net>

David Miller wrote:
> From: Joe Perches <joe@perches.com>
> Date: Sat, 27 Feb 2010 16:43:51 -0800
> 
>> I'm not sure this is correct.
>> 
>> It changes logging macros from:
>> 	dev_<level>(&ks->spidev->dev,
>> to
>> 	netdev_<level>(ks->netdev,
> 
> I'm just applying this, more than a week to get a review is more than
enough.
> 
> Thanks Joe.

I have no objection about the patch as it just uses the latest kernel
debug call.  Is this new code available in current 2.6.34 release?  Or
will it be in the next version?

I generally download the latest 2.6.34-rc release to generate patches
for submission.

^ permalink raw reply

* RTL-8110SC lockup with r8169
From: Pádraig Brady @ 2010-05-05 16:05 UTC (permalink / raw)
  To: netdev; +Cc: Francois Romieu, Glen Gray

[-- Attachment #1: Type: text/plain, Size: 1812 bytes --]

Hi,

We're having an issue with the r8169 driver, where very often
(1 in 10 boots) it will lockup and our netboot system will hang.
On this hardware previously, we used FC5 with the r1000 driver without issue.

The interesting (problematic) thing is that the system needs to
be _power cycled_ to get the link detected again.
A reboot does not suffice. The symptoms sound like:

http://kerneltrap.org/mailarchive/linux-netdev/2009/3/6/5110184
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.32.y.git;a=commitdiff;h=ea8dbdd17099a9a5864ebd4c87e01e657b19c7ab

However the above code wasn't in the 2.6.32.10-90.fc12 driver we used.
Also I've back-ported the latest r8169 driver from git to our kernel
and it still has the same issue. The inconsequential diff
between our driver and the latest in git are attached just in case.

In the following dmesg you can see a successful boot.
Why does the link go down twice even in that case?
On lockup one does not see the "link up" message and
a power cycle is required, as the link is not detected
by the netboot ROM on a reboot.

r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
r8169 0000:01:04.0: PCI INT A -> Link[LNKA] -> GSI 11 (level, low) -> IRQ 11
r8169 0000:01:04.0: no PCI Express capability
eth0: RTL8169sc/8110sc at 0xde87c000, 00:60:ef:08:49:f5, XID 98000000 IRQ 11
r8169: eth0: link down
r8169: eth0: link down
r8169: eth0: link up

I'm in a good position to test any patches/ideas you may have.

cheers,
Pádraig.

Ancillary version details are...

# dmesg | grep 8169
# lspci -n | grep -v 8086:
01:04.0 0200: 10ec:8167 (rev 10)

# ethtool -i eth0
driver: r8169
version: 2.3LK-NAPI
firmware-version:
bus-info: 0000:01:04.0

# uname -a
Linux Unit-00:60:ef:08:49:f5 2.6.32.10-90.fc12.i686 #1 SMP Tue Mar 23 10:21:29 UTC 2010 i686 i686 i386 GNU/Linux

[-- Attachment #2: r8169-unimportant.diff --]
[-- Type: text/x-patch, Size: 11726 bytes --]

--- /home/padraig/kernel/r8169.c.latest	2010-05-05 11:42:54.345452634 +0100
+++ /home/padraig/kernel/r8169.c.new	2010-05-05 16:36:36.805627050 +0100
@@ -168,7 +168,7 @@
 static void rtl_hw_start_8168(struct net_device *);
 static void rtl_hw_start_8101(struct net_device *);
 
-static DEFINE_PCI_DEVICE_TABLE(rtl8169_pci_tbl) = {
+static struct pci_device_id rtl8169_pci_tbl[] = {
 	{ PCI_DEVICE(PCI_VENDOR_ID_REALTEK,	0x8129), 0, 0, RTL_CFG_0 },
 	{ PCI_DEVICE(PCI_VENDOR_ID_REALTEK,	0x8136), 0, 0, RTL_CFG_2 },
 	{ PCI_DEVICE(PCI_VENDOR_ID_REALTEK,	0x8167), 0, 0, RTL_CFG_0 },
@@ -749,10 +749,12 @@
 	spin_lock_irqsave(&tp->lock, flags);
 	if (tp->link_ok(ioaddr)) {
 		netif_carrier_on(dev);
-		netif_info(tp, ifup, dev, "link up\n");
+		if (netif_msg_ifup(tp))
+			printk(KERN_INFO PFX "%s: link up\n", dev->name);
 	} else {
+		if (netif_msg_ifdown(tp))
+			printk(KERN_INFO PFX "%s: link down\n", dev->name);
 		netif_carrier_off(dev);
-		netif_info(tp, ifdown, dev, "link down\n");
 	}
 	spin_unlock_irqrestore(&tp->lock, flags);
 }
@@ -865,8 +867,11 @@
 	} else if (autoneg == AUTONEG_ENABLE)
 		RTL_W32(TBICSR, reg | TBINwEnable | TBINwRestart);
 	else {
-		netif_warn(tp, link, dev,
-			   "incorrect speed setting refused in TBI mode\n");
+		if (netif_msg_link(tp)) {
+			printk(KERN_WARNING "%s: "
+			       "incorrect speed setting refused in TBI mode\n",
+			       dev->name);
+		}
 		ret = -EOPNOTSUPP;
 	}
 
@@ -901,9 +906,9 @@
 		    (tp->mac_version != RTL_GIGA_MAC_VER_15) &&
 		    (tp->mac_version != RTL_GIGA_MAC_VER_16)) {
 			giga_ctrl |= ADVERTISE_1000FULL | ADVERTISE_1000HALF;
-		} else {
-			netif_info(tp, link, dev,
-				   "PHY does not support 1000Mbps\n");
+		} else if (netif_msg_link(tp)) {
+			printk(KERN_INFO "%s: PHY does not support 1000Mbps.\n",
+			       dev->name);
 		}
 
 		bmcr = BMCR_ANENABLE | BMCR_ANRESTART;
@@ -2705,7 +2710,8 @@
 	if (tp->link_ok(ioaddr))
 		goto out_unlock;
 
-	netif_warn(tp, link, dev, "PHY reset until link up\n");
+	if (netif_msg_link(tp))
+		printk(KERN_WARNING "%s: PHY reset until link up\n", dev->name);
 
 	tp->phy_reset_enable(ioaddr);
 
@@ -2776,7 +2782,8 @@
 			return;
 		msleep(1);
 	}
-	netif_err(tp, link, dev, "PHY reset failed\n");
+	if (netif_msg_link(tp))
+		printk(KERN_ERR "%s: PHY reset failed.\n", dev->name);
 }
 
 static void rtl8169_init_phy(struct net_device *dev, struct rtl8169_private *tp)
@@ -2810,8 +2817,8 @@
 	 */
 	rtl8169_set_speed(dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL);
 
-	if (RTL_R8(PHYstatus) & TBI_Enable)
-		netif_info(tp, link, dev, "TBI auto-negotiating\n");
+	if ((RTL_R8(PHYstatus) & TBI_Enable) && netif_msg_link(tp))
+		printk(KERN_INFO PFX "%s: TBI auto-negotiating\n", dev->name);
 }
 
 static void rtl_rar_set(struct rtl8169_private *tp, u8 *addr)
@@ -3016,33 +3023,41 @@
 	/* enable device (incl. PCI PM wakeup and hotplug setup) */
 	rc = pci_enable_device(pdev);
 	if (rc < 0) {
-		netif_err(tp, probe, dev, "enable failure\n");
+		if (netif_msg_probe(tp))
+			dev_err(&pdev->dev, "enable failure\n");
 		goto err_out_free_dev_1;
 	}
 
 	if (pci_set_mwi(pdev) < 0)
-		netif_info(tp, probe, dev, "Mem-Wr-Inval unavailable\n");
+		if (netif_msg_probe(tp)) {
+			dev_err(&pdev->dev, "Mem-Wr-Inval unavailable\n");
+		}
 
 	/* make sure PCI base addr 1 is MMIO */
 	if (!(pci_resource_flags(pdev, region) & IORESOURCE_MEM)) {
-		netif_err(tp, probe, dev,
-			  "region #%d not an MMIO resource, aborting\n",
-			  region);
+		if (netif_msg_probe(tp)) {
+			dev_err(&pdev->dev,
+				"region #%d not an MMIO resource, aborting\n",
+				region);
+		}
 		rc = -ENODEV;
 		goto err_out_mwi_2;
 	}
 
 	/* check for weird/broken PCI region reporting */
 	if (pci_resource_len(pdev, region) < R8169_REGS_SIZE) {
-		netif_err(tp, probe, dev,
-			  "Invalid PCI region size(s), aborting\n");
+		if (netif_msg_probe(tp)) {
+			dev_err(&pdev->dev,
+				"Invalid PCI region size(s), aborting\n");
+		}
 		rc = -ENODEV;
 		goto err_out_mwi_2;
 	}
 
 	rc = pci_request_regions(pdev, MODULENAME);
 	if (rc < 0) {
-		netif_err(tp, probe, dev, "could not request regions\n");
+		if (netif_msg_probe(tp))
+			dev_err(&pdev->dev, "could not request regions.\n");
 		goto err_out_mwi_2;
 	}
 
@@ -3055,7 +3070,10 @@
 	} else {
 		rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
 		if (rc < 0) {
-			netif_err(tp, probe, dev, "DMA configuration failed\n");
+			if (netif_msg_probe(tp)) {
+				dev_err(&pdev->dev,
+					"DMA configuration failed.\n");
+			}
 			goto err_out_free_res_3;
 		}
 	}
@@ -3063,14 +3081,15 @@
 	/* ioremap MMIO region */
 	ioaddr = ioremap(pci_resource_start(pdev, region), R8169_REGS_SIZE);
 	if (!ioaddr) {
-		netif_err(tp, probe, dev, "cannot remap MMIO, aborting\n");
+		if (netif_msg_probe(tp))
+			dev_err(&pdev->dev, "cannot remap MMIO, aborting\n");
 		rc = -EIO;
 		goto err_out_free_res_3;
 	}
 
 	tp->pcie_cap = pci_find_capability(pdev, PCI_CAP_ID_EXP);
-	if (!tp->pcie_cap)
-		netif_info(tp, probe, dev, "no PCI Express capability\n");
+	if (!tp->pcie_cap && netif_msg_probe(tp))
+		dev_info(&pdev->dev, "no PCI Express capability\n");
 
 	RTL_W16(IntrMask, 0x0000);
 
@@ -3093,8 +3112,10 @@
 
 	/* Use appropriate default if unknown */
 	if (tp->mac_version == RTL_GIGA_MAC_NONE) {
-		netif_notice(tp, probe, dev,
-			     "unknown MAC, using family default\n");
+		if (netif_msg_probe(tp)) {
+			dev_notice(&pdev->dev,
+				   "unknown MAC, using family default\n");
+		}
 		tp->mac_version = cfg->default_ver;
 	}
 
@@ -3176,10 +3197,19 @@
 
 	pci_set_drvdata(pdev, dev);
 
-	netif_info(tp, probe, dev, "%s at 0x%lx, %pM, XID %08x IRQ %d\n",
-		   rtl_chip_info[tp->chipset].name,
-		   dev->base_addr, dev->dev_addr,
-		   (u32)(RTL_R32(TxConfig) & 0x9cf0f8ff), dev->irq);
+	if (netif_msg_probe(tp)) {
+		u32 xid = RTL_R32(TxConfig) & 0x9cf0f8ff;
+
+		printk(KERN_INFO "%s: %s at 0x%lx, "
+		       "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x, "
+		       "XID %08x IRQ %d\n",
+		       dev->name,
+		       rtl_chip_info[tp->chipset].name,
+		       dev->base_addr,
+		       dev->dev_addr[0], dev->dev_addr[1],
+		       dev->dev_addr[2], dev->dev_addr[3],
+		       dev->dev_addr[4], dev->dev_addr[5], xid, dev->irq);
+	}
 
 	rtl8169_init_phy(dev, tp);
 
@@ -3231,8 +3261,8 @@
 	unsigned int max_frame = mtu + VLAN_ETH_HLEN + ETH_FCS_LEN;
 
 	if (max_frame != 16383)
-		printk(KERN_WARNING PFX "WARNING! Changing of MTU on this "
-			"NIC may lead to frame reception errors!\n");
+		printk(KERN_WARNING "WARNING! Changing of MTU on this NIC "
+			"May lead to frame reception errors!\n");
 
 	tp->rx_buf_sz = (max_frame > RX_BUF_SIZE) ? max_frame : RX_BUF_SIZE;
 }
@@ -4131,10 +4161,10 @@
 
 	ret = rtl8169_open(dev);
 	if (unlikely(ret < 0)) {
-		if (net_ratelimit())
-			netif_err(tp, drv, dev,
-				  "reinit failure (status = %d). Rescheduling\n",
-				  ret);
+		if (net_ratelimit() && netif_msg_drv(tp)) {
+			printk(KERN_ERR PFX "%s: reinit failure (status = %d)."
+			       " Rescheduling.\n", dev->name, ret);
+		}
 		rtl8169_schedule_work(dev, rtl8169_reinit_task);
 	}
 
@@ -4164,8 +4194,10 @@
 		netif_wake_queue(dev);
 		rtl8169_check_link_status(dev, tp, tp->mmio_addr);
 	} else {
-		if (net_ratelimit())
-			netif_emerg(tp, intr, dev, "Rx buffers shortage\n");
+		if (net_ratelimit() && netif_msg_intr(tp)) {
+			printk(KERN_EMERG PFX "%s: Rx buffers shortage\n",
+			       dev->name);
+		}
 		rtl8169_schedule_work(dev, rtl8169_reset_task);
 	}
 
@@ -4253,7 +4285,11 @@
 	u32 opts1;
 
 	if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) {
-		netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n");
+		if (netif_msg_drv(tp)) {
+			printk(KERN_ERR
+			       "%s: BUG! Tx Ring full when queue awake!\n",
+			       dev->name);
+		}
 		goto err_stop;
 	}
 
@@ -4315,8 +4351,11 @@
 	pci_read_config_word(pdev, PCI_COMMAND, &pci_cmd);
 	pci_read_config_word(pdev, PCI_STATUS, &pci_status);
 
-	netif_err(tp, intr, dev, "PCI error (cmd = 0x%04x, status = 0x%04x)\n",
-		  pci_cmd, pci_status);
+	if (netif_msg_intr(tp)) {
+		printk(KERN_ERR
+		       "%s: PCI error (cmd = 0x%04x, status = 0x%04x).\n",
+		       dev->name, pci_cmd, pci_status);
+	}
 
 	/*
 	 * The recovery sequence below admits a very elaborated explanation:
@@ -4340,7 +4379,8 @@
 
 	/* The infamous DAC f*ckup only happens at boot time */
 	if ((tp->cp_cmd & PCIDAC) && !tp->dirty_rx && !tp->cur_rx) {
-		netif_info(tp, intr, dev, "disabling PCI DAC\n");
+		if (netif_msg_intr(tp))
+			printk(KERN_INFO "%s: disabling PCI DAC.\n", dev->name);
 		tp->cp_cmd &= ~PCIDAC;
 		RTL_W16(CPlusCmd, tp->cp_cmd);
 		dev->features &= ~NETIF_F_HIGHDMA;
@@ -4432,12 +4472,13 @@
 	if (pkt_size >= rx_copybreak)
 		goto out;
 
-	skb = netdev_alloc_skb_ip_align(tp->dev, pkt_size);
+	skb = netdev_alloc_skb(tp->dev, pkt_size + NET_IP_ALIGN);
 	if (!skb)
 		goto out;
 
 	pci_dma_sync_single_for_cpu(tp->pci_dev, addr, pkt_size,
 				    PCI_DMA_FROMDEVICE);
+	skb_reserve(skb, NET_IP_ALIGN);
 	skb_copy_from_linear_data(*sk_buff, skb->data, pkt_size);
 	*sk_buff = skb;
 	done = true;
@@ -4475,8 +4516,11 @@
 		if (status & DescOwn)
 			break;
 		if (unlikely(status & RxRES)) {
-			netif_info(tp, rx_err, dev, "Rx ERROR. status = %08x\n",
-				   status);
+			if (netif_msg_rx_err(tp)) {
+				printk(KERN_INFO
+				       "%s: Rx ERROR. status = %08x\n",
+				       dev->name, status);
+			}
 			dev->stats.rx_errors++;
 			if (status & (RxRWT | RxRUNT))
 				dev->stats.rx_length_errors++;
@@ -4543,8 +4587,8 @@
 	tp->cur_rx = cur_rx;
 
 	delta = rtl8169_rx_fill(tp, dev, tp->dirty_rx, tp->cur_rx);
-	if (!delta && count)
-		netif_info(tp, intr, dev, "no Rx buffer allocated\n");
+	if (!delta && count && netif_msg_intr(tp))
+		printk(KERN_INFO "%s: no Rx buffer allocated\n", dev->name);
 	tp->dirty_rx += delta;
 
 	/*
@@ -4554,8 +4598,8 @@
 	 *   after refill ?
 	 * - how do others driver handle this condition (Uh oh...).
 	 */
-	if (tp->dirty_rx + NUM_RX_DESC == tp->cur_rx)
-		netif_emerg(tp, intr, dev, "Rx buffers exhausted\n");
+	if ((tp->dirty_rx + NUM_RX_DESC == tp->cur_rx) && netif_msg_intr(tp))
+		printk(KERN_EMERG "%s: Rx buffers exhausted\n", dev->name);
 
 	return count;
 }
@@ -4610,9 +4654,10 @@
 
 			if (likely(napi_schedule_prep(&tp->napi)))
 				__napi_schedule(&tp->napi);
-			else
-				netif_info(tp, intr, dev,
-					   "interrupt %04x in poll\n", status);
+			else if (netif_msg_intr(tp)) {
+				printk(KERN_INFO "%s: interrupt %04x in poll\n",
+				dev->name, status);
+			}
 		}
 
 		/* We only get a new MSI interrupt when all active irq
@@ -4748,22 +4793,27 @@
 
 	if (dev->flags & IFF_PROMISC) {
 		/* Unconditionally log net taps. */
-		netif_notice(tp, link, dev, "Promiscuous mode enabled\n");
+		if (netif_msg_link(tp)) {
+			printk(KERN_NOTICE "%s: Promiscuous mode enabled.\n",
+			       dev->name);
+		}
 		rx_mode =
 		    AcceptBroadcast | AcceptMulticast | AcceptMyPhys |
 		    AcceptAllPhys;
 		mc_filter[1] = mc_filter[0] = 0xffffffff;
-	} else if ((netdev_mc_count(dev) > multicast_filter_limit) ||
-		   (dev->flags & IFF_ALLMULTI)) {
+	} else if ((dev->mc_count > multicast_filter_limit)
+		   || (dev->flags & IFF_ALLMULTI)) {
 		/* Too many to filter perfectly -- accept all multicasts. */
 		rx_mode = AcceptBroadcast | AcceptMulticast | AcceptMyPhys;
 		mc_filter[1] = mc_filter[0] = 0xffffffff;
 	} else {
 		struct dev_mc_list *mclist;
+		unsigned int i;
 
 		rx_mode = AcceptBroadcast | AcceptMyPhys;
 		mc_filter[1] = mc_filter[0] = 0;
-		netdev_for_each_mc_addr(mclist, dev) {
+		for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
+		     i++, mclist = mclist->next) {
 			int bit_nr = ether_crc(ETH_ALEN, mclist->dmi_addr) >> 26;
 			mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31);
 			rx_mode |= AcceptMulticast;

^ permalink raw reply

* Re: linux kernel's IPV6_MULTICAST_HOPS default is 64; should be 1?
From: Brian Haley @ 2010-05-05 15:36 UTC (permalink / raw)
  To: David Miller; +Cc: dlstevens, enh, netdev, netdev-owner
In-Reply-To: <20100504.144647.157477097.davem@davemloft.net>

David Miller wrote:
> From: Brian Haley <brian.haley@hp.com>
> Date: Tue, 04 May 2010 10:40:58 -0400
> 
>> Specifying -1 for setsockopt(IPV6_MULTICAST_HOPS) should set the socket
>> value back to the system default value of IPV6_DEFAULT_MCASTHOPS (1).
>>
>> Signed-off-by: Brian Haley <brian.haley@hp.com>
> 
> In cast it wasn't clear from my other reply, I'm not applying this
> patch because I intentionally left this behavior there based upon
> some comments from Elliot in that this lets developers get the
> old default by asking for "-1" explicitly with a setsockopt.

I now see that in Elliot's email, but I think it's incorrect.  The RFC
says that setting it to -1 should get you the kernel default, which is
now 1.  Without this change, setting it to -1 will get you 64, the
old behavior.  If the user wants to, they can always just set it to
64 themselves, that's better than assuming when you set it to -1
you're going to get 64.

I'm just trying to make this follow the RFC and behave like other OSes
for consistency.

-Brian

^ permalink raw reply

* Fw: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join on same multicast-group dont return EINVAL
From: Stephen Hemminger @ 2010-05-05 15:21 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev



Begin forwarded message:

Date: Wed, 5 May 2010 09:48:40 GMT
From: bugzilla-daemon@bugzilla.kernel.org
To: shemminger@linux-foundation.org
Subject: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join on same multicast-group dont return EINVAL


https://bugzilla.kernel.org/show_bug.cgi?id=15907

           Summary: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join
                    on same multicast-group dont return EINVAL
           Product: Networking
           Version: 2.5
    Kernel Version: 2.6.34-rc6
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: IPV4
        AssignedTo: shemminger@linux-foundation.org
        ReportedBy: mail@fholler.de
        Regression: No


Created an attachment (id=26225)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=26225)
asm+ssm join test program

When an SSM IP_ADD_SOURCE_MEMBERSHIP is done after an ASM IP_ADD_MEMBERSHIP
join on the same group(& same interface) the setsockopt operation should return
EINVAL.

The linux implementation returns successfull



https://www3.tools.ietf.org/html/rfc3678#section-4.1.3

I attached an simple C test program.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


-- 

^ permalink raw reply

* Re: [PATCH v2] net/gianfar: drop recycled skbs on MTU change
From: Andy Fleming @ 2010-05-05 15:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: David Miller, afleming, netdev
In-Reply-To: <20100505083047.GA4398@Chamillionaire.breakpoint.cc>


On May 5, 2010, at 3:30 AM, Sebastian Andrzej Siewior wrote:

> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> The size for skbs which is added to the recycled list is using the
> current descriptor size which is current MTU. gfar_new_skb() is also
> using this size. So after changing or alteast increasing the MTU all
> recycled skbs should be dropped.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>>> I think we should probably do this in free_skb_resources.  And remove
>>> the call from gfar_close().
>> 
>> Ok, Sebastian please rework your patch as requested by Andy.
> 
> This has the side effect of dropping them on reset which is not
> required.

Sure, but that's true of the buffers in the BD ring, too.  My theory, here, is just that it's best to treat it the same as the other "skb resources".  If we want to avoid reallocation during a reset, that's a separate patch.  :)

Acked-by: Andy Fleming <afleming@freescale.com>


^ permalink raw reply

* Re: [PATCH 2/6] netns: Teach network device kobjects which namespace they are in.
From: Serge E. Hallyn @ 2010-05-05 15:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kay Sievers, linux-kernel, Tejun Heo,
	Cornelia Huck, Eric Dumazet, Benjamin LaHaise, netdev,
	David Miller
In-Reply-To: <1273019809-16472-2-git-send-email-ebiederm@xmission.com>

Quoting Eric W. Biederman (ebiederm@xmission.com):
> diff --git a/net/Kconfig b/net/Kconfig
> index 041c35e..265e33b 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -45,6 +45,14 @@ config COMPAT_NETLINK_MESSAGES
> 
>  menu "Networking options"
> 
> +config NET_NS
> +	bool "Network namespace support"
> +	default n
> +	depends on EXPERIMENTAL && NAMESPACES
> +	help
> +	  Allow user space to create what appear to be multiple instances
> +	  of the network stack.
> +

Hi Eric,

I'm confused - NET_NS is defined in init/Kconfig right now.  Is the tree
you're working from very different from mine, or is this the unfortunate
rekult of the patches sitting so long?

>  source "net/packet/Kconfig"
>  source "net/unix/Kconfig"
>  source "net/xfrm/Kconfig"
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 099c753..1b98e36 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -13,7 +13,9 @@
>  #include <linux/kernel.h>
>  #include <linux/netdevice.h>
>  #include <linux/if_arp.h>
> +#include <linux/nsproxy.h>
>  #include <net/sock.h>
> +#include <net/net_namespace.h>
>  #include <linux/rtnetlink.h>
>  #include <linux/wireless.h>
>  #include <net/wext.h>
> @@ -466,6 +468,37 @@ static struct attribute_group wireless_group = {
>  };
>  #endif
> 
> +static const void *net_current_ns(void)
> +{
> +	return current->nsproxy->net_ns;
> +}
> +
> +static const void *net_initial_ns(void)
> +{
> +	return &init_net;
> +}
> +
> +static const void *net_netlink_ns(struct sock *sk)
> +{
> +	return sock_net(sk);
> +}
> +
> +static struct kobj_ns_type_operations net_ns_type_operations = {
> +	.type = KOBJ_NS_TYPE_NET,
> +	.current_ns = net_current_ns,
> +	.netlink_ns = net_netlink_ns,
> +	.initial_ns = net_initial_ns,
> +};
> +
> +static void net_kobj_ns_exit(struct net *net)
> +{
> +	kobj_ns_exit(KOBJ_NS_TYPE_NET, net);
> +}
> +
> +static struct pernet_operations sysfs_net_ops = {
> +	.exit = net_kobj_ns_exit,
> +};
> +
>  #endif /* CONFIG_SYSFS */

...

>  int netdev_kobject_init(void)
>  {
> +	kobj_ns_type_register(&net_ns_type_operations);
> +#ifdef CONFIG_SYSFS
> +	register_pernet_subsys(&sysfs_net_ops);
> +#endif
>  	return class_register(&net_class);

I think the kobj_ns_type_register() needs to be under
ifdef CONFIG_SYSFS as well, bc net_ns_type_operations is defined
under ifdef CONFIG_SYSFS.

-serge

^ permalink raw reply

* [PATCH] nf_conntrack_core.c: fix for dead connection after flushing conntrack cache
From: Joerg Marx @ 2010-05-05 14:46 UTC (permalink / raw)
  To: netdev; +Cc: kaber

Hi,

we encountered a weird problem of a 'stalled' connection when using
'conntrack -F' on a box with heavy network load. 'conntrack -L' gave us
sometimes a [UNREPLIED] entry for the traffic in question, but no traffic
flow, no matching of packets in or out, only the timer went down from 600
to 0 (it was ESP traffic - the default generic timeout = 600 seconds).
After the entry vanished by timeout (or doing a 'conntrack -F' once more),
all worked normal again.

The reason we finally found, is a race window in 'nf_conntrack_confirm'
when calling '__nf_conntrack_confirm':

In 'nf_conntrack_confirm' is checked (without holding a lock), if the entry
to be confirmed is possibly dying: !nf_ct_is_dying(ct).
If not, then __nf_conntrack_confirm will do some sanity checking, grab a
spin_lock_bh and insert the 'ct' into the lookup cache.

Now consider the following scenario:

1. a connection has seen the first packet already -> state is UNREPLIED
2. now the answer is to be sent, conntrack wants to confirm the connection
3. the !nf_ct_is_dying(ct) check is passed, __nf_conntrack_confirm is just
started
4. in a user context a 'conntrack -F' command is running right now e.g. on
another CPU
5. this will flag all unconfirmed connections as 'dying' in
get_next_corpse(...), including the entry going to be confirmed!
6. now the already 'dying' entry is included into the hash cache in
__nf_conntrack_confirm - BOOM!

After this step the connection in question is dead, because no packets are
forwarded until the entry is purged from hash cache. This was a big blocker
for us, because each dead IPsec tunnel is a dead branch network for 10
minutes...

For every packet from now on 'nf_conntrack_find_get' will ignore the entry,
because it is dying and because __nf_conntrack_confirm finds the hash in
the cache already, it will NF_DROP the packet.

The key for finding this was 'NF_CT_STAT_INC(net, insert_failed)' in
__nf_conntrack_confirm.

The suggested solution is to check for '!nf_ct_is_dying(ct)' again, _after_
the spin_lock_bh is grabbed in __nf_conntrack_confirm. So it is clear, that
no other softirq or user context can set that 'evil' dying flag ;-)
The return value in this case should be NF_ACCEPT, so we loose no packets
then, this is also important for us.


---
 net/netfilter/nf_conntrack_core.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c
b/net/netfilter/nf_conntrack_core.c
index 1374179..e2c8bfe 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -413,6 +413,11 @@ __nf_conntrack_confirm(struct sk_buff *skb)

 	spin_lock_bh(&nf_conntrack_lock);

+	if (unlikely(nf_ct_is_dying(ct))) {
+		spin_unlock_bh(&nf_conntrack_lock);
+		return NF_ACCEPT;
+	}
+
 	/* See if there's one in the list already, including reverse:
 	   NAT could have grabbed it without realizing, since we're
 	   not in the hash.  If there is, we lost race. */
-- 1.5.6.5

Best regards
Joerg.


-- 

^ permalink raw reply related

* Kernel 2.6.33-2 - ip -6 addr list does not show any addresses
From: Samuel Suter @ 2010-05-05 14:43 UTC (permalink / raw)
  To: netdev

Hello all

I am repeatedly hitting a problem on two of my servers running 
kernel.org 2.6.33-2 on Debian 5.0 x86_64 using the Debian 5.0 stable 
iproute package (iproute2-ss080725).

There is a clear discrepancy between the information returned by 
'ifconfig' and 'ip addr' (or 'ip -6 addr'). This server has 26 physical 
interfaces and ifconfig lists all of them with all their assigned 
addresses, but 'ip addr' only lists my loopback and one of the interfaces.

The actual problem emerges when I try and flush all addresses and add 
them again:

# ip -6 addr flush dev eth24
Nothing to flush.
# ip -6 addr add 2001:470:921b:7845::3d/64 dev eth24
RTNETLINK answers: File exists
# ifconfig eth24
eth24     Link encap:Ethernet  HWaddr 00:22:19:c4:7a:1d 
          inet addr:10.7.24.61  Bcast:0.0.0.0  Mask:255.255.255.128
          inet6 addr: 2001:470:921b:7845::77/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::66/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::55/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::44/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::76/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::67/64 Scope:Global
          .
          .
          .
# ip -6 addr list dev eth24
#

This shows that 'ifconfig' knows about these addresses, but 'ip' does 
not. All the IPv6 addresses show up in /proc/net/if_inet6.
# grep eth24 /proc/net/if_inet6 | wc -l
57
# ifconfig eth24 | grep inet6 | wc -l
57

This leaves me in a situation where I an unable to flush the IP 
addresses on a device.

This problem seems to happen more often on a server running kernel.org 
2.6.33-2 kernel as opposed to the Debian patched 2.6.26, though it did 
happen with the debian kernel.

Regards

Samuel Suter

^ permalink raw reply

* Re: [PATCH -next 2/3] bnx2: Add prefetches to rx path.
From: Michael Chan @ 2010-05-05 15:01 UTC (permalink / raw)
  To: 'Eric Dumazet'; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <1273038330.2304.10.camel@edumazet-laptop>

Eric Dumazet wrote:

> > @@ -3097,7 +3099,11 @@ bnx2_rx_int(struct bnx2 *bp, struct
> bnx2_napi *bnapi, int budget)
> >
> >             rx_buf = &rxr->rx_buf_ring[sw_ring_cons];
> >             skb = rx_buf->skb;
> > +           prefetch(skb);
>
> why not a prefetchw() ?

Yes, didn't know there was a prefetchw() before you pointed it
out.

>
> >
> > +           next_rx_buf =
> > +
> &rxr->rx_buf_ring[RX_RING_IDX(NEXT_RX_BD(sw_cons))];
> > +           prefetch(next_rx_buf->desc);
>
> So cpu is allowed to start a memory transaction on
> next_skb->data, while
> not yes DMA unmapped ?

Very good point.  The prefetch() will not work and will be
wasted on systems that have pci_dma_sync_...() defined.  I
think we can skip the prefetch if pci_dma_sync_...() is
defined.  The logic to determine if the next descriptor is
ready, dma_sync it, and prefetch it will be complicated and
we may end up not gaining any performance in the end.

What do you think?  Thanks.



^ permalink raw reply

* [RFC 5/5] net/ucc_geth: use generic recycling infrastructure
From: Sebastian Andrzej Siewior @ 2010-05-05 14:47 UTC (permalink / raw)
  To: netdev; +Cc: tglx, Sebastian Andrzej Siewior
In-Reply-To: <1273070870-7821-1-git-send-email-sebastian@breakpoint.cc>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/ucc_geth.c |   24 +++++++-----------------
 drivers/net/ucc_geth.h |    2 --
 2 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index 081f76b..d5af4f1 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -210,10 +210,7 @@ static struct sk_buff *get_new_skb(struct ucc_geth_private *ugeth,
 {
 	struct sk_buff *skb = NULL;
 
-	skb = __skb_dequeue(&ugeth->rx_recycle);
-	if (!skb)
-		skb = dev_alloc_skb(ugeth->ug_info->uf_info.max_rx_buf_length +
-				    UCC_GETH_RX_DATA_BUF_ALIGNMENT);
+	skb = net_recycle_get(ugeth->ndev);
 	if (skb == NULL)
 		return NULL;
 
@@ -1992,8 +1989,6 @@ static void ucc_geth_memclean(struct ucc_geth_private *ugeth)
 		iounmap(ugeth->ug_regs);
 		ugeth->ug_regs = NULL;
 	}
-
-	skb_queue_purge(&ugeth->rx_recycle);
 }
 
 static void ucc_geth_set_multi(struct net_device *dev)
@@ -2069,6 +2064,7 @@ static void ucc_geth_stop(struct ucc_geth_private *ugeth)
 	ugeth->phydev = NULL;
 
 	ucc_geth_memclean(ugeth);
+	net_recycle_cleanup(ugeth->ndev);
 }
 
 static int ucc_struct_init(struct ucc_geth_private *ugeth)
@@ -2205,9 +2201,6 @@ static int ucc_struct_init(struct ucc_geth_private *ugeth)
 			ugeth_err("%s: Failed to ioremap regs.", __func__);
 		return -ENOMEM;
 	}
-
-	skb_queue_head_init(&ugeth->rx_recycle);
-
 	return 0;
 }
 
@@ -3217,7 +3210,7 @@ static int ucc_geth_rx(struct ucc_geth_private *ugeth, u8 rxQ, int rx_work_limit
 					   __func__, __LINE__, (u32) skb);
 			if (skb) {
 				skb->data = skb->head + NET_SKB_PAD;
-				__skb_queue_head(&ugeth->rx_recycle, skb);
+				net_recycle_add(dev, skb);
 			}
 
 			ugeth->rx_skbuff[rxQ][ugeth->skb_currx[rxQ]] = NULL;
@@ -3288,13 +3281,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
 
 		dev->stats.tx_packets++;
 
-		if (skb_queue_len(&ugeth->rx_recycle) < RX_BD_RING_LEN &&
-			     skb_recycle_check(skb,
-				    ugeth->ug_info->uf_info.max_rx_buf_length +
-				    UCC_GETH_RX_DATA_BUF_ALIGNMENT))
-			__skb_queue_head(&ugeth->rx_recycle, skb);
-		else
-			dev_kfree_skb(skb);
+		net_recycle_add(dev, skb);
 
 		ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]] = NULL;
 		ugeth->skb_dirtytx[txQ] =
@@ -3915,6 +3902,9 @@ static int ucc_geth_probe(struct of_device* ofdev, const struct of_device_id *ma
 	netif_napi_add(dev, &ugeth->napi, ucc_geth_poll, 64);
 	dev->mtu = 1500;
 
+	net_recycle_init(dev, RX_BD_RING_LEN, ug_info->uf_info.max_rx_buf_length
+			+ UCC_GETH_RX_DATA_BUF_ALIGNMENT);
+
 	ugeth->msg_enable = netif_msg_init(debug.msg_enable, UGETH_MSG_DEFAULT);
 	ugeth->phy_interface = phy_interface;
 	ugeth->max_speed = max_speed;
diff --git a/drivers/net/ucc_geth.h b/drivers/net/ucc_geth.h
index ef1fbeb..55708c5 100644
--- a/drivers/net/ucc_geth.h
+++ b/drivers/net/ucc_geth.h
@@ -1213,8 +1213,6 @@ struct ucc_geth_private {
 	/* index of the first skb which hasn't been transmitted yet. */
 	u16 skb_dirtytx[NUM_TX_QUEUES];
 
-	struct sk_buff_head rx_recycle;
-
 	struct ugeth_mii_info *mii_info;
 	struct phy_device *phydev;
 	phy_interface_t phy_interface;
-- 
1.6.6.1


^ permalink raw reply related

* [RFC 4/5] net/stmmac: use generic recycling infrastructure
From: Sebastian Andrzej Siewior @ 2010-05-05 14:47 UTC (permalink / raw)
  To: netdev; +Cc: tglx, Sebastian Andrzej Siewior
In-Reply-To: <1273070870-7821-1-git-send-email-sebastian@breakpoint.cc>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/stmmac/stmmac.h      |    1 -
 drivers/net/stmmac/stmmac_main.c |   26 +++++++-------------------
 2 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/drivers/net/stmmac/stmmac.h b/drivers/net/stmmac/stmmac.h
index ebebc64..dbf9f95 100644
--- a/drivers/net/stmmac/stmmac.h
+++ b/drivers/net/stmmac/stmmac.h
@@ -44,7 +44,6 @@ struct stmmac_priv {
 	unsigned int dirty_rx;
 	struct sk_buff **rx_skbuff;
 	dma_addr_t *rx_skbuff_dma;
-	struct sk_buff_head rx_recycle;
 
 	struct net_device *dev;
 	int is_gmac;
diff --git a/drivers/net/stmmac/stmmac_main.c b/drivers/net/stmmac/stmmac_main.c
index 7ac6dde..14c4e25 100644
--- a/drivers/net/stmmac/stmmac_main.c
+++ b/drivers/net/stmmac/stmmac_main.c
@@ -646,18 +646,7 @@ static void stmmac_tx(struct stmmac_priv *priv)
 			p->des3 = 0;
 
 		if (likely(skb != NULL)) {
-			/*
-			 * If there's room in the queue (limit it to size)
-			 * we add this skb back into the pool,
-			 * if it's the right size.
-			 */
-			if ((skb_queue_len(&priv->rx_recycle) <
-				priv->dma_rx_size) &&
-				skb_recycle_check(skb, priv->dma_buf_sz))
-				__skb_queue_head(&priv->rx_recycle, skb);
-			else
-				dev_kfree_skb(skb);
-
+			net_recycle_add(priv->dev, skb);
 			priv->tx_skbuff[entry] = NULL;
 		}
 
@@ -860,6 +849,9 @@ static int stmmac_open(struct net_device *dev)
 	priv->dma_buf_sz = STMMAC_ALIGN(buf_sz);
 	init_dma_desc_rings(dev);
 
+	net_recycle_init(priv->dev, priv->dma_rx_size, priv->dma_buf_sz +
+			NET_IP_ALIGN);
+
 	/* DMA initialization and SW reset */
 	if (unlikely(priv->hw->dma->init(ioaddr, priv->pbl, priv->dma_tx_phy,
 					 priv->dma_rx_phy) < 0)) {
@@ -911,7 +903,6 @@ static int stmmac_open(struct net_device *dev)
 		phy_start(priv->phydev);
 
 	napi_enable(&priv->napi);
-	skb_queue_head_init(&priv->rx_recycle);
 	netif_start_queue(dev);
 	return 0;
 }
@@ -942,7 +933,7 @@ static int stmmac_release(struct net_device *dev)
 		kfree(priv->tm);
 #endif
 	napi_disable(&priv->napi);
-	skb_queue_purge(&priv->rx_recycle);
+	net_recycle_cleanup(priv->dev);
 
 	/* Free the IRQ lines */
 	free_irq(dev->irq, dev);
@@ -1174,13 +1165,10 @@ static inline void stmmac_rx_refill(struct stmmac_priv *priv)
 		if (likely(priv->rx_skbuff[entry] == NULL)) {
 			struct sk_buff *skb;
 
-			skb = __skb_dequeue(&priv->rx_recycle);
-			if (skb == NULL)
-				skb = netdev_alloc_skb_ip_align(priv->dev,
-								bfsize);
-
+			skb = net_recycle_get(priv->dev);
 			if (unlikely(skb == NULL))
 				break;
+			skb_reserve(skb, NET_IP_ALIGN);
 
 			priv->rx_skbuff[entry] = skb;
 			priv->rx_skbuff_dma[entry] =
-- 
1.6.6.1


^ permalink raw reply related

* [RFC 2/5] net/gianfar: use generic recycling infrasstructure
From: Sebastian Andrzej Siewior @ 2010-05-05 14:47 UTC (permalink / raw)
  To: netdev; +Cc: tglx, Sebastian Andrzej Siewior
In-Reply-To: <1273070870-7821-1-git-send-email-sebastian@breakpoint.cc>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/gianfar.c |   28 ++++++++--------------------
 drivers/net/gianfar.h |    2 --
 2 files changed, 8 insertions(+), 22 deletions(-)

diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index c6df2ba..af98675 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -1102,6 +1102,9 @@ static int gfar_probe(struct of_device *ofdev,
 		priv->rx_queue[i]->rxic = DEFAULT_RXIC;
 	}
 
+	net_recycle_init(dev, DEFAULT_RX_RING_SIZE,
+			priv->rx_buffer_size + RXBUF_ALIGNMENT);
+
 	/* enable filer if using multiple RX queues*/
 	if(priv->num_rx_queues > 1)
 		priv->rx_filer_enable = 1;
@@ -1705,7 +1708,7 @@ static void free_skb_resources(struct gfar_private *priv)
 			sizeof(struct rxbd8) * priv->total_rx_ring_size,
 			priv->tx_queue[0]->tx_bd_base,
 			priv->tx_queue[0]->tx_bd_dma_base);
-	skb_queue_purge(&priv->rx_recycle);
+	net_recycle_cleanup(priv->ndev);
 }
 
 void gfar_start(struct net_device *dev)
@@ -1886,8 +1889,6 @@ static int gfar_enet_open(struct net_device *dev)
 
 	enable_napi(priv);
 
-	skb_queue_head_init(&priv->rx_recycle);
-
 	/* Initialize a bunch of registers */
 	init_registers(dev);
 
@@ -2291,6 +2292,7 @@ static int gfar_change_mtu(struct net_device *dev, int new_mtu)
 		stop_gfar(dev);
 
 	priv->rx_buffer_size = tempsize;
+	net_recycle_size(dev, tempsize + RXBUF_ALIGNMENT);
 
 	dev->mtu = new_mtu;
 
@@ -2422,16 +2424,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 			bdp = next_txbd(bdp, base, tx_ring_size);
 		}
 
-		/*
-		 * If there's room in the queue (limit it to rx_buffer_size)
-		 * we add this skb back into the pool, if it's the right size
-		 */
-		if (skb_queue_len(&priv->rx_recycle) < rx_queue->rx_ring_size &&
-				skb_recycle_check(skb, priv->rx_buffer_size +
-					RXBUF_ALIGNMENT))
-			__skb_queue_head(&priv->rx_recycle, skb);
-		else
-			dev_kfree_skb_any(skb);
+		net_recycle_add(dev, skb);
 
 		tx_queue->tx_skbuff[skb_dirtytx] = NULL;
 
@@ -2497,14 +2490,9 @@ static void gfar_new_rxbdp(struct gfar_priv_rx_q *rx_queue, struct rxbd8 *bdp,
 struct sk_buff * gfar_new_skb(struct net_device *dev)
 {
 	unsigned int alignamount;
-	struct gfar_private *priv = netdev_priv(dev);
 	struct sk_buff *skb = NULL;
 
-	skb = __skb_dequeue(&priv->rx_recycle);
-	if (!skb)
-		skb = netdev_alloc_skb(dev,
-				priv->rx_buffer_size + RXBUF_ALIGNMENT);
-
+	skb = net_recycle_get(dev);
 	if (!skb)
 		return NULL;
 
@@ -2673,7 +2661,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
 				 * recycle list.
 				 */
 				skb_reserve(skb, -GFAR_CB(skb)->alignamount);
-				__skb_queue_head(&priv->rx_recycle, skb);
+				net_recycle_add(dev, skb);
 			}
 		} else {
 			/* Increment the number of packets */
diff --git a/drivers/net/gianfar.h b/drivers/net/gianfar.h
index ac4a92e..99f5a9b 100644
--- a/drivers/net/gianfar.h
+++ b/drivers/net/gianfar.h
@@ -1061,8 +1061,6 @@ struct gfar_private {
 
 	u32 cur_filer_idx;
 
-	struct sk_buff_head rx_recycle;
-
 	struct vlan_group *vlgrp;
 
 
-- 
1.6.6.1


^ permalink raw reply related

* [RFC 0/5] generic rx recycling
From: Sebastian Andrzej Siewior @ 2010-05-05 14:47 UTC (permalink / raw)
  To: netdev; +Cc: tglx

This series merges the rx recycling code trying to come up with generic
code. Recycling skbs from the tx path for incomming rx skips the memory
allocater and improves latency during memory pressure.
This is now used by just by just four drivers in the tree which were doing
this on their own.

Sebastian


^ permalink raw reply

* [RFC 1/5] net: implement generic rx recycling
From: Sebastian Andrzej Siewior @ 2010-05-05 14:47 UTC (permalink / raw)
  To: netdev; +Cc: tglx, Sebastian Andrzej Siewior
In-Reply-To: <1273070870-7821-1-git-send-email-sebastian@breakpoint.cc>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netdevice.h |   70 +++++++++++++++++++++++++++++++++++++--------
 1 files changed, 58 insertions(+), 12 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 98112fb..70d385d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -588,6 +588,18 @@ struct netdev_rx_queue {
 } ____cacheline_aligned_in_smp;
 #endif /* CONFIG_RPS */
 
+/* Use this variant when it is known for sure that it
+ * is executing from hardware interrupt context or with hardware interrupts
+ * disabled.
+ */
+extern void dev_kfree_skb_irq(struct sk_buff *skb);
+
+/* Use this variant in places where it could be invoked
+ * from either hardware interrupt or other context, with hardware interrupts
+ * either disabled or enabled.
+ */
+extern void dev_kfree_skb_any(struct sk_buff *skb);
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1044,9 +1056,55 @@ struct net_device {
 #endif
 	/* n-tuple filter list attached to this device */
 	struct ethtool_rx_ntuple_list ethtool_ntuple_list;
+	struct sk_buff_head rx_recycle;
+	u32 rx_rec_skbs_max;
+	u32 rx_rec_skb_size;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
+static inline void net_recycle_init(struct net_device *dev, u32 qlen, u32 size)
+{
+	skb_queue_head_init(&dev->rx_recycle);
+	dev->rx_rec_skbs_max = qlen;
+	dev->rx_rec_skb_size = size;
+}
+
+static inline void net_recycle_cleanup(struct net_device *dev)
+{
+	skb_queue_purge(&dev->rx_recycle);
+}
+
+static inline void net_recycle_add(struct net_device *dev, struct sk_buff *skb)
+{
+	if (skb_queue_len(&dev->rx_recycle) < dev->rx_rec_skbs_max &&
+			skb_recycle_check(skb, dev->rx_rec_skb_size))
+		skb_queue_head(&dev->rx_recycle, skb);
+	else
+		dev_kfree_skb_any(skb);
+}
+
+static inline struct sk_buff *net_recycle_get(struct net_device *dev)
+{
+	struct sk_buff *skb;
+
+	skb = skb_dequeue(&dev->rx_recycle);
+	if (skb)
+		return skb;
+	return netdev_alloc_skb(dev, dev->rx_rec_skb_size);
+}
+
+static inline void net_recycle_size(struct net_device *dev, u32 size)
+{
+	if (dev->rx_rec_skb_size < size)
+		net_recycle_cleanup(dev);
+	dev->rx_rec_skb_size = size;
+}
+
+static inline void net_recycle_qlen(struct net_device *dev, u32 qlen)
+{
+	dev->rx_rec_skbs_max = qlen;
+}
+
 #define	NETDEV_ALIGN		32
 
 static inline
@@ -1635,18 +1693,6 @@ static inline int netif_is_multiqueue(const struct net_device *dev)
 	return (dev->num_tx_queues > 1);
 }
 
-/* Use this variant when it is known for sure that it
- * is executing from hardware interrupt context or with hardware interrupts
- * disabled.
- */
-extern void dev_kfree_skb_irq(struct sk_buff *skb);
-
-/* Use this variant in places where it could be invoked
- * from either hardware interrupt or other context, with hardware interrupts
- * either disabled or enabled.
- */
-extern void dev_kfree_skb_any(struct sk_buff *skb);
-
 #define HAVE_NETIF_RX 1
 extern int		netif_rx(struct sk_buff *skb);
 extern int		netif_rx_ni(struct sk_buff *skb);
-- 
1.6.6.1


^ permalink raw reply related

* [RFC 3/5] net/mv643xx: use generic recycling infrastructure
From: Sebastian Andrzej Siewior @ 2010-05-05 14:47 UTC (permalink / raw)
  To: netdev; +Cc: tglx, Sebastian Andrzej Siewior
In-Reply-To: <1273070870-7821-1-git-send-email-sebastian@breakpoint.cc>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/net/mv643xx_eth.c |   27 +++++++++------------------
 1 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/drivers/net/mv643xx_eth.c b/drivers/net/mv643xx_eth.c
index 4ee9d04..f56aaac 100644
--- a/drivers/net/mv643xx_eth.c
+++ b/drivers/net/mv643xx_eth.c
@@ -404,8 +404,6 @@ struct mv643xx_eth_private {
 	u8 work_rx_refill;
 
 	int skb_size;
-	struct sk_buff_head rx_recycle;
-
 	/*
 	 * RX state.
 	 */
@@ -649,6 +647,7 @@ err:
 static int rxq_refill(struct rx_queue *rxq, int budget)
 {
 	struct mv643xx_eth_private *mp = rxq_to_mp(rxq);
+	struct net_device *dev = mp->dev;
 	int refilled;
 
 	refilled = 0;
@@ -658,10 +657,7 @@ static int rxq_refill(struct rx_queue *rxq, int budget)
 		struct rx_desc *rx_desc;
 		int size;
 
-		skb = __skb_dequeue(&mp->rx_recycle);
-		if (skb == NULL)
-			skb = dev_alloc_skb(mp->skb_size);
-
+		skb = net_recycle_get(dev);
 		if (skb == NULL) {
 			mp->oom = 1;
 			goto oom;
@@ -922,6 +918,7 @@ out:
 static int txq_reclaim(struct tx_queue *txq, int budget, int force)
 {
 	struct mv643xx_eth_private *mp = txq_to_mp(txq);
+	struct net_device *dev = mp->dev;
 	struct netdev_queue *nq = netdev_get_tx_queue(mp->dev, txq->index);
 	int reclaimed;
 
@@ -968,14 +965,8 @@ static int txq_reclaim(struct tx_queue *txq, int budget, int force)
 				       desc->byte_cnt, DMA_TO_DEVICE);
 		}
 
-		if (skb != NULL) {
-			if (skb_queue_len(&mp->rx_recycle) <
-					mp->rx_ring_size &&
-			    skb_recycle_check(skb, mp->skb_size))
-				__skb_queue_head(&mp->rx_recycle, skb);
-			else
-				dev_kfree_skb(skb);
-		}
+		if (skb)
+			net_recycle_add(dev, skb);
 	}
 
 	__netif_tx_unlock(nq);
@@ -1564,7 +1555,7 @@ mv643xx_eth_set_ringparam(struct net_device *dev, struct ethtool_ringparam *er)
 
 	mp->rx_ring_size = er->rx_pending < 4096 ? er->rx_pending : 4096;
 	mp->tx_ring_size = er->tx_pending < 4096 ? er->tx_pending : 4096;
-
+	net_recycle_qlen(dev, mp->rx_ring_size);
 	if (netif_running(dev)) {
 		mv643xx_eth_stop(dev);
 		if (mv643xx_eth_open(dev)) {
@@ -2340,9 +2331,9 @@ static int mv643xx_eth_open(struct net_device *dev)
 
 	mv643xx_eth_recalc_skb_size(mp);
 
-	napi_enable(&mp->napi);
+	net_recycle_init(mp->dev, mp->rx_ring_size, mp->skb_size);
 
-	skb_queue_head_init(&mp->rx_recycle);
+	napi_enable(&mp->napi);
 
 	mp->int_mask = INT_EXT;
 
@@ -2438,7 +2429,7 @@ static int mv643xx_eth_stop(struct net_device *dev)
 	mib_counters_update(mp);
 	del_timer_sync(&mp->mib_counters_timer);
 
-	skb_queue_purge(&mp->rx_recycle);
+	net_recycle_cleanup(dev);
 
 	for (i = 0; i < mp->rxq_count; i++)
 		rxq_deinit(mp->rxq + i);
-- 
1.6.6.1


^ permalink raw reply related

* Re: [PATCH] bonding: fix arp_validate on bonds inside a bridge
From: Jiri Bohac @ 2010-05-05 13:33 UTC (permalink / raw)
  To: David Miller; +Cc: fubar, jbohac, bonding-devel, netdev
In-Reply-To: <20100504.161815.71114704.davem@davemloft.net>

On Tue, May 04, 2010 at 04:18:15PM -0700, David Miller wrote:
> From: Jay Vosburgh <fubar@us.ibm.com>
> > 	Tested and it looks to work as advertised.  I see only one minor
> > nit, there's a pr_debug that missed being renamed to the new function
> > name; here's the whole patch with that fixed.
> 
> I don't think you need the ugly arp hook.
> 
> Instead, it's much cleaner to provide a way for packet type taps to
> see the packet before bridge et al. decapsulation.  In fact this makes
> a lot of sense, wanting to see the device as __netif_receive_skb() saw
> it, with no changes whatsoever.
> 
> In fact ptype_all runs before bridging, ING, and MACVLAN decap the
> thing, so we could have a 'ptype_base_predecap[]' that we run over
> right after those.

I was considering exactly this, but I thought it would be
rejected because of the overhead for all packets received.

In fact, bonding could register the ARP handler in the ptype_all
list and check itself whether the packets were ARPs. This
would require no changes to __netif_receive_skb() at all, but
would cause an extra fuction call and a condition for _every_
packet once a bond with arp_validate would be up.

Having a ptype_base_predecap[] hashtable would still cause at
least a comparision for _every_ packet, even without bonding
being loaded (!).

The current patch causes an extra comparison only for packets
arriving on boding slaves.

If either of the ptype_all or ptype_base_predecap[] method is
preferred, I'll be happy to re-work the patch. I just thought
performance had bigger priority here.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: 3 packet TCP window limit?
From: Brian Bloniarz @ 2010-05-05 13:26 UTC (permalink / raw)
  To: dormando; +Cc: netdev
In-Reply-To: <alpine.LNX.2.00.1005050210230.8544@d>

dormando wrote:
> Hey,
> 
> Noticed in Linux that no matter what sysctl variable I twiddle, or what
> TCP congestion algorithm is running, TCP will wait for remote acks after
> sending the first 3 packets. After that it's normal.
> 
> Apologies, it's hard ot describe:
> 
> Linux server listening.
> 
> Remote -> SYN
> (RTT wait)
> Linux -> SYN/ACK
> Remote -> ACK
> Remote -> Packet (small HTTP request)
> (RTT wait)
> Linux -> Packet (x 3)
> Remote -> (returning acks per packet)
> (RTT wait)
> Linux -> More packets (up to window size)
> 
> If the request response fits in 3 packets or less, that third RTT wait
> never happens. The remote client gets all its data, and sends back all the
> FIN/ACK packets for closing the connection.
> 
> What's bizarre is that this 3 packet/4 packet barrier is regardless of how
> much data there is to send. I can cause the extra RTT to flip on or off by
> sending exactly +/- 1 byte to cause an extra packet.
> 
> Holding the connection open and repeating the request any number of times
> runs just fine, after the initial request.
> 
> You can pretty easily see this by:
> tc qdisc add dev eth0 root netem delay 100ms
> ... then fetching a 3k file, then 4k file from an http server running
> linux. Well. at least I can see this easily. I tried on a half dozen boxes
> (2.6.11 through 2.6.32).
> 
> I'm trying to track down where in the code this is, or why my sysctl
> tuning isn't affecting it. I can't discern its purpose. The lag it causes
> is pretty awful for far away clients; adding 300ms of latency will make a
> small request take a full second, instead of 600ms.
> 
> I'm slugging through the code but any insight would be greatly
> appreciated!

This sounds like TCP slow start.

http://en.wikipedia.org/wiki/Slow-start

As far as tunables you might want to play with the initcwnd route
flag (see "ip route help")

> 
> -Dormando
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH net-next-2.6] net: __alloc_skb() speedup
From: Eric Dumazet @ 2010-05-05 12:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, hadi, therbert
In-Reply-To: <20100505.012647.260083711.davem@davemloft.net>

Le mercredi 05 mai 2010 à 01:26 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 05 May 2010 10:22:14 +0200
> 
> > You mean memset() wont be inlined by ompiler to plain memory writes, but
> > use the custom kernel memset()  ?
> 
> I hope memset() is never inlined for a 202 byte piece of memory on
> sparc64 or powerpc.  What happens and makes sense on x86 is x86's
> business :-)
> 
> Especially since that elides the cache invalidate optimizations, and
> for anything >= 64 bytes those are absolutely critical on Niagara.
> -

Sorry, I was thinking about the shinfo part :

memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));

offsetof(struct skb_shared_info, dataref) is small enough and we dont
dirty a full cache line, so maybe I can keep prefetchw(data + size) ?

If not, in which cases can we use prefetchw() in kernel, if some arches
dont handle it well ?

Note1 : Without prefetchw(skb) (I removed it in this v2 patch), some
packets are dropped again...

Note2: If NET_SKB_PAD changed to 64, cpu0 has about 2% of free cpu
cycles (as noticed by a user application cycles burner)

-----------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:99.8% [1000Hz cycles],  (all, cpu: 0)
-----------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ___________
             1018.00 16.8% eth_type_trans                
              960.00 15.9% __alloc_skb                   
              757.00 12.5% __netdev_alloc_skb            
              681.00 11.3% _raw_spin_lock                
              479.00  7.9% nommu_map_page                
              424.00  7.0% tg3_poll_work                 
              209.00  3.5% get_rps_cpu                   
              205.00  3.4% _raw_spin_lock_irqsave        
              188.00  3.1% __kmalloc                    
              164.00  2.7% enqueue_to_backlog            
              119.00  2.0% tg3_alloc_rx_skb              
              112.00  1.9% kmem_cache_alloc              

Thanks !

[PATCH v2 net-next-2.6] net: __alloc_skb() speedup

With following patch I can reach maximum rate of my pktgen+udpsink
simulator :
- 'old' machine : dual quad core E5450  @3.00GHz
- 64 UDP rx flows (only differ by destination port)
- RPS enabled, NIC interrupts serviced on cpu0
- rps dispatched on 7 other cores. (~130.000 IPI per second)
- SLAB allocator (faster than SLUB in this workload)
- tg3 NIC [BCM5715S Gigabit Ethernet (rev a3)]
- 1.080.000 pps with few drops (~150 packets per second) at NIC level.
- 32bit kernel

Idea is to add one prefetchw() call in __alloc_skb() to hint cpu we are
about to clear part of skb_shared_info.

Also using one memset() to initialize all skb_shared_info fields instead
of one by one to reduce number of instructions, using long word moves.

All skb_shared_info fields before 'dataref' are cleared in 
__alloc_skb().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |    7 ++++++-
 net/core/skbuff.c      |   21 +++++----------------
 2 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 746a652..f32ccc9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -187,7 +187,6 @@ union skb_shared_tx {
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
-	atomic_t	dataref;
 	unsigned short	nr_frags;
 	unsigned short	gso_size;
 	/* Warning: this field is not always filled in (UFO)! */
@@ -197,6 +196,12 @@ struct skb_shared_info {
 	union skb_shared_tx tx_flags;
 	struct sk_buff	*frag_list;
 	struct skb_shared_hwtstamps hwtstamps;
+
+	/*
+	 * Warning : all fields before dataref are cleared in __alloc_skb()
+	 */
+	atomic_t	dataref;
+	
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8b9c109..7cafe50 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -187,6 +187,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 			gfp_mask, node);
 	if (!data)
 		goto nodata;
+	/* prepare shinfo initialization */
+	prefetchw(data + size);
 
 	/*
 	 * Only clear those fields we need to clear, not those that we will
@@ -208,15 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
-	shinfo->nr_frags  = 0;
-	shinfo->gso_size = 0;
-	shinfo->gso_segs = 0;
-	shinfo->gso_type = 0;
-	shinfo->ip6_frag_id = 0;
-	shinfo->tx_flags.flags = 0;
-	skb_frag_list_init(skb);
-	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
 	if (fclone) {
 		struct sk_buff *child = skb + 1;
@@ -505,16 +500,10 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
 		return 0;
 
 	skb_release_head_state(skb);
+
 	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
-	shinfo->nr_frags = 0;
-	shinfo->gso_size = 0;
-	shinfo->gso_segs = 0;
-	shinfo->gso_type = 0;
-	shinfo->ip6_frag_id = 0;
-	shinfo->tx_flags.flags = 0;
-	skb_frag_list_init(skb);
-	memset(&shinfo->hwtstamps, 0, sizeof(shinfo->hwtstamps));
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->data = skb->head + NET_SKB_PAD;



^ permalink raw reply related

* [Patch 3/3] net: reserve ports for applications using fixed port numbers
From: Amerigo Wang @ 2010-05-05 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Octavian Purdila, Eric Dumazet, penguin-kernel, netdev,
	Neil Horman, Amerigo Wang, xiaosuo, David Miller, adobriyan,
	ebiederm
In-Reply-To: <20100505103033.5600.77502.sendpatchset@localhost.localdomain>


(Dropped the infiniband part, because Tetsuo modified the related code,
I will send a separate patch for it once this is accepted.)

This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.

The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---

Index: linux-2.6/Documentation/networking/ip-sysctl.txt
===================================================================
--- linux-2.6.orig/Documentation/networking/ip-sysctl.txt
+++ linux-2.6/Documentation/networking/ip-sysctl.txt
@@ -588,6 +588,37 @@ ip_local_port_range - 2 INTEGERS
 	(i.e. by default) range 1024-4999 is enough to issue up to
 	2000 connections per second to systems supporting timestamps.
 
+ip_local_reserved_ports - list of comma separated ranges
+	Specify the ports which are reserved for known third-party
+	applications. These ports will not be used by automatic port
+	assignments (e.g. when calling connect() or bind() with port
+	number 0). Explicit port allocation behavior is unchanged.
+
+	The format used for both input and output is a comma separated
+	list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and
+	10). Writing to the file will clear all previously reserved
+	ports and update the current list with the one given in the
+	input.
+
+	Note that ip_local_port_range and ip_local_reserved_ports
+	settings are independent and both are considered by the kernel
+	when determining which ports are available for automatic port
+	assignments.
+
+	You can reserve ports which are not in the current
+	ip_local_port_range, e.g.:
+
+	$ cat /proc/sys/net/ipv4/ip_local_port_range
+	32000	61000
+	$ cat /proc/sys/net/ipv4/ip_local_reserved_ports
+	8080,9148
+
+	although this is redundant. However such a setting is useful
+	if later the port range is changed to a value that will
+	include the reserved ports.
+
+	Default: Empty
+
 ip_nonlocal_bind - BOOLEAN
 	If set, allows processes to bind() to non-local IP addresses,
 	which can be quite useful - but may break some applications.
Index: linux-2.6/include/net/ip.h
===================================================================
--- linux-2.6.orig/include/net/ip.h
+++ linux-2.6/include/net/ip.h
@@ -184,6 +184,12 @@ extern struct local_ports {
 } sysctl_local_ports;
 extern void inet_get_local_port_range(int *low, int *high);
 
+extern unsigned long *sysctl_local_reserved_ports;
+static inline int inet_is_reserved_local_port(int port)
+{
+	return test_bit(port, sysctl_local_reserved_ports);
+}
+
 extern int sysctl_ip_default_ttl;
 extern int sysctl_ip_nonlocal_bind;
 
Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c
+++ linux-2.6/net/ipv4/af_inet.c
@@ -1552,9 +1552,13 @@ static int __init inet_init(void)
 
 	BUILD_BUG_ON(sizeof(struct inet_skb_parm) > sizeof(dummy_skb->cb));
 
+	sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
+	if (!sysctl_local_reserved_ports)
+		goto out;
+
 	rc = proto_register(&tcp_prot, 1);
 	if (rc)
-		goto out;
+		goto out_free_reserved_ports;
 
 	rc = proto_register(&udp_prot, 1);
 	if (rc)
@@ -1653,6 +1657,8 @@ out_unregister_udp_proto:
 	proto_unregister(&udp_prot);
 out_unregister_tcp_proto:
 	proto_unregister(&tcp_prot);
+out_free_reserved_ports:
+	kfree(sysctl_local_reserved_ports);
 	goto out;
 }
 
Index: linux-2.6/net/ipv4/inet_connection_sock.c
===================================================================
--- linux-2.6.orig/net/ipv4/inet_connection_sock.c
+++ linux-2.6/net/ipv4/inet_connection_sock.c
@@ -37,6 +37,9 @@ struct local_ports sysctl_local_ports __
 	.range = { 32768, 61000 },
 };
 
+unsigned long *sysctl_local_reserved_ports;
+EXPORT_SYMBOL(sysctl_local_reserved_ports);
+
 void inet_get_local_port_range(int *low, int *high)
 {
 	unsigned seq;
@@ -108,6 +111,8 @@ again:
 
 		smallest_size = -1;
 		do {
+			if (inet_is_reserved_local_port(rover))
+				goto next_nolock;
 			head = &hashinfo->bhash[inet_bhashfn(net, rover,
 					hashinfo->bhash_size)];
 			spin_lock(&head->lock);
@@ -130,6 +135,7 @@ again:
 			break;
 		next:
 			spin_unlock(&head->lock);
+		next_nolock:
 			if (++rover > high)
 				rover = low;
 		} while (--remaining > 0);
Index: linux-2.6/net/ipv4/inet_hashtables.c
===================================================================
--- linux-2.6.orig/net/ipv4/inet_hashtables.c
+++ linux-2.6/net/ipv4/inet_hashtables.c
@@ -456,6 +456,8 @@ int __inet_hash_connect(struct inet_time
 		local_bh_disable();
 		for (i = 1; i <= remaining; i++) {
 			port = low + (i + offset) % remaining;
+			if (inet_is_reserved_local_port(port))
+				continue;
 			head = &hinfo->bhash[inet_bhashfn(net, port,
 					hinfo->bhash_size)];
 			spin_lock(&head->lock);
Index: linux-2.6/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/sysctl_net_ipv4.c
+++ linux-2.6/net/ipv4/sysctl_net_ipv4.c
@@ -299,6 +299,13 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_local_port_range,
 	},
+	{
+		.procname	= "ip_local_reserved_ports",
+		.data		= NULL, /* initialized in sysctl_ipv4_init */
+		.maxlen		= 65536,
+		.mode		= 0644,
+		.proc_handler	= proc_do_large_bitmap,
+	},
 #ifdef CONFIG_IP_MULTICAST
 	{
 		.procname	= "igmp_max_memberships",
@@ -736,6 +743,16 @@ static __net_initdata struct pernet_oper
 static __init int sysctl_ipv4_init(void)
 {
 	struct ctl_table_header *hdr;
+	struct ctl_table *i;
+
+	for (i = ipv4_table; i->procname; i++) {
+		if (strcmp(i->procname, "ip_local_reserved_ports") == 0) {
+			i->data = sysctl_local_reserved_ports;
+			break;
+		}
+	}
+	if (!i->procname)
+		return -EINVAL;
 
 	hdr = register_sysctl_paths(net_ipv4_ctl_path, ipv4_table);
 	if (hdr == NULL)
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c
+++ linux-2.6/net/ipv4/udp.c
@@ -233,7 +233,8 @@ int udp_lib_get_port(struct sock *sk, un
 			 */
 			do {
 				if (low <= snum && snum <= high &&
-				    !test_bit(snum >> udptable->log, bitmap))
+				    !test_bit(snum >> udptable->log, bitmap) &&
+				    !inet_is_reserved_local_port(snum))
 					goto found;
 				snum += rand;
 			} while (snum != first);
Index: linux-2.6/net/sctp/socket.c
===================================================================
--- linux-2.6.orig/net/sctp/socket.c
+++ linux-2.6/net/sctp/socket.c
@@ -5436,6 +5436,8 @@ static long sctp_get_port_local(struct s
 			rover++;
 			if ((rover < low) || (rover > high))
 				rover = low;
+			if (inet_is_reserved_local_port(rover))
+				continue;
 			index = sctp_phashfn(rover);
 			head = &sctp_port_hashtable[index];
 			sctp_spin_lock(&head->lock);

^ permalink raw reply

* [Patch 2/3] sysctl: add proc_do_large_bitmap
From: Amerigo Wang @ 2010-05-05 10:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Octavian Purdila, Eric Dumazet, penguin-kernel, netdev,
	Neil Horman, Amerigo Wang, ebiederm, xiaosuo, adobriyan,
	David Miller
In-Reply-To: <20100505103033.5600.77502.sendpatchset@localhost.localdomain>

From: Octavian Purdila <opurdila@ixiacom.com>

The new function can be used to read/write large bitmaps via /proc. A
comma separated range format is used for compact output and input
(e.g. 1,3-4,10-10).

Writing into the file will first reset the bitmap then update it
based on the given input.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---

Index: linux-2.6/include/linux/sysctl.h
===================================================================
--- linux-2.6.orig/include/linux/sysctl.h
+++ linux-2.6/include/linux/sysctl.h
@@ -980,6 +980,8 @@ extern int proc_doulongvec_minmax(struct
 				  void __user *, size_t *, loff_t *);
 extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
 				      void __user *, size_t *, loff_t *);
+extern int proc_do_large_bitmap(struct ctl_table *, int,
+				void __user *, size_t *, loff_t *);
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -2049,6 +2049,16 @@ static size_t proc_skip_spaces(char **bu
 	return ret;
 }
 
+static void proc_skip_char(char **buf, size_t *size, const char v)
+{
+	while (*size) {
+		if (**buf != v)
+			break;
+		(*size)--;
+		(*buf)++;
+	}
+}
+
 #define TMPBUFLEN 22
 /**
  * proc_get_long - reads an ASCII formated integer from a user buffer
@@ -2675,6 +2685,157 @@ static int proc_do_cad_pid(struct ctl_ta
 	return 0;
 }
 
+/**
+ * proc_do_large_bitmap - read/write from/to a large bitmap
+ * @table: the sysctl table
+ * @write: %TRUE if this is a write to the sysctl file
+ * @buffer: the user buffer
+ * @lenp: the size of the user buffer
+ * @ppos: file position
+ *
+ * The bitmap is stored at table->data and the bitmap length (in bits)
+ * in table->maxlen.
+ *
+ * We use a range comma separated format (e.g. 1,3-4,10-10) so that
+ * large bitmaps may be represented in a compact manner. Writing into
+ * the file will clear the bitmap then update it with the given input.
+ *
+ * Returns 0 on success.
+ */
+int proc_do_large_bitmap(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err = 0;
+	bool first = 1;
+	size_t left = *lenp;
+	unsigned long bitmap_len = table->maxlen;
+	unsigned long *bitmap = (unsigned long *) table->data;
+	unsigned long *tmp_bitmap = NULL;
+	char tr_a[] = { '-', ',', '\n' }, tr_b[] = { ',', '\n', 0 }, c;
+
+	if (!bitmap_len || !left || (*ppos && !write)) {
+		*lenp = 0;
+		return 0;
+	}
+
+	if (write) {
+		unsigned long page = 0;
+		char *kbuf;
+
+		if (left > PAGE_SIZE - 1)
+			left = PAGE_SIZE - 1;
+
+		page = __get_free_page(GFP_TEMPORARY);
+		kbuf = (char *) page;
+		if (!kbuf)
+			return -ENOMEM;
+		if (copy_from_user(kbuf, buffer, left)) {
+			free_page(page);
+			return -EFAULT;
+                }
+		kbuf[left] = 0;
+
+		tmp_bitmap = kzalloc(BITS_TO_LONGS(bitmap_len) * sizeof(unsigned long),
+				     GFP_KERNEL);
+		if (!tmp_bitmap) {
+			free_page(page);
+			return -ENOMEM;
+		}
+		proc_skip_char(&kbuf, &left, '\n');
+		while (!err && left) {
+			unsigned long val_a, val_b;
+			bool neg;
+
+			err = proc_get_long(&kbuf, &left, &val_a, &neg, tr_a,
+					     sizeof(tr_a), &c);
+			if (err)
+				break;
+			if (val_a >= bitmap_len || neg) {
+				err = -EINVAL;
+				break;
+			}
+
+			val_b = val_a;
+			if (left) {
+				kbuf++;
+				left--;
+			}
+
+			if (c == '-') {
+				err = proc_get_long(&kbuf, &left, &val_b,
+						     &neg, tr_b, sizeof(tr_b),
+						     &c);
+				if (err)
+					break;
+				if (val_b >= bitmap_len || neg ||
+				    val_a > val_b) {
+					err = -EINVAL;
+					break;
+				}
+				if (left) {
+					kbuf++;
+					left--;
+				}
+			}
+
+			while (val_a <= val_b)
+				set_bit(val_a++, tmp_bitmap);
+
+			first = 0;
+			proc_skip_char(&kbuf, &left, '\n');
+		}
+		free_page(page);
+	} else {
+		unsigned long bit_a, bit_b = 0;
+
+		while (left) {
+			bit_a = find_next_bit(bitmap, bitmap_len, bit_b);
+			if (bit_a >= bitmap_len)
+				break;
+			bit_b = find_next_zero_bit(bitmap, bitmap_len,
+						   bit_a + 1) - 1;
+
+			if (!first) {
+				err = proc_put_char(&buffer, &left, ',');
+				if (err)
+					break;
+			}
+			err = proc_put_long(&buffer, &left, bit_a, false);
+			if (err)
+				break;
+			if (bit_a != bit_b) {
+				err = proc_put_char(&buffer, &left, '-');
+				if (err)
+					break;
+				err = proc_put_long(&buffer, &left, bit_b, false);
+				if (err)
+					break;
+			}
+
+			first = 0; bit_b++;
+		}
+		if (!err)
+			err = proc_put_char(&buffer, &left, '\n');
+	}
+
+	if (!err) {
+		if (write) {
+			if (*ppos)
+				bitmap_or(bitmap, bitmap, tmp_bitmap, bitmap_len);
+			else
+				memcpy(bitmap, tmp_bitmap,
+					BITS_TO_LONGS(bitmap_len) * sizeof(unsigned long));
+		}
+		kfree(tmp_bitmap);
+		*lenp -= left;
+		*ppos += *lenp;
+		return 0;
+	} else {
+		kfree(tmp_bitmap);
+		return err;
+	}
+}
+
 #else /* CONFIG_PROC_FS */
 
 int proc_dostring(struct ctl_table *table, int write,

^ permalink raw reply

* [Patch 1/3] sysctl: refactor integer handling proc code
From: Amerigo Wang @ 2010-05-05 10:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Octavian Purdila, Eric Dumazet, penguin-kernel, netdev,
	Neil Horman, ebiederm, xiaosuo, David Miller, adobriyan,
	Amerigo Wang
In-Reply-To: <20100505103033.5600.77502.sendpatchset@localhost.localdomain>

(Based on Octavian's work, and I modified a lot.)

As we are about to add another integer handling proc function a little
bit of cleanup is in order: add a few helper functions to improve code
readability and decrease code duplication.

In the process a bug is also fixed: if the user specifies a number
with more then 20 digits it will be interpreted as two integers
(e.g. 10000...13 will be interpreted as 100.... and 13).

Behavior for EFAULT handling was changed as well. Previous to this
patch, when an EFAULT error occurred in the middle of a write
operation, although some of the elements were set, that was not
acknowledged to the user (by shorting the write and returning the
number of bytes accepted). EFAULT is now treated just like any other
errors by acknowledging the amount of bytes accepted.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---

Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -2040,8 +2040,122 @@ int proc_dostring(struct ctl_table *tabl
 			       buffer, lenp, ppos);
 }
 
+static size_t proc_skip_spaces(char **buf)
+{
+	size_t ret;
+	char *tmp = skip_spaces(*buf);
+	ret = tmp - *buf;
+	*buf = tmp;
+	return ret;
+}
+
+#define TMPBUFLEN 22
+/**
+ * proc_get_long - reads an ASCII formated integer from a user buffer
+ *
+ * @buf - a kernel buffer
+ * @size - size of the kernel buffer
+ * @val - this is where the number will be stored
+ * @neg - set to %TRUE if number is negative
+ * @perm_tr - a vector which contains the allowed trailers
+ * @perm_tr_len - size of the perm_tr vector
+ * @tr - pointer to store the trailer character
+ *
+ * In case of success 0 is returned and buf and size are updated with
+ * the amount of bytes read. If tr is non NULL and a trailing
+ * character exist (size is non zero after returning from this
+ * function) tr is updated with the trailing character.
+ */
+static int proc_get_long(char **buf, size_t *size,
+			  unsigned long *val, bool *neg,
+			  const char *perm_tr, unsigned perm_tr_len, char *tr)
+{
+	int len;
+	char *p, tmp[TMPBUFLEN];
+
+	if (!*size)
+		return -EINVAL;
+
+	len = *size;
+	if (len > TMPBUFLEN - 1)
+		len = TMPBUFLEN - 1;
+
+	memcpy(tmp, *buf, len);
+
+	tmp[len] = 0;
+	p = tmp;
+	if (*p == '-' && *size > 1) {
+		*neg = true;
+		p++;
+	} else
+		*neg = false;
+	if (!isdigit(*p))
+		return -EINVAL;
+
+	*val = simple_strtoul(p, &p, 0);
 
-static int do_proc_dointvec_conv(int *negp, unsigned long *lvalp,
+	len = p - tmp;
+
+	/* We don't know if the next char is whitespace thus we may accept
+	 * invalid integers (e.g. 1234...a) or two integers instead of one
+	 * (e.g. 123...1). So lets not allow such large numbers. */
+	if (len == TMPBUFLEN - 1)
+		return -EINVAL;
+
+	if (len < *size && perm_tr_len && !memchr(perm_tr, *p, perm_tr_len))
+		return -EINVAL;
+
+	if (tr && (len < *size))
+		*tr = *p;
+
+	*buf += len;
+	*size -= len;
+
+	return 0;
+}
+
+/**
+ * proc_put_long - coverts an integer to a decimal ASCII formated string
+ *
+ * @buf - the user buffer
+ * @size - the size of the user buffer
+ * @val - the integer to be converted
+ * @neg - sign of the number, %TRUE for negative
+ *
+ * In case of success 0 is returned and buf and size are updated with
+ * the amount of bytes read.
+ */
+static int proc_put_long(void __user **buf, size_t *size, unsigned long val,
+			  bool neg)
+{
+	int len;
+	char tmp[TMPBUFLEN], *p = tmp;
+
+	sprintf(p, "%s%lu", neg ? "-" : "", val);
+	len = strlen(tmp);
+	if (len > *size)
+		len = *size;
+	if (copy_to_user(*buf, tmp, len))
+		return -EFAULT;
+	*size -= len;
+	*buf += len;
+	return 0;
+}
+#undef TMPBUFLEN
+
+static int proc_put_char(void __user **buf, size_t *size, char c)
+{
+	if (*size) {
+		char __user **buffer = (char __user **)buf;
+		if (put_user(c, *buffer))
+			return -EFAULT;
+		(*size)--, (*buffer)++;
+		*buf = *buffer;
+	}
+	return 0;
+}
+
+static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
 				 int *valp,
 				 int write, void *data)
 {
@@ -2050,33 +2164,31 @@ static int do_proc_dointvec_conv(int *ne
 	} else {
 		int val = *valp;
 		if (val < 0) {
-			*negp = -1;
+			*negp = true;
 			*lvalp = (unsigned long)-val;
 		} else {
-			*negp = 0;
+			*negp = false;
 			*lvalp = (unsigned long)val;
 		}
 	}
 	return 0;
 }
 
+static const char proc_wspace_sep[] = { ' ', '\t', '\n' };
+
 static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
 		  int write, void __user *buffer,
 		  size_t *lenp, loff_t *ppos,
-		  int (*conv)(int *negp, unsigned long *lvalp, int *valp,
+		  int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
 			      int write, void *data),
 		  void *data)
 {
-#define TMPBUFLEN 21
-	int *i, vleft, first = 1, neg;
-	unsigned long lval;
-	size_t left, len;
+	int *i, vleft, first = 1, err = 0;
+	unsigned long page = 0;
+	size_t left;
+	char *kbuf;
 	
-	char buf[TMPBUFLEN], *p;
-	char __user *s = buffer;
-	
-	if (!tbl_data || !table->maxlen || !*lenp ||
-	    (*ppos && !write)) {
+	if (!tbl_data || !table->maxlen || !*lenp || (*ppos && !write)) {
 		*lenp = 0;
 		return 0;
 	}
@@ -2088,89 +2200,69 @@ static int __do_proc_dointvec(void *tbl_
 	if (!conv)
 		conv = do_proc_dointvec_conv;
 
+	if (write) {
+		if (left > PAGE_SIZE - 1)
+			left = PAGE_SIZE - 1;
+		page = __get_free_page(GFP_TEMPORARY);
+		kbuf = (char *) page;
+		if (!kbuf)
+			return -ENOMEM;
+		if (copy_from_user(kbuf, buffer, left)) {
+			err = -EFAULT;
+			goto free;
+		}
+		kbuf[left] = 0;
+	}
+
 	for (; left && vleft--; i++, first=0) {
-		if (write) {
-			while (left) {
-				char c;
-				if (get_user(c, s))
-					return -EFAULT;
-				if (!isspace(c))
-					break;
-				left--;
-				s++;
-			}
-			if (!left)
-				break;
-			neg = 0;
-			len = left;
-			if (len > sizeof(buf) - 1)
-				len = sizeof(buf) - 1;
-			if (copy_from_user(buf, s, len))
-				return -EFAULT;
-			buf[len] = 0;
-			p = buf;
-			if (*p == '-' && left > 1) {
-				neg = 1;
-				p++;
-			}
-			if (*p < '0' || *p > '9')
-				break;
+		unsigned long lval;
+		bool neg;
 
-			lval = simple_strtoul(p, &p, 0);
+		if (write) {
+			left -= proc_skip_spaces(&kbuf);
 
-			len = p-buf;
-			if ((len < left) && *p && !isspace(*p))
+			err = proc_get_long(&kbuf, &left, &lval, &neg,
+					     proc_wspace_sep,
+					     sizeof(proc_wspace_sep), NULL);
+			if (err)
 				break;
-			s += len;
-			left -= len;
-
-			if (conv(&neg, &lval, i, 1, data))
+			if (conv(&neg, &lval, i, 1, data)) {
+				err = -EINVAL;
 				break;
+			}
 		} else {
-			p = buf;
+			if (conv(&neg, &lval, i, 0, data)) {
+				err = -EINVAL;
+				break;
+			}
 			if (!first)
-				*p++ = '\t';
-	
-			if (conv(&neg, &lval, i, 0, data))
+				err = proc_put_char(&buffer, &left, '\t');
+			if (err)
+				break;
+			err = proc_put_long(&buffer, &left, lval, neg);
+			if (err)
 				break;
-
-			sprintf(p, "%s%lu", neg ? "-" : "", lval);
-			len = strlen(buf);
-			if (len > left)
-				len = left;
-			if(copy_to_user(s, buf, len))
-				return -EFAULT;
-			left -= len;
-			s += len;
 		}
 	}
 
-	if (!write && !first && left) {
-		if(put_user('\n', s))
-			return -EFAULT;
-		left--, s++;
-	}
+	if (!write && !first && left && !err)
+		err = proc_put_char(&buffer, &left, '\n');
+	if (write && !err)
+		left -= proc_skip_spaces(&kbuf);
+free:
 	if (write) {
-		while (left) {
-			char c;
-			if (get_user(c, s++))
-				return -EFAULT;
-			if (!isspace(c))
-				break;
-			left--;
-		}
+		free_page(page);
+		if (first)
+			return err ? : -EINVAL;
 	}
-	if (write && first)
-		return -EINVAL;
 	*lenp -= left;
 	*ppos += *lenp;
-	return 0;
-#undef TMPBUFLEN
+	return err;
 }
 
 static int do_proc_dointvec(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos,
-		  int (*conv)(int *negp, unsigned long *lvalp, int *valp,
+		  int (*conv)(bool *negp, unsigned long *lvalp, int *valp,
 			      int write, void *data),
 		  void *data)
 {
@@ -2238,8 +2330,8 @@ struct do_proc_dointvec_minmax_conv_para
 	int *max;
 };
 
-static int do_proc_dointvec_minmax_conv(int *negp, unsigned long *lvalp, 
-					int *valp, 
+static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
+					int *valp,
 					int write, void *data)
 {
 	struct do_proc_dointvec_minmax_conv_param *param = data;
@@ -2252,10 +2344,10 @@ static int do_proc_dointvec_minmax_conv(
 	} else {
 		int val = *valp;
 		if (val < 0) {
-			*negp = -1;
+			*negp = true;
 			*lvalp = (unsigned long)-val;
 		} else {
-			*negp = 0;
+			*negp = false;
 			*lvalp = (unsigned long)val;
 		}
 	}
@@ -2295,102 +2387,78 @@ static int __do_proc_doulongvec_minmax(v
 				     unsigned long convmul,
 				     unsigned long convdiv)
 {
-#define TMPBUFLEN 21
-	unsigned long *i, *min, *max, val;
-	int vleft, first=1, neg;
-	size_t len, left;
-	char buf[TMPBUFLEN], *p;
-	char __user *s = buffer;
-	
-	if (!data || !table->maxlen || !*lenp ||
-	    (*ppos && !write)) {
+	unsigned long *i, *min, *max;
+	int vleft, first = 1, err = 0;
+	unsigned long page = 0;
+	size_t left;
+	char *kbuf;
+
+	if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
 		*lenp = 0;
 		return 0;
 	}
-	
+
 	i = (unsigned long *) data;
 	min = (unsigned long *) table->extra1;
 	max = (unsigned long *) table->extra2;
 	vleft = table->maxlen / sizeof(unsigned long);
 	left = *lenp;
-	
+
+	if (write) {
+		if (left > PAGE_SIZE - 1)
+			left = PAGE_SIZE - 1;
+		page = __get_free_page(GFP_TEMPORARY);
+		kbuf = (char *) page;
+		if (!kbuf)
+			return -ENOMEM;
+		if (copy_from_user(kbuf, buffer, left)) {
+			err = -EFAULT;
+			goto free;
+		}
+		kbuf[left] = 0;
+	}
+
 	for (; left && vleft--; i++, min++, max++, first=0) {
+		unsigned long val;
+
 		if (write) {
-			while (left) {
-				char c;
-				if (get_user(c, s))
-					return -EFAULT;
-				if (!isspace(c))
-					break;
-				left--;
-				s++;
-			}
-			if (!left)
-				break;
-			neg = 0;
-			len = left;
-			if (len > TMPBUFLEN-1)
-				len = TMPBUFLEN-1;
-			if (copy_from_user(buf, s, len))
-				return -EFAULT;
-			buf[len] = 0;
-			p = buf;
-			if (*p == '-' && left > 1) {
-				neg = 1;
-				p++;
-			}
-			if (*p < '0' || *p > '9')
-				break;
-			val = simple_strtoul(p, &p, 0) * convmul / convdiv ;
-			len = p-buf;
-			if ((len < left) && *p && !isspace(*p))
+			bool neg;
+
+			left -= proc_skip_spaces(&kbuf);
+
+			err = proc_get_long(&kbuf, &left, &val, &neg,
+					     proc_wspace_sep,
+					     sizeof(proc_wspace_sep), NULL);
+			if (err)
 				break;
 			if (neg)
-				val = -val;
-			s += len;
-			left -= len;
-
-			if(neg)
 				continue;
 			if ((min && val < *min) || (max && val > *max))
 				continue;
 			*i = val;
 		} else {
-			p = buf;
+			val = convdiv * (*i) / convmul;
 			if (!first)
-				*p++ = '\t';
-			sprintf(p, "%lu", convdiv * (*i) / convmul);
-			len = strlen(buf);
-			if (len > left)
-				len = left;
-			if(copy_to_user(s, buf, len))
-				return -EFAULT;
-			left -= len;
-			s += len;
+				err = proc_put_char(&buffer, &left, '\t');
+			err = proc_put_long(&buffer, &left, val, false);
+			if (err)
+				break;
 		}
 	}
 
-	if (!write && !first && left) {
-		if(put_user('\n', s))
-			return -EFAULT;
-		left--, s++;
-	}
+	if (!write && !first && left && !err)
+		err = proc_put_char(&buffer, &left, '\n');
+	if (write && !err)
+		left -= proc_skip_spaces(&kbuf);
+free:
 	if (write) {
-		while (left) {
-			char c;
-			if (get_user(c, s++))
-				return -EFAULT;
-			if (!isspace(c))
-				break;
-			left--;
-		}
+		free_page(page);
+		if (first)
+			return err ? : -EINVAL;
 	}
-	if (write && first)
-		return -EINVAL;
 	*lenp -= left;
 	*ppos += *lenp;
-	return 0;
-#undef TMPBUFLEN
+	return err;
 }
 
 static int do_proc_doulongvec_minmax(struct ctl_table *table, int write,
@@ -2451,7 +2519,7 @@ int proc_doulongvec_ms_jiffies_minmax(st
 }
 
 
-static int do_proc_dointvec_jiffies_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_jiffies_conv(bool *negp, unsigned long *lvalp,
 					 int *valp,
 					 int write, void *data)
 {
@@ -2463,10 +2531,10 @@ static int do_proc_dointvec_jiffies_conv
 		int val = *valp;
 		unsigned long lval;
 		if (val < 0) {
-			*negp = -1;
+			*negp = true;
 			lval = (unsigned long)-val;
 		} else {
-			*negp = 0;
+			*negp = false;
 			lval = (unsigned long)val;
 		}
 		*lvalp = lval / HZ;
@@ -2474,7 +2542,7 @@ static int do_proc_dointvec_jiffies_conv
 	return 0;
 }
 
-static int do_proc_dointvec_userhz_jiffies_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_userhz_jiffies_conv(bool *negp, unsigned long *lvalp,
 						int *valp,
 						int write, void *data)
 {
@@ -2486,10 +2554,10 @@ static int do_proc_dointvec_userhz_jiffi
 		int val = *valp;
 		unsigned long lval;
 		if (val < 0) {
-			*negp = -1;
+			*negp = true;
 			lval = (unsigned long)-val;
 		} else {
-			*negp = 0;
+			*negp = false;
 			lval = (unsigned long)val;
 		}
 		*lvalp = jiffies_to_clock_t(lval);
@@ -2497,7 +2565,7 @@ static int do_proc_dointvec_userhz_jiffi
 	return 0;
 }
 
-static int do_proc_dointvec_ms_jiffies_conv(int *negp, unsigned long *lvalp,
+static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
 					    int *valp,
 					    int write, void *data)
 {
@@ -2507,10 +2575,10 @@ static int do_proc_dointvec_ms_jiffies_c
 		int val = *valp;
 		unsigned long lval;
 		if (val < 0) {
-			*negp = -1;
+			*negp = true;
 			lval = (unsigned long)-val;
 		} else {
-			*negp = 0;
+			*negp = false;
 			lval = (unsigned long)val;
 		}
 		*lvalp = jiffies_to_msecs(lval);

^ permalink raw reply

* [Patch v10 0/3] net: reserve ports for applications using fixed port numbers
From: Amerigo Wang @ 2010-05-05 10:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Octavian Purdila, ebiederm, Eric Dumazet, penguin-kernel, netdev,
	Neil Horman, Amerigo Wang, xiaosuo, adobriyan, David Miller


Changes from the previous version:
- Use 'true' and 'false' for bool's;
- Fix some coding style problems;
- Allow appending lines to bitmap proc file so that it will be
  easier to add new bits.

------------------>

This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.

The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.

There are still some miss behaviors with regard to proc parsing in odd
invalid cases (for "40000\0-40001" all is acknowledged but only 40000
is accepted) but they are not easy to fix without changing the current
"acknowledge how much we accepted" behavior.

Because of that and because the same issues are present in the
existing proc_dointvec code as well I don't think its worth holding
the actual feature (port reservation) after such petty error recovery
issues.

^ permalink raw reply

* Re: [Patch 2/3] sysctl: add proc_do_large_bitmap
From: Cong Wang @ 2010-05-05  9:20 UTC (permalink / raw)
  To: Changli Gao
  Cc: linux-kernel, Octavian Purdila, Eric Dumazet, penguin-kernel,
	netdev, Neil Horman, ebiederm, adobriyan, David Miller
In-Reply-To: <4BE0E27C.7040200@redhat.com>

Cong Wang wrote:
> Changli Gao wrote:
>>                      add the following lines to let "echo 1-10 >>
>> /proc/..." work as normal.
> 
> Hmm, I haven't tested this, what did you see if we append
> lines into it?
> 
> Also, do we need appending lines to this /proc file when design it?
> Octavian? Eric?
> 

Hmm, currently this behaves like other /proc files, IOW,
echo 'foo' >> /proc/XXX is the same with echo 'foo' > /proc/XXX.

I think it is reasonable for bitmap /proc files to have
echo 'foo' >> /proc/XXX behaves like non-proc files, that is
appending numbers into that file, like what Changli mentioned.

Any objections?

Thanks.

^ permalink raw reply

* 3 packet TCP window limit?
From: dormando @ 2010-05-05  9:10 UTC (permalink / raw)
  To: netdev

Hey,

Noticed in Linux that no matter what sysctl variable I twiddle, or what
TCP congestion algorithm is running, TCP will wait for remote acks after
sending the first 3 packets. After that it's normal.

Apologies, it's hard ot describe:

Linux server listening.

Remote -> SYN
(RTT wait)
Linux -> SYN/ACK
Remote -> ACK
Remote -> Packet (small HTTP request)
(RTT wait)
Linux -> Packet (x 3)
Remote -> (returning acks per packet)
(RTT wait)
Linux -> More packets (up to window size)

If the request response fits in 3 packets or less, that third RTT wait
never happens. The remote client gets all its data, and sends back all the
FIN/ACK packets for closing the connection.

What's bizarre is that this 3 packet/4 packet barrier is regardless of how
much data there is to send. I can cause the extra RTT to flip on or off by
sending exactly +/- 1 byte to cause an extra packet.

Holding the connection open and repeating the request any number of times
runs just fine, after the initial request.

You can pretty easily see this by:
tc qdisc add dev eth0 root netem delay 100ms
... then fetching a 3k file, then 4k file from an http server running
linux. Well. at least I can see this easily. I tried on a half dozen boxes
(2.6.11 through 2.6.32).

I'm trying to track down where in the code this is, or why my sysctl
tuning isn't affecting it. I can't discern its purpose. The lag it causes
is pretty awful for far away clients; adding 300ms of latency will make a
small request take a full second, instead of 600ms.

I'm slugging through the code but any insight would be greatly
appreciated!

-Dormando


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox