Netdev List
 help / color / mirror / Atom feed
* [net-next PATCH v2] net/ethernet: remove useless is_valid_ether_addr from drivers ndo_open
From: Joachim Eastwood @ 2012-11-16 14:47 UTC (permalink / raw)
  To: davem, nicolas.ferre, shemminger, steve.glendinning, stigge,
	msink
  Cc: netdev, Joachim Eastwood

If ndo_validate_addr is set to the generic eth_validate_addr
function there is no point in calling is_valid_ether_addr
from driver ndo_open if ndo_open is not used elsewhere in
the driver.

With this change is_valid_ether_addr will be called from the
generic eth_validate_addr function. So there should be no change
in the actual behavior.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
---
Hi,

v2: Audit changed drivers to ensure ndo_open functions is
only called from net core and not used elsewhere in the
drivers.

net/core/dev.c __dev_open does
 1165        if (ops->ndo_validate_addr)
 1166                ret = ops->ndo_validate_addr(dev);
 1167
 1168        if (!ret && ops->ndo_open)
 1169                ret = ops->ndo_open(dev);

so there shouldn't be a need for a is_valid_ether_addr in
ndo_open if the open function is not used elsewhere in the
driver.

ndo_validate_addr is set to eth_validate_addr in all
changed drivers.

regards
Joachim Eastwood

 drivers/net/ethernet/8390/etherh.c        |  6 ------
 drivers/net/ethernet/cadence/at91_ether.c |  3 ---
 drivers/net/ethernet/cadence/macb.c       |  3 ---
 drivers/net/ethernet/dnet.c               |  3 ---
 drivers/net/ethernet/i825xx/ether1.c      |  6 ------
 drivers/net/ethernet/micrel/ks8695net.c   |  3 ---
 drivers/net/ethernet/nxp/lpc_eth.c        |  4 +---
 drivers/net/ethernet/seeq/ether3.c        |  6 ------
 drivers/net/ethernet/smsc/smc911x.c       | 10 ----------
 drivers/net/ethernet/smsc/smc91x.c        | 10 ----------
 drivers/net/ethernet/smsc/smsc911x.c      |  5 -----
 drivers/net/ethernet/wiznet/w5100.c       |  2 --
 drivers/net/ethernet/wiznet/w5300.c       |  2 --
 13 files changed, 1 insertion(+), 62 deletions(-)

diff --git a/drivers/net/ethernet/8390/etherh.c b/drivers/net/ethernet/8390/etherh.c
index 8322c54..6414e84 100644
--- a/drivers/net/ethernet/8390/etherh.c
+++ b/drivers/net/ethernet/8390/etherh.c
@@ -463,12 +463,6 @@ etherh_open(struct net_device *dev)
 {
 	struct ei_device *ei_local = netdev_priv(dev);
 
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		printk(KERN_WARNING "%s: invalid ethernet MAC address\n",
-			dev->name);
-		return -EINVAL;
-	}
-
 	if (request_irq(dev->irq, __ei_interrupt, 0, dev->name, dev))
 		return -EAGAIN;
 
diff --git a/drivers/net/ethernet/cadence/at91_ether.c b/drivers/net/ethernet/cadence/at91_ether.c
index e7a476c..716cc01 100644
--- a/drivers/net/ethernet/cadence/at91_ether.c
+++ b/drivers/net/ethernet/cadence/at91_ether.c
@@ -97,9 +97,6 @@ static int at91ether_open(struct net_device *dev)
 	u32 ctl;
 	int ret;
 
-	if (!is_valid_ether_addr(dev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	/* Clear internal statistics */
 	ctl = macb_readl(lp, NCR);
 	macb_writel(lp, NCR, ctl | MACB_BIT(CLRSTAT));
diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index edb2aba..d556c52 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -1218,9 +1218,6 @@ static int macb_open(struct net_device *dev)
 	if (!bp->phy_dev)
 		return -EAGAIN;
 
-	if (!is_valid_ether_addr(dev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	err = macb_alloc_consistent(bp);
 	if (err) {
 		netdev_err(dev, "Unable to allocate DMA memory (error %d)\n",
diff --git a/drivers/net/ethernet/dnet.c b/drivers/net/ethernet/dnet.c
index 290b26f..feb5095 100644
--- a/drivers/net/ethernet/dnet.c
+++ b/drivers/net/ethernet/dnet.c
@@ -664,9 +664,6 @@ static int dnet_open(struct net_device *dev)
 	if (!bp->phy_dev)
 		return -EAGAIN;
 
-	if (!is_valid_ether_addr(dev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	napi_enable(&bp->napi);
 	dnet_init_hw(bp);
 
diff --git a/drivers/net/ethernet/i825xx/ether1.c b/drivers/net/ethernet/i825xx/ether1.c
index 067db3f..7b9609d 100644
--- a/drivers/net/ethernet/i825xx/ether1.c
+++ b/drivers/net/ethernet/i825xx/ether1.c
@@ -638,12 +638,6 @@ ether1_txalloc (struct net_device *dev, int size)
 static int
 ether1_open (struct net_device *dev)
 {
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		printk(KERN_WARNING "%s: invalid ethernet MAC address\n",
-			dev->name);
-		return -EINVAL;
-	}
-
 	if (request_irq(dev->irq, ether1_interrupt, 0, "ether1", dev))
 		return -EAGAIN;
 
diff --git a/drivers/net/ethernet/micrel/ks8695net.c b/drivers/net/ethernet/micrel/ks8695net.c
index dccae1d..e62c312 100644
--- a/drivers/net/ethernet/micrel/ks8695net.c
+++ b/drivers/net/ethernet/micrel/ks8695net.c
@@ -1249,9 +1249,6 @@ ks8695_open(struct net_device *ndev)
 	struct ks8695_priv *ksp = netdev_priv(ndev);
 	int ret;
 
-	if (!is_valid_ether_addr(ndev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	ks8695_reset(ksp);
 
 	ks8695_update_mac(ksp);
diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index af8b414..db6e101 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -1219,9 +1219,6 @@ static int lpc_eth_open(struct net_device *ndev)
 	if (netif_msg_ifup(pldat))
 		dev_dbg(&pldat->pdev->dev, "enabling %s\n", ndev->name);
 
-	if (!is_valid_ether_addr(ndev->dev_addr))
-		return -EADDRNOTAVAIL;
-
 	__lpc_eth_clock_enable(pldat, true);
 
 	/* Reset and initialize */
@@ -1301,6 +1298,7 @@ static const struct net_device_ops lpc_netdev_ops = {
 	.ndo_set_rx_mode	= lpc_eth_set_multicast_list,
 	.ndo_do_ioctl		= lpc_eth_ioctl,
 	.ndo_set_mac_address	= lpc_set_mac_address,
+	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_change_mtu		= eth_change_mtu,
 };
 
diff --git a/drivers/net/ethernet/seeq/ether3.c b/drivers/net/ethernet/seeq/ether3.c
index 6a40dd0..72a0174 100644
--- a/drivers/net/ethernet/seeq/ether3.c
+++ b/drivers/net/ethernet/seeq/ether3.c
@@ -399,12 +399,6 @@ ether3_probe_bus_16(struct net_device *dev, int val)
 static int
 ether3_open(struct net_device *dev)
 {
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		printk(KERN_WARNING "%s: invalid ethernet MAC address\n",
-			dev->name);
-		return -EINVAL;
-	}
-
 	if (request_irq(dev->irq, ether3_interrupt, 0, "ether3", dev))
 		return -EAGAIN;
 
diff --git a/drivers/net/ethernet/smsc/smc911x.c b/drivers/net/ethernet/smsc/smc911x.c
index 8d15f7a..990f574 100644
--- a/drivers/net/ethernet/smsc/smc911x.c
+++ b/drivers/net/ethernet/smsc/smc911x.c
@@ -1400,16 +1400,6 @@ smc911x_open(struct net_device *dev)
 
 	DBG(SMC_DEBUG_FUNC, "%s: --> %s\n", dev->name, __func__);
 
-	/*
-	 * Check that the address is valid.  If its not, refuse
-	 * to bring the device up.	 The user must specify an
-	 * address using ifconfig eth0 hw ether xx:xx:xx:xx:xx:xx
-	 */
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		PRINTK("%s: no valid ethernet hw addr\n", __func__);
-		return -EINVAL;
-	}
-
 	/* reset the hardware */
 	smc911x_reset(dev);
 
diff --git a/drivers/net/ethernet/smsc/smc91x.c b/drivers/net/ethernet/smsc/smc91x.c
index 318adc9..f516e5a 100644
--- a/drivers/net/ethernet/smsc/smc91x.c
+++ b/drivers/net/ethernet/smsc/smc91x.c
@@ -1474,16 +1474,6 @@ smc_open(struct net_device *dev)
 
 	DBG(2, "%s: %s\n", dev->name, __func__);
 
-	/*
-	 * Check that the address is valid.  If its not, refuse
-	 * to bring the device up.  The user must specify an
-	 * address using ifconfig eth0 hw ether xx:xx:xx:xx:xx:xx
-	 */
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		PRINTK("%s: no valid ethernet hw addr\n", __func__);
-		return -EINVAL;
-	}
-
 	/* Setup the default Register Modes */
 	lp->tcr_cur_mode = TCR_DEFAULT;
 	lp->rcr_cur_mode = RCR_DEFAULT;
diff --git a/drivers/net/ethernet/smsc/smsc911x.c b/drivers/net/ethernet/smsc/smsc911x.c
index 62d1baf..a088c4f 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -1463,11 +1463,6 @@ static int smsc911x_open(struct net_device *dev)
 		return -EAGAIN;
 	}
 
-	if (!is_valid_ether_addr(dev->dev_addr)) {
-		SMSC_WARN(pdata, hw, "dev_addr is not a valid MAC address");
-		return -EADDRNOTAVAIL;
-	}
-
 	/* Reset the LAN911x */
 	if (smsc911x_soft_reset(pdata)) {
 		SMSC_WARN(pdata, hw, "soft reset failed");
diff --git a/drivers/net/ethernet/wiznet/w5100.c b/drivers/net/ethernet/wiznet/w5100.c
index 2c08bf6..7daf92e 100644
--- a/drivers/net/ethernet/wiznet/w5100.c
+++ b/drivers/net/ethernet/wiznet/w5100.c
@@ -580,8 +580,6 @@ static int w5100_open(struct net_device *ndev)
 	struct w5100_priv *priv = netdev_priv(ndev);
 
 	netif_info(priv, ifup, ndev, "enabling\n");
-	if (!is_valid_ether_addr(ndev->dev_addr))
-		return -EINVAL;
 	w5100_hw_start(priv);
 	napi_enable(&priv->napi);
 	netif_start_queue(ndev);
diff --git a/drivers/net/ethernet/wiznet/w5300.c b/drivers/net/ethernet/wiznet/w5300.c
index 88943d9..bd9eec6 100644
--- a/drivers/net/ethernet/wiznet/w5300.c
+++ b/drivers/net/ethernet/wiznet/w5300.c
@@ -500,8 +500,6 @@ static int w5300_open(struct net_device *ndev)
 	struct w5300_priv *priv = netdev_priv(ndev);
 
 	netif_info(priv, ifup, ndev, "enabling\n");
-	if (!is_valid_ether_addr(ndev->dev_addr))
-		return -EINVAL;
 	w5300_hw_start(priv);
 	napi_enable(&priv->napi);
 	netif_start_queue(ndev);
-- 
1.8.0

^ permalink raw reply related

* Re: [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:18 UTC (permalink / raw)
  To: Annie Li; +Cc: xen-devel, netdev, Ian.Campbell
In-Reply-To: <1352962987-541-1-git-send-email-annie.li@oracle.com>

On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> This patch implements persistent grants for xen-netfront/netback. This
> mechanism maintains page pools in netback/netfront, these page pools is used to
> save grant pages which are mapped. This way improve performance which is wasted
> when doing grant operations.
> 
> Current netback/netfront does map/unmap grant operations frequently when
> transmitting/receiving packets, and grant operations costs much cpu clock. In
> this patch, netfront/netback maps grant pages when needed and then saves them
> into a page pool for future use. All these pages will be unmapped when
> removing/releasing the net device.
> 
> In netfront, two pools are maintained for transmitting and receiving packets.
> When new grant pages are needed, the driver gets grant pages from this pool
> first. If no free grant page exists, it allocates new page, maps it and then
> saves it into the pool. The pool size for transmit/receive is exactly tx/rx
> ring size. The driver uses memcpy(not grantcopy) to copy data grant pages.
> Here, memcpy is copying the whole page size data. I tried to copy len size data
> from offset, but network does not seem work well. I am trying to find the root
> cause now.
> 
> In netback, it also maintains two page pools for tx/rx. When netback gets a
> request, it does a search first to find out whether the grant reference of
> this request is already mapped into its page pool. If the grant ref is mapped,
> the address of this mapped page is gotten and memcpy is used to copy data
> between grant pages. However, if the grant ref is not mapped, a new page is
> allocated, mapped with this grant ref, and then saved into page pool for
> future use. Similarly, memcpy replaces grant copy to copy data between grant
> pages. In this implementation, two arrays(gnttab_tx_vif,gnttab_rx_vif) are
> used to save vif pointer for every request because current netback is not
> per-vif based. This would be changed after implementing 1:1 model in netback.
> 
> This patch supports both persistent-grant and non persistent grant. A new
> xenstore key "feature-persistent-grants" is used to represent this feature.
> 
> This patch is based on linux3.4-rc3. I hit netperf/netserver failure on
> linux latest version v3.7-rc1, v3.7-rc2 and v3.7-rc4. Not sure whether this
> netperf/netserver failure connects compound page commit in v3.7-rc1, but I did
> hit BUG_ON with debug patch from thread
> http://lists.xen.org/archives/html/xen-devel/2012-10/msg00893.html

FYI, I get this:

 477.814511] BUG: sleeping function called from invalid context at /home/konrad/ssd/linux/mm/page_alloc.c:2487
[  477.815281] in_atomic(): 1, irqs_disabled(): 1, pid: 3017, name: netperf
[  477.815281] Pid: 3017, comm: netperf Not tainted 3.5.0upstream-00004-g69047bb #1
[  477.815281] Call Trace:
[  477.815281]  [<ffffffff810b990a>] __might_sleep+0xda/0x100
[  477.815281]  [<ffffffff81142e93>] __alloc_pages_nodemask+0x223/0x920
[  477.815281]  [<ffffffff81158439>] ? zone_statistics+0x99/0xc0
[  477.815281]  [<ffffffff81076e79>] ? default_spin_lock_flags+0x9/0x10
[  477.815281]  [<ffffffff81615e3a>] ? _raw_spin_lock_irqsave+0x3a/0x50
[  477.815281]  [<ffffffff81076e79>] ? default_spin_lock_flags+0x9/0x10
[  477.815281]  [<ffffffff81098977>] ? lock_timer_base+0x37/0x70
[  477.815281]  [<ffffffff8109a03d>] ? mod_timer_pending+0x11d/0x230
[  477.815281]  [<ffffffff81616144>] ? _raw_spin_unlock_bh+0x24/0x30
[  477.815281]  [<ffffffff8117e7e1>] alloc_pages_current+0xb1/0x110
[  477.815281]  [<ffffffffa0034238>] xennet_alloc_tx_ref+0x78/0x1c0 [xen_netfront]
[  477.815281]  [<ffffffffa00344eb>] xennet_start_xmit+0x16b/0x9f0 [xen_netfront]
[  477.815281]  [<ffffffff814c69eb>] dev_hard_start_xmit+0x2fb/0x6f0
[  477.815281]  [<ffffffff814e4566>] sch_direct_xmit+0x116/0x1e0
[  477.815281]  [<ffffffff814c6f6a>] dev_queue_xmit+0x18a/0x6b0
[  477.815281]  [<ffffffff8151264e>] ip_finish_output+0x18e/0x300
[  477.815281]  [<ffffffff81512821>] ip_output+0x61/0xa0
[  477.815281]  [<ffffffff81511b82>] ? __ip_local_out+0xa2/0xb0
[  477.815281]  [<ffffffff81511bb4>] ip_local_out+0x24/0x30
[  477.815281]  [<ffffffff81511ffe>] ip_queue_xmit+0x15e/0x410
[  477.815281]  [<ffffffff81528354>] tcp_transmit_skb+0x424/0x8f0
[  477.815281]  [<ffffffff8152a8c2>] tcp_write_xmit+0x1f2/0x9c0
[  477.815281]  [<ffffffff81182194>] ? ksize+0x14/0x70
[  477.815281]  [<ffffffff8152b711>] __tcp_push_pending_frames+0x21/0x90
[  477.815281]  [<ffffffff8151db23>] tcp_sendmsg+0x983/0xcd0
[  477.815281]  [<ffffffff81540daf>] inet_sendmsg+0x7f/0xd0
[  477.815281]  [<ffffffff81290dde>] ? selinux_socket_sendmsg+0x1e/0x20
[  477.815281]  [<ffffffff814aed13>] sock_sendmsg+0xf3/0x120
[  477.815281]  [<ffffffff81076f48>] ? pvclock_clocksource_read+0x58/0xd0
[  477.815281]  [<ffffffff812de7c0>] ? timerqueue_add+0x60/0xb0
[  477.815281]  [<ffffffff810b0b85>] ? enqueue_hrtimer+0x25/0xb0
[  477.815281]  [<ffffffff814af4d4>] sys_sendto+0x104/0x140
[  477.815281]  [<ffffffff81041279>] ? xen_clocksource_read+0x39/0x50
[  477.815281]  [<ffffffff81041419>] ? xen_clocksource_get_cycles+0x9/0x10
[  477.815281]  [<ffffffff810d3242>] ? getnstimeofday+0x52/0xe0
[  477.815281]  [<ffffffff8161dfb9>] system_call_fastpath+0x16/0x1b

> 
> 
> Annie Li (4):
>   xen/netback: implements persistent grant with one page pool.
>   xen/netback: Split one page pool into two(tx/rx) page pool.
>   Xen/netfront: Implement persistent grant in netfront.
>   fix code indent issue in xen-netfront.
> 
>  drivers/net/xen-netback/common.h    |   24 ++-
>  drivers/net/xen-netback/interface.c |   26 +++
>  drivers/net/xen-netback/netback.c   |  215 ++++++++++++++++++--
>  drivers/net/xen-netback/xenbus.c    |   14 ++-
>  drivers/net/xen-netfront.c          |  378 +++++++++++++++++++++++++++++------
>  5 files changed, 570 insertions(+), 87 deletions(-)
> 
> -- 
> 1.7.3.4

^ permalink raw reply

* Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:21 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: ANNIE LI, Pasi Kärkkäinen, netdev@vger.kernel.org,
	xen-devel@lists.xensource.com, Ian Campbell
In-Reply-To: <50A4CA51.8080208@citrix.com>

On Thu, Nov 15, 2012 at 11:56:17AM +0100, Roger Pau Monné wrote:
> On 15/11/12 09:38, ANNIE LI wrote:
> > 
> > 
> > On 2012-11-15 15:40, Pasi Kärkkäinen wrote:
> >> Hello,
> >>
> >> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> >>> This patch implements persistent grants for xen-netfront/netback. This
> >>> mechanism maintains page pools in netback/netfront, these page pools is used to
> >>> save grant pages which are mapped. This way improve performance which is wasted
> >>> when doing grant operations.
> >>>
> >>> Current netback/netfront does map/unmap grant operations frequently when
> >>> transmitting/receiving packets, and grant operations costs much cpu clock. In
> >>> this patch, netfront/netback maps grant pages when needed and then saves them
> >>> into a page pool for future use. All these pages will be unmapped when
> >>> removing/releasing the net device.
> >>>
> >> Do you have performance numbers available already? with/without persistent grants?
> > I have some simple netperf/netserver test result with/without persistent 
> > grants,
> > 
> > Following is result of with persistent grant patch,
> > 
> > Guests, Sum,      Avg,     Min,     Max
> >   1,  15106.4,  15106.4, 15106.36, 15106.36
> >   2,  13052.7,  6526.34,  6261.81,  6790.86
> >   3,  12675.1,  6337.53,  6220.24,  6454.83
> >   4,  13194,  6596.98,  6274.70,  6919.25
> > 
> > 
> > Following are result of without persistent patch
> > 
> > Guests, Sum,     Avg,    Min,        Max
> >   1,  10864.1,  10864.1, 10864.10, 10864.10
> >   2,  10898.5,  5449.24,  4862.08,  6036.40
> >   3,  10734.5,  5367.26,  5261.43,  5473.08
> >   4,  10924,    5461.99,  5314.84,  5609.14
> 
> In the block case, performance improvement is seen when using a large
> number of guests, could you perform the same benchmark increasing the
> number of guests to 15?

Keep in mind that one of the things that is limiting these numbers is
that netback is very CPU intensive. So I think it could get much much
faster - but netback pegs at 100%. With Wei Liu's patches the CPU usage
did drop by 40% (this is when I tested the old netback with Wei's
netback patches)- so we should see even a further speed increase.

^ permalink raw reply

* Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:23 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Roger Pau Monne, ANNIE LI, Pasi Kärkkäinen,
	netdev@vger.kernel.org, xen-devel@lists.xensource.com
In-Reply-To: <1353006667.26243.6.camel@dagon.hellion.org.uk>

On Thu, Nov 15, 2012 at 07:11:07PM +0000, Ian Campbell wrote:
> On Thu, 2012-11-15 at 18:29 +0000, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 15, 2012 at 11:15:06AM +0000, Ian Campbell wrote:
> > > On Thu, 2012-11-15 at 10:56 +0000, Roger Pau Monne wrote:
> > > > On 15/11/12 09:38, ANNIE LI wrote:
> > > > > 
> > > > > 
> > > > > On 2012-11-15 15:40, Pasi Kärkkäinen wrote:
> > > > >> Hello,
> > > > >>
> > > > >> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> > > > >>> This patch implements persistent grants for xen-netfront/netback. This
> > > > >>> mechanism maintains page pools in netback/netfront, these page pools is used to
> > > > >>> save grant pages which are mapped. This way improve performance which is wasted
> > > > >>> when doing grant operations.
> > > > >>>
> > > > >>> Current netback/netfront does map/unmap grant operations frequently when
> > > > >>> transmitting/receiving packets, and grant operations costs much cpu clock. In
> > > > >>> this patch, netfront/netback maps grant pages when needed and then saves them
> > > > >>> into a page pool for future use. All these pages will be unmapped when
> > > > >>> removing/releasing the net device.
> > > > >>>
> > > > >> Do you have performance numbers available already? with/without persistent grants?
> > > > > I have some simple netperf/netserver test result with/without persistent 
> > > > > grants,
> > > > > 
> > > > > Following is result of with persistent grant patch,
> > > > > 
> > > > > Guests, Sum,      Avg,     Min,     Max
> > > > >   1,  15106.4,  15106.4, 15106.36, 15106.36
> > > > >   2,  13052.7,  6526.34,  6261.81,  6790.86
> > > > >   3,  12675.1,  6337.53,  6220.24,  6454.83
> > > > >   4,  13194,  6596.98,  6274.70,  6919.25
> > > > > 
> > > > > 
> > > > > Following are result of without persistent patch
> > > > > 
> > > > > Guests, Sum,     Avg,    Min,        Max
> > > > >   1,  10864.1,  10864.1, 10864.10, 10864.10
> > > > >   2,  10898.5,  5449.24,  4862.08,  6036.40
> > > > >   3,  10734.5,  5367.26,  5261.43,  5473.08
> > > > >   4,  10924,    5461.99,  5314.84,  5609.14
> > > > 
> > > > In the block case, performance improvement is seen when using a large
> > > > number of guests, could you perform the same benchmark increasing the
> > > > number of guests to 15?
> > > 
> > > It would also be nice to see some analysis of the numbers which justify
> > > why this change is a good one without every reviewer having to evaluate
> > > the raw data themselves. In fact this should really be part of the
> > > commit message.
> > 
> > You mean like a nice graph, eh?
> 
> Together with an analysis of what it means and why it is a good thing,
> yes.

OK, lets put that on the TODO list for the next posting. In the meantime -
it sounds like you (the maintainer) are happy with the direction this is going.

The other things we want to do _after_ these patches is to look at the Wei
Liu patches and try to address the different reviewers comments.

The neat thing about them is that they have a concept of a page pool system.
And with persistent pages in both blkback and netback this gets more exciting.


> 
> Ian.
> 
> > 
> > I will run these patches on my 32GB box and see if I can give you
> > a nice PDF/jpg.
> > 
> > > 
> > > Ian.
> > > 
> 

^ permalink raw reply

* Re: [PATCH] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures
From: Eric Dumazet @ 2012-11-16 15:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jln
In-Reply-To: <20121116.013940.813652515905883288.davem@davemloft.net>

On Fri, 2012-11-16 at 01:39 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 15 Nov 2012 15:41:04 -0800
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > order-5 allocations can fail with current kernels, we should
> > try to reduce allocation sizes to allow network namespace
> > creation.
> > 
> > Reported-by: Julien Tinnes <jln@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> Indeed, this has to be done better.
> 
> But this kind of retry solution results in non-deterministic behavior.
> Yes the tcp metrics cache is best effort, but it's size can influence
> behavior in a substantial way depending upon the workload.
> 
> I would suggest that we instead use different limits, ones which the
> page allocator will satisfy for us always with GFP_KERNEL.
> 
> 1) include linux/mmzone.h
> 
> 2) Make the two limits based upon PAGE_ALLOC_COSTLY_ORDER.
> 
> That is, make the larger table size PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER
> and the smaller one PAGE_SIZE << (PAGE_ALLOC_COSTLY_ORDER - 1).

Well, we dont really know what the size needs to be, and your proposal
reduces the size by a 4 factor, even for the initial namespace.

Julien report was about Chrome browser own netns, on a suspend/resume
cycle (or something like that)

If size can influence behavior, we could try a vmalloc() if kmalloc()
fails...

Thanks

[PATCH v3] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures

order-5 allocations can fail with current kernels, we should
try vmalloc() as well.

Reported-by: Julien Tinnes <jln@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_metrics.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 53bc584..f696d7c 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1,7 +1,6 @@
 #include <linux/rcupdate.h>
 #include <linux/spinlock.h>
 #include <linux/jiffies.h>
-#include <linux/bootmem.h>
 #include <linux/module.h>
 #include <linux/cache.h>
 #include <linux/slab.h>
@@ -9,6 +8,7 @@
 #include <linux/tcp.h>
 #include <linux/hash.h>
 #include <linux/tcp_metrics.h>
+#include <linux/vmalloc.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/net_namespace.h>
@@ -1034,7 +1034,10 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	net->ipv4.tcp_metrics_hash_log = order_base_2(slots);
 	size = sizeof(struct tcpm_hash_bucket) << net->ipv4.tcp_metrics_hash_log;
 
-	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL);
+	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!net->ipv4.tcp_metrics_hash)
+		net->ipv4.tcp_metrics_hash = vzalloc(size);
+
 	if (!net->ipv4.tcp_metrics_hash)
 		return -ENOMEM;
 
@@ -1055,7 +1058,10 @@ static void __net_exit tcp_net_metrics_exit(struct net *net)
 			tm = next;
 		}
 	}
-	kfree(net->ipv4.tcp_metrics_hash);
+	if (is_vmalloc_addr(net->ipv4.tcp_metrics_hash))
+		vfree(net->ipv4.tcp_metrics_hash);
+	else
+		kfree(net->ipv4.tcp_metrics_hash);
 }
 
 static __net_initdata struct pernet_operations tcp_net_metrics_ops = {

^ permalink raw reply related

* Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:34 UTC (permalink / raw)
  To: Wei Liu
  Cc: ANNIE LI, Pasi Kärkkäinen, netdev,
	xen-devel@lists.xensource.com, Ian Campbell
In-Reply-To: <CAOsiSVU+9fkGQhVhnrx=xxUD8hej55XXJGpKGWAxe1USPEhiEQ@mail.gmail.com>

On Thu, Nov 15, 2012 at 05:35:13PM +0800, Wei Liu wrote:
> On Thu, Nov 15, 2012 at 4:38 PM, ANNIE LI <annie.li@oracle.com> wrote:
> 
> >
> >
> > On 2012-11-15 15:40, Pasi Kärkkäinen wrote:
> >
> >> Hello,
> >>
> >> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> >>
> >>> This patch implements persistent grants for xen-netfront/netback. This
> >>> mechanism maintains page pools in netback/netfront, these page pools is
> >>> used to
> >>> save grant pages which are mapped. This way improve performance which is
> >>> wasted
> >>> when doing grant operations.
> >>>
> >>> Current netback/netfront does map/unmap grant operations frequently when
> >>> transmitting/receiving packets, and grant operations costs much cpu
> >>> clock. In
> >>> this patch, netfront/netback maps grant pages when needed and then saves
> >>> them
> >>> into a page pool for future use. All these pages will be unmapped when
> >>> removing/releasing the net device.
> >>>
> >>>  Do you have performance numbers available already? with/without
> >> persistent grants?
> >>
> > I have some simple netperf/netserver test result with/without persistent
> > grants,
> >
> > Following is result of with persistent grant patch,
> >
> > Guests, Sum,      Avg,     Min,     Max
> >  1,  15106.4,  15106.4, 15106.36, 15106.36
> >  2,  13052.7,  6526.34,  6261.81,  6790.86
> >  3,  12675.1,  6337.53,  6220.24,  6454.83
> >  4,  13194,  6596.98,  6274.70,  6919.25
> >
> >
> > Following are result of without persistent patch
> >
> > Guests, Sum,     Avg,    Min,        Max
> >  1,  10864.1,  10864.1, 10864.10, 10864.10
> >  2,  10898.5,  5449.24,  4862.08,  6036.40
> >  3,  10734.5,  5367.26,  5261.43,  5473.08
> >  4,  10924,    5461.99,  5314.84,  5609.14
> >
> >
> >
> Interesting results. Have you tested how good it is on a 10G nic, i.e.
> guest sending packets
> through physical network to another host.

Not yet. This was done with two guests pounding each other. I am
setting two machines up for Annie so she can do that type of testing
and also with more guests.

^ permalink raw reply

* [PATCH] openvswitch: Make IPv6 packet parsing dependent on IPv6 config
From: Vlad Yasevich @ 2012-11-16 15:43 UTC (permalink / raw)
  To: netdev
In-Reply-To: <50a5c2e5.lgIvZwesNLp78CVD%fengguang.wu@intel.com>

Openvswitch attempts to use IPv6 packet parsing functions without
any dependency on IPv6 (unlike every other place in kernel).  Pull
the IPv6 code in openvswitch togeter and put a conditional that's
dependent on CONFIG_IPV6.

Resolves:
net/built-in.o: In function `ovs_flow_extract':
(.text+0xbf5d5): undefined reference to `ipv6_skip_exthdr'

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
---
 net/openvswitch/flow.c |  168 ++++++++++++++++++++++++-----------------------
 1 files changed, 86 insertions(+), 82 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 98c7063..6dfaf60 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -124,6 +124,7 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies)
 	(offsetof(struct sw_flow_key, field) +	\
 	 FIELD_SIZEOF(struct sw_flow_key, field))
 
+#if IS_ENABLED(CONFIG_IPV6)
 static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key,
 			 int *key_lenp)
 {
@@ -175,6 +176,89 @@ static bool icmp6hdr_ok(struct sk_buff *skb)
 				  sizeof(struct icmp6hdr));
 }
 
+static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
+			int *key_lenp, int nh_len)
+{
+	struct icmp6hdr *icmp = icmp6_hdr(skb);
+	int error = 0;
+	int key_len;
+
+	/* The ICMPv6 type and code fields use the 16-bit transport port
+	 * fields, so we need to store them in 16-bit network byte order.
+	 */
+	key->ipv6.tp.src = htons(icmp->icmp6_type);
+	key->ipv6.tp.dst = htons(icmp->icmp6_code);
+	key_len = SW_FLOW_KEY_OFFSET(ipv6.tp);
+
+	if (icmp->icmp6_code == 0 &&
+	    (icmp->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION ||
+	     icmp->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT)) {
+		int icmp_len = skb->len - skb_transport_offset(skb);
+		struct nd_msg *nd;
+		int offset;
+
+		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
+
+		/* In order to process neighbor discovery options, we need the
+		 * entire packet.
+		 */
+		if (unlikely(icmp_len < sizeof(*nd)))
+			goto out;
+		if (unlikely(skb_linearize(skb))) {
+			error = -ENOMEM;
+			goto out;
+		}
+
+		nd = (struct nd_msg *)skb_transport_header(skb);
+		key->ipv6.nd.target = nd->target;
+		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
+
+		icmp_len -= sizeof(*nd);
+		offset = 0;
+		while (icmp_len >= 8) {
+			struct nd_opt_hdr *nd_opt =
+				 (struct nd_opt_hdr *)(nd->opt + offset);
+			int opt_len = nd_opt->nd_opt_len * 8;
+
+			if (unlikely(!opt_len || opt_len > icmp_len))
+				goto invalid;
+
+			/* Store the link layer address if the appropriate
+			 * option is provided.  It is considered an error if
+			 * the same link layer option is specified twice.
+			 */
+			if (nd_opt->nd_opt_type == ND_OPT_SOURCE_LL_ADDR
+			    && opt_len == 8) {
+				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.sll)))
+					goto invalid;
+				memcpy(key->ipv6.nd.sll,
+				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
+			} else if (nd_opt->nd_opt_type == ND_OPT_TARGET_LL_ADDR
+				   && opt_len == 8) {
+				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.tll)))
+					goto invalid;
+				memcpy(key->ipv6.nd.tll,
+				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
+			}
+
+			icmp_len -= opt_len;
+			offset += opt_len;
+		}
+	}
+
+	goto out;
+
+invalid:
+	memset(&key->ipv6.nd.target, 0, sizeof(key->ipv6.nd.target));
+	memset(key->ipv6.nd.sll, 0, sizeof(key->ipv6.nd.sll));
+	memset(key->ipv6.nd.tll, 0, sizeof(key->ipv6.nd.tll));
+
+out:
+	*key_lenp = key_len;
+	return error;
+}
+#endif
+
 #define TCP_FLAGS_OFFSET 13
 #define TCP_FLAG_MASK 0x3f
 
@@ -487,88 +571,6 @@ static __be16 parse_ethertype(struct sk_buff *skb)
 	return llc->ethertype;
 }
 
-static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
-			int *key_lenp, int nh_len)
-{
-	struct icmp6hdr *icmp = icmp6_hdr(skb);
-	int error = 0;
-	int key_len;
-
-	/* The ICMPv6 type and code fields use the 16-bit transport port
-	 * fields, so we need to store them in 16-bit network byte order.
-	 */
-	key->ipv6.tp.src = htons(icmp->icmp6_type);
-	key->ipv6.tp.dst = htons(icmp->icmp6_code);
-	key_len = SW_FLOW_KEY_OFFSET(ipv6.tp);
-
-	if (icmp->icmp6_code == 0 &&
-	    (icmp->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION ||
-	     icmp->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT)) {
-		int icmp_len = skb->len - skb_transport_offset(skb);
-		struct nd_msg *nd;
-		int offset;
-
-		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
-
-		/* In order to process neighbor discovery options, we need the
-		 * entire packet.
-		 */
-		if (unlikely(icmp_len < sizeof(*nd)))
-			goto out;
-		if (unlikely(skb_linearize(skb))) {
-			error = -ENOMEM;
-			goto out;
-		}
-
-		nd = (struct nd_msg *)skb_transport_header(skb);
-		key->ipv6.nd.target = nd->target;
-		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
-
-		icmp_len -= sizeof(*nd);
-		offset = 0;
-		while (icmp_len >= 8) {
-			struct nd_opt_hdr *nd_opt =
-				 (struct nd_opt_hdr *)(nd->opt + offset);
-			int opt_len = nd_opt->nd_opt_len * 8;
-
-			if (unlikely(!opt_len || opt_len > icmp_len))
-				goto invalid;
-
-			/* Store the link layer address if the appropriate
-			 * option is provided.  It is considered an error if
-			 * the same link layer option is specified twice.
-			 */
-			if (nd_opt->nd_opt_type == ND_OPT_SOURCE_LL_ADDR
-			    && opt_len == 8) {
-				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.sll)))
-					goto invalid;
-				memcpy(key->ipv6.nd.sll,
-				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
-			} else if (nd_opt->nd_opt_type == ND_OPT_TARGET_LL_ADDR
-				   && opt_len == 8) {
-				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.tll)))
-					goto invalid;
-				memcpy(key->ipv6.nd.tll,
-				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
-			}
-
-			icmp_len -= opt_len;
-			offset += opt_len;
-		}
-	}
-
-	goto out;
-
-invalid:
-	memset(&key->ipv6.nd.target, 0, sizeof(key->ipv6.nd.target));
-	memset(key->ipv6.nd.sll, 0, sizeof(key->ipv6.nd.sll));
-	memset(key->ipv6.nd.tll, 0, sizeof(key->ipv6.nd.tll));
-
-out:
-	*key_lenp = key_len;
-	return error;
-}
-
 /**
  * ovs_flow_extract - extracts a flow key from an Ethernet frame.
  * @skb: sk_buff that contains the frame, with skb->data pointing to the
@@ -712,6 +714,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 				key_len = SW_FLOW_KEY_OFFSET(ipv4.arp);
 			}
 		}
+#if IS_ENABLED(CONFIG_IPV6)
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		int nh_len;             /* IPv6 Header + Extensions */
 
@@ -752,6 +755,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 					goto out;
 			}
 		}
+#endif
 	}
 
 out:
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH net-next 2/4] sit: allow to configure 6rd tunnels via netlink
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

This patch add the support of 6RD tunnels management via netlink.
Note that netdev_state_change() is now called when 6RD parameters are updated.

6RD parameters are updated only if there is at least one 6RD attribute.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/uapi/linux/if_tunnel.h |   4 ++
 net/ipv6/sit.c                 | 149 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 128 insertions(+), 25 deletions(-)

diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 5ab0c8d..aee73d0 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -49,6 +49,10 @@ enum {
 	IFLA_IPTUN_FLAGS,
 	IFLA_IPTUN_PROTO,
 	IFLA_IPTUN_PMTUDISC,
+	IFLA_IPTUN_6RD_PREFIX,
+	IFLA_IPTUN_6RD_RELAY_PREFIX,
+	IFLA_IPTUN_6RD_PREFIXLEN,
+	IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
 	__IFLA_IPTUN_MAX,
 };
 #define IFLA_IPTUN_MAX	(__IFLA_IPTUN_MAX - 1)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index ca6c2c8..504422d 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -936,6 +936,38 @@ static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p)
 	netdev_state_change(t->dev);
 }
 
+#ifdef CONFIG_IPV6_SIT_6RD
+static int ipip6_tunnel_update_6rd(struct ip_tunnel *t,
+				   struct ip_tunnel_6rd *ip6rd)
+{
+	struct in6_addr prefix;
+	__be32 relay_prefix;
+
+	if (ip6rd->relay_prefixlen > 32 ||
+	    ip6rd->prefixlen + (32 - ip6rd->relay_prefixlen) > 64)
+		return -EINVAL;
+
+	ipv6_addr_prefix(&prefix, &ip6rd->prefix, ip6rd->prefixlen);
+	if (!ipv6_addr_equal(&prefix, &ip6rd->prefix))
+		return -EINVAL;
+	if (ip6rd->relay_prefixlen)
+		relay_prefix = ip6rd->relay_prefix &
+			       htonl(0xffffffffUL <<
+				     (32 - ip6rd->relay_prefixlen));
+	else
+		relay_prefix = 0;
+	if (relay_prefix != ip6rd->relay_prefix)
+		return -EINVAL;
+
+	t->ip6rd.prefix = prefix;
+	t->ip6rd.relay_prefix = relay_prefix;
+	t->ip6rd.prefixlen = ip6rd->prefixlen;
+	t->ip6rd.relay_prefixlen = ip6rd->relay_prefixlen;
+	netdev_state_change(t->dev);
+	return 0;
+}
+#endif
+
 static int
 ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 {
@@ -1105,31 +1137,9 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		t = netdev_priv(dev);
 
 		if (cmd != SIOCDEL6RD) {
-			struct in6_addr prefix;
-			__be32 relay_prefix;
-
-			err = -EINVAL;
-			if (ip6rd.relay_prefixlen > 32 ||
-			    ip6rd.prefixlen + (32 - ip6rd.relay_prefixlen) > 64)
-				goto done;
-
-			ipv6_addr_prefix(&prefix, &ip6rd.prefix,
-					 ip6rd.prefixlen);
-			if (!ipv6_addr_equal(&prefix, &ip6rd.prefix))
+			err = ipip6_tunnel_update_6rd(t, &ip6rd);
+			if (err < 0)
 				goto done;
-			if (ip6rd.relay_prefixlen)
-				relay_prefix = ip6rd.relay_prefix &
-					       htonl(0xffffffffUL <<
-						     (32 - ip6rd.relay_prefixlen));
-			else
-				relay_prefix = 0;
-			if (relay_prefix != ip6rd.relay_prefix)
-				goto done;
-
-			t->ip6rd.prefix = prefix;
-			t->ip6rd.relay_prefix = relay_prefix;
-			t->ip6rd.prefixlen = ip6rd.prefixlen;
-			t->ip6rd.relay_prefixlen = ip6rd.relay_prefixlen;
 		} else
 			ipip6_tunnel_clone_6rd(dev, sitn);
 
@@ -1261,11 +1271,53 @@ static void ipip6_netlink_parms(struct nlattr *data[],
 		parms->i_flags = nla_get_be16(data[IFLA_IPTUN_FLAGS]);
 }
 
+#ifdef CONFIG_IPV6_SIT_6RD
+/* This function returns true when 6RD attributes are present in the nl msg */
+static bool ipip6_netlink_6rd_parms(struct nlattr *data[],
+				    struct ip_tunnel_6rd *ip6rd)
+{
+	bool ret = false;
+	memset(ip6rd, 0, sizeof(*ip6rd));
+
+	if (!data)
+		return ret;
+
+	if (data[IFLA_IPTUN_6RD_PREFIX]) {
+		ret = true;
+		nla_memcpy(&ip6rd->prefix, data[IFLA_IPTUN_6RD_PREFIX],
+			   sizeof(struct in6_addr));
+	}
+
+	if (data[IFLA_IPTUN_6RD_RELAY_PREFIX]) {
+		ret = true;
+		ip6rd->relay_prefix =
+			nla_get_be32(data[IFLA_IPTUN_6RD_RELAY_PREFIX]);
+	}
+
+	if (data[IFLA_IPTUN_6RD_PREFIXLEN]) {
+		ret = true;
+		ip6rd->prefixlen = nla_get_u16(data[IFLA_IPTUN_6RD_PREFIXLEN]);
+	}
+
+	if (data[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]) {
+		ret = true;
+		ip6rd->relay_prefixlen =
+			nla_get_u16(data[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
+	}
+
+	return ret;
+}
+#endif
+
 static int ipip6_newlink(struct net *src_net, struct net_device *dev,
 			 struct nlattr *tb[], struct nlattr *data[])
 {
 	struct net *net = dev_net(dev);
 	struct ip_tunnel *nt;
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
+	int err;
 
 	nt = netdev_priv(dev);
 	ipip6_netlink_parms(data, &nt->parms);
@@ -1273,7 +1325,16 @@ static int ipip6_newlink(struct net *src_net, struct net_device *dev,
 	if (ipip6_tunnel_locate(net, &nt->parms, 0))
 		return -EEXIST;
 
-	return ipip6_tunnel_create(dev);
+	err = ipip6_tunnel_create(dev);
+	if (err < 0)
+		return err;
+
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (ipip6_netlink_6rd_parms(data, &ip6rd))
+		err = ipip6_tunnel_update_6rd(nt, &ip6rd);
+#endif
+
+	return err;
 }
 
 static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
@@ -1283,6 +1344,9 @@ static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
 	struct ip_tunnel_parm p;
 	struct net *net = dev_net(dev);
 	struct sit_net *sitn = net_generic(net, sit_net_id);
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
 
 	if (dev == sitn->fb_tunnel_dev)
 		return -EINVAL;
@@ -1302,6 +1366,12 @@ static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
 		t = netdev_priv(dev);
 
 	ipip6_tunnel_update(t, &p);
+
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (ipip6_netlink_6rd_parms(data, &ip6rd))
+		return ipip6_tunnel_update_6rd(t, &ip6rd);
+#endif
+
 	return 0;
 }
 
@@ -1322,6 +1392,16 @@ static size_t ipip6_get_size(const struct net_device *dev)
 		nla_total_size(1) +
 		/* IFLA_IPTUN_FLAGS */
 		nla_total_size(2) +
+#ifdef CONFIG_IPV6_SIT_6RD
+		/* IFLA_IPTUN_6RD_PREFIX */
+		nla_total_size(sizeof(struct in6_addr)) +
+		/* IFLA_IPTUN_6RD_RELAY_PREFIX */
+		nla_total_size(4) +
+		/* IFLA_IPTUN_6RD_PREFIXLEN */
+		nla_total_size(2) +
+		/* IFLA_IPTUN_6RD_RELAY_PREFIXLEN */
+		nla_total_size(2) +
+#endif
 		0;
 }
 
@@ -1339,6 +1419,19 @@ static int ipip6_fill_info(struct sk_buff *skb, const struct net_device *dev)
 		       !!(parm->iph.frag_off & htons(IP_DF))) ||
 	    nla_put_be16(skb, IFLA_IPTUN_FLAGS, parm->i_flags))
 		goto nla_put_failure;
+
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (nla_put(skb, IFLA_IPTUN_6RD_PREFIX, sizeof(struct in6_addr),
+		    &tunnel->ip6rd.prefix) ||
+	    nla_put_be32(skb, IFLA_IPTUN_6RD_RELAY_PREFIX,
+			 tunnel->ip6rd.relay_prefix) ||
+	    nla_put_u16(skb, IFLA_IPTUN_6RD_PREFIXLEN,
+			tunnel->ip6rd.prefixlen) ||
+	    nla_put_u16(skb, IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
+			tunnel->ip6rd.relay_prefixlen))
+		goto nla_put_failure;
+#endif
+
 	return 0;
 
 nla_put_failure:
@@ -1353,6 +1446,12 @@ static const struct nla_policy ipip6_policy[IFLA_IPTUN_MAX + 1] = {
 	[IFLA_IPTUN_TOS]		= { .type = NLA_U8 },
 	[IFLA_IPTUN_PMTUDISC]		= { .type = NLA_U8 },
 	[IFLA_IPTUN_FLAGS]		= { .type = NLA_U16 },
+#ifdef CONFIG_IPV6_SIT_6RD
+	[IFLA_IPTUN_6RD_PREFIX]		= { .len = sizeof(struct in6_addr) },
+	[IFLA_IPTUN_6RD_RELAY_PREFIX]	= { .type = NLA_U32 },
+	[IFLA_IPTUN_6RD_PREFIXLEN]	= { .type = NLA_U16 },
+	[IFLA_IPTUN_6RD_RELAY_PREFIXLEN] = { .type = NLA_U16 },
+#endif
 };
 
 static struct rtnl_link_ops sit_link_ops __read_mostly = {
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 3/4] sit: allow to deactivate the creation of fb device
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

Now that tunnels can be configured via rtnetlink, this device is not mandatory.
The default is conservative.

The fb device was also used by 6RD as a template for default 6RD parameters.
User was able to update this template for each netns, hence when the fb device
is not created, this option does not exist. However, user can now set 6RD
parameters in the same netlink message that create the tunnel (before two
ioctl were needed) thus user can directly set the right value.

Last point is about ISATAP. The potential routers list (prl) management has not
been converted to netlink, but the ioctl is anyway performed on the related
device directly and not on the fb device.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/sit.c | 62 +++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 40 insertions(+), 22 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 504422d..d789a55 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -65,6 +65,15 @@
 #define HASH_SIZE  16
 #define HASH(addr) (((__force u32)addr^((__force u32)addr>>4))&0xF)
 
+static bool setup_fb = true;
+module_param(setup_fb, bool, 0644);
+MODULE_PARM_DESC(setup_fb,
+		 "Setup the fb device to configure tunnel via IOCTL");
+
+#ifdef CONFIG_IPV6_SIT_6RD
+static struct ip_tunnel_6rd_parm ip6rd_template;
+#endif
+
 static int ipip6_tunnel_init(struct net_device *dev);
 static void ipip6_tunnel_setup(struct net_device *dev);
 static void ipip6_dev_free(struct net_device *dev);
@@ -204,12 +213,9 @@ static void ipip6_tunnel_clone_6rd(struct net_device *dev, struct sit_net *sitn)
 #ifdef CONFIG_IPV6_SIT_6RD
 	struct ip_tunnel *t = netdev_priv(dev);
 
-	if (t->dev == sitn->fb_tunnel_dev) {
-		ipv6_addr_set(&t->ip6rd.prefix, htonl(0x20020000), 0, 0, 0);
-		t->ip6rd.relay_prefix = 0;
-		t->ip6rd.prefixlen = 16;
-		t->ip6rd.relay_prefixlen = 0;
-	} else {
+	if (t->dev == sitn->fb_tunnel_dev || setup_fb == false)
+		memcpy(&t->ip6rd, &ip6rd_template, sizeof(t->ip6rd));
+	else {
 		struct ip_tunnel *t0 = netdev_priv(sitn->fb_tunnel_dev);
 		memcpy(&t->ip6rd, &t0->ip6rd, sizeof(t->ip6rd));
 	}
@@ -1501,26 +1507,30 @@ static int __net_init sit_init_net(struct net *net)
 	sitn->tunnels[2] = sitn->tunnels_r;
 	sitn->tunnels[3] = sitn->tunnels_r_l;
 
-	sitn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel), "sit0",
-					   ipip6_tunnel_setup);
-	if (!sitn->fb_tunnel_dev) {
-		err = -ENOMEM;
-		goto err_alloc_dev;
-	}
-	dev_net_set(sitn->fb_tunnel_dev, net);
+	if (setup_fb) {
+		sitn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+						   "sit0", ipip6_tunnel_setup);
+		if (!sitn->fb_tunnel_dev) {
+			err = -ENOMEM;
+			goto err_alloc_dev;
+		}
+		dev_net_set(sitn->fb_tunnel_dev, net);
 
-	err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
-	if (err)
-		goto err_dev_free;
+		err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
+		if (err)
+			goto err_dev_free;
 
-	ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
+		ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
 
-	if ((err = register_netdev(sitn->fb_tunnel_dev)))
-		goto err_reg_dev;
+		err = register_netdev(sitn->fb_tunnel_dev);
+		if (err)
+			goto err_reg_dev;
 
-	t = netdev_priv(sitn->fb_tunnel_dev);
+		t = netdev_priv(sitn->fb_tunnel_dev);
 
-	strcpy(t->parms.name, sitn->fb_tunnel_dev->name);
+		strcpy(t->parms.name, sitn->fb_tunnel_dev->name);
+	} else
+		sitn->fb_tunnel_dev = NULL;
 	return 0;
 
 err_reg_dev:
@@ -1538,7 +1548,8 @@ static void __net_exit sit_exit_net(struct net *net)
 
 	rtnl_lock();
 	sit_destroy_tunnels(sitn, &list);
-	unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
+	if (setup_fb)
+		unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
@@ -1565,6 +1576,13 @@ static int __init sit_init(void)
 
 	pr_info("IPv6 over IPv4 tunneling driver\n");
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	ipv6_addr_set(&ip6rd_template.prefix, htonl(0x20020000), 0, 0, 0);
+	ip6rd_template.relay_prefix = 0;
+	ip6rd_template.prefixlen = 16;
+	ip6rd_template.relay_prefixlen = 0;
+#endif
+
 	err = register_pernet_device(&sit_net_ops);
 	if (err < 0)
 		return err;
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 4/4] ip6tnl: allow to deactivate the creation of fb dev
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

Now that tunnels can be configured via rtnetlink, this device is not mandatory.
The default is conservative.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/ip6_tunnel.c | 43 +++++++++++++++++++++++++++----------------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index bf3a549..fe2028e 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -62,6 +62,11 @@ MODULE_DESCRIPTION("IPv6 tunneling device");
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_NETDEV("ip6tnl0");
 
+static bool setup_fb = true;
+module_param(setup_fb, bool, 0644);
+MODULE_PARM_DESC(setup_fb,
+		 "Setup the fb device to configure tunnel via IOCTL");
+
 #ifdef IP6_TNL_DEBUG
 #define IP6_TNL_TRACE(x...) pr_debug("%s:" x "\n", __func__)
 #else
@@ -1711,8 +1716,10 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n)
 	}
 
 	t = rtnl_dereference(ip6n->tnls_wc[0]);
-	unregister_netdevice_queue(t->dev, &list);
-	unregister_netdevice_many(&list);
+	if (t) {
+		unregister_netdevice_queue(t->dev, &list);
+		unregister_netdevice_many(&list);
+	}
 }
 
 static int __net_init ip6_tnl_init_net(struct net *net)
@@ -1724,25 +1731,29 @@ static int __net_init ip6_tnl_init_net(struct net *net)
 	ip6n->tnls[0] = ip6n->tnls_wc;
 	ip6n->tnls[1] = ip6n->tnls_r_l;
 
-	err = -ENOMEM;
-	ip6n->fb_tnl_dev = alloc_netdev(sizeof(struct ip6_tnl), "ip6tnl0",
-				      ip6_tnl_dev_setup);
+	if (setup_fb) {
+		err = -ENOMEM;
+		ip6n->fb_tnl_dev = alloc_netdev(sizeof(struct ip6_tnl),
+						"ip6tnl0", ip6_tnl_dev_setup);
 
-	if (!ip6n->fb_tnl_dev)
-		goto err_alloc_dev;
-	dev_net_set(ip6n->fb_tnl_dev, net);
+		if (!ip6n->fb_tnl_dev)
+			goto err_alloc_dev;
+		dev_net_set(ip6n->fb_tnl_dev, net);
 
-	err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev);
-	if (err < 0)
-		goto err_register;
+		err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev);
+		if (err < 0)
+			goto err_register;
 
-	err = register_netdev(ip6n->fb_tnl_dev);
-	if (err < 0)
-		goto err_register;
+		err = register_netdev(ip6n->fb_tnl_dev);
+		if (err < 0)
+			goto err_register;
+
+		t = netdev_priv(ip6n->fb_tnl_dev);
 
-	t = netdev_priv(ip6n->fb_tnl_dev);
+		strcpy(t->parms.name, ip6n->fb_tnl_dev->name);
+	} else
+		ip6n->fb_tnl_dev = NULL;
 
-	strcpy(t->parms.name, ip6n->fb_tnl_dev->name);
 	return 0;
 
 err_register:
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 1/4] ipip: allow to deactivate the creation of fb dev
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

Now that tunnels can be configured via rtnetlink, this device is not mandatory.
The default is conservative.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv4/ipip.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index c26c171..9e11633 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -124,6 +124,11 @@ static bool log_ecn_error = true;
 module_param(log_ecn_error, bool, 0644);
 MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
 
+static bool setup_fb = true;
+module_param(setup_fb, bool, 0644);
+MODULE_PARM_DESC(setup_fb,
+		 "Setup the fb device to configure tunnel via IOCTL");
+
 static int ipip_net_id __read_mostly;
 struct ipip_net {
 	struct ip_tunnel __rcu *tunnels_r_l[HASH_SIZE];
@@ -1022,25 +1027,29 @@ static int __net_init ipip_init_net(struct net *net)
 	ipn->tunnels[2] = ipn->tunnels_r;
 	ipn->tunnels[3] = ipn->tunnels_r_l;
 
-	ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
-					   "tunl0",
-					   ipip_tunnel_setup);
-	if (!ipn->fb_tunnel_dev) {
-		err = -ENOMEM;
-		goto err_alloc_dev;
-	}
-	dev_net_set(ipn->fb_tunnel_dev, net);
+	if (setup_fb) {
+		ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+						  "tunl0",
+						  ipip_tunnel_setup);
+		if (!ipn->fb_tunnel_dev) {
+			err = -ENOMEM;
+			goto err_alloc_dev;
+		}
+		dev_net_set(ipn->fb_tunnel_dev, net);
 
-	err = ipip_fb_tunnel_init(ipn->fb_tunnel_dev);
-	if (err)
-		goto err_reg_dev;
+		err = ipip_fb_tunnel_init(ipn->fb_tunnel_dev);
+		if (err)
+			goto err_reg_dev;
 
-	if ((err = register_netdev(ipn->fb_tunnel_dev)))
-		goto err_reg_dev;
+		err = register_netdev(ipn->fb_tunnel_dev);
+		if (err)
+			goto err_reg_dev;
 
-	t = netdev_priv(ipn->fb_tunnel_dev);
+		t = netdev_priv(ipn->fb_tunnel_dev);
 
-	strcpy(t->parms.name, ipn->fb_tunnel_dev->name);
+		strcpy(t->parms.name, ipn->fb_tunnel_dev->name);
+	} else
+		ipn->fb_tunnel_dev = NULL;
 	return 0;
 
 err_reg_dev:
@@ -1057,7 +1066,8 @@ static void __net_exit ipip_exit_net(struct net *net)
 
 	rtnl_lock();
 	ipip_destroy_tunnels(ipn, &list);
-	unregister_netdevice_queue(ipn->fb_tunnel_dev, &list);
+	if (setup_fb)
+		unregister_netdevice_queue(ipn->fb_tunnel_dev, &list);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 0/4] Allow to deactivate fb tunnels device
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem

This serie proposes a module option to avoid the creation of the fb tunnels
device for sit/ipip/ip6tnl.
It also add netlink management for 6RD tunnels. The last info that still
requires ioctl is the management of the potential routers list (prl) for isatap
(note that this ioctl is not performed on the fb device).

As usual, the patch against iproute2 will be sent once the patches are included and
net-next merged. I can send it on demand.

 include/uapi/linux/if_tunnel.h |   4 +
 net/ipv4/ipip.c                |  42 ++++----
 net/ipv6/ip6_tunnel.c          |  43 +++++----
 net/ipv6/sit.c                 | 211 ++++++++++++++++++++++++++++++++---------
 4 files changed, 221 insertions(+), 79 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply

* Re: [PATCH iproute2 3/3] ip/ip6tunnel: fix update of tclass and flowlabel
From: Stephen Hemminger @ 2012-11-16 16:16 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev
In-Reply-To: <1352906966-12932-3-git-send-email-nicolas.dichtel@6wind.com>

On Wed, 14 Nov 2012 16:29:26 +0100
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:

> When tclass or flowlabel field were updated, we only performed an OR with the
> new value. For example, it was not possible to reset tclass:
>   ip -6 tunnel change ip6tnl2 tclass 0
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---
>  ip/ip6tunnel.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
> index 7aaac61..fcc9f33 100644
> --- a/ip/ip6tunnel.c
> +++ b/ip/ip6tunnel.c
> @@ -173,6 +173,7 @@ static int parse_args(int argc, char **argv, int cmd, struct ip6_tnl_parm *p)
>  			   matches(*argv, "dsfield") == 0) {
>  			__u8 uval;
>  			NEXT_ARG();
> +			p->flowinfo &= ~IP6_FLOWINFO_TCLASS;
>  			if (strcmp(*argv, "inherit") == 0)
>  				p->flags |= IP6_TNL_F_USE_ORIG_TCLASS;
>  			else {
> @@ -185,6 +186,7 @@ static int parse_args(int argc, char **argv, int cmd, struct ip6_tnl_parm *p)
>  			   strcmp(*argv, "fl") == 0) {
>  			__u32 uval;
>  			NEXT_ARG();
> +			p->flowinfo &= ~IP6_FLOWINFO_FLOWLABEL;
>  			if (strcmp(*argv, "inherit") == 0)
>  				p->flags |= IP6_TNL_F_USE_ORIG_FLOWLABEL;
>  			else {

All applied thanks.

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipip: allow to deactivate the creation of fb dev
From: Stephen Hemminger @ 2012-11-16 16:29 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, davem
In-Reply-To: <1353082456-21234-2-git-send-email-nicolas.dichtel@6wind.com>

On Fri, 16 Nov 2012 17:14:13 +0100
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:

> Now that tunnels can be configured via rtnetlink, this device is not mandatory.
> The default is conservative.
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>


Although I am in favor of reducing clutter, and we even have to put in special case
code to ignore these stub devices in the Vyatta scripts. Module parameters are bit of a nuisance to deal with, but maybe
the only way for this kind of thing and keep the required ABI.

Not sure if I can fully endorse this. The device may still have uses.
It is still useful for capturing "none of the above" packets
and is used to auto-load module via module aliases.

^ permalink raw reply

* Re: [PATCH v2 net-next] sctp: Add support to per-association statistics via a new SCTP_GET_ASSOC_STATS call
From: Neil Horman @ 2012-11-16 16:39 UTC (permalink / raw)
  To: Michele Baldessari
  Cc: linux-sctp, Thomas Graf, Vlad Yasevich, netdev, David S. Miller
In-Reply-To: <1352991680-12289-1-git-send-email-michele@acksyn.org>

On Thu, Nov 15, 2012 at 04:01:20PM +0100, Michele Baldessari wrote:
> The current SCTP stack is lacking a mechanism to have per association
> statistics. This is an implementation modeled after OpenSolaris'
> SCTP_GET_ASSOC_STATS.
> 
> Userspace part will follow on lksctp if/when there is a general ACK on
> this.
> 
> V2)
>   - Implement partial retrieval of stat struct to cope for future expansion
>   - Kill the rtxpackets counter as it cannot be precise anyway
>   - Rename outseqtsns to outofseqtsns to make it clearer that these are out
>     of sequence unexpected TSNs
>   - Move asoc->ipackets++ under a lock to avoid potential miscounts
>   - Fold asoc->opackets++ into the already existing asoc check
>   - Kill unneeded (q->asoc) test when increasing rtxchunks
>   - Do not count octrlchunks if sending failed (SCTP_XMIT_OK != 0)
>   - Don't count SHUTDOWNs as SACKs
>   - Move SCTP_GET_ASSOC_STATS to the private space API
>   - Adjust the len check in sctp_getsockopt_assoc_stats() to allow for
>     future struct growth
>   - Move association statistics in their own struct
>   - Update idupchunks when we send a SACK with dup TSNs
>   - return min_rto in max_rto when RTO has not changed. Also return the
>     transport when max_rto last changed.
> 
> Signed-off: Michele Baldessari <michele@acksyn.org>
> Acked-by: Thomas Graf <tgraf@suug.ch>

Yes, I think this is good, I still don't like the idea of having to do these via
an ioctl, but I suppose it fits well enough.
Neil

Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipip: allow to deactivate the creation of fb dev
From: Nicolas Dichtel @ 2012-11-16 16:46 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, davem
In-Reply-To: <20121116082926.1c6cccd2@nehalam.linuxnetplumber.net>

Le 16/11/2012 17:29, Stephen Hemminger a écrit :
> On Fri, 16 Nov 2012 17:14:13 +0100
> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>
>> Now that tunnels can be configured via rtnetlink, this device is not mandatory.
>> The default is conservative.
>>
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>
>
> Although I am in favor of reducing clutter, and we even have to put in special case
> code to ignore these stub devices in the Vyatta scripts. Module parameters are bit of a nuisance to deal with, but maybe
> the only way for this kind of thing and keep the required ABI.
>
> Not sure if I can fully endorse this. The device may still have uses.
> It is still useful for capturing "none of the above" packets
If you need to capture these packets, you can still create a tunnel with local 
any and remote any, even if the fb_device has not been created.

> and is used to auto-load module via module aliases.
Right, but if user uses netlink, the problem exists without these patches too.

By default, the fb device is created, so there is no change if you don't set 
explicitly setup_fb to 0.

^ permalink raw reply

* [RFC PATCH] tcp: introduce raw access to experimental options
From: elelueck @ 2012-11-16 16:54 UTC (permalink / raw)
  To: netdev; +Cc: frankbla, raspl, ubacher, samudrala, Einar Lueck, davem

From: Einar Lueck <elelueck@linux.vnet.ibm.com>

This patch adds means for raw acces to TCP expirimental options
253 and 254. The intention of this is to enable user space
applications to implement communication behaviour that depends
on experimental options. For that, new (set|get)sockopts are
introduced:

TCP_EXPOPTS (get & set): TCP experimental options to be added to
			 packets
TCP_RECV_EXPOPTS (get):  experimental options received with last
			 packet
TCP_RECV_SYN_EXPOPTS (get): experimental options received with
			 SYN packet

TCP experimental options 253 and 254 configured via TCP_EXPOPTS on
any TCP socket are appended to every packet that is sent as long
as there is enough room left. If there is not enough room left they
are silently dropped.

Listening sockets reply to SYN packets with SYN ACK packets containing
TCP experimental options 253 and 254 as configured via TCP_EXPOPTS, too.
If a TCP connection gets established the configured experimental options
are the defaults for the new socket, too. Thus, a getsockopt on the
resulting accept socket for TCP_EXPOPTS returns the same stuff configured
on the listening socket.

As mentioned above, even after the 3whs is complete, experimental options
are sent with every packet. To enable user space applications to distinguish
between what has been advertized via SYN and what has been received with the
last packet the aforementioned TCP_RECV_SYN_EXPOPTS and TCP_RECV_EXPOPTS are
introduced.

Today, experimental option 253 (COOKIE) and 254 (FASTOPEN) are already
exploited. For co-existence the following approach has been taken:

General remarks:
* Interface to COOKIE and FASTOPEN stays the same
Sender side:
1. COOKIE and FASTPATH code adds own options first (if applicable)
2. Finally, if enough room is left, TCP_EXPOPTS experimental options are
   appended
Receiver side:
1. ALL 253 and 254 experimental options are made available via
   TCP_RECV(_SYN)_EXPOPTS
2. COOKIE and FASTOPEN code check if there is any option relevant for them

References:
http://tools.ietf.org/html/draft-ietf-tcpm-experimental-options-02

Signed-off-by: Einar Lueck <elelueck@linux.vnet.ibm.com>
---
 include/linux/tcp.h      |  25 ++++++++++
 include/net/tcp.h        |   3 ++
 net/ipv4/tcp.c           | 110 +++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c     | 119 +++++++++++++++++++++++++++++++----------------
 net/ipv4/tcp_ipv4.c      |  14 ++++++
 net/ipv4/tcp_minisocks.c |  17 +++++++
 net/ipv4/tcp_output.c    |  37 ++++++++++++---
 7 files changed, 279 insertions(+), 46 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index eb125a4..b2a6451 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -110,6 +110,10 @@ enum {
 #define TCP_REPAIR_QUEUE	20
 #define TCP_QUEUE_SEQ		21
 #define TCP_REPAIR_OPTIONS	22
+#define TCP_EXPOPTS		23	/* TCP exp. options (configured) */
+#define TCP_RECV_EXPOPTS	24	/* TCP exp. options (received) */
+#define TCP_RECV_SYN_EXPOPTS	25	/* TCP exp. options
+					   (received with syn)) */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
@@ -269,6 +273,8 @@ struct tcp_sack_block {
 #define TCP_FACK_ENABLED  (1 << 1)   /*1 = FACK is enabled locally*/
 #define TCP_DSACK_SEEN    (1 << 2)   /*1 = DSACK was received from peer*/
 
+#define TCP_EXPOP_MAXLEN	40
+
 struct tcp_options_received {
 /*	PAWS/RTTM data	*/
 	long	ts_recent_stamp;/* Time we stored ts_recent (for aging) */
@@ -288,6 +294,9 @@ struct tcp_options_received {
 	u8	num_sacks;	/* Number of SACK blocks		*/
 	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
+	u8	exp_opts_len;	/* length of buffer containing all exp
+				   options in format: kind length data */
+	u8	exp_opts[TCP_EXPOP_MAXLEN];	/* experimental options */
 };
 
 static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
@@ -295,6 +304,7 @@ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
 	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
 	rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
 	rx_opt->cookie_plus = 0;
+	rx_opt->exp_opts_len = 0;
 }
 
 /* This is the max number of SACKS that we'll generate and process. It's safe
@@ -315,6 +325,10 @@ struct tcp_request_sock {
 	u32				rcv_isn;
 	u32				snt_isn;
 	u32				snt_synack; /* synack sent time */
+
+	u8 syn_expopts[TCP_EXPOP_MAXLEN];	/* experimental options
+						   received with SYNACK */
+	u8 syn_expopts_len;
 };
 
 static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
@@ -406,6 +420,17 @@ struct tcp_sock {
 	u32	snd_up;		/* Urgent pointer		*/
 
 	u8	keepalive_probes; /* num of allowed keep alive probes	*/
+
+	/* for raw acces to experimental options */
+	struct {
+		u8 *conf;	/* lazy allocation of TCP_EXPOP_MAXLEN bytes
+				   for raw access to experimental options */
+		u8 conf_len;	/* bytes actually used for experimental opts */
+		u8 *syn;	/* experimental options received with SYN,
+				   allocated only if received */
+		u8 syn_len;	/* bytes of experimental options actually
+				   received with SYN */
+	} exp_opts;
 /*
  *      Options received (usually on last packet, some only on SYN packets).
  */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1f000ff..b63d5c9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -170,6 +170,8 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
 #define TCPOPT_COOKIE		253	/* Cookie extension (experimental) */
+#define TCPOPT_EXP253		253	/* TCP experimental option 253 */
+#define TCPOPT_EXP254		254	/* TCP experimental option 254 */
 #define TCPOPT_EXP		254	/* Experimental */
 /* Magic number to be after the option value for sharing TCP
  * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
@@ -180,6 +182,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
  *     TCP option lengths
  */
 
+#define TCPOLEN_MAX_ANYEXP     40
 #define TCPOLEN_MSS            4
 #define TCPOLEN_WINDOW         3
 #define TCPOLEN_SACK_PERM      2
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5f64193..e7e4947 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -423,6 +423,12 @@ void tcp_init_sock(struct sock *sk)
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
+	/* memory for raw access to experimental options is allocated lazy */
+	tp->exp_opts.conf = NULL;
+	tp->exp_opts.conf_len = 0;
+	tp->exp_opts.syn = NULL;
+	tp->exp_opts.syn_len = 0;
+
 	local_bh_disable();
 	sock_update_memcg(sk);
 	sk_sockets_allocated_inc(sk);
@@ -2376,6 +2382,53 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 
 	/* These are data/string values, all the others are ints */
 	switch (optname) {
+	case TCP_EXPOPTS: {
+		u8 conf[TCP_EXPOP_MAXLEN];
+
+		if (optlen > TCP_EXPOP_MAXLEN || (optlen < 4 && optlen > 0) ||
+		    (optlen % 4 > 0))
+			return -EINVAL;
+		if (optlen > 0 && !optval)
+			return -EINVAL;
+
+		/* filter for raw access to supported options */
+		if (optlen) {
+			u8 i;
+
+			if (copy_from_user(conf, optval, optlen))
+				return -EFAULT;
+
+			i = 0;
+			while (i < optlen) {
+				if (conf[i] != TCPOPT_EXP253 &&
+				    conf[i] != TCPOPT_EXP254)
+					return -EINVAL;
+
+				if (i + 1 < optlen) {
+					i += conf[i+1];
+					if (i > optlen)
+						return -EINVAL;
+				} else {
+					return -EINVAL;
+				}
+			}
+		}
+
+		lock_sock(sk);
+		if (!optlen) {
+			tp->exp_opts.conf_len = 0;
+			release_sock(sk);
+			return 0;
+		}
+		if (!tp->exp_opts.conf) {
+			tp->exp_opts.conf = kzalloc(TCP_EXPOP_MAXLEN,
+						    sk->sk_allocation);
+		}
+		memcpy(tp->exp_opts.conf, conf, optlen);
+		tp->exp_opts.conf_len = optlen;
+		release_sock(sk);
+		return err;
+	}
 	case TCP_CONGESTION: {
 		char name[TCP_CA_NAME_MAX];
 
@@ -2947,6 +3000,63 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_USER_TIMEOUT:
 		val = jiffies_to_msecs(icsk->icsk_user_timeout);
 		break;
+	case TCP_EXPOPTS: {
+		u8 exp_opts_len;
+
+		if (get_user(len, optlen))
+			return -EFAULT;
+		if (len < 0)
+			return -EINVAL;
+
+		exp_opts_len = tp->exp_opts.conf_len;
+
+		if (exp_opts_len > len)
+			return -EINVAL;
+		if (put_user(exp_opts_len, optlen))
+			return -EFAULT;
+		if (exp_opts_len && copy_to_user(optval, tp->exp_opts.conf,
+						 exp_opts_len))
+			return -EFAULT;
+		return 0;
+	}
+	case TCP_RECV_EXPOPTS:
+		if (get_user(len, optlen))
+			return -EFAULT;
+		if (len < 0)
+			return -EINVAL;
+
+		if (len < tp->rx_opt.exp_opts_len)
+			return -EINVAL;
+
+		if (put_user(tp->rx_opt.exp_opts_len, optlen))
+			return -EFAULT;
+		if (copy_to_user(optval, tp->rx_opt.exp_opts,
+				 tp->rx_opt.exp_opts_len))
+			return -EFAULT;
+		return 0;
+	case TCP_RECV_SYN_EXPOPTS: {
+		u8 exp_opts_len;
+
+		if (get_user(len, optlen))
+			return -EFAULT;
+		if (len < 0)
+			return -EINVAL;
+
+		if (!tp->exp_opts.syn)
+			exp_opts_len = 0;
+		else
+			exp_opts_len = tp->exp_opts.syn_len;
+
+		if (exp_opts_len > len)
+			return -EINVAL;
+		if (put_user(exp_opts_len, optlen))
+			return -EFAULT;
+		if (exp_opts_len && copy_to_user(optval, tp->exp_opts.syn,
+						 exp_opts_len)) {
+			return -EFAULT;
+		}
+		return 0;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d377f48..130d4f4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3726,11 +3726,32 @@ old_ack:
 	return 0;
 }
 
+static inline void tcp_parse_fastopen_cookie(int opcode,
+		int opsize,
+		const unsigned char *ptr,
+		struct tcp_fastopen_cookie *foc,
+		const struct tcphdr *th) {
+	/* Fast Open option shares code 254 using a 16 bits magic number. It's
+	 * valid only in SYN or SYN-ACK with an even size.
+	 */
+	if (opsize < TCPOLEN_EXP_FASTOPEN_BASE ||
+	    get_unaligned_be16(ptr) != TCPOPT_FASTOPEN_MAGIC || foc == NULL ||
+	    !th->syn || (opsize & 1))
+		return;
+	foc->len = opsize - TCPOLEN_EXP_FASTOPEN_BASE;
+	if (foc->len >= TCP_FASTOPEN_COOKIE_MIN &&
+	    foc->len <= TCP_FASTOPEN_COOKIE_MAX)
+		memcpy(foc->val, ptr + 2, foc->len);
+	else if (foc->len != 0)
+		foc->len = -1;
+}
+
 /* Look for tcp options. Normally only called on SYN and SYNACK packets.
  * But, this can also be called on packets in the established flow when
  * the fast version below fails.
  */
-void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx,
+void tcp_parse_options(const struct sk_buff *skb,
+		       struct tcp_options_received *opt_rx,
 		       const u8 **hvpp, int estab,
 		       struct tcp_fastopen_cookie *foc)
 {
@@ -3740,6 +3761,7 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o
 
 	ptr = (const unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
+	opt_rx->exp_opts_len = 0;
 
 	while (length > 0) {
 		int opcode = *ptr++;
@@ -3815,48 +3837,56 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o
 				 */
 				break;
 #endif
-			case TCPOPT_COOKIE:
-				/* This option is variable length.
+			case TCPOPT_EXP253:
+			case TCPOPT_EXP254:
+				/* First parse options into raw access area for
+				 * experimental options. Then handle
+				 * potential exploitations
 				 */
-				switch (opsize) {
-				case TCPOLEN_COOKIE_BASE:
-					/* not yet implemented */
-					break;
-				case TCPOLEN_COOKIE_PAIR:
-					/* not yet implemented */
-					break;
-				case TCPOLEN_COOKIE_MIN+0:
-				case TCPOLEN_COOKIE_MIN+2:
-				case TCPOLEN_COOKIE_MIN+4:
-				case TCPOLEN_COOKIE_MIN+6:
-				case TCPOLEN_COOKIE_MAX:
-					/* 16-bit multiple */
-					opt_rx->cookie_plus = opsize;
-					*hvpp = ptr;
-					break;
-				default:
-					/* ignore option */
-					break;
+				if (opsize <= TCPOLEN_MAX_ANYEXP &&
+				    opsize >= 2 &&
+				    (opt_rx->exp_opts_len + opsize <=
+				     TCPOLEN_MAX_ANYEXP)) {
+					opt_rx->exp_opts[
+						opt_rx->exp_opts_len] = opcode;
+					opt_rx->exp_opts[
+						opt_rx->exp_opts_len + 1] =
+						opsize;
+					memcpy(opt_rx->exp_opts +
+						opt_rx->exp_opts_len + 2, ptr,
+						opsize - 2);
+					opt_rx->exp_opts_len += opsize;
 				}
-				break;
 
-			case TCPOPT_EXP:
-				/* Fast Open option shares code 254 using a
-				 * 16 bits magic number. It's valid only in
-				 * SYN or SYN-ACK with an even size.
-				 */
-				if (opsize < TCPOLEN_EXP_FASTOPEN_BASE ||
-				    get_unaligned_be16(ptr) != TCPOPT_FASTOPEN_MAGIC ||
-				    foc == NULL || !th->syn || (opsize & 1))
-					break;
-				foc->len = opsize - TCPOLEN_EXP_FASTOPEN_BASE;
-				if (foc->len >= TCP_FASTOPEN_COOKIE_MIN &&
-				    foc->len <= TCP_FASTOPEN_COOKIE_MAX)
-					memcpy(foc->val, ptr + 2, foc->len);
-				else if (foc->len != 0)
-					foc->len = -1;
+				/* handle potential exploitations */
+				if (opcode == TCPOPT_COOKIE) {
+					/* This option is variable length. */
+					switch (opsize) {
+					case TCPOLEN_COOKIE_BASE:
+						/* not yet implemented */
+						break;
+					case TCPOLEN_COOKIE_PAIR:
+						/* not yet implemented */
+						break;
+					case TCPOLEN_COOKIE_MIN+0:
+					case TCPOLEN_COOKIE_MIN+2:
+					case TCPOLEN_COOKIE_MIN+4:
+					case TCPOLEN_COOKIE_MIN+6:
+					case TCPOLEN_COOKIE_MAX:
+						/* 16-bit multiple */
+						opt_rx->cookie_plus = opsize;
+						*hvpp = ptr;
+						break;
+					default:
+						/* ignore option */
+						break;
+					}
+				} else {
+					tcp_parse_fastopen_cookie(opcode,
+								  opsize, ptr,
+								  foc, th);
+				}
 				break;
-
 			}
 			ptr += opsize-2;
 			length -= opsize;
@@ -3888,6 +3918,9 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
 				   const struct tcphdr *th,
 				   struct tcp_sock *tp, const u8 **hvpp)
 {
+	/* required if exp options are not used anymore by the counter part */
+	tp->rx_opt.exp_opts_len = 0;
+
 	/* In the spirit of fast parsing, compare doff directly to constant
 	 * values.  Because equality is used, short doff can be ignored here.
 	 */
@@ -5806,6 +5839,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 			}
 		}
 
+		if (unlikely(tp->rx_opt.exp_opts_len > 0)) {
+			tp->exp_opts.syn = kzalloc(tp->rx_opt.exp_opts_len,
+						   sk->sk_allocation);
+			tp->exp_opts.syn_len = tp->rx_opt.exp_opts_len;
+			memcpy(tp->exp_opts.syn, &tp->rx_opt.exp_opts,
+			       tp->rx_opt.exp_opts_len);
+		}
+
 		smp_mb();
 
 		tcp_finish_connect(sk, skb);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 00a748d..2f66bd5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1321,6 +1321,16 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
 	tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
 
+	/* for raw access to experimental options in SYN packet */
+	tcp_rsk(req)->syn_expopts_len = tmp_opt.exp_opts_len;
+	if (tcp_rsk(req)->syn_expopts_len) {
+		/* transport experimental options via request socket to big
+		 * socket
+		 */
+		memcpy(tcp_rsk(req)->syn_expopts, tmp_opt.exp_opts,
+		       tcp_rsk(req)->syn_expopts_len);
+	}
+
 	if (tmp_opt.cookie_plus > 0 &&
 	    tmp_opt.saw_tstamp &&
 	    !tp->rx_opt.cookie_out_never &&
@@ -1978,6 +1988,10 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		tp->cookie_values = NULL;
 	}
 
+	/* buffers for raw access to experimental options */
+	kfree(tp->exp_opts.conf);
+	kfree(tp->exp_opts.syn);
+
 	/* If socket is aborted during connect operation */
 	tcp_free_fastopen_req(tp);
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 6ff7f10..dc25875 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -466,6 +466,23 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 
 		newtp->urg_data = 0;
 
+		if (tcp_rsk(req)->syn_expopts_len) {
+			newtp->exp_opts.syn_len =
+					tcp_rsk(req)->syn_expopts_len;
+			newtp->exp_opts.syn = kzalloc(newtp->exp_opts.syn_len,
+						      GFP_ATOMIC);
+			memcpy(newtp->exp_opts.syn, tcp_rsk(req)->syn_expopts,
+			       newtp->exp_opts.syn_len);
+		}
+
+		if (oldtp->exp_opts.conf_len > 0) {
+			newtp->exp_opts.conf_len = oldtp->exp_opts.conf_len;
+			newtp->exp_opts.conf = kzalloc(TCP_EXPOP_MAXLEN,
+						       GFP_ATOMIC);
+			memcpy(newtp->exp_opts.conf, oldtp->exp_opts.conf,
+			       oldtp->exp_opts.conf_len);
+		}
+
 		if (sock_flag(newsk, SOCK_KEEPOPEN))
 			inet_csk_reset_keepalive_timer(newsk,
 						       keepalive_time_when(newtp));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d046326..8d7cf51 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -385,6 +385,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_MD5		(1 << 2)
 #define OPTION_WSCALE		(1 << 3)
 #define OPTION_COOKIE_EXTENSION	(1 << 4)
+#define OPTION_EXP		(1 << 5)
 #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
 
 struct tcp_out_options {
@@ -581,6 +582,12 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 		}
 		ptr += (foc->len + 3) >> 2;
 	}
+	if (unlikely(OPTION_EXP & options && tp->exp_opts.conf_len > 0)) {
+		__u8 *p = (__u8 *) ptr;
+		memcpy(ptr, tp->exp_opts.conf, tp->exp_opts.conf_len);
+		p += tp->exp_opts.conf_len;
+		ptr = (__be32 *) p;
+	}
 }
 
 /* Compute TCP options for SYN packets. This is not the final
@@ -693,6 +700,11 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 			remaining -= need;
 		}
 	}
+	if (unlikely(tp->exp_opts.conf_len > 0 &&
+		     tp->exp_opts.conf_len <= remaining)) {
+		opts->options |= OPTION_EXP;
+		remaining -= tp->exp_opts.conf_len;
+	}
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
@@ -747,6 +759,11 @@ static unsigned int tcp_synack_options(struct sock *sk,
 		if (unlikely(!ireq->tstamp_ok))
 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
 	}
+	if (unlikely(tcp_sk(sk)->exp_opts.conf_len > 0 &&
+		     tcp_sk(sk)->exp_opts.conf_len <= remaining)) {
+		opts->options |= OPTION_EXP;
+		remaining -= tcp_sk(sk)->exp_opts.conf_len;
+	}
 
 	/* Similar rationale to tcp_syn_options() applies here, too.
 	 * If the <SYN> options fit, the same options should fit now!
@@ -782,38 +799,44 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 {
 	struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
 	struct tcp_sock *tp = tcp_sk(sk);
-	unsigned int size = 0;
+	unsigned remaining = MAX_TCP_OPTION_SPACE;
 	unsigned int eff_sacks;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
 	if (unlikely(*md5)) {
 		opts->options |= OPTION_MD5;
-		size += TCPOLEN_MD5SIG_ALIGNED;
+		remaining -= TCPOLEN_MD5SIG_ALIGNED;
 	}
 #else
 	*md5 = NULL;
 #endif
 
-	if (likely(tp->rx_opt.tstamp_ok)) {
+	if (likely(tp->rx_opt.tstamp_ok &&
+		   remaining >= TCPOLEN_TSTAMP_ALIGNED)) {
 		opts->options |= OPTION_TS;
 		opts->tsval = tcb ? tcb->when : 0;
 		opts->tsecr = tp->rx_opt.ts_recent;
-		size += TCPOLEN_TSTAMP_ALIGNED;
+		remaining -= TCPOLEN_TSTAMP_ALIGNED;
 	}
 
 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
 	if (unlikely(eff_sacks)) {
-		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
 		opts->num_sack_blocks =
 			min_t(unsigned int, eff_sacks,
 			      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
 			      TCPOLEN_SACK_PERBLOCK);
-		size += TCPOLEN_SACK_BASE_ALIGNED +
+		remaining -= TCPOLEN_SACK_BASE_ALIGNED +
 			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
 	}
 
-	return size;
+	if (unlikely(tp->exp_opts.conf_len > 0 &&
+		     tp->exp_opts.conf_len <= remaining)) {
+		opts->options |= OPTION_EXP;
+		remaining -= tp->exp_opts.conf_len;
+	}
+
+	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
 
-- 
1.7.12.4

^ permalink raw reply related

* Re: [PATCH] openvswitch: Make IPv6 packet parsing dependent on IPv6 config
From: Jesse Gross @ 2012-11-16 17:33 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1353080434-14165-1-git-send-email-vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 990 bytes --]

On Fri, Nov 16, 2012 at 7:40 AM, Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> Openvswitch attempts to use IPv6 packet parsing functions without
> any dependency on IPv6 (unlike every other place in kernel).  Pull
> the IPv6 code in openvswitch togeter and put a conditional that's
> dependent on CONFIG_IPV6.
>
> Resolves:
> net/built-in.o: In function `ovs_flow_extract':
> (.text+0xbf5d5): undefined reference to `ipv6_skip_exthdr'
>
> Signed-off-by: Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


Doesn't this move in the opposite direction of your patches to make IPv6
GSO/GRO always available?  The packets being processed here
are generally created by the guest but with Open vSwitch running on the
host.  Also, ipv6_skip_exthdr() is in exthdrs_core.c, so it actually is
always available.  I suspect that the real problem is that the dependency
on the ipv6 directory changed to CONFIG_INET and Open vSwitch should now
depend on this.

[-- Attachment #1.2: Type: text/html, Size: 1531 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply

* Re: [PATCH] openvswitch: Make IPv6 packet parsing dependent on IPv6 config
From: Jesse Gross @ 2012-11-16 17:36 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1353080434-14165-1-git-send-email-vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Fri, Nov 16, 2012 at 7:40 AM, Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Openvswitch attempts to use IPv6 packet parsing functions without
> any dependency on IPv6 (unlike every other place in kernel).  Pull
> the IPv6 code in openvswitch togeter and put a conditional that's
> dependent on CONFIG_IPV6.
>
> Resolves:
> net/built-in.o: In function `ovs_flow_extract':
> (.text+0xbf5d5): undefined reference to `ipv6_skip_exthdr'
>
> Signed-off-by: Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

(Sorry for duplicates received, the original message had a typo in the
mailing list address.)

Doesn't this move in the opposite direction of your patches to make
IPv6 GSO/GRO always available?  The packets being processed here are
generally created by the guest but with Open vSwitch running on the
host.  Also, ipv6_skip_exthdr() is in exthdrs_core.c, so it actually
is always available.  I suspect that the real problem is that the
dependency on the ipv6 directory changed to CONFIG_INET and Open
vSwitch should now depend on this.

^ permalink raw reply

* Re: [PATCH] openvswitch: Make IPv6 packet parsing dependent on IPv6 config
From: Vlad Yasevich @ 2012-11-16 17:43 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <CAEP_g=9ge1dq8ahinM075hjdfdmXaou4a9fqPyLPJQinths6pQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 11/16/2012 12:26 PM, Jesse Gross wrote:
> On Fri, Nov 16, 2012 at 7:40 AM, Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>> Openvswitch attempts to use IPv6 packet parsing functions without
>> any dependency on IPv6 (unlike every other place in kernel).  Pull
>> the IPv6 code in openvswitch togeter and put a conditional that's
>> dependent on CONFIG_IPV6.
>>
>> Resolves:
>> net/built-in.o: In function `ovs_flow_extract':
>> (.text+0xbf5d5): undefined reference to `ipv6_skip_exthdr'
>>
>> Signed-off-by: Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
>
> Doesn't this move in the opposite direction of your patches to make IPv6
> GSO/GRO always available?  The packets being processed here
> are generally created by the guest but with Open vSwitch running on the
> host.  Also, ipv6_skip_exthdr() is in exthdrs_core.c, so it actually is
> always available.  I suspect that the real problem is that the dependency
> on the ipv6 directory changed to CONFIG_INET and Open vSwitch should now
> depend on this.
>

Yes and no... :)  IPv6 uses a bunch of IPv4 code all over.  IPv4 is 
enabled with CONFIG_INET and IPv6 with CONFIG_NET.  So creates a strange 
imbalance.  By shifting IPv6 to CONFIG_INET (which is where it
lives and what enables its selection during config process), we now have 
a dependency with openvswitch.

All other users of ipv6_skip_exthdr have it either under the IS_ENABLED 
conditional or through some other means that don't build it when INET is
completely turned off.  This patch does the same for openvswitch.

I see 2 alternatives to this:
  1) Make openvswitch depend on CONFIG_INET.
  2) Pull a ton of code out of CONFIG_INET (v4 and v6) and into 
CONFIG_NET.  This could start with IPv6 header parsing and maybe even
include GSO/TSO (but not sure how much sense that would be).

What's your take?

-vlad

^ permalink raw reply

* Re: [PATCH 0/4] netfilter updates for nf-next (try 2)
From: David Miller @ 2012-11-16 18:00 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1353069653-3231-1-git-send-email-pablo@netfilter.org>

From: pablo@netfilter.org
Date: Fri, 16 Nov 2012 13:40:49 +0100

> This is the second try to include the following four patches that contain
> updates for your net-next tree, they are:
> 
> * Little cleanup for IPVS the use of a strange notation to assign the
>   conntrack object, from Alan Cox.
> 
> * Another little cleanup for nf_nat to save a couple of lines by using
>   PTR_RET, from Wu Fengguan
> 
> * getsockopt support to obtain the original IPv6 address after NAT,
>   similar to the one that IPv4 provides, from Florian Westphal.
> 
> * Follow-up patch pointed out by YOSHIFUJI Hideaki to only provide
>   the scope_id in case that link is local, again from Florian Westphal.
> 
> You can pull these changes from:
> 
> git://1984.lsi.us.es/nf-next master

Pulled, thanks.

There was a conflict I had to resolve because the net-next tree
had a series of IS_ENABLED conversions that overlapped with some
of the changes in your tree.

^ permalink raw reply

* [PATCH net-next] net: use right lock in __dev_remove_offload
From: Eric Dumazet @ 2012-11-16 18:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Vlad Yasevich

From: Eric Dumazet <edumazet@google.com>

offload_base is protected by offload_lock, not ptype_lock

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
---
 net/core/dev.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index cf105e8..2705a2a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -513,7 +513,7 @@ void __dev_remove_offload(struct packet_offload *po)
 	struct list_head *head = &offload_base;
 	struct packet_offload *po1;
 
-	spin_lock(&ptype_lock);
+	spin_lock(&offload_lock);
 
 	list_for_each_entry(po1, head, list) {
 		if (po == po1) {
@@ -524,7 +524,7 @@ void __dev_remove_offload(struct packet_offload *po)
 
 	pr_warn("dev_remove_offload: %p not found\n", po);
 out:
-	spin_unlock(&ptype_lock);
+	spin_unlock(&offload_lock);
 }
 EXPORT_SYMBOL(__dev_remove_offload);
 

^ permalink raw reply related

* Re: [PATCH net-next] net: use right lock in __dev_remove_offload
From: Vlad Yasevich @ 2012-11-16 18:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1353089303.10798.38.camel@edumazet-glaptop>

On 11/16/2012 01:08 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> offload_base is protected by offload_lock, not ptype_lock
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Vlad Yasevich <vyasevic@redhat.com>
> ---
>   net/core/dev.c |    4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index cf105e8..2705a2a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -513,7 +513,7 @@ void __dev_remove_offload(struct packet_offload *po)
>   	struct list_head *head = &offload_base;
>   	struct packet_offload *po1;
>
> -	spin_lock(&ptype_lock);
> +	spin_lock(&offload_lock);
>
>   	list_for_each_entry(po1, head, list) {
>   		if (po == po1) {
> @@ -524,7 +524,7 @@ void __dev_remove_offload(struct packet_offload *po)
>
>   	pr_warn("dev_remove_offload: %p not found\n", po);
>   out:
> -	spin_unlock(&ptype_lock);
> +	spin_unlock(&offload_lock);
>   }
>   EXPORT_SYMBOL(__dev_remove_offload);
>

Acked-by: Vlad Yasevich <vyasevic@redhat.com>

I distinctly remember changing that, but just look at my patches and 
it's there.... Sorry...

-vlad

^ permalink raw reply

* Re: [PATCH 2/5] drivers/net/wireless/ti/wlcore/main.c: eliminate possible double power off
From: Luciano Coelho @ 2012-11-16 18:18 UTC (permalink / raw)
  To: Julia Lawall
  Cc: kernel-janitors, John W. Linville, linux-wireless, netdev,
	linux-kernel
In-Reply-To: <1350816727-1381-3-git-send-email-Julia.Lawall@lip6.fr>

On Sun, 2012-10-21 at 12:52 +0200, Julia Lawall wrote:
> From: Julia Lawall <Julia.Lawall@lip6.fr>
> 
> The function wl12xx_set_power_on is only called twice, once in
> wl12xx_chip_wakeup and once in wl12xx_get_hw_info.  On the failure of the
> call in wl12xx_chip_wakeup, the containing function just returns, but on
> the failure of the call in wl12xx_get_hw_info, the containing function
> calls wl1271_power_off.  This does not seem necessary, because if
> wl12xx_set_power_on has set the power on and then fails, it has already
> turned the power off.

[...]

Applied and pushed, thanks!

--
Luca.

^ permalink raw reply

* Re: [PATCH 14/14] wlcore: Remove redundant check on unsigned variable
From: Luciano Coelho @ 2012-11-16 18:24 UTC (permalink / raw)
  To: Tushar Behera; +Cc: linux-kernel, patches, linux-wireless, netdev
In-Reply-To: <1353048646-10935-15-git-send-email-tushar.behera@linaro.org>

On Fri, 2012-11-16 at 12:20 +0530, Tushar Behera wrote:
> No need to check whether unsigned variable is less than 0.
> 
> CC: Luciano Coelho <coelho@ti.com>
> CC: linux-wireless@vger.kernel.org
> CC: netdev@vger.kernel.org
> Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
> ---

Applied in the wl12xx.git tree.  Thanks!

--
Luca.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox