Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/3] net: lpc_eth: Replace WARN() trace with simple pr_warn()
From: Roland Stigge @ 2012-06-11  8:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, linux-kernel, kevin.wells, srinivas.bakki,
	aletes.xgr, linux-arm-kernel
In-Reply-To: <1339403108.6001.1697.camel@edumazet-glaptop>

Hi Dave and Eric,

thanks for your feedback!

On 06/11/2012 10:25 AM, Eric Dumazet wrote:
> On Mon, 2012-06-11 at 10:03 +0200, Roland Stigge wrote:
>> A WARN() trace indicating a "BUG!" was identified as a "normal" case in the
>> xmit function in case all TX descriptors are occupied already. In this case,
>> NETDEV_TX_BUSY is returned, nothing buggy at all.
>>
>> Signed-off-by: Roland Stigge <stigge@antcom.de>
>> Tested-by: Alexandre Pereira da Silva <aletes.xgr@gmail.com>
>>
>> ---
>>  drivers/net/ethernet/nxp/lpc_eth.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> --- linux-2.6.orig/drivers/net/ethernet/nxp/lpc_eth.c
>> +++ linux-2.6/drivers/net/ethernet/nxp/lpc_eth.c
>> @@ -1114,7 +1114,7 @@ static int lpc_eth_hard_start_xmit(struc
>>  		   buffers */
>>  		netif_stop_queue(ndev);
>>  		spin_unlock_irq(&pldat->lock);
>> -		WARN(1, "BUG! TX request when no free TX buffers!\n");
>> +		pr_warn("Note: TX request when no free TX buffers.\n");
>>  		return NETDEV_TX_BUSY;
>>  	}
>>  
> 
> Entering this path is a bug, don't hide it...
> 
> Please share with us how this bug was identified as a "normal case" ?

I encountered cases where this happened for me on a custom board under
heavy load.

I discussed this with Kevin Wells, the original driver author. We
identified the case of xmit()'s TX request (from .ndo_start_xmit) with
full TX driver buffers as valid when ethernet is busy.

But maybe this is wrong. Can you please give me a hint how the net
subsystem makes sure that this doesn't happen under normal circumstances?

Thanks in advance!

Roland

^ permalink raw reply

* Re: [PATCH 1/3] net: lpc_eth: Replace WARN() trace with simple pr_warn()
From: Eric Dumazet @ 2012-06-11  8:39 UTC (permalink / raw)
  To: Roland Stigge
  Cc: davem, netdev, linux-kernel, kevin.wells, srinivas.bakki,
	aletes.xgr, linux-arm-kernel
In-Reply-To: <1339403108.6001.1697.camel@edumazet-glaptop>

On Mon, 2012-06-11 at 10:25 +0200, Eric Dumazet wrote:
> On Mon, 2012-06-11 at 10:03 +0200, Roland Stigge wrote:
> > A WARN() trace indicating a "BUG!" was identified as a "normal" case in the
> > xmit function in case all TX descriptors are occupied already. In this case,
> > NETDEV_TX_BUSY is returned, nothing buggy at all.
> > 
> > Signed-off-by: Roland Stigge <stigge@antcom.de>
> > Tested-by: Alexandre Pereira da Silva <aletes.xgr@gmail.com>
> > 
> > ---
> >  drivers/net/ethernet/nxp/lpc_eth.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- linux-2.6.orig/drivers/net/ethernet/nxp/lpc_eth.c
> > +++ linux-2.6/drivers/net/ethernet/nxp/lpc_eth.c
> > @@ -1114,7 +1114,7 @@ static int lpc_eth_hard_start_xmit(struc
> >  		   buffers */
> >  		netif_stop_queue(ndev);
> >  		spin_unlock_irq(&pldat->lock);
> > -		WARN(1, "BUG! TX request when no free TX buffers!\n");
> > +		pr_warn("Note: TX request when no free TX buffers.\n");
> >  		return NETDEV_TX_BUSY;
> >  	}
> >  
> 
> Entering this path is a bug, don't hide it...
> 
> Please share with us how this bug was identified as a "normal case" ?
> 
> 


There is an skb leak in this driver, maybe it's the real problem.

diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index 8d2666f..0d0f4cb 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -946,10 +946,8 @@ static void __lpc_handle_xmit(struct net_device *ndev)
 			/* Update stats */
 			ndev->stats.tx_packets++;
 			ndev->stats.tx_bytes += skb->len;
-
-			/* Free buffer */
-			dev_kfree_skb_irq(skb);
 		}
+		dev_kfree_skb_irq(skb);
 
 		txcidx = readl(LPC_ENET_TXCONSUMEINDEX(pldat->net_base));
 	}

^ permalink raw reply related

* Re: [PATCH 1/3] net: lpc_eth: Replace WARN() trace with simple pr_warn()
From: Eric Dumazet @ 2012-06-11  8:53 UTC (permalink / raw)
  To: Roland Stigge
  Cc: davem, netdev, linux-kernel, kevin.wells, srinivas.bakki,
	aletes.xgr, linux-arm-kernel
In-Reply-To: <4FD5AE1D.9030807@antcom.de>

On Mon, 2012-06-11 at 10:36 +0200, Roland Stigge wrote:

> I encountered cases where this happened for me on a custom board under
> heavy load.
> 
> I discussed this with Kevin Wells, the original driver author. We
> identified the case of xmit()'s TX request (from .ndo_start_xmit) with
> full TX driver buffers as valid when ethernet is busy.
> 
> But maybe this is wrong. Can you please give me a hint how the net
> subsystem makes sure that this doesn't happen under normal circumstances?

When TX ring is about to be filler, driver lpc_eth_hard_start_xmit()
calls netif_stop_queue(ndev);

So network stack should not call again lpc_eth_hard_start_xmit().

I would say the bug(s) come from __lpc_handle_xmit(), since it does :

if (netif_queue_stopped(ndev))
	netif_wake_queue(ndev);

without making sure some room is available in TX ring.

cumulative patch :

diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index 8d2666f..59b37c8 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -946,16 +946,16 @@ static void __lpc_handle_xmit(struct net_device *ndev)
 			/* Update stats */
 			ndev->stats.tx_packets++;
 			ndev->stats.tx_bytes += skb->len;
-
-			/* Free buffer */
-			dev_kfree_skb_irq(skb);
 		}
+		dev_kfree_skb_irq(skb);
 
 		txcidx = readl(LPC_ENET_TXCONSUMEINDEX(pldat->net_base));
 	}
 
-	if (netif_queue_stopped(ndev))
-		netif_wake_queue(ndev);
+	if (pldat->num_used_tx_buffs <= ENET_TX_DESC/2) { 
+		if (netif_queue_stopped(ndev))
+			netif_wake_queue(ndev);
+	}
 }
 
 static int __lpc_handle_recv(struct net_device *ndev, int budget)

^ permalink raw reply related

* [PATCH net 0/3] correct behavior when modify primary via sysfs
From: Weiping Pan @ 2012-06-11  9:00 UTC (permalink / raw)
  To: netdev

There is a problem that when we set primary slave with module parameters,
bond will always use this primary slave as active slave.

But when we modify primary slave via sysfs, it will call
bond_should_change_active() and take into account
primary_reselect.

And I think we should use the new primary slave as the new active slave
regardless of the value of primary_reselect.
Thus the behavior is the same with module parameters and meets the
administrator's expectation.

Weiping Pan (3):
  bonding:record primary when modify it via sysfs
  bonding:check mode when modify primary_reselect
  bonding:force to use primary slave

 drivers/net/bonding/bond_sysfs.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

-- 
1.7.4

^ permalink raw reply

* [PATCH net 1/3] bonding:record primary when modify it via sysfs
From: Weiping Pan @ 2012-06-11  9:00 UTC (permalink / raw)
  To: netdev
In-Reply-To: <cover.1339404887.git.wpan@redhat.com>

If we modify primary via sysfs and it is not a valid slave,
we should record it for future use, and this behavior is the same with
bond_check_params().

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 drivers/net/bonding/bond_sysfs.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index aef42f0..485bedb 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1082,8 +1082,12 @@ static ssize_t bonding_store_primary(struct device *d,
 		}
 	}
 
-	pr_info("%s: Unable to set %.*s as primary slave.\n",
-		bond->dev->name, (int)strlen(buf) - 1, buf);
+	strncpy(bond->params.primary, ifname, IFNAMSIZ);
+	bond->params.primary[IFNAMSIZ - 1] = 0;
+
+	pr_info("%s: Recording %s as primary, "
+		"but it has not been enslaved to %s yet.\n",
+		bond->dev->name, ifname, bond->dev->name);
 out:
 	write_unlock_bh(&bond->curr_slave_lock);
 	read_unlock(&bond->lock);
-- 
1.7.4

^ permalink raw reply related

* [PATCH net 2/3] bonding:check mode when modify primary_reselect
From: Weiping Pan @ 2012-06-11  9:00 UTC (permalink / raw)
  To: netdev
In-Reply-To: <cover.1339404887.git.wpan@redhat.com>

Using a primary_reselect only makes sense in active backup, TLB or ALB modes.

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 drivers/net/bonding/bond_sysfs.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 485bedb..1b0f3cd 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1123,6 +1123,13 @@ static ssize_t bonding_store_primary_reselect(struct device *d,
 	if (!rtnl_trylock())
 		return restart_syscall();
 
+	if (!USES_PRIMARY(bond->params.mode)) {
+		pr_err("%s: Unable to set primary_reselect; %s is in mode %d\n",
+			bond->dev->name, bond->dev->name, bond->params.mode);
+		ret = -EINVAL;
+		goto out;
+	}
+
 	new_value = bond_parse_parm(buf, pri_reselect_tbl);
 	if (new_value < 0)  {
 		pr_err("%s: Ignoring invalid primary_reselect value %.*s.\n",
-- 
1.7.4

^ permalink raw reply related

* [PATCH net 3/3] bonding:force to use primary slave
From: Weiping Pan @ 2012-06-11  9:00 UTC (permalink / raw)
  To: netdev
In-Reply-To: <cover.1339404887.git.wpan@redhat.com>

When we set primary slave with module parameters, bond will always use this
primary slave as active slave.

But when we modify primary slave via sysfs, it will call
bond_should_change_active() and take into account primary_reselect.

And I think we should use the new primary slave as the new active slave
regardless of the value of primary_reselect.
Thus the behavior is the same with module parameters and meets the
administrator's expectation.

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 drivers/net/bonding/bond_sysfs.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 1b0f3cd..7256ae4 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1077,6 +1077,7 @@ static ssize_t bonding_store_primary(struct device *d,
 				bond->dev->name, slave->dev->name);
 			bond->primary_slave = slave;
 			strcpy(bond->params.primary, slave->dev->name);
+			bond->force_primary = true;
 			bond_select_active_slave(bond);
 			goto out;
 		}
-- 
1.7.4

^ permalink raw reply related

* Re: [PATCH 1/3] net: lpc_eth: Replace WARN() trace with simple pr_warn()
From: David Miller @ 2012-06-11  9:03 UTC (permalink / raw)
  To: stigge
  Cc: eric.dumazet, netdev, linux-kernel, kevin.wells, srinivas.bakki,
	aletes.xgr, linux-arm-kernel
In-Reply-To: <4FD5AE1D.9030807@antcom.de>

From: Roland Stigge <stigge@antcom.de>
Date: Mon, 11 Jun 2012 10:36:45 +0200

> But maybe this is wrong. Can you please give me a hint how the net
> subsystem makes sure that this doesn't happen under normal circumstances?

Well if you are asking this question then you didn't read my feedback,
because I explained exactly what prevents this.

^ permalink raw reply

* [PATCH] lpc_eth: add missing ndo_change_mtu()
From: Eric Dumazet @ 2012-06-11  9:24 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, stigge, kevin.wells, aletes.xgr, srinivas.bakki

From: Eric Dumazet <edumazet@google.com>

lpc_eth does a copy of transmitted skbs to DMA area, without checking
skb lengths, so can trigger buffer overflows :

memcpy(pldat->tx_buff_v + txidx * ENET_MAXF_SIZE, skb->data, len);

One way to get bigger skbs is to allow MTU changes above the 1500 limit.

Calling eth_change_mtu() in ndo_change_mtu() makes sure this cannot
happen.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Roland Stigge <stigge@antcom.de>
Cc: Kevin Wells <kevin.wells@nxp.com>
---
diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index 8d2666f..10febdc 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -1320,6 +1320,7 @@ static const struct net_device_ops lpc_netdev_ops = {
 	.ndo_set_rx_mode	= lpc_eth_set_multicast_list,
 	.ndo_do_ioctl		= lpc_eth_ioctl,
 	.ndo_set_mac_address	= lpc_set_mac_address,
+	.ndo_change_mtu		= eth_change_mtu,
 };
 
 static int lpc_eth_drv_probe(struct platform_device *pdev)

^ permalink raw reply related

* Re: [PATCH 1/3] net: lpc_eth: Replace WARN() trace with simple pr_warn()
From: Roland Stigge @ 2012-06-11  9:26 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, netdev, linux-kernel, kevin.wells, srinivas.bakki,
	aletes.xgr, linux-arm-kernel
In-Reply-To: <20120611.020352.1962768244524496467.davem@davemloft.net>

Hi!

On 06/11/2012 11:03 AM, David Miller wrote:
> From: Roland Stigge <stigge@antcom.de>
> Date: Mon, 11 Jun 2012 10:36:45 +0200
> 
>> But maybe this is wrong. Can you please give me a hint how the net
>> subsystem makes sure that this doesn't happen under normal circumstances?
> 
> Well if you are asking this question then you didn't read my feedback,
> because I explained exactly what prevents this.

Re-reading your feedback, you are right, sorry!

My question was based on the assumption that the driver is doing
correctly, which was wrong.

Thank you and Eric for clarifying!

Eric's second (cumulative) patch works fine for now, and I can't
reproduce the issue. Will do more test runs now and will reply back
later with an updated patch set.

Is it sensible at this point to increase the TX buffers anyway? For
different reasons of course: We have enough SRAM available and TX
buffers (16->32) are still more than RX buffers (48).

Roland

^ permalink raw reply

* [PATCH 0/5] Inetpeer roots in FIB tables
From: David Miller @ 2012-06-11  9:28 UTC (permalink / raw)
  To: netdev

This patch series should fix the problem in the bugzilla Stephen
forwarded last week, in that we won't cache metrics properly for
source based routes.

Committed to net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* [PATCH 1/5] inet: Hide route peer accesses behind helpers.
From: David Miller @ 2012-06-11  9:29 UTC (permalink / raw)
  To: netdev


We encode the pointer(s) into an unsigned long with one state bit.

The state bit is used so we can store the inetpeer tree root to use
when resolving the peer later.

Later the peer roots will be per-FIB table, and this change works to
facilitate that.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/inetpeer.h  |   54 +++++++++++++++++++++++++++++++++++++++++++++
 include/net/ip6_fib.h   |   32 ++++++++++++++++++++++++++-
 include/net/ip6_route.h |    6 ++---
 include/net/route.h     |   42 +++++++++++++++++++++++++++++++----
 net/ipv4/route.c        |   56 ++++++++++++++++++++++++++++-------------------
 net/ipv4/xfrm4_policy.c |   10 ++++-----
 net/ipv6/route.c        |   42 ++++++++++++++++++++---------------
 net/ipv6/xfrm6_policy.c |   10 ++++-----
 8 files changed, 193 insertions(+), 59 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index b84b32f..d432489 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -71,6 +71,60 @@ struct inet_peer_base {
 	int			total;
 };
 
+#define INETPEER_BASE_BIT	0x1UL
+
+static inline struct inet_peer *inetpeer_ptr(unsigned long val)
+{
+	BUG_ON(val & INETPEER_BASE_BIT);
+	return (struct inet_peer *) val;
+}
+
+static inline struct inet_peer_base *inetpeer_base_ptr(unsigned long val)
+{
+	if (!(val & INETPEER_BASE_BIT))
+		return NULL;
+	val &= ~INETPEER_BASE_BIT;
+	return (struct inet_peer_base *) val;
+}
+
+static inline bool inetpeer_ptr_is_peer(unsigned long val)
+{
+	return !(val & INETPEER_BASE_BIT);
+}
+
+static inline void __inetpeer_ptr_set_peer(unsigned long *val, struct inet_peer *peer)
+{
+	/* This implicitly clears INETPEER_BASE_BIT */
+	*val = (unsigned long) peer;
+}
+
+static inline bool inetpeer_ptr_set_peer(unsigned long *ptr, struct inet_peer *peer)
+{
+	unsigned long val = (unsigned long) peer;
+	unsigned long orig = *ptr;
+
+	if (!(orig & INETPEER_BASE_BIT) || !val ||
+	    cmpxchg(ptr, orig, val) != orig)
+		return false;
+	return true;
+}
+
+static inline void inetpeer_init_ptr(unsigned long *ptr, struct inet_peer_base *base)
+{
+	*ptr = (unsigned long) base | INETPEER_BASE_BIT;
+}
+
+static inline void inetpeer_transfer_peer(unsigned long *to, unsigned long *from)
+{
+	unsigned long val = *from;
+
+	*to = val;
+	if (inetpeer_ptr_is_peer(val)) {
+		struct inet_peer *peer = inetpeer_ptr(val);
+		atomic_inc(&peer->refcnt);
+	}
+}
+
 extern void inet_peer_base_init(struct inet_peer_base *);
 
 void			inet_initpeers(void) __init;
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 0ae759a..3ac5f15 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -107,7 +107,7 @@ struct rt6_info {
 	u32				rt6i_peer_genid;
 
 	struct inet6_dev		*rt6i_idev;
-	struct inet_peer		*rt6i_peer;
+	unsigned long			_rt6i_peer;
 
 #ifdef CONFIG_XFRM
 	u32				rt6i_flow_cache_genid;
@@ -118,6 +118,36 @@ struct rt6_info {
 	u8				rt6i_protocol;
 };
 
+static inline struct inet_peer *rt6_peer_ptr(struct rt6_info *rt)
+{
+	return inetpeer_ptr(rt->_rt6i_peer);
+}
+
+static inline bool rt6_has_peer(struct rt6_info *rt)
+{
+	return inetpeer_ptr_is_peer(rt->_rt6i_peer);
+}
+
+static inline void __rt6_set_peer(struct rt6_info *rt, struct inet_peer *peer)
+{
+	__inetpeer_ptr_set_peer(&rt->_rt6i_peer, peer);
+}
+
+static inline bool rt6_set_peer(struct rt6_info *rt, struct inet_peer *peer)
+{
+	return inetpeer_ptr_set_peer(&rt->_rt6i_peer, peer);
+}
+
+static inline void rt6_init_peer(struct rt6_info *rt, struct inet_peer_base *base)
+{
+	inetpeer_init_ptr(&rt->_rt6i_peer, base);
+}
+
+static inline void rt6_transfer_peer(struct rt6_info *rt, struct rt6_info *ort)
+{
+	inetpeer_transfer_peer(&rt->_rt6i_peer, &ort->_rt6i_peer);
+}
+
 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
 {
 	return ((struct rt6_info *)dst)->rt6i_idev;
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 73d7502..f88a85c 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -57,11 +57,11 @@ extern void rt6_bind_peer(struct rt6_info *rt, int create);
 
 static inline struct inet_peer *__rt6_get_peer(struct rt6_info *rt, int create)
 {
-	if (rt->rt6i_peer)
-		return rt->rt6i_peer;
+	if (rt6_has_peer(rt))
+		return rt6_peer_ptr(rt);
 
 	rt6_bind_peer(rt, create);
-	return rt->rt6i_peer;
+	return rt6_peer_ptr(rt);
 }
 
 static inline struct inet_peer *rt6_get_peer(struct rt6_info *rt)
diff --git a/include/net/route.h b/include/net/route.h
index 433fc6c..6340c37 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -67,10 +67,44 @@ struct rtable {
 	/* Miscellaneous cached information */
 	__be32			rt_spec_dst; /* RFC1122 specific destination */
 	u32			rt_peer_genid;
-	struct inet_peer	*peer; /* long-living peer info */
+	unsigned long		_peer; /* long-living peer info */
 	struct fib_info		*fi; /* for client ref to shared metrics */
 };
 
+static inline struct inet_peer *rt_peer_ptr(struct rtable *rt)
+{
+	return inetpeer_ptr(rt->_peer);
+}
+
+static inline bool rt_has_peer(struct rtable *rt)
+{
+	return inetpeer_ptr_is_peer(rt->_peer);
+}
+
+static inline void __rt_set_peer(struct rtable *rt, struct inet_peer *peer)
+{
+	__inetpeer_ptr_set_peer(&rt->_peer, peer);
+}
+
+static inline bool rt_set_peer(struct rtable *rt, struct inet_peer *peer)
+{
+	return inetpeer_ptr_set_peer(&rt->_peer, peer);
+}
+
+static inline void rt_init_peer(struct rtable *rt, struct inet_peer_base *base)
+{
+	inetpeer_init_ptr(&rt->_peer, base);
+}
+
+static inline void rt_transfer_peer(struct rtable *rt, struct rtable *ort)
+{
+	rt->_peer = ort->_peer;
+	if (rt_has_peer(ort)) {
+		struct inet_peer *peer = rt_peer_ptr(ort);
+		atomic_inc(&peer->refcnt);
+	}
+}
+
 static inline bool rt_is_input_route(const struct rtable *rt)
 {
 	return rt->rt_route_iif != 0;
@@ -298,11 +332,11 @@ extern void rt_bind_peer(struct rtable *rt, __be32 daddr, int create);
 
 static inline struct inet_peer *__rt_get_peer(struct rtable *rt, __be32 daddr, int create)
 {
-	if (rt->peer)
-		return rt->peer;
+	if (rt_has_peer(rt))
+		return rt_peer_ptr(rt);
 
 	rt_bind_peer(rt, daddr, create);
-	return rt->peer;
+	return rt_peer_ptr(rt);
 }
 
 static inline struct inet_peer *rt_get_peer(struct rtable *rt, __be32 daddr)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 2aa663a..03e5b61 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -677,7 +677,7 @@ static inline int rt_fast_clean(struct rtable *rth)
 static inline int rt_valuable(struct rtable *rth)
 {
 	return (rth->rt_flags & (RTCF_REDIRECTED | RTCF_NOTIFY)) ||
-		(rth->peer && rth->peer->pmtu_expires);
+		(rt_has_peer(rth) && rt_peer_ptr(rth)->pmtu_expires);
 }
 
 static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
@@ -1325,12 +1325,16 @@ static u32 rt_peer_genid(void)
 
 void rt_bind_peer(struct rtable *rt, __be32 daddr, int create)
 {
-	struct net *net = dev_net(rt->dst.dev);
+	struct inet_peer_base *base;
 	struct inet_peer *peer;
 
-	peer = inet_getpeer_v4(net->ipv4.peers, daddr, create);
+	base = inetpeer_base_ptr(rt->_peer);
+	if (!base)
+		return;
+
+	peer = inet_getpeer_v4(base, daddr, create);
 
-	if (peer && cmpxchg(&rt->peer, NULL, peer) != NULL)
+	if (!rt_set_peer(rt, peer))
 		inet_putpeer(peer);
 	else
 		rt->rt_peer_genid = rt_peer_genid();
@@ -1533,8 +1537,10 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
 						rt_genid(dev_net(dst->dev)));
 			rt_del(hash, rt);
 			ret = NULL;
-		} else if (rt->peer && peer_pmtu_expired(rt->peer)) {
-			dst_metric_set(dst, RTAX_MTU, rt->peer->pmtu_orig);
+		} else if (rt_has_peer(rt)) {
+			struct inet_peer *peer = rt_peer_ptr(rt);
+			if (peer_pmtu_expired(peer))
+				dst_metric_set(dst, RTAX_MTU, peer->pmtu_orig);
 		}
 	}
 	return ret;
@@ -1796,14 +1802,13 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 static void ipv4_dst_destroy(struct dst_entry *dst)
 {
 	struct rtable *rt = (struct rtable *) dst;
-	struct inet_peer *peer = rt->peer;
 
 	if (rt->fi) {
 		fib_info_put(rt->fi);
 		rt->fi = NULL;
 	}
-	if (peer) {
-		rt->peer = NULL;
+	if (rt_has_peer(rt)) {
+		struct inet_peer *peer = rt_peer_ptr(rt);
 		inet_putpeer(peer);
 	}
 }
@@ -1816,8 +1821,11 @@ static void ipv4_link_failure(struct sk_buff *skb)
 	icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH, 0);
 
 	rt = skb_rtable(skb);
-	if (rt && rt->peer && peer_pmtu_cleaned(rt->peer))
-		dst_metric_set(&rt->dst, RTAX_MTU, rt->peer->pmtu_orig);
+	if (rt && rt_has_peer(rt)) {
+		struct inet_peer *peer = rt_peer_ptr(rt);
+		if (peer_pmtu_cleaned(peer))
+			dst_metric_set(&rt->dst, RTAX_MTU, peer->pmtu_orig);
+	}
 }
 
 static int ip_rt_bug(struct sk_buff *skb)
@@ -1919,7 +1927,7 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst)
 static void rt_init_metrics(struct rtable *rt, const struct flowi4 *fl4,
 			    struct fib_info *fi)
 {
-	struct net *net = dev_net(rt->dst.dev);
+	struct inet_peer_base *base;
 	struct inet_peer *peer;
 	int create = 0;
 
@@ -1929,8 +1937,12 @@ static void rt_init_metrics(struct rtable *rt, const struct flowi4 *fl4,
 	if (fl4 && (fl4->flowi4_flags & FLOWI_FLAG_PRECOW_METRICS))
 		create = 1;
 
-	rt->peer = peer = inet_getpeer_v4(net->ipv4.peers, rt->rt_dst, create);
+	base = inetpeer_base_ptr(rt->_peer);
+	BUG_ON(!base);
+
+	peer = inet_getpeer_v4(base, rt->rt_dst, create);
 	if (peer) {
+		__rt_set_peer(rt, peer);
 		rt->rt_peer_genid = rt_peer_genid();
 		if (inet_metrics_new(peer))
 			memcpy(peer->metrics, fi->fib_metrics,
@@ -2046,7 +2058,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	rth->rt_gateway	= daddr;
 	rth->rt_spec_dst= spec_dst;
 	rth->rt_peer_genid = 0;
-	rth->peer = NULL;
+	rt_init_peer(rth, dev_net(dev)->ipv4.peers);
 	rth->fi = NULL;
 	if (our) {
 		rth->dst.input= ip_local_deliver;
@@ -2174,7 +2186,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->rt_gateway	= daddr;
 	rth->rt_spec_dst= spec_dst;
 	rth->rt_peer_genid = 0;
-	rth->peer = NULL;
+	rt_init_peer(rth, dev_net(rth->dst.dev)->ipv4.peers);
 	rth->fi = NULL;
 
 	rth->dst.input = ip_forward;
@@ -2357,7 +2369,7 @@ local_input:
 	rth->rt_gateway	= daddr;
 	rth->rt_spec_dst= spec_dst;
 	rth->rt_peer_genid = 0;
-	rth->peer = NULL;
+	rt_init_peer(rth, net->ipv4.peers);
 	rth->fi = NULL;
 	if (res.type == RTN_UNREACHABLE) {
 		rth->dst.input= ip_error;
@@ -2561,7 +2573,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	rth->rt_gateway = fl4->daddr;
 	rth->rt_spec_dst= fl4->saddr;
 	rth->rt_peer_genid = 0;
-	rth->peer = NULL;
+	rt_init_peer(rth, dev_net(dev_out)->ipv4.peers);
 	rth->fi = NULL;
 
 	RT_CACHE_STAT_INC(out_slow_tot);
@@ -2898,9 +2910,7 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
 		rt->rt_src = ort->rt_src;
 		rt->rt_gateway = ort->rt_gateway;
 		rt->rt_spec_dst = ort->rt_spec_dst;
-		rt->peer = ort->peer;
-		if (rt->peer)
-			atomic_inc(&rt->peer->refcnt);
+		rt_transfer_peer(rt, ort);
 		rt->fi = ort->fi;
 		if (rt->fi)
 			atomic_inc(&rt->fi->fib_clntref);
@@ -2938,7 +2948,6 @@ static int rt_fill_info(struct net *net,
 	struct rtmsg *r;
 	struct nlmsghdr *nlh;
 	unsigned long expires = 0;
-	const struct inet_peer *peer = rt->peer;
 	u32 id = 0, ts = 0, tsage = 0, error;
 
 	nlh = nlmsg_put(skb, pid, seq, event, sizeof(*r), flags);
@@ -2994,8 +3003,9 @@ static int rt_fill_info(struct net *net,
 		goto nla_put_failure;
 
 	error = rt->dst.error;
-	if (peer) {
-		inet_peer_refcheck(rt->peer);
+	if (rt_has_peer(rt)) {
+		const struct inet_peer *peer = rt_peer_ptr(rt);
+		inet_peer_refcheck(peer);
 		id = atomic_read(&peer->ip_id_count) & 0xffff;
 		if (peer->tcp_ts_stamp) {
 			ts = peer->tcp_ts;
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 0d3426c..8855d82 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -90,9 +90,7 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	xdst->u.dst.dev = dev;
 	dev_hold(dev);
 
-	xdst->u.rt.peer = rt->peer;
-	if (rt->peer)
-		atomic_inc(&rt->peer->refcnt);
+	rt_transfer_peer(&xdst->u.rt, rt);
 
 	/* Sheit... I remember I did this right. Apparently,
 	 * it was magically lost, so this code needs audit */
@@ -212,8 +210,10 @@ static void xfrm4_dst_destroy(struct dst_entry *dst)
 
 	dst_destroy_metrics_generic(dst);
 
-	if (likely(xdst->u.rt.peer))
-		inet_putpeer(xdst->u.rt.peer);
+	if (rt_has_peer(&xdst->u.rt)) {
+		struct inet_peer *peer = rt_peer_ptr(&xdst->u.rt);
+		inet_putpeer(peer);
+	}
 
 	xfrm_dst_destroy(xdst);
 }
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8fc41d5..17a9b86 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -258,16 +258,18 @@ static struct rt6_info ip6_blk_hole_entry_template = {
 #endif
 
 /* allocate dst with ip6_dst_ops */
-static inline struct rt6_info *ip6_dst_alloc(struct dst_ops *ops,
+static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 					     struct net_device *dev,
 					     int flags)
 {
-	struct rt6_info *rt = dst_alloc(ops, dev, 0, 0, flags);
+	struct rt6_info *rt = dst_alloc(&net->ipv6.ip6_dst_ops, dev,
+					0, 0, flags);
 
-	if (rt)
+	if (rt) {
 		memset(&rt->rt6i_table, 0,
 		       sizeof(*rt) - sizeof(struct dst_entry));
-
+		rt6_init_peer(rt, net->ipv6.peers);
+	}
 	return rt;
 }
 
@@ -275,7 +277,6 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *)dst;
 	struct inet6_dev *idev = rt->rt6i_idev;
-	struct inet_peer *peer = rt->rt6i_peer;
 
 	if (!(rt->dst.flags & DST_HOST))
 		dst_destroy_metrics_generic(dst);
@@ -288,8 +289,8 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 	if (!(rt->rt6i_flags & RTF_EXPIRES) && dst->from)
 		dst_release(dst->from);
 
-	if (peer) {
-		rt->rt6i_peer = NULL;
+	if (rt6_has_peer(rt)) {
+		struct inet_peer *peer = rt6_peer_ptr(rt);
 		inet_putpeer(peer);
 	}
 }
@@ -303,11 +304,15 @@ static u32 rt6_peer_genid(void)
 
 void rt6_bind_peer(struct rt6_info *rt, int create)
 {
-	struct net *net = dev_net(rt->dst.dev);
+	struct inet_peer_base *base;
 	struct inet_peer *peer;
 
-	peer = inet_getpeer_v6(net->ipv6.peers, &rt->rt6i_dst.addr, create);
-	if (peer && cmpxchg(&rt->rt6i_peer, NULL, peer) != NULL)
+	base = inetpeer_base_ptr(rt->_rt6i_peer);
+	if (!base)
+		return;
+
+	peer = inet_getpeer_v6(base, &rt->rt6i_dst.addr, create);
+	if (!rt6_set_peer(rt, peer))
 		inet_putpeer(peer);
 	else
 		rt->rt6i_peer_genid = rt6_peer_genid();
@@ -950,6 +955,7 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori
 	rt = dst_alloc(&ip6_dst_blackhole_ops, ort->dst.dev, 1, 0, 0);
 	if (rt) {
 		memset(&rt->rt6i_table, 0, sizeof(*rt) - sizeof(struct dst_entry));
+		rt6_init_peer(rt, net->ipv6.peers);
 
 		new = &rt->dst;
 
@@ -994,7 +1000,7 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 
 	if (rt->rt6i_node && (rt->rt6i_node->fn_sernum == cookie)) {
 		if (rt->rt6i_peer_genid != rt6_peer_genid()) {
-			if (!rt->rt6i_peer)
+			if (!rt6_has_peer(rt))
 				rt6_bind_peer(rt, 0);
 			rt->rt6i_peer_genid = rt6_peer_genid();
 		}
@@ -1108,7 +1114,7 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev,
 	if (unlikely(!idev))
 		return ERR_PTR(-ENODEV);
 
-	rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops, dev, 0);
+	rt = ip6_dst_alloc(net, dev, 0);
 	if (unlikely(!rt)) {
 		in6_dev_put(idev);
 		dst = ERR_PTR(-ENOMEM);
@@ -1290,7 +1296,7 @@ int ip6_route_add(struct fib6_config *cfg)
 	if (!table)
 		goto out;
 
-	rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops, NULL, DST_NOCOUNT);
+	rt = ip6_dst_alloc(net, NULL, DST_NOCOUNT);
 
 	if (!rt) {
 		err = -ENOMEM;
@@ -1812,8 +1818,7 @@ static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
 				    const struct in6_addr *dest)
 {
 	struct net *net = dev_net(ort->dst.dev);
-	struct rt6_info *rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops,
-					    ort->dst.dev, 0);
+	struct rt6_info *rt = ip6_dst_alloc(net, ort->dst.dev, 0);
 
 	if (rt) {
 		rt->dst.input = ort->dst.input;
@@ -2097,8 +2102,7 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 				    bool anycast)
 {
 	struct net *net = dev_net(idev->dev);
-	struct rt6_info *rt = ip6_dst_alloc(&net->ipv6.ip6_dst_ops,
-					    net->loopback_dev, 0);
+	struct rt6_info *rt = ip6_dst_alloc(net, net->loopback_dev, 0);
 	int err;
 
 	if (!rt) {
@@ -2519,7 +2523,9 @@ static int rt6_fill_node(struct net *net,
 	else
 		expires = INT_MAX;
 
-	peer = rt->rt6i_peer;
+	peer = NULL;
+	if (rt6_has_peer(rt))
+		peer = rt6_peer_ptr(rt);
 	ts = tsage = 0;
 	if (peer && peer->tcp_ts_stamp) {
 		ts = peer->tcp_ts;
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 8625fba..d749484 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -99,9 +99,7 @@ static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	if (!xdst->u.rt6.rt6i_idev)
 		return -ENODEV;
 
-	xdst->u.rt6.rt6i_peer = rt->rt6i_peer;
-	if (rt->rt6i_peer)
-		atomic_inc(&rt->rt6i_peer->refcnt);
+	rt6_transfer_peer(&xdst->u.rt6, rt);
 
 	/* Sheit... I remember I did this right. Apparently,
 	 * it was magically lost, so this code needs audit */
@@ -223,8 +221,10 @@ static void xfrm6_dst_destroy(struct dst_entry *dst)
 	if (likely(xdst->u.rt6.rt6i_idev))
 		in6_dev_put(xdst->u.rt6.rt6i_idev);
 	dst_destroy_metrics_generic(dst);
-	if (likely(xdst->u.rt6.rt6i_peer))
-		inet_putpeer(xdst->u.rt6.rt6i_peer);
+	if (rt6_has_peer(&xdst->u.rt6)) {
+		struct inet_peer *peer = rt6_peer_ptr(&xdst->u.rt6);
+		inet_putpeer(peer);
+	}
 	xfrm_dst_destroy(xdst);
 }
 
-- 
1.7.10

^ permalink raw reply related

* [PATCH 2/5] ipv4: Kill ip_rt_frag_needed().
From: David Miller @ 2012-06-11  9:29 UTC (permalink / raw)
  To: netdev


There is zero point to this function.

It's only real substance is to perform an extremely outdated BSD4.2
ICMP check, which we can safely remove.  If you really have a MTU
limited link being routed by a BSD4.2 derived system, here's a nickel
go buy yourself a real router.

The other actions of ip_rt_frag_needed(), checking and conditionally
updating the peer, are done by the per-protocol handlers of the ICMP
event.

TCP, UDP, et al. have a handler which will receive this event and
transmit it back into the associated route via dst_ops->update_pmtu().

This simplification is important, because it eliminates the one place
where we do not have a proper route context in which to make an
inetpeer lookup.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h  |    2 --
 net/ipv4/icmp.c      |    4 +---
 net/ipv4/route.c     |   61 --------------------------------------------------
 net/rxrpc/ar-error.c |    4 ----
 4 files changed, 1 insertion(+), 70 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 6340c37..cc693a5 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -215,8 +215,6 @@ static inline int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 s
 	return ip_route_input_common(skb, dst, src, tos, devin, true);
 }
 
-extern unsigned short	ip_rt_frag_needed(struct net *net, const struct iphdr *iph,
-					  unsigned short new_mtu, struct net_device *dev);
 extern void		ip_rt_send_redirect(struct sk_buff *skb);
 
 extern unsigned int		inet_addr_type(struct net *net, __be32 addr);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 0c78ef1..e1caa1a 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -673,9 +673,7 @@ static void icmp_unreach(struct sk_buff *skb)
 				LIMIT_NETDEBUG(KERN_INFO pr_fmt("%pI4: fragmentation needed and DF set\n"),
 					       &iph->daddr);
 			} else {
-				info = ip_rt_frag_needed(net, iph,
-							 ntohs(icmph->un.frag.mtu),
-							 skb->dev);
+				info = ntohs(icmph->un.frag.mtu);
 				if (!info)
 					goto out;
 			}
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 03e5b61..4f5834c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1664,67 +1664,6 @@ out:	kfree_skb(skb);
 	return 0;
 }
 
-/*
- *	The last two values are not from the RFC but
- *	are needed for AMPRnet AX.25 paths.
- */
-
-static const unsigned short mtu_plateau[] =
-{32000, 17914, 8166, 4352, 2002, 1492, 576, 296, 216, 128 };
-
-static inline unsigned short guess_mtu(unsigned short old_mtu)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(mtu_plateau); i++)
-		if (old_mtu > mtu_plateau[i])
-			return mtu_plateau[i];
-	return 68;
-}
-
-unsigned short ip_rt_frag_needed(struct net *net, const struct iphdr *iph,
-				 unsigned short new_mtu,
-				 struct net_device *dev)
-{
-	unsigned short old_mtu = ntohs(iph->tot_len);
-	unsigned short est_mtu = 0;
-	struct inet_peer *peer;
-
-	peer = inet_getpeer_v4(net->ipv4.peers, iph->daddr, 1);
-	if (peer) {
-		unsigned short mtu = new_mtu;
-
-		if (new_mtu < 68 || new_mtu >= old_mtu) {
-			/* BSD 4.2 derived systems incorrectly adjust
-			 * tot_len by the IP header length, and report
-			 * a zero MTU in the ICMP message.
-			 */
-			if (mtu == 0 &&
-			    old_mtu >= 68 + (iph->ihl << 2))
-				old_mtu -= iph->ihl << 2;
-			mtu = guess_mtu(old_mtu);
-		}
-
-		if (mtu < ip_rt_min_pmtu)
-			mtu = ip_rt_min_pmtu;
-		if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
-			unsigned long pmtu_expires;
-
-			pmtu_expires = jiffies + ip_rt_mtu_expires;
-			if (!pmtu_expires)
-				pmtu_expires = 1UL;
-
-			est_mtu = mtu;
-			peer->pmtu_learned = mtu;
-			peer->pmtu_expires = pmtu_expires;
-			atomic_inc(&__rt_peer_genid);
-		}
-
-		inet_putpeer(peer);
-	}
-	return est_mtu ? : new_mtu;
-}
-
 static void check_peer_pmtu(struct dst_entry *dst, struct inet_peer *peer)
 {
 	unsigned long expires = ACCESS_ONCE(peer->pmtu_expires);
diff --git a/net/rxrpc/ar-error.c b/net/rxrpc/ar-error.c
index 5d6b572..a920608 100644
--- a/net/rxrpc/ar-error.c
+++ b/net/rxrpc/ar-error.c
@@ -81,10 +81,6 @@ void rxrpc_UDP_error_report(struct sock *sk)
 			_net("I/F MTU %u", mtu);
 		}
 
-		/* ip_rt_frag_needed() may have eaten the info */
-		if (mtu == 0)
-			mtu = ntohs(icmp_hdr(skb)->un.frag.mtu);
-
 		if (mtu == 0) {
 			/* they didn't give us a size, estimate one */
 			if (mtu > 1500) {
-- 
1.7.10

^ permalink raw reply related

* [PATCH 3/5] inet: Add family scope inetpeer flushes.
From: David Miller @ 2012-06-11  9:29 UTC (permalink / raw)
  To: netdev


This implementation can deal with having many inetpeer roots, which is
a necessary prerequisite for per-FIB table rooted peer tables.

Each family (AF_INET, AF_INET6) has a sequence number which we bump
when we get a family invalidation request.

Each peer lookup cheaply checks whether the flush sequence of the
root we are using is out of date, and if so flushes it and updates
the sequence number.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/inetpeer.h |    2 ++
 net/ipv4/inetpeer.c    |   28 ++++++++++++++++++++++++++++
 net/ipv4/route.c       |    2 +-
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index d432489..e15c086 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -68,6 +68,7 @@ struct inet_peer {
 struct inet_peer_base {
 	struct inet_peer __rcu	*root;
 	seqlock_t		lock;
+	u32			flush_seq;
 	int			total;
 };
 
@@ -168,6 +169,7 @@ extern void inet_putpeer(struct inet_peer *p);
 extern bool inet_peer_xrlim_allow(struct inet_peer *peer, int timeout);
 
 extern void inetpeer_invalidate_tree(struct inet_peer_base *);
+extern void inetpeer_invalidate_family(int family);
 
 /*
  * temporary check to make sure we dont access rid, ip_id_count, tcp_ts,
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index e4cba56..cac02ad 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -86,10 +86,36 @@ void inet_peer_base_init(struct inet_peer_base *bp)
 {
 	bp->root = peer_avl_empty_rcu;
 	seqlock_init(&bp->lock);
+	bp->flush_seq = ~0U;
 	bp->total = 0;
 }
 EXPORT_SYMBOL_GPL(inet_peer_base_init);
 
+static atomic_t v4_seq = ATOMIC_INIT(0);
+static atomic_t v6_seq = ATOMIC_INIT(0);
+
+static atomic_t *inetpeer_seq_ptr(int family)
+{
+	return (family == AF_INET ? &v4_seq : &v6_seq);
+}
+
+static inline void flush_check(struct inet_peer_base *base, int family)
+{
+	atomic_t *fp = inetpeer_seq_ptr(family);
+
+	if (unlikely(base->flush_seq != atomic_read(fp))) {
+		inetpeer_invalidate_tree(base);
+		base->flush_seq = atomic_read(fp);
+	}
+}
+
+void inetpeer_invalidate_family(int family)
+{
+	atomic_t *fp = inetpeer_seq_ptr(family);
+
+	atomic_inc(fp);
+}
+
 #define PEER_MAXDEPTH 40 /* sufficient for about 2^27 nodes */
 
 /* Exported for sysctl_net_ipv4.  */
@@ -437,6 +463,8 @@ struct inet_peer *inet_getpeer(struct inet_peer_base *base,
 	unsigned int sequence;
 	int invalidated, gccnt = 0;
 
+	flush_check(base, daddr->family);
+
 	/* Attempt a lockless lookup first.
 	 * Because of a concurrent writer, we might not find an existing entry.
 	 */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 4f5834c..456a947 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -935,7 +935,7 @@ static void rt_cache_invalidate(struct net *net)
 
 	get_random_bytes(&shuffle, sizeof(shuffle));
 	atomic_add(shuffle + 1U, &net->ipv4.rt_genid);
-	inetpeer_invalidate_tree(net->ipv4.peers);
+	inetpeer_invalidate_family(AF_INET);
 }
 
 /*
-- 
1.7.10

^ permalink raw reply related

* [PATCH 4/5] inet: Add inetpeer tree roots to the FIB tables.
From: David Miller @ 2012-06-11  9:29 UTC (permalink / raw)
  To: netdev


Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/ip6_fib.h |    1 +
 include/net/ip_fib.h  |   12 +++++++-----
 net/ipv4/fib_trie.c   |    3 +++
 net/ipv6/ip6_fib.c    |    5 +++++
 4 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 3ac5f15..a192f78 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -237,6 +237,7 @@ struct fib6_table {
 	u32			tb6_id;
 	rwlock_t		tb6_lock;
 	struct fib6_node	tb6_root;
+	struct inet_peer_base	tb6_peers;
 };
 
 #define RT6_TABLE_UNSPEC	RT_TABLE_UNSPEC
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 78df0866..4b347c0 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -19,6 +19,7 @@
 #include <net/flow.h>
 #include <linux/seq_file.h>
 #include <net/fib_rules.h>
+#include <net/inetpeer.h>
 
 struct fib_config {
 	u8			fc_dst_len;
@@ -157,11 +158,12 @@ extern __be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh);
 					 FIB_RES_SADDR(net, res))
 
 struct fib_table {
-	struct hlist_node tb_hlist;
-	u32		tb_id;
-	int		tb_default;
-	int		tb_num_default;
-	unsigned long	tb_data[0];
+	struct hlist_node	tb_hlist;
+	u32			tb_id;
+	int			tb_default;
+	int			tb_num_default;
+	struct inet_peer_base	tb_peers;
+	unsigned long		tb_data[0];
 };
 
 extern int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 18cbc15..9b0f259 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1843,6 +1843,8 @@ int fib_table_flush(struct fib_table *tb)
 	if (ll && hlist_empty(&ll->list))
 		trie_leaf_remove(t, ll);
 
+	inetpeer_invalidate_tree(&tb->tb_peers);
+
 	pr_debug("trie_flush found=%d\n", found);
 	return found;
 }
@@ -1991,6 +1993,7 @@ struct fib_table *fib_trie_table(u32 id)
 	tb->tb_id = id;
 	tb->tb_default = -1;
 	tb->tb_num_default = 0;
+	inet_peer_base_init(&tb->tb_peers);
 
 	t = (struct trie *) tb->tb_data;
 	memset(t, 0, sizeof(*t));
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 0c220a4..7ef0743 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -197,6 +197,7 @@ static struct fib6_table *fib6_alloc_table(struct net *net, u32 id)
 		table->tb6_id = id;
 		table->tb6_root.leaf = net->ipv6.ip6_null_entry;
 		table->tb6_root.fn_flags = RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO;
+		inet_peer_base_init(&table->tb6_peers);
 	}
 
 	return table;
@@ -1633,6 +1634,7 @@ static int __net_init fib6_net_init(struct net *net)
 	net->ipv6.fib6_main_tbl->tb6_root.leaf = net->ipv6.ip6_null_entry;
 	net->ipv6.fib6_main_tbl->tb6_root.fn_flags =
 		RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO;
+	inet_peer_base_init(&net->ipv6.fib6_main_tbl->tb6_peers);
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 	net->ipv6.fib6_local_tbl = kzalloc(sizeof(*net->ipv6.fib6_local_tbl),
@@ -1643,6 +1645,7 @@ static int __net_init fib6_net_init(struct net *net)
 	net->ipv6.fib6_local_tbl->tb6_root.leaf = net->ipv6.ip6_null_entry;
 	net->ipv6.fib6_local_tbl->tb6_root.fn_flags =
 		RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO;
+	inet_peer_base_init(&net->ipv6.fib6_local_tbl->tb6_peers);
 #endif
 	fib6_tables_init(net);
 
@@ -1666,8 +1669,10 @@ static void fib6_net_exit(struct net *net)
 	del_timer_sync(&net->ipv6.ip6_fib_timer);
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
+	inetpeer_invalidate_tree(&net->ipv6.fib6_local_tbl->tb6_peers);
 	kfree(net->ipv6.fib6_local_tbl);
 #endif
+	inetpeer_invalidate_tree(&net->ipv6.fib6_main_tbl->tb6_peers);
 	kfree(net->ipv6.fib6_main_tbl);
 	kfree(net->ipv6.fib_table_hash);
 	kfree(net->ipv6.rt6_stats);
-- 
1.7.10

^ permalink raw reply related

* [PATCH 5/5] inet: Use FIB table peer roots in routes.
From: David Miller @ 2012-06-11  9:29 UTC (permalink / raw)
  To: netdev


Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c |    8 ++++++--
 net/ipv6/route.c |   14 ++++++++------
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 456a947..4c33ce3 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2125,7 +2125,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->rt_gateway	= daddr;
 	rth->rt_spec_dst= spec_dst;
 	rth->rt_peer_genid = 0;
-	rt_init_peer(rth, dev_net(rth->dst.dev)->ipv4.peers);
+	rt_init_peer(rth, &res->table->tb_peers);
 	rth->fi = NULL;
 
 	rth->dst.input = ip_forward;
@@ -2512,7 +2512,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	rth->rt_gateway = fl4->daddr;
 	rth->rt_spec_dst= fl4->saddr;
 	rth->rt_peer_genid = 0;
-	rt_init_peer(rth, dev_net(dev_out)->ipv4.peers);
+	rt_init_peer(rth, (res->table ?
+			   &res->table->tb_peers :
+			   dev_net(dev_out)->ipv4.peers));
 	rth->fi = NULL;
 
 	RT_CACHE_STAT_INC(out_slow_tot);
@@ -2561,6 +2563,7 @@ static struct rtable *ip_route_output_slow(struct net *net, struct flowi4 *fl4)
 	int orig_oif;
 
 	res.fi		= NULL;
+	res.table	= NULL;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
 	res.r		= NULL;
 #endif
@@ -2666,6 +2669,7 @@ static struct rtable *ip_route_output_slow(struct net *net, struct flowi4 *fl4)
 
 	if (fib_lookup(net, fl4, &res)) {
 		res.fi = NULL;
+		res.table = NULL;
 		if (fl4->flowi4_oif) {
 			/* Apparently, routing tables are wrong. Assume,
 			   that the destination is on link.
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 17a9b86..d9ba480 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -260,7 +260,8 @@ static struct rt6_info ip6_blk_hole_entry_template = {
 /* allocate dst with ip6_dst_ops */
 static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 					     struct net_device *dev,
-					     int flags)
+					     int flags,
+					     struct fib6_table *table)
 {
 	struct rt6_info *rt = dst_alloc(&net->ipv6.ip6_dst_ops, dev,
 					0, 0, flags);
@@ -268,7 +269,7 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 	if (rt) {
 		memset(&rt->rt6i_table, 0,
 		       sizeof(*rt) - sizeof(struct dst_entry));
-		rt6_init_peer(rt, net->ipv6.peers);
+		rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
 	}
 	return rt;
 }
@@ -1114,7 +1115,7 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev,
 	if (unlikely(!idev))
 		return ERR_PTR(-ENODEV);
 
-	rt = ip6_dst_alloc(net, dev, 0);
+	rt = ip6_dst_alloc(net, dev, 0, NULL);
 	if (unlikely(!rt)) {
 		in6_dev_put(idev);
 		dst = ERR_PTR(-ENOMEM);
@@ -1296,7 +1297,7 @@ int ip6_route_add(struct fib6_config *cfg)
 	if (!table)
 		goto out;
 
-	rt = ip6_dst_alloc(net, NULL, DST_NOCOUNT);
+	rt = ip6_dst_alloc(net, NULL, DST_NOCOUNT, table);
 
 	if (!rt) {
 		err = -ENOMEM;
@@ -1818,7 +1819,8 @@ static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
 				    const struct in6_addr *dest)
 {
 	struct net *net = dev_net(ort->dst.dev);
-	struct rt6_info *rt = ip6_dst_alloc(net, ort->dst.dev, 0);
+	struct rt6_info *rt = ip6_dst_alloc(net, ort->dst.dev, 0,
+					    ort->rt6i_table);
 
 	if (rt) {
 		rt->dst.input = ort->dst.input;
@@ -2102,7 +2104,7 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 				    bool anycast)
 {
 	struct net *net = dev_net(idev->dev);
-	struct rt6_info *rt = ip6_dst_alloc(net, net->loopback_dev, 0);
+	struct rt6_info *rt = ip6_dst_alloc(net, net->loopback_dev, 0, NULL);
 	int err;
 
 	if (!rt) {
-- 
1.7.10

^ permalink raw reply related

* Re: [PATCH] lpc_eth: add missing ndo_change_mtu()
From: Roland Stigge @ 2012-06-11  9:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, kevin.wells, aletes.xgr, srinivas.bakki
In-Reply-To: <1339406640.6001.1896.camel@edumazet-glaptop>

On 06/11/2012 11:24 AM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> lpc_eth does a copy of transmitted skbs to DMA area, without checking
> skb lengths, so can trigger buffer overflows :
> 
> memcpy(pldat->tx_buff_v + txidx * ENET_MAXF_SIZE, skb->data, len);
> 
> One way to get bigger skbs is to allow MTU changes above the 1500 limit.
> 
> Calling eth_change_mtu() in ndo_change_mtu() makes sure this cannot
> happen.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Acked-by: Roland Stigge <stigge@antcom.de>

> Cc: Roland Stigge <stigge@antcom.de>
> Cc: Kevin Wells <kevin.wells@nxp.com>
> ---
> diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
> index 8d2666f..10febdc 100644
> --- a/drivers/net/ethernet/nxp/lpc_eth.c
> +++ b/drivers/net/ethernet/nxp/lpc_eth.c
> @@ -1320,6 +1320,7 @@ static const struct net_device_ops lpc_netdev_ops = {
>  	.ndo_set_rx_mode	= lpc_eth_set_multicast_list,
>  	.ndo_do_ioctl		= lpc_eth_ioctl,
>  	.ndo_set_mac_address	= lpc_set_mac_address,
> +	.ndo_change_mtu		= eth_change_mtu,
>  };
>  
>  static int lpc_eth_drv_probe(struct platform_device *pdev)
> 
> 

^ permalink raw reply

* Re: [PATCH rfc net] Allow the autoconfigured network interface to be renamed.
From: scott @ 2012-06-11  9:58 UTC (permalink / raw)
  To: David Miller; +Cc: scott, netdev, scott.parlane
In-Reply-To: <20120610.202515.138701034400919651.davem@davemloft.net>

> From: Scott Parlane <scott@scottnz.com>
> Date: Sat,  9 Jun 2012 19:48:07 +1200
>
>> From: Scott Parlane <scott.parlane@alliedtelesis.co.nz>
>>
>> if IP_PNP_RENAME_DEV is set, the first interface to be configured
>> automatically by the kernel during boot will be renamed.
>>
>> IP_PNP_DEV_NEWNAME is the name to give the autoconfigured device.
>>
>> No changes will be made to any interface that is not autoconfigured.
>>
>> This allows the assurance of the boot device name, without the need
>> for an initramfs.
>>
>> Signed-off-by: Scott Parlane <scott.parlane@alliedtelesis.co.nz>
>
> Making this a compile time option makes absolutely no sense at all.
>
> Assuming this feature is desirable at all (which is a big IF), it
> should be a kernel command line option.

My comment re how many devices was a reference to the level of testing,
not the usage, however I see it was probably moot given any person with
sufficient
knowledge of the area would see it works as described.
(unless I did something i haven't seen yet)

[background of this solution]
In our configuration we load the same kernel on to a number of testboxes,
each of which has several ethernet interfaces and serial ports
(for communicating with our devices under test)
They use nfsroot which prevents udev rules from working correctly
(because you cant take down the interface to rename it)
Previously we were using an initramfs that would rename the interface,
just before mounting the real root.
However a recent change to glibc prevents busybox's nfs utils from working,
so I made this patch to get the kernel to do it, and remove the need for
the initramfs
(which just slows our boot process anyways)

In our case, we run a custom kernel, and have this option turned on, with a
suitable name so that we can identify the boot interface.

While I agree that being able to configure it from the command line would
be useful,
I would rather that I could (if I wanted) configure it at compile time to
default,
purely because it means there is one less thing my admins need to deal
worry about
or can change in the pxelinux config. (every testbox has a unique cmdline,
because of
how our nfsroot mount points work)

Please let me know if you want me to extend it to support command line
configuration,
and what configuration you would like to trigger it.

Kind Regards,
Scott

^ permalink raw reply

* R: Re: BUG (?) multicast loopback (IP6SKB_FORWARDED)
From: maxd @ 2012-06-11 10:09 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

Hi Eric,
thanks for the quick reply. It seems that reverting the patch fixes the issue, 
and I have not observed any unintended behaviour so far.
Do you know what was the motivation for the patch?

Regards,
Massimiliano

>----Messaggio originale----
>Da: eric.dumazet@gmail.com
>Data: 08/06/2012 19.23
>A: "maxd@inwind.it"<maxd@inwind.it>
>Cc: <netdev@vger.kernel.org>
>Ogg: Re: BUG (?) multicast loopback (IP6SKB_FORWARDED)
>
>On Fri, 2012-06-08 at 18:48 +0200, maxd@inwind.it wrote:
>> Hi guys,
>> I found a probably wrong behaviour while doing some tests with multicast 
>> routing on IPv6 with kernel 2.6.29. I will try to describe what's wrong in 
the 
>> code in the following. I will use the latest kernel sources (3.5-rc1)
>> as reference source code (line numbers are taken there).
>> Let's assume a scenario with a node with two network interfaces acting as 
a 
>> multicast router. The router receives the message on one interface and 
needs to 
>> forward it on the other interface. Looking at the packet flow inside the 
>> kernel, we notice that
>> 
>> in ip6mr.c, line 2282, a flag is set:
>> IP6CB(skb)->flags |= IP6SKB_FORWARDED;
>> 
>> After this, a multicast packet can be looped back (see line 124 in 
ip6_output.
>> c where function ip6_dev_loopback_xmit is called). 
>> The packet is hence reinjected in the stack.
>> 
>> The packet is processed by function ipv6_rcv (ip6_input.c), and then by 
>> ipv6_mc_input (ip6_input.c).
>> 
>> In ipv6_rcv, line 82, the previously set flag is cleared
>> memset(IP6CB(skb), 0, sizeof(struct inet6_skb_parm));
>> 
>> In ipv6_mc_input, , line 268, the flag is checked to determine if the 
packet 
>> has been already forwarded. Since the flag has been cleared, the kernel 
cannot 
>> determine that the packet has been looped back, and will hence try to 
forward 
>> it again.
>> 
>> Trying to forward a looped back packet determines a wrong behaviour of the 
>> multicast routing protocol (PIM): the kernel believes that a multicast 
message 
>> has been received from a wrong interface (line 1993 in ip6mr.c), discard 
the 
>> message (this explains why the packet does not loop forever) and triggers 
the 
>> transmission of an ASSERT message. Basically, the node ends up sending an 
>> ASSERT message because of a looped back packet. 
>> 
>> WDYT? Is my analysis correct? Which is the best way to fix this issue?
>
>I guess your analysis is correct, try to revert commit
>6b7fdc3ae18a0598a999156b62d55ea55220e00f ([IPV6]: Clean skb cb on IPv6
>input) ?
>
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH rfc net] Allow the autoconfigured network interface to be renamed.
From: David Miller @ 2012-06-11 10:10 UTC (permalink / raw)
  To: scott; +Cc: netdev, scott.parlane
In-Reply-To: <f8f3fa05ddbe3a2e1613588754f151ac.squirrel@scottnz.com>

From: scott@scottnz.com
Date: Mon, 11 Jun 2012 21:58:45 +1200

> They use nfsroot which prevents udev rules from working correctly
> (because you cant take down the interface to rename it)

This is why I hate nfsroot as implemented in the kernel.

Do this right and use an initial ramdisk, then you won't have
this huge disconnect between different device names due to
lack of udev.

Thanks, you've confirmed that this patch is totally inappropriate.

^ permalink raw reply

* IPv6 tc filters
From: Dragos Ilie @ 2012-06-11 10:31 UTC (permalink / raw)
  To: netdev

Hi!

Do IPv6 tc filters work? According to Linux Advanced Routing & Traffic
Control HOWTO, IPv6 does not hook into the Routing Policy Database
(RPDB), thus causing the filters to fail. I would appreciate it if
someone on this list can confirm or refute this claim. If filters have
been fixed, since which kernel/iproute2 version do they work?

Regards,
Dragos

^ permalink raw reply

* Re: IPv6 tc filters
From: Thomas Graf @ 2012-06-11 10:39 UTC (permalink / raw)
  To: Dragos Ilie; +Cc: netdev
In-Reply-To: <CAOLNa-fyvic2xzrUwz2WsEap_CgGmU6OwAf=jMPCUVyXbrimVg@mail.gmail.com>

On Mon, Jun 11, 2012 at 12:31:27PM +0200, Dragos Ilie wrote:
> Do IPv6 tc filters work? According to Linux Advanced Routing & Traffic
> Control HOWTO, IPv6 does not hook into the Routing Policy Database
> (RPDB), thus causing the filters to fail. I would appreciate it if
> someone on this list can confirm or refute this claim. If filters have
> been fixed, since which kernel/iproute2 version do they work?

They do work, tc rules apply to any kind of traffic.

Some specific selectors only work with IPv4 so you need to craft
IPv6 variations yourself.

^ permalink raw reply

* Re: [PATCH rfc net] Allow the autoconfigured network interface to be renamed.
From: scott @ 2012-06-11 10:44 UTC (permalink / raw)
  To: David Miller; +Cc: scott, netdev, scott.parlane
In-Reply-To: <20120611.031053.1433253186495303724.davem@davemloft.net>

> From: scott@scottnz.com
> Date: Mon, 11 Jun 2012 21:58:45 +1200
>
>> They use nfsroot which prevents udev rules from working correctly
>> (because you cant take down the interface to rename it)
>
> This is why I hate nfsroot as implemented in the kernel.
>
The same is true for root over nfs as configured by the initramfs,
conviently there is a small window where you can interfere with it
(at the initramfs stage), or I should say there-was,
its not available anymore, specifically because glibc dropped the sunrpc
code. (This is probably an artifact of gentoo's initramfs more than anything,
but it was the last distro that I could find to network boot without fully
RYO)

> Do this right and use an initial ramdisk, then you won't have
> this huge disconnect between different device names due to
> lack of udev.
>
> Thanks, you've confirmed that this patch is totally inappropriate.
So be it, we will continue to run it, because its the simpliest way to
get what we want done (and most of our software developers can deal
with kernel code, more so than libc)

Regards,
Scott

^ permalink raw reply

* Re: [PATCH 1/5] inet: Hide route peer accesses behind helpers.
From: Eric Dumazet @ 2012-06-11 10:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120611.022908.953732817435232845.davem@davemloft.net>

From: Eric Dumazet <edumazet@google.com>

On Mon, 2012-06-11 at 02:29 -0700, David Miller wrote:
> We encode the pointer(s) into an unsigned long with one state bit.
> 
> The state bit is used so we can store the inetpeer tree root to use
> when resolving the peer later.
> 
> Later the peer roots will be per-FIB table, and this change works to
> facilitate that.

...

> +static inline bool inetpeer_ptr_set_peer(unsigned long *ptr, struct inet_peer *peer)
> +{
> +	unsigned long val = (unsigned long) peer;
> +	unsigned long orig = *ptr;
> +
> +	if (!(orig & INETPEER_BASE_BIT) || !val ||
> +	    cmpxchg(ptr, orig, val) != orig)
> +		return false;
> +	return true;
> +}

If peer is NULL here, we return false;



So we might have a NULL deref later :

>  
>  void rt_bind_peer(struct rtable *rt, __be32 daddr, int create)
>  {
> -	struct net *net = dev_net(rt->dst.dev);
> +	struct inet_peer_base *base;
>  	struct inet_peer *peer;
>  
> -	peer = inet_getpeer_v4(net->ipv4.peers, daddr, create);
> +	base = inetpeer_base_ptr(rt->_peer);
> +	if (!base)
> +		return;
> +
> +	peer = inet_getpeer_v4(base, daddr, create);
>  

Here, peer can be NULL

> -	if (peer && cmpxchg(&rt->peer, NULL, peer) != NULL)
> +	if (!rt_set_peer(rt, peer))
>  		inet_putpeer(peer); << CRASH >>
>  	else
>  		rt->rt_peer_genid = rt_peer_genid();



and in :

>  void rt6_bind_peer(struct rt6_info *rt, int create)
>  {
> -	struct net *net = dev_net(rt->dst.dev);
> +	struct inet_peer_base *base;
>  	struct inet_peer *peer;
>  
> -	peer = inet_getpeer_v6(net->ipv6.peers, &rt->rt6i_dst.addr, create);
> -	if (peer && cmpxchg(&rt->rt6i_peer, NULL, peer) != NULL)
> +	base = inetpeer_base_ptr(rt->_rt6i_peer);
> +	if (!base)
> +		return;
> +
> +	peer = inet_getpeer_v6(base, &rt->rt6i_dst.addr, create);

peer can be NULL

> +	if (!rt6_set_peer(rt, peer))
>  		inet_putpeer(peer);


[PATCH net-next] net: allow NULL param in inet_putpeer()

inet_putpeer() can be called with NULL peer, we must take care of it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/inetpeer.c      |    2 ++
 net/ipv4/ip_fragment.c   |    3 +--
 net/ipv4/tcp_minisocks.c |    3 +--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index cac02ad..cf1e7aa 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -527,6 +527,8 @@ EXPORT_SYMBOL_GPL(inet_getpeer);
 
 void inet_putpeer(struct inet_peer *p)
 {
+	if (!p)
+		return;
 	p->dtime = (__u32)jiffies;
 	smp_mb__before_atomic_dec();
 	atomic_dec(&p->refcnt);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8d07c97..3bd3ed5 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -192,8 +192,7 @@ static __inline__ void ip4_frag_free(struct inet_frag_queue *q)
 	struct ipq *qp;
 
 	qp = container_of(q, struct ipq, q);
-	if (qp->peer)
-		inet_putpeer(qp->peer);
+	inet_putpeer(qp->peer);
 }
 
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index cb01531..b2d89b0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -410,8 +410,7 @@ void tcp_twsk_destructor(struct sock *sk)
 {
 	struct tcp_timewait_sock *twsk = tcp_twsk(sk);
 
-	if (twsk->tw_peer)
-		inet_putpeer(twsk->tw_peer);
+	inet_putpeer(twsk->tw_peer);
 #ifdef CONFIG_TCP_MD5SIG
 	if (twsk->tw_md5_key) {
 		tcp_free_md5sig_pool();

^ permalink raw reply related

* Re: [PATCH 2/5] ipv4: Kill ip_rt_frag_needed().
From: Steffen Klassert @ 2012-06-11 11:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120611.022911.885347106959530782.davem@davemloft.net>

On Mon, Jun 11, 2012 at 02:29:11AM -0700, David Miller wrote:
> 
> -unsigned short ip_rt_frag_needed(struct net *net, const struct iphdr *iph,
> -				 unsigned short new_mtu,
> -				 struct net_device *dev)
> -{
> -	unsigned short old_mtu = ntohs(iph->tot_len);
> -	unsigned short est_mtu = 0;
> -	struct inet_peer *peer;
> -
> -	peer = inet_getpeer_v4(net->ipv4.peers, iph->daddr, 1);
> -	if (peer) {
> -		unsigned short mtu = new_mtu;
> -
> -		if (new_mtu < 68 || new_mtu >= old_mtu) {
> -			/* BSD 4.2 derived systems incorrectly adjust
> -			 * tot_len by the IP header length, and report
> -			 * a zero MTU in the ICMP message.
> -			 */
> -			if (mtu == 0 &&
> -			    old_mtu >= 68 + (iph->ihl << 2))
> -				old_mtu -= iph->ihl << 2;
> -			mtu = guess_mtu(old_mtu);
> -		}
> -
> -		if (mtu < ip_rt_min_pmtu)
> -			mtu = ip_rt_min_pmtu;
> -		if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
> -			unsigned long pmtu_expires;
> -
> -			pmtu_expires = jiffies + ip_rt_mtu_expires;
> -			if (!pmtu_expires)
> -				pmtu_expires = 1UL;
> -
> -			est_mtu = mtu;
> -			peer->pmtu_learned = mtu;
> -			peer->pmtu_expires = pmtu_expires;
> -			atomic_inc(&__rt_peer_genid);
> -		}
> -
> -		inet_putpeer(peer);
> -	}
> -	return est_mtu ? : new_mtu;
> -}
> -

It seems that we don't cache the learned pmtu informations
in some cases with ip_rt_frag_needed() removed. 

At least when doing a simple ping test on a network that has
a router with mtu 1300 along the path, the following happens:

bash-3.00# ping -c 4 -s 1400 192.168.40.2                                       
PING 192.168.40.2 (192.168.40.2) 1400(1428) bytes of data.                      
>From 10.2.2.2 icmp_seq=1 Frag needed and DF set (mtu = 1300)                    
>From 10.2.2.2 icmp_seq=2 Frag needed and DF set (mtu = 1300)                    
>From 10.2.2.2 icmp_seq=3 Frag needed and DF set (mtu = 1300)                    
>From 10.2.2.2 icmp_seq=4 Frag needed and DF set (mtu = 1300)                    
                                                                                
--- 192.168.40.2 ping statistics ---                                            
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3005ms     

We should learn the pmtu information with the first packet,
all further packets should get fragmented according to
the learned informations. Unfortunately we don't cache
these informations: 
                                                        
bash-3.00# ip r g 192.168.40.2                                                  
192.168.40.2 via 192.168.20.1 dev eth0  src 192.168.20.2                        
    cache

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox