Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] ipv4: tcp: dont cache output dst for syncookies
From: Jesper Dangaard Brouer @ 2012-06-22 11:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Hans Schillstrom
In-Reply-To: <1340204539.4604.1122.camel@edumazet-glaptop>

On Wed, 2012-06-20 at 17:02 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Don't cache output dst for syncookies, as this adds pressure on IP route
> cache and rcu subsystem for no gain.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>

Looks good to me.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

^ permalink raw reply

* Re: [PATCH 10/17] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: Sebastian Andrzej Siewior @ 2012-06-22 11:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120622105451.GC8271@suse.de>

On Fri, Jun 22, 2012 at 11:54:51AM +0100, Mel Gorman wrote:
> > This is mostly used by nic to refil their RX skb pool. You add the
> > __GFP_MEMALLOC to the allocation to rise the change of a successfull refill
> > for the swap case.
> > A few drivers use build_skb() to create the skb. __netdev_alloc_skb()
> > shouldn't be affected since the allocation happens with GFP_ATOMIC. Looking at
> > TG3 it uses build_skb() and get_pages() / kmalloc(). Shouldn't this be some
> > considered?
> > 
> 
> While TG3 is not exactly as you describe after rebasing build_skb should
> make a similar check to __alloc_skb. As it is always used for RX allocation
> from the skbuff_head_cache cache the following should be suitable. Thanks.

As Eric pointed out you end up in netdev_alloc_frag() which is using
alloc_page(). This is also used by __netdev_alloc_skb().

Sebastian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] smsc911x.c: encapsulate enable irq calls
From: Matthias Brugger @ 2012-06-22 11:10 UTC (permalink / raw)
  To: Steve Glendinning; +Cc: netdev, Matthias Brugger

We encapsulate enbale irq functionality in a function call.
As on probe the interrupts will be disabled twice, we delete
one.

Signed-off-by: Matthias Brugger <mbrugger@iseebcn.com>
---
 drivers/net/ethernet/smsc/smsc911x.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.c b/drivers/net/ethernet/smsc/smsc911x.c
index 1466e5d..54ca99d 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -1442,6 +1442,14 @@ smsc911x_set_hw_mac_address(struct smsc911x_data *pdata, u8 dev_addr[6])
 	smsc911x_mac_write(pdata, ADDRL, mac_low32);
 }
 
+static void smsc911x_disable_irq_chip(struct net_device *dev)
+{
+	struct smsc911x_data *pdata = netdev_priv(dev);
+
+	smsc911x_reg_write(pdata, INT_EN, 0);
+	smsc911x_reg_write(pdata, INT_STS, 0xFFFFFFFF);
+}
+
 static int smsc911x_open(struct net_device *dev)
 {
 	struct smsc911x_data *pdata = netdev_priv(dev);
@@ -1494,8 +1502,7 @@ static int smsc911x_open(struct net_device *dev)
 	spin_unlock_irq(&pdata->mac_lock);
 
 	/* Initialise irqs, but leave all sources disabled */
-	smsc911x_reg_write(pdata, INT_EN, 0);
-	smsc911x_reg_write(pdata, INT_STS, 0xFFFFFFFF);
+	smsc911x_disable_irq_chip(dev);
 
 	/* Set interrupt deassertion to 100uS */
 	intcfg = ((10 << 24) | INT_CFG_IRQ_EN_);
@@ -2215,9 +2222,6 @@ static int __devinit smsc911x_init(struct net_device *dev)
 	if (smsc911x_soft_reset(pdata))
 		return -ENODEV;
 
-	/* Disable all interrupt sources until we bring the device up */
-	smsc911x_reg_write(pdata, INT_EN, 0);
-
 	ether_setup(dev);
 	dev->flags |= IFF_MULTICAST;
 	netif_napi_add(dev, &pdata->napi, smsc911x_poll, SMSC_NAPI_WEIGHT);
@@ -2434,8 +2438,7 @@ static int __devinit smsc911x_drv_probe(struct platform_device *pdev)
 	smsc911x_reg_write(pdata, INT_CFG, intcfg);
 
 	/* Ensure interrupts are globally disabled before connecting ISR */
-	smsc911x_reg_write(pdata, INT_EN, 0);
-	smsc911x_reg_write(pdata, INT_STS, 0xFFFFFFFF);
+	smsc911x_disable_irq_chip(dev);
 
 	retval = request_irq(dev->irq, smsc911x_irqhandler,
 			     irq_flags | IRQF_SHARED, dev->name, dev);
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH 10/17] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: Mel Gorman @ 2012-06-22 10:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120621163029.GB6045@breakpoint.cc>

On Thu, Jun 21, 2012 at 06:30:29PM +0200, Sebastian Andrzej Siewior wrote:
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 1d6ecc8..9a58dcc 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -167,14 +206,19 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
> >   *	%GFP_ATOMIC.
> >   */
> >  struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > -			    int fclone, int node)
> > +			    int flags, int node)
> >  {
> >  	struct kmem_cache *cache;
> >  	struct skb_shared_info *shinfo;
> >  	struct sk_buff *skb;
> >  	u8 *data;
> > +	bool pfmemalloc;
> >  
> > -	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
> > +	cache = (flags & SKB_ALLOC_FCLONE)
> > +		? skbuff_fclone_cache : skbuff_head_cache;
> > +
> > +	if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
> > +		gfp_mask |= __GFP_MEMALLOC;
> >  
> >  	/* Get the HEAD */
> >  	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> 
> This is mostly used by nic to refil their RX skb pool. You add the
> __GFP_MEMALLOC to the allocation to rise the change of a successfull refill
> for the swap case.
> A few drivers use build_skb() to create the skb. __netdev_alloc_skb()
> shouldn't be affected since the allocation happens with GFP_ATOMIC. Looking at
> TG3 it uses build_skb() and get_pages() / kmalloc(). Shouldn't this be some
> considered?
> 

While TG3 is not exactly as you describe after rebasing build_skb should
make a similar check to __alloc_skb. As it is always used for RX allocation
from the skbuff_head_cache cache the following should be suitable. Thanks.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9832001..063830c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -310,8 +310,12 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	unsigned int size = frag_size ? : ksize(data);
+	gfp_t gfp_mask = GFP_ATOMIC;
 
-	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	if (sk_memalloc_socks())
+		gfp_mask |= __GFP_MEMALLOC;
+
+	skb = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 	if (!skb)
 		return NULL;
 

^ permalink raw reply related

* Re: [PATCH 10/17] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: Mel Gorman @ 2012-06-22 10:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120621160902.GA6045@breakpoint.cc>

On Thu, Jun 21, 2012 at 06:09:02PM +0200, Sebastian Andrzej Siewior wrote:
> > <SNIP>
> >
> If merge this chunk
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 6510a5d..2acfec9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -510,7 +510,7 @@ struct sk_buff {
>  #define SKB_ALLOC_RX		0x02
>  
>  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
> -static inline bool skb_pfmemalloc(struct sk_buff *skb)
> +static inline bool skb_pfmemalloc(const struct sk_buff *skb)
>  {
>  	return unlikely(skb->pfmemalloc);
>  }
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index c44ab68..6ce94b5 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -852,7 +852,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>  
>  static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
>  {
> -	if (skb_pfmemalloc((struct sk_buff *)skb))
> +	if (skb_pfmemalloc(skb))
>  		return SKB_ALLOC_RX;
>  	return 0;
>  }
> 
> 
> Then you should be able to drop the case in skb_alloc_rx_flag() without adding
> a warning.
> 

You're right. Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 01/17] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2012-06-22  9:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <4FE39290.8020609@redhat.com>

On Thu, Jun 21, 2012 at 05:30:56PM -0400, Rik van Riel wrote:
> On 06/20/2012 07:43 AM, Mel Gorman wrote:
> 
> >+/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
> >+static void check_ac_pfmemalloc(struct kmem_cache *cachep,
> >+						struct array_cache *ac)
> >+{
> 
> >+	pfmemalloc_active = false;
> >+out:
> >+	spin_unlock_irqrestore(&l3->list_lock, flags);
> >+}
> 
> The comment and the function do not seem to match.
> 

Well spotted. There used to be ac->pfmemalloc and obviously I failed to
update the comment when it was removed.

> Otherwise, the patch looks reasonable.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH net] net: qmi_wwan: fix Oops while disconnecting
From: Bjørn Mork @ 2012-06-22  9:11 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA, Marius Bjørnstad Kotsbak,
	Bjørn Mork

usbnet_disconnect() will set intfdata to NULL before calling
the minidriver unbind function.  The cdc_wdm subdriver cannot
know that it is disconnecting until the qmi_wwan unbind
function has called its disconnect function.  This means that
we must be able to support the cdc_wdm subdriver operating
normally while usbnet_disconnect() is running, and in
particular that intfdata may be NULL.

The only place this matters is in qmi_wwan_cdc_wdm_manage_power
which is called from cdc_wdm.  Simply testing for NULL
intfdata there is sufficient to allow it to continue working
at all times.

Fixes this Oops where a cdc-wdm device was closed while the
USB device was disconnecting, causing wdm_release to call
qmi_wwan_cdc_wdm_manage_power after intfdata was set to
NULL by usbnet_disconnect:

[41819.087460] BUG: unable to handle kernel NULL pointer dereference at 00000080
[41819.087815] IP: [<f8640458>] qmi_wwan_manage_power+0x68/0x90 [qmi_wwan]
[41819.088028] *pdpt = 000000000314f001 *pde = 0000000000000000
[41819.088028] Oops: 0002 [#1] SMP
[41819.088028] Modules linked in: qmi_wwan option usb_wwan usbserial usbnet
cdc_wdm nls_iso8859_1 nls_cp437 vfat fat usb_storage bnep rfcomm bluetooth
parport_pc ppdev binfmt_misc iptable_nat nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables
x_tables dm_crypt uvcvideo snd_hda_codec_realtek snd_hda_intel
videobuf2_core snd_hda_codec joydev videodev videobuf2_vmalloc
hid_multitouch snd_hwdep arc4 videobuf2_memops snd_pcm snd_seq_midi
snd_rawmidi snd_seq_midi_event ath9k mac80211 snd_seq ath9k_common ath9k_hw
ath snd_timer snd_seq_device sparse_keymap dm_multipath scsi_dh coretemp
mac_hid snd soundcore cfg80211 snd_page_alloc psmouse serio_raw microcode
lp parport dm_mirror dm_region_hash dm_log usbhid hid i915 drm_kms_helper
drm r8169 i2c_algo_bit wmi video [last unloaded: qmi_wwan]
[41819.088028]
[41819.088028] Pid: 23292, comm: qmicli Not tainted 3.4.0-5-generic #11-Ubuntu GIGABYTE T1005/T1005
[41819.088028] EIP: 0060:[<f8640458>] EFLAGS: 00010246 CPU: 1
[41819.088028] EIP is at qmi_wwan_manage_power+0x68/0x90 [qmi_wwan]
[41819.088028] EAX: 00000000 EBX: 00000000 ECX: 000000c3 EDX: 00000000
[41819.088028] ESI: c3b27658 EDI: 00000000 EBP: c298bea4 ESP: c298be98
[41819.088028]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[41819.088028] CR0: 8005003b CR2: 00000080 CR3: 3605e000 CR4: 000007f0
[41819.088028] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[41819.088028] DR6: ffff0ff0 DR7: 00000400
[41819.088028] Process qmicli (pid: 23292, ti=c298a000 task=f343b280 task.ti=c298a000)
[41819.088028] Stack:
[41819.088028]  00000000 c3b27658 e2a80d00 c298beb0 f864051a c3b27600 c298bec0 f9027099
[41819.088028]  c2fd6000 00000008 c298bef0 c1147f96 00000001 00000000 00000000 f4e54790
[41819.088028]  ecf43a00 ecf43a00 c2fd6008 c2fd6000 ebbd7600 ffffffb9 c298bf08 c1144474
[41819.088028] Call Trace:
[41819.088028]  [<f864051a>] qmi_wwan_cdc_wdm_manage_power+0x1a/0x20 [qmi_wwan]
[41819.088028]  [<f9027099>] wdm_release+0x69/0x70 [cdc_wdm]
[41819.088028]  [<c1147f96>] fput+0xe6/0x210
[41819.088028]  [<c1144474>] filp_close+0x54/0x80
[41819.088028]  [<c1046a65>] put_files_struct+0x75/0xc0
[41819.088028]  [<c1046b56>] exit_files+0x46/0x60
[41819.088028]  [<c1046f81>] do_exit+0x141/0x780
[41819.088028]  [<c107248f>] ? wake_up_state+0xf/0x20
[41819.088028]  [<c1053f48>] ? signal_wake_up+0x28/0x40
[41819.088028]  [<c1054f3b>] ? zap_other_threads+0x6b/0x80
[41819.088028]  [<c1047864>] do_group_exit+0x34/0xa0
[41819.088028]  [<c10478e8>] sys_exit_group+0x18/0x20
[41819.088028]  [<c15bb7df>] sysenter_do_call+0x12/0x28
[41819.088028] Code: 04 83 e7 01 c1 e7 03 0f b6 42 18 83 e0 f7 09 f8 88 42
18 8b 43 04 e8 48 9a dd c8 89 f0 8b 5d f4 8b 75 f8 8b 7d fc 89 ec 5d c3 90
<f0> ff 88 80 00 00 00 0f 94 c0 84 c0 75 b7 31 f6 8b 5d f4 89 f0
[41819.088028] EIP: [<f8640458>] qmi_wwan_manage_power+0x68/0x90 [qmi_wwan] SS:ESP 0068:c298be98
[41819.088028] CR2: 0000000000000080
[41819.149492] ---[ end trace 0944479ff8257f55 ]---

Reported-by: Marius Bjørnstad Kotsbak <marius.kotsbak-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # v3.4
Signed-off-by: Bjørn Mork <bjorn-yOkvZcmFvRU@public.gmane.org>
---
 drivers/net/usb/qmi_wwan.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 3767a12..b01960f 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -197,6 +197,10 @@ err:
 static int qmi_wwan_cdc_wdm_manage_power(struct usb_interface *intf, int on)
 {
 	struct usbnet *dev = usb_get_intfdata(intf);
+
+	/* can be called while disconnecting */
+	if (!dev)
+		return 0;
 	return qmi_wwan_manage_power(dev, on);
 }
 
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Eric Dumazet @ 2012-06-22  8:29 UTC (permalink / raw)
  To: Josh Hunt
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org
In-Reply-To: <4FE41570.4090803@akamai.com>

On Fri, 2012-06-22 at 01:49 -0500, Josh Hunt wrote:
> On 06/21/2012 03:27 PM, Eric Dumazet wrote:
> > On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:
> > 
> >> Can anyone provide details of the crash which was intended to be fixed
> >> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
> >> doing concurrent adds/deletes and dumping the table via netlink causes
> >> duplicate entries to be reported. Reverting this patch causes those
> >> problems to go away. We can provide a more detailed test if that is
> >> needed, but so far our testing has been unable to reproduce the crash
> >> mentioned in the above commit with it reverted.
> > 
> > A mere revert wont be enough.
> > 
> > Looking at this code, it lacks proper synchronization
> > between tree updaters and tree walkers.
> > 
> > fib6_walker_lock rwlock is not enough to prevent races.
> > 
> > Are you willing to fix this yourself ?
> > 
> 
> Looking through the code a bit more it seems like we would need to have
> a lock in fib6_walker_t to protect its contents. Mainly for when we
> update the pointers in fib6_del_route and fib6_repair_tree. Right now
> there is the fib6_walker_lock, but that appears to only be protecting
> the elements of the list, not their contents. Is this what you had in
> mind? I just coded up something along these lines and it works for the
> most part, but I also got a message about unsafe lock ordering when I
> stressed it so I am messing something up. If this sounds like it's on
> the right track I can work out the kinks in the morning.

Hmm, it seems tb6_lock is held by a writer, so its safe :

a tree walker can run only holding a read_lock on tb6_lock

^ permalink raw reply

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Josh Hunt @ 2012-06-22  6:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org
In-Reply-To: <1340310469.4604.6702.camel@edumazet-glaptop>

On 06/21/2012 03:27 PM, Eric Dumazet wrote:
> On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:
> 
>> Can anyone provide details of the crash which was intended to be fixed
>> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
>> doing concurrent adds/deletes and dumping the table via netlink causes
>> duplicate entries to be reported. Reverting this patch causes those
>> problems to go away. We can provide a more detailed test if that is
>> needed, but so far our testing has been unable to reproduce the crash
>> mentioned in the above commit with it reverted.
> 
> A mere revert wont be enough.
> 
> Looking at this code, it lacks proper synchronization
> between tree updaters and tree walkers.
> 
> fib6_walker_lock rwlock is not enough to prevent races.
> 
> Are you willing to fix this yourself ?
> 

Looking through the code a bit more it seems like we would need to have
a lock in fib6_walker_t to protect its contents. Mainly for when we
update the pointers in fib6_del_route and fib6_repair_tree. Right now
there is the fib6_walker_lock, but that appears to only be protecting
the elements of the list, not their contents. Is this what you had in
mind? I just coded up something along these lines and it works for the
most part, but I also got a message about unsafe lock ordering when I
stressed it so I am messing something up. If this sounds like it's on
the right track I can work out the kinks in the morning.

Josh

^ permalink raw reply

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Gao feng @ 2012-06-22  3:34 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Eric Dumazet, Josh Hunt, davem@davemloft.net, kaber@trash.net,
	Debabrata Banerjee, netdev@vger.kernel.org,
	yoshfuji@linux-ipv6.org, jmorris@namei.org, pekkas@netcore.fi,
	linux-kernel@vger.kernel.org
In-Reply-To: <20120621215056.GA24908@ms2.inr.ac.ru>

于 2012年06月22日 05:50, Alexey Kuznetsov 写道:
> On Thu, Jun 21, 2012 at 10:27:49PM +0200, Eric Dumazet wrote:
>> Looking at this code, it lacks proper synchronization
>> between tree updaters and tree walkers.
>>
>> fib6_walker_lock rwlock is not enough to prevent races.
> 
> Hmm. As author of this weird code, I must say I honestly believed it was correct.
> At least I tried. :-)
> 
> 
> What's about 2bec5a336.., it does not look reasonable.
> The idea was that when you change tree, you fixup sleeping walkers, moving
> their location in tree to correct point. So, walkers must not have any stale pointers
> at any times (except when you under table write lock) and no skips/counts are required.
> I remember how damn difficult was it to make this right (well, sorry, if it is not yet :-)),
> so that understand that if some update is forgotten or done incorrectly, it is not so easy to find,
> but it is definitely worth of efforts.


Actually, I spent two months to try to reproduce this crash four months ago,
But finally I give up, I don't think there was any stale pointers,
we already correct it when we change the tree.

^ permalink raw reply

* [PATCH] ipv4: Add sysctl knob to control early socket demux
From: Alexander Duyck @ 2012-06-21 23:58 UTC (permalink / raw)
  To: netdev; +Cc: jeffrey.t.kirsher, David S. Miller, Eric Dumazet, Alexander Duyck

This change is meant to add a control for disabling early socket demux.
The main motivation behind this patch is to provide an option to disable
the feature as it adds an additional cost to routing that reduces overall
throughput by up to 5%.  For example one of my systems went from 12.1Mpps
to 11.6 after the early socket demux was added.  It looks like the reason
for the regression is that we are now having to perform two lookups, first
the one for an established socket, and then the one for the routing table.

By adding this patch and toggling the value for ip_early_demux to 0 I am
able to get back to the 12.1Mpps I was previously seeing.

Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

I am open to any comments or suggestions on this patch.  I had seen the
earlier discussions and saw mention of adding a control for disabling the
early demux feature so I figured I would just code it up real quick once I
ran into the regression.  I am assuming it is okay to disable the early
demux code since I suspect there is other code in place that will still
handle the demux for the TCP sockets later.  Also I wasn't sure about the
sysctl since I haven't set one up before.

 include/linux/sysctl.h     |    1 +
 include/net/ip.h           |    3 +++
 kernel/sysctl_binary.c     |    2 ++
 net/ipv4/ip_input.c        |   19 +++++++++++--------
 net/ipv4/sysctl_net_ipv4.c |    7 +++++++
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index c34b4c8..20825e5 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -425,6 +425,7 @@ enum
 	NET_TCP_ALLOWED_CONG_CONTROL=123,
 	NET_TCP_MAX_SSTHRESH=124,
 	NET_TCP_FRTO_RESPONSE=125,
+	NET_IPV4_EARLY_DEMUX=126,
 };
 
 enum {
diff --git a/include/net/ip.h b/include/net/ip.h
index 83e0619..50841bd 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -210,6 +210,9 @@ extern int inet_peer_threshold;
 extern int inet_peer_minttl;
 extern int inet_peer_maxttl;
 
+/* From ip_input.c */
+extern int sysctl_ip_early_demux;
+
 /* From ip_output.c */
 extern int sysctl_ip_dynaddr;
 
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index a650694..6a3cf82 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -415,6 +415,8 @@ static const struct bin_table bin_net_ipv4_table[] = {
 	{ CTL_INT,	NET_IPV4_IPFRAG_SECRET_INTERVAL,	"ipfrag_secret_interval" },
 	/* NET_IPV4_IPFRAG_MAX_DIST "ipfrag_max_dist" no longer used */
 
+	{ CTL_INT,	NET_IPV4_EARLY_DEMUX,			"ip_early_demux" },
+
 	{ CTL_INT,	2088 /* NET_IPQ_QMAX */,		"ip_queue_maxlen" },
 
 	/* NET_TCP_DEFAULT_WIN_SCALE unused */
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 93b092c..07de38d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -313,6 +313,8 @@ drop:
 	return true;
 }
 
+int sysctl_ip_early_demux __read_mostly = 1;
+
 static int ip_rcv_finish(struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
@@ -325,14 +327,15 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	if (skb_dst(skb) == NULL) {
 		const struct net_protocol *ipprot;
 		int protocol = iph->protocol;
-		int err;
-
-		rcu_read_lock();
-		ipprot = rcu_dereference(inet_protos[protocol]);
-		err = -ENOENT;
-		if (ipprot && ipprot->early_demux)
-			err = ipprot->early_demux(skb);
-		rcu_read_unlock();
+		int err = -ENOENT;
+
+		if (sysctl_ip_early_demux) {
+			rcu_read_lock();
+			ipprot = rcu_dereference(inet_protos[protocol]);
+			if (ipprot && ipprot->early_demux)
+				err = ipprot->early_demux(skb);
+			rcu_read_unlock();
+		}
 
 		if (err) {
 			err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ef32956..12aa0c5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -301,6 +301,13 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
+		.procname	= "ip_early_demux",
+		.data		= &sysctl_ip_early_demux,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
 		.procname	= "ip_dynaddr",
 		.data		= &sysctl_ip_dynaddr,
 		.maxlen		= sizeof(int),

^ permalink raw reply related

* Re: [PATCH][RESEND] bonding: delete migrated IP addresses from the rlb hash table
From: Jay Vosburgh @ 2012-06-21 23:05 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Andy Gospodarek, netdev
In-Reply-To: <20120620203731.GA4957@midget.suse.cz>

Jiri Bohac <jbohac@suse.cz> wrote:

>Hi, this is a resend of the patch discussed here:
>	http://thread.gmane.org/gmane.linux.network/228076
>It has been updated to apply to the lastest net-next.
[...]
>The hash table is hashed by ip_dst. To be able to do the above
>check efficiently (not walking the whole hash table), we need a
>reverse mapping (by ip_src).
>
>I added three new members in struct rlb_client_info:
>   rx_hashtbl[x].src_first will point to the start of a list of
>      entries for which hash(ip_src) == x.
>   The list is linked with src_next and src_prev.
>
>When an incoming ARP packet arrives at rlb_arp_recv()
>rlb_purge_src_ip() can quickly walk only the entries on the
>corresponding lists, i.e. the entries that are likely to contain
>the offending IP address.
>
>To avoid confusion, I renamed these existing fields of struct 
>rlb_client_info:
>	next -> used_next
>	prev -> used_prev
>	rx_hashtbl_head -> rx_hashtbl_used_head
>
>(The current linked list is _not_ a list of hash table
>entries with colliding ip_dst. It's a list of entries that are
>being used; its purpose is to avoid walking the whole hash table
>when looking for used entries.)

	Just a note that I'm doing some testing with this patch.  Seems
to be ok for the "direct" case (wherein the IP in question is assigned
to the local system); I haven't tried the "bridge" case yet.  I've
extended some of the debugfs stuff to dump the new information, and I'm
trying some of the corner cases (e.g., breaking the linkages in the
middle) to see if it all hangs together.

	I am thinking that the layout of the "hash"-ish table is now
sufficiently complicated that there should be a comment block somewhere
describing what's going on (because I didn't really quite get it until I
dumped the whole thing and looked at it).  With this patch, there is one
"used" linkage for all of the elements in use, plus some number of "src"
linkages, one for each active source hash.  The "src" linkages are also
notable in that they are separate from the "assigned" state.

>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index e15cc11..8505a24 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
[...]
>@@ -354,6 +357,17 @@ static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond,
> 	if (!arp)
> 		goto out;
>
>+	/* We received an ARP from arp->ip_src.
>+	 * We might have used this IP address previously (on the bonding host
>+	 * itself or on a system that is bridged together with the bond).
>+	 * However, if arp->mac_src is different than what is stored in
>+	 * rx_hashtbl, some other host is now using the IP and we must prevent
>+	 * sending out client updates with this IP address and the old MAC address.
>+	 * Clean up all hash table entries that have this address as ip_src but
>+	 * have a dirrerent mac_src.

	Typo here; should be "different."

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH] net: dcb: fix small regression in __dcbnl_pg_setcfg()
From: David Miller @ 2012-06-21 22:06 UTC (permalink / raw)
  To: tgraf; +Cc: john.r.fastabend, netdev, lucy.liu, alexander.h.duyck
In-Reply-To: <20120621105245.GI27921@canuck.infradead.org>

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 21 Jun 2012 06:52:45 -0400

> On Wed, Jun 20, 2012 at 10:56:21PM -0700, John Fastabend wrote:
>> A small regression was introduced in the reply command of
>> dcbnl_pg_setcfg(). User space apps may be expecting the
>> DCB_ATTR_PG_CFG attribute to be returned with the patch
>> below TX or RX variants are returned.
>> 
>> commit 7be994138b188387691322921c08e19bddf6d3c5
>> Author: Thomas Graf <tgraf@suug.ch>
>> Date:   Wed Jun 13 02:54:55 2012 +0000
>> 
>>     dcbnl: Shorten all command handling functions
>> 
>> This patch reverts this behavior and returns DCB_ATTR_PG_CFG
>> 
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> 
> Acked-by: Thomas Graf <tgraf@suug.ch>

Applied, thanks.

^ permalink raw reply

* Re: [net] ixgbe: simplify padding and length checks
From: David Miller @ 2012-06-21 22:04 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: shemminger, netdev, gospo, sassmann
In-Reply-To: <1340314089.2033.16.camel@jtkirshe-mobl>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu, 21 Jun 2012 14:28:09 -0700

> On Thu, 2012-06-21 at 13:38 -0700, David Miller wrote:
>> From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> Date: Thu, 21 Jun 2012 05:15:10 -0700
>> 
>> > From: Stephen Hemminger <shemminger@vyatta.com>
>> > 
>> > The check for length <= 0 is bogus because length is unsigned, and network
>> > stack never sends zero length packets (unless it is totally broken).
>> > 
>> > The check for really small packets can be optimized (using unlikely)
>> > and calling skb_pad directly.
>> > 
>> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>> > Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
>> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> 
>> Not really fixing anything and more of a cleanup, so maybe 'net-next'
>> instead of 'net' for this guy instead?
> 
> Yeah, net-next is fine.  I just verified that the patch applies cleanly
> to net-next as well.

Great, applied, thanks.

^ permalink raw reply

* [PATCH] tcp: Validate route interface in early demux.
From: David Miller @ 2012-06-21 22:03 UTC (permalink / raw)
  To: netdev


Otherwise we might violate reverse path filtering.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_ipv4.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 13857df..21e22a0 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1676,6 +1676,7 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 	struct net *net = dev_net(skb->dev);
 	const struct iphdr *iph;
 	const struct tcphdr *th;
+	struct net_device *dev;
 	struct sock *sk;
 	int err;
 
@@ -1695,10 +1696,11 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 	if (!pskb_may_pull(skb, ip_hdrlen(skb) + th->doff * 4))
 		goto out_err;
 
+	dev = skb->dev;
 	sk = __inet_lookup_established(net, &tcp_hashinfo,
 				       iph->saddr, th->source,
 				       iph->daddr, th->dest,
-				       skb->dev->ifindex);
+				       dev->ifindex);
 	if (sk) {
 		skb->sk = sk;
 		skb->destructor = sock_edemux;
@@ -1707,8 +1709,12 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 			if (dst)
 				dst = dst_check(dst, 0);
 			if (dst) {
-				skb_dst_set_noref(skb, dst);
-				err = 0;
+				struct rtable *rt = (struct rtable *) dst;
+
+				if (rt->rt_iif == dev->ifindex) {
+					skb_dst_set_noref(skb, dst);
+					err = 0;
+				}
 			}
 		}
 	}
-- 
1.7.10

^ permalink raw reply related

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Alexey Kuznetsov @ 2012-06-21 21:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Josh Hunt, davem@davemloft.net, kaber@trash.net,
	Debabrata Banerjee, netdev@vger.kernel.org,
	yoshfuji@linux-ipv6.org, jmorris@namei.org, pekkas@netcore.fi,
	linux-kernel@vger.kernel.org
In-Reply-To: <1340310469.4604.6702.camel@edumazet-glaptop>

On Thu, Jun 21, 2012 at 10:27:49PM +0200, Eric Dumazet wrote:
> Looking at this code, it lacks proper synchronization
> between tree updaters and tree walkers.
> 
> fib6_walker_lock rwlock is not enough to prevent races.

Hmm. As author of this weird code, I must say I honestly believed it was correct.
At least I tried. :-)

What's about 2bec5a336.., it does not look reasonable.
The idea was that when you change tree, you fixup sleeping walkers, moving
their location in tree to correct point. So, walkers must not have any stale pointers
at any times (except when you under table write lock) and no skips/counts are required.
I remember how damn difficult was it to make this right (well, sorry, if it is not yet :-)),
so that understand that if some update is forgotten or done incorrectly, it is not so easy to find,
but it is definitely worth of efforts.

Alexey

^ permalink raw reply

* Re: [PATCH 01/17] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Rik van Riel @ 2012-06-21 21:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <1340192652-31658-2-git-send-email-mgorman@suse.de>

On 06/20/2012 07:43 AM, Mel Gorman wrote:

> +/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
> +static void check_ac_pfmemalloc(struct kmem_cache *cachep,
> +						struct array_cache *ac)
> +{

> +	pfmemalloc_active = false;
> +out:
> +	spin_unlock_irqrestore(&l3->list_lock, flags);
> +}

The comment and the function do not seem to match.

Otherwise, the patch looks reasonable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [net] ixgbe: simplify padding and length checks
From: Jeff Kirsher @ 2012-06-21 21:28 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev, gospo, sassmann
In-Reply-To: <20120621.133807.59441426155221033.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 887 bytes --]

On Thu, 2012-06-21 at 13:38 -0700, David Miller wrote:
> From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Date: Thu, 21 Jun 2012 05:15:10 -0700
> 
> > From: Stephen Hemminger <shemminger@vyatta.com>
> > 
> > The check for length <= 0 is bogus because length is unsigned, and network
> > stack never sends zero length packets (unless it is totally broken).
> > 
> > The check for really small packets can be optimized (using unlikely)
> > and calling skb_pad directly.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> 
> Not really fixing anything and more of a cleanup, so maybe 'net-next'
> instead of 'net' for this guy instead?

Yeah, net-next is fine.  I just verified that the patch applies cleanly
to net-next as well.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* 3.4.2 net/ipv6/mip6.c destopt regression
From: Andras Takacs @ 2012-06-21 20:41 UTC (permalink / raw)
  To: netdev

Dear All,

I'm working with Mobile IPv6 systems, and I'm setting up a new MIP6 environment. I would like to use the latest stable kernel, so I'm using 3.4.2. Unfortunately I have some serious problems with destination option XFRM processing. I have done the following tests to find the issue:

First case: No XFRM policies and states.
Sending MH messages without destopt header.
In this case the message format is OK, I have tested it with tcpdump and wireshark.

21:33:58.817130 IP6 2001:470:7210:10::11 > 2001:470:7210:10::1000: mobility: BU seq#=1 lifetime=8
	0x0000:  6000 0000 0020 8740 2001 0470 7210 0010  `......@...pr...
	0x0010:  0000 0000 0000 0011 2001 0470 7210 0010  ...........pr...
	0x0020:  0000 0000 0000 1000 3b03 0500 1c46 0001  ........;....F..
	0x0030:  0000 0002 0100 0310 2001 0470 7210 0011  ...........pr...
	0x0040:  020c 29ff fe46 a0e3                      ..)..F..

Second case: Adding destopt XFRM policy and state:

ip -6 xfrm policy add src 2001:470:7210:10::11 dst 2001:470:7210:10::1000 proto 135 type 5 dir out priority 2 ptype sub tmpl src 2001:470:7210:10::11 dst 2001:470:7210:10::1000 proto hao reqid 0 mode ro level use
ip -6 xfrm state add src 2001:470:7210:10::11 dst 2001:470:7210:10::1000 proto hao reqid 0 mode ro replay-window 0 coa 2001:470:7210:11:20c:29ff:fe46:a0e3 sel src 2001:470:7210:10::11 dst 2001:470:7210:10::1000

In this case, the message format is corrupted:

21:30:42.350315 IP6 2001:470:7210:11:20c:29ff:fe46:a0e3 > 2001:470:7210:10::1000: DSTOPT mobility: type-#41 len=12
	0x0000:  6000 0000 0020 3c40 2001 0470 7210 0011  `.....<@...pr...
	0x0010:  020c 29ff fe46 a0e3 2001 0470 7210 0010  ..)..F.....pr...
	0x0020:  0000 0000 0000 1000 8702 0102 0000 c910  ................
	0x0030:  2001 0470 7210 0010 0000 0000 0000 0011  ...pr...........
	0x0040:  020c 29ff fe46 a0e3 

As you can see, the IPv6 header is OK. Next, the destination option header is OK. Finally, the following part of the packet isn't OK. If you compare the two dump carefully, you will see, that the last 8 bytes are identical. The mip6_destopt_output function adds the destination option header correctly, but overwrites the existing MH header, and doesn't shift it after the destopt header.

I'm not familiar with the XFRM framework enough to fix the problem. :(
Maybe, could anyone help to me to fix this issue?

The last environment, which worked fine was built on 2.6.35 version. The problem happened between 2.6.35 and 3.4.2. Sorry, I know, it is a quite big interval. :(

Thanks!

Best regards,
András Takács

^ permalink raw reply

* Re: [net] ixgbe: simplify padding and length checks
From: David Miller @ 2012-06-21 20:38 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: shemminger, netdev, gospo, sassmann
In-Reply-To: <1340280910-22651-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu, 21 Jun 2012 05:15:10 -0700

> From: Stephen Hemminger <shemminger@vyatta.com>
> 
> The check for length <= 0 is bogus because length is unsigned, and network
> stack never sends zero length packets (unless it is totally broken).
> 
> The check for really small packets can be optimized (using unlikely)
> and calling skb_pad directly.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Not really fixing anything and more of a cleanup, so maybe 'net-next'
instead of 'net' for this guy instead?

^ permalink raw reply

* Re: pull request: batman-adv 2012-06-21
From: David Miller @ 2012-06-21 20:36 UTC (permalink / raw)
  To: ordex-GaUfNO9RBHfsrOwW+9ziJQ
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r
In-Reply-To: <1340297876-29923-1-git-send-email-ordex-GaUfNO9RBHfsrOwW+9ziJQ@public.gmane.org>

From: Antonio Quartulli <ordex-GaUfNO9RBHfsrOwW+9ziJQ@public.gmane.org>
Date: Thu, 21 Jun 2012 18:57:36 +0200

> here is another set of changes intended for net-next/linux-3.6.
> From 1/20 to 19/20 you have our first set of changes aiming to rename the
> exported symbols, as you suggested some time ago.
> 
> Patch 20/20 restyles all the comments in our code in order to follow the new
> guidelines.
 ...
>   git://git.open-mesh.org/linux-merge.git tags/batman-adv-for-davem

Pulled, thanks.

^ permalink raw reply

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Eric Dumazet @ 2012-06-21 20:27 UTC (permalink / raw)
  To: Josh Hunt
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org
In-Reply-To: <4FE37783.9000409@akamai.com>

On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:

> Can anyone provide details of the crash which was intended to be fixed
> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
> doing concurrent adds/deletes and dumping the table via netlink causes
> duplicate entries to be reported. Reverting this patch causes those
> problems to go away. We can provide a more detailed test if that is
> needed, but so far our testing has been unable to reproduce the crash
> mentioned in the above commit with it reverted.

A mere revert wont be enough.

Looking at this code, it lacks proper synchronization
between tree updaters and tree walkers.

fib6_walker_lock rwlock is not enough to prevent races.

Are you willing to fix this yourself ?

^ permalink raw reply

* Re: PTP + H/W time stamping + CONFIG_PREEMPT_RT_FULL == Oops in slab
From: Richard Cochran @ 2012-06-21 19:54 UTC (permalink / raw)
  To: satpal parmar; +Cc: netdev
In-Reply-To: <CAGgcXb662=PbdJaMff1QfBhW06mYXgNvM1o6snunZMe=Q36Q7A@mail.gmail.com>

I suspect this isn't a time stamping bug, but rather that the time
stamping code just triggers the issue. My own experience has been that
preempt_rt on arm is quite shaky.

On Thu, Jun 21, 2012 at 08:09:11PM +0530, satpal parmar wrote:
> I am not sure if this is the right mailing list. If anyone know better
> forum please direct me to that.  Appreciate your patience and time.

Maybe lkml or rt-users would be better.

However, there are really too many variables, and I recommend first
narrowing the problem down. Here are a few ideas.

1. use a mainline kernel plus rt patch (without vendor/custom changes)
2. try to trigger issue without using ptpd2 (like stress testing)
3. write a simple test case (like sending a single PTP sync) and
   instrument your kernel

Good luck,
Richard

^ permalink raw reply

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Josh Hunt @ 2012-06-21 19:35 UTC (permalink / raw)
  To: davem@davemloft.net, kaber@trash.net
  Cc: Debabrata Banerjee, netdev@vger.kernel.org,
	yoshfuji@linux-ipv6.org, jmorris@namei.org, pekkas@netcore.fi,
	kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org,
	eric.dumazet@gmail.com
In-Reply-To: <CAATkVEzTxT6d5nzt_qwCuFWM-WcN8hdhjGxFRPXJ4=L35wKbGw@mail.gmail.com>

On 06/12/2012 12:22 PM, Debabrata Banerjee wrote:
> Looks like commit 2bec5a369ee79576a3eea2c23863325089785a2c "ipv6: fib:
> fix crash when changing large fib while dumping" is the culprit. The
> result of this code is that if there is a tree addition while a dump
> has suspended because the netlink skb is full, it will simply go back
> to the top of the tree and you end up with duplicate/triplicate/etc
> routes. It looks like the code attempts to count nodes, but it's a
> linear count and the data structure is a tree so that's a big problem.
> The net result is potentially DOSable, since if route table updates
> happen often enough in proportion to table size, a dump will attempt
> to return an infinite amount of routes (observed). So this commit
> should be reverted. However I am interested in the problem that commit
> tried to solve, if anyone has more information on that. My assumption
> is the fib tree gets corrupted and eventually it crashes in
> fib6_dump_table(), which I assume can still happen.
> 
> I can easily demonstrate the bug by adding cloned/cache routes while I
> check the results of fib6_dump_table:
> 
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 593
> 189
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 884
> 16
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 888
> 78
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 507
> 507
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 533
> 533
> root@a172-25-43-12.deploy.akamaitechnologies.com:~# ip -6 -o route
> show table cache |tee tmp | wc -l; sort tmp | uniq -u | wc -l
> 571
> 571
> 
> Thanks,
> Debabrata

Ping?

Can anyone provide details of the crash which was intended to be fixed
by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
doing concurrent adds/deletes and dumping the table via netlink causes
duplicate entries to be reported. Reverting this patch causes those
problems to go away. We can provide a more detailed test if that is
needed, but so far our testing has been unable to reproduce the crash
mentioned in the above commit with it reverted.

Thanks
Josh

^ permalink raw reply

* RE: [PATCH net-next (v2) 2/6] bnx2x: link cleanup
From: Yuval Mintz @ 2012-06-21 17:52 UTC (permalink / raw)
  To: Joe Perches
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Eilon Greenstein,
	Yaniv Rosner
In-Reply-To: <1340298385.29885.76.camel@joe2Laptop>

> > This patch does several things:
> []
> >  3. Change msleep(small) --> usleep_range(small, small*2).
> >     Also correct existing calls to usleep_range.
> 
> Just fyi:
> 
> There are some more of these duplicated arg
> usleep_range uses in other files too
> 
> cheers, Joe

Hi Joe,

Yes - after you've awakened my awareness of the issue I've
taken a look at our driver and found the other occurrences as well.

However, since this patch series was meant to add some functionality
to our link code and its cleanup is a byproduct, I didn't want to add a 
new (unrelated) patch to the series.

Hopefully, next patch series we'll send will remedy all of these 
occurrences in our driver.

Thanks,
Yuval

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox