Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [stable] [PATCH net-2.6/stable] tg3: Restrict phy ioctl access
From: Matt Carlson @ 2011-02-17  0:11 UTC (permalink / raw)
  To: Greg KH
  Cc: Matthew Carlson, David Miller, stable@kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20110217000035.GB6296@kroah.com>

On Wed, Feb 16, 2011 at 04:00:35PM -0800, Greg KH wrote:
> On Wed, Feb 16, 2011 at 03:52:48PM -0800, Matt Carlson wrote:
> > On Wed, Feb 16, 2011 at 03:11:03PM -0800, David Miller wrote:
> > > From: "Matt Carlson" <mcarlson@broadcom.com>
> > > Date: Wed, 16 Feb 2011 15:06:13 -0800
> > > 
> > > > On Wed, Feb 16, 2011 at 02:39:35PM -0800, Greg KH wrote:
> > > >> On Tue, Feb 15, 2011 at 02:51:10PM -0800, Matt Carlson wrote:
> > > >> > If management firmware is present and the device is down, the firmware
> > > >> > will assume control of the phy.  If a phy access were allowed from the
> > > >> > host, it will collide with firmware phy accesses, resulting in
> > > >> > unpredictable behavior.  This patch fixes the problem by disallowing phy
> > > >> > accesses during the problematic condition.
> > > >> > 
> > > >> > Upstream commit ID f746a3136a61ae535c5d0b49a9418fa21edc61b5
> > > >> 
> > > >> There is no such upstream git commit id in Linus's tree.  What am I
> > > >> doing wrong here?
> > > > 
> > > > The commit is in Dave Miller's net-next-2.6 tree.
> > > > 
> > > 
> > > If it wasn't appropriate for net-2.6, it absolutely it not appropriate
> > > for -stable.
> > 
> > net-2.6 was the target tree for the patch.  The stable_kernel_rules.txt
> > seemed to suggest that I could just CC stable@kernel.org with the
> > commit ID, and Greg would pull it in as the process dictates.  If that
> > isn't correct, what is the preferred way to expedite the integration of
> > a patch?
> 
> Keep reading that file, it says to put the Cc: in the signed-off-by area
> of the original patch.

Ah.  Yes.  I see that now.

> Also, that file says the patch has to be in Linus's tree, otherwise
> sending me a git commit id of some other tree isn't going to help at
> all.

I see.  Thanks for the tips.


^ permalink raw reply

* Re: [RFC, PATCH 1/4] net: sh_eth: modify the definitions of register
From: Nobuhiro Iwamatsu @ 2011-02-17  0:21 UTC (permalink / raw)
  To: Yoshihiro Shimoda; +Cc: netdev, SH-Linux
In-Reply-To: <4D5A67CC.80107@renesas.com>

2011/2/15 Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>:
> The previous code cannot handle the ETHER and GETHER both as same time
> because the definitions of register was hardcoded.
>
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> ---
>  arch/sh/include/asm/sh_eth.h |    7 +
>  drivers/net/sh_eth.c         |  322 +++++++++++-----------
>  drivers/net/sh_eth.h         |  623 ++++++++++++++++++++++++------------------
>  3 files changed, 537 insertions(+), 415 deletions(-)
>

<snip>

>
> +static const u16 *sh_eth_get_register_offset(int register_type)
> +{
> +       const u16 *reg_offset = NULL;
> +
> +       switch (register_type) {
> +       case SH_ETH_REG_GIGABIT:
> +               reg_offset = sh_eth_offset_gigabit;
> +               break;
> +       case SH_ETH_REG_FAST_SH4:
> +               reg_offset = sh_eth_offset_fast_sh4;
> +               break;
> +       case SH_ETH_REG_FAST_SH3_SH2:
> +               reg_offset = sh_eth_offset_fast_sh3_sh2;
> +               break;
> +       case SH_ETH_REG_DEFAULT:
> +#if defined(CONFIG_CPU_SUBTYPE_SH7763)
> +               reg_offset = sh_eth_offset_gigabit;
> +#elif defined(CONFIG_CPU_SH4)
> +               reg_offset = sh_eth_offset_fast_sh4;
> +#else
> +               reg_offset = sh_eth_offset_fast_sh3_sh2;
> +#endif
> +               break;
> +       default:
> +               printk(KERN_ERR "Unknown register type (%d)\n", register_type);
> +               break;
> +       }
> +
> +       return reg_offset;
> +}
> +

Is the handling of SH_ETH_REG_DEFAULT necessary?

>
> +static inline void sh_eth_write(struct net_device *ndev, unsigned long data,
> +                               int enum_index)
> +{
> +       struct sh_eth_private *mdp = netdev_priv(ndev);
> +
> +       writel(data, ndev->base_addr + mdp->reg_offset[enum_index]);
> +}
> +
> +static inline unsigned long sh_eth_read(struct net_device *ndev,
> +                                       int enum_index)
> +{
> +       struct sh_eth_private *mdp = netdev_priv(ndev);
> +
> +       return readl(ndev->base_addr + mdp->reg_offset[enum_index]);
> +}
> +
> +static inline void sh_eth_tsu_write(struct sh_eth_private *mdp,
> +                               unsigned long data, int enum_index)
> +{
> +       writel(data, mdp->tsu_addr + mdp->reg_offset[enum_index]);
> +}
> +
> +static inline unsigned long sh_eth_tsu_read(struct sh_eth_private *mdp,
> +                                       int enum_index)
> +{
> +       return readl(mdp->tsu_addr + mdp->reg_offset[enum_index]);
> +}
> +
>  #endif /* #ifndef __SH_ETH_H__ */

I do not think that a new function is necessary.
I think it use MACRO.

For example,
#define sh_eth_write(data,reg) \
{\
    writel(data, ndev->base_addr + mdp->reg_offset[reg]); \
}

Best regards,
  Nobuhiro
-- 
Nobuhiro Iwamatsu
   iwamatsu at {nigauri.org / debian.org}
   GPG ID: 40AD1FA6

^ permalink raw reply

* Re: [RFC, PATCH 2/4] net: sh_eth: remove the SH_TSU_ADDR
From: Nobuhiro Iwamatsu @ 2011-02-17  0:24 UTC (permalink / raw)
  To: Yoshihiro Shimoda; +Cc: netdev, SH-Linux
In-Reply-To: <4D5A67CF.3040406@renesas.com>

2011/2/15 Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>:
> The defination is hardcoded in this driver for some CPUs. This patch
> modifies to get resource of TSU address from platform_device.
>
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> ---
>  drivers/net/sh_eth.c |   16 ++++++++++++----
>  drivers/net/sh_eth.h |   15 ---------------
>  2 files changed, 12 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/net/sh_eth.c b/drivers/net/sh_eth.c
> index 3b6d545..0593f29 100644
> --- a/drivers/net/sh_eth.c
> +++ b/drivers/net/sh_eth.c
> @@ -1446,7 +1446,7 @@ static const struct net_device_ops sh_eth_netdev_ops = {
>  static int sh_eth_drv_probe(struct platform_device *pdev)
>  {
>        int ret, devno = 0;
> -       struct resource *res;
> +       struct resource *res, *res_tsu;
>        struct net_device *ndev = NULL;
>        struct sh_eth_private *mdp;
>        struct sh_eth_plat_data *pd;
> @@ -1520,9 +1520,13 @@ static int sh_eth_drv_probe(struct platform_device *pdev)
>                        mdp->cd->chip_reset(ndev);
>
>  #if defined(SH_ETH_HAS_TSU)
> -               /* TSU init (Init only)*/
> -               mdp->tsu_addr = SH_TSU_ADDR;
> -               sh_eth_tsu_init(mdp);
> +               res_tsu = platform_get_resource(pdev, IORESOURCE_MEM, 1);
> +               if (res_tsu) {
> +                       mdp->tsu_addr = ioremap(res_tsu->start,
> +                                               resource_size(res_tsu));
> +                       /* TSU init (Init only)*/
> +                       sh_eth_tsu_init(mdp);
> +               }
>  #endif
>        }
>
> @@ -1549,6 +1553,8 @@ out_unregister:
>
>  out_release:
>        /* net_dev free */
> +       if (mdp->tsu_addr)
> +               iounmap(mdp->tsu_addr);
>        if (ndev)
>                free_netdev(ndev);
>
> @@ -1559,7 +1565,9 @@ out:
>  static int sh_eth_drv_remove(struct platform_device *pdev)
>  {
>        struct net_device *ndev = platform_get_drvdata(pdev);
> +       struct sh_eth_private *mdp = netdev_priv(ndev);
>
> +       iounmap(mdp->tsu_addr);
>        sh_mdio_release(ndev);
>        unregister_netdev(ndev);
>        pm_runtime_disable(&pdev->dev);

You forget to fix for ARSTR.

Best regards,
  Nobuhiro
-- 
Nobuhiro Iwamatsu
   iwamatsu at {nigauri.org / debian.org}
   GPG ID: 40AD1FA6

^ permalink raw reply

* Re: [stable] [PATCH net-2.6/stable] tg3: Restrict phy ioctl access
From: Matt Carlson @ 2011-02-17  0:39 UTC (permalink / raw)
  To: David Miller
  Cc: Matthew Carlson, greg@kroah.com, netdev@vger.kernel.org,
	stable@kernel.org
In-Reply-To: <20110216.161025.59672084.davem@davemloft.net>

On Wed, Feb 16, 2011 at 04:10:25PM -0800, David Miller wrote:
> From: "Matt Carlson" <mcarlson@broadcom.com>
> Date: Wed, 16 Feb 2011 15:52:48 -0800
> 
> > On Wed, Feb 16, 2011 at 03:11:03PM -0800, David Miller wrote:
> >> From: "Matt Carlson" <mcarlson@broadcom.com>
> >> Date: Wed, 16 Feb 2011 15:06:13 -0800
> >> 
> >> > On Wed, Feb 16, 2011 at 02:39:35PM -0800, Greg KH wrote:
> >> >> On Tue, Feb 15, 2011 at 02:51:10PM -0800, Matt Carlson wrote:
> >> >> > If management firmware is present and the device is down, the firmware
> >> >> > will assume control of the phy.  If a phy access were allowed from the
> >> >> > host, it will collide with firmware phy accesses, resulting in
> >> >> > unpredictable behavior.  This patch fixes the problem by disallowing phy
> >> >> > accesses during the problematic condition.
> >> >> > 
> >> >> > Upstream commit ID f746a3136a61ae535c5d0b49a9418fa21edc61b5
> >> >> 
> >> >> There is no such upstream git commit id in Linus's tree.  What am I
> >> >> doing wrong here?
> >> > 
> >> > The commit is in Dave Miller's net-next-2.6 tree.
> >> > 
> >> 
> >> If it wasn't appropriate for net-2.6, it absolutely it not appropriate
> >> for -stable.
> > 
> > net-2.6 was the target tree for the patch.  The stable_kernel_rules.txt
> > seemed to suggest that I could just CC stable@kernel.org with the
> > commit ID, and Greg would pull it in as the process dictates.  If that
> > isn't correct, what is the preferred way to expedite the integration of
> > a patch?
> 
> You are posting a commit ID for the net-next-2.6 tree, that's what triggered
> my response.
> 
> Unless it also went into the net-2.6 tree (in which case you should
> give Greg the net-2.6 commit ID, which is also what the commit ID must
> be in Linus's tree right now), the change is not appropriate for
> -stable submission.

So the proper thing to do here is recall the patch, submit a new patch
to net-2.6 with a CC: stabel@kernel.org in the signed-off-by section.

Would I do the exact same thing if I were posting to net-next-2.6?
(i.e. the CC line tells you I want this patch to go to net-next-2.6,
net-2.6, then Linus's tree, then stable?)  Or would you rather I posted
a completely different patchset against net-2.6?


^ permalink raw reply

* Re: [stable] [PATCH net-2.6/stable] tg3: Restrict phy ioctl access
From: David Miller @ 2011-02-17  0:56 UTC (permalink / raw)
  To: mcarlson; +Cc: greg, netdev, stable
In-Reply-To: <20110217003947.GA11185@mcarlson.broadcom.com>

From: "Matt Carlson" <mcarlson@broadcom.com>
Date: Wed, 16 Feb 2011 16:39:47 -0800

> So the proper thing to do here is recall the patch, submit a new patch
> to net-2.6 with a CC: stabel@kernel.org in the signed-off-by section.

No.

I have not applied your patch yet, but when I do and I also get my
tree pulled next time into Linus's tree, you can ask stable to apply
it.

Because only at that point will it exist as a commit in Linus's tree.

Before that happens you cannot ask Greg to apply it to his tree.

^ permalink raw reply

* Re: [RFC, PATCH 1/4] net: sh_eth: modify the definitions of register
From: Nobuhiro Iwamatsu @ 2011-02-17  0:56 UTC (permalink / raw)
  To: Yoshihiro Shimoda; +Cc: netdev, SH-Linux
In-Reply-To: <AANLkTimvaLOuBXcqJ=co1LakTxJqGEUQC3Cif+wBoJhm@mail.gmail.com>

2011/2/17 Nobuhiro Iwamatsu <iwamatsu@nigauri.org>:
> 2011/2/15 Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>:
>> The previous code cannot handle the ETHER and GETHER both as same time
>> because the definitions of register was hardcoded.
>>
>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
>> ---
>>  arch/sh/include/asm/sh_eth.h |    7 +
>>  drivers/net/sh_eth.c         |  322 +++++++++++-----------
>>  drivers/net/sh_eth.h         |  623 ++++++++++++++++++++++++------------------
>>  3 files changed, 537 insertions(+), 415 deletions(-)
>>
>
> <snip>
>
>>
>> +static const u16 *sh_eth_get_register_offset(int register_type)
>> +{
>> +       const u16 *reg_offset = NULL;
>> +
>> +       switch (register_type) {
>> +       case SH_ETH_REG_GIGABIT:
>> +               reg_offset = sh_eth_offset_gigabit;
>> +               break;
>> +       case SH_ETH_REG_FAST_SH4:
>> +               reg_offset = sh_eth_offset_fast_sh4;
>> +               break;
>> +       case SH_ETH_REG_FAST_SH3_SH2:
>> +               reg_offset = sh_eth_offset_fast_sh3_sh2;
>> +               break;
>> +       case SH_ETH_REG_DEFAULT:
>> +#if defined(CONFIG_CPU_SUBTYPE_SH7763)
>> +               reg_offset = sh_eth_offset_gigabit;
>> +#elif defined(CONFIG_CPU_SH4)
>> +               reg_offset = sh_eth_offset_fast_sh4;
>> +#else
>> +               reg_offset = sh_eth_offset_fast_sh3_sh2;
>> +#endif
>> +               break;
>> +       default:
>> +               printk(KERN_ERR "Unknown register type (%d)\n", register_type);
>> +               break;
>> +       }
>> +
>> +       return reg_offset;
>> +}
>> +
>
> Is the handling of SH_ETH_REG_DEFAULT necessary?
>
>>
>> +static inline void sh_eth_write(struct net_device *ndev, unsigned long data,
>> +                               int enum_index)
>> +{
>> +       struct sh_eth_private *mdp = netdev_priv(ndev);
>> +
>> +       writel(data, ndev->base_addr + mdp->reg_offset[enum_index]);
>> +}
>> +
>> +static inline unsigned long sh_eth_read(struct net_device *ndev,
>> +                                       int enum_index)
>> +{
>> +       struct sh_eth_private *mdp = netdev_priv(ndev);
>> +
>> +       return readl(ndev->base_addr + mdp->reg_offset[enum_index]);
>> +}
>> +
>> +static inline void sh_eth_tsu_write(struct sh_eth_private *mdp,
>> +                               unsigned long data, int enum_index)
>> +{
>> +       writel(data, mdp->tsu_addr + mdp->reg_offset[enum_index]);
>> +}
>> +
>> +static inline unsigned long sh_eth_tsu_read(struct sh_eth_private *mdp,
>> +                                       int enum_index)
>> +{
>> +       return readl(mdp->tsu_addr + mdp->reg_offset[enum_index]);
>> +}
>> +
>>  #endif /* #ifndef __SH_ETH_H__ */
>
> I do not think that a new function is necessary.
> I think it use MACRO.
>
> For example,
> #define sh_eth_write(data,reg) \
> {\
>    writel(data, ndev->base_addr + mdp->reg_offset[reg]); \
> }
>

Oh sorry, this is bad coding style.
Please ignore this comment.

Nobuhiro


-- 
Nobuhiro Iwamatsu
   iwamatsu at {nigauri.org / debian.org}
   GPG ID: 40AD1FA6

^ permalink raw reply

* ixgbe: 82599 and Westmere with HT
From: Andrew Dickinson @ 2011-02-17  1:21 UTC (permalink / raw)
  To: netdev

Hi,

I've got a dual Westmere board (X5675) with an X520 card with dual
10G.  I see 24-cores exposed to me and the ixgbe driver exposes 24 tx
and 24 rx interrupts per NIC.  I then pin the interrupts to cores for
each NIC (each interrupt gets its own core, standard stuff).

Anyway.... I'm only seeing RX interrupts on 16 of the 24 cores (random
src/dest pairs across a /16 each, so I should be getting good flow
hashing).  Did I miss some magic somewhere?

I'm running 2.6.32.4, perhaps this has been fixed upstream.  If not,
any thoughts on how to make this work?

-A

^ permalink raw reply

* Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
From: Cypher Wu @ 2011-02-17  2:17 UTC (permalink / raw)
  To: linux-kernel, Chris Metcalf, Américo Wang, Eric Dumazet,
	netdev

---------- Forwarded message ----------
From: Cypher Wu <cypher.w@gmail.com>
Date: Wed, Feb 16, 2011 at 5:58 PM
Subject: GMP and rwlock: Dead ocurred again on TILEPro
To: linux-kernel@vger.kernel.org


The rwlock and spinlock of TILEPro platform use TNS instruction to
test the value of lock, but if interrupt is not masked, read_lock()
have another chance to deadlock while read_lock() called in bh of
interrupt.

The original code:
void __raw_read_lock_slow(raw_
rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
           rwlock->lock = val;
       delay_backoff(iterations++);
       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

I've modified it to get some information:
void __raw_read_lock_slow(raw_rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
       {
           rwlock->lock = val;
           iterations = 0;
       }
       delay_backoff(iterations++);
       if (iterations > 0x1000000)
       {
           dump_stack();
           iterations = 0;
       }

       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

And this is the stack info:

Starting stack dump of tid 837, pid 837 (ff0) on cpu 55 at cycle 10180633928773
 frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8)
 frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8)
 frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0)
 frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20)
 frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30)
 frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70)
 frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98)
 frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8)
 frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0)
 frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0)
 <interrupt 25 while in kernel mode>
 frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38)
 frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40)
 frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70)
 frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90)
 frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98)
 frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0)
 <syscall while in user mode>
 frame 16: 0x3342a0 (sp 0xbfddfb00)
 frame 17: 0x16130 (sp 0xbfddfb08)
 frame 18: 0x16640 (sp 0xbfddfb38)
 frame 19: 0x16ee8 (sp 0xbfddfc58)
 frame 20: 0x345a08 (sp 0xbfddfc90)
 frame 21: 0x10218 (sp 0xbfddfe48)
Stack dump complete

I don't know the clear definition of rwlock & spinlock in Linux, but
the implementation of other platforms
like x86, PowerPC, ARM don't have that issue. The use of TNS cause a
race condition between system
call and interrupt.

Through the call tree of packet sending, there are also some other
rwlock will be tried, say
read_lock(&fib_hash_lock) in fn_hash_lookup() which is called in
ip_route_output_slow(). I've seen deadlock
on fib_hash_lock, but haven't reproduced with that debug information yet.

Maybe IGMP is not the only one, TCP timer will retransmit data and
will also call read_lock(&fib_hash_lock).

--
Cyberman Wu



-- 
Cyberman Wu

^ permalink raw reply

* Re: state of rtcache removal...
From: Tom Herbert @ 2011-02-17  2:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110216.160838.39164069.davem@davemloft.net>

On Wed, Feb 16, 2011 at 4:08 PM, David Miller <davem@davemloft.net> wrote:
>
> So I've been testing out the routing cache removal patch to see
> what the impact is on performance.
>
Interesting results.

I assume that this test is purposely using sento on a connected socket
to force sendmsg to go through the route lookup :-), so this is
showing what the benefits of rtcache are is when cache hit rate is
100%.  For comparison, it might interesting to see what the
performance is when rate is < 100%.  For instance, we often see hit
rates < 20% on front end servers.  This could be done flooding to
random addresses in 10/8 or even 0/0...  I'm hoping that without the
rtcache performance actually improves in that case!

Tom

^ permalink raw reply

* Re: state of rtcache removal...
From: David Miller @ 2011-02-17  3:27 UTC (permalink / raw)
  To: therbert; +Cc: netdev
In-Reply-To: <AANLkTi=W=1ZiP4JOJwo=FccNA9aK=1bpRn=zkepXXMzi@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Wed, 16 Feb 2011 18:59:35 -0800

> On Wed, Feb 16, 2011 at 4:08 PM, David Miller <davem@davemloft.net> wrote:
>>
>> So I've been testing out the routing cache removal patch to see
>> what the impact is on performance.
>>
> Interesting results.
> 
> I assume that this test is purposely using sento on a connected socket
> to force sendmsg to go through the route lookup :-), so this is
> showing what the benefits of rtcache are is when cache hit rate is
> 100%.  For comparison, it might interesting to see what the
> performance is when rate is < 100%.  For instance, we often see hit
> rates < 20% on front end servers.  This could be done flooding to
> random addresses in 10/8 or even 0/0...  I'm hoping that without the
> rtcache performance actually improves in that case!

We know that the performance will be higher in the "closer to %0"
situation, ie.  for DoS workloads.  Because all of the route cache
management overhead goes away.

Anyways I'm working on some ideas to make the high hit rate case
perform amicably again.

^ permalink raw reply

* Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
From: Cypher Wu @ 2011-02-17  3:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Chris Metcalf, Américo Wang, Eric Dumazet, netdev

---------- Forwarded message ----------
From: Cypher Wu <cypher.w@gmail.com>
Date: Wed, Feb 16, 2011 at 5:58 PM
Subject: GMP and rwlock: Dead ocurred again on TILEPro
To: linux-kernel@vger.kernel.org


The rwlock and spinlock of TILEPro platform use TNS instruction to
test the value of lock, but if interrupt is not masked, read_lock()
have another chance to deadlock while read_lock() called in bh of
interrupt.

The original code:
void __raw_read_lock_slow(raw_
rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
           rwlock->lock = val;
       delay_backoff(iterations++);
       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

I've modified it to get some information:
void __raw_read_lock_slow(raw_rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
       {
           rwlock->lock = val;
           iterations = 0;
       }
       delay_backoff(iterations++);
       if (iterations > 0x1000000)
       {
           dump_stack();
           iterations = 0;
       }

       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

And this is the stack info:

Starting stack dump of tid 837, pid 837 (ff0) on cpu 55 at cycle 10180633928773
 frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8)
 frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8)
 frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0)
 frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20)
 frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30)
 frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70)
 frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98)
 frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8)
 frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0)
 frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0)
 <interrupt 25 while in kernel mode>
 frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38)
 frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40)
 frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70)
 frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90)
 frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98)
 frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0)
 <syscall while in user mode>
 frame 16: 0x3342a0 (sp 0xbfddfb00)
 frame 17: 0x16130 (sp 0xbfddfb08)
 frame 18: 0x16640 (sp 0xbfddfb38)
 frame 19: 0x16ee8 (sp 0xbfddfc58)
 frame 20: 0x345a08 (sp 0xbfddfc90)
 frame 21: 0x10218 (sp 0xbfddfe48)
Stack dump complete

I don't know the clear definition of rwlock & spinlock in Linux, but
the implementation of other platforms
like x86, PowerPC, ARM don't have that issue. The use of TNS cause a
race condition between system
call and interrupt.

Through the call tree of packet sending, there are also some other
rwlock will be tried, say
read_lock(&fib_hash_lock) in fn_hash_lookup() which is called in
ip_route_output_slow(). I've seen deadlock
on fib_hash_lock, but haven't reproduced with that debug information yet.

Maybe IGMP is not the only one, TCP timer will retransmit data and
will also call read_lock(&fib_hash_lock).

--
Cyberman Wu



-- 
Cyberman Wu

^ permalink raw reply

* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
From: Américo Wang @ 2011-02-17  4:49 UTC (permalink / raw)
  To: Cypher Wu
  Cc: linux-kernel, Chris Metcalf, Américo Wang, Eric Dumazet,
	netdev
In-Reply-To: <AANLkTimj+_8cBz0gcEyF1889xK2HdF4SQDSK8DiY-4Gs@mail.gmail.com>

On Thu, Feb 17, 2011 at 11:39:22AM +0800, Cypher Wu wrote:
>---------- Forwarded message ----------
>From: Cypher Wu <cypher.w@gmail.com>
>Date: Wed, Feb 16, 2011 at 5:58 PM
>Subject: GMP and rwlock: Dead ocurred again on TILEPro
>To: linux-kernel@vger.kernel.org
>
>
>The rwlock and spinlock of TILEPro platform use TNS instruction to
>test the value of lock, but if interrupt is not masked, read_lock()
>have another chance to deadlock while read_lock() called in bh of
>interrupt.
>


In this case, you should call read_lock_bh() instead of read_lock().


> frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8)
> frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8)
> frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0)
> frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20)
> frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30)
> frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70)
> frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98)
> frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8)
> frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0)
> frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0)
> <interrupt 25 while in kernel mode>
> frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38)
> frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40)
> frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70)
> frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90)
> frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98)
> frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0)
> <syscall while in user mode>
> frame 16: 0x3342a0 (sp 0xbfddfb00)
> frame 17: 0x16130 (sp 0xbfddfb08)
> frame 18: 0x16640 (sp 0xbfddfb38)
> frame 19: 0x16ee8 (sp 0xbfddfc58)
> frame 20: 0x345a08 (sp 0xbfddfc90)
> frame 21: 0x10218 (sp 0xbfddfe48)
>Stack dump complete
>
>I don't know the clear definition of rwlock & spinlock in Linux, but
>the implementation of other platforms
>like x86, PowerPC, ARM don't have that issue. The use of TNS cause a
>race condition between system
>call and interrupt.
>

Have you turned CONFIG_LOCKDEP on?

I think Eric already converted that rwlock into RCU lock, thus
this problem should disappear. Could you try a new kernel?

Thanks.

^ permalink raw reply

* BUG? racy code at net/atm/br2684.c
From: 홍신 shin hong @ 2011-02-17  4:49 UTC (permalink / raw)
  To: netdev

Hi. I am reporting a suspected race bug at br2684_push()
in /net/atm/br2684.c.

Please examine the report and let me know your opinion.
I reported the same issue several month ago and got feedback
however, I found that the code is still not changed.

In br2684_push() accesses &brdev->brvcss without devs_lock holding at line 334.
But it seems that all accesses to &brdev->brvcss are synchronized by devs_lock.

So that br2684_push() may result race condition in concurrent execution
with other functions that manipulate &brdev->brvcss.

It seems that it is better to guard list_empty(&brdev->brvccs) by devs_lock.
Or check list_empty(&brdev->brvccs) once again after devs_lock holding.

Thank you.

Sincerely
Shin Hong

^ permalink raw reply

* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
From: Cypher Wu @ 2011-02-17  5:04 UTC (permalink / raw)
  To: Américo Wang; +Cc: linux-kernel, Chris Metcalf, Eric Dumazet, netdev
In-Reply-To: <20110217044917.GA2653@cr0.nay.redhat.com>

On Thu, Feb 17, 2011 at 12:49 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
> On Thu, Feb 17, 2011 at 11:39:22AM +0800, Cypher Wu wrote:
>>---------- Forwarded message ----------
>>From: Cypher Wu <cypher.w@gmail.com>
>>Date: Wed, Feb 16, 2011 at 5:58 PM
>>Subject: GMP and rwlock: Dead ocurred again on TILEPro
>>To: linux-kernel@vger.kernel.org
>>
>>
>>The rwlock and spinlock of TILEPro platform use TNS instruction to
>>test the value of lock, but if interrupt is not masked, read_lock()
>>have another chance to deadlock while read_lock() called in bh of
>>interrupt.
>>
>
>
> In this case, you should call read_lock_bh() instead of read_lock().
>
>
>> frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8)
>> frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8)
>> frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0)
>> frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20)
>> frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30)
>> frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70)
>> frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98)
>> frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8)
>> frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0)
>> frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0)
>> <interrupt 25 while in kernel mode>
>> frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38)
>> frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40)
>> frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70)
>> frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90)
>> frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98)
>> frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0)
>> <syscall while in user mode>
>> frame 16: 0x3342a0 (sp 0xbfddfb00)
>> frame 17: 0x16130 (sp 0xbfddfb08)
>> frame 18: 0x16640 (sp 0xbfddfb38)
>> frame 19: 0x16ee8 (sp 0xbfddfc58)
>> frame 20: 0x345a08 (sp 0xbfddfc90)
>> frame 21: 0x10218 (sp 0xbfddfe48)
>>Stack dump complete
>>
>>I don't know the clear definition of rwlock & spinlock in Linux, but
>>the implementation of other platforms
>>like x86, PowerPC, ARM don't have that issue. The use of TNS cause a
>>race condition between system
>>call and interrupt.
>>
>
> Have you turned CONFIG_LOCKDEP on?
>
> I think Eric already converted that rwlock into RCU lock, thus
> this problem should disappear. Could you try a new kernel?
>
> Thanks.
>

I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
much information when we tried to figured out the former deadlock.

IGMP used read_lock() instead of read_lock_bh() since usually
read_lock() can be called recursively, and today I've read the
implementation of MIPS, it's should also works fine in that situation.
The implementation of TILEPro cause problem since after it use TNS set
the lock-val to 1 and hold the original value and before it re-set
lock-val a new value, it a race condition window.

It's not practical to upgrade the kernel.

Thanks.

-- 
Cyberman Wu

^ permalink raw reply

* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
From: Américo Wang @ 2011-02-17  5:42 UTC (permalink / raw)
  To: Cypher Wu
  Cc: Américo Wang, linux-kernel, Chris Metcalf, Eric Dumazet,
	netdev
In-Reply-To: <AANLkTikkro-_9pv5uJ4+SqYh9Fr480_96Wh1Pv6Pvtez@mail.gmail.com>

On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote:
>>
>> Have you turned CONFIG_LOCKDEP on?
>>
>> I think Eric already converted that rwlock into RCU lock, thus
>> this problem should disappear. Could you try a new kernel?
>>
>> Thanks.
>>
>
>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
>much information when we tried to figured out the former deadlock.
>
>IGMP used read_lock() instead of read_lock_bh() since usually
>read_lock() can be called recursively, and today I've read the
>implementation of MIPS, it's should also works fine in that situation.
>The implementation of TILEPro cause problem since after it use TNS set
>the lock-val to 1 and hold the original value and before it re-set
>lock-val a new value, it a race condition window.
>

I see no reason why you can't call read_lock_bh() recursively,
read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(),
both can be recursive.

But I may miss something here. :-/

^ permalink raw reply

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
From: David Miller @ 2011-02-17  5:46 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: cypher.w, linux-kernel, cmetcalf, eric.dumazet, netdev
In-Reply-To: <20110217054237.GB2653@cr0.nay.redhat.com>

From: Américo Wang <xiyou.wangcong@gmail.com>
Date: Thu, 17 Feb 2011 13:42:37 +0800

> On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote:
>>>
>>> Have you turned CONFIG_LOCKDEP on?
>>>
>>> I think Eric already converted that rwlock into RCU lock, thus
>>> this problem should disappear. Could you try a new kernel?
>>>
>>> Thanks.
>>>
>>
>>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
>>much information when we tried to figured out the former deadlock.
>>
>>IGMP used read_lock() instead of read_lock_bh() since usually
>>read_lock() can be called recursively, and today I've read the
>>implementation of MIPS, it's should also works fine in that situation.
>>The implementation of TILEPro cause problem since after it use TNS set
>>the lock-val to 1 and hold the original value and before it re-set
>>lock-val a new value, it a race condition window.
>>
> 
> I see no reason why you can't call read_lock_bh() recursively,
> read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(),
> both can be recursive.
> 
> But I may miss something here. :-/

IGMP is doing this so that taking the read lock does not stop packet
processing.

TILEPro's rwlock implementation is simply buggy and needs to be fixed.

^ permalink raw reply

* Re: state of rtcache removal...
From: Eric Dumazet @ 2011-02-17  6:25 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110216.160838.39164069.davem@davemloft.net>

Le mercredi 16 février 2011 à 16:08 -0800, David Miller a écrit :
> So I've been testing out the routing cache removal patch to see
> what the impact is on performance.
> 
> I'm using a UDP flood to a single IP address over a dummy interface
> with hard coded ARP entries, so that pretty much just the main IP
> output and routing paths are being exercised.
> 
> The UDP flood tool I cooked up based upon a description sent to me by
> Eric Dumazet of a similar utility he uses for testing.  I've included
> the code to this tool at the end of this email, as well as the dummy
> interface setup script.   Basically, you go:
> 
> bash# ./udpflood_setup.sh
> bash# time ./udpflood -l 10000 10.2.2.11
> 
> The IP output path is about twice as slow with the routing cache
> removed entirely.  Here are the numbers I have:
> 
> net-next-2.6, rt_cache on:
> 
> davem@maramba:~$ time udpflood -l 10000000 10.2.2.11
> real		 1m47.012s
> user		 0m8.670s
> sys		 1m38.370s
> 
> net-next-2.6, rt_cache turned off via sysctl:
> 
> davem@maramba:~$ time udpflood -l 10000000 10.2.2.11
> real		 3m12.662s
> user		 0m9.490s
> sys		 3m3.220s
> 
> net-next-2.6 + "BONUS" rt_cache deletion patch:
> 
> maramba:/home/davem# time ./bin/udpflood -l 10000000 10.2.2.11
> real		     3m9.921s
> user		     0m9.520s
> sys		     3m0.440s
> 
> I then worked on some simplifications of the code in net/ipv4/route.c
> that remains after the cache removal.  I'll post those patches after
> I've chewed on them some more, but they knock a couple seconds back off
> of the benchmark:
> 
> The profile output is what you'd expect, with fib_table_lookup() topping
> the charts taking ~%10 of the time.
> 
> What might not be initially apparent is that each output route lookup
> results in two calls to fib_table_lookup() and thus two trie lookups.
> Why?  Because we have two routing tables (3 with IP_MULTIPLE_TABLES
> enabled) that get searched, first the LOCAL then the MAIN table (then
> with mutliple-tables enabled, the DEFAULT).  And most external
> outgoing routes sit in the MAIN table.
> 
> We do this so we can store all the interface address network,
> broadcast, loopback network, et al. routes in the LOCAL table, then all
> globally visible routes in the MAIN table.
> 
> Anyways, the long and short of this is that route lookups take two
> trie lookups instead of just one.  On input there are even more, for
> source address validation done by fib_validate_source().  That can be
> up to 4 more fib_table_lookup() invocations.
> 
> Add in another level of complexity if you have a series of FIB rules
> installed.
> 
> So, to me, this means that spending time micro-optiming fib_trie is
> not going to help much.  Getting rid of that multiplier somehow, on
> the other hand, might.
> 
> I plan to play with some ideas, such as sticking fib_alias entries into
> the flow cache and consulting/populating the flow cache on fib_lookup()
> calls.
> 
> -------------------- udpflood.c --------------------
> /* An adaptation of Eric Dumazet's udpflood tool.  */
> 
> #include <stdio.h>
> #include <stddef.h>
> #include <malloc.h>
> #include <string.h>
> #include <errno.h>
> 
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <netinet/in.h>
> #include <arpa/inet.h>
> 
> #define _GNU_SOURCE
> #include <getopt.h>
> 
> static int usage(void)
> {
> 	printf("usage: udpflood [ -l count ] [ -m message_size ] IP_ADDRESS\n");
> 	return -1;
> }
> 
> static int send_packets(in_addr_t addr, int port, int count, int msg_sz)
> {
> 	char *msg = malloc(msg_sz);
> 	struct sockaddr_in saddr;
> 	int fd, i, err;
> 
> 	if (!msg)
> 		return -ENOMEM;
> 
> 	memset(msg, 0, msg_sz);
> 
> 	memset(&saddr, 0, sizeof(saddr));
> 	saddr.sin_family = AF_INET;
> 	saddr.sin_port = port;
> 	saddr.sin_addr.s_addr = addr;
> 
> 	fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
> 	if (fd < 0) {
> 		perror("socket");
> 		err = fd;
> 		goto out_nofd;
> 	}
> 	err = connect(fd, (struct sockaddr *) &saddr, sizeof(saddr));
> 	if (err < 0) {
> 		perror("connect");
> 		close(fd);
> 		goto out;
> 	}
> 	for (i = 0; i < count; i++) {
> 		err = sendto(fd, msg, msg_sz, 0,
> 			     (struct sockaddr *) &saddr, sizeof(saddr));
> 		if (err < 0) {
> 			perror("sendto");
> 			goto out;
> 		}
> 	}
> 
> 	err = 0;
> out:
> 	close(fd);
> out_nofd:
> 	free(msg);
> 	return err;
> }
> 
> int main(int argc, char **argv, char **envp)
> {
> 	int port, msg_sz, count, ret;
> 	in_addr_t addr;
> 
> 	port = 6000;
> 	msg_sz = 32;
> 	count = 10000000;
> 
> 	while ((ret = getopt(argc, argv, "l:s:p:")) >= 0) {
> 		switch (ret) {
> 		case 'l':
> 			sscanf(optarg, "%d", &count);
> 			break;
> 		case 's':
> 			sscanf(optarg, "%d", &msg_sz);
> 			break;
> 		case 'p':
> 			sscanf(optarg, "%d", &port);
> 			break;
> 		case '?':
> 			return usage();
> 		}
> 	}
> 
> 	if (!argv[optind])
> 		return usage();
> 
> 	addr = inet_addr(argv[optind]);
> 	if (addr == INADDR_NONE)
> 		return usage();
> 
> 	return send_packets(addr, port, count, msg_sz);
> }
> 
> -------------------- udpflood_setup.sh --------------------
> #!/bin/sh
> modprobe dummy
> ifconfig dummy0 10.2.2.254 netmask 255.255.255.0 up
> 
> for f in $(seq 11 26)
> do
>  arp -H ether -i dummy0 -s 10.2.2.$f 00:00:0c:07:ac:$f
> done
> --

Thanks David for this work in progress.

If I remember my works in last October/November, I also know fib_hash
was a bit faster than fib_trie (around 20%)...





^ permalink raw reply

* Re: state of rtcache removal...
From: David Miller @ 2011-02-17  6:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1297923915.2645.24.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 17 Feb 2011 07:25:15 +0100

> Thanks David for this work in progress.

It was my pleasure :-)

> If I remember my works in last October/November, I also know fib_hash
> was a bit faster than fib_trie (around 20%)...

Right, if table is small hash can be faster.

I just wrote a hack that puts fib_lookup() results into the the
flow cache, and hooked it only into the one fib_lookup() call
that happens in ip_route_output_slow().

This pointed out another costly thing we do when resolving output
routes.  If the flow key's source is not INADDR_ANY, we validate the
source address is our's by doing a fib table lookup in the local
table, with the address in the flow key's destination field.

So even on output we were doing 3 fib lookups :-/

Therefore, the flow cache hack gets rid of 2 of those 3.

^ permalink raw reply

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
From: Eric Dumazet @ 2011-02-17  6:39 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, cypher.w, linux-kernel, cmetcalf, netdev
In-Reply-To: <20110216.214625.189707123.davem@davemloft.net>

Le mercredi 16 février 2011 à 21:46 -0800, David Miller a écrit :
> From: Américo Wang <xiyou.wangcong@gmail.com>
> Date: Thu, 17 Feb 2011 13:42:37 +0800
> 
> > On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote:
> >>>
> >>> Have you turned CONFIG_LOCKDEP on?
> >>>
> >>> I think Eric already converted that rwlock into RCU lock, thus
> >>> this problem should disappear. Could you try a new kernel?
> >>>
> >>> Thanks.
> >>>
> >>
> >>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
> >>much information when we tried to figured out the former deadlock.
> >>
> >>IGMP used read_lock() instead of read_lock_bh() since usually
> >>read_lock() can be called recursively, and today I've read the
> >>implementation of MIPS, it's should also works fine in that situation.
> >>The implementation of TILEPro cause problem since after it use TNS set
> >>the lock-val to 1 and hold the original value and before it re-set
> >>lock-val a new value, it a race condition window.
> >>
> > 
> > I see no reason why you can't call read_lock_bh() recursively,
> > read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(),
> > both can be recursive.
> > 
> > But I may miss something here. :-/
> 
> IGMP is doing this so that taking the read lock does not stop packet
> processing.
> 
> TILEPro's rwlock implementation is simply buggy and needs to be fixed.

Yep. Finding all recursive readlocks in kernel and convert them to
another locking model is probably more expensive than fixing TILEPro
rwlock implementation.

^ permalink raw reply

* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
From: Cypher Wu @ 2011-02-17  6:42 UTC (permalink / raw)
  To: Américo Wang; +Cc: linux-kernel, Chris Metcalf, Eric Dumazet, netdev
In-Reply-To: <20110217054237.GB2653@cr0.nay.redhat.com>

On Thu, Feb 17, 2011 at 1:42 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
> On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote:
>>>
>>> Have you turned CONFIG_LOCKDEP on?
>>>
>>> I think Eric already converted that rwlock into RCU lock, thus
>>> this problem should disappear. Could you try a new kernel?
>>>
>>> Thanks.
>>>
>>
>>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
>>much information when we tried to figured out the former deadlock.
>>
>>IGMP used read_lock() instead of read_lock_bh() since usually
>>read_lock() can be called recursively, and today I've read the
>>implementation of MIPS, it's should also works fine in that situation.
>>The implementation of TILEPro cause problem since after it use TNS set
>>the lock-val to 1 and hold the original value and before it re-set
>>lock-val a new value, it a race condition window.
>>
>
> I see no reason why you can't call read_lock_bh() recursively,
> read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(),
> both can be recursive.
>
> But I may miss something here. :-/
>

Of course read_lock_bh() can be called recursively, but read_lock() is
already enough for IGMP, the only reason for that deadlock is because
using TNS instruction set the value to 1 cause another race condition.

-- 
Cyberman Wu

^ permalink raw reply

* Re: [RFC, PATCH 1/4] net: sh_eth: modify the definitions of register
From: Yoshihiro Shimoda @ 2011-02-17  6:51 UTC (permalink / raw)
  To: Nobuhiro Iwamatsu; +Cc: netdev, SH-Linux
In-Reply-To: <AANLkTikxYtG620D4z09k15v7=r5CZ7xCcDWenbJmM_sP@mail.gmail.com>

2011/02/17 9:56, Nobuhiro Iwamatsu wrote:
> 2011/2/17 Nobuhiro Iwamatsu <iwamatsu@nigauri.org>:
>>> +static const u16 *sh_eth_get_register_offset(int register_type)
>>> +{
>>> +       const u16 *reg_offset = NULL;
>>> +
>>> +       switch (register_type) {
>>> +       case SH_ETH_REG_GIGABIT:
>>> +               reg_offset = sh_eth_offset_gigabit;
>>> +               break;
>>> +       case SH_ETH_REG_FAST_SH4:
>>> +               reg_offset = sh_eth_offset_fast_sh4;
>>> +               break;
>>> +       case SH_ETH_REG_FAST_SH3_SH2:
>>> +               reg_offset = sh_eth_offset_fast_sh3_sh2;
>>> +               break;
>>> +       case SH_ETH_REG_DEFAULT:
>>> +#if defined(CONFIG_CPU_SUBTYPE_SH7763)
>>> +               reg_offset = sh_eth_offset_gigabit;
>>> +#elif defined(CONFIG_CPU_SH4)
>>> +               reg_offset = sh_eth_offset_fast_sh4;
>>> +#else
>>> +               reg_offset = sh_eth_offset_fast_sh3_sh2;
>>> +#endif
>>> +               break;
>>> +       default:
>>> +               printk(KERN_ERR "Unknown register type (%d)\n", register_type);
>>> +               break;
>>> +       }
>>> +
>>> +       return reg_offset;
>>> +}
>>> +
>>
>> Is the handling of SH_ETH_REG_DEFAULT necessary?
Thank you for your review!
No, it isn't necessary. I will remove it and I will add NULL checking of mdp->reg_offset.

Best regards,
Yoshihiro Shimoda

^ permalink raw reply

* Re: [RFC, PATCH 2/4] net: sh_eth: remove the SH_TSU_ADDR
From: Yoshihiro Shimoda @ 2011-02-17  6:51 UTC (permalink / raw)
  To: Nobuhiro Iwamatsu; +Cc: netdev, SH-Linux
In-Reply-To: <AANLkTimJM0yEOU6RgoAs+Njv5b7ixVRnPubnUg=F72GX@mail.gmail.com>

2011/02/17 9:24, Nobuhiro Iwamatsu wrote:
>> @@ -1559,7 +1565,9 @@ out:
>>  static int sh_eth_drv_remove(struct platform_device *pdev)
>>  {
>>        struct net_device *ndev = platform_get_drvdata(pdev);
>> +       struct sh_eth_private *mdp = netdev_priv(ndev);
>>
>> +       iounmap(mdp->tsu_addr);
>>        sh_mdio_release(ndev);
>>        unregister_netdev(ndev);
>>        pm_runtime_disable(&pdev->dev);
> 
> You forget to fix for ARSTR.
Thank you for your point! I will fix it.

Best regards,
Yoshihiro Shimoda

^ permalink raw reply

* Re: [PATCH] drivers/net: Call netif_carrier_off at the end of the probe
From: Ivan Vecera @ 2011-02-17  8:24 UTC (permalink / raw)
  To: Francois Romieu, davem; +Cc: netdev, aabdulla, Ben Hutchings
In-Reply-To: <483985247.35721.1297792863958.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com>

On Tue, 2011-02-15 at 13:01 -0500, Ivan Vecera wrote:
> ----- Original Message -----
> > On Tue, 2011-02-15 at 16:22 +0100, Francois Romieu wrote:
> > > Stated this way it sounds like a core dev layer issue.
> > 
> > ...
> > > I am not completely sure after reading some history. Namely:
> > > - (37e8273cd30592d3a82bcb70cbb1bdc4eaeb6b71 ?)
> > > - c276e098d3ee33059b4a1c747354226cec58487c
> > > - 22604c866889c4b2e12b73cbf1683bda1b72a313
> > > - b47300168e770b60ab96c8924854c3b0eb4260eb
> > >
> > > I am confused.
> > 
> > Drivers that can report carrier state should do so initially some time
> > between registering a device and bringing it up (either in the bus
> > probe
> > function or the ndo_open function). It generally seems to be safe to
> > assume that the link is down initially, and then to rely on
> > notifications from the hardware. However, that does depend on the
> > behaviour of the hardware.
> > 
> Yes,that's true... forcedeth and r8169 are the drivers that detect link
> state when device is opened and call netif_carrier_on(off) appropriately.
> 
> Ivan
Dave, Francois, is it acceptable?

Ivan


^ permalink raw reply

* Re: [PATCH v2] bnx2x: Support for managing RX indirection table
From: Eric Dumazet @ 2011-02-17  9:09 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, eilong, netdev
In-Reply-To: <alpine.DEB.2.00.1102161210190.31937@pokey.mtv.corp.google.com>

Le mercredi 16 février 2011 à 12:27 -0800, Tom Herbert a écrit :
> Support fetching and retrieving RX indirection table via ethtool.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  drivers/net/bnx2x/bnx2x.h         |    2 +
>  drivers/net/bnx2x/bnx2x_ethtool.c |   56 +++++++++++++++++++++++++++++++++++++
>  drivers/net/bnx2x/bnx2x_main.c    |   22 +++++++++++---
>  3 files changed, 75 insertions(+), 5 deletions(-)

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>


As this is a bit unusual stuff (ethtool -X), I guess some hints can be
useful to testers and admins :

I tested this patch on my dev machine

Broadcom Corporation NetXtreme II BCM57711E 10Gigabit PCIe
(Hewlett-Packard Company NC532i Dual Port 10GbE Multifunction BL-C
Adapter)


# ethtool -X eth1 weight 0 1 0 2 0 4 0 8 0 16 0 32
# ethtool -x eth1 
RX flow hash indirection table for eth1 with 16 RX ring(s):
    0:      1     1     3     3     3     3     5     5
    8:      5     5     5     5     5     5     7     7
   16:      7     7     7     7     7     7     7     7
   24:      7     7     7     7     7     7     9     9
   32:      9     9     9     9     9     9     9     9
   40:      9     9     9     9     9     9     9     9
   48:      9     9     9     9     9     9     9     9
   56:      9     9     9     9     9     9    11    11
   64:     11    11    11    11    11    11    11    11
   72:     11    11    11    11    11    11    11    11
   80:     11    11    11    11    11    11    11    11
   88:     11    11    11    11    11    11    11    11
   96:     11    11    11    11    11    11    11    11
  104:     11    11    11    11    11    11    11    11
  112:     11    11    11    11    11    11    11    11
  120:     11    11    11    11    11    11    11    11


After some (distributed) trafic on eth1 I get :

[root@svivoipvnx021 ~]# ethtool -S eth1|grep rx_ucast_packets
     [0]: rx_ucast_packets: 62
     [1]: rx_ucast_packets: 15870
     [2]: rx_ucast_packets: 9
     [3]: rx_ucast_packets: 31507
     [4]: rx_ucast_packets: 0
     [5]: rx_ucast_packets: 63106
     [6]: rx_ucast_packets: 0
     [7]: rx_ucast_packets: 122051
     [8]: rx_ucast_packets: 1
     [9]: rx_ucast_packets: 248071
     [10]: rx_ucast_packets: 0
     [11]: rx_ucast_packets: 519864
     [12]: rx_ucast_packets: 0
     [13]: rx_ucast_packets: 0
     [14]: rx_ucast_packets: 0
     [15]: rx_ucast_packets: 0
     rx_ucast_packets: 1000541



^ permalink raw reply

* Re: [PATCH 1/1] ARC VMAC ethernet driver.
From: Andreas Fenkart @ 2011-02-17  9:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev
In-Reply-To: <1297764149.3201.4.camel@edumazet-laptop>

fixed, pls find new version in reply to this mail.

2011/2/15 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mardi 15 février 2011 à 10:31 +0100, Andreas Fenkart a écrit :
>> Signed-off-by: Andreas Fenkart <afenkart@gmail.com>
>
>
>> +             processed++;
>> +             skb->dev = dev;
>        eth_type_trans() already sets "skb->dev = dev;"
>
>> +             skb->protocol = eth_type_trans(skb, dev);
>> +             ap->stats.rx_packets++;
> Hmm, why dont you use dev->stats internal structure ? No need to
> maintain a shadow in ap->stats.
>
>> +             ap->stats.rx_bytes += skb->len;
>> +             dev->last_rx = jiffies;
>        /* last_rx = jiffies; not needed anymore  */
>
>
>> +             netif_rx(skb);
>
> A NAPI driver should use netif_receive_skb(), not netif_rx()
>
>
>
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox