Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] bonding: fix to rejoin multicast groups immediately
From: David Miller @ 2010-10-05 22:09 UTC (permalink / raw)
  To: fbl; +Cc: netdev
In-Reply-To: <20101005220436.GA19931@redhat.com>

From: Flavio Leitner <fbl@redhat.com>
Date: Tue, 5 Oct 2010 19:04:36 -0300

> Sure. I'll post replying to another thread so that the patch
> can be applied in correct order.
> 
> thanks for reviewing it

No problem, thank you.

^ permalink raw reply

* Re: [PATCH v2] bonding: rejoin multicast groups on VLANs
From: Flavio Leitner @ 2010-10-05 22:07 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: netdev, bonding-devel, Jay Vosburgh
In-Reply-To: <20101004132409.GA7497@gospo.rdu.redhat.com>

On Mon, Oct 04, 2010 at 09:24:10AM -0400, Andy Gospodarek wrote:
> On Thu, Sep 30, 2010 at 05:45:24PM -0300, Flavio Leitner wrote:
> > It fixes bonding to rejoin multicast groups added
> > to VLAN devices on top of bonding when a failover
> > happens.
> > 
> > Signed-off-by: Flavio Leitner <fleitner@redhat.com>
> > ---
> >  drivers/net/bonding/bond_main.c |   61 +++++++++++++++++++++++++++++++++-----
> >  drivers/net/bonding/bonding.h   |    1 +
> >  2 files changed, 54 insertions(+), 8 deletions(-)
> > 
> [...]
> > @@ -944,7 +979,9 @@ static void bond_mc_swap(struct bonding *bond, struct slave *new_active,
> >  
> >  		netdev_for_each_mc_addr(ha, bond->dev)
> >  			dev_mc_add(new_active->dev, ha->addr);
> > -		bond_resend_igmp_join_requests(bond);
> > +
> > +		/* rejoin multicast groups */
> > +		queue_delayed_work(bond->wq, &bond->mcast_work, 0);
> >  	}
> >  }
> >  
> 
> I was hoping that one patch would just make the changes so
> retransmission was supported on VLANs and the second patch would queue
> the work and add the tunable for multiple retransmissions, but I guess
> it wasn't clear enough.
> 
> I felt like that would divide the patches up into the bug-fix (VLANs +
> multicast not working) and the new feature (multiple retransmissions
> from the workqueue).

I'm not seeing how to that because the first patch changes to send
the membership directly and not using timers anymore, so the call
trace usually is:

 write_lock_bh(&bond->curr_slave_lock); <-- hold lock
 bond_select_active_slave() 
  ->bond_change_active_slave()
     ->bond_mc_swap()
        ->tx membership reports
 

then trying to send the IGMP packet directly, it will end up at:
  bond_start_xmit()
   ->bond_xmit_activebackup() (mode=1, for example)
      ->read_lock(&bond->lock);
      ->read_lock(&bond->curr_slave_lock); <-- deadlock


> I will test these out this morning and make sure things look good.

I have a better patch sequence which I'm planning post replying to
this thread as soon as I finish up testing them. It is still using
workqueue though.

--
Flavio

^ permalink raw reply

* Re: [PATCH] bonding: fix to rejoin multicast groups immediately
From: Flavio Leitner @ 2010-10-05 22:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20101005.132525.186293128.davem@davemloft.net>

On Tue, Oct 05, 2010 at 01:25:25PM -0700, David Miller wrote:
> From: Flavio Leitner <fbl@redhat.com>
> Date: Tue, 5 Oct 2010 11:34:30 -0300
> 
> > On Tue, Oct 05, 2010 at 12:13:38AM -0700, David Miller wrote:
> >> From: Flavio Leitner <fleitner@redhat.com>
> >> Date: Wed, 29 Sep 2010 04:12:07 -0300
> >> 
> >> > It should rejoin multicast groups immediately when
> >> > the failover happens to restore the multicast traffic.
> >> > 
> >> > Signed-off-by: Flavio Leitner <fleitner@redhat.com>
> >> 
> >> I suspect the IGMPv3 handling via a delayed action, as is currently
> >> implemented, is on purpose and is done so to follow the specification
> >> of the IGMPv3 RFCs.
> >> 
> >> Therefore you have to explain why your new behavior is so desirable
> >> and in particular why something as undesirable as violating the RFCs
> >> is therefore warranted.
> > 
> > That patch only changes the behavior for bonding during a link
> > failure, so if we have a bonding in active-backup or any other mode
> > with current-active-slave, the initialization will happen just fine
> > following IGMP specs.
> > 
> > However, neither the backup slave interface nor the backup switch
> > connected to backup slave knows about mcast. Thus when a link failure
> > happens, we shouldn't rely on timers to not stay out of the mcast
> > group losing traffic.
> > 
> > E.g. The V1 specs says that we shouldn't send any membership report
> > if it has been one in the last minute because that means the switch
> > is notified and the system will receive mcast traffic for that group. 
> > Therefore, if it sees one and a link failure happens right after that,
> > the backup slave will send another membership report only one minute
> > later. During this time the system loses traffic.
> 
> All of this valuable information belongs in the commit log message.
> 
> Please resubmit your patch with all of the necessary contextual
> information and reasoning explaining in the commit message.

Sure. I'll post replying to another thread so that the patch
can be applied in correct order.

thanks for reviewing it
-- 
Flavio

^ permalink raw reply

* Re: [PATCH] SIW: Module initialization
From: Stephen Hemminger @ 2010-10-05 22:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Bernard Metzler, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <AANLkTik5=i+A5_OpU0rVyfYi=ibgS9UWMu82vMCmM=PN-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, 5 Oct 2010 12:57:21 +0200
Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:

> > + * TODO: Dynamic device management (network device registration/removal).  
> 
> The current implementation is such that one siw device is created for
> each network device found at kernel module load time. That means that
> you force the user to load the siw kernel module after all other
> kernel modules that register a network device. I'm not sure that's a
> good idea.

Then device should be controlled by a netlink (rtnl_link_ops) style
interface see vlan_netlink.c. Using netlink is extensible and provides
a cleaner interface than all these other parameterization methods.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] skge: add quirk to limit DMA
From: Stephen Hemminger @ 2010-10-05 21:58 UTC (permalink / raw)
  To: David Miller; +Cc: sgruszka, netdev, luya
In-Reply-To: <20101005.001800.13732127.davem@davemloft.net>

On Tue, 05 Oct 2010 00:18:00 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Stanislaw Gruszka <sgruszka@redhat.com>
> Date: Wed, 29 Sep 2010 11:33:23 +0200
> 
> > Skge devices installed on some Gigabyte motherboards are not able to
> > perform 64 dma correctly due to board PCI implementation, so limit
> > DMA to 32bit if such boards are detected.
> > 
> > Bug was reported here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=447489
> > 
> > Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
> > Tested-by: Luya Tshimbalanga <luya@fedoraproject.org>
> 
> Stephen?  Can I get an ACK or some kind of other status on this?

I was hoping to find the hardware somewhere to dig deeper into
this. But until I know more please apply the patch. There are
two possibilities that still exist, 1) it is true for all devices
on this motherboard (in which cases it should be a PCI quirk),
2) it is a driver bug. The test was going to be putting a skge
pci card in a slot on the MB.


^ permalink raw reply

* Re: [PATCH] AF_UNIX: Implement SO_TIMESTAMP and SO_TIMETAMPNS on Unix sockets
From: David Miller @ 2010-10-05 21:55 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alban.crequy, lennart, shemminger, gorcunov, adobriyan,
	linux-kernel, netdev
In-Reply-To: <1286220052.2381.70.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 04 Oct 2010 21:20:52 +0200

> Le lundi 04 octobre 2010 à 19:48 +0100, Alban Crequy a écrit :
>> Userspace applications can already request to receive timestamps with:
>> setsockopt(sockfd, SOL_SOCKET, SO_TIMESTAMP, ...)
>> 
>> Although setsockopt() returns zero (success), timestamps are not added to the
>> ancillary data. This patch fixes that on SOCK_DGRAM and SOCK_SEQPACKET Unix
>> sockets.
>> 
>> Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net neigh: RCU conversion of neigh hash table
From: David Miller @ 2010-10-05 21:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286208944.18293.349.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 04 Oct 2010 18:15:44 +0200

> [PATCH net-next] net neigh: RCU conversion of neigh hash table
> 
> Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
> neigh_table", a new structure is defined :
> 
> struct neigh_hash_table {
>        struct neighbour        **hash_buckets;
>        unsigned int            hash_mask;
>        __u32                   hash_rnd;
>        struct rcu_head         rcu;
> };
> 
> And "struct neigh_table" has an RCU protected pointer to such a
> neigh_hash_table.
> 
> This means the signature of (*hash)() function changed: We need to add a
> third parameter with the actual hash_rnd value, since this is not
> anymore a neigh_table field.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Also applied.

Looking good so far! :)

^ permalink raw reply

* Re: [PATCH net-next] net neigh: neigh_delete() and neigh_add() changes
From: David Miller @ 2010-10-05 21:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286202456.18293.293.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 04 Oct 2010 16:27:36 +0200

> neigh_delete() and neigh_add() dont need to touch device refcount,
> we hold RTNL when calling them, so device cannot disappear under us.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
From: David Miller @ 2010-10-05 21:50 UTC (permalink / raw)
  To: eric.dumazet
  Cc: holt, akpm, w, linux-kernel, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber
In-Reply-To: <1286025736.2582.1827.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 02 Oct 2010 15:22:16 +0200

> [PATCH] net: avoid limits overflow
> 
> Robin Holt tried to boot a 16TB machine and found some limits were
> reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]
> 
> We can switch infrastructure to use long "instead" of "int", now
> atomic_long_t primitives are available for free.
> 
> Reported-by: Robin Holt <holt@sgi.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Eric please resubmit this when the sysctl fix is resolved.

Thanks.

^ permalink raw reply

* Re: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
From: David Miller @ 2010-10-05 21:50 UTC (permalink / raw)
  To: holt; +Cc: w, linux-kernel, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber
In-Reply-To: <20101002112419.248437367@gulag1.americas.sgi.com>

From: Robin Holt <holt@sgi.com>
Date: Sat, 02 Oct 2010 06:24:06 -0500

> Subject: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
> 
> On a 16TB x86_64 machine, sysctl_tcp_mem[2], sysctl_udp_mem[2], and
> sysctl_sctp_mem[2] can integer overflow.  Set limit such that they are
> maximized without overflowing.
> 
> Signed-off-by: Robin Holt <holt@sgi.com>

Robin please resubmit this with the SCTP bits included.

^ permalink raw reply

* Re: [PATCH net-next] net: add a core netdev->rx_dropped counter
From: David Miller @ 2010-10-05 21:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jesse, rl, netdev, kaber
In-Reply-To: <1285916815.2705.152.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 01 Oct 2010 09:06:55 +0200

> [PATCH net-next] net: add a core netdev->rx_dropped counter
> 
> In various situations, a device provides a packet to our stack and we
> drop it before it enters protocol stack :
> - softnet backlog full (accounted in /proc/net/softnet_stat)
> - bad vlan tag (not accounted)
> - unknown/unregistered protocol (not accounted)
> 
> We can handle a per-device counter of such dropped frames at core level,
> and automatically adds it to the device provided stats (rx_dropped), so
> that standard tools can be used (ifconfig, ip link, cat /proc/net/dev)
> 
> This is a generalization of commit 8990f468a (net: rx_dropped
> accounting), thus reverting it.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good to me, applied, thanks.

^ permalink raw reply

* Re: BUG - qdev - partial loss of network connectivity
From: Leszek Urbanski @ 2010-10-05 21:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anthony Liguori, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <20100927213203.GA28089-4cL5GMJPxME@public.gmane.org>

<20100927213203.GA28089-4cL5GMJPxME@public.gmane.org>; from Leszek Urbanski on Mon, Sep 27, 2010 at 23:32:03 +0200

> > > > >It's vanilla 2.6.32.22, but I also reproduced this on Debian's 2.6.32-23
> > > > >(based on 2.6.32.21).
> > > > >
> > > > >If offload is the only difference, I'll play with different offload
> > > > >options and check which one causes it.
> > > > >   
> > > > 
> > > > It's not technically the only difference but it's the most likely 
> > > > culprit IMHO.
> > > 
> > > udp fragmentation offload is definitely the culprit.
> > 
> > I see. Most likely guest bug - won't be the first bug around UFO.
> > If so pls copy netdev linux-nfs and virtualization.
> > Do you see anything in dmesg? Can try 2.6.36-rc5?
> 
> (for reference: first post is at:
> http://lists.nongnu.org/archive/html/qemu-devel/2010-09/msg01685.html )
> 
> I can't reproduce it on 2.6.36-rc5. Do you have an idea which patch may have

This one definitely fixes it: http://patchwork.ozlabs.org/patch/55643/

(Herbert Xu: udp: Fix bogus UFO packet generation)

The patch works with 2.6.32.24 too - it should probably be backported to
2.6.32.x.


-- 
Leszek "Tygrys" Urbanski, SCSA, SCNA
 "Unix-to-Unix Copy Program;" said PDP-1. "You will never find a more
  wretched hive of bugs and flamers. We must be cautious." -- DECWARS
     http://cygnus.moo.pl/ -- Cygnus High Altitude Balloon
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next PATCH] igb: update adapter stats when reading /proc/net/dev.
From: Jesper Dangaard Brouer @ 2010-10-05 21:16 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck
  Cc: Jesper Dangaard Brouer, David S. Miller, netdev, Jeff Kirsher
In-Reply-To: <1286312479.2593.35.camel@edumazet-laptop>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1133 bytes --]

On Tue, 5 Oct 2010, Eric Dumazet wrote:

> Le mardi 05 octobre 2010 à 17:19 +0200, Eric Dumazet a écrit :
>> Le mardi 05 octobre 2010 à 16:53 +0200, Jesper Dangaard Brouer a écrit :
>>
>>> Its already racy, because "ethtool -S" reads out the stats immediately,
>>> and thus races with the timer.
>>>
>>> See: igb_ethtool.c
>>>  igb_get_ethtool_stats() invoke igb_update_stats(adapter);
>>>
>>
>> You would be surprised how many bugs are waiting to be found and
>> fixed ;)
>
> I took a look at igb stats, and it appears they also are racy on 32bit
> arches. igb uses u64 counters, with no synchronization between
> producers(writers) and consumers(readers).

Are you saying, that we need more than a simple mutex in 
igb_update_stats()?

> Some fixes are needed ;)

The question is then if Intel wants to fix it, or let it be up to you or 
me?

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply

* Re: [net-next PATCH] igb: update adapter stats when reading /proc/net/dev.
From: Eric Dumazet @ 2010-10-05 21:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S. Miller, netdev, Jeff Kirsher
In-Reply-To: <1286291947.2796.387.camel@edumazet-laptop>

Le mardi 05 octobre 2010 à 17:19 +0200, Eric Dumazet a écrit :
> Le mardi 05 octobre 2010 à 16:53 +0200, Jesper Dangaard Brouer a écrit :
> 
> > Its already racy, because "ethtool -S" reads out the stats immediately,
> > and thus races with the timer.
> > 
> > See: igb_ethtool.c
> >  igb_get_ethtool_stats() invoke igb_update_stats(adapter);
> > 
> 
> You would be surprised how many bugs are waiting to be found and
> fixed ;)
> 
> 

I took a look at igb stats, and it appears they also are racy on 32bit
arches. igb uses u64 counters, with no synchronization between
producers(writers) and consumers(readers).

Some fixes are needed ;)

^ permalink raw reply

* [PATCH net-next] fib: RCU conversion of fib_lookup()
From: Eric Dumazet @ 2010-10-05 20:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

fib_lookup() converted to be called in RCU protected context, no
reference taken and released on a contended cache line (fib_clntref)

fib_table_lookup() and fib_semantic_match() get an additional parameter.

struct fib_info gets an rcu_head field, and is freed after an rcu grace
period.

Stress test :
(Sending 160.000.000 UDP frames on same neighbour,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_HASH) (about same results for FIB_TRIE)

Before patch :

real	1m31.199s
user	0m13.761s
sys	23m24.780s

After patch:

real	1m5.375s
user	0m14.997s
sys	15m50.115s

Before patch Profile :

13044.00 15.4% __ip_route_output_key vmlinux
 8438.00 10.0% dst_destroy           vmlinux
 5983.00  7.1% fib_semantic_match    vmlinux
 5410.00  6.4% fib_rules_lookup      vmlinux
 4803.00  5.7% neigh_lookup          vmlinux
 4420.00  5.2% _raw_spin_lock        vmlinux
 3883.00  4.6% rt_set_nexthop        vmlinux
 3261.00  3.9% _raw_read_lock        vmlinux
 2794.00  3.3% fib_table_lookup      vmlinux
 2374.00  2.8% neigh_resolve_output  vmlinux
 2153.00  2.5% dst_alloc             vmlinux
 1502.00  1.8% _raw_read_lock_bh     vmlinux
 1484.00  1.8% kmem_cache_alloc      vmlinux
 1407.00  1.7% eth_header            vmlinux
 1406.00  1.7% ipv4_dst_destroy      vmlinux
 1298.00  1.5% __copy_from_user_ll   vmlinux
 1174.00  1.4% dev_queue_xmit        vmlinux
 1000.00  1.2% ip_output             vmlinux

After patch Profile :

13712.00 15.8% dst_destroy             vmlinux
 8548.00  9.9% __ip_route_output_key   vmlinux
 7017.00  8.1% neigh_lookup            vmlinux
 4554.00  5.3% fib_semantic_match      vmlinux
 4067.00  4.7% _raw_read_lock          vmlinux
 3491.00  4.0% dst_alloc               vmlinux
 3186.00  3.7% neigh_resolve_output    vmlinux
 3103.00  3.6% fib_table_lookup        vmlinux
 2098.00  2.4% _raw_read_lock_bh       vmlinux
 2081.00  2.4% kmem_cache_alloc        vmlinux
 2013.00  2.3% _raw_spin_lock          vmlinux
 1763.00  2.0% __copy_from_user_ll     vmlinux
 1763.00  2.0% ip_output               vmlinux
 1761.00  2.0% ipv4_dst_destroy        vmlinux
 1631.00  1.9% eth_header              vmlinux
 1440.00  1.7% _raw_read_unlock_bh     vmlinux

Reference results, if IP route cache is enabled :

real	0m29.718s
user	0m10.845s
sys	7m37.341s


25213.00 29.5% __ip_route_output_key   vmlinux
 9011.00 10.5% dst_release             vmlinux
 4817.00  5.6% ip_push_pending_frames  vmlinux
 4232.00  5.0% ip_finish_output        vmlinux
 3940.00  4.6% udp_sendmsg             vmlinux
 3730.00  4.4% __copy_from_user_ll     vmlinux
 3716.00  4.4% ip_route_output_flow    vmlinux
 2451.00  2.9% __xfrm_lookup           vmlinux
 2221.00  2.6% ip_append_data          vmlinux
 1718.00  2.0% _raw_spin_lock_bh       vmlinux
 1655.00  1.9% __alloc_skb             vmlinux
 1572.00  1.8% sock_wfree              vmlinux
 1345.00  1.6% kfree                   vmlinux


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/fib_rules.h  |    2 +
 include/net/ip_fib.h     |   17 ++--------
 net/core/fib_rules.c     |    3 +
 net/ipv4/fib_frontend.c  |   27 ++++++++--------
 net/ipv4/fib_hash.c      |    5 +--
 net/ipv4/fib_lookup.h    |    2 -
 net/ipv4/fib_rules.c     |    3 +
 net/ipv4/fib_semantics.c |   21 ++++++++++---
 net/ipv4/fib_trie.c      |   10 +++---
 net/ipv4/route.c         |   59 +++++++++++++++----------------------
 10 files changed, 72 insertions(+), 77 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index ac2fd00..106f309 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -31,6 +31,8 @@ struct fib_lookup_arg {
 	void			*lookup_ptr;
 	void			*result;
 	struct fib_rule		*rule;
+	int			flags;
+#define FIB_LOOKUP_NOREF	1
 };
 
 struct fib_rules_ops {
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index c93f94e..ba3666d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -86,6 +86,7 @@ struct fib_info {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	int			fib_power;
 #endif
+	struct rcu_head		rcu;
 	struct fib_nh		fib_nh[0];
 #define fib_dev		fib_nh[0].nh_dev
 };
@@ -148,7 +149,7 @@ struct fib_table {
 };
 
 extern int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
-			    struct fib_result *res);
+			    struct fib_result *res, int fib_flags);
 extern int fib_table_insert(struct fib_table *, struct fib_config *);
 extern int fib_table_delete(struct fib_table *, struct fib_config *);
 extern int fib_table_dump(struct fib_table *table, struct sk_buff *skb,
@@ -185,11 +186,11 @@ static inline int fib_lookup(struct net *net, const struct flowi *flp,
 	struct fib_table *table;
 
 	table = fib_get_table(net, RT_TABLE_LOCAL);
-	if (!fib_table_lookup(table, flp, res))
+	if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF))
 		return 0;
 
 	table = fib_get_table(net, RT_TABLE_MAIN);
-	if (!fib_table_lookup(table, flp, res))
+	if (!fib_table_lookup(table, flp, res, FIB_LOOKUP_NOREF))
 		return 0;
 	return -ENETUNREACH;
 }
@@ -254,16 +255,6 @@ static inline void fib_info_put(struct fib_info *fi)
 		free_fib_info(fi);
 }
 
-static inline void fib_res_put(struct fib_result *res)
-{
-	if (res->fi)
-		fib_info_put(res->fi);
-#ifdef CONFIG_IP_MULTIPLE_TABLES
-	if (res->r)
-		fib_rule_put(res->r);
-#endif
-}
-
 #ifdef CONFIG_PROC_FS
 extern int __net_init  fib_proc_init(struct net *net);
 extern void __net_exit fib_proc_exit(struct net *net);
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cfb7d25..21698f8 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -225,7 +225,8 @@ jumped:
 			err = ops->action(rule, fl, flags, arg);
 
 		if (err != -EAGAIN) {
-			if (likely(atomic_inc_not_zero(&rule->refcnt))) {
+			if ((arg->flags & FIB_LOOKUP_NOREF) ||
+			    likely(atomic_inc_not_zero(&rule->refcnt))) {
 				arg->rule = rule;
 				goto out;
 			}
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index b05c23b..919f2ad 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -168,8 +168,11 @@ struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 	struct fib_result res = { 0 };
 	struct net_device *dev = NULL;
 
-	if (fib_lookup(net, &fl, &res))
+	rcu_read_lock();
+	if (fib_lookup(net, &fl, &res)) {
+		rcu_read_unlock();
 		return NULL;
+	}
 	if (res.type != RTN_LOCAL)
 		goto out;
 	dev = FIB_RES_DEV(res);
@@ -177,7 +180,7 @@ struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 	if (dev && devref)
 		dev_hold(dev);
 out:
-	fib_res_put(&res);
+	rcu_read_unlock();
 	return dev;
 }
 EXPORT_SYMBOL(__ip_dev_find);
@@ -207,11 +210,12 @@ static inline unsigned __inet_dev_addr_type(struct net *net,
 	local_table = fib_get_table(net, RT_TABLE_LOCAL);
 	if (local_table) {
 		ret = RTN_UNICAST;
-		if (!fib_table_lookup(local_table, &fl, &res)) {
+		rcu_read_lock();
+		if (!fib_table_lookup(local_table, &fl, &res, FIB_LOOKUP_NOREF)) {
 			if (!dev || dev == res.fi->fib_dev)
 				ret = res.type;
-			fib_res_put(&res);
 		}
+		rcu_read_unlock();
 	}
 	return ret;
 }
@@ -235,6 +239,7 @@ EXPORT_SYMBOL(inet_dev_addr_type);
  * - figure out what "logical" interface this packet arrived
  *   and calculate "specific destination" address.
  * - check, that packet arrived from expected physical interface.
+ * called with rcu_read_lock()
  */
 int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 			struct net_device *dev, __be32 *spec_dst,
@@ -259,7 +264,6 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 	struct net *net;
 
 	no_addr = rpf = accept_local = 0;
-	rcu_read_lock();
 	in_dev = __in_dev_get_rcu(dev);
 	if (in_dev) {
 		no_addr = in_dev->ifa_list == NULL;
@@ -268,7 +272,6 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 		if (mark && !IN_DEV_SRC_VMARK(in_dev))
 			fl.mark = 0;
 	}
-	rcu_read_unlock();
 
 	if (in_dev == NULL)
 		goto e_inval;
@@ -278,7 +281,7 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 		goto last_resort;
 	if (res.type != RTN_UNICAST) {
 		if (res.type != RTN_LOCAL || !accept_local)
-			goto e_inval_res;
+			goto e_inval;
 	}
 	*spec_dst = FIB_RES_PREFSRC(res);
 	fib_combine_itag(itag, &res);
@@ -299,10 +302,8 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 #endif
 	if (dev_match) {
 		ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
-		fib_res_put(&res);
 		return ret;
 	}
-	fib_res_put(&res);
 	if (no_addr)
 		goto last_resort;
 	if (rpf == 1)
@@ -315,7 +316,6 @@ int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
 			*spec_dst = FIB_RES_PREFSRC(res);
 			ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
 		}
-		fib_res_put(&res);
 	}
 	return ret;
 
@@ -326,8 +326,6 @@ last_resort:
 	*itag = 0;
 	return 0;
 
-e_inval_res:
-	fib_res_put(&res);
 e_inval:
 	return -EINVAL;
 e_rpf:
@@ -873,15 +871,16 @@ static void nl_fib_lookup(struct fib_result_nl *frn, struct fib_table *tb)
 		local_bh_disable();
 
 		frn->tb_id = tb->tb_id;
-		frn->err = fib_table_lookup(tb, &fl, &res);
+		rcu_read_lock();
+		frn->err = fib_table_lookup(tb, &fl, &res, FIB_LOOKUP_NOREF);
 
 		if (!frn->err) {
 			frn->prefixlen = res.prefixlen;
 			frn->nh_sel = res.nh_sel;
 			frn->type = res.type;
 			frn->scope = res.scope;
-			fib_res_put(&res);
 		}
+		rcu_read_unlock();
 		local_bh_enable();
 	}
 }
diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index 4ed7e0d..83cca68 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -244,7 +244,8 @@ fn_new_zone(struct fn_hash *table, int z)
 }
 
 int fib_table_lookup(struct fib_table *tb,
-		     const struct flowi *flp, struct fib_result *res)
+		     const struct flowi *flp, struct fib_result *res,
+		     int fib_flags)
 {
 	int err;
 	struct fn_zone *fz;
@@ -264,7 +265,7 @@ int fib_table_lookup(struct fib_table *tb,
 
 			err = fib_semantic_match(&f->fn_alias,
 						 flp, res,
-						 fz->fz_order);
+						 fz->fz_order, fib_flags);
 			if (err <= 0)
 				goto out;
 		}
diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
index 637b133..b9c9a9f 100644
--- a/net/ipv4/fib_lookup.h
+++ b/net/ipv4/fib_lookup.h
@@ -22,7 +22,7 @@ struct fib_alias {
 /* Exported by fib_semantics.c */
 extern int fib_semantic_match(struct list_head *head,
 			      const struct flowi *flp,
-			      struct fib_result *res, int prefixlen);
+			      struct fib_result *res, int prefixlen, int fib_flags);
 extern void fib_release_info(struct fib_info *);
 extern struct fib_info *fib_create_info(struct fib_config *cfg);
 extern int fib_nh_match(struct fib_config *cfg, struct fib_info *fi);
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 3230052..7981a24 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -57,6 +57,7 @@ int fib_lookup(struct net *net, struct flowi *flp, struct fib_result *res)
 {
 	struct fib_lookup_arg arg = {
 		.result = res,
+		.flags = FIB_LOOKUP_NOREF,
 	};
 	int err;
 
@@ -94,7 +95,7 @@ static int fib4_rule_action(struct fib_rule *rule, struct flowi *flp,
 	if (!tbl)
 		goto errout;
 
-	err = fib_table_lookup(tbl, flp, (struct fib_result *) arg->result);
+	err = fib_table_lookup(tbl, flp, (struct fib_result *) arg->result, arg->flags);
 	if (err > 0)
 		err = -EAGAIN;
 errout:
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index ba52f39..0f80dfc 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -148,6 +148,13 @@ static const struct
 
 /* Release a nexthop info record */
 
+static void free_fib_info_rcu(struct rcu_head *head)
+{
+	struct fib_info *fi = container_of(head, struct fib_info, rcu);
+
+	kfree(fi);
+}
+
 void free_fib_info(struct fib_info *fi)
 {
 	if (fi->fib_dead == 0) {
@@ -161,7 +168,7 @@ void free_fib_info(struct fib_info *fi)
 	} endfor_nexthops(fi);
 	fib_info_cnt--;
 	release_net(fi->fib_net);
-	kfree(fi);
+	call_rcu(&fi->rcu, free_fib_info_rcu);
 }
 
 void fib_release_info(struct fib_info *fi)
@@ -553,6 +560,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			nh->nh_scope = RT_SCOPE_LINK;
 			return 0;
 		}
+		rcu_read_lock();
 		{
 			struct flowi fl = {
 				.nl_u = {
@@ -568,8 +576,10 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			if (fl.fl4_scope < RT_SCOPE_LINK)
 				fl.fl4_scope = RT_SCOPE_LINK;
 			err = fib_lookup(net, &fl, &res);
-			if (err)
+			if (err) {
+				rcu_read_unlock();
 				return err;
+			}
 		}
 		err = -EINVAL;
 		if (res.type != RTN_UNICAST && res.type != RTN_LOCAL)
@@ -585,7 +595,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			goto out;
 		err = 0;
 out:
-		fib_res_put(&res);
+		rcu_read_unlock();
 		return err;
 	} else {
 		struct in_device *in_dev;
@@ -879,7 +889,7 @@ failure:
 
 /* Note! fib_semantic_match intentionally uses  RCU list functions. */
 int fib_semantic_match(struct list_head *head, const struct flowi *flp,
-		       struct fib_result *res, int prefixlen)
+		       struct fib_result *res, int prefixlen, int fib_flags)
 {
 	struct fib_alias *fa;
 	int nh_sel = 0;
@@ -943,7 +953,8 @@ out_fill_res:
 	res->type = fa->fa_type;
 	res->scope = fa->fa_scope;
 	res->fi = fa->fa_info;
-	atomic_inc(&res->fi->fib_clntref);
+	if (!(fib_flags & FIB_LOOKUP_NOREF))
+		atomic_inc(&res->fi->fib_clntref);
 	return 0;
 }
 
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index a96e5ec..271c89b 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1342,7 +1342,7 @@ err:
 /* should be called with rcu_read_lock */
 static int check_leaf(struct trie *t, struct leaf *l,
 		      t_key key,  const struct flowi *flp,
-		      struct fib_result *res)
+		      struct fib_result *res, int fib_flags)
 {
 	struct leaf_info *li;
 	struct hlist_head *hhead = &l->list;
@@ -1356,7 +1356,7 @@ static int check_leaf(struct trie *t, struct leaf *l,
 		if (l->key != (key & ntohl(mask)))
 			continue;
 
-		err = fib_semantic_match(&li->falh, flp, res, plen);
+		err = fib_semantic_match(&li->falh, flp, res, plen, fib_flags);
 
 #ifdef CONFIG_IP_FIB_TRIE_STATS
 		if (err <= 0)
@@ -1372,7 +1372,7 @@ static int check_leaf(struct trie *t, struct leaf *l,
 }
 
 int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
-		     struct fib_result *res)
+		     struct fib_result *res, int fib_flags)
 {
 	struct trie *t = (struct trie *) tb->tb_data;
 	int ret;
@@ -1399,7 +1399,7 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
 
 	/* Just a leaf? */
 	if (IS_LEAF(n)) {
-		ret = check_leaf(t, (struct leaf *)n, key, flp, res);
+		ret = check_leaf(t, (struct leaf *)n, key, flp, res, fib_flags);
 		goto found;
 	}
 
@@ -1424,7 +1424,7 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi *flp,
 		}
 
 		if (IS_LEAF(n)) {
-			ret = check_leaf(t, (struct leaf *)n, key, flp, res);
+			ret = check_leaf(t, (struct leaf *)n, key, flp, res, fib_flags);
 			if (ret > 0)
 				goto backtrace;
 			goto found;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 04e0df8..7864d0c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1773,12 +1773,15 @@ void ip_rt_get_source(u8 *addr, struct rtable *rt)
 
 	if (rt->fl.iif == 0)
 		src = rt->rt_src;
-	else if (fib_lookup(dev_net(rt->dst.dev), &rt->fl, &res) == 0) {
-		src = FIB_RES_PREFSRC(res);
-		fib_res_put(&res);
-	} else
-		src = inet_select_addr(rt->dst.dev, rt->rt_gateway,
+	else {
+		rcu_read_lock();
+		if (fib_lookup(dev_net(rt->dst.dev), &rt->fl, &res) == 0)
+			src = FIB_RES_PREFSRC(res);
+		else
+			src = inet_select_addr(rt->dst.dev, rt->rt_gateway,
 					RT_SCOPE_UNIVERSE);
+		rcu_read_unlock();
+	}
 	memcpy(addr, &src, 4);
 }
 
@@ -2081,6 +2084,7 @@ static int ip_mkroute_input(struct sk_buff *skb,
  *	Such approach solves two big problems:
  *	1. Not simplex devices are handled properly.
  *	2. IP spoofing attempts are filtered with 100% of guarantee.
+ *	called with rcu_read_lock()
  */
 
 static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
@@ -2102,7 +2106,6 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	unsigned	hash;
 	__be32		spec_dst;
 	int		err = -EINVAL;
-	int		free_res = 0;
 	struct net    * net = dev_net(dev);
 
 	/* IP on this device is disabled. */
@@ -2134,12 +2137,12 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	/*
 	 *	Now we are ready to route packet.
 	 */
-	if ((err = fib_lookup(net, &fl, &res)) != 0) {
+	err = fib_lookup(net, &fl, &res);
+	if (err != 0) {
 		if (!IN_DEV_FORWARD(in_dev))
 			goto e_hostunreach;
 		goto no_route;
 	}
-	free_res = 1;
 
 	RT_CACHE_STAT_INC(in_slow_tot);
 
@@ -2148,8 +2151,8 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 
 	if (res.type == RTN_LOCAL) {
 		err = fib_validate_source(saddr, daddr, tos,
-					     net->loopback_dev->ifindex,
-					     dev, &spec_dst, &itag, skb->mark);
+					  net->loopback_dev->ifindex,
+					  dev, &spec_dst, &itag, skb->mark);
 		if (err < 0)
 			goto martian_source_keep_err;
 		if (err)
@@ -2164,9 +2167,6 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		goto martian_destination;
 
 	err = ip_mkroute_input(skb, &res, &fl, in_dev, daddr, saddr, tos);
-done:
-	if (free_res)
-		fib_res_put(&res);
 out:	return err;
 
 brd_input:
@@ -2226,7 +2226,7 @@ local_input:
 	rth->rt_type	= res.type;
 	hash = rt_hash(daddr, saddr, fl.iif, rt_genid(net));
 	err = rt_intern_hash(hash, rth, NULL, skb, fl.iif);
-	goto done;
+	goto out;
 
 no_route:
 	RT_CACHE_STAT_INC(in_no_route);
@@ -2249,21 +2249,21 @@ martian_destination:
 
 e_hostunreach:
 	err = -EHOSTUNREACH;
-	goto done;
+	goto out;
 
 e_inval:
 	err = -EINVAL;
-	goto done;
+	goto out;
 
 e_nobufs:
 	err = -ENOBUFS;
-	goto done;
+	goto out;
 
 martian_source:
 	err = -EINVAL;
 martian_source_keep_err:
 	ip_handle_martian_source(dev, in_dev, skb, daddr, saddr);
-	goto done;
+	goto out;
 }
 
 int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
@@ -2349,6 +2349,7 @@ skip_cache:
 }
 EXPORT_SYMBOL(ip_route_input_common);
 
+/* called with rcu_read_lock() */
 static int __mkroute_output(struct rtable **result,
 			    struct fib_result *res,
 			    const struct flowi *fl,
@@ -2373,18 +2374,13 @@ static int __mkroute_output(struct rtable **result,
 	if (dev_out->flags & IFF_LOOPBACK)
 		flags |= RTCF_LOCAL;
 
-	rcu_read_lock();
 	in_dev = __in_dev_get_rcu(dev_out);
-	if (!in_dev) {
-		rcu_read_unlock();
+	if (!in_dev)
 		return -EINVAL;
-	}
+
 	if (res->type == RTN_BROADCAST) {
 		flags |= RTCF_BROADCAST | RTCF_LOCAL;
-		if (res->fi) {
-			fib_info_put(res->fi);
-			res->fi = NULL;
-		}
+		res->fi = NULL;
 	} else if (res->type == RTN_MULTICAST) {
 		flags |= RTCF_MULTICAST | RTCF_LOCAL;
 		if (!ip_check_mc(in_dev, oldflp->fl4_dst, oldflp->fl4_src,
@@ -2394,10 +2390,8 @@ static int __mkroute_output(struct rtable **result,
 		 * default one, but do not gateway in this case.
 		 * Yes, it is hack.
 		 */
-		if (res->fi && res->prefixlen < 4) {
-			fib_info_put(res->fi);
+		if (res->fi && res->prefixlen < 4)
 			res->fi = NULL;
-		}
 	}
 
 
@@ -2467,6 +2461,7 @@ static int __mkroute_output(struct rtable **result,
 	return 0;
 }
 
+/* called with rcu_read_lock() */
 static int ip_mkroute_output(struct rtable **rp,
 			     struct fib_result *res,
 			     const struct flowi *fl,
@@ -2509,7 +2504,6 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 	struct fib_result res;
 	unsigned int flags = 0;
 	struct net_device *dev_out = NULL;
-	int free_res = 0;
 	int err;
 
 
@@ -2636,15 +2630,12 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 		err = -ENETUNREACH;
 		goto out;
 	}
-	free_res = 1;
 
 	if (res.type == RTN_LOCAL) {
 		if (!fl.fl4_src)
 			fl.fl4_src = fl.fl4_dst;
 		dev_out = net->loopback_dev;
 		fl.oif = dev_out->ifindex;
-		if (res.fi)
-			fib_info_put(res.fi);
 		res.fi = NULL;
 		flags |= RTCF_LOCAL;
 		goto make_route;
@@ -2668,8 +2659,6 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 make_route:
 	err = ip_mkroute_output(rp, &res, &fl, oldflp, dev_out, flags);
 
-	if (free_res)
-		fib_res_put(&res);
 out:	return err;
 }
 



^ permalink raw reply related

* Re: [PATCH 3/3] e1000e.txt: Add e1000e documentation
From: David Miller @ 2010-10-05 20:28 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: rdunlap, netdev, linux-doc, gospo, bphilips
In-Reply-To: <20101005111726.23000.73997.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 05 Oct 2010 04:17:27 -0700

> Adds documentation for the e1000e networking driver.
> 
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/3] e1000.txt: Update e1000 documentation
From: David Miller @ 2010-10-05 20:28 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: rdunlap, netdev, linux-doc, gospo, bphilips
In-Reply-To: <20101005111704.23000.60972.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 05 Oct 2010 04:17:05 -0700

> Updated the e1000 networking driver documentation.
> 
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/3] ixgbevf.txt: Update ixgbevf documentation
From: David Miller @ 2010-10-05 20:28 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: rdunlap, netdev, linux-doc, gospo, bphilips
In-Reply-To: <20101005111643.23000.38976.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 05 Oct 2010 04:16:44 -0700

> Update the documentation for the ixgbevf (ixgbe virtual
> function driver).
> 
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-2.6 PATCH] MAINTAINERS: update Intel LAN Ethernet info
From: David Miller @ 2010-10-05 20:25 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, linux-kernel, gospo, bphilips
In-Reply-To: <20101005111438.22968.15130.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 05 Oct 2010 04:15:17 -0700

> - Add ixgbevf and docs files to the maintainers file
> 
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH] bonding: fix to rejoin multicast groups immediately
From: David Miller @ 2010-10-05 20:25 UTC (permalink / raw)
  To: fbl; +Cc: netdev
In-Reply-To: <20101005143430.GA13811@redhat.com>

From: Flavio Leitner <fbl@redhat.com>
Date: Tue, 5 Oct 2010 11:34:30 -0300

> On Tue, Oct 05, 2010 at 12:13:38AM -0700, David Miller wrote:
>> From: Flavio Leitner <fleitner@redhat.com>
>> Date: Wed, 29 Sep 2010 04:12:07 -0300
>> 
>> > It should rejoin multicast groups immediately when
>> > the failover happens to restore the multicast traffic.
>> > 
>> > Signed-off-by: Flavio Leitner <fleitner@redhat.com>
>> 
>> I suspect the IGMPv3 handling via a delayed action, as is currently
>> implemented, is on purpose and is done so to follow the specification
>> of the IGMPv3 RFCs.
>> 
>> Therefore you have to explain why your new behavior is so desirable
>> and in particular why something as undesirable as violating the RFCs
>> is therefore warranted.
> 
> That patch only changes the behavior for bonding during a link
> failure, so if we have a bonding in active-backup or any other mode
> with current-active-slave, the initialization will happen just fine
> following IGMP specs.
> 
> However, neither the backup slave interface nor the backup switch
> connected to backup slave knows about mcast. Thus when a link failure
> happens, we shouldn't rely on timers to not stay out of the mcast
> group losing traffic.
> 
> E.g. The V1 specs says that we shouldn't send any membership report
> if it has been one in the last minute because that means the switch
> is notified and the system will receive mcast traffic for that group. 
> Therefore, if it sees one and a link failure happens right after that,
> the backup slave will send another membership report only one minute
> later. During this time the system loses traffic.

All of this valuable information belongs in the commit log message.

Please resubmit your patch with all of the necessary contextual
information and reasoning explaining in the commit message.

Thanks.

^ permalink raw reply

* Re: [MeeGo-Dev][PATCH v3] Topcliff: Update PCH_CAN driver to 2.6.35
From: David Miller @ 2010-10-05 18:45 UTC (permalink / raw)
  To: masa-korg-ECg8zkTtlr0C6LszWs/t0g
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	sameo-VuQAYsv1563Yd54FQh9/CA,
	margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, wg-5Yr1BZd7O62+XT7JhA+gdA,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w, mkl-bIcnvbaLZ9MEGnE8C9+IrQ,
	chripell-VaTbYqLCNhc, morinaga526-ECg8zkTtlr0C6LszWs/t0g,
	meego-dev-WXzIur8shnEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <002e01cb6486$2ed72cc0$66f8800a-a06+6cuVnkTSQfdrb5gaxUEOCMrvLtNR@public.gmane.org>

From: "Masayuki Ohtake" <masa-korg-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
Date: Tue, 5 Oct 2010 21:09:30 +0900

> On Tuesday, October 05, 2010 8:08 PM, Marc Kleine-Budde wrote:
>> If FIFO is working you might also think about NAPI.
> 
> I think NAPI isn't necessary for our CAN driver.
> NAPI is for high-speed networking.
> CAN is NOT high-speed.
> 
> In fact, some accepted CAN drivers don't have NAPI.

NAPI is not just for performance concerns.

It greatly simplifies the locking since all packet processing paths
run only in software interrupt context, never in hardware interrupt
context.

We encourage all new drivers that can use NAPI to do so, just because
recent driver submissions do not do this doesn't mean we should
continue such a mistake ;-)

^ permalink raw reply

* Re: [patch 2/2] drivers/net/stmmac/: add HAS_IOMEM dependency
From: David Miller @ 2010-10-05 18:41 UTC (permalink / raw)
  To: peppe.cavallaro; +Cc: akpm, netdev, schwidefsky, heiko.carstens
In-Reply-To: <4CAB0219.40007@st.com>

From: Peppe CAVALLARO <peppe.cavallaro@st.com>
Date: Tue, 5 Oct 2010 12:46:49 +0200

> This dependency was removed in driver included in net-next Git (because
> the CPU_SUBTYPE_ST40 is a dead option).

Oh yes, I missed that.  Thanks for noticing.

> I've seen that other drivers depend on HAS_IOMEM that is always  defined
> for STM and ARM targets (where the I know the stmmac is used); No
> problem for me to add this new dependency to in HAS_IOMEM the stmmac's
> Kconfig if actually needed.

I'll add the patch Andrew posted, it is needed now.

Thanks again.

^ permalink raw reply

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-10-05 18:23 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty
In-Reply-To: <OF13594229.1A55A20C-ON652577B3.00393C8D-652577B3.003A54C9@in.ibm.com>

On Tue, Oct 05, 2010 at 04:10:00PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/19/2010 06:14:43 PM:
> 
> > Could you document how exactly do you measure multistream bandwidth:
> > netperf flags, etc?
> 
> All results were without any netperf flags or system tuning:
>     for i in $list
>     do
>         netperf -c -C -l 60 -H 192.168.122.1 > /tmp/netperf.$$.$i &
>     done
>     wait
> Another script processes the result files.  It also displays the
> start time/end time of each iteration to make sure skew due to
> parallel netperfs is minimal.
> 
> I changed the vhost functionality once more to try to get the
> best model, the new model being:
> 1. #numtxqs=1 -> #vhosts=1, this thread handles both RX/TX.
> 2. #numtxqs>1 -> vhost[0] handles RX and vhost[1-MAX] handles
>    TX[0-n], where MAX is 4.  Beyond numtxqs=4, the remaining TX
>    queues are handled by vhost threads in round-robin fashion.
> 
> Results from here on are with these changes, and only "tuning" is
> to set each vhost's affinity to CPUs[0-3] ("taskset -p f <vhost-pids>").
> 
> > Any idea where does this come from?
> > Do you see more TX interrupts? RX interrupts? Exits?
> > Do interrupts bounce more between guest CPUs?
> > 4. Identify reasons for single netperf BW regression.
> 
> After testing various combinations of #txqs, #vhosts, #netperf
> sessions, I think the drop for 1 stream is due to TX and RX for
> a flow being processed on different cpus.

Right. Can we fix it?

>  I did two more tests:
>     1. Pin vhosts to same CPU:
>         - BW drop is much lower for 1 stream case (- 5 to -8% range)
>         - But performance is not so high for more sessions.
>     2. Changed vhost to be single threaded:
>           - No degradation for 1 session, and improvement for upto
> 	      8, sometimes 16 streams (5-12%).
>           - BW degrades after that, all the way till 128 netperf sessions.
>           - But overall CPU utilization improves.
>             Summary of the entire run (for 1-128 sessions):
>                 txq=4:  BW: (-2.3)      CPU: (-16.5)    RCPU: (-5.3)
>                 txq=16: BW: (-1.9)      CPU: (-24.9)    RCPU: (-9.6)
> 
> I don't see any reasons mentioned above.  However, for higher
> number of netperf sessions, I see a big increase in retransmissions:

Hmm, ok, and do you see any errors?

> _______________________________________
> #netperf      ORG           NEW
>             BW (#retr)    BW (#retr)
> _______________________________________
> 1          70244 (0)     64102 (0)
> 4          21421 (0)     36570 (416)
> 8          21746 (0)     38604 (148)
> 16         21783 (0)     40632 (464)
> 32         22677 (0)     37163 (1053)
> 64         23648 (4)     36449 (2197)
> 128        23251 (2)     31676 (3185)
> _______________________________________
> 
> Single netperf case didn't have any retransmissions so that is not
> the cause for drop.  I tested ixgbe (MQ):
> ___________________________________________________________
> #netperf      ixgbe             ixgbe (pin intrs to cpu#0 on
>                                        both server/client)
>             BW (#retr)          BW (#retr)
> ___________________________________________________________
> 1           3567 (117)          6000 (251)
> 2           4406 (477)          6298 (725)
> 4           6119 (1085)         7208 (3387)
> 8           6595 (4276)         7381 (15296)
> 16          6651 (11651)        6856 (30394)

Interesting.
You are saying we get much more retransmissions with physical nic as
well?

> ___________________________________________________________
> 
> > 5. Test perf in more scenarious:
> >    small packets
> 
> 512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
> but increases with #sessions:
> _______________________________________________________________________________
> #       BW1     BW2 (%)         CPU1    CPU2 (%)        RCPU1   RCPU2 (%)
> _______________________________________________________________________________
> 1       4043    3800 (-6.0)     50      50 (0)          86      98 (13.9)
> 2       8358    7485 (-10.4)    153     178 (16.3)      230     264 (14.7)
> 4       20664   13567 (-34.3)   448     490 (9.3)       530     624 (17.7)
> 8       25198   17590 (-30.1)   967     1021 (5.5)      1085    1257 (15.8)
> 16      23791   24057 (1.1)     1904    2220 (16.5)     2156    2578 (19.5)
> 24      23055   26378 (14.4)    2807    3378 (20.3)     3225    3901 (20.9)
> 32      22873   27116 (18.5)    3748    4525 (20.7)     4307    5239 (21.6)
> 40      22876   29106 (27.2)    4705    5717 (21.5)     5388    6591 (22.3)
> 48      23099   31352 (35.7)    5642    6986 (23.8)     6475    8085 (24.8)
> 64      22645   30563 (34.9)    7527    9027 (19.9)     8619    10656 (23.6)
> 80      22497   31922 (41.8)    9375    11390 (21.4)    10736   13485 (25.6)
> 96      22509   32718 (45.3)    11271   13710 (21.6)    12927   16269 (25.8)
> 128     22255   32397 (45.5)    15036   18093 (20.3)    17144   21608 (26.0)
> _______________________________________________________________________________
> SUM:    BW: (16.7)      CPU: (20.6)     RCPU: (24.3)
> _______________________________________________________________________________
> 
> > host -> guest
> _______________________________________________________________________________
> #       BW1     BW2 (%)         CPU1    CPU2 (%)        RCPU1   RCPU2 (%)
> _______________________________________________________________________________
> *1      70706   90398 (27.8)    300     327 (9.0)       140     175 (25.0)
> 2       20951   21937 (4.7)     188     196 (4.2)       93      103 (10.7)
> 4       19952   25281 (26.7)    397     496 (24.9)      210     304 (44.7)
> 8       18559   24992 (34.6)    802     1010 (25.9)     439     659 (50.1)
> 16      18882   25608 (35.6)    1642    2082 (26.7)     953     1454 (52.5)
> 24      19012   26955 (41.7)    2465    3153 (27.9)     1452    2254 (55.2)
> 32      19846   26894 (35.5)    3278    4238 (29.2)     1914    3081 (60.9)
> 40      19704   27034 (37.2)    4104    5303 (29.2)     2409    3866 (60.4)
> 48      19721   26832 (36.0)    4924    6418 (30.3)     2898    4701 (62.2)
> 64      19650   26849 (36.6)    6595    8611 (30.5)     3975    6433 (61.8)
> 80      19432   26823 (38.0)    8244    10817 (31.2)    4985    8165 (63.7)
> 96      20347   27886 (37.0)    9913    13017 (31.3)    5982    9860 (64.8)
> 128     19108   27715 (45.0)    13254   17546 (32.3)    8153    13589 (66.6)
> _______________________________________________________________________________
> SUM:    BW: (32.4)      CPU: (30.4)     RCPU: (62.6)
> _______________________________________________________________________________
> *: Sum over 7 iterations, remaining test cases are sum over 2 iterations
> 
> > guest <-> external
> 
> I haven't done this right now since I don't have a setup.  I guess
> it would be limited by wire speed and gains may not be there.  I
> will try to do this later when I get the setup.

OK but at least need to check that it does not hurt things.

> > in last case:
> > find some other way to measure host CPU utilization,
> > try multiqueue and single queue devices
> > 6. Use above to figure out what is a sane default for numtxqs
> 
> A. Summary for default I/O (16K):
> #txqs=2 (#vhost=3):       BW: (37.6)      CPU: (69.2)     RCPU: (40.8)
> #txqs=4 (#vhost=5):       BW: (36.9)      CPU: (60.9)     RCPU: (25.2)
> #txqs=8 (#vhost=5):       BW: (41.8)      CPU: (50.0)     RCPU: (15.2)
> #txqs=16 (#vhost=5):      BW: (40.4)      CPU: (49.9)     RCPU: (10.0)
> 
> B. Summary for 512 byte I/O:
> #txqs=2 (#vhost=3):       BW: (31.6)      CPU: (35.7)     RCPU: (28.6)
> #txqs=4 (#vhost=5):       BW: (5.7)       CPU: (27.2)     RCPU: (22.7)
> #txqs=8 (#vhost=5):       BW: (-.6)       CPU: (25.1)     RCPU: (22.5)
> #txqs=16 (#vhost=5):      BW: (-6.6)      CPU: (24.7)     RCPU: (21.7)
> 
> Summary:
> 
> 1. Average BW increase for regular I/O is best for #txq=16 with the
>    least CPU utilization increase.
> 2. The average BW for 512 byte I/O is best for lower #txq=2. For higher
>    #txqs, BW increased only after a particular #netperf sessions - in
>    my testing that limit was 32 netperf sessions.
> 3. Multiple txq for guest by itself doesn't seem to have any issues.
>    Guest CPU% increase is slightly higher than BW improvement.  I
>    think it is true for all mq drivers since more paths run in parallel
>    upto the device instead of sleeping and allowing one thread to send
>    all packets via qdisc_restart.
> 4. Having high number of txqs gives better gains and reduces cpu util
>    on the guest and the host.
> 5. MQ is intended for server loads.  MQ should probably not be explicitly
>    specified for client systems.
> 6. No regression with numtxqs=1 (or if mq option is not used) in any
>    testing scenario.

Of course txq=1 can be considered a kind of fix, but if we know the
issue is TX/RX flows getting bounced between CPUs, can we fix this?
Workload-specific optimizations can only get us this far.

> 
> I will send the v3 patch within a day after some more testing.
> 
> Thanks,
> 
> - KK

^ permalink raw reply

* [GIT PULL net-next-2.6] vhost-net patchset for 2.6.37
From: Michael S. Tsirkin @ 2010-10-05 18:27 UTC (permalink / raw)
  To: David Miller; +Cc: kvm, virtualization, netdev, linux-kernel, Jason Wang

It looks like it was a quiet cycle for vhost-net:
probably because most of energy was spent on bugfixes
that went in for 2.6.36.
People are working on multiqueue, tracing but I'm not
sure it'll get done in time for 2.6.37 - so here's
a tree with a single patch that helps windows guests
which we definitely want in the next kernel.

Please merge for 2.6.37.
Thanks!

The following changes since commit a00eac0c459abecb539fb2a2abd3122dd7ca5d4a:

  ppp: Use a real SKB control block in fragmentation engine. (2010-10-05 01:36:52 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next

Jason Wang (1):
      vhost: max s/g to match qemu

 drivers/vhost/net.c   |    2 +-
 drivers/vhost/vhost.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.h |   18 ++++++++----------
 3 files changed, 57 insertions(+), 12 deletions(-)

^ permalink raw reply

* Re: [net-next-2.6 PATCH] net: netif_set_real_num_rx_queues may cap num_rx_queues at init time
From: John Fastabend @ 2010-10-05 17:45 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Eric Dumazet, netdev@vger.kernel.org, therbert@google.com
In-Reply-To: <1286296476.2307.5.camel@achroite.uk.solarflarecom.com>

On 10/5/2010 9:34 AM, Ben Hutchings wrote:
> On Tue, 2010-10-05 at 09:08 -0700, John Fastabend wrote:
>> On 10/4/2010 10:35 PM, Eric Dumazet wrote:
>>> Le lundi 04 octobre 2010 à 15:00 -0700, John Fastabend a écrit :
>>>> The logic for netif_set_real_num_rx_queues is the following,
>>>>
>>>> netif_set_real_num_rx_queues(dev, rxq)
>>>> {
>>>> 	...
>>>> 	if (dev->reg_state == NETREG_REGISTERED) {
>>>> 		...
>>>> 	} else {
>>>> 		dev->num_rx_queues = rxq;
>>>> 	}
>>>>
>>>> 	dev->real_num_rx_queues = rxq;
>>>> 	return 0;
>>>> }
>>>>
>>>> Some drivers init path looks like the following,
>>>>
>>>> alloc_etherdev_mq(priv_sz, max_num_queues_ever);
>>>> ...
>>>> netif_set_real_num_rx_queues(dev, queues_to_use_now);
>>>> ...
>>>> register_netdev(dev);
>>>> ...
>>>>
>>>> Because netif_set_real_num_rx_queues sets num_rx_queues if the
>>>> reg state is not NETREG_REGISTERED we end up with the incorrect
>>>> max number of rx queues. This patch proposes to remove the else
>>>> clause above so this does not occur.  Also just reading the
>>>> function set_real_num it seems a bit unexpected that num_rx_queues
>>>> gets set.
>>>>
>>>
>>> You dont tell why its "incorrect".
>>>
>>
>> OK that is a poor description.
>>
>>> Why should we keep num_rx_queues > real_num_rx_queues ?
>>>
>>
>> If we do not ever need them then we should not keep them I agree.
>> But having netif_set_real_num_rx_queues set something other then
>> 'real_num_rx_queues' does not seem right to me at least. Also
>> netif_set_real_num_tx_queues and netif_set_real_num_rx_queues have
>> different behavior. It would be nice if this weren't the case but
>> they allocate queues in two places.
> [...]
> 
> I only did this to satisfy Eric's desire to reduce memory usage.
> However, I believe that there are currently no drivers that dynamically
> increase numbers of RX or TX queues.  Until there are, there is not much
> point in removing this assignment to num_rx_queues.
> 
> Ben.
> 

ixgbe increases the real_num_[rx|tx]_queues when FCoE or DCB is enabled.
Also many of the drivers could increase the number of queues if they were
given more interrupt vectors at some point. 

But, it is easy enough to patch ixgbe to not set the number of RX queues
currently in use until after the device is registered. Which brings it
inline with many of the other drivers. And then the drivers that are using
it to explicitly set num_rx_queues do not need to change.

Thanks,
John



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox