Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next-2.6 0/5 v2] SCTP updates for net-next-2.6
From: Wei Yongjun @ 2011-04-27  7:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, lksctp

Hi David

Here is a set of SCTP patches for net-next-2.6, the last part
from vlad's lksctp-dev tree, update SCTP IPv6 routing and IPSec
issues. Please apply.

Changelog:
  - redo the intermediate builds test and function test.
  - remove useless ->dst_saddr member of sctp_pf
  - merge some fix for original patch

Vlad Yasevich (4):
      sctp: cache the ipv6 source after route lookup
      sctp: make sctp over IPv6 work with IPsec
      sctp: remove useless arguments from get_saddr() call
      sctp: clean up route lookup calls

Weixing Shi (1):
      sctp: fix sctp to work with ipv6 source address routing

 include/net/sctp/structs.h |   18 ++---
 net/sctp/ipv6.c            |  183 +++++++++++++++++++++++++------------------
 net/sctp/protocol.c        |   54 ++++++-------
 net/sctp/socket.c          |    2 +-
 net/sctp/transport.c       |   28 ++++---
 5 files changed, 153 insertions(+), 132 deletions(-)



^ permalink raw reply

* Re: [PATCH] Applying inappropriate ioctl operation on socket should return ENOTTY
From: Eric Dumazet @ 2011-04-27  6:57 UTC (permalink / raw)
  To: Lifeng Sun; +Cc: linux-kernel, netdev
In-Reply-To: <20110427063730.GA20313@md5.ntu.edu.sg>

Le mercredi 27 avril 2011 à 14:37 +0800, Lifeng Sun a écrit :
> On 07:58 Wed 04/27/11 Apr, Eric Dumazet wrote:
> > Really ?
> > 
> > EINVAL is ok too : Request or argp is not valid.
> 
> I'm afraid not. SUSv4 specifies, say,
> 
>   int tcsetattr(int fildes, int optional_actions,
>          const struct termios *termios_p);
> 
>  ERROR:
>   [EINVAL]
>       The optional_actions argument is not a supported value, or an
>       attempt was made to change an attribute represented in the
>       termios structure to an unsupported value.
> 
>   [ENOTTY]
>       The file associated with fildes is not a terminal.
> 
> which means when we apply tcsetattr (implemented by ioctl) to _any_
> non-terminal file descriptor, it should set errno to ENOTTY rather
> than EINVAL.
> 

You quote manpage for a library call, not a system call.

If you feel your glibc doesnt implement well this, please complain to
glibc maintainer.

^ permalink raw reply

* Re: [PATCH] Applying inappropriate ioctl operation on socket should return ENOTTY
From: Eric Dumazet @ 2011-04-27  6:55 UTC (permalink / raw)
  To: Lifeng Sun; +Cc: linux-kernel, netdev
In-Reply-To: <20110427063730.GA20313@md5.ntu.edu.sg>

Le mercredi 27 avril 2011 à 14:37 +0800, Lifeng Sun a écrit :
> On 07:58 Wed 04/27/11 Apr, Eric Dumazet wrote:
> > Really ?
> > 
> > EINVAL is ok too : Request or argp is not valid.
> 
> I'm afraid not. SUSv4 specifies, say,
> 
>   int tcsetattr(int fildes, int optional_actions,
>          const struct termios *termios_p);
> 
>  ERROR:
>   [EINVAL]
>       The optional_actions argument is not a supported value, or an
>       attempt was made to change an attribute represented in the
>       termios structure to an unsupported value.
> 
>   [ENOTTY]
>       The file associated with fildes is not a terminal.
> 
> which means when we apply tcsetattr (implemented by ioctl) to _any_
> non-terminal file descriptor, it should set errno to ENOTTY rather
> than EINVAL.

Thats not so simple. This is a known and documented artifact.

In old days, ioctl() had a meaning for TTYS (mostly).



man isatty

ERRORS
       EBADF  fd is not a valid file descriptor.

       EINVAL fd refers to a file other than a terminal.  POSIX.1-2001 specifies the error ENOTTY for this case.


This is not because POSIX changes rules that we must change kernel and break applications.

Conformant applications use isatty(fd) and test result code being 1 or not 1

This way, they work with linux 1.0, 2.0, 2.2, 2.4, .... and other OSes as well.

^ permalink raw reply

* Re: [PATCH] Applying inappropriate ioctl operation on socket should return ENOTTY
From: Lifeng Sun @ 2011-04-27  6:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev
In-Reply-To: <1303883910.2699.53.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 1008 bytes --]

On 07:58 Wed 04/27/11 Apr, Eric Dumazet wrote:
> Really ?
> 
> EINVAL is ok too : Request or argp is not valid.

I'm afraid not. SUSv4 specifies, say,

  int tcsetattr(int fildes, int optional_actions,
         const struct termios *termios_p);

 ERROR:
  [EINVAL]
      The optional_actions argument is not a supported value, or an
      attempt was made to change an attribute represented in the
      termios structure to an unsupported value.

  [ENOTTY]
      The file associated with fildes is not a terminal.

which means when we apply tcsetattr (implemented by ioctl) to _any_
non-terminal file descriptor, it should set errno to ENOTTY rather
than EINVAL.

> I would say, its not a bug as you claim. 
> 
> Its really too late to make such change and risk regressions.
> 
> isatty(fd) performs well. Please use it instead.
> 
> Also, networking patches should be sent to netdev@vger.kernel.org and
> David Miller, as mentioned in MAINTAINERS file.

Thank you.

-- 

[-- Attachment #2: GnuPG digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH] net: tun: convert to hw_features
From: Rusty Russell @ 2011-04-27  4:59 UTC (permalink / raw)
  To: David Miller, mirq-linux; +Cc: netdev
In-Reply-To: <20110420.013216.28818575.davem@davemloft.net>

On Wed, 20 Apr 2011 01:32:16 -0700 (PDT), David Miller <davem@davemloft.net> wrote:
> From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> Date: Tue, 19 Apr 2011 18:13:10 +0200 (CEST)
> 
> > This changes offload setting behaviour to what I think is correct:
> >  - offloads set via ethtool mean what admin wants to use (by default
> >    he wants 'em all)
> >  - offloads set via ioctl() mean what userspace is expecting to get
> >    (this limits which admin wishes are granted)
> >  - TUN_NOCHECKSUM is ignored, as it might cause broken packets when
> >    forwarded (ip_summed == CHECKSUM_UNNECESSARY means that checksum
> >    was verified, not that it can be ignored)
> > 
> > If TUN_NOCHECKSUM is implemented, it should set skb->csum_* and
> > skb->ip_summed (= CHECKSUM_PARTIAL) for known protocols and let others
> > be verified by kernel when necessary.
> > 
> > TUN_NOCHECKSUM handling was introduced by commit
> > f43798c27684ab925adde7d8acc34c78c6e50df8:
> > 
> >     tun: Allow GSO using virtio_net_hdr
> >     
> > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> 
> Applied.

Dave, you just removed a feature that has been in Linux since before
git.  It *probably* just means we go slower in cases we don't really
care about.  But does removing it break qemu?  Has anyone tested?

Thanks,
Rusty.


^ permalink raw reply

* Re: [PATCH] Applying inappropriate ioctl operation on socket should return ENOTTY
From: Eric Dumazet @ 2011-04-27  5:58 UTC (permalink / raw)
  To: Lifeng Sun; +Cc: linux-kernel, netdev
In-Reply-To: <1303882625-28115-1-git-send-email-lifongsun@gmail.com>

Le mercredi 27 avril 2011 à 13:37 +0800, Lifeng Sun a écrit :
> ioctl() calls against a socket with an inappropriate ioctl operation
> are incorrectly returning EINVAL rather than ENOTTY:
> 
>   [ENOTTY]
>       Inappropriate I/O control operation.
> 
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=33992
> 
> This bug is not limited to socket, it also occurs in a lot of, maybe
> some hundred, other ioctl operations, while in the patch I only fixed
> about a dozen of additional ones in pipe, fifo and character device
> drivers.

Really ?

EINVAL is ok too : Request or argp is not valid.

I would say, its not a bug as you claim. 

Its really too late to make such change and risk regressions.

isatty(fd) performs well. Please use it instead.

Also, networking patches should be sent to netdev@vger.kernel.org and
David Miller, as mentioned in MAINTAINERS file.




^ permalink raw reply

* Re: [PATCH 3/4] ipv4: Remove erroneous check in igmpv3_newpack() and igmp_send_report().
From: Eric Dumazet @ 2011-04-27  4:50 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110426.151204.70183859.davem@davemloft.net>

Le mardi 26 avril 2011 à 15:12 -0700, David Miller a écrit :
> Output route resolution never returns a route with rt_src set to zero
> (which is INADDR_ANY).
> 
> Even if the flow key for the output route lookup specifies INADDR_ANY
> for the source address, the output route resolution chooses a real
> source address to use in the final route.
> 
> This test has existed forever in igmp_send_report() and David Stevens
> simply copied over the erroneous test when implementing support for
> IGMPv3.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>



^ permalink raw reply

* linux-next: ibmveth runtime errors
From: Stephen Rothwell @ 2011-04-27  4:49 UTC (permalink / raw)
  To: ppc-dev
  Cc: "Michał Mirosław", David Miller, netdev,
	Santiago Leon, linux-next, LKML

[-- Attachment #1: Type: text/plain, Size: 579 bytes --]

Hi all,

For the last couple of days, linux-next booting on a few of our Power
partitions (but not all) have produced this error (over and over):

ibmveth 3000000b: eth0: tx: h_send_logical_lan failed with rc=-4

Linus' tree seems to boot fine on these partitions.  The only commit
directly affecting ibmveth in linux-next is b9367bf3ee6d ("net: ibmveth:
convert to hw_features") which first appeared in next-20110421 which is
also the first one that failed.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* Re: [PATCH 2/4] ipv4: Sanitize and simplify ip_route_{connect,newports}()
From: Eric Dumazet @ 2011-04-27  4:47 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110426.151202.245387083.davem@davemloft.net>

Le mardi 26 avril 2011 à 15:12 -0700, David Miller a écrit :
> These functions are used together as a unit for route resolution
> during connect().  They address the chicken-and-egg problem that
> exists when ports need to be allocated during connect() processing,
> yet such port allocations require addressing information from the
> routing code.
> 
> It's currently more heavy handed than it needs to be, and in
> particular we allocate and initialize a flow object twice.
> 
> Let the callers provide the on-stack flow object.  That way we only
> need to initialize it once in the ip_route_connect() call.
> 
> Later, if ip_route_newports() needs to do anything, it re-uses that
> flow object as-is except for the ports which it updates before the
> route re-lookup.
> 
> Also, describe why this set of facilities are needed and how it works
> in a big comment.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>



^ permalink raw reply

* Re: Kernel crash after using new Intel NIC (igb)
From: Eric Dumazet @ 2011-04-27  4:32 UTC (permalink / raw)
  To: Maximilian Engelhardt; +Cc: linux-kernel, netdev, StuStaNet Vorstand
In-Reply-To: <1303878240.2699.41.camel@edumazet-laptop>

Le mercredi 27 avril 2011 à 06:24 +0200, Eric Dumazet a écrit :

> We had similar reports in the past that disappeared when adding
> "slab_nomerge" to boot parameters. We suspect a memory corruption from
> another part of kernel on 64bytes kmemcache objects.
> 
> In 2.6.37, inetpeer code uses 64bytes objects. Using slab_nomerge and
> SLUB allocator (as you already do), makes sure inetpeer kmemcache wont
> be shared by other 64bytes objects in kernel.
> 

Of course, the right option name is slub_nomerge

vi +2293 Documentation/kernel-parameters.txt

        slub_nomerge    [MM, SLUB]
                        Disable merging of slabs with similar size. May be
                        necessary if there is some reason to distinguish
                        allocs to different slabs. Debug options disable
                        merging on their own.
                        For more information see Documentation/vm/slub.txt.

^ permalink raw reply

* Re: Kernel crash after using new Intel NIC (igb)
From: Eric Dumazet @ 2011-04-27  4:24 UTC (permalink / raw)
  To: Maximilian Engelhardt; +Cc: linux-kernel, netdev, StuStaNet Vorstand
In-Reply-To: <201104250033.03401.maxi@daemonizer.de>

Le lundi 25 avril 2011 à 00:32 +0200, Maximilian Engelhardt a écrit :
> Hello,
> 
> some time ago we switched some of our servers to a new networking card that 
> uses the Intel igb driver. Since that time we see regular kernel crashes.
> The crashes happen at very irregular intervals, sometimes after a week uptime, 
> sometimes after a month or even more. They seem to be independent of the 
> server load as they also happen in the night when there is low traffic.
> 
> The affected server is used as a NAT device with some iptables rules and serves 
> about 2000 people.
> 
> Attached are two logs of the crashes as well as the output of dmesg, lspci, 
> and /proc/interrupts as well as the used kernel config.
> 
> I have no idea what might be wrong but I think it is a kernel bug. Perhaps 
> someone with more knowledge has a clue.
> 
> If needed I can provide additional information or build different kernels.
> 
> Greetings,
> Maxi

Hello Maximilian

We had similar reports in the past that disappeared when adding
"slab_nomerge" to boot parameters. We suspect a memory corruption from
another part of kernel on 64bytes kmemcache objects.

In 2.6.37, inetpeer code uses 64bytes objects. Using slab_nomerge and
SLUB allocator (as you already do), makes sure inetpeer kmemcache wont
be shared by other 64bytes objects in kernel.

In 2.6.38 and up, inetpeer objects are now larger, so you also could try
latest linux-2.6 tree, just to make sure inetpeer code is not faulty.

Thanks

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff8145ea9f>] cleanup_once+0x3f/0xa0
PGD 12d82a067 PUD 12ea49067 PMD 0
Oops: 0002 [#1] PREEMPT SMP
last sysfs file: /sys/devices/virtual/vc/vcsa5/uevent
CPU 0
Pid: 0, comm: swapper Not tainted 2.6.37.1 #1 Supermicro X7SB4/E/X7SB4/E
RIP: 0010:[<ffffffff8145ea9f>]  [<ffffffff8145ea9f>] cleanup_once+0x3f/0xa0
RSP: 0018:ffff8800cfc03e40  EFLAGS: 00010202
RAX: ffff880128167798 RBX: ffff880128167780 RCX: 0000000000000000
RDX: c398112e00026cf7 RSI: 00000000000001a2 RDI: ffffffff8166ce10
RBP: 0000000000024702 R08: 00000000003d0900 R09: 00040ea8ea5b7700
R10: ffffffff814f312d R11: 0000000000000010 R12: ffffffff8161ffd8
R13: 0000000000000102 R14: ffffffff8174b4e0 R15: ffffffff8161ffd8
FS:  0000000000000000(0000) GS:ffff8800cfc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000012fe67000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff8161e000, task ffffffff81638020)
Stack:
 ffff8800cfc11f00 0000000111034f87 0000000000024702 ffffffff8145ed68
 ffffffff8174a4c0 ffffffff8174a4c0 ffff8800cfc03eb0 ffffffff81044cb8
 ffffffff81034079 ffffffff8145ed30 0000000000000000 ffffffff8174b8e0
Call Trace:
 <IRQ>
 [<ffffffff8145ed68>] ? peer_check_expire+0x38/0x110
 [<ffffffff81044cb8>] ? run_timer_softirq+0x138/0x250
 [<ffffffff81034079>] ? scheduler_tick+0xd9/0x2e0
 [<ffffffff8145ed30>] ? peer_check_expire+0x0/0x110
 [<ffffffff8103eb0d>] ? __do_softirq+0x9d/0x130
 [<ffffffff8100320c>] ? call_softirq+0x1c/0x30
 [<ffffffff8100531d>] ? do_softirq+0x4d/0x80
 [<ffffffff8103e9cd>] ? irq_exit+0x8d/0x90
 [<ffffffff8101d5ea>] ? smp_apic_timer_interrupt+0x6a/0xa0
 [<ffffffff81002cd3>] ? apic_timer_interrupt+0x13/0x20
 <EOI>
 [<ffffffff8100a93a>] ? mwait_idle+0x6a/0x80
 [<ffffffff81001528>] ? cpu_idle+0x58/0xb0
 [<ffffffff81698dd3>] ? start_kernel+0x334/0x33f
 [<ffffffff8169840d>] ? x86_64_start_kernel+0xf3/0xf7
Code: 00 48 8b 05 84 e3 20 00 48 3d 00 ce 66 81 74 5c 48 8d 58 e8 48 8b 15 31 5e 22 00 2b 53 28 48 39 ea 72 49 48 8b 4b 18 48 8b 53 20 <48> 89 51 08 48 89 0a 48 89 43 18 48 89 43 20 f0 ff 40 14 48 c7
RIP  [<ffffffff8145ea9f>] cleanup_once+0x3f/0xa0
 RSP <ffff8800cfc03e40>
CR2: 0000000000000008
---[ end trace 904f16191de0663c ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G      D     2.6.37.1 #1
Call Trace:
 <IRQ>  [<ffffffff814e4152>] ? panic+0xa1/0x19e
 [<ffffffff810068eb>] ? oops_end+0x9b/0xa0
 [<ffffffff81024523>] ? no_context+0x103/0x270
 [<ffffffff81024d10>] ? do_page_fault+0x290/0x430
 [<ffffffff813eabd2>] ? __alloc_skb+0x72/0x160
 [<ffffffff81262f40>] ? swiotlb_dma_mapping_error+0x10/0x20
 [<ffffffff8133e168>] ? igb_alloc_rx_buffers_adv+0x208/0x3a0
 [<ffffffff814e780f>] ? page_fault+0x1f/0x30
 [<ffffffff8145ea9f>] ? cleanup_once+0x3f/0xa0
 [<ffffffff8145ed68>] ? peer_check_expire+0x38/0x110
 [<ffffffff81044cb8>] ? run_timer_softirq+0x138/0x250
 [<ffffffff81034079>] ? scheduler_tick+0xd9/0x2e0
 [<ffffffff8145ed30>] ? peer_check_expire+0x0/0x110
 [<ffffffff8103eb0d>] ? __do_softirq+0x9d/0x130
 [<ffffffff8100320c>] ? call_softirq+0x1c/0x30
 [<ffffffff8100531d>] ? do_softirq+0x4d/0x80
 [<ffffffff8103e9cd>] ? irq_exit+0x8d/0x90
 [<ffffffff8101d5ea>] ? smp_apic_timer_interrupt+0x6a/0xa0
 [<ffffffff81002cd3>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff8100a93a>] ? mwait_idle+0x6a/0x80
 [<ffffffff81001528>] ? cpu_idle+0x58/0xb0
 [<ffffffff81698dd3>] ? start_kernel+0x334/0x33f
 [<ffffffff8169840d>] ? x86_64_start_kernel+0xf3/0xf7

^ permalink raw reply

* Re: [PATCH] netfilter/IPv6: initialize TOS field in REJECT target module
From: Fernando Luis Vazquez Cao @ 2011-04-27  4:21 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: David Miller, eric.dumazet, netfilter-devel, netdev, yoshfuji,
	jengelh, Patrick McHardy
In-Reply-To: <4DB6E647.5060608@netfilter.org>

On Tue, 2011-04-26 at 17:35 +0200, Pablo Neira Ayuso wrote:
> On 26/04/11 17:34, Pablo Neira Ayuso wrote:
> > On 26/04/11 07:25, Fernando Luis Vazquez Cao wrote:
> >> Pablo, could you pull in the two patches below? They have already been
> >> acked by Eric. It would be great it we could get them merged for the
> >> next -rc and stable releases.
> >>
> >> [PATCH] netfilter/IPv6: fix DSCP mangle code
> >> [PATCH] netfilter/IPv6: initialize TOS field in REJECT target module
> > 
> > Patrick is the primary link to take patches, I'm including him in this
> > CC. If he experiences any problem, I'll make sure that these hit -rc, so
> > never mind.
>   ^^^^^^^^^^
> 
> Sorry, I meant to say, "don't worry" :-)

Thank you, Pablo. I really appreciate it.

- Fernando


^ permalink raw reply

* Re: Strange igb bug, out-of-tree driver seems to work fine.
From: Ben Greear @ 2011-04-27  4:18 UTC (permalink / raw)
  To: Wyborny, Carolyn; +Cc: netdev
In-Reply-To: <EDC0E76513226749BFBC9C3FB031318F016B41256F@orsmsx508.amr.corp.intel.com>

On 04/26/2011 04:23 PM, Wyborny, Carolyn wrote:

> Hello,
>
> I'm sorry for the delay in responding.  I'm really scratching my head on this one as we don't do much in the driver that affects what we get on receive.  I've seen situations where some switches end up transmitting more of these and then we record more of them, but I'm guessing you're testing with the same equipment, just a different driver version.  Let me know if I'm mistaken there.
>
> So, to answer your question, I believe my patches are there, but I did review them again and I'm not sure they will make any difference.  My latest batch of patches was to add features to the i350 device specifically.
>
> Give it try though and let me know if you see any difference with 2.6.39-rc4+.

We reproduced this with stock 2.6.38.4 today, but I didn't get a chance to really
dig into it.

We only seem to have problems when the nics are associated with a kernel bridge
(some ports are connected to a pair of veth devices through a user-space bridge
that uses packet sockets to bridge the packets, and one of the veth interfaces
is in the kernel bridge).

We did run the same igb system to itself sending layer-3 traffic and it ran
fine, so it appears to be a fairly tricky bug.  It *almost* looks like issues
with the bridge or how we set things up, but we can reliably reproduce it
on in-kernel igb driver systems, and e1000e systems never see the problem.

I'll try to get some better debug info tomorrow, and if time allows,
we'll try on the stock linus top-of-tree kernel as well.  If top-of-tree
does work, I should be able to bisect the problem since we have a reliable
test case..would be interesting to see where the issues lies.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: r8169 :  always copying the rx buffer to new skb
From: Eric Dumazet @ 2011-04-27  3:57 UTC (permalink / raw)
  To: John Lumby; +Cc: Francois Romieu, netdev, Ben Hutchings, nic_swsd
In-Reply-To: <4DB77D03.9070507@hotmail.com>

Le mardi 26 avril 2011 à 22:18 -0400, John Lumby a écrit :
> Anyone have any further thoughts on the proposal to avoid memcpy'ing?  
> (see earlier post)
> 
> I also have a question concerning NAPI.     I've found that much of the 
> CPU saved from not memcpy'ing is burned in extra rx_interrupt'ing,  and 
> much of that seems to be wasted (no new packets).    So the actual 
> benefit is rather less than I think should be possible.
> 
> I've tried some tinkering with the napi weight but can't find any 
> setting which really improves the ratio of rx packets to hard interrupts 
> significantly.    The problem seems to be that each successive 
> rtl8169_poll() is driven too soon after the last one   (in this 
> particular workload).     The napi weight doesn't directly influence that.
> 
> So  -  question :
> is there any way,   when returning from rtl8169_poll,  to tell napi 
> something like :
>     "   finish this interrupt context and let something else run on this 
> CPU  (always CPU0 on my machine) BUT reschedule another napi poll on 
> this same device at some time after that "
> the point being that rtl8169_poll will,  for this case,  NOT re-enable 
> the NIC's napi interrupts,  in the hope that maybe some user work can be 
> dispatched,    so something else will have to schedule the next napi 
> poll for it.    Conceptually,    if rtl8169_poll finds no rx work done 
> on this call,   it wants to cause a yield() and then try again.     
> Except it can't from within the interrupt.
> 
> I appreciate this could lead to delays in handling new work so might be 
> dangerous,    but it seems to me to be in line with NAPI objectives so I 
> wanted to try it .   But don't know how.     Any hints or thoughts 
> appreciated.

Answer is no. There is no such facility in NAPI infrastructure.

You want to introduce a timer based polling. Some old pre-NAPI drivers
were doing that. Its OK when you have one device to handle, it can be a
nightmare when you mix several devices.




^ permalink raw reply

* Re: r8169 :  always copying the rx buffer to new skb
From: John Lumby @ 2011-04-27  2:18 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev, Ben Hutchings, nic_swsd
In-Reply-To: <4DAFA9F9.5080909@hotmail.com>

Anyone have any further thoughts on the proposal to avoid memcpy'ing?  
(see earlier post)

I also have a question concerning NAPI.     I've found that much of the 
CPU saved from not memcpy'ing is burned in extra rx_interrupt'ing,  and 
much of that seems to be wasted (no new packets).    So the actual 
benefit is rather less than I think should be possible.

I've tried some tinkering with the napi weight but can't find any 
setting which really improves the ratio of rx packets to hard interrupts 
significantly.    The problem seems to be that each successive 
rtl8169_poll() is driven too soon after the last one   (in this 
particular workload).     The napi weight doesn't directly influence that.

So  -  question :
is there any way,   when returning from rtl8169_poll,  to tell napi 
something like :
    "   finish this interrupt context and let something else run on this 
CPU  (always CPU0 on my machine) BUT reschedule another napi poll on 
this same device at some time after that "
the point being that rtl8169_poll will,  for this case,  NOT re-enable 
the NIC's napi interrupts,  in the hope that maybe some user work can be 
dispatched,    so something else will have to schedule the next napi 
poll for it.    Conceptually,    if rtl8169_poll finds no rx work done 
on this call,   it wants to cause a yield() and then try again.     
Except it can't from within the interrupt.

I appreciate this could lead to delays in handling new work so might be 
dangerous,    but it seems to me to be in line with NAPI objectives so I 
wanted to try it .   But don't know how.     Any hints or thoughts 
appreciated.

John

^ permalink raw reply

* Re: [RFC PATCH 1/1] bna: Generic Netlink Interface to collect FW trace
From: David Miller @ 2011-04-27  2:17 UTC (permalink / raw)
  To: debdut; +Cc: shemminger, rmody, netdev, huangj, amathur, ddutt
In-Reply-To: <BANLkTinUsJ2foWKBVb_9nVFA2V_vKb11rA@mail.gmail.com>

From: Debashis Dutt <debdut@gmail.com>
Date: Tue, 26 Apr 2011 19:16:29 -0700

> However, since the generic netlink is a more generic interface, we could use
> this infrastructure in the driver for commands which are not part of
> other standard tools.

You aren't the only device in the world that might want to provide
a facility to fetch firmware traces.

This isn't really a niche thing specific to your device at all.

^ permalink raw reply

* Re: [RFC PATCH 1/1] bna: Generic Netlink Interface to collect FW trace
From: Debashis Dutt @ 2011-04-27  2:16 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Rasesh Mody, netdev, davem, huangj, amathur, Debashis Dutt
In-Reply-To: <20110426185917.3d033636@nehalam>

On Tue, Apr 26, 2011 at 6:59 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Tue, 26 Apr 2011 18:51:57 -0700
> Rasesh Mody <rmody@brocade.com> wrote:
>
>> This is a RFC patch to Brocade BNA 10G Ethernet driver. It adds the generic
>> netlink communication interface to the BNA driver to collect firmware traces
>> using the in-kernel generic netlink infrastructure. The driver uses the
>> "dumpit" handler provided by the generic netlink layer to accomplish this. The
>> driver can extend this interface later if required.
>>
>> As of today, there seems to be no standard mechanism to collect debug
>> information such as firmware trace for a given hardware. Generic Netlinki seems
>> to provide a suitable option to do the same, without any further
>> addition/modification to the existing kernel implementation.
>>
>> This is a RFC patch inviting suggestions/opinions for improvement/modification
>> and requesting consideration for possible inclusion in net-next tree.
>>
>> Signed-off-by: Debashis Dutt <ddutt@brocade.com>
>> Signed-off-by: Rasesh Mody <rmody@brocade.com>
>
> Seems like a lot of work for a debug interface. What about debugfs?
> Or is that out of favor now, I can never keep track...

Stephen,

Well, I think debugfs isn't always available, but there could always
be a driver
configuration option for that.

However, since the generic netlink is a more generic interface, we could use
this infrastructure in the driver for commands which are not part of
other standard tools.

Thanks
--Debashis
>
> --
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH net-next-2.6] ipv4,ipv6,bonding: Restore control over number of peer notifications
From: Ben Hutchings @ 2011-04-27  2:14 UTC (permalink / raw)
  To: Brian Haley
  Cc: Jay Vosburgh, Andy Gospodarek, David Miller, Patrick McHardy,
	netdev
In-Reply-To: <4DB77AC2.5070207@hp.com>

On Tue, 2011-04-26 at 22:09 -0400, Brian Haley wrote:
> On 04/26/2011 09:25 PM, Ben Hutchings wrote:
> > For backward compatibility, we should retain the module parameters and
> > sysfs attributes to control the number of peer notifications
> > (gratuitous ARPs and unsolicited NAs) sent after bonding failover.
> > Also, it is possible for failover to take place even though the new
> > active slave does not have link up, and in that case the peer
> > notification should be deferred until it does.
> > 
> > Change ipv4 and ipv6 so they do not automatically send peer
> > notifications on bonding failover.
> > 
> > Change the bonding driver to send separate NETDEV_NOTIFY_PEERS
> > notifications when the link is up, as many times as requested.  Since
> > it does not directly control which protocols send notifications, make
> > num_grat_arp and num_unsol_na aliases for a single parameter.  Bump
> > the bonding version number and update its documentation.
> > 
> > Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> 
> Signed-off-by: Brian Haley <brian.haley@hp.com>

I'm not sure what you mean by this.  You didn't write any of it and
you're not a maintainer with your own repository.  Did you mean to say
'Reviewed-by' or 'Acked-by'?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH net-next-2.6] ipv4,ipv6,bonding: Restore control over number of peer notifications
From: Brian Haley @ 2011-04-27  2:09 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Jay Vosburgh, Andy Gospodarek, David Miller, Patrick McHardy,
	netdev
In-Reply-To: <1303867552.2850.39.camel@bwh-desktop>

On 04/26/2011 09:25 PM, Ben Hutchings wrote:
> For backward compatibility, we should retain the module parameters and
> sysfs attributes to control the number of peer notifications
> (gratuitous ARPs and unsolicited NAs) sent after bonding failover.
> Also, it is possible for failover to take place even though the new
> active slave does not have link up, and in that case the peer
> notification should be deferred until it does.
> 
> Change ipv4 and ipv6 so they do not automatically send peer
> notifications on bonding failover.
> 
> Change the bonding driver to send separate NETDEV_NOTIFY_PEERS
> notifications when the link is up, as many times as requested.  Since
> it does not directly control which protocols send notifications, make
> num_grat_arp and num_unsol_na aliases for a single parameter.  Bump
> the bonding version number and update its documentation.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>

Signed-off-by: Brian Haley <brian.haley@hp.com>

^ permalink raw reply

* Re: [RFC PATCH 1/1] bna: Generic Netlink Interface to collect FW trace
From: Stephen Hemminger @ 2011-04-27  1:59 UTC (permalink / raw)
  To: Rasesh Mody; +Cc: netdev, davem, huangj, amathur, Debashis Dutt
In-Reply-To: <1303869117-780-1-git-send-email-rmody@brocade.com>

On Tue, 26 Apr 2011 18:51:57 -0700
Rasesh Mody <rmody@brocade.com> wrote:

> This is a RFC patch to Brocade BNA 10G Ethernet driver. It adds the generic
> netlink communication interface to the BNA driver to collect firmware traces
> using the in-kernel generic netlink infrastructure. The driver uses the
> "dumpit" handler provided by the generic netlink layer to accomplish this. The
> driver can extend this interface later if required. 
> 
> As of today, there seems to be no standard mechanism to collect debug
> information such as firmware trace for a given hardware. Generic Netlinki seems
> to provide a suitable option to do the same, without any further
> addition/modification to the existing kernel implementation.
> 
> This is a RFC patch inviting suggestions/opinions for improvement/modification
> and requesting consideration for possible inclusion in net-next tree.
> 
> Signed-off-by: Debashis Dutt <ddutt@brocade.com>
> Signed-off-by: Rasesh Mody <rmody@brocade.com>

Seems like a lot of work for a debug interface. What about debugfs?
Or is that out of favor now, I can never keep track...

-- 

^ permalink raw reply

* [RFC PATCH 1/1] bna: Generic Netlink Interface to collect FW trace
From: Rasesh Mody @ 2011-04-27  1:51 UTC (permalink / raw)
  To: netdev, davem; +Cc: huangj, amathur, Rasesh Mody, Debashis Dutt

This is a RFC patch to Brocade BNA 10G Ethernet driver. It adds the generic
netlink communication interface to the BNA driver to collect firmware traces
using the in-kernel generic netlink infrastructure. The driver uses the
"dumpit" handler provided by the generic netlink layer to accomplish this. The
driver can extend this interface later if required. 

As of today, there seems to be no standard mechanism to collect debug
information such as firmware trace for a given hardware. Generic Netlinki seems
to provide a suitable option to do the same, without any further
addition/modification to the existing kernel implementation.

This is a RFC patch inviting suggestions/opinions for improvement/modification
and requesting consideration for possible inclusion in net-next tree.

Signed-off-by: Debashis Dutt <ddutt@brocade.com>
Signed-off-by: Rasesh Mody <rmody@brocade.com>
---
 drivers/net/bna/bfa_ioc.c   |   59 +++++++++++
 drivers/net/bna/bfa_ioc.h   |    4 +
 drivers/net/bna/bfi.h       |    2 +
 drivers/net/bna/bnad.c      |   39 ++++++++
 drivers/net/bna/bnad.h      |    9 ++-
 drivers/net/bna/bnad_genl.c |  228 +++++++++++++++++++++++++++++++++++++++++++
 drivers/net/bna/bnad_genl.h |   63 ++++++++++++
 7 files changed, 402 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/bna/bnad_genl.c
 create mode 100644 drivers/net/bna/bnad_genl.h

diff --git a/drivers/net/bna/bfa_ioc.c b/drivers/net/bna/bfa_ioc.c
index fcb9bb3..766e48b 100644
--- a/drivers/net/bna/bfa_ioc.c
+++ b/drivers/net/bna/bfa_ioc.c
@@ -2209,6 +2209,65 @@ bfa_nw_ioc_get_mac(struct bfa_ioc *ioc)
 	return ioc->attr->mac;
 }
 
+static int
+bfa_nw_ioc_smem_read(struct bfa_ioc *ioc, void *tbuf, u32 soff, u32 sz)
+{
+	u32 pgnum, loff;
+	__be32 r32;
+	int i, len;
+	u32 *buf = tbuf;
+
+	pgnum = PSS_SMEM_PGNUM(ioc->ioc_regs.smem_pg0, soff);
+	loff = PSS_SMEM_PGOFF(soff);
+
+	/*
+	 *  Hold semaphore to serialize pll init and fwtrc.
+	 */
+	if (!(bfa_nw_ioc_sem_get(ioc->ioc_regs.ioc_init_sem_reg)))
+		return 1;
+
+	writel(pgnum, ioc->ioc_regs.host_page_num_fn);
+
+	len = sz/sizeof(u32);
+	for (i = 0; i < len; i++) {
+		r32 = swab32(readl(ioc->ioc_regs.smem_page_start + loff));
+		buf[i] = be32_to_cpu(r32);
+		loff += sizeof(u32);
+
+		/*
+		 * handle page offset wrap around
+		 */
+		loff = PSS_SMEM_PGOFF(loff);
+		if (loff == 0) {
+			pgnum++;
+			writel(pgnum, ioc->ioc_regs.host_page_num_fn);
+		}
+	}
+	writel(PSS_SMEM_PGNUM(ioc->ioc_regs.smem_pg0, 0),
+			ioc->ioc_regs.host_page_num_fn);
+	/*
+	 *  release semaphore.
+	 */
+	writel(1, ioc->ioc_regs.ioc_init_sem_reg);
+
+	return 0;
+}
+
+int
+bfa_nw_ioc_debug_fwtrc(struct bfa_ioc *ioc, void *trcdata, int *trclen)
+{
+	u32 loff = (BFI_IOC_TRC_OFF + BFA_DBG_FWTRC_LEN * (ioc->port_id));
+	int tlen, status = 0;
+
+	tlen = *trclen;
+	if (tlen > BFA_DBG_FWTRC_LEN)
+		tlen = BFA_DBG_FWTRC_LEN;
+
+	status = bfa_nw_ioc_smem_read(ioc, trcdata, loff, tlen);
+	*trclen = tlen;
+	return status;
+}
+
 /**
  * Firmware failure detected. Start recovery actions.
  */
diff --git a/drivers/net/bna/bfa_ioc.h b/drivers/net/bna/bfa_ioc.h
index bd48abe..383cd59 100644
--- a/drivers/net/bna/bfa_ioc.h
+++ b/drivers/net/bna/bfa_ioc.h
@@ -23,6 +23,9 @@
 #include "bfi.h"
 #include "cna.h"
 
+#define BFA_DBG_FWTRC_LEN	(BFI_IOC_TRC_ENTS * BFI_IOC_TRC_ENT_SZ + \
+				BFI_IOC_TRC_HDR_SZ)
+
 #define BFA_IOC_TOV		3000	/* msecs */
 #define BFA_IOC_HWSEM_TOV	500	/* msecs */
 #define BFA_IOC_HB_TOV		500	/* msecs */
@@ -269,6 +272,7 @@ void bfa_nw_ioc_hbfail_register(struct bfa_ioc *ioc,
 bool bfa_nw_ioc_sem_get(void __iomem *sem_reg);
 void bfa_nw_ioc_sem_release(void __iomem *sem_reg);
 void bfa_nw_ioc_hw_sem_release(struct bfa_ioc *ioc);
+int bfa_nw_ioc_debug_fwtrc(struct bfa_ioc *ioc, void *trcdata, int *trclen);
 void bfa_nw_ioc_fwver_get(struct bfa_ioc *ioc,
 			struct bfi_ioc_image_hdr *fwhdr);
 bool bfa_nw_ioc_fwver_cmp(struct bfa_ioc *ioc,
diff --git a/drivers/net/bna/bfi.h b/drivers/net/bna/bfi.h
index 6050379..ee73b6f 100644
--- a/drivers/net/bna/bfi.h
+++ b/drivers/net/bna/bfi.h
@@ -277,6 +277,8 @@ struct bfi_ioc_getattr_reply {
  */
 #define BFI_IOC_TRC_OFF		(0x4b00)
 #define BFI_IOC_TRC_ENTS	256
+#define BFI_IOC_TRC_ENT_SZ	16
+#define BFI_IOC_TRC_HDR_SZ	32
 
 #define BFI_IOC_FW_SIGNATURE	(0xbfadbfad)
 #define BFI_IOC_MD5SUM_SZ	4
diff --git a/drivers/net/bna/bnad.c b/drivers/net/bna/bnad.c
index e588511..d8d0da0 100644
--- a/drivers/net/bna/bnad.c
+++ b/drivers/net/bna/bnad.c
@@ -25,6 +25,7 @@
 #include <linux/ip.h>
 
 #include "bnad.h"
+#include "bnad_genl.h"
 #include "bna.h"
 #include "cna.h"
 
@@ -44,8 +45,12 @@ MODULE_PARM_DESC(bnad_ioc_auto_recover, "Enable / Disable auto recovery");
 /*
  * Global variables
  */
+u32 bna_id;
 u32 bnad_rxqs_per_cq = 2;
 
+struct mutex bnad_list_mutex;
+LIST_HEAD(bnad_list);
+
 static const u8 bnad_bcast_addr[] =  {0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
 
 /*
@@ -72,6 +77,23 @@ do {								\
 
 #define BNAD_TXRX_SYNC_MDELAY	250	/* 250 msecs */
 
+static void
+bnad_add_to_list(struct bnad *bnad)
+{
+	mutex_lock(&bnad_list_mutex);
+	list_add_tail(&bnad->list_entry, &bnad_list);
+	bna_id++;
+	mutex_unlock(&bnad_list_mutex);
+}
+
+static void
+bnad_remove_from_list(struct bnad *bnad)
+{
+	mutex_lock(&bnad_list_mutex);
+	list_del(&bnad->list_entry);
+	mutex_unlock(&bnad_list_mutex);
+}
+
 /*
  * Reinitialize completions in CQ, once Rx is taken down
  */
@@ -3087,6 +3109,11 @@ bnad_pci_probe(struct pci_dev *pdev,
 	}
 	bnad = netdev_priv(netdev);
 
+	mutex_lock(&bnad_list_mutex);
+	bnad->id = bna_id;
+	mutex_unlock(&bnad_list_mutex);
+
+	bnad_add_to_list(bnad);
 	/*
 	 * PCI initialization
 	 * 	Output : using_dac = 1 for 64 bit DMA
@@ -3189,6 +3216,7 @@ disable_device:
 	bnad_res_free(bnad);
 	bnad_disable_msix(bnad);
 pci_uninit:
+	bnad_remove_from_list(bnad);
 	bnad_pci_uninit(pdev);
 	bnad_lock_uninit(bnad);
 	bnad_uninit(bnad);
@@ -3226,6 +3254,7 @@ bnad_pci_remove(struct pci_dev *pdev)
 
 	bnad_res_free(bnad);
 	bnad_disable_msix(bnad);
+	bnad_remove_from_list(bnad);
 	bnad_pci_uninit(pdev);
 	bnad_lock_uninit(bnad);
 	bnad_uninit(bnad);
@@ -3257,6 +3286,7 @@ bnad_module_init(void)
 
 	pr_info("Brocade 10G Ethernet driver\n");
 
+	mutex_init(&bnad_list_mutex);
 	bfa_nw_ioc_auto_recover(bnad_ioc_auto_recover);
 
 	err = pci_register_driver(&bnad_pci_driver);
@@ -3266,6 +3296,10 @@ bnad_module_init(void)
 		return err;
 	}
 
+	/* Register with generic netlink */
+	if (bnad_genl_init())
+		pr_err("bna: Generic Netlink Register failed\n");
+
 	return 0;
 }
 
@@ -3273,6 +3307,11 @@ static void __exit
 bnad_module_exit(void)
 {
 	pci_unregister_driver(&bnad_pci_driver);
+	mutex_destroy(&bnad_list_mutex);
+
+	/* Unegister with generic netlink */
+	if (bnad_genl_uninit())
+		pr_err("bna: Generic Netlink Unregister failed\n");
 
 	if (bfi_fw)
 		release_firmware(bfi_fw);
diff --git a/drivers/net/bna/bnad.h b/drivers/net/bna/bnad.h
index ccdabad..2afec2b 100644
--- a/drivers/net/bna/bnad.h
+++ b/drivers/net/bna/bnad.h
@@ -279,13 +279,18 @@ struct bnad {
 	char			adapter_name[BNAD_NAME_LEN];
 	char 			port_name[BNAD_NAME_LEN];
 	char			mbox_irq_name[BNAD_NAME_LEN];
+
+	int			id;
+	struct list_head	list_entry;
 };
 
 /*
  * EXTERN VARIABLES
  */
-extern struct firmware *bfi_fw;
-extern u32 		bnad_rxqs_per_cq;
+extern struct firmware		*bfi_fw;
+extern struct mutex		bnad_list_mutex;
+extern struct list_head		bnad_list;
+extern u32			bnad_rxqs_per_cq;
 
 /*
  * EXTERN PROTOTYPES
diff --git a/drivers/net/bna/bnad_genl.c b/drivers/net/bna/bnad_genl.c
new file mode 100644
index 0000000..d8e49ad
--- /dev/null
+++ b/drivers/net/bna/bnad_genl.c
@@ -0,0 +1,228 @@
+/*
+ * Linux network driver for Brocade Converged Network Adapter.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License (GPL) Version 2 as
+ * published by the Free Software Foundation
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+/*
+ * Copyright (c) 2005-2011 Brocade Communications Systems, Inc.
+ * All rights reserved
+ * www.brocade.com
+ */
+#include "bnad.h"
+#include "bnad_genl.h"
+#include "bna.h"
+
+static struct genl_family bnad_genl_family = {
+	.id = GENL_ID_GENERATE,
+	.name = "BNAD_GENL",
+	.version = BNAD_GENL_VERSION,
+	.hdrsize = 0,
+	.maxattr = BNAD_GENL_ATTR_MAX,
+};
+
+struct bnad_fw_debug_info {
+	char	*debug_buffer;
+	u32	buffer_len;
+};
+
+static struct bnad *
+bnad_get_bnadev(int bna_id)
+{
+	struct bnad *bnad;
+
+	mutex_lock(&bnad_list_mutex);
+	list_for_each_entry(bnad, &bnad_list, list_entry) {
+		if (bnad->id == bna_id) {
+			mutex_unlock(&bnad_list_mutex);
+			return bnad;
+		}
+	}
+	mutex_unlock(&bnad_list_mutex);
+	return NULL;
+}
+
+static int
+bnad_genl_fwtrc_msg_fill(struct sk_buff *skb, u32 pid, u32 seq, int flags,
+			struct bnad_fw_debug_info *fw_debug, long *offset)
+{
+	void *genl_msg_hdr = NULL;
+	int sk_buff_len, err = 0;
+
+	/* genlmsg_put */
+	genl_msg_hdr = genlmsg_put(skb, pid, seq, &bnad_genl_family, flags,
+							BNAD_GENL_CMD_FWTRC);
+	if (!genl_msg_hdr) {
+		pr_err("bna: Failed to get the genl_msg_header\n");
+		return -ENOMEM;
+	}
+
+	/* NLA_PUT
+	 * sk_buff_len - available lenth for attribute payload
+	 * offset - offset in fwtrc buffer
+	 */
+	sk_buff_len = skb_tailroom(skb) - NLA_HDRLEN;
+	if ((fw_debug->buffer_len - *offset) < sk_buff_len)
+		sk_buff_len = fw_debug->buffer_len - *offset;
+
+	NLA_PUT(skb, BNAD_GENL_ATTR_FWTRC, sk_buff_len,
+				((fw_debug->debug_buffer) + *offset));
+
+	*offset += sk_buff_len;
+
+	/* genlmsg_end */
+	err = genlmsg_end(skb, genl_msg_hdr);
+	if (err < 0) {
+		pr_err("bna: Failed to do genlmsg_end\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+
+nla_put_failure:
+	pr_err("bna: Failed to do NLA_PUT\n");
+	genlmsg_cancel(skb, genl_msg_hdr);
+	return -EMSGSIZE;
+}
+
+static int
+bnad_genl_fwtrc_get(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	/*
+	 * cb->args[0] firmware trace offset
+	 * cb->args[1] firmware buffer pointer
+	 */
+	long offset = cb->args[0];
+	int err;
+	struct bnad *bnad = NULL;
+	struct bnad_fw_debug_info *fw_debug = NULL;
+	struct bnad_genl_debug *iocmd = NULL;
+
+	if (offset >= (long)BFA_DBG_FWTRC_LEN) {
+		fw_debug = (struct bnad_fw_debug_info *)cb->args[1];
+		vfree(fw_debug->debug_buffer);
+		fw_debug->debug_buffer = NULL;
+		kfree(fw_debug);
+		fw_debug = NULL;
+		return skb->len;
+	}
+
+	if (offset == 0) {
+		/* Get the driver instance */
+		iocmd = (struct bnad_genl_debug *)
+			(((char *)NLMSG_DATA(cb->nlh)) +
+			GENL_HDRLEN + NLA_HDRLEN);
+		bnad = bnad_get_bnadev(iocmd->inst_no);
+		if (!bnad) {
+			pr_warn("bna: Failed to get driver instance\n");
+			return -EINVAL;
+		}
+
+		/* Allocate memory for fwtrc structure and buffer */
+		fw_debug = kzalloc(sizeof(struct bnad_fw_debug_info),
+								GFP_KERNEL);
+		if (!fw_debug)
+			return -ENOMEM;
+
+		fw_debug->buffer_len = iocmd->bufsz;
+
+		fw_debug->debug_buffer = vzalloc(fw_debug->buffer_len);
+		if (!fw_debug->debug_buffer) {
+			kfree(fw_debug);
+			fw_debug = NULL;
+			pr_err("bnad[%d]: Failed to allocate fwtrc buffer\n",
+								bnad->id);
+			return -ENOMEM;
+		}
+
+		/* Get the firmware trace into the buffer */
+		err = bfa_nw_ioc_debug_fwtrc(&bnad->bna.device.ioc,
+				fw_debug->debug_buffer, &fw_debug->buffer_len);
+		if (err) {
+			vfree(fw_debug->debug_buffer);
+			fw_debug->debug_buffer = NULL;
+			kfree(fw_debug);
+			fw_debug = NULL;
+			pr_err("bnad[%d]: Failed to collect fwtrc\n", bnad->id);
+			return -ENOMEM;;
+		}
+
+		cb->args[1] = (long)fw_debug;
+	} else
+		fw_debug = (struct bnad_fw_debug_info *)cb->args[1];
+
+	err = bnad_genl_fwtrc_msg_fill(skb, NETLINK_CB(cb->skb).pid,
+			cb->nlh->nlmsg_seq, NLM_F_MULTI, fw_debug, &offset);
+	if (err < 0)
+		return err;
+
+	cb->args[0] = offset;
+	return skb->len;
+}
+
+static struct genl_ops bnad_genl_ops[] = {
+	{
+	.cmd = BNAD_GENL_CMD_FWTRC,
+	.flags = 0,
+	.policy = NULL,
+	.doit = NULL,
+	.dumpit = bnad_genl_fwtrc_get,
+	},
+};
+
+int
+bnad_genl_init(void)
+{
+	int i, err = 0;
+
+	/* Register family */
+	err = genl_register_family(&bnad_genl_family);
+	if (err) {
+		pr_err("bna: failed to register with Netlink\n");
+		return err;
+	}
+	pr_info("bna: registered with Netlink\n");
+
+	/* Register ops */
+	for (i = 0; i < sizeof(bnad_genl_ops) / sizeof(bnad_genl_ops[0]); i++) {
+		err = genl_register_ops(&bnad_genl_family, &bnad_genl_ops[i]);
+		if (err)
+			pr_err("bna: failed to register netlink op %u\n",
+				bnad_genl_ops[i].cmd);
+		else
+			pr_info("bna: registered netlink op %u\n",
+				bnad_genl_ops[i].cmd);
+	}
+
+	return err;
+}
+
+int
+bnad_genl_uninit(void)
+{
+	int i, err = 0;
+
+	for (i = 0; i < sizeof(bnad_genl_ops) / sizeof(bnad_genl_ops[0]); i++) {
+		err = genl_unregister_ops(&bnad_genl_family, &bnad_genl_ops[i]);
+		if (err)
+			pr_err("bna: failed to unregister netlink op %u)\n",
+				bnad_genl_ops[i].cmd);
+		else
+			pr_info("bna: unregistered netlink op %u\n",
+				bnad_genl_ops[i].cmd);
+	}
+
+	err = genl_unregister_family(&bnad_genl_family);
+	if (err)
+		pr_err("bna: failed to unregister with Netlink\n");
+	else
+		pr_info("bna: unregistered with Netlink\n");
+
+	return err;
+}
diff --git a/drivers/net/bna/bnad_genl.h b/drivers/net/bna/bnad_genl.h
new file mode 100644
index 0000000..fb75a6e
--- /dev/null
+++ b/drivers/net/bna/bnad_genl.h
@@ -0,0 +1,63 @@
+/*
+ * Linux network driver for Brocade Converged Network Adapter.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License (GPL) Version 2 as
+ * published by the Free Software Foundation
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+/*
+ * Copyright (c) 2005-2011 Brocade Communications Systems, Inc.
+ * All rights reserved
+ * www.brocade.com
+ */
+#ifndef __BNAD_GENL_H__
+#define __BNAD_GENL_H__
+
+#include <linux/device.h>
+#include <linux/netdevice.h>
+#include <linux/gfp.h>
+#include <linux/skbuff.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <net/genetlink.h>
+
+/* Attributes */
+enum {
+	BNAD_GENL_ATTR_UNSPEC,
+	BNAD_GENL_ATTR_FWTRC,
+	__BNAD_GENL_ATTR_MAX
+};
+
+/* Effectively a single attribute */
+#define BNAD_GENL_ATTR_MAX (__BNAD_GENL_ATTR_MAX - 1)
+
+enum {
+	BNAD_GENL_VERSION = 1,
+};
+
+/* Commands/Responses */
+enum {
+	BNAD_GENL_CMD_UNSPEC,
+	BNAD_GENL_CMD_FWTRC,
+	__BNAD_GENL_CMD_MAX,
+};
+
+struct bnad_genl_debug {
+	int		status;
+	u16		bnad_num;
+	u16		rsvd;
+	u32		bufsz;
+	int		inst_no;
+	u64		buf_ptr;
+	u64		offset;
+};
+
+extern int bnad_genl_init(void);
+extern int bnad_genl_uninit(void);
+
+#endif /* __BNAD_GENL_H__ */
-- 
1.7.1


^ permalink raw reply related

* Re: [PATCH net-next-2.6 0/7] SCTP updates for net-next-2.6
From: David Miller @ 2011-04-27  1:47 UTC (permalink / raw)
  To: yjwei; +Cc: netdev, linux-sctp
In-Reply-To: <4DB76A64.3020708@cn.fujitsu.com>

From: Wei Yongjun <yjwei@cn.fujitsu.com>
Date: Wed, 27 Apr 2011 08:59:16 +0800

> And I have a stupid question about the rule of backport.  Since those
> patchs have existed so long time, when I backport those patchs, I'd better
> fix the bug in the original patch, or create new patch to fix it? Also how
> about some thing need to improvement like the ->dst_saddr() method?

I think, since these are going in as patches, you should make any
corrections to those patches by fixing the patch.

Adding new patches is just going to keep the bug in there if people
try to bisect through these changes, and that's no good.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next-2.6] ipv4,ipv6,bonding: Restore control over number of peer notifications
From: Jay Vosburgh @ 2011-04-27  1:44 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Andy Gospodarek, David Miller, Patrick McHardy, netdev,
	Brian Haley
In-Reply-To: <1303867552.2850.39.camel@bwh-desktop>

Ben Hutchings <bhutchings@solarflare.com> wrote:

>For backward compatibility, we should retain the module parameters and
>sysfs attributes to control the number of peer notifications
>(gratuitous ARPs and unsolicited NAs) sent after bonding failover.
>Also, it is possible for failover to take place even though the new
>active slave does not have link up, and in that case the peer
>notification should be deferred until it does.
>
>Change ipv4 and ipv6 so they do not automatically send peer
>notifications on bonding failover.
>
>Change the bonding driver to send separate NETDEV_NOTIFY_PEERS
>notifications when the link is up, as many times as requested.  Since
>it does not directly control which protocols send notifications, make
>num_grat_arp and num_unsol_na aliases for a single parameter.  Bump
>the bonding version number and update its documentation.
>
>Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

>---
> Documentation/networking/bonding.txt |   34 +++++++++----------
> drivers/net/bonding/bond_main.c      |   59 ++++++++++++++++++++++++++++++++++
> drivers/net/bonding/bond_sysfs.c     |   26 +++++++++++++++
> drivers/net/bonding/bonding.h        |    6 ++-
> net/ipv4/devinet.c                   |    1 -
> net/ipv6/ndisc.c                     |    1 -
> 6 files changed, 105 insertions(+), 22 deletions(-)
>
>diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
>index e27202b..1f45bd8 100644
>--- a/Documentation/networking/bonding.txt
>+++ b/Documentation/networking/bonding.txt
>@@ -1,7 +1,7 @@
>
> 		Linux Ethernet Bonding Driver HOWTO
>
>-		Latest update: 23 September 2009
>+		Latest update: 27 April 2011
>
> Initial release : Thomas Davis <tadavis at lbl.gov>
> Corrections, HA extensions : 2000/10/03-15 :
>@@ -585,25 +585,23 @@ mode
> 		chosen.
>
> num_grat_arp
>-
>-	Specifies the number of gratuitous ARPs to be issued after a
>-	failover event.  One gratuitous ARP is issued immediately after
>-	the failover, subsequent ARPs are sent at a rate of one per link
>-	monitor interval (arp_interval or miimon, whichever is active).
>-
>-	The valid range is 0 - 255; the default value is 1.  This option
>-	affects only the active-backup mode.  This option was added for
>-	bonding version 3.3.0.
>-
> num_unsol_na
>
>-	Specifies the number of unsolicited IPv6 Neighbor Advertisements
>-	to be issued after a failover event.  One unsolicited NA is issued
>-	immediately after the failover.
>-
>-	The valid range is 0 - 255; the default value is 1.  This option
>-	affects only the active-backup mode.  This option was added for
>-	bonding version 3.4.0.
>+	Specify the number of peer notifications (gratuitous ARPs and
>+	unsolicited IPv6 Neighbor Advertisements) to be issued after a
>+	failover event.  As soon as the link is up on the new slave
>+	(possibly immediately) a peer notification is sent on the
>+	bonding device and each VLAN sub-device.  This is repeated at
>+	each link monitor interval (arp_interval or miimon, whichever
>+	is active) if the number is greater than 1.
>+
>+	The valid range is 0 - 255; the default value is 1.  These options
>+	affect only the active-backup mode.  These options were added for
>+	bonding versions 3.3.0 and 3.4.0 respectively.
>+
>+	From Linux 2.6.40 and bonding version 3.7.1, these notifications
>+	are generated by the ipv4 and ipv6 code and the numbers of
>+	repetitions cannot be set independently.
>
> primary
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 66d9dc6..22bd03b 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -89,6 +89,7 @@
>
> static int max_bonds	= BOND_DEFAULT_MAX_BONDS;
> static int tx_queues	= BOND_DEFAULT_TX_QUEUES;
>+static int num_peer_notif = 1;
> static int miimon	= BOND_LINK_MON_INTERV;
> static int updelay;
> static int downdelay;
>@@ -111,6 +112,10 @@ module_param(max_bonds, int, 0);
> MODULE_PARM_DESC(max_bonds, "Max number of bonded devices");
> module_param(tx_queues, int, 0);
> MODULE_PARM_DESC(tx_queues, "Max number of transmit queues (default = 16)");
>+module_param_named(num_grat_arp, num_peer_notif, int, 0644);
>+MODULE_PARM_DESC(num_grat_arp, "Number of peer notifications to send on failover event (alias of num_unsol_na)");
>+module_param_named(num_unsol_na, num_peer_notif, int, 0644);
>+MODULE_PARM_DESC(num_unsol_na, "Number of peer notifications to send on failover event (alias of num_grat_arp)");
> module_param(miimon, int, 0);
> MODULE_PARM_DESC(miimon, "Link check interval in milliseconds");
> module_param(updelay, int, 0);
>@@ -1082,6 +1087,21 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
> 	return bestslave;
> }
>
>+static bool bond_should_notify_peers(struct bonding *bond)
>+{
>+	struct slave *slave = bond->curr_active_slave;
>+
>+	pr_debug("bond_should_notify_peers: bond %s slave %s\n",
>+		 bond->dev->name, slave ? slave->dev->name : "NULL");
>+
>+	if (!slave || !bond->send_peer_notif ||
>+	    test_bit(__LINK_STATE_LINKWATCH_PENDING, &slave->dev->state))
>+		return false;
>+
>+	bond->send_peer_notif--;
>+	return true;
>+}
>+
> /**
>  * change_active_interface - change the active slave into the specified one
>  * @bond: our bonding struct
>@@ -1149,16 +1169,28 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
> 			bond_set_slave_inactive_flags(old_active);
>
> 		if (new_active) {
>+			bool should_notify_peers = false;
>+
> 			bond_set_slave_active_flags(new_active);
>
> 			if (bond->params.fail_over_mac)
> 				bond_do_fail_over_mac(bond, new_active,
> 						      old_active);
>
>+			if (netif_running(bond->dev)) {
>+				bond->send_peer_notif =
>+					bond->params.num_peer_notif;
>+				should_notify_peers =
>+					bond_should_notify_peers(bond);
>+			}
>+
> 			write_unlock_bh(&bond->curr_slave_lock);
> 			read_unlock(&bond->lock);
>
> 			netdev_bonding_change(bond->dev, NETDEV_BONDING_FAILOVER);
>+			if (should_notify_peers)
>+				netdev_bonding_change(bond->dev,
>+						      NETDEV_NOTIFY_PEERS);
>
> 			read_lock(&bond->lock);
> 			write_lock_bh(&bond->curr_slave_lock);
>@@ -2556,6 +2588,7 @@ void bond_mii_monitor(struct work_struct *work)
> {
> 	struct bonding *bond = container_of(work, struct bonding,
> 					    mii_work.work);
>+	bool should_notify_peers = false;
>
> 	read_lock(&bond->lock);
> 	if (bond->kill_timers)
>@@ -2564,6 +2597,8 @@ void bond_mii_monitor(struct work_struct *work)
> 	if (bond->slave_cnt == 0)
> 		goto re_arm;
>
>+	should_notify_peers = bond_should_notify_peers(bond);
>+
> 	if (bond_miimon_inspect(bond)) {
> 		read_unlock(&bond->lock);
> 		rtnl_lock();
>@@ -2582,6 +2617,12 @@ re_arm:
> 				   msecs_to_jiffies(bond->params.miimon));
> out:
> 	read_unlock(&bond->lock);
>+
>+	if (should_notify_peers) {
>+		rtnl_lock();
>+		netdev_bonding_change(bond->dev, NETDEV_NOTIFY_PEERS);
>+		rtnl_unlock();
>+	}
> }
>
> static __be32 bond_glean_dev_ip(struct net_device *dev)
>@@ -3154,6 +3195,7 @@ void bond_activebackup_arp_mon(struct work_struct *work)
> {
> 	struct bonding *bond = container_of(work, struct bonding,
> 					    arp_work.work);
>+	bool should_notify_peers = false;
> 	int delta_in_ticks;
>
> 	read_lock(&bond->lock);
>@@ -3166,6 +3208,8 @@ void bond_activebackup_arp_mon(struct work_struct *work)
> 	if (bond->slave_cnt == 0)
> 		goto re_arm;
>
>+	should_notify_peers = bond_should_notify_peers(bond);
>+
> 	if (bond_ab_arp_inspect(bond, delta_in_ticks)) {
> 		read_unlock(&bond->lock);
> 		rtnl_lock();
>@@ -3185,6 +3229,12 @@ re_arm:
> 		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
> out:
> 	read_unlock(&bond->lock);
>+
>+	if (should_notify_peers) {
>+		rtnl_lock();
>+		netdev_bonding_change(bond->dev, NETDEV_NOTIFY_PEERS);
>+		rtnl_unlock();
>+	}
> }
>
> /*-------------------------- netdev event handling --------------------------*/
>@@ -3494,6 +3544,8 @@ static int bond_close(struct net_device *bond_dev)
>
> 	write_lock_bh(&bond->lock);
>
>+	bond->send_peer_notif = 0;
>+
> 	/* signal timers not to re-arm */
> 	bond->kill_timers = 1;
>
>@@ -4571,6 +4623,12 @@ static int bond_check_params(struct bond_params *params)
> 		use_carrier = 1;
> 	}
>
>+	if (num_peer_notif < 0 || num_peer_notif > 255) {
>+		pr_warning("Warning: num_grat_arp/num_unsol_na (%d) not in range 0-255 so it was reset to 1\n",
>+			   num_peer_notif);
>+		num_peer_notif = 1;
>+	}
>+
> 	/* reset values for 802.3ad */
> 	if (bond_mode == BOND_MODE_8023AD) {
> 		if (!miimon) {
>@@ -4760,6 +4818,7 @@ static int bond_check_params(struct bond_params *params)
> 	params->mode = bond_mode;
> 	params->xmit_policy = xmit_hashtype;
> 	params->miimon = miimon;
>+	params->num_peer_notif = num_peer_notif;
> 	params->arp_interval = arp_interval;
> 	params->arp_validate = arp_validate_value;
> 	params->updelay = updelay;
>diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>index 935406a..4059bfc 100644
>--- a/drivers/net/bonding/bond_sysfs.c
>+++ b/drivers/net/bonding/bond_sysfs.c
>@@ -869,6 +869,30 @@ static DEVICE_ATTR(ad_select, S_IRUGO | S_IWUSR,
> 		   bonding_show_ad_select, bonding_store_ad_select);
>
> /*
>+ * Show and set the number of peer notifications to send after a failover event.
>+ */
>+static ssize_t bonding_show_num_peer_notif(struct device *d,
>+					   struct device_attribute *attr,
>+					   char *buf)
>+{
>+	struct bonding *bond = to_bond(d);
>+	return sprintf(buf, "%d\n", bond->params.num_peer_notif);
>+}
>+
>+static ssize_t bonding_store_num_peer_notif(struct device *d,
>+					    struct device_attribute *attr,
>+					    const char *buf, size_t count)
>+{
>+	struct bonding *bond = to_bond(d);
>+	int err = kstrtou8(buf, 10, &bond->params.num_peer_notif);
>+	return err ? err : count;
>+}
>+static DEVICE_ATTR(num_grat_arp, S_IRUGO | S_IWUSR,
>+		   bonding_show_num_peer_notif, bonding_store_num_peer_notif);
>+static DEVICE_ATTR(num_unsol_na, S_IRUGO | S_IWUSR,
>+		   bonding_show_num_peer_notif, bonding_store_num_peer_notif);
>+
>+/*
>  * Show and set the MII monitor interval.  There are two tricky bits
>  * here.  First, if MII monitoring is activated, then we must disable
>  * ARP monitoring.  Second, if the timer isn't running, we must
>@@ -1566,6 +1590,8 @@ static struct attribute *per_bond_attrs[] = {
> 	&dev_attr_lacp_rate.attr,
> 	&dev_attr_ad_select.attr,
> 	&dev_attr_xmit_hash_policy.attr,
>+	&dev_attr_num_grat_arp.attr,
>+	&dev_attr_num_unsol_na.attr,
> 	&dev_attr_miimon.attr,
> 	&dev_attr_primary.attr,
> 	&dev_attr_primary_reselect.attr,
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index 85fb822..d08362e 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -24,8 +24,8 @@
> #include "bond_3ad.h"
> #include "bond_alb.h"
>
>-#define DRV_VERSION	"3.7.0"
>-#define DRV_RELDATE	"June 2, 2010"
>+#define DRV_VERSION	"3.7.1"
>+#define DRV_RELDATE	"April 27, 2011"
> #define DRV_NAME	"bonding"
> #define DRV_DESCRIPTION	"Ethernet Channel Bonding Driver"
>
>@@ -149,6 +149,7 @@ struct bond_params {
> 	int mode;
> 	int xmit_policy;
> 	int miimon;
>+	u8 num_peer_notif;
> 	int arp_interval;
> 	int arp_validate;
> 	int use_carrier;
>@@ -231,6 +232,7 @@ struct bonding {
> 	rwlock_t lock;
> 	rwlock_t curr_slave_lock;
> 	s8       kill_timers;
>+	u8	 send_peer_notif;
> 	s8	 setup_by_slave;
> 	s8       igmp_retrans;
> #ifdef CONFIG_PROC_FS
>diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
>index acf553f..5345b0b 100644
>--- a/net/ipv4/devinet.c
>+++ b/net/ipv4/devinet.c
>@@ -1203,7 +1203,6 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
> 			break;
> 		/* fall through */
> 	case NETDEV_NOTIFY_PEERS:
>-	case NETDEV_BONDING_FAILOVER:
> 		/* Send gratuitous ARP to notify of link change */
> 		inetdev_send_gratuitous_arp(dev, in_dev);
> 		break;
>diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
>index 69aacd1..7596f07 100644
>--- a/net/ipv6/ndisc.c
>+++ b/net/ipv6/ndisc.c
>@@ -1747,7 +1747,6 @@ static int ndisc_netdev_event(struct notifier_block *this, unsigned long event,
> 		fib6_run_gc(~0UL, net);
> 		break;
> 	case NETDEV_NOTIFY_PEERS:
>-	case NETDEV_BONDING_FAILOVER:
> 		ndisc_send_unsol_na(dev);
> 		break;
> 	default:
>-- 
>1.7.4
>
>
>-- 
>Ben Hutchings, Senior Software Engineer, Solarflare
>Not speaking for my employer; that's the marketing department's job.
>They asked us to note that Solarflare product names are trademarked.
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next-2.6] ipv4,ipv6,bonding: Restore control over number of peer notifications
From: Ben Hutchings @ 2011-04-27  1:25 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek
  Cc: David Miller, Patrick McHardy, netdev, Brian Haley

For backward compatibility, we should retain the module parameters and
sysfs attributes to control the number of peer notifications
(gratuitous ARPs and unsolicited NAs) sent after bonding failover.
Also, it is possible for failover to take place even though the new
active slave does not have link up, and in that case the peer
notification should be deferred until it does.

Change ipv4 and ipv6 so they do not automatically send peer
notifications on bonding failover.

Change the bonding driver to send separate NETDEV_NOTIFY_PEERS
notifications when the link is up, as many times as requested.  Since
it does not directly control which protocols send notifications, make
num_grat_arp and num_unsol_na aliases for a single parameter.  Bump
the bonding version number and update its documentation.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 Documentation/networking/bonding.txt |   34 +++++++++----------
 drivers/net/bonding/bond_main.c      |   59 ++++++++++++++++++++++++++++++++++
 drivers/net/bonding/bond_sysfs.c     |   26 +++++++++++++++
 drivers/net/bonding/bonding.h        |    6 ++-
 net/ipv4/devinet.c                   |    1 -
 net/ipv6/ndisc.c                     |    1 -
 6 files changed, 105 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index e27202b..1f45bd8 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -1,7 +1,7 @@
 
 		Linux Ethernet Bonding Driver HOWTO
 
-		Latest update: 23 September 2009
+		Latest update: 27 April 2011
 
 Initial release : Thomas Davis <tadavis at lbl.gov>
 Corrections, HA extensions : 2000/10/03-15 :
@@ -585,25 +585,23 @@ mode
 		chosen.
 
 num_grat_arp
-
-	Specifies the number of gratuitous ARPs to be issued after a
-	failover event.  One gratuitous ARP is issued immediately after
-	the failover, subsequent ARPs are sent at a rate of one per link
-	monitor interval (arp_interval or miimon, whichever is active).
-
-	The valid range is 0 - 255; the default value is 1.  This option
-	affects only the active-backup mode.  This option was added for
-	bonding version 3.3.0.
-
 num_unsol_na
 
-	Specifies the number of unsolicited IPv6 Neighbor Advertisements
-	to be issued after a failover event.  One unsolicited NA is issued
-	immediately after the failover.
-
-	The valid range is 0 - 255; the default value is 1.  This option
-	affects only the active-backup mode.  This option was added for
-	bonding version 3.4.0.
+	Specify the number of peer notifications (gratuitous ARPs and
+	unsolicited IPv6 Neighbor Advertisements) to be issued after a
+	failover event.  As soon as the link is up on the new slave
+	(possibly immediately) a peer notification is sent on the
+	bonding device and each VLAN sub-device.  This is repeated at
+	each link monitor interval (arp_interval or miimon, whichever
+	is active) if the number is greater than 1.
+
+	The valid range is 0 - 255; the default value is 1.  These options
+	affect only the active-backup mode.  These options were added for
+	bonding versions 3.3.0 and 3.4.0 respectively.
+
+	From Linux 2.6.40 and bonding version 3.7.1, these notifications
+	are generated by the ipv4 and ipv6 code and the numbers of
+	repetitions cannot be set independently.
 
 primary
 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 66d9dc6..22bd03b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -89,6 +89,7 @@
 
 static int max_bonds	= BOND_DEFAULT_MAX_BONDS;
 static int tx_queues	= BOND_DEFAULT_TX_QUEUES;
+static int num_peer_notif = 1;
 static int miimon	= BOND_LINK_MON_INTERV;
 static int updelay;
 static int downdelay;
@@ -111,6 +112,10 @@ module_param(max_bonds, int, 0);
 MODULE_PARM_DESC(max_bonds, "Max number of bonded devices");
 module_param(tx_queues, int, 0);
 MODULE_PARM_DESC(tx_queues, "Max number of transmit queues (default = 16)");
+module_param_named(num_grat_arp, num_peer_notif, int, 0644);
+MODULE_PARM_DESC(num_grat_arp, "Number of peer notifications to send on failover event (alias of num_unsol_na)");
+module_param_named(num_unsol_na, num_peer_notif, int, 0644);
+MODULE_PARM_DESC(num_unsol_na, "Number of peer notifications to send on failover event (alias of num_grat_arp)");
 module_param(miimon, int, 0);
 MODULE_PARM_DESC(miimon, "Link check interval in milliseconds");
 module_param(updelay, int, 0);
@@ -1082,6 +1087,21 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
 	return bestslave;
 }
 
+static bool bond_should_notify_peers(struct bonding *bond)
+{
+	struct slave *slave = bond->curr_active_slave;
+
+	pr_debug("bond_should_notify_peers: bond %s slave %s\n",
+		 bond->dev->name, slave ? slave->dev->name : "NULL");
+
+	if (!slave || !bond->send_peer_notif ||
+	    test_bit(__LINK_STATE_LINKWATCH_PENDING, &slave->dev->state))
+		return false;
+
+	bond->send_peer_notif--;
+	return true;
+}
+
 /**
  * change_active_interface - change the active slave into the specified one
  * @bond: our bonding struct
@@ -1149,16 +1169,28 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 			bond_set_slave_inactive_flags(old_active);
 
 		if (new_active) {
+			bool should_notify_peers = false;
+
 			bond_set_slave_active_flags(new_active);
 
 			if (bond->params.fail_over_mac)
 				bond_do_fail_over_mac(bond, new_active,
 						      old_active);
 
+			if (netif_running(bond->dev)) {
+				bond->send_peer_notif =
+					bond->params.num_peer_notif;
+				should_notify_peers =
+					bond_should_notify_peers(bond);
+			}
+
 			write_unlock_bh(&bond->curr_slave_lock);
 			read_unlock(&bond->lock);
 
 			netdev_bonding_change(bond->dev, NETDEV_BONDING_FAILOVER);
+			if (should_notify_peers)
+				netdev_bonding_change(bond->dev,
+						      NETDEV_NOTIFY_PEERS);
 
 			read_lock(&bond->lock);
 			write_lock_bh(&bond->curr_slave_lock);
@@ -2556,6 +2588,7 @@ void bond_mii_monitor(struct work_struct *work)
 {
 	struct bonding *bond = container_of(work, struct bonding,
 					    mii_work.work);
+	bool should_notify_peers = false;
 
 	read_lock(&bond->lock);
 	if (bond->kill_timers)
@@ -2564,6 +2597,8 @@ void bond_mii_monitor(struct work_struct *work)
 	if (bond->slave_cnt == 0)
 		goto re_arm;
 
+	should_notify_peers = bond_should_notify_peers(bond);
+
 	if (bond_miimon_inspect(bond)) {
 		read_unlock(&bond->lock);
 		rtnl_lock();
@@ -2582,6 +2617,12 @@ re_arm:
 				   msecs_to_jiffies(bond->params.miimon));
 out:
 	read_unlock(&bond->lock);
+
+	if (should_notify_peers) {
+		rtnl_lock();
+		netdev_bonding_change(bond->dev, NETDEV_NOTIFY_PEERS);
+		rtnl_unlock();
+	}
 }
 
 static __be32 bond_glean_dev_ip(struct net_device *dev)
@@ -3154,6 +3195,7 @@ void bond_activebackup_arp_mon(struct work_struct *work)
 {
 	struct bonding *bond = container_of(work, struct bonding,
 					    arp_work.work);
+	bool should_notify_peers = false;
 	int delta_in_ticks;
 
 	read_lock(&bond->lock);
@@ -3166,6 +3208,8 @@ void bond_activebackup_arp_mon(struct work_struct *work)
 	if (bond->slave_cnt == 0)
 		goto re_arm;
 
+	should_notify_peers = bond_should_notify_peers(bond);
+
 	if (bond_ab_arp_inspect(bond, delta_in_ticks)) {
 		read_unlock(&bond->lock);
 		rtnl_lock();
@@ -3185,6 +3229,12 @@ re_arm:
 		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
 out:
 	read_unlock(&bond->lock);
+
+	if (should_notify_peers) {
+		rtnl_lock();
+		netdev_bonding_change(bond->dev, NETDEV_NOTIFY_PEERS);
+		rtnl_unlock();
+	}
 }
 
 /*-------------------------- netdev event handling --------------------------*/
@@ -3494,6 +3544,8 @@ static int bond_close(struct net_device *bond_dev)
 
 	write_lock_bh(&bond->lock);
 
+	bond->send_peer_notif = 0;
+
 	/* signal timers not to re-arm */
 	bond->kill_timers = 1;
 
@@ -4571,6 +4623,12 @@ static int bond_check_params(struct bond_params *params)
 		use_carrier = 1;
 	}
 
+	if (num_peer_notif < 0 || num_peer_notif > 255) {
+		pr_warning("Warning: num_grat_arp/num_unsol_na (%d) not in range 0-255 so it was reset to 1\n",
+			   num_peer_notif);
+		num_peer_notif = 1;
+	}
+
 	/* reset values for 802.3ad */
 	if (bond_mode == BOND_MODE_8023AD) {
 		if (!miimon) {
@@ -4760,6 +4818,7 @@ static int bond_check_params(struct bond_params *params)
 	params->mode = bond_mode;
 	params->xmit_policy = xmit_hashtype;
 	params->miimon = miimon;
+	params->num_peer_notif = num_peer_notif;
 	params->arp_interval = arp_interval;
 	params->arp_validate = arp_validate_value;
 	params->updelay = updelay;
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 935406a..4059bfc 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -869,6 +869,30 @@ static DEVICE_ATTR(ad_select, S_IRUGO | S_IWUSR,
 		   bonding_show_ad_select, bonding_store_ad_select);
 
 /*
+ * Show and set the number of peer notifications to send after a failover event.
+ */
+static ssize_t bonding_show_num_peer_notif(struct device *d,
+					   struct device_attribute *attr,
+					   char *buf)
+{
+	struct bonding *bond = to_bond(d);
+	return sprintf(buf, "%d\n", bond->params.num_peer_notif);
+}
+
+static ssize_t bonding_store_num_peer_notif(struct device *d,
+					    struct device_attribute *attr,
+					    const char *buf, size_t count)
+{
+	struct bonding *bond = to_bond(d);
+	int err = kstrtou8(buf, 10, &bond->params.num_peer_notif);
+	return err ? err : count;
+}
+static DEVICE_ATTR(num_grat_arp, S_IRUGO | S_IWUSR,
+		   bonding_show_num_peer_notif, bonding_store_num_peer_notif);
+static DEVICE_ATTR(num_unsol_na, S_IRUGO | S_IWUSR,
+		   bonding_show_num_peer_notif, bonding_store_num_peer_notif);
+
+/*
  * Show and set the MII monitor interval.  There are two tricky bits
  * here.  First, if MII monitoring is activated, then we must disable
  * ARP monitoring.  Second, if the timer isn't running, we must
@@ -1566,6 +1590,8 @@ static struct attribute *per_bond_attrs[] = {
 	&dev_attr_lacp_rate.attr,
 	&dev_attr_ad_select.attr,
 	&dev_attr_xmit_hash_policy.attr,
+	&dev_attr_num_grat_arp.attr,
+	&dev_attr_num_unsol_na.attr,
 	&dev_attr_miimon.attr,
 	&dev_attr_primary.attr,
 	&dev_attr_primary_reselect.attr,
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 85fb822..d08362e 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -24,8 +24,8 @@
 #include "bond_3ad.h"
 #include "bond_alb.h"
 
-#define DRV_VERSION	"3.7.0"
-#define DRV_RELDATE	"June 2, 2010"
+#define DRV_VERSION	"3.7.1"
+#define DRV_RELDATE	"April 27, 2011"
 #define DRV_NAME	"bonding"
 #define DRV_DESCRIPTION	"Ethernet Channel Bonding Driver"
 
@@ -149,6 +149,7 @@ struct bond_params {
 	int mode;
 	int xmit_policy;
 	int miimon;
+	u8 num_peer_notif;
 	int arp_interval;
 	int arp_validate;
 	int use_carrier;
@@ -231,6 +232,7 @@ struct bonding {
 	rwlock_t lock;
 	rwlock_t curr_slave_lock;
 	s8       kill_timers;
+	u8	 send_peer_notif;
 	s8	 setup_by_slave;
 	s8       igmp_retrans;
 #ifdef CONFIG_PROC_FS
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index acf553f..5345b0b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1203,7 +1203,6 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
 			break;
 		/* fall through */
 	case NETDEV_NOTIFY_PEERS:
-	case NETDEV_BONDING_FAILOVER:
 		/* Send gratuitous ARP to notify of link change */
 		inetdev_send_gratuitous_arp(dev, in_dev);
 		break;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 69aacd1..7596f07 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1747,7 +1747,6 @@ static int ndisc_netdev_event(struct notifier_block *this, unsigned long event,
 		fib6_run_gc(~0UL, net);
 		break;
 	case NETDEV_NOTIFY_PEERS:
-	case NETDEV_BONDING_FAILOVER:
 		ndisc_send_unsol_na(dev);
 		break;
 	default:
-- 
1.7.4


-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: [PATCH net-next-2.6 0/7] SCTP updates for net-next-2.6
From: Wei Yongjun @ 2011-04-27  0:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-sctp
In-Reply-To: <20110426.145120.28826019.davem@davemloft.net>

> Wei, while you are re-spinning this patch set I want to bring up
> something I just noticed in the SCTP code.
>
> The ->dst_saddr() method is not used by anything, it appears.
>
> The ipv4 variant, sctp_v4_dst_saddr() is called internally by the
> ipv4 specific code, but that's it.
>
> So I think the ->dst_saddr member of sctp_pf can be completely
> removed, as can sctp_v6_dst_saddr().
>
> The sctp_v4_dst_saddr() function, of course, will need to be retained.

David, thanks to noticed this. I will cleanup it.

And I have a stupid question about the rule of backport.  Since those
patchs have existed so long time, when I backport those patchs, I'd better
fix the bug in the original patch, or create new patch to fix it? Also how
about some thing need to improvement like the ->dst_saddr() method?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox