* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Segher Boessenkool @ 2007-08-20 22:04 UTC (permalink / raw)
To: Chris Snook
Cc: Christoph Lameter, Paul Mackerras, heiko.carstens, horms,
linux-kernel, Paul E. McKenney, ak, netdev, cfriesen, akpm,
rpjday, Nick Piggin, linux-arch, jesper.juhl, satyam, zlynx,
schwidefsky, Herbert Xu, davem, Linus Torvalds, wensong, wjiang
In-Reply-To: <46C997B1.1010800@redhat.com>
>> Right. ROTFL... volatile actually breaks atomic_t instead of making
>> it safe. x++ becomes a register load, increment and a register store.
>> Without volatile we can increment the memory directly. It seems that
>> volatile requires that the variable is loaded into a register first
>> and then operated upon. Understandable when you think about volatile
>> being used to access memory mapped I/O registers where a RMW
>> operation could be problematic.
>
> So, if we want consistent behavior, we're pretty much screwed unless
> we use inline assembler everywhere?
Nah, this whole argument is flawed -- "without volatile" we still
*cannot* "increment the memory directly". On x86, you need a lock
prefix; on other archs, some other mechanism to make the memory
increment an *atomic* memory increment.
And no, RMW on MMIO isn't "problematic" at all, either.
An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode. But three actual operations.
The advantages of asm code for atomic_{read,set} are:
1) all the other atomic ops are implemented that way already;
2) you have full control over the asm insns selected, in particular,
you can guarantee you *do* get an atomic op;
3) you don't need to use "volatile <data>" which generates
not-all-that-good code on archs like x86, and we want to get rid
of it anyway since it is problematic in many ways;
4) you don't need to use *(volatile <type>*)&<data>, which a) doesn't
exist in C; b) isn't documented or supported in GCC; c) has a recent
history of bugginess; d) _still uses volatile objects_; e) _still_
is problematic in almost all those same ways as in 3);
5) you can mix atomic and non-atomic accesses to the atomic_t, which
you cannot with the other alternatives.
The only disadvantage I know of is potentially slightly worse
instruction scheduling. This is a generic asm() problem: GCC
cannot see what actual insns are inside the asm() block.
Segher
^ permalink raw reply
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Patrick Geoffray @ 2007-08-20 20:33 UTC (permalink / raw)
To: Felix Marti
Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general,
David Miller
In-Reply-To: <8A71B368A89016469F72CD08050AD334018E2115@maui.asicdesigners.com>
Felix Marti wrote:
> Yes, the app will take the cache hits when accessing the data. However,
> the fact remains that if there is a copy in the receive path, you
> require and additional 3x memory BW (which is very significant at these
> high rates and most likely the bottleneck for most current systems)...
> and somebody always has to take the cache miss be it the copy_to_user or
> the app.
The cache miss is going to cost you half the memory bandwidth of a full
copy. If the data is already in cache, then the copy is cheaper.
However, removing the copy removes the kernel from the picture on the
receive side, so you lose demultiplexing, asynchronism, security,
accounting, flow-control, swapping, etc. If it's ok with you to not use
the kernel stack, then why expect to fit in the existing infrastructure
anyway ?
> Yes, RDMA support is there... but we could make it better and easier to
What do you need from the kernel for RDMA support beyond HW drivers ? A
fast way to pin and translate user memory (ie registration). That is
pretty much the sandbox that David referred to.
Eventually, it would be useful to be able to track the VM space to
implement a registration cache instead of using ugly hacks in user-space
to hijack malloc, but this is completely independent from the net stack.
> use. We have a problem today with port sharing and there was a proposal
The port spaces are either totally separate and there is no issue, or
completely identical and you should then run your connection manager in
user-space or fix your middlewares.
> and not for technical reasons. I believe this email threads shows in
> detail how RDMA (a network technology) is treated as bastard child by
> the network folks, well at least by one of them.
I don't think it's fair. This thread actually show how pushy some RDMA
folks are about not acknowledging that the current infrastructure is
here for a reason, and about mistaking zero-copy and RDMA.
This is a similar argument than the TOE discussion, and it was
definitively a good decision to not mess up the Linux stack with TOEs.
Patrick
^ permalink raw reply
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Andi Kleen @ 2007-08-20 20:33 UTC (permalink / raw)
To: Thomas Graf
Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, Andi Kleen,
general, David Miller
In-Reply-To: <20070820201808.GM32236@postel.suug.ch>
> GPUs have almost no influence on system security,
Unless you use direct rendering from user space.
-Andi
^ permalink raw reply
* [PATCH 0/2] qla3xxx: receive path bugfixes.
From: Ron Mercer @ 2007-08-20 20:32 UTC (permalink / raw)
To: ron.mercer, jeff; +Cc: netdev
The following two patches fix:
An undocumented "feature" where the 4032 chip sets bit-7
of the opcode for an inbound completion if it's for a VLAN.
The access of stale data on a completion entry.
These patches were built and tested on 2.6.23-rc1.
Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
^ permalink raw reply
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Thomas Graf @ 2007-08-20 20:18 UTC (permalink / raw)
To: Felix Marti
Cc: Andi Kleen, Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel,
general, David Miller
In-Reply-To: <8A71B368A89016469F72CD08050AD334018E2149@maui.asicdesigners.com>
* Felix Marti <felix@chelsio.com> 2007-08-20 12:02
> These graphic adapters provide a wealth of features that you can take
> advantage of to bring these amazing graphics to life. General purpose
> CPUs cannot keep up. Chelsio offload devices do the same thing in the
> realm of networking. - Will there be things you can't do, probably yes,
> but as I said, there are lots of knobs to turn (and the latest and
> greatest feature that gets hyped up might not always be the best thing
> since sliced bread anyway; what happened to BIC love? ;)
GPUs have almost no influence on system security, the network stack OTOH
is probably the most vulnerable part of an operating system. Even if all
vendors would implement all the features collected over the last years
properly which seems unlikely. Having such an essential and critical
part depend on the vendor of my network card without being able to even
verify it properly is truly frightening.
^ permalink raw reply
* Re: [PATCH] phy layer: fix genphy_setup_forced (don't reset)
From: Andy Fleming @ 2007-08-20 19:43 UTC (permalink / raw)
To: Domen Puncer; +Cc: jeff, netdev
In-Reply-To: <20070817065445.GH13994@moe.telargo.com>
On Aug 17, 2007, at 01:54, Domen Puncer wrote:
> Writing BMCR_RESET bit will reset MII_BMCR to default values. This is
> clearly not what we want.
>
>
> Signed-off-by: Domen Puncer <domen.puncer@telargo.com>
Acked-by: Andy Fleming <afleming@freescale.com>
I could have sworn there was a patch that did this, already, but it
must have lost steam somewhere.
Andy
^ permalink raw reply
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Rick Jones @ 2007-08-20 19:16 UTC (permalink / raw)
To: Andi Kleen; +Cc: jeff, netdev, rdreier, linux-kernel, general, David Miller
In-Reply-To: <p733aye1n39.fsf@bingen.suse.de>
Andi Kleen wrote:
> TSO is beneficial for the software again. The linux code currently
> takes several locks and does quite a few function calls for each
> packet and using larger packets lowers this overhead. At least with
> 10GbE saving CPU cycles is still quite important.
Some quick netperf TCP_RR tests between a pair of dual-core rx6600's running
2.6.23-rc3. the NICs are dual-core e1000's connected back-to-back with the
interrupt throttle disabled. I like using TCP_RR to tickle path-length
questions because it rarely runs into bandwidth limitations regardless of the
link-type.
First, with TSO enabled on both sides, then with it disabled, netperf/netserver
bound to the same CPU as takes interrupts, which is the "best" place to be for a
TCP_RR test (although not always for a TCP_STREAM test...):
:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf. : first burst 0 : cpu bind
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 0.3%
!!! Local CPU util : 39.3%
!!! Remote CPU util : 40.6%
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr
16384 87380 1 1 10.01 18611.32 20.96 22.35 22.522 24.017
16384 87380
:~# ethtool -K eth2 tso off
e1000: eth2: e1000_set_tso: TSO is Disabled
:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf. : first burst 0 : cpu bind
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput : 0.4%
!!! Local CPU util : 21.0%
!!! Remote CPU util : 25.2%
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr
16384 87380 1 1 10.01 19812.51 17.81 17.19 17.983 17.358
16384 87380
While the confidence intervals for CPU util weren't hit, I suspect the
differences in service demand were still real. On throughput we are talking
about +/- 0.2%, for CPU util we are talking about +/- 20% (percent not
percentage points) in the first test and 12.5% in the second.
So, in broad handwaving terms, TSO increased the per-transaction service demand
by something along the lines of (23.27 - 17.67)/17.67 or ~30% and the
transaction rate decreased by ~6%.
rick jones
bitrate blindless is a constant concern
^ permalink raw reply
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Felix Marti @ 2007-08-20 19:02 UTC (permalink / raw)
To: Andi Kleen
Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general,
David Miller
In-Reply-To: <p73y7g6yt3j.fsf@bingen.suse.de>
> -----Original Message-----
> From: ak@suse.de [mailto:ak@suse.de] On Behalf Of Andi Kleen
> Sent: Monday, August 20, 2007 11:11 AM
> To: Felix Marti
> Cc: Evgeniy Polyakov; jeff@garzik.org; netdev@vger.kernel.org;
> rdreier@cisco.com; linux-kernel@vger.kernel.org;
> general@lists.openfabrics.org; David Miller
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
>
> "Felix Marti" <felix@chelsio.com> writes:
>
> > What I was referring to is that TSO(/LRO) have their own
> > issues, some eluded to by Roland and me. In fact, customers working
> on
> > the LSR couldn't use TSO due to the burstiness it introduces
>
> That was in old kernels where TSO didn't honor the initial cwnd
> correctly,
> right? I assume it's long fixed.
>
> If not please clarify what the problem was.
The problem is that is that Ethernet is about the only technology that
discloses 'useable' throughput while everybody else talks about
signaling rates ;) - OC-192 can carry about 9.128Gbps (or close to that
number) and hence 10Gbps Ethernet was overwhelming the OC-192 network.
The customer needed to schedule packets at about 98% of OC-192
throughput in order to avoid packet drop. The scheduling needed to be
done on a per packet basis and not per 'burst of packets' basis in order
to avoid packet drop.
>
> > have a look at graphics.
> > Graphics used to be done by the host CPU and now we have dedicated
> > graphics adapters that do a much better job...
>
> Is your off load device as programable as a modern GPU?
It has a lot of knobs to turn.
>
> > farfetched that offload devices can do a better job at a data-flow
> > problem?
>
> One big difference is that there is no potentially adverse and
> always varying internet between the graphics card and your monitor.
These graphic adapters provide a wealth of features that you can take
advantage of to bring these amazing graphics to life. General purpose
CPUs cannot keep up. Chelsio offload devices do the same thing in the
realm of networking. - Will there be things you can't do, probably yes,
but as I said, there are lots of knobs to turn (and the latest and
greatest feature that gets hyped up might not always be the best thing
since sliced bread anyway; what happened to BIC love? ;)
>
> -Andi
^ permalink raw reply
* Re: [RFT] r8169 changes against 2.6.23-rc3
From: Chuck Lever @ 2007-08-20 18:58 UTC (permalink / raw)
To: Francois Romieu; +Cc: netdev ML
In-Reply-To: <20070818100701.GA20703@electric-eye.fr.zoreil.com>
[-- Attachment #1: Type: text/plain, Size: 1619 bytes --]
Francois Romieu wrote:
> The latest serie of r8169 changes is available against 2.6.23-rc3 as:
> http://www.fr.zoreil.com/people/francois/misc/20070818-2.6.23-rc3-r8169-test.patch
>
> or (tarball sits one level higher):
>
> http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.23-rc3/r8169-20070818/
>
> or (rebase prone branch)
>
> git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6.git#r8169
>
> Please do not clone your whole git kernel tree from here, thanks.
>
> Changes (most recent first):
>
> - eeprom read support
> - phy init cleanup
> - PHY init for the 8168
> - make room for more PHY init changes
> - remove dead wood
> - add MAC identifiers
> - MSI support
> - correct phy parameters for the 8110SC
>
> The first patch of the serie ("correct phy parameters for the 8110SC") has
> been elaborated with Edward Hsu from Realtek and it should help some owners
> of 8169 chipsets. If there is no report of regression for it on any
> chispet and it is reported to fix someone's problems, I will send it to
> Jeff Garzik for inclusion in 2.6.23 as a bugfix.
>
> Anything else in this serie has not been tested on a wide scale nor acked
> by the manufacturer: I consider it post 2.6.23 material. That being said,
> the MSI changes seem fine and the "PHY init for the 8168" patch could make
> a difference for the users of the 8168 whose link is not properly
> negotiated.
>
> Success and failure reports or patches will be welcome. Please Cc: netdev
> and include "r8169" in the Subject.
Tested 2.6.23-rc3 plus your patch on my dual-R8169 mini-ITX Jetway
J7F4K1G2E mainboard. No problems to report.
[-- Attachment #2: chuck.lever.vcf --]
[-- Type: text/x-vcard, Size: 290 bytes --]
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
url:http://oss.oracle.com/~cel
version:2.1
end:vcard
^ permalink raw reply
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Andi Kleen @ 2007-08-20 18:10 UTC (permalink / raw)
To: Felix Marti
Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general,
David Miller
In-Reply-To: <8A71B368A89016469F72CD08050AD334018E2115@maui.asicdesigners.com>
"Felix Marti" <felix@chelsio.com> writes:
> What I was referring to is that TSO(/LRO) have their own
> issues, some eluded to by Roland and me. In fact, customers working on
> the LSR couldn't use TSO due to the burstiness it introduces
That was in old kernels where TSO didn't honor the initial cwnd correctly,
right? I assume it's long fixed.
If not please clarify what the problem was.
> have a look at graphics.
> Graphics used to be done by the host CPU and now we have dedicated
> graphics adapters that do a much better job...
Is your off load device as programable as a modern GPU?
> farfetched that offload devices can do a better job at a data-flow
> problem?
One big difference is that there is no potentially adverse and
always varying internet between the graphics card and your monitor.
-Andi
^ permalink raw reply
* Re: skb_pull_rcsum - Fatal exception in interrupt
From: Alan J. Wylie @ 2007-08-20 17:04 UTC (permalink / raw)
To: Brandeburg, Jesse; +Cc: e1000-devel, Linux Network Development list
In-Reply-To: <36D9DB17C6DE9E40B059440DB8D95F52032A44D8@orsmsx418.amr.corp.intel.com>
On Mon, 20 Aug 2007 09:21:54 -0700, "Brandeburg, Jesse" <jesse.brandeburg@intel.com> said:
> Hi Alan, I work on the team that supports e1000, I'd be interested
> in seeing the dmesg output from the machine before it crashes, maybe
> you can add that to your web collection of data below?
Don't worry - it's definitely not an e1000 problem. I'm in contact
with the netdev guys, who have produced a patch.
Thanks anyway
Alan.
--
Alan J. Wylie http://www.wylie.me.uk/
"Perfection [in design] is achieved not when there is nothing left to add,
but rather when there is nothing left to take away."
-- Antoine de Saint-Exupery
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
^ permalink raw reply
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Felix Marti @ 2007-08-20 16:53 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: jeff, netdev, rdreier, linux-kernel, general, David Miller
In-Reply-To: <20070820094317.GA14817@2ka.mipt.ru>
> -----Original Message-----
> From: Evgeniy Polyakov [mailto:johnpol@2ka.mipt.ru]
> Sent: Monday, August 20, 2007 2:43 AM
> To: Felix Marti
> Cc: David Miller; sean.hefty@intel.com; netdev@vger.kernel.org;
> rdreier@cisco.com; general@lists.openfabrics.org; linux-
> kernel@vger.kernel.org; jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
>
> On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti
> (felix@chelsio.com) wrote:
> > [Felix Marti] David and Herbert, so you agree that the user<>kernel
> > space memory copy overhead is a significant overhead and we want to
> > enable zero-copy in both the receive and transmit path? - Yes, copy
>
> It depends. If you need to access that data after received, you will
> get
> cache miss and performance will not be much better (if any) that with
> copy.
Yes, the app will take the cache hits when accessing the data. However,
the fact remains that if there is a copy in the receive path, you
require and additional 3x memory BW (which is very significant at these
high rates and most likely the bottleneck for most current systems)...
and somebody always has to take the cache miss be it the copy_to_user or
the app.
>
> > avoidance is mainly an API issue and unfortunately the so widely
used
> > (synchronous) sockets API doesn't make copy avoidance easy, which is
> one
> > area where protocol offload can help. Yes, some apps can resort to
> > sendfile() but there are many apps which seem to have trouble
> switching
> > to that API... and what about the receive path?
>
> There is number of implementations, and all they are suitable for is
> to have recvfile(), since this is likely the only case, which can work
> without cache.
>
> And actually RDMA stack exist and no one said it should be thrown away
> _until_ it messes with main stack. It started to speal ports. What
will
> happen when it gest all port space and no new legal network conection
> can be opened, although there is no way to show to user who got it?
> What will happen if hardware RDMA connection got terminated and
> software
> could not free the port? Will RDMA request to export connection reset
> functions out of stack to drop network connections which are on the
> ports
> which are supposed to be used by new RDMA connections?
Yes, RDMA support is there... but we could make it better and easier to
use. We have a problem today with port sharing and there was a proposal
to address the issue by tighter integration (see the beginning of the
thread) but the proposal got shot down immediately... because it is RDMA
and not for technical reasons. I believe this email threads shows in
detail how RDMA (a network technology) is treated as bastard child by
the network folks, well at least by one of them.
>
> RDMA is not a problem, but how it influence to the network stack is.
> Let's better think about how to work correctly with network stack
> (since
> we already have that cr^Wdifferent hardware) instead of saying that
> others do bad work and do not allow shiny new feature to exist.
By no means did I want to imply that others do bad work; are you
referring to me using TSO implementation issues as an example? - If so,
let me clarify: I understand that the TSO implementation took some time
to get right. What I was referring to is that TSO(/LRO) have their own
issues, some eluded to by Roland and me. In fact, customers working on
the LSR couldn't use TSO due to the burstiness it introduces and had to
fall-back to our fine grained packet scheduling done in the offload
device. I am for variety, let us support new technologies that solve
real problems (lots of folks are buying this stuff for a reason) instead
of the 'ah, its brain-dead and has no future' attitude... there is
precedence for offloading the host CPUs: have a look at graphics.
Graphics used to be done by the host CPU and now we have dedicated
graphics adapters that do a much better job... so, why is it so
farfetched that offload devices can do a better job at a data-flow
problem?
>
> --
> Evgeniy Polyakov
^ permalink raw reply
* Re: 2.6.23-rc3 and SKY2 driver issue
From: Stephen Hemminger @ 2007-08-20 16:47 UTC (permalink / raw)
To: James Corey; +Cc: Michal Piotrowski, linux-kernel, Netdev
In-Reply-To: <624025.57129.qm@web90408.mail.mud.yahoo.com>
On Mon, 20 Aug 2007 09:23:46 -0700 (PDT)
James Corey <ploversegg@yahoo.com> wrote:
>
> --- Stephen Hemminger
> <shemminger@linux-foundation.org> wrote:
>
> > On Mon, 20 Aug 2007 08:42:21 -0700 (PDT)
> > James Corey <ploversegg@yahoo.com> wrote:
> >
> > >
> > > --- Stephen Hemminger
> > > <shemminger@linux-foundation.org> wrote:
> > >
> > > > On Thu, 16 Aug 2007 10:25:45 +0200
> > > > "Michal Piotrowski"
> > >
> > > > Please reproduce with a more recent kernel?
> > >
> > > Um, I thought 2.6.23rc WAS pretty recent. :-)
> > >
> > > I'll check if there is something newer in the
> > > repository now.
> > >
> >
> > What is the chip version? Please send console log:
> > "dmesg | grep sky2"
> >
> >
> > --
> > Stephen Hemminger <shemminger@linux-foundation.org>
> >
>
>
> Ah ... details.
>
> Machine:
>
> Dell Optiplex 745
>
> Kernel:
>
> 2.6.23-rc3 #1 SMP Tue Aug 14 19:44:07 EDT 2007 x86_64
> x86_64 x86_64 GNU/Linux
>
> Card:
>
> D-Link DGE-550SX
>
> # dmesg | grep sky2
> sky2 0000:04:00.0: v1.16 addr 0xdf9fc000 irq 16
> Yukon-XL (0xb3) rev 3
> sky2 eth1: addr 00:17:9a:73:87:60
> sky2 eth1: enabling interface
> sky2 eth1: ram buffer 96K
> sky2 eth1: Link is up at 1000 Mbps, full duplex, flow
> control none
Okay, this is a fiber based card. Does the error happen right away (ie all packets
have bad sum), or is it sporadic (ie some magic packet or race in hardware).
Also are you using regular (1500) or jumbo (9000) mtu?
--
Stephen Hemminger <shemminger@linux-foundation.org>
^ permalink raw reply
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Felix Marti @ 2007-08-20 16:26 UTC (permalink / raw)
To: Andi Kleen
Cc: David Miller, sean.hefty, netdev, rdreier, general, linux-kernel,
jeff
In-Reply-To: <p733aye1n39.fsf@bingen.suse.de>
> -----Original Message-----
> From: ak@suse.de [mailto:ak@suse.de] On Behalf Of Andi Kleen
> Sent: Monday, August 20, 2007 4:07 AM
> To: Felix Marti
> Cc: David Miller; sean.hefty@intel.com; netdev@vger.kernel.org;
> rdreier@cisco.com; general@lists.openfabrics.org; linux-
> kernel@vger.kernel.org; jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
>
> "Felix Marti" <felix@chelsio.com> writes:
> > > avoidance gains of TSO and LRO are still a very worthwhile
savings.
> > So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20
+
> > 20), 864B, when moving ~64KB of payload - looks like very much in
the
> > noise to me.
>
> TSO is beneficial for the software again. The linux code currently
> takes several locks and does quite a few function calls for each
> packet and using larger packets lowers this overhead. At least with
> 10GbE saving CPU cycles is still quite important.
>
> > an option to get 'high performance'
>
> Shouldn't you qualify that?
>
> It is unlikely you really duplicated all the tuning for corner cases
> that went over many years into good software TCP stacks in your
> hardware. So e.g. for wide area networks with occasional packet loss
> the software might well perform better.
Yes, it used to be sufficient to submit performance data to show that a
technology make 'sense'. In fact, I believe it was Alan Cox who once
said that linux will have a look at offload once an offload device holds
the land speed record (probably assuming that the day never comes ;).
For the last few years it has been Chelsio offload devices that have
been improving their own LSRs (as IO bus speeds have been increasing).
It is worthwhile to point out that OC-192 doesn't offer full 10Gbps BW
and the fine-grained (per packet and not per TSO-burst) packet scheduler
in the offload device played a crucial part in pushing performance to
the limits of what OC-192 can do. Most other customers use our offload
products in low-latency cluster environments. - The problem with offload
devices is that they are not all born equal and there have been a lot of
poor implementation giving the technology a bad name. I can only speak
for Chelsio and do claim that we have a solid implementation that scales
from low-latency clusters environments to LFNs.
Andi, I could present performance numbers, i.e. throughput and CPU
utilization in function of IO size, number of connections, ... in a
back-to-back environment and/or in a cluster environment... but what
will it get me? I'd still get hit by the 'not integrated' hammer :(
>
> -Andi
^ permalink raw reply
* Re: Linksys Gigabit USB2.0 adapter (asix) regression
From: Erik Slagter @ 2007-08-20 16:23 UTC (permalink / raw)
To: David Hollis; +Cc: netdev
In-Reply-To: <1186603993.3078.16.camel@dhollis-lnx.sunera.com>
David Hollis wrote:
> It's a bit of a longshot, but I notice that EEPROM index 0x17 returns
> 0x580 for you, 0x180 for my devices. Based on that, my devices go
> through the "gpio phymode == 1 path" GPIO init sequence, and yours goes
> through the other path ( if ((eeprom >> 8) != 1) { ). Comment out the
> if() else portion so that you go through the "phymode == 1" path and see
> if that makes a difference. That segment should look something like
> this:
>
> /*
> if ((eeprom >> 8) != 1) {
> asix_write_gpio(dev, 0x003c, 30);
> asix_write_gpio(dev, 0x001c, 300);
> asix_write_gpio(dev, 0x003c, 30);
> } else {
> */
> dbg("gpio phymode == 1 path");
> asix_write_gpio(dev, AX_GPIO_GPO1EN, 30);
> asix_write_gpio(dev, AX_GPIO_GPO1EN | AX_GPIO_GPO_1,
> 30);
> // }
Tried, but now it doesn't work at all, no LEDs and no traffic.
^ permalink raw reply
* Re: 2.6.23-rc3 and SKY2 driver issue
From: James Corey @ 2007-08-20 16:23 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Michal Piotrowski, linux-kernel, Netdev
In-Reply-To: <20070820090227.6ceb2934@freepuppy.rosehill.hemminger.net>
--- Stephen Hemminger
<shemminger@linux-foundation.org> wrote:
> On Mon, 20 Aug 2007 08:42:21 -0700 (PDT)
> James Corey <ploversegg@yahoo.com> wrote:
>
> >
> > --- Stephen Hemminger
> > <shemminger@linux-foundation.org> wrote:
> >
> > > On Thu, 16 Aug 2007 10:25:45 +0200
> > > "Michal Piotrowski"
> >
> > > Please reproduce with a more recent kernel?
> >
> > Um, I thought 2.6.23rc WAS pretty recent. :-)
> >
> > I'll check if there is something newer in the
> > repository now.
> >
>
> What is the chip version? Please send console log:
> "dmesg | grep sky2"
>
>
> --
> Stephen Hemminger <shemminger@linux-foundation.org>
>
Ah ... details.
Machine:
Dell Optiplex 745
Kernel:
2.6.23-rc3 #1 SMP Tue Aug 14 19:44:07 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux
Card:
D-Link DGE-550SX
# dmesg | grep sky2
sky2 0000:04:00.0: v1.16 addr 0xdf9fc000 irq 16
Yukon-XL (0xb3) rev 3
sky2 eth1: addr 00:17:9a:73:87:60
sky2 eth1: enabling interface
sky2 eth1: ram buffer 96K
sky2 eth1: Link is up at 1000 Mbps, full duplex, flow
control none
____________________________________________________________________________________
Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545469
^ permalink raw reply
* RE: skb_pull_rcsum - Fatal exception in interrupt
From: Brandeburg, Jesse @ 2007-08-20 16:21 UTC (permalink / raw)
To: Alan J. Wylie; +Cc: e1000-devel, Linux Network Development list
In-Reply-To: <18115.5803.588114.372952@wylie.me.uk>
Alan J. Wylie wrote:
> We have been shipping Linux based servers to customers for several
> years now, with few problems. Recently, however, a single customer has
> been seeing kernel panics. Unfortunately, the customer is about 200
> miles away, so physical access is limited. There are two ethernet
> interfaces, one should be plugged into a local RFC1918 network, the
> other is connected to the internet. If eth0 is plugged into the local
> network, a short time later the system panics.
>
> Hardware: Intel S5000VSA server
>
> Network cards: Intel e1000
> Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper)
Hi Alan, I work on the team that supports e1000, I'd be interested in
seeing the dmesg output from the machine before it crashes, maybe you
can add that to your web collection of data below?
many of the 5000 series machines have BMC's its possible that you could
set up the remote management so you could reboot it remotely, but that
may not be worth the extra effort. It could however give you the
ability to have a serial console over ethernet, which would get us the
full panic message, but see below.
> # CONFIG_E1000_DISABLE_PACKET_SPLIT is not set
can you try setting the CONFIG_E1000_DISABLE_PACKET_SPLIT=y
this will prevent the driver from splitting the header from the packet
data which could be exacerbating this problem.
Its not immediately obvious whether this is a kernel or driver problem,
I hope you don't mind I cc'd e1000-devel since this is possibly relevant
to other e1000 users and developers.
> We shipped a second system, and this displayed identical symptoms. We
> have tested with several recent 2.6 kernels, including
>
> 2.6.22
> 2.6.17.14
> 2.6.20.15
>
> all of which crash.
>
> We have a couple of photographs showing the tail end of the messages
> on the screen.
>
> The last two lines are:
>
> EIP: [<c02b6fb2>] skb_pull_rcsum+0x6d/0x71 SS:ESP 09068:c03e1ea4
> Kernel panic - not syncing: Fatal exception in interrupt
can you boot with vga=0x318 appended to kernel options? this might help
you get more on the screen. you could also look into netconsole, but
because this is a networking crash I don't know if you'll get data out
of netconsole or not, and I don't know if you can use netconsole over
the 'net' as I've only used it for local logging.
Jesse
^ permalink raw reply
* Re: 2.6.23-rc3 and SKY2 driver issue
From: Stephen Hemminger @ 2007-08-20 16:02 UTC (permalink / raw)
To: James Corey; +Cc: Michal Piotrowski, linux-kernel, Netdev
In-Reply-To: <370212.9287.qm@web90407.mail.mud.yahoo.com>
On Mon, 20 Aug 2007 08:42:21 -0700 (PDT)
James Corey <ploversegg@yahoo.com> wrote:
>
> --- Stephen Hemminger
> <shemminger@linux-foundation.org> wrote:
>
> > On Thu, 16 Aug 2007 10:25:45 +0200
> > "Michal Piotrowski"
>
> > Please reproduce with a more recent kernel?
>
> Um, I thought 2.6.23rc WAS pretty recent. :-)
>
> I'll check if there is something newer in the
> repository now.
>
What is the chip version? Please send console log: "dmesg | grep sky2"
--
Stephen Hemminger <shemminger@linux-foundation.org>
^ permalink raw reply
* [1/2] 2.6.23-rc3: known regressions with patches v2
From: Michal Piotrowski @ 2007-08-20 16:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, LKML, Gabriel C, Satyam Sharma, Vitaly Bordug,
Pierre Ossman, Christian Casteyde, Alan Stern, linux-mtd,
David Woodhouse, Ingo Molnar, Netdev, Stephen Hemminger,
Daniel K., Florian Lohoff
Hi all,
Here is a list of some known regressions in 2.6.23-rc3
with patches available.
Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions
List of Aces
Name Regressions fixed since 21-Jun-2007
Adrian Bunk 9
Andi Kleen 5
Linus Torvalds 5
Andrew Morton 4
Al Viro 3
Cornelia Huck 3
Jens Axboe 3
Tejun Heo 3
Unclassified
Subject : Oops while modprobing phy fixed module
References : http://lkml.org/lkml/2007/7/14/63
Last known good : ?
Submitter : Gabriel C <nix.or.die@googlemail.com>
Caused-By : ?
Handled-By : Satyam Sharma <satyam.sharma@gmail.com>
Vitaly Bordug <vitb@kernel.crashing.org>
Patch1 : http://lkml.org/lkml/2007/7/18/506
Status : patch available
MMC
Subject : Unable to access memory card reader anymore
References : http://bugzilla.kernel.org/show_bug.cgi?id=8885
Last known good : ?
Submitter : Christian Casteyde <casteyde.christian@free.fr>
Caused-By : ?
Handled-By : Alan Stern <stern@rowland.harvard.edu>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=12438
Status : patch available
MTD
Subject : error: implicit declaration of function 'cfi_interleave'
References : http://lkml.org/lkml/2007/8/6/272
Last known good : ?
Submitter : Ingo Molnar <mingo@elte.hu>
Caused-By : ?
Handled-By : David Woodhouse <dwmw2@infradead.org>
Patch : http://lkml.org/lkml/2007/8/9/586
Status : patch available
Networking
Subject : BUG: when using 'brctl stp'
References : http://lkml.org/lkml/2007/8/10/441
Last known good : 2.6.23-rc1
Submitter : Daniel K. <daniel@cluded.net>
Caused-By : ?
Handled-By : Stephen Hemminger <shemminger@osdl.org>
Status : fix applied by David Miller
Subject : sky2 boot crash in sky2_mac_intr
References : http://lkml.org/lkml/2007/7/24/91
Last known good : ?
Submitter : Florian Lohoff <flo@rfc822.org>
Caused-By :
Handled-By : Stephen Hemminger <shemminger@osdl.org>
Patch : http://marc.info/?l=linux-netdev&m=118651402523966&w=2
Status : patch available
Regards,
Michal
--
LOG
http://www.stardust.webpages.pl/log/
^ permalink raw reply
* Re: [3/4] 2.6.23-rc3: known regressions v2
From: Michal Piotrowski @ 2007-08-20 16:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, LKML, Netdev, Stephen Hemminger, Thomas Meyer,
Uwe Bugla, Shish, Karl Meyer, Francois Romieu, Parag Warudkar,
Zachary Amsden
In-Reply-To: <46C9B6C7.5070704@googlemail.com>
Hi all,
Here is a list of some known regressions in 2.6.23-rc3.
Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions
List of Aces
Name Regressions fixed since 21-Jun-2007
Adrian Bunk 9
Andi Kleen 5
Linus Torvalds 5
Andrew Morton 4
Al Viro 3
Cornelia Huck 3
Jens Axboe 3
Tejun Heo 3
Networking
Subject : NETDEV WATCHDOG: eth0: transmit timed out
References : http://lkml.org/lkml/2007/8/13/737
Last known good : ?
Submitter : Karl Meyer <adhocrocker@gmail.com>
Caused-By : ?
Handled-By : Francois Romieu <romieu@fr.zoreil.com>
Status : problem is being debugged
Subject : Weird network problems with 2.6.23-rc2
References : http://lkml.org/lkml/2007/8/11/40
Last known good : ?
Submitter : Shish <shish@shishnet.org>
Caused-By : ?
Handled-By : ?
Status : unknown
Subject : IP v4 routing is broken
References : http://www.stardust.webpages.pl/files/tbf/bugs/bug_report01.txt
Last known good : 2.6.22-git2
Submitter : Uwe Bugla <uwe.bugla@gmx.de>
Caused-By : ?
Handled-By : ?
Status : unknown
Subject : New wake ups from sky2
References : http://lkml.org/lkml/2007/7/20/386
Last known good : ?
Submitter : Thomas Meyer <thomas@m3y3r.de>
Caused-By : Stephen Hemminger <shemminger@osdl.org>
commit eb35cf60e462491249166182e3e755d3d5d91a28
Handled-By : Stephen Hemminger <shemminger@osdl.org>
Status : unknown
Virtualization
Subject : CONFIG_VMI broken
References : http://lkml.org/lkml/2007/8/14/203
Last known good : ?
Submitter : Parag Warudkar <parag.warudkar@gmail.com>
Caused-By : ?
Handled-By : Zachary Amsden <zach@vmware.com>
Status : problem is being debugged
Regards,
Michal
--
LOG
http://www.stardust.webpages.pl/log/
^ permalink raw reply
* [PATCH V4 10/10] net/bonding: Destroy bonding master when last slave is gone
From: Moni Shoua @ 2007-08-20 15:58 UTC (permalink / raw)
To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>
When bonding enslaves non Ethernet devices it takes pointers to functions
in the module that owns the slaves. In this case it becomes unsafe
to keep the bonding master registered after last slave was unenslaved
because we don't know if the pointers are still valid. Destroying the bond when slave_cnt is zero
ensures that these functions be used anymore.
Signed-off-by: Moni Shoua <monis@voltaire.com>
---
drivers/net/bonding/bond_main.c | 45 +++++++++++++++++++++++++++++++++++++++-
drivers/net/bonding/bonding.h | 3 ++
2 files changed, 47 insertions(+), 1 deletion(-)
Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-20 14:43:17.123702132 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-20 14:43:17.850571535 +0300
@@ -1256,6 +1256,7 @@ static int bond_compute_features(struct
static void bond_setup_by_slave(struct net_device *bond_dev,
struct net_device *slave_dev)
{
+ struct bonding *bond = bond_dev->priv;
bond_dev->hard_header = slave_dev->hard_header;
bond_dev->rebuild_header = slave_dev->rebuild_header;
bond_dev->hard_header_cache = slave_dev->hard_header_cache;
@@ -1270,6 +1271,7 @@ static void bond_setup_by_slave(struct n
memcpy(bond_dev->broadcast, slave_dev->broadcast,
slave_dev->addr_len);
+ bond->setup_by_slave = 1;
}
/* enslave device <slave> to bond device <master> */
@@ -1838,6 +1840,35 @@ int bond_release(struct net_device *bond
}
/*
+* Destroy a bonding device.
+* Must be under rtnl_lock when this function is called.
+*/
+void bond_destroy(struct bonding *bond)
+{
+ bond_deinit(bond->dev);
+ bond_destroy_sysfs_entry(bond);
+ unregister_netdevice(bond->dev);
+}
+
+/*
+* First release a slave and than destroy the bond if no more slaves iare left.
+* Must be under rtnl_lock when this function is called.
+*/
+int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev)
+{
+ struct bonding *bond = bond_dev->priv;
+ int ret;
+
+ ret = bond_release(bond_dev, slave_dev);
+ if ((ret == 0) && (bond->slave_cnt == 0)) {
+ printk(KERN_INFO DRV_NAME " %s: destroying bond for.\n",
+ bond_dev->name);
+ bond_destroy(bond);
+ }
+ return ret;
+}
+
+/*
* This function releases all slaves.
*/
static int bond_release_all(struct net_device *bond_dev)
@@ -3322,7 +3353,11 @@ static int bond_slave_netdev_event(unsig
switch (event) {
case NETDEV_UNREGISTER:
if (bond_dev) {
- bond_release(bond_dev, slave_dev);
+ dprintk("slave %s unregisters\n", slave_dev->name);
+ if (bond->setup_by_slave)
+ bond_release_and_destroy(bond_dev, slave_dev);
+ else
+ bond_release(bond_dev, slave_dev);
}
break;
case NETDEV_CHANGE:
@@ -3331,6 +3366,13 @@ static int bond_slave_netdev_event(unsig
* sets up a hierarchical bond, then rmmod's
* one of the slave bonding devices?
*/
+ if (slave_dev->priv_flags & IFF_SLAVE_DETACH) {
+ dprintk("slave %s detaching\n", slave_dev->name);
+ if (bond->setup_by_slave)
+ bond_release_and_destroy(bond_dev, slave_dev);
+ else
+ bond_release(bond_dev, slave_dev);
+ }
break;
case NETDEV_DOWN:
/*
@@ -4311,6 +4353,7 @@ static int bond_init(struct net_device *
bond->primary_slave = NULL;
bond->dev = bond_dev;
bond->send_grat_arp = 0;
+ bond->setup_by_slave = 0;
INIT_LIST_HEAD(&bond->vlan_list);
/* Initialize the device entry points */
Index: net-2.6/drivers/net/bonding/bonding.h
===================================================================
--- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-20 14:43:17.123702132 +0300
+++ net-2.6/drivers/net/bonding/bonding.h 2007-08-20 14:47:52.845180870 +0300
@@ -188,6 +188,7 @@ struct bonding {
s8 kill_timers;
s8 do_set_mac_addr;
s8 send_grat_arp;
+ s8 setup_by_slave;
struct net_device_stats stats;
#ifdef CONFIG_PROC_FS
struct proc_dir_entry *proc_entry;
@@ -295,6 +296,8 @@ static inline void bond_unset_master_alb
struct vlan_entry *bond_next_vlan(struct bonding *bond, struct vlan_entry *curr);
int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, struct net_device *slave_dev);
int bond_create(char *name, struct bond_params *params, struct bonding **newbond);
+void bond_destroy(struct bonding *bond);
+int bond_release_and_destroy(struct net_device *bond_dev, struct net_device *slave_dev);
void bond_deinit(struct net_device *bond_dev);
int bond_create_sysfs(void);
void bond_destroy_sysfs(void);
^ permalink raw reply
* [ofa-general] PATCH V4 9/10] net/bonding: Delay sending of gratuitous ARP to avoid failure
From: Moni Shoua @ 2007-08-20 15:53 UTC (permalink / raw)
To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>
Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit
in dev->state field is on. This improves the chances for the arp packet to
be transmitted.
Signed-off-by: Moni Shoua <monis@voltaire.com>
---
drivers/net/bonding/bond_main.c | 24 +++++++++++++++++++++---
drivers/net/bonding/bonding.h | 1 +
2 files changed, 22 insertions(+), 3 deletions(-)
Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:56:33.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 11:04:37.221123652 +0300
@@ -1102,8 +1102,14 @@ void bond_change_active_slave(struct bon
if (new_active && !bond->do_set_mac_addr)
memcpy(bond->dev->dev_addr, new_active->dev->dev_addr,
new_active->dev->addr_len);
-
- bond_send_gratuitous_arp(bond);
+ if (bond->curr_active_slave &&
+ test_bit(__LINK_STATE_LINKWATCH_PENDING,
+ &bond->curr_active_slave->dev->state)) {
+ dprintk("delaying gratuitous arp on %s\n",
+ bond->curr_active_slave->dev->name);
+ bond->send_grat_arp = 1;
+ } else
+ bond_send_gratuitous_arp(bond);
}
}
@@ -2083,6 +2089,17 @@ void bond_mii_monitor(struct net_device
* program could monitor the link itself if needed.
*/
+ if (bond->send_grat_arp) {
+ if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING,
+ &bond->curr_active_slave->dev->state))
+ dprintk("Needs to send gratuitous arp but not yet\n");
+ else {
+ dprintk("sending delayed gratuitous arp on on %s\n",
+ bond->curr_active_slave->dev->name);
+ bond_send_gratuitous_arp(bond);
+ bond->send_grat_arp = 0;
+ }
+ }
read_lock(&bond->curr_slave_lock);
oldcurrent = bond->curr_active_slave;
read_unlock(&bond->curr_slave_lock);
@@ -2484,7 +2501,7 @@ static void bond_send_gratuitous_arp(str
if (bond->master_ip) {
bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip,
- bond->master_ip, 0);
+ bond->master_ip, 0);
}
list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
@@ -4293,6 +4310,7 @@ static int bond_init(struct net_device *
bond->current_arp_slave = NULL;
bond->primary_slave = NULL;
bond->dev = bond_dev;
+ bond->send_grat_arp = 0;
INIT_LIST_HEAD(&bond->vlan_list);
/* Initialize the device entry points */
Index: net-2.6/drivers/net/bonding/bonding.h
===================================================================
--- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:56:33.000000000 +0300
+++ net-2.6/drivers/net/bonding/bonding.h 2007-08-15 11:05:41.516451497 +0300
@@ -187,6 +187,7 @@ struct bonding {
struct timer_list arp_timer;
s8 kill_timers;
s8 do_set_mac_addr;
+ s8 send_grat_arp;
struct net_device_stats stats;
#ifdef CONFIG_PROC_FS
struct proc_dir_entry *proc_entry;
^ permalink raw reply
* [PATCH V4 8/10] net/bonding: Handlle wrong assumptions that slave is always an Ethernet device
From: Moni Shoua @ 2007-08-20 15:52 UTC (permalink / raw)
To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>
bonding sometimes uses Ethernet constants (such as MTU and address length) which
are not good when it enslaves non Ethernet devices (such as InfiniBand).
Signed-off-by: Moni Shoua <monis@voltaire.com>
---
drivers/net/bonding/bond_main.c | 3 ++-
drivers/net/bonding/bond_sysfs.c | 19 +++++++++++++------
drivers/net/bonding/bonding.h | 1 +
3 files changed, 16 insertions(+), 7 deletions(-)
Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-20 14:29:11.911298577 +0300
@@ -1224,7 +1224,8 @@ static int bond_compute_features(struct
struct slave *slave;
struct net_device *bond_dev = bond->dev;
unsigned long features = bond_dev->features;
- unsigned short max_hard_header_len = ETH_HLEN;
+ unsigned short max_hard_header_len = max((u16)ETH_HLEN,
+ bond_dev->hard_header_len);
int i;
features &= ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES);
Index: net-2.6/drivers/net/bonding/bond_sysfs.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:55:48.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-08-15 12:14:41.152469089 +0300
@@ -164,9 +164,7 @@ static ssize_t bonding_store_bonds(struc
printk(KERN_INFO DRV_NAME
": %s is being deleted...\n",
bond->dev->name);
- bond_deinit(bond->dev);
- bond_destroy_sysfs_entry(bond);
- unregister_netdevice(bond->dev);
+ bond_destroy(bond);
rtnl_unlock();
goto out;
}
@@ -260,6 +258,7 @@ static ssize_t bonding_store_slaves(stru
char command[IFNAMSIZ + 1] = { 0, };
char *ifname;
int i, res, found, ret = count;
+ u32 original_mtu;
struct slave *slave;
struct net_device *dev = NULL;
struct bonding *bond = to_bond(d);
@@ -325,6 +324,7 @@ static ssize_t bonding_store_slaves(stru
}
/* Set the slave's MTU to match the bond */
+ original_mtu = dev->mtu;
if (dev->mtu != bond->dev->mtu) {
if (dev->change_mtu) {
res = dev->change_mtu(dev,
@@ -339,6 +339,9 @@ static ssize_t bonding_store_slaves(stru
}
rtnl_lock();
res = bond_enslave(bond->dev, dev);
+ bond_for_each_slave(bond, slave, i)
+ if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0)
+ slave->original_mtu = original_mtu;
rtnl_unlock();
if (res) {
ret = res;
@@ -351,13 +354,17 @@ static ssize_t bonding_store_slaves(stru
bond_for_each_slave(bond, slave, i)
if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) {
dev = slave->dev;
+ original_mtu = slave->original_mtu;
break;
}
if (dev) {
printk(KERN_INFO DRV_NAME ": %s: Removing slave %s\n",
bond->dev->name, dev->name);
rtnl_lock();
- res = bond_release(bond->dev, dev);
+ if (bond->setup_by_slave)
+ res = bond_release_and_destroy(bond->dev, dev);
+ else
+ res = bond_release(bond->dev, dev);
rtnl_unlock();
if (res) {
ret = res;
@@ -365,9 +372,9 @@ static ssize_t bonding_store_slaves(stru
}
/* set the slave MTU to the default */
if (dev->change_mtu) {
- dev->change_mtu(dev, 1500);
+ dev->change_mtu(dev, original_mtu);
} else {
- dev->mtu = 1500;
+ dev->mtu = original_mtu;
}
}
else {
Index: net-2.6/drivers/net/bonding/bonding.h
===================================================================
--- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:55:34.000000000 +0300
+++ net-2.6/drivers/net/bonding/bonding.h 2007-08-20 14:29:11.912298402 +0300
@@ -156,6 +156,7 @@ struct slave {
s8 link; /* one of BOND_LINK_XXXX */
s8 state; /* one of BOND_STATE_XXXX */
u32 original_flags;
+ u32 original_mtu;
u32 link_failure_count;
u16 speed;
u8 duplex;
^ permalink raw reply
* [PATCH V4 7/10] net/bonding: Enable IP multicast for bonding IPoIB devices
From: Moni Shoua @ 2007-08-20 15:51 UTC (permalink / raw)
To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>
Allow to enslave devices when the bonding device is not up. Over the discussion
held at the previous post this seemed to be the most clean way to go, where it
is not expected to cause instabilities.
Normally, the bonding driver is UP before any enslavement takes place.
Once a netdevice is UP, the network stack acts to have it join some multicast groups
(eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device
type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code
computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called
where for multicast joins taking place after the enslavement another ip_xxx_mc_map()
is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND)
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
---
drivers/net/bonding/bond_main.c | 5 +++--
drivers/net/bonding/bond_sysfs.c | 6 ++----
2 files changed, 5 insertions(+), 6 deletions(-)
Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:55:48.431862446 +0300
@@ -1285,8 +1285,9 @@ int bond_enslave(struct net_device *bond
/* bond must be initialized by bond_open() before enslaving */
if (!(bond_dev->flags & IFF_UP)) {
- dprintk("Error, master_dev is not up\n");
- return -EPERM;
+ printk(KERN_WARNING DRV_NAME
+ " %s: master_dev is not up in bond_enslave\n",
+ bond_dev->name);
}
/* already enslaved */
Index: net-2.6/drivers/net/bonding/bond_sysfs.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:08:58.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-08-15 10:55:48.432862269 +0300
@@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru
/* Quick sanity check -- is the bond interface up? */
if (!(bond->dev->flags & IFF_UP)) {
- printk(KERN_ERR DRV_NAME
- ": %s: Unable to update slaves because interface is down.\n",
+ printk(KERN_WARNING DRV_NAME
+ ": %s: doing slave updates when interface is down.\n",
bond->dev->name);
- ret = -EPERM;
- goto out;
}
/* Note: We can't hold bond->lock here, as bond_create grabs it. */
^ permalink raw reply
* [PATCH V4 6/10] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address()
From: Moni Shoua @ 2007-08-20 15:49 UTC (permalink / raw)
To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>
This patch allows for enslaving netdevices which do not support
the set_mac_address() function. In that case the bond mac address is the one
of the active slave, where remote peers are notified on the mac address
(neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs
(this is already done by the bonding code).
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
---
drivers/net/bonding/bond_main.c | 87 +++++++++++++++++++++++++++-------------
drivers/net/bonding/bonding.h | 1
2 files changed, 60 insertions(+), 28 deletions(-)
Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-08-15 10:54:13.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c 2007-08-15 10:54:41.971632881 +0300
@@ -1095,6 +1095,14 @@ void bond_change_active_slave(struct bon
if (new_active) {
bond_set_slave_active_flags(new_active);
}
+
+ /* when bonding does not set the slave MAC address, the bond MAC
+ * address is the one of the active slave.
+ */
+ if (new_active && !bond->do_set_mac_addr)
+ memcpy(bond->dev->dev_addr, new_active->dev->dev_addr,
+ new_active->dev->addr_len);
+
bond_send_gratuitous_arp(bond);
}
}
@@ -1351,13 +1359,22 @@ int bond_enslave(struct net_device *bond
}
if (slave_dev->set_mac_address == NULL) {
- printk(KERN_ERR DRV_NAME
- ": %s: Error: The slave device you specified does "
- "not support setting the MAC address. "
- "Your kernel likely does not support slave "
- "devices.\n", bond_dev->name);
- res = -EOPNOTSUPP;
- goto err_undo_flags;
+ if (bond->slave_cnt == 0) {
+ printk(KERN_WARNING DRV_NAME
+ ": %s: Warning: The first slave device you "
+ "specified does not support setting the MAC "
+ "address. This bond MAC address would be that "
+ "of the active slave.\n", bond_dev->name);
+ bond->do_set_mac_addr = 0;
+ } else if (bond->do_set_mac_addr) {
+ printk(KERN_ERR DRV_NAME
+ ": %s: Error: The slave device you specified "
+ "does not support setting the MAC addres,."
+ "but this bond uses this practice. \n"
+ , bond_dev->name);
+ res = -EOPNOTSUPP;
+ goto err_undo_flags;
+ }
}
new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL);
@@ -1378,16 +1395,18 @@ int bond_enslave(struct net_device *bond
*/
memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN);
- /*
- * Set slave to master's mac address. The application already
- * set the master's mac address to that of the first slave
- */
- memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len);
- addr.sa_family = slave_dev->type;
- res = dev_set_mac_address(slave_dev, &addr);
- if (res) {
- dprintk("Error %d calling set_mac_address\n", res);
- goto err_free;
+ if (bond->do_set_mac_addr) {
+ /*
+ * Set slave to master's mac address. The application already
+ * set the master's mac address to that of the first slave
+ */
+ memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len);
+ addr.sa_family = slave_dev->type;
+ res = dev_set_mac_address(slave_dev, &addr);
+ if (res) {
+ dprintk("Error %d calling set_mac_address\n", res);
+ goto err_free;
+ }
}
res = netdev_set_master(slave_dev, bond_dev);
@@ -1612,9 +1631,11 @@ err_close:
dev_close(slave_dev);
err_restore_mac:
- memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN);
- addr.sa_family = slave_dev->type;
- dev_set_mac_address(slave_dev, &addr);
+ if (bond->do_set_mac_addr) {
+ memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN);
+ addr.sa_family = slave_dev->type;
+ dev_set_mac_address(slave_dev, &addr);
+ }
err_free:
kfree(new_slave);
@@ -1792,10 +1813,12 @@ int bond_release(struct net_device *bond
/* close slave before restoring its mac address */
dev_close(slave_dev);
- /* restore original ("permanent") mac address */
- memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
- addr.sa_family = slave_dev->type;
- dev_set_mac_address(slave_dev, &addr);
+ if (bond->do_set_mac_addr) {
+ /* restore original ("permanent") mac address */
+ memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
+ addr.sa_family = slave_dev->type;
+ dev_set_mac_address(slave_dev, &addr);
+ }
slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
IFF_SLAVE_INACTIVE | IFF_BONDING |
@@ -1882,10 +1905,12 @@ static int bond_release_all(struct net_d
/* close slave before restoring its mac address */
dev_close(slave_dev);
- /* restore original ("permanent") mac address*/
- memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
- addr.sa_family = slave_dev->type;
- dev_set_mac_address(slave_dev, &addr);
+ if (bond->do_set_mac_addr) {
+ /* restore original ("permanent") mac address*/
+ memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
+ addr.sa_family = slave_dev->type;
+ dev_set_mac_address(slave_dev, &addr);
+ }
slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
IFF_SLAVE_INACTIVE);
@@ -3922,6 +3947,9 @@ static int bond_set_mac_address(struct n
dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None"));
+ if (!bond->do_set_mac_addr)
+ return -EOPNOTSUPP;
+
if (!is_valid_ether_addr(sa->sa_data)) {
return -EADDRNOTAVAIL;
}
@@ -4312,6 +4340,9 @@ static int bond_init(struct net_device *
bond_create_proc_entry(bond);
#endif
+ /* set do_set_mac_addr to true on startup */
+ bond->do_set_mac_addr = 1;
+
list_add_tail(&bond->bond_list, &bond_dev_list);
return 0;
Index: net-2.6/drivers/net/bonding/bonding.h
===================================================================
--- net-2.6.orig/drivers/net/bonding/bonding.h 2007-08-15 10:08:58.000000000 +0300
+++ net-2.6/drivers/net/bonding/bonding.h 2007-08-15 10:55:34.359354833 +0300
@@ -185,6 +185,7 @@ struct bonding {
struct timer_list mii_timer;
struct timer_list arp_timer;
s8 kill_timers;
+ s8 do_set_mac_addr;
struct net_device_stats stats;
#ifdef CONFIG_PROC_FS
struct proc_dir_entry *proc_entry;
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox