* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Segher Boessenkool @ 2007-08-20 22:04 UTC (permalink / raw)
To: Chris Snook
Cc: Christoph Lameter, Paul Mackerras, heiko.carstens, horms,
linux-kernel, Paul E. McKenney, ak, netdev, cfriesen, akpm,
rpjday, Nick Piggin, linux-arch, jesper.juhl, satyam, zlynx,
schwidefsky, Herbert Xu, davem, Linus Torvalds, wensong, wjiang
In-Reply-To: <46C997B1.1010800@redhat.com>
>> Right. ROTFL... volatile actually breaks atomic_t instead of making
>> it safe. x++ becomes a register load, increment and a register store.
>> Without volatile we can increment the memory directly. It seems that
>> volatile requires that the variable is loaded into a register first
>> and then operated upon. Understandable when you think about volatile
>> being used to access memory mapped I/O registers where a RMW
>> operation could be problematic.
>
> So, if we want consistent behavior, we're pretty much screwed unless
> we use inline assembler everywhere?
Nah, this whole argument is flawed -- "without volatile" we still
*cannot* "increment the memory directly". On x86, you need a lock
prefix; on other archs, some other mechanism to make the memory
increment an *atomic* memory increment.
And no, RMW on MMIO isn't "problematic" at all, either.
An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode. But three actual operations.
The advantages of asm code for atomic_{read,set} are:
1) all the other atomic ops are implemented that way already;
2) you have full control over the asm insns selected, in particular,
you can guarantee you *do* get an atomic op;
3) you don't need to use "volatile <data>" which generates
not-all-that-good code on archs like x86, and we want to get rid
of it anyway since it is problematic in many ways;
4) you don't need to use *(volatile <type>*)&<data>, which a) doesn't
exist in C; b) isn't documented or supported in GCC; c) has a recent
history of bugginess; d) _still uses volatile objects_; e) _still_
is problematic in almost all those same ways as in 3);
5) you can mix atomic and non-atomic accesses to the atomic_t, which
you cannot with the other alternatives.
The only disadvantage I know of is potentially slightly worse
instruction scheduling. This is a generic asm() problem: GCC
cannot see what actual insns are inside the asm() block.
Segher
^ permalink raw reply
* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Segher Boessenkool @ 2007-08-20 22:07 UTC (permalink / raw)
To: Chris Snook
Cc: Christoph Lameter, Paul Mackerras, heiko.carstens, Stefan Richter,
horms, Satyam Sharma, Ilpo Jarvinen, Linux Kernel Mailing List,
David Miller, Paul E. McKenney, ak, Netdev, cfriesen, rpjday,
jesper.juhl, linux-arch, Andrew Morton, zlynx, schwidefsky,
Herbert Xu, Linus Torvalds, wensong, Nick Piggin, wjiang
In-Reply-To: <46C99945.9080303@redhat.com>
> Such code generally doesn't care precisely when it gets the update,
> just that the update is atomic, and it doesn't loop forever.
Yes, it _does_ care that it gets the update _at all_, and preferably
as early as possible.
> Regardless, I'm convinced we just need to do it all in assembly.
So do you want "volatile asm" or "plain asm", for atomic_read()?
The asm version has two ways to go about it too...
Segher
^ permalink raw reply
* Re: [PATCH] spidernet: enable poll() before registering interrupts
From: Linas Vepstas @ 2007-08-20 22:20 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: cbe-oss-dev, netdev, Christian Krafft
In-Reply-To: <200707120119.12589.arnd@arndb.de>
On Thu, Jul 12, 2007 at 01:19:11AM +0200, Arnd Bergmann wrote:
> Index: linux-2.6/drivers/net/spider_net.c
Sorry, this one got lost in my mailbox. Will attend to it shortly.
--linas
^ permalink raw reply
* Re: [PATCH] spidernet: enable poll() before registering interrupts
From: Linas Vepstas @ 2007-08-20 22:29 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: cbe-oss-dev, netdev, Christian Krafft
In-Reply-To: <200707120119.12589.arnd@arndb.de>
On Thu, Jul 12, 2007 at 01:19:11AM +0200, Arnd Bergmann wrote:
> We must not call netif_poll_enable after enabling interrupts,
> because an interrupt might come in and set the __LINK_STATE_RX_SCHED
> bit before we get to clear that bit again. If that happens,
> the next call to the ->poll() function will oops.
>
> Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
> ---
> This was found during testing with the fedora kernel,
> with all patches from netdev-2.6.git applied.
>
> It may not be the right fix, but this is currently the
> only way I can get that kernel to boot.
>
> One part I don't understand at the moment is that Christian
> Krafft reported the same problem with tg3, but that driver
> has all interrupts disabled at the device while calling
> the request_irq() function, which seems to be the best
> solution for avoiding the bug in the first place.
It apears that this patch does not apply cleanly any more,
and I think that's a good thing!
An intervening patch changed the init so that the
hardware interrupts aren't enabled until after the
request_irq, and after the poll_enable(). Thus,
it seems this pach is no longer needed, right?
I'll pursue with Kou Ishizaki, who pointed out that
I'd missed your email.
--linas
^ permalink raw reply
* Re: [PATCH] spidernet: fix interrupt reason recognition
From: Linas Vepstas @ 2007-08-20 22:38 UTC (permalink / raw)
To: Ishizaki Kou; +Cc: netdev, cbe-oss-dev, Arnd Bergmann
In-Reply-To: <20070820.221327.-1300518791.kouish@swc.toshiba.co.jp>
On Mon, Aug 20, 2007 at 10:13:27PM +0900, Ishizaki Kou wrote:
> Please apply this to 2.6.23.
I'll review and forward shortly. Kick me if you don't see a formal
reply in a few days.
> And also, please apply the following Arnd-san's patch to fix a problem
> that spidernet driver sometimes causes a BUG_ON at open.
>
> http://patchwork.ozlabs.org/cbe-oss-dev/patch?id=12211
Are you sure? This patch no longer applies cleanly, in part because
your patch "[PATCH] spidernet: improve interrupt handling"
from Mon, 09 Jul 2007 added a spider_net_enable_interrupts(card);
at the end of spider_net_open(). Because of this, it seems like
Arnd's patch is no longer needed, right?
--linas
^ permalink raw reply
* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Russell King @ 2007-08-20 22:48 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Chris Snook, Christoph Lameter, Paul Mackerras, heiko.carstens,
horms, linux-kernel, Paul E. McKenney, ak, netdev, cfriesen, akpm,
rpjday, Nick Piggin, linux-arch, jesper.juhl, satyam, zlynx,
schwidefsky, Herbert Xu, davem, Linus Torvalds, wensong, wjiang
In-Reply-To: <417ebba299a7ad3c4b7a31c4f860a814@kernel.crashing.org>
On Tue, Aug 21, 2007 at 12:04:17AM +0200, Segher Boessenkool wrote:
> And no, RMW on MMIO isn't "problematic" at all, either.
>
> An RMW op is a read op, a modify op, and a write op, all rolled
> into one opcode. But three actual operations.
Maybe for some CPUs, but not all. ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.
This means placing atomic_t or bitops into MMIO space is a definite
no-go on ARM. It breaks.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply
* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Segher Boessenkool @ 2007-08-20 23:02 UTC (permalink / raw)
To: Russell King
Cc: Christoph Lameter, Paul Mackerras, heiko.carstens, horms,
linux-kernel, Paul E. McKenney, ak, netdev, cfriesen, akpm,
rpjday, Nick Piggin, linux-arch, jesper.juhl, satyam, zlynx,
schwidefsky, Chris Snook, Herbert Xu, davem, Linus Torvalds,
wensong, wjiang
In-Reply-To: <20070820224859.GA16162@flint.arm.linux.org.uk>
>> And no, RMW on MMIO isn't "problematic" at all, either.
>>
>> An RMW op is a read op, a modify op, and a write op, all rolled
>> into one opcode. But three actual operations.
>
> Maybe for some CPUs, but not all. ARM for instance can't use the
> load exclusive and store exclusive instructions to MMIO space.
Sure, your CPU doesn't have RMW instructions -- how to emulate
those if you don't have them is a totally different thing.
Segher
^ permalink raw reply
* Re: [PATCH] spidernet: enable poll() before registering interrupts
From: Arnd Bergmann @ 2007-08-20 23:06 UTC (permalink / raw)
To: Linas Vepstas; +Cc: cbe-oss-dev, netdev, Christian Krafft
In-Reply-To: <20070820222945.GJ4261@austin.ibm.com>
On Tuesday 21 August 2007, Linas Vepstas wrote:
>
> An intervening patch changed the init so that the
> hardware interrupts aren't enabled until after the
> request_irq, and after the poll_enable(). Thus,
> it seems this pach is no longer needed, right?
Right, the other patch that you already applied is a better fix.
Arnd <><
^ permalink raw reply
* [PATCH] DM9000: fix interface hang under load
From: Florian Westphal @ 2007-08-20 23:33 UTC (permalink / raw)
To: akpm, jgarzik; +Cc: netdev
When transferring data at full speed, the DM9000 network interface
sometimes stops sending/receiving data. Worse, ksoftirqd consumes
100% cpu and the net tx watchdog never triggers.
Fix by spin_lock_irqsave() in dm9000_start_xmit() to prevent the
interrupt handler from interfering.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
Actually the comments ('Disable all interrupts, iow(db, DM9000_IMR, IMR_PAR) etc)
give the impression that the interrupt handler cannot run during dm9000_start_xmit(),
however this isn't correct (perhaps the chipset has some weird timing issues?).
The interface lockup usually occurs between 30 and 360 seconds after starting transmitting
data (netcat /dev/zero) at full speed; with this patch applied I haven't been able
to reproduce hangs yet (ran for > 2h).
FTR: This is a dm9000 on XScale-PXA255 rev 6 (ARMv5TE)/Compulab CM-x255, i.e.
a module not supported by the vanilla kernel. Tested on (patched) 2.6.18.
dm9000.c | 25 +++++++------------------
1 file changed, 7 insertions(+), 18 deletions(-)
diff --git a/drivers/net/dm9000.c b/drivers/net/dm9000.c
index c3de81b..738aa59 100644
--- a/drivers/net/dm9000.c
+++ b/drivers/net/dm9000.c
@@ -700,6 +700,7 @@ dm9000_init_dm9000(struct net_device *dev)
static int
dm9000_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
+ unsigned long flags;
board_info_t *db = (board_info_t *) dev->priv;
PRINTK3("dm9000_start_xmit\n");
@@ -707,10 +708,7 @@ dm9000_start_xmit(struct sk_buff *skb, struct net_device *dev)
if (db->tx_pkt_cnt > 1)
return 1;
- netif_stop_queue(dev);
-
- /* Disable all interrupts */
- iow(db, DM9000_IMR, IMR_PAR);
+ spin_lock_irqsave(&db->lock, flags);
/* Move data to DM9000 TX RAM */
writeb(DM9000_MWCMD, db->io_addr);
@@ -718,12 +716,9 @@ dm9000_start_xmit(struct sk_buff *skb, struct net_device *dev)
(db->outblk)(db->io_data, skb->data, skb->len);
db->stats.tx_bytes += skb->len;
+ db->tx_pkt_cnt++;
/* TX control: First packet immediately send, second packet queue */
- if (db->tx_pkt_cnt == 0) {
-
- /* First Packet */
- db->tx_pkt_cnt++;
-
+ if (db->tx_pkt_cnt == 1) {
/* Set TX length to DM9000 */
iow(db, DM9000_TXPLL, skb->len & 0xff);
iow(db, DM9000_TXPLH, (skb->len >> 8) & 0xff);
@@ -732,23 +727,17 @@ dm9000_start_xmit(struct sk_buff *skb, struct net_device *dev)
iow(db, DM9000_TCR, TCR_TXREQ); /* Cleared after TX complete */
dev->trans_start = jiffies; /* save the time stamp */
-
} else {
/* Second packet */
- db->tx_pkt_cnt++;
db->queue_pkt_len = skb->len;
+ netif_stop_queue(dev);
}
+ spin_unlock_irqrestore(&db->lock, flags);
+
/* free this SKB */
dev_kfree_skb(skb);
- /* Re-enable resource check */
- if (db->tx_pkt_cnt == 1)
- netif_wake_queue(dev);
-
- /* Re-enable interrupt */
- iow(db, DM9000_IMR, IMR_PAR | IMR_PTM | IMR_PRM);
-
return 0;
}
^ permalink raw reply related
* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul E. McKenney @ 2007-08-21 0:05 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Russell King, Christoph Lameter, Paul Mackerras, heiko.carstens,
horms, linux-kernel, ak, netdev, cfriesen, akpm, rpjday,
Nick Piggin, linux-arch, jesper.juhl, satyam, zlynx, schwidefsky,
Chris Snook, Herbert Xu, davem, Linus Torvalds, wensong, wjiang
In-Reply-To: <2bdb04581125f22122f1d230e991ea92@kernel.crashing.org>
On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
> >>And no, RMW on MMIO isn't "problematic" at all, either.
> >>
> >>An RMW op is a read op, a modify op, and a write op, all rolled
> >>into one opcode. But three actual operations.
> >
> >Maybe for some CPUs, but not all. ARM for instance can't use the
> >load exclusive and store exclusive instructions to MMIO space.
>
> Sure, your CPU doesn't have RMW instructions -- how to emulate
> those if you don't have them is a totally different thing.
I thought that ARM's load exclusive and store exclusive instructions
were its equivalent of LL and SC, which RISC machines typically use to
build atomic sequences of instructions -- and which normally cannot be
applied to MMIO space.
Thanx, Paul
^ permalink raw reply
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
From: Roland Dreier @ 2007-08-21 1:16 UTC (permalink / raw)
To: David Miller; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general
In-Reply-To: <20070817.234405.66176298.davem@davemloft.net>
[TSO / LRO discussion snipped -- it's not the main point so no sense
spending energy arguing about it]
> Just be realistic and accept that RDMA is a point in time solution,
> and like any other such technology takes flexibility away from users.
>
> Horizontal scaling of cpus up to huge arity cores, network devices
> using large numbers of transmit and receive queues and classification
> based queue selection, are all going to work to make things like RDMA
> even more irrelevant than they already are.
To me there is a real fundamental difference between RDMA and
traditional SOCK_STREAM / SOCK_DATAGRAM networking, namely that
messages can carry the address where they're supposed to be
delivered (what the IETF calls "direct data placement"). And on top
of that you can build one-sided operations aka put/get aka RDMA.
And direct data placement really does give you a factor of two at
least, because otherwise you're stuck receiving the data in one
buffer, looking at some of the data at least, and then figuring out
where to copy it. And memory bandwidth is if anything becoming more
valuable; maybe LRO + header splitting + page remapping tricks can get
you somewhere but as NCPUS grows then it seems the TLB shootdown cost
of page flipping is only going to get worse.
Don't get too hung up on the fact that current iWARP (RDMA over IP)
implementations are using TCP offload -- to me that is just a side
effect of doing enough processing on the NIC side of the PCI bus to be
able to do direct data placement. InfiniBand with competely different
transport, link and physical layers is one way to implement RDMA
without TCP offload and I'm sure there will be others -- eg Intel's
IOAT stuff could probably evolve to the point where you could
implement iWARP with software TCP and the data placement offloaded to
some DMA engine.
- R.
^ permalink raw reply
* Re: [PATCH 2/4 - rev2] Add new timeval_to_sec function
From: Varun Chandramohan @ 2007-08-21 3:39 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: davem, netdev, kaber, socketcan, krkumar2, varuncha
In-Reply-To: <20070820075325.43bd8f60@freepuppy.rosehill.hemminger.net>
Stephen Hemminger wrote:
> On Mon, 20 Aug 2007 13:45:36 +0530
> Varun Chandramohan <varunc@linux.vnet.ibm.com> wrote:
>
>
>> A new function for converting timeval to time_t is added in time.h. Its a common function used in different
>> places.
>>
>> Signed-off-by: Varun Chandramohan <varunc@linux.vnet.ibm.com>
>> ---
>> include/linux/time.h | 12 ++++++++++++
>> 1 files changed, 12 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/time.h b/include/linux/time.h
>> index 6a5f503..1faf65c 100644
>> --- a/include/linux/time.h
>> +++ b/include/linux/time.h
>> @@ -149,6 +149,18 @@ static inline s64 timeval_to_ns(const st
>> }
>>
>> /**
>> + * timeval_to_sec - Convert timeval to seconds
>> + * @tv: pointer to the timeval variable to be converted
>> + *
>> + * Returns the seconds representation of timeval parameter.
>> + * Note : Here we round up the value. We dont need accuracy.
>> + */
>> +static inline time_t timeval_to_sec(const struct timeval *tv)
>> +{
>> + return (tv->tv_sec + (tv->tv_usec ? 1 : 0));
>> +}
>> +
>>
>
> Why roundup? Unless there is a requirement in the standard, please just
> use the timeval seconds. In which case the inline is unneeded.
>
>
>
Thanks for the reply stephen. As you might be aware that this discussion
took place sometime ago when i posted my first patch set.
Initially it was like this:
return (tv->tv_sec + (tv->tv_usec + 500000)/1000000);
Then i got some comments from patrick and oliver. They wanted me to
round it up.
So what about rounding up with
return (tv->tv_sec + (tv->tv_usec + 999999)/1000000);
Then on second revision the above was changed to
return tv->tv_sec + (tv->tv_usec ? 1 : 0);
as it would be much faster. Since the timeval is meant for stats purpose
we decided not really bother about accuracy. My initial patch actually
took only sec value into account, but i was adviced to round up usec to
give a better o/p. Is that ok??? Or you still think we should consider
only secs?
Regards,
Varun
^ permalink raw reply
* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Felix Marti @ 2007-08-21 4:21 UTC (permalink / raw)
To: Patrick Geoffray
Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general,
David Miller
In-Reply-To: <46C9FAB4.5020609@myri.com>
> -----Original Message-----
> From: Patrick Geoffray [mailto:patrick@myri.com]
> Sent: Monday, August 20, 2007 1:34 PM
> To: Felix Marti
> Cc: Evgeniy Polyakov; David Miller; sean.hefty@intel.com;
> netdev@vger.kernel.org; rdreier@cisco.com;
> general@lists.openfabrics.org; linux-kernel@vger.kernel.org;
> jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
>
> Felix Marti wrote:
> > Yes, the app will take the cache hits when accessing the data.
> However,
> > the fact remains that if there is a copy in the receive path, you
> > require and additional 3x memory BW (which is very significant at
> these
> > high rates and most likely the bottleneck for most current
> systems)...
> > and somebody always has to take the cache miss be it the
copy_to_user
> or
> > the app.
>
> The cache miss is going to cost you half the memory bandwidth of a
full
> copy. If the data is already in cache, then the copy is cheaper.
>
> However, removing the copy removes the kernel from the picture on the
> receive side, so you lose demultiplexing, asynchronism, security,
> accounting, flow-control, swapping, etc. If it's ok with you to not
use
> the kernel stack, then why expect to fit in the existing
infrastructure
> anyway ?
Many of the things you're referring to are moved to the offload adapter
but from an ease of use point of view, it would be great if the user
could still collect stats the same way, i.e. netstat reports the 4-tuple
in use and other network stats. In addition, security features and
packet scheduling could be integrated so that the user configures them
the same way as the network stack.
>
> > Yes, RDMA support is there... but we could make it better and easier
> to
>
> What do you need from the kernel for RDMA support beyond HW drivers ?
A
> fast way to pin and translate user memory (ie registration). That is
> pretty much the sandbox that David referred to.
>
> Eventually, it would be useful to be able to track the VM space to
> implement a registration cache instead of using ugly hacks in user-
> space
> to hijack malloc, but this is completely independent from the net
> stack.
>
> > use. We have a problem today with port sharing and there was a
> proposal
>
> The port spaces are either totally separate and there is no issue, or
> completely identical and you should then run your connection manager
in
> user-space or fix your middlewares.
When running on an iWarp device (and hence on top of TCP) I believe that
the port space should shared and i.e. netstat reports the 4-tuple in
use.
>
> > and not for technical reasons. I believe this email threads shows in
> > detail how RDMA (a network technology) is treated as bastard child
by
> > the network folks, well at least by one of them.
>
> I don't think it's fair. This thread actually show how pushy some RDMA
> folks are about not acknowledging that the current infrastructure is
> here for a reason, and about mistaking zero-copy and RDMA.
Zero-copy and RDMA are not the same but in the context of this
discussion I referred to RDMA as a superset (zero-copy is implied).
>
> This is a similar argument than the TOE discussion, and it was
> definitively a good decision to not mess up the Linux stack with TOEs.
>
> Patrick
^ permalink raw reply
* New User Support
From: Joke-A-Day @ 2007-08-21 5:28 UTC (permalink / raw)
To: netdev
New Member,
Welcome To Joke-A-Day.
Confirmation Number: 85757736
Login ID: user7849
Temp Password ID: xe694
This Login Info will expire in 24 hours. Please Change it.
Use this link to change your Login info: http://24.242.63.49/
Welcome,
New Member Services
Joke-A-Day
^ permalink raw reply
* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Linus Torvalds @ 2007-08-21 5:46 UTC (permalink / raw)
To: Chris Snook
Cc: Nick Piggin, Satyam Sharma, Herbert Xu, Paul Mackerras,
Christoph Lameter, Ilpo Jarvinen, Paul E. McKenney,
Stefan Richter, Linux Kernel Mailing List, linux-arch, Netdev,
Andrew Morton, ak, heiko.carstens, David Miller, schwidefsky,
wensong, horms, wjiang, cfriesen, zlynx, rpjday, jesper.juhl,
segher
In-Reply-To: <46C993DF.4080400@redhat.com>
On Mon, 20 Aug 2007, Chris Snook wrote:
>
> What about barrier removal? With consistent semantics we could optimize a
> fair amount of code. Whether or not that constitutes "premature" optimization
> is open to debate, but there's no question we could reduce our register wiping
> in some places.
Why do people think that barriers are expensive? They really aren't.
Especially the regular compiler barrier is basically zero cost. Any
reasonable compiler will just flush the stuff it holds in registers that
isn't already automatic local variables, and for regular kernel code, that
tends to basically be nothing at all.
Ie a "barrier()" is likely _cheaper_ than the code generation downside
from using "volatile".
Linus
^ permalink raw reply
* Re: [NET]: Share correct feature code between bridging and bonding
From: Herbert Xu @ 2007-08-21 6:22 UTC (permalink / raw)
To: David Miller; +Cc: netdev, stable
In-Reply-To: <20070810.154817.04442540.davem@davemloft.net>
Hi:
Here's the back-port for 2.6.22.
[NET]: Share correct feature code between bridging and bonding
http://bugzilla.kernel.org/show_bug.cgi?id=8797 shows that the
bonding driver may produce bogus combinations of the checksum
flags and SG/TSO.
For example, if you bond devices with NETIF_F_HW_CSUM and
NETIF_F_IP_CSUM you'll end up with a bonding device that
has neither flag set. If both have TSO then this produces
an illegal combination.
The bridge device on the other hand has the correct code to
deal with this.
In fact, the same code can be used for both. So this patch
moves that logic into net/core/dev.c and uses it for both
bonding and bridging.
In the process I've made small adjustments such as only
setting GSO_ROBUST if at least one constituent device
supports it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 6287ffb..0af7bc8 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1233,43 +1233,31 @@ int bond_sethwaddr(struct net_device *bond_dev, struct net_device *slave_dev)
return 0;
}
-#define BOND_INTERSECT_FEATURES \
- (NETIF_F_SG | NETIF_F_ALL_CSUM | NETIF_F_TSO | NETIF_F_UFO)
+#define BOND_VLAN_FEATURES \
+ (NETIF_F_VLAN_CHALLENGED | NETIF_F_HW_VLAN_RX | NETIF_F_HW_VLAN_TX | \
+ NETIF_F_HW_VLAN_FILTER)
/*
* Compute the common dev->feature set available to all slaves. Some
- * feature bits are managed elsewhere, so preserve feature bits set on
- * master device that are not part of the examined set.
+ * feature bits are managed elsewhere, so preserve those feature bits
+ * on the master device.
*/
static int bond_compute_features(struct bonding *bond)
{
- unsigned long features = BOND_INTERSECT_FEATURES;
struct slave *slave;
struct net_device *bond_dev = bond->dev;
+ unsigned long features = bond_dev->features & ~BOND_VLAN_FEATURES;
unsigned short max_hard_header_len = ETH_HLEN;
int i;
bond_for_each_slave(bond, slave, i) {
- features &= (slave->dev->features & BOND_INTERSECT_FEATURES);
+ features = netdev_compute_features(features,
+ slave->dev->features);
if (slave->dev->hard_header_len > max_hard_header_len)
max_hard_header_len = slave->dev->hard_header_len;
}
- if ((features & NETIF_F_SG) &&
- !(features & NETIF_F_ALL_CSUM))
- features &= ~NETIF_F_SG;
-
- /*
- * features will include NETIF_F_TSO (NETIF_F_UFO) iff all
- * slave devices support NETIF_F_TSO (NETIF_F_UFO), which
- * implies that all slaves also support scatter-gather
- * (NETIF_F_SG), which implies that features also includes
- * NETIF_F_SG. So no need to check whether we have an
- * illegal combination of NETIF_F_{TSO,UFO} and
- * !NETIF_F_SG
- */
-
- features |= (bond_dev->features & ~BOND_INTERSECT_FEATURES);
+ features |= (bond_dev->features & BOND_VLAN_FEATURES);
bond_dev->features = features;
bond_dev->hard_header_len = max_hard_header_len;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3a70f55..ab210be 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1032,6 +1032,8 @@ extern void dev_seq_stop(struct seq_file *seq, void *v);
extern void linkwatch_run_queue(void);
+extern int netdev_compute_features(unsigned long all, unsigned long one);
+
static inline int net_gso_ok(int features, int gso_type)
{
int feature = gso_type << NETIF_F_GSO_SHIFT;
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 5e1892d..c326602 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -179,5 +179,6 @@ void br_dev_setup(struct net_device *dev)
dev->priv_flags = IFF_EBRIDGE;
dev->features = NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HIGHDMA |
- NETIF_F_TSO | NETIF_F_NO_CSUM | NETIF_F_GSO_ROBUST;
+ NETIF_F_GSO_SOFTWARE | NETIF_F_NO_CSUM |
+ NETIF_F_GSO_ROBUST | NETIF_F_LLTX;
}
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 849deaf..fefd7c1 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -360,35 +360,15 @@ int br_min_mtu(const struct net_bridge *br)
void br_features_recompute(struct net_bridge *br)
{
struct net_bridge_port *p;
- unsigned long features, checksum;
+ unsigned long features;
- checksum = br->feature_mask & NETIF_F_ALL_CSUM ? NETIF_F_NO_CSUM : 0;
- features = br->feature_mask & ~NETIF_F_ALL_CSUM;
+ features = br->feature_mask;
list_for_each_entry(p, &br->port_list, list) {
- unsigned long feature = p->dev->features;
-
- if (checksum & NETIF_F_NO_CSUM && !(feature & NETIF_F_NO_CSUM))
- checksum ^= NETIF_F_NO_CSUM | NETIF_F_HW_CSUM;
- if (checksum & NETIF_F_HW_CSUM && !(feature & NETIF_F_HW_CSUM))
- checksum ^= NETIF_F_HW_CSUM | NETIF_F_IP_CSUM;
- if (!(feature & NETIF_F_IP_CSUM))
- checksum = 0;
-
- if (feature & NETIF_F_GSO)
- feature |= NETIF_F_GSO_SOFTWARE;
- feature |= NETIF_F_GSO;
-
- features &= feature;
+ features = netdev_compute_features(features, p->dev->features);
}
- if (!(checksum & NETIF_F_ALL_CSUM))
- features &= ~NETIF_F_SG;
- if (!(features & NETIF_F_SG))
- features &= ~NETIF_F_GSO_MASK;
-
- br->dev->features = features | checksum | NETIF_F_LLTX |
- NETIF_F_GSO_ROBUST;
+ br->dev->features = features;
}
/* called with RTNL */
diff --git a/net/core/dev.c b/net/core/dev.c
index ee051bb..1561f61 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3635,6 +3635,44 @@ static int __init netdev_dma_register(void)
static int __init netdev_dma_register(void) { return -ENODEV; }
#endif /* CONFIG_NET_DMA */
+/**
+ * netdev_compute_feature - compute conjunction of two feature sets
+ * @all: first feature set
+ * @one: second feature set
+ *
+ * Computes a new feature set after adding a device with feature set
+ * @one to the master device with current feature set @all. Returns
+ * the new feature set.
+ */
+int netdev_compute_features(unsigned long all, unsigned long one)
+{
+ /* if device needs checksumming, downgrade to hw checksumming */
+ if (all & NETIF_F_NO_CSUM && !(one & NETIF_F_NO_CSUM))
+ all ^= NETIF_F_NO_CSUM | NETIF_F_HW_CSUM;
+
+ /* if device can't do all checksum, downgrade to ipv4 */
+ if (all & NETIF_F_HW_CSUM && !(one & NETIF_F_HW_CSUM))
+ all ^= NETIF_F_HW_CSUM | NETIF_F_IP_CSUM;
+
+ if (one & NETIF_F_GSO)
+ one |= NETIF_F_GSO_SOFTWARE;
+ one |= NETIF_F_GSO;
+
+ /* If even one device supports robust GSO, enable it for all. */
+ if (one & NETIF_F_GSO_ROBUST)
+ all |= NETIF_F_GSO_ROBUST;
+
+ all &= one | NETIF_F_LLTX;
+
+ if (!(all & NETIF_F_ALL_CSUM))
+ all &= ~NETIF_F_SG;
+ if (!(all & NETIF_F_SG))
+ all &= ~NETIF_F_GSO_MASK;
+
+ return all;
+}
+EXPORT_SYMBOL(netdev_compute_features);
+
/*
* Initialize the DEV module. At boot time this walks the device list and
* unhooks any devices that fail to initialise (normally hardware not
^ permalink raw reply related
* [patch 03/20] sky2: restore workarounds for lost interrupts
From: Greg KH @ 2007-08-21 6:53 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
Dave Jones, Chuck Wolber, Chris Wedgwood, Michael Krufky,
Chuck Ebbert, Domenico Andreoli, torvalds, akpm, alan, netdev,
Stephen Hemminger, Greg Kroah-Hartman
In-Reply-To: <20070821065210.GA5275@kroah.com>
[-- Attachment #1: sky2-lost-irq.patch --]
[-- Type: text/plain, Size: 1567 bytes --]
-stable review patch. If anyone has any objections, please let us know.
------------------
From: Stephen Hemminger <shemminger@linux-foundation.org>
Backport of commit c59697e06058fc2361da8cefcfa3de85ac107582
This patch restores a couple of workarounds from 2.6.16:
* restart transmit moderation timer in case it expires during IRQ routine
* default to having 10 HZ watchdog timer.
At this point it more important not to hang than to worry about the
power cost.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
drivers/net/sky2.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -96,7 +96,7 @@ static int disable_msi = 0;
module_param(disable_msi, int, 0);
MODULE_PARM_DESC(disable_msi, "Disable Message Signaled Interrupt (MSI)");
-static int idle_timeout = 0;
+static int idle_timeout = 100;
module_param(idle_timeout, int, 0);
MODULE_PARM_DESC(idle_timeout, "Watchdog timer for lost interrupts (ms)");
@@ -2442,6 +2442,13 @@ static int sky2_poll(struct net_device *
work_done = sky2_status_intr(hw, work_limit);
if (work_done < work_limit) {
+ /* Bug/Errata workaround?
+ * Need to kick the TX irq moderation timer.
+ */
+ if (sky2_read8(hw, STAT_TX_TIMER_CTRL) == TIM_START) {
+ sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
+ sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
+ }
netif_rx_complete(dev0);
/* end of interrupt, re-enables also acts as I/O synchronization */
--
^ permalink raw reply
* [patch 04/20] sky2: carrier management
From: Greg KH @ 2007-08-21 6:54 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
Dave Jones, Chuck Wolber, Chris Wedgwood, Michael Krufky,
Chuck Ebbert, Domenico Andreoli, torvalds, akpm, alan, netdev,
Stephen Hemminger, Greg Kroah-Hartman
In-Reply-To: <20070821065210.GA5275@kroah.com>
[-- Attachment #1: sky2-carrier-mgmt.patch --]
[-- Type: text/plain, Size: 2132 bytes --]
-stable review patch. If anyone has any objections, please let us know.
------------------
From: Stephen Hemminger <shemminger@linux-foundation.org>
backport of commit 55d7b4e6ed6ad3ec5e5e30b3b4515a0a6a53e344
Make sky2 handle carrier similar to other drivers,
eliminate some possible races in carrier state transistions.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
drivers/net/sky2.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -1234,6 +1234,8 @@ static int sky2_up(struct net_device *de
if (netif_msg_ifup(sky2))
printk(KERN_INFO PFX "%s: enabling interface\n", dev->name);
+ netif_carrier_off(dev);
+
/* must be power of 2 */
sky2->tx_le = pci_alloc_consistent(hw->pdev,
TX_RING_SIZE *
@@ -1573,7 +1575,6 @@ static int sky2_down(struct net_device *
/* Stop more packets from being queued */
netif_stop_queue(dev);
- netif_carrier_off(dev);
/* Disable port IRQ */
imask = sky2_read32(hw, B0_IMSK);
@@ -1625,6 +1626,8 @@ static int sky2_down(struct net_device *
sky2_phy_power(hw, port, 0);
+ netif_carrier_off(dev);
+
/* turn off LED's */
sky2_write16(hw, B0_Y2LED, LED_STAT_OFF);
@@ -1689,7 +1692,6 @@ static void sky2_link_up(struct sky2_por
gm_phy_write(hw, port, PHY_MARV_INT_MASK, PHY_M_DEF_MSK);
netif_carrier_on(sky2->netdev);
- netif_wake_queue(sky2->netdev);
/* Turn on link LED */
sky2_write8(hw, SK_REG(port, LNK_LED_REG),
@@ -1741,7 +1743,6 @@ static void sky2_link_down(struct sky2_p
gma_write16(hw, port, GM_GP_CTRL, reg);
netif_carrier_off(sky2->netdev);
- netif_stop_queue(sky2->netdev);
/* Turn on link LED */
sky2_write8(hw, SK_REG(port, LNK_LED_REG), LINKLED_OFF);
@@ -3493,10 +3494,6 @@ static __devinit struct net_device *sky2
memcpy_fromio(dev->dev_addr, hw->regs + B2_MAC_1 + port * 8, ETH_ALEN);
memcpy(dev->perm_addr, dev->dev_addr, dev->addr_len);
- /* device is off until link detection */
- netif_carrier_off(dev);
- netif_stop_queue(dev);
-
return dev;
}
--
^ permalink raw reply
* [patch 05/20] sky2: check for more work before leaving NAPI
From: Greg KH @ 2007-08-21 6:54 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
Dave Jones, Chuck Wolber, Chris Wedgwood, Michael Krufky,
Chuck Ebbert, Domenico Andreoli, torvalds, akpm, alan, netdev,
Stephen Hemminger, Greg Kroah-Hartman
In-Reply-To: <20070821065210.GA5275@kroah.com>
[-- Attachment #1: sky2-napi-fill.patch --]
[-- Type: text/plain, Size: 2291 bytes --]
-stable review patch. If anyone has any objections, please let us know.
------------------
From: Stephen Hemminger <shemminger@linux-foundation.org>
Backport of commit 5c11ce700f77fada15b6264417d72462da4bbb1c
This patch avoids generating another IRQ if more packets
arrive while in the NAPI poll routine. Before marking device as
finished, it rechecks that the status ring is empty.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
drivers/net/sky2.c | 35 +++++++++++++++++------------------
1 file changed, 17 insertions(+), 18 deletions(-)
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -2428,8 +2428,7 @@ static void sky2_err_intr(struct sky2_hw
static int sky2_poll(struct net_device *dev0, int *budget)
{
struct sky2_hw *hw = ((struct sky2_port *) netdev_priv(dev0))->hw;
- int work_limit = min(dev0->quota, *budget);
- int work_done = 0;
+ int work_done;
u32 status = sky2_read32(hw, B0_Y2_SP_EISR);
if (unlikely(status & Y2_IS_ERROR))
@@ -2441,25 +2440,25 @@ static int sky2_poll(struct net_device *
if (status & Y2_IS_IRQ_PHY2)
sky2_phy_intr(hw, 1);
- work_done = sky2_status_intr(hw, work_limit);
- if (work_done < work_limit) {
- /* Bug/Errata workaround?
- * Need to kick the TX irq moderation timer.
- */
- if (sky2_read8(hw, STAT_TX_TIMER_CTRL) == TIM_START) {
- sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
- sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
- }
- netif_rx_complete(dev0);
+ work_done = sky2_status_intr(hw, min(dev0->quota, *budget));
+ *budget -= work_done;
+ dev0->quota -= work_done;
- /* end of interrupt, re-enables also acts as I/O synchronization */
- sky2_read32(hw, B0_Y2_SP_LISR);
- return 0;
- } else {
- *budget -= work_done;
- dev0->quota -= work_done;
+ /* More work? */
+ if (hw->st_idx != sky2_read16(hw, STAT_PUT_IDX))
return 1;
+
+ /* Bug/Errata workaround?
+ * Need to kick the TX irq moderation timer.
+ */
+ if (sky2_read8(hw, STAT_TX_TIMER_CTRL) == TIM_START) {
+ sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
+ sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
}
+ netif_rx_complete(dev0);
+
+ sky2_read32(hw, B0_Y2_SP_LISR);
+ return 0;
}
static irqreturn_t sky2_intr(int irq, void *dev_id)
--
^ permalink raw reply
* [patch 06/20] sky2: check drop truncated packets
From: Greg KH @ 2007-08-21 6:54 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Justin Forbes, Zwane Mwaikambo, Theodore Ts'o, Randy Dunlap,
Dave Jones, Chuck Wolber, Chris Wedgwood, Michael Krufky,
Chuck Ebbert, Domenico Andreoli, torvalds, akpm, alan, netdev,
Stephen Hemminger, Greg Kroah-Hartman
In-Reply-To: <20070821065210.GA5275@kroah.com>
[-- Attachment #1: sky2-stable-trunc.patch --]
[-- Type: text/plain, Size: 1105 bytes --]
-stable review patch. If anyone has any objections, please let us know.
------------------
From: Stephen Hemminger <shemminger@linux-foundation.org>
Backport of commit 71749531f2d1954137a1a77422ef4ff29eb102dd
If packet larger than MTU is received, the driver uses hardware to
truncate the packet. Use the status registers to catch/drop them.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
drivers/net/sky2.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -2065,6 +2065,9 @@ static struct sk_buff *sky2_receive(stru
if (!(status & GMR_FS_RX_OK))
goto resubmit;
+ if (status >> 16 != length)
+ goto len_mismatch;
+
if (length < copybreak)
skb = receive_copy(sky2, re, length);
else
@@ -2074,6 +2077,11 @@ resubmit:
return skb;
+len_mismatch:
+ /* Truncation of overlength packets
+ causes PHY length to not match MAC length */
+ ++sky2->net_stats.rx_length_errors;
+
error:
++sky2->net_stats.rx_errors;
if (status & GMR_FS_RX_FF_OV) {
--
^ permalink raw reply
* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
From: David Miller @ 2007-08-21 6:58 UTC (permalink / raw)
To: rdreier; +Cc: jeff, netdev, linux-kernel, general
In-Reply-To: <adalkc5u1o9.fsf@cisco.com>
From: Roland Dreier <rdreier@cisco.com>
Date: Mon, 20 Aug 2007 18:16:54 -0700
> And direct data placement really does give you a factor of two at
> least, because otherwise you're stuck receiving the data in one
> buffer, looking at some of the data at least, and then figuring out
> where to copy it. And memory bandwidth is if anything becoming more
> valuable; maybe LRO + header splitting + page remapping tricks can get
> you somewhere but as NCPUS grows then it seems the TLB shootdown cost
> of page flipping is only going to get worse.
As Herbert has said already, people can code for this just like
they have to code for RDMA.
There is no fundamental difference from converting an application
to sendfile or similar.
The only thing this needs is a
"recvmsg_I_dont_care_where_the_data_is()" call. There are no alignment
issues unless you are trying to push this data directly into the
page cache.
Couple this with a card that makes sure that on a per-page basis, only
data for a particular flow (or group of flows) will accumulate.
People already make cards that can do stuff like this, it can be done
statelessly with an on-chip dynamically maintained flow table.
And best yet it doesn't turn off every feature in the networking nor
bypass it for the actual protocol processing.
^ permalink raw reply
* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: David Miller @ 2007-08-21 7:04 UTC (permalink / raw)
To: torvalds
Cc: csnook, piggin, satyam, herbert, paulus, clameter, ilpo.jarvinen,
paulmck, stefanr, linux-kernel, linux-arch, netdev, akpm, ak,
heiko.carstens, schwidefsky, wensong, horms, wjiang, cfriesen,
zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <alpine.LFD.0.999.0708202244520.30176@woody.linux-foundation.org>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)
> Ie a "barrier()" is likely _cheaper_ than the code generation downside
> from using "volatile".
Assuming GCC were ever better about the code generation badness
with volatile that has been discussed here, I much prefer
we tell GCC "this memory piece changed" rather than "every
piece of memory has changed" which is what the barrier() does.
I happened to have been scanning a lot of assembler lately to
track down a gcc-4.2 miscompilation on sparc64, and the barriers
do hurt quite a bit in some places. Instead of keeping unrelated
variables around cached in local registers, it reloads everything.
^ permalink raw reply
* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Russell King @ 2007-08-21 7:05 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Christoph Lameter, Paul Mackerras, heiko.carstens, horms,
linux-kernel, Paul E. McKenney, ak, netdev, cfriesen, akpm,
rpjday, Nick Piggin, linux-arch, jesper.juhl, satyam, zlynx,
schwidefsky, Chris Snook, Herbert Xu, davem, Linus Torvalds,
wensong, wjiang
In-Reply-To: <2bdb04581125f22122f1d230e991ea92@kernel.crashing.org>
On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
> >>And no, RMW on MMIO isn't "problematic" at all, either.
> >>
> >>An RMW op is a read op, a modify op, and a write op, all rolled
> >>into one opcode. But three actual operations.
> >
> >Maybe for some CPUs, but not all. ARM for instance can't use the
> >load exclusive and store exclusive instructions to MMIO space.
>
> Sure, your CPU doesn't have RMW instructions -- how to emulate
> those if you don't have them is a totally different thing.
Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply
* Re: [NET]: Share correct feature code between bridging and bonding
From: David Miller @ 2007-08-21 7:07 UTC (permalink / raw)
To: herbert; +Cc: netdev, stable
In-Reply-To: <20070821062255.GA21799@gondor.apana.org.au>
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 21 Aug 2007 14:22:55 +0800
> Hi:
>
> Here's the back-port for 2.6.22.
>
> [NET]: Share correct feature code between bridging and bonding
>
> http://bugzilla.kernel.org/show_bug.cgi?id=8797 shows that the
> bonding driver may produce bogus combinations of the checksum
> flags and SG/TSO.
>
> For example, if you bond devices with NETIF_F_HW_CSUM and
> NETIF_F_IP_CSUM you'll end up with a bonding device that
> has neither flag set. If both have TSO then this produces
> an illegal combination.
>
> The bridge device on the other hand has the correct code to
> deal with this.
>
> In fact, the same code can be used for both. So this patch
> moves that logic into net/core/dev.c and uses it for both
> bonding and bridging.
>
> In the process I've made small adjustments such as only
> setting GSO_ROBUST if at least one constituent device
> supports it.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
I was about to send this off, but you beat me to it :-)
Acked-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply
* Re: [PATCH] IPv6: Fix kernel panic while send SCTP data with IP fragments
From: David Miller @ 2007-08-21 7:09 UTC (permalink / raw)
To: yjwei; +Cc: acme, netdev
In-Reply-To: <46C8FC18.6030006@cn.fujitsu.com>
From: Wei Yongjun <yjwei@cn.fujitsu.com>
Date: Mon, 20 Aug 2007 10:27:36 +0800
> Hi Arnaldo Carvalho de Melo:
> > Em Mon, Aug 20, 2007 at 09:28:27AM +0800, Wei Yongjun escreveu:
> >
> >> If ICMP6 message with "Packet Too Big" is received after send SCTP DATA,
> >> kernel panic will occur when SCTP DATA is send again.
> >>
> >> This is because of a bad dest address when call to skb_copy_bits().
> >>
> >> The messages sequence is like this:
> >>
> >> Endpoint A Endpoint B
> >> <------- SCTP DATA (size=1432)
> >> ICMP6 message ------->
> >> (Packet Too Big pmtu=1280)
> >> <------- Resend SCTP DATA (size=1432)
> >> ------------kernel panic---------------
> >>
> >
> > Thanks! I'm to blame for this one, problem was introduced in:
> >
> > b0e380b1d8a8e0aca215df97702f99815f05c094
> >
> > @@ -761,7 +762,7 @@ slow_path:
> > /*
> > * Copy a block of the IP datagram.
> > */
> > - if (skb_copy_bits(skb, ptr, frag->h.raw, len))
> > + if (skb_copy_bits(skb, ptr, skb_transport_header(skb),
> > len))
> > BUG();
> > left -= len;
> >
> > So please add:
> >
> > Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
> >
> > To this patch.
> >
> > - Arnaldo
> >
> >
> >
> >> printing eip:
> >> c05be62a
> >> *pde = 00000000
> >> Oops: 0002 [#1]
> >> SMP
> >> Modules linked in: scomm l2cap bluetooth ipv6 dm_mirror dm_mod video output sbs battery lp floppy sg i2c_piix4 i2c_core pcnet32 mii button ac parport_pc parport ide_cd cdrom serio_raw mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> >> CPU: 0
> >> EIP: 0060:[<c05be62a>] Not tainted VLI
> >> EFLAGS: 00010282 (2.6.23-rc2 #1)
> >> EIP is at skb_copy_bits+0x4f/0x1ef
> >> eax: 000004d0 ebx: ce12a980 ecx: 00000134 edx: cfd5a880
> >> esi: c8246858 edi: 00000000 ebp: c0759b14 esp: c0759adc
> >> ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
> >> Process swapper (pid: 0, ti=c0759000 task=c06d0340 task.ti=c0713000)
> >> Stack: c0759b88 c0405867 ce12a980 c8bff838 c789c084 00000000 00000028 cfd5a880
> >> d09f1890 000005dc 0000007b ce12a980 cfd5a880 c8bff838 c0759b88 d09bc521
> >> 000004d0 fffff96c 00000200 00000100 c0759b50 cfd5a880 00000246 c0759bd4
> >> Call Trace:
> >> [<c0405e1d>] show_trace_log_lvl+0x1a/0x2f
> >> [<c0405ecd>] show_stack_log_lvl+0x9b/0xa3
> >> [<c040608d>] show_registers+0x1b8/0x289
> >> [<c0406271>] die+0x113/0x246
> >> [<c0625dbc>] do_page_fault+0x4ad/0x57e
> >> [<c0624642>] error_code+0x72/0x78
> >> [<d09bc521>] ip6_output+0x8e5/0xab2 [ipv6]
> >> [<d09bcec1>] ip6_xmit+0x2ea/0x3a3 [ipv6]
> >> [<d0a3f2ca>] sctp_v6_xmit+0x248/0x253 [sctp]
> >> [<d0a3c934>] sctp_packet_transmit+0x53f/0x5ae [sctp]
> >> [<d0a34bf8>] sctp_outq_flush+0x555/0x587 [sctp]
> >> [<d0a34d3c>] sctp_retransmit+0xf8/0x10f [sctp]
> >> [<d0a3d183>] sctp_icmp_frag_needed+0x57/0x5b [sctp]
> >> [<d0a3ece2>] sctp_v6_err+0xcd/0x148 [sctp]
> >> [<d09cf1ce>] icmpv6_notify+0xe6/0x167 [ipv6]
> >> [<d09d009a>] icmpv6_rcv+0x7d7/0x849 [ipv6]
> >> [<d09be240>] ip6_input+0x1dc/0x310 [ipv6]
> >> [<d09be965>] ipv6_rcv+0x294/0x2df [ipv6]
> >> [<c05c3789>] netif_receive_skb+0x2d2/0x335
> >> [<c05c5733>] process_backlog+0x7f/0xd0
> >> [<c05c58f6>] net_rx_action+0x96/0x17e
> >> [<c042e722>] __do_softirq+0x64/0xcd
> >> [<c0406f37>] do_softirq+0x5c/0xac
> >> =======================
> >> Code: 00 00 29 ca 89 d0 2b 45 e0 89 55 ec 85 c0 7e 35 39 45 08 8b 55 e4 0f 4e 45 08 8b 75 e0 8b 7d dc 89 c1 c1 e9 02 03 b2 a0 00 00 00 <f3> a5 89 c1 83 e1 03 74 02 f3 a4 29 45 08 0f 84 7b 01 00 00 01
> >> EIP: [<c05be62a>] skb_copy_bits+0x4f/0x1ef SS:ESP 0068:c0759adc
> >> Kernel panic - not syncing: Fatal exception in interrupt
> >>
> >> Following is the patch.
> >>
> Have changed. Thanks
>
> Regards
>
>
> Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
> Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Applied, thanks everyone.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox