Netdev List
 help / color / mirror / Atom feed
* Re: TG3, kvm, ipv6 & tso data corruption bug?
From: Matt Carlson @ 2009-10-28 16:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: netdev@vger.kernel.org, Linux kernel Mailing List,
	Matthew Carlson, Michael Chan, KVM list
In-Reply-To: <4AE8595F.1080404@redhat.com>

On Wed, Oct 28, 2009 at 07:46:55AM -0700, Rik van Riel wrote:
> I have been tracking down what I thought was a KVM related network
> issue for a while, however it appears it could be a hardware issue.
> 
> The symptom is that data in network packets gets corrupted, before
> the checksum is calculated.  This means the remote host can get
> corrupted data, with no way to calculate it (except application
> level checksums).  Luckily ssh has such checksums, so my rsync over
> ssh backup script discovered this issue.
> 
> On a very regular basis, I got this message from ssh:
> 
> 	Corrupted MAC on input.
> 
> I have played around a bit and narrowed it down to the following:
> 
> ipv4          => no problem
> ipv6 w/o tso  => no problem
> ipv6 with tso => occasional data corruption
> 
> Disabling tso with ethtool -K eth0 tso off makes the problem stop.
> 
> I am running Fedora 12's 2.6.31.1-56.fc12.x86_64 kernel, with the
> following hardware:
> 
> 05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5761 
> Gigabit Ethernet PCIe (rev 10)
> 
> I do not know enough about the network layer to know whether this is
> fixable in software or whether TSO offloading for ipv6 should just
> be disabled on this model.

This problem sounds familiar.  There are chip bugs in this area, but as
far as I know, they should have been worked around.  Let me see if this
is indeed the same bug resurfacing.


^ permalink raw reply

* RE: [Pv-drivers] [PATCH] vmxnet3: remove duplicate #include
From: Bhavesh Davda @ 2009-10-28 16:32 UTC (permalink / raw)
  To: Shreyas Bhatewara, netdev@vger.kernel.org
  Cc: pv-drivers@vmware.com, weiyi.huang@gmail.com
In-Reply-To: <alpine.LRH.2.00.0910280910250.24555@sbhatewara-dev1.eng.vmware.com>

Correction:

< Signed-off-by: Bhavesh Davda <davda@vmware.com>
> Signed-off-by: Bhavesh Davda <bhavesh@vmware.com>

Thanks

- Bhavesh
 
Bhavesh P. Davda

> -----Original Message-----
> From: pv-drivers-bounces@vmware.com [mailto:pv-drivers-
> bounces@vmware.com] On Behalf Of Shreyas Bhatewara
> Sent: Wednesday, October 28, 2009 9:31 AM
> To: netdev@vger.kernel.org
> Cc: pv-drivers@vmware.com; weiyi.huang@gmail.com
> Subject: [Pv-drivers] [PATCH] vmxnet3: remove duplicate #include
> 
> 
> Remove duplicate headerfile includes from vmxnet3_int.h
> 
> Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
> Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
> Signed-off-by: Bhavesh Davda <davda@vmware.com>
> 
> ---
> 
> diff --git a/drivers/net/vmxnet3/vmxnet3_int.h
> b/drivers/net/vmxnet3/vmxnet3_int.h
> index 3c0d70d..4450816 100644
> --- a/drivers/net/vmxnet3/vmxnet3_int.h
> +++ b/drivers/net/vmxnet3/vmxnet3_int.h
> @@ -27,16 +27,11 @@
>  #ifndef _VMXNET3_INT_H
>  #define _VMXNET3_INT_H
> 
> -#include <linux/types.h>
>  #include <linux/ethtool.h>
>  #include <linux/delay.h>
> -#include <linux/device.h>
>  #include <linux/netdevice.h>
>  #include <linux/pci.h>
> -#include <linux/ethtool.h>
>  #include <linux/compiler.h>
> -#include <linux/module.h>
> -#include <linux/moduleparam.h>
>  #include <linux/slab.h>
>  #include <linux/spinlock.h>
>  #include <linux/ioport.h>
> _______________________________________________
> Pv-drivers mailing list
> Pv-drivers@vmware.com
> http://mailman2.vmware.com/mailman/listinfo/pv-drivers

^ permalink raw reply

* phylib printouts
From: Kristoffer Glembo @ 2009-10-28 16:33 UTC (permalink / raw)
  To: netdev

Hi,

There are some printouts in the phy abstraction library using pr_info(). 
  Personally I would prefer if they used pr_debug() instead.

Especially the pr_info("%s: probed\n", bus->name) in mdiobus_register() 
since in some cases I can't tolerate a console printout while the link 
is down (we have an ethernet debug link and read the UART output through 
the link).

Of course this is a very specific case but in general in embedded 
systems (which I think is where the phylib is mainly used) I prefer the 
driver and other kernel code to be rather silent if not debugging.

Any opinions? :)

Best regards,
Kristoffer Glembo

^ permalink raw reply

* Re: [PATCH] vlan: allow VLAN ID 0 to be used
From: Patrick McHardy @ 2009-10-28 16:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benny Amorsen, Gertjan Hofman, Matt Carlson,
	netdev@vger.kernel.org, David S. Miller
In-Reply-To: <4AE6C54B.7080805@gmail.com>

Eric Dumazet wrote:
> Benny Amorsen a écrit :
>> Eric Dumazet <eric.dumazet@gmail.com> writes:
>>
>>> Here is the patch I cooked that permitted VLAN 0 to be used with tg3
>>> (and other HW accelerated vlan nics I suppose)
>>>
>>> [PATCH] vlan: allow VLAN ID 0 to be used
>>>
>>> We currently use a 16 bit field (vlan_tci) to store VLAN ID on a skb.
>>>
>>> 0 value is used a special value, meaning VLAN ID not set.
>>> This forbids use of VLAN ID 0
>> Are you sure you actually want to do this?
>>
>> VLAN 0 IS special. Frames received on VLAN 0 should be treated just as
>> if they had no VLAN tag at all, except that they have an 802.1p value.
>>
>> Sending frames with VLAN 0 should have something to do with whether
>> the sender wants to use 802.1p, which doesn't really have much to do
>> with VLAN's at all...
>>
>> It would be nice if the unsuspecting user was at least warned that their
>> use of VLAN 0 is non-standard and may cause surprising results like
>> leakage into the "native" VLAN. That could be done in /sbin/ip or
>> /sbin/vconfig, of course.
>>
> 
> Quotting http://en.wikipedia.org/wiki/IEEE_802.1Q
> 
> VLAN Identifier (VID): a 12-bit field specifying the VLAN to which the frame belongs.
>  A value of 0 means that the frame doesn't belong to any VLAN; in this case the 802.1Q
>  tag specifies only a priority and is referred to as a priority tag.
> A value of hex FFF is reserved for implementation use.
> All other values may be used as VLAN identifiers, allowing up to 4094 VLANs
> 
> 
> So we expect to generate a 802.1Q frame, even with a VID=0 field.
> Before patch, device sends a non 802.1Q frame, which is not what was wanted by user.
> (Maybe he wants to check its device/network is able to transport 1522 bytes frames, who knows...)
> 
> To use non tagged frames, user selects eth0 device, and to send tagged frames, he selects eth0.0

I agree. Quoting IEEE802.1Q:

0 The null VLAN ID. Indicates that the tag header contains only priority
  information; no VLAN identifier is present in the frame. ...

which implies that its valid to use 0 as tag in the header.

> Now, maybe eth0 and eth0.0 should share same IP addresses, because incoming frame
> with ID=0 tag should be received by eth0 device, but I am not sure standard requires this.

I don't think the standard makes any demands regarding this.

The current behaviour seems better to me since its symetrical
to the output path, where the VLAN device is necessary to be
able to specify a priority mapping.



^ permalink raw reply

* Re: [Fwd: Re: [PATCH] Multicast packet reassembly can fail]
From: Steve Chen @ 2009-10-28 17:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <4AE86420.3040607@gmail.com>

On Wed, 2009-10-28 at 16:32 +0100, Eric Dumazet wrote:
> Steve Chen a écrit :
> > Following is the e-mail I replied to Rick Jones in case you haven't got
> > it yet.  I have not posted the test code yet.  The test code is a tar gz
> > file about 11k in size.  Lot of people will get very upset if I send it
> > to the mailing list.  If you can suggest a place to up load, I'll be
> > happy to do so.
> > 
> > Regards,
> 
> > Here is the scenario this patch tries to address
> > 
> > <src node> ---->  <switch>  ----> <eth0 dest node>
> >                             \--->  <eth1 dest node>
> 
> If each fragment is received twice on host, once by eth0, once by eth1,
> should we deliver datagram once or twice ?

The application received it once.  IIRC the duplicate packet is drop in
the routing code.

> 
> Once should be enough, even if in the non fragmented case, it will
> be delivered twice (kernel cannot detect duplicates, user app might do itself)

Routing code drops the duplicate packet for none-fragmented case as
well.

> 
> 
> > 
> > For this specific case, src/dst address, protocol, IP ID and fragment
> > offset are all identical.  The only difference is the ingress interface.
> > A good follow up question would be why would anyone in their right mind
> > multicast to the same destination?  well, I don't know.  I can not get
> > the people who reported the problem to tell me either.   Since someone
> > found the need to do this,  perhaps others may find it useful too.
> > 
> 
> Then, if a 2000 bytes message is fragmented in two packets, one coming
> from eth0, one coming from eth1, I suspect your patch drops the message,
> unless eth0/eth1 are part of a bonding device...

Actually, the patch tries to prevent packet drop for this exact
scenario.  Please consider the following scenarios
1.  Packet comes in the fragment reassemble code in the following order
(eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
Packet from both interfaces get reassembled and gets further processed.

2. Packet can some times arrive in (perhaps other orders as well)
(eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
packet from eth1 is dropped in the routing code.

> 
> That would break common routing setups, using two links to aggregate bandwidth ?

I don't believe it would.  The aggregate bandwidth will work the same as
before.  The attributes (src/dst addr, protocol, interface, etc.) should
generate a unique hash key.  If hash collision should happen with the
addition of iif << 5, the code still compare the original src addr along
with interface number, so there should be no issues.

> Nothing in IP forces all packets of a datagram to follow an unique route.
> 
My understanding as well.

Regards,

Steve


^ permalink raw reply

* Re: [net-next-2.6 PATCH v4 3/3] TCPCT part 1c: initial SYN exchange with SYNACK data
From: William Allen Simpson @ 2009-10-28 17:14 UTC (permalink / raw)
  To: Linux Kernel Network Developers; +Cc: Eric Dumazet
In-Reply-To: <4AE8527E.4080603@gmail.com>

Eric Dumazet wrote:
> William Allen Simpson a écrit :
>> suggested for many years, and is also useful without SYN data, allowing
>> several related concepts to use the same extension option.
>>
>>    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
>>    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html
> 
> Sorry this link might be interesting to you, but I found nothing that explains
> your patches.
> 
>>    "Re: what a new TCP header might look like", May 12, 1998.
>>    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail
> 
> Same here....
> 
They explain the "several related concepts" -- indeed the entire scope of
the whole project.  That's why this patch series is labeled "part 1".


> I tried to find an RFC or document about this stuff and failed.
> 
This patch series was fairly clearly described in Adam's draft last year.
It's expired now, but I'll send you a copy privately.


> Before reading implementation code, I like to have english text that describes
> the new concept/design.
> 
Before writing text, I like to have implementation code.... :-)

In fact, it has long been my position that we shouldn't publish IETF RFCs
without running code (and I never have).  Even PPP over SOnet/SDH had
preliminary hardware before publication.  Unfortunately, those with
standards-bodies-itis have a sad history of the opposite.

The Usenix ;login: overview and other documents are embargoed until
(December) publication, but I'll send you some recent galleys privately.


> (BTW I found http://ttcplinux.sourceforge.net/theses/ETTCP.pdf and found it interesting,
> I wonder what happened to this)
> 
I found it too, and deliberately extended Adam's sockopt to encompass it.
It's one reason this was renamed TCP Cookie *Transactions* (TCPCT).  That
will come along later, probably about parts 4 or 5.

^ permalink raw reply

* Re: [PATCH] Multicast packet reassembly can fail
From: Rick Jones @ 2009-10-28 17:18 UTC (permalink / raw)
  To: Mark Huth; +Cc: Steve Chen, netdev
In-Reply-To: <4AE8776D.4020609@mvista.com>

>> It has been hours since my last good Emily Litella moment so I'll ask 
>> - isn't the combination of source and dest addr, protocol, IP ID and 
>> fragment offset supposed to take care of this?  How does the ingress 
>> interface have anything to do with it?
>>
>> rick jones
> 
> The problem we've seen arises only when there are multiple interfaces 
> each receiving the same multicast packets.  In that case there are 
> multiple packets with the same key.  Steve was able to track down a 
> packet loss due to re-assembly failure under certain arrival order 
> conditions.
> 
> The proposed fix eliminated the packet loss in this case.  There might 
> be a different problem in the re-assembly code that we have masked by 
> separating the packets into streams from each interface.  Now that you 
> mention it, the re-assembly code should be robust in the face of some 
> duplicated and mis-ordered packets.  We can look more closely at that code.

If I understand correctly, the idea here is to say that when multiple interfaces 
receive fragments of copies of the same  IP datagram that both copies will 
"survive" and flow up the stack?

I'm basing that on your description, and an email from Steve that reads:

> Actually, the patch tries to prevent packet drop for this exact
> scenario.  Please consider the following scenarios
> 1.  Packet comes in the fragment reassemble code in the following order
> (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> Packet from both interfaces get reassembled and gets further processed.
> 
> 2. Packet can some times arrive in (perhaps other orders as well)
> (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> packet from eth1 is dropped in the routing code.

Doesn't that rather fly in the face of the weak-end-system model followed by Linux?

I can see where scenario one leads to two IP datagrams making it up the stack, 
but I would have thought that was simply an "accident" of the situation that 
cannot reasonably be prevented, not justification to cause scenario two to send 
two datagrams up the stack.

rick jones

^ permalink raw reply

* Re: [PATCH] Multicast packet reassembly can fail
From: Mark Huth @ 2009-10-28 16:55 UTC (permalink / raw)
  To: Rick Jones; +Cc: Steve Chen, netdev
In-Reply-To: <4AE780CB.8070401@hp.com>

Rick Jones wrote:
> Steve Chen wrote:
>> Multicast packet reassembly can fail
>>
>> When multicast connections with multiple fragments are received by the 
>> same
>> node from more than one Ethernet ports, race condition between fragments
>> from each Ethernet port can cause fragment reassembly to fail leading to
>> packet drop.  This is because packets from each Ethernet port appears 
>> identical
>> to the the code that reassembles the Ethernet packet.
>>
>> The solution is evaluate the Ethernet interface number in addition to 
>> all other
>> parameters so that every packet can be uniquely identified.  The existing
>> iif field in struct ipq is now used to generate the hash key, and iif 
>> is also
>> used for comparison in case of hash collision.
>>
>> Please note that q->saddr ^ (q->iif << 5) is now being passed into
>> ipqhashfn to generate the hash key.  This is borrowed from the routing
>> code.
>>
>> Signed-off-by: Steve Chen <schen@mvista.com>
>> Signed-off-by: Mark Huth <mhuth@mvista.com>
> 
> It has been hours since my last good Emily Litella moment so I'll ask - 
> isn't the combination of source and dest addr, protocol, IP ID and 
> fragment offset supposed to take care of this?  How does the ingress 
> interface have anything to do with it?
> 
> rick jones
The problem we've seen arises only when there are multiple interfaces 
each receiving the same multicast packets.  In that case there are 
multiple packets with the same key.  Steve was able to track down a 
packet loss due to re-assembly failure under certain arrival order 
conditions.

The proposed fix eliminated the packet loss in this case.  There might 
be a different problem in the re-assembly code that we have masked by 
separating the packets into streams from each interface.  Now that you 
mention it, the re-assembly code should be robust in the face of some 
duplicated and mis-ordered packets.  We can look more closely at that code.

Mark Huth


^ permalink raw reply

* [PATCH net-next-2.6] be2net: Add the new PCI IDs to PCI_DEVICE_TABLE.
From: Ajit Khaparde @ 2009-10-28 17:23 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This patch adds the PCI IDs for the next generation chip to the
PCI_DEVICE_ID table.

Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
---
 drivers/net/benet/be_main.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index 4520db7..43180dc 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -31,8 +31,10 @@ MODULE_PARM_DESC(rx_frag_size, "Size of a fragment that holds rcvd data.");
 
 static DEFINE_PCI_DEVICE_TABLE(be_dev_ids) = {
 	{ PCI_DEVICE(BE_VENDOR_ID, BE_DEVICE_ID1) },
+	{ PCI_DEVICE(BE_VENDOR_ID, BE_DEVICE_ID2) },
 	{ PCI_DEVICE(BE_VENDOR_ID, OC_DEVICE_ID1) },
 	{ PCI_DEVICE(BE_VENDOR_ID, OC_DEVICE_ID2) },
+	{ PCI_DEVICE(BE_VENDOR_ID, OC_DEVICE_ID3) },
 	{ 0 }
 };
 MODULE_DEVICE_TABLE(pci, be_dev_ids);
-- 
1.6.0.4


^ permalink raw reply related

* Re: [RFC] multiqueue changes
From: Patrick McHardy @ 2009-10-28 17:27 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Eric Dumazet, David S. Miller, Linux Netdev List
In-Reply-To: <20091008120039.GA8691@ff.dom.local>

Jarek Poplawski wrote:
> On Thu, Oct 08, 2009 at 09:03:44AM +0000, Jarek Poplawski wrote:
> ...
>> num_tx_queues. But it seems we should rather use more often
>> real_num_tx_queue in schedulers code like dumps and maybe more
>> (like e.g. sch_multiq does).
> 
> ...i.e. probably everywhere between dev_activate and dev_deactivate
> all qdisc operations could use real_num_tx_queues (including a test
> like: netif_is_real_multique), unless I miss something.

We don't seem to be supporting changing real_num_tx_queues for
registered devices currently (at least I couldn't find it).
So I guess it depends on how this would be implemented.

Simply changing the dev->real_num_tx_queues value while the
device is down would require qdisc operations to operate on
all possible queues since the amount of queues in use could
be changed after the qdisc is created/configured, but before
the device is set up. This approach has more complications
like switching between mq and non-mq root qdiscs, taking care
of non-default root qdisc (cloning them to the new queues), etc.

A simpler alternative would be to destroy the existing root
qdisc on any change to real_num_tx_queues and have dev_activate()
set it up from scratch. In this case, we could (as you suggested)
use real_num_tx_queues, which should fix the problem Eric reported.

^ permalink raw reply

* Re: [Fwd: Re: [PATCH] Multicast packet reassembly can fail]
From: Eric Dumazet @ 2009-10-28 17:26 UTC (permalink / raw)
  To: Steve Chen; +Cc: netdev
In-Reply-To: <1256749559.3153.447.camel@linux-1lbu>

Steve Chen a écrit :
> On Wed, 2009-10-28 at 16:32 +0100, Eric Dumazet wrote:
>> If each fragment is received twice on host, once by eth0, once by eth1,
>> should we deliver datagram once or twice ?
> 
> The application received it once.  IIRC the duplicate packet is drop in
> the routing code.
> 
>> Once should be enough, even if in the non fragmented case, it will
>> be delivered twice (kernel cannot detect duplicates, user app might do itself)
> 
> Routing code drops the duplicate packet for none-fragmented case as
> well.

Really ? How so ? Receiving two copies of the same packet is legal.

> 
>>
>>> For this specific case, src/dst address, protocol, IP ID and fragment
>>> offset are all identical.  The only difference is the ingress interface.
>>> A good follow up question would be why would anyone in their right mind
>>> multicast to the same destination?  well, I don't know.  I can not get
>>> the people who reported the problem to tell me either.   Since someone
>>> found the need to do this,  perhaps others may find it useful too.
>>>
>> Then, if a 2000 bytes message is fragmented in two packets, one coming
>> from eth0, one coming from eth1, I suspect your patch drops the message,
>> unless eth0/eth1 are part of a bonding device...
> 
> Actually, the patch tries to prevent packet drop for this exact
> scenario.  Please consider the following scenarios
> 1.  Packet comes in the fragment reassemble code in the following order
> (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> Packet from both interfaces get reassembled and gets further processed.

Yes your patch does this, so each multicast application receives two copies of the
same datagram.

> 
> 2. Packet can some times arrive in (perhaps other orders as well)
> (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> packet from eth1 is dropped in the routing code.

Really ? how so ? I dont see how it can happen, unless you use RPF ?

current situation should be :

(eth0 frag1) : We create a context, store frag1 in it
(eth1 frag1) : We find this context, and drop frag1 since we already have the data
                  (maybe the bug is here, if we cannot cope with a duplicate ?)
(eth0 frag2) : We find this context, store frag2 -> complete datagram and deliver it
(eth1 frag2) : We find context, drop frag2 since datagram was completed.

               (or maybe we create a new context that will timeout later, maybe this is your problem ?)

Net effect : We deliver the datagram correctly.


> 
>> That would break common routing setups, using two links to aggregate bandwidth ?
> 
> I don't believe it would.  The aggregate bandwidth will work the same as
> before.  The attributes (src/dst addr, protocol, interface, etc.) should
> generate a unique hash key.  If hash collision should happen with the
> addition of iif << 5, the code still compare the original src addr along
> with interface number, so there should be no issues.

What about the obvious :

(eth0 frag1),  (eth1 frag2)

Your patch creates two contexts since hashes are different,
that will timeout and no packet delivered at all


^ permalink raw reply

* [PATCH] AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl
From: Neil Horman @ 2009-10-28 17:39 UTC (permalink / raw)
  To: netdev; +Cc: davem, nhorman, eric.dumazet

Augment raw_send_hdrinc to correct for incorrect ip header length values

A series of oopses was reported to me recently.  Apparently when using AF_RAW
sockets to send data to peers that were reachable via ipsec encapsulation,
people could panic or BUG halt their systems.

I've tracked the problem down to user space sending an invalid ip header over an
AF_RAW socket with IP_HDRINCL set to 1.

Basically what happens is that userspace sends down an ip frame that includes
only the header (no data), but sets the ip header ihl value to a large number,
one that is larger than the total amount of data passed to the sendmsg call.  In
raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
that was passed in, but assume the data is all valid.  Later during ipsec
encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
to provide headroom for the ipsec headers.  During this operation, the
skb->transport_header is repointed to a spot computed by
skb->network_header + the ip header length (ihl).  Since so little data was
passed in relative to the value of ihl provided by the raw socket, we point
transport header to an unknown location, resulting in various crashes.

So, what to do about this?  My first thought was to simply return -EINVAL, and
let user space sort it out.  I'm still thinking that might be the best way, but
I thought I'd try this first, just in case someone has reason to try to
send such a bogus frame through the kernel.  This solution simply checks the
value of ihl in raw_send_hdrinc and expands the skb to fit, filling the new
space with  IPOPT_NOOP options.  I've confirmed that it fixes the crashes that
were reported.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 raw.c |   42 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)


diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 9ef8c08..412304f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -351,13 +351,42 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 	skb->ip_summed = CHECKSUM_NONE;
 
 	skb->transport_header = skb->network_header;
-	err = memcpy_fromiovecend((void *)iph, from, 0, length);
-	if (err)
-		goto error_fault;
+	err = -EFAULT;
+	if (memcpy_fromiovecend((void *)iph, from, 0, length))
+		goto error_free;
 
-	/* We don't modify invalid header */
 	iphlen = iph->ihl * 4;
-	if (iphlen >= sizeof(*iph) && iphlen <= length) {
+	/*
+	 * We don't modify invalid header, but we do want to be sure
+	 * that we have enough space in the skb to hold the header
+	 * and all its options, so if iph->ihl is greater than
+	 * the length of the iovec, we need to reallocate the skb, lest
+	 * odd things happen farther down the stack
+	 */
+	if (iphlen > length) {
+		size_t new_length;
+		int i;
+		char *pad;
+ 
+		/*
+		 * someone passed in a bogus ip header, in which the
+		 * the iph->ihl value was longer than the actual data
+		 * buffer.  We need to at least meet the ihl requirement
+		 * and since we don't mess with the ip header here
+		 * lets expand the skb
+		 */
+		new_length = iphlen - length;
+		err = -ENOMEM;
+		if (pskb_expand_head(skb, 0,
+		     new_length, GFP_KERNEL) < 0)
+			goto error_free;
+		pad = skb_put(skb, new_length);
+		for (i = 0; i < new_length; i++)
+			pad[i] = IPOPT_NOOP;
+ 
+	}
+ 
+	if (iphlen >= sizeof(*iph)) {
 		if (!iph->saddr)
 			iph->saddr = rt->rt_src;
 		iph->check   = 0;
@@ -380,8 +409,7 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 out:
 	return 0;
 
-error_fault:
-	err = -EFAULT;
+error_free:
 	kfree_skb(skb);
 error:
 	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);

^ permalink raw reply related

* Re: [PATCH] Multicast packet reassembly can fail
From: Steve Chen @ 2009-10-28 17:50 UTC (permalink / raw)
  To: Rick Jones; +Cc: Mark Huth, netdev
In-Reply-To: <4AE87D03.4020708@hp.com>

On Wed, 2009-10-28 at 10:18 -0700, Rick Jones wrote:
> >> It has been hours since my last good Emily Litella moment so I'll ask 
> >> - isn't the combination of source and dest addr, protocol, IP ID and 
> >> fragment offset supposed to take care of this?  How does the ingress 
> >> interface have anything to do with it?
> >>
> >> rick jones
> > 
> > The problem we've seen arises only when there are multiple interfaces 
> > each receiving the same multicast packets.  In that case there are 
> > multiple packets with the same key.  Steve was able to track down a 
> > packet loss due to re-assembly failure under certain arrival order 
> > conditions.
> > 
> > The proposed fix eliminated the packet loss in this case.  There might 
> > be a different problem in the re-assembly code that we have masked by 
> > separating the packets into streams from each interface.  Now that you 
> > mention it, the re-assembly code should be robust in the face of some 
> > duplicated and mis-ordered packets.  We can look more closely at that code.
> 
> If I understand correctly, the idea here is to say that when multiple interfaces 
> receive fragments of copies of the same  IP datagram that both copies will 
> "survive" and flow up the stack?
> 
> I'm basing that on your description, and an email from Steve that reads:
> 
> > Actually, the patch tries to prevent packet drop for this exact
> > scenario.  Please consider the following scenarios
> > 1.  Packet comes in the fragment reassemble code in the following order
> > (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> > Packet from both interfaces get reassembled and gets further processed.
> > 
> > 2. Packet can some times arrive in (perhaps other orders as well)
> > (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> > Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> > packet from eth1 is dropped in the routing code.
> 
> Doesn't that rather fly in the face of the weak-end-system model followed by Linux?
> 
> I can see where scenario one leads to two IP datagrams making it up the stack, 
> but I would have thought that was simply an "accident" of the situation that 
> cannot reasonably be prevented, not justification to cause scenario two to send 
> two datagrams up the stack.

For scenario 2, the routing code drops the 2nd packet.  As a result, no
packet make it to the application.  If someone is willing to suggest an
alternative, I can certainly rework the patch and retest.

Regards,

Steve


^ permalink raw reply

* Re: [PATCH] Multicast packet reassembly can fail
From: Rick Jones @ 2009-10-28 18:10 UTC (permalink / raw)
  To: Steve Chen; +Cc: Mark Huth, netdev
In-Reply-To: <1256752203.3153.461.camel@linux-1lbu>

>>If I understand correctly, the idea here is to say that when multiple interfaces 
>>receive fragments of copies of the same  IP datagram that both copies will 
>>"survive" and flow up the stack?
>>
>>I'm basing that on your description, and an email from Steve that reads:
>>
>>
>>>Actually, the patch tries to prevent packet drop for this exact
>>>scenario.  Please consider the following scenarios
>>>1.  Packet comes in the fragment reassemble code in the following order
>>>(eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
>>>Packet from both interfaces get reassembled and gets further processed.
>>>
>>>2. Packet can some times arrive in (perhaps other orders as well)
>>>(eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
>>>Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
>>>packet from eth1 is dropped in the routing code.
>>
>>Doesn't that rather fly in the face of the weak-end-system model followed by Linux?
>>
>>I can see where scenario one leads to two IP datagrams making it up the stack, 
>>but I would have thought that was simply an "accident" of the situation that 
>>cannot reasonably be prevented, not justification to cause scenario two to send 
>>two datagrams up the stack.
> 
> 
> For scenario 2, the routing code drops the 2nd packet.  As a result, no
> packet make it to the application.  If someone is willing to suggest an
> alternative, I can certainly rework the patch and retest.

I'll ask my next potentially Emily Litella question - don't multicast IP 
applications bind to multicast IP addresses and not interfaces?  That is to say, 
doesn't the first datagram completed get delivered to all applications on the 
host which have bound to the corresponding multicast IP (and port number...) ?

rick jones

^ permalink raw reply

* Re: [PATCH] AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl
From: Eric Dumazet @ 2009-10-28 18:13 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem
In-Reply-To: <20091028173955.GB7422@hmsreliant.think-freely.org>

Neil Horman a écrit :
> Augment raw_send_hdrinc to correct for incorrect ip header length values
> 
> A series of oopses was reported to me recently.  Apparently when using AF_RAW
> sockets to send data to peers that were reachable via ipsec encapsulation,
> people could panic or BUG halt their systems.
> 
> I've tracked the problem down to user space sending an invalid ip header over an
> AF_RAW socket with IP_HDRINCL set to 1.
> 
> Basically what happens is that userspace sends down an ip frame that includes
> only the header (no data), but sets the ip header ihl value to a large number,
> one that is larger than the total amount of data passed to the sendmsg call.  In
> raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
> that was passed in, but assume the data is all valid.  Later during ipsec
> encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
> to provide headroom for the ipsec headers.  During this operation, the
> skb->transport_header is repointed to a spot computed by
> skb->network_header + the ip header length (ihl).  Since so little data was
> passed in relative to the value of ihl provided by the raw socket, we point
> transport header to an unknown location, resulting in various crashes.
> 
> So, what to do about this?  My first thought was to simply return -EINVAL, and
> let user space sort it out.  I'm still thinking that might be the best way, but
> I thought I'd try this first, just in case someone has reason to try to
> send such a bogus frame through the kernel.  This solution simply checks the
> value of ihl in raw_send_hdrinc and expands the skb to fit, filling the new
> space with  IPOPT_NOOP options.  I've confirmed that it fixes the crashes that
> were reported.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 

Thanks a lot for this detailed info, I wish everything could be explained like this !

I believe we should drop the request, since padding it is not what was expected by user.

^ permalink raw reply

* Re: wanPMC-CxT1E1
From: Greg KH @ 2009-10-28 18:11 UTC (permalink / raw)
  To: Bob Beers; +Cc: netdev
In-Reply-To: <4f6ba3b0910271048n10ff37fek9af191b133892e1e@mail.gmail.com>

On Tue, Oct 27, 2009 at 01:48:53PM -0400, Bob Beers wrote:
> On Mon, Oct 26, 2009 at 4:41 PM, Greg KH <greg@kroah.com> wrote:
> > Getting it to build on 2.6.31 is more important than RHEL5, we can't do
> > anything with an old kernel like that.
> 
> ok, so where do I start, I have a system ready to start
>  git cloning, and creating patches. I googled for a while
>  but didn't find a nice recipe for participating in the -staging
>  process.

I'll try to get it to build in the kernel tree today and send you a copy
of the patch to try out.

thanks,

greg k-h

^ permalink raw reply

* Re: [Fwd: Re: [PATCH] Multicast packet reassembly can fail]
From: Steve Chen @ 2009-10-28 18:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: mhuth, netdev
In-Reply-To: <4AE87ECD.7080408@gmail.com>

On Wed, 2009-10-28 at 18:26 +0100, Eric Dumazet wrote:
> Steve Chen a écrit :
> > On Wed, 2009-10-28 at 16:32 +0100, Eric Dumazet wrote:
> >> If each fragment is received twice on host, once by eth0, once by eth1,
> >> should we deliver datagram once or twice ?
> > 
> > The application received it once.  IIRC the duplicate packet is drop in
> > the routing code.
> > 
> >> Once should be enough, even if in the non fragmented case, it will
> >> be delivered twice (kernel cannot detect duplicates, user app might do itself)
> > 
> > Routing code drops the duplicate packet for none-fragmented case as
> > well.
> 
> Really ? How so ? Receiving two copies of the same packet is legal.

I will have to double check exactly where the packet drop happens.  I
thought it was somewhere in routing, but it could be in netfilter.

> 
> > 
> >>
> >>> For this specific case, src/dst address, protocol, IP ID and fragment
> >>> offset are all identical.  The only difference is the ingress interface.
> >>> A good follow up question would be why would anyone in their right mind
> >>> multicast to the same destination?  well, I don't know.  I can not get
> >>> the people who reported the problem to tell me either.   Since someone
> >>> found the need to do this,  perhaps others may find it useful too.
> >>>
> >> Then, if a 2000 bytes message is fragmented in two packets, one coming
> >> from eth0, one coming from eth1, I suspect your patch drops the message,
> >> unless eth0/eth1 are part of a bonding device...
> > 
> > Actually, the patch tries to prevent packet drop for this exact
> > scenario.  Please consider the following scenarios
> > 1.  Packet comes in the fragment reassemble code in the following order
> > (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> > Packet from both interfaces get reassembled and gets further processed.
> 
> Yes your patch does this, so each multicast application receives two copies of the
> same datagram.
> 
> > 
> > 2. Packet can some times arrive in (perhaps other orders as well)
> > (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> > Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> > packet from eth1 is dropped in the routing code.
> 
> Really ? how so ? I dont see how it can happen, unless you use RPF ?
> 
> current situation should be :
> 
> (eth0 frag1) : We create a context, store frag1 in it
> (eth1 frag1) : We find this context, and drop frag1 since we already have the data
>                   (maybe the bug is here, if we cannot cope with a duplicate ?)
> (eth0 frag2) : We find this context, store frag2 -> complete datagram and deliver it
> (eth1 frag2) : We find context, drop frag2 since datagram was completed.

Yes, this is exactly what is happening in the current code.

> 
>                (or maybe we create a new context that will timeout later, maybe this is your problem ?)
> 
> Net effect : We deliver the datagram correctly.
> 
> 
> > 
> >> That would break common routing setups, using two links to aggregate bandwidth ?
> > 
> > I don't believe it would.  The aggregate bandwidth will work the same as
> > before.  The attributes (src/dst addr, protocol, interface, etc.) should
> > generate a unique hash key.  If hash collision should happen with the
> > addition of iif << 5, the code still compare the original src addr along
> > with interface number, so there should be no issues.
> 
> What about the obvious :
> 
> (eth0 frag1),  (eth1 frag2)
> 
> Your patch creates two contexts since hashes are different,
> that will timeout and no packet delivered at all
> 
I see the point you are making.  I assumed, probably incorrectly, that
since eth0 and eth1 have different IP address.  I would get a complete
series of fragments for each interface.  Perhaps, I should really be
looking up the stack to see why packets were dropped.  Please correct me
if I'm mistaken.  The normal behavior is that application should be
receiving either 2 (scenario 1) or 1 (scenario 2) packets.

Regards,

Steve


^ permalink raw reply

* Re: [PATCH] Multicast packet reassembly can fail
From: Steve Chen @ 2009-10-28 18:40 UTC (permalink / raw)
  To: Rick Jones; +Cc: Mark Huth, netdev
In-Reply-To: <4AE88907.3030206@hp.com>

On Wed, 2009-10-28 at 11:10 -0700, Rick Jones wrote:
> >>If I understand correctly, the idea here is to say that when multiple interfaces 
> >>receive fragments of copies of the same  IP datagram that both copies will 
> >>"survive" and flow up the stack?
> >>
> >>I'm basing that on your description, and an email from Steve that reads:
> >>
> >>
> >>>Actually, the patch tries to prevent packet drop for this exact
> >>>scenario.  Please consider the following scenarios
> >>>1.  Packet comes in the fragment reassemble code in the following order
> >>>(eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> >>>Packet from both interfaces get reassembled and gets further processed.
> >>>
> >>>2. Packet can some times arrive in (perhaps other orders as well)
> >>>(eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> >>>Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> >>>packet from eth1 is dropped in the routing code.
> >>
> >>Doesn't that rather fly in the face of the weak-end-system model followed by Linux?
> >>
> >>I can see where scenario one leads to two IP datagrams making it up the stack, 
> >>but I would have thought that was simply an "accident" of the situation that 
> >>cannot reasonably be prevented, not justification to cause scenario two to send 
> >>two datagrams up the stack.
> > 
> > 
> > For scenario 2, the routing code drops the 2nd packet.  As a result, no
> > packet make it to the application.  If someone is willing to suggest an
> > alternative, I can certainly rework the patch and retest.
> 
> I'll ask my next potentially Emily Litella question - don't multicast IP 
> applications bind to multicast IP addresses and not interfaces?  That is to say, 
> doesn't the first datagram completed get delivered to all applications on the 
> host which have bound to the corresponding multicast IP (and port number...) ?
I actually don't know who Emily Litella is until today.  This mailing
list is great not just for learning networking stuff :).  In the test
code I received, one of the step to setup is to configure the IP address
of the interface that the application is expecting the packet.  It
appears to bind on interface based on that casual observation.  I'll
have to study the code in detail to be able to say for sure.

Regards,

Steve



^ permalink raw reply

* [net-2.6 PATCH 2/2] qlge: Fix firmware mailbox command timeout.
From: Ron Mercer @ 2009-10-28 18:39 UTC (permalink / raw)
  To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <1256755161-29606-1-git-send-email-ron.mercer@qlogic.com>

The mailbox command process would only process a maximum of 5 unrelated
firmware events while waiting for it's command completion status.
It should process an unlimited number of events while waiting for a maximum of 5 seconds.

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/qlge/qlge.h     |    1 +
 drivers/net/qlge/qlge_mpi.c |   23 ++++++++++++-----------
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/net/qlge/qlge.h b/drivers/net/qlge/qlge.h
index e7285f0..c2383ad 100644
--- a/drivers/net/qlge/qlge.h
+++ b/drivers/net/qlge/qlge.h
@@ -95,6 +95,7 @@ enum {
 
 	/* Misc. stuff */
 	MAILBOX_COUNT = 16,
+	MAILBOX_TIMEOUT = 5,
 
 	PROC_ADDR_RDY = (1 << 31),
 	PROC_ADDR_R = (1 << 30),
diff --git a/drivers/net/qlge/qlge_mpi.c b/drivers/net/qlge/qlge_mpi.c
index 99e58e3..bcf13c9 100644
--- a/drivers/net/qlge/qlge_mpi.c
+++ b/drivers/net/qlge/qlge_mpi.c
@@ -470,7 +470,8 @@ end:
  */
 static int ql_mailbox_command(struct ql_adapter *qdev, struct mbox_params *mbcp)
 {
-	int status, count;
+	int status;
+	unsigned long count;
 
 
 	/* Begin polled mode for MPI */
@@ -491,9 +492,9 @@ static int ql_mailbox_command(struct ql_adapter *qdev, struct mbox_params *mbcp)
 	/* Wait for the command to complete. We loop
 	 * here because some AEN might arrive while
 	 * we're waiting for the mailbox command to
-	 * complete. If more than 5 arrive then we can
+	 * complete. If more than 5 seconds expire we can
 	 * assume something is wrong. */
-	count = 5;
+	count = jiffies + HZ * MAILBOX_TIMEOUT;
 	do {
 		/* Wait for the interrupt to come in. */
 		status = ql_wait_mbx_cmd_cmplt(qdev);
@@ -517,15 +518,15 @@ static int ql_mailbox_command(struct ql_adapter *qdev, struct mbox_params *mbcp)
 					MB_CMD_STS_GOOD) ||
 			((mbcp->mbox_out[0] & 0x0000f000) ==
 					MB_CMD_STS_INTRMDT))
-			break;
-	} while (--count);
+			goto done;
+	} while (time_before(jiffies, count));
 
-	if (!count) {
-		QPRINTK(qdev, DRV, ERR,
-			"Timed out waiting for mailbox complete.\n");
-		status = -ETIMEDOUT;
-		goto end;
-	}
+	QPRINTK(qdev, DRV, ERR,
+		"Timed out waiting for mailbox complete.\n");
+	status = -ETIMEDOUT;
+	goto end;
+
+done:
 
 	/* Now we can clear the interrupt condition
 	 * and look at our status.
-- 
1.6.0.2


^ permalink raw reply related

* [net-2.6 PATCH 1/2] qlge: Fix EEH handling.
From: Ron Mercer @ 2009-10-28 18:39 UTC (permalink / raw)
  To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <1256755161-29606-1-git-send-email-ron.mercer@qlogic.com>

Clean up driver resources without touch the hardware. Add pci
save/restore state.

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/qlge/qlge_main.c |   78 ++++++++++++++++++++++++++++--------------
 1 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index 48b45df..cea7531 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -3916,6 +3916,7 @@ static int __devinit ql_init_device(struct pci_dev *pdev,
 		goto err_out;
 	}
 
+	pci_save_state(pdev);
 	qdev->reg_base =
 	    ioremap_nocache(pci_resource_start(pdev, 1),
 			    pci_resource_len(pdev, 1));
@@ -4070,6 +4071,33 @@ static void __devexit qlge_remove(struct pci_dev *pdev)
 	free_netdev(ndev);
 }
 
+/* Clean up resources without touching hardware. */
+static void ql_eeh_close(struct net_device *ndev)
+{
+	int i;
+	struct ql_adapter *qdev = netdev_priv(ndev);
+
+	if (netif_carrier_ok(ndev)) {
+		netif_carrier_off(ndev);
+		netif_stop_queue(ndev);
+	}
+
+	if (test_bit(QL_ADAPTER_UP, &qdev->flags))
+		cancel_delayed_work_sync(&qdev->asic_reset_work);
+	cancel_delayed_work_sync(&qdev->mpi_reset_work);
+	cancel_delayed_work_sync(&qdev->mpi_work);
+	cancel_delayed_work_sync(&qdev->mpi_idc_work);
+	cancel_delayed_work_sync(&qdev->mpi_port_cfg_work);
+
+	for (i = 0; i < qdev->rss_ring_count; i++)
+		netif_napi_del(&qdev->rx_ring[i].napi);
+
+	clear_bit(QL_ADAPTER_UP, &qdev->flags);
+	ql_tx_ring_clean(qdev);
+	ql_free_rx_buffers(qdev);
+	ql_release_adapter_resources(qdev);
+}
+
 /*
  * This callback is called by the PCI subsystem whenever
  * a PCI bus error is detected.
@@ -4078,17 +4106,21 @@ static pci_ers_result_t qlge_io_error_detected(struct pci_dev *pdev,
 					       enum pci_channel_state state)
 {
 	struct net_device *ndev = pci_get_drvdata(pdev);
-	struct ql_adapter *qdev = netdev_priv(ndev);
-
-	netif_device_detach(ndev);
 
-	if (state == pci_channel_io_perm_failure)
+	switch (state) {
+	case pci_channel_io_normal:
+		return PCI_ERS_RESULT_CAN_RECOVER;
+	case pci_channel_io_frozen:
+		netif_device_detach(ndev);
+		if (netif_running(ndev))
+			ql_eeh_close(ndev);
+		pci_disable_device(pdev);
+		return PCI_ERS_RESULT_NEED_RESET;
+	case pci_channel_io_perm_failure:
+		dev_err(&pdev->dev,
+			"%s: pci_channel_io_perm_failure.\n", __func__);
 		return PCI_ERS_RESULT_DISCONNECT;
-
-	if (netif_running(ndev))
-		ql_adapter_down(qdev);
-
-	pci_disable_device(pdev);
+	}
 
 	/* Request a slot reset. */
 	return PCI_ERS_RESULT_NEED_RESET;
@@ -4105,25 +4137,15 @@ static pci_ers_result_t qlge_io_slot_reset(struct pci_dev *pdev)
 	struct net_device *ndev = pci_get_drvdata(pdev);
 	struct ql_adapter *qdev = netdev_priv(ndev);
 
+	pdev->error_state = pci_channel_io_normal;
+
+	pci_restore_state(pdev);
 	if (pci_enable_device(pdev)) {
 		QPRINTK(qdev, IFUP, ERR,
 			"Cannot re-enable PCI device after reset.\n");
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
-
 	pci_set_master(pdev);
-
-	netif_carrier_off(ndev);
-	ql_adapter_reset(qdev);
-
-	/* Make sure the EEPROM is good */
-	memcpy(ndev->perm_addr, ndev->dev_addr, ndev->addr_len);
-
-	if (!is_valid_ether_addr(ndev->perm_addr)) {
-		QPRINTK(qdev, IFUP, ERR, "After reset, invalid MAC address.\n");
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
@@ -4131,17 +4153,21 @@ static void qlge_io_resume(struct pci_dev *pdev)
 {
 	struct net_device *ndev = pci_get_drvdata(pdev);
 	struct ql_adapter *qdev = netdev_priv(ndev);
+	int err = 0;
 
-	pci_set_master(pdev);
-
+	if (ql_adapter_reset(qdev))
+		QPRINTK(qdev, DRV, ERR, "reset FAILED!\n");
 	if (netif_running(ndev)) {
-		if (ql_adapter_up(qdev)) {
+		err = qlge_open(ndev);
+		if (err) {
 			QPRINTK(qdev, IFUP, ERR,
 				"Device initialization failed after reset.\n");
 			return;
 		}
+	} else {
+		QPRINTK(qdev, IFUP, ERR,
+			"Device was not running prior to EEH.\n");
 	}
-
 	netif_device_attach(ndev);
 }
 
-- 
1.6.0.2


^ permalink raw reply related

* [net-2.6 PATCH 0/2] qlge: Fixes for qlge.
From: Ron Mercer @ 2009-10-28 18:39 UTC (permalink / raw)
  To: davem; +Cc: netdev, ron.mercer

Fix EEH and firmware mailbox command processor.




^ permalink raw reply

* Re: [PATCH] AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl (v2)
From: Neil Horman @ 2009-10-28 18:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, nhorman
In-Reply-To: <4AE889B5.4040301@gmail.com>

On Wed, Oct 28, 2009 at 07:13:09PM +0100, Eric Dumazet wrote:
> Neil Horman a écrit :
> > Augment raw_send_hdrinc to correct for incorrect ip header length values
> > 
> > A series of oopses was reported to me recently.  Apparently when using AF_RAW
> > sockets to send data to peers that were reachable via ipsec encapsulation,
> > people could panic or BUG halt their systems.
> > 
> > I've tracked the problem down to user space sending an invalid ip header over an
> > AF_RAW socket with IP_HDRINCL set to 1.
> > 
> > Basically what happens is that userspace sends down an ip frame that includes
> > only the header (no data), but sets the ip header ihl value to a large number,
> > one that is larger than the total amount of data passed to the sendmsg call.  In
> > raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
> > that was passed in, but assume the data is all valid.  Later during ipsec
> > encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
> > to provide headroom for the ipsec headers.  During this operation, the
> > skb->transport_header is repointed to a spot computed by
> > skb->network_header + the ip header length (ihl).  Since so little data was
> > passed in relative to the value of ihl provided by the raw socket, we point
> > transport header to an unknown location, resulting in various crashes.
> > 
> > So, what to do about this?  My first thought was to simply return -EINVAL, and
> > let user space sort it out.  I'm still thinking that might be the best way, but
> > I thought I'd try this first, just in case someone has reason to try to
> > send such a bogus frame through the kernel.  This solution simply checks the
> > value of ihl in raw_send_hdrinc and expands the skb to fit, filling the new
> > space with  IPOPT_NOOP options.  I've confirmed that it fixes the crashes that
> > were reported.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> 
> Thanks a lot for this detailed info, I wish everything could be explained like this !
> 
You're welcome, this was a fun one to track down :)

> I believe we should drop the request, since padding it is not what was expected by user.

Yeah, I had a feeling.  Ok, version 2, this time drop the invalid frame and
report it to user space, instead of expanding it:


    Augment raw_send_hdrinc to correct for incorrect ip header length values
    
    A series of oopses was reported to me recently.  Apparently when using AF_RAW
    sockets to send data to peers that were reachable via ipsec encapsulation,
    people could panic or BUG halt their systems.
    
    I've tracked the problem down to user space sending an invalid ip header over an
    AF_RAW socket with IP_HDRINCL set to 1.
    
    Basically what happens is that userspace sends down an ip frame that includes
    only the header (no data), but sets the ip header ihl value to a large number,
    one that is larger than the total amount of data passed to the sendmsg call.  In
    raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
    that was passed in, but assume the data is all valid.  Later during ipsec
    encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
    to provide headroom for the ipsec headers.  During this operation, the
    skb->transport_header is repointed to a spot computed by
    skb->network_header + the ip header length (ihl).  Since so little data was
    passed in relative to the value of ihl provided by the raw socket, we point
    transport header to an unknown location, resulting in various crashes.
    
    This fix for this is pretty straightforward, simply validate the value of of
    iph->ihl when sending over a raw socket.  If (iph->ihl*4U) > user data buffer
    size, drop the frame and return -EINVAL.  I just confirmed this fixes the
    reported crashes.
    
    Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

 raw.c |   24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)


diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 9ef8c08..4b15354 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -351,13 +351,24 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 	skb->ip_summed = CHECKSUM_NONE;
 
 	skb->transport_header = skb->network_header;
-	err = memcpy_fromiovecend((void *)iph, from, 0, length);
-	if (err)
-		goto error_fault;
+	err = -EFAULT;
+	if (memcpy_fromiovecend((void *)iph, from, 0, length))
+		goto error_free;
 
-	/* We don't modify invalid header */
 	iphlen = iph->ihl * 4;
-	if (iphlen >= sizeof(*iph) && iphlen <= length) {
+
+	/*
+	 * We don't want to modify the ip header, but we do need to 
+	 * be sure that it won't cause problems later along the network
+	 * stack.  Specifically we want to make sure that iph->ihl is a 
+	 * sane value.  If ihl points beyond the length of the buffer passed
+	 * in, reject the frame as invalid
+	 */
+	err = -EINVAL;
+	if (iphlen > length)
+		goto error_free; 
+ 
+	if (iphlen >= sizeof(*iph)) {
 		if (!iph->saddr)
 			iph->saddr = rt->rt_src;
 		iph->check   = 0;
@@ -380,8 +391,7 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 out:
 	return 0;
 
-error_fault:
-	err = -EFAULT;
+error_free:
 	kfree_skb(skb);
 error:
 	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);

^ permalink raw reply related

* Re: iproute uses too small of a receive buffer
From: Patrick McHardy @ 2009-10-28 19:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, Ben Greear, NetDev
In-Reply-To: <4AE7F859.7020105@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 753 bytes --]

Eric Dumazet wrote:
> Stephen Hemminger a écrit :
>> Just having larger buffer isn't guarantee of success. Allocating
>> a huge buffer is not going to work on embedded.
>>
> 
> Please note we do not allocate a big buffer, only allow more small skbs
> to be queued on socket receive queue.
> 
> If memory is not available, skb allocation will eventually fail
> and be reported as well, embedded or not.
> 
> I vote for allowing 1024*1024 bytes instead of 32768,
> and eventually user should be warned that it is capped by 
> /proc/sys/net/core/rmem_max

How about this? It will double the receive queue limit on ENOBUFS
up to 1024 * 1024b, then bail out with the normal error message on
further ENOBUFS.

Signed-off-by: Patrick McHardy <kaber@trash.net>

[-- Attachment #2: x --]
[-- Type: text/plain, Size: 894 bytes --]

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index b68e2fd..e4fda40 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -25,6 +25,8 @@
 
 #include "libnetlink.h"
 
+static int rcvbuf = 32768;
+
 void rtnl_close(struct rtnl_handle *rth)
 {
 	if (rth->fd >= 0) {
@@ -38,7 +40,6 @@ int rtnl_open_byproto(struct rtnl_handle *rth, unsigned subscriptions,
 {
 	socklen_t addr_len;
 	int sndbuf = 32768;
-	int rcvbuf = 32768;
 
 	memset(rth, 0, sizeof(*rth));
 
@@ -407,6 +409,12 @@ int rtnl_listen(struct rtnl_handle *rtnl,
 		if (status < 0) {
 			if (errno == EINTR || errno == EAGAIN)
 				continue;
+			if (errno == ENOBUFS && rcvbuf < 1024 * 1024) {
+				rcvbuf *= 2;
+				if (setsockopt(rtnl->fd, SOL_SOCKET, SO_RCVBUF,
+					       &rcvbuf, sizeof(rcvbuf)) == 0)
+					continue;
+			}
 			fprintf(stderr, "netlink receive error %s (%d)\n",
 				strerror(errno), errno);
 			return -1;

^ permalink raw reply related

* RE: [PATCH] udev: create empty regular files to represent net interfaces
From: Narendra_K @ 2009-10-28 19:15 UTC (permalink / raw)
  To: kay.sievers, Matt_Domsch
  Cc: dannf, linux-hotplug, netdev, Jordan_Hargrave, Charles_Rose,
	bhutchings
In-Reply-To: <ac3eb2510910280123g3c0e3d95wb38a239238906027@mail.gmail.com>


>That all sounds very much like something which will hit us 
>back some day. I'm not sure, if udev should publish such dead 
>text files in /dev, it does not seem to fit the usual 
>APIs/assumptions where /sys and /dev match, and libudev 
>provides access to both. It all sounds more like a database 
>for a possible netdevname library, which does not need to be 
>public in /dev, right?

The char device nodes under /dev/netdev/ do seem to adhere to the
assumption of what is there under /sys and /dev match.

With regards,
Narendra K

^ permalink raw reply

* Re: iproute uses too small of a receive buffer
From: Ben Greear @ 2009-10-28 19:19 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Eric Dumazet, Stephen Hemminger, NetDev
In-Reply-To: <4AE895E8.60308@trash.net>

On 10/28/2009 12:05 PM, Patrick McHardy wrote:
> Eric Dumazet wrote:
>> Stephen Hemminger a écrit :
>>> Just having larger buffer isn't guarantee of success. Allocating
>>> a huge buffer is not going to work on embedded.
>>>
>>
>> Please note we do not allocate a big buffer, only allow more small skbs
>> to be queued on socket receive queue.
>>
>> If memory is not available, skb allocation will eventually fail
>> and be reported as well, embedded or not.
>>
>> I vote for allowing 1024*1024 bytes instead of 32768,
>> and eventually user should be warned that it is capped by
>> /proc/sys/net/core/rmem_max
>
> How about this? It will double the receive queue limit on ENOBUFS
> up to 1024 * 1024b, then bail out with the normal error message on
> further ENOBUFS.
>
> Signed-off-by: Patrick McHardy<kaber@trash.net>

First:  This still pretty much guarantees that messages will be lost when
the program starts (when messages are coming in too large of chunks for small buffers)
If you are debugging something tricky, having lost messages will be
very annoying!

Second:  Why bail on ENOBUFS at all?  I don't see how it helps the user
since they will probably just have to start it again, and will miss more
messages than keeping going would have.

And, even 1MB may not be enough for some scenarios.  So, probably best to
let users over-ride the initial setting on cmd-line.  If not, then use
a large value to start with.

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox