netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [bug] e100: checksum mismatch on 82551ER rev10
       [not found] <Pine.LNX.4.61.0607311653360.24450@e-smith.charlieb.ott.istop.com>
@ 2006-08-02 16:50 ` Auke Kok
  2006-08-02 17:45   ` Charlie Brady
  2006-08-04 11:04   ` Molle Bestefich
  0 siblings, 2 replies; 8+ messages in thread
From: Auke Kok @ 2006-08-02 16:50 UTC (permalink / raw)
  To: Charlie Brady; +Cc: NetDev, Linux Kernel Mailing List, molle.bestefich

[cc-ing netdev]
[adding original thread authors back, please do not strip CC]

Charlie Brady wrote:
>> Molle Bestefich wrote:
>>> The NICs are working perfectly.
>> How can you tell? Do you know if jumbo frames work correctly? Is the
>> device properly checksumming? is flow control working properly? These
>> and many, many more settings are determined by the EEPROM. Seemingly it
>> may work correctly, but there is no guarantee whatsoever that it will 
>> work
>> correctly at all if the checksum is bad. Again, you can lose data, or
>> worse, you could corrupt memory in the system causing massive failure 
>> (DMA
>> timings, etc). Unlikely? sure, but not impossible.
> 
> Let's assume that these things are all true, and the NIC currently does 
> not work perfectly, just imperfectly, but acceptably. With the recent 
> driver change, it now does not work at all. That's surely a bug in the 
> driver.

There is no logic in that sentence at all. You're saying that the driver is 
broken because it doesn't fix an error in the EEPROM?

We're trying extremely hard to fix real errors here (especially when we find 
that hardware resellers send out hardware with EEPROM problems) and you are 
asking for a workaround that will (likely) introduce random errors and failure 
into your kernel. I do not want to accept responsability for that and I also 
do not think any other kernel developer would like me to release such a risk 
into the kernel. I'd probably get whistled back instantly :)

If you want to edit your own kernel then I am fine with it. If you want to 
recalculate the checksum yourself and put it in the EEPROM then I am also fine 
with that. As long as you never ask for support for that NIC. But we can't 
support an option that allows all users to willingly enable a piece of 
non-properly-working hardware. Because that is what it is: Not properly 
configured hardware.

The bottom line is that your problem is that a specific hardware vendor is/was 
selling badly configured hardware, and you buy it from them, even after it's 
End Of Lifed for that vendor. Even though that vendor did buy the units 
properly configured and had all the tools needed to configure them properly. I 
can maybe fix your problem by seeing if we can get you an eeprom update, but I 
can not break everyone elses kernel for that.

Auke




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bug] e100: checksum mismatch on 82551ER rev10
  2006-08-02 16:50 ` [bug] e100: checksum mismatch on 82551ER rev10 Auke Kok
@ 2006-08-02 17:45   ` Charlie Brady
  2006-08-02 18:30     ` Auke Kok
  2006-08-04 11:04   ` Molle Bestefich
  1 sibling, 1 reply; 8+ messages in thread
From: Charlie Brady @ 2006-08-02 17:45 UTC (permalink / raw)
  To: Auke Kok
  Cc: Charlie Brady, NetDev, Linux Kernel Mailing List, molle.bestefich


On Wed, 2 Aug 2006, Auke Kok wrote:

> [cc-ing netdev]
> [adding original thread authors back, please do not strip CC]

[There were no Cc's visible in the lkml archive I used as source of my 
quotes.]

> Charlie Brady wrote:
>> 
>> Let's assume that these things are all true, and the NIC currently does 
>> not work perfectly, just imperfectly, but acceptably. With the recent 
>> driver change, it now does not work at all. That's surely a bug in the 
>> driver.
>
> There is no logic in that sentence at all. You're saying that the driver is 
> broken because it doesn't fix an error in the EEPROM?

I am not asking the driver to fix errors in the EEPROM. I'm asking it to 
send and receive packets, as it has done in the past.

> We're trying extremely hard to fix real errors here (especially when we find 
> that hardware resellers send out hardware with EEPROM problems) ...

I do not expect the kernel to perform QA tests on my hardware, just work.

> and you are 
> asking for a workaround that will (likely) introduce random errors and 
> failure into your kernel. I do not want to accept responsability for 
> that ...

You publish your code under the GPL. You explicitly disclaim any warranty.

> If you want to edit your own kernel then I am fine with it.

I suspect that if all/many T23 laptops perform as mine does then some 
major vendors will also edit their kernels. I'm sure they would rather not 
do that.

> If you want to recalculate the checksum yourself and put it in the 
> EEPROM then I am also fine with that.

Can you provide a reference as to how I might do that?

> As long as you never ask for support for that NIC. But we can't support 
> an option that allows all users to willingly enable a piece of 
> non-properly-working hardware. Because that is what it is: Not properly 
> configured hardware.

Which it may be. But it doesn't work at all with the new kernel, where it 
has in the past.

> The bottom line is that your problem is that a specific hardware vendor 
> is/was selling badly configured hardware, and you buy it from them, even 
> after it's End Of Lifed for that vendor. Even though that vendor did buy the 
> units properly configured and had all the tools needed to configure them 
> properly.

I don't think either of us knows that.

> I can maybe fix your problem by seeing if we can get you an eeprom 
> update...

That'd be great. Thanks!

Regards

---
Charlie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bug] e100: checksum mismatch on 82551ER rev10
  2006-08-02 17:45   ` Charlie Brady
@ 2006-08-02 18:30     ` Auke Kok
  0 siblings, 0 replies; 8+ messages in thread
From: Auke Kok @ 2006-08-02 18:30 UTC (permalink / raw)
  To: Charlie Brady
  Cc: Auke Kok, NetDev, Linux Kernel Mailing List, molle.bestefich

Charlie Brady wrote:
>>> Let's assume that these things are all true, and the NIC currently 
>>> does not work perfectly, just imperfectly, but acceptably. With the 
>>> recent driver change, it now does not work at all. That's surely a 
>>> bug in the driver.
>>
>> There is no logic in that sentence at all. You're saying that the 
>> driver is broken because it doesn't fix an error in the EEPROM?
> 
> I am not asking the driver to fix errors in the EEPROM. I'm asking it to 
> send and receive packets, as it has done in the past.

maybe you are confusing e100 with eepro100. e100 has done this since it made 
it into 2.6.4 or so.

Auke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: e100: checksum mismatch on 82551ER rev10
  2006-08-02 16:50 ` [bug] e100: checksum mismatch on 82551ER rev10 Auke Kok
  2006-08-02 17:45   ` Charlie Brady
@ 2006-08-04 11:04   ` Molle Bestefich
  2006-08-04 11:20     ` David Miller
  1 sibling, 1 reply; 8+ messages in thread
From: Molle Bestefich @ 2006-08-04 11:04 UTC (permalink / raw)
  To: Auke Kok; +Cc: Charlie Brady, NetDev, Linux Kernel Mailing List

Auke Kok wrote:
> Charlie Brady wrote:
> > Let's assume that these things are all true, and the NIC currently does
> > not work perfectly, just imperfectly, but acceptably. With the recent
> > driver change, it now does not work at all. That's surely a bug in the
> > driver.
>
> There is no logic in that sentence at all. You're saying that the driver is
> broken because it doesn't fix an error in the EEPROM?

It's broken because it bails completely instead of just emitting a
warning message.

You wouldn't believe the number of hours people spend out there trying
to get a Linux box up when there's no network access.  Bailing out and
completely disabling the hardware on checksum errors is shooting those
people in the foot, because they'll need to try and debug the driver,
or the hardware, or do something completely else, perhaps on an
embedded device, and you're basically telling them "We at Intel do not
want to allow you to even attempt to make this your hardware work.".
By refusing to add an option to NOT bail, you're adding "And we're
happy to handicap any attempts you might make at it.".

> We're trying extremely hard to fix real errors here

You're not fixing anything, you're creating a problem for the user, sorry.

> (especially when we find that hardware resellers send out
> hardware with EEPROM problems) and you are asking for
> a workaround that will (likely) introduce random errors
> and failure into your kernel.

You've established yourself that the most likely cause of the error is
that the vendor forgot to run a checksumming tool.  That's hardly
random errors and failure.  You're trying to pull Linux end users into
a war between Intel and it's vendors, so you can make end users scream
at the vendors when they forget to run the checksum tool.  Well,
perhaps you should drop that and instead make it so that the *tools*
bail when the checksum is wrong, not the end user's driver.

> If you want to recalculate the checksum yourself and
> put it in the EEPROM then I am also fine with that.

Could you please provide a method and/or tool to do that?

> But we can't support an option that allows all users to willingly enable
> a piece of non-properly-working hardware.

The tactful thing to do would be to put out a big fat error message
during boot, but not bailing.
If you're worried that the end user might not see the message, then
bail, but provide an option to load anyway.
This is the only constructive and meaningful way forward.  There's no
point in holding the end user hostage.

> The bottom line is that your problem is that a specific hardware vendor
> is/was selling badly configured hardware, and you buy it from them, even
> after it's End Of Lifed for that vendor. Even though that vendor did buy the units
> properly configured and had all the tools needed to configure them properly.

There's no way for me to make Nokia do anything about this problem.
Please don't try to drag me into a Intel vs vendors war just for the
purpose of making me a number in their statistics.
(Maybe you could improve your tools so they'll want to fix the checksum.)

> I can maybe fix your problem by seeing if we can get you an eeprom update

Any chance you could get one of those for me?

(Yeah, I do realize that I'm critizicing and then asking for help.  Cocky :-D.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: e100: checksum mismatch on 82551ER rev10
  2006-08-04 11:04   ` Molle Bestefich
@ 2006-08-04 11:20     ` David Miller
  2006-08-04 11:28       ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2006-08-04 11:20 UTC (permalink / raw)
  To: molle.bestefich; +Cc: auke-jan.h.kok, charlieb, netdev, linux-kernel

From: "Molle Bestefich" <molle.bestefich@gmail.com>
Date: Fri, 4 Aug 2006 13:04:07 +0200

> You're trying to pull Linux end users into a war between Intel and
> it's vendors, so you can make end users scream at the vendors when
> they forget to run the checksum tool.

I totally agree, Intel driver maintainers generally act like complete
idiots in these kinds of situations.

If the EEPROM has a broken checksum, the user should have an option
that allows him to try and use the device anyways, end of story.

It is only self serving to not provide this option to the user.

People make errors, EEPROM's get shipped with bad checksums but the
device might still be usable.  That is life get over it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: e100: checksum mismatch on 82551ER rev10
  2006-08-04 11:20     ` David Miller
@ 2006-08-04 11:28       ` David Miller
  2006-08-05 13:28         ` Molle Bestefich
  2006-08-05 17:21         ` Jason Lunz
  0 siblings, 2 replies; 8+ messages in thread
From: David Miller @ 2006-08-04 11:28 UTC (permalink / raw)
  To: molle.bestefich; +Cc: auke-jan.h.kok, charlieb, netdev, linux-kernel

From: David Miller <davem@davemloft.net>
Date: Fri, 04 Aug 2006 04:20:24 -0700 (PDT)

> I totally agree, Intel driver maintainers generally act like complete
> idiots in these kinds of situations.
> 
> If the EEPROM has a broken checksum, the user should have an option
> that allows him to try and use the device anyways, end of story.

And BTW I want to remind the entire world that the last time Intel
cried wolf to all of us about vendors using broken EEPROMs with their
networking chips it turned out to be a bug in one of the patches Intel
put into the Linux driver. :-)

Intel should really humble themselves and help users instead of hinder
them.  Putting the blame on other vendors does not help users, I don't
care how you spin it.  It only serves to make Intel look like a bunch
of control freaks, and that pisses off users to no end.

Please put the option into the e100 driver to allow trying to use the
device even if the EEPROM checksum is wrong.

If an Intel developer doesn't do it, I will.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: e100: checksum mismatch on 82551ER rev10
  2006-08-04 11:28       ` David Miller
@ 2006-08-05 13:28         ` Molle Bestefich
  2006-08-05 17:21         ` Jason Lunz
  1 sibling, 0 replies; 8+ messages in thread
From: Molle Bestefich @ 2006-08-05 13:28 UTC (permalink / raw)
  To: David Miller; +Cc: auke-jan.h.kok, charlieb, netdev, linux-kernel

David Miller wrote:
> Please put the option into the e100 driver to allow trying to use the
> device even if the EEPROM checksum is wrong.

Whee, the users win! :-)

> If an Intel developer doesn't do it, I will.

I hope you don't piss off the nice guys at Intel who contribute source
code to the Linux kernel so much that they go away.

For what it's worth, a redistributable utility to fix the EEPROM
checksum would be just as fine a solution (for me)...  if only one was
available.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: e100: checksum mismatch on 82551ER rev10
  2006-08-04 11:28       ` David Miller
  2006-08-05 13:28         ` Molle Bestefich
@ 2006-08-05 17:21         ` Jason Lunz
  1 sibling, 0 replies; 8+ messages in thread
From: Jason Lunz @ 2006-08-05 17:21 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

davem@davemloft.net said:
> And BTW I want to remind the entire world that the last time Intel
> cried wolf to all of us about vendors using broken EEPROMs with their
> networking chips it turned out to be a bug in one of the patches Intel
> put into the Linux driver. :-)
>
> Intel should really humble themselves and help users instead of hinder
> them.  Putting the blame on other vendors does not help users, I don't
> care how you spin it.  It only serves to make Intel look like a bunch
> of control freaks, and that pisses off users to no end.

The real problem here is neither Intel nor users. It's crappy vendor
QA.  I recently had to deal with a batch of e1000 cards that had the
*wrong* EEPROMs, with *correct* checksums.

So of course the driver didn't complain - nevermind the fact that the
EEPROMs might claim you have a copper card when it's really fiber. And
that's best case, because it fails obviously. Far worse is when an
EEPROM is close enough to "work", but claim the wrong chipset revision
and cause the driver to do totally wrong things in strange
circumstances.

I think this is what Auke is worried about. If you can't trust the
EEPROM, all sorts of maddeningly subtle things can go wrong. And it
isn't likely to be properly diagnosed by an end user.

The sad thing is that the checksum can only protect against a subset of
EEPROM problems. But it does help. As a counterexample, a power failure
last weekend corrupted the EEPROM of the onboard e100 in one of my
servers, and this EEPROM check led to an immediate diagnosis of the
problem.

> Please put the option into the e100 driver to allow trying to use the
> device even if the EEPROM checksum is wrong.

There is already support for EEPROM read/write in ethtool. I used it to
fix the e1000 cards in question. If e100 implements ethtool -E, all
that's needed is documentation on where in the EEPROM the checksum is
stored and how to calculate it. I don't doubt the freely-available pdfs
for e100 chipsets cover this.

Jason


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-08-05 17:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.61.0607311653360.24450@e-smith.charlieb.ott.istop.com>
2006-08-02 16:50 ` [bug] e100: checksum mismatch on 82551ER rev10 Auke Kok
2006-08-02 17:45   ` Charlie Brady
2006-08-02 18:30     ` Auke Kok
2006-08-04 11:04   ` Molle Bestefich
2006-08-04 11:20     ` David Miller
2006-08-04 11:28       ` David Miller
2006-08-05 13:28         ` Molle Bestefich
2006-08-05 17:21         ` Jason Lunz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).