Sym53C8xx Driver Hardening

Linux SCSI subsystem development
 help / color / mirror / Atom feed

* Sym53C8xx Driver Hardening
@ 2002-07-23 13:29 Isabelle, Francois
  2002-07-23 13:57 ` Rogier Wolff
  0 siblings, 1 reply; 9+ messages in thread
From: Isabelle, Francois @ 2002-07-23 13:29 UTC (permalink / raw)
  To: Linux SCSI list (E-mail)

Hi, 
    anyone working on driver hardening such as defined by the TLT/CGL
specification ?

TLT : Telecom Linux Technology
CGL: Open Source Development Lab - Carrier Grade Linux

Theese are 2 projects to define a "highly available" and "easily embeddable"
linux for the replacement of proprietary telecom platform.
The specification includes a section related to "driver hardening"

Some of the features:
 - Panic removal
 - Standardized event logging
 - Parameter checking ...

The white paper from Intel:
http://cedar.intel.com/cgi-bin/ids.dll/content/content.jsp?cntKey=Generic%20
Editorial::linux_hardening&cntType=IDS_EDITORIAL&catCode=0

Here is a quote from A.C. on the linux-ha list:
What I've seen so far on the hardening proposals varies between the
comical and the well-intentioned but ugly intel code. I've not seen
public montavista work so it would inappropriate for me to comment on it.

As you see, it's not really accepted at unanimity ...
It's just not bazaar like enough I suppose !

Anyway there is some good in such standardization and if it produces better
drivers for linux, why not ?

So to get back to the topic :
	anyone started sym53c8xx driver hardening ?

It might already be full proof but most certainly lacks standard event
logging.

Any comments on CGL/TLT appreciated
Thank you

Francois Isabelle
Kontron Canada Inc.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Sym53C8xx Driver Hardening
  2002-07-23 13:29 Sym53C8xx Driver Hardening Isabelle, Francois
@ 2002-07-23 13:57 ` Rogier Wolff
  2002-07-23 15:24   ` Alan Cox
  0 siblings, 1 reply; 9+ messages in thread
From: Rogier Wolff @ 2002-07-23 13:57 UTC (permalink / raw)
  To: Isabelle, Francois; +Cc: linux-scsi

Isabelle, Francois wrote:
> Some of the features:
>  - Panic removal
>  - Standardized event logging
>  - Parameter checking ...

Panic removal is NOT a good thing. 

A panic should only happen wehn the driver/kernel/whatever CANNOT
continue because of some serious problem.

A panic should occurin cases like: 

	b = 2 + 3;

	....

	if (b != 5) panic (); 

Now it won't be as easy as this. But for instance in my firestream
driver, you sometimes put a value in a register in the chip, and if
later on you read it back, you want the chip to have left it
unmodified, or to have it changed in a predictable way. If the value
is unexpected, a panic is the right "way out". 

If this is NOT done, the chip may be DMA-ing data into the wrong spot
in main memory. This will not panic your system, but crash it (*). A
panic can be configured to "do the right thing": reboot, and be back
in business as soon as possible. An operator should then be notified
and somehow take action.

Removing panics sounds like a manager-solution to the problem: A
system that panics is not reliable, so we should remove the panics.

Well, if the code panics for the right reason, then the panic is
to be preferred above continuing. 

I'm convinced that most linux-code is well-written in this respect.

			Rogier. 

(*) A crash is meant as where the system ends up being wedged, with
only a hardware watchdog able to get it out of that state.

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Sym53C8xx Driver Hardening
  2002-07-23 13:57 ` Rogier Wolff
@ 2002-07-23 15:24   ` Alan Cox
  2002-07-23 15:11     ` random1
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Alan Cox @ 2002-07-23 15:24 UTC (permalink / raw)
  To: Rogier Wolff; +Cc: Isabelle, Francois, linux-scsi

On Tue, 2002-07-23 at 14:57, Rogier Wolff wrote:
> Now it won't be as easy as this. But for instance in my firestream
> driver, you sometimes put a value in a register in the chip, and if
> later on you read it back, you want the chip to have left it
> unmodified, or to have it changed in a predictable way. If the value
> is unexpected, a panic is the right "way out". 

The high reliability people take a different view. I actually agree with
them. It isnt about 'oops didnt happen' it is about controlling the
failure case

Suppose your firestream driver reports catacylsmic internal error
status. Their argument is not that you should pretend life is good but
that the driver should log a fault and shut off the chip as best it can.
So you might have a firestream_failed() function which did

	Disable master bit
	Put board into D3
	Wait
	Put board into running state
	Try to reset and configure it
	If this fails shove it in D3 and give up

At this point the high reliability system is servicing the other links
it manages and flashing warning lights to the engineers, rather than
completely down


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Sym53C8xx Driver Hardening
  2002-07-23 15:24   ` Alan Cox
@ 2002-07-23 15:11     ` random1
  2002-07-23 15:38     ` Rogier Wolff
  2002-07-24 23:11     ` Gérard Roudier
  2 siblings, 0 replies; 9+ messages in thread
From: random1 @ 2002-07-23 15:11 UTC (permalink / raw)
  To: linux-scsi

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> The high reliability people take a different view. I actually agree with
> them. It isnt about 'oops didnt happen' it is about controlling the
> failure case

"graceful degradation" -- it's been being taught in CS curricula for
*at least* twenty years.  I'm disturbed by all the places that people
have chosen, when faced with the unexpected, to just panic, rather
than find a way to get back to some semblence of a running system.

I don't want some goofy UPS hanging off a USB bus, for instance, to
take down a critical storage server.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Sym53C8xx Driver Hardening
  2002-07-23 15:24   ` Alan Cox
  2002-07-23 15:11     ` random1
@ 2002-07-23 15:38     ` Rogier Wolff
  2002-07-24 23:11     ` Gérard Roudier
  2 siblings, 0 replies; 9+ messages in thread
From: Rogier Wolff @ 2002-07-23 15:38 UTC (permalink / raw)
  To: Alan Cox; +Cc: Rogier Wolff, Isabelle, Francois, linux-scsi

Alan Cox wrote:
> On Tue, 2002-07-23 at 14:57, Rogier Wolff wrote:
> > Now it won't be as easy as this. But for instance in my firestream
> > driver, you sometimes put a value in a register in the chip, and if
> > later on you read it back, you want the chip to have left it
> > unmodified, or to have it changed in a predictable way. If the value
> > is unexpected, a panic is the right "way out". 
> 
> The high reliability people take a different view. I actually agree with
> them. It isnt about 'oops didnt happen' it is about controlling the
> failure case
> 
> Suppose your firestream driver reports catacylsmic internal error
> status. Their argument is not that you should pretend life is good but
> that the driver should log a fault and shut off the chip as best it can.
> So you might have a firestream_failed() function which did
> 
> 	Disable master bit
> 	Put board into D3
> 	Wait
> 	Put board into running state
> 	Try to reset and configure it
> 	If this fails shove it in D3 and give up
> 
> At this point the high reliability system is servicing the other links
> it manages and flashing warning lights to the engineers, rather than
> completely down

That might indeed be preferable. However, the "wild DMA" may have
corrupted users' data, and/or the kernel's datastructures. So
continuing may lead to a bad situation getting worse...

Maybe we want to generalize "panic" so that you pass it a pointer to
"shutdown this hardware" routine, allowing diversion of the "policy"
about what to do to a user-definable central place.....

Userspace would then be notified: "We shut down atm0 due to an
irrecoverable error". And userspace can then decide to kick the device
as you suggest above.

Or, I could configure it to do an immediate reboot, with/without
attempting to sync disks....

		Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Sym53C8xx Driver Hardening
  2002-07-23 15:24   ` Alan Cox
  2002-07-23 15:11     ` random1
  2002-07-23 15:38     ` Rogier Wolff
@ 2002-07-24 23:11     ` Gérard Roudier
  2002-07-25 22:33       ` Jeremy Higdon
  2 siblings, 1 reply; 9+ messages in thread
From: Gérard Roudier @ 2002-07-24 23:11 UTC (permalink / raw)
  To: Alan Cox; +Cc: Rogier Wolff, Isabelle, Francois, linux-scsi

On 23 Jul 2002, Alan Cox wrote:

> On Tue, 2002-07-23 at 14:57, Rogier Wolff wrote:
> > Now it won't be as easy as this. But for instance in my firestream
> > driver, you sometimes put a value in a register in the chip, and if
> > later on you read it back, you want the chip to have left it
> > unmodified, or to have it changed in a predictable way. If the value
> > is unexpected, a panic is the right "way out".
>
> The high reliability people take a different view. I actually agree with
> them. It isnt about 'oops didnt happen' it is about controlling the
> failure case
>
> Suppose your firestream driver reports catacylsmic internal error
> status. Their argument is not that you should pretend life is good but
> that the driver should log a fault and shut off the chip as best it can.
> So you might have a firestream_failed() function which did
>
> 	Disable master bit
> 	Put board into D3
> 	Wait
> 	Put board into running state
> 	Try to reset and configure it
> 	If this fails shove it in D3 and give up
>
> At this point the high reliability system is servicing the other links
> it manages and flashing warning lights to the engineers, rather than
> completely down

By the way, the sym53c8xx_2 driver (and probably version 1 too) never
intentionnaly panics the system on hardware failure detection.
The couple of calls to panic you can see in the driver are related to
software unexpected situations.
In my opinion, serious high reliability requires special hardware support.

On serious harware error detected, the driver simply tries to reset
everything it can in order to have a chance to restart operations
properly. May-be it should count such events and give up in some way after
some given number of retries. If you just suggest such 'give up'
operations to disable the device this should be doable, but this will not
make upper layers aware of the situation and certainly not be what most
users expect. This MEANS that serious high reliability ALSO requires
special SOFTWARE support and USER utilities, and certainly NOT ONLY be
based on some questionnable trivial tinking in device drivers, in my
opinion.

  Gérard.

>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Sym53C8xx Driver Hardening
  2002-07-24 23:11     ` Gérard Roudier
@ 2002-07-25 22:33       ` Jeremy Higdon
  2002-07-25 23:53         ` Alan Cox
  2002-07-26 22:25         ` Gérard Roudier
  0 siblings, 2 replies; 9+ messages in thread
From: Jeremy Higdon @ 2002-07-25 22:33 UTC (permalink / raw)
  To: Gérard Roudier, Alan Cox
  Cc: Rogier Wolff, Isabelle, Francois, linux-scsi

[-- Attachment #1: Text --]
[-- Type: text/plain , Size: 1537 bytes --]

On Jul 25,  1:11am, Gérard Roudier wrote:
> 
> By the way, the sym53c8xx_2 driver (and probably version 1 too) never
> intentionnaly panics the system on hardware failure detection.
> The couple of calls to panic you can see in the driver are related to
> software unexpected situations.
> In my opinion, serious high reliability requires special hardware support.
> 
> On serious harware error detected, the driver simply tries to reset
> everything it can in order to have a chance to restart operations
> properly. May-be it should count such events and give up in some way after
> some given number of retries. If you just suggest such 'give up'
> operations to disable the device this should be doable, but this will not
> make upper layers aware of the situation and certainly not be what most
> users expect. This MEANS that serious high reliability ALSO requires
> special SOFTWARE support and USER utilities, and certainly NOT ONLY be
> based on some questionnable trivial tinking in device drivers, in my
> opinion.
> 
>   Gérard.


Also, with some sorts of errors, one might suspect that the device has
DMA'd data into the wrong place in memory.

Generally, I think a crash is preferable to data corruption (i.e. no
answer is better than a wrong answer), so in such cases, the driver
might want to panic.

If hardware can place a fence around DMA (such that DMA to unintended
locations is impossible), then you might not want to panic in such
situations, assuming that you can retry.

jeremy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Sym53C8xx Driver Hardening
  2002-07-25 22:33       ` Jeremy Higdon
@ 2002-07-25 23:53         ` Alan Cox
  2002-07-26 22:25         ` Gérard Roudier
  1 sibling, 0 replies; 9+ messages in thread
From: Alan Cox @ 2002-07-25 23:53 UTC (permalink / raw)
  To: Jeremy Higdon
  Cc: Gérard Roudier, Rogier Wolff, Isabelle, Francois, linux-scsi

On Thu, 2002-07-25 at 23:33, Jeremy Higdon wrote:
> Also, with some sorts of errors, one might suspect that the device has
> DMA'd data into the wrong place in memory.

It depends on the environment. In telco stuff its often better to sound
the alarms and pray. Gerard is right that policy must be configurable
 
> If hardware can place a fence around DMA (such that DMA to unintended
> locations is impossible), then you might not want to panic in such
> situations, assuming that you can retry.

Doable on real computers with an iommu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Sym53C8xx Driver Hardening
  2002-07-25 22:33       ` Jeremy Higdon
  2002-07-25 23:53         ` Alan Cox
@ 2002-07-26 22:25         ` Gérard Roudier
  1 sibling, 0 replies; 9+ messages in thread
From: Gérard Roudier @ 2002-07-26 22:25 UTC (permalink / raw)
  To: Jeremy Higdon; +Cc: Alan Cox, Rogier Wolff, Isabelle, Francois, linux-scsi



On Thu, 25 Jul 2002, Jeremy Higdon wrote:

> On Jul 25,  1:11am, Gérard Roudier wrote:
> >
> > By the way, the sym53c8xx_2 driver (and probably version 1 too) never
> > intentionnaly panics the system on hardware failure detection.
> > The couple of calls to panic you can see in the driver are related to
> > software unexpected situations.
> > In my opinion, serious high reliability requires special hardware support.
> >
> > On serious harware error detected, the driver simply tries to reset
> > everything it can in order to have a chance to restart operations
> > properly. May-be it should count such events and give up in some way after
> > some given number of retries. If you just suggest such 'give up'
> > operations to disable the device this should be doable, but this will not
> > make upper layers aware of the situation and certainly not be what most
> > users expect. This MEANS that serious high reliability ALSO requires
> > special SOFTWARE support and USER utilities, and certainly NOT ONLY be
> > based on some questionnable trivial tinking in device drivers, in my
> > opinion.
> >
> >   Gérard.
>
>
> Also, with some sorts of errors, one might suspect that the device has
> DMA'd data into the wrong place in memory.
>
> Generally, I think a crash is preferable to data corruption (i.e. no
> answer is better than a wrong answer), so in such cases, the driver
> might want to panic.

The sym53c8xx ensures that all possible path checkings against errors it
has under control are enabled. If the drivers gets the error, then this
ideally means that it is not a fatal system error or user or O/S does not
want it to handled so. Otherwise, some NMI should occur and the system
should halt. So, my opinion is that a hardware error that could be handled
as a system error but is not so should be considered as an invitation to
try to recover. Hence, the answer of the driver is the appropriate one, in
my opinion.

> If hardware can place a fence around DMA (such that DMA to unintended
> locations is impossible), then you might not want to panic in such
> situations, assuming that you can retry.

You may be dreaming there. At least, I never heard about such magic in
the PCI world.

Regards,
  Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2002-07-26 22:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-23 13:29 Sym53C8xx Driver Hardening Isabelle, Francois
2002-07-23 13:57 ` Rogier Wolff
2002-07-23 15:24   ` Alan Cox
2002-07-23 15:11     ` random1
2002-07-23 15:38     ` Rogier Wolff
2002-07-24 23:11     ` Gérard Roudier
2002-07-25 22:33       ` Jeremy Higdon
2002-07-25 23:53         ` Alan Cox
2002-07-26 22:25         ` Gérard Roudier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox