Re: AIC7xxx kernel problem with 2.4.2[234] kernels

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
@ 2004-01-19 13:32 Xose Vazquez Perez
  2004-01-19 17:21 ` James Bottomley
  0 siblings, 1 reply; 11+ messages in thread
From: Xose Vazquez Perez @ 2004-01-19 13:32 UTC (permalink / raw)
  To: linux-kernel, Tosatti, linux-scsi, Justin T. Gibbs

Marcelo Tosatti wrote:

> About the aic7xxx update, well, I believe aic7xxx 6.2.36 is pretty stable
> (I dont remember seeing any reliable bug report and I also cant find one
> in lkml archives) except this one (and a pair of "lockup on initialization
> with SMP").

Justin already put updates in BK, but James did not like the "new error recovery"
code. So, kernel driver is *SIX months* behind ADAPTEC driver release.

There is more info in this linux-scsi thread, why the patch was not applied:
http://marc.theaimsgroup.com/?l=linux-scsi&m=107228516327580&w=2

It looks like the _kernel_ driver is going to be without a maintainer
unless somebody works on it, porting ADAPTEC fixes/features to the kernel driver.


> What bugs are you aware of in 2.4's aic7xxx ?

aic7xxx/aic79xx CHANGELOG has info about all bugs fixed:

o Adaptec Aic7xxx

Version History:

   6.3.4 (December 22nd, 2003)
        - Provide a better description string for the 2915/30LP.
        - Sniff sense information returned by targets for unit
          attention errors that may indicate that the device has
          been changed.  If we see such status for non Domain
          Validation related commands, start a DV scan for the
          target.  In the past, DV would only occur for hot-plugged
          devices if no target had been previously probed for a
          particular ID.  This change guarantees that the DV process
          will occur even if the user swaps devices without any
          interveining I/O to tell us that a device has gone missing.
          The old behavior, among other things, would fail to spin up
          drives that were hot-plugged since the Linux mid-layer
          will only spin-up drives on initial attach.

   6.3.3 (November 6th, 2003)
        - Support the 2.6.0-test9 kernel
        - Fix rare deadlock caused by using del_timer_sync from within
          a timer handler.

   6.3.2 (October 28th, 2003)
        - Enforce a bus settle delay for bus resets that the
          driver initiates.
        - Fall back to basic DV for U160 devices that lack an
          echo buffer.
        - Correctly detect that left over BIOS data has not
          been initialized when the CHPRST status bit is set
          during driver initialization.

   6.3.1 (October 21st, 2003)
        - Fix a compiler error when building with only EISA or PCI
          support compiled into the kernel.
        - Add chained dependencies to both the driver and aicasm Makefiles
          to avoid problems with parallel builds.
        - Move additional common routines to the aiclib OSM library
          to reduce code duplication.
        - Fix a bug in the testing of the AHC_TMODE_WIDEODD_BUG that
          could cause target mode operations to hang.
        - Leave removal of softcs from the global list of softcs to
          the OSM.  This allows us to avoid holding the list_lock during
          device destruction.

   6.3.0 (September 8th, 2003)
        - Move additional common routines to the aiclib OSM library
          to reduce code duplication.
        - Bump minor number to reflect change in error recovery strategy.

   6.2.38 (August 31st, 2003)
        - Avoid an inadvertant reset of the controller during the
          memory mapped I/O test should the controller be left in
          the reset state prior to driver initialization.  On some
          systems, this extra reset resulted in a system hang due
          to a chip access that occurred too soon after reset.
        - Move additional common routines to the aiclib OSM library
          to reduce code duplication.
        - Add magic sysrq handler that causes a card dump to be output
          to the console for each controller.

   6.2.37 (August 12th, 2003)
        - Perform timeout recovery within the driver instead of relying
          on the Linux SCSI mid-layer to perform this function.  The
          mid-layer does not know the full state of the SCSI bus and
          is therefore prone to looping for several minutes to effect
          recovery.  The new scheme recovers within 15 seconds of the
          failure.
        - Support writing 93c56/66 SEEPROM on newer cards.
        - Avoid clearing ENBUSFREE during single stepping to avoid
          spurious "unexpected busfree while idle" messages.
        - Enable the use of the "Auto-Access-Pause" feature on the
          aic7880 and aic7870 chips.  It was disabled due to an
          oversight.  Using this feature drastically reduces command
          delivery latency.

   6.2.36 **KERNEL DRIVER**


o Adaptec Aic79xx

Version History:

   2.0.5 (December 22nd, 2003)
        - Correct a bug preventing the driver from renegotiating
          during auto-request operations when a check condition
          occurred for a zero length command.
        - Sniff sense information returned by targets for unit
          attention errors that may indicate that the device has
          been changed.  If we see such status for non Domain
          Validation related commands, start a DV scan for the
          target.  In the past, DV would only occur for hot-plugged
          devices if no target had been previously probed for a
          particular ID.  This change guarantees that the DV process
          will occur even if the user swaps devices without any
          interveining I/O to tell us that a device has gone missing.
          The old behavior, among other things, would fail to spin up
          drives that were hot-plugged since the Linux mid-layer
          will only spin-up drives on initial attach.
        - Correct several issues in the rundown of the good status
          FIFO during error recovery.  The typical failure scenario
          evidenced by this defect was the loss of several commands
          under high load when   several queue full conditions occured
          back to back.

   2.0.4 (November 6th, 2003)
        - Support the 2.6.0-test9 kernel
        - Fix rare deadlock caused by using del_timer_sync from within
          a timer handler.

   2.0.3 (October 21st, 2003)
        - On 7902A4 hardware, use the slow slew rate for transfer
          rates slower than U320.  This behavior matches the Windows
          driver.
        - Fix some issues with the ahd_flush_qoutfifo() routine.
        - Add a delay in the loop waiting for selection activity
          to cease.  Otherwise we may exhaust the loop counter too
          quickly on fast machines.
        - Return to processing bad status completions through the
          qoutfifo.  This reduces the amount of time the controller
          is paused for these kinds of errors.
        - Move additional common routines to the aiclib OSM library
          to reduce code duplication.
        - Leave removal of softcs from the global list of softcs to
          the OSM.  This allows us to avoid holding the list_lock during
          device destruction.
        - Enforce a bus settle delay for bus resets that the
          driver initiates.
        - Fall back to basic DV for U160 devices that lack an
          echo buffer.

   2.0.2 (September 4th, 2003)
        - Move additional common routines to the aiclib OSM library
          to reduce code duplication.
        - Avoid an inadvertant reset of the controller during the
          memory mapped I/O test should the controller be left in
          the reset state prior to driver initialization.  On some
          systems, this extra reset resulted in a system hang due
          to a chip access that occurred too soon after reset.
        - Correct an endian bug in ahd_swap_with_next_hscb.  This
          corrects strong-arm support.
        - Reset the bus for transactions that timeout waiting for
          the bus to go free after a disconnect or command complete
          message.

   2.0.1 (August 26th, 2003)
        - Add magic sysrq handler that causes a card dump to be output
          to the console for each controller.
        - Avoid waking the mid-layer's error recovery handler during
          timeout recovery by returning DID_ERROR instead of DID_TIMEOUT
          for timed-out commands that have been aborted.
        - Move additional common routines to the aiclib OSM library
          to reduce code duplication.

   2.0.0 (August 20th, 2003)
        - Remove MMAPIO definition and allow memory mapped
          I/O for any platform that supports PCI.
        - Avoid clearing ENBUSFREE during single stepping to avoid
          spurious "unexpected busfree while idle" messages.
        - Correct deadlock in ahd_run_qoutfifo() processing.
        - Optimize support for the 7901B.
        - Correct a few cases where an explicit flush of pending
          register writes was required to ensure acuracy in delays.
        - Correct problems in manually flushing completed commands
          on the controller.  The FIFOs are now flushed to ensure
          that completed commands that are still draining to the
          host are completed correctly.
        - Correct incomplete CDB delivery detection on the 790XB.
        - Ignore the cmd->underflow field since userland applications
          using the legacy command pass-thru interface do not set
          it correctly.  Honoring this field led to spurious errors
          when users used the "scsi_unique_id" program.
        - Perform timeout recovery within the driver instead of relying
          on the Linux SCSI mid-layer to perform this function.  The
          mid-layer does not know the full state of the SCSI bus and
          is therefore prone to looping for several minutes to effect
          recovery.  The new scheme recovers within 15 seconds of the
          failure.
        - Correct support for manual termination settings.
        - Increase maximum wait time for serial eeprom writes allowing
          writes to function correctly.

   1.3.12 (August 11, 2003)
        - Implement new error recovery thread that supercedes the existing
          Linux SCSI error recovery code.
        - Fix termination logic for 29320ALP.
        - Fix SEEPROM delay to compensate for write ops taking longer.

   1.3.11 (July 11, 2003)
        - Fix several deadlock issues.
        - Add 29320ALP and 39320B Id's.

   1.3.10 **KERNEL DRIVER**




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-19 13:32 AIC7xxx kernel problem with 2.4.2[234] kernels Xose Vazquez Perez
@ 2004-01-19 17:21 ` James Bottomley
  2004-01-19 18:38   ` Justin T. Gibbs
  0 siblings, 1 reply; 11+ messages in thread
From: James Bottomley @ 2004-01-19 17:21 UTC (permalink / raw)
  To: Xose Vazquez Perez; +Cc: Linux Kernel, Tosatti, linux-scsi, Justin T. Gibbs

On Mon, 2004-01-19 at 08:32, Xose Vazquez Perez wrote:
> It looks like the _kernel_ driver is going to be without a maintainer
> unless somebody works on it, porting ADAPTEC fixes/features to the kernel driver.

As I told you in private email, this is *not* the way I see it.  At the
moment, Ataptec is the maintainer of that driver unless they choose
formally to relinquish it.

There is a glimmering of a resolution of the problem in an early
notification API for command timeouts.

Although throwing away successful completions when error recovery is in
progress isn't a bug (scsi commands are either idempotent or non
retryable), it's certainly not ideal.  I'm thinking about a better
framework where we would quiesce the device but pull back from
activating the eh thread if all commands return.  This would also fix
the tag starvation issue that many drivers tackle independently too.

James

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-19 17:21 ` James Bottomley
@ 2004-01-19 18:38   ` Justin T. Gibbs
  2004-01-20  0:50     ` James Bottomley
  0 siblings, 1 reply; 11+ messages in thread
From: Justin T. Gibbs @ 2004-01-19 18:38 UTC (permalink / raw)
  To: James Bottomley, Xose Vazquez Perez; +Cc: Linux Kernel, Tosatti, linux-scsi

> On Mon, 2004-01-19 at 08:32, Xose Vazquez Perez wrote:
>> It looks like the _kernel_ driver is going to be without a maintainer
>> unless somebody works on it, porting ADAPTEC fixes/features to the kernel driver.
> 
> As I told you in private email, this is *not* the way I see it.  At the
> moment, Ataptec is the maintainer of that driver unless they choose
> formally to relinquish it.

Can you provide your definition of "maintainer"?  I know that I am maintainer
of the drivers distributed from my website, but I don't feel I have ever
been maintainer of the drivers in the 2.4.X, 2.5.X, or 2.6.X trees.

> There is a glimmering of a resolution of the problem in an early
> notification API for command timeouts.

I'm open to ideas, but from this one line summary, this sounds like a
workaround and not a real solution.  Can you say more about your proposal?

In my mind, an easy resolution would be to:

1) Let me fix the SCSI layer so that the error recovery handler override
   already there will actually work - cleanly.

2) Let my drivers use that mechanism.

While working on 1, I would appreciate being able to "maintain" these
drivers with their current error recovery workaround in place.

> Although throwing away successful completions when error recovery is in
> progress isn't a bug (scsi commands are either idempotent or non
> retryable), it's certainly not ideal.

Most SCSI commands are only idempotent if replayed in the same order
as originally issued (consider FSes that rely on write ordering to
keep their meta-data coherent).  Some commands are retriable but only if
they have actually failed.  The mid-layer has no concept currently of these
issues, yet it acts on behalf of the peripheral drivers that can better
understand how the device they control behaves and act accordingly.

Bugs are defects that render non-ideal behavior.  The only question is
what types of non-ideal behaviors you are willing to tolerate.

> I'm thinking about a better
> framework where we would quiesce the device but pull back from
> activating the eh thread if all commands return.  This would also fix
> the tag starvation issue that many drivers tackle independently too.

That wouldn't help things.  For example, lets say that there is one command
active on the bus holding up the completion of 32 others.  "Waiting for a bit"
will never release the other 32 commands.  You must abort the bus hog.  Once
you abort the problem command, you get flooded with the completions of the
32 others.  The bus is recovered.  You can now safely go about your business.
An HBA watchdog handler can properly deal with this situation since it has
state that the mid-layer does not.

As for tag starvation, just inserting a periodic ordered tag on devices
that show signs of starvation is a much better approach than shutting
down the flow of commands to the whole controller at the first sign of
trouble.  Luckily, most vendors stopped making drives with tag starvation
issues in the mid-90's.  For this reason, the tag starvation code in
my drivers is off by default, but can be enabled via a module or kernel
command line option.

--
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-19 18:38   ` Justin T. Gibbs
@ 2004-01-20  0:50     ` James Bottomley
  2004-01-20  2:02       ` Justin T. Gibbs
  0 siblings, 1 reply; 11+ messages in thread
From: James Bottomley @ 2004-01-20  0:50 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi

On Mon, 2004-01-19 at 13:38, Justin T. Gibbs wrote: 
> Can you provide your definition of "maintainer"?  I know that I am maintainer
> of the drivers distributed from my website, but I don't feel I have ever
> been maintainer of the drivers in the 2.4.X, 2.5.X, or 2.6.X trees.

A maintainer is a person who works with the kernel community to keep the
driver (or subsystem, filesystem or whatever) up to date.  Such a person
may or possibly may not have an entry in the MAINTAINERS file. 

If you want to maintain a reference driver and have someone else do the
legwork with the community, that's fine by me...do you have someone in
mind, or should I find this person? 

> I'm open to ideas, but from this one line summary, this sounds like a
> workaround and not a real solution.  Can you say more about your proposal?

It actually wasn't mine, it was Alan Cox's.  There was a thread about it
(on which you were cc'd), but I'm currently on a 'plane to NY for
LinuxWorld and don't have it handy. 

> In my mind, an easy resolution would be to:
> 
> 1) Let me fix the SCSI layer so that the error recovery handler override
>    already there will actually work - cleanly.
> 
> 2) Let my drivers use that mechanism.
> 
> While working on 1, I would appreciate being able to "maintain" these
> drivers with their current error recovery workaround in place.

Well, I'm thinking of having a class based error recovery scheme, built
upon an extension of the transport class patch that has been floating
around this list.  However, my problem is that the aic7xxx/79xx chips
are basically SPI, and therefore, even under the new scheme, should be
using the SPI recovery class.  Therefore, just providing an override
mechanism for all drivers to use isn't what I want.  What I want is a
robust SPI recovery mechanism usable by all. 

This is what "working with the kernel community" means.  If there's a
bug, I don't want it fixed by driver work arounds, I want it fixed in
the core code.  Having driver writers ignore the APIs and roll their own
will simply create problems. 

> Most SCSI commands are only idempotent if replayed in the same order
> as originally issued (consider FSes that rely on write ordering to
> keep their meta-data coherent).  Some commands are retriable but only if
> they have actually failed.  The mid-layer has no concept currently of these
> issues, yet it acts on behalf of the peripheral drivers that can better
> understand how the device they control behaves and act accordingly.
> 
> Bugs are defects that render non-ideal behavior.  The only question is
> what types of non-ideal behaviors you are willing to tolerate.

This is the old barrier debate.  The scsi subsytem does not advertise an
ordering property to the block layer and thus is not required to
preserve order over error recovery.  This problem, therefore, does not
exist in linux. 

We had this debate years ago...the upshot being that the performance
benefits of order preservation were uncertain at best so it was never
implemented.  Linux works just fine without it. 

> That wouldn't help things.  For example, lets say that there is one command
> active on the bus holding up the completion of 32 others.  "Waiting for a bit"
> will never release the other 32 commands.  You must abort the bus hog.  Once
> you abort the problem command, you get flooded with the completions of the
> 32 others.  The bus is recovered.  You can now safely go about your business.
> An HBA watchdog handler can properly deal with this situation since it has
> state that the mid-layer does not.

I don't understand this.  If by "active on the bus" you mean is holding
the bus in a busy state, then you cannot get access to the bus to to
send an abort or a device reset...the only recourse is a bus
reset...which the mid layer will do. 

If the drive has actually freed the bus but lost the tag, then it's a
drive queueing bug, and the solution is usually to lower the TCQ depth
(we should probably have a blacklist for this).  This is where the mid
layer quiesce is good...if all the other commands complete, the bus is
free and we still don't get a response from the missing command, then
you know the drive firmware lost it, and the driver should adjust the
queue depth downwards. 

If the drive is just off servicing other tags, then it's tag starvation.
> As for tag starvation, just inserting a periodic ordered tag on devices
> that show signs of starvation is a much better approach than shutting
> down the flow of commands to the whole controller at the first sign of
> trouble.  Luckily, most vendors stopped making drives with tag starvation
> issues in the mid-90's.  For this reason, the tag starvation code in
> my drivers is off by default, but can be enabled via a module or kernel
> command line option.

Well, I have to deal with old hardware... 

James 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-20  0:50     ` James Bottomley
@ 2004-01-20  2:02       ` Justin T. Gibbs
  2004-01-20  4:45         ` James Bottomley
  2004-01-20  7:15         ` Linus Torvalds
  0 siblings, 2 replies; 11+ messages in thread
From: Justin T. Gibbs @ 2004-01-20  2:02 UTC (permalink / raw)
  To: James Bottomley; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi

> On Mon, 2004-01-19 at 13:38, Justin T. Gibbs wrote: 
>> Can you provide your definition of "maintainer"?  I know that I am maintainer
>> of the drivers distributed from my website, but I don't feel I have ever
>> been maintainer of the drivers in the 2.4.X, 2.5.X, or 2.6.X trees.
> 
> A maintainer is a person who works with the kernel community to keep the
> driver (or subsystem, filesystem or whatever) up to date.  Such a person
> may or possibly may not have an entry in the MAINTAINERS file. 

Does the maintainer have the ability to veto changes that harm the
code they maintain?  In otherwords, you claim that I am the maintainer
of the drivers in the kernel.org tree.  This has not prevented changes
from being made to these drivers without adequate review.  Even your last
update to the driver threw away all of the changelog state and left at
least the aic79xx driver in a worse state than it was in before (see
changelog entries for the driver versions after the one that you imported
for details - this was exactly why I didn't submit that particular revision).
You didn't even bother to ask me if importing 1.3.11 was appropriate.  This
is why I say I don't feel like a maintainer.  I'm not given adequate control
over the end product yet I'm supposed to take the blame when it doesn't work.

>> I'm open to ideas, but from this one line summary, this sounds like a
>> workaround and not a real solution.  Can you say more about your proposal?
> 
> It actually wasn't mine, it was Alan Cox's.  There was a thread about it
> (on which you were cc'd), but I'm currently on a 'plane to NY for
> LinuxWorld and don't have it handy. 

That proposal was to allow the timeout handler to be redirected.  This
is different than an early notification.  Allowing the timeout handler
to be redirected is a required step toward making the recovery code
work.

> Well, I'm thinking of having a class based error recovery scheme, built
> upon an extension of the transport class patch that has been floating
> around this list.

That's all fine for "status based" recovery.  All I'm trying to resolve
are issues with watchdog recovery.  Can we limit the discussion to that
area?

> However, my problem is that the aic7xxx/79xx chips
> are basically SPI, and therefore, even under the new scheme, should be
> using the SPI recovery class.  Therefore, just providing an override
> mechanism for all drivers to use isn't what I want.  What I want is a
> robust SPI recovery mechanism usable by all. 

I understand that, but it just isn't possible to do well for watchdog
recovery.  For status based recovery, sure.

> This is what "working with the kernel community" means.  If there's a
> bug, I don't want it fixed by driver work arounds, I want it fixed in
> the core code.  Having driver writers ignore the APIs and roll their own
> will simply create problems. 

In this case, the bug is that the mid-layer tries to handle watchdog
recovery on its own.  It will never, in my opinion, having reviewed
lots of systems that have tried to do it in a centralized way, work well.
The mid-layer just doesn't have the necessary state to make intelligent
decisions and exporting that state will always be cumbersome and incomplete.

>> Most SCSI commands are only idempotent if replayed in the same order
>> as originally issued (consider FSes that rely on write ordering to
>> keep their meta-data coherent).  Some commands are retriable but only if
>> they have actually failed.  The mid-layer has no concept currently of these
>> issues, yet it acts on behalf of the peripheral drivers that can better
>> understand how the device they control behaves and act accordingly.
>> 
>> Bugs are defects that render non-ideal behavior.  The only question is
>> what types of non-ideal behaviors you are willing to tolerate.
> 
> This is the old barrier debate.

Not entirely.  Tapes are allowed to accept multiple commands and some
FCTape drives do.  But even if you throw away that argument completely,
you still haven't resolved how to deal with retriable commands that
are only retriable if they have actually failed.

My feeling is that any situation where the mid-layer or HBA drivers fail
to provide complete and acurate state for the commands that are completed
is a bug.  The peripheral drivers cannot do their job if they aren't
given good information.

>> That wouldn't help things.  For example, lets say that there is one
>> command active on the bus holding up the completion of 32 others.
>> "Waiting for a bit" will never release the other 32 commands.  You must
>> abort the bus hog.  Once  you abort the problem command, you get flooded
>> with the completions of the 32 others.  The bus is recovered.  You can now
>> safely go about your business.  An HBA watchdog handler can properly deal
>> with this situation since it has state that the mid-layer does not.
> 
> I don't understand this.  If by "active on the bus" you mean is holding
> the bus in a busy state, then you cannot get access to the bus to to
> send an abort or a device reset...the only recourse is a bus
> reset...which the mid layer will do. 

If we are talking SPI, then aborts, device resets, and lun resets
are all handled with either message bytes transmitted via a message phase,
or via command packets with the task management function set appropriately.
If you cannot send an abort message, you cannot send any message, so claiming
that a BDR request will resolve the problem doesn't make any sense if you
believe that a device active on the bus prevents aborts for working.  In any
event, just because a device is active on the bus doesn't mean that the
bus is hung and that you cannot abort a command.  By raising the ATN line,
the target may decide to change phase to accept your message byte.  It
may not, but if it doesn't, then your only recourse is to reset the
bus.  Looping through all the other commands that happen to be stalled
and asking the driver to abort them will only delay the inevitable.

> If the drive has actually freed the bus but lost the tag, then it's a
> drive queueing bug, and the solution is usually to lower the TCQ depth
> (we should probably have a blacklist for this).  This is where the mid
> layer quiesce is good...if all the other commands complete, the bus is
> free and we still don't get a response from the missing command, then
> you know the drive firmware lost it, and the driver should adjust the
> queue depth downwards. 

How does the mid-layer know that the "bus is free".  What transports even
have this concept?  If one drive has lost a command, and the transport
is functioning normally, why are you penalizing the other devices attached
to the HBA while you "sort this out"?  There is no need to do that.

As for reducing the queue depth in response to repeated timeouts by a
device, this is easy enough to do with your "multi-layered", status based
recovery code.  All that is required is for the HBA to tell you that a
particular command was aborted due to timeout as well as indicate what
side-effects occurred because of the abort process (bus reset, device reset,
lun reset, LIP, etc).   Some of the latter is already provided for by the
reset and bus reset entry points, but a better solution would be to have
a single "async event" callback that can encompass any transport notifications
needed by SPI, FC, SAS, and any future transports without adding more
entry points.

> If the drive is just off servicing other tags, then it's tag starvation.
>> As for tag starvation, just inserting a periodic ordered tag on devices
>> that show signs of starvation is a much better approach than shutting
>> down the flow of commands to the whole controller at the first sign of
>> trouble.  Luckily, most vendors stopped making drives with tag starvation
>> issues in the mid-90's.  For this reason, the tag starvation code in
>> my drivers is off by default, but can be enabled via a module or kernel
>> command line option.
> 
> Well, I have to deal with old hardware... 

Sure, just don't penalize the other disks on the transport because you
have one disk out there that is affected by this issue.

--
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-20  2:02       ` Justin T. Gibbs
@ 2004-01-20  4:45         ` James Bottomley
  2004-01-20  5:43           ` Justin T. Gibbs
  2004-01-20 11:24           ` Chiaki
  2004-01-20  7:15         ` Linus Torvalds
  1 sibling, 2 replies; 11+ messages in thread
From: James Bottomley @ 2004-01-20  4:45 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi

On Mon, 2004-01-19 at 21:02, Justin T. Gibbs wrote:
> Does the maintainer have the ability to veto changes that harm the
> code they maintain?  In otherwords, you claim that I am the maintainer
> of the drivers in the kernel.org tree.  This has not prevented changes
> from being made to these drivers without adequate review.  Even your last
> update to the driver threw away all of the changelog state and left at
> least the aic79xx driver in a worse state than it was in before (see
> changelog entries for the driver versions after the one that you imported
> for details - this was exactly why I didn't submit that particular revision).

I said "works with the kernel community".  It's not about control, it's
about co-operation.  The control you seek simply does not exist in the
kernel development process.

> You didn't even bother to ask me if importing 1.3.11 was appropriate.  This
> is why I say I don't feel like a maintainer.  I'm not given adequate control
> over the end product yet I'm supposed to take the blame when it doesn't work.

In the previous thread about the driver you said "You can integrate the
driver at whatever revision suits you.", so I took you at your word; if
that wasn't what you meant, it's a little late to whine about it now. 
Small bug fixes, would, as ever, be welcome...

As for blame, apart from the occasional flamewar, the community seems
generally welcoming of anyone who provides fixes.  We tend to be more
interested in fixing things than assigning blame.

> That proposal was to allow the timeout handler to be redirected.  This
> is different than an early notification.  Allowing the timeout handler
> to be redirected is a required step toward making the recovery code
> work.

The recovery code does work.  You may want it to work differently, and
that may make it work better, but that's an enhancement not a bug fix.

> In this case, the bug is that the mid-layer tries to handle watchdog
> recovery on its own.  It will never, in my opinion, having reviewed
> lots of systems that have tried to do it in a centralized way, work well.
> The mid-layer just doesn't have the necessary state to make intelligent
> decisions and exporting that state will always be cumbersome and incomplete.

But it does do it successfully.  Something that currently works but
could work better is an enhancement not a bug.

> How does the mid-layer know that the "bus is free".  What transports even
> have this concept?  If one drive has lost a command, and the transport
> is functioning normally, why are you penalizing the other devices attached
> to the HBA while you "sort this out"?  There is no need to do that.

Again, this is could do better not required bug fix.

I'm not against enhancements, even at this late stage in the
stabilisation process.  However, they have to be small, self contained
and obviously correct.  If you have them, send them to the list and
they'll get reviewed.

James

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-20  4:45         ` James Bottomley
@ 2004-01-20  5:43           ` Justin T. Gibbs
  2004-01-22  5:14             ` James Bottomley
  2004-01-20 11:24           ` Chiaki
  1 sibling, 1 reply; 11+ messages in thread
From: Justin T. Gibbs @ 2004-01-20  5:43 UTC (permalink / raw)
  To: James Bottomley; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi

> On Mon, 2004-01-19 at 21:02, Justin T. Gibbs wrote:
>> Does the maintainer have the ability to veto changes that harm the
>> code they maintain?  In otherwords, you claim that I am the maintainer
>> of the drivers in the kernel.org tree.  This has not prevented changes
>> from being made to these drivers without adequate review.  Even your last
>> update to the driver threw away all of the changelog state and left at
>> least the aic79xx driver in a worse state than it was in before (see
>> changelog entries for the driver versions after the one that you imported
>> for details - this was exactly why I didn't submit that particular revision).
> 
> I said "works with the kernel community".  It's not about control, it's
> about co-operation.  The control you seek simply does not exist in the
> kernel development process.

Then I ask again, what does it mean to be a maintainer?  It sounds like
I'm on equal footing with anyone who decides to post some patch to the
lists.  I've lost count of the number of occasions that some random
patch from some random individual was accepted without any consultation
with "the maintainer" of these drivers.  The end result was more email
in my mailbox complaining about "the broken driver that I maintain."

As for control, the type of control "I seek" does exist.  You have it.
You can also delegate some of that control if it suits you.

A maintainer takes on responsibility to ensure that something is maintained
and works.  Without some level of control, how can the maintainer fulfill
that responsibility?

>> You didn't even bother to ask me if importing 1.3.11 was appropriate.  This
>> is why I say I don't feel like a maintainer.  I'm not given adequate control
>> over the end product yet I'm supposed to take the blame when it doesn't work.
> 
> In the previous thread about the driver you said "You can integrate the
> driver at whatever revision suits you.", so I took you at your word; if
> that wasn't what you meant, it's a little late to whine about it now. 
> Small bug fixes, would, as ever, be welcome...

I provided all of the information required for you to make a reasoned
decision of which change sets to integrate.  I had no idea that you
would completely disregard the wealth of information in the change sets
and change set comments when coming up with an integration point.  Your
actions show that you didn't review or understand the changes well enough
to submit them into the tree.  You probably didn't even test the resulting
driver on real hardware before you submitted the changes.

> The recovery code does work.  You may want it to work differently, and
> that may make it work better, but that's an enhancement not a bug fix.

No.  The recovery code doesn't work.  Many of the people that know this
don't bother complaining to you about it.  They complain to the HBA driver
authors and the tech support departments of the companies that make the HBAs.
The HBA driver authors then do what they have to ensure that the system
remains viable after recovery.  

I mean honestly.  Do you think I would have gone to all of the trouble
I did in doing my own watchdog recovery if the recovery code worked
correctly?  Or that I would stand so firm in my position if these issues
didn't have real customer impact?

--
Justin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-20  5:43           ` Justin T. Gibbs
@ 2004-01-22  5:14             ` James Bottomley
  0 siblings, 0 replies; 11+ messages in thread
From: James Bottomley @ 2004-01-22  5:14 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi

On Tue, 2004-01-20 at 00:43, Justin T. Gibbs wrote: 
> As for control, the type of control "I seek" does exist.  You have it.
> You can also delegate some of that control if it suits you.

Well, as you have heard from the horse's mouth: I don't. 

> I provided all of the information required for you to make a reasoned
> decision of which change sets to integrate.  I had no idea that you
> would completely disregard the wealth of information in the change sets
> and change set comments when coming up with an integration point.  Your
> actions show that you didn't review or understand the changes well enough
> to submit them into the tree.  You probably didn't even test the resulting
> driver on real hardware before you submitted the changes.

Actually, I would have done nothing but for some 2.6 migration reports
of total lockups with the then in tree aic79xx driver.  The patch that
went into the tree was tested by the people reporting the lockups. 

> > The recovery code does work.  You may want it to work differently, and
> > that may make it work better, but that's an enhancement not a bug fix.
> 
> No.  The recovery code doesn't work.  Many of the people that know this
> don't bother complaining to you about it.  They complain to the HBA driver
> authors and the tech support departments of the companies that make the HBAs.
> The HBA driver authors then do what they have to ensure that the system
> remains viable after recovery.  

You haven't outlined any incorrect cases in your emails, just could do
better cases.  If you have all these bug reports that you haven't been
passing on, could you at least distil them to the mid layer failure
scenario that we can discuss fixing? 

> I mean honestly.  Do you think I would have gone to all of the trouble
> I did in doing my own watchdog recovery if the recovery code worked
> correctly?  Or that I would stand so firm in my position if these issues
> didn't have real customer impact?

Well, in coming up with the mid layer changes from 2.4 to 2.6 I did look
at what issues the main drivers had work arounds for. I used these work
arounds and an email you sent in September 2002 as the basis for a lot
of the mid-layer changes in 2.6.  None of the other drivers does this
timer interception and the issue wasn't mentioned in your email, so I am
dubious about the seriousness of the impact.

The way fixes get into linux is either lots of people complain, or one
person sends a fix, neither has happened in this case, which again leads
me to suspect that it's not a huge problem.

The still outstanding question is, now that you have a clearer idea what
being a Maintainer entails, do you wish to be the maintainer for
aic7xxx?

James

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-20  4:45         ` James Bottomley
  2004-01-20  5:43           ` Justin T. Gibbs
@ 2004-01-20 11:24           ` Chiaki
  1 sibling, 0 replies; 11+ messages in thread
From: Chiaki @ 2004-01-20 11:24 UTC (permalink / raw)
  To: linux-scsi
  Cc: James Bottomley, Justin T. Gibbs, Xose Vazquez Perez,
	Linux Kernel, Tosatti

I have a feeling that Linus summed up
what the maintainer is like in the context
of linux kernel development.

But I have a comment.

> The recovery code does work. 

Maybe. I have not tried a few problematic devices under
my PC lately. These devices
usually caused troube under 2.2.xx series, and even under late
2.3.y series for a while.
The symptom was essentially a reset storm that made the system
unusable.
Given the various patches accumalating, maybe the symptom
is tolerable now, but again I see some mention of
unusable system response even today.
So I suspect the problem is still there for certain type
of hardware problems.

>You may want it to work differently, and
> that may make it work better, but that's an enhancement not a bug fix.

To people who have been bitten with such unusable system symptoms
the above statement simply  won't pass.

It is essentially a "performance *bug*" and
should be corrected IMHO.

> But it does do it successfully.  Something that currently works but
> could work better is an enhancement not a bug.

Again, to those people this is a correctable (and should be corrected)
performance "bug".

> I'm not against enhancements, even at this late stage in the
> stabilisation process. 

I am a little confused here. Are we talking about 2.4 series?
OK the subject line states 2.4.2[234].

I believe that there are a lot of user base, especially, people
who use server type machines with SCSI interface (and AIC chips
seem to be popular among these machines)
who would appreciate the enhanced (== perforamance
bug corrected) version of the SCSI subsystem.

I, for one, don't use AIC chip on my home PCs, but
do have some machines at the office which use them and
will appreciate "enhanced" SCSI subsystem after all these years.

As for 2.6.zz, aren't there any chance of introducing hooks into EH
framework? The previous discussion suggested that it needs
to wait for 2.7 series. Too bad :-(

I feel that these error handling issues of SCSI subsystem
will have to be solved once for all sooner or later in the mainline
or otherwise as we see the vendors of commercial distribution probably
need to keep a separate tree (which they may have to, anyway,  deal
with other quirks of the mainline kernel, etc.) for a long time to
come and this is rather waste of man-power resources IMHO.

In any case, with all due respect
I don't think that the discussion goes anywhere unless we
recognize that someone's "mere enhancement" is actually
other people's "serious performance bug correction".
I, for one, tend to see the topic discussed as
the performance "BUG" and so
am a little frustrated at the pace the
error handling scheme is being improved.

This is just a comment from a third party observer who,
unfortunately doesn't have the time to dig into
the code and offer a patch. (Yes I actually tried
once during 2.2.xx time-frame but was then repulsed at
the spaghetti code of the time and gave up.).

PS: One other thing is that the type of the bug
is hard to trigger unless you have a controlled
facility or some seeming working and yet
faulty devices which
generate bad condition in a short time, say a few minutes
into the operation . So I agree that
not all people see such problems.

Intermittently faulty SCSI devices are rather rare, aren't they?
Either a SCSI device such as disk is complete dead or
or healthy. Finding a faulty device that triggers error condition
from time to time is probably the key to observe the
problematic symptom being discussed. I wonder
if some disk manufacturers or someone could produce
a special firmware to generate error condition every minute or so
and send such disks to SCSI subsystem developers :-)

PPS: Some would argue that if such devices are so rare
then we can ignore them. Heck, no!
I have seen Solaris log files where such faulty
behavior occur from time to time and was dealt with
very gracefully without the system being rendered unusable.
So the ratio of the such devices are small, but the sheer number
of computer installation today make such incidents visible indeed.

-- 
int main(void){int j=2003;/*(c)2003 cishikawa. */
char t[] ="<CI> @abcdefghijklmnopqrstuvwxyz.,\n\"";
char *i ="g>qtCIuqivb,gCwe\np@.ietCIuqi\"tqkvv is>dnamz";
while(*i)((j+=strchr(t,*i++)-(int)t),(j%=sizeof t-1),
(putchar(t[j])));return 0;}/* under GPL */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-20  2:02       ` Justin T. Gibbs
  2004-01-20  4:45         ` James Bottomley
@ 2004-01-20  7:15         ` Linus Torvalds
  2004-01-20  8:30           ` Andre Hedrick
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2004-01-20  7:15 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: James Bottomley, Xose Vazquez Perez, Linux Kernel, Tosatti,
	linux-scsi

On Mon, 19 Jan 2004, Justin T. Gibbs wrote:
> 
> Does the maintainer have the ability to veto changes that harm the
> code they maintain?

Nope. Nobody has that right.

Even _I_ don't veto changes that the right people push (my motto:
"everybody is wrong sometimes: when enough people complain, even I am
wrong"). 

In particular, maintainers of "conceptually higher" generally always have
priority. If Al Viro says a filesystem is doing something wrong from a VFS
standpoint, then that filesystem is broken - regardless of whether the
filesystem maintainer agrees or not. Because the VFS layer requirements 
trump any low-level filesystem issues.

But perhaps more importantly (and it's the reason even _I_ don't have the 
right, regardless of how high up in the maintainership chain I am), nobody 
has veto-power over anything. That's to keep people honest: nobody should 
_ever_ think that they are "in control", and that nobody else can replace 
them. 

In other words: maintainership is not ownership. It's a stewardship.

End result: maintainership is a nasty and mostly unthankful job. It
doesn't really give many privileges, and most of what it does is just have
people complain to you about bugs. The satisfaction is there, of course, 
but 

And finally: maintainership is largely about working with people.  
There's some code in there too, but people tend to be more important.

		Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: AIC7xxx kernel problem with 2.4.2[234] kernels
  2004-01-20  7:15         ` Linus Torvalds
@ 2004-01-20  8:30           ` Andre Hedrick
  0 siblings, 0 replies; 11+ messages in thread
From: Andre Hedrick @ 2004-01-20  8:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, linux-scsi


Linus,

Would have been nice to list such rules at the top if the MAINTAINERS
file, this may have saved me some grief, well maybe not ...

Andre Hedrick
LAD Storage Consulting Group

On Mon, 19 Jan 2004, Linus Torvalds wrote:

> 
> 
> On Mon, 19 Jan 2004, Justin T. Gibbs wrote:
> > 
> > Does the maintainer have the ability to veto changes that harm the
> > code they maintain?
> 
> Nope. Nobody has that right.
> 
> Even _I_ don't veto changes that the right people push (my motto:
> "everybody is wrong sometimes: when enough people complain, even I am
> wrong"). 
> 
> In particular, maintainers of "conceptually higher" generally always have
> priority. If Al Viro says a filesystem is doing something wrong from a VFS
> standpoint, then that filesystem is broken - regardless of whether the
> filesystem maintainer agrees or not. Because the VFS layer requirements 
> trump any low-level filesystem issues.
> 
> But perhaps more importantly (and it's the reason even _I_ don't have the 
> right, regardless of how high up in the maintainership chain I am), nobody 
> has veto-power over anything. That's to keep people honest: nobody should 
> _ever_ think that they are "in control", and that nobody else can replace 
> them. 
> 
> In other words: maintainership is not ownership. It's a stewardship.
> 
> End result: maintainership is a nasty and mostly unthankful job. It
> doesn't really give many privileges, and most of what it does is just have
> people complain to you about bugs. The satisfaction is there, of course, 
> but 
> 
> And finally: maintainership is largely about working with people.  
> There's some code in there too, but people tend to be more important.
> 
> 		Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-01-22  5:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-19 13:32 AIC7xxx kernel problem with 2.4.2[234] kernels Xose Vazquez Perez
2004-01-19 17:21 ` James Bottomley
2004-01-19 18:38   ` Justin T. Gibbs
2004-01-20  0:50     ` James Bottomley
2004-01-20  2:02       ` Justin T. Gibbs
2004-01-20  4:45         ` James Bottomley
2004-01-20  5:43           ` Justin T. Gibbs
2004-01-22  5:14             ` James Bottomley
2004-01-20 11:24           ` Chiaki
2004-01-20  7:15         ` Linus Torvalds
2004-01-20  8:30           ` Andre Hedrick

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox