* Re: AIC7xxx kernel problem with 2.4.2[234] kernels @ 2004-01-19 13:32 Xose Vazquez Perez 2004-01-19 17:21 ` James Bottomley 0 siblings, 1 reply; 11+ messages in thread From: Xose Vazquez Perez @ 2004-01-19 13:32 UTC (permalink / raw) To: linux-kernel, Tosatti, linux-scsi, Justin T. Gibbs Marcelo Tosatti wrote: > About the aic7xxx update, well, I believe aic7xxx 6.2.36 is pretty stable > (I dont remember seeing any reliable bug report and I also cant find one > in lkml archives) except this one (and a pair of "lockup on initialization > with SMP"). Justin already put updates in BK, but James did not like the "new error recovery" code. So, kernel driver is *SIX months* behind ADAPTEC driver release. There is more info in this linux-scsi thread, why the patch was not applied: http://marc.theaimsgroup.com/?l=linux-scsi&m=107228516327580&w=2 It looks like the _kernel_ driver is going to be without a maintainer unless somebody works on it, porting ADAPTEC fixes/features to the kernel driver. > What bugs are you aware of in 2.4's aic7xxx ? aic7xxx/aic79xx CHANGELOG has info about all bugs fixed: o Adaptec Aic7xxx Version History: 6.3.4 (December 22nd, 2003) - Provide a better description string for the 2915/30LP. - Sniff sense information returned by targets for unit attention errors that may indicate that the device has been changed. If we see such status for non Domain Validation related commands, start a DV scan for the target. In the past, DV would only occur for hot-plugged devices if no target had been previously probed for a particular ID. This change guarantees that the DV process will occur even if the user swaps devices without any interveining I/O to tell us that a device has gone missing. The old behavior, among other things, would fail to spin up drives that were hot-plugged since the Linux mid-layer will only spin-up drives on initial attach. 6.3.3 (November 6th, 2003) - Support the 2.6.0-test9 kernel - Fix rare deadlock caused by using del_timer_sync from within a timer handler. 6.3.2 (October 28th, 2003) - Enforce a bus settle delay for bus resets that the driver initiates. - Fall back to basic DV for U160 devices that lack an echo buffer. - Correctly detect that left over BIOS data has not been initialized when the CHPRST status bit is set during driver initialization. 6.3.1 (October 21st, 2003) - Fix a compiler error when building with only EISA or PCI support compiled into the kernel. - Add chained dependencies to both the driver and aicasm Makefiles to avoid problems with parallel builds. - Move additional common routines to the aiclib OSM library to reduce code duplication. - Fix a bug in the testing of the AHC_TMODE_WIDEODD_BUG that could cause target mode operations to hang. - Leave removal of softcs from the global list of softcs to the OSM. This allows us to avoid holding the list_lock during device destruction. 6.3.0 (September 8th, 2003) - Move additional common routines to the aiclib OSM library to reduce code duplication. - Bump minor number to reflect change in error recovery strategy. 6.2.38 (August 31st, 2003) - Avoid an inadvertant reset of the controller during the memory mapped I/O test should the controller be left in the reset state prior to driver initialization. On some systems, this extra reset resulted in a system hang due to a chip access that occurred too soon after reset. - Move additional common routines to the aiclib OSM library to reduce code duplication. - Add magic sysrq handler that causes a card dump to be output to the console for each controller. 6.2.37 (August 12th, 2003) - Perform timeout recovery within the driver instead of relying on the Linux SCSI mid-layer to perform this function. The mid-layer does not know the full state of the SCSI bus and is therefore prone to looping for several minutes to effect recovery. The new scheme recovers within 15 seconds of the failure. - Support writing 93c56/66 SEEPROM on newer cards. - Avoid clearing ENBUSFREE during single stepping to avoid spurious "unexpected busfree while idle" messages. - Enable the use of the "Auto-Access-Pause" feature on the aic7880 and aic7870 chips. It was disabled due to an oversight. Using this feature drastically reduces command delivery latency. 6.2.36 **KERNEL DRIVER** o Adaptec Aic79xx Version History: 2.0.5 (December 22nd, 2003) - Correct a bug preventing the driver from renegotiating during auto-request operations when a check condition occurred for a zero length command. - Sniff sense information returned by targets for unit attention errors that may indicate that the device has been changed. If we see such status for non Domain Validation related commands, start a DV scan for the target. In the past, DV would only occur for hot-plugged devices if no target had been previously probed for a particular ID. This change guarantees that the DV process will occur even if the user swaps devices without any interveining I/O to tell us that a device has gone missing. The old behavior, among other things, would fail to spin up drives that were hot-plugged since the Linux mid-layer will only spin-up drives on initial attach. - Correct several issues in the rundown of the good status FIFO during error recovery. The typical failure scenario evidenced by this defect was the loss of several commands under high load when several queue full conditions occured back to back. 2.0.4 (November 6th, 2003) - Support the 2.6.0-test9 kernel - Fix rare deadlock caused by using del_timer_sync from within a timer handler. 2.0.3 (October 21st, 2003) - On 7902A4 hardware, use the slow slew rate for transfer rates slower than U320. This behavior matches the Windows driver. - Fix some issues with the ahd_flush_qoutfifo() routine. - Add a delay in the loop waiting for selection activity to cease. Otherwise we may exhaust the loop counter too quickly on fast machines. - Return to processing bad status completions through the qoutfifo. This reduces the amount of time the controller is paused for these kinds of errors. - Move additional common routines to the aiclib OSM library to reduce code duplication. - Leave removal of softcs from the global list of softcs to the OSM. This allows us to avoid holding the list_lock during device destruction. - Enforce a bus settle delay for bus resets that the driver initiates. - Fall back to basic DV for U160 devices that lack an echo buffer. 2.0.2 (September 4th, 2003) - Move additional common routines to the aiclib OSM library to reduce code duplication. - Avoid an inadvertant reset of the controller during the memory mapped I/O test should the controller be left in the reset state prior to driver initialization. On some systems, this extra reset resulted in a system hang due to a chip access that occurred too soon after reset. - Correct an endian bug in ahd_swap_with_next_hscb. This corrects strong-arm support. - Reset the bus for transactions that timeout waiting for the bus to go free after a disconnect or command complete message. 2.0.1 (August 26th, 2003) - Add magic sysrq handler that causes a card dump to be output to the console for each controller. - Avoid waking the mid-layer's error recovery handler during timeout recovery by returning DID_ERROR instead of DID_TIMEOUT for timed-out commands that have been aborted. - Move additional common routines to the aiclib OSM library to reduce code duplication. 2.0.0 (August 20th, 2003) - Remove MMAPIO definition and allow memory mapped I/O for any platform that supports PCI. - Avoid clearing ENBUSFREE during single stepping to avoid spurious "unexpected busfree while idle" messages. - Correct deadlock in ahd_run_qoutfifo() processing. - Optimize support for the 7901B. - Correct a few cases where an explicit flush of pending register writes was required to ensure acuracy in delays. - Correct problems in manually flushing completed commands on the controller. The FIFOs are now flushed to ensure that completed commands that are still draining to the host are completed correctly. - Correct incomplete CDB delivery detection on the 790XB. - Ignore the cmd->underflow field since userland applications using the legacy command pass-thru interface do not set it correctly. Honoring this field led to spurious errors when users used the "scsi_unique_id" program. - Perform timeout recovery within the driver instead of relying on the Linux SCSI mid-layer to perform this function. The mid-layer does not know the full state of the SCSI bus and is therefore prone to looping for several minutes to effect recovery. The new scheme recovers within 15 seconds of the failure. - Correct support for manual termination settings. - Increase maximum wait time for serial eeprom writes allowing writes to function correctly. 1.3.12 (August 11, 2003) - Implement new error recovery thread that supercedes the existing Linux SCSI error recovery code. - Fix termination logic for 29320ALP. - Fix SEEPROM delay to compensate for write ops taking longer. 1.3.11 (July 11, 2003) - Fix several deadlock issues. - Add 29320ALP and 39320B Id's. 1.3.10 **KERNEL DRIVER** ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-19 13:32 AIC7xxx kernel problem with 2.4.2[234] kernels Xose Vazquez Perez @ 2004-01-19 17:21 ` James Bottomley 2004-01-19 18:38 ` Justin T. Gibbs 0 siblings, 1 reply; 11+ messages in thread From: James Bottomley @ 2004-01-19 17:21 UTC (permalink / raw) To: Xose Vazquez Perez; +Cc: Linux Kernel, Tosatti, linux-scsi, Justin T. Gibbs On Mon, 2004-01-19 at 08:32, Xose Vazquez Perez wrote: > It looks like the _kernel_ driver is going to be without a maintainer > unless somebody works on it, porting ADAPTEC fixes/features to the kernel driver. As I told you in private email, this is *not* the way I see it. At the moment, Ataptec is the maintainer of that driver unless they choose formally to relinquish it. There is a glimmering of a resolution of the problem in an early notification API for command timeouts. Although throwing away successful completions when error recovery is in progress isn't a bug (scsi commands are either idempotent or non retryable), it's certainly not ideal. I'm thinking about a better framework where we would quiesce the device but pull back from activating the eh thread if all commands return. This would also fix the tag starvation issue that many drivers tackle independently too. James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-19 17:21 ` James Bottomley @ 2004-01-19 18:38 ` Justin T. Gibbs 2004-01-20 0:50 ` James Bottomley 0 siblings, 1 reply; 11+ messages in thread From: Justin T. Gibbs @ 2004-01-19 18:38 UTC (permalink / raw) To: James Bottomley, Xose Vazquez Perez; +Cc: Linux Kernel, Tosatti, linux-scsi > On Mon, 2004-01-19 at 08:32, Xose Vazquez Perez wrote: >> It looks like the _kernel_ driver is going to be without a maintainer >> unless somebody works on it, porting ADAPTEC fixes/features to the kernel driver. > > As I told you in private email, this is *not* the way I see it. At the > moment, Ataptec is the maintainer of that driver unless they choose > formally to relinquish it. Can you provide your definition of "maintainer"? I know that I am maintainer of the drivers distributed from my website, but I don't feel I have ever been maintainer of the drivers in the 2.4.X, 2.5.X, or 2.6.X trees. > There is a glimmering of a resolution of the problem in an early > notification API for command timeouts. I'm open to ideas, but from this one line summary, this sounds like a workaround and not a real solution. Can you say more about your proposal? In my mind, an easy resolution would be to: 1) Let me fix the SCSI layer so that the error recovery handler override already there will actually work - cleanly. 2) Let my drivers use that mechanism. While working on 1, I would appreciate being able to "maintain" these drivers with their current error recovery workaround in place. > Although throwing away successful completions when error recovery is in > progress isn't a bug (scsi commands are either idempotent or non > retryable), it's certainly not ideal. Most SCSI commands are only idempotent if replayed in the same order as originally issued (consider FSes that rely on write ordering to keep their meta-data coherent). Some commands are retriable but only if they have actually failed. The mid-layer has no concept currently of these issues, yet it acts on behalf of the peripheral drivers that can better understand how the device they control behaves and act accordingly. Bugs are defects that render non-ideal behavior. The only question is what types of non-ideal behaviors you are willing to tolerate. > I'm thinking about a better > framework where we would quiesce the device but pull back from > activating the eh thread if all commands return. This would also fix > the tag starvation issue that many drivers tackle independently too. That wouldn't help things. For example, lets say that there is one command active on the bus holding up the completion of 32 others. "Waiting for a bit" will never release the other 32 commands. You must abort the bus hog. Once you abort the problem command, you get flooded with the completions of the 32 others. The bus is recovered. You can now safely go about your business. An HBA watchdog handler can properly deal with this situation since it has state that the mid-layer does not. As for tag starvation, just inserting a periodic ordered tag on devices that show signs of starvation is a much better approach than shutting down the flow of commands to the whole controller at the first sign of trouble. Luckily, most vendors stopped making drives with tag starvation issues in the mid-90's. For this reason, the tag starvation code in my drivers is off by default, but can be enabled via a module or kernel command line option. -- Justin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-19 18:38 ` Justin T. Gibbs @ 2004-01-20 0:50 ` James Bottomley 2004-01-20 2:02 ` Justin T. Gibbs 0 siblings, 1 reply; 11+ messages in thread From: James Bottomley @ 2004-01-20 0:50 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi On Mon, 2004-01-19 at 13:38, Justin T. Gibbs wrote: > Can you provide your definition of "maintainer"? I know that I am maintainer > of the drivers distributed from my website, but I don't feel I have ever > been maintainer of the drivers in the 2.4.X, 2.5.X, or 2.6.X trees. A maintainer is a person who works with the kernel community to keep the driver (or subsystem, filesystem or whatever) up to date. Such a person may or possibly may not have an entry in the MAINTAINERS file. If you want to maintain a reference driver and have someone else do the legwork with the community, that's fine by me...do you have someone in mind, or should I find this person? > I'm open to ideas, but from this one line summary, this sounds like a > workaround and not a real solution. Can you say more about your proposal? It actually wasn't mine, it was Alan Cox's. There was a thread about it (on which you were cc'd), but I'm currently on a 'plane to NY for LinuxWorld and don't have it handy. > In my mind, an easy resolution would be to: > > 1) Let me fix the SCSI layer so that the error recovery handler override > already there will actually work - cleanly. > > 2) Let my drivers use that mechanism. > > While working on 1, I would appreciate being able to "maintain" these > drivers with their current error recovery workaround in place. Well, I'm thinking of having a class based error recovery scheme, built upon an extension of the transport class patch that has been floating around this list. However, my problem is that the aic7xxx/79xx chips are basically SPI, and therefore, even under the new scheme, should be using the SPI recovery class. Therefore, just providing an override mechanism for all drivers to use isn't what I want. What I want is a robust SPI recovery mechanism usable by all. This is what "working with the kernel community" means. If there's a bug, I don't want it fixed by driver work arounds, I want it fixed in the core code. Having driver writers ignore the APIs and roll their own will simply create problems. > Most SCSI commands are only idempotent if replayed in the same order > as originally issued (consider FSes that rely on write ordering to > keep their meta-data coherent). Some commands are retriable but only if > they have actually failed. The mid-layer has no concept currently of these > issues, yet it acts on behalf of the peripheral drivers that can better > understand how the device they control behaves and act accordingly. > > Bugs are defects that render non-ideal behavior. The only question is > what types of non-ideal behaviors you are willing to tolerate. This is the old barrier debate. The scsi subsytem does not advertise an ordering property to the block layer and thus is not required to preserve order over error recovery. This problem, therefore, does not exist in linux. We had this debate years ago...the upshot being that the performance benefits of order preservation were uncertain at best so it was never implemented. Linux works just fine without it. > That wouldn't help things. For example, lets say that there is one command > active on the bus holding up the completion of 32 others. "Waiting for a bit" > will never release the other 32 commands. You must abort the bus hog. Once > you abort the problem command, you get flooded with the completions of the > 32 others. The bus is recovered. You can now safely go about your business. > An HBA watchdog handler can properly deal with this situation since it has > state that the mid-layer does not. I don't understand this. If by "active on the bus" you mean is holding the bus in a busy state, then you cannot get access to the bus to to send an abort or a device reset...the only recourse is a bus reset...which the mid layer will do. If the drive has actually freed the bus but lost the tag, then it's a drive queueing bug, and the solution is usually to lower the TCQ depth (we should probably have a blacklist for this). This is where the mid layer quiesce is good...if all the other commands complete, the bus is free and we still don't get a response from the missing command, then you know the drive firmware lost it, and the driver should adjust the queue depth downwards. If the drive is just off servicing other tags, then it's tag starvation. > As for tag starvation, just inserting a periodic ordered tag on devices > that show signs of starvation is a much better approach than shutting > down the flow of commands to the whole controller at the first sign of > trouble. Luckily, most vendors stopped making drives with tag starvation > issues in the mid-90's. For this reason, the tag starvation code in > my drivers is off by default, but can be enabled via a module or kernel > command line option. Well, I have to deal with old hardware... James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-20 0:50 ` James Bottomley @ 2004-01-20 2:02 ` Justin T. Gibbs 2004-01-20 4:45 ` James Bottomley 2004-01-20 7:15 ` Linus Torvalds 0 siblings, 2 replies; 11+ messages in thread From: Justin T. Gibbs @ 2004-01-20 2:02 UTC (permalink / raw) To: James Bottomley; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi > On Mon, 2004-01-19 at 13:38, Justin T. Gibbs wrote: >> Can you provide your definition of "maintainer"? I know that I am maintainer >> of the drivers distributed from my website, but I don't feel I have ever >> been maintainer of the drivers in the 2.4.X, 2.5.X, or 2.6.X trees. > > A maintainer is a person who works with the kernel community to keep the > driver (or subsystem, filesystem or whatever) up to date. Such a person > may or possibly may not have an entry in the MAINTAINERS file. Does the maintainer have the ability to veto changes that harm the code they maintain? In otherwords, you claim that I am the maintainer of the drivers in the kernel.org tree. This has not prevented changes from being made to these drivers without adequate review. Even your last update to the driver threw away all of the changelog state and left at least the aic79xx driver in a worse state than it was in before (see changelog entries for the driver versions after the one that you imported for details - this was exactly why I didn't submit that particular revision). You didn't even bother to ask me if importing 1.3.11 was appropriate. This is why I say I don't feel like a maintainer. I'm not given adequate control over the end product yet I'm supposed to take the blame when it doesn't work. >> I'm open to ideas, but from this one line summary, this sounds like a >> workaround and not a real solution. Can you say more about your proposal? > > It actually wasn't mine, it was Alan Cox's. There was a thread about it > (on which you were cc'd), but I'm currently on a 'plane to NY for > LinuxWorld and don't have it handy. That proposal was to allow the timeout handler to be redirected. This is different than an early notification. Allowing the timeout handler to be redirected is a required step toward making the recovery code work. > Well, I'm thinking of having a class based error recovery scheme, built > upon an extension of the transport class patch that has been floating > around this list. That's all fine for "status based" recovery. All I'm trying to resolve are issues with watchdog recovery. Can we limit the discussion to that area? > However, my problem is that the aic7xxx/79xx chips > are basically SPI, and therefore, even under the new scheme, should be > using the SPI recovery class. Therefore, just providing an override > mechanism for all drivers to use isn't what I want. What I want is a > robust SPI recovery mechanism usable by all. I understand that, but it just isn't possible to do well for watchdog recovery. For status based recovery, sure. > This is what "working with the kernel community" means. If there's a > bug, I don't want it fixed by driver work arounds, I want it fixed in > the core code. Having driver writers ignore the APIs and roll their own > will simply create problems. In this case, the bug is that the mid-layer tries to handle watchdog recovery on its own. It will never, in my opinion, having reviewed lots of systems that have tried to do it in a centralized way, work well. The mid-layer just doesn't have the necessary state to make intelligent decisions and exporting that state will always be cumbersome and incomplete. >> Most SCSI commands are only idempotent if replayed in the same order >> as originally issued (consider FSes that rely on write ordering to >> keep their meta-data coherent). Some commands are retriable but only if >> they have actually failed. The mid-layer has no concept currently of these >> issues, yet it acts on behalf of the peripheral drivers that can better >> understand how the device they control behaves and act accordingly. >> >> Bugs are defects that render non-ideal behavior. The only question is >> what types of non-ideal behaviors you are willing to tolerate. > > This is the old barrier debate. Not entirely. Tapes are allowed to accept multiple commands and some FCTape drives do. But even if you throw away that argument completely, you still haven't resolved how to deal with retriable commands that are only retriable if they have actually failed. My feeling is that any situation where the mid-layer or HBA drivers fail to provide complete and acurate state for the commands that are completed is a bug. The peripheral drivers cannot do their job if they aren't given good information. >> That wouldn't help things. For example, lets say that there is one >> command active on the bus holding up the completion of 32 others. >> "Waiting for a bit" will never release the other 32 commands. You must >> abort the bus hog. Once you abort the problem command, you get flooded >> with the completions of the 32 others. The bus is recovered. You can now >> safely go about your business. An HBA watchdog handler can properly deal >> with this situation since it has state that the mid-layer does not. > > I don't understand this. If by "active on the bus" you mean is holding > the bus in a busy state, then you cannot get access to the bus to to > send an abort or a device reset...the only recourse is a bus > reset...which the mid layer will do. If we are talking SPI, then aborts, device resets, and lun resets are all handled with either message bytes transmitted via a message phase, or via command packets with the task management function set appropriately. If you cannot send an abort message, you cannot send any message, so claiming that a BDR request will resolve the problem doesn't make any sense if you believe that a device active on the bus prevents aborts for working. In any event, just because a device is active on the bus doesn't mean that the bus is hung and that you cannot abort a command. By raising the ATN line, the target may decide to change phase to accept your message byte. It may not, but if it doesn't, then your only recourse is to reset the bus. Looping through all the other commands that happen to be stalled and asking the driver to abort them will only delay the inevitable. > If the drive has actually freed the bus but lost the tag, then it's a > drive queueing bug, and the solution is usually to lower the TCQ depth > (we should probably have a blacklist for this). This is where the mid > layer quiesce is good...if all the other commands complete, the bus is > free and we still don't get a response from the missing command, then > you know the drive firmware lost it, and the driver should adjust the > queue depth downwards. How does the mid-layer know that the "bus is free". What transports even have this concept? If one drive has lost a command, and the transport is functioning normally, why are you penalizing the other devices attached to the HBA while you "sort this out"? There is no need to do that. As for reducing the queue depth in response to repeated timeouts by a device, this is easy enough to do with your "multi-layered", status based recovery code. All that is required is for the HBA to tell you that a particular command was aborted due to timeout as well as indicate what side-effects occurred because of the abort process (bus reset, device reset, lun reset, LIP, etc). Some of the latter is already provided for by the reset and bus reset entry points, but a better solution would be to have a single "async event" callback that can encompass any transport notifications needed by SPI, FC, SAS, and any future transports without adding more entry points. > If the drive is just off servicing other tags, then it's tag starvation. >> As for tag starvation, just inserting a periodic ordered tag on devices >> that show signs of starvation is a much better approach than shutting >> down the flow of commands to the whole controller at the first sign of >> trouble. Luckily, most vendors stopped making drives with tag starvation >> issues in the mid-90's. For this reason, the tag starvation code in >> my drivers is off by default, but can be enabled via a module or kernel >> command line option. > > Well, I have to deal with old hardware... Sure, just don't penalize the other disks on the transport because you have one disk out there that is affected by this issue. -- Justin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-20 2:02 ` Justin T. Gibbs @ 2004-01-20 4:45 ` James Bottomley 2004-01-20 5:43 ` Justin T. Gibbs 2004-01-20 11:24 ` Chiaki 2004-01-20 7:15 ` Linus Torvalds 1 sibling, 2 replies; 11+ messages in thread From: James Bottomley @ 2004-01-20 4:45 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi On Mon, 2004-01-19 at 21:02, Justin T. Gibbs wrote: > Does the maintainer have the ability to veto changes that harm the > code they maintain? In otherwords, you claim that I am the maintainer > of the drivers in the kernel.org tree. This has not prevented changes > from being made to these drivers without adequate review. Even your last > update to the driver threw away all of the changelog state and left at > least the aic79xx driver in a worse state than it was in before (see > changelog entries for the driver versions after the one that you imported > for details - this was exactly why I didn't submit that particular revision). I said "works with the kernel community". It's not about control, it's about co-operation. The control you seek simply does not exist in the kernel development process. > You didn't even bother to ask me if importing 1.3.11 was appropriate. This > is why I say I don't feel like a maintainer. I'm not given adequate control > over the end product yet I'm supposed to take the blame when it doesn't work. In the previous thread about the driver you said "You can integrate the driver at whatever revision suits you.", so I took you at your word; if that wasn't what you meant, it's a little late to whine about it now. Small bug fixes, would, as ever, be welcome... As for blame, apart from the occasional flamewar, the community seems generally welcoming of anyone who provides fixes. We tend to be more interested in fixing things than assigning blame. > That proposal was to allow the timeout handler to be redirected. This > is different than an early notification. Allowing the timeout handler > to be redirected is a required step toward making the recovery code > work. The recovery code does work. You may want it to work differently, and that may make it work better, but that's an enhancement not a bug fix. > In this case, the bug is that the mid-layer tries to handle watchdog > recovery on its own. It will never, in my opinion, having reviewed > lots of systems that have tried to do it in a centralized way, work well. > The mid-layer just doesn't have the necessary state to make intelligent > decisions and exporting that state will always be cumbersome and incomplete. But it does do it successfully. Something that currently works but could work better is an enhancement not a bug. > How does the mid-layer know that the "bus is free". What transports even > have this concept? If one drive has lost a command, and the transport > is functioning normally, why are you penalizing the other devices attached > to the HBA while you "sort this out"? There is no need to do that. Again, this is could do better not required bug fix. I'm not against enhancements, even at this late stage in the stabilisation process. However, they have to be small, self contained and obviously correct. If you have them, send them to the list and they'll get reviewed. James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-20 4:45 ` James Bottomley @ 2004-01-20 5:43 ` Justin T. Gibbs 2004-01-22 5:14 ` James Bottomley 2004-01-20 11:24 ` Chiaki 1 sibling, 1 reply; 11+ messages in thread From: Justin T. Gibbs @ 2004-01-20 5:43 UTC (permalink / raw) To: James Bottomley; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi > On Mon, 2004-01-19 at 21:02, Justin T. Gibbs wrote: >> Does the maintainer have the ability to veto changes that harm the >> code they maintain? In otherwords, you claim that I am the maintainer >> of the drivers in the kernel.org tree. This has not prevented changes >> from being made to these drivers without adequate review. Even your last >> update to the driver threw away all of the changelog state and left at >> least the aic79xx driver in a worse state than it was in before (see >> changelog entries for the driver versions after the one that you imported >> for details - this was exactly why I didn't submit that particular revision). > > I said "works with the kernel community". It's not about control, it's > about co-operation. The control you seek simply does not exist in the > kernel development process. Then I ask again, what does it mean to be a maintainer? It sounds like I'm on equal footing with anyone who decides to post some patch to the lists. I've lost count of the number of occasions that some random patch from some random individual was accepted without any consultation with "the maintainer" of these drivers. The end result was more email in my mailbox complaining about "the broken driver that I maintain." As for control, the type of control "I seek" does exist. You have it. You can also delegate some of that control if it suits you. A maintainer takes on responsibility to ensure that something is maintained and works. Without some level of control, how can the maintainer fulfill that responsibility? >> You didn't even bother to ask me if importing 1.3.11 was appropriate. This >> is why I say I don't feel like a maintainer. I'm not given adequate control >> over the end product yet I'm supposed to take the blame when it doesn't work. > > In the previous thread about the driver you said "You can integrate the > driver at whatever revision suits you.", so I took you at your word; if > that wasn't what you meant, it's a little late to whine about it now. > Small bug fixes, would, as ever, be welcome... I provided all of the information required for you to make a reasoned decision of which change sets to integrate. I had no idea that you would completely disregard the wealth of information in the change sets and change set comments when coming up with an integration point. Your actions show that you didn't review or understand the changes well enough to submit them into the tree. You probably didn't even test the resulting driver on real hardware before you submitted the changes. > The recovery code does work. You may want it to work differently, and > that may make it work better, but that's an enhancement not a bug fix. No. The recovery code doesn't work. Many of the people that know this don't bother complaining to you about it. They complain to the HBA driver authors and the tech support departments of the companies that make the HBAs. The HBA driver authors then do what they have to ensure that the system remains viable after recovery. I mean honestly. Do you think I would have gone to all of the trouble I did in doing my own watchdog recovery if the recovery code worked correctly? Or that I would stand so firm in my position if these issues didn't have real customer impact? -- Justin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-20 5:43 ` Justin T. Gibbs @ 2004-01-22 5:14 ` James Bottomley 0 siblings, 0 replies; 11+ messages in thread From: James Bottomley @ 2004-01-22 5:14 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi On Tue, 2004-01-20 at 00:43, Justin T. Gibbs wrote: > As for control, the type of control "I seek" does exist. You have it. > You can also delegate some of that control if it suits you. Well, as you have heard from the horse's mouth: I don't. > I provided all of the information required for you to make a reasoned > decision of which change sets to integrate. I had no idea that you > would completely disregard the wealth of information in the change sets > and change set comments when coming up with an integration point. Your > actions show that you didn't review or understand the changes well enough > to submit them into the tree. You probably didn't even test the resulting > driver on real hardware before you submitted the changes. Actually, I would have done nothing but for some 2.6 migration reports of total lockups with the then in tree aic79xx driver. The patch that went into the tree was tested by the people reporting the lockups. > > The recovery code does work. You may want it to work differently, and > > that may make it work better, but that's an enhancement not a bug fix. > > No. The recovery code doesn't work. Many of the people that know this > don't bother complaining to you about it. They complain to the HBA driver > authors and the tech support departments of the companies that make the HBAs. > The HBA driver authors then do what they have to ensure that the system > remains viable after recovery. You haven't outlined any incorrect cases in your emails, just could do better cases. If you have all these bug reports that you haven't been passing on, could you at least distil them to the mid layer failure scenario that we can discuss fixing? > I mean honestly. Do you think I would have gone to all of the trouble > I did in doing my own watchdog recovery if the recovery code worked > correctly? Or that I would stand so firm in my position if these issues > didn't have real customer impact? Well, in coming up with the mid layer changes from 2.4 to 2.6 I did look at what issues the main drivers had work arounds for. I used these work arounds and an email you sent in September 2002 as the basis for a lot of the mid-layer changes in 2.6. None of the other drivers does this timer interception and the issue wasn't mentioned in your email, so I am dubious about the seriousness of the impact. The way fixes get into linux is either lots of people complain, or one person sends a fix, neither has happened in this case, which again leads me to suspect that it's not a huge problem. The still outstanding question is, now that you have a clearer idea what being a Maintainer entails, do you wish to be the maintainer for aic7xxx? James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-20 4:45 ` James Bottomley 2004-01-20 5:43 ` Justin T. Gibbs @ 2004-01-20 11:24 ` Chiaki 1 sibling, 0 replies; 11+ messages in thread From: Chiaki @ 2004-01-20 11:24 UTC (permalink / raw) To: linux-scsi Cc: James Bottomley, Justin T. Gibbs, Xose Vazquez Perez, Linux Kernel, Tosatti I have a feeling that Linus summed up what the maintainer is like in the context of linux kernel development. But I have a comment. > The recovery code does work. Maybe. I have not tried a few problematic devices under my PC lately. These devices usually caused troube under 2.2.xx series, and even under late 2.3.y series for a while. The symptom was essentially a reset storm that made the system unusable. Given the various patches accumalating, maybe the symptom is tolerable now, but again I see some mention of unusable system response even today. So I suspect the problem is still there for certain type of hardware problems. >You may want it to work differently, and > that may make it work better, but that's an enhancement not a bug fix. To people who have been bitten with such unusable system symptoms the above statement simply won't pass. It is essentially a "performance *bug*" and should be corrected IMHO. > But it does do it successfully. Something that currently works but > could work better is an enhancement not a bug. Again, to those people this is a correctable (and should be corrected) performance "bug". > I'm not against enhancements, even at this late stage in the > stabilisation process. I am a little confused here. Are we talking about 2.4 series? OK the subject line states 2.4.2[234]. I believe that there are a lot of user base, especially, people who use server type machines with SCSI interface (and AIC chips seem to be popular among these machines) who would appreciate the enhanced (== perforamance bug corrected) version of the SCSI subsystem. I, for one, don't use AIC chip on my home PCs, but do have some machines at the office which use them and will appreciate "enhanced" SCSI subsystem after all these years. As for 2.6.zz, aren't there any chance of introducing hooks into EH framework? The previous discussion suggested that it needs to wait for 2.7 series. Too bad :-( I feel that these error handling issues of SCSI subsystem will have to be solved once for all sooner or later in the mainline or otherwise as we see the vendors of commercial distribution probably need to keep a separate tree (which they may have to, anyway, deal with other quirks of the mainline kernel, etc.) for a long time to come and this is rather waste of man-power resources IMHO. In any case, with all due respect I don't think that the discussion goes anywhere unless we recognize that someone's "mere enhancement" is actually other people's "serious performance bug correction". I, for one, tend to see the topic discussed as the performance "BUG" and so am a little frustrated at the pace the error handling scheme is being improved. This is just a comment from a third party observer who, unfortunately doesn't have the time to dig into the code and offer a patch. (Yes I actually tried once during 2.2.xx time-frame but was then repulsed at the spaghetti code of the time and gave up.). PS: One other thing is that the type of the bug is hard to trigger unless you have a controlled facility or some seeming working and yet faulty devices which generate bad condition in a short time, say a few minutes into the operation . So I agree that not all people see such problems. Intermittently faulty SCSI devices are rather rare, aren't they? Either a SCSI device such as disk is complete dead or or healthy. Finding a faulty device that triggers error condition from time to time is probably the key to observe the problematic symptom being discussed. I wonder if some disk manufacturers or someone could produce a special firmware to generate error condition every minute or so and send such disks to SCSI subsystem developers :-) PPS: Some would argue that if such devices are so rare then we can ignore them. Heck, no! I have seen Solaris log files where such faulty behavior occur from time to time and was dealt with very gracefully without the system being rendered unusable. So the ratio of the such devices are small, but the sheer number of computer installation today make such incidents visible indeed. -- int main(void){int j=2003;/*(c)2003 cishikawa. */ char t[] ="<CI> @abcdefghijklmnopqrstuvwxyz.,\n\""; char *i ="g>qtCIuqivb,gCwe\np@.ietCIuqi\"tqkvv is>dnamz"; while(*i)((j+=strchr(t,*i++)-(int)t),(j%=sizeof t-1), (putchar(t[j])));return 0;}/* under GPL */ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-20 2:02 ` Justin T. Gibbs 2004-01-20 4:45 ` James Bottomley @ 2004-01-20 7:15 ` Linus Torvalds 2004-01-20 8:30 ` Andre Hedrick 1 sibling, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2004-01-20 7:15 UTC (permalink / raw) To: Justin T. Gibbs Cc: James Bottomley, Xose Vazquez Perez, Linux Kernel, Tosatti, linux-scsi On Mon, 19 Jan 2004, Justin T. Gibbs wrote: > > Does the maintainer have the ability to veto changes that harm the > code they maintain? Nope. Nobody has that right. Even _I_ don't veto changes that the right people push (my motto: "everybody is wrong sometimes: when enough people complain, even I am wrong"). In particular, maintainers of "conceptually higher" generally always have priority. If Al Viro says a filesystem is doing something wrong from a VFS standpoint, then that filesystem is broken - regardless of whether the filesystem maintainer agrees or not. Because the VFS layer requirements trump any low-level filesystem issues. But perhaps more importantly (and it's the reason even _I_ don't have the right, regardless of how high up in the maintainership chain I am), nobody has veto-power over anything. That's to keep people honest: nobody should _ever_ think that they are "in control", and that nobody else can replace them. In other words: maintainership is not ownership. It's a stewardship. End result: maintainership is a nasty and mostly unthankful job. It doesn't really give many privileges, and most of what it does is just have people complain to you about bugs. The satisfaction is there, of course, but And finally: maintainership is largely about working with people. There's some code in there too, but people tend to be more important. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: AIC7xxx kernel problem with 2.4.2[234] kernels 2004-01-20 7:15 ` Linus Torvalds @ 2004-01-20 8:30 ` Andre Hedrick 0 siblings, 0 replies; 11+ messages in thread From: Andre Hedrick @ 2004-01-20 8:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel, linux-scsi Linus, Would have been nice to list such rules at the top if the MAINTAINERS file, this may have saved me some grief, well maybe not ... Andre Hedrick LAD Storage Consulting Group On Mon, 19 Jan 2004, Linus Torvalds wrote: > > > On Mon, 19 Jan 2004, Justin T. Gibbs wrote: > > > > Does the maintainer have the ability to veto changes that harm the > > code they maintain? > > Nope. Nobody has that right. > > Even _I_ don't veto changes that the right people push (my motto: > "everybody is wrong sometimes: when enough people complain, even I am > wrong"). > > In particular, maintainers of "conceptually higher" generally always have > priority. If Al Viro says a filesystem is doing something wrong from a VFS > standpoint, then that filesystem is broken - regardless of whether the > filesystem maintainer agrees or not. Because the VFS layer requirements > trump any low-level filesystem issues. > > But perhaps more importantly (and it's the reason even _I_ don't have the > right, regardless of how high up in the maintainership chain I am), nobody > has veto-power over anything. That's to keep people honest: nobody should > _ever_ think that they are "in control", and that nobody else can replace > them. > > In other words: maintainership is not ownership. It's a stewardship. > > End result: maintainership is a nasty and mostly unthankful job. It > doesn't really give many privileges, and most of what it does is just have > people complain to you about bugs. The satisfaction is there, of course, > but > > And finally: maintainership is largely about working with people. > There's some code in there too, but people tend to be more important. > > Linus > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2004-01-22 5:14 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-01-19 13:32 AIC7xxx kernel problem with 2.4.2[234] kernels Xose Vazquez Perez 2004-01-19 17:21 ` James Bottomley 2004-01-19 18:38 ` Justin T. Gibbs 2004-01-20 0:50 ` James Bottomley 2004-01-20 2:02 ` Justin T. Gibbs 2004-01-20 4:45 ` James Bottomley 2004-01-20 5:43 ` Justin T. Gibbs 2004-01-22 5:14 ` James Bottomley 2004-01-20 11:24 ` Chiaki 2004-01-20 7:15 ` Linus Torvalds 2004-01-20 8:30 ` Andre Hedrick
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox