* AIC94XX discovery timeout problem details... @ 2006-09-15 7:20 malahal 2006-09-18 7:35 ` Luben Tuikov 0 siblings, 1 reply; 6+ messages in thread From: malahal @ 2006-09-15 7:20 UTC (permalink / raw) To: linux-scsi I chased the time out problem and found that the PORTE_BYTES_DMAED port event must be responded with a call to lldd_port_formed() which will update PHY_IS_UP and port-links fields in DDB 0. As there is a single thread handling PHY/port events as well as discovery, we really can't handle PORTE_BYTES_DMAED event until the discovery is complete. This results in SCSI commands timing out in the discovery thread no matter what I do! This problem may be unique to Vitesse expander, read the PS section for details. Tried using two threads, one for events and the other for discovery. That avoided the timeout problems, but the discovery thread would die after few iterations due to the event thread and the discovery thread racing each other for setting up and tearing down of sysfs objects. I tried calling lldd_port_formed() with appropriate phy_mask from the notify_port_event() itself. That worked fine. It is just a hack for now! Any comments, suggestions? Thanks, Malahal. PS: I instrumented the code to use only a single PHY out of 4-phy wide port by prematurely returning from sas_form_port() for all PHYs except one. In other words, I ignored the PORTE_BYTES_DMAED event for all PHYs but one. There were SCSI timeouts every time I did that except if the PHY I was using is phy3. When I swapped cables, the PHY that worked changed too. I believe, it goes with a specific Vitesse expander PHY. Is it possible that the Vitesse expander uses a specific PHY to respond to some SCSI requests even though the adapter uses a different PHY while sending the request? Is this behavior of communicating on two different PHYs for a single SCSI command is allowed in the spec? Also note that the discovery commands (SMP) just work fine. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: AIC94XX discovery timeout problem details... 2006-09-15 7:20 AIC94XX discovery timeout problem details malahal @ 2006-09-18 7:35 ` Luben Tuikov 2006-09-19 20:59 ` malahal 0 siblings, 1 reply; 6+ messages in thread From: Luben Tuikov @ 2006-09-18 7:35 UTC (permalink / raw) To: malahal, linux-scsi --- malahal@us.ibm.com wrote: > I chased the time out problem and found that the PORTE_BYTES_DMAED port > event must be responded with a call to lldd_port_formed() which will > update PHY_IS_UP and port-links fields in DDB 0. As there is a single > thread handling PHY/port events as well as discovery, we really can't > handle PORTE_BYTES_DMAED event until the discovery is complete. This > results in SCSI commands timing out in the discovery thread no matter > what I do! This problem may be unique to Vitesse expander, read the PS > section for details. > > Tried using two threads, one for events and the other for discovery. Malahal, If you go back in the archives, and take a look at the SAS Stack as I submitted it last year, you'll notice that this is exactly how the code is: there is a separate event thread and a separate discovery thread. That is, from inception, my code has always had two separate threads, one for events and one for discovery. I wasn't aware that the code had been changed such that a single thread handles events and discovery. This is a very naive approach, and a regression over the original code. > That avoided the timeout problems, but the discovery thread would die > after few iterations due to the event thread and the discovery thread > racing each other for setting up and tearing down of sysfs objects. Indeed, the situation presented in such circumstances is tricky. These "races" have been dealt with in my (original) code. Currently, I don't experience any problems with my SAS Stack, as I maintain it. If any of my original comments are still left in the code or the README files, that would give you a hint of how such "races" are handled. > I tried calling lldd_port_formed() with appropriate phy_mask from the > notify_port_event() itself. That worked fine. It is just a hack for now! This and your postscript below, tell me that you're coming around full circle to how the original code works. I'd suggest that you take a look at the original code. > Any comments, suggestions? Good luck, Luben > Thanks, Malahal. > PS: I instrumented the code to use only a single PHY out of 4-phy wide > port by prematurely returning from sas_form_port() for all PHYs except > one. In other words, I ignored the PORTE_BYTES_DMAED event for all PHYs > but one. There were SCSI timeouts every time I did that except if the > PHY I was using is phy3. When I swapped cables, the PHY that worked > changed too. I believe, it goes with a specific Vitesse expander PHY. Is > it possible that the Vitesse expander uses a specific PHY to respond to > some SCSI requests even though the adapter uses a different PHY while > sending the request? Is this behavior of communicating on two different > PHYs for a single SCSI command is allowed in the spec? > Also note that > the discovery commands (SMP) just work fine. > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: AIC94XX discovery timeout problem details... 2006-09-18 7:35 ` Luben Tuikov @ 2006-09-19 20:59 ` malahal 2006-09-19 21:32 ` James Bottomley 2006-09-20 14:55 ` Luben Tuikov 0 siblings, 2 replies; 6+ messages in thread From: malahal @ 2006-09-19 20:59 UTC (permalink / raw) To: Luben Tuikov; +Cc: linux-scsi I am planning to process port/phy events in two phases. Phase-I sets up asd_sas_port, asd_sas_phy lists and calls the LLDD with a correct phy_mask. Phase-II is done in the thread context by queuing to the scsi work queue that involves setting up sysfs objects. Still there would be just one thread setting up/tearing down sysfs objects that should avoid any races. Would appreciate any suggestions or issues with this approach. Thanks, Malahal. Luben Tuikov [ltuikov@yahoo.com] wrote: > --- malahal@us.ibm.com wrote: > > I chased the time out problem and found that the PORTE_BYTES_DMAED port > > event must be responded with a call to lldd_port_formed() which will > > update PHY_IS_UP and port-links fields in DDB 0. As there is a single > > thread handling PHY/port events as well as discovery, we really can't > > handle PORTE_BYTES_DMAED event until the discovery is complete. This > > results in SCSI commands timing out in the discovery thread no matter > > what I do! This problem may be unique to Vitesse expander, read the PS > > section for details. > > > > Tried using two threads, one for events and the other for discovery. > > Malahal, > > If you go back in the archives, and take a look at the SAS Stack > as I submitted it last year, you'll notice that this is exactly how > the code is: there is a separate event thread and a separate > discovery thread. > > That is, from inception, my code has always had two separate threads, > one for events and one for discovery. > > I wasn't aware that the code had been changed such that a single > thread handles events and discovery. This is a very naive approach, > and a regression over the original code. > > > That avoided the timeout problems, but the discovery thread would die > > after few iterations due to the event thread and the discovery thread > > racing each other for setting up and tearing down of sysfs objects. > > Indeed, the situation presented in such circumstances is tricky. > These "races" have been dealt with in my (original) code. Currently, > I don't experience any problems with my SAS Stack, as I maintain it. > > If any of my original comments are still left in the code or the README > files, that would give you a hint of how such "races" are handled. > > > I tried calling lldd_port_formed() with appropriate phy_mask from the > > notify_port_event() itself. That worked fine. It is just a hack for now! ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: AIC94XX discovery timeout problem details... 2006-09-19 20:59 ` malahal @ 2006-09-19 21:32 ` James Bottomley 2006-10-05 0:21 ` malahal 2006-09-20 14:55 ` Luben Tuikov 1 sibling, 1 reply; 6+ messages in thread From: James Bottomley @ 2006-09-19 21:32 UTC (permalink / raw) To: malahal; +Cc: Luben Tuikov, linux-scsi On Tue, 2006-09-19 at 13:59 -0700, malahal@us.ibm.com wrote: > I am planning to process port/phy events in two phases. Phase-I sets up > asd_sas_port, asd_sas_phy lists and calls the LLDD with a correct > phy_mask. Phase-II is done in the thread context by queuing to the scsi > work queue that involves setting up sysfs objects. Still there would be > just one thread setting up/tearing down sysfs objects that should avoid > any races. Erm ... just a minute, I don't understand what the problem is. The reason you can handle both in a single event thread is because the PORTE_BYTES_DMAED event either triggers discovery or wide port formation (if it's the discovery of another phy in the port). It sounds like the scenario you have is that discovery is kicked off before the wide port is formed (I actually see this a bit in my LSI expander as well). This shouldn't be an issue: we're perfectly entitled to do discovery on a single phy if we so choose and not form the wide port until after discovery is completed. We really need to debug this. In your PS you wrote: > PS: I instrumented the code to use only a single PHY out of 4-phy wide > port by prematurely returning from sas_form_port() for all PHYs except > one. In other words, I ignored the PORTE_BYTES_DMAED event for all PHYs > but one. There were SCSI timeouts every time I did that except if the > PHY I was using is phy3. When I swapped cables, the PHY that worked > changed too. I believe, it goes with a specific Vitesse expander PHY. Is > it possible that the Vitesse expander uses a specific PHY to respond to > some SCSI requests even though the adapter uses a different PHY while > sending the request? Is this behavior of communicating on two different > PHYs for a single SCSI command is allowed in the spec? Also note that > the discovery commands (SMP) just work fine. However, SAS specs require us to open connections for SMP and receive the responses through the connection we opened. The target is specifically prohibited from responding outside of this protocol boundary. The theory behind this is that the Host controls port formation ... until we open multiple connections between the source and target ports, the actual connection remains narrow. All of this should support the current discovery model. I suppose the only way to confirm what's going on would be with a SAS analyser. Does the host send multiple OPEN's even before the PORTE_BYTES_DMAED is responded to? James ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: AIC94XX discovery timeout problem details... 2006-09-19 21:32 ` James Bottomley @ 2006-10-05 0:21 ` malahal 0 siblings, 0 replies; 6+ messages in thread From: malahal @ 2006-10-05 0:21 UTC (permalink / raw) To: James Bottomley; +Cc: linux-scsi James, Jack Hammer from Adaptec gave us SAS analyser output. All SMP commands work fine, but SSP commands are sent on a PHY that is up and updated (in the DDB0). In response, the expander tries to open connection on a different PHY that is up but not yet updated in DDB0. The HBA sends OPEN_REJECT(retry) and the Vitesse expander retries the same path forever! Thanks, Malahal. James Bottomley [James.Bottomley@SteelEye.com] wrote: > However, SAS specs require us to open connections for SMP and receive > the responses through the connection we opened. The target is > specifically prohibited from responding outside of this protocol > boundary. The theory behind this is that the Host controls port > formation ... until we open multiple connections between the source and > target ports, the actual connection remains narrow. All of this should > support the current discovery model. > > I suppose the only way to confirm what's going on would be with a SAS > analyser. Does the host send multiple OPEN's even before the > PORTE_BYTES_DMAED is responded to? > > James > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: AIC94XX discovery timeout problem details... 2006-09-19 20:59 ` malahal 2006-09-19 21:32 ` James Bottomley @ 2006-09-20 14:55 ` Luben Tuikov 1 sibling, 0 replies; 6+ messages in thread From: Luben Tuikov @ 2006-09-20 14:55 UTC (permalink / raw) To: malahal; +Cc: linux-scsi --- malahal@us.ibm.com wrote: > I am planning to process port/phy events in two phases. Phase-I sets up > asd_sas_port, asd_sas_phy lists and calls the LLDD with a correct > phy_mask. Phase-II is done in the thread context by queuing to the scsi > work queue that involves setting up sysfs objects. Still there would be > just one thread setting up/tearing down sysfs objects that should avoid > any races. > > Would appreciate any suggestions or issues with this approach. Malahal, Your persev erance is commen dable. Unfortunately this approach is too simplistic and naive, no offence intended. That is, such an approach seems to be the first thing that "pops" into someone's head on how to do it and a more elaborate analysis of the protocol is warranted when coding such concepts. That is, you want to handle the absolute general case of _more than one type of event_. (This is not intuitive.) You should take a look at the "patches" submitted by Bottomley and Hellwig in the areas of event infra, event processing, and discovery, since I posted the SAS Stack: - The Priority Queue Without Duplication Implementation (sas_event.c) for queueing infinitely many events in finite storage, sorted by arrival time, where insertion is O(1) and ordered removal is O(1) has been completely removed. For some reason they called it "long queue" in the subject line of their message or in the body, cannot remember. - Threaded event processing has been completely removed. - Threaded discovery has been removed for the more naive and simplistic, do-it-all-now-at-once way. Add to this the fact that some things shouldn't be done "right now", even though they happened "right now", but should be done "way later" because you want other things which depend on those things to ... This isn't intuitive, but is the result of careful analysis of the protocol. That is, since my original posting of the SAS Stack, Bottomley's refusal to accept it into the kernel has resulted in what you have right now. But look on the bright side: just take a look at the mailing list at how many people are now employed by this? Five, six people? So from another angle, Bottomley's refusal to accept what is the beginning of an enterprise level SAS Stack written by one person, and his subsequent "patches" has now created jobs for at least 5 people, and keeps another 1 or 2 busy with something to do. Good luck! (Because you're going to need it.) Luben P.S. I even wrote a proposal for a presentation at OLS this year ('06) to describe the design and architectural decisions of the SAS Stack and SAS LLDD, as well as talk about SATL and future SCSI directions and while this proposal was initialy approved, then later it was silently dropped. The details are in the archives of linux-scsi. Good luck! > Thanks, Malahal. > > Luben Tuikov [ltuikov@yahoo.com] wrote: > > --- malahal@us.ibm.com wrote: > > > I chased the time out problem and found that the PORTE_BYTES_DMAED port > > > event must be responded with a call to lldd_port_formed() which will > > > update PHY_IS_UP and port-links fields in DDB 0. As there is a single > > > thread handling PHY/port events as well as discovery, we really can't > > > handle PORTE_BYTES_DMAED event until the discovery is complete. This > > > results in SCSI commands timing out in the discovery thread no matter > > > what I do! This problem may be unique to Vitesse expander, read the PS > > > section for details. > > > > > > Tried using two threads, one for events and the other for discovery. > > > > Malahal, > > > > If you go back in the archives, and take a look at the SAS Stack > > as I submitted it last year, you'll notice that this is exactly how > > the code is: there is a separate event thread and a separate > > discovery thread. > > > > That is, from inception, my code has always had two separate threads, > > one for events and one for discovery. > > > > I wasn't aware that the code had been changed such that a single > > thread handles events and discovery. This is a very naive approach, > > and a regression over the original code. > > > > > That avoided the timeout problems, but the discovery thread would die > > > after few iterations due to the event thread and the discovery thread > > > racing each other for setting up and tearing down of sysfs objects. > > > > Indeed, the situation presented in such circumstances is tricky. > > These "races" have been dealt with in my (original) code. Currently, > > I don't experience any problems with my SAS Stack, as I maintain it. > > > > If any of my original comments are still left in the code or the README > > files, that would give you a hint of how such "races" are handled. > > > > > I tried calling lldd_port_formed() with appropriate phy_mask from the > > > notify_port_event() itself. That worked fine. It is just a hack for now! > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-10-05 0:21 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-09-15 7:20 AIC94XX discovery timeout problem details malahal 2006-09-18 7:35 ` Luben Tuikov 2006-09-19 20:59 ` malahal 2006-09-19 21:32 ` James Bottomley 2006-10-05 0:21 ` malahal 2006-09-20 14:55 ` Luben Tuikov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox