* SCSI target and IO-throttling @ 2006-03-02 16:21 Vladislav Bolkhovitin 2006-03-03 18:07 ` Steve Byan 0 siblings, 1 reply; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-02 16:21 UTC (permalink / raw) To: linux-scsi Hello Could anyone advice how a SCSI target device can IO-throttle its initiators, i.e. prevent them from queuing too many commands, please? I suppose, the best way for doing this is to inform the initiators about the maximum queue depth X of the target device, so any of the initiators will not send more than X commands. But I have not found anything similar to that on INQUIRY or MODE SENSE pages. Have I missed something? Just returning QUEUE FULL status doesn't look to be correct, because it can lead to out of order commands execution. Apparently, hardware SCSI targets don't suffer from queuing overflow and don't return all the time QUEUE FULL status, so the must be a way to do the throttling more elegantly. Regards, Vlad ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-02 16:21 SCSI target and IO-throttling Vladislav Bolkhovitin @ 2006-03-03 18:07 ` Steve Byan 2006-03-03 18:47 ` Stefan Richter ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Steve Byan @ 2006-03-03 18:07 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: linux-scsi On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: > Could anyone advice how a SCSI target device can IO-throttle its > initiators, i.e. prevent them from queuing too many commands, please? > > I suppose, the best way for doing this is to inform the initiators > about the maximum queue depth X of the target device, so any of the > initiators will not send more than X commands. But I have not found > anything similar to that on INQUIRY or MODE SENSE pages. Have I > missed something? Just returning QUEUE FULL status doesn't look to > be correct, because it can lead to out of order commands execution. Returning QUEUE FULL status is correct, unless the initiator does not have any pending commands on the LUN, in which case you should return BUSY. Yes, this can lead to out-of-order execution. That's why tapes have traditionally not used SCSI command queuing. Look into the unit attention interlock feature added to SCSI as a result of uncovering this issue during the development of the iSCSI standard. > Apparently, hardware SCSI targets don't suffer from queuing > overflow and don't return all the time QUEUE FULL status, so the > must be a way to do the throttling more elegantly. No, they just have big queues. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-03 18:07 ` Steve Byan @ 2006-03-03 18:47 ` Stefan Richter 2006-03-03 20:24 ` Steve Byan 2006-03-06 19:15 ` Bryan Henderson 2006-03-07 17:53 ` Vladislav Bolkhovitin 2 siblings, 1 reply; 25+ messages in thread From: Stefan Richter @ 2006-03-03 18:47 UTC (permalink / raw) To: Steve Byan; +Cc: Vladislav Bolkhovitin, linux-scsi Steve Byan wrote: > On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: >> Could anyone advice how a SCSI target device can IO-throttle its >> initiators, i.e. prevent them from queuing too many commands, please? >> >> I suppose, the best way for doing this is to inform the initiators >> about the maximum queue depth X of the target device, [...] > Returning QUEUE FULL status is correct, unless the initiator does not > have any pending commands on the LUN, in which case you should return > BUSY. Yes, this can lead to out-of-order execution. That's why tapes > have traditionally not used SCSI command queuing. > > Look into the unit attention interlock feature added to SCSI as a > result of uncovering this issue during the development of the iSCSI > standard. > >> Apparently, hardware SCSI targets don't suffer from queuing overflow [...] > No, they just have big queues. Depending on the the transport protocol, the problem of queue depth at the target may not even exist in the first place. This is the case with SBP-2 where the queue of command blocks resides at the initiator. -- Stefan Richter -=====-=-==- --== ---== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-03 18:47 ` Stefan Richter @ 2006-03-03 20:24 ` Steve Byan 0 siblings, 0 replies; 25+ messages in thread From: Steve Byan @ 2006-03-03 20:24 UTC (permalink / raw) To: Stefan Richter; +Cc: Vladislav Bolkhovitin, linux-scsi On Mar 3, 2006, at 1:47 PM, Stefan Richter wrote: > Steve Byan wrote: >> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: >>> Apparently, hardware SCSI targets don't suffer from queuing >>> overflow > [...] >> No, they just have big queues. > > Depending on the the transport protocol, the problem of queue depth > at the target may not even exist in the first place. This is the > case with SBP-2 where the queue of command blocks resides at the > initiator. Yes, and that's a clever optimization in SBP-2 to support resource- poor targets. Thanks for reminding us of it. Too bad SATA drives didn't take advantage of the SATA first-party DMA to implement SBP-2. The definition of the tag field for native command queuing adopted by T13 essentially makes it infeasible to revisit this decision. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-03 18:07 ` Steve Byan 2006-03-03 18:47 ` Stefan Richter @ 2006-03-06 19:15 ` Bryan Henderson 2006-03-06 19:55 ` Steve Byan 2006-03-07 17:56 ` Vladislav Bolkhovitin 2006-03-07 17:53 ` Vladislav Bolkhovitin 2 siblings, 2 replies; 25+ messages in thread From: Bryan Henderson @ 2006-03-06 19:15 UTC (permalink / raw) To: Steve Byan; +Cc: linux-scsi, Vladislav Bolkhovitin >On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: > >> Could anyone advice how a SCSI target device can IO-throttle its >> initiators, i.e. prevent them from queuing too many commands, please? >> >> I suppose, the best way for doing this is to inform the initiators >> about the maximum queue depth X of the target device, so any of the >> initiators will not send more than X commands. But I have not found >> anything similar to that on INQUIRY or MODE SENSE pages. Have I >> missed something? Just returning QUEUE FULL status doesn't look to >> be correct, because it can lead to out of order commands execution. > >Returning QUEUE FULL status is correct, unless the initiator does not >have any pending commands on the LUN, in which case you should return >BUSY. Yes, this can lead to out-of-order execution. That's why tapes >have traditionally not used SCSI command queuing. I'm confused, Vladislav appears to be asking about flow control such as is built into ISCSI, wherein the ISCSI target tells the intitiator how many tasks it's willing to work on at once and the initiator stops sending new ones when it has hit that limit and waits for one of the previous ones to finish. And the target can continuously change that number. With the more primitive transports, I believe this is a manual configuration step -- the target has a fixed maximum queue depth and you tell the driver via some configuration parameter what it is. As I understand it, any system in which QUEUE FULL (that's another name for SCSI's Task Set Full, isn't it?) errors happen is one that is not properly configured. I saw a broken ISCSI system that had QUEUE FULLs happening, and it was a performance disaster. >> Apparently, hardware SCSI targets don't suffer from queuing >> overflow and don't return all the time QUEUE FULL status, so the >> must be a way to do the throttling more elegantly. > >No, they just have big queues. Big queues are another serious performance problem, when it means a target accepts work faster than it can do it. I've seen that cause initiators to send suboptimal requests (if the target appears to be working at infinite speed, the initiator sends small chunks of work as soon as each is ready, whereas if the initiator can tell that the target is choked, the initiator combines and sorts work while it waits, into a stream the target can handle more efficiently). When systems substitute an oversized queue in a target for initiator-target flow control, the initiator ends up having to compensate with artificial schemes to withhold work from a willing target (e.g. Linux "queue plugging"). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-06 19:15 ` Bryan Henderson @ 2006-03-06 19:55 ` Steve Byan 2006-03-07 23:32 ` Bryan Henderson 2006-03-07 17:56 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 25+ messages in thread From: Steve Byan @ 2006-03-06 19:55 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-scsi, Vladislav Bolkhovitin On Mar 6, 2006, at 2:15 PM, Bryan Henderson wrote: >> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: >> >>> Could anyone advice how a SCSI target device can IO-throttle its >>> initiators, i.e. prevent them from queuing too many commands, >>> please? >>> >>> I suppose, the best way for doing this is to inform the initiators >>> about the maximum queue depth X of the target device, so any of the >>> initiators will not send more than X commands. But I have not found >>> anything similar to that on INQUIRY or MODE SENSE pages. Have I >>> missed something? Just returning QUEUE FULL status doesn't look to >>> be correct, because it can lead to out of order commands execution. >> >> Returning QUEUE FULL status is correct, unless the initiator does not >> have any pending commands on the LUN, in which case you should return >> BUSY. Yes, this can lead to out-of-order execution. That's why tapes >> have traditionally not used SCSI command queuing. > > I'm confused, Vladislav appears to be asking about flow control > such as > is built into ISCSI, wherein the ISCSI target tells the intitiator how > many tasks it's willing to work on at once and the initiator stops > sending > new ones when it has hit that limit and waits for one of the > previous ones > to finish. And the target can continuously change that number. > > With the more primitive transports, Seems like a somewhat loaded description to me. Personally, I'd pick something more neutral. > I believe this is a manual > configuration step -- the target has a fixed maximum queue depth > and you > tell the driver via some configuration parameter what it is. Not true. Consider the case where multiple initiators share one logical unit - there is no guarantee that a single initiator can queue even a single command, since another initiator may have filled the queue at the device. Another case is a target that has multiple logical units; it is conceivable that an implementation may share the device queue resources among all logical units. In this case again, there is no fixed number of commands that the target can guarantee to queue for a logical unit. > As I understand it, any system in which QUEUE FULL (that's another > name > for SCSI's Task Set Full, isn't it?) Yes, you're correct. I should have written TASK SET FULL, which is the correct name for the SCSI status value that we are discussing. > errors happen is one that is not > properly configured. Absolutely untrue. > I saw a broken ISCSI system that had QUEUE FULLs > happening, and it was a performance disaster. Was it a performance disaster because of the broken-ness, or solely because of the TASK SET FULLs? >>> Apparently, hardware SCSI targets don't suffer from queuing >>> overflow and don't return all the time QUEUE FULL status, so the >>> must be a way to do the throttling more elegantly. >> >> No, they just have big queues. > > Big queues are another serious performance problem, when it means a > target > accepts work faster than it can do it. I've seen that cause > initiators to > send suboptimal requests (if the target appears to be working at > infinite > speed, the initiator sends small chunks of work as soon as each is > ready, > whereas if the initiator can tell that the target is choked, the > initiator > combines and sorts work while it waits, into a stream the target can > handle more efficiently). 1) Considering only first-order effects, who cares whether the initiator sends sub-optimal requests and the target coalesces them, or if the initiator does the coalescing itself? 2) If you care about performance, you don't try to fill the device queue; you just want to have enough outstanding so that the device doesn't go idle when there is work to do. The reason why you do this has more to do with the access scheduling algorithm in the target more than anything else; brain-damaged marketing values small average access times more than a small variance in access times, so the device folks do crazy shortest- access-time-first scheduling instead of something more sane and less prone to spreading out the access time distribution like CSCAN. > When systems substitute an oversized queue in a > target for initiator-target flow control, the initiator ends up > having to > compensate with artificial schemes to withhold work from a willing > target > (e.g. Linux "queue plugging"). 1) The SCSI architectural standard does not prescribe any method for initiator-target flow control other than TASK SET FULL and BUSY. There's nothing wrong with X-ON and X-OFF for flow control, especially when you cannot deterministically calculate a window size. 2) Tell the device folks to switch from shortest-access-time-first scheduling to something less aggressive like CSCAN, and then you might be able to tolerate the device queuing better. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-06 19:55 ` Steve Byan @ 2006-03-07 23:32 ` Bryan Henderson 2006-03-08 15:35 ` Vladislav Bolkhovitin 2006-03-10 13:26 ` Steve Byan 0 siblings, 2 replies; 25+ messages in thread From: Bryan Henderson @ 2006-03-07 23:32 UTC (permalink / raw) To: Steve Byan; +Cc: linux-scsi, Vladislav Bolkhovitin >> With the more primitive transports, > >Seems like a somewhat loaded description to me. Personally, I'd pick >something more neutral. Unfortunately, it's exactly what I mean. I understand that some people attach negative connotations to primitivity, but I can't let that get in the way of clarity. >> I believe this is a manual >> configuration step -- the target has a fixed maximum queue depth >> and you >> tell the driver via some configuration parameter what it is. > >Not true. Consider the case where multiple initiators share one >logical unit - there is no guarantee that a single initiator can >queue even a single command, since another initiator may have filled >the queue at the device. I'm not sure what it is that you're saying isn't true. You do give a good explanation of why designers would want something more sophisticated than this, but that doesn't mean every SCSI implementation actually is. Are you saying there are no SCSI targets so primitive that they have a fixed maximum queue depth? That there are no systems where you manually set the maximum requests-in-flight at the initiator in order to optimally drive such targets? >> I saw a broken ISCSI system that had QUEUE FULLs >> happening, and it was a performance disaster. > >Was it a performance disaster because of the broken-ness, or solely >because of the TASK SET FULLs? Because of the broken-ness. Task Set Full is the symptom, not the disease. I should add that in this system, there was no way to make it perform optimally and also see Task Set Full regularly. You mentioned in another email that FCP is designed to use Task Set Full for normal flow control. I heard that before, but didn't believe it; I thought FCP was more advanced than that. But I believe it now. So I was wrong to say that Task Set Full happening means a system is misconfigured. But it's still the case that if you can design a system in which Task Set Full never happens, it will perform better than one in which it does. ISCSI flow control and manual setting of queue sizes in initiators are two ways people do that. >1) Considering only first-order effects, who cares whether the >initiator sends sub-optimal requests and the target coalesces them, >or if the initiator does the coalescing itself? I don't know what a first-order effect is, so this may be out of bounds, but here's a reason to care: the initiator may have more resource available to do the work than the target. We're talking here about a saturated target (which, rather than admit it's overwhelmed, keeps accepting new tasks). But it's really the wrong question, because the more important question is would you rather have the initiator do the coalescing or nobody? There exist targets that are not capable of combining or ordering tasks, and still accept large queues of them. These are the ones I saw have improperly large queues. A target that can actually make use of a large backlog of work, on the other hand, is right to accept one. I have seen people try to improve performance of a storage system by increasing queue depth in the target such as this. They note that the queue is always full, so it must need more queue space. But this degrades performance, because on one of these first-in-first-out targets, the only way to get peak capacity is to keep the queue full all the time so as to create backpressure and cause the initiator to schedule the work. Increasing the queue depth increases the chance that the initiator will not have the backlog necessary to do that scheduling. The correct queue depth on this kind of target is the number of requests the target can process within the initiator's (and channel's) turnaround time. >brain-damaged >marketing values small average access times more than a small >variance in access times, so the device folks do crazy shortest- >access-time-first scheduling instead of something more sane and less >prone to spreading out the access time distribution like CSCAN. Since I'm talking about targets that don't do anything close to that sophisticated with the stuff in their queue, this doesn't apply. But I do have to point out that there are systems where throughput is everything, and response time, including variability of it, is nothing. In fact, the systems I work with are mostly that kind. For that kind of system, you'd want to target to do that kind of scheduling. >2) If you care about performance, you don't try to fill the device >queue; you just want to have enough outstanding so that the device >doesn't go idle when there is work to do. Why would the queue have a greater capacity than what is needed when you care about performance? Is there some non-performance reason to have a giant queue? I still think having a giant queue is not a solution to any flow control (or, in the words of the original problem, I/O throttling) problem. I'm even skeptical that there's any size you can make one that would avoid queue full conditions. It would be like avoiding difficult memory allocation algorithms by just having a whole lot of memory. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-07 23:32 ` Bryan Henderson @ 2006-03-08 15:35 ` Vladislav Bolkhovitin 2006-03-08 15:56 ` Steve Byan 2006-03-10 13:26 ` Steve Byan 1 sibling, 1 reply; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-08 15:35 UTC (permalink / raw) To: Bryan Henderson; +Cc: Steve Byan, linux-scsi Bryan Henderson wrote: >>>With the more primitive transports, >> >>Seems like a somewhat loaded description to me. Personally, I'd pick >>something more neutral. > > > Unfortunately, it's exactly what I mean. I understand that some people > attach negative connotations to primitivity, but I can't let that get in > the way of clarity. > > >>>I believe this is a manual >>>configuration step -- the target has a fixed maximum queue depth >>>and you >>>tell the driver via some configuration parameter what it is. >> >>Not true. Consider the case where multiple initiators share one >>logical unit - there is no guarantee that a single initiator can >>queue even a single command, since another initiator may have filled >>the queue at the device. > > > I'm not sure what it is that you're saying isn't true. You do give a good > explanation of why designers would want something more sophisticated than > this, but that doesn't mean every SCSI implementation actually is. Are > you saying there are no SCSI targets so primitive that they have a fixed > maximum queue depth? That there are no systems where you manually set the > maximum requests-in-flight at the initiator in order to optimally drive > such targets? > > >>>I saw a broken ISCSI system that had QUEUE FULLs >>>happening, and it was a performance disaster. >> >>Was it a performance disaster because of the broken-ness, or solely >>because of the TASK SET FULLs? > > > Because of the broken-ness. Task Set Full is the symptom, not the > disease. I should add that in this system, there was no way to make it > perform optimally and also see Task Set Full regularly. > > You mentioned in another email that FCP is designed to use Task Set Full > for normal flow control. I heard that before, but didn't believe it; I > thought FCP was more advanced than that. But I believe it now. So I was > wrong to say that Task Set Full happening means a system is misconfigured. > But it's still the case that if you can design a system in which Task Set > Full never happens, it will perform better than one in which it does. > ISCSI flow control and manual setting of queue sizes in initiators are two > ways people do that. > > >>1) Considering only first-order effects, who cares whether the >>initiator sends sub-optimal requests and the target coalesces them, >>or if the initiator does the coalescing itself? > > > I don't know what a first-order effect is, so this may be out of bounds, > but here's a reason to care: the initiator may have more resource > available to do the work than the target. We're talking here about a > saturated target (which, rather than admit it's overwhelmed, keeps > accepting new tasks). > > But it's really the wrong question, because the more important question is > would you rather have the initiator do the coalescing or nobody? There > exist targets that are not capable of combining or ordering tasks, and > still accept large queues of them. These are the ones I saw have > improperly large queues. A target that can actually make use of a large > backlog of work, on the other hand, is right to accept one. > > I have seen people try to improve performance of a storage system by > increasing queue depth in the target such as this. They note that the > queue is always full, so it must need more queue space. But this degrades > performance, because on one of these first-in-first-out targets, the only > way to get peak capacity is to keep the queue full all the time so as to > create backpressure and cause the initiator to schedule the work. > Increasing the queue depth increases the chance that the initiator will > not have the backlog necessary to do that scheduling. The correct queue > depth on this kind of target is the number of requests the target can > process within the initiator's (and channel's) turnaround time. > > >>brain-damaged >>marketing values small average access times more than a small >>variance in access times, so the device folks do crazy shortest- >>access-time-first scheduling instead of something more sane and less >>prone to spreading out the access time distribution like CSCAN. > > > Since I'm talking about targets that don't do anything close to that > sophisticated with the stuff in their queue, this doesn't apply. > > But I do have to point out that there are systems where throughput is > everything, and response time, including variability of it, is nothing. In > fact, the systems I work with are mostly that kind. For that kind of > system, you'd want to target to do that kind of scheduling. > > >>2) If you care about performance, you don't try to fill the device >>queue; you just want to have enough outstanding so that the device >>doesn't go idle when there is work to do. > > > Why would the queue have a greater capacity than what is needed when you > care about performance? Is there some non-performance reason to have a > giant queue? > > I still think having a giant queue is not a solution to any flow control > (or, in the words of the original problem, I/O throttling) problem. I'm > even skeptical that there's any size you can make one that would avoid > queue full conditions. It would be like avoiding difficult memory > allocation algorithms by just having a whole lot of memory. Yes, you're correct. But can you formulate a practical common rule working on any SCSI transport, including FC, on which a SCSI target, which knows some limit, can tell it to an initiator, so it will not try to queue too many commands, please? It looks like I have no choice, except doing "giant" queue on target hoping that initiators are smart enough to not queue so many commands that it starts seeing timeouts. Vlad > -- > Bryan Henderson IBM Almaden Research Center > San Jose CA Filesystems > > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-08 15:35 ` Vladislav Bolkhovitin @ 2006-03-08 15:56 ` Steve Byan 2006-03-08 17:49 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 25+ messages in thread From: Steve Byan @ 2006-03-08 15:56 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi On Mar 8, 2006, at 10:35 AM, Vladislav Bolkhovitin wrote: > Bryan Henderson wrote: >> Why would the queue have a greater capacity than what is needed >> when you care about performance? Is there some non-performance >> reason to have a giant queue? >> I still think having a giant queue is not a solution to any flow >> control (or, in the words of the original problem, I/O throttling) >> problem. I'm even skeptical that there's any size you can make >> one that would avoid queue full conditions. It would be like >> avoiding difficult memory allocation algorithms by just having a >> whole lot of memory. > > Yes, you're correct. But can you formulate a practical common rule > working on any SCSI transport, including FC, on which a SCSI > target, which knows some limit, can tell it to an initiator, so it > will not try to queue too many commands, please? It looks like I > have no choice, except doing "giant" queue on target hoping that > initiators are smart enough to not queue so many commands that it > starts seeing timeouts. I still don't understand why you are reluctant to return TASK_SET_FULL or BUSY in this case; it's what the SCSI standard supplies as the way to say "don't queue too many commands, please". If you don't want to return TASK_SET_FULL, then yes, an effectively unbounded command queue is your only alternative. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-08 15:56 ` Steve Byan @ 2006-03-08 17:49 ` Vladislav Bolkhovitin 2006-03-08 18:09 ` Steve Byan 0 siblings, 1 reply; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-08 17:49 UTC (permalink / raw) To: Steve Byan; +Cc: Bryan Henderson, linux-scsi Steve Byan wrote: > > On Mar 8, 2006, at 10:35 AM, Vladislav Bolkhovitin wrote: > >> Bryan Henderson wrote: >> >>> Why would the queue have a greater capacity than what is needed when >>> you care about performance? Is there some non-performance reason to >>> have a giant queue? >>> I still think having a giant queue is not a solution to any flow >>> control (or, in the words of the original problem, I/O throttling) >>> problem. I'm even skeptical that there's any size you can make one >>> that would avoid queue full conditions. It would be like avoiding >>> difficult memory allocation algorithms by just having a whole lot of >>> memory. >> >> >> Yes, you're correct. But can you formulate a practical common rule >> working on any SCSI transport, including FC, on which a SCSI target, >> which knows some limit, can tell it to an initiator, so it will not >> try to queue too many commands, please? It looks like I have no >> choice, except doing "giant" queue on target hoping that initiators >> are smart enough to not queue so many commands that it starts seeing >> timeouts. > > > I still don't understand why you are reluctant to return TASK_SET_FULL > or BUSY in this case; it's what the SCSI standard supplies as the way > to say "don't queue too many commands, please". I don't like out of order execution, which happens practically on all such "rejected" commands, because subsequent already queued commands are not "rejected" with it and some of them could be accepted later. And the initiator (Linux with FC driver) is dumb enough to hit this TASK_SET_FULL again and again until the queue is large enough. So, I can see only one solution, which almost eliminate breaking the order, - unbounded command queue. But, maybe I should think/experiment more and ease the ordering restriction... Thanks, Vlad > If you don't want to return TASK_SET_FULL, then yes, an effectively > unbounded command queue is your only alternative. > > Regards, > -Steve ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-08 17:49 ` Vladislav Bolkhovitin @ 2006-03-08 18:09 ` Steve Byan 2006-03-09 18:37 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 25+ messages in thread From: Steve Byan @ 2006-03-08 18:09 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote: > Steve Byan wrote: >> >> I still don't understand why you are reluctant to return >> TASK_SET_FULL or BUSY in this case; it's what the SCSI standard >> supplies as the way to say "don't queue too many commands, please". > > I don't like out of order execution, which happens practically on > all such "rejected" commands, because subsequent already queued > commands are not "rejected" with it and some of them could be > accepted later. I see, you care about order. So do tapes. The historical answer has been to not support tagged command queuing when you care about ordering. To dodge the performance problem due to lack of queuing, the targets usually implement a read-ahead and write-behind cache, and then perform queuing behind the scenes, after telling the initiator that the command has completed. Of course, this has obvious data integrity issues for disk-type logical units. The solution introduced for tapes concurrent with iSCSI (which motivated the need for command-queuing for tapes, since some envisioned backing up to a tape drive located on 3000 miles away is something called "unit-attention interlock", or "UA interlock". Check out page 287 of the draft revision 23 of the SCSI Primary Commands - 3 (SPC-3) standard from T10.org. The UA_INTLCK_CTRL field can be set to cause a persistent unit attention condition if a command was rejected with TASK_SET_FULL or BUSY. This requires the cooperation of the initiator. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-08 18:09 ` Steve Byan @ 2006-03-09 18:37 ` Vladislav Bolkhovitin 2006-03-09 19:32 ` Steve Byan 0 siblings, 1 reply; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-09 18:37 UTC (permalink / raw) To: Steve Byan; +Cc: Bryan Henderson, linux-scsi Steve Byan wrote: > > On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote: > >> Steve Byan wrote: >> >>> > >>> I still don't understand why you are reluctant to return >>> TASK_SET_FULL or BUSY in this case; it's what the SCSI standard >>> supplies as the way to say "don't queue too many commands, please". >> >> >> I don't like out of order execution, which happens practically on all >> such "rejected" commands, because subsequent already queued commands >> are not "rejected" with it and some of them could be accepted later. > > > I see, you care about order. So do tapes. The historical answer has > been to not support tagged command queuing when you care about > ordering. To dodge the performance problem due to lack of queuing, the > targets usually implement a read-ahead and write-behind cache, and then > perform queuing behind the scenes, after telling the initiator that the > command has completed. Of course, this has obvious data integrity > issues for disk-type logical units. Yes, tapes just can't work without strict ordering. SCST was originally done for tapes, so I still keep some kind of tape-oriented thinking :) Actually, with current journaling file systems ordering also became more important for disks as well. Data integrity problem in "behind the scenes" queuing could be on practice easily solved by battery-based backup power on the disks. In case of TASK_SET_FULL things are much worse, because the reordering happens _between_ target and _initiator_, since the initiator must retry "rejected" command explicitly, then in case of the initiator crash before the command will be retried and if FS on it uses ordering barriers to protect the integrity (Linux seems does so, but I could be wrong), the FS data could be written out of order with its journal and the FS could be corrupted. Even worse, TASK_SET_FULL "rejects" basically happen every the queue length'th command, ie very often. This is why I prefer the "dumb" and "safe" way. But, I could overestimate the problem, because it looks like nobody cares about it.. > The solution introduced for tapes concurrent with iSCSI (which > motivated the need for command-queuing for tapes, since some envisioned > backing up to a tape drive located on 3000 miles away is something > called "unit-attention interlock", or "UA interlock". Check out page > 287 of the draft revision 23 of the SCSI Primary Commands - 3 (SPC-3) > standard from T10.org. The UA_INTLCK_CTRL field can be set to cause a > persistent unit attention condition if a command was rejected with > TASK_SET_FULL or BUSY. Thanks, I'll take a look. > This requires the cooperation of the initiator. Which practically means that it will not work for at least several years. I think, I won't be wrong, if say that no Linux initiators use this feature and going to use... BTW, it is also impossible to correctly process commands errors (CHECK CONDITIONs) in async environment without using ACA (Auto Contingent Allegiance). Again, I see no sign that it's used by Linux or somebody interested to use it in Linux. Have I missed anything and it is not important? (rather rhetorical question) Thanks, Vlad ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-09 18:37 ` Vladislav Bolkhovitin @ 2006-03-09 19:32 ` Steve Byan 2006-03-10 18:46 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 25+ messages in thread From: Steve Byan @ 2006-03-09 19:32 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote: > Steve Byan wrote: >> On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote: >>> Steve Byan wrote: >>> >>>> >>>> I still don't understand why you are reluctant to return >>>> TASK_SET_FULL or BUSY in this case; it's what the SCSI >>>> standard supplies as the way to say "don't queue too many >>>> commands, please". >>> >>> >>> I don't like out of order execution, which happens practically >>> on all such "rejected" commands, because subsequent already >>> queued commands are not "rejected" with it and some of them >>> could be accepted later. >> I see, you care about order. So do tapes. The historical answer >> has been to not support tagged command queuing when you care >> about ordering. To dodge the performance problem due to lack of >> queuing, the targets usually implement a read-ahead and write- >> behind cache, and then perform queuing behind the scenes, after >> telling the initiator that the command has completed. Of course, >> this has obvious data integrity issues for disk-type logical units. > > Yes, tapes just can't work without strict ordering. SCST was > originally done for tapes, so I still keep some kind of tape- > oriented thinking :) > > Actually, with current journaling file systems ordering also became > more important for disks as well. Usually the workload from a journaling filesystem consists of a lot of unordered writes (user data) and some partially-ordered writes (metadata). The partially-ordered writes do not have a defined ordering with respect to the unordered writes; they are ordered only with respect to each other. Most systems today solve the TASK_SET_FULL problem by only having one ordered write outstanding at any point in time. You want to do it this way anyway, so that you can build up a queue of commits and do a group commit with the next write to the journal. If you need write barriers between the metadata writes and the data writes, the initiator should use the ORDERED task tag on that write, and have only one ORDERED write outstanding at any point in time (I mean to the same logical unit, of course). > Data integrity problem in "behind the scenes" queuing could be on > practice easily solved by battery-based backup power on the disks. > In case of TASK_SET_FULL things are much worse, because the > reordering happens _between_ target and _initiator_, since the > initiator must retry "rejected" command explicitly, then in case of > the initiator crash before the command will be retried and if FS on > it uses ordering barriers to protect the integrity (Linux seems > does so, but I could be wrong), the FS data could be written out of > order with its journal and the FS could be corrupted. Even worse, > TASK_SET_FULL "rejects" basically happen every the queue length'th > command, ie very often. This is why I prefer the "dumb" and "safe" > way. But, I could overestimate the problem, because it looks like > nobody cares about it.. See above, Since only one ordered write is ever pending, no file system corruption occurs. Since you want to do group commits anyway, you never need to have more than one ordered write pending. > >> The solution introduced for tapes concurrent with iSCSI (which >> motivated the need for command-queuing for tapes, since some >> envisioned backing up to a tape drive located on 3000 miles away >> is something called "unit-attention interlock", or "UA >> interlock". Check out page 287 of the draft revision 23 of the >> SCSI Primary Commands - 3 (SPC-3) standard from T10.org. The >> UA_INTLCK_CTRL field can be set to cause a persistent unit >> attention condition if a command was rejected with TASK_SET_FULL >> or BUSY. > > Thanks, I'll take a look. > >> This requires the cooperation of the initiator. > > Which practically means that it will not work for at least several > years. Well, the feature was added back in 2001 or 2002; the initiators have already had years to incorporate it. This might say something about the state of the Linux SCSI subsystem (running and ducking for cover :-). Seriously, I think this has more to do with either the lack of need for command-queuing for tapes or the lack of modern tape support in Linux. > I think, I won't be wrong, if say that no Linux initiators use this > feature and going to use... If you have an initiator that is sending queued SCSI commands with the SIMPLE task attribute but which expects the target to maintain ordering of those commands, the SCSI standard can't help you. The initiator is broken. If the initiator needs to send _queued_ SCSI commands with a task attribute of ORDERED, then to preserve ordering it must set the UA_INTLCK_CTL appropriately. The SCSI standard has no other mechanism to offer such an initiator. To the best of my knowledge no current Linux initiator sends SCSI commands with a task attribute other than SIMPLE., and you seem to be concerned only about Linux initiators. Therefor your target does not need to preserve order. QUED. > BTW, it is also impossible to correctly process commands errors > (CHECK CONDITIONs) in async environment When you say "async environment" I assume you are referring to queuing SCSI commands using SCSI command queuing, as opposed to sending a single SCSI command and synchronously awaiting its completion. > without using ACA (Auto Contingent Allegiance). Again, I see no > sign that it's used by Linux or somebody interested to use it in > Linux. Have I missed anything and it is not important? (rather > rhetorical question) ACA is not important if the command that got the error is idempotent and independent of all other commands in flight. In the case of disks (SBC command set) and CD-ROMs and DVD-ROMs (MMC command-set) this condition is true (given the restriction on the number of outstanding ordered writes which I discussed above), and so ACA is not needed. Tapes would need ACA if they did command queuing (which is why ACA was invented), but the practice in tape-land seems to be to avoid SCSI command queuing and instead asynchronously stage the operations behind the target. This does lead to complications in error recovery, which is why tape error handling is so problematic. My advice to you is to either a) follow the industry trend, which is to use command queuing only for SBC (disk) targets and not for MMC (CD-ROM) and SSC (tape) targets, or b) fix the initiator to handle ordered queuing (i.e. add support for the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL). Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-09 19:32 ` Steve Byan @ 2006-03-10 18:46 ` Vladislav Bolkhovitin 2006-03-10 19:47 ` Steve Byan 2006-03-14 20:54 ` Douglas Gilbert 0 siblings, 2 replies; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-10 18:46 UTC (permalink / raw) To: Steve Byan; +Cc: Bryan Henderson, linux-scsi Steve Byan wrote: > On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote: > >> Steve Byan wrote: >> >>> On Mar 8, 2006, at 12:49 PM, Vladislav Bolkhovitin wrote: >>> >>>> Steve Byan wrote: >>>> >>>>> >>>>> I still don't understand why you are reluctant to return >>>>> TASK_SET_FULL or BUSY in this case; it's what the SCSI standard >>>>> supplies as the way to say "don't queue too many commands, please". >>>> >>>> >>>> >>>> I don't like out of order execution, which happens practically on >>>> all such "rejected" commands, because subsequent already queued >>>> commands are not "rejected" with it and some of them could be >>>> accepted later. >>> >>> I see, you care about order. So do tapes. The historical answer has >>> been to not support tagged command queuing when you care about >>> ordering. To dodge the performance problem due to lack of queuing, >>> the targets usually implement a read-ahead and write- behind cache, >>> and then perform queuing behind the scenes, after telling the >>> initiator that the command has completed. Of course, this has >>> obvious data integrity issues for disk-type logical units. >> >> >> Yes, tapes just can't work without strict ordering. SCST was >> originally done for tapes, so I still keep some kind of tape- oriented >> thinking :) >> >> Actually, with current journaling file systems ordering also became >> more important for disks as well. > > > Usually the workload from a journaling filesystem consists of a lot of > unordered writes (user data) and some partially-ordered writes > (metadata). The partially-ordered writes do not have a defined ordering > with respect to the unordered writes; they are ordered only with > respect to each other. Most systems today solve the TASK_SET_FULL > problem by only having one ordered write outstanding at any point in > time. You want to do it this way anyway, so that you can build up a > queue of commits and do a group commit with the next write to the journal. > > If you need write barriers between the metadata writes and the data > writes, the initiator should use the ORDERED task tag on that write, > and have only one ORDERED write outstanding at any point in time (I > mean to the same logical unit, of course). I mean the barrier between journal writes and metadata writes, because they order is essential for a FS health. User data almost always not journaled and not protected. Obviously, having only one ORDERED, i.e. journal, write and having to wait for it completition before submitting subsequent commands creates some performance bottleneck. I mean mostly latency, which often quite big in many SCSI transports. It would be much better to queue as many such ORDERED commands as necessary and then, without waiting for their completition, metadata updates (SIMPLE) commands and being sure, that no metadata commands will be executed if any of ORDERED ones fail. As far as I can see, nothing prevents to work that way right now, except that somebody should implement it in both hardware and software. >> Data integrity problem in "behind the scenes" queuing could be on >> practice easily solved by battery-based backup power on the disks. In >> case of TASK_SET_FULL things are much worse, because the reordering >> happens _between_ target and _initiator_, since the initiator must >> retry "rejected" command explicitly, then in case of the initiator >> crash before the command will be retried and if FS on it uses >> ordering barriers to protect the integrity (Linux seems does so, but >> I could be wrong), the FS data could be written out of order with its >> journal and the FS could be corrupted. Even worse, TASK_SET_FULL >> "rejects" basically happen every the queue length'th command, ie very >> often. This is why I prefer the "dumb" and "safe" way. But, I could >> overestimate the problem, because it looks like nobody cares about it.. > > > See above, Since only one ordered write is ever pending, no file system > corruption occurs. Since you want to do group commits anyway, you never > need to have more than one ordered write pending. > >> >>> The solution introduced for tapes concurrent with iSCSI (which >>> motivated the need for command-queuing for tapes, since some >>> envisioned backing up to a tape drive located on 3000 miles away is >>> something called "unit-attention interlock", or "UA interlock". >>> Check out page 287 of the draft revision 23 of the SCSI Primary >>> Commands - 3 (SPC-3) standard from T10.org. The UA_INTLCK_CTRL >>> field can be set to cause a persistent unit attention condition if >>> a command was rejected with TASK_SET_FULL or BUSY. >> >> >> Thanks, I'll take a look. >> >>> This requires the cooperation of the initiator. >> >> >> Which practically means that it will not work for at least several >> years. > > > Well, the feature was added back in 2001 or 2002; the initiators have > already had years to incorporate it. This might say something about the > state of the Linux SCSI subsystem (running and ducking for cover :-). > Seriously, I think this has more to do with either the lack of need for > command-queuing for tapes or the lack of modern tape support in Linux. > >> I think, I won't be wrong, if say that no Linux initiators use this >> feature and going to use... > > > If you have an initiator that is sending queued SCSI commands with the > SIMPLE task attribute but which expects the target to maintain ordering > of those commands, the SCSI standard can't help you. The initiator is > broken. Sure > If the initiator needs to send _queued_ SCSI commands with a task > attribute of ORDERED, then to preserve ordering it must set the > UA_INTLCK_CTL appropriately. The SCSI standard has no other mechanism > to offer such an initiator. > > To the best of my knowledge no current Linux initiator sends SCSI > commands with a task attribute other than SIMPLE., and you seem to be > concerned only about Linux initiators. Therefor your target does not > need to preserve order. QUED. I prefer to be overinsured in such cases. >> BTW, it is also impossible to correctly process commands errors >> (CHECK CONDITIONs) in async environment > > > When you say "async environment" I assume you are referring to queuing > SCSI commands using SCSI command queuing, as opposed to sending a > single SCSI command and synchronously awaiting its completion. Yes >> without using ACA (Auto Contingent Allegiance). Again, I see no sign >> that it's used by Linux or somebody interested to use it in Linux. >> Have I missed anything and it is not important? (rather rhetorical >> question) > > > ACA is not important if the command that got the error is idempotent > and independent of all other commands in flight. In the case of disks > (SBC command set) and CD-ROMs and DVD-ROMs (MMC command-set) this > condition is true (given the restriction on the number of outstanding > ordered writes which I discussed above), and so ACA is not needed. Yes, when working as you described, ACA is not needed. But when working as I described, ACA is essential. > Tapes would need ACA if they did command queuing (which is why ACA was > invented), but the practice in tape-land seems to be to avoid SCSI > command queuing and instead asynchronously stage the operations behind > the target. This does lead to complications in error recovery, which is > why tape error handling is so problematic. Could you please explain "synchronously stage the operations behind the target" more? I don't understand what you mean. > My advice to you is to either > a) follow the industry trend, which is to use command queuing only for > SBC (disk) targets and not for MMC (CD-ROM) and SSC (tape) targets, or > b) fix the initiator to handle ordered queuing (i.e. add support for > the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL). OK, thanks. Looks like (a) is easier :). BTW, do you have any statistic how many modern SCSI disks support those features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none of available for us SCSI hardware, including tape libraries, supported ACA. It was not very modern for that time, though Regards, Vlad ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-10 18:46 ` Vladislav Bolkhovitin @ 2006-03-10 19:47 ` Steve Byan 2006-03-13 17:35 ` Vladislav Bolkhovitin 2006-03-14 20:54 ` Douglas Gilbert 1 sibling, 1 reply; 25+ messages in thread From: Steve Byan @ 2006-03-10 19:47 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi On Mar 10, 2006, at 1:46 PM, Vladislav Bolkhovitin wrote: > Steve Byan wrote: >> On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote: > I mean the barrier between journal writes and metadata writes, > because they order is essential for a FS health. I counted journal writes as metadata writes. If you want to make a distinction, OK, we now have a common language. > Obviously, having only one ORDERED, i.e. journal, write and having > to wait for it completition before submitting subsequent commands > creates some performance bottleneck. It might be obvious but it's not true. You missed my point about group commits to the journal. That's why there's no performance hit for only having one outstanding journal write at a time; each journal write commits many transactions. Stated another way, you don't want to eagerly initiate journal writes; you want to execute one at a time, and group all transactions that arrive while the one write is active into the next write. See the seminal paper from Xerox PARC on "Group Commits in the CEDAR Filesystem". I'm working from memory so I can't give you a better citation than that. It's an old paper, probably circa 1987 or 1988, published I think in an ACM journal. I've benchmarked metadata-intensive workloads on a journaling filesystem with a storage controller with NV-RAM arranged so that all metadata and journal writes complete without any disk activity against a vanilla controller. The lights on the disks on the NV-RAM controller never came on; i.e. there was _no_ disk activity. The lights on the disks attached to the vanilla controller were on solid. The performance of the two systems was essentially the same with respect to average response time and throughput. > I mean mostly latency, which often quite big in many SCSI > transports. It would be much better to queue as many such ORDERED > commands as necessary and then, without waiting for their > completition, metadata updates (SIMPLE) commands and being sure, > that no metadata commands will be executed if any of ORDERED ones > fail. As far as I can see, nothing prevents to work that way right > now, except that somebody should implement it in both hardware and > software. If you use group commits, there's little value in implementing this. >> To the best of my knowledge no current Linux initiator sends SCSI >> commands with a task attribute other than SIMPLE., and you seem to >> be concerned only about Linux initiators. Therefor your target >> does not need to preserve order. QUED. > > I prefer to be overinsured in such cases. Suit yourself. Just don't expect help from the SCSI standard, it's not designed to do that. >> ACA is not important if the command that got the error is >> idempotent and independent of all other commands in flight. In >> the case of disks (SBC command set) and CD-ROMs and DVD-ROMs (MMC >> command-set) this condition is true (given the restriction on the >> number of outstanding ordered writes which I discussed above), >> and so ACA is not needed. > > Yes, when working as you described, ACA is not needed. But when > working as I described, ACA is essential. As is unit attention interlock. >> Tapes would need ACA if they did command queuing (which is why >> ACA was invented), but the practice in tape-land seems to be to >> avoid SCSI command queuing and instead asynchronously stage the >> operations behind the target. This does lead to complications in >> error recovery, which is why tape error handling is so problematic. > > Could you please explain "synchronously stage the operations behind > the target" more? I don't understand what you mean. I mean they buffer the operations in memory after completing the SCSI command and then (asynchronous to the execution of the SCSI command, i,e, after it has been completed) queue them ("stage" them) and send them on to the physical device. I'm a bit hazy on the terminology, because I was never a tape guy and it's been years since I thought about tapes, but I think the term the industry used when streaming tapes first came out was "buffered operation". The tape controller accepts the write command and completes it with good status but doesn't write it to the media; it waits until it has accumulated a sufficient number of records to keep the tape streaming before starting to dump the buffer to the tape media. This avoids the need for SCSI command-queuing while still keeping the tape streaming. >> My advice to you is to either >> a) follow the industry trend, which is to use command queuing >> only for SBC (disk) targets and not for MMC (CD-ROM) and SSC >> (tape) targets, or >> b) fix the initiator to handle ordered queuing (i.e. add support >> for the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL). > > OK, thanks. Looks like (a) is easier :). > > BTW, do you have any statistic how many modern SCSI disks support > those features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago > none of available for us SCSI hardware, including tape libraries, > supported ACA. It was not very modern for that time, though I can't say with certainty, but I believe no SCSI disk supports ACA or UA_INTLCK_CTL. Some may support the ORDERED task tag but I guess it would be implemented in a low-performance path. Storage controllers might be a different story; I have no data on what they support in the way of task attributes, ACA, and unit attention interlock. As far as tapes go, I've got no data on modern SCSI tape controllers, but judging by the squirming going on in T10 around command-ordering for Fibre Channel tapes, I'd guess very few if any have gotten command-queuing to work for tapes. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-10 19:47 ` Steve Byan @ 2006-03-13 17:35 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-13 17:35 UTC (permalink / raw) To: Steve Byan; +Cc: Bryan Henderson, linux-scsi Steve Byan wrote: > On Mar 10, 2006, at 1:46 PM, Vladislav Bolkhovitin wrote: > >> Steve Byan wrote: >> >>> On Mar 9, 2006, at 1:37 PM, Vladislav Bolkhovitin wrote: > > >> I mean the barrier between journal writes and metadata writes, >> because they order is essential for a FS health. > > > I counted journal writes as metadata writes. If you want to make a > distinction, OK, we now have a common language. > >> Obviously, having only one ORDERED, i.e. journal, write and having to >> wait for it completition before submitting subsequent commands >> creates some performance bottleneck. > > > It might be obvious but it's not true. > > You missed my point about group commits to the journal. That's why > there's no performance hit for only having one outstanding journal > write at a time; each journal write commits many transactions. Stated > another way, you don't want to eagerly initiate journal writes; you > want to execute one at a time, and group all transactions that arrive > while the one write is active into the next write. > > See the seminal paper from Xerox PARC on "Group Commits in the CEDAR > Filesystem". I'm working from memory so I can't give you a better > citation than that. It's an old paper, probably circa 1987 or 1988, > published I think in an ACM journal. I didn't miss your point. I wrote that such journal updates have to be _synchronous_, i.e. it's necessary, despite that the updates are combined in one command, to wait for their completion (as well as _all_ previously queued commands, including SIMPLE ones). This is the (possible) performance bottleneck. Yes, the disk can imitate the commands completion with its write back cache, but the cache is limited in size, so on some workload it could get full and not able to help. However, I don't have any numbers and maybe this is not so noticeable in practice. > I've benchmarked metadata-intensive workloads on a journaling > filesystem with a storage controller with NV-RAM arranged so that all > metadata and journal writes complete without any disk activity against > a vanilla controller. The lights on the disks on the NV-RAM controller > never came on; i.e. there was _no_ disk activity. The lights on the > disks attached to the vanilla controller were on solid. The performance > of the two systems was essentially the same with respect to average > response time and throughput. > >> I mean mostly latency, which often quite big in many SCSI transports. >> It would be much better to queue as many such ORDERED commands as >> necessary and then, without waiting for their completition, metadata >> updates (SIMPLE) commands and being sure, that no metadata commands >> will be executed if any of ORDERED ones fail. As far as I can see, >> nothing prevents to work that way right now, except that somebody >> should implement it in both hardware and software. > > > If you use group commits, there's little value in implementing this. > >>> Tapes would need ACA if they did command queuing (which is why ACA >>> was invented), but the practice in tape-land seems to be to avoid >>> SCSI command queuing and instead asynchronously stage the >>> operations behind the target. This does lead to complications in >>> error recovery, which is why tape error handling is so problematic. >> >> >> Could you please explain "synchronously stage the operations behind >> the target" more? I don't understand what you mean. > > > I mean they buffer the operations in memory after completing the SCSI > command and then (asynchronous to the execution of the SCSI command, > i,e, after it has been completed) queue them ("stage" them) and send > them on to the physical device. > > I'm a bit hazy on the terminology, because I was never a tape guy and > it's been years since I thought about tapes, but I think the term the > industry used when streaming tapes first came out was "buffered > operation". The tape controller accepts the write command and completes > it with good status but doesn't write it to the media; it waits until > it has accumulated a sufficient number of records to keep the tape > streaming before starting to dump the buffer to the tape media. This > avoids the need for SCSI command-queuing while still keeping the tape > streaming. I see >>> My advice to you is to either >>> a) follow the industry trend, which is to use command queuing only >>> for SBC (disk) targets and not for MMC (CD-ROM) and SSC (tape) >>> targets, or >>> b) fix the initiator to handle ordered queuing (i.e. add support >>> for the ORDERED and ACA task tags, ACA, and UA_INTLCK_CTL). >> >> >> OK, thanks. Looks like (a) is easier :). >> >> BTW, do you have any statistic how many modern SCSI disks support >> those features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none >> of available for us SCSI hardware, including tape libraries, >> supported ACA. It was not very modern for that time, though > > > I can't say with certainty, but I believe no SCSI disk supports ACA or > UA_INTLCK_CTL. Some may support the ORDERED task tag but I guess it > would be implemented in a low-performance path. This is the point from which we should have started :). It's senseless to implement something, which you can't use. > Storage controllers might be a different story; I have no data on what > they support in the way of task attributes, ACA, and unit attention > interlock. > > As far as tapes go, I've got no data on modern SCSI tape controllers, > but judging by the squirming going on in T10 around command-ordering > for Fibre Channel tapes, I'd guess very few if any have gotten > command-queuing to work for tapes. Thanks, Vlad ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-10 18:46 ` Vladislav Bolkhovitin 2006-03-10 19:47 ` Steve Byan @ 2006-03-14 20:54 ` Douglas Gilbert 2006-03-15 17:15 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 25+ messages in thread From: Douglas Gilbert @ 2006-03-14 20:54 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: Steve Byan, Bryan Henderson, linux-scsi Vladislav Bolkhovitin wrote: > Steve Byan wrote: <snip> > BTW, do you have any statistic how many modern SCSI disks support those > features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none of > available for us SCSI hardware, including tape libraries, supported ACA. > It was not very modern for that time, though Vlad, Here is part of the control mode page from a recent SCSI disk (Cheetah 15k.4) : # sdparm -p co /dev/sdb -ll /dev/sdb: SEAGATE ST336754SS 0003 Direct access device specific parameters: WP=0 DPOFUA=1 Control mode page [PS=1]: TST 0 [cha: n, def: 0, sav: 0] Task set type 0: lu maintains one task set for all I_T nexuses 1: lu maintains separate task sets for each I_T nexus TMF_ONLY 0 [cha: n, def: 0, sav: 0] Task management functions only D_SENSE 0 [cha: n, def: 0, sav: 0] Descriptor format sense data GLTSD 0 [cha: y, def: 1, sav: 0] Global logging target save disable RLEC 0 [cha: y, def: 0, sav: 0] Report log exception condition QAM 0 [cha: y, def: 0, sav: 0] Queue algorithm modifier 0: restricted re-ordering; 1: unrestricted QERR 0 [cha: n, def: 0, sav: 0] Queue error management 0: only affected task gets CC; 1: affected tasks aborted 3: affected tasks aborted on same I_T nexus RAC 0 [cha: n, def: 0, sav: 0] Report a check UA_INTLCK 0 [cha: n, def: 0, sav: 0] Unit attention interlocks control 0: unit attention cleared with check condition status 2: unit attention not cleared with check condition status 3: as 2 plus ua on busy, task set full or reservation conflict SWP 0 [cha: n, def: 0, sav: 0] Software write protect ATO 0 [cha: n, def: 0, sav: 0] Application tag owner TAS 0 [cha: n, def: 0, sav: 0] Task aborted status 0: tasks aborted without response to app client 1: any other I_T nexuses receive task aborted So it doesn't support UA_INTLCK ("cha: n" implies the user cannot change that value). QAM can be changed to allow unrestricted re-ordering (of task with the SIMPLE task attribute). The NormACA bit in the standard INQUIRY response is 0 so it doesn't support ACA either. Doug Gilbert ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-14 20:54 ` Douglas Gilbert @ 2006-03-15 17:15 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-15 17:15 UTC (permalink / raw) To: dougg; +Cc: Steve Byan, Bryan Henderson, linux-scsi Douglas Gilbert wrote: > Vladislav Bolkhovitin wrote: > >>Steve Byan wrote: > > > <snip> > >>BTW, do you have any statistic how many modern SCSI disks support those >>features (ORDERED, ACA, UA_INTLCK_CTL, etc)? Few years ago none of >>available for us SCSI hardware, including tape libraries, supported ACA. >>It was not very modern for that time, though > > > Vlad, > Here is part of the control mode page from a > recent SCSI disk (Cheetah 15k.4) : > > # sdparm -p co /dev/sdb -ll > /dev/sdb: SEAGATE ST336754SS 0003 > Direct access device specific parameters: WP=0 DPOFUA=1 > Control mode page [PS=1]: > TST 0 [cha: n, def: 0, sav: 0] Task set type > 0: lu maintains one task set for all I_T nexuses > 1: lu maintains separate task sets for each I_T nexus > TMF_ONLY 0 [cha: n, def: 0, sav: 0] Task management functions only > D_SENSE 0 [cha: n, def: 0, sav: 0] Descriptor format sense data > GLTSD 0 [cha: y, def: 1, sav: 0] Global logging target save disable > RLEC 0 [cha: y, def: 0, sav: 0] Report log exception condition > QAM 0 [cha: y, def: 0, sav: 0] Queue algorithm modifier > 0: restricted re-ordering; 1: unrestricted > QERR 0 [cha: n, def: 0, sav: 0] Queue error management > 0: only affected task gets CC; 1: affected tasks aborted > 3: affected tasks aborted on same I_T nexus > RAC 0 [cha: n, def: 0, sav: 0] Report a check > UA_INTLCK 0 [cha: n, def: 0, sav: 0] Unit attention interlocks control > 0: unit attention cleared with check condition status > 2: unit attention not cleared with check condition status > 3: as 2 plus ua on busy, task set full or reservation conflict > SWP 0 [cha: n, def: 0, sav: 0] Software write protect > ATO 0 [cha: n, def: 0, sav: 0] Application tag owner > TAS 0 [cha: n, def: 0, sav: 0] Task aborted status > 0: tasks aborted without response to app client > 1: any other I_T nexuses receive task aborted > > So it doesn't support UA_INTLCK ("cha: n" implies the user > cannot change that value). QAM can be changed to allow > unrestricted re-ordering (of task with the SIMPLE task > attribute). > > The NormACA bit in the standard INQUIRY response is 0 so > it doesn't support ACA either. Thanks! This is exactly what we've seen in the our small investigation. Perhaps, those features are really not needed, if nobody still use them. Vlad ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-07 23:32 ` Bryan Henderson 2006-03-08 15:35 ` Vladislav Bolkhovitin @ 2006-03-10 13:26 ` Steve Byan 1 sibling, 0 replies; 25+ messages in thread From: Steve Byan @ 2006-03-10 13:26 UTC (permalink / raw) To: Bryan Henderson; +Cc: linux-scsi, Vladislav Bolkhovitin On Mar 7, 2006, at 6:32 PM, Bryan Henderson wrote: >>> With the more primitive transports, >> >> Seems like a somewhat loaded description to me. Personally, I'd pick >> something more neutral. > > Unfortunately, it's exactly what I mean. I understand that some > people > attach negative connotations to primitivity, but I can't let that > get in > the way of clarity. > >>> I believe this is a manual >>> configuration step -- the target has a fixed maximum queue depth >>> and you >>> tell the driver via some configuration parameter what it is. >> >> Not true. Consider the case where multiple initiators share one >> logical unit - there is no guarantee that a single initiator can >> queue even a single command, since another initiator may have filled >> the queue at the device. > > I'm not sure what it is that you're saying isn't true. I'm saying that your blanket statement that "With the more primitive transports, I believe this is a manual configuration step -- the target has a fixed maximum queue depth and you tell the driver via some configuration parameter what it is." is not true. > You do give a good > explanation of why designers would want something more > sophisticated than > this, but that doesn't mean every SCSI implementation actually is. I didn't say every SCSI implementation did anything in particular. On the other hand, you did. > Are > you saying there are no SCSI targets so primitive that they have a > fixed > maximum queue depth? Of course I'm not saying that no such systems exist. I'm only refuting your claim that they all behave that way. > That there are no systems where you manually set the > maximum requests-in-flight at the initiator in order to optimally > drive > such targets? Of course I'm not saying that no such systems exist. I'm only refuting your claim that they all behave that way. > >>> I saw a broken ISCSI system that had QUEUE FULLs >>> happening, and it was a performance disaster. >> >> Was it a performance disaster because of the broken-ness, or solely >> because of the TASK SET FULLs? > > Because of the broken-ness. Task Set Full is the symptom, not the > disease. I should add that in this system, there was no way to > make it > perform optimally and also see Task Set Full regularly. > > You mentioned in another email that FCP is designed to use Task Set > Full > for normal flow control. I heard that before, but didn't believe > it; I > thought FCP was more advanced than that. But I believe it now. > So I was > wrong to say that Task Set Full happening means a system is > misconfigured. > But it's still the case that if you can design a system in which > Task Set > Full never happens, it will perform better than one in which it does. This is not necessarily true. TASK_SET_FULL does consume some initiator CPU resources and some bus bandwidth, so if one of those is your bottleneck, then yes, avoiding TASK_SET_FULL will improve performance. But if the performance bottleneck is the device server itself, then to a first approximation it makes no difference to performance whether the commands are queued on the initiator side of the interface or on the target side of the interface, assuming both the initiator and the target are capable of performing the same reordering optimizations. > ISCSI flow control and manual setting of queue sizes in initiators > are two > ways people do that. > >> 1) Considering only first-order effects, who cares whether the >> initiator sends sub-optimal requests and the target coalesces them, >> or if the initiator does the coalescing itself? > > I don't know what a first-order effect is, so this may be out of > bounds, > but here's a reason to care: the initiator may have more resource > available to do the work than the target. We're talking here about a > saturated target (which, rather than admit it's overwhelmed, keeps > accepting new tasks). Usually the target resource that is the bottleneck is the mechanical device, not the CPU. So it usually has the resources to devote to reordering the queue. Even disk drives with their $5 CPU have enough CPU bandwidth for this. > > But it's really the wrong question, because the more important > question is > would you rather have the initiator do the coalescing or nobody? There > exist targets that are not capable of combining or ordering tasks, and > still accept large queues of them. So no target should be able to accept large numbers of queued commands because some targets you've worked with are broken? Or we should have to manually configure the queue depth on every target because some of them are broken? This also doesn't seem pertinent to TASK_SET_FULL versus iSCSI-style windowing, since a broken target can accept a large queue of commands no matter what flow-control mechanism is used. I don't oppose including an option to an initiator that would manually set a maximum queue depth for a particular make and model of a SCSI target as a device-specific quirk; I just don't think it's mandatory, I don't think it's a good idea to have it be a global setting, and I also don't think it is the best general solution. > These are the ones I saw have > improperly large queues. A target that can actually make use of a > large > backlog of work, on the other hand, is right to accept one. Absolutely. And the ones that can't should be sending TASK_SET_FULL when they've reached their limit. > > I have seen people try to improve performance of a storage system by > increasing queue depth in the target such as this. They note that the > queue is always full, so it must need more queue space. But this > degrades > performance, because on one of these first-in-first-out targets, > the only > way to get peak capacity is to keep the queue full all the time so > as to > create backpressure and cause the initiator to schedule the work. > Increasing the queue depth increases the chance that the initiator > will > not have the backlog necessary to do that scheduling. The correct > queue > depth on this kind of target is the number of requests the target can > process within the initiator's (and channel's) turnaround time. > >> brain-damaged >> marketing values small average access times more than a small >> variance in access times, so the device folks do crazy shortest- >> access-time-first scheduling instead of something more sane and less >> prone to spreading out the access time distribution like CSCAN. > > Since I'm talking about targets that don't do anything close to that > sophisticated with the stuff in their queue, this doesn't apply. > > But I do have to point out that there are systems where throughput is > everything, and response time, including variability of it, is > nothing. In > fact, the systems I work with are mostly that kind. For that kind of > system, you'd want to target to do that kind of scheduling. Yep, for batch you want SATF scheduling. It's not appropriate as the default setting for mass-produced disk devices, however. > >> 2) If you care about performance, you don't try to fill the device >> queue; you just want to have enough outstanding so that the device >> doesn't go idle when there is work to do. > > Why would the queue have a greater capacity than what is needed > when you > care about performance? Is there some non-performance reason to > have a > giant queue? Benchmarks which measure whether the device can coalesce 256 512-byte sequential writes :-) Basically it is that for disk devices the optimal queue depth depends on the workload, so it's statically-sized for the worst-case. > I still think having a giant queue is not a solution to any flow > control > (or, in the words of the original problem, I/O throttling) problem. I did not suggest a giant queue as a "solution". I only replied to Vladislav's question as to how disk drives avoid sending TASK_SET_FULL all the time. They have queue sizes larger than the number of commands that the initiator usually tries to send. > I'm > even skeptical that there's any size you can make one that would avoid > queue full conditions. Well, if it's bigger than the number of SCSI command buffers allocated by the initiator, the target wins and never has to send TASK_SET_FULL (unless there are multiple initiators). > It would be like avoiding difficult memory > allocation algorithms by just having a whole lot of memory. Yep. That's a good practical solution, and one which the operating system on your desktop computer probably uses :-) I do take your point; arbitrarily large queues only postpone the point at which the target must reply TASK_SET_FULL. Usually that is good enough. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-06 19:15 ` Bryan Henderson 2006-03-06 19:55 ` Steve Byan @ 2006-03-07 17:56 ` Vladislav Bolkhovitin 2006-03-07 18:38 ` Steve Byan 1 sibling, 1 reply; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-07 17:56 UTC (permalink / raw) To: Bryan Henderson; +Cc: Steve Byan, linux-scsi Bryan Henderson wrote: >>On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: >> >> >>>Could anyone advice how a SCSI target device can IO-throttle its >>>initiators, i.e. prevent them from queuing too many commands, please? >>> >>>I suppose, the best way for doing this is to inform the initiators >>>about the maximum queue depth X of the target device, so any of the >>>initiators will not send more than X commands. But I have not found >>>anything similar to that on INQUIRY or MODE SENSE pages. Have I >>>missed something? Just returning QUEUE FULL status doesn't look to >>>be correct, because it can lead to out of order commands execution. >> >>Returning QUEUE FULL status is correct, unless the initiator does not >>have any pending commands on the LUN, in which case you should return >>BUSY. Yes, this can lead to out-of-order execution. That's why tapes >>have traditionally not used SCSI command queuing. > > > I'm confused, Vladislav appears to be asking about flow control such as > is built into ISCSI, wherein the ISCSI target tells the intitiator how > many tasks it's willing to work on at once and the initiator stops sending > new ones when it has hit that limit and waits for one of the previous ones > to finish. And the target can continuously change that number. Yes, exactly. > With the more primitive transports, I believe this is a manual > configuration step -- the target has a fixed maximum queue depth and you > tell the driver via some configuration parameter what it is. We currently mostly deal with Fibre Channel, which seems to be a kind of "more primitive transport" without explicit flow control. Actually, I'm very surprised and can't believe that so advanced and expensive technology doesn't have such basic thing as a good flow control. Although, precisely speaking, such flow control is located on level above transport (this is true for iSCSI as well), therefore this is SCSI flaw, not FC. > As I understand it, any system in which QUEUE FULL (that's another name > for SCSI's Task Set Full, isn't it?) errors happen is one that is not > properly configured. I saw a broken ISCSI system that had QUEUE FULLs > happening, and it was a performance disaster. It is what we observe, too much QUEUE FULLs degrade performance considerably. >>>Apparently, hardware SCSI targets don't suffer from queuing >>>overflow and don't return all the time QUEUE FULL status, so the >>>must be a way to do the throttling more elegantly. >> >>No, they just have big queues. > > Big queues are another serious performance problem, when it means a target > accepts work faster than it can do it. I've seen that cause initiators to > send suboptimal requests (if the target appears to be working at infinite > speed, the initiator sends small chunks of work as soon as each is ready, > whereas if the initiator can tell that the target is choked, the initiator > combines and sorts work while it waits, into a stream the target can > handle more efficiently). When systems substitute an oversized queue in a > target for initiator-target flow control, the initiator ends up having to > compensate with artificial schemes to withhold work from a willing target > (e.g. Linux "queue plugging"). This is one point why I don't like having a overbig queue on the target. Another one is initiator side timeouts when the queue so big that it could not been done on time. I described it in the previous email. Thanks, Vlad ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-07 17:56 ` Vladislav Bolkhovitin @ 2006-03-07 18:38 ` Steve Byan 0 siblings, 0 replies; 25+ messages in thread From: Steve Byan @ 2006-03-07 18:38 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: Bryan Henderson, linux-scsi On Mar 7, 2006, at 12:56 PM, Vladislav Bolkhovitin wrote: > Bryan Henderson wrote: >>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: >>> >>> >>>> Could anyone advice how a SCSI target device can IO-throttle its >>>> initiators, i.e. prevent them from queuing too many commands, >>>> please? >>>> >>>> I suppose, the best way for doing this is to inform the >>>> initiators about the maximum queue depth X of the target device, >>>> so any of the initiators will not send more than X commands. But >>>> I have not found anything similar to that on INQUIRY or MODE >>>> SENSE pages. Have I missed something? Just returning QUEUE FULL >>>> status doesn't look to be correct, because it can lead to out of >>>> order commands execution. >>> >>> Returning QUEUE FULL status is correct, unless the initiator does >>> not have any pending commands on the LUN, in which case you >>> should return BUSY. Yes, this can lead to out-of-order execution. >>> That's why tapes have traditionally not used SCSI command queuing. >> I'm confused, Vladislav appears to be asking about flow control >> such as is built into ISCSI, wherein the ISCSI target tells the >> intitiator how many tasks it's willing to work on at once and the >> initiator stops sending new ones when it has hit that limit and >> waits for one of the previous ones to finish. And the target can >> continuously change that number. > > Yes, exactly. > >> With the more primitive transports, I believe this is a manual >> configuration step -- the target has a fixed maximum queue depth >> and you tell the driver via some configuration parameter what it is. > > We currently mostly deal with Fibre Channel, which seems to be a > kind of "more primitive transport" without explicit flow control. > Actually, I'm very surprised and can't believe that so advanced and > expensive technology doesn't have such basic thing as a good flow > control. Although, precisely speaking, such flow control is located > on level above transport (this is true for iSCSI as well), > therefore this is SCSI flaw, not FC. It has X-ON and X-OFF flow control. Not bad considering it was designed in the early 1980's. X-OFF is TASK_SET_FULL or BUSY X-ON is a command completing, or if busy was received because the initiator did not have any outstanding commands at the target, then X- ON is implied after a short time delay. Since an intelligently-designed initiator isn't going to dump every command to the device anyway (after all, the person writing the initiator driver wants to have some fun implementing I/O optimizations too; can't let those target folk have all the fun :-), the XON/XOFF flow control isn't often invoked. >> As I understand it, any system in which QUEUE FULL (that's another >> name for SCSI's Task Set Full, isn't it?) errors happen is one >> that is not properly configured. I saw a broken ISCSI system that >> had QUEUE FULLs happening, and it was a performance disaster. > > It is what we observe, too much QUEUE FULLs degrade performance > considerably. Sounds like a broken initiator. > >>>> Apparently, hardware SCSI targets don't suffer from queuing >>>> overflow and don't return all the time QUEUE FULL status, so the >>>> must be a way to do the throttling more elegantly. >>> >>> No, they just have big queues. >> Big queues are another serious performance problem, when it means >> a target accepts work faster than it can do it. I've seen that >> cause initiators to send suboptimal requests (if the target >> appears to be working at infinite speed, the initiator sends small >> chunks of work as soon as each is ready, whereas if the initiator >> can tell that the target is choked, the initiator combines and >> sorts work while it waits, into a stream the target can handle >> more efficiently). When systems substitute an oversized queue in >> a target for initiator-target flow control, the initiator ends up >> having to compensate with artificial schemes to withhold work from >> a willing target (e.g. Linux "queue plugging"). > > This is one point why I don't like having a overbig queue on the > target. This is just a matter of taste of whether you prefer the optimization to be done on the initiator side or the target side. If you prefer it to be done on the initiator side, then don't queue large amounts of work at the target. > Another one is initiator side timeouts when the queue so big that > it could not been done on time. I described it in the previous email. This is just a bug in the initiator. It can observe the average service time and it knows how many commands it has queued. If it sets its timeout anywhere close to the product of those two numbers it is buggy. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-03 18:07 ` Steve Byan 2006-03-03 18:47 ` Stefan Richter 2006-03-06 19:15 ` Bryan Henderson @ 2006-03-07 17:53 ` Vladislav Bolkhovitin 2006-03-07 18:19 ` Steve Byan 2 siblings, 1 reply; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-07 17:53 UTC (permalink / raw) To: Steve Byan; +Cc: linux-scsi Steve Byan wrote: > > On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: > >> Could anyone advice how a SCSI target device can IO-throttle its >> initiators, i.e. prevent them from queuing too many commands, please? >> >> I suppose, the best way for doing this is to inform the initiators >> about the maximum queue depth X of the target device, so any of the >> initiators will not send more than X commands. But I have not found >> anything similar to that on INQUIRY or MODE SENSE pages. Have I >> missed something? Just returning QUEUE FULL status doesn't look to be >> correct, because it can lead to out of order commands execution. > > > Returning QUEUE FULL status is correct, unless the initiator does not > have any pending commands on the LUN, in which case you should return > BUSY. Yes, this can lead to out-of-order execution. That's why tapes > have traditionally not used SCSI command queuing. > > Look into the unit attention interlock feature added to SCSI as a > result of uncovering this issue during the development of the iSCSI > standard. > >> Apparently, hardware SCSI targets don't suffer from queuing overflow >> and don't return all the time QUEUE FULL status, so the must be a way >> to do the throttling more elegantly. > > > No, they just have big queues. Thanks for the reply! Things are getting clearer for me now, but still there are few things that are not very clear for me. Hope, they won't require too long answers. I'm asking, because we in SCST project (SCSI target mid-level for Linux + some target drivers, http://scst.sourceforge.net) must emulate correct SCSI target device behavior under any IO load, including extreme high one. - Can you estimate, please, how big target commands queue should be in order to initiators will never receive QUEUE FULL status? Considering case that initiators are Linux-based and each has a separate and independent queue. - The queue could be so big that the last command in it could not been processed before the initiator's timeout, then, after the timeout was hit, the initiator would start issuing ABORTs for the timeouted command. Is it OK behavior? Or rather misconfiguration (of who, initiator or target?)? Does the initiator in such situation supposed to reissue the command after the preceding ones finished, or behave somehow else? Apparently, ABORTs must hit the performance at the similar degree as too many QUEUE FULLs, if not more. Seems, we should setup on the target queue with virtually unlimited size and, if an initiator is dumb enough to queue so much commands that there will be timeouts, then it will be its problem and duty to rule the situation without performance loss. Does it looks OK? Thanks, Vlad > Regards, > -Steve ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-07 17:53 ` Vladislav Bolkhovitin @ 2006-03-07 18:19 ` Steve Byan 2006-03-07 18:46 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 25+ messages in thread From: Steve Byan @ 2006-03-07 18:19 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: linux-scsi On Mar 7, 2006, at 12:53 PM, Vladislav Bolkhovitin wrote: > Steve Byan wrote: >> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: >>> Could anyone advice how a SCSI target device can IO-throttle its >>> initiators, i.e. prevent them from queuing too many commands, >>> please? >>> >>> I suppose, the best way for doing this is to inform the >>> initiators about the maximum queue depth X of the target device, >>> so any of the initiators will not send more than X commands. But >>> I have not found anything similar to that on INQUIRY or MODE >>> SENSE pages. Have I missed something? Just returning QUEUE FULL >>> status doesn't look to be correct, because it can lead to out of >>> order commands execution. >> Returning QUEUE FULL status is correct, unless the initiator does >> not have any pending commands on the LUN, in which case you >> should return BUSY. Yes, this can lead to out-of-order execution. >> That's why tapes have traditionally not used SCSI command queuing. >> Look into the unit attention interlock feature added to SCSI as a >> result of uncovering this issue during the development of the >> iSCSI standard. >>> Apparently, hardware SCSI targets don't suffer from queuing >>> overflow and don't return all the time QUEUE FULL status, so the >>> must be a way to do the throttling more elegantly. >> No, they just have big queues. > > Thanks for the reply! > > Things are getting clearer for me now, but still there are few > things that are not very clear for me. Hope, they won't require too > long answers. I'm asking, because we in SCST project (SCSI target > mid-level for Linux + some target drivers, http:// > scst.sourceforge.net) must emulate correct SCSI target device > behavior under any IO load, including extreme high one. > > - Can you estimate, please, how big target commands queue should > be in order to initiators will never receive QUEUE FULL status? > Considering case that initiators are Linux-based and each has a > separate and independent queue. Do you have a per-target pool of resources for handing commands, or are the pools per-logical unit? I'm not sure you could size the queue so that TASK_SET_FULL is never returned. Just accept the fact the the target must return TASK_SET_FULL or BUSY sometimes. As a data-point, some modern SCSI disks support queue depths in the range of 128 to 256 commands. > - The queue could be so big that the last command in it could not > been processed before the initiator's timeout, then, after the > timeout was hit, the initiator would start issuing ABORTs for the > timeouted command. Is it OK behavior? Well, it's the behavior implied by the SCSI standard; that is, on a timeout, the initiator should abort the command. If an initiator sets it's timeout to less than the queuing delay at the server, I wouldn't call that "OK behavior", but it's not the target's fault, it's the initiator's fault. > Or rather misconfiguration (of who, initiator or target?)? Does the > initiator in such situation supposed to reissue the command after > the preceding ones finished, or behave somehow else? I think it's up to the class driver to decide whether to retry a command after it times-out. > Apparently, ABORTs must hit the performance at the similar degree > as too many QUEUE FULLs, if not more. Much worse, I would think. > Seems, we should setup on the target queue with virtually unlimited > size and, if an initiator is dumb enough to queue so much commands > that there will be timeouts, then it will be its problem and duty > to rule the situation without performance loss. Does it looks OK? I don't think you need to pick an unlimited size. Something on the order of 128 to 512 commands should be sufficient. If you have multiple logical units, you could probably combine them in a common pool and somewhat reduce the number of command resources you allocate per logical unit, on the theory that they'll not all be fully utilized at the same time. By the way, make sure you don't deadlock trying to obtain command- resources to return TASK_SET_FULL or BUSY to a command in the case where the pool of command-resources is exhausted. This is one of the tricky bits. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-07 18:19 ` Steve Byan @ 2006-03-07 18:46 ` Vladislav Bolkhovitin 2006-03-07 19:00 ` Steve Byan 0 siblings, 1 reply; 25+ messages in thread From: Vladislav Bolkhovitin @ 2006-03-07 18:46 UTC (permalink / raw) To: Steve Byan; +Cc: linux-scsi Steve Byan wrote: > > On Mar 7, 2006, at 12:53 PM, Vladislav Bolkhovitin wrote: > >> Steve Byan wrote: >> >>> On Mar 2, 2006, at 11:21 AM, Vladislav Bolkhovitin wrote: >>> >>>> Could anyone advice how a SCSI target device can IO-throttle its >>>> initiators, i.e. prevent them from queuing too many commands, please? >>>> >>>> I suppose, the best way for doing this is to inform the initiators >>>> about the maximum queue depth X of the target device, so any of >>>> the initiators will not send more than X commands. But I have not >>>> found anything similar to that on INQUIRY or MODE SENSE pages. >>>> Have I missed something? Just returning QUEUE FULL status doesn't >>>> look to be correct, because it can lead to out of order commands >>>> execution. >>> >>> Returning QUEUE FULL status is correct, unless the initiator does >>> not have any pending commands on the LUN, in which case you should >>> return BUSY. Yes, this can lead to out-of-order execution. That's >>> why tapes have traditionally not used SCSI command queuing. >>> Look into the unit attention interlock feature added to SCSI as a >>> result of uncovering this issue during the development of the iSCSI >>> standard. >>> >>>> Apparently, hardware SCSI targets don't suffer from queuing >>>> overflow and don't return all the time QUEUE FULL status, so the >>>> must be a way to do the throttling more elegantly. >>> >>> No, they just have big queues. >> >> >> Thanks for the reply! >> >> Things are getting clearer for me now, but still there are few things >> that are not very clear for me. Hope, they won't require too long >> answers. I'm asking, because we in SCST project (SCSI target >> mid-level for Linux + some target drivers, http://scst.sourceforge.net) must >> emulate correct SCSI target device >> behavior under any IO load, including extreme high one. >> >> - Can you estimate, please, how big target commands queue should be >> in order to initiators will never receive QUEUE FULL status? >> Considering case that initiators are Linux-based and each has a >> separate and independent queue. > > > Do you have a per-target pool of resources for handing commands, or are > the pools per-logical unit? Most limited resource is memory allocated for commands buffers. It is per-target. Other resourses, like internal commands structures, are so small, so they could be considered virtually unlimited. They are also global, but accounting is done by per-(session(nexus), LU). > I'm not sure you could size the queue so that TASK_SET_FULL is never > returned. Just accept the fact the the target must return TASK_SET_FULL > or BUSY sometimes. We have relatively cheap method of queuing commands without allocating buffers for them. This way millions of commands could be queued on an average Linux box without problems. Only ABORTs and they influence on performance worry me. > As a data-point, some modern SCSI disks support queue depths in the > range of 128 to 256 commands. I rather asked about practical upper limit. From our observations a Linux initiator could easily send 128+ commands, but usually less. Looks like it depends from its available memory. Interested to know the exact rule. >> - The queue could be so big that the last command in it could not >> been processed before the initiator's timeout, then, after the >> timeout was hit, the initiator would start issuing ABORTs for the >> timeouted command. Is it OK behavior? > > > Well, it's the behavior implied by the SCSI standard; that is, on a > timeout, the initiator should abort the command. If an initiator sets > it's timeout to less than the queuing delay at the server, I wouldn't > call that "OK behavior", but it's not the target's fault, it's the > initiator's fault. > >> Or rather misconfiguration (of who, initiator or target?)? Does the >> initiator in such situation supposed to reissue the command after the >> preceding ones finished, or behave somehow else? > > > I think it's up to the class driver to decide whether to retry a > command after it times-out. > >> Apparently, ABORTs must hit the performance at the similar degree as >> too many QUEUE FULLs, if not more. > > > Much worse, I would think. > >> Seems, we should setup on the target queue with virtually unlimited >> size and, if an initiator is dumb enough to queue so much commands >> that there will be timeouts, then it will be its problem and duty to >> rule the situation without performance loss. Does it looks OK? > > > I don't think you need to pick an unlimited size. Something on the > order of 128 to 512 commands should be sufficient. If you have multiple > logical units, you could probably combine them in a common pool and > somewhat reduce the number of command resources you allocate per > logical unit, on the theory that they'll not all be fully utilized at > the same time. OK > By the way, make sure you don't deadlock trying to obtain command- > resources to return TASK_SET_FULL or BUSY to a command in the case > where the pool of command-resources is exhausted. This is one of the > tricky bits. In our architecture there is no need to allocate any additional resources to reply with TASK_SET_FULL or BUSY. So, we already took care of this. Thanks, Vlad ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: SCSI target and IO-throttling 2006-03-07 18:46 ` Vladislav Bolkhovitin @ 2006-03-07 19:00 ` Steve Byan 0 siblings, 0 replies; 25+ messages in thread From: Steve Byan @ 2006-03-07 19:00 UTC (permalink / raw) To: Vladislav Bolkhovitin; +Cc: linux-scsi On Mar 7, 2006, at 1:46 PM, Vladislav Bolkhovitin wrote: > Steve Byan wrote: >> As a data-point, some modern SCSI disks support queue depths in >> the range of 128 to 256 commands. > > I rather asked about practical upper limit. From our observations a > Linux initiator could easily send 128+ commands, but usually less. > Looks like it depends from its available memory. Interested to know > the exact rule. I don't know the rule. Obviously, it could change over time, and be different for different OS's. Sounds to me like you might be trying to fix a busted initiator by changing the target behavior. Regards, -Steve -- Steve Byan <smb@egenera.com> Software Architect Egenera, Inc. 165 Forest Street Marlboro, MA 01752 (508) 858-3125 ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2006-03-15 17:15 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-02 16:21 SCSI target and IO-throttling Vladislav Bolkhovitin 2006-03-03 18:07 ` Steve Byan 2006-03-03 18:47 ` Stefan Richter 2006-03-03 20:24 ` Steve Byan 2006-03-06 19:15 ` Bryan Henderson 2006-03-06 19:55 ` Steve Byan 2006-03-07 23:32 ` Bryan Henderson 2006-03-08 15:35 ` Vladislav Bolkhovitin 2006-03-08 15:56 ` Steve Byan 2006-03-08 17:49 ` Vladislav Bolkhovitin 2006-03-08 18:09 ` Steve Byan 2006-03-09 18:37 ` Vladislav Bolkhovitin 2006-03-09 19:32 ` Steve Byan 2006-03-10 18:46 ` Vladislav Bolkhovitin 2006-03-10 19:47 ` Steve Byan 2006-03-13 17:35 ` Vladislav Bolkhovitin 2006-03-14 20:54 ` Douglas Gilbert 2006-03-15 17:15 ` Vladislav Bolkhovitin 2006-03-10 13:26 ` Steve Byan 2006-03-07 17:56 ` Vladislav Bolkhovitin 2006-03-07 18:38 ` Steve Byan 2006-03-07 17:53 ` Vladislav Bolkhovitin 2006-03-07 18:19 ` Steve Byan 2006-03-07 18:46 ` Vladislav Bolkhovitin 2006-03-07 19:00 ` Steve Byan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).