From mboxrd@z Thu Jan 1 00:00:00 1970 From: Doug Ledford Subject: Re: PATCH: scsi device queue depth adjustability patch Date: Wed, 2 Oct 2002 18:18:37 -0400 Sender: linux-scsi-owner@vger.kernel.org Message-ID: <20021002221837.GB30503@redhat.com> References: <20021002002854.GF28265@redhat.com> <200210022141.g92LfCh01941@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <200210022141.g92LfCh01941@localhost.localdomain> List-Id: linux-scsi@vger.kernel.org To: James Bottomley Cc: linux-scsi@vger.kernel.org On Wed, Oct 02, 2002 at 05:41:12PM -0400, James Bottomley wrote: > dledford@redhat.com said: > > This patch makes it possible to adjust the queue depth of a scsi > > device after it has been in use some time and you have a better idea > > of what the optimal queue depth should be. For the most part this > > should work, but my 2.5.40 test machine is blowing chunks on the > > serverworks IDE support right now so it isn't tested :-( > > I note that there's a lot more than dynamic queue depth adjustment in this > patch (PPA inquiry, slave attach etc.). The slave attach is required for the queue depth adjustment stuff to work. The goal is to remove the select_queue_depths() function entirely and make slave_attach() -> adjust_queue_depth() the replacement. The PPR message ability flag was just a simple bit that goes along with slave attach. There's more to be done there, but that's in there as a sample. > How do HBA's that don't support this now work? The mid layer calls into the HBA driver's select_queue_depths() function and the driver sets the queue depth on each device from there. > You've taken any dependency on > scsi_host.cmd_per_lun out of the code (thus rendering it useless) so every HBA > driver now is forced to use an initial queue depth adjustment just to start > tagged command queueing. This is true both before and after my patch. If the HBA doesn't implement the select_queue_depths function then it will never get tagged queueing. Besides, cmd_per_lun is *not* for tagged queueing, it's the queue depth the scsi mid layer is suppossed to keep for *non* tagged devices. > Can't we at least start with cmd_per_lun as the > default depth? The current mid layer never has enabled a default depth, all scsi drivers must enable tagged queueing for each device before it happens. I'm just changing how that's done from a specific select_queue_depths() call in point that does one and only one thing to a generic slave_attach() call in point that sets various post-INQUIRY data drive specific items, queue depth and acceptable speed negotiation message protocols being the two examples in my patch. > I'm not entirely happy with the idea that we control the queue depth by > adjusting the number of the device's allocated commands. It's the only safe way to do it. You can't count on being able to allocate commands because you may not have available mem, and you don't want extra laying around because that wastes mem, and it allows my next changes to go in which includes switching to linked lists for commands and that way we can have a free list for each lun and when there is a command on the free list, then you know you are under your queue depth count because if all commands are used, then none are on the free list. > I know the patch > goes to great lengths to move these kmallocs out of the critical path, but > there are certain environments (multi-initiator) where the queue depth can be > nastily and randomly variable. Go look at the QUEUE_FULL handler in the aic7xxx_old driver. This is how most/all reasonably well written drivers handle queue depth adjustments. Trust me, they don't go around adjusting the depth all the time. Most of the time there will be one initial adjustment, then maybe one more adjustment as we lock it down to the max upper limit when one exists, the rest of the time we just handle the occasional random queue depth QUEUE_FULL messages as exactly that and only temporarily freeze the queue to let the drive get some work done. > If the allocations were more lazy (wait a > while before freeing a struct Scsi_Cmnd to see if the queue depth goes up > again for instance) this would address some of these concerns (perhaps just > moving to a slab allocator for command blocks would do it?) Nope, this is a non concern because we don't call adjust_queue_depth on every queue full (or at least we shouldn't, there isn't any documentation yet to tell people that, but that was the purpose of making the aic7xxx_old driver implement the new method since it serves as an example and it does exactly as I'm talking about). > > Side note: I left the control of queue depth setting solely in the > > hands of the low level drivers since they are the *only* ones that > > can get an accurate queue depth reading at the time of any given > > QUEUE_FULL message (see what the aic7xxx_old driver has to do in the > > QUEUE_FULL handler to find out how many commands the drive has seen > > at this exact point in time vs. how many we may have queued up to the > > card, the difference in numbers can be significant). For that > > reason, the adjust_queue_depth call was made to defer the action > > until later so that it was interrupt context safe. > > I appreciate that the HBA driver is the most exact counter of the queue depth, > but would it make a significant difference if the adjustments were done > globally in the mid-layer? The mid layer simply does not have access to the info needed to do what you are talking about. Let me give you an example from the aic7xxx driver. On this card we have a sequencer that handles the scsi protocol for us. When we get a command from the mid layer it goes either to the card's QINFIFO (queue of commands that the sequencer hasn't started yet) or to the device's waiting queue (queue of commands that we can't send to the sequencer yet because of some reason). Once it is on the card and the sequencer starts the process of selecting the target device, the command is moved to the waiting_q on the card (queue of commands already started but for which the target device has not yet responded). Once the device is selected and we get to send the command, then we usually go directly into status phase and get our QUEUE_FULL. At this point the sequencer pauses all operations on the scsi bus and interrupts the kernel to handle the QUEUE_FULL. Prior to this command being selected, there may have been any number of commands place in the cards QOUTFIFO (our queue of already completed SCSI commands which the kernel interrupt driver hasn't processed yet). So when the kernel driver gets the QUEUE_FULL interrupt, it must first plug this device up. It then must remove any commands it finds in the QINFIFO and in the waiting_q on the card and requeue them back into the device's waiting queue. Each command it moves like this decrements the device's active command count since when we place a command on the QINFIFO that's when we increment the active count. Then we have to pull any possible completed commands out of the QOUTFIFO and place them on the done queue (commands we have queued to go back to the mid layer), also decrementing the active count each time. Finally, we decrement the active count for the device one more time to accomodate our QUEUE_FULL command. Only after we have done *all* of this crap do we have an accurate queue depth count for the command that returned a QUEUE_FULL. As you can see, lots of this is very hardware specific. Controller cards that don't have a sequencer like control engine won't be starting commands off in a lazy fasion like this and so don't need all this stuff. That's why the only place where we can get an accurate count of the queue depth is in the low level driver. So, that code *must* stay there since getting an accurate count is how we avoid the random flip flops of queue depth that you were worried about earlier. Now, what can be moved to the mid layer, and is on my list to do, is a generic interface for coming up with the proper action after a QUEUE_FULL. Currently, each driver not only determines the real depth, but then also does it's own magic to tell if it's a random event or a hard limit. It would be easy to add something like scsi_notify_queue_full(sdev, count); where scsi_notify_queue_full() would keep track of the last queue full depth, etc. Let the low level driver come up with the accurate count, then they can use this function to determine what to do about it. On any change to queue depth, the function can return the amount of commands to add/subtract, then the low level driver can adjust it's own internal structure counts and also call scsi_adjust_queue_depth() to have the mid layer do likewise. BTW, I'll change the adjust_queue_depth code to make it immediately adjust the depth down when possible and do lazy increases so that hitting a hard limit will free up resources immediately, but that will go with making the commands linked list based so that it can simply do a while(sdev->queue_depth > sdev->new_queue_depth && list_not_empty(sdev->free_list)) { kfree(get_list_head(sdev->free_list)); sdev->queue_depth--; } > The great advantage is that we would gain dynamic > queue depth adjustments without having to add specific code to every driver, You're never going to get around this and have a reasonably reliable method :-( > at the cost of not always being entirely accurate about the depth. Is there a > good argument that this really, really must be done at the LLD level given the > cost in terms of LLD modifications? You could still add hooks for those HBAs > that really want to do it themselves. Well, all the device drivers that implement queue depth adjustments at all already do it themselves, so my proposed method is better than what we currently have since it at least allows them to move the decide disposition stuff into a library call or they can do it themselves. -- Doug Ledford 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606