* scsi command slab allocation under memory pressure @ 2003-01-29 18:47 Patrick Mansfield 2003-01-29 19:40 ` Luben Tuikov 2003-01-29 22:53 ` James Bottomley 0 siblings, 2 replies; 15+ messages in thread From: Patrick Mansfield @ 2003-01-29 18:47 UTC (permalink / raw) To: linux-scsi James had a similiar comment about this (free_list storing multiple commands). The linux-scsi.bkbits.net scsi-kmem_alloc-2.5 and scsi-combined-2.5 tree include the scsi command slab allocation (Luben's patch). How does the use of a single slab for all hosts and all devices allow for IO while under memory pressure? There is one extra scsi command pre-allocated per host, but don't we require at least one (and ideally maybe more) per device? The pre-slab (current mainline kernel) command allocation always had at least one command per device available, and usually more (because we allocated more commands during the scan and upper level init). That is - if we have swap on a separate disk and our command pool is small enough, IO to another disk could use the single per-host command under memory pressure, and we can fail to get a scsi command in order to write to the swap disk. scsi_put_command() re-fills the host->free_list if it is empty, but under high (or higher) IO loads, the disk/device that generated the scsi_put_command will immediately issue a scsi_get_command for the same device. If all command allocations are failing for a particular device (i.e. swap), we will wait a bit (device_blocked and device_busy == 0) and try again, we will not retry based on a scsi_put_command(). Even if we did retry based on a scsi_put_command, we will can race with the scsi_put_command caller. -- Patrick Mansfield ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-01-29 18:47 scsi command slab allocation under memory pressure Patrick Mansfield @ 2003-01-29 19:40 ` Luben Tuikov 2003-01-29 20:11 ` Patrick Mansfield 2003-01-29 22:53 ` James Bottomley 1 sibling, 1 reply; 15+ messages in thread From: Luben Tuikov @ 2003-01-29 19:40 UTC (permalink / raw) To: Patrick Mansfield; +Cc: linux-scsi Patrick Mansfield wrote: > James had a similiar comment about this (free_list storing multiple > commands). > > The linux-scsi.bkbits.net scsi-kmem_alloc-2.5 and scsi-combined-2.5 tree > include the scsi command slab allocation (Luben's patch). > > How does the use of a single slab for all hosts and all devices allow for > IO while under memory pressure? > > There is one extra scsi command pre-allocated per host, but don't we > require at least one (and ideally maybe more) per device? The pre-slab > (current mainline kernel) command allocation always had at least one > command per device available, and usually more (because we allocated more > commands during the scan and upper level init). > > That is - if we have swap on a separate disk and our command pool is small > enough, IO to another disk could use the single per-host command under > memory pressure, and we can fail to get a scsi command in order to write > to the swap disk. > > scsi_put_command() re-fills the host->free_list if it is empty, but under > high (or higher) IO loads, the disk/device that generated the > scsi_put_command will immediately issue a scsi_get_command for the same > device. > > If all command allocations are failing for a particular device (i.e. > swap), we will wait a bit (device_blocked and device_busy == 0) and try > again, we will not retry based on a scsi_put_command(). Even if we did > retry based on a scsi_put_command, we will can race with the > scsi_put_command caller. Is this a question, narrative, comment or flame? I'll try to answer this anyway. James had this comment, just because I put the mechanism there to allow for more than one command to be in the store of backup command structs. See my reply to his email in the archives. The reason for populating free_list with just one command on host init, is quite obvious, but to elaborate, the choices are 1 and N, where N is a natural number greater than 1. The problem with N is that I do *not* have a heuristic which will tell me what a suitable value for N is. How about 5, hmm, what about 10, or maybe 1e10? Furthermore, N may be a constant or it may be a function of how much memory we currently have, how many commands have been queued into the host, etc, or it may just be N = can_queue - num_queued_commands + 1, which is dynamic, which is pointless (Homework: show why). We want to waste as little memory as possible (thus 1 per host), since SCSI Core is not the only subsystem running on the machine. Hint: see the flags the slabs are allocated with. Using the Central Limit Theorem, I hope that by the time we get low on memory pressure, the scsi command cache pool size has settled*. Unless we started with very little memory, which would be quite unusual in this day and age. * Lots of assumptions here, but all valid for a *server* machine. Let's get some experience with this thing running and actually have a *natural* failing example, and we can twiddle with the initial value of N, and/or can develop a f(N) which would be computed occasionally and free_list varied upon scsi_put_command(). -- Luben P.S. In my own mini-scsi-core I haven't had any problems with this issue. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-01-29 19:40 ` Luben Tuikov @ 2003-01-29 20:11 ` Patrick Mansfield 2003-01-29 22:26 ` Luben Tuikov 2003-01-31 6:57 ` Andrew Morton 0 siblings, 2 replies; 15+ messages in thread From: Patrick Mansfield @ 2003-01-29 20:11 UTC (permalink / raw) To: Luben Tuikov; +Cc: linux-scsi On Wed, Jan 29, 2003 at 02:40:28PM -0500, Luben Tuikov wrote: > Patrick Mansfield wrote: > > Is this a question, narrative, comment or flame? I'll try to answer > this anyway. Yes, except for the flame part :) > James had this comment, just because I put the mechanism there to allow > for more than one command to be in the store of backup command structs. > See my reply to his email in the archives. > > The reason for populating free_list with just one command on host init, > is quite obvious, but to elaborate, the choices are 1 and N, where N is > a natural number greater than 1. > > The problem with N is that I do *not* have a heuristic which will tell me > what a suitable value for N is. How about 5, hmm, what about 10, or maybe > 1e10? We do have bounds on the number of commands. Assuming N is the minimum number of commands we want available: Currently (plain 2.5.59) we have N being at the number of scsi_device's on the system, and we always have at least one command available for each scsi_device. It is not clear if N should be the number of scsi commands that might be outstanding on the system (what it used to be prior to Doug L's queue depth changes, and what we have in 2.4). So N should be at least the number of scsi_devices on the system, and at most the sum of all the commands that can potentially be used by all scsi_devices on the system. > Furthermore, N may be a constant or it may be a function of how much > memory we currently have, how many commands have been queued into the host, > etc, or it may just be N = can_queue - num_queued_commands + 1, which is > dynamic, which is pointless (Homework: show why). > > We want to waste as little memory as possible (thus 1 per host), since > SCSI Core is not the only subsystem running on the machine. Hint: > see the flags the slabs are allocated with. Using the Central Limit > Theorem, I hope that by the time we get low on memory pressure, the scsi > command cache pool size has settled*. Unless we started with very little > memory, which would be quite unusual in this day and age. If we must allocate enough space to ensure proper swapout behavour, it is not a waste! IMO we need something similiar to the request_queue_t free list, where we allocate a bunch of items up front out of a slab. > * Lots of assumptions here, but all valid for a *server* machine. > > Let's get some experience with this thing running and actually have a > *natural* failing example, and we can twiddle with the initial value > of N, and/or can develop a f(N) which would be computed occasionally > and free_list varied upon scsi_put_command(). Are you saying we will always be able to get a command under low memory situtations? If not, the code should be changed, otherwise please explain how. We should not wait for a failure to occur to answer the question. -- Patrick Mansfield ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-01-29 20:11 ` Patrick Mansfield @ 2003-01-29 22:26 ` Luben Tuikov 2003-01-31 6:57 ` Andrew Morton 1 sibling, 0 replies; 15+ messages in thread From: Luben Tuikov @ 2003-01-29 22:26 UTC (permalink / raw) To: Patrick Mansfield; +Cc: linux-scsi Patrick Mansfield wrote: >>The problem with N is that I do *not* have a heuristic which will tell me >>what a suitable value for N is. How about 5, hmm, what about 10, or maybe >>1e10? > > > We do have bounds on the number of commands. Yes, and it is can_queue and it is per host -- we all know this. I was talking about N, which is a bit different matter. > Assuming N is the minimum number of commands we want available: > > Currently (plain 2.5.59) we have N being at the number of scsi_device's on > the system, and we always have at least one command available for each > scsi_device. This has two problems: 1. ``currently'' and 2. number of scsi_devices on the system. 1. We should improve, ``currently'' may not be the best policy. (Thus the slab allocator.) 2. If I had to go this route, I'd say: the number of scsi_devices per host; but this number is dynamic as devices come and go. (So we get a hint at a dynamic heuristic.) > It is not clear if N should be the number of scsi commands that might be > outstanding on the system (what it used to be prior to Doug L's queue > depth changes, and what we have in 2.4). This has two problems: 1. ``It is not clear'' and 2. ``outstanding on the system''. 1. I think I did mention this in my previous reply. (Thus N = 1.) 2. This is a pickle and might I mention that Doug's queue depth changes were *for a reason*, so there's no point in saying ``the way it was before''. Furthermore ``the number of outstanding commands'' is ambiguous and any which way it is it, is NOT enough. I.e. outstaning in LLDD, oustanding free, etc. -- it just doesn't compute. > So N should be at least the number of scsi_devices on the system, and at > most the sum of all the commands that can potentially be used by all > scsi_devices on the system. How did you deduce this if one of your premises starts with ``It is not clear''? This has 2 problems: 1. ``number of scsi_devices on the system'' and 2. ``sum off all commands which can potentially be used by all scsi_devices on the system''. 1. This number is dynamic. N is *per host* and not per SCSI Core, i.e. it would've been more proper to say, ``number of scsi devices per host'', which would also not be sufficient, since this number is also dynamic. Furthermore, if N > 1, and the system is under memory pressure such that the cache allocator fails, it is true that a single device can starve the others of scsi commands, for any 1 < N <= can_queue ! So this doesn't quite do it as well, and a per device back-up would be needed. Which would *not* be that helpful since we can get in to the same argument.... how many is good enough. 2. In which case we *do not* need the cache allocator at all. I.e. you're saying that N = can_queue - number_cmnds_queued + 1, and I did comment on this in my previous mail and will say it again: in which case we do NOT need the cache allocator at all. (homework: show why) And the whole reason of this is to give memory allocation to another subsystem. Hint again: see the flags with which the caches are created. >>Furthermore, N may be a constant or it may be a function of how much >>memory we currently have, how many commands have been queued into the host, >>etc, or it may just be N = can_queue - num_queued_commands + 1, which is >>dynamic, which is pointless (Homework: show why). >> >>We want to waste as little memory as possible (thus 1 per host), since >>SCSI Core is not the only subsystem running on the machine. Hint: >>see the flags the slabs are allocated with. Using the Central Limit >>Theorem, I hope that by the time we get low on memory pressure, the scsi >>command cache pool size has settled*. Unless we started with very little >>memory, which would be quite unusual in this day and age. > > > If we must allocate enough space to ensure proper swapout behavour, it is > not a waste! Are we creating a problem with generalims like this one? How much is ``enough''? Or how long is the long hair? I think I mentioned this conundrum in my previous email. > IMO we need something similiar to the request_queue_t free list, where > we allocate a bunch of items up front out of a slab. Yes, this a static heuristic. See below for comments on this policy. >>* Lots of assumptions here, but all valid for a *server* machine. >> >>Let's get some experience with this thing running and actually have a >>*natural* failing example, and we can twiddle with the initial value >>of N, and/or can develop a f(N) which would be computed occasionally >>and free_list varied upon scsi_put_command(). > > > Are you saying we will always be able to get a command under low memory > situtations? The cache allocator page values should've settled after a few hours of intensive SCSI IO. (Cf. Central Limit Theorem) And those values always *round up* by the lookaside cache, which means that the cache allocator would always have more than can_queue if in that time can_queue was ever reached. It would be *quite a rare* circumstance that the cache allocator would fail (again hint: see the flags which with they are created), and even if it did fail, chances are that other subsystems would free memory, or a scsi command would finish, etc. in which case SCSI Core could go on. It could fail if, e.g., an exorbitant amount of scsi devices were plugged on the fabric *while* the system was experiencing memory pressure. Thus the above mentioning average value would not be correct anymore, as the number of devices has changed. *BUT* by having N=1, we wouldn't load the system as much as if we had a different initial N, say N=10. So this is tricky, it could be the case that in two exact same circumstances when N=1 the system could survive, but for N=10, we crash it. So the best policy is a dynamic heuristic. But I think that *YOU* should be making this argument, since it looks like you're not happy with the slab allocator being used for scsi commands. > If not, the code should be changed, otherwise please explain how. Looks like you have an agenda. If you have nothing else to do, please go ahead and rip the code apart and do your own thoughtful and smart changes. Go ahead, overengineer as you wish. > We should not wait for a failure to occur to answer the question. Another generalism. If you have a heuristic for N, either *static*, i.e. N = can_queue/10 + 1, or *dynamic*, i.e.: N_0 = can_queue/10 + 1, (no devices yet) N_i = host::num_devices + N_(i-1)/2, i > 0, *please* suggest it, show it, and prove it. (See bottom of text.) Just remember, the whole framework is one such that: 1. SCSI Core MUST NOT just blindly call the slab for more memory -- this *beats the whole purpose* of using the slab and we could just as well use kmalloc(). 2. Adjustment should be done at scsi_put_command(). I.e. we're *holding off* commands from the slab on scsi_put_command(), depending on the heuristic. (Which by definition tells us something about the *future* and this is the whole point.) So that in the future we can be ok, should memory be scarce. I.e. do we need to decrease or increase the commands in free_list, depending on the current value of N_i. In the example I gave above for a dynamic heuristic, it is easiliy shown that N_i = 2*(host::num_devices), i -> inf. To show this is important as we don't want to drain the computer's memory. The above heuristic would settle for N being 2 times the number of devices, which is good, since if can_queue = 255, we have N_0 = 26, but after a few generations, we don't want to keep so many commands around, and given we have only one device, we settle to N_i = 2, i -> inf. This was just an example, of course. You can model your work similarly. BTW, if you are so ardent on this whole issue I can put code in for the heuristic, and you can supply N_0 and N_i. -- Luben ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-01-29 20:11 ` Patrick Mansfield 2003-01-29 22:26 ` Luben Tuikov @ 2003-01-31 6:57 ` Andrew Morton 2003-01-31 13:46 ` James Bottomley 1 sibling, 1 reply; 15+ messages in thread From: Andrew Morton @ 2003-01-31 6:57 UTC (permalink / raw) To: Patrick Mansfield; +Cc: luben, linux-scsi Patrick Mansfield <patmans@us.ibm.com> wrote: > > IMO we need something similiar to the request_queue_t free list, where > we allocate a bunch of items up front out of a slab. Please do not reinvent the mm/mempool.c functionality. 'twould be better to just use it ;) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-01-31 6:57 ` Andrew Morton @ 2003-01-31 13:46 ` James Bottomley 2003-01-31 20:44 ` Andrew Morton 0 siblings, 1 reply; 15+ messages in thread From: James Bottomley @ 2003-01-31 13:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Patrick Mansfield, luben, SCSI Mailing List On Fri, 2003-01-31 at 01:57, Andrew Morton wrote: > Please do not reinvent the mm/mempool.c functionality. > > 'twould be better to just use it ;) Unfortunately, in this instance, mempool is a slight overkill. The problem is that we need to guarantee that a command (or set of commands) be available to a given device regardless of what's going on in the rest of the system. Thus we might need a mempool for each active device, rather than a mempool for all devices and a mechanism for giving fine grained control to the pool depth per device. Mempool would fit all of the above, I was just concerned that it looks to be a rather heavy addition (in terms of structure size) per device. James ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-01-31 13:46 ` James Bottomley @ 2003-01-31 20:44 ` Andrew Morton 2003-02-01 2:46 ` Patrick Mansfield 2003-02-03 22:55 ` Doug Ledford 0 siblings, 2 replies; 15+ messages in thread From: Andrew Morton @ 2003-01-31 20:44 UTC (permalink / raw) To: James Bottomley; +Cc: patmans, luben, linux-scsi James Bottomley <James.Bottomley@steeleye.com> wrote: > > On Fri, 2003-01-31 at 01:57, Andrew Morton wrote: > > Please do not reinvent the mm/mempool.c functionality. > > > > 'twould be better to just use it ;) > > Unfortunately, in this instance, mempool is a slight overkill. The > problem is that we need to guarantee that a command (or set of commands) > be available to a given device regardless of what's going on in the rest > of the system. Thus we might need a mempool for each active device, > rather than a mempool for all devices and a mechanism for giving fine > grained control to the pool depth per device. A lot depends on the context of the allocation. Can the caller sleep? Is the caller using GFP_ATOMIC/__GFP_HIGH? (What file-n-line should I be looking at, anyway?) Bear in mind that on the swapout path, the calling process has PF_MEMALLOC set. This is a strong and successful mechanism - it allows the caller to dip into the final page reserves which are denied to even GFP_ATOMIC allocations. There's maybe a megabyte or two there. Could be that there's no problem to be solved here. It depends on whether these allocations are occurring in process context or not. > Mempool would fit all of > the above, I was just concerned that it looks to be a rather heavy > addition (in terms of structure size) per device. It's 40-odd bytes, plus 4*max_reservation bytes. Fairly lean. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-01-31 20:44 ` Andrew Morton @ 2003-02-01 2:46 ` Patrick Mansfield 2003-02-03 22:55 ` Doug Ledford 1 sibling, 0 replies; 15+ messages in thread From: Patrick Mansfield @ 2003-02-01 2:46 UTC (permalink / raw) To: Andrew Morton; +Cc: James Bottomley, luben, linux-scsi On Fri, Jan 31, 2003 at 12:44:12PM -0800, Andrew Morton wrote: > James Bottomley <James.Bottomley@steeleye.com> wrote: > > > > On Fri, 2003-01-31 at 01:57, Andrew Morton wrote: > > > Please do not reinvent the mm/mempool.c functionality. > > > > > > 'twould be better to just use it ;) > > > > Unfortunately, in this instance, mempool is a slight overkill. The > > problem is that we need to guarantee that a command (or set of commands) > > be available to a given device regardless of what's going on in the rest > > of the system. Thus we might need a mempool for each active device, > > rather than a mempool for all devices and a mechanism for giving fine > > grained control to the pool depth per device. > > A lot depends on the context of the allocation. Can the caller sleep? No (generally) - we are in our request function, and are called via soft irq and from anywhere that blk_run_queues is called. Calls outside of the request function can sleep (but they are not used for regular IO). > Is the caller using GFP_ATOMIC/__GFP_HIGH? Yes normally GFP_ATOMIC > (What file-n-line should I be looking at, anyway?) If you have bk, bk://linux-scsi.bkbits.net/scsi-combined-2.5, file is drivers/scsi/scsi_lib.c - the calls to scsi_getset_command. That tree also has other scsi changes, as well as changes related to the new allocation scheme. In 2.5.59 this was a call to scsi_allocate_device. The current 2.5 allocation alogrithm used by scsi_allocate_device is poor. In the new code, scsi_allocate_device is replaced by a call to scsi_getset_command. scsi_getset_command effectively calls kmem_cache_alloc(some_cache, flags). My complaint that started the thread is that (generally) we don't have fairness across devices (scsi_device) on kmem_cache_alloc failure - a failure can put us in a timeout/poll like mode, but anyone with an IO already in flight can allocate (from kmem_cache_alloc or a single scsi_cmnd saved per host adapter) and issue another IO before the device that had a kmem_cache_alloc failure. So, a swap disk could potentially be starved from issuing IO during low memory conditions. But, there is the extra scsi_cmnd per host adapter (not per device), and it is always refilled (if empty) when a scsi_cmnd completes for that adapter, this combined with filling of the cache via continued IO use might prevent most (and hopefully all) such failures. -- Patrick Mansfield ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-01-31 20:44 ` Andrew Morton 2003-02-01 2:46 ` Patrick Mansfield @ 2003-02-03 22:55 ` Doug Ledford 2003-02-03 22:59 ` Andrew Morton ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Doug Ledford @ 2003-02-03 22:55 UTC (permalink / raw) To: Andrew Morton; +Cc: James Bottomley, patmans, luben, linux-scsi On Fri, Jan 31, 2003 at 12:44:12PM -0800, Andrew Morton wrote: > Bear in mind that on the swapout path, the calling process has PF_MEMALLOC > set. This is a strong and successful mechanism - it allows the caller to dip > into the final page reserves which are denied to even GFP_ATOMIC allocations. > There's maybe a megabyte or two there. > > Could be that there's no problem to be solved here. It depends on whether > these allocations are occurring in process context or not. I think the case is that there is no problem to be solved. One command per host is enough to keep each host running, and that's enough to keep the system running. If we are ever low enough on mem that we get down to failing scsi command allocations, the system is already hurting. The complaint was that a device doing something other than swap could starve a swap device. I don't buy that. If the device is doing constant reads then it's going to run out of mem eventually and block just like our allocations are, if it's writing then it very likely is freeing up just as many pages as the swap operation would be. In short, I think if we keep the disk subsystem running, even if crippled with just one command, the problem becomes self correcting and there isn't much for us to solve. Of course, that's just my 5 minute analysis, someone feel free to prove me wrong. -- Doug Ledford <dledford@redhat.com> 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-02-03 22:55 ` Doug Ledford @ 2003-02-03 22:59 ` Andrew Morton 2003-02-03 23:05 ` James Bottomley 2003-02-04 6:15 ` Andre Hedrick 2 siblings, 0 replies; 15+ messages in thread From: Andrew Morton @ 2003-02-03 22:59 UTC (permalink / raw) To: Doug Ledford; +Cc: James.Bottomley, patmans, luben, linux-scsi Doug Ledford <dledford@redhat.com> wrote: > > Of course, that's just my 5 minute analysis, someone feel free to prove me > wrong. I'd agree with that. Plus there's the PF_MEMALLOC thing which gives swapper-outers an extra megabyte or two. If _any_ of these allocations are happening in process context then they will benefit from this. If all the allocations are happening at interrupt/softirq time then some changes might be needed. But I haven't been able to break it, using aix7xxx, with quite harsh testing. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-02-03 22:55 ` Doug Ledford 2003-02-03 22:59 ` Andrew Morton @ 2003-02-03 23:05 ` James Bottomley 2003-02-03 23:19 ` Andrew Morton 2003-02-04 6:15 ` Andre Hedrick 2 siblings, 1 reply; 15+ messages in thread From: James Bottomley @ 2003-02-03 23:05 UTC (permalink / raw) To: Doug Ledford; +Cc: Andrew Morton, patmans, luben, SCSI Mailing List On Mon, 2003-02-03 at 17:55, Doug Ledford wrote: > I think the case is that there is no problem to be solved. One command > per host is enough to keep each host running, and that's enough to keep > the system running. If we are ever low enough on mem that we get down to > failing scsi command allocations, the system is already hurting. The > complaint was that a device doing something other than swap could starve a > swap device. I don't buy that. If the device is doing constant reads > then it's going to run out of mem eventually and block just like our > allocations are, if it's writing then it very likely is freeing up just as > many pages as the swap operation would be. In short, I think if we keep > the disk subsystem running, even if crippled with just one command, the > problem becomes self correcting and there isn't much for us to solve. Of > course, that's just my 5 minute analysis, someone feel free to prove me > wrong. I agree with the analysis: The system can make forward progress as long as we have only one guaranteed command. However, I do worry about the performance under memory pressure. I don't think only having one command pre allocated per HBA is sufficient to ensure efficient swap out behaviour under load. The question, of course, is what do we need to do to make it more efficient? Andrew, these patches are now in Linus' BK, so if you want to take a look and see how our loaded behaviour is, I'd be grateful. James ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-02-03 23:05 ` James Bottomley @ 2003-02-03 23:19 ` Andrew Morton 2003-02-04 18:04 ` Luben Tuikov 0 siblings, 1 reply; 15+ messages in thread From: Andrew Morton @ 2003-02-03 23:19 UTC (permalink / raw) To: James Bottomley; +Cc: dledford, patmans, luben, linux-scsi James Bottomley <James.Bottomley@steeleye.com> wrote: > > Andrew, these patches are now in Linus' BK, so if you want to take a > look and see how our loaded behaviour is, I'd be grateful. Looks pretty straightforward. Most of the allocations are GFP_KERNEL, which is a good sign. One could have designed it to support a pool of >1 command from the outset, but it's unlikely to be necessary. (linux-scsi@vger doesn't send messages to or from oneself to or from oneself. This upsets one's filing system. Does this irritate otherselves as much as this self?) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-02-03 23:19 ` Andrew Morton @ 2003-02-04 18:04 ` Luben Tuikov 0 siblings, 0 replies; 15+ messages in thread From: Luben Tuikov @ 2003-02-04 18:04 UTC (permalink / raw) To: Andrew Morton; +Cc: James Bottomley, dledford, patmans, linux-scsi Andrew Morton wrote: > > One could have designed it to support a pool of >1 command from the outset, > but it's unlikely to be necessary. Yes, I also that it's not likely to be necessary. The functionality of support for more than one is there (i.e. in the freeing-list code). As I mentioned in this thread before, I had *no* idea (and still have none) on _what_ number to settle if greater than one. OTOH, if the powers that be decide that more than one is nevertheless necessary, a mempool, I think, would be quite appropriate. -- Luben ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-02-03 22:55 ` Doug Ledford 2003-02-03 22:59 ` Andrew Morton 2003-02-03 23:05 ` James Bottomley @ 2003-02-04 6:15 ` Andre Hedrick 2 siblings, 0 replies; 15+ messages in thread From: Andre Hedrick @ 2003-02-04 6:15 UTC (permalink / raw) To: Doug Ledford; +Cc: Andrew Morton, James Bottomley, patmans, luben, linux-scsi Doug, I had argued some time ago for reserved and priority allocation for swap under block period. Regardless if this is scsi/ata/sas/sata the issue is fundamential. I spent a fair amount of time debating memory pressure against swap in combination of saturated device request queues, with Rik Riel. This is one layer above the LLDD and regardless if you reserve 1,2,N-1,N command slots in the queuedcommand list. If the request can not be obtained from block because all request slots are stuffed full, you have to deploy an out-of-bounds operation. Progress will not happen period. Elevator sorting on swap is silly! If it is still done, you can not get there from here. Any swap-io run via the elevator must be inserted in the front of the request queue, period. This will provide you N+1 jump without fracturing all the commands in process or in flight. Since it would be priority, one can stuff it with the SPECIAL marker and boost it to the head of the queuedcommand list. Maybe I am on crack, but so is the design delpoyed to date. (from top->bottom and not bottom->up) Cheers, Andre Hedrick LAD Storage Consulting Group On Mon, 3 Feb 2003, Doug Ledford wrote: > On Fri, Jan 31, 2003 at 12:44:12PM -0800, Andrew Morton wrote: > > Bear in mind that on the swapout path, the calling process has PF_MEMALLOC > > set. This is a strong and successful mechanism - it allows the caller to dip > > into the final page reserves which are denied to even GFP_ATOMIC allocations. > > There's maybe a megabyte or two there. > > > > Could be that there's no problem to be solved here. It depends on whether > > these allocations are occurring in process context or not. > > I think the case is that there is no problem to be solved. One command > per host is enough to keep each host running, and that's enough to keep > the system running. If we are ever low enough on mem that we get down to > failing scsi command allocations, the system is already hurting. The > complaint was that a device doing something other than swap could starve a > swap device. I don't buy that. If the device is doing constant reads > then it's going to run out of mem eventually and block just like our > allocations are, if it's writing then it very likely is freeing up just as > many pages as the swap operation would be. In short, I think if we keep > the disk subsystem running, even if crippled with just one command, the > problem becomes self correcting and there isn't much for us to solve. Of > course, that's just my 5 minute analysis, someone feel free to prove me > wrong. > > -- > Doug Ledford <dledford@redhat.com> 919-754-3700 x44233 > Red Hat, Inc. > 1801 Varsity Dr. > Raleigh, NC 27606 > > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: scsi command slab allocation under memory pressure 2003-01-29 18:47 scsi command slab allocation under memory pressure Patrick Mansfield 2003-01-29 19:40 ` Luben Tuikov @ 2003-01-29 22:53 ` James Bottomley 1 sibling, 0 replies; 15+ messages in thread From: James Bottomley @ 2003-01-29 22:53 UTC (permalink / raw) To: Patrick Mansfield; +Cc: SCSI Mailing List On Wed, 2003-01-29 at 13:47, Patrick Mansfield wrote: > The linux-scsi.bkbits.net scsi-kmem_alloc-2.5 and scsi-combined-2.5 tree > include the scsi command slab allocation (Luben's patch). > > How does the use of a single slab for all hosts and all devices allow for > IO while under memory pressure? In essence, all we really need to guarantee under memory pressure is that I/O which is being used to clear memory (i.e. for the swap device) will eventually proceed. This is the weakest necessary assumption for the system to make forward progress. Having a single command per device (or even just a single available command) guarantees this since if it is outstanding, it will eventually return and be re-used for clearing memory, which is all that is required. With a single command per host, there is a starvation issue if you have heavy I/O to a device whose controller also contains the swap. However, it would have to be fairly pathological conditions to continue doing heavy I/O under memory pressure while starving the swap device. > There is one extra scsi command pre-allocated per host, but don't we > require at least one (and ideally maybe more) per device? The pre-slab > (current mainline kernel) command allocation always had at least one > command per device available, and usually more (because we allocated more > commands during the scan and upper level init). Now we get into tuning: Even if the system is making forward progress, it might be doing it erratically, so how best do we ensure that the memory clearing I/O proceeds. > That is - if we have swap on a separate disk and our command pool is small > enough, IO to another disk could use the single per-host command under > memory pressure, and we can fail to get a scsi command in order to write > to the swap disk. Right: a single command reserved per swap device would be sufficient to assure a steady stream of memory clearing I/O, which is probably sufficient for most purposes. > scsi_put_command() re-fills the host->free_list if it is empty, but under > high (or higher) IO loads, the disk/device that generated the > scsi_put_command will immediately issue a scsi_get_command for the same > device. That's true, but again, it's a system tuning issue. The optimal thing to do for SCSI is to issue a new command for a device that just returned one because we know it has all the resources to hand. > What we do under memory pressure needs to be separated from what we do ordinarily. > If all command allocations are failing for a particular device (i.e. > swap), we will wait a bit (device_blocked and device_busy == 0) and try > again, we will not retry based on a scsi_put_command(). Even if we did > retry based on a scsi_put_command, we will can race with the > scsi_put_command caller. This is theoretically possible, but unlikely: all of the allocated commands must eventually return. I can't think of any non pathological load scenarios where we can load up the command queues so completely from userland as to cause complete starvation of the swap devices...but doubtless somebody will come up with one. James ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2003-02-04 18:04 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-01-29 18:47 scsi command slab allocation under memory pressure Patrick Mansfield 2003-01-29 19:40 ` Luben Tuikov 2003-01-29 20:11 ` Patrick Mansfield 2003-01-29 22:26 ` Luben Tuikov 2003-01-31 6:57 ` Andrew Morton 2003-01-31 13:46 ` James Bottomley 2003-01-31 20:44 ` Andrew Morton 2003-02-01 2:46 ` Patrick Mansfield 2003-02-03 22:55 ` Doug Ledford 2003-02-03 22:59 ` Andrew Morton 2003-02-03 23:05 ` James Bottomley 2003-02-03 23:19 ` Andrew Morton 2003-02-04 18:04 ` Luben Tuikov 2003-02-04 6:15 ` Andre Hedrick 2003-01-29 22:53 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox