scsi command slab allocation under memory pressure

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* scsi command slab allocation under memory pressure
@ 2003-01-29 18:47 Patrick Mansfield
  2003-01-29 19:40 ` Luben Tuikov
  2003-01-29 22:53 ` James Bottomley
  0 siblings, 2 replies; 15+ messages in thread
From: Patrick Mansfield @ 2003-01-29 18:47 UTC (permalink / raw)
  To: linux-scsi

James had a similiar comment about this (free_list storing multiple
commands).

The linux-scsi.bkbits.net scsi-kmem_alloc-2.5 and scsi-combined-2.5 tree
include the scsi command slab allocation (Luben's patch).

How does the use of a single slab for all hosts and all devices allow for
IO while under memory pressure?

There is one extra scsi command pre-allocated per host, but don't we
require at least one (and ideally maybe more) per device? The pre-slab
(current mainline kernel) command allocation always had at least one
command per device available, and usually more (because we allocated more
commands during the scan and upper level init).

That is - if we have swap on a separate disk and our command pool is small
enough, IO to another disk could use the single per-host command under
memory pressure, and we can fail to get a scsi command in order to write
to the swap disk. 

scsi_put_command() re-fills the host->free_list if it is empty, but under
high (or higher) IO loads, the disk/device that generated the
scsi_put_command will immediately issue a scsi_get_command for the same
device.

If all command allocations are failing for a particular device (i.e.
swap), we will wait a bit (device_blocked and device_busy == 0) and try
again, we will not retry based on a scsi_put_command(). Even if we did
retry based on a scsi_put_command, we will can race with the
scsi_put_command caller.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-01-29 18:47 scsi command slab allocation under memory pressure Patrick Mansfield
@ 2003-01-29 19:40 ` Luben Tuikov
  2003-01-29 20:11   ` Patrick Mansfield
  2003-01-29 22:53 ` James Bottomley
  1 sibling, 1 reply; 15+ messages in thread
From: Luben Tuikov @ 2003-01-29 19:40 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: linux-scsi

Patrick Mansfield wrote:
> James had a similiar comment about this (free_list storing multiple
> commands).
> 
> The linux-scsi.bkbits.net scsi-kmem_alloc-2.5 and scsi-combined-2.5 tree
> include the scsi command slab allocation (Luben's patch).
> 
> How does the use of a single slab for all hosts and all devices allow for
> IO while under memory pressure?
> 
> There is one extra scsi command pre-allocated per host, but don't we
> require at least one (and ideally maybe more) per device? The pre-slab
> (current mainline kernel) command allocation always had at least one
> command per device available, and usually more (because we allocated more
> commands during the scan and upper level init).
> 
> That is - if we have swap on a separate disk and our command pool is small
> enough, IO to another disk could use the single per-host command under
> memory pressure, and we can fail to get a scsi command in order to write
> to the swap disk. 
> 
> scsi_put_command() re-fills the host->free_list if it is empty, but under
> high (or higher) IO loads, the disk/device that generated the
> scsi_put_command will immediately issue a scsi_get_command for the same
> device.
> 
> If all command allocations are failing for a particular device (i.e.
> swap), we will wait a bit (device_blocked and device_busy == 0) and try
> again, we will not retry based on a scsi_put_command(). Even if we did
> retry based on a scsi_put_command, we will can race with the
> scsi_put_command caller.

Is this a question, narrative, comment or flame?  I'll try to answer
this anyway.

James had this comment, just because I put the mechanism there to allow
for more than one command to be in the store of backup command structs.
See my reply to his email in the archives.

The reason for populating free_list with just one command on host init,
is quite obvious, but to elaborate, the choices are 1 and N, where N is
a natural number greater than 1.

The problem with N is that I do *not* have a heuristic which will tell me
what a suitable value for N is.  How about 5, hmm, what about 10, or maybe
1e10?

Furthermore, N may be a constant or it may be a function of how much
memory we currently have, how many commands have been queued into the host,
etc, or it may just be N = can_queue - num_queued_commands + 1, which is
dynamic, which is pointless (Homework: show why).

We want to waste as little memory as possible (thus 1 per host), since
SCSI Core is not the only subsystem running on the machine.  Hint:
see the flags the slabs are allocated with.  Using the Central Limit
Theorem, I hope that by the time we get low on memory pressure, the scsi
command cache pool size has settled*.  Unless we started with very little
memory, which would be quite unusual in this day and age.

* Lots of assumptions here, but all valid for a *server* machine.

Let's get some experience with this thing running and actually have a
*natural* failing example, and we can twiddle with the initial value
of N, and/or can develop a f(N) which would be computed occasionally
and free_list varied upon scsi_put_command().

-- 
Luben

P.S. In my own mini-scsi-core I haven't had any problems with this issue.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-01-29 19:40 ` Luben Tuikov
@ 2003-01-29 20:11   ` Patrick Mansfield
  2003-01-29 22:26     ` Luben Tuikov
  2003-01-31  6:57     ` Andrew Morton
  0 siblings, 2 replies; 15+ messages in thread
From: Patrick Mansfield @ 2003-01-29 20:11 UTC (permalink / raw)
  To: Luben Tuikov; +Cc: linux-scsi

On Wed, Jan 29, 2003 at 02:40:28PM -0500, Luben Tuikov wrote:
> Patrick Mansfield wrote:
> 
> Is this a question, narrative, comment or flame?  I'll try to answer
> this anyway.

Yes, except for the flame part :)

> James had this comment, just because I put the mechanism there to allow
> for more than one command to be in the store of backup command structs.
> See my reply to his email in the archives.
> 
> The reason for populating free_list with just one command on host init,
> is quite obvious, but to elaborate, the choices are 1 and N, where N is
> a natural number greater than 1.
> 
> The problem with N is that I do *not* have a heuristic which will tell me
> what a suitable value for N is.  How about 5, hmm, what about 10, or maybe
> 1e10?

We do have bounds on the number of commands.

Assuming N is the minimum number of commands we want available:

Currently (plain 2.5.59) we have N being at the number of scsi_device's on
the system, and we always have at least one command available for each
scsi_device.

It is not clear if N should be the number of scsi commands that might be
outstanding on the system (what it used to be prior to Doug L's queue
depth changes, and what we have in 2.4).

So N should be at least the number of scsi_devices on the system, and at
most the sum of all the commands that can potentially be used by all
scsi_devices on the system.

> Furthermore, N may be a constant or it may be a function of how much
> memory we currently have, how many commands have been queued into the host,
> etc, or it may just be N = can_queue - num_queued_commands + 1, which is
> dynamic, which is pointless (Homework: show why).
> 
> We want to waste as little memory as possible (thus 1 per host), since
> SCSI Core is not the only subsystem running on the machine.  Hint:
> see the flags the slabs are allocated with.  Using the Central Limit
> Theorem, I hope that by the time we get low on memory pressure, the scsi
> command cache pool size has settled*.  Unless we started with very little
> memory, which would be quite unusual in this day and age.

If we must allocate enough space to ensure proper swapout behavour, it is
not a waste!

IMO we need something similiar to the request_queue_t free list, where
we allocate a bunch of items up front out of a slab.

> * Lots of assumptions here, but all valid for a *server* machine.
> 
> Let's get some experience with this thing running and actually have a
> *natural* failing example, and we can twiddle with the initial value
> of N, and/or can develop a f(N) which would be computed occasionally
> and free_list varied upon scsi_put_command().

Are you saying we will always be able to get a command under low memory
situtations? 

If not, the code should be changed, otherwise please explain how.

We should not wait for a failure to occur to answer the question.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-01-29 20:11   ` Patrick Mansfield
@ 2003-01-29 22:26     ` Luben Tuikov
  2003-01-31  6:57     ` Andrew Morton
  1 sibling, 0 replies; 15+ messages in thread
From: Luben Tuikov @ 2003-01-29 22:26 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: linux-scsi

Patrick Mansfield wrote:
>>The problem with N is that I do *not* have a heuristic which will tell me
>>what a suitable value for N is.  How about 5, hmm, what about 10, or maybe
>>1e10?
> 
> 
> We do have bounds on the number of commands.

Yes, and it is can_queue and it is per host -- we all know this.
I was talking about N, which is a bit different matter.

> Assuming N is the minimum number of commands we want available:
> 
> Currently (plain 2.5.59) we have N being at the number of scsi_device's on
> the system, and we always have at least one command available for each
> scsi_device.

This has two problems: 1. ``currently'' and 2. number of scsi_devices
on the system.

1. We should improve, ``currently'' may not be the best policy.
(Thus the slab allocator.)

2. If I had to go this route, I'd say: the number of scsi_devices
per host; but this number is dynamic as devices come and go.
(So we get a hint at a dynamic heuristic.)

> It is not clear if N should be the number of scsi commands that might be
> outstanding on the system (what it used to be prior to Doug L's queue
> depth changes, and what we have in 2.4).

This has two problems: 1. ``It is not clear'' and 2. ``outstanding
on the system''.

1. I think I did mention this in my previous reply. (Thus N = 1.)

2. This is a pickle and might I mention that Doug's queue depth
changes were *for a reason*, so there's no point in saying
``the way it was before''.  Furthermore ``the number of outstanding
commands'' is ambiguous and any which way it is it, is NOT enough.
I.e. outstaning in LLDD, oustanding free, etc. -- it just doesn't
compute.

> So N should be at least the number of scsi_devices on the system, and at
> most the sum of all the commands that can potentially be used by all
> scsi_devices on the system.

How did you deduce this if one of your premises starts with ``It is not clear''?

This has 2 problems: 1. ``number of scsi_devices on the system'' and
2. ``sum off all commands which can potentially be used by all scsi_devices
on the system''.

1. This number is dynamic. N is *per host* and not per SCSI Core, i.e.
it would've been more proper to say, ``number of scsi devices per host'',
which would also not be sufficient, since this number is also dynamic.

Furthermore, if N > 1, and the system is under memory pressure such that
the cache allocator fails, it is true that a single device can starve
the others of scsi commands, for any 1 < N <= can_queue !  So this
doesn't quite do it as well, and a per device back-up would be needed.

Which would *not* be that helpful since we can get in to the same
argument.... how many is good enough.

2. In which case we *do not* need the cache allocator at all.
I.e. you're saying that N = can_queue - number_cmnds_queued + 1,
and I did comment on this in my previous mail and will say it
again: in which case we do NOT need the cache allocator at all.
(homework: show why)

And the whole reason of this is to give memory allocation
to another subsystem.  Hint again: see the flags with which
the caches are created.

>>Furthermore, N may be a constant or it may be a function of how much
>>memory we currently have, how many commands have been queued into the host,
>>etc, or it may just be N = can_queue - num_queued_commands + 1, which is
>>dynamic, which is pointless (Homework: show why).
>>
>>We want to waste as little memory as possible (thus 1 per host), since
>>SCSI Core is not the only subsystem running on the machine.  Hint:
>>see the flags the slabs are allocated with.  Using the Central Limit
>>Theorem, I hope that by the time we get low on memory pressure, the scsi
>>command cache pool size has settled*.  Unless we started with very little
>>memory, which would be quite unusual in this day and age.
> 
> 
> If we must allocate enough space to ensure proper swapout behavour, it is
> not a waste!

Are we creating a problem with generalims like this one?  How much is ``enough''?
Or how long is the long hair?

I think I mentioned this conundrum in my previous email.

> IMO we need something similiar to the request_queue_t free list, where
> we allocate a bunch of items up front out of a slab.

Yes, this a static heuristic. See below for comments on this policy.

>>* Lots of assumptions here, but all valid for a *server* machine.
>>
>>Let's get some experience with this thing running and actually have a
>>*natural* failing example, and we can twiddle with the initial value
>>of N, and/or can develop a f(N) which would be computed occasionally
>>and free_list varied upon scsi_put_command().
> 
> 
> Are you saying we will always be able to get a command under low memory
> situtations? 

The cache allocator page values should've settled after a few hours
of intensive SCSI IO. (Cf. Central Limit Theorem)  And those values
always *round up* by the lookaside cache, which means
that the cache allocator would always have more than can_queue
if in that time can_queue was ever reached.

It would be *quite a rare* circumstance that the cache allocator would fail
(again hint: see the flags which with they are created), and even if it
did fail, chances are that other subsystems would free memory, or a scsi
command would finish, etc. in which case SCSI Core could go on.

It could fail if, e.g., an exorbitant amount of scsi devices were
plugged on the fabric *while* the system was experiencing memory pressure.
Thus the above mentioning average value would not be correct anymore,
as the number of devices has changed.  *BUT* by having N=1, we wouldn't
load the system as much as if we had a different initial N, say N=10.

So this is tricky, it could be the case that in two exact same circumstances
when N=1 the system could survive, but for N=10, we crash it.

So the best policy is a dynamic heuristic.

But I think that *YOU* should be making this argument, since it looks
like you're not happy with the slab allocator being used for scsi commands.

> If not, the code should be changed, otherwise please explain how.

Looks like you have an agenda.  If you have nothing else to do, please
go ahead and rip the code apart and do your own thoughtful and smart
changes.  Go ahead, overengineer as you wish.

> We should not wait for a failure to occur to answer the question.

Another generalism.

If you have a heuristic for N, either *static*, i.e.
	N = can_queue/10 + 1,
or *dynamic*, i.e.:
	N_0 = can_queue/10 + 1,  (no devices yet)
	N_i = host::num_devices + N_(i-1)/2,  i > 0,
*please* suggest it, show it, and prove it. (See bottom of text.)

Just remember, the whole framework is one such that:

1. SCSI Core MUST NOT just blindly call the slab for more
memory -- this *beats the whole purpose* of using the
slab and we could just as well use kmalloc().

2. Adjustment should be done at scsi_put_command(). I.e.
we're *holding off* commands from the slab on scsi_put_command(),
depending on the heuristic. (Which by definition tells us something
about the *future* and this is the whole point.) So that in the
future we can be ok, should memory be scarce.  I.e. do we need
to decrease or increase the commands in free_list, depending on
the current value of N_i.

In the example I gave above for a dynamic heuristic, it is
easiliy shown that
	N_i = 2*(host::num_devices), i -> inf.
To show this is important as we don't want to drain the computer's
memory.

The above heuristic would settle for N being 2 times the number of
devices, which is good, since if can_queue = 255,
we have N_0 = 26, but after a few generations, we don't want to keep
so many commands around, and given we have only one device, we
settle to N_i = 2, i -> inf.

This was just an example, of course.  You can model your work similarly.

BTW, if you are so ardent on this whole issue I can put code in for
the heuristic, and you can supply N_0 and N_i.

-- 
Luben

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-01-29 20:11   ` Patrick Mansfield
  2003-01-29 22:26     ` Luben Tuikov
@ 2003-01-31  6:57     ` Andrew Morton
  2003-01-31 13:46       ` James Bottomley
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2003-01-31  6:57 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: luben, linux-scsi

Patrick Mansfield <patmans@us.ibm.com> wrote:
>
> IMO we need something similiar to the request_queue_t free list, where
> we allocate a bunch of items up front out of a slab.

Please do not reinvent the mm/mempool.c functionality.

'twould be better to just use it ;)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-01-31  6:57     ` Andrew Morton
@ 2003-01-31 13:46       ` James Bottomley
  2003-01-31 20:44         ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2003-01-31 13:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Patrick Mansfield, luben, SCSI Mailing List

On Fri, 2003-01-31 at 01:57, Andrew Morton wrote:
> Please do not reinvent the mm/mempool.c functionality.
> 
> 'twould be better to just use it ;)

Unfortunately, in this instance, mempool is a slight overkill.  The
problem is that we need to guarantee that a command (or set of commands)
be available to a given device regardless of what's going on in the rest
of the system.  Thus we might need a mempool for each active device,
rather than a mempool for all devices and a mechanism for giving fine
grained control to the pool depth per device.  Mempool would fit all of
the above, I was just concerned that it looks to be a rather heavy
addition (in terms of structure size) per device.

James

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-01-31 13:46       ` James Bottomley
@ 2003-01-31 20:44         ` Andrew Morton
  2003-02-01  2:46           ` Patrick Mansfield
  2003-02-03 22:55           ` Doug Ledford
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2003-01-31 20:44 UTC (permalink / raw)
  To: James Bottomley; +Cc: patmans, luben, linux-scsi

James Bottomley <James.Bottomley@steeleye.com> wrote:
>
> On Fri, 2003-01-31 at 01:57, Andrew Morton wrote:
> > Please do not reinvent the mm/mempool.c functionality.
> > 
> > 'twould be better to just use it ;)
> 
> Unfortunately, in this instance, mempool is a slight overkill.  The
> problem is that we need to guarantee that a command (or set of commands)
> be available to a given device regardless of what's going on in the rest
> of the system.  Thus we might need a mempool for each active device,
> rather than a mempool for all devices and a mechanism for giving fine
> grained control to the pool depth per device.

A lot depends on the context of the allocation.  Can the caller sleep?
Is the caller using GFP_ATOMIC/__GFP_HIGH?

(What file-n-line should I be looking at, anyway?)

Bear in mind that on the swapout path, the calling process has PF_MEMALLOC
set.  This is a strong and successful mechanism - it allows the caller to dip
into the final page reserves which are denied to even GFP_ATOMIC allocations.
There's maybe a megabyte or two there.

Could be that there's no problem to be solved here.  It depends on whether
these allocations are occurring in process context or not.

>  Mempool would fit all of
> the above, I was just concerned that it looks to be a rather heavy
> addition (in terms of structure size) per device.

It's 40-odd bytes, plus 4*max_reservation bytes.  Fairly lean.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-01-31 20:44         ` Andrew Morton
@ 2003-02-01  2:46           ` Patrick Mansfield
  2003-02-03 22:55           ` Doug Ledford
  1 sibling, 0 replies; 15+ messages in thread
From: Patrick Mansfield @ 2003-02-01  2:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: James Bottomley, luben, linux-scsi

On Fri, Jan 31, 2003 at 12:44:12PM -0800, Andrew Morton wrote:
> James Bottomley <James.Bottomley@steeleye.com> wrote:
> >
> > On Fri, 2003-01-31 at 01:57, Andrew Morton wrote:
> > > Please do not reinvent the mm/mempool.c functionality.
> > > 
> > > 'twould be better to just use it ;)
> > 
> > Unfortunately, in this instance, mempool is a slight overkill.  The
> > problem is that we need to guarantee that a command (or set of commands)
> > be available to a given device regardless of what's going on in the rest
> > of the system.  Thus we might need a mempool for each active device,
> > rather than a mempool for all devices and a mechanism for giving fine
> > grained control to the pool depth per device.
> 
> A lot depends on the context of the allocation.  Can the caller sleep?

No (generally) - we are in our request function, and are called via soft
irq and from anywhere that blk_run_queues is called. Calls outside of the
request function can sleep (but they are not used for regular IO).

> Is the caller using GFP_ATOMIC/__GFP_HIGH?

Yes normally GFP_ATOMIC

> (What file-n-line should I be looking at, anyway?)

If you have bk, bk://linux-scsi.bkbits.net/scsi-combined-2.5, file is
drivers/scsi/scsi_lib.c - the calls to scsi_getset_command. That tree also
has other scsi changes, as well as changes related to the new allocation
scheme.

In 2.5.59 this was a call to scsi_allocate_device. The current 2.5
allocation alogrithm used by scsi_allocate_device is poor. In the new
code, scsi_allocate_device is replaced by a call to scsi_getset_command.

scsi_getset_command effectively calls kmem_cache_alloc(some_cache,
flags).

My complaint that started the thread is that (generally) we don't have
fairness across devices (scsi_device) on kmem_cache_alloc failure - a
failure can put us in a timeout/poll like mode, but anyone with an IO
already in flight can allocate (from kmem_cache_alloc or a single
scsi_cmnd saved per host adapter) and issue another IO before the device
that had a kmem_cache_alloc failure.

So, a swap disk could potentially be starved from issuing IO during low
memory conditions. 

But, there is the extra scsi_cmnd per host adapter (not per device), and
it is always refilled (if empty) when a scsi_cmnd completes for that
adapter, this combined with filling of the cache via continued IO use
might prevent most (and hopefully all) such failures.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-01-31 20:44         ` Andrew Morton
  2003-02-01  2:46           ` Patrick Mansfield
@ 2003-02-03 22:55           ` Doug Ledford
  2003-02-03 22:59             ` Andrew Morton
                               ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Doug Ledford @ 2003-02-03 22:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: James Bottomley, patmans, luben, linux-scsi

On Fri, Jan 31, 2003 at 12:44:12PM -0800, Andrew Morton wrote:
> Bear in mind that on the swapout path, the calling process has PF_MEMALLOC
> set.  This is a strong and successful mechanism - it allows the caller to dip
> into the final page reserves which are denied to even GFP_ATOMIC allocations.
> There's maybe a megabyte or two there.
> 
> Could be that there's no problem to be solved here.  It depends on whether
> these allocations are occurring in process context or not.

I think the case is that there is no problem to be solved.  One command 
per host is enough to keep each host running, and that's enough to keep 
the system running.  If we are ever low enough on mem that we get down to 
failing scsi command allocations, the system is already hurting.  The 
complaint was that a device doing something other than swap could starve a 
swap device.  I don't buy that.  If the device is doing constant reads 
then it's going to run out of mem eventually and block just like our 
allocations are, if it's writing then it very likely is freeing up just as 
many pages as the swap operation would be.  In short, I think if we keep 
the disk subsystem running, even if crippled with just one command, the 
problem becomes self correcting and there isn't much for us to solve.  Of 
course, that's just my 5 minute analysis, someone feel free to prove me 
wrong.

-- 
  Doug Ledford <dledford@redhat.com>     919-754-3700 x44233
         Red Hat, Inc. 
         1801 Varsity Dr.
         Raleigh, NC 27606

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-02-03 22:55           ` Doug Ledford
@ 2003-02-03 22:59             ` Andrew Morton
  2003-02-03 23:05             ` James Bottomley
  2003-02-04  6:15             ` Andre Hedrick
  2 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2003-02-03 22:59 UTC (permalink / raw)
  To: Doug Ledford; +Cc: James.Bottomley, patmans, luben, linux-scsi

Doug Ledford <dledford@redhat.com> wrote:
>
> Of course, that's just my 5 minute analysis, someone feel free to prove me 
> wrong.

I'd agree with that.

Plus there's the PF_MEMALLOC thing which gives swapper-outers an extra
megabyte or two.  If _any_ of these allocations are happening in process
context then they will benefit from this.

If all the allocations are happening at interrupt/softirq time then some
changes might be needed.

But I haven't been able to break it, using aix7xxx, with quite harsh
testing.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-02-03 22:55           ` Doug Ledford
  2003-02-03 22:59             ` Andrew Morton
@ 2003-02-03 23:05             ` James Bottomley
  2003-02-03 23:19               ` Andrew Morton
  2003-02-04  6:15             ` Andre Hedrick
  2 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2003-02-03 23:05 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Andrew Morton, patmans, luben, SCSI Mailing List

On Mon, 2003-02-03 at 17:55, Doug Ledford wrote:
> I think the case is that there is no problem to be solved.  One command 
> per host is enough to keep each host running, and that's enough to keep 
> the system running.  If we are ever low enough on mem that we get down to 
> failing scsi command allocations, the system is already hurting.  The 
> complaint was that a device doing something other than swap could starve a 
> swap device.  I don't buy that.  If the device is doing constant reads 
> then it's going to run out of mem eventually and block just like our 
> allocations are, if it's writing then it very likely is freeing up just as 
> many pages as the swap operation would be.  In short, I think if we keep 
> the disk subsystem running, even if crippled with just one command, the 
> problem becomes self correcting and there isn't much for us to solve.  Of 
> course, that's just my 5 minute analysis, someone feel free to prove me 
> wrong.

I agree with the analysis: The system can make forward progress as long
as we have only one guaranteed command.

However, I do worry about the performance under memory pressure.  I
don't think only having one command pre allocated per HBA is sufficient
to ensure efficient swap out behaviour under load.  The question, of
course, is what do we need to do to make it more efficient?

Andrew, these patches are now in Linus' BK, so if you want to take a
look and see how our loaded behaviour is, I'd be grateful.

James



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-02-03 23:05             ` James Bottomley
@ 2003-02-03 23:19               ` Andrew Morton
  2003-02-04 18:04                 ` Luben Tuikov
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2003-02-03 23:19 UTC (permalink / raw)
  To: James Bottomley; +Cc: dledford, patmans, luben, linux-scsi

James Bottomley <James.Bottomley@steeleye.com> wrote:
>
> Andrew, these patches are now in Linus' BK, so if you want to take a
> look and see how our loaded behaviour is, I'd be grateful.

Looks pretty straightforward.  Most of the allocations are GFP_KERNEL, which
is a good sign.

One could have designed it to support a pool of >1 command from the outset,
but it's unlikely to be necessary.

(linux-scsi@vger doesn't send messages to or from oneself to or from oneself.
 This upsets one's filing system.  Does this irritate otherselves as much as
this self?)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-02-03 23:19               ` Andrew Morton
@ 2003-02-04 18:04                 ` Luben Tuikov
  0 siblings, 0 replies; 15+ messages in thread
From: Luben Tuikov @ 2003-02-04 18:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: James Bottomley, dledford, patmans, linux-scsi

Andrew Morton wrote:
> 
> One could have designed it to support a pool of >1 command from the outset,
> but it's unlikely to be necessary.

Yes, I also that it's not likely to be necessary.

The functionality of support for more than one is there (i.e. in
the freeing-list code).  As I mentioned in this thread before,
I had *no* idea (and still have none) on _what_ number to settle if
greater than one.

OTOH, if the powers that be decide that more than one is
nevertheless necessary, a mempool, I think, would be quite appropriate.

-- 
Luben

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-02-03 22:55           ` Doug Ledford
  2003-02-03 22:59             ` Andrew Morton
  2003-02-03 23:05             ` James Bottomley
@ 2003-02-04  6:15             ` Andre Hedrick
  2 siblings, 0 replies; 15+ messages in thread
From: Andre Hedrick @ 2003-02-04  6:15 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Andrew Morton, James Bottomley, patmans, luben, linux-scsi


Doug,

I had argued some time ago for reserved and priority allocation for swap
under block period.  Regardless if this is scsi/ata/sas/sata the issue is
fundamential.  I spent a fair amount of time debating memory pressure
against swap in combination of saturated device request queues, with Rik
Riel.  This is one layer above the LLDD and regardless if you reserve
1,2,N-1,N command slots in the queuedcommand list.  If the request can
not be obtained from block because all request slots are stuffed full, you
have to deploy an out-of-bounds operation.

Progress will not happen period.  Elevator sorting on swap is silly!
If it is still done, you can not get there from here.  Any swap-io run via
the elevator must be inserted in the front of the request queue, period.
This will provide you N+1 jump without fracturing all the commands in
process or in flight.  Since it would be priority, one can stuff it with
the SPECIAL marker and boost it to the head of the queuedcommand list.

Maybe I am on crack, but so is the design delpoyed to date.
	(from top->bottom and not bottom->up)

Cheers,


Andre Hedrick
LAD Storage Consulting Group


On Mon, 3 Feb 2003, Doug Ledford wrote:

> On Fri, Jan 31, 2003 at 12:44:12PM -0800, Andrew Morton wrote:
> > Bear in mind that on the swapout path, the calling process has PF_MEMALLOC
> > set.  This is a strong and successful mechanism - it allows the caller to dip
> > into the final page reserves which are denied to even GFP_ATOMIC allocations.
> > There's maybe a megabyte or two there.
> > 
> > Could be that there's no problem to be solved here.  It depends on whether
> > these allocations are occurring in process context or not.
> 
> I think the case is that there is no problem to be solved.  One command 
> per host is enough to keep each host running, and that's enough to keep 
> the system running.  If we are ever low enough on mem that we get down to 
> failing scsi command allocations, the system is already hurting.  The 
> complaint was that a device doing something other than swap could starve a 
> swap device.  I don't buy that.  If the device is doing constant reads 
> then it's going to run out of mem eventually and block just like our 
> allocations are, if it's writing then it very likely is freeing up just as 
> many pages as the swap operation would be.  In short, I think if we keep 
> the disk subsystem running, even if crippled with just one command, the 
> problem becomes self correcting and there isn't much for us to solve.  Of 
> course, that's just my 5 minute analysis, someone feel free to prove me 
> wrong.
> 
> -- 
>   Doug Ledford <dledford@redhat.com>     919-754-3700 x44233
>          Red Hat, Inc. 
>          1801 Varsity Dr.
>          Raleigh, NC 27606
>   
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: scsi command slab allocation under memory pressure
  2003-01-29 18:47 scsi command slab allocation under memory pressure Patrick Mansfield
  2003-01-29 19:40 ` Luben Tuikov
@ 2003-01-29 22:53 ` James Bottomley
  1 sibling, 0 replies; 15+ messages in thread
From: James Bottomley @ 2003-01-29 22:53 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: SCSI Mailing List

On Wed, 2003-01-29 at 13:47, Patrick Mansfield wrote:
> The linux-scsi.bkbits.net scsi-kmem_alloc-2.5 and scsi-combined-2.5 tree
> include the scsi command slab allocation (Luben's patch).
> 
> How does the use of a single slab for all hosts and all devices allow for
> IO while under memory pressure?

In essence, all we really need to guarantee under memory pressure is
that I/O which is being used to clear memory (i.e. for the swap device)
will eventually proceed.  This is the weakest necessary assumption for
the system to make forward progress.  Having a single command per device
(or even just a single available command) guarantees this since if it is
outstanding, it will eventually return and be re-used for clearing
memory, which is all that is required.

With a single command per host, there is a starvation issue if you have
heavy I/O to a device whose controller also contains the swap.  However,
it would have to be fairly pathological conditions to continue doing
heavy I/O under memory pressure while starving the swap device.

> There is one extra scsi command pre-allocated per host, but don't we
> require at least one (and ideally maybe more) per device? The pre-slab
> (current mainline kernel) command allocation always had at least one
> command per device available, and usually more (because we allocated more
> commands during the scan and upper level init).

Now we get into tuning:  Even if the system is making forward progress, it
might be doing it erratically, so how best do we ensure that the memory
clearing I/O proceeds.

> That is - if we have swap on a separate disk and our command pool is small
> enough, IO to another disk could use the single per-host command under
> memory pressure, and we can fail to get a scsi command in order to write
> to the swap disk. 

Right: a single command reserved per swap device would be sufficient to assure
a steady stream of memory clearing I/O, which is probably sufficient
for most purposes.

> scsi_put_command() re-fills the host->free_list if it is empty, but under
> high (or higher) IO loads, the disk/device that generated the
> scsi_put_command will immediately issue a scsi_get_command for the same
> device.

That's true, but again, it's a system tuning issue.  The optimal thing to do
for SCSI is to issue a new command for a device that just returned one
because we know it has all the resources to hand.
> 
What we do under memory pressure needs to be separated from what we do
ordinarily.

> If all command allocations are failing for a particular device (i.e.
> swap), we will wait a bit (device_blocked and device_busy == 0) and try
> again, we will not retry based on a scsi_put_command(). Even if we did
> retry based on a scsi_put_command, we will can race with the
> scsi_put_command caller.

This is theoretically possible, but unlikely:  all of the allocated commands
must eventually return.  I can't think of any non pathological load scenarios
where we can load up the command queues so completely from userland as to
cause complete starvation of the swap devices...but doubtless somebody will
come up with one.

James

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2003-02-04 18:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-29 18:47 scsi command slab allocation under memory pressure Patrick Mansfield
2003-01-29 19:40 ` Luben Tuikov
2003-01-29 20:11   ` Patrick Mansfield
2003-01-29 22:26     ` Luben Tuikov
2003-01-31  6:57     ` Andrew Morton
2003-01-31 13:46       ` James Bottomley
2003-01-31 20:44         ` Andrew Morton
2003-02-01  2:46           ` Patrick Mansfield
2003-02-03 22:55           ` Doug Ledford
2003-02-03 22:59             ` Andrew Morton
2003-02-03 23:05             ` James Bottomley
2003-02-03 23:19               ` Andrew Morton
2003-02-04 18:04                 ` Luben Tuikov
2003-02-04  6:15             ` Andre Hedrick
2003-01-29 22:53 ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox