[RFC]: performance improvement by coalescing requests?

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC]: performance improvement by coalescing requests?
@ 2005-06-20 19:25 Salyzyn, Mark
  2005-06-20 20:24 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Salyzyn, Mark @ 2005-06-20 19:25 UTC (permalink / raw)
  To: linux-scsi

This is not a patch to be applied to any release, for discussion only.

We have managed to increase the performance of the I/O to the driver by
pushing back on the scsi_merge layer when we detect that we are issuing
sequential requests (patch enclosed below to demonstrate the technique
used to investigate). In the algorithm used, when we see that we have an
I/O that adjoins the previous request, we reduce the queue depth to a
value of 2 for the device. This allows the incoming I/O to be
scrutinized by the scsi_merge layer for a bit longer permitting them to
be merged together into a larger more efficient request.

By limiting the queue to a depth of two, we also do not delay the system
much since we keep one worker and one outstanding remaining in the
controller. This keeps the I/O's fed without delay.

The net result was instead of receiving, for example, 64 4K sequential
I/O requests to an eager controller more than willing to accept the
commands into it's domain, we instead see two 4K I/O requests, followed
by one 248KB I/O request.

I would like to hear from the luminaries about how we could move this
proposed policy to the scsi or block layers for a generalized increase
in Linux performance.

One should note that this kind of policy to deal with sequential I/O
activity is not new in high performance operating systems. It is simply
lacking in the Linux I/O layers.

Sincerely -- Mark Salyzyn

diff -ru a/drivers/scsi/aacraid/aachba.c b/drivers/scsi/aacraid/aachba.c
--- a/drivers/scsi/aacraid/aachba.c	Mon Jun 20 11:57:47 2005
+++ b/drivers/scsi/aacraid/aachba.c	Mon Jun 20 12:08:23 2005
@@ -154,6 +154,10 @@
 module_param(commit, int, 0);
 MODULE_PARM_DESC(commit, "Control whether a COMMIT_CONFIG is issued to
the adapter for foreign arrays.\nThis is typically needed in systems
that do not have a BIOS. 0=off, 1=on");

+static int coalescethreshold = 0;
+module_param(coalescethreshold, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(coalescethreshold, "Control the maximum block size of
sequential requests that are fed back to the\nscsi_merge layer for
coalescing. 0=off, 16 block (8KB) default.");
+
 int numacb = -1;
 module_param(numacb, int, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(numacb, "Request a limit to the number of adapter
control blocks (FIB) allocated. Valid\nvalues are 512 and down. Default
is to use suggestion from Firmware.");
@@ -878,6 +882,40 @@
 	aac_io_done(scsicmd);
 }

+static inline void aac_select_queue_depth(
+	struct scsi_cmnd * scsicmd,
+	int cid,
+	u64 lba,
+	u32 count)
+{
+	struct scsi_device *device = scsicmd->device;
+	struct aac_dev *dev;
+	unsigned depth;
+
+	if (!device->tagged_supported)
+		return;
+	dev = (struct aac_dev *)device->host->hostdata;
+	if (dev->fsa_dev[cid].queue_depth <= 2)
+		dev->fsa_dev[cid].queue_depth = device->queue_depth;
+	if (lba == dev->fsa_dev[cid].last) {
+		/*
+		 * If larger than coalescethreshold in size, coalescing
has
+		 * less effect on overall performance.  Also, if we are
+		 * coalescing right now, leave it alone if above the
threshold.
+		 */
+		if (count > coalescethreshold)
+			return;
+		depth = 2;
+	} else {
+		depth = dev->fsa_dev[cid].queue_depth;
+	}
+	scsi_adjust_queue_depth(device, MSG_ORDERED_TAG, depth);
+	dprintk((KERN_DEBUG "l=%llu %llu[%u] q=%u %lu\n",
+	  dev->fsa_dev[cid].last, lba, count, device->queue_depth,
+	  dev->queues->queue[AdapNormCmdQueue].numpending));
+	dev->fsa_dev[cid].last = lba + count;
+}
+
 static int aac_read(struct scsi_cmnd * scsicmd, int cid)
 {
 	u32 lba;
@@ -910,6 +948,10 @@
 	dprintk((KERN_DEBUG "aac_read[cpu %d]: lba = %u, t = %ld.\n",
 	  smp_processor_id(), (unsigned long long)lba, jiffies));
 	/*
+	 *	Are we in a sequential mode?
+	 */
+	aac_select_queue_depth(scsicmd, cid, lba, count);
+	/*
 	 *	Alocate and initialize a Fib
 	 */
 	if (!(cmd_fibcontext = fib_alloc(dev))) {
@@ -1016,6 +1058,10 @@
 	dprintk((KERN_DEBUG "aac_write[cpu %d]: lba = %u, t = %ld.\n",
 	  smp_processor_id(), (unsigned long long)lba, jiffies));
 	/*
+	 *	Are we in a sequential mode?
+	 */
+	aac_select_queue_depth(scsicmd, cid, lba, count);
+	/*
 	 *	Allocate and initialize a Fib then setup a BlockWrite
command
 	 */
 	if (!(cmd_fibcontext = fib_alloc(dev))) {

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC]: performance improvement by coalescing requests?
  2005-06-20 19:25 [RFC]: performance improvement by coalescing requests? Salyzyn, Mark
@ 2005-06-20 20:24 ` Jens Axboe
  2005-06-20 21:01 ` Jeff Garzik
  2005-06-20 23:21 ` Bryan Henderson
  2 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2005-06-20 20:24 UTC (permalink / raw)
  To: Salyzyn, Mark; +Cc: linux-scsi

On Mon, Jun 20 2005, Salyzyn, Mark wrote:
> This is not a patch to be applied to any release, for discussion only.
> 
> We have managed to increase the performance of the I/O to the driver by
> pushing back on the scsi_merge layer when we detect that we are issuing
> sequential requests (patch enclosed below to demonstrate the technique
> used to investigate). In the algorithm used, when we see that we have an
> I/O that adjoins the previous request, we reduce the queue depth to a
> value of 2 for the device. This allows the incoming I/O to be
> scrutinized by the scsi_merge layer for a bit longer permitting them to
> be merged together into a larger more efficient request.
> 
> By limiting the queue to a depth of two, we also do not delay the system
> much since we keep one worker and one outstanding remaining in the
> controller. This keeps the I/O's fed without delay.
> 
> The net result was instead of receiving, for example, 64 4K sequential
> I/O requests to an eager controller more than willing to accept the
> commands into it's domain, we instead see two 4K I/O requests, followed
> by one 248KB I/O request.
> 
> I would like to hear from the luminaries about how we could move this
> proposed policy to the scsi or block layers for a generalized increase
> in Linux performance.
> 
> One should note that this kind of policy to deal with sequential I/O
> activity is not new in high performance operating systems. It is simply
> lacking in the Linux I/O layers.

You say io, but I guess you mean writes in particular? If someone is
queuing large chunks of reads in 4kb sizes and starting a wait on the
first one immediately, that would defeat plugging and cause suboptimal
performance. That would be caller bug to be addressed though. So I'm
surprised that you see this happening, the plugging should handle this
case just fine. Or for any substantial amount of io, you would be
queueing it so fast that it should have plenty of time to be merged
until the drive sucks them in. For sequential io submitted in 4kb
chunks, by default we would not be invoking the request handler until we
have queued 4 (unplug_thresh) requests of 256kb (assuming that's the
adapter limit) or a few ms have passed. And a few ms should be enough
time to queue that amount many many times over.

Do you see this happening right from a queue depth of 1? I'm curious as
to what the testing scenario is that is able to provoke suboptimal
behavior in this way.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC]: performance improvement by coalescing requests?
  2005-06-20 19:25 [RFC]: performance improvement by coalescing requests? Salyzyn, Mark
  2005-06-20 20:24 ` Jens Axboe
@ 2005-06-20 21:01 ` Jeff Garzik
  2005-06-20 23:21 ` Bryan Henderson
  2 siblings, 0 replies; 8+ messages in thread
From: Jeff Garzik @ 2005-06-20 21:01 UTC (permalink / raw)
  To: Salyzyn, Mark; +Cc: linux-scsi

Salyzyn, Mark wrote:
> This is not a patch to be applied to any release, for discussion only.
> 
> We have managed to increase the performance of the I/O to the driver by
> pushing back on the scsi_merge layer when we detect that we are issuing
> sequential requests (patch enclosed below to demonstrate the technique
> used to investigate). In the algorithm used, when we see that we have an
> I/O that adjoins the previous request, we reduce the queue depth to a
> value of 2 for the device. This allows the incoming I/O to be
> scrutinized by the scsi_merge layer for a bit longer permitting them to
> be merged together into a larger more efficient request.
> 
> By limiting the queue to a depth of two, we also do not delay the system
> much since we keep one worker and one outstanding remaining in the
> controller. This keeps the I/O's fed without delay.
> 
> The net result was instead of receiving, for example, 64 4K sequential
> I/O requests to an eager controller more than willing to accept the
> commands into it's domain, we instead see two 4K I/O requests, followed
> by one 248KB I/O request.

Since you have

./drivers/scsi/aacraid/linit.c: .use_clustering  = ENABLE_CLUSTERING,

this smells like a bug or scheduler issue.  The block layer should 
-already- be coalescing requests.

What happens when you change I/O schedulers?  That may be a source of 
problems too.

	Jeff



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC]: performance improvement by coalescing requests?
  2005-06-20 19:25 [RFC]: performance improvement by coalescing requests? Salyzyn, Mark
  2005-06-20 20:24 ` Jens Axboe
  2005-06-20 21:01 ` Jeff Garzik
@ 2005-06-20 23:21 ` Bryan Henderson
  2 siblings, 0 replies; 8+ messages in thread
From: Bryan Henderson @ 2005-06-20 23:21 UTC (permalink / raw)
  To: Salyzyn, Mark; +Cc: linux-scsi

It's bad for a controller to suck in an arbitrarily large number of 
requests and queue them.  It makes it look to the initiator like the 
device is less busy than it is, which causes the initiator wrongly to 
conclude that coalescing would be a net loss.  The amount of I/O in flight 
from the initiator should be just enough to cover the reaction time 
between when the controller signals that it's (really) ready for more work 
and the initiator is able to deliver more, thus preventing the controller 
from running dry.

But I've run into this many times myself -- the controller looks like it 
can do I/O at channel speed for a while, and then it looks very slow for a 
while because of the tiny, poorly ordered chunks of work it was given. 
This causes people to propose various queue plugging algorithms where the 
initiator withholds work from a willing controller because it knows more 
about that controller's performance characteristics than the controller 
lets on.  I don't like any of these algorithms because they are so highly 
dependent on what goes on inside the controller and the work arrival 
pattern.  There's no heuristic you can choose that works for even all the 
common scenarios, let alone the scenarios that we haven't thought of yet. 
I generally override Linux block layer queue plugging (by explicitly 
unplugging every time I put something in the queue).

If you have a continual flow of work, you can solve the problem just by 
enlarging upstream queues to make sure you swamp whatever queue capacity 
the controller has.  I've done that by increasing the number of threads 
processing files, for example.  If you have a bursty 
response-time-sensitive workload (e.g. a small number of threads doing 
small synchronous reads, well below the capacity of the controller), it's 
much harder.

Of course, the cleanest thing to do is to reduce the size of the queue in 
the controller to the reaction time window I described above or have the 
controller adjust it dynamically.

But when you don't have that luxury, doing the same thing via the Linux 
driver queue depth (this patch) seems like a great substitute to me.

Why is it just for sequential?  If your pipeline (device driver -> 
controller -> device) is always full, what's to lose by backing up 
requests that can't be coalesced?

By the way, a disk controller in theory is a better place to do coalescing 
than a Linux queue -- the controller should know more about the ideal 
ordering and clustering of reads and writes.  Linux should have to 
coalesce only where small I/Os saturate the pipe to the controller.  But I 
assume we're talking about controllers that execute each received I/O 
separately.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC]: performance improvement by coalescing requests?
@ 2005-06-20 20:48 Salyzyn, Mark
  2005-06-21  7:28 ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Salyzyn, Mark @ 2005-06-20 20:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-scsi

Jens Axboe [mailto:axboe@suse.de] writes:
> You say io, but I guess you mean writes in particular?

Read or writes. One of the test cases was:

dd if=/dev/sda of=/dev/null bs=512b

would break apart into 64 4K reads with no completion dependencies
between them.

> Or for any substantial amount of io, you would be queueing it so fast
> that it should have plenty of time to be merged
> until the drive sucks them in.

Did I mention that this problem started occurring when we increased the
aacraid adapter and driver performance last year? We managed to suck the
requests in faster. Sadly (from the perspective of Adaptec pride in our
hardware controllers ;-> ), the scsi_merge layer is more efficient at
coalescing the requests than the adapter's Firmware solely because of
the PCI bus bandwidth used.

I must admit that the last time I did this instrumented test was in the
2.6.3 timeframe with SL9.1. This 'plugging' you are talking about, when
did it make it into the scsi layer? Sounds like I need to retest,
certainly a good result of opening my mouth to start this thread.

> And a few ms should be enough time to queue that amount many many
times over.

The adapter can suck in 256 requests within a single ms.

Sincerely -- Mark Salyzyn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC]: performance improvement by coalescing requests?
  2005-06-20 20:48 Salyzyn, Mark
@ 2005-06-21  7:28 ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2005-06-21  7:28 UTC (permalink / raw)
  To: Salyzyn, Mark; +Cc: linux-scsi

On Mon, Jun 20 2005, Salyzyn, Mark wrote:
> Jens Axboe [mailto:axboe@suse.de] writes:
> > You say io, but I guess you mean writes in particular?
> 
> Read or writes. One of the test cases was:
> 
> dd if=/dev/sda of=/dev/null bs=512b
> 
> would break apart into 64 4K reads with no completion dependencies
> between them.

That's a silly test case though, because you are intentionally issuing
io in a really small size. Do you have any real world cases?

If you do

dd if=/dev/zero of=/dev/sda bs=512b

and see lots of small requests, then that would be more strange. Can you
definitely verify this is what happens?

> > Or for any substantial amount of io, you would be queueing it so fast
> > that it should have plenty of time to be merged
> > until the drive sucks them in.
> 
> Did I mention that this problem started occurring when we increased the
> aacraid adapter and driver performance last year? We managed to suck the
> requests in faster. Sadly (from the perspective of Adaptec pride in our
> hardware controllers ;-> ), the scsi_merge layer is more efficient at
> coalescing the requests than the adapter's Firmware solely because of
> the PCI bus bandwidth used.
> 
> I must admit that the last time I did this instrumented test was in the
> 2.6.3 timeframe with SL9.1. This 'plugging' you are talking about, when
> did it make it into the scsi layer? Sounds like I need to retest,
> certainly a good result of opening my mouth to start this thread.

The plugging is a block layer property, it's been in use for ages (since
at least 2.0, I forget when it was originall introduced).

> > And a few ms should be enough time to queue that amount many many
> > times over.
> 
> The adapter can suck in 256 requests within a single ms.

I'm sure it can, I'm also sure that you can queue io orders of magnitude
faster than you can send them to hardware!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC]: performance improvement by coalescing requests?
@ 2005-06-21 12:05 Salyzyn, Mark
  2005-06-21 12:34 ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Salyzyn, Mark @ 2005-06-21 12:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-scsi

Jens and Jeff, thanks for your feedback. It does indicate that I will
need to go further in my investigation and make sure I am not dealing
with a corner case.

Jens Axboe [mailto:axboe@suse.de] 
> On Mon, Jun 20 2005, Salyzyn, Mark wrote:
>> Jens Axboe [mailto:axboe@suse.de] writes:
>> > You say io, but I guess you mean writes in particular?
>> 
>> Read or writes. One of the test cases was:
>> 
>> dd if=/dev/sda of=/dev/null bs=512b
>> 
>> would break apart into 64 4K reads with no completion dependencies
>> between them.
> That's a silly test case though, because you are intentionally
> issuing io in a really small size.

The io size is 256KB (the 'b' in dd size operands is 'blocks'). This is
the worst case scenario (single thread, large enough i/o to stuff the
controller full with 4K requests, then stops waiting for them to
complete before issuing more).

> real world cases?

It is not 'real world'. iozone was hard pressed to find much of a
difference, real world is a mix of threaded, small, large, sequential
and random; Focused on single thread large sequential and a surgical
solution with zero affect on all other i/o styles.

> and see lots of small requests, then that would be more strange.
> Can you definitely verify this is what happens?

I can verify that OOB RHEL3 (2.4.21-4.EL) and SL9.1 (2.6.4-52) exhibited
this issue. I will regroup, re-instrument and report back if this is
still the case for a late model (distribution?) kernels.

> The plugging is a block layer property, it's been in use for ages
> (since at least 2.0, I forget when it was originall introduced).

Ok, so no recent changes that would affect my results. Regardless, I
will assume there are differences between RHEL3/SL9.1 and 2.6.12 that
may have an affect. Also, as Jeff pointed out, I should scrutinize the
i/o schedulers, the 'fix' may be in tuning the selection.

>> The adapter can suck in 256 requests within a single ms.
> I'm sure it can, I'm also sure that you can queue io orders of
> magnitude faster than you can send them to hardware!

With the recent 'interrupt mitigation' patch to the aacraid driver, we
don't even need to go to the hardware to queue the request after the
first two are added and triggered. We can put 512 requests queued the
controller in the time it takes to move each pointer, size and increment
a produced index on the main memory. Regardless, before the patch it was
one PCI write overhead between each, which 'only' adds 10us to that
process for each request.

Not sure if this is a flaw, or a feature ;-> But the fast disposition of
queuing may be the root cause and not the Linux I/O system.

Random i/o performance benefits, sequential i/o induced by the split up
of large i/o requests suffers (per-se, the controller will coalesce the
request, the difference is an unscientific 10% with the worst case
scenario)

Sincerely - Mark Salyzyn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC]: performance improvement by coalescing requests?
  2005-06-21 12:05 Salyzyn, Mark
@ 2005-06-21 12:34 ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2005-06-21 12:34 UTC (permalink / raw)
  To: Salyzyn, Mark; +Cc: linux-scsi

On Tue, Jun 21 2005, Salyzyn, Mark wrote:
> Jens Axboe [mailto:axboe@suse.de] 
> > On Mon, Jun 20 2005, Salyzyn, Mark wrote:
> >> Jens Axboe [mailto:axboe@suse.de] writes:
> >> > You say io, but I guess you mean writes in particular?
> >> 
> >> Read or writes. One of the test cases was:
> >> 
> >> dd if=/dev/sda of=/dev/null bs=512b
> >> 
> >> would break apart into 64 4K reads with no completion dependencies
> >> between them.
> > That's a silly test case though, because you are intentionally
> > issuing io in a really small size.
> 
> The io size is 256KB (the 'b' in dd size operands is 'blocks'). This is

You are right, my mistake, I misread that as 'bytes'.

> the worst case scenario (single thread, large enough i/o to stuff the
> controller full with 4K requests, then stops waiting for them to
> complete before issuing more).

For 256kb issued reads, you really should not see 4kb reaching the
controller. What happens if you access a file on the file system
instead?

> > real world cases?
> 
> It is not 'real world'. iozone was hard pressed to find much of a
> difference, real world is a mix of threaded, small, large, sequential
> and random; Focused on single thread large sequential and a surgical
> solution with zero affect on all other i/o styles.

Ok

> > and see lots of small requests, then that would be more strange.
> > Can you definitely verify this is what happens?
> 
> I can verify that OOB RHEL3 (2.4.21-4.EL) and SL9.1 (2.6.4-52) exhibited
> this issue. I will regroup, re-instrument and report back if this is
> still the case for a late model (distribution?) kernels.

Both of those kernels are ancient. Please just test with 2.6.12, no need
to mix distro kernels into this.

> 
> > The plugging is a block layer property, it's been in use for ages
> > (since at least 2.0, I forget when it was originall introduced).
> 
> Ok, so no recent changes that would affect my results. Regardless, I

Given how old kernels you tested, there are probably a ton of things
that could affect this.

> will assume there are differences between RHEL3/SL9.1 and 2.6.12 that
> may have an affect. Also, as Jeff pointed out, I should scrutinize the
> i/o schedulers, the 'fix' may be in tuning the selection.

Well there are so many differences between 2.4 and 2.6. I would suggest
you forget about 2.4 completely, as this is marginal stuff. Focus in
2.6.latest and see if you can find performance issues there, if you can
we will fix it and if it's something obvious it will often go to distro
kernels as well.

> >> The adapter can suck in 256 requests within a single ms.
> > I'm sure it can, I'm also sure that you can queue io orders of
> > magnitude faster than you can send them to hardware!
> 
> With the recent 'interrupt mitigation' patch to the aacraid driver, we
> don't even need to go to the hardware to queue the request after the
> first two are added and triggered. We can put 512 requests queued the
> controller in the time it takes to move each pointer, size and increment
> a produced index on the main memory. Regardless, before the patch it was
> one PCI write overhead between each, which 'only' adds 10us to that
> process for each request.

You still need to be invoked first, which should not happen unless
someone higher up unplugged the queue manually. 512 requests is really
insane anyways, you will completely kill the io scheduler with such deep
queue depths. At the very least increase the block layer queue size to
be _at least_ twice the hardware depth, or you can very easily provoke
bad behaviour such as this.

> Not sure if this is a flaw, or a feature ;-> But the fast disposition of
> queuing may be the root cause and not the Linux I/O system.

Nah, I don't think so.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-06-21 12:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-20 19:25 [RFC]: performance improvement by coalescing requests? Salyzyn, Mark
2005-06-20 20:24 ` Jens Axboe
2005-06-20 21:01 ` Jeff Garzik
2005-06-20 23:21 ` Bryan Henderson
  -- strict thread matches above, loose matches on Subject: below --
2005-06-20 20:48 Salyzyn, Mark
2005-06-21  7:28 ` Jens Axboe
2005-06-21 12:05 Salyzyn, Mark
2005-06-21 12:34 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox