public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC]: performance improvement by coalescing requests?
@ 2005-06-20 19:25 Salyzyn, Mark
  2005-06-20 20:24 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Salyzyn, Mark @ 2005-06-20 19:25 UTC (permalink / raw)
  To: linux-scsi

This is not a patch to be applied to any release, for discussion only.

We have managed to increase the performance of the I/O to the driver by
pushing back on the scsi_merge layer when we detect that we are issuing
sequential requests (patch enclosed below to demonstrate the technique
used to investigate). In the algorithm used, when we see that we have an
I/O that adjoins the previous request, we reduce the queue depth to a
value of 2 for the device. This allows the incoming I/O to be
scrutinized by the scsi_merge layer for a bit longer permitting them to
be merged together into a larger more efficient request.

By limiting the queue to a depth of two, we also do not delay the system
much since we keep one worker and one outstanding remaining in the
controller. This keeps the I/O's fed without delay.

The net result was instead of receiving, for example, 64 4K sequential
I/O requests to an eager controller more than willing to accept the
commands into it's domain, we instead see two 4K I/O requests, followed
by one 248KB I/O request.

I would like to hear from the luminaries about how we could move this
proposed policy to the scsi or block layers for a generalized increase
in Linux performance.

One should note that this kind of policy to deal with sequential I/O
activity is not new in high performance operating systems. It is simply
lacking in the Linux I/O layers.

Sincerely -- Mark Salyzyn

diff -ru a/drivers/scsi/aacraid/aachba.c b/drivers/scsi/aacraid/aachba.c
--- a/drivers/scsi/aacraid/aachba.c	Mon Jun 20 11:57:47 2005
+++ b/drivers/scsi/aacraid/aachba.c	Mon Jun 20 12:08:23 2005
@@ -154,6 +154,10 @@
 module_param(commit, int, 0);
 MODULE_PARM_DESC(commit, "Control whether a COMMIT_CONFIG is issued to
the adapter for foreign arrays.\nThis is typically needed in systems
that do not have a BIOS. 0=off, 1=on");
 
+static int coalescethreshold = 0;
+module_param(coalescethreshold, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(coalescethreshold, "Control the maximum block size of
sequential requests that are fed back to the\nscsi_merge layer for
coalescing. 0=off, 16 block (8KB) default.");
+
 int numacb = -1;
 module_param(numacb, int, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(numacb, "Request a limit to the number of adapter
control blocks (FIB) allocated. Valid\nvalues are 512 and down. Default
is to use suggestion from Firmware.");
@@ -878,6 +882,40 @@
 	aac_io_done(scsicmd);
 }
 
+static inline void aac_select_queue_depth(
+	struct scsi_cmnd * scsicmd,
+	int cid,
+	u64 lba,
+	u32 count)
+{
+	struct scsi_device *device = scsicmd->device;
+	struct aac_dev *dev;
+	unsigned depth;
+
+	if (!device->tagged_supported)
+		return;
+	dev = (struct aac_dev *)device->host->hostdata;
+	if (dev->fsa_dev[cid].queue_depth <= 2)
+		dev->fsa_dev[cid].queue_depth = device->queue_depth;
+	if (lba == dev->fsa_dev[cid].last) {
+		/*
+		 * If larger than coalescethreshold in size, coalescing
has
+		 * less effect on overall performance.  Also, if we are
+		 * coalescing right now, leave it alone if above the
threshold.
+		 */
+		if (count > coalescethreshold)
+			return;
+		depth = 2;
+	} else {
+		depth = dev->fsa_dev[cid].queue_depth;
+	}
+	scsi_adjust_queue_depth(device, MSG_ORDERED_TAG, depth);
+	dprintk((KERN_DEBUG "l=%llu %llu[%u] q=%u %lu\n",
+	  dev->fsa_dev[cid].last, lba, count, device->queue_depth,
+	  dev->queues->queue[AdapNormCmdQueue].numpending));
+	dev->fsa_dev[cid].last = lba + count;
+}
+
 static int aac_read(struct scsi_cmnd * scsicmd, int cid)
 {
 	u32 lba;
@@ -910,6 +948,10 @@
 	dprintk((KERN_DEBUG "aac_read[cpu %d]: lba = %u, t = %ld.\n",
 	  smp_processor_id(), (unsigned long long)lba, jiffies));
 	/*
+	 *	Are we in a sequential mode?
+	 */
+	aac_select_queue_depth(scsicmd, cid, lba, count);
+	/*
 	 *	Alocate and initialize a Fib
 	 */
 	if (!(cmd_fibcontext = fib_alloc(dev))) {
@@ -1016,6 +1058,10 @@
 	dprintk((KERN_DEBUG "aac_write[cpu %d]: lba = %u, t = %ld.\n",
 	  smp_processor_id(), (unsigned long long)lba, jiffies));
 	/*
+	 *	Are we in a sequential mode?
+	 */
+	aac_select_queue_depth(scsicmd, cid, lba, count);
+	/*
 	 *	Allocate and initialize a Fib then setup a BlockWrite
command
 	 */
 	if (!(cmd_fibcontext = fib_alloc(dev))) {

^ permalink raw reply	[flat|nested] 8+ messages in thread
* RE: [RFC]: performance improvement by coalescing requests?
@ 2005-06-20 20:48 Salyzyn, Mark
  2005-06-21  7:28 ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Salyzyn, Mark @ 2005-06-20 20:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-scsi

Jens Axboe [mailto:axboe@suse.de] writes:
> You say io, but I guess you mean writes in particular?

Read or writes. One of the test cases was:

dd if=/dev/sda of=/dev/null bs=512b

would break apart into 64 4K reads with no completion dependencies
between them.

> Or for any substantial amount of io, you would be queueing it so fast
> that it should have plenty of time to be merged
> until the drive sucks them in.

Did I mention that this problem started occurring when we increased the
aacraid adapter and driver performance last year? We managed to suck the
requests in faster. Sadly (from the perspective of Adaptec pride in our
hardware controllers ;-> ), the scsi_merge layer is more efficient at
coalescing the requests than the adapter's Firmware solely because of
the PCI bus bandwidth used.

I must admit that the last time I did this instrumented test was in the
2.6.3 timeframe with SL9.1. This 'plugging' you are talking about, when
did it make it into the scsi layer? Sounds like I need to retest,
certainly a good result of opening my mouth to start this thread.

> And a few ms should be enough time to queue that amount many many
times over.

The adapter can suck in 256 requests within a single ms.

Sincerely -- Mark Salyzyn

^ permalink raw reply	[flat|nested] 8+ messages in thread
* RE: [RFC]: performance improvement by coalescing requests?
@ 2005-06-21 12:05 Salyzyn, Mark
  2005-06-21 12:34 ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Salyzyn, Mark @ 2005-06-21 12:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-scsi

Jens and Jeff, thanks for your feedback. It does indicate that I will
need to go further in my investigation and make sure I am not dealing
with a corner case.

Jens Axboe [mailto:axboe@suse.de] 
> On Mon, Jun 20 2005, Salyzyn, Mark wrote:
>> Jens Axboe [mailto:axboe@suse.de] writes:
>> > You say io, but I guess you mean writes in particular?
>> 
>> Read or writes. One of the test cases was:
>> 
>> dd if=/dev/sda of=/dev/null bs=512b
>> 
>> would break apart into 64 4K reads with no completion dependencies
>> between them.
> That's a silly test case though, because you are intentionally
> issuing io in a really small size.

The io size is 256KB (the 'b' in dd size operands is 'blocks'). This is
the worst case scenario (single thread, large enough i/o to stuff the
controller full with 4K requests, then stops waiting for them to
complete before issuing more).

> real world cases?

It is not 'real world'. iozone was hard pressed to find much of a
difference, real world is a mix of threaded, small, large, sequential
and random; Focused on single thread large sequential and a surgical
solution with zero affect on all other i/o styles.

> and see lots of small requests, then that would be more strange.
> Can you definitely verify this is what happens?

I can verify that OOB RHEL3 (2.4.21-4.EL) and SL9.1 (2.6.4-52) exhibited
this issue. I will regroup, re-instrument and report back if this is
still the case for a late model (distribution?) kernels.

> The plugging is a block layer property, it's been in use for ages
> (since at least 2.0, I forget when it was originall introduced).

Ok, so no recent changes that would affect my results. Regardless, I
will assume there are differences between RHEL3/SL9.1 and 2.6.12 that
may have an affect. Also, as Jeff pointed out, I should scrutinize the
i/o schedulers, the 'fix' may be in tuning the selection.

>> The adapter can suck in 256 requests within a single ms.
> I'm sure it can, I'm also sure that you can queue io orders of
> magnitude faster than you can send them to hardware!

With the recent 'interrupt mitigation' patch to the aacraid driver, we
don't even need to go to the hardware to queue the request after the
first two are added and triggered. We can put 512 requests queued the
controller in the time it takes to move each pointer, size and increment
a produced index on the main memory. Regardless, before the patch it was
one PCI write overhead between each, which 'only' adds 10us to that
process for each request.

Not sure if this is a flaw, or a feature ;-> But the fast disposition of
queuing may be the root cause and not the Linux I/O system.

Random i/o performance benefits, sequential i/o induced by the split up
of large i/o requests suffers (per-se, the controller will coalesce the
request, the difference is an unscientific 10% with the worst case
scenario)

Sincerely - Mark Salyzyn

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-06-21 12:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-20 19:25 [RFC]: performance improvement by coalescing requests? Salyzyn, Mark
2005-06-20 20:24 ` Jens Axboe
2005-06-20 21:01 ` Jeff Garzik
2005-06-20 23:21 ` Bryan Henderson
  -- strict thread matches above, loose matches on Subject: below --
2005-06-20 20:48 Salyzyn, Mark
2005-06-21  7:28 ` Jens Axboe
2005-06-21 12:05 Salyzyn, Mark
2005-06-21 12:34 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox