Optimize scsi_cmnd initialization and bug fix

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Optimize scsi_cmnd initialization and bug fix
@ 2005-11-23 22:50 Chen, Kenneth W
  2005-11-24  8:05 ` Arjan van de Ven
  2005-12-06 17:22 ` Christoph Hellwig
  0 siblings, 2 replies; 6+ messages in thread
From: Chen, Kenneth W @ 2005-11-23 22:50 UTC (permalink / raw)
  To: linux-scsi; +Cc: 'James Bottomley'

struct scsi_cmnd is a fairly large data structure and is used
to construct scsi command.  On x86-64 arch, it is 448 bytes.  

In scsi_get_command(), it is a bit too paranoid in zeroing the
data structure.  Since most of the member variables will be
initialized in various stage of I/O submission, i.e., in
scsi_prep_fn, scsi_init_io, sd_init_command, scsi_init_cmd_errh,
etc.  So instead of blindly zeroing the whole structure, initialize
to some other value later on.  All it needs in scsi_get_command
is to zero out a few member variables that aren't initialized in
the I/O path.

I'm proposing the following patch to optimize scsi_cmnd initialization.
The only part that I'm not 100% sure is struct scsi_pointer SCp and
eh_eflags.  It appears to be safe, but since I don't have every scsi
controllers in the world to test, I can't say so for sure.  I'm willing
to take on the responsibility to fix any hiccup people might have with
this patch.  Comments?

p.s. de-referencing jiffies_at_alloc ought be in the if(cmd!=NULL) block.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

--- ./drivers/scsi/scsi.c.orig	2005-11-23 13:04:46.607198199 -0800
+++ ./drivers/scsi/scsi.c	2005-11-23 13:05:38.154072568 -0800
@@ -258,17 +258,18 @@ struct scsi_cmnd *scsi_get_command(struc
 	if (likely(cmd != NULL)) {
 		unsigned long flags;

-		memset(cmd, 0, sizeof(*cmd));
+		cmd->cmd_len = 0;
+		cmd->sc_request = 0;
+		cmd->retries = 0;
 		cmd->device = dev;
 		init_timer(&cmd->eh_timeout);
-		INIT_LIST_HEAD(&cmd->list);
 		spin_lock_irqsave(&dev->list_lock, flags);
 		list_add_tail(&cmd->list, &dev->cmd_list);
 		spin_unlock_irqrestore(&dev->list_lock, flags);
+		cmd->jiffies_at_alloc = jiffies;
 	} else
 		put_device(&dev->sdev_gendev);

-	cmd->jiffies_at_alloc = jiffies;
 	return cmd;
 }				
 EXPORT_SYMBOL(scsi_get_command);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Optimize scsi_cmnd initialization and bug fix
  2005-11-23 22:50 Optimize scsi_cmnd initialization and bug fix Chen, Kenneth W
@ 2005-11-24  8:05 ` Arjan van de Ven
  2005-11-29  1:22   ` Chen, Kenneth W
  2005-12-06 17:22 ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Arjan van de Ven @ 2005-11-24  8:05 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'James Bottomley', linux-scsi

On Wed, 2005-11-23 at 14:50 -0800, Chen, Kenneth W wrote:
> struct scsi_cmnd is a fairly large data structure and is used
> to construct scsi command.  On x86-64 arch, it is 448 bytes.  
> 
> In scsi_get_command(), it is a bit too paranoid in zeroing the
> data structure.  Since most of the member variables will be
> initialized in various stage of I/O submission, i.e., in
> scsi_prep_fn, scsi_init_io, sd_init_command, scsi_init_cmd_errh,
> etc.  So instead of blindly zeroing the whole structure, initialize
> to some other value later on.  All it needs in scsi_get_command
> is to zero out a few member variables that aren't initialized in
> the I/O path.


actually I question this optimisation. memset uses the rep stosl code
sequence, which mean that the cpu can avoid write-allocate on the
cachelines in question, and just plain zero them in cache. If you
initialize the parts one by one, the cpu will need to do write-allocate
on the cachelines, and thus has double the memory bandwidth needed than
the existing case.

(not sure if all cpus are smart enough to avoid write allocate for rep
stosl, but most of the newer ones are)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Optimize scsi_cmnd initialization and bug fix
  2005-11-24  8:05 ` Arjan van de Ven
@ 2005-11-29  1:22   ` Chen, Kenneth W
  2005-11-29  7:37     ` Arjan van de Ven
  0 siblings, 1 reply; 6+ messages in thread
From: Chen, Kenneth W @ 2005-11-29  1:22 UTC (permalink / raw)
  To: 'Arjan van de Ven'; +Cc: 'James Bottomley', linux-scsi

Arjan van de Ven wrote on Thursday, November 24, 2005 12:06 AM
> On Wed, 2005-11-23 at 14:50 -0800, Chen, Kenneth W wrote:
> > struct scsi_cmnd is a fairly large data structure and is used
> > to construct scsi command.  On x86-64 arch, it is 448 bytes.  
> > 
> > In scsi_get_command(), it is a bit too paranoid in zeroing the
> > data structure.  Since most of the member variables will be
> > initialized in various stage of I/O submission, i.e., in
> > scsi_prep_fn, scsi_init_io, sd_init_command, scsi_init_cmd_errh,
> > etc.  So instead of blindly zeroing the whole structure, initialize
> > to some other value later on.  All it needs in scsi_get_command
> > is to zero out a few member variables that aren't initialized in
> > the I/O path.
> 
> actually I question this optimisation. memset uses the rep stosl code
> sequence, which mean that the cpu can avoid write-allocate on the
> cachelines in question, and just plain zero them in cache. If you
> initialize the parts one by one, the cpu will need to do write-allocate
> on the cachelines, and thus has double the memory bandwidth needed than
> the existing case.
> 
> (not sure if all cpus are smart enough to avoid write allocate for rep
> stosl, but most of the newer ones are)


30 out of 38 member variables are initialized to non-zero value, about 4
are initialized by the LLDD.  Another 4 got written in the I/O return
path, though these 4 are sprinkled in the structure.  Even though memset
doesn't write-allocate, there are enough code which will bring the cache
line into the cpu anyway.  Since this is arch independent code, I'm also
trying to optimize not just x86-64, but other arch like ia64 which doesn't
have the same memset cache behavior.

Our experiments with I/O intensive db workload showed that removing memset
call has a net performance gain for db workload.

- Ken


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Optimize scsi_cmnd initialization and bug fix
  2005-11-29  1:22   ` Chen, Kenneth W
@ 2005-11-29  7:37     ` Arjan van de Ven
  2005-11-30  2:06       ` Chen, Kenneth W
  0 siblings, 1 reply; 6+ messages in thread
From: Arjan van de Ven @ 2005-11-29  7:37 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'James Bottomley', linux-scsi

> 
> 30 out of 38 member variables are initialized to non-zero value, about 4
> are initialized by the LLDD.  Another 4 got written in the I/O return
> path, though these 4 are sprinkled in the structure.  Even though memset
> doesn't write-allocate, there are enough code which will bring the cache
> line into the cpu anyway.  

but.. it's ALREADY in cache after the memset.... That's the entire point
of that. You put zeros in the cache without needing to get the
overwritten-in-a-few-cycles data from ram, but make sure the data is in
cache so that the next uses of it are really cheap. Eg the only cache
traffic is writing the data back to ram eventually, which is
asynchronous. By avoiding the write allocate altogether you avoid 1)
having to wait for the ram and 2) the memory bandwidth needed for it.
Both are important, and both are avoided by a memset..

> Since this is arch independent code, I'm also
> trying to optimize not just x86-64, but other arch like ia64 which doesn't
> have the same memset cache behavior.

sounds like an unoptimized cpu to me; alpha and others I'm sure will
avoid this as well.. I'd be highly surprised if ppc64 didn't as well. 

> Our experiments with I/O intensive db workload showed that removing memset
> call has a net performance gain for db workload.

which architectures?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Optimize scsi_cmnd initialization and bug fix
  2005-11-29  7:37     ` Arjan van de Ven
@ 2005-11-30  2:06       ` Chen, Kenneth W
  0 siblings, 0 replies; 6+ messages in thread
From: Chen, Kenneth W @ 2005-11-30  2:06 UTC (permalink / raw)
  To: 'Arjan van de Ven'; +Cc: 'James Bottomley', linux-scsi

Arjan van de Ven wrote on Thursday, November 24, 2005 12:06 AM
> actually I question this optimisation. memset uses the rep stosl code
> sequence, which mean that the cpu can avoid write-allocate on the
> cachelines in question, and just plain zero them in cache. If you
> initialize the parts one by one, the cpu will need to do write-allocate
> on the cachelines, and thus has double the memory bandwidth needed than
> the existing case.
> 
> (not sure if all cpus are smart enough to avoid write allocate for rep
> stosl, but most of the newer ones are)

Your earlier email prompt me to double check with hardware architects
whether "rep stosl" has any special cache behavior.  The answer is NO
(at least on all Intel Pentium 4 processors).  It behaves just like
stosl as far as the cache behavior is concerned. On write cache misses,
processor performs a cache line fill, write allocation. Then it writes
the operand into the cache line.


Arjan van de Ven wrote on Monday, November 28, 2005 11:38 PM
> > 30 out of 38 member variables are initialized to non-zero value, about 4
> > are initialized by the LLDD.  Another 4 got written in the I/O return
> > path, though these 4 are sprinkled in the structure.  Even though memset
> > doesn't write-allocate, there are enough code which will bring the cache
> > line into the cpu anyway.  
> 
> but.. it's ALREADY in cache after the memset.... That's the entire point
> of that. You put zeros in the cache without needing to get the
> overwritten-in-a-few-cycles data from ram, but make sure the data is in
> cache so that the next uses of it are really cheap. Eg the only cache
> traffic is writing the data back to ram eventually, which is
> asynchronous. By avoiding the write allocate altogether you avoid 1)
> having to wait for the ram and 2) the memory bandwidth needed for it.
> Both are important, and both are avoided by a memset..

The behavior you described here is not what Intel pentium 4 processor
would do.  So the assumption doesn't apply here.  By removing the memset,
we don't have to execute extra CPU cycles because at later point, a store
will effectively does the same thing.  Why spend the extra cycle to store
two values into the same location?

- Ken


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Optimize scsi_cmnd initialization and bug fix
  2005-11-23 22:50 Optimize scsi_cmnd initialization and bug fix Chen, Kenneth W
  2005-11-24  8:05 ` Arjan van de Ven
@ 2005-12-06 17:22 ` Christoph Hellwig
  1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2005-12-06 17:22 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: linux-scsi, 'James Bottomley'

On Wed, Nov 23, 2005 at 02:50:00PM -0800, Chen, Kenneth W wrote:
> I'm proposing the following patch to optimize scsi_cmnd initialization.
> The only part that I'm not 100% sure is struct scsi_pointer SCp and
> eh_eflags.  It appears to be safe, but since I don't have every scsi
> controllers in the world to test, I can't say so for sure.  I'm willing
> to take on the responsibility to fix any hiccup people might have with
> this patch.  Comments?

SCp should be zeroed, it's the scratch area for controllers.  eh_ehflags
is under midlayer control and should start out as zero.

> p.s. de-referencing jiffies_at_alloc ought be in the if(cmd!=NULL) block.

that's fixed already in latest mainline.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-12-06 17:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-23 22:50 Optimize scsi_cmnd initialization and bug fix Chen, Kenneth W
2005-11-24  8:05 ` Arjan van de Ven
2005-11-29  1:22   ` Chen, Kenneth W
2005-11-29  7:37     ` Arjan van de Ven
2005-11-30  2:06       ` Chen, Kenneth W
2005-12-06 17:22 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).