From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: Re: [PATCH] scsi, fcoe, libfc: drop scsi host_lock use from fc_queuecommand Date: Sun, 26 Sep 2010 12:19:37 +0900 Message-ID: <4C9EBBC9.9070709@fusionio.com> References: <20100903222715.6237.75737.stgit@localhost.localdomain> <4C9C47FC.5080304@fusionio.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Return-path: Received: from mx2.fusionio.com ([64.244.102.31]:53416 "EHLO mx2.fusionio.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755980Ab0IZDTm (ORCPT ); Sat, 25 Sep 2010 23:19:42 -0400 In-Reply-To: Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Bart Van Assche Cc: Vasu Dev , "linux-scsi@vger.kernel.org" On 2010-09-26 01:55, Bart Van Assche wrote: > On Fri, Sep 24, 2010 at 8:41 AM, Jens Axboe wrote: >> >> [ ... ] >> >> Bart, can you try with this patchset added: >> >> git://git.kernel.dk/linux-2.6-block.git blk-alloc-optimize >> >> It's a work in progress and not suitable for general consumption yet, >> but it's tested working at least. There will be more built on top of >> this, but at least even this simple stuff is making a big difference >> for IOPS testing for me. > > Hello Jens, > > Thanks for the feedback. I see a nice 10% speedup after having applied > the four block layer optimization patches from the blk-alloc-optimize > branch on an already patched 2.6.35.5 SRP initiator. Great! Not too bad for something that's will a WIP. > Note: according to the output of perf record -g, most spinlock calls > still originate from the block layer. This is what the perf tool > reported for a fio run using libaio with small blocks (512 bytes): > > Event: cycles > - 7.06% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave > - _raw_spin_lock_irqsave > + 19.51% blk_run_queue > + 13.71% blk_end_bidi_request > + 10.04% mlx4_ib_poll_cq > + 4.68% lock_timer_base > + 4.22% aio_complete > + 3.97% srp_send_completion > + 3.71% srp_queuecommand > + 3.55% dio_bio_end_aio > + 3.37% __srp_get_tx_iu > + 3.14% srp_recv_completion > + 3.00% scsi_device_unbusy > + 2.87% __scsi_put_command > + 2.82% __blockdev_direct_IO_newtrunc > + 2.76% scsi_put_command > + 2.69% scsi_run_queue > + 2.65% dio_bio_submit > + 2.54% srp_remove_req > + 2.46% mlx4_ib_post_send > + 2.33% scsi_get_command > + 1.95% mlx4_ib_post_recv One piece of low hanging fruit is reducing the number of queue runs. SCSI does this for every completed command to keep the device queue full. I bet if you try an experiement where you only run the queue when a certain number of requests have completed, you would greatly reduce scsi_run_queue and blk_run_queue in the above profile. -- Jens Axboe