From mboxrd@z Thu Jan 1 00:00:00 1970 From: Patrick Mansfield Subject: scsi per-device lock [was Re: [RFC][PATCH] ... ] Date: Thu, 6 Mar 2003 11:40:10 -0800 Sender: linux-scsi-owner@vger.kernel.org Message-ID: <20030306114010.A24155@beaverton.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline List-Id: linux-scsi@vger.kernel.org To: James Bottomley Cc: SCSI Mailing List On Thu, Mar 06, 2003 at 12:04:04PM -0600, James Bottomley wrote: > On Thu, 2003-03-06 at 11:41, Patrick Mansfield wrote: > > Such a solution complicates moving to a per-device queue lock - we need a > > per-host lock while pulling off the restart list, and we need a > > per-queue-lock prior to calling __blk_run_queue(); but when starving in > > the scsi_request_fn(), we have a per-queue-lock and need the per-host lock > > to add to the restart list. So we have to allow dropping a lock in both > > cases without leading to a hung requeset queue. > > Actually, I don't think it does. Any driver that can truly have a per > device lock can't have a fixed size pool of resources that device > requests are competing for (otherwise it would need a host lock to > access these resources), thus can_queue should always be set to a number > much greater than the aggregate of the local device queue depths (we'd > need a comment in the code to this effect). I don't quite follow the above - you mean a per device lock in the actual LLDD probably has no benefit? Yes, I agree with that. But we do not need the host_lock while processing the request queue, or modifying scsi_device values. Holding the host_lock in such cases not only prevents IO from completing (i.e. host_busy decrement), but also blocks other disks on the same adapter from prepping a request, allocating a scsi_cmnd or checking scsi_device limits. An LLDD that uses host_lock to protect scsi_device or its request queue is doing something wrong (it can reference invariant fields of scsi_device such as host/chan/id/lun without a lock, and generally it can reference scsi_cmnd fields without a lock). > Do we actually have any LLDs in the tree today that can benefit from a > device based locking approach? (i.e. they must manage their resource > pools on a device basis and also have a mailbox scheme that doesn't > require host based locking). > > James I'm not proposing changing the LLDD's to use a per-device lock, rather that scsi-core use a per device lock to protect the request queue and the scsi_device. The scsi core would still use the host_lock to protect scsi_host and LLDD calls (the silly lock/call/unlock sequences). Do you see any major problems with this approach? I have a lock split patch, built on top of the previous pending_cmd queue patch; it still needs code to handle the single_lun case, and it has a lock hierarchy. I can post it as-is but it is still in progress, and I (now) have to change the pending_cmd to instead be a queue of starved request queues. I ran with and without the lock split on a 4 processor NUMAQ with 10 drives on a single qla 2300 adapter using the feral driver. I also ran some tests on the same system NUMAQ with 8 procs and about 39 drives. I used a simple program to try and maximize IOs/second - it uses O_DIRECT and rereads the first 512 bytes of the disk. One of these is run per disk for a fixed number of IO's. The 4x with 10 drives, each drive rereading 20000 times using the current code gave (don't know how it got 103%): 0.51user 21.21system 0:21.08elapsed 103%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (6670major+2705minor)pagefaults 0swaps Top oprofile users (low to high): c0168180 1156 1.10736 dio_await_one c0180105 1365 1.30757 .text.lock.inode c01e0ab0 1512 1.44839 scsi_dispatch_cmd c0115cb0 1545 1.48 try_to_wake_up c01164c0 1621 1.5528 load_balance c01e7087 1772 1.69745 .text.lock.scsi_lib c0151cc9 3380 3.2378 .text.lock.block_dev c0116d20 3491 3.34413 schedule c01e1cf5 4148 3.97348 .text.lock.scsi c01cb945 7584 7.26492 .text.lock.ll_rw_blk c0107010 25712 24.6302 default_idle With the lock split the same test gave: 0.43user 14.33system 0:20.55elapsed 71%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (6670major+2705minor)pagefaults 0swaps Top oprofile users (low to high): c01077a0 880 1.02679 __switch_to c0180105 888 1.03612 .text.lock.inode c01896b0 894 1.04313 start_this_handle c01e6cc0 912 1.06413 scsi_check_shost c01e07f0 965 1.12597 scsi_get_command c0168180 1106 1.29049 dio_await_one c0115cb0 1483 1.73037 try_to_wake_up c01164c0 1489 1.73738 load_balance c0151cc9 2706 3.15738 .text.lock.block_dev c0116d20 3470 4.04882 schedule c0107010 24072 28.0874 default_idle Even though elapsed times were nearly equal with and without the lock split, system time went from 21 to 14 seconds, a fairly significant drop for this super-micro-mini benchmark. I don't have the 8x numbers with 39 drives available, but the *elapsed* time went from about 60 seconds to about 40 seconds. Any suggestions for a simple multiple-disk benchmark? -- Patrick Mansfield