Re: [RFC] hpsa: work in progress "lockless monster" patches

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Tomas Henzl <thenzl@redhat.com>
To: Webb Scales <webbnh@hp.com>,
	scameron@beardog.cce.hp.com, linux-scsi@vger.kernel.org
Cc: jbottomley@parallels.com, martin.petersen@oracle.com,
	hare@suse.de, hch@infradead.org, bvanassche@acm.org,
	axboe@kernel.dk, elliott@hp.com, scott.teel@hp.com,
	joseph.t.handzik@hp.com, justin.lindley@hp.com
Subject: Re: [RFC] hpsa: work in progress "lockless monster" patches
Date: Mon, 04 Aug 2014 15:36:48 +0200	[thread overview]
Message-ID: <53DF8C70.2080306@redhat.com> (raw)
In-Reply-To: <53DBC2E0.9030009@hp.com>

On 08/01/2014 06:40 PM, Webb Scales wrote:
> On 7/31/14 9:56 AM, Tomas Henzl wrote:
>>>> In cmd_tagged_alloc "Thus, there should never be a collision here between
>>>> two requests" if this is true you don't need the refcount and just a simple
>>>> flag were sufficient for your other needs. (And maybe you get to ~971k IOPS..)
>>> :-)  The code previously had the reference count in there to close
>>> certain race conditions' asynchronous accesses to the command block
>>> (e.g., between an abort and a completing command).  We hope that, using
>>> the block layer tags, those races no longer occur, but there didn't seem
>>> to be any point in removing the reference count until we'd had more
>>> experience with the code...and, in a fit of healthy paranoia I added the
>>> check to which you refer.  Unfortunately, in some of Rob Elliott's
>>> recent tests, he managed to drive a case where the check triggers.  So,
>>> until we reconcile that, I'm inclined to leave the check in place
>>> (hopefully it's not costing a full 1k IOPS :-) ).
>> Let us assume that the block layer never sends duplicate tags to the driver,
>> I think there are some weaknesses in the way how it's implemented,
>> for example here - from fail_all_outstanding_cmds:
>>                  refcount = atomic_inc_return(&c->refcount);
>>                  if (refcount > 1) {
>>                  ...
>>                          finish_cmd(c); - this finishes the command
>> 			and the tag might be now reused,
>> 			when that happens you'll see a race
>>                          atomic_dec(&h->commands_outstanding);
>>                          failcount++;
>>                  }
>>                  else {what happens when right now comes a call from block layer?}
> The next call from the block layer should find that the controller has 
> been marked as locked-up and return immediately with a no-connect error 
> (even before trying to allocate an HPSA command block), so I don't think 
> there is a race or conflict.

 
OK, the 'locked-up' logic and when it's activated - I mean I don't understand every
part of your driver. I have just seen that in your current implementation 
of cmd_tagged_alloc that you only print a debug message and continue with the same block.
That might bring some issues later.
 

>
>
>> When references are used it usually means, that when refcount==0 that
>> a release function frees the resources, in this case it is the tag number.
>> In your implementation you should ensure that when you signal
>> to the upper layer that the tag is free, that nothing else holds the reference.
> Actually, in this case, the resource is the HPSA command block which 
> corresponds to the tag number.  But, you are correct that trying to 
> manage them separately is apt to cause problems.  Having the 
> cmd_[tagged_]free() routine make the call to scsi_done() -- only when 
> the reference count is dropping to zero -- is an interesting 
> suggestion...but I'll have to do considerable investigation before I can 
> proceed with that.

I'm not sure if this is doable (calling scsi_done when refcount drops to zero)
I just see a weakness of the code that you can have called the scsi_done  
and still having a refcount > 0.

There is work in progress on the code and I have found no more issues (better said
maybe issues), so thanks for sharing the code now.

--tomash

>
>
>> If this is true the conclusion is that you don't need this kind of references
>> and a simple flag just to debug not yet fixed races is enough.
>> I'm maybe wrong because I don't understand what you want to protect
>> in the above example, so that makes me to have some doubts if I understand
>> the code properly.
> Absent the block layer support, the HPSA code has to be able to defend 
> against various races where there may be more than two interested 
> parties looking at the command block -- so a simple flag will not 
> suffice.  On the other hand, with the block layer support in place, 
> we're hoping that we can get rid of the reference count altogether, but 
> we're not there, yet.
>
>
>                  Webb
>
>
>
>>>> On 07/25/2014 09:28 PM, scameron@beardog.cce.hp.com wrote:
>>>>> hpsa: Work In Progress: "lockless monster" patches
>>>>>
>>>>> To be clear, I am not suggesting that these patches be merged at this time.
>>>>>
>>>>> This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which
>>>>> may be found here: git://git.infradead.org/users/hch/scsi.git
>>>>>
>>>>> We've been working for a long time on a patchset for hpsa to remove
>>>>> all the locks from the main i/o path in pursuit of high IOPS.  Some
>>>>> of that work is already upstream, but a lot more of it is not quite
>>>>> yet ready to be merged.  However, I think we've "gone dark" for a bit
>>>>> too long on this, and even though the patches aren't really ready to
>>>>> be merged just yet, I thought I should let other people who might be
>>>>> interested have a look anyway, as things are starting to be at least
>>>>> more stable than unstable.  Be warned though, there are still some
>>>>> problems, esp. around error recovery.
>>>>>
>>>>> That being said, with the right hardware (fast SAS SSDs, recent Smart
>>>>> Arrays e.g. P430, with up-to-date firmware, attached to recent disk enclosures)
>>>>> with these patches and the scsi-mq patches, it is possible to get around
>>>>> ~970k IOPS from a single logical drive on a single controller.
>>>>>
>>>>> There are about 150 patches in this set.  Rather than bomb the list
>>>>> with that, here is a link to a tarball of the patches in the form of
>>>>> an stgit patch series:
>>>>> https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true
>>>>>
>>>>> In some cases, I have probably erred on the side of having too many too
>>>>> small patches, on the theory that it is easier to bake a cake than to
>>>>> unbake a cake.  Before these are submitted "for reals", there will be
>>>>> some squashing of patches, along with other cleaning up.
>>>>>
>>>>> There are some patches in this set which are already upstream in
>>>>> James's tree which do not happen to be in Christoph's tree
>>>>> (most of which are named "*_sent_to_james").  There are also
>>>>> quite a few patches which are strictly for debugging and are not
>>>>> ever intended to be merged.
>>>>>
>>>>>
>>>>> Here is a list of all the patches sorted by author:
>>>>>
>>>>> Arnd Bergmann (1):
>>>>>     [SCSI] hpsa: fix non-x86 builds
>>>>>
>>>>> Christoph Hellwig (1):
>>>>>     reserve block tags in scsi host
>>>>>
>>>>> Joe Handzik (9):
>>>>>     hpsa: add masked physical devices into h->dev[] array
>>>>>     hpsa: add hpsa_bmic_id_physical_device function
>>>>>     hpsa: get queue depth for physical devices
>>>>>     hpsa: use ioaccel2 path to submit IOs to physical drives in HBA mode.
>>>>>     hpsa: Get queue depth from identify physical bmic for physical disks.
>>>>>     hpsa: add ioaccel sg chaining for the ioaccel2 path
>>>>>     Set the phys_disk value for a CommandList structure that is
>>>>>       submitted. Squash
>>>>>     hpsa: unmap ioaccel2 commands before, not after adding to resubmit
>>>>>       workqueue
>>>>>     hpsa: add more ioaccel2 error handling, including underrun statuses.
>>>>>
>>>>> Nicholas Bellinger (1):
>>>>>     hpsa: Convert SCSI LLD ->queuecommand() for host_lock less operation
>>>>>
>>>>> Robert Elliott (44):
>>>>>     Change scsi.c scsi_log_completion() to print strings for QUEUED,
>>>>>     Set scsi_logging_level to be more verbose to get better messages
>>>>>     hpsa: do not unconditionally copy sense data
>>>>>     hpsa: optimize cmd_alloc function by remembering last allocation
>>>>>     From a154642aeed291c7cfe4b9ea9da932156030f7d1 Mon Sep 17 00:00:00
>>>>>       2001
>>>>>     hpsa: Clean up host, channel, target, lun prints
>>>>>     hpsa: do not check cmd_alloc return value - it cannnot return NULL
>>>>>     hpsa: return -1 rather than -ENOMEM in hpsa_get_raid_map if fill_cmd
>>>>>       fails
>>>>>     hpsa: return -ENOMEM not 1 from ioaccel2_alloc_cmds_and_bft on
>>>>>       allocation failure
>>>>>     hpsa: return -ENOMEM not 1 from hpsa_alloc_ioaccel_cmd_and_bft on
>>>>>       allocation failure
>>>>>     hpsa: return -ENOMEM not -EFAULT from hpsa_passthru_ioctl on kmalloc
>>>>>       failure
>>>>>     hpsa: pass error from pci_set_consistent_dma_mask intact from
>>>>>       hpsa_message
>>>>>     hpsa: detect and report failures changing controller transport modes
>>>>>     hpsa: add hpsa_disable_interrupt_mode
>>>>>     hpsa: rename free_irqs to hpsa_free_irqs and move before
>>>>>       hpsa_request_irq
>>>>>     hpsa: rename hpsa_request_irq to hpsa_request_irqs
>>>>>     hpsa: on failure of request_irq, free the irqs that we did get
>>>>>     hpsa: make hpsa_pci_init disable interrupts and pci_disable_device on
>>>>>       critical failures
>>>>>     hpsa: fix a string constant broken into two lines
>>>>>     hpsa: avoid unneccesary calls to resource freeing functions in
>>>>>       hpsa_init_one
>>>>>     hpsa: add hpsa_free_cfgtables function to undo what
>>>>>       hpsa_find_cfgtables does
>>>>>     hpsa: report failure to ioremap config table
>>>>>     hpsa: add hpsa_free_pci_init function
>>>>>     hpsa: clean up error handling in hpsa_pci_init
>>>>>     hpsa: make hpsa_remove_one use hpsa_pci_free_init
>>>>>     hpsa: rename hpsa_alloc_ioaccel_cmd_and_bft to
>>>>>       hpsa_alloc_ioaccel1_cmd_and_bft
>>>>>     hpsa: rename hpsa_allocate_cmd_pool to hpsa_alloc_cmd_pool
>>>>>     hpsa: rename ioaccel2_alloc_cmds_and_bft to
>>>>>       hpsa_alloc_ioaccel2_cmd_and_bft
>>>>>     hpsa: fix memory leak in hpsa_alloc_cmd_pool
>>>>>     hpsa: return status not void from hpsa_put_ctlr_into_performant_mode
>>>>>     hpsa: clean up error handling in hpsa_init_one
>>>>>     hpsa: report allocation failures while allocating SG chain blocks
>>>>>     hpsa: rename hpsa_allocate_sg_chain_blocks to
>>>>>       hpsa_alloc_sg_chain_blocks
>>>>>     hpsa: improve allocation failure messages from hpsa_init_one
>>>>>     hpsa: fix minor if-statement style issue in hpsa_init_one
>>>>>     hpsa: fix wrong indent level in hpsa_init_one
>>>>>     Add hpsa_free_performant_mode and call it from hpsa_init_one
>>>>>     hpsa: shorten the wait for the CISS doorbell mode change
>>>>>       acknowledgement
>>>>>     hpsa: clean up some error reporting output in abort handler
>>>>>     hpsa: try to detect controller lockup in abort handler
>>>>>     hpsa: Return DID_NO_CONNECT rather than DID_ERR on lockup
>>>>>     hpsa: check for lockup on all simple_cmd submissions
>>>>>     hpsa: Support 64-bit lun values in printks
>>>>>     hpsa: do not print ioaccel2 warning messages about unusual
>>>>>       completions.
>>>>>
>>>>> Stephen M. Cameron (95):
>>>>>     scsi: break from queue depth adjusting loops when device found
>>>>>     hpsa: remove online devices from offline device list
>>>>>     hpsa: fix bad -ENOMEM return value in hpsa_big_passthru_ioctl
>>>>>     hpsa: make hpsa_init_one return -ENOMEM if allocation of
>>>>>       h->lockup_detected fails
>>>>>     hpsa: fix 6-byte READ/WRITE with 0 length data xfer
>>>>>     hpsa: remove spin lock around command allocation
>>>>>     hpsa: use atomics for commands_outstanding instead of spin locks
>>>>>     hpsa: reserve some commands for use by driver
>>>>>     hpsa: get rid of cmd_special_alloc and cmd_special_free
>>>>>     hpsa: do not queue commands internally in driver
>>>>>     hpsa: remove unused interrupt handler code
>>>>>     hpsa: remove direct lookup bit from command tags
>>>>>     hpsa: fix race between abort handler and main i/o path
>>>>>     hpsa: allow lockup detector to be turned off
>>>>>     hpsa: do not request device rescan on every ioaccel path error
>>>>>     hpsa: factor out hpsa_ciss_submit function
>>>>>     hpsa: use workqueue to resubmit failed ioaccel commands
>>>>>     hpsa: use driver instead of system work queue for command
>>>>>       resubmission
>>>>>     hpsa: remove 'action required' phrasing
>>>>>     hpsa: Count passthru cmds with atomics, not a spin locked int
>>>>>     hpsa: remove unused fifo_full function
>>>>>     hpsa: slightly optimize SA5_performant_completed
>>>>>     hpsa: do not check for msi(x) in interrupt_pending
>>>>>     hpsa: fix fail_all_outstanding_cmds to use reference counting
>>>>>     hpsa: fix endianness issue with scatter gather elements
>>>>>     hpsa: get rid of type/attribute/direction bit field where possible
>>>>>     hpsa: do not use function pointers in fast path command submission
>>>>>     hpsa: do not wait for aborted commands to complete after abort
>>>>>     hpsa: fix hpsa_drain_accel_commands to use reference counting instead
>>>>>       of bitmap
>>>>>     hpsa: fix race when updating raid map data.
>>>>>     hpsa: allow commands to be completed on specified reply queue
>>>>>     hpsa: fix completion race in abort handler
>>>>>     hpsa: fix aborting of reused commands
>>>>>     hpsa: lay groundwork for handling driver initiated command timeouts
>>>>>     hpsa: do not free command twice in hpsa_get_raid_map
>>>>>     hpsa: do not send aborts to logical devices that do not support
>>>>>       aborts
>>>>>     hpsa: remember whether devices support aborts across rescans
>>>>>     hpsa: do not send two aborts with swizzled tags.
>>>>>     hpsa: decrement h->commands_outstanding in fail_all_outstanding_cmds
>>>>>     hpsa: flush work queue hpsa_wq in fail_all_outstanding_cmds
>>>>>     hpsa: use per controller not per driver work queue for command
>>>>>       resubmission
>>>>>     hpsa: fix allocation sizes for CISS_REPORT_LUNs commands
>>>>>     hpsa: always use extended report physical LUNs for physical devices
>>>>>     hpsa: prevent manually adding masked physical devices
>>>>>     hpsa: fix values for expose status
>>>>>     hpsa: avoid unnecessary io in hpsa_get_pdisk_of_ioaccel2()
>>>>>     hpsa: do not be so noisy about check conditions
>>>>>     hpsa: handle descriptor format sense data where needed
>>>>>     hpsa: print CDBs instead of kernel virtual addresses for uncommon
>>>>>       errors
>>>>>     hpsa: handle Command Status TMF function status
>>>>>     hpsa: do not use a void pointer for scsi_cmd field of struct
>>>>>       CommandList
>>>>>     hpsa: return FAILED from abort handler if controller locked up.
>>>>>     hpsa: if controller locked up return failed from device reset handler
>>>>>     hpsa: check for ctlr lockup after command allocation in main io path
>>>>>     hpsa: mark masked devices as masked in output
>>>>>     hpsa: limit number of concurrent abort requests
>>>>>     hpsa: separate lockup detection and device rescanning
>>>>>     hpsa: factor out hpsa_init_cmd function
>>>>>     hpsa: reinitialize commands on resubmitting failed ioaccel commands
>>>>>     hpsa: fully initialize commands once at driver load not every time
>>>>>     hpsa: change driver version to 3.4.6
>>>>>     hpsa: allow lockup detected to be viewed via sysfs
>>>>>     hpsa: do not ignore return value of hpsa_register_scsi
>>>>>     hpsa: DEBUG ONLY - allow triggering command resubmitting via sysfs
>>>>>     hpsa: try resubmitting down raid path on task set full
>>>>>     hpsa: factor out hpsa_ioaccel_submit function
>>>>>     hpsa: try resubmitting down ioaccelerated path on task set full
>>>>>     hpsa: honor queue depth of physical devices
>>>>>     hpsa: use cancel_delayed_work_sync where needed
>>>>>     hpsa: clean up queue depth getting code
>>>>>     hpsa: try to fix ioaccel command accounting code
>>>>>     hpsa: fix hpsa_figure_phys_disk_ptrs to handle layout_map_count other
>>>>>       than 1
>>>>>     hpsa: close race in ioaccel_cmds_out accounting
>>>>>     hpsa: guard against overflowing raid map array
>>>>>     hpsa: test ioaccel_cmds_out vs. queue depth earlier to avoid wasting
>>>>>       time
>>>>>     hpsa: DEBUG ONLY - fix missing atomic-dec of ioaccel_cmds_out in
>>>>>       debug command resubmission code
>>>>>     hpsa: fix missing atomic_dec of dev->ioaccel_cmds_out
>>>>>     hpsa: do not ack controller events on controllers that do not support
>>>>>       it
>>>>>     hpsa: fix code that updates h->dev[]->phys_disk[]
>>>>>     hpsa: remove bug-on that triggers in hba mode by mistake
>>>>>     hpsa: remove unused temp64 variable from hpsa_cmd_init
>>>>>     hpsa bump driver version again
>>>>>     hpsa break hpsa_free_irqs_and_disable_msix into two functions
>>>>>     separate hpsa_free_cmd_pool into three separate function
>>>>>     hpsa: fix missing return in hpsa_command_resubmit_worker
>>>>>     hpsa: fix command leaks in hpsa_resubmit_command_worker
>>>>>     hpsa: fix bad interpretations of hpsa_ioaccel_submit return value
>>>>>     hpsa: do not touch phys_disk[] for ioaccel enabled logical drives
>>>>>       during rescan
>>>>>     hpsa: add change_queue_type function
>>>>>     hpsa: make hpsa_change_queue_depth handle reason SCSI_QDEPTH_QFULL
>>>>>     hpsa: initialize queue depth of logical drives reasonably
>>>>>     hpsa: attempt to fix queue depth calculations for logical drives
>>>>>     hpsa: add support sending aborts to physical devices via the ioaccel2
>>>>>       path
>>>>>     hpsa: fix problem with abort support checking happening too early
>>>>>     hpsa: remove incorrect bug on in hpsa_scsi_ioaccel_raid_map
>>>>>
>>>>> Webb Scales (3):
>>>>>     Initialize reference counts to zero on initial command entry
>>>>>       allocation
>>>>>     hpsa: make hpsa_send_reset_as_abort_ioaccel2() use the specified
>>>>>       reply queue.
>>>>>     hpsa:  use block layer tag for command allocation
>>>>>
>>>>>    drivers/scsi/hpsa.c       | 3642 ++++++++++++++++++++++++++++-----------------
>>>>>    drivers/scsi/hpsa.h       |  103 +-
>>>>>    drivers/scsi/hpsa_cmd.h   |  222 +++-
>>>>>    drivers/scsi/scsi.c       |   64 +-
>>>>>    drivers/scsi/scsi_error.c |    2 +
>>>>>    drivers/scsi/scsi_lib.c   |    1 +
>>>>>    include/scsi/scsi_host.h  |    9 +
>>>>>    7 files changed, 2561 insertions(+), 1482 deletions(-)
>>>>>
>>>>> -- steve
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

     prev parent reply	other threads:[~2014-08-04 13:38 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-25 19:28 [RFC] hpsa: work in progress "lockless monster" patches scameron
2014-07-26 15:32 ` Christoph Hellwig
2014-07-27 15:24 ` Hannes Reinecke
2014-07-31 16:35   ` scameron
2014-07-30 14:55 ` Tomas Henzl
2014-07-30 20:33   ` Webb Scales
2014-07-31 13:56     ` Tomas Henzl
2014-07-31 15:16       ` scameron
2014-08-04 13:13         ` Tomas Henzl
2014-08-01 16:40       ` Webb Scales
2014-08-04 13:36         ` Tomas Henzl [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53DF8C70.2080306@redhat.com \
    --to=thenzl@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=elliott@hp.com \
    --cc=hare@suse.de \
    --cc=hch@infradead.org \
    --cc=jbottomley@parallels.com \
    --cc=joseph.t.handzik@hp.com \
    --cc=justin.lindley@hp.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=scameron@beardog.cce.hp.com \
    --cc=scott.teel@hp.com \
    --cc=webbnh@hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).