public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
From: Martin Wilck <mwilck@suse.com>
To: "Martin K. Petersen" <martin.petersen@oracle.com>,
	Ranjan Kumar <ranjan.kumar@broadcom.com>,
	Don Brace <don.brace@microchip.com>,
	Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>,
	linux-scsi@vger.kernel.org, Lee Duncan <lduncan@suse.com>,
	Yihang Li <liyihang9@h-partners.com>,
	Jack Wang <jinpu.wang@cloud.ionos.com>,
	John Garry <john.g.garry@oracle.com>
Subject: Re: [PATCH v2 0/2] Fix SAS wildcard scan on smartpqi and other controllers
Date: Mon, 04 May 2026 18:58:48 +0200	[thread overview]
Message-ID: <61b4da9dcbee8fd71d1ecb2cfdca5c2408528bd7.camel@suse.com> (raw)
In-Reply-To: <yq1y0i4csyi.fsf@ca-mkp.ca.oracle.com>

Hi Martin, Ranjan, Don, Hannes, all,

I need some advice / guidance.

TL;DR:

- Does it make sense to pursue the idea of this patch by adding proper
locking and making some additional amendments?
- Otherwise, can we revert 37c4e72b0651 and find a different solution
for rescanning NVMe drives on mpt3sas and mpi3mr?

Long version:

Thanks for the link with the Gemini review. My current approach is
arguably too simplistic.

The question is, then, what to do about 37c4e72b0651 ("scsi: Fix
sas_user_scan() to handle wildcard and multi-channel scans"). As a
matter of fact, this commit introduces severe regressions (system
stall) at least for smartpqi and hisi_sas. 

Looking back at the discussion about this commit [2], it was submitted
to fix scanning of NVMe drives, which mpt3sas and mpi3mr may present as
SCSI targets in a certain mode of operation, in which these drivers use
the channel number to distinguish NVMe and SCSI devices from each
other. This use of the channel number is inconsistent with how other
SAS drivers use it. Therefore this commit broke SAS scanning for other
drivers. The obvious fix would  be to revert 37c4e72b0651. This is
actually what we (SUSE) have done in our downstream kernel so far.

This would mean that Broadcom would need to come up with a different
way to handle the scanning for mpt3sas and mpi3mr in "Tri-mode".

Other than that: I still believe that the general idea of using
scan_start() and scan_finished() not only for the initial host adapter
scan but also for rescan operations is reasonable. After all, a "dumb"
wildcard scan doesn't make a lot of sense for drivers "which find their
own targets", for which scan_start() and scan_finished were introduced
in 1aa8fab2acf1 ("[SCSI] Make scsi_scan_host work for drivers which
find their own targets").

But the issues pointed out by Gemini are real, and while I could fix
the locking, the problem that some drivers may reinitialize the
hardware in scan_start() is tricky. Some detailed replies on Gemini's
review are appended below.

Thanks,
Martin

[1] https://sashiko.dev/#/patchset/20260421202018.511388-1-mwilck%40suse.com
[2] https://lore.kernel.org/linux-scsi/CAFdVvOwjy+2ORJ6uJkspiLTPF05481U7gcS4QohFOFGPqAs8ig@mail.gmail.com/

On Thu, 2026-04-30 at 12:30 -0400, Martin K. Petersen wrote:
> 
> https://sashiko.dev/#/patchset/20260421202018.511388-1-mwilck%40suse.com


P.S: Detailed replies to Gemini's review:

> Does bypassing scsi_scan_host_selected() here skip necessary power
> management and serialization steps?
> 
> Typically, wildcard scans go through scsi_scan_host_selected(), which
> secures the host adapter via scsi_autopm_get_host(), checks
> scsi_host_scan_allowed(), and takes shost->scan_mutex.
>
> By calling do_scsi_scan_host() directly, it appears scan_start()
could be
> invoked while the host adapter is runtime-suspended, which might lead
to
> hardware faults.

The call to scsi_scan_host_selected() was introduced in commit
37c4e72b0651, which my patch is intended to fix; before that commit,
scan_mutex was not taken in the sas_user_scan() code path. It's still
explicitly not taken for the important case of scanning channel 0;
sas_user_scan() takes only sas_host->lock, which protects just the rphy
list.

So yes, sas_user_scan_with_scan_start() code path should acquire the
scan_mutex. I suppose that the mutex should be acquired in
sas_user_scan(), actually.

> Also, without holding scan_mutex and checking
scsi_host_scan_allowed(),
> is there a race condition if a sysfs scan is triggered concurrently
with
> driver removal? If scsi_remove_host() tears down the host
asynchronously,
> the while loop in do_scsi_scan_host() might poll shost->hostt-
>scan_finished()
> after the driver module has unloaded, potentially causing a use-
after-free.

True. Taking the scan_mutex should fix this.

> Similarly, multiple concurrent sysfs writes could execute
scan_start()
> simultaneously without scan_mutex protection. Can this corrupt
internal
> driver state?

Right. Again, this can be fixed by taking scan_mutex.

> Additionally, is it safe to repurpose scan_start() for runtime sysfs
> scans?
> 
> Historically, shost->hostt->scan_start() seems to be used exclusively
> for one-time asynchronous boot initialization in do_scan_async().
> 
> Many SAS drivers might implement this as a hardware initialization
routine
> that resets physical links.
> 
> If exposed to a runtime userspace sysfs trigger, could this
unintentionally
> bounce active SAS links, drop active I/O, or cause controller hangs
for
> drivers that assume scan_start() is only used during boot?

This is not the case for smartpqi, but other drivers (e.g. hisi_sas)
actually seem to enable PHYs in the scan_start() code path. Not sure
how this can be fixed in general. AFAICS it would require thorough
understanding of the hardware / firmware of all affected drivers, which
I don't have.

Would it make sense pass a flag to scan_start() indicating whether we
initiate a first-time scan or a rescan, and skip the hardware
initialization in the latter case?

      reply	other threads:[~2026-05-04 16:58 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21 20:20 [PATCH v2 0/2] Fix SAS wildcard scan on smartpqi and other controllers Martin Wilck
2026-04-21 20:20 ` [PATCH v2 1/2] scsi: smartpqi: use shost_to_hba() in pqi_scan_finished() Martin Wilck
2026-04-21 20:20 ` [PATCH v2 2/2] scsi: sas_user_scan: use scan_start if available Martin Wilck
2026-04-30 16:30 ` [PATCH v2 0/2] Fix SAS wildcard scan on smartpqi and other controllers Martin K. Petersen
2026-05-04 16:58   ` Martin Wilck [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=61b4da9dcbee8fd71d1ecb2cfdca5c2408528bd7.camel@suse.com \
    --to=mwilck@suse.com \
    --cc=don.brace@microchip.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=jinpu.wang@cloud.ionos.com \
    --cc=john.g.garry@oracle.com \
    --cc=lduncan@suse.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=liyihang9@h-partners.com \
    --cc=martin.petersen@oracle.com \
    --cc=ranjan.kumar@broadcom.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox