From: John Garry <john.g.garry@oracle.com>
To: Benjamin Marzinski <bmarzins@redhat.com>,
Keith Busch <kbusch@kernel.org>
Cc: hch@lst.de, sagi@grimberg.me, axboe@fb.com,
martin.petersen@oracle.com,
james.bottomley@hansenpartnership.com, hare@suse.com,
jmeneghi@redhat.com, linux-nvme@lists.infradead.org,
linux-scsi@vger.kernel.org, michael.christie@oracle.com,
snitzer@kernel.org, dm-devel@lists.linux.dev,
linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 09/13] libmultipath: Add PR support
Date: Mon, 2 Mar 2026 10:45:00 +0000 [thread overview]
Message-ID: <add7aabf-fa2c-472b-bedb-e8713176c692@oracle.com> (raw)
In-Reply-To: <aaHecneNg9Q8EtiS@redhat.com>
On 27/02/2026 18:12, Benjamin Marzinski wrote:
>> Instead of having the lower layer define new mp template functions, why
>> not use the existing pr_ops from mpath_device->disk->fops->pr_ops?
> I don't think that's the right answer. The regular scsi persistent
> reservation functions simply won't work on a multipath device. Even just
> a simple reservation fails.
>
> For example (with /dev/sda being multipath device 0):
> # echo round-robin > /sys/class/scsi_mpath_device/0/iopolicy
> # blkpr -c register -k 0x1 /dev/sda
> # blkpr -c reserve -k 0x1 -t exclusive-access-reg-only /dev/sda
> # dd if=/dev/sda of=/dev/null iflag=direct count=100
> dd: error reading '/dev/sda': Invalid exchange
> 1+0 records in
> 1+0 records out
> 512 bytes copied, 0.00871312 s, 58.8 kB/s
>
> Here are the kernel messages:
> [ 3494.660401] sd 7:0:1:0: reservation conflict
> [ 3494.661802] sd 7:0:1:0: [sda:1] tag#768 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> [ 3494.664848] sd 7:0:1:0: [sda:1] tag#768 CDB: Read(10) 28 00 00 00 00 01 00 00 01 00
> [ 3494.667092] reservation conflict error, dev sda:1, sector 1 op 0x0:(READ) flags 0x2800800 phys_seg 1 prio class 2
>
> If you don't have a multipathed scsi device to try this on, you can run:
>
> targetcli <<EOF
> /backstores/ramdisk create mptest 1G
> /loopback create naa.5001401111111111
> /loopback create naa.5001402222222222
> /loopback create naa.5001403333333333
> /loopback create naa.5001404444444444
> /loopback/naa.5001401111111111/luns create /backstores/ramdisk/mptest
> /loopback/naa.5001402222222222/luns create /backstores/ramdisk/mptest
> /loopback/naa.5001403333333333/luns create /backstores/ramdisk/mptest
> /loopback/naa.5001404444444444/luns create /backstores/ramdisk/mptest
> EOF
>
> to create one.
>
> Handling scsi Persistent Reservations on a multipath device is painful.
> Here is a non-exhaustive list of the problems with trying to make a
> multipath device act like a single scsi device for persistent
> reservation purposes:
>
> You need to register the key on all the I_T Nexuses. You can't just pick
> a single path. Otherwise, when you set up the reservation, you will only
> be able to do IO on one of the paths. That's what happened above.
ok, thanks for the pointer. This does not sound too difficult to
implement, but obviously it will require special handling (vs NVMe)
>
> If an path is down when you do the resevation, you might not be able to
> register the key on that path. You certainly can't do it directly.
> Using the All Target Ports bit (assuming the device supports it) could
> let you extend a reservation from one target port to others, assuming
> your path isn't down because of connection issue on the host side. But
> in general, you have to be able to handle the case where you can't
> register (or unregister) a key on your failed paths. If you don't do
> that (un)registration when the path comes up, before it can get seleted
> for handling IO, you will fail when accessing a path you should be
> allowed allowed to access, or succeed in accessing a path that you are
> should not be allowed to access.
Understood
>
> The same is true when new paths are discovered. You need to register
> them.
>
> Except that a preempt can come and remove your registration at any time.
> You can't register the new (or newly active) path if the key has been
> preempted, and this preemption can happen at any moment, even after you
> check if the other paths are still registered. If this isn't handled
> correctly, paths can access storage that they should not be allowed to
> access.
right
>
> Changing the reservation type (for instance from
> exclusive-access-reg-only to write-exclusive-reg-only) in scsi devices
> is done by preempting the existing reservation. This will remove the
> registered keys from every path except the one issuing the command. The
> key needs to be reregistered on all the other paths again. If any IO
> goes to these paths before they are reregistered, it will fail with a
> reservation conflict, so IO needs to be suspended during this time.
>
> The path that is holding the reservation might be down. In this case,
> you aren't able to release the reservation from that path. The only way
> I figured out to handle this in dm-mpath was for the device to preempt
> it's own key, to move the reservation to a working path. This causes the
> same issues as preempting key to change the reservation type, where you
> need to reregister all the paths with IO suspended.
>
> An actual preemption can come in from another machine while you are
> doing this. In that case, you must not reregister the paths, and if you
> already started, you must unregister them.
>
> I can probably come up with more issues.
This all is becoming complicated... :)
>
> I think the best course of action for now is to just fail persistent
> reservations as non-supported for scsi devices. IMHO Making them work
> correctly (where mulitpath device IO won't fail when it should succeed,
> and succeed when it should fail with a reservation conflict) dwarfs the
> amount of work necessary to support ALUA.
Yeah, that sounds reasonable, but I want to ensure libmultipath API does
not later change here such that it disrupts NVMe support (if indeed NVMe
goes on to use libmultipath).
>
> dm-mpath previously did a pretty good job handling Persistent
> Reservations. But recently it became much better, because it become very
> clear that pretty good is not good enough for what people what to do
> with Persistent Reservations and multipath devices.
Thanks for the feedback. I'll check these details further now.
next prev parent reply other threads:[~2026-03-02 10:45 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
2026-02-25 15:32 ` [PATCH 01/13] libmultipath: Add initial framework John Garry
2026-03-02 12:08 ` Nilay Shroff
2026-03-02 12:21 ` John Garry
2026-02-25 15:32 ` [PATCH 02/13] libmultipath: Add basic gendisk support John Garry
2026-02-26 2:16 ` Benjamin Marzinski
2026-02-26 9:04 ` John Garry
2026-03-02 12:31 ` Nilay Shroff
2026-03-02 15:39 ` John Garry
2026-03-03 12:39 ` Nilay Shroff
2026-03-03 12:59 ` John Garry
2026-03-03 12:13 ` Markus Elfring
2026-02-25 15:32 ` [PATCH 03/13] libmultipath: Add path selection support John Garry
2026-02-26 3:37 ` Benjamin Marzinski
2026-02-26 9:26 ` John Garry
2026-03-02 12:36 ` Nilay Shroff
2026-03-02 15:11 ` John Garry
2026-03-03 11:01 ` Nilay Shroff
2026-03-03 12:41 ` John Garry
2026-03-04 10:26 ` Nilay Shroff
2026-03-04 11:09 ` John Garry
2026-03-04 13:10 ` Nilay Shroff
2026-03-04 14:38 ` John Garry
2026-02-25 15:32 ` [PATCH 04/13] libmultipath: Add bio handling John Garry
2026-03-02 12:39 ` Nilay Shroff
2026-03-02 15:52 ` John Garry
2026-03-03 14:00 ` Nilay Shroff
2026-02-25 15:32 ` [PATCH 05/13] libmultipath: Add support for mpath_device management John Garry
2026-02-25 15:32 ` [PATCH 06/13] libmultipath: Add cdev support John Garry
2026-02-25 15:32 ` [PATCH 07/13] libmultipath: Add delayed removal support John Garry
2026-03-02 12:41 ` Nilay Shroff
2026-03-02 15:54 ` John Garry
2026-02-25 15:32 ` [PATCH 08/13] libmultipath: Add sysfs helpers John Garry
2026-02-27 19:05 ` Benjamin Marzinski
2026-03-02 11:11 ` John Garry
2026-02-25 15:32 ` [PATCH 09/13] libmultipath: Add PR support John Garry
2026-02-25 15:49 ` Keith Busch
2026-02-25 16:52 ` John Garry
2026-02-27 18:12 ` Benjamin Marzinski
2026-03-02 10:45 ` John Garry [this message]
2026-02-25 15:32 ` [PATCH 10/13] libmultipath: Add mpath_bdev_report_zones() John Garry
2026-02-25 15:32 ` [PATCH 11/13] libmultipath: Add support for block device IOCTL John Garry
2026-02-27 19:52 ` Benjamin Marzinski
2026-03-02 11:19 ` John Garry
2026-02-25 15:32 ` [PATCH 12/13] libmultipath: Add mpath_bdev_getgeo() John Garry
2026-02-25 15:32 ` [PATCH 13/13] libmultipath: Add mpath_bdev_get_unique_id() John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=add7aabf-fa2c-472b-bedb-e8713176c692@oracle.com \
--to=john.g.garry@oracle.com \
--cc=axboe@fb.com \
--cc=bmarzins@redhat.com \
--cc=dm-devel@lists.linux.dev \
--cc=hare@suse.com \
--cc=hch@lst.de \
--cc=james.bottomley@hansenpartnership.com \
--cc=jmeneghi@redhat.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=michael.christie@oracle.com \
--cc=sagi@grimberg.me \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox