* [PATCH v2 1/5] scsi: generalize scsi_SG_IO_FROM_DEV() to scsi_SG_IO()
2026-01-26 19:47 [PATCH v2 0/5] scsi: persistent reservation live migration Stefan Hajnoczi
@ 2026-01-26 19:47 ` Stefan Hajnoczi
2026-01-26 19:47 ` [PATCH v2 2/5] scsi: add error reporting " Stefan Hajnoczi
` (4 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2026-01-26 19:47 UTC (permalink / raw)
To: qemu-devel
Cc: pkrempa, Paolo Bonzini, Alberto Faria, Hannes Reinecke, Zhao Liu,
Kevin Wolf, qemu-block, Fam Zheng, Philippe Mathieu-Daudé,
Eduardo Habkost, Qing Wang, Yanan Wang, Marcel Apfelbaum,
Stefan Hajnoczi
Add a direction argument so that scsi_SG_IO() can be used for
SG_DXFER_FROM_DEV and SG_DXFER_TO_DEV transfers.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
include/hw/scsi/scsi.h | 4 ++--
hw/scsi/scsi-disk.c | 4 ++--
hw/scsi/scsi-generic.c | 18 ++++++++++--------
3 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index d26f1127bb..670c477e38 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -236,8 +236,8 @@ void scsi_device_report_change(SCSIDevice *dev, SCSISense sense);
void scsi_device_unit_attention_reported(SCSIDevice *dev);
void scsi_generic_read_device_inquiry(SCSIDevice *dev);
int scsi_device_get_sense(SCSIDevice *dev, uint8_t *buf, int len, bool fixed);
-int scsi_SG_IO_FROM_DEV(BlockBackend *blk, uint8_t *cmd, uint8_t cmd_size,
- uint8_t *buf, uint8_t buf_size, uint32_t timeout);
+int scsi_SG_IO(BlockBackend *blk, int direction, uint8_t *cmd, uint8_t cmd_size,
+ uint8_t *buf, uint8_t buf_size, uint32_t timeout);
SCSIDevice *scsi_device_find(SCSIBus *bus, int channel, int target, int lun);
SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int target, int lun);
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 0f896c27f4..97ae535a27 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2748,8 +2748,8 @@ static int get_device_type(SCSIDiskState *s)
cmd[0] = INQUIRY;
cmd[4] = sizeof(buf);
- ret = scsi_SG_IO_FROM_DEV(s->qdev.conf.blk, cmd, sizeof(cmd),
- buf, sizeof(buf), s->qdev.io_timeout);
+ ret = scsi_SG_IO(s->qdev.conf.blk, SG_DXFER_FROM_DEV, cmd, sizeof(cmd),
+ buf, sizeof(buf), s->qdev.io_timeout);
if (ret < 0) {
return -1;
}
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 6acaf8831a..61511cf945 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -525,8 +525,9 @@ static int read_naa_id(const uint8_t *p, uint64_t *p_wwn)
return -EINVAL;
}
-int scsi_SG_IO_FROM_DEV(BlockBackend *blk, uint8_t *cmd, uint8_t cmd_size,
- uint8_t *buf, uint8_t buf_size, uint32_t timeout)
+int scsi_SG_IO(BlockBackend *blk, int direction, uint8_t *cmd,
+ uint8_t cmd_size, uint8_t *buf, uint8_t buf_size,
+ uint32_t timeout)
{
sg_io_hdr_t io_header;
uint8_t sensebuf[8];
@@ -534,7 +535,7 @@ int scsi_SG_IO_FROM_DEV(BlockBackend *blk, uint8_t *cmd, uint8_t cmd_size,
memset(&io_header, 0, sizeof(io_header));
io_header.interface_id = 'S';
- io_header.dxfer_direction = SG_DXFER_FROM_DEV;
+ io_header.dxfer_direction = direction;
io_header.dxfer_len = buf_size;
io_header.dxferp = buf;
io_header.cmdp = cmd;
@@ -574,8 +575,8 @@ static void scsi_generic_set_vpd_bl_emulation(SCSIDevice *s)
cmd[2] = 0x00;
cmd[4] = sizeof(buf);
- ret = scsi_SG_IO_FROM_DEV(s->conf.blk, cmd, sizeof(cmd),
- buf, sizeof(buf), s->io_timeout);
+ ret = scsi_SG_IO(s->conf.blk, SG_DXFER_FROM_DEV, cmd, sizeof(cmd),
+ buf, sizeof(buf), s->io_timeout);
if (ret < 0) {
/*
* Do not assume anything if we can't retrieve the
@@ -610,8 +611,8 @@ static void scsi_generic_read_device_identification(SCSIDevice *s)
cmd[2] = 0x83;
cmd[4] = sizeof(buf);
- ret = scsi_SG_IO_FROM_DEV(s->conf.blk, cmd, sizeof(cmd),
- buf, sizeof(buf), s->io_timeout);
+ ret = scsi_SG_IO(s->conf.blk, SG_DXFER_FROM_DEV, cmd, sizeof(cmd),
+ buf, sizeof(buf), s->io_timeout);
if (ret < 0) {
return;
}
@@ -662,7 +663,8 @@ static int get_stream_blocksize(BlockBackend *blk)
cmd[0] = MODE_SENSE;
cmd[4] = sizeof(buf);
- ret = scsi_SG_IO_FROM_DEV(blk, cmd, sizeof(cmd), buf, sizeof(buf), 6);
+ ret = scsi_SG_IO(blk, SG_DXFER_FROM_DEV, cmd, sizeof(cmd),
+ buf, sizeof(buf), 6);
if (ret < 0) {
return -1;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* [PATCH v2 2/5] scsi: add error reporting to scsi_SG_IO()
2026-01-26 19:47 [PATCH v2 0/5] scsi: persistent reservation live migration Stefan Hajnoczi
2026-01-26 19:47 ` [PATCH v2 1/5] scsi: generalize scsi_SG_IO_FROM_DEV() to scsi_SG_IO() Stefan Hajnoczi
@ 2026-01-26 19:47 ` Stefan Hajnoczi
2026-01-26 19:47 ` [PATCH v2 3/5] scsi: track SCSI reservation state for live migration Stefan Hajnoczi
` (3 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2026-01-26 19:47 UTC (permalink / raw)
To: qemu-devel
Cc: pkrempa, Paolo Bonzini, Alberto Faria, Hannes Reinecke, Zhao Liu,
Kevin Wolf, qemu-block, Fam Zheng, Philippe Mathieu-Daudé,
Eduardo Habkost, Qing Wang, Yanan Wang, Marcel Apfelbaum,
Stefan Hajnoczi
Report the details of the SG_IO ioctl failure if an Error pointer is
provided. This information aids troubleshooting and will be used by the
SCSI Persistent Reservations migration code.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
include/hw/scsi/scsi.h | 2 +-
hw/scsi/scsi-disk.c | 2 +-
hw/scsi/scsi-generic.c | 33 ++++++++++++++++++++++++++++-----
3 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 670c477e38..89b1ed6258 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -237,7 +237,7 @@ void scsi_device_unit_attention_reported(SCSIDevice *dev);
void scsi_generic_read_device_inquiry(SCSIDevice *dev);
int scsi_device_get_sense(SCSIDevice *dev, uint8_t *buf, int len, bool fixed);
int scsi_SG_IO(BlockBackend *blk, int direction, uint8_t *cmd, uint8_t cmd_size,
- uint8_t *buf, uint8_t buf_size, uint32_t timeout);
+ uint8_t *buf, uint8_t buf_size, uint32_t timeout, Error **errp);
SCSIDevice *scsi_device_find(SCSIBus *bus, int channel, int target, int lun);
SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int target, int lun);
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 97ae535a27..76fe5f085b 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2749,7 +2749,7 @@ static int get_device_type(SCSIDiskState *s)
cmd[4] = sizeof(buf);
ret = scsi_SG_IO(s->qdev.conf.blk, SG_DXFER_FROM_DEV, cmd, sizeof(cmd),
- buf, sizeof(buf), s->qdev.io_timeout);
+ buf, sizeof(buf), s->qdev.io_timeout, NULL);
if (ret < 0) {
return -1;
}
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 61511cf945..2af8803644 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -527,10 +527,10 @@ static int read_naa_id(const uint8_t *p, uint64_t *p_wwn)
int scsi_SG_IO(BlockBackend *blk, int direction, uint8_t *cmd,
uint8_t cmd_size, uint8_t *buf, uint8_t buf_size,
- uint32_t timeout)
+ uint32_t timeout, Error **errp)
{
sg_io_hdr_t io_header;
- uint8_t sensebuf[8];
+ uint8_t sensebuf[8] = {};
int ret;
memset(&io_header, 0, sizeof(io_header));
@@ -550,6 +550,29 @@ int scsi_SG_IO(BlockBackend *blk, int direction, uint8_t *cmd,
io_header.driver_status || io_header.host_status) {
trace_scsi_generic_ioctl_sgio_done(cmd[0], ret, io_header.status,
io_header.host_status);
+ if (ret < 0) {
+ error_setg_errno(errp, -ret, "SG_IO ioctl failed");
+ } else {
+ g_autofree char *sensebuf_hex =
+ g_strdup_printf("%02x%02x%02x%02x%02x%02x%02x%02x",
+ sensebuf[0],
+ sensebuf[1],
+ sensebuf[2],
+ sensebuf[3],
+ sensebuf[4],
+ sensebuf[5],
+ sensebuf[6],
+ sensebuf[7]);
+
+ error_setg(errp, "SG_IO SCSI command failed with status=0x%x "
+ "driver_status=0x%x host_status=0x%x sensebuf=%s "
+ "sb_len_wr=%u",
+ io_header.status,
+ io_header.driver_status,
+ io_header.host_status,
+ sensebuf_hex,
+ io_header.sb_len_wr);
+ }
return -1;
}
return 0;
@@ -576,7 +599,7 @@ static void scsi_generic_set_vpd_bl_emulation(SCSIDevice *s)
cmd[4] = sizeof(buf);
ret = scsi_SG_IO(s->conf.blk, SG_DXFER_FROM_DEV, cmd, sizeof(cmd),
- buf, sizeof(buf), s->io_timeout);
+ buf, sizeof(buf), s->io_timeout, NULL);
if (ret < 0) {
/*
* Do not assume anything if we can't retrieve the
@@ -612,7 +635,7 @@ static void scsi_generic_read_device_identification(SCSIDevice *s)
cmd[4] = sizeof(buf);
ret = scsi_SG_IO(s->conf.blk, SG_DXFER_FROM_DEV, cmd, sizeof(cmd),
- buf, sizeof(buf), s->io_timeout);
+ buf, sizeof(buf), s->io_timeout, NULL);
if (ret < 0) {
return;
}
@@ -664,7 +687,7 @@ static int get_stream_blocksize(BlockBackend *blk)
cmd[4] = sizeof(buf);
ret = scsi_SG_IO(blk, SG_DXFER_FROM_DEV, cmd, sizeof(cmd),
- buf, sizeof(buf), 6);
+ buf, sizeof(buf), 6, NULL);
if (ret < 0) {
return -1;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* [PATCH v2 3/5] scsi: track SCSI reservation state for live migration
2026-01-26 19:47 [PATCH v2 0/5] scsi: persistent reservation live migration Stefan Hajnoczi
2026-01-26 19:47 ` [PATCH v2 1/5] scsi: generalize scsi_SG_IO_FROM_DEV() to scsi_SG_IO() Stefan Hajnoczi
2026-01-26 19:47 ` [PATCH v2 2/5] scsi: add error reporting " Stefan Hajnoczi
@ 2026-01-26 19:47 ` Stefan Hajnoczi
2026-01-26 19:47 ` [PATCH v2 4/5] scsi: save/load SCSI reservation state Stefan Hajnoczi
` (2 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2026-01-26 19:47 UTC (permalink / raw)
To: qemu-devel
Cc: pkrempa, Paolo Bonzini, Alberto Faria, Hannes Reinecke, Zhao Liu,
Kevin Wolf, qemu-block, Fam Zheng, Philippe Mathieu-Daudé,
Eduardo Habkost, Qing Wang, Yanan Wang, Marcel Apfelbaum,
Stefan Hajnoczi
SCSI Persistent Reservations are stateful and external to the guest. In
order to transparently move reservations to the destination host during
live migration, it is necessary to track the state built up on the
source host before migration. Only then can the destination host ensure
an equivalent state is restored upon migration.
Snoop on successful PERSISTENT RESERVE OUT commands and save the
reservation key and reservation type. This will allow registered keys
and reservations to be migrated.
Also patch PERSISTENT RESERVE IN replies with the REPORT CAPABILITIES
service action since features that involve the physical SCSI bus target
ports must not be exposed to the guest (it sees a virtual SCSI bus).
Usually this plays out as follows:
1. The guest invokes the REGISTER service action to register a
reservation key on its I_T nexus.
2. The guest invokes the RESERVE service action to create a reservation
using the previously-registered key.
This commit implements the snooping and stores the reservation key and
type (if any) for each LUN. The snooped PR state and the migrate_pr flag
to enable PR migration will be used in later commits.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
include/hw/scsi/scsi.h | 10 +++
include/scsi/constants.h | 21 +++++
hw/scsi/scsi-bus.c | 3 +
hw/scsi/scsi-generic.c | 160 +++++++++++++++++++++++++++++++++++++++
hw/scsi/trace-events | 1 +
5 files changed, 195 insertions(+)
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 89b1ed6258..c5ec58089b 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -57,6 +57,13 @@ struct SCSIRequest {
QTAILQ_ENTRY(SCSIRequest) next;
};
+/* Per-SCSIDevice Persistent Reservation state */
+typedef struct {
+ QemuMutex mutex; /* protects all fields (e.g. from multiple IOThreads) */
+ uint64_t key; /* 0 if no registered key */
+ uint8_t resv_type; /* 0 if no reservation */
+} SCSIPRState;
+
#define TYPE_SCSI_DEVICE "scsi-device"
OBJECT_DECLARE_TYPE(SCSIDevice, SCSIDeviceClass, SCSI_DEVICE)
@@ -97,6 +104,9 @@ struct SCSIDevice
uint32_t io_timeout;
bool needs_vpd_bl_emulation;
bool hba_supports_iothread;
+
+ bool migrate_pr;
+ SCSIPRState pr_state;
};
extern const VMStateDescription vmstate_scsi_device;
diff --git a/include/scsi/constants.h b/include/scsi/constants.h
index 9b98451912..cb97bdb636 100644
--- a/include/scsi/constants.h
+++ b/include/scsi/constants.h
@@ -319,4 +319,25 @@
#define IDENT_DESCR_TGT_DESCR_SIZE 32
#define XCOPY_BLK2BLK_SEG_DESC_SIZE 28
+/*
+ * PERSISTENT RESERVATION IN service action codes
+ */
+#define PRI_READ_KEYS 0x00
+#define PRI_READ_RESERVATION 0x01
+#define PRI_REPORT_CAPABILITIES 0x02
+#define PRI_READ_FULL_STATUS 0x03
+
+/*
+ * PERSISTENT RESERVATION OUT service action codes
+ */
+#define PRO_REGISTER 0x00
+#define PRO_RESERVE 0x01
+#define PRO_RELEASE 0x02
+#define PRO_CLEAR 0x03
+#define PRO_PREEMPT 0x04
+#define PRO_PREEMPT_AND_ABORT 0x05
+#define PRO_REGISTER_AND_IGNORE_EXISTING_KEY 0x06
+#define PRO_REGISTER_AND_MOVE 0x07
+#define PRO_REPLACE_LOST_RESERVATION 0x08
+
#endif
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index f310ddafb9..9b8656dd83 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -393,6 +393,7 @@ static void scsi_qdev_realize(DeviceState *qdev, Error **errp)
}
qemu_mutex_init(&dev->requests_lock);
+ qemu_mutex_init(&dev->pr_state.mutex);
QTAILQ_INIT(&dev->requests);
scsi_device_realize(dev, &local_err);
if (local_err) {
@@ -417,6 +418,8 @@ static void scsi_qdev_unrealize(DeviceState *qdev)
scsi_device_unrealize(dev);
+ qemu_mutex_destroy(&dev->pr_state.mutex);
+
blockdev_mark_auto_del(dev->conf.blk);
}
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 2af8803644..392647e2b2 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -265,6 +265,160 @@ static int scsi_generic_emulate_block_limits(SCSIGenericReq *r, SCSIDevice *s)
return r->buflen;
}
+/*
+ * Patch persistent reservation capabilities that are not emulated.
+ */
+static void scsi_handle_persistent_reserve_in_reply(SCSIGenericReq *r,
+ SCSIDevice *s)
+{
+ uint8_t service_action = r->req.cmd.buf[1] & 0x1f;
+
+ if (!s->migrate_pr) {
+ return; /* when migration is disabled there is no need for patching */
+ }
+
+ if (service_action == PRI_REPORT_CAPABILITIES) {
+ assert(r->buflen >= 3);
+
+ /*
+ * Clear specify initiator ports capable (SIP_C) and all target ports
+ * capable (ATC_C).
+ *
+ * SPEC_I_PT is not supported because the guest sees an emulated SCSI
+ * bus and does not have the underlying transport IDs needed to use
+ * SPEC_I_PT.
+ *
+ * ALL_TG_PT is not supported because we only track the state of this
+ * emulated I_T nexus, not the underlying device's target ports.
+ */
+ r->buf[2] &= ~0xc;
+ }
+}
+
+static int scsi_generic_read_reservation(SCSIDevice *s, uint64_t *key,
+ uint8_t *resv_type, Error **errp)
+{
+ uint8_t cmd[10] = {};
+ uint8_t buf[24] = {};
+ uint32_t additional_length;
+ int ret;
+
+ *key = 0;
+ *resv_type = 0;
+
+ cmd[0] = PERSISTENT_RESERVE_IN;
+ cmd[1] = PRI_READ_RESERVATION;
+ cmd[8] = sizeof(buf);
+
+ ret = scsi_SG_IO(s->conf.blk, SG_DXFER_FROM_DEV, cmd, sizeof(cmd),
+ buf, sizeof(buf), s->io_timeout, errp);
+ if (ret < 0) {
+ return ret;
+ }
+
+ memcpy(&additional_length, &buf[4], sizeof(additional_length));
+ be32_to_cpus(&additional_length);
+
+ if (additional_length >= 0x10) {
+ memcpy(key, &buf[8], sizeof(*key));
+ be64_to_cpus(key);
+
+ *resv_type = buf[21] & 0xf;
+ }
+ return 0;
+}
+
+/*
+ * Snoop changes to registered keys and reservations so that this information
+ * can be transferred during live migration.
+ */
+static void scsi_handle_persistent_reserve_out_reply(
+ SCSIGenericReq *r,
+ SCSIDevice *s)
+{
+ SCSIPRState *pr_state = &s->pr_state;
+ uint8_t service_action = r->req.cmd.buf[1] & 0x1f;
+ uint8_t resv_type = r->req.cmd.buf[2] & 0xf;
+ uint64_t old_key;
+ uint64_t new_key;
+
+ assert(r->buflen >= 16);
+ memcpy(&old_key, &r->buf[0], sizeof(old_key));
+ memcpy(&new_key, &r->buf[8], sizeof(new_key));
+ be64_to_cpus(&old_key);
+ be64_to_cpus(&new_key);
+
+ trace_scsi_generic_persistent_reserve_out_reply(service_action, resv_type,
+ old_key, new_key);
+
+ switch (service_action) {
+ case PRO_REGISTER: /* fallthrough */
+ case PRO_REGISTER_AND_IGNORE_EXISTING_KEY:
+ if (service_action == PRO_REGISTER && old_key == 0 && new_key == 0) {
+ /* Do nothing */
+ } else {
+ WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
+ pr_state->key = new_key;
+ if (new_key == 0) {
+ pr_state->resv_type = 0; /* release reservation */
+ }
+ }
+ }
+ break;
+
+ case PRO_RESERVE:
+ WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
+ pr_state->resv_type = resv_type;
+ }
+ break;
+
+ case PRO_RELEASE:
+ WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
+ pr_state->resv_type = 0;
+ }
+ break;
+
+ case PRO_CLEAR:
+ WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
+ pr_state->key = 0;
+ pr_state->resv_type = 0;
+ }
+ break;
+
+ case PRO_REPLACE_LOST_RESERVATION:
+ WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
+ pr_state->key = new_key;
+ pr_state->resv_type = resv_type;
+ }
+ break;
+
+ case PRO_PREEMPT: /* fallthrough */
+ case PRO_PREEMPT_AND_ABORT: {
+ uint64_t dev_key;
+ uint8_t dev_resv_type;
+
+ /* Not enough information to know actual state, ask the device */
+ if (!scsi_generic_read_reservation(s, &dev_key, &dev_resv_type, NULL)) {
+ WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
+ if (pr_state->key == dev_key) {
+ pr_state->resv_type = dev_resv_type;
+ } else {
+ pr_state->resv_type = 0;
+ }
+ }
+ }
+ break;
+ }
+
+ /*
+ * PRO_REGISTER_AND_MOVE cannot be implemented since it involves the
+ * physical SCSI bus target ports.
+ */
+ default:
+ break; /* do nothing */
+ }
+}
+
static void scsi_read_complete(void * opaque, int ret)
{
SCSIGenericReq *r = (SCSIGenericReq *)opaque;
@@ -347,6 +501,9 @@ static void scsi_read_complete(void * opaque, int ret)
if (r->req.cmd.buf[0] == INQUIRY) {
len = scsi_handle_inquiry_reply(r, s, len);
}
+ if (r->req.cmd.buf[0] == PERSISTENT_RESERVE_IN) {
+ scsi_handle_persistent_reserve_in_reply(r, s);
+ }
req_complete:
scsi_req_data(&r->req, len);
@@ -396,6 +553,9 @@ static void scsi_write_complete(void * opaque, int ret)
s->blocksize = (r->buf[9] << 16) | (r->buf[10] << 8) | r->buf[11];
trace_scsi_generic_write_complete_blocksize(s->blocksize);
}
+ if (r->req.cmd.buf[0] == PERSISTENT_RESERVE_OUT) {
+ scsi_handle_persistent_reserve_out_reply(r, s);
+ }
scsi_command_complete_noio(r, ret);
}
diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
index 3e81f44dad..ff92fff7c5 100644
--- a/hw/scsi/trace-events
+++ b/hw/scsi/trace-events
@@ -390,3 +390,4 @@ scsi_generic_realize_blocksize(int blocksize) "block size %d"
scsi_generic_aio_sgio_command(uint32_t tag, uint8_t cmd, uint32_t timeout) "generic aio sgio: tag=0x%x cmd=0x%x timeout=%u"
scsi_generic_ioctl_sgio_command(uint8_t cmd, uint32_t timeout) "generic ioctl sgio: cmd=0x%x timeout=%u"
scsi_generic_ioctl_sgio_done(uint8_t cmd, int ret, uint8_t status, uint8_t host_status) "generic ioctl sgio: cmd=0x%x ret=%d status=0x%x host_status=0x%x"
+scsi_generic_persistent_reserve_out_reply(uint8_t service_action, uint8_t resv_type, uint64_t old_key, uint64_t new_key) "persistent reserve out reply service_action=%u resv_type=%u old_key=0x%" PRIx64 " new_key=0x%" PRIx64
--
2.52.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* [PATCH v2 4/5] scsi: save/load SCSI reservation state
2026-01-26 19:47 [PATCH v2 0/5] scsi: persistent reservation live migration Stefan Hajnoczi
` (2 preceding siblings ...)
2026-01-26 19:47 ` [PATCH v2 3/5] scsi: track SCSI reservation state for live migration Stefan Hajnoczi
@ 2026-01-26 19:47 ` Stefan Hajnoczi
2026-01-26 19:59 ` Daniel P. Berrangé
2026-01-26 19:47 ` [PATCH v2 5/5] docs: add SCSI migrate-pr documentation Stefan Hajnoczi
2026-01-26 21:55 ` [PATCH v2 0/5] scsi: persistent reservation live migration Stefan Hajnoczi
5 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2026-01-26 19:47 UTC (permalink / raw)
To: qemu-devel
Cc: pkrempa, Paolo Bonzini, Alberto Faria, Hannes Reinecke, Zhao Liu,
Kevin Wolf, qemu-block, Fam Zheng, Philippe Mathieu-Daudé,
Eduardo Habkost, Qing Wang, Yanan Wang, Marcel Apfelbaum,
Stefan Hajnoczi
Add a vmstate subsection to SCSIDiskState so that scsi-block devices can
transfer their reservation state during live migration. Upon loading the
subsection, the destination QEMU invokes the PERSISTENT RESERVE OUT
command's PREEMPT service action to atomically move the reservation from
the source I_T nexus to the destination I_T nexus. This results in
transparent live migration of SCSI reservations.
This approach is incomplete since SCSI reservations are cooperative and
other hosts could interfere. Neither the source QEMU nor the destination
QEMU are aware of changes made by other hosts. The assumption is that
reservation is not taken over by a third host without cooperation from
the source host.
I considered adding the vmstate subsection to SCSIDevice instead of
SCSIDiskState, since reservations are part of the SCSI Primary Commands
that other devices apart from disks could support. However, due to
fragility of migrating reservations, we will probably limit support to
scsi-block and maybe scsi-disk in the future. In the end, I think it
makes sense to place this within scsi-disk.c.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
include/hw/scsi/scsi.h | 1 +
hw/core/machine.c | 4 ++-
hw/scsi/scsi-disk.c | 81 ++++++++++++++++++++++++++++++++++++++++-
hw/scsi/scsi-generic.c | 82 ++++++++++++++++++++++++++++++++++++++++++
hw/scsi/trace-events | 1 +
5 files changed, 167 insertions(+), 2 deletions(-)
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index c5ec58089b..a3e246dbd9 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -253,6 +253,7 @@ SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int target, int lun);
/* scsi-generic.c. */
extern const SCSIReqOps scsi_generic_req_ops;
+bool scsi_generic_pr_state_preempt(SCSIDevice *s, Error **errp);
/* scsi-disk.c */
#define SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR 0
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 6411e68856..16134f8ce5 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -38,7 +38,9 @@
#include "hw/acpi/generic_event_device.h"
#include "qemu/audio.h"
-GlobalProperty hw_compat_10_2[] = {};
+GlobalProperty hw_compat_10_2[] = {
+ { "scsi-block", "migrate-pr", "off" },
+};
const size_t hw_compat_10_2_len = G_N_ELEMENTS(hw_compat_10_2);
GlobalProperty hw_compat_10_1[] = {
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 76fe5f085b..8845ab1192 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -28,6 +28,7 @@
#include "qemu/hw-version.h"
#include "qemu/memalign.h"
#include "hw/scsi/scsi.h"
+#include "migration/misc.h"
#include "migration/qemu-file-types.h"
#include "migration/vmstate.h"
#include "hw/scsi/emulation.h"
@@ -122,6 +123,7 @@ struct SCSIDiskState {
*/
uint16_t rotation_rate;
bool migrate_emulated_scsi_request;
+ NotifierWithReturn migration_notifier;
};
static void scsi_free_request(SCSIRequest *req)
@@ -2737,6 +2739,25 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun,
}
#ifdef __linux__
+/*
+ * Preempt on the SCSI Persistent Reservation on the source when migration
+ * fails because the destination may have already preempted and we need to get
+ * the reservation back.
+ */
+static int scsi_block_migration_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e, Error **errp)
+{
+ if (e->type == MIG_EVENT_PRECOPY_FAILED) {
+ SCSIDiskState *s =
+ container_of(notifier, SCSIDiskState, migration_notifier);
+ SCSIDevice *d = &s->qdev;
+
+ /* MIG_EVENT_PRECOPY_FAILED cannot fail, so ignore errors */
+ scsi_generic_pr_state_preempt(d, NULL);
+ }
+ return 0;
+}
+
static int get_device_type(SCSIDiskState *s)
{
uint8_t cmd[16];
@@ -2815,6 +2836,16 @@ static void scsi_block_realize(SCSIDevice *dev, Error **errp)
scsi_realize(&s->qdev, errp);
scsi_generic_read_device_inquiry(&s->qdev);
+
+ migration_add_notifier(&s->migration_notifier,
+ scsi_block_migration_notifier);
+}
+
+static void scsi_block_unrealize(SCSIDevice *dev)
+{
+ SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, dev);
+
+ migration_remove_notifier(&s->migration_notifier);
}
typedef struct SCSIBlockReq {
@@ -3209,6 +3240,46 @@ static const Property scsi_hd_properties[] = {
DEFINE_BLOCK_CHS_PROPERTIES(SCSIDiskState, qdev.conf),
};
+#ifdef __linux__
+static bool scsi_disk_pr_state_post_load_errp(void *opaque, int version_id, Error **errp)
+{
+ SCSIDiskState *s = opaque;
+ SCSIDevice *dev = &s->qdev;
+
+ return scsi_generic_pr_state_preempt(dev, errp);
+}
+
+static bool scsi_disk_pr_state_needed(void *opaque)
+{
+ SCSIDiskState *s = opaque;
+ SCSIPRState *pr_state = &s->qdev.pr_state;
+ bool ret;
+
+ if (!s->qdev.migrate_pr) {
+ return false;
+ }
+
+ /* A reservation requires a key, so checking this field is enough */
+ WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
+ ret = pr_state->key;
+ }
+ return ret;
+}
+
+static const VMStateDescription vmstate_scsi_disk_pr_state = {
+ .name = "scsi-disk/pr",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .post_load_errp = scsi_disk_pr_state_post_load_errp,
+ .needed = scsi_disk_pr_state_needed,
+ .fields = (const VMStateField[]) {
+ VMSTATE_UINT64(qdev.pr_state.key, SCSIDiskState),
+ VMSTATE_UINT8(qdev.pr_state.resv_type, SCSIDiskState),
+ VMSTATE_END_OF_LIST()
+ }
+};
+#endif /* __linux__ */
+
static const VMStateDescription vmstate_scsi_disk_state = {
.name = "scsi-disk",
.version_id = 1,
@@ -3221,7 +3292,13 @@ static const VMStateDescription vmstate_scsi_disk_state = {
VMSTATE_BOOL(tray_open, SCSIDiskState),
VMSTATE_BOOL(tray_locked, SCSIDiskState),
VMSTATE_END_OF_LIST()
- }
+ },
+ .subsections = (const VMStateDescription * const []) {
+#ifdef __linux__
+ &vmstate_scsi_disk_pr_state,
+#endif
+ NULL
+ },
};
static void scsi_hd_class_initfn(ObjectClass *klass, const void *data)
@@ -3301,6 +3378,7 @@ static const Property scsi_block_properties[] = {
-1),
DEFINE_PROP_UINT32("io_timeout", SCSIDiskState, qdev.io_timeout,
DEFAULT_IO_TIMEOUT),
+ DEFINE_PROP_BOOL("migrate-pr", SCSIDiskState, qdev.migrate_pr, true),
};
static void scsi_block_class_initfn(ObjectClass *klass, const void *data)
@@ -3310,6 +3388,7 @@ static void scsi_block_class_initfn(ObjectClass *klass, const void *data)
SCSIDiskClass *sdc = SCSI_DISK_BASE_CLASS(klass);
sc->realize = scsi_block_realize;
+ sc->unrealize = scsi_block_unrealize;
sc->alloc_req = scsi_block_new_request;
sc->parse_cdb = scsi_block_parse_cdb;
sdc->dma_readv = scsi_block_dma_readv;
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 392647e2b2..e44979faeb 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -419,6 +419,88 @@ static void scsi_handle_persistent_reserve_out_reply(
}
}
+static bool scsi_generic_pr_register(SCSIDevice *s, uint64_t key, Error **errp)
+{
+ uint8_t cmd[10] = {};
+ uint8_t buf[24] = {};
+ uint64_t key_be = cpu_to_be64(key);
+ int ret;
+
+ cmd[0] = PERSISTENT_RESERVE_OUT;
+ cmd[1] = PRO_REGISTER;
+ cmd[8] = sizeof(buf);
+ memcpy(&buf[8], &key_be, sizeof(key_be));
+
+ ret = scsi_SG_IO(s->conf.blk, SG_DXFER_TO_DEV, cmd, sizeof(cmd),
+ buf, sizeof(buf), s->io_timeout, errp);
+ if (ret < 0) {
+ error_prepend(errp, "PERSISTENT RESERVE OUT with REGISTER");
+ return false;
+ }
+ return true;
+}
+
+static bool scsi_generic_pr_preempt(SCSIDevice *s, uint64_t key, uint8_t resv_type, Error **errp)
+{
+ uint8_t cmd[10] = {};
+ uint8_t buf[24] = {};
+ uint64_t key_be = cpu_to_be64(key);
+ int ret;
+
+ cmd[0] = PERSISTENT_RESERVE_OUT;
+ cmd[1] = PRO_PREEMPT;
+ cmd[2] = resv_type & 0xf;
+ cmd[8] = sizeof(buf);
+ memcpy(&buf[0], &key_be, sizeof(key_be));
+ memcpy(&buf[8], &key_be, sizeof(key_be));
+
+ ret = scsi_SG_IO(s->conf.blk, SG_DXFER_TO_DEV, cmd, sizeof(cmd),
+ buf, sizeof(buf), s->io_timeout, errp);
+ if (ret < 0) {
+ error_prepend(errp, "PERSISTENT RESERVE OUT with PREEMPT");
+ return false;
+ }
+ return true;
+}
+
+/* Register keys and preempt reservations after live migration */
+bool scsi_generic_pr_state_preempt(SCSIDevice *s, Error **errp)
+{
+ SCSIPRState *pr_state = &s->pr_state;
+ uint64_t key;
+ uint8_t resv_type;
+
+ WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
+ key = pr_state->key;
+ resv_type = pr_state->resv_type;
+ }
+
+ trace_scsi_generic_pr_state_preempt(key, resv_type);
+
+ if (key) {
+ if (!scsi_generic_pr_register(s, key, errp)) {
+ return false;
+ }
+
+ /*
+ * Two cases:
+ *
+ * 1. There is no reservation (resv_type is 0) and the other I_T nexus
+ * will be unregistered. This is important so the source host does
+ * not leak registered keys across live migration.
+ *
+ * 2. There is a reservation (resv_type is not 0) and the other I_T
+ * nexus will be unregistered and its reservation is atomically
+ * taken over by us. This is the scenario where a reservation is
+ * migrated along with the guest.
+ */
+ if (!scsi_generic_pr_preempt(s, key, resv_type, errp)) {
+ return false;
+ }
+ }
+ return true;
+}
+
static void scsi_read_complete(void * opaque, int ret)
{
SCSIGenericReq *r = (SCSIGenericReq *)opaque;
diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
index ff92fff7c5..a8ac1e7f1d 100644
--- a/hw/scsi/trace-events
+++ b/hw/scsi/trace-events
@@ -391,3 +391,4 @@ scsi_generic_aio_sgio_command(uint32_t tag, uint8_t cmd, uint32_t timeout) "gene
scsi_generic_ioctl_sgio_command(uint8_t cmd, uint32_t timeout) "generic ioctl sgio: cmd=0x%x timeout=%u"
scsi_generic_ioctl_sgio_done(uint8_t cmd, int ret, uint8_t status, uint8_t host_status) "generic ioctl sgio: cmd=0x%x ret=%d status=0x%x host_status=0x%x"
scsi_generic_persistent_reserve_out_reply(uint8_t service_action, uint8_t resv_type, uint64_t old_key, uint64_t new_key) "persistent reserve out reply service_action=%u resv_type=%u old_key=0x%" PRIx64 " new_key=0x%" PRIx64
+scsi_generic_pr_state_preempt(uint64_t key, uint8_t resv_type) "key=0x%" PRIx64 " resv_type=%u"
--
2.52.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v2 4/5] scsi: save/load SCSI reservation state
2026-01-26 19:47 ` [PATCH v2 4/5] scsi: save/load SCSI reservation state Stefan Hajnoczi
@ 2026-01-26 19:59 ` Daniel P. Berrangé
2026-01-26 22:13 ` Stefan Hajnoczi
0 siblings, 1 reply; 12+ messages in thread
From: Daniel P. Berrangé @ 2026-01-26 19:59 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, pkrempa, Paolo Bonzini, Alberto Faria,
Hannes Reinecke, Zhao Liu, Kevin Wolf, qemu-block, Fam Zheng,
Philippe Mathieu-Daudé, Eduardo Habkost, Qing Wang,
Yanan Wang, Marcel Apfelbaum
On Mon, Jan 26, 2026 at 02:47:34PM -0500, Stefan Hajnoczi wrote:
> Add a vmstate subsection to SCSIDiskState so that scsi-block devices can
> transfer their reservation state during live migration. Upon loading the
> subsection, the destination QEMU invokes the PERSISTENT RESERVE OUT
> command's PREEMPT service action to atomically move the reservation from
> the source I_T nexus to the destination I_T nexus. This results in
> transparent live migration of SCSI reservations.
>
> This approach is incomplete since SCSI reservations are cooperative and
> other hosts could interfere. Neither the source QEMU nor the destination
> QEMU are aware of changes made by other hosts. The assumption is that
> reservation is not taken over by a third host without cooperation from
> the source host.
>
> I considered adding the vmstate subsection to SCSIDevice instead of
> SCSIDiskState, since reservations are part of the SCSI Primary Commands
> that other devices apart from disks could support. However, due to
> fragility of migrating reservations, we will probably limit support to
> scsi-block and maybe scsi-disk in the future. In the end, I think it
> makes sense to place this within scsi-disk.c.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
> include/hw/scsi/scsi.h | 1 +
> hw/core/machine.c | 4 ++-
> hw/scsi/scsi-disk.c | 81 ++++++++++++++++++++++++++++++++++++++++-
> hw/scsi/scsi-generic.c | 82 ++++++++++++++++++++++++++++++++++++++++++
> hw/scsi/trace-events | 1 +
> 5 files changed, 167 insertions(+), 2 deletions(-)
>
> diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
> index c5ec58089b..a3e246dbd9 100644
> --- a/include/hw/scsi/scsi.h
> +++ b/include/hw/scsi/scsi.h
> @@ -253,6 +253,7 @@ SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int target, int lun);
>
> /* scsi-generic.c. */
> extern const SCSIReqOps scsi_generic_req_ops;
> +bool scsi_generic_pr_state_preempt(SCSIDevice *s, Error **errp);
>
> /* scsi-disk.c */
> #define SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR 0
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 6411e68856..16134f8ce5 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -38,7 +38,9 @@
> #include "hw/acpi/generic_event_device.h"
> #include "qemu/audio.h"
>
> -GlobalProperty hw_compat_10_2[] = {};
> +GlobalProperty hw_compat_10_2[] = {
> + { "scsi-block", "migrate-pr", "off" },
> +};
> const size_t hw_compat_10_2_len = G_N_ELEMENTS(hw_compat_10_2);
>
> GlobalProperty hw_compat_10_1[] = {
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 76fe5f085b..8845ab1192 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -28,6 +28,7 @@
> #include "qemu/hw-version.h"
> #include "qemu/memalign.h"
> #include "hw/scsi/scsi.h"
> +#include "migration/misc.h"
> #include "migration/qemu-file-types.h"
> #include "migration/vmstate.h"
> #include "hw/scsi/emulation.h"
> @@ -122,6 +123,7 @@ struct SCSIDiskState {
> */
> uint16_t rotation_rate;
> bool migrate_emulated_scsi_request;
> + NotifierWithReturn migration_notifier;
> };
>
> static void scsi_free_request(SCSIRequest *req)
> @@ -2737,6 +2739,25 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun,
> }
>
> #ifdef __linux__
> +/*
> + * Preempt on the SCSI Persistent Reservation on the source when migration
> + * fails because the destination may have already preempted and we need to get
> + * the reservation back.
> + */
> +static int scsi_block_migration_notifier(NotifierWithReturn *notifier,
> + MigrationEvent *e, Error **errp)
> +{
> + if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> + SCSIDiskState *s =
> + container_of(notifier, SCSIDiskState, migration_notifier);
> + SCSIDevice *d = &s->qdev;
> +
> + /* MIG_EVENT_PRECOPY_FAILED cannot fail, so ignore errors */
> + scsi_generic_pr_state_preempt(d, NULL);
I feel like we ought to 'warn_report' any errors related to failing
to acquire persistent reservations.
In the unlikely event an error occurs, whomever has to deal with
the resulting support ticket will want to know something went wrong
from the QEMU logs.
> @@ -3209,6 +3240,46 @@ static const Property scsi_hd_properties[] = {
> DEFINE_BLOCK_CHS_PROPERTIES(SCSIDiskState, qdev.conf),
> };
>
> +#ifdef __linux__
> +static bool scsi_disk_pr_state_post_load_errp(void *opaque, int version_id, Error **errp)
> +{
> + SCSIDiskState *s = opaque;
> + SCSIDevice *dev = &s->qdev;
> +
> + return scsi_generic_pr_state_preempt(dev, errp);
> +}
What if there are multiple SCSI disks, and on the target host
we successful acquire the reservation for the 1st, but fail
the second ? Are we ignoring the failure of the second, or
are we discarding the reservation of the 1st and failing
migration ? Should we warn_report here too ?
> +
> +static bool scsi_disk_pr_state_needed(void *opaque)
> +{
> + SCSIDiskState *s = opaque;
> + SCSIPRState *pr_state = &s->qdev.pr_state;
> + bool ret;
> +
> + if (!s->qdev.migrate_pr) {
> + return false;
> + }
> +
> + /* A reservation requires a key, so checking this field is enough */
> + WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
> + ret = pr_state->key;
> + }
> + return ret;
> +}
> +
> +static const VMStateDescription vmstate_scsi_disk_pr_state = {
> + .name = "scsi-disk/pr",
> + .version_id = 1,
> + .minimum_version_id = 1,
> + .post_load_errp = scsi_disk_pr_state_post_load_errp,
> + .needed = scsi_disk_pr_state_needed,
> + .fields = (const VMStateField[]) {
> + VMSTATE_UINT64(qdev.pr_state.key, SCSIDiskState),
> + VMSTATE_UINT8(qdev.pr_state.resv_type, SCSIDiskState),
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +#endif /* __linux__ */
> +
> static const VMStateDescription vmstate_scsi_disk_state = {
> .name = "scsi-disk",
> .version_id = 1,
> @@ -3221,7 +3292,13 @@ static const VMStateDescription vmstate_scsi_disk_state = {
> VMSTATE_BOOL(tray_open, SCSIDiskState),
> VMSTATE_BOOL(tray_locked, SCSIDiskState),
> VMSTATE_END_OF_LIST()
> - }
> + },
> + .subsections = (const VMStateDescription * const []) {
> +#ifdef __linux__
> + &vmstate_scsi_disk_pr_state,
> +#endif
> + NULL
> + },
> };
>
> static void scsi_hd_class_initfn(ObjectClass *klass, const void *data)
> @@ -3301,6 +3378,7 @@ static const Property scsi_block_properties[] = {
> -1),
> DEFINE_PROP_UINT32("io_timeout", SCSIDiskState, qdev.io_timeout,
> DEFAULT_IO_TIMEOUT),
> + DEFINE_PROP_BOOL("migrate-pr", SCSIDiskState, qdev.migrate_pr, true),
> };
>
> static void scsi_block_class_initfn(ObjectClass *klass, const void *data)
> @@ -3310,6 +3388,7 @@ static void scsi_block_class_initfn(ObjectClass *klass, const void *data)
> SCSIDiskClass *sdc = SCSI_DISK_BASE_CLASS(klass);
>
> sc->realize = scsi_block_realize;
> + sc->unrealize = scsi_block_unrealize;
> sc->alloc_req = scsi_block_new_request;
> sc->parse_cdb = scsi_block_parse_cdb;
> sdc->dma_readv = scsi_block_dma_readv;
> diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
> index 392647e2b2..e44979faeb 100644
> --- a/hw/scsi/scsi-generic.c
> +++ b/hw/scsi/scsi-generic.c
> @@ -419,6 +419,88 @@ static void scsi_handle_persistent_reserve_out_reply(
> }
> }
>
> +static bool scsi_generic_pr_register(SCSIDevice *s, uint64_t key, Error **errp)
> +{
> + uint8_t cmd[10] = {};
> + uint8_t buf[24] = {};
> + uint64_t key_be = cpu_to_be64(key);
> + int ret;
> +
> + cmd[0] = PERSISTENT_RESERVE_OUT;
> + cmd[1] = PRO_REGISTER;
> + cmd[8] = sizeof(buf);
> + memcpy(&buf[8], &key_be, sizeof(key_be));
> +
> + ret = scsi_SG_IO(s->conf.blk, SG_DXFER_TO_DEV, cmd, sizeof(cmd),
> + buf, sizeof(buf), s->io_timeout, errp);
> + if (ret < 0) {
> + error_prepend(errp, "PERSISTENT RESERVE OUT with REGISTER");
> + return false;
> + }
> + return true;
> +}
> +
> +static bool scsi_generic_pr_preempt(SCSIDevice *s, uint64_t key, uint8_t resv_type, Error **errp)
> +{
> + uint8_t cmd[10] = {};
> + uint8_t buf[24] = {};
> + uint64_t key_be = cpu_to_be64(key);
> + int ret;
> +
> + cmd[0] = PERSISTENT_RESERVE_OUT;
> + cmd[1] = PRO_PREEMPT;
> + cmd[2] = resv_type & 0xf;
> + cmd[8] = sizeof(buf);
> + memcpy(&buf[0], &key_be, sizeof(key_be));
> + memcpy(&buf[8], &key_be, sizeof(key_be));
> +
> + ret = scsi_SG_IO(s->conf.blk, SG_DXFER_TO_DEV, cmd, sizeof(cmd),
> + buf, sizeof(buf), s->io_timeout, errp);
> + if (ret < 0) {
> + error_prepend(errp, "PERSISTENT RESERVE OUT with PREEMPT");
> + return false;
> + }
> + return true;
> +}
> +
> +/* Register keys and preempt reservations after live migration */
> +bool scsi_generic_pr_state_preempt(SCSIDevice *s, Error **errp)
> +{
> + SCSIPRState *pr_state = &s->pr_state;
> + uint64_t key;
> + uint8_t resv_type;
> +
> + WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
> + key = pr_state->key;
> + resv_type = pr_state->resv_type;
> + }
> +
> + trace_scsi_generic_pr_state_preempt(key, resv_type);
> +
> + if (key) {
> + if (!scsi_generic_pr_register(s, key, errp)) {
> + return false;
> + }
> +
> + /*
> + * Two cases:
> + *
> + * 1. There is no reservation (resv_type is 0) and the other I_T nexus
> + * will be unregistered. This is important so the source host does
> + * not leak registered keys across live migration.
> + *
> + * 2. There is a reservation (resv_type is not 0) and the other I_T
> + * nexus will be unregistered and its reservation is atomically
> + * taken over by us. This is the scenario where a reservation is
> + * migrated along with the guest.
> + */
> + if (!scsi_generic_pr_preempt(s, key, resv_type, errp)) {
> + return false;
> + }
> + }
> + return true;
> +}
> +
> static void scsi_read_complete(void * opaque, int ret)
> {
> SCSIGenericReq *r = (SCSIGenericReq *)opaque;
> diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
> index ff92fff7c5..a8ac1e7f1d 100644
> --- a/hw/scsi/trace-events
> +++ b/hw/scsi/trace-events
> @@ -391,3 +391,4 @@ scsi_generic_aio_sgio_command(uint32_t tag, uint8_t cmd, uint32_t timeout) "gene
> scsi_generic_ioctl_sgio_command(uint8_t cmd, uint32_t timeout) "generic ioctl sgio: cmd=0x%x timeout=%u"
> scsi_generic_ioctl_sgio_done(uint8_t cmd, int ret, uint8_t status, uint8_t host_status) "generic ioctl sgio: cmd=0x%x ret=%d status=0x%x host_status=0x%x"
> scsi_generic_persistent_reserve_out_reply(uint8_t service_action, uint8_t resv_type, uint64_t old_key, uint64_t new_key) "persistent reserve out reply service_action=%u resv_type=%u old_key=0x%" PRIx64 " new_key=0x%" PRIx64
> +scsi_generic_pr_state_preempt(uint64_t key, uint8_t resv_type) "key=0x%" PRIx64 " resv_type=%u"
> --
> 2.52.0
>
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v2 4/5] scsi: save/load SCSI reservation state
2026-01-26 19:59 ` Daniel P. Berrangé
@ 2026-01-26 22:13 ` Stefan Hajnoczi
2026-01-27 8:54 ` Daniel P. Berrangé
0 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2026-01-26 22:13 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: qemu-devel, pkrempa, Paolo Bonzini, Alberto Faria,
Hannes Reinecke, Zhao Liu, Kevin Wolf, qemu-block, Fam Zheng,
Philippe Mathieu-Daudé, Eduardo Habkost, Qing Wang,
Yanan Wang, Marcel Apfelbaum
[-- Attachment #1: Type: text/plain, Size: 13386 bytes --]
On Mon, Jan 26, 2026 at 07:59:55PM +0000, Daniel P. Berrangé wrote:
> On Mon, Jan 26, 2026 at 02:47:34PM -0500, Stefan Hajnoczi wrote:
> > Add a vmstate subsection to SCSIDiskState so that scsi-block devices can
> > transfer their reservation state during live migration. Upon loading the
> > subsection, the destination QEMU invokes the PERSISTENT RESERVE OUT
> > command's PREEMPT service action to atomically move the reservation from
> > the source I_T nexus to the destination I_T nexus. This results in
> > transparent live migration of SCSI reservations.
> >
> > This approach is incomplete since SCSI reservations are cooperative and
> > other hosts could interfere. Neither the source QEMU nor the destination
> > QEMU are aware of changes made by other hosts. The assumption is that
> > reservation is not taken over by a third host without cooperation from
> > the source host.
> >
> > I considered adding the vmstate subsection to SCSIDevice instead of
> > SCSIDiskState, since reservations are part of the SCSI Primary Commands
> > that other devices apart from disks could support. However, due to
> > fragility of migrating reservations, we will probably limit support to
> > scsi-block and maybe scsi-disk in the future. In the end, I think it
> > makes sense to place this within scsi-disk.c.
> >
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> > include/hw/scsi/scsi.h | 1 +
> > hw/core/machine.c | 4 ++-
> > hw/scsi/scsi-disk.c | 81 ++++++++++++++++++++++++++++++++++++++++-
> > hw/scsi/scsi-generic.c | 82 ++++++++++++++++++++++++++++++++++++++++++
> > hw/scsi/trace-events | 1 +
> > 5 files changed, 167 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
> > index c5ec58089b..a3e246dbd9 100644
> > --- a/include/hw/scsi/scsi.h
> > +++ b/include/hw/scsi/scsi.h
> > @@ -253,6 +253,7 @@ SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int target, int lun);
> >
> > /* scsi-generic.c. */
> > extern const SCSIReqOps scsi_generic_req_ops;
> > +bool scsi_generic_pr_state_preempt(SCSIDevice *s, Error **errp);
> >
> > /* scsi-disk.c */
> > #define SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR 0
> > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > index 6411e68856..16134f8ce5 100644
> > --- a/hw/core/machine.c
> > +++ b/hw/core/machine.c
> > @@ -38,7 +38,9 @@
> > #include "hw/acpi/generic_event_device.h"
> > #include "qemu/audio.h"
> >
> > -GlobalProperty hw_compat_10_2[] = {};
> > +GlobalProperty hw_compat_10_2[] = {
> > + { "scsi-block", "migrate-pr", "off" },
> > +};
> > const size_t hw_compat_10_2_len = G_N_ELEMENTS(hw_compat_10_2);
> >
> > GlobalProperty hw_compat_10_1[] = {
> > diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> > index 76fe5f085b..8845ab1192 100644
> > --- a/hw/scsi/scsi-disk.c
> > +++ b/hw/scsi/scsi-disk.c
> > @@ -28,6 +28,7 @@
> > #include "qemu/hw-version.h"
> > #include "qemu/memalign.h"
> > #include "hw/scsi/scsi.h"
> > +#include "migration/misc.h"
> > #include "migration/qemu-file-types.h"
> > #include "migration/vmstate.h"
> > #include "hw/scsi/emulation.h"
> > @@ -122,6 +123,7 @@ struct SCSIDiskState {
> > */
> > uint16_t rotation_rate;
> > bool migrate_emulated_scsi_request;
> > + NotifierWithReturn migration_notifier;
> > };
> >
> > static void scsi_free_request(SCSIRequest *req)
> > @@ -2737,6 +2739,25 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun,
> > }
> >
> > #ifdef __linux__
> > +/*
> > + * Preempt on the SCSI Persistent Reservation on the source when migration
> > + * fails because the destination may have already preempted and we need to get
> > + * the reservation back.
> > + */
> > +static int scsi_block_migration_notifier(NotifierWithReturn *notifier,
> > + MigrationEvent *e, Error **errp)
> > +{
> > + if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> > + SCSIDiskState *s =
> > + container_of(notifier, SCSIDiskState, migration_notifier);
> > + SCSIDevice *d = &s->qdev;
> > +
> > + /* MIG_EVENT_PRECOPY_FAILED cannot fail, so ignore errors */
> > + scsi_generic_pr_state_preempt(d, NULL);
>
> I feel like we ought to 'warn_report' any errors related to failing
> to acquire persistent reservations.
>
> In the unlikely event an error occurs, whomever has to deal with
> the resulting support ticket will want to know something went wrong
> from the QEMU logs.
Good idea.
I'm also not sure how to best approach logging in general. Usually QEMU
does little logging when the VM is running, but it has become
increasingly difficult to get information out of QEMU via tracing or
monitor commands since nowadays QEMU might be running in a locked down
container. Debugging PR migration issues would be easiest if the trace
events introduced in this series were actually printfs to stderr, but
that's not traditionally how QEMU did things.
>
>
> > @@ -3209,6 +3240,46 @@ static const Property scsi_hd_properties[] = {
> > DEFINE_BLOCK_CHS_PROPERTIES(SCSIDiskState, qdev.conf),
> > };
> >
> > +#ifdef __linux__
> > +static bool scsi_disk_pr_state_post_load_errp(void *opaque, int version_id, Error **errp)
> > +{
> > + SCSIDiskState *s = opaque;
> > + SCSIDevice *dev = &s->qdev;
> > +
> > + return scsi_generic_pr_state_preempt(dev, errp);
> > +}
>
> What if there are multiple SCSI disks, and on the target host
> we successful acquire the reservation for the 1st, but fail
> the second ? Are we ignoring the failure of the second, or
> are we discarding the reservation of the 1st and failing
> migration ? Should we warn_report here too ?
Each disk is its own scsi-block device.
scsi_disk_pr_state_post_load_errp() will be invoked for each such
device. If one of these calls fails, the migration will abort and the
errp here will reported.
The rollback on the source host doesn't care which devices
succeeded/failed, it performs idempotent PREEMPT operations for all
devices so we're sure all reservations are back on the source host after
a failed migration.
>
> > +
> > +static bool scsi_disk_pr_state_needed(void *opaque)
> > +{
> > + SCSIDiskState *s = opaque;
> > + SCSIPRState *pr_state = &s->qdev.pr_state;
> > + bool ret;
> > +
> > + if (!s->qdev.migrate_pr) {
> > + return false;
> > + }
> > +
> > + /* A reservation requires a key, so checking this field is enough */
> > + WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
> > + ret = pr_state->key;
> > + }
> > + return ret;
> > +}
> > +
> > +static const VMStateDescription vmstate_scsi_disk_pr_state = {
> > + .name = "scsi-disk/pr",
> > + .version_id = 1,
> > + .minimum_version_id = 1,
> > + .post_load_errp = scsi_disk_pr_state_post_load_errp,
> > + .needed = scsi_disk_pr_state_needed,
> > + .fields = (const VMStateField[]) {
> > + VMSTATE_UINT64(qdev.pr_state.key, SCSIDiskState),
> > + VMSTATE_UINT8(qdev.pr_state.resv_type, SCSIDiskState),
> > + VMSTATE_END_OF_LIST()
> > + }
> > +};
> > +#endif /* __linux__ */
> > +
> > static const VMStateDescription vmstate_scsi_disk_state = {
> > .name = "scsi-disk",
> > .version_id = 1,
> > @@ -3221,7 +3292,13 @@ static const VMStateDescription vmstate_scsi_disk_state = {
> > VMSTATE_BOOL(tray_open, SCSIDiskState),
> > VMSTATE_BOOL(tray_locked, SCSIDiskState),
> > VMSTATE_END_OF_LIST()
> > - }
> > + },
> > + .subsections = (const VMStateDescription * const []) {
> > +#ifdef __linux__
> > + &vmstate_scsi_disk_pr_state,
> > +#endif
> > + NULL
> > + },
> > };
> >
> > static void scsi_hd_class_initfn(ObjectClass *klass, const void *data)
> > @@ -3301,6 +3378,7 @@ static const Property scsi_block_properties[] = {
> > -1),
> > DEFINE_PROP_UINT32("io_timeout", SCSIDiskState, qdev.io_timeout,
> > DEFAULT_IO_TIMEOUT),
> > + DEFINE_PROP_BOOL("migrate-pr", SCSIDiskState, qdev.migrate_pr, true),
> > };
> >
> > static void scsi_block_class_initfn(ObjectClass *klass, const void *data)
> > @@ -3310,6 +3388,7 @@ static void scsi_block_class_initfn(ObjectClass *klass, const void *data)
> > SCSIDiskClass *sdc = SCSI_DISK_BASE_CLASS(klass);
> >
> > sc->realize = scsi_block_realize;
> > + sc->unrealize = scsi_block_unrealize;
> > sc->alloc_req = scsi_block_new_request;
> > sc->parse_cdb = scsi_block_parse_cdb;
> > sdc->dma_readv = scsi_block_dma_readv;
> > diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
> > index 392647e2b2..e44979faeb 100644
> > --- a/hw/scsi/scsi-generic.c
> > +++ b/hw/scsi/scsi-generic.c
> > @@ -419,6 +419,88 @@ static void scsi_handle_persistent_reserve_out_reply(
> > }
> > }
> >
> > +static bool scsi_generic_pr_register(SCSIDevice *s, uint64_t key, Error **errp)
> > +{
> > + uint8_t cmd[10] = {};
> > + uint8_t buf[24] = {};
> > + uint64_t key_be = cpu_to_be64(key);
> > + int ret;
> > +
> > + cmd[0] = PERSISTENT_RESERVE_OUT;
> > + cmd[1] = PRO_REGISTER;
> > + cmd[8] = sizeof(buf);
> > + memcpy(&buf[8], &key_be, sizeof(key_be));
> > +
> > + ret = scsi_SG_IO(s->conf.blk, SG_DXFER_TO_DEV, cmd, sizeof(cmd),
> > + buf, sizeof(buf), s->io_timeout, errp);
> > + if (ret < 0) {
> > + error_prepend(errp, "PERSISTENT RESERVE OUT with REGISTER");
> > + return false;
> > + }
> > + return true;
> > +}
> > +
> > +static bool scsi_generic_pr_preempt(SCSIDevice *s, uint64_t key, uint8_t resv_type, Error **errp)
> > +{
> > + uint8_t cmd[10] = {};
> > + uint8_t buf[24] = {};
> > + uint64_t key_be = cpu_to_be64(key);
> > + int ret;
> > +
> > + cmd[0] = PERSISTENT_RESERVE_OUT;
> > + cmd[1] = PRO_PREEMPT;
> > + cmd[2] = resv_type & 0xf;
> > + cmd[8] = sizeof(buf);
> > + memcpy(&buf[0], &key_be, sizeof(key_be));
> > + memcpy(&buf[8], &key_be, sizeof(key_be));
> > +
> > + ret = scsi_SG_IO(s->conf.blk, SG_DXFER_TO_DEV, cmd, sizeof(cmd),
> > + buf, sizeof(buf), s->io_timeout, errp);
> > + if (ret < 0) {
> > + error_prepend(errp, "PERSISTENT RESERVE OUT with PREEMPT");
> > + return false;
> > + }
> > + return true;
> > +}
> > +
> > +/* Register keys and preempt reservations after live migration */
> > +bool scsi_generic_pr_state_preempt(SCSIDevice *s, Error **errp)
> > +{
> > + SCSIPRState *pr_state = &s->pr_state;
> > + uint64_t key;
> > + uint8_t resv_type;
> > +
> > + WITH_QEMU_LOCK_GUARD(&pr_state->mutex) {
> > + key = pr_state->key;
> > + resv_type = pr_state->resv_type;
> > + }
> > +
> > + trace_scsi_generic_pr_state_preempt(key, resv_type);
> > +
> > + if (key) {
> > + if (!scsi_generic_pr_register(s, key, errp)) {
> > + return false;
> > + }
> > +
> > + /*
> > + * Two cases:
> > + *
> > + * 1. There is no reservation (resv_type is 0) and the other I_T nexus
> > + * will be unregistered. This is important so the source host does
> > + * not leak registered keys across live migration.
> > + *
> > + * 2. There is a reservation (resv_type is not 0) and the other I_T
> > + * nexus will be unregistered and its reservation is atomically
> > + * taken over by us. This is the scenario where a reservation is
> > + * migrated along with the guest.
> > + */
> > + if (!scsi_generic_pr_preempt(s, key, resv_type, errp)) {
> > + return false;
> > + }
> > + }
> > + return true;
> > +}
> > +
> > static void scsi_read_complete(void * opaque, int ret)
> > {
> > SCSIGenericReq *r = (SCSIGenericReq *)opaque;
> > diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
> > index ff92fff7c5..a8ac1e7f1d 100644
> > --- a/hw/scsi/trace-events
> > +++ b/hw/scsi/trace-events
> > @@ -391,3 +391,4 @@ scsi_generic_aio_sgio_command(uint32_t tag, uint8_t cmd, uint32_t timeout) "gene
> > scsi_generic_ioctl_sgio_command(uint8_t cmd, uint32_t timeout) "generic ioctl sgio: cmd=0x%x timeout=%u"
> > scsi_generic_ioctl_sgio_done(uint8_t cmd, int ret, uint8_t status, uint8_t host_status) "generic ioctl sgio: cmd=0x%x ret=%d status=0x%x host_status=0x%x"
> > scsi_generic_persistent_reserve_out_reply(uint8_t service_action, uint8_t resv_type, uint64_t old_key, uint64_t new_key) "persistent reserve out reply service_action=%u resv_type=%u old_key=0x%" PRIx64 " new_key=0x%" PRIx64
> > +scsi_generic_pr_state_preempt(uint64_t key, uint8_t resv_type) "key=0x%" PRIx64 " resv_type=%u"
> > --
> > 2.52.0
> >
> >
>
> With regards,
> Daniel
> --
> |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o- https://fstop138.berrange.com :|
> |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v2 4/5] scsi: save/load SCSI reservation state
2026-01-26 22:13 ` Stefan Hajnoczi
@ 2026-01-27 8:54 ` Daniel P. Berrangé
2026-01-27 14:41 ` Stefan Hajnoczi
0 siblings, 1 reply; 12+ messages in thread
From: Daniel P. Berrangé @ 2026-01-27 8:54 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, pkrempa, Paolo Bonzini, Alberto Faria,
Hannes Reinecke, Zhao Liu, Kevin Wolf, qemu-block, Fam Zheng,
Philippe Mathieu-Daudé, Eduardo Habkost, Qing Wang,
Yanan Wang, Marcel Apfelbaum
On Mon, Jan 26, 2026 at 05:13:43PM -0500, Stefan Hajnoczi wrote:
> On Mon, Jan 26, 2026 at 07:59:55PM +0000, Daniel P. Berrangé wrote:
> > On Mon, Jan 26, 2026 at 02:47:34PM -0500, Stefan Hajnoczi wrote:
> > > Add a vmstate subsection to SCSIDiskState so that scsi-block devices can
> > > transfer their reservation state during live migration. Upon loading the
> > > subsection, the destination QEMU invokes the PERSISTENT RESERVE OUT
> > > command's PREEMPT service action to atomically move the reservation from
> > > the source I_T nexus to the destination I_T nexus. This results in
> > > transparent live migration of SCSI reservations.
> > >
> > > This approach is incomplete since SCSI reservations are cooperative and
> > > other hosts could interfere. Neither the source QEMU nor the destination
> > > QEMU are aware of changes made by other hosts. The assumption is that
> > > reservation is not taken over by a third host without cooperation from
> > > the source host.
> > >
> > > I considered adding the vmstate subsection to SCSIDevice instead of
> > > SCSIDiskState, since reservations are part of the SCSI Primary Commands
> > > that other devices apart from disks could support. However, due to
> > > fragility of migrating reservations, we will probably limit support to
> > > scsi-block and maybe scsi-disk in the future. In the end, I think it
> > > makes sense to place this within scsi-disk.c.
> > >
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > ---
> > > include/hw/scsi/scsi.h | 1 +
> > > hw/core/machine.c | 4 ++-
> > > hw/scsi/scsi-disk.c | 81 ++++++++++++++++++++++++++++++++++++++++-
> > > hw/scsi/scsi-generic.c | 82 ++++++++++++++++++++++++++++++++++++++++++
> > > hw/scsi/trace-events | 1 +
> > > 5 files changed, 167 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
> > > index c5ec58089b..a3e246dbd9 100644
> > > --- a/include/hw/scsi/scsi.h
> > > +++ b/include/hw/scsi/scsi.h
> > > @@ -253,6 +253,7 @@ SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int target, int lun);
> > >
> > > /* scsi-generic.c. */
> > > extern const SCSIReqOps scsi_generic_req_ops;
> > > +bool scsi_generic_pr_state_preempt(SCSIDevice *s, Error **errp);
> > >
> > > /* scsi-disk.c */
> > > #define SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR 0
> > > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > > index 6411e68856..16134f8ce5 100644
> > > --- a/hw/core/machine.c
> > > +++ b/hw/core/machine.c
> > > @@ -38,7 +38,9 @@
> > > #include "hw/acpi/generic_event_device.h"
> > > #include "qemu/audio.h"
> > >
> > > -GlobalProperty hw_compat_10_2[] = {};
> > > +GlobalProperty hw_compat_10_2[] = {
> > > + { "scsi-block", "migrate-pr", "off" },
> > > +};
> > > const size_t hw_compat_10_2_len = G_N_ELEMENTS(hw_compat_10_2);
> > >
> > > GlobalProperty hw_compat_10_1[] = {
> > > diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> > > index 76fe5f085b..8845ab1192 100644
> > > --- a/hw/scsi/scsi-disk.c
> > > +++ b/hw/scsi/scsi-disk.c
> > > @@ -28,6 +28,7 @@
> > > #include "qemu/hw-version.h"
> > > #include "qemu/memalign.h"
> > > #include "hw/scsi/scsi.h"
> > > +#include "migration/misc.h"
> > > #include "migration/qemu-file-types.h"
> > > #include "migration/vmstate.h"
> > > #include "hw/scsi/emulation.h"
> > > @@ -122,6 +123,7 @@ struct SCSIDiskState {
> > > */
> > > uint16_t rotation_rate;
> > > bool migrate_emulated_scsi_request;
> > > + NotifierWithReturn migration_notifier;
> > > };
> > >
> > > static void scsi_free_request(SCSIRequest *req)
> > > @@ -2737,6 +2739,25 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun,
> > > }
> > >
> > > #ifdef __linux__
> > > +/*
> > > + * Preempt on the SCSI Persistent Reservation on the source when migration
> > > + * fails because the destination may have already preempted and we need to get
> > > + * the reservation back.
> > > + */
> > > +static int scsi_block_migration_notifier(NotifierWithReturn *notifier,
> > > + MigrationEvent *e, Error **errp)
> > > +{
> > > + if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> > > + SCSIDiskState *s =
> > > + container_of(notifier, SCSIDiskState, migration_notifier);
> > > + SCSIDevice *d = &s->qdev;
> > > +
> > > + /* MIG_EVENT_PRECOPY_FAILED cannot fail, so ignore errors */
> > > + scsi_generic_pr_state_preempt(d, NULL);
> >
> > I feel like we ought to 'warn_report' any errors related to failing
> > to acquire persistent reservations.
> >
> > In the unlikely event an error occurs, whomever has to deal with
> > the resulting support ticket will want to know something went wrong
> > from the QEMU logs.
>
> Good idea.
>
> I'm also not sure how to best approach logging in general. Usually QEMU
> does little logging when the VM is running, but it has become
> increasingly difficult to get information out of QEMU via tracing or
> monitor commands since nowadays QEMU might be running in a locked down
> container. Debugging PR migration issues would be easiest if the trace
> events introduced in this series were actually printfs to stderr, but
> that's not traditionally how QEMU did things.
I wouldn't overthink this - just pass a locall "Error *err" object
instead of NULL, and then warn_report_err on the result. Just needs
to be a marker in the that something has gone wrong, which we could
not propagate to the mgmt app in the normal manner since we have no
error path here.
Debugging /anything/ in QEMU is easiest if the trace events were
actually prints, but that thinking leads to us enabling tracing
all the time, everywhere which is impractical
At the same time it is common for apps to have some level of
"always on" debugging collected and so we perhaps do have a
general conceptual gap in QEMU.
If we thinking of trace events as being equiv to "DEBUG" level
log events, would it help if we could annotate a subset of
trace events as "INFO" level and do something with them by
default. eg perhaps we have an ring buffer tracing target that
collects "INFO" events continuously and can be asked to dump
out its state in error scenarios ?
> > > @@ -3209,6 +3240,46 @@ static const Property scsi_hd_properties[] = {
> > > DEFINE_BLOCK_CHS_PROPERTIES(SCSIDiskState, qdev.conf),
> > > };
> > >
> > > +#ifdef __linux__
> > > +static bool scsi_disk_pr_state_post_load_errp(void *opaque, int version_id, Error **errp)
> > > +{
> > > + SCSIDiskState *s = opaque;
> > > + SCSIDevice *dev = &s->qdev;
> > > +
> > > + return scsi_generic_pr_state_preempt(dev, errp);
> > > +}
> >
> > What if there are multiple SCSI disks, and on the target host
> > we successful acquire the reservation for the 1st, but fail
> > the second ? Are we ignoring the failure of the second, or
> > are we discarding the reservation of the 1st and failing
> > migration ? Should we warn_report here too ?
>
> Each disk is its own scsi-block device.
> scsi_disk_pr_state_post_load_errp() will be invoked for each such
> device. If one of these calls fails, the migration will abort and the
> errp here will reported.
Oh right, the errp is fine as that will go up to libvirt and thus
be recorded.
> The rollback on the source host doesn't care which devices
> succeeded/failed, it performs idempotent PREEMPT operations for all
> devices so we're sure all reservations are back on the source host after
> a failed migration.
Ok
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v2 4/5] scsi: save/load SCSI reservation state
2026-01-27 8:54 ` Daniel P. Berrangé
@ 2026-01-27 14:41 ` Stefan Hajnoczi
2026-01-27 14:47 ` Daniel P. Berrangé
0 siblings, 1 reply; 12+ messages in thread
From: Stefan Hajnoczi @ 2026-01-27 14:41 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: qemu-devel, pkrempa, Paolo Bonzini, Alberto Faria,
Hannes Reinecke, Zhao Liu, Kevin Wolf, qemu-block, Fam Zheng,
Philippe Mathieu-Daudé, Eduardo Habkost, Qing Wang,
Yanan Wang, Marcel Apfelbaum
[-- Attachment #1: Type: text/plain, Size: 7132 bytes --]
On Tue, Jan 27, 2026 at 08:54:24AM +0000, Daniel P. Berrangé wrote:
> On Mon, Jan 26, 2026 at 05:13:43PM -0500, Stefan Hajnoczi wrote:
> > On Mon, Jan 26, 2026 at 07:59:55PM +0000, Daniel P. Berrangé wrote:
> > > On Mon, Jan 26, 2026 at 02:47:34PM -0500, Stefan Hajnoczi wrote:
> > > > Add a vmstate subsection to SCSIDiskState so that scsi-block devices can
> > > > transfer their reservation state during live migration. Upon loading the
> > > > subsection, the destination QEMU invokes the PERSISTENT RESERVE OUT
> > > > command's PREEMPT service action to atomically move the reservation from
> > > > the source I_T nexus to the destination I_T nexus. This results in
> > > > transparent live migration of SCSI reservations.
> > > >
> > > > This approach is incomplete since SCSI reservations are cooperative and
> > > > other hosts could interfere. Neither the source QEMU nor the destination
> > > > QEMU are aware of changes made by other hosts. The assumption is that
> > > > reservation is not taken over by a third host without cooperation from
> > > > the source host.
> > > >
> > > > I considered adding the vmstate subsection to SCSIDevice instead of
> > > > SCSIDiskState, since reservations are part of the SCSI Primary Commands
> > > > that other devices apart from disks could support. However, due to
> > > > fragility of migrating reservations, we will probably limit support to
> > > > scsi-block and maybe scsi-disk in the future. In the end, I think it
> > > > makes sense to place this within scsi-disk.c.
> > > >
> > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > ---
> > > > include/hw/scsi/scsi.h | 1 +
> > > > hw/core/machine.c | 4 ++-
> > > > hw/scsi/scsi-disk.c | 81 ++++++++++++++++++++++++++++++++++++++++-
> > > > hw/scsi/scsi-generic.c | 82 ++++++++++++++++++++++++++++++++++++++++++
> > > > hw/scsi/trace-events | 1 +
> > > > 5 files changed, 167 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
> > > > index c5ec58089b..a3e246dbd9 100644
> > > > --- a/include/hw/scsi/scsi.h
> > > > +++ b/include/hw/scsi/scsi.h
> > > > @@ -253,6 +253,7 @@ SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int target, int lun);
> > > >
> > > > /* scsi-generic.c. */
> > > > extern const SCSIReqOps scsi_generic_req_ops;
> > > > +bool scsi_generic_pr_state_preempt(SCSIDevice *s, Error **errp);
> > > >
> > > > /* scsi-disk.c */
> > > > #define SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR 0
> > > > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > > > index 6411e68856..16134f8ce5 100644
> > > > --- a/hw/core/machine.c
> > > > +++ b/hw/core/machine.c
> > > > @@ -38,7 +38,9 @@
> > > > #include "hw/acpi/generic_event_device.h"
> > > > #include "qemu/audio.h"
> > > >
> > > > -GlobalProperty hw_compat_10_2[] = {};
> > > > +GlobalProperty hw_compat_10_2[] = {
> > > > + { "scsi-block", "migrate-pr", "off" },
> > > > +};
> > > > const size_t hw_compat_10_2_len = G_N_ELEMENTS(hw_compat_10_2);
> > > >
> > > > GlobalProperty hw_compat_10_1[] = {
> > > > diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> > > > index 76fe5f085b..8845ab1192 100644
> > > > --- a/hw/scsi/scsi-disk.c
> > > > +++ b/hw/scsi/scsi-disk.c
> > > > @@ -28,6 +28,7 @@
> > > > #include "qemu/hw-version.h"
> > > > #include "qemu/memalign.h"
> > > > #include "hw/scsi/scsi.h"
> > > > +#include "migration/misc.h"
> > > > #include "migration/qemu-file-types.h"
> > > > #include "migration/vmstate.h"
> > > > #include "hw/scsi/emulation.h"
> > > > @@ -122,6 +123,7 @@ struct SCSIDiskState {
> > > > */
> > > > uint16_t rotation_rate;
> > > > bool migrate_emulated_scsi_request;
> > > > + NotifierWithReturn migration_notifier;
> > > > };
> > > >
> > > > static void scsi_free_request(SCSIRequest *req)
> > > > @@ -2737,6 +2739,25 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun,
> > > > }
> > > >
> > > > #ifdef __linux__
> > > > +/*
> > > > + * Preempt on the SCSI Persistent Reservation on the source when migration
> > > > + * fails because the destination may have already preempted and we need to get
> > > > + * the reservation back.
> > > > + */
> > > > +static int scsi_block_migration_notifier(NotifierWithReturn *notifier,
> > > > + MigrationEvent *e, Error **errp)
> > > > +{
> > > > + if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> > > > + SCSIDiskState *s =
> > > > + container_of(notifier, SCSIDiskState, migration_notifier);
> > > > + SCSIDevice *d = &s->qdev;
> > > > +
> > > > + /* MIG_EVENT_PRECOPY_FAILED cannot fail, so ignore errors */
> > > > + scsi_generic_pr_state_preempt(d, NULL);
> > >
> > > I feel like we ought to 'warn_report' any errors related to failing
> > > to acquire persistent reservations.
> > >
> > > In the unlikely event an error occurs, whomever has to deal with
> > > the resulting support ticket will want to know something went wrong
> > > from the QEMU logs.
> >
> > Good idea.
> >
> > I'm also not sure how to best approach logging in general. Usually QEMU
> > does little logging when the VM is running, but it has become
> > increasingly difficult to get information out of QEMU via tracing or
> > monitor commands since nowadays QEMU might be running in a locked down
> > container. Debugging PR migration issues would be easiest if the trace
> > events introduced in this series were actually printfs to stderr, but
> > that's not traditionally how QEMU did things.
>
> I wouldn't overthink this - just pass a locall "Error *err" object
> instead of NULL, and then warn_report_err on the result. Just needs
> to be a marker in the that something has gone wrong, which we could
> not propagate to the mgmt app in the normal manner since we have no
> error path here.
Yes, sounds good.
>
> Debugging /anything/ in QEMU is easiest if the trace events were
> actually prints, but that thinking leads to us enabling tracing
> all the time, everywhere which is impractical
>
> At the same time it is common for apps to have some level of
> "always on" debugging collected and so we perhaps do have a
> general conceptual gap in QEMU.
>
> If we thinking of trace events as being equiv to "DEBUG" level
> log events, would it help if we could annotate a subset of
> trace events as "INFO" level and do something with them by
> default. eg perhaps we have an ring buffer tracing target that
> collects "INFO" events continuously and can be asked to dump
> out its state in error scenarios ?
I need to research SystemTap inside container scenarios more first.
Ideally we could rely on the DTrace probes, but it needs to be easy for
users to collect information from their containers. I suspect there will
be permission problems as well as multiple steps required to just find
the container where QEMU is running, which makes this hard for users.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v2 4/5] scsi: save/load SCSI reservation state
2026-01-27 14:41 ` Stefan Hajnoczi
@ 2026-01-27 14:47 ` Daniel P. Berrangé
0 siblings, 0 replies; 12+ messages in thread
From: Daniel P. Berrangé @ 2026-01-27 14:47 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, pkrempa, Paolo Bonzini, Alberto Faria,
Hannes Reinecke, Zhao Liu, Kevin Wolf, qemu-block, Fam Zheng,
Philippe Mathieu-Daudé, Eduardo Habkost, Qing Wang,
Yanan Wang, Marcel Apfelbaum
On Tue, Jan 27, 2026 at 09:41:59AM -0500, Stefan Hajnoczi wrote:
> On Tue, Jan 27, 2026 at 08:54:24AM +0000, Daniel P. Berrangé wrote:
> > On Mon, Jan 26, 2026 at 05:13:43PM -0500, Stefan Hajnoczi wrote:
> > > On Mon, Jan 26, 2026 at 07:59:55PM +0000, Daniel P. Berrangé wrote:
> > > > On Mon, Jan 26, 2026 at 02:47:34PM -0500, Stefan Hajnoczi wrote:
> > > > > Add a vmstate subsection to SCSIDiskState so that scsi-block devices can
> > > > > transfer their reservation state during live migration. Upon loading the
> > > > > subsection, the destination QEMU invokes the PERSISTENT RESERVE OUT
> > > > > command's PREEMPT service action to atomically move the reservation from
> > > > > the source I_T nexus to the destination I_T nexus. This results in
> > > > > transparent live migration of SCSI reservations.
> > > > >
> > > > > This approach is incomplete since SCSI reservations are cooperative and
> > > > > other hosts could interfere. Neither the source QEMU nor the destination
> > > > > QEMU are aware of changes made by other hosts. The assumption is that
> > > > > reservation is not taken over by a third host without cooperation from
> > > > > the source host.
> > > > >
> > > > > I considered adding the vmstate subsection to SCSIDevice instead of
> > > > > SCSIDiskState, since reservations are part of the SCSI Primary Commands
> > > > > that other devices apart from disks could support. However, due to
> > > > > fragility of migrating reservations, we will probably limit support to
> > > > > scsi-block and maybe scsi-disk in the future. In the end, I think it
> > > > > makes sense to place this within scsi-disk.c.
> > > > >
> > > > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > ---
> > > > > include/hw/scsi/scsi.h | 1 +
> > > > > hw/core/machine.c | 4 ++-
> > > > > hw/scsi/scsi-disk.c | 81 ++++++++++++++++++++++++++++++++++++++++-
> > > > > hw/scsi/scsi-generic.c | 82 ++++++++++++++++++++++++++++++++++++++++++
> > > > > hw/scsi/trace-events | 1 +
> > > > > 5 files changed, 167 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
> > > > > index c5ec58089b..a3e246dbd9 100644
> > > > > --- a/include/hw/scsi/scsi.h
> > > > > +++ b/include/hw/scsi/scsi.h
> > > > > @@ -253,6 +253,7 @@ SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int target, int lun);
> > > > >
> > > > > /* scsi-generic.c. */
> > > > > extern const SCSIReqOps scsi_generic_req_ops;
> > > > > +bool scsi_generic_pr_state_preempt(SCSIDevice *s, Error **errp);
> > > > >
> > > > > /* scsi-disk.c */
> > > > > #define SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR 0
> > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > > > > index 6411e68856..16134f8ce5 100644
> > > > > --- a/hw/core/machine.c
> > > > > +++ b/hw/core/machine.c
> > > > > @@ -38,7 +38,9 @@
> > > > > #include "hw/acpi/generic_event_device.h"
> > > > > #include "qemu/audio.h"
> > > > >
> > > > > -GlobalProperty hw_compat_10_2[] = {};
> > > > > +GlobalProperty hw_compat_10_2[] = {
> > > > > + { "scsi-block", "migrate-pr", "off" },
> > > > > +};
> > > > > const size_t hw_compat_10_2_len = G_N_ELEMENTS(hw_compat_10_2);
> > > > >
> > > > > GlobalProperty hw_compat_10_1[] = {
> > > > > diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> > > > > index 76fe5f085b..8845ab1192 100644
> > > > > --- a/hw/scsi/scsi-disk.c
> > > > > +++ b/hw/scsi/scsi-disk.c
> > > > > @@ -28,6 +28,7 @@
> > > > > #include "qemu/hw-version.h"
> > > > > #include "qemu/memalign.h"
> > > > > #include "hw/scsi/scsi.h"
> > > > > +#include "migration/misc.h"
> > > > > #include "migration/qemu-file-types.h"
> > > > > #include "migration/vmstate.h"
> > > > > #include "hw/scsi/emulation.h"
> > > > > @@ -122,6 +123,7 @@ struct SCSIDiskState {
> > > > > */
> > > > > uint16_t rotation_rate;
> > > > > bool migrate_emulated_scsi_request;
> > > > > + NotifierWithReturn migration_notifier;
> > > > > };
> > > > >
> > > > > static void scsi_free_request(SCSIRequest *req)
> > > > > @@ -2737,6 +2739,25 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun,
> > > > > }
> > > > >
> > > > > #ifdef __linux__
> > > > > +/*
> > > > > + * Preempt on the SCSI Persistent Reservation on the source when migration
> > > > > + * fails because the destination may have already preempted and we need to get
> > > > > + * the reservation back.
> > > > > + */
> > > > > +static int scsi_block_migration_notifier(NotifierWithReturn *notifier,
> > > > > + MigrationEvent *e, Error **errp)
> > > > > +{
> > > > > + if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> > > > > + SCSIDiskState *s =
> > > > > + container_of(notifier, SCSIDiskState, migration_notifier);
> > > > > + SCSIDevice *d = &s->qdev;
> > > > > +
> > > > > + /* MIG_EVENT_PRECOPY_FAILED cannot fail, so ignore errors */
> > > > > + scsi_generic_pr_state_preempt(d, NULL);
> > > >
> > > > I feel like we ought to 'warn_report' any errors related to failing
> > > > to acquire persistent reservations.
> > > >
> > > > In the unlikely event an error occurs, whomever has to deal with
> > > > the resulting support ticket will want to know something went wrong
> > > > from the QEMU logs.
> > >
> > > Good idea.
> > >
> > > I'm also not sure how to best approach logging in general. Usually QEMU
> > > does little logging when the VM is running, but it has become
> > > increasingly difficult to get information out of QEMU via tracing or
> > > monitor commands since nowadays QEMU might be running in a locked down
> > > container. Debugging PR migration issues would be easiest if the trace
> > > events introduced in this series were actually printfs to stderr, but
> > > that's not traditionally how QEMU did things.
> >
> > I wouldn't overthink this - just pass a locall "Error *err" object
> > instead of NULL, and then warn_report_err on the result. Just needs
> > to be a marker in the that something has gone wrong, which we could
> > not propagate to the mgmt app in the normal manner since we have no
> > error path here.
>
> Yes, sounds good.
>
> >
> > Debugging /anything/ in QEMU is easiest if the trace events were
> > actually prints, but that thinking leads to us enabling tracing
> > all the time, everywhere which is impractical
> >
> > At the same time it is common for apps to have some level of
> > "always on" debugging collected and so we perhaps do have a
> > general conceptual gap in QEMU.
> >
> > If we thinking of trace events as being equiv to "DEBUG" level
> > log events, would it help if we could annotate a subset of
> > trace events as "INFO" level and do something with them by
> > default. eg perhaps we have an ring buffer tracing target that
> > collects "INFO" events continuously and can be asked to dump
> > out its state in error scenarios ?
>
> I need to research SystemTap inside container scenarios more first.
> Ideally we could rely on the DTrace probes, but it needs to be easy for
> users to collect information from their containers. I suspect there will
> be permission problems as well as multiple steps required to just find
> the container where QEMU is running, which makes this hard for users.
Agreed, I'd expect to use systemtap with a containerized QEMU would
generally involve an account outside the container on the host OS.
While it might be possible to open up the kernel / seccomp perms
in just the right way to get things working inside the container,
that's unlikely to be permitted in any public cloud solution, and
probably also blocked in many other places. I think this pushes
towards the need for a more traditional "logging" like facility.
We do have the "log" backend already of course but not neccessarily
enabled by distros, but that's something they need to consider.
An always-on ring buffer would be more like a "flight recorder"
scenario to reduce the overheads compared to actually logging to
disk / external service until a failure scenario arrives.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 5/5] docs: add SCSI migrate-pr documentation
2026-01-26 19:47 [PATCH v2 0/5] scsi: persistent reservation live migration Stefan Hajnoczi
` (3 preceding siblings ...)
2026-01-26 19:47 ` [PATCH v2 4/5] scsi: save/load SCSI reservation state Stefan Hajnoczi
@ 2026-01-26 19:47 ` Stefan Hajnoczi
2026-01-26 21:55 ` [PATCH v2 0/5] scsi: persistent reservation live migration Stefan Hajnoczi
5 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2026-01-26 19:47 UTC (permalink / raw)
To: qemu-devel
Cc: pkrempa, Paolo Bonzini, Alberto Faria, Hannes Reinecke, Zhao Liu,
Kevin Wolf, qemu-block, Fam Zheng, Philippe Mathieu-Daudé,
Eduardo Habkost, Qing Wang, Yanan Wang, Marcel Apfelbaum,
Stefan Hajnoczi
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
docs/system/device-emulation.rst | 1 +
docs/system/devices/scsi/index.rst | 10 +++++
docs/system/devices/scsi/migrate-pr.rst | 54 +++++++++++++++++++++++++
3 files changed, 65 insertions(+)
create mode 100644 docs/system/devices/scsi/index.rst
create mode 100644 docs/system/devices/scsi/migrate-pr.rst
diff --git a/docs/system/device-emulation.rst b/docs/system/device-emulation.rst
index 971325527a..40054bb7df 100644
--- a/docs/system/device-emulation.rst
+++ b/docs/system/device-emulation.rst
@@ -95,6 +95,7 @@ Emulated Devices
devices/keyboard.rst
devices/net.rst
devices/nvme.rst
+ devices/scsi/index.rst
devices/usb-u2f.rst
devices/usb.rst
devices/vfio-user.rst
diff --git a/docs/system/devices/scsi/index.rst b/docs/system/devices/scsi/index.rst
new file mode 100644
index 0000000000..4f0929b0ca
--- /dev/null
+++ b/docs/system/devices/scsi/index.rst
@@ -0,0 +1,10 @@
+SCSI Devices
+============
+
+Several SCSI devices are available in QEMU. They are primarily used for block
+storage.
+
+.. toctree::
+ :maxdepth: 1
+
+ migrate-pr.rst
diff --git a/docs/system/devices/scsi/migrate-pr.rst b/docs/system/devices/scsi/migrate-pr.rst
new file mode 100644
index 0000000000..a8f2790a86
--- /dev/null
+++ b/docs/system/devices/scsi/migrate-pr.rst
@@ -0,0 +1,54 @@
+..
+ SPDX-License-Identifier: GPL-2.0-or-later
+
+.. _scsi_migrate_pr:
+
+SCSI Persistent Reservation Live Migration
+==========================================
+
+This document explains how to live migrate SCSI Persistent Reservations.
+
+The ``scsi-block`` device migrates SCSI Persistent Reservations when the
+``migrate-pr=on`` parameter is given. Migration is enabled by default in
+versioned machine types since QEMU 11.0. It is disabled by default on older
+machine types and needs to be explicitly enabled with ``--device
+scsi-block,migrate-pr=on,...``.
+
+When migration is enabled, QEMU snoops PERSISTENT RESERVATION OUT commands and
+tracks the reservation key registered by the guest as well as reservations that
+the guest acquires. This information is migrated along with the guest and the
+destination QEMU submits a PERSISTENT RESERVATION OUT command with the PREEMPT
+service action to atomically transfer the reservation to the destination before
+the guest starts running on the destination.
+
+The following persistent reservation capabilities reported by the PERSISTENT
+RESERVATION IN command with the REPORT CAPABILITIES service action are masked
+from the guest by QEMU when migration is enabled:
+
+ * Specify Initiator Ports Capable (SIP_C)
+ * All Target Ports Capable (ATC_C)
+
+When migration is disabled, the ``scsi-block`` device is live migrated but
+reservations remain in place on the source. Usually this is not the intended
+behavior unless there is another mechanism to update reservations during
+migration. The PERSISTENT RESERVATION IN command also does not mask
+capabilities reported to the guest when migration is disabled.
+
+Limitations
+-----------
+
+QEMU does not remember snooped reservation details across restart, so software
+inside the guest must acquire the reservation after boot in order for live
+migration to work. Similarly, if the reservation is acquired outside the guest
+then it will not live migrate along with the guest.
+
+Snooping only considers the PERSISTENT RESERVATION OUT commands from the guest
+and does not track reservation changes made by other SCSI initiators. QEMU's
+snooped reservation details can become stale if another SCSI initiator
+makes changes to the reservation.
+
+Guests running on the same host share a single SCSI initiator identity unless
+Fibre Channel N_Port ID Virtualization is configured. As a consequence,
+multiple guests on the same hosts may observe unexpected behavior if they use
+the same physical LUN. From the LUN's perspective all guests are the same
+initiator and there is no way to distinguish between guests.
--
2.52.0
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v2 0/5] scsi: persistent reservation live migration
2026-01-26 19:47 [PATCH v2 0/5] scsi: persistent reservation live migration Stefan Hajnoczi
` (4 preceding siblings ...)
2026-01-26 19:47 ` [PATCH v2 5/5] docs: add SCSI migrate-pr documentation Stefan Hajnoczi
@ 2026-01-26 21:55 ` Stefan Hajnoczi
5 siblings, 0 replies; 12+ messages in thread
From: Stefan Hajnoczi @ 2026-01-26 21:55 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, pkrempa, Paolo Bonzini, Alberto Faria,
Hannes Reinecke, Zhao Liu, Kevin Wolf, qemu-block, Fam Zheng,
Philippe Mathieu-Daudé, Eduardo Habkost, Qing Wang,
Yanan Wang, Marcel Apfelbaum
This series requires Peter Xu's "[PATCH v2 0/5] migration: Notifier
fixes for 11.0" to ensure that the rollback operation happens before
the guest resumes on the migration source.
https://lore.kernel.org/qemu-devel/20260126213614.3815900-1-peterx@redhat.com/T/#t
Stefan
^ permalink raw reply [flat|nested] 12+ messages in thread