* [PATCH v7 0/1] Implement the NVMe reservation feature
@ 2024-02-01 2:32 Guixin Liu
2024-02-01 2:32 ` [PATCH v7 1/1] nvmet: support " Guixin Liu
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Guixin Liu @ 2024-02-01 2:32 UTC (permalink / raw)
To: sagi, hch, kch; +Cc: linux-nvme
Hi guys,
I've implemented the NVMe reservation feature. Please review it, all
comments are welcome as usual.
Changes from v6 to v7:
- Handle "reservation notification mask" feature command to mask reservation
log.
- Add all the registrants that need to be freed to a temporary list fist,
and then after calling synchronize_rcu(), release all the registrants on the
temporary list.
- Fix the resv log page is random when there is no resv log page.
- Change nvmet_is_host_still_connected() to nvmet_is_host_connected().
- Remove nvmet_pr_set_rtype_and_holder() and change nvmet_pr_create_new_resv()
to nvmet_pr_create_new_reservation().
- Change nvmet_pr_find_registrant_by_hostid() to nvmet_pr_find_registrant().
- Change nvmet_pr_send_resv_released() to nvmet_pr_resv_released().
- Change __nvmet_pr_unregister_one() to nvmet_pr_unregister_one().
- In nvmet_pr_unreg_by_prkey(), nvmet_pr_unreg_by_prkey_except_hostid() and
nvmet_pr_unreg_except_hostid(), first do unregistering and then do event sending.
Changes from v5 to v6:
- Use synchronize_rcu() and kfree() to free registrant instead of kfree_rcu().
- Remove nvmet_pr_register_check_rkey(), put the check into pr_lock warp.
And refactor the nvmet_pr_register().
- Add the print fmt to the head.
- Add lockdep_is_held(&pr->pr_lock) condition to list_for_each_entry_rcu.
- Fix the bug in nvmet_pr_update_reg_attr(), when the change_attr hook
return fail, we should not replace the holder.
Changes from v4 to v5:
- Use rculist macros to handle registration_list instead of list macros
regardless of in mutex lock or not.
- Use goto statement instead of return in nvmet_is_host_still_connected
and __nvmet_pr_unregister_one.
- Add lockdep_assert_held and rcu_read_lock_held assert to many functions,
if it's necessary.
- Add a comment to nvmet_execute_get_log_page_resv to explain how lost_count
works.
- In nvmet_pr_clear, we should set holder to NULL first, I fixed this.
- Unify nvmet_pr_update_holder_rtype and __nvmet_pr_do_replace to
nvmet_pr_update_reg_attr.
- Fix wrong nr_pages in nvmet_execute_get_log_page_resv.
- Fix the deadlock issue of nvmet_pr_exit_ns, put it out of the subsys lock.
Changes from v3 to v4:
- Use kfifo to handle resv log page instead of list, and also limit the
resv log queue to 64.
- Change the function calling alignment style to:
nvmet_pr_send_event_by_hostid(pr, hostid,
NVME_PR_LOG_RESERVATOPM_PREEMPTED);
- Put kmalloc out of rcu_read_lock in nvmet_execute_pr_report().
- Remove the goto in __nvmet_pr_unregister_one().
- Change generation to atomic_t, and remove nvmet_pr_inc_generation().
- In addtion, the number2 patch "nvmet: unify aer type enum" is not
relate with this patch, so I will send it separately.
Changes from v2 to v3:
- Use rcu instead of rwlock to make IO path run faster, and put the rtype
into the struct nvmet_pr_registrant.
- Limit the resv_log_list to 128.
- Change generation to atomic64.
- Put register rkey check to a warpper.
- Change nr_avl_pages to nr_pages.
- Use NVME_SC_SUCCESS instead of 0.
- Change kmalloc param to let it not sleep in mutex lock.
Changes from v1 to v2:
- Implement the reservation notification report, includes registration
preempted, reservation released and reservation preempted.
And also handle the reservation log page available event and send get
reservation log page command to clear log page at host.
- Put the reservation check access after validate opcode. And remove
opcodes which nvmet not implement yet check.
Now there is no admin opcode nvmet implemented needs reservation check,
so I dont add reservation check to admin command path.
Next we need to do reservation check includes the situation of nsid is
0xffffffff at each admin command path, if it is needed.
- Add reservation commands support in nvmet_get_cmd_effects_nvm().
- From Chaitanya, change the local variable tree style to make it cleaner,
and add some comments about NVMe spec.
And also change others advice from chaitanya.
- Put the nvmet_pr_check_cmd_access and nvmet_parse_pr_cmd into reservation
enable check warp.
- Remove kmem_cache instead to use kmalloc and kfree.
- Change others advice from Sagi.
- Add a blktest test case, this patch will be sent before these series of
patches.
Guixin Liu (1):
nvmet: support reservation feature
drivers/nvme/target/Makefile | 2 +-
drivers/nvme/target/admin-cmd.c | 20 +-
drivers/nvme/target/configfs.c | 27 +
drivers/nvme/target/core.c | 51 +-
drivers/nvme/target/nvmet.h | 36 ++
drivers/nvme/target/pr.c | 1041 +++++++++++++++++++++++++++++++
include/linux/nvme.h | 54 ++
7 files changed, 1224 insertions(+), 7 deletions(-)
create mode 100644 drivers/nvme/target/pr.c
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH v7 1/1] nvmet: support reservation feature 2024-02-01 2:32 [PATCH v7 0/1] Implement the NVMe reservation feature Guixin Liu @ 2024-02-01 2:32 ` Guixin Liu 2024-02-28 0:40 ` Sagi Grimberg 2024-02-18 2:12 ` [PATCH v7 0/1] Implement the NVMe " Guixin Liu 2024-02-29 2:57 ` Guixin Liu 2 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-02-01 2:32 UTC (permalink / raw) To: sagi, hch, kch; +Cc: linux-nvme This patch implements the reservation feature, includes: 1. reservation register(register, unregister and replace). 2. reservation acquire(acquire, preempt, preempt and abort). 3. reservation release(release and clear). 4. reservation report. And also make reservation configurable, one can set ns to support reservation before enable ns. The default of resv_enable is false. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> --- drivers/nvme/target/Makefile | 2 +- drivers/nvme/target/admin-cmd.c | 20 +- drivers/nvme/target/configfs.c | 27 + drivers/nvme/target/core.c | 51 +- drivers/nvme/target/nvmet.h | 36 ++ drivers/nvme/target/pr.c | 1041 +++++++++++++++++++++++++++++++ include/linux/nvme.h | 54 ++ 7 files changed, 1224 insertions(+), 7 deletions(-) create mode 100644 drivers/nvme/target/pr.c diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile index c66820102493..f9bfc904a5b3 100644 --- a/drivers/nvme/target/Makefile +++ b/drivers/nvme/target/Makefile @@ -10,7 +10,7 @@ obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ - discovery.o io-cmd-file.o io-cmd-bdev.o + discovery.o io-cmd-file.o io-cmd-bdev.o pr.o nvmet-$(CONFIG_NVME_TARGET_PASSTHRU) += passthru.o nvmet-$(CONFIG_BLK_DEV_ZONED) += zns.o nvmet-$(CONFIG_NVME_TARGET_AUTH) += fabrics-cmd-auth.o auth.o diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 39cb570f833d..eb1d85701441 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -176,6 +176,10 @@ static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log) log->iocs[nvme_cmd_read] = log->iocs[nvme_cmd_flush] = log->iocs[nvme_cmd_dsm] = + log->iocs[nvme_cmd_resv_acquire] = + log->iocs[nvme_cmd_resv_register] = + log->iocs[nvme_cmd_resv_release] = + log->iocs[nvme_cmd_resv_report] = cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); log->iocs[nvme_cmd_write] = log->iocs[nvme_cmd_write_zeroes] = @@ -340,6 +344,8 @@ static void nvmet_execute_get_log_page(struct nvmet_req *req) return nvmet_execute_get_log_cmd_effects_ns(req); case NVME_LOG_ANA: return nvmet_execute_get_log_page_ana(req); + case NVME_LOG_RESERVATION: + return nvmet_execute_get_log_page_resv(req); } pr_debug("unhandled lid %d on qid %d\n", req->cmd->get_log_page.lid, req->sq->qid); @@ -550,7 +556,13 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req) */ id->nmic = NVME_NS_NMIC_SHARED; id->anagrpid = cpu_to_le32(req->ns->anagrpid); - + id->rescap = NVME_PR_SUPPORT_WRITE_EXCLUSIVE | + NVME_PR_SUPPORT_EXCLUSIVE_ACCESS | + NVME_PR_SUPPORT_WRITE_EXCLUSIVE_REG_ONLY | + NVME_PR_SUPPORT_EXCLUSIVE_ACCESS_REG_ONLY | + NVME_PR_SUPPORT_WRITE_EXCLUSIVE_ALL_REGS | + NVME_PR_SUPPORT_EXCLUSIVE_ACCESS_ALL_REGS | + NVME_PR_SUPPORT_IEKEY_VER_1_3_DEF; memcpy(&id->nguid, &req->ns->nguid, sizeof(id->nguid)); id->lbaf[0].ds = req->ns->blksize_shift; @@ -851,6 +863,9 @@ void nvmet_execute_set_features(struct nvmet_req *req) case NVME_FEAT_WRITE_PROTECT: status = nvmet_set_feat_write_protect(req); break; + case NVME_FEAT_RESV_MASK: + status = nvmet_set_feat_resv_notif_mask(req, cdw11); + break; default: req->error_loc = offsetof(struct nvme_common_command, cdw10); status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; @@ -949,6 +964,9 @@ void nvmet_execute_get_features(struct nvmet_req *req) case NVME_FEAT_WRITE_PROTECT: status = nvmet_get_feat_write_protect(req); break; + case NVME_FEAT_RESV_MASK: + status = nvmet_get_feat_resv_notif_mask(req); + break; default: req->error_loc = offsetof(struct nvme_common_command, cdw10); diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index 2482a0db2504..56aead89d21a 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -714,6 +714,32 @@ static ssize_t nvmet_ns_revalidate_size_store(struct config_item *item, CONFIGFS_ATTR_WO(nvmet_ns_, revalidate_size); +static ssize_t nvmet_ns_resv_enable_show(struct config_item *item, char *page) +{ + return sprintf(page, "%d\n", to_nvmet_ns(item)->pr.enable); +} + +static ssize_t nvmet_ns_resv_enable_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_ns *ns = to_nvmet_ns(item); + bool val; + + if (kstrtobool(page, &val)) + return -EINVAL; + + mutex_lock(&ns->subsys->lock); + if (ns->enabled) { + pr_err("the ns:%d is already enabled.\n", ns->nsid); + mutex_unlock(&ns->subsys->lock); + return -EINVAL; + } + ns->pr.enable = val; + mutex_unlock(&ns->subsys->lock); + return count; +} +CONFIGFS_ATTR(nvmet_ns_, resv_enable); + static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_device_path, &nvmet_ns_attr_device_nguid, @@ -722,6 +748,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = { &nvmet_ns_attr_enable, &nvmet_ns_attr_buffered_io, &nvmet_ns_attr_revalidate_size, + &nvmet_ns_attr_resv_enable, #ifdef CONFIG_PCI_P2PDMA &nvmet_ns_attr_p2pmem, #endif diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index d26aa30f8702..5bdb7001b45a 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -591,6 +591,8 @@ int nvmet_ns_enable(struct nvmet_ns *ns) if (ns->nsid > subsys->max_nsid) subsys->max_nsid = ns->nsid; + nvmet_pr_init_ns(ns); + ret = xa_insert(&subsys->namespaces, ns->nsid, ns, GFP_KERNEL); if (ret) goto out_restore_subsys_maxnsid; @@ -646,6 +648,7 @@ void nvmet_ns_disable(struct nvmet_ns *ns) synchronize_rcu(); wait_for_completion(&ns->disable_done); percpu_ref_exit(&ns->ref); + nvmet_pr_exit_ns(ns); mutex_lock(&subsys->lock); @@ -904,18 +907,34 @@ static u16 nvmet_parse_io_cmd(struct nvmet_req *req) return ret; } + if (req->ns->pr.enable) { + ret = nvmet_parse_pr_cmd(req); + if (!ret) + return ret; + } + switch (req->ns->csi) { case NVME_CSI_NVM: if (req->ns->file) - return nvmet_file_parse_io_cmd(req); - return nvmet_bdev_parse_io_cmd(req); + ret = nvmet_file_parse_io_cmd(req); + else + ret = nvmet_bdev_parse_io_cmd(req); + break; case NVME_CSI_ZNS: if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) - return nvmet_bdev_zns_parse_io_cmd(req); - return NVME_SC_INVALID_IO_CMD_SET; + ret = nvmet_bdev_zns_parse_io_cmd(req); + else + ret = NVME_SC_INVALID_IO_CMD_SET; + break; default: - return NVME_SC_INVALID_IO_CMD_SET; + ret = NVME_SC_INVALID_IO_CMD_SET; } + if (ret) + return ret; + + if (req->ns->pr.enable) + ret = nvmet_pr_check_cmd_access(req); + return ret; } bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq, @@ -1231,6 +1250,22 @@ static void nvmet_init_cap(struct nvmet_ctrl *ctrl) nvmet_passthrough_override_cap(ctrl); } +bool nvmet_is_host_connected(uuid_t *hostid, struct nvmet_subsys *subsys) +{ + struct nvmet_ctrl *ctrl; + bool ret = false; + + mutex_lock(&subsys->lock); + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + if (uuid_equal(hostid, &ctrl->hostid)) { + ret = true; + break; + } + } + mutex_unlock(&subsys->lock); + return ret; +} + struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn, const char *hostnqn, u16 cntlid, struct nvmet_req *req) @@ -1447,6 +1482,11 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, ctrl->err_counter = 0; spin_lock_init(&ctrl->error_lock); + ctrl->pr_log_mgr.counter = 0; + ctrl->pr_log_mgr.lost_count = 0; + mutex_init(&ctrl->pr_log_mgr.lock); + INIT_KFIFO(ctrl->pr_log_mgr.log_queue); + nvmet_start_keep_alive_timer(ctrl); mutex_lock(&subsys->lock); @@ -1485,6 +1525,7 @@ static void nvmet_ctrl_free(struct kref *ref) cancel_work_sync(&ctrl->fatal_err_work); nvmet_destroy_auth(ctrl); + nvmet_ctrl_destroy_pr(ctrl); ida_free(&cntlid_ida, ctrl->cntlid); diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 6c8acebe1a1a..4f8416e7353d 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -20,6 +20,7 @@ #include <linux/blkdev.h> #include <linux/radix-tree.h> #include <linux/t10-pi.h> +#include <linux/kfifo.h> #define NVMET_DEFAULT_VS NVME_VS(1, 3, 0) @@ -30,6 +31,7 @@ #define NVMET_MN_MAX_SIZE 40 #define NVMET_SN_MAX_SIZE 20 #define NVMET_FR_MAX_SIZE 8 +#define NVMET_PR_LOG_QUEUE_SIZE 64 /* * Supported optional AENs: @@ -56,6 +58,22 @@ #define IPO_IATTR_CONNECT_SQE(x) \ (cpu_to_le32(offsetof(struct nvmf_connect_command, x))) +struct nvmet_pr_registrant { + u64 rkey; + uuid_t hostid; + enum nvme_pr_type rtype; + struct list_head entry; +}; + +struct nvmet_pr { + bool enable; + unsigned long notify_mask; + atomic_t generation; + struct nvmet_pr_registrant __rcu *holder; + struct mutex pr_lock; + struct list_head registrant_list; +}; + struct nvmet_ns { struct percpu_ref ref; struct bdev_handle *bdev_handle; @@ -85,6 +103,7 @@ struct nvmet_ns { int pi_type; int metadata_size; u8 csi; + struct nvmet_pr pr; }; static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item) @@ -190,6 +209,13 @@ static inline bool nvmet_port_secure_channel_required(struct nvmet_port *port) return nvmet_port_disc_addr_treq_secure_channel(port) == NVMF_TREQ_REQUIRED; } +struct nvmet_pr_log_mgr { + struct mutex lock; + u64 lost_count; + u64 counter; + DECLARE_KFIFO(log_queue, struct nvme_pr_log, NVMET_PR_LOG_QUEUE_SIZE); +}; + struct nvmet_ctrl { struct nvmet_subsys *subsys; struct nvmet_sq **sqs; @@ -243,6 +269,7 @@ struct nvmet_ctrl { u8 *dh_key; size_t dh_keysize; #endif + struct nvmet_pr_log_mgr pr_log_mgr; }; struct nvmet_subsys { @@ -750,4 +777,13 @@ static inline bool nvmet_has_auth(struct nvmet_ctrl *ctrl) static inline const char *nvmet_dhchap_dhgroup_name(u8 dhgid) { return NULL; } #endif +void nvmet_pr_init_ns(struct nvmet_ns *ns); +u16 nvmet_parse_pr_cmd(struct nvmet_req *req); +u16 nvmet_pr_check_cmd_access(struct nvmet_req *req); +void nvmet_ctrl_destroy_pr(struct nvmet_ctrl *ctrl); +void nvmet_pr_exit_ns(struct nvmet_ns *ns); +bool nvmet_is_host_connected(uuid_t *hostid, struct nvmet_subsys *subsys); +void nvmet_execute_get_log_page_resv(struct nvmet_req *req); +u16 nvmet_set_feat_resv_notif_mask(struct nvmet_req *req, u32 mask); +u16 nvmet_get_feat_resv_notif_mask(struct nvmet_req *req); #endif /* _NVMET_H */ diff --git a/drivers/nvme/target/pr.c b/drivers/nvme/target/pr.c new file mode 100644 index 000000000000..d3a25ff40365 --- /dev/null +++ b/drivers/nvme/target/pr.c @@ -0,0 +1,1041 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NVMe over Fabrics Persist Reservation. + * Copyright (c) 2024 Guixin Liu, Alibaba Group. + * All rights reserved. + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include "nvmet.h" +#include <linux/slab.h> +#include <asm/unaligned.h> +#include <linux/lockdep.h> + +#define NVMET_PR_NOTIFI_MASK_ALL \ + (1 << NVME_PR_NOTIFY_BIT_REG_PREEMPTED | \ + 1 << NVME_PR_NOTIFY_BIT_RESV_RELEASED | \ + 1 << NVME_PR_NOTIFY_BIT_RESV_PREEMPTED) + +struct nvmet_pr_register_data { + __le64 crkey; + __le64 nrkey; +}; + +struct nvmet_pr_acquire_data { + __le64 crkey; + __le64 prkey; +}; + +struct nvmet_pr_release_data { + __le64 crkey; +}; + +static inline struct nvmet_ns *nvmet_pr_to_ns(struct nvmet_pr *pr) +{ + return container_of(pr, struct nvmet_ns, pr); +} + +static u16 nvmet_pr_validate_rtype(u8 rtype) +{ + if (rtype < NVME_PR_WRITE_EXCLUSIVE || + rtype > NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS) + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; + + return NVME_SC_SUCCESS; +} + +static struct nvmet_pr_registrant * +nvmet_pr_find_registrant(struct nvmet_pr *pr, uuid_t *hostid) +{ + struct nvmet_pr_registrant *reg; + + WARN_ON_ONCE(!rcu_read_lock_held() && + !lockdep_is_held(&pr->pr_lock)); + + list_for_each_entry_rcu(reg, &pr->registrant_list, entry, + rcu_read_lock_held() || + lockdep_is_held(&pr->pr_lock)) { + if (uuid_equal(®->hostid, hostid)) + return reg; + } + return NULL; +} + +u16 nvmet_set_feat_resv_notif_mask(struct nvmet_req *req, u32 mask) +{ + u32 nsid = le32_to_cpu(req->cmd->common.nsid); + struct nvmet_ctrl *ctrl = req->sq->ctrl; + struct nvmet_ns *ns; + unsigned long idx; + u16 status; + + if (mask & ~(NVMET_PR_NOTIFI_MASK_ALL)) { + req->error_loc = offsetof(struct nvme_common_command, cdw11); + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; + } + + if (nsid != U32_MAX) { + status = nvmet_req_find_ns(req); + if (status) + return status; + if (!req->ns->pr.enable) + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; + + WRITE_ONCE(req->ns->pr.notify_mask, mask); + goto success; + } + + xa_for_each(&ctrl->subsys->namespaces, idx, ns) { + if (ns->pr.enable) + WRITE_ONCE(ns->pr.notify_mask, mask); + } + +success: + nvmet_set_result(req, mask); + return NVME_SC_SUCCESS; +} + +u16 nvmet_get_feat_resv_notif_mask(struct nvmet_req *req) +{ + u16 status; + + status = nvmet_req_find_ns(req); + if (status) + return status; + + if (!req->ns->pr.enable) + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; + + nvmet_set_result(req, READ_ONCE(req->ns->pr.notify_mask)); + return status; +} + +void nvmet_execute_get_log_page_resv(struct nvmet_req *req) +{ + struct nvmet_pr_log_mgr *log_mgr = &req->sq->ctrl->pr_log_mgr; + struct nvme_pr_log next_log = {0}; + struct nvme_pr_log log = {0}; + u16 status = NVME_SC_SUCCESS; + u64 lost_count; + u64 cur_count; + u64 next_count; + + mutex_lock(&log_mgr->lock); + if (!kfifo_get(&log_mgr->log_queue, &log)) + goto out; + + /* + * We can't get the last in kfifo. + * Utilize the current count and the count from the next log to + * calculate the number of lost logs, while also addressing cases + * of overflow. If there is no subsequent log, the number of lost + * logs is equal to the lost_count within the nvmet_pr_log_mgr. + */ + cur_count = le64_to_cpu(log.count); + if (kfifo_peek(&log_mgr->log_queue, &next_log)) { + next_count = le64_to_cpu(next_log.count); + if (next_count > cur_count) + lost_count = next_count - cur_count - 1; + else + lost_count = U64_MAX - cur_count + next_count - 1; + } else { + lost_count = log_mgr->lost_count; + } + + log.count = cpu_to_le64(cur_count + lost_count); + log_mgr->lost_count -= lost_count; + + log.nr_pages = kfifo_len(&log_mgr->log_queue); + +out: + status = nvmet_copy_to_sgl(req, 0, &log, sizeof(log)); + mutex_unlock(&log_mgr->lock); + nvmet_req_complete(req, status); +} + +static void nvmet_pr_add_resv_log(struct nvmet_ctrl *ctrl, u8 log_type, u32 nsid) +{ + struct nvmet_pr_log_mgr *log_mgr = &ctrl->pr_log_mgr; + struct nvme_pr_log log = {0}; + + mutex_lock(&log_mgr->lock); + log_mgr->counter++; + if (log_mgr->counter == 0) + log_mgr->counter = 1; + + log.count = cpu_to_le64(log_mgr->counter); + log.type = log_type; + log.nsid = cpu_to_le32(nsid); + + if (!kfifo_put(&log_mgr->log_queue, log)) { + pr_warn("a reservation log lost, cntlid:%d, log_type:%d, nsid:%d\n", + ctrl->cntlid, log_type, nsid); + log_mgr->lost_count++; + } + + mutex_unlock(&log_mgr->lock); +} + +static void nvmet_pr_resv_released(struct nvmet_pr *pr, uuid_t *hostid) +{ + struct nvmet_ns *ns = nvmet_pr_to_ns(pr); + struct nvmet_subsys *subsys = ns->subsys; + struct nvmet_ctrl *ctrl; + + if (test_bit(NVME_PR_NOTIFY_BIT_RESV_RELEASED, &pr->notify_mask)) + return; + + mutex_lock(&subsys->lock); + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + if (!uuid_equal(&ctrl->hostid, hostid) && + nvmet_pr_find_registrant(pr, &ctrl->hostid)) { + nvmet_pr_add_resv_log(ctrl, + NVME_PR_LOG_RESERVATION_RELEASED, ns->nsid); + nvmet_add_async_event(ctrl, NVME_AER_CSS, + NVME_AEN_RESV_LOG_PAGE_AVALIABLE, + NVME_LOG_RESERVATION); + } + } + mutex_unlock(&subsys->lock); +} + +static void nvmet_pr_send_event_by_hostid(struct nvmet_pr *pr, uuid_t *hostid, + u8 log_type) +{ + struct nvmet_ns *ns = nvmet_pr_to_ns(pr); + struct nvmet_subsys *subsys = ns->subsys; + struct nvmet_ctrl *ctrl; + + mutex_lock(&subsys->lock); + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + if (uuid_equal(hostid, &ctrl->hostid)) { + nvmet_pr_add_resv_log(ctrl, log_type, ns->nsid); + nvmet_add_async_event(ctrl, NVME_AER_CSS, + NVME_AEN_RESV_LOG_PAGE_AVALIABLE, + NVME_LOG_RESERVATION); + } + } + mutex_unlock(&subsys->lock); +} + +static void nvmet_pr_resv_preempted(struct nvmet_pr *pr, uuid_t *hostid) +{ + if (test_bit(NVME_PR_NOTIFY_BIT_RESV_PREEMPTED, &pr->notify_mask)) + return; + + nvmet_pr_send_event_by_hostid(pr, hostid, + NVME_PR_LOG_RESERVATOPM_PREEMPTED); +} + +static void nvmet_pr_registration_preempted(struct nvmet_pr *pr, + uuid_t *hostid) +{ + if (test_bit(NVME_PR_NOTIFY_BIT_REG_PREEMPTED, &pr->notify_mask)) + return; + + nvmet_pr_send_event_by_hostid(pr, hostid, + NVME_PR_LOG_REGISTRATION_PREEMPTED); +} + +static u16 nvmet_pr_create_new_reservation(struct nvmet_pr *pr, u8 new_rtype, + struct nvmet_pr_registrant *reg) +{ + u16 status; + + status = nvmet_pr_validate_rtype(new_rtype); + if (!status) { + reg->rtype = new_rtype; + rcu_assign_pointer(pr->holder, reg); + } + return status; +} + +static u16 nvmet_pr_register(struct nvmet_req *req, + struct nvmet_pr_register_data *d) +{ + struct nvmet_ctrl *ctrl = req->sq->ctrl; + struct nvmet_pr_registrant *new, *reg; + struct nvmet_pr *pr = &req->ns->pr; + u16 status = NVME_SC_SUCCESS; + + new = kmalloc(sizeof(*new), GFP_KERNEL); + if (!new) + return NVME_SC_INTERNAL; + + mutex_lock(&pr->pr_lock); + reg = nvmet_pr_find_registrant(pr, &ctrl->hostid); + if (reg) { + if (reg->rkey != le64_to_cpu(d->nrkey)) + status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + kfree(new); + goto out; + } + + memset(new, 0, sizeof(*new)); + INIT_LIST_HEAD(&new->entry); + new->rkey = le64_to_cpu(d->nrkey); + uuid_copy(&new->hostid, &ctrl->hostid); + list_add_tail_rcu(&new->entry, &pr->registrant_list); + +out: + mutex_unlock(&pr->pr_lock); + return status; +} + +static void nvmet_pr_unregister_one(struct nvmet_pr *pr, + struct nvmet_pr_registrant *reg) +{ + struct nvmet_pr_registrant *first_reg; + struct nvmet_pr_registrant *holder; + u8 original_rtype; + + lockdep_assert_held(&pr->pr_lock); + list_del_rcu(®->entry); + + holder = rcu_dereference_protected(pr->holder, + lockdep_is_held(&pr->pr_lock)); + if (reg != holder) + goto out; + + original_rtype = holder->rtype; + if (original_rtype == NVME_PR_WRITE_EXCLUSIVE_ALL_REGS || + original_rtype == NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS) { + first_reg = list_first_or_null_rcu(&pr->registrant_list, + struct nvmet_pr_registrant, entry); + if (first_reg) + first_reg->rtype = original_rtype; + rcu_assign_pointer(pr->holder, first_reg); + } else { + rcu_assign_pointer(pr->holder, NULL); + + if (original_rtype == NVME_PR_WRITE_EXCLUSIVE_REG_ONLY || + original_rtype == NVME_PR_WRITE_EXCLUSIVE_REG_ONLY) + nvmet_pr_resv_released(pr, ®->hostid); + } +out: + synchronize_rcu(); + kfree(reg); +} + +static u16 nvmet_pr_unregister(struct nvmet_req *req, + struct nvmet_pr_register_data *d, + bool ignore_key) +{ + u16 status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + struct nvmet_ctrl *ctrl = req->sq->ctrl; + struct nvmet_pr *pr = &req->ns->pr; + struct nvmet_pr_registrant *reg; + + mutex_lock(&pr->pr_lock); + list_for_each_entry_rcu(reg, &pr->registrant_list, entry, + lockdep_is_held(&pr->pr_lock)) { + if (uuid_equal(®->hostid, &ctrl->hostid)) { + if (ignore_key || reg->rkey == le64_to_cpu(d->crkey)) { + status = NVME_SC_SUCCESS; + nvmet_pr_unregister_one(pr, reg); + } + break; + } + } + mutex_unlock(&pr->pr_lock); + + return status; +} + +static u16 nvmet_pr_update_reg_rkey(struct nvmet_pr_registrant *reg, + void *attr) +{ + reg->rkey = *(u64 *)attr; + return NVME_SC_SUCCESS; +} + +static u16 nvmet_pr_update_reg_attr(struct nvmet_pr *pr, + struct nvmet_pr_registrant *reg, + u16 (*change_attr)(struct nvmet_pr_registrant *reg, void *attr), + void *attr) +{ + struct nvmet_pr_registrant *holder; + struct nvmet_pr_registrant *new; + u16 status; + + lockdep_assert_held(&pr->pr_lock); + holder = rcu_dereference_protected(pr->holder, + lockdep_is_held(&pr->pr_lock)); + if (reg != holder) + return change_attr(reg, attr); + + new = kmalloc(sizeof(*new), GFP_ATOMIC); + if (!new) + return NVME_SC_INTERNAL; + + new->rkey = holder->rkey; + new->rtype = holder->rtype; + uuid_copy(&new->hostid, &holder->hostid); + INIT_LIST_HEAD(&new->entry); + + status = change_attr(new, attr); + if (!status) { + list_replace_rcu(&holder->entry, &new->entry); + rcu_assign_pointer(pr->holder, new); + synchronize_rcu(); + kfree(holder); + } else { + kfree(new); + } + + return status; +} + +static u16 nvmet_pr_replace(struct nvmet_req *req, + struct nvmet_pr_register_data *d, + bool ignore_key) +{ + u16 status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + struct nvmet_ctrl *ctrl = req->sq->ctrl; + struct nvmet_pr *pr = &req->ns->pr; + struct nvmet_pr_registrant *reg; + u64 nrkey = le64_to_cpu(d->nrkey); + + mutex_lock(&pr->pr_lock); + list_for_each_entry_rcu(reg, &pr->registrant_list, entry, + lockdep_is_held(&pr->pr_lock)) { + if (uuid_equal(®->hostid, &ctrl->hostid)) { + if (ignore_key || reg->rkey == le64_to_cpu(d->crkey)) + status = nvmet_pr_update_reg_attr(pr, reg, + nvmet_pr_update_reg_rkey, + &nrkey); + break; + } + } + mutex_unlock(&pr->pr_lock); + return status; +} + +static void nvmet_execute_pr_register(struct nvmet_req *req) +{ + u32 cdw10 = le32_to_cpu(req->cmd->common.cdw10); + bool ignore_key = (bool)((cdw10 >> 3) & 1); /* Ignore existing key, bit 03 */ + struct nvmet_pr_register_data *d; + u8 reg_act = cdw10 & 0x07; /* Reservation Register Action, bit 02:00 */ + u16 status; + + d = kmalloc(sizeof(*d), GFP_KERNEL); + if (!d) { + status = NVME_SC_INTERNAL; + goto out; + } + + status = nvmet_copy_from_sgl(req, 0, d, sizeof(*d)); + if (status) + goto free_data; + + switch (reg_act) { + case NVME_PR_REGISTER_ACT_REG: + status = nvmet_pr_register(req, d); + break; + case NVME_PR_REGISTER_ACT_UNREG: + status = nvmet_pr_unregister(req, d, ignore_key); + break; + case NVME_PR_REGISTER_ACT_REPLACE: + status = nvmet_pr_replace(req, d, ignore_key); + break; + default: + req->error_loc = offsetof(struct nvme_common_command, cdw10); + status = NVME_SC_INVALID_OPCODE | NVME_SC_DNR; + break; + } +free_data: + kfree(d); +out: + if (!status) + atomic_inc(&req->ns->pr.generation); + nvmet_req_complete(req, status); +} + +static u16 nvmet_pr_acquire(struct nvmet_req *req, + struct nvmet_pr_registrant *reg, + u8 rtype) +{ + struct nvmet_pr *pr = &req->ns->pr; + struct nvmet_pr_registrant *holder; + + lockdep_assert_held(&pr->pr_lock); + holder = rcu_dereference_protected(pr->holder, + lockdep_is_held(&pr->pr_lock)); + if (holder && reg != holder) + return NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + if (holder && reg == holder) { + if (holder->rtype == rtype) + return NVME_SC_SUCCESS; + return NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + } + + return nvmet_pr_create_new_reservation(pr, rtype, reg); +} + +static u16 nvmet_pr_unreg_by_prkey(struct nvmet_pr *pr, u64 prkey, + uuid_t *send_hostid) +{ + u16 status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + struct nvmet_pr_registrant *reg, *tmp; + uuid_t hostid; + + lockdep_assert_held(&pr->pr_lock); + + list_for_each_entry_safe(reg, tmp, &pr->registrant_list, entry) { + if (reg->rkey == prkey) { + status = NVME_SC_SUCCESS; + uuid_copy(&hostid, ®->hostid); + nvmet_pr_unregister_one(pr, reg); + if (!uuid_equal(®->hostid, send_hostid)) + nvmet_pr_registration_preempted(pr, &hostid); + } + } + return status; +} + +static u16 nvmet_pr_unreg_by_prkey_except_hostid(struct nvmet_pr *pr, u64 prkey, + uuid_t *send_hostid) +{ + u16 status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + struct nvmet_pr_registrant *reg, *tmp; + uuid_t hostid; + + lockdep_assert_held(&pr->pr_lock); + + list_for_each_entry_safe(reg, tmp, &pr->registrant_list, entry) { + if (reg->rkey == prkey && + !uuid_equal(®->hostid, send_hostid)) { + status = NVME_SC_SUCCESS; + uuid_copy(&hostid, ®->hostid); + nvmet_pr_unregister_one(pr, reg); + nvmet_pr_registration_preempted(pr, &hostid); + } + } + return status; +} + +static void nvmet_pr_unreg_except_hostid(struct nvmet_pr *pr, + uuid_t *send_hostid) +{ + struct nvmet_pr_registrant *reg, *tmp; + uuid_t hostid; + + lockdep_assert_held(&pr->pr_lock); + + list_for_each_entry_safe(reg, tmp, &pr->registrant_list, entry) { + if (!uuid_equal(®->hostid, send_hostid)) { + uuid_copy(&hostid, ®->hostid); + nvmet_pr_unregister_one(pr, reg); + nvmet_pr_registration_preempted(pr, &hostid); + } + } +} + +static u16 nvmet_pr_update_holder_rtype(struct nvmet_pr_registrant *reg, + void *attr) +{ + u8 new_rtype = *(u8 *)attr; + u16 status; + + status = nvmet_pr_validate_rtype(new_rtype); + if (status) + return status; + + reg->rtype = new_rtype; + return NVME_SC_SUCCESS; +} + +static u16 nvmet_pr_preempt(struct nvmet_req *req, + struct nvmet_pr_registrant *reg, + u8 rtype, + struct nvmet_pr_acquire_data *d) +{ + struct nvmet_ctrl *ctrl = req->sq->ctrl; + struct nvmet_pr *pr = &req->ns->pr; + struct nvmet_pr_registrant *holder; + enum nvme_pr_type original_rtype; + u16 status; + + lockdep_assert_held(&pr->pr_lock); + holder = rcu_dereference_protected(pr->holder, + lockdep_is_held(&pr->pr_lock)); + if (!holder) + return nvmet_pr_unreg_by_prkey(pr, le64_to_cpu(d->prkey), + &ctrl->hostid); + + original_rtype = holder->rtype; + if (original_rtype == NVME_PR_WRITE_EXCLUSIVE_ALL_REGS || + original_rtype == NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS) { + if (!le64_to_cpu(d->prkey)) { + nvmet_pr_unreg_except_hostid(pr, &ctrl->hostid); + return nvmet_pr_create_new_reservation(pr, + rtype, reg); + } + return nvmet_pr_unreg_by_prkey(pr, le64_to_cpu(d->prkey), + &ctrl->hostid); + } + + if (holder == reg) { + status = nvmet_pr_update_reg_attr(pr, holder, + nvmet_pr_update_holder_rtype, &rtype); + if (!status && original_rtype != rtype) + nvmet_pr_resv_released(pr, ®->hostid); + return status; + } + + if (le64_to_cpu(d->prkey) == holder->rkey) { + status = nvmet_pr_unreg_by_prkey_except_hostid(pr, + le64_to_cpu(d->prkey), &ctrl->hostid); + if (status) + return status; + status = nvmet_pr_create_new_reservation(pr, rtype, reg); + if (!status && original_rtype != rtype) + nvmet_pr_resv_released(pr, ®->hostid); + return status; + } + + if (le64_to_cpu(d->prkey)) + return nvmet_pr_unreg_by_prkey(pr, le64_to_cpu(d->prkey), + &ctrl->hostid); + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; +} + +static u16 __nvmet_execute_pr_acquire(struct nvmet_req *req, + struct nvmet_pr_registrant *reg, + u8 acquire_act, + u8 rtype, + struct nvmet_pr_acquire_data *d) +{ + u16 status; + + switch (acquire_act) { + case NVME_PR_ACQUIRE_ACT_ACQUIRE: + status = nvmet_pr_acquire(req, reg, rtype); + goto out; + case NVME_PR_ACQUIRE_ACT_PREEMPT: + status = nvmet_pr_preempt(req, reg, rtype, d); + goto inc_gen; + case NVME_PR_ACQUIRE_ACT_PREEMPT_AND_ABORT: + status = nvmet_pr_preempt(req, reg, rtype, d); + goto inc_gen; + default: + req->error_loc = offsetof(struct nvme_common_command, cdw10); + status = NVME_SC_INVALID_OPCODE | NVME_SC_DNR; + goto out; + } +inc_gen: + if (!status) + atomic_inc(&req->ns->pr.generation); +out: + return status; +} + +static void nvmet_execute_pr_acquire(struct nvmet_req *req) +{ + u32 cdw10 = le32_to_cpu(req->cmd->common.cdw10); + bool ignore_key = (bool)((cdw10 >> 3) & 1); /* Ignore existing key, bit 03 */ + u8 rtype = (u8)((cdw10 >> 8) & 0xff); /* Reservation type, bit 15:08 */ + u8 acquire_act = cdw10 & 0x07; /* Reservation acquire action, bit 02:00 */ + struct nvmet_ctrl *ctrl = req->sq->ctrl; + struct nvmet_pr_acquire_data *d = NULL; + struct nvmet_pr *pr = &req->ns->pr; + struct nvmet_pr_registrant *reg; + u16 status = NVME_SC_SUCCESS; + + if (ignore_key) { + status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; + goto out; + } + + d = kmalloc(sizeof(*d), GFP_KERNEL); + if (!d) { + status = NVME_SC_INTERNAL; + goto out; + } + + status = nvmet_copy_from_sgl(req, 0, d, sizeof(*d)); + if (status) + goto free_data; + + status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + mutex_lock(&pr->pr_lock); + list_for_each_entry_rcu(reg, &pr->registrant_list, entry, + lockdep_is_held(&pr->pr_lock)) { + if (uuid_equal(®->hostid, &ctrl->hostid) && + reg->rkey == le64_to_cpu(d->crkey)) { + status = __nvmet_execute_pr_acquire(req, reg, + acquire_act, rtype, d); + break; + } + } + mutex_unlock(&pr->pr_lock); + +free_data: + kfree(d); +out: + nvmet_req_complete(req, status); +} + +static u16 nvmet_pr_release(struct nvmet_req *req, + struct nvmet_pr_registrant *reg, + u8 rtype) +{ + struct nvmet_pr *pr = &req->ns->pr; + struct nvmet_pr_registrant *holder; + u8 original_rtype; + + lockdep_assert_held(&pr->pr_lock); + holder = rcu_dereference_protected(pr->holder, + lockdep_is_held(&pr->pr_lock)); + if (!holder || reg != holder) + return NVME_SC_SUCCESS; + + original_rtype = holder->rtype; + if (original_rtype != rtype) + return NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + + rcu_assign_pointer(pr->holder, NULL); + + if (original_rtype != NVME_PR_WRITE_EXCLUSIVE && + original_rtype != NVME_PR_EXCLUSIVE_ACCESS) + nvmet_pr_resv_released(pr, ®->hostid); + + return NVME_SC_SUCCESS; +} + +static void nvmet_pr_clear(struct nvmet_req *req) +{ + struct nvmet_pr_registrant *reg, *tmp; + struct nvmet_pr *pr = &req->ns->pr; + LIST_HEAD(free_list); + + lockdep_assert_held(&pr->pr_lock); + + rcu_assign_pointer(pr->holder, NULL); + + list_for_each_entry_safe(reg, tmp, &pr->registrant_list, entry) { + list_del_rcu(®->entry); + if (!uuid_equal(&req->sq->ctrl->hostid, ®->hostid)) + nvmet_pr_resv_preempted(pr, ®->hostid); + list_add(®->entry, &free_list); + } + synchronize_rcu(); + list_for_each_entry_safe(reg, tmp, &free_list, entry) { + kfree(reg); + } + + atomic_inc(&pr->generation); +} + +static u16 __nvmet_execute_pr_release(struct nvmet_req *req, + struct nvmet_pr_registrant *reg, + u8 release_act, u8 rtype) +{ + switch (release_act) { + case NVME_PR_RELEASE_ACT_RELEASE: + return nvmet_pr_release(req, reg, rtype); + case NVME_PR_RELEASE_ACT_CLEAR: + nvmet_pr_clear(req); + return NVME_SC_SUCCESS; + default: + req->error_loc = offsetof(struct nvme_common_command, cdw10); + return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; + } +} + +static void nvmet_execute_pr_release(struct nvmet_req *req) +{ + u32 cdw10 = le32_to_cpu(req->cmd->common.cdw10); + bool ignore_key = (bool)((cdw10 >> 3) & 1); /* Ignore existing key, bit 03 */ + u8 rtype = (u8)((cdw10 >> 8) & 0xff); /* Reservation type, bit 15:08 */ + u8 release_act = cdw10 & 0x07; /* Reservation release action, bit 02:00 */ + struct nvmet_ctrl *ctrl = req->sq->ctrl; + struct nvmet_pr *pr = &req->ns->pr; + struct nvmet_pr_release_data *d; + struct nvmet_pr_registrant *reg; + u16 status; + + if (ignore_key) { + status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; + goto out; + } + + d = kmalloc(sizeof(*d), GFP_KERNEL); + if (!d) { + status = NVME_SC_INTERNAL; + goto out; + } + + status = nvmet_copy_from_sgl(req, 0, d, sizeof(*d)); + if (status) + goto free_data; + + status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + mutex_lock(&pr->pr_lock); + list_for_each_entry_rcu(reg, &pr->registrant_list, entry, + lockdep_is_held(&pr->pr_lock)) { + if (uuid_equal(®->hostid, &ctrl->hostid) && + reg->rkey == le64_to_cpu(d->crkey)) { + status = __nvmet_execute_pr_release(req, reg, + release_act, rtype); + break; + } + } + mutex_unlock(&pr->pr_lock); +free_data: + kfree(d); +out: + nvmet_req_complete(req, status); +} + +static void nvmet_execute_pr_report(struct nvmet_req *req) +{ + u32 cdw11 = le32_to_cpu(req->cmd->common.cdw11); + u32 cdw10 = le32_to_cpu(req->cmd->common.cdw10); + u32 num_bytes = 4 * (cdw10 + 1); /* cdw10 is number of dwords */ + u8 eds = cdw11 & 1; /* Extended data structure, bit 00 */ + struct nvmet_subsys *subsys = req->sq->ctrl->subsys; + struct nvme_registered_ctrl_ext *ctrl_eds; + struct nvme_reservation_status_ext *data; + struct nvmet_pr *pr = &req->ns->pr; + struct nvmet_pr_registrant *holder; + struct nvmet_pr_registrant *reg; + struct nvmet_ctrl *ctrl; + u16 num_ctrls = 0; + u16 status; + u8 rtype; + + /* nvmet hostid(uuid_t) is 128 bit. */ + if (!eds) { + req->error_loc = offsetof(struct nvme_common_command, cdw11); + status = NVME_SC_HOST_ID_INCONSIST | NVME_SC_DNR; + goto out; + } + + if (num_bytes < sizeof(struct nvme_reservation_status_ext)) { + req->error_loc = offsetof(struct nvme_common_command, cdw10); + status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; + goto out; + } + + mutex_lock(&subsys->lock); + num_ctrls = list_count_nodes(&subsys->ctrls); + mutex_unlock(&subsys->lock); + + num_bytes = min(num_bytes, + sizeof(struct nvme_reservation_status_ext) + + sizeof(struct nvme_registered_ctrl_ext) * num_ctrls); + + data = kmalloc(num_bytes, GFP_KERNEL); + if (!data) { + status = NVME_SC_INTERNAL; + goto out; + } + memset(data, 0, num_bytes); + data->gen = cpu_to_le32(atomic_read(&pr->generation)); + data->ptpls = 0; + ctrl_eds = data->regctl_eds; + + mutex_lock(&subsys->lock); + rcu_read_lock(); + holder = rcu_dereference(pr->holder); + rtype = holder ? holder->rtype : 0; + data->rtype = rtype; + num_ctrls = 0; + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + if ((void *)ctrl_eds >= (void *)(data + num_bytes)) + break; + reg = nvmet_pr_find_registrant(pr, &ctrl->hostid); + if (!reg) + continue; + ctrl_eds->cntlid = cpu_to_le16(ctrl->cntlid); + if (rtype == NVME_PR_WRITE_EXCLUSIVE_ALL_REGS || + rtype == NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS) + ctrl_eds->rcsts = 1; + if (holder && uuid_equal(&ctrl->hostid, &holder->hostid)) + ctrl_eds->rcsts = 1; + uuid_copy((uuid_t *)&ctrl_eds->hostid, &ctrl->hostid); + ctrl_eds->rkey = cpu_to_le64(reg->rkey); + ctrl_eds++; + num_ctrls++; + } + rcu_read_unlock(); + mutex_unlock(&subsys->lock); + + put_unaligned_le16(num_ctrls, data->regctl); + num_bytes = sizeof(struct nvme_reservation_status_ext) + + sizeof(struct nvme_registered_ctrl_ext) * num_ctrls; + status = nvmet_copy_to_sgl(req, 0, data, num_bytes); + kfree(data); +out: + nvmet_req_complete(req, status); +} + +u16 nvmet_parse_pr_cmd(struct nvmet_req *req) +{ + struct nvme_command *cmd = req->cmd; + + switch (cmd->common.opcode) { + case nvme_cmd_resv_register: + req->execute = nvmet_execute_pr_register; + break; + case nvme_cmd_resv_acquire: + req->execute = nvmet_execute_pr_acquire; + break; + case nvme_cmd_resv_release: + req->execute = nvmet_execute_pr_release; + break; + case nvme_cmd_resv_report: + req->execute = nvmet_execute_pr_report; + break; + default: + return 1; + } + return NVME_SC_SUCCESS; +} + +static bool nvmet_is_req_write_cmd_group(struct nvmet_req *req) +{ + u8 opcode = req->cmd->common.opcode; + + if (req->sq->qid) { + switch (opcode) { + case nvme_cmd_flush: + case nvme_cmd_write: + case nvme_cmd_write_zeroes: + case nvme_cmd_dsm: + case nvme_cmd_zone_append: + case nvme_cmd_zone_mgmt_send: + return true; + default: + return false; + } + } + return false; +} + +static bool nvmet_is_req_read_cmd_group(struct nvmet_req *req) +{ + u8 opcode = req->cmd->common.opcode; + + if (req->sq->qid) { + switch (opcode) { + case nvme_cmd_read: + case nvme_cmd_zone_mgmt_recv: + return true; + default: + return false; + } + } + return false; +} + +u16 nvmet_pr_check_cmd_access(struct nvmet_req *req) +{ + struct nvmet_ctrl *ctrl = req->sq->ctrl; + struct nvmet_pr_registrant *holder; + struct nvmet_ns *ns = req->ns; + struct nvmet_pr *pr = &ns->pr; + u16 status = NVME_SC_SUCCESS; + + rcu_read_lock(); + holder = rcu_dereference(pr->holder); + if (!holder) + goto unlock; + if (uuid_equal(&ctrl->hostid, &holder->hostid)) + goto unlock; + + /* + * The Reservation command group is checked in executing, + * allow it here. + */ + switch (holder->rtype) { + case NVME_PR_WRITE_EXCLUSIVE: + if (nvmet_is_req_write_cmd_group(req)) + status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + break; + case NVME_PR_EXCLUSIVE_ACCESS: + if (nvmet_is_req_read_cmd_group(req) || + nvmet_is_req_write_cmd_group(req)) + status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + break; + case NVME_PR_WRITE_EXCLUSIVE_REG_ONLY: + case NVME_PR_WRITE_EXCLUSIVE_ALL_REGS: + if ((nvmet_is_req_write_cmd_group(req)) && + !nvmet_pr_find_registrant(pr, &ctrl->hostid)) + status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + break; + case NVME_PR_EXCLUSIVE_ACCESS_REG_ONLY: + case NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS: + if ((nvmet_is_req_read_cmd_group(req) || + nvmet_is_req_write_cmd_group(req)) && + !nvmet_pr_find_registrant(pr, &ctrl->hostid)) + status = NVME_SC_RESERVATION_CONFLICT | NVME_SC_DNR; + break; + default: + pr_warn("the reservation type is set wrong, type:%d\n", + holder->rtype); + break; + } + +unlock: + rcu_read_unlock(); + if (status) + req->error_loc = offsetof(struct nvme_common_command, opcode); + return status; +} + +void nvmet_ctrl_destroy_pr(struct nvmet_ctrl *ctrl) +{ + struct nvmet_pr_registrant *reg; + struct nvmet_ns *ns; + unsigned long idx; + + kfifo_free(&ctrl->pr_log_mgr.log_queue); + mutex_destroy(&ctrl->pr_log_mgr.lock); + + if (nvmet_is_host_connected(&ctrl->hostid, ctrl->subsys)) + return; + + xa_for_each(&ctrl->subsys->namespaces, idx, ns) { + mutex_lock(&ns->pr.pr_lock); + list_for_each_entry_rcu(reg, &ns->pr.registrant_list, entry, + lockdep_is_held(&ns->pr.pr_lock)) { + if (uuid_equal(®->hostid, &ctrl->hostid)) { + nvmet_pr_unregister_one(&ns->pr, reg); + break; + } + } + mutex_unlock(&ns->pr.pr_lock); + } +} + +void nvmet_pr_exit_ns(struct nvmet_ns *ns) +{ + struct nvmet_pr_registrant *reg, *tmp; + struct nvmet_pr *pr = &ns->pr; + LIST_HEAD(free_list); + + mutex_lock(&pr->pr_lock); + rcu_assign_pointer(pr->holder, NULL); + list_for_each_entry_safe(reg, tmp, &pr->registrant_list, entry) { + list_del_rcu(®->entry); + list_add(®->entry, &free_list); + } + mutex_unlock(&pr->pr_lock); + synchronize_rcu(); + list_for_each_entry_safe(reg, tmp, &free_list, entry) { + kfree(reg); + } + + mutex_destroy(&pr->pr_lock); +} + +void nvmet_pr_init_ns(struct nvmet_ns *ns) +{ + ns->pr.holder = NULL; + atomic_set(&ns->pr.generation, 0); + mutex_init(&ns->pr.pr_lock); + INIT_LIST_HEAD(&ns->pr.registrant_list); + ns->pr.notify_mask = 0; +} diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 462c21e0e417..f683d1b4ffad 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -745,6 +745,32 @@ enum { NVME_AEN_CFG_DISC_CHANGE = 1 << NVME_AEN_BIT_DISC_CHANGE, }; +enum { + NVME_AEN_RESV_LOG_PAGE_AVALIABLE = 0x00, +}; + +enum { + NVME_PR_LOG_EMPTY_LOG_PAGE = 0x00, + NVME_PR_LOG_REGISTRATION_PREEMPTED = 0x01, + NVME_PR_LOG_RESERVATION_RELEASED = 0x02, + NVME_PR_LOG_RESERVATOPM_PREEMPTED = 0x03, +}; + +enum { + NVME_PR_NOTIFY_BIT_REG_PREEMPTED = 1, + NVME_PR_NOTIFY_BIT_RESV_RELEASED = 2, + NVME_PR_NOTIFY_BIT_RESV_PREEMPTED = 3, +}; + +struct nvme_pr_log { + __le64 count; + __u8 type; + __u8 nr_pages; + __u8 rsvd1[2]; + __le32 nsid; + __u8 rsvd2[48]; +}; + struct nvme_lba_range_type { __u8 type; __u8 attributes; @@ -765,6 +791,17 @@ enum { NVME_LBART_ATTRIB_HIDE = 1 << 1, }; +enum nvme_pr_capabilities { + NVME_PR_SUPPORT_PTPL = 1, + NVME_PR_SUPPORT_WRITE_EXCLUSIVE = 1 << 1, + NVME_PR_SUPPORT_EXCLUSIVE_ACCESS = 1 << 2, + NVME_PR_SUPPORT_WRITE_EXCLUSIVE_REG_ONLY = 1 << 3, + NVME_PR_SUPPORT_EXCLUSIVE_ACCESS_REG_ONLY = 1 << 4, + NVME_PR_SUPPORT_WRITE_EXCLUSIVE_ALL_REGS = 1 << 5, + NVME_PR_SUPPORT_EXCLUSIVE_ACCESS_ALL_REGS = 1 << 6, + NVME_PR_SUPPORT_IEKEY_VER_1_3_DEF = 1 << 7, +}; + enum nvme_pr_type { NVME_PR_WRITE_EXCLUSIVE = 1, NVME_PR_EXCLUSIVE_ACCESS = 2, @@ -774,6 +811,23 @@ enum nvme_pr_type { NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS = 6, }; +enum nvme_pr_register_action { + NVME_PR_REGISTER_ACT_REG = 0, + NVME_PR_REGISTER_ACT_UNREG = 1, + NVME_PR_REGISTER_ACT_REPLACE = 1 << 1, +}; + +enum nvme_pr_acquire_action { + NVME_PR_ACQUIRE_ACT_ACQUIRE = 0, + NVME_PR_ACQUIRE_ACT_PREEMPT = 1, + NVME_PR_ACQUIRE_ACT_PREEMPT_AND_ABORT = 1 << 1, +}; + +enum nvme_pr_release_action { + NVME_PR_RELEASE_ACT_RELEASE = 0, + NVME_PR_RELEASE_ACT_CLEAR = 1, +}; + enum nvme_eds { NVME_EXTENDED_DATA_STRUCT = 0x1, }; -- 2.43.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-02-01 2:32 ` [PATCH v7 1/1] nvmet: support " Guixin Liu @ 2024-02-28 0:40 ` Sagi Grimberg 2024-02-28 2:21 ` Guixin Liu 0 siblings, 1 reply; 29+ messages in thread From: Sagi Grimberg @ 2024-02-28 0:40 UTC (permalink / raw) To: Guixin Liu, hch, kch; +Cc: linux-nvme Hey Guixin, On 01/02/2024 4:32, Guixin Liu wrote: > This patch implements the reservation feature, includes: > 1. reservation register(register, unregister and replace). > 2. reservation acquire(acquire, preempt, preempt and abort). Overall patchset looks in decent shape to me now. I do have one more question though. You say that this patchset implements preempt and abort, however I don't see how this is actually implemented or even how can it be implemented at all at this point. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-02-28 0:40 ` Sagi Grimberg @ 2024-02-28 2:21 ` Guixin Liu 2024-02-28 3:21 ` Keith Busch 0 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-02-28 2:21 UTC (permalink / raw) To: Sagi Grimberg, hch, kch; +Cc: linux-nvme 在 2024/2/28 08:40, Sagi Grimberg 写道: > Hey Guixin, > > On 01/02/2024 4:32, Guixin Liu wrote: >> This patch implements the reservation feature, includes: >> 1. reservation register(register, unregister and replace). >> 2. reservation acquire(acquire, preempt, preempt and abort). > > > Overall patchset looks in decent shape to me now. > > I do have one more question though. My thanks for your patient review. > > You say that this patchset implements preempt and abort, however I > don't see how this is actually implemented or even how can it be > implemented at all at this point. Yes, the nvme target does not do any useful workto abort command, the comment of nvmet_execute_abort() says that "we are not required to do any useful work, and we couldn't really do a useful abort", so I don't know what to do about preempt and abort, and the NVMe spec says that the abort is best effor. My statement here may be a little inaccurate. Best Regards, Guixin Liu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-02-28 2:21 ` Guixin Liu @ 2024-02-28 3:21 ` Keith Busch 2024-02-28 3:40 ` Guixin Liu 0 siblings, 1 reply; 29+ messages in thread From: Keith Busch @ 2024-02-28 3:21 UTC (permalink / raw) To: Guixin Liu; +Cc: Sagi Grimberg, hch, kch, linux-nvme On Wed, Feb 28, 2024 at 10:21:57AM +0800, Guixin Liu wrote: > > 在 2024/2/28 08:40, Sagi Grimberg 写道: > > Hey Guixin, > > > > On 01/02/2024 4:32, Guixin Liu wrote: > > > This patch implements the reservation feature, includes: > > > 1. reservation register(register, unregister and replace). > > > 2. reservation acquire(acquire, preempt, preempt and abort). > > > > > > Overall patchset looks in decent shape to me now. > > > > I do have one more question though. > My thanks for your patient review. > > > > You say that this patchset implements preempt and abort, however I don't > > see how this is actually implemented or even how can it be implemented > > at all at this point. > > Yes, the nvme target does not do any useful workto abort command, the > comment of nvmet_execute_abort() says that > > "we are not required to do any useful work, and we couldn't really do a > useful abort", > > so I don't know what to do about preempt and abort, and the NVMe spec says > that the abort is best effor. > > My statement here may be a little inaccurate. I think you may be mixing up the NVMe "Abort" command with a different command, "Reservation Acquire", where RACQA is set to value 010b, "Preempt and Abort". ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-02-28 3:21 ` Keith Busch @ 2024-02-28 3:40 ` Guixin Liu 2024-03-07 9:27 ` Sagi Grimberg 0 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-02-28 3:40 UTC (permalink / raw) To: Keith Busch; +Cc: Sagi Grimberg, hch, kch, linux-nvme 在 2024/2/28 11:21, Keith Busch 写道: > On Wed, Feb 28, 2024 at 10:21:57AM +0800, Guixin Liu wrote: >> 在 2024/2/28 08:40, Sagi Grimberg 写道: >>> Hey Guixin, >>> >>> On 01/02/2024 4:32, Guixin Liu wrote: >>>> This patch implements the reservation feature, includes: >>>> 1. reservation register(register, unregister and replace). >>>> 2. reservation acquire(acquire, preempt, preempt and abort). >>> >>> Overall patchset looks in decent shape to me now. >>> >>> I do have one more question though. >> My thanks for your patient review. >>> You say that this patchset implements preempt and abort, however I don't >>> see how this is actually implemented or even how can it be implemented >>> at all at this point. >> Yes, the nvme target does not do any useful workto abort command, the >> comment of nvmet_execute_abort() says that >> >> "we are not required to do any useful work, and we couldn't really do a >> useful abort", >> >> so I don't know what to do about preempt and abort, and the NVMe spec says >> that the abort is best effor. >> >> My statement here may be a little inaccurate. > I think you may be mixing up the NVMe "Abort" command with a different > command, "Reservation Acquire", where RACQA is set to value 010b, > "Preempt and Abort". Sorry, I just want to explain that I dont know how to implement "preempt and abort" either. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-02-28 3:40 ` Guixin Liu @ 2024-03-07 9:27 ` Sagi Grimberg 2024-03-07 9:42 ` Guixin Liu 2024-03-08 9:15 ` Guixin Liu 0 siblings, 2 replies; 29+ messages in thread From: Sagi Grimberg @ 2024-03-07 9:27 UTC (permalink / raw) To: Guixin Liu, Keith Busch; +Cc: hch, kch, linux-nvme On 28/02/2024 5:40, Guixin Liu wrote: > > 在 2024/2/28 11:21, Keith Busch 写道: >> On Wed, Feb 28, 2024 at 10:21:57AM +0800, Guixin Liu wrote: >>> 在 2024/2/28 08:40, Sagi Grimberg 写道: >>>> Hey Guixin, >>>> >>>> On 01/02/2024 4:32, Guixin Liu wrote: >>>>> This patch implements the reservation feature, includes: >>>>> 1. reservation register(register, unregister and replace). >>>>> 2. reservation acquire(acquire, preempt, preempt and abort). >>>> >>>> Overall patchset looks in decent shape to me now. >>>> >>>> I do have one more question though. >>> My thanks for your patient review. >>>> You say that this patchset implements preempt and abort, however I >>>> don't >>>> see how this is actually implemented or even how can it be implemented >>>> at all at this point. >>> Yes, the nvme target does not do any useful workto abort command, the >>> comment of nvmet_execute_abort() says that >>> >>> "we are not required to do any useful work, and we couldn't really do a >>> useful abort", >>> >>> so I don't know what to do about preempt and abort, and the NVMe >>> spec says >>> that the abort is best effor. >>> >>> My statement here may be a little inaccurate. >> I think you may be mixing up the NVMe "Abort" command with a different >> command, "Reservation Acquire", where RACQA is set to value 010b, >> "Preempt and Abort". > > Sorry, I just want to explain that I dont know how to implement > "preempt and abort" unlike abort, preempt-and-abort needs a semantic guarantee because the consumer may rely on this for fencing purposes. So it cannot be supported in "best effort" I think. A possible implementation would be not to abort as there is no such interface, but nvmet may wait for all pending ns IO to complete and disallowing new IO to come in (using percpu_ref_kill and percpu_ref_resurrect on ns->ref). This won't work very efficiently withALL_REGS reservations though. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-07 9:27 ` Sagi Grimberg @ 2024-03-07 9:42 ` Guixin Liu 2024-03-08 9:15 ` Guixin Liu 1 sibling, 0 replies; 29+ messages in thread From: Guixin Liu @ 2024-03-07 9:42 UTC (permalink / raw) To: Sagi Grimberg, Keith Busch; +Cc: hch, kch, linux-nvme 在 2024/3/7 17:27, Sagi Grimberg 写道: > unlike abort, preempt-and-abort needs a semantic guarantee because the > consumer > may rely on this for fencing purposes. So it cannot be supported in > "best effort" I think. > > A possible implementation would be not to abort as there is no such > interface, but > nvmet may wait for all pending ns IO to complete and disallowing new > IO to come in > (using percpu_ref_kill and percpu_ref_resurrect on ns->ref). This > won't work very efficiently > withALL_REGS reservations though. I will think deeply,and send a v9 patch. Best regards, Guixin Liu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-07 9:27 ` Sagi Grimberg 2024-03-07 9:42 ` Guixin Liu @ 2024-03-08 9:15 ` Guixin Liu 2024-03-08 10:07 ` Sagi Grimberg 1 sibling, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-03-08 9:15 UTC (permalink / raw) To: Sagi Grimberg, Keith Busch; +Cc: hch, kch, linux-nvme > unlike abort, preempt-and-abort needs a semantic guarantee because the > consumer > may rely on this for fencing purposes. So it cannot be supported in > "best effort" I think. > > A possible implementation would be not to abort as there is no such > interface, but > nvmet may wait for all pending ns IO to complete and disallowing new > IO to come in > (using percpu_ref_kill and percpu_ref_resurrect on ns->ref). This > won't work very efficiently > withALL_REGS reservations though. Hi Sagi, I found that if we return an error when the call to percpu_ref_tryget_live(&ns->ref) fails, it might cause hosts that still have permissions to interrupt their IO. Additionally, preempt_and_abort itself holds an ns->ref, we cannot wait the ref to become to zero. The solution I can think of is to add a "per-namespace" percpu_ref to the controller for counting IO issued to a particular namespace by that controller. Then, during the execution of preempt_and_abort, we wait for the count of those preempted and unregistered controllers to drop to zero. The nsid is user-specified, so we can not use array to store the per-namespace percpu_ref, this will increase lookup overhead if we use xarray. What do you think Sagi? Or may be we can declare that preempt_and_abort is not supported, just like SPDK does. Best Regards, Guixin Liu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-08 9:15 ` Guixin Liu @ 2024-03-08 10:07 ` Sagi Grimberg 2024-03-11 11:19 ` Guixin Liu 0 siblings, 1 reply; 29+ messages in thread From: Sagi Grimberg @ 2024-03-08 10:07 UTC (permalink / raw) To: Guixin Liu, Keith Busch; +Cc: hch, kch, linux-nvme On 08/03/2024 11:15, Guixin Liu wrote: > >> unlike abort, preempt-and-abort needs a semantic guarantee because >> the consumer >> may rely on this for fencing purposes. So it cannot be supported in >> "best effort" I think. >> >> A possible implementation would be not to abort as there is no such >> interface, but >> nvmet may wait for all pending ns IO to complete and disallowing new >> IO to come in >> (using percpu_ref_kill and percpu_ref_resurrect on ns->ref). This >> won't work very efficiently >> withALL_REGS reservations though. > > Hi Sagi, > > I found that if we return an error when the call to > percpu_ref_tryget_live(&ns->ref) fails, > > it might cause hosts that still have permissions to interrupt their > IO. Additionally, > > preempt_and_abort itself holds an ns->ref, we cannot wait the ref to > become to zero. > > The solution I can think of is to add a "per-namespace" percpu_ref to > the controller for > > counting IO issued to a particular namespace by that controller. Then, > during the execution > > of preempt_and_abort, we wait for the count of those preempted and > unregistered controllers > > to drop to zero. Yes, that is what I had in mind as well. Obviously the ns->ref cannot be used for this purpose. > > The nsid is user-specified, so we can not use array to store the > per-namespace percpu_ref, > > this will increase lookup overhead if we use xarray. Yes, that is tricky to get right. > > What do you think Sagi? Or may be we can declare that > preempt_and_abort is not supported, just > > like SPDK does. It can definitely come incrementally, but at the very least it should be incorrectly supported. Out of curiosity, doesn't your use-case need a fencing protection against inflight I/Os reordering during preemption? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-08 10:07 ` Sagi Grimberg @ 2024-03-11 11:19 ` Guixin Liu 2024-03-12 21:31 ` Sagi Grimberg 0 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-03-11 11:19 UTC (permalink / raw) To: Sagi Grimberg, Keith Busch; +Cc: hch, kch, linux-nvme >> >> What do you think Sagi? Or may be we can declare that >> preempt_and_abort is not supported, just >> >> like SPDK does. > > It can definitely come incrementally, but at the very least it should > be incorrectly supported. > > Out of curiosity, doesn't your use-case need a fencing protection > against inflight I/Os reordering during > preemption? The cluster use stonith(shoot the other node in the head) to protect sources, and the backend storage(our storage system goes like this, so does disk firmware I think.) can process write I/O operations on a first-come,first-served basis if the range of data accessed by two I/O operation overlaps, for example, hostB preempt and abort hostA, then host B send I/Os, the backend storage will handle hostA's I/O first(by the timestamp). Best regards, Guixin Liu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-11 11:19 ` Guixin Liu @ 2024-03-12 21:31 ` Sagi Grimberg 2024-03-13 3:42 ` Guixin Liu 0 siblings, 1 reply; 29+ messages in thread From: Sagi Grimberg @ 2024-03-12 21:31 UTC (permalink / raw) To: Guixin Liu, Keith Busch; +Cc: hch, kch, linux-nvme On 11/03/2024 13:19, Guixin Liu wrote: >>> >>> What do you think Sagi? Or may be we can declare that >>> preempt_and_abort is not supported, just >>> >>> like SPDK does. >> >> It can definitely come incrementally, but at the very least it should >> be incorrectly supported. >> >> Out of curiosity, doesn't your use-case need a fencing protection >> against inflight I/Os reordering during >> preemption? > > The cluster use stonith(shoot the other node in the head) to protect > sources, > > and the backend storage(our storage system goes like this, so does > disk firmware I think.) > > can process write I/O operations on a first-come,first-served basis if > the range of > > data accessed by two I/O operation overlaps, for example, hostB > preempt and abort > > hostA, then host B send I/Os, the backend storage will handle hostA's > I/O first(by the timestamp). So I understand from you that there is out-of-band mechanism for fence guarantee? In any event, Christoph reminded me that preempt and abort is mandatory, so we cannot get away without doing it. I suggest to introduce a per-ns-per-controller percpuref, with an xarray that is rcu protected. btw, I also think that reservations on nsid=0xffffffff should also work. IIRC there were at least two use-cases that would have benefited from reservations on all subsystem namespaces. It can also be a lot more lightweight to implement that (outside of conflicting reservations checks). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-12 21:31 ` Sagi Grimberg @ 2024-03-13 3:42 ` Guixin Liu 2024-03-13 9:54 ` Sagi Grimberg 0 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-03-13 3:42 UTC (permalink / raw) To: Sagi Grimberg, Keith Busch; +Cc: hch, kch, linux-nvme 在 2024/3/13 05:31, Sagi Grimberg 写道: > So I understand from you that there is out-of-band mechanism for fence > guarantee? > > In any event, Christoph reminded me that preempt and abort is > mandatory, so we cannot > get away without doing it. > Agree. > I suggest to introduce a per-ns-per-controller percpuref, with an > xarray that is rcu protected. I will implement it in v9, and I also think that we should a time-limited wait. > > btw, I also think that reservations on nsid=0xffffffff should also > work. IIRC there were at least > two use-cases that would have benefited from reservations on all > subsystem namespaces. > It can also be a lot more lightweight to implement that (outside of > conflicting reservations checks). You mean one host can reserve all of namespaces by sending one reservation command? Well, Could you please tell me which two use-cases? I think this will make the target code too complicated, because the reservation situation on each namespace may different, we should handle it one-by-one, and if failed in middle, we should revert the handled namespaces. Best regards, Guixin Liu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-13 3:42 ` Guixin Liu @ 2024-03-13 9:54 ` Sagi Grimberg 2024-03-13 11:56 ` Guixin Liu 0 siblings, 1 reply; 29+ messages in thread From: Sagi Grimberg @ 2024-03-13 9:54 UTC (permalink / raw) To: Guixin Liu, Keith Busch; +Cc: hch, kch, linux-nvme On 13/03/2024 5:42, Guixin Liu wrote: > > 在 2024/3/13 05:31, Sagi Grimberg 写道: >> So I understand from you that there is out-of-band mechanism for >> fence guarantee? >> >> In any event, Christoph reminded me that preempt and abort is >> mandatory, so we cannot >> get away without doing it. >> > Agree. >> I suggest to introduce a per-ns-per-controller percpuref, with an >> xarray that is rcu protected. > > I will implement it in v9, and I also think that we should a > time-limited wait. If it is time-limited, and you didn't manage to wait, so you need to fail the preempt and abort action (without DNR set so the reservation holder can retry if it chooses to). > >> >> btw, I also think that reservations on nsid=0xffffffff should also >> work. IIRC there were at least >> two use-cases that would have benefited from reservations on all >> subsystem namespaces. >> It can also be a lot more lightweight to implement that (outside of >> conflicting reservations checks). > > You mean one host can reserve all of namespaces by sending one > reservation command? Yes. The use-case is to have the reservation in the subsystem level. The spec seems to imply that this is a valid reservation request. > > Well, Could you please tell me which two use-cases? Don't remember for sure, but IIRC I've seen people run oracle rac with all namespaces in the nvmet subsystem is exposed to DB hosts, and there were many namespaces. > > I think this will make the target code too complicated, because the > reservation situation on each > > namespace may different, we should handle it one-by-one, and if failed > in middle, we should revert > > the handled namespaces. Not sure why you'd assume that, this would be a subsystem-wide reservation, which should simply cross-check individual namespace reservations for conflicts. Sure it adds complication, but I think implementing it on every individual namespace is a naive implementation. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-13 9:54 ` Sagi Grimberg @ 2024-03-13 11:56 ` Guixin Liu 2024-03-13 12:36 ` Sagi Grimberg 0 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-03-13 11:56 UTC (permalink / raw) To: Sagi Grimberg, Keith Busch; +Cc: hch, kch, linux-nvme 在 2024/3/13 17:54, Sagi Grimberg 写道: > > > On 13/03/2024 5:42, Guixin Liu wrote: >> >> 在 2024/3/13 05:31, Sagi Grimberg 写道: >>> So I understand from you that there is out-of-band mechanism for >>> fence guarantee? >>> >>> In any event, Christoph reminded me that preempt and abort is >>> mandatory, so we cannot >>> get away without doing it. >>> >> Agree. >>> I suggest to introduce a per-ns-per-controller percpuref, with an >>> xarray that is rcu protected. >> >> I will implement it in v9, and I also think that we should a >> time-limited wait. > > If it is time-limited, and you didn't manage to wait, so you need to > fail the preempt and abort > action (without DNR set so the reservation holder can retry if it > chooses to). > It can wait. >> >>> >>> btw, I also think that reservations on nsid=0xffffffff should also >>> work. IIRC there were at least >>> two use-cases that would have benefited from reservations on all >>> subsystem namespaces. >>> It can also be a lot more lightweight to implement that (outside of >>> conflicting reservations checks). >> >> You mean one host can reserve all of namespaces by sending one >> reservation command? > > Yes. The use-case is to have the reservation in the subsystem level. > The spec seems to imply that this is a valid reservation request. > >> >> Well, Could you please tell me which two use-cases? > > Don't remember for sure, but IIRC I've seen people run oracle rac with > all namespaces in the > nvmet subsystem is exposed to DB hosts, and there were many namespaces. > >> >> I think this will make the target code too complicated, because the >> reservation situation on each >> >> namespace may different, we should handle it one-by-one, and if >> failed in middle, we should revert >> >> the handled namespaces. > > Not sure why you'd assume that, this would be a subsystem-wide > reservation, which should simply > cross-check individual namespace reservations for conflicts. Sure it > adds complication, but I think implementing > it on every individual namespace is a naive implementation. Well, this is strange, a few years ago, when I tested SCSI reservations with Oracle RAC, I did not observe any "all LUN" reservation commands being issued. I also researched LIO and SPDK, and found that neither supports "all LUN/NS" reservation commands. When a host issued a reservation command without carrying a specific LUN or NS ID, both returned an error. Best regards, Guixin Liu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-13 11:56 ` Guixin Liu @ 2024-03-13 12:36 ` Sagi Grimberg 2024-03-14 2:03 ` Guixin Liu 0 siblings, 1 reply; 29+ messages in thread From: Sagi Grimberg @ 2024-03-13 12:36 UTC (permalink / raw) To: Guixin Liu, Keith Busch; +Cc: hch, kch, linux-nvme On 13/03/2024 13:56, Guixin Liu wrote: > > 在 2024/3/13 17:54, Sagi Grimberg 写道: >> >> >> On 13/03/2024 5:42, Guixin Liu wrote: >>> >>> 在 2024/3/13 05:31, Sagi Grimberg 写道: >>>> So I understand from you that there is out-of-band mechanism for >>>> fence guarantee? >>>> >>>> In any event, Christoph reminded me that preempt and abort is >>>> mandatory, so we cannot >>>> get away without doing it. >>>> >>> Agree. >>>> I suggest to introduce a per-ns-per-controller percpuref, with an >>>> xarray that is rcu protected. >>> >>> I will implement it in v9, and I also think that we should a >>> time-limited wait. >> >> If it is time-limited, and you didn't manage to wait, so you need to >> fail the preempt and abort >> action (without DNR set so the reservation holder can retry if it >> chooses to). >> > It can wait. >>> >>>> >>>> btw, I also think that reservations on nsid=0xffffffff should also >>>> work. IIRC there were at least >>>> two use-cases that would have benefited from reservations on all >>>> subsystem namespaces. >>>> It can also be a lot more lightweight to implement that (outside of >>>> conflicting reservations checks). >>> >>> You mean one host can reserve all of namespaces by sending one >>> reservation command? >> >> Yes. The use-case is to have the reservation in the subsystem level. >> The spec seems to imply that this is a valid reservation request. >> >>> >>> Well, Could you please tell me which two use-cases? >> >> Don't remember for sure, but IIRC I've seen people run oracle rac >> with all namespaces in the >> nvmet subsystem is exposed to DB hosts, and there were many namespaces. >> >>> >>> I think this will make the target code too complicated, because the >>> reservation situation on each >>> >>> namespace may different, we should handle it one-by-one, and if >>> failed in middle, we should revert >>> >>> the handled namespaces. >> >> Not sure why you'd assume that, this would be a subsystem-wide >> reservation, which should simply >> cross-check individual namespace reservations for conflicts. Sure it >> adds complication, but I think implementing >> it on every individual namespace is a naive implementation. > > Well, this is strange, a few years ago, when I tested SCSI > reservations with Oracle RAC, > > I did not observe any "all LUN" reservation commands being issued. > > I also researched LIO and SPDK, and found that neither supports "all > LUN/NS" reservation commands. > > When a host issued a reservation command without carrying a specific > LUN or NS ID, both returned an error. Didn't say this is actually in use, but a use-case that would benefit. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-13 12:36 ` Sagi Grimberg @ 2024-03-14 2:03 ` Guixin Liu 2024-03-19 2:59 ` Chaitanya Kulkarni 0 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-03-14 2:03 UTC (permalink / raw) To: Sagi Grimberg, Keith Busch; +Cc: hch, kch, linux-nvme > Didn't say this is actually in use, but a use-case that would benefit. I will add a TODO-list to research main cluster softwares to see if this is actually a significant use-case, after my patch merged. I'll change my code then. Best regards, Guixin Liu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-14 2:03 ` Guixin Liu @ 2024-03-19 2:59 ` Chaitanya Kulkarni 2024-03-19 3:19 ` Guixin Liu 0 siblings, 1 reply; 29+ messages in thread From: Chaitanya Kulkarni @ 2024-03-19 2:59 UTC (permalink / raw) To: Guixin Liu, Sagi Grimberg Cc: hch@lst.de, Chaitanya Kulkarni, linux-nvme@lists.infradead.org, Keith Busch On 3/13/24 19:03, Guixin Liu wrote: > > >> Didn't say this is actually in use, but a use-case that would benefit. > > I will add a TODO-list to research main cluster softwares > > to see if this is actually a significant use-case, after my patch merged. > > I'll change my code then. > > Best regards, > > Guixin Liu > so which one is the latest version I use for testing ? I want to run some tests so I can provide reviewed-by tag ... -ck ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-19 2:59 ` Chaitanya Kulkarni @ 2024-03-19 3:19 ` Guixin Liu 2024-03-20 1:59 ` hch 0 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-03-19 3:19 UTC (permalink / raw) To: Chaitanya Kulkarni, Sagi Grimberg Cc: hch@lst.de, linux-nvme@lists.infradead.org, Keith Busch 在 2024/3/19 10:59, Chaitanya Kulkarni 写道: > On 3/13/24 19:03, Guixin Liu wrote: >> >>> Didn't say this is actually in use, but a use-case that would benefit. >> I will add a TODO-list to research main cluster softwares >> >> to see if this is actually a significant use-case, after my patch merged. >> >> I'll change my code then. >> >> Best regards, >> >> Guixin Liu >> > so which one is the latest version I use for testing ? I want to run some > tests so I can provide reviewed-by tag ... > > -ck I am adding a per-controller percpu-ref to ns to implement preempt_and_abort, the v9 will be soon. Thanks, Guixin Liu > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-19 3:19 ` Guixin Liu @ 2024-03-20 1:59 ` hch 2024-03-20 9:16 ` Sagi Grimberg 0 siblings, 1 reply; 29+ messages in thread From: hch @ 2024-03-20 1:59 UTC (permalink / raw) To: Guixin Liu Cc: Chaitanya Kulkarni, Sagi Grimberg, hch@lst.de, linux-nvme@lists.infradead.org, Keith Busch On Tue, Mar 19, 2024 at 11:19:50AM +0800, Guixin Liu wrote: > I am adding a per-controller percpu-ref to ns to implement > preempt_and_abort, the v9 will be soon. Do we need that? For blk-mq based backends just doing a queue freeze would be enough. That doesn't work for bio based drivers yet, but I'd rather improve the block layer freezing for them rather than reimplenting the whole thing. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-20 1:59 ` hch @ 2024-03-20 9:16 ` Sagi Grimberg 2024-03-21 8:06 ` Chaitanya Kulkarni 2024-03-21 21:02 ` hch 0 siblings, 2 replies; 29+ messages in thread From: Sagi Grimberg @ 2024-03-20 9:16 UTC (permalink / raw) To: hch@lst.de, Guixin Liu Cc: Chaitanya Kulkarni, linux-nvme@lists.infradead.org, Keith Busch On 20/03/2024 3:59, hch@lst.de wrote: > On Tue, Mar 19, 2024 at 11:19:50AM +0800, Guixin Liu wrote: >> I am adding a per-controller percpu-ref to ns to implement >> preempt_and_abort, the v9 will be soon. > Do we need that? For blk-mq based backends just doing a queue freeze > would be enough. That doesn't work for bio based drivers yet, but I'd > rather improve the block layer freezing for them rather than reimplenting > the whole thing. That's a neat idea, we also have file to think about though. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-20 9:16 ` Sagi Grimberg @ 2024-03-21 8:06 ` Chaitanya Kulkarni 2024-03-21 21:02 ` hch 1 sibling, 0 replies; 29+ messages in thread From: Chaitanya Kulkarni @ 2024-03-21 8:06 UTC (permalink / raw) To: Sagi Grimberg, hch@lst.de Cc: linux-nvme@lists.infradead.org, Keith Busch, Guixin Liu On 3/20/2024 2:16 AM, Sagi Grimberg wrote: > > > On 20/03/2024 3:59, hch@lst.de wrote: >> On Tue, Mar 19, 2024 at 11:19:50AM +0800, Guixin Liu wrote: >>> I am adding a per-controller percpu-ref to ns to implement >>> preempt_and_abort, the v9 will be soon. >> Do we need that? For blk-mq based backends just doing a queue freeze >> would be enough. That doesn't work for bio based drivers yet, but I'd >> rather improve the block layer freezing for them rather than reimplenting >> the whole thing. > > That's a neat idea, we also have file to think about though. I failed to understand how this will work for non block device (passthru or bio mode) backend(s). Unless there is a plan to create similar API for non block device backhend(s) and if that turns out to be overkill then why not go with current per-controller percpu-ref ? -ck ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-20 9:16 ` Sagi Grimberg 2024-03-21 8:06 ` Chaitanya Kulkarni @ 2024-03-21 21:02 ` hch 2024-03-22 9:34 ` Chaitanya Kulkarni 1 sibling, 1 reply; 29+ messages in thread From: hch @ 2024-03-21 21:02 UTC (permalink / raw) To: Sagi Grimberg Cc: hch@lst.de, Guixin Liu, Chaitanya Kulkarni, linux-nvme@lists.infradead.org, Keith Busch On Wed, Mar 20, 2024 at 11:16:29AM +0200, Sagi Grimberg wrote: > That's a neat idea, we also have file to think about though. Indeed. It still would be good to avoid the extra overhead when not needed, though. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-21 21:02 ` hch @ 2024-03-22 9:34 ` Chaitanya Kulkarni 2024-03-23 20:41 ` Sagi Grimberg 0 siblings, 1 reply; 29+ messages in thread From: Chaitanya Kulkarni @ 2024-03-22 9:34 UTC (permalink / raw) To: hch@lst.de Cc: Guixin Liu, Sagi Grimberg, linux-nvme@lists.infradead.org, Keith Busch Christoph, On 3/21/2024 2:02 PM, hch@lst.de wrote: > On Wed, Mar 20, 2024 at 11:16:29AM +0200, Sagi Grimberg wrote: >> That's a neat idea, we also have file to think about though. > > Indeed. It still would be good to avoid the extra overhead when not > needed, though. I still don't understand how this solves the case for file back-end ? can you please explain in detail ? as far as I can there is no equivalent provision that we can use for file back-end as you have suggested for block (bio/blk-mq)back-end, if there is can you please provide some details ? In case there is no such provision, how do you see file backends to handle this case ? perhaps you are referring to use above suggested mechanism for block backend and implement percpu-ref to for file only backend so we can keep the code clean for bio and passthru back-end vs file back-end ? -ck ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 1/1] nvmet: support reservation feature 2024-03-22 9:34 ` Chaitanya Kulkarni @ 2024-03-23 20:41 ` Sagi Grimberg 0 siblings, 0 replies; 29+ messages in thread From: Sagi Grimberg @ 2024-03-23 20:41 UTC (permalink / raw) To: Chaitanya Kulkarni, hch@lst.de Cc: Guixin Liu, linux-nvme@lists.infradead.org, Keith Busch On 22/03/2024 11:34, Chaitanya Kulkarni wrote: > Christoph, > > On 3/21/2024 2:02 PM, hch@lst.de wrote: >> On Wed, Mar 20, 2024 at 11:16:29AM +0200, Sagi Grimberg wrote: >>> That's a neat idea, we also have file to think about though. >> Indeed. It still would be good to avoid the extra overhead when not >> needed, though. > I still don't understand how this solves the case for file back-end ? > can you please explain in detail ? > > as far as I can there is no equivalent provision that we can use for > file back-end as you have suggested for block (bio/blk-mq)back-end, > if there is can you please provide some details ? > > In case there is no such provision, how do you see file backends to > handle this case ? perhaps you are referring to use above suggested > mechanism for block backend and implement percpu-ref to for file only > backend so we can keep the code clean for bio and passthru back-end vs > file back-end ? Yes, the latter. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 0/1] Implement the NVMe reservation feature 2024-02-01 2:32 [PATCH v7 0/1] Implement the NVMe reservation feature Guixin Liu 2024-02-01 2:32 ` [PATCH v7 1/1] nvmet: support " Guixin Liu @ 2024-02-18 2:12 ` Guixin Liu 2024-02-26 6:33 ` Guixin Liu 2024-02-29 2:57 ` Guixin Liu 2 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-02-18 2:12 UTC (permalink / raw) To: sagi, hch, kch; +Cc: linux-nvme Hi, Gentle ping... Best regards. Guixin Liu 在 2024/2/1 10:32, Guixin Liu 写道: > Hi guys, > I've implemented the NVMe reservation feature. Please review it, all > comments are welcome as usual. > > Changes from v6 to v7: > - Handle "reservation notification mask" feature command to mask reservation > log. > > - Add all the registrants that need to be freed to a temporary list fist, > and then after calling synchronize_rcu(), release all the registrants on the > temporary list. > > - Fix the resv log page is random when there is no resv log page. > > - Change nvmet_is_host_still_connected() to nvmet_is_host_connected(). > > - Remove nvmet_pr_set_rtype_and_holder() and change nvmet_pr_create_new_resv() > to nvmet_pr_create_new_reservation(). > > - Change nvmet_pr_find_registrant_by_hostid() to nvmet_pr_find_registrant(). > > - Change nvmet_pr_send_resv_released() to nvmet_pr_resv_released(). > > - Change __nvmet_pr_unregister_one() to nvmet_pr_unregister_one(). > > - In nvmet_pr_unreg_by_prkey(), nvmet_pr_unreg_by_prkey_except_hostid() and > nvmet_pr_unreg_except_hostid(), first do unregistering and then do event sending. > > > Changes from v5 to v6: > - Use synchronize_rcu() and kfree() to free registrant instead of kfree_rcu(). > > - Remove nvmet_pr_register_check_rkey(), put the check into pr_lock warp. > And refactor the nvmet_pr_register(). > > - Add the print fmt to the head. > > - Add lockdep_is_held(&pr->pr_lock) condition to list_for_each_entry_rcu. > > - Fix the bug in nvmet_pr_update_reg_attr(), when the change_attr hook > return fail, we should not replace the holder. > > Changes from v4 to v5: > > - Use rculist macros to handle registration_list instead of list macros > regardless of in mutex lock or not. > > - Use goto statement instead of return in nvmet_is_host_still_connected > and __nvmet_pr_unregister_one. > > - Add lockdep_assert_held and rcu_read_lock_held assert to many functions, > if it's necessary. > > - Add a comment to nvmet_execute_get_log_page_resv to explain how lost_count > works. > > - In nvmet_pr_clear, we should set holder to NULL first, I fixed this. > > - Unify nvmet_pr_update_holder_rtype and __nvmet_pr_do_replace to > nvmet_pr_update_reg_attr. > > - Fix wrong nr_pages in nvmet_execute_get_log_page_resv. > > - Fix the deadlock issue of nvmet_pr_exit_ns, put it out of the subsys lock. > > > Changes from v3 to v4: > - Use kfifo to handle resv log page instead of list, and also limit the > resv log queue to 64. > > - Change the function calling alignment style to: > nvmet_pr_send_event_by_hostid(pr, hostid, > NVME_PR_LOG_RESERVATOPM_PREEMPTED); > > - Put kmalloc out of rcu_read_lock in nvmet_execute_pr_report(). > > - Remove the goto in __nvmet_pr_unregister_one(). > > - Change generation to atomic_t, and remove nvmet_pr_inc_generation(). > > - In addtion, the number2 patch "nvmet: unify aer type enum" is not > relate with this patch, so I will send it separately. > > > Changes from v2 to v3: > - Use rcu instead of rwlock to make IO path run faster, and put the rtype > into the struct nvmet_pr_registrant. > > - Limit the resv_log_list to 128. > > - Change generation to atomic64. > > - Put register rkey check to a warpper. > > - Change nr_avl_pages to nr_pages. > > - Use NVME_SC_SUCCESS instead of 0. > > - Change kmalloc param to let it not sleep in mutex lock. > > > Changes from v1 to v2: > - Implement the reservation notification report, includes registration > preempted, reservation released and reservation preempted. > And also handle the reservation log page available event and send get > reservation log page command to clear log page at host. > > - Put the reservation check access after validate opcode. And remove > opcodes which nvmet not implement yet check. > Now there is no admin opcode nvmet implemented needs reservation check, > so I dont add reservation check to admin command path. > Next we need to do reservation check includes the situation of nsid is > 0xffffffff at each admin command path, if it is needed. > > - Add reservation commands support in nvmet_get_cmd_effects_nvm(). > > - From Chaitanya, change the local variable tree style to make it cleaner, > and add some comments about NVMe spec. > And also change others advice from chaitanya. > > - Put the nvmet_pr_check_cmd_access and nvmet_parse_pr_cmd into reservation > enable check warp. > > - Remove kmem_cache instead to use kmalloc and kfree. > > - Change others advice from Sagi. > > - Add a blktest test case, this patch will be sent before these series of > patches. > > Guixin Liu (1): > nvmet: support reservation feature > > drivers/nvme/target/Makefile | 2 +- > drivers/nvme/target/admin-cmd.c | 20 +- > drivers/nvme/target/configfs.c | 27 + > drivers/nvme/target/core.c | 51 +- > drivers/nvme/target/nvmet.h | 36 ++ > drivers/nvme/target/pr.c | 1041 +++++++++++++++++++++++++++++++ > include/linux/nvme.h | 54 ++ > 7 files changed, 1224 insertions(+), 7 deletions(-) > create mode 100644 drivers/nvme/target/pr.c > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 0/1] Implement the NVMe reservation feature 2024-02-18 2:12 ` [PATCH v7 0/1] Implement the NVMe " Guixin Liu @ 2024-02-26 6:33 ` Guixin Liu 2024-02-26 6:43 ` Chaitanya Kulkarni 0 siblings, 1 reply; 29+ messages in thread From: Guixin Liu @ 2024-02-26 6:33 UTC (permalink / raw) To: sagi, hch, kch; +Cc: linux-nvme Hi Christoph and Chaitanya, I noticed that Sagi hasn't replied to any emails for a month, could you please review this patch? Best Regards, Guixin Liu 在 2024/2/18 10:12, Guixin Liu 写道: > Hi, > > Gentle ping... > > Best regards. > > Guixin Liu > > 在 2024/2/1 10:32, Guixin Liu 写道: >> Hi guys, >> I've implemented the NVMe reservation feature. Please review it, >> all >> comments are welcome as usual. >> >> Changes from v6 to v7: >> - Handle "reservation notification mask" feature command to mask >> reservation >> log. >> >> - Add all the registrants that need to be freed to a temporary list >> fist, >> and then after calling synchronize_rcu(), release all the registrants >> on the >> temporary list. >> >> - Fix the resv log page is random when there is no resv log page. >> >> - Change nvmet_is_host_still_connected() to nvmet_is_host_connected(). >> >> - Remove nvmet_pr_set_rtype_and_holder() and change >> nvmet_pr_create_new_resv() >> to nvmet_pr_create_new_reservation(). >> >> - Change nvmet_pr_find_registrant_by_hostid() to >> nvmet_pr_find_registrant(). >> >> - Change nvmet_pr_send_resv_released() to nvmet_pr_resv_released(). >> >> - Change __nvmet_pr_unregister_one() to nvmet_pr_unregister_one(). >> >> - In nvmet_pr_unreg_by_prkey(), >> nvmet_pr_unreg_by_prkey_except_hostid() and >> nvmet_pr_unreg_except_hostid(), first do unregistering and then do >> event sending. >> >> >> Changes from v5 to v6: >> - Use synchronize_rcu() and kfree() to free registrant instead of >> kfree_rcu(). >> >> - Remove nvmet_pr_register_check_rkey(), put the check into pr_lock >> warp. >> And refactor the nvmet_pr_register(). >> >> - Add the print fmt to the head. >> >> - Add lockdep_is_held(&pr->pr_lock) condition to >> list_for_each_entry_rcu. >> >> - Fix the bug in nvmet_pr_update_reg_attr(), when the change_attr hook >> return fail, we should not replace the holder. >> >> Changes from v4 to v5: >> >> - Use rculist macros to handle registration_list instead of list macros >> regardless of in mutex lock or not. >> >> - Use goto statement instead of return in nvmet_is_host_still_connected >> and __nvmet_pr_unregister_one. >> >> - Add lockdep_assert_held and rcu_read_lock_held assert to many >> functions, >> if it's necessary. >> >> - Add a comment to nvmet_execute_get_log_page_resv to explain how >> lost_count >> works. >> >> - In nvmet_pr_clear, we should set holder to NULL first, I fixed this. >> >> - Unify nvmet_pr_update_holder_rtype and __nvmet_pr_do_replace to >> nvmet_pr_update_reg_attr. >> >> - Fix wrong nr_pages in nvmet_execute_get_log_page_resv. >> >> - Fix the deadlock issue of nvmet_pr_exit_ns, put it out of the >> subsys lock. >> >> >> Changes from v3 to v4: >> - Use kfifo to handle resv log page instead of list, and also limit the >> resv log queue to 64. >> >> - Change the function calling alignment style to: >> nvmet_pr_send_event_by_hostid(pr, hostid, >> NVME_PR_LOG_RESERVATOPM_PREEMPTED); >> >> - Put kmalloc out of rcu_read_lock in nvmet_execute_pr_report(). >> >> - Remove the goto in __nvmet_pr_unregister_one(). >> >> - Change generation to atomic_t, and remove nvmet_pr_inc_generation(). >> >> - In addtion, the number2 patch "nvmet: unify aer type enum" is not >> relate with this patch, so I will send it separately. >> >> >> Changes from v2 to v3: >> - Use rcu instead of rwlock to make IO path run faster, and put the >> rtype >> into the struct nvmet_pr_registrant. >> >> - Limit the resv_log_list to 128. >> >> - Change generation to atomic64. >> >> - Put register rkey check to a warpper. >> >> - Change nr_avl_pages to nr_pages. >> >> - Use NVME_SC_SUCCESS instead of 0. >> >> - Change kmalloc param to let it not sleep in mutex lock. >> >> >> Changes from v1 to v2: >> - Implement the reservation notification report, includes registration >> preempted, reservation released and reservation preempted. >> And also handle the reservation log page available event and send get >> reservation log page command to clear log page at host. >> >> - Put the reservation check access after validate opcode. And remove >> opcodes which nvmet not implement yet check. >> Now there is no admin opcode nvmet implemented needs reservation >> check, >> so I dont add reservation check to admin command path. >> Next we need to do reservation check includes the situation of >> nsid is >> 0xffffffff at each admin command path, if it is needed. >> >> - Add reservation commands support in nvmet_get_cmd_effects_nvm(). >> >> - From Chaitanya, change the local variable tree style to make it >> cleaner, >> and add some comments about NVMe spec. >> And also change others advice from chaitanya. >> >> - Put the nvmet_pr_check_cmd_access and nvmet_parse_pr_cmd into >> reservation >> enable check warp. >> >> - Remove kmem_cache instead to use kmalloc and kfree. >> >> - Change others advice from Sagi. >> >> - Add a blktest test case, this patch will be sent before these >> series of >> patches. >> >> Guixin Liu (1): >> nvmet: support reservation feature >> >> drivers/nvme/target/Makefile | 2 +- >> drivers/nvme/target/admin-cmd.c | 20 +- >> drivers/nvme/target/configfs.c | 27 + >> drivers/nvme/target/core.c | 51 +- >> drivers/nvme/target/nvmet.h | 36 ++ >> drivers/nvme/target/pr.c | 1041 +++++++++++++++++++++++++++++++ >> include/linux/nvme.h | 54 ++ >> 7 files changed, 1224 insertions(+), 7 deletions(-) >> create mode 100644 drivers/nvme/target/pr.c >> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 0/1] Implement the NVMe reservation feature 2024-02-26 6:33 ` Guixin Liu @ 2024-02-26 6:43 ` Chaitanya Kulkarni 0 siblings, 0 replies; 29+ messages in thread From: Chaitanya Kulkarni @ 2024-02-26 6:43 UTC (permalink / raw) To: Guixin Liu Cc: linux-nvme@lists.infradead.org, Chaitanya Kulkarni, sagi@grimberg.me, hch@lst.de On 2/25/24 22:33, Guixin Liu wrote: > Hi Christoph and Chaitanya, > > I noticed that Sagi hasn't replied to any emails for a month, could > > you please review this patch? > > Best Regards, > > Guixin Liu > Hi, You will have my review by end of this week worst case on weekend. -ck ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 0/1] Implement the NVMe reservation feature 2024-02-01 2:32 [PATCH v7 0/1] Implement the NVMe reservation feature Guixin Liu 2024-02-01 2:32 ` [PATCH v7 1/1] nvmet: support " Guixin Liu 2024-02-18 2:12 ` [PATCH v7 0/1] Implement the NVMe " Guixin Liu @ 2024-02-29 2:57 ` Guixin Liu 2 siblings, 0 replies; 29+ messages in thread From: Guixin Liu @ 2024-02-29 2:57 UTC (permalink / raw) To: sagi, hch, kch; +Cc: linux-nvme Hi guys, I will send a v8 patch to add me as the new file's maintainer. Best Regards, Guixin Liu 在 2024/2/1 10:32, Guixin Liu 写道: > Hi guys, > I've implemented the NVMe reservation feature. Please review it, all > comments are welcome as usual. > > Changes from v6 to v7: > - Handle "reservation notification mask" feature command to mask reservation > log. > > - Add all the registrants that need to be freed to a temporary list fist, > and then after calling synchronize_rcu(), release all the registrants on the > temporary list. > > - Fix the resv log page is random when there is no resv log page. > > - Change nvmet_is_host_still_connected() to nvmet_is_host_connected(). > > - Remove nvmet_pr_set_rtype_and_holder() and change nvmet_pr_create_new_resv() > to nvmet_pr_create_new_reservation(). > > - Change nvmet_pr_find_registrant_by_hostid() to nvmet_pr_find_registrant(). > > - Change nvmet_pr_send_resv_released() to nvmet_pr_resv_released(). > > - Change __nvmet_pr_unregister_one() to nvmet_pr_unregister_one(). > > - In nvmet_pr_unreg_by_prkey(), nvmet_pr_unreg_by_prkey_except_hostid() and > nvmet_pr_unreg_except_hostid(), first do unregistering and then do event sending. > > > Changes from v5 to v6: > - Use synchronize_rcu() and kfree() to free registrant instead of kfree_rcu(). > > - Remove nvmet_pr_register_check_rkey(), put the check into pr_lock warp. > And refactor the nvmet_pr_register(). > > - Add the print fmt to the head. > > - Add lockdep_is_held(&pr->pr_lock) condition to list_for_each_entry_rcu. > > - Fix the bug in nvmet_pr_update_reg_attr(), when the change_attr hook > return fail, we should not replace the holder. > > Changes from v4 to v5: > > - Use rculist macros to handle registration_list instead of list macros > regardless of in mutex lock or not. > > - Use goto statement instead of return in nvmet_is_host_still_connected > and __nvmet_pr_unregister_one. > > - Add lockdep_assert_held and rcu_read_lock_held assert to many functions, > if it's necessary. > > - Add a comment to nvmet_execute_get_log_page_resv to explain how lost_count > works. > > - In nvmet_pr_clear, we should set holder to NULL first, I fixed this. > > - Unify nvmet_pr_update_holder_rtype and __nvmet_pr_do_replace to > nvmet_pr_update_reg_attr. > > - Fix wrong nr_pages in nvmet_execute_get_log_page_resv. > > - Fix the deadlock issue of nvmet_pr_exit_ns, put it out of the subsys lock. > > > Changes from v3 to v4: > - Use kfifo to handle resv log page instead of list, and also limit the > resv log queue to 64. > > - Change the function calling alignment style to: > nvmet_pr_send_event_by_hostid(pr, hostid, > NVME_PR_LOG_RESERVATOPM_PREEMPTED); > > - Put kmalloc out of rcu_read_lock in nvmet_execute_pr_report(). > > - Remove the goto in __nvmet_pr_unregister_one(). > > - Change generation to atomic_t, and remove nvmet_pr_inc_generation(). > > - In addtion, the number2 patch "nvmet: unify aer type enum" is not > relate with this patch, so I will send it separately. > > > Changes from v2 to v3: > - Use rcu instead of rwlock to make IO path run faster, and put the rtype > into the struct nvmet_pr_registrant. > > - Limit the resv_log_list to 128. > > - Change generation to atomic64. > > - Put register rkey check to a warpper. > > - Change nr_avl_pages to nr_pages. > > - Use NVME_SC_SUCCESS instead of 0. > > - Change kmalloc param to let it not sleep in mutex lock. > > > Changes from v1 to v2: > - Implement the reservation notification report, includes registration > preempted, reservation released and reservation preempted. > And also handle the reservation log page available event and send get > reservation log page command to clear log page at host. > > - Put the reservation check access after validate opcode. And remove > opcodes which nvmet not implement yet check. > Now there is no admin opcode nvmet implemented needs reservation check, > so I dont add reservation check to admin command path. > Next we need to do reservation check includes the situation of nsid is > 0xffffffff at each admin command path, if it is needed. > > - Add reservation commands support in nvmet_get_cmd_effects_nvm(). > > - From Chaitanya, change the local variable tree style to make it cleaner, > and add some comments about NVMe spec. > And also change others advice from chaitanya. > > - Put the nvmet_pr_check_cmd_access and nvmet_parse_pr_cmd into reservation > enable check warp. > > - Remove kmem_cache instead to use kmalloc and kfree. > > - Change others advice from Sagi. > > - Add a blktest test case, this patch will be sent before these series of > patches. > > Guixin Liu (1): > nvmet: support reservation feature > > drivers/nvme/target/Makefile | 2 +- > drivers/nvme/target/admin-cmd.c | 20 +- > drivers/nvme/target/configfs.c | 27 + > drivers/nvme/target/core.c | 51 +- > drivers/nvme/target/nvmet.h | 36 ++ > drivers/nvme/target/pr.c | 1041 +++++++++++++++++++++++++++++++ > include/linux/nvme.h | 54 ++ > 7 files changed, 1224 insertions(+), 7 deletions(-) > create mode 100644 drivers/nvme/target/pr.c > ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2024-03-23 20:41 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-02-01 2:32 [PATCH v7 0/1] Implement the NVMe reservation feature Guixin Liu 2024-02-01 2:32 ` [PATCH v7 1/1] nvmet: support " Guixin Liu 2024-02-28 0:40 ` Sagi Grimberg 2024-02-28 2:21 ` Guixin Liu 2024-02-28 3:21 ` Keith Busch 2024-02-28 3:40 ` Guixin Liu 2024-03-07 9:27 ` Sagi Grimberg 2024-03-07 9:42 ` Guixin Liu 2024-03-08 9:15 ` Guixin Liu 2024-03-08 10:07 ` Sagi Grimberg 2024-03-11 11:19 ` Guixin Liu 2024-03-12 21:31 ` Sagi Grimberg 2024-03-13 3:42 ` Guixin Liu 2024-03-13 9:54 ` Sagi Grimberg 2024-03-13 11:56 ` Guixin Liu 2024-03-13 12:36 ` Sagi Grimberg 2024-03-14 2:03 ` Guixin Liu 2024-03-19 2:59 ` Chaitanya Kulkarni 2024-03-19 3:19 ` Guixin Liu 2024-03-20 1:59 ` hch 2024-03-20 9:16 ` Sagi Grimberg 2024-03-21 8:06 ` Chaitanya Kulkarni 2024-03-21 21:02 ` hch 2024-03-22 9:34 ` Chaitanya Kulkarni 2024-03-23 20:41 ` Sagi Grimberg 2024-02-18 2:12 ` [PATCH v7 0/1] Implement the NVMe " Guixin Liu 2024-02-26 6:33 ` Guixin Liu 2024-02-26 6:43 ` Chaitanya Kulkarni 2024-02-29 2:57 ` Guixin Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox