All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dmitry Bogdanov <d.bogdanov@yadro.com>
To: Guixin Liu <kanie@linux.alibaba.com>
Cc: <hch@lst.de>, <sagi@grimberg.me>, <kch@nvidia.com>,
	<linux-nvme@lists.infradead.org>
Subject: Re: [PATCH v9 1/1] nvmet: support reservation feature
Date: Tue, 24 Sep 2024 08:54:33 +0300	[thread overview]
Message-ID: <20240924055433.GE22571@yadro.com> (raw)
In-Reply-To: <114ce6b1-13e7-418e-bd23-8ab32a7b4c8f@linux.alibaba.com>

On Tue, Sep 24, 2024 at 10:49:58AM +0800, Guixin Liu wrote:
> 
> 在 2024/9/24 00:32, Dmitry Bogdanov 写道:
> > > This patch implements the reservation feature, includes:
> > > 1. reservation register(register, unregister and replace).
> > > 2. reservation acquire(acquire, preempt, preempt and abort).
> > > 3. reservation release(release and clear).
> > > 4. reservation report.
> > > 5. set feature and get feature of reservation notify mask.
> > > 6. get log page of reservation event.
> > > 
> > > And also make reservation configurable, one can set ns to support
> > > reservation before enable ns. The default of resv_enable is false.
> > > 
> > > Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
> > > ---
> > >   drivers/nvme/target/Makefile    |    2 +-
> > >   drivers/nvme/target/admin-cmd.c |   24 +-
> > >   drivers/nvme/target/configfs.c  |   27 +
> > >   drivers/nvme/target/core.c      |   58 +-
> > >   drivers/nvme/target/nvmet.h     |   49 ++
> > >   drivers/nvme/target/pr.c        | 1214 +++++++++++++++++++++++++++++++
> > >   include/linux/nvme.h            |   54 ++
> > >   7 files changed, 1419 insertions(+), 9 deletions(-)
> > >   create mode 100644 drivers/nvme/target/pr.c
> > ...
> > > +
> > > +static void nvmet_pr_unregister_one(struct nvmet_pr *pr,
> > > +                                   struct nvmet_pr_registrant *reg)
> > > +{
> > > +       struct nvmet_pr_registrant *first_reg;
> > > +       struct nvmet_pr_registrant *holder;
> > > +       u8 original_rtype;
> > > +
> > > +       lockdep_assert_held(&pr->pr_lock);
> > > +       list_del_rcu(&reg->entry);
> > > +
> > > +       holder = rcu_dereference_protected(pr->holder,
> > > +                       lockdep_is_held(&pr->pr_lock));
> > > +       if (reg != holder)
> > > +               goto out;
> > > +
> > > +       original_rtype = holder->rtype;
> > > +       if (original_rtype == NVME_PR_WRITE_EXCLUSIVE_ALL_REGS ||
> > > +           original_rtype == NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS) {
> > > +               first_reg = list_first_or_null_rcu(&pr->registrant_list,
> > > +                               struct nvmet_pr_registrant, entry);
> > > +               if (first_reg)
> > > +                       first_reg->rtype = original_rtype;
> > > +               rcu_assign_pointer(pr->holder, first_reg);
> > > +       } else {
> > > +               rcu_assign_pointer(pr->holder, NULL);
> > > +
> > > +               if (original_rtype == NVME_PR_WRITE_EXCLUSIVE_REG_ONLY ||
> > > +                   original_rtype == NVME_PR_WRITE_EXCLUSIVE_REG_ONLY)
> > copy&past error?
> > The second check should be against NVME_PR_EXCLUSIVE_ACCESS_REG_ONLY?
> Indeed, my mistake, it will be fixed in v10.
> > > +                       nvmet_pr_resv_released(pr, &reg->hostid);
> > > +       }
> > > +out:
> > > +       synchronize_rcu();
> > > +       kfree(reg);
> > > +}
> > > +
> > > ......
> > > +
> > > +static u16 nvmet_pr_preempt(struct nvmet_req *req,
> > > +                           struct nvmet_pr_registrant *reg,
> > > +                           u8 rtype,
> > > +                           struct nvmet_pr_acquire_data *d,
> > > +                           bool abort)
> > > +{
> > > +       struct nvmet_ctrl *ctrl = req->sq->ctrl;
> > > +       struct nvmet_pr *pr = &req->ns->pr;
> > > +       struct nvmet_pr_registrant *holder;
> > > +       enum nvme_pr_type original_rtype;
> > > +       u64 prkey = le64_to_cpu(d->prkey);
> > > +       u16 status;
> > > +
> > > +       lockdep_assert_held(&pr->pr_lock);
> > > +       holder = rcu_dereference_protected(pr->holder,
> > > +                       lockdep_is_held(&pr->pr_lock));
> > > +       if (!holder)
> > > +               return nvmet_pr_unreg_all_host_by_prkey(req, prkey,
> > > +                                       &ctrl->hostid, abort);
> > > +
> > > +       original_rtype = holder->rtype;
> > > +       if (original_rtype == NVME_PR_WRITE_EXCLUSIVE_ALL_REGS ||
> > > +           original_rtype == NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS) {
> > > +               if (!prkey) {
> > > +                       nvmet_pr_unreg_all_others(req, &ctrl->hostid, abort);
> > > +                       nvmet_pr_set_new_holder(pr, rtype, reg);
> > You didnot address the reservation atomicity here. You still remove an
> > old holder and then after a some time you set the new holder.
> > In this time gap an incomming WRITE command from a non holder will write
> > to the media.
> Right, I will change it in v10.
> > > +                       return NVME_SC_SUCCESS;
> > > +               }
> > > +               return nvmet_pr_unreg_all_host_by_prkey(req, prkey,
> > > +                               &ctrl->hostid, abort);
> > > +       }
> > > +
> > > +       if (holder == reg) {
> > > +               status = nvmet_pr_update_reg_attr(pr, holder,
> > > +                               nvmet_pr_update_holder_rtype, &rtype);
> > > +               if (!status && original_rtype != rtype)
> > > +                       nvmet_pr_resv_released(pr, &reg->hostid);
> > > +               return status;
> > > +       }
> > > +
> > > +       if (prkey == holder->rkey) {
> > > +               status = nvmet_pr_unreg_all_others_by_prkey(req, prkey,
> > > +                                       &ctrl->hostid, abort);
> > And here too that timegap with released reservation.
> I will fix this too, and also the
> 
> nvmet_pr_unreg_all_others_by_prkey() does not need a return value,
> 
> I remove it.
> 
> > > +               if (status)
> > > +                       return status;
> > > +
> > > +               nvmet_pr_set_new_holder(pr, rtype, reg);
> > > +               if (original_rtype != rtype)
> > > +                       nvmet_pr_resv_released(pr, &reg->hostid);
> > This function is(may be) already invoked in nvmet_pr_unregister_one.
> > 
> Not, here the original_rtype is not NVME_PR_WRITE_EXCLUSIVE_REG_ONLY or
> NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS, so the nvmet_pr_unregister_one will
> not call nvmet_pr_resv_released.

nvmet_pr_unregister_one sends an event for original_rtype == *_REG_ONLY
Here you have only original_rtype != ALL_REGS, so _REG_ONLY is possible.

> > > +               return status;
> > > +       }
> > > +
> > > +       if (prkey)
> > > +               return nvmet_pr_unreg_all_host_by_prkey(req, prkey,
> > > +                                       &ctrl->hostid, abort);
> > > +       return NVME_SC_INVALID_FIELD | NVME_SC_DNR;
> > > +}
> > > +
> > > +static void nvmet_pr_confirm_ns_pc_ref(struct percpu_ref *ref)
> > > +{
> > > +       struct nvmet_pr_per_ctrl_ref *pc_ref =
> > > +               container_of(ref, struct nvmet_pr_per_ctrl_ref, ref);
> > > +
> > > +       complete(&pc_ref->confirm_done);
> > > +}
> > > +
> > > +static void nvmet_pr_do_abort(struct nvmet_req *req)
> > > +{
> > > +       struct nvmet_subsys *subsys = req->sq->ctrl->subsys;
> > > +       struct nvmet_pr_per_ctrl_ref *pc_ref;
> > > +       struct nvmet_ctrl *ctrl = NULL;
> > > +
> > > +find_next:
> > > +       mutex_lock(&subsys->lock);
> > > +       list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> > > +               if (ctrl->pr_abort)
> > > +                       goto do_abort;
> > > +       }
> > > +       mutex_unlock(&subsys->lock);
> > > +       return;
> > > +
> > > +do_abort:
> > > +       mutex_unlock(&subsys->lock);
> > > +       /*
> > > +        * The target dose not support abort, just wait ref to 0.
> > typo "dose"->"does"
> > 
> Sure.
> > > +        */
> > > +       pc_ref = xa_load(&req->ns->pr_per_ctrl_refs, ctrl->cntlid);
> > > +       percpu_ref_kill_and_confirm(&pc_ref->ref, nvmet_pr_confirm_ns_pc_ref);
> > > +       wait_for_completion(&pc_ref->confirm_done);
> > > +       wait_for_completion(&pc_ref->free_done);
> > > +       reinit_completion(&pc_ref->confirm_done);
> > > +       reinit_completion(&pc_ref->free_done);
> > > +       percpu_ref_resurrect(&pc_ref->ref);
> > > +
> > > +       ctrl->pr_abort = false;
> > pr_abort flag is per controller, but this function is per namespace.
> > The use case of preempt_and_abort is to preempt reservation of a "zomby" host
> > for the all namespaces. It means that preempt_and_abort command will be sent
> > for the each namespace and there will be several parallel abort processes for
> > the same controllers. And you should not mix them.
> > 
> > May be doing ref_kill instead of setting ctlr->pr_abort=true will more
> > accuratelly set the barier for incoming IO. You may use Mark feature in
> > XArrays to mark pc_ref to wait the completion for and use xa_for_each_marked
> > for going through such pc_refs.
> 
> Good idea, thanks very much, will changed in v10.
> 
> Best Regards,
> 
> Guixin Liu
> 
> > > +       nvmet_ctrl_put(ctrl);
> > > +       goto find_next;
> > > +}
> > > +
> > > +static u16 __nvmet_execute_pr_acquire(struct nvmet_req *req,
> > > +                                     struct nvmet_pr_registrant *reg,
> > > +                                     u8 acquire_act,
> > > +                                     u8 rtype,
> > > +                                     struct nvmet_pr_acquire_data *d)
> > > +{
> > > +       u16 status;
> > > +
> > > +       switch (acquire_act) {
> > > +       case NVME_PR_ACQUIRE_ACT_ACQUIRE:
> > > +               status = nvmet_pr_acquire(req, reg, rtype);
> > > +               goto out;
> > > +       case NVME_PR_ACQUIRE_ACT_PREEMPT:
> > > +               status = nvmet_pr_preempt(req, reg, rtype, d, false);
> > > +               goto inc_gen;
> > > +       case NVME_PR_ACQUIRE_ACT_PREEMPT_AND_ABORT:
> > > +               status = nvmet_pr_preempt(req, reg, rtype, d, true);
> > > +               if (!status)
> > > +                       nvmet_pr_do_abort(req);
> > > +               goto inc_gen;
> > > +       default:
> > > +               req->error_loc = offsetof(struct nvme_common_command, cdw10);
> > > +               status = NVME_SC_INVALID_OPCODE | NVME_SC_DNR;
> > > +               goto out;
> > > +       }
> > > +inc_gen:
> > > +       if (!status)
> > > +               atomic_inc(&req->ns->pr.generation);
> > > +out:
> > > +       return status;
> > > +}
> > > +


  reply	other threads:[~2024-09-24  5:55 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-23  9:47 [PATCH v9 0/1] Implement the NVMe reservation feature Guixin Liu
2024-09-23  9:47 ` [PATCH v9 1/1] nvmet: support " Guixin Liu
2024-09-23 16:32   ` Dmitry Bogdanov
2024-09-24  2:49     ` Guixin Liu
2024-09-24  5:54       ` Dmitry Bogdanov [this message]
2024-09-24  6:18         ` Guixin Liu
2024-09-24  6:38           ` Guixin Liu
2024-09-24  8:24             ` Dmitry Bogdanov
2024-09-24  9:45               ` Guixin Liu
2024-09-24 12:57                 ` Dmitry Bogdanov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240924055433.GE22571@yadro.com \
    --to=d.bogdanov@yadro.com \
    --cc=hch@lst.de \
    --cc=kanie@linux.alibaba.com \
    --cc=kch@nvidia.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.