From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A740BCCD187 for ; Tue, 14 Oct 2025 08:58:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=Q5GeQQ5L7KstxSj3TV0TOKNF3YsTWBqd1rTK/8hDiw8=; b=Hzb8fUQrDSaiP/o5MSkIFLXFc5 LfpfkyTUujeb19B976PfqrtzQ5bBwbE5XzvxnDwpUrhVanaj4D6b/Ld2tr9l06j/2uYed94KHRxfX j/94y1AutrBLLiVj6ZmgChqYK6blS4w/JEzL6MykSTtUN46rPwyu2PpbkX/cu9gjCUcRt8MqO+4r5 JgSeDpLpibSUcetzbZcmh5ipIWzvo+xr/X9vC9BPyGhwnKUfZgNqNA6cHtdNWaxJs+UkVxcn0PLRI BtduknaYH9knKQKpQP4HOGYbmCOL3VhV1zb0quuOnO2Q7SRuGrb6ai4U90hPdfxc5h/hgFOEV893n Wqxu6SDA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1v8arE-0000000Fh01-2OU7; Tue, 14 Oct 2025 08:58:12 +0000 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1v8arC-0000000Fgzg-1Db1 for linux-nvme@lists.infradead.org; Tue, 14 Oct 2025 08:58:11 +0000 Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 59E7J6gu016580; Tue, 14 Oct 2025 08:58:06 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=Q5GeQQ 5L7KstxSj3TV0TOKNF3YsTWBqd1rTK/8hDiw8=; b=QxQSqiWSjMzuRFGBYn++Ze LN8zSKN4v7Cf83U0dndi+282M2500Nn1/gnRq8Tc+sOiStEIUDZdKqKrQc7vLrfr TFzjc7pZLwKIuZ7Shc9rHguAo0QUde3CqUpPtKJwOtQx6Mr1xPI2bnSlAt6NTSu6 5U926cCJa9McxPZZss6zwKrNHtpNnYbS68cOU/PmAarNp/j2bgDo8Wndz2oKzNSR SFCh/ltbL86Hl22aoF8dRtTVUBALnBRtD3xJCA+ejbr4PR3FW4xAeFka5YL605ni tPyaiSSbt8KLInXBmyl9L/7cuOmqjhXDjJxLWnPsWKW32QYxIlTsmWP9FmAi71kA == Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 49qcnr5dy8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 14 Oct 2025 08:58:05 +0000 (GMT) Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 59E7oRMN016756; Tue, 14 Oct 2025 08:58:05 GMT Received: from smtprelay05.dal12v.mail.ibm.com ([172.16.1.7]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 49r32jsyds-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 14 Oct 2025 08:58:05 +0000 Received: from smtpav04.wdc07v.mail.ibm.com (smtpav04.wdc07v.mail.ibm.com [10.39.53.231]) by smtprelay05.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 59E8w40g28770906 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 14 Oct 2025 08:58:04 GMT Received: from smtpav04.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 36F3058045; Tue, 14 Oct 2025 08:58:04 +0000 (GMT) Received: from smtpav04.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6611C58050; Tue, 14 Oct 2025 08:58:01 +0000 (GMT) Received: from [9.109.198.148] (unknown [9.109.198.148]) by smtpav04.wdc07v.mail.ibm.com (Postfix) with ESMTP; Tue, 14 Oct 2025 08:58:01 +0000 (GMT) Message-ID: <6049dd46-e841-4446-9d33-09b7072249d6@linux.ibm.com> Date: Tue, 14 Oct 2025 14:27:59 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCHv2 2/4] nvme-multipath: add support for adaptive I/O policy To: Hannes Reinecke , linux-nvme@lists.infradead.org Cc: kbusch@kernel.org, hch@lst.de, axboe@kernel.dk, dwagner@suse.de, gjoyce@ibm.com References: <20251009100608.1699550-1-nilay@linux.ibm.com> <20251009100608.1699550-3-nilay@linux.ibm.com> <24f76fe3-9508-4c05-976b-8be9ec9a4371@suse.de> Content-Language: en-US From: Nilay Shroff In-Reply-To: <24f76fe3-9508-4c05-976b-8be9ec9a4371@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Authority-Analysis: v=2.4 cv=M5ZA6iws c=1 sm=1 tr=0 ts=68ee109d cx=c_pps a=3Bg1Hr4SwmMryq2xdFQyZA==:117 a=3Bg1Hr4SwmMryq2xdFQyZA==:17 a=IkcTkHD0fZMA:10 a=x6icFKpwvdMA:10 a=VkNPw1HP01LnGYTKEx00:22 a=8gyeK5pOzyvbsE0He_cA:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 a=cPQSjfK2_nFv0Q5t_7PE:22 X-Proofpoint-GUID: YTlwmNZCXydG5P4pSXwNaAPfyqfGhhed X-Proofpoint-ORIG-GUID: YTlwmNZCXydG5P4pSXwNaAPfyqfGhhed X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUxMDEwMDE0MCBTYWx0ZWRfX3WihmiJ/sk5P HvlSH4Bz/9XZIFzGL29cMmDYv2uj9JRVhXtcwoPyFKHKgp7AsiqiW47ELU753lc0/QWL/Jv4wvH yCbqyIfXdBawAk4ghvn4IM+TytwSRRjYpe+el/yF45K4TPmJDW8txdPUBiVXpHNrykoS8M/LDgc Rx7wOTVjWbfTguEXboNHDOm6aY/GDsF8b+GsY3AFPJjPlRU/GBIf9ZrrQQaauMapp/67h4VY1ym bqJrhVXmt3LY1E/f9/2CYKTxfHBdYAKMBz4bDBnkjWcchoOkmCoeTkitfa93c+imr61zW6OzIx3 L03YUYRa86Q9aEiFMWFVF3roIQM/9vIwO4UXyBgbMkEdrZhEEoT12HLz9cNtO5hgLyS+naHTH3d qMRGhuA+qnmmcYbWxhfuTS8JUQZ9fw== X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.9,FMLib:17.12.80.40 definitions=2025-10-14_02,2025-10-13_01,2025-03-28_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 bulkscore=0 spamscore=0 clxscore=1015 impostorscore=0 phishscore=0 adultscore=0 suspectscore=0 priorityscore=1501 lowpriorityscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.19.0-2510020000 definitions=main-2510100140 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20251014_015810_453340_0BF2190A X-CRM114-Status: GOOD ( 51.00 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org >> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c >> index c212fa952c0f..69d2f72d0e86 100644 >> --- a/drivers/nvme/host/ioctl.c >> +++ b/drivers/nvme/host/ioctl.c >> @@ -711,7 +711,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode, >>           flags |= NVME_IOCTL_PARTITION; >>         srcu_idx = srcu_read_lock(&head->srcu); >> -    ns = nvme_find_path(head); >> +    ns = nvme_find_path(head, open_for_write ? WRITE : READ); >>       if (!ns) >>           goto out_unlock; >>   @@ -742,7 +742,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd, >>       int srcu_idx, ret = -EWOULDBLOCK; >>         srcu_idx = srcu_read_lock(&head->srcu); >> -    ns = nvme_find_path(head); >> +    ns = nvme_find_path(head, open_for_write ? WRITE : READ); >>       if (!ns) >>           goto out_unlock; >>   > Are we sure that we should account for non-IO commands here, too? > Thing is, the I/O policy should be just that, directing I/O to the > various paths. > But what about commands on the admin queue? Should they be influenced by > the same policy? And can we even define what a 'read' command is? > Wouldn't it be better to pass in a third option here (NONE?) to make it > clear that this is a non-I/O command? > Yeah makes sense to differentiate between read/write and others here. I think we should be able to differentiate between read/write vs others. I'd implement this in next patchset. >> @@ -762,7 +762,8 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, >>       struct cdev *cdev = file_inode(ioucmd->file)->i_cdev; >>       struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev); >>       int srcu_idx = srcu_read_lock(&head->srcu); >> -    struct nvme_ns *ns = nvme_find_path(head); >> +    struct nvme_ns *ns = nvme_find_path(head, >> +            ioucmd->file->f_mode & FMODE_WRITE ? WRITE : READ); >>       int ret = -EINVAL; >>         if (ns) > > See above. I really think that we should be able to pass in a third > option for admin commands. Yep, as mentioned will do this in the next patchset. > >> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c >> index 3da980dc60d9..9ecdaca5e9a0 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -6,6 +6,8 @@ >>   #include >>   #include >>   #include >> +#include >> +#include >>   #include >>   #include "nvme.h" >>   > Ouch. > #include > always spells disaster :-) This is included for div_u64(). Well we may replace it with usual division operator (e.g. dividend / divisor) but that may not work well on 32-bit architecture. >>   +static void nvme_mpath_weight_work(struct work_struct *weight_work) >> +{ >> +    int cpu, srcu_idx; >> +    u32 weight; >> +    struct nvme_ns *ns; >> +    struct nvme_path_stat *stat; >> +    struct nvme_path_work *work = container_of(weight_work, >> +            struct nvme_path_work, weight_work); >> +    struct nvme_ns_head *head = work->ns->head; >> +    int rw = work->rw; >> +    u64 total_score = 0; >> + >> +    cpu = get_cpu(); >> + >> +    srcu_idx = srcu_read_lock(&head->srcu); >> +    list_for_each_entry_srcu(ns, &head->list, siblings, >> +            srcu_read_lock_held(&head->srcu)) { >> + >> +        stat = &this_cpu_ptr(ns->info)[rw].stat; >> +        if (!READ_ONCE(stat->slat_ns)) >> +            continue; >> +        /* >> +         * Compute the path score as the inverse of smoothed >> +         * latency, scaled by NSEC_PER_SEC. Floating point >> +         * math is unavailable in the kernel, so fixed-point >> +         * scaling is used instead. NSEC_PER_SEC is chosen >> +         * because valid latencies are always < 1 second; longer >> +         * latencies are ignored. >> +         */ >> +        stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns)); >> + >> +        /* Compute total score. */ >> +        total_score += stat->score; >> +    } >> + >> +    if (!total_score) >> +        goto out; >> + >> +    /* >> +     * After computing the total slatency, we derive per-path weight >> +     * (normalized to the range 0–100). The weight represents the >> +     * relative share of I/O the path should receive. > > Why do we use the range 0-100? > While this is nice for human consumption, it doesn't lend itself to a > nice computation. Maybe we should select a different scale to allow us > for more efficient computation (0-128? Maybe?) Yes we can normalize this to any range. Lets scale it from 0-128 as you suggested, and it also aligns with power of 2. >> + >> +#define NVME_EWMA_SHIFT    3 > > And we use this value why? Oh yeah the reason being in the proposed code we want to assign ~87.5% weightage to the existing (smoothed) latency and ~12.5% weightage to the new latency. If you see the ewma_update() then you'd realize this fact. But yes I'd add a comment above this macro to make the intent clear. > >> +static inline u64 ewma_update(u64 old, u64 new) >> +{ >> +    return (old * ((1 << NVME_EWMA_SHIFT) - 1) + new) >> NVME_EWMA_SHIFT; >> +} >> + >> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns) >> +{ >> +    int cpu; >> +    unsigned int rw; >> +    struct nvme_path_info *info; >> +    struct nvme_path_stat *stat; >> +    u64 now, latency, slat_ns, avg_lat_ns; >> +    struct nvme_ns_head *head = ns->head; >> + >> +    if (list_is_singular(&head->list)) >> +        return; >> + >> +    now = ktime_get_ns(); >> +    latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0; >> +    if (!latency) >> +        return; >> + >> +    /* >> +     * As completion code path is serialized(i.e. no same completion queue >> +     * update code could run simultaneously on multiple cpu) we can safely >> +     * access per cpu nvme path stat here from another cpu (in case the >> +     * completion cpu is different from submission cpu). >> +     * The only field which could be accessed simultaneously here is the >> +     * path ->weight which may be accessed by this function as well as I/O >> +     * submission path during path selection logic and we protect ->weight >> +     * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but >> +     * we also don't need to be so accurate here as the path credit would >> +     * be anyways refilled, based on path weight, once path consumes all >> +     * its credits. And we limit path weight/credit max up to 100. Please >> +     * also refer nvme_adaptive_path(). >> +     */ >> +    cpu = blk_mq_rq_cpu(rq); >> +    rw = rq_data_dir(rq); >> +    info = &per_cpu_ptr(ns->info, cpu)[rw]; >> +    stat = &info->stat; >> + > Hmm. While the 'SAME_CONP' attribute should help here, I remain > sceptical. > Wouldn't it be possible to look at the hwq map (eg by using something > like blk_mq_map_queue_type()) to figure out the hw context, and then > use the first cpu from that mask? > That way we are guaranteed to always using the same per-cpu ptr. > Might be overkill, though; but it would be good to have some > checks in here in case we do run on the wrong CPU. > Okay so I think you propose ensuring that submitting cpu should match the cpu on which we're processing completion I/O sample. Well, that makes sense but here we use blk_mq_rq_cpu() which yields the submitting cpu for the request and then use submitting cpu number to get per-cpu stat using per_cpu_ptr(). As we don't rely on the current cpu on which completion I/O sample is being processed, I think we're all good. Agreed? >> +    /* >> +     * If latency > ~1s then ignore this sample to prevent EWMA from being >> +     * skewed by pathological outliers (multi-second waits, controller >> +     * timeouts etc.). This keeps path scores representative of normal >> +     * performance and avoids instability from rare spikes. If such high >> +     * latency is real, ANA state reporting or keep-alive error counters >> +     * will mark the path unhealthy and remove it from the head node list, >> +     * so we safely skip such sample here. >> +     */ >> +    if (unlikely(latency > NSEC_PER_SEC)) { >> +        stat->nr_ignored++; >> +        return; >> +    } > > I would even go so far as to indicate that EWMA is unusable here. > Can we issue a warning / debug message here to indicate that the path > selector is running suboptimal? > We can certainly do that, but as you could see above we're incrementing ->nr_ignored here, and user could then access this value through debugfs, so is that sufficient? Or do you want an explicit WARN or debug message to be also printed? >> + >> +    /* >> +     * Accumulate latency samples and increment the batch count for each >> +     * ~15 second interval. When the interval expires, compute the simple >> +     * average latency over that window, then update the smoothed (EWMA) >> +     * latency. The path weight is recalculated based on this smoothed >> +     * latency. >> +     */ > > Why 15 seconds? Can we make this changeable eg via debugfs? Yes certainly we can make this configurable, BTW should it be done through a new sysfs attribute instead of debugfs? > >> +    stat->batch += latency; >> +    stat->batch_count++; >> +    stat->nr_samples++; >> + >> +    if (now > stat->last_weight_ts && >> +            (now - stat->last_weight_ts) >= 15 * NSEC_PER_SEC) { >> + > > At the very least make this a #define. But ideally one should be able > to modify that. > Yes I'd define a macro for this 15 seconds interval, and also we'll make it configurable. I'd implement this in next patchset. >> +        stat->last_weight_ts = now; >> + >> +        /* >> +         * Find simple average latency for the last epoch (~15 sec >> +         * interval). >> +         */ >> +        avg_lat_ns = div_u64(stat->batch, stat->batch_count); >> + >> +        /* >> +         * Calculate smooth/EWMA (Exponentially Weighted Moving Average) >> +         * latency. EWMA is preferred over simple average latency >> +         * because it smooths naturally, reduces jitter from sudden >> +         * spikes, and adapts faster to changing conditions. It also >> +         * avoids storing historical samples, and works well for both >> +         * slow and fast I/O rates. >> +         * Formula: >> +         * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT >> +         * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the >> +         * existing latency and 1/8 (~12.5%) weight to the new latency. >> +         */ > > Similar comment to the WEIGHT. While 8 might be a nice choice, there is nothing indicating it's always the right choice. > Please make it a #define and see if we cannot modify it via debugfs. > Yeah sure, as discussed earlier this value is derived from NVME_EWMA_SHIFT and I'd make it configurable in the next patchset. >> +int nvme_alloc_ns_stat(struct nvme_ns *ns) >> +{ >> +    int rw, cpu; >> +    struct nvme_path_work *work; >> +    gfp_t gfp = GFP_KERNEL | __GFP_ZERO; >> + >> +    if (!ns->head->disk) >> +        return 0; >> + >> +    ns->info = __alloc_percpu_gfp(2 * sizeof(struct nvme_path_info), >> +            __alignof__(struct nvme_path_info), gfp); >> +    if (!ns->info) >> +        return -ENOMEM; >> + >> +    for_each_possible_cpu(cpu) { >> +        for (rw = 0; rw < 2; rw++) { >> +            work = &per_cpu_ptr(ns->info, cpu)[rw].work; >> +            work->ns = ns; >> +            work->rw = rw; >> +            INIT_WORK(&work->weight_work, nvme_mpath_weight_work); >> +        } >> +    } >> + >> +    return 0; >> +} >> + > > The more I think about it the more I like the idea of having a third > direction (and not just READ and WRITE). > The latency of I/O commands _will_ be different from the latency of > admin commands, so the adaptive policy might come to different decisions > for admin commands. Yes as I said above, I'd address this in the next patchset. >> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head, >> +        unsigned int rw) >> +{ >> +    struct nvme_ns *ns, *found = NULL; >> +    struct nvme_path_stat *stat; >> +    u32 weight; >> +    int refill = 0; >> + >> +    get_cpu(); >> +retry: >> +    list_for_each_entry_srcu(ns, &head->list, siblings, >> +            srcu_read_lock_held(&head->srcu)) { >> + >> +        if (nvme_path_is_disabled(ns) || >> +                !nvme_state_is_live(ns->ana_state)) >> +            continue; >> + >> +        stat = &this_cpu_ptr(ns->info)[rw].stat; >> + >> +        /* >> +         * When the head path-list is singular we don't calculate the >> +         * only path weight for optimization as we don't need to forward >> +         * I/O to more than one path. The another possibility is whenthe >> +         * path is newly added, we don't know its weight. So we go round >> +         * -robin for each such path and forward I/O to it.Once we start >> +         * getting response for such I/Os, the path weight calculation >> +         * would kick in and then we start using path credit for >> +         * forwarding I/O. >> +         */ >> +        weight = READ_ONCE(stat->weight); >> +        if (unlikely(!weight)) { >> +            found = ns; >> +            goto out; >> +        } >> + >> +        /* >> +         * To keep path selection logic simple, we don't distinguish >> +         * between ANA optimized and non-optimized states. The non- >> +         * optimized path is expected to have a lower weight, and >> +         * therefore fewer credits. As a result, only a small number of >> +         * I/Os will be forwarded to paths in the non-optimized state. >> +         */ >> +        if (stat->credit > 0) { >> +            --stat->credit; >> +            found = ns; >> +            goto out; >> +        } >> +    } >> + >> +    if (!found && !list_empty(&head->list)) {> +        /* >> +         * Refill credits and retry. >> +         */ >> +        list_for_each_entry_srcu(ns, &head->list, siblings, >> +                srcu_read_lock_held(&head->srcu)) { >> +            if (nvme_path_is_disabled(ns) || >> +                    !nvme_state_is_live(ns->ana_state)) >> +                continue; >> + >> +            stat = &this_cpu_ptr(ns->info)[rw].stat; >> +            weight = READ_ONCE(stat->weight); >> +            stat->credit = weight; >> +            refill = 1; >> +        } >> +        if (refill) >> +            goto retry; >> +    } > > Hmm. Loop within a loop. Not pretty. > Can't we make do with just one loop? > After all, if we end up here the credits for this path are use up, and need to be refilled. > What would happen if we were _just_ refill and retry? > IE drop the second iteration and just execute the branch > under the loop? > Yes good pint. I think we should be able to address this with one loop. I'd fix this in the next patchset. >> @@ -581,7 +951,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector, >>       int srcu_idx, ret = -EWOULDBLOCK; >>         srcu_idx = srcu_read_lock(&head->srcu); >> -    ns = nvme_find_path(head); >> +    ns = nvme_find_path(head, READ); >>       if (ns) >>           ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data); >>       srcu_read_unlock(&head->srcu, srcu_idx); > > See the discussion about the third option and not just 'READ'/'WRITE'. Yeah, noted! >> diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c >> index 29430949ce2f..4f9607e9698a 100644 >> --- a/drivers/nvme/host/sysfs.c >> +++ b/drivers/nvme/host/sysfs.c >> @@ -194,7 +194,7 @@ static int ns_head_update_nuse(struct nvme_ns_head *head) >>           return 0; >>         srcu_idx = srcu_read_lock(&head->srcu); >> -    ns = nvme_find_path(head); >> +    ns = nvme_find_path(head, READ); >>       if (!ns) >>           goto out_unlock; >>   > Other than that it really looks good. Thank you:) --Nilay