From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EEAFBC3271E for ; Mon, 8 Jul 2024 14:25:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=bL64+wZkU2iUfENWrhWVq9xKXTux9tKW9+b+BgRap68=; b=g+SXa6/pVa/2LuCEAXcNB0M9MB kdY0Cq+25FgcJuGFV6uJSZXwgxmxYYmpZjZau9S3tsumMl/yLOk3P6SyVlfB/ZsJDNhrNSjdFjWld kAx8dqgGUkcSe0M3XcwzthUUrKfdnNZGrK6ew5G7CBcce9vZIbo3Wp0vb4j/3LuJ02drxlz4VEhvn GUR9tO+Tukzjh/Wz10iznqBo/pjT1Lfftn5RuFueprndegMRf06NYd4zZV9QVOm43qeAub+jlEAKE FJ8Qp4vrvd5h2v3mTETCcnISAcruNu1KWog8zzNRJmVTmFglYs0GJpOyt/lATgsH2E/l5LaQFlARG jEw6iqOg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sQpIl-000000043br-2Yrr; Mon, 08 Jul 2024 14:25:11 +0000 Received: from mail-wm1-f53.google.com ([209.85.128.53]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1sQpIh-000000043bG-41xR for linux-nvme@lists.infradead.org; Mon, 08 Jul 2024 14:25:09 +0000 Received: by mail-wm1-f53.google.com with SMTP id 5b1f17b1804b1-42111cf2706so4702375e9.0 for ; Mon, 08 Jul 2024 07:25:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720448705; x=1721053505; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bL64+wZkU2iUfENWrhWVq9xKXTux9tKW9+b+BgRap68=; b=NeH1tJSsAtVqQbJbC/dBDOUm7FWgg5NpOjHA/k580YiugGtwhrlvxBQ39IDKu28c1K rfuGzGyxyFdKwREHZjdPyDkCtJmKCWM+30ByRB8UB29rYIU5pEsyaluXkCWqEaOtcMA6 TYCS02PUKwJO7vykwBxr2Jx0ZeHK/Zm9xMCiwRpqmoyqnj9VTgO2QD9oLGZV+NpkQw/T tTXh9DrgxuJkHGCX3pzugJc10JFOXlQJMOsdoduRmTWYHPVvmi6FPuXAIrm+2ndPqgLH 0h9ZGpFX0N05T6tmosDVT61r0hkaV4ItYafhmt3wiZu7hjmUDhZftLEkQ/t/mQQJtbMU FILA== X-Forwarded-Encrypted: i=1; AJvYcCUPNQA5KQYQ1DgHg9ppafSOnTjeE1X/4x7b6hGqnp7tFZnYbYkhnXZgSNiFRu/KwmCAzBzzEFb5Qrj1wxzVkcF0Ztp/2EmF4fPBnMpQ6Lg= X-Gm-Message-State: AOJu0YzuOPpt+HFfXEBJPM1I5BBr/2vE5SaJnrG5N1LdltIOBilaR33D ZLReUypvtksDrs+e5V2xHffo4ffItUrfqb8bPimLgcZMeWPxAsW5 X-Google-Smtp-Source: AGHT+IHOMvYBP1f9Pay8CCFP1dwKa7iGuXiQXIqFahoeV6LL6ptELe9EmIDSloEWg2+CbDkxN1qMew== X-Received: by 2002:a05:600c:4f03:b0:426:6358:7c5d with SMTP id 5b1f17b1804b1-4266358805emr33887955e9.4.1720448704790; Mon, 08 Jul 2024 07:25:04 -0700 (PDT) Received: from [10.50.4.202] (bzq-84-110-32-226.static-ip.bezeqint.net. [84.110.32.226]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4264a1d16b0sm165949635e9.7.2024.07.08.07.25.03 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 08 Jul 2024 07:25:04 -0700 (PDT) Message-ID: <5334b2dd-23ab-4637-a762-e871f884db54@grimberg.me> Date: Mon, 8 Jul 2024 17:25:03 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness To: Hannes Reinecke , Hannes Reinecke , Christoph Hellwig Cc: Keith Busch , linux-nvme@lists.infradead.org References: <20240708071013.69984-1-hare@kernel.org> <20240708071013.69984-2-hare@kernel.org> <96bb4107-abc0-4fd3-8e2b-35eafa6a5d4f@grimberg.me> Content-Language: en-US From: Sagi Grimberg In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240708_072508_037614_5BC60101 X-CRM114-Status: GOOD ( 58.70 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org On 08/07/2024 16:21, Hannes Reinecke wrote: > On 7/8/24 13:57, Sagi Grimberg wrote: >> Hey Hannes, thanks for doing this. >> >> On 08/07/2024 10:10, Hannes Reinecke wrote: >>> We need to restrict both side, rx and tx, to only run for a certain >>> time >>> to ensure that we're not blocking the other side and induce starvation. >>> So pass in a 'deadline' value to nvme_tcp_send_all() >> >> Please split the addition of nvme_tcp_send_all() to a separate prep >> patch. >> > Okay, no problem. > >>>   and nvme_tcp_try_recv() >>> and break out of the loop if the deadline is reached. >> >> I think we want to limit the rx/tx in pdus/bytes. This will also >> allow us >> to possibly do burst rx from data-ready. >> > PDUs is not the best scheduling boundary here, as each PDU can be of > different size, and the network interface most definitely is limited by > the number of bytes transferred. Well, PDUs is at least A proxy of bytes in terms of limits, but I agree bytes may be more appropriate. > >>> >>> As we now have a timestamp we can also use it to print out a warning >>> if the actual time spent exceeds the deadline. >>> >>> Performance comparison: >>>                 baseline rx/tx fairness >>> 4k seq write:  449MiB/s 480MiB/s >>> 4k rand write: 410MiB/s 481MiB/s >>> 4k seq read:   478MiB/s 481MiB/s >>> 4k rand read:  547MiB/s 480MiB/s >>> >>> Random read is ever so disappointing, but that will be fixed with >>> the later >>> patches. >> >> That is a significant decline in relative perf. I'm counting 12.5%... >> Can you explain why that is? >> > Not really :-( > But then fairness cuts both ways; so I am not surprised that some > workloads suffer here. Well, 12.5% is rather steep, don't you agree? I'm even more surprised that seq/rand read make a difference... > > >> How does this look for multiple controllers? >> > Haven't really checked (yet); it would make a rather weak case if > we killed performance just to scale better ... > >> >> >>> >>> Signed-off-by: Hannes Reinecke >>> --- >>>   drivers/nvme/host/tcp.c | 38 +++++++++++++++++++++++++++++--------- >>>   1 file changed, 29 insertions(+), 9 deletions(-) >>> >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >>> index 0873b3949355..f621d3ba89b2 100644 >>> --- a/drivers/nvme/host/tcp.c >>> +++ b/drivers/nvme/host/tcp.c >>> @@ -153,6 +153,7 @@ struct nvme_tcp_queue { >>>       size_t            data_remaining; >>>       size_t            ddgst_remaining; >>>       unsigned int        nr_cqe; >>> +    unsigned long        deadline; >> >> I don't see why you need to keep this in the queue struct. You could >> have >> easily initialize it in the read_descriptor_t and test against it. >> > Because I wanted to interrupt the receive side for large data > transfers, and haven't found a way to pass the deadline into > ->read_sock(). desc.count is for you to interpret however you want it to be no? > >>>       /* send state */ >>>       struct nvme_tcp_request *request; >>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct >>> nvme_tcp_request *req, >>>       } >>>   } >>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue) >>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue, >>> +                    unsigned long deadline) >>>   { >>>       int ret; >>>       /* drain the send queue as much as we can... */ >>>       do { >>>           ret = nvme_tcp_try_send(queue); >>> +        if (time_after(jiffies, deadline)) >>> +            break; >>>       } while (ret > 0); >>> +    return ret; >> >> I think you want a different interface, nvme_tcp_send_budgeted(queue, >> budget). >> I don't know what you pass here, but jiffies is a rather large >> granularity... >> > Hmm. I've been using jiffies as the io_work loop had been counting > in jiffies. But let me check what would happen if I move that over to > PDU / size counting. Cool. > >>>   } >>>   static inline bool nvme_tcp_queue_has_pending(struct >>> nvme_tcp_queue *queue) >>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct >>> nvme_tcp_request *req, >>>           bool sync, bool last) >>>   { >>>       struct nvme_tcp_queue *queue = req->queue; >>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1); >>>       bool empty; >>>       empty = llist_add(&req->lentry, &queue->req_list) && >>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct >>> nvme_tcp_request *req, >>>        */ >>>       if (queue->io_cpu == raw_smp_processor_id() && >>>           sync && empty && mutex_trylock(&queue->send_mutex)) { >>> -        nvme_tcp_send_all(queue); >>> +        nvme_tcp_send_all(queue, deadline); >>>           mutex_unlock(&queue->send_mutex); >>>       } >> >> Umm, spend up to a millisecond in in queue_request ? Sounds like way >> too much... >> Did you ever see this deadline exceeded? sends should be rather quick... >> > Of _course_ it's too long. That's kinda the point. > But I'm seeing network latency up to 4000 msecs (!) on my test setup, > so _that_ is the least of my worries ... This has nothing to do with the network latency afaict. This is just a matter of how long does it take to write to the socket buffer (or in our case, construct skbs and add them to the socket as TX is zero copy). > >>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t >>> *desc, struct sk_buff *skb, >>> nvme_tcp_error_recovery(&queue->ctrl->ctrl); >>>               return result; >>>           } >>> +        if (time_after(jiffies, queue->deadline)) { >>> +            desc->count = 0; >>> +            break; >>> +        } >>> + >> >> That is still not right. >> You don't want to spend a full deadline reading from the socket, and >> then spend a full deadline writing to the socket... >> > Yes, and no. > Problem is that the current code serializes writes  and reads. > Essentially for each iteration we first to a write, and then a read. > If we spend the full deadline on write we will need to reschedule, > but we then _again_ start with writes. This leads to a heavy preference > for writing, and negative performance impact. Yes, the assumption we should take is that the tx/rx limits allow to switch between the two at least once or twice in the io_work while loop, and that they get a fair'ish share of quota. > >> You want the io_work to take a full deadline, and send budgets of >> try_send and try_recv. And set it to sane counts. Say 8 pdus, or >> 64k bytes. We want to get to some magic value that presents a >> sane behavior, that confidently fits inside a deadline, and is fair. >> > Easier said than done. > Biggest problem is that most of the latency increase comes from the > actual 'sendmsg()' and 'read_sock()' calls. This is a bit surprising. because read_sock reads data that is already processed by the tcp stack, and ready to simply copy back to the user pages. Same for sendmsg, the socket is non-blocking, so I would not expect it to stall for unusually long periods. Maybe there is an issue hiding in how we call send/recv? > And the only way of inhibiting that would be to check _prior_ whether > we can issue the call in the first place. > (That's why I did the SOCK_NOSPACE tests in the previous patchsets). The socket is non-blocking. I always thought that SOCK_NOSPACE is serving blocking sockets... afaik, there shouldn't be a real difference between checking NOSPACE vs. attempting sendmsg on a non-blocking socket... But I could be wrong here.. > >>>       } >>> -    return consumed; >>> +    return consumed - len; >>>   } >>>   static void nvme_tcp_data_ready(struct sock *sk) >>> @@ -1258,7 +1269,7 @@ static int nvme_tcp_try_send(struct >>> nvme_tcp_queue *queue) >>>       return ret; >>>   } >>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue) >>> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, unsigned >>> long deadline) >>>   { >>>       struct socket *sock = queue->sock; >>>       struct sock *sk = sock->sk; >>> @@ -1269,6 +1280,7 @@ static int nvme_tcp_try_recv(struct >>> nvme_tcp_queue *queue) >>>       rd_desc.count = 1; >>>       lock_sock(sk); >>>       queue->nr_cqe = 0; >>> +    queue->deadline = deadline; >>>       consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb); >>>       release_sock(sk); >>>       return consumed; >>> @@ -1278,14 +1290,15 @@ static void nvme_tcp_io_work(struct >>> work_struct *w) >>>   { >>>       struct nvme_tcp_queue *queue = >>>           container_of(w, struct nvme_tcp_queue, io_work); >>> -    unsigned long deadline = jiffies + msecs_to_jiffies(1); >>> +    unsigned long tx_deadline = jiffies + msecs_to_jiffies(1); >>> +    unsigned long rx_deadline = tx_deadline + msecs_to_jiffies(1), >>> overrun; >>>       do { >>>           bool pending = false; >>>           int result; >>>           if (mutex_trylock(&queue->send_mutex)) { >>> -            result = nvme_tcp_try_send(queue); >>> +            result = nvme_tcp_send_all(queue, tx_deadline); >>>               mutex_unlock(&queue->send_mutex); >>>               if (result > 0) >>>                   pending = true; >>> @@ -1293,7 +1306,7 @@ static void nvme_tcp_io_work(struct >>> work_struct *w) >>>                   break; >>>           } >>> -        result = nvme_tcp_try_recv(queue); >>> +        result = nvme_tcp_try_recv(queue, rx_deadline); >> >> I think you want a more frequent substitution of sends/receives. the >> granularity of 1ms budget may be too coarse? >> > The problem is not the granularity, the problem is the latency spikes > I'm seeing when issuing 'sendmsg' or 'read_sock'. I think that its possible that there is another underlying issue. Because these calls should NOT be blocked for a long period of time for non-blocking sockets. > >>>           if (result > 0) >>>               pending = true; >>>           else if (unlikely(result < 0)) >>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct >>> work_struct *w) >>>           if (!pending || !queue->rd_enabled) >>>               return; >>> -    } while (!time_after(jiffies, deadline)); /* quota is exhausted */ >>> +    } while (!time_after(jiffies, rx_deadline)); /* quota is >>> exhausted */ >>> + >>> +    overrun = jiffies - rx_deadline; >>> +    if (nvme_tcp_queue_id(queue) > 0 && >>> +        overrun > msecs_to_jiffies(10)) >>> +        dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall >>> (%u msecs)\n", >>> +            nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun)); >> >> Umm, ok. why 10? why not 2? or 3? >> Do you expect io_work to spend more time executing? >> > Eg 4000 msecs like on my testbed? If you see ANYTHING in the IO path stalling for 4 seconds then we have a bigger issue here. > Yes. > >>>       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); >>>   } >>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx >>> *hctx, struct io_comp_batch *iob) >>>   { >>>       struct nvme_tcp_queue *queue = hctx->driver_data; >>>       struct sock *sk = queue->sock->sk; >>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1); >>>       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) >>>           return 0; >>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx >>> *hctx, struct io_comp_batch *iob) >>>       set_bit(NVME_TCP_Q_POLLING, &queue->flags); >>>       if (sk_can_busy_loop(sk) && >>> skb_queue_empty_lockless(&sk->sk_receive_queue)) >>>           sk_busy_loop(sk, true); >>> -    nvme_tcp_try_recv(queue); >>> +    nvme_tcp_try_recv(queue, deadline); >> >> spend a millisecond in nvme_tcp_poll() ?? >> Isn't it too long? > Haven't tried with polling, so can't honestly answer. > > In the end, it all boils down to numbers. > I'm having 2 controllers with 32 queues and 256 requests. > Running on a 10GigE link one should be getting a net throughput > of 1GB/s, or, with 4k requests, 256k IOPS. > Or a latency of 0.25 milliseconds per command. > In the optimal case. As I'm seeing a bandwidth of around > 500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_. > > So no, I don't think the 1 milliseconds is too long. nvme_tcp_poll is NOT designed to poll for the entirety or an individual IO lifetime. Its a non-selective polling interface wired to io_uring. It should peek the socket, consume what ever is there, and break. It most definitely should not spend 1ms in this check...