From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3BEDCC3DA42 for ; Mon, 8 Jul 2024 11:57:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=IfT9aMXid43LaybVtUdZF4tnvbRoXbRQnEMzzl4ywxs=; b=w8tGmlSp8wuqdcIoJfS9ekNYVD XdeGQtaGUM3e0VGc3IZmCojMDZlzWVtl7wskct7gZvu5Qmz2FOzevHbtuekoFn9cVqcZ0mz3be2Zg VDK4FjnROG8T03jGF3KhRehMM0AAaOqJqU6I1g8UpCD6hrv6hbUd5yJMPFwMQsnukwyI0b6MHBkeW 0A32x4NVKMYPemPQX58OrX/msZPodkrQOR4ESLll00F8Ll5QkfZqL1f/hhYlJ/7OM21D4zKKQaQy+ m6Aq9sKXcrqo5MYMIulmB+coSAuBnK17V/i25DfV0dOy2Ga3Lp8XQXL+XlTLQ8gMJ+bTcejcEl1yH hHvHdyHg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sQn09-00000003fOU-37vJ; Mon, 08 Jul 2024 11:57:49 +0000 Received: from mail-wm1-f53.google.com ([209.85.128.53]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1sQn06-00000003fNI-1fAC for linux-nvme@lists.infradead.org; Mon, 08 Jul 2024 11:57:48 +0000 Received: by mail-wm1-f53.google.com with SMTP id 5b1f17b1804b1-426593b99daso2891585e9.1 for ; Mon, 08 Jul 2024 04:57:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720439864; x=1721044664; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=IfT9aMXid43LaybVtUdZF4tnvbRoXbRQnEMzzl4ywxs=; b=o5thUTdE045Af9R3FZLZqZrEcqXb36HZoBPHV6oJ49s+v641B4eAHtjvi/Makaptrm 0kIUGbS8YgRWZC9uywrGxfRg2iTYQykeZfboQFkh3Cmw1dtXUYW4RptLnghg94PhLR12 WSfWWCaMERJSnJSnI8r3w9j38+cDmFjRosaZOLb0bj5FB7UC63yjkcORjFxkyp21GWxe KyZCVBw5LSoDYCYqX75Tw2CYZRqD4nJf8p/Z6d+iroCNQi+y9H87yi9O4OQMgOKsBDYt KLBKdFZGRtTf96DW8L2ItQ9FPfDCbzdUw7/XmTueCYVlIqpgKmei4El6qkzh4500ried 5Sow== X-Forwarded-Encrypted: i=1; AJvYcCX4/C0dMZH2ahVnLR93+bzhp8pMH3MNmbG61KhA00HokmSsHIzQHEhqARGtJ0Hes5cm8k/XXvFUMolGwztpHSsc9M4Mdjb7bjBsuax0AzQ= X-Gm-Message-State: AOJu0YzTtcwYCunFCeNDvJtKs8Dss6gh/isyyd/SoN6bCU+5Bw2H9Spn s8u0opFO3OILynGCrn445s6GEFaj0gxxdT9qGUk8E3QN/XYnzt+RG3pjHaOM X-Google-Smtp-Source: AGHT+IH8MPjiiRYFQCAtO3ulVDUB91rh95R2JRFD0r4wq92uGCdyEntePPYJyUqdLZoiFjdPoghPvQ== X-Received: by 2002:a05:600c:4511:b0:425:6962:4253 with SMTP id 5b1f17b1804b1-4264a48345bmr87400105e9.4.1720439863929; Mon, 08 Jul 2024 04:57:43 -0700 (PDT) Received: from [10.50.4.202] (bzq-84-110-32-226.static-ip.bezeqint.net. [84.110.32.226]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4264a25198dsm164979805e9.32.2024.07.08.04.57.43 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 08 Jul 2024 04:57:43 -0700 (PDT) Message-ID: <96bb4107-abc0-4fd3-8e2b-35eafa6a5d4f@grimberg.me> Date: Mon, 8 Jul 2024 14:57:42 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness To: Hannes Reinecke , Christoph Hellwig Cc: Keith Busch , linux-nvme@lists.infradead.org References: <20240708071013.69984-1-hare@kernel.org> <20240708071013.69984-2-hare@kernel.org> Content-Language: en-US From: Sagi Grimberg In-Reply-To: <20240708071013.69984-2-hare@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240708_045746_474742_35EA81B2 X-CRM114-Status: GOOD ( 36.35 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org Hey Hannes, thanks for doing this. On 08/07/2024 10:10, Hannes Reinecke wrote: > We need to restrict both side, rx and tx, to only run for a certain time > to ensure that we're not blocking the other side and induce starvation. > So pass in a 'deadline' value to nvme_tcp_send_all() Please split the addition of nvme_tcp_send_all() to a separate prep patch. > and nvme_tcp_try_recv() > and break out of the loop if the deadline is reached. I think we want to limit the rx/tx in pdus/bytes. This will also allow us to possibly do burst rx from data-ready. > > As we now have a timestamp we can also use it to print out a warning > if the actual time spent exceeds the deadline. > > Performance comparison: > baseline rx/tx fairness > 4k seq write: 449MiB/s 480MiB/s > 4k rand write: 410MiB/s 481MiB/s > 4k seq read: 478MiB/s 481MiB/s > 4k rand read: 547MiB/s 480MiB/s > > Random read is ever so disappointing, but that will be fixed with the later > patches. That is a significant decline in relative perf. I'm counting 12.5%... Can you explain why that is? How does this look for multiple controllers? > > Signed-off-by: Hannes Reinecke > --- > drivers/nvme/host/tcp.c | 38 +++++++++++++++++++++++++++++--------- > 1 file changed, 29 insertions(+), 9 deletions(-) > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 0873b3949355..f621d3ba89b2 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -153,6 +153,7 @@ struct nvme_tcp_queue { > size_t data_remaining; > size_t ddgst_remaining; > unsigned int nr_cqe; > + unsigned long deadline; I don't see why you need to keep this in the queue struct. You could have easily initialize it in the read_descriptor_t and test against it. > > /* send state */ > struct nvme_tcp_request *request; > @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req, > } > } > > -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue) > +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue, > + unsigned long deadline) > { > int ret; > > /* drain the send queue as much as we can... */ > do { > ret = nvme_tcp_try_send(queue); > + if (time_after(jiffies, deadline)) > + break; > } while (ret > 0); > + return ret; I think you want a different interface, nvme_tcp_send_budgeted(queue, budget). I don't know what you pass here, but jiffies is a rather large granularity... > } > > static inline bool nvme_tcp_queue_has_pending(struct nvme_tcp_queue *queue) > @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, > bool sync, bool last) > { > struct nvme_tcp_queue *queue = req->queue; > + unsigned long deadline = jiffies + msecs_to_jiffies(1); > bool empty; > > empty = llist_add(&req->lentry, &queue->req_list) && > @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, > */ > if (queue->io_cpu == raw_smp_processor_id() && > sync && empty && mutex_trylock(&queue->send_mutex)) { > - nvme_tcp_send_all(queue); > + nvme_tcp_send_all(queue, deadline); > mutex_unlock(&queue->send_mutex); > } Umm, spend up to a millisecond in in queue_request ? Sounds like way too much... Did you ever see this deadline exceeded? sends should be rather quick... > > @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, > nvme_tcp_error_recovery(&queue->ctrl->ctrl); > return result; > } > + if (time_after(jiffies, queue->deadline)) { > + desc->count = 0; > + break; > + } > + That is still not right. You don't want to spend a full deadline reading from the socket, and then spend a full deadline writing to the socket... You want the io_work to take a full deadline, and send budgets of try_send and try_recv. And set it to sane counts. Say 8 pdus, or 64k bytes. We want to get to some magic value that presents a sane behavior, that confidently fits inside a deadline, and is fair. > } > > - return consumed; > + return consumed - len; > } > > static void nvme_tcp_data_ready(struct sock *sk) > @@ -1258,7 +1269,7 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue) > return ret; > } > > -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue) > +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, unsigned long deadline) > { > struct socket *sock = queue->sock; > struct sock *sk = sock->sk; > @@ -1269,6 +1280,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue) > rd_desc.count = 1; > lock_sock(sk); > queue->nr_cqe = 0; > + queue->deadline = deadline; > consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb); > release_sock(sk); > return consumed; > @@ -1278,14 +1290,15 @@ static void nvme_tcp_io_work(struct work_struct *w) > { > struct nvme_tcp_queue *queue = > container_of(w, struct nvme_tcp_queue, io_work); > - unsigned long deadline = jiffies + msecs_to_jiffies(1); > + unsigned long tx_deadline = jiffies + msecs_to_jiffies(1); > + unsigned long rx_deadline = tx_deadline + msecs_to_jiffies(1), overrun; > > do { > bool pending = false; > int result; > > if (mutex_trylock(&queue->send_mutex)) { > - result = nvme_tcp_try_send(queue); > + result = nvme_tcp_send_all(queue, tx_deadline); > mutex_unlock(&queue->send_mutex); > if (result > 0) > pending = true; > @@ -1293,7 +1306,7 @@ static void nvme_tcp_io_work(struct work_struct *w) > break; > } > > - result = nvme_tcp_try_recv(queue); > + result = nvme_tcp_try_recv(queue, rx_deadline); I think you want a more frequent substitution of sends/receives. the granularity of 1ms budget may be too coarse? > if (result > 0) > pending = true; > else if (unlikely(result < 0)) > @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct work_struct *w) > if (!pending || !queue->rd_enabled) > return; > > - } while (!time_after(jiffies, deadline)); /* quota is exhausted */ > + } while (!time_after(jiffies, rx_deadline)); /* quota is exhausted */ > + > + overrun = jiffies - rx_deadline; > + if (nvme_tcp_queue_id(queue) > 0 && > + overrun > msecs_to_jiffies(10)) > + dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall (%u msecs)\n", > + nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun)); Umm, ok. why 10? why not 2? or 3? Do you expect io_work to spend more time executing? > > queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); > } > @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob) > { > struct nvme_tcp_queue *queue = hctx->driver_data; > struct sock *sk = queue->sock->sk; > + unsigned long deadline = jiffies + msecs_to_jiffies(1); > > if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) > return 0; > @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob) > set_bit(NVME_TCP_Q_POLLING, &queue->flags); > if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue)) > sk_busy_loop(sk, true); > - nvme_tcp_try_recv(queue); > + nvme_tcp_try_recv(queue, deadline); spend a millisecond in nvme_tcp_poll() ?? Isn't it too long?