From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BB919C3DA42 for ; Mon, 8 Jul 2024 15:51:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=U/G7wxsve0Dt9ncJ3DugrXIZLnOkW1Vk/8W8S6NUfg0=; b=k68M6Z4AbM6dI9QBIoE0lT5DWl aLbXrMeMIYJQmqEpIZdUF/sfBs7le9Frv/9/7+n0qEG8N6E366c95zn2QfQute3pKifaUTCfOTbnM 0+QsuBnzCj9/uyS6zyzUify7Sy5RHmvsyQl+/c8f01svQRYHYcZep9DImOupL0ECU7N5rOIwDXhx3 9gpeSFgs53pPKxVNTzPoAkmGRlKLceiaGEHEQN6CscwnQhPHIsyiXDc5kVndvUP6ceUlI0J9J7L0O FHGM/kySLAyjhsCqypmlxlUBtNF2WltghqeV9VpCVREFeDpxuStQXtkSQIcWZv7vII49K3ZEE6Giz bClkmgZg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sQqe4-00000004IFW-3QC6; Mon, 08 Jul 2024 15:51:16 +0000 Received: from smtp-out2.suse.de ([2a07:de40:b251:101:10:150:64:2]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1sQqe0-00000004IDn-12PA for linux-nvme@lists.infradead.org; Mon, 08 Jul 2024 15:51:15 +0000 Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 664C31F444; Mon, 8 Jul 2024 15:51:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1720453870; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=U/G7wxsve0Dt9ncJ3DugrXIZLnOkW1Vk/8W8S6NUfg0=; b=owxpMrhRy0FVYTj6xK8wx2jeFgtt85lzds/8GV5nQLwcYUL1BboQ9qjJF1yTwclascztzE qqsehNH0ALLSv/HnNtB8ItsU4QgxJBCKIxXkiqy8D5Mb5yHj2AiVv9de3YHEsM7DgrJEmk kevbRIFX4Gt1U4xyCiOzhcGD/NhF+U4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1720453870; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=U/G7wxsve0Dt9ncJ3DugrXIZLnOkW1Vk/8W8S6NUfg0=; b=G3Z7aJmmTaKqQ9pLGvbSek5/+4LbuYcLhn7Me/AAEDzkiNp+3/9QUbthtylVFKYqsXtId0 LHXC+aDU1izz2sDQ== Authentication-Results: smtp-out2.suse.de; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=owxpMrhR; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=G3Z7aJmm DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1720453870; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=U/G7wxsve0Dt9ncJ3DugrXIZLnOkW1Vk/8W8S6NUfg0=; b=owxpMrhRy0FVYTj6xK8wx2jeFgtt85lzds/8GV5nQLwcYUL1BboQ9qjJF1yTwclascztzE qqsehNH0ALLSv/HnNtB8ItsU4QgxJBCKIxXkiqy8D5Mb5yHj2AiVv9de3YHEsM7DgrJEmk kevbRIFX4Gt1U4xyCiOzhcGD/NhF+U4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1720453870; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=U/G7wxsve0Dt9ncJ3DugrXIZLnOkW1Vk/8W8S6NUfg0=; b=G3Z7aJmmTaKqQ9pLGvbSek5/+4LbuYcLhn7Me/AAEDzkiNp+3/9QUbthtylVFKYqsXtId0 LHXC+aDU1izz2sDQ== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 2DF3513A7F; Mon, 8 Jul 2024 15:51:10 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id dt+gCe4KjGbJJgAAD6G6ig (envelope-from ); Mon, 08 Jul 2024 15:51:10 +0000 Message-ID: <7d52a56b-2311-4f64-a0f4-383c5521e91c@suse.de> Date: Mon, 8 Jul 2024 17:50:31 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness Content-Language: en-US To: Sagi Grimberg , Hannes Reinecke , Christoph Hellwig Cc: Keith Busch , linux-nvme@lists.infradead.org References: <20240708071013.69984-1-hare@kernel.org> <20240708071013.69984-2-hare@kernel.org> <96bb4107-abc0-4fd3-8e2b-35eafa6a5d4f@grimberg.me> <5334b2dd-23ab-4637-a762-e871f884db54@grimberg.me> From: Hannes Reinecke In-Reply-To: <5334b2dd-23ab-4637-a762-e871f884db54@grimberg.me> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 664C31F444 X-Spamd-Result: default: False [-4.50 / 50.00]; BAYES_HAM(-3.00)[100.00%]; NEURAL_HAM_LONG(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain]; XM_UA_NO_VERSION(0.01)[]; MX_GOOD(-0.01)[]; RCVD_TLS_ALL(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; MIME_TRACE(0.00)[0:+]; ARC_NA(0.00)[]; TO_DN_SOME(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; RCPT_COUNT_FIVE(0.00)[5]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.de:email,suse.de:dkim]; DKIM_TRACE(0.00)[suse.de:+] X-Rspamd-Action: no action X-Rspamd-Server: rspamd1.dmz-prg2.suse.org X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240708_085112_616932_7FB00BD5 X-CRM114-Status: GOOD ( 55.67 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org On 7/8/24 16:25, Sagi Grimberg wrote: > > > On 08/07/2024 16:21, Hannes Reinecke wrote: >> On 7/8/24 13:57, Sagi Grimberg wrote: >>> Hey Hannes, thanks for doing this. >>> [ .. ] >>> That is a significant decline in relative perf. I'm counting 12.5%... >>> Can you explain why that is? >>> >> Not really :-( >> But then fairness cuts both ways; so I am not surprised that some >> workloads suffer here. > > Well, 12.5% is rather steep, don't you agree? > I'm even more surprised that seq/rand read make a difference... > Yeah; but I guess I found the issue now (we should interrupt data transfers if the deadline expires, not just pdu transfers). [ .. ] >>> >>> I don't see why you need to keep this in the queue struct. You could >>> have >>> easily initialize it in the read_descriptor_t and test against it. >>> >> Because I wanted to interrupt the receive side for large data >> transfers, and haven't found a way to pass the deadline into >> ->read_sock(). > > desc.count is for you to interpret however you want it to be no? > Except for the magic 'desc.count == 0' value; but yeah, that should work. >> >>>>       /* send state */ >>>>       struct nvme_tcp_request *request; >>>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct >>>> nvme_tcp_request *req, >>>>       } >>>>   } >>>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue) >>>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue, >>>> +                    unsigned long deadline) >>>>   { >>>>       int ret; >>>>       /* drain the send queue as much as we can... */ >>>>       do { >>>>           ret = nvme_tcp_try_send(queue); >>>> +        if (time_after(jiffies, deadline)) >>>> +            break; >>>>       } while (ret > 0); >>>> +    return ret; >>> >>> I think you want a different interface, nvme_tcp_send_budgeted(queue, >>> budget). >>> I don't know what you pass here, but jiffies is a rather large >>> granularity... >>> >> Hmm. I've been using jiffies as the io_work loop had been counting >> in jiffies. But let me check what would happen if I move that over to >> PDU / size counting. > > Cool. > >> >>>>   } >>>>   static inline bool nvme_tcp_queue_has_pending(struct >>>> nvme_tcp_queue *queue) >>>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct >>>> nvme_tcp_request *req, >>>>           bool sync, bool last) >>>>   { >>>>       struct nvme_tcp_queue *queue = req->queue; >>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1); >>>>       bool empty; >>>>       empty = llist_add(&req->lentry, &queue->req_list) && >>>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct >>>> nvme_tcp_request *req, >>>>        */ >>>>       if (queue->io_cpu == raw_smp_processor_id() && >>>>           sync && empty && mutex_trylock(&queue->send_mutex)) { >>>> -        nvme_tcp_send_all(queue); >>>> +        nvme_tcp_send_all(queue, deadline); >>>>           mutex_unlock(&queue->send_mutex); >>>>       } >>> >>> Umm, spend up to a millisecond in in queue_request ? Sounds like way >>> too much... >>> Did you ever see this deadline exceeded? sends should be rather quick... >>> >> Of _course_ it's too long. That's kinda the point. >> But I'm seeing network latency up to 4000 msecs (!) on my test setup, >> so _that_ is the least of my worries ... > > This has nothing to do with the network latency afaict. This is just a > matter of how long does it take to write to the socket buffer (or in our case, > construct skbs and add them to the socket as TX is zero copy). > Not the transfer itself? I would have thought so ... >> >>>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t >>>> *desc, struct sk_buff *skb, >>>> nvme_tcp_error_recovery(&queue->ctrl->ctrl); >>>>               return result; >>>>           } >>>> +        if (time_after(jiffies, queue->deadline)) { >>>> +            desc->count = 0; >>>> +            break; >>>> +        } >>>> + >>> >>> That is still not right. >>> You don't want to spend a full deadline reading from the socket, and >>> then spend a full deadline writing to the socket... >>> >> Yes, and no. >> Problem is that the current code serializes writes  and reads. >> Essentially for each iteration we first to a write, and then a read. >> If we spend the full deadline on write we will need to reschedule, >> but we then _again_ start with writes. This leads to a heavy preference >> for writing, and negative performance impact. > > Yes, the assumption we should take is that the tx/rx limits allow to > switch between the > two at least once or twice in the io_work while loop, and that they get > a fair'ish share > of quota. > >> >>> You want the io_work to take a full deadline, and send budgets of >>> try_send and try_recv. And set it to sane counts. Say 8 pdus, or >>> 64k bytes. We want to get to some magic value that presents a >>> sane behavior, that confidently fits inside a deadline, and is fair. >>> >> Easier said than done. >> Biggest problem is that most of the latency increase comes from the >> actual 'sendmsg()' and 'read_sock()' calls. > > This is a bit surprising. because read_sock reads data that is already > processed by the tcp stack, and ready to simply copy back to the user > pages. Same for sendmsg, the socket is non-blocking, so I would not > expect it to stall for unusually long periods. Maybe there is an issue > hiding in how we call send/recv? > Hmm. That could be. >> And the only way of inhibiting that would be to check _prior_ whether >> we can issue the call in the first place. >> (That's why I did the SOCK_NOSPACE tests in the previous patchsets). > > The socket is non-blocking. I always thought that SOCK_NOSPACE is > serving blocking > sockets... afaik, there shouldn't be a real difference between checking > NOSPACE vs. > attempting sendmsg on a non-blocking socket... But I could be wrong here.. > Weellll ... if 'SOCK_NOSPACE' is for blocking sockets, why do we even get the 'write_space()' callback? It gets triggered quite often, and checking for the SOCK_NOSPACE bit before sending drops the number of invocations quite significantly. [ .. ] >>>>                   break; >>>>           } >>>> -        result = nvme_tcp_try_recv(queue); >>>> +        result = nvme_tcp_try_recv(queue, rx_deadline); >>> >>> I think you want a more frequent substitution of sends/receives. the >>> granularity of 1ms budget may be too coarse? >>> >> The problem is not the granularity, the problem is the latency spikes >> I'm seeing when issuing 'sendmsg' or 'read_sock'. > > I think that its possible that there is another underlying issue. > Because these calls > should NOT be blocked for a long period of time for non-blocking sockets. > I did some instrumentation, and that certainly pointed into that direction. But will have to redo that to get some meaningful data here. Another point really might be the granularity; jiffies isn't a good measurement here (even with HZ=1000 we'll only have milliseconds granularity, so it might be that we're incurring fluctuations due to jiffy calculation itself). >> >>>>           if (result > 0) >>>>               pending = true; >>>>           else if (unlikely(result < 0)) >>>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct >>>> work_struct *w) >>>>           if (!pending || !queue->rd_enabled) >>>>               return; >>>> -    } while (!time_after(jiffies, deadline)); /* quota is exhausted */ >>>> +    } while (!time_after(jiffies, rx_deadline)); /* quota is >>>> exhausted */ >>>> + >>>> +    overrun = jiffies - rx_deadline; >>>> +    if (nvme_tcp_queue_id(queue) > 0 && >>>> +        overrun > msecs_to_jiffies(10)) >>>> +        dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall >>>> (%u msecs)\n", >>>> +            nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun)); >>> >>> Umm, ok. why 10? why not 2? or 3? >>> Do you expect io_work to spend more time executing? >>> >> Eg 4000 msecs like on my testbed? > > If you see ANYTHING in the IO path stalling for 4 seconds then we have a > bigger issue here. > You don't say ... >> Yes. >> >>>>       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); >>>>   } >>>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx >>>> *hctx, struct io_comp_batch *iob) >>>>   { >>>>       struct nvme_tcp_queue *queue = hctx->driver_data; >>>>       struct sock *sk = queue->sock->sk; >>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1); >>>>       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) >>>>           return 0; >>>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx >>>> *hctx, struct io_comp_batch *iob) >>>>       set_bit(NVME_TCP_Q_POLLING, &queue->flags); >>>>       if (sk_can_busy_loop(sk) && >>>> skb_queue_empty_lockless(&sk->sk_receive_queue)) >>>>           sk_busy_loop(sk, true); >>>> -    nvme_tcp_try_recv(queue); >>>> +    nvme_tcp_try_recv(queue, deadline); >>> >>> spend a millisecond in nvme_tcp_poll() ?? >>> Isn't it too long? >> Haven't tried with polling, so can't honestly answer. >> >> In the end, it all boils down to numbers. >> I'm having 2 controllers with 32 queues and 256 requests. >> Running on a 10GigE link one should be getting a net throughput >> of 1GB/s, or, with 4k requests, 256k IOPS. >> Or a latency of 0.25 milliseconds per command. >> In the optimal case. As I'm seeing a bandwidth of around >> 500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_. >> >> So no, I don't think the 1 milliseconds is too long. > > nvme_tcp_poll is NOT designed to poll for the entirety or an individual > IO lifetime. Its > a non-selective polling interface wired to io_uring. It should peek the > socket, consume what ever > is there, and break. It most definitely should not spend 1ms in this > check... Oh, I know that. That's why I said: "Haven't tried with polling." The calculation is independent on polling. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich