Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Sagi Grimberg <sagi@grimberg.me>,
	Hannes Reinecke <hare@kernel.org>, Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>, linux-nvme@lists.infradead.org
Subject: Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness
Date: Mon, 8 Jul 2024 15:21:24 +0200	[thread overview]
Message-ID: <a54ac72e-f04f-4bb7-a9de-5a3d3d71d9c6@suse.de> (raw)
In-Reply-To: <96bb4107-abc0-4fd3-8e2b-35eafa6a5d4f@grimberg.me>

On 7/8/24 13:57, Sagi Grimberg wrote:
> Hey Hannes, thanks for doing this.
> 
> On 08/07/2024 10:10, Hannes Reinecke wrote:
>> We need to restrict both side, rx and tx, to only run for a certain time
>> to ensure that we're not blocking the other side and induce starvation.
>> So pass in a 'deadline' value to nvme_tcp_send_all()
> 
> Please split the addition of nvme_tcp_send_all() to a separate prep patch.
> 
Okay, no problem.

>>   and nvme_tcp_try_recv()
>> and break out of the loop if the deadline is reached.
> 
> I think we want to limit the rx/tx in pdus/bytes. This will also allow us
> to possibly do burst rx from data-ready.
> 
PDUs is not the best scheduling boundary here, as each PDU can be of 
different size, and the network interface most definitely is limited by
the number of bytes transferred.

>>
>> As we now have a timestamp we can also use it to print out a warning
>> if the actual time spent exceeds the deadline.
>>
>> Performance comparison:
>>                 baseline rx/tx fairness
>> 4k seq write:  449MiB/s 480MiB/s
>> 4k rand write: 410MiB/s 481MiB/s
>> 4k seq read:   478MiB/s 481MiB/s
>> 4k rand read:  547MiB/s 480MiB/s
>>
>> Random read is ever so disappointing, but that will be fixed with the 
>> later
>> patches.
> 
> That is a significant decline in relative perf. I'm counting 12.5%...
> Can you explain why that is?
> 
Not really :-(
But then fairness cuts both ways; so I am not surprised that some 
workloads suffer here.

> How does this look for multiple controllers?
> 
Haven't really checked (yet); it would make a rather weak case if
we killed performance just to scale better ...

> 
> 
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 38 +++++++++++++++++++++++++++++---------
>>   1 file changed, 29 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0873b3949355..f621d3ba89b2 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -153,6 +153,7 @@ struct nvme_tcp_queue {
>>       size_t            data_remaining;
>>       size_t            ddgst_remaining;
>>       unsigned int        nr_cqe;
>> +    unsigned long        deadline;
> 
> I don't see why you need to keep this in the queue struct. You could have
> easily initialize it in the read_descriptor_t and test against it.
> 
Because I wanted to interrupt the receive side for large data transfers, 
and haven't found a way to pass the deadline into ->read_sock().

>>       /* send state */
>>       struct nvme_tcp_request *request;
>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct 
>> nvme_tcp_request *req,
>>       }
>>   }
>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>> +                    unsigned long deadline)
>>   {
>>       int ret;
>>       /* drain the send queue as much as we can... */
>>       do {
>>           ret = nvme_tcp_try_send(queue);
>> +        if (time_after(jiffies, deadline))
>> +            break;
>>       } while (ret > 0);
>> +    return ret;
> 
> I think you want a different interface, nvme_tcp_send_budgeted(queue, 
> budget).
> I don't know what you pass here, but jiffies is a rather large 
> granularity...
> 
Hmm. I've been using jiffies as the io_work loop had been counting
in jiffies. But let me check what would happen if I move that over to
PDU / size counting.

>>   }
>>   static inline bool nvme_tcp_queue_has_pending(struct nvme_tcp_queue 
>> *queue)
>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct 
>> nvme_tcp_request *req,
>>           bool sync, bool last)
>>   {
>>       struct nvme_tcp_queue *queue = req->queue;
>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>       bool empty;
>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct 
>> nvme_tcp_request *req,
>>        */
>>       if (queue->io_cpu == raw_smp_processor_id() &&
>>           sync && empty && mutex_trylock(&queue->send_mutex)) {
>> -        nvme_tcp_send_all(queue);
>> +        nvme_tcp_send_all(queue, deadline);
>>           mutex_unlock(&queue->send_mutex);
>>       }
> 
> Umm, spend up to a millisecond in in queue_request ? Sounds like way too 
> much...
> Did you ever see this deadline exceeded? sends should be rather quick...
> 
Of _course_ it's too long. That's kinda the point.
But I'm seeing network latency up to 4000 msecs (!) on my test setup,
so _that_ is the least of my worries ...

>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t 
>> *desc, struct sk_buff *skb,
>>               nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>               return result;
>>           }
>> +        if (time_after(jiffies, queue->deadline)) {
>> +            desc->count = 0;
>> +            break;
>> +        }
>> +
> 
> That is still not right.
> You don't want to spend a full deadline reading from the socket, and 
> then spend a full deadline writing to the socket...
> 
Yes, and no.
Problem is that the current code serializes writes  and reads.
Essentially for each iteration we first to a write, and then a read.
If we spend the full deadline on write we will need to reschedule,
but we then _again_ start with writes. This leads to a heavy preference
for writing, and negative performance impact.

> You want the io_work to take a full deadline, and send budgets of 
> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
> 64k bytes. We want to get to some magic value that presents a
> sane behavior, that confidently fits inside a deadline, and is fair.
> 
Easier said than done.
Biggest problem is that most of the latency increase comes from the
actual 'sendmsg()' and 'read_sock()' calls.
And the only way of inhibiting that would be to check _prior_ whether
we can issue the call in the first place.
(That's why I did the SOCK_NOSPACE tests in the previous patchsets).

>>       }
>> -    return consumed;
>> +    return consumed - len;
>>   }
>>   static void nvme_tcp_data_ready(struct sock *sk)
>> @@ -1258,7 +1269,7 @@ static int nvme_tcp_try_send(struct 
>> nvme_tcp_queue *queue)
>>       return ret;
>>   }
>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, unsigned 
>> long deadline)
>>   {
>>       struct socket *sock = queue->sock;
>>       struct sock *sk = sock->sk;
>> @@ -1269,6 +1280,7 @@ static int nvme_tcp_try_recv(struct 
>> nvme_tcp_queue *queue)
>>       rd_desc.count = 1;
>>       lock_sock(sk);
>>       queue->nr_cqe = 0;
>> +    queue->deadline = deadline;
>>       consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>       release_sock(sk);
>>       return consumed;
>> @@ -1278,14 +1290,15 @@ static void nvme_tcp_io_work(struct 
>> work_struct *w)
>>   {
>>       struct nvme_tcp_queue *queue =
>>           container_of(w, struct nvme_tcp_queue, io_work);
>> -    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>> +    unsigned long tx_deadline = jiffies + msecs_to_jiffies(1);
>> +    unsigned long rx_deadline = tx_deadline + msecs_to_jiffies(1), 
>> overrun;
>>       do {
>>           bool pending = false;
>>           int result;
>>           if (mutex_trylock(&queue->send_mutex)) {
>> -            result = nvme_tcp_try_send(queue);
>> +            result = nvme_tcp_send_all(queue, tx_deadline);
>>               mutex_unlock(&queue->send_mutex);
>>               if (result > 0)
>>                   pending = true;
>> @@ -1293,7 +1306,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>>                   break;
>>           }
>> -        result = nvme_tcp_try_recv(queue);
>> +        result = nvme_tcp_try_recv(queue, rx_deadline);
> 
> I think you want a more frequent substitution of sends/receives. the 
> granularity of 1ms budget may be too coarse?
> 
The problem is not the granularity, the problem is the latency spikes
I'm seeing when issuing 'sendmsg' or 'read_sock'.

>>           if (result > 0)
>>               pending = true;
>>           else if (unlikely(result < 0))
>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct work_struct 
>> *w)
>>           if (!pending || !queue->rd_enabled)
>>               return;
>> -    } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>> +    } while (!time_after(jiffies, rx_deadline)); /* quota is 
>> exhausted */
>> +
>> +    overrun = jiffies - rx_deadline;
>> +    if (nvme_tcp_queue_id(queue) > 0 &&
>> +        overrun > msecs_to_jiffies(10))
>> +        dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall (%u 
>> msecs)\n",
>> +            nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
> 
> Umm, ok. why 10? why not 2? or 3?
> Do you expect io_work to spend more time executing?
> 
Eg 4000 msecs like on my testbed?
Yes.

>>       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>   }
>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>> *hctx, struct io_comp_batch *iob)
>>   {
>>       struct nvme_tcp_queue *queue = hctx->driver_data;
>>       struct sock *sk = queue->sock->sk;
>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
>>           return 0;
>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>> *hctx, struct io_comp_batch *iob)
>>       set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>       if (sk_can_busy_loop(sk) && 
>> skb_queue_empty_lockless(&sk->sk_receive_queue))
>>           sk_busy_loop(sk, true);
>> -    nvme_tcp_try_recv(queue);
>> +    nvme_tcp_try_recv(queue, deadline);
> 
> spend a millisecond in nvme_tcp_poll() ??
> Isn't it too long?
Haven't tried with polling, so can't honestly answer.

In the end, it all boils down to numbers.
I'm having 2 controllers with 32 queues and 256 requests.
Running on a 10GigE link one should be getting a net throughput
of 1GB/s, or, with 4k requests, 256k IOPS.
Or a latency of 0.25 milliseconds per command.
In the optimal case. As I'm seeing a bandwidth of around
500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_.

So no, I don't think the 1 milliseconds is too long.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



  reply	other threads:[~2024-07-08 13:21 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-08  7:10 [PATCHv2 0/3] nvme-tcp: improve scalability Hannes Reinecke
2024-07-08  7:10 ` [PATCH 1/3] nvme-tcp: improve rx/tx fairness Hannes Reinecke
2024-07-08 11:57   ` Sagi Grimberg
2024-07-08 13:21     ` Hannes Reinecke [this message]
2024-07-08 14:25       ` Sagi Grimberg
2024-07-08 15:50         ` Hannes Reinecke
2024-07-08 19:31           ` Sagi Grimberg
2024-07-09  6:51             ` Hannes Reinecke
2024-07-09  7:06               ` Sagi Grimberg
2024-07-08  7:10 ` [PATCH 2/3] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-07-08 12:08   ` Sagi Grimberg
2024-07-08 12:43     ` Hannes Reinecke
2024-07-08 14:38       ` Sagi Grimberg
2024-07-08  7:10 ` [PATCH 3/3] nvme-tcp: per-controller I/O workqueues Hannes Reinecke
2024-07-08 12:12   ` Sagi Grimberg
2024-07-08 12:48     ` Hannes Reinecke
2024-07-08 14:41       ` Sagi Grimberg
2024-07-10 11:56 ` [PATCHv2 0/3] nvme-tcp: improve scalability Sagi Grimberg
2024-07-10 14:06   ` Hannes Reinecke
2024-07-10 14:45     ` Sagi Grimberg
2024-07-16  6:31 ` Sagi Grimberg
2024-07-16  7:10   ` Hannes Reinecke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a54ac72e-f04f-4bb7-a9de-5a3d3d71d9c6@suse.de \
    --to=hare@suse.de \
    --cc=hare@kernel.org \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox