From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 44DEE2C9F for ; Wed, 27 Oct 2021 00:47:40 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10149"; a="253592369" X-IronPort-AV: E=Sophos;i="5.87,184,1631602800"; d="scan'208";a="253592369" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Oct 2021 17:47:39 -0700 X-IronPort-AV: E=Sophos;i="5.87,184,1631602800"; d="scan'208";a="635468241" Received: from gacramer-mobl2.amr.corp.intel.com ([10.209.53.40]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Oct 2021 17:47:39 -0700 Date: Tue, 26 Oct 2021 17:47:38 -0700 (PDT) From: Mat Martineau To: Paolo Abeni cc: mptcp Subject: Re: [PATCH mptcp-next] mptcp: enforce HoL-blocking estimation In-Reply-To: Message-ID: <414d1018-32fb-e080-683f-8e5ed461c1a@linux.intel.com> References: <22cda018a37459d99683e572e35ac61bbc43fcae.1634900440.git.pabeni@redhat.com> <6e3468c2-8e61-a05b-8cd1-4656e9217fcc@linux.intel.com> <049cc3893767d00b4511aafb8acb191dea7aaca8.camel@redhat.com> <88871581-6815-174-a75c-22618eb7c475@linux.intel.com> Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed On Tue, 26 Oct 2021, Paolo Abeni wrote: > On Mon, 2021-10-25 at 10:17 -0700, Mat Martineau wrote: >> On Mon, 25 Oct 2021, Paolo Abeni wrote: >> >>> On Fri, 2021-10-22 at 18:19 -0700, Mat Martineau wrote: >>>> On Fri, 22 Oct 2021, Paolo Abeni wrote: >>>> >>>>> The MPTCP packet scheduler has sub-optimal behavior with asymmetric >>>>> subflows: if the faster subflow-level cwin is closed, the packet >>>>> scheduler can enqueue "too much" data on a slower subflow. >>>>> >>>>> When all the data on the faster subflow is acked, if the mptcp-level >>>>> cwin is closed, and link utilization becomes suboptimal. >>>>> >>>>> The solution is implementing blest-like[1] HoL-blocking estimation, >>>>> transmitting only on the subflow with the shorter estimated time to >>>>> flush the queued memory. If such subflows cwin is closed, we wait >>>>> even if other subflows are available. >>>>> >>>>> This is quite simpler than the original blest implementation, as we >>>>> leverage the pacing rate provided by the TCP socket. To get a more >>>>> accurate estimation for the subflow linger-time, we maintain a >>>>> per-subflow weighted average of such info. >>>>> >>>>> Additionally drop magic numbers usage in favor of newly defined >>>>> macros. >>>>> >>>>> [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf >>>>> >>>>> Signed-off-by: Paolo Abeni >>>>> --- >>>>> notes: >>>>> - this apparently solves for good issue/137, with > 200 iterations with >>>>> no failures >>>>> - still to be investigated the impact on high-speed links, if any >>>>> --- >>>>> net/mptcp/protocol.c | 58 ++++++++++++++++++++++++++++++-------------- >>>>> net/mptcp/protocol.h | 1 + >>>>> 2 files changed, 41 insertions(+), 18 deletions(-) >>>>> >>>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c >>>>> index 7803b0dbb1be..cc9d32cb7bc7 100644 >>>>> --- a/net/mptcp/protocol.c >>>>> +++ b/net/mptcp/protocol.c >>>>> @@ -1395,20 +1395,24 @@ bool mptcp_subflow_active(struct mptcp_subflow_context *subflow) >>>>> return __mptcp_subflow_active(subflow); >>>>> } >>>>> >>>>> +#define SSK_MODE_ACTIVE 0 >>>>> +#define SSK_MODE_BACKUP 1 >>>>> +#define SSK_MODE_MAX 2 >>>>> + >>>>> /* implement the mptcp packet scheduler; >>>>> * returns the subflow that will transmit the next DSS >>>>> * additionally updates the rtx timeout >>>>> */ >>>>> static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk) >>>>> { >>>>> - struct subflow_send_info send_info[2]; >>>>> + struct subflow_send_info send_info[SSK_MODE_MAX]; >>>>> struct mptcp_subflow_context *subflow; >>>>> struct sock *sk = (struct sock *)msk; >>>>> + u32 pace, burst, wmem; >>>>> int i, nr_active = 0; >>>>> struct sock *ssk; >>>>> long tout = 0; >>>>> u64 ratio; >>>>> - u32 pace; >>>>> >>>>> sock_owned_by_me(sk); >>>>> >>>>> @@ -1427,10 +1431,11 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk) >>>>> } >>>>> >>>>> /* pick the subflow with the lower wmem/wspace ratio */ >>>>> - for (i = 0; i < 2; ++i) { >>>>> + for (i = 0; i < SSK_MODE_MAX; ++i) { >>>>> send_info[i].ssk = NULL; >>>>> send_info[i].ratio = -1; >>>>> } >>>>> + >>>>> mptcp_for_each_subflow(msk, subflow) { >>>>> trace_mptcp_subflow_get_send(subflow); >>>>> ssk = mptcp_subflow_tcp_sock(subflow); >>>>> @@ -1439,12 +1444,13 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk) >>>>> >>>>> tout = max(tout, mptcp_timeout_from_subflow(subflow)); >>>>> nr_active += !subflow->backup; >>>>> - if (!sk_stream_memory_free(subflow->tcp_sock) || !tcp_sk(ssk)->snd_wnd) >>>>> - continue; >>>>> - >>>>> - pace = READ_ONCE(ssk->sk_pacing_rate); >>>>> - if (!pace) >>>>> - continue; >>>>> + pace = subflow->avg_pacing_rate; >>>>> + if (unlikely(!pace)) { >>>>> + /* init pacing rate from socket */ >>>>> + pace = subflow->avg_pacing_rate = READ_ONCE(ssk->sk_pacing_rate); >>>> >>>> Checkpatch complains about the double assignment. I did do a double-take >>>> to confirm that it didn't look like a '==' typo, and slightly lean toward >>>> keeping it checkpatch-clean. >>> >>> Yep, I'll fix the above in v2. >>> >>>>> + if (!pace) >>>>> + continue; >>>>> + } >>>>> >>>>> ratio = div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32, >>>>> pace); >>>>> @@ -1457,16 +1463,32 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk) >>>>> >>>>> /* pick the best backup if no other subflow is active */ >>>>> if (!nr_active) >>>>> - send_info[0].ssk = send_info[1].ssk; >>>>> - >>>>> - if (send_info[0].ssk) { >>>>> - msk->last_snd = send_info[0].ssk; >>>>> - msk->snd_burst = min_t(int, MPTCP_SEND_BURST_SIZE, >>>>> - tcp_sk(msk->last_snd)->snd_wnd); >>>>> - return msk->last_snd; >>>>> - } >>>>> + send_info[SSK_MODE_ACTIVE].ssk = send_info[SSK_MODE_BACKUP].ssk; >>>>> + >>>>> + /* According to the blest algorithm, to avoid HoL blocking for the >>>>> + * faster flow, we need to: >>>>> + * - estimate the faster flow linger time >>>>> + * - use the above to estimate the amount of byte transferred >>>>> + * by the faster flow >>>>> + * - check that the amount of queued data is greter than the above, >>>>> + * otherwise do not use the picked, slower, subflow >>>>> + * We select the subflow with the shorter estimated time to flush >>>>> + * the queued mem, which basically ensure the above. We just need >>>>> + * to check that subflow has a non empty cwin. >>>>> + */ >>>>> + ssk = send_info[SSK_MODE_ACTIVE].ssk; >>>>> + if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd) >>>>> + return NULL; >>>>> >>>>> - return NULL; >>>>> + burst = min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd); >>>>> + wmem = READ_ONCE(ssk->sk_wmem_queued); >>>>> + subflow = mptcp_subflow_ctx(ssk); >>>>> + subflow->avg_pacing_rate = div_u64((u64)subflow->avg_pacing_rate * wmem + >>>>> + READ_ONCE(ssk->sk_pacing_rate) * burst, >>>>> + burst + wmem); >>>> >>>> I see how this gives a smoothed-out version of the pacing rate, using >>>> avg_pacing_rate for the bytes already sent on this subflow (wmem) and >>>> ssk->sk_pacing_rate for the bytes that will probably be sent now. >>>> >>>> Is this a *better* estimation of the flow speed than the plain >>>> ssk->sk_pacing_rate? >>> >>> It looks like it is. I get a consistent (higher) failure rate using >>> plain sk_pacing_rate instead. 'sk_pacing_rate' is an 'instant' value >>> applied to the current pkt, can change in a relevant way in a little >>> amount of time. >>> >> >> Ok, sk_pacing_rate is even noisier than I realized. My main concern here >> is that whatever filtering we do on that noisy signal handles a variety of >> conditions well. This includes cases like a failing wireless signal, where >> the current and future throughput might be far lower than the average >> pacing rate would indicate - and in that case the filtering algorithm >> needs to respond appropriately. >> >> My first job was in test & measurement, where a lot of thought goes in to >> taking noisy A/D converter data and turning it in to a "good" measurement >> value. When using a weighted average like this, it's common to throw away >> the average if the new value is a significant change (like "X%" higher or >> lower). > > I think I do understand. > > Please note that here there are some difference WRT sampling more or > less noisy values: the stack enforces that the current sampled value is > applied to current packet and we record the pacing for each xmitted > chunk of data. > > Using the current pacing_rate to estimate the time to flush the ssk > xmit queue is subject to relevant errors, as it applies the "wrong" > pace to previous samples. > > The computed time could differ in a significant way from the real one > only if the TCP stack will produce multiple GSO segments for a single > burst, using very different pacing rate for the 'next ones'. > > Since the max burst size is below the max gso size, AFAICS the above > could happen only with GSO disabled, when performances are expected to > be bad. Additionally, throwing away the average on sudden pacing value > changes will not protect from the above scenario: a delta will be > caused by _future_ sudden change of the pacing rate. > > Not sure if the above is somewhat clear ;) > Yeah, I get what you're saying. Appreciate the explanation since you know the underlying pacing implementation better than I do. I'm still working on gathering some raw data to better understand what sk_pacing_rate, the values used to calculate sk_pacing_rate, and the weighted average look like over time. I want to have more confidence that the scheduling changes work well in a variety of conditions. Would you say the slow start / ca ratios in tcp_update_pacing_rate() are interfering with subflow selection at all? It looks like the various tcp_sock members that are used to generate sk_pacing_rate could be useful, but they require holding the subflow socket lock to read them :( > > p.s. just to move the thing forward I'll send a v2 with the > uncontroversial changes applied... Ok, thanks for sending. -- Mat Martineau Intel