Re: [PATCH v2 mptcp-next] mptcp: enforce HoL-blocking estimation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mat Martineau <mathew.j.martineau@linux.intel.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: mptcp@lists.linux.dev
Subject: Re: [PATCH v2 mptcp-next] mptcp: enforce HoL-blocking estimation
Date: Tue, 2 Nov 2021 18:07:46 -0700 (PDT)	[thread overview]
Message-ID: <a7127287-2a50-cd58-e82-4022ed399ad@linux.intel.com> (raw)
In-Reply-To: <8a4189c20285e73f1b4f51d4e442477003f40d1f.camel@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5379 bytes --]

On Tue, 2 Nov 2021, Paolo Abeni wrote:

> Hello,
>
> On Tue, 2021-10-26 at 11:42 +0200, Paolo Abeni wrote:
>> The MPTCP packet scheduler has sub-optimal behavior with asymmetric
>> subflows: if the faster subflow-level cwin is closed, the packet
>> scheduler can enqueue "too much" data on a slower subflow.
>>
>> When all the data on the faster subflow is acked, if the mptcp-level
>> cwin is closed, and link utilization becomes suboptimal.
>>
>> The solution is implementing blest-like[1] HoL-blocking estimation,
>> transmitting only on the subflow with the shorter estimated time to
>> flush the queued memory. If such subflows cwin is closed, we wait
>> even if other subflows are available.
>>
>> This is quite simpler than the original blest implementation, as we
>> leverage the pacing rate provided by the TCP socket. To get a more
>> accurate estimation for the subflow linger-time, we maintain a
>> per-subflow weighted average of such info.
>>
>> Additionally drop magic numbers usage in favor of newly defined
>> macros and use more meaningful names for status variable.
>>
>> [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf
>>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> v1 -> v2:
>>  - fix checkpatch issue (mat)
>>  - rename ratio as linger_time (mat)
>
> As this patch empirically improves the situation quite a bit, what
> about applying it on the export branch, and staging it for a bit?
>
> If we find issues or improvements we can revert/drop/update. Otherwise
> we can keep it. Meanwhile it will get a better testing...
>
> WDYT?

Hi Paolo -

I finally (sorry about the delay) did capture some data to see what was 
going on with the pacing rate. I added a pr_debug() in 
mptcp_subflow_get_send() to dump the following:

timestamp
socket ptr
sk_pacing_rate
sk_pacing_shift
mss_cache
snd_cwnd
packets_out
srtt_us

I chose those values based on what is used to calculate sk_pacing_rate - I 
should have also included snd_wnd and sk_wmem_queued, since those are used 
in the weighted average in this patch. But I ran out of time to update 
everything today and still get this reply written.

The debug output didn't appear to interfere much with the simult_flows.sh 
tests, they still completed in the normal time (except for one that ran 
just a little long).

I observed some patterns with the output I captured during a 
simult_flows.sh run:

1. When snd_cwnd and srtt_us were small (when data had not been sent yet 
or there was a time gap), sk_pacing_rate had very large spikes. Often 10x 
higher or more.

2. When packets were streaming out, sk_pacing_rate was quite consistent

Some sockets just had a large initial value, then settled in to a stable 
range. An example is the attached normal-stable.png graph (socket ecab33d5 
in the attached data).

Here's the beginning of that "normal" case:

socket,timestamp,pacing_rate,pacing_shift,mss_cache,snd_cwnd,packets_out,srtt_us

00000000ecab33d5	1115.798277	73877551	10	1448	10	0	3136
00000000ecab33d5	1115.859702	5711316	10	1448	21	17	51112
00000000ecab33d5	1115.8793	2691321	10	1448	22	18	113631
00000000ecab33d5	1115.888531	2428742	10	1448	22	22	125916
00000000ecab33d5	1115.905641	2048124	10	1448	23	22	156103
00000000ecab33d5	1115.952846	1687326	10	1448	25	14	205959
00000000ecab33d5	1115.972365	1731531	10	1448	26	26	208729

The average pacing rate settles in after the first few points. But when 
snd_cwnd and srtt_us are small, the pacing rate was not reliable.

Other sockets, which seemed to encounter gaps where packets were not sent, 
had spikes in sk_pacing_rate each time sending resumed. pacing_spikes.png 
shows this.

An excerpt of that (socket 59324953):

socket,timestamp,pacing_rate,pacing_shift,mss_cache,snd_cwnd,packets_out,srtt_us

0000000059324953	1099.545071	1535124	10	1448	333	333	3015368
0000000059324953	1100.638369	28960000	10	1448	10	0	8000
0000000059324953	1108.17501	101614035	10	1448	10	0	2280
0000000059324953	1123.068772	212941176	10	1448	10	0	1088
0000000059324953	1123.277791	218360037	10	1448	10	7	1061
0000000059324953	1123.325756	5016147	10	1448	19	0	52653
0000000059324953	1123.406518	9498001	10	1448	20	1	29271
0000000059324953	1123.440064	4935750	10	1448	20	1	56327

Here, the pacing rate had settled to around 1.5 million, then suddenly 
jumped to ~29 million and ~218 million. Two orders of magnitude is a 
pretty huge change! If I had sampled sk_wmem_queued, I'm guessing that 
would show a small (or zero) value which would minimize the impact of 
these spikes on the weighted average.

If this data is representative of typical behavior of sk_pacing_rate (and 
simult_flows.sh is representative enough of real-world behavior!), this 
does definitely confirm that the way the existing upstream code uses 
sk_pacing_rate is likely making some very sub-optimal choices when there 
are outliers in sk_pacing_rate.

I think this also shows that we need to be careful about how these spikes 
affect a weighted average, because one sk_pacing_rate spike could end up 
influencing the scheduler for an even longer time depending on how it gets 
weighted. Before we merge a change here, we should develop a clear 
understanding of what the data going in to the scheduling decision looks 
like, and make sure that outliers in that input data are handled 
appropriately in a variety of scenarios.

--
Mat Martineau
Intel

[-- Attachment #2: Type: application/x-gzip, Size: 85783 bytes --]

[-- Attachment #3: Type: image/png, Size: 20634 bytes --]

[-- Attachment #4: Type: image/png, Size: 30747 bytes --]

next prev parent reply	other threads:[~2021-11-03  1:11 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-26  9:42 [PATCH v2 mptcp-next] mptcp: enforce HoL-blocking estimation Paolo Abeni
2021-11-02 18:00 ` Paolo Abeni
2021-11-03  1:07   ` Mat Martineau [this message]
2021-11-03 11:43     ` Paolo Abeni
2021-11-03 23:28       ` Mat Martineau
2021-11-04 14:05         ` Matthieu Baerts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a7127287-2a50-cd58-e82-4022ed399ad@linux.intel.com \
    --to=mathew.j.martineau@linux.intel.com \
    --cc=mptcp@lists.linux.dev \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.