From: Mat Martineau <mathew.j.martineau@linux.intel.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: mptcp@lists.linux.dev
Subject: Re: [PATCH v2 mptcp-next] mptcp: enforce HoL-blocking estimation
Date: Tue, 2 Nov 2021 18:07:46 -0700 (PDT) [thread overview]
Message-ID: <a7127287-2a50-cd58-e82-4022ed399ad@linux.intel.com> (raw)
In-Reply-To: <8a4189c20285e73f1b4f51d4e442477003f40d1f.camel@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 5379 bytes --]
On Tue, 2 Nov 2021, Paolo Abeni wrote:
> Hello,
>
> On Tue, 2021-10-26 at 11:42 +0200, Paolo Abeni wrote:
>> The MPTCP packet scheduler has sub-optimal behavior with asymmetric
>> subflows: if the faster subflow-level cwin is closed, the packet
>> scheduler can enqueue "too much" data on a slower subflow.
>>
>> When all the data on the faster subflow is acked, if the mptcp-level
>> cwin is closed, and link utilization becomes suboptimal.
>>
>> The solution is implementing blest-like[1] HoL-blocking estimation,
>> transmitting only on the subflow with the shorter estimated time to
>> flush the queued memory. If such subflows cwin is closed, we wait
>> even if other subflows are available.
>>
>> This is quite simpler than the original blest implementation, as we
>> leverage the pacing rate provided by the TCP socket. To get a more
>> accurate estimation for the subflow linger-time, we maintain a
>> per-subflow weighted average of such info.
>>
>> Additionally drop magic numbers usage in favor of newly defined
>> macros and use more meaningful names for status variable.
>>
>> [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf
>>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> v1 -> v2:
>> - fix checkpatch issue (mat)
>> - rename ratio as linger_time (mat)
>
> As this patch empirically improves the situation quite a bit, what
> about applying it on the export branch, and staging it for a bit?
>
> If we find issues or improvements we can revert/drop/update. Otherwise
> we can keep it. Meanwhile it will get a better testing...
>
> WDYT?
Hi Paolo -
I finally (sorry about the delay) did capture some data to see what was
going on with the pacing rate. I added a pr_debug() in
mptcp_subflow_get_send() to dump the following:
timestamp
socket ptr
sk_pacing_rate
sk_pacing_shift
mss_cache
snd_cwnd
packets_out
srtt_us
I chose those values based on what is used to calculate sk_pacing_rate - I
should have also included snd_wnd and sk_wmem_queued, since those are used
in the weighted average in this patch. But I ran out of time to update
everything today and still get this reply written.
The debug output didn't appear to interfere much with the simult_flows.sh
tests, they still completed in the normal time (except for one that ran
just a little long).
I observed some patterns with the output I captured during a
simult_flows.sh run:
1. When snd_cwnd and srtt_us were small (when data had not been sent yet
or there was a time gap), sk_pacing_rate had very large spikes. Often 10x
higher or more.
2. When packets were streaming out, sk_pacing_rate was quite consistent
Some sockets just had a large initial value, then settled in to a stable
range. An example is the attached normal-stable.png graph (socket ecab33d5
in the attached data).
Here's the beginning of that "normal" case:
socket,timestamp,pacing_rate,pacing_shift,mss_cache,snd_cwnd,packets_out,srtt_us
00000000ecab33d5 1115.798277 73877551 10 1448 10 0 3136
00000000ecab33d5 1115.859702 5711316 10 1448 21 17 51112
00000000ecab33d5 1115.8793 2691321 10 1448 22 18 113631
00000000ecab33d5 1115.888531 2428742 10 1448 22 22 125916
00000000ecab33d5 1115.905641 2048124 10 1448 23 22 156103
00000000ecab33d5 1115.952846 1687326 10 1448 25 14 205959
00000000ecab33d5 1115.972365 1731531 10 1448 26 26 208729
The average pacing rate settles in after the first few points. But when
snd_cwnd and srtt_us are small, the pacing rate was not reliable.
Other sockets, which seemed to encounter gaps where packets were not sent,
had spikes in sk_pacing_rate each time sending resumed. pacing_spikes.png
shows this.
An excerpt of that (socket 59324953):
socket,timestamp,pacing_rate,pacing_shift,mss_cache,snd_cwnd,packets_out,srtt_us
0000000059324953 1099.545071 1535124 10 1448 333 333 3015368
0000000059324953 1100.638369 28960000 10 1448 10 0 8000
0000000059324953 1108.17501 101614035 10 1448 10 0 2280
0000000059324953 1123.068772 212941176 10 1448 10 0 1088
0000000059324953 1123.277791 218360037 10 1448 10 7 1061
0000000059324953 1123.325756 5016147 10 1448 19 0 52653
0000000059324953 1123.406518 9498001 10 1448 20 1 29271
0000000059324953 1123.440064 4935750 10 1448 20 1 56327
Here, the pacing rate had settled to around 1.5 million, then suddenly
jumped to ~29 million and ~218 million. Two orders of magnitude is a
pretty huge change! If I had sampled sk_wmem_queued, I'm guessing that
would show a small (or zero) value which would minimize the impact of
these spikes on the weighted average.
If this data is representative of typical behavior of sk_pacing_rate (and
simult_flows.sh is representative enough of real-world behavior!), this
does definitely confirm that the way the existing upstream code uses
sk_pacing_rate is likely making some very sub-optimal choices when there
are outliers in sk_pacing_rate.
I think this also shows that we need to be careful about how these spikes
affect a weighted average, because one sk_pacing_rate spike could end up
influencing the scheduler for an even longer time depending on how it gets
weighted. Before we merge a change here, we should develop a clear
understanding of what the data going in to the scheduling decision looks
like, and make sure that outliers in that input data are handled
appropriately in a variety of scenarios.
--
Mat Martineau
Intel
[-- Attachment #2: Type: application/x-gzip, Size: 85783 bytes --]
[-- Attachment #3: Type: image/png, Size: 20634 bytes --]
[-- Attachment #4: Type: image/png, Size: 30747 bytes --]
next prev parent reply other threads:[~2021-11-03 1:11 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-26 9:42 [PATCH v2 mptcp-next] mptcp: enforce HoL-blocking estimation Paolo Abeni
2021-11-02 18:00 ` Paolo Abeni
2021-11-03 1:07 ` Mat Martineau [this message]
2021-11-03 11:43 ` Paolo Abeni
2021-11-03 23:28 ` Mat Martineau
2021-11-04 14:05 ` Matthieu Baerts
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a7127287-2a50-cd58-e82-4022ed399ad@linux.intel.com \
--to=mathew.j.martineau@linux.intel.com \
--cc=mptcp@lists.linux.dev \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.