From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCA2F2C96 for ; Wed, 3 Nov 2021 23:29:03 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10157"; a="231868505" X-IronPort-AV: E=Sophos;i="5.87,207,1631602800"; d="scan'208";a="231868505" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2021 16:28:58 -0700 X-IronPort-AV: E=Sophos;i="5.87,207,1631602800"; d="scan'208";a="468285959" Received: from crschrol-desk6.amr.corp.intel.com ([10.251.25.129]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2021 16:28:56 -0700 Date: Wed, 3 Nov 2021 16:28:56 -0700 (PDT) From: Mat Martineau To: Paolo Abeni cc: mptcp@lists.linux.dev Subject: Re: [PATCH v2 mptcp-next] mptcp: enforce HoL-blocking estimation In-Reply-To: Message-ID: <98a575e6-7784-eb6c-7286-df7b8e4f2e87@linux.intel.com> References: <8a4189c20285e73f1b4f51d4e442477003f40d1f.camel@redhat.com> Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset=US-ASCII On Wed, 3 Nov 2021, Paolo Abeni wrote: > On Tue, 2021-11-02 at 18:07 -0700, Mat Martineau wrote: >> On Tue, 2 Nov 2021, Paolo Abeni wrote: >> >>> Hello, >>> >>> On Tue, 2021-10-26 at 11:42 +0200, Paolo Abeni wrote: >>>> The MPTCP packet scheduler has sub-optimal behavior with asymmetric >>>> subflows: if the faster subflow-level cwin is closed, the packet >>>> scheduler can enqueue "too much" data on a slower subflow. >>>> >>>> When all the data on the faster subflow is acked, if the mptcp-level >>>> cwin is closed, and link utilization becomes suboptimal. >>>> >>>> The solution is implementing blest-like[1] HoL-blocking estimation, >>>> transmitting only on the subflow with the shorter estimated time to >>>> flush the queued memory. If such subflows cwin is closed, we wait >>>> even if other subflows are available. >>>> >>>> This is quite simpler than the original blest implementation, as we >>>> leverage the pacing rate provided by the TCP socket. To get a more >>>> accurate estimation for the subflow linger-time, we maintain a >>>> per-subflow weighted average of such info. >>>> >>>> Additionally drop magic numbers usage in favor of newly defined >>>> macros and use more meaningful names for status variable. >>>> >>>> [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf >>>> >>>> Signed-off-by: Paolo Abeni >>>> --- >>>> v1 -> v2: >>>> - fix checkpatch issue (mat) >>>> - rename ratio as linger_time (mat) >>> >>> As this patch empirically improves the situation quite a bit, what >>> about applying it on the export branch, and staging it for a bit? >>> >>> If we find issues or improvements we can revert/drop/update. Otherwise >>> we can keep it. Meanwhile it will get a better testing... >>> >>> WDYT? >> >> Hi Paolo - >> >> I finally (sorry about the delay) did capture some data to see what was >> going on with the pacing rate. I added a pr_debug() in >> mptcp_subflow_get_send() to dump the following: >> >> timestamp >> socket ptr >> sk_pacing_rate >> sk_pacing_shift >> mss_cache >> snd_cwnd >> packets_out >> srtt_us >> >> I chose those values based on what is used to calculate sk_pacing_rate - I >> should have also included snd_wnd and sk_wmem_queued, since those are used >> in the weighted average in this patch. But I ran out of time to update >> everything today and still get this reply written. >> >> The debug output didn't appear to interfere much with the simult_flows.sh >> tests, they still completed in the normal time (except for one that ran >> just a little long). >> >> I observed some patterns with the output I captured during a >> simult_flows.sh run: >> >> 1. When snd_cwnd and srtt_us were small (when data had not been sent yet >> or there was a time gap), sk_pacing_rate had very large spikes. Often 10x >> higher or more. >> >> 2. When packets were streaming out, sk_pacing_rate was quite consistent >> >> >> Some sockets just had a large initial value, then settled in to a stable >> range. An example is the attached normal-stable.png graph (socket ecab33d5 >> in the attached data). >> >> Here's the beginning of that "normal" case: >> >> socket,timestamp,pacing_rate,pacing_shift,mss_cache,snd_cwnd,packets_out,srtt_us >> >> 00000000ecab33d5 1115.798277 73877551 10 1448 10 0 3136 >> 00000000ecab33d5 1115.859702 5711316 10 1448 21 17 51112 >> 00000000ecab33d5 1115.8793 2691321 10 1448 22 18 113631 >> 00000000ecab33d5 1115.888531 2428742 10 1448 22 22 125916 >> 00000000ecab33d5 1115.905641 2048124 10 1448 23 22 156103 >> 00000000ecab33d5 1115.952846 1687326 10 1448 25 14 205959 >> 00000000ecab33d5 1115.972365 1731531 10 1448 26 26 208729 >> >> The average pacing rate settles in after the first few points. But when >> snd_cwnd and srtt_us are small, the pacing rate was not reliable. >> >> >> Other sockets, which seemed to encounter gaps where packets were not sent, >> had spikes in sk_pacing_rate each time sending resumed. pacing_spikes.png >> shows this. >> >> An excerpt of that (socket 59324953): >> >> socket,timestamp,pacing_rate,pacing_shift,mss_cache,snd_cwnd,packets_out,srtt_us >> >> 0000000059324953 1099.545071 1535124 10 1448 333 333 3015368 >> 0000000059324953 1100.638369 28960000 10 1448 10 0 8000 >> 0000000059324953 1108.17501 101614035 10 1448 10 0 2280 >> 0000000059324953 1123.068772 212941176 10 1448 10 0 1088 >> 0000000059324953 1123.277791 218360037 10 1448 10 7 1061 >> 0000000059324953 1123.325756 5016147 10 1448 19 0 52653 >> 0000000059324953 1123.406518 9498001 10 1448 20 1 29271 >> 0000000059324953 1123.440064 4935750 10 1448 20 1 56327 >> >> Here, the pacing rate had settled to around 1.5 million, then suddenly >> jumped to ~29 million and ~218 million. Two orders of magnitude is a >> pretty huge change! If I had sampled sk_wmem_queued, I'm guessing that >> would show a small (or zero) value which would minimize the impact of >> these spikes on the weighted average. >> >> >> >> If this data is representative of typical behavior of sk_pacing_rate (and >> simult_flows.sh is representative enough of real-world behavior!), this >> does definitely confirm that the way the existing upstream code uses >> sk_pacing_rate is likely making some very sub-optimal choices when there >> are outliers in sk_pacing_rate. >> >> I think this also shows that we need to be careful about how these spikes >> affect a weighted average, because one sk_pacing_rate spike could end up >> influencing the scheduler for an even longer time depending on how it gets >> weighted. > > Thank you for the detailed analisys and all the info shared! > > I think there is an important point I was unable to clarify so far: the > average pacing rate is _NOT_ used to make any prediction or estimation > on the future. > > It computes an approximation of the time to flush the queued data. Note > due to pacing and the mptcp scheduler enqueuing data only if the cwin > is open, we could compute such time with absolute accuracy, e.g. if the > packets present into the tx queue have the following data: > > pkt len pacing > 1 2000 100K > 2 1000 1M > 3 500 10K > > The time to flush will be: > > 1000/ 10K + 2000/1M + 500/100K > > regardless of the high fluctuation in the pacing rate. If the TCP code > selects a suboptimal pacing rate value, the _TCP_ tput will be > suboptimal, but the above equation will still be exact. > > The average schema used here is an approximation of the above. Why > don't using the exact formula instead? it would require more overhead, > more accounting (we would need to store a timing or pace info in the CB > of each packet) and possibly even additional hooks (we would need to > update the accounting when a packet transmission completes). > > Empirical results showed that the avg approximation is a good enough > approximation. Thanks for the info and explanation. I do agree it's likely an improvement - I'll still collect some more data (this time with the avg, snd_wnd, and sk_wmem_queued) to see if more adjustments are needed before upstreaming. > >> Before we merge a change here, we should develop a clear >> understanding of what the data going in to the scheduling decision looks >> like, and make sure that outliers in that input data are handled >> appropriately in a variety of scenarios. > > Due to the above, I still propose to go ahead with this version and > ev., as needed, improve incrementally. I'm ok to provisionally add it to the export branch now to get some more test exposure beyond the tests I'm running locally. I haven't convinced myself it's ready for net-next yet but we can sort that part out later. -- Mat Martineau Intel