From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D1314D4B; Wed, 6 Dec 2023 18:33:35 -0800 (PST) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R591e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045170;MF=xuanzhuo@linux.alibaba.com;NM=1;PH=DS;RN=17;SR=0;TI=SMTPD_---0Vy-5ywS_1701916412; Received: from localhost(mailfrom:xuanzhuo@linux.alibaba.com fp:SMTPD_---0Vy-5ywS_1701916412) by smtp.aliyun-inc.com; Thu, 07 Dec 2023 10:33:33 +0800 Message-ID: <1701916081.917355-1-xuanzhuo@linux.alibaba.com> Subject: Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked Date: Thu, 7 Dec 2023 10:28:01 +0800 From: Xuan Zhuo To: Eric Dumazet Cc: netdev@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, martin.lau@linux.dev, linux-trace-kernel@vger.kernel.org, bpf@vger.kernel.org, dust.li@linux.alibaba.com, alibuda@linux.alibaba.com, guwen@linux.alibaba.com, hengqi@linux.alibaba.com, Philo Lu References: <20231204114322.9218-1-lulie@linux.alibaba.com> <1701740897.6795166-1-xuanzhuo@linux.alibaba.com> In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: On Tue, 5 Dec 2023 20:39:28 +0100, Eric Dumazet wrote: > On Tue, Dec 5, 2023 at 3:11=E2=80=AFAM Xuan Zhuo wrote: > > > > On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet w= rote: > > > On Mon, Dec 4, 2023 at 12:43=E2=80=AFPM Philo Lu wrote: > > > > > > > > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acke= d, > > > > which will be called every time a tcp data packet is sent, received= , and > > > > acked. > > > > tcp_data_send: called after a data packet is sent. > > > > tcp_data_recv: called after a data packet is receviced. > > > > tcp_data_acked: called after a valid ack packet is processed (some = sent > > > > data are ackknowledged). > > > > > > > > We use these callbacks for fine-grained tcp monitoring, which colle= cts > > > > and analyses every tcp request/response event information. The whole > > > > system has been described in SIGMOD'18 (see > > > > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To > > > > achieve this with bpf, we require hooks for data events that call b= pf > > > > prog (1) when any data packet is sent/received/acked, and (2) after > > > > critical tcp state variables have been updated (e.g., snd_una, snd_= nxt, > > > > rcv_nxt). However, existing bpf hooks cannot meet our requirements. > > > > Besides, these tracepoints help to debug tcp when data send/recv/ac= ked. > > > > > > This I do not understand. > > > > > > > > > > > Though kretprobe/fexit can also be used to collect these informatio= n, > > > > they will not work if the kernel functions get inlined. Considering= the > > > > stability, we prefer tracepoint as the solution. > > > > > > I dunno, this seems quite weak to me. I see many patches coming to add > > > tracing in the stack, but no patches fixing any issues. > > > > > > We have implemented a mechanism to split the request and response from = the TCP > > connection using these "hookers", which can handle various protocols su= ch as > > HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record impor= tant > > information about each request and response, including the amount of da= ta > > uploaded, the time taken by the server to handle the request, and the t= ime taken > > for the client to receive the response. This mechanism has been running > > internally for many years and has proven to be very useful. > > > > One of the main benefits of this mechanism is that it helps in locating= the > > source of any issues or problems that may arise. For example, if there = is a > > problem with the network, the application, or the machine, we can use t= his > > mechanism to identify and isolate the issue. > > > > TCP has long been a challenge when it comes to tracking the transmissio= n of data > > on the network. The application can only confirm that it has sent a cer= tain > > amount of data to the kernel, but it has limited visibility into whethe= r the > > client has actually received this data. Our mechanism addresses this is= sue by > > providing insights into the amount of data received by the client and t= he time > > it was received. Furthermore, we can also detect any packet loss or del= ays > > caused by the server. > > > > https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961= /9732df025beny.svg > > > > So, we do not want to add some tracepoint to do some unknow debug. > > We have a clear goal. debugging is just an incidental capability. > > > > We have powerful mechanisms in the stack already that ordinary (no > privilege requested) applications can readily use. > > We have been using them for a while. > > If existing mechanisms are missing something you need, please expand them. > > For reference, start looking at tcp_get_timestamping_opt_stats() history. > > Sender side can for instance get precise timestamps. > > Combinations of these timestamps reveal different parts of the overall > network latency, > > T0: sendmsg() enters TCP > T1: first byte enters qdisc > T2: first byte sent to the NIC > T3: first byte ACKed in TCP > T4: last byte sent to the NIC > T5: last byte ACKed > T1 - T0: how long the first byte was blocked in the TCP layer ("Head > of Line Blocking" latency). > T2 - T1: how long the first byte was blocked in the Linux traffic > shaping layer (known as QDisc). > T3 - T2: the network =E2=80=98distance=E2=80=99 (propagation delay + curr= ent queuing > delay along the network path and at the receiver). > T5 - T2: how fast the sent chunk was delivered. > Message Size / (T5 - T0): goodput (from application=E2=80=99s perspective) The key point is that using our mechanism, the application does not need to= be modified. As long as the app's network protocol is request-response, we can trace tcp connection at any time to analyze the request and response. And record the = start and end times of request and response. Of course there is some ttl and other information. Thanks.