From mboxrd@z Thu Jan 1 00:00:00 1970 From: Li Yu Subject: [RFC] skbtrace: A trace infrastructure for networking subsystem Date: Tue, 10 Jul 2012 14:07:50 +0800 Message-ID: <4FFBC6B6.2000600@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: 7bit To: Linux Netdev List Return-path: Received: from mail-pb0-f46.google.com ([209.85.160.46]:44550 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752269Ab2GJGID (ORCPT ); Tue, 10 Jul 2012 02:08:03 -0400 Received: by pbbrp8 with SMTP id rp8so20448998pbb.19 for ; Mon, 09 Jul 2012 23:08:02 -0700 (PDT) Sender: netdev-owner@vger.kernel.org List-ID: Hi, This RFC introduces to the tracing infrastructure for networking subsystem and a workable prototype. I noticed that the blktrace indeed helps file system and block subsystem developers a lot, even it could help them to find out some problems in mm subsystem. However, the "networkers" don't have such like good luck, although tcpdump is very very useful, but they still often need to start investigation from limited exported statistics counters, then may directly dig into source code to guess possible solutions, then test their ideas, if good luck doesn't arrive, then start another investigation-guess-test loop. It is a difficult time-costly and hard to share experiences, report problem, many users have not enough understanding for protocol stack internals, I saw some "detailed reports" still do not carry useful information to solve problem. Unfortunately, the networking subsystem is rather performance sensitive in kernel, so we can not add too detailed counters directly here. In fact, Some folks already tried to add more statistics counters for detailed performance measuration, e.g. RFC4898 and its implementation Web10g project. Web10G is a great project for researchers and engineers on TCP stack, which exports per-connection details to userland by procfs or netlink interface. However, it tightly depends on TCP and its implementation, other protocols implementation need some duplicated works to archive same goal, and it also has some measurable overhead (5% - 10% in my simple netperf TCP_STREAM benchmark), I think it'd better that such powerful tracing or instrumentation feature should be able to be off at runtime, and zero overhead when it is off. So why we don't write a blktrace like utility for our sweet networking subsystem? This just is it, "skbtrace", I hope it can: 1. Provide an extendable tracing infrastructure to support various protocols instead of specific one. 2. Ability of runtime enable or disable and zero overhead when it is off. I think that jump label optimized trace point is a good choice to implement it. 3. Provide tracing details on per-connection/per-skb level. Please note that skbtrace are not only for sk_buff tracing, but also can track sockets events. Second, this also means we need some forms of filters, otherwise we must will lost in tons of uninteresting trace data. I think that BPF is one of good choices. But we need extend BPF to make it can handle other data structures rather than skb. Above is my basic idea, below are details of current prototype implementation. Like blktrace, skbtrace also are base on the tracepoints infrastructure and relay file system, however, I do not implement any tracers like blktrace, since I want to keep kernel side as simple (also fast, I hope) as possible. Basically, the trace points just are optimized conditional statements here, the slow path copies these traced data to the ring buffer in relay file system. The parameters of this relay file system can be tuned by some exported files in skbtrace directory. There are three trace data files (channels) in relay file system for each CPU, they represent above ring buffers that save kernel traced data for different contexts respectively: (1) trace.hardirq.cpuN, saving trace data that come from hardirq context. (2) trace.softirq.cpuN, saving trace data that come from softirq context. (3) trace.syscall.cpuN, saving trace data that come from process context. Each trace data will write into one of above channels, depend on which context is trace point called. Each trace data is represented by a skbtrace_block struct, the extended fields for specific protocols can be append at end of it. For global order of trace data, this patch has an 64 bits atomic variable to generate sequence number of each generated trace data. So userland utility is able to sort out of order trace data across different channels or/and CPUs. For tracing filter feature, I selected BPF as core engine, so far, it only can filter out sk_buff-based traces, I have a plan to extend BPF to support other data structures. In fact, I ever wrote a custom filter implemenation for TCP/IPv4 ago, this way needs to refactor each specific protocol implemenation, I do not like and discard them. So far, I implemented some skbtrace trace points: (1) skb_rps_info. I ever saw that some buggy drivers (or firmwares?) always setup zero skb->rx_hash. And it seem that RPS hashing can not work well for some corner cases. (2) tcp_connection and icsk_connection. To track the basic TCP state migration, e.g. TCP_LISTEN. (3) tcp_sendlimit. Personally, I am interesting in reason of tcp_write_xmit() exits. (4) tcp_congestion. Oops, it is cwnd killer, isn't it? The userland utilties: (1) skbtrace, record raw trace data to regular disk files. (2) skbparse, parse raw trace data to human readable strings. this still need a lot of works, it just is a rough (but workable) demo for TCP/IPv4 yet. You can get source code at github: https://github.com/Rover-Yu/skbtrace-userland https://github.com/Rover-Yu/skbtrace-kernel The source code of skbtrace-kernel is based on net-next tree. Welcome for suggestions. Thanks. Yu